Full Code of funstory-ai/BabelDOC for AI

main 34739ea88118 cached

156 files

1.9 MB

564.1k tokens

1723 symbols

1 requests

Download .txt

Showing preview only (2,055K chars total). Download the full file or copy to clipboard to get everything.

Repository: funstory-ai/BabelDOC
Branch: main
Commit: 34739ea88118
Files: 156
Total size: 1.9 MB

Directory structure:
gitextract_4xv94fs_/

├── .cursorignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.yaml
│   │   └── feature_request.yaml
│   ├── PULL_REQUEST_TEMPLATE/
│   │   └── pr_form.yml
│   ├── PULL_REQUEST_TEMPLATE.md
│   ├── dependabot.yml
│   ├── labels.yml
│   ├── release-drafter.yml
│   └── workflows/
│       ├── codeql.yml
│       ├── docs.yml
│       ├── labeler.yml
│       ├── lint.yml
│       ├── pr-lint.yml
│       ├── publish-to-pypi.yml
│       └── test.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── babeldoc/
│   ├── __init__.py
│   ├── assets/
│   │   ├── assets.py
│   │   └── embedding_assets_metadata.py
│   ├── asynchronize/
│   │   └── __init__.py
│   ├── babeldoc_exception/
│   │   ├── BabelDOCException.py
│   │   └── __init__.py
│   ├── const.py
│   ├── docvision/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── base_doclayout.py
│   │   ├── doclayout.py
│   │   ├── rpc_doclayout.py
│   │   ├── rpc_doclayout2.py
│   │   ├── rpc_doclayout3.py
│   │   ├── rpc_doclayout4.py
│   │   ├── rpc_doclayout5.py
│   │   ├── rpc_doclayout6.py
│   │   ├── rpc_doclayout7.py
│   │   └── table_detection/
│   │       └── rapidocr.py
│   ├── format/
│   │   ├── __init__.py
│   │   └── pdf/
│   │       ├── __init__.py
│   │       ├── babelpdf/
│   │       │   ├── base14.py
│   │       │   ├── cidfont.py
│   │       │   ├── cmap.py
│   │       │   ├── encoding.py
│   │       │   ├── type3.py
│   │       │   ├── utils.py
│   │       │   └── win_core.py
│   │       ├── converter.py
│   │       ├── document_il/
│   │       │   ├── __init__.py
│   │       │   ├── backend/
│   │       │   │   ├── __init__.py
│   │       │   │   └── pdf_creater.py
│   │       │   ├── frontend/
│   │       │   │   ├── __init__.py
│   │       │   │   └── il_creater.py
│   │       │   ├── il_version_1.py
│   │       │   ├── il_version_1.rnc
│   │       │   ├── il_version_1.rng
│   │       │   ├── il_version_1.xsd
│   │       │   ├── midend/
│   │       │   │   ├── __init__.py
│   │       │   │   ├── add_debug_information.py
│   │       │   │   ├── automatic_term_extractor.py
│   │       │   │   ├── detect_scanned_file.py
│   │       │   │   ├── il_translator.py
│   │       │   │   ├── il_translator_llm_only.py
│   │       │   │   ├── layout_parser.py
│   │       │   │   ├── paragraph_finder.py
│   │       │   │   ├── remove_descent.py
│   │       │   │   ├── styles_and_formulas.py
│   │       │   │   ├── table_parser.py
│   │       │   │   └── typesetting.py
│   │       │   ├── utils/
│   │       │   │   ├── __init__.py
│   │       │   │   ├── extract_char.py
│   │       │   │   ├── fontmap.py
│   │       │   │   ├── formular_helper.py
│   │       │   │   ├── layout_helper.py
│   │       │   │   ├── matrix_helper.py
│   │       │   │   ├── mupdf_helper.py
│   │       │   │   ├── paragraph_helper.py
│   │       │   │   ├── spatial_analyzer.py
│   │       │   │   ├── style_helper.py
│   │       │   │   └── zstd_helper.py
│   │       │   └── xml_converter.py
│   │       ├── high_level.py
│   │       ├── pdfinterp.py
│   │       ├── result_merger.py
│   │       ├── split_manager.py
│   │       └── translation_config.py
│   ├── glossary.py
│   ├── main.py
│   ├── pdfminer/
│   │   ├── LICENSE
│   │   ├── __init__.py
│   │   ├── _saslprep.py
│   │   ├── arcfour.py
│   │   ├── ascii85.py
│   │   ├── casting.py
│   │   ├── ccitt.py
│   │   ├── cmap/
│   │   │   └── README.txt
│   │   ├── cmapdb.py
│   │   ├── converter.py
│   │   ├── data_structures.py
│   │   ├── encodingdb.py
│   │   ├── fontmetrics.py
│   │   ├── glyphlist.py
│   │   ├── high_level.py
│   │   ├── image.py
│   │   ├── jbig2.py
│   │   ├── latin_enc.py
│   │   ├── layout.py
│   │   ├── lzw.py
│   │   ├── pdfcolor.py
│   │   ├── pdfdevice.py
│   │   ├── pdfdocument.py
│   │   ├── pdfexceptions.py
│   │   ├── pdffont.py
│   │   ├── pdfinterp.py
│   │   ├── pdfpage.py
│   │   ├── pdfparser.py
│   │   ├── pdftypes.py
│   │   ├── psexceptions.py
│   │   ├── psparser.py
│   │   ├── py.typed
│   │   ├── runlength.py
│   │   ├── settings.py
│   │   └── utils.py
│   ├── progress_monitor.py
│   ├── tools/
│   │   ├── generate_cmap_metadata.py
│   │   ├── generate_font_metadata.py
│   │   ├── italic_assistance.py
│   │   └── italic_recognize_tool.py
│   ├── translator/
│   │   ├── __init__.py
│   │   ├── cache.py
│   │   └── translator.py
│   └── utils/
│       ├── __init__.py
│       ├── atomic_integer.py
│       ├── memory.py
│       └── priority_thread_pool_executor.py
├── docs/
│   ├── CODE_OF_CONDUCT.md
│   ├── CONTRIBUTING.md
│   ├── CONTRIBUTOR_REWARD.md
│   ├── ImplementationDetails/
│   │   ├── AsyncTranslate/
│   │   │   └── AsyncTranslate.md
│   │   ├── ILTranslator/
│   │   │   └── ILTranslator.md
│   │   ├── PDFCreation/
│   │   │   └── PDFCreation.md
│   │   ├── PDFParsing/
│   │   │   └── PDFParsing.md
│   │   ├── ParagraphFinding/
│   │   │   └── ParagraphFinding.md
│   │   ├── README.md
│   │   ├── StylesAndFormulas/
│   │   │   └── StylesAndFormulas.md
│   │   └── Typesetting/
│   │       └── Typesetting.md
│   ├── README.md
│   ├── deploy.sh
│   ├── example/
│   │   └── demo_glossary.csv
│   ├── index.md
│   ├── intro-to-pdf-object.md
│   ├── requirements.txt
│   └── supported_languages.md
├── mkdocs.yml
├── pyproject.toml
└── tests/
    └── test_translation_cache_cleanup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .cursorignore
================================================
# Project notes and templates
xnotes/


================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.yaml
================================================
name: "🐞 Bug Report"
description: Create a report to help us improve
labels: ['bug']
body:
  - type: checkboxes
    id: checks
    attributes:
      label: Before you submit
      options:
        - label: I have searched existing issues
          required: true
        - label: I spent at least 5 minutes investigating and preparing this report
          required: true
        - label: I confirmed this is not caused by a network issue
          required: true
        - label: I have fully read and understood the [README](https://github.com/funstory-ai/BabelDOC/blob/main/README.md)
          required: true
        - label: I am certain that this issue is with BabelDOC itself and can be reproduced through the BabelDOC cli
          required: true
        - label: I have uploaded the original file, or confirmed that this issue is unrelated to the original file
          required: true
        - label: I have uploaded the log.
          required: true
        - label: I confirm that the latest version of BabelDOC is being used.
          required: true
        - label: I am aware that the issue section of this project is only for submitting bugs that are clearly related to the BabelDOC core, with complete reproduction steps and relevant logs attached.** Otherwise, issues will be closed directly.
          required: true

  - type: markdown
    attributes:
      value: |
        Thank you for using **BabelDOC** and helping us improve it! 🙏
        Please confirm again that the above checklist items have been carefully executed! (If you have not carefully executed them, the issue will be closed directly without any response)

        Please also note:
        - If you are using a downstream project like pdf2zh-next, please submit an issue directly to the downstream application. Only when you confirm that this issue is a problem with the core library of BabelDOC, submit this issue.
        - The CLI is only used for debugging purposes, we do not provide any technical support for CLI usage.

  - type: markdown
    attributes:
      value: |
        Please note! Users of immersive translate online services should contact customer service and provide their translation ID. **Feedback related to online services is not handled here.**

  - type: textarea
    id: environment
    attributes:
      label: Environment
      description: Provide your system details (required)
      value: |
        - OS:
        - Python:
        - BabelDOC:
      render: markdown
    validations:
      required: true

  - type: textarea
    id: describe
    attributes:
      label: Describe the bug
      description: A clear and concise description of what the bug is.
    validations:
      required: true

  - type: textarea
    id: reproduce
    attributes:
      label: Steps to Reproduce
      description: Help us reproduce the issue. Issues that do not provide reproduction steps will be closed directly.
      value: |
        1. Go to '...'
        2. Click on '...'
        3. See error
    validations:
      required: false

  - type: textarea
    id: expected
    attributes:
      label: Expected Behavior
      description: What did you expect to happen?
    validations:
      required: false

  - type: textarea
    id: logs
    attributes:
      label: Relevant Log Output or Screenshots
      description: Copy and paste any logs or attach screenshots. This will be formatted automatically.
      render: text
    validations:
      required: false

  - type: textarea
    id: pdf
    attributes:
      label: Original PDF File
      description: Upload the input PDF if applicable. (Issues related to specific PDFs but without uploaded files will be closed directly.)
    validations:
      required: false

  - type: textarea
    id: others
    attributes:
      label: Additional Context
      description: Anything else we should know?
    validations:
      required: false


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.yaml
================================================
name: "✨ Feature Request"
description: Suggest a new idea or improvement for BabelDOC
labels: ['enhancement']
body:
  - type: markdown
    attributes:
      value: |
        Thank you for helping improve **BabelDOC**! Please fill out the form below to suggest a feature.

  - type: checkboxes
    id: checks
    attributes:
      label: Before you submit
      options:
        - label: I have searched existing issues
          required: true
        - label: I have fully read and understood the [README](https://github.com/funstory-ai/BabelDOC/blob/main/README.md)
          required: true
        - label: This feature is not related to BabelDOC CLI. The CLI is only used for debugging purposes, we do not accept any feature requests related to the CLI.
          required: true
  
  - type: markdown
    attributes:
      value: |
        如果您想自部署 BabelDOC，请使用 [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) 代替。若其功能无法满足，请向 [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) 提交功能请求。
        If you wish to self-host BabelDOC, please use [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) instead. If its features do not meet your needs, please submit a feature request to [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next).


  - type: textarea
    id: describe
    attributes:
      label: Is your feature request related to a problem?
      description: If applicable, describe what problem this feature would solve.
      placeholder: Ex. I'm always frustrated when ...
    validations:
      required: false

  - type: textarea
    id: solution
    attributes:
      label: Describe the solution you'd like
      description: What would you like to see happen?
    validations:
      required: true

  - type: textarea
    id: alternatives
    attributes:
      label: Describe alternatives you've considered
      description: Have you thought of other ways to solve this?
    validations:
      required: false

  - type: textarea
    id: additional
    attributes:
      label: Additional context
      description: Any other context, examples, or screenshots?
    validations:
      required: false


================================================
FILE: .github/PULL_REQUEST_TEMPLATE/pr_form.yml
================================================
name: Pull Request
description: Submit a pull request to contribute to BabelDOC
title: "[PR] <Your concise title here>"
labels:
  - needs triage
body:
  - type: markdown
    attributes:
      value: |
        ## 👋 Thanks for contributing to **BabelDOC**!

        Please fill out this form to help us review your pull request effectively.

  - type: input
    id: issue
    attributes:
      label: Related Issue(s)
      description: If this pull request closes or is related to one or more issues, list them here (e.g., #37)
      placeholder: "#37"
    validations:
      required: false

  - type: textarea
    id: summary
    attributes:
      label: Description
      description: Describe the purpose of this pull request and what was changed.
      placeholder: |
        - What does this PR introduce or fix?
        - What is the motivation behind it?
    validations:
      required: true

  - type: dropdown
    id: pr_type
    attributes:
      label: PR Type
      description: What kind of change is this?
      multiple: true
      options:
        - enhancement
        - bug
        - documentation
        - refactor
        - test
        - chore
    validations:
      required: true

  - type: checkboxes
    id: checklist
    attributes:
      label: Contributor Checklist
      options:
        - label: I’ve fully read and understood the **[CONTRIBUTING.md](https://funstory-ai.github.io/BabelDOC/CONTRIBUTING/)** guide
          required: true
        - label: My changes follow the project’s code style and guidelines
          required: true
        - label: I’ve linked the related issue(s) in the description above
        - label: I’ve updated relevant documentation (if applicable)
        - label: I’ve added necessary tests (if applicable)
        - label: All new and existing tests passed locally
        - label: I understand that due to limited maintainer resources, only small pull requests are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary.

  - type: textarea
    id: testing
    attributes:
      label: Testing Instructions
      description: Provide step-by-step instructions on how to test your changes
      placeholder: |
        1. Run `...`
        2. Visit `...`
        3. Click `...`
        4. Verify `...`
    validations:
      required: false

  - type: textarea
    id: screenshots
    attributes:
      label: Screenshots (if applicable)
      description: If UI changes were made, please attach before/after screenshots.
    validations:
      required: false

  - type: textarea
    id: notes
    attributes:
      label: Additional Notes
      description: Anything else the reviewer should know?
    validations:
      required: false


================================================
FILE: .github/PULL_REQUEST_TEMPLATE.md
================================================
### PR Title

<!-- Please fill in a concise and clear PR title below -->
[PR] <Your concise title here>

### Related Issue(s)

<!-- If this PR closes or is related to one or more issues, please list them here (e.g., #37) -->
<!-- e.g.: Closes #37, Relates to #42 -->

### Motivation and Context

<!-- Why is this change required? What problem does it solve? -->
<!-- If it fixes an open issue, please link to the issue here. -->

### Summary of Changes

<!-- What does this PR introduce or fix? Please describe concisely. -->

### PR Type

<!-- What kind of change is this? Please select one or more -->
- [ ] ✨ Enhancement
- [ ] 🐛 Bug Fix
- [ ] 📚 Documentation
- [ ] 🏗️ Refactor
- [ ] 🧪 Test
- [ ] 🧹 Chore

### Breaking Changes

<!-- Does this PR introduce any breaking changes? If so, please describe them. -->
<!-- - [ ] Yes, this PR introduces breaking changes.
<!-- - [ ] No, this PR does not introduce breaking changes. -->
<!-- Detailed description of breaking changes (if any): -->

### Contributor Checklist

- [ ] I have fully read and understood the **[CONTRIBUTING.md](https://funstory-ai.github.io/BabelDOC/CONTRIBUTING/)** guide.
- [ ] I have performed a self-review of my own code.
- [ ] My changes follow the project's code style and guidelines
- [ ] I have linked the related issue(s) in the description above (if applicable)
- [ ] I have updated relevant documentation (if applicable)
- [ ] I have added necessary tests that prove my fix is effective or that my feature works (if applicable)
- [ ] All new and existing tests passed locally with my changes
- [ ] My changes generate no new warnings or errors
- [ ] I understand that due to limited maintainer resources, only small PRs are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary.

### Testing Instructions

<!-- Please provide clear and concise step-by-step instructions on how to test your changes. -->
<!-- e.g.: -->
<!-- 1. Check out this branch. -->
<!-- 2. Run `...` to install dependencies. -->
<!-- 3. Run `...` to start the application/run the script. -->
<!-- 4. Navigate to `...` or observe `...` -->
<!-- 5. Verify that `...` (expected outcome). -->

### Screenshots (if applicable)

<!-- If your changes include UI modifications, please add screenshots or GIFs to show the before and after. -->

### Additional Notes

<!-- Is there anything else the reviewer should know? For example, any dependencies, or potential impacts. --> 

================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
  - package-ecosystem: github-actions
    directory: "/"
    schedule:
      interval: weekly
  # - package-ecosystem: pip
  #   directory: "/.github/workflows"
  #   schedule:
  #     interval: weekly
  # - package-ecosystem: pip
  #   directory: "/docs"
  #   schedule:
  #     interval: weekly
  - package-ecosystem: pip
    directory: "/"
    schedule:
      interval: weekly
    versioning-strategy: lockfile-only
    allow:
      - dependency-type: "all"

================================================
FILE: .github/labels.yml
================================================
---
# Labels names are important as they are used by Release Drafter to decide
# regarding where to record them in changelog or if to skip them.
#
# The repository labels will be automatically configured using this file and
# the GitHub Action https://github.com/marketplace/actions/github-labeler.
- name: breaking
  description: Breaking Changes
  color: "bfd4f2"
- name: bug
  description: Something isn't working
  color: "d73a4a"
- name: build
  description: Build System and Dependencies
  color: "bfdadc"
- name: ci
  description: Continuous Integration
  color: "4a97d6"
- name: dependencies
  description: Pull requests that update a dependency file
  color: "0366d6"
- name: documentation
  description: Improvements or additions to documentation
  color: "0075ca"
- name: duplicate
  description: This issue or pull request already exists
  color: "cfd3d7"
- name: enhancement
  description: New feature or request
  color: "a2eeef"
- name: github_actions
  description: Pull requests that update Github_actions code
  color: "000000"
- name: good first issue
  description: Good for newcomers
  color: "7057ff"
- name: help wanted
  description: Extra attention is needed
  color: "008672"
- name: invalid
  description: This doesn't seem right
  color: "e4e669"
- name: performance
  description: Performance
  color: "016175"
- name: python
  description: Pull requests that update Python code
  color: "2b67c6"
- name: question
  description: Further information is requested
  color: "d876e3"
- name: refactoring
  description: Refactoring
  color: "ef67c4"
- name: removal
  description: Removals and Deprecations
  color: "9ae7ea"
- name: style
  description: Style
  color: "c120e5"
- name: testing
  description: Testing
  color: "b1fc6f"
- name: wontfix
  description: This will not be worked on
  color: "ffffff"

================================================
FILE: .github/release-drafter.yml
================================================
name-template: 'v$RESOLVED_VERSION'
tag-template: 'v$RESOLVED_VERSION'
categories:
  - title: '🚀 Features'
    labels:
      - 'feature'
      - 'enhancement'
  - title: '🐛 Bug Fixes'
    labels:
      - 'fix'
      - 'bugfix'
      - 'bug'
  - title: '🧰 Maintenance'
    labels:
      - 'chore'
      - 'maintenance'
      - 'refactor'
  - title: '📝 Documentation'
    labels:
      - 'docs'
      - 'documentation'
change-template: '- $TITLE @$AUTHOR (#$NUMBER)'
change-title-escapes: '\<*_&' # You can add # and @ to disable mentions
version-resolver:
  major:
    labels:
      - 'major'
  minor:
    labels:
      - 'minor'
  patch:
    labels:
      - 'patch'
  default: patch
template: |
  ## Changes

  $CHANGES

  ## Contributors
  
  $CONTRIBUTORS


================================================
FILE: .github/workflows/codeql.yml
================================================
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL Advanced"

on:
  push:
  pull_request:
    branches: [ "main" ]
  schedule:
    - cron: '36 14 * * 1'

jobs:
  analyze:
    name: Analyze (${{ matrix.language }})
    # Runner size impacts CodeQL analysis time. To learn more, please see:
    #   - https://gh.io/recommended-hardware-resources-for-running-codeql
    #   - https://gh.io/supported-runners-and-hardware-resources
    #   - https://gh.io/using-larger-runners (GitHub.com only)
    # Consider using larger runners or machines with greater resources for possible analysis time improvements.
    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}
    permissions:
      # required for all workflows
      security-events: write

      # required to fetch internal or private CodeQL packs
      packages: read

      # only required for workflows in private repositories
      actions: read
      contents: read

    strategy:
      fail-fast: false
      matrix:
        include:
        - language: python
          build-mode: none
        - language: actions
        # CodeQL supports the following values keywords for 'language': 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'swift'
        # Use `c-cpp` to analyze code written in C, C++ or both
        # Use 'java-kotlin' to analyze code written in Java, Kotlin or both
        # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both
        # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,
        # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.
        # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how
        # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages
    steps:
    - name: Checkout repository
      uses: actions/checkout@v5

    # Initializes the CodeQL tools for scanning.
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v4
      with:
        languages: ${{ matrix.language }}
        build-mode: ${{ matrix.build-mode }}
        # If you wish to specify custom queries, you can do so here or in a config file.
        # By default, queries listed here will override any specified in a config file.
        # Prefix the list here with "+" to use these queries and those in the config file.

        # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs
        # queries: security-extended,security-and-quality

    # If the analyze step fails for one of the languages you are analyzing with
    # "We were unable to automatically build your code", modify the matrix above
    # to set the build mode to "manual" for that language. Then modify this step
    # to build your code.
    # ℹ️ Command-line programs to run using the OS shell.
    # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun
    - if: matrix.build-mode == 'manual'
      shell: bash
      run: |
        echo 'If you are using a "manual" build mode for one or more of the' \
          'languages you are analyzing, replace this with the commands to build' \
          'your code, for example:'
        echo '  make bootstrap'
        echo '  make release'
        exit 1

    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v4
      with:
        category: "/language:${{matrix.language}}"


================================================
FILE: .github/workflows/docs.yml
================================================
name: docs
on:
  push:
    branches:
      - main
permissions:
  contents: write
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
        with:
          fetch-depth: 0
      - name: Configure Git Credentials
        run: |
          git config user.name github-actions[bot]
          git config user.email 41898282+github-actions[bot]@users.noreply.github.com
      - name: Setup uv with Python 3.12
        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2
        with:
          python-version: "3.12"
          enable-cache: true
          cache-dependency-glob: "uv.lock"
          activate-environment: true
      - run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV 
      - uses: actions/cache@v4
        with:
          key: mkdocs-material-${{ env.cache_id }}
          path: .cache
          restore-keys: |
            mkdocs-material-
      - run: uv sync
      - run: uv run mkdocs gh-deploy --force

================================================
FILE: .github/workflows/labeler.yml
================================================
name: Labeler

on:
  push:
    branches:
      - 'main'
    paths:
      - '.github/labels.yml'
      - '.github/workflows/labels.yml'
  pull_request:
    paths:
      - '.github/labels.yml'
      - '.github/workflows/labels.yml'

permissions:
  contents: read
  issues: write
  pull-requests: write

jobs:
  labeler:
    runs-on: ubuntu-latest
    steps:
      - name: Check out the repository
        uses: actions/checkout@v5

      - name: Run Labeler
        uses: crazy-max/ghaction-github-labeler@24d110aa46a59976b8a7f35518cb7f14f434c916 # v5.3.0
        with:
          skip-delete: true
          dry-run: ${{ github.event_name == 'pull_request' }}
          github-token: ${{ secrets.GITHUB_TOKEN }}
          yaml-file: .github/labels.yml
          exclude: |
            help*
            *issue

================================================
FILE: .github/workflows/lint.yml
================================================
name: Lint Code
permissions:
  contents: read
  pull-requests: write
on: [push]

jobs:
  lint:
    strategy:
      fail-fast: false
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - name: Ruff
        uses: astral-sh/ruff-action@v3
      - name: AutoCorrect
        uses: huacnlee/autocorrect-action@main


================================================
FILE: .github/workflows/pr-lint.yml
================================================
name: Lint Code and Review Dog Report

on: [pull_request]
permissions:
  contents: read
  pull-requests: write
jobs:
  ruff:
    name: runner / ruff
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      
      - name: Install Python
        uses: actions/setup-python@v6
        with:
          python-version: '3.11'
          
      - name: Install ruff
        run: pip install ruff
        
      - name: Install reviewdog
        uses: reviewdog/action-setup@d8edfce3dd5e1ec6978745e801f9c50b5ef80252 # v1.4.0
        with:
          reviewdog_version: latest
          
      - name: Run ruff with reviewdog
        env:
          REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          ruff check . --output-format=rdjson | reviewdog -f=rdjson -reporter=github-pr-review -fail-on-error
          
  autocorrect:
    name: runner / autocorrect
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - name: AutoCorrect
        uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3
      - name: Report ReviewDog
        if: failure()
        uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3
        env:
          REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          reviewdog: true

================================================
FILE: .github/workflows/publish-to-pypi.yml
================================================
name: Release

on:
  push:
    branches:
      - main
      - master

permissions:
  id-token: write
  contents: write
  pull-requests: write

jobs:
  check-repository:
    name: Check if running in main repository
    runs-on: ubuntu-latest
    outputs:
      is_main_repo: ${{ github.repository == 'funstory-ai/BabelDOC' }}
    steps:
      - run: echo "Running repository check"

  build:
    name: Build distribution 📦
    needs: check-repository
    if: needs.check-repository.outputs.is_main_repo == 'true'
    runs-on: ubuntu-latest
    outputs:
      is_release: ${{ steps.check-version.outputs.tag }}
    steps:
      - uses: actions/checkout@v5
        with:
          persist-credentials: true
          fetch-depth: 2
          token: ${{ secrets.GITHUB_TOKEN }}
          
      - name: Setup uv with Python 3.12
        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2
        with:
          python-version: "3.12"
          enable-cache: true
          cache-dependency-glob: "uv.lock"
          activate-environment: true

      - name: Check if there is a parent commit
        id: check-parent-commit
        run: |
          echo "sha=$(git rev-parse --verify --quiet HEAD^)" >> $GITHUB_OUTPUT

      - name: Detect and tag new version
        id: check-version
        if: steps.check-parent-commit.outputs.sha
        uses: salsify/action-detect-and-tag-new-version@b1778166f13188a9d478e2d1198f993011ba9864 # v2.0.3
        with:
          version-command: |
            cat pyproject.toml | grep "version = " | head -n 1 | awk -F'"' '{print $2}'

      - name: Install Dependencies
        run: |
          uv sync

      - name: Bump version for developmental release
        if: "! steps.check-version.outputs.tag"
        run: |
          version=$(uv run bumpver update --patch --tag=final --dry 2>&1 | grep "New Version" | awk '{print $NF}') &&
          uv run bumpver update --set-version $version.dev$(date +%s)

      - name: Build package
        run: "uv build"

      - name: Store the distribution packages
        uses: actions/upload-artifact@v4.6.2
        with:
          name: python-package-distributions
          path: dist/

  publish-to-pypi:
    name: Publish Python 🐍 distribution 📦 to PyPI
    if: needs.build.outputs.is_release != ''
    needs:
      - check-repository
      - build
    runs-on: ubuntu-latest
    environment:
      name: pypi
      url: https://pypi.org/p/BabelDOC

    permissions:
      id-token: write

    steps:
      - name: Download all the dists
        uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
        with:
          name: python-package-distributions
          path: dist/

      - name: Publish distribution 📦 to PyPI
        uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0

  publish-to-testpypi:
    name: Publish Python 🐍 distribution 📦 to TestPyPI
    if: needs.build.outputs.is_release == ''
    needs:
      - check-repository
      - build
    runs-on: ubuntu-latest
    environment:
      name: testpypi
      url: https://test.pypi.org/p/BabelDOC

    permissions:
      id-token: write

    steps:
      - name: Download all the dists
        uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0
        with:
          name: python-package-distributions
          path: dist/

      - name: Publish distribution 📦 to TestPyPI
        uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0
        with:
          repository-url: https://test.pypi.org/legacy/

  post-release:
    name: Post Release Tasks
    needs:
      - check-repository
      - build
      - publish-to-pypi
      - publish-to-testpypi
    if: |
      always() && needs.check-repository.outputs.is_main_repo == 'true' && 
      (needs.publish-to-pypi.result == 'success' || needs.publish-to-testpypi.result == 'success')
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write
    steps:
      - uses: actions/checkout@v5
        with:
          persist-credentials: true
          fetch-depth: 2
          token: ${{ secrets.GITHUB_TOKEN }}

      - name: Publish the release notes
        uses: release-drafter/release-drafter@b1476f6e6eb133afa41ed8589daba6dc69b4d3f5 # v6.1.0
        with:
          publish: ${{ needs.build.outputs.is_release != '' }}
          tag: ${{ needs.build.outputs.is_release }}
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

================================================
FILE: .github/workflows/test.yml
================================================
name: Run Tests 🧪

on:
  push:
  pull_request:
    branches: ["main"]

permissions:
  contents: read
  pull-requests: read

jobs:
  test:
    name: Run Python Tests
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12", "3.13"]

    steps:
      - uses: actions/checkout@v5
        with:
          persist-credentials: false
      - name: Cached Assets
        id: cache-assets
        uses: actions/cache@v4.2.0
        with:
          path: ~/.cache/babeldoc
          key: babeldoc-assets-${{ hashFiles('babeldoc/assets/embedding_assets_metadata.py') }}
      - name: Setup uv with Python ${{ matrix.python-version }}
        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2
        with:
          python-version: ${{ matrix.python-version }}
          enable-cache: true
          cache-dependency-glob: "uv.lock"
          activate-environment: true
      - name: Warm up cache
        run: |
          uv run babeldoc --warmup
      - name: Run tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAIAPIKEY }}
          OPENAI_BASE_URL: ${{ secrets.OPENAIAPIURL }}
          OPENAI_MODEL: ${{ secrets.OPENAIMODEL }}
        run: |
          uv run babeldoc --help
          uv run babeldoc --openai --files examples/ci/test.pdf --openai-api-key ${{ env.OPENAI_API_KEY }} --openai-base-url ${{ env.OPENAI_BASE_URL }} --openai-model ${{ env.OPENAI_MODEL }}
      - name: Generate offline assets package
        run: |
          uv run babeldoc --generate-offline-assets /tmp/offline_assets
      - name: Restore offline assets package
        run: |
          rm -rf ~/.cache/babeldoc
          uv run babeldoc --restore-offline-assets /tmp/offline_assets
      - name: Clean up
        run: |
          rm -rf /tmp/offline_assets
          rm -rf ~/.cache/babeldoc/cache.v1.db
          rm -rf ~/.cache/babeldoc/working


================================================
FILE: .gitignore
================================================
# Logs
web/logs
web/*.log
web/npm-debug.log*
web/yarn-debug.log*
web/yarn-error.log*
web/pnpm-debug.log*
web/lerna-debug.log*

web/node_modules
web/dist
web/dist-ssr
web/*.local

memray*
**/*.so
*.pdf
*.docx
*.json
**/*.pyc
.venv
.idea
*.egg-info
.DS_Store
.vscode
__pycache__
.ruff_cache
yadt.toml
examples/
/make_gif.py
/dist
.cache
.cursor/rules/_*.mdc
/.cursor
/xnotes
/docs/workflow-rules.md
babeldoc/format/txt
/profile.svg


# uv
uv.lock

# Claude Code memory file
CLAUDE.md
/.claude
babeldoc/format/playground
temp.jpg
AGENTS.md


================================================
FILE: .pre-commit-config.yaml
================================================
files: '^.*\.py$'
repos:
  - repo: https://github.com/astral-sh/ruff-pre-commit
    # Ruff version.
    rev: v0.9.5
    hooks:
      # Run the linter.
      - id: ruff
        args: [ "--fix",
                "--ignore=E203,E261,E501,E741,F841" ]
      # Run the formatter.
      - id: ruff-format


================================================
FILE: LICENSE
================================================
                    GNU AFFERO GENERAL PUBLIC LICENSE
                       Version 3, 19 November 2007

 Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.

                            Preamble

  The GNU Affero General Public License is a free, copyleft license for
software and other kinds of works, specifically designed to ensure
cooperation with the community in the case of network server software.

  The licenses for most software and other practical works are designed
to take away your freedom to share and change the works.  By contrast,
our General Public Licenses are intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users.

  When we speak of free software, we are referring to freedom, not
price.  Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.

  Developers that use our General Public Licenses protect your rights
with two steps: (1) assert copyright on the software, and (2) offer
you this License which gives you legal permission to copy, distribute
and/or modify the software.

  A secondary benefit of defending all users' freedom is that
improvements made in alternate versions of the program, if they
receive widespread use, become available for other developers to
incorporate.  Many developers of free software are heartened and
encouraged by the resulting cooperation.  However, in the case of
software used on network servers, this result may fail to come about.
The GNU General Public License permits making a modified version and
letting the public access it on a server without ever releasing its
source code to the public.

  The GNU Affero General Public License is designed specifically to
ensure that, in such cases, the modified source code becomes available
to the community.  It requires the operator of a network server to
provide the source code of the modified version running there to the
users of that server.  Therefore, public use of a modified version, on
a publicly accessible server, gives the public access to the source
code of the modified version.

  An older license, called the Affero General Public License and
published by Affero, was designed to accomplish similar goals.  This is
a different license, not a version of the Affero GPL, but Affero has
released a new version of the Affero GPL which permits relicensing under
this license.

  The precise terms and conditions for copying, distribution and
modification follow.

                       TERMS AND CONDITIONS

  0. Definitions.

  "This License" refers to version 3 of the GNU Affero General Public License.

  "Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.

  "The Program" refers to any copyrightable work licensed under this
License.  Each licensee is addressed as "you".  "Licensees" and
"recipients" may be individuals or organizations.

  To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy.  The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.

  A "covered work" means either the unmodified Program or a work based
on the Program.

  To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy.  Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.

  To "convey" a work means any kind of propagation that enables other
parties to make or receive copies.  Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.

  An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License.  If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.

  1. Source Code.

  The "source code" for a work means the preferred form of the work
for making modifications to it.  "Object code" means any non-source
form of a work.

  A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.

  The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form.  A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.

  The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities.  However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work.  For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.

  The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.

  The Corresponding Source for a work in source code form is that
same work.

  2. Basic Permissions.

  All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met.  This License explicitly affirms your unlimited
permission to run the unmodified Program.  The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work.  This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.

  You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force.  You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright.  Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.

  Conveying under any other circumstances is permitted solely under
the conditions stated below.  Sublicensing is not allowed; section 10
makes it unnecessary.

  3. Protecting Users' Legal Rights From Anti-Circumvention Law.

  No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.

  When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.

  4. Conveying Verbatim Copies.

  You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.

  You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.

  5. Conveying Modified Source Versions.

  You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:

    a) The work must carry prominent notices stating that you modified
    it, and giving a relevant date.

    b) The work must carry prominent notices stating that it is
    released under this License and any conditions added under section
    7.  This requirement modifies the requirement in section 4 to
    "keep intact all notices".

    c) You must license the entire work, as a whole, under this
    License to anyone who comes into possession of a copy.  This
    License will therefore apply, along with any applicable section 7
    additional terms, to the whole of the work, and all its parts,
    regardless of how they are packaged.  This License gives no
    permission to license the work in any other way, but it does not
    invalidate such permission if you have separately received it.

    d) If the work has interactive user interfaces, each must display
    Appropriate Legal Notices; however, if the Program has interactive
    interfaces that do not display Appropriate Legal Notices, your
    work need not make them do so.

  A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit.  Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.

  6. Conveying Non-Source Forms.

  You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:

    a) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by the
    Corresponding Source fixed on a durable physical medium
    customarily used for software interchange.

    b) Convey the object code in, or embodied in, a physical product
    (including a physical distribution medium), accompanied by a
    written offer, valid for at least three years and valid for as
    long as you offer spare parts or customer support for that product
    model, to give anyone who possesses the object code either (1) a
    copy of the Corresponding Source for all the software in the
    product that is covered by this License, on a durable physical
    medium customarily used for software interchange, for a price no
    more than your reasonable cost of physically performing this
    conveying of source, or (2) access to copy the
    Corresponding Source from a network server at no charge.

    c) Convey individual copies of the object code with a copy of the
    written offer to provide the Corresponding Source.  This
    alternative is allowed only occasionally and noncommercially, and
    only if you received the object code with such an offer, in accord
    with subsection 6b.

    d) Convey the object code by offering access from a designated
    place (gratis or for a charge), and offer equivalent access to the
    Corresponding Source in the same way through the same place at no
    further charge.  You need not require recipients to copy the
    Corresponding Source along with the object code.  If the place to
    copy the object code is a network server, the Corresponding Source
    may be on a different server (operated by you or a third party)
    that supports equivalent copying facilities, provided you maintain
    clear directions next to the object code saying where to find the
    Corresponding Source.  Regardless of what server hosts the
    Corresponding Source, you remain obligated to ensure that it is
    available for as long as needed to satisfy these requirements.

    e) Convey the object code using peer-to-peer transmission, provided
    you inform other peers where the object code and Corresponding
    Source of the work are being offered to the general public at no
    charge under subsection 6d.

  A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.

  A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling.  In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage.  For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product.  A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.

  "Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source.  The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.

  If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information.  But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).

  The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed.  Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.

  Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.

  7. Additional Terms.

  "Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law.  If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.

  When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it.  (Additional permissions may be written to require their own
removal in certain cases when you modify the work.)  You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.

  Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:

    a) Disclaiming warranty or limiting liability differently from the
    terms of sections 15 and 16 of this License; or

    b) Requiring preservation of specified reasonable legal notices or
    author attributions in that material or in the Appropriate Legal
    Notices displayed by works containing it; or

    c) Prohibiting misrepresentation of the origin of that material, or
    requiring that modified versions of such material be marked in
    reasonable ways as different from the original version; or

    d) Limiting the use for publicity purposes of names of licensors or
    authors of the material; or

    e) Declining to grant rights under trademark law for use of some
    trade names, trademarks, or service marks; or

    f) Requiring indemnification of licensors and authors of that
    material by anyone who conveys the material (or modified versions of
    it) with contractual assumptions of liability to the recipient, for
    any liability that these contractual assumptions directly impose on
    those licensors and authors.

  All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10.  If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term.  If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.

  If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.

  Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.

  8. Termination.

  You may not propagate or modify a covered work except as expressly
provided under this License.  Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).

  However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.

  Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.

  Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License.  If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.

  9. Acceptance Not Required for Having Copies.

  You are not required to accept this License in order to receive or
run a copy of the Program.  Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance.  However,
nothing other than this License grants you permission to propagate or
modify any covered work.  These actions infringe copyright if you do
not accept this License.  Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.

  10. Automatic Licensing of Downstream Recipients.

  Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License.  You are not responsible
for enforcing compliance by third parties with this License.

  An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations.  If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.

  You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License.  For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.

  11. Patents.

  A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based.  The
work thus licensed is called the contributor's "contributor version".

  A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version.  For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.

  Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.

  In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement).  To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.

  If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients.  "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.

  If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.

  A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License.  You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.

  Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.

  12. No Surrender of Others' Freedom.

  If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License.  If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all.  For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.

  13. Remote Network Interaction; Use with the GNU General Public License.

  Notwithstanding any other provision of this License, if you modify the
Program, your modified version must prominently offer all users
interacting with it remotely through a computer network (if your version
supports such interaction) an opportunity to receive the Corresponding
Source of your version by providing access to the Corresponding Source
from a network server at no charge, through some standard or customary
means of facilitating copying of software.  This Corresponding Source
shall include the Corresponding Source for any work covered by version 3
of the GNU General Public License that is incorporated pursuant to the
following paragraph.

  Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU General Public License into a single
combined work, and to convey the resulting work.  The terms of this
License will continue to apply to the part which is the covered work,
but the work with which it is combined will remain governed by version
3 of the GNU General Public License.

  14. Revised Versions of this License.

  The Free Software Foundation may publish revised and/or new versions of
the GNU Affero General Public License from time to time.  Such new versions
will be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.

  Each version is given a distinguishing version number.  If the
Program specifies that a certain numbered version of the GNU Affero General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation.  If the Program does not specify a version number of the
GNU Affero General Public License, you may choose any version ever published
by the Free Software Foundation.

  If the Program specifies that a proxy can decide which future
versions of the GNU Affero General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.

  Later license versions may give you additional or different
permissions.  However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.

  15. Disclaimer of Warranty.

  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.

  16. Limitation of Liability.

  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.

  17. Interpretation of Sections 15 and 16.

  If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.

                     END OF TERMS AND CONDITIONS

            How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.

  To do so, attach the following notices to the program.  It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.

    BabelDOC is library for ultimated document translation solution.
    Copyright (C) 2024  <funstory.ai limited>

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU Affero General Public License as published
    by the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU Affero General Public License for more details.

    You should have received a copy of the GNU Affero General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.

Also add information on how to contact you by electronic and paper mail.

  If your software can interact with users remotely through a computer
network, you should also make sure that it provides a way for users to
get its source.  For example, if your program is a web application, its
interface could display a "Source" link that leads users to an archive
of the code.  There are many ways you could offer source, and different
solutions will be better for different programs; see section 13 for the
specific requirements.

  You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU AGPL, see
<https://www.gnu.org/licenses/>.


================================================
FILE: README.md
================================================
<!-- # Yet Another Document Translator -->

<div align="center">
<!-- <img src="https://s.immersivetranslate.com/assets/r2-uploads/images/babeldoc-banner.png" width="320px"  alt="YADT"/> -->

<br/>

<picture>
  <source media="(prefers-color-scheme: dark)" srcset="https://s.immersivetranslate.com/assets/uploads/babeldoc-big-logo-darkmode-with-transparent-background-IKuNO1.svg" width="320px" alt="BabelDOC"/>
  <img src="https://s.immersivetranslate.com/assets/uploads/babeldoc-big-logo-with-transparent-background-2xweBr.svg" width="320px" alt="BabelDOC"/>
</picture>

<!-- <h2 id="title">BabelDOC</h2> -->

<p>
  <!-- PyPI -->
  <a href="https://pypi.org/project/BabelDOC/">
    <img src="https://img.shields.io/pypi/v/BabelDOC"></a>
  <a href="https://pepy.tech/projects/BabelDOC">
    <img src="https://static.pepy.tech/badge/BabelDOC"></a>
  <!-- <a href="https://github.com/funstory-ai/BabelDOC/pulls">
    <img src="https://img.shields.io/badge/contributions-welcome-green"></a> -->
  <!-- License -->
  <a href="./LICENSE">
    <img src="https://img.shields.io/github/license/funstory-ai/BabelDOC"></a>
  <a href="https://t.me/+Z9_SgnxmsmA5NzBl">
    <img src="https://img.shields.io/badge/Telegram-2CA5E0?style=flat-squeare&logo=telegram&logoColor=white"></a>
  <a href="https://deepwiki.com/funstory-ai/BabelDOC"><img src="https://deepwiki.com/badge.svg" alt="Ask DeepWiki"></a>
</p>

<a href="https://trendshift.io/repositories/13358" target="_blank"><img src="https://trendshift.io/api/badge/repositories/13358" alt="funstory-ai%2FBabelDOC | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>

</div>

PDF scientific paper translation and bilingual comparison library.

- **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) Free usage quota is available; please refer to the FAQ section on the page for details.
- **Self-deployment**: [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) support for BabelDOC, available for self-deployment + WebUI with more translation services.
- Provides a simple [command line interface](#getting-started).
- Provides a [Python API](#python-api).
- Mainly designed to be embedded into other programs, but can also be used directly for simple translation tasks.

> [!TIP]
>
> How to use BabelDOC in Zotero
>
> 1. Immersive Translate Pro members can use the [immersive-translate/zotero-immersivetranslate](https://github.com/immersive-translate/zotero-immersivetranslate) plugin
>
> 2. PDFMathTranslate self-deployed users can use the [guaguastandup/zotero-pdf2zh](https://github.com/guaguastandup/zotero-pdf2zh) plugin

[Supported Language](https://funstory-ai.github.io/BabelDOC/supported_languages/)

## Preview

<div align="center">
<img src="https://s.immersivetranslate.com/assets/r2-uploads/images/babeldoc-preview.png" width="80%"/>
</div>

## We are hiring

See details: [EN](https://github.com/funstory-ai/jobs) | [ZH](https://github.com/funstory-ai/jobs/blob/main/README_ZH.md)

## Getting Started

### Install from PyPI

We recommend using the Tool feature of [uv](https://github.com/astral-sh/uv) to install yadt.

1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted.

2. Use the following command to install yadt:

```bash
uv tool install --python 3.12 BabelDOC

babeldoc --help
```

3. Use the `babeldoc` command. For example:

```bash
babeldoc --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"  --files example.pdf

# multiple files
babeldoc --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"  --files example1.pdf --files example2.pdf
```

### Install from Source

We still recommend using [uv](https://github.com/astral-sh/uv) to manage virtual environments.

1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted.

2. Use the following command to install yadt:

```bash
# clone the project
git clone https://github.com/funstory-ai/BabelDOC

# enter the project directory
cd BabelDOC

# install dependencies and run babeldoc
uv run babeldoc --help
```

3. Use the `uv run babeldoc` command. For example:

```bash
uv run babeldoc --files example.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"

# multiple files
uv run babeldoc --files example.pdf --files example2.pdf --openai --openai-model "gpt-4o-mini" --openai-base-url "https://api.openai.com/v1" --openai-api-key "your-api-key-here"
```

> [!TIP]
> The absolute path is recommended.

## Advanced Options

> [!NOTE]
> This CLI is mainly for debugging purposes. Although end users can use this CLI to translate files, we do not provide any technical support for this purpose.
>
> End users should directly use **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) 1000 free pages per month.
>
> End users who need self-deployment should use [PDFMathTranslate 2.0](https://github.com/PDFMathTranslate/PDFMathTranslate-next)
> 
> If you find that an option is not listed below, it means that this option is a debugging option for maintainers. Please do not use these options.


### Language Options

- `--lang-in`, `-li`: Source language code (default: en)
- `--lang-out`, `-lo`: Target language code (default: zh)

> [!TIP]
> Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet.
> 
> (2025.3.1 update): Basic English target language support has been added, primarily to minimize line breaks within words([0-9A-Za-z]+).
> 
> [HELP WANTED: Collecting word regular expressions for more languages](https://github.com/funstory-ai/BabelDOC/issues/129)

### PDF Processing Options

- `--files`: One or more file paths to input PDF documents.
- `--pages`, `-p`: Specify pages to translate (e.g., "1,2,1-,-3,3-5"). If not set, translate all pages
- `--split-short-lines`: Force split short lines into different paragraphs (may cause poor typesetting & bugs)
- `--short-line-split-factor`: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page \* this factor
- `--skip-clean`: Skip PDF cleaning step
- `--dual-translate-first`: Put translated pages first in dual PDF mode (default: original pages first)
- `--disable-rich-text-translate`: Disable rich text translation (may help improve compatibility with some PDFs)
- `--enhance-compatibility`: Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate)
- `--use-alternating-pages-dual`: Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order. When disabled (default), original and translated pages are shown side by side on the same page.
- `--watermark-output-mode`: Control watermark output mode: 'watermarked' (default) adds watermark to translated PDF, 'no_watermark' doesn't add watermark, 'both' outputs both versions.
- `--max-pages-per-part`: Maximum number of pages per part for split translation. If not set, no splitting will be performed.
- `--no-watermark`: [DEPRECATED] Use --watermark-output-mode=no_watermark instead.
- `--translate-table-text`: Translate table text (experimental, default: False)
- `--formular-font-pattern`: Font pattern to identify formula text (default: None)
- `--formular-char-pattern`: Character pattern to identify formula text (default: None)
- `--show-char-box`: Show character bounding boxes (debug only, default: False)
- `--skip-scanned-detection`: Skip scanned document detection (default: False). When using split translation, only the first part performs detection if not skipped.
- `--ocr-workaround`: Use OCR workaround (default: False). Only suitable for documents with black text on white background. When enabled, white rectangular blocks will be added below the translation to cover the original text content, and all text will be forced to black color.
- `--auto-enable-ocr-workaround`: Enable automatic OCR workaround (default: False). If a document is detected as heavily scanned, this will attempt to enable OCR processing and skip further scan detection. See "Important Interaction Note" below for crucial details on how this interacts with `--ocr-workaround` and `--skip-scanned-detection`.
- `--primary-font-family`: Override primary font family for translated text. Choices: 'serif' for serif fonts, 'sans-serif' for sans-serif fonts, 'script' for script/italic fonts. If not specified, uses automatic font selection based on original text properties.
- `--only-include-translated-page`: Only include translated pages in the output PDF. This option is only effective when `--pages` is used. (default: False)
- `--merge-alternating-line-numbers`: Enable post-processing to merge alternating line-number layouts (keep the number paragraph as an independent paragraph b; merge adjacent text paragraphs a and c across it when `layout_id` and `xobj_id` match, digits are ASCII and spaces only). Default: off.
- `--skip-form-render`: Skip form rendering (default: False). When enabled, PDF forms will not be rendered in the output.
- `--skip-curve-render`: Skip curve rendering (default: False). When enabled, PDF curves will not be rendered in the output.
- `--only-parse-generate-pdf`: Only parse PDF and generate output PDF without translation (default: False). This skips all translation-related processing including layout analysis, paragraph finding, style processing, and translation itself. Useful for testing PDF parsing and reconstruction functionality.
- `--remove-non-formula-lines`: Remove non-formula lines from paragraph areas (default: False). This removes decorative lines that are not part of formulas, while protecting lines in figure/table areas. Useful for cleaning up documents with decorative elements that interfere with text flow.
- `--non-formula-line-iou-threshold`: IoU threshold for detecting paragraph overlap when removing non-formula lines (default: 0.9). Higher values are more conservative and will remove fewer lines.
- `--figure-table-protection-threshold`: IoU threshold for protecting lines in figure/table areas when removing non-formula lines (default: 0.9). Higher values provide more protection for structural elements in figures and tables.

- `--rpc-doclayout`: RPC service host address for document layout analysis (default: None)
- `--working-dir`: Working directory for translation. If not set, use temp directory.
- `--no-auto-extract-glossary`: Disable automatic term extraction. If this flag is present, the step is skipped. Defaults to enabled.
- `--save-auto-extracted-glossary`: Save automatically extracted glossary to the specified file. If not set, the glossary will not be saved.

> [!TIP]
> - Both `--skip-clean` and `--dual-translate-first` may help improve compatibility with some PDF readers
> - `--disable-rich-text-translate` can also help with compatibility by simplifying translation input
> - However, using `--skip-clean` will result in larger file sizes
> - If you encounter any compatibility issues, try using `--enhance-compatibility` first
> - Use `--max-pages-per-part` for large documents to split them into smaller parts for translation and automatically merge them back.
> - Use `--skip-scanned-detection` to speed up processing when you know your document is not a scanned PDF.
> - Use `--ocr-workaround` to fill background for scanned PDF. (Current assumption: background is pure white, text is pure black, this option will also auto enable `--skip-scanned-detection`)

### Translation Service Options

- `--qps`: QPS (Queries Per Second) limit for translation service (default: 4)
- `--ignore-cache`: Ignore translation cache and force retranslation
- `--no-dual`: Do not output bilingual PDF files
- `--no-mono`: Do not output monolingual PDF files
- `--min-text-length`: Minimum text length to translate (default: 5)
- `--openai`: Use OpenAI for translation (default: False)
- `--custom-system-prompt`: Custom system prompt for translation.
- `--add-formula-placehold-hint`: Add formula placeholder hint for translation. (Currently not recommended, it may affect translation quality, default: False)
- `--disable-same-text-fallback`: Disable fallback translation when LLM output matches input text. (default: False)
- `--pool-max-workers`: Maximum number of worker threads for internal task processing pools. If not specified, defaults to QPS value. This parameter directly sets the worker count, replacing previous QPS-based dynamic calculations.
- `--no-auto-extract-glossary`: Disable automatic term extraction. If this flag is present, the step is skipped. Defaults to enabled.

> [!TIP]
>
> 1. Currently, only OpenAI-compatible LLM is supported. For more translator support, please use [PDFMathTranslate 2.0](https://github.com/PDFMathTranslate/PDFMathTranslate-next).
> 2. It is recommended to use models with strong compatibility with OpenAI, such as: `glm-4-flash`, `deepseek-chat`, etc.
> 3. Currently, it has not been optimized for traditional translation engines like Bing/Google, it is recommended to use LLMs.
> 4. You can use [litellm](https://github.com/BerriAI/litellm) to access multiple models.
> 5. `--custom-system-prompt`: It is mainly used to add the `/no_think` instruction of Qwen 3 in the prompt. For example: `--custom-system-prompt "/no_think You are a professional, authentic machine translation engine."`

### OpenAI Specific Options

- `--openai-model`: OpenAI model to use (default: gpt-4o-mini)
- `--openai-base-url`: Base URL for OpenAI API
- `--openai-api-key`: API key for OpenAI service
- `--enable-json-mode-if-requested`: Enable JSON mode for OpenAI requests (default: False)
- `--term-pool-max-workers`: Maximum number of worker threads dedicated to automatic term extraction. If not specified, this defaults to the value of `--pool-max-workers`, which itself defaults to the QPS value when unset.

> [!TIP]
>
> 1. This tool supports any OpenAI-compatible API endpoints. Just set the correct base URL and API key. (e.g. `https://xxx.custom.xxx/v1`)
> 2. For local models like Ollama, you can use any value as the API key (e.g. `--openai-api-key a`).

### Glossary Options

- `--glossary-files`: Comma-separated paths to glossary CSV files.
  - Each CSV file should have the columns: `source`, `target`, and an optional `tgt_lng`.
  - The `source` column contains the term in the original language.
  - The `target` column contains the term in the target language.
  - The `tgt_lng` column (optional) specifies the target language for that specific entry (e.g., "zh-CN", "en-US").
    - If `tgt_lng` is provided for an entry, that entry will only be loaded and used if its (normalized) `tgt_lng` matches the (normalized) overall target language specified by `--lang-out`. Normalization involves lowercasing and replacing hyphens (`-`) with underscores (`_`).
    - If `tgt_lng` is omitted for an entry, that entry is considered applicable for any `--lang-out`.
  - The name of each glossary (used in LLM prompts) is derived from its filename (without the .csv extension).
  - During translation, the system will check the input text against the loaded glossaries. If terms from a glossary are found in the current text segment, that glossary (with the relevant terms) will be included in the prompt to the language model, along with an instruction to adhere to it.

### Output Control

- `--output`, `-o`: Output directory for translated files. If not set, use current working directory.
- `--debug`: Enable debug logging level and export detailed intermediate results in `~/.cache/yadt/working`.
- `--report-interval`: Progress report interval in seconds (default: 0.1).

### General Options

- `--warmup`: Only download and verify required assets then exit (default: False)

### Offline Assets Management

- `--generate-offline-assets`: Generate an offline assets package in the specified directory. This creates a zip file containing all required models and fonts.
- `--restore-offline-assets`: Restore an offline assets package from the specified file. This extracts models and fonts from a previously generated package.

> [!TIP]
> 
> 1. Offline assets packages are useful for environments without internet access or to speed up installation on multiple machines.
> 2. Generate a package once with `babeldoc --generate-offline-assets /path/to/output/dir` and then distribute it.
> 3. Restore the package on target machines with `babeldoc --restore-offline-assets /path/to/offline_assets_*.zip`.
> 4. The offline assets package name cannot be modified because the file list hash is encoded in the name.
> 5. If you provide a directory path to `--restore-offline-assets`, the tool will automatically look for the correct offline assets package file in that directory.
> 6. The package contains all necessary fonts and models required for document processing, ensuring consistent results across different environments.
> 7. The integrity of all assets is verified using SHA3-256 hashes during both packaging and restoration.
> 8. If you're deploying in an air-gapped environment, make sure to generate the package on a machine with internet access first.

### Configuration File

- `--config`, `-c`: Configuration file path. Use the TOML format.

Example Configuration:

```toml
[babeldoc]
# Basic settings
debug = true
lang-in = "en-US"
lang-out = "zh-CN"
qps = 10
output = "/path/to/output/dir"

# PDF processing options
split-short-lines = false
short-line-split-factor = 0.8
skip-clean = false
dual-translate-first = false
disable-rich-text-translate = false
use-alternating-pages-dual = false
watermark-output-mode = "watermarked"  # Choices: "watermarked", "no_watermark", "both"
max-pages-per-part = 50  # Automatically split the document for translation and merge it back.
only_include_translated_page = false # Only include translated pages in the output PDF. Effective only when `pages` is used.
# no-watermark = false  # DEPRECATED: Use watermark-output-mode instead
skip-scanned-detection = false  # Skip scanned document detection for faster processing
auto_extract_glossary = true # Set to false to disable automatic term extraction
formular_font_pattern = "" # Font pattern for formula text
formular_char_pattern = "" # Character pattern for formula text
show_char_box = false # Show character bounding boxes (debug)
ocr_workaround = false # Use OCR workaround for scanned PDFs
rpc_doclayout = "" # RPC service host for document layout analysis
working_dir = "" # Working directory for translation
auto_enable_ocr_workaround = false # Enable automatic OCR workaround for scanned PDFs. See docs for interaction with ocr_workaround and skip_scanned_detection.
skip_form_render = false # Skip form rendering (default: False)
skip_curve_render = false # Skip curve rendering (default: False)
only_parse_generate_pdf = false # Only parse PDF and generate output PDF without translation (default: False)
remove_non_formula_lines = false # Remove non-formula lines from paragraph areas (default: False)
non_formula_line_iou_threshold = 0.2 # IoU threshold for paragraph overlap detection (default: 0.2)
figure_table_protection_threshold = 0.3 # IoU threshold for figure/table protection (default: 0.3)

# Translation service
openai = true
openai-model = "gpt-4o-mini"
openai-base-url = "https://api.openai.com/v1"
openai-api-key = "your-api-key-here"
enable-json-mode-if-requested = false  # Enable JSON mode when requested (default: false)
disable_same_text_fallback = false # Disable fallback translation when LLM output matches input text (default: false)
pool-max-workers = 8  # Maximum worker threads for task processing (defaults to QPS value if not set)

# Glossary Options (Optional)
# glossary-files = "/path/to/glossary1.csv,/path/to/glossary2.csv"

# Output control
no-dual = false
no-mono = false
min-text-length = 5
report-interval = 0.5

# Offline assets management
# Uncomment one of these options as needed:
# generate-offline-assets = "/path/to/output/dir"
# restore-offline-assets = "/path/to/offline_assets_package.zip"
```

## Python API

The current recommended way to call BabelDOC in Python is to call the `high_level.do_translate_async_stream` function of [pdf2zh next](https://github.com/PDFMathTranslate/PDFMathTranslate-next).

> [!WARNING]
> **All APIs of BabelDOC should be considered as internal APIs, and any direct use of BabelDOC is not supported.**

## Background

There are a lot projects and teams working on to make document editing and translating easier like:

- [mathpix](https://mathpix.com/)
- [Doc2X](https://doc2x.noedgeai.com/)
- [minerU](https://github.com/opendatalab/MinerU)
- [PDFMathTranslate](https://github.com/funstory-ai/yadt)

There are also some solutions to solve specific parts of the problem like:

- [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader): the read order of the text block in a pdf
- [Surya](https://github.com/surya-is/surya): the structure of the pdf

This project hopes to promote a standard pipeline and interface to solve the problem.

In fact, there are two main stages of a PDF parser or translator:

- **Parsing**: A stage of parsing means to get the structure of the pdf such as text blocks, images, tables, etc.
- **Rendering**: A stage of rendering means to render the structure into a new pdf or other format.

For a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader) does. The bad news is that the original structure lost.

Some people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive.
And you know, a pdf or word document is not a good format for reading in mobile devices.

We offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc.

## Roadmap

- [ ] Add line support
- [ ] Add table support
- [ ] Add cross-page/cross-column paragraph support
- [ ] More advanced typesetting features
- [ ] Outline support
- [ ] ...

Our first 1.0 version goal is to finish a translation from [PDF Reference, Version 1.7](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) to the following language version:

- Simplified Chinese
- Traditional Chinese
- Japanese
- Spanish

And meet the following requirements:

- layout error less than 1%
- content loss less than 1%

## Version Number Explanation

This project uses a combination of [Semantic Versioning](https://semver.org/) and [Pride Versioning](https://pridever.org/). The version number format is: "0.MAJOR.MINOR".

> [!NOTE]
>
> The API compatibility here mainly refers to the compatibility with [pdf2zh_next](https://github.com/PDFMathTranslate/PDFMathTranslate-next).


- MAJOR: Incremented by 1 when API incompatible changes are made or when proud improvements are implemented.

- MINOR: Incremented by 1 when any API compatible changes are made.

## Known Issues

1. Parsing errors in the author and reference sections; they get merged into one paragraph after translation.
2. Lines are not supported.
3. Does not support drop caps.
4. Large pages will be skipped.

## How to Contribute

We encourage you to contribute to YADT! Please check out the [CONTRIBUTING](https://github.com/funstory-ai/yadt/blob/main/docs/CONTRIBUTING.md) guide.

Everyone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT [Code of Conduct](https://github.com/funstory-ai/yadt/blob/main/docs/CODE_OF_CONDUCT.md).

[Immersive Translation](https://immersivetranslate.com) sponsors monthly Pro membership redemption codes for active contributors to this project, see details at: [CONTRIBUTOR_REWARD.md](https://github.com/funstory-ai/BabelDOC/blob/main/docs/CONTRIBUTOR_REWARD.md)

## Acknowledgements

- [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate)
- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
- [pdfminer](https://github.com/pdfminer/pdfminer.six)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
- [Asynchronize](https://github.com/multimeric/Asynchronize/tree/master?tab=readme-ov-file)
- [PriorityThreadPoolExecutor](https://github.com/oleglpts/PriorityThreadPoolExecutor)

<h2 id="star_hist">Star History</h2>

<a href="https://star-history.com/#funstory-ai/babeldoc&Date">
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date&theme=dark" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date" />
   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date"/>
 </picture>
</a>

> [!WARNING]
> **Important Interaction Note for `--auto-enable-ocr-workaround`:**
>
> When `--auto-enable-ocr-workaround` is set to `true` (either via command line or config file):
>
> 1.  During the initial setup, the values for `ocr_workaround` and `skip_scanned_detection` will be forced to `false` by `TranslationConfig`, regardless of whether you also set `--ocr-workaround` or `--skip-scanned-detection` flags.
> 2.  Then, during the scanned document detection phase (`DetectScannedFile` stage):
>     *   If the document is identified as heavily scanned (e.g., >80% scanned pages) AND `auto_enable_ocr_workaround` is `true` (i.e., `translation_config.auto_enable_ocr_workaround` is true), the system will then attempt to set both `ocr_workaround` to `true` and `skip_scanned_detection` to `true`.
>
> This means that `--auto-enable-ocr-workaround` effectively gives the system control to enable OCR processing for scanned documents, potentially overriding manual settings for `--ocr-workaround` and `--skip_scanned_detection` based on its detection results. If the document is *not* detected as heavily scanned, then the initial `false` values for `ocr_workaround` and `skip_scanned_detection` (forced by `--auto-enable-ocr-workaround` at the `TranslationConfig` initialization stage) will remain in effect unless changed by other logic.


================================================
FILE: babeldoc/__init__.py
================================================
__version__ = "0.5.23"


================================================
FILE: babeldoc/assets/assets.py
================================================
import asyncio
import hashlib
import json
import logging
import threading
import zipfile
from pathlib import Path

import httpx
from babeldoc.assets import embedding_assets_metadata
from babeldoc.assets.embedding_assets_metadata import CMAP_METADATA
from babeldoc.assets.embedding_assets_metadata import CMAP_URL_BY_UPSTREAM
from babeldoc.assets.embedding_assets_metadata import DOC_LAYOUT_ONNX_MODEL_URL
from babeldoc.assets.embedding_assets_metadata import (
    DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256,
)
from babeldoc.assets.embedding_assets_metadata import EMBEDDING_FONT_METADATA
from babeldoc.assets.embedding_assets_metadata import FONT_METADATA_URL
from babeldoc.assets.embedding_assets_metadata import FONT_URL_BY_UPSTREAM
from babeldoc.assets.embedding_assets_metadata import (
    TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256,
)
from babeldoc.assets.embedding_assets_metadata import TABLE_DETECTION_RAPIDOCR_MODEL_URL
from babeldoc.assets.embedding_assets_metadata import TIKTOKEN_CACHES
from babeldoc.const import get_cache_file_path
from tenacity import retry
from tenacity import stop_after_attempt
from tenacity import wait_exponential

logger = logging.getLogger(__name__)


_FASTEST_FONT_UPSTREAM_LOCK = asyncio.Lock()
_FASTEST_FONT_UPSTREAM: str | None = None
_FASTEST_FONT_METADATA: dict | None = None


class ResultContainer:
    def __init__(self):
        self.result = None

    def set_result(self, result):
        self.result = result


def run_in_another_thread(coro):
    result_container = ResultContainer()

    def _wrapper():
        result_container.set_result(asyncio.run(coro))

    thread = threading.Thread(target=_wrapper)
    thread.start()
    thread.join()
    return result_container.result


def run_coro(coro):
    return run_in_another_thread(coro)


def _retry_if_not_cancelled_and_failed(retry_state):
    """Only retry if the exception is not CancelledError and the attempt failed."""
    if retry_state.outcome.failed:
        exception = retry_state.outcome.exception()
        # Don't retry on CancelledError
        if isinstance(exception, asyncio.CancelledError):
            logger.debug("Operation was cancelled, not retrying")
            return False
        # Retry on network related errors
        if isinstance(
            exception, httpx.HTTPError | ConnectionError | ValueError | TimeoutError
        ):
            logger.warning(f"Network error occurred: {exception}, will retry")
            return True
    # Don't retry on success
    return False


def verify_file(path: Path, sha3_256: str):
    if not path.exists():
        return False
    hash_ = hashlib.sha3_256()
    with path.open("rb") as f:
        while True:
            chunk = f.read(1024 * 1024)
            if not chunk:
                break
            hash_.update(chunk)
    return hash_.hexdigest() == sha3_256


@retry(
    retry=_retry_if_not_cancelled_and_failed,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=15),
    before_sleep=lambda retry_state: logger.warning(
        f"Download file failed, retrying in {retry_state.next_action.sleep} seconds... "
        f"(Attempt {retry_state.attempt_number}/3)"
    ),
)
async def download_file(
    client: httpx.AsyncClient | None = None,
    url: str = None,
    path: Path = None,
    sha3_256: str = None,
):
    if client is None:
        async with httpx.AsyncClient() as client:
            response = await client.get(url, follow_redirects=True)
    else:
        response = await client.get(url, follow_redirects=True)

    response.raise_for_status()
    with path.open("wb") as f:
        f.write(response.content)
    if not verify_file(path, sha3_256):
        path.unlink(missing_ok=True)
        raise ValueError(f"File {path} is corrupted")


@retry(
    retry=_retry_if_not_cancelled_and_failed,
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=15),
    before_sleep=lambda retry_state: logger.warning(
        f"Get font metadata failed, retrying in {retry_state.next_action.sleep} seconds... "
        f"(Attempt {retry_state.attempt_number}/3)"
    ),
)
async def get_font_metadata(
    client: httpx.AsyncClient | None = None, upstream: str = None
):
    if upstream not in FONT_METADATA_URL:
        logger.critical(f"Invalid upstream: {upstream}")
        exit(1)

    if client is None:
        async with httpx.AsyncClient() as client:
            response = await client.get(
                FONT_METADATA_URL[upstream], follow_redirects=True
            )
    else:
        response = await client.get(FONT_METADATA_URL[upstream], follow_redirects=True)

    response.raise_for_status()
    logger.debug(f"Get font metadata from {upstream} success")
    return upstream, response.json()


async def _get_fastest_upstream_for_font_internal(
    client: httpx.AsyncClient | None = None, exclude_upstream: list[str] | None = None
) -> tuple[str | None, dict | None]:
    """Find the fastest upstream for font metadata without using cached result."""
    tasks: list[asyncio.Task[tuple[str, dict]]] = []
    for upstream in FONT_METADATA_URL:
        if exclude_upstream and upstream in exclude_upstream:
            continue
        tasks.append(asyncio.create_task(get_font_metadata(client, upstream)))
    for future in asyncio.as_completed(tasks):
        try:
            result = await future
            for task in tasks:
                if not task.done():
                    task.cancel()
            return result
        except Exception as e:
            logger.exception(f"Error getting font metadata: {e}")
    logger.error("All upstreams failed")
    return None, None


async def get_fastest_upstream_for_font(
    client: httpx.AsyncClient | None = None, exclude_upstream: list[str] | None = None
) -> tuple[str | None, dict | None]:
    """Get the fastest upstream for font metadata with cached result.

    The cached upstream is only used when exclude_upstream is None.
    """
    global _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA

    if exclude_upstream is None and _FASTEST_FONT_UPSTREAM is not None:
        return _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA

    if exclude_upstream is not None:
        # Do not use or update cache when exclude_upstream is provided.
        return await _get_fastest_upstream_for_font_internal(client, exclude_upstream)

    async with _FASTEST_FONT_UPSTREAM_LOCK:
        if _FASTEST_FONT_UPSTREAM is not None:
            return _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA

        upstream, metadata = await _get_fastest_upstream_for_font_internal(client)
        if upstream is not None:
            _FASTEST_FONT_UPSTREAM = upstream
            _FASTEST_FONT_METADATA = metadata
            logger.info(f"Fastest font upstream determined: {upstream}")
        return upstream, metadata


async def get_fastest_upstream_for_model(client: httpx.AsyncClient | None = None):
    return await get_fastest_upstream_for_font(client, exclude_upstream=["github"])


async def get_fastest_upstream(client: httpx.AsyncClient | None = None):
    (
        fastest_upstream_for_font,
        online_font_metadata,
    ) = await get_fastest_upstream_for_font(client)
    if fastest_upstream_for_font is None:
        logger.error("Failed to get fastest upstream")
        exit(1)

    if fastest_upstream_for_font == "github":
        # since github is only store font, we need to get the fastest upstream for model
        fastest_upstream_for_model, _ = await get_fastest_upstream_for_model(client)
        if fastest_upstream_for_model is None:
            logger.error("Failed to get fastest upstream")
            exit(1)
    else:
        fastest_upstream_for_model = fastest_upstream_for_font

    return online_font_metadata, fastest_upstream_for_font, fastest_upstream_for_model


async def get_doclayout_onnx_model_path_async(client: httpx.AsyncClient | None = None):
    onnx_path = get_cache_file_path(
        "doclayout_yolo_docstructbench_imgsz1024.onnx", "models"
    )
    if verify_file(onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256):
        return onnx_path

    logger.info("doclayout onnx model not found or corrupted, downloading...")
    fastest_upstream, _ = await get_fastest_upstream_for_model(client)
    if fastest_upstream is None:
        logger.error("Failed to get fastest upstream")
        exit(1)

    url = DOC_LAYOUT_ONNX_MODEL_URL[fastest_upstream]

    await download_file(
        client, url, onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256
    )
    logger.info(f"Download doclayout onnx model from {fastest_upstream} success")
    return onnx_path


async def get_table_detection_rapidocr_model_path_async(
    client: httpx.AsyncClient | None = None,
):
    onnx_path = get_cache_file_path("ch_PP-OCRv4_det_infer.onnx", "models")
    if verify_file(onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256):
        return onnx_path

    logger.info("table detection rapidocr model not found or corrupted, downloading...")
    fastest_upstream, _ = await get_fastest_upstream_for_model(client)
    if fastest_upstream is None:
        logger.error("Failed to get fastest upstream")
        exit(1)

    url = TABLE_DETECTION_RAPIDOCR_MODEL_URL[fastest_upstream]

    await download_file(client, url, onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256)
    logger.info(
        f"Download table detection rapidocr model from {fastest_upstream} success"
    )
    return onnx_path


def get_doclayout_onnx_model_path():
    return run_coro(get_doclayout_onnx_model_path_async())


def get_table_detection_rapidocr_model_path():
    return run_coro(get_table_detection_rapidocr_model_path_async())


def get_font_url_by_name_and_upstream(font_file_name: str, upstream: str):
    if upstream not in FONT_URL_BY_UPSTREAM:
        logger.critical(f"Invalid upstream: {upstream}")
        exit(1)

    return FONT_URL_BY_UPSTREAM[upstream](font_file_name)


async def get_font_and_metadata_async(
    font_file_name: str,
    client: httpx.AsyncClient | None = None,
    fastest_upstream: str | None = None,
    font_metadata: dict | None = None,
):
    cache_file_path = get_cache_file_path(font_file_name, "fonts")
    if font_file_name in EMBEDDING_FONT_METADATA and verify_file(
        cache_file_path, EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"]
    ):
        return cache_file_path, EMBEDDING_FONT_METADATA[font_file_name]

    logger.info(f"Font {cache_file_path} not found or corrupted, downloading...")
    if fastest_upstream is None:
        fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client)
        if fastest_upstream is None:
            logger.critical("Failed to get fastest upstream")
            exit(1)

        if font_file_name not in font_metadata:
            logger.critical(f"Font {font_file_name} not found in {font_metadata}")
            exit(1)

        if verify_file(cache_file_path, font_metadata[font_file_name]["sha3_256"]):
            return cache_file_path, font_metadata[font_file_name]

    assert font_metadata is not None
    logger.info(f"download {font_file_name} from {fastest_upstream}")

    url = get_font_url_by_name_and_upstream(font_file_name, fastest_upstream)
    if "sha3_256" not in font_metadata[font_file_name]:
        logger.critical(f"Font {font_file_name} not found in {font_metadata}")
        exit(1)
    await download_file(
        client, url, cache_file_path, font_metadata[font_file_name]["sha3_256"]
    )
    return cache_file_path, font_metadata[font_file_name]


def get_font_and_metadata(font_file_name: str):
    return run_coro(get_font_and_metadata_async(font_file_name))


async def get_cmap_file_path_async(
    name: str, client: httpx.AsyncClient | None = None
) -> Path:
    """Get cached cmap file path, downloading it if necessary."""
    if name.endswith(".json"):
        file_name = name
    else:
        file_name = f"{name}.json"

    if file_name not in CMAP_METADATA:
        logger.critical(f"CMap {file_name} not found in CMAP_METADATA")
        exit(1)

    meta = CMAP_METADATA[file_name]
    cache_file_path = get_cache_file_path(file_name, "cmap")
    if verify_file(cache_file_path, meta["sha3_256"]):
        return cache_file_path

    logger.info(f"CMap {cache_file_path} not found or corrupted, downloading...")
    await download_cmap_file_async(file_name, client)
    if not verify_file(cache_file_path, meta["sha3_256"]):
        logger.critical(f"Failed to verify downloaded cmap file: {cache_file_path}")
        exit(1)
    return cache_file_path


async def download_cmap_file_async(
    file_name: str, client: httpx.AsyncClient | None = None
) -> Path:
    """Download a single cmap file to cache directory."""
    if file_name not in CMAP_METADATA:
        logger.critical(f"CMap {file_name} not found in CMAP_METADATA")
        exit(1)

    fastest_upstream, _ = await get_fastest_upstream_for_font(client)
    if fastest_upstream is None:
        logger.critical("Failed to get fastest upstream for cmap")
        exit(1)

    if fastest_upstream not in CMAP_URL_BY_UPSTREAM:
        logger.critical(f"Invalid fastest upstream for cmap: {fastest_upstream}")
        exit(1)

    url = CMAP_URL_BY_UPSTREAM[fastest_upstream](file_name)
    cache_file_path = get_cache_file_path(file_name, "cmap")
    sha3_256 = CMAP_METADATA[file_name]["sha3_256"]
    await download_file(client, url, cache_file_path, sha3_256)
    return cache_file_path


async def get_cmap_data_async(
    name: str, client: httpx.AsyncClient | None = None
) -> dict:
    """Load cmap json data from cached file, downloading it if necessary."""
    path = await get_cmap_file_path_async(name, client)
    return json.loads(path.read_text())


def get_cmap_file_path(name: str):
    return run_coro(get_cmap_file_path_async(name))


def get_cmap_data(name: str):
    return run_coro(get_cmap_data_async(name))


def get_font_family(lang_code: str):
    font_family = embedding_assets_metadata.get_font_family(lang_code)
    return font_family


async def download_all_fonts_async(client: httpx.AsyncClient | None = None):
    for font_file_name in EMBEDDING_FONT_METADATA:
        if not verify_file(
            get_cache_file_path(font_file_name, "fonts"),
            EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"],
        ):
            break
    else:
        logger.debug("All fonts are already downloaded")
        return

    fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client)
    if fastest_upstream is None:
        logger.error("Failed to get fastest upstream")
        exit(1)
    logger.info(f"Downloading fonts from {fastest_upstream}")

    font_tasks = [
        asyncio.create_task(
            get_font_and_metadata_async(
                font_file_name, client, fastest_upstream, font_metadata
            )
        )
        for font_file_name in EMBEDDING_FONT_METADATA
    ]
    await asyncio.gather(*font_tasks)


async def download_all_cmaps_async(client: httpx.AsyncClient | None = None):
    """Download all cmap files defined in CMAP_METADATA."""
    for cmap_file_name, meta in CMAP_METADATA.items():
        if not verify_file(
            get_cache_file_path(cmap_file_name, "cmap"),
            meta["sha3_256"],
        ):
            break
    else:
        logger.debug("All cmaps are already downloaded")
        return

    fastest_upstream, _ = await get_fastest_upstream_for_font(client)
    if fastest_upstream is None:
        logger.error("Failed to get fastest upstream for cmap")
        exit(1)
    logger.info(f"Downloading cmaps from {fastest_upstream}")

    cmap_tasks = [
        asyncio.create_task(get_cmap_file_path_async(cmap_file_name, client))
        for cmap_file_name in CMAP_METADATA
    ]
    await asyncio.gather(*cmap_tasks)


async def async_warmup():
    logger.info("Downloading all assets...")
    from tiktoken import encoding_for_model

    _ = encoding_for_model("gpt-4o")
    async with httpx.AsyncClient() as client:
        onnx_task = asyncio.create_task(get_doclayout_onnx_model_path_async(client))
        onnx_task2 = asyncio.create_task(
            get_table_detection_rapidocr_model_path_async(client)
        )
        font_tasks = asyncio.create_task(download_all_fonts_async(client))
        cmap_tasks = asyncio.create_task(download_all_cmaps_async(client))
        await asyncio.gather(onnx_task, onnx_task2, font_tasks, cmap_tasks)


def warmup():
    run_coro(async_warmup())


def generate_all_assets_file_list():
    result: dict[str, list[dict[str, str]]] = {}
    result["fonts"] = []
    result["models"] = []
    result["tiktoken"] = []
    result["cmap"] = []
    for font_file_name in EMBEDDING_FONT_METADATA:
        result["fonts"].append(
            {
                "name": font_file_name,
                "sha3_256": EMBEDDING_FONT_METADATA[font_file_name]["sha3_256"],
            }
        )
    for cmap_file_name in CMAP_METADATA:
        result["cmap"].append(
            {
                "name": cmap_file_name,
                "sha3_256": CMAP_METADATA[cmap_file_name]["sha3_256"],
            }
        )
    for tiktoken_file, sha3_256 in TIKTOKEN_CACHES.items():
        result["tiktoken"].append(
            {
                "name": tiktoken_file,
                "sha3_256": sha3_256,
            }
        )
    result["models"].append(
        {
            "name": "doclayout_yolo_docstructbench_imgsz1024.onnx",
            "sha3_256": DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256,
        },
    )
    result["models"].append(
        {
            "name": "ch_PP-OCRv4_det_infer.onnx",
            "sha3_256": TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256,
        },
    )
    return result


async def generate_offline_assets_package_async(output_directory: Path | None = None):
    await async_warmup()
    logger.info("Generating offline assets package...")
    file_list = generate_all_assets_file_list()
    offline_assets_tag = get_offline_assets_tag(file_list)
    if output_directory is None:
        output_path = get_cache_file_path(
            f"offline_assets_{offline_assets_tag}.zip", "assets"
        )
    else:
        output_directory.mkdir(parents=True, exist_ok=True)
        output_path = output_directory / f"offline_assets_{offline_assets_tag}.zip"
    with zipfile.ZipFile(
        output_path, "w", compression=zipfile.ZIP_DEFLATED, compresslevel=9
    ) as zipf:
        for file_type, file_descs in file_list.items():
            # zipf.mkdir(file_type)
            for file_desc in file_descs:
                file_name = file_desc["name"]
                sha3_256 = file_desc["sha3_256"]
                file_path = get_cache_file_path(file_name, file_type)
                if not verify_file(file_path, sha3_256):
                    logger.error(f"File {file_path} is corrupted")
                    exit(1)

                with file_path.open("rb") as f:
                    zipf.writestr(f"{file_type}/{file_name}", f.read())
    logger.info(f"Offline assets package generated at {output_path}")


async def restore_offline_assets_package_async(input_path: Path | None = None):
    file_list = generate_all_assets_file_list()
    offline_assets_tag = get_offline_assets_tag(file_list)
    if input_path is None:
        input_path = get_cache_file_path(
            f"offline_assets_{offline_assets_tag}.zip", "assets"
        )
    else:
        if input_path.exists() and input_path.is_dir():
            input_path = input_path / f"offline_assets_{offline_assets_tag}.zip"
        if not input_path.exists():
            logger.critical(f"Offline assets package not found: {input_path}")
            exit(1)

        import re

        offline_assets_tag_from_input_path = re.match(
            r"offline_assets_(.*)\.zip", input_path.name
        ).group(1)
        if offline_assets_tag != offline_assets_tag_from_input_path:
            logger.critical(
                f"Offline assets tag mismatch: {offline_assets_tag} != {offline_assets_tag_from_input_path}"
            )
            exit(1)
    nothing_changed = True
    with zipfile.ZipFile(input_path, "r") as zipf:
        for file_type, file_descs in file_list.items():
            for file_desc in file_descs:
                file_name = file_desc["name"]
                file_path = get_cache_file_path(file_name, file_type)

                if verify_file(file_path, file_desc["sha3_256"]):
                    continue
                nothing_changed = False
                with zipf.open(f"{file_type}/{file_name}", "r") as f:
                    with file_path.open("wb") as f2:
                        f2.write(f.read())
                if not verify_file(file_path, file_desc["sha3_256"]):
                    logger.critical(
                        "Offline assets package is corrupted, please delete it and try again"
                    )
                    exit(1)
    if not nothing_changed:
        logger.info(f"Offline assets package restored from {input_path}")


def get_offline_assets_tag(file_list: dict | None = None):
    if file_list is None:
        file_list = generate_all_assets_file_list()
    import orjson

    # noinspection PyTypeChecker
    offline_assets_tag = hashlib.sha3_256(
        orjson.dumps(
            file_list,
            option=orjson.OPT_APPEND_NEWLINE
            | orjson.OPT_INDENT_2
            | orjson.OPT_SORT_KEYS,
        )
    ).hexdigest()
    return offline_assets_tag


def generate_offline_assets_package(output_directory: Path | None = None):
    return run_coro(generate_offline_assets_package_async(output_directory))


def restore_offline_assets_package(input_path: Path | None = None):
    return run_coro(restore_offline_assets_package_async(input_path))


if __name__ == "__main__":
    from rich.logging import RichHandler

    logging.basicConfig(level=logging.DEBUG, handlers=[RichHandler()])
    logging.getLogger("httpx").setLevel(logging.WARNING)
    logging.getLogger("httpcore").setLevel(logging.WARNING)
    # warmup()
    # generate_offline_assets_package()
    # restore_offline_assets_package(Path(
    #     '/Users/aw/.cache/babeldoc/assets/offline_assets_33971e4940e90ba0c35baacda44bbe83b214f4703a7bdb8b837de97d0383508c.zip'))
    # warmup()


================================================
FILE: babeldoc/assets/embedding_assets_metadata.py
================================================
import itertools

DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256 = (
    "60be061226930524958b5465c8c04af3d7c03bcb0beb66454f5da9f792e3cf2a"
)

TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256 = (
    "062f4619afe91b33147c033acadecbb53f2a7b99ac703d157b96d5b10948da5e"
)

TIKTOKEN_CACHES = {
    "fb374d419588a4632f3f557e76b4b70aebbca790": "cb04bcda5782cfbbe77f2f991d92c0ea785d9496ef1137c91dfc3c8c324528d6"
}

FONT_METADATA_URL = {
    "github": "https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/font_metadata.json",
    "huggingface": "https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true",
    # "hf-mirror": "https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true",
    "modelscope": "https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/font_metadata.json",
}

FONT_URL_BY_UPSTREAM = {
    "github": lambda name: f"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/fonts/{name}",
    "huggingface": lambda name: f"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true",
    "hf-mirror": lambda name: f"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true",
    "modelscope": lambda name: f"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/fonts/{name}",
}

CMAP_URL_BY_UPSTREAM = {
    "github": lambda name: f"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/cmap/{name}",
    "huggingface": lambda name: f"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/cmap/{name}?download=true",
    "hf-mirror": lambda name: f"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/cmap/{name}?download=true",
    "modelscope": lambda name: f"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/cmap/{name}",
}

DOC_LAYOUT_ONNX_MODEL_URL = {
    "huggingface": "https://huggingface.co/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true",
    "hf-mirror": "https://hf-mirror.com/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true",
    "modelscope": "https://www.modelscope.cn/models/AI-ModelScope/DocLayout-YOLO-DocStructBench-onnx/resolve/master/doclayout_yolo_docstructbench_imgsz1024.onnx",
}

TABLE_DETECTION_RAPIDOCR_MODEL_URL = {
    "huggingface": "https://huggingface.co/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx",
    "hf-mirror": "https://hf-mirror.com/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx",
    "modelscope": "https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/master/onnx/PP-OCRv4/det/ch_PP-OCRv4_det_infer.onnx",
}

# from https://github.com/funstory-ai/BabelDOC-Assets/blob/main/font_metadata.json
EMBEDDING_FONT_METADATA = {
    "GoNotoKurrent-Bold.ttf": {
        "ascent": 1069,
        "bold": 1,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "GoNotoKurrent-Bold.ttf",
        "font_name": "Go Noto Kurrent-Bold Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "000b37f592477945b27b7702dcad39f73e23e140e66ddff9847eb34f32389566",
        "size": 15303772,
    },
    "GoNotoKurrent-Regular.ttf": {
        "ascent": 1069,
        "bold": 0,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "GoNotoKurrent-Regular.ttf",
        "font_name": "Go Noto Kurrent-Regular Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "4324a60d507c691e6efc97420647f4d2c2d86d9de35009d1c769861b76074ae6",
        "size": 15515760,
    },
    "KleeOne-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "KleeOne-Regular.ttf",
        "font_name": "Klee One Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "8585c29f89b322d937f83739f61ede5d84297873e1465cad9a120a208ac55ce0",
        "size": 8724704,
    },
    "LXGWWenKai-Regular.1.520.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -256,
        "encoding_length": 2,
        "file_name": "LXGWWenKai-Regular.1.520.ttf",
        "font_name": "LXGW WenKai Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "708b4fd6cfae62a26f71016724d38e862210732f101b9225225a1d5e8205f94d",
        "size": 24744500,
    },
    "LXGWWenKaiGB-Regular.1.520.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -256,
        "encoding_length": 2,
        "file_name": "LXGWWenKaiGB-Regular.1.520.ttf",
        "font_name": "LXGW WenKai GB Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "0671656b00992e317f9e20610e7145b024e664ada9f272d4f8e497196af98005",
        "size": 24903712,
    },
    "LXGWWenKaiGB-Regular.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -256,
        "encoding_length": 2,
        "file_name": "LXGWWenKaiGB-Regular.ttf",
        "font_name": "LXGW WenKai GB Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "b563a5e8d9db4cd15602a3a3700b01925e80a21f99fb88e1b763b1fb8685f8ee",
        "size": 19558756,
    },
    "LXGWWenKaiMonoTC-Regular.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -241,
        "encoding_length": 2,
        "file_name": "LXGWWenKaiMonoTC-Regular.ttf",
        "font_name": "LXGW WenKai Mono TC Regular",
        "italic": 0,
        "monospace": 1,
        "serif": 0,
        "sha3_256": "596b278d11418d374a1cfa3a50cbfb82b31db82d3650cfacae8f94311b27fdc5",
        "size": 13115416,
    },
    "LXGWWenKaiTC-Regular.1.520.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -256,
        "encoding_length": 2,
        "file_name": "LXGWWenKaiTC-Regular.1.520.ttf",
        "font_name": "LXGW WenKai TC Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "347d3d4bd88c2afcb194eba186d2c6c0b95d18b2145220feb1c88abf761f1398",
        "size": 15348376,
    },
    "LXGWWenKaiTC-Regular.ttf": {
        "ascent": 928,
        "bold": 0,
        "descent": -256,
        "encoding_length": 2,
        "file_name": "LXGWWenKaiTC-Regular.ttf",
        "font_name": "LXGW WenKai TC Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "66ccd0ffe8e56cd585dabde8d1292c3f551b390d8ed85f81d7a844825f9c2379",
        "size": 13100328,
    },
    "MaruBuri-Regular.ttf": {
        "ascent": 800,
        "bold": 0,
        "descent": -200,
        "encoding_length": 2,
        "file_name": "MaruBuri-Regular.ttf",
        "font_name": "MaruBuri Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "abb672dde7b89e06914ce27c59159b7a2933f26207bfcc47981c67c11c41e6d1",
        "size": 3268988,
    },
    "NotoSans-Bold.ttf": {
        "ascent": 1069,
        "bold": 1,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSans-Bold.ttf",
        "font_name": "Noto Sans Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "ecd38d472c1cad07d8a5dffd2b5a0f72edcd40fff2b4e68d770da8f2ef343a82",
        "size": 630964,
    },
    "NotoSans-BoldItalic.ttf": {
        "ascent": 1069,
        "bold": 1,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSans-BoldItalic.ttf",
        "font_name": "Noto Sans Bold Italic",
        "italic": 1,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "0b6c690a4a6b7d605b2ecbde00c7ac1a23e60feb17fa30d8b972d61ec3ff732b",
        "size": 644340,
    },
    "NotoSans-Italic.ttf": {
        "ascent": 1069,
        "bold": 0,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSans-Italic.ttf",
        "font_name": "Noto Sans Italic",
        "italic": 1,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "830652f61724c017e5a29a96225b484a2ccbd25f69a1b3f47e5f466a2dbed1ad",
        "size": 642344,
    },
    "NotoSans-Regular.ttf": {
        "ascent": 1069,
        "bold": 0,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSans-Regular.ttf",
        "font_name": "Noto Sans Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "7dfe2bbf97dc04c852d1223b220b63430e6ad03b0dbb28ebe6328a20a2d45eb8",
        "size": 629024,
    },
    "NotoSerif-Bold.ttf": {
        "ascent": 1069,
        "bold": 1,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSerif-Bold.ttf",
        "font_name": "Noto Serif Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "28d88d924285eadb9f9ce49f2d2b95473f89a307b226c5f6ebed87a654898312",
        "size": 506864,
    },
    "NotoSerif-BoldItalic.ttf": {
        "ascent": 1069,
        "bold": 1,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSerif-BoldItalic.ttf",
        "font_name": "Noto Serif Bold Italic",
        "italic": 1,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "b69ee56af6351b2fb4fbce623f8e1c1f9fb19170686a9e5db2cf260b8cf24ac7",
        "size": 535724,
    },
    "NotoSerif-Italic.ttf": {
        "ascent": 1069,
        "bold": 0,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSerif-Italic.ttf",
        "font_name": "Noto Serif Italic",
        "italic": 1,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "9b7773c24ab8a29e3c1c03efa4ab652d051e4c209134431953463aa946d62868",
        "size": 535340,
    },
    "NotoSerif-Regular.ttf": {
        "ascent": 1069,
        "bold": 0,
        "descent": -293,
        "encoding_length": 2,
        "file_name": "NotoSerif-Regular.ttf",
        "font_name": "Noto Serif Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "c2bbe984e65bafd3bcd38b3cb1e1344f3b7b79d6beffc7a3d883b57f8358559d",
        "size": 504932,
    },
    "SourceHanSansCN-Bold.ttf": {
        "ascent": 1160,
        "bold": 1,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansCN-Bold.ttf",
        "font_name": "Source Han Sans CN Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "82314c11016a04ef03e7afd00abe0ccc8df54b922dee79abf6424f3002a31825",
        "size": 10174460,
    },
    "SourceHanSansCN-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansCN-Regular.ttf",
        "font_name": "Source Han Sans CN Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "b45a80cf3650bfc62aa014e58243c6325e182c4b0c5819e41a583c699cce9a8f",
        "size": 10397552,
    },
    "SourceHanSansHK-Bold.ttf": {
        "ascent": 1160,
        "bold": 1,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansHK-Bold.ttf",
        "font_name": "Source Han Sans HK Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "3eecd57457ba9a0fbad6c794f40e7ae704c4f825091aef2ac18902ffdde50608",
        "size": 6856692,
    },
    "SourceHanSansHK-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansHK-Regular.ttf",
        "font_name": "Source Han Sans HK Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "5fe4141f9164c03616323400b2936ee4c8265314492e2b822c3a6fbfb63ffe08",
        "size": 6999792,
    },
    "SourceHanSansJP-Bold.ttf": {
        "ascent": 1160,
        "bold": 1,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansJP-Bold.ttf",
        "font_name": "Source Han Sans JP Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "fb05bd84d62e8064117ee357ab6a4481e1cde931e8e984c0553c8c4b09dc3938",
        "size": 5603068,
    },
    "SourceHanSansJP-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansJP-Regular.ttf",
        "font_name": "Source Han Sans JP Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "722cfbdcc0fd83fe07a3d1b10e9e64343c924a351d02cfe8dbb6ec4c6bc38230",
        "size": 5723960,
    },
    "SourceHanSansKR-Bold.ttf": {
        "ascent": 1160,
        "bold": 1,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansKR-Bold.ttf",
        "font_name": "Source Han Sans KR Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "02959eb2c1eea0786a736aeb50b6e61f2ab873cd69c659389b7511f80f734838",
        "size": 5858892,
    },
    "SourceHanSansKR-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansKR-Regular.ttf",
        "font_name": "Source Han Sans KR Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "aba70109eff718e8f796f0185f8dca38026c1661b43c195883c84577e501adf2",
        "size": 5961704,
    },
    "SourceHanSansTW-Bold.ttf": {
        "ascent": 1160,
        "bold": 1,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansTW-Bold.ttf",
        "font_name": "Source Han Sans TW Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "4a92730e644a1348e87bba7c77e9b462f257f381bd6abbeac5860d8f8306aee6",
        "size": 6883224,
    },
    "SourceHanSansTW-Regular.ttf": {
        "ascent": 1160,
        "bold": 0,
        "descent": -288,
        "encoding_length": 2,
        "file_name": "SourceHanSansTW-Regular.ttf",
        "font_name": "Source Han Sans TW Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 0,
        "sha3_256": "6129b68ff4b0814624cac7edca61fbacf8f4d79db6f4c3cfc46b1c48ea2f81ac",
        "size": 7024812,
    },
    "SourceHanSerifCN-Bold.ttf": {
        "ascent": 1150,
        "bold": 1,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifCN-Bold.ttf",
        "font_name": "Source Han Serif CN Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "77816a54957616e140e25a36a41fc061ddb505a1107de4e6a65f561e5dcf8310",
        "size": 14134156,
    },
    "SourceHanSerifCN-Regular.ttf": {
        "ascent": 1150,
        "bold": 0,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifCN-Regular.ttf",
        "font_name": "Source Han Serif CN Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "c8bf74da2c3b7457c9d887465b42fb6f80d3d84f361cfe5b0673a317fb1f85ad",
        "size": 14047768,
    },
    "SourceHanSerifHK-Bold.ttf": {
        "ascent": 1150,
        "bold": 1,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifHK-Bold.ttf",
        "font_name": "Source Han Serif HK Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "0f81296f22846b622a26f7342433d6c5038af708a32fc4b892420c150227f4bb",
        "size": 9532580,
    },
    "SourceHanSerifHK-Regular.ttf": {
        "ascent": 1150,
        "bold": 0,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifHK-Regular.ttf",
        "font_name": "Source Han Serif HK Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "d5232ec3adf4fb8604bb4779091169ec9bd9d574b513e4a75752e614193afebe",
        "size": 9467292,
    },
    "SourceHanSerifJP-Bold.ttf": {
        "ascent": 1150,
        "bold": 1,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifJP-Bold.ttf",
        "font_name": "Source Han Serif JP Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "a4a8c22e8ec7bb6e66b9caaff1e12c7a52b5a4201eec3d074b35957c0126faef",
        "size": 7811832,
    },
    "SourceHanSerifJP-Regular.ttf": {
        "ascent": 1150,
        "bold": 0,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifJP-Regular.ttf",
        "font_name": "Source Han Serif JP Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "3d1f9933c7f3abc8c285e317119a533e6dcfe6027d1f5f066ba71b3eb9161e9c",
        "size": 7748816,
    },
    "SourceHanSerifKR-Bold.ttf": {
        "ascent": 1150,
        "bold": 1,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifKR-Bold.ttf",
        "font_name": "Source Han Serif KR Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "b071b1aecb042aa779e1198767048438dc756d0da8f90660408abb421393f5cb",
        "size": 12387920,
    },
    "SourceHanSerifKR-Regular.ttf": {
        "ascent": 1150,
        "bold": 0,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifKR-Regular.ttf",
        "font_name": "Source Han Serif KR Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "a85913439f0a49024ca77c02dfede4318e503ee6b2b7d8fef01eb42435f27b61",
        "size": 12459924,
    },
    "SourceHanSerifTW-Bold.ttf": {
        "ascent": 1150,
        "bold": 1,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifTW-Bold.ttf",
        "font_name": "Source Han Serif TW Bold",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "562eea88895ab79ffefab7eabb4d322352a7b1963764c524c6d5242ca456bb6e",
        "size": 9551724,
    },
    "SourceHanSerifTW-Regular.ttf": {
        "ascent": 1150,
        "bold": 0,
        "descent": -286,
        "encoding_length": 2,
        "file_name": "SourceHanSerifTW-Regular.ttf",
        "font_name": "Source Han Serif TW Regular",
        "italic": 0,
        "monospace": 0,
        "serif": 1,
        "sha3_256": "85c1d6460b2e169b3d53ac60f6fb7a219fb99923027d78fb64b679475e2ddae4",
        "size": 9486772,
    },
}

CMAP_METADATA = {
    "78-EUC-H.json": {
        "file_name": "78-EUC-H.json",
        "sha3_256": "657006ae4360ac584316dbda94f2223d7dd4cf7c721021b78b470ed712d22a3d",
        "size": 15035,
    },
    "78-EUC-V.json": {
        "file_name": "78-EUC-V.json",
        "sha3_256": "ffd0610937d3893cd6b9f10007033dab4c846d6a50914b3e0b5b1a1d5a446483",
        "size": 704,
    },
    "78-H.json": {
        "file_name": "78-H.json",
        "sha3_256": "07960a71bd7f2dc8501bfff6ebacb5d179961accbb8d043837d6d213d4e7c43f",
        "size": 14993,
    },
    "78-RKSJ-H.json": {
        "file_name": "78-RKSJ-H.json",
        "sha3_256": "2cea4cbf474c08d99420790509473f48960d14df27e37155c0833150eff0310c",
        "size": 15054,
    },
    "78-RKSJ-V.json": {
        "file_name": "78-RKSJ-V.json",
        "sha3_256": "0005485dc7cb41b9911d651a31a008ff4d8f707f3a271f5eb900640415255f58",
        "size": 705,
    },
    "78-V.json": {
        "file_name": "78-V.json",
        "sha3_256": "6ec527dfdd6f8176719db47aea208d96c8427ff2c44bb6d6adcf215e3599c7dd",
        "size": 700,
    },
    "78ms-RKSJ-H.json": {
        "file_name": "78ms-RKSJ-H.json",
        "sha3_256": "781802e72f8e79d599d58a81445333d005df5117b10c9b8392459729e51bbec7",
        "size": 17125,
    },
    "78ms-RKSJ-V.json": {
        "file_name": "78ms-RKSJ-V.json",
        "sha3_256": "1854ff118f30bdee044813bf764f44123697cb2c2dfcfacb10e1aa161d7db16b",
        "size": 1928,
    },
    "83pv-RKSJ-H.json": {
        "file_name": "83pv-RKSJ-H.json",
        "sha3_256": "2b6dd0a63fc97f3b33767a1b16a49b30ba0cb97a1ff01deb6ca5592d90e79815",
        "size": 5277,
    },
    "90ms-RKSJ-H.json": {
        "file_name": "90ms-RKSJ-H.json",
        "sha3_256": "ebacf23e35e924a65b45afb6276f645289f68b122f1b32ab4dbc64f9c7903ccf",
        "size": 4117,
    },
    "90ms-RKSJ-V.json": {
        "file_name": "90ms-RKSJ-V.json",
        "sha3_256": "0e08ffc0c46d93912870ad12a863081bcea12db09038e3929e1e015cfc1663da",
        "size": 1928,
    },
    "90msp-RKSJ-H.json": {
        "file_name": "90msp-RKSJ-H.json",
        "sha3_256": "3098d897f17b1723d5915518d281d3c5d4f46f0b83dbde8b8001073e0f882d32",
        "size": 4096,
    },
    "90msp-RKSJ-V.json": {
        "file_name": "90msp-RKSJ-V.json",
        "sha3_256": "a7ad430c32de4dbce2667fff874efc5d4114c685107f026788eee4ec83992fc8",
        "size": 1929,
    },
    "90pv-RKSJ-H.json": {
        "file_name": "90pv-RKSJ-H.json",
        "sha3_256": "2c1720cc7343f95ccb87e073df0c7788d33bc8811b703b709a0230e79ecb2341",
        "size": 6314,
    },
    "90pv-RKSJ-V.json": {
        "file_name": "90pv-RKSJ-V.json",
        "sha3_256": "487bf100397d4f0bcfa86dbfea149cac54faa59c0b449d65284cc43123d99023",
        "size": 1283,
    },
    "Add-H.json": {
        "file_name": "Add-H.json",
        "sha3_256": "3bd6fbbe961dffa3a6395d1e3823da665efc74363f44ff6083d98fc5ae22433a",
        "size": 15174,
    },
    "Add-RKSJ-H.json": {
        "file_name": "Add-RKSJ-H.json",
        "sha3_256": "bde048bae5dc9c43570bff29ff4691e03372e029dde66edc5e8de64a891dd53b",
        "size": 15259,
    },
    "Add-RKSJ-V.json": {
        "file_name": "Add-RKSJ-V.json",
        "sha3_256": "1a81852c30ebf3101e1e0b0b5eff2e4f19211373c513d7c42b0933ded6b6e59b",
        "size": 1426,
    },
    "Add-V.json": {
        "file_name": "Add-V.json",
        "sha3_256": "6a4f7a4ee2d7a04ce0500b93453859faf3fc3f11b3f55cb61753ef79846b419b",
        "size": 1421,
    },
    "B5-H.json": {
        "file_name": "B5-H.json",
        "sha3_256": "f1b984aa231df737628663a56d380c93fe3172a243792db6d36921b964a118db",
        "size": 5960,
    },
    "B5-V.json": {
        "file_name": "B5-V.json",
        "sha3_256": "0fafc3f78a34f2bf2377a89b2679469505a35ae42df95bf6765f743344f9a94c",
        "size": 334,
    },
    "B5pc-H.json": {
        "file_name": "B5pc-H.json",
        "sha3_256": "07f0c25086768b9731971ba164d88cb10202a9d36e79a076c43233351f61c52f",
        "size": 6015,
    },
    "B5pc-V.json": {
        "file_name": "B5pc-V.json",
        "sha3_256": "f5e44d8eeeda40e8c3a81858dfb823eeed3f5e834e985544d1e56fb79260b8f8",
        "size": 336,
    },
    "CNS-EUC-H.json": {
        "file_name": "CNS-EUC-H.json",
        "sha3_256": "2add6b8cd4750db8bf6b029595232fecb8f1e54a0bad56590d4aa46401085e44",
        "size": 11342,
    },
    "CNS-EUC-V.json": {
        "file_name": "CNS-EUC-V.json",
        "sha3_256": "1ff26a35f10467a99957886c482de267658b9132a704b547381c90fc37c90820",
        "size": 12592,
    },
    "CNS1-H.json": {
        "file_name": "CNS1-H.json",
        "sha3_256": "e64c524f07718603b6bd84fd6799f875cc13c00137fbaa2b41215d518e96c87a",
        "size": 3728,
    },
    "CNS1-V.json": {
        "file_name": "CNS1-V.json",
        "sha3_256": "57a1d2aabe6ab9db9a323ab43c37e3aa1ba9b3eb71841dfec4d8568d657d503a",
        "size": 332,
    },
    "CNS2-H.json": {
        "file_name": "CNS2-H.json",
        "sha3_256": "90831af5d65fae9565d705fc8f1fccd091e33a67a1e544552410e39d7558daed",
        "size": 2053,
    },
    "CNS2-V.json": {
        "file_name": "CNS2-V.json",
        "sha3_256": "c4d2aae661b26120030754901abced51766fa4bce638433a7aa7130a3d5eabb0",
        "size": 54,
    },
    "ETHK-B5-H.json": {
        "file_name": "ETHK-B5-H.json",
        "sha3_256": "3ef2e9ef0364675c2fb9ccbfd37ed9227d416457ee8cadb9e59b2db4354d88ea",
        "size": 25660,
    },
    "ETHK-B5-V.json": {
        "file_name": "ETHK-B5-V.json",
        "sha3_256": "a12c5917b6f3400793e7d6ea2e217e9af05a28621a937cfef4da9f5184a03578",
        "size": 364,
    },
    "ETen-B5-H.json": {
        "file_name": "ETen-B5-H.json",
        "sha3_256": "57f29290c730277b221ad074709d4f76c429d5410931131c9da7157ebae76951",
        "size": 6205,
    },
    "ETen-B5-V.json": {
        "file_name": "ETen-B5-V.json",
        "sha3_256": "d07d9af9e30a8fc3ca7e52158f854226b831ab9ef552cda46219819e47950680",
        "size": 364,
    },
    "ETenms-B5-H.json": {
        "file_name": "ETenms-B5-H.json",
        "sha3_256": "0659f282182ebdaa6abb38062bc3428a3b7b5907513fd499980d1b49930a9b9e",
        "size": 72,
    },
    "ETenms-B5-V.json": {
        "file_name": "ETenms-B5-V.json",
        "sha3_256": "74b107f8950456b2df294a089091837bf802892c1bc3136c403da2a427130c33",
        "size": 429,
    },
    "EUC-H.json": {
        "file_name": "EUC-H.json",
        "sha3_256": "b6df6e254254eb5a2254b0d581f4820d2b3553cd372136ec88f605521683c44a",
        "size": 2910,
    },
    "EUC-V.json": {
        "file_name": "EUC-V.json",
        "sha3_256": "e81c0f409365f2fd60232f6e5c84bf52c8a6b9c6336d4c96fb554f213dbdfaf6",
        "size": 701,
    },
    "Ext-H.json": {
        "file_name": "Ext-H.json",
        "sha3_256": "629359cf115575acb68b59c82373a1a3958001212a854d0a5b98e6fe1efe81db",
        "size": 15891,
    },
    "Ext-RKSJ-H.json": {
        "file_name": "Ext-RKSJ-H.json",
        "sha3_256": "3336a4a77a75924588f13c5a24157680c9c5b6a46298063dcdb461b90bb55da0",
        "size": 15975,
    },
    "Ext-RKSJ-V.json": {
        "file_name": "Ext-RKSJ-V.json",
        "sha3_256": "f2915039ff32992094ff6521fa24c3f41c27f55f3f071730eea732e261a2a553",
        "size": 994,
    },
    "Ext-V.json": {
        "file_name": "Ext-V.json",
        "sha3_256": "e2fb58ec483aee0910b0733dcb6220f10f9f4d2553c8c139a523e3992363f93e",
        "size": 989,
    },
    "GB-EUC-H.json": {
        "file_name": "GB-EUC-H.json",
        "sha3_256": "4a0b5fda367993409663ec1d4be57c207a3500d778373546b729d143d789c191",
        "size": 2178,
    },
    "GB-EUC-V.json": {
        "file_name": "GB-EUC-V.json",
        "sha3_256": "b45a8a562304c2c388fd1574c3a1a0af6f49e4849f7904ba07d57967d9625917",
        "size": 520,
    },
    "GB-H.json": {
        "file_name": "GB-H.json",
        "sha3_256": "a50b5d6461c95a667ccbc44c507ff5e6686e4f1bbd8bfae69486396b4ed03510",
        "size": 2139,
    },
    "GB-V.json": {
        "file_name": "GB-V.json",
        "sha3_256": "1f043042065f2df4590ebbd27fbc8f93802ea66caeb0b8ba92823575842743e5",
        "size": 516,
    },
    "GBK-EUC-H.json": {
        "file_name": "GBK-EUC-H.json",
        "sha3_256": "4502e7abe2edfb6256b5a4308dfca940aaa92a2d951c4b44942ce7bdb9eda877",
        "size": 99532,
    },
    "GBK-EUC-V.json": {
        "file_name": "GBK-EUC-V.json",
        "sha3_256": "c71f6281bb59897dcf48f587136d002d5caa8a0ed89f9b490a6a288765ec674d",
        "size": 521,
    },
    "GBK2K-H.json": {
        "file_name": "GBK2K-H.json",
        "sha3_256": "0a2a975da25641067ea2743f15407df20895b28804a1e64c12cd9fd0f306b1a9",
        "size": 109298,
    },
    "GBK2K-V.json": {
        "file_name": "GBK2K-V.json",
        "sha3_256": "0febb4a13f8f73dc949d159b4f37e886d1c3d1514aaf53d3492e0b5e21523f52",
        "size": 1044,
    },
    "GBKp-EUC-H.json": {
        "file_name": "GBKp-EUC-H.json",
        "sha3_256": "50d628304aff1f13ded3790cc3b8bd48502267768cac5e72cb3be8a46f9a5436",
        "size": 99510,
    },
    "GBKp-EUC-V.json": {
        "file_name": "GBKp-EUC-V.json",
        "sha3_256": "8c540fc12dfed309896544f8153fa52b793708a85e3882985567dcae86fb1732",
        "size": 522,
    },
    "GBT-EUC-H.json": {
        "file_name": "GBT-EUC-H.json",
        "sha3_256": "5fbe99ec7638de5216ea452788d3ef40cfd8c110c8b8ae936b57db6221d9b9d9",
        "size": 54802,
    },
    "GBT-EUC-V.json": {
        "file_name": "GBT-EUC-V.json",
        "sha3_256": "4cc3a48b1f7c8ab088391aa78131289da3d68e2fe0071b380a10c19757356ab5",
        "size": 521,
    },
    "GBT-H.json": {
        "file_name": "GBT-H.json",
        "sha3_256": "8bbbbbdee2722751708dd66a7ed12fa54a08bbf0dcfaefca2b87f305ca591f32",
        "size": 54763,
    },
    "GBT-V.json": {
        "file_name": "GBT-V.json",
        "sha3_256": "32e4457c8b0edbeeec9445465ec40106603ad50003e1af98994c02020df1c59f",
        "size": 517,
    },
    "GBTpc-EUC-H.json": {
        "file_name": "GBTpc-EUC-H.json",
        "sha3_256": "7f7faa903850fc471948e284853a81ee2f4a32693e14131f3ab1fbc490c5695b",
        "size": 54820,
    },
    "GBTpc-EUC-V.json": {
        "file_name": "GBTpc-EUC-V.json",
        "sha3_256": "3cf85a97171567e08d0112b71ca4a0aef68c52918b7c635669ef7e25e1bcb818",
        "size": 523,
    },
    "GBpc-EUC-H.json": {
        "file_name": "GBpc-EUC-H.json",
        "sha3_256": "38332ce5be0b82e4010fbd05ceac92e9f05a784ccacf6a4f004cd8da734c47de",
        "size": 2196,
    },
    "GBpc-EUC-V.json": {
        "file_name": "GBpc-EUC-V.json",
        "sha3_256": "5a0b4e7db0aedd6b27f84b191791b527da3ea27ea1ca42460086cb0d294418bf",
        "size": 522,
    },
    "H.json": {
        "file_name": "H.json",
        "sha3_256": "5ee11fcc99897b769fd62238967954e957bb8079353abba815792aab6f3e329c",
        "size": 2868,
    },
    "HKdla-B5-H.json": {
        "file_name": "HKdla-B5-H.json",
        "sha3_256": "8f24808486e1d5363a66981021f3f8b136f1ec6231d48bda76344e1f7f1695aa",
        "size": 25384,
    },
    "HKdla-B5-V.json": {
        "file_name": "HKdla-B5-V.json",
        "sha3_256": "1e686a7f69d6b7a3c05a4be9e7e396cf81498ef48299341616e76805c1092733",
        "size": 340,
    },
    "HKdlb-B5-H.json": {
        "file_name": "HKdlb-B5-H.json",
        "sha3_256": "0ccae437017107059630d56c7e0e2d6f086d5fb512c9e60b1bd48c4a04b6652d",
        "size": 22501,
    },
    "HKdlb-B5-V.json": {
        "file_name": "HKdlb-B5-V.json",
        "sha3_256": "dad584337fd6e5e6ab5e1e30dc9b8cc1013985a04a159b3c108c4dfb5c10fb55",
        "size": 340,
    },
    "HKgccs-B5-H.json": {
        "file_name": "HKgccs-B5-H.json",
        "sha3_256": "f7da0854c355c51957de6e71ffa33fbc69414d52dcfc5a5cb50c8f8c6c6bd9c6",
        "size": 13642,
    },
    "HKgccs-B5-V.json": {
        "file_name": "HKgccs-B5-V.json",
        "sha3_256": "d7f89dc24162b624bc4d682484da315a4d39eaf9a8f63c1392e06d2aa46f015a",
        "size": 341,
    },
    "HKm314-B5-H.json": {
        "file_name": "HKm314-B5-H.json",
        "sha3_256": "febd4cb78048e012478df9fc91aa23e946304d63c5f7c64ea8e16277b64a359b",
        "size": 13405,
    },
    "HKm314-B5-V.json": {
        "file_name": "HKm314-B5-V.json",
        "sha3_256": "d310bbf5a975fe8e1f8bb4523b0db8e792043578f0c2a12735bbc24fc4a3721f",
        "size": 341,
    },
    "HKm471-B5-H.json": {
        "file_name": "HKm471-B5-H.json",
        "sha3_256": "fdb1368b1a6f4df20ab87e2a1045a579088645828d1168e39d6aa5b52c09bd8e",
        "size": 17079,
    },
    "HKm471-B5-V.json": {
        "file_name": "HKm471-B5-V.json",
        "sha3_256": "34c40c1bb1409942f12f66f1bcbc2be73406b4c5e626ea7a4ab7f73160ba2a88",
        "size": 341,
    },
    "HKscs-B5-H.json": {
        "file_name": "HKscs-B5-H.json",
        "sha3_256": "63fe2b09c05c8ef70fb937aad49698d4154e1d7bb75f94344fea4db522b87a88",
        "size": 25722,
    },
    "HKscs-B5-V.json": {
        "file_name": "HKscs-B5-V.json",
        "sha3_256": "14c864025ffca616fc173458162efe190bdace4700e2a7ad4869c66476534223",
        "size": 365,
    },
    "Hankaku.json": {
        "file_name": "Hankaku.json",
        "sha3_256": "befe81a2bbe191bcb8e0ff23706a51cb6a41a60f6bc508d5c0c19040c14afc06",
        "size": 238,
    },
    "Hiragana.json": {
        "file_name": "Hiragana.json",
        "sha3_256": "0e8ce0a48ec8c05f4c65d23ada539c4a2a236fcb7dd46e20874acd9362394525",
        "size": 200,
    },
    "Identity-H.json": {
        "file_name": "Identity-H.json",
        "sha3_256": "77cc630138b29b5acd4ab216cb1d173bb3e7b994ab932a4f3d8a9121be91fbab",
        "size": 6404,
    },
    "Identity-V.json": {
        "file_name": "Identity-V.json",
        "sha3_256": "067a8d390f2d99dfa94ff19009925e5815c8b54b65b39314a244cbbace494679",
        "size": 62,
    },
    "KSC-EUC-H.json": {
        "file_name": "KSC-EUC-H.json",
        "sha3_256": "79fb3c0bd9d2ce6b80da98d6f1ef4fd2776dfc3fb78c5ee4d6ee3a06aebc9fd0",
        "size": 11234,
    },
    "KSC-EUC-V.json": {
        "file_name": "KSC-EUC-V.json",
        "sha3_256": "a541a285c966105a92dba6939401ac8aaeb057e5200bdbf8c874ceecb9f37b01",
        "size": 441,
    },
    "KSC-H.json": {
        "file_name": "KSC-H.json",
        "sha3_256": "a0a20bce98ffe98036aa748d46c2921e17247827a22298edb59c778b8b776f24",
        "size": 11214,
    },
    "KSC-Johab-H.json": {
        "file_name": "KSC-Johab-H.json",
        "sha3_256": "3d7cd1473ddcf7c3bfb80c7eadf45a365389759b1df1f53e0bd5f31e31125e96",
        "size": 100922,
    },
    "KSC-Johab-V.json": {
        "file_name": "KSC-Johab-V.json",
        "sha3_256": "2f7cf1d05bd82d65e488fc3297aefc1c1f48f2c6972b01304c4be5f260fae86e",
        "size": 443,
    },
    "KSC-V.json": {
        "file_name": "KSC-V.json",
        "sha3_256": "f6f09bab60f802d61c22368ca8650cefa08851c2039c5825e37404c7047eb496",
        "size": 437,
    },
    "KSCms-UHC-H.json": {
        "file_name": "KSCms-UHC-H.json",
        "sha3_256": "6df55fd679239f3a6642c7690e89a85525fa6a8a3cf748aef247b2d06fdc1aca",
        "size": 16419,
    },
    "KSCms-UHC-HW-H.json": {
        "file_name": "KSCms-UHC-HW-H.json",
        "sha3_256": "a05183c5d7b6b6f62d11f8175e5749d5ad2913d469403905c8f01a403d715583",
        "size": 16422,
    },
    "KSCms-UHC-HW-V.json": {
        "file_name": "KSCms-UHC-HW-V.json",
        "sha3_256": "e2586795b094fade7e385ff1ce5570232edc791c456acf4c6e1c11bc501f82a4",
        "size": 446,
    },
    "KSCms-UHC-V.json": {
        "file_name": "KSCms-UHC-V.json",
        "sha3_256": "c09dc49c1afea5a5dc01bd6ac672d2af83b4821d74de7df71d4da3233513cefb",
        "size": 443,
    },
    "KSCpc-EUC-H.json": {
        "file_name": "KSCpc-EUC-H.json",
        "sha3_256": "b43448cb510c7f952a6affd0950db58063719f7499309c64f78fea6b2778fa11",
        "size": 12226,
    },
    "KSCpc-EUC-V.json": {
        "file_name": "KSCpc-EUC-V.json",
        "sha3_256": "1f4889c2e7278085738257e8097382ef5ac40b543b71751b75b155b056a46db2",
        "size": 443,
    },
    "Katakana.json": {
        "file_name": "Katakana.json",
        "sha3_256": "524b659bd0acc0fb4baa7633c3250683d6b3ba1685caadc9739240ccdbfd2ce2",
        "size": 86,
    },
    "NWP-H.json": {
        "file_name": "NWP-H.json",
        "sha3_256": "6c067655436fe89fb21a26e258973313bfe7cd5fbab3a2857b00ea92cc82c25d",
        "size": 18143,
    },
    "NWP-V.json": {
        "file_name": "NWP-V.json",
        "sha3_256": "b494038c72c63c6917ab3ed3f83a8b6bf21c65ba9ea47a4887833fffcc434763",
        "size": 1205,
    },
    "RKSJ-H.json": {
        "file_name": "RKSJ-H.json",
        "sha3_256": "eff868636f960b80d6923b77eb59d76acf6d7297bc74e1b7f3a13ff92a71c1cb",
        "size": 2953,
    },
    "RKSJ-V.json": {
        "file_name": "RKSJ-V.json",
        "sha3_256": "f3827bc17eb1172a5713d2d5c83a9b60f965894e3f2cb8dcb731b6f151abaa10",
        "size": 702,
    },
    "Roman.json": {
        "file_name": "Roman.json",
        "sha3_256": "620ab6ac0f4b487f19d44397b49612db57d164ddbff8e7d52fb5fd7e969e0cb9",
        "size": 67,
    },
    "UniAKR-UTF16-H.json": {
        "file_name": "UniAKR-UTF16-H.json",
        "sha3_256": "1204af593c62e5d10ace0db3b5ca0caecc80240f1c866bf1585fad405c204a54",
        "size": 232741,
    },
    "UniAKR-UTF32-H.json": {
        "file_name": "UniAKR-UTF32-H.json",
        "sha3_256": "cbbebc4b9b018109612dcfc0798f5c164d739a8b202017580301e0f27f76c35d",
        "size": 296773,
    },
    "UniAKR-UTF8-H.json": {
        "file_name": "UniAKR-UTF8-H.json",
        "sha3_256": "e08da06fc02a877abb02205fe0db3b61566d9ac41511a735ef2f12b5741d069a",
        "size": 266575,
    },
    "UniCNS-UCS2-H.json": {
        "file_name": "UniCNS-UCS2-H.json",
        "sha3_256": "48a0840498b90cf597c05ad2f63e26aaea778a49171f821d4b87b94424d7e640",
        "size": 400654,
    },
    "UniCNS-UCS2-V.json": {
        "file_name": "UniCNS-UCS2-V.json",
        "sha3_256": "014f9d86baea5fd13e460dd3735eab98dbbacf126922826ef0be9d7c8c605418",
        "size": 360,
    },
    "UniCNS-UTF16-H.json": {
        "file_name": "UniCNS-UTF16-H.json",
        "sha3_256": "c67980ebfb0d525365d0b5421548cc64ce9fb89afca1a0f6d04972f1e39b7f9c",
        "size": 320254,
    },
    "UniCNS-UTF16-V.json": {
        "file_name": "UniCNS-UTF16-V.json",
        "sha3_256": "98bd35d76997c0f3c443f130d44e814997cb0277183b7bf6571f92206d9a85a0",
        "size": 311,
    },
    "UniCNS-UTF32-H.json": {
        "file_name": "UniCNS-UTF32-H.json",
        "sha3_256": "6ab73cc531843f9bef915a949a0b79de1df288bb7ed6026db782ac446ed36c94",
        "size": 391690,
    },
    "UniCNS-UTF32-V.json": {
        "file_name": "UniCNS-UTF32-V.json",
        "sha3_256": "d94f8c3d7fe834d34f746b9404a4bb5dd8479353e3b9f95b308642a8be793a44",
        "size": 391,
    },
    "UniCNS-UTF8-H.json": {
        "file_name": "UniCNS-UTF8-H.json",
        "sha3_256": "3666cbe4d00de4038120c98472137857c93d44735c3a5def8c4ac7f84a59aa72",
        "size": 357287,
    },
    "UniCNS-UTF8-V.json": {
        "file_name": "UniCNS-UTF8-V.json",
        "sha3_256": "e410ed491c0e2f31ba30cfd60eb4e21c40d3ee82e2be1c06c7adb8772b175f10",
        "size": 350,
    },
    "UniGB-UCS2-H.json": {
        "file_name": "UniGB-UCS2-H.json",
        "sha3_256": "42a8e01b690cf2cd6b137c1eb94e7668899f0041b6e43b921252fe453486a96e",
        "size": 336533,
    },
    "UniGB-UCS2-V.json": {
        "file_name": "UniGB-UCS2-V.json",
        "sha3_256": "0a0aaf21f823546faf0971b7926724cc95b53b3da3f42a22ec0526ca8de1b237",
        "size": 617,
    },
    "UniGB-UTF16-H.json": {
        "file_name": "UniGB-UTF16-H.json",
        "sha3_256": "c306f093839fffe81e0c8597a24be508a64aa2a9c3e9b9eee858d55059530c0d",
        "size": 251806,
    },
    "UniGB-UTF16-V.json": {
        "file_name": "UniGB-UTF16-V.json",
        "sha3_256": "bd283b8c7e145e340db39868ec1a3b0a08d89acc2bfac672d41008a8195c7bb3",
        "size": 456,
    },
    "UniGB-UTF32-H.json": {
        "file_name": "UniGB-UTF32-H.json",
        "sha3_256": "a01a6a8b4b715f27c7e1866894240b0e1fd61a4eaca1c91df80c1f256ad06f72",
        "size": 319766,
    },
    "UniGB-UTF32-V.json": {
        "file_name": "UniGB-UTF32-V.json",
        "sha3_256": "8b31bba8b852a2c6c1f6d92aea633285e2f75237fbe87ecadff9f9312a0bfaa9",
        "size": 572,
    },
    "UniGB-UTF8-H.json": {
        "file_name": "UniGB-UTF8-H.json",
        "sha3_256": "87f7a6b0360d0f9bd0658cb7a67587e86c604be44292214622d972d85a474dbf",
        "size": 290481,
    },
    "UniGB-UTF8-V.json": {
        "file_name": "UniGB-UTF8-V.json",
        "sha3_256": "1378adf3ecd0bfbdb11dabbf2118cbb968a03aa2215780b77b07459e3b1df6e7",
        "size": 513,
    },
    "UniJIS-UCS2-H.json": {
        "file_name": "UniJIS-UCS2-H.json",
        "sha3_256": "a73e449136b46240ef86c9fb2b614e7d290b814130e9beb4b987c52fd7eda575",
        "size": 205924,
    },
    "UniJIS-UCS2-HW-H.json": {
        "file_name": "UniJIS-UCS2-HW-H.json",
        "sha3_256": "e58ec4fd06677ecfcef12d25f6456b7f80da706b2ac6ef915239e0b780b775a0",
        "size": 154,
    },
    "UniJIS-UCS2-HW-V.json": {
        "file_name": "UniJIS-UCS2-HW-V.json",
        "sha3_256": "bc3c81dbd6329d83cd71743a6985ed0cf516b0aa97a1c58c3cc3940e280b1e8e",
        "size": 4868,
    },
    "UniJIS-UCS2-V.json": {
        "file_name": "UniJIS-UCS2-V.json",
        "sha3_256": "276712ac66416538e859ad28e9f5b685fbc71e5d7d91e905a3489f03667ae4bc",
        "size": 4775,
    },
    "UniJIS-UTF16-H.json": {
        "file_name": "UniJIS-UTF16-H.json",
        "sha3_256": "afc923e268f22dcf09e0871ce0060c7588aa1304d4b26e781a261c14566f7642",
        "size": 238042,
    },
    "UniJIS-UTF16-V.json": {
        "file_name": "UniJIS-UTF16-V.json",
        "sha3_256": "0a044ab7015485c3b0f7f9e4d883a1d9e9f1d04235b13e2a17687e878ce3e9f0",
        "size": 3951,
    },
    "UniJIS-UTF32-H.json": {
        "file_name": "UniJIS-UTF32-H.json",
        "sha3_256": "1c27e2e595d659073e37e5ee22a9b39abe30af1483de33e1078ed174abdc723c",
        "size": 295294,
    },
    "UniJIS-UTF32-V.json": {
        "file_name": "UniJIS-UTF32-V.json",
        "sha3_256": "aa7a475ce5f85f79d73e17355c08e6aee21a949b596f2efe359913489a22117f",
        "size": 4983,
    },
    "UniJIS-UTF8-H.json": {
        "file_name": "UniJIS-UTF8-H.json",
        "sha3_256": "d91079b3f1671a7f4ace8b8f89478558f43f7782e666064ce1b53af563a87306",
        "size": 266367,
    },
    "UniJIS-UTF8-V.json": {
        "file_name": "UniJIS-UTF8-V.json",
        "sha3_256": "d0c8c94f7d54dafa40876ce7eb28845d8ac00b688cf4bac255694cb2f086d109",
        "size": 4483,
    },
    "UniJIS2004-UTF16-H.json": {
        "file_name": "UniJIS2004-UTF16-H.json",
        "sha3_256": "336660e87fc57ad166258d22f09690fcebb546840faee1e1b3f6cad3556bcf80",
        "size": 238119,
    },
    "UniJIS2004-UTF16-V.json": {
        "file_name": "UniJIS2004-UTF16-V.json",
        "sha3_256": "f6619a74b62f9986e9a74620b28e726b927dde5cd6184742f368ef4d686fe55c",
        "size": 3955,
    },
    "UniJIS2004-UTF32-H.json": {
        "file_name": "UniJIS2004-UTF32-H.json",
        "sha3_256": "2512690db880e0663f8208d22acda8daa98f1240ff14a038bf02e57c4908afb5",
        "size": 295371,
    },
    "UniJIS2004-UTF32-V.json": {
        "file_name": "UniJIS2004-UTF32-V.json",
        "sha3_256": "da1728a91845f1654457eaf0f15b75d1ace5cbf75486bca8523bd5edf20a8010",
        "size": 4987,
    },
    "UniJIS2004-UTF8-H.json": {
        "file_name": "UniJIS2004-UTF8-H.json",
        "sha3_256": "af36b0255a1ed15966670703ba8a48987a1cf7e43f5c94a4e86a41e5ee26b940",
        "size": 266444,
    },
    "UniJIS2004-UTF8-V.json": {
        "file_name": "UniJIS2004-UTF8-V.json",
        "sha3_256": "28bebdf1581c45f2e9b38caa2ff643abd561321bab45febb0f90d802d2290faa",
        "size": 4487,
    },
    "UniJISPro-UCS2-HW-V.json": {
        "file_name": "UniJISPro-UCS2-HW-V.json",
        "sha3_256": "21fd353a062b6c415389d6fde11718488f765ca31fd4ca481050c89633568009",
        "size": 4994,
    },
    "UniJISPro-UCS2-V.json": {
        "file_name": "UniJISPro-UCS2-V.json",
        "sha3_256": "8daa155869a35f3f629abb042790c59eb5cff342b83573c2ae4c87b3e865dc27",
        "size": 4901,
    },
    "UniJISPro-UTF8-V.json": {
        "file_name": "UniJISPro-UTF8-V.json",
        "sha3_256": "19b9a6d908f9fb7413d778c9cc912072314864225c38a3f5c345936fabcea650",
        "size": 5726,
    },
    "UniJISX0213-UTF32-H.json": {
        "file_name": "UniJISX0213-UTF32-H.json",
        "sha3_256": "e6a07453703f5070bf567c9d67aa20bc4b404bd311413fed45d9ba8c297a91d9",
        "size": 295246,
    },
    "UniJISX0213-UTF32-V.json": {
        "file_name": "UniJISX0213-UTF32-V.json",
        "sha3_256": "5f2dd4ff8045b2308a707e3d4ffb73e1ba7f5a1c1fdb43b17c5a322109897b9c",
        "size": 4908,
    },
    "UniJISX02132004-UTF32-H.json": {
        "file_name": "UniJISX02132004-UTF32-H.json",
        "sha3_256": "81427dc73cf9392c0c3e8eeeb1dedbc797b123059714bfcdcd1ecffec9f341e3",
        "size": 295323,
    },
    "UniJISX02132004-UTF32-V.json": {
        "file_name": "UniJISX02132004-UTF32-V.json",
        "sha3_256": "c0721298f3449f0c6f48ada1200ebcadbfc4020b10333871f6c0eea0be9f13ac",
        "size": 4912,
    },
    "UniKS-UCS2-H.json": {
        "file_name": "UniKS-UCS2-H.json",
        "sha3_256": "3a1c10535982d06dde447764f8e3dd82c6c87bec6c4272eaf449f67db6d50ab8",
        "size": 202706,
    },
    "UniKS-UCS2-V.json": {
        "file_name": "UniKS-UCS2-V.json",
        "sha3_256": "b915820ff4639f837e4d3b7e5a7c0810c26af1dcf3df9e56ed9a0a69e3cdba9d",
        "size": 492,
    },
    "UniKS-UTF16-H.json": {
        "file_name": "UniKS-UTF16-H.json",
        "sha3_256": "820f534efffcef15f0d3f270c078774febee31b451a1387b27f7225da321c12f",
        "size": 153894,
    },
    "UniKS-UTF16-V.json": {
        "file_name": "UniKS-UTF16-V.json",
        "sha3_256": "2b5be7641990cf79754a12309c6069c01b636cfc3308bc4dc8075da59c2d8d6b",
        "size": 403,
    },
    "UniKS-UTF32-H.json": {
        "file_name": "UniKS-UTF32-H.json",
        "sha3_256": "541515ed8ff15170b38fbe6587ff6c54f6fc75aeede9da110133dc335e4ddf0e",
        "size": 195998,
    },
    "UniKS-UTF32-V.json": {
        "file_name": "UniKS-UTF32-V.json",
        "sha3_256": "940e977d3927c8480c65dc4ad6be4f365f65b8d76707758a7696d40e2b3583ea",
        "size": 503,
    },
    "UniKS-UTF8-H.json": {
        "file_name": "UniKS-UTF8-H.json",
        "sha3_256": "81b5c336c1a20dee2e9592c6615a46cdd906edd242717c1807609b5687576252",
        "size": 177154,
    },
    "UniKS-UTF8-V.json": {
        "file_name": "UniKS-UTF8-V.json",
        "sha3_256": "9a282e8eee884f801a5518cc52ff240ee8635553661dd0ee7df952adbad7462a",
        "size": 452,
    },
    "V.json": {
        "file_name": "V.json",
        "sha3_256": "616f263e53079846a66efc861524a15c0a411e823c37fe08e62bad835745cbba",
        "size": 697,
    },
    "WP-Symbol.json": {
        "file_name": "WP-Symbol.json",
        "sha3_256": "533dfe497eab1f095039b6344217fc0ff6b1f7cdf9b406bb19c30b945fe78c21",
        "size": 588,
    },
}


FONT_NAMES = {v["font_name"] for v in EMBEDDING_FONT_METADATA.values()}

CN_FONT_FAMILY = {
    # 手写体
    "script": [
        "LXGWWenKaiGB-Regular.1.520.ttf",
    ],
    # 正文字体
    "normal": [
        "SourceHanSerifCN-Bold.ttf",
        "SourceHanSerifCN-Regular.ttf",
        "SourceHanSansCN-Bold.ttf",
        "SourceHanSansCN-Regular.ttf",
    ],
    # 备用字体
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": ["SourceHanSansCN-Regular.ttf"],
}

HK_FONT_FAMILY = {
    "script": ["LXGWWenKaiTC-Regular.1.520.ttf"],
    "normal": [
        "SourceHanSerifHK-Bold.ttf",
        "SourceHanSerifHK-Regular.ttf",
        "SourceHanSansHK-Bold.ttf",
        "SourceHanSansHK-Regular.ttf",
    ],
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": ["SourceHanSansCN-Regular.ttf"],
}

TW_FONT_FAMILY = {
    "script": ["LXGWWenKaiTC-Regular.1.520.ttf"],
    "normal": [
        "SourceHanSerifTW-Bold.ttf",
        "SourceHanSerifTW-Regular.ttf",
        "SourceHanSansTW-Bold.ttf",
        "SourceHanSansTW-Regular.ttf",
    ],
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": ["SourceHanSansCN-Regular.ttf"],
}

KR_FONT_FAMILY = {
    "script": ["MaruBuri-Regular.ttf"],
    "normal": [
        "SourceHanSerifKR-Bold.ttf",
        "SourceHanSerifKR-Regular.ttf",
        "SourceHanSansKR-Bold.ttf",
        "SourceHanSansKR-Regular.ttf",
    ],
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": ["SourceHanSansCN-Regular.ttf"],
}

JP_FONT_FAMILY = {
    "script": ["KleeOne-Regular.ttf"],
    "normal": [
        "SourceHanSerifJP-Bold.ttf",
        "SourceHanSerifJP-Regular.ttf",
        "SourceHanSansJP-Bold.ttf",
        "SourceHanSansJP-Regular.ttf",
    ],
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": ["SourceHanSansCN-Regular.ttf"],
}

EN_FONT_FAMILY = {
    "script": [
        "NotoSans-Italic.ttf",
        "NotoSans-BoldItalic.ttf",
        "NotoSerif-Italic.ttf",
        "NotoSerif-BoldItalic.ttf",
    ],
    "normal": [
        "NotoSerif-Regular.ttf",
        "NotoSerif-Bold.ttf",
        "NotoSans-Regular.ttf",
        "NotoSans-Bold.ttf",
    ],
    "fallback": [
        "GoNotoKurrent-Regular.ttf",
        "GoNotoKurrent-Bold.ttf",
    ],
    "base": [
        "NotoSans-Regular.ttf",
    ],
}

ALL_FONT_FAMILY = {
    "CN": CN_FONT_FAMILY,
    "TW": TW_FONT_FAMILY,
    "HK": HK_FONT_FAMILY,
    "KR": KR_FONT_FAMILY,
    "JP": JP_FONT_FAMILY,
    "EN": EN_FONT_FAMILY,
    "JA": JP_FONT_FAMILY,
}


def __add_fallback_to_font_family():
    for lang1, family1 in ALL_FONT_FAMILY.items():
        added_font = set()
        for font in itertools.chain.from_iterable(family1.values()):
            added_font.add(font)

        for lang2, family2 in ALL_FONT_FAMILY.items():
            if lang1 != lang2:
                for type_ in family1:
                    for font in family2[type_]:
                        if font not in added_font:
                            family1[type_].append(font)
                            added_font.add(font)


def __cleanup_unused_font_metadata():
    """Remove unused font metadata that are not referenced in any font family."""
    referenced_fonts = set()
    for family in ALL_FONT_FAMILY.values():
        for font_list in family.values():
            referenced_fonts.update(font_list)

    # Remove unreferenced fonts from EMBEDDING_FONT_METADATA
    unused_fonts = set(EMBEDDING_FONT_METADATA.keys()) - referenced_fonts
    for font_name in unused_fonts:
        del EMBEDDING_FONT_METADATA[font_name]


__add_fallback_to_font_family()
__cleanup_unused_font_metadata()


def get_font_family(lang_code: str):
    lang_code = lang_code.upper()
    if "KR" in lang_code:
        font_family = KR_FONT_FAMILY
    elif "JP" in lang_code or "JA" in lang_code:
        font_family = JP_FONT_FAMILY
    elif "HK" in lang_code:
        font_family = HK_FONT_FAMILY
    elif "TW" in lang_code:
        font_family = TW_FONT_FAMILY
    elif "EN" in lang_code:
        font_family = EN_FONT_FAMILY
    elif "CN" in lang_code:
        font_family = CN_FONT_FAMILY
    else:
        font_family = EN_FONT_FAMILY
    verify_font_family(font_family)
    return font_family


def verify_font_family(font_family: str | dict):
    if isinstance(font_family, str):
        font_family = ALL_FONT_FAMILY[font_family]
    for k in font_family:
        if k not in ["script", "normal", "fallback", "base"]:
            raise ValueError(f"Invalid font family: {font_family}")
        for font_file_name in font_family[k]:
            if font_file_name not in EMBEDDING_FONT_METADATA:
                raise ValueError(f"Invalid font file: {font_file_name}")


if __name__ == "__main__":
    for k in ALL_FONT_FAMILY:
        verify_font_family(k)


================================================
FILE: babeldoc/asynchronize/__init__.py
================================================
import asyncio
import time


class Args:
    def __init__(self, args, kwargs):
        self.args = args
        self.kwargs = kwargs


class AsyncCallback:
    def __init__(self):
        self.queue = asyncio.Queue()
        self.finished = False
        self.loop = asyncio.get_event_loop()

    def step_callback(self, *args, **kwargs):
        # Whenever a step is called, add to the queue but don't set finished to True, so __anext__ will continue
        args = Args(args, kwargs)

        # We have to use the threadsafe call so that it wakes up the event loop, in case it's sleeping:
        # https://stackoverflow.com/a/49912853/2148718
        self.loop.call_soon_threadsafe(self.queue.put_nowait, args)

        # Add a small delay to release the GIL, ensuring the event loop has time to process messages
        time.sleep(0.01)

    def finished_callback(self, *args, **kwargs):
        # Whenever a finished is called, add to the queue as with step, but also set finished to True, so __anext__
        # will terminate after processing the remaining items
        if self.finished:
            return
        self.step_callback(*args, **kwargs)
        self.finished = True

    def __await__(self):
        # Since this implements __anext__, this can return itself
        return self.queue.get().__await__()

    def __aiter__(self):
        # Since this implements __anext__, this can return itself
        return self

    async def __anext__(self):
        # Keep waiting for the queue if a) we haven't finished, or b) if the queue is still full. This lets us finish
        # processing the remaining items even after we've finished
        if self.finished and self.queue.empty():
            raise StopAsyncIteration

        result = await self.queue.get()
        return result


================================================
FILE: babeldoc/babeldoc_exception/BabelDOCException.py
================================================
class ScannedPDFError(Exception):
    def __init__(self, message):
        super().__init__(message)


class ExtractTextError(Exception):
    def __init__(self, message):
        super().__init__(message)


class InputFileGeneratedByBabelDOCError(Exception):
    def __init__(self, message):
        super().__init__(message)


class ContentFilterError(Exception):
    def __init__(self, message):
        super().__init__(message)
        self.message = message


================================================
FILE: babeldoc/babeldoc_exception/__init__.py
================================================


================================================
FILE: babeldoc/const.py
================================================
import itertools
import multiprocessing as mp
import os
import shutil
import subprocess
import threading
from pathlib import Path

__version__ = "0.5.23"

CACHE_FOLDER = Path.home() / ".cache" / "babeldoc"


def get_cache_file_path(filename: str, sub_folder: str | None = None) -> Path:
    if sub_folder is not None:
        sub_folder = sub_folder.strip("/")
        sub_folder_path = CACHE_FOLDER / sub_folder
        sub_folder_path.mkdir(parents=True, exist_ok=True)
        return sub_folder_path / filename
    return CACHE_FOLDER / filename


try:
    git_path = shutil.which("git")
    if git_path is None:
        raise FileNotFoundError("git executable not found")
    two_parent = Path(__file__).resolve().parent.parent
    md_ = two_parent / "docs" / "README.md"
    if two_parent.name == "site-packages" or not md_.exists():
        raise FileNotFoundError("not in git repo")
    WATERMARK_VERSION = (
        subprocess.check_output(  # noqa: S603
            [git_path, "describe", "--always"],
            cwd=Path(__file__).resolve().parent,
        )
        .strip()
        .decode()
    )
except (OSError, FileNotFoundError, subprocess.CalledProcessError):
    WATERMARK_VERSION = f"v{__version__}"

TIKTOKEN_CACHE_FOLDER = CACHE_FOLDER / "tiktoken"
TIKTOKEN_CACHE_FOLDER.mkdir(parents=True, exist_ok=True)
os.environ["TIKTOKEN_CACHE_DIR"] = str(TIKTOKEN_CACHE_FOLDER)


_process_pool = None
_process_pool_lock = threading.Lock()
_ENABLE_PROCESS_POOL = False


def enable_process_pool():
    # Development and Testing ONLY API
    global _ENABLE_PROCESS_POOL
    _ENABLE_PROCESS_POOL = True


# macos & windows use spawn mode
# linux use forkserver mode


def get_process_pool():
    if not _ENABLE_PROCESS_POOL:
        return None
    global _process_pool
    with _process_pool_lock:
        if _process_pool is None:
            # Create pool only in main process
            if mp.current_process().name != "MainProcess":
                return None

            _process_pool = mp.Pool()
        return _process_pool


def close_process_pool():
    if not _ENABLE_PROCESS_POOL:
        return None
    global _process_pool
    with _process_pool_lock:
        if _process_pool:
            _process_pool.close()
            _process_pool.join()
            _process_pool = None


def batched(iterable, n, *, strict=False):
    # batched('ABCDEFG', 3) → ABC DEF G
    if n < 1:
        raise ValueError("n must be at least one")
    iterator = iter(iterable)
    while batch := tuple(itertools.islice(iterator, n)):
        if strict and len(batch) != n:
            raise ValueError("batched(): incomplete batch")
        yield batch


================================================
FILE: babeldoc/docvision/README.md
================================================


================================================
FILE: babeldoc/docvision/__init__.py
================================================


================================================
FILE: babeldoc/docvision/base_doclayout.py
================================================
import abc
import logging
from collections.abc import Generator

import pymupdf

from babeldoc.format.pdf.document_il.il_version_1 import Page

logger = logging.getLogger(__name__)


class YoloResult:
    """Helper class to store detection results from ONNX model."""

    def __init__(self, names, boxes=None, boxes_data=None):
        if boxes is not None:
            self.boxes = boxes
        else:
            assert boxes_data is not None
            self.boxes = [YoloBox(data=d) for d in boxes_data]
        self.boxes.sort(key=lambda x: x.conf, reverse=True)
        self.names = names


class YoloBox:
    """Helper class to store detection results from ONNX model."""

    def __init__(self, data=None, xyxy=None, conf=None, cls=None):
        if data is not None:
            self.xyxy = data[:4]
            self.conf = data[-2]
            self.cls = data[-1]
            return
        assert xyxy is not None and conf is not None and cls is not None
        self.xyxy = xyxy
        self.conf = conf
        self.cls = cls


class DocLayoutModel(abc.ABC):
    @staticmethod
    def load_onnx():
        logger.info("Loading ONNX model...")
        from babeldoc.docvision.doclayout import OnnxModel

        model = OnnxModel.from_pretrained()
        return model

    @staticmethod
    def load_available():
        return DocLayoutModel.load_onnx()

    @property
    @abc.abstractmethod
    def stride(self) -> int:
        """Stride of the model input."""

    @abc.abstractmethod
    def handle_document(
        self,
        pages: list[Page],
        mupdf_doc: pymupdf.Document,
        translate_config,
        save_debug_image,
    ) -> Generator[tuple[Page, YoloResult], None, None]:
        """
        Handle a document.
        """


================================================
FILE: babeldoc/docvision/doclayout.py
================================================
import ast
import logging
import platform
import re
import threading
from collections.abc import Generator

import cv2
import numpy as np

from babeldoc.docvision.base_doclayout import DocLayoutModel
from babeldoc.docvision.base_doclayout import YoloResult
from babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img

try:
    import onnx
    import onnxruntime
except ImportError as e:
    if "DLL load failed" in str(e):
        raise OSError(
            "Microsoft Visual C++ Redistributable is not installed. "
            "Download it at https://aka.ms/vs/17/release/vc_redist.x64.exe"
        ) from e
    raise
import pymupdf

import babeldoc.format.pdf.document_il.il_version_1
from babeldoc.assets.assets import get_doclayout_onnx_model_path

# from huggingface_hub import hf_hub_download

logger = logging.getLogger(__name__)


# 检测操作系统类型
os_name = platform.system()


class OnnxModel(DocLayoutModel):
    def __init__(self, model_path: str):
        self.model_path = model_path

        model = onnx.load(model_path)
        metadata = {d.key: d.value for d in model.metadata_props}
        self._stride = ast.literal_eval(metadata["stride"])
        self._names = ast.literal_eval(metadata["names"])
        providers = []

        available_providers = onnxruntime.get_available_providers()
        for provider in available_providers:
            # disable dml|cuda|
            # directml/cuda may encounter problems under special circumstances
            if re.match(r"cpu", provider, re.IGNORECASE):
                logger.info(f"Available Provider: {provider}")
                providers.append(provider)
        self.model = onnxruntime.InferenceSession(
            model.SerializeToString(),
            providers=providers,
        )
        self.lock = threading.Lock()

    @staticmethod
    def from_pretrained():
        pth = get_doclayout_onnx_model_path()
        return OnnxModel(pth)

    @property
    def stride(self):
        return self._stride

    def resize_and_pad_image(self, image, new_shape):
        """
        Resize and pad the image to the specified size, ensuring dimensions are multiples of stride.

        Parameters:
        - image: Input image
        - new_shape: Target size (integer or (height, width) tuple)
        - stride: Padding alignment stride, default 32

        Returns:
        - Processed image
        """
        if isinstance(new_shape, int):
            new_shape = (new_shape, new_shape)

        h, w = image.shape[:2]
        new_h, new_w = new_shape

        # Calculate scaling ratio
        r = min(new_h / h, new_w / w)
        resized_h, resized_w = int(round(h * r)), int(round(w * r))

        # Resize image
        image = cv2.resize(
            image,
            (resized_w, resized_h),
            interpolation=cv2.INTER_LINEAR,
        )

        # Calculate padding size and align to stride multiple
        pad_w = (new_w - resized_w) % self.stride
        pad_h = (new_h - resized_h) % self.stride
        top, bottom = pad_h // 2, pad_h - pad_h // 2
        left, right = pad_w // 2, pad_w - pad_w // 2

        # Add padding
        image = cv2.copyMakeBorder(
            image,
            top,
            bottom,
            left,
            right,
            cv2.BORDER_CONSTANT,
            value=(114, 114, 114),
        )

        return image

    def scale_boxes(self, img1_shape, boxes, img0_shape):
        """
        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally
        specified in (img1_shape) to the shape of a different image (img0_shape).

        Args:
            img1_shape (tuple): The shape of the image that the bounding boxes are for,
                in the format of (height, width).
            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)
            img0_shape (tuple): the shape of the target image, in the format of (height, width).

        Returns:
            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)
        """

        # Calculate scaling ratio
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])

        # Calculate padding size
        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)
        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)

        # Remove padding and scale boxes
        boxes[..., :4] = (boxes[..., :4] - [pad_x, pad_y, pad_x, pad_y]) / gain
        return boxes

    def predict(self, image, imgsz=800, batch_size=16, **kwargs):
        """
        Predict the layout of document pages.

        Args:
            image: A single image or a list of images of document pages.
            imgsz: Resize the image to this size. Must be a multiple of the stride.
            batch_size: Number of images to process in one batch.
            **kwargs: Additional arguments.

        Returns:
            A list of YoloResult objects, one for each input image.
        """
        # Handle single image input
        if isinstance(image, np.ndarray) and len(image.shape) == 3:
            image = [image]

        total_images = len(image)
        results = []
        batch_size = 1

        # Process images in batches
        for i in range(0, total_images, batch_size):
            batch_images = image[i : i + batch_size]
            batch_size_actual = len(batch_images)

            # Calculate target size based on the maximum height in the batch
            max_height = max(img.shape[0] for img in batch_images)
            target_imgsz = 1024

            # Preprocess batch
            processed_batch = []
            orig_shapes = []
            for img in batch_images:
                orig_h, orig_w = img.shape[:2]
                orig_shapes.append((orig_h, orig_w))

                pix = self.resize_and_pad_image(img, new_shape=target_imgsz)
                pix = np.transpose(pix, (2, 0, 1))  # CHW
                pix = pix.astype(np.float32) / 255.0  # Normalize to [0, 1]
                processed_batch.append(pix)

            # Stack batch
            batch_input = np.stack(processed_batch, axis=0)  # BCHW
            new_h, new_w = batch_input.shape[2:]

            # Run inference
            batch_preds = self.model.run(None, {"images": batch_input})[0]

            # Process each prediction in the batch
            for j in range(batch_size_actual):
                preds = batch_preds[j]
                preds = preds[preds[..., 4] > 0.25]
                if len(preds) > 0:
                    preds[..., :4] = self.scale_boxes(
                        (new_h, new_w),
                        preds[..., :4],
                        orig_shapes[j],
                    )
                results.append(YoloResult(boxes_data=preds, names=self._names))

        return results

    def handle_document(
        self,
        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],
        mupdf_doc: pymupdf.Document,
        translate_config,
        save_debug_image,
    ) -> Generator[
        tuple[babeldoc.format.pdf.document_il.il_version_1.Page, YoloResult], None, None
    ]:
        for page in pages:
            translate_config.raise_if_cancelled()
            with self.lock:
                # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)
                pix = get_no_rotation_img(mupdf_doc[page.page_number])
            image = np.frombuffer(pix.samples, np.uint8).reshape(
                pix.height,
                pix.width,
                3,
            )[:, :, ::-1]
            predict_result = self.predict(image)[0]
            save_debug_image(
                image,
                predict_result,
                page.page_number + 1,
            )
            yield page, predict_result


================================================
FILE: babeldoc/docvision/rpc_doclayout.py
================================================
import logging
import threading
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import cv2
import httpx
import msgpack
import numpy as np
import pymupdf
from tenacity import retry
from tenacity import retry_if_exception_type
from tenacity import stop_after_attempt
from tenacity import wait_exponential

import babeldoc
from babeldoc.docvision.base_doclayout import DocLayoutModel
from babeldoc.docvision.base_doclayout import YoloBox
from babeldoc.docvision.base_doclayout import YoloResult
from babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img

logger = logging.getLogger(__name__)


def encode_image(image) -> bytes:
    """Read and encode image to bytes

    Args:
        image: Can be either a file path (str) or numpy array
    """
    if isinstance(image, str):
        if not Path(image).exists():
            raise FileNotFoundError(f"Image file not found: {image}")
        img = cv2.imread(image)
        if img is None:
            raise ValueError(f"Failed to read image: {image}")
    else:
        img = image

    # logger.debug(f"Image shape: {img.shape}")
    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)

    encoded = cv2.imencode(".jpg", img)[1].tobytes()
    # logger.debug(f"Encoded image size: {len(encoded)} bytes")
    return encoded


@retry(
    stop=stop_after_attempt(3),  # 最多重试 3 次
    wait=wait_exponential(
        multiplier=1, min=1, max=10
    ),  # 指数退避策略，初始 1 秒，最大 10 秒
    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试
    before_sleep=lambda retry_state: logger.warning(
        f"Request failed, retrying in {retry_state.next_action.sleep} seconds... "
        f"(Attempt {retry_state.attempt_number}/3)"
    ),
)
def predict_layout(
    image,
    host: str = "http://localhost:8000",
    imgsz: int = 1024,
):
    """
    Predict document layout using the MOSEC service

    Args:
        image: Can be either a file path (str) or numpy array
        host: Service host URL
        imgsz: Image size for model input

    Returns:
        List of predictions containing bounding boxes and classes
    """
    # Prepare request data
    if not isinstance(image, list):
        image = [image]
    image_data = [encode_image(image) for image in image]
    data = {
        "image": image_data,
        "imgsz": imgsz,
    }

    # Pack data using msgpack
    packed_data = msgpack.packb(data, use_bin_type=True)
    # logger.debug(f"Packed data size: {len(packed_data)} bytes")

    # Send request
    # logger.debug(f"Sending request to {host}/inference")
    response = httpx.post(
        f"{host}/inference",
        data=packed_data,
        headers={
            "Content-Type": "application/msgpack",
            "Accept": "application/msgpack",
        },
        timeout=300,
        follow_redirects=True,
    )

    # logger.debug(f"Response status: {response.status_code}")
    # logger.debug(f"Response headers: {response.headers}")

    if response.status_code == 200:
        try:
            result = msgpack.unpackb(response.content, raw=False)
            return result
        except Exception as e:
            logger.exception(f"Failed to unpack response: {e!s}")
            raise
    else:
        logger.error(f"Request failed with status {response.status_code}")
        logger.error(f"Response content: {response.content}")
        raise Exception(
            f"Request failed with status {response.status_code}: {response.text}",
        )


class ResultContainer:
    def __init__(self):
        self.result = YoloResult(boxes_data=np.array([]), names=[])


class RpcDocLayoutModel(DocLayoutModel):
    """DocLayoutModel implementation that uses RPC service."""

    def __init__(self, host: str = "http://localhost:8000"):
        """Initialize RPC model with host address."""
        self.host = host
        self._stride = 32  # Default stride value
        self._names = ["text", "title", "list", "table", "figure"]
        self.lock = threading.Lock()

    @property
    def stride(self) -> int:
        """Stride of the model input."""
        return self._stride

    def resize_and_pad_image(self, image, new_shape):
        """
        Resize and pad the image to the specified size,
        ensuring dimensions are multiples of stride.

        Parameters:
        - image: Input image
        - new_shape: Target size (integer or (height, width) tuple)
        - stride: Padding alignment stride, default 32

        Returns:
        - Processed image
        """
        if isinstance(new_shape, int):
            new_shape = (new_shape, new_shape)

        h, w = image.shape[:2]
        new_h, new_w = new_shape

        # Calculate scaling ratio
        r = min(new_h / h, new_w / w)
        resized_h, resized_w = int(round(h * r)), int(round(w * r))

        # Resize image
        image = cv2.resize(
            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR
        )

        # Calculate padding size
        pad_h = new_h - resized_h
        pad_w = new_w - resized_w
        top, bottom = pad_h // 2, pad_h - pad_h // 2
        left, right = pad_w // 2, pad_w - pad_w // 2

        # Add padding
        image = cv2.copyMakeBorder(
            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)
        )

        return image

    def scale_boxes(self, img1_shape, boxes, img0_shape):
        """
        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally
        specified in (img1_shape) to the shape of a different image (img0_shape).

        Args:
            img1_shape (tuple): The shape of the image that the bounding boxes are for,
                in the format of (height, width).
            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)
            img0_shape (tuple): the shape of the target image, in the format of (height, width).

        Returns:
            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)
        """

        # Calculate scaling ratio
        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])

        # Calculate padding size
        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)
        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)

        # Remove padding and scale boxes
        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain
        return boxes

    def predict_image(
        self,
        image,
        host: str = None,
        result_container: ResultContainer | None = None,
        imgsz: int = 1024,
    ) -> ResultContainer:
        """Predict the layout of document pages using RPC service."""
        if result_container is None:
            result_container = ResultContainer()
        target_imgsz = (800, 800)
        orig_h, orig_w = image.shape[:2]
        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:
            image = self.resize_and_pad_image(image, new_shape=target_imgsz)
        preds = predict_layout([image], host=self.host, imgsz=800)

        if len(preds) > 0:
            for pred in preds:
                boxes = [
                    YoloBox(
                        None,
                        self.scale_boxes(
                            (800, 800), np.array(x["xyxy"]), (orig_h, orig_w)
                        ),
                        np.array(x["conf"]),
                        x["cls"],
                    )
                    for x in pred["boxes"]
                ]
                result_container.result = YoloResult(
                    boxes=boxes,
                    names={int(k): v for k, v in pred["names"].items()},
                )
        return result_container.result

    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
        """Predict the layout of document pages using RPC service."""
        # Handle single image input
        if isinstance(image, np.ndarray) and len(image.shape) == 3:
            image = [image]

        result_containers = [ResultContainer() for _ in image]
        predict_thread = ThreadPoolExecutor(max_workers=len(image))
        for img, result_container in zip(image, result_containers, strict=True):
            predict_thread.submit(
                self.predict_image, img, self.host, result_container, 800
            )
        predict_thread.shutdown(wait=True)
        result = [result_container.result for result_container in result_containers]
        return result

    def predict_page(
        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image
    ):
        translate_config.raise_if_cancelled()
        with self.lock:
            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)
            pix = get_no_rotation_img(mupdf_doc[page.page_number])
        image = np.frombuffer(pix.samples, np.uint8).reshape(
            pix.height,
            pix.width,
            3,
        )[:, :, ::-1]
        predict_result = self.predict_image(image, self.host, None, 800)
        save_debug_image(image, predict_result, page.page_number + 1)
        return page, predict_result

    def handle_document(
        self,
        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],
        mupdf_doc: pymupdf.Document,
        translate_config,
        save_debug_image,
    ):
        with ThreadPoolExecutor(max_workers=16) as executor:
            yield from executor.map(
                self.predict_page,
                pages,
                (mupdf_doc for _ in range(len(pages))),
                (translate_config for _ in range(len(pages))),
                (save_debug_image for _ in range(len(pages))),
            )

    @staticmethod
    def from_host(host: str) -> "RpcDocLayoutModel":
        """Create RpcDocLayoutModel from host address."""
        return RpcDocLayoutModel(host=host)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    # Test the service
    try:
        # Use a default test image if example/1.png doesn't exist
        image_path = "example/1.png"
        if not Path(image_path).exists():
            print(f"Warning: {image_path} not found.")
            print("Please provide the path to a test image:")
            image_path = input("> ")

        logger.info(f"Processing image: {image_path}")
        result = predict_layout(image_path)
        print("Prediction results:")
        print(result)
    except Exception as e:
        print(f"Error: {e!s}")


================================================
FILE: babeldoc/docvision/rpc_doclayout2.py
================================================
import logging
import threading
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import cv2
import httpx
import msgpack
import numpy as np
import pymupdf
from tenacity import retry
from tenacity import retry_if_exception_type
from tenacity import stop_after_attempt
from tenacity import wait_exponential

import babeldoc
from babeldoc.docvision.base_doclayout import DocLayoutModel
from babeldoc.docvision.base_doclayout import YoloBox
from babeldoc.docvision.base_doclayout import YoloResult
from babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img

logger = logging.getLogger(__name__)
DPI = 150


def encode_image(image) -> bytes:
    """Read and encode image to bytes

    Args:
        image: Can be either a file path (str) or numpy array
    """
    if isinstance(image, str):
        if not Path(image).exists():
            raise FileNotFoundError(f"Image file not found: {image}")
        img = cv2.imread(image)
        if img is None:
            raise ValueError(f"Failed to read image: {image}")
    else:
        img = image

    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    # logger.debug(f"Image shape: {img.shape}")
    encoded = cv2.imencode(".jpg", img)[1].tobytes()
    # logger.debug(f"Encoded image size: {len(encoded)} bytes")
    return encoded


@retry(
    stop=stop_after_attempt(3),  # 最多重试 3 次
    wait=wait_exponential(
        multiplier=1, min=1, max=10
    ),  # 指数退避策略，初始 1 秒，最大 10 秒
    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试
    before_sleep=lambda retry_state: logger.warning(
        f"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... "
        f"(Attempt {retry_state.attempt_number}/3)"
    ),
)
def predict_layout(
    image,
    host: str = "http://localhost:8000",
    _imgsz: int = 1024,
):
    """
    Predict document layout using the MOSEC service

    Args:
        image: Can be either a file path (str) or numpy array
        host: Service host URL
        imgsz: Image size for model input

    Returns:
        List of predictions containing bounding boxes and classes
    """
    # Prepare request data

    if not isinstance(image, list):
        image = [image]
    image_data = [encode_image(image) for image in image]
    data = {
        "image": image_data,
    }

    # Pack data using msgpack
    packed_data = msgpack.packb(data, use_bin_type=True)
    # logger.debug(f"Packed data size: {len(packed_data)} bytes")

    # Send request
    # logger.debug(f"Sending request to {host}/inference")
    response = httpx.post(
        # f"{host}/analyze?min_sim=0.7&early_stop=0.99&timeout=480",
        f"{host}/inference",
        data=packed_data,
        headers={
            "Content-Type": "application/msgpack",
            "Accept": "application/msgpack",
        },
        timeout=480,
        follow_redirects=True,
    )

    # logger.debug(f"Response status: {response.status_code}")
    # logger.debug(f"Response headers: {response.headers}")
    idx = 0
    id_lookup = {}
    if response.status_code == 200:
        try:
            result = msgpack.unpackb(response.content, raw=False)
            useful_result = []
            if isinstance(result, dict):
                names = {}
                for box in result["boxes"]:
                    if box["score"] < 0.7:
                        continue

                    box["xyxy"] = box["coordinate"]
                    box["conf"] = box["score"]
                    if box["label"] not in names:
                        idx += 1
                        names[idx] = box["label"]
                        box["cls_id"] = idx
                        id_lookup[box["label"]] = idx
                    else:
                        box["cls_id"] = id_lookup[box["label"]]
                    names[box["cls_id"]] = box["label"]
                    box["cls"] = box["cls_id"]
                    useful_result.append(box)
                if "names" not in result:
                    result["names"] = names
                result["boxes"] = useful_result
                result = [result]
            return result
        except Exception as e:
            logger.exception(f"Failed to unpack response: {e!s}")
            raise
    else:
        logger.error(f"Request failed with status {response.status_code}")
        logger.error(f"Response content: {response.content}")
        raise Exception(
            f"Request failed with status {response.status_code}: {response.text}",
        )


class ResultContainer:
    def __init__(self):
        self.result = YoloResult(boxes_data=np.array([]), names=[])


class RpcDocLayoutModel(DocLayoutModel):
    """DocLayoutModel implementation that uses RPC service."""

    def __init__(self, host: str = "http://localhost:8000"):
        """Initialize RPC model with host address."""
        self.host = host
        self._stride = 32  # Default stride value
        self._names = ["text", "title", "list", "table", "figure"]
        self.lock = threading.Lock()

    @property
    def stride(self) -> int:
        """Stride of the model input."""
        return self._stride

    def resize_and_pad_image(self, image, new_shape):
        """
        Resize and pad the image to the specified size,
        ensuring dimensions are multiples of stride.

        Parameters:
        - image: Input image
        - new_shape: Target size (integer or (height, width) tuple)
        - stride: Padding alignment stride, default 32

        Returns:
        - Processed image
        """
        if isinstance(new_shape, int):
            new_shape = (new_shape, new_shape)

        h, w = image.shape[:2]
        new_h, new_w = new_shape

        # Calculate scaling ratio
        r = min(new_h / h, new_w / w)
        resized_h, resized_w = int(round(h * r)), int(round(w * r))

        # Resize image
        image = cv2.resize(
            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR
        )

        # Calculate padding size
        pad_h = new_h - resized_h
        pad_w = new_w - resized_w
        top, bottom = pad_h // 2, pad_h - pad_h // 2
        left, right = pad_w // 2, pad_w - pad_w // 2

        # Add padding
        image = cv2.copyMakeBorder(
            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)
        )

        return image

    def scale_boxes(self, img1_shape, boxes, img0_shape):
        """
        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally
        specified in (img1_shape) to the shape of a different image (img0_shape).

        Args:
            img1_shape (tupl

Download .txt

gitextract_4xv94fs_/

├── .cursorignore
├── .github/
│   ├── ISSUE_TEMPLATE/
│   │   ├── bug_report.yaml
│   │   └── feature_request.yaml
│   ├── PULL_REQUEST_TEMPLATE/
│   │   └── pr_form.yml
│   ├── PULL_REQUEST_TEMPLATE.md
│   ├── dependabot.yml
│   ├── labels.yml
│   ├── release-drafter.yml
│   └── workflows/
│       ├── codeql.yml
│       ├── docs.yml
│       ├── labeler.yml
│       ├── lint.yml
│       ├── pr-lint.yml
│       ├── publish-to-pypi.yml
│       └── test.yml
├── .gitignore
├── .pre-commit-config.yaml
├── LICENSE
├── README.md
├── babeldoc/
│   ├── __init__.py
│   ├── assets/
│   │   ├── assets.py
│   │   └── embedding_assets_metadata.py
│   ├── asynchronize/
│   │   └── __init__.py
│   ├── babeldoc_exception/
│   │   ├── BabelDOCException.py
│   │   └── __init__.py
│   ├── const.py
│   ├── docvision/
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── base_doclayout.py
│   │   ├── doclayout.py
│   │   ├── rpc_doclayout.py
│   │   ├── rpc_doclayout2.py
│   │   ├── rpc_doclayout3.py
│   │   ├── rpc_doclayout4.py
│   │   ├── rpc_doclayout5.py
│   │   ├── rpc_doclayout6.py
│   │   ├── rpc_doclayout7.py
│   │   └── table_detection/
│   │       └── rapidocr.py
│   ├── format/
│   │   ├── __init__.py
│   │   └── pdf/
│   │       ├── __init__.py
│   │       ├── babelpdf/
│   │       │   ├── base14.py
│   │       │   ├── cidfont.py
│   │       │   ├── cmap.py
│   │       │   ├── encoding.py
│   │       │   ├── type3.py
│   │       │   ├── utils.py
│   │       │   └── win_core.py
│   │       ├── converter.py
│   │       ├── document_il/
│   │       │   ├── __init__.py
│   │       │   ├── backend/
│   │       │   │   ├── __init__.py
│   │       │   │   └── pdf_creater.py
│   │       │   ├── frontend/
│   │       │   │   ├── __init__.py
│   │       │   │   └── il_creater.py
│   │       │   ├── il_version_1.py
│   │       │   ├── il_version_1.rnc
│   │       │   ├── il_version_1.rng
│   │       │   ├── il_version_1.xsd
│   │       │   ├── midend/
│   │       │   │   ├── __init__.py
│   │       │   │   ├── add_debug_information.py
│   │       │   │   ├── automatic_term_extractor.py
│   │       │   │   ├── detect_scanned_file.py
│   │       │   │   ├── il_translator.py
│   │       │   │   ├── il_translator_llm_only.py
│   │       │   │   ├── layout_parser.py
│   │       │   │   ├── paragraph_finder.py
│   │       │   │   ├── remove_descent.py
│   │       │   │   ├── styles_and_formulas.py
│   │       │   │   ├── table_parser.py
│   │       │   │   └── typesetting.py
│   │       │   ├── utils/
│   │       │   │   ├── __init__.py
│   │       │   │   ├── extract_char.py
│   │       │   │   ├── fontmap.py
│   │       │   │   ├── formular_helper.py
│   │       │   │   ├── layout_helper.py
│   │       │   │   ├── matrix_helper.py
│   │       │   │   ├── mupdf_helper.py
│   │       │   │   ├── paragraph_helper.py
│   │       │   │   ├── spatial_analyzer.py
│   │       │   │   ├── style_helper.py
│   │       │   │   └── zstd_helper.py
│   │       │   └── xml_converter.py
│   │       ├── high_level.py
│   │       ├── pdfinterp.py
│   │       ├── result_merger.py
│   │       ├── split_manager.py
│   │       └── translation_config.py
│   ├── glossary.py
│   ├── main.py
│   ├── pdfminer/
│   │   ├── LICENSE
│   │   ├── __init__.py
│   │   ├── _saslprep.py
│   │   ├── arcfour.py
│   │   ├── ascii85.py
│   │   ├── casting.py
│   │   ├── ccitt.py
│   │   ├── cmap/
│   │   │   └── README.txt
│   │   ├── cmapdb.py
│   │   ├── converter.py
│   │   ├── data_structures.py
│   │   ├── encodingdb.py
│   │   ├── fontmetrics.py
│   │   ├── glyphlist.py
│   │   ├── high_level.py
│   │   ├── image.py
│   │   ├── jbig2.py
│   │   ├── latin_enc.py
│   │   ├── layout.py
│   │   ├── lzw.py
│   │   ├── pdfcolor.py
│   │   ├── pdfdevice.py
│   │   ├── pdfdocument.py
│   │   ├── pdfexceptions.py
│   │   ├── pdffont.py
│   │   ├── pdfinterp.py
│   │   ├── pdfpage.py
│   │   ├── pdfparser.py
│   │   ├── pdftypes.py
│   │   ├── psexceptions.py
│   │   ├── psparser.py
│   │   ├── py.typed
│   │   ├── runlength.py
│   │   ├── settings.py
│   │   └── utils.py
│   ├── progress_monitor.py
│   ├── tools/
│   │   ├── generate_cmap_metadata.py
│   │   ├── generate_font_metadata.py
│   │   ├── italic_assistance.py
│   │   └── italic_recognize_tool.py
│   ├── translator/
│   │   ├── __init__.py
│   │   ├── cache.py
│   │   └── translator.py
│   └── utils/
│       ├── __init__.py
│       ├── atomic_integer.py
│       ├── memory.py
│       └── priority_thread_pool_executor.py
├── docs/
│   ├── CODE_OF_CONDUCT.md
│   ├── CONTRIBUTING.md
│   ├── CONTRIBUTOR_REWARD.md
│   ├── ImplementationDetails/
│   │   ├── AsyncTranslate/
│   │   │   └── AsyncTranslate.md
│   │   ├── ILTranslator/
│   │   │   └── ILTranslator.md
│   │   ├── PDFCreation/
│   │   │   └── PDFCreation.md
│   │   ├── PDFParsing/
│   │   │   └── PDFParsing.md
│   │   ├── ParagraphFinding/
│   │   │   └── ParagraphFinding.md
│   │   ├── README.md
│   │   ├── StylesAndFormulas/
│   │   │   └── StylesAndFormulas.md
│   │   └── Typesetting/
│   │       └── Typesetting.md
│   ├── README.md
│   ├── deploy.sh
│   ├── example/
│   │   └── demo_glossary.csv
│   ├── index.md
│   ├── intro-to-pdf-object.md
│   ├── requirements.txt
│   └── supported_languages.md
├── mkdocs.yml
├── pyproject.toml
└── tests/
    └── test_translation_cache_cleanup.py

Download .txt

SYMBOL INDEX (1723 symbols across 93 files)

FILE: babeldoc/assets/assets.py
  class ResultContainer (line 38) | class ResultContainer:
    method __init__ (line 39) | def __init__(self):
    method set_result (line 42) | def set_result(self, result):
  function run_in_another_thread (line 46) | def run_in_another_thread(coro):
  function run_coro (line 58) | def run_coro(coro):
  function _retry_if_not_cancelled_and_failed (line 62) | def _retry_if_not_cancelled_and_failed(retry_state):
  function verify_file (line 80) | def verify_file(path: Path, sha3_256: str):
  function download_file (line 102) | async def download_file(
  function get_font_metadata (line 131) | async def get_font_metadata(
  function _get_fastest_upstream_for_font_internal (line 151) | async def _get_fastest_upstream_for_font_internal(
  function get_fastest_upstream_for_font (line 173) | async def get_fastest_upstream_for_font(
  function get_fastest_upstream_for_model (line 201) | async def get_fastest_upstream_for_model(client: httpx.AsyncClient | Non...
  function get_fastest_upstream (line 205) | async def get_fastest_upstream(client: httpx.AsyncClient | None = None):
  function get_doclayout_onnx_model_path_async (line 226) | async def get_doclayout_onnx_model_path_async(client: httpx.AsyncClient ...
  function get_table_detection_rapidocr_model_path_async (line 248) | async def get_table_detection_rapidocr_model_path_async(
  function get_doclayout_onnx_model_path (line 270) | def get_doclayout_onnx_model_path():
  function get_table_detection_rapidocr_model_path (line 274) | def get_table_detection_rapidocr_model_path():
  function get_font_url_by_name_and_upstream (line 278) | def get_font_url_by_name_and_upstream(font_file_name: str, upstream: str):
  function get_font_and_metadata_async (line 286) | async def get_font_and_metadata_async(
  function get_font_and_metadata (line 325) | def get_font_and_metadata(font_file_name: str):
  function get_cmap_file_path_async (line 329) | async def get_cmap_file_path_async(
  function download_cmap_file_async (line 355) | async def download_cmap_file_async(
  function get_cmap_data_async (line 379) | async def get_cmap_data_async(
  function get_cmap_file_path (line 387) | def get_cmap_file_path(name: str):
  function get_cmap_data (line 391) | def get_cmap_data(name: str):
  function get_font_family (line 395) | def get_font_family(lang_code: str):
  function download_all_fonts_async (line 400) | async def download_all_fonts_async(client: httpx.AsyncClient | None = No...
  function download_all_cmaps_async (line 428) | async def download_all_cmaps_async(client: httpx.AsyncClient | None = No...
  function async_warmup (line 453) | async def async_warmup():
  function warmup (line 468) | def warmup():
  function generate_all_assets_file_list (line 472) | def generate_all_assets_file_list():
  function generate_offline_assets_package_async (line 514) | async def generate_offline_assets_package_async(output_directory: Path |...
  function restore_offline_assets_package_async (line 544) | async def restore_offline_assets_package_async(input_path: Path | None =...
  function get_offline_assets_tag (line 590) | def get_offline_assets_tag(file_list: dict | None = None):
  function generate_offline_assets_package (line 607) | def generate_offline_assets_package(output_directory: Path | None = None):
  function restore_offline_assets_package (line 611) | def restore_offline_assets_package(input_path: Path | None = None):

FILE: babeldoc/assets/embedding_assets_metadata.py
  function __add_fallback_to_font_family (line 1395) | def __add_fallback_to_font_family():
  function __cleanup_unused_font_metadata (line 1410) | def __cleanup_unused_font_metadata():
  function get_font_family (line 1427) | def get_font_family(lang_code: str):
  function verify_font_family (line 1447) | def verify_font_family(font_family: str | dict):

FILE: babeldoc/asynchronize/__init__.py
  class Args (line 5) | class Args:
    method __init__ (line 6) | def __init__(self, args, kwargs):
  class AsyncCallback (line 11) | class AsyncCallback:
    method __init__ (line 12) | def __init__(self):
    method step_callback (line 17) | def step_callback(self, *args, **kwargs):
    method finished_callback (line 28) | def finished_callback(self, *args, **kwargs):
    method __await__ (line 36) | def __await__(self):
    method __aiter__ (line 40) | def __aiter__(self):
    method __anext__ (line 44) | async def __anext__(self):

FILE: babeldoc/babeldoc_exception/BabelDOCException.py
  class ScannedPDFError (line 1) | class ScannedPDFError(Exception):
    method __init__ (line 2) | def __init__(self, message):
  class ExtractTextError (line 6) | class ExtractTextError(Exception):
    method __init__ (line 7) | def __init__(self, message):
  class InputFileGeneratedByBabelDOCError (line 11) | class InputFileGeneratedByBabelDOCError(Exception):
    method __init__ (line 12) | def __init__(self, message):
  class ContentFilterError (line 16) | class ContentFilterError(Exception):
    method __init__ (line 17) | def __init__(self, message):

FILE: babeldoc/const.py
  function get_cache_file_path (line 14) | def get_cache_file_path(filename: str, sub_folder: str | None = None) ->...
  function enable_process_pool (line 52) | def enable_process_pool():
  function get_process_pool (line 62) | def get_process_pool():
  function close_process_pool (line 76) | def close_process_pool():
  function batched (line 87) | def batched(iterable, n, *, strict=False):

FILE: babeldoc/docvision/base_doclayout.py
  class YoloResult (line 12) | class YoloResult:
    method __init__ (line 15) | def __init__(self, names, boxes=None, boxes_data=None):
  class YoloBox (line 25) | class YoloBox:
    method __init__ (line 28) | def __init__(self, data=None, xyxy=None, conf=None, cls=None):
  class DocLayoutModel (line 40) | class DocLayoutModel(abc.ABC):
    method load_onnx (line 42) | def load_onnx():
    method load_available (line 50) | def load_available():
    method stride (line 55) | def stride(self) -> int:
    method handle_document (line 59) | def handle_document(

FILE: babeldoc/docvision/doclayout.py
  class OnnxModel (line 39) | class OnnxModel(DocLayoutModel):
    method __init__ (line 40) | def __init__(self, model_path: str):
    method from_pretrained (line 63) | def from_pretrained():
    method stride (line 68) | def stride(self):
    method resize_and_pad_image (line 71) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 119) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict (line 145) | def predict(self, image, imgsz=800, batch_size=16, **kwargs):
    method handle_document (line 208) | def handle_document(

FILE: babeldoc/docvision/rpc_doclayout.py
  function encode_image (line 25) | def encode_image(image) -> bytes:
  function predict_layout (line 59) | def predict_layout(
  class ResultContainer (line 119) | class ResultContainer:
    method __init__ (line 120) | def __init__(self):
  class RpcDocLayoutModel (line 124) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 127) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 135) | def stride(self) -> int:
    method resize_and_pad_image (line 139) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 180) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 206) | def predict_image(
    method predict (line 241) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
    method predict_page (line 257) | def predict_page(
    method handle_document (line 273) | def handle_document(
    method from_host (line 290) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout2.py
  function encode_image (line 26) | def encode_image(image) -> bytes:
  function predict_layout (line 59) | def predict_layout(
  class ResultContainer (line 144) | class ResultContainer:
    method __init__ (line 145) | def __init__(self):
  class RpcDocLayoutModel (line 149) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 152) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 160) | def stride(self) -> int:
    method resize_and_pad_image (line 164) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 205) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 231) | def predict_image(
    method predict (line 267) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
    method predict_page (line 283) | def predict_page(
    method handle_document (line 299) | def handle_document(
    method from_host (line 316) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout3.py
  function encode_image (line 26) | def encode_image(image) -> bytes:
  function predict_layout (line 59) | def predict_layout(
  class ResultContainer (line 137) | class ResultContainer:
    method __init__ (line 138) | def __init__(self):
  class RpcDocLayoutModel (line 142) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 145) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 153) | def stride(self) -> int:
    method resize_and_pad_image (line 157) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 198) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 224) | def predict_image(
    method predict (line 260) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
    method predict_page (line 276) | def predict_page(
    method handle_document (line 292) | def handle_document(
    method from_host (line 309) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout4.py
  function encode_image (line 26) | def encode_image(image) -> bytes:
  function predict_layout (line 59) | def predict_layout(
  class ResultContainer (line 144) | class ResultContainer:
    method __init__ (line 145) | def __init__(self):
  class RpcDocLayoutModel (line 149) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 152) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 160) | def stride(self) -> int:
    method resize_and_pad_image (line 164) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 205) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 231) | def predict_image(
    method predict (line 267) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
    method predict_page (line 283) | def predict_page(
    method handle_document (line 299) | def handle_document(
    method from_host (line 316) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout5.py
  function encode_image (line 26) | def encode_image(image) -> bytes:
  function predict_layout (line 59) | def predict_layout(
  class ResultContainer (line 135) | class ResultContainer:
    method __init__ (line 136) | def __init__(self):
  class RpcDocLayoutModel (line 140) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 143) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 151) | def stride(self) -> int:
    method resize_and_pad_image (line 155) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 196) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 222) | def predict_image(
    method predict (line 258) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:
    method predict_page (line 274) | def predict_page(
    method handle_document (line 290) | def handle_document(
    method from_host (line 307) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout6.py
  function encode_image (line 39) | def encode_image(image) -> bytes:
  function clip_num (line 61) | def clip_num(num: float, min_value: float, max_value: float) -> float:
  function predict_layout (line 81) | def predict_layout(
  function predict_layout2 (line 199) | def predict_layout2(
  class ResultContainer (line 284) | class ResultContainer:
    method __init__ (line 285) | def __init__(self):
  function filter_text (line 289) | def filter_text(txt: str, font_mapper: FontMapper):
  class RpcDocLayoutModel (line 300) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 303) | def __init__(self, host: str = "http://localhost:8000;http://localhost...
    method init_font_mapper (line 324) | def init_font_mapper(self, translation_config):
    method stride (line 328) | def stride(self) -> int:
    method resize_and_pad_image (line 332) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 373) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method calculate_iou (line 399) | def calculate_iou(self, box1, box2):
    method is_subset (line 422) | def is_subset(self, inner_box, outer_box):
    method expand_box_to_contain (line 434) | def expand_box_to_contain(self, box_to_expand, box_to_contain):
    method post_process_boxes (line 446) | def post_process_boxes(self, merged_boxes: list[YoloBox], names: dict[...
    method predict_image (line 477) | def predict_image(
    method predict (line 560) | def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:  #...
    method predict_page (line 574) | def predict_page(self, page, pdf_bytes: Path, translate_config, save_d...
    method handle_document (line 593) | def handle_document(  # type: ignore[override]
    method from_host (line 612) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/rpc_doclayout7.py
  function encode_image (line 34) | def encode_image(image) -> bytes:
  function predict_layout (line 67) | def predict_layout(
  class ResultContainer (line 171) | class ResultContainer:
    method __init__ (line 172) | def __init__(self):
  class RpcDocLayoutModel (line 176) | class RpcDocLayoutModel(DocLayoutModel):
    method __init__ (line 179) | def __init__(self, host: str = "http://localhost:8000"):
    method stride (line 187) | def stride(self) -> int:
    method resize_and_pad_image (line 191) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 232) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict_image (line 258) | def predict_image(
    method predict_page (line 299) | def predict_page(
    method handle_document (line 315) | def handle_document(
    method from_host (line 332) | def from_host(host: str) -> "RpcDocLayoutModel":

FILE: babeldoc/docvision/table_detection/rapidocr.py
  function convert_to_yolo_result (line 29) | def convert_to_yolo_result(predictions):
  function create_yolo_result_from_nested_coords (line 66) | def create_yolo_result_from_nested_coords(nested_coords: np.ndarray, nam...
  class RapidOCRModel (line 85) | class RapidOCRModel:
    method __init__ (line 86) | def __init__(self):
    method stride (line 105) | def stride(self):
    method resize_and_pad_image (line 108) | def resize_and_pad_image(self, image, new_shape):
    method scale_boxes (line 156) | def scale_boxes(self, img1_shape, boxes, img0_shape):
    method predict (line 182) | def predict(self, image, imgsz=800, batch_size=16, **kwargs):
    method handle_document (line 228) | def handle_document(
    method _is_box_in_table (line 277) | def _is_box_in_table(self, box_xyxy, table_box, page, img_width, img_h...

FILE: babeldoc/format/pdf/babelpdf/base14.py
  function get_cached_bbox (line 3311) | def get_cached_bbox(database, family, encoding):
  function get_base14_bbox (line 3321) | def get_base14_bbox(family, encoding_name="WinAnsiEncoding"):

FILE: babeldoc/format/pdf/babelpdf/cidfont.py
  function indirect (line 7) | def indirect(obj):
  function get_xref (line 12) | def get_xref(doc, xref, key):
  function get_font_file (line 18) | def get_font_file(doc, xref):
  function get_font_descriptor (line 27) | def get_font_descriptor(doc, xref):
  function get_descendant_fonts (line 32) | def get_descendant_fonts(doc, xref):
  function get_glyph_bbox (line 43) | def get_glyph_bbox(face, g):
  function get_face_bbox (line 56) | def get_face_bbox(blob):
  function get_cidfont_bbox (line 64) | def get_cidfont_bbox(doc, xref):

FILE: babeldoc/format/pdf/babelpdf/cmap.py
  function parse_blob_value (line 28) | def parse_blob_value(text):
  function parse_cmap_char (line 32) | def parse_cmap_char(text, store):
  function parse_cmap_range (line 39) | def parse_cmap_range(text, store):
  function parse_cmap (line 47) | def parse_cmap(text):
  function _normalize_cmap_name (line 63) | def _normalize_cmap_name(name: str) -> str:
  function use_cmap (line 70) | def use_cmap(name: str):
  function propagation (line 99) | def propagation(r, c):
  class CharacterMap (line 119) | class CharacterMap:
    method __init__ (line 120) | def __init__(self, text):
    method decode_one (line 132) | def decode_one(self, text):
    method decode (line 139) | def decode(self, text):

FILE: babeldoc/format/pdf/babelpdf/encoding.py
  function get_type1_encoding (line 1038) | def get_type1_encoding(name):

FILE: babeldoc/format/pdf/babelpdf/type3.py
  function merge_bbox (line 7) | def merge_bbox(bbox_list, factor=1):
  function get_type3_bbox (line 16) | def get_type3_bbox(doc, obj):

FILE: babeldoc/format/pdf/babelpdf/utils.py
  function guarded_bbox (line 4) | def guarded_bbox(bbox):

FILE: babeldoc/format/pdf/converter.py
  class PDFConverterEx (line 32) | class PDFConverterEx(PDFConverter):
    method __init__ (line 33) | def __init__(
    method begin_page (line 41) | def begin_page(self, page, ctm) -> None:
    method end_page (line 56) | def end_page(self, _page) -> None:
    method begin_figure (line 60) | def begin_figure(self, name, bbox, matrix) -> None:
    method end_figure (line 66) | def end_figure(self, _: str) -> None:
    method render_char (line 75) | def render_char(
  class AWLTChar (line 129) | class AWLTChar(LTChar):
    method __init__ (line 132) | def __init__(
    method __repr__ (line 190) | def __repr__(self) -> str:
    method get_text (line 193) | def get_text(self) -> str:
  class Paragraph (line 197) | class Paragraph:
    method __init__ (line 198) | def __init__(self, y, x, x0, x1, size, brk):
  class TranslateConverter (line 208) | class TranslateConverter(PDFConverterEx):
    method __init__ (line 209) | def __init__(
    method receive_layout (line 234) | def receive_layout(self, ltpage: LTPage):

FILE: babeldoc/format/pdf/document_il/backend/pdf_creater.py
  class RenderUnit (line 33) | class RenderUnit(ABC):
    method __init__ (line 36) | def __init__(
    method render (line 51) | def render(
    method get_sort_key (line 59) | def get_sort_key(self) -> tuple[int, int]:
  class CharacterRenderUnit (line 64) | class CharacterRenderUnit(RenderUnit):
    method __init__ (line 67) | def __init__(
    method render (line 76) | def render(self, draw_op: BitStream, context: "RenderContext") -> None:
  class FormRenderUnit (line 128) | class FormRenderUnit(RenderUnit):
    method __init__ (line 131) | def __init__(
    method render (line 140) | def render(self, draw_op: BitStream, context: "RenderContext") -> None:
  class RectangleRenderUnit (line 218) | class RectangleRenderUnit(RenderUnit):
    method __init__ (line 221) | def __init__(
    method render (line 232) | def render(self, draw_op: BitStream, context: "RenderContext") -> None:
  class CurveRenderUnit (line 261) | class CurveRenderUnit(RenderUnit):
    method __init__ (line 264) | def __init__(
    method render (line 273) | def render(self, draw_op: BitStream, context: "RenderContext") -> None:
  class RenderContext (line 335) | class RenderContext:
    method __init__ (line 338) | def __init__(
  function to_int (line 361) | def to_int(src):
  function parse_mapping (line 365) | def parse_mapping(text):
  function apply_normalization (line 372) | def apply_normalization(cmap, gid, code):
  function batched (line 385) | def batched(iterable, n, *, strict=False):
  function update_tounicode_cmap_pair (line 396) | def update_tounicode_cmap_pair(cmap, data):
  function update_tounicode_cmap_code (line 403) | def update_tounicode_cmap_code(cmap, data):
  function parse_tounicode_cmap (line 408) | def parse_tounicode_cmap(data):
  function parse_truetype_data (line 421) | def parse_truetype_data(data):
  function make_tounicode (line 448) | def make_tounicode(cmap, used):
  function reproduce_one_font (line 469) | def reproduce_one_font(doc, index):
  function reproduce_cmap (line 484) | def reproduce_cmap(doc):
  function _subset_fonts_process (line 500) | def _subset_fonts_process(pdf_path, output_path):
  function _save_pdf_clean_process (line 519) | def _save_pdf_clean_process(
  class PDFCreater (line 557) | class PDFCreater:
    method __init__ (line 560) | def __init__(
    method render_graphic_state (line 574) | def render_graphic_state(
    method render_paragraph_to_char (line 610) | def render_paragraph_to_char(
    method create_render_units_for_page (line 639) | def create_render_units_for_page(
    method render_units_to_stream (line 721) | def render_units_to_stream(
    method get_available_font_list (line 742) | def get_available_font_list(self, pdf, page):
    method get_xobj_available_fonts (line 746) | def get_xobj_available_fonts(self, page_xref_id, pdf):
    method _render_rectangle (line 772) | def _render_rectangle(
    method create_side_by_side_dual_pdf (line 811) | def create_side_by_side_dual_pdf(
    method create_alternating_pages_dual_pdf (line 895) | def create_alternating_pages_dual_pdf(
    method write_debug_info (line 925) | def write_debug_info(
    method subset_fonts_in_subprocess (line 1016) | def subset_fonts_in_subprocess(
    method save_pdf_with_timeout (line 1090) | def save_pdf_with_timeout(
    method restore_media_box (line 1231) | def restore_media_box(self, doc: pymupdf.Document, mediabox_data: dict...
    method write (line 1239) | def write(
    method update_page_content_stream (line 1425) | def update_page_content_stream(

FILE: babeldoc/format/pdf/document_il/frontend/il_creater.py
  function invert_matrix (line 44) | def invert_matrix(
  function batched (line 74) | def batched(iterable, n, *, strict=False):
  function indirect (line 111) | def indirect(obj):
  function get_char_cbox (line 116) | def get_char_cbox(face, idx):
  function get_name_cbox (line 121) | def get_name_cbox(face, name):
  function font_encoding_lookup (line 130) | def font_encoding_lookup(doc, idx, key):
  function parse_font_encoding (line 138) | def parse_font_encoding(doc, idx):
  function get_truetype_ansi_bbox_list (line 146) | def get_truetype_ansi_bbox_list(face):
  function collect_face_cmap (line 153) | def collect_face_cmap(face):
  function get_truetype_custom_bbox_list (line 164) | def get_truetype_custom_bbox_list(face):
  function parse_font_file (line 178) | def parse_font_file(doc, idx, encoding, differences):
  function parse_encoding (line 209) | def parse_encoding(obj_str):
  function parse_mapping (line 225) | def parse_mapping(text):
  function update_cmap_pair (line 232) | def update_cmap_pair(cmap, data):
  function update_cmap_code (line 244) | def update_cmap_code(cmap, data):
  function parse_cmap (line 254) | def parse_cmap(cmap_str):
  function get_code (line 267) | def get_code(cmap, c):
  function get_bbox (line 274) | def get_bbox(bbox, size, c, x, y):
  function get_rotation_angle (line 319) | def get_rotation_angle(matrix):
  class ILCreater (line 331) | class ILCreater:
    method __init__ (line 334) | def __init__(self, translation_config: TranslationConfig):
    method transform_clip_path (line 365) | def transform_clip_path(
    method get_render_order_and_increase (line 404) | def get_render_order_and_increase(self):
    method get_render_order (line 408) | def get_render_order(self):
    method on_finish (line 411) | def on_finish(self):
    method is_graphic_operation (line 414) | def is_graphic_operation(self, operator: str):
    method is_passthrough_per_char_operation (line 423) | def is_passthrough_per_char_operation(self, operator: str):
    method can_remove_old_passthrough_per_char_instruction (line 429) | def can_remove_old_passthrough_per_char_instruction(self, operator: str):
    method on_line_dash (line 435) | def on_line_dash(self, dash, phase):
    method on_passthrough_per_char (line 439) | def on_passthrough_per_char(self, operator: str, args: list[str]):
    method remove_latest_passthrough_per_char_instruction (line 460) | def remove_latest_passthrough_per_char_instruction(self):
    method parse_arg (line 464) | def parse_arg(self, arg: str):
    method pop_passthrough_per_char_instruction (line 473) | def pop_passthrough_per_char_instruction(self):
    method push_passthrough_per_char_instruction (line 490) | def push_passthrough_per_char_instruction(self):
    method on_stroking_color_space (line 497) | def on_stroking_color_space(self, color_space_name):
    method on_non_stroking_color_space (line 500) | def on_non_stroking_color_space(self, color_space_name):
    method on_new_stream (line 503) | def on_new_stream(self):
    method push_xobj (line 509) | def push_xobj(self):
    method pop_xobj (line 519) | def pop_xobj(self):
    method on_xobj_begin (line 524) | def on_xobj_begin(self, bbox, xref_id):
    method on_xobj_end (line 546) | def on_xobj_end(self, xobj_id, base_op):
    method on_page_start (line 554) | def on_page_start(self):
    method on_page_end (line 577) | def on_page_end(self):
    method on_page_crop_box (line 602) | def on_page_crop_box(
    method on_page_media_box (line 612) | def on_page_media_box(
    method on_page_number (line 622) | def on_page_number(self, page_number: int):
    method on_page_base_operation (line 627) | def on_page_base_operation(self, operation: str):
    method on_page_resource_font (line 631) | def on_page_resource_font(self, font: PDFFont, xref_id: int, font_id: ...
    method parse_font_xobj_id (line 768) | def parse_font_xobj_id(self, xobj_id: int):
    method create_graphic_state (line 801) | def create_graphic_state(
    method on_lt_char (line 870) | def on_lt_char(self, char: LTChar):
    method _collect_valid_char (line 1022) | def _collect_valid_char(self, ch: str):
    method on_lt_curve (line 1059) | def on_lt_curve(self, curve: babeldoc.pdfminer.layout.LTCurve):
    method on_xobj_form (line 1170) | def on_xobj_form(
    method on_pdf_clip_path (line 1225) | def on_pdf_clip_path(
    method create_il (line 1236) | def create_il(self):
    method on_total_pages (line 1245) | def on_total_pages(self, total_pages: int):
    method on_pdf_figure (line 1259) | def on_pdf_figure(self, figure: LTFigure):
    method on_inline_image_begin (line 1268) | def on_inline_image_begin(self):
    method on_inline_image_end (line 1276) | def on_inline_image_end(self, stream_obj, ctm):

FILE: babeldoc/format/pdf/document_il/il_version_1.py
  class BaseOperations (line 6) | class BaseOperations:
    class Meta (line 7) | class Meta:
  class Box (line 19) | class Box:
    class Meta (line 20) | class Meta:
  class GraphicState (line 54) | class GraphicState:
    class Meta (line 55) | class Meta:
  class PdfAffineTransform (line 68) | class PdfAffineTransform:
    class Meta (line 69) | class Meta:
  class PdfFontCharBoundingBox (line 117) | class PdfFontCharBoundingBox:
    class Meta (line 118) | class Meta:
  class PdfInlineForm (line 159) | class PdfInlineForm:
    class Meta (line 160) | class Meta:
  class PdfMatrix (line 180) | class PdfMatrix:
    class Meta (line 181) | class Meta:
  class PdfPath (line 229) | class PdfPath:
    class Meta (line 230) | class Meta:
  class PdfXobjForm (line 263) | class PdfXobjForm:
    class Meta (line 264) | class Meta:
  class Cropbox (line 286) | class Cropbox:
    class Meta (line 287) | class Meta:
  class Mediabox (line 300) | class Mediabox:
    class Meta (line 301) | class Meta:
  class PageLayout (line 314) | class PageLayout:
    class Meta (line 315) | class Meta:
  class PdfFigure (line 349) | class PdfFigure:
    class Meta (line 350) | class Meta:
  class PdfFont (line 363) | class PdfFont:
    class Meta (line 364) | class Meta:
  class PdfFormSubtype (line 444) | class PdfFormSubtype:
    class Meta (line 445) | class Meta:
  class PdfOriginalPath (line 465) | class PdfOriginalPath:
    class Meta (line 466) | class Meta:
  class PdfRectangle (line 480) | class PdfRectangle:
    class Meta (line 481) | class Meta:
  class PdfStyle (line 535) | class PdfStyle:
    class Meta (line 536) | class Meta:
  class VisualBbox (line 564) | class VisualBbox:
    class Meta (line 565) | class Meta:
  class PdfCharacter (line 578) | class PdfCharacter:
    class Meta (line 579) | class Meta:
  class PdfCurve (line 671) | class PdfCurve:
    class Meta (line 672) | class Meta:
  class PdfForm (line 761) | class PdfForm:
    class Meta (line 762) | class Meta:
  class PdfSameStyleUnicodeCharacters (line 847) | class PdfSameStyleUnicodeCharacters:
    class Meta (line 848) | class Meta:
  class PdfXobject (line 874) | class PdfXobject:
    class Meta (line 875) | class Meta:
  class PdfFormula (line 919) | class PdfFormula:
    class Meta (line 920) | class Meta:
  class PdfLine (line 988) | class PdfLine:
    class Meta (line 989) | class Meta:
  class PdfSameStyleCharacters (line 1017) | class PdfSameStyleCharacters:
    class Meta (line 1018) | class Meta:
  class PdfParagraphComposition (line 1047) | class PdfParagraphComposition:
    class Meta (line 1048) | class Meta:
  class PdfParagraph (line 1089) | class PdfParagraph:
    class Meta (line 1090) | class Meta:
  class Page (line 1182) | class Page:
    class Meta (line 1183) | class Meta:
  class Document (line 1290) | class Document:
    class Meta (line 1291) | class Meta:

FILE: babeldoc/format/pdf/document_il/midend/add_debug_information.py
  class AddDebugInformation (line 15) | class AddDebugInformation:
    method __init__ (line 18) | def __init__(self, translation_config: TranslationConfig):
    method process (line 22) | def process(self, docs: il_version_1.Document):
    method _create_rectangle (line 29) | def _create_rectangle(
    method _create_text (line 43) | def _create_text(
    method process_page (line 78) | def process_page(self, page: il_version_1.Page):

FILE: babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py
  class BatchParagraph (line 69) | class BatchParagraph:
    method __init__ (line 70) | def __init__(
  class DocumentTermExtractTracker (line 79) | class DocumentTermExtractTracker:
    method __init__ (line 80) | def __init__(self):
    method new_page (line 83) | def new_page(self):
    method to_json (line 88) | def to_json(self):
  class PageTermExtractTracker (line 109) | class PageTermExtractTracker:
    method __init__ (line 110) | def __init__(self):
    method new_paragraph (line 113) | def new_paragraph(self):
  class ParagraphTermExtractTracker (line 119) | class ParagraphTermExtractTracker:
    method __init__ (line 120) | def __init__(self):
    method append_paragraph_unicode (line 123) | def append_paragraph_unicode(self, unicode: str):
    method set_output (line 126) | def set_output(self, output: str):
    method set_input (line 129) | def set_input(self, _input: str):
  class AutomaticTermExtractor (line 133) | class AutomaticTermExtractor:
    method __init__ (line 136) | def __init__(
    method calc_token_count (line 154) | def calc_token_count(self, text: str) -> int:
    method _snapshot_token_usage (line 160) | def _snapshot_token_usage(self) -> tuple[int, int, int, int]:
    method _clean_json_output (line 179) | def _clean_json_output(self, llm_output: str) -> str:
    method _process_llm_response (line 193) | def _process_llm_response(self, llm_response_text: str, request_id: str):
    method process_page (line 226) | def process_page(
    method extract_terms_from_paragraphs (line 274) | def extract_terms_from_paragraphs(
    method procress (line 357) | def procress(self, doc_il: ILDocument):

FILE: babeldoc/format/pdf/document_il/midend/detect_scanned_file.py
  class DetectScannedFile (line 19) | class DetectScannedFile:
    method __init__ (line 22) | def __init__(self, translation_config: TranslationConfig):
    method _save_debug_box_to_page (line 25) | def _save_debug_box_to_page(self, page: il_version_1.Page, similarity:...
    method fast_check (line 68) | def fast_check(self, doc: pymupdf.Document) -> bool:
    method process (line 84) | def process(
    method clean_render_order_for_chars (line 144) | def clean_render_order_for_chars(self, docs: il_version_1.Document):
    method detect_page_is_scanned (line 151) | def detect_page_is_scanned(

FILE: babeldoc/format/pdf/document_il/midend/il_translator.py
  class RichTextPlaceholder (line 77) | class RichTextPlaceholder:
    method __init__ (line 78) | def __init__(
    method to_dict (line 94) | def to_dict(self) -> dict:
  class FormulaPlaceholder (line 108) | class FormulaPlaceholder:
    method __init__ (line 109) | def __init__(
    method to_dict (line 121) | def to_dict(self) -> dict:
  class PbarContext (line 133) | class PbarContext:
    method __init__ (line 134) | def __init__(self, pbar):
    method __enter__ (line 137) | def __enter__(self):
    method __exit__ (line 140) | def __exit__(self, exc_type, exc_value, traceback):
  class DocumentTranslateTracker (line 144) | class DocumentTranslateTracker:
    method __init__ (line 145) | def __init__(self):
    method new_page (line 151) | def new_page(self):
    method new_cross_page (line 156) | def new_cross_page(self):
    method new_cross_column (line 161) | def new_cross_column(self):
    method to_json (line 167) | def to_json(self):
    method convert_paragraph (line 190) | def convert_paragraph(self, page):
  class PageTranslateTracker (line 234) | class PageTranslateTracker:
    method __init__ (line 235) | def __init__(self):
    method new_paragraph (line 238) | def new_paragraph(self):
  class ParagraphTranslateTracker (line 244) | class ParagraphTranslateTracker:
    method __init__ (line 245) | def __init__(self):
    method set_pdf_unicode (line 250) | def set_pdf_unicode(self, unicode: str):
    method set_input (line 253) | def set_input(self, input_text: str):
    method set_placeholders (line 256) | def set_placeholders(
    method set_original_placeholders (line 261) | def set_original_placeholders(self, placeholders: dict[str, int] | None):
    method record_multi_paragraph_id (line 265) | def record_multi_paragraph_id(self, mid):
    method record_multi_paragraph_index (line 268) | def record_multi_paragraph_index(self, index):
    method set_output (line 271) | def set_output(self, output: str):
    method record_removed_hallucinated_placeholder (line 274) | def record_removed_hallucinated_placeholder(self, token: str):
    method new_llm_translate_tracker (line 282) | def new_llm_translate_tracker(self) -> LLMTranslateTracker:
    method last_llm_translate_tracker (line 287) | def last_llm_translate_tracker(self) -> LLMTranslateTracker | None:
  class LLMTranslateTracker (line 293) | class LLMTranslateTracker:
    method __init__ (line 294) | def __init__(self):
    method set_input (line 302) | def set_input(self, input_text: str):
    method set_output (line 305) | def set_output(self, output_text: str):
    method set_error_message (line 308) | def set_error_message(self, error_message: str):
    method set_placeholder_full_match (line 312) | def set_placeholder_full_match(self):
    method set_fallback_to_translate (line 315) | def set_fallback_to_translate(self):
    method to_dict (line 318) | def to_dict(self):
  class ILTranslator (line 329) | class ILTranslator:
    method __init__ (line 332) | def __init__(
    method calc_token_count (line 382) | def calc_token_count(self, text: str) -> int:
    method translate (line 388) | def translate(self, docs: Document):
    method find_title_paragraph (line 426) | def find_title_paragraph(self, docs: Document) -> PdfParagraph | None:
    method process_page (line 442) | def process_page(
    class TranslateInput (line 479) | class TranslateInput:
      method __init__ (line 480) | def __init__(
      method set_original_placeholder_tokens (line 493) | def set_original_placeholder_tokens(self, tokens: dict[str, int] | N...
      method get_placeholders_hint (line 497) | def get_placeholders_hint(self) -> dict[str, str] | None:
    method create_formula_placeholder (line 515) | def create_formula_placeholder(
    method create_rich_text_placeholder (line 531) | def create_rich_text_placeholder(
    method get_translate_input (line 571) | def get_translate_input(
    method process_formula (line 732) | def process_formula(
    method process_composition (line 744) | def process_composition(
    method parse_translate_output (line 767) | def parse_translate_output(
    method pre_translate_paragraph (line 950) | def pre_translate_paragraph(
    method post_translate_paragraph (line 987) | def post_translate_paragraph(
    method _build_role_block (line 1017) | def _build_role_block(self) -> str:
    method _build_context_block (line 1038) | def _build_context_block(
    method _build_glossary_block (line 1086) | def _build_glossary_block(self, text: str) -> str:
    method generate_prompt_for_llm (line 1130) | def generate_prompt_for_llm(
    method add_content_filter_hint (line 1162) | def add_content_filter_hint(self, page: Page, paragraph: PdfParagraph):
    method _create_text (line 1180) | def _create_text(
    method translate_paragraph (line 1210) | def translate_paragraph(

FILE: babeldoc/format/pdf/document_il/midend/il_translator_llm_only.py
  class BatchParagraph (line 98) | class BatchParagraph:
    method __init__ (line 99) | def __init__(
  class ILTranslatorLLMOnly (line 110) | class ILTranslatorLLMOnly:
    method __init__ (line 113) | def __init__(
    method calc_token_count (line 153) | def calc_token_count(self, text: str) -> int:
    method find_title_paragraph (line 159) | def find_title_paragraph(self, docs: Document) -> PdfParagraph | None:
    method translate (line 175) | def translate(self, docs: Document) -> None:
    method _is_body_text_paragraph (line 257) | def _is_body_text_paragraph(self, paragraph: PdfParagraph) -> bool:
    method _should_translate_paragraph (line 272) | def _should_translate_paragraph(
    method _filter_paragraphs (line 310) | def _filter_paragraphs(
    method _build_font_maps (line 334) | def _build_font_maps(
    method process_cross_page_paragraph (line 357) | def process_cross_page_paragraph(
    method process_cross_column_paragraph (line 454) | def process_cross_column_paragraph(
    method process_page (line 526) | def process_page(
    method translate_paragraph (line 622) | def translate_paragraph(
    method _build_llm_prompt (line 882) | def _build_llm_prompt(
    method _clean_json_output (line 981) | def _clean_json_output(self, llm_output: str) -> str:

FILE: babeldoc/format/pdf/document_il/midend/layout_parser.py
  class LayoutParser (line 19) | class LayoutParser:
    method __init__ (line 22) | def __init__(self, translation_config: TranslationConfig):
    method _save_debug_image (line 26) | def _save_debug_image(self, image: np.ndarray, layout, page_number: int):
    method _save_debug_box_to_page (line 61) | def _save_debug_box_to_page(self, page: il_version_1.Page):
    method process (line 119) | def process(self, docs: il_version_1.Document, mupdf_doc: Document):
    method generate_fallback_line_layout_for_page (line 178) | def generate_fallback_line_layout_for_page(self, page: il_version_1.Pa...

FILE: babeldoc/format/pdf/document_il/midend/paragraph_finder.py
  function generate_base58_id (line 46) | def generate_base58_id(length: int = 5) -> str:
  class ParagraphFinder (line 51) | class ParagraphFinder:
    method __init__ (line 56) | def __init__(self, translation_config: TranslationConfig):
    method _preprocess_formula_layouts (line 60) | def _preprocess_formula_layouts(self, page: Page):
    method add_text_fill_background (line 89) | def add_text_fill_background(self, page: Page):
    method update_paragraph_data (line 124) | def update_paragraph_data(self, paragraph: PdfParagraph, update_unicod...
    method update_line_data (line 172) | def update_line_data(self, line: PdfLine):
    method add_debug_info (line 179) | def add_debug_info(self, page: Page):
    method process (line 196) | def process(self, document):
    method check_cid_paragraph (line 217) | def check_cid_paragraph(self, doc: Document):
    method bbox_overlap (line 227) | def bbox_overlap(self, bbox1: Box, bbox2: Box) -> bool:
    method process_page (line 235) | def process_page(self, page: Page):
    method _set_paragraph_render_order (line 312) | def _set_paragraph_render_order(self, page: Page):
    method is_isolated_formula (line 349) | def is_isolated_formula(self, char: PdfCharacter):
    method _paragraph_text_ascii (line 357) | def _paragraph_text_ascii(self, p: PdfParagraph) -> str:
    method _is_ascii_digit_or_space_paragraph (line 368) | def _is_ascii_digit_or_space_paragraph(self, p: PdfParagraph) -> bool:
    method _same_layout_and_xobj (line 383) | def _same_layout_and_xobj(a: PdfParagraph, c: PdfParagraph) -> bool:
    method merge_alternating_line_number_paragraphs (line 393) | def merge_alternating_line_number_paragraphs(self, paragraphs: list[Pd...
    method _group_characters_into_paragraphs (line 420) | def _group_characters_into_paragraphs(
    method _merge_overlapping_clusters (line 514) | def _merge_overlapping_clusters(
    method _get_effective_y_bounds (line 600) | def _get_effective_y_bounds(self, char: PdfCharacter) -> tuple[float, ...
    method _compute_collision_counts_histogram (line 616) | def _compute_collision_counts_histogram(
    method _split_paragraph_into_lines (line 652) | def _split_paragraph_into_lines(
    method process_paragraph_spacing (line 779) | def process_paragraph_spacing(self, paragraph: PdfParagraph):
    method create_line (line 815) | def create_line(self, chars: list[PdfCharacter]) -> PdfParagraphCompos...
    method calculate_median_line_width (line 822) | def calculate_median_line_width(self, paragraphs: list[PdfParagraph]) ...
    method process_independent_paragraphs (line 841) | def process_independent_paragraphs(
    method is_bbox_contain_in_vertical (line 931) | def is_bbox_contain_in_vertical(bbox1: Box, bbox2: Box) -> bool:
    method fix_overlapping_paragraphs (line 939) | def fix_overlapping_paragraphs(self, page: Page):
    method _sort_characters_in_lines (line 1032) | def _sort_characters_in_lines(self, page: Page):
    method _get_char_sort_key (line 1040) | def _get_char_sort_key(self, char: PdfCharacter):

FILE: babeldoc/format/pdf/document_il/midend/remove_descent.py
  class RemoveDescent (line 11) | class RemoveDescent:
    method __init__ (line 14) | def __init__(self, translation_config: TranslationConfig):
    method _remove_char_descent (line 17) | def _remove_char_descent(
    method process (line 50) | def process(self, document: il_version_1.Document):
    method process_page (line 65) | def process_page(self, page: il_version_1.Page):

FILE: babeldoc/format/pdf/document_il/midend/styles_and_formulas.py
  class StylesAndFormulas (line 44) | class StylesAndFormulas:
    method __init__ (line 47) | def __init__(self, translation_config: TranslationConfig):
    method update_formula_data (line 51) | def update_formula_data(self, formula: PdfFormula):
    method process (line 54) | def process(self, document: Document):
    method update_all_formula_data (line 64) | def update_all_formula_data(self, page: Page):
    method _calculate_element_formula_iou (line 70) | def _calculate_element_formula_iou(
    method _is_element_contained_exact (line 96) | def _is_element_contained_exact(
    method _calculate_element_formula_distance (line 119) | def _calculate_element_formula_distance(
    method _collect_element_formula_candidates (line 159) | def _collect_element_formula_candidates(
    method _resolve_assignment_conflicts (line 257) | def _resolve_assignment_conflicts(
    method collect_contained_elements (line 325) | def collect_contained_elements(self, page: Page):
    method process_page (line 362) | def process_page(self, page: Page):
    method update_line_data (line 384) | def update_line_data(self, line: PdfLine):
    method _classify_characters_in_composition (line 391) | def _classify_characters_in_composition(
    method _group_classified_characters (line 525) | def _group_classified_characters(
    method process_page_formulas (line 568) | def process_page_formulas(self, page: Page):
    method process_translatable_formulas (line 621) | def process_translatable_formulas(self, page: Page):
    method process_page_styles (line 650) | def process_page_styles(self, page: Page):
    method _calculate_base_style (line 710) | def _calculate_base_style(self, paragraph) -> PdfStyle:
    method _get_mode_value (line 738) | def _get_mode_value(self, values):
    method _merge_styles (line 747) | def _merge_styles(self, style1, style2):
    method _merge_graphic_states (line 767) | def _merge_graphic_states(self, state1, state2):
    method _create_same_style_composition (line 783) | def _create_same_style_composition(
    method process_page_offsets (line 807) | def process_page_offsets(self, page: Page):
    method calculate_line_spacing (line 905) | def calculate_line_spacing(self, paragraph) -> float:
    method create_composition (line 933) | def create_composition(
    method is_translatable_formula (line 950) | def is_translatable_formula(self, formula: PdfFormula) -> bool:
    method should_split_formula (line 960) | def should_split_formula(self, formula: PdfFormula) -> bool:
    method split_formula_by_comma (line 974) | def split_formula_by_comma(
    method merge_formulas (line 1010) | def merge_formulas(self, formula1: PdfFormula, formula2: PdfFormula) -...
    method is_x_axis_contained (line 1023) | def is_x_axis_contained(self, box1: Box, box2: Box) -> bool:
    method has_y_intersection (line 1029) | def has_y_intersection(self, box1: Box, box2: Box) -> bool:
    method is_x_axis_adjacent (line 1034) | def is_x_axis_adjacent(self, box1: Box, box2: Box, tolerance: float = ...
    method calculate_y_iou (line 1046) | def calculate_y_iou(self, box1: Box, box2: Box) -> float:
    method merge_overlapping_formulas (line 1064) | def merge_overlapping_formulas(self, page: Page):
    method _have_same_layout_ids (line 1156) | def _have_same_layout_ids(
    method process_comma_formulas (line 1185) | def process_comma_formulas(self, page: Page):
    method remove_non_formula_lines_from_paragraphs (line 1225) | def remove_non_formula_lines_from_paragraphs(self, page: Page):

FILE: babeldoc/format/pdf/document_il/midend/table_parser.py
  class TableParser (line 16) | class TableParser:
    method __init__ (line 19) | def __init__(self, translation_config: TranslationConfig):
    method _save_debug_image (line 23) | def _save_debug_image(self, image: np.ndarray, layouts, page_number: i...
    method _save_debug_box_to_page (line 62) | def _save_debug_box_to_page(self, page: il_version_1.Page):
    method process (line 116) | def process(self, docs: il_version_1.Document, mupdf_doc: Document):

FILE: babeldoc/format/pdf/document_il/midend/typesetting.py
  class TypesettingUnit (line 90) | class TypesettingUnit:
    method __str__ (line 91) | def __str__(self):
    method __init__ (line 94) | def __init__(
    method try_resue_cache (line 153) | def try_resue_cache(self, old_tu: TypesettingUnit):
    method try_get_unicode (line 179) | def try_get_unicode(self) -> str | None:
    method mixed_character_blacklist (line 188) | def mixed_character_blacklist(self):
    method calc_mixed_character_blacklist (line 194) | def calc_mixed_character_blacklist(self):
    method can_break_line (line 207) | def can_break_line(self):
    method calc_can_break_line (line 213) | def calc_can_break_line(self):
    method is_cjk_char (line 222) | def is_cjk_char(self):
    method calc_is_cjk_char (line 228) | def calc_is_cjk_char(self):
    method is_space (line 300) | def is_space(self):
    method calc_is_space (line 306) | def calc_is_space(self):
    method is_hung_punctuation (line 313) | def is_hung_punctuation(self):
    method calc_is_hung_punctuation (line 319) | def calc_is_hung_punctuation(self):
    method is_cannot_appear_in_line_end_punctuation (line 379) | def is_cannot_appear_in_line_end_punctuation(self):
    method calc_is_cannot_appear_in_line_end_punctuation (line 387) | def calc_is_cannot_appear_in_line_end_punctuation(self):
    method passthrough (line 413) | def passthrough(
    method can_passthrough (line 430) | def can_passthrough(self):
    method calc_can_passthrough (line 436) | def calc_can_passthrough(self):
    method calculate_box (line 439) | def calculate_box(self):
    method box (line 462) | def box(self):
    method width (line 469) | def width(self):
    method calc_width (line 475) | def calc_width(self):
    method height (line 480) | def height(self):
    method calc_height (line 486) | def calc_height(self):
    method relocate (line 490) | def relocate(
    method _transform_curve_for_relocation (line 657) | def _transform_curve_for_relocation(
    method _transform_form_for_relocation (line 716) | def _transform_form_for_relocation(
    method render (line 767) | def render(
  class Typesetting (line 824) | class Typesetting:
    method __init__ (line 827) | def __init__(self, translation_config: TranslationConfig):
    method preprocess_document (line 843) | def preprocess_document(self, document: il_version_1.Document, pbar):
    method _find_optimal_scale_and_layout (line 919) | def _find_optimal_scale_and_layout(
    method _get_optimal_scale (line 1056) | def _get_optimal_scale(
    method retypeset_with_precomputed_scale (line 1074) | def retypeset_with_precomputed_scale(
    method typesetting_document (line 1096) | def typesetting_document(self, document: il_version_1.Document):
    method render_page (line 1115) | def render_page(self, page: il_version_1.Page):
    method add_watermark (line 1198) | def add_watermark(self, page: il_version_1.Page):
    method render_paragraph (line 1232) | def render_paragraph(
    method _get_width_before_next_break_point (line 1263) | def _get_width_before_next_break_point(
    method _layout_typesetting_units (line 1278) | def _layout_typesetting_units(
    method create_typesetting_units (line 1436) | def create_typesetting_units(
    method create_passthrough_composition (line 1535) | def create_passthrough_composition(
    method get_max_right_space (line 1560) | def get_max_right_space(self, current_box: Box, page) -> float:
    method get_max_bottom_space (line 1596) | def get_max_bottom_space(self, current_box: Box, page: il_version_1.Pa...
    method _update_paragraph_render_order (line 1632) | def _update_paragraph_render_order(self, paragraph: il_version_1.PdfPa...

FILE: babeldoc/format/pdf/document_il/utils/extract_char.py
  function parse_pdf (line 56) | def parse_pdf(pdf_path, page_ranges=None) -> il_version_1.Document:
  class Line (line 90) | class Line:
    method __init__ (line 91) | def __init__(self, chars: list[tuple[il_version_1.Box, str, bool]]):
  function _recalculate_line_text_with_spacing (line 96) | def _recalculate_line_text_with_spacing(line, orientation):
  function extract_paragraph_line (line 146) | def extract_paragraph_line(
  function convert_page_to_char_boxes (line 158) | def convert_page_to_char_boxes(
  function _cluster_by_axis (line 167) | def _cluster_by_axis(chars: list[tuple[il_version_1.Box, str, bool]], or...
  function _merge_lines_on_page (line 355) | def _merge_lines_on_page(page_lines: list[Line]) -> list[Line]:
  function process_page_chars_to_lines (line 573) | def process_page_chars_to_lines(
  function process_page_chars_to_lines_internal (line 582) | def process_page_chars_to_lines_internal(
  function cluster_chars_to_lines (line 621) | def cluster_chars_to_lines(
  function draw_clustered_lines_to_image (line 635) | def draw_clustered_lines_to_image(pdf_path, clustered_lines: dict[int, l...
  function main (line 729) | def main():

FILE: babeldoc/format/pdf/document_il/utils/fontmap.py
  class PrimaryFontFamily (line 17) | class PrimaryFontFamily(enum.IntEnum):
    method from_str (line 24) | def from_str(cls, value: str):
  class FontMapper (line 35) | class FontMapper:
    method __init__ (line 38) | def __init__(self, translation_config: TranslationConfig):
    method has_char (line 119) | def has_char(self, char_unicode: str):
    method map_in_type (line 128) | def map_in_type(
    method map (line 154) | def map(self, original_font: PdfFont, char_unicode: str):
    method get_used_font_ids (line 215) | def get_used_font_ids(self, il: il_version_1.Document) -> set[str]:
    method add_font (line 228) | def add_font(self, doc_zh: pymupdf.Document, il: il_version_1.Document):

FILE: babeldoc/format/pdf/document_il/utils/formular_helper.py
  function is_formulas_start_char (line 16) | def is_formulas_start_char(
  function is_formulas_middle_char (line 54) | def is_formulas_middle_char(
  function collect_page_formula_font_ids (line 68) | def collect_page_formula_font_ids(
  function is_formulas_font (line 111) | def is_formulas_font(font_name: str, formular_font_pattern: str | None) ...
  function update_formula_data (line 312) | def update_formula_data(formula: PdfFormula):

FILE: babeldoc/format/pdf/document_il/utils/layout_helper.py
  function is_bullet_point (line 55) | def is_bullet_point(char: PdfCharacter) -> bool:
  function calculate_box_iou (line 68) | def calculate_box_iou(box1: Box, box2: Box) -> float:
  function formular_height_ignore_char (line 108) | def formular_height_ignore_char(char: PdfCharacter):
  function box_to_tuple (line 115) | def box_to_tuple(box: Box) -> tuple[float, float, float, float]:
  class Layout (line 122) | class Layout:
    method __init__ (line 123) | def __init__(self, layout_id, name):
    method is_newline (line 128) | def is_newline(prev_char: PdfCharacter, curr_char: PdfCharacter) -> bool:
  function get_paragraph_length_except (line 159) | def get_paragraph_length_except(
  function get_paragraph_unicode (line 200) | def get_paragraph_unicode(paragraph: PdfParagraph) -> str:
  function get_char_unicode_string (line 226) | def get_char_unicode_string(chars: list[PdfCharacter | str]) -> str:
  function get_paragraph_max_height (line 296) | def get_paragraph_max_height(paragraph: PdfParagraph) -> float:
  function is_same_style (line 344) | def is_same_style(style1, style2) -> bool:
  function is_same_style_except_size (line 356) | def is_same_style_except_size(style1, style2) -> bool:
  function is_same_style_except_font (line 368) | def is_same_style_except_font(style1, style2) -> bool:
  function is_same_graphic_state (line 378) | def is_same_graphic_state(state1: GraphicState, state2: GraphicState) ->...
  function add_space_dummy_chars (line 389) | def add_space_dummy_chars(paragraph: PdfParagraph) -> None:
  function _get_first_char_from_composition (line 458) | def _get_first_char_from_composition(
  function _get_last_char_from_composition (line 475) | def _get_last_char_from_composition(
  function _add_space_dummy_chars_to_list (line 492) | def _add_space_dummy_chars_to_list(chars: list[PdfCharacter]) -> None:
  function build_layout_index (line 553) | def build_layout_index(page):
  function calculate_iou_for_boxes (line 566) | def calculate_iou_for_boxes(box1: Box, box2: Box) -> float:
  function calculate_y_iou_for_boxes (line 589) | def calculate_y_iou_for_boxes(box1: Box, box2: Box) -> float:
  function calculate_y_true_iou_for_boxes (line 618) | def calculate_y_true_iou_for_boxes(box1: Box, box2: Box) -> float:
  function get_character_layout (line 650) | def get_character_layout(
  function is_text_layout (line 801) | def is_text_layout(layout: Layout):
  function is_character_in_formula_layout (line 852) | def is_character_in_formula_layout(
  function is_curve_in_figure_table_layout (line 883) | def is_curve_in_figure_table_layout(
  function is_curve_overlapping_with_paragraphs (line 932) | def is_curve_overlapping_with_paragraphs(
  function get_paragraph_bounding_box (line 958) | def get_paragraph_bounding_box(paragraph) -> Box | None:

FILE: babeldoc/format/pdf/document_il/utils/matrix_helper.py
  function decompose_ctm (line 22) | def decompose_ctm(m: Matrix | PdfMatrix) -> PdfAffineTransform:
  function compose_ctm (line 125) | def compose_ctm(transform: PdfAffineTransform) -> Matrix:
  function scale_and_set_translation (line 172) | def scale_and_set_translation(
  function create_translation_and_scale_matrix (line 224) | def create_translation_and_scale_matrix(
  function multiply_matrices (line 248) | def multiply_matrices(m1: Matrix | PdfMatrix, m2: Matrix | PdfMatrix) ->...
  function apply_transform_to_ctm (line 287) | def apply_transform_to_ctm(
  function matrix_to_bytes (line 329) | def matrix_to_bytes(m: Matrix | PdfMatrix) -> bytes:

FILE: babeldoc/format/pdf/document_il/utils/mupdf_helper.py
  function get_no_rotation_img (line 7) | def get_no_rotation_img(page: pymupdf.Page, dpi: int = 72) -> pymupdf.Pi...
  function get_no_rotation_img_multiprocess_internal (line 16) | def get_no_rotation_img_multiprocess_internal(
  function get_no_rotation_img_multiprocess (line 36) | def get_no_rotation_img_multiprocess(pdf_bytes: str, pagenum: int, dpi: ...

FILE: babeldoc/format/pdf/document_il/utils/paragraph_helper.py
  function is_cid_paragraph (line 9) | def is_cid_paragraph(paragraph: il_version_1.PdfParagraph):
  function is_pure_numeric_paragraph (line 42) | def is_pure_numeric_paragraph(paragraph) -> bool:
  function is_placeholder_only_paragraph (line 55) | def is_placeholder_only_paragraph(paragraph: il_version_1.PdfParagraph) ...

FILE: babeldoc/format/pdf/document_il/utils/spatial_analyzer.py
  function is_element_contained_in_formula (line 20) | def is_element_contained_in_formula(
  function find_contained_curves (line 53) | def find_contained_curves(
  function find_contained_forms (line 81) | def find_contained_forms(
  function find_all_contained_elements (line 109) | def find_all_contained_elements(
  function calculate_translation_and_scale (line 128) | def calculate_translation_and_scale(

FILE: babeldoc/format/pdf/document_il/utils/style_helper.py
  function create_pdf_style (line 4) | def create_pdf_style(r, g, b, font_id="base", font_size=6):

FILE: babeldoc/format/pdf/document_il/utils/zstd_helper.py
  function zstd_compress (line 6) | def zstd_compress(data) -> str:
  function zstd_decompress (line 15) | def zstd_decompress(data) -> str:

FILE: babeldoc/format/pdf/document_il/xml_converter.py
  class XMLConverter (line 13) | class XMLConverter:
    method __init__ (line 14) | def __init__(self):
    method write_xml (line 20) | def write_xml(self, document: il_version_1.Document, path: str):
    method read_xml (line 24) | def read_xml(self, path: str) -> il_version_1.Document:
    method to_xml (line 28) | def to_xml(self, document: il_version_1.Document) -> str:
    method from_xml (line 31) | def from_xml(self, xml: str) -> il_version_1.Document:
    method deepcopy (line 37) | def deepcopy(self, document: il_version_1.Document) -> il_version_1.Do...
    method to_json (line 41) | def to_json(self, document: il_version_1.Document) -> str:
    method write_json (line 49) | def write_json(self, document: il_version_1.Document, path: str):

FILE: babeldoc/format/pdf/high_level.py
  function safe_save (line 97) | def safe_save(doc, *args, **kwargs):
  function check_metadata (line 106) | def check_metadata(pdf: Document):
  function add_metadata (line 121) | def add_metadata(
  function fix_cmap (line 165) | def fix_cmap(translate_result: TranslateResult, translate_config: Transl...
  function verify_file_hash (line 185) | def verify_file_hash(file_path: str, expected_hash: str) -> bool:
  function translator_supports_llm (line 195) | def translator_supports_llm(translator) -> bool:
  function start_parse_il (line 208) | def start_parse_il(
  function translate (line 326) | def translate(translation_config: TranslationConfig) -> TranslateResult:
  function get_translation_stage (line 331) | def get_translation_stage(
  function async_translate (line 366) | async def async_translate(translation_config: TranslationConfig):
  class MemoryMonitor (line 446) | class MemoryMonitor:
    method __init__ (line 449) | def __init__(self, interval=0.1):
    method __enter__ (line 461) | def __enter__(self):
    method __exit__ (line 471) | def __exit__(self, exc_type, exc_val, exc_tb):
    method _monitor_memory_usage (line 480) | def _monitor_memory_usage(self):
    method get_peek_memory_psutil (line 503) | def get_peek_memory_psutil(self):
  function fix_null_page_content (line 508) | def fix_null_page_content(doc: Document) -> list[int]:
  function fix_null_xref (line 520) | def fix_null_xref(doc: Document) -> None:
  function fix_filter (line 543) | def fix_filter(doc):
  function update_page_bbox (line 589) | def update_page_bbox(doc, page, box, key):
  function do_translate (line 594) | def do_translate(
  function migrate_toc (line 803) | def migrate_toc(
  function fix_media_box (line 862) | def fix_media_box(doc: Document) -> None:
  function check_cid_char (line 890) | def check_cid_char(il: il_version_1.Document):
  function _do_translate_single (line 903) | def _do_translate_single(
  function generate_first_page_with_watermark (line 1146) | def generate_first_page_with_watermark(
  function merge_watermark_doc (line 1197) | def merge_watermark_doc(
  function download_font_assets (line 1232) | def download_font_assets():
  function create_cache_folder (line 1236) | def create_cache_folder():
  function init (line 1248) | def init():

FILE: babeldoc/format/pdf/pdfinterp.py
  function safe_float (line 48) | def safe_float(o: Any) -> float | None:
  class PDFContentParserEx (line 55) | class PDFContentParserEx(PDFContentParser):
    method __init__ (line 56) | def __init__(self, streams: Sequence[object]) -> None:
    method do_keyword (line 59) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
  class PDFPageInterpreterEx (line 91) | class PDFPageInterpreterEx(PDFPageInterpreter):
    method __init__ (line 97) | def __init__(
    method dup (line 109) | def dup(self) -> "PDFPageInterpreterEx":
    method init_resources (line 117) | def init_resources(self, resources: dict[object, object]) -> None:
    method do_CS (line 170) | def do_CS(self, name: PDFStackT) -> None:
    method do_cs (line 183) | def do_cs(self, name: PDFStackT) -> None:
    method do_SCN (line 195) | def do_SCN(self) -> None:
    method do_scn (line 209) | def do_scn(self) -> None:
    method do_SC (line 223) | def do_SC(self) -> None:
    method do_sc (line 230) | def do_sc(self) -> None:
    method do_Do (line 239) | def do_Do(self, xobjid_arg: PDFStackT) -> None:
    method do_W (line 346) | def do_W(self) -> None:
    method do_W_a (line 350) | def do_W_a(self) -> None:
    method handle_w (line 354) | def handle_w(self, evenodd: bool):
    method process_page (line 358) | def process_page(self, page: PDFPage) -> None:
    method render_contents (line 394) | def render_contents(
    method do_q (line 415) | def do_q(self) -> None:
    method do_Q (line 421) | def do_Q(self) -> None:
    method do_TJ (line 428) | def do_TJ(self, seq: PDFStackT) -> None:
    method do_d (line 446) | def do_d(self, dash: PDFStackT, phase: PDFStackT) -> None:
    method do_BI (line 451) | def do_BI(self) -> None:
    method do_ID (line 455) | def do_ID(self) -> None:
    method do_EI (line 459) | def do_EI(self, obj: PDFStackT) -> None:
    method execute (line 466) | def execute(self, streams: Sequence[object]) -> None:

FILE: babeldoc/format/pdf/result_merger.py
  class ResultMerger (line 13) | class ResultMerger:
    method __init__ (line 16) | def __init__(self, translation_config: TranslationConfig):
    method merge_results (line 19) | def merge_results(
    method _merge_pdfs (line 173) | def _merge_pdfs(

FILE: babeldoc/format/pdf/split_manager.py
  class SplitPoint (line 8) | class SplitPoint:
  class BaseSplitStrategy (line 17) | class BaseSplitStrategy:
    method determine_split_points (line 20) | def determine_split_points(self, config) -> list[SplitPoint]:
  class PageCountStrategy (line 24) | class PageCountStrategy(BaseSplitStrategy):
    method __init__ (line 27) | def __init__(self, max_pages_per_part: int = 20):
    method determine_split_points (line 30) | def determine_split_points(self, config) -> list[SplitPoint]:
  class SplitManager (line 52) | class SplitManager:
    method __init__ (line 55) | def __init__(self, config=None):
    method determine_split_points (line 58) | def determine_split_points(self, config) -> list[SplitPoint]:
    method estimate_part_complexity (line 62) | def estimate_part_complexity(self, split_point: SplitPoint) -> float:

FILE: babeldoc/format/pdf/translation_config.py
  class WatermarkOutputMode (line 20) | class WatermarkOutputMode(enum.Enum):
  class SharedContextCrossSplitPart (line 26) | class SharedContextCrossSplitPart:
    method __init__ (line 27) | def __init__(self):
    method initialize_glossaries (line 39) | def initialize_glossaries(self, initial_glossaries: list[Glossary] | N...
    method add_raw_extracted_term_pair (line 55) | def add_raw_extracted_term_pair(self, src: str, tgt: str):
    method _generate_unique_auto_glossary_name (line 59) | def _generate_unique_auto_glossary_name(self) -> str:
    method contains_term (line 75) | def contains_term(self, term: str) -> bool:
    method finalize_auto_extracted_glossary (line 82) | def finalize_auto_extracted_glossary(self):
    method get_glossaries (line 106) | def get_glossaries(self) -> list[Glossary]:
    method get_glossaries_for_translation (line 113) | def get_glossaries_for_translation(
    method add_valid_counts (line 125) | def add_valid_counts(self, char_count: int, token_count: int):
  class TranslationConfig (line 136) | class TranslationConfig:
    method create_max_pages_per_part_split_strategy (line 138) | def create_max_pages_per_part_split_strategy(max_pages_per_part: int):
    method __init__ (line 143) | def __init__(
    method parse_pages (line 363) | def parse_pages(self, pages_str: str | None) -> list[tuple[int, int]] ...
    method should_translate_page (line 388) | def should_translate_page(self, page_number: int) -> bool:
    method get_output_file_path (line 405) | def get_output_file_path(self, filename: str) -> Path:
    method get_working_file_path (line 408) | def get_working_file_path(self, filename: str) -> Path:
    method get_part_working_dir (line 411) | def get_part_working_dir(self, part_index: int) -> Path:
    method get_part_output_dir (line 422) | def get_part_output_dir(self, part_index: int) -> Path:
    method cleanup_part_output_dir (line 430) | def cleanup_part_output_dir(self, part_index: int):
    method cleanup_part_working_dir (line 438) | def cleanup_part_working_dir(self, part_index: int):
    method cleanup_temp_files (line 446) | def cleanup_temp_files(self):
    method raise_if_cancelled (line 457) | def raise_if_cancelled(self):
    method cancel_translation (line 461) | def cancel_translation(self):
    method get_term_extraction_translator (line 465) | def get_term_extraction_translator(self) -> BaseTranslator:
    method record_term_extraction_usage (line 469) | def record_term_extraction_usage(
  class TranslateResult (line 489) | class TranslateResult:
    method __init__ (line 501) | def __init__(
    method __str__ (line 519) | def __str__(self):

FILE: babeldoc/glossary.py
  class GlossaryEntry (line 16) | class GlossaryEntry:
    method __init__ (line 17) | def __init__(self, source: str, target: str, target_language: str | No...
    method __repr__ (line 22) | def __repr__(self):
  function batched (line 26) | def batched(iterable, n, *, strict=False):
  class Glossary (line 40) | class Glossary:
    method __init__ (line 41) | def __init__(self, name: str, entries: list[GlossaryEntry]):
    method normalize_source (line 60) | def normalize_source(source_term: str) -> str:
    method _build_regex_and_lookup (line 68) | def _build_regex_and_lookup(self):
    method from_csv (line 124) | def from_csv(cls, file_path: Path, target_lang_out: str) -> "Glossary":
    method to_csv (line 172) | def to_csv(self) -> str:
    method __repr__ (line 190) | def __repr__(self):
    method get_active_entries_for_text (line 193) | def get_active_entries_for_text(self, text: str) -> list[tuple[str, st...

FILE: babeldoc/main.py
  function create_parser (line 32) | def create_parser():
  function main (line 461) | async def main():
  function create_progress_handler (line 786) | def create_progress_handler(
  function create_cache_folder (line 869) | def create_cache_folder():
  function download_font_assets (line 874) | def download_font_assets():
  class EvictQueue (line 878) | class EvictQueue(queue.Queue):
    method __init__ (line 879) | def __init__(self, maxsize):
    method put (line 883) | def put(self, item, block=False, timeout=None):
  function speed_up_logs (line 896) | def speed_up_logs():
  function cli (line 907) | def cli():

FILE: babeldoc/pdfminer/_saslprep.py
  function saslprep (line 46) | def saslprep(data: str, prohibit_unassigned_code_points: bool = True) ->...

FILE: babeldoc/pdfminer/arcfour.py
  class Arcfour (line 10) | class Arcfour:
    method __init__ (line 11) | def __init__(self, key: Sequence[int]) -> None:
    method process (line 22) | def process(self, data: bytes) -> bytes:

FILE: babeldoc/pdfminer/ascii85.py
  function ascii85decode (line 11) | def ascii85decode(data: bytes) -> bytes:
  function asciihexdecode (line 33) | def asciihexdecode(data: bytes) -> bytes:

FILE: babeldoc/pdfminer/casting.py
  function safe_int (line 11) | def safe_int(o: Any) -> int | None:
  function safe_float (line 18) | def safe_float(o: Any) -> float | None:
  function safe_matrix (line 25) | def safe_matrix(a: Any, b: Any, c: Any, d: Any, e: Any, f: Any) -> Matri...
  function safe_rgb (line 46) | def safe_rgb(r: Any, g: Any, b: Any) -> tuple[float, float, float] | None:
  function safe_cmyk (line 50) | def safe_cmyk(
  function safe_rect_list (line 56) | def safe_rect_list(value: Any) -> Rect | None:
  function safe_rect (line 68) | def safe_rect(a: Any, b: Any, c: Any, d: Any) -> Rect | None:
  function _safe_float_triple (line 72) | def _safe_float_triple(a: Any, b: Any, c: Any) -> _FloatTriple | None:
  function _safe_float_quadruple (line 83) | def _safe_float_quadruple(a: Any, b: Any, c: Any, d: Any) -> _FloatQuadr...

FILE: babeldoc/pdfminer/ccitt.py
  function get_bytes (line 26) | def get_bytes(data: bytes) -> Iterator[int]:
  class BitParser (line 36) | class BitParser:
    method __init__ (line 43) | def __init__(self) -> None:
    method add (line 47) | def add(cls, root: BitParserState, v: int | str, bits: str) -> None:
    method feedbytes (line 63) | def feedbytes(self, data: bytes) -> None:
    method _parse_bit (line 68) | def _parse_bit(self, x: object) -> None:
  class CCITTG4Parser (line 81) | class CCITTG4Parser(BitParser):
    class CCITTException (line 330) | class CCITTException(PDFException):
    class EOFB (line 333) | class EOFB(CCITTException):
    class InvalidData (line 336) | class InvalidData(CCITTException):
    class ByteSkip (line 339) | class ByteSkip(CCITTException):
    method __init__ (line 344) | def __init__(self, width: int, bytealign: bool = False) -> None:
    method feedbytes (line 350) | def feedbytes(self, data: bytes) -> None:
    method _parse_mode (line 361) | def _parse_mode(self, mode: object) -> BitParserState:
    method _parse_horiz1 (line 385) | def _parse_horiz1(self, n: Any) -> BitParserState:
    method _parse_horiz2 (line 398) | def _parse_horiz2(self, n: Any) -> BitParserState:
    method _parse_uncompressed (line 413) | def _parse_uncompressed(self, bits: str | None) -> BitParserState:
    method _get_bits (line 425) | def _get_bits(self) -> str:
    method _get_refline (line 428) | def _get_refline(self, i: int) -> str:
    method reset (line 442) | def reset(self) -> None:
    method output_line (line 449) | def output_line(self, y: int, bits: Sequence[int]) -> None:
    method _reset_line (line 452) | def _reset_line(self) -> None:
    method _flush_line (line 458) | def _flush_line(self) -> None:
    method _do_vertical (line 466) | def _do_vertical(self, dx: int) -> None:
    method _do_pass (line 490) | def _do_pass(self) -> None:
    method _do_horizontal (line 516) | def _do_horizontal(self, n1: int, n2: int) -> None:
    method _do_uncompressed (line 532) | def _do_uncompressed(self, bits: str) -> None:
  class CCITTFaxDecoder (line 539) | class CCITTFaxDecoder(CCITTG4Parser):
    method __init__ (line 540) | def __init__(
    method close (line 550) | def close(self) -> bytes:
    method output_line (line 553) | def output_line(self, y: int, bits: Sequence[int]) -> None:
  function ccittfaxdecode (line 563) | def ccittfaxdecode(data: bytes, params: dict[str, object]) -> bytes:
  function main (line 577) | def main(argv: list[str]) -> None:

FILE: babeldoc/pdfminer/cmapdb.py
  class CMapError (line 43) | class CMapError(PDFException):
  class CMapBase (line 47) | class CMapBase:
    method __init__ (line 50) | def __init__(self, **kwargs: object) -> None:
    method is_vertical (line 53) | def is_vertical(self) -> bool:
    method set_attr (line 56) | def set_attr(self, k: str, v: object) -> None:
    method add_code2cid (line 59) | def add_code2cid(self, code: str, cid: int) -> None:
    method add_cid2unichr (line 62) | def add_cid2unichr(self, cid: int, code: PSLiteral | bytes | int) -> N...
    method use_cmap (line 65) | def use_cmap(self, cmap: "CMapBase") -> None:
    method decode (line 68) | def decode(self, code: bytes) -> Iterable[int]:
  class CMap (line 72) | class CMap(CMapBase):
    method __init__ (line 73) | def __init__(self, **kwargs: str | int) -> None:
    method __repr__ (line 77) | def __repr__(self) -> str:
    method use_cmap (line 80) | def use_cmap(self, cmap: CMapBase) -> None:
    method decode (line 94) | def decode(self, code: bytes) -> Iterator[int]:
    method dump (line 108) | def dump(
  class IdentityCMap (line 125) | class IdentityCMap(CMapBase):
    method decode (line 126) | def decode(self, code: bytes) -> tuple[int, ...]:
  class IdentityCMapByte (line 134) | class IdentityCMapByte(IdentityCMap):
    method decode (line 135) | def decode(self, code: bytes) -> tuple[int, ...]:
  class UnicodeMap (line 143) | class UnicodeMap(CMapBase):
    method __init__ (line 144) | def __init__(self, **kwargs: str | int) -> None:
    method __repr__ (line 148) | def __repr__(self) -> str:
    method get_unichr (line 151) | def get_unichr(self, cid: int) -> str:
    method dump (line 155) | def dump(self, out: TextIO = sys.stdout) -> None:
  class IdentityUnicodeMap (line 160) | class IdentityUnicodeMap(UnicodeMap):
    method get_unichr (line 161) | def get_unichr(self, cid: int) -> str:
  class FileCMap (line 167) | class FileCMap(CMap):
    method add_code2cid (line 168) | def add_code2cid(self, code: str, cid: int) -> None:
  class FileUnicodeMap (line 185) | class FileUnicodeMap(UnicodeMap):
    method add_cid2unichr (line 186) | def add_cid2unichr(self, cid: int, code: PSLiteral | bytes | int) -> N...
  class PyCMap (line 206) | class PyCMap(CMap):
    method __init__ (line 207) | def __init__(self, name: str, module: Any) -> None:
  class PyUnicodeMap (line 214) | class PyUnicodeMap(UnicodeMap):
    method __init__ (line 215) | def __init__(self, name: str, module: Any, vertical: bool) -> None:
  class CMapDB (line 224) | class CMapDB:
    class CMapNotFound (line 228) | class CMapNotFound(CMapError):
    method _load_data (line 232) | def _load_data(cls, name: str) -> Any:
    method get_cmap (line 251) | def get_cmap(cls, name: str) -> CMapBase:
    method get_unicode_map (line 269) | def get_unicode_map(cls, name: str, vertical: bool = False) -> Unicode...
  class CMapParser (line 279) | class CMapParser(PSStackParser[PSKeyword]):
    method __init__ (line 280) | def __init__(self, cmap: CMapBase, fp: BinaryIO) -> None:
    method run (line 287) | def run(self) -> None:
    method do_keyword (line 310) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
    method _warn_once (line 463) | def _warn_once(self, msg: str) -> None:

FILE: babeldoc/pdfminer/converter.py
  class PDFLayoutAnalyzer (line 56) | class PDFLayoutAnalyzer(PDFTextDevice):
    method __init__ (line 60) | def __init__(
    method begin_page (line 71) | def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
    method end_page (line 78) | def end_page(self, page: PDFPage) -> None:
    method begin_figure (line 86) | def begin_figure(self, name: str, bbox: Rect, matrix: Matrix) -> None:
    method end_figure (line 90) | def end_figure(self, _: str) -> None:
    method render_image (line 96) | def render_image(self, name: str, stream: PDFStream) -> None:
    method paint_path (line 105) | def paint_path(
    method render_char (line 258) | def render_char(
    method handle_undefined_char (line 291) | def handle_undefined_char(self, font: PDFFont, cid: int) -> str:
    method receive_layout (line 295) | def receive_layout(self, ltpage: LTPage) -> None:
  class PDFPageAggregator (line 299) | class PDFPageAggregator(PDFLayoutAnalyzer):
    method __init__ (line 300) | def __init__(
    method receive_layout (line 309) | def receive_layout(self, ltpage: LTPage) -> None:
    method get_result (line 312) | def get_result(self) -> LTPage:
  class PDFConverter (line 321) | class PDFConverter(PDFLayoutAnalyzer, Generic[IOType]):
    method __init__ (line 322) | def __init__(
    method _is_binary_stream (line 336) | def _is_binary_stream(outfp: AnyIO) -> bool:
  class TextConverter (line 351) | class TextConverter(PDFConverter[AnyIO]):
    method __init__ (line 352) | def __init__(
    method write_text (line 366) | def write_text(self, text: str) -> None:
    method receive_layout (line 373) | def receive_layout(self, ltpage: LTPage) -> None:
    method render_image (line 394) | def render_image(self, name: str, stream: PDFStream) -> None:
    method paint_path (line 398) | def paint_path(
  class HTMLConverter (line 409) | class HTMLConverter(PDFConverter[AnyIO]):
    method __init__ (line 424) | def __init__(
    method write (line 477) | def write(self, text: str) -> None:
    method write_header (line 483) | def write_header(self) -> None:
    method write_footer (line 495) | def write_footer(self) -> None:
    method write_text (line 503) | def write_text(self, text: str) -> None:
    method place_rect (line 506) | def place_rect(
    method place_border (line 531) | def place_border(self, color: str, borderwidth: int, item: LTComponent...
    method place_image (line 534) | def place_image(
    method place_text (line 559) | def place_text(
    method begin_div (line 583) | def begin_div(
    method end_div (line 611) | def end_div(self, color: str) -> None:
    method put_text (line 617) | def put_text(self, text: str, fontname: str, fontsize: float) -> None:
    method put_newline (line 631) | def put_newline(self) -> None:
    method receive_layout (line 634) | def receive_layout(self, ltpage: LTPage) -> None:
    method close (line 720) | def close(self) -> None:
  class XMLConverter (line 724) | class XMLConverter(PDFConverter[AnyIO]):
    method __init__ (line 727) | def __init__(
    method write (line 754) | def write(self, text: str) -> None:
    method write_header (line 760) | def write_header(self) -> None:
    method write_footer (line 767) | def write_footer(self) -> None:
    method write_text (line 770) | def write_text(self, text: str) -> None:
    method receive_layout (line 775) | def receive_layout(self, ltpage: LTPage) -> None:
    method close (line 882) | def close(self) -> None:
  class HOCRConverter (line 886) | class HOCRConverter(PDFConverter[AnyIO]):
    method __init__ (line 905) | def __init__(
    method bbox_repr (line 926) | def bbox_repr(self, bbox: Rect) -> str:
    method write (line 935) | def write(self, text: str) -> None:
    method write_header (line 942) | def write_header(self) -> None:
    method write_footer (line 967) | def write_footer(self) -> None:
    method write_text (line 973) | def write_text(self, text: str) -> None:
    method write_word (line 978) | def write_word(self) -> None:
    method receive_layout (line 1003) | def receive_layout(self, ltpage: LTPage) -> None:
    method close (line 1061) | def close(self) -> None:

FILE: babeldoc/pdfminer/data_structures.py
  class NumberTree (line 12) | class NumberTree:
    method __init__ (line 18) | def __init__(self, obj: Any):
    method _parse (line 31) | def _parse(self) -> list[tuple[int, Any]]:
    method values (line 46) | def values(self) -> list[tuple[int, Any]]:

FILE: babeldoc/pdfminer/encodingdb.py
  function name2unicode (line 16) | def name2unicode(name: str) -> str:
  function raise_key_error_for_invalid_unicode (line 72) | def raise_key_error_for_invalid_unicode(unicode_digit: int) -> None:
  class EncodingDB (line 85) | class EncodingDB:
    method get_encoding (line 109) | def get_encoding(

FILE: babeldoc/pdfminer/fontmetrics.py
  function convert_font_metrics (line 33) | def convert_font_metrics(path: str) -> None:

FILE: babeldoc/pdfminer/glyphlist.py
  function convert_glyphlist (line 57) | def convert_glyphlist(path: str) -> None:

FILE: babeldoc/pdfminer/high_level.py
  function extract_text_to_fp (line 31) | def extract_text_to_fp(
  function extract_text (line 153) | def extract_text(
  function extract_pages (line 196) | def extract_pages(

FILE: babeldoc/pdfminer/image.py
  function align32 (line 29) | def align32(x: int) -> int:
  class BMPWriter (line 33) | class BMPWriter:
    method __init__ (line 34) | def __init__(self, fp: BinaryIO, bits: int, width: int, height: int) -...
    method write_line (line 88) | def write_line(self, y: int, data: bytes) -> None:
  class ImageWriter (line 93) | class ImageWriter:
    method __init__ (line 99) | def __init__(self, outdir: str) -> None:
    method export_image (line 104) | def export_image(self, image: LTImage) -> str:
    method _save_jpeg (line 142) | def _save_jpeg(self, image: LTImage) -> str:
    method _save_jpeg2000 (line 165) | def _save_jpeg2000(self, image: LTImage) -> str:
    method _save_jbig2 (line 185) | def _save_jbig2(self, image: LTImage) -> str:
    method _save_bmp (line 214) | def _save_bmp(
    method _save_bytes (line 233) | def _save_bytes(self, image: LTImage) -> str:
    method _save_raw (line 263) | def _save_raw(self, image: LTImage) -> str:
    method _is_jbig2_iamge (line 273) | def _is_jbig2_iamge(image: LTImage) -> bool:
    method _create_unique_image_name (line 280) | def _create_unique_image_name(self, image: LTImage, ext: str) -> tuple...

FILE: babeldoc/pdfminer/jbig2.py
  function bit_set (line 43) | def bit_set(bit_pos: int, value: int) -> bool:
  function check_flag (line 47) | def check_flag(flag: int, value: int) -> bool:
  function masked_value (line 51) | def masked_value(mask: int, value: int) -> int:
  function mask_value (line 59) | def mask_value(mask: int, value: int) -> int:
  function unpack_int (line 67) | def unpack_int(format: str, buffer: bytes) -> int:
  class JBIG2StreamReader (line 81) | class JBIG2StreamReader:
    method __init__ (line 84) | def __init__(self, stream: BinaryIO) -> None:
    method get_segments (line 87) | def get_segments(self) -> list[JBIG2Segment]:
    method is_eof (line 107) | def is_eof(self) -> bool:
    method parse_flags (line 114) | def parse_flags(
    method parse_retention_flags (line 126) | def parse_retention_flags(
    method parse_page_assoc (line 171) | def parse_page_assoc(self, segment: JBIG2Segment, page: int, field: by...
    method parse_data_length (line 177) | def parse_data_length(
  class JBIG2StreamWriter (line 197) | class JBIG2StreamWriter:
    method __init__ (line 206) | def __init__(self, stream: BinaryIO) -> None:
    method write_segments (line 209) | def write_segments(
    method write_file (line 244) | def write_file(
    method encode_segment (line 277) | def encode_segment(self, segment: JBIG2Segment) -> bytes:
    method encode_flags (line 289) | def encode_flags(self, value: JBIG2SegmentFlags, segment: JBIG2Segment...
    method encode_retention_flags (line 307) | def encode_retention_flags(
    method encode_data_length (line 354) | def encode_data_length(self, value: int, segment: JBIG2Segment) -> bytes:
    method get_eop_segment (line 359) | def get_eop_segment(self, seg_number: int, page_number: int) -> JBIG2S...
    method get_eof_segment (line 369) | def get_eof_segment(self, seg_number: int) -> JBIG2Segment:

FILE: babeldoc/pdfminer/layout.py
  class IndexAssigner (line 36) | class IndexAssigner:
    method __init__ (line 37) | def __init__(self, index: int = 0) -> None:
    method run (line 40) | def run(self, obj: "LTItem") -> None:
  class LAParams (line 49) | class LAParams:
    method __init__ (line 77) | def __init__(
    method _validate (line 97) | def _validate(self) -> None:
    method __repr__ (line 109) | def __repr__(self) -> str:
  class LTItem (line 117) | class LTItem:
    method analyze (line 120) | def analyze(self, laparams: LAParams) -> None:
  class LTText (line 124) | class LTText:
    method __repr__ (line 127) | def __repr__(self) -> str:
    method get_text (line 130) | def get_text(self) -> str:
  class LTComponent (line 135) | class LTComponent(LTItem):
    method __init__ (line 138) | def __init__(self, bbox: Rect) -> None:
    method __repr__ (line 142) | def __repr__(self) -> str:
    method __lt__ (line 146) | def __lt__(self, _: object) -> bool:
    method __le__ (line 149) | def __le__(self, _: object) -> bool:
    method __gt__ (line 152) | def __gt__(self, _: object) -> bool:
    method __ge__ (line 155) | def __ge__(self, _: object) -> bool:
    method set_bbox (line 158) | def set_bbox(self, bbox: Rect) -> None:
    method is_empty (line 168) | def is_empty(self) -> bool:
    method is_hoverlap (line 171) | def is_hoverlap(self, obj: "LTComponent") -> bool:
    method hdistance (line 175) | def hdistance(self, obj: "LTComponent") -> float:
    method hoverlap (line 182) | def hoverlap(self, obj: "LTComponent") -> float:
    method is_voverlap (line 189) | def is_voverlap(self, obj: "LTComponent") -> bool:
    method vdistance (line 193) | def vdistance(self, obj: "LTComponent") -> float:
    method voverlap (line 200) | def voverlap(self, obj: "LTComponent") -> float:
  class LTCurve (line 208) | class LTCurve(LTComponent):
    method __init__ (line 217) | def __init__(
    method get_pts (line 240) | def get_pts(self) -> str:
  class LTLine (line 244) | class LTLine(LTCurve):
    method __init__ (line 250) | def __init__(
  class LTRect (line 277) | class LTRect(LTCurve):
    method __init__ (line 283) | def __init__(
  class LTImage (line 310) | class LTImage(LTComponent):
    method __init__ (line 316) | def __init__(self, name: str, stream: PDFStream, bbox: Rect) -> None:
    method __repr__ (line 327) | def __repr__(self) -> str:
  class LTAnno (line 331) | class LTAnno(LTItem, LTText):
    method __init__ (line 339) | def __init__(self, text: str) -> None:
    method get_text (line 342) | def get_text(self) -> str:
  class LTChar (line 346) | class LTChar(LTComponent, LTText):
    method __init__ (line 349) | def __init__(
    method __repr__ (line 400) | def __repr__(self) -> str:
    method get_text (line 403) | def get_text(self) -> str:
  class LTContainer (line 410) | class LTContainer(LTComponent, Generic[LTItemT]):
    method __init__ (line 413) | def __init__(self, bbox: Rect) -> None:
    method __iter__ (line 417) | def __iter__(self) -> Iterator[LTItemT]:
    method __len__ (line 420) | def __len__(self) -> int:
    method add (line 423) | def add(self, obj: LTItemT) -> None:
    method extend (line 426) | def extend(self, objs: Iterable[LTItemT]) -> None:
    method analyze (line 430) | def analyze(self, laparams: LAParams) -> None:
  class LTExpandableContainer (line 435) | class LTExpandableContainer(LTContainer[LTItemT]):
    method __init__ (line 436) | def __init__(self) -> None:
    method add (line 441) | def add(self, obj: LTComponent) -> None:  # type: ignore[override]
  class LTTextContainer (line 453) | class LTTextContainer(LTExpandableContainer[LTItemT], LTText):
    method __init__ (line 454) | def __init__(self) -> None:
    method get_text (line 458) | def get_text(self) -> str:
  class LTTextLine (line 467) | class LTTextLine(LTTextContainer[TextLineElement]):
    method __init__ (line 474) | def __init__(self, word_margin: float) -> None:
    method __repr__ (line 478) | def __repr__(self) -> str:
    method analyze (line 481) | def analyze(self, laparams: LAParams) -> None:
    method find_neighbors (line 486) | def find_neighbors(
    method is_empty (line 493) | def is_empty(self) -> bool:
  class LTTextLineHorizontal (line 497) | class LTTextLineHorizontal(LTTextLine):
    method __init__ (line 498) | def __init__(self, word_margin: float) -> None:
    method add (line 504) | def add(self, obj: LTComponent) -> None:  # type: ignore[override]
    method find_neighbors (line 512) | def find_neighbors(
    method _is_left_aligned_with (line 540) | def _is_left_aligned_with(self, other: LTComponent, tolerance: float =...
    method _is_right_aligned_with (line 544) | def _is_right_aligned_with(self, other: LTComponent, tolerance: float ...
    method _is_centrally_aligned_with (line 548) | def _is_centrally_aligned_with(
    method _is_same_height_as (line 556) | def _is_same_height_as(self, other: LTComponent, tolerance: float = 0)...
  class LTTextLineVertical (line 560) | class LTTextLineVertical(LTTextLine):
    method __init__ (line 561) | def __init__(self, word_margin: float) -> None:
    method add (line 567) | def add(self, obj: LTComponent) -> None:  # type: ignore[override]
    method find_neighbors (line 575) | def find_neighbors(
    method _is_lower_aligned_with (line 603) | def _is_lower_aligned_with(self, other: LTComponent, tolerance: float ...
    method _is_upper_aligned_with (line 607) | def _is_upper_aligned_with(self, other: LTComponent, tolerance: float ...
    method _is_centrally_aligned_with (line 611) | def _is_centrally_aligned_with(
    method _is_same_width_as (line 619) | def _is_same_width_as(self, other: LTComponent, tolerance: float) -> b...
  class LTTextBox (line 623) | class LTTextBox(LTTextContainer[LTTextLine]):
    method __init__ (line 631) | def __init__(self) -> None:
    method __repr__ (line 635) | def __repr__(self) -> str:
    method get_writing_mode (line 638) | def get_writing_mode(self) -> str:
  class LTTextBoxHorizontal (line 642) | class LTTextBoxHorizontal(LTTextBox):
    method analyze (line 643) | def analyze(self, laparams: LAParams) -> None:
    method get_writing_mode (line 647) | def get_writing_mode(self) -> str:
  class LTTextBoxVertical (line 651) | class LTTextBoxVertical(LTTextBox):
    method analyze (line 652) | def analyze(self, laparams: LAParams) -> None:
    method get_writing_mode (line 656) | def get_writing_mode(self) -> str:
  class LTTextGroup (line 663) | class LTTextGroup(LTTextContainer[TextGroupElement]):
    method __init__ (line 664) | def __init__(self, objs: Iterable[TextGroupElement]) -> None:
  class LTTextGroupLRTB (line 669) | class LTTextGroupLRTB(LTTextGroup):
    method analyze (line 670) | def analyze(self, laparams: LAParams) -> None:
  class LTTextGroupTBRL (line 681) | class LTTextGroupTBRL(LTTextGroup):
    method analyze (line 682) | def analyze(self, laparams: LAParams) -> None:
  class LTLayoutContainer (line 693) | class LTLayoutContainer(LTContainer[LTComponent]):
    method __init__ (line 694) | def __init__(self, bbox: Rect) -> None:
    method group_objects (line 699) | def group_objects(
    method group_textlines (line 776) | def group_textlines(
    method group_textboxes (line 810) | def group_textboxes(
    method analyze (line 903) | def analyze(self, laparams: LAParams) -> None:
  class LTFigure (line 941) | class LTFigure(LTLayoutContainer):
    method __init__ (line 949) | def __init__(self, name: str, bbox: Rect, matrix: Matrix) -> None:
    method __repr__ (line 957) | def __repr__(self) -> str:
    method analyze (line 960) | def analyze(self, laparams: LAParams) -> None:
  class LTPage (line 966) | class LTPage(LTLayoutContainer):
    method __init__ (line 973) | def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:
    method __repr__ (line 978) | def __repr__(self) -> str:

FILE: babeldoc/pdfminer/lzw.py
  class CorruptDataError (line 13) | class CorruptDataError(PDFException):
  class LZWDecoder (line 17) | class LZWDecoder:
    method __init__ (line 18) | def __init__(self, fp: BinaryIO) -> None:
    method readbits (line 27) | def readbits(self, bits: int) -> int:
    method feed (line 52) | def feed(self, code: int) -> bytes:
    method run (line 83) | def run(self) -> Iterator[bytes]:
  function lzwdecode (line 105) | def lzwdecode(data: bytes) -> bytes:

FILE: babeldoc/pdfminer/pdfcolor.py
  class PDFColorSpace (line 14) | class PDFColorSpace:
    method __init__ (line 15) | def __init__(self, name: str, ncomponents: int) -> None:
    method __repr__ (line 19) | def __repr__(self) -> str:

FILE: babeldoc/pdfminer/pdfdevice.py
  class PDFDevice (line 33) | class PDFDevice:
    method __init__ (line 36) | def __init__(self, rsrcmgr: "PDFResourceManager") -> None:
    method __repr__ (line 40) | def __repr__(self) -> str:
    method __enter__ (line 43) | def __enter__(self) -> "PDFDevice":
    method __exit__ (line 46) | def __exit__(self, exc_type: object, exc_val: object, exc_tb: object) ...
    method close (line 49) | def close(self) -> None:
    method set_ctm (line 52) | def set_ctm(self, ctm: Matrix) -> None:
    method begin_tag (line 55) | def begin_tag(self, tag: PSLiteral, props: Optional["PDFStackT"] = Non...
    method end_tag (line 58) | def end_tag(self) -> None:
    method do_tag (line 61) | def do_tag(self, tag: PSLiteral, props: Optional["PDFStackT"] = None) ...
    method begin_page (line 64) | def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
    method end_page (line 67) | def end_page(self, page: PDFPage) -> None:
    method begin_figure (line 70) | def begin_figure(self, name: str, bbox: Rect, matrix: Matrix) -> None:
    method end_figure (line 73) | def end_figure(self, name: str) -> None:
    method paint_path (line 76) | def paint_path(
    method render_image (line 86) | def render_image(self, name: str, stream: PDFStream) -> None:
    method render_string (line 89) | def render_string(
  class PDFTextDevice (line 99) | class PDFTextDevice(PDFDevice):
    method render_string (line 100) | def render_string(
    method render_string_horizontal (line 151) | def render_string_horizontal(
    method render_string_vertical (line 195) | def render_string_vertical(
    method render_char (line 239) | def render_char(
  class TagExtractor (line 253) | class TagExtractor(PDFDevice):
    method __init__ (line 254) | def __init__(
    method render_string (line 266) | def render_string(
    method begin_page (line 290) | def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
    method end_page (line 298) | def end_page(self, page: PDFPage) -> None:
    method begin_tag (line 302) | def begin_tag(self, tag: PSLiteral, props: Optional["PDFStackT"] = Non...
    method end_tag (line 315) | def end_tag(self) -> None:
    method do_tag (line 321) | def do_tag(self, tag: PSLiteral, props: Optional["PDFStackT"] = None) ...
    method _write (line 325) | def _write(self, s: str) -> None:

FILE: babeldoc/pdfminer/pdfdocument.py
  class PDFNoValidXRef (line 55) | class PDFNoValidXRef(PDFSyntaxError):
  class PDFNoValidXRefWarning (line 59) | class PDFNoValidXRefWarning(SyntaxWarning):
  class PDFNoOutlines (line 66) | class PDFNoOutlines(PDFException):
  class PDFNoPageLabels (line 70) | class PDFNoPageLabels(PDFException):
  class PDFDestinationNotFound (line 74) | class PDFDestinationNotFound(PDFException):
  class PDFEncryptionError (line 78) | class PDFEncryptionError(PDFException):
  class PDFPasswordIncorrect (line 82) | class PDFPasswordIncorrect(PDFEncryptionError):
  class PDFEncryptionWarning (line 86) | class PDFEncryptionWarning(UserWarning):
  class PDFTextExtractionNotAllowedWarning (line 93) | class PDFTextExtractionNotAllowedWarning(UserWarning):
  class PDFTextExtractionNotAllowed (line 100) | class PDFTextExtractionNotAllowed(PDFEncryptionError):
  class PDFBaseXRef (line 110) | class PDFBaseXRef:
    method get_trailer (line 111) | def get_trailer(self) -> dict[str, Any]:
    method get_objids (line 114) | def get_objids(self) -> Iterable[int]:
    method get_pos (line 120) | def get_pos(self, objid: int) -> tuple[int | None, int, int]:
    method load (line 123) | def load(self, parser: PDFParser) -> None:
  class PDFXRef (line 127) | class PDFXRef(PDFBaseXRef):
    method __init__ (line 128) | def __init__(self) -> None:
    method __repr__ (line 132) | def __repr__(self) -> str:
    method load (line 135) | def load(self, parser: PDFParser) -> None:
    method load_trailer (line 183) | def load_trailer(self, parser: PDFParser) -> None:
    method get_trailer (line 196) | def get_trailer(self) -> dict[str, Any]:
    method get_objids (line 199) | def get_objids(self) -> KeysView[int]:
    method get_pos (line 202) | def get_pos(self, objid: int) -> tuple[int | None, int, int]:
  class PDFXRefFallback (line 206) | class PDFXRefFallback(PDFXRef):
    method __repr__ (line 207) | def __repr__(self) -> str:
    method load (line 212) | def load(self, parser: PDFParser) -> None:
  class PDFXRefStream (line 257) | class PDFXRefStream(PDFBaseXRef):
    method __init__ (line 258) | def __init__(self) -> None:
    method __repr__ (line 266) | def __repr__(self) -> str:
    method load (line 269) | def load(self, parser: PDFParser) -> None:
    method get_trailer (line 294) | def get_trailer(self) -> dict[str, Any]:
    method get_objids (line 297) | def get_objids(self) -> Iterator[int]:
    method get_pos (line 308) | def get_pos(self, objid: int) -> tuple[int | None, int, int]:
  class PDFStandardSecurityHandler (line 335) | class PDFStandardSecurityHandler:
    method __init__ (line 341) | def __init__(
    method init (line 352) | def init(self) -> None:
    method init_params (line 359) | def init_params(self) -> None:
    method init_key (line 367) | def init_key(self) -> None:
    method is_printable (line 372) | def is_printable(self) -> bool:
    method is_modifiable (line 375) | def is_modifiable(self) -> bool:
    method is_extractable (line 378) | def is_extractable(self) -> bool:
    method compute_u (line 381) | def compute_u(self, key: bytes) -> bytes:
    method compute_encryption_key (line 396) | def compute_encryption_key(self, password: bytes) -> bytes:
    method authenticate (line 415) | def authenticate(self, password: str) -> bytes | None:
    method authenticate_user_password (line 422) | def authenticate_user_password(self, password: bytes) -> bytes | None:
    method verify_encryption_key (line 429) | def verify_encryption_key(self, key: bytes) -> bool:
    method authenticate_owner_password (line 436) | def authenticate_owner_password(self, password: bytes) -> bytes | None:
    method decrypt (line 456) | def decrypt(
    method decrypt_rc4 (line 465) | def decrypt_rc4(self, objid: int, genno: int, data: bytes) -> bytes:
  class PDFStandardSecurityHandlerV4 (line 473) | class PDFStandardSecurityHandlerV4(PDFStandardSecurityHandler):
    method init_params (line 476) | def init_params(self) -> None:
    method get_cfm (line 498) | def get_cfm(self, name: str) -> Callable[[int, int, bytes], bytes] | N...
    method decrypt (line 506) | def decrypt(
    method decrypt_identity (line 522) | def decrypt_identity(self, objid: int, genno: int, data: bytes) -> bytes:
    method decrypt_aes128 (line 525) | def decrypt_aes128(self, objid: int, genno: int, data: bytes) -> bytes:
  class PDFStandardSecurityHandlerV5 (line 545) | class PDFStandardSecurityHandlerV5(PDFStandardSecurityHandlerV4):
    method init_params (line 548) | def init_params(self) -> None:
    method get_cfm (line 560) | def get_cfm(self, name: str) -> Callable[[int, int, bytes], bytes] | N...
    method authenticate (line 566) | def authenticate(self, password: str) -> bytes | None:
    method _normalize_password (line 588) | def _normalize_password(self, password: str) -> bytes:
    method _password_hash (line 598) | def _password_hash(
    method _r5_password (line 609) | def _r5_password(
    method _r6_password (line 622) | def _r6_password(
    method _bytes_mod_3 (line 648) | def _bytes_mod_3(input_bytes: bytes) -> int:
    method _aes_cbc_encrypt (line 652) | def _aes_cbc_encrypt(self, key: bytes, iv: bytes, data: bytes) -> bytes:
    method decrypt_aes256 (line 657) | def decrypt_aes256(self, objid: int, genno: int, data: bytes) -> bytes:
  class PDFDocument (line 669) | class PDFDocument:
    method __init__ (line 689) | def __init__(
    method _initialize_password (line 752) | def _initialize_password(self, password: str = "") -> None:
    method _getobj_objstm (line 769) | def _getobj_objstm(self, stream: PDFStream, index: int, objid: int) ->...
    method _get_objects (line 784) | def _get_objects(self, stream: PDFStream) -> tuple[list[object], int]:
    method _getobj_parse (line 805) | def _getobj_parse(self, pos: int, objid: int) -> object:
    method getobj (line 833) | def getobj(self, objid: int) -> object:
    method get_outlines (line 873) | def get_outlines(self) -> Iterator[OutlineType]:
    method get_page_labels (line 893) | def get_page_labels(self) -> Iterator[str]:
    method lookup_name (line 910) | def lookup_name(self, cat: str, key: str | bytes) -> Any:
    method get_dest (line 938) | def get_dest(self, name: str | bytes) -> Any:
    method find_xref (line 953) | def find_xref(self, parser: PDFParser) -> int:
    method read_xref_from (line 980) | def read_xref_from(
  class PageLabels (line 1017) | class PageLabels(NumberTree):
    method labels (line 1024) | def labels(self) -> Iterator[str]:
    method _format_page_label (line 1055) | def _format_page_label(value: int, style: Any) -> str:

FILE: babeldoc/pdfminer/pdfexceptions.py
  class PDFException (line 4) | class PDFException(PSException):
  class PDFTypeError (line 8) | class PDFTypeError(PDFException, TypeError):
  class PDFValueError (line 12) | class PDFValueError(PDFException, ValueError):
  class PDFObjectNotFound (line 16) | class PDFObjectNotFound(PDFException):
  class PDFNotImplementedError (line 20) | class PDFNotImplementedError(PDFException, NotImplementedError):
  class PDFKeyError (line 24) | class PDFKeyError(PDFException, KeyError):
  class PDFEOFError (line 28) | class PDFEOFError(PDFException, EOFError):
  class PDFIOError (line 32) | class PDFIOError(PDFException, IOError):

FILE: babeldoc/pdfminer/pdffont.py
  function get_widths (line 58) | def get_widths(seq: Iterable[object]) -> dict[str | int, float]:
  function get_widths2 (line 89) | def get_widths2(seq: Iterable[object]) -> dict[int, tuple[float, Point]]:
  class FontMetricsDB (line 110) | class FontMetricsDB:
    method get_metrics (line 112) | def get_metrics(cls, fontname: str) -> tuple[dict[str, object], dict[s...
  class Type1FontHeaderParser (line 117) | class Type1FontHeaderParser(PSStackParser[int]):
    method __init__ (line 127) | def __init__(self, data: BinaryIO) -> None:
    method get_encoding (line 131) | def get_encoding(self) -> dict[int, str]:
    method do_keyword (line 156) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
  function getdict (line 173) | def getdict(data: bytes) -> dict[int, list[float | int]]:
  class CFFFont (line 219) | class CFFFont:
    class INDEX (line 614) | class INDEX:
      method __init__ (line 615) | def __init__(self, fp: BinaryIO) -> None:
      method __repr__ (line 624) | def __repr__(self) -> str:
      method __len__ (line 627) | def __len__(self) -> int:
      method __getitem__ (line 630) | def __getitem__(self, i: int) -> bytes:
      method __iter__ (line 634) | def __iter__(self) -> Iterator[bytes]:
    method __init__ (line 637) | def __init__(self, name: str, fp: BinaryIO) -> None:
    method getstr (line 717) | def getstr(self, sid: int) -> str | bytes:
  class TrueTypeFont (line 725) | class TrueTypeFont:
    class CMapNotFound (line 726) | class CMapNotFound(PDFException):
    method __init__ (line 729) | def __init__(self, name: str, fp: BinaryIO) -> None:
    method create_unicode_map (line 751) | def create_unicode_map(self) -> FileUnicodeMap:
  class PDFFontError (line 768) | class PDFFontError(PDFException):
  class PDFUnicodeNotDefined (line 772) | class PDFUnicodeNotDefined(PDFFontError):
  class PDFFont (line 784) | class PDFFont:
    method __init__ (line 785) | def __init__(
    method __repr__ (line 816) | def __repr__(self) -> str:
    method is_vertical (line 819) | def is_vertical(self) -> bool:
    method is_multibyte (line 822) | def is_multibyte(self) -> bool:
    method decode (line 825) | def decode(self, bytes: bytes) -> Iterable[int]:
    method get_ascent (line 828) | def get_ascent(self) -> float:
    method get_descent (line 832) | def get_descent(self) -> float:
    method get_width (line 836) | def get_width(self) -> float:
    method get_height (line 842) | def get_height(self) -> float:
    method char_width (line 848) | def char_width(self, cid: int) -> float:
    method char_disp (line 866) | def char_disp(self, cid: int) -> float | tuple[float | None, float]:
    method string_width (line 870) | def string_width(self, s: bytes) -> float:
    method to_unichr (line 873) | def to_unichr(self, cid: int) -> str:
    method _parse_bbox (line 877) | def _parse_bbox(descriptor: Mapping[str, Any]) -> Rect:
  class PDFSimpleFont (line 889) | class PDFSimpleFont(PDFFont):
    method __init__ (line 890) | def __init__(
    method to_unichr (line 916) | def to_unichr(self, cid: int) -> str:
  class PDFType1Font (line 928) | class PDFType1Font(PDFSimpleFont):
    method __init__ (line 929) | def __init__(self, rsrcmgr: "PDFResourceManager", spec: Mapping[str, A...
    method __repr__ (line 960) | def __repr__(self) -> str:
  class PDFTrueTypeFont (line 964) | class PDFTrueTypeFont(PDFType1Font):
    method __repr__ (line 965) | def __repr__(self) -> str:
  class PDFType3Font (line 969) | class PDFType3Font(PDFSimpleFont):
    method __init__ (line 970) | def __init__(self, rsrcmgr: "PDFResourceManager", spec: Mapping[str, A...
    method __repr__ (line 986) | def __repr__(self) -> str:
  class PDFCIDFont (line 990) | class PDFCIDFont(PDFFont):
    method __init__ (line 993) | def __init__(
    method get_cmap_from_spec (line 1088) | def get_cmap_from_spec(self, spec: Mapping[str, Any], strict: bool) ->...
    method _get_cmap_name (line 1107) | def _get_cmap_name(spec: Mapping[str, Any], strict: bool) -> str:
    method __repr__ (line 1130) | def __repr__(self) -> str:
    method is_vertical (line 1133) | def is_vertical(self) -> bool:
    method is_multibyte (line 1136) | def is_multibyte(self) -> bool:
    method decode (line 1139) | def decode(self, bytes: bytes) -> Iterable[int]:
    method char_disp (line 1150) | def char_disp(self, cid: int) -> float | tuple[float | None, float]:
    method to_unichr (line 1154) | def to_unichr(self, cid: int) -> str:

FILE: babeldoc/pdfminer/pdfinterp.py
  class PDFResourceError (line 59) | class PDFResourceError(PDFException):
  class PDFInterpreterError (line 63) | class PDFInterpreterError(PDFException):
  class PDFTextState (line 74) | class PDFTextState:
    method __init__ (line 78) | def __init__(self) -> None:
    method __repr__ (line 91) | def __repr__(self) -> str:
    method copy (line 110) | def copy(self) -> "PDFTextState":
    method reset (line 125) | def reset(self) -> None:
  class PDFGraphicState (line 137) | class PDFGraphicState:
    method __init__ (line 138) | def __init__(self) -> None:
    method copy (line 153) | def copy(self) -> "PDFGraphicState":
    method __repr__ (line 166) | def __repr__(self) -> str:
  class PDFResourceManager (line 185) | class PDFResourceManager:
    method __init__ (line 193) | def __init__(self, caching: bool = True) -> None:
    method get_procset (line 197) | def get_procset(self, procs: Sequence[object]) -> None:
    method get_cmap (line 204) | def get_cmap(self, cmapname: str, strict: bool = False) -> CMapBase:
    method get_font (line 212) | def get_font(self, objid: object, spec: Mapping[str, object]) -> PDFFont:
  class PDFContentParser (line 257) | class PDFContentParser(PSStackParser[Union[PSKeyword, PDFStream]]):
    method __init__ (line 258) | def __init__(self, streams: Sequence[object]) -> None:
    method fillfp (line 266) | def fillfp(self) -> None:
    method seek (line 275) | def seek(self, pos: int) -> None:
    method fillbuf (line 279) | def fillbuf(self) -> None:
    method get_inline_data (line 291) | def get_inline_data(self, pos: int, target: bytes = b"EI") -> tuple[in...
    method flush (line 324) | def flush(self) -> None:
    method do_keyword (line 331) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
  class PDFPageInterpreter (line 367) | class PDFPageInterpreter:
    method __init__ (line 373) | def __init__(self, rsrcmgr: PDFResourceManager, device: PDFDevice) -> ...
    method dup (line 377) | def dup(self) -> "PDFPageInterpreter":
    method init_resources (line 380) | def init_resources(self, resources: dict[object, object]) -> None:
    method init_state (line 421) | def init_state(self, ctm: Matrix) -> None:
    method push (line 438) | def push(self, obj: PDFStackT) -> None:
    method pop (line 441) | def pop(self, n: int) -> list[PDFStackT]:
    method get_current_state (line 448) | def get_current_state(self) -> tuple[Matrix, PDFTextState, PDFGraphicS...
    method set_current_state (line 451) | def set_current_state(
    method do_q (line 458) | def do_q(self) -> None:
    method do_Q (line 462) | def do_Q(self) -> None:
    method do_cm (line 467) | def do_cm(
    method do_w (line 487) | def do_w(self, linewidth: PDFStackT) -> None:
    method do_J (line 497) | def do_J(self, linecap: PDFStackT) -> None:
    method do_j (line 501) | def do_j(self, linejoin: PDFStackT) -> None:
    method do_M (line 505) | def do_M(self, miterlimit: PDFStackT) -> None:
    method do_d (line 509) | def do_d(self, dash: PDFStackT, phase: PDFStackT) -> None:
    method do_ri (line 513) | def do_ri(self, intent: PDFStackT) -> None:
    method do_i (line 517) | def do_i(self, flatness: PDFStackT) -> None:
    method do_gs (line 521) | def do_gs(self, name: PDFStackT) -> None:
    method do_m (line 525) | def do_m(self, x: PDFStackT, y: PDFStackT) -> None:
    method do_l (line 539) | def do_l(self, x: PDFStackT, y: PDFStackT) -> None:
    method do_c (line 552) | def do_c(
    method do_v (line 584) | def do_v(self, x2: PDFStackT, y2: PDFStackT, x3: PDFStackT, y3: PDFSta...
    method do_y (line 599) | def do_y(self, x1: PDFStackT, y1: PDFStackT, x3: PDFStackT, y3: PDFSta...
    method do_h (line 614) | def do_h(self) -> None:
    method do_re (line 618) | def do_re(self, x: PDFStackT, y: PDFStackT, w: PDFStackT, h: PDFStackT...
    method do_S (line 637) | def do_S(self) -> None:
    method do_s (line 642) | def do_s(self) -> None:
    method do_f (line 647) | def do_f(self) -> None:
    method do_F (line 652) | def do_F(self) -> None:
    method do_f_a (line 655) | def do_f_a(self) -> None:
    method do_B (line 660) | def do_B(self) -> None:
    method do_B_a (line 665) | def do_B_a(self) -> None:
    method do_b (line 670) | def do_b(self) -> None:
    method do_b_a (line 675) | def do_b_a(self) -> None:
    method do_n (line 680) | def do_n(self) -> None:
    method do_W (line 684) | def do_W(self) -> None:
    method do_W_a (line 688) | def do_W_a(self) -> None:
    method do_CS (line 692) | def do_CS(self, name: PDFStackT) -> None:
    method do_cs (line 703) | def do_cs(self, name: PDFStackT) -> None:
    method do_G (line 711) | def do_G(self, gray: PDFStackT) -> None:
    method do_g (line 723) | def do_g(self, gray: PDFStackT) -> None:
    method do_RG (line 735) | def do_RG(self, r: PDFStackT, g: PDFStackT, b: PDFStackT) -> None:
    method do_rg (line 747) | def do_rg(self, r: PDFStackT, g: PDFStackT, b: PDFStackT) -> None:
    method do_K (line 759) | def do_K(self, c: PDFStackT, m: PDFStackT, y: PDFStackT, k: PDFStackT)...
    method do_k (line 771) | def do_k(self, c: PDFStackT, m: PDFStackT, y: PDFStackT, k: PDFStackT)...
    method do_SCN (line 783) | def do_SCN(self) -> None:
    method do_scn (line 828) | def do_scn(self) -> None:
    method do_SC (line 874) | def do_SC(self) -> None:
    method do_sc (line 878) | def do_sc(self) -> None:
    method do_sh (line 882) | def do_sh(self, name: object) -> None:
    method do_BT (line 885) | def do_BT(self) -> None:
    method do_ET (line 894) | def do_ET(self) -> None:
    method do_BX (line 897) | def do_BX(self) -> None:
    method do_EX (line 900) | def do_EX(self) -> None:
    method do_MP (line 903) | def do_MP(self, tag: PDFStackT) -> None:
    method do_DP (line 912) | def do_DP(self, tag: PDFStackT, props: PDFStackT) -> None:
    method do_BMC (line 921) | def do_BMC(self, tag: PDFStackT) -> None:
    method do_BDC (line 930) | def do_BDC(self, tag: PDFStackT, props: PDFStackT) -> None:
    method do_EMC (line 939) | def do_EMC(self) -> None:
    method do_Tc (line 943) | def do_Tc(self, space: PDFStackT) -> None:
    method do_Tw (line 958) | def do_Tw(self, space: PDFStackT) -> None:
    method do_Tz (line 973) | def do_Tz(self, scale: PDFStackT) -> None:
    method do_TL (line 987) | def do_TL(self, leading: PDFStackT) -> None:
    method do_Tf (line 1002) | def do_Tf(self, fontid: PDFStackT, fontsize: PDFStackT) -> None:
    method do_Tr (line 1025) | def do_Tr(self, render: PDFStackT) -> None:
    method do_Ts (line 1036) | def do_Ts(self, rise: PDFStackT) -> None:
    method do_Td (line 1050) | def do_Td(self, tx: PDFStackT, ty: PDFStackT) -> None:
    method do_TD (line 1068) | def do_TD(self, tx: PDFStackT, ty: PDFStackT) -> None:
    method do_Tm (line 1091) | def do_Tm(
    method do_T_a (line 1112) | def do_T_a(self) -> None:
    method do_TJ (line 1125) | def do_TJ(self, seq: PDFStackT) -> None:
    method do_Tj (line 1139) | def do_Tj(self, s: PDFStackT) -> None:
    method do__q (line 1143) | def do__q(self, s: PDFStackT) -> None:
    method do__w (line 1151) | def do__w(self, aw: PDFStackT, ac: PDFStackT, s: PDFStackT) -> None:
    method do_BI (line 1160) | def do_BI(self) -> None:
    method do_ID (line 1163) | def do_ID(self) -> None:
    method do_EI (line 1166) | def do_EI(self, obj: PDFStackT) -> None:
    method do_Do (line 1174) | def do_Do(self, xobjid_arg: PDFStackT) -> None:
    method process_page (line 1212) | def process_page(self, page: PDFPage) -> None:
    method render_contents (line 1227) | def render_contents(
    method execute (line 1247) | def execute(self, streams: Sequence[object]) -> None:

FILE: babeldoc/pdfminer/pdfpage.py
  class PDFPage (line 30) | class PDFPage:
    method __init__ (line 54) | def __init__(
    method __repr__ (line 93) | def __repr__(self) -> str:
    method create_pages (line 99) | def create_pages(cls, document: PDFDocument) -> Iterator["PDFPage"]:
    method get_pages (line 161) | def get_pages(
    method _parse_mediabox (line 197) | def _parse_mediabox(self, value: Any) -> Rect:
    method _parse_cropbox (line 214) | def _parse_cropbox(self, value: Any, mediabox: Rect) -> Rect:
    method _parse_contents (line 226) | def _parse_contents(self, value: Any) -> list[Any]:

FILE: babeldoc/pdfminer/pdfparser.py
  class PDFSyntaxError (line 25) | class PDFSyntaxError(PDFException):
  class PDFParser (line 30) | class PDFParser(PSStackParser[Union[PSKeyword, PDFStream, PDFObjRef, Non...
    method __init__ (line 46) | def __init__(self, fp: BinaryIO) -> None:
    method set_document (line 51) | def set_document(self, doc: "PDFDocument") -> None:
    method do_keyword (line 62) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
  class PDFStreamParser (line 139) | class PDFStreamParser(PDFParser):
    method __init__ (line 147) | def __init__(self, data: bytes) -> None:
    method flush (line 150) | def flush(self) -> None:
    method do_keyword (line 155) | def do_keyword(self, pos: int, token: PSKeyword) -> None:

FILE: babeldoc/pdfminer/pdftypes.py
  class DecipherCallable (line 42) | class DecipherCallable(Protocol):
    method __call__ (line 45) | def __call__(
  class PDFObject (line 55) | class PDFObject(PSObject):
  class PDFObjRef (line 69) | class PDFObjRef(PDFObject):
    method __init__ (line 70) | def __init__(
    method __repr__ (line 96) | def __repr__(self) -> str:
    method resolve (line 99) | def resolve(self, default: object = None) -> Any:
  function resolve1 (line 107) | def resolve1(x: object, default: object = None) -> Any:
  function resolve_all (line 118) | def resolve_all(x: object, default: object = None) -> Any:
  function decipher_all (line 134) | def decipher_all(decipher: DecipherCallable, objid: int, genno: int, x: ...
  function int_value (line 148) | def int_value(x: object) -> int:
  function float_value (line 157) | def float_value(x: object) -> float:
  function num_value (line 166) | def num_value(x: object) -> float:
  function uint_value (line 175) | def uint_value(x: object, n_bits: int) -> int:
  function str_value (line 184) | def str_value(x: object) -> bytes:
  function list_value (line 193) | def list_value(x: object) -> list[Any] | tuple[Any, ...]:
  function dict_value (line 202) | def dict_value(x: object) -> dict[Any, Any]:
  function stream_value (line 212) | def stream_value(x: object) -> "PDFStream":
  function decompress_corrupted (line 221) | def decompress_corrupted(data: bytes) -> bytes:
  class PDFStream (line 242) | class PDFStream(PDFObject):
    method __init__ (line 243) | def __init__(
    method set_objid (line 257) | def set_objid(self, objid: int, genno: int) -> None:
    method __repr__ (line 261) | def __repr__(self) -> str:
    method __contains__ (line 277) | def __contains__(self, name: object) -> bool:
    method __getitem__ (line 280) | def __getitem__(self, name: str) -> Any:
    method get (line 283) | def get(self, name: str, default: object = None) -> Any:
    method get_any (line 286) | def get_any(self, names: Iterable[str], default: object = None) -> Any:
    method get_filters (line 292) | def get_filters(self) -> list[tuple[Any, Any]]:
    method decode (line 309) | def decode(self) -> None:
    method get_data (line 387) | def get_data(self) -> bytes:
    method get_rawdata (line 393) | def get_rawdata(self) -> bytes | None:

FILE: babeldoc/pdfminer/psexceptions.py
  class PSException (line 1) | class PSException(Exception):
  class PSEOF (line 5) | class PSEOF(PSException):
  class PSSyntaxError (line 9) | class PSSyntaxError(PSException):
  class PSTypeError (line 13) | class PSTypeError(PSException):
  class PSValueError (line 17) | class PSValueError(PSException):

FILE: babeldoc/pdfminer/psparser.py
  class PSObject (line 27) | class PSObject:
  class PSLiteral (line 31) | class PSLiteral(PSObject):
    method __init__ (line 45) | def __init__(self, name: NameType) -> None:
    method __repr__ (line 48) | def __repr__(self) -> str:
  class PSKeyword (line 53) | class PSKeyword(PSObject):
    method __init__ (line 64) | def __init__(self, name: bytes) -> None:
    method __repr__ (line 67) | def __repr__(self) -> str:
  class PSSymbolTable (line 75) | class PSSymbolTable(Generic[_SymbolT]):
    method __init__ (line 81) | def __init__(self, klass: type[_SymbolT]) -> None:
    method intern (line 85) | def intern(self, name: PSLiteral.NameType) -> _SymbolT:
  function literal_name (line 108) | def literal_name(x: Any) -> str:
  function keyword_name (line 122) | def keyword_name(x: Any) -> Any:
  class PSBaseParser (line 159) | class PSBaseParser:
    method __init__ (line 164) | def __init__(self, fp: BinaryIO) -> None:
    method __repr__ (line 169) | def __repr__(self) -> str:
    method flush (line 172) | def flush(self) -> None:
    method close (line 175) | def close(self) -> None:
    method tell (line 178) | def tell(self) -> int:
    method poll (line 181) | def poll(self, pos: int | None = None, n: int = 80) -> None:
    method seek (line 189) | def seek(self, pos: int) -> None:
    method fillbuf (line 204) | def fillbuf(self) -> None:
    method nextline (line 214) | def nextline(self) -> tuple[int, bytes]:
    method revreadlines (line 243) | def revreadlines(self) -> Iterator[bytes]:
    method _parse_main (line 267) | def _parse_main(self, s: bytes, i: int) -> int:
    method _add_token (line 313) | def _add_token(self, obj: PSBaseParserToken) -> None:
    method _parse_comment (line 316) | def _parse_comment(self, s: bytes, i: int) -> int:
    method _parse_literal (line 328) | def _parse_literal(self, s: bytes, i: int) -> int:
    method _parse_literal_hex (line 348) | def _parse_literal_hex(self, s: bytes, i: int) -> int:
    method _parse_number (line 358) | def _parse_number(self, s: bytes, i: int) -> int:
    method _parse_float (line 377) | def _parse_float(self, s: bytes, i: int) -> int:
    method _parse_keyword (line 391) | def _parse_keyword(self, s: bytes, i: int) -> int:
    method _parse_string (line 409) | def _parse_string(self, s: bytes, i: int) -> int:
    method _parse_string_1 (line 435) | def _parse_string_1(self, s: bytes, i: int) -> int:
    method _parse_wopen (line 464) | def _parse_wopen(self, s: bytes, i: int) -> int:
    method _parse_wclose (line 474) | def _parse_wclose(self, s: bytes, i: int) -> int:
    method _parse_hexstring (line 482) | def _parse_hexstring(self, s: bytes, i: int) -> int:
    method nexttoken (line 497) | def nexttoken(self) -> tuple[int, PSBaseParserToken]:
  class PSStackParser (line 530) | class PSStackParser(PSBaseParser, Generic[ExtraT]):
    method __init__ (line 531) | def __init__(self, fp: BinaryIO) -> None:
    method reset (line 535) | def reset(self) -> None:
    method seek (line 541) | def seek(self, pos: int) -> None:
    method push (line 545) | def push(self, *objs: PSStackEntry[ExtraT]) -> None:
    method pop (line 548) | def pop(self, n: int) -> list[PSStackEntry[ExtraT]]:
    method popall (line 553) | def popall(self) -> list[PSStackEntry[ExtraT]]:
    method add_results (line 558) | def add_results(self, *objs: PSStackEntry[ExtraT]) -> None:
    method start_type (line 565) | def start_type(self, pos: int, type: str) -> None:
    method end_type (line 570) | def end_type(self, type: str) -> tuple[int, list[PSStackType[ExtraT]]]:
    method do_keyword (line 578) | def do_keyword(self, pos: int, token: PSKeyword) -> None:
    method nextobject (line 581) | def nextobject(self) -> PSStackEntry[ExtraT]:

FILE: babeldoc/pdfminer/runlength.py
  function rldecode (line 9) | def rldecode(data: bytes) -> bytes:

FILE: babeldoc/pdfminer/utils.py
  class open_filename (line 36) | class open_filename:
    method __init__ (line 42) | def __init__(self, filename: FileOrName, *args: Any, **kwargs: Any) ->...
    method __enter__ (line 54) | def __enter__(self) -> AnyIO:
    method __exit__ (line 57) | def __exit__(self, exc_type: object, exc_val: object, exc_tb: object) ...
  function make_compat_bytes (line 62) | def make_compat_bytes(in_str: str) -> bytes:
  function make_compat_str (line 68) | def make_compat_str(o: object) -> str:
  function shorten_str (line 80) | def shorten_str(s: str, size: int) -> str:
  function compatible_encode_method (line 90) | def compatible_encode_method(
  function paeth_predictor (line 105) | def paeth_predictor(left: int, above: int, upper_left: int) -> int:
  function apply_png_predictor (line 123) | def apply_png_predictor(
  function parse_rect (line 238) | def parse_rect(o: Any) -> Rect:
  function mult_matrix (line 246) | def mult_matrix(m1: Matrix, m0: Matrix) -> Matrix:
  function translate_matrix (line 260) | def translate_matrix(m: Matrix, v: Point) -> Matrix:
  function apply_matrix_pt (line 267) | def apply_matrix_pt(m: Matrix, v: Point) -> Point:
  function apply_matrix_norm (line 274) | def apply_matrix_norm(m: Matrix, v: Point) -> Point:
  function isnumber (line 284) | def isnumber(x: object) -> bool:
  function uniq (line 291) | def uniq(objs: Iterable[_T]) -> Iterator[_T]:
  function fsplit (line 301) | def fsplit(pred: Callable[[_T], bool], objs: Iterable[_T]) -> tuple[list...
  function drange (line 313) | def drange(v0: float, v1: float, d: int) -> range:
  function get_bound (line 318) | def get_bound(pts: Iterable[Point]) -> Rect:
  function pick (line 330) | def pick(
  function choplist (line 344) | def choplist(n: int, seq: Iterable[_T]) -> Iterator[tuple[_T, ...]]:
  function nunpack (line 354) | def nunpack(s: bytes, default: int = 0) -> int:
  function decode_text (line 626) | def decode_text(s: bytes) -> str:
  function enc (line 634) | def enc(x: str) -> str:
  function bbox2str (line 641) | def bbox2str(bbox: Rect) -> str:
  function matrix2str (line 646) | def matrix2str(m: Matrix) -> str:
  function vecBetweenBoxes (line 651) | def vecBetweenBoxes(obj1: "LTComponent", obj2: "LTComponent") -> Point:
  class Plane (line 680) | class Plane(Generic[LTComponentT]):
    method __init__ (line 688) | def __init__(self, bbox: Rect, gridsize: int = 50) -> None:
    method __repr__ (line 695) | def __repr__(self) -> str:
    method __iter__ (line 698) | def __iter__(self) -> Iterator[LTComponentT]:
    method __len__ (line 701) | def __len__(self) -> int:
    method __contains__ (line 704) | def __contains__(self, obj: object) -> bool:
    method _getrange (line 707) | def _getrange(self, bbox: Rect) -> Iterator[Point]:
    method extend (line 719) | def extend(self, objs: Iterable[LTComponentT]) -> None:
    method add (line 723) | def add(self, obj: LTComponentT) -> None:
    method remove (line 735) | def remove(self, obj: LTComponentT) -> None:
    method find (line 744) | def find(self, bbox: Rect) -> Iterator[LTComponentT]:
  function format_int_roman (line 764) | def format_int_roman(value: int) -> str:
  function format_int_alpha (line 789) | def format_int_alpha(value: int) -> str:

FILE: babeldoc/progress_monitor.py
  class ProgressMonitor (line 12) | class ProgressMonitor:
    method __init__ (line 13) | def __init__(
    method create_part_monitor (line 72) | def create_part_monitor(
    method _handle_part_progress (line 88) | def _handle_part_progress(self, **kwargs):
    method _handle_part_finish (line 96) | def _handle_part_finish(self, **kwargs):
    method stage_start (line 110) | def stage_start(self, stage_name: str, total: int):
    method __enter__ (line 133) | def __enter__(self):
    method __exit__ (line 136) | def __exit__(self, exc_type, exc_val, exc_tb):
    method on_finish (line 139) | def on_finish(self):
    method stage_done (line 149) | def stage_done(self, stage):
    method calculate_current_progress (line 175) | def calculate_current_progress(self, stage=None):
    method _calculate_current_progress (line 187) | def _calculate_current_progress(self, stage=None):
    method stage_update (line 214) | def stage_update(self, stage, n: int):
    method translate_done (line 237) | def translate_done(self, translate_result):
    method translate_error (line 243) | def translate_error(self, error):
    method raise_if_cancelled (line 250) | def raise_if_cancelled(self):
    method cancel (line 254) | def cancel(self):
  class TranslationStage (line 262) | class TranslationStage:
    method __init__ (line 263) | def __init__(
    method __enter__ (line 280) | def __enter__(self):
    method __exit__ (line 283) | def __exit__(self, exc_type, exc_val, exc_tb):
    method advance (line 294) | def advance(self, n: int = 1):
  class DummyTranslationStage (line 300) | class DummyTranslationStage:
    method __init__ (line 301) | def __init__(self, name: str, total: int, pm: ProgressMonitor, weight:...
    method __enter__ (line 308) | def __enter__(self):
    method __exit__ (line 311) | def __exit__(self, exc_type, exc_val, exc_tb):
    method advance (line 314) | def advance(self, n: int = 1):

FILE: babeldoc/tools/generate_cmap_metadata.py
  function _calc_sha3_256 (line 17) | def _calc_sha3_256(path: Path) -> str:
  function main (line 30) | def main() -> None:

FILE: babeldoc/tools/generate_font_metadata.py
  function get_font_metadata (line 29) | def get_font_metadata(font_path) -> PdfFont:
  function main (line 60) | def main():

FILE: babeldoc/tools/italic_assistance.py
  function find_latest_il_json (line 16) | def find_latest_il_json() -> Path | None:
  function extract_fonts_from_paragraph (line 34) | def extract_fonts_from_paragraph(
  function find_fonts_by_debug_id (line 121) | def find_fonts_by_debug_id(json_path: Path, debug_id_regex: str) -> dict...
  function main (line 163) | def main():

FILE: babeldoc/translator/cache.py
  class _TranslationCache (line 31) | class _TranslationCache(Model):
    class Meta (line 38) | class Meta:
  class TranslationCache (line 54) | class TranslationCache:
    method _sort_dict_recursively (line 56) | def _sort_dict_recursively(obj):
    method __init__ (line 67) | def __init__(self, translate_engine: str, translate_engine_params: dic...
    method replace_params (line 74) | def replace_params(self, params: dict = None):
    method update_params (line 81) | def update_params(self, params: dict = None):
    method add_params (line 87) | def add_params(self, k: str, v):
    method get (line 93) | def get(self, original_text: str) -> str | None:
    method set (line 111) | def set(self, original_text: str, translation: str):
    method _cleanup (line 128) | def _cleanup(self) -> None:
  function init_db (line 148) | def init_db(remove_exists=False):
  function init_test_db (line 165) | def init_test_db():
  function clean_test_db (line 185) | def clean_test_db(test_db):

FILE: babeldoc/translator/translator.py
  function remove_control_characters (line 24) | def remove_control_characters(s):
  class RateLimiter (line 28) | class RateLimiter:
    method __init__ (line 34) | def __init__(self, max_qps: int):
    method wait (line 43) | def wait(self, _rate_limit_params: dict = None):
    method set_max_qps (line 61) | def set_max_qps(self, max_qps: int):
  function set_translate_rate_limiter (line 75) | def set_translate_rate_limiter(max_qps):
  class BaseTranslator (line 79) | class BaseTranslator(ABC):
    method __init__ (line 85) | def __init__(self, lang_in, lang_out, ignore_cache):
    method __del__ (line 103) | def __del__(self):
    method add_cache_impact_parameters (line 112) | def add_cache_impact_parameters(self, k: str, v):
    method translate (line 120) | def translate(self, text, ignore_cache=False, rate_limit_params: dict ...
    method llm_translate (line 141) | def llm_translate(self, text, ignore_cache=False, rate_limit_params: d...
    method do_llm_translate (line 168) | def do_llm_translate(self, text, rate_limit_params: dict = None):
    method do_translate (line 177) | def do_translate(self, text, rate_limit_params: dict = None):
    method __str__ (line 190) | def __str__(self):
    method get_rich_text_left_placeholder (line 193) | def get_rich_text_left_placeholder(self, placeholder_id: int | str):
    method get_rich_text_right_placeholder (line 196) | def get_rich_text_right_placeholder(self, placeholder_id: int | str):
    method get_formular_placeholder (line 199) | def get_formular_placeholder(self, placeholder_id: int | str):
  class OpenAITranslator (line 203) | class OpenAITranslator(BaseTranslator):
    method __init__ (line 207) | def __init__(
    method do_translate (line 265) | def do_translate(self, text, rate_limit_params: dict = None) -> str:
    method prompt (line 279) | def prompt(self, text):
    method do_llm_translate (line 297) | def do_llm_translate(self, text, rate_limit_params: dict = None):
    method update_token_count (line 339) | def update_token_count(self, response):
    method get_formular_placeholder (line 360) | def get_formular_placeholder(self, placeholder_id: int | str):
    method get_rich_text_left_placeholder (line 364) | def get_rich_text_left_placeholder(self, placeholder_id: int | str):
    method get_rich_text_right_placeholder (line 370) | def get_rich_text_right_placeholder(self, placeholder_id: int | str):

FILE: babeldoc/utils/atomic_integer.py
  class AtomicInteger (line 4) | class AtomicInteger:
    method __init__ (line 5) | def __init__(self, value=0):
    method inc (line 9) | def inc(self, d=1):
    method dec (line 14) | def dec(self, d=1):
    method value (line 18) | def value(self):
    method value (line 23) | def value(self, v):

FILE: babeldoc/utils/memory.py
  function _parse_pss_from_smaps_rollup (line 12) | def _parse_pss_from_smaps_rollup(pid: int) -> int | None:
  function _parse_pss_from_smaps (line 32) | def _parse_pss_from_smaps(pid: int) -> int | None:
  function _get_pss_linux (line 54) | def _get_pss_linux(pid: int) -> int | None:
  function _get_rss_psutil (line 73) | def _get_rss_psutil(pid: int) -> int | None:
  function _get_single_process_memory (line 88) | def _get_single_process_memory(
  function get_memory_usage_bytes (line 119) | def get_memory_usage_bytes(
  function get_memory_usage_with_throttle (line 192) | def get_memory_usage_with_throttle(

FILE: babeldoc/utils/priority_thread_pool_executor.py
  function python_exit (line 36) | def python_exit():
  class PriorityQueue (line 58) | class PriorityQueue(queue.Queue):
    method _init (line 67) | def _init(self, maxsize):
    method _qsize (line 72) | def _qsize(self):
    method _put (line 75) | def _put(self, item):
    method remove (line 87) | def remove(self, task):
    method _get (line 95) | def _get(self):
  function _worker (line 104) | def _worker(executor_reference, work_queue, initializer, initargs):
  class PriorityThreadPoolExecutor (line 150) | class PriorityThreadPoolExecutor(ThreadPoolExecutor):
    method __init__ (line 155) | def __init__(self, *args, **kwargs):
    method submit (line 162) | def submit(self, fn, *args, **kwargs):
    method _adjust_thread_count (line 202) | def _adjust_thread_count(self):
    method shutdown (line 229) | def shutdown(self, wait=True, *, cancel_futures=False):
    method __del__ (line 263) | def __del__(self):

FILE: tests/test_translation_cache_cleanup.py
  function _prepare_records (line 9) | def _prepare_records(cache: TranslationCache, num_records: int) -> None:
  function test_cleanup_under_limit (line 15) | def test_cleanup_under_limit(monkeypatch):
  function test_cleanup_over_limit (line 33) | def test_cleanup_over_limit(monkeypatch):
  function test_cleanup_thread_safety (line 50) | def test_cleanup_thread_safety(monkeypatch):

Download .json

Condensed preview — 156 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,165K chars).

[
  {
    "path": ".cursorignore",
    "chars": 38,
    "preview": "# Project notes and templates\nxnotes/\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.yaml",
    "chars": 3912,
    "preview": "name: \"🐞 Bug Report\"\ndescription: Create a report to help us improve\nlabels: ['bug']\nbody:\n  - type: checkboxes\n    id: "
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.yaml",
    "chars": 2262,
    "preview": "name: \"✨ Feature Request\"\ndescription: Suggest a new idea or improvement for BabelDOC\nlabels: ['enhancement']\nbody:\n  - "
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE/pr_form.yml",
    "chars": 2763,
    "preview": "name: Pull Request\ndescription: Submit a pull request to contribute to BabelDOC\ntitle: \"[PR] <Your concise title here>\"\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "chars": 2480,
    "preview": "### PR Title\n\n<!-- Please fill in a concise and clear PR title below -->\n[PR] <Your concise title here>\n\n### Related Iss"
  },
  {
    "path": ".github/dependabot.yml",
    "chars": 480,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: github-actions\n    directory: \"/\"\n    schedule:\n      interval: weekly\n  # - "
  },
  {
    "path": ".github/labels.yml",
    "chars": 1834,
    "preview": "---\n# Labels names are important as they are used by Release Drafter to decide\n# regarding where to record them in chang"
  },
  {
    "path": ".github/release-drafter.yml",
    "chars": 758,
    "preview": "name-template: 'v$RESOLVED_VERSION'\ntag-template: 'v$RESOLVED_VERSION'\ncategories:\n  - title: '🚀 Features'\n    labels:\n "
  },
  {
    "path": ".github/workflows/codeql.yml",
    "chars": 4298,
    "preview": "# For most projects, this workflow file will not need changing; you simply need\n# to commit it to your repository.\n#\n# Y"
  },
  {
    "path": ".github/workflows/docs.yml",
    "chars": 983,
    "preview": "name: docs\non:\n  push:\n    branches:\n      - main\npermissions:\n  contents: write\njobs:\n  deploy:\n    runs-on: ubuntu-lat"
  },
  {
    "path": ".github/workflows/labeler.yml",
    "chars": 807,
    "preview": "name: Labeler\n\non:\n  push:\n    branches:\n      - 'main'\n    paths:\n      - '.github/labels.yml'\n      - '.github/workflo"
  },
  {
    "path": ".github/workflows/lint.yml",
    "chars": 335,
    "preview": "name: Lint Code\npermissions:\n  contents: read\n  pull-requests: write\non: [push]\n\njobs:\n  lint:\n    strategy:\n      fail-"
  },
  {
    "path": ".github/workflows/pr-lint.yml",
    "chars": 1353,
    "preview": "name: Lint Code and Review Dog Report\n\non: [pull_request]\npermissions:\n  contents: read\n  pull-requests: write\njobs:\n  r"
  },
  {
    "path": ".github/workflows/publish-to-pypi.yml",
    "chars": 4536,
    "preview": "name: Release\n\non:\n  push:\n    branches:\n      - main\n      - master\n\npermissions:\n  id-token: write\n  contents: write\n "
  },
  {
    "path": ".github/workflows/test.yml",
    "chars": 1909,
    "preview": "name: Run Tests 🧪\n\non:\n  push:\n  pull_request:\n    branches: [\"main\"]\n\npermissions:\n  contents: read\n  pull-requests: re"
  },
  {
    "path": ".gitignore",
    "chars": 537,
    "preview": "# Logs\nweb/logs\nweb/*.log\nweb/npm-debug.log*\nweb/yarn-debug.log*\nweb/yarn-error.log*\nweb/pnpm-debug.log*\nweb/lerna-debug"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 298,
    "preview": "files: '^.*\\.py$'\nrepos:\n  - repo: https://github.com/astral-sh/ruff-pre-commit\n    # Ruff version.\n    rev: v0.9.5\n    "
  },
  {
    "path": "LICENSE",
    "chars": 34519,
    "preview": "                    GNU AFFERO GENERAL PUBLIC LICENSE\n                       Version 3, 19 November 2007\n\n Copyright (C)"
  },
  {
    "path": "README.md",
    "chars": 26782,
    "preview": "<!-- # Yet Another Document Translator -->\n\n<div align=\"center\">\n<!-- <img src=\"https://s.immersivetranslate.com/assets/"
  },
  {
    "path": "babeldoc/__init__.py",
    "chars": 23,
    "preview": "__version__ = \"0.5.23\"\n"
  },
  {
    "path": "babeldoc/assets/assets.py",
    "chars": 22340,
    "preview": "import asyncio\nimport hashlib\nimport json\nimport logging\nimport threading\nimport zipfile\nfrom pathlib import Path\n\nimpor"
  },
  {
    "path": "babeldoc/assets/embedding_assets_metadata.py",
    "chars": 50533,
    "preview": "import itertools\n\nDOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256 = (\n    \"60be061226930524958b5465c8c04af3d7c03bcb"
  },
  {
    "path": "babeldoc/asynchronize/__init__.py",
    "chars": 1802,
    "preview": "import asyncio\nimport time\n\n\nclass Args:\n    def __init__(self, args, kwargs):\n        self.args = args\n        self.kwa"
  },
  {
    "path": "babeldoc/babeldoc_exception/BabelDOCException.py",
    "chars": 463,
    "preview": "class ScannedPDFError(Exception):\n    def __init__(self, message):\n        super().__init__(message)\n\n\nclass ExtractText"
  },
  {
    "path": "babeldoc/babeldoc_exception/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/const.py",
    "chars": 2662,
    "preview": "import itertools\nimport multiprocessing as mp\nimport os\nimport shutil\nimport subprocess\nimport threading\nfrom pathlib im"
  },
  {
    "path": "babeldoc/docvision/README.md",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/docvision/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/docvision/base_doclayout.py",
    "chars": 1766,
    "preview": "import abc\nimport logging\nfrom collections.abc import Generator\n\nimport pymupdf\n\nfrom babeldoc.format.pdf.document_il.il"
  },
  {
    "path": "babeldoc/docvision/doclayout.py",
    "chars": 7879,
    "preview": "import ast\nimport logging\nimport platform\nimport re\nimport threading\nfrom collections.abc import Generator\n\nimport cv2\ni"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout.py",
    "chars": 10599,
    "preview": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\ni"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout2.py",
    "chars": 11771,
    "preview": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\ni"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout3.py",
    "chars": 11593,
    "preview": "import json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout4.py",
    "chars": 11770,
    "preview": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\ni"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout5.py",
    "chars": 11525,
    "preview": "import json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout6.py",
    "chars": 21860,
    "preview": "import base64\nimport json\nimport logging\nimport threading\nimport unicodedata\nfrom concurrent.futures import ThreadPoolEx"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout7.py",
    "chars": 12120,
    "preview": "import base64\nimport json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib"
  },
  {
    "path": "babeldoc/docvision/table_detection/rapidocr.py",
    "chars": 11432,
    "preview": "import logging\nimport re\nimport threading\nfrom collections.abc import Generator\n\nimport cv2\nimport numpy as np\nfrom babe"
  },
  {
    "path": "babeldoc/format/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/base14.py",
    "chars": 127441,
    "preview": "from .encoding import get_type1_encoding\nfrom .win_core import win_core\n\nbase14_bbox = {\n    \"Courier-BoldOblique\": {\n  "
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/cidfont.py",
    "chars": 1851,
    "preview": "import re\nfrom io import BytesIO\n\nimport freetype\n\n\ndef indirect(obj):\n    if isinstance(obj, tuple) and obj[0] == \"xref"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/cmap.py",
    "chars": 3701,
    "preview": "import re\nimport struct\n\npattern_map_r = (\n    r\"\\s+begincidrange\\s*\"\n    r\"(?P<cidrange>(<[a-fA-F0-9]+>\\s*<[a-fA-F0-9]+"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/encoding.py",
    "chars": 16977,
    "preview": "adobe_standard = [\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n "
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/type3.py",
    "chars": 1489,
    "preview": "import io\nimport re\n\nimport pymupdf\n\n\ndef merge_bbox(bbox_list, factor=1):\n    if bbox_list:\n        base = bbox_list[0]"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/utils.py",
    "chars": 358,
    "preview": "from babeldoc.pdfminer.pdftypes import PDFObjRef\n\n\ndef guarded_bbox(bbox):\n    bbox_guarded = []\n    for v in bbox:\n    "
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/win_core.py",
    "chars": 100126,
    "preview": "win_core = {\n    \"Arial\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (85, 0, 194, 715),\n        \"quotedbl\": (45,"
  },
  {
    "path": "babeldoc/format/pdf/converter.py",
    "chars": 22585,
    "preview": "import logging\nimport re\nimport unicodedata\n\nimport numpy as np\nfrom pymupdf import Font\n\nfrom babeldoc.format.pdf.docum"
  },
  {
    "path": "babeldoc/format/pdf/document_il/__init__.py",
    "chars": 2783,
    "preview": "from babeldoc.format.pdf.document_il.il_version_1 import BaseOperations\nfrom babeldoc.format.pdf.document_il.il_version_"
  },
  {
    "path": "babeldoc/format/pdf/document_il/backend/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/backend/pdf_creater.py",
    "chars": 55812,
    "preview": "import io\nimport itertools\nimport logging\nimport os\nimport re\nimport time\nimport unicodedata\nfrom abc import ABC\nfrom ab"
  },
  {
    "path": "babeldoc/format/pdf/document_il/frontend/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/frontend/il_creater.py",
    "chars": 47980,
    "preview": "import base64\nimport functools\nimport logging\nimport math\nimport re\nimport unicodedata\nfrom io import BytesIO\nfrom itert"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.py",
    "chars": 27991,
    "preview": "from dataclasses import dataclass\nfrom dataclasses import field\n\n\n@dataclass(slots=True)\nclass BaseOperations:\n    class"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.rnc",
    "chars": 6264,
    "preview": "start = Document\nDocument =\n  element document {\n    Page+,\n    attribute totalPages { xsd:int }\n  }\nPage =\n  element pa"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.rng",
    "chars": 16414,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<grammar xmlns=\"http://relaxng.org/ns/structure/1.0\" datatypeLibrary=\"http://www."
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.xsd",
    "chars": 14494,
    "preview": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<xs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" elementFormDefault=\"qualif"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/add_debug_information.py",
    "chars": 6387,
    "preview": "import logging\n\nimport babeldoc.format.pdf.document_il.il_version_1 as il_version_1\nfrom babeldoc.format.pdf.document_il"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py",
    "chars": 16291,
    "preview": "from __future__ import annotations\n\nimport json\nimport logging\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/detect_scanned_file.py",
    "chars": 7051,
    "preview": "import logging\n\nimport cv2\nimport numpy as np\nimport pymupdf\nimport regex\nfrom skimage.metrics import structural_similar"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/il_translator.py",
    "chars": 48745,
    "preview": "from __future__ import annotations\n\nimport copy\nimport json\nimport logging\nimport re\nimport threading\nfrom pathlib impor"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/il_translator_llm_only.py",
    "chars": 38719,
    "preview": "import copy\nimport json\nimport logging\nimport re\nfrom pathlib import Path\nfrom string import Template\n\nimport Levenshtei"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/layout_parser.py",
    "chars": 8173,
    "preview": "import logging\nimport math\nimport os\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport "
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/paragraph_finder.py",
    "chars": 42590,
    "preview": "import logging\nimport random\nimport re\n\nimport numpy as np\n\nfrom babeldoc.babeldoc_exception.BabelDOCException import Ex"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/remove_descent.py",
    "chars": 6832,
    "preview": "import logging\nfrom collections import Counter\nfrom functools import cache\n\nfrom babeldoc.format.pdf.document_il import "
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/styles_and_formulas.py",
    "chars": 50061,
    "preview": "import math\nimport re\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/table_parser.py",
    "chars": 6206,
    "preview": "import logging\nfrom pathlib import Path\n\nimport cv2\nimport numpy as np\nfrom pymupdf import Document\n\nfrom babeldoc.forma"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/typesetting.py",
    "chars": 58085,
    "preview": "from __future__ import annotations\n\nimport copy\nimport logging\nimport re\nimport statistics\nimport unicodedata\nfrom funct"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/extract_char.py",
    "chars": 27750,
    "preview": "import logging\nimport shutil\nfrom collections import defaultdict\nfrom pathlib import Path\n\nimport cv2\nimport numpy as np"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/fontmap.py",
    "chars": 12019,
    "preview": "import enum\nimport functools\nimport logging\nimport re\nfrom pathlib import Path\n\nimport pymupdf\n\nfrom babeldoc.assets imp"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/formular_helper.py",
    "chars": 10157,
    "preview": "import base64\nimport functools\nimport re\nimport unicodedata\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Bo"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/layout_helper.py",
    "chars": 31414,
    "preview": "import logging\nimport math\nimport re\nimport unicodedata\nfrom typing import Literal\n\nimport regex\nfrom pymupdf import Fon"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/matrix_helper.py",
    "chars": 10315,
    "preview": "\"\"\"Matrix helper utilities for CTM decomposition and composition.\n\nThis module provides functions to:\n- Decompose a PDF "
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/mupdf_helper.py",
    "chars": 1229,
    "preview": "import numpy as np\nimport pymupdf\n\nfrom babeldoc.const import get_process_pool\n\n\ndef get_no_rotation_img(page: pymupdf.P"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/paragraph_helper.py",
    "chars": 3257,
    "preview": "import logging\nimport re\n\nfrom babeldoc.format.pdf.document_il import il_version_1\n\nlogger = logging.getLogger(__name__)"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/spatial_analyzer.py",
    "chars": 5816,
    "preview": "\"\"\"Spatial relationship analyzer for PDF elements.\n\nThis module provides functions to analyze spatial relationships betw"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/style_helper.py",
    "chars": 3185,
    "preview": "from babeldoc.format.pdf.document_il import il_version_1\n\n\ndef create_pdf_style(r, g, b, font_id=\"base\", font_size=6):\n "
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/zstd_helper.py",
    "chars": 557,
    "preview": "import base64\n\nimport pyzstd\n\n\ndef zstd_compress(data) -> str:\n    if isinstance(data, str):\n        data = data.encode("
  },
  {
    "path": "babeldoc/format/pdf/document_il/xml_converter.py",
    "chars": 1786,
    "preview": "import copy\nfrom pathlib import Path\n\nimport orjson\nfrom xsdata.formats.dataclass.context import XmlContext\nfrom xsdata."
  },
  {
    "path": "babeldoc/format/pdf/high_level.py",
    "chars": 47350,
    "preview": "import asyncio\nimport copy\nimport hashlib\nimport io\nimport logging\nimport pathlib\nimport re\nimport shutil\nimport threadi"
  },
  {
    "path": "babeldoc/format/pdf/pdfinterp.py",
    "chars": 21949,
    "preview": "import logging\nfrom collections.abc import Sequence\nfrom typing import Any\nfrom typing import cast\n\nimport numpy as np\n\n"
  },
  {
    "path": "babeldoc/format/pdf/result_merger.py",
    "chars": 7436,
    "preview": "import logging\nfrom pathlib import Path\n\nfrom pymupdf import Document\n\nfrom babeldoc.format.pdf.document_il.backend.pdf_"
  },
  {
    "path": "babeldoc/format/pdf/split_manager.py",
    "chars": 1942,
    "preview": "import logging\nfrom dataclasses import dataclass\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass SplitPoint:\n  "
  },
  {
    "path": "babeldoc/format/pdf/translation_config.py",
    "chars": 22275,
    "preview": "import enum\nimport logging\nimport shutil\nimport tempfile\nimport threading\nfrom collections import Counter\nfrom pathlib i"
  },
  {
    "path": "babeldoc/glossary.py",
    "chars": 7865,
    "preview": "import csv\nimport io\nimport itertools\nimport logging\nimport re\nimport time\nfrom pathlib import Path\n\nimport chardet\nimpo"
  },
  {
    "path": "babeldoc/main.py",
    "chars": 35423,
    "preview": "import asyncio\nimport logging\nimport multiprocessing as mp\nimport queue\nimport random\nimport sys\nfrom pathlib import Pat"
  },
  {
    "path": "babeldoc/pdfminer/LICENSE",
    "chars": 1092,
    "preview": "Copyright (c) 2004-2016  Yusuke Shinyama <yusuke at shinyama dot jp>\n\nPermission is hereby granted, free of charge, to a"
  },
  {
    "path": "babeldoc/pdfminer/__init__.py",
    "chars": 290,
    "preview": "from importlib.metadata import PackageNotFoundError\nfrom importlib.metadata import version\n\ntry:\n    __version__ = versi"
  },
  {
    "path": "babeldoc/pdfminer/_saslprep.py",
    "chars": 3714,
    "preview": "# Copyright 2016-present MongoDB, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not"
  },
  {
    "path": "babeldoc/pdfminer/arcfour.py",
    "chars": 936,
    "preview": "\"\"\"Python implementation of Arcfour encryption algorithm.\nSee https://en.wikipedia.org/wiki/RC4\nThis code is in the publ"
  },
  {
    "path": "babeldoc/pdfminer/ascii85.py",
    "chars": 1850,
    "preview": "\"\"\"Python implementation of ASCII85/ASCIIHex decoder (Adobe version).\"\"\"\n\nimport re\nfrom base64 import a85decode\nfrom bi"
  },
  {
    "path": "babeldoc/pdfminer/casting.py",
    "chars": 2055,
    "preview": "import itertools\nfrom typing import Any\n\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import "
  },
  {
    "path": "babeldoc/pdfminer/ccitt.py",
    "chars": 21280,
    "preview": "# CCITT Fax decoder\n#\n# Bugs: uncompressed mode untested.\n#\n# cf.\n#  ITU-T Recommendation T.4\n#    \"Standardization of G"
  },
  {
    "path": "babeldoc/pdfminer/cmap/README.txt",
    "chars": 3917,
    "preview": "README.txt for cmap\n\nThis directory contains *.pickle.gz files converted from Adobe CMap resources.\nCMaps are required t"
  },
  {
    "path": "babeldoc/pdfminer/cmapdb.py",
    "chars": 15750,
    "preview": "\"\"\"Adobe character mapping (CMap) support.\n\nCMaps provide the mapping between character codes and Unicode\ncode-points to"
  },
  {
    "path": "babeldoc/pdfminer/converter.py",
    "chars": 38289,
    "preview": "import io\nimport logging\nimport re\nfrom collections.abc import Sequence\nfrom typing import BinaryIO\nfrom typing import G"
  },
  {
    "path": "babeldoc/pdfminer/data_structures.py",
    "chars": 1775,
    "preview": "from collections.abc import Iterable\nfrom typing import Any\n\nfrom babeldoc.pdfminer.pdfparser import PDFSyntaxError\nfrom"
  },
  {
    "path": "babeldoc/pdfminer/encodingdb.py",
    "chars": 4040,
    "preview": "import logging\nimport re\nfrom collections.abc import Iterable\nfrom typing import cast\n\nfrom babeldoc.pdfminer.glyphlist "
  },
  {
    "path": "babeldoc/pdfminer/fontmetrics.py",
    "chars": 112593,
    "preview": "\"\"\"Font metrics for the Adobe core 14 fonts.\n\nFont metrics are used to compute the boundary of each character\nwritten wi"
  },
  {
    "path": "babeldoc/pdfminer/glyphlist.py",
    "chars": 130838,
    "preview": "\"\"\"Mappings from Adobe glyph names to Unicode characters.\n\nIn some CMap tables, Adobe glyph names are used for specifyin"
  },
  {
    "path": "babeldoc/pdfminer/high_level.py",
    "chars": 7997,
    "preview": "\"\"\"Functions that can be used for the most common use-cases for pdfminer.six\"\"\"\n\nimport logging\nimport sys\nfrom collecti"
  },
  {
    "path": "babeldoc/pdfminer/image.py",
    "chars": 9784,
    "preview": "import os\nimport os.path\nimport struct\nfrom io import BytesIO\nfrom typing import BinaryIO\nfrom typing import Literal\n\nfr"
  },
  {
    "path": "babeldoc/pdfminer/jbig2.py",
    "chars": 11602,
    "preview": "import math\nimport os\nfrom collections.abc import Iterable\nfrom struct import calcsize\nfrom struct import pack\nfrom stru"
  },
  {
    "path": "babeldoc/pdfminer/latin_enc.py",
    "chars": 8476,
    "preview": "\"\"\"Standard encoding tables used in PDF.\n\nThis table is extracted from PDF Reference Manual 1.6, pp.925\n  \"D.1 Latin Cha"
  },
  {
    "path": "babeldoc/pdfminer/layout.py",
    "chars": 34290,
    "preview": "import heapq\nimport logging\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections.a"
  },
  {
    "path": "babeldoc/pdfminer/lzw.py",
    "chars": 3320,
    "preview": "import logging\nfrom collections.abc import Iterator\nfrom io import BytesIO\nfrom typing import BinaryIO\nfrom typing impor"
  },
  {
    "path": "babeldoc/pdfminer/pdfcolor.py",
    "chars": 964,
    "preview": "import collections\n\nfrom babeldoc.pdfminer.psparser import LIT\n\nLITERAL_DEVICE_GRAY = LIT(\"DeviceGray\")\nLITERAL_DEVICE_R"
  },
  {
    "path": "babeldoc/pdfminer/pdfdevice.py",
    "chars": 9579,
    "preview": "import logging\nfrom collections.abc import Iterable\nfrom collections.abc import Sequence\nfrom typing import TYPE_CHECKIN"
  },
  {
    "path": "babeldoc/pdfminer/pdfdocument.py",
    "chars": 38347,
    "preview": "import itertools\nimport logging\nimport re\nimport struct\nfrom collections.abc import Callable\nfrom collections.abc import"
  },
  {
    "path": "babeldoc/pdfminer/pdfexceptions.py",
    "chars": 499,
    "preview": "from babeldoc.pdfminer.psexceptions import PSException\n\n\nclass PDFException(PSException):\n    pass\n\n\nclass PDFTypeError("
  },
  {
    "path": "babeldoc/pdfminer/pdffont.py",
    "chars": 36621,
    "preview": "import logging\nimport struct\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections."
  },
  {
    "path": "babeldoc/pdfminer/pdfinterp.py",
    "chars": 44217,
    "preview": "import logging\nimport re\nfrom collections.abc import Mapping\nfrom collections.abc import Sequence\nfrom io import BytesIO"
  },
  {
    "path": "babeldoc/pdfminer/pdfpage.py",
    "chars": 8978,
    "preview": "import itertools\nimport logging\nfrom collections.abc import Container\nfrom collections.abc import Iterator\nfrom typing i"
  },
  {
    "path": "babeldoc/pdfminer/pdfparser.py",
    "chars": 5887,
    "preview": "import logging\nfrom io import BytesIO\nfrom typing import TYPE_CHECKING\nfrom typing import BinaryIO\nfrom typing import Un"
  },
  {
    "path": "babeldoc/pdfminer/pdftypes.py",
    "chars": 12568,
    "preview": "import io\nimport logging\nimport zlib\nfrom collections.abc import Iterable\nfrom typing import TYPE_CHECKING\nfrom typing i"
  },
  {
    "path": "babeldoc/pdfminer/psexceptions.py",
    "chars": 208,
    "preview": "class PSException(Exception):\n    pass\n\n\nclass PSEOF(PSException):\n    pass\n\n\nclass PSSyntaxError(PSException):\n    pass"
  },
  {
    "path": "babeldoc/pdfminer/psparser.py",
    "chars": 20388,
    "preview": "#!/usr/bin/env python3\nimport io\nimport logging\nimport re\nfrom collections.abc import Iterator\nfrom typing import Any\nfr"
  },
  {
    "path": "babeldoc/pdfminer/py.typed",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/pdfminer/runlength.py",
    "chars": 1305,
    "preview": "#\n# RunLength decoder (Adobe version) implementation based on PDF Reference\n# version 1.4 section 3.3.4.\n#\n#  * public d"
  },
  {
    "path": "babeldoc/pdfminer/settings.py",
    "chars": 15,
    "preview": "STRICT = False\n"
  },
  {
    "path": "babeldoc/pdfminer/utils.py",
    "chars": 20871,
    "preview": "\"\"\"Miscellaneous Routines.\"\"\"\n\nimport io\nimport pathlib\nimport string\nfrom collections.abc import Callable\nfrom collecti"
  },
  {
    "path": "babeldoc/progress_monitor.py",
    "chars": 11299,
    "preview": "import asyncio\nimport logging\nimport threading\nimport time\nfrom asyncio import CancelledError\nfrom collections.abc impor"
  },
  {
    "path": "babeldoc/tools/generate_cmap_metadata.py",
    "chars": 2306,
    "preview": "\"\"\"\nThis script is used to automatically generate the following file:\nhttps://github.com/funstory-ai/BabelDOC-Assets/blo"
  },
  {
    "path": "babeldoc/tools/generate_font_metadata.py",
    "chars": 3961,
    "preview": "# This script is used to automatically generate the following files:\n# https://github.com/funstory-ai/BabelDOC-Assets/bl"
  },
  {
    "path": "babeldoc/tools/italic_assistance.py",
    "chars": 10455,
    "preview": "import argparse\nimport json\nimport re\nfrom pathlib import Path\n\nimport orjson\nfrom babeldoc.const import CACHE_FOLDER\nfr"
  },
  {
    "path": "babeldoc/tools/italic_recognize_tool.py",
    "chars": 2863,
    "preview": "# Identify non-formula italic fonts that were incorrectly classified as formulas in BableDOC translation results (interm"
  },
  {
    "path": "babeldoc/translator/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/translator/cache.py",
    "chars": 6455,
    "preview": "import json\nimport logging\nimport random\nimport threading\nfrom pathlib import Path\n\nimport peewee\nfrom peewee import SQL"
  },
  {
    "path": "babeldoc/translator/translator.py",
    "chars": 13753,
    "preview": "import contextlib\nimport logging\nimport threading\nimport time\nimport unicodedata\nfrom abc import ABC\nfrom abc import abs"
  },
  {
    "path": "babeldoc/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "babeldoc/utils/atomic_integer.py",
    "chars": 536,
    "preview": "import threading\n\n\nclass AtomicInteger:\n    def __init__(self, value=0):\n        self._value = int(value)\n        self._"
  },
  {
    "path": "babeldoc/utils/memory.py",
    "chars": 8210,
    "preview": "import os\nimport sys\nimport time\nfrom pathlib import Path\n\ntry:\n    import psutil\nexcept ImportError:\n    psutil = None\n"
  },
  {
    "path": "babeldoc/utils/priority_thread_pool_executor.py",
    "chars": 9238,
    "preview": "# thanks to:\n# https://github.com/oleglpts/PriorityThreadPoolExecutor/blob/master/PriorityThreadPoolExecutor/__init__.py"
  },
  {
    "path": "docs/CODE_OF_CONDUCT.md",
    "chars": 5217,
    "preview": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participa"
  },
  {
    "path": "docs/CONTRIBUTING.md",
    "chars": 8095,
    "preview": "# Contributing to BabelDOC\n\n## How to contribute to BabelDOC\n\n### **About Language**\n\n- Issues can be in Chinese or Engl"
  },
  {
    "path": "docs/CONTRIBUTOR_REWARD.md",
    "chars": 2874,
    "preview": "# BabelDOC/PDFMathTranslate/OneAIFW 贡献者奖励规则\n\n## 月度活跃贡献者奖励规则\n\n### 一、资格标准\n#### **贡献类型要求**\n   - 需提交 **至少 1 个有效 PR**（Pull Re"
  },
  {
    "path": "docs/ImplementationDetails/AsyncTranslate/AsyncTranslate.md",
    "chars": 5443,
    "preview": "# Async Translation API\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, "
  },
  {
    "path": "docs/ImplementationDetails/ILTranslator/ILTranslator.md",
    "chars": 3017,
    "preview": "# Intermediate Layer Translator\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for ac"
  },
  {
    "path": "docs/ImplementationDetails/PDFCreation/PDFCreation.md",
    "chars": 3525,
    "preview": "# PDF Creation\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there mig"
  },
  {
    "path": "docs/ImplementationDetails/PDFParsing/PDFParsing.md",
    "chars": 3817,
    "preview": "# PDF Parsing and Intermediate Layer Creation\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we"
  },
  {
    "path": "docs/ImplementationDetails/ParagraphFinding/ParagraphFinding.md",
    "chars": 2772,
    "preview": "# Paragraph Finding\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, ther"
  },
  {
    "path": "docs/ImplementationDetails/README.md",
    "chars": 1362,
    "preview": "# Implementation Details\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy,"
  },
  {
    "path": "docs/ImplementationDetails/StylesAndFormulas/StylesAndFormulas.md",
    "chars": 2798,
    "preview": "# Styles and Formulas Processing\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for a"
  },
  {
    "path": "docs/ImplementationDetails/Typesetting/Typesetting.md",
    "chars": 4102,
    "preview": "# Typography\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might"
  },
  {
    "path": "docs/README.md",
    "chars": 356,
    "preview": "YADT Spec\n===\n\n## YADT Document Intermediate Language\n\n[il_version_1.rnc](https://github.com/funstory-ai/yadt/blob/main/"
  },
  {
    "path": "docs/deploy.sh",
    "chars": 851,
    "preview": "#!/bin/bash\nset -e\n\ncommand_exists() {\n  command -v \"$1\" >/dev/null 2>&1\n}\n\necho \"check uv installed ……\"\nif command_exis"
  },
  {
    "path": "docs/example/demo_glossary.csv",
    "chars": 69,
    "preview": "source,target,tgt_lng\nAutoML,自动ML,zh-CN\n\"a,a\",a,zh-CN\n\"\"\"\",\"\"\"\",zh-CN"
  },
  {
    "path": "docs/index.md",
    "chars": 15,
    "preview": "\n{!README.md!}\n"
  },
  {
    "path": "docs/intro-to-pdf-object.md",
    "chars": 6281,
    "preview": "An Introduction to PDF Object Definitions in dpml\n===\n\n## 1. Understanding PDF Structure\nA PDF file is fundamentally an "
  },
  {
    "path": "docs/requirements.txt",
    "chars": 104,
    "preview": "sphinx>=8.2.0\nsphinx-click>=5.1.0\nfuro>=2024.1.29\nmyst-parser[linkify,html_meta,html_admonition]>=2.0.0 "
  },
  {
    "path": "docs/supported_languages.md",
    "chars": 12503,
    "preview": "# Supported Languages\n\nFor languages in the table below that do not rely on ligature support, BabelDOC provides good sup"
  },
  {
    "path": "mkdocs.yml",
    "chars": 5138,
    "preview": "# Copyright (c) 2016-2025 Martin Donath <martin.donath@squidfunk.com>\n\n# Permission is hereby granted, free of charge, t"
  },
  {
    "path": "pyproject.toml",
    "chars": 4726,
    "preview": "[project]\nname = \"BabelDOC\"\nversion = \"0.5.23\"\ndescription = \"Yet Another Document Translator\"\nlicense = \"AGPL-3.0\"\nread"
  },
  {
    "path": "tests/test_translation_cache_cleanup.py",
    "chars": 2569,
    "preview": "from concurrent.futures import ThreadPoolExecutor\n\nfrom babeldoc.translator.cache import TranslationCache\nfrom babeldoc."
  }
]

About this extraction

This page contains the full source code of the funstory-ai/BabelDOC GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 156 files (1.9 MB), approximately 564.1k tokens, and a symbol index with 1723 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo