[
  {
    "path": ".cursorignore",
    "content": "# Project notes and templates\nxnotes/\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.yaml",
    "content": "name: \"🐞 Bug Report\"\ndescription: Create a report to help us improve\nlabels: ['bug']\nbody:\n  - type: checkboxes\n    id: checks\n    attributes:\n      label: Before you submit\n      options:\n        - label: I have searched existing issues\n          required: true\n        - label: I spent at least 5 minutes investigating and preparing this report\n          required: true\n        - label: I confirmed this is not caused by a network issue\n          required: true\n        - label: I have fully read and understood the [README](https://github.com/funstory-ai/BabelDOC/blob/main/README.md)\n          required: true\n        - label: I am certain that this issue is with BabelDOC itself and can be reproduced through the BabelDOC cli\n          required: true\n        - label: I have uploaded the original file, or confirmed that this issue is unrelated to the original file\n          required: true\n        - label: I have uploaded the log.\n          required: true\n        - label: I confirm that the latest version of BabelDOC is being used.\n          required: true\n        - label: I am aware that the issue section of this project is only for submitting bugs that are clearly related to the BabelDOC core, with complete reproduction steps and relevant logs attached.** Otherwise, issues will be closed directly.\n          required: true\n\n  - type: markdown\n    attributes:\n      value: |\n        Thank you for using **BabelDOC** and helping us improve it! 🙏\n        Please confirm again that the above checklist items have been carefully executed! (If you have not carefully executed them, the issue will be closed directly without any response)\n\n        Please also note:\n        - If you are using a downstream project like pdf2zh-next, please submit an issue directly to the downstream application. Only when you confirm that this issue is a problem with the core library of BabelDOC, submit this issue.\n        - The CLI is only used for debugging purposes, we do not provide any technical support for CLI usage.\n\n  - type: markdown\n    attributes:\n      value: |\n        Please note! Users of immersive translate online services should contact customer service and provide their translation ID. **Feedback related to online services is not handled here.**\n\n  - type: textarea\n    id: environment\n    attributes:\n      label: Environment\n      description: Provide your system details (required)\n      value: |\n        - OS:\n        - Python:\n        - BabelDOC:\n      render: markdown\n    validations:\n      required: true\n\n  - type: textarea\n    id: describe\n    attributes:\n      label: Describe the bug\n      description: A clear and concise description of what the bug is.\n    validations:\n      required: true\n\n  - type: textarea\n    id: reproduce\n    attributes:\n      label: Steps to Reproduce\n      description: Help us reproduce the issue. Issues that do not provide reproduction steps will be closed directly.\n      value: |\n        1. Go to '...'\n        2. Click on '...'\n        3. See error\n    validations:\n      required: false\n\n  - type: textarea\n    id: expected\n    attributes:\n      label: Expected Behavior\n      description: What did you expect to happen?\n    validations:\n      required: false\n\n  - type: textarea\n    id: logs\n    attributes:\n      label: Relevant Log Output or Screenshots\n      description: Copy and paste any logs or attach screenshots. This will be formatted automatically.\n      render: text\n    validations:\n      required: false\n\n  - type: textarea\n    id: pdf\n    attributes:\n      label: Original PDF File\n      description: Upload the input PDF if applicable. (Issues related to specific PDFs but without uploaded files will be closed directly.)\n    validations:\n      required: false\n\n  - type: textarea\n    id: others\n    attributes:\n      label: Additional Context\n      description: Anything else we should know?\n    validations:\n      required: false\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.yaml",
    "content": "name: \"✨ Feature Request\"\ndescription: Suggest a new idea or improvement for BabelDOC\nlabels: ['enhancement']\nbody:\n  - type: markdown\n    attributes:\n      value: |\n        Thank you for helping improve **BabelDOC**! Please fill out the form below to suggest a feature.\n\n  - type: checkboxes\n    id: checks\n    attributes:\n      label: Before you submit\n      options:\n        - label: I have searched existing issues\n          required: true\n        - label: I have fully read and understood the [README](https://github.com/funstory-ai/BabelDOC/blob/main/README.md)\n          required: true\n        - label: This feature is not related to BabelDOC CLI. The CLI is only used for debugging purposes, we do not accept any feature requests related to the CLI.\n          required: true\n  \n  - type: markdown\n    attributes:\n      value: |\n        如果您想自部署 BabelDOC，请使用 [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) 代替。若其功能无法满足，请向 [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) 提交功能请求。\n        If you wish to self-host BabelDOC, please use [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) instead. If its features do not meet your needs, please submit a feature request to [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next).\n\n\n  - type: textarea\n    id: describe\n    attributes:\n      label: Is your feature request related to a problem?\n      description: If applicable, describe what problem this feature would solve.\n      placeholder: Ex. I'm always frustrated when ...\n    validations:\n      required: false\n\n  - type: textarea\n    id: solution\n    attributes:\n      label: Describe the solution you'd like\n      description: What would you like to see happen?\n    validations:\n      required: true\n\n  - type: textarea\n    id: alternatives\n    attributes:\n      label: Describe alternatives you've considered\n      description: Have you thought of other ways to solve this?\n    validations:\n      required: false\n\n  - type: textarea\n    id: additional\n    attributes:\n      label: Additional context\n      description: Any other context, examples, or screenshots?\n    validations:\n      required: false\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE/pr_form.yml",
    "content": "name: Pull Request\ndescription: Submit a pull request to contribute to BabelDOC\ntitle: \"[PR] <Your concise title here>\"\nlabels:\n  - needs triage\nbody:\n  - type: markdown\n    attributes:\n      value: |\n        ## 👋 Thanks for contributing to **BabelDOC**!\n\n        Please fill out this form to help us review your pull request effectively.\n\n  - type: input\n    id: issue\n    attributes:\n      label: Related Issue(s)\n      description: If this pull request closes or is related to one or more issues, list them here (e.g., #37)\n      placeholder: \"#37\"\n    validations:\n      required: false\n\n  - type: textarea\n    id: summary\n    attributes:\n      label: Description\n      description: Describe the purpose of this pull request and what was changed.\n      placeholder: |\n        - What does this PR introduce or fix?\n        - What is the motivation behind it?\n    validations:\n      required: true\n\n  - type: dropdown\n    id: pr_type\n    attributes:\n      label: PR Type\n      description: What kind of change is this?\n      multiple: true\n      options:\n        - enhancement\n        - bug\n        - documentation\n        - refactor\n        - test\n        - chore\n    validations:\n      required: true\n\n  - type: checkboxes\n    id: checklist\n    attributes:\n      label: Contributor Checklist\n      options:\n        - label: I’ve fully read and understood the **[CONTRIBUTING.md](https://funstory-ai.github.io/BabelDOC/CONTRIBUTING/)** guide\n          required: true\n        - label: My changes follow the project’s code style and guidelines\n          required: true\n        - label: I’ve linked the related issue(s) in the description above\n        - label: I’ve updated relevant documentation (if applicable)\n        - label: I’ve added necessary tests (if applicable)\n        - label: All new and existing tests passed locally\n        - label: I understand that due to limited maintainer resources, only small pull requests are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary.\n\n  - type: textarea\n    id: testing\n    attributes:\n      label: Testing Instructions\n      description: Provide step-by-step instructions on how to test your changes\n      placeholder: |\n        1. Run `...`\n        2. Visit `...`\n        3. Click `...`\n        4. Verify `...`\n    validations:\n      required: false\n\n  - type: textarea\n    id: screenshots\n    attributes:\n      label: Screenshots (if applicable)\n      description: If UI changes were made, please attach before/after screenshots.\n    validations:\n      required: false\n\n  - type: textarea\n    id: notes\n    attributes:\n      label: Additional Notes\n      description: Anything else the reviewer should know?\n    validations:\n      required: false\n"
  },
  {
    "path": ".github/PULL_REQUEST_TEMPLATE.md",
    "content": "### PR Title\n\n<!-- Please fill in a concise and clear PR title below -->\n[PR] <Your concise title here>\n\n### Related Issue(s)\n\n<!-- If this PR closes or is related to one or more issues, please list them here (e.g., #37) -->\n<!-- e.g.: Closes #37, Relates to #42 -->\n\n### Motivation and Context\n\n<!-- Why is this change required? What problem does it solve? -->\n<!-- If it fixes an open issue, please link to the issue here. -->\n\n### Summary of Changes\n\n<!-- What does this PR introduce or fix? Please describe concisely. -->\n\n### PR Type\n\n<!-- What kind of change is this? Please select one or more -->\n- [ ] ✨ Enhancement\n- [ ] 🐛 Bug Fix\n- [ ] 📚 Documentation\n- [ ] 🏗️ Refactor\n- [ ] 🧪 Test\n- [ ] 🧹 Chore\n\n### Breaking Changes\n\n<!-- Does this PR introduce any breaking changes? If so, please describe them. -->\n<!-- - [ ] Yes, this PR introduces breaking changes.\n<!-- - [ ] No, this PR does not introduce breaking changes. -->\n<!-- Detailed description of breaking changes (if any): -->\n\n### Contributor Checklist\n\n- [ ] I have fully read and understood the **[CONTRIBUTING.md](https://funstory-ai.github.io/BabelDOC/CONTRIBUTING/)** guide.\n- [ ] I have performed a self-review of my own code.\n- [ ] My changes follow the project's code style and guidelines\n- [ ] I have linked the related issue(s) in the description above (if applicable)\n- [ ] I have updated relevant documentation (if applicable)\n- [ ] I have added necessary tests that prove my fix is effective or that my feature works (if applicable)\n- [ ] All new and existing tests passed locally with my changes\n- [ ] My changes generate no new warnings or errors\n- [ ] I understand that due to limited maintainer resources, only small PRs are accepted. Suggestions with proof-of-concept patches are appreciated, and my patch may be rewritten if necessary.\n\n### Testing Instructions\n\n<!-- Please provide clear and concise step-by-step instructions on how to test your changes. -->\n<!-- e.g.: -->\n<!-- 1. Check out this branch. -->\n<!-- 2. Run `...` to install dependencies. -->\n<!-- 3. Run `...` to start the application/run the script. -->\n<!-- 4. Navigate to `...` or observe `...` -->\n<!-- 5. Verify that `...` (expected outcome). -->\n\n### Screenshots (if applicable)\n\n<!-- If your changes include UI modifications, please add screenshots or GIFs to show the before and after. -->\n\n### Additional Notes\n\n<!-- Is there anything else the reviewer should know? For example, any dependencies, or potential impacts. --> "
  },
  {
    "path": ".github/dependabot.yml",
    "content": "version: 2\nupdates:\n  - package-ecosystem: github-actions\n    directory: \"/\"\n    schedule:\n      interval: weekly\n  # - package-ecosystem: pip\n  #   directory: \"/.github/workflows\"\n  #   schedule:\n  #     interval: weekly\n  # - package-ecosystem: pip\n  #   directory: \"/docs\"\n  #   schedule:\n  #     interval: weekly\n  - package-ecosystem: pip\n    directory: \"/\"\n    schedule:\n      interval: weekly\n    versioning-strategy: lockfile-only\n    allow:\n      - dependency-type: \"all\""
  },
  {
    "path": ".github/labels.yml",
    "content": "---\n# Labels names are important as they are used by Release Drafter to decide\n# regarding where to record them in changelog or if to skip them.\n#\n# The repository labels will be automatically configured using this file and\n# the GitHub Action https://github.com/marketplace/actions/github-labeler.\n- name: breaking\n  description: Breaking Changes\n  color: \"bfd4f2\"\n- name: bug\n  description: Something isn't working\n  color: \"d73a4a\"\n- name: build\n  description: Build System and Dependencies\n  color: \"bfdadc\"\n- name: ci\n  description: Continuous Integration\n  color: \"4a97d6\"\n- name: dependencies\n  description: Pull requests that update a dependency file\n  color: \"0366d6\"\n- name: documentation\n  description: Improvements or additions to documentation\n  color: \"0075ca\"\n- name: duplicate\n  description: This issue or pull request already exists\n  color: \"cfd3d7\"\n- name: enhancement\n  description: New feature or request\n  color: \"a2eeef\"\n- name: github_actions\n  description: Pull requests that update Github_actions code\n  color: \"000000\"\n- name: good first issue\n  description: Good for newcomers\n  color: \"7057ff\"\n- name: help wanted\n  description: Extra attention is needed\n  color: \"008672\"\n- name: invalid\n  description: This doesn't seem right\n  color: \"e4e669\"\n- name: performance\n  description: Performance\n  color: \"016175\"\n- name: python\n  description: Pull requests that update Python code\n  color: \"2b67c6\"\n- name: question\n  description: Further information is requested\n  color: \"d876e3\"\n- name: refactoring\n  description: Refactoring\n  color: \"ef67c4\"\n- name: removal\n  description: Removals and Deprecations\n  color: \"9ae7ea\"\n- name: style\n  description: Style\n  color: \"c120e5\"\n- name: testing\n  description: Testing\n  color: \"b1fc6f\"\n- name: wontfix\n  description: This will not be worked on\n  color: \"ffffff\""
  },
  {
    "path": ".github/release-drafter.yml",
    "content": "name-template: 'v$RESOLVED_VERSION'\ntag-template: 'v$RESOLVED_VERSION'\ncategories:\n  - title: '🚀 Features'\n    labels:\n      - 'feature'\n      - 'enhancement'\n  - title: '🐛 Bug Fixes'\n    labels:\n      - 'fix'\n      - 'bugfix'\n      - 'bug'\n  - title: '🧰 Maintenance'\n    labels:\n      - 'chore'\n      - 'maintenance'\n      - 'refactor'\n  - title: '📝 Documentation'\n    labels:\n      - 'docs'\n      - 'documentation'\nchange-template: '- $TITLE @$AUTHOR (#$NUMBER)'\nchange-title-escapes: '\\<*_&' # You can add # and @ to disable mentions\nversion-resolver:\n  major:\n    labels:\n      - 'major'\n  minor:\n    labels:\n      - 'minor'\n  patch:\n    labels:\n      - 'patch'\n  default: patch\ntemplate: |\n  ## Changes\n\n  $CHANGES\n\n  ## Contributors\n  \n  $CONTRIBUTORS\n"
  },
  {
    "path": ".github/workflows/codeql.yml",
    "content": "# For most projects, this workflow file will not need changing; you simply need\n# to commit it to your repository.\n#\n# You may wish to alter this file to override the set of languages analyzed,\n# or to provide custom queries or build logic.\n#\n# ******** NOTE ********\n# We have attempted to detect the languages in your repository. Please check\n# the `language` matrix defined below to confirm you have the correct set of\n# supported CodeQL languages.\n#\nname: \"CodeQL Advanced\"\n\non:\n  push:\n  pull_request:\n    branches: [ \"main\" ]\n  schedule:\n    - cron: '36 14 * * 1'\n\njobs:\n  analyze:\n    name: Analyze (${{ matrix.language }})\n    # Runner size impacts CodeQL analysis time. To learn more, please see:\n    #   - https://gh.io/recommended-hardware-resources-for-running-codeql\n    #   - https://gh.io/supported-runners-and-hardware-resources\n    #   - https://gh.io/using-larger-runners (GitHub.com only)\n    # Consider using larger runners or machines with greater resources for possible analysis time improvements.\n    runs-on: ${{ (matrix.language == 'swift' && 'macos-latest') || 'ubuntu-latest' }}\n    permissions:\n      # required for all workflows\n      security-events: write\n\n      # required to fetch internal or private CodeQL packs\n      packages: read\n\n      # only required for workflows in private repositories\n      actions: read\n      contents: read\n\n    strategy:\n      fail-fast: false\n      matrix:\n        include:\n        - language: python\n          build-mode: none\n        - language: actions\n        # CodeQL supports the following values keywords for 'language': 'c-cpp', 'csharp', 'go', 'java-kotlin', 'javascript-typescript', 'python', 'ruby', 'swift'\n        # Use `c-cpp` to analyze code written in C, C++ or both\n        # Use 'java-kotlin' to analyze code written in Java, Kotlin or both\n        # Use 'javascript-typescript' to analyze code written in JavaScript, TypeScript or both\n        # To learn more about changing the languages that are analyzed or customizing the build mode for your analysis,\n        # see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/customizing-your-advanced-setup-for-code-scanning.\n        # If you are analyzing a compiled language, you can modify the 'build-mode' for that language to customize how\n        # your codebase is analyzed, see https://docs.github.com/en/code-security/code-scanning/creating-an-advanced-setup-for-code-scanning/codeql-code-scanning-for-compiled-languages\n    steps:\n    - name: Checkout repository\n      uses: actions/checkout@v5\n\n    # Initializes the CodeQL tools for scanning.\n    - name: Initialize CodeQL\n      uses: github/codeql-action/init@v4\n      with:\n        languages: ${{ matrix.language }}\n        build-mode: ${{ matrix.build-mode }}\n        # If you wish to specify custom queries, you can do so here or in a config file.\n        # By default, queries listed here will override any specified in a config file.\n        # Prefix the list here with \"+\" to use these queries and those in the config file.\n\n        # For more details on CodeQL's query packs, refer to: https://docs.github.com/en/code-security/code-scanning/automatically-scanning-your-code-for-vulnerabilities-and-errors/configuring-code-scanning#using-queries-in-ql-packs\n        # queries: security-extended,security-and-quality\n\n    # If the analyze step fails for one of the languages you are analyzing with\n    # \"We were unable to automatically build your code\", modify the matrix above\n    # to set the build mode to \"manual\" for that language. Then modify this step\n    # to build your code.\n    # ℹ️ Command-line programs to run using the OS shell.\n    # 📚 See https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#jobsjob_idstepsrun\n    - if: matrix.build-mode == 'manual'\n      shell: bash\n      run: |\n        echo 'If you are using a \"manual\" build mode for one or more of the' \\\n          'languages you are analyzing, replace this with the commands to build' \\\n          'your code, for example:'\n        echo '  make bootstrap'\n        echo '  make release'\n        exit 1\n\n    - name: Perform CodeQL Analysis\n      uses: github/codeql-action/analyze@v4\n      with:\n        category: \"/language:${{matrix.language}}\"\n"
  },
  {
    "path": ".github/workflows/docs.yml",
    "content": "name: docs\non:\n  push:\n    branches:\n      - main\npermissions:\n  contents: write\njobs:\n  deploy:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          fetch-depth: 0\n      - name: Configure Git Credentials\n        run: |\n          git config user.name github-actions[bot]\n          git config user.email 41898282+github-actions[bot]@users.noreply.github.com\n      - name: Setup uv with Python 3.12\n        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2\n        with:\n          python-version: \"3.12\"\n          enable-cache: true\n          cache-dependency-glob: \"uv.lock\"\n          activate-environment: true\n      - run: echo \"cache_id=$(date --utc '+%V')\" >> $GITHUB_ENV \n      - uses: actions/cache@v4\n        with:\n          key: mkdocs-material-${{ env.cache_id }}\n          path: .cache\n          restore-keys: |\n            mkdocs-material-\n      - run: uv sync\n      - run: uv run mkdocs gh-deploy --force"
  },
  {
    "path": ".github/workflows/labeler.yml",
    "content": "name: Labeler\n\non:\n  push:\n    branches:\n      - 'main'\n    paths:\n      - '.github/labels.yml'\n      - '.github/workflows/labels.yml'\n  pull_request:\n    paths:\n      - '.github/labels.yml'\n      - '.github/workflows/labels.yml'\n\npermissions:\n  contents: read\n  issues: write\n  pull-requests: write\n\njobs:\n  labeler:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Check out the repository\n        uses: actions/checkout@v5\n\n      - name: Run Labeler\n        uses: crazy-max/ghaction-github-labeler@24d110aa46a59976b8a7f35518cb7f14f434c916 # v5.3.0\n        with:\n          skip-delete: true\n          dry-run: ${{ github.event_name == 'pull_request' }}\n          github-token: ${{ secrets.GITHUB_TOKEN }}\n          yaml-file: .github/labels.yml\n          exclude: |\n            help*\n            *issue"
  },
  {
    "path": ".github/workflows/lint.yml",
    "content": "name: Lint Code\npermissions:\n  contents: read\n  pull-requests: write\non: [push]\n\njobs:\n  lint:\n    strategy:\n      fail-fast: false\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      - name: Ruff\n        uses: astral-sh/ruff-action@v3\n      - name: AutoCorrect\n        uses: huacnlee/autocorrect-action@main\n"
  },
  {
    "path": ".github/workflows/pr-lint.yml",
    "content": "name: Lint Code and Review Dog Report\n\non: [pull_request]\npermissions:\n  contents: read\n  pull-requests: write\njobs:\n  ruff:\n    name: runner / ruff\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      \n      - name: Install Python\n        uses: actions/setup-python@v6\n        with:\n          python-version: '3.11'\n          \n      - name: Install ruff\n        run: pip install ruff\n        \n      - name: Install reviewdog\n        uses: reviewdog/action-setup@d8edfce3dd5e1ec6978745e801f9c50b5ef80252 # v1.4.0\n        with:\n          reviewdog_version: latest\n          \n      - name: Run ruff with reviewdog\n        env:\n          REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        run: |\n          ruff check . --output-format=rdjson | reviewdog -f=rdjson -reporter=github-pr-review -fail-on-error\n          \n  autocorrect:\n    name: runner / autocorrect\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v5\n      - name: AutoCorrect\n        uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3\n      - name: Report ReviewDog\n        if: failure()\n        uses: huacnlee/autocorrect-action@bf91ab3904c2908dd8e71312a8a83ed1eb632997 # v2.13.3\n        env:\n          REVIEWDOG_GITHUB_API_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        with:\n          reviewdog: true"
  },
  {
    "path": ".github/workflows/publish-to-pypi.yml",
    "content": "name: Release\n\non:\n  push:\n    branches:\n      - main\n      - master\n\npermissions:\n  id-token: write\n  contents: write\n  pull-requests: write\n\njobs:\n  check-repository:\n    name: Check if running in main repository\n    runs-on: ubuntu-latest\n    outputs:\n      is_main_repo: ${{ github.repository == 'funstory-ai/BabelDOC' }}\n    steps:\n      - run: echo \"Running repository check\"\n\n  build:\n    name: Build distribution 📦\n    needs: check-repository\n    if: needs.check-repository.outputs.is_main_repo == 'true'\n    runs-on: ubuntu-latest\n    outputs:\n      is_release: ${{ steps.check-version.outputs.tag }}\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          persist-credentials: true\n          fetch-depth: 2\n          token: ${{ secrets.GITHUB_TOKEN }}\n          \n      - name: Setup uv with Python 3.12\n        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2\n        with:\n          python-version: \"3.12\"\n          enable-cache: true\n          cache-dependency-glob: \"uv.lock\"\n          activate-environment: true\n\n      - name: Check if there is a parent commit\n        id: check-parent-commit\n        run: |\n          echo \"sha=$(git rev-parse --verify --quiet HEAD^)\" >> $GITHUB_OUTPUT\n\n      - name: Detect and tag new version\n        id: check-version\n        if: steps.check-parent-commit.outputs.sha\n        uses: salsify/action-detect-and-tag-new-version@b1778166f13188a9d478e2d1198f993011ba9864 # v2.0.3\n        with:\n          version-command: |\n            cat pyproject.toml | grep \"version = \" | head -n 1 | awk -F'\"' '{print $2}'\n\n      - name: Install Dependencies\n        run: |\n          uv sync\n\n      - name: Bump version for developmental release\n        if: \"! steps.check-version.outputs.tag\"\n        run: |\n          version=$(uv run bumpver update --patch --tag=final --dry 2>&1 | grep \"New Version\" | awk '{print $NF}') &&\n          uv run bumpver update --set-version $version.dev$(date +%s)\n\n      - name: Build package\n        run: \"uv build\"\n\n      - name: Store the distribution packages\n        uses: actions/upload-artifact@v4.6.2\n        with:\n          name: python-package-distributions\n          path: dist/\n\n  publish-to-pypi:\n    name: Publish Python 🐍 distribution 📦 to PyPI\n    if: needs.build.outputs.is_release != ''\n    needs:\n      - check-repository\n      - build\n    runs-on: ubuntu-latest\n    environment:\n      name: pypi\n      url: https://pypi.org/p/BabelDOC\n\n    permissions:\n      id-token: write\n\n    steps:\n      - name: Download all the dists\n        uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0\n        with:\n          name: python-package-distributions\n          path: dist/\n\n      - name: Publish distribution 📦 to PyPI\n        uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0\n\n  publish-to-testpypi:\n    name: Publish Python 🐍 distribution 📦 to TestPyPI\n    if: needs.build.outputs.is_release == ''\n    needs:\n      - check-repository\n      - build\n    runs-on: ubuntu-latest\n    environment:\n      name: testpypi\n      url: https://test.pypi.org/p/BabelDOC\n\n    permissions:\n      id-token: write\n\n    steps:\n      - name: Download all the dists\n        uses: actions/download-artifact@634f93cb2916e3fdff6788551b99b062d0335ce0 # v5.0.0\n        with:\n          name: python-package-distributions\n          path: dist/\n\n      - name: Publish distribution 📦 to TestPyPI\n        uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # v1.13.0\n        with:\n          repository-url: https://test.pypi.org/legacy/\n\n  post-release:\n    name: Post Release Tasks\n    needs:\n      - check-repository\n      - build\n      - publish-to-pypi\n      - publish-to-testpypi\n    if: |\n      always() && needs.check-repository.outputs.is_main_repo == 'true' && \n      (needs.publish-to-pypi.result == 'success' || needs.publish-to-testpypi.result == 'success')\n    runs-on: ubuntu-latest\n    permissions:\n      contents: write\n      pull-requests: write\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          persist-credentials: true\n          fetch-depth: 2\n          token: ${{ secrets.GITHUB_TOKEN }}\n\n      - name: Publish the release notes\n        uses: release-drafter/release-drafter@b1476f6e6eb133afa41ed8589daba6dc69b4d3f5 # v6.1.0\n        with:\n          publish: ${{ needs.build.outputs.is_release != '' }}\n          tag: ${{ needs.build.outputs.is_release }}\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}"
  },
  {
    "path": ".github/workflows/test.yml",
    "content": "name: Run Tests 🧪\n\non:\n  push:\n  pull_request:\n    branches: [\"main\"]\n\npermissions:\n  contents: read\n  pull-requests: read\n\njobs:\n  test:\n    name: Run Python Tests\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        python-version: [\"3.10\", \"3.11\", \"3.12\", \"3.13\"]\n\n    steps:\n      - uses: actions/checkout@v5\n        with:\n          persist-credentials: false\n      - name: Cached Assets\n        id: cache-assets\n        uses: actions/cache@v4.2.0\n        with:\n          path: ~/.cache/babeldoc\n          key: babeldoc-assets-${{ hashFiles('babeldoc/assets/embedding_assets_metadata.py') }}\n      - name: Setup uv with Python ${{ matrix.python-version }}\n        uses: astral-sh/setup-uv@85856786d1ce8acfbcc2f13a5f3fbd6b938f9f41 # v7.1.2\n        with:\n          python-version: ${{ matrix.python-version }}\n          enable-cache: true\n          cache-dependency-glob: \"uv.lock\"\n          activate-environment: true\n      - name: Warm up cache\n        run: |\n          uv run babeldoc --warmup\n      - name: Run tests\n        env:\n          OPENAI_API_KEY: ${{ secrets.OPENAIAPIKEY }}\n          OPENAI_BASE_URL: ${{ secrets.OPENAIAPIURL }}\n          OPENAI_MODEL: ${{ secrets.OPENAIMODEL }}\n        run: |\n          uv run babeldoc --help\n          uv run babeldoc --openai --files examples/ci/test.pdf --openai-api-key ${{ env.OPENAI_API_KEY }} --openai-base-url ${{ env.OPENAI_BASE_URL }} --openai-model ${{ env.OPENAI_MODEL }}\n      - name: Generate offline assets package\n        run: |\n          uv run babeldoc --generate-offline-assets /tmp/offline_assets\n      - name: Restore offline assets package\n        run: |\n          rm -rf ~/.cache/babeldoc\n          uv run babeldoc --restore-offline-assets /tmp/offline_assets\n      - name: Clean up\n        run: |\n          rm -rf /tmp/offline_assets\n          rm -rf ~/.cache/babeldoc/cache.v1.db\n          rm -rf ~/.cache/babeldoc/working\n"
  },
  {
    "path": ".gitignore",
    "content": "# Logs\nweb/logs\nweb/*.log\nweb/npm-debug.log*\nweb/yarn-debug.log*\nweb/yarn-error.log*\nweb/pnpm-debug.log*\nweb/lerna-debug.log*\n\nweb/node_modules\nweb/dist\nweb/dist-ssr\nweb/*.local\n\nmemray*\n**/*.so\n*.pdf\n*.docx\n*.json\n**/*.pyc\n.venv\n.idea\n*.egg-info\n.DS_Store\n.vscode\n__pycache__\n.ruff_cache\nyadt.toml\nexamples/\n/make_gif.py\n/dist\n.cache\n.cursor/rules/_*.mdc\n/.cursor\n/xnotes\n/docs/workflow-rules.md\nbabeldoc/format/txt\n/profile.svg\n\n\n# uv\nuv.lock\n\n# Claude Code memory file\nCLAUDE.md\n/.claude\nbabeldoc/format/playground\ntemp.jpg\nAGENTS.md\n"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "files: '^.*\\.py$'\nrepos:\n  - repo: https://github.com/astral-sh/ruff-pre-commit\n    # Ruff version.\n    rev: v0.9.5\n    hooks:\n      # Run the linter.\n      - id: ruff\n        args: [ \"--fix\",\n                \"--ignore=E203,E261,E501,E741,F841\" ]\n      # Run the formatter.\n      - id: ruff-format\n"
  },
  {
    "path": "LICENSE",
    "content": "                    GNU AFFERO GENERAL PUBLIC LICENSE\n                       Version 3, 19 November 2007\n\n Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>\n Everyone is permitted to copy and distribute verbatim copies\n of this license document, but changing it is not allowed.\n\n                            Preamble\n\n  The GNU Affero General Public License is a free, copyleft license for\nsoftware and other kinds of works, specifically designed to ensure\ncooperation with the community in the case of network server software.\n\n  The licenses for most software and other practical works are designed\nto take away your freedom to share and change the works.  By contrast,\nour General Public Licenses are intended to guarantee your freedom to\nshare and change all versions of a program--to make sure it remains free\nsoftware for all its users.\n\n  When we speak of free software, we are referring to freedom, not\nprice.  Our General Public Licenses are designed to make sure that you\nhave the freedom to distribute copies of free software (and charge for\nthem if you wish), that you receive source code or can get it if you\nwant it, that you can change the software or use pieces of it in new\nfree programs, and that you know you can do these things.\n\n  Developers that use our General Public Licenses protect your rights\nwith two steps: (1) assert copyright on the software, and (2) offer\nyou this License which gives you legal permission to copy, distribute\nand/or modify the software.\n\n  A secondary benefit of defending all users' freedom is that\nimprovements made in alternate versions of the program, if they\nreceive widespread use, become available for other developers to\nincorporate.  Many developers of free software are heartened and\nencouraged by the resulting cooperation.  However, in the case of\nsoftware used on network servers, this result may fail to come about.\nThe GNU General Public License permits making a modified version and\nletting the public access it on a server without ever releasing its\nsource code to the public.\n\n  The GNU Affero General Public License is designed specifically to\nensure that, in such cases, the modified source code becomes available\nto the community.  It requires the operator of a network server to\nprovide the source code of the modified version running there to the\nusers of that server.  Therefore, public use of a modified version, on\na publicly accessible server, gives the public access to the source\ncode of the modified version.\n\n  An older license, called the Affero General Public License and\npublished by Affero, was designed to accomplish similar goals.  This is\na different license, not a version of the Affero GPL, but Affero has\nreleased a new version of the Affero GPL which permits relicensing under\nthis license.\n\n  The precise terms and conditions for copying, distribution and\nmodification follow.\n\n                       TERMS AND CONDITIONS\n\n  0. Definitions.\n\n  \"This License\" refers to version 3 of the GNU Affero General Public License.\n\n  \"Copyright\" also means copyright-like laws that apply to other kinds of\nworks, such as semiconductor masks.\n\n  \"The Program\" refers to any copyrightable work licensed under this\nLicense.  Each licensee is addressed as \"you\".  \"Licensees\" and\n\"recipients\" may be individuals or organizations.\n\n  To \"modify\" a work means to copy from or adapt all or part of the work\nin a fashion requiring copyright permission, other than the making of an\nexact copy.  The resulting work is called a \"modified version\" of the\nearlier work or a work \"based on\" the earlier work.\n\n  A \"covered work\" means either the unmodified Program or a work based\non the Program.\n\n  To \"propagate\" a work means to do anything with it that, without\npermission, would make you directly or secondarily liable for\ninfringement under applicable copyright law, except executing it on a\ncomputer or modifying a private copy.  Propagation includes copying,\ndistribution (with or without modification), making available to the\npublic, and in some countries other activities as well.\n\n  To \"convey\" a work means any kind of propagation that enables other\nparties to make or receive copies.  Mere interaction with a user through\na computer network, with no transfer of a copy, is not conveying.\n\n  An interactive user interface displays \"Appropriate Legal Notices\"\nto the extent that it includes a convenient and prominently visible\nfeature that (1) displays an appropriate copyright notice, and (2)\ntells the user that there is no warranty for the work (except to the\nextent that warranties are provided), that licensees may convey the\nwork under this License, and how to view a copy of this License.  If\nthe interface presents a list of user commands or options, such as a\nmenu, a prominent item in the list meets this criterion.\n\n  1. Source Code.\n\n  The \"source code\" for a work means the preferred form of the work\nfor making modifications to it.  \"Object code\" means any non-source\nform of a work.\n\n  A \"Standard Interface\" means an interface that either is an official\nstandard defined by a recognized standards body, or, in the case of\ninterfaces specified for a particular programming language, one that\nis widely used among developers working in that language.\n\n  The \"System Libraries\" of an executable work include anything, other\nthan the work as a whole, that (a) is included in the normal form of\npackaging a Major Component, but which is not part of that Major\nComponent, and (b) serves only to enable use of the work with that\nMajor Component, or to implement a Standard Interface for which an\nimplementation is available to the public in source code form.  A\n\"Major Component\", in this context, means a major essential component\n(kernel, window system, and so on) of the specific operating system\n(if any) on which the executable work runs, or a compiler used to\nproduce the work, or an object code interpreter used to run it.\n\n  The \"Corresponding Source\" for a work in object code form means all\nthe source code needed to generate, install, and (for an executable\nwork) run the object code and to modify the work, including scripts to\ncontrol those activities.  However, it does not include the work's\nSystem Libraries, or general-purpose tools or generally available free\nprograms which are used unmodified in performing those activities but\nwhich are not part of the work.  For example, Corresponding Source\nincludes interface definition files associated with source files for\nthe work, and the source code for shared libraries and dynamically\nlinked subprograms that the work is specifically designed to require,\nsuch as by intimate data communication or control flow between those\nsubprograms and other parts of the work.\n\n  The Corresponding Source need not include anything that users\ncan regenerate automatically from other parts of the Corresponding\nSource.\n\n  The Corresponding Source for a work in source code form is that\nsame work.\n\n  2. Basic Permissions.\n\n  All rights granted under this License are granted for the term of\ncopyright on the Program, and are irrevocable provided the stated\nconditions are met.  This License explicitly affirms your unlimited\npermission to run the unmodified Program.  The output from running a\ncovered work is covered by this License only if the output, given its\ncontent, constitutes a covered work.  This License acknowledges your\nrights of fair use or other equivalent, as provided by copyright law.\n\n  You may make, run and propagate covered works that you do not\nconvey, without conditions so long as your license otherwise remains\nin force.  You may convey covered works to others for the sole purpose\nof having them make modifications exclusively for you, or provide you\nwith facilities for running those works, provided that you comply with\nthe terms of this License in conveying all material for which you do\nnot control copyright.  Those thus making or running the covered works\nfor you must do so exclusively on your behalf, under your direction\nand control, on terms that prohibit them from making any copies of\nyour copyrighted material outside their relationship with you.\n\n  Conveying under any other circumstances is permitted solely under\nthe conditions stated below.  Sublicensing is not allowed; section 10\nmakes it unnecessary.\n\n  3. Protecting Users' Legal Rights From Anti-Circumvention Law.\n\n  No covered work shall be deemed part of an effective technological\nmeasure under any applicable law fulfilling obligations under article\n11 of the WIPO copyright treaty adopted on 20 December 1996, or\nsimilar laws prohibiting or restricting circumvention of such\nmeasures.\n\n  When you convey a covered work, you waive any legal power to forbid\ncircumvention of technological measures to the extent such circumvention\nis effected by exercising rights under this License with respect to\nthe covered work, and you disclaim any intention to limit operation or\nmodification of the work as a means of enforcing, against the work's\nusers, your or third parties' legal rights to forbid circumvention of\ntechnological measures.\n\n  4. Conveying Verbatim Copies.\n\n  You may convey verbatim copies of the Program's source code as you\nreceive it, in any medium, provided that you conspicuously and\nappropriately publish on each copy an appropriate copyright notice;\nkeep intact all notices stating that this License and any\nnon-permissive terms added in accord with section 7 apply to the code;\nkeep intact all notices of the absence of any warranty; and give all\nrecipients a copy of this License along with the Program.\n\n  You may charge any price or no price for each copy that you convey,\nand you may offer support or warranty protection for a fee.\n\n  5. Conveying Modified Source Versions.\n\n  You may convey a work based on the Program, or the modifications to\nproduce it from the Program, in the form of source code under the\nterms of section 4, provided that you also meet all of these conditions:\n\n    a) The work must carry prominent notices stating that you modified\n    it, and giving a relevant date.\n\n    b) The work must carry prominent notices stating that it is\n    released under this License and any conditions added under section\n    7.  This requirement modifies the requirement in section 4 to\n    \"keep intact all notices\".\n\n    c) You must license the entire work, as a whole, under this\n    License to anyone who comes into possession of a copy.  This\n    License will therefore apply, along with any applicable section 7\n    additional terms, to the whole of the work, and all its parts,\n    regardless of how they are packaged.  This License gives no\n    permission to license the work in any other way, but it does not\n    invalidate such permission if you have separately received it.\n\n    d) If the work has interactive user interfaces, each must display\n    Appropriate Legal Notices; however, if the Program has interactive\n    interfaces that do not display Appropriate Legal Notices, your\n    work need not make them do so.\n\n  A compilation of a covered work with other separate and independent\nworks, which are not by their nature extensions of the covered work,\nand which are not combined with it such as to form a larger program,\nin or on a volume of a storage or distribution medium, is called an\n\"aggregate\" if the compilation and its resulting copyright are not\nused to limit the access or legal rights of the compilation's users\nbeyond what the individual works permit.  Inclusion of a covered work\nin an aggregate does not cause this License to apply to the other\nparts of the aggregate.\n\n  6. Conveying Non-Source Forms.\n\n  You may convey a covered work in object code form under the terms\nof sections 4 and 5, provided that you also convey the\nmachine-readable Corresponding Source under the terms of this License,\nin one of these ways:\n\n    a) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by the\n    Corresponding Source fixed on a durable physical medium\n    customarily used for software interchange.\n\n    b) Convey the object code in, or embodied in, a physical product\n    (including a physical distribution medium), accompanied by a\n    written offer, valid for at least three years and valid for as\n    long as you offer spare parts or customer support for that product\n    model, to give anyone who possesses the object code either (1) a\n    copy of the Corresponding Source for all the software in the\n    product that is covered by this License, on a durable physical\n    medium customarily used for software interchange, for a price no\n    more than your reasonable cost of physically performing this\n    conveying of source, or (2) access to copy the\n    Corresponding Source from a network server at no charge.\n\n    c) Convey individual copies of the object code with a copy of the\n    written offer to provide the Corresponding Source.  This\n    alternative is allowed only occasionally and noncommercially, and\n    only if you received the object code with such an offer, in accord\n    with subsection 6b.\n\n    d) Convey the object code by offering access from a designated\n    place (gratis or for a charge), and offer equivalent access to the\n    Corresponding Source in the same way through the same place at no\n    further charge.  You need not require recipients to copy the\n    Corresponding Source along with the object code.  If the place to\n    copy the object code is a network server, the Corresponding Source\n    may be on a different server (operated by you or a third party)\n    that supports equivalent copying facilities, provided you maintain\n    clear directions next to the object code saying where to find the\n    Corresponding Source.  Regardless of what server hosts the\n    Corresponding Source, you remain obligated to ensure that it is\n    available for as long as needed to satisfy these requirements.\n\n    e) Convey the object code using peer-to-peer transmission, provided\n    you inform other peers where the object code and Corresponding\n    Source of the work are being offered to the general public at no\n    charge under subsection 6d.\n\n  A separable portion of the object code, whose source code is excluded\nfrom the Corresponding Source as a System Library, need not be\nincluded in conveying the object code work.\n\n  A \"User Product\" is either (1) a \"consumer product\", which means any\ntangible personal property which is normally used for personal, family,\nor household purposes, or (2) anything designed or sold for incorporation\ninto a dwelling.  In determining whether a product is a consumer product,\ndoubtful cases shall be resolved in favor of coverage.  For a particular\nproduct received by a particular user, \"normally used\" refers to a\ntypical or common use of that class of product, regardless of the status\nof the particular user or of the way in which the particular user\nactually uses, or expects or is expected to use, the product.  A product\nis a consumer product regardless of whether the product has substantial\ncommercial, industrial or non-consumer uses, unless such uses represent\nthe only significant mode of use of the product.\n\n  \"Installation Information\" for a User Product means any methods,\nprocedures, authorization keys, or other information required to install\nand execute modified versions of a covered work in that User Product from\na modified version of its Corresponding Source.  The information must\nsuffice to ensure that the continued functioning of the modified object\ncode is in no case prevented or interfered with solely because\nmodification has been made.\n\n  If you convey an object code work under this section in, or with, or\nspecifically for use in, a User Product, and the conveying occurs as\npart of a transaction in which the right of possession and use of the\nUser Product is transferred to the recipient in perpetuity or for a\nfixed term (regardless of how the transaction is characterized), the\nCorresponding Source conveyed under this section must be accompanied\nby the Installation Information.  But this requirement does not apply\nif neither you nor any third party retains the ability to install\nmodified object code on the User Product (for example, the work has\nbeen installed in ROM).\n\n  The requirement to provide Installation Information does not include a\nrequirement to continue to provide support service, warranty, or updates\nfor a work that has been modified or installed by the recipient, or for\nthe User Product in which it has been modified or installed.  Access to a\nnetwork may be denied when the modification itself materially and\nadversely affects the operation of the network or violates the rules and\nprotocols for communication across the network.\n\n  Corresponding Source conveyed, and Installation Information provided,\nin accord with this section must be in a format that is publicly\ndocumented (and with an implementation available to the public in\nsource code form), and must require no special password or key for\nunpacking, reading or copying.\n\n  7. Additional Terms.\n\n  \"Additional permissions\" are terms that supplement the terms of this\nLicense by making exceptions from one or more of its conditions.\nAdditional permissions that are applicable to the entire Program shall\nbe treated as though they were included in this License, to the extent\nthat they are valid under applicable law.  If additional permissions\napply only to part of the Program, that part may be used separately\nunder those permissions, but the entire Program remains governed by\nthis License without regard to the additional permissions.\n\n  When you convey a copy of a covered work, you may at your option\nremove any additional permissions from that copy, or from any part of\nit.  (Additional permissions may be written to require their own\nremoval in certain cases when you modify the work.)  You may place\nadditional permissions on material, added by you to a covered work,\nfor which you have or can give appropriate copyright permission.\n\n  Notwithstanding any other provision of this License, for material you\nadd to a covered work, you may (if authorized by the copyright holders of\nthat material) supplement the terms of this License with terms:\n\n    a) Disclaiming warranty or limiting liability differently from the\n    terms of sections 15 and 16 of this License; or\n\n    b) Requiring preservation of specified reasonable legal notices or\n    author attributions in that material or in the Appropriate Legal\n    Notices displayed by works containing it; or\n\n    c) Prohibiting misrepresentation of the origin of that material, or\n    requiring that modified versions of such material be marked in\n    reasonable ways as different from the original version; or\n\n    d) Limiting the use for publicity purposes of names of licensors or\n    authors of the material; or\n\n    e) Declining to grant rights under trademark law for use of some\n    trade names, trademarks, or service marks; or\n\n    f) Requiring indemnification of licensors and authors of that\n    material by anyone who conveys the material (or modified versions of\n    it) with contractual assumptions of liability to the recipient, for\n    any liability that these contractual assumptions directly impose on\n    those licensors and authors.\n\n  All other non-permissive additional terms are considered \"further\nrestrictions\" within the meaning of section 10.  If the Program as you\nreceived it, or any part of it, contains a notice stating that it is\ngoverned by this License along with a term that is a further\nrestriction, you may remove that term.  If a license document contains\na further restriction but permits relicensing or conveying under this\nLicense, you may add to a covered work material governed by the terms\nof that license document, provided that the further restriction does\nnot survive such relicensing or conveying.\n\n  If you add terms to a covered work in accord with this section, you\nmust place, in the relevant source files, a statement of the\nadditional terms that apply to those files, or a notice indicating\nwhere to find the applicable terms.\n\n  Additional terms, permissive or non-permissive, may be stated in the\nform of a separately written license, or stated as exceptions;\nthe above requirements apply either way.\n\n  8. Termination.\n\n  You may not propagate or modify a covered work except as expressly\nprovided under this License.  Any attempt otherwise to propagate or\nmodify it is void, and will automatically terminate your rights under\nthis License (including any patent licenses granted under the third\nparagraph of section 11).\n\n  However, if you cease all violation of this License, then your\nlicense from a particular copyright holder is reinstated (a)\nprovisionally, unless and until the copyright holder explicitly and\nfinally terminates your license, and (b) permanently, if the copyright\nholder fails to notify you of the violation by some reasonable means\nprior to 60 days after the cessation.\n\n  Moreover, your license from a particular copyright holder is\nreinstated permanently if the copyright holder notifies you of the\nviolation by some reasonable means, this is the first time you have\nreceived notice of violation of this License (for any work) from that\ncopyright holder, and you cure the violation prior to 30 days after\nyour receipt of the notice.\n\n  Termination of your rights under this section does not terminate the\nlicenses of parties who have received copies or rights from you under\nthis License.  If your rights have been terminated and not permanently\nreinstated, you do not qualify to receive new licenses for the same\nmaterial under section 10.\n\n  9. Acceptance Not Required for Having Copies.\n\n  You are not required to accept this License in order to receive or\nrun a copy of the Program.  Ancillary propagation of a covered work\noccurring solely as a consequence of using peer-to-peer transmission\nto receive a copy likewise does not require acceptance.  However,\nnothing other than this License grants you permission to propagate or\nmodify any covered work.  These actions infringe copyright if you do\nnot accept this License.  Therefore, by modifying or propagating a\ncovered work, you indicate your acceptance of this License to do so.\n\n  10. Automatic Licensing of Downstream Recipients.\n\n  Each time you convey a covered work, the recipient automatically\nreceives a license from the original licensors, to run, modify and\npropagate that work, subject to this License.  You are not responsible\nfor enforcing compliance by third parties with this License.\n\n  An \"entity transaction\" is a transaction transferring control of an\norganization, or substantially all assets of one, or subdividing an\norganization, or merging organizations.  If propagation of a covered\nwork results from an entity transaction, each party to that\ntransaction who receives a copy of the work also receives whatever\nlicenses to the work the party's predecessor in interest had or could\ngive under the previous paragraph, plus a right to possession of the\nCorresponding Source of the work from the predecessor in interest, if\nthe predecessor has it or can get it with reasonable efforts.\n\n  You may not impose any further restrictions on the exercise of the\nrights granted or affirmed under this License.  For example, you may\nnot impose a license fee, royalty, or other charge for exercise of\nrights granted under this License, and you may not initiate litigation\n(including a cross-claim or counterclaim in a lawsuit) alleging that\nany patent claim is infringed by making, using, selling, offering for\nsale, or importing the Program or any portion of it.\n\n  11. Patents.\n\n  A \"contributor\" is a copyright holder who authorizes use under this\nLicense of the Program or a work on which the Program is based.  The\nwork thus licensed is called the contributor's \"contributor version\".\n\n  A contributor's \"essential patent claims\" are all patent claims\nowned or controlled by the contributor, whether already acquired or\nhereafter acquired, that would be infringed by some manner, permitted\nby this License, of making, using, or selling its contributor version,\nbut do not include claims that would be infringed only as a\nconsequence of further modification of the contributor version.  For\npurposes of this definition, \"control\" includes the right to grant\npatent sublicenses in a manner consistent with the requirements of\nthis License.\n\n  Each contributor grants you a non-exclusive, worldwide, royalty-free\npatent license under the contributor's essential patent claims, to\nmake, use, sell, offer for sale, import and otherwise run, modify and\npropagate the contents of its contributor version.\n\n  In the following three paragraphs, a \"patent license\" is any express\nagreement or commitment, however denominated, not to enforce a patent\n(such as an express permission to practice a patent or covenant not to\nsue for patent infringement).  To \"grant\" such a patent license to a\nparty means to make such an agreement or commitment not to enforce a\npatent against the party.\n\n  If you convey a covered work, knowingly relying on a patent license,\nand the Corresponding Source of the work is not available for anyone\nto copy, free of charge and under the terms of this License, through a\npublicly available network server or other readily accessible means,\nthen you must either (1) cause the Corresponding Source to be so\navailable, or (2) arrange to deprive yourself of the benefit of the\npatent license for this particular work, or (3) arrange, in a manner\nconsistent with the requirements of this License, to extend the patent\nlicense to downstream recipients.  \"Knowingly relying\" means you have\nactual knowledge that, but for the patent license, your conveying the\ncovered work in a country, or your recipient's use of the covered work\nin a country, would infringe one or more identifiable patents in that\ncountry that you have reason to believe are valid.\n\n  If, pursuant to or in connection with a single transaction or\narrangement, you convey, or propagate by procuring conveyance of, a\ncovered work, and grant a patent license to some of the parties\nreceiving the covered work authorizing them to use, propagate, modify\nor convey a specific copy of the covered work, then the patent license\nyou grant is automatically extended to all recipients of the covered\nwork and works based on it.\n\n  A patent license is \"discriminatory\" if it does not include within\nthe scope of its coverage, prohibits the exercise of, or is\nconditioned on the non-exercise of one or more of the rights that are\nspecifically granted under this License.  You may not convey a covered\nwork if you are a party to an arrangement with a third party that is\nin the business of distributing software, under which you make payment\nto the third party based on the extent of your activity of conveying\nthe work, and under which the third party grants, to any of the\nparties who would receive the covered work from you, a discriminatory\npatent license (a) in connection with copies of the covered work\nconveyed by you (or copies made from those copies), or (b) primarily\nfor and in connection with specific products or compilations that\ncontain the covered work, unless you entered into that arrangement,\nor that patent license was granted, prior to 28 March 2007.\n\n  Nothing in this License shall be construed as excluding or limiting\nany implied license or other defenses to infringement that may\notherwise be available to you under applicable patent law.\n\n  12. No Surrender of Others' Freedom.\n\n  If conditions are imposed on you (whether by court order, agreement or\notherwise) that contradict the conditions of this License, they do not\nexcuse you from the conditions of this License.  If you cannot convey a\ncovered work so as to satisfy simultaneously your obligations under this\nLicense and any other pertinent obligations, then as a consequence you may\nnot convey it at all.  For example, if you agree to terms that obligate you\nto collect a royalty for further conveying from those to whom you convey\nthe Program, the only way you could satisfy both those terms and this\nLicense would be to refrain entirely from conveying the Program.\n\n  13. Remote Network Interaction; Use with the GNU General Public License.\n\n  Notwithstanding any other provision of this License, if you modify the\nProgram, your modified version must prominently offer all users\ninteracting with it remotely through a computer network (if your version\nsupports such interaction) an opportunity to receive the Corresponding\nSource of your version by providing access to the Corresponding Source\nfrom a network server at no charge, through some standard or customary\nmeans of facilitating copying of software.  This Corresponding Source\nshall include the Corresponding Source for any work covered by version 3\nof the GNU General Public License that is incorporated pursuant to the\nfollowing paragraph.\n\n  Notwithstanding any other provision of this License, you have\npermission to link or combine any covered work with a work licensed\nunder version 3 of the GNU General Public License into a single\ncombined work, and to convey the resulting work.  The terms of this\nLicense will continue to apply to the part which is the covered work,\nbut the work with which it is combined will remain governed by version\n3 of the GNU General Public License.\n\n  14. Revised Versions of this License.\n\n  The Free Software Foundation may publish revised and/or new versions of\nthe GNU Affero General Public License from time to time.  Such new versions\nwill be similar in spirit to the present version, but may differ in detail to\naddress new problems or concerns.\n\n  Each version is given a distinguishing version number.  If the\nProgram specifies that a certain numbered version of the GNU Affero General\nPublic License \"or any later version\" applies to it, you have the\noption of following the terms and conditions either of that numbered\nversion or of any later version published by the Free Software\nFoundation.  If the Program does not specify a version number of the\nGNU Affero General Public License, you may choose any version ever published\nby the Free Software Foundation.\n\n  If the Program specifies that a proxy can decide which future\nversions of the GNU Affero General Public License can be used, that proxy's\npublic statement of acceptance of a version permanently authorizes you\nto choose that version for the Program.\n\n  Later license versions may give you additional or different\npermissions.  However, no additional obligations are imposed on any\nauthor or copyright holder as a result of your choosing to follow a\nlater version.\n\n  15. Disclaimer of Warranty.\n\n  THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY\nAPPLICABLE LAW.  EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT\nHOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM \"AS IS\" WITHOUT WARRANTY\nOF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,\nTHE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR\nPURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM\nIS WITH YOU.  SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF\nALL NECESSARY SERVICING, REPAIR OR CORRECTION.\n\n  16. Limitation of Liability.\n\n  IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING\nWILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS\nTHE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY\nGENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE\nUSE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF\nDATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD\nPARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),\nEVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF\nSUCH DAMAGES.\n\n  17. Interpretation of Sections 15 and 16.\n\n  If the disclaimer of warranty and limitation of liability provided\nabove cannot be given local legal effect according to their terms,\nreviewing courts shall apply local law that most closely approximates\nan absolute waiver of all civil liability in connection with the\nProgram, unless a warranty or assumption of liability accompanies a\ncopy of the Program in return for a fee.\n\n                     END OF TERMS AND CONDITIONS\n\n            How to Apply These Terms to Your New Programs\n\n  If you develop a new program, and you want it to be of the greatest\npossible use to the public, the best way to achieve this is to make it\nfree software which everyone can redistribute and change under these terms.\n\n  To do so, attach the following notices to the program.  It is safest\nto attach them to the start of each source file to most effectively\nstate the exclusion of warranty; and each file should have at least\nthe \"copyright\" line and a pointer to where the full notice is found.\n\n    BabelDOC is library for ultimated document translation solution.\n    Copyright (C) 2024  <funstory.ai limited>\n\n    This program is free software: you can redistribute it and/or modify\n    it under the terms of the GNU Affero General Public License as published\n    by the Free Software Foundation, either version 3 of the License, or\n    (at your option) any later version.\n\n    This program is distributed in the hope that it will be useful,\n    but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\n    GNU Affero General Public License for more details.\n\n    You should have received a copy of the GNU Affero General Public License\n    along with this program.  If not, see <https://www.gnu.org/licenses/>.\n\nAlso add information on how to contact you by electronic and paper mail.\n\n  If your software can interact with users remotely through a computer\nnetwork, you should also make sure that it provides a way for users to\nget its source.  For example, if your program is a web application, its\ninterface could display a \"Source\" link that leads users to an archive\nof the code.  There are many ways you could offer source, and different\nsolutions will be better for different programs; see section 13 for the\nspecific requirements.\n\n  You should also get your employer (if you work as a programmer) or school,\nif any, to sign a \"copyright disclaimer\" for the program, if necessary.\nFor more information on this, and how to apply and follow the GNU AGPL, see\n<https://www.gnu.org/licenses/>.\n"
  },
  {
    "path": "README.md",
    "content": "<!-- # Yet Another Document Translator -->\n\n<div align=\"center\">\n<!-- <img src=\"https://s.immersivetranslate.com/assets/r2-uploads/images/babeldoc-banner.png\" width=\"320px\"  alt=\"YADT\"/> -->\n\n<br/>\n\n<picture>\n  <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://s.immersivetranslate.com/assets/uploads/babeldoc-big-logo-darkmode-with-transparent-background-IKuNO1.svg\" width=\"320px\" alt=\"BabelDOC\"/>\n  <img src=\"https://s.immersivetranslate.com/assets/uploads/babeldoc-big-logo-with-transparent-background-2xweBr.svg\" width=\"320px\" alt=\"BabelDOC\"/>\n</picture>\n\n<!-- <h2 id=\"title\">BabelDOC</h2> -->\n\n<p>\n  <!-- PyPI -->\n  <a href=\"https://pypi.org/project/BabelDOC/\">\n    <img src=\"https://img.shields.io/pypi/v/BabelDOC\"></a>\n  <a href=\"https://pepy.tech/projects/BabelDOC\">\n    <img src=\"https://static.pepy.tech/badge/BabelDOC\"></a>\n  <!-- <a href=\"https://github.com/funstory-ai/BabelDOC/pulls\">\n    <img src=\"https://img.shields.io/badge/contributions-welcome-green\"></a> -->\n  <!-- License -->\n  <a href=\"./LICENSE\">\n    <img src=\"https://img.shields.io/github/license/funstory-ai/BabelDOC\"></a>\n  <a href=\"https://t.me/+Z9_SgnxmsmA5NzBl\">\n    <img src=\"https://img.shields.io/badge/Telegram-2CA5E0?style=flat-squeare&logo=telegram&logoColor=white\"></a>\n  <a href=\"https://deepwiki.com/funstory-ai/BabelDOC\"><img src=\"https://deepwiki.com/badge.svg\" alt=\"Ask DeepWiki\"></a>\n</p>\n\n<a href=\"https://trendshift.io/repositories/13358\" target=\"_blank\"><img src=\"https://trendshift.io/api/badge/repositories/13358\" alt=\"funstory-ai%2FBabelDOC | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"/></a>\n\n</div>\n\nPDF scientific paper translation and bilingual comparison library.\n\n- **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) Free usage quota is available; please refer to the FAQ section on the page for details.\n- **Self-deployment**: [PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) support for BabelDOC, available for self-deployment + WebUI with more translation services.\n- Provides a simple [command line interface](#getting-started).\n- Provides a [Python API](#python-api).\n- Mainly designed to be embedded into other programs, but can also be used directly for simple translation tasks.\n\n> [!TIP]\n>\n> How to use BabelDOC in Zotero\n>\n> 1. Immersive Translate Pro members can use the [immersive-translate/zotero-immersivetranslate](https://github.com/immersive-translate/zotero-immersivetranslate) plugin\n>\n> 2. PDFMathTranslate self-deployed users can use the [guaguastandup/zotero-pdf2zh](https://github.com/guaguastandup/zotero-pdf2zh) plugin\n\n[Supported Language](https://funstory-ai.github.io/BabelDOC/supported_languages/)\n\n## Preview\n\n<div align=\"center\">\n<img src=\"https://s.immersivetranslate.com/assets/r2-uploads/images/babeldoc-preview.png\" width=\"80%\"/>\n</div>\n\n## We are hiring\n\nSee details: [EN](https://github.com/funstory-ai/jobs) | [ZH](https://github.com/funstory-ai/jobs/blob/main/README_ZH.md)\n\n## Getting Started\n\n### Install from PyPI\n\nWe recommend using the Tool feature of [uv](https://github.com/astral-sh/uv) to install yadt.\n\n1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted.\n\n2. Use the following command to install yadt:\n\n```bash\nuv tool install --python 3.12 BabelDOC\n\nbabeldoc --help\n```\n\n3. Use the `babeldoc` command. For example:\n\n```bash\nbabeldoc --openai --openai-model \"gpt-4o-mini\" --openai-base-url \"https://api.openai.com/v1\" --openai-api-key \"your-api-key-here\"  --files example.pdf\n\n# multiple files\nbabeldoc --openai --openai-model \"gpt-4o-mini\" --openai-base-url \"https://api.openai.com/v1\" --openai-api-key \"your-api-key-here\"  --files example1.pdf --files example2.pdf\n```\n\n### Install from Source\n\nWe still recommend using [uv](https://github.com/astral-sh/uv) to manage virtual environments.\n\n1. First, you need to refer to [uv installation](https://github.com/astral-sh/uv#installation) to install uv and set up the `PATH` environment variable as prompted.\n\n2. Use the following command to install yadt:\n\n```bash\n# clone the project\ngit clone https://github.com/funstory-ai/BabelDOC\n\n# enter the project directory\ncd BabelDOC\n\n# install dependencies and run babeldoc\nuv run babeldoc --help\n```\n\n3. Use the `uv run babeldoc` command. For example:\n\n```bash\nuv run babeldoc --files example.pdf --openai --openai-model \"gpt-4o-mini\" --openai-base-url \"https://api.openai.com/v1\" --openai-api-key \"your-api-key-here\"\n\n# multiple files\nuv run babeldoc --files example.pdf --files example2.pdf --openai --openai-model \"gpt-4o-mini\" --openai-base-url \"https://api.openai.com/v1\" --openai-api-key \"your-api-key-here\"\n```\n\n> [!TIP]\n> The absolute path is recommended.\n\n## Advanced Options\n\n> [!NOTE]\n> This CLI is mainly for debugging purposes. Although end users can use this CLI to translate files, we do not provide any technical support for this purpose.\n>\n> End users should directly use **Online Service**: Beta version launched [Immersive Translate - BabelDOC](https://app.immersivetranslate.com/babel-doc/) 1000 free pages per month.\n>\n> End users who need self-deployment should use [PDFMathTranslate 2.0](https://github.com/PDFMathTranslate/PDFMathTranslate-next)\n> \n> If you find that an option is not listed below, it means that this option is a debugging option for maintainers. Please do not use these options.\n\n\n### Language Options\n\n- `--lang-in`, `-li`: Source language code (default: en)\n- `--lang-out`, `-lo`: Target language code (default: zh)\n\n> [!TIP]\n> Currently, this project mainly focuses on English-to-Chinese translation, and other scenarios have not been tested yet.\n> \n> (2025.3.1 update): Basic English target language support has been added, primarily to minimize line breaks within words([0-9A-Za-z]+).\n> \n> [HELP WANTED: Collecting word regular expressions for more languages](https://github.com/funstory-ai/BabelDOC/issues/129)\n\n### PDF Processing Options\n\n- `--files`: One or more file paths to input PDF documents.\n- `--pages`, `-p`: Specify pages to translate (e.g., \"1,2,1-,-3,3-5\"). If not set, translate all pages\n- `--split-short-lines`: Force split short lines into different paragraphs (may cause poor typesetting & bugs)\n- `--short-line-split-factor`: Split threshold factor (default: 0.8). The actual threshold is the median length of all lines on the current page \\* this factor\n- `--skip-clean`: Skip PDF cleaning step\n- `--dual-translate-first`: Put translated pages first in dual PDF mode (default: original pages first)\n- `--disable-rich-text-translate`: Disable rich text translation (may help improve compatibility with some PDFs)\n- `--enhance-compatibility`: Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate)\n- `--use-alternating-pages-dual`: Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order. When disabled (default), original and translated pages are shown side by side on the same page.\n- `--watermark-output-mode`: Control watermark output mode: 'watermarked' (default) adds watermark to translated PDF, 'no_watermark' doesn't add watermark, 'both' outputs both versions.\n- `--max-pages-per-part`: Maximum number of pages per part for split translation. If not set, no splitting will be performed.\n- `--no-watermark`: [DEPRECATED] Use --watermark-output-mode=no_watermark instead.\n- `--translate-table-text`: Translate table text (experimental, default: False)\n- `--formular-font-pattern`: Font pattern to identify formula text (default: None)\n- `--formular-char-pattern`: Character pattern to identify formula text (default: None)\n- `--show-char-box`: Show character bounding boxes (debug only, default: False)\n- `--skip-scanned-detection`: Skip scanned document detection (default: False). When using split translation, only the first part performs detection if not skipped.\n- `--ocr-workaround`: Use OCR workaround (default: False). Only suitable for documents with black text on white background. When enabled, white rectangular blocks will be added below the translation to cover the original text content, and all text will be forced to black color.\n- `--auto-enable-ocr-workaround`: Enable automatic OCR workaround (default: False). If a document is detected as heavily scanned, this will attempt to enable OCR processing and skip further scan detection. See \"Important Interaction Note\" below for crucial details on how this interacts with `--ocr-workaround` and `--skip-scanned-detection`.\n- `--primary-font-family`: Override primary font family for translated text. Choices: 'serif' for serif fonts, 'sans-serif' for sans-serif fonts, 'script' for script/italic fonts. If not specified, uses automatic font selection based on original text properties.\n- `--only-include-translated-page`: Only include translated pages in the output PDF. This option is only effective when `--pages` is used. (default: False)\n- `--merge-alternating-line-numbers`: Enable post-processing to merge alternating line-number layouts (keep the number paragraph as an independent paragraph b; merge adjacent text paragraphs a and c across it when `layout_id` and `xobj_id` match, digits are ASCII and spaces only). Default: off.\n- `--skip-form-render`: Skip form rendering (default: False). When enabled, PDF forms will not be rendered in the output.\n- `--skip-curve-render`: Skip curve rendering (default: False). When enabled, PDF curves will not be rendered in the output.\n- `--only-parse-generate-pdf`: Only parse PDF and generate output PDF without translation (default: False). This skips all translation-related processing including layout analysis, paragraph finding, style processing, and translation itself. Useful for testing PDF parsing and reconstruction functionality.\n- `--remove-non-formula-lines`: Remove non-formula lines from paragraph areas (default: False). This removes decorative lines that are not part of formulas, while protecting lines in figure/table areas. Useful for cleaning up documents with decorative elements that interfere with text flow.\n- `--non-formula-line-iou-threshold`: IoU threshold for detecting paragraph overlap when removing non-formula lines (default: 0.9). Higher values are more conservative and will remove fewer lines.\n- `--figure-table-protection-threshold`: IoU threshold for protecting lines in figure/table areas when removing non-formula lines (default: 0.9). Higher values provide more protection for structural elements in figures and tables.\n\n- `--rpc-doclayout`: RPC service host address for document layout analysis (default: None)\n- `--working-dir`: Working directory for translation. If not set, use temp directory.\n- `--no-auto-extract-glossary`: Disable automatic term extraction. If this flag is present, the step is skipped. Defaults to enabled.\n- `--save-auto-extracted-glossary`: Save automatically extracted glossary to the specified file. If not set, the glossary will not be saved.\n\n> [!TIP]\n> - Both `--skip-clean` and `--dual-translate-first` may help improve compatibility with some PDF readers\n> - `--disable-rich-text-translate` can also help with compatibility by simplifying translation input\n> - However, using `--skip-clean` will result in larger file sizes\n> - If you encounter any compatibility issues, try using `--enhance-compatibility` first\n> - Use `--max-pages-per-part` for large documents to split them into smaller parts for translation and automatically merge them back.\n> - Use `--skip-scanned-detection` to speed up processing when you know your document is not a scanned PDF.\n> - Use `--ocr-workaround` to fill background for scanned PDF. (Current assumption: background is pure white, text is pure black, this option will also auto enable `--skip-scanned-detection`)\n\n### Translation Service Options\n\n- `--qps`: QPS (Queries Per Second) limit for translation service (default: 4)\n- `--ignore-cache`: Ignore translation cache and force retranslation\n- `--no-dual`: Do not output bilingual PDF files\n- `--no-mono`: Do not output monolingual PDF files\n- `--min-text-length`: Minimum text length to translate (default: 5)\n- `--openai`: Use OpenAI for translation (default: False)\n- `--custom-system-prompt`: Custom system prompt for translation.\n- `--add-formula-placehold-hint`: Add formula placeholder hint for translation. (Currently not recommended, it may affect translation quality, default: False)\n- `--disable-same-text-fallback`: Disable fallback translation when LLM output matches input text. (default: False)\n- `--pool-max-workers`: Maximum number of worker threads for internal task processing pools. If not specified, defaults to QPS value. This parameter directly sets the worker count, replacing previous QPS-based dynamic calculations.\n- `--no-auto-extract-glossary`: Disable automatic term extraction. If this flag is present, the step is skipped. Defaults to enabled.\n\n> [!TIP]\n>\n> 1. Currently, only OpenAI-compatible LLM is supported. For more translator support, please use [PDFMathTranslate 2.0](https://github.com/PDFMathTranslate/PDFMathTranslate-next).\n> 2. It is recommended to use models with strong compatibility with OpenAI, such as: `glm-4-flash`, `deepseek-chat`, etc.\n> 3. Currently, it has not been optimized for traditional translation engines like Bing/Google, it is recommended to use LLMs.\n> 4. You can use [litellm](https://github.com/BerriAI/litellm) to access multiple models.\n> 5. `--custom-system-prompt`: It is mainly used to add the `/no_think` instruction of Qwen 3 in the prompt. For example: `--custom-system-prompt \"/no_think You are a professional, authentic machine translation engine.\"`\n\n### OpenAI Specific Options\n\n- `--openai-model`: OpenAI model to use (default: gpt-4o-mini)\n- `--openai-base-url`: Base URL for OpenAI API\n- `--openai-api-key`: API key for OpenAI service\n- `--enable-json-mode-if-requested`: Enable JSON mode for OpenAI requests (default: False)\n- `--term-pool-max-workers`: Maximum number of worker threads dedicated to automatic term extraction. If not specified, this defaults to the value of `--pool-max-workers`, which itself defaults to the QPS value when unset.\n\n> [!TIP]\n>\n> 1. This tool supports any OpenAI-compatible API endpoints. Just set the correct base URL and API key. (e.g. `https://xxx.custom.xxx/v1`)\n> 2. For local models like Ollama, you can use any value as the API key (e.g. `--openai-api-key a`).\n\n### Glossary Options\n\n- `--glossary-files`: Comma-separated paths to glossary CSV files.\n  - Each CSV file should have the columns: `source`, `target`, and an optional `tgt_lng`.\n  - The `source` column contains the term in the original language.\n  - The `target` column contains the term in the target language.\n  - The `tgt_lng` column (optional) specifies the target language for that specific entry (e.g., \"zh-CN\", \"en-US\").\n    - If `tgt_lng` is provided for an entry, that entry will only be loaded and used if its (normalized) `tgt_lng` matches the (normalized) overall target language specified by `--lang-out`. Normalization involves lowercasing and replacing hyphens (`-`) with underscores (`_`).\n    - If `tgt_lng` is omitted for an entry, that entry is considered applicable for any `--lang-out`.\n  - The name of each glossary (used in LLM prompts) is derived from its filename (without the .csv extension).\n  - During translation, the system will check the input text against the loaded glossaries. If terms from a glossary are found in the current text segment, that glossary (with the relevant terms) will be included in the prompt to the language model, along with an instruction to adhere to it.\n\n### Output Control\n\n- `--output`, `-o`: Output directory for translated files. If not set, use current working directory.\n- `--debug`: Enable debug logging level and export detailed intermediate results in `~/.cache/yadt/working`.\n- `--report-interval`: Progress report interval in seconds (default: 0.1).\n\n### General Options\n\n- `--warmup`: Only download and verify required assets then exit (default: False)\n\n### Offline Assets Management\n\n- `--generate-offline-assets`: Generate an offline assets package in the specified directory. This creates a zip file containing all required models and fonts.\n- `--restore-offline-assets`: Restore an offline assets package from the specified file. This extracts models and fonts from a previously generated package.\n\n> [!TIP]\n> \n> 1. Offline assets packages are useful for environments without internet access or to speed up installation on multiple machines.\n> 2. Generate a package once with `babeldoc --generate-offline-assets /path/to/output/dir` and then distribute it.\n> 3. Restore the package on target machines with `babeldoc --restore-offline-assets /path/to/offline_assets_*.zip`.\n> 4. The offline assets package name cannot be modified because the file list hash is encoded in the name.\n> 5. If you provide a directory path to `--restore-offline-assets`, the tool will automatically look for the correct offline assets package file in that directory.\n> 6. The package contains all necessary fonts and models required for document processing, ensuring consistent results across different environments.\n> 7. The integrity of all assets is verified using SHA3-256 hashes during both packaging and restoration.\n> 8. If you're deploying in an air-gapped environment, make sure to generate the package on a machine with internet access first.\n\n### Configuration File\n\n- `--config`, `-c`: Configuration file path. Use the TOML format.\n\nExample Configuration:\n\n```toml\n[babeldoc]\n# Basic settings\ndebug = true\nlang-in = \"en-US\"\nlang-out = \"zh-CN\"\nqps = 10\noutput = \"/path/to/output/dir\"\n\n# PDF processing options\nsplit-short-lines = false\nshort-line-split-factor = 0.8\nskip-clean = false\ndual-translate-first = false\ndisable-rich-text-translate = false\nuse-alternating-pages-dual = false\nwatermark-output-mode = \"watermarked\"  # Choices: \"watermarked\", \"no_watermark\", \"both\"\nmax-pages-per-part = 50  # Automatically split the document for translation and merge it back.\nonly_include_translated_page = false # Only include translated pages in the output PDF. Effective only when `pages` is used.\n# no-watermark = false  # DEPRECATED: Use watermark-output-mode instead\nskip-scanned-detection = false  # Skip scanned document detection for faster processing\nauto_extract_glossary = true # Set to false to disable automatic term extraction\nformular_font_pattern = \"\" # Font pattern for formula text\nformular_char_pattern = \"\" # Character pattern for formula text\nshow_char_box = false # Show character bounding boxes (debug)\nocr_workaround = false # Use OCR workaround for scanned PDFs\nrpc_doclayout = \"\" # RPC service host for document layout analysis\nworking_dir = \"\" # Working directory for translation\nauto_enable_ocr_workaround = false # Enable automatic OCR workaround for scanned PDFs. See docs for interaction with ocr_workaround and skip_scanned_detection.\nskip_form_render = false # Skip form rendering (default: False)\nskip_curve_render = false # Skip curve rendering (default: False)\nonly_parse_generate_pdf = false # Only parse PDF and generate output PDF without translation (default: False)\nremove_non_formula_lines = false # Remove non-formula lines from paragraph areas (default: False)\nnon_formula_line_iou_threshold = 0.2 # IoU threshold for paragraph overlap detection (default: 0.2)\nfigure_table_protection_threshold = 0.3 # IoU threshold for figure/table protection (default: 0.3)\n\n# Translation service\nopenai = true\nopenai-model = \"gpt-4o-mini\"\nopenai-base-url = \"https://api.openai.com/v1\"\nopenai-api-key = \"your-api-key-here\"\nenable-json-mode-if-requested = false  # Enable JSON mode when requested (default: false)\ndisable_same_text_fallback = false # Disable fallback translation when LLM output matches input text (default: false)\npool-max-workers = 8  # Maximum worker threads for task processing (defaults to QPS value if not set)\n\n# Glossary Options (Optional)\n# glossary-files = \"/path/to/glossary1.csv,/path/to/glossary2.csv\"\n\n# Output control\nno-dual = false\nno-mono = false\nmin-text-length = 5\nreport-interval = 0.5\n\n# Offline assets management\n# Uncomment one of these options as needed:\n# generate-offline-assets = \"/path/to/output/dir\"\n# restore-offline-assets = \"/path/to/offline_assets_package.zip\"\n```\n\n## Python API\n\nThe current recommended way to call BabelDOC in Python is to call the `high_level.do_translate_async_stream` function of [pdf2zh next](https://github.com/PDFMathTranslate/PDFMathTranslate-next).\n\n> [!WARNING]\n> **All APIs of BabelDOC should be considered as internal APIs, and any direct use of BabelDOC is not supported.**\n\n## Background\n\nThere are a lot projects and teams working on to make document editing and translating easier like:\n\n- [mathpix](https://mathpix.com/)\n- [Doc2X](https://doc2x.noedgeai.com/)\n- [minerU](https://github.com/opendatalab/MinerU)\n- [PDFMathTranslate](https://github.com/funstory-ai/yadt)\n\nThere are also some solutions to solve specific parts of the problem like:\n\n- [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader): the read order of the text block in a pdf\n- [Surya](https://github.com/surya-is/surya): the structure of the pdf\n\nThis project hopes to promote a standard pipeline and interface to solve the problem.\n\nIn fact, there are two main stages of a PDF parser or translator:\n\n- **Parsing**: A stage of parsing means to get the structure of the pdf such as text blocks, images, tables, etc.\n- **Rendering**: A stage of rendering means to render the structure into a new pdf or other format.\n\nFor a service like mathpix, it will parse the pdf into a structure may be in a XML format, and then render them using a single column reader order as [layoutreader](https://github.com/microsoft/unilm/tree/master/layoutreader) does. The bad news is that the original structure lost.\n\nSome people will use Adobe PDF Parser because it will generate a Word document and it keeps the original structure. But it is somewhat expensive.\nAnd you know, a pdf or word document is not a good format for reading in mobile devices.\n\nWe offer an intermediate representation of the results from parser and can be rendered into a new pdf or other format. The pipeline is also a plugin-based system which everybody can add their new model, ocr, renderer, etc.\n\n## Roadmap\n\n- [ ] Add line support\n- [ ] Add table support\n- [ ] Add cross-page/cross-column paragraph support\n- [ ] More advanced typesetting features\n- [ ] Outline support\n- [ ] ...\n\nOur first 1.0 version goal is to finish a translation from [PDF Reference, Version 1.7](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) to the following language version:\n\n- Simplified Chinese\n- Traditional Chinese\n- Japanese\n- Spanish\n\nAnd meet the following requirements:\n\n- layout error less than 1%\n- content loss less than 1%\n\n## Version Number Explanation\n\nThis project uses a combination of [Semantic Versioning](https://semver.org/) and [Pride Versioning](https://pridever.org/). The version number format is: \"0.MAJOR.MINOR\".\n\n> [!NOTE]\n>\n> The API compatibility here mainly refers to the compatibility with [pdf2zh_next](https://github.com/PDFMathTranslate/PDFMathTranslate-next).\n\n\n- MAJOR: Incremented by 1 when API incompatible changes are made or when proud improvements are implemented.\n\n- MINOR: Incremented by 1 when any API compatible changes are made.\n\n## Known Issues\n\n1. Parsing errors in the author and reference sections; they get merged into one paragraph after translation.\n2. Lines are not supported.\n3. Does not support drop caps.\n4. Large pages will be skipped.\n\n## How to Contribute\n\nWe encourage you to contribute to YADT! Please check out the [CONTRIBUTING](https://github.com/funstory-ai/yadt/blob/main/docs/CONTRIBUTING.md) guide.\n\nEveryone interacting in YADT and its sub-projects' codebases, issue trackers, chat rooms, and mailing lists is expected to follow the YADT [Code of Conduct](https://github.com/funstory-ai/yadt/blob/main/docs/CODE_OF_CONDUCT.md).\n\n[Immersive Translation](https://immersivetranslate.com) sponsors monthly Pro membership redemption codes for active contributors to this project, see details at: [CONTRIBUTOR_REWARD.md](https://github.com/funstory-ai/BabelDOC/blob/main/docs/CONTRIBUTOR_REWARD.md)\n\n## Acknowledgements\n\n- [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate)\n- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)\n- [pdfminer](https://github.com/pdfminer/pdfminer.six)\n- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)\n- [Asynchronize](https://github.com/multimeric/Asynchronize/tree/master?tab=readme-ov-file)\n- [PriorityThreadPoolExecutor](https://github.com/oleglpts/PriorityThreadPoolExecutor)\n\n<h2 id=\"star_hist\">Star History</h2>\n\n<a href=\"https://star-history.com/#funstory-ai/babeldoc&Date\">\n <picture>\n   <source media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date&theme=dark\" />\n   <source media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date\" />\n   <img alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=funstory-ai/babeldoc&type=Date\"/>\n </picture>\n</a>\n\n> [!WARNING]\n> **Important Interaction Note for `--auto-enable-ocr-workaround`:**\n>\n> When `--auto-enable-ocr-workaround` is set to `true` (either via command line or config file):\n>\n> 1.  During the initial setup, the values for `ocr_workaround` and `skip_scanned_detection` will be forced to `false` by `TranslationConfig`, regardless of whether you also set `--ocr-workaround` or `--skip-scanned-detection` flags.\n> 2.  Then, during the scanned document detection phase (`DetectScannedFile` stage):\n>     *   If the document is identified as heavily scanned (e.g., >80% scanned pages) AND `auto_enable_ocr_workaround` is `true` (i.e., `translation_config.auto_enable_ocr_workaround` is true), the system will then attempt to set both `ocr_workaround` to `true` and `skip_scanned_detection` to `true`.\n>\n> This means that `--auto-enable-ocr-workaround` effectively gives the system control to enable OCR processing for scanned documents, potentially overriding manual settings for `--ocr-workaround` and `--skip_scanned_detection` based on its detection results. If the document is *not* detected as heavily scanned, then the initial `false` values for `ocr_workaround` and `skip_scanned_detection` (forced by `--auto-enable-ocr-workaround` at the `TranslationConfig` initialization stage) will remain in effect unless changed by other logic.\n"
  },
  {
    "path": "babeldoc/__init__.py",
    "content": "__version__ = \"0.5.23\"\n"
  },
  {
    "path": "babeldoc/assets/assets.py",
    "content": "import asyncio\nimport hashlib\nimport json\nimport logging\nimport threading\nimport zipfile\nfrom pathlib import Path\n\nimport httpx\nfrom babeldoc.assets import embedding_assets_metadata\nfrom babeldoc.assets.embedding_assets_metadata import CMAP_METADATA\nfrom babeldoc.assets.embedding_assets_metadata import CMAP_URL_BY_UPSTREAM\nfrom babeldoc.assets.embedding_assets_metadata import DOC_LAYOUT_ONNX_MODEL_URL\nfrom babeldoc.assets.embedding_assets_metadata import (\n    DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256,\n)\nfrom babeldoc.assets.embedding_assets_metadata import EMBEDDING_FONT_METADATA\nfrom babeldoc.assets.embedding_assets_metadata import FONT_METADATA_URL\nfrom babeldoc.assets.embedding_assets_metadata import FONT_URL_BY_UPSTREAM\nfrom babeldoc.assets.embedding_assets_metadata import (\n    TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256,\n)\nfrom babeldoc.assets.embedding_assets_metadata import TABLE_DETECTION_RAPIDOCR_MODEL_URL\nfrom babeldoc.assets.embedding_assets_metadata import TIKTOKEN_CACHES\nfrom babeldoc.const import get_cache_file_path\nfrom tenacity import retry\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nlogger = logging.getLogger(__name__)\n\n\n_FASTEST_FONT_UPSTREAM_LOCK = asyncio.Lock()\n_FASTEST_FONT_UPSTREAM: str | None = None\n_FASTEST_FONT_METADATA: dict | None = None\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = None\n\n    def set_result(self, result):\n        self.result = result\n\n\ndef run_in_another_thread(coro):\n    result_container = ResultContainer()\n\n    def _wrapper():\n        result_container.set_result(asyncio.run(coro))\n\n    thread = threading.Thread(target=_wrapper)\n    thread.start()\n    thread.join()\n    return result_container.result\n\n\ndef run_coro(coro):\n    return run_in_another_thread(coro)\n\n\ndef _retry_if_not_cancelled_and_failed(retry_state):\n    \"\"\"Only retry if the exception is not CancelledError and the attempt failed.\"\"\"\n    if retry_state.outcome.failed:\n        exception = retry_state.outcome.exception()\n        # Don't retry on CancelledError\n        if isinstance(exception, asyncio.CancelledError):\n            logger.debug(\"Operation was cancelled, not retrying\")\n            return False\n        # Retry on network related errors\n        if isinstance(\n            exception, httpx.HTTPError | ConnectionError | ValueError | TimeoutError\n        ):\n            logger.warning(f\"Network error occurred: {exception}, will retry\")\n            return True\n    # Don't retry on success\n    return False\n\n\ndef verify_file(path: Path, sha3_256: str):\n    if not path.exists():\n        return False\n    hash_ = hashlib.sha3_256()\n    with path.open(\"rb\") as f:\n        while True:\n            chunk = f.read(1024 * 1024)\n            if not chunk:\n                break\n            hash_.update(chunk)\n    return hash_.hexdigest() == sha3_256\n\n\n@retry(\n    retry=_retry_if_not_cancelled_and_failed,\n    stop=stop_after_attempt(3),\n    wait=wait_exponential(multiplier=1, min=1, max=15),\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Download file failed, retrying in {retry_state.next_action.sleep} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\nasync def download_file(\n    client: httpx.AsyncClient | None = None,\n    url: str = None,\n    path: Path = None,\n    sha3_256: str = None,\n):\n    if client is None:\n        async with httpx.AsyncClient() as client:\n            response = await client.get(url, follow_redirects=True)\n    else:\n        response = await client.get(url, follow_redirects=True)\n\n    response.raise_for_status()\n    with path.open(\"wb\") as f:\n        f.write(response.content)\n    if not verify_file(path, sha3_256):\n        path.unlink(missing_ok=True)\n        raise ValueError(f\"File {path} is corrupted\")\n\n\n@retry(\n    retry=_retry_if_not_cancelled_and_failed,\n    stop=stop_after_attempt(3),\n    wait=wait_exponential(multiplier=1, min=1, max=15),\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Get font metadata failed, retrying in {retry_state.next_action.sleep} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\nasync def get_font_metadata(\n    client: httpx.AsyncClient | None = None, upstream: str = None\n):\n    if upstream not in FONT_METADATA_URL:\n        logger.critical(f\"Invalid upstream: {upstream}\")\n        exit(1)\n\n    if client is None:\n        async with httpx.AsyncClient() as client:\n            response = await client.get(\n                FONT_METADATA_URL[upstream], follow_redirects=True\n            )\n    else:\n        response = await client.get(FONT_METADATA_URL[upstream], follow_redirects=True)\n\n    response.raise_for_status()\n    logger.debug(f\"Get font metadata from {upstream} success\")\n    return upstream, response.json()\n\n\nasync def _get_fastest_upstream_for_font_internal(\n    client: httpx.AsyncClient | None = None, exclude_upstream: list[str] | None = None\n) -> tuple[str | None, dict | None]:\n    \"\"\"Find the fastest upstream for font metadata without using cached result.\"\"\"\n    tasks: list[asyncio.Task[tuple[str, dict]]] = []\n    for upstream in FONT_METADATA_URL:\n        if exclude_upstream and upstream in exclude_upstream:\n            continue\n        tasks.append(asyncio.create_task(get_font_metadata(client, upstream)))\n    for future in asyncio.as_completed(tasks):\n        try:\n            result = await future\n            for task in tasks:\n                if not task.done():\n                    task.cancel()\n            return result\n        except Exception as e:\n            logger.exception(f\"Error getting font metadata: {e}\")\n    logger.error(\"All upstreams failed\")\n    return None, None\n\n\nasync def get_fastest_upstream_for_font(\n    client: httpx.AsyncClient | None = None, exclude_upstream: list[str] | None = None\n) -> tuple[str | None, dict | None]:\n    \"\"\"Get the fastest upstream for font metadata with cached result.\n\n    The cached upstream is only used when exclude_upstream is None.\n    \"\"\"\n    global _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA\n\n    if exclude_upstream is None and _FASTEST_FONT_UPSTREAM is not None:\n        return _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA\n\n    if exclude_upstream is not None:\n        # Do not use or update cache when exclude_upstream is provided.\n        return await _get_fastest_upstream_for_font_internal(client, exclude_upstream)\n\n    async with _FASTEST_FONT_UPSTREAM_LOCK:\n        if _FASTEST_FONT_UPSTREAM is not None:\n            return _FASTEST_FONT_UPSTREAM, _FASTEST_FONT_METADATA\n\n        upstream, metadata = await _get_fastest_upstream_for_font_internal(client)\n        if upstream is not None:\n            _FASTEST_FONT_UPSTREAM = upstream\n            _FASTEST_FONT_METADATA = metadata\n            logger.info(f\"Fastest font upstream determined: {upstream}\")\n        return upstream, metadata\n\n\nasync def get_fastest_upstream_for_model(client: httpx.AsyncClient | None = None):\n    return await get_fastest_upstream_for_font(client, exclude_upstream=[\"github\"])\n\n\nasync def get_fastest_upstream(client: httpx.AsyncClient | None = None):\n    (\n        fastest_upstream_for_font,\n        online_font_metadata,\n    ) = await get_fastest_upstream_for_font(client)\n    if fastest_upstream_for_font is None:\n        logger.error(\"Failed to get fastest upstream\")\n        exit(1)\n\n    if fastest_upstream_for_font == \"github\":\n        # since github is only store font, we need to get the fastest upstream for model\n        fastest_upstream_for_model, _ = await get_fastest_upstream_for_model(client)\n        if fastest_upstream_for_model is None:\n            logger.error(\"Failed to get fastest upstream\")\n            exit(1)\n    else:\n        fastest_upstream_for_model = fastest_upstream_for_font\n\n    return online_font_metadata, fastest_upstream_for_font, fastest_upstream_for_model\n\n\nasync def get_doclayout_onnx_model_path_async(client: httpx.AsyncClient | None = None):\n    onnx_path = get_cache_file_path(\n        \"doclayout_yolo_docstructbench_imgsz1024.onnx\", \"models\"\n    )\n    if verify_file(onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256):\n        return onnx_path\n\n    logger.info(\"doclayout onnx model not found or corrupted, downloading...\")\n    fastest_upstream, _ = await get_fastest_upstream_for_model(client)\n    if fastest_upstream is None:\n        logger.error(\"Failed to get fastest upstream\")\n        exit(1)\n\n    url = DOC_LAYOUT_ONNX_MODEL_URL[fastest_upstream]\n\n    await download_file(\n        client, url, onnx_path, DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256\n    )\n    logger.info(f\"Download doclayout onnx model from {fastest_upstream} success\")\n    return onnx_path\n\n\nasync def get_table_detection_rapidocr_model_path_async(\n    client: httpx.AsyncClient | None = None,\n):\n    onnx_path = get_cache_file_path(\"ch_PP-OCRv4_det_infer.onnx\", \"models\")\n    if verify_file(onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256):\n        return onnx_path\n\n    logger.info(\"table detection rapidocr model not found or corrupted, downloading...\")\n    fastest_upstream, _ = await get_fastest_upstream_for_model(client)\n    if fastest_upstream is None:\n        logger.error(\"Failed to get fastest upstream\")\n        exit(1)\n\n    url = TABLE_DETECTION_RAPIDOCR_MODEL_URL[fastest_upstream]\n\n    await download_file(client, url, onnx_path, TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256)\n    logger.info(\n        f\"Download table detection rapidocr model from {fastest_upstream} success\"\n    )\n    return onnx_path\n\n\ndef get_doclayout_onnx_model_path():\n    return run_coro(get_doclayout_onnx_model_path_async())\n\n\ndef get_table_detection_rapidocr_model_path():\n    return run_coro(get_table_detection_rapidocr_model_path_async())\n\n\ndef get_font_url_by_name_and_upstream(font_file_name: str, upstream: str):\n    if upstream not in FONT_URL_BY_UPSTREAM:\n        logger.critical(f\"Invalid upstream: {upstream}\")\n        exit(1)\n\n    return FONT_URL_BY_UPSTREAM[upstream](font_file_name)\n\n\nasync def get_font_and_metadata_async(\n    font_file_name: str,\n    client: httpx.AsyncClient | None = None,\n    fastest_upstream: str | None = None,\n    font_metadata: dict | None = None,\n):\n    cache_file_path = get_cache_file_path(font_file_name, \"fonts\")\n    if font_file_name in EMBEDDING_FONT_METADATA and verify_file(\n        cache_file_path, EMBEDDING_FONT_METADATA[font_file_name][\"sha3_256\"]\n    ):\n        return cache_file_path, EMBEDDING_FONT_METADATA[font_file_name]\n\n    logger.info(f\"Font {cache_file_path} not found or corrupted, downloading...\")\n    if fastest_upstream is None:\n        fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client)\n        if fastest_upstream is None:\n            logger.critical(\"Failed to get fastest upstream\")\n            exit(1)\n\n        if font_file_name not in font_metadata:\n            logger.critical(f\"Font {font_file_name} not found in {font_metadata}\")\n            exit(1)\n\n        if verify_file(cache_file_path, font_metadata[font_file_name][\"sha3_256\"]):\n            return cache_file_path, font_metadata[font_file_name]\n\n    assert font_metadata is not None\n    logger.info(f\"download {font_file_name} from {fastest_upstream}\")\n\n    url = get_font_url_by_name_and_upstream(font_file_name, fastest_upstream)\n    if \"sha3_256\" not in font_metadata[font_file_name]:\n        logger.critical(f\"Font {font_file_name} not found in {font_metadata}\")\n        exit(1)\n    await download_file(\n        client, url, cache_file_path, font_metadata[font_file_name][\"sha3_256\"]\n    )\n    return cache_file_path, font_metadata[font_file_name]\n\n\ndef get_font_and_metadata(font_file_name: str):\n    return run_coro(get_font_and_metadata_async(font_file_name))\n\n\nasync def get_cmap_file_path_async(\n    name: str, client: httpx.AsyncClient | None = None\n) -> Path:\n    \"\"\"Get cached cmap file path, downloading it if necessary.\"\"\"\n    if name.endswith(\".json\"):\n        file_name = name\n    else:\n        file_name = f\"{name}.json\"\n\n    if file_name not in CMAP_METADATA:\n        logger.critical(f\"CMap {file_name} not found in CMAP_METADATA\")\n        exit(1)\n\n    meta = CMAP_METADATA[file_name]\n    cache_file_path = get_cache_file_path(file_name, \"cmap\")\n    if verify_file(cache_file_path, meta[\"sha3_256\"]):\n        return cache_file_path\n\n    logger.info(f\"CMap {cache_file_path} not found or corrupted, downloading...\")\n    await download_cmap_file_async(file_name, client)\n    if not verify_file(cache_file_path, meta[\"sha3_256\"]):\n        logger.critical(f\"Failed to verify downloaded cmap file: {cache_file_path}\")\n        exit(1)\n    return cache_file_path\n\n\nasync def download_cmap_file_async(\n    file_name: str, client: httpx.AsyncClient | None = None\n) -> Path:\n    \"\"\"Download a single cmap file to cache directory.\"\"\"\n    if file_name not in CMAP_METADATA:\n        logger.critical(f\"CMap {file_name} not found in CMAP_METADATA\")\n        exit(1)\n\n    fastest_upstream, _ = await get_fastest_upstream_for_font(client)\n    if fastest_upstream is None:\n        logger.critical(\"Failed to get fastest upstream for cmap\")\n        exit(1)\n\n    if fastest_upstream not in CMAP_URL_BY_UPSTREAM:\n        logger.critical(f\"Invalid fastest upstream for cmap: {fastest_upstream}\")\n        exit(1)\n\n    url = CMAP_URL_BY_UPSTREAM[fastest_upstream](file_name)\n    cache_file_path = get_cache_file_path(file_name, \"cmap\")\n    sha3_256 = CMAP_METADATA[file_name][\"sha3_256\"]\n    await download_file(client, url, cache_file_path, sha3_256)\n    return cache_file_path\n\n\nasync def get_cmap_data_async(\n    name: str, client: httpx.AsyncClient | None = None\n) -> dict:\n    \"\"\"Load cmap json data from cached file, downloading it if necessary.\"\"\"\n    path = await get_cmap_file_path_async(name, client)\n    return json.loads(path.read_text())\n\n\ndef get_cmap_file_path(name: str):\n    return run_coro(get_cmap_file_path_async(name))\n\n\ndef get_cmap_data(name: str):\n    return run_coro(get_cmap_data_async(name))\n\n\ndef get_font_family(lang_code: str):\n    font_family = embedding_assets_metadata.get_font_family(lang_code)\n    return font_family\n\n\nasync def download_all_fonts_async(client: httpx.AsyncClient | None = None):\n    for font_file_name in EMBEDDING_FONT_METADATA:\n        if not verify_file(\n            get_cache_file_path(font_file_name, \"fonts\"),\n            EMBEDDING_FONT_METADATA[font_file_name][\"sha3_256\"],\n        ):\n            break\n    else:\n        logger.debug(\"All fonts are already downloaded\")\n        return\n\n    fastest_upstream, font_metadata = await get_fastest_upstream_for_font(client)\n    if fastest_upstream is None:\n        logger.error(\"Failed to get fastest upstream\")\n        exit(1)\n    logger.info(f\"Downloading fonts from {fastest_upstream}\")\n\n    font_tasks = [\n        asyncio.create_task(\n            get_font_and_metadata_async(\n                font_file_name, client, fastest_upstream, font_metadata\n            )\n        )\n        for font_file_name in EMBEDDING_FONT_METADATA\n    ]\n    await asyncio.gather(*font_tasks)\n\n\nasync def download_all_cmaps_async(client: httpx.AsyncClient | None = None):\n    \"\"\"Download all cmap files defined in CMAP_METADATA.\"\"\"\n    for cmap_file_name, meta in CMAP_METADATA.items():\n        if not verify_file(\n            get_cache_file_path(cmap_file_name, \"cmap\"),\n            meta[\"sha3_256\"],\n        ):\n            break\n    else:\n        logger.debug(\"All cmaps are already downloaded\")\n        return\n\n    fastest_upstream, _ = await get_fastest_upstream_for_font(client)\n    if fastest_upstream is None:\n        logger.error(\"Failed to get fastest upstream for cmap\")\n        exit(1)\n    logger.info(f\"Downloading cmaps from {fastest_upstream}\")\n\n    cmap_tasks = [\n        asyncio.create_task(get_cmap_file_path_async(cmap_file_name, client))\n        for cmap_file_name in CMAP_METADATA\n    ]\n    await asyncio.gather(*cmap_tasks)\n\n\nasync def async_warmup():\n    logger.info(\"Downloading all assets...\")\n    from tiktoken import encoding_for_model\n\n    _ = encoding_for_model(\"gpt-4o\")\n    async with httpx.AsyncClient() as client:\n        onnx_task = asyncio.create_task(get_doclayout_onnx_model_path_async(client))\n        onnx_task2 = asyncio.create_task(\n            get_table_detection_rapidocr_model_path_async(client)\n        )\n        font_tasks = asyncio.create_task(download_all_fonts_async(client))\n        cmap_tasks = asyncio.create_task(download_all_cmaps_async(client))\n        await asyncio.gather(onnx_task, onnx_task2, font_tasks, cmap_tasks)\n\n\ndef warmup():\n    run_coro(async_warmup())\n\n\ndef generate_all_assets_file_list():\n    result: dict[str, list[dict[str, str]]] = {}\n    result[\"fonts\"] = []\n    result[\"models\"] = []\n    result[\"tiktoken\"] = []\n    result[\"cmap\"] = []\n    for font_file_name in EMBEDDING_FONT_METADATA:\n        result[\"fonts\"].append(\n            {\n                \"name\": font_file_name,\n                \"sha3_256\": EMBEDDING_FONT_METADATA[font_file_name][\"sha3_256\"],\n            }\n        )\n    for cmap_file_name in CMAP_METADATA:\n        result[\"cmap\"].append(\n            {\n                \"name\": cmap_file_name,\n                \"sha3_256\": CMAP_METADATA[cmap_file_name][\"sha3_256\"],\n            }\n        )\n    for tiktoken_file, sha3_256 in TIKTOKEN_CACHES.items():\n        result[\"tiktoken\"].append(\n            {\n                \"name\": tiktoken_file,\n                \"sha3_256\": sha3_256,\n            }\n        )\n    result[\"models\"].append(\n        {\n            \"name\": \"doclayout_yolo_docstructbench_imgsz1024.onnx\",\n            \"sha3_256\": DOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256,\n        },\n    )\n    result[\"models\"].append(\n        {\n            \"name\": \"ch_PP-OCRv4_det_infer.onnx\",\n            \"sha3_256\": TABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256,\n        },\n    )\n    return result\n\n\nasync def generate_offline_assets_package_async(output_directory: Path | None = None):\n    await async_warmup()\n    logger.info(\"Generating offline assets package...\")\n    file_list = generate_all_assets_file_list()\n    offline_assets_tag = get_offline_assets_tag(file_list)\n    if output_directory is None:\n        output_path = get_cache_file_path(\n            f\"offline_assets_{offline_assets_tag}.zip\", \"assets\"\n        )\n    else:\n        output_directory.mkdir(parents=True, exist_ok=True)\n        output_path = output_directory / f\"offline_assets_{offline_assets_tag}.zip\"\n    with zipfile.ZipFile(\n        output_path, \"w\", compression=zipfile.ZIP_DEFLATED, compresslevel=9\n    ) as zipf:\n        for file_type, file_descs in file_list.items():\n            # zipf.mkdir(file_type)\n            for file_desc in file_descs:\n                file_name = file_desc[\"name\"]\n                sha3_256 = file_desc[\"sha3_256\"]\n                file_path = get_cache_file_path(file_name, file_type)\n                if not verify_file(file_path, sha3_256):\n                    logger.error(f\"File {file_path} is corrupted\")\n                    exit(1)\n\n                with file_path.open(\"rb\") as f:\n                    zipf.writestr(f\"{file_type}/{file_name}\", f.read())\n    logger.info(f\"Offline assets package generated at {output_path}\")\n\n\nasync def restore_offline_assets_package_async(input_path: Path | None = None):\n    file_list = generate_all_assets_file_list()\n    offline_assets_tag = get_offline_assets_tag(file_list)\n    if input_path is None:\n        input_path = get_cache_file_path(\n            f\"offline_assets_{offline_assets_tag}.zip\", \"assets\"\n        )\n    else:\n        if input_path.exists() and input_path.is_dir():\n            input_path = input_path / f\"offline_assets_{offline_assets_tag}.zip\"\n        if not input_path.exists():\n            logger.critical(f\"Offline assets package not found: {input_path}\")\n            exit(1)\n\n        import re\n\n        offline_assets_tag_from_input_path = re.match(\n            r\"offline_assets_(.*)\\.zip\", input_path.name\n        ).group(1)\n        if offline_assets_tag != offline_assets_tag_from_input_path:\n            logger.critical(\n                f\"Offline assets tag mismatch: {offline_assets_tag} != {offline_assets_tag_from_input_path}\"\n            )\n            exit(1)\n    nothing_changed = True\n    with zipfile.ZipFile(input_path, \"r\") as zipf:\n        for file_type, file_descs in file_list.items():\n            for file_desc in file_descs:\n                file_name = file_desc[\"name\"]\n                file_path = get_cache_file_path(file_name, file_type)\n\n                if verify_file(file_path, file_desc[\"sha3_256\"]):\n                    continue\n                nothing_changed = False\n                with zipf.open(f\"{file_type}/{file_name}\", \"r\") as f:\n                    with file_path.open(\"wb\") as f2:\n                        f2.write(f.read())\n                if not verify_file(file_path, file_desc[\"sha3_256\"]):\n                    logger.critical(\n                        \"Offline assets package is corrupted, please delete it and try again\"\n                    )\n                    exit(1)\n    if not nothing_changed:\n        logger.info(f\"Offline assets package restored from {input_path}\")\n\n\ndef get_offline_assets_tag(file_list: dict | None = None):\n    if file_list is None:\n        file_list = generate_all_assets_file_list()\n    import orjson\n\n    # noinspection PyTypeChecker\n    offline_assets_tag = hashlib.sha3_256(\n        orjson.dumps(\n            file_list,\n            option=orjson.OPT_APPEND_NEWLINE\n            | orjson.OPT_INDENT_2\n            | orjson.OPT_SORT_KEYS,\n        )\n    ).hexdigest()\n    return offline_assets_tag\n\n\ndef generate_offline_assets_package(output_directory: Path | None = None):\n    return run_coro(generate_offline_assets_package_async(output_directory))\n\n\ndef restore_offline_assets_package(input_path: Path | None = None):\n    return run_coro(restore_offline_assets_package_async(input_path))\n\n\nif __name__ == \"__main__\":\n    from rich.logging import RichHandler\n\n    logging.basicConfig(level=logging.DEBUG, handlers=[RichHandler()])\n    logging.getLogger(\"httpx\").setLevel(logging.WARNING)\n    logging.getLogger(\"httpcore\").setLevel(logging.WARNING)\n    # warmup()\n    # generate_offline_assets_package()\n    # restore_offline_assets_package(Path(\n    #     '/Users/aw/.cache/babeldoc/assets/offline_assets_33971e4940e90ba0c35baacda44bbe83b214f4703a7bdb8b837de97d0383508c.zip'))\n    # warmup()\n"
  },
  {
    "path": "babeldoc/assets/embedding_assets_metadata.py",
    "content": "import itertools\n\nDOCLAYOUT_YOLO_DOCSTRUCTBENCH_IMGSZ1024ONNX_SHA3_256 = (\n    \"60be061226930524958b5465c8c04af3d7c03bcb0beb66454f5da9f792e3cf2a\"\n)\n\nTABLE_DETECTION_RAPIDOCR_MODEL_SHA3_256 = (\n    \"062f4619afe91b33147c033acadecbb53f2a7b99ac703d157b96d5b10948da5e\"\n)\n\nTIKTOKEN_CACHES = {\n    \"fb374d419588a4632f3f557e76b4b70aebbca790\": \"cb04bcda5782cfbbe77f2f991d92c0ea785d9496ef1137c91dfc3c8c324528d6\"\n}\n\nFONT_METADATA_URL = {\n    \"github\": \"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/font_metadata.json\",\n    \"huggingface\": \"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true\",\n    # \"hf-mirror\": \"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/font_metadata.json?download=true\",\n    \"modelscope\": \"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/font_metadata.json\",\n}\n\nFONT_URL_BY_UPSTREAM = {\n    \"github\": lambda name: f\"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/fonts/{name}\",\n    \"huggingface\": lambda name: f\"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true\",\n    \"hf-mirror\": lambda name: f\"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/fonts/{name}?download=true\",\n    \"modelscope\": lambda name: f\"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/fonts/{name}\",\n}\n\nCMAP_URL_BY_UPSTREAM = {\n    \"github\": lambda name: f\"https://raw.githubusercontent.com/funstory-ai/BabelDOC-Assets/refs/heads/main/cmap/{name}\",\n    \"huggingface\": lambda name: f\"https://huggingface.co/datasets/awwaawwa/BabelDOC-Assets/resolve/main/cmap/{name}?download=true\",\n    \"hf-mirror\": lambda name: f\"https://hf-mirror.com/datasets/awwaawwa/BabelDOC-Assets/resolve/main/cmap/{name}?download=true\",\n    \"modelscope\": lambda name: f\"https://www.modelscope.cn/datasets/awwaawwa/BabelDOCAssets/resolve/master/cmap/{name}\",\n}\n\nDOC_LAYOUT_ONNX_MODEL_URL = {\n    \"huggingface\": \"https://huggingface.co/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true\",\n    \"hf-mirror\": \"https://hf-mirror.com/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx?download=true\",\n    \"modelscope\": \"https://www.modelscope.cn/models/AI-ModelScope/DocLayout-YOLO-DocStructBench-onnx/resolve/master/doclayout_yolo_docstructbench_imgsz1024.onnx\",\n}\n\nTABLE_DETECTION_RAPIDOCR_MODEL_URL = {\n    \"huggingface\": \"https://huggingface.co/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx\",\n    \"hf-mirror\": \"https://hf-mirror.com/spaces/RapidAI/RapidOCR/resolve/main/models/text_det/ch_PP-OCRv4_det_infer.onnx\",\n    \"modelscope\": \"https://www.modelscope.cn/models/RapidAI/RapidOCR/resolve/master/onnx/PP-OCRv4/det/ch_PP-OCRv4_det_infer.onnx\",\n}\n\n# from https://github.com/funstory-ai/BabelDOC-Assets/blob/main/font_metadata.json\nEMBEDDING_FONT_METADATA = {\n    \"GoNotoKurrent-Bold.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 1,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"GoNotoKurrent-Bold.ttf\",\n        \"font_name\": \"Go Noto Kurrent-Bold Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"000b37f592477945b27b7702dcad39f73e23e140e66ddff9847eb34f32389566\",\n        \"size\": 15303772,\n    },\n    \"GoNotoKurrent-Regular.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 0,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"GoNotoKurrent-Regular.ttf\",\n        \"font_name\": \"Go Noto Kurrent-Regular Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"4324a60d507c691e6efc97420647f4d2c2d86d9de35009d1c769861b76074ae6\",\n        \"size\": 15515760,\n    },\n    \"KleeOne-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"KleeOne-Regular.ttf\",\n        \"font_name\": \"Klee One Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"8585c29f89b322d937f83739f61ede5d84297873e1465cad9a120a208ac55ce0\",\n        \"size\": 8724704,\n    },\n    \"LXGWWenKai-Regular.1.520.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -256,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKai-Regular.1.520.ttf\",\n        \"font_name\": \"LXGW WenKai Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"708b4fd6cfae62a26f71016724d38e862210732f101b9225225a1d5e8205f94d\",\n        \"size\": 24744500,\n    },\n    \"LXGWWenKaiGB-Regular.1.520.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -256,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKaiGB-Regular.1.520.ttf\",\n        \"font_name\": \"LXGW WenKai GB Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"0671656b00992e317f9e20610e7145b024e664ada9f272d4f8e497196af98005\",\n        \"size\": 24903712,\n    },\n    \"LXGWWenKaiGB-Regular.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -256,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKaiGB-Regular.ttf\",\n        \"font_name\": \"LXGW WenKai GB Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"b563a5e8d9db4cd15602a3a3700b01925e80a21f99fb88e1b763b1fb8685f8ee\",\n        \"size\": 19558756,\n    },\n    \"LXGWWenKaiMonoTC-Regular.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -241,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKaiMonoTC-Regular.ttf\",\n        \"font_name\": \"LXGW WenKai Mono TC Regular\",\n        \"italic\": 0,\n        \"monospace\": 1,\n        \"serif\": 0,\n        \"sha3_256\": \"596b278d11418d374a1cfa3a50cbfb82b31db82d3650cfacae8f94311b27fdc5\",\n        \"size\": 13115416,\n    },\n    \"LXGWWenKaiTC-Regular.1.520.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -256,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKaiTC-Regular.1.520.ttf\",\n        \"font_name\": \"LXGW WenKai TC Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"347d3d4bd88c2afcb194eba186d2c6c0b95d18b2145220feb1c88abf761f1398\",\n        \"size\": 15348376,\n    },\n    \"LXGWWenKaiTC-Regular.ttf\": {\n        \"ascent\": 928,\n        \"bold\": 0,\n        \"descent\": -256,\n        \"encoding_length\": 2,\n        \"file_name\": \"LXGWWenKaiTC-Regular.ttf\",\n        \"font_name\": \"LXGW WenKai TC Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"66ccd0ffe8e56cd585dabde8d1292c3f551b390d8ed85f81d7a844825f9c2379\",\n        \"size\": 13100328,\n    },\n    \"MaruBuri-Regular.ttf\": {\n        \"ascent\": 800,\n        \"bold\": 0,\n        \"descent\": -200,\n        \"encoding_length\": 2,\n        \"file_name\": \"MaruBuri-Regular.ttf\",\n        \"font_name\": \"MaruBuri Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"abb672dde7b89e06914ce27c59159b7a2933f26207bfcc47981c67c11c41e6d1\",\n        \"size\": 3268988,\n    },\n    \"NotoSans-Bold.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 1,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSans-Bold.ttf\",\n        \"font_name\": \"Noto Sans Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"ecd38d472c1cad07d8a5dffd2b5a0f72edcd40fff2b4e68d770da8f2ef343a82\",\n        \"size\": 630964,\n    },\n    \"NotoSans-BoldItalic.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 1,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSans-BoldItalic.ttf\",\n        \"font_name\": \"Noto Sans Bold Italic\",\n        \"italic\": 1,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"0b6c690a4a6b7d605b2ecbde00c7ac1a23e60feb17fa30d8b972d61ec3ff732b\",\n        \"size\": 644340,\n    },\n    \"NotoSans-Italic.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 0,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSans-Italic.ttf\",\n        \"font_name\": \"Noto Sans Italic\",\n        \"italic\": 1,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"830652f61724c017e5a29a96225b484a2ccbd25f69a1b3f47e5f466a2dbed1ad\",\n        \"size\": 642344,\n    },\n    \"NotoSans-Regular.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 0,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSans-Regular.ttf\",\n        \"font_name\": \"Noto Sans Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"7dfe2bbf97dc04c852d1223b220b63430e6ad03b0dbb28ebe6328a20a2d45eb8\",\n        \"size\": 629024,\n    },\n    \"NotoSerif-Bold.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 1,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSerif-Bold.ttf\",\n        \"font_name\": \"Noto Serif Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"28d88d924285eadb9f9ce49f2d2b95473f89a307b226c5f6ebed87a654898312\",\n        \"size\": 506864,\n    },\n    \"NotoSerif-BoldItalic.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 1,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSerif-BoldItalic.ttf\",\n        \"font_name\": \"Noto Serif Bold Italic\",\n        \"italic\": 1,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"b69ee56af6351b2fb4fbce623f8e1c1f9fb19170686a9e5db2cf260b8cf24ac7\",\n        \"size\": 535724,\n    },\n    \"NotoSerif-Italic.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 0,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSerif-Italic.ttf\",\n        \"font_name\": \"Noto Serif Italic\",\n        \"italic\": 1,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"9b7773c24ab8a29e3c1c03efa4ab652d051e4c209134431953463aa946d62868\",\n        \"size\": 535340,\n    },\n    \"NotoSerif-Regular.ttf\": {\n        \"ascent\": 1069,\n        \"bold\": 0,\n        \"descent\": -293,\n        \"encoding_length\": 2,\n        \"file_name\": \"NotoSerif-Regular.ttf\",\n        \"font_name\": \"Noto Serif Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"c2bbe984e65bafd3bcd38b3cb1e1344f3b7b79d6beffc7a3d883b57f8358559d\",\n        \"size\": 504932,\n    },\n    \"SourceHanSansCN-Bold.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 1,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansCN-Bold.ttf\",\n        \"font_name\": \"Source Han Sans CN Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"82314c11016a04ef03e7afd00abe0ccc8df54b922dee79abf6424f3002a31825\",\n        \"size\": 10174460,\n    },\n    \"SourceHanSansCN-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansCN-Regular.ttf\",\n        \"font_name\": \"Source Han Sans CN Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"b45a80cf3650bfc62aa014e58243c6325e182c4b0c5819e41a583c699cce9a8f\",\n        \"size\": 10397552,\n    },\n    \"SourceHanSansHK-Bold.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 1,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansHK-Bold.ttf\",\n        \"font_name\": \"Source Han Sans HK Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"3eecd57457ba9a0fbad6c794f40e7ae704c4f825091aef2ac18902ffdde50608\",\n        \"size\": 6856692,\n    },\n    \"SourceHanSansHK-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansHK-Regular.ttf\",\n        \"font_name\": \"Source Han Sans HK Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"5fe4141f9164c03616323400b2936ee4c8265314492e2b822c3a6fbfb63ffe08\",\n        \"size\": 6999792,\n    },\n    \"SourceHanSansJP-Bold.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 1,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansJP-Bold.ttf\",\n        \"font_name\": \"Source Han Sans JP Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"fb05bd84d62e8064117ee357ab6a4481e1cde931e8e984c0553c8c4b09dc3938\",\n        \"size\": 5603068,\n    },\n    \"SourceHanSansJP-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansJP-Regular.ttf\",\n        \"font_name\": \"Source Han Sans JP Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"722cfbdcc0fd83fe07a3d1b10e9e64343c924a351d02cfe8dbb6ec4c6bc38230\",\n        \"size\": 5723960,\n    },\n    \"SourceHanSansKR-Bold.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 1,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansKR-Bold.ttf\",\n        \"font_name\": \"Source Han Sans KR Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"02959eb2c1eea0786a736aeb50b6e61f2ab873cd69c659389b7511f80f734838\",\n        \"size\": 5858892,\n    },\n    \"SourceHanSansKR-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansKR-Regular.ttf\",\n        \"font_name\": \"Source Han Sans KR Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"aba70109eff718e8f796f0185f8dca38026c1661b43c195883c84577e501adf2\",\n        \"size\": 5961704,\n    },\n    \"SourceHanSansTW-Bold.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 1,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansTW-Bold.ttf\",\n        \"font_name\": \"Source Han Sans TW Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"4a92730e644a1348e87bba7c77e9b462f257f381bd6abbeac5860d8f8306aee6\",\n        \"size\": 6883224,\n    },\n    \"SourceHanSansTW-Regular.ttf\": {\n        \"ascent\": 1160,\n        \"bold\": 0,\n        \"descent\": -288,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSansTW-Regular.ttf\",\n        \"font_name\": \"Source Han Sans TW Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 0,\n        \"sha3_256\": \"6129b68ff4b0814624cac7edca61fbacf8f4d79db6f4c3cfc46b1c48ea2f81ac\",\n        \"size\": 7024812,\n    },\n    \"SourceHanSerifCN-Bold.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 1,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifCN-Bold.ttf\",\n        \"font_name\": \"Source Han Serif CN Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"77816a54957616e140e25a36a41fc061ddb505a1107de4e6a65f561e5dcf8310\",\n        \"size\": 14134156,\n    },\n    \"SourceHanSerifCN-Regular.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 0,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifCN-Regular.ttf\",\n        \"font_name\": \"Source Han Serif CN Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"c8bf74da2c3b7457c9d887465b42fb6f80d3d84f361cfe5b0673a317fb1f85ad\",\n        \"size\": 14047768,\n    },\n    \"SourceHanSerifHK-Bold.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 1,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifHK-Bold.ttf\",\n        \"font_name\": \"Source Han Serif HK Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"0f81296f22846b622a26f7342433d6c5038af708a32fc4b892420c150227f4bb\",\n        \"size\": 9532580,\n    },\n    \"SourceHanSerifHK-Regular.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 0,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifHK-Regular.ttf\",\n        \"font_name\": \"Source Han Serif HK Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"d5232ec3adf4fb8604bb4779091169ec9bd9d574b513e4a75752e614193afebe\",\n        \"size\": 9467292,\n    },\n    \"SourceHanSerifJP-Bold.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 1,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifJP-Bold.ttf\",\n        \"font_name\": \"Source Han Serif JP Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"a4a8c22e8ec7bb6e66b9caaff1e12c7a52b5a4201eec3d074b35957c0126faef\",\n        \"size\": 7811832,\n    },\n    \"SourceHanSerifJP-Regular.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 0,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifJP-Regular.ttf\",\n        \"font_name\": \"Source Han Serif JP Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"3d1f9933c7f3abc8c285e317119a533e6dcfe6027d1f5f066ba71b3eb9161e9c\",\n        \"size\": 7748816,\n    },\n    \"SourceHanSerifKR-Bold.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 1,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifKR-Bold.ttf\",\n        \"font_name\": \"Source Han Serif KR Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"b071b1aecb042aa779e1198767048438dc756d0da8f90660408abb421393f5cb\",\n        \"size\": 12387920,\n    },\n    \"SourceHanSerifKR-Regular.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 0,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifKR-Regular.ttf\",\n        \"font_name\": \"Source Han Serif KR Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"a85913439f0a49024ca77c02dfede4318e503ee6b2b7d8fef01eb42435f27b61\",\n        \"size\": 12459924,\n    },\n    \"SourceHanSerifTW-Bold.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 1,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifTW-Bold.ttf\",\n        \"font_name\": \"Source Han Serif TW Bold\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"562eea88895ab79ffefab7eabb4d322352a7b1963764c524c6d5242ca456bb6e\",\n        \"size\": 9551724,\n    },\n    \"SourceHanSerifTW-Regular.ttf\": {\n        \"ascent\": 1150,\n        \"bold\": 0,\n        \"descent\": -286,\n        \"encoding_length\": 2,\n        \"file_name\": \"SourceHanSerifTW-Regular.ttf\",\n        \"font_name\": \"Source Han Serif TW Regular\",\n        \"italic\": 0,\n        \"monospace\": 0,\n        \"serif\": 1,\n        \"sha3_256\": \"85c1d6460b2e169b3d53ac60f6fb7a219fb99923027d78fb64b679475e2ddae4\",\n        \"size\": 9486772,\n    },\n}\n\nCMAP_METADATA = {\n    \"78-EUC-H.json\": {\n        \"file_name\": \"78-EUC-H.json\",\n        \"sha3_256\": \"657006ae4360ac584316dbda94f2223d7dd4cf7c721021b78b470ed712d22a3d\",\n        \"size\": 15035,\n    },\n    \"78-EUC-V.json\": {\n        \"file_name\": \"78-EUC-V.json\",\n        \"sha3_256\": \"ffd0610937d3893cd6b9f10007033dab4c846d6a50914b3e0b5b1a1d5a446483\",\n        \"size\": 704,\n    },\n    \"78-H.json\": {\n        \"file_name\": \"78-H.json\",\n        \"sha3_256\": \"07960a71bd7f2dc8501bfff6ebacb5d179961accbb8d043837d6d213d4e7c43f\",\n        \"size\": 14993,\n    },\n    \"78-RKSJ-H.json\": {\n        \"file_name\": \"78-RKSJ-H.json\",\n        \"sha3_256\": \"2cea4cbf474c08d99420790509473f48960d14df27e37155c0833150eff0310c\",\n        \"size\": 15054,\n    },\n    \"78-RKSJ-V.json\": {\n        \"file_name\": \"78-RKSJ-V.json\",\n        \"sha3_256\": \"0005485dc7cb41b9911d651a31a008ff4d8f707f3a271f5eb900640415255f58\",\n        \"size\": 705,\n    },\n    \"78-V.json\": {\n        \"file_name\": \"78-V.json\",\n        \"sha3_256\": \"6ec527dfdd6f8176719db47aea208d96c8427ff2c44bb6d6adcf215e3599c7dd\",\n        \"size\": 700,\n    },\n    \"78ms-RKSJ-H.json\": {\n        \"file_name\": \"78ms-RKSJ-H.json\",\n        \"sha3_256\": \"781802e72f8e79d599d58a81445333d005df5117b10c9b8392459729e51bbec7\",\n        \"size\": 17125,\n    },\n    \"78ms-RKSJ-V.json\": {\n        \"file_name\": \"78ms-RKSJ-V.json\",\n        \"sha3_256\": \"1854ff118f30bdee044813bf764f44123697cb2c2dfcfacb10e1aa161d7db16b\",\n        \"size\": 1928,\n    },\n    \"83pv-RKSJ-H.json\": {\n        \"file_name\": \"83pv-RKSJ-H.json\",\n        \"sha3_256\": \"2b6dd0a63fc97f3b33767a1b16a49b30ba0cb97a1ff01deb6ca5592d90e79815\",\n        \"size\": 5277,\n    },\n    \"90ms-RKSJ-H.json\": {\n        \"file_name\": \"90ms-RKSJ-H.json\",\n        \"sha3_256\": \"ebacf23e35e924a65b45afb6276f645289f68b122f1b32ab4dbc64f9c7903ccf\",\n        \"size\": 4117,\n    },\n    \"90ms-RKSJ-V.json\": {\n        \"file_name\": \"90ms-RKSJ-V.json\",\n        \"sha3_256\": \"0e08ffc0c46d93912870ad12a863081bcea12db09038e3929e1e015cfc1663da\",\n        \"size\": 1928,\n    },\n    \"90msp-RKSJ-H.json\": {\n        \"file_name\": \"90msp-RKSJ-H.json\",\n        \"sha3_256\": \"3098d897f17b1723d5915518d281d3c5d4f46f0b83dbde8b8001073e0f882d32\",\n        \"size\": 4096,\n    },\n    \"90msp-RKSJ-V.json\": {\n        \"file_name\": \"90msp-RKSJ-V.json\",\n        \"sha3_256\": \"a7ad430c32de4dbce2667fff874efc5d4114c685107f026788eee4ec83992fc8\",\n        \"size\": 1929,\n    },\n    \"90pv-RKSJ-H.json\": {\n        \"file_name\": \"90pv-RKSJ-H.json\",\n        \"sha3_256\": \"2c1720cc7343f95ccb87e073df0c7788d33bc8811b703b709a0230e79ecb2341\",\n        \"size\": 6314,\n    },\n    \"90pv-RKSJ-V.json\": {\n        \"file_name\": \"90pv-RKSJ-V.json\",\n        \"sha3_256\": \"487bf100397d4f0bcfa86dbfea149cac54faa59c0b449d65284cc43123d99023\",\n        \"size\": 1283,\n    },\n    \"Add-H.json\": {\n        \"file_name\": \"Add-H.json\",\n        \"sha3_256\": \"3bd6fbbe961dffa3a6395d1e3823da665efc74363f44ff6083d98fc5ae22433a\",\n        \"size\": 15174,\n    },\n    \"Add-RKSJ-H.json\": {\n        \"file_name\": \"Add-RKSJ-H.json\",\n        \"sha3_256\": \"bde048bae5dc9c43570bff29ff4691e03372e029dde66edc5e8de64a891dd53b\",\n        \"size\": 15259,\n    },\n    \"Add-RKSJ-V.json\": {\n        \"file_name\": \"Add-RKSJ-V.json\",\n        \"sha3_256\": \"1a81852c30ebf3101e1e0b0b5eff2e4f19211373c513d7c42b0933ded6b6e59b\",\n        \"size\": 1426,\n    },\n    \"Add-V.json\": {\n        \"file_name\": \"Add-V.json\",\n        \"sha3_256\": \"6a4f7a4ee2d7a04ce0500b93453859faf3fc3f11b3f55cb61753ef79846b419b\",\n        \"size\": 1421,\n    },\n    \"B5-H.json\": {\n        \"file_name\": \"B5-H.json\",\n        \"sha3_256\": \"f1b984aa231df737628663a56d380c93fe3172a243792db6d36921b964a118db\",\n        \"size\": 5960,\n    },\n    \"B5-V.json\": {\n        \"file_name\": \"B5-V.json\",\n        \"sha3_256\": \"0fafc3f78a34f2bf2377a89b2679469505a35ae42df95bf6765f743344f9a94c\",\n        \"size\": 334,\n    },\n    \"B5pc-H.json\": {\n        \"file_name\": \"B5pc-H.json\",\n        \"sha3_256\": \"07f0c25086768b9731971ba164d88cb10202a9d36e79a076c43233351f61c52f\",\n        \"size\": 6015,\n    },\n    \"B5pc-V.json\": {\n        \"file_name\": \"B5pc-V.json\",\n        \"sha3_256\": \"f5e44d8eeeda40e8c3a81858dfb823eeed3f5e834e985544d1e56fb79260b8f8\",\n        \"size\": 336,\n    },\n    \"CNS-EUC-H.json\": {\n        \"file_name\": \"CNS-EUC-H.json\",\n        \"sha3_256\": \"2add6b8cd4750db8bf6b029595232fecb8f1e54a0bad56590d4aa46401085e44\",\n        \"size\": 11342,\n    },\n    \"CNS-EUC-V.json\": {\n        \"file_name\": \"CNS-EUC-V.json\",\n        \"sha3_256\": \"1ff26a35f10467a99957886c482de267658b9132a704b547381c90fc37c90820\",\n        \"size\": 12592,\n    },\n    \"CNS1-H.json\": {\n        \"file_name\": \"CNS1-H.json\",\n        \"sha3_256\": \"e64c524f07718603b6bd84fd6799f875cc13c00137fbaa2b41215d518e96c87a\",\n        \"size\": 3728,\n    },\n    \"CNS1-V.json\": {\n        \"file_name\": \"CNS1-V.json\",\n        \"sha3_256\": \"57a1d2aabe6ab9db9a323ab43c37e3aa1ba9b3eb71841dfec4d8568d657d503a\",\n        \"size\": 332,\n    },\n    \"CNS2-H.json\": {\n        \"file_name\": \"CNS2-H.json\",\n        \"sha3_256\": \"90831af5d65fae9565d705fc8f1fccd091e33a67a1e544552410e39d7558daed\",\n        \"size\": 2053,\n    },\n    \"CNS2-V.json\": {\n        \"file_name\": \"CNS2-V.json\",\n        \"sha3_256\": \"c4d2aae661b26120030754901abced51766fa4bce638433a7aa7130a3d5eabb0\",\n        \"size\": 54,\n    },\n    \"ETHK-B5-H.json\": {\n        \"file_name\": \"ETHK-B5-H.json\",\n        \"sha3_256\": \"3ef2e9ef0364675c2fb9ccbfd37ed9227d416457ee8cadb9e59b2db4354d88ea\",\n        \"size\": 25660,\n    },\n    \"ETHK-B5-V.json\": {\n        \"file_name\": \"ETHK-B5-V.json\",\n        \"sha3_256\": \"a12c5917b6f3400793e7d6ea2e217e9af05a28621a937cfef4da9f5184a03578\",\n        \"size\": 364,\n    },\n    \"ETen-B5-H.json\": {\n        \"file_name\": \"ETen-B5-H.json\",\n        \"sha3_256\": \"57f29290c730277b221ad074709d4f76c429d5410931131c9da7157ebae76951\",\n        \"size\": 6205,\n    },\n    \"ETen-B5-V.json\": {\n        \"file_name\": \"ETen-B5-V.json\",\n        \"sha3_256\": \"d07d9af9e30a8fc3ca7e52158f854226b831ab9ef552cda46219819e47950680\",\n        \"size\": 364,\n    },\n    \"ETenms-B5-H.json\": {\n        \"file_name\": \"ETenms-B5-H.json\",\n        \"sha3_256\": \"0659f282182ebdaa6abb38062bc3428a3b7b5907513fd499980d1b49930a9b9e\",\n        \"size\": 72,\n    },\n    \"ETenms-B5-V.json\": {\n        \"file_name\": \"ETenms-B5-V.json\",\n        \"sha3_256\": \"74b107f8950456b2df294a089091837bf802892c1bc3136c403da2a427130c33\",\n        \"size\": 429,\n    },\n    \"EUC-H.json\": {\n        \"file_name\": \"EUC-H.json\",\n        \"sha3_256\": \"b6df6e254254eb5a2254b0d581f4820d2b3553cd372136ec88f605521683c44a\",\n        \"size\": 2910,\n    },\n    \"EUC-V.json\": {\n        \"file_name\": \"EUC-V.json\",\n        \"sha3_256\": \"e81c0f409365f2fd60232f6e5c84bf52c8a6b9c6336d4c96fb554f213dbdfaf6\",\n        \"size\": 701,\n    },\n    \"Ext-H.json\": {\n        \"file_name\": \"Ext-H.json\",\n        \"sha3_256\": \"629359cf115575acb68b59c82373a1a3958001212a854d0a5b98e6fe1efe81db\",\n        \"size\": 15891,\n    },\n    \"Ext-RKSJ-H.json\": {\n        \"file_name\": \"Ext-RKSJ-H.json\",\n        \"sha3_256\": \"3336a4a77a75924588f13c5a24157680c9c5b6a46298063dcdb461b90bb55da0\",\n        \"size\": 15975,\n    },\n    \"Ext-RKSJ-V.json\": {\n        \"file_name\": \"Ext-RKSJ-V.json\",\n        \"sha3_256\": \"f2915039ff32992094ff6521fa24c3f41c27f55f3f071730eea732e261a2a553\",\n        \"size\": 994,\n    },\n    \"Ext-V.json\": {\n        \"file_name\": \"Ext-V.json\",\n        \"sha3_256\": \"e2fb58ec483aee0910b0733dcb6220f10f9f4d2553c8c139a523e3992363f93e\",\n        \"size\": 989,\n    },\n    \"GB-EUC-H.json\": {\n        \"file_name\": \"GB-EUC-H.json\",\n        \"sha3_256\": \"4a0b5fda367993409663ec1d4be57c207a3500d778373546b729d143d789c191\",\n        \"size\": 2178,\n    },\n    \"GB-EUC-V.json\": {\n        \"file_name\": \"GB-EUC-V.json\",\n        \"sha3_256\": \"b45a8a562304c2c388fd1574c3a1a0af6f49e4849f7904ba07d57967d9625917\",\n        \"size\": 520,\n    },\n    \"GB-H.json\": {\n        \"file_name\": \"GB-H.json\",\n        \"sha3_256\": \"a50b5d6461c95a667ccbc44c507ff5e6686e4f1bbd8bfae69486396b4ed03510\",\n        \"size\": 2139,\n    },\n    \"GB-V.json\": {\n        \"file_name\": \"GB-V.json\",\n        \"sha3_256\": \"1f043042065f2df4590ebbd27fbc8f93802ea66caeb0b8ba92823575842743e5\",\n        \"size\": 516,\n    },\n    \"GBK-EUC-H.json\": {\n        \"file_name\": \"GBK-EUC-H.json\",\n        \"sha3_256\": \"4502e7abe2edfb6256b5a4308dfca940aaa92a2d951c4b44942ce7bdb9eda877\",\n        \"size\": 99532,\n    },\n    \"GBK-EUC-V.json\": {\n        \"file_name\": \"GBK-EUC-V.json\",\n        \"sha3_256\": \"c71f6281bb59897dcf48f587136d002d5caa8a0ed89f9b490a6a288765ec674d\",\n        \"size\": 521,\n    },\n    \"GBK2K-H.json\": {\n        \"file_name\": \"GBK2K-H.json\",\n        \"sha3_256\": \"0a2a975da25641067ea2743f15407df20895b28804a1e64c12cd9fd0f306b1a9\",\n        \"size\": 109298,\n    },\n    \"GBK2K-V.json\": {\n        \"file_name\": \"GBK2K-V.json\",\n        \"sha3_256\": \"0febb4a13f8f73dc949d159b4f37e886d1c3d1514aaf53d3492e0b5e21523f52\",\n        \"size\": 1044,\n    },\n    \"GBKp-EUC-H.json\": {\n        \"file_name\": \"GBKp-EUC-H.json\",\n        \"sha3_256\": \"50d628304aff1f13ded3790cc3b8bd48502267768cac5e72cb3be8a46f9a5436\",\n        \"size\": 99510,\n    },\n    \"GBKp-EUC-V.json\": {\n        \"file_name\": \"GBKp-EUC-V.json\",\n        \"sha3_256\": \"8c540fc12dfed309896544f8153fa52b793708a85e3882985567dcae86fb1732\",\n        \"size\": 522,\n    },\n    \"GBT-EUC-H.json\": {\n        \"file_name\": \"GBT-EUC-H.json\",\n        \"sha3_256\": \"5fbe99ec7638de5216ea452788d3ef40cfd8c110c8b8ae936b57db6221d9b9d9\",\n        \"size\": 54802,\n    },\n    \"GBT-EUC-V.json\": {\n        \"file_name\": \"GBT-EUC-V.json\",\n        \"sha3_256\": \"4cc3a48b1f7c8ab088391aa78131289da3d68e2fe0071b380a10c19757356ab5\",\n        \"size\": 521,\n    },\n    \"GBT-H.json\": {\n        \"file_name\": \"GBT-H.json\",\n        \"sha3_256\": \"8bbbbbdee2722751708dd66a7ed12fa54a08bbf0dcfaefca2b87f305ca591f32\",\n        \"size\": 54763,\n    },\n    \"GBT-V.json\": {\n        \"file_name\": \"GBT-V.json\",\n        \"sha3_256\": \"32e4457c8b0edbeeec9445465ec40106603ad50003e1af98994c02020df1c59f\",\n        \"size\": 517,\n    },\n    \"GBTpc-EUC-H.json\": {\n        \"file_name\": \"GBTpc-EUC-H.json\",\n        \"sha3_256\": \"7f7faa903850fc471948e284853a81ee2f4a32693e14131f3ab1fbc490c5695b\",\n        \"size\": 54820,\n    },\n    \"GBTpc-EUC-V.json\": {\n        \"file_name\": \"GBTpc-EUC-V.json\",\n        \"sha3_256\": \"3cf85a97171567e08d0112b71ca4a0aef68c52918b7c635669ef7e25e1bcb818\",\n        \"size\": 523,\n    },\n    \"GBpc-EUC-H.json\": {\n        \"file_name\": \"GBpc-EUC-H.json\",\n        \"sha3_256\": \"38332ce5be0b82e4010fbd05ceac92e9f05a784ccacf6a4f004cd8da734c47de\",\n        \"size\": 2196,\n    },\n    \"GBpc-EUC-V.json\": {\n        \"file_name\": \"GBpc-EUC-V.json\",\n        \"sha3_256\": \"5a0b4e7db0aedd6b27f84b191791b527da3ea27ea1ca42460086cb0d294418bf\",\n        \"size\": 522,\n    },\n    \"H.json\": {\n        \"file_name\": \"H.json\",\n        \"sha3_256\": \"5ee11fcc99897b769fd62238967954e957bb8079353abba815792aab6f3e329c\",\n        \"size\": 2868,\n    },\n    \"HKdla-B5-H.json\": {\n        \"file_name\": \"HKdla-B5-H.json\",\n        \"sha3_256\": \"8f24808486e1d5363a66981021f3f8b136f1ec6231d48bda76344e1f7f1695aa\",\n        \"size\": 25384,\n    },\n    \"HKdla-B5-V.json\": {\n        \"file_name\": \"HKdla-B5-V.json\",\n        \"sha3_256\": \"1e686a7f69d6b7a3c05a4be9e7e396cf81498ef48299341616e76805c1092733\",\n        \"size\": 340,\n    },\n    \"HKdlb-B5-H.json\": {\n        \"file_name\": \"HKdlb-B5-H.json\",\n        \"sha3_256\": \"0ccae437017107059630d56c7e0e2d6f086d5fb512c9e60b1bd48c4a04b6652d\",\n        \"size\": 22501,\n    },\n    \"HKdlb-B5-V.json\": {\n        \"file_name\": \"HKdlb-B5-V.json\",\n        \"sha3_256\": \"dad584337fd6e5e6ab5e1e30dc9b8cc1013985a04a159b3c108c4dfb5c10fb55\",\n        \"size\": 340,\n    },\n    \"HKgccs-B5-H.json\": {\n        \"file_name\": \"HKgccs-B5-H.json\",\n        \"sha3_256\": \"f7da0854c355c51957de6e71ffa33fbc69414d52dcfc5a5cb50c8f8c6c6bd9c6\",\n        \"size\": 13642,\n    },\n    \"HKgccs-B5-V.json\": {\n        \"file_name\": \"HKgccs-B5-V.json\",\n        \"sha3_256\": \"d7f89dc24162b624bc4d682484da315a4d39eaf9a8f63c1392e06d2aa46f015a\",\n        \"size\": 341,\n    },\n    \"HKm314-B5-H.json\": {\n        \"file_name\": \"HKm314-B5-H.json\",\n        \"sha3_256\": \"febd4cb78048e012478df9fc91aa23e946304d63c5f7c64ea8e16277b64a359b\",\n        \"size\": 13405,\n    },\n    \"HKm314-B5-V.json\": {\n        \"file_name\": \"HKm314-B5-V.json\",\n        \"sha3_256\": \"d310bbf5a975fe8e1f8bb4523b0db8e792043578f0c2a12735bbc24fc4a3721f\",\n        \"size\": 341,\n    },\n    \"HKm471-B5-H.json\": {\n        \"file_name\": \"HKm471-B5-H.json\",\n        \"sha3_256\": \"fdb1368b1a6f4df20ab87e2a1045a579088645828d1168e39d6aa5b52c09bd8e\",\n        \"size\": 17079,\n    },\n    \"HKm471-B5-V.json\": {\n        \"file_name\": \"HKm471-B5-V.json\",\n        \"sha3_256\": \"34c40c1bb1409942f12f66f1bcbc2be73406b4c5e626ea7a4ab7f73160ba2a88\",\n        \"size\": 341,\n    },\n    \"HKscs-B5-H.json\": {\n        \"file_name\": \"HKscs-B5-H.json\",\n        \"sha3_256\": \"63fe2b09c05c8ef70fb937aad49698d4154e1d7bb75f94344fea4db522b87a88\",\n        \"size\": 25722,\n    },\n    \"HKscs-B5-V.json\": {\n        \"file_name\": \"HKscs-B5-V.json\",\n        \"sha3_256\": \"14c864025ffca616fc173458162efe190bdace4700e2a7ad4869c66476534223\",\n        \"size\": 365,\n    },\n    \"Hankaku.json\": {\n        \"file_name\": \"Hankaku.json\",\n        \"sha3_256\": \"befe81a2bbe191bcb8e0ff23706a51cb6a41a60f6bc508d5c0c19040c14afc06\",\n        \"size\": 238,\n    },\n    \"Hiragana.json\": {\n        \"file_name\": \"Hiragana.json\",\n        \"sha3_256\": \"0e8ce0a48ec8c05f4c65d23ada539c4a2a236fcb7dd46e20874acd9362394525\",\n        \"size\": 200,\n    },\n    \"Identity-H.json\": {\n        \"file_name\": \"Identity-H.json\",\n        \"sha3_256\": \"77cc630138b29b5acd4ab216cb1d173bb3e7b994ab932a4f3d8a9121be91fbab\",\n        \"size\": 6404,\n    },\n    \"Identity-V.json\": {\n        \"file_name\": \"Identity-V.json\",\n        \"sha3_256\": \"067a8d390f2d99dfa94ff19009925e5815c8b54b65b39314a244cbbace494679\",\n        \"size\": 62,\n    },\n    \"KSC-EUC-H.json\": {\n        \"file_name\": \"KSC-EUC-H.json\",\n        \"sha3_256\": \"79fb3c0bd9d2ce6b80da98d6f1ef4fd2776dfc3fb78c5ee4d6ee3a06aebc9fd0\",\n        \"size\": 11234,\n    },\n    \"KSC-EUC-V.json\": {\n        \"file_name\": \"KSC-EUC-V.json\",\n        \"sha3_256\": \"a541a285c966105a92dba6939401ac8aaeb057e5200bdbf8c874ceecb9f37b01\",\n        \"size\": 441,\n    },\n    \"KSC-H.json\": {\n        \"file_name\": \"KSC-H.json\",\n        \"sha3_256\": \"a0a20bce98ffe98036aa748d46c2921e17247827a22298edb59c778b8b776f24\",\n        \"size\": 11214,\n    },\n    \"KSC-Johab-H.json\": {\n        \"file_name\": \"KSC-Johab-H.json\",\n        \"sha3_256\": \"3d7cd1473ddcf7c3bfb80c7eadf45a365389759b1df1f53e0bd5f31e31125e96\",\n        \"size\": 100922,\n    },\n    \"KSC-Johab-V.json\": {\n        \"file_name\": \"KSC-Johab-V.json\",\n        \"sha3_256\": \"2f7cf1d05bd82d65e488fc3297aefc1c1f48f2c6972b01304c4be5f260fae86e\",\n        \"size\": 443,\n    },\n    \"KSC-V.json\": {\n        \"file_name\": \"KSC-V.json\",\n        \"sha3_256\": \"f6f09bab60f802d61c22368ca8650cefa08851c2039c5825e37404c7047eb496\",\n        \"size\": 437,\n    },\n    \"KSCms-UHC-H.json\": {\n        \"file_name\": \"KSCms-UHC-H.json\",\n        \"sha3_256\": \"6df55fd679239f3a6642c7690e89a85525fa6a8a3cf748aef247b2d06fdc1aca\",\n        \"size\": 16419,\n    },\n    \"KSCms-UHC-HW-H.json\": {\n        \"file_name\": \"KSCms-UHC-HW-H.json\",\n        \"sha3_256\": \"a05183c5d7b6b6f62d11f8175e5749d5ad2913d469403905c8f01a403d715583\",\n        \"size\": 16422,\n    },\n    \"KSCms-UHC-HW-V.json\": {\n        \"file_name\": \"KSCms-UHC-HW-V.json\",\n        \"sha3_256\": \"e2586795b094fade7e385ff1ce5570232edc791c456acf4c6e1c11bc501f82a4\",\n        \"size\": 446,\n    },\n    \"KSCms-UHC-V.json\": {\n        \"file_name\": \"KSCms-UHC-V.json\",\n        \"sha3_256\": \"c09dc49c1afea5a5dc01bd6ac672d2af83b4821d74de7df71d4da3233513cefb\",\n        \"size\": 443,\n    },\n    \"KSCpc-EUC-H.json\": {\n        \"file_name\": \"KSCpc-EUC-H.json\",\n        \"sha3_256\": \"b43448cb510c7f952a6affd0950db58063719f7499309c64f78fea6b2778fa11\",\n        \"size\": 12226,\n    },\n    \"KSCpc-EUC-V.json\": {\n        \"file_name\": \"KSCpc-EUC-V.json\",\n        \"sha3_256\": \"1f4889c2e7278085738257e8097382ef5ac40b543b71751b75b155b056a46db2\",\n        \"size\": 443,\n    },\n    \"Katakana.json\": {\n        \"file_name\": \"Katakana.json\",\n        \"sha3_256\": \"524b659bd0acc0fb4baa7633c3250683d6b3ba1685caadc9739240ccdbfd2ce2\",\n        \"size\": 86,\n    },\n    \"NWP-H.json\": {\n        \"file_name\": \"NWP-H.json\",\n        \"sha3_256\": \"6c067655436fe89fb21a26e258973313bfe7cd5fbab3a2857b00ea92cc82c25d\",\n        \"size\": 18143,\n    },\n    \"NWP-V.json\": {\n        \"file_name\": \"NWP-V.json\",\n        \"sha3_256\": \"b494038c72c63c6917ab3ed3f83a8b6bf21c65ba9ea47a4887833fffcc434763\",\n        \"size\": 1205,\n    },\n    \"RKSJ-H.json\": {\n        \"file_name\": \"RKSJ-H.json\",\n        \"sha3_256\": \"eff868636f960b80d6923b77eb59d76acf6d7297bc74e1b7f3a13ff92a71c1cb\",\n        \"size\": 2953,\n    },\n    \"RKSJ-V.json\": {\n        \"file_name\": \"RKSJ-V.json\",\n        \"sha3_256\": \"f3827bc17eb1172a5713d2d5c83a9b60f965894e3f2cb8dcb731b6f151abaa10\",\n        \"size\": 702,\n    },\n    \"Roman.json\": {\n        \"file_name\": \"Roman.json\",\n        \"sha3_256\": \"620ab6ac0f4b487f19d44397b49612db57d164ddbff8e7d52fb5fd7e969e0cb9\",\n        \"size\": 67,\n    },\n    \"UniAKR-UTF16-H.json\": {\n        \"file_name\": \"UniAKR-UTF16-H.json\",\n        \"sha3_256\": \"1204af593c62e5d10ace0db3b5ca0caecc80240f1c866bf1585fad405c204a54\",\n        \"size\": 232741,\n    },\n    \"UniAKR-UTF32-H.json\": {\n        \"file_name\": \"UniAKR-UTF32-H.json\",\n        \"sha3_256\": \"cbbebc4b9b018109612dcfc0798f5c164d739a8b202017580301e0f27f76c35d\",\n        \"size\": 296773,\n    },\n    \"UniAKR-UTF8-H.json\": {\n        \"file_name\": \"UniAKR-UTF8-H.json\",\n        \"sha3_256\": \"e08da06fc02a877abb02205fe0db3b61566d9ac41511a735ef2f12b5741d069a\",\n        \"size\": 266575,\n    },\n    \"UniCNS-UCS2-H.json\": {\n        \"file_name\": \"UniCNS-UCS2-H.json\",\n        \"sha3_256\": \"48a0840498b90cf597c05ad2f63e26aaea778a49171f821d4b87b94424d7e640\",\n        \"size\": 400654,\n    },\n    \"UniCNS-UCS2-V.json\": {\n        \"file_name\": \"UniCNS-UCS2-V.json\",\n        \"sha3_256\": \"014f9d86baea5fd13e460dd3735eab98dbbacf126922826ef0be9d7c8c605418\",\n        \"size\": 360,\n    },\n    \"UniCNS-UTF16-H.json\": {\n        \"file_name\": \"UniCNS-UTF16-H.json\",\n        \"sha3_256\": \"c67980ebfb0d525365d0b5421548cc64ce9fb89afca1a0f6d04972f1e39b7f9c\",\n        \"size\": 320254,\n    },\n    \"UniCNS-UTF16-V.json\": {\n        \"file_name\": \"UniCNS-UTF16-V.json\",\n        \"sha3_256\": \"98bd35d76997c0f3c443f130d44e814997cb0277183b7bf6571f92206d9a85a0\",\n        \"size\": 311,\n    },\n    \"UniCNS-UTF32-H.json\": {\n        \"file_name\": \"UniCNS-UTF32-H.json\",\n        \"sha3_256\": \"6ab73cc531843f9bef915a949a0b79de1df288bb7ed6026db782ac446ed36c94\",\n        \"size\": 391690,\n    },\n    \"UniCNS-UTF32-V.json\": {\n        \"file_name\": \"UniCNS-UTF32-V.json\",\n        \"sha3_256\": \"d94f8c3d7fe834d34f746b9404a4bb5dd8479353e3b9f95b308642a8be793a44\",\n        \"size\": 391,\n    },\n    \"UniCNS-UTF8-H.json\": {\n        \"file_name\": \"UniCNS-UTF8-H.json\",\n        \"sha3_256\": \"3666cbe4d00de4038120c98472137857c93d44735c3a5def8c4ac7f84a59aa72\",\n        \"size\": 357287,\n    },\n    \"UniCNS-UTF8-V.json\": {\n        \"file_name\": \"UniCNS-UTF8-V.json\",\n        \"sha3_256\": \"e410ed491c0e2f31ba30cfd60eb4e21c40d3ee82e2be1c06c7adb8772b175f10\",\n        \"size\": 350,\n    },\n    \"UniGB-UCS2-H.json\": {\n        \"file_name\": \"UniGB-UCS2-H.json\",\n        \"sha3_256\": \"42a8e01b690cf2cd6b137c1eb94e7668899f0041b6e43b921252fe453486a96e\",\n        \"size\": 336533,\n    },\n    \"UniGB-UCS2-V.json\": {\n        \"file_name\": \"UniGB-UCS2-V.json\",\n        \"sha3_256\": \"0a0aaf21f823546faf0971b7926724cc95b53b3da3f42a22ec0526ca8de1b237\",\n        \"size\": 617,\n    },\n    \"UniGB-UTF16-H.json\": {\n        \"file_name\": \"UniGB-UTF16-H.json\",\n        \"sha3_256\": \"c306f093839fffe81e0c8597a24be508a64aa2a9c3e9b9eee858d55059530c0d\",\n        \"size\": 251806,\n    },\n    \"UniGB-UTF16-V.json\": {\n        \"file_name\": \"UniGB-UTF16-V.json\",\n        \"sha3_256\": \"bd283b8c7e145e340db39868ec1a3b0a08d89acc2bfac672d41008a8195c7bb3\",\n        \"size\": 456,\n    },\n    \"UniGB-UTF32-H.json\": {\n        \"file_name\": \"UniGB-UTF32-H.json\",\n        \"sha3_256\": \"a01a6a8b4b715f27c7e1866894240b0e1fd61a4eaca1c91df80c1f256ad06f72\",\n        \"size\": 319766,\n    },\n    \"UniGB-UTF32-V.json\": {\n        \"file_name\": \"UniGB-UTF32-V.json\",\n        \"sha3_256\": \"8b31bba8b852a2c6c1f6d92aea633285e2f75237fbe87ecadff9f9312a0bfaa9\",\n        \"size\": 572,\n    },\n    \"UniGB-UTF8-H.json\": {\n        \"file_name\": \"UniGB-UTF8-H.json\",\n        \"sha3_256\": \"87f7a6b0360d0f9bd0658cb7a67587e86c604be44292214622d972d85a474dbf\",\n        \"size\": 290481,\n    },\n    \"UniGB-UTF8-V.json\": {\n        \"file_name\": \"UniGB-UTF8-V.json\",\n        \"sha3_256\": \"1378adf3ecd0bfbdb11dabbf2118cbb968a03aa2215780b77b07459e3b1df6e7\",\n        \"size\": 513,\n    },\n    \"UniJIS-UCS2-H.json\": {\n        \"file_name\": \"UniJIS-UCS2-H.json\",\n        \"sha3_256\": \"a73e449136b46240ef86c9fb2b614e7d290b814130e9beb4b987c52fd7eda575\",\n        \"size\": 205924,\n    },\n    \"UniJIS-UCS2-HW-H.json\": {\n        \"file_name\": \"UniJIS-UCS2-HW-H.json\",\n        \"sha3_256\": \"e58ec4fd06677ecfcef12d25f6456b7f80da706b2ac6ef915239e0b780b775a0\",\n        \"size\": 154,\n    },\n    \"UniJIS-UCS2-HW-V.json\": {\n        \"file_name\": \"UniJIS-UCS2-HW-V.json\",\n        \"sha3_256\": \"bc3c81dbd6329d83cd71743a6985ed0cf516b0aa97a1c58c3cc3940e280b1e8e\",\n        \"size\": 4868,\n    },\n    \"UniJIS-UCS2-V.json\": {\n        \"file_name\": \"UniJIS-UCS2-V.json\",\n        \"sha3_256\": \"276712ac66416538e859ad28e9f5b685fbc71e5d7d91e905a3489f03667ae4bc\",\n        \"size\": 4775,\n    },\n    \"UniJIS-UTF16-H.json\": {\n        \"file_name\": \"UniJIS-UTF16-H.json\",\n        \"sha3_256\": \"afc923e268f22dcf09e0871ce0060c7588aa1304d4b26e781a261c14566f7642\",\n        \"size\": 238042,\n    },\n    \"UniJIS-UTF16-V.json\": {\n        \"file_name\": \"UniJIS-UTF16-V.json\",\n        \"sha3_256\": \"0a044ab7015485c3b0f7f9e4d883a1d9e9f1d04235b13e2a17687e878ce3e9f0\",\n        \"size\": 3951,\n    },\n    \"UniJIS-UTF32-H.json\": {\n        \"file_name\": \"UniJIS-UTF32-H.json\",\n        \"sha3_256\": \"1c27e2e595d659073e37e5ee22a9b39abe30af1483de33e1078ed174abdc723c\",\n        \"size\": 295294,\n    },\n    \"UniJIS-UTF32-V.json\": {\n        \"file_name\": \"UniJIS-UTF32-V.json\",\n        \"sha3_256\": \"aa7a475ce5f85f79d73e17355c08e6aee21a949b596f2efe359913489a22117f\",\n        \"size\": 4983,\n    },\n    \"UniJIS-UTF8-H.json\": {\n        \"file_name\": \"UniJIS-UTF8-H.json\",\n        \"sha3_256\": \"d91079b3f1671a7f4ace8b8f89478558f43f7782e666064ce1b53af563a87306\",\n        \"size\": 266367,\n    },\n    \"UniJIS-UTF8-V.json\": {\n        \"file_name\": \"UniJIS-UTF8-V.json\",\n        \"sha3_256\": \"d0c8c94f7d54dafa40876ce7eb28845d8ac00b688cf4bac255694cb2f086d109\",\n        \"size\": 4483,\n    },\n    \"UniJIS2004-UTF16-H.json\": {\n        \"file_name\": \"UniJIS2004-UTF16-H.json\",\n        \"sha3_256\": \"336660e87fc57ad166258d22f09690fcebb546840faee1e1b3f6cad3556bcf80\",\n        \"size\": 238119,\n    },\n    \"UniJIS2004-UTF16-V.json\": {\n        \"file_name\": \"UniJIS2004-UTF16-V.json\",\n        \"sha3_256\": \"f6619a74b62f9986e9a74620b28e726b927dde5cd6184742f368ef4d686fe55c\",\n        \"size\": 3955,\n    },\n    \"UniJIS2004-UTF32-H.json\": {\n        \"file_name\": \"UniJIS2004-UTF32-H.json\",\n        \"sha3_256\": \"2512690db880e0663f8208d22acda8daa98f1240ff14a038bf02e57c4908afb5\",\n        \"size\": 295371,\n    },\n    \"UniJIS2004-UTF32-V.json\": {\n        \"file_name\": \"UniJIS2004-UTF32-V.json\",\n        \"sha3_256\": \"da1728a91845f1654457eaf0f15b75d1ace5cbf75486bca8523bd5edf20a8010\",\n        \"size\": 4987,\n    },\n    \"UniJIS2004-UTF8-H.json\": {\n        \"file_name\": \"UniJIS2004-UTF8-H.json\",\n        \"sha3_256\": \"af36b0255a1ed15966670703ba8a48987a1cf7e43f5c94a4e86a41e5ee26b940\",\n        \"size\": 266444,\n    },\n    \"UniJIS2004-UTF8-V.json\": {\n        \"file_name\": \"UniJIS2004-UTF8-V.json\",\n        \"sha3_256\": \"28bebdf1581c45f2e9b38caa2ff643abd561321bab45febb0f90d802d2290faa\",\n        \"size\": 4487,\n    },\n    \"UniJISPro-UCS2-HW-V.json\": {\n        \"file_name\": \"UniJISPro-UCS2-HW-V.json\",\n        \"sha3_256\": \"21fd353a062b6c415389d6fde11718488f765ca31fd4ca481050c89633568009\",\n        \"size\": 4994,\n    },\n    \"UniJISPro-UCS2-V.json\": {\n        \"file_name\": \"UniJISPro-UCS2-V.json\",\n        \"sha3_256\": \"8daa155869a35f3f629abb042790c59eb5cff342b83573c2ae4c87b3e865dc27\",\n        \"size\": 4901,\n    },\n    \"UniJISPro-UTF8-V.json\": {\n        \"file_name\": \"UniJISPro-UTF8-V.json\",\n        \"sha3_256\": \"19b9a6d908f9fb7413d778c9cc912072314864225c38a3f5c345936fabcea650\",\n        \"size\": 5726,\n    },\n    \"UniJISX0213-UTF32-H.json\": {\n        \"file_name\": \"UniJISX0213-UTF32-H.json\",\n        \"sha3_256\": \"e6a07453703f5070bf567c9d67aa20bc4b404bd311413fed45d9ba8c297a91d9\",\n        \"size\": 295246,\n    },\n    \"UniJISX0213-UTF32-V.json\": {\n        \"file_name\": \"UniJISX0213-UTF32-V.json\",\n        \"sha3_256\": \"5f2dd4ff8045b2308a707e3d4ffb73e1ba7f5a1c1fdb43b17c5a322109897b9c\",\n        \"size\": 4908,\n    },\n    \"UniJISX02132004-UTF32-H.json\": {\n        \"file_name\": \"UniJISX02132004-UTF32-H.json\",\n        \"sha3_256\": \"81427dc73cf9392c0c3e8eeeb1dedbc797b123059714bfcdcd1ecffec9f341e3\",\n        \"size\": 295323,\n    },\n    \"UniJISX02132004-UTF32-V.json\": {\n        \"file_name\": \"UniJISX02132004-UTF32-V.json\",\n        \"sha3_256\": \"c0721298f3449f0c6f48ada1200ebcadbfc4020b10333871f6c0eea0be9f13ac\",\n        \"size\": 4912,\n    },\n    \"UniKS-UCS2-H.json\": {\n        \"file_name\": \"UniKS-UCS2-H.json\",\n        \"sha3_256\": \"3a1c10535982d06dde447764f8e3dd82c6c87bec6c4272eaf449f67db6d50ab8\",\n        \"size\": 202706,\n    },\n    \"UniKS-UCS2-V.json\": {\n        \"file_name\": \"UniKS-UCS2-V.json\",\n        \"sha3_256\": \"b915820ff4639f837e4d3b7e5a7c0810c26af1dcf3df9e56ed9a0a69e3cdba9d\",\n        \"size\": 492,\n    },\n    \"UniKS-UTF16-H.json\": {\n        \"file_name\": \"UniKS-UTF16-H.json\",\n        \"sha3_256\": \"820f534efffcef15f0d3f270c078774febee31b451a1387b27f7225da321c12f\",\n        \"size\": 153894,\n    },\n    \"UniKS-UTF16-V.json\": {\n        \"file_name\": \"UniKS-UTF16-V.json\",\n        \"sha3_256\": \"2b5be7641990cf79754a12309c6069c01b636cfc3308bc4dc8075da59c2d8d6b\",\n        \"size\": 403,\n    },\n    \"UniKS-UTF32-H.json\": {\n        \"file_name\": \"UniKS-UTF32-H.json\",\n        \"sha3_256\": \"541515ed8ff15170b38fbe6587ff6c54f6fc75aeede9da110133dc335e4ddf0e\",\n        \"size\": 195998,\n    },\n    \"UniKS-UTF32-V.json\": {\n        \"file_name\": \"UniKS-UTF32-V.json\",\n        \"sha3_256\": \"940e977d3927c8480c65dc4ad6be4f365f65b8d76707758a7696d40e2b3583ea\",\n        \"size\": 503,\n    },\n    \"UniKS-UTF8-H.json\": {\n        \"file_name\": \"UniKS-UTF8-H.json\",\n        \"sha3_256\": \"81b5c336c1a20dee2e9592c6615a46cdd906edd242717c1807609b5687576252\",\n        \"size\": 177154,\n    },\n    \"UniKS-UTF8-V.json\": {\n        \"file_name\": \"UniKS-UTF8-V.json\",\n        \"sha3_256\": \"9a282e8eee884f801a5518cc52ff240ee8635553661dd0ee7df952adbad7462a\",\n        \"size\": 452,\n    },\n    \"V.json\": {\n        \"file_name\": \"V.json\",\n        \"sha3_256\": \"616f263e53079846a66efc861524a15c0a411e823c37fe08e62bad835745cbba\",\n        \"size\": 697,\n    },\n    \"WP-Symbol.json\": {\n        \"file_name\": \"WP-Symbol.json\",\n        \"sha3_256\": \"533dfe497eab1f095039b6344217fc0ff6b1f7cdf9b406bb19c30b945fe78c21\",\n        \"size\": 588,\n    },\n}\n\n\nFONT_NAMES = {v[\"font_name\"] for v in EMBEDDING_FONT_METADATA.values()}\n\nCN_FONT_FAMILY = {\n    # 手写体\n    \"script\": [\n        \"LXGWWenKaiGB-Regular.1.520.ttf\",\n    ],\n    # 正文字体\n    \"normal\": [\n        \"SourceHanSerifCN-Bold.ttf\",\n        \"SourceHanSerifCN-Regular.ttf\",\n        \"SourceHanSansCN-Bold.ttf\",\n        \"SourceHanSansCN-Regular.ttf\",\n    ],\n    # 备用字体\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\"SourceHanSansCN-Regular.ttf\"],\n}\n\nHK_FONT_FAMILY = {\n    \"script\": [\"LXGWWenKaiTC-Regular.1.520.ttf\"],\n    \"normal\": [\n        \"SourceHanSerifHK-Bold.ttf\",\n        \"SourceHanSerifHK-Regular.ttf\",\n        \"SourceHanSansHK-Bold.ttf\",\n        \"SourceHanSansHK-Regular.ttf\",\n    ],\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\"SourceHanSansCN-Regular.ttf\"],\n}\n\nTW_FONT_FAMILY = {\n    \"script\": [\"LXGWWenKaiTC-Regular.1.520.ttf\"],\n    \"normal\": [\n        \"SourceHanSerifTW-Bold.ttf\",\n        \"SourceHanSerifTW-Regular.ttf\",\n        \"SourceHanSansTW-Bold.ttf\",\n        \"SourceHanSansTW-Regular.ttf\",\n    ],\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\"SourceHanSansCN-Regular.ttf\"],\n}\n\nKR_FONT_FAMILY = {\n    \"script\": [\"MaruBuri-Regular.ttf\"],\n    \"normal\": [\n        \"SourceHanSerifKR-Bold.ttf\",\n        \"SourceHanSerifKR-Regular.ttf\",\n        \"SourceHanSansKR-Bold.ttf\",\n        \"SourceHanSansKR-Regular.ttf\",\n    ],\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\"SourceHanSansCN-Regular.ttf\"],\n}\n\nJP_FONT_FAMILY = {\n    \"script\": [\"KleeOne-Regular.ttf\"],\n    \"normal\": [\n        \"SourceHanSerifJP-Bold.ttf\",\n        \"SourceHanSerifJP-Regular.ttf\",\n        \"SourceHanSansJP-Bold.ttf\",\n        \"SourceHanSansJP-Regular.ttf\",\n    ],\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\"SourceHanSansCN-Regular.ttf\"],\n}\n\nEN_FONT_FAMILY = {\n    \"script\": [\n        \"NotoSans-Italic.ttf\",\n        \"NotoSans-BoldItalic.ttf\",\n        \"NotoSerif-Italic.ttf\",\n        \"NotoSerif-BoldItalic.ttf\",\n    ],\n    \"normal\": [\n        \"NotoSerif-Regular.ttf\",\n        \"NotoSerif-Bold.ttf\",\n        \"NotoSans-Regular.ttf\",\n        \"NotoSans-Bold.ttf\",\n    ],\n    \"fallback\": [\n        \"GoNotoKurrent-Regular.ttf\",\n        \"GoNotoKurrent-Bold.ttf\",\n    ],\n    \"base\": [\n        \"NotoSans-Regular.ttf\",\n    ],\n}\n\nALL_FONT_FAMILY = {\n    \"CN\": CN_FONT_FAMILY,\n    \"TW\": TW_FONT_FAMILY,\n    \"HK\": HK_FONT_FAMILY,\n    \"KR\": KR_FONT_FAMILY,\n    \"JP\": JP_FONT_FAMILY,\n    \"EN\": EN_FONT_FAMILY,\n    \"JA\": JP_FONT_FAMILY,\n}\n\n\ndef __add_fallback_to_font_family():\n    for lang1, family1 in ALL_FONT_FAMILY.items():\n        added_font = set()\n        for font in itertools.chain.from_iterable(family1.values()):\n            added_font.add(font)\n\n        for lang2, family2 in ALL_FONT_FAMILY.items():\n            if lang1 != lang2:\n                for type_ in family1:\n                    for font in family2[type_]:\n                        if font not in added_font:\n                            family1[type_].append(font)\n                            added_font.add(font)\n\n\ndef __cleanup_unused_font_metadata():\n    \"\"\"Remove unused font metadata that are not referenced in any font family.\"\"\"\n    referenced_fonts = set()\n    for family in ALL_FONT_FAMILY.values():\n        for font_list in family.values():\n            referenced_fonts.update(font_list)\n\n    # Remove unreferenced fonts from EMBEDDING_FONT_METADATA\n    unused_fonts = set(EMBEDDING_FONT_METADATA.keys()) - referenced_fonts\n    for font_name in unused_fonts:\n        del EMBEDDING_FONT_METADATA[font_name]\n\n\n__add_fallback_to_font_family()\n__cleanup_unused_font_metadata()\n\n\ndef get_font_family(lang_code: str):\n    lang_code = lang_code.upper()\n    if \"KR\" in lang_code:\n        font_family = KR_FONT_FAMILY\n    elif \"JP\" in lang_code or \"JA\" in lang_code:\n        font_family = JP_FONT_FAMILY\n    elif \"HK\" in lang_code:\n        font_family = HK_FONT_FAMILY\n    elif \"TW\" in lang_code:\n        font_family = TW_FONT_FAMILY\n    elif \"EN\" in lang_code:\n        font_family = EN_FONT_FAMILY\n    elif \"CN\" in lang_code:\n        font_family = CN_FONT_FAMILY\n    else:\n        font_family = EN_FONT_FAMILY\n    verify_font_family(font_family)\n    return font_family\n\n\ndef verify_font_family(font_family: str | dict):\n    if isinstance(font_family, str):\n        font_family = ALL_FONT_FAMILY[font_family]\n    for k in font_family:\n        if k not in [\"script\", \"normal\", \"fallback\", \"base\"]:\n            raise ValueError(f\"Invalid font family: {font_family}\")\n        for font_file_name in font_family[k]:\n            if font_file_name not in EMBEDDING_FONT_METADATA:\n                raise ValueError(f\"Invalid font file: {font_file_name}\")\n\n\nif __name__ == \"__main__\":\n    for k in ALL_FONT_FAMILY:\n        verify_font_family(k)\n"
  },
  {
    "path": "babeldoc/asynchronize/__init__.py",
    "content": "import asyncio\nimport time\n\n\nclass Args:\n    def __init__(self, args, kwargs):\n        self.args = args\n        self.kwargs = kwargs\n\n\nclass AsyncCallback:\n    def __init__(self):\n        self.queue = asyncio.Queue()\n        self.finished = False\n        self.loop = asyncio.get_event_loop()\n\n    def step_callback(self, *args, **kwargs):\n        # Whenever a step is called, add to the queue but don't set finished to True, so __anext__ will continue\n        args = Args(args, kwargs)\n\n        # We have to use the threadsafe call so that it wakes up the event loop, in case it's sleeping:\n        # https://stackoverflow.com/a/49912853/2148718\n        self.loop.call_soon_threadsafe(self.queue.put_nowait, args)\n\n        # Add a small delay to release the GIL, ensuring the event loop has time to process messages\n        time.sleep(0.01)\n\n    def finished_callback(self, *args, **kwargs):\n        # Whenever a finished is called, add to the queue as with step, but also set finished to True, so __anext__\n        # will terminate after processing the remaining items\n        if self.finished:\n            return\n        self.step_callback(*args, **kwargs)\n        self.finished = True\n\n    def __await__(self):\n        # Since this implements __anext__, this can return itself\n        return self.queue.get().__await__()\n\n    def __aiter__(self):\n        # Since this implements __anext__, this can return itself\n        return self\n\n    async def __anext__(self):\n        # Keep waiting for the queue if a) we haven't finished, or b) if the queue is still full. This lets us finish\n        # processing the remaining items even after we've finished\n        if self.finished and self.queue.empty():\n            raise StopAsyncIteration\n\n        result = await self.queue.get()\n        return result\n"
  },
  {
    "path": "babeldoc/babeldoc_exception/BabelDOCException.py",
    "content": "class ScannedPDFError(Exception):\n    def __init__(self, message):\n        super().__init__(message)\n\n\nclass ExtractTextError(Exception):\n    def __init__(self, message):\n        super().__init__(message)\n\n\nclass InputFileGeneratedByBabelDOCError(Exception):\n    def __init__(self, message):\n        super().__init__(message)\n\n\nclass ContentFilterError(Exception):\n    def __init__(self, message):\n        super().__init__(message)\n        self.message = message\n"
  },
  {
    "path": "babeldoc/babeldoc_exception/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/const.py",
    "content": "import itertools\nimport multiprocessing as mp\nimport os\nimport shutil\nimport subprocess\nimport threading\nfrom pathlib import Path\n\n__version__ = \"0.5.23\"\n\nCACHE_FOLDER = Path.home() / \".cache\" / \"babeldoc\"\n\n\ndef get_cache_file_path(filename: str, sub_folder: str | None = None) -> Path:\n    if sub_folder is not None:\n        sub_folder = sub_folder.strip(\"/\")\n        sub_folder_path = CACHE_FOLDER / sub_folder\n        sub_folder_path.mkdir(parents=True, exist_ok=True)\n        return sub_folder_path / filename\n    return CACHE_FOLDER / filename\n\n\ntry:\n    git_path = shutil.which(\"git\")\n    if git_path is None:\n        raise FileNotFoundError(\"git executable not found\")\n    two_parent = Path(__file__).resolve().parent.parent\n    md_ = two_parent / \"docs\" / \"README.md\"\n    if two_parent.name == \"site-packages\" or not md_.exists():\n        raise FileNotFoundError(\"not in git repo\")\n    WATERMARK_VERSION = (\n        subprocess.check_output(  # noqa: S603\n            [git_path, \"describe\", \"--always\"],\n            cwd=Path(__file__).resolve().parent,\n        )\n        .strip()\n        .decode()\n    )\nexcept (OSError, FileNotFoundError, subprocess.CalledProcessError):\n    WATERMARK_VERSION = f\"v{__version__}\"\n\nTIKTOKEN_CACHE_FOLDER = CACHE_FOLDER / \"tiktoken\"\nTIKTOKEN_CACHE_FOLDER.mkdir(parents=True, exist_ok=True)\nos.environ[\"TIKTOKEN_CACHE_DIR\"] = str(TIKTOKEN_CACHE_FOLDER)\n\n\n_process_pool = None\n_process_pool_lock = threading.Lock()\n_ENABLE_PROCESS_POOL = False\n\n\ndef enable_process_pool():\n    # Development and Testing ONLY API\n    global _ENABLE_PROCESS_POOL\n    _ENABLE_PROCESS_POOL = True\n\n\n# macos & windows use spawn mode\n# linux use forkserver mode\n\n\ndef get_process_pool():\n    if not _ENABLE_PROCESS_POOL:\n        return None\n    global _process_pool\n    with _process_pool_lock:\n        if _process_pool is None:\n            # Create pool only in main process\n            if mp.current_process().name != \"MainProcess\":\n                return None\n\n            _process_pool = mp.Pool()\n        return _process_pool\n\n\ndef close_process_pool():\n    if not _ENABLE_PROCESS_POOL:\n        return None\n    global _process_pool\n    with _process_pool_lock:\n        if _process_pool:\n            _process_pool.close()\n            _process_pool.join()\n            _process_pool = None\n\n\ndef batched(iterable, n, *, strict=False):\n    # batched('ABCDEFG', 3) → ABC DEF G\n    if n < 1:\n        raise ValueError(\"n must be at least one\")\n    iterator = iter(iterable)\n    while batch := tuple(itertools.islice(iterator, n)):\n        if strict and len(batch) != n:\n            raise ValueError(\"batched(): incomplete batch\")\n        yield batch\n"
  },
  {
    "path": "babeldoc/docvision/README.md",
    "content": ""
  },
  {
    "path": "babeldoc/docvision/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/docvision/base_doclayout.py",
    "content": "import abc\nimport logging\nfrom collections.abc import Generator\n\nimport pymupdf\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Page\n\nlogger = logging.getLogger(__name__)\n\n\nclass YoloResult:\n    \"\"\"Helper class to store detection results from ONNX model.\"\"\"\n\n    def __init__(self, names, boxes=None, boxes_data=None):\n        if boxes is not None:\n            self.boxes = boxes\n        else:\n            assert boxes_data is not None\n            self.boxes = [YoloBox(data=d) for d in boxes_data]\n        self.boxes.sort(key=lambda x: x.conf, reverse=True)\n        self.names = names\n\n\nclass YoloBox:\n    \"\"\"Helper class to store detection results from ONNX model.\"\"\"\n\n    def __init__(self, data=None, xyxy=None, conf=None, cls=None):\n        if data is not None:\n            self.xyxy = data[:4]\n            self.conf = data[-2]\n            self.cls = data[-1]\n            return\n        assert xyxy is not None and conf is not None and cls is not None\n        self.xyxy = xyxy\n        self.conf = conf\n        self.cls = cls\n\n\nclass DocLayoutModel(abc.ABC):\n    @staticmethod\n    def load_onnx():\n        logger.info(\"Loading ONNX model...\")\n        from babeldoc.docvision.doclayout import OnnxModel\n\n        model = OnnxModel.from_pretrained()\n        return model\n\n    @staticmethod\n    def load_available():\n        return DocLayoutModel.load_onnx()\n\n    @property\n    @abc.abstractmethod\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n\n    @abc.abstractmethod\n    def handle_document(\n        self,\n        pages: list[Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ) -> Generator[tuple[Page, YoloResult], None, None]:\n        \"\"\"\n        Handle a document.\n        \"\"\"\n"
  },
  {
    "path": "babeldoc/docvision/doclayout.py",
    "content": "import ast\nimport logging\nimport platform\nimport re\nimport threading\nfrom collections.abc import Generator\n\nimport cv2\nimport numpy as np\n\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\ntry:\n    import onnx\n    import onnxruntime\nexcept ImportError as e:\n    if \"DLL load failed\" in str(e):\n        raise OSError(\n            \"Microsoft Visual C++ Redistributable is not installed. \"\n            \"Download it at https://aka.ms/vs/17/release/vc_redist.x64.exe\"\n        ) from e\n    raise\nimport pymupdf\n\nimport babeldoc.format.pdf.document_il.il_version_1\nfrom babeldoc.assets.assets import get_doclayout_onnx_model_path\n\n# from huggingface_hub import hf_hub_download\n\nlogger = logging.getLogger(__name__)\n\n\n# 检测操作系统类型\nos_name = platform.system()\n\n\nclass OnnxModel(DocLayoutModel):\n    def __init__(self, model_path: str):\n        self.model_path = model_path\n\n        model = onnx.load(model_path)\n        metadata = {d.key: d.value for d in model.metadata_props}\n        self._stride = ast.literal_eval(metadata[\"stride\"])\n        self._names = ast.literal_eval(metadata[\"names\"])\n        providers = []\n\n        available_providers = onnxruntime.get_available_providers()\n        for provider in available_providers:\n            # disable dml|cuda|\n            # directml/cuda may encounter problems under special circumstances\n            if re.match(r\"cpu\", provider, re.IGNORECASE):\n                logger.info(f\"Available Provider: {provider}\")\n                providers.append(provider)\n        self.model = onnxruntime.InferenceSession(\n            model.SerializeToString(),\n            providers=providers,\n        )\n        self.lock = threading.Lock()\n\n    @staticmethod\n    def from_pretrained():\n        pth = get_doclayout_onnx_model_path()\n        return OnnxModel(pth)\n\n    @property\n    def stride(self):\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size, ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image,\n            (resized_w, resized_h),\n            interpolation=cv2.INTER_LINEAR,\n        )\n\n        # Calculate padding size and align to stride multiple\n        pad_w = (new_w - resized_w) % self.stride\n        pad_h = (new_h - resized_h) % self.stride\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image,\n            top,\n            bottom,\n            left,\n            right,\n            cv2.BORDER_CONSTANT,\n            value=(114, 114, 114),\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes[..., :4] = (boxes[..., :4] - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict(self, image, imgsz=800, batch_size=16, **kwargs):\n        \"\"\"\n        Predict the layout of document pages.\n\n        Args:\n            image: A single image or a list of images of document pages.\n            imgsz: Resize the image to this size. Must be a multiple of the stride.\n            batch_size: Number of images to process in one batch.\n            **kwargs: Additional arguments.\n\n        Returns:\n            A list of YoloResult objects, one for each input image.\n        \"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        total_images = len(image)\n        results = []\n        batch_size = 1\n\n        # Process images in batches\n        for i in range(0, total_images, batch_size):\n            batch_images = image[i : i + batch_size]\n            batch_size_actual = len(batch_images)\n\n            # Calculate target size based on the maximum height in the batch\n            max_height = max(img.shape[0] for img in batch_images)\n            target_imgsz = 1024\n\n            # Preprocess batch\n            processed_batch = []\n            orig_shapes = []\n            for img in batch_images:\n                orig_h, orig_w = img.shape[:2]\n                orig_shapes.append((orig_h, orig_w))\n\n                pix = self.resize_and_pad_image(img, new_shape=target_imgsz)\n                pix = np.transpose(pix, (2, 0, 1))  # CHW\n                pix = pix.astype(np.float32) / 255.0  # Normalize to [0, 1]\n                processed_batch.append(pix)\n\n            # Stack batch\n            batch_input = np.stack(processed_batch, axis=0)  # BCHW\n            new_h, new_w = batch_input.shape[2:]\n\n            # Run inference\n            batch_preds = self.model.run(None, {\"images\": batch_input})[0]\n\n            # Process each prediction in the batch\n            for j in range(batch_size_actual):\n                preds = batch_preds[j]\n                preds = preds[preds[..., 4] > 0.25]\n                if len(preds) > 0:\n                    preds[..., :4] = self.scale_boxes(\n                        (new_h, new_w),\n                        preds[..., :4],\n                        orig_shapes[j],\n                    )\n                results.append(YoloResult(boxes_data=preds, names=self._names))\n\n        return results\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ) -> Generator[\n        tuple[babeldoc.format.pdf.document_il.il_version_1.Page, YoloResult], None, None\n    ]:\n        for page in pages:\n            translate_config.raise_if_cancelled()\n            with self.lock:\n                # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n                pix = get_no_rotation_img(mupdf_doc[page.page_number])\n            image = np.frombuffer(pix.samples, np.uint8).reshape(\n                pix.height,\n                pix.width,\n                3,\n            )[:, :, ::-1]\n            predict_result = self.predict(image)[0]\n            save_debug_image(\n                image,\n                predict_result,\n                page.page_number + 1,\n            )\n            yield page, predict_result\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout.py",
    "content": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport msgpack\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    # logger.debug(f\"Image shape: {img.shape}\")\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {retry_state.next_action.sleep} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n    if not isinstance(image, list):\n        image = [image]\n    image_data = [encode_image(image) for image in image]\n    data = {\n        \"image\": image_data,\n        \"imgsz\": imgsz,\n    }\n\n    # Pack data using msgpack\n    packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        f\"{host}/inference\",\n        data=packed_data,\n        headers={\n            \"Content-Type\": \"application/msgpack\",\n            \"Accept\": \"application/msgpack\",\n        },\n        timeout=300,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n\n    if response.status_code == 200:\n        try:\n            result = msgpack.unpackb(response.content, raw=False)\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.content}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n    ) -> ResultContainer:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        preds = predict_layout([image], host=self.host, imgsz=800)\n\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            (800, 800), np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        result_containers = [ResultContainer() for _ in image]\n        predict_thread = ThreadPoolExecutor(max_workers=len(image))\n        for img, result_container in zip(image, result_containers, strict=True):\n            predict_thread.submit(\n                self.predict_image, img, self.host, result_container, 800\n            )\n        predict_thread.shutdown(wait=True)\n        result = [result_container.result for result_container in result_containers]\n        return result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number])\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=16) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout2.py",
    "content": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport msgpack\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    if not isinstance(image, list):\n        image = [image]\n    image_data = [encode_image(image) for image in image]\n    data = {\n        \"image\": image_data,\n    }\n\n    # Pack data using msgpack\n    packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        # f\"{host}/analyze?min_sim=0.7&early_stop=0.99&timeout=480\",\n        f\"{host}/inference\",\n        data=packed_data,\n        headers={\n            \"Content-Type\": \"application/msgpack\",\n            \"Accept\": \"application/msgpack\",\n        },\n        timeout=480,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = msgpack.unpackb(response.content, raw=False)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                for box in result[\"boxes\"]:\n                    if box[\"score\"] < 0.7:\n                        continue\n\n                    box[\"xyxy\"] = box[\"coordinate\"]\n                    box[\"conf\"] = box[\"score\"]\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.content}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str | None = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n    ) -> ResultContainer:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        preds = predict_layout(image, host=self.host)\n        orig_h, orig_w = orig_h / DPI * 72, orig_w / DPI * 72\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            target_imgsz, np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        result_containers = [ResultContainer() for _ in image]\n        predict_thread = ThreadPoolExecutor(max_workers=len(image))\n        for img, result_container in zip(image, result_containers, strict=True):\n            predict_thread.submit(\n                self.predict_image, img, self.host, result_container, 800\n            )\n        predict_thread.shutdown(wait=True)\n        result = [result_container.result for result_container in result_containers]\n        return result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number], dpi=DPI)\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=16) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout3.py",
    "content": "import json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    image_data = encode_image(image)\n\n    # Pack data using msgpack\n    # packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        f\"{host}/analyze?min_sim=0.7&early_stop=0.99&timeout=1800\",\n        files={\"file\": (\"image.jpg\", image_data, \"image/jpeg\")},\n        headers={\n            \"Accept\": \"application/json\",\n        },\n        timeout=1800,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = json.loads(response.text)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                for box in result[\"boxes\"]:\n                    if box[\"ocr_match_score\"] < 0.7:\n                        continue\n\n                    box[\"xyxy\"] = box[\"coords\"]\n                    box[\"conf\"] = box[\"ocr_match_score\"]\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.content}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str | None = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n    ) -> ResultContainer:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        preds = predict_layout(image, host=self.host)\n        orig_h, orig_w = orig_h / DPI * 72, orig_w / DPI * 72\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            target_imgsz, np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        result_containers = [ResultContainer() for _ in image]\n        predict_thread = ThreadPoolExecutor(max_workers=len(image))\n        for img, result_container in zip(image, result_containers, strict=True):\n            predict_thread.submit(\n                self.predict_image, img, self.host, result_container, 800\n            )\n        predict_thread.shutdown(wait=True)\n        result = [result_container.result for result_container in result_containers]\n        return result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number], dpi=DPI)\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=4) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout4.py",
    "content": "import logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport msgpack\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    if not isinstance(image, list):\n        image = [image]\n    image_data = [encode_image(image) for image in image]\n    data = {\n        \"image\": image_data,\n    }\n\n    # Pack data using msgpack\n    packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        # f\"{host}/analyze?min_sim=0.7&early_stop=0.99&timeout=480\",\n        f\"{host}/inference\",\n        data=packed_data,\n        headers={\n            \"Content-Type\": \"application/msgpack\",\n            \"Accept\": \"application/msgpack\",\n        },\n        timeout=480,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = msgpack.unpackb(response.content, raw=False)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                for box in result[\"boxes\"]:\n                    if box[\"score\"] < 0.7:\n                        continue\n\n                    box[\"xyxy\"] = box[\"coordinate\"]\n                    box[\"conf\"] = box[\"score\"]\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.content}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str | None = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n    ) -> ResultContainer:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        preds = predict_layout(image, host=self.host)\n        orig_h, orig_w = orig_h / DPI * 72, orig_w / DPI * 72\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            target_imgsz, np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        result_containers = [ResultContainer() for _ in image]\n        predict_thread = ThreadPoolExecutor(max_workers=len(image))\n        for img, result_container in zip(image, result_containers, strict=True):\n            predict_thread.submit(\n                self.predict_image, img, self.host, result_container, 800\n            )\n        predict_thread.shutdown(wait=True)\n        result = [result_container.result for result_container in result_containers]\n        return result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number], dpi=DPI)\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=1) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout5.py",
    "content": "import json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    image_data = encode_image(image)\n\n    # Pack data using msgpack\n    # packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        f\"{host}/analyze_hybrid?min_sim=0.7&early_stop=0.99&timeout=1800\",\n        files={\"file\": (\"image.jpg\", image_data, \"image/jpeg\")},\n        headers={\n            \"Accept\": \"application/json\",\n        },\n        timeout=1800,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = json.loads(response.text)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                clusters = result[\"clusters\"]\n                for box in clusters:\n                    box[\"xyxy\"] = box[\"box\"]\n                    box[\"conf\"] = 1\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.text}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str | None = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n    ) -> ResultContainer:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        preds = predict_layout(image, host=self.host)\n        orig_h, orig_w = orig_h / DPI * 72, orig_w / DPI * 72\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            target_imgsz, np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        # Handle single image input\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        result_containers = [ResultContainer() for _ in image]\n        predict_thread = ThreadPoolExecutor(max_workers=len(image))\n        for img, result_container in zip(image, result_containers, strict=True):\n            predict_thread.submit(\n                self.predict_image, img, self.host, result_container, 800\n            )\n        predict_thread.shutdown(wait=True)\n        result = [result_container.result for result_container in result_containers]\n        return result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number], dpi=DPI)\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=1) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout6.py",
    "content": "import base64\nimport json\nimport logging\nimport threading\nimport unicodedata\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport msgpack\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.extract_char import (\n    convert_page_to_char_boxes,\n)\nfrom babeldoc.format.pdf.document_il.utils.extract_char import (\n    process_page_chars_to_lines,\n)\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import SPACE_REGEX\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import (\n    get_no_rotation_img_multiprocess,\n)\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    # logger.debug(f\"Encoded image size: {len(encoded)} bytes\")\n    return encoded\n\n\ndef clip_num(num: float, min_value: float, max_value: float) -> float:\n    \"\"\"Clip a number to a specified range.\"\"\"\n    if num < min_value:\n        return min_value\n    elif num > max_value:\n        return max_value\n    return num\n\n\n@retry(\n    stop=stop_after_attempt(5),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed VLM, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/5)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n    lines=None,\n    font_mapper: FontMapper | None = None,\n):\n    \"\"\"Predict document layout using OCR line information (RPC service).\"\"\"\n\n    if lines is None:\n        lines = []\n\n    image_data = encode_image(image)\n\n    def convert_line(line):\n        if not line.text:\n            return None\n        boxes = [c[0] for c in line.chars]\n        min_x = min(b.x for b in boxes)\n        max_x = max(b.x2 for b in boxes)\n        min_y = min(b.y for b in boxes)\n        max_y = max(b.y2 for b in boxes)\n\n        image_height, image_width = image.shape[:2]\n\n        # Transform to image pixel coordinates\n        min_x = min_x / 72 * DPI\n        max_x = max_x / 72 * DPI\n        min_y = min_y / 72 * DPI\n        max_y = max_y / 72 * DPI\n\n        min_y, max_y = image_height - max_y, image_height - min_y\n\n        box_volume = (max_x - min_x) * (max_y - min_y)\n        if box_volume < 1:\n            return None\n\n        min_x = clip_num(min_x, 0, image_width - 1)\n        max_x = clip_num(max_x, 0, image_width - 1)\n        min_y = clip_num(min_y, 0, image_height - 1)\n        max_y = clip_num(max_y, 0, image_height - 1)\n\n        filtered_text = filter_text(line.text, font_mapper)\n        if not filtered_text:\n            return None\n\n        return {\"box\": [min_x, min_y, max_x, max_y], \"text\": filtered_text}\n\n    formatted_results = [convert_line(l) for l in lines]\n    formatted_results = [r for r in formatted_results if r is not None]\n    if not formatted_results:\n        return None\n\n    image_b64 = base64.b64encode(image_data).decode(\"utf-8\")\n\n    request_data = {\n        \"image\": image_b64,\n        \"ocr_results\": formatted_results,\n        \"image_size\": list(image.shape[:2])[::-1],  # (height, width)\n    }\n\n    response = httpx.post(\n        f\"{host}/inference\",\n        json=request_data,\n        headers={\"Accept\": \"application/json\", \"Content-Type\": \"application/json\"},\n        timeout=30,\n        follow_redirects=True,\n    )\n\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = json.loads(response.text)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                clusters = result[\"clusters\"]\n                for box in clusters:\n                    box[\"xyxy\"] = box[\"box\"]\n                    box[\"conf\"] = 1\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.text}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\n@retry(\n    stop=stop_after_attempt(5),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed PADDLE, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/5)\"\n    ),\n)\ndef predict_layout2(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    if not isinstance(image, list):\n        image = [image]\n    image_data = [encode_image(image) for image in image]\n    data = {\n        \"image\": image_data,\n    }\n\n    # Pack data using msgpack\n    packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        # f\"{host}/analyze?min_sim=0.7&early_stop=0.99&timeout=480\",\n        f\"{host}/inference\",\n        data=packed_data,\n        headers={\n            \"Content-Type\": \"application/msgpack\",\n            \"Accept\": \"application/msgpack\",\n        },\n        timeout=30,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = msgpack.unpackb(response.content, raw=False)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                for box in result[\"boxes\"]:\n                    if box[\"score\"] < 0.7:\n                        continue\n\n                    box[\"xyxy\"] = box[\"coordinate\"]\n                    box[\"conf\"] = box[\"score\"]\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.content}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\ndef filter_text(txt: str, font_mapper: FontMapper):\n    normalize = unicodedata.normalize(\"NFKC\", txt)\n    unicodes = []\n    for c in normalize:\n        if font_mapper.has_char(c):\n            unicodes.append(c)\n    normalize = \"\".join(unicodes)\n    result = SPACE_REGEX.sub(\" \", normalize).strip()\n    return result\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000;http://localhost:8001\"):\n        \"\"\"Initialize RPC model with host address.\n\n        Args:\n            host: Two RPC service hosts separated by ';', e.g. \"host1;host2\".\n        \"\"\"\n        if \";\" not in host:\n            raise ValueError(\n                \"RpcDocLayoutModel host must be two hosts separated by ';' (e.g. 'http://h1;http://h2')\"\n            )\n\n        self.host1, self.host2 = [h.strip() for h in host.split(\";\", 1)]\n\n        # keep the raw host string for logging/debugging purposes\n        self.host = host\n\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n        self.font_mapper = None\n\n    def init_font_mapper(self, translation_config):\n        self.font_mapper = FontMapper(translation_config)\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def calculate_iou(self, box1, box2):\n        \"\"\"Calculate IoU between two boxes in xyxy format.\"\"\"\n        x1_1, y1_1, x2_1, y2_1 = box1\n        x1_2, y1_2, x2_2, y2_2 = box2\n\n        # Calculate intersection area\n        x1_inter = max(x1_1, x1_2)\n        y1_inter = max(y1_1, y1_2)\n        x2_inter = min(x2_1, x2_2)\n        y2_inter = min(y2_1, y2_2)\n\n        if x2_inter <= x1_inter or y2_inter <= y1_inter:\n            return 0.0\n\n        intersection = (x2_inter - x1_inter) * (y2_inter - y1_inter)\n\n        # Calculate union area\n        area1 = (x2_1 - x1_1) * (y2_1 - y1_1)\n        area2 = (x2_2 - x1_2) * (y2_2 - y1_2)\n        union = area1 + area2 - intersection\n\n        return intersection / union if union > 0 else 0.0\n\n    def is_subset(self, inner_box, outer_box):\n        \"\"\"Check if inner_box is a subset of outer_box.\"\"\"\n        x1_inner, y1_inner, x2_inner, y2_inner = inner_box\n        x1_outer, y1_outer, x2_outer, y2_outer = outer_box\n\n        return (\n            x1_inner >= x1_outer\n            and y1_inner >= y1_outer\n            and x2_inner <= x2_outer\n            and y2_inner <= y2_outer\n        )\n\n    def expand_box_to_contain(self, box_to_expand, box_to_contain):\n        \"\"\"Expand box_to_expand to fully contain box_to_contain.\"\"\"\n        x1_expand, y1_expand, x2_expand, y2_expand = box_to_expand\n        x1_contain, y1_contain, x2_contain, y2_contain = box_to_contain\n\n        return [\n            min(x1_expand, x1_contain),\n            min(y1_expand, y1_contain),\n            max(x2_expand, x2_contain),\n            max(y2_expand, y2_contain),\n        ]\n\n    def post_process_boxes(self, merged_boxes: list[YoloBox], names: dict[int, str]):\n        \"\"\"Post-process merged boxes to handle text and paragraph_hybrid overlaps.\"\"\"\n        for i, text_box in enumerate(merged_boxes):\n            text_label = names.get(text_box.cls, \"\")\n            if \"text\" not in text_label:\n                continue\n\n            for j, para_box in enumerate(merged_boxes):\n                if i == j:\n                    continue\n\n                para_label = names.get(para_box.cls, \"\")\n                if \"paragraph_hybrid\" not in para_label:\n                    continue\n\n                # Calculate IoU\n                iou = self.calculate_iou(text_box.xyxy, para_box.xyxy)\n\n                # Check if IoU > 0.95 and paragraph is not subset of text\n                if iou > 0.95 and not self.is_subset(para_box.xyxy, text_box.xyxy):\n                    # Expand text box to contain paragraph_hybrid\n                    expanded_box = self.expand_box_to_contain(\n                        text_box.xyxy, para_box.xyxy\n                    )\n                    merged_boxes[i] = YoloBox(\n                        None,\n                        np.array(expanded_box),\n                        text_box.conf,\n                        text_box.cls,\n                    )\n\n    def predict_image(\n        self,\n        image,\n        imgsz: int = 1024,\n        lines=None,\n    ) -> YoloResult:\n        \"\"\"Predict the layout of a single page and fuse results from two RPC services.\"\"\"\n\n        # Resize/pad image if needed – use original size to avoid extra scaling artefacts\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image_proc = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        else:\n            image_proc = image\n\n        # Parallel calls to both services; exceptions propagate if either fails\n        with ThreadPoolExecutor(max_workers=2) as ex:\n            if lines:\n                future1 = ex.submit(\n                    predict_layout,\n                    image_proc,\n                    self.host1,\n                    imgsz,\n                    lines,\n                    self.font_mapper,\n                )\n            future2 = ex.submit(predict_layout2, image_proc, self.host2, imgsz)\n\n            # .result() will re-raise any exception occurred in worker thread.\n            if lines:\n                preds1 = future1.result()\n            else:\n                preds1 = None\n            preds2 = future2.result()\n\n        # Convert DPI to PDF points (72 dpi)\n        pdf_h, pdf_w = orig_h / DPI * 72, orig_w / DPI * 72\n\n        merged_boxes: list[YoloBox] = []\n        names: dict[int, str] = {}\n\n        def _process_preds(preds, id_offset: int, label_suffix: str | None):\n            for pred in preds or []:\n                for box in pred[\"boxes\"]:\n                    # scale coords back to PDF space\n                    scaled_xyxy = self.scale_boxes(\n                        target_imgsz, np.array(box[\"xyxy\"]), (pdf_h, pdf_w)\n                    )\n\n                    new_cls_id = box[\"cls\"] + id_offset\n\n                    # derive label – fall back gracefully if missing\n                    label = pred[\"names\"].get(box[\"cls\"], str(box[\"cls\"]))\n                    if label_suffix:\n                        label = f\"{label}{label_suffix}\"\n\n                    names[new_cls_id] = label\n\n                    merged_boxes.append(\n                        YoloBox(\n                            None,\n                            scaled_xyxy,\n                            np.array(box.get(\"conf\", box.get(\"score\", 1.0))),\n                            new_cls_id,\n                        )\n                    )\n\n        # service-1: +1000 id, add \"_hybrid\" suffix\n        if preds1:\n            _process_preds(preds1, 1000, \"_hybrid\")\n\n        # service-2: +2000 id, label unchanged\n        _process_preds(preds2, 2000, None)\n\n        # Sort boxes by confidence desc (YoloResult expects sorted list)\n        merged_boxes.sort(key=lambda b: b.conf, reverse=True)\n\n        # Post-process boxes to handle text and paragraph_hybrid overlaps\n        self.post_process_boxes(merged_boxes, names)\n\n        return YoloResult(boxes=merged_boxes, names=names)\n\n    def predict(self, image, imgsz=1024, **kwargs) -> list[YoloResult]:  # type: ignore[override]\n        \"\"\"Predict the layout for one or multiple images.\"\"\"\n\n        # Normalize to list\n        if isinstance(image, np.ndarray) and len(image.shape) == 3:\n            image = [image]\n\n        # Sequential processing is sufficient; keep simple\n        results: list[YoloResult] = []\n        for img in image:\n            results.append(self.predict_image(img, imgsz))\n\n        return results\n\n    def predict_page(self, page, pdf_bytes: Path, translate_config, save_debug_image):\n        translate_config.raise_if_cancelled()\n        # doc = pymupdf.open(io.BytesIO(pdf_bytes))\n        # with self.lock:\n        # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n        image = get_no_rotation_img_multiprocess(\n            pdf_bytes.as_posix(), page.page_number, dpi=DPI\n        )\n        # image = np.frombuffer(pix.samples, np.uint8).reshape(\n        #     pix.height,\n        #     pix.width,\n        #     3,\n        # )[:, :, ::-1]\n        char_boxes = convert_page_to_char_boxes(page)\n        lines = process_page_chars_to_lines(char_boxes)\n        predict_result = self.predict_image(image, 800, lines)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(  # type: ignore[override]\n        self,\n        pages: list[\"babeldoc.format.pdf.document_il.il_version_1.Page\"],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        layout_temp_path = translate_config.get_working_file_path(\"layout.temp.pdf\")\n        mupdf_doc.save(layout_temp_path.as_posix())\n        with ThreadPoolExecutor(max_workers=32) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (layout_temp_path for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/rpc_doclayout7.py",
    "content": "import base64\nimport json\nimport logging\nimport threading\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport httpx\nimport numpy as np\nimport pymupdf\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nimport babeldoc\nfrom babeldoc.docvision.base_doclayout import DocLayoutModel\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils.extract_char import (\n    convert_page_to_char_boxes,\n)\nfrom babeldoc.format.pdf.document_il.utils.extract_char import (\n    process_page_chars_to_lines,\n)\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\n\nlogger = logging.getLogger(__name__)\nDPI = 150\n\n\ndef encode_image(image) -> bytes:\n    \"\"\"Read and encode image to bytes\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n    \"\"\"\n    if isinstance(image, str):\n        if not Path(image).exists():\n            raise FileNotFoundError(f\"Image file not found: {image}\")\n        img = cv2.imread(image)\n\n        if img is None:\n            raise ValueError(f\"Failed to read image: {image}\")\n    else:\n        img = image\n\n    img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)\n    # logger.debug(f\"Image shape: {img.shape}\")\n    encoded = cv2.imencode(\".jpg\", img)[1].tobytes()\n    return encoded\n\n\n@retry(\n    stop=stop_after_attempt(3),  # 最多重试 3 次\n    wait=wait_exponential(\n        multiplier=1, min=1, max=10\n    ),  # 指数退避策略，初始 1 秒，最大 10 秒\n    retry=retry_if_exception_type((httpx.HTTPError, Exception)),  # 针对哪些异常重试\n    before_sleep=lambda retry_state: logger.warning(\n        f\"Request failed, retrying in {getattr(retry_state.next_action, 'sleep', 'unknown')} seconds... \"\n        f\"(Attempt {retry_state.attempt_number}/3)\"\n    ),\n)\ndef predict_layout(\n    image,\n    host: str = \"http://localhost:8000\",\n    _imgsz: int = 1024,\n    lines: list[babeldoc.format.pdf.document_il.utils.extract_char.Line] | None = None,\n):\n    \"\"\"\n    Predict document layout using the MOSEC service\n\n    Args:\n        image: Can be either a file path (str) or numpy array\n        host: Service host URL\n        imgsz: Image size for model input\n\n    Returns:\n        List of predictions containing bounding boxes and classes\n    \"\"\"\n    # Prepare request data\n\n    image_data = encode_image(image)\n\n    def convert_line(line: babeldoc.format.pdf.document_il.utils.extract_char.Line):\n        \"\"\"Extract bounding box from a line object.\"\"\"\n        boxes = [c[0] for c in line.chars]\n        min_x = min([b.x for b in boxes])\n        max_x = max([b.x2 for b in boxes])\n        min_y = min([b.y for b in boxes])\n        max_y = max([b.y2 for b in boxes])\n        # min_y, max_y = max_y, min_y\n\n        min_x = min_x / 72 * DPI\n        max_x = max_x / 72 * DPI\n        min_y = min_y / 72 * DPI\n        max_y = max_y / 72 * DPI\n\n        image_height = image.shape[0]\n        min_y, max_y = image_height - max_y, image_height - min_y\n\n        return {\"box\": [min_x, min_y, max_x, max_y], \"text\": line.text}\n\n    formatted_results = [convert_line(l) for l in lines]\n\n    image_b64 = base64.b64encode(image_data).decode(\"utf-8\")\n\n    request_data = {\n        \"image\": image_b64,\n        \"ocr_results\": formatted_results,\n        \"image_size\": list(image.shape[:2])[::-1],  # (height, width)\n    }\n\n    # Pack data using msgpack\n    # packed_data = msgpack.packb(data, use_bin_type=True)\n    # logger.debug(f\"Packed data size: {len(packed_data)} bytes\")\n\n    # Send request\n    # logger.debug(f\"Sending request to {host}/inference\")\n    response = httpx.post(\n        f\"{host}/inference\",\n        json=request_data,\n        headers={\"Accept\": \"application/json\", \"Content-Type\": \"application/json\"},\n        timeout=1800,\n        follow_redirects=True,\n    )\n\n    # logger.debug(f\"Response status: {response.status_code}\")\n    # logger.debug(f\"Response headers: {response.headers}\")\n    idx = 0\n    id_lookup = {}\n    if response.status_code == 200:\n        try:\n            result = json.loads(response.text)\n            useful_result = []\n            if isinstance(result, dict):\n                names = {}\n                clusters = result[\"clusters\"]\n                for box in clusters:\n                    box[\"xyxy\"] = box[\"box\"]\n                    box[\"conf\"] = 1\n                    if box[\"label\"] not in names:\n                        idx += 1\n                        names[idx] = box[\"label\"]\n                        box[\"cls_id\"] = idx\n                        id_lookup[box[\"label\"]] = idx\n                    else:\n                        box[\"cls_id\"] = id_lookup[box[\"label\"]]\n                    names[box[\"cls_id\"]] = box[\"label\"]\n                    box[\"cls\"] = box[\"cls_id\"]\n                    useful_result.append(box)\n                if \"names\" not in result:\n                    result[\"names\"] = names\n                result[\"boxes\"] = useful_result\n                result = [result]\n            return result\n        except Exception as e:\n            logger.exception(f\"Failed to unpack response: {e!s}\")\n            raise\n    else:\n        logger.error(f\"Request failed with status {response.status_code}\")\n        logger.error(f\"Response content: {response.text}\")\n        raise Exception(\n            f\"Request failed with status {response.status_code}: {response.text}\",\n        )\n\n\nclass ResultContainer:\n    def __init__(self):\n        self.result = YoloResult(boxes_data=np.array([]), names=[])\n\n\nclass RpcDocLayoutModel(DocLayoutModel):\n    \"\"\"DocLayoutModel implementation that uses RPC service.\"\"\"\n\n    def __init__(self, host: str = \"http://localhost:8000\"):\n        \"\"\"Initialize RPC model with host address.\"\"\"\n        self.host = host\n        self._stride = 32  # Default stride value\n        self._names = [\"text\", \"title\", \"list\", \"table\", \"figure\"]\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self) -> int:\n        \"\"\"Stride of the model input.\"\"\"\n        return self._stride\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size,\n        ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image, (resized_w, resized_h), interpolation=cv2.INTER_LINEAR\n        )\n\n        # Calculate padding size\n        pad_h = new_h - resized_h\n        pad_w = new_w - resized_w\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image, top, bottom, left, right, cv2.BORDER_CONSTANT, value=(114, 114, 114)\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes = (boxes - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict_image(\n        self,\n        image,\n        host: str | None = None,\n        result_container: ResultContainer | None = None,\n        imgsz: int = 1024,\n        page: il_version_1.Page | None = None,\n    ) -> YoloResult:\n        \"\"\"Predict the layout of document pages using RPC service.\"\"\"\n        if result_container is None:\n            result_container = ResultContainer()\n        target_imgsz = (800, 800)\n        orig_h, orig_w = image.shape[:2]\n        target_imgsz = (orig_h, orig_w)\n        if image.shape[0] != target_imgsz[0] or image.shape[1] != target_imgsz[1]:\n            image = self.resize_and_pad_image(image, new_shape=target_imgsz)\n\n        char_boxes = convert_page_to_char_boxes(page)\n        lines = process_page_chars_to_lines(char_boxes)\n\n        preds = predict_layout(image, host=self.host, lines=lines)\n        orig_h, orig_w = orig_h / DPI * 72, orig_w / DPI * 72\n        if len(preds) > 0:\n            for pred in preds:\n                boxes = [\n                    YoloBox(\n                        None,\n                        self.scale_boxes(\n                            target_imgsz, np.array(x[\"xyxy\"]), (orig_h, orig_w)\n                        ),\n                        np.array(x[\"conf\"]),\n                        x[\"cls\"],\n                    )\n                    for x in pred[\"boxes\"]\n                ]\n                result_container.result = YoloResult(\n                    boxes=boxes,\n                    names={int(k): v for k, v in pred[\"names\"].items()},\n                )\n        return result_container.result\n\n    def predict_page(\n        self, page, mupdf_doc: pymupdf.Document, translate_config, save_debug_image\n    ):\n        translate_config.raise_if_cancelled()\n        with self.lock:\n            # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n            pix = get_no_rotation_img(mupdf_doc[page.page_number], dpi=DPI)\n        image = np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n        predict_result = self.predict_image(image, self.host, None, 800, page)\n        save_debug_image(image, predict_result, page.page_number + 1)\n        return page, predict_result\n\n    def handle_document(\n        self,\n        pages: list[il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ):\n        with ThreadPoolExecutor(max_workers=1) as executor:\n            yield from executor.map(\n                self.predict_page,\n                pages,\n                (mupdf_doc for _ in range(len(pages))),\n                (translate_config for _ in range(len(pages))),\n                (save_debug_image for _ in range(len(pages))),\n            )\n\n    @staticmethod\n    def from_host(host: str) -> \"RpcDocLayoutModel\":\n        \"\"\"Create RpcDocLayoutModel from host address.\"\"\"\n        return RpcDocLayoutModel(host=host)\n\n\nif __name__ == \"__main__\":\n    logging.basicConfig(level=logging.DEBUG)\n    # Test the service\n    try:\n        # Use a default test image if example/1.png doesn't exist\n        image_path = \"example/1.png\"\n        if not Path(image_path).exists():\n            print(f\"Warning: {image_path} not found.\")\n            print(\"Please provide the path to a test image:\")\n            image_path = input(\"> \")\n\n        logger.info(f\"Processing image: {image_path}\")\n        result = predict_layout(image_path)\n        print(\"Prediction results:\")\n        print(result)\n    except Exception as e:\n        print(f\"Error: {e!s}\")\n"
  },
  {
    "path": "babeldoc/docvision/table_detection/rapidocr.py",
    "content": "import logging\nimport re\nimport threading\nfrom collections.abc import Generator\n\nimport cv2\nimport numpy as np\nfrom babeldoc.assets.assets import get_table_detection_rapidocr_model_path\nfrom babeldoc.docvision.base_doclayout import YoloBox\nfrom babeldoc.docvision.base_doclayout import YoloResult\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\nfrom rapidocr_onnxruntime import RapidOCR\n\ntry:\n    import onnxruntime\nexcept ImportError as e:\n    if \"DLL load failed\" in str(e):\n        raise OSError(\n            \"Microsoft Visual C++ Redistributable is not installed. \"\n            \"Download it at https://aka.ms/vs/17/release/vc_redist.x64.exe\"\n        ) from e\n    raise\nimport babeldoc.format.pdf.document_il.il_version_1\nimport pymupdf\n\nlogger = logging.getLogger(__name__)\n\n\ndef convert_to_yolo_result(predictions):\n    \"\"\"\n    Convert RapidOCR predictions to YoloResult format.\n\n    Args:\n        predictions (list): List of predictions, where each prediction is a list of coordinates\n                           in format [[x1, y1], [x2, y2], [x3, y3], [x4, y4], (text, confidence)]\n                           or a numpy array of format [x1, y1, x2, y2, ...]\n\n    Returns:\n        YoloResult: Converted predictions in YoloResult format\n    \"\"\"\n    boxes = []\n\n    for pred in predictions:\n        # Check if the prediction is in the format of 4 corner points\n        if isinstance(pred, list) and len(pred) >= 5 and isinstance(pred[0], list):\n            # Convert 4 corner points to xyxy format (min x, min y, max x, max y)\n            points = np.array(pred[:4])\n            x1, y1 = points[:, 0].min(), points[:, 1].min()\n            x2, y2 = points[:, 0].max(), points[:, 1].max()\n            xyxy = [x1, y1, x2, y2]\n            box = YoloBox(xyxy=xyxy, conf=1.0, cls=\"text\")\n        # Check if the prediction is already in xyxy format\n        elif isinstance(pred, list | np.ndarray) and len(pred) >= 4:\n            if isinstance(pred, np.ndarray):\n                pred = pred.tolist()\n            xyxy = pred[:4]\n            box = YoloBox(xyxy=xyxy, conf=1.0, cls=\"text\")\n        else:\n            continue\n\n        boxes.append(box)\n\n    return YoloResult(names=[\"text\"], boxes=boxes)\n\n\ndef create_yolo_result_from_nested_coords(nested_coords: np.ndarray, names: dict):\n    boxes = []\n\n    for quad in nested_coords.tolist():\n        if len(quad) != 4:\n            continue\n\n        # Convert quad coordinates to xyxy format (min x, min y, max x, max y)\n        x1, y1, x2, y2 = quad\n\n        # Create YoloBox with confidence 1.0 and class 'text'\n        box = YoloBox(\n            xyxy=[float(x1), float(y1), float(x2), float(y2)], conf=np.array(1.0), cls=0\n        )\n        boxes.append(box)\n\n    return YoloResult(names=names, boxes=boxes)\n\n\nclass RapidOCRModel:\n    def __init__(self):\n        self.use_cuda = False\n        self.use_dml = False\n        available_providers = onnxruntime.get_available_providers()\n        for provider in available_providers:\n            if re.match(r\"dml\", provider, re.IGNORECASE):\n                self.use_dml = True\n            elif re.match(r\"cuda\", provider, re.IGNORECASE):\n                self.use_cuda = True\n        self.use_dml = False  # force disable directml\n        self.model = RapidOCR(\n            det_model_path=get_table_detection_rapidocr_model_path(),\n            det_use_cuda=self.use_cuda,\n            det_use_dml=False,\n        )\n        self.names = {0: \"table_text\"}\n        self.lock = threading.Lock()\n\n    @property\n    def stride(self):\n        return 32\n\n    def resize_and_pad_image(self, image, new_shape):\n        \"\"\"\n        Resize and pad the image to the specified size, ensuring dimensions are multiples of stride.\n\n        Parameters:\n        - image: Input image\n        - new_shape: Target size (integer or (height, width) tuple)\n        - stride: Padding alignment stride, default 32\n\n        Returns:\n        - Processed image\n        \"\"\"\n        if isinstance(new_shape, int):\n            new_shape = (new_shape, new_shape)\n\n        h, w = image.shape[:2]\n        new_h, new_w = new_shape\n\n        # Calculate scaling ratio\n        r = min(new_h / h, new_w / w)\n        resized_h, resized_w = int(round(h * r)), int(round(w * r))\n\n        # Resize image\n        image = cv2.resize(\n            image,\n            (resized_w, resized_h),\n            interpolation=cv2.INTER_LINEAR,\n        )\n\n        # Calculate padding size and align to stride multiple\n        pad_w = (new_w - resized_w) % self.stride\n        pad_h = (new_h - resized_h) % self.stride\n        top, bottom = pad_h // 2, pad_h - pad_h // 2\n        left, right = pad_w // 2, pad_w - pad_w // 2\n\n        # Add padding\n        image = cv2.copyMakeBorder(\n            image,\n            top,\n            bottom,\n            left,\n            right,\n            cv2.BORDER_CONSTANT,\n            value=(114, 114, 114),\n        )\n\n        return image\n\n    def scale_boxes(self, img1_shape, boxes, img0_shape):\n        \"\"\"\n        Rescales bounding boxes (in the format of xyxy by default) from the shape of the image they were originally\n        specified in (img1_shape) to the shape of a different image (img0_shape).\n\n        Args:\n            img1_shape (tuple): The shape of the image that the bounding boxes are for,\n                in the format of (height, width).\n            boxes (torch.Tensor): the bounding boxes of the objects in the image, in the format of (x1, y1, x2, y2)\n            img0_shape (tuple): the shape of the target image, in the format of (height, width).\n\n        Returns:\n            boxes (torch.Tensor): The scaled bounding boxes, in the format of (x1, y1, x2, y2)\n        \"\"\"\n\n        # Calculate scaling ratio\n        gain = min(img1_shape[0] / img0_shape[0], img1_shape[1] / img0_shape[1])\n\n        # Calculate padding size\n        pad_x = round((img1_shape[1] - img0_shape[1] * gain) / 2 - 0.1)\n        pad_y = round((img1_shape[0] - img0_shape[0] * gain) / 2 - 0.1)\n\n        # Remove padding and scale boxes\n        boxes[..., :4] = (boxes[..., :4] - [pad_x, pad_y, pad_x, pad_y]) / gain\n        return boxes\n\n    def predict(self, image, imgsz=800, batch_size=16, **kwargs):\n        \"\"\"\n        Predict the layout of document pages.\n\n        Args:\n            image: A single image or a list of images of document pages.\n            imgsz: Resize the image to this size. Must be a multiple of the stride.\n            batch_size: Number of images to process in one batch.\n            **kwargs: Additional arguments.\n\n        Returns:\n            A YoloResult object containing the detected boxes.\n        \"\"\"\n        # Handle single image input\n        assert isinstance(image, np.ndarray) and len(image.shape) == 3\n\n        # Calculate target size based on the maximum height in the batch\n        target_imgsz = 1024\n\n        orig_shape = (image.shape[0], image.shape[1])\n\n        pix = self.resize_and_pad_image(image, new_shape=target_imgsz)\n        # pix = np.transpose(pix, (2, 0, 1))  # CHW\n        # pix = pix.astype(np.float32) / 255.0  # Normalize to [0, 1]\n        input_ = pix\n\n        new_h, new_w = input_.shape[:2]\n\n        # Run inference\n        preds = self.model(input_, use_det=True, use_cls=False, use_rec=False)\n\n        # Process each prediction in the batch\n        if len(preds) > 0:\n            preds_np = np.array(preds[0])[:, [0, 2], :].reshape([-1, 4])\n            preds_np[..., :4] = self.scale_boxes(\n                (new_h, new_w),\n                preds_np[..., :4],\n                orig_shape,\n            )\n\n            # Convert predictions to YoloResult format\n            return create_yolo_result_from_nested_coords(preds_np, self.names)\n        else:\n            # Return empty YoloResult if no predictions\n            return YoloResult(names=self.names, boxes=[])\n\n    def handle_document(\n        self,\n        pages: list[babeldoc.format.pdf.document_il.il_version_1.Page],\n        mupdf_doc: pymupdf.Document,\n        translate_config,\n        save_debug_image,\n    ) -> Generator[\n        tuple[babeldoc.format.pdf.document_il.il_version_1.Page, YoloResult], None, None\n    ]:\n        for page in pages:\n            translate_config.raise_if_cancelled()\n            with self.lock:\n                # pix = mupdf_doc[page.page_number].get_pixmap(dpi=72)\n                pix = get_no_rotation_img(mupdf_doc[page.page_number])\n            image = np.frombuffer(pix.samples, np.uint8).reshape(\n                pix.height,\n                pix.width,\n                3,\n            )[:, :, ::-1]\n\n            table_boxes = []\n            for layout in page.page_layout:\n                if layout.class_name == \"table\":\n                    table_boxes.append(layout.box)\n\n            predict_result = self.predict(image)\n\n            ok_boxes = []\n            for box in predict_result.boxes:\n                # Convert the box coordinates to float for proper comparison\n                box_xyxy = [float(coord) for coord in box.xyxy]\n\n                # Check if this box is inside any of the table boxes\n                for table_box in table_boxes:\n                    # Determine if box is inside or overlapping with table_box with image dimensions\n                    if self._is_box_in_table(\n                        box_xyxy, table_box, page, image.shape[1], image.shape[0]\n                    ):\n                        ok_boxes.append(box)\n                        break\n\n            yolo_result = YoloResult(names=self.names, boxes=ok_boxes)\n            save_debug_image(\n                image,\n                yolo_result,\n                page.page_number + 1,\n            )\n            yield page, yolo_result\n\n    def _is_box_in_table(self, box_xyxy, table_box, page, img_width, img_height):\n        \"\"\"\n        Check if a box from image coordinates is inside a table box from PDF coordinates.\n\n        Args:\n            box_xyxy (list): Box coordinates in image coordinate system [x1, y1, x2, y2]\n            table_box (Box): Table box in PDF coordinate system\n            page: The page object containing information for coordinate conversion\n            img_width: Width of the image\n            img_height: Height of the image\n\n        Returns:\n            bool: True if the box is inside or significantly overlapping with the table box\n        \"\"\"\n\n        # Get table box coordinates in PDF coordinate system\n        table_pdf_x1 = table_box.x\n        table_pdf_y1 = table_box.y\n        table_pdf_x2 = table_box.x2\n        table_pdf_y2 = table_box.y2\n\n        # Convert table box to image coordinates\n        table_img_x1 = table_pdf_x1\n        table_img_y1 = img_height - table_pdf_y2\n        table_img_x2 = table_pdf_x2\n        table_img_y2 = img_height - table_pdf_y1\n\n        # Now check for overlap between the boxes\n        # Calculate the area of overlap\n        x_overlap = max(\n            0, min(box_xyxy[2], table_img_x2) - max(box_xyxy[0], table_img_x1)\n        )\n        y_overlap = max(\n            0, min(box_xyxy[3], table_img_y2) - max(box_xyxy[1], table_img_y1)\n        )\n        overlap_area = x_overlap * y_overlap\n\n        # Calculate area of the detected box\n        box_area = (box_xyxy[2] - box_xyxy[0]) * (box_xyxy[3] - box_xyxy[1])\n\n        # If overlap area is significant relative to the box area, consider it inside\n        if box_area > 0 and overlap_area / box_area > 0.5:\n            return True\n\n        return False\n"
  },
  {
    "path": "babeldoc/format/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/base14.py",
    "content": "from .encoding import get_type1_encoding\nfrom .win_core import win_core\n\nbase14_bbox = {\n    \"Courier-BoldOblique\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (216, -15, 495, 572),\n        \"quotedbl\": (212, 277, 584, 562),\n        \"numbersign\": (88, -45, 640, 651),\n        \"dollar\": (87, -126, 629, 666),\n        \"percent\": (102, -15, 624, 616),\n        \"ampersand\": (62, -15, 594, 543),\n        \"quoteright\": (230, 277, 542, 562),\n        \"parenleft\": (266, -102, 592, 616),\n        \"parenright\": (117, -102, 443, 616),\n        \"asterisk\": (179, 219, 597, 601),\n        \"plus\": (114, 39, 596, 478),\n        \"comma\": (99, -111, 430, 174),\n        \"hyphen\": (143, 203, 567, 313),\n        \"period\": (207, -15, 426, 171),\n        \"slash\": (91, -77, 626, 626),\n        \"zero\": (137, -15, 591, 616),\n        \"one\": (93, 0, 561, 616),\n        \"two\": (61, 0, 593, 616),\n        \"three\": (72, -15, 570, 616),\n        \"four\": (82, 0, 558, 616),\n        \"five\": (77, -15, 621, 601),\n        \"six\": (136, -15, 652, 616),\n        \"seven\": (147, 0, 622, 601),\n        \"eight\": (116, -15, 603, 616),\n        \"nine\": (76, -15, 591, 616),\n        \"colon\": (206, -15, 479, 425),\n        \"semicolon\": (99, -111, 480, 425),\n        \"less\": (121, 15, 612, 501),\n        \"equal\": (96, 118, 614, 398),\n        \"greater\": (97, 15, 589, 501),\n        \"question\": (183, -14, 591, 580),\n        \"at\": (67, -15, 641, 616),\n        \"A\": (-9, 0, 631, 562),\n        \"B\": (30, 0, 628, 562),\n        \"C\": (75, -18, 674, 580),\n        \"D\": (30, 0, 663, 562),\n        \"E\": (25, 0, 669, 562),\n        \"F\": (39, 0, 683, 562),\n        \"G\": (75, -18, 674, 580),\n        \"H\": (20, 0, 699, 562),\n        \"I\": (77, 0, 642, 562),\n        \"J\": (59, -18, 720, 562),\n        \"K\": (21, 0, 691, 562),\n        \"L\": (39, 0, 635, 562),\n        \"M\": (-2, 0, 721, 562),\n        \"N\": (8, -12, 729, 562),\n        \"O\": (75, -18, 645, 580),\n        \"P\": (48, 0, 642, 562),\n        \"Q\": (84, -138, 635, 580),\n        \"R\": (24, 0, 617, 562),\n        \"S\": (54, -22, 672, 582),\n        \"T\": (86, 0, 678, 562),\n        \"U\": (101, -18, 715, 562),\n        \"V\": (84, 0, 732, 562),\n        \"W\": (84, 0, 737, 562),\n        \"X\": (12, 0, 689, 562),\n        \"Y\": (109, 0, 708, 562),\n        \"Z\": (62, 0, 636, 562),\n        \"bracketleft\": (223, -102, 606, 616),\n        \"backslash\": (223, -77, 496, 626),\n        \"bracketright\": (103, -102, 486, 616),\n        \"asciicircum\": (171, 250, 555, 616),\n        \"underscore\": (-27, -125, 584, -75),\n        \"quoteleft\": (297, 277, 487, 562),\n        \"a\": (62, -15, 592, 454),\n        \"b\": (13, -15, 635, 626),\n        \"c\": (82, -15, 631, 459),\n        \"d\": (61, -15, 644, 626),\n        \"e\": (82, -15, 604, 454),\n        \"f\": (83, 0, 677, 626),\n        \"g\": (41, -146, 673, 454),\n        \"h\": (18, 0, 614, 626),\n        \"i\": (77, 0, 545, 658),\n        \"j\": (37, -146, 580, 658),\n        \"k\": (33, 0, 642, 626),\n        \"l\": (77, 0, 545, 626),\n        \"m\": (-22, 0, 648, 454),\n        \"n\": (18, 0, 614, 454),\n        \"o\": (72, -15, 622, 454),\n        \"p\": (-31, -142, 621, 454),\n        \"q\": (61, -142, 684, 454),\n        \"r\": (47, 0, 654, 454),\n        \"s\": (67, -17, 607, 459),\n        \"t\": (118, -15, 566, 562),\n        \"u\": (70, -15, 591, 439),\n        \"v\": (70, 0, 694, 439),\n        \"w\": (53, 0, 711, 439),\n        \"x\": (6, 0, 670, 439),\n        \"y\": (-20, -142, 694, 439),\n        \"z\": (81, 0, 613, 439),\n        \"braceleft\": (204, -102, 595, 616),\n        \"bar\": (202, -250, 504, 750),\n        \"braceright\": (114, -102, 506, 616),\n        \"asciitilde\": (120, 153, 589, 356),\n        \"exclamdown\": (197, -146, 476, 449),\n        \"cent\": (122, -49, 604, 614),\n        \"sterling\": (107, -28, 650, 611),\n        \"fraction\": (22, -60, 707, 661),\n        \"yen\": (98, 0, 709, 562),\n        \"florin\": (-56, -131, 701, 616),\n        \"section\": (74, -70, 619, 580),\n        \"currency\": (77, 49, 643, 517),\n        \"quotesingle\": (304, 277, 492, 562),\n        \"quotedblleft\": (190, 277, 594, 562),\n        \"guillemotleft\": (63, 70, 638, 446),\n        \"guilsinglleft\": (196, 70, 544, 446),\n        \"guilsinglright\": (166, 70, 514, 446),\n        \"fi\": (12, 0, 643, 626),\n        \"fl\": (12, 0, 643, 626),\n        \"endash\": (108, 203, 602, 313),\n        \"dagger\": (176, -70, 586, 580),\n        \"daggerdbl\": (122, -70, 586, 580),\n        \"periodcentered\": (250, 165, 460, 351),\n        \"paragraph\": (61, -70, 699, 580),\n        \"bullet\": (197, 132, 523, 430),\n        \"quotesinglbase\": (145, -142, 457, 143),\n        \"quotedblbase\": (35, -142, 559, 143),\n        \"quotedblright\": (120, 277, 644, 562),\n        \"guillemotright\": (72, 70, 647, 446),\n        \"ellipsis\": (36, -15, 586, 116),\n        \"perthousand\": (-44, -15, 742, 616),\n        \"questiondown\": (102, -146, 509, 449),\n        \"grave\": (272, 508, 503, 661),\n        \"acute\": (313, 508, 608, 661),\n        \"circumflex\": (212, 483, 606, 657),\n        \"tilde\": (200, 493, 642, 636),\n        \"macron\": (195, 505, 636, 585),\n        \"breve\": (217, 468, 651, 631),\n        \"dotaccent\": (347, 485, 489, 625),\n        \"dieresis\": (245, 485, 591, 625),\n        \"ring\": (319, 481, 527, 678),\n        \"cedilla\": (169, -206, 366, 0),\n        \"hungarumlaut\": (172, 488, 728, 661),\n        \"ogonek\": (144, -199, 350, 0),\n        \"caron\": (238, 493, 632, 667),\n        \"emdash\": (33, 203, 677, 313),\n        \"AE\": (-29, 0, 707, 562),\n        \"ordfeminine\": (189, 196, 526, 580),\n        \"Lslash\": (39, 0, 635, 562),\n        \"Oslash\": (48, -22, 672, 584),\n        \"OE\": (27, 0, 700, 562),\n        \"ordmasculine\": (189, 196, 542, 580),\n        \"ae\": (22, -15, 651, 454),\n        \"dotlessi\": (77, 0, 545, 439),\n        \"lslash\": (77, 0, 578, 626),\n        \"oslash\": (55, -24, 637, 463),\n        \"oe\": (19, -15, 661, 454),\n        \"germandbls\": (22, -15, 628, 626),\n        \"Scedilla\": (54, -206, 672, 582),\n        \"multiply\": (105, 39, 606, 478),\n        \"logicalnot\": (135, 103, 617, 413),\n        \"format\": (-26, -146, 243, 601),\n        \"tab\": (19, 0, 641, 562),\n        \"overscore\": (123, 579, 734, 629),\n        \"IJ\": (-8, -18, 741, 562),\n        \"trademark\": (86, 230, 868, 562),\n        \"onequarter\": (14, -60, 706, 661),\n        \"mu\": (50, -142, 591, 439),\n        \"minus\": (114, 203, 596, 313),\n        \"brokenbar\": (218, -175, 488, 675),\n        \"arrowleft\": (40, 143, 708, 455),\n        \"LL\": (-45, 0, 694, 562),\n        \"arrowright\": (20, 143, 688, 455),\n        \"thorn\": (-31, -142, 621, 626),\n        \"lira\": (107, -28, 650, 611),\n        \"arrowboth\": (40, 143, 688, 455),\n        \"indent\": (99, 45, 579, 372),\n        \"threesuperior\": (193, 222, 525, 616),\n        \"onehalf\": (23, -60, 715, 661),\n        \"graybox\": (76, 0, 652, 599),\n        \"Idot\": (77, 0, 642, 748),\n        \"ll\": (1, 0, 653, 626),\n        \"Thorn\": (48, 0, 619, 562),\n        \"Ccedilla\": (75, -206, 674, 580),\n        \"notegraphic\": (91, -15, 619, 572),\n        \"arrowup\": (244, 3, 556, 626),\n        \"down\": (168, -15, 496, 439),\n        \"plusminus\": (76, 24, 614, 515),\n        \"threequarters\": (8, -60, 698, 661),\n        \"scedilla\": (67, -206, 607, 459),\n        \"ij\": (6, -146, 714, 658),\n        \"eth\": (94, -27, 661, 626),\n        \"merge\": (168, -15, 533, 487),\n        \"twosuperior\": (192, 230, 540, 616),\n        \"arrowdown\": (174, -15, 486, 608),\n        \"left\": (109, 44, 589, 371),\n        \"return\": (79, 0, 700, 562),\n        \"Eth\": (30, 0, 663, 562),\n        \"up\": (196, 0, 523, 447),\n        \"divide\": (114, 16, 596, 500),\n        \"prescription\": (24, -15, 632, 562),\n        \"square\": (19, 0, 700, 562),\n        \"stop\": (19, 0, 700, 562),\n        \"degree\": (174, 243, 569, 616),\n        \"ccedilla\": (82, -206, 631, 459),\n        \"onesuperior\": (213, 230, 514, 616),\n        \"largebullet\": (307, 229, 413, 333),\n        \"center\": (103, 14, 623, 580),\n        \"registered\": (54, -18, 666, 580),\n        \"copyright\": (54, -18, 666, 580),\n        \"dectab\": (8, 0, 615, 320),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (-9, 0, 665, 784),\n        \"Acircumflex\": (-9, 0, 631, 780),\n        \"Adieresis\": (-9, 0, 631, 748),\n        \"Agrave\": (-9, 0, 631, 784),\n        \"Aring\": (-9, 0, 631, 801),\n        \"Atilde\": (-9, 0, 638, 759),\n        \"Eacute\": (25, 0, 669, 784),\n        \"Ecircumflex\": (25, 0, 669, 780),\n        \"Edieresis\": (25, 0, 669, 748),\n        \"Egrave\": (25, 0, 669, 784),\n        \"Gcaron\": (75, -18, 674, 790),\n        \"Iacute\": (77, 0, 642, 784),\n        \"Icircumflex\": (77, 0, 642, 780),\n        \"Idieresis\": (77, 0, 642, 748),\n        \"Igrave\": (77, 0, 642, 784),\n        \"Ntilde\": (8, -12, 729, 759),\n        \"Oacute\": (75, -18, 645, 784),\n        \"Ocircumflex\": (75, -18, 645, 780),\n        \"Odieresis\": (75, -18, 645, 748),\n        \"Ograve\": (75, -18, 645, 784),\n        \"Otilde\": (75, -18, 668, 759),\n        \"Scaron\": (54, -22, 672, 790),\n        \"Uacute\": (101, -18, 715, 784),\n        \"Ucircumflex\": (101, -18, 715, 780),\n        \"Udieresis\": (101, -18, 715, 748),\n        \"Ugrave\": (101, -18, 715, 784),\n        \"Yacute\": (109, 0, 708, 784),\n        \"Ydieresis\": (109, 0, 708, 748),\n        \"Zcaron\": (62, 0, 659, 790),\n        \"aacute\": (62, -15, 608, 661),\n        \"acircumflex\": (62, -15, 592, 657),\n        \"adieresis\": (62, -15, 592, 625),\n        \"agrave\": (62, -15, 592, 661),\n        \"aring\": (62, -15, 592, 678),\n        \"atilde\": (62, -15, 642, 636),\n        \"eacute\": (82, -15, 608, 661),\n        \"ecircumflex\": (82, -15, 606, 657),\n        \"edieresis\": (82, -15, 604, 625),\n        \"egrave\": (82, -15, 604, 661),\n        \"gcaron\": (41, -146, 673, 667),\n        \"iacute\": (77, 0, 608, 661),\n        \"icircumflex\": (77, 0, 566, 657),\n        \"idieresis\": (77, 0, 551, 625),\n        \"igrave\": (77, 0, 545, 661),\n        \"ntilde\": (18, 0, 642, 636),\n        \"oacute\": (72, -15, 622, 661),\n        \"ocircumflex\": (72, -15, 622, 657),\n        \"odieresis\": (72, -15, 622, 625),\n        \"ograve\": (72, -15, 622, 661),\n        \"otilde\": (72, -15, 642, 636),\n        \"scaron\": (67, -17, 632, 667),\n        \"uacute\": (70, -15, 608, 661),\n        \"ucircumflex\": (70, -15, 591, 657),\n        \"udieresis\": (70, -15, 591, 625),\n        \"ugrave\": (70, -15, 591, 661),\n        \"yacute\": (-20, -142, 694, 661),\n        \"ydieresis\": (-20, -142, 694, 625),\n        \"zcaron\": (81, 0, 632, 667),\n    },\n    \"Courier-Bold\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (202, -15, 398, 572),\n        \"quotedbl\": (135, 277, 465, 562),\n        \"numbersign\": (56, -45, 544, 651),\n        \"dollar\": (82, -126, 519, 666),\n        \"percent\": (5, -15, 595, 616),\n        \"ampersand\": (36, -15, 546, 543),\n        \"quoteright\": (171, 277, 423, 562),\n        \"parenleft\": (219, -102, 461, 616),\n        \"parenright\": (139, -102, 381, 616),\n        \"asterisk\": (91, 219, 509, 601),\n        \"plus\": (71, 39, 529, 478),\n        \"comma\": (123, -111, 393, 174),\n        \"hyphen\": (100, 203, 500, 313),\n        \"period\": (192, -15, 408, 171),\n        \"slash\": (98, -77, 502, 626),\n        \"zero\": (87, -15, 513, 616),\n        \"one\": (81, 0, 539, 616),\n        \"two\": (61, 0, 499, 616),\n        \"three\": (63, -15, 501, 616),\n        \"four\": (53, 0, 507, 616),\n        \"five\": (70, -15, 521, 601),\n        \"six\": (90, -15, 521, 616),\n        \"seven\": (55, 0, 494, 601),\n        \"eight\": (83, -15, 517, 616),\n        \"nine\": (79, -15, 510, 616),\n        \"colon\": (191, -15, 407, 425),\n        \"semicolon\": (123, -111, 408, 425),\n        \"less\": (66, 15, 523, 501),\n        \"equal\": (71, 118, 529, 398),\n        \"greater\": (77, 15, 534, 501),\n        \"question\": (98, -14, 501, 580),\n        \"at\": (16, -15, 584, 616),\n        \"A\": (-9, 0, 609, 562),\n        \"B\": (30, 0, 573, 562),\n        \"C\": (22, -18, 560, 580),\n        \"D\": (30, 0, 594, 562),\n        \"E\": (25, 0, 560, 562),\n        \"F\": (39, 0, 570, 562),\n        \"G\": (22, -18, 594, 580),\n        \"H\": (20, 0, 580, 562),\n        \"I\": (77, 0, 523, 562),\n        \"J\": (37, -18, 601, 562),\n        \"K\": (21, 0, 599, 562),\n        \"L\": (39, 0, 578, 562),\n        \"M\": (-2, 0, 602, 562),\n        \"N\": (8, -12, 610, 562),\n        \"O\": (22, -18, 578, 580),\n        \"P\": (48, 0, 559, 562),\n        \"Q\": (32, -138, 578, 580),\n        \"R\": (24, 0, 599, 562),\n        \"S\": (47, -22, 553, 582),\n        \"T\": (21, 0, 579, 562),\n        \"U\": (4, -18, 596, 562),\n        \"V\": (-13, 0, 613, 562),\n        \"W\": (-18, 0, 618, 562),\n        \"X\": (12, 0, 588, 562),\n        \"Y\": (12, 0, 589, 562),\n        \"Z\": (62, 0, 539, 562),\n        \"bracketleft\": (245, -102, 475, 616),\n        \"backslash\": (99, -77, 503, 626),\n        \"bracketright\": (125, -102, 355, 616),\n        \"asciicircum\": (108, 250, 492, 616),\n        \"underscore\": (0, -125, 600, -75),\n        \"quoteleft\": (178, 277, 428, 562),\n        \"a\": (35, -15, 570, 454),\n        \"b\": (0, -15, 584, 626),\n        \"c\": (40, -15, 545, 459),\n        \"d\": (20, -15, 591, 626),\n        \"e\": (40, -15, 563, 454),\n        \"f\": (83, 0, 547, 626),\n        \"g\": (30, -146, 580, 454),\n        \"h\": (5, 0, 592, 626),\n        \"i\": (77, 0, 523, 658),\n        \"j\": (63, -146, 440, 658),\n        \"k\": (20, 0, 585, 626),\n        \"l\": (77, 0, 523, 626),\n        \"m\": (-22, 0, 626, 454),\n        \"n\": (18, 0, 592, 454),\n        \"o\": (30, -15, 570, 454),\n        \"p\": (-1, -142, 570, 454),\n        \"q\": (20, -142, 591, 454),\n        \"r\": (47, 0, 580, 454),\n        \"s\": (68, -17, 535, 459),\n        \"t\": (47, -15, 532, 562),\n        \"u\": (-1, -15, 569, 439),\n        \"v\": (-1, 0, 601, 439),\n        \"w\": (-18, 0, 618, 439),\n        \"x\": (6, 0, 594, 439),\n        \"y\": (-4, -142, 601, 439),\n        \"z\": (81, 0, 520, 439),\n        \"braceleft\": (160, -102, 464, 616),\n        \"bar\": (255, -250, 345, 750),\n        \"braceright\": (136, -102, 440, 616),\n        \"asciitilde\": (71, 153, 530, 356),\n        \"exclamdown\": (202, -146, 398, 449),\n        \"cent\": (66, -49, 518, 614),\n        \"sterling\": (72, -28, 558, 611),\n        \"fraction\": (25, -60, 576, 661),\n        \"yen\": (10, 0, 590, 562),\n        \"florin\": (-30, -131, 572, 616),\n        \"section\": (83, -70, 517, 580),\n        \"currency\": (54, 49, 546, 517),\n        \"quotesingle\": (227, 277, 373, 562),\n        \"quotedblleft\": (71, 277, 535, 562),\n        \"guillemotleft\": (8, 70, 553, 446),\n        \"guilsinglleft\": (141, 70, 459, 446),\n        \"guilsinglright\": (141, 70, 459, 446),\n        \"fi\": (12, 0, 593, 626),\n        \"fl\": (12, 0, 593, 626),\n        \"endash\": (65, 203, 535, 313),\n        \"dagger\": (106, -70, 494, 580),\n        \"daggerdbl\": (106, -70, 494, 580),\n        \"periodcentered\": (196, 165, 404, 351),\n        \"paragraph\": (6, -70, 576, 580),\n        \"bullet\": (140, 132, 460, 430),\n        \"quotesinglbase\": (175, -142, 427, 143),\n        \"quotedblbase\": (65, -142, 529, 143),\n        \"quotedblright\": (61, 277, 525, 562),\n        \"guillemotright\": (47, 70, 592, 446),\n        \"ellipsis\": (26, -15, 574, 116),\n        \"perthousand\": (-113, -15, 713, 616),\n        \"questiondown\": (99, -146, 502, 449),\n        \"grave\": (132, 508, 395, 661),\n        \"acute\": (205, 508, 468, 661),\n        \"circumflex\": (103, 483, 497, 657),\n        \"tilde\": (89, 493, 512, 636),\n        \"macron\": (88, 505, 512, 585),\n        \"breve\": (83, 468, 517, 631),\n        \"dotaccent\": (230, 485, 370, 625),\n        \"dieresis\": (128, 485, 472, 625),\n        \"ring\": (198, 481, 402, 678),\n        \"cedilla\": (205, -206, 387, 0),\n        \"hungarumlaut\": (68, 488, 588, 661),\n        \"ogonek\": (169, -199, 367, 0),\n        \"caron\": (103, 493, 497, 667),\n        \"emdash\": (-10, 203, 610, 313),\n        \"AE\": (-29, 0, 602, 562),\n        \"ordfeminine\": (147, 196, 453, 580),\n        \"Lslash\": (39, 0, 578, 562),\n        \"Oslash\": (22, -22, 578, 584),\n        \"OE\": (-25, 0, 595, 562),\n        \"ordmasculine\": (147, 196, 453, 580),\n        \"ae\": (-4, -15, 601, 454),\n        \"dotlessi\": (77, 0, 523, 439),\n        \"lslash\": (77, 0, 523, 626),\n        \"oslash\": (30, -24, 570, 463),\n        \"oe\": (-18, -15, 611, 454),\n        \"germandbls\": (22, -15, 596, 626),\n        \"Scedilla\": (47, -206, 553, 582),\n        \"multiply\": (81, 39, 520, 478),\n        \"logicalnot\": (71, 103, 529, 413),\n        \"format\": (5, -146, 115, 601),\n        \"tab\": (19, 0, 581, 562),\n        \"overscore\": (0, 579, 600, 629),\n        \"IJ\": (-8, -18, 622, 562),\n        \"trademark\": (-9, 230, 749, 562),\n        \"onequarter\": (-56, -60, 656, 661),\n        \"mu\": (-1, -142, 569, 439),\n        \"minus\": (71, 203, 529, 313),\n        \"brokenbar\": (255, -175, 345, 675),\n        \"arrowleft\": (-24, 143, 634, 455),\n        \"LL\": (-45, 0, 645, 562),\n        \"arrowright\": (-34, 143, 624, 455),\n        \"thorn\": (-14, -142, 570, 626),\n        \"lira\": (72, -28, 558, 611),\n        \"arrowboth\": (-24, 143, 624, 455),\n        \"indent\": (65, 45, 535, 372),\n        \"threesuperior\": (138, 222, 433, 616),\n        \"onehalf\": (-47, -60, 648, 661),\n        \"graybox\": (76, 0, 525, 599),\n        \"Idot\": (77, 0, 523, 748),\n        \"ll\": (-12, 0, 600, 626),\n        \"Thorn\": (48, 0, 557, 562),\n        \"Ccedilla\": (22, -206, 560, 580),\n        \"notegraphic\": (77, -15, 523, 572),\n        \"arrowup\": (144, 3, 456, 626),\n        \"down\": (137, -15, 464, 439),\n        \"plusminus\": (71, 24, 529, 515),\n        \"threequarters\": (-47, -60, 648, 661),\n        \"scedilla\": (68, -206, 535, 459),\n        \"ij\": (6, -146, 574, 658),\n        \"eth\": (58, -27, 543, 626),\n        \"merge\": (137, -15, 464, 487),\n        \"twosuperior\": (143, 230, 436, 616),\n        \"arrowdown\": (144, -15, 456, 608),\n        \"left\": (65, 44, 535, 371),\n        \"return\": (19, 0, 581, 562),\n        \"Eth\": (30, 0, 594, 562),\n        \"up\": (136, 0, 463, 447),\n        \"divide\": (71, 16, 529, 500),\n        \"prescription\": (24, -15, 599, 562),\n        \"square\": (19, 0, 581, 562),\n        \"stop\": (19, 0, 581, 562),\n        \"degree\": (86, 243, 474, 616),\n        \"ccedilla\": (40, -206, 545, 459),\n        \"onesuperior\": (153, 230, 447, 616),\n        \"largebullet\": (248, 229, 352, 333),\n        \"center\": (40, 14, 560, 580),\n        \"registered\": (0, -18, 600, 580),\n        \"copyright\": (0, -18, 600, 580),\n        \"dectab\": (8, 0, 592, 320),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (-9, 0, 609, 784),\n        \"Acircumflex\": (-9, 0, 609, 780),\n        \"Adieresis\": (-9, 0, 609, 748),\n        \"Agrave\": (-9, 0, 609, 784),\n        \"Aring\": (-9, 0, 609, 801),\n        \"Atilde\": (-9, 0, 609, 759),\n        \"Eacute\": (25, 0, 560, 784),\n        \"Ecircumflex\": (25, 0, 560, 780),\n        \"Edieresis\": (25, 0, 560, 748),\n        \"Egrave\": (25, 0, 560, 784),\n        \"Gcaron\": (22, -18, 594, 790),\n        \"Iacute\": (77, 0, 523, 784),\n        \"Icircumflex\": (77, 0, 523, 780),\n        \"Idieresis\": (77, 0, 523, 748),\n        \"Igrave\": (77, 0, 523, 784),\n        \"Ntilde\": (8, -12, 610, 759),\n        \"Oacute\": (22, -18, 578, 784),\n        \"Ocircumflex\": (22, -18, 578, 780),\n        \"Odieresis\": (22, -18, 578, 748),\n        \"Ograve\": (22, -18, 578, 784),\n        \"Otilde\": (22, -18, 578, 759),\n        \"Scaron\": (47, -22, 553, 790),\n        \"Uacute\": (4, -18, 596, 784),\n        \"Ucircumflex\": (4, -18, 596, 780),\n        \"Udieresis\": (4, -18, 596, 748),\n        \"Ugrave\": (4, -18, 596, 784),\n        \"Yacute\": (12, 0, 589, 784),\n        \"Ydieresis\": (12, 0, 589, 748),\n        \"Zcaron\": (62, 0, 539, 790),\n        \"aacute\": (35, -15, 570, 661),\n        \"acircumflex\": (35, -15, 570, 657),\n        \"adieresis\": (35, -15, 570, 625),\n        \"agrave\": (35, -15, 570, 661),\n        \"aring\": (35, -15, 570, 678),\n        \"atilde\": (35, -15, 570, 636),\n        \"eacute\": (40, -15, 563, 661),\n        \"ecircumflex\": (40, -15, 563, 657),\n        \"edieresis\": (40, -15, 563, 625),\n        \"egrave\": (40, -15, 563, 661),\n        \"gcaron\": (30, -146, 580, 667),\n        \"iacute\": (77, 0, 523, 661),\n        \"icircumflex\": (63, 0, 523, 657),\n        \"idieresis\": (77, 0, 523, 625),\n        \"igrave\": (77, 0, 523, 661),\n        \"ntilde\": (18, 0, 592, 636),\n        \"oacute\": (30, -15, 570, 661),\n        \"ocircumflex\": (30, -15, 570, 657),\n        \"odieresis\": (30, -15, 570, 625),\n        \"ograve\": (30, -15, 570, 661),\n        \"otilde\": (30, -15, 570, 636),\n        \"scaron\": (68, -17, 535, 667),\n        \"uacute\": (-1, -15, 569, 661),\n        \"ucircumflex\": (-1, -15, 569, 657),\n        \"udieresis\": (-1, -15, 569, 625),\n        \"ugrave\": (-1, -15, 569, 661),\n        \"yacute\": (-4, -142, 601, 661),\n        \"ydieresis\": (-4, -142, 601, 625),\n        \"zcaron\": (81, 0, 520, 667),\n    },\n    \"Courier\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (236, -15, 364, 572),\n        \"quotedbl\": (187, 328, 413, 562),\n        \"numbersign\": (93, -32, 507, 639),\n        \"dollar\": (105, -126, 496, 662),\n        \"percent\": (81, -15, 518, 622),\n        \"ampersand\": (63, -15, 538, 543),\n        \"quoteright\": (213, 328, 376, 562),\n        \"parenleft\": (269, -108, 440, 622),\n        \"parenright\": (160, -108, 331, 622),\n        \"asterisk\": (116, 257, 484, 607),\n        \"plus\": (80, 44, 520, 470),\n        \"comma\": (181, -112, 344, 122),\n        \"hyphen\": (103, 231, 497, 285),\n        \"period\": (229, -15, 371, 109),\n        \"slash\": (125, -80, 475, 629),\n        \"zero\": (106, -15, 494, 622),\n        \"one\": (96, 0, 505, 622),\n        \"two\": (70, 0, 471, 622),\n        \"three\": (75, -15, 466, 622),\n        \"four\": (78, 0, 500, 622),\n        \"five\": (92, -15, 497, 607),\n        \"six\": (111, -15, 497, 622),\n        \"seven\": (82, 0, 483, 607),\n        \"eight\": (102, -15, 498, 622),\n        \"nine\": (96, -15, 489, 622),\n        \"colon\": (229, -15, 371, 385),\n        \"semicolon\": (181, -112, 371, 385),\n        \"less\": (41, 42, 519, 472),\n        \"equal\": (80, 138, 520, 376),\n        \"greater\": (66, 42, 544, 472),\n        \"question\": (129, -15, 492, 572),\n        \"at\": (77, -15, 533, 622),\n        \"A\": (3, 0, 597, 562),\n        \"B\": (43, 0, 559, 562),\n        \"C\": (41, -18, 540, 580),\n        \"D\": (43, 0, 574, 562),\n        \"E\": (53, 0, 550, 562),\n        \"F\": (53, 0, 545, 562),\n        \"G\": (31, -18, 575, 580),\n        \"H\": (32, 0, 568, 562),\n        \"I\": (96, 0, 504, 562),\n        \"J\": (34, -18, 566, 562),\n        \"K\": (38, 0, 582, 562),\n        \"L\": (47, 0, 554, 562),\n        \"M\": (4, 0, 596, 562),\n        \"N\": (7, -13, 593, 562),\n        \"O\": (43, -18, 557, 580),\n        \"P\": (79, 0, 558, 562),\n        \"Q\": (43, -138, 557, 580),\n        \"R\": (38, 0, 588, 562),\n        \"S\": (72, -20, 529, 580),\n        \"T\": (38, 0, 563, 562),\n        \"U\": (17, -18, 583, 562),\n        \"V\": (-4, -13, 604, 562),\n        \"W\": (-3, -13, 603, 562),\n        \"X\": (23, 0, 577, 562),\n        \"Y\": (24, 0, 576, 562),\n        \"Z\": (86, 0, 514, 562),\n        \"bracketleft\": (269, -108, 442, 622),\n        \"backslash\": (118, -80, 482, 629),\n        \"bracketright\": (158, -108, 331, 622),\n        \"asciicircum\": (94, 354, 506, 622),\n        \"underscore\": (0, -125, 600, -75),\n        \"quoteleft\": (224, 328, 387, 562),\n        \"a\": (53, -15, 559, 441),\n        \"b\": (14, -15, 575, 629),\n        \"c\": (66, -15, 529, 441),\n        \"d\": (45, -15, 591, 629),\n        \"e\": (66, -15, 548, 441),\n        \"f\": (114, 0, 531, 629),\n        \"g\": (45, -157, 566, 441),\n        \"h\": (18, 0, 582, 629),\n        \"i\": (95, 0, 505, 657),\n        \"j\": (82, -157, 410, 657),\n        \"k\": (43, 0, 580, 629),\n        \"l\": (95, 0, 505, 629),\n        \"m\": (-5, 0, 605, 441),\n        \"n\": (26, 0, 575, 441),\n        \"o\": (62, -15, 538, 441),\n        \"p\": (9, -157, 555, 441),\n        \"q\": (45, -157, 591, 441),\n        \"r\": (60, 0, 559, 441),\n        \"s\": (80, -15, 513, 441),\n        \"t\": (87, -15, 530, 561),\n        \"u\": (21, -15, 562, 426),\n        \"v\": (10, -10, 590, 426),\n        \"w\": (-4, -10, 604, 426),\n        \"x\": (20, 0, 580, 426),\n        \"y\": (7, -157, 592, 426),\n        \"z\": (99, 0, 502, 426),\n        \"braceleft\": (182, -108, 437, 622),\n        \"bar\": (275, -250, 326, 750),\n        \"braceright\": (163, -108, 418, 622),\n        \"asciitilde\": (63, 197, 540, 320),\n        \"exclamdown\": (236, -157, 364, 430),\n        \"cent\": (96, -49, 500, 614),\n        \"sterling\": (84, -21, 521, 611),\n        \"fraction\": (92, -57, 509, 665),\n        \"yen\": (26, 0, 574, 562),\n        \"florin\": (4, -143, 539, 622),\n        \"section\": (113, -78, 488, 580),\n        \"currency\": (73, 58, 527, 506),\n        \"quotesingle\": (259, 328, 341, 562),\n        \"quotedblleft\": (143, 328, 471, 562),\n        \"guillemotleft\": (37, 70, 563, 446),\n        \"guilsinglleft\": (149, 70, 451, 446),\n        \"guilsinglright\": (149, 70, 451, 446),\n        \"fi\": (3, 0, 597, 629),\n        \"fl\": (3, 0, 597, 629),\n        \"endash\": (75, 231, 525, 285),\n        \"dagger\": (141, -78, 459, 580),\n        \"daggerdbl\": (141, -78, 459, 580),\n        \"periodcentered\": (222, 189, 378, 327),\n        \"paragraph\": (50, -78, 511, 562),\n        \"bullet\": (172, 130, 428, 383),\n        \"quotesinglbase\": (213, -134, 376, 100),\n        \"quotedblbase\": (143, -134, 457, 100),\n        \"quotedblright\": (143, 328, 457, 562),\n        \"guillemotright\": (37, 70, 563, 446),\n        \"ellipsis\": (37, -15, 563, 111),\n        \"perthousand\": (3, -15, 600, 622),\n        \"questiondown\": (108, -157, 471, 430),\n        \"grave\": (151, 497, 378, 672),\n        \"acute\": (242, 497, 469, 672),\n        \"circumflex\": (124, 477, 476, 654),\n        \"tilde\": (105, 489, 503, 606),\n        \"macron\": (120, 525, 480, 565),\n        \"breve\": (153, 501, 447, 609),\n        \"dotaccent\": (249, 477, 352, 580),\n        \"dieresis\": (148, 492, 453, 595),\n        \"ring\": (218, 463, 382, 627),\n        \"cedilla\": (224, -151, 362, 10),\n        \"hungarumlaut\": (133, 497, 540, 672),\n        \"ogonek\": (227, -151, 370, 0),\n        \"caron\": (124, 492, 476, 669),\n        \"emdash\": (0, 231, 600, 285),\n        \"AE\": (3, 0, 550, 562),\n        \"ordfeminine\": (156, 249, 442, 580),\n        \"Lslash\": (47, 0, 554, 562),\n        \"Oslash\": (43, -80, 557, 629),\n        \"OE\": (7, 0, 567, 562),\n        \"ordmasculine\": (157, 249, 443, 580),\n        \"ae\": (19, -15, 570, 441),\n        \"dotlessi\": (95, 0, 505, 426),\n        \"lslash\": (95, 0, 505, 629),\n        \"oslash\": (62, -80, 538, 506),\n        \"oe\": (19, -15, 559, 441),\n        \"germandbls\": (48, -15, 588, 629),\n        \"Scedilla\": (72, -151, 529, 580),\n        \"multiply\": (87, 43, 515, 470),\n        \"logicalnot\": (87, 108, 513, 369),\n        \"format\": (5, -157, 56, 607),\n        \"tab\": (19, 0, 581, 562),\n        \"overscore\": (0, 579, 600, 629),\n        \"IJ\": (32, -18, 583, 562),\n        \"trademark\": (-23, 263, 623, 562),\n        \"onequarter\": (0, -57, 600, 665),\n        \"mu\": (21, -157, 562, 426),\n        \"minus\": (80, 232, 520, 283),\n        \"brokenbar\": (275, -175, 326, 675),\n        \"arrowleft\": (-24, 115, 624, 483),\n        \"LL\": (8, 0, 592, 562),\n        \"arrowright\": (-24, 115, 624, 483),\n        \"thorn\": (-6, -157, 555, 629),\n        \"lira\": (73, -21, 521, 611),\n        \"arrowboth\": (-28, 115, 628, 483),\n        \"indent\": (70, 68, 530, 348),\n        \"threesuperior\": (155, 240, 406, 622),\n        \"onehalf\": (0, -57, 611, 665),\n        \"graybox\": (76, 0, 525, 599),\n        \"Idot\": (96, 0, 504, 716),\n        \"ll\": (18, 0, 567, 629),\n        \"Thorn\": (79, 0, 538, 562),\n        \"Ccedilla\": (41, -151, 540, 580),\n        \"notegraphic\": (136, -15, 464, 572),\n        \"arrowup\": (116, 0, 484, 623),\n        \"down\": (160, -15, 440, 426),\n        \"plusminus\": (87, 44, 513, 558),\n        \"threequarters\": (8, -56, 593, 666),\n        \"scedilla\": (80, -151, 513, 441),\n        \"ij\": (37, -157, 490, 657),\n        \"eth\": (62, -15, 538, 629),\n        \"merge\": (160, -15, 440, 436),\n        \"twosuperior\": (177, 249, 424, 622),\n        \"arrowdown\": (116, -15, 484, 608),\n        \"left\": (70, 68, 530, 348),\n        \"return\": (19, 0, 581, 562),\n        \"Eth\": (30, 0, 574, 562),\n        \"up\": (160, 0, 440, 437),\n        \"divide\": (87, 48, 513, 467),\n        \"prescription\": (27, -15, 577, 562),\n        \"square\": (19, 0, 581, 562),\n        \"stop\": (19, 0, 581, 562),\n        \"degree\": (123, 269, 477, 622),\n        \"ccedilla\": (66, -151, 529, 441),\n        \"onesuperior\": (172, 249, 428, 622),\n        \"largebullet\": (261, 220, 339, 297),\n        \"center\": (40, 14, 560, 580),\n        \"registered\": (0, -18, 600, 580),\n        \"copyright\": (0, -18, 600, 580),\n        \"dectab\": (18, 0, 582, 227),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (3, 0, 597, 793),\n        \"Acircumflex\": (3, 0, 597, 775),\n        \"Adieresis\": (3, 0, 597, 731),\n        \"Agrave\": (3, 0, 597, 793),\n        \"Aring\": (3, 0, 597, 753),\n        \"Atilde\": (3, 0, 597, 732),\n        \"Eacute\": (53, 0, 550, 793),\n        \"Ecircumflex\": (53, 0, 550, 775),\n        \"Edieresis\": (53, 0, 550, 731),\n        \"Egrave\": (53, 0, 550, 793),\n        \"Gcaron\": (31, -18, 575, 805),\n        \"Iacute\": (96, 0, 504, 793),\n        \"Icircumflex\": (96, 0, 504, 775),\n        \"Idieresis\": (96, 0, 504, 731),\n        \"Igrave\": (96, 0, 504, 793),\n        \"Ntilde\": (7, -13, 593, 732),\n        \"Oacute\": (43, -18, 557, 793),\n        \"Ocircumflex\": (43, -18, 557, 775),\n        \"Odieresis\": (43, -18, 557, 731),\n        \"Ograve\": (43, -18, 557, 793),\n        \"Otilde\": (43, -18, 557, 732),\n        \"Scaron\": (72, -20, 529, 805),\n        \"Uacute\": (17, -18, 583, 793),\n        \"Ucircumflex\": (17, -18, 583, 775),\n        \"Udieresis\": (17, -18, 583, 731),\n        \"Ugrave\": (17, -18, 583, 793),\n        \"Yacute\": (24, 0, 576, 793),\n        \"Ydieresis\": (24, 0, 576, 731),\n        \"Zcaron\": (86, 0, 514, 805),\n        \"aacute\": (53, -15, 559, 672),\n        \"acircumflex\": (53, -15, 559, 654),\n        \"adieresis\": (53, -15, 559, 595),\n        \"agrave\": (53, -15, 559, 672),\n        \"aring\": (53, -15, 559, 627),\n        \"atilde\": (53, -15, 559, 606),\n        \"eacute\": (66, -15, 548, 672),\n        \"ecircumflex\": (66, -15, 548, 654),\n        \"edieresis\": (66, -15, 548, 595),\n        \"egrave\": (66, -15, 548, 672),\n        \"gcaron\": (45, -157, 566, 669),\n        \"iacute\": (95, 0, 505, 672),\n        \"icircumflex\": (94, 0, 505, 654),\n        \"idieresis\": (95, 0, 505, 595),\n        \"igrave\": (95, 0, 505, 672),\n        \"ntilde\": (26, 0, 575, 606),\n        \"oacute\": (62, -15, 538, 672),\n        \"ocircumflex\": (62, -15, 538, 654),\n        \"odieresis\": (62, -15, 538, 595),\n        \"ograve\": (62, -15, 538, 672),\n        \"otilde\": (62, -15, 538, 606),\n        \"scaron\": (80, -15, 513, 669),\n        \"uacute\": (21, -15, 562, 672),\n        \"ucircumflex\": (21, -15, 562, 654),\n        \"udieresis\": (21, -15, 562, 595),\n        \"ugrave\": (21, -15, 562, 672),\n        \"yacute\": (7, -157, 592, 672),\n        \"ydieresis\": (7, -157, 592, 595),\n        \"zcaron\": (99, 0, 502, 669),\n    },\n    \"Courier-Oblique\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (244, -15, 464, 572),\n        \"quotedbl\": (273, 328, 532, 562),\n        \"numbersign\": (133, -32, 596, 639),\n        \"dollar\": (108, -126, 596, 662),\n        \"percent\": (134, -15, 599, 622),\n        \"ampersand\": (87, -15, 580, 543),\n        \"quoteright\": (283, 328, 495, 562),\n        \"parenleft\": (314, -108, 572, 622),\n        \"parenright\": (137, -108, 396, 622),\n        \"asterisk\": (212, 257, 580, 607),\n        \"plus\": (129, 44, 580, 470),\n        \"comma\": (157, -112, 370, 122),\n        \"hyphen\": (152, 231, 558, 285),\n        \"period\": (238, -15, 382, 109),\n        \"slash\": (112, -80, 604, 629),\n        \"zero\": (155, -15, 574, 622),\n        \"one\": (98, 0, 515, 622),\n        \"two\": (70, 0, 568, 622),\n        \"three\": (82, -15, 537, 622),\n        \"four\": (108, 0, 541, 622),\n        \"five\": (99, -15, 589, 607),\n        \"six\": (155, -15, 629, 622),\n        \"seven\": (182, 0, 612, 607),\n        \"eight\": (133, -15, 588, 622),\n        \"nine\": (93, -15, 574, 622),\n        \"colon\": (238, -15, 441, 385),\n        \"semicolon\": (157, -112, 441, 385),\n        \"less\": (96, 42, 610, 472),\n        \"equal\": (109, 138, 600, 376),\n        \"greater\": (85, 42, 599, 472),\n        \"question\": (222, -15, 583, 572),\n        \"at\": (127, -15, 582, 622),\n        \"A\": (3, 0, 607, 562),\n        \"B\": (43, 0, 615, 562),\n        \"C\": (94, -18, 655, 580),\n        \"D\": (43, 0, 645, 562),\n        \"E\": (53, 0, 660, 562),\n        \"F\": (53, 0, 660, 562),\n        \"G\": (84, -18, 645, 580),\n        \"H\": (32, 0, 687, 562),\n        \"I\": (96, 0, 623, 562),\n        \"J\": (52, -18, 685, 562),\n        \"K\": (38, 0, 671, 562),\n        \"L\": (47, 0, 607, 562),\n        \"M\": (4, 0, 715, 562),\n        \"N\": (7, -13, 712, 562),\n        \"O\": (95, -18, 625, 580),\n        \"P\": (79, 0, 643, 562),\n        \"Q\": (95, -138, 625, 580),\n        \"R\": (38, 0, 598, 562),\n        \"S\": (76, -20, 650, 580),\n        \"T\": (108, 0, 665, 562),\n        \"U\": (125, -18, 702, 562),\n        \"V\": (105, -13, 723, 562),\n        \"W\": (106, -13, 722, 562),\n        \"X\": (23, 0, 675, 562),\n        \"Y\": (133, 0, 695, 562),\n        \"Z\": (86, 0, 610, 562),\n        \"bracketleft\": (246, -108, 574, 622),\n        \"backslash\": (249, -80, 468, 629),\n        \"bracketright\": (135, -108, 463, 622),\n        \"asciicircum\": (175, 354, 587, 622),\n        \"underscore\": (-27, -125, 584, -75),\n        \"quoteleft\": (343, 328, 457, 562),\n        \"a\": (77, -15, 569, 441),\n        \"b\": (29, -15, 625, 629),\n        \"c\": (106, -15, 608, 441),\n        \"d\": (86, -15, 640, 629),\n        \"e\": (107, -15, 597, 441),\n        \"f\": (114, 0, 662, 629),\n        \"g\": (61, -157, 657, 441),\n        \"h\": (33, 0, 592, 629),\n        \"i\": (95, 0, 515, 657),\n        \"j\": (52, -157, 550, 657),\n        \"k\": (58, 0, 633, 629),\n        \"l\": (95, 0, 515, 629),\n        \"m\": (-5, 0, 615, 441),\n        \"n\": (26, 0, 585, 441),\n        \"o\": (102, -15, 588, 441),\n        \"p\": (-24, -157, 605, 441),\n        \"q\": (86, -157, 682, 441),\n        \"r\": (60, 0, 636, 441),\n        \"s\": (78, -15, 584, 441),\n        \"t\": (167, -15, 561, 561),\n        \"u\": (101, -15, 572, 426),\n        \"v\": (90, -10, 681, 426),\n        \"w\": (76, -10, 695, 426),\n        \"x\": (20, 0, 655, 426),\n        \"y\": (-4, -157, 683, 426),\n        \"z\": (99, 0, 593, 426),\n        \"braceleft\": (233, -108, 569, 622),\n        \"bar\": (222, -250, 485, 750),\n        \"braceright\": (140, -108, 477, 622),\n        \"asciitilde\": (116, 197, 600, 320),\n        \"exclamdown\": (225, -157, 445, 430),\n        \"cent\": (152, -49, 588, 614),\n        \"sterling\": (124, -21, 621, 611),\n        \"fraction\": (84, -57, 646, 665),\n        \"yen\": (120, 0, 693, 562),\n        \"florin\": (-26, -143, 671, 622),\n        \"section\": (104, -78, 590, 580),\n        \"currency\": (94, 58, 628, 506),\n        \"quotesingle\": (345, 328, 460, 562),\n        \"quotedblleft\": (262, 328, 541, 562),\n        \"guillemotleft\": (92, 70, 652, 446),\n        \"guilsinglleft\": (204, 70, 540, 446),\n        \"guilsinglright\": (170, 70, 506, 446),\n        \"fi\": (3, 0, 619, 629),\n        \"fl\": (3, 0, 619, 629),\n        \"endash\": (124, 231, 586, 285),\n        \"dagger\": (217, -78, 546, 580),\n        \"daggerdbl\": (163, -78, 546, 580),\n        \"periodcentered\": (276, 189, 434, 327),\n        \"paragraph\": (100, -78, 630, 562),\n        \"bullet\": (225, 130, 485, 383),\n        \"quotesinglbase\": (185, -134, 397, 100),\n        \"quotedblbase\": (115, -134, 478, 100),\n        \"quotedblright\": (213, 328, 576, 562),\n        \"guillemotright\": (58, 70, 618, 446),\n        \"ellipsis\": (46, -15, 574, 111),\n        \"perthousand\": (59, -15, 626, 622),\n        \"questiondown\": (106, -157, 466, 430),\n        \"grave\": (294, 497, 484, 672),\n        \"acute\": (348, 497, 612, 672),\n        \"circumflex\": (229, 477, 581, 654),\n        \"tilde\": (212, 489, 629, 606),\n        \"macron\": (232, 525, 600, 565),\n        \"breve\": (279, 501, 576, 609),\n        \"dotaccent\": (360, 477, 465, 580),\n        \"dieresis\": (263, 492, 570, 595),\n        \"ring\": (333, 463, 499, 627),\n        \"cedilla\": (197, -151, 344, 10),\n        \"hungarumlaut\": (239, 497, 683, 672),\n        \"ogonek\": (207, -151, 348, 0),\n        \"caron\": (262, 492, 614, 669),\n        \"emdash\": (49, 231, 661, 285),\n        \"AE\": (3, 0, 655, 562),\n        \"ordfeminine\": (209, 249, 512, 580),\n        \"Lslash\": (47, 0, 607, 562),\n        \"Oslash\": (95, -80, 625, 629),\n        \"OE\": (60, 0, 672, 562),\n        \"ordmasculine\": (210, 249, 534, 580),\n        \"ae\": (42, -15, 626, 441),\n        \"dotlessi\": (95, 0, 515, 426),\n        \"lslash\": (95, 0, 583, 629),\n        \"oslash\": (102, -80, 588, 506),\n        \"oe\": (55, -15, 615, 441),\n        \"germandbls\": (48, -15, 617, 629),\n        \"Scedilla\": (76, -151, 650, 580),\n        \"multiply\": (103, 43, 607, 470),\n        \"logicalnot\": (155, 108, 591, 369),\n        \"format\": (-28, -157, 185, 607),\n        \"tab\": (19, 0, 641, 562),\n        \"overscore\": (123, 579, 734, 629),\n        \"IJ\": (32, -18, 702, 562),\n        \"trademark\": (75, 263, 742, 562),\n        \"onequarter\": (65, -57, 674, 665),\n        \"mu\": (72, -157, 572, 426),\n        \"minus\": (129, 232, 580, 283),\n        \"brokenbar\": (238, -175, 469, 675),\n        \"arrowleft\": (40, 115, 693, 483),\n        \"LL\": (8, 0, 647, 562),\n        \"arrowright\": (34, 115, 688, 483),\n        \"thorn\": (-24, -157, 605, 629),\n        \"lira\": (118, -21, 621, 611),\n        \"arrowboth\": (36, 115, 692, 483),\n        \"indent\": (108, 68, 574, 348),\n        \"threesuperior\": (213, 240, 500, 622),\n        \"onehalf\": (65, -57, 669, 665),\n        \"graybox\": (76, 0, 652, 599),\n        \"Idot\": (96, 0, 623, 716),\n        \"ll\": (33, 0, 616, 629),\n        \"Thorn\": (79, 0, 605, 562),\n        \"Ccedilla\": (94, -151, 658, 580),\n        \"notegraphic\": (144, -15, 564, 572),\n        \"arrowup\": (209, 0, 577, 623),\n        \"down\": (187, -15, 467, 426),\n        \"plusminus\": (96, 44, 594, 558),\n        \"threequarters\": (73, -56, 659, 666),\n        \"scedilla\": (78, -151, 584, 441),\n        \"ij\": (37, -157, 630, 657),\n        \"eth\": (102, -15, 639, 629),\n        \"merge\": (187, -15, 503, 436),\n        \"twosuperior\": (230, 249, 534, 622),\n        \"arrowdown\": (152, -15, 520, 608),\n        \"left\": (114, 68, 580, 348),\n        \"return\": (79, 0, 700, 562),\n        \"Eth\": (43, 0, 645, 562),\n        \"up\": (223, 0, 503, 437),\n        \"divide\": (136, 48, 573, 467),\n        \"prescription\": (27, -15, 617, 562),\n        \"square\": (19, 0, 700, 562),\n        \"stop\": (19, 0, 700, 562),\n        \"degree\": (214, 269, 575, 622),\n        \"ccedilla\": (106, -151, 614, 441),\n        \"onesuperior\": (231, 249, 491, 622),\n        \"largebullet\": (316, 220, 394, 297),\n        \"center\": (103, 14, 623, 580),\n        \"registered\": (54, -18, 666, 580),\n        \"copyright\": (54, -18, 666, 580),\n        \"dectab\": (18, 0, 593, 227),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (3, 0, 658, 793),\n        \"Acircumflex\": (3, 0, 607, 775),\n        \"Adieresis\": (3, 0, 607, 731),\n        \"Agrave\": (3, 0, 607, 793),\n        \"Aring\": (3, 0, 607, 753),\n        \"Atilde\": (3, 0, 656, 732),\n        \"Eacute\": (53, 0, 668, 793),\n        \"Ecircumflex\": (53, 0, 660, 775),\n        \"Edieresis\": (53, 0, 660, 731),\n        \"Egrave\": (53, 0, 660, 793),\n        \"Gcaron\": (84, -18, 645, 805),\n        \"Iacute\": (96, 0, 638, 793),\n        \"Icircumflex\": (96, 0, 623, 775),\n        \"Idieresis\": (96, 0, 623, 731),\n        \"Igrave\": (96, 0, 623, 793),\n        \"Ntilde\": (7, -13, 712, 732),\n        \"Oacute\": (95, -18, 638, 793),\n        \"Ocircumflex\": (95, -18, 625, 775),\n        \"Odieresis\": (95, -18, 625, 731),\n        \"Ograve\": (95, -18, 625, 793),\n        \"Otilde\": (95, -18, 656, 732),\n        \"Scaron\": (76, -20, 673, 805),\n        \"Uacute\": (125, -18, 702, 793),\n        \"Ucircumflex\": (125, -18, 702, 775),\n        \"Udieresis\": (125, -18, 702, 731),\n        \"Ugrave\": (125, -18, 702, 793),\n        \"Yacute\": (133, 0, 695, 793),\n        \"Ydieresis\": (133, 0, 695, 731),\n        \"Zcaron\": (86, 0, 643, 805),\n        \"aacute\": (77, -15, 612, 672),\n        \"acircumflex\": (77, -15, 581, 654),\n        \"adieresis\": (77, -15, 570, 595),\n        \"agrave\": (77, -15, 569, 672),\n        \"aring\": (77, -15, 569, 627),\n        \"atilde\": (77, -15, 629, 606),\n        \"eacute\": (107, -15, 612, 672),\n        \"ecircumflex\": (107, -15, 597, 654),\n        \"edieresis\": (107, -15, 597, 595),\n        \"egrave\": (107, -15, 597, 672),\n        \"gcaron\": (61, -157, 657, 669),\n        \"iacute\": (95, 0, 612, 672),\n        \"icircumflex\": (95, 0, 551, 654),\n        \"idieresis\": (95, 0, 540, 595),\n        \"igrave\": (95, 0, 515, 672),\n        \"ntilde\": (26, 0, 629, 606),\n        \"oacute\": (102, -15, 612, 672),\n        \"ocircumflex\": (102, -15, 588, 654),\n        \"odieresis\": (102, -15, 588, 595),\n        \"ograve\": (102, -15, 588, 672),\n        \"otilde\": (102, -15, 629, 606),\n        \"scaron\": (78, -15, 614, 669),\n        \"uacute\": (101, -15, 602, 672),\n        \"ucircumflex\": (101, -15, 572, 654),\n        \"udieresis\": (101, -15, 572, 595),\n        \"ugrave\": (101, -15, 572, 672),\n        \"yacute\": (-4, -157, 683, 672),\n        \"ydieresis\": (-4, -157, 683, 595),\n        \"zcaron\": (99, 0, 624, 669),\n    },\n    \"Helvetica-BoldOblique\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (94, 0, 397, 718),\n        \"quotedbl\": (193, 447, 529, 718),\n        \"numbersign\": (60, 0, 644, 698),\n        \"dollar\": (67, -115, 621, 775),\n        \"percent\": (137, -19, 900, 710),\n        \"ampersand\": (89, -19, 732, 718),\n        \"quoteright\": (167, 445, 362, 718),\n        \"parenleft\": (76, -208, 470, 734),\n        \"parenright\": (-25, -208, 368, 734),\n        \"asterisk\": (146, 387, 481, 718),\n        \"plus\": (82, 0, 610, 506),\n        \"comma\": (28, -168, 245, 146),\n        \"hyphen\": (73, 215, 379, 345),\n        \"period\": (64, 0, 245, 146),\n        \"slash\": (-37, -19, 468, 737),\n        \"zero\": (87, -19, 617, 710),\n        \"one\": (173, 0, 529, 710),\n        \"two\": (26, 0, 619, 710),\n        \"three\": (66, -19, 608, 710),\n        \"four\": (60, 0, 598, 710),\n        \"five\": (64, -19, 636, 698),\n        \"six\": (86, -19, 619, 710),\n        \"seven\": (125, 0, 676, 698),\n        \"eight\": (70, -19, 615, 710),\n        \"nine\": (78, -19, 615, 710),\n        \"colon\": (92, 0, 351, 512),\n        \"semicolon\": (56, -168, 351, 512),\n        \"less\": (82, -8, 655, 514),\n        \"equal\": (58, 87, 633, 419),\n        \"greater\": (36, -8, 609, 514),\n        \"question\": (165, 0, 670, 727),\n        \"at\": (186, -19, 953, 737),\n        \"A\": (20, 0, 702, 718),\n        \"B\": (76, 0, 763, 718),\n        \"C\": (107, -19, 788, 737),\n        \"D\": (76, 0, 777, 718),\n        \"E\": (76, 0, 757, 718),\n        \"F\": (76, 0, 740, 718),\n        \"G\": (108, -19, 816, 737),\n        \"H\": (71, 0, 804, 718),\n        \"I\": (64, 0, 367, 718),\n        \"J\": (60, -18, 637, 718),\n        \"K\": (87, 0, 858, 718),\n        \"L\": (76, 0, 611, 718),\n        \"M\": (69, 0, 918, 718),\n        \"N\": (69, 0, 807, 718),\n        \"O\": (108, -19, 823, 737),\n        \"P\": (76, 0, 737, 718),\n        \"Q\": (108, -52, 823, 737),\n        \"R\": (76, 0, 778, 718),\n        \"S\": (81, -19, 717, 737),\n        \"T\": (140, 0, 751, 718),\n        \"U\": (116, -19, 804, 718),\n        \"V\": (172, 0, 801, 718),\n        \"W\": (169, 0, 1082, 718),\n        \"X\": (14, 0, 791, 718),\n        \"Y\": (168, 0, 806, 718),\n        \"Z\": (25, 0, 737, 718),\n        \"bracketleft\": (21, -196, 462, 722),\n        \"backslash\": (124, -19, 307, 737),\n        \"bracketright\": (-18, -196, 423, 722),\n        \"asciicircum\": (131, 323, 591, 698),\n        \"underscore\": (-27, -125, 540, -75),\n        \"quoteleft\": (165, 454, 361, 727),\n        \"a\": (55, -14, 582, 546),\n        \"b\": (61, -14, 645, 718),\n        \"c\": (79, -14, 599, 546),\n        \"d\": (83, -14, 704, 718),\n        \"e\": (71, -14, 592, 546),\n        \"f\": (87, 0, 469, 727),\n        \"g\": (39, -217, 666, 546),\n        \"h\": (65, 0, 629, 718),\n        \"i\": (69, 0, 363, 725),\n        \"j\": (-42, -214, 363, 725),\n        \"k\": (69, 0, 670, 718),\n        \"l\": (69, 0, 362, 718),\n        \"m\": (64, 0, 909, 546),\n        \"n\": (65, 0, 629, 546),\n        \"o\": (83, -14, 643, 546),\n        \"p\": (18, -207, 645, 546),\n        \"q\": (81, -207, 665, 546),\n        \"r\": (64, 0, 489, 546),\n        \"s\": (63, -14, 584, 546),\n        \"t\": (101, -6, 422, 676),\n        \"u\": (99, -14, 658, 532),\n        \"v\": (126, 0, 656, 532),\n        \"w\": (123, 0, 882, 532),\n        \"x\": (15, 0, 648, 532),\n        \"y\": (42, -214, 652, 532),\n        \"z\": (20, 0, 583, 532),\n        \"braceleft\": (94, -196, 518, 722),\n        \"bar\": (80, -19, 353, 737),\n        \"braceright\": (-18, -196, 407, 722),\n        \"asciitilde\": (115, 163, 577, 343),\n        \"exclamdown\": (50, -186, 353, 532),\n        \"cent\": (79, -118, 599, 628),\n        \"sterling\": (50, -16, 635, 718),\n        \"fraction\": (-174, -19, 487, 710),\n        \"yen\": (60, 0, 713, 698),\n        \"florin\": (-50, -210, 669, 737),\n        \"section\": (61, -184, 598, 727),\n        \"currency\": (27, 76, 680, 636),\n        \"quotesingle\": (165, 447, 321, 718),\n        \"quotedblleft\": (160, 454, 588, 727),\n        \"guillemotleft\": (135, 76, 571, 484),\n        \"guilsinglleft\": (130, 76, 353, 484),\n        \"guilsinglright\": (99, 76, 322, 484),\n        \"fi\": (87, 0, 696, 727),\n        \"fl\": (87, 0, 695, 727),\n        \"endash\": (48, 227, 627, 333),\n        \"dagger\": (118, -171, 626, 718),\n        \"daggerdbl\": (46, -171, 628, 718),\n        \"periodcentered\": (111, 172, 275, 334),\n        \"paragraph\": (99, -191, 688, 700),\n        \"bullet\": (84, 194, 420, 524),\n        \"quotesinglbase\": (41, -146, 236, 127),\n        \"quotedblbase\": (36, -146, 463, 127),\n        \"quotedblright\": (162, 445, 589, 718),\n        \"guillemotright\": (104, 76, 540, 484),\n        \"ellipsis\": (92, 0, 939, 146),\n        \"perthousand\": (76, -19, 1038, 710),\n        \"questiondown\": (54, -195, 559, 532),\n        \"grave\": (136, 604, 353, 750),\n        \"acute\": (236, 604, 515, 750),\n        \"circumflex\": (118, 604, 471, 750),\n        \"tilde\": (113, 610, 507, 737),\n        \"macron\": (122, 604, 483, 678),\n        \"breve\": (156, 604, 494, 750),\n        \"dotaccent\": (235, 614, 385, 729),\n        \"dieresis\": (137, 614, 482, 729),\n        \"ring\": (200, 568, 420, 776),\n        \"cedilla\": (-37, -228, 219, 0),\n        \"hungarumlaut\": (137, 604, 645, 750),\n        \"ogonek\": (41, -228, 264, 0),\n        \"caron\": (149, 604, 502, 750),\n        \"emdash\": (48, 227, 1071, 333),\n        \"AE\": (5, 0, 1100, 718),\n        \"ordfeminine\": (92, 276, 464, 737),\n        \"Lslash\": (34, 0, 611, 718),\n        \"Oslash\": (35, -27, 894, 745),\n        \"OE\": (99, -19, 1114, 737),\n        \"ordmasculine\": (92, 276, 484, 737),\n        \"ae\": (56, -14, 922, 546),\n        \"dotlessi\": (69, 0, 322, 532),\n        \"lslash\": (40, 0, 407, 718),\n        \"oslash\": (22, -29, 701, 560),\n        \"oe\": (83, -14, 976, 546),\n        \"germandbls\": (69, -14, 657, 731),\n        \"onesuperior\": (148, 283, 388, 710),\n        \"logicalnot\": (105, 108, 633, 419),\n        \"mu\": (22, -207, 658, 532),\n        \"trademark\": (179, 306, 1109, 718),\n        \"Eth\": (62, 0, 777, 718),\n        \"onehalf\": (132, -19, 858, 710),\n        \"plusminus\": (40, 0, 625, 506),\n        \"Thorn\": (76, 0, 715, 718),\n        \"onequarter\": (132, -19, 806, 710),\n        \"divide\": (82, -42, 610, 548),\n        \"brokenbar\": (80, -19, 353, 737),\n        \"degree\": (175, 426, 467, 712),\n        \"thorn\": (18, -208, 645, 718),\n        \"threequarters\": (100, -19, 839, 710),\n        \"twosuperior\": (69, 283, 448, 710),\n        \"registered\": (56, -19, 834, 737),\n        \"minus\": (82, 197, 610, 309),\n        \"eth\": (82, -14, 670, 737),\n        \"multiply\": (57, 1, 635, 505),\n        \"threesuperior\": (92, 271, 440, 710),\n        \"copyright\": (57, -19, 835, 737),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (20, 0, 750, 936),\n        \"Acircumflex\": (20, 0, 706, 936),\n        \"Adieresis\": (20, 0, 716, 915),\n        \"Agrave\": (20, 0, 702, 936),\n        \"Aring\": (20, 0, 702, 962),\n        \"Atilde\": (20, 0, 741, 923),\n        \"Ccedilla\": (107, -228, 788, 737),\n        \"Eacute\": (76, 0, 757, 936),\n        \"Ecircumflex\": (76, 0, 757, 936),\n        \"Edieresis\": (76, 0, 757, 915),\n        \"Egrave\": (76, 0, 757, 936),\n        \"Iacute\": (64, 0, 528, 936),\n        \"Icircumflex\": (64, 0, 484, 936),\n        \"Idieresis\": (64, 0, 494, 915),\n        \"Igrave\": (64, 0, 367, 936),\n        \"Ntilde\": (69, 0, 807, 923),\n        \"Oacute\": (108, -19, 823, 936),\n        \"Ocircumflex\": (108, -19, 823, 936),\n        \"Odieresis\": (108, -19, 823, 915),\n        \"Ograve\": (108, -19, 823, 936),\n        \"Otilde\": (108, -19, 823, 923),\n        \"Scaron\": (81, -19, 717, 936),\n        \"Uacute\": (116, -19, 804, 936),\n        \"Ucircumflex\": (116, -19, 804, 936),\n        \"Udieresis\": (116, -19, 804, 915),\n        \"Ugrave\": (116, -19, 804, 936),\n        \"Yacute\": (168, 0, 806, 936),\n        \"Ydieresis\": (168, 0, 806, 915),\n        \"Zcaron\": (25, 0, 737, 936),\n        \"aacute\": (55, -14, 627, 750),\n        \"acircumflex\": (55, -14, 583, 750),\n        \"adieresis\": (55, -14, 594, 729),\n        \"agrave\": (55, -14, 582, 750),\n        \"aring\": (55, -14, 582, 776),\n        \"atilde\": (55, -14, 619, 737),\n        \"ccedilla\": (79, -228, 599, 546),\n        \"eacute\": (71, -14, 627, 750),\n        \"ecircumflex\": (71, -14, 592, 750),\n        \"edieresis\": (71, -14, 594, 729),\n        \"egrave\": (71, -14, 592, 750),\n        \"iacute\": (69, 0, 488, 750),\n        \"icircumflex\": (69, 0, 444, 750),\n        \"idieresis\": (69, 0, 455, 729),\n        \"igrave\": (69, 0, 326, 750),\n        \"ntilde\": (65, 0, 646, 737),\n        \"oacute\": (83, -14, 654, 750),\n        \"ocircumflex\": (83, -14, 643, 750),\n        \"odieresis\": (83, -14, 643, 729),\n        \"ograve\": (83, -14, 643, 750),\n        \"otilde\": (83, -14, 646, 737),\n        \"scaron\": (63, -14, 614, 750),\n        \"uacute\": (99, -14, 658, 750),\n        \"ucircumflex\": (99, -14, 658, 750),\n        \"udieresis\": (99, -14, 658, 729),\n        \"ugrave\": (99, -14, 658, 750),\n        \"yacute\": (42, -214, 652, 750),\n        \"ydieresis\": (42, -214, 652, 729),\n        \"zcaron\": (20, 0, 586, 750),\n    },\n    \"Helvetica-Bold\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (90, 0, 244, 718),\n        \"quotedbl\": (98, 447, 376, 718),\n        \"numbersign\": (18, 0, 538, 698),\n        \"dollar\": (30, -115, 523, 775),\n        \"percent\": (28, -19, 861, 710),\n        \"ampersand\": (54, -19, 701, 718),\n        \"quoteright\": (69, 445, 209, 718),\n        \"parenleft\": (35, -208, 314, 734),\n        \"parenright\": (19, -208, 298, 734),\n        \"asterisk\": (27, 387, 362, 718),\n        \"plus\": (40, 0, 544, 506),\n        \"comma\": (64, -168, 214, 146),\n        \"hyphen\": (27, 215, 306, 345),\n        \"period\": (64, 0, 214, 146),\n        \"slash\": (-33, -19, 311, 737),\n        \"zero\": (32, -19, 524, 710),\n        \"one\": (69, 0, 378, 710),\n        \"two\": (26, 0, 511, 710),\n        \"three\": (27, -19, 516, 710),\n        \"four\": (27, 0, 526, 710),\n        \"five\": (27, -19, 516, 698),\n        \"six\": (31, -19, 520, 710),\n        \"seven\": (25, 0, 528, 698),\n        \"eight\": (32, -19, 524, 710),\n        \"nine\": (30, -19, 522, 710),\n        \"colon\": (92, 0, 242, 512),\n        \"semicolon\": (92, -168, 242, 512),\n        \"less\": (38, -8, 546, 514),\n        \"equal\": (40, 87, 544, 419),\n        \"greater\": (38, -8, 546, 514),\n        \"question\": (60, 0, 556, 727),\n        \"at\": (118, -19, 856, 737),\n        \"A\": (20, 0, 702, 718),\n        \"B\": (76, 0, 669, 718),\n        \"C\": (44, -19, 684, 737),\n        \"D\": (76, 0, 685, 718),\n        \"E\": (76, 0, 621, 718),\n        \"F\": (76, 0, 587, 718),\n        \"G\": (44, -19, 713, 737),\n        \"H\": (71, 0, 651, 718),\n        \"I\": (64, 0, 214, 718),\n        \"J\": (22, -18, 484, 718),\n        \"K\": (87, 0, 722, 718),\n        \"L\": (76, 0, 583, 718),\n        \"M\": (69, 0, 765, 718),\n        \"N\": (69, 0, 654, 718),\n        \"O\": (44, -19, 734, 737),\n        \"P\": (76, 0, 627, 718),\n        \"Q\": (44, -52, 737, 737),\n        \"R\": (76, 0, 677, 718),\n        \"S\": (39, -19, 629, 737),\n        \"T\": (14, 0, 598, 718),\n        \"U\": (72, -19, 651, 718),\n        \"V\": (19, 0, 648, 718),\n        \"W\": (16, 0, 929, 718),\n        \"X\": (14, 0, 653, 718),\n        \"Y\": (15, 0, 653, 718),\n        \"Z\": (25, 0, 586, 718),\n        \"bracketleft\": (63, -196, 309, 722),\n        \"backslash\": (-33, -19, 311, 737),\n        \"bracketright\": (24, -196, 270, 722),\n        \"asciicircum\": (62, 323, 522, 698),\n        \"underscore\": (0, -125, 556, -75),\n        \"quoteleft\": (69, 454, 209, 727),\n        \"a\": (29, -14, 527, 546),\n        \"b\": (61, -14, 578, 718),\n        \"c\": (34, -14, 524, 546),\n        \"d\": (34, -14, 551, 718),\n        \"e\": (23, -14, 528, 546),\n        \"f\": (10, 0, 318, 727),\n        \"g\": (40, -217, 553, 546),\n        \"h\": (65, 0, 546, 718),\n        \"i\": (69, 0, 209, 725),\n        \"j\": (3, -214, 209, 725),\n        \"k\": (69, 0, 562, 718),\n        \"l\": (69, 0, 209, 718),\n        \"m\": (64, 0, 826, 546),\n        \"n\": (65, 0, 546, 546),\n        \"o\": (34, -14, 578, 546),\n        \"p\": (62, -207, 578, 546),\n        \"q\": (34, -207, 552, 546),\n        \"r\": (64, 0, 373, 546),\n        \"s\": (30, -14, 519, 546),\n        \"t\": (10, -6, 309, 676),\n        \"u\": (66, -14, 545, 532),\n        \"v\": (13, 0, 543, 532),\n        \"w\": (10, 0, 769, 532),\n        \"x\": (15, 0, 541, 532),\n        \"y\": (10, -214, 539, 532),\n        \"z\": (20, 0, 480, 532),\n        \"braceleft\": (48, -196, 365, 722),\n        \"bar\": (84, -19, 196, 737),\n        \"braceright\": (24, -196, 341, 722),\n        \"asciitilde\": (61, 163, 523, 343),\n        \"exclamdown\": (90, -186, 244, 532),\n        \"cent\": (34, -118, 524, 628),\n        \"sterling\": (28, -16, 541, 718),\n        \"fraction\": (-170, -19, 336, 710),\n        \"yen\": (-9, 0, 565, 698),\n        \"florin\": (-10, -210, 516, 737),\n        \"section\": (34, -184, 522, 727),\n        \"currency\": (-3, 76, 559, 636),\n        \"quotesingle\": (70, 447, 168, 718),\n        \"quotedblleft\": (64, 454, 436, 727),\n        \"guillemotleft\": (88, 76, 468, 484),\n        \"guilsinglleft\": (83, 76, 250, 484),\n        \"guilsinglright\": (83, 76, 250, 484),\n        \"fi\": (10, 0, 542, 727),\n        \"fl\": (10, 0, 542, 727),\n        \"endash\": (0, 227, 556, 333),\n        \"dagger\": (36, -171, 520, 718),\n        \"daggerdbl\": (36, -171, 520, 718),\n        \"periodcentered\": (58, 172, 220, 334),\n        \"paragraph\": (-8, -191, 539, 700),\n        \"bullet\": (10, 194, 340, 524),\n        \"quotesinglbase\": (69, -146, 209, 127),\n        \"quotedblbase\": (64, -146, 436, 127),\n        \"quotedblright\": (64, 445, 436, 718),\n        \"guillemotright\": (88, 76, 468, 484),\n        \"ellipsis\": (92, 0, 908, 146),\n        \"perthousand\": (-3, -19, 1003, 710),\n        \"questiondown\": (55, -195, 551, 532),\n        \"grave\": (-23, 604, 225, 750),\n        \"acute\": (108, 604, 356, 750),\n        \"circumflex\": (-10, 604, 343, 750),\n        \"tilde\": (-17, 610, 350, 737),\n        \"macron\": (-6, 604, 339, 678),\n        \"breve\": (-2, 604, 335, 750),\n        \"dotaccent\": (104, 614, 230, 729),\n        \"dieresis\": (6, 614, 327, 729),\n        \"ring\": (59, 568, 275, 776),\n        \"cedilla\": (6, -228, 245, 0),\n        \"hungarumlaut\": (9, 604, 486, 750),\n        \"ogonek\": (71, -228, 304, 0),\n        \"caron\": (-10, 604, 343, 750),\n        \"emdash\": (0, 227, 1000, 333),\n        \"AE\": (5, 0, 954, 718),\n        \"ordfeminine\": (22, 276, 347, 737),\n        \"Lslash\": (-20, 0, 583, 718),\n        \"Oslash\": (33, -27, 744, 745),\n        \"OE\": (37, -19, 961, 737),\n        \"ordmasculine\": (6, 276, 360, 737),\n        \"ae\": (29, -14, 858, 546),\n        \"dotlessi\": (69, 0, 209, 532),\n        \"lslash\": (-18, 0, 296, 718),\n        \"oslash\": (22, -29, 589, 560),\n        \"oe\": (34, -14, 912, 546),\n        \"germandbls\": (69, -14, 579, 731),\n        \"onesuperior\": (26, 283, 237, 710),\n        \"logicalnot\": (40, 108, 544, 419),\n        \"mu\": (66, -207, 545, 532),\n        \"trademark\": (44, 306, 956, 718),\n        \"Eth\": (-5, 0, 685, 718),\n        \"onehalf\": (26, -19, 794, 710),\n        \"plusminus\": (40, 0, 544, 506),\n        \"Thorn\": (76, 0, 627, 718),\n        \"onequarter\": (26, -19, 766, 710),\n        \"divide\": (40, -42, 544, 548),\n        \"brokenbar\": (84, -19, 196, 737),\n        \"degree\": (57, 426, 343, 712),\n        \"thorn\": (62, -208, 578, 718),\n        \"threequarters\": (16, -19, 799, 710),\n        \"twosuperior\": (9, 283, 324, 710),\n        \"registered\": (-11, -19, 748, 737),\n        \"minus\": (40, 197, 544, 309),\n        \"eth\": (34, -14, 578, 737),\n        \"multiply\": (40, 1, 545, 505),\n        \"threesuperior\": (8, 271, 326, 710),\n        \"copyright\": (-11, -19, 749, 737),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (20, 0, 702, 936),\n        \"Acircumflex\": (20, 0, 702, 936),\n        \"Adieresis\": (20, 0, 702, 915),\n        \"Agrave\": (20, 0, 702, 936),\n        \"Aring\": (20, 0, 702, 962),\n        \"Atilde\": (20, 0, 702, 923),\n        \"Ccedilla\": (44, -228, 684, 737),\n        \"Eacute\": (76, 0, 621, 936),\n        \"Ecircumflex\": (76, 0, 621, 936),\n        \"Edieresis\": (76, 0, 621, 915),\n        \"Egrave\": (76, 0, 621, 936),\n        \"Iacute\": (64, 0, 329, 936),\n        \"Icircumflex\": (-37, 0, 316, 936),\n        \"Idieresis\": (-21, 0, 300, 915),\n        \"Igrave\": (-50, 0, 214, 936),\n        \"Ntilde\": (69, 0, 654, 923),\n        \"Oacute\": (44, -19, 734, 936),\n        \"Ocircumflex\": (44, -19, 734, 936),\n        \"Odieresis\": (44, -19, 734, 915),\n        \"Ograve\": (44, -19, 734, 936),\n        \"Otilde\": (44, -19, 734, 923),\n        \"Scaron\": (39, -19, 629, 936),\n        \"Uacute\": (72, -19, 651, 936),\n        \"Ucircumflex\": (72, -19, 651, 936),\n        \"Udieresis\": (72, -19, 651, 915),\n        \"Ugrave\": (72, -19, 651, 936),\n        \"Yacute\": (15, 0, 653, 936),\n        \"Ydieresis\": (15, 0, 653, 915),\n        \"Zcaron\": (25, 0, 586, 936),\n        \"aacute\": (29, -14, 527, 750),\n        \"acircumflex\": (29, -14, 527, 750),\n        \"adieresis\": (29, -14, 527, 729),\n        \"agrave\": (29, -14, 527, 750),\n        \"aring\": (29, -14, 527, 776),\n        \"atilde\": (29, -14, 527, 737),\n        \"ccedilla\": (34, -228, 524, 546),\n        \"eacute\": (23, -14, 528, 750),\n        \"ecircumflex\": (23, -14, 528, 750),\n        \"edieresis\": (23, -14, 528, 729),\n        \"egrave\": (23, -14, 528, 750),\n        \"iacute\": (69, 0, 329, 750),\n        \"icircumflex\": (-37, 0, 316, 750),\n        \"idieresis\": (-21, 0, 300, 729),\n        \"igrave\": (-50, 0, 209, 750),\n        \"ntilde\": (65, 0, 546, 737),\n        \"oacute\": (34, -14, 578, 750),\n        \"ocircumflex\": (34, -14, 578, 750),\n        \"odieresis\": (34, -14, 578, 729),\n        \"ograve\": (34, -14, 578, 750),\n        \"otilde\": (34, -14, 578, 737),\n        \"scaron\": (30, -14, 519, 750),\n        \"uacute\": (66, -14, 545, 750),\n        \"ucircumflex\": (66, -14, 545, 750),\n        \"udieresis\": (66, -14, 545, 729),\n        \"ugrave\": (66, -14, 545, 750),\n        \"yacute\": (10, -214, 539, 750),\n        \"ydieresis\": (10, -214, 539, 729),\n        \"zcaron\": (20, 0, 480, 750),\n    },\n    \"Helvetica-Oblique\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (90, 0, 340, 718),\n        \"quotedbl\": (168, 463, 438, 718),\n        \"numbersign\": (73, 0, 631, 688),\n        \"dollar\": (69, -115, 617, 775),\n        \"percent\": (147, -19, 888, 703),\n        \"ampersand\": (78, -15, 647, 718),\n        \"quoteright\": (151, 463, 310, 718),\n        \"parenleft\": (108, -207, 454, 733),\n        \"parenright\": (-9, -207, 336, 733),\n        \"asterisk\": (165, 431, 475, 718),\n        \"plus\": (85, 0, 606, 505),\n        \"comma\": (56, -147, 214, 106),\n        \"hyphen\": (93, 232, 357, 322),\n        \"period\": (87, 0, 214, 106),\n        \"slash\": (-21, -19, 452, 737),\n        \"zero\": (94, -19, 607, 703),\n        \"one\": (207, 0, 508, 703),\n        \"two\": (26, 0, 617, 703),\n        \"three\": (75, -19, 609, 703),\n        \"four\": (61, 0, 576, 703),\n        \"five\": (68, -19, 621, 688),\n        \"six\": (91, -19, 615, 703),\n        \"seven\": (137, 0, 669, 688),\n        \"eight\": (74, -19, 606, 703),\n        \"nine\": (83, -19, 608, 703),\n        \"colon\": (87, 0, 301, 516),\n        \"semicolon\": (56, -147, 301, 516),\n        \"less\": (94, 11, 641, 495),\n        \"equal\": (63, 115, 628, 390),\n        \"greater\": (50, 11, 597, 495),\n        \"question\": (161, 0, 610, 727),\n        \"at\": (215, -19, 964, 737),\n        \"A\": (14, 0, 654, 718),\n        \"B\": (74, 0, 711, 718),\n        \"C\": (108, -19, 781, 737),\n        \"D\": (81, 0, 763, 718),\n        \"E\": (86, 0, 762, 718),\n        \"F\": (86, 0, 736, 718),\n        \"G\": (111, -19, 798, 737),\n        \"H\": (77, 0, 799, 718),\n        \"I\": (91, 0, 341, 718),\n        \"J\": (47, -19, 581, 718),\n        \"K\": (76, 0, 808, 718),\n        \"L\": (76, 0, 555, 718),\n        \"M\": (73, 0, 914, 718),\n        \"N\": (76, 0, 799, 718),\n        \"O\": (105, -19, 825, 737),\n        \"P\": (86, 0, 736, 718),\n        \"Q\": (105, -56, 825, 737),\n        \"R\": (88, 0, 773, 718),\n        \"S\": (90, -19, 712, 737),\n        \"T\": (148, 0, 750, 718),\n        \"U\": (124, -19, 797, 718),\n        \"V\": (173, 0, 800, 718),\n        \"W\": (169, 0, 1081, 718),\n        \"X\": (19, 0, 790, 718),\n        \"Y\": (167, 0, 806, 718),\n        \"Z\": (23, 0, 741, 718),\n        \"bracketleft\": (21, -196, 403, 722),\n        \"backslash\": (140, -19, 291, 737),\n        \"bracketright\": (-14, -196, 368, 722),\n        \"asciicircum\": (42, 264, 539, 688),\n        \"underscore\": (-27, -125, 540, -75),\n        \"quoteleft\": (165, 470, 323, 725),\n        \"a\": (62, -15, 558, 538),\n        \"b\": (58, -15, 584, 718),\n        \"c\": (75, -15, 553, 538),\n        \"d\": (84, -15, 652, 718),\n        \"e\": (85, -15, 578, 538),\n        \"f\": (86, 0, 416, 728),\n        \"g\": (42, -220, 610, 538),\n        \"h\": (65, 0, 572, 718),\n        \"i\": (67, 0, 308, 718),\n        \"j\": (-60, -210, 308, 718),\n        \"k\": (67, 0, 600, 718),\n        \"l\": (67, 0, 308, 718),\n        \"m\": (65, 0, 851, 538),\n        \"n\": (65, 0, 572, 538),\n        \"o\": (84, -14, 584, 538),\n        \"p\": (14, -207, 584, 538),\n        \"q\": (84, -207, 605, 538),\n        \"r\": (77, 0, 446, 538),\n        \"s\": (64, -15, 529, 538),\n        \"t\": (103, -7, 368, 669),\n        \"u\": (95, -15, 600, 523),\n        \"v\": (119, 0, 603, 523),\n        \"w\": (125, 0, 820, 523),\n        \"x\": (11, 0, 594, 523),\n        \"y\": (15, -214, 600, 523),\n        \"z\": (31, 0, 571, 523),\n        \"braceleft\": (92, -196, 445, 722),\n        \"bar\": (90, -19, 324, 737),\n        \"braceright\": (0, -196, 354, 722),\n        \"asciitilde\": (111, 180, 580, 326),\n        \"exclamdown\": (77, -195, 326, 523),\n        \"cent\": (96, -115, 583, 623),\n        \"sterling\": (49, -16, 633, 718),\n        \"fraction\": (-170, -19, 482, 703),\n        \"yen\": (81, 0, 699, 688),\n        \"florin\": (-52, -207, 654, 737),\n        \"section\": (77, -191, 583, 737),\n        \"currency\": (60, 99, 646, 603),\n        \"quotesingle\": (157, 463, 285, 718),\n        \"quotedblleft\": (138, 470, 461, 725),\n        \"guillemotleft\": (146, 108, 554, 446),\n        \"guilsinglleft\": (137, 108, 340, 446),\n        \"guilsinglright\": (111, 108, 314, 446),\n        \"fi\": (86, 0, 587, 728),\n        \"fl\": (86, 0, 585, 728),\n        \"endash\": (51, 240, 623, 313),\n        \"dagger\": (135, -159, 622, 718),\n        \"daggerdbl\": (52, -159, 623, 718),\n        \"periodcentered\": (130, 190, 257, 315),\n        \"paragraph\": (126, -173, 650, 718),\n        \"bullet\": (91, 202, 412, 517),\n        \"quotesinglbase\": (21, -149, 180, 106),\n        \"quotedblbase\": (-6, -149, 318, 106),\n        \"quotedblright\": (124, 463, 448, 718),\n        \"guillemotright\": (120, 108, 528, 446),\n        \"ellipsis\": (115, 0, 908, 106),\n        \"perthousand\": (88, -19, 1029, 703),\n        \"questiondown\": (85, -201, 534, 525),\n        \"grave\": (170, 593, 337, 734),\n        \"acute\": (248, 593, 475, 734),\n        \"circumflex\": (147, 593, 438, 734),\n        \"tilde\": (125, 606, 490, 722),\n        \"macron\": (143, 627, 468, 684),\n        \"breve\": (167, 595, 476, 731),\n        \"dotaccent\": (249, 604, 362, 706),\n        \"dieresis\": (168, 604, 443, 706),\n        \"ring\": (214, 572, 402, 756),\n        \"cedilla\": (2, -225, 232, 0),\n        \"hungarumlaut\": (157, 593, 565, 734),\n        \"ogonek\": (44, -225, 249, 0),\n        \"caron\": (177, 593, 468, 734),\n        \"emdash\": (51, 240, 1067, 313),\n        \"AE\": (8, 0, 1097, 718),\n        \"ordfeminine\": (100, 304, 448, 737),\n        \"Lslash\": (41, 0, 555, 718),\n        \"Oslash\": (43, -19, 890, 737),\n        \"OE\": (99, -19, 1116, 737),\n        \"ordmasculine\": (100, 304, 467, 737),\n        \"ae\": (62, -15, 909, 538),\n        \"dotlessi\": (95, 0, 294, 523),\n        \"lslash\": (41, 0, 347, 718),\n        \"oslash\": (29, -22, 647, 545),\n        \"oe\": (84, -15, 964, 538),\n        \"germandbls\": (67, -15, 657, 728),\n        \"onesuperior\": (166, 281, 371, 703),\n        \"logicalnot\": (106, 108, 628, 390),\n        \"mu\": (24, -207, 600, 523),\n        \"trademark\": (186, 306, 1056, 718),\n        \"Eth\": (69, 0, 763, 718),\n        \"onehalf\": (114, -19, 838, 703),\n        \"plusminus\": (39, 0, 618, 506),\n        \"Thorn\": (86, 0, 711, 718),\n        \"onequarter\": (150, -19, 802, 703),\n        \"divide\": (85, -19, 606, 524),\n        \"brokenbar\": (90, -19, 324, 737),\n        \"degree\": (169, 411, 467, 703),\n        \"thorn\": (14, -207, 584, 718),\n        \"threequarters\": (130, -19, 861, 703),\n        \"twosuperior\": (64, 281, 448, 703),\n        \"registered\": (55, -19, 837, 737),\n        \"minus\": (85, 216, 606, 289),\n        \"eth\": (82, -15, 617, 737),\n        \"multiply\": (50, 0, 642, 506),\n        \"threesuperior\": (90, 270, 436, 703),\n        \"copyright\": (55, -19, 837, 737),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (14, 0, 683, 929),\n        \"Acircumflex\": (14, 0, 654, 929),\n        \"Adieresis\": (14, 0, 654, 901),\n        \"Agrave\": (14, 0, 654, 929),\n        \"Aring\": (14, 0, 654, 931),\n        \"Atilde\": (14, 0, 699, 917),\n        \"Ccedilla\": (108, -225, 781, 737),\n        \"Eacute\": (86, 0, 762, 929),\n        \"Ecircumflex\": (86, 0, 762, 929),\n        \"Edieresis\": (86, 0, 762, 901),\n        \"Egrave\": (86, 0, 762, 929),\n        \"Iacute\": (91, 0, 489, 929),\n        \"Icircumflex\": (91, 0, 452, 929),\n        \"Idieresis\": (91, 0, 458, 901),\n        \"Igrave\": (91, 0, 351, 929),\n        \"Ntilde\": (76, 0, 799, 917),\n        \"Oacute\": (105, -19, 825, 929),\n        \"Ocircumflex\": (105, -19, 825, 929),\n        \"Odieresis\": (105, -19, 825, 901),\n        \"Ograve\": (105, -19, 825, 929),\n        \"Otilde\": (105, -19, 825, 917),\n        \"Scaron\": (90, -19, 712, 929),\n        \"Uacute\": (124, -19, 797, 929),\n        \"Ucircumflex\": (124, -19, 797, 929),\n        \"Udieresis\": (124, -19, 797, 901),\n        \"Ugrave\": (124, -19, 797, 929),\n        \"Yacute\": (167, 0, 806, 929),\n        \"Ydieresis\": (167, 0, 806, 901),\n        \"Zcaron\": (23, 0, 741, 929),\n        \"aacute\": (62, -15, 587, 734),\n        \"acircumflex\": (62, -15, 558, 734),\n        \"adieresis\": (62, -15, 558, 706),\n        \"agrave\": (62, -15, 558, 734),\n        \"aring\": (62, -15, 558, 756),\n        \"atilde\": (62, -15, 592, 722),\n        \"ccedilla\": (75, -225, 553, 538),\n        \"eacute\": (85, -15, 587, 734),\n        \"ecircumflex\": (85, -15, 578, 734),\n        \"edieresis\": (85, -15, 578, 706),\n        \"egrave\": (85, -15, 578, 734),\n        \"iacute\": (95, 0, 448, 734),\n        \"icircumflex\": (95, 0, 411, 734),\n        \"idieresis\": (95, 0, 416, 706),\n        \"igrave\": (95, 0, 310, 734),\n        \"ntilde\": (65, 0, 592, 722),\n        \"oacute\": (84, -14, 587, 734),\n        \"ocircumflex\": (84, -14, 584, 734),\n        \"odieresis\": (84, -14, 584, 706),\n        \"ograve\": (84, -14, 584, 734),\n        \"otilde\": (84, -14, 602, 722),\n        \"scaron\": (64, -15, 552, 734),\n        \"uacute\": (95, -15, 600, 734),\n        \"ucircumflex\": (95, -15, 600, 734),\n        \"udieresis\": (95, -15, 600, 706),\n        \"ugrave\": (95, -15, 600, 734),\n        \"yacute\": (15, -214, 600, 734),\n        \"ydieresis\": (15, -214, 600, 706),\n        \"zcaron\": (31, 0, 571, 734),\n    },\n    \"Helvetica\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (90, 0, 187, 718),\n        \"quotedbl\": (70, 463, 285, 718),\n        \"numbersign\": (28, 0, 529, 688),\n        \"dollar\": (32, -115, 520, 775),\n        \"percent\": (39, -19, 850, 703),\n        \"ampersand\": (44, -15, 645, 718),\n        \"quoteright\": (53, 463, 157, 718),\n        \"parenleft\": (68, -207, 299, 733),\n        \"parenright\": (34, -207, 265, 733),\n        \"asterisk\": (39, 431, 349, 718),\n        \"plus\": (39, 0, 545, 505),\n        \"comma\": (87, -147, 191, 106),\n        \"hyphen\": (44, 232, 289, 322),\n        \"period\": (87, 0, 191, 106),\n        \"slash\": (-17, -19, 295, 737),\n        \"zero\": (37, -19, 519, 703),\n        \"one\": (101, 0, 359, 703),\n        \"two\": (26, 0, 507, 703),\n        \"three\": (34, -19, 522, 703),\n        \"four\": (25, 0, 523, 703),\n        \"five\": (32, -19, 514, 688),\n        \"six\": (38, -19, 518, 703),\n        \"seven\": (37, 0, 523, 688),\n        \"eight\": (38, -19, 517, 703),\n        \"nine\": (42, -19, 514, 703),\n        \"colon\": (87, 0, 191, 516),\n        \"semicolon\": (87, -147, 191, 516),\n        \"less\": (48, 11, 536, 495),\n        \"equal\": (39, 115, 545, 390),\n        \"greater\": (48, 11, 536, 495),\n        \"question\": (56, 0, 492, 727),\n        \"at\": (147, -19, 868, 737),\n        \"A\": (14, 0, 654, 718),\n        \"B\": (74, 0, 627, 718),\n        \"C\": (44, -19, 681, 737),\n        \"D\": (81, 0, 674, 718),\n        \"E\": (86, 0, 616, 718),\n        \"F\": (86, 0, 583, 718),\n        \"G\": (48, -19, 704, 737),\n        \"H\": (77, 0, 646, 718),\n        \"I\": (91, 0, 188, 718),\n        \"J\": (17, -19, 428, 718),\n        \"K\": (76, 0, 663, 718),\n        \"L\": (76, 0, 537, 718),\n        \"M\": (73, 0, 761, 718),\n        \"N\": (76, 0, 646, 718),\n        \"O\": (39, -19, 739, 737),\n        \"P\": (86, 0, 622, 718),\n        \"Q\": (39, -56, 739, 737),\n        \"R\": (88, 0, 684, 718),\n        \"S\": (49, -19, 620, 737),\n        \"T\": (14, 0, 597, 718),\n        \"U\": (79, -19, 644, 718),\n        \"V\": (20, 0, 647, 718),\n        \"W\": (16, 0, 928, 718),\n        \"X\": (19, 0, 648, 718),\n        \"Y\": (14, 0, 653, 718),\n        \"Z\": (23, 0, 588, 718),\n        \"bracketleft\": (63, -196, 250, 722),\n        \"backslash\": (-17, -19, 295, 737),\n        \"bracketright\": (28, -196, 215, 722),\n        \"asciicircum\": (-14, 264, 483, 688),\n        \"underscore\": (0, -125, 556, -75),\n        \"quoteleft\": (65, 470, 169, 725),\n        \"a\": (36, -15, 530, 538),\n        \"b\": (58, -15, 517, 718),\n        \"c\": (30, -15, 477, 538),\n        \"d\": (35, -15, 499, 718),\n        \"e\": (40, -15, 516, 538),\n        \"f\": (14, 0, 262, 728),\n        \"g\": (40, -220, 499, 538),\n        \"h\": (65, 0, 491, 718),\n        \"i\": (67, 0, 155, 718),\n        \"j\": (-16, -210, 155, 718),\n        \"k\": (67, 0, 501, 718),\n        \"l\": (67, 0, 155, 718),\n        \"m\": (65, 0, 769, 538),\n        \"n\": (65, 0, 491, 538),\n        \"o\": (35, -14, 521, 538),\n        \"p\": (58, -207, 517, 538),\n        \"q\": (35, -207, 494, 538),\n        \"r\": (77, 0, 332, 538),\n        \"s\": (32, -15, 464, 538),\n        \"t\": (14, -7, 257, 669),\n        \"u\": (68, -15, 489, 523),\n        \"v\": (8, 0, 492, 523),\n        \"w\": (14, 0, 709, 523),\n        \"x\": (11, 0, 490, 523),\n        \"y\": (11, -214, 489, 523),\n        \"z\": (31, 0, 469, 523),\n        \"braceleft\": (42, -196, 292, 722),\n        \"bar\": (94, -19, 167, 737),\n        \"braceright\": (42, -196, 292, 722),\n        \"asciitilde\": (61, 180, 523, 326),\n        \"exclamdown\": (118, -195, 215, 523),\n        \"cent\": (51, -115, 513, 623),\n        \"sterling\": (33, -16, 539, 718),\n        \"fraction\": (-166, -19, 333, 703),\n        \"yen\": (3, 0, 553, 688),\n        \"florin\": (-11, -207, 501, 737),\n        \"section\": (43, -191, 512, 737),\n        \"currency\": (28, 99, 528, 603),\n        \"quotesingle\": (59, 463, 132, 718),\n        \"quotedblleft\": (38, 470, 307, 725),\n        \"guillemotleft\": (97, 108, 459, 446),\n        \"guilsinglleft\": (88, 108, 245, 446),\n        \"guilsinglright\": (88, 108, 245, 446),\n        \"fi\": (14, 0, 434, 728),\n        \"fl\": (14, 0, 432, 728),\n        \"endash\": (0, 240, 556, 313),\n        \"dagger\": (43, -159, 514, 718),\n        \"daggerdbl\": (43, -159, 514, 718),\n        \"periodcentered\": (77, 190, 202, 315),\n        \"paragraph\": (18, -173, 497, 718),\n        \"bullet\": (18, 202, 333, 517),\n        \"quotesinglbase\": (53, -149, 157, 106),\n        \"quotedblbase\": (26, -149, 295, 106),\n        \"quotedblright\": (26, 463, 295, 718),\n        \"guillemotright\": (97, 108, 459, 446),\n        \"ellipsis\": (115, 0, 885, 106),\n        \"perthousand\": (7, -19, 994, 703),\n        \"questiondown\": (91, -201, 527, 525),\n        \"grave\": (14, 593, 211, 734),\n        \"acute\": (122, 593, 319, 734),\n        \"circumflex\": (21, 593, 312, 734),\n        \"tilde\": (-4, 606, 337, 722),\n        \"macron\": (10, 627, 323, 684),\n        \"breve\": (13, 595, 321, 731),\n        \"dotaccent\": (121, 604, 212, 706),\n        \"dieresis\": (40, 604, 293, 706),\n        \"ring\": (75, 572, 259, 756),\n        \"cedilla\": (45, -225, 259, 0),\n        \"hungarumlaut\": (31, 593, 409, 734),\n        \"ogonek\": (73, -225, 287, 0),\n        \"caron\": (21, 593, 312, 734),\n        \"emdash\": (0, 240, 1000, 313),\n        \"AE\": (8, 0, 951, 718),\n        \"ordfeminine\": (24, 304, 346, 737),\n        \"Lslash\": (-20, 0, 537, 718),\n        \"Oslash\": (39, -19, 740, 737),\n        \"OE\": (36, -19, 965, 737),\n        \"ordmasculine\": (25, 304, 341, 737),\n        \"ae\": (36, -15, 847, 538),\n        \"dotlessi\": (95, 0, 183, 523),\n        \"lslash\": (-20, 0, 242, 718),\n        \"oslash\": (28, -22, 537, 545),\n        \"oe\": (35, -15, 902, 538),\n        \"germandbls\": (67, -15, 571, 728),\n        \"onesuperior\": (43, 281, 222, 703),\n        \"logicalnot\": (39, 108, 545, 390),\n        \"mu\": (68, -207, 489, 523),\n        \"trademark\": (46, 306, 903, 718),\n        \"Eth\": (0, 0, 674, 718),\n        \"onehalf\": (43, -19, 773, 703),\n        \"plusminus\": (39, 0, 545, 506),\n        \"Thorn\": (86, 0, 622, 718),\n        \"onequarter\": (73, -19, 756, 703),\n        \"divide\": (39, -19, 545, 524),\n        \"brokenbar\": (94, -19, 167, 737),\n        \"degree\": (54, 411, 346, 703),\n        \"thorn\": (58, -207, 517, 718),\n        \"threequarters\": (45, -19, 810, 703),\n        \"twosuperior\": (4, 281, 323, 703),\n        \"registered\": (-14, -19, 752, 737),\n        \"minus\": (39, 216, 545, 289),\n        \"eth\": (35, -15, 522, 737),\n        \"multiply\": (39, 0, 545, 506),\n        \"threesuperior\": (5, 270, 325, 703),\n        \"copyright\": (-14, -19, 752, 737),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (14, 0, 654, 929),\n        \"Acircumflex\": (14, 0, 654, 929),\n        \"Adieresis\": (14, 0, 654, 901),\n        \"Agrave\": (14, 0, 654, 929),\n        \"Aring\": (14, 0, 654, 931),\n        \"Atilde\": (14, 0, 654, 917),\n        \"Ccedilla\": (44, -225, 681, 737),\n        \"Eacute\": (86, 0, 616, 929),\n        \"Ecircumflex\": (86, 0, 616, 929),\n        \"Edieresis\": (86, 0, 616, 901),\n        \"Egrave\": (86, 0, 616, 929),\n        \"Iacute\": (91, 0, 292, 929),\n        \"Icircumflex\": (-6, 0, 285, 929),\n        \"Idieresis\": (13, 0, 266, 901),\n        \"Igrave\": (-13, 0, 188, 929),\n        \"Ntilde\": (76, 0, 646, 917),\n        \"Oacute\": (39, -19, 739, 929),\n        \"Ocircumflex\": (39, -19, 739, 929),\n        \"Odieresis\": (39, -19, 739, 901),\n        \"Ograve\": (39, -19, 739, 929),\n        \"Otilde\": (39, -19, 739, 917),\n        \"Scaron\": (49, -19, 620, 929),\n        \"Uacute\": (79, -19, 644, 929),\n        \"Ucircumflex\": (79, -19, 644, 929),\n        \"Udieresis\": (79, -19, 644, 901),\n        \"Ugrave\": (79, -19, 644, 929),\n        \"Yacute\": (14, 0, 653, 929),\n        \"Ydieresis\": (14, 0, 653, 901),\n        \"Zcaron\": (23, 0, 588, 929),\n        \"aacute\": (36, -15, 530, 734),\n        \"acircumflex\": (36, -15, 530, 734),\n        \"adieresis\": (36, -15, 530, 706),\n        \"agrave\": (36, -15, 530, 734),\n        \"aring\": (36, -15, 530, 756),\n        \"atilde\": (36, -15, 530, 722),\n        \"ccedilla\": (30, -225, 477, 538),\n        \"eacute\": (40, -15, 516, 734),\n        \"ecircumflex\": (40, -15, 516, 734),\n        \"edieresis\": (40, -15, 516, 706),\n        \"egrave\": (40, -15, 516, 734),\n        \"iacute\": (95, 0, 292, 734),\n        \"icircumflex\": (-6, 0, 285, 734),\n        \"idieresis\": (13, 0, 266, 706),\n        \"igrave\": (-13, 0, 184, 734),\n        \"ntilde\": (65, 0, 491, 722),\n        \"oacute\": (35, -14, 521, 734),\n        \"ocircumflex\": (35, -14, 521, 734),\n        \"odieresis\": (35, -14, 521, 706),\n        \"ograve\": (35, -14, 521, 734),\n        \"otilde\": (35, -14, 521, 722),\n        \"scaron\": (32, -15, 464, 734),\n        \"uacute\": (68, -15, 489, 734),\n        \"ucircumflex\": (68, -15, 489, 734),\n        \"udieresis\": (68, -15, 489, 706),\n        \"ugrave\": (68, -15, 489, 734),\n        \"yacute\": (11, -214, 489, 734),\n        \"ydieresis\": (11, -214, 489, 706),\n        \"zcaron\": (31, 0, 469, 734),\n    },\n    \"Symbol\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (128, -17, 240, 672),\n        \"universal\": (31, 0, 681, 705),\n        \"numbersign\": (20, -16, 481, 673),\n        \"existential\": (25, 0, 478, 707),\n        \"percent\": (64, -35, 771, 655),\n        \"ampersand\": (42, -17, 750, 661),\n        \"suchthat\": (48, -17, 414, 499),\n        \"parenleft\": (53, -191, 300, 673),\n        \"parenright\": (30, -191, 277, 673),\n        \"asteriskmath\": (65, 134, 427, 551),\n        \"plus\": (10, 0, 539, 533),\n        \"comma\": (56, -152, 194, 104),\n        \"minus\": (11, 233, 535, 288),\n        \"period\": (69, -17, 181, 95),\n        \"slash\": (0, -18, 254, 646),\n        \"zero\": (24, -17, 470, 685),\n        \"one\": (117, 0, 390, 673),\n        \"two\": (25, 0, 475, 685),\n        \"three\": (39, -17, 435, 685),\n        \"four\": (16, 0, 469, 685),\n        \"five\": (29, -17, 443, 685),\n        \"six\": (36, -17, 467, 685),\n        \"seven\": (24, -16, 448, 673),\n        \"eight\": (55, -17, 440, 684),\n        \"nine\": (32, -18, 459, 684),\n        \"colon\": (81, -17, 193, 460),\n        \"semicolon\": (83, -152, 221, 460),\n        \"less\": (26, 0, 523, 522),\n        \"equal\": (11, 141, 537, 390),\n        \"greater\": (26, 0, 523, 522),\n        \"question\": (71, -17, 411, 686),\n        \"congruent\": (11, 0, 537, 475),\n        \"Alpha\": (4, 0, 684, 673),\n        \"Beta\": (29, 0, 592, 673),\n        \"Chi\": (-9, 0, 704, 673),\n        \"Delta\": (6, 0, 608, 688),\n        \"Epsilon\": (32, 0, 617, 673),\n        \"Phi\": (26, 0, 741, 673),\n        \"Gamma\": (24, 0, 609, 673),\n        \"Eta\": (39, 0, 729, 673),\n        \"Iota\": (32, 0, 316, 673),\n        \"theta1\": (18, -17, 623, 689),\n        \"Kappa\": (35, 0, 722, 673),\n        \"Lambda\": (6, 0, 680, 688),\n        \"Mu\": (28, 0, 887, 673),\n        \"Nu\": (29, -8, 720, 673),\n        \"Omicron\": (41, -17, 715, 685),\n        \"Pi\": (25, 0, 745, 673),\n        \"Theta\": (41, -17, 715, 685),\n        \"Rho\": (28, 0, 562, 673),\n        \"Sigma\": (5, 0, 589, 673),\n        \"Tau\": (33, 0, 607, 673),\n        \"Upsilon\": (-8, 0, 694, 673),\n        \"sigma1\": (40, -233, 436, 500),\n        \"Omega\": (34, 0, 736, 688),\n        \"Xi\": (40, 0, 599, 673),\n        \"Psi\": (15, 0, 781, 684),\n        \"Zeta\": (44, 0, 636, 673),\n        \"bracketleft\": (86, -155, 299, 674),\n        \"therefore\": (163, 0, 701, 478),\n        \"bracketright\": (33, -155, 246, 674),\n        \"perpendicular\": (15, 0, 652, 674),\n        \"underscore\": (-2, -252, 502, -206),\n        \"radicalex\": (480, 881, 1090, 917),\n        \"alpha\": (41, -18, 622, 500),\n        \"beta\": (61, -223, 515, 740),\n        \"chi\": (12, -231, 522, 499),\n        \"delta\": (40, -18, 481, 739),\n        \"epsilon\": (22, -19, 427, 501),\n        \"phi\": (28, -224, 490, 671),\n        \"gamma\": (6, -225, 484, 498),\n        \"eta\": (0, -202, 527, 513),\n        \"iota\": (0, -17, 301, 503),\n        \"phi1\": (37, -224, 587, 499),\n        \"kappa\": (33, 0, 558, 501),\n        \"lambda\": (24, -17, 548, 739),\n        \"mu\": (33, -223, 567, 500),\n        \"nu\": (-9, -16, 474, 507),\n        \"omicron\": (35, -18, 501, 498),\n        \"pi\": (10, -19, 530, 487),\n        \"theta\": (43, -17, 485, 690),\n        \"rho\": (50, -230, 490, 498),\n        \"sigma\": (31, -21, 588, 500),\n        \"tau\": (10, -18, 418, 500),\n        \"upsilon\": (7, -18, 535, 507),\n        \"omega1\": (12, -17, 671, 583),\n        \"omega\": (43, -17, 683, 500),\n        \"xi\": (28, -224, 469, 765),\n        \"psi\": (12, -228, 701, 500),\n        \"zeta\": (60, -225, 467, 756),\n        \"braceleft\": (58, -183, 397, 673),\n        \"bar\": (65, -177, 135, 673),\n        \"braceright\": (79, -183, 418, 673),\n        \"similar\": (17, 203, 529, 307),\n        \"Upsilon1\": (-1, 0, 610, 685),\n        \"minute\": (27, 459, 228, 734),\n        \"lessequal\": (29, 0, 526, 639),\n        \"fraction\": (-180, -12, 340, 677),\n        \"infinity\": (26, 125, 688, 404),\n        \"florin\": (2, -193, 494, 686),\n        \"club\": (86, -26, 660, 533),\n        \"diamond\": (142, -36, 600, 550),\n        \"heart\": (117, -33, 631, 532),\n        \"spade\": (114, -36, 628, 548),\n        \"arrowboth\": (24, -15, 1024, 511),\n        \"arrowleft\": (32, -15, 942, 511),\n        \"arrowup\": (45, 0, 571, 910),\n        \"arrowright\": (49, -15, 959, 511),\n        \"arrowdown\": (45, -22, 571, 888),\n        \"degree\": (50, 385, 350, 685),\n        \"plusminus\": (10, 0, 539, 645),\n        \"second\": (20, 459, 413, 736),\n        \"greaterequal\": (29, 0, 526, 639),\n        \"multiply\": (17, 8, 533, 524),\n        \"proportional\": (27, 124, 639, 404),\n        \"partialdiff\": (27, -20, 462, 745),\n        \"bullet\": (50, 113, 410, 473),\n        \"divide\": (10, 71, 536, 456),\n        \"notequal\": (15, -25, 540, 549),\n        \"equivalence\": (14, 82, 538, 443),\n        \"approxequal\": (14, 135, 527, 394),\n        \"ellipsis\": (111, -17, 889, 95),\n        \"arrowvertex\": (280, -120, 336, 1010),\n        \"arrowhorizex\": (-60, 220, 1050, 276),\n        \"carriagereturn\": (15, -16, 602, 629),\n        \"aleph\": (175, -18, 661, 658),\n        \"Ifraktur\": (10, -53, 578, 740),\n        \"Rfraktur\": (26, -15, 759, 733),\n        \"weierstrass\": (159, -211, 870, 573),\n        \"circlemultiply\": (43, -17, 733, 673),\n        \"circleplus\": (43, -15, 733, 675),\n        \"emptyset\": (39, -24, 781, 719),\n        \"intersection\": (40, 0, 732, 509),\n        \"union\": (40, -17, 732, 492),\n        \"propersuperset\": (20, 0, 673, 470),\n        \"reflexsuperset\": (20, -125, 673, 470),\n        \"notsubset\": (36, -70, 690, 540),\n        \"propersubset\": (37, 0, 690, 470),\n        \"reflexsubset\": (37, -125, 690, 470),\n        \"element\": (45, 0, 505, 468),\n        \"notelement\": (45, -58, 505, 555),\n        \"angle\": (26, 0, 738, 673),\n        \"gradient\": (36, -19, 681, 718),\n        \"registerserif\": (50, -17, 740, 673),\n        \"copyrightserif\": (51, -15, 741, 675),\n        \"trademarkserif\": (18, 293, 855, 673),\n        \"product\": (25, -101, 803, 751),\n        \"radical\": (10, -38, 515, 917),\n        \"dotmath\": (69, 210, 169, 310),\n        \"logicalnot\": (15, 0, 680, 288),\n        \"logicaland\": (23, 0, 583, 454),\n        \"logicalor\": (30, 0, 578, 477),\n        \"arrowdblboth\": (27, -20, 1023, 510),\n        \"arrowdblleft\": (30, -15, 939, 513),\n        \"arrowdblup\": (39, 2, 567, 911),\n        \"arrowdblright\": (45, -20, 954, 508),\n        \"arrowdbldown\": (44, -19, 572, 890),\n        \"lozenge\": (18, 0, 466, 745),\n        \"angleleft\": (25, -198, 306, 746),\n        \"registersans\": (50, -20, 740, 670),\n        \"copyrightsans\": (49, -15, 739, 675),\n        \"trademarksans\": (5, 293, 725, 673),\n        \"summation\": (14, -108, 695, 752),\n        \"parenlefttp\": (40, -293, 436, 926),\n        \"parenleftex\": (40, -85, 92, 925),\n        \"parenleftbt\": (40, -293, 436, 926),\n        \"bracketlefttp\": (0, -80, 341, 926),\n        \"bracketleftex\": (0, -79, 55, 925),\n        \"bracketleftbt\": (0, -80, 340, 926),\n        \"bracelefttp\": (201, -75, 439, 926),\n        \"braceleftmid\": (14, -85, 255, 935),\n        \"braceleftbt\": (201, -70, 439, 926),\n        \"braceex\": (201, -80, 255, 935),\n        \"angleright\": (21, -198, 302, 746),\n        \"integral\": (2, -107, 290, 915),\n        \"integraltp\": (332, -83, 715, 921),\n        \"integralex\": (332, -88, 415, 975),\n        \"integralbt\": (39, -81, 415, 921),\n        \"parenrighttp\": (54, -293, 450, 926),\n        \"parenrightex\": (398, -85, 450, 925),\n        \"parenrightbt\": (54, -293, 450, 926),\n        \"bracketrighttp\": (22, -80, 360, 926),\n        \"bracketrightex\": (305, -79, 360, 925),\n        \"bracketrightbt\": (20, -80, 360, 926),\n        \"bracerighttp\": (17, -75, 255, 926),\n        \"bracerightmid\": (201, -85, 442, 935),\n        \"bracerightbt\": (17, -70, 255, 926),\n        \"apple\": (56, -2, 733, 808),\n        \"space\": (0, 0, 0, 0),\n    },\n    \"Times-BoldItalic\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (67, -13, 370, 684),\n        \"quotedbl\": (136, 398, 536, 685),\n        \"numbersign\": (-33, 0, 533, 700),\n        \"dollar\": (-20, -100, 497, 733),\n        \"percent\": (39, -10, 793, 692),\n        \"ampersand\": (5, -19, 699, 682),\n        \"quoteright\": (98, 369, 302, 685),\n        \"parenleft\": (28, -179, 344, 685),\n        \"parenright\": (-44, -179, 271, 685),\n        \"asterisk\": (65, 249, 456, 685),\n        \"plus\": (33, 0, 537, 506),\n        \"comma\": (-60, -182, 144, 134),\n        \"hyphen\": (2, 166, 271, 282),\n        \"period\": (-9, -13, 139, 135),\n        \"slash\": (-64, -18, 342, 685),\n        \"zero\": (17, -14, 477, 683),\n        \"one\": (5, 0, 419, 683),\n        \"two\": (-27, 0, 446, 683),\n        \"three\": (-15, -13, 450, 683),\n        \"four\": (-15, 0, 503, 683),\n        \"five\": (-11, -13, 487, 669),\n        \"six\": (23, -15, 509, 679),\n        \"seven\": (52, 0, 525, 669),\n        \"eight\": (3, -13, 476, 683),\n        \"nine\": (-12, -10, 475, 683),\n        \"colon\": (23, -13, 264, 459),\n        \"semicolon\": (-25, -183, 264, 459),\n        \"less\": (31, -8, 539, 514),\n        \"equal\": (33, 107, 537, 399),\n        \"greater\": (31, -8, 539, 514),\n        \"question\": (79, -13, 470, 684),\n        \"at\": (63, -18, 770, 685),\n        \"A\": (-67, 0, 593, 683),\n        \"B\": (-24, 0, 624, 669),\n        \"C\": (32, -18, 677, 685),\n        \"D\": (-46, 0, 685, 669),\n        \"E\": (-27, 0, 653, 669),\n        \"F\": (-13, 0, 660, 669),\n        \"G\": (21, -18, 706, 685),\n        \"H\": (-24, 0, 799, 669),\n        \"I\": (-32, 0, 406, 669),\n        \"J\": (-46, -99, 524, 669),\n        \"K\": (-21, 0, 702, 669),\n        \"L\": (-22, 0, 590, 669),\n        \"M\": (-29, -12, 917, 669),\n        \"N\": (-27, -15, 748, 669),\n        \"O\": (27, -18, 691, 685),\n        \"P\": (-27, 0, 613, 669),\n        \"Q\": (27, -208, 691, 685),\n        \"R\": (-29, 0, 623, 669),\n        \"S\": (2, -18, 526, 685),\n        \"T\": (50, 0, 650, 669),\n        \"U\": (67, -18, 744, 669),\n        \"V\": (65, -18, 715, 669),\n        \"W\": (65, -18, 940, 669),\n        \"X\": (-24, 0, 694, 669),\n        \"Y\": (73, 0, 659, 669),\n        \"Z\": (-11, 0, 590, 669),\n        \"bracketleft\": (-37, -159, 362, 674),\n        \"backslash\": (-1, -18, 279, 685),\n        \"bracketright\": (-56, -157, 343, 674),\n        \"asciicircum\": (67, 304, 503, 669),\n        \"underscore\": (0, -125, 500, -75),\n        \"quoteleft\": (128, 369, 332, 685),\n        \"a\": (-21, -14, 455, 462),\n        \"b\": (-14, -13, 444, 699),\n        \"c\": (-5, -13, 392, 462),\n        \"d\": (-21, -13, 517, 699),\n        \"e\": (5, -13, 398, 462),\n        \"f\": (-169, -205, 446, 698),\n        \"g\": (-52, -203, 478, 462),\n        \"h\": (-13, -9, 498, 699),\n        \"i\": (2, -9, 263, 684),\n        \"j\": (-189, -207, 279, 684),\n        \"k\": (-23, -8, 483, 699),\n        \"l\": (2, -9, 290, 699),\n        \"m\": (-14, -9, 722, 462),\n        \"n\": (-6, -9, 493, 462),\n        \"o\": (-3, -13, 441, 462),\n        \"p\": (-120, -205, 446, 462),\n        \"q\": (1, -205, 471, 462),\n        \"r\": (-21, 0, 389, 462),\n        \"s\": (-19, -13, 333, 462),\n        \"t\": (-11, -9, 281, 594),\n        \"u\": (15, -9, 492, 462),\n        \"v\": (16, -13, 401, 462),\n        \"w\": (16, -13, 614, 462),\n        \"x\": (-46, -13, 469, 462),\n        \"y\": (-94, -205, 392, 462),\n        \"z\": (-43, -78, 368, 449),\n        \"braceleft\": (5, -187, 436, 686),\n        \"bar\": (66, -18, 154, 685),\n        \"braceright\": (-129, -187, 302, 686),\n        \"asciitilde\": (54, 173, 516, 333),\n        \"exclamdown\": (19, -205, 322, 492),\n        \"cent\": (42, -143, 439, 576),\n        \"sterling\": (-32, -12, 510, 683),\n        \"fraction\": (-169, -14, 324, 683),\n        \"yen\": (33, 0, 628, 669),\n        \"florin\": (-87, -156, 537, 707),\n        \"section\": (36, -143, 459, 685),\n        \"currency\": (-26, 34, 526, 586),\n        \"quotesingle\": (128, 398, 268, 685),\n        \"quotedblleft\": (53, 369, 513, 685),\n        \"guillemotleft\": (12, 32, 468, 415),\n        \"guilsinglleft\": (32, 32, 303, 415),\n        \"guilsinglright\": (10, 32, 281, 415),\n        \"fi\": (-188, -205, 514, 703),\n        \"fl\": (-186, -205, 553, 704),\n        \"endash\": (-40, 178, 477, 269),\n        \"dagger\": (91, -145, 494, 685),\n        \"daggerdbl\": (10, -139, 493, 685),\n        \"periodcentered\": (51, 257, 199, 405),\n        \"paragraph\": (-57, -193, 562, 669),\n        \"bullet\": (0, 175, 350, 525),\n        \"quotesinglbase\": (-5, -182, 199, 134),\n        \"quotedblbase\": (-57, -182, 403, 134),\n        \"quotedblright\": (53, 369, 513, 685),\n        \"guillemotright\": (12, 32, 468, 415),\n        \"ellipsis\": (40, -13, 852, 135),\n        \"perthousand\": (7, -29, 996, 706),\n        \"questiondown\": (30, -205, 421, 492),\n        \"grave\": (85, 516, 297, 697),\n        \"acute\": (139, 516, 379, 697),\n        \"circumflex\": (40, 516, 367, 690),\n        \"tilde\": (48, 536, 407, 655),\n        \"macron\": (51, 553, 393, 623),\n        \"breve\": (71, 516, 387, 678),\n        \"dotaccent\": (163, 525, 293, 655),\n        \"dieresis\": (55, 525, 397, 655),\n        \"ring\": (127, 516, 340, 729),\n        \"cedilla\": (-80, -218, 156, 5),\n        \"hungarumlaut\": (69, 516, 498, 697),\n        \"ogonek\": (-40, -173, 189, 44),\n        \"caron\": (79, 516, 411, 690),\n        \"emdash\": (-40, 178, 977, 269),\n        \"AE\": (-64, 0, 918, 669),\n        \"ordfeminine\": (16, 399, 330, 685),\n        \"Lslash\": (-22, 0, 590, 669),\n        \"Oslash\": (27, -125, 691, 764),\n        \"OE\": (23, -8, 946, 677),\n        \"ordmasculine\": (56, 400, 347, 685),\n        \"ae\": (-5, -13, 673, 462),\n        \"dotlessi\": (2, -9, 238, 462),\n        \"lslash\": (-13, -9, 301, 699),\n        \"oslash\": (-3, -119, 441, 560),\n        \"oe\": (6, -13, 674, 462),\n        \"germandbls\": (-200, -200, 473, 705),\n        \"onesuperior\": (30, 274, 301, 683),\n        \"logicalnot\": (51, 108, 555, 399),\n        \"mu\": (-60, -207, 516, 449),\n        \"trademark\": (32, 263, 968, 669),\n        \"Eth\": (-31, 0, 700, 669),\n        \"onehalf\": (-9, -14, 723, 683),\n        \"plusminus\": (33, 0, 537, 506),\n        \"Thorn\": (-27, 0, 573, 669),\n        \"onequarter\": (7, -14, 721, 683),\n        \"divide\": (33, -29, 537, 535),\n        \"brokenbar\": (66, -18, 154, 685),\n        \"degree\": (83, 397, 369, 683),\n        \"thorn\": (-120, -205, 446, 699),\n        \"threequarters\": (7, -14, 726, 683),\n        \"twosuperior\": (2, 274, 313, 683),\n        \"registered\": (30, -18, 718, 685),\n        \"minus\": (51, 209, 555, 297),\n        \"eth\": (-3, -13, 454, 699),\n        \"multiply\": (48, 16, 522, 490),\n        \"threesuperior\": (17, 265, 321, 683),\n        \"copyright\": (30, -18, 718, 685),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (-67, 0, 593, 904),\n        \"Acircumflex\": (-67, 0, 593, 897),\n        \"Adieresis\": (-67, 0, 593, 862),\n        \"Agrave\": (-67, 0, 593, 904),\n        \"Aring\": (-67, 0, 593, 921),\n        \"Atilde\": (-67, 0, 593, 862),\n        \"Ccedilla\": (32, -218, 677, 685),\n        \"Eacute\": (-27, 0, 653, 904),\n        \"Ecircumflex\": (-27, 0, 653, 897),\n        \"Edieresis\": (-27, 0, 653, 862),\n        \"Egrave\": (-27, 0, 653, 904),\n        \"Iacute\": (-32, 0, 412, 904),\n        \"Icircumflex\": (-32, 0, 420, 897),\n        \"Idieresis\": (-32, 0, 445, 862),\n        \"Igrave\": (-32, 0, 406, 904),\n        \"Ntilde\": (-27, -15, 748, 862),\n        \"Oacute\": (27, -18, 691, 904),\n        \"Ocircumflex\": (27, -18, 691, 897),\n        \"Odieresis\": (27, -18, 691, 862),\n        \"Ograve\": (27, -18, 691, 904),\n        \"Otilde\": (27, -18, 691, 862),\n        \"Scaron\": (2, -18, 526, 897),\n        \"Uacute\": (67, -18, 744, 904),\n        \"Ucircumflex\": (67, -18, 744, 897),\n        \"Udieresis\": (67, -18, 744, 862),\n        \"Ugrave\": (67, -18, 744, 904),\n        \"Yacute\": (73, 0, 659, 904),\n        \"Ydieresis\": (73, 0, 659, 862),\n        \"Zcaron\": (-11, 0, 590, 897),\n        \"aacute\": (-21, -14, 463, 697),\n        \"acircumflex\": (-21, -14, 455, 690),\n        \"adieresis\": (-21, -14, 471, 655),\n        \"agrave\": (-21, -14, 455, 697),\n        \"aring\": (-21, -14, 455, 729),\n        \"atilde\": (-21, -14, 491, 655),\n        \"ccedilla\": (-24, -218, 392, 462),\n        \"eacute\": (5, -13, 435, 697),\n        \"ecircumflex\": (5, -13, 423, 690),\n        \"edieresis\": (5, -13, 443, 655),\n        \"egrave\": (5, -13, 398, 697),\n        \"iacute\": (2, -9, 352, 697),\n        \"icircumflex\": (-2, -9, 325, 690),\n        \"idieresis\": (2, -9, 360, 655),\n        \"igrave\": (2, -9, 260, 697),\n        \"ntilde\": (-6, -9, 504, 655),\n        \"oacute\": (-3, -13, 463, 697),\n        \"ocircumflex\": (-3, -13, 451, 690),\n        \"odieresis\": (-3, -13, 466, 655),\n        \"ograve\": (-3, -13, 441, 697),\n        \"otilde\": (-3, -13, 491, 655),\n        \"scaron\": (-19, -13, 439, 690),\n        \"uacute\": (15, -9, 492, 697),\n        \"ucircumflex\": (15, -9, 492, 690),\n        \"udieresis\": (15, -9, 494, 655),\n        \"ugrave\": (15, -9, 492, 697),\n        \"yacute\": (-94, -205, 435, 697),\n        \"ydieresis\": (-94, -205, 438, 655),\n        \"zcaron\": (-43, -78, 424, 690),\n    },\n    \"Times-Bold\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (81, -13, 251, 691),\n        \"quotedbl\": (83, 404, 472, 691),\n        \"numbersign\": (4, 0, 496, 700),\n        \"dollar\": (29, -99, 472, 750),\n        \"percent\": (124, -14, 877, 692),\n        \"ampersand\": (62, -16, 787, 691),\n        \"quoteright\": (79, 356, 263, 691),\n        \"parenleft\": (46, -168, 306, 694),\n        \"parenright\": (27, -168, 287, 694),\n        \"asterisk\": (56, 255, 447, 691),\n        \"plus\": (33, 0, 537, 506),\n        \"comma\": (39, -180, 223, 155),\n        \"hyphen\": (44, 171, 287, 287),\n        \"period\": (41, -13, 210, 156),\n        \"slash\": (-24, -19, 302, 691),\n        \"zero\": (24, -13, 476, 688),\n        \"one\": (65, 0, 442, 688),\n        \"two\": (17, 0, 478, 688),\n        \"three\": (16, -14, 468, 688),\n        \"four\": (19, 0, 475, 688),\n        \"five\": (22, -8, 470, 676),\n        \"six\": (28, -13, 475, 688),\n        \"seven\": (17, 0, 477, 676),\n        \"eight\": (28, -13, 472, 688),\n        \"nine\": (26, -13, 473, 688),\n        \"colon\": (82, -13, 251, 472),\n        \"semicolon\": (82, -180, 266, 472),\n        \"less\": (31, -8, 539, 514),\n        \"equal\": (33, 107, 537, 399),\n        \"greater\": (31, -8, 539, 514),\n        \"question\": (57, -13, 445, 689),\n        \"at\": (108, -19, 822, 691),\n        \"A\": (9, 0, 689, 690),\n        \"B\": (16, 0, 619, 676),\n        \"C\": (49, -19, 687, 691),\n        \"D\": (14, 0, 690, 676),\n        \"E\": (16, 0, 641, 676),\n        \"F\": (16, 0, 583, 676),\n        \"G\": (37, -19, 755, 691),\n        \"H\": (21, 0, 759, 676),\n        \"I\": (20, 0, 370, 676),\n        \"J\": (3, -96, 479, 676),\n        \"K\": (30, 0, 769, 676),\n        \"L\": (19, 0, 638, 676),\n        \"M\": (14, 0, 921, 676),\n        \"N\": (16, -18, 701, 676),\n        \"O\": (35, -19, 743, 691),\n        \"P\": (16, 0, 600, 676),\n        \"Q\": (35, -176, 743, 691),\n        \"R\": (26, 0, 715, 676),\n        \"S\": (35, -19, 513, 692),\n        \"T\": (31, 0, 636, 676),\n        \"U\": (16, -19, 701, 676),\n        \"V\": (16, -18, 701, 676),\n        \"W\": (19, -15, 981, 676),\n        \"X\": (16, 0, 699, 676),\n        \"Y\": (15, 0, 699, 676),\n        \"Z\": (28, 0, 634, 676),\n        \"bracketleft\": (67, -149, 301, 678),\n        \"backslash\": (-25, -19, 303, 691),\n        \"bracketright\": (32, -149, 266, 678),\n        \"asciicircum\": (73, 311, 509, 676),\n        \"underscore\": (0, -125, 500, -75),\n        \"quoteleft\": (70, 356, 254, 691),\n        \"a\": (25, -14, 488, 473),\n        \"b\": (17, -14, 521, 676),\n        \"c\": (25, -14, 430, 473),\n        \"d\": (25, -14, 534, 676),\n        \"e\": (25, -14, 426, 473),\n        \"f\": (14, 0, 389, 691),\n        \"g\": (28, -206, 483, 473),\n        \"h\": (16, 0, 534, 676),\n        \"i\": (16, 0, 255, 691),\n        \"j\": (-57, -203, 263, 691),\n        \"k\": (22, 0, 543, 676),\n        \"l\": (16, 0, 255, 676),\n        \"m\": (16, 0, 814, 473),\n        \"n\": (21, 0, 539, 473),\n        \"o\": (25, -14, 476, 473),\n        \"p\": (19, -205, 524, 473),\n        \"q\": (34, -205, 536, 473),\n        \"r\": (29, 0, 434, 473),\n        \"s\": (25, -14, 361, 473),\n        \"t\": (20, -12, 332, 630),\n        \"u\": (16, -14, 537, 461),\n        \"v\": (21, -14, 485, 461),\n        \"w\": (23, -14, 707, 461),\n        \"x\": (12, 0, 484, 461),\n        \"y\": (16, -205, 480, 461),\n        \"z\": (21, 0, 420, 461),\n        \"braceleft\": (22, -175, 340, 698),\n        \"bar\": (66, -19, 154, 691),\n        \"braceright\": (54, -175, 372, 698),\n        \"asciitilde\": (29, 173, 491, 333),\n        \"exclamdown\": (82, -203, 252, 501),\n        \"cent\": (53, -140, 458, 588),\n        \"sterling\": (21, -14, 477, 684),\n        \"fraction\": (-168, -12, 329, 688),\n        \"yen\": (-64, 0, 547, 676),\n        \"florin\": (0, -155, 498, 706),\n        \"section\": (57, -132, 443, 691),\n        \"currency\": (-26, 61, 526, 613),\n        \"quotesingle\": (75, 404, 204, 691),\n        \"quotedblleft\": (32, 356, 486, 691),\n        \"guillemotleft\": (23, 36, 473, 415),\n        \"guilsinglleft\": (51, 36, 305, 415),\n        \"guilsinglright\": (28, 36, 282, 415),\n        \"fi\": (14, 0, 536, 691),\n        \"fl\": (14, 0, 536, 691),\n        \"endash\": (0, 181, 500, 271),\n        \"dagger\": (47, -134, 453, 691),\n        \"daggerdbl\": (45, -132, 456, 691),\n        \"periodcentered\": (41, 248, 210, 417),\n        \"paragraph\": (0, -186, 519, 676),\n        \"bullet\": (35, 198, 315, 478),\n        \"quotesinglbase\": (79, -180, 263, 155),\n        \"quotedblbase\": (14, -180, 468, 155),\n        \"quotedblright\": (14, 356, 468, 691),\n        \"guillemotright\": (27, 36, 477, 415),\n        \"ellipsis\": (82, -13, 917, 156),\n        \"perthousand\": (7, -29, 995, 706),\n        \"questiondown\": (55, -201, 443, 501),\n        \"grave\": (8, 528, 246, 713),\n        \"acute\": (86, 528, 324, 713),\n        \"circumflex\": (-2, 528, 335, 704),\n        \"tilde\": (-16, 547, 349, 674),\n        \"macron\": (1, 565, 331, 637),\n        \"breve\": (15, 528, 318, 691),\n        \"dotaccent\": (103, 537, 230, 667),\n        \"dieresis\": (-2, 537, 335, 667),\n        \"ring\": (60, 527, 273, 740),\n        \"cedilla\": (68, -218, 294, 0),\n        \"hungarumlaut\": (-13, 528, 425, 713),\n        \"ogonek\": (90, -173, 319, 44),\n        \"caron\": (-2, 528, 335, 704),\n        \"emdash\": (0, 181, 1000, 271),\n        \"AE\": (4, 0, 951, 676),\n        \"ordfeminine\": (-1, 397, 301, 688),\n        \"Lslash\": (19, 0, 638, 676),\n        \"Oslash\": (35, -74, 743, 737),\n        \"OE\": (22, -5, 981, 684),\n        \"ordmasculine\": (18, 397, 312, 688),\n        \"ae\": (33, -14, 693, 473),\n        \"dotlessi\": (16, 0, 255, 461),\n        \"lslash\": (-22, 0, 303, 676),\n        \"oslash\": (25, -92, 476, 549),\n        \"oe\": (22, -14, 696, 473),\n        \"germandbls\": (19, -12, 517, 691),\n        \"onesuperior\": (28, 275, 273, 688),\n        \"logicalnot\": (33, 108, 537, 399),\n        \"mu\": (33, -206, 536, 461),\n        \"trademark\": (24, 271, 977, 676),\n        \"Eth\": (6, 0, 690, 676),\n        \"onehalf\": (-7, -12, 775, 688),\n        \"plusminus\": (33, 0, 537, 506),\n        \"Thorn\": (16, 0, 600, 676),\n        \"onequarter\": (28, -12, 743, 688),\n        \"divide\": (33, -31, 537, 537),\n        \"brokenbar\": (66, -19, 154, 691),\n        \"degree\": (57, 402, 343, 688),\n        \"thorn\": (19, -205, 524, 676),\n        \"threequarters\": (23, -12, 733, 688),\n        \"twosuperior\": (0, 275, 300, 688),\n        \"registered\": (26, -19, 721, 691),\n        \"minus\": (33, 209, 537, 297),\n        \"eth\": (25, -14, 476, 691),\n        \"multiply\": (48, 16, 522, 490),\n        \"threesuperior\": (3, 268, 297, 688),\n        \"copyright\": (26, -19, 721, 691),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (9, 0, 689, 923),\n        \"Acircumflex\": (9, 0, 689, 914),\n        \"Adieresis\": (9, 0, 689, 877),\n        \"Agrave\": (9, 0, 689, 923),\n        \"Aring\": (9, 0, 689, 935),\n        \"Atilde\": (9, 0, 689, 884),\n        \"Ccedilla\": (49, -218, 687, 691),\n        \"Eacute\": (16, 0, 641, 923),\n        \"Ecircumflex\": (16, 0, 641, 914),\n        \"Edieresis\": (16, 0, 641, 877),\n        \"Egrave\": (16, 0, 641, 923),\n        \"Iacute\": (20, 0, 370, 923),\n        \"Icircumflex\": (20, 0, 370, 914),\n        \"Idieresis\": (20, 0, 370, 877),\n        \"Igrave\": (20, 0, 370, 923),\n        \"Ntilde\": (16, -18, 701, 884),\n        \"Oacute\": (35, -19, 743, 923),\n        \"Ocircumflex\": (35, -19, 743, 914),\n        \"Odieresis\": (35, -19, 743, 877),\n        \"Ograve\": (35, -19, 743, 923),\n        \"Otilde\": (35, -19, 743, 884),\n        \"Scaron\": (35, -19, 513, 914),\n        \"Uacute\": (16, -19, 701, 923),\n        \"Ucircumflex\": (16, -19, 701, 914),\n        \"Udieresis\": (16, -19, 701, 877),\n        \"Ugrave\": (16, -19, 701, 923),\n        \"Yacute\": (15, 0, 699, 928),\n        \"Ydieresis\": (15, 0, 699, 877),\n        \"Zcaron\": (28, 0, 634, 914),\n        \"aacute\": (25, -14, 488, 713),\n        \"acircumflex\": (25, -14, 488, 704),\n        \"adieresis\": (25, -14, 488, 667),\n        \"agrave\": (25, -14, 488, 713),\n        \"aring\": (25, -14, 488, 740),\n        \"atilde\": (25, -14, 488, 674),\n        \"ccedilla\": (25, -218, 430, 473),\n        \"eacute\": (25, -14, 426, 713),\n        \"ecircumflex\": (25, -14, 426, 704),\n        \"edieresis\": (25, -14, 426, 667),\n        \"egrave\": (25, -14, 426, 713),\n        \"iacute\": (16, 0, 290, 713),\n        \"icircumflex\": (-36, 0, 301, 704),\n        \"idieresis\": (-36, 0, 301, 667),\n        \"igrave\": (-26, 0, 255, 713),\n        \"ntilde\": (21, 0, 539, 674),\n        \"oacute\": (25, -14, 476, 713),\n        \"ocircumflex\": (25, -14, 476, 704),\n        \"odieresis\": (25, -14, 476, 667),\n        \"ograve\": (25, -14, 476, 713),\n        \"otilde\": (25, -14, 476, 674),\n        \"scaron\": (25, -14, 363, 704),\n        \"uacute\": (16, -14, 537, 713),\n        \"ucircumflex\": (16, -14, 537, 704),\n        \"udieresis\": (16, -14, 537, 667),\n        \"ugrave\": (16, -14, 537, 713),\n        \"yacute\": (16, -205, 480, 713),\n        \"ydieresis\": (16, -205, 480, 667),\n        \"zcaron\": (21, 0, 420, 704),\n    },\n    \"Times-Italic\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (39, -11, 302, 667),\n        \"quotedbl\": (144, 421, 432, 666),\n        \"numbersign\": (2, 0, 540, 676),\n        \"dollar\": (31, -89, 497, 731),\n        \"percent\": (79, -13, 790, 676),\n        \"ampersand\": (76, -18, 723, 666),\n        \"quoteright\": (151, 436, 290, 666),\n        \"parenleft\": (42, -181, 315, 669),\n        \"parenright\": (16, -180, 289, 669),\n        \"asterisk\": (128, 255, 492, 666),\n        \"plus\": (86, 0, 590, 506),\n        \"comma\": (-4, -129, 135, 101),\n        \"hyphen\": (49, 192, 282, 255),\n        \"period\": (27, -11, 138, 100),\n        \"slash\": (-65, -18, 386, 666),\n        \"zero\": (32, -7, 497, 676),\n        \"one\": (49, 0, 409, 676),\n        \"two\": (12, 0, 452, 676),\n        \"three\": (15, -7, 465, 676),\n        \"four\": (1, 0, 479, 676),\n        \"five\": (15, -7, 491, 666),\n        \"six\": (30, -7, 521, 686),\n        \"seven\": (75, -8, 537, 666),\n        \"eight\": (30, -7, 493, 676),\n        \"nine\": (23, -17, 492, 676),\n        \"colon\": (50, -11, 261, 441),\n        \"semicolon\": (27, -129, 261, 441),\n        \"less\": (84, -8, 592, 514),\n        \"equal\": (86, 120, 590, 386),\n        \"greater\": (84, -8, 592, 514),\n        \"question\": (132, -12, 472, 664),\n        \"at\": (118, -18, 806, 666),\n        \"A\": (-51, 0, 564, 668),\n        \"B\": (-8, 0, 588, 653),\n        \"C\": (66, -18, 689, 666),\n        \"D\": (-8, 0, 700, 653),\n        \"E\": (-1, 0, 634, 653),\n        \"F\": (8, 0, 645, 653),\n        \"G\": (52, -18, 722, 666),\n        \"H\": (-8, 0, 767, 653),\n        \"I\": (-8, 0, 384, 653),\n        \"J\": (-6, -18, 491, 653),\n        \"K\": (7, 0, 722, 653),\n        \"L\": (-8, 0, 559, 653),\n        \"M\": (-18, 0, 873, 653),\n        \"N\": (-20, -15, 727, 653),\n        \"O\": (60, -18, 699, 666),\n        \"P\": (0, 0, 605, 653),\n        \"Q\": (59, -182, 699, 666),\n        \"R\": (-13, 0, 588, 653),\n        \"S\": (17, -18, 508, 667),\n        \"T\": (59, 0, 633, 653),\n        \"U\": (102, -18, 765, 653),\n        \"V\": (76, -18, 688, 653),\n        \"W\": (71, -18, 906, 653),\n        \"X\": (-29, 0, 655, 653),\n        \"Y\": (78, 0, 633, 653),\n        \"Z\": (-6, 0, 606, 653),\n        \"bracketleft\": (21, -153, 391, 663),\n        \"backslash\": (-41, -18, 319, 666),\n        \"bracketright\": (12, -153, 382, 663),\n        \"asciicircum\": (0, 301, 422, 666),\n        \"underscore\": (0, -125, 500, -75),\n        \"quoteleft\": (171, 436, 310, 666),\n        \"a\": (17, -11, 476, 441),\n        \"b\": (23, -11, 473, 683),\n        \"c\": (30, -11, 425, 441),\n        \"d\": (15, -13, 527, 683),\n        \"e\": (31, -11, 412, 441),\n        \"f\": (-147, -207, 424, 678),\n        \"g\": (8, -206, 472, 441),\n        \"h\": (19, -9, 478, 683),\n        \"i\": (49, -11, 264, 654),\n        \"j\": (-124, -207, 276, 654),\n        \"k\": (14, -11, 461, 683),\n        \"l\": (41, -11, 279, 683),\n        \"m\": (12, -9, 704, 441),\n        \"n\": (14, -9, 474, 441),\n        \"o\": (27, -11, 468, 441),\n        \"p\": (-75, -205, 469, 441),\n        \"q\": (25, -209, 483, 441),\n        \"r\": (45, 0, 412, 441),\n        \"s\": (16, -13, 366, 442),\n        \"t\": (37, -11, 296, 546),\n        \"u\": (42, -11, 475, 441),\n        \"v\": (21, -18, 426, 441),\n        \"w\": (16, -18, 648, 441),\n        \"x\": (-27, -11, 447, 441),\n        \"y\": (-24, -206, 426, 441),\n        \"z\": (-2, -81, 380, 428),\n        \"braceleft\": (51, -177, 407, 687),\n        \"bar\": (105, -18, 171, 666),\n        \"braceright\": (-7, -177, 349, 687),\n        \"asciitilde\": (40, 183, 502, 323),\n        \"exclamdown\": (59, -205, 322, 473),\n        \"cent\": (77, -143, 472, 560),\n        \"sterling\": (10, -6, 517, 670),\n        \"fraction\": (-169, -10, 337, 676),\n        \"yen\": (27, 0, 603, 653),\n        \"florin\": (25, -182, 507, 682),\n        \"section\": (53, -162, 461, 666),\n        \"currency\": (-22, 53, 522, 597),\n        \"quotesingle\": (132, 421, 241, 666),\n        \"quotedblleft\": (166, 436, 514, 666),\n        \"guillemotleft\": (53, 37, 445, 403),\n        \"guilsinglleft\": (51, 37, 281, 403),\n        \"guilsinglright\": (52, 37, 282, 403),\n        \"fi\": (-141, -207, 481, 681),\n        \"fl\": (-141, -204, 517, 682),\n        \"endash\": (-6, 197, 505, 243),\n        \"dagger\": (101, -159, 488, 666),\n        \"daggerdbl\": (22, -143, 491, 666),\n        \"periodcentered\": (70, 199, 181, 310),\n        \"paragraph\": (55, -123, 616, 653),\n        \"bullet\": (40, 191, 310, 461),\n        \"quotesinglbase\": (44, -129, 183, 101),\n        \"quotedblbase\": (57, -129, 405, 101),\n        \"quotedblright\": (151, 436, 499, 666),\n        \"guillemotright\": (55, 37, 447, 403),\n        \"ellipsis\": (57, -11, 762, 100),\n        \"perthousand\": (25, -19, 1010, 706),\n        \"questiondown\": (28, -205, 368, 471),\n        \"grave\": (121, 492, 311, 664),\n        \"acute\": (180, 494, 403, 664),\n        \"circumflex\": (91, 492, 385, 661),\n        \"tilde\": (100, 517, 427, 624),\n        \"macron\": (99, 532, 411, 583),\n        \"breve\": (117, 492, 418, 650),\n        \"dotaccent\": (207, 508, 305, 606),\n        \"dieresis\": (107, 508, 405, 606),\n        \"ring\": (155, 492, 355, 691),\n        \"cedilla\": (-30, -217, 182, 0),\n        \"hungarumlaut\": (93, 494, 486, 664),\n        \"ogonek\": (-20, -169, 200, 40),\n        \"caron\": (121, 492, 426, 661),\n        \"emdash\": (-6, 197, 894, 243),\n        \"AE\": (-27, 0, 911, 653),\n        \"ordfeminine\": (42, 406, 352, 676),\n        \"Lslash\": (-8, 0, 559, 653),\n        \"Oslash\": (60, -105, 699, 722),\n        \"OE\": (49, -8, 964, 666),\n        \"ordmasculine\": (67, 406, 362, 676),\n        \"ae\": (23, -11, 640, 441),\n        \"dotlessi\": (49, -11, 235, 441),\n        \"lslash\": (37, -11, 307, 683),\n        \"oslash\": (28, -135, 469, 554),\n        \"oe\": (20, -12, 646, 441),\n        \"germandbls\": (-168, -207, 493, 679),\n        \"onesuperior\": (43, 271, 283, 676),\n        \"logicalnot\": (86, 108, 590, 386),\n        \"mu\": (-30, -209, 497, 428),\n        \"trademark\": (30, 247, 957, 653),\n        \"Eth\": (-8, 0, 700, 653),\n        \"onehalf\": (34, -10, 749, 676),\n        \"plusminus\": (86, 0, 590, 506),\n        \"Thorn\": (0, 0, 569, 653),\n        \"onequarter\": (33, -10, 736, 676),\n        \"divide\": (86, -11, 590, 517),\n        \"brokenbar\": (105, -18, 171, 666),\n        \"degree\": (101, 390, 387, 676),\n        \"thorn\": (-75, -205, 469, 683),\n        \"threequarters\": (23, -10, 736, 676),\n        \"twosuperior\": (33, 271, 324, 676),\n        \"registered\": (41, -18, 719, 666),\n        \"minus\": (86, 220, 590, 286),\n        \"eth\": (27, -11, 482, 683),\n        \"multiply\": (93, 8, 582, 497),\n        \"threesuperior\": (43, 268, 339, 676),\n        \"copyright\": (41, -18, 719, 666),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (-51, 0, 564, 876),\n        \"Acircumflex\": (-51, 0, 564, 873),\n        \"Adieresis\": (-51, 0, 564, 818),\n        \"Agrave\": (-51, 0, 564, 876),\n        \"Aring\": (-51, 0, 564, 883),\n        \"Atilde\": (-51, 0, 566, 836),\n        \"Ccedilla\": (66, -217, 689, 666),\n        \"Eacute\": (-1, 0, 634, 876),\n        \"Ecircumflex\": (-1, 0, 634, 873),\n        \"Edieresis\": (-1, 0, 634, 818),\n        \"Egrave\": (-1, 0, 634, 876),\n        \"Iacute\": (-8, 0, 413, 876),\n        \"Icircumflex\": (-8, 0, 425, 873),\n        \"Idieresis\": (-8, 0, 435, 818),\n        \"Igrave\": (-8, 0, 384, 876),\n        \"Ntilde\": (-20, -15, 727, 836),\n        \"Oacute\": (60, -18, 699, 876),\n        \"Ocircumflex\": (60, -18, 699, 873),\n        \"Odieresis\": (60, -18, 699, 818),\n        \"Ograve\": (60, -18, 699, 876),\n        \"Otilde\": (60, -18, 699, 836),\n        \"Scaron\": (17, -18, 520, 873),\n        \"Uacute\": (102, -18, 765, 876),\n        \"Ucircumflex\": (102, -18, 765, 873),\n        \"Udieresis\": (102, -18, 765, 818),\n        \"Ugrave\": (102, -18, 765, 876),\n        \"Yacute\": (78, 0, 633, 876),\n        \"Ydieresis\": (78, 0, 633, 818),\n        \"Zcaron\": (-6, 0, 606, 873),\n        \"aacute\": (17, -11, 487, 664),\n        \"acircumflex\": (17, -11, 476, 661),\n        \"adieresis\": (17, -11, 489, 606),\n        \"agrave\": (17, -11, 476, 664),\n        \"aring\": (17, -11, 476, 691),\n        \"atilde\": (17, -11, 511, 624),\n        \"ccedilla\": (26, -217, 425, 441),\n        \"eacute\": (31, -11, 459, 664),\n        \"ecircumflex\": (31, -11, 441, 661),\n        \"edieresis\": (31, -11, 451, 606),\n        \"egrave\": (31, -11, 412, 664),\n        \"iacute\": (49, -11, 356, 664),\n        \"icircumflex\": (34, -11, 328, 661),\n        \"idieresis\": (49, -11, 353, 606),\n        \"igrave\": (49, -11, 284, 664),\n        \"ntilde\": (14, -9, 476, 624),\n        \"oacute\": (27, -11, 487, 664),\n        \"ocircumflex\": (27, -11, 468, 661),\n        \"odieresis\": (27, -11, 489, 606),\n        \"ograve\": (27, -11, 468, 664),\n        \"otilde\": (27, -11, 496, 624),\n        \"scaron\": (16, -13, 454, 661),\n        \"uacute\": (42, -11, 477, 664),\n        \"ucircumflex\": (42, -11, 475, 661),\n        \"udieresis\": (42, -11, 479, 606),\n        \"ugrave\": (42, -11, 475, 664),\n        \"yacute\": (-24, -206, 459, 664),\n        \"ydieresis\": (-24, -206, 441, 606),\n        \"zcaron\": (-2, -81, 434, 661),\n    },\n    \"Times-Roman\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"exclam\": (130, -9, 238, 676),\n        \"quotedbl\": (77, 431, 331, 676),\n        \"numbersign\": (5, 0, 496, 662),\n        \"dollar\": (44, -87, 457, 727),\n        \"percent\": (61, -13, 772, 676),\n        \"ampersand\": (42, -13, 750, 676),\n        \"quoteright\": (79, 433, 218, 676),\n        \"parenleft\": (48, -177, 304, 676),\n        \"parenright\": (29, -177, 285, 676),\n        \"asterisk\": (69, 265, 432, 676),\n        \"plus\": (30, 0, 534, 506),\n        \"comma\": (56, -141, 195, 102),\n        \"hyphen\": (39, 194, 285, 257),\n        \"period\": (70, -11, 181, 100),\n        \"slash\": (-9, -14, 287, 676),\n        \"zero\": (24, -14, 476, 676),\n        \"one\": (111, 0, 394, 676),\n        \"two\": (30, 0, 475, 676),\n        \"three\": (43, -14, 431, 676),\n        \"four\": (12, 0, 472, 676),\n        \"five\": (32, -14, 438, 688),\n        \"six\": (34, -14, 468, 684),\n        \"seven\": (20, -8, 449, 662),\n        \"eight\": (56, -14, 445, 676),\n        \"nine\": (30, -22, 459, 676),\n        \"colon\": (81, -11, 192, 459),\n        \"semicolon\": (80, -141, 219, 459),\n        \"less\": (28, -8, 536, 514),\n        \"equal\": (30, 120, 534, 386),\n        \"greater\": (28, -8, 536, 514),\n        \"question\": (68, -8, 414, 676),\n        \"at\": (116, -14, 809, 676),\n        \"A\": (15, 0, 706, 674),\n        \"B\": (17, 0, 593, 662),\n        \"C\": (28, -14, 633, 676),\n        \"D\": (16, 0, 685, 662),\n        \"E\": (12, 0, 597, 662),\n        \"F\": (12, 0, 546, 662),\n        \"G\": (32, -14, 709, 676),\n        \"H\": (19, 0, 702, 662),\n        \"I\": (18, 0, 315, 662),\n        \"J\": (10, -14, 370, 662),\n        \"K\": (34, 0, 723, 662),\n        \"L\": (12, 0, 598, 662),\n        \"M\": (12, 0, 863, 662),\n        \"N\": (12, -11, 707, 662),\n        \"O\": (34, -14, 688, 676),\n        \"P\": (16, 0, 542, 662),\n        \"Q\": (34, -178, 701, 676),\n        \"R\": (17, 0, 659, 662),\n        \"S\": (42, -14, 491, 676),\n        \"T\": (17, 0, 593, 662),\n        \"U\": (14, -14, 705, 662),\n        \"V\": (16, -11, 697, 662),\n        \"W\": (5, -11, 932, 662),\n        \"X\": (10, 0, 704, 662),\n        \"Y\": (22, 0, 703, 662),\n        \"Z\": (9, 0, 597, 662),\n        \"bracketleft\": (88, -156, 299, 662),\n        \"backslash\": (-9, -14, 287, 676),\n        \"bracketright\": (34, -156, 245, 662),\n        \"asciicircum\": (24, 297, 446, 662),\n        \"underscore\": (0, -125, 500, -75),\n        \"quoteleft\": (115, 433, 254, 676),\n        \"a\": (37, -10, 442, 460),\n        \"b\": (3, -10, 468, 683),\n        \"c\": (25, -10, 412, 460),\n        \"d\": (27, -10, 491, 683),\n        \"e\": (25, -10, 424, 460),\n        \"f\": (20, 0, 383, 683),\n        \"g\": (28, -218, 470, 460),\n        \"h\": (9, 0, 487, 683),\n        \"i\": (16, 0, 253, 683),\n        \"j\": (-70, -218, 194, 683),\n        \"k\": (7, 0, 505, 683),\n        \"l\": (19, 0, 257, 683),\n        \"m\": (16, 0, 775, 460),\n        \"n\": (16, 0, 485, 460),\n        \"o\": (29, -10, 470, 460),\n        \"p\": (5, -217, 470, 460),\n        \"q\": (24, -217, 488, 460),\n        \"r\": (5, 0, 335, 460),\n        \"s\": (51, -10, 348, 460),\n        \"t\": (13, -10, 279, 579),\n        \"u\": (9, -10, 479, 450),\n        \"v\": (19, -14, 477, 450),\n        \"w\": (21, -14, 694, 450),\n        \"x\": (17, 0, 479, 450),\n        \"y\": (14, -218, 475, 450),\n        \"z\": (27, 0, 418, 450),\n        \"braceleft\": (100, -181, 350, 680),\n        \"bar\": (67, -14, 133, 676),\n        \"braceright\": (130, -181, 380, 680),\n        \"asciitilde\": (40, 183, 502, 323),\n        \"exclamdown\": (97, -218, 205, 467),\n        \"cent\": (53, -138, 448, 579),\n        \"sterling\": (12, -8, 490, 676),\n        \"fraction\": (-168, -14, 331, 676),\n        \"yen\": (-53, 0, 512, 662),\n        \"florin\": (7, -189, 490, 676),\n        \"section\": (70, -148, 426, 676),\n        \"currency\": (-22, 58, 522, 602),\n        \"quotesingle\": (48, 431, 133, 676),\n        \"quotedblleft\": (43, 433, 414, 676),\n        \"guillemotleft\": (42, 33, 456, 416),\n        \"guilsinglleft\": (63, 33, 285, 416),\n        \"guilsinglright\": (48, 33, 270, 416),\n        \"fi\": (31, 0, 521, 683),\n        \"fl\": (32, 0, 521, 683),\n        \"endash\": (0, 201, 500, 250),\n        \"dagger\": (59, -149, 442, 676),\n        \"daggerdbl\": (58, -153, 442, 676),\n        \"periodcentered\": (70, 199, 181, 310),\n        \"paragraph\": (-22, -154, 450, 662),\n        \"bullet\": (40, 196, 310, 466),\n        \"quotesinglbase\": (79, -141, 218, 102),\n        \"quotedblbase\": (45, -141, 416, 102),\n        \"quotedblright\": (30, 433, 401, 676),\n        \"guillemotright\": (44, 33, 458, 416),\n        \"ellipsis\": (111, -11, 888, 100),\n        \"perthousand\": (7, -19, 994, 706),\n        \"questiondown\": (30, -218, 376, 466),\n        \"grave\": (19, 507, 242, 678),\n        \"acute\": (93, 507, 317, 678),\n        \"circumflex\": (11, 507, 322, 674),\n        \"tilde\": (1, 532, 331, 638),\n        \"macron\": (11, 547, 322, 601),\n        \"breve\": (26, 507, 307, 664),\n        \"dotaccent\": (118, 523, 216, 623),\n        \"dieresis\": (18, 523, 315, 623),\n        \"ring\": (67, 512, 266, 711),\n        \"cedilla\": (52, -215, 261, 0),\n        \"hungarumlaut\": (-3, 507, 377, 678),\n        \"ogonek\": (64, -165, 249, 0),\n        \"caron\": (11, 507, 322, 674),\n        \"emdash\": (0, 201, 1000, 250),\n        \"AE\": (0, 0, 863, 662),\n        \"ordfeminine\": (4, 394, 270, 676),\n        \"Lslash\": (12, 0, 598, 662),\n        \"Oslash\": (34, -80, 688, 734),\n        \"OE\": (30, -6, 885, 668),\n        \"ordmasculine\": (6, 394, 304, 676),\n        \"ae\": (38, -10, 632, 460),\n        \"dotlessi\": (16, 0, 253, 460),\n        \"lslash\": (19, 0, 259, 683),\n        \"oslash\": (29, -112, 470, 551),\n        \"oe\": (30, -10, 690, 460),\n        \"germandbls\": (12, -9, 468, 683),\n        \"onesuperior\": (57, 270, 248, 676),\n        \"logicalnot\": (30, 108, 534, 386),\n        \"mu\": (36, -218, 512, 450),\n        \"trademark\": (30, 256, 957, 662),\n        \"Eth\": (16, 0, 685, 662),\n        \"onehalf\": (31, -14, 746, 676),\n        \"plusminus\": (30, 0, 534, 506),\n        \"Thorn\": (16, 0, 542, 662),\n        \"onequarter\": (37, -14, 718, 676),\n        \"divide\": (30, -10, 534, 516),\n        \"brokenbar\": (67, -14, 133, 676),\n        \"degree\": (57, 390, 343, 676),\n        \"thorn\": (5, -217, 470, 683),\n        \"threequarters\": (15, -14, 718, 676),\n        \"twosuperior\": (1, 270, 296, 676),\n        \"registered\": (38, -14, 722, 676),\n        \"minus\": (30, 220, 534, 286),\n        \"eth\": (29, -10, 471, 686),\n        \"multiply\": (38, 8, 527, 497),\n        \"threesuperior\": (15, 262, 291, 676),\n        \"copyright\": (38, -14, 722, 676),\n        \"space\": (0, 0, 0, 0),\n        \"Aacute\": (15, 0, 706, 890),\n        \"Acircumflex\": (15, 0, 706, 886),\n        \"Adieresis\": (15, 0, 706, 835),\n        \"Agrave\": (15, 0, 706, 890),\n        \"Aring\": (15, 0, 706, 898),\n        \"Atilde\": (15, 0, 706, 850),\n        \"Ccedilla\": (28, -215, 633, 676),\n        \"Eacute\": (12, 0, 597, 890),\n        \"Ecircumflex\": (12, 0, 597, 886),\n        \"Edieresis\": (12, 0, 597, 835),\n        \"Egrave\": (12, 0, 597, 890),\n        \"Iacute\": (18, 0, 317, 890),\n        \"Icircumflex\": (11, 0, 322, 886),\n        \"Idieresis\": (18, 0, 315, 835),\n        \"Igrave\": (18, 0, 315, 890),\n        \"Ntilde\": (12, -11, 707, 850),\n        \"Oacute\": (34, -14, 688, 890),\n        \"Ocircumflex\": (34, -14, 688, 886),\n        \"Odieresis\": (34, -14, 688, 835),\n        \"Ograve\": (34, -14, 688, 890),\n        \"Otilde\": (34, -14, 688, 850),\n        \"Scaron\": (42, -14, 491, 886),\n        \"Uacute\": (14, -14, 705, 890),\n        \"Ucircumflex\": (14, -14, 705, 886),\n        \"Udieresis\": (14, -14, 705, 835),\n        \"Ugrave\": (14, -14, 705, 890),\n        \"Yacute\": (22, 0, 703, 890),\n        \"Ydieresis\": (22, 0, 703, 835),\n        \"Zcaron\": (9, 0, 597, 886),\n        \"aacute\": (37, -10, 442, 678),\n        \"acircumflex\": (37, -10, 442, 674),\n        \"adieresis\": (37, -10, 442, 623),\n        \"agrave\": (37, -10, 442, 678),\n        \"aring\": (37, -10, 442, 711),\n        \"atilde\": (37, -10, 442, 638),\n        \"ccedilla\": (25, -215, 412, 460),\n        \"eacute\": (25, -10, 424, 678),\n        \"ecircumflex\": (25, -10, 424, 674),\n        \"edieresis\": (25, -10, 424, 623),\n        \"egrave\": (25, -10, 424, 678),\n        \"iacute\": (16, 0, 290, 678),\n        \"icircumflex\": (-16, 0, 295, 674),\n        \"idieresis\": (-9, 0, 288, 623),\n        \"igrave\": (-8, 0, 253, 678),\n        \"ntilde\": (16, 0, 485, 638),\n        \"oacute\": (29, -10, 470, 678),\n        \"ocircumflex\": (29, -10, 470, 674),\n        \"odieresis\": (29, -10, 470, 623),\n        \"ograve\": (29, -10, 470, 678),\n        \"otilde\": (29, -10, 470, 638),\n        \"scaron\": (39, -10, 350, 674),\n        \"uacute\": (9, -10, 479, 678),\n        \"ucircumflex\": (9, -10, 479, 674),\n        \"udieresis\": (9, -10, 479, 623),\n        \"ugrave\": (9, -10, 479, 678),\n        \"yacute\": (14, -218, 475, 678),\n        \"ydieresis\": (14, -218, 475, 623),\n        \"zcaron\": (27, 0, 418, 674),\n    },\n    \"ZapfDingbats\": {\n        \".notdef\": (0, 0, 0, 0),\n        \"a1\": (35, 72, 939, 621),\n        \"a2\": (35, 81, 927, 611),\n        \"a202\": (35, 72, 939, 621),\n        \"a3\": (35, 0, 945, 692),\n        \"a4\": (34, 139, 685, 566),\n        \"a5\": (35, -14, 755, 705),\n        \"a119\": (35, -14, 755, 705),\n        \"a118\": (35, -13, 761, 705),\n        \"a117\": (35, 138, 655, 553),\n        \"a11\": (35, 123, 925, 568),\n        \"a12\": (35, 134, 904, 559),\n        \"a13\": (29, -11, 516, 705),\n        \"a14\": (34, 59, 820, 632),\n        \"a15\": (35, 50, 876, 642),\n        \"a16\": (35, 139, 899, 550),\n        \"a105\": (35, 50, 876, 642),\n        \"a17\": (35, 139, 909, 553),\n        \"a18\": (35, 104, 938, 587),\n        \"a19\": (34, -13, 721, 705),\n        \"a20\": (36, -14, 811, 705),\n        \"a21\": (35, 0, 727, 692),\n        \"a22\": (35, 0, 727, 692),\n        \"a23\": (-1, -68, 571, 661),\n        \"a24\": (36, -13, 642, 705),\n        \"a25\": (35, 0, 728, 692),\n        \"a26\": (35, 0, 726, 692),\n        \"a27\": (35, 0, 725, 692),\n        \"a28\": (35, 0, 720, 692),\n        \"a6\": (35, 0, 460, 692),\n        \"a7\": (35, 0, 517, 692),\n        \"a8\": (35, 0, 503, 692),\n        \"a9\": (35, 96, 542, 596),\n        \"a10\": (35, -14, 657, 705),\n        \"a29\": (35, -14, 751, 705),\n        \"a30\": (35, -14, 752, 705),\n        \"a31\": (35, -14, 753, 705),\n        \"a32\": (35, -14, 756, 705),\n        \"a33\": (35, -13, 759, 705),\n        \"a34\": (35, -13, 759, 705),\n        \"a35\": (35, -14, 782, 705),\n        \"a36\": (35, -14, 787, 705),\n        \"a37\": (35, -14, 754, 705),\n        \"a38\": (35, -14, 807, 705),\n        \"a39\": (35, -14, 789, 705),\n        \"a40\": (35, -14, 798, 705),\n        \"a41\": (35, -13, 782, 705),\n        \"a42\": (35, -14, 796, 705),\n        \"a43\": (35, -14, 888, 705),\n        \"a44\": (35, 0, 710, 692),\n        \"a45\": (35, 0, 688, 692),\n        \"a46\": (35, 0, 714, 692),\n        \"a47\": (34, -14, 756, 705),\n        \"a48\": (35, -14, 758, 705),\n        \"a49\": (35, -14, 661, 706),\n        \"a50\": (35, -6, 741, 699),\n        \"a51\": (35, -7, 734, 699),\n        \"a52\": (35, -14, 757, 705),\n        \"a53\": (35, 0, 725, 692),\n        \"a54\": (35, -13, 672, 704),\n        \"a55\": (35, -14, 672, 705),\n        \"a56\": (35, -14, 647, 705),\n        \"a57\": (35, -14, 666, 705),\n        \"a58\": (35, -14, 791, 705),\n        \"a59\": (35, -14, 780, 705),\n        \"a60\": (35, -14, 754, 705),\n        \"a61\": (35, -14, 754, 705),\n        \"a62\": (34, -14, 673, 705),\n        \"a63\": (36, 0, 651, 692),\n        \"a64\": (35, 1, 661, 690),\n        \"a65\": (35, 0, 655, 692),\n        \"a66\": (34, -14, 751, 705),\n        \"a67\": (35, -14, 752, 705),\n        \"a68\": (35, -14, 678, 705),\n        \"a69\": (35, -14, 756, 705),\n        \"a70\": (36, -14, 751, 705),\n        \"a71\": (35, -14, 757, 705),\n        \"a72\": (35, -14, 838, 705),\n        \"a73\": (35, 0, 726, 692),\n        \"a74\": (35, 0, 727, 692),\n        \"a203\": (35, 0, 727, 692),\n        \"a75\": (35, 0, 725, 692),\n        \"a204\": (35, 0, 725, 692),\n        \"a76\": (35, 0, 858, 705),\n        \"a77\": (35, -14, 858, 692),\n        \"a78\": (35, -14, 754, 705),\n        \"a79\": (35, -14, 749, 705),\n        \"a81\": (35, -14, 403, 705),\n        \"a82\": (35, 0, 104, 692),\n        \"a83\": (35, 0, 242, 692),\n        \"a84\": (35, 0, 380, 692),\n        \"a97\": (35, 263, 357, 705),\n        \"a98\": (34, 263, 357, 705),\n        \"a99\": (35, 263, 633, 705),\n        \"a100\": (36, 263, 634, 705),\n        \"a101\": (35, -143, 697, 806),\n        \"a102\": (56, -14, 488, 706),\n        \"a103\": (34, -14, 508, 705),\n        \"a104\": (35, 40, 875, 651),\n        \"a106\": (35, -14, 633, 705),\n        \"a107\": (35, -14, 726, 705),\n        \"a108\": (0, 121, 758, 569),\n        \"a112\": (35, 0, 741, 705),\n        \"a111\": (34, -14, 560, 705),\n        \"a110\": (35, -14, 659, 705),\n        \"a109\": (34, 0, 591, 705),\n        \"a120\": (35, -14, 754, 705),\n        \"a121\": (35, -14, 754, 705),\n        \"a122\": (35, -14, 754, 705),\n        \"a123\": (35, -14, 754, 705),\n        \"a124\": (35, -14, 754, 705),\n        \"a125\": (35, -14, 754, 705),\n        \"a126\": (35, -14, 754, 705),\n        \"a127\": (35, -14, 754, 705),\n        \"a128\": (35, -14, 754, 705),\n        \"a129\": (35, -14, 754, 705),\n        \"a130\": (35, -14, 754, 705),\n        \"a131\": (35, -14, 754, 705),\n        \"a132\": (35, -14, 754, 705),\n        \"a133\": (35, -14, 754, 705),\n        \"a134\": (35, -14, 754, 705),\n        \"a135\": (35, -14, 754, 705),\n        \"a136\": (35, -14, 754, 705),\n        \"a137\": (35, -14, 754, 705),\n        \"a138\": (35, -14, 754, 705),\n        \"a139\": (35, -14, 754, 705),\n        \"a140\": (35, -14, 754, 705),\n        \"a141\": (35, -14, 754, 705),\n        \"a142\": (35, -14, 754, 705),\n        \"a143\": (35, -14, 754, 705),\n        \"a144\": (35, -14, 754, 705),\n        \"a145\": (35, -14, 754, 705),\n        \"a146\": (35, -14, 754, 705),\n        \"a147\": (35, -14, 754, 705),\n        \"a148\": (35, -14, 754, 705),\n        \"a149\": (35, -14, 754, 705),\n        \"a150\": (35, -14, 754, 705),\n        \"a151\": (35, -14, 754, 705),\n        \"a152\": (35, -14, 754, 705),\n        \"a153\": (35, -14, 754, 705),\n        \"a154\": (35, -14, 754, 705),\n        \"a155\": (35, -14, 754, 705),\n        \"a156\": (35, -14, 754, 705),\n        \"a157\": (35, -14, 754, 705),\n        \"a158\": (35, -14, 754, 705),\n        \"a159\": (35, -14, 754, 705),\n        \"a160\": (35, 58, 860, 634),\n        \"a161\": (35, 152, 803, 540),\n        \"a163\": (34, 152, 981, 540),\n        \"a164\": (35, -127, 422, 820),\n        \"a196\": (35, 94, 698, 597),\n        \"a165\": (35, 140, 890, 552),\n        \"a192\": (35, 94, 698, 597),\n        \"a166\": (35, 166, 884, 526),\n        \"a167\": (35, 32, 892, 660),\n        \"a168\": (35, 129, 891, 562),\n        \"a169\": (35, 128, 893, 563),\n        \"a170\": (35, 155, 799, 537),\n        \"a171\": (35, 93, 838, 599),\n        \"a172\": (35, 104, 791, 588),\n        \"a173\": (35, 98, 889, 594),\n        \"a162\": (35, 98, 889, 594),\n        \"a174\": (35, 0, 882, 692),\n        \"a175\": (35, 84, 896, 608),\n        \"a176\": (35, 84, 896, 608),\n        \"a177\": (35, -99, 429, 791),\n        \"a178\": (35, 71, 848, 623),\n        \"a179\": (35, 44, 802, 648),\n        \"a193\": (35, 44, 802, 648),\n        \"a180\": (35, 101, 832, 591),\n        \"a199\": (35, 101, 832, 591),\n        \"a181\": (35, 44, 661, 648),\n        \"a200\": (35, 44, 661, 648),\n        \"a182\": (35, 77, 840, 619),\n        \"a201\": (35, 73, 840, 615),\n        \"a183\": (35, 0, 725, 692),\n        \"a184\": (35, 160, 911, 533),\n        \"a197\": (34, 37, 736, 655),\n        \"a185\": (35, 207, 830, 481),\n        \"a194\": (34, 37, 736, 655),\n        \"a198\": (34, -19, 853, 712),\n        \"a186\": (35, 124, 932, 568),\n        \"a195\": (34, -19, 853, 712),\n        \"a187\": (35, 113, 796, 579),\n        \"a188\": (36, 118, 838, 578),\n        \"a189\": (35, 150, 891, 542),\n        \"a190\": (35, 76, 931, 616),\n        \"a191\": (34, 99, 884, 593),\n        \"a86\": (35, 0, 375, 692),\n        \"a85\": (35, 0, 475, 692),\n        \"a95\": (35, 0, 299, 692),\n        \"a205\": (35, 0, 475, 692),\n        \"a89\": (35, -14, 356, 705),\n        \"a87\": (35, -14, 199, 705),\n        \"a91\": (35, 0, 242, 692),\n        \"a90\": (35, -14, 355, 705),\n        \"a206\": (35, 0, 375, 692),\n        \"a94\": (35, 0, 283, 692),\n        \"a93\": (35, 0, 283, 692),\n        \"a92\": (35, 0, 242, 692),\n        \"a96\": (35, 0, 299, 692),\n        \"a88\": (35, -14, 199, 705),\n        \"space\": (0, 0, 0, 0),\n    },\n}\n\nbase14_alias = {\n    \"Times New Roman\": \"Times-Roman\",\n    \"Times New Roman,Bold\": \"Times-Bold\",\n    \"Times New Roman,Italic\": \"Times-Italic\",\n}\n\n\ndef get_cached_bbox(database, family, encoding):\n    bbox = [(0, 0, 0, 0)] * 256\n    base_font = database[family]\n    for index, name in enumerate(encoding):\n        if name:\n            if cur_bbox := base_font.get(name, None):\n                bbox[index] = cur_bbox\n    return bbox\n\n\ndef get_base14_bbox(family, encoding_name=\"WinAnsiEncoding\"):\n    bbox = [(0, 0, 0, 0)] * 256\n    encoding = get_type1_encoding(encoding_name)\n    if not encoding:\n        return [(0, 0, 0, 0)] * 256\n\n    if family in base14_alias:\n        family = base14_alias[family]\n\n    if family in base14_bbox:\n        bbox = get_cached_bbox(base14_bbox, family, encoding)\n\n    if family in win_core:\n        bbox = get_cached_bbox(win_core, family, encoding)\n\n    return bbox\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/cidfont.py",
    "content": "import re\nfrom io import BytesIO\n\nimport freetype\n\n\ndef indirect(obj):\n    if isinstance(obj, tuple) and obj[0] == \"xref\":\n        return int(obj[1].split(\" \")[0])\n\n\ndef get_xref(doc, xref, key):\n    obj = doc.xref_get_key(xref, key)\n    if obj[0] == \"xref\":\n        return indirect(obj)\n\n\ndef get_font_file(doc, xref):\n    if idx := get_xref(doc, xref, \"FontFile\"):\n        return doc.xref_stream(idx)\n    if idx := get_xref(doc, xref, \"FontFile2\"):\n        return doc.xref_stream(idx)\n    if idx := get_xref(doc, xref, \"FontFile3\"):\n        return doc.xref_stream(idx)\n\n\ndef get_font_descriptor(doc, xref):\n    if idx := get_xref(doc, xref, \"FontDescriptor\"):\n        return get_font_file(doc, idx)\n\n\ndef get_descendant_fonts(doc, xref):\n    obj = doc.xref_get_key(xref, \"DescendantFonts\")\n    array_text = \"\"\n    if obj[0] == \"xref\":\n        array_text = doc.xref_object(indirect(obj))\n    elif obj[0] == \"array\":\n        array_text = obj[1]\n    if m := re.search(r\"\\d+\", array_text):\n        return get_font_descriptor(doc, int(m.group(0)))\n\n\ndef get_glyph_bbox(face, g):\n    try:\n        face.load_glyph(g, freetype.FT_LOAD_NO_SCALE)\n        outline = face.glyph.outline\n        if outline.contours:\n            cbox = outline.get_bbox()\n            return cbox.xMin, cbox.yMin, cbox.xMax, cbox.yMax\n        else:\n            return 0, 0, 0, 0\n    except Exception:\n        return 0, 0, 0, 0\n\n\ndef get_face_bbox(blob):\n    face = freetype.Face(BytesIO(blob))\n    scale = 1000 / face.units_per_EM\n    bbox_list = [get_glyph_bbox(face, code) for code in range(face.num_glyphs)]\n    bbox_list = [[v * scale for v in bbox] for bbox in bbox_list]\n    return bbox_list\n\n\ndef get_cidfont_bbox(doc, xref):\n    if doc.xref_get_key(xref, \"Subtype\")[1] == \"/Type0\":\n        if blob := get_descendant_fonts(doc, xref):\n            return get_face_bbox(blob)\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/cmap.py",
    "content": "import re\nimport struct\n\npattern_map_r = (\n    r\"\\s+begincidrange\\s*\"\n    r\"(?P<cidrange>(<[a-fA-F0-9]+>\\s*<[a-fA-F0-9]+>\\s*\\d+\\s*)+)\"\n    r\"\\s+endcidrange\\s+\"\n)\npattern_map_c = (\n    r\"\\s+begincidchar\\s*\"\n    r\"(?P<cidchar>(<[a-fA-F0-9]+>\\s*\\d+\\s*)+)\"\n    r\"\\s+endcidchar\\s+\"\n)\npattern_one_c = (\n    r\"<(?P<pat>[a-fA-F0-9]+)>\"\n    r\"\\s*\"\n    r\"(?P<val>\\d+)\"\n)\npattern_one_r = (\n    r\"<(?P<pat>[a-fA-F0-9]+)>\"\n    r\"\\s*\"\n    r\"<(?P<end>[a-fA-F0-9]+)>\"\n    r\"\\s*\"\n    r\"(?P<val>\\d+)\"\n)\n\n\ndef parse_blob_value(text):\n    return int(text, 16), len(text) // 2\n\n\ndef parse_cmap_char(text, store):\n    for m in re.finditer(pattern_one_c, text):\n        pat = m[\"pat\"]\n        val = m[\"val\"]\n        store.append((pat, int(val)))\n\n\ndef parse_cmap_range(text, store):\n    for m in re.finditer(pattern_one_r, text):\n        pat = m[\"pat\"]\n        end = m[\"end\"]\n        val = m[\"val\"]\n        store.append((pat, end, int(val)))\n\n\ndef parse_cmap(text):\n    usecmap = \"\"\n    if m := re.search(r\"/(?P<usecmap>[a-zA-Z0-9-]+)\\s+usecmap\\s+\", text):\n        usecmap = m[\"usecmap\"]\n    cidrange = []\n    for m in re.finditer(pattern_map_r, text):\n        parse_cmap_range(m[\"cidrange\"], cidrange)\n    cidchar = []\n    for m in re.finditer(pattern_map_c, text):\n        parse_cmap_char(m[\"cidchar\"], cidchar)\n    return usecmap, cidrange, cidchar\n\n\n_CMAP_CACHE: dict[str, tuple[list, list]] = {}\n\n\ndef _normalize_cmap_name(name: str) -> str:\n    \"\"\"Normalize cmap name for internal cache key.\"\"\"\n    if name.endswith(\".json\"):\n        return name[: -len(\".json\")]\n    return name\n\n\ndef use_cmap(name: str):\n    key = _normalize_cmap_name(name)\n    if key in _CMAP_CACHE:\n        return _CMAP_CACHE[key]\n\n    # Lazy import to avoid circular dependency at import time.\n    from babeldoc.assets.assets import get_cmap_data\n\n    data = get_cmap_data(key)\n    if not isinstance(data, dict):\n        raise TypeError(f\"Invalid cmap data type for {key}: {type(data)!r}\")\n\n    cid_u = data.get(\"u\") or \"\"\n    cid_r = data.get(\"r\") or []\n    cid_c = data.get(\"c\") or []\n\n    store_r: list = []\n    store_c: list = []\n    if cid_u:\n        use_r, use_c = use_cmap(cid_u)\n        store_r += use_r\n        store_c += use_c\n    store_r += cid_r\n    store_c += cid_c\n\n    _CMAP_CACHE[key] = (store_r, store_c)\n    return store_r, store_c\n\n\ndef propagation(r, c):\n    encoding = {}\n    len_set = set()\n    for one_r in r:\n        val_l, len_l = parse_blob_value(one_r[0])\n        val_r, len_r = parse_blob_value(one_r[1])\n        if len_l != len_r:\n            continue\n        len_set.add(len_l)\n        for i, v in enumerate(range(val_l, val_r + 1)):\n            val_b = struct.pack(\">L\", v)\n            fin_b = val_b[4 - len_l :]\n            encoding[fin_b] = one_r[2] + i\n    for one_c in c:\n        encoding[one_c[0]] = one_c[1]\n    len_list = list(len_set)\n    len_list.sort(reverse=True)\n    return encoding, len_list\n\n\nclass CharacterMap:\n    def __init__(self, text):\n        cid_r = []\n        cid_c = []\n        usecmap, cidrange, cidchar = parse_cmap(text)\n        if usecmap:\n            use_r, use_c = use_cmap(usecmap)\n            cid_r += use_r\n            cid_c += use_c\n        cid_r += cidrange\n        cid_c += cidchar\n        self.encoding, self.len_list = propagation(cid_r, cid_c)\n\n    def decode_one(self, text):\n        for l in self.len_list:\n            pat = text[:l]\n            if pat in self.encoding:\n                return self.encoding[pat], l\n        return 0, 1\n\n    def decode(self, text):\n        index = 0\n        size = len(text)\n        gstr = []\n        while index < size:\n            g, l = self.decode_one(text[index:])\n            gstr.append(g)\n            index += l\n        return gstr\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/encoding.py",
    "content": "adobe_standard = [\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"space\",\n    \"exclam\",\n    \"quotedbl\",\n    \"numbersign\",\n    \"dollar\",\n    \"percent\",\n    \"ampersand\",\n    \"quoteright\",\n    \"parenleft\",\n    \"parenright\",\n    \"asterisk\",\n    \"plus\",\n    \"comma\",\n    \"hyphen\",\n    \"period\",\n    \"slash\",\n    \"zero\",\n    \"one\",\n    \"two\",\n    \"three\",\n    \"four\",\n    \"five\",\n    \"six\",\n    \"seven\",\n    \"eight\",\n    \"nine\",\n    \"colon\",\n    \"semicolon\",\n    \"less\",\n    \"equal\",\n    \"greater\",\n    \"question\",\n    \"at\",\n    \"A\",\n    \"B\",\n    \"C\",\n    \"D\",\n    \"E\",\n    \"F\",\n    \"G\",\n    \"H\",\n    \"I\",\n    \"J\",\n    \"K\",\n    \"L\",\n    \"M\",\n    \"N\",\n    \"O\",\n    \"P\",\n    \"Q\",\n    \"R\",\n    \"S\",\n    \"T\",\n    \"U\",\n    \"V\",\n    \"W\",\n    \"X\",\n    \"Y\",\n    \"Z\",\n    \"bracketleft\",\n    \"backslash\",\n    \"bracketright\",\n    \"asciicircum\",\n    \"underscore\",\n    \"quoteleft\",\n    \"a\",\n    \"b\",\n    \"c\",\n    \"d\",\n    \"e\",\n    \"f\",\n    \"g\",\n    \"h\",\n    \"i\",\n    \"j\",\n    \"k\",\n    \"l\",\n    \"m\",\n    \"n\",\n    \"o\",\n    \"p\",\n    \"q\",\n    \"r\",\n    \"s\",\n    \"t\",\n    \"u\",\n    \"v\",\n    \"w\",\n    \"x\",\n    \"y\",\n    \"z\",\n    \"braceleft\",\n    \"bar\",\n    \"braceright\",\n    \"asciitilde\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"exclamdown\",\n    \"cent\",\n    \"sterling\",\n    \"fraction\",\n    \"yen\",\n    \"florin\",\n    \"section\",\n    \"currency\",\n    \"quotesingle\",\n    \"quotedblleft\",\n    \"guillemotleft\",\n    \"guilsinglleft\",\n    \"guilsinglright\",\n    \"fi\",\n    \"fl\",\n    None,\n    \"endash\",\n    \"dagger\",\n    \"daggerdbl\",\n    \"periodcentered\",\n    None,\n    \"paragraph\",\n    \"bullet\",\n    \"quotesinglbase\",\n    \"quotedblbase\",\n    \"quotedblright\",\n    \"guillemotright\",\n    \"ellipsis\",\n    \"perthousand\",\n    None,\n    \"questiondown\",\n    None,\n    \"grave\",\n    \"acute\",\n    \"circumflex\",\n    \"tilde\",\n    \"macron\",\n    \"breve\",\n    \"dotaccent\",\n    \"dieresis\",\n    None,\n    \"ring\",\n    \"cedilla\",\n    None,\n    \"hungarumlaut\",\n    \"ogonek\",\n    \"caron\",\n    \"emdash\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"AE\",\n    None,\n    \"ordfeminine\",\n    None,\n    None,\n    None,\n    None,\n    \"Lslash\",\n    \"Oslash\",\n    \"OE\",\n    \"ordmasculine\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"ae\",\n    None,\n    None,\n    None,\n    \"dotlessi\",\n    None,\n    None,\n    \"lslash\",\n    \"oslash\",\n    \"oe\",\n    \"germandbls\",\n    None,\n    None,\n    None,\n    None,\n]\n\nmac_expert = [\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"space\",\n    \"exclamsmall\",\n    \"Hungarumlautsmall\",\n    \"centoldstyle\",\n    \"dollaroldstyle\",\n    \"dollarsuperior\",\n    \"ampersandsmall\",\n    \"Acutesmall\",\n    \"parenleftsuperior\",\n    \"parenrightsuperior\",\n    \"twodotenleader\",\n    \"onedotenleader\",\n    \"comma\",\n    \"hyphen\",\n    \"period\",\n    \"fraction\",\n    \"zerooldstyle\",\n    \"oneoldstyle\",\n    \"twooldstyle\",\n    \"threeoldstyle\",\n    \"fouroldstyle\",\n    \"fiveoldstyle\",\n    \"sixoldstyle\",\n    \"sevenoldstyle\",\n    \"eightoldstyle\",\n    \"nineoldstyle\",\n    \"colon\",\n    \"semicolon\",\n    None,\n    \"threequartersemdash\",\n    None,\n    \"questionsmall\",\n    None,\n    None,\n    None,\n    None,\n    \"Ethsmall\",\n    None,\n    None,\n    \"onequarter\",\n    \"onehalf\",\n    \"threequarters\",\n    \"oneeighth\",\n    \"threeeighths\",\n    \"fiveeighths\",\n    \"seveneighths\",\n    \"onethird\",\n    \"twothirds\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"ff\",\n    \"fi\",\n    \"fl\",\n    \"ffi\",\n    \"ffl\",\n    \"parenleftinferior\",\n    None,\n    \"parenrightinferior\",\n    \"Circumflexsmall\",\n    \"hypheninferior\",\n    \"Gravesmall\",\n    \"Asmall\",\n    \"Bsmall\",\n    \"Csmall\",\n    \"Dsmall\",\n    \"Esmall\",\n    \"Fsmall\",\n    \"Gsmall\",\n    \"Hsmall\",\n    \"Ismall\",\n    \"Jsmall\",\n    \"Ksmall\",\n    \"Lsmall\",\n    \"Msmall\",\n    \"Nsmall\",\n    \"Osmall\",\n    \"Psmall\",\n    \"Qsmall\",\n    \"Rsmall\",\n    \"Ssmall\",\n    \"Tsmall\",\n    \"Usmall\",\n    \"Vsmall\",\n    \"Wsmall\",\n    \"Xsmall\",\n    \"Ysmall\",\n    \"Zsmall\",\n    \"colonmonetary\",\n    \"onefitted\",\n    \"rupiah\",\n    \"Tildesmall\",\n    None,\n    None,\n    \"asuperior\",\n    \"centsuperior\",\n    None,\n    None,\n    None,\n    None,\n    \"Aacutesmall\",\n    \"Agravesmall\",\n    \"Acircumflexsmall\",\n    \"Adieresissmall\",\n    \"Atildesmall\",\n    \"Aringsmall\",\n    \"Ccedillasmall\",\n    \"Eacutesmall\",\n    \"Egravesmall\",\n    \"Ecircumflexsmall\",\n    \"Edieresissmall\",\n    \"Iacutesmall\",\n    \"Igravesmall\",\n    \"Icircumflexsmall\",\n    \"Idieresissmall\",\n    \"Ntildesmall\",\n    \"Oacutesmall\",\n    \"Ogravesmall\",\n    \"Ocircumflexsmall\",\n    \"Odieresissmall\",\n    \"Otildesmall\",\n    \"Uacutesmall\",\n    \"Ugravesmall\",\n    \"Ucircumflexsmall\",\n    \"Udieresissmall\",\n    None,\n    \"eightsuperior\",\n    \"fourinferior\",\n    \"threeinferior\",\n    \"sixinferior\",\n    \"eightinferior\",\n    \"seveninferior\",\n    \"Scaronsmall\",\n    None,\n    \"centinferior\",\n    \"twoinferior\",\n    None,\n    \"Dieresissmall\",\n    None,\n    \"Caronsmall\",\n    \"osuperior\",\n    \"fiveinferior\",\n    None,\n    \"commainferior\",\n    \"periodinferior\",\n    \"Yacutesmall\",\n    None,\n    \"dollarinferior\",\n    None,\n    None,\n    \"Thornsmall\",\n    None,\n    \"nineinferior\",\n    \"zeroinferior\",\n    \"Zcaronsmall\",\n    \"AEsmall\",\n    \"Oslashsmall\",\n    \"questiondownsmall\",\n    \"oneinferior\",\n    \"Lslashsmall\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"Cedillasmall\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"OEsmall\",\n    \"figuredash\",\n    \"hyphensuperior\",\n    None,\n    None,\n    None,\n    None,\n    \"exclamdownsmall\",\n    None,\n    \"Ydieresissmall\",\n    None,\n    \"onesuperior\",\n    \"twosuperior\",\n    \"threesuperior\",\n    \"foursuperior\",\n    \"fivesuperior\",\n    \"sixsuperior\",\n    \"sevensuperior\",\n    \"ninesuperior\",\n    \"zerosuperior\",\n    None,\n    \"esuperior\",\n    \"rsuperior\",\n    \"tsuperior\",\n    None,\n    None,\n    \"isuperior\",\n    \"ssuperior\",\n    \"dsuperior\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"lsuperior\",\n    \"Ogoneksmall\",\n    \"Brevesmall\",\n    \"Macronsmall\",\n    \"bsuperior\",\n    \"nsuperior\",\n    \"msuperior\",\n    \"commasuperior\",\n    \"periodsuperior\",\n    \"Dotaccentsmall\",\n    \"Ringsmall\",\n    None,\n    None,\n    None,\n    None,\n]\n\nmac_roman = [\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"space\",\n    \"exclamsmall\",\n    \"Hungarumlautsmall\",\n    \"centoldstyle\",\n    \"dollaroldstyle\",\n    \"dollarsuperior\",\n    \"ampersandsmall\",\n    \"Acutesmall\",\n    \"parenleftsuperior\",\n    \"parenrightsuperior\",\n    \"twodotenleader\",\n    \"onedotenleader\",\n    \"comma\",\n    \"hyphen\",\n    \"period\",\n    \"fraction\",\n    \"zerooldstyle\",\n    \"oneoldstyle\",\n    \"twooldstyle\",\n    \"threeoldstyle\",\n    \"fouroldstyle\",\n    \"fiveoldstyle\",\n    \"sixoldstyle\",\n    \"sevenoldstyle\",\n    \"eightoldstyle\",\n    \"nineoldstyle\",\n    \"colon\",\n    \"semicolon\",\n    None,\n    \"threequartersemdash\",\n    None,\n    \"questionsmall\",\n    None,\n    None,\n    None,\n    None,\n    \"Ethsmall\",\n    None,\n    None,\n    \"onequarter\",\n    \"onehalf\",\n    \"threequarters\",\n    \"oneeighth\",\n    \"threeeighths\",\n    \"fiveeighths\",\n    \"seveneighths\",\n    \"onethird\",\n    \"twothirds\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"ff\",\n    \"fi\",\n    \"fl\",\n    \"ffi\",\n    \"ffl\",\n    \"parenleftinferior\",\n    None,\n    \"parenrightinferior\",\n    \"Circumflexsmall\",\n    \"hypheninferior\",\n    \"Gravesmall\",\n    \"Asmall\",\n    \"Bsmall\",\n    \"Csmall\",\n    \"Dsmall\",\n    \"Esmall\",\n    \"Fsmall\",\n    \"Gsmall\",\n    \"Hsmall\",\n    \"Ismall\",\n    \"Jsmall\",\n    \"Ksmall\",\n    \"Lsmall\",\n    \"Msmall\",\n    \"Nsmall\",\n    \"Osmall\",\n    \"Psmall\",\n    \"Qsmall\",\n    \"Rsmall\",\n    \"Ssmall\",\n    \"Tsmall\",\n    \"Usmall\",\n    \"Vsmall\",\n    \"Wsmall\",\n    \"Xsmall\",\n    \"Ysmall\",\n    \"Zsmall\",\n    \"colonmonetary\",\n    \"onefitted\",\n    \"rupiah\",\n    \"Tildesmall\",\n    None,\n    None,\n    \"asuperior\",\n    \"centsuperior\",\n    None,\n    None,\n    None,\n    None,\n    \"Aacutesmall\",\n    \"Agravesmall\",\n    \"Acircumflexsmall\",\n    \"Adieresissmall\",\n    \"Atildesmall\",\n    \"Aringsmall\",\n    \"Ccedillasmall\",\n    \"Eacutesmall\",\n    \"Egravesmall\",\n    \"Ecircumflexsmall\",\n    \"Edieresissmall\",\n    \"Iacutesmall\",\n    \"Igravesmall\",\n    \"Icircumflexsmall\",\n    \"Idieresissmall\",\n    \"Ntildesmall\",\n    \"Oacutesmall\",\n    \"Ogravesmall\",\n    \"Ocircumflexsmall\",\n    \"Odieresissmall\",\n    \"Otildesmall\",\n    \"Uacutesmall\",\n    \"Ugravesmall\",\n    \"Ucircumflexsmall\",\n    \"Udieresissmall\",\n    None,\n    \"eightsuperior\",\n    \"fourinferior\",\n    \"threeinferior\",\n    \"sixinferior\",\n    \"eightinferior\",\n    \"seveninferior\",\n    \"Scaronsmall\",\n    None,\n    \"centinferior\",\n    \"twoinferior\",\n    None,\n    \"Dieresissmall\",\n    None,\n    \"Caronsmall\",\n    \"osuperior\",\n    \"fiveinferior\",\n    None,\n    \"commainferior\",\n    \"periodinferior\",\n    \"Yacutesmall\",\n    None,\n    \"dollarinferior\",\n    None,\n    None,\n    \"Thornsmall\",\n    None,\n    \"nineinferior\",\n    \"zeroinferior\",\n    \"Zcaronsmall\",\n    \"AEsmall\",\n    \"Oslashsmall\",\n    \"questiondownsmall\",\n    \"oneinferior\",\n    \"Lslashsmall\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"Cedillasmall\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"OEsmall\",\n    \"figuredash\",\n    \"hyphensuperior\",\n    None,\n    None,\n    None,\n    None,\n    \"exclamdownsmall\",\n    None,\n    \"Ydieresissmall\",\n    None,\n    \"onesuperior\",\n    \"twosuperior\",\n    \"threesuperior\",\n    \"foursuperior\",\n    \"fivesuperior\",\n    \"sixsuperior\",\n    \"sevensuperior\",\n    \"ninesuperior\",\n    \"zerosuperior\",\n    None,\n    \"esuperior\",\n    \"rsuperior\",\n    \"tsuperior\",\n    None,\n    None,\n    \"isuperior\",\n    \"ssuperior\",\n    \"dsuperior\",\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"lsuperior\",\n    \"Ogoneksmall\",\n    \"Brevesmall\",\n    \"Macronsmall\",\n    \"bsuperior\",\n    \"nsuperior\",\n    \"msuperior\",\n    \"commasuperior\",\n    \"periodsuperior\",\n    \"Dotaccentsmall\",\n    \"Ringsmall\",\n    None,\n    None,\n    None,\n    None,\n]\n\nwin_ansi = [\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    None,\n    \"space\",\n    \"exclam\",\n    \"quotedbl\",\n    \"numbersign\",\n    \"dollar\",\n    \"percent\",\n    \"ampersand\",\n    \"quotesingle\",\n    \"parenleft\",\n    \"parenright\",\n    \"asterisk\",\n    \"plus\",\n    \"comma\",\n    \"hyphen\",\n    \"period\",\n    \"slash\",\n    \"zero\",\n    \"one\",\n    \"two\",\n    \"three\",\n    \"four\",\n    \"five\",\n    \"six\",\n    \"seven\",\n    \"eight\",\n    \"nine\",\n    \"colon\",\n    \"semicolon\",\n    \"less\",\n    \"equal\",\n    \"greater\",\n    \"question\",\n    \"at\",\n    \"A\",\n    \"B\",\n    \"C\",\n    \"D\",\n    \"E\",\n    \"F\",\n    \"G\",\n    \"H\",\n    \"I\",\n    \"J\",\n    \"K\",\n    \"L\",\n    \"M\",\n    \"N\",\n    \"O\",\n    \"P\",\n    \"Q\",\n    \"R\",\n    \"S\",\n    \"T\",\n    \"U\",\n    \"V\",\n    \"W\",\n    \"X\",\n    \"Y\",\n    \"Z\",\n    \"bracketleft\",\n    \"backslash\",\n    \"bracketright\",\n    \"asciicircum\",\n    \"underscore\",\n    \"grave\",\n    \"a\",\n    \"b\",\n    \"c\",\n    \"d\",\n    \"e\",\n    \"f\",\n    \"g\",\n    \"h\",\n    \"i\",\n    \"j\",\n    \"k\",\n    \"l\",\n    \"m\",\n    \"n\",\n    \"o\",\n    \"p\",\n    \"q\",\n    \"r\",\n    \"s\",\n    \"t\",\n    \"u\",\n    \"v\",\n    \"w\",\n    \"x\",\n    \"y\",\n    \"z\",\n    \"braceleft\",\n    \"bar\",\n    \"braceright\",\n    \"asciitilde\",\n    \"bullet\",\n    \"Euro\",\n    \"bullet\",\n    \"quotesinglbase\",\n    \"florin\",\n    \"quotedblbase\",\n    \"ellipsis\",\n    \"dagger\",\n    \"daggerdbl\",\n    \"circumflex\",\n    \"perthousand\",\n    \"Scaron\",\n    \"guilsinglleft\",\n    \"OE\",\n    \"bullet\",\n    \"Zcaron\",\n    \"bullet\",\n    \"bullet\",\n    \"quoteleft\",\n    \"quoteright\",\n    \"quotedblleft\",\n    \"quotedblright\",\n    \"bullet\",\n    \"endash\",\n    \"emdash\",\n    \"tilde\",\n    \"trademark\",\n    \"scaron\",\n    \"guilsinglright\",\n    \"oe\",\n    \"bullet\",\n    \"zcaron\",\n    \"Ydieresis\",\n    \"space\",\n    \"exclamdown\",\n    \"cent\",\n    \"sterling\",\n    \"currency\",\n    \"yen\",\n    \"brokenbar\",\n    \"section\",\n    \"dieresis\",\n    \"copyright\",\n    \"ordfeminine\",\n    \"guillemotleft\",\n    \"logicalnot\",\n    \"hyphen\",\n    \"registered\",\n    \"macron\",\n    \"degree\",\n    \"plusminus\",\n    \"twosuperior\",\n    \"threesuperior\",\n    \"acute\",\n    \"mu\",\n    \"paragraph\",\n    \"periodcentered\",\n    \"cedilla\",\n    \"onesuperior\",\n    \"ordmasculine\",\n    \"guillemotright\",\n    \"onequarter\",\n    \"onehalf\",\n    \"threequarters\",\n    \"questiondown\",\n    \"Agrave\",\n    \"Aacute\",\n    \"Acircumflex\",\n    \"Atilde\",\n    \"Adieresis\",\n    \"Aring\",\n    \"AE\",\n    \"Ccedilla\",\n    \"Egrave\",\n    \"Eacute\",\n    \"Ecircumflex\",\n    \"Edieresis\",\n    \"Igrave\",\n    \"Iacute\",\n    \"Icircumflex\",\n    \"Idieresis\",\n    \"Eth\",\n    \"Ntilde\",\n    \"Ograve\",\n    \"Oacute\",\n    \"Ocircumflex\",\n    \"Otilde\",\n    \"Odieresis\",\n    \"multiply\",\n    \"Oslash\",\n    \"Ugrave\",\n    \"Uacute\",\n    \"Ucircumflex\",\n    \"Udieresis\",\n    \"Yacute\",\n    \"Thorn\",\n    \"germandbls\",\n    \"agrave\",\n    \"aacute\",\n    \"acircumflex\",\n    \"atilde\",\n    \"adieresis\",\n    \"aring\",\n    \"ae\",\n    \"ccedilla\",\n    \"egrave\",\n    \"eacute\",\n    \"ecircumflex\",\n    \"edieresis\",\n    \"igrave\",\n    \"iacute\",\n    \"icircumflex\",\n    \"idieresis\",\n    \"eth\",\n    \"ntilde\",\n    \"ograve\",\n    \"oacute\",\n    \"ocircumflex\",\n    \"otilde\",\n    \"odieresis\",\n    \"divide\",\n    \"oslash\",\n    \"ugrave\",\n    \"uacute\",\n    \"ucircumflex\",\n    \"udieresis\",\n    \"yacute\",\n    \"thorn\",\n    \"ydieresis\",\n]\n\n\ndef get_type1_encoding(name):\n    match name:\n        case \"StandardEncoding\":\n            return adobe_standard\n        case \"MacRomanEncoding\":\n            return mac_roman\n        case \"WinAnsiEncoding\":\n            return win_ansi\n        case \"MacExpertEncoding\":\n            return mac_expert\n\n\nWinAnsiEncoding = [\n    0,\n    1,\n    2,\n    3,\n    4,\n    5,\n    6,\n    7,\n    8,\n    9,\n    10,\n    11,\n    12,\n    13,\n    14,\n    15,\n    16,\n    17,\n    18,\n    19,\n    20,\n    21,\n    22,\n    23,\n    24,\n    25,\n    26,\n    27,\n    28,\n    29,\n    30,\n    31,\n    32,\n    33,\n    34,\n    35,\n    36,\n    37,\n    38,\n    39,\n    40,\n    41,\n    42,\n    43,\n    44,\n    45,\n    46,\n    47,\n    48,\n    49,\n    50,\n    51,\n    52,\n    53,\n    54,\n    55,\n    56,\n    57,\n    58,\n    59,\n    60,\n    61,\n    62,\n    63,\n    64,\n    65,\n    66,\n    67,\n    68,\n    69,\n    70,\n    71,\n    72,\n    73,\n    74,\n    75,\n    76,\n    77,\n    78,\n    79,\n    80,\n    81,\n    82,\n    83,\n    84,\n    85,\n    86,\n    87,\n    88,\n    89,\n    90,\n    91,\n    92,\n    93,\n    94,\n    95,\n    96,\n    97,\n    98,\n    99,\n    100,\n    101,\n    102,\n    103,\n    104,\n    105,\n    106,\n    107,\n    108,\n    109,\n    110,\n    111,\n    112,\n    113,\n    114,\n    115,\n    116,\n    117,\n    118,\n    119,\n    120,\n    121,\n    122,\n    123,\n    124,\n    125,\n    126,\n    127,\n    8364,\n    0,\n    8218,\n    402,\n    8222,\n    8230,\n    8224,\n    8225,\n    710,\n    8240,\n    352,\n    8249,\n    338,\n    0,\n    381,\n    0,\n    0,\n    8216,\n    8217,\n    8220,\n    8221,\n    8226,\n    8211,\n    8212,\n    732,\n    8482,\n    353,\n    8250,\n    339,\n    0,\n    382,\n    376,\n    160,\n    161,\n    162,\n    163,\n    164,\n    165,\n    166,\n    167,\n    168,\n    169,\n    170,\n    171,\n    172,\n    173,\n    174,\n    175,\n    176,\n    177,\n    178,\n    179,\n    180,\n    181,\n    182,\n    183,\n    184,\n    185,\n    186,\n    187,\n    188,\n    189,\n    190,\n    191,\n    192,\n    193,\n    194,\n    195,\n    196,\n    197,\n    198,\n    199,\n    200,\n    201,\n    202,\n    203,\n    204,\n    205,\n    206,\n    207,\n    208,\n    209,\n    210,\n    211,\n    212,\n    213,\n    214,\n    215,\n    216,\n    217,\n    218,\n    219,\n    220,\n    221,\n    222,\n    223,\n    224,\n    225,\n    226,\n    227,\n    228,\n    229,\n    230,\n    231,\n    232,\n    233,\n    234,\n    235,\n    236,\n    237,\n    238,\n    239,\n    240,\n    241,\n    242,\n    243,\n    244,\n    245,\n    246,\n    247,\n    248,\n    249,\n    250,\n    251,\n    252,\n    253,\n    254,\n    255,\n]\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/type3.py",
    "content": "import io\nimport re\n\nimport pymupdf\n\n\ndef merge_bbox(bbox_list, factor=1):\n    if bbox_list:\n        base = bbox_list[0]\n        for bbox in bbox_list[1:]:\n            base.include_rect(bbox)\n        x0, y0, x1, y1 = [v / factor for v in tuple(base)]\n        return x0, -y1, x1, -y0\n\n\ndef get_type3_bbox(doc, obj):\n    bbox_list = [(0, 0, 0, 0)] * 256\n    first = int(doc.xref_get_key(obj, \"FirstChar\")[1])\n    last = int(doc.xref_get_key(obj, \"LastChar\")[1])\n    factor_text = doc.xref_get_key(obj, \"FontMatrix\")[1]\n    factor = 1\n    if factor_m := re.search(r\"(\\d+)?\\.\\d+\", factor_text):\n        factor = float(factor_m.group(0))\n    page = doc.new_page(width=10, height=10)\n    doc.xref_set_key(page.xref, \"Resources\", \"<<>>\")\n    doc.xref_set_key(page.xref, \"Resources/Font\", f\"<</T0 {obj} 0 R>>\")\n    text = doc.get_new_xref()\n    doc.update_object(text, \"<<>>\")\n    for x in range(first, last + 1):\n        doc.update_stream(text, b\"1 0 0 1 0 10 cm BT /T0 1 Tf <%02X> Tj ET\" % x)\n        doc.xref_set_key(page.xref, \"Contents\", f\"{text} 0 R\")\n        char_data = page.get_svg_image(text_as_path=True)\n        char_doc = pymupdf.Document(stream=io.BytesIO(char_data.encode(\"U8\")))\n        char_bbox = []\n        for element in char_doc:\n            for item in element.get_drawings():\n                char_bbox.append(item[\"rect\"])\n        if char_bbox_merged := merge_bbox(char_bbox, factor):\n            bbox_list[x] = char_bbox_merged\n    doc.delete_page(-1)\n    return bbox_list\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/utils.py",
    "content": "from babeldoc.pdfminer.pdftypes import PDFObjRef\n\n\ndef guarded_bbox(bbox):\n    bbox_guarded = []\n    for v in bbox:\n        u = v\n        if isinstance(v, PDFObjRef):\n            u = v.resolve()\n        if isinstance(u, int) or isinstance(u, float):\n            bbox_guarded.append(u)\n        else:\n            bbox_guarded.append(u)\n    return bbox_guarded\n"
  },
  {
    "path": "babeldoc/format/pdf/babelpdf/win_core.py",
    "content": "win_core = {\n    \"Arial\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (85, 0, 194, 715),\n        \"quotedbl\": (45, 462, 308, 715),\n        \"numbersign\": (10, -12, 543, 728),\n        \"dollar\": (35, -103, 509, 781),\n        \"percent\": (58, -26, 827, 728),\n        \"ampersand\": (42, -16, 644, 728),\n        \"quotesingle\": (43, 462, 144, 715),\n        \"parenleft\": (60, -210, 296, 728),\n        \"parenright\": (60, -210, 296, 728),\n        \"asterisk\": (31, 423, 354, 728),\n        \"plus\": (55, 115, 528, 588),\n        \"comma\": (83, -141, 188, 100),\n        \"hyphen\": (31, 214, 301, 303),\n        \"period\": (90, 0, 190, 100),\n        \"slash\": (0, -12, 277, 728),\n        \"zero\": (41, -12, 508, 718),\n        \"one\": (108, 0, 372, 718),\n        \"two\": (30, 0, 503, 718),\n        \"three\": (41, -12, 510, 718),\n        \"four\": (12, 0, 507, 715),\n        \"five\": (41, -12, 516, 706),\n        \"six\": (37, -12, 510, 718),\n        \"seven\": (47, 0, 510, 706),\n        \"eight\": (40, -12, 512, 718),\n        \"nine\": (41, -12, 512, 718),\n        \"colon\": (90, 0, 190, 518),\n        \"semicolon\": (83, -141, 188, 518),\n        \"less\": (54, 110, 528, 595),\n        \"equal\": (55, 203, 528, 502),\n        \"greater\": (54, 110, 528, 595),\n        \"question\": (43, 0, 505, 728),\n        \"at\": (54, -210, 979, 729),\n        \"A\": (-1, 0, 668, 715),\n        \"B\": (73, 0, 613, 715),\n        \"C\": (49, -12, 682, 728),\n        \"D\": (77, 0, 668, 715),\n        \"E\": (79, 0, 613, 715),\n        \"F\": (82, 0, 564, 715),\n        \"G\": (53, -12, 715, 728),\n        \"H\": (80, 0, 641, 715),\n        \"I\": (93, 0, 187, 715),\n        \"J\": (28, -12, 422, 715),\n        \"K\": (73, 0, 665, 715),\n        \"L\": (73, 0, 520, 715),\n        \"M\": (74, 0, 757, 715),\n        \"N\": (76, 0, 640, 715),\n        \"O\": (48, -12, 732, 728),\n        \"P\": (77, 0, 623, 715),\n        \"Q\": (42, -55, 741, 728),\n        \"R\": (78, 0, 709, 715),\n        \"S\": (44, -12, 614, 728),\n        \"T\": (23, 0, 590, 715),\n        \"U\": (78, -12, 641, 715),\n        \"V\": (4, 0, 659, 715),\n        \"W\": (12, 0, 932, 715),\n        \"X\": (4, 0, 660, 715),\n        \"Y\": (2, 0, 659, 715),\n        \"Z\": (20, 0, 585, 715),\n        \"bracketleft\": (67, -198, 261, 715),\n        \"backslash\": (0, -12, 277, 728),\n        \"bracketright\": (19, -198, 212, 715),\n        \"asciicircum\": (26, 336, 442, 728),\n        \"underscore\": (-15, -198, 567, -135),\n        \"grave\": (43, 583, 227, 719),\n        \"a\": (36, -11, 513, 530),\n        \"b\": (65, -11, 515, 715),\n        \"c\": (39, -11, 490, 530),\n        \"d\": (34, -11, 483, 715),\n        \"e\": (36, -11, 514, 530),\n        \"f\": (9, 0, 312, 728),\n        \"g\": (32, -210, 489, 530),\n        \"h\": (65, 0, 488, 715),\n        \"i\": (66, 0, 154, 715),\n        \"j\": (-45, -210, 153, 715),\n        \"k\": (66, 0, 496, 715),\n        \"l\": (63, 0, 151, 715),\n        \"m\": (65, 0, 768, 530),\n        \"n\": (65, 0, 487, 530),\n        \"o\": (33, -11, 519, 530),\n        \"p\": (65, -198, 516, 530),\n        \"q\": (35, -198, 484, 530),\n        \"r\": (64, 0, 346, 530),\n        \"s\": (30, -11, 461, 530),\n        \"t\": (17, -6, 270, 699),\n        \"u\": (63, -11, 484, 518),\n        \"v\": (12, 0, 488, 518),\n        \"w\": (2, 0, 714, 518),\n        \"x\": (7, 0, 492, 518),\n        \"y\": (16, -210, 491, 518),\n        \"z\": (19, 0, 478, 518),\n        \"braceleft\": (27, -210, 310, 728),\n        \"bar\": (91, -210, 168, 728),\n        \"braceright\": (22, -210, 305, 728),\n        \"asciitilde\": (42, 271, 541, 432),\n        \"bullet\": (53, 226, 300, 474),\n        \"Euro\": (-13, -12, 540, 728),\n        \"quotesinglbase\": (52, -132, 154, 102),\n        \"florin\": (22, -210, 529, 728),\n        \"quotedblbase\": (34, -132, 288, 102),\n        \"ellipsis\": (116, 0, 883, 100),\n        \"dagger\": (35, -168, 514, 699),\n        \"daggerdbl\": (35, -168, 516, 706),\n        \"circumflex\": (12, 583, 321, 719),\n        \"perthousand\": (18, -26, 981, 728),\n        \"Scaron\": (44, -12, 614, 893),\n        \"guilsinglleft\": (44, 35, 271, 480),\n        \"OE\": (62, -12, 968, 728),\n        \"Zcaron\": (20, 0, 585, 893),\n        \"quoteleft\": (62, 493, 164, 728),\n        \"quoteright\": (52, 488, 154, 723),\n        \"quotedblleft\": (40, 493, 293, 728),\n        \"quotedblright\": (34, 488, 288, 723),\n        \"endash\": (-1, 223, 554, 294),\n        \"emdash\": (0, 223, 1000, 294),\n        \"tilde\": (3, 595, 330, 708),\n        \"trademark\": (109, 317, 870, 715),\n        \"scaron\": (30, -11, 461, 719),\n        \"guilsinglright\": (44, 35, 266, 480),\n        \"oe\": (40, -11, 906, 530),\n        \"zcaron\": (19, 0, 478, 719),\n        \"Ydieresis\": (2, 0, 659, 859),\n        \"exclamdown\": (113, -197, 222, 518),\n        \"cent\": (52, -199, 504, 715),\n        \"sterling\": (13, -13, 528, 728),\n        \"currency\": (36, 114, 516, 593),\n        \"yen\": (-1, 0, 553, 715),\n        \"brokenbar\": (91, -210, 168, 728),\n        \"section\": (39, -210, 510, 728),\n        \"dieresis\": (29, 620, 303, 720),\n        \"copyright\": (1, -8, 738, 728),\n        \"ordfeminine\": (22, 364, 350, 728),\n        \"guillemotleft\": (65, 35, 483, 480),\n        \"logicalnot\": (55, 207, 528, 502),\n        \"registered\": (1, -8, 738, 728),\n        \"macron\": (-15, 764, 567, 827),\n        \"degree\": (62, 457, 333, 728),\n        \"plusminus\": (38, 0, 510, 600),\n        \"twosuperior\": (12, 357, 316, 724),\n        \"threesuperior\": (16, 349, 315, 724),\n        \"acute\": (108, 583, 288, 719),\n        \"mu\": (78, -198, 497, 518),\n        \"paragraph\": (0, -198, 540, 715),\n        \"periodcentered\": (116, 311, 216, 411),\n        \"cedilla\": (52, -205, 263, 11),\n        \"onesuperior\": (52, 357, 232, 724),\n        \"ordmasculine\": (21, 361, 342, 728),\n        \"guillemotright\": (68, 35, 486, 480),\n        \"onequarter\": (52, -27, 819, 728),\n        \"onehalf\": (52, -27, 816, 728),\n        \"threequarters\": (16, -27, 819, 728),\n        \"questiondown\": (77, -209, 538, 518),\n        \"Agrave\": (-1, 0, 668, 896),\n        \"Aacute\": (-1, 0, 668, 896),\n        \"Acircumflex\": (-1, 0, 668, 896),\n        \"Atilde\": (-1, 0, 668, 872),\n        \"Adieresis\": (-1, 0, 668, 859),\n        \"Aring\": (-1, 0, 668, 869),\n        \"AE\": (0, 0, 945, 715),\n        \"Ccedilla\": (49, -205, 682, 728),\n        \"Egrave\": (79, 0, 613, 896),\n        \"Eacute\": (79, 0, 613, 896),\n        \"Ecircumflex\": (79, 0, 613, 896),\n        \"Edieresis\": (79, 0, 613, 859),\n        \"Igrave\": (26, 0, 209, 896),\n        \"Iacute\": (68, 0, 249, 896),\n        \"Icircumflex\": (-15, 0, 293, 896),\n        \"Idieresis\": (1, 0, 275, 859),\n        \"Eth\": (-1, 0, 668, 715),\n        \"Ntilde\": (76, 0, 640, 872),\n        \"Ograve\": (48, -12, 732, 896),\n        \"Oacute\": (48, -12, 732, 896),\n        \"Ocircumflex\": (48, -12, 732, 896),\n        \"Otilde\": (48, -12, 732, 872),\n        \"Odieresis\": (48, -12, 732, 859),\n        \"multiply\": (78, 140, 504, 566),\n        \"Oslash\": (40, -28, 740, 742),\n        \"Ugrave\": (78, -12, 641, 896),\n        \"Uacute\": (78, -12, 641, 896),\n        \"Ucircumflex\": (78, -12, 641, 896),\n        \"Udieresis\": (78, -12, 641, 859),\n        \"Yacute\": (2, 0, 659, 896),\n        \"Thorn\": (77, 0, 623, 715),\n        \"germandbls\": (74, -12, 579, 728),\n        \"agrave\": (36, -11, 513, 719),\n        \"aacute\": (36, -11, 513, 719),\n        \"acircumflex\": (36, -11, 513, 719),\n        \"atilde\": (36, -11, 513, 708),\n        \"adieresis\": (36, -11, 513, 720),\n        \"aring\": (36, -11, 513, 740),\n        \"ae\": (33, -11, 848, 530),\n        \"ccedilla\": (39, -195, 490, 530),\n        \"egrave\": (36, -11, 514, 719),\n        \"eacute\": (36, -11, 514, 719),\n        \"ecircumflex\": (36, -11, 514, 719),\n        \"edieresis\": (36, -11, 514, 720),\n        \"igrave\": (17, 0, 200, 719),\n        \"iacute\": (92, 0, 272, 719),\n        \"icircumflex\": (-8, 0, 300, 719),\n        \"idieresis\": (4, 0, 278, 720),\n        \"eth\": (35, -12, 516, 715),\n        \"ntilde\": (65, 0, 487, 708),\n        \"ograve\": (33, -11, 519, 719),\n        \"oacute\": (33, -11, 519, 719),\n        \"ocircumflex\": (33, -11, 519, 719),\n        \"otilde\": (33, -11, 519, 708),\n        \"odieresis\": (33, -11, 519, 720),\n        \"divide\": (38, 155, 510, 550),\n        \"oslash\": (62, -38, 548, 550),\n        \"ugrave\": (63, -11, 484, 719),\n        \"uacute\": (63, -11, 484, 719),\n        \"ucircumflex\": (63, -11, 484, 719),\n        \"udieresis\": (63, -11, 484, 720),\n        \"yacute\": (16, -210, 491, 719),\n        \"thorn\": (65, -198, 516, 715),\n        \"ydieresis\": (16, -210, 491, 720),\n    },\n    \"Arial,Bold\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (89, 0, 238, 715),\n        \"quotedbl\": (54, 461, 424, 715),\n        \"numbersign\": (8, -12, 544, 728),\n        \"dollar\": (34, -100, 511, 773),\n        \"percent\": (43, -28, 842, 728),\n        \"ampersand\": (43, -18, 706, 728),\n        \"quotesingle\": (44, 461, 194, 715),\n        \"parenleft\": (52, -210, 300, 728),\n        \"parenright\": (32, -210, 281, 728),\n        \"asterisk\": (13, 386, 367, 728),\n        \"plus\": (41, 103, 541, 603),\n        \"comma\": (57, -159, 205, 137),\n        \"hyphen\": (56, 190, 325, 328),\n        \"period\": (71, 0, 208, 137),\n        \"slash\": (-1, -12, 278, 728),\n        \"zero\": (41, -12, 506, 718),\n        \"one\": (79, 0, 393, 718),\n        \"two\": (24, 0, 505, 718),\n        \"three\": (37, -12, 513, 718),\n        \"four\": (18, 0, 533, 718),\n        \"five\": (44, -12, 525, 706),\n        \"six\": (42, -12, 520, 718),\n        \"seven\": (42, 0, 511, 706),\n        \"eight\": (40, -12, 511, 718),\n        \"nine\": (31, -12, 509, 718),\n        \"colon\": (98, 0, 235, 518),\n        \"semicolon\": (83, -159, 231, 518),\n        \"less\": (46, 81, 537, 625),\n        \"equal\": (41, 181, 541, 524),\n        \"greater\": (46, 81, 537, 624),\n        \"question\": (51, 0, 565, 723),\n        \"at\": (29, -210, 971, 728),\n        \"A\": (0, 0, 718, 715),\n        \"B\": (73, 0, 672, 715),\n        \"C\": (47, -12, 670, 728),\n        \"D\": (72, 0, 672, 715),\n        \"E\": (72, 0, 617, 715),\n        \"F\": (73, 0, 564, 715),\n        \"G\": (47, -12, 717, 728),\n        \"H\": (73, 0, 645, 715),\n        \"I\": (68, 0, 212, 715),\n        \"J\": (17, -12, 475, 715),\n        \"K\": (74, 0, 720, 715),\n        \"L\": (76, 0, 580, 709),\n        \"M\": (70, 0, 762, 715),\n        \"N\": (74, 0, 642, 715),\n        \"O\": (43, -12, 737, 728),\n        \"P\": (72, 0, 621, 715),\n        \"Q\": (43, -71, 764, 728),\n        \"R\": (73, 0, 716, 715),\n        \"S\": (36, -12, 618, 728),\n        \"T\": (21, 0, 590, 715),\n        \"U\": (71, -12, 642, 715),\n        \"V\": (0, 0, 666, 715),\n        \"W\": (3, 0, 942, 715),\n        \"X\": (0, 0, 665, 715),\n        \"Y\": (-1, 0, 667, 715),\n        \"Z\": (10, 0, 592, 715),\n        \"bracketleft\": (71, -201, 314, 715),\n        \"backslash\": (-1, -12, 278, 728),\n        \"bracketright\": (18, -201, 261, 715),\n        \"asciicircum\": (56, 337, 527, 728),\n        \"underscore\": (-9, -197, 561, -108),\n        \"grave\": (20, 582, 241, 728),\n        \"a\": (35, -11, 522, 530),\n        \"b\": (65, -11, 572, 715),\n        \"c\": (41, -11, 530, 530),\n        \"d\": (41, -11, 547, 715),\n        \"e\": (31, -11, 516, 530),\n        \"f\": (11, 0, 362, 728),\n        \"g\": (41, -210, 546, 530),\n        \"h\": (71, 0, 543, 715),\n        \"i\": (71, 0, 208, 715),\n        \"j\": (-45, -210, 206, 715),\n        \"k\": (66, 0, 546, 715),\n        \"l\": (71, 0, 208, 715),\n        \"m\": (61, 0, 824, 530),\n        \"n\": (70, 0, 543, 530),\n        \"o\": (40, -11, 575, 530),\n        \"p\": (67, -197, 573, 530),\n        \"q\": (44, -197, 547, 530),\n        \"r\": (65, 0, 401, 530),\n        \"s\": (23, -11, 507, 530),\n        \"t\": (15, -11, 320, 701),\n        \"u\": (68, -11, 540, 518),\n        \"v\": (5, 0, 543, 518),\n        \"w\": (4, 0, 777, 518),\n        \"x\": (5, 0, 546, 518),\n        \"y\": (6, -210, 540, 518),\n        \"z\": (16, 0, 479, 518),\n        \"braceleft\": (29, -210, 363, 728),\n        \"bar\": (85, -210, 194, 728),\n        \"braceright\": (21, -210, 355, 728),\n        \"asciitilde\": (32, 253, 551, 451),\n        \"bullet\": (32, 208, 320, 497),\n        \"Euro\": (-15, -12, 524, 728),\n        \"quotesinglbase\": (57, -159, 205, 137),\n        \"florin\": (-9, -210, 557, 728),\n        \"quotedblbase\": (51, -160, 430, 137),\n        \"ellipsis\": (98, 0, 902, 137),\n        \"dagger\": (33, -170, 517, 707),\n        \"daggerdbl\": (33, -170, 517, 707),\n        \"circumflex\": (1, 583, 332, 728),\n        \"perthousand\": (0, -28, 999, 728),\n        \"Scaron\": (36, -12, 618, 903),\n        \"guilsinglleft\": (36, 34, 298, 479),\n        \"OE\": (35, -12, 969, 728),\n        \"Zcaron\": (10, 0, 592, 903),\n        \"quoteleft\": (74, 425, 222, 722),\n        \"quoteright\": (57, 416, 205, 713),\n        \"quotedblleft\": (64, 425, 441, 722),\n        \"quotedblright\": (51, 418, 430, 715),\n        \"endash\": (-1, 208, 554, 310),\n        \"emdash\": (0, 208, 1000, 310),\n        \"tilde\": (-6, 588, 331, 712),\n        \"trademark\": (105, 315, 877, 715),\n        \"scaron\": (23, -11, 507, 728),\n        \"guilsinglright\": (36, 34, 298, 479),\n        \"oe\": (42, -11, 902, 530),\n        \"zcaron\": (16, 0, 479, 728),\n        \"Ydieresis\": (-1, 0, 667, 874),\n        \"exclamdown\": (95, -198, 243, 518),\n        \"cent\": (41, -196, 530, 710),\n        \"sterling\": (6, -12, 540, 728),\n        \"currency\": (21, 100, 530, 610),\n        \"yen\": (0, 0, 551, 715),\n        \"brokenbar\": (85, -210, 194, 728),\n        \"section\": (28, -210, 521, 728),\n        \"dieresis\": (2, 610, 330, 728),\n        \"copyright\": (-4, -17, 743, 730),\n        \"ordfeminine\": (18, 362, 345, 728),\n        \"guillemotleft\": (46, 34, 500, 479),\n        \"logicalnot\": (41, 183, 541, 524),\n        \"registered\": (-4, -17, 743, 730),\n        \"macron\": (-9, 757, 561, 847),\n        \"degree\": (41, 416, 353, 728),\n        \"plusminus\": (24, 0, 524, 674),\n        \"twosuperior\": (12, 354, 308, 724),\n        \"threesuperior\": (19, 349, 312, 724),\n        \"acute\": (91, 582, 312, 728),\n        \"mu\": (54, -198, 525, 518),\n        \"paragraph\": (0, -196, 551, 715),\n        \"periodcentered\": (97, 279, 234, 416),\n        \"cedilla\": (18, -204, 284, -5),\n        \"onesuperior\": (44, 354, 241, 724),\n        \"ordmasculine\": (12, 361, 351, 728),\n        \"guillemotright\": (51, 34, 505, 479),\n        \"onequarter\": (44, -26, 824, 724),\n        \"onehalf\": (44, -26, 808, 724),\n        \"threequarters\": (19, -26, 824, 724),\n        \"questiondown\": (49, -205, 563, 518),\n        \"Agrave\": (0, 0, 718, 902),\n        \"Aacute\": (0, 0, 718, 902),\n        \"Acircumflex\": (0, 0, 718, 900),\n        \"Atilde\": (0, 0, 718, 879),\n        \"Adieresis\": (0, 0, 718, 874),\n        \"Aring\": (0, 0, 718, 858),\n        \"AE\": (-41, 0, 951, 715),\n        \"Ccedilla\": (47, -204, 670, 728),\n        \"Egrave\": (72, 0, 617, 902),\n        \"Eacute\": (72, 0, 617, 902),\n        \"Ecircumflex\": (72, 0, 617, 900),\n        \"Edieresis\": (72, 0, 617, 874),\n        \"Igrave\": (-4, 0, 216, 902),\n        \"Iacute\": (51, 0, 272, 902),\n        \"Icircumflex\": (-20, 0, 310, 900),\n        \"Idieresis\": (-21, 0, 306, 874),\n        \"Eth\": (-1, 0, 672, 715),\n        \"Ntilde\": (74, 0, 642, 879),\n        \"Ograve\": (43, -12, 737, 902),\n        \"Oacute\": (43, -12, 737, 902),\n        \"Ocircumflex\": (43, -12, 737, 900),\n        \"Otilde\": (43, -12, 737, 879),\n        \"Odieresis\": (43, -12, 737, 874),\n        \"multiply\": (53, 114, 529, 591),\n        \"Oslash\": (30, -40, 750, 750),\n        \"Ugrave\": (71, -12, 642, 902),\n        \"Uacute\": (71, -12, 642, 902),\n        \"Ucircumflex\": (71, -12, 642, 900),\n        \"Udieresis\": (71, -12, 642, 874),\n        \"Yacute\": (-1, 0, 667, 902),\n        \"Thorn\": (72, 0, 621, 715),\n        \"germandbls\": (67, -11, 575, 728),\n        \"agrave\": (35, -11, 522, 728),\n        \"aacute\": (35, -11, 522, 728),\n        \"acircumflex\": (35, -11, 522, 728),\n        \"atilde\": (35, -11, 522, 712),\n        \"adieresis\": (35, -11, 522, 728),\n        \"aring\": (35, -11, 522, 750),\n        \"ae\": (42, -11, 841, 530),\n        \"ccedilla\": (41, -204, 530, 530),\n        \"egrave\": (31, -11, 516, 728),\n        \"eacute\": (31, -11, 516, 728),\n        \"ecircumflex\": (31, -11, 516, 728),\n        \"edieresis\": (31, -11, 516, 728),\n        \"igrave\": (-11, 0, 209, 728),\n        \"iacute\": (61, 0, 282, 728),\n        \"icircumflex\": (-24, 0, 305, 728),\n        \"idieresis\": (-23, 0, 304, 728),\n        \"eth\": (40, -12, 573, 715),\n        \"ntilde\": (70, 0, 543, 712),\n        \"ograve\": (40, -11, 575, 728),\n        \"oacute\": (40, -11, 575, 728),\n        \"ocircumflex\": (40, -11, 575, 728),\n        \"otilde\": (40, -11, 575, 712),\n        \"odieresis\": (40, -11, 575, 728),\n        \"divide\": (23, 90, 524, 616),\n        \"oslash\": (42, -35, 577, 546),\n        \"ugrave\": (68, -11, 540, 728),\n        \"uacute\": (68, -11, 540, 728),\n        \"ucircumflex\": (68, -11, 540, 728),\n        \"udieresis\": (68, -11, 540, 728),\n        \"yacute\": (6, -210, 540, 728),\n        \"thorn\": (67, -197, 573, 715),\n        \"ydieresis\": (6, -210, 540, 728),\n    },\n    \"Arial,BoldItalic\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (61, 0, 353, 715),\n        \"quotedbl\": (151, 461, 506, 715),\n        \"numbersign\": (47, -12, 583, 728),\n        \"dollar\": (43, -99, 576, 770),\n        \"percent\": (90, -30, 864, 728),\n        \"ampersand\": (83, -16, 706, 728),\n        \"quotesingle\": (151, 461, 329, 715),\n        \"parenleft\": (65, -210, 435, 728),\n        \"parenright\": (-78, -210, 291, 728),\n        \"asterisk\": (98, 386, 452, 728),\n        \"plus\": (80, 103, 581, 603),\n        \"comma\": (10, -155, 212, 135),\n        \"hyphen\": (38, 190, 338, 325),\n        \"period\": (43, 0, 210, 135),\n        \"slash\": (-43, -12, 408, 728),\n        \"zero\": (64, -12, 571, 718),\n        \"one\": (118, 0, 510, 720),\n        \"two\": (60, 0, 570, 718),\n        \"three\": (50, -12, 560, 718),\n        \"four\": (27, 0, 560, 715),\n        \"five\": (63, -12, 577, 706),\n        \"six\": (81, -12, 575, 718),\n        \"seven\": (103, 0, 602, 706),\n        \"eight\": (65, -12, 566, 718),\n        \"nine\": (63, -12, 558, 718),\n        \"colon\": (70, 0, 316, 518),\n        \"semicolon\": (40, -155, 319, 518),\n        \"less\": (85, 81, 576, 625),\n        \"equal\": (80, 181, 581, 524),\n        \"greater\": (85, 81, 576, 624),\n        \"question\": (123, 0, 618, 728),\n        \"at\": (64, -210, 1006, 728),\n        \"A\": (-11, 0, 673, 715),\n        \"B\": (40, 0, 709, 715),\n        \"C\": (94, -12, 745, 728),\n        \"D\": (43, 0, 724, 715),\n        \"E\": (41, 0, 721, 715),\n        \"F\": (39, 0, 689, 715),\n        \"G\": (88, -12, 785, 728),\n        \"H\": (43, 0, 764, 715),\n        \"I\": (34, 0, 331, 715),\n        \"J\": (28, -12, 599, 715),\n        \"K\": (39, 0, 801, 715),\n        \"L\": (44, 0, 581, 715),\n        \"M\": (40, 0, 878, 715),\n        \"N\": (44, 0, 762, 715),\n        \"O\": (87, -12, 784, 728),\n        \"P\": (40, 0, 702, 715),\n        \"Q\": (87, -95, 783, 728),\n        \"R\": (43, 0, 741, 715),\n        \"S\": (63, -12, 676, 728),\n        \"T\": (120, 0, 708, 715),\n        \"U\": (91, -12, 765, 715),\n        \"V\": (113, 0, 793, 715),\n        \"W\": (117, 0, 1067, 715),\n        \"X\": (-30, 0, 783, 715),\n        \"Y\": (114, 0, 784, 715),\n        \"Z\": (24, 0, 667, 715),\n        \"bracketleft\": (9, -197, 438, 715),\n        \"backslash\": (78, -12, 287, 728),\n        \"bracketright\": (-55, -197, 375, 715),\n        \"asciicircum\": (104, 337, 576, 728),\n        \"underscore\": (-9, -197, 561, -108),\n        \"grave\": (133, 585, 331, 731),\n        \"a\": (44, -12, 533, 530),\n        \"b\": (36, -12, 601, 715),\n        \"c\": (60, -12, 564, 530),\n        \"d\": (59, -12, 668, 715),\n        \"e\": (58, -12, 554, 530),\n        \"f\": (53, 0, 470, 728),\n        \"g\": (31, -210, 622, 530),\n        \"h\": (41, 0, 590, 715),\n        \"i\": (40, 0, 329, 715),\n        \"j\": (-109, -210, 331, 715),\n        \"k\": (37, 0, 614, 715),\n        \"l\": (39, 0, 328, 715),\n        \"m\": (35, 0, 868, 530),\n        \"n\": (41, 0, 591, 530),\n        \"o\": (60, -12, 599, 530),\n        \"p\": (-5, -197, 605, 530),\n        \"q\": (59, -197, 625, 530),\n        \"r\": (32, 0, 474, 530),\n        \"s\": (21, -12, 551, 530),\n        \"t\": (75, -12, 390, 698),\n        \"u\": (70, -12, 619, 518),\n        \"v\": (74, 0, 618, 518),\n        \"w\": (71, 0, 840, 518),\n        \"x\": (-21, 0, 612, 518),\n        \"y\": (6, -210, 620, 518),\n        \"z\": (16, 0, 518, 518),\n        \"braceleft\": (41, -210, 490, 728),\n        \"bar\": (85, -210, 194, 728),\n        \"braceright\": (-84, -210, 363, 728),\n        \"asciitilde\": (66, 253, 585, 451),\n        \"bullet\": (81, 208, 369, 497),\n        \"Euro\": (26, -12, 639, 728),\n        \"quotesinglbase\": (10, -155, 212, 135),\n        \"florin\": (-9, -210, 557, 728),\n        \"quotedblbase\": (3, -155, 441, 135),\n        \"ellipsis\": (92, 0, 907, 135),\n        \"dagger\": (84, -170, 594, 706),\n        \"daggerdbl\": (0, -170, 599, 706),\n        \"circumflex\": (56, 584, 391, 731),\n        \"perthousand\": (67, -28, 1021, 728),\n        \"Scaron\": (63, -12, 676, 905),\n        \"guilsinglleft\": (59, 34, 378, 477),\n        \"OE\": (68, -12, 1078, 728),\n        \"Zcaron\": (24, 0, 667, 905),\n        \"quoteleft\": (108, 433, 311, 724),\n        \"quoteright\": (123, 424, 325, 715),\n        \"quotedblleft\": (125, 433, 562, 724),\n        \"quotedblright\": (128, 424, 566, 715),\n        \"endash\": (-1, 208, 554, 310),\n        \"emdash\": (0, 208, 1000, 310),\n        \"tilde\": (92, 592, 428, 710),\n        \"trademark\": (144, 315, 916, 715),\n        \"scaron\": (21, -12, 551, 731),\n        \"guilsinglright\": (9, 34, 318, 477),\n        \"oe\": (58, -12, 943, 530),\n        \"zcaron\": (16, 0, 527, 731),\n        \"Ydieresis\": (114, 0, 784, 875),\n        \"exclamdown\": (11, -197, 304, 518),\n        \"cent\": (58, -192, 562, 713),\n        \"sterling\": (20, -18, 610, 728),\n        \"currency\": (65, 100, 574, 610),\n        \"yen\": (23, 0, 666, 715),\n        \"brokenbar\": (85, -210, 194, 728),\n        \"section\": (21, -211, 560, 728),\n        \"dieresis\": (84, 597, 435, 716),\n        \"copyright\": (43, -17, 791, 730),\n        \"ordfeminine\": (82, 362, 412, 728),\n        \"guillemotleft\": (82, 34, 590, 477),\n        \"logicalnot\": (80, 183, 581, 524),\n        \"registered\": (43, -17, 791, 730),\n        \"macron\": (68, 757, 638, 847),\n        \"degree\": (109, 416, 421, 728),\n        \"plusminus\": (63, 0, 563, 674),\n        \"twosuperior\": (82, 354, 395, 724),\n        \"threesuperior\": (76, 349, 389, 724),\n        \"acute\": (183, 583, 435, 730),\n        \"mu\": (-37, -200, 584, 518),\n        \"paragraph\": (43, -196, 596, 715),\n        \"periodcentered\": (136, 290, 303, 425),\n        \"cedilla\": (6, -207, 267, -12),\n        \"onesuperior\": (114, 354, 361, 725),\n        \"ordmasculine\": (72, 362, 414, 728),\n        \"guillemotright\": (22, 34, 531, 477),\n        \"onequarter\": (99, -29, 839, 724),\n        \"onehalf\": (84, -29, 835, 724),\n        \"threequarters\": (75, -29, 851, 724),\n        \"questiondown\": (26, -209, 521, 518),\n        \"Agrave\": (-11, 0, 673, 905),\n        \"Aacute\": (-11, 0, 686, 903),\n        \"Acircumflex\": (-11, 0, 673, 905),\n        \"Atilde\": (-11, 0, 673, 874),\n        \"Adieresis\": (-11, 0, 680, 875),\n        \"Aring\": (-11, -9, 673, 854),\n        \"AE\": (-32, 0, 1059, 715),\n        \"Ccedilla\": (94, -204, 745, 728),\n        \"Egrave\": (41, 0, 721, 905),\n        \"Eacute\": (41, 0, 721, 903),\n        \"Ecircumflex\": (41, 0, 721, 905),\n        \"Edieresis\": (41, 0, 721, 875),\n        \"Igrave\": (34, 0, 382, 905),\n        \"Iacute\": (34, 0, 451, 903),\n        \"Icircumflex\": (34, 0, 426, 905),\n        \"Idieresis\": (34, 0, 452, 875),\n        \"Eth\": (36, 0, 725, 715),\n        \"Ntilde\": (44, 0, 762, 874),\n        \"Ograve\": (87, -12, 784, 905),\n        \"Oacute\": (87, -12, 784, 903),\n        \"Ocircumflex\": (87, -12, 784, 905),\n        \"Otilde\": (87, -12, 784, 874),\n        \"Odieresis\": (87, -12, 784, 875),\n        \"multiply\": (92, 114, 568, 591),\n        \"Oslash\": (77, -59, 786, 766),\n        \"Ugrave\": (91, -12, 765, 905),\n        \"Uacute\": (91, -12, 765, 903),\n        \"Ucircumflex\": (91, -12, 765, 905),\n        \"Udieresis\": (91, -12, 765, 875),\n        \"Yacute\": (114, 0, 784, 903),\n        \"Thorn\": (40, 0, 673, 715),\n        \"germandbls\": (35, -12, 581, 728),\n        \"agrave\": (44, -12, 533, 731),\n        \"aacute\": (44, -12, 567, 730),\n        \"acircumflex\": (44, -12, 533, 731),\n        \"atilde\": (44, -12, 549, 710),\n        \"adieresis\": (44, -12, 553, 716),\n        \"aring\": (44, -12, 533, 753),\n        \"ae\": (30, -12, 865, 530),\n        \"ccedilla\": (60, -203, 564, 530),\n        \"egrave\": (58, -12, 554, 731),\n        \"eacute\": (58, -12, 562, 730),\n        \"ecircumflex\": (58, -12, 554, 731),\n        \"edieresis\": (58, -12, 554, 716),\n        \"igrave\": (40, 0, 347, 731),\n        \"iacute\": (40, 0, 413, 730),\n        \"icircumflex\": (40, 0, 389, 731),\n        \"idieresis\": (40, 0, 417, 716),\n        \"eth\": (60, -12, 607, 715),\n        \"ntilde\": (41, 0, 591, 710),\n        \"ograve\": (60, -12, 599, 731),\n        \"oacute\": (60, -12, 599, 730),\n        \"ocircumflex\": (60, -12, 599, 731),\n        \"otilde\": (60, -12, 599, 710),\n        \"odieresis\": (60, -12, 599, 716),\n        \"divide\": (63, 90, 563, 616),\n        \"oslash\": (52, -52, 604, 571),\n        \"ugrave\": (70, -12, 619, 731),\n        \"uacute\": (70, -12, 619, 730),\n        \"ucircumflex\": (70, -12, 619, 731),\n        \"udieresis\": (70, -12, 619, 716),\n        \"yacute\": (6, -210, 620, 730),\n        \"thorn\": (-9, -197, 602, 715),\n        \"ydieresis\": (6, -210, 620, 716),\n    },\n    \"Arial,Italic\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (56, 0, 303, 715),\n        \"quotedbl\": (135, 462, 428, 715),\n        \"numbersign\": (46, -12, 579, 728),\n        \"dollar\": (51, -95, 572, 763),\n        \"percent\": (97, -26, 852, 728),\n        \"ampersand\": (78, -17, 651, 728),\n        \"quotesingle\": (126, 462, 258, 715),\n        \"parenleft\": (84, -210, 413, 728),\n        \"parenright\": (-53, -210, 275, 728),\n        \"asterisk\": (115, 423, 437, 728),\n        \"plus\": (89, 115, 562, 588),\n        \"comma\": (23, -144, 175, 100),\n        \"hyphen\": (46, 214, 334, 303),\n        \"period\": (57, 0, 178, 100),\n        \"slash\": (-50, -11, 410, 728),\n        \"zero\": (70, -12, 565, 718),\n        \"one\": (147, 0, 479, 718),\n        \"two\": (58, 0, 562, 718),\n        \"three\": (54, -12, 557, 718),\n        \"four\": (45, 0, 542, 715),\n        \"five\": (69, -12, 572, 706),\n        \"six\": (83, -12, 567, 718),\n        \"seven\": (121, 0, 595, 706),\n        \"eight\": (74, -12, 564, 718),\n        \"nine\": (67, -12, 551, 718),\n        \"colon\": (57, 0, 265, 518),\n        \"semicolon\": (23, -144, 262, 518),\n        \"less\": (89, 110, 563, 595),\n        \"equal\": (89, 203, 562, 502),\n        \"greater\": (89, 110, 563, 595),\n        \"question\": (126, 0, 560, 728),\n        \"at\": (54, -210, 979, 729),\n        \"A\": (-20, 0, 616, 715),\n        \"B\": (43, 0, 654, 715),\n        \"C\": (90, -12, 730, 728),\n        \"D\": (44, 0, 711, 715),\n        \"E\": (44, 0, 711, 715),\n        \"F\": (45, 0, 660, 715),\n        \"G\": (97, -12, 766, 728),\n        \"H\": (41, 0, 753, 715),\n        \"I\": (57, 0, 302, 715),\n        \"J\": (33, -12, 535, 715),\n        \"K\": (44, 0, 741, 715),\n        \"L\": (40, 0, 524, 715),\n        \"M\": (43, 0, 872, 715),\n        \"N\": (48, 0, 756, 715),\n        \"O\": (91, -12, 772, 728),\n        \"P\": (42, 0, 697, 715),\n        \"Q\": (92, -82, 773, 728),\n        \"R\": (46, 0, 729, 715),\n        \"S\": (70, -12, 671, 728),\n        \"T\": (124, 0, 705, 715),\n        \"U\": (96, -12, 754, 715),\n        \"V\": (124, 0, 756, 715),\n        \"W\": (125, 0, 1061, 715),\n        \"X\": (-31, 0, 769, 715),\n        \"Y\": (116, 0, 772, 715),\n        \"Z\": (24, 0, 636, 715),\n        \"bracketleft\": (6, -195, 391, 715),\n        \"backslash\": (84, -11, 273, 728),\n        \"bracketright\": (-58, -195, 329, 715),\n        \"asciicircum\": (70, 336, 486, 728),\n        \"underscore\": (-63, -198, 519, -135),\n        \"grave\": (145, 581, 309, 715),\n        \"a\": (43, -11, 526, 530),\n        \"b\": (33, -11, 535, 715),\n        \"c\": (56, -11, 510, 530),\n        \"d\": (52, -11, 598, 715),\n        \"e\": (51, -11, 531, 530),\n        \"f\": (45, 0, 407, 728),\n        \"g\": (25, -207, 564, 530),\n        \"h\": (33, 0, 528, 715),\n        \"i\": (29, 0, 267, 715),\n        \"j\": (-121, -207, 267, 715),\n        \"k\": (34, 0, 553, 715),\n        \"l\": (26, 0, 264, 715),\n        \"m\": (32, 0, 812, 530),\n        \"n\": (33, 0, 527, 530),\n        \"o\": (48, -11, 540, 530),\n        \"p\": (-10, -198, 535, 530),\n        \"q\": (51, -198, 552, 530),\n        \"r\": (33, 0, 419, 530),\n        \"s\": (41, -11, 501, 530),\n        \"t\": (56, -8, 321, 707),\n        \"u\": (62, -11, 557, 518),\n        \"v\": (79, 0, 559, 518),\n        \"w\": (77, 0, 776, 518),\n        \"x\": (-1, 0, 537, 518),\n        \"y\": (0, -210, 561, 518),\n        \"z\": (19, 0, 512, 518),\n        \"braceleft\": (52, -210, 445, 728),\n        \"bar\": (91, -210, 168, 728),\n        \"braceright\": (-83, -210, 309, 728),\n        \"asciitilde\": (80, 271, 579, 432),\n        \"bullet\": (53, 226, 300, 474),\n        \"Euro\": (39, -12, 645, 728),\n        \"quotesinglbase\": (-7, -144, 144, 100),\n        \"florin\": (22, -210, 529, 728),\n        \"quotedblbase\": (-19, -144, 291, 100),\n        \"ellipsis\": (143, 0, 932, 100),\n        \"dagger\": (90, -170, 583, 706),\n        \"daggerdbl\": (5, -170, 588, 706),\n        \"circumflex\": (100, 581, 387, 715),\n        \"perthousand\": (66, -26, 1003, 728),\n        \"Scaron\": (70, -12, 671, 894),\n        \"guilsinglleft\": (47, 35, 313, 478),\n        \"OE\": (80, -12, 1043, 728),\n        \"Zcaron\": (24, 0, 636, 894),\n        \"quoteleft\": (128, 482, 280, 728),\n        \"quoteright\": (125, 467, 276, 712),\n        \"quotedblleft\": (105, 482, 413, 728),\n        \"quotedblright\": (104, 467, 417, 712),\n        \"endash\": (-1, 223, 554, 294),\n        \"emdash\": (0, 223, 1000, 294),\n        \"tilde\": (93, 596, 423, 706),\n        \"trademark\": (136, 317, 897, 715),\n        \"scaron\": (41, -11, 503, 715),\n        \"guilsinglright\": (16, 35, 288, 478),\n        \"oe\": (62, -11, 918, 530),\n        \"zcaron\": (19, 0, 512, 715),\n        \"Ydieresis\": (116, 0, 772, 858),\n        \"exclamdown\": (57, -197, 305, 518),\n        \"cent\": (75, -198, 529, 725),\n        \"sterling\": (31, -12, 607, 728),\n        \"currency\": (80, 114, 560, 593),\n        \"yen\": (36, 0, 666, 715),\n        \"brokenbar\": (91, -210, 168, 728),\n        \"section\": (30, -210, 555, 728),\n        \"dieresis\": (115, 599, 408, 699),\n        \"copyright\": (40, -8, 777, 728),\n        \"ordfeminine\": (81, 359, 409, 728),\n        \"guillemotleft\": (78, 35, 537, 478),\n        \"logicalnot\": (89, 207, 562, 502),\n        \"registered\": (40, -8, 777, 728),\n        \"macron\": (88, 764, 670, 827),\n        \"degree\": (133, 457, 404, 728),\n        \"plusminus\": (60, 0, 533, 600),\n        \"twosuperior\": (74, 357, 400, 724),\n        \"threesuperior\": (82, 349, 399, 724),\n        \"acute\": (168, 581, 372, 715),\n        \"mu\": (5, -200, 571, 518),\n        \"paragraph\": (69, -198, 609, 715),\n        \"periodcentered\": (151, 307, 272, 407),\n        \"cedilla\": (37, -207, 287, 5),\n        \"onesuperior\": (136, 357, 354, 724),\n        \"ordmasculine\": (69, 360, 411, 728),\n        \"guillemotright\": (40, 35, 504, 478),\n        \"onequarter\": (83, -29, 850, 728),\n        \"onehalf\": (60, -29, 827, 728),\n        \"threequarters\": (82, -29, 865, 728),\n        \"questiondown\": (83, -209, 517, 518),\n        \"Agrave\": (-20, 0, 616, 894),\n        \"Aacute\": (-20, 0, 616, 894),\n        \"Acircumflex\": (-20, 0, 616, 894),\n        \"Atilde\": (-20, 0, 616, 867),\n        \"Adieresis\": (-20, 0, 616, 859),\n        \"Aring\": (-20, 0, 616, 863),\n        \"AE\": (-40, 0, 1043, 715),\n        \"Ccedilla\": (90, -210, 730, 728),\n        \"Egrave\": (44, 0, 711, 894),\n        \"Eacute\": (44, 0, 711, 894),\n        \"Ecircumflex\": (44, 0, 711, 894),\n        \"Edieresis\": (44, 0, 711, 858),\n        \"Igrave\": (57, 0, 340, 894),\n        \"Iacute\": (57, 0, 389, 894),\n        \"Icircumflex\": (57, 0, 407, 894),\n        \"Idieresis\": (57, 0, 413, 859),\n        \"Eth\": (44, 0, 720, 715),\n        \"Ntilde\": (48, 0, 756, 867),\n        \"Ograve\": (91, -12, 772, 894),\n        \"Oacute\": (91, -12, 772, 894),\n        \"Ocircumflex\": (91, -12, 772, 894),\n        \"Otilde\": (91, -12, 772, 867),\n        \"Odieresis\": (91, -12, 772, 859),\n        \"multiply\": (127, 140, 553, 566),\n        \"Oslash\": (84, -50, 776, 764),\n        \"Ugrave\": (96, -12, 754, 894),\n        \"Uacute\": (96, -12, 754, 894),\n        \"Ucircumflex\": (96, -12, 754, 894),\n        \"Udieresis\": (96, -12, 754, 859),\n        \"Yacute\": (116, 0, 772, 894),\n        \"Thorn\": (42, 0, 666, 715),\n        \"germandbls\": (36, -12, 567, 728),\n        \"agrave\": (43, -11, 526, 715),\n        \"aacute\": (43, -11, 526, 715),\n        \"acircumflex\": (43, -11, 526, 715),\n        \"atilde\": (43, -11, 540, 706),\n        \"adieresis\": (43, -11, 526, 699),\n        \"aring\": (43, -11, 526, 733),\n        \"ae\": (42, -12, 865, 530),\n        \"ccedilla\": (56, -198, 510, 530),\n        \"egrave\": (51, -11, 531, 715),\n        \"eacute\": (51, -11, 531, 715),\n        \"ecircumflex\": (51, -11, 531, 715),\n        \"edieresis\": (51, -11, 531, 699),\n        \"igrave\": (61, 0, 310, 715),\n        \"iacute\": (61, 0, 349, 715),\n        \"icircumflex\": (61, 0, 361, 715),\n        \"idieresis\": (61, 0, 377, 699),\n        \"eth\": (48, -12, 545, 715),\n        \"ntilde\": (33, 0, 532, 706),\n        \"ograve\": (48, -11, 540, 715),\n        \"oacute\": (48, -11, 540, 715),\n        \"ocircumflex\": (48, -11, 540, 715),\n        \"otilde\": (48, -11, 540, 706),\n        \"odieresis\": (48, -11, 540, 699),\n        \"divide\": (62, 155, 535, 550),\n        \"oslash\": (74, -49, 583, 565),\n        \"ugrave\": (62, -11, 557, 715),\n        \"uacute\": (62, -11, 557, 715),\n        \"ucircumflex\": (62, -11, 557, 715),\n        \"udieresis\": (62, -11, 557, 699),\n        \"yacute\": (0, -210, 561, 715),\n        \"thorn\": (-10, -198, 535, 715),\n        \"ydieresis\": (0, -210, 561, 699),\n    },\n    \"ArialNarrow\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (72, 0, 161, 715),\n        \"quotedbl\": (37, 462, 252, 715),\n        \"numbersign\": (7, -12, 444, 728),\n        \"dollar\": (27, -103, 416, 781),\n        \"percent\": (45, -26, 676, 728),\n        \"ampersand\": (35, -16, 528, 728),\n        \"quotesingle\": (34, 462, 116, 715),\n        \"parenleft\": (49, -210, 243, 728),\n        \"parenright\": (29, -210, 223, 728),\n        \"asterisk\": (24, 423, 289, 728),\n        \"plus\": (44, 115, 432, 588),\n        \"comma\": (69, -141, 156, 100),\n        \"hyphen\": (25, 214, 247, 303),\n        \"period\": (75, 0, 157, 100),\n        \"slash\": (0, -12, 228, 728),\n        \"zero\": (32, -12, 415, 718),\n        \"one\": (87, 0, 304, 718),\n        \"two\": (24, 0, 412, 718),\n        \"three\": (33, -12, 417, 718),\n        \"four\": (9, 0, 415, 715),\n        \"five\": (32, -12, 421, 706),\n        \"six\": (29, -12, 416, 718),\n        \"seven\": (37, 0, 417, 706),\n        \"eight\": (32, -12, 418, 718),\n        \"nine\": (32, -12, 418, 718),\n        \"colon\": (75, 0, 157, 518),\n        \"semicolon\": (69, -141, 156, 518),\n        \"less\": (43, 110, 432, 595),\n        \"equal\": (44, 203, 432, 502),\n        \"greater\": (43, 110, 432, 595),\n        \"question\": (34, 0, 413, 728),\n        \"at\": (43, -210, 801, 729),\n        \"A\": (0, 0, 548, 715),\n        \"B\": (60, 0, 503, 715),\n        \"C\": (39, -12, 558, 728),\n        \"D\": (61, 0, 546, 715),\n        \"E\": (64, 0, 502, 715),\n        \"F\": (68, 0, 464, 715),\n        \"G\": (45, -12, 588, 728),\n        \"H\": (62, 0, 523, 715),\n        \"I\": (78, 0, 155, 715),\n        \"J\": (21, -12, 344, 715),\n        \"K\": (60, 0, 545, 715),\n        \"L\": (58, 0, 425, 715),\n        \"M\": (61, 0, 621, 715),\n        \"N\": (61, 0, 523, 715),\n        \"O\": (41, -12, 603, 728),\n        \"P\": (63, 0, 511, 715),\n        \"Q\": (37, -55, 609, 728),\n        \"R\": (62, 0, 580, 715),\n        \"S\": (37, -12, 504, 728),\n        \"T\": (20, 0, 485, 715),\n        \"U\": (62, -12, 524, 715),\n        \"V\": (3, 0, 540, 715),\n        \"W\": (11, 0, 766, 715),\n        \"X\": (3, 0, 541, 715),\n        \"Y\": (2, 0, 540, 715),\n        \"Z\": (17, 0, 481, 715),\n        \"bracketleft\": (57, -198, 216, 715),\n        \"backslash\": (0, -12, 228, 728),\n        \"bracketright\": (17, -198, 176, 715),\n        \"asciicircum\": (21, 336, 363, 728),\n        \"underscore\": (-5, -125, 460, -75),\n        \"grave\": (35, 583, 186, 719),\n        \"a\": (28, -11, 419, 530),\n        \"b\": (52, -11, 420, 715),\n        \"c\": (31, -11, 402, 530),\n        \"d\": (26, -11, 395, 715),\n        \"e\": (28, -11, 420, 530),\n        \"f\": (9, 0, 257, 728),\n        \"g\": (24, -210, 399, 530),\n        \"h\": (52, 0, 398, 715),\n        \"i\": (52, 0, 124, 715),\n        \"j\": (-39, -210, 124, 715),\n        \"k\": (54, 0, 406, 715),\n        \"l\": (50, 0, 122, 715),\n        \"m\": (53, 0, 629, 530),\n        \"n\": (52, 0, 398, 530),\n        \"o\": (25, -11, 424, 530),\n        \"p\": (52, -198, 421, 530),\n        \"q\": (27, -198, 395, 530),\n        \"r\": (52, 0, 283, 530),\n        \"s\": (25, -12, 378, 530),\n        \"t\": (16, -6, 223, 699),\n        \"u\": (51, -11, 395, 518),\n        \"v\": (10, 0, 400, 518),\n        \"w\": (0, 0, 584, 518),\n        \"x\": (5, 0, 403, 518),\n        \"y\": (13, -210, 402, 518),\n        \"z\": (16, 0, 392, 518),\n        \"braceleft\": (22, -210, 254, 728),\n        \"bar\": (76, -210, 139, 728),\n        \"braceright\": (18, -210, 250, 728),\n        \"asciitilde\": (35, 271, 444, 432),\n        \"bullet\": (44, 226, 247, 474),\n        \"Euro\": (-11, -12, 443, 728),\n        \"quotesinglbase\": (41, -132, 125, 102),\n        \"florin\": (17, -210, 433, 728),\n        \"quotedblbase\": (28, -132, 236, 102),\n        \"ellipsis\": (95, 0, 724, 100),\n        \"dagger\": (27, -168, 420, 699),\n        \"daggerdbl\": (27, -168, 422, 706),\n        \"circumflex\": (9, 583, 263, 719),\n        \"perthousand\": (14, -26, 805, 728),\n        \"Scaron\": (37, -12, 504, 901),\n        \"guilsinglleft\": (36, 35, 222, 480),\n        \"OE\": (51, -12, 793, 728),\n        \"Zcaron\": (17, 0, 481, 901),\n        \"quoteleft\": (49, 481, 133, 715),\n        \"quoteright\": (41, 481, 125, 715),\n        \"quotedblleft\": (33, 481, 241, 715),\n        \"quotedblright\": (28, 481, 236, 715),\n        \"endash\": (-2, 223, 453, 294),\n        \"emdash\": (0, 223, 819, 294),\n        \"tilde\": (2, 595, 270, 708),\n        \"trademark\": (90, 317, 713, 715),\n        \"scaron\": (25, -12, 378, 719),\n        \"guilsinglright\": (50, 35, 235, 480),\n        \"oe\": (34, -11, 744, 530),\n        \"zcaron\": (16, 0, 392, 719),\n        \"Ydieresis\": (2, 0, 540, 901),\n        \"exclamdown\": (91, -197, 181, 518),\n        \"cent\": (41, -199, 413, 715),\n        \"sterling\": (9, -13, 432, 728),\n        \"currency\": (28, 114, 421, 593),\n        \"yen\": (-2, 0, 452, 715),\n        \"brokenbar\": (76, -210, 139, 728),\n        \"section\": (31, -210, 417, 728),\n        \"dieresis\": (24, 620, 249, 720),\n        \"copyright\": (1, -8, 606, 728),\n        \"ordfeminine\": (20, 364, 289, 728),\n        \"guillemotleft\": (52, 35, 395, 480),\n        \"logicalnot\": (44, 207, 432, 502),\n        \"registered\": (1, -8, 606, 728),\n        \"macron\": (-5, 790, 505, 840),\n        \"degree\": (62, 457, 333, 728),\n        \"plusminus\": (38, 0, 510, 600),\n        \"twosuperior\": (9, 357, 259, 724),\n        \"threesuperior\": (12, 349, 258, 724),\n        \"acute\": (88, 583, 236, 719),\n        \"mu\": (78, -198, 497, 518),\n        \"paragraph\": (2, -198, 444, 715),\n        \"periodcentered\": (95, 311, 177, 411),\n        \"cedilla\": (42, -205, 216, 11),\n        \"onesuperior\": (41, 357, 189, 724),\n        \"ordmasculine\": (18, 361, 280, 728),\n        \"guillemotright\": (54, 35, 397, 480),\n        \"onequarter\": (41, -27, 671, 728),\n        \"onehalf\": (41, -27, 669, 728),\n        \"threequarters\": (12, -27, 669, 728),\n        \"questiondown\": (64, -209, 443, 518),\n        \"Agrave\": (0, 0, 548, 901),\n        \"Aacute\": (0, 0, 548, 901),\n        \"Acircumflex\": (0, 0, 548, 901),\n        \"Atilde\": (0, 0, 548, 878),\n        \"Adieresis\": (0, 0, 548, 901),\n        \"Aring\": (0, 0, 548, 921),\n        \"AE\": (0, 0, 775, 715),\n        \"Ccedilla\": (39, -205, 558, 728),\n        \"Egrave\": (64, 0, 502, 901),\n        \"Eacute\": (64, 0, 502, 901),\n        \"Ecircumflex\": (64, 0, 502, 901),\n        \"Edieresis\": (64, 0, 502, 901),\n        \"Igrave\": (23, 0, 174, 901),\n        \"Iacute\": (73, 0, 220, 901),\n        \"Icircumflex\": (-11, 0, 241, 901),\n        \"Idieresis\": (4, 0, 229, 901),\n        \"Eth\": (-2, 0, 546, 715),\n        \"Ntilde\": (61, 0, 523, 878),\n        \"Ograve\": (41, -12, 603, 901),\n        \"Oacute\": (41, -12, 603, 901),\n        \"Ocircumflex\": (41, -12, 603, 901),\n        \"Otilde\": (41, -12, 603, 878),\n        \"Odieresis\": (41, -12, 603, 901),\n        \"multiply\": (63, 140, 412, 566),\n        \"Oslash\": (35, -28, 609, 742),\n        \"Ugrave\": (62, -12, 524, 901),\n        \"Uacute\": (62, -12, 524, 901),\n        \"Ucircumflex\": (62, -12, 524, 901),\n        \"Udieresis\": (62, -12, 524, 901),\n        \"Yacute\": (2, 0, 540, 901),\n        \"Thorn\": (63, 0, 511, 715),\n        \"germandbls\": (62, -12, 476, 728),\n        \"agrave\": (28, -11, 419, 719),\n        \"aacute\": (28, -11, 419, 719),\n        \"acircumflex\": (28, -11, 419, 719),\n        \"atilde\": (28, -11, 419, 696),\n        \"adieresis\": (28, -11, 419, 720),\n        \"aring\": (28, -11, 419, 762),\n        \"ae\": (25, -11, 694, 530),\n        \"ccedilla\": (31, -205, 402, 530),\n        \"egrave\": (28, -11, 420, 719),\n        \"eacute\": (28, -11, 420, 719),\n        \"ecircumflex\": (28, -11, 420, 719),\n        \"edieresis\": (28, -11, 420, 720),\n        \"igrave\": (9, 0, 160, 719),\n        \"iacute\": (62, 0, 210, 719),\n        \"icircumflex\": (-6, 0, 246, 719),\n        \"idieresis\": (1, 0, 226, 720),\n        \"eth\": (27, -12, 421, 715),\n        \"ntilde\": (52, 0, 398, 696),\n        \"ograve\": (25, -11, 424, 719),\n        \"oacute\": (25, -11, 424, 719),\n        \"ocircumflex\": (25, -11, 424, 719),\n        \"otilde\": (25, -11, 424, 696),\n        \"odieresis\": (25, -11, 424, 720),\n        \"divide\": (38, 155, 510, 550),\n        \"oslash\": (55, -38, 453, 550),\n        \"ugrave\": (51, -11, 395, 719),\n        \"uacute\": (51, -11, 395, 719),\n        \"ucircumflex\": (51, -11, 395, 719),\n        \"udieresis\": (51, -11, 395, 720),\n        \"yacute\": (13, -210, 402, 719),\n        \"thorn\": (52, -198, 421, 715),\n        \"ydieresis\": (13, -210, 402, 720),\n    },\n    \"ArialNarrow,Bold\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (73, 0, 194, 715),\n        \"quotedbl\": (44, 461, 348, 715),\n        \"numbersign\": (7, -12, 446, 728),\n        \"dollar\": (28, -100, 419, 773),\n        \"percent\": (35, -28, 690, 728),\n        \"ampersand\": (36, -18, 579, 728),\n        \"quotesingle\": (36, 461, 159, 715),\n        \"parenleft\": (42, -210, 246, 728),\n        \"parenright\": (26, -210, 230, 728),\n        \"asterisk\": (11, 386, 301, 728),\n        \"plus\": (33, 103, 444, 603),\n        \"comma\": (46, -159, 168, 137),\n        \"hyphen\": (25, 190, 247, 328),\n        \"period\": (59, 0, 171, 137),\n        \"slash\": (0, -12, 229, 728),\n        \"zero\": (34, -12, 416, 718),\n        \"one\": (64, 0, 322, 718),\n        \"two\": (20, 0, 415, 718),\n        \"three\": (30, -12, 420, 718),\n        \"four\": (15, 0, 437, 718),\n        \"five\": (36, -12, 431, 706),\n        \"six\": (35, -12, 427, 718),\n        \"seven\": (34, 0, 419, 706),\n        \"eight\": (33, -12, 419, 718),\n        \"nine\": (25, -12, 417, 718),\n        \"colon\": (80, 0, 192, 518),\n        \"semicolon\": (67, -159, 189, 518),\n        \"less\": (38, 81, 440, 625),\n        \"equal\": (33, 181, 444, 524),\n        \"greater\": (37, 81, 440, 624),\n        \"question\": (42, 0, 463, 723),\n        \"at\": (24, -210, 796, 728),\n        \"A\": (0, 0, 588, 715),\n        \"B\": (59, 0, 551, 715),\n        \"C\": (39, -12, 550, 728),\n        \"D\": (59, 0, 551, 715),\n        \"E\": (60, 0, 506, 715),\n        \"F\": (60, 0, 462, 715),\n        \"G\": (39, -12, 588, 728),\n        \"H\": (60, 0, 529, 715),\n        \"I\": (56, 0, 174, 715),\n        \"J\": (14, -12, 390, 715),\n        \"K\": (61, 0, 590, 715),\n        \"L\": (62, 0, 476, 709),\n        \"M\": (58, 0, 625, 715),\n        \"N\": (61, 0, 526, 715),\n        \"O\": (36, -12, 605, 728),\n        \"P\": (59, 0, 509, 715),\n        \"Q\": (35, -71, 626, 728),\n        \"R\": (60, 0, 587, 715),\n        \"S\": (29, -12, 506, 728),\n        \"T\": (17, 0, 483, 715),\n        \"U\": (58, -12, 526, 715),\n        \"V\": (0, 0, 546, 715),\n        \"W\": (2, 0, 772, 715),\n        \"X\": (0, 0, 546, 715),\n        \"Y\": (0, 0, 547, 715),\n        \"Z\": (8, 0, 485, 715),\n        \"bracketleft\": (58, -201, 257, 715),\n        \"backslash\": (0, -12, 229, 728),\n        \"bracketright\": (15, -201, 214, 715),\n        \"asciicircum\": (46, 337, 433, 728),\n        \"underscore\": (-5, -125, 462, -75),\n        \"grave\": (17, 582, 198, 728),\n        \"a\": (29, -11, 428, 530),\n        \"b\": (53, -11, 469, 715),\n        \"c\": (34, -11, 435, 530),\n        \"d\": (33, -11, 449, 715),\n        \"e\": (26, -11, 423, 530),\n        \"f\": (9, 0, 296, 728),\n        \"g\": (33, -210, 448, 530),\n        \"h\": (58, 0, 445, 715),\n        \"i\": (59, 0, 171, 715),\n        \"j\": (-37, -210, 169, 715),\n        \"k\": (55, 0, 448, 715),\n        \"l\": (59, 0, 171, 715),\n        \"m\": (50, 0, 675, 530),\n        \"n\": (58, 0, 445, 530),\n        \"o\": (32, -11, 471, 530),\n        \"p\": (55, -197, 470, 530),\n        \"q\": (36, -197, 449, 530),\n        \"r\": (54, 0, 329, 530),\n        \"s\": (19, -11, 416, 530),\n        \"t\": (12, -11, 262, 701),\n        \"u\": (56, -11, 442, 518),\n        \"v\": (4, 0, 445, 518),\n        \"w\": (3, 0, 637, 518),\n        \"x\": (4, 0, 448, 518),\n        \"y\": (5, -210, 442, 518),\n        \"z\": (13, 0, 393, 518),\n        \"braceleft\": (23, -210, 297, 728),\n        \"bar\": (70, -210, 160, 728),\n        \"braceright\": (18, -210, 291, 728),\n        \"asciitilde\": (26, 253, 452, 451),\n        \"bullet\": (26, 208, 263, 497),\n        \"Euro\": (-13, -12, 431, 728),\n        \"quotesinglbase\": (46, -159, 168, 137),\n        \"florin\": (-7, -210, 457, 728),\n        \"quotedblbase\": (44, -159, 355, 137),\n        \"ellipsis\": (80, 0, 739, 137),\n        \"dagger\": (27, -170, 423, 707),\n        \"daggerdbl\": (27, -170, 423, 707),\n        \"circumflex\": (0, 583, 271, 728),\n        \"perthousand\": (0, -28, 819, 728),\n        \"Scaron\": (29, -12, 506, 909),\n        \"guilsinglleft\": (30, 34, 245, 479),\n        \"OE\": (28, -12, 794, 728),\n        \"Zcaron\": (8, 0, 485, 909),\n        \"quoteleft\": (61, 418, 182, 715),\n        \"quoteright\": (45, 418, 166, 715),\n        \"quotedblleft\": (52, 418, 362, 715),\n        \"quotedblright\": (41, 418, 352, 715),\n        \"endash\": (-1, 208, 454, 310),\n        \"emdash\": (0, 208, 819, 310),\n        \"tilde\": (-5, 588, 271, 712),\n        \"trademark\": (86, 315, 719, 715),\n        \"scaron\": (19, -11, 416, 728),\n        \"guilsinglright\": (29, 34, 244, 479),\n        \"oe\": (35, -11, 740, 530),\n        \"zcaron\": (13, 0, 393, 728),\n        \"Ydieresis\": (0, 0, 547, 909),\n        \"exclamdown\": (78, -198, 199, 518),\n        \"cent\": (33, -196, 434, 710),\n        \"sterling\": (5, -12, 443, 728),\n        \"currency\": (18, 100, 435, 610),\n        \"yen\": (0, 0, 452, 715),\n        \"brokenbar\": (70, -210, 160, 728),\n        \"section\": (23, -210, 427, 728),\n        \"dieresis\": (1, 610, 270, 728),\n        \"copyright\": (-3, -17, 609, 730),\n        \"ordfeminine\": (15, 362, 283, 728),\n        \"guillemotleft\": (38, 34, 410, 479),\n        \"logicalnot\": (33, 183, 444, 524),\n        \"registered\": (-3, -17, 609, 730),\n        \"macron\": (-5, 790, 505, 840),\n        \"degree\": (41, 416, 353, 728),\n        \"plusminus\": (24, 0, 524, 674),\n        \"twosuperior\": (9, 354, 252, 724),\n        \"threesuperior\": (15, 349, 255, 724),\n        \"acute\": (74, 582, 256, 728),\n        \"mu\": (54, -198, 525, 518),\n        \"paragraph\": (0, -196, 452, 715),\n        \"periodcentered\": (80, 279, 192, 416),\n        \"cedilla\": (15, -204, 233, -5),\n        \"onesuperior\": (36, 354, 198, 724),\n        \"ordmasculine\": (10, 361, 288, 728),\n        \"guillemotright\": (42, 34, 414, 479),\n        \"onequarter\": (36, -26, 675, 724),\n        \"onehalf\": (36, -26, 663, 724),\n        \"threequarters\": (16, -26, 676, 724),\n        \"questiondown\": (40, -205, 462, 518),\n        \"Agrave\": (0, 0, 588, 909),\n        \"Aacute\": (0, 0, 588, 909),\n        \"Acircumflex\": (0, 0, 588, 909),\n        \"Atilde\": (0, 0, 588, 894),\n        \"Adieresis\": (0, 0, 588, 909),\n        \"Aring\": (0, 0, 588, 932),\n        \"AE\": (-34, 0, 780, 715),\n        \"Ccedilla\": (39, -210, 550, 728),\n        \"Egrave\": (60, 0, 506, 909),\n        \"Eacute\": (60, 0, 506, 909),\n        \"Ecircumflex\": (60, 0, 506, 909),\n        \"Edieresis\": (60, 0, 506, 909),\n        \"Igrave\": (-3, 0, 177, 909),\n        \"Iacute\": (53, 0, 235, 909),\n        \"Icircumflex\": (-20, 0, 250, 909),\n        \"Idieresis\": (-19, 0, 250, 909),\n        \"Eth\": (-1, 0, 551, 715),\n        \"Ntilde\": (61, 0, 526, 894),\n        \"Ograve\": (36, -12, 605, 909),\n        \"Oacute\": (36, -12, 605, 909),\n        \"Ocircumflex\": (36, -12, 605, 909),\n        \"Otilde\": (36, -12, 605, 894),\n        \"Odieresis\": (36, -12, 605, 909),\n        \"multiply\": (43, 114, 434, 591),\n        \"Oslash\": (25, -40, 615, 750),\n        \"Ugrave\": (58, -12, 526, 909),\n        \"Uacute\": (58, -12, 526, 909),\n        \"Ucircumflex\": (58, -12, 526, 909),\n        \"Udieresis\": (58, -12, 526, 909),\n        \"Yacute\": (0, 0, 547, 909),\n        \"Thorn\": (59, 0, 509, 715),\n        \"germandbls\": (55, -11, 472, 728),\n        \"agrave\": (29, -11, 428, 728),\n        \"aacute\": (29, -11, 428, 728),\n        \"acircumflex\": (29, -11, 428, 728),\n        \"atilde\": (29, -11, 428, 712),\n        \"adieresis\": (29, -11, 428, 728),\n        \"aring\": (29, -11, 428, 750),\n        \"ae\": (35, -11, 690, 530),\n        \"ccedilla\": (34, -204, 435, 530),\n        \"egrave\": (26, -11, 423, 728),\n        \"eacute\": (26, -11, 423, 728),\n        \"ecircumflex\": (26, -11, 423, 728),\n        \"edieresis\": (26, -11, 423, 728),\n        \"igrave\": (-9, 0, 172, 728),\n        \"iacute\": (58, 0, 240, 728),\n        \"icircumflex\": (-20, 0, 250, 728),\n        \"idieresis\": (-19, 0, 250, 728),\n        \"eth\": (33, -12, 470, 715),\n        \"ntilde\": (58, 0, 445, 712),\n        \"ograve\": (32, -11, 471, 728),\n        \"oacute\": (32, -11, 471, 728),\n        \"ocircumflex\": (32, -11, 471, 728),\n        \"otilde\": (32, -11, 471, 712),\n        \"odieresis\": (32, -11, 471, 728),\n        \"divide\": (23, 90, 524, 616),\n        \"oslash\": (35, -35, 474, 546),\n        \"ugrave\": (56, -11, 442, 728),\n        \"uacute\": (56, -11, 442, 728),\n        \"ucircumflex\": (56, -11, 442, 728),\n        \"udieresis\": (56, -11, 442, 728),\n        \"yacute\": (5, -210, 442, 728),\n        \"thorn\": (55, -197, 470, 715),\n        \"ydieresis\": (5, -210, 442, 728),\n    },\n    \"ArialNarrow,BoldItalic\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (50, 0, 289, 715),\n        \"quotedbl\": (121, 461, 447, 715),\n        \"numbersign\": (7, -12, 446, 728),\n        \"dollar\": (36, -99, 472, 770),\n        \"percent\": (74, -30, 708, 728),\n        \"ampersand\": (67, -16, 578, 728),\n        \"quotesingle\": (124, 461, 269, 715),\n        \"parenleft\": (53, -210, 356, 728),\n        \"parenright\": (-64, -210, 238, 728),\n        \"asterisk\": (78, 382, 368, 721),\n        \"plus\": (33, 103, 444, 603),\n        \"comma\": (8, -155, 173, 135),\n        \"hyphen\": (31, 190, 277, 325),\n        \"period\": (36, 0, 172, 135),\n        \"slash\": (-35, -12, 335, 728),\n        \"zero\": (52, -12, 468, 718),\n        \"one\": (97, 0, 418, 720),\n        \"two\": (49, 0, 468, 718),\n        \"three\": (41, -12, 459, 718),\n        \"four\": (22, 0, 458, 715),\n        \"five\": (52, -12, 474, 706),\n        \"six\": (66, -12, 471, 718),\n        \"seven\": (84, 0, 494, 706),\n        \"eight\": (54, -12, 464, 718),\n        \"nine\": (52, -12, 457, 718),\n        \"colon\": (57, 0, 259, 518),\n        \"semicolon\": (33, -155, 262, 518),\n        \"less\": (38, 81, 440, 625),\n        \"equal\": (33, 181, 444, 524),\n        \"greater\": (37, 81, 440, 624),\n        \"question\": (100, 0, 506, 728),\n        \"at\": (24, -210, 796, 728),\n        \"A\": (-9, 0, 551, 715),\n        \"B\": (32, 0, 582, 715),\n        \"C\": (77, -12, 611, 728),\n        \"D\": (35, 0, 594, 715),\n        \"E\": (33, 0, 591, 715),\n        \"F\": (31, 0, 565, 715),\n        \"G\": (72, -12, 644, 728),\n        \"H\": (35, 0, 626, 715),\n        \"I\": (28, 0, 271, 715),\n        \"J\": (23, -12, 492, 715),\n        \"K\": (32, 0, 657, 715),\n        \"L\": (37, 0, 477, 715),\n        \"M\": (33, 0, 720, 715),\n        \"N\": (36, 0, 625, 715),\n        \"O\": (71, -12, 643, 728),\n        \"P\": (33, 0, 576, 715),\n        \"Q\": (72, -95, 643, 728),\n        \"R\": (36, 0, 607, 715),\n        \"S\": (51, -12, 554, 728),\n        \"T\": (98, 0, 581, 715),\n        \"U\": (74, -12, 626, 715),\n        \"V\": (93, 0, 650, 715),\n        \"W\": (96, 0, 875, 715),\n        \"X\": (-24, 0, 642, 715),\n        \"Y\": (94, 0, 643, 715),\n        \"Z\": (20, 0, 547, 715),\n        \"bracketleft\": (7, -197, 359, 715),\n        \"backslash\": (63, -12, 235, 728),\n        \"bracketright\": (-45, -197, 307, 715),\n        \"asciicircum\": (46, 337, 433, 728),\n        \"underscore\": (-5, -125, 462, -75),\n        \"grave\": (109, 585, 271, 731),\n        \"a\": (37, -11, 437, 530),\n        \"b\": (29, -11, 493, 715),\n        \"c\": (49, -11, 462, 530),\n        \"d\": (48, -11, 548, 715),\n        \"e\": (47, -11, 454, 530),\n        \"f\": (43, 0, 385, 728),\n        \"g\": (25, -210, 510, 530),\n        \"h\": (34, 0, 484, 715),\n        \"i\": (33, 0, 270, 715),\n        \"j\": (-89, -210, 271, 715),\n        \"k\": (31, 0, 503, 715),\n        \"l\": (32, 0, 270, 715),\n        \"m\": (29, 0, 712, 530),\n        \"n\": (34, 0, 484, 530),\n        \"o\": (49, -11, 491, 530),\n        \"p\": (-4, -197, 496, 530),\n        \"q\": (48, -197, 512, 530),\n        \"r\": (26, 0, 388, 530),\n        \"s\": (18, -11, 452, 530),\n        \"t\": (61, -11, 320, 698),\n        \"u\": (57, -11, 507, 518),\n        \"v\": (61, 0, 506, 518),\n        \"w\": (59, 0, 689, 518),\n        \"x\": (-18, 0, 501, 518),\n        \"y\": (5, -210, 509, 518),\n        \"z\": (13, 0, 425, 518),\n        \"braceleft\": (44, -210, 412, 728),\n        \"bar\": (70, -210, 160, 728),\n        \"braceright\": (-70, -210, 296, 728),\n        \"asciitilde\": (26, 253, 452, 451),\n        \"bullet\": (26, 208, 263, 497),\n        \"Euro\": (21, -12, 523, 728),\n        \"quotesinglbase\": (8, -155, 173, 135),\n        \"florin\": (-7, -210, 457, 728),\n        \"quotedblbase\": (2, -155, 361, 135),\n        \"ellipsis\": (76, 0, 744, 135),\n        \"dagger\": (69, -170, 487, 706),\n        \"daggerdbl\": (0, -170, 491, 706),\n        \"circumflex\": (45, 584, 320, 731),\n        \"perthousand\": (55, -28, 837, 728),\n        \"Scaron\": (51, -12, 554, 912),\n        \"guilsinglleft\": (48, 34, 309, 477),\n        \"OE\": (56, -12, 884, 728),\n        \"Zcaron\": (20, 0, 547, 912),\n        \"quoteleft\": (87, 424, 253, 715),\n        \"quoteright\": (101, 424, 267, 715),\n        \"quotedblleft\": (101, 424, 459, 715),\n        \"quotedblright\": (104, 424, 463, 715),\n        \"endash\": (-1, 208, 454, 310),\n        \"emdash\": (0, 208, 819, 310),\n        \"tilde\": (76, 592, 351, 710),\n        \"trademark\": (86, 315, 719, 715),\n        \"scaron\": (18, -11, 452, 731),\n        \"guilsinglright\": (7, 34, 261, 477),\n        \"oe\": (47, -11, 773, 530),\n        \"zcaron\": (13, 0, 433, 731),\n        \"Ydieresis\": (94, 0, 643, 898),\n        \"exclamdown\": (9, -197, 250, 518),\n        \"cent\": (48, -192, 461, 713),\n        \"sterling\": (17, -18, 500, 728),\n        \"currency\": (18, 100, 435, 610),\n        \"yen\": (20, 0, 546, 715),\n        \"brokenbar\": (70, -210, 160, 728),\n        \"section\": (18, -211, 459, 728),\n        \"dieresis\": (69, 597, 356, 716),\n        \"copyright\": (-3, -17, 609, 730),\n        \"ordfeminine\": (66, 362, 337, 728),\n        \"guillemotleft\": (43, 34, 460, 477),\n        \"logicalnot\": (33, 183, 444, 524),\n        \"registered\": (-3, -17, 609, 730),\n        \"macron\": (94, 790, 605, 840),\n        \"degree\": (41, 416, 353, 728),\n        \"plusminus\": (24, 0, 524, 674),\n        \"twosuperior\": (66, 354, 324, 724),\n        \"threesuperior\": (62, 349, 319, 724),\n        \"acute\": (150, 583, 356, 730),\n        \"mu\": (-37, -200, 584, 518),\n        \"paragraph\": (0, -196, 452, 715),\n        \"periodcentered\": (108, 290, 245, 425),\n        \"cedilla\": (5, -207, 218, -12),\n        \"onesuperior\": (93, 354, 296, 725),\n        \"ordmasculine\": (59, 362, 339, 728),\n        \"guillemotright\": (18, 34, 435, 477),\n        \"onequarter\": (81, -29, 688, 725),\n        \"onehalf\": (69, -29, 684, 725),\n        \"threequarters\": (62, -29, 698, 724),\n        \"questiondown\": (21, -209, 428, 518),\n        \"Agrave\": (-9, 0, 551, 913),\n        \"Aacute\": (-9, 0, 562, 912),\n        \"Acircumflex\": (-9, 0, 551, 912),\n        \"Atilde\": (-9, 0, 556, 892),\n        \"Adieresis\": (-9, 0, 562, 898),\n        \"Aring\": (-9, 0, 551, 935),\n        \"AE\": (-26, 0, 868, 715),\n        \"Ccedilla\": (77, -204, 611, 728),\n        \"Egrave\": (33, 0, 591, 913),\n        \"Eacute\": (33, 0, 591, 912),\n        \"Ecircumflex\": (33, 0, 591, 912),\n        \"Edieresis\": (33, 0, 591, 898),\n        \"Igrave\": (28, 0, 297, 913),\n        \"Iacute\": (28, 0, 368, 912),\n        \"Icircumflex\": (28, 0, 347, 912),\n        \"Idieresis\": (28, 0, 383, 898),\n        \"Eth\": (29, 0, 594, 715),\n        \"Ntilde\": (36, 0, 625, 892),\n        \"Ograve\": (71, -12, 643, 913),\n        \"Oacute\": (71, -12, 643, 912),\n        \"Ocircumflex\": (71, -12, 643, 912),\n        \"Otilde\": (71, -12, 643, 892),\n        \"Odieresis\": (71, -12, 643, 898),\n        \"multiply\": (43, 114, 434, 591),\n        \"Oslash\": (63, -59, 645, 766),\n        \"Ugrave\": (74, -12, 626, 913),\n        \"Uacute\": (74, -12, 626, 912),\n        \"Ucircumflex\": (74, -12, 626, 912),\n        \"Udieresis\": (74, -12, 626, 898),\n        \"Yacute\": (94, 0, 643, 912),\n        \"Thorn\": (33, 0, 552, 715),\n        \"germandbls\": (28, -11, 476, 728),\n        \"agrave\": (37, -11, 437, 731),\n        \"aacute\": (37, -11, 437, 730),\n        \"acircumflex\": (37, -11, 437, 731),\n        \"atilde\": (37, -11, 447, 710),\n        \"adieresis\": (37, -11, 454, 716),\n        \"aring\": (37, -11, 437, 753),\n        \"ae\": (25, -11, 709, 530),\n        \"ccedilla\": (49, -203, 462, 530),\n        \"egrave\": (47, -11, 454, 731),\n        \"eacute\": (47, -11, 454, 730),\n        \"ecircumflex\": (47, -11, 454, 731),\n        \"edieresis\": (47, -11, 454, 716),\n        \"igrave\": (33, 0, 258, 731),\n        \"iacute\": (33, 0, 319, 730),\n        \"icircumflex\": (33, 0, 319, 731),\n        \"idieresis\": (33, 0, 342, 716),\n        \"eth\": (49, -11, 498, 715),\n        \"ntilde\": (34, 0, 484, 710),\n        \"ograve\": (49, -11, 491, 731),\n        \"oacute\": (49, -11, 491, 730),\n        \"ocircumflex\": (49, -11, 491, 731),\n        \"otilde\": (49, -11, 491, 710),\n        \"odieresis\": (49, -11, 491, 716),\n        \"divide\": (23, 90, 524, 616),\n        \"oslash\": (42, -52, 495, 571),\n        \"ugrave\": (57, -11, 507, 731),\n        \"uacute\": (57, -11, 507, 730),\n        \"ucircumflex\": (57, -11, 507, 731),\n        \"udieresis\": (57, -11, 507, 716),\n        \"yacute\": (5, -210, 509, 730),\n        \"thorn\": (-7, -197, 494, 715),\n        \"ydieresis\": (5, -210, 509, 716),\n    },\n    \"ArialNarrow,Italic\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (46, 0, 249, 715),\n        \"quotedbl\": (106, 462, 346, 715),\n        \"numbersign\": (7, -12, 444, 728),\n        \"dollar\": (41, -95, 469, 763),\n        \"percent\": (79, -26, 698, 728),\n        \"ampersand\": (64, -17, 534, 728),\n        \"quotesingle\": (104, 462, 212, 715),\n        \"parenleft\": (69, -210, 338, 728),\n        \"parenright\": (-43, -210, 225, 728),\n        \"asterisk\": (92, 422, 357, 727),\n        \"plus\": (45, 115, 433, 588),\n        \"comma\": (20, -144, 144, 100),\n        \"hyphen\": (37, 214, 273, 303),\n        \"period\": (47, 0, 146, 100),\n        \"slash\": (-41, -11, 336, 728),\n        \"zero\": (58, -12, 463, 718),\n        \"one\": (121, 0, 393, 718),\n        \"two\": (48, 0, 460, 718),\n        \"three\": (44, -12, 457, 718),\n        \"four\": (37, 0, 445, 715),\n        \"five\": (57, -12, 469, 706),\n        \"six\": (68, -12, 465, 718),\n        \"seven\": (99, 0, 488, 706),\n        \"eight\": (61, -12, 462, 718),\n        \"nine\": (55, -12, 452, 718),\n        \"colon\": (46, 0, 217, 518),\n        \"semicolon\": (20, -144, 215, 518),\n        \"less\": (44, 110, 433, 595),\n        \"equal\": (45, 203, 433, 502),\n        \"greater\": (44, 110, 433, 595),\n        \"question\": (104, 0, 459, 728),\n        \"at\": (44, -210, 803, 729),\n        \"A\": (-16, 0, 505, 715),\n        \"B\": (35, 0, 537, 715),\n        \"C\": (74, -12, 598, 728),\n        \"D\": (36, 0, 583, 715),\n        \"E\": (37, 0, 583, 715),\n        \"F\": (37, 0, 541, 715),\n        \"G\": (79, -12, 628, 728),\n        \"H\": (34, 0, 618, 715),\n        \"I\": (46, 0, 248, 715),\n        \"J\": (26, -12, 438, 715),\n        \"K\": (36, 0, 607, 715),\n        \"L\": (32, 0, 429, 715),\n        \"M\": (36, 0, 715, 715),\n        \"N\": (39, 0, 620, 715),\n        \"O\": (75, -12, 633, 728),\n        \"P\": (35, 0, 572, 715),\n        \"Q\": (76, -82, 634, 728),\n        \"R\": (38, 0, 599, 715),\n        \"S\": (58, -12, 551, 728),\n        \"T\": (102, 0, 578, 715),\n        \"U\": (79, -12, 618, 715),\n        \"V\": (101, 0, 620, 715),\n        \"W\": (102, 0, 870, 715),\n        \"X\": (-25, 0, 630, 715),\n        \"Y\": (96, 0, 634, 715),\n        \"Z\": (20, 0, 521, 715),\n        \"bracketleft\": (5, -195, 320, 715),\n        \"backslash\": (69, -11, 224, 728),\n        \"bracketright\": (-47, -195, 270, 715),\n        \"asciicircum\": (21, 336, 363, 728),\n        \"underscore\": (-5, -125, 460, -75),\n        \"grave\": (119, 581, 254, 715),\n        \"a\": (36, -11, 431, 530),\n        \"b\": (27, -11, 438, 715),\n        \"c\": (45, -11, 418, 530),\n        \"d\": (42, -11, 490, 715),\n        \"e\": (42, -11, 436, 530),\n        \"f\": (37, 0, 334, 728),\n        \"g\": (21, -207, 462, 530),\n        \"h\": (27, 0, 433, 715),\n        \"i\": (24, 0, 219, 715),\n        \"j\": (-99, -207, 218, 715),\n        \"k\": (27, 0, 454, 715),\n        \"l\": (21, 0, 216, 715),\n        \"m\": (26, 0, 666, 530),\n        \"n\": (27, 0, 433, 530),\n        \"o\": (40, -11, 442, 530),\n        \"p\": (-8, -198, 438, 530),\n        \"q\": (42, -198, 453, 530),\n        \"r\": (27, 0, 344, 530),\n        \"s\": (31, -11, 408, 530),\n        \"t\": (45, -8, 263, 707),\n        \"u\": (51, -11, 457, 518),\n        \"v\": (64, 0, 458, 518),\n        \"w\": (63, 0, 636, 518),\n        \"x\": (-1, 0, 440, 518),\n        \"y\": (0, -210, 459, 518),\n        \"z\": (16, 0, 419, 518),\n        \"braceleft\": (55, -210, 376, 728),\n        \"bar\": (75, -210, 138, 728),\n        \"braceright\": (-68, -210, 253, 728),\n        \"asciitilde\": (35, 271, 444, 432),\n        \"bullet\": (43, 226, 246, 474),\n        \"Euro\": (33, -12, 528, 728),\n        \"quotesinglbase\": (-5, -144, 118, 100),\n        \"florin\": (18, -210, 434, 728),\n        \"quotedblbase\": (-16, -144, 238, 100),\n        \"ellipsis\": (117, 0, 764, 100),\n        \"dagger\": (74, -170, 478, 706),\n        \"daggerdbl\": (4, -170, 482, 706),\n        \"circumflex\": (82, 581, 317, 715),\n        \"perthousand\": (54, -26, 822, 728),\n        \"Scaron\": (58, -12, 551, 896),\n        \"guilsinglleft\": (39, 35, 257, 478),\n        \"OE\": (65, -12, 856, 728),\n        \"Zcaron\": (20, 0, 521, 896),\n        \"quoteleft\": (103, 470, 228, 715),\n        \"quoteright\": (103, 470, 227, 715),\n        \"quotedblleft\": (83, 470, 336, 715),\n        \"quotedblright\": (85, 470, 342, 715),\n        \"endash\": (-1, 223, 454, 294),\n        \"emdash\": (0, 223, 819, 294),\n        \"tilde\": (76, 596, 347, 706),\n        \"trademark\": (90, 317, 713, 715),\n        \"scaron\": (31, -11, 410, 715),\n        \"guilsinglright\": (13, 35, 235, 478),\n        \"oe\": (51, -11, 752, 530),\n        \"zcaron\": (16, 0, 419, 715),\n        \"Ydieresis\": (96, 0, 634, 880),\n        \"exclamdown\": (24, -197, 228, 518),\n        \"cent\": (62, -198, 434, 725),\n        \"sterling\": (25, -12, 498, 728),\n        \"currency\": (28, 114, 421, 593),\n        \"yen\": (29, 0, 546, 715),\n        \"brokenbar\": (75, -210, 138, 728),\n        \"section\": (24, -210, 455, 728),\n        \"dieresis\": (95, 599, 335, 699),\n        \"copyright\": (0, -8, 605, 728),\n        \"ordfeminine\": (66, 359, 335, 728),\n        \"guillemotleft\": (64, 35, 440, 478),\n        \"logicalnot\": (45, 207, 433, 502),\n        \"registered\": (0, -8, 605, 728),\n        \"macron\": (88, 790, 600, 840),\n        \"degree\": (133, 457, 404, 728),\n        \"plusminus\": (38, 0, 510, 600),\n        \"twosuperior\": (61, 357, 329, 724),\n        \"threesuperior\": (67, 349, 327, 724),\n        \"acute\": (138, 581, 304, 715),\n        \"mu\": (5, -200, 571, 518),\n        \"paragraph\": (2, -198, 444, 715),\n        \"periodcentered\": (124, 307, 223, 407),\n        \"cedilla\": (30, -207, 235, 5),\n        \"onesuperior\": (111, 357, 290, 724),\n        \"ordmasculine\": (57, 360, 337, 728),\n        \"guillemotright\": (33, 35, 414, 478),\n        \"onequarter\": (68, -29, 697, 728),\n        \"onehalf\": (48, -29, 677, 728),\n        \"threequarters\": (67, -29, 708, 728),\n        \"questiondown\": (46, -209, 401, 518),\n        \"Agrave\": (-16, 0, 505, 896),\n        \"Aacute\": (-16, 0, 505, 896),\n        \"Acircumflex\": (-16, 0, 505, 896),\n        \"Atilde\": (-16, 0, 514, 887),\n        \"Adieresis\": (-16, 0, 505, 880),\n        \"Aring\": (-16, 0, 505, 914),\n        \"AE\": (-33, 0, 855, 715),\n        \"Ccedilla\": (74, -210, 598, 728),\n        \"Egrave\": (37, 0, 583, 896),\n        \"Eacute\": (37, 0, 583, 896),\n        \"Ecircumflex\": (37, 0, 583, 896),\n        \"Edieresis\": (37, 0, 583, 880),\n        \"Igrave\": (46, 0, 262, 896),\n        \"Iacute\": (46, 0, 312, 896),\n        \"Icircumflex\": (46, 0, 326, 896),\n        \"Idieresis\": (46, 0, 343, 880),\n        \"Eth\": (29, 0, 583, 715),\n        \"Ntilde\": (39, 0, 620, 887),\n        \"Ograve\": (75, -12, 633, 896),\n        \"Oacute\": (75, -12, 633, 896),\n        \"Ocircumflex\": (75, -12, 633, 896),\n        \"Otilde\": (75, -12, 633, 887),\n        \"Odieresis\": (75, -12, 633, 880),\n        \"multiply\": (63, 140, 412, 566),\n        \"Oslash\": (69, -50, 636, 764),\n        \"Ugrave\": (79, -12, 618, 896),\n        \"Uacute\": (79, -12, 618, 896),\n        \"Ucircumflex\": (79, -12, 618, 896),\n        \"Udieresis\": (79, -12, 618, 880),\n        \"Yacute\": (96, 0, 634, 896),\n        \"Thorn\": (35, 0, 547, 715),\n        \"germandbls\": (29, -12, 465, 728),\n        \"agrave\": (36, -11, 431, 715),\n        \"aacute\": (36, -11, 431, 715),\n        \"acircumflex\": (36, -11, 431, 715),\n        \"atilde\": (36, -11, 443, 706),\n        \"adieresis\": (36, -11, 431, 699),\n        \"aring\": (36, -11, 431, 733),\n        \"ae\": (34, -12, 708, 530),\n        \"ccedilla\": (45, -207, 418, 530),\n        \"egrave\": (42, -11, 436, 715),\n        \"eacute\": (42, -11, 436, 715),\n        \"ecircumflex\": (42, -11, 436, 715),\n        \"edieresis\": (42, -11, 436, 699),\n        \"igrave\": (50, 0, 254, 715),\n        \"iacute\": (50, 0, 270, 715),\n        \"icircumflex\": (50, 0, 305, 715),\n        \"idieresis\": (50, 0, 310, 699),\n        \"eth\": (40, -11, 447, 715),\n        \"ntilde\": (27, 0, 436, 706),\n        \"ograve\": (40, -11, 442, 715),\n        \"oacute\": (40, -11, 442, 715),\n        \"ocircumflex\": (40, -11, 442, 715),\n        \"otilde\": (40, -11, 442, 706),\n        \"odieresis\": (40, -11, 442, 699),\n        \"divide\": (38, 155, 510, 550),\n        \"oslash\": (58, -49, 476, 565),\n        \"ugrave\": (51, -11, 457, 715),\n        \"uacute\": (51, -11, 457, 715),\n        \"ucircumflex\": (51, -11, 457, 715),\n        \"udieresis\": (51, -11, 457, 699),\n        \"yacute\": (0, -210, 459, 715),\n        \"thorn\": (-8, -198, 438, 715),\n        \"ydieresis\": (0, -210, 459, 699),\n    },\n    \"Arial,Black\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (60, 0, 272, 715),\n        \"quotedbl\": (23, 452, 476, 715),\n        \"numbersign\": (29, -11, 627, 728),\n        \"dollar\": (26, -104, 631, 770),\n        \"percent\": (48, -36, 951, 728),\n        \"ampersand\": (74, -11, 848, 728),\n        \"quotesingle\": (41, 452, 239, 715),\n        \"parenleft\": (54, -210, 350, 728),\n        \"parenright\": (39, -210, 334, 728),\n        \"asterisk\": (86, 370, 465, 728),\n        \"plus\": (62, 91, 594, 624),\n        \"comma\": (60, -201, 272, 197),\n        \"hyphen\": (21, 184, 311, 337),\n        \"period\": (60, 0, 272, 199),\n        \"slash\": (0, -11, 280, 728),\n        \"zero\": (41, -12, 625, 728),\n        \"one\": (81, 0, 491, 728),\n        \"two\": (26, 0, 623, 728),\n        \"three\": (35, -12, 626, 728),\n        \"four\": (20, 0, 645, 728),\n        \"five\": (32, -12, 627, 715),\n        \"six\": (41, -12, 631, 728),\n        \"seven\": (44, 0, 625, 715),\n        \"eight\": (41, -12, 625, 728),\n        \"nine\": (34, -12, 624, 728),\n        \"colon\": (60, 0, 272, 518),\n        \"semicolon\": (60, -201, 272, 518),\n        \"less\": (52, 54, 607, 660),\n        \"equal\": (61, 158, 594, 557),\n        \"greater\": (52, 54, 607, 660),\n        \"question\": (35, 0, 575, 728),\n        \"at\": (-2, -113, 741, 728),\n        \"A\": (0, 0, 780, 715),\n        \"B\": (73, 0, 735, 715),\n        \"C\": (47, -12, 743, 728),\n        \"D\": (76, 0, 734, 715),\n        \"E\": (72, 0, 676, 715),\n        \"F\": (74, 0, 621, 715),\n        \"G\": (45, -12, 774, 728),\n        \"H\": (74, 0, 759, 715),\n        \"I\": (82, 0, 303, 715),\n        \"J\": (17, -12, 592, 715),\n        \"K\": (74, 0, 833, 715),\n        \"L\": (73, 0, 639, 715),\n        \"M\": (70, 0, 875, 715),\n        \"N\": (74, 0, 759, 715),\n        \"O\": (45, -12, 787, 728),\n        \"P\": (72, 0, 679, 715),\n        \"Q\": (45, -80, 814, 728),\n        \"R\": (76, 0, 780, 715),\n        \"S\": (34, -12, 684, 728),\n        \"T\": (22, 0, 695, 715),\n        \"U\": (73, -12, 759, 715),\n        \"V\": (2, 0, 778, 715),\n        \"W\": (0, 0, 1000, 715),\n        \"X\": (1, 0, 779, 715),\n        \"Y\": (0, 0, 779, 715),\n        \"Z\": (16, 0, 695, 715),\n        \"bracketleft\": (65, -198, 366, 715),\n        \"backslash\": (-2, -11, 277, 728),\n        \"bracketright\": (22, -198, 323, 715),\n        \"asciicircum\": (61, 331, 595, 728),\n        \"underscore\": (-5, -125, 505, -75),\n        \"grave\": (0, 582, 250, 728),\n        \"a\": (35, -11, 632, 530),\n        \"b\": (61, -11, 631, 715),\n        \"c\": (36, -12, 635, 530),\n        \"d\": (35, -11, 605, 715),\n        \"e\": (35, -11, 635, 530),\n        \"f\": (7, 0, 418, 728),\n        \"g\": (35, -210, 607, 530),\n        \"h\": (60, 0, 608, 715),\n        \"i\": (67, 0, 266, 715),\n        \"j\": (-48, -210, 267, 715),\n        \"k\": (60, 0, 666, 715),\n        \"l\": (66, 0, 266, 715),\n        \"m\": (61, 0, 941, 530),\n        \"n\": (60, 0, 608, 530),\n        \"o\": (35, -11, 631, 530),\n        \"p\": (61, -197, 631, 530),\n        \"q\": (35, -197, 605, 530),\n        \"r\": (62, 0, 470, 530),\n        \"s\": (24, -12, 576, 530),\n        \"t\": (27, -11, 416, 715),\n        \"u\": (58, -11, 606, 518),\n        \"v\": (0, 0, 613, 518),\n        \"w\": (1, 0, 945, 518),\n        \"x\": (5, 0, 661, 518),\n        \"y\": (2, -210, 614, 518),\n        \"z\": (18, 0, 534, 518),\n        \"braceleft\": (12, -210, 377, 728),\n        \"bar\": (78, -197, 202, 715),\n        \"braceright\": (11, -210, 376, 728),\n        \"asciitilde\": (48, 240, 608, 475),\n        \"bullet\": (87, 189, 412, 514),\n        \"Euro\": (8, -12, 641, 728),\n        \"quotesinglbase\": (34, -201, 246, 197),\n        \"florin\": (18, -210, 651, 728),\n        \"quotedblbase\": (26, -201, 486, 197),\n        \"ellipsis\": (60, 0, 939, 199),\n        \"dagger\": (68, -198, 604, 715),\n        \"daggerdbl\": (68, -198, 604, 715),\n        \"circumflex\": (-13, 582, 347, 721),\n        \"perthousand\": (0, -36, 1000, 728),\n        \"Scaron\": (34, -12, 684, 898),\n        \"guilsinglleft\": (11, 34, 319, 486),\n        \"OE\": (34, -12, 968, 728),\n        \"Zcaron\": (16, 0, 695, 898),\n        \"quoteleft\": (34, 329, 246, 728),\n        \"quoteright\": (34, 329, 246, 728),\n        \"quotedblleft\": (26, 329, 486, 728),\n        \"quotedblright\": (26, 329, 486, 728),\n        \"endash\": (-5, 207, 505, 315),\n        \"emdash\": (-5, 207, 1005, 315),\n        \"tilde\": (-9, 580, 342, 715),\n        \"trademark\": (17, 317, 910, 715),\n        \"scaron\": (24, -12, 576, 721),\n        \"guilsinglright\": (13, 34, 321, 486),\n        \"oe\": (28, -11, 972, 530),\n        \"zcaron\": (18, 0, 534, 721),\n        \"Ydieresis\": (0, 0, 779, 883),\n        \"exclamdown\": (60, -197, 272, 518),\n        \"cent\": (36, -190, 635, 706),\n        \"sterling\": (55, -12, 662, 728),\n        \"currency\": (47, 0, 607, 560),\n        \"yen\": (0, 0, 667, 715),\n        \"brokenbar\": (78, -197, 202, 715),\n        \"section\": (31, -210, 628, 728),\n        \"dieresis\": (0, 583, 334, 706),\n        \"copyright\": (28, -17, 773, 728),\n        \"ordfeminine\": (16, 363, 371, 728),\n        \"guillemotleft\": (46, 34, 607, 486),\n        \"logicalnot\": (61, 154, 594, 553),\n        \"registered\": (28, -17, 773, 728),\n        \"macron\": (-5, 780, 505, 830),\n        \"degree\": (58, 449, 337, 728),\n        \"plusminus\": (62, 0, 594, 705),\n        \"twosuperior\": (10, 361, 386, 728),\n        \"threesuperior\": (15, 352, 384, 728),\n        \"acute\": (79, 582, 332, 728),\n        \"mu\": (58, -196, 607, 518),\n        \"paragraph\": (65, -198, 789, 715),\n        \"periodcentered\": (60, 258, 272, 457),\n        \"cedilla\": (8, -210, 304, -11),\n        \"onesuperior\": (68, 361, 306, 728),\n        \"ordmasculine\": (11, 362, 384, 728),\n        \"guillemotright\": (59, 34, 620, 486),\n        \"onequarter\": (76, -25, 962, 728),\n        \"onehalf\": (76, -25, 971, 728),\n        \"threequarters\": (34, -25, 962, 728),\n        \"questiondown\": (35, -209, 575, 518),\n        \"Agrave\": (0, 0, 780, 905),\n        \"Aacute\": (0, 0, 780, 905),\n        \"Acircumflex\": (0, 0, 780, 898),\n        \"Atilde\": (0, 0, 780, 893),\n        \"Adieresis\": (0, 0, 780, 883),\n        \"Aring\": (0, 0, 780, 892),\n        \"AE\": (-37, 0, 964, 715),\n        \"Ccedilla\": (47, -210, 743, 728),\n        \"Egrave\": (72, 0, 676, 905),\n        \"Eacute\": (72, 0, 676, 905),\n        \"Ecircumflex\": (72, 0, 676, 898),\n        \"Edieresis\": (72, 0, 676, 883),\n        \"Igrave\": (28, 0, 303, 905),\n        \"Iacute\": (82, 0, 360, 905),\n        \"Icircumflex\": (14, 0, 375, 898),\n        \"Idieresis\": (27, 0, 362, 883),\n        \"Eth\": (0, 0, 734, 715),\n        \"Ntilde\": (74, 0, 759, 893),\n        \"Ograve\": (45, -12, 787, 905),\n        \"Oacute\": (45, -12, 787, 905),\n        \"Ocircumflex\": (45, -12, 787, 898),\n        \"Otilde\": (45, -12, 787, 893),\n        \"Odieresis\": (45, -12, 787, 883),\n        \"multiply\": (61, 90, 595, 625),\n        \"Oslash\": (17, -25, 815, 740),\n        \"Ugrave\": (73, -12, 759, 905),\n        \"Uacute\": (73, -12, 759, 905),\n        \"Ucircumflex\": (73, -12, 759, 898),\n        \"Udieresis\": (73, -12, 759, 883),\n        \"Yacute\": (0, 0, 779, 905),\n        \"Thorn\": (72, 0, 679, 715),\n        \"germandbls\": (58, -11, 631, 728),\n        \"agrave\": (35, -11, 632, 728),\n        \"aacute\": (35, -11, 632, 728),\n        \"acircumflex\": (35, -11, 632, 721),\n        \"atilde\": (35, -11, 632, 715),\n        \"adieresis\": (35, -11, 632, 706),\n        \"aring\": (35, -11, 632, 802),\n        \"ae\": (33, -11, 971, 530),\n        \"ccedilla\": (36, -210, 635, 530),\n        \"egrave\": (35, -11, 635, 728),\n        \"eacute\": (35, -11, 635, 728),\n        \"ecircumflex\": (35, -11, 635, 721),\n        \"edieresis\": (35, -11, 635, 706),\n        \"igrave\": (0, 0, 266, 728),\n        \"iacute\": (67, 0, 332, 728),\n        \"icircumflex\": (-13, 0, 347, 721),\n        \"idieresis\": (0, 0, 334, 706),\n        \"eth\": (36, -11, 629, 715),\n        \"ntilde\": (60, 0, 608, 715),\n        \"ograve\": (35, -11, 631, 728),\n        \"oacute\": (35, -11, 631, 728),\n        \"ocircumflex\": (35, -11, 631, 721),\n        \"otilde\": (35, -11, 631, 715),\n        \"odieresis\": (35, -11, 631, 706),\n        \"divide\": (62, 51, 594, 662),\n        \"oslash\": (35, -47, 630, 564),\n        \"ugrave\": (58, -11, 606, 728),\n        \"uacute\": (58, -11, 606, 728),\n        \"ucircumflex\": (58, -11, 606, 721),\n        \"udieresis\": (58, -11, 606, 706),\n        \"yacute\": (2, -210, 614, 728),\n        \"thorn\": (61, -197, 631, 715),\n        \"ydieresis\": (2, -210, 614, 706),\n    },\n    \"Garamond\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (61, -12, 160, 638),\n        \"quotedbl\": (64, 392, 341, 677),\n        \"numbersign\": (45, -22, 620, 666),\n        \"dollar\": (41, -133, 404, 655),\n        \"percent\": (36, -32, 789, 637),\n        \"ampersand\": (26, -14, 713, 594),\n        \"quotesingle\": (39, 392, 137, 677),\n        \"parenleft\": (76, -245, 309, 639),\n        \"parenright\": (-21, -244, 213, 640),\n        \"asterisk\": (28, 240, 393, 631),\n        \"plus\": (70, 49, 595, 572),\n        \"comma\": (41, -173, 189, 68),\n        \"hyphen\": (37, 171, 275, 217),\n        \"period\": (58, -14, 160, 93),\n        \"slash\": (56, -135, 443, 696),\n        \"zero\": (35, -14, 437, 636),\n        \"one\": (75, 0, 354, 633),\n        \"two\": (21, 0, 441, 633),\n        \"three\": (38, -13, 424, 636),\n        \"four\": (26, -11, 456, 636),\n        \"five\": (51, -16, 418, 638),\n        \"six\": (48, -13, 427, 639),\n        \"seven\": (45, -12, 431, 619),\n        \"eight\": (56, -13, 429, 633),\n        \"nine\": (43, -14, 421, 638),\n        \"colon\": (57, -13, 161, 387),\n        \"semicolon\": (42, -156, 188, 391),\n        \"less\": (71, 70, 594, 551),\n        \"equal\": (71, 176, 595, 445),\n        \"greater\": (71, 70, 594, 551),\n        \"question\": (43, -14, 330, 640),\n        \"at\": (47, -215, 896, 694),\n        \"A\": (-7, 0, 669, 655),\n        \"B\": (13, 0, 568, 633),\n        \"C\": (43, -13, 601, 640),\n        \"D\": (10, -8, 722, 635),\n        \"E\": (23, -6, 632, 622),\n        \"F\": (28, -9, 540, 631),\n        \"G\": (46, -12, 758, 640),\n        \"H\": (19, -10, 734, 629),\n        \"I\": (20, 0, 324, 624),\n        \"J\": (-84, -252, 277, 624),\n        \"K\": (28, -8, 759, 625),\n        \"L\": (5, -2, 574, 622),\n        \"M\": (6, -4, 826, 629),\n        \"N\": (12, -22, 732, 627),\n        \"O\": (45, -9, 733, 630),\n        \"P\": (18, -9, 536, 632),\n        \"Q\": (47, -217, 748, 642),\n        \"R\": (20, -2, 641, 629),\n        \"S\": (37, -16, 437, 642),\n        \"T\": (-1, -12, 602, 649),\n        \"U\": (18, -16, 675, 627),\n        \"V\": (-8, -19, 686, 628),\n        \"W\": (-9, -27, 891, 624),\n        \"X\": (4, -10, 707, 623),\n        \"Y\": (-9, -6, 664, 629),\n        \"Z\": (35, -7, 608, 657),\n        \"bracketleft\": (101, -231, 295, 627),\n        \"backslash\": (55, -135, 444, 696),\n        \"bracketright\": (-20, -232, 174, 627),\n        \"asciicircum\": (32, 382, 469, 670),\n        \"underscore\": (-5, -125, 505, -75),\n        \"grave\": (97, 479, 261, 631),\n        \"a\": (32, -11, 399, 398),\n        \"b\": (16, -20, 471, 658),\n        \"c\": (38, -15, 390, 398),\n        \"d\": (32, -18, 487, 658),\n        \"e\": (38, -12, 392, 401),\n        \"f\": (46, 0, 402, 653),\n        \"g\": (6, -257, 460, 400),\n        \"h\": (14, -3, 497, 650),\n        \"i\": (0, -2, 221, 639),\n        \"j\": (20, -263, 153, 634),\n        \"k\": (25, 0, 477, 654),\n        \"l\": (4, 0, 227, 648),\n        \"m\": (17, 0, 753, 417),\n        \"n\": (17, 0, 500, 411),\n        \"o\": (35, -13, 474, 400),\n        \"p\": (11, -256, 474, 434),\n        \"q\": (34, -255, 498, 412),\n        \"r\": (18, -1, 332, 422),\n        \"s\": (55, -15, 321, 404),\n        \"t\": (27, -10, 295, 482),\n        \"u\": (16, -9, 483, 383),\n        \"v\": (-5, -20, 477, 387),\n        \"w\": (-10, -22, 675, 385),\n        \"x\": (13, 0, 444, 385),\n        \"y\": (3, -246, 430, 386),\n        \"z\": (26, -2, 389, 422),\n        \"braceleft\": (138, -215, 410, 694),\n        \"bar\": (228, -257, 271, 653),\n        \"braceright\": (86, -215, 358, 694),\n        \"asciitilde\": (73, 243, 593, 378),\n        \"bullet\": (54, 208, 299, 453),\n        \"Euro\": (-13, -13, 454, 640),\n        \"quotesinglbase\": (45, -173, 188, 68),\n        \"florin\": (0, -256, 615, 642),\n        \"quotedblbase\": (31, -172, 406, 71),\n        \"ellipsis\": (114, -9, 885, 96),\n        \"dagger\": (0, -243, 422, 640),\n        \"daggerdbl\": (15, -240, 411, 643),\n        \"circumflex\": (71, 477, 286, 650),\n        \"perthousand\": (35, -32, 987, 637),\n        \"Scaron\": (37, -16, 437, 859),\n        \"guilsinglleft\": (6, 6, 190, 393),\n        \"OE\": (46, -8, 909, 629),\n        \"Zcaron\": (35, -7, 608, 859),\n        \"quoteleft\": (51, 393, 199, 637),\n        \"quoteright\": (49, 393, 193, 636),\n        \"quotedblleft\": (43, 392, 418, 635),\n        \"quotedblright\": (35, 395, 412, 643),\n        \"endash\": (-5, 168, 505, 213),\n        \"emdash\": (-5, 168, 1005, 213),\n        \"tilde\": (42, 504, 322, 604),\n        \"trademark\": (14, 268, 963, 662),\n        \"scaron\": (55, -15, 321, 650),\n        \"guilsinglright\": (8, 7, 190, 395),\n        \"oe\": (38, -16, 666, 400),\n        \"zcaron\": (26, -2, 389, 650),\n        \"Ydieresis\": (-9, -6, 664, 770),\n        \"exclamdown\": (59, -240, 159, 408),\n        \"cent\": (38, -168, 389, 580),\n        \"sterling\": (29, -235, 591, 633),\n        \"currency\": (98, 89, 564, 555),\n        \"yen\": (-9, -6, 664, 629),\n        \"brokenbar\": (228, -257, 271, 653),\n        \"section\": (56, -243, 369, 641),\n        \"dieresis\": (64, 515, 316, 600),\n        \"copyright\": (33, -15, 726, 677),\n        \"ordfeminine\": (13, 377, 264, 630),\n        \"guillemotleft\": (5, 5, 365, 390),\n        \"logicalnot\": (71, 180, 595, 461),\n        \"registered\": (33, -15, 726, 677),\n        \"macron\": (-5, 743, 505, 793),\n        \"degree\": (47, 376, 348, 676),\n        \"plusminus\": (70, -18, 595, 660),\n        \"twosuperior\": (24, 305, 284, 635),\n        \"threesuperior\": (35, 297, 274, 636),\n        \"acute\": (119, 479, 284, 630),\n        \"mu\": (22, -216, 497, 383),\n        \"paragraph\": (-6, -215, 454, 662),\n        \"periodcentered\": (115, 284, 217, 391),\n        \"cedilla\": (0, -210, 146, 6),\n        \"onesuperior\": (56, 305, 231, 635),\n        \"ordmasculine\": (18, 376, 314, 630),\n        \"guillemotright\": (0, 5, 360, 390),\n        \"onequarter\": (56, -34, 785, 635),\n        \"onehalf\": (56, -32, 776, 637),\n        \"threequarters\": (35, -32, 791, 637),\n        \"questiondown\": (16, -245, 302, 408),\n        \"Agrave\": (-7, 0, 669, 837),\n        \"Aacute\": (-7, 0, 669, 836),\n        \"Acircumflex\": (-7, 0, 669, 859),\n        \"Atilde\": (-7, 0, 669, 785),\n        \"Adieresis\": (-7, 0, 669, 770),\n        \"Aring\": (-7, 0, 669, 807),\n        \"AE\": (-62, -4, 828, 627),\n        \"Ccedilla\": (43, -210, 601, 640),\n        \"Egrave\": (23, -6, 632, 837),\n        \"Eacute\": (23, -6, 632, 836),\n        \"Ecircumflex\": (23, -6, 632, 859),\n        \"Edieresis\": (23, -6, 632, 770),\n        \"Igrave\": (20, 0, 324, 837),\n        \"Iacute\": (20, 0, 324, 836),\n        \"Icircumflex\": (20, 0, 324, 859),\n        \"Idieresis\": (20, 0, 324, 770),\n        \"Eth\": (7, -8, 722, 635),\n        \"Ntilde\": (12, -22, 732, 785),\n        \"Ograve\": (45, -9, 733, 837),\n        \"Oacute\": (45, -9, 733, 836),\n        \"Ocircumflex\": (45, -9, 733, 859),\n        \"Otilde\": (45, -9, 733, 785),\n        \"Odieresis\": (45, -9, 733, 770),\n        \"multiply\": (96, 73, 571, 548),\n        \"Oslash\": (45, -30, 733, 651),\n        \"Ugrave\": (18, -16, 675, 837),\n        \"Uacute\": (18, -16, 675, 836),\n        \"Ucircumflex\": (18, -16, 675, 859),\n        \"Udieresis\": (18, -16, 675, 770),\n        \"Yacute\": (-9, -6, 664, 836),\n        \"Thorn\": (18, -9, 536, 625),\n        \"germandbls\": (7, -15, 469, 643),\n        \"agrave\": (32, -11, 399, 631),\n        \"aacute\": (32, -11, 399, 630),\n        \"acircumflex\": (32, -11, 399, 650),\n        \"atilde\": (32, -11, 399, 604),\n        \"adieresis\": (32, -11, 399, 600),\n        \"aring\": (32, -11, 399, 614),\n        \"ae\": (36, -15, 561, 399),\n        \"ccedilla\": (38, -210, 390, 398),\n        \"egrave\": (38, -12, 392, 631),\n        \"eacute\": (38, -12, 392, 630),\n        \"ecircumflex\": (38, -12, 392, 650),\n        \"edieresis\": (38, -12, 392, 600),\n        \"igrave\": (-1, -2, 219, 631),\n        \"iacute\": (-1, -2, 231, 630),\n        \"icircumflex\": (-1, -2, 224, 650),\n        \"idieresis\": (-1, -2, 250, 600),\n        \"eth\": (44, -11, 485, 642),\n        \"ntilde\": (17, 0, 500, 604),\n        \"ograve\": (35, -13, 474, 631),\n        \"oacute\": (35, -13, 474, 630),\n        \"ocircumflex\": (35, -13, 474, 650),\n        \"otilde\": (35, -13, 474, 604),\n        \"odieresis\": (35, -13, 474, 600),\n        \"divide\": (11, 136, 537, 524),\n        \"oslash\": (38, -23, 476, 412),\n        \"ugrave\": (16, -9, 483, 631),\n        \"uacute\": (16, -9, 483, 630),\n        \"ucircumflex\": (16, -9, 483, 650),\n        \"udieresis\": (16, -9, 483, 600),\n        \"yacute\": (3, -246, 430, 630),\n        \"thorn\": (11, -256, 474, 648),\n        \"ydieresis\": (3, -246, 430, 600),\n    },\n    \"Garamond,Bold\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (61, -8, 202, 649),\n        \"quotedbl\": (85, 352, 465, 677),\n        \"numbersign\": (41, -21, 625, 675),\n        \"dollar\": (39, -94, 437, 635),\n        \"percent\": (31, -12, 800, 653),\n        \"ampersand\": (45, -10, 762, 613),\n        \"quotesingle\": (68, 352, 212, 677),\n        \"parenleft\": (68, -236, 350, 647),\n        \"parenright\": (11, -236, 294, 647),\n        \"asterisk\": (32, 213, 457, 649),\n        \"plus\": (65, 50, 601, 584),\n        \"comma\": (45, -179, 221, 134),\n        \"hyphen\": (34, 158, 302, 251),\n        \"period\": (61, -8, 202, 132),\n        \"slash\": (57, -135, 495, 696),\n        \"zero\": (27, -10, 438, 645),\n        \"one\": (25, 3, 368, 644),\n        \"two\": (19, 1, 449, 642),\n        \"three\": (14, -13, 437, 642),\n        \"four\": (23, -10, 445, 644),\n        \"five\": (31, -10, 428, 641),\n        \"six\": (29, -10, 439, 648),\n        \"seven\": (34, -10, 430, 628),\n        \"eight\": (42, -10, 434, 641),\n        \"nine\": (30, -14, 442, 644),\n        \"colon\": (57, -8, 199, 423),\n        \"semicolon\": (48, -178, 224, 424),\n        \"less\": (66, 59, 600, 576),\n        \"equal\": (66, 164, 600, 471),\n        \"greater\": (66, 59, 600, 576),\n        \"question\": (48, -9, 375, 650),\n        \"at\": (44, -215, 908, 677),\n        \"A\": (-12, 3, 676, 647),\n        \"B\": (35, 0, 627, 639),\n        \"C\": (45, -6, 645, 649),\n        \"D\": (24, 3, 736, 645),\n        \"E\": (17, 0, 670, 635),\n        \"F\": (29, 0, 585, 638),\n        \"G\": (45, -8, 711, 646),\n        \"H\": (31, 4, 826, 639),\n        \"I\": (40, 1, 352, 639),\n        \"J\": (-58, -235, 345, 638),\n        \"K\": (26, 2, 709, 639),\n        \"L\": (19, 1, 632, 641),\n        \"M\": (20, 0, 894, 637),\n        \"N\": (3, -13, 814, 636),\n        \"O\": (43, -5, 744, 647),\n        \"P\": (23, 0, 587, 639),\n        \"Q\": (43, -170, 750, 648),\n        \"R\": (39, 1, 710, 640),\n        \"S\": (49, -6, 476, 649),\n        \"T\": (0, 1, 657, 664),\n        \"U\": (17, -13, 718, 634),\n        \"V\": (-11, -4, 675, 640),\n        \"W\": (0, -14, 898, 633),\n        \"X\": (4, 1, 687, 635),\n        \"Y\": (-18, 2, 672, 635),\n        \"Z\": (21, 1, 620, 660),\n        \"bracketleft\": (122, -225, 340, 631),\n        \"backslash\": (58, -135, 494, 696),\n        \"bracketright\": (20, -224, 240, 633),\n        \"asciicircum\": (73, 325, 511, 675),\n        \"underscore\": (-5, -125, 505, -75),\n        \"grave\": (59, 468, 242, 625),\n        \"a\": (48, -2, 468, 415),\n        \"b\": (20, -8, 516, 646),\n        \"c\": (38, -7, 447, 419),\n        \"d\": (38, -11, 543, 652),\n        \"e\": (35, -8, 435, 418),\n        \"f\": (26, 1, 393, 648),\n        \"g\": (24, -250, 539, 415),\n        \"h\": (18, 0, 540, 646),\n        \"i\": (14, 2, 268, 645),\n        \"j\": (21, -229, 199, 645),\n        \"k\": (15, 0, 539, 647),\n        \"l\": (3, 1, 260, 647),\n        \"m\": (20, 3, 833, 434),\n        \"n\": (19, 0, 539, 440),\n        \"o\": (36, -8, 484, 418),\n        \"p\": (-1, -246, 515, 447),\n        \"q\": (38, -248, 545, 443),\n        \"r\": (17, 3, 343, 437),\n        \"s\": (43, -8, 374, 417),\n        \"t\": (27, -1, 301, 497),\n        \"u\": (20, -8, 536, 401),\n        \"v\": (-6, -6, 466, 402),\n        \"w\": (-6, -6, 717, 400),\n        \"x\": (9, 2, 485, 400),\n        \"y\": (-7, -237, 471, 400),\n        \"z\": (29, 3, 426, 447),\n        \"braceleft\": (80, -202, 351, 677),\n        \"bar\": (231, -249, 309, 644),\n        \"braceright\": (44, -202, 315, 677),\n        \"asciitilde\": (67, 238, 599, 396),\n        \"bullet\": (37, 190, 316, 469),\n        \"Euro\": (-17, -5, 448, 649),\n        \"quotesinglbase\": (40, -179, 216, 134),\n        \"florin\": (0, -236, 708, 645),\n        \"quotedblbase\": (43, -177, 457, 134),\n        \"ellipsis\": (94, -7, 904, 135),\n        \"dagger\": (14, -236, 486, 648),\n        \"daggerdbl\": (21, -232, 479, 652),\n        \"circumflex\": (32, 460, 322, 633),\n        \"perthousand\": (31, -12, 998, 653),\n        \"Scaron\": (49, -6, 476, 848),\n        \"guilsinglleft\": (11, 13, 251, 402),\n        \"OE\": (50, 0, 943, 646),\n        \"Zcaron\": (21, 1, 620, 848),\n        \"quoteleft\": (45, 326, 223, 640),\n        \"quoteright\": (34, 326, 210, 639),\n        \"quotedblleft\": (46, 325, 461, 640),\n        \"quotedblright\": (33, 326, 450, 639),\n        \"endash\": (-5, 205, 505, 295),\n        \"emdash\": (-5, 205, 1005, 295),\n        \"tilde\": (10, 486, 334, 615),\n        \"trademark\": (-1, 268, 1005, 662),\n        \"scaron\": (43, -8, 374, 635),\n        \"guilsinglright\": (22, 10, 262, 399),\n        \"oe\": (36, -6, 699, 419),\n        \"zcaron\": (29, 3, 428, 635),\n        \"Ydieresis\": (-18, 2, 672, 822),\n        \"exclamdown\": (58, -238, 199, 419),\n        \"cent\": (27, -171, 436, 584),\n        \"sterling\": (46, -229, 645, 647),\n        \"currency\": (81, 78, 581, 578),\n        \"yen\": (-18, 2, 672, 635),\n        \"brokenbar\": (231, -249, 309, 644),\n        \"section\": (41, -241, 463, 647),\n        \"dieresis\": (33, 488, 319, 609),\n        \"copyright\": (28, -15, 721, 677),\n        \"ordfeminine\": (22, 393, 303, 645),\n        \"guillemotleft\": (2, 12, 430, 396),\n        \"logicalnot\": (65, 168, 601, 483),\n        \"registered\": (28, -15, 721, 677),\n        \"macron\": (-5, 682, 505, 732),\n        \"degree\": (28, 337, 366, 675),\n        \"plusminus\": (65, -23, 601, 676),\n        \"twosuperior\": (23, 310, 287, 644),\n        \"threesuperior\": (20, 302, 282, 644),\n        \"acute\": (114, 467, 298, 625),\n        \"mu\": (25, -186, 453, 401),\n        \"paragraph\": (0, -215, 541, 662),\n        \"periodcentered\": (96, 253, 237, 394),\n        \"cedilla\": (43, -228, 291, 7),\n        \"onesuperior\": (43, 311, 258, 645),\n        \"ordmasculine\": (17, 389, 316, 647),\n        \"guillemotright\": (17, 12, 444, 396),\n        \"onequarter\": (46, -12, 804, 653),\n        \"onehalf\": (46, -12, 805, 653),\n        \"threequarters\": (23, -12, 804, 653),\n        \"questiondown\": (42, -239, 369, 421),\n        \"Agrave\": (-12, 3, 676, 837),\n        \"Aacute\": (-12, 3, 676, 837),\n        \"Acircumflex\": (-12, 3, 676, 846),\n        \"Atilde\": (-12, 3, 676, 828),\n        \"Adieresis\": (-12, 3, 676, 822),\n        \"Aring\": (-12, 3, 676, 802),\n        \"AE\": (-44, -2, 841, 633),\n        \"Ccedilla\": (45, -228, 645, 649),\n        \"Egrave\": (17, 0, 670, 837),\n        \"Eacute\": (17, 0, 670, 837),\n        \"Ecircumflex\": (17, 0, 670, 846),\n        \"Edieresis\": (17, 0, 670, 822),\n        \"Igrave\": (40, 1, 352, 837),\n        \"Iacute\": (40, 1, 352, 837),\n        \"Icircumflex\": (40, 1, 354, 846),\n        \"Idieresis\": (40, 1, 352, 822),\n        \"Eth\": (24, 3, 736, 645),\n        \"Ntilde\": (3, -13, 814, 828),\n        \"Ograve\": (43, -5, 744, 837),\n        \"Oacute\": (43, -5, 744, 837),\n        \"Ocircumflex\": (43, -5, 744, 846),\n        \"Otilde\": (43, -5, 744, 828),\n        \"Odieresis\": (43, -5, 744, 822),\n        \"multiply\": (85, 70, 582, 565),\n        \"Oslash\": (43, -7, 744, 650),\n        \"Ugrave\": (17, -13, 718, 837),\n        \"Uacute\": (17, -13, 718, 837),\n        \"Ucircumflex\": (17, -13, 718, 846),\n        \"Udieresis\": (17, -13, 718, 822),\n        \"Yacute\": (-18, 2, 672, 837),\n        \"Thorn\": (23, 0, 588, 639),\n        \"germandbls\": (17, -1, 514, 647),\n        \"agrave\": (48, -2, 468, 625),\n        \"aacute\": (48, -2, 468, 625),\n        \"acircumflex\": (48, -2, 468, 633),\n        \"atilde\": (48, -2, 468, 615),\n        \"adieresis\": (48, -2, 468, 609),\n        \"aring\": (48, -2, 468, 629),\n        \"ae\": (41, -8, 664, 416),\n        \"ccedilla\": (38, -228, 447, 419),\n        \"egrave\": (35, -8, 435, 625),\n        \"eacute\": (35, -8, 435, 625),\n        \"ecircumflex\": (35, -8, 435, 633),\n        \"edieresis\": (35, -8, 435, 609),\n        \"igrave\": (16, 2, 268, 625),\n        \"iacute\": (16, 2, 271, 625),\n        \"icircumflex\": (5, 2, 296, 633),\n        \"idieresis\": (7, 2, 292, 609),\n        \"eth\": (33, -8, 482, 648),\n        \"ntilde\": (19, 0, 539, 615),\n        \"ograve\": (36, -8, 484, 625),\n        \"oacute\": (36, -8, 484, 625),\n        \"ocircumflex\": (36, -8, 484, 633),\n        \"otilde\": (36, -8, 484, 615),\n        \"odieresis\": (36, -8, 484, 609),\n        \"divide\": (65, 69, 601, 569),\n        \"oslash\": (36, -38, 485, 449),\n        \"ugrave\": (20, -8, 536, 625),\n        \"uacute\": (20, -8, 536, 625),\n        \"ucircumflex\": (20, -8, 536, 633),\n        \"udieresis\": (20, -8, 536, 609),\n        \"yacute\": (-7, -237, 471, 625),\n        \"thorn\": (-1, -246, 515, 647),\n        \"ydieresis\": (-7, -237, 471, 609),\n    },\n    \"Garamond,Italic\": {\n        \"space\": (0, 0, 0, 0),\n        \"exclam\": (49, -11, 299, 623),\n        \"quotedbl\": (124, 392, 465, 677),\n        \"numbersign\": (81, -22, 656, 666),\n        \"dollar\": (11, -105, 460, 629),\n        \"percent\": (71, -32, 734, 633),\n        \"ampersand\": (91, -9, 978, 655),\n        \"quotesingle\": (131, 392, 261, 677),\n        \"parenleft\": (95, -255, 428, 651),\n        \"parenright\": (-78, -253, 257, 652),\n        \"asterisk\": (95, 245, 490, 631),\n        \"plus\": (105, 49, 630, 572),\n        \"comma\": (-17, -160, 154, 119),\n        \"hyphen\": (51, 169, 269, 219),\n        \"period\": (41, -14, 142, 93),\n        \"slash\": (56, -135, 443, 696),\n        \"zero\": (52, -11, 471, 633),\n        \"one\": (148, 0, 407, 631),\n        \"two\": (16, 0, 485, 632),\n        \"three\": (21, -11, 453, 632),\n        \"four\": (16, 0, 443, 631),\n        \"five\": (15, -11, 499, 640),\n        \"six\": (56, -11, 505, 633),\n        \"seven\": (81, -11, 518, 613),\n        \"eight\": (45, -13, 475, 631),\n        \"nine\": (28, -12, 478, 633),\n        \"colon\": (42, -10, 238, 396),\n        \"semicolon\": (0, -157, 251, 398),\n        \"less\": (106, 69, 629, 551),\n        \"equal\": (106, 175, 630, 445),\n        \"greater\": (106, 69, 629, 551),\n        \"question\": (110, -12, 416, 635),\n        \"at\": (47, -215, 896, 694),\n        \"A\": (-55, -8, 746, 641),\n        \"B\": (12, -7, 544, 640),\n        \"C\": (70, -15, 702, 646),\n        \"D\": (18, -6, 734, 639),\n        \"E\": (-2, -8, 673, 636),\n        \"F\": (7, -8, 648, 640),\n        \"G\": (70, -16, 708, 641),\n        \"H\": (16, -7, 833, 639),\n        \"I\": (7, -8, 393, 640),\n        \"J\": (-117, -248, 390, 639),\n        \"K\": (14, -8, 677, 637),\n        \"L\": (1, -4, 674, 632),\n        \"M\": (-25, -19, 883, 646),\n        \"N\": (-9, -18, 865, 640),\n        \"O\": (81, -13, 674, 648),\n        \"P\": (12, -6, 574, 643),\n        \"Q\": (-97, -235, 690, 643),\n        \"R\": (30, -5, 673, 636),\n        \"S\": (28, -15, 523, 645),\n        \"T\": (69, -10, 682, 652),\n        \"U\": (115, -15, 784, 641),\n        \"V\": (118, -19, 925, 638),\n        \"W\": (106, -18, 1003, 637),\n        \"X\": (-10, -8, 826, 645),\n        \"Y\": (71, -3, 760, 643),\n        \"Z\": (41, 0, 631, 635),\n        \"bracketleft\": (47, -229, 479, 625),\n        \"backslash\": (55, -135, 444, 696),\n        \"bracketright\": (-104, -229, 322, 625),\n        \"asciicircum\": (67, 382, 504, 670),\n        \"underscore\": (-5, -125, 505, -75),\n        \"grave\": (194, 461, 357, 612),\n        \"a\": (38, -12, 426, 387),\n        \"b\": (66, -14, 429, 646),\n        \"c\": (48, -10, 334, 400),\n        \"d\": (44, -20, 509, 656),\n        \"e\": (50, -16, 315, 395),\n        \"f\": (-182, -256, 434, 642),\n        \"g\": (-92, -246, 380, 400),\n        \"h\": (35, -16, 422, 649),\n        \"i\": (37, -11, 291, 621),\n        \"j\": (-216, -245, 284, 606),\n        \"k\": (32, -23, 512, 645),\n        \"l\": (35, -13, 334, 649),\n        \"m\": (24, -13, 649, 396),\n        \"n\": (45, -14, 434, 403),\n        \"o\": (55, -11, 354, 399),\n        \"p\": (-141, -252, 409, 516),\n        \"q\": (38, -252, 450, 402),\n        \"r\": (55, -11, 397, 400),\n        \"s\": (25, -8, 331, 399),\n        \"t\": (38, -8, 335, 522),\n        \"u\": (38, -12, 452, 400),\n        \"v\": (52, -15, 379, 407),\n        \"w\": (35, -18, 577, 401),\n        \"x\": (8, -9, 556, 397),\n        \"y\": (-215, -243, 350, 399),\n        \"z\": (58, -253, 486, 399),\n        \"braceleft\": (138, -215, 410, 694),\n        \"bar\": (263, -246, 307, 641),\n        \"braceright\": (133, -215, 406, 694),\n        \"asciitilde\": (108, 243, 628, 377),\n        \"bullet\": (102, 208, 347, 453),\n        \"Euro\": (44, -16, 611, 645),\n        \"quotesinglbase\": (7, -137, 151, 119),\n        \"florin\": (0, -256, 615, 642),\n        \"quotedblbase\": (6, -162, 357, 95),\n        \"ellipsis\": (114, -9, 886, 96),\n        \"dagger\": (84, -242, 499, 644),\n        \"daggerdbl\": (-18, -254, 499, 654),\n        \"circumflex\": (163, 439, 390, 622),\n        \"perthousand\": (70, -32, 891, 633),\n        \"Scaron\": (28, -15, 600, 856),\n        \"guilsinglleft\": (61, -5, 317, 404),\n        \"OE\": (80, -4, 963, 642),\n        \"Zcaron\": (41, 0, 648, 853),\n        \"quoteleft\": (177, 386, 326, 650),\n        \"quoteright\": (152, 393, 297, 650),\n        \"quotedblleft\": (188, 385, 536, 646),\n        \"quotedblright\": (146, 388, 495, 645),\n        \"endash\": (-5, 168, 505, 213),\n        \"emdash\": (-5, 168, 1005, 213),\n        \"tilde\": (158, 489, 437, 589),\n        \"trademark\": (61, 268, 1010, 662),\n        \"scaron\": (25, -8, 455, 624),\n        \"guilsinglright\": (-19, -7, 236, 404),\n        \"oe\": (52, -11, 493, 398),\n        \"zcaron\": (58, -253, 522, 624),\n        \"Ydieresis\": (71, -3, 760, 786),\n        \"exclamdown\": (-17, -227, 232, 408),\n        \"cent\": (-7, -121, 351, 534),\n        \"sterling\": (31, -235, 593, 633),\n        \"currency\": (133, 89, 600, 555),\n        \"yen\": (45, -9, 741, 638),\n        \"brokenbar\": (263, -246, 307, 641),\n        \"section\": (-4, -227, 464, 644),\n        \"dieresis\": (179, 494, 422, 574),\n        \"copyright\": (81, -15, 773, 677),\n        \"ordfeminine\": (103, 392, 365, 638),\n        \"guillemotleft\": (52, -7, 458, 403),\n        \"logicalnot\": (106, 180, 630, 461),\n        \"registered\": (81, -15, 773, 677),\n        \"macron\": (80, 669, 591, 719),\n        \"degree\": (104, 378, 404, 678),\n        \"plusminus\": (105, -18, 630, 660),\n        \"twosuperior\": (49, 303, 338, 632),\n        \"threesuperior\": (52, 297, 319, 632),\n        \"acute\": (242, 460, 404, 611),\n        \"mu\": (-62, -215, 481, 383),\n        \"paragraph\": (-6, -215, 454, 662),\n        \"periodcentered\": (162, 263, 264, 371),\n        \"cedilla\": (23, -223, 147, 7),\n        \"onesuperior\": (127, 303, 293, 632),\n        \"ordmasculine\": (115, 392, 321, 645),\n        \"guillemotright\": (-12, -6, 394, 404),\n        \"onequarter\": (127, -32, 729, 633),\n        \"onehalf\": (127, -32, 754, 633),\n        \"threequarters\": (52, -32, 729, 633),\n        \"questiondown\": (-4, -237, 301, 409),\n        \"Agrave\": (-55, -8, 762, 845),\n        \"Aacute\": (-55, -8, 853, 845),\n        \"Acircumflex\": (-55, -8, 827, 861),\n        \"Atilde\": (-55, -8, 890, 801),\n        \"Adieresis\": (-55, -8, 844, 786),\n        \"Aring\": (-55, -8, 758, 791),\n        \"AE\": (-32, -6, 869, 637),\n        \"Ccedilla\": (70, -226, 702, 646),\n        \"Egrave\": (-2, -8, 673, 845),\n        \"Eacute\": (-2, -8, 673, 845),\n        \"Ecircumflex\": (-2, -8, 673, 861),\n        \"Edieresis\": (-2, -8, 673, 786),\n        \"Igrave\": (7, -8, 393, 845),\n        \"Iacute\": (7, -8, 408, 845),\n        \"Icircumflex\": (7, -8, 393, 861),\n        \"Idieresis\": (7, -8, 446, 786),\n        \"Eth\": (33, -6, 750, 639),\n        \"Ntilde\": (-9, -18, 865, 801),\n        \"Ograve\": (81, -13, 674, 845),\n        \"Oacute\": (81, -13, 674, 845),\n        \"Ocircumflex\": (81, -13, 674, 861),\n        \"Otilde\": (81, -13, 674, 801),\n        \"Odieresis\": (81, -13, 674, 786),\n        \"multiply\": (131, 73, 606, 548),\n        \"Oslash\": (81, -16, 674, 650),\n        \"Ugrave\": (115, -15, 784, 845),\n        \"Uacute\": (115, -15, 784, 845),\n        \"Ucircumflex\": (115, -15, 784, 861),\n        \"Udieresis\": (115, -15, 784, 786),\n        \"Yacute\": (71, -3, 760, 845),\n        \"Thorn\": (22, -6, 556, 642),\n        \"germandbls\": (-145, -250, 538, 648),\n        \"agrave\": (38, -12, 445, 612),\n        \"aacute\": (38, -12, 444, 611),\n        \"acircumflex\": (38, -12, 429, 622),\n        \"atilde\": (38, -12, 476, 589),\n        \"adieresis\": (38, -12, 495, 574),\n        \"aring\": (38, -12, 426, 616),\n        \"ae\": (26, -13, 514, 406),\n        \"ccedilla\": (-7, -223, 334, 400),\n        \"egrave\": (50, -16, 335, 612),\n        \"eacute\": (50, -16, 382, 611),\n        \"ecircumflex\": (50, -16, 367, 622),\n        \"edieresis\": (50, -16, 399, 574),\n        \"igrave\": (38, -9, 302, 612),\n        \"iacute\": (38, -9, 349, 611),\n        \"icircumflex\": (38, -9, 341, 622),\n        \"idieresis\": (38, -9, 378, 574),\n        \"eth\": (58, -13, 425, 642),\n        \"ntilde\": (45, -14, 536, 589),\n        \"ograve\": (55, -11, 369, 612),\n        \"oacute\": (55, -11, 416, 611),\n        \"ocircumflex\": (55, -11, 401, 622),\n        \"otilde\": (55, -11, 448, 589),\n        \"odieresis\": (55, -11, 433, 574),\n        \"divide\": (106, 81, 630, 543),\n        \"oslash\": (43, -10, 373, 400),\n        \"ugrave\": (38, -12, 452, 612),\n        \"uacute\": (38, -12, 455, 611),\n        \"ucircumflex\": (38, -12, 452, 622),\n        \"udieresis\": (38, -12, 472, 574),\n        \"yacute\": (-215, -243, 404, 611),\n        \"thorn\": (-141, -252, 409, 648),\n        \"ydieresis\": (-215, -243, 363, 574),\n    },\n}\n"
  },
  {
    "path": "babeldoc/format/pdf/converter.py",
    "content": "import logging\nimport re\nimport unicodedata\n\nimport numpy as np\nfrom pymupdf import Font\n\nfrom babeldoc.format.pdf.document_il.frontend.il_creater import ILCreater\nfrom babeldoc.pdfminer.converter import PDFConverter\nfrom babeldoc.pdfminer.layout import LTChar\nfrom babeldoc.pdfminer.layout import LTComponent\nfrom babeldoc.pdfminer.layout import LTCurve\nfrom babeldoc.pdfminer.layout import LTFigure\nfrom babeldoc.pdfminer.layout import LTLine\nfrom babeldoc.pdfminer.layout import LTPage\nfrom babeldoc.pdfminer.layout import LTText\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdffont import PDFCIDFont\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdffont import PDFUnicodeNotDefined\nfrom babeldoc.pdfminer.pdfinterp import PDFGraphicState\nfrom babeldoc.pdfminer.pdfinterp import PDFResourceManager\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import apply_matrix_pt\nfrom babeldoc.pdfminer.utils import bbox2str\nfrom babeldoc.pdfminer.utils import matrix2str\nfrom babeldoc.pdfminer.utils import mult_matrix\n\nlog = logging.getLogger(__name__)\n\n\nclass PDFConverterEx(PDFConverter):\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        il_creater: ILCreater | None = None,\n    ) -> None:\n        PDFConverter.__init__(self, rsrcmgr, None, \"utf-8\", 1, None)\n        self.il_creater = il_creater\n\n    def begin_page(self, page, ctm) -> None:\n        # 重载替换 cropbox\n        (x0, y0, x1, y1) = page.cropbox\n        (x0, y0) = apply_matrix_pt(ctm, (x0, y0))\n        (x1, y1) = apply_matrix_pt(ctm, (x1, y1))\n        mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))\n        self.il_creater.on_page_media_box(\n            mediabox[0],\n            mediabox[1],\n            mediabox[2],\n            mediabox[3],\n        )\n        self.il_creater.on_page_number(page.pageno)\n        self.cur_item = LTPage(page.pageno, mediabox)\n\n    def end_page(self, _page) -> None:\n        # 重载返回指令流\n        return self.receive_layout(self.cur_item)\n\n    def begin_figure(self, name, bbox, matrix) -> None:\n        # 重载设置 pageid\n        self._stack.append(self.cur_item)\n        self.cur_item = LTFigure(name, bbox, mult_matrix(matrix, self.ctm))\n        self.cur_item.pageid = self._stack[-1].pageid\n\n    def end_figure(self, _: str) -> None:\n        # 重载返回指令流\n        fig = self.cur_item\n        if not isinstance(self.cur_item, LTFigure):\n            raise ValueError(f\"Unexpected item type: {type(self.cur_item)}\")\n        self.cur_item = self._stack.pop()\n        self.cur_item.add(fig)\n        return self.receive_layout(fig)\n\n    def render_char(\n        self,\n        matrix,\n        font,\n        fontsize: float,\n        scaling: float,\n        rise: float,\n        cid: int,\n        ncs,\n        graphicstate: PDFGraphicState,\n    ) -> float:\n        # 重载设置 cid 和 font\n        try:\n            text = font.to_unichr(cid)\n            if not isinstance(text, str):\n                raise TypeError(f\"Expected string, got {type(text)}\")\n        except PDFUnicodeNotDefined:\n            text = self.handle_undefined_char(font, cid)\n        textwidth = font.char_width(cid)\n        textdisp = font.char_disp(cid)\n        font_id = font.font_id_temp\n        if font_id is not None:\n            pass\n        elif not hasattr(font, \"xobj_id\"):\n            log.debug(\n                f\"Font {font.fontname} does not have xobj_id attribute.\",\n            )\n            font_id = \"UNKNOW\"\n        else:\n            font_id = self.il_creater.current_page_font_name_id_map.get(\n                font.xobj_id, None\n            )\n\n        item = AWLTChar(\n            matrix,\n            font,\n            fontsize,\n            scaling,\n            rise,\n            text,\n            textwidth,\n            textdisp,\n            ncs,\n            graphicstate,\n            self.il_creater.xobj_id,\n            font_id,\n            self.il_creater.get_render_order_and_increase(),\n        )\n        self.cur_item.add(item)\n        item.cid = cid  # hack 插入原字符编码\n        item.font = font  # hack 插入原字符字体\n        return item.adv\n\n\nclass AWLTChar(LTChar):\n    \"\"\"Actual letter in the text as a Unicode string.\"\"\"\n\n    def __init__(\n        self,\n        matrix: Matrix,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        rise: float,\n        text: str,\n        textwidth: float,\n        textdisp: float | tuple[float | None, float],\n        ncs: PDFColorSpace,\n        graphicstate: PDFGraphicState,\n        xobj_id: int,\n        font_id: str,\n        render_order: int,\n    ) -> None:\n        LTText.__init__(self)\n        self._text = text\n        self.matrix = matrix\n        self.fontname = font.fontname\n        self.ncs = ncs\n        self.graphicstate = graphicstate\n        self.xobj_id = xobj_id\n        self.adv = textwidth * fontsize * scaling\n        self.aw_font_id = font_id\n        self.render_order = render_order\n        # compute the boundary rectangle.\n        if font.is_vertical():\n            # vertical\n            assert isinstance(textdisp, tuple)\n            (vx, vy) = textdisp\n            if vx is None:\n                vx = fontsize * 0.5\n            else:\n                vx = vx * fontsize * 0.001\n            vy = (1000 - vy) * fontsize * 0.001\n            bbox_lower_left = (-vx, vy + rise + self.adv)\n            bbox_upper_right = (-vx + fontsize, vy + rise)\n        else:\n            # horizontal\n            descent = font.get_descent() * fontsize\n            bbox_lower_left = (0, descent + rise)\n            bbox_upper_right = (self.adv, descent + rise + fontsize)\n        (a, b, c, d, e, f) = self.matrix\n        self.upright = a * d * scaling > 0 and b * c <= 0\n        (x0, y0) = apply_matrix_pt(self.matrix, bbox_lower_left)\n        (x1, y1) = apply_matrix_pt(self.matrix, bbox_upper_right)\n        if x1 < x0:\n            (x0, x1) = (x1, x0)\n        if y1 < y0:\n            (y0, y1) = (y1, y0)\n        LTComponent.__init__(self, (x0, y0, x1, y1))\n        if font.is_vertical() or matrix[0] == 0:\n            self.size = self.width\n        else:\n            self.size = self.height\n        return\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__} {bbox2str(self.bbox)} matrix={matrix2str(self.matrix)} font={self.fontname!r} adv={self.adv} text={self.get_text()!r}>\"\n\n    def get_text(self) -> str:\n        return self._text\n\n\nclass Paragraph:\n    def __init__(self, y, x, x0, x1, size, brk):\n        self.y: float = y  # 初始纵坐标\n        self.x: float = x  # 初始横坐标\n        self.x0: float = x0  # 左边界\n        self.x1: float = x1  # 右边界\n        self.size: float = size  # 字体大小\n        self.brk: bool = brk  # 换行标记\n\n\n# fmt: off\nclass TranslateConverter(PDFConverterEx):\n    def __init__(\n        self,\n        rsrcmgr,\n        vfont: str | None = None,\n        vchar: str | None = None,\n        thread: int = 0,\n        layout: dict | None = None,\n        lang_in: str = \"\",  # 保留参数但添加未使用标记\n        _lang_out: str = \"\",  # 改为未使用参数\n        _service: str = \"\",  # 改为未使用参数\n        resfont: str = \"\",\n        noto: Font | None = None,\n        envs: dict | None = None,\n        _prompt: list | None = None,  # 改为未使用参数\n        il_creater: ILCreater | None = None,\n    ):\n        layout = layout or {}\n        super().__init__(rsrcmgr, il_creater)\n        self.vfont = vfont\n        self.vchar = vchar\n        self.thread = thread\n        self.layout = layout\n        self.resfont = resfont\n        self.noto = noto\n\n    def receive_layout(self, ltpage: LTPage):\n        # 段落\n        sstk: list[str] = []            # 段落文字栈\n        pstk: list[Paragraph] = []      # 段落属性栈\n        vbkt: int = 0                   # 段落公式括号计数\n        # 公式组\n        vstk: list[LTChar] = []         # 公式符号组\n        vlstk: list[LTLine] = []        # 公式线条组\n        vfix: float = 0                 # 公式纵向偏移\n        # 公式组栈\n        var: list[list[LTChar]] = []    # 公式符号组栈\n        varl: list[list[LTLine]] = []   # 公式线条组栈\n        varf: list[float] = []          # 公式纵向偏移栈\n        vlen: list[float] = []          # 公式宽度栈\n        # 全局\n        lstk: list[LTLine] = []         # 全局线条栈\n        xt: LTChar = None               # 上一个字符\n        xt_cls: int = -1                # 上一个字符所属段落，保证无论第一个字符属于哪个类别都可以触发新段落\n        vmax: float = ltpage.width / 4  # 行内公式最大宽度\n        ops: str = \"\"                   # 渲染结果\n\n        def vflag(font: str, char: str):    # 匹配公式（和角标）字体\n            if isinstance(font, bytes):     # 不一定能 decode，直接转 str\n                font = str(font)\n            font = font.split(\"+\")[-1]      # 字体名截断\n            if re.match(r\"\\(cid:\", char):\n                return True\n            # 基于字体名规则的判定\n            if self.vfont:\n                if re.match(self.vfont, font):\n                    return True\n            else:\n                if re.match(                                            # latex 字体\n                    r\"(CM[^R]|(MS|XY|MT|BL|RM|EU|LA|RS)[A-Z]|LINE|LCIRCLE|TeX-|rsfs|txsy|wasy|stmary|.*Mono|.*Code|.*Ital|.*Sym|.*Math)\",\n                    font,\n                ):\n                    return True\n            # 基于字符集规则的判定\n            if self.vchar:\n                if re.match(self.vchar, char):\n                    return True\n            else:\n                if (\n                    char\n                    and char != \" \"                                     # 非空格\n                    and (\n                        unicodedata.category(char[0])\n                        in [\"Lm\", \"Mn\", \"Sk\", \"Sm\", \"Zl\", \"Zp\", \"Zs\"]   # 文字修饰符、数学符号、分隔符号\n                        or ord(char[0]) in range(0x370, 0x400)          # 希腊字母\n                    )\n                ):\n                    return True\n            return False\n\n        ############################################################\n        # A. 原文档解析\n        for child in ltpage:\n            if isinstance(child, LTChar):\n                try:\n                    self.il_creater.on_lt_char(child)\n                except Exception:\n                    log.exception(\n                        'Error processing LTChar',\n                    )\n                continue\n                cur_v = False\n                layout = self.layout[ltpage.pageid]\n                # ltpage.height 可能是 fig 里面的高度，这里统一用 layout.shape\n                h, w = layout.shape\n                # 读取当前字符在 layout 中的类别\n                cx, cy = np.clip(int(child.x0), 0, w - 1), np.clip(int(child.y0), 0, h - 1)\n                cls = layout[cy, cx]\n                # 锚定文档中 bullet 的位置\n                if child.get_text() == \"•\":\n                    cls = 0\n                # 判定当前字符是否属于公式\n                if (                                                                                        # 判定当前字符是否属于公式\n                    cls == 0                                                                                # 1. 类别为保留区域\n                    or (cls == xt_cls and len(sstk[-1].strip()) > 1 and child.size < pstk[-1].size * 0.79)  # 2. 角标字体，有 0.76 的角标和 0.799 的大写，这里用 0.79 取中，同时考虑首字母放大的情况\n                    or vflag(child.fontname, child.get_text())                                              # 3. 公式字体\n                    or (child.matrix[0] == 0 and child.matrix[3] == 0)                                      # 4. 垂直字体\n                ):\n                    cur_v = True\n                # 判定括号组是否属于公式\n                if not cur_v:\n                    if vstk and child.get_text() == \"(\":\n                        cur_v = True\n                        vbkt += 1\n                    if vbkt and child.get_text() == \")\":\n                        cur_v = True\n                        vbkt -= 1\n                if (                                                        # 判定当前公式是否结束\n                    not cur_v                                               # 1. 当前字符不属于公式\n                    or cls != xt_cls                                        # 2. 当前字符与前一个字符不属于同一段落\n                    # or (abs(child.x0 - xt.x0) > vmax and cls != 0)        # 3. 段落内换行，可能是一长串斜体的段落，也可能是段内分式换行，这里设个阈值进行区分\n                    # 禁止纯公式（代码）段落换行，直到文字开始再重开文字段落，保证只存在两种情况\n                    # A. 纯公式（代码）段落（锚定绝对位置）sstk[-1]==\"\" -> sstk[-1]==\"{v*}\"\n                    # B. 文字开头段落（排版相对位置）sstk[-1]!=\"\"\n                    or (sstk[-1] != \"\" and abs(child.x0 - xt.x0) > vmax)    # 因为 cls==xt_cls==0 一定有 sstk[-1]==\"\"，所以这里不需要再判定 cls!=0\n                ):\n                    if vstk:\n                        if (                                                # 根据公式右侧的文字修正公式的纵向偏移\n                            not cur_v                                       # 1. 当前字符不属于公式\n                            and cls == xt_cls                               # 2. 当前字符与前一个字符属于同一段落\n                            and child.x0 > max([vch.x0 for vch in vstk])    # 3. 当前字符在公式右侧\n                        ):\n                            vfix = vstk[0].y0 - child.y0\n                        if sstk[-1] == \"\":\n                            xt_cls = -1 # 禁止纯公式段落（sstk[-1]==\"{v*}\"）的后续连接，但是要考虑新字符和后续字符的连接，所以这里修改的是上个字符的类别\n                        sstk[-1] += f\"{{v{len(var)}}}\"\n                        var.append(vstk)\n                        varl.append(vlstk)\n                        varf.append(vfix)\n                        vstk = []\n                        vlstk = []\n                        vfix = 0\n                # 当前字符不属于公式或当前字符是公式的第一个字符\n                if not vstk:\n                    if cls == xt_cls:               # 当前字符与前一个字符属于同一段落\n                        if child.x0 > xt.x1 + 1:    # 添加行内空格\n                            sstk[-1] += \" \"\n                        elif child.x1 < xt.x0:      # 添加换行空格并标记原文段落存在换行\n                            sstk[-1] += \" \"\n                            pstk[-1].brk = True\n                    else:                           # 根据当前字符构建一个新的段落\n                        sstk.append(\"\")\n                        pstk.append(Paragraph(child.y0, child.x0, child.x0, child.x0, child.size, False))\n                if not cur_v:                                               # 文字入栈\n                    if (                                                    # 根据当前字符修正段落属性\n                        child.size > pstk[-1].size / 0.79                   # 1. 当前字符显著比段落字体大\n                        or len(sstk[-1].strip()) == 1                       # 2. 当前字符为段落第二个文字（考虑首字母放大的情况）\n                    ) and child.get_text() != \" \":                          # 3. 当前字符不是空格\n                        pstk[-1].y -= child.size - pstk[-1].size            # 修正段落初始纵坐标，假设两个不同大小字符的上边界对齐\n                        pstk[-1].size = child.size\n                    sstk[-1] += child.get_text()\n                else:                                                       # 公式入栈\n                    if (                                                    # 根据公式左侧的文字修正公式的纵向偏移\n                        not vstk                                            # 1. 当前字符是公式的第一个字符\n                        and cls == xt_cls                                   # 2. 当前字符与前一个字符属于同一段落\n                        and child.x0 > xt.x0                                # 3. 前一个字符在公式左侧\n                    ):\n                        vfix = child.y0 - xt.y0\n                    vstk.append(child)\n                # 更新段落边界，因为段落内换行之后可能是公式开头，所以要在外边处理\n                pstk[-1].x0 = min(pstk[-1].x0, child.x0)\n                pstk[-1].x1 = max(pstk[-1].x1, child.x1)\n                # 更新上一个字符\n                xt = child\n                xt_cls = cls\n            elif isinstance(child, LTFigure):\n                # 图表\n                self.il_creater.on_pdf_figure(child)\n                pass\n            # elif isinstance(child, LTLine):     # 线条\n            #     continue\n            #     layout = self.layout[ltpage.pageid]\n            #     # ltpage.height 可能是 fig 里面的高度，这里统一用 layout.shape\n            #     h, w = layout.shape\n            #     # 读取当前线条在 layout 中的类别\n            #     cx, cy = np.clip(int(child.x0), 0, w - 1), np.clip(int(child.y0), 0, h - 1)\n            #     cls = layout[cy, cx]\n            #     if vstk and cls == xt_cls:      # 公式线条\n            #         vlstk.append(child)\n            #     else:                           # 全局线条\n            #         lstk.append(child)\n            elif isinstance(child, LTCurve):\n                self.il_creater.on_lt_curve(child)\n                pass\n            else:\n                pass\n        return\n        # 处理结尾\n        if vstk:    # 公式出栈\n            sstk[-1] += f\"{{v{len(var)}}}\"\n            var.append(vstk)\n            varl.append(vlstk)\n            varf.append(vfix)\n        log.debug(\"\\n==========[VSTACK]==========\\n\")\n        for var_id, v in enumerate(var):  # 计算公式宽度\n            l = max([vch.x1 for vch in v]) - v[0].x0\n            log.debug(f'< {l:.1f} {v[0].x0:.1f} {v[0].y0:.1f} {v[0].cid} {v[0].fontname} {len(varl[var_id])} > v{var_id} = {\"\".join([ch.get_text() for ch in v])}')\n            vlen.append(l)\n\n        ############################################################\n        # B. 段落翻译\n        log.debug(\"\\n==========[SSTACK]==========\\n\")\n\n        news = sstk.copy()\n\n        ############################################################\n        # C. 新文档排版\n        def raw_string(fcur: str, cstk: str):  # 编码字符串\n            if fcur == 'noto':\n                return \"\".join([f\"{self.noto.has_glyph(ord(c)):04x}\" for c in cstk])\n            elif isinstance(self.fontmap[fcur], PDFCIDFont):  # 判断编码长度\n                return \"\".join([f\"{ord(c):04x}\" for c in cstk])\n            else:\n                return \"\".join([f\"{ord(c):02x}\" for c in cstk])\n\n        _x, _y = 0, 0\n        for para_id, new in enumerate(news):\n            x: float = pstk[para_id].x           # 段落初始横坐标\n            y: float = pstk[para_id].y           # 段落初始纵坐标\n            x0: float = pstk[para_id].x0         # 段落左边界\n            x1: float = pstk[para_id].x1         # 段落右边界\n            size: float = pstk[para_id].size     # 段落字体大小\n            brk: bool = pstk[para_id].brk        # 段落换行标记\n            cstk: str = \"\"                  # 当前文字栈\n            fcur: str = None                # 当前字体 ID\n            tx = x\n            fcur_ = fcur\n            ptr = 0\n            log.debug(f\"< {y} {x} {x0} {x1} {size} {brk} > {sstk[para_id]} | {new}\")\n            while ptr < len(new):\n                vy_regex = re.match(\n                    r\"\\{\\s*v([\\d\\s]+)\\}\", new[ptr:], re.IGNORECASE,\n                )  # 匹配 {vn} 公式标记\n                mod = 0  # 文字修饰符\n                if vy_regex:  # 加载公式\n                    ptr += len(vy_regex.group(0))\n                    try:\n                        vid = int(vy_regex.group(1).replace(\" \", \"\"))\n                        adv = vlen[vid]\n                    except Exception as e:\n                        log.debug(\"Skipping formula placeholder due to: %s\", e)\n                        continue  # 翻译器可能会自动补个越界的公式标记\n                    if var[vid][-1].get_text() and unicodedata.category(var[vid][-1].get_text()[0]) in [\"Lm\", \"Mn\", \"Sk\"]:  # 文字修饰符\n                        mod = var[vid][-1].width\n                else:  # 加载文字\n                    ch = new[ptr]\n                    fcur_ = None\n                    try:\n                        if fcur_ is None and self.fontmap[\"tiro\"].to_unichr(ord(ch)) == ch:\n                            fcur_ = \"tiro\"  # 默认拉丁字体\n                    except Exception:\n                        pass\n                    if fcur_ is None:\n                        fcur_ = self.resfont  # 默认非拉丁字体\n                    if fcur_ == 'noto':\n                        adv = self.noto.char_lengths(ch, size)[0]\n                    else:\n                        adv = self.fontmap[fcur_].char_width(ord(ch)) * size\n                    ptr += 1\n                if (                                # 输出文字缓冲区\n                    fcur_ != fcur                   # 1. 字体更新\n                    or vy_regex                     # 2. 插入公式\n                    or x + adv > x1 + 0.1 * size    # 3. 到达右边界（可能一整行都被符号化，这里需要考虑浮点误差）\n                ):\n                    if cstk:\n                        ops += f\"/{fcur} {size:f} Tf 1 0 0 1 {tx:f} {y:f} Tm [<{raw_string(fcur, cstk)}>] TJ \"\n                        cstk = \"\"\n                if brk and x + adv > x1 + 0.1 * size:  # 到达右边界且原文段落存在换行\n                    x = x0\n                    lang_space = {\"zh-cn\": 1.4, \"zh-tw\": 1.4, \"zh-hans\": 1.4, \"zh-hant\": 1.4, \"zh\": 1.4, \"ja\": 1.1, \"ko\": 1.2, \"en\": 1.2, \"ar\": 1.0, \"ru\": 0.8, \"uk\": 0.8, \"ta\": 0.8}\n                    # y -= size * lang_space.get(self.translator.lang_out.lower(), 1.1)  # 小语种大多适配 1.1\n                    y -= size * 1.4\n                if vy_regex:  # 插入公式\n                    fix = 0\n                    if fcur is not None:  # 段落内公式修正纵向偏移\n                        fix = varf[vid]\n                    for vch in var[vid]:  # 排版公式字符\n                        vc = chr(vch.cid)\n                        ops += f\"/{self.fontid[vch.font]} {vch.size:f} Tf 1 0 0 1 {x + vch.x0 - var[vid][0].x0:f} {fix + y + vch.y0 - var[vid][0].y0:f} Tm <{raw_string(self.fontid[vch.font], vc)}> TJ \"\n                        if log.isEnabledFor(logging.DEBUG):\n                            lstk.append(LTLine(0.1, (_x, _y), (x + vch.x0 - var[vid][0].x0, fix + y + vch.y0 - var[vid][0].y0)))\n                            _x, _y = x + vch.x0 - var[vid][0].x0, fix + y + vch.y0 - var[vid][0].y0\n                    for l in varl[vid]:  # 排版公式线条\n                        if l.linewidth < 5:  # hack 有的文档会用粗线条当图片背景\n                            ops += f\"ET q 1 0 0 1 {l.pts[0][0] + x - var[vid][0].x0:f} {l.pts[0][1] + fix + y - var[vid][0].y0:f} cm [] 0 d 0 J {l.linewidth:f} w 0 0 m {l.pts[1][0] - l.pts[0][0]:f} {l.pts[1][1] - l.pts[0][1]:f} l S Q BT \"\n                else:  # 插入文字缓冲区\n                    if not cstk:  # 单行开头\n                        tx = x\n                        if x == x0 and ch == \" \":  # 消除段落换行空格\n                            adv = 0\n                        else:\n                            cstk += ch\n                    else:\n                        cstk += ch\n                adv -= mod # 文字修饰符\n                fcur = fcur_\n                x += adv\n                if log.isEnabledFor(logging.DEBUG):\n                    lstk.append(LTLine(0.1, (_x, _y), (x, y)))\n                    _x, _y = x, y\n            # 处理结尾\n            if cstk:\n                ops += f\"/{fcur} {size:f} Tf 1 0 0 1 {tx:f} {y:f} Tm <{raw_string(fcur, cstk)}> TJ \"\n        for l in lstk:  # 排版全局线条\n            if l.linewidth < 5:  # hack 有的文档会用粗线条当图片背景\n                ops += f\"ET q 1 0 0 1 {l.pts[0][0]:f} {l.pts[0][1]:f} cm [] 0 d 0 J {l.linewidth:f} w 0 0 m {l.pts[1][0] - l.pts[0][0]:f} {l.pts[1][1] - l.pts[0][1]:f} l S Q BT \"\n        ops = f\"BT {ops}ET \"\n        return ops\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/__init__.py",
    "content": "from babeldoc.format.pdf.document_il.il_version_1 import BaseOperations\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il.il_version_1 import Cropbox\nfrom babeldoc.format.pdf.document_il.il_version_1 import Document\nfrom babeldoc.format.pdf.document_il.il_version_1 import GraphicState\nfrom babeldoc.format.pdf.document_il.il_version_1 import Mediabox\nfrom babeldoc.format.pdf.document_il.il_version_1 import Page\nfrom babeldoc.format.pdf.document_il.il_version_1 import PageLayout\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfAffineTransform\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfCharacter\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfCurve\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFigure\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFont\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFontCharBoundingBox\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfForm\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFormSubtype\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFormula\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfInlineForm\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfLine\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfMatrix\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfOriginalPath\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfParagraph\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfParagraphComposition\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfPath\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfRectangle\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfSameStyleCharacters\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfSameStyleUnicodeCharacters\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfStyle\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfXobject\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfXobjForm\nfrom babeldoc.format.pdf.document_il.il_version_1 import VisualBbox\n\n__all__ = [\n    \"BaseOperations\",\n    \"Box\",\n    \"Cropbox\",\n    \"Document\",\n    \"GraphicState\",\n    \"Mediabox\",\n    \"Page\",\n    \"PageLayout\",\n    \"PdfAffineTransform\",\n    \"PdfCharacter\",\n    \"PdfCurve\",\n    \"PdfFigure\",\n    \"PdfFont\",\n    \"PdfFontCharBoundingBox\",\n    \"PdfForm\",\n    \"PdfFormSubtype\",\n    \"PdfFormula\",\n    \"PdfInlineForm\",\n    \"PdfLine\",\n    \"PdfMatrix\",\n    \"PdfOriginalPath\",\n    \"PdfParagraph\",\n    \"PdfParagraphComposition\",\n    \"PdfPath\",\n    \"PdfRectangle\",\n    \"PdfSameStyleCharacters\",\n    \"PdfSameStyleUnicodeCharacters\",\n    \"PdfStyle\",\n    \"PdfXobjForm\",\n    \"PdfXobject\",\n    \"VisualBbox\",\n]\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/backend/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/backend/pdf_creater.py",
    "content": "import io\nimport itertools\nimport logging\nimport os\nimport re\nimport time\nimport unicodedata\nfrom abc import ABC\nfrom abc import abstractmethod\nfrom multiprocessing import Process\nfrom pathlib import Path\n\nimport freetype\nimport pymupdf\nfrom bitstring import BitStream\n\nfrom babeldoc.assets.embedding_assets_metadata import FONT_NAMES\nfrom babeldoc.format.pdf.document_il import PdfOriginalPath\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.matrix_helper import matrix_to_bytes\nfrom babeldoc.format.pdf.document_il.utils.zstd_helper import zstd_decompress\nfrom babeldoc.format.pdf.translation_config import TranslateResult\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.format.pdf.translation_config import WatermarkOutputMode\n\nlogger = logging.getLogger(__name__)\n\nSUBSET_FONT_STAGE_NAME = \"Subset font\"\nSAVE_PDF_STAGE_NAME = \"Save PDF\"\n\n\nclass RenderUnit(ABC):\n    \"\"\"Abstract base class for all renderable units.\"\"\"\n\n    def __init__(\n        self,\n        render_order: int,\n        sub_render_order: int = 0,\n        xobj_id: str | None = None,\n    ):\n        self.render_order = render_order\n        self.sub_render_order = sub_render_order\n        self.xobj_id = xobj_id\n        if self.render_order is None:\n            self.render_order = 9999999999999999\n        if self.sub_render_order is None:\n            self.sub_render_order = 9999999999999999\n\n    @abstractmethod\n    def render(\n        self,\n        draw_op: BitStream,\n        context: \"RenderContext\",\n    ) -> None:\n        \"\"\"Render this unit to the draw_op BitStream.\"\"\"\n        pass\n\n    def get_sort_key(self) -> tuple[int, int]:\n        \"\"\"Get the sort key for ordering render units.\"\"\"\n        return (self.render_order, self.sub_render_order)\n\n\nclass CharacterRenderUnit(RenderUnit):\n    \"\"\"Render unit for PDF characters.\"\"\"\n\n    def __init__(\n        self,\n        char: il_version_1.PdfCharacter,\n        render_order: int,\n        sub_render_order: int = 0,\n    ):\n        super().__init__(render_order, sub_render_order, char.xobj_id)\n        self.char = char\n\n    def render(self, draw_op: BitStream, context: \"RenderContext\") -> None:\n        char = self.char\n        if char.char_unicode == \"\\n\":\n            return\n        if char.pdf_character_id is None:\n            return\n\n        char_size = char.pdf_style.font_size\n        font_id = char.pdf_style.font_id\n\n        # Get encoding length map based on xobj_id\n        if self.xobj_id in context.xobj_encoding_length_map:\n            encoding_length_map = context.xobj_encoding_length_map[self.xobj_id]\n        else:\n            encoding_length_map = context.page_encoding_length_map\n\n        # Check font exists if needed\n        if context.check_font_exists:\n            if self.xobj_id in context.xobj_available_fonts:\n                if font_id not in context.xobj_available_fonts[self.xobj_id]:\n                    return\n            elif font_id not in context.available_font_list:\n                return\n\n        draw_op.append(b\"q \")\n        context.pdf_creator.render_graphic_state(draw_op, char.pdf_style.graphic_state)\n\n        if char.vertical:\n            draw_op.append(\n                f\"BT /{font_id} {char_size:f} Tf 0 1 -1 0 {char.box.x2:f} {char.box.y:f} Tm \".encode(),\n            )\n        else:\n            draw_op.append(\n                f\"BT /{font_id} {char_size:f} Tf 1 0 0 1 {char.box.x:f} {char.box.y:f} Tm \".encode(),\n            )\n\n        encoding_length = encoding_length_map.get(font_id, None)\n        if encoding_length is None:\n            if font_id in context.all_encoding_length_map:\n                encoding_length = context.all_encoding_length_map[font_id]\n            else:\n                logger.debug(\n                    f\"Font {font_id} not found in encoding length map for page {context.page.page_number}\"\n                )\n                return\n\n        draw_op.append(\n            f\"<{char.pdf_character_id:0{encoding_length * 2}x}>\".upper().encode(),\n        )\n        draw_op.append(b\" Tj ET Q \\n\")\n\n\nclass FormRenderUnit(RenderUnit):\n    \"\"\"Render unit for PDF forms.\"\"\"\n\n    def __init__(\n        self,\n        form: il_version_1.PdfForm,\n        render_order: int,\n        sub_render_order: int = 0,\n    ):\n        super().__init__(render_order, sub_render_order, form.xobj_id)\n        self.form = form\n\n    def render(self, draw_op: BitStream, context: \"RenderContext\") -> None:\n        form = self.form\n        draw_op.append(b\"q \")\n\n        # Apply relocation transform first if present (before passthrough instructions)\n        # This ensures masks in passthrough_per_char_instruction use the correct coordinate system\n        assert form.pdf_matrix is not None\n        if form.relocation_transform and len(form.relocation_transform) == 6:\n            try:\n                relocation_matrix = tuple(float(x) for x in form.relocation_transform)\n                draw_op.append(matrix_to_bytes(relocation_matrix))\n            except (ValueError, TypeError):\n                # If relocation transform conversion fails, skip it and use original matrix later\n                pass\n\n        draw_op.append(matrix_to_bytes(form.pdf_matrix))\n\n        draw_op.append(b\" \")\n\n        draw_op.append(\n            form.graphic_state.passthrough_per_char_instruction.encode(),\n        )\n\n        draw_op.append(b\" \")\n\n        assert form.pdf_form_subtype is not None\n        if form.pdf_form_subtype.pdf_xobj_form:\n            draw_op.append(\n                f\" /{form.pdf_form_subtype.pdf_xobj_form.do_args} Do \".encode()\n            )\n        elif form.pdf_form_subtype.pdf_inline_form:\n            # Handle inline form (inline image)\n            inline_form = form.pdf_form_subtype.pdf_inline_form\n\n            # Start inline image\n            draw_op.append(b\" BI \")\n\n            # Add image parameters if available\n            if inline_form.image_parameters:\n                import json\n\n                try:\n                    params = json.loads(inline_form.image_parameters)\n                    for key, value in params.items():\n                        if key.startswith(\"/\"):\n                            key = key[1:]  # Remove leading slash\n                        # Convert Python boolean to PDF boolean\n                        if value is True:\n                            value = \"true\"\n                        elif value is False:\n                            value = \"false\"\n                        elif isinstance(value, str) and value in (\n                            \"True\",\n                            \"False\",\n                        ):\n                            value = value.lower()\n                        draw_op.append(f\"/{key} {value} \".encode())\n                except json.JSONDecodeError:\n                    pass\n\n            # Start image data\n            draw_op.append(b\"ID \")\n\n            # Add image data if available (base64 decode it first)\n            if inline_form.form_data:\n                import base64\n\n                try:\n                    image_data = base64.b64decode(inline_form.form_data)\n                    draw_op.append(image_data)\n                except Exception:\n                    pass\n\n            # End inline image\n            draw_op.append(b\" EI \")\n        draw_op.append(b\" Q\\n\")\n\n\nclass RectangleRenderUnit(RenderUnit):\n    \"\"\"Render unit for PDF rectangles.\"\"\"\n\n    def __init__(\n        self,\n        rectangle: il_version_1.PdfRectangle,\n        render_order: int,\n        sub_render_order: int = 0,\n        line_width: float = 0.4,\n    ):\n        super().__init__(render_order, sub_render_order, rectangle.xobj_id)\n        self.rectangle = rectangle\n        self.line_width = line_width\n\n    def render(self, draw_op: BitStream, context: \"RenderContext\") -> None:\n        rectangle = self.rectangle\n        x1 = rectangle.box.x\n        y1 = rectangle.box.y\n        x2 = rectangle.box.x2\n        y2 = rectangle.box.y2\n        width = x2 - x1\n        height = y2 - y1\n\n        draw_op.append(b\"q n \")\n        draw_op.append(\n            rectangle.graphic_state.passthrough_per_char_instruction.encode(),\n        )\n\n        line_width = self.line_width\n        if rectangle.line_width is not None:\n            line_width = rectangle.line_width\n        if line_width > 0:\n            draw_op.append(f\" {line_width:.6f} w \".encode())\n\n        draw_op.append(f\"{x1:.6f} {y1:.6f} {width:.6f} {height:.6f} re \".encode())\n        if rectangle.fill_background:\n            draw_op.append(b\" f \")\n        else:\n            draw_op.append(b\" S \")\n\n        draw_op.append(b\"Q\\n\")\n\n\nclass CurveRenderUnit(RenderUnit):\n    \"\"\"Render unit for PDF curves.\"\"\"\n\n    def __init__(\n        self,\n        curve: il_version_1.PdfCurve,\n        render_order: int,\n        sub_render_order: int = 0,\n    ):\n        super().__init__(render_order, sub_render_order, curve.xobj_id)\n        self.curve = curve\n\n    def render(self, draw_op: BitStream, context: \"RenderContext\") -> None:\n        curve = self.curve\n        draw_op.append(b\"q n \")\n\n        # Apply relocation transform first if present (before passthrough instructions)\n        # This ensures masks in passthrough_per_char_instruction use the correct coordinate system\n        if curve.relocation_transform and len(curve.relocation_transform) == 6:\n            try:\n                relocation_matrix = tuple(float(x) for x in curve.relocation_transform)\n                draw_op.append(matrix_to_bytes(relocation_matrix))\n            except (ValueError, TypeError):\n                # If relocation transform conversion fails, skip it and use original CTM later\n                pass\n\n        draw_op.append(b\" \")\n\n        # Apply original CTM if present\n        if curve.ctm and len(curve.ctm) == 6:\n            ctm = curve.ctm\n            draw_op.append(\n                f\"{ctm[0]:.6f} {ctm[1]:.6f} {ctm[2]:.6f} {ctm[3]:.6f} {ctm[4]:.6f} {ctm[5]:.6f} cm \".encode()\n            )\n\n        draw_op.append(b\" \")\n\n        draw_op.append(\n            curve.graphic_state.passthrough_per_char_instruction.encode(),\n        )\n\n        draw_op.append(b\" \")\n        path_op = BitStream(b\" \")\n\n        # Use original path if available, otherwise fall back to transformed path\n        path_to_use = (\n            curve.pdf_original_path\n            if curve.pdf_original_path is not None\n            else curve.pdf_path\n        )\n        for path in path_to_use:\n            if isinstance(path, PdfOriginalPath):\n                path = path.pdf_path\n            if path.has_xy:\n                path_op.append(f\"{path.x:F} {path.y:F} {path.op} \".encode())\n            else:\n                path_op.append(f\"{path.op} \".encode())\n\n        if curve.fill_background:\n            draw_op.append(path_op)\n            draw_op.append(b\" f\")\n        if curve.evenodd:\n            draw_op.append(b\"* \")\n        else:\n            draw_op.append(b\" \")\n        if curve.stroke_path:\n            draw_op.append(path_op)\n            draw_op.append(b\"S \")\n\n        # final_op = b' B '\n\n        draw_op.append(b\" n Q\\n\")\n\n\nclass RenderContext:\n    \"\"\"Context object containing shared state for rendering.\"\"\"\n\n    def __init__(\n        self,\n        pdf_creator: \"PDFCreater\",\n        page: il_version_1.Page,\n        available_font_list: set[str],\n        page_encoding_length_map: dict[str, int],\n        all_encoding_length_map: dict[str, int],\n        xobj_available_fonts: dict[str, set[str]],\n        xobj_encoding_length_map: dict[str, dict[str, int]],\n        ctm_for_ops: bytes,\n        check_font_exists: bool = False,\n    ):\n        self.pdf_creator = pdf_creator\n        self.page = page\n        self.available_font_list = available_font_list\n        self.page_encoding_length_map = page_encoding_length_map\n        self.all_encoding_length_map = all_encoding_length_map\n        self.xobj_available_fonts = xobj_available_fonts\n        self.xobj_encoding_length_map = xobj_encoding_length_map\n        self.ctm_for_ops = ctm_for_ops\n        self.check_font_exists = check_font_exists\n\n\ndef to_int(src):\n    return int(re.search(r\"\\d+\", src).group(0))\n\n\ndef parse_mapping(text):\n    mapping = []\n    for x in re.finditer(rb\"<(?P<num>[a-fA-F0-9]+)>\", text):\n        mapping.append(int(x.group(\"num\"), 16))\n    return mapping\n\n\ndef apply_normalization(cmap, gid, code):\n    need = False\n    if 0x2F00 <= code <= 0x2FD5:  # Kangxi Radicals\n        need = True\n    if 0xF900 <= code <= 0xFAFF:  # CJK Compatibility Ideographs\n        need = True\n    if need:\n        norm = unicodedata.normalize(\"NFD\", chr(code))\n        cmap[gid] = ord(norm)\n    else:\n        cmap[gid] = code\n\n\ndef batched(iterable, n, *, strict=False):\n    # batched('ABCDEFG', 3) → ABC DEF G\n    if n < 1:\n        raise ValueError(\"n must be at least one\")\n    iterator = iter(iterable)\n    while batch := tuple(itertools.islice(iterator, n)):\n        if strict and len(batch) != n:\n            raise ValueError(\"batched(): incomplete batch\")\n        yield batch\n\n\ndef update_tounicode_cmap_pair(cmap, data):\n    for start, stop, value in batched(data, 3):\n        for gid in range(start, stop + 1):\n            code = value + gid - start\n            apply_normalization(cmap, gid, code)\n\n\ndef update_tounicode_cmap_code(cmap, data):\n    for gid, code in batched(data, 2):\n        apply_normalization(cmap, gid, code)\n\n\ndef parse_tounicode_cmap(data):\n    cmap = {}\n    for x in re.finditer(\n        rb\"\\s+beginbfrange\\s*(?P<r>(<[0-9a-fA-F]+>\\s*)+)endbfrange\\s+\", data\n    ):\n        update_tounicode_cmap_pair(cmap, parse_mapping(x.group(\"r\")))\n    for x in re.finditer(\n        rb\"\\s+beginbfchar\\s*(?P<c>(<[0-9a-fA-F]+>\\s*)+)endbfchar\", data\n    ):\n        update_tounicode_cmap_code(cmap, parse_mapping(x.group(\"c\")))\n    return cmap\n\n\ndef parse_truetype_data(data):\n    glyph_in_use = []\n    face = freetype.Face(io.BytesIO(data))\n    for i in range(face.num_glyphs):\n        face.load_glyph(i)\n        if face.glyph.outline.contours:\n            glyph_in_use.append(i)\n    return glyph_in_use\n\n\nTOUNICODE_HEAD = \"\"\"\\\n/CIDInit /ProcSet findresource begin\n12 dict begin\nbegincmap\n/CIDSystemInfo <</Registry(Adobe)/Ordering(UCS)/Supplement 0>> def\n/CMapName /Adobe-Identity-UCS def\n/CMapType 2 def\n1 begincodespacerange\n<0000> <FFFF>\nendcodespacerange\"\"\"\nTOUNICODE_TAIL = \"\"\"\\\nendcmap\nCMapName currentdict /CMap defineresource pop\nend\nend\"\"\"\n\n\ndef make_tounicode(cmap, used):\n    short = []\n    for x in used:\n        if x in cmap:\n            short.append((x, cmap[x]))\n    line = [TOUNICODE_HEAD]\n    for block in batched(short, 100):\n        line.append(f\"{len(block)} beginbfchar\")\n        for glyph, code in block:\n            if code < 0x10000:\n                line.append(f\"<{glyph:04x}><{code:04x}>\")\n            else:\n                code -= 0x10000\n                high = 0xD800 + (code >> 10)\n                low = 0xDC00 + (code & 0b1111111111)\n                line.append(f\"<{glyph:04x}><{high:04x}{low:04x}>\")\n        line.append(\"endbfchar\")\n    line.append(TOUNICODE_TAIL)\n    return \"\\n\".join(line)\n\n\ndef reproduce_one_font(doc, index):\n    m = doc.xref_get_key(index, \"ToUnicode\")\n    f = doc.xref_get_key(index, \"DescendantFonts\")\n    if m[0] == \"xref\" and f[0] == \"array\":\n        mi = to_int(m[1])\n        fi = to_int(f[1])\n        ff = doc.xref_get_key(fi, \"FontDescriptor/FontFile2\")\n        ms = doc.xref_stream(mi)\n        fs = doc.xref_stream(to_int(ff[1]))\n        cmap = parse_tounicode_cmap(ms)\n        used = parse_truetype_data(fs)\n        text = make_tounicode(cmap, used)\n        doc.update_stream(mi, bytes(text, \"U8\"))\n\n\ndef reproduce_cmap(doc):\n    assert doc\n    font_set = set()\n    for page in doc:\n        try:\n            font_list = page.get_fonts()\n            for font in font_list:\n                if font[1] == \"ttf\" and font[3] in FONT_NAMES and \".ttf\" in font[4]:\n                    font_set.add(font)\n        except Exception as e:\n            logger.error(f\"Error in getting page fonts: {e}\")\n    for font in font_set:\n        reproduce_one_font(doc, font[0])\n    return doc\n\n\ndef _subset_fonts_process(pdf_path, output_path):\n    \"\"\"Function to run in subprocess for font subsetting.\n\n    Args:\n        pdf_path: Path to the PDF file to subset\n        output_path: Path where to save the result\n    \"\"\"\n    try:\n        pdf = pymupdf.open(pdf_path)\n        pdf.subset_fonts(fallback=False)\n        pdf.save(output_path)\n        # 返回 0 表示成功\n        os._exit(0)\n    except Exception as e:\n        logger.error(f\"Error in font subsetting subprocess: {e}\")\n        # 返回 1 表示失败\n        os._exit(1)\n\n\ndef _save_pdf_clean_process(\n    pdf_path,\n    output_path,\n    garbage=1,\n    deflate=True,\n    clean=True,\n    deflate_fonts=True,\n    linear=False,\n):\n    \"\"\"Function to run in subprocess for saving PDF with clean=True which can be time-consuming.\n\n    Args:\n        pdf_path: Path to the PDF file to save\n        output_path: Path where to save the result\n        garbage: Garbage collection level (0, 1, 2, 3, 4)\n        deflate: Whether to deflate the PDF\n        clean: Whether to clean the PDF\n        deflate_fonts: Whether to deflate fonts\n        linear: Whether to linearize the PDF\n    \"\"\"\n    try:\n        pdf = pymupdf.open(pdf_path)\n        pdf.save(\n            output_path,\n            garbage=garbage,\n            deflate=deflate,\n            clean=clean,\n            deflate_fonts=deflate_fonts,\n            linear=linear,\n        )\n        # 返回 0 表示成功\n        os._exit(0)\n    except Exception as e:\n        logger.error(f\"Error in save PDF with clean=True subprocess: {e}\")\n        # 返回 1 表示失败\n        os._exit(1)\n\n\nclass PDFCreater:\n    stage_name = \"Generate drawing instructions\"\n\n    def __init__(\n        self,\n        original_pdf_path: str,\n        document: il_version_1.Document,\n        translation_config: TranslationConfig,\n        mediabox_data: dict,\n    ):\n        self.original_pdf_path = original_pdf_path\n        self.docs = document\n        self.font_path = translation_config.font\n        self.font_mapper = FontMapper(translation_config)\n        self.translation_config = translation_config\n        self.mediabox_data = mediabox_data\n\n    def render_graphic_state(\n        self,\n        draw_op: BitStream,\n        graphic_state: il_version_1.GraphicState,\n    ):\n        if graphic_state is None:\n            return\n        # if graphic_state.stroking_color_space_name:\n        #     draw_op.append(\n        #         f\"/{graphic_state.stroking_color_space_name} CS \\n\".encode()\n        #     )\n        # if graphic_state.non_stroking_color_space_name:\n        #     draw_op.append(\n        #         f\"/{graphic_state.non_stroking_color_space_name}\"\n        #         f\" cs \\n\".encode()\n        #     )\n        # if graphic_state.ncolor is not None:\n        #     if len(graphic_state.ncolor) == 1:\n        #         draw_op.append(f\"{graphic_state.ncolor[0]} g \\n\".encode())\n        #     elif len(graphic_state.ncolor) == 3:\n        #         draw_op.append(\n        #             f\"{' '.join((str(x) for x in graphic_state.ncolor))} sc \\n\".encode()\n        #         )\n        # if graphic_state.scolor is not None:\n        #     if len(graphic_state.scolor) == 1:\n        #         draw_op.append(f\"{graphic_state.scolor[0]} G \\n\".encode())\n        #     elif len(graphic_state.scolor) == 3:\n        #         draw_op.append(\n        #             f\"{' '.join((str(x) for x in graphic_state.scolor))} SC \\n\".encode()\n        #         )\n\n        if graphic_state.passthrough_per_char_instruction:\n            draw_op.append(\n                f\"{graphic_state.passthrough_per_char_instruction} \\n\".encode(),\n            )\n\n    def render_paragraph_to_char(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n    ) -> list[il_version_1.PdfCharacter]:\n        chars = []\n        for composition in paragraph.pdf_paragraph_composition:\n            if composition.pdf_character:\n                chars.append(composition.pdf_character)\n            elif composition.pdf_formula:\n                # Flatten formula: extract all characters from the formula\n                chars.extend(composition.pdf_formula.pdf_character)\n            else:\n                logger.error(\n                    f\"Unknown composition type. \"\n                    f\"This type only appears in the IL \"\n                    f\"after the translation is completed.\"\n                    f\"During pdf rendering, this type is not supported.\"\n                    f\"Composition: {composition}. \"\n                    f\"Paragraph: {paragraph}. \",\n                )\n                continue\n        if not chars and paragraph.unicode and paragraph.debug_id:\n            logger.error(\n                f\"Unable to export paragraphs that have \"\n                f\"not yet been formatted: {paragraph}\",\n            )\n            return chars\n        return chars\n\n    def create_render_units_for_page(\n        self,\n        page: il_version_1.Page,\n        translation_config: TranslationConfig,\n    ) -> list[RenderUnit]:\n        \"\"\"Convert all renderable objects in a page to render units.\"\"\"\n        render_units = []\n\n        # Collect all characters (from page and paragraphs)\n        chars = []\n        if page.pdf_character:\n            chars.extend(page.pdf_character)\n        for paragraph in page.pdf_paragraph:\n            chars.extend(self.render_paragraph_to_char(paragraph))\n\n        # Convert characters to render units\n        for i, char in enumerate(chars):\n            render_order = getattr(char, \"render_order\", 100)  # Default render order\n            sub_render_order = getattr(char, \"sub_render_order\", i)\n            render_units.append(\n                CharacterRenderUnit(char, render_order, sub_render_order)\n            )\n\n        # Collect forms from formulas within paragraphs\n        formula_forms = []\n        for paragraph in page.pdf_paragraph:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_formula:\n                    formula_forms.extend(composition.pdf_formula.pdf_form)\n\n        # Convert forms to render units (page-level forms + forms from formulas)\n        if not translation_config.skip_form_render:\n            all_forms = list(page.pdf_form) + formula_forms\n            for i, form in enumerate(all_forms):\n                render_order = getattr(\n                    form, \"render_order\", 50\n                )  # Forms render before characters\n                sub_render_order = getattr(form, \"sub_render_order\", i)\n                render_units.append(\n                    FormRenderUnit(form, render_order, sub_render_order)\n                )\n\n        # Convert rectangles to render units (only for OCR workaround or debug)\n        for i, rect in enumerate(page.pdf_rectangle):\n            if (\n                translation_config.ocr_workaround\n                and not rect.debug_info\n                and rect.fill_background\n            ) or (translation_config.debug and rect.debug_info):\n                render_order = getattr(\n                    rect, \"render_order\", 10\n                )  # Rectangles render first\n                sub_render_order = getattr(rect, \"sub_render_order\", i)\n                line_width = 0.1 if translation_config.ocr_workaround else 0.4\n                render_units.append(\n                    RectangleRenderUnit(\n                        rect, render_order, sub_render_order, line_width\n                    )\n                )\n\n        # Collect curves from formulas within paragraphs\n        formula_curves = []\n        for paragraph in page.pdf_paragraph:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_formula:\n                    formula_curves.extend(composition.pdf_formula.pdf_curve)\n\n        # Convert curves to render units (page-level curves + curves from formulas, only for debug)\n        if not translation_config.skip_curve_render:\n            all_curves = list(page.pdf_curve) + formula_curves\n            for i, curve in enumerate(all_curves):\n                if curve.debug_info or translation_config.debug:\n                    render_order = getattr(\n                        curve, \"render_order\", 20\n                    )  # Curves render after rectangles\n                    sub_render_order = getattr(curve, \"sub_render_order\", i)\n                    render_units.append(\n                        CurveRenderUnit(curve, render_order, sub_render_order)\n                    )\n\n        return render_units\n\n    def render_units_to_stream(\n        self,\n        render_units: list[RenderUnit],\n        context: RenderContext,\n        page_op: BitStream,\n        xobj_draw_ops: dict[str, BitStream],\n    ) -> None:\n        \"\"\"Render sorted render units to appropriate draw streams.\"\"\"\n        # Sort render units by (render_order, sub_render_order)\n        sorted_units = sorted(render_units, key=lambda unit: unit.get_sort_key())\n\n        for unit in sorted_units:\n            # Determine which draw_op to use based on xobj_id\n            if unit.xobj_id in xobj_draw_ops:\n                draw_op = xobj_draw_ops[unit.xobj_id]\n            else:\n                draw_op = page_op\n\n            # Render the unit\n            unit.render(draw_op, context)\n\n    def get_available_font_list(self, pdf, page):\n        page_xref_id = pdf[page.page_number].xref\n        return self.get_xobj_available_fonts(page_xref_id, pdf)\n\n    def get_xobj_available_fonts(self, page_xref_id, pdf):\n        try:\n            resources_type, r_id = pdf.xref_get_key(page_xref_id, \"Resources\")\n            if resources_type == \"xref\":\n                resource_xref_id = re.search(\"(\\\\d+) 0 R\", r_id).group(1)\n                r_id = pdf.xref_object(int(resource_xref_id))\n                resources_type = \"dict\"\n            if resources_type == \"dict\":\n                xref_id = re.search(\"/Font (\\\\d+) 0 R\", r_id)\n                if xref_id is not None:\n                    xref_id = xref_id.group(1)\n                    font_dict = pdf.xref_object(int(xref_id))\n                else:\n                    search = re.search(\"/Font *<<(.+?)>>\", r_id.replace(\"\\n\", \" \"))\n                    if search is None:\n                        # Have resources but no fonts\n                        return set()\n                    font_dict = search.group(1)\n            else:\n                r_id = int(r_id.split(\" \")[0])\n                _, font_dict = pdf.xref_get_key(r_id, \"Font\")\n            fonts = re.findall(\"/([^ ]+?) \", font_dict)\n            return set(fonts)\n        except Exception:\n            return set()\n\n    def _render_rectangle(\n        self,\n        draw_op: BitStream,\n        rectangle: il_version_1.PdfRectangle,\n        line_width: float = 0.4,\n    ):\n        \"\"\"Draw a rectangle in PDF for visualization purposes.\n\n        Args:\n            draw_op: BitStream to append PDF drawing operations\n            rectangle: Rectangle object containing position information\n            line_width: Line width\n        \"\"\"\n        x1 = rectangle.box.x\n        y1 = rectangle.box.y\n        x2 = rectangle.box.x2\n        y2 = rectangle.box.y2\n        width = x2 - x1\n        height = y2 - y1\n        # Save graphics state\n        draw_op.append(b\"q \")\n\n        # Set green color for debug visibility\n        draw_op.append(\n            rectangle.graphic_state.passthrough_per_char_instruction.encode(),\n        )  # Green stroke\n        if rectangle.line_width is not None:\n            line_width = rectangle.line_width\n        if line_width > 0:\n            draw_op.append(f\" {line_width:.6f} w \".encode())  # Line width\n        draw_op.append(f\"{x1:.6f} {y1:.6f} {width:.6f} {height:.6f} re \".encode())\n        if rectangle.fill_background:\n            draw_op.append(b\" f \")\n        else:\n            draw_op.append(b\" S \")\n\n        # Restore graphics state\n        draw_op.append(b\" n Q\\n\")\n\n    def create_side_by_side_dual_pdf(\n        self,\n        original_pdf: pymupdf.Document,\n        translated_pdf: pymupdf.Document,\n        dual_out_path: str,\n        translation_config: TranslationConfig,\n    ) -> pymupdf.Document:\n        \"\"\"Create a dual PDF with side-by-side pages (original and translation).\n\n        Args:\n            original_pdf: Original PDF document\n            translated_pdf: Translated PDF document\n            dual_out_path: Output path for the dual PDF\n            translation_config: Translation configuration\n\n        Returns:\n            The created dual PDF document\n        \"\"\"\n        # Create a new PDF for side-by-side pages\n        dual = pymupdf.open()\n        page_count = min(original_pdf.page_count, translated_pdf.page_count)\n\n        for page_id in range(page_count):\n            # Get pages from both PDFs\n            orig_page = original_pdf[page_id]\n            trans_page = translated_pdf[page_id]\n            rotate_angle = orig_page.rotation\n            total_width = orig_page.rect.width + trans_page.rect.width\n            max_height = max(orig_page.rect.height, trans_page.rect.height)\n            left_width = (\n                orig_page.rect.width\n                if not translation_config.dual_translate_first\n                else trans_page.rect.width\n            )\n\n            orig_page.set_rotation(0)\n            trans_page.set_rotation(0)\n\n            # Create new page with combined width\n            dual_page = dual.new_page(width=total_width, height=max_height)\n\n            # Define rectangles for left and right sides\n            rect_left = pymupdf.Rect(0, 0, left_width, max_height)\n            rect_right = pymupdf.Rect(left_width, 0, total_width, max_height)\n\n            # Show pages according to dual_translate_first setting\n            if translation_config.dual_translate_first:\n                # Show translated page on left and original on right\n                rect_left, rect_right = rect_right, rect_left\n            try:\n                # Show original page on left and translated on right (default)\n                dual_page.show_pdf_page(\n                    rect_left,\n                    original_pdf,\n                    page_id,\n                    keep_proportion=True,\n                    rotate=-rotate_angle,\n                )\n            except Exception as e:\n                logger.warning(\n                    f\"Failed to show original page on left and translated on right (default). \"\n                    f\"Page ID: {page_id}. \"\n                    f\"Original PDF: {self.original_pdf_path}. \"\n                    f\"Translated PDF: {translation_config.input_file}. \",\n                    exc_info=e,\n                )\n            try:\n                dual_page.show_pdf_page(\n                    rect_right,\n                    translated_pdf,\n                    page_id,\n                    keep_proportion=True,\n                    rotate=-rotate_angle,\n                )\n            except Exception as e:\n                logger.warning(\n                    f\"Failed to show translated page on left and original on right. \"\n                    f\"Page ID: {page_id}. \"\n                    f\"Original PDF: {self.original_pdf_path}. \"\n                    f\"Translated PDF: {translation_config.input_file}. \",\n                    exc_info=e,\n                )\n        return dual\n\n    def create_alternating_pages_dual_pdf(\n        self,\n        original_pdf: pymupdf.Document,\n        translated_pdf: pymupdf.Document,\n        translation_config: TranslationConfig,\n    ) -> pymupdf.Document:\n        \"\"\"Create a dual PDF with alternating pages (original and translation).\n\n        Args:\n            original_pdf_path: Path to the original PDF\n            translated_pdf: Translated PDF document\n            translation_config: Translation configuration\n\n        Returns:\n            The created dual PDF document\n        \"\"\"\n        # Open the original PDF and insert translated PDF\n        dual = original_pdf\n        dual.insert_file(translated_pdf)\n\n        # Rearrange pages to alternate between original and translated\n        page_count = translated_pdf.page_count\n        for page_id in range(page_count):\n            if translation_config.dual_translate_first:\n                dual.move_page(page_count + page_id, page_id * 2)\n            else:\n                dual.move_page(page_count + page_id, page_id * 2 + 1)\n\n        return dual\n\n    def write_debug_info(\n        self,\n        pdf: pymupdf.Document,\n        translation_config: TranslationConfig,\n    ):\n        self.font_mapper.add_font(pdf, self.docs)\n\n        for page in self.docs.page:\n            _, r_id = pdf.xref_get_key(pdf[page.page_number].xref, \"Contents\")\n            resource_xref_id = re.search(\"(\\\\d+) 0 R\", r_id).group(1)\n            base_op = pdf.xref_stream(int(resource_xref_id))\n            translation_config.raise_if_cancelled()\n            xobj_available_fonts = {}\n            xobj_draw_ops = {}\n            xobj_encoding_length_map = {}\n            available_font_list = self.get_available_font_list(pdf, page)\n\n            page_encoding_length_map = {\n                f.font_id: f.encoding_length for f in page.pdf_font\n            }\n            page_op = BitStream()\n            # q {ops_base}Q 1 0 0 1 {x0} {y0} cm {ops_new}\n            page_op.append(b\"q \")\n            if base_op is not None:\n                page_op.append(base_op)\n            page_op.append(b\" Q \")\n            page_op.append(\n                f\"q Q 1 0 0 1 {page.cropbox.box.x:.6f} {page.cropbox.box.y:.6f} cm \\n\".encode(),\n            )\n            # 收集所有字符\n            chars = []\n            # 首先添加页面级别的字符\n            if page.pdf_character:\n                chars.extend(page.pdf_character)\n            # 然后添加段落中的字符\n            for paragraph in page.pdf_paragraph:\n                chars.extend(self.render_paragraph_to_char(paragraph))\n\n            # 渲染所有字符\n            for char in chars:\n                if not getattr(char, \"debug_info\", False):\n                    continue\n                if char.char_unicode == \"\\n\":\n                    continue\n                if char.pdf_character_id is None:\n                    # dummy char\n                    continue\n                char_size = char.pdf_style.font_size\n                font_id = char.pdf_style.font_id\n\n                if font_id not in available_font_list:\n                    continue\n                draw_op = page_op\n                encoding_length_map = page_encoding_length_map\n\n                draw_op.append(b\"q \")\n                self.render_graphic_state(draw_op, char.pdf_style.graphic_state)\n                if char.vertical:\n                    draw_op.append(\n                        f\"BT /{font_id} {char_size:f} Tf 0 1 -1 0 {char.box.x2:f} {char.box.y:f} Tm \".encode(),\n                    )\n                else:\n                    draw_op.append(\n                        f\"BT /{font_id} {char_size:f} Tf 1 0 0 1 {char.box.x:f} {char.box.y:f} Tm \".encode(),\n                    )\n\n                encoding_length = encoding_length_map[font_id]\n                # pdf32000-2008 page14:\n                # As hexadecimal data enclosed in angle brackets < >\n                # see 7.3.4.3, \"Hexadecimal Strings.\"\n                draw_op.append(\n                    f\"<{char.pdf_character_id:0{encoding_length * 2}x}>\".upper().encode(),\n                )\n\n                draw_op.append(b\" Tj ET Q \\n\")\n            for rect in page.pdf_rectangle:\n                if not rect.debug_info:\n                    continue\n                self._render_rectangle(page_op, rect)\n            draw_op = page_op\n            # Since this is a draw instruction container,\n            # no additional information is needed\n            pdf.update_stream(int(resource_xref_id), draw_op.tobytes())\n        translation_config.raise_if_cancelled()\n\n        # 使用子进程进行字体子集化\n        if not translation_config.skip_clean:\n            pdf = self.subset_fonts_in_subprocess(pdf, translation_config, tag=\"debug\")\n        return pdf\n\n    @staticmethod\n    def subset_fonts_in_subprocess(\n        pdf: pymupdf.Document, translation_config: TranslationConfig, tag: str\n    ) -> pymupdf.Document:\n        \"\"\"Run font subsetting in a subprocess with timeout.\n\n        Args:\n            pdf: The PDF document object\n            translation_config: Translation configuration\n\n        Returns:\n            Path to the PDF with subsetted fonts, or original path if subsetting failed or timed out\n        \"\"\"\n        original_pdf = pdf\n        # Create temporary file paths\n        temp_input = str(\n            translation_config.get_working_file_path(f\"temp_subset_input_{tag}.pdf\")\n        )\n        temp_output = str(\n            translation_config.get_working_file_path(f\"temp_subset_output_{tag}.pdf\")\n        )\n\n        # Save PDF to temporary file without subsetting\n        pdf.save(temp_input)\n\n        # Create and start subprocess\n        process = Process(target=_subset_fonts_process, args=(temp_input, temp_output))\n        process.start()\n\n        # Wait for subprocess with timeout (1 minute)\n        timeout = 60  # 1 minutes in seconds\n        start_time = time.time()\n\n        while process.is_alive():\n            if time.time() - start_time > timeout:\n                logger.warning(\n                    f\"Font subsetting timeout after {timeout} seconds, terminating subprocess\"\n                )\n                process.terminate()\n                try:\n                    process.join(5)  # Give it 5 seconds to clean up\n                    if process.is_alive():\n                        logger.warning(\"Subprocess did not terminate, killing it\")\n                        process.kill()\n                        process.terminate()\n                        process.kill()\n                        process.terminate()\n                        process.kill()\n                        process.terminate()\n                except Exception as e:\n                    logger.error(f\"Error terminating font subsetting process: {e}\")\n\n                return original_pdf\n\n            time.sleep(0.5)  # Check every half second\n\n        # Process completed, check exit code\n        exit_code = process.exitcode\n        success = exit_code == 0\n\n        # Check if subsetting was successful\n        if (\n            success\n            and Path(temp_output).exists()\n            and Path(temp_output).stat().st_size > 0\n        ):\n            logger.info(\"Font subsetting completed successfully\")\n            return pymupdf.open(temp_output)\n        else:\n            logger.warning(\n                f\"Font subsetting failed with exit code {exit_code} or produced empty file\"\n            )\n            return original_pdf\n\n    @staticmethod\n    def save_pdf_with_timeout(\n        pdf: pymupdf.Document,\n        output_path: str,\n        translation_config: TranslationConfig,\n        garbage: int = 1,\n        deflate: bool = True,\n        clean: bool = True,\n        deflate_fonts: bool = True,\n        linear: bool = False,\n        timeout: int = 120,\n        tag: str = \"\",\n    ) -> bool:\n        \"\"\"Save a PDF document with a timeout for the clean=True operation.\n\n        Args:\n            pdf: The PDF document object\n            output_path: Path where to save the PDF\n            translation_config: Translation configuration\n            garbage: Garbage collection level (0, 1, 2, 3, 4)\n            deflate: Whether to deflate the PDF\n            clean: Whether to clean the PDF\n            deflate_fonts: Whether to deflate fonts\n            linear: Whether to linearize the PDF\n            timeout: Timeout in seconds (default: 2 minutes)\n\n        Returns:\n            True if saved with clean=True successfully, False if fallback to clean=False was used\n        \"\"\"\n        # Create temporary file paths\n        temp_input = str(\n            translation_config.get_working_file_path(f\"temp_save_input_{tag}.pdf\")\n        )\n        temp_output = str(\n            translation_config.get_working_file_path(f\"temp_save_output_{tag}.pdf\")\n        )\n\n        # Save PDF to temporary file first\n        pdf.save(temp_input)\n\n        # Try to save with clean=True in a subprocess\n        process = Process(\n            target=_save_pdf_clean_process,\n            args=(\n                temp_input,\n                temp_output,\n                garbage,\n                deflate,\n                clean,\n                deflate_fonts,\n                linear,\n            ),\n        )\n        process.start()\n\n        # Wait for subprocess with timeout\n        start_time = time.time()\n\n        while process.is_alive():\n            if time.time() - start_time > timeout:\n                logger.warning(\n                    f\"PDF save with clean={clean} timeout after {timeout} seconds, terminating subprocess\"\n                )\n                process.terminate()\n                try:\n                    process.join(5)  # Give it 5 seconds to clean up\n                    if process.is_alive():\n                        logger.warning(\"Subprocess did not terminate, killing it\")\n                        process.kill()\n                        process.terminate()\n                        process.kill()\n                        process.terminate()\n                        process.kill()\n                        process.terminate()\n                except Exception as e:\n                    logger.error(f\"Error terminating PDF save process: {e}\")\n\n                # Fallback to save without clean parameter\n                logger.info(\"Falling back to save with clean=False\")\n                try:\n                    pdf.save(\n                        output_path,\n                        garbage=garbage,\n                        deflate=deflate,\n                        clean=False,\n                        deflate_fonts=deflate_fonts,\n                        linear=linear,\n                    )\n                    return False\n                except Exception as e:\n                    logger.error(f\"Error in fallback save: {e}\")\n                    # Last resort: basic save\n                    pdf.save(output_path)\n                    return False\n\n            time.sleep(0.5)  # Check every half second\n\n        # Process completed, check exit code\n        exit_code = process.exitcode\n        success = exit_code == 0\n\n        # Check if save was successful\n        if (\n            success\n            and Path(temp_output).exists()\n            and Path(temp_output).stat().st_size > 0\n        ):\n            logger.info(f\"PDF save with clean={clean} completed successfully\")\n            # Copy the successfully created file to the target path\n            try:\n                import shutil\n\n                shutil.copy2(temp_output, output_path)\n                return True\n            except Exception as e:\n                logger.error(f\"Error copying saved PDF: {e}\")\n                pdf.save(output_path)  # Fallback to direct save\n                return False\n            finally:\n                Path(temp_input).unlink()\n                Path(temp_output).unlink()\n        else:\n            logger.warning(\n                f\"PDF save with clean={clean} failed with exit code {exit_code} or produced empty file\"\n            )\n            # Fallback to save without clean parameter\n            try:\n                pdf.save(\n                    output_path,\n                    garbage=garbage,\n                    deflate=deflate,\n                    clean=False,\n                    deflate_fonts=deflate_fonts,\n                    linear=linear,\n                )\n            except Exception as e:\n                logger.error(f\"Error in fallback save: {e}\")\n                # Last resort: basic save\n                pdf.save(output_path)\n\n            return False\n\n    def restore_media_box(self, doc: pymupdf.Document, mediabox_data: dict) -> None:\n        for xref, page_box_data in mediabox_data.items():\n            for name, box in page_box_data.items():\n                try:\n                    doc.xref_set_key(xref, name, box)\n                except Exception:\n                    logger.debug(f\"Error restoring media box {name} from PDF\")\n\n    def write(\n        self,\n        translation_config: TranslationConfig,\n        check_font_exists: bool = False,\n    ) -> TranslateResult:\n        try:\n            basename = Path(translation_config.input_file).stem\n            debug_suffix = \".debug\" if translation_config.debug else \"\"\n            if (\n                translation_config.watermark_output_mode\n                != WatermarkOutputMode.Watermarked\n            ):\n                debug_suffix += \".no_watermark\"\n            mono_out_path = translation_config.get_output_file_path(\n                f\"{basename}{debug_suffix}.{translation_config.lang_out}.mono.pdf\",\n            )\n            pdf = pymupdf.open(self.original_pdf_path)\n            self.font_mapper.add_font(pdf, self.docs)\n            with self.translation_config.progress_monitor.stage_start(\n                self.stage_name,\n                len(self.docs.page),\n            ) as pbar:\n                for page in self.docs.page:\n                    self.update_page_content_stream(\n                        check_font_exists, page, pdf, translation_config\n                    )\n                    pbar.advance()\n            translation_config.raise_if_cancelled()\n            gc_level = 1\n            if self.translation_config.ocr_workaround:\n                gc_level = 4\n            with self.translation_config.progress_monitor.stage_start(\n                SUBSET_FONT_STAGE_NAME,\n                1,\n            ) as pbar:\n                if not translation_config.skip_clean:\n                    pdf = self.subset_fonts_in_subprocess(\n                        pdf, translation_config, tag=\"mono\"\n                    )\n\n                pbar.advance()\n            try:\n                self.restore_media_box(pdf, self.mediabox_data)\n            except Exception:\n                logger.exception(\"restore media box failed\")\n\n            if translation_config.only_include_translated_page:\n                total_page = set(range(0, len(pdf)))\n\n                pages_to_translate = {\n                    page.page_number\n                    for page in self.docs.page\n                    if self.translation_config.should_translate_page(\n                        page.page_number + 1\n                    )\n                }\n\n                should_removed_page = list(total_page - pages_to_translate)\n\n                pdf.delete_pages(should_removed_page)\n\n            with self.translation_config.progress_monitor.stage_start(\n                SAVE_PDF_STAGE_NAME,\n                2,\n            ) as pbar:\n                if not translation_config.no_mono:\n                    if translation_config.debug:\n                        translation_config.raise_if_cancelled()\n                        pdf.save(\n                            f\"{mono_out_path}.decompressed.pdf\",\n                            expand=True,\n                            pretty=True,\n                        )\n                    translation_config.raise_if_cancelled()\n                    self.save_pdf_with_timeout(\n                        pdf,\n                        mono_out_path,\n                        translation_config,\n                        garbage=gc_level,\n                        deflate=True,\n                        clean=not translation_config.skip_clean,\n                        deflate_fonts=True,\n                        linear=False,\n                        tag=\"mono\",\n                    )\n                pbar.advance()\n                dual_out_path = None\n                if not translation_config.no_dual:\n                    dual_out_path = translation_config.get_output_file_path(\n                        f\"{basename}{debug_suffix}.{translation_config.lang_out}.dual.pdf\",\n                    )\n                    translation_config.raise_if_cancelled()\n                    original_pdf = pymupdf.open(self.original_pdf_path)\n\n                    if translation_config.debug:\n                        translation_config.raise_if_cancelled()\n                        try:\n                            original_pdf = self.write_debug_info(\n                                original_pdf, translation_config\n                            )\n                        except Exception:\n                            logger.warning(\n                                \"Failed to write debug info to dual PDF\",\n                                exc_info=True,\n                            )\n\n                    if (\n                        self.translation_config.only_include_translated_page\n                        and should_removed_page\n                    ):\n                        original_pdf.delete_pages(should_removed_page)\n                    translated_pdf = pdf\n\n                    # Choose between alternating pages and side-by-side format\n                    # Default to side-by-side if not specified\n                    use_alternating_pages = (\n                        translation_config.use_alternating_pages_dual\n                    )\n\n                    if use_alternating_pages:\n                        # Create a dual PDF with alternating pages (original and translation)\n                        dual = self.create_alternating_pages_dual_pdf(\n                            original_pdf,\n                            translated_pdf,\n                            translation_config,\n                        )\n                    else:\n                        # Create a dual PDF with side-by-side pages (original and translation)\n                        dual = self.create_side_by_side_dual_pdf(\n                            original_pdf,\n                            translated_pdf,\n                            dual_out_path,\n                            translation_config,\n                        )\n\n                    self.save_pdf_with_timeout(\n                        dual,\n                        dual_out_path,\n                        translation_config,\n                        garbage=gc_level,\n                        deflate=True,\n                        clean=not translation_config.skip_clean,\n                        deflate_fonts=True,\n                        linear=False,\n                        tag=\"dual\",\n                    )\n                    if translation_config.debug:\n                        translation_config.raise_if_cancelled()\n                        dual.save(\n                            f\"{dual_out_path}.decompressed.pdf\",\n                            expand=True,\n                            pretty=True,\n                        )\n                pbar.advance()\n            if self.translation_config.no_mono:\n                mono_out_path = None\n            if self.translation_config.no_dual:\n                dual_out_path = None\n            auto_extracted_glossary_path = None\n            if (\n                self.translation_config.save_auto_extracted_glossary\n                and self.translation_config.shared_context_cross_split_part.auto_extracted_glossary\n            ):\n                auto_extracted_glossary_path = self.translation_config.get_output_file_path(\n                    f\"{basename}{debug_suffix}.{translation_config.lang_out}.glossary.csv\"\n                )\n                with auto_extracted_glossary_path.open(\"w\", encoding=\"utf-8\") as f:\n                    logger.info(\n                        f\"save auto extracted glossary to {auto_extracted_glossary_path}\"\n                    )\n                    f.write(\n                        self.translation_config.shared_context_cross_split_part.auto_extracted_glossary.to_csv()\n                    )\n\n            return TranslateResult(\n                mono_out_path, dual_out_path, auto_extracted_glossary_path\n            )\n        except Exception:\n            logger.exception(\n                \"Failed to create PDF: %s\",\n                translation_config.input_file,\n            )\n            if not check_font_exists:\n                return self.write(translation_config, True)\n            raise\n\n    def update_page_content_stream(\n        self, check_font_exists, page, pdf, translation_config, skip_char: bool = False\n    ):\n        assert page.cropbox is not None and page.cropbox.box is not None\n        page_crop_box = page.cropbox.box\n        ctm_for_ops = (\n            1,\n            0,\n            0,\n            1,\n            -page_crop_box.x,\n            -page_crop_box.y,\n        )\n        ctm_for_ops = f\" {' '.join(f'{x:f}' for x in ctm_for_ops)} cm \".encode()\n        translation_config.raise_if_cancelled()\n        xobj_available_fonts = {}\n        xobj_draw_ops = {}\n        xobj_encoding_length_map = {}\n        available_font_list = self.get_available_font_list(pdf, page)\n        page_encoding_length_map: dict[str | None, int | None] = {\n            f.font_id: f.encoding_length for f in page.pdf_font\n        }\n        all_encoding_length_map = page_encoding_length_map.copy()\n        for xobj in page.pdf_xobject:\n            xobj_available_fonts[xobj.xobj_id] = available_font_list.copy()\n            try:\n                xobj_available_fonts[xobj.xobj_id].update(\n                    self.get_xobj_available_fonts(xobj.xref_id, pdf),\n                )\n            except Exception:\n                pass\n            xobj_encoding_length_map[xobj.xobj_id] = {\n                f.font_id: f.encoding_length for f in xobj.pdf_font\n            }\n            all_encoding_length_map.update(xobj_encoding_length_map[xobj.xobj_id])\n            xobj_encoding_length_map[xobj.xobj_id].update(page_encoding_length_map)\n            xobj_op = BitStream()\n            base_op = xobj.base_operations.value\n            base_op = zstd_decompress(base_op)\n            xobj_op.append(base_op.encode())\n            xobj_draw_ops[xobj.xobj_id] = xobj_op\n        page_op = BitStream()\n        # q {ops_base}Q 1 0 0 1 {x0} {y0} cm {ops_new}\n        # page_op.append(b\"q \")\n        # base_op = page.base_operations.value\n        # base_op = zstd_decompress(base_op)\n        # page_op.append(base_op.encode())\n        # page_op.append(b\" \\n\")\n        page_op.append(ctm_for_ops)\n        page_op.append(b\" \\n\")\n        # Create render context\n        context = RenderContext(\n            pdf_creator=self,\n            page=page,\n            available_font_list=available_font_list,\n            page_encoding_length_map=page_encoding_length_map,\n            all_encoding_length_map=all_encoding_length_map,\n            xobj_available_fonts=xobj_available_fonts,\n            xobj_encoding_length_map=xobj_encoding_length_map,\n            ctm_for_ops=ctm_for_ops,\n            check_font_exists=check_font_exists,\n        )\n        # Create render units for all renderable objects\n        render_units = self.create_render_units_for_page(page, translation_config)\n        if skip_char:\n            render_units = [\n                unit\n                for unit in render_units\n                if not isinstance(unit, CharacterRenderUnit)\n            ]\n        # Render all units to their appropriate streams\n        self.render_units_to_stream(render_units, context, page_op, xobj_draw_ops)\n        # Update xobject streams\n        for xobj in page.pdf_xobject:\n            draw_op = xobj_draw_ops[xobj.xobj_id]\n            try:\n                pdf.update_stream(xobj.xref_id, draw_op.tobytes())\n            except Exception:\n                logger.warning(f\"update xref {xobj.xref_id} stream fail, continue\")\n        draw_op = page_op\n        op_container = pdf.get_new_xref()\n        # Since this is a draw instruction container,\n        # no additional information is needed\n        pdf.update_object(op_container, \"<<>>\")\n        pdf.update_stream(op_container, draw_op.tobytes())\n        pdf[page.page_number].set_contents(op_container)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/frontend/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/frontend/il_creater.py",
    "content": "import base64\nimport functools\nimport logging\nimport math\nimport re\nimport unicodedata\nfrom io import BytesIO\nfrom itertools import islice\nfrom typing import Literal\n\nimport freetype\nimport pymupdf\nimport tiktoken\n\nimport babeldoc.pdfminer.pdfinterp\nfrom babeldoc.format.pdf.babelpdf.base14 import get_base14_bbox\nfrom babeldoc.format.pdf.babelpdf.cidfont import get_cidfont_bbox\nfrom babeldoc.format.pdf.babelpdf.cidfont import get_glyph_bbox\nfrom babeldoc.format.pdf.babelpdf.encoding import WinAnsiEncoding\nfrom babeldoc.format.pdf.babelpdf.encoding import get_type1_encoding\nfrom babeldoc.format.pdf.babelpdf.type3 import get_type3_bbox\nfrom babeldoc.format.pdf.babelpdf.utils import guarded_bbox\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils import zstd_helper\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.matrix_helper import decompose_ctm\nfrom babeldoc.format.pdf.document_il.utils.style_helper import BLACK\nfrom babeldoc.format.pdf.document_il.utils.style_helper import YELLOW\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.pdfminer.layout import LTChar\nfrom babeldoc.pdfminer.layout import LTFigure\nfrom babeldoc.pdfminer.pdffont import PDFCIDFont\nfrom babeldoc.pdfminer.pdffont import PDFFont\n\n# from babeldoc.pdfminer.pdfpage import PDFPage as PDFMinerPDFPage\n# from babeldoc.pdfminer.pdftypes import PDFObjRef as PDFMinerPDFObjRef\n# from babeldoc.pdfminer.pdftypes import resolve1 as pdftypes_resolve1\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.utils import apply_matrix_pt\nfrom babeldoc.pdfminer.utils import get_bound\nfrom babeldoc.pdfminer.utils import mult_matrix\n\n\ndef invert_matrix(\n    ctm: tuple[float, float, float, float, float, float],\n) -> tuple[float, float, float, float, float, float]:\n    \"\"\"\n    Calculate the inverse of a 2D transformation matrix.\n    Matrix format: (a, b, c, d, e, f) representing:\n    [a c e]\n    [b d f]\n    [0 0 1]\n    \"\"\"\n    a, b, c, d, e, f = ctm\n\n    # Calculate determinant\n    det = a * d - b * c\n\n    if abs(det) < 1e-10:\n        # Matrix is singular, return identity matrix\n        return (1.0, 0.0, 0.0, 1.0, 0.0, 0.0)\n\n    # Calculate inverse matrix elements\n    inv_a = d / det\n    inv_b = -b / det\n    inv_c = -c / det\n    inv_d = a / det\n    inv_e = (c * f - d * e) / det\n    inv_f = (b * e - a * f) / det\n\n    return (inv_a, inv_b, inv_c, inv_d, inv_e, inv_f)\n\n\ndef batched(iterable, n, *, strict=False):\n    # batched('ABCDEFG', 3) → ABC DEF G\n    if n < 1:\n        raise ValueError(\"n must be at least one\")\n    iterator = iter(iterable)\n    while batch := tuple(islice(iterator, n)):\n        if strict and len(batch) != n:\n            raise ValueError(\"batched(): incomplete batch\")\n        yield batch\n\n\nlogger = logging.getLogger(__name__)\n\n#\n# def create_hook(func, hook):\n#     @wraps(func)\n#     def wrapper(*args, **kwargs):\n#         hook(*args, **kwargs)\n#         return func(*args, **kwargs)\n#\n#     return wrapper\n#\n#\n# def hook_pdfminer_pdf_page_init(*args):\n#     attrs = args[3]\n#     try:\n#         while isinstance(attrs[\"MediaBox\"], PDFMinerPDFObjRef):\n#             attrs[\"MediaBox\"] = pdftypes_resolve1(attrs[\"MediaBox\"])\n#     except Exception:\n#         logger.exception(f\"try to fix mediabox failed: {attrs}\")\n#\n#\n# PDFMinerPDFPage.__init__ = create_hook(\n#     PDFMinerPDFPage.__init__, hook_pdfminer_pdf_page_init\n# )\n\n\ndef indirect(obj):\n    if isinstance(obj, tuple) and obj[0] == \"xref\":\n        return int(obj[1].split(\" \")[0])\n\n\ndef get_char_cbox(face, idx):\n    g = face.get_char_index(idx)\n    return get_glyph_bbox(face, g)\n\n\ndef get_name_cbox(face, name):\n    if name:\n        if isinstance(name, str):\n            name = name.encode(\"utf-8\")\n        g = face.get_name_index(name)\n        return get_glyph_bbox(face, g)\n    return (0, 0, 0, 0)\n\n\ndef font_encoding_lookup(doc, idx, key):\n    obj = doc.xref_get_key(idx, key)\n    if obj[0] == \"name\":\n        enc_name = obj[1][1:]\n        if enc_vector := get_type1_encoding(enc_name):\n            return enc_name, enc_vector\n\n\ndef parse_font_encoding(doc, idx):\n    if encoding := font_encoding_lookup(doc, idx, \"Encoding/BaseEncoding\"):\n        return encoding\n    if encoding := font_encoding_lookup(doc, idx, \"Encoding\"):\n        return encoding\n    return (\"Custom\", get_type1_encoding(\"StandardEncoding\"))\n\n\ndef get_truetype_ansi_bbox_list(face):\n    scale = 1000 / face.units_per_EM\n    bbox_list = [get_char_cbox(face, code) for code in WinAnsiEncoding]\n    bbox_list = [[v * scale for v in bbox] for bbox in bbox_list]\n    return bbox_list\n\n\ndef collect_face_cmap(face):\n    umap = []  # unicode maps\n    lmap = []  # legacy maps\n    for cmap in face.charmaps:\n        if cmap.encoding_name == \"FT_ENCODING_UNICODE\":\n            umap.append(cmap)\n        else:\n            lmap.append(cmap)\n    return umap, lmap\n\n\ndef get_truetype_custom_bbox_list(face):\n    umap, lmap = collect_face_cmap(face)\n    if umap:\n        face.set_charmap(umap[0])\n    elif lmap:\n        face.set_charmap(lmap[0])\n    else:\n        return []\n    scale = 1000 / face.units_per_EM\n    bbox_list = [get_char_cbox(face, code) for code in range(256)]\n    bbox_list = [[v * scale for v in bbox] for bbox in bbox_list]\n    return bbox_list\n\n\ndef parse_font_file(doc, idx, encoding, differences):\n    bbox_list = []\n    data = doc.xref_stream(idx)\n    face = freetype.Face(BytesIO(data))\n    if face.get_format() == b\"TrueType\":\n        if encoding[0] == \"WinAnsiEncoding\":\n            return get_truetype_ansi_bbox_list(face)\n        elif encoding[0] == \"Custom\":\n            return get_truetype_custom_bbox_list(face)\n    glyph_name_set = set()\n    for x in range(0, face.num_glyphs):\n        glyph_name_set.add(face.get_glyph_name(x).decode(\"U8\"))\n    scale = 1000 / face.units_per_EM\n    enc_name, enc_vector = encoding\n    _, lmap = collect_face_cmap(face)\n    abbr = enc_name.removesuffix(\"Encoding\")\n    if lmap and abbr in [\"Custom\", \"MacRoman\", \"Standard\", \"WinAnsi\", \"MacExpert\"]:\n        face.set_charmap(lmap[0])\n    for i, x in enumerate(enc_vector):\n        if x in glyph_name_set:\n            v = get_name_cbox(face, x.encode(\"U8\"))\n        else:\n            v = get_char_cbox(face, i)\n        bbox_list.append(v)\n    if differences:\n        for code, name in differences:\n            bbox_list[code] = get_name_cbox(face, name.encode(\"U8\"))\n    norm_bbox_list = [[v * scale for v in box] for box in bbox_list]\n    return norm_bbox_list\n\n\ndef parse_encoding(obj_str):\n    delta = []\n    current = 0\n    for x in re.finditer(\n        r\"(?P<p>[\\[\\]])|(?P<c>\\d+)|(?P<n>/[^\\s/\\[\\]()<>]+)|(?P<s>.)\", obj_str\n    ):\n        key = x.lastgroup\n        val = x.group()\n        if key == \"c\":\n            current = int(val)\n        if key == \"n\":\n            delta.append((current, val[1:]))\n            current += 1\n    return delta\n\n\ndef parse_mapping(text):\n    mapping = []\n    for x in re.finditer(r\"<(?P<num>[a-fA-F0-9]+)>\", text):\n        mapping.append(x.group(\"num\"))\n    return mapping\n\n\ndef update_cmap_pair(cmap, data):\n    for start_str, stop_str, value_str in batched(data, 3):\n        start = int(start_str, 16)\n        stop = int(stop_str, 16)\n        try:\n            value = base64.b16decode(value_str, True).decode(\"UTF-16-BE\")\n            for code in range(start, stop + 1):\n                cmap[code] = value\n        except Exception:\n            pass  # to skip surrogate pairs (D800-DFFF)\n\n\ndef update_cmap_code(cmap, data):\n    for code_str, value_str in batched(data, 2):\n        code = int(code_str, 16)\n        try:\n            value = base64.b16decode(value_str, True).decode(\"UTF-16-BE\")\n            cmap[code] = value\n        except Exception:\n            pass  # to skip surrogate pairs (D800-DFFF)\n\n\ndef parse_cmap(cmap_str):\n    cmap = {}\n    for x in re.finditer(\n        r\"\\s+beginbfrange\\s*(?P<r>(<[0-9a-fA-F]+>\\s*)+)endbfrange\\s+\", cmap_str\n    ):\n        update_cmap_pair(cmap, parse_mapping(x.group(\"r\")))\n    for x in re.finditer(\n        r\"\\s+beginbfchar\\s*(?P<c>(<[0-9a-fA-F]+>\\s*)+)endbfchar\", cmap_str\n    ):\n        update_cmap_code(cmap, parse_mapping(x.group(\"c\")))\n    return cmap\n\n\ndef get_code(cmap, c):\n    for k, v in cmap.items():\n        if v == c:\n            return k\n    return -1\n\n\ndef get_bbox(bbox, size, c, x, y):\n    x_min, y_min, x_max, y_max = bbox[c]\n    factor = 1 / 1000 * size\n    x_min = x_min * factor\n    y_min = -y_min * factor\n    x_max = x_max * factor\n    y_max = -y_max * factor\n    ll = (x + x_min, y + y_min)\n    lr = (x + x_max, y + y_min)\n    ul = (x + x_min, y + y_max)\n    ur = (x + x_max, y + y_max)\n    return pymupdf.Quad(ll, lr, ul, ur)\n\n\n# 常见 Unicode 空格字符的代码点\nunicode_spaces = [\n    \"\\u0020\",  # 半角空格\n    \"\\u00a0\",  # 不间断空格\n    \"\\u1680\",  # Ogham 空格标记\n    \"\\u2000\",  # En Quad\n    \"\\u2001\",  # Em Quad\n    \"\\u2002\",  # En Space\n    \"\\u2003\",  # Em Space\n    \"\\u2004\",  # 三分之一 Em 空格\n    \"\\u2005\",  # 四分之一 Em 空格\n    \"\\u2006\",  # 六分之一 Em 空格\n    \"\\u2007\",  # 数样间距\n    \"\\u2008\",  # 行首前导空格\n    \"\\u2009\",  # 瘦弱空格\n    \"\\u200a\",  # hair space\n    \"\\u202f\",  # 窄不间断空格\n    \"\\u205f\",  # 数学中等空格\n    \"\\u3000\",  # 全角空格\n    \"\\u200b\",  # 零宽度空格\n    \"\\u2060\",  # 零宽度非断空格\n    \"\\t\",  # 水平制表符\n]\n\n# 构建正则表达式\npattern = \"^[\" + \"\".join(unicode_spaces) + \"]+$\"\n\n# 编译正则\nspace_regex = re.compile(pattern)\n\n\ndef get_rotation_angle(matrix):\n    \"\"\"\n    根据 PDF 的字符矩阵计算旋转角度（单位：度）\n    matrix: tuple/list, 格式 (a, b, c, d, e, f)\n    \"\"\"\n    a, b, c, d, e, f = matrix\n    # 旋转角度：arctan2(b, a)\n    angle_rad = math.atan2(b, a)\n    angle_deg = math.degrees(angle_rad)\n    return angle_deg\n\n\nclass ILCreater:\n    stage_name = \"Parse PDF and Create Intermediate Representation\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.progress = None\n        self.current_page: il_version_1.Page = None\n        self.mupdf: pymupdf.Document = None\n        self.model = translation_config.doc_layout_model\n        self.docs = il_version_1.Document(page=[])\n        self.stroking_color_space_name = None\n        self.non_stroking_color_space_name = None\n        self.passthrough_per_char_instruction: list[tuple[str, str]] = []\n        self.translation_config = translation_config\n        self.passthrough_per_char_instruction_stack: list[list[tuple[str, str]]] = []\n        self.xobj_id = 0\n        self.xobj_inc = 0\n        self.xobj_map: dict[int, il_version_1.PdfXobject] = {}\n        self.xobj_stack = []\n        self.current_page_font_name_id_map = {}\n        self.current_page_font_char_bounding_box_map = {}\n        self.current_available_fonts = {}\n        self.mupdf_font_map: dict[int, pymupdf.Font] = {}\n        self.graphic_state_pool = {}\n        self.enable_graphic_element_process = (\n            translation_config.enable_graphic_element_process\n        )\n        self.render_order = 0\n        self.current_clip_paths: list[tuple] = []\n        self.clip_paths_stack: list[list[tuple]] = []\n        # For valid character collection\n        self.font_mapper = FontMapper(translation_config)\n        self.tokenizer = tiktoken.encoding_for_model(\"gpt-4o\")\n        self._page_valid_chars_buffer: list[str] | None = None\n\n    def transform_clip_path(\n        self,\n        clip_path,\n        source_ctm: tuple[float, float, float, float, float, float],\n        target_ctm: tuple[float, float, float, float, float, float],\n    ):\n        \"\"\"Transform clip path coordinates from source CTM to target CTM.\"\"\"\n        if source_ctm == target_ctm:\n            return clip_path\n\n        # Calculate transformation matrix: inverse(target_ctm) * source_ctm\n        inv_target_ctm = invert_matrix(target_ctm)\n        transform_matrix = mult_matrix(source_ctm, inv_target_ctm)\n\n        transformed_path = []\n        for path_element in clip_path:\n            if len(path_element) == 1:\n                # Path operation without coordinates (e.g., 'h' for close path)\n                transformed_path.append(path_element)\n            else:\n                # Path operation with coordinates\n                op = path_element[0]\n                coords = path_element[1:]\n                transformed_coords = []\n\n                # Transform coordinate pairs\n                for i in range(0, len(coords), 2):\n                    if i + 1 < len(coords):\n                        x, y = coords[i], coords[i + 1]\n                        transformed_point = apply_matrix_pt(transform_matrix, (x, y))\n                        transformed_coords.extend(transformed_point)\n                    else:\n                        # Handle odd number of coordinates (shouldn't happen in well-formed paths)\n                        transformed_coords.append(coords[i])\n\n                transformed_path.append([op] + transformed_coords)\n\n        return transformed_path\n\n    def get_render_order_and_increase(self):\n        self.render_order += 1\n        return self.render_order\n\n    def get_render_order(self):\n        return self.render_order\n\n    def on_finish(self):\n        self.progress.__exit__(None, None, None)\n\n    def is_graphic_operation(self, operator: str):\n        if not self.enable_graphic_element_process:\n            return False\n\n        return re.match(\n            \"^(m|l|c|v|y|re|h|S|s|f|f*|F|B|B*|b|b*|n|Do)$\",\n            operator,\n        )\n\n    def is_passthrough_per_char_operation(self, operator: str):\n        return re.match(\n            \"^(sc|SC|sh|scn|SCN|g|G|rg|RG|k|K|cs|CS|gs|ri|w|J|j|M|i)$\",\n            operator,\n        )\n\n    def can_remove_old_passthrough_per_char_instruction(self, operator: str):\n        return re.match(\n            \"^(sc|SC|sh|scn|SCN|g|G|rg|RG|k|K|cs|CS|ri|w|J|j|M|i|d)$\",\n            operator,\n        )\n\n    def on_line_dash(self, dash, phase):\n        dash_str = f\"[{' '.join(f'{arg}' for arg in dash)}]\"\n        self.on_passthrough_per_char(\"d\", [dash_str, str(phase)])\n\n    def on_passthrough_per_char(self, operator: str, args: list[str]):\n        if not self.is_passthrough_per_char_operation(operator) and operator not in (\n            \"W n\",\n            \"W* n\",\n            \"d\",\n            \"W\",\n            \"W*\",\n        ):\n            logger.error(\"Unknown passthrough_per_char operation: %s\", operator)\n            return\n        # logger.debug(\"xobj_id: %d, on_passthrough_per_char: %s ( %s )\", self.xobj_id, operator, args)\n        args = [self.parse_arg(arg) for arg in args]\n        if self.can_remove_old_passthrough_per_char_instruction(operator):\n            for _i, value in enumerate(self.passthrough_per_char_instruction.copy()):\n                op, arg = value\n                if op == operator:\n                    self.passthrough_per_char_instruction.remove(value)\n                    break\n        self.passthrough_per_char_instruction.append((operator, \" \".join(args)))\n        pass\n\n    def remove_latest_passthrough_per_char_instruction(self):\n        if self.passthrough_per_char_instruction:\n            self.passthrough_per_char_instruction.pop()\n\n    def parse_arg(self, arg: str):\n        if isinstance(arg, PSLiteral):\n            return f\"/{arg.name}\"\n        elif isinstance(arg, float):\n            return f\"{arg:f}\"\n        elif not isinstance(arg, str):\n            return str(arg)\n        return arg\n\n    def pop_passthrough_per_char_instruction(self):\n        if self.passthrough_per_char_instruction_stack:\n            self.passthrough_per_char_instruction = (\n                self.passthrough_per_char_instruction_stack.pop()\n            )\n        else:\n            self.passthrough_per_char_instruction = []\n            logging.error(\n                \"pop_passthrough_per_char_instruction error on page: %s\",\n                self.current_page.page_number,\n            )\n\n        if self.clip_paths_stack:\n            self.current_clip_paths = self.clip_paths_stack.pop()\n        else:\n            self.current_clip_paths = []\n\n    def push_passthrough_per_char_instruction(self):\n        self.passthrough_per_char_instruction_stack.append(\n            self.passthrough_per_char_instruction.copy(),\n        )\n        self.clip_paths_stack.append(self.current_clip_paths.copy())\n\n    # pdf32000 page 171\n    def on_stroking_color_space(self, color_space_name):\n        self.stroking_color_space_name = color_space_name\n\n    def on_non_stroking_color_space(self, color_space_name):\n        self.non_stroking_color_space_name = color_space_name\n\n    def on_new_stream(self):\n        self.stroking_color_space_name = None\n        self.non_stroking_color_space_name = None\n        self.passthrough_per_char_instruction = []\n        self.current_clip_paths = []\n\n    def push_xobj(self):\n        self.xobj_stack.append(\n            (\n                self.xobj_id,\n                self.current_clip_paths.copy(),\n                self.current_available_fonts.copy(),\n            ),\n        )\n        self.current_clip_paths = []\n\n    def pop_xobj(self):\n        (self.xobj_id, self.current_clip_paths, self.current_available_fonts) = (\n            self.xobj_stack.pop()\n        )\n\n    def on_xobj_begin(self, bbox, xref_id):\n        logger.debug(f\"on_xobj_begin: {bbox} @ {xref_id}\")\n        self.push_passthrough_per_char_instruction()\n        self.push_xobj()\n        self.xobj_inc += 1\n        self.xobj_id = self.xobj_inc\n        xobject = il_version_1.PdfXobject(\n            box=il_version_1.Box(\n                x=float(bbox[0]),\n                y=float(bbox[1]),\n                x2=float(bbox[2]),\n                y2=float(bbox[3]),\n            ),\n            xobj_id=self.xobj_id,\n            xref_id=xref_id,\n            pdf_font=[],\n        )\n        self.current_page.pdf_xobject.append(xobject)\n        self.xobj_map[self.xobj_id] = xobject\n        xobject.pdf_font.extend(self.current_available_fonts.values())\n        return self.xobj_id\n\n    def on_xobj_end(self, xobj_id, base_op):\n        self.pop_passthrough_per_char_instruction()\n        self.pop_xobj()\n        xobj = self.xobj_map[xobj_id]\n        base_op = zstd_helper.zstd_compress(base_op)\n        xobj.base_operations = il_version_1.BaseOperations(value=base_op)\n        self.xobj_inc += 1\n\n    def on_page_start(self):\n        self.current_page = il_version_1.Page(\n            pdf_font=[],\n            pdf_character=[],\n            page_layout=[],\n            pdf_curve=[],\n            pdf_form=[],\n            # currently don't support UserUnit page parameter\n            # pdf32000 page 79\n            unit=\"point\",\n        )\n        self.current_page_font_name_id_map = {}\n        self.current_page_font_char_bounding_box_map = {}\n        self.passthrough_per_char_instruction_stack = []\n        self.xobj_stack = []\n        self.non_stroking_color_space_name = None\n        self.stroking_color_space_name = None\n        self.current_clip_paths = []\n        self.clip_paths_stack = []\n        self.docs.page.append(self.current_page)\n        # Prepare per-page buffer for valid characters on translated pages\n        self._page_valid_chars_buffer = []\n\n    def on_page_end(self):\n        # Accumulate this page's valid characters and tokens into shared context\n        try:\n            if (\n                self._page_valid_chars_buffer is not None\n                and len(self._page_valid_chars_buffer) > 0\n            ):\n                page_text = \"\".join(self._page_valid_chars_buffer)\n                char_count = len(page_text)\n                try:\n                    token_count = len(\n                        self.tokenizer.encode(page_text, disallowed_special=())\n                    )\n                except Exception as e:\n                    logger.warning(\"Failed to compute token count for page: %s\", e)\n                    token_count = 0\n                self.translation_config.shared_context_cross_split_part.add_valid_counts(\n                    char_count, token_count\n                )\n        except Exception as e:\n            logger.warning(\"Failed to accumulate page valid stats: %s\", e)\n        finally:\n            self._page_valid_chars_buffer = []\n        self.progress.advance(1)\n\n    def on_page_crop_box(\n        self,\n        x0: float | int,\n        y0: float | int,\n        x1: float | int,\n        y1: float | int,\n    ):\n        box = il_version_1.Box(x=float(x0), y=float(y0), x2=float(x1), y2=float(y1))\n        self.current_page.cropbox = il_version_1.Cropbox(box=box)\n\n    def on_page_media_box(\n        self,\n        x0: float | int,\n        y0: float | int,\n        x1: float | int,\n        y1: float | int,\n    ):\n        box = il_version_1.Box(x=float(x0), y=float(y0), x2=float(x1), y2=float(y1))\n        self.current_page.mediabox = il_version_1.Mediabox(box=box)\n\n    def on_page_number(self, page_number: int):\n        assert isinstance(page_number, int)\n        assert page_number >= 0\n        self.current_page.page_number = page_number\n\n    def on_page_base_operation(self, operation: str):\n        operation = zstd_helper.zstd_compress(operation)\n        self.current_page.base_operations = il_version_1.BaseOperations(value=operation)\n\n    def on_page_resource_font(self, font: PDFFont, xref_id: int, font_id: str):\n        font_name = font.fontname\n        logger.debug(f\"handle font {font_name} @ {xref_id} in {self.xobj_id}\")\n        if isinstance(font_name, bytes):\n            try:\n                font_name = font_name.decode(\"utf-8\")\n            except UnicodeDecodeError:\n                font_name = \"BASE64:\" + base64.b64encode(font_name).decode(\"utf-8\")\n        encoding_length = 1\n        if isinstance(font, PDFCIDFont):\n            try:\n                # pdf 32000:2008 page 273\n                # Table 118 - Predefined CJK CMap names\n                _, encoding = self.mupdf.xref_get_key(xref_id, \"Encoding\")\n                if encoding == \"/Identity-H\" or encoding == \"/Identity-V\":\n                    encoding_length = 2\n                elif encoding == \"/WinAnsiEncoding\":\n                    encoding_length = 1\n                else:\n                    _, to_unicode_id = self.mupdf.xref_get_key(xref_id, \"ToUnicode\")\n                    if to_unicode_id is not None:\n                        to_unicode_bytes = self.mupdf.xref_stream(\n                            int(to_unicode_id.split(\" \")[0]),\n                        )\n                        code_range = re.search(\n                            b\"begincodespacerange\\n?.*<(\\\\d+?)>.*\",\n                            to_unicode_bytes,\n                        ).group(1)\n                        encoding_length = len(code_range) // 2\n            except Exception:\n                if (\n                    font.unicode_map\n                    and font.unicode_map.cid2unichr\n                    and max(font.unicode_map.cid2unichr.keys()) > 255\n                ):\n                    encoding_length = 2\n                else:\n                    encoding_length = 1\n        try:\n            if xref_id in self.mupdf_font_map:\n                mupdf_font = self.mupdf_font_map[xref_id]\n            else:\n                mupdf_font = pymupdf.Font(\n                    fontbuffer=self.mupdf.extract_font(xref_id)[3]\n                )\n                mupdf_font.has_glyph = functools.lru_cache(maxsize=10240, typed=True)(\n                    mupdf_font.has_glyph,\n                )\n            bold = mupdf_font.is_bold\n            italic = mupdf_font.is_italic\n            monospaced = mupdf_font.is_monospaced\n            serif = mupdf_font.is_serif\n            self.mupdf_font_map[xref_id] = mupdf_font\n        except Exception:\n            bold = None\n            italic = None\n            monospaced = None\n            serif = None\n        il_font_metadata = il_version_1.PdfFont(\n            name=font_name,\n            xref_id=xref_id,\n            font_id=font_id,\n            encoding_length=encoding_length,\n            bold=bold,\n            italic=italic,\n            monospace=monospaced,\n            serif=serif,\n            ascent=font.ascent,\n            descent=font.descent,\n            pdf_font_char_bounding_box=[],\n        )\n        try:\n            if xref_id is None:\n                logger.warning(\"xref_id is None for font %s\", font_name)\n                raise ValueError(\"xref_id is None for font %s\", font_name)\n            bbox_list, cmap = self.parse_font_xobj_id(xref_id)\n            font_char_bounding_box_map = {}\n            if not cmap:\n                cmap = {x: x for x in range(257)}\n            for char_id, char_bbox in enumerate(bbox_list):\n                font_char_bounding_box_map[char_id] = char_bbox\n            for char_id in cmap:\n                if char_id < 0 or char_id >= len(bbox_list):\n                    continue\n                bbox = bbox_list[char_id]\n                x, y, x2, y2 = bbox\n                if (\n                    x == 0\n                    and y == 0\n                    and x2 == 500\n                    and y2 == 698\n                    or x == 0\n                    and y == 0\n                    and x2 == 0\n                    and y2 == 0\n                ):\n                    # ignore default bounding box\n                    continue\n                il_font_metadata.pdf_font_char_bounding_box.append(\n                    il_version_1.PdfFontCharBoundingBox(\n                        x=x,\n                        y=y,\n                        x2=x2,\n                        y2=y2,\n                        char_id=char_id,\n                    )\n                )\n                font_char_bounding_box_map[char_id] = bbox\n            if self.xobj_id in self.xobj_map:\n                if self.xobj_id not in self.current_page_font_char_bounding_box_map:\n                    self.current_page_font_char_bounding_box_map[self.xobj_id] = {}\n                self.current_page_font_char_bounding_box_map[self.xobj_id][xref_id] = (\n                    font_char_bounding_box_map\n                )\n            else:\n                self.current_page_font_char_bounding_box_map[xref_id] = (\n                    font_char_bounding_box_map\n                )\n        except Exception as e:\n            if xref_id is None:\n                logger.error(\"failed to parse font xobj id None: %s\", e)\n            else:\n                logger.error(\"failed to parse font xobj id %d: %s\", xref_id, e)\n        self.current_page_font_name_id_map[xref_id] = font_id\n        self.current_available_fonts[font_id] = il_font_metadata\n\n        fonts = self.current_page.pdf_font\n        if self.xobj_id in self.xobj_map:\n            fonts = self.xobj_map[self.xobj_id].pdf_font\n        should_remove = []\n        for f in fonts:\n            if f.font_id == font_id:\n                should_remove.append(f)\n        for sr in should_remove:\n            fonts.remove(sr)\n        fonts.append(il_font_metadata)\n\n    def parse_font_xobj_id(self, xobj_id: int):\n        if xobj_id is None:\n            return [], {}\n\n        bbox_list = []\n        encoding = parse_font_encoding(self.mupdf, xobj_id)\n        differences = []\n        font_differences = self.mupdf.xref_get_key(xobj_id, \"Encoding/Differences\")\n        if font_differences:\n            differences = parse_encoding(font_differences[1])\n        for file_key in [\"FontFile\", \"FontFile2\", \"FontFile3\"]:\n            font_file = self.mupdf.xref_get_key(xobj_id, f\"FontDescriptor/{file_key}\")\n            if file_idx := indirect(font_file):\n                bbox_list = parse_font_file(\n                    self.mupdf,\n                    file_idx,\n                    encoding,\n                    differences,\n                )\n        cmap = {}\n        to_unicode = self.mupdf.xref_get_key(xobj_id, \"ToUnicode\")\n        if to_unicode_idx := indirect(to_unicode):\n            cmap = parse_cmap(self.mupdf.xref_stream(to_unicode_idx).decode(\"U8\"))\n        if not bbox_list:\n            obj_type, obj_val = self.mupdf.xref_get_key(xobj_id, \"BaseFont\")\n            if obj_type == \"name\":\n                bbox_list = get_base14_bbox(obj_val[1:])\n        if cid_bbox := get_cidfont_bbox(self.mupdf, xobj_id):\n            bbox_list = cid_bbox\n        if self.mupdf.xref_get_key(xobj_id, \"Subtype\")[1] == \"/Type3\":\n            bbox_list = get_type3_bbox(self.mupdf, xobj_id)\n        return bbox_list, cmap\n\n    def create_graphic_state(\n        self,\n        gs: babeldoc.pdfminer.pdfinterp.PDFGraphicState | list[tuple[str, str]],\n        include_clipping: bool = False,\n        target_ctm: tuple[float, float, float, float, float, float] = None,\n        clip_paths=None,\n    ):\n        if clip_paths is None:\n            clip_paths = self.current_clip_paths\n        passthrough_instruction = getattr(gs, \"passthrough_instruction\", gs)\n\n        def filter_clipping(op):\n            return op not in (\"W n\", \"W* n\")\n\n        def pass_all(_op):\n            return True\n\n        if include_clipping:\n            filter_clipping = pass_all\n\n        passthrough_per_char_instruction_parts = [\n            f\"{arg} {op}\" for op, arg in passthrough_instruction if filter_clipping(op)\n        ]\n\n        # Add transformed clipping paths if requested and target CTM is provided\n        if include_clipping and target_ctm and clip_paths:\n            for clip_path, source_ctm, evenodd in clip_paths:\n                try:\n                    # Transform clip path from source CTM to target CTM\n                    transformed_path = self.transform_clip_path(\n                        clip_path, source_ctm, target_ctm\n                    )\n\n                    # Generate clipping instruction\n                    op = \"W* n\" if evenodd else \"W n\"\n                    args = []\n                    for p in transformed_path:\n                        if len(p) == 1:\n                            args.append(p[0])\n                        elif len(p) > 1:\n                            args.extend([f\"{x:F}\" for x in p[1:]])\n                            args.append(p[0])\n\n                    if args:\n                        clipping_instruction = f\"{' '.join(args)} {op}\"\n                        passthrough_per_char_instruction_parts.append(\n                            clipping_instruction\n                        )\n\n                except Exception as e:\n                    logger.warning(\"Error transforming clip path: %s\", e)\n\n        passthrough_per_char_instruction = \" \".join(\n            passthrough_per_char_instruction_parts\n        )\n\n        # 可能会影响部分 graphic state 准确度。不过 BabelDOC 仅使用 passthrough_per_char_instruction\n        # 所以应该是没啥影响\n        # 但是池化 graphic state 后可以减少内存占用\n        if passthrough_per_char_instruction not in self.graphic_state_pool:\n            self.graphic_state_pool[passthrough_per_char_instruction] = (\n                il_version_1.GraphicState(\n                    passthrough_per_char_instruction=passthrough_per_char_instruction\n                )\n            )\n        graphic_state = self.graphic_state_pool[passthrough_per_char_instruction]\n\n        return graphic_state\n\n    def on_lt_char(self, char: LTChar):\n        if char.aw_font_id is None:\n            return\n        try:\n            rotation_angle = get_rotation_angle(char.matrix)\n            if not (-0.1 <= rotation_angle <= 0.1 or 89.9 <= rotation_angle <= 90.1):\n                return\n        except Exception:\n            logger.warning(\n                \"Failed to get rotation angle for char %s\",\n                char.get_text(),\n            )\n        # Collect valid characters for statistics\n        try:\n            self._collect_valid_char(char.get_text())\n        except Exception as e:\n            logger.warning(\"Error collecting valid char: %s\", e)\n        gs = self.create_graphic_state(char.graphicstate)\n        # Get font from current page or xobject\n        font = None\n        pdf_font = None\n        for pdf_font in self.xobj_map.get(char.xobj_id, self.current_page).pdf_font:\n            if pdf_font.font_id == char.aw_font_id:\n                font = pdf_font\n                break\n\n        # Get descent from font\n        descent = 0\n        if font and hasattr(font, \"descent\"):\n            descent = font.descent * char.size / 1000\n\n        char_id = char.cid\n\n        char_bounding_box = None\n        try:\n            if (\n                font_bounding_box_map\n                := self.current_page_font_char_bounding_box_map.get(\n                    char.xobj_id, self.current_page_font_char_bounding_box_map\n                ).get(font.xref_id)\n            ):\n                char_bounding_box = font_bounding_box_map.get(char_id, None)\n            else:\n                char_bounding_box = None\n        except Exception:\n            # logger.debug(\n            #     \"Failed to get font bounding box for char %s\",\n            #     char.get_text(),\n            # )\n            char_bounding_box = None\n\n        char_unicode = char.get_text()\n        # if \"(cid:\" not in char_unicode and len(char_unicode) > 1:\n        #     return\n        if space_regex.match(char_unicode):\n            char_unicode = \" \"\n        advance = char.adv\n        bbox = il_version_1.Box(\n            x=char.bbox[0],\n            y=char.bbox[1],\n            x2=char.bbox[2],\n            y2=char.bbox[3],\n        )\n        if bbox.x2 < bbox.x or bbox.y2 < bbox.y:\n            logger.warning(\n                \"Invalid bounding box for character %s: %s\",\n                char_unicode,\n                bbox,\n            )\n\n        if char.matrix[0] == 0 and char.matrix[3] == 0:\n            vertical = True\n            visual_bbox = il_version_1.Box(\n                x=char.bbox[0] - descent,\n                y=char.bbox[1],\n                x2=char.bbox[2] - descent,\n                y2=char.bbox[3],\n            )\n        else:\n            vertical = False\n            # Add descent to y coordinates\n            visual_bbox = il_version_1.Box(\n                x=char.bbox[0],\n                y=char.bbox[1] + descent,\n                x2=char.bbox[2],\n                y2=char.bbox[3] + descent,\n            )\n        visual_bbox = il_version_1.VisualBbox(box=visual_bbox)\n        pdf_style = il_version_1.PdfStyle(\n            font_id=char.aw_font_id,\n            font_size=char.size,\n            graphic_state=gs,\n        )\n\n        if font:\n            font_xref_id = font.xref_id\n            if font_xref_id in self.mupdf_font_map:\n                mupdf_font = self.mupdf_font_map[font_xref_id]\n                # if \"(cid:\" not in char_unicode:\n                #     if mupdf_cid := mupdf_font.has_glyph(ord(char_unicode)):\n                #         char_id = mupdf_cid\n\n        pdf_char = il_version_1.PdfCharacter(\n            box=bbox,\n            pdf_character_id=char_id,\n            advance=advance,\n            char_unicode=char_unicode,\n            vertical=vertical,\n            pdf_style=pdf_style,\n            xobj_id=char.xobj_id,\n            visual_bbox=visual_bbox,\n            render_order=char.render_order,\n            sub_render_order=0,\n        )\n        if self.translation_config.ocr_workaround:\n            pdf_char.pdf_style.graphic_state = BLACK\n            pdf_char.render_order = None\n        if pdf_style.font_size == 0.0:\n            logger.warning(\n                \"Font size is 0.0 for character %s. Skip it.\",\n                char_unicode,\n            )\n            return\n\n        if char_bounding_box and len(char_bounding_box) == 4:\n            x_min, y_min, x_max, y_max = char_bounding_box\n            factor = 1 / 1000 * pdf_style.font_size\n            x_min = x_min * factor\n            y_min = y_min * factor\n            x_max = x_max * factor\n            y_max = y_max * factor\n            ll = (char.bbox[0] + x_min, char.bbox[1] + y_min)\n            ur = (char.bbox[0] + x_max, char.bbox[1] + y_max)\n\n            volume = (ur[0] - ll[0]) * (ur[1] - ll[1])\n            if volume > 1:\n                pdf_char.visual_bbox = il_version_1.VisualBbox(\n                    il_version_1.Box(ll[0], ll[1], ur[0], ur[1])\n                )\n\n        self.current_page.pdf_character.append(pdf_char)\n\n        if self.translation_config.show_char_box:\n            self.current_page.pdf_rectangle.append(\n                il_version_1.PdfRectangle(\n                    box=pdf_char.visual_bbox.box,\n                    graphic_state=YELLOW,\n                    debug_info=True,\n                    line_width=0.2,\n                )\n            )\n\n    def _collect_valid_char(self, ch: str):\n        \"\"\"Append a valid character into the current page buffer according to rules.\n        Rules:\n        - Include whitespace matched by space_regex directly.\n        - Ignore categories that are never normal text: {Cc, Cs, Co, Cn}.\n        - Apply inverted criteria from formular_helper.py (21-28):\n          empty -> invalid, contains '(cid:' -> invalid,\n          not has_char(ch) -> invalid unless len(ch) > 1 and all(has_char(x)).\n        \"\"\"\n        if self._page_valid_chars_buffer is None:\n            return\n        if space_regex.match(ch):\n            self._page_valid_chars_buffer.append(ch)\n            return\n        try:\n            cat = unicodedata.category(ch[0]) if ch else None\n        except Exception:\n            cat = None\n        if cat in {\"Cc\", \"Cs\", \"Co\", \"Cn\"}:\n            return\n        is_invalid = False\n        if not ch:\n            is_invalid = True\n        elif \"(cid:\" in ch:\n            is_invalid = True\n        else:\n            try:\n                if not self.font_mapper.has_char(ch):\n                    if len(ch) > 1 and all(self.font_mapper.has_char(x) for x in ch):\n                        is_invalid = False\n                    else:\n                        is_invalid = True\n            except Exception:\n                is_invalid = True\n        if not is_invalid:\n            self._page_valid_chars_buffer.append(ch)\n\n    def on_lt_curve(self, curve: babeldoc.pdfminer.layout.LTCurve):\n        if not self.enable_graphic_element_process:\n            return\n        bbox = il_version_1.Box(\n            x=curve.bbox[0],\n            y=curve.bbox[1],\n            x2=curve.bbox[2],\n            y2=curve.bbox[3],\n        )\n        # Extract CTM from curve object if it exists\n        curve_ctm = getattr(curve, \"ctm\", None)\n        gs = self.create_graphic_state(\n            curve.passthrough_instruction,\n            include_clipping=True,\n            target_ctm=curve_ctm,\n            clip_paths=curve.clip_paths,\n        )\n        paths = []\n        for point in curve.original_path:\n            op = point[0]\n            if len(point) == 1:\n                paths.append(\n                    il_version_1.PdfPath(\n                        op=op,\n                        x=None,\n                        y=None,\n                        has_xy=False,\n                    )\n                )\n                continue\n            for p in point[1:-1]:\n                paths.append(\n                    il_version_1.PdfPath(\n                        op=\"\",\n                        x=p[0],\n                        y=p[1],\n                        has_xy=True,\n                    )\n                )\n            paths.append(\n                il_version_1.PdfPath(\n                    op=point[0],\n                    x=point[-1][0],\n                    y=point[-1][1],\n                    has_xy=True,\n                )\n            )\n\n        fill_background = curve.fill\n        stroke_path = curve.stroke\n        evenodd = curve.evenodd\n        # Extract CTM from curve object if it exists\n        ctm = getattr(curve, \"ctm\", None)\n\n        # Extract raw path from curve object if it exists\n        raw_path = getattr(curve, \"raw_path\", None)\n        raw_pdf_paths = None\n        if raw_path is not None:\n            raw_pdf_paths = []\n            for path in raw_path:\n                if path[0] == \"h\":  # h command (close path)\n                    raw_pdf_paths.append(\n                        il_version_1.PdfOriginalPath(\n                            pdf_path=il_version_1.PdfPath(\n                                x=0.0,\n                                y=0.0,\n                                op=path[0],\n                                has_xy=False,\n                            )\n                        )\n                    )\n                else:  # commands with coordinates (m, l, c, v, y, etc.)\n                    for p in batched(path[1:-2], 2, strict=True):\n                        raw_pdf_paths.append(\n                            il_version_1.PdfOriginalPath(\n                                pdf_path=il_version_1.PdfPath(\n                                    x=float(p[0]),\n                                    y=float(p[1]),\n                                    op=\"\",\n                                    has_xy=True,\n                                )\n                            )\n                        )\n                    # Last point in the path\n                    raw_pdf_paths.append(\n                        il_version_1.PdfOriginalPath(\n                            pdf_path=il_version_1.PdfPath(\n                                x=float(path[-2]),\n                                y=float(path[-1]),\n                                op=path[0],\n                                has_xy=True,\n                            )\n                        )\n                    )\n\n        curve_obj = il_version_1.PdfCurve(\n            box=bbox,\n            graphic_state=gs,\n            pdf_path=paths,\n            fill_background=fill_background,\n            stroke_path=stroke_path,\n            evenodd=evenodd,\n            debug_info=\"a\",\n            xobj_id=curve.xobj_id,\n            render_order=curve.render_order,\n            ctm=list(ctm) if ctm is not None else None,\n            pdf_original_path=raw_pdf_paths,\n        )\n        self.current_page.pdf_curve.append(curve_obj)\n        pass\n\n    def on_xobj_form(\n        self,\n        ctm: tuple[float, float, float, float, float, float],\n        xobj_id: int,\n        xref_id: int,\n        form_type: Literal[\"image\", \"form\"],\n        do_args: str,\n        bbox: tuple[float, float, float, float],\n        matrix: tuple[float, float, float, float, float, float],\n    ):\n        logger.debug(f\"on_xobj_form: {do_args}[{bbox}] @ {xref_id} in {self.xobj_id}\")\n        matrix = mult_matrix(matrix, ctm)\n        (x, y, w, h) = guarded_bbox(bbox)\n        bounds = ((x, y), (x + w, y), (x, y + h), (x + w, y + h))\n        bbox = get_bound(apply_matrix_pt(matrix, (p, q)) for (p, q) in bounds)\n\n        gs = self.create_graphic_state(\n            self.passthrough_per_char_instruction, include_clipping=True, target_ctm=ctm\n        )\n\n        figure_bbox = il_version_1.Box(\n            x=bbox[0],\n            y=bbox[1],\n            x2=bbox[2],\n            y2=bbox[3],\n        )\n        pdf_matrix = il_version_1.PdfMatrix(\n            a=ctm[0],\n            b=ctm[1],\n            c=ctm[2],\n            d=ctm[3],\n            e=ctm[4],\n            f=ctm[5],\n        )\n        affine_transform = decompose_ctm(ctm)\n        xobj_form = il_version_1.PdfXobjForm(\n            xref_id=xref_id,\n            do_args=do_args,\n        )\n        pdf_form_subtype = il_version_1.PdfFormSubtype(\n            pdf_xobj_form=xobj_form,\n        )\n        new_form = il_version_1.PdfForm(\n            xobj_id=xobj_id,\n            box=figure_bbox,\n            pdf_matrix=pdf_matrix,\n            graphic_state=gs,\n            pdf_affine_transform=affine_transform,\n            render_order=self.get_render_order_and_increase(),\n            form_type=form_type,\n            pdf_form_subtype=pdf_form_subtype,\n            ctm=list(ctm),\n        )\n        self.current_page.pdf_form.append(new_form)\n\n    def on_pdf_clip_path(\n        self,\n        clip_path,\n        evenodd: bool,\n        ctm: tuple[float, float, float, float, float, float],\n    ):\n        try:\n            self.current_clip_paths.append((clip_path.copy(), ctm, evenodd))\n        except Exception as e:\n            logger.warning(\"Error in on_pdf_clip_path: %s\", e)\n\n    def create_il(self):\n        pages = [\n            page\n            for page in self.docs.page\n            if self.translation_config.should_translate_page(page.page_number + 1)\n        ]\n        self.docs.page = pages\n        return self.docs\n\n    def on_total_pages(self, total_pages: int):\n        assert isinstance(total_pages, int)\n        assert total_pages > 0\n        self.docs.total_pages = total_pages\n        total = 0\n        for page in range(total_pages):\n            if self.translation_config.should_translate_page(page + 1) is False:\n                continue\n            total += 1\n        self.progress = self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        )\n\n    def on_pdf_figure(self, figure: LTFigure):\n        box = il_version_1.Box(\n            figure.bbox[0],\n            figure.bbox[1],\n            figure.bbox[2],\n            figure.bbox[3],\n        )\n        self.current_page.pdf_figure.append(il_version_1.PdfFigure(box=box))\n\n    def on_inline_image_begin(self):\n        \"\"\"Begin processing inline image\"\"\"\n        # Store current state for inline image processing\n        self._inline_image_state = {\n            \"ctm\": None,\n            \"parameters\": {},\n        }\n\n    def on_inline_image_end(self, stream_obj, ctm):\n        \"\"\"End processing inline image and create PdfForm\"\"\"\n        import base64\n        import json\n\n        from babeldoc.format.pdf.babelpdf.utils import guarded_bbox\n        from babeldoc.format.pdf.document_il.utils.matrix_helper import decompose_ctm\n        from babeldoc.pdfminer.utils import apply_matrix_pt\n        from babeldoc.pdfminer.utils import get_bound\n\n        # Extract image parameters from stream dictionary\n        image_dict = stream_obj.attrs if hasattr(stream_obj, \"attrs\") else {}\n\n        # Build parameters dictionary\n        parameters = {}\n        for key, value in image_dict.items():\n            if hasattr(value, \"name\"):\n                parameters[key] = value.name\n            else:\n                parameters[key] = str(value)\n\n        # Get image data (encoded as base64)\n        image_data = \"\"\n        if hasattr(stream_obj, \"data\") and stream_obj.data is not None:\n            image_data = base64.b64encode(stream_obj.data).decode(\"ascii\")\n        elif hasattr(stream_obj, \"rawdata\") and stream_obj.rawdata is not None:\n            image_data = base64.b64encode(stream_obj.rawdata).decode(\"ascii\")\n\n        # Create inline form with parameters as JSON string\n        inline_form = il_version_1.PdfInlineForm(\n            form_data=image_data, image_parameters=json.dumps(parameters)\n        )\n\n        # Calculate bounding box - inline images are typically 1x1 unit square in user space\n        bbox = (0, 0, 1, 1)\n        (x, y, w, h) = guarded_bbox(bbox)\n        bounds = ((x, y), (x + w, y), (x, y + h), (x + w, y + h))\n        final_bbox = get_bound(apply_matrix_pt(ctm, (p, q)) for (p, q) in bounds)\n\n        # Create graphics state\n        gs = self.create_graphic_state(\n            self.passthrough_per_char_instruction, include_clipping=True, target_ctm=ctm\n        )\n\n        # Create PdfMatrix from CTM\n        pdf_matrix = il_version_1.PdfMatrix(\n            a=ctm[0], b=ctm[1], c=ctm[2], d=ctm[3], e=ctm[4], f=ctm[5]\n        )\n\n        # Create affine transform\n        affine_transform = decompose_ctm(ctm)\n\n        # Create PdfFormSubtype with inline form\n        pdf_form_subtype = il_version_1.PdfFormSubtype(pdf_inline_form=inline_form)\n\n        # Create PdfForm for the inline image\n        pdf_form = il_version_1.PdfForm(\n            box=il_version_1.Box(\n                x=final_bbox[0],\n                y=final_bbox[1],\n                x2=final_bbox[2],\n                y2=final_bbox[3],\n            ),\n            graphic_state=gs,\n            pdf_matrix=pdf_matrix,\n            pdf_affine_transform=affine_transform,\n            pdf_form_subtype=pdf_form_subtype,\n            xobj_id=self.xobj_id,\n            ctm=list(ctm),\n            render_order=self.get_render_order_and_increase(),\n            form_type=\"image\",\n        )\n\n        # Add to current page\n        self.current_page.pdf_form.append(pdf_form)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.py",
    "content": "from dataclasses import dataclass\nfrom dataclasses import field\n\n\n@dataclass(slots=True)\nclass BaseOperations:\n    class Meta:\n        name = \"baseOperations\"\n\n    value: str = field(\n        default=\"\",\n        metadata={\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass Box:\n    class Meta:\n        name = \"box\"\n\n    x: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    x2: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y2: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass GraphicState:\n    class Meta:\n        name = \"graphicState\"\n\n    passthrough_per_char_instruction: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"passthroughPerCharInstruction\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfAffineTransform:\n    class Meta:\n        name = \"pdfAffineTransform\"\n\n    translation_x: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    translation_y: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    rotation: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    scale_x: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    scale_y: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    shear: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfFontCharBoundingBox:\n    class Meta:\n        name = \"pdfFontCharBoundingBox\"\n\n    x: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    x2: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y2: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    char_id: int | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfInlineForm:\n    class Meta:\n        name = \"pdfInlineForm\"\n\n    form_data: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"formData\",\n            \"type\": \"Attribute\",\n        },\n    )\n    image_parameters: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"imageParameters\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfMatrix:\n    class Meta:\n        name = \"pdfMatrix\"\n\n    a: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    b: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    c: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    d: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    e: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    f: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfPath:\n    class Meta:\n        name = \"pdfPath\"\n\n    x: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    op: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    has_xy: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfXobjForm:\n    class Meta:\n        name = \"pdfXobjForm\"\n\n    xref_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xrefId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    do_args: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"doArgs\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass Cropbox:\n    class Meta:\n        name = \"cropbox\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass Mediabox:\n    class Meta:\n        name = \"mediabox\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PageLayout:\n    class Meta:\n        name = \"pageLayout\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    id: int | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    conf: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    class_name: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfFigure:\n    class Meta:\n        name = \"pdfFigure\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfFont:\n    class Meta:\n        name = \"pdfFont\"\n\n    pdf_font_char_bounding_box: list[PdfFontCharBoundingBox] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfFontCharBoundingBox\",\n            \"type\": \"Element\",\n        },\n    )\n    name: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    font_id: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"fontId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    xref_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xrefId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    encoding_length: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"encodingLength\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    bold: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    italic: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    monospace: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    serif: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    ascent: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    descent: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfFormSubtype:\n    class Meta:\n        name = \"pdfFormSubtype\"\n\n    pdf_inline_form: PdfInlineForm | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfInlineForm\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_xobj_form: PdfXobjForm | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfXobjForm\",\n            \"type\": \"Element\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfOriginalPath:\n    class Meta:\n        name = \"pdfOriginalPath\"\n\n    pdf_path: PdfPath | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfPath\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfRectangle:\n    class Meta:\n        name = \"pdfRectangle\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    graphic_state: GraphicState | None = field(\n        default=None,\n        metadata={\n            \"name\": \"graphicState\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    debug_info: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    fill_background: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    line_width: float | None = field(\n        default=None,\n        metadata={\n            \"name\": \"lineWidth\",\n            \"type\": \"Attribute\",\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfStyle:\n    class Meta:\n        name = \"pdfStyle\"\n\n    graphic_state: GraphicState | None = field(\n        default=None,\n        metadata={\n            \"name\": \"graphicState\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    font_id: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    font_size: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass VisualBbox:\n    class Meta:\n        name = \"visual_bbox\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfCharacter:\n    class Meta:\n        name = \"pdfCharacter\"\n\n    pdf_style: PdfStyle | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfStyle\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    visual_bbox: VisualBbox | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n        },\n    )\n    vertical: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    scale: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    pdf_character_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfCharacterId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    char_unicode: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    advance: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    debug_info: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    formula_layout_id: int | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n    sub_render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"subRenderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfCurve:\n    class Meta:\n        name = \"pdfCurve\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    graphic_state: GraphicState | None = field(\n        default=None,\n        metadata={\n            \"name\": \"graphicState\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_path: list[PdfPath] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfPath\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_original_path: list[PdfOriginalPath] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfOriginalPath\",\n            \"type\": \"Element\",\n        },\n    )\n    debug_info: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    fill_background: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    stroke_path: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    evenodd: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n    ctm: list[object] = field(\n        default_factory=list,\n        metadata={\n            \"type\": \"Attribute\",\n            \"length\": 6,\n            \"tokens\": True,\n        },\n    )\n    relocation_transform: list[object] = field(\n        default_factory=list,\n        metadata={\n            \"type\": \"Attribute\",\n            \"length\": 6,\n            \"tokens\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfForm:\n    class Meta:\n        name = \"pdfForm\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    graphic_state: GraphicState | None = field(\n        default=None,\n        metadata={\n            \"name\": \"graphicState\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_matrix: PdfMatrix | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfMatrix\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_affine_transform: PdfAffineTransform | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfAffineTransform\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_form_subtype: PdfFormSubtype | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfFormSubtype\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    ctm: list[object] = field(\n        default_factory=list,\n        metadata={\n            \"type\": \"Attribute\",\n            \"length\": 6,\n            \"tokens\": True,\n        },\n    )\n    relocation_transform: list[object] = field(\n        default_factory=list,\n        metadata={\n            \"type\": \"Attribute\",\n            \"length\": 6,\n            \"tokens\": True,\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    form_type: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"formType\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfSameStyleUnicodeCharacters:\n    class Meta:\n        name = \"pdfSameStyleUnicodeCharacters\"\n\n    pdf_style: PdfStyle | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfStyle\",\n            \"type\": \"Element\",\n        },\n    )\n    unicode: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    debug_info: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfXobject:\n    class Meta:\n        name = \"pdfXobject\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_font: list[PdfFont] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfFont\",\n            \"type\": \"Element\",\n        },\n    )\n    base_operations: BaseOperations | None = field(\n        default=None,\n        metadata={\n            \"name\": \"baseOperations\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    xref_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xrefId\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfFormula:\n    class Meta:\n        name = \"pdfFormula\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_character: list[PdfCharacter] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCharacter\",\n            \"type\": \"Element\",\n            \"min_occurs\": 1,\n        },\n    )\n    pdf_curve: list[PdfCurve] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCurve\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_form: list[PdfForm] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfForm\",\n            \"type\": \"Element\",\n        },\n    )\n    x_offset: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    y_offset: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    x_advance: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    line_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"lineId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    is_corner_mark: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfLine:\n    class Meta:\n        name = \"pdfLine\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_character: list[PdfCharacter] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCharacter\",\n            \"type\": \"Element\",\n            \"min_occurs\": 1,\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfSameStyleCharacters:\n    class Meta:\n        name = \"pdfSameStyleCharacters\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_style: PdfStyle | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfStyle\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_character: list[PdfCharacter] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCharacter\",\n            \"type\": \"Element\",\n            \"min_occurs\": 1,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfParagraphComposition:\n    class Meta:\n        name = \"pdfParagraphComposition\"\n\n    pdf_line: PdfLine | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfLine\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_formula: PdfFormula | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfFormula\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_same_style_characters: PdfSameStyleCharacters | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfSameStyleCharacters\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_character: PdfCharacter | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfCharacter\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_same_style_unicode_characters: PdfSameStyleUnicodeCharacters | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfSameStyleUnicodeCharacters\",\n            \"type\": \"Element\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass PdfParagraph:\n    class Meta:\n        name = \"pdfParagraph\"\n\n    box: Box | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_style: PdfStyle | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pdfStyle\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_paragraph_composition: list[PdfParagraphComposition] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfParagraphComposition\",\n            \"type\": \"Element\",\n        },\n    )\n    xobj_id: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"xobjId\",\n            \"type\": \"Attribute\",\n        },\n    )\n    unicode: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    scale: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    optimal_scale: float | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    vertical: bool | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    first_line_indent: bool | None = field(\n        default=None,\n        metadata={\n            \"name\": \"FirstLineIndent\",\n            \"type\": \"Attribute\",\n        },\n    )\n    debug_id: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    layout_label: str | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    layout_id: int | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Attribute\",\n        },\n    )\n    render_order: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"renderOrder\",\n            \"type\": \"Attribute\",\n        },\n    )\n\n\n@dataclass(slots=True)\nclass Page:\n    class Meta:\n        name = \"page\"\n\n    mediabox: Mediabox | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    cropbox: Cropbox | None = field(\n        default=None,\n        metadata={\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    pdf_xobject: list[PdfXobject] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfXobject\",\n            \"type\": \"Element\",\n        },\n    )\n    page_layout: list[PageLayout] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pageLayout\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_rectangle: list[PdfRectangle] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfRectangle\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_font: list[PdfFont] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfFont\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_paragraph: list[PdfParagraph] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfParagraph\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_figure: list[PdfFigure] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfFigure\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_character: list[PdfCharacter] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCharacter\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_curve: list[PdfCurve] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfCurve\",\n            \"type\": \"Element\",\n        },\n    )\n    pdf_form: list[PdfForm] = field(\n        default_factory=list,\n        metadata={\n            \"name\": \"pdfForm\",\n            \"type\": \"Element\",\n        },\n    )\n    base_operations: BaseOperations | None = field(\n        default=None,\n        metadata={\n            \"name\": \"baseOperations\",\n            \"type\": \"Element\",\n            \"required\": True,\n        },\n    )\n    page_number: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"pageNumber\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n    unit: str | None = field(\n        default=None,\n        metadata={\n            \"name\": \"Unit\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n\n\n@dataclass(slots=True)\nclass Document:\n    class Meta:\n        name = \"document\"\n\n    page: list[Page] = field(\n        default_factory=list,\n        metadata={\n            \"type\": \"Element\",\n            \"min_occurs\": 1,\n        },\n    )\n    total_pages: int | None = field(\n        default=None,\n        metadata={\n            \"name\": \"totalPages\",\n            \"type\": \"Attribute\",\n            \"required\": True,\n        },\n    )\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.rnc",
    "content": "start = Document\nDocument =\n  element document {\n    Page+,\n    attribute totalPages { xsd:int }\n  }\nPage =\n  element page {\n    element mediabox { Box },\n    element cropbox { Box },\n    PDFXobject*,\n    PageLayout*,\n    PDFRectangle*,\n    PDFFont*,\n    PDFParagraph*,\n    PDFFigure*,\n    PDFCharacter*,\n    PDFCurve*,\n    PDFForm*,\n    attribute pageNumber { xsd:int },\n    attribute Unit { xsd:string },\n    element baseOperations { xsd:string }\n  }\nBox =\n  element box {\n    # from (x,y) to (x2,y2)\n    attribute x { xsd:float },\n    attribute y { xsd:float },\n    attribute x2 { xsd:float },\n    attribute y2 { xsd:float }\n  }\nPDFXrefId = xsd:int\nPDFFont =\n  element pdfFont {\n    attribute name { xsd:string },\n    attribute fontId { xsd:string },\n    attribute xrefId { PDFXrefId },\n    attribute encodingLength { xsd:int },\n    attribute bold { xsd:boolean }?,\n    attribute italic { xsd:boolean }?,\n    attribute monospace { xsd:boolean }?,\n    attribute serif { xsd:boolean }?,\n    attribute ascent { xsd:float }?,\n    attribute descent { xsd:float }?,\n    PDFFontCharBoundingBox*\n  }\nPDFFontCharBoundingBox =\n  element pdfFontCharBoundingBox {\n    attribute x { xsd:float },\n    attribute y { xsd:float },\n    attribute x2 { xsd:float },\n    attribute y2 { xsd:float },\n    attribute char_id { xsd:int }\n  }\nPDFXobject =\n  element pdfXobject {\n    attribute xobjId { xsd:int },\n    attribute xrefId { PDFXrefId },\n    Box,\n    PDFFont*,\n    element baseOperations { xsd:string }\n  }\nPDFCharacter =\n  element pdfCharacter {\n    attribute vertical { xsd:boolean }?,\n    attribute scale { xsd:float }?,\n    attribute pdfCharacterId { xsd:int }?,\n    attribute char_unicode { xsd:string },\n    attribute advance { xsd:float }?,\n    # xobject nesting depth\n    attribute xobjId { xsd:int }?,\n    attribute debug_info { xsd:boolean }?,\n    attribute formula_layout_id { xsd:int }?,\n    attribute renderOrder { xsd:int }?,\n    attribute subRenderOrder { xsd:int }?,\n    PDFStyle,\n    Box,\n    element visual_bbox { Box }?\n  }\nPageLayout =\n  element pageLayout {\n    attribute id { xsd:int },\n    attribute conf { xsd:float },\n    attribute class_name { xsd:string },\n    Box\n  }\nGraphicState =\n  element graphicState {\n    attribute passthroughPerCharInstruction { xsd:string }?\n  }\nPDFStyle =\n  element pdfStyle {\n    attribute font_id { xsd:string },\n    attribute font_size { xsd:float },\n    GraphicState\n  }\nPDFParagraph =\n  element pdfParagraph {\n    attribute xobjId { xsd:int }?,\n    attribute unicode { xsd:string },\n    attribute scale { xsd:float }?,\n    attribute optimal_scale { xsd:float }?,\n    attribute vertical { xsd:boolean }?,\n    attribute FirstLineIndent { xsd:boolean }?,\n    attribute debug_id { xsd:string }?,\n    attribute layout_label { xsd:string }?,\n    attribute layout_id { xsd:int }?,\n    attribute renderOrder { xsd:int }?,\n    Box,\n    PDFStyle,\n    PDFParagraphComposition*\n  }\nPDFParagraphComposition =\n  element pdfParagraphComposition {\n    PDFLine\n    | PDFFormula\n    | PDFSameStyleCharacters\n    | PDFCharacter\n    | PDFSameStyleUnicodeCharacters\n  }\nPDFLine =\n  element pdfLine {\n    Box,\n    PDFCharacter+,\n    attribute renderOrder { xsd:int }?\n  }\nPDFSameStyleCharacters =\n  element pdfSameStyleCharacters { Box, PDFStyle, PDFCharacter+ }\nPDFSameStyleUnicodeCharacters =\n  element pdfSameStyleUnicodeCharacters {\n    PDFStyle?,\n    attribute unicode { xsd:string },\n    attribute debug_info { xsd:boolean }?\n  }\nPDFFormula =\n  element pdfFormula {\n    Box,\n    PDFCharacter+,\n    PDFCurve*,\n    PDFForm*,\n    attribute x_offset { xsd:float },\n    attribute y_offset { xsd:float },\n    attribute x_advance { xsd:float }?,\n    attribute lineId { xsd:int }?,\n    attribute is_corner_mark { xsd:boolean }?\n  }\nPDFFigure = element pdfFigure { Box }\nPDFRectangle =\n  element pdfRectangle {\n    Box,\n    GraphicState,\n    attribute debug_info { xsd:boolean }?,\n    attribute fill_background { xsd:boolean }?,\n    attribute xobjId { xsd:int }?,\n    attribute lineWidth { xsd:float }?,\n    attribute renderOrder { xsd:int }?\n  }\nPDFCurve =\n  element pdfCurve {\n    Box,\n    GraphicState,\n    PDFPath*,\n    PDFOriginalPath*,\n    attribute debug_info { xsd:boolean }?,\n    attribute fill_background { xsd:boolean }?,\n    attribute stroke_path { xsd:boolean }?,\n    attribute evenodd { xsd:boolean }?,\n    attribute xobjId { xsd:int }?,\n    attribute renderOrder { xsd:int }?,\n    attribute ctm {\n      list {\n        xsd:float, xsd:float, xsd:float, xsd:float, xsd:float, xsd:float\n      }\n    }?,\n    attribute relocation_transform {\n      list {\n        xsd:float, xsd:float, xsd:float, xsd:float, xsd:float, xsd:float\n      }\n    }?\n  }\nPDFOriginalPath = element pdfOriginalPath { PDFPath }\nPDFPath =\n  element pdfPath {\n    attribute x { xsd:float },\n    attribute y { xsd:float },\n    attribute op { xsd:string },\n    attribute has_xy { xsd:boolean }?\n  }\nPDFForm =\n  element pdfForm {\n    attribute xobjId { xsd:int },\n    Box,\n    GraphicState,\n    PDFMatrix,\n    PDFAffineTransform,\n    attribute ctm {\n      list {\n        xsd:float, xsd:float, xsd:float, xsd:float, xsd:float, xsd:float\n      }\n    }?,\n    attribute relocation_transform {\n      list {\n        xsd:float, xsd:float, xsd:float, xsd:float, xsd:float, xsd:float\n      }\n    }?,\n    attribute renderOrder { xsd:int },\n    attribute formType { xsd:string },\n    PDFFormSubtype\n  }\nPDFFormSubtype = element pdfFormSubtype { PDFInlineForm | PDFXobjForm }\nPDFInlineForm =\n  element pdfInlineForm {\n    attribute formData { xsd:string }?,\n    attribute imageParameters { xsd:string }?\n  }\nPDFXobjForm =\n  element pdfXobjForm {\n    attribute xrefId { PDFXrefId },\n    attribute doArgs { xsd:string }\n  }\nPDFMatrix =\n  element pdfMatrix {\n    attribute a { xsd:float },\n    attribute b { xsd:float },\n    attribute c { xsd:float },\n    attribute d { xsd:float },\n    attribute e { xsd:float },\n    attribute f { xsd:float }\n  }\n# Decomposed transform parameters for a CTM\nPDFAffineTransform =\n  element pdfAffineTransform {\n    attribute translation_x { xsd:float },\n    attribute translation_y { xsd:float },\n    attribute rotation { xsd:float },\n    attribute scale_x { xsd:float },\n    attribute scale_y { xsd:float },\n    attribute shear { xsd:float }\n  }\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.rng",
    "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<grammar xmlns=\"http://relaxng.org/ns/structure/1.0\" datatypeLibrary=\"http://www.w3.org/2001/XMLSchema-datatypes\">\n  <start>\n    <ref name=\"Document\"/>\n  </start>\n  <define name=\"Document\">\n    <element name=\"document\">\n      <oneOrMore>\n        <ref name=\"Page\"/>\n      </oneOrMore>\n      <attribute name=\"totalPages\">\n        <data type=\"int\"/>\n      </attribute>\n    </element>\n  </define>\n  <define name=\"Page\">\n    <element name=\"page\">\n      <element name=\"mediabox\">\n        <ref name=\"Box\"/>\n      </element>\n      <element name=\"cropbox\">\n        <ref name=\"Box\"/>\n      </element>\n      <zeroOrMore>\n        <ref name=\"PDFXobject\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PageLayout\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFRectangle\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFFont\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFParagraph\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFFigure\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFCharacter\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFCurve\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFForm\"/>\n      </zeroOrMore>\n      <attribute name=\"pageNumber\">\n        <data type=\"int\"/>\n      </attribute>\n      <attribute name=\"Unit\">\n        <data type=\"string\"/>\n      </attribute>\n      <element name=\"baseOperations\">\n        <data type=\"string\"/>\n      </element>\n    </element>\n  </define>\n  <define name=\"Box\">\n    <element name=\"box\">\n      <!-- from (x,y) to (x2,y2) -->\n      <attribute name=\"x\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"x2\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y2\">\n        <data type=\"float\"/>\n      </attribute>\n    </element>\n  </define>\n  <define name=\"PDFXrefId\">\n    <data type=\"int\"/>\n  </define>\n  <define name=\"PDFFont\">\n    <element name=\"pdfFont\">\n      <attribute name=\"name\">\n        <data type=\"string\"/>\n      </attribute>\n      <attribute name=\"fontId\">\n        <data type=\"string\"/>\n      </attribute>\n      <attribute name=\"xrefId\">\n        <ref name=\"PDFXrefId\"/>\n      </attribute>\n      <attribute name=\"encodingLength\">\n        <data type=\"int\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"bold\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"italic\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"monospace\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"serif\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"ascent\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"descent\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <zeroOrMore>\n        <ref name=\"PDFFontCharBoundingBox\"/>\n      </zeroOrMore>\n    </element>\n  </define>\n  <define name=\"PDFFontCharBoundingBox\">\n    <element name=\"pdfFontCharBoundingBox\">\n      <attribute name=\"x\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"x2\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y2\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"char_id\">\n        <data type=\"int\"/>\n      </attribute>\n    </element>\n  </define>\n  <define name=\"PDFXobject\">\n    <element name=\"pdfXobject\">\n      <attribute name=\"xobjId\">\n        <data type=\"int\"/>\n      </attribute>\n      <attribute name=\"xrefId\">\n        <ref name=\"PDFXrefId\"/>\n      </attribute>\n      <ref name=\"Box\"/>\n      <zeroOrMore>\n        <ref name=\"PDFFont\"/>\n      </zeroOrMore>\n      <element name=\"baseOperations\">\n        <data type=\"string\"/>\n      </element>\n    </element>\n  </define>\n  <define name=\"PDFCharacter\">\n    <element name=\"pdfCharacter\">\n      <optional>\n        <attribute name=\"vertical\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"scale\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"pdfCharacterId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <attribute name=\"char_unicode\">\n        <data type=\"string\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"advance\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <!-- xobject nesting depth -->\n        <attribute name=\"xobjId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"debug_info\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"formula_layout_id\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"renderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"subRenderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <ref name=\"PDFStyle\"/>\n      <ref name=\"Box\"/>\n      <optional>\n        <element name=\"visual_bbox\">\n          <ref name=\"Box\"/>\n        </element>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PageLayout\">\n    <element name=\"pageLayout\">\n      <attribute name=\"id\">\n        <data type=\"int\"/>\n      </attribute>\n      <attribute name=\"conf\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"class_name\">\n        <data type=\"string\"/>\n      </attribute>\n      <ref name=\"Box\"/>\n    </element>\n  </define>\n  <define name=\"GraphicState\">\n    <element name=\"graphicState\">\n      <optional>\n        <attribute name=\"passthroughPerCharInstruction\">\n          <data type=\"string\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFStyle\">\n    <element name=\"pdfStyle\">\n      <attribute name=\"font_id\">\n        <data type=\"string\"/>\n      </attribute>\n      <attribute name=\"font_size\">\n        <data type=\"float\"/>\n      </attribute>\n      <ref name=\"GraphicState\"/>\n    </element>\n  </define>\n  <define name=\"PDFParagraph\">\n    <element name=\"pdfParagraph\">\n      <optional>\n        <attribute name=\"xobjId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <attribute name=\"unicode\">\n        <data type=\"string\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"scale\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"optimal_scale\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"vertical\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"FirstLineIndent\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"debug_id\">\n          <data type=\"string\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"layout_label\">\n          <data type=\"string\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"layout_id\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"renderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <ref name=\"Box\"/>\n      <ref name=\"PDFStyle\"/>\n      <zeroOrMore>\n        <ref name=\"PDFParagraphComposition\"/>\n      </zeroOrMore>\n    </element>\n  </define>\n  <define name=\"PDFParagraphComposition\">\n    <element name=\"pdfParagraphComposition\">\n      <choice>\n        <ref name=\"PDFLine\"/>\n        <ref name=\"PDFFormula\"/>\n        <ref name=\"PDFSameStyleCharacters\"/>\n        <ref name=\"PDFCharacter\"/>\n        <ref name=\"PDFSameStyleUnicodeCharacters\"/>\n      </choice>\n    </element>\n  </define>\n  <define name=\"PDFLine\">\n    <element name=\"pdfLine\">\n      <ref name=\"Box\"/>\n      <oneOrMore>\n        <ref name=\"PDFCharacter\"/>\n      </oneOrMore>\n      <optional>\n        <attribute name=\"renderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFSameStyleCharacters\">\n    <element name=\"pdfSameStyleCharacters\">\n      <ref name=\"Box\"/>\n      <ref name=\"PDFStyle\"/>\n      <oneOrMore>\n        <ref name=\"PDFCharacter\"/>\n      </oneOrMore>\n    </element>\n  </define>\n  <define name=\"PDFSameStyleUnicodeCharacters\">\n    <element name=\"pdfSameStyleUnicodeCharacters\">\n      <optional>\n        <ref name=\"PDFStyle\"/>\n      </optional>\n      <attribute name=\"unicode\">\n        <data type=\"string\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"debug_info\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFFormula\">\n    <element name=\"pdfFormula\">\n      <ref name=\"Box\"/>\n      <oneOrMore>\n        <ref name=\"PDFCharacter\"/>\n      </oneOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFCurve\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFForm\"/>\n      </zeroOrMore>\n      <attribute name=\"x_offset\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y_offset\">\n        <data type=\"float\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"x_advance\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"lineId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"is_corner_mark\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFFigure\">\n    <element name=\"pdfFigure\">\n      <ref name=\"Box\"/>\n    </element>\n  </define>\n  <define name=\"PDFRectangle\">\n    <element name=\"pdfRectangle\">\n      <ref name=\"Box\"/>\n      <ref name=\"GraphicState\"/>\n      <optional>\n        <attribute name=\"debug_info\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"fill_background\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"xobjId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"lineWidth\">\n          <data type=\"float\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"renderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFCurve\">\n    <element name=\"pdfCurve\">\n      <ref name=\"Box\"/>\n      <ref name=\"GraphicState\"/>\n      <zeroOrMore>\n        <ref name=\"PDFPath\"/>\n      </zeroOrMore>\n      <zeroOrMore>\n        <ref name=\"PDFOriginalPath\"/>\n      </zeroOrMore>\n      <optional>\n        <attribute name=\"debug_info\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"fill_background\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"stroke_path\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"evenodd\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"xobjId\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"renderOrder\">\n          <data type=\"int\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"ctm\">\n          <list>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n          </list>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"relocation_transform\">\n          <list>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n          </list>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFOriginalPath\">\n    <element name=\"pdfOriginalPath\">\n      <ref name=\"PDFPath\"/>\n    </element>\n  </define>\n  <define name=\"PDFPath\">\n    <element name=\"pdfPath\">\n      <attribute name=\"x\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"y\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"op\">\n        <data type=\"string\"/>\n      </attribute>\n      <optional>\n        <attribute name=\"has_xy\">\n          <data type=\"boolean\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFForm\">\n    <element name=\"pdfForm\">\n      <attribute name=\"xobjId\">\n        <data type=\"int\"/>\n      </attribute>\n      <ref name=\"Box\"/>\n      <ref name=\"GraphicState\"/>\n      <ref name=\"PDFMatrix\"/>\n      <ref name=\"PDFAffineTransform\"/>\n      <optional>\n        <attribute name=\"ctm\">\n          <list>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n          </list>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"relocation_transform\">\n          <list>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n            <data type=\"float\"/>\n          </list>\n        </attribute>\n      </optional>\n      <attribute name=\"renderOrder\">\n        <data type=\"int\"/>\n      </attribute>\n      <attribute name=\"formType\">\n        <data type=\"string\"/>\n      </attribute>\n      <ref name=\"PDFFormSubtype\"/>\n    </element>\n  </define>\n  <define name=\"PDFFormSubtype\">\n    <element name=\"pdfFormSubtype\">\n      <choice>\n        <ref name=\"PDFInlineForm\"/>\n        <ref name=\"PDFXobjForm\"/>\n      </choice>\n    </element>\n  </define>\n  <define name=\"PDFInlineForm\">\n    <element name=\"pdfInlineForm\">\n      <optional>\n        <attribute name=\"formData\">\n          <data type=\"string\"/>\n        </attribute>\n      </optional>\n      <optional>\n        <attribute name=\"imageParameters\">\n          <data type=\"string\"/>\n        </attribute>\n      </optional>\n    </element>\n  </define>\n  <define name=\"PDFXobjForm\">\n    <element name=\"pdfXobjForm\">\n      <attribute name=\"xrefId\">\n        <ref name=\"PDFXrefId\"/>\n      </attribute>\n      <attribute name=\"doArgs\">\n        <data type=\"string\"/>\n      </attribute>\n    </element>\n  </define>\n  <define name=\"PDFMatrix\">\n    <element name=\"pdfMatrix\">\n      <attribute name=\"a\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"b\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"c\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"d\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"e\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"f\">\n        <data type=\"float\"/>\n      </attribute>\n    </element>\n  </define>\n  <!-- Decomposed transform parameters for a CTM -->\n  <define name=\"PDFAffineTransform\">\n    <element name=\"pdfAffineTransform\">\n      <attribute name=\"translation_x\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"translation_y\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"rotation\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"scale_x\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"scale_y\">\n        <data type=\"float\"/>\n      </attribute>\n      <attribute name=\"shear\">\n        <data type=\"float\"/>\n      </attribute>\n    </element>\n  </define>\n</grammar>\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/il_version_1.xsd",
    "content": "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<xs:schema xmlns:xs=\"http://www.w3.org/2001/XMLSchema\" elementFormDefault=\"qualified\">\n  <xs:element name=\"document\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element maxOccurs=\"unbounded\" ref=\"page\"/>\n      </xs:sequence>\n      <xs:attribute name=\"totalPages\" use=\"required\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"page\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"mediabox\"/>\n        <xs:element ref=\"cropbox\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfXobject\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pageLayout\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfRectangle\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfFont\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfParagraph\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfFigure\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfCharacter\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfCurve\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfForm\"/>\n        <xs:element ref=\"baseOperations\"/>\n      </xs:sequence>\n      <xs:attribute name=\"pageNumber\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"Unit\" use=\"required\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"mediabox\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"cropbox\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"baseOperations\" type=\"xs:string\"/>\n  <xs:element name=\"box\">\n    <xs:complexType>\n      <xs:attribute name=\"x\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"x2\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y2\" use=\"required\" type=\"xs:float\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:simpleType name=\"PDFXrefId\">\n    <xs:restriction base=\"xs:int\"/>\n  </xs:simpleType>\n  <xs:element name=\"pdfFont\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfFontCharBoundingBox\"/>\n      </xs:sequence>\n      <xs:attribute name=\"name\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"fontId\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"xrefId\" use=\"required\" type=\"PDFXrefId\"/>\n      <xs:attribute name=\"encodingLength\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"bold\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"italic\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"monospace\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"serif\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"ascent\" type=\"xs:float\"/>\n      <xs:attribute name=\"descent\" type=\"xs:float\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfFontCharBoundingBox\">\n    <xs:complexType>\n      <xs:attribute name=\"x\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"x2\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y2\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"char_id\" use=\"required\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfXobject\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfFont\"/>\n        <xs:element ref=\"baseOperations\"/>\n      </xs:sequence>\n      <xs:attribute name=\"xobjId\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"xrefId\" use=\"required\" type=\"PDFXrefId\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfCharacter\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"pdfStyle\"/>\n        <xs:element ref=\"box\"/>\n        <xs:element minOccurs=\"0\" ref=\"visual_bbox\"/>\n      </xs:sequence>\n      <xs:attribute name=\"vertical\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"scale\" type=\"xs:float\"/>\n      <xs:attribute name=\"pdfCharacterId\" type=\"xs:int\"/>\n      <xs:attribute name=\"char_unicode\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"advance\" type=\"xs:float\"/>\n      <xs:attribute name=\"xobjId\" type=\"xs:int\"/>\n      <xs:attribute name=\"debug_info\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"formula_layout_id\" type=\"xs:int\"/>\n      <xs:attribute name=\"renderOrder\" type=\"xs:int\"/>\n      <xs:attribute name=\"subRenderOrder\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"visual_bbox\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pageLayout\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n      </xs:sequence>\n      <xs:attribute name=\"id\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"conf\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"class_name\" use=\"required\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"graphicState\">\n    <xs:complexType>\n      <xs:attribute name=\"passthroughPerCharInstruction\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfStyle\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"graphicState\"/>\n      </xs:sequence>\n      <xs:attribute name=\"font_id\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"font_size\" use=\"required\" type=\"xs:float\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfParagraph\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element ref=\"pdfStyle\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfParagraphComposition\"/>\n      </xs:sequence>\n      <xs:attribute name=\"xobjId\" type=\"xs:int\"/>\n      <xs:attribute name=\"unicode\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"scale\" type=\"xs:float\"/>\n      <xs:attribute name=\"optimal_scale\" type=\"xs:float\"/>\n      <xs:attribute name=\"vertical\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"FirstLineIndent\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"debug_id\" type=\"xs:string\"/>\n      <xs:attribute name=\"layout_label\" type=\"xs:string\"/>\n      <xs:attribute name=\"layout_id\" type=\"xs:int\"/>\n      <xs:attribute name=\"renderOrder\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfParagraphComposition\">\n    <xs:complexType>\n      <xs:choice>\n        <xs:element ref=\"pdfLine\"/>\n        <xs:element ref=\"pdfFormula\"/>\n        <xs:element ref=\"pdfSameStyleCharacters\"/>\n        <xs:element ref=\"pdfCharacter\"/>\n        <xs:element ref=\"pdfSameStyleUnicodeCharacters\"/>\n      </xs:choice>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfLine\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element maxOccurs=\"unbounded\" ref=\"pdfCharacter\"/>\n      </xs:sequence>\n      <xs:attribute name=\"renderOrder\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfSameStyleCharacters\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element ref=\"pdfStyle\"/>\n        <xs:element maxOccurs=\"unbounded\" ref=\"pdfCharacter\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfSameStyleUnicodeCharacters\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element minOccurs=\"0\" ref=\"pdfStyle\"/>\n      </xs:sequence>\n      <xs:attribute name=\"unicode\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"debug_info\" type=\"xs:boolean\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfFormula\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element maxOccurs=\"unbounded\" ref=\"pdfCharacter\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfCurve\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfForm\"/>\n      </xs:sequence>\n      <xs:attribute name=\"x_offset\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y_offset\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"x_advance\" type=\"xs:float\"/>\n      <xs:attribute name=\"lineId\" type=\"xs:int\"/>\n      <xs:attribute name=\"is_corner_mark\" type=\"xs:boolean\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfFigure\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfRectangle\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element ref=\"graphicState\"/>\n      </xs:sequence>\n      <xs:attribute name=\"debug_info\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"fill_background\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"xobjId\" type=\"xs:int\"/>\n      <xs:attribute name=\"lineWidth\" type=\"xs:float\"/>\n      <xs:attribute name=\"renderOrder\" type=\"xs:int\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfCurve\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element ref=\"graphicState\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfPath\"/>\n        <xs:element minOccurs=\"0\" maxOccurs=\"unbounded\" ref=\"pdfOriginalPath\"/>\n      </xs:sequence>\n      <xs:attribute name=\"debug_info\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"fill_background\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"stroke_path\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"evenodd\" type=\"xs:boolean\"/>\n      <xs:attribute name=\"xobjId\" type=\"xs:int\"/>\n      <xs:attribute name=\"renderOrder\" type=\"xs:int\"/>\n      <xs:attribute name=\"ctm\">\n        <xs:simpleType>\n          <xs:restriction>\n            <xs:simpleType>\n              <xs:list>\n                <xs:simpleType>\n                  <xs:union memberTypes=\"xs:float xs:float xs:float xs:float xs:float xs:float\"/>\n                </xs:simpleType>\n              </xs:list>\n            </xs:simpleType>\n            <xs:length value=\"6\"/>\n          </xs:restriction>\n        </xs:simpleType>\n      </xs:attribute>\n      <xs:attribute name=\"relocation_transform\">\n        <xs:simpleType>\n          <xs:restriction>\n            <xs:simpleType>\n              <xs:list>\n                <xs:simpleType>\n                  <xs:union memberTypes=\"xs:float xs:float xs:float xs:float xs:float xs:float\"/>\n                </xs:simpleType>\n              </xs:list>\n            </xs:simpleType>\n            <xs:length value=\"6\"/>\n          </xs:restriction>\n        </xs:simpleType>\n      </xs:attribute>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfOriginalPath\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"pdfPath\"/>\n      </xs:sequence>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfPath\">\n    <xs:complexType>\n      <xs:attribute name=\"x\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"y\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"op\" use=\"required\" type=\"xs:string\"/>\n      <xs:attribute name=\"has_xy\" type=\"xs:boolean\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfForm\">\n    <xs:complexType>\n      <xs:sequence>\n        <xs:element ref=\"box\"/>\n        <xs:element ref=\"graphicState\"/>\n        <xs:element ref=\"pdfMatrix\"/>\n        <xs:element ref=\"pdfAffineTransform\"/>\n        <xs:element ref=\"pdfFormSubtype\"/>\n      </xs:sequence>\n      <xs:attribute name=\"xobjId\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"ctm\">\n        <xs:simpleType>\n          <xs:restriction>\n            <xs:simpleType>\n              <xs:list>\n                <xs:simpleType>\n                  <xs:union memberTypes=\"xs:float xs:float xs:float xs:float xs:float xs:float\"/>\n                </xs:simpleType>\n              </xs:list>\n            </xs:simpleType>\n            <xs:length value=\"6\"/>\n          </xs:restriction>\n        </xs:simpleType>\n      </xs:attribute>\n      <xs:attribute name=\"relocation_transform\">\n        <xs:simpleType>\n          <xs:restriction>\n            <xs:simpleType>\n              <xs:list>\n                <xs:simpleType>\n                  <xs:union memberTypes=\"xs:float xs:float xs:float xs:float xs:float xs:float\"/>\n                </xs:simpleType>\n              </xs:list>\n            </xs:simpleType>\n            <xs:length value=\"6\"/>\n          </xs:restriction>\n        </xs:simpleType>\n      </xs:attribute>\n      <xs:attribute name=\"renderOrder\" use=\"required\" type=\"xs:int\"/>\n      <xs:attribute name=\"formType\" use=\"required\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfFormSubtype\">\n    <xs:complexType>\n      <xs:choice>\n        <xs:element ref=\"pdfInlineForm\"/>\n        <xs:element ref=\"pdfXobjForm\"/>\n      </xs:choice>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfInlineForm\">\n    <xs:complexType>\n      <xs:attribute name=\"formData\" type=\"xs:string\"/>\n      <xs:attribute name=\"imageParameters\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfXobjForm\">\n    <xs:complexType>\n      <xs:attribute name=\"xrefId\" use=\"required\" type=\"PDFXrefId\"/>\n      <xs:attribute name=\"doArgs\" use=\"required\" type=\"xs:string\"/>\n    </xs:complexType>\n  </xs:element>\n  <xs:element name=\"pdfMatrix\">\n    <xs:complexType>\n      <xs:attribute name=\"a\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"b\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"c\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"d\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"e\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"f\" use=\"required\" type=\"xs:float\"/>\n    </xs:complexType>\n  </xs:element>\n  <!-- Decomposed transform parameters for a CTM -->\n  <xs:element name=\"pdfAffineTransform\">\n    <xs:complexType>\n      <xs:attribute name=\"translation_x\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"translation_y\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"rotation\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"scale_x\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"scale_y\" use=\"required\" type=\"xs:float\"/>\n      <xs:attribute name=\"shear\" use=\"required\" type=\"xs:float\"/>\n    </xs:complexType>\n  </xs:element>\n</xs:schema>\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/add_debug_information.py",
    "content": "import logging\n\nimport babeldoc.format.pdf.document_il.il_version_1 as il_version_1\nfrom babeldoc.format.pdf.document_il import GraphicState\nfrom babeldoc.format.pdf.document_il.utils.style_helper import BLUE\nfrom babeldoc.format.pdf.document_il.utils.style_helper import ORANGE\nfrom babeldoc.format.pdf.document_il.utils.style_helper import PINK\nfrom babeldoc.format.pdf.document_il.utils.style_helper import TEAL\nfrom babeldoc.format.pdf.document_il.utils.style_helper import YELLOW\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass AddDebugInformation:\n    stage_name = \"Add Debug Information\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        self.model = translation_config.doc_layout_model\n\n    def process(self, docs: il_version_1.Document):\n        if not self.translation_config.debug:\n            return\n\n        for page in docs.page:\n            self.process_page(page)\n\n    def _create_rectangle(\n        self,\n        box: il_version_1.Box,\n        color: GraphicState,\n        line_width: float | None = None,\n    ):\n        rect = il_version_1.PdfRectangle(\n            box=box,\n            graphic_state=color,\n            debug_info=True,\n            line_width=line_width,\n        )\n        return rect\n\n    def _create_text(\n        self,\n        text: str,\n        color: GraphicState,\n        box: il_version_1.Box,\n        font_size: float = 4,\n    ):\n        style = il_version_1.PdfStyle(\n            font_id=\"base\",\n            font_size=font_size,\n            graphic_state=color,\n        )\n        return il_version_1.PdfParagraph(\n            first_line_indent=False,\n            box=il_version_1.Box(\n                x=box.x,\n                y=box.y2,\n                x2=box.x2,\n                y2=box.y2 + 5,\n            ),\n            vertical=False,\n            pdf_style=style,\n            unicode=text,\n            pdf_paragraph_composition=[\n                il_version_1.PdfParagraphComposition(\n                    pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                        unicode=text,\n                        pdf_style=style,\n                        debug_info=True,\n                    ),\n                ),\n            ],\n            xobj_id=-1,\n        )\n\n    def process_page(self, page: il_version_1.Page):\n        # Add page number text at top-left corner\n        page_width = page.cropbox.box.x2 - page.cropbox.box.x\n        page_height = page.cropbox.box.y2 - page.cropbox.box.y\n        page_number_text = f\"pagenumber: {page.page_number + 1}\"\n        page_number_box = il_version_1.Box(\n            x=page.cropbox.box.x + page_width * 0.02,\n            y=page.cropbox.box.y,\n            x2=page.cropbox.box.x2,\n            y2=page.cropbox.box.y2 - page_height * 0.02,\n        )\n        page_number_paragraph = self._create_text(\n            page_number_text,\n            BLUE,\n            page_number_box,\n        )\n        page.pdf_paragraph.append(page_number_paragraph)\n\n        new_paragraphs = []\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n            if any(\n                x.pdf_same_style_unicode_characters.debug_info\n                for x in paragraph.pdf_paragraph_composition\n                if x.pdf_same_style_unicode_characters\n            ):\n                continue\n            # Create a rectangle box\n            rect = self._create_rectangle(paragraph.box, BLUE)\n\n            page.pdf_rectangle.append(rect)\n\n            # Create text label at top-left corner\n            # Note: PDF coordinates are from bottom-left,\n            # so we use y2 for top position\n\n            debug_text = \"paragraph\"\n            if hasattr(paragraph, \"debug_id\") and paragraph.debug_id:\n                debug_text = (\n                    f\"paragraph[{paragraph.debug_id}]-[{paragraph.layout_label}]\"\n                )\n            new_paragraphs.append(self._create_text(debug_text, BLUE, paragraph.box))\n\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_formula:\n                    new_paragraphs.append(\n                        self._create_text(\n                            \"formula\",\n                            ORANGE,\n                            composition.pdf_formula.box,\n                        ),\n                    )\n                    page.pdf_rectangle.append(\n                        self._create_rectangle(\n                            composition.pdf_formula.box,\n                            ORANGE,\n                        ),\n                    )\n                    for char in composition.pdf_formula.pdf_character:\n                        page.pdf_rectangle.append(\n                            self._create_rectangle(\n                                char.visual_bbox.box, TEAL, line_width=0.2\n                            ),\n                        )\n                        # page.pdf_rectangle.append(\n                        #     self._create_rectangle(char.box, CYAN, line_width=0.2),\n                        # )\n\n            for xobj in page.pdf_xobject:\n                # new_paragraphs.append(\n                #     self._create_text(\n                #         \"xobj\",\n                #         YELLOW,\n                #         xobj.box,\n                #     ),\n                # )\n                page.pdf_rectangle.append(\n                    self._create_rectangle(\n                        xobj.box,\n                        YELLOW,\n                    ),\n                )\n\n            for form in page.pdf_form:\n                debug_text = \"Form\"\n                if form.pdf_form_subtype.pdf_xobj_form:\n                    debug_text += f\"[{form.pdf_form_subtype.pdf_xobj_form.do_args}]\"\n                elif form.pdf_form_subtype.pdf_inline_form:\n                    debug_text += \"[inline]\"\n\n                new_paragraphs.append(\n                    self._create_text(debug_text, PINK, form.box, font_size=0.4),\n                )\n                page.pdf_rectangle.append(\n                    self._create_rectangle(\n                        form.box,\n                        PINK,\n                    ),\n                )\n\n        page.pdf_paragraph.extend(new_paragraphs)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/automatic_term_extractor.py",
    "content": "from __future__ import annotations\n\nimport json\nimport logging\nfrom pathlib import Path\nfrom typing import TYPE_CHECKING\n\nimport tiktoken\nfrom tqdm import tqdm\n\nfrom babeldoc.format.pdf.document_il import (\n    Document as ILDocument,  # Renamed to avoid conflict\n)\nfrom babeldoc.format.pdf.document_il import PdfParagraph  # Renamed to avoid conflict\nfrom babeldoc.format.pdf.document_il.midend.il_translator import Page\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import is_cid_paragraph\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_placeholder_only_paragraph,\n)\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_pure_numeric_paragraph,\n)\nfrom babeldoc.utils.priority_thread_pool_executor import PriorityThreadPoolExecutor\n\nif TYPE_CHECKING:\n    from babeldoc.format.pdf.translation_config import TranslationConfig\n    from babeldoc.translator.translator import BaseTranslator\n\nlogger = logging.getLogger(__name__)\n\nLLM_PROMPT_TEMPLATE: str = \"\"\"\nYou are an expert multilingual terminologist. Extract key terms from the text and translate them into {target_language}.\n\n### Extraction Rules\n1. Include only: named entities (people, orgs, locations, theorem/algorithm names, dates) and domain-specific nouns/noun phrases essential to meaning.\n2. No full sentences. Ignore function words.\n3. Use minimal noun phrases (≤5 words unless a named entity). No generic academic nouns (e.g., model, case, property) unless part of a standard term.\n4. No mathematical items: variables (X1, a, ε), symbols (=, +, →, ⊥⊥, ∈), subscripts/superscripts, formula fragments, mappings (T: H1→H2), etc. Keep only natural-language concepts.\n5. Extract each term once. Keep order of first appearance.\n\n### Translation Rules\n1. Translate each term into {target_language}.\n2. If in the reference glossary, use its translation exactly.\n3. Keep proper names in original language unless a well-known translation exists.\n4. Ensure consistent translations.\n\n{reference_glossary_section}\n\n### Output Format\n- Return ONLY a valid JSON array.\n- Each element: {{\"src\": \"...\", \"tgt\": \"...\"}}.\n- No comments, no backticks, no extra text.\n- If no terms: [].\n\n### Example\nFor terms “LLM”, “GPT”:\n{example_output}\n\nInput Text:\n```\n{text_to_process}\n```\n\nReturn JSON ONLY. NO OTHER TEXT.\nResult:\n\"\"\"\n\n\nclass BatchParagraph:\n    def __init__(\n        self,\n        paragraphs: list[PdfParagraph],\n        page_tracker: PageTermExtractTracker,\n    ):\n        self.paragraphs = paragraphs\n        self.tracker = page_tracker.new_paragraph()\n\n\nclass DocumentTermExtractTracker:\n    def __init__(self):\n        self.page = []\n\n    def new_page(self):\n        page = PageTermExtractTracker()\n        self.page.append(page)\n        return page\n\n    def to_json(self):\n        pages = []\n        for page in self.page:\n            paragraphs = []\n            for para in page.paragraph:\n                o_str = getattr(para, \"output\", None)\n                i_str = getattr(para, \"input\", None)\n                pdf_unicodes = getattr(para, \"pdf_unicodes\", None)\n                if not pdf_unicodes:\n                    continue\n                paragraphs.append(\n                    {\n                        \"pdf_unicodes\": pdf_unicodes,\n                        \"output\": o_str,\n                        \"input\": i_str,\n                    },\n                )\n            pages.append({\"paragraph\": paragraphs})\n        return json.dumps({\"page\": pages}, ensure_ascii=False, indent=2)\n\n\nclass PageTermExtractTracker:\n    def __init__(self):\n        self.paragraph = []\n\n    def new_paragraph(self):\n        paragraph = ParagraphTermExtractTracker()\n        self.paragraph.append(paragraph)\n        return paragraph\n\n\nclass ParagraphTermExtractTracker:\n    def __init__(self):\n        self.pdf_unicodes = []\n\n    def append_paragraph_unicode(self, unicode: str):\n        self.pdf_unicodes.append(unicode)\n\n    def set_output(self, output: str):\n        self.output = output\n\n    def set_input(self, _input: str):\n        self.input = _input\n\n\nclass AutomaticTermExtractor:\n    stage_name = \"Automatic Term Extraction\"\n\n    def __init__(\n        self,\n        translate_engine: BaseTranslator,\n        translation_config: TranslationConfig,\n    ):\n        self.translate_engine = translate_engine\n        self.translation_config = translation_config\n        self.shared_context = translation_config.shared_context_cross_split_part\n        self.tokenizer = tiktoken.encoding_for_model(\"gpt-4o\")\n\n        # Check if the translate_engine has llm_translate capability\n        if not hasattr(self.translate_engine, \"llm_translate\") or not callable(\n            self.translate_engine.llm_translate\n        ):\n            raise ValueError(\n                \"The provided translate_engine does not support LLM-based translation, which is required for AutomaticTermExtractor.\"\n            )\n\n    def calc_token_count(self, text: str) -> int:\n        try:\n            return len(self.tokenizer.encode(text, disallowed_special=()))\n        except Exception:\n            return 0\n\n    def _snapshot_token_usage(self) -> tuple[int, int, int, int]:\n        if not self.translate_engine:\n            return 0, 0, 0, 0\n        token_counter = getattr(self.translate_engine, \"token_count\", None)\n        prompt_counter = getattr(self.translate_engine, \"prompt_token_count\", None)\n        completion_counter = getattr(\n            self.translate_engine, \"completion_token_count\", None\n        )\n        cache_hit_prompt_counter = getattr(\n            self.translate_engine, \"cache_hit_prompt_token_count\", None\n        )\n        total_tokens = token_counter.value if token_counter else 0\n        prompt_tokens = prompt_counter.value if prompt_counter else 0\n        completion_tokens = completion_counter.value if completion_counter else 0\n        cache_hit_prompt_tokens = (\n            cache_hit_prompt_counter.value if cache_hit_prompt_counter else 0\n        )\n        return total_tokens, prompt_tokens, completion_tokens, cache_hit_prompt_tokens\n\n    def _clean_json_output(self, llm_output: str) -> str:\n        llm_output = llm_output.strip()\n        if llm_output.startswith(\"<json>\"):\n            llm_output = llm_output[6:]\n        if llm_output.endswith(\"</json>\"):\n            llm_output = llm_output[:-7]\n        if llm_output.startswith(\"```json\"):\n            llm_output = llm_output[7:]\n        if llm_output.startswith(\"```\"):\n            llm_output = llm_output[3:]\n        if llm_output.endswith(\"```\"):\n            llm_output = llm_output[:-3]\n        return llm_output.strip()\n\n    def _process_llm_response(self, llm_response_text: str, request_id: str):\n        try:\n            cleaned_response_text = self._clean_json_output(llm_response_text)\n            extracted_data = json.loads(cleaned_response_text)\n\n            if not isinstance(extracted_data, list):\n                logger.warning(\n                    f\"Request ID {request_id}: LLM response was not a JSON list, but type: {type(extracted_data)}. Content: {cleaned_response_text[:200]}\"\n                )\n                return\n\n            for item in extracted_data:\n                if isinstance(item, dict) and \"src\" in item and \"tgt\" in item:\n                    src_term = str(item[\"src\"]).strip()\n                    tgt_term = str(item[\"tgt\"]).strip()\n                    if (\n                        src_term and tgt_term and len(src_term) < 100\n                    ):  # Basic validation\n                        self.shared_context.add_raw_extracted_term_pair(\n                            src_term, tgt_term\n                        )\n                else:\n                    logger.warning(\n                        f\"Request ID {request_id}: Skipping malformed item in LLM JSON response: {item}\"\n                    )\n\n        except json.JSONDecodeError as e:\n            logger.error(\n                f\"Request ID {request_id}: JSON Parsing Error: {e}. Problematic LLM Response after cleaning (start): {cleaned_response_text[:200]}...\"\n            )\n        except Exception as e:\n            logger.error(f\"Request ID {request_id}: Error processing LLM response: {e}\")\n\n    def process_page(\n        self,\n        page: Page,\n        executor: PriorityThreadPoolExecutor,\n        pbar: tqdm | None = None,\n        tracker: PageTermExtractTracker = None,\n    ):\n        self.translation_config.raise_if_cancelled()\n        paragraphs = []\n        total_token_count = 0\n        for paragraph in page.pdf_paragraph:\n            if paragraph.debug_id is None or paragraph.unicode is None:\n                pbar.advance(1)\n                continue\n            if is_cid_paragraph(paragraph):\n                pbar.advance(1)\n                continue\n            if is_pure_numeric_paragraph(paragraph):\n                pbar.advance(1)\n                continue\n            if is_placeholder_only_paragraph(paragraph):\n                pbar.advance(1)\n                continue\n            # if len(paragraph.unicode) < self.translation_config.min_text_length:\n            #     pbar.advance(1)\n            #     continue\n            total_token_count += self.calc_token_count(paragraph.unicode)\n            paragraphs.append(paragraph)\n            if total_token_count > 600 or len(paragraphs) > 12:\n                executor.submit(\n                    self.extract_terms_from_paragraphs,\n                    BatchParagraph(paragraphs, tracker),\n                    pbar,\n                    total_token_count,\n                    priority=1048576 - total_token_count,\n                )\n                paragraphs = []\n                total_token_count = 0\n\n        if paragraphs:\n            executor.submit(\n                self.extract_terms_from_paragraphs,\n                BatchParagraph(paragraphs, tracker),\n                pbar,\n                total_token_count,\n                priority=1048576 - total_token_count,\n            )\n\n    def extract_terms_from_paragraphs(\n        self,\n        paragraphs: BatchParagraph,\n        pbar: tqdm | None = None,\n        paragraph_token_count: int = 0,\n    ):\n        self.translation_config.raise_if_cancelled()\n        try:\n            inputs = [p.unicode for p in paragraphs.paragraphs if p.unicode]\n            tracker = paragraphs.tracker\n            for u in inputs:\n                tracker.append_paragraph_unicode(u)\n            if not inputs:\n                return\n\n            # Build reference glossary section\n            reference_glossary_section = \"\"\n            user_glossaries = self.shared_context.user_glossaries\n            if user_glossaries:\n                text_for_glossary = \"\\n\\n\".join(inputs)\n\n                # Group entries by glossary name\n                glossary_entries = {}\n                for glossary in user_glossaries:\n                    active_entries = glossary.get_active_entries_for_text(\n                        text_for_glossary\n                    )\n                    if active_entries:\n                        glossary_entries[glossary.name] = active_entries\n\n                if glossary_entries:\n                    reference_glossary_section = (\n                        \"Reference Glossaries (for consistency and quality):\\n\"\n                    )\n\n                    # Add entries grouped by glossary name\n                    for glossary_name, entries in glossary_entries.items():\n                        reference_glossary_section += f\"\\n{glossary_name}:\\n\"\n                        for src, tgt in sorted(set(entries)):\n                            reference_glossary_section += f\"- {src} → {tgt}\\n\"\n\n                    reference_glossary_section += \"\\nPlease consider these existing translations for consistency when extracting new terms. IMPORTANT: You should also extract terms that appear in the reference glossaries above if they are found in the input text - don't skip them just because they already exist in the reference.\"\n\n            prompt = LLM_PROMPT_TEMPLATE.format(\n                target_language=self.translation_config.lang_out,\n                text_to_process=\"\\n\\n\".join(inputs),\n                reference_glossary_section=reference_glossary_section,\n                example_output=\"\"\"[\n  {\"src\": \"LLM\", \"tgt\": \"大语言模型\"},\n  {\"src\": \"GPT\", \"tgt\": \"GPT\"}\n]\"\"\",\n            )\n            tracker.set_input(prompt)\n            output = self.translate_engine.llm_translate(\n                prompt,\n                rate_limit_params={\n                    \"paragraph_token_count\": paragraph_token_count,\n                    \"request_json_mode\": True,\n                },\n            )\n            tracker.set_output(output)\n            cleaned_output = self._clean_json_output(output)\n            response = json.loads(cleaned_output)\n            if not isinstance(response, list):\n                response = [response]  # Ensure we have a list\n\n            for term in response:\n                if isinstance(term, dict) and \"src\" in term and \"tgt\" in term:\n                    src_term = str(term[\"src\"]).strip()\n                    tgt_term = str(term[\"tgt\"]).strip()\n                    if src_term == tgt_term and len(src_term) < 3:\n                        continue\n                    if src_term and tgt_term and len(src_term) < 100:\n                        self.shared_context.add_raw_extracted_term_pair(\n                            src_term, tgt_term\n                        )\n\n        except Exception as e:\n            logger.warning(f\"Error during automatic terms extract: {e}\")\n            return\n        finally:\n            pbar.advance(len(paragraphs.paragraphs))\n\n    def procress(self, doc_il: ILDocument):\n        logger.info(f\"{self.stage_name}: Starting term extraction for document.\")\n        start_total, start_prompt, start_completion, start_cache_hit_prompt = (\n            self._snapshot_token_usage()\n        )\n        tracker = DocumentTermExtractTracker()\n        total = sum(len(page.pdf_paragraph) for page in doc_il.page)\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        ) as pbar:\n            max_workers = self.translation_config.term_pool_max_workers\n            logger.info(\n                f\"Using {max_workers} worker threads for automatic term extraction.\"\n            )\n            with PriorityThreadPoolExecutor(\n                max_workers=max_workers,\n            ) as executor:\n                for page in doc_il.page:\n                    self.process_page(page, executor, pbar, tracker.new_page())\n\n        self.shared_context.finalize_auto_extracted_glossary()\n        end_total, end_prompt, end_completion, end_cache_hit_prompt = (\n            self._snapshot_token_usage()\n        )\n        self.translation_config.record_term_extraction_usage(\n            end_total - start_total,\n            end_prompt - start_prompt,\n            end_completion - start_completion,\n            end_cache_hit_prompt - start_cache_hit_prompt,\n        )\n\n        if (\n            self.translation_config.debug\n            or self.translation_config.working_dir is not None\n        ):\n            path = self.translation_config.get_working_file_path(\n                \"term_extractor_tracking.json\"\n            )\n            logger.debug(f\"save translate tracking to {path}\")\n            with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n                f.write(tracker.to_json())\n\n            path = self.translation_config.get_working_file_path(\n                \"term_extractor_freq.json\"\n            )\n            logger.debug(f\"save term frequency to {path}\")\n            with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n                json.dump(\n                    self.shared_context.raw_extracted_terms,\n                    f,\n                    ensure_ascii=False,\n                    indent=2,\n                )\n\n            path = self.translation_config.get_working_file_path(\n                \"auto_extractor_glossary.csv\"\n            )\n            logger.debug(f\"save auto extracted glossary to {path}\")\n            with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n                auto_extracted_glossary = self.shared_context.auto_extracted_glossary\n                if auto_extracted_glossary:\n                    f.write(auto_extracted_glossary.to_csv())\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/detect_scanned_file.py",
    "content": "import logging\n\nimport cv2\nimport numpy as np\nimport pymupdf\nimport regex\nfrom skimage.metrics import structural_similarity\n\nfrom babeldoc.babeldoc_exception.BabelDOCException import ScannedPDFError\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import PDFCreater\nfrom babeldoc.format.pdf.document_il.utils.style_helper import BLACK\nfrom babeldoc.format.pdf.document_il.utils.style_helper import GREEN\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass DetectScannedFile:\n    stage_name = \"DetectScannedFile\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n\n    def _save_debug_box_to_page(self, page: il_version_1.Page, similarity: float):\n        \"\"\"Save debug boxes and text labels to the PDF page.\"\"\"\n        if not self.translation_config.debug:\n            return\n\n        color = GREEN\n\n        # Create text label at top-left corner\n        # Note: PDF coordinates are from bottom-left,\n        # so we use y2 for top position\n        style = il_version_1.PdfStyle(\n            font_id=\"base\",\n            font_size=4,\n            graphic_state=color,\n        )\n        page_width = page.cropbox.box.x2 - page.cropbox.box.x\n        page_height = page.cropbox.box.y2 - page.cropbox.box.y\n        unicode = f\"scanned score: {similarity * 100:.2f} %\"\n        page.pdf_paragraph.append(\n            il_version_1.PdfParagraph(\n                first_line_indent=False,\n                box=il_version_1.Box(\n                    x=page.cropbox.box.x + page_width * 0.03,\n                    y=page.cropbox.box.y,\n                    x2=page.cropbox.box.x2,\n                    y2=page.cropbox.box.y2 - page_height * 0.03,\n                ),\n                vertical=False,\n                pdf_style=style,\n                unicode=unicode,\n                pdf_paragraph_composition=[\n                    il_version_1.PdfParagraphComposition(\n                        pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                            unicode=unicode,\n                            pdf_style=style,\n                            debug_info=True,\n                        ),\n                    ),\n                ],\n                xobj_id=-1,\n            ),\n        )\n\n    def fast_check(self, doc: pymupdf.Document) -> bool:\n        if doc:\n            hit_list = [0] * len(doc)\n            for page in doc:\n                contents_list = page.get_contents()\n                for index in contents_list:\n                    contents = doc.xref_stream(index)\n                    if regex.search(\n                        rb\"(/Artifact|/P)(\\s*\\<\\<\\s*/MCID\\s+|\\s+BDC)\", contents\n                    ):\n                        hit_list[page.number] += 1\n                    if regex.search(rb\"\\s3\\s+Tr\\s\", contents):\n                        hit_list[page.number] += 1\n            return bool(sum(hit_list) > len(doc) * 0.8)\n        return False\n\n    def process(\n        self, docs: il_version_1.Document, original_pdf_path, mediabox_data: dict\n    ):\n        \"\"\"Generate layouts for all pages that need to be translated.\"\"\"\n        # Get pages that need to be translated\n\n        pdf_creater = PDFCreater(\n            original_pdf_path, docs, self.translation_config, mediabox_data\n        )\n\n        pages_to_translate = [\n            page\n            for page in docs.page\n            if self.translation_config.should_translate_page(page.page_number + 1)\n        ]\n        if not pages_to_translate:\n            return\n        mupdf = pymupdf.open(self.translation_config.get_working_file_path(\"input.pdf\"))\n        total = len(pages_to_translate)\n        threshold = 0.8 * total\n        threshold = max(threshold, 1)\n        scanned = 0\n        non_scanned = 0\n        non_scanned_threshold = total - threshold\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        ) as progress:\n            for page in pages_to_translate:\n                if scanned < threshold and non_scanned < non_scanned_threshold:\n                    # Only continue detection if both counts are below thresholds\n                    is_scanned = self.detect_page_is_scanned(page, mupdf, pdf_creater)\n                    if is_scanned:\n                        scanned += 1\n                    else:\n                        non_scanned += 1\n                else:\n                    # We have enough information to determine document type\n                    non_scanned += 1\n                progress.advance(1)\n\n        if scanned >= threshold:\n            if self.translation_config.auto_enable_ocr_workaround:\n                logger.warning(\n                    f\"Detected {scanned} scanned pages, which is more than 80% of the total pages. \"\n                    \"Turning on OCR workaround.\",\n                )\n                self.translation_config.shared_context_cross_split_part.auto_enabled_ocr_workaround = True\n                self.translation_config.ocr_workaround = True\n                self.translation_config.skip_scanned_detection = True\n                self.translation_config.disable_rich_text_translate = True\n                self.clean_render_order_for_chars(docs)\n                self.translation_config.remove_non_formula_lines = False\n            else:\n                logger.warning(\n                    f\"Detected {scanned} scanned pages, which is more than 80% of the total pages. \"\n                    \"Please check the input PDF file.\",\n                )\n                raise ScannedPDFError(\"Scanned PDF detected.\")\n\n    def clean_render_order_for_chars(self, docs: il_version_1.Document):\n        for page in docs.page:\n            for char in page.pdf_character:\n                char.render_order = None\n                if not char.debug_info:\n                    char.pdf_style.graphic_state = BLACK\n\n    def detect_page_is_scanned(\n        self, page: il_version_1.Page, pdf: pymupdf.Document, pdf_creater: PDFCreater\n    ) -> bool:\n        before_page_image = pdf[page.page_number].get_pixmap()\n        before_page_image = np.frombuffer(before_page_image.samples, np.uint8).reshape(\n            before_page_image.height,\n            before_page_image.width,\n            3,\n        )[:, :, ::-1]\n\n        pdf_creater.update_page_content_stream(\n            False, page, pdf, self.translation_config, True\n        )\n\n        after_page_image = pdf[page.page_number].get_pixmap()\n        after_page_image = np.frombuffer(after_page_image.samples, np.uint8).reshape(\n            after_page_image.height,\n            after_page_image.width,\n            3,\n        )[:, :, ::-1]\n        before_page_image = cv2.cvtColor(before_page_image, cv2.COLOR_RGB2GRAY)\n        after_page_image = cv2.cvtColor(after_page_image, cv2.COLOR_RGB2GRAY)\n        similarity = structural_similarity(before_page_image, after_page_image)\n        return similarity > 0.95\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/il_translator.py",
    "content": "from __future__ import annotations\n\nimport copy\nimport json\nimport logging\nimport re\nimport threading\nfrom pathlib import Path\nfrom string import Template\n\nimport tiktoken\nfrom tqdm import tqdm\n\nimport babeldoc.format.pdf.document_il.il_version_1 as il_version_1\nfrom babeldoc.babeldoc_exception.BabelDOCException import ContentFilterError\nfrom babeldoc.format.pdf.document_il import Document\nfrom babeldoc.format.pdf.document_il import GraphicState\nfrom babeldoc.format.pdf.document_il import Page\nfrom babeldoc.format.pdf.document_il import PdfFont\nfrom babeldoc.format.pdf.document_il import PdfFormula\nfrom babeldoc.format.pdf.document_il import PdfParagraph\nfrom babeldoc.format.pdf.document_il import PdfParagraphComposition\nfrom babeldoc.format.pdf.document_il import PdfSameStyleCharacters\nfrom babeldoc.format.pdf.document_il import PdfSameStyleUnicodeCharacters\nfrom babeldoc.format.pdf.document_il import PdfStyle\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import get_char_unicode_string\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import get_paragraph_unicode\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import is_same_style\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    is_same_style_except_font,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    is_same_style_except_size,\n)\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_placeholder_only_paragraph,\n)\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_pure_numeric_paragraph,\n)\nfrom babeldoc.format.pdf.document_il.utils.style_helper import GRAY80\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.translator.translator import BaseTranslator\nfrom babeldoc.utils.priority_thread_pool_executor import PriorityThreadPoolExecutor\n\nlogger = logging.getLogger(__name__)\n\n\nPROMPT_TEMPLATE = Template(\n    \"\"\"$role_block\n\n## Rules\n\n1. Keep the structure exactly unchanged: do NOT add/remove/reorder any tags, placeholders, or tokens.\n2. Keep all tags unchanged (e.g., <style>, <b>, </style>).\n   - Translate human-readable text inside tags.\n   - Do NOT translate text inside <code>…</code>.\n3. Do NOT translate or alter placeholders: {v1}, {name}, %s, %d, [[...]], %%...%%.\n4. If the entire input is pure code/identifiers, return it unchanged.\n5. Translate ALL human-readable content into $lang_out.\n\n$glossary_block\n\n$context_block\n\n## Output\n\nOutput ONLY the translated $lang_out text. No explanations, no backticks, no extra text.\n\nNow translate the following text:\n\n$text_to_translate\"\"\"\n)\n\n\nclass RichTextPlaceholder:\n    def __init__(\n        self,\n        placeholder_id: int,\n        composition: PdfSameStyleCharacters,\n        left_placeholder: str,\n        right_placeholder: str,\n        left_regex_pattern: str = None,\n        right_regex_pattern: str = None,\n    ):\n        self.id = placeholder_id\n        self.composition = composition\n        self.left_placeholder = left_placeholder\n        self.right_placeholder = right_placeholder\n        self.left_regex_pattern = left_regex_pattern\n        self.right_regex_pattern = right_regex_pattern\n\n    def to_dict(self) -> dict:\n        return {\n            \"type\": \"rich_text\",\n            \"id\": self.id,\n            \"left_placeholder\": self.left_placeholder,\n            \"right_placeholder\": self.right_placeholder,\n            \"left_regex_pattern\": self.left_regex_pattern,\n            \"right_regex_pattern\": self.right_regex_pattern,\n            \"composition_chars\": get_char_unicode_string(self.composition.pdf_character)\n            if self.composition and self.composition.pdf_character\n            else None,\n        }\n\n\nclass FormulaPlaceholder:\n    def __init__(\n        self,\n        placeholder_id: int,\n        formula: PdfFormula,\n        placeholder: str,\n        regex_pattern: str,\n    ):\n        self.id = placeholder_id\n        self.formula = formula\n        self.placeholder = placeholder\n        self.regex_pattern = regex_pattern\n\n    def to_dict(self) -> dict:\n        return {\n            \"type\": \"formula\",\n            \"id\": self.id,\n            \"placeholder\": self.placeholder,\n            \"regex_pattern\": self.regex_pattern,\n            \"formula_chars\": get_char_unicode_string(self.formula.pdf_character)\n            if self.formula and self.formula.pdf_character\n            else None,\n        }\n\n\nclass PbarContext:\n    def __init__(self, pbar):\n        self.pbar = pbar\n\n    def __enter__(self):\n        return self.pbar\n\n    def __exit__(self, exc_type, exc_value, traceback):\n        self.pbar.advance()\n\n\nclass DocumentTranslateTracker:\n    def __init__(self):\n        self.page = []\n        self.cross_page = []\n        # Track paragraphs that are combined due to cross-column detection within the same page\n        self.cross_column = []\n\n    def new_page(self):\n        page = PageTranslateTracker()\n        self.page.append(page)\n        return page\n\n    def new_cross_page(self):\n        page = PageTranslateTracker()\n        self.cross_page.append(page)\n        return page\n\n    def new_cross_column(self):\n        \"\"\"Create and return a new PageTranslateTracker dedicated to cross-column merging.\"\"\"\n        page = PageTranslateTracker()\n        self.cross_column.append(page)\n        return page\n\n    def to_json(self):\n        pages = []\n        for page in self.page:\n            paragraphs = self.convert_paragraph(page)\n            pages.append({\"paragraph\": paragraphs})\n        cross_page = []\n        for page in self.cross_page:\n            paragraphs = self.convert_paragraph(page)\n            cross_page.append({\"paragraph\": paragraphs})\n        cross_column = []\n        for page in self.cross_column:\n            paragraphs = self.convert_paragraph(page)\n            cross_column.append({\"paragraph\": paragraphs})\n        return json.dumps(\n            {\n                \"cross_page\": cross_page,\n                \"cross_column\": cross_column,\n                \"page\": pages,\n            },\n            ensure_ascii=False,\n            indent=2,\n        )\n\n    def convert_paragraph(self, page):\n        paragraphs = []\n        for para in page.paragraph:\n            i_str = getattr(para, \"input\", None)\n            o_str = getattr(para, \"output\", None)\n            pdf_unicode = getattr(para, \"pdf_unicode\", None)\n            llm_translate_trackers = getattr(para, \"llm_translate_trackers\", None)\n            placeholders = getattr(para, \"placeholders\", None)\n            original_placeholders = getattr(para, \"original_placeholders\", None)\n            removed_hallucinated_placeholders = getattr(\n                para,\n                \"removed_hallucinated_placeholders\",\n                None,\n            )\n\n            llm_translate_trackers_json = []\n            if llm_translate_trackers:\n                for tracker in llm_translate_trackers:\n                    llm_translate_trackers_json.append(tracker.to_dict())\n\n            placeholders_json = []\n            if placeholders:\n                for placeholder in placeholders:\n                    placeholders_json.append(placeholder.to_dict())\n\n            if pdf_unicode is None or i_str is None:\n                continue\n            paragraph_json = {\n                \"input\": i_str,\n                \"output\": o_str,\n                \"pdf_unicode\": pdf_unicode,\n                \"llm_translate_trackers\": llm_translate_trackers_json,\n                \"placeholders\": placeholders_json,\n                \"multi_paragraph_id\": getattr(para, \"multi_paragraph_id\", None),\n                \"multi_paragraph_index\": getattr(para, \"multi_paragraph_index\", None),\n                \"original_placeholders\": original_placeholders,\n                \"removed_hallucinated_placeholders\": removed_hallucinated_placeholders,\n            }\n            paragraphs.append(\n                paragraph_json,\n            )\n        return paragraphs\n\n\nclass PageTranslateTracker:\n    def __init__(self):\n        self.paragraph = []\n\n    def new_paragraph(self):\n        paragraph = ParagraphTranslateTracker()\n        self.paragraph.append(paragraph)\n        return paragraph\n\n\nclass ParagraphTranslateTracker:\n    def __init__(self):\n        self.llm_translate_trackers = []\n        self.original_placeholders: dict[str, int] = {}\n        self.removed_hallucinated_placeholders: dict[str, int] = {}\n\n    def set_pdf_unicode(self, unicode: str):\n        self.pdf_unicode = unicode\n\n    def set_input(self, input_text: str):\n        self.input = input_text\n\n    def set_placeholders(\n        self, placeholders: list[RichTextPlaceholder | FormulaPlaceholder]\n    ):\n        self.placeholders = placeholders\n\n    def set_original_placeholders(self, placeholders: dict[str, int] | None):\n        \"\"\"Record original placeholder-like tokens from the source text.\"\"\"\n        self.original_placeholders = placeholders or {}\n\n    def record_multi_paragraph_id(self, mid):\n        self.multi_paragraph_id = mid\n\n    def record_multi_paragraph_index(self, index):\n        self.multi_paragraph_index = index\n\n    def set_output(self, output: str):\n        self.output = output\n\n    def record_removed_hallucinated_placeholder(self, token: str):\n        \"\"\"Record placeholder-like tokens removed from translated text.\"\"\"\n        if not token:\n            return\n        self.removed_hallucinated_placeholders[token] = (\n            self.removed_hallucinated_placeholders.get(token, 0) + 1\n        )\n\n    def new_llm_translate_tracker(self) -> LLMTranslateTracker:\n        tracker = LLMTranslateTracker()\n        self.llm_translate_trackers.append(tracker)\n        return tracker\n\n    def last_llm_translate_tracker(self) -> LLMTranslateTracker | None:\n        if self.llm_translate_trackers:\n            return self.llm_translate_trackers[-1]\n        return None\n\n\nclass LLMTranslateTracker:\n    def __init__(self):\n        self.input = \"\"\n        self.output = \"\"\n        self.has_error = False\n        self.error_message = \"\"\n        self.placeholder_full_match = False\n        self.fallback_to_translate = False\n\n    def set_input(self, input_text: str):\n        self.input = input_text\n\n    def set_output(self, output_text: str):\n        self.output = output_text\n\n    def set_error_message(self, error_message: str):\n        self.has_error = True\n        self.error_message = error_message\n\n    def set_placeholder_full_match(self):\n        self.placeholder_full_match = True\n\n    def set_fallback_to_translate(self):\n        self.fallback_to_translate = True\n\n    def to_dict(self):\n        return {\n            \"input\": self.input,\n            \"output\": self.output,\n            \"has_error\": self.has_error,\n            \"error_message\": self.error_message,\n            \"placeholder_full_match\": self.placeholder_full_match,\n            \"fallback_to_translate\": self.fallback_to_translate,\n        }\n\n\nclass ILTranslator:\n    stage_name = \"Translate Paragraphs\"\n\n    def __init__(\n        self,\n        translate_engine: BaseTranslator,\n        translation_config: TranslationConfig,\n        tokenizer=None,\n    ):\n        self.translate_engine = translate_engine\n        self.translation_config = translation_config\n        self.font_mapper = FontMapper(translation_config)\n        self.shared_context_cross_split_part = (\n            translation_config.shared_context_cross_split_part\n        )\n        if tokenizer is None:\n            self.tokenizer = tiktoken.encoding_for_model(\"gpt-4o\")\n        else:\n            self.tokenizer = tokenizer\n\n        # Cache glossaries at initialization\n        self._cached_glossaries = (\n            self.shared_context_cross_split_part.get_glossaries_for_translation(\n                self.translation_config.auto_extract_glossary\n            )\n        )\n\n        self.support_llm_translate = False\n        try:\n            if translate_engine and hasattr(translate_engine, \"do_llm_translate\"):\n                translate_engine.do_llm_translate(None)\n                self.support_llm_translate = True\n        except NotImplementedError:\n            self.support_llm_translate = False\n\n        self.use_as_fallback = False\n        self.add_content_filter_hint_lock = threading.Lock()\n        self.docs = None\n\n        # Pre-compile patterns for placeholder-like tokens that may be hallucinated by LLM.\n        # We only consider the same shapes as our own formula & rich-text placeholders.\n        self._formula_placeholder_pattern = re.compile(\n            self.translate_engine.get_formular_placeholder(r\"\\d+\")[1], re.IGNORECASE\n        )\n        self._style_left_placeholder_pattern = re.compile(\n            self.translate_engine.get_rich_text_left_placeholder(r\"\\d+\")[1],\n            re.IGNORECASE,\n        )\n        self._style_right_placeholder_pattern = re.compile(\n            self.translate_engine.get_rich_text_right_placeholder(r\"\\d+\")[1],\n            re.IGNORECASE,\n        )\n\n    def calc_token_count(self, text: str) -> int:\n        try:\n            return len(self.tokenizer.encode(text, disallowed_special=()))\n        except Exception:\n            return 0\n\n    def translate(self, docs: Document):\n        self.docs = docs\n        tracker = DocumentTranslateTracker()\n\n        if not self.translation_config.shared_context_cross_split_part.first_paragraph:\n            # Try to find the first title paragraph\n            title_paragraph = self.find_title_paragraph(docs)\n            self.translation_config.shared_context_cross_split_part.first_paragraph = (\n                copy.deepcopy(title_paragraph)\n            )\n            self.translation_config.shared_context_cross_split_part.recent_title_paragraph = copy.deepcopy(\n                title_paragraph\n            )\n            if title_paragraph:\n                logger.info(f\"Found first title paragraph: {title_paragraph.unicode}\")\n\n        # count total paragraph\n        total = sum(len(page.pdf_paragraph) for page in docs.page)\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        ) as pbar:\n            with PriorityThreadPoolExecutor(\n                max_workers=self.translation_config.pool_max_workers,\n            ) as executor:\n                for page in docs.page:\n                    self.process_page(page, executor, pbar, tracker.new_page())\n\n        path = self.translation_config.get_working_file_path(\"translate_tracking.json\")\n\n        if (\n            self.translation_config.debug\n            or self.translation_config.working_dir is not None\n        ):\n            logger.debug(f\"save translate tracking to {path}\")\n            with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n                f.write(tracker.to_json())\n\n    def find_title_paragraph(self, docs: Document) -> PdfParagraph | None:\n        \"\"\"Find the first paragraph with layout_label 'title' in the document.\n\n        Args:\n            docs: The document to search in\n\n        Returns:\n            The first title paragraph found, or None if no title paragraph exists\n        \"\"\"\n        for page in docs.page:\n            for paragraph in page.pdf_paragraph:\n                if paragraph.layout_label == \"title\":\n                    logger.info(f\"Found title paragraph: {paragraph.unicode}\")\n                    return paragraph\n        return None\n\n    def process_page(\n        self,\n        page: Page,\n        executor: PriorityThreadPoolExecutor,\n        pbar: tqdm | None = None,\n        tracker: PageTranslateTracker = None,\n    ):\n        self.translation_config.raise_if_cancelled()\n        for paragraph in page.pdf_paragraph:\n            page_font_map = {}\n            for font in page.pdf_font:\n                page_font_map[font.font_id] = font\n            page_xobj_font_map = {}\n            for xobj in page.pdf_xobject:\n                page_xobj_font_map[xobj.xobj_id] = page_font_map.copy()\n                for font in xobj.pdf_font:\n                    page_xobj_font_map[xobj.xobj_id][font.font_id] = font\n            # self.translate_paragraph(paragraph, pbar,tracker.new_paragraph(), page_font_map, page_xobj_font_map)\n            paragraph_token_count = self.calc_token_count(paragraph.unicode)\n            if paragraph.layout_label == \"title\":\n                self.shared_context_cross_split_part.recent_title_paragraph = (\n                    copy.deepcopy(paragraph)\n                )\n            executor.submit(\n                self.translate_paragraph,\n                paragraph,\n                page,\n                pbar,\n                tracker.new_paragraph(),\n                page_font_map,\n                page_xobj_font_map,\n                priority=1048576 - paragraph_token_count,\n                paragraph_token_count=paragraph_token_count,\n                title_paragraph=self.translation_config.shared_context_cross_split_part.first_paragraph,\n                local_title_paragraph=self.translation_config.shared_context_cross_split_part.recent_title_paragraph,\n            )\n\n    class TranslateInput:\n        def __init__(\n            self,\n            unicode: str,\n            placeholders: list[RichTextPlaceholder | FormulaPlaceholder],\n            base_style: PdfStyle = None,\n        ):\n            self.unicode = unicode\n            self.placeholders = placeholders\n            self.base_style = base_style\n            # Original placeholder-like tokens extracted from the source text.\n            # Key: exact matched token string; Value: occurrence count.\n            self.original_placeholder_tokens: dict[str, int] = {}\n\n        def set_original_placeholder_tokens(self, tokens: dict[str, int] | None):\n            \"\"\"Attach original placeholder-like tokens from source text.\"\"\"\n            self.original_placeholder_tokens = tokens or {}\n\n        def get_placeholders_hint(self) -> dict[str, str] | None:\n            hint = {}\n            for placeholder in self.placeholders:\n                if isinstance(placeholder, FormulaPlaceholder):\n                    cid_count = 0\n                    for char in placeholder.formula.pdf_character:\n                        if re.match(r\"^\\(cid:\\d+\\)$\", char.char_unicode):\n                            cid_count += 1\n                    if cid_count > len(placeholder.formula.pdf_character) * 0.8:\n                        continue\n\n                    hint[placeholder.placeholder] = get_char_unicode_string(\n                        placeholder.formula.pdf_character\n                    )\n            if hint:\n                return hint\n            return None\n\n    def create_formula_placeholder(\n        self,\n        formula: PdfFormula,\n        formula_id: int,\n        paragraph: PdfParagraph,\n    ):\n        placeholder = self.translate_engine.get_formular_placeholder(formula_id)\n        if isinstance(placeholder, tuple):\n            placeholder, regex_pattern = placeholder\n        else:\n            regex_pattern = re.escape(placeholder)\n        if re.match(regex_pattern, paragraph.unicode, re.IGNORECASE):\n            return self.create_formula_placeholder(formula, formula_id + 1, paragraph)\n\n        return FormulaPlaceholder(formula_id, formula, placeholder, regex_pattern)\n\n    def create_rich_text_placeholder(\n        self,\n        composition: PdfSameStyleCharacters,\n        composition_id: int,\n        paragraph: PdfParagraph,\n    ):\n        left_placeholder = self.translate_engine.get_rich_text_left_placeholder(\n            composition_id,\n        )\n        right_placeholder = self.translate_engine.get_rich_text_right_placeholder(\n            composition_id,\n        )\n        if isinstance(left_placeholder, tuple):\n            left_placeholder, left_placeholder_regex_pattern = left_placeholder\n        else:\n            left_placeholder_regex_pattern = re.escape(left_placeholder)\n        if isinstance(right_placeholder, tuple):\n            right_placeholder, right_placeholder_regex_pattern = right_placeholder\n        else:\n            right_placeholder_regex_pattern = re.escape(right_placeholder)\n        if re.match(\n            f\"{left_placeholder_regex_pattern}|{right_placeholder_regex_pattern}\",\n            paragraph.unicode,\n            re.IGNORECASE,\n        ):\n            return self.create_rich_text_placeholder(\n                composition,\n                composition_id + 1,\n                paragraph,\n            )\n\n        return RichTextPlaceholder(\n            composition_id,\n            composition,\n            left_placeholder,\n            right_placeholder,\n            left_placeholder_regex_pattern,\n            right_placeholder_regex_pattern,\n        )\n\n    def get_translate_input(\n        self,\n        paragraph: PdfParagraph,\n        page_font_map: dict[str, PdfFont] = None,\n        disable_rich_text_translate: bool | None = None,\n    ):\n        if not paragraph.pdf_paragraph_composition:\n            return\n\n        # Skip pure numeric paragraphs\n        if is_pure_numeric_paragraph(paragraph):\n            return None\n\n        # Skip paragraphs with only placeholders\n        if is_placeholder_only_paragraph(paragraph):\n            return None\n\n        # Extract original placeholder-like tokens from the raw paragraph text\n        original_placeholder_tokens: dict[str, int] = {}\n\n        def scan_placeholder_tokens(text: str, tokens: dict[str, int]):\n            for pattern in (\n                self._formula_placeholder_pattern,\n                self._style_left_placeholder_pattern,\n                self._style_right_placeholder_pattern,\n            ):\n                for match in pattern.finditer(text):\n                    token = match.group(0)\n                    tokens[token] = tokens.get(token, 0) + 1\n\n        if paragraph.unicode:\n            scan_placeholder_tokens(paragraph.unicode, original_placeholder_tokens)\n        if len(paragraph.pdf_paragraph_composition) == 1:\n            # 如果整个段落只有一个组成部分，那么直接返回，不需要套占位符等\n            composition = paragraph.pdf_paragraph_composition[0]\n            if (\n                composition.pdf_line\n                or composition.pdf_same_style_characters\n                or composition.pdf_character\n            ):\n                translate_input = self.TranslateInput(\n                    paragraph.unicode,\n                    [],\n                    paragraph.pdf_style,\n                )\n                translate_input.set_original_placeholder_tokens(\n                    original_placeholder_tokens,\n                )\n                return translate_input\n            elif composition.pdf_formula:\n                # 不需要翻译纯公式\n                return None\n            elif composition.pdf_same_style_unicode_characters:\n                # DEBUG INSERT CHAR, NOT TRANSLATE\n                return None\n            else:\n                logger.error(\n                    f\"Unknown composition type. \"\n                    f\"Composition: {composition}. \"\n                    f\"Paragraph: {paragraph}. \",\n                )\n                return None\n\n        # 如果没有指定 disable_rich_text_translate，使用配置中的值\n        if disable_rich_text_translate is None:\n            disable_rich_text_translate = (\n                self.translation_config.disable_rich_text_translate\n            )\n\n        placeholder_id = 1\n        placeholders = []\n        chars = []\n        for composition in paragraph.pdf_paragraph_composition:\n            if composition.pdf_line:\n                chars.extend(composition.pdf_line.pdf_character)\n            elif composition.pdf_formula:\n                formula_placeholder = self.create_formula_placeholder(\n                    composition.pdf_formula,\n                    placeholder_id,\n                    paragraph,\n                )\n                placeholders.append(formula_placeholder)\n                # 公式只需要一个占位符，所以 id+1\n                placeholder_id = formula_placeholder.id + 1\n                chars.extend(formula_placeholder.placeholder)\n            elif composition.pdf_character:\n                chars.append(composition.pdf_character)\n            elif composition.pdf_same_style_characters:\n                if disable_rich_text_translate:\n                    # 如果禁用富文本翻译，直接添加字符\n                    chars.extend(composition.pdf_same_style_characters.pdf_character)\n                    continue\n\n                fonta = self.font_mapper.map(\n                    page_font_map[\n                        composition.pdf_same_style_characters.pdf_style.font_id\n                    ],\n                    \"1\",\n                )\n                fontb = self.font_mapper.map(\n                    page_font_map[paragraph.pdf_style.font_id],\n                    \"1\",\n                )\n                if (\n                    # 样式和段落基准样式一致，无需占位符\n                    is_same_style(\n                        composition.pdf_same_style_characters.pdf_style,\n                        paragraph.pdf_style,\n                    )\n                    # 字号差异在 0.7-1.3 之间，可能是首字母变大效果，无需占位符\n                    or is_same_style_except_size(\n                        composition.pdf_same_style_characters.pdf_style,\n                        paragraph.pdf_style,\n                    )\n                    or (\n                        # 除了字体以外样式都和基准一样，并且字体都映射到同一个字体。无需占位符\n                        is_same_style_except_font(\n                            composition.pdf_same_style_characters.pdf_style,\n                            paragraph.pdf_style,\n                        )\n                        and fonta\n                        and fontb\n                        and fonta.font_id == fontb.font_id\n                    )\n                    # or len(composition.pdf_same_style_characters.pdf_character) == 1\n                ):\n                    chars.extend(composition.pdf_same_style_characters.pdf_character)\n                    continue\n                placeholder = self.create_rich_text_placeholder(\n                    composition.pdf_same_style_characters,\n                    placeholder_id,\n                    paragraph,\n                )\n                placeholders.append(placeholder)\n                # 样式需要一左一右两个占位符，所以 id+2\n                placeholder_id = placeholder.id + 2\n                chars.append(placeholder.left_placeholder)\n                chars.extend(composition.pdf_same_style_characters.pdf_character)\n                chars.append(placeholder.right_placeholder)\n            else:\n                logger.error(\n                    \"Unexpected PdfParagraphComposition type \"\n                    \"in PdfParagraph during translation. \"\n                    f\"Composition: {composition}. \"\n                    f\"Paragraph: {paragraph}. \",\n                )\n                return None\n\n            # 如果占位符数量超过阈值，且未禁用富文本翻译，则递归调用并禁用富文本翻译\n            if len(placeholders) > 40 and not disable_rich_text_translate:\n                logger.warning(\n                    f\"Too many placeholders ({len(placeholders)}) in paragraph[{paragraph.debug_id}], \"\n                    \"disabling rich text translation for this paragraph\",\n                )\n                return self.get_translate_input(paragraph, page_font_map, True)\n\n        text = get_char_unicode_string(chars)\n        translate_input = self.TranslateInput(text, placeholders, paragraph.pdf_style)\n        translate_input.set_original_placeholder_tokens(original_placeholder_tokens)\n        return translate_input\n\n    def process_formula(\n        self,\n        formula: PdfFormula,\n        formula_id: int,\n        paragraph: PdfParagraph,\n    ):\n        placeholder = self.create_formula_placeholder(formula, formula_id, paragraph)\n        if placeholder.placeholder in paragraph.unicode:\n            return self.process_formula(formula, formula_id + 1, paragraph)\n\n        return placeholder\n\n    def process_composition(\n        self,\n        composition: PdfSameStyleCharacters,\n        composition_id: int,\n        paragraph: PdfParagraph,\n    ):\n        placeholder = self.create_rich_text_placeholder(\n            composition,\n            composition_id,\n            paragraph,\n        )\n        if (\n            placeholder.left_placeholder in paragraph.unicode\n            or placeholder.right_placeholder in paragraph.unicode\n        ):\n            return self.process_composition(\n                composition,\n                composition_id + 1,\n                paragraph,\n            )\n\n        return placeholder\n\n    def parse_translate_output(\n        self,\n        input_text: TranslateInput,\n        output: str,\n        tracker: ParagraphTranslateTracker | None = None,\n        llm_translate_tracker: LLMTranslateTracker | None = None,\n    ) -> [PdfParagraphComposition]:\n        result = []\n\n        # 如果没有占位符，直接返回整个文本\n        if not input_text.placeholders:\n            comp = PdfParagraphComposition()\n            comp.pdf_same_style_unicode_characters = PdfSameStyleUnicodeCharacters()\n            comp.pdf_same_style_unicode_characters.unicode = output\n            comp.pdf_same_style_unicode_characters.pdf_style = input_text.base_style\n            if llm_translate_tracker:\n                llm_translate_tracker.set_placeholder_full_match()\n            return [comp]\n\n        # 构建正则表达式模式\n        patterns = []\n        placeholder_patterns = []\n        placeholder_map = {}\n\n        for placeholder in input_text.placeholders:\n            if isinstance(placeholder, FormulaPlaceholder):\n                # 转义特殊字符\n                # pattern = re.escape(placeholder.placeholder)\n                pattern = placeholder.regex_pattern\n                patterns.append(f\"({pattern})\")\n                placeholder_patterns.append(f\"({pattern})\")\n                placeholder_map[placeholder.placeholder] = placeholder\n            else:\n                left = placeholder.left_regex_pattern\n                right = placeholder.right_regex_pattern\n                patterns.append(f\"({left}.*?{right})\")\n                placeholder_patterns.append(f\"({left})\")\n                placeholder_patterns.append(f\"({right})\")\n                placeholder_map[placeholder.left_placeholder] = placeholder\n        all_match = True\n        for pattern in patterns:\n            if not re.search(pattern, output, flags=re.IGNORECASE):\n                all_match = False\n                break\n        if all_match:\n            if llm_translate_tracker:\n                llm_translate_tracker.set_placeholder_full_match()\n        else:\n            logger.debug(f\"Failed to match all placeholder for {input_text.unicode}\")\n        # 合并所有模式\n        combined_pattern = \"|\".join(patterns)\n        combined_placeholder_pattern = \"|\".join(placeholder_patterns)\n        # Build allowed placeholder tokens: originals from source + placeholders we injected.\n        allowed_placeholder_tokens: set[str] = set()\n        if getattr(input_text, \"original_placeholder_tokens\", None):\n            allowed_placeholder_tokens.update(input_text.original_placeholder_tokens)\n        for placeholder in input_text.placeholders:\n            if isinstance(placeholder, FormulaPlaceholder):\n                allowed_placeholder_tokens.add(placeholder.placeholder)\n            else:\n                allowed_placeholder_tokens.add(placeholder.left_placeholder)\n                allowed_placeholder_tokens.add(placeholder.right_placeholder)\n\n        def remove_placeholder(text: str):\n            \"\"\"Remove placeholder artifacts and hallucinated placeholder-like tokens.\"\"\"\n            # First, remove any leftover placeholders built from our own regex patterns.\n            if combined_placeholder_pattern:\n                text = re.sub(\n                    combined_placeholder_pattern,\n                    \"\",\n                    text,\n                    flags=re.IGNORECASE,\n                )\n\n            # Then, detect placeholder-like tokens of the same shapes as our own\n            # formula and rich-text placeholders. Only keep those in the allowed set.\n            def _replace_token(match: re.Match) -> str:\n                token = match.group(0)\n                if token in allowed_placeholder_tokens:\n                    return token\n                if tracker is not None:\n                    tracker.record_removed_hallucinated_placeholder(token)\n                return \"\"\n\n            text = self._formula_placeholder_pattern.sub(_replace_token, text)\n            text = self._style_left_placeholder_pattern.sub(_replace_token, text)\n            text = self._style_right_placeholder_pattern.sub(_replace_token, text)\n            return text\n\n        # 找到所有匹配\n        last_end = 0\n        for match in re.finditer(combined_pattern, output, flags=re.IGNORECASE):\n            # 处理匹配之前的普通文本\n            if match.start() > last_end:\n                text = output[last_end : match.start()]\n                if text:\n                    comp = PdfParagraphComposition()\n                    comp.pdf_same_style_unicode_characters = (\n                        PdfSameStyleUnicodeCharacters()\n                    )\n                    comp.pdf_same_style_unicode_characters.unicode = remove_placeholder(\n                        text,\n                    )\n                    comp.pdf_same_style_unicode_characters.pdf_style = (\n                        input_text.base_style\n                    )\n                    result.append(comp)\n\n            matched_text = match.group(0)\n\n            # 处理占位符\n            if any(\n                isinstance(p, FormulaPlaceholder)\n                and re.match(f\"^{p.regex_pattern}$\", matched_text, re.IGNORECASE)\n                for p in input_text.placeholders\n            ):\n                # 处理公式占位符\n                placeholder = next(\n                    p\n                    for p in input_text.placeholders\n                    if isinstance(p, FormulaPlaceholder)\n                    and re.match(f\"^{p.regex_pattern}$\", matched_text, re.IGNORECASE)\n                )\n                comp = PdfParagraphComposition()\n                comp.pdf_formula = placeholder.formula\n                result.append(comp)\n            else:\n                # 处理富文本占位符\n                placeholder = next(\n                    p\n                    for p in input_text.placeholders\n                    if not isinstance(p, FormulaPlaceholder)\n                    and re.match(\n                        f\"^{p.left_regex_pattern}\", matched_text, re.IGNORECASE\n                    )\n                )\n                text = re.match(\n                    f\"^{placeholder.left_regex_pattern}(.*){placeholder.right_regex_pattern}$\",\n                    matched_text,\n                    re.IGNORECASE,\n                ).group(1)\n\n                if isinstance(\n                    placeholder.composition,\n                    PdfSameStyleCharacters,\n                ) and text.replace(\" \", \"\") == \"\".join(\n                    x.char_unicode for x in placeholder.composition.pdf_character\n                ).replace(\n                    \" \",\n                    \"\",\n                ):\n                    comp = PdfParagraphComposition(\n                        pdf_same_style_characters=placeholder.composition,\n                    )\n                else:\n                    comp = PdfParagraphComposition()\n                    comp.pdf_same_style_unicode_characters = (\n                        PdfSameStyleUnicodeCharacters()\n                    )\n                    comp.pdf_same_style_unicode_characters.pdf_style = (\n                        placeholder.composition.pdf_style\n                    )\n                    comp.pdf_same_style_unicode_characters.unicode = remove_placeholder(\n                        text,\n                    )\n                result.append(comp)\n\n            last_end = match.end()\n\n        # 处理最后的普通文本\n        if last_end < len(output):\n            text = output[last_end:]\n            if text:\n                comp = PdfParagraphComposition()\n                comp.pdf_same_style_unicode_characters = PdfSameStyleUnicodeCharacters()\n                comp.pdf_same_style_unicode_characters.unicode = remove_placeholder(\n                    text,\n                )\n                comp.pdf_same_style_unicode_characters.pdf_style = input_text.base_style\n                result.append(comp)\n\n        return result\n\n    def pre_translate_paragraph(\n        self,\n        paragraph: PdfParagraph,\n        tracker: ParagraphTranslateTracker,\n        page_font_map: dict[str, PdfFont],\n        xobj_font_map: dict[int, dict[str, PdfFont]],\n    ):\n        \"\"\"Pre-translation processing: prepare text for translation.\"\"\"\n        if paragraph.vertical:\n            return None, None\n        tracker.set_pdf_unicode(paragraph.unicode)\n        if paragraph.xobj_id in xobj_font_map:\n            page_font_map = xobj_font_map[paragraph.xobj_id]\n        disable_rich_text_translate = (\n            self.translation_config.disable_rich_text_translate\n        )\n        if not self.support_llm_translate:\n            disable_rich_text_translate = True\n\n        translate_input = self.get_translate_input(\n            paragraph, page_font_map, disable_rich_text_translate\n        )\n        if not translate_input:\n            return None, None\n        tracker.set_input(translate_input.unicode)\n        tracker.set_placeholders(translate_input.placeholders)\n        tracker.set_original_placeholders(\n            getattr(translate_input, \"original_placeholder_tokens\", None),\n        )\n        text = translate_input.unicode\n        if len(text) < self.translation_config.min_text_length:\n            logger.debug(\n                f\"Text too short to translate, skip. Text: {text}. Paragraph id: {paragraph.debug_id}.\"\n            )\n            return None, None\n        return text, translate_input\n\n    def post_translate_paragraph(\n        self,\n        paragraph: PdfParagraph,\n        tracker: ParagraphTranslateTracker,\n        translate_input,\n        translated_text: str,\n    ):\n        \"\"\"Post-translation processing: update paragraph with translated text.\"\"\"\n        tracker.set_output(translated_text)\n        if translated_text == translate_input:\n            if llm_translate_tracker := tracker.last_llm_translate_tracker():\n                llm_translate_tracker.set_placeholder_full_match()\n            return False\n        paragraph.unicode = translated_text\n        paragraph.pdf_paragraph_composition = self.parse_translate_output(\n            translate_input,\n            translated_text,\n            tracker,\n            tracker.last_llm_translate_tracker(),\n        )\n        for composition in paragraph.pdf_paragraph_composition:\n            if (\n                composition.pdf_same_style_unicode_characters\n                and composition.pdf_same_style_unicode_characters.pdf_style is None\n            ):\n                composition.pdf_same_style_unicode_characters.pdf_style = (\n                    paragraph.pdf_style\n                )\n        return True\n\n    def _build_role_block(self) -> str:\n        \"\"\"Build the role block for LLM prompt.\n\n        Returns:\n            Role block string with custom_system_prompt or default role description.\n        \"\"\"\n        custom_prompt = getattr(self.translation_config, \"custom_system_prompt\", None)\n        if custom_prompt:\n            role_block = custom_prompt.strip()\n            if \"Follow all rules strictly.\" not in role_block:\n                if not role_block.endswith(\"\\n\"):\n                    role_block += \"\\n\"\n                role_block += \"Follow all rules strictly.\"\n        else:\n            role_block = (\n                f\"You are a professional {self.translation_config.lang_out} native translator who needs to fluently translate text \"\n                f\"into {self.translation_config.lang_out}.\\n\\n\"\n                \"Follow all rules strictly.\"\n            )\n        return role_block\n\n    def _build_context_block(\n        self,\n        title_paragraph: PdfParagraph | None = None,\n        local_title_paragraph: PdfParagraph | None = None,\n        translate_input: TranslateInput | None = None,\n    ) -> str:\n        \"\"\"Build the context/hints block for LLM prompt.\n\n        Args:\n            title_paragraph: First title paragraph in the document\n            local_title_paragraph: Most recent title paragraph\n            translate_input: TranslateInput containing placeholder hints\n\n        Returns:\n            Context block string, empty if no context hints available\n        \"\"\"\n        context_lines: list[str] = []\n        hint_idx = 1\n\n        if title_paragraph:\n            context_lines.append(\n                f\"{hint_idx}. First title in the full text: {title_paragraph.unicode}\"\n            )\n            hint_idx += 1\n\n        if local_title_paragraph:\n            is_different_from_global = True\n            if title_paragraph:\n                if local_title_paragraph.debug_id == title_paragraph.debug_id:\n                    is_different_from_global = False\n\n            if is_different_from_global:\n                context_lines.append(\n                    f\"{hint_idx}. The most recent title is: {local_title_paragraph.unicode}\"\n                )\n                hint_idx += 1\n\n        if translate_input and self.translation_config.add_formula_placehold_hint:\n            placeholders_hint = translate_input.get_placeholders_hint()\n            if placeholders_hint:\n                context_lines.append(\n                    f\"{hint_idx}. Formula placeholder hint:\\n{placeholders_hint}\"\n                )\n\n        if context_lines:\n            return \"## Context / Hints\\n\" + \"\\n\".join(context_lines) + \"\\n\"\n        return \"\"\n\n    def _build_glossary_block(self, text: str) -> str:\n        \"\"\"Build the glossary block for LLM prompt.\n\n        Args:\n            text: Text to match against glossary entries\n\n        Returns:\n            Glossary block string with tables, empty if no active glossary entries\n        \"\"\"\n        if not self._cached_glossaries:\n            return \"\"\n\n        glossary_entries_per_glossary: dict[str, list[tuple[str, str]]] = {}\n\n        for glossary in self._cached_glossaries:\n            active_entries = glossary.get_active_entries_for_text(text)\n            if active_entries:\n                glossary_entries_per_glossary[glossary.name] = sorted(active_entries)\n\n        if not glossary_entries_per_glossary:\n            return \"\"\n\n        glossary_block_lines: list[str] = [\n            \"## Glossary\",\n            \"\",\n            \"Always use the glossary's **Target Term** for any occurrence of its **Source Term** \"\n            \"(including variants, inside tags, or broken across lines).\",\n            \"\",\n            \"Unlisted terms are translated naturally.\",\n            \"\",\n        ]\n\n        for glossary_name, entries in glossary_entries_per_glossary.items():\n            glossary_block_lines.append(f\"### Glossary: {glossary_name}\")\n            glossary_block_lines.append(\"\")\n            glossary_block_lines.append(\n                \"| Source Term | Target Term |\\n|-------------|-------------|\"\n            )\n            for original_source, target_text in entries:\n                glossary_block_lines.append(f\"| {original_source} | {target_text} |\")\n            glossary_block_lines.append(\"\")\n\n        return \"\\n\".join(glossary_block_lines)\n\n    def generate_prompt_for_llm(\n        self,\n        text: str,\n        title_paragraph: PdfParagraph | None = None,\n        local_title_paragraph: PdfParagraph | None = None,\n        translate_input: TranslateInput | None = None,\n    ):\n        \"\"\"Generate LLM prompt using template-based approach.\n\n        Args:\n            text: Text to be translated\n            title_paragraph: First title paragraph in the document\n            local_title_paragraph: Most recent title paragraph\n            translate_input: TranslateInput containing placeholder information\n\n        Returns:\n            Final LLM prompt string\n        \"\"\"\n        role_block = self._build_role_block()\n        context_block = self._build_context_block(\n            title_paragraph, local_title_paragraph, translate_input\n        )\n        glossary_block = self._build_glossary_block(text)\n\n        return PROMPT_TEMPLATE.substitute(\n            role_block=role_block,\n            glossary_block=glossary_block,\n            context_block=context_block,\n            lang_out=self.translation_config.lang_out,\n            text_to_translate=text,\n        )\n\n    def add_content_filter_hint(self, page: Page, paragraph: PdfParagraph):\n        with self.add_content_filter_hint_lock:\n            new_box = il_version_1.Box(\n                x=paragraph.box.x,\n                y=paragraph.box.y2,\n                x2=paragraph.box.x2,\n                y2=paragraph.box.y2 + 1.1,\n            )\n            page.pdf_paragraph.append(\n                self._create_text(\n                    \"翻译服务检测到内容可能包含不安全或敏感内容，请您避免翻译敏感内容，感谢您的配合。\",\n                    GRAY80,\n                    new_box,\n                    1,\n                )\n            )\n            logger.info(\"success add content filter hint\")\n\n    def _create_text(\n        self,\n        text: str,\n        color: GraphicState,\n        box: il_version_1.Box,\n        font_size: float = 4,\n    ):\n        style = il_version_1.PdfStyle(\n            font_id=\"base\",\n            font_size=font_size,\n            graphic_state=color,\n        )\n        return il_version_1.PdfParagraph(\n            first_line_indent=False,\n            box=box,\n            vertical=False,\n            pdf_style=style,\n            unicode=text,\n            pdf_paragraph_composition=[\n                il_version_1.PdfParagraphComposition(\n                    pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                        unicode=text,\n                        pdf_style=style,\n                        debug_info=True,\n                    ),\n                ),\n            ],\n            xobj_id=-1,\n        )\n\n    def translate_paragraph(\n        self,\n        paragraph: PdfParagraph,\n        page: Page,\n        pbar: tqdm | None = None,\n        tracker: ParagraphTranslateTracker = None,\n        page_font_map: dict[str, PdfFont] = None,\n        xobj_font_map: dict[int, dict[str, PdfFont]] = None,\n        paragraph_token_count: int = 0,\n        title_paragraph: PdfParagraph | None = None,\n        local_title_paragraph: PdfParagraph | None = None,\n    ):\n        \"\"\"Translate a paragraph using pre and post processing functions.\"\"\"\n        self.translation_config.raise_if_cancelled()\n        with PbarContext(pbar):\n            try:\n                if self.use_as_fallback:\n                    # il translator llm only modifies unicode in some situations\n                    paragraph.unicode = get_paragraph_unicode(paragraph)\n                # Pre-translation processing\n                text, translate_input = self.pre_translate_paragraph(\n                    paragraph, tracker, page_font_map, xobj_font_map\n                )\n                if text is None:\n                    return\n                llm_translate_tracker = tracker.new_llm_translate_tracker()\n                # Perform translation\n                if self.support_llm_translate:\n                    llm_prompt = self.generate_prompt_for_llm(\n                        text,\n                        title_paragraph,\n                        local_title_paragraph,\n                        translate_input,\n                    )\n                    llm_translate_tracker.set_input(llm_prompt)\n                    translated_text = self.translate_engine.llm_translate(\n                        llm_prompt,\n                        rate_limit_params={\n                            \"paragraph_token_count\": paragraph_token_count\n                        },\n                    )\n                    llm_translate_tracker.set_output(translated_text)\n                else:\n                    translated_text = self.translate_engine.translate(\n                        text,\n                        rate_limit_params={\n                            \"paragraph_token_count\": paragraph_token_count\n                        },\n                    )\n                translated_text = re.sub(r\"[. 。…，]{20,}\", \".\", translated_text)\n\n                # Post-translation processing\n                self.post_translate_paragraph(\n                    paragraph, tracker, translate_input, translated_text\n                )\n            except ContentFilterError as e:\n                logger.warning(f\"ContentFilterError: {e.message}\")\n                self.add_content_filter_hint(page, paragraph)\n                return\n            except Exception as e:\n                logger.exception(\n                    f\"Error translating paragraph. Paragraph: {paragraph.debug_id} ({paragraph.unicode}). Error: {e}. \",\n                )\n                # ignore error and continue\n                return\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/il_translator_llm_only.py",
    "content": "import copy\nimport json\nimport logging\nimport re\nfrom pathlib import Path\nfrom string import Template\n\nimport Levenshtein\nimport tiktoken\nfrom tqdm import tqdm\n\nfrom babeldoc.format.pdf.document_il import Document\nfrom babeldoc.format.pdf.document_il import Page\nfrom babeldoc.format.pdf.document_il import PdfFont\nfrom babeldoc.format.pdf.document_il import PdfParagraph\nfrom babeldoc.format.pdf.document_il.midend import il_translator\nfrom babeldoc.format.pdf.document_il.midend.il_translator import (\n    DocumentTranslateTracker,\n)\nfrom babeldoc.format.pdf.document_il.midend.il_translator import ILTranslator\nfrom babeldoc.format.pdf.document_il.midend.il_translator import PageTranslateTracker\nfrom babeldoc.format.pdf.document_il.midend.il_translator import (\n    ParagraphTranslateTracker,\n)\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import is_cid_paragraph\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_placeholder_only_paragraph,\n)\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import (\n    is_pure_numeric_paragraph,\n)\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.translator.translator import BaseTranslator\nfrom babeldoc.utils.priority_thread_pool_executor import PriorityThreadPoolExecutor\n\nlogger = logging.getLogger(__name__)\n\n\nPROMPT_TEMPLATE = Template(\n    \"\"\"$role_block\n\n## Structure Rules\n1. Keep **the same number of paragraphs as the input**.\n2. Input paragraphs may be **sliced pieces of the same original paragraph**.  \n   → You MUST treat each input paragraph **as an independent, fixed unit**.  \n   → Do NOT merge paragraphs, split paragraphs, or move content between paragraphs.\n3. Inside each paragraph, you may adjust word order for fluency, but:\n   - Do NOT change the meaning.\n   - Do NOT move placeholders, tags, or code outside their paragraph.\n4. Translate ALL human-readable content into $lang_out.\n\n## Do NOT Modify\n- Tags (e.g., <style>, <b>, <code>): keep them exactly the same.  \n  *Translate tag-internal text except code blocks (<code>…</code>)*.\n- Placeholders: `{v1}`, `{name}`, `%s`, `%d`, `[[...]]`, `%%...%%` — keep exactly unchanged.\n- JSON keys or structure.\n\n$glossary_usage_rules_block\n## Output Format\nReturn a JSON array of the same length.  \nFor each item:\n- Keep the same \"id\" and remove other fields like \"input\" and \"layout_label\".\n- Add \"output\" with the translated text only.\n- No extra text, no ```json blocks.\n\n## Style\n- Produce fluent, professional $lang_out.\n- Preserve punctuation unless needed for target language fluency.\n\n### Example\nInput:\n[\n    {\n    \"id\": 0,\n    \"input\": \"{v1}<style id='2'>hello</style>, world!\",\n    \"layout_label\": \"text\"\n    }\n]\nOutput:\n[\n    {\n    \"id\": 0,\n    \"output\": \"{v1}<style id='2'>你好</style>，世界！\"\n    }\n]\n\n$contextual_hints_block\n\n$glossary_tables_block\n\n## Here is the input:\n\n$json_input_str\"\"\"\n)\n\n\nclass BatchParagraph:\n    def __init__(\n        self,\n        paragraphs: list[PdfParagraph],\n        pages: list[Page],\n        page_tracker: PageTranslateTracker,\n    ):\n        self.paragraphs = paragraphs\n        self.pages = pages\n        self.trackers = [page_tracker.new_paragraph() for _ in paragraphs]\n\n\nclass ILTranslatorLLMOnly:\n    stage_name = \"Translate Paragraphs\"\n\n    def __init__(\n        self,\n        translate_engine: BaseTranslator,\n        translation_config: TranslationConfig,\n        tokenizer=None,\n    ):\n        self.translate_engine = translate_engine\n        self.translation_config = translation_config\n        self.font_mapper = FontMapper(translation_config)\n        self.shared_context_cross_split_part = (\n            translation_config.shared_context_cross_split_part\n        )\n\n        if tokenizer is None:\n            self.tokenizer = tiktoken.encoding_for_model(\"gpt-4o\")\n        else:\n            self.tokenizer = tokenizer\n\n        # Cache glossaries at initialization\n        self._cached_glossaries = (\n            self.shared_context_cross_split_part.get_glossaries_for_translation(\n                translation_config.auto_extract_glossary\n            )\n        )\n\n        self.il_translator = ILTranslator(\n            translate_engine=translate_engine,\n            translation_config=translation_config,\n            tokenizer=self.tokenizer,\n        )\n        self.il_translator.use_as_fallback = True\n        try:\n            self.translate_engine.do_llm_translate(None)\n        except NotImplementedError as e:\n            raise ValueError(\"LLM translator not supported\") from e\n\n        self.ok_count = 0\n        self.fallback_count = 0\n        self.total_count = 0\n\n    def calc_token_count(self, text: str) -> int:\n        try:\n            return len(self.tokenizer.encode(text, disallowed_special=()))\n        except Exception:\n            return 0\n\n    def find_title_paragraph(self, docs: Document) -> PdfParagraph | None:\n        \"\"\"Find the first paragraph with layout_label 'title' in the document.\n\n        Args:\n            docs: The document to search in\n\n        Returns:\n            The first title paragraph found, or None if no title paragraph exists\n        \"\"\"\n        for page in docs.page:\n            for paragraph in page.pdf_paragraph:\n                if paragraph.layout_label == \"title\":\n                    logger.info(f\"Found title paragraph: {paragraph.unicode}\")\n                    return paragraph\n        return None\n\n    def translate(self, docs: Document) -> None:\n        self.il_translator.docs = docs\n        tracker = DocumentTranslateTracker()\n        self.mid = 0\n\n        if not self.translation_config.shared_context_cross_split_part.first_paragraph:\n            # Try to find the first title paragraph\n            title_paragraph = self.find_title_paragraph(docs)\n            self.translation_config.shared_context_cross_split_part.first_paragraph = (\n                copy.deepcopy(title_paragraph)\n            )\n            self.translation_config.shared_context_cross_split_part.recent_title_paragraph = copy.deepcopy(\n                title_paragraph\n            )\n            if title_paragraph:\n                logger.info(f\"Found first title paragraph: {title_paragraph.unicode}\")\n\n        # count total paragraph\n        total = sum(\n            [\n                len(\n                    [\n                        p\n                        for p in page.pdf_paragraph\n                        if p.debug_id is not None and p.unicode is not None\n                    ]\n                )\n                for page in docs.page\n            ]\n        )\n        translated_ids = set()\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        ) as pbar:\n            with PriorityThreadPoolExecutor(\n                max_workers=self.translation_config.pool_max_workers,\n            ) as executor2:\n                with PriorityThreadPoolExecutor(\n                    max_workers=self.translation_config.pool_max_workers,\n                ) as executor:\n                    self.process_cross_page_paragraph(\n                        docs,\n                        executor,\n                        pbar,\n                        tracker,\n                        executor2,\n                        translated_ids,\n                    )\n                    # Cross-column detection per page (after cross-page processing)\n                    for page in docs.page:\n                        self.process_cross_column_paragraph(\n                            page,\n                            executor,\n                            pbar,\n                            tracker,\n                            executor2,\n                            translated_ids,\n                        )\n                    for page in docs.page:\n                        self.process_page(\n                            page,\n                            executor,\n                            pbar,\n                            tracker.new_page(),\n                            executor2,\n                            translated_ids,\n                        )\n\n        path = self.translation_config.get_working_file_path(\"translate_tracking.json\")\n\n        if (\n            self.translation_config.debug\n            or self.translation_config.working_dir is not None\n        ):\n            logger.debug(f\"save translate tracking to {path}\")\n            with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n                f.write(tracker.to_json())\n        logger.info(\n            f\"Translation completed. Total: {self.total_count}, Successful: {self.ok_count}, Fallback: {self.fallback_count}\"\n        )\n\n    def _is_body_text_paragraph(self, paragraph: PdfParagraph) -> bool:\n        \"\"\"判断正文段落（当前仅 layout_label == 'text'）。\n\n        Args:\n            paragraph: PDF paragraph to check\n\n        Returns:\n            True if this is a body text paragraph, False otherwise\n        \"\"\"\n        return paragraph.layout_label in (\n            \"text\",\n            \"plain text\",\n            \"paragraph_hybrid\",\n        )\n\n    def _should_translate_paragraph(\n        self,\n        paragraph: PdfParagraph,\n        translated_ids: set[int] | None = None,\n        require_body_text: bool = False,\n    ) -> bool:\n        \"\"\"Check if a paragraph should be translated based on common filtering criteria.\n\n        Args:\n            paragraph: PDF paragraph to check\n            translated_ids: Set of already translated paragraph IDs\n            require_body_text: Whether to additionally check if paragraph is body text\n\n        Returns:\n            True if paragraph should be translated, False otherwise\n        \"\"\"\n        # Basic validation checks\n        if paragraph.debug_id is None or paragraph.unicode is None:\n            return False\n\n        # Check if already translated\n        if translated_ids is not None and id(paragraph) in translated_ids:\n            return False\n\n        # CID paragraph check\n        if is_cid_paragraph(paragraph):\n            return False\n\n        # Minimum length check\n        if len(paragraph.unicode) < self.translation_config.min_text_length:\n            return False\n\n        # Body text check if requested\n        if require_body_text and not self._is_body_text_paragraph(paragraph):\n            return False\n\n        return True\n\n    def _filter_paragraphs(\n        self,\n        page: Page,\n        translated_ids: set[int] | None = None,\n        require_body_text: bool = False,\n    ) -> list[PdfParagraph]:\n        \"\"\"Get list of paragraphs that should be translated from a page.\n\n        Args:\n            page: Page to get paragraphs from\n            translated_ids: Set of already translated paragraph IDs\n            require_body_text: Whether to filter for body text paragraphs only\n\n        Returns:\n            List of paragraphs that should be translated\n        \"\"\"\n        return [\n            paragraph\n            for paragraph in page.pdf_paragraph\n            if self._should_translate_paragraph(\n                paragraph, translated_ids, require_body_text\n            )\n        ]\n\n    def _build_font_maps(\n        self, page: Page\n    ) -> tuple[dict[str, PdfFont], dict[int, dict[str, PdfFont]]]:\n        \"\"\"Build font maps for a page.\n\n        Args:\n            page: The page to build font maps for\n\n        Returns:\n            Tuple of (page_font_map, page_xobj_font_map)\n        \"\"\"\n        page_font_map = {}\n        for font in page.pdf_font:\n            page_font_map[font.font_id] = font\n\n        page_xobj_font_map = {}\n        for xobj in page.pdf_xobject:\n            page_xobj_font_map[xobj.xobj_id] = page_font_map.copy()\n            for font in xobj.pdf_font:\n                page_xobj_font_map[xobj.xobj_id][font.font_id] = font\n\n        return page_font_map, page_xobj_font_map\n\n    def process_cross_page_paragraph(\n        self,\n        docs: Document,\n        executor: PriorityThreadPoolExecutor,\n        pbar: tqdm | None = None,\n        tracker: DocumentTranslateTracker | None = None,\n        executor2: PriorityThreadPoolExecutor | None = None,\n        translated_ids: set[int] | None = None,\n    ):\n        \"\"\"Process cross-page paragraphs by combining last body text paragraph of current page\n        with first body text paragraph of next page.\n\n        Args:\n            docs: Document containing pages to process\n            executor: Thread pool executor for translation tasks\n            pbar: Progress bar for tracking translation progress\n            tracker: Page translation tracker\n            executor2: Secondary executor for fallback translation\n            translated_ids: Set of already translated paragraph IDs\n        \"\"\"\n        self.translation_config.raise_if_cancelled()\n\n        if tracker is None:\n            tracker = DocumentTranslateTracker()\n\n        if translated_ids is None:\n            translated_ids = set()\n\n        # Process adjacent page pairs\n        for i in range(len(docs.page) - 1):\n            page_curr = docs.page[i]\n            page_next = docs.page[i + 1]\n\n            # Find body text paragraphs in current page\n            curr_body_paragraphs = self._filter_paragraphs(\n                page_curr, translated_ids, require_body_text=True\n            )\n\n            # Find body text paragraphs in next page\n            next_body_paragraphs = self._filter_paragraphs(\n                page_next, translated_ids, require_body_text=True\n            )\n\n            # Get last paragraph from current page and first paragraph from next page\n            if not curr_body_paragraphs or not next_body_paragraphs:\n                continue\n\n            last_curr_paragraph = curr_body_paragraphs[-1]\n            first_next_paragraph = next_body_paragraphs[0]\n\n            # Skip if either paragraph is already translated\n            if (\n                id(last_curr_paragraph) in translated_ids\n                or id(first_next_paragraph) in translated_ids\n            ):\n                continue\n\n            # Build font maps for both pages\n            curr_font_map, curr_xobj_font_map = self._build_font_maps(page_curr)\n            next_font_map, next_xobj_font_map = self._build_font_maps(page_next)\n\n            # Merge font maps\n            merged_font_map = {**curr_font_map, **next_font_map}\n            merged_xobj_font_map = {**curr_xobj_font_map, **next_xobj_font_map}\n\n            # Calculate total token count\n            total_token_count = self.calc_token_count(\n                last_curr_paragraph.unicode\n            ) + self.calc_token_count(first_next_paragraph.unicode)\n\n            # Create batch with both paragraphs\n            cross_page_paragraphs = [last_curr_paragraph, first_next_paragraph]\n            cross_page_pages = [page_curr, page_next]\n            batch_paragraph = BatchParagraph(\n                cross_page_paragraphs, cross_page_pages, tracker.new_cross_page()\n            )\n\n            self.mid += 1\n            # Submit translation task (force submit regardless of token count)\n            executor.submit(\n                self.translate_paragraph,\n                batch_paragraph,\n                pbar,\n                merged_font_map,\n                merged_xobj_font_map,\n                self.translation_config.shared_context_cross_split_part.first_paragraph,\n                self.translation_config.shared_context_cross_split_part.recent_title_paragraph,\n                executor2,\n                priority=1048576 - total_token_count,\n                paragraph_token_count=total_token_count,\n                mp_id=self.mid,\n            )\n\n            # Mark paragraphs as translated\n            translated_ids.add(id(last_curr_paragraph))\n            translated_ids.add(id(first_next_paragraph))\n\n    def process_cross_column_paragraph(\n        self,\n        page: Page,\n        executor: PriorityThreadPoolExecutor,\n        pbar: tqdm | None = None,\n        tracker: DocumentTranslateTracker | None = None,\n        executor2: PriorityThreadPoolExecutor | None = None,\n        translated_ids: set[int] | None = None,\n    ):\n        \"\"\"Process cross-column paragraphs within the same page.\n\n        If two adjacent body-text paragraphs have a gap in their y2 coordinate\n        greater than 20 units, they are considered split across columns and\n        will be translated together.\n        \"\"\"\n        self.translation_config.raise_if_cancelled()\n\n        if tracker is None:\n            tracker = DocumentTranslateTracker()\n        if translated_ids is None:\n            translated_ids = set()\n\n        # Filter body-text paragraphs maintaining original order\n        body_paragraphs = self._filter_paragraphs(\n            page, translated_ids, require_body_text=True\n        )\n        if len(body_paragraphs) < 2:\n            return\n\n        # Build font maps once for the whole page\n        page_font_map, page_xobj_font_map = self._build_font_maps(page)\n\n        for idx in range(len(body_paragraphs) - 1):\n            p1 = body_paragraphs[idx]\n            p2 = body_paragraphs[idx + 1]\n\n            # Skip already translated\n            if id(p1) in translated_ids or id(p2) in translated_ids:\n                continue\n\n            # Safety checks for box information\n            if not (\n                p1.box and p2.box and p1.box.y2 is not None and p2.box.y2 is not None\n            ):\n                continue\n\n            if p2.box.y2 - p1.box.y2 <= 20:\n                continue\n\n            total_token_count = self.calc_token_count(\n                p1.unicode\n            ) + self.calc_token_count(p2.unicode)\n\n            batch = BatchParagraph([p1, p2], [page, page], tracker.new_cross_column())\n            self.mid += 1\n            executor.submit(\n                self.translate_paragraph,\n                batch,\n                pbar,\n                page_font_map,\n                page_xobj_font_map,\n                self.translation_config.shared_context_cross_split_part.first_paragraph,\n                self.translation_config.shared_context_cross_split_part.recent_title_paragraph,\n                executor2,\n                priority=1048576 - total_token_count,\n                paragraph_token_count=total_token_count,\n                mp_id=self.mid,\n            )\n\n            translated_ids.add(id(p1))\n            translated_ids.add(id(p2))\n\n    def process_page(\n        self,\n        page: Page,\n        executor: PriorityThreadPoolExecutor,\n        pbar: tqdm | None = None,\n        tracker: PageTranslateTracker = None,\n        executor2: PriorityThreadPoolExecutor | None = None,\n        translated_ids: set | None = None,\n    ):\n        self.translation_config.raise_if_cancelled()\n        page_font_map = {}\n        for font in page.pdf_font:\n            page_font_map[font.font_id] = font\n        page_xobj_font_map = {}\n        for xobj in page.pdf_xobject:\n            page_xobj_font_map[xobj.xobj_id] = page_font_map.copy()\n            for font in xobj.pdf_font:\n                page_xobj_font_map[xobj.xobj_id][font.font_id] = font\n\n        paragraphs = []\n\n        total_token_count = 0\n        for paragraph in page.pdf_paragraph:\n            # Check if already translated\n            if id(paragraph) in translated_ids:\n                continue\n\n            # Check basic validation\n            if paragraph.debug_id is None or paragraph.unicode is None:\n                continue\n\n            # Check CID paragraph - advance progress bar if filtered out\n            if is_cid_paragraph(paragraph):\n                if pbar:\n                    pbar.advance(1)\n                continue\n\n            # Check minimum length - advance progress bar if filtered out\n            if len(paragraph.unicode) < self.translation_config.min_text_length:\n                if pbar:\n                    pbar.advance(1)\n                continue\n\n            if is_pure_numeric_paragraph(paragraph):\n                if pbar:\n                    pbar.advance(1)\n                continue\n\n            if is_placeholder_only_paragraph(paragraph):\n                if pbar:\n                    pbar.advance(1)\n                continue\n\n            # self.translate_paragraph(paragraph, pbar,tracker.new_paragraph(), page_font_map, page_xobj_font_map)\n            total_token_count += self.calc_token_count(paragraph.unicode)\n            paragraphs.append(paragraph)\n            translated_ids.add(id(paragraph))\n            if paragraph.layout_label == \"title\":\n                self.shared_context_cross_split_part.recent_title_paragraph = (\n                    copy.deepcopy(paragraph)\n                )\n\n            if total_token_count > 200 or len(paragraphs) > 5:\n                self.mid += 1\n                executor.submit(\n                    self.translate_paragraph,\n                    BatchParagraph(paragraphs, [page] * len(paragraphs), tracker),\n                    pbar,\n                    page_font_map,\n                    page_xobj_font_map,\n                    self.translation_config.shared_context_cross_split_part.first_paragraph,\n                    self.translation_config.shared_context_cross_split_part.recent_title_paragraph,\n                    executor2,\n                    priority=1048576 - total_token_count,\n                    paragraph_token_count=total_token_count,\n                    mp_id=self.mid,\n                )\n                paragraphs = []\n                total_token_count = 0\n\n        if paragraphs:\n            self.mid += 1\n            executor.submit(\n                self.translate_paragraph,\n                BatchParagraph(paragraphs, [page] * len(paragraphs), tracker),\n                pbar,\n                page_font_map,\n                page_xobj_font_map,\n                self.translation_config.shared_context_cross_split_part.first_paragraph,\n                self.translation_config.shared_context_cross_split_part.recent_title_paragraph,\n                executor2,\n                priority=1048576 - total_token_count,\n                paragraph_token_count=total_token_count,\n                mp_id=self.mid,\n            )\n\n    def translate_paragraph(\n        self,\n        batch_paragraph: BatchParagraph,\n        pbar: tqdm | None = None,\n        page_font_map: dict[str, PdfFont] = None,\n        xobj_font_map: dict[int, dict[str, PdfFont]] = None,\n        title_paragraph: PdfParagraph | None = None,\n        local_title_paragraph: PdfParagraph | None = None,\n        executor: PriorityThreadPoolExecutor | None = None,\n        paragraph_token_count: int = 0,\n        mp_id: int = 0,\n    ):\n        \"\"\"Translate a paragraph using pre and post processing functions.\"\"\"\n        self.translation_config.raise_if_cancelled()\n        should_translate_paragraph = []\n        try:\n            inputs = []\n            llm_translate_trackers = []\n            paragraph_unicodes = []\n            for i in range(len(batch_paragraph.paragraphs)):\n                paragraph = batch_paragraph.paragraphs[i]\n                tracker = batch_paragraph.trackers[i]\n                text, translate_input = self.il_translator.pre_translate_paragraph(\n                    paragraph, tracker, page_font_map, xobj_font_map\n                )\n                if text is None:\n                    pbar.advance(1)\n                    continue\n\n                tracker.record_multi_paragraph_id(mp_id)\n\n                llm_translate_tracker = tracker.new_llm_translate_tracker()\n                should_translate_paragraph.append(i)\n                llm_translate_trackers.append(llm_translate_tracker)\n                inputs.append(\n                    (\n                        text,\n                        translate_input,\n                        paragraph,\n                        tracker,\n                        llm_translate_tracker,\n                        paragraph_unicodes,\n                    )\n                )\n                paragraph_unicodes.append(paragraph.unicode)\n            if not inputs:\n                return\n            json_format_input = []\n\n            for id_, input_text in enumerate(inputs):\n                ti: il_translator.ILTranslator.TranslateInput = input_text[1]\n                tracker: ParagraphTranslateTracker = input_text[3]\n                tracker.record_multi_paragraph_index(id_)\n                placeholders_hint = ti.get_placeholders_hint()\n                obj = {\n                    \"id\": id_,\n                    \"input\": input_text[0],\n                    \"layout_label\": input_text[2].layout_label,\n                }\n                if (\n                    placeholders_hint\n                    and self.translation_config.add_formula_placehold_hint\n                ):\n                    obj[\"formula_placeholders_hint\"] = placeholders_hint\n                json_format_input.append(obj)\n\n            json_format_input_str = json.dumps(\n                json_format_input, ensure_ascii=False, indent=2\n            )\n\n            batch_text_for_glossary_matching = \"\\n\".join(\n                item.get(\"input\", \"\") for item in json_format_input\n            )\n\n            final_input = self._build_llm_prompt(\n                json_input_str=json_format_input_str,\n                title_paragraph=title_paragraph,\n                local_title_paragraph=local_title_paragraph,\n                batch_text_for_glossary_matching=batch_text_for_glossary_matching,\n            )\n\n            for llm_translate_tracker in llm_translate_trackers:\n                llm_translate_tracker.set_input(final_input)\n            llm_output = self.translate_engine.llm_translate(\n                final_input,\n                rate_limit_params={\n                    \"paragraph_token_count\": paragraph_token_count,\n                    \"request_json_mode\": True,\n                },\n            )\n            for llm_translate_tracker in llm_translate_trackers:\n                llm_translate_tracker.set_output(llm_output)\n            llm_output = llm_output.strip()\n\n            llm_output = self._clean_json_output(llm_output)\n\n            parsed_output = json.loads(llm_output)\n\n            if isinstance(parsed_output, dict) and parsed_output.get(\n                \"output\", parsed_output.get(\"input\", False)\n            ):\n                parsed_output = [parsed_output]\n\n            translation_results = {\n                item[\"id\"]: item.get(\"output\", item.get(\"input\"))\n                for item in parsed_output\n            }\n\n            if len(translation_results) != len(inputs):\n                raise Exception(\n                    f\"Translation results length mismatch. Expected: {len(inputs)}, Got: {len(translation_results)}\"\n                )\n\n            for id_, output in translation_results.items():\n                should_fallback = True\n                try:\n                    if not isinstance(output, str):\n                        logger.warning(\n                            f\"Translation result is not a string. Output: {output}\"\n                        )\n                        continue\n\n                    id_ = int(id_)  # Ensure id is an integer\n                    if id_ >= len(inputs):\n                        logger.warning(f\"Invalid id {id_}, skipping\")\n                        continue\n\n                    # Clean up any excessive punctuation in the translated text\n                    translated_text = re.sub(r\"[. 。…，]{20,}\", \".\", output)\n\n                    # Get the original input for this translation\n                    translate_input = inputs[id_][1]\n                    llm_translate_tracker = inputs[id_][4]\n\n                    input_unicode = inputs[id_][0]\n                    output_unicode = translated_text\n\n                    trimed_input = re.sub(r\"[. 。…，]{20,}\", \".\", input_unicode)\n\n                    input_token_count = self.calc_token_count(trimed_input)\n                    output_token_count = self.calc_token_count(output_unicode)\n\n                    same_as_input = trimed_input == output_unicode\n                    if (\n                        same_as_input\n                        and input_token_count > 10\n                        and not self.translation_config.disable_same_text_fallback\n                    ):\n                        llm_translate_tracker.set_error_message(\n                            \"Translation result is the same as input, fallback.\"\n                        )\n                        llm_translate_tracker.set_placeholder_full_match()\n                        logger.warning(\n                            \"Translation result is the same as input, fallback.\"\n                        )\n                        continue\n\n                    if not (0.3 < output_token_count / input_token_count < 3):\n                        llm_translate_tracker.set_error_message(\n                            f\"Translation result is too long or too short. Input: {input_token_count}, Output: {output_token_count}\"\n                        )\n                        logger.warning(\n                            f\"Translation result is too long or too short. Input: {input_token_count}, Output: {output_token_count}\"\n                        )\n                        llm_translate_tracker.set_placeholder_full_match()\n                        continue\n\n                    if not self.translation_config.disable_same_text_fallback:\n                        edit_distance = Levenshtein.distance(\n                            input_unicode, output_unicode\n                        )\n                        if edit_distance < 5 and input_token_count > 20:\n                            llm_translate_tracker.set_error_message(\n                                f\"Translation result edit distance is too small. distance: {edit_distance}, input: {input_unicode}, output: {output_unicode}\"\n                            )\n                            logger.warning(\n                                f\"Translation result edit distance is too small. distance: {edit_distance}, input: {input_unicode}, output: {output_unicode}\"\n                            )\n                            llm_translate_tracker.set_placeholder_full_match()\n                            continue\n                    # Apply the translation to the paragraph\n                    self.il_translator.post_translate_paragraph(\n                        inputs[id_][2],\n                        inputs[id_][3],\n                        translate_input,\n                        translated_text,\n                    )\n                    should_fallback = False\n                    if pbar:\n                        pbar.advance(1)\n                except Exception as e:\n                    error_message = f\"Error translating paragraph. Error: {e}.\"\n                    logger.exception(error_message)\n                    # Ignore error and continue\n                    for llm_translate_tracker in llm_translate_trackers:\n                        llm_translate_tracker.set_error_message(error_message)\n                    continue\n                finally:\n                    self.total_count += 1\n                    if should_fallback:\n                        self.fallback_count += 1\n                        inputs[id_][4].set_fallback_to_translate()\n                        logger.warning(\n                            f\"Fallback to simple translation. paragraph id: {inputs[id_][2].debug_id}\"\n                        )\n                        paragraph_token_count = self.calc_token_count(\n                            inputs[id_][2].unicode\n                        )\n                        paragraph_unicodes = inputs[id_][5]\n                        inputs[id_][2].unicode = paragraph_unicodes[id_]\n                        executor.submit(\n                            self.il_translator.translate_paragraph,\n                            inputs[id_][2],\n                            batch_paragraph.pages[id_],\n                            pbar,\n                            inputs[id_][3],\n                            page_font_map,\n                            xobj_font_map,\n                            priority=1048576 - paragraph_token_count,\n                            paragraph_token_count=paragraph_token_count,\n                            title_paragraph=title_paragraph,\n                            local_title_paragraph=local_title_paragraph,\n                        )\n                    else:\n                        self.ok_count += 1\n\n        except Exception as e:\n            error_message = f\"Error {e} during translation. try fallback\"\n            logger.warning(error_message)\n            for llm_translate_tracker in llm_translate_trackers:\n                llm_translate_tracker.set_error_message(error_message)\n                llm_translate_tracker.set_fallback_to_translate()\n            self.total_count += len(llm_translate_trackers)\n            self.fallback_count += len(llm_translate_trackers)\n            for input_ in inputs:\n                input_[2].unicode = input_[5]\n            if not should_translate_paragraph:\n                should_translate_paragraph = list(\n                    range(len(batch_paragraph.paragraphs))\n                )\n            for i in should_translate_paragraph:\n                paragraph = batch_paragraph.paragraphs[i]\n                tracker = batch_paragraph.trackers[i]\n                if paragraph.debug_id is None:\n                    continue\n                paragraph_token_count = self.calc_token_count(paragraph.unicode)\n                executor.submit(\n                    self.il_translator.translate_paragraph,\n                    paragraph,\n                    batch_paragraph.pages[i],\n                    pbar,\n                    tracker,\n                    page_font_map,\n                    xobj_font_map,\n                    priority=1048576 - paragraph_token_count,\n                    paragraph_token_count=paragraph_token_count,\n                    title_paragraph=title_paragraph,\n                    local_title_paragraph=local_title_paragraph,\n                )\n\n    def _build_llm_prompt(\n        self,\n        json_input_str: str,\n        title_paragraph: PdfParagraph | None,\n        local_title_paragraph: PdfParagraph | None,\n        batch_text_for_glossary_matching: str,\n    ) -> str:\n        \"\"\"Build LLM prompt using a single template for easier maintenance.\"\"\"\n        # Build role block, honoring custom_system_prompt if provided.\n        custom_prompt = getattr(self.translation_config, \"custom_system_prompt\", None)\n        if custom_prompt:\n            role_block = custom_prompt.strip()\n            if \"Follow all rules strictly.\" not in role_block:\n                if not role_block.endswith(\"\\n\"):\n                    role_block += \"\\n\"\n                role_block += \"Follow all rules strictly.\"\n        else:\n            role_block = (\n                f\"You are a professional {self.translation_config.lang_out} native translator who needs to fluently translate text \"\n                f\"into {self.translation_config.lang_out}.\\n\\n\"\n                \"Follow all rules strictly.\"\n            )\n\n        # Build contextual hints section.\n        contextual_lines: list[str] = []\n        hint_idx = 1\n        if title_paragraph:\n            contextual_lines.append(\n                f\"{hint_idx}. First title in full text: {title_paragraph.unicode}\"\n            )\n            hint_idx += 1\n\n        if local_title_paragraph:\n            is_different_from_global = True\n            if title_paragraph:\n                if local_title_paragraph.debug_id == title_paragraph.debug_id:\n                    is_different_from_global = False\n\n            if is_different_from_global:\n                contextual_lines.append(\n                    f\"{hint_idx}. The most recent title is: {local_title_paragraph.unicode}\"\n                )\n\n        if contextual_lines:\n            contextual_hints_block = (\n                \"## Contextual Hints for Better Translation\\n\"\n                + \"\\n\".join(contextual_lines)\n                + \"\\n\"\n            )\n        else:\n            contextual_hints_block = \"\"\n\n        # Build glossary usage rules and glossary tables.\n        glossary_usage_rules_block = \"\"\n        glossary_tables_block = \"\"\n        glossary_entries_per_glossary: dict[str, list[tuple[str, str]]] = {}\n\n        if self._cached_glossaries:\n            for glossary in self._cached_glossaries:\n                active_entries = glossary.get_active_entries_for_text(\n                    batch_text_for_glossary_matching\n                )\n                if active_entries:\n                    glossary_entries_per_glossary[glossary.name] = sorted(\n                        active_entries\n                    )\n\n        if glossary_entries_per_glossary:\n            glossary_usage_rules_block = (\n                \"## Glossary\\n\"\n                \"If a glossary is provided:\\n\"\n                \"- Always use the exact target term.\\n\"\n                \"- Apply glossary items even inside tags or when broken by hyphens/line breaks.\\n\"\n                \"- If glossary does NOT include a term, translate it naturally.\\n\\n\"\n            )\n\n            glossary_table_lines: list[str] = [\"## Glossary Tables\", \"\"]\n            for glossary_name, entries in glossary_entries_per_glossary.items():\n                glossary_table_lines.append(f\"### Glossary: {glossary_name}\")\n                glossary_table_lines.append(\"\")\n                glossary_table_lines.append(\n                    \"| Source Term | Target Term |\\n|-------------|-------------|\"\n                )\n                for original_source, target_text in entries:\n                    glossary_table_lines.append(\n                        f\"| {original_source} | {target_text} |\"\n                    )\n                glossary_table_lines.append(\"\")\n            glossary_tables_block = \"\\n\".join(glossary_table_lines)\n\n        return PROMPT_TEMPLATE.substitute(\n            role_block=role_block,\n            glossary_usage_rules_block=glossary_usage_rules_block,\n            contextual_hints_block=contextual_hints_block,\n            json_input_str=json_input_str,\n            glossary_tables_block=glossary_tables_block,\n            lang_out=self.translation_config.lang_out,\n        )\n\n    def _clean_json_output(self, llm_output: str) -> str:\n        # Clean up JSON output by removing common wrapper tags\n        llm_output = llm_output.strip()\n        if llm_output.startswith(\"<json>\"):\n            llm_output = llm_output[6:]\n        if llm_output.endswith(\"</json>\"):\n            llm_output = llm_output[:-7]\n        if llm_output.startswith(\"```json\"):\n            llm_output = llm_output[7:]\n        if llm_output.startswith(\"```\"):\n            llm_output = llm_output[3:]\n        if llm_output.endswith(\"```\"):\n            llm_output = llm_output[:-3]\n        return llm_output.strip()\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/layout_parser.py",
    "content": "import logging\nimport math\nimport os\nfrom concurrent.futures import ThreadPoolExecutor\nfrom pathlib import Path\n\nimport cv2\nimport numpy as np\nfrom pymupdf import Document\n\nimport babeldoc.format.pdf.document_il.utils.extract_char\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils.style_helper import GREEN\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass LayoutParser:\n    stage_name = \"Parse Page Layout\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        self.model = translation_config.doc_layout_model\n\n    def _save_debug_image(self, image: np.ndarray, layout, page_number: int):\n        \"\"\"Save debug image with drawn boxes if debug mode is enabled.\"\"\"\n        if not self.translation_config.debug:\n            return\n\n        debug_dir = Path(self.translation_config.get_working_file_path(\"ocr-box-image\"))\n        debug_dir.mkdir(parents=True, exist_ok=True)\n\n        # Draw boxes on the image\n        debug_image = image.copy()\n        for box in layout.boxes:\n            x0, y0, x1, y1 = box.xyxy\n            cv2.rectangle(\n                debug_image,\n                (int(x0), int(y0)),\n                (int(x1), int(y1)),\n                (0, 255, 0),\n                2,\n            )\n            # Add text label\n            cv2.putText(\n                debug_image,\n                layout.names[box.cls],\n                (int(x0), int(y0) - 5),\n                cv2.FONT_HERSHEY_SIMPLEX,\n                0.5,\n                (0, 255, 0),\n                1,\n            )\n        img_bgr = cv2.cvtColor(debug_image, cv2.COLOR_RGB2BGR)\n\n        # Save the image\n        output_path = debug_dir / f\"{page_number}.jpg\"\n        cv2.imwrite(str(output_path), img_bgr)\n\n    def _save_debug_box_to_page(self, page: il_version_1.Page):\n        \"\"\"Save debug boxes and text labels to the PDF page.\"\"\"\n        if not self.translation_config.debug:\n            return\n\n        color = GREEN\n\n        for layout in page.page_layout:\n            # Create a rectangle box\n            scale_factor = 1\n            if layout.class_name == \"fallback_line\":\n                scale_factor = 0.1\n            rect = il_version_1.PdfRectangle(\n                box=il_version_1.Box(\n                    x=layout.box.x,\n                    y=layout.box.y,\n                    x2=layout.box.x2,\n                    y2=layout.box.y2,\n                ),\n                graphic_state=color,\n                debug_info=True,\n                line_width=0.4 * scale_factor,\n            )\n            page.pdf_rectangle.append(rect)\n\n            # Create text label at top-left corner\n            # Note: PDF coordinates are from bottom-left,\n            # so we use y2 for top position\n            style = il_version_1.PdfStyle(\n                font_id=\"base\",\n                font_size=4 * scale_factor,\n                graphic_state=color,\n            )\n            page.pdf_paragraph.append(\n                il_version_1.PdfParagraph(\n                    first_line_indent=False,\n                    box=il_version_1.Box(\n                        x=layout.box.x,\n                        y=layout.box.y2,\n                        x2=layout.box.x2,\n                        y2=layout.box.y2 + 5,\n                    ),\n                    vertical=False,\n                    pdf_style=style,\n                    unicode=layout.class_name,\n                    pdf_paragraph_composition=[\n                        il_version_1.PdfParagraphComposition(\n                            pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                                unicode=layout.class_name,\n                                pdf_style=style,\n                                debug_info=True,\n                            ),\n                        ),\n                    ],\n                    xobj_id=-1,\n                ),\n            )\n\n    def process(self, docs: il_version_1.Document, mupdf_doc: Document):\n        \"\"\"Generate layouts for all pages that need to be translated.\"\"\"\n        # Get pages that need to be translated\n        total = len(docs.page)\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total * 2,\n        ) as progress:\n            # Process predictions for each page\n            for page, layouts in self.model.handle_document(\n                docs.page,\n                mupdf_doc,\n                self.translation_config,\n                self._save_debug_image,\n            ):\n                page_layouts = []\n                for layout in layouts.boxes:\n                    # Convert coordinate system from picture to il\n                    # system to the il coordinate system\n                    x0, y0, x1, y1 = layout.xyxy\n                    # pix = get_no_rotation_img(mupdf_doc[page.page_number])\n                    # pix = mupdf_doc[page.page_number].get_pixmap()\n                    # h, w = pix.height, pix.width\n                    box = mupdf_doc[page.page_number].mediabox_size\n                    b_h = math.ceil(box.y)\n                    b_w = math.ceil(box.x)\n                    # if b_h != h or b_w != w:\n                    #     logger.warning(f\"page {page.page_number} mediabox is not correct, b_h: {b_h}, h: {h}, b_w: {b_w}, w: {w}\")\n                    h, w = b_h, b_w\n                    x0, y0, x1, y1 = (\n                        np.clip(int(x0 - 1), 0, w - 1),\n                        np.clip(int(h - y1 - 1), 0, h - 1),\n                        np.clip(int(x1 + 1), 0, w - 1),\n                        np.clip(int(h - y0 + 1), 0, h - 1),\n                    )\n                    page_layout = il_version_1.PageLayout(\n                        id=len(page_layouts) + 1,\n                        box=il_version_1.Box(\n                            x0.item(),\n                            y0.item(),\n                            x1.item(),\n                            y1.item(),\n                        ),\n                        conf=layout.conf.item(),\n                        class_name=layouts.names[layout.cls],\n                    )\n                    page_layouts.append(page_layout)\n\n                page.page_layout = page_layouts\n                # self.generate_fallback_line_layout_for_page(page)\n                # self._save_debug_box_to_page(page)\n                progress.advance(1)\n            with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:\n                for page in docs.page:\n                    executor.submit(\n                        self.generate_fallback_line_layout_for_page, page, progress\n                    )\n        return docs\n\n    def generate_fallback_line_layout_for_page(self, page: il_version_1.Page, progress):\n        try:\n            exists_page_layouts = page.page_layout\n            char_boxes = babeldoc.format.pdf.document_il.utils.extract_char.convert_page_to_char_boxes(\n                page\n            )\n            if not char_boxes:\n                return\n\n            clusters = babeldoc.format.pdf.document_il.utils.extract_char.process_page_chars_to_lines(\n                char_boxes\n            )\n            for cluster in clusters:\n                boxes = [c[0] for c in cluster.chars]\n                min_x = min(b.x for b in boxes)\n                max_x = max(b.x2 for b in boxes)\n                min_y = min(b.y for b in boxes)\n                max_y = max(b.y2 for b in boxes)\n                cluster.chars = il_version_1.Box(min_x, min_y, max_x, max_y)\n                page_layout = il_version_1.PageLayout(\n                    id=len(exists_page_layouts) + 1,\n                    box=il_version_1.Box(\n                        min_x,\n                        min_y,\n                        max_x,\n                        max_y,\n                    ),\n                    conf=1,\n                    class_name=\"fallback_line\",\n                )\n                exists_page_layouts.append(page_layout)\n            self._save_debug_box_to_page(page)\n        finally:\n            progress.advance(1)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/paragraph_finder.py",
    "content": "import logging\nimport random\nimport re\n\nimport numpy as np\n\nfrom babeldoc.babeldoc_exception.BabelDOCException import ExtractTextError\nfrom babeldoc.format.pdf.document_il import Box\nfrom babeldoc.format.pdf.document_il import Document\nfrom babeldoc.format.pdf.document_il import Page\nfrom babeldoc.format.pdf.document_il import PdfCharacter\nfrom babeldoc.format.pdf.document_il import PdfLine\nfrom babeldoc.format.pdf.document_il import PdfParagraph\nfrom babeldoc.format.pdf.document_il import PdfParagraphComposition\nfrom babeldoc.format.pdf.document_il import PdfRectangle\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import (\n    collect_page_formula_font_ids,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    HEIGHT_NOT_USFUL_CHAR_IN_CHAR,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import SPACE_REGEX\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import Layout\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import add_space_dummy_chars\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import build_layout_index\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import calculate_iou_for_boxes\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import get_char_unicode_string\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import get_character_layout\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import is_bullet_point\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    is_character_in_formula_layout,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import is_text_layout\nfrom babeldoc.format.pdf.document_il.utils.paragraph_helper import is_cid_paragraph\nfrom babeldoc.format.pdf.document_il.utils.style_helper import INDIGO\nfrom babeldoc.format.pdf.document_il.utils.style_helper import WHITE\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n# Base58 alphabet (Bitcoin style, without numbers 0, O, I, l)\nBASE58_ALPHABET = \"123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz\"\n\n\ndef generate_base58_id(length: int = 5) -> str:\n    \"\"\"Generate a random base58 ID of specified length.\"\"\"\n    return \"\".join(random.choice(BASE58_ALPHABET) for _ in range(length))\n\n\nclass ParagraphFinder:\n    stage_name = \"Parse Paragraphs\"\n\n    # 定义项目符号的正则表达式模式\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        self.font_mapper = FontMapper(translation_config)\n\n    def _preprocess_formula_layouts(self, page: Page):\n        \"\"\"\n        Identifies 'formula' layouts that do not significantly overlap with any text layouts\n        and re-labels them as 'isolate_formula'.\n        \"\"\"\n        # Use a simplified Layout object for is_text_layout check\n        text_layouts = [\n            layout\n            for layout in page.page_layout\n            if is_text_layout(Layout(layout.id, layout.class_name))\n        ]\n        formula_layouts = [\n            layout for layout in page.page_layout if layout.class_name == \"formula\"\n        ]\n\n        if not text_layouts or not formula_layouts:\n            return\n\n        for formula_layout in formula_layouts:\n            is_isolated = True\n            for text_layout in text_layouts:\n                iou = calculate_iou_for_boxes(formula_layout.box, text_layout.box)\n                if iou >= 0.5:\n                    is_isolated = False\n                    break\n\n            if is_isolated:\n                formula_layout.class_name = \"isolate_formula\"\n\n    def add_text_fill_background(self, page: Page):\n        layout_map = {layout.id: layout for layout in page.page_layout}\n        for paragraph in page.pdf_paragraph:\n            layout_id = paragraph.layout_id\n            if layout_id is None:\n                continue\n            layout = layout_map[layout_id]\n            if paragraph.box is None:\n                continue\n            x1, y1, x2, y2 = (\n                paragraph.box.x,\n                paragraph.box.y,\n                paragraph.box.x2,\n                paragraph.box.y2,\n            )\n            layout_box = layout.box\n            if layout_box.x < x1:\n                x1 = layout_box.x\n            if layout_box.y < y1:\n                y1 = layout_box.y\n            if layout_box.x2 > x2:\n                x2 = layout_box.x2\n            if layout_box.y2 > y2:\n                y2 = layout_box.y2\n            assert x2 > x1 and y2 > y1\n            page.pdf_rectangle.append(\n                PdfRectangle(\n                    box=Box(x1, y1, x2, y2),\n                    fill_background=True,\n                    graphic_state=WHITE,\n                    debug_info=False,\n                    xobj_id=paragraph.xobj_id,\n                )\n            )\n\n    def update_paragraph_data(self, paragraph: PdfParagraph, update_unicode=False):\n        if not paragraph.pdf_paragraph_composition:\n            return\n\n        chars = []\n        for composition in paragraph.pdf_paragraph_composition:\n            if composition.pdf_line:\n                chars.extend(composition.pdf_line.pdf_character)\n            elif composition.pdf_formula:\n                chars.extend(composition.pdf_formula.pdf_character)\n            elif composition.pdf_character:\n                chars.append(composition.pdf_character)\n            elif composition.pdf_same_style_unicode_characters:\n                continue\n            else:\n                logger.error(\n                    \"Unexpected composition type\"\n                    \" in PdfParagraphComposition. \"\n                    \"This type only appears in the IL \"\n                    \"after the translation is completed.\",\n                )\n                continue\n\n        if update_unicode and chars:\n            paragraph.unicode = get_char_unicode_string(chars)\n        if not chars:\n            return\n        # 更新边界框\n        min_x = min(char.visual_bbox.box.x for char in chars)\n        min_y = min(char.visual_bbox.box.y for char in chars)\n        max_x = max(char.visual_bbox.box.x2 for char in chars)\n        max_y = max(char.visual_bbox.box.y2 for char in chars)\n        paragraph.box = Box(min_x, min_y, max_x, max_y)\n        paragraph.vertical = chars[0].vertical\n        paragraph.xobj_id = chars[0].xobj_id\n\n        paragraph.first_line_indent = False\n        if (\n            paragraph.pdf_paragraph_composition\n            and paragraph.pdf_paragraph_composition[0].pdf_line\n            and paragraph.pdf_paragraph_composition[0]\n            .pdf_line.pdf_character[0]\n            .visual_bbox.box.x\n            - paragraph.box.x\n            > 1\n        ):\n            paragraph.first_line_indent = True\n\n    def update_line_data(self, line: PdfLine):\n        min_x = min(char.visual_bbox.box.x for char in line.pdf_character)\n        min_y = min(char.visual_bbox.box.y for char in line.pdf_character)\n        max_x = max(char.visual_bbox.box.x2 for char in line.pdf_character)\n        max_y = max(char.visual_bbox.box.y2 for char in line.pdf_character)\n        line.box = Box(min_x, min_y, max_x, max_y)\n\n    def add_debug_info(self, page: Page):\n        if not self.translation_config.debug:\n            return\n        for paragraph in page.pdf_paragraph:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_line:\n                    line = composition.pdf_line\n                    page.pdf_rectangle.append(\n                        PdfRectangle(\n                            box=line.box,\n                            fill_background=False,\n                            graphic_state=INDIGO,\n                            debug_info=True,\n                            line_width=0.2,\n                        )\n                    )\n\n    def process(self, document):\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            len(document.page),\n        ) as pbar:\n            if not document.page:\n                return\n            for page in document.page:\n                self.translation_config.raise_if_cancelled()\n                self.process_page(page)\n                pbar.advance()\n\n            total_paragraph_count = 0\n            for page in document.page:\n                total_paragraph_count += len(page.pdf_paragraph)\n            if total_paragraph_count == 0:\n                raise ExtractTextError(\"The document contains no paragraphs.\")\n\n            if self.check_cid_paragraph(document):\n                raise ExtractTextError(\"The document contains too many CID paragraphs.\")\n\n    def check_cid_paragraph(self, doc: Document):\n        cid_para_count = 0\n        para_total = 0\n        for page in doc.page:\n            para_total += len(page.pdf_paragraph)\n            for para in page.pdf_paragraph:\n                if is_cid_paragraph(para):\n                    cid_para_count += 1\n        return cid_para_count / para_total > 0.8\n\n    def bbox_overlap(self, bbox1: Box, bbox2: Box) -> bool:\n        return (\n            bbox1.x < bbox2.x2\n            and bbox1.x2 > bbox2.x\n            and bbox1.y < bbox2.y2\n            and bbox1.y2 > bbox2.y\n        )\n\n    def process_page(self, page: Page):\n        layout_index, layout_map = build_layout_index(page)\n        # 预处理公式布局的标签\n        self._preprocess_formula_layouts(page)\n\n        # 第一步：根据 layout 创建 paragraphs\n        # 在这一步中，page.pdf_character 中的字符会被移除\n        paragraphs = self._group_characters_into_paragraphs(\n            page, layout_index, layout_map\n        )\n        page.pdf_paragraph = paragraphs\n\n        page_level_formula_font_ids, xobj_specific_formula_font_ids = (\n            collect_page_formula_font_ids(\n                page, self.translation_config.formular_font_pattern\n            )\n        )\n\n        # for para in paragraphs:\n        #     if not para.debug_id:\n        #         continue\n        #     new_line = PdfLine(\n        #         pdf_character=[x.pdf_character for x in para.pdf_paragraph_composition]\n        #     )\n        #     self.update_line_data(new_line)\n        #     para.pdf_paragraph_composition = [\n        #         PdfParagraphComposition(pdf_line=new_line)\n        #     ]\n\n        # 第二步：将段落内的字符拆分为行\n        for paragraph in paragraphs:\n            if (\n                paragraph.xobj_id\n                and paragraph.xobj_id in xobj_specific_formula_font_ids\n            ):\n                current_formula_font_ids = xobj_specific_formula_font_ids[\n                    paragraph.xobj_id\n                ]\n            else:\n                current_formula_font_ids = page_level_formula_font_ids\n            self._split_paragraph_into_lines(paragraph, current_formula_font_ids)\n\n        # 第三步：处理段落中的空格\n        for paragraph in paragraphs:\n            add_space_dummy_chars(paragraph)\n            self.process_paragraph_spacing(paragraph)\n            self.update_paragraph_data(paragraph)\n\n        # 第四步：计算所有行宽度的中位数\n        median_width = self.calculate_median_line_width(paragraphs)\n\n        # 第五步：处理独立段落\n        self.process_independent_paragraphs(paragraphs, median_width)\n\n        # 新增后处理：合并带行号交替的正文段落（a 正文、b 行号、c 正文 -> 合并 a 与 c，保留 b）\n        if getattr(self.translation_config, \"merge_alternating_line_numbers\", True):\n            self.merge_alternating_line_number_paragraphs(paragraphs)\n\n        for paragraph in paragraphs:\n            self.update_paragraph_data(paragraph, update_unicode=True)\n\n        if self.translation_config.ocr_workaround:\n            self.add_text_fill_background(page)\n            # since this is ocr file,\n            # image characters are not needed\n            page.pdf_character = []\n\n        self.fix_overlapping_paragraphs(page)\n\n        # 第六步：对每一行的字符进行排序\n        # self._sort_characters_in_lines(page)\n\n        self.add_debug_info(page)\n\n        # 新阶段：设置段落的 renderorder 为所有组成部分中 renderorder 最小的\n        self._set_paragraph_render_order(page)\n\n    def _set_paragraph_render_order(self, page: Page):\n        \"\"\"\n        设置段落的 renderorder 为段落所有组成部分中 renderorder 最小的值\n        \"\"\"\n        for paragraph in page.pdf_paragraph:\n            min_render_order = 9999999999999999\n\n            # 遍历段落的所有组成部分\n            for composition in paragraph.pdf_paragraph_composition:\n                # 检查 PdfLine 中的字符\n                if composition.pdf_line:\n                    for char in composition.pdf_line.pdf_character:\n                        if (\n                            hasattr(char, \"render_order\")\n                            and char.render_order is not None\n                        ):\n                            min_render_order = min(min_render_order, char.render_order)\n\n                # 检查单个字符\n                elif composition.pdf_character:\n                    char = composition.pdf_character\n                    if hasattr(char, \"render_order\") and char.render_order is not None:\n                        min_render_order = min(min_render_order, char.render_order)\n\n                # 检查公式中的字符\n                elif composition.pdf_formula:\n                    for char in composition.pdf_formula.pdf_character:\n                        if (\n                            hasattr(char, \"render_order\")\n                            and char.render_order is not None\n                        ):\n                            min_render_order = min(min_render_order, char.render_order)\n\n            # 如果找到了有效的 renderorder，设置段落的 renderorder\n            if min_render_order != 9999999999999999:\n                paragraph.render_order = min_render_order\n\n    def is_isolated_formula(self, char: PdfCharacter):\n        return char.char_unicode in (\n            \"(cid:122)\",\n            \"(cid:123)\",\n            \"(cid:124)\",\n            \"(cid:125)\",\n        )\n\n    def _paragraph_text_ascii(self, p: PdfParagraph) -> str:\n        parts: list[str] = []\n        for comp in p.pdf_paragraph_composition or []:\n            if comp.pdf_line:\n                for ch in comp.pdf_line.pdf_character or []:\n                    if ch.char_unicode is not None:\n                        parts.append(ch.char_unicode)\n            elif comp.pdf_character and comp.pdf_character.char_unicode is not None:\n                parts.append(comp.pdf_character.char_unicode)\n        return \"\".join(parts)\n\n    def _is_ascii_digit_or_space_paragraph(self, p: PdfParagraph) -> bool:\n        text = self._paragraph_text_ascii(p)\n        if not text:\n            return True\n        has_digit = False\n        for c in text:\n            if c.isdigit() and ord(c) < 128:\n                has_digit = True\n                continue\n            if c.isspace():\n                continue\n            return False\n        return True if has_digit or text.strip() == \"\" else False\n\n    @staticmethod\n    def _same_layout_and_xobj(a: PdfParagraph, c: PdfParagraph) -> bool:\n        return (\n            a.layout_id is not None\n            and c.layout_id is not None\n            and a.layout_id == c.layout_id\n            and a.xobj_id is not None\n            and c.xobj_id is not None\n            and a.xobj_id == c.xobj_id\n        )\n\n    def merge_alternating_line_number_paragraphs(self, paragraphs: list[PdfParagraph]):\n        # a 代表正文\n        # l 代表行号\n        if not paragraphs or len(paragraphs) < 3:\n            return\n        i = 0\n        while i < len(paragraphs) - 2:\n            a = paragraphs[i]\n            # 吞掉一个或多个连续的行号段 l\n            j = i + 1\n            saw_l = False\n            while j < len(paragraphs) and self._is_ascii_digit_or_space_paragraph(\n                paragraphs[j]\n            ):\n                saw_l = True\n                j += 1\n            # 现在 j 指向候选的 c\n            if saw_l and j < len(paragraphs):\n                c = paragraphs[j]\n                if self._same_layout_and_xobj(a, c):\n                    a.pdf_paragraph_composition.extend(c.pdf_paragraph_composition)\n                    self.update_paragraph_data(a)\n                    del paragraphs[j]\n                    # 不移动 i，继续尝试把更多正文接到 a，实现 a l+ a l+ a ... 链式合并\n                    continue\n            i += 1\n\n    def _group_characters_into_paragraphs(\n        self, page: Page, layout_index, layout_map\n    ) -> list[PdfParagraph]:\n        paragraphs: list[PdfParagraph] = []\n        if page.pdf_paragraph:\n            paragraphs.extend(page.pdf_paragraph)\n            page.pdf_paragraph = []\n\n        char_areas = [\n            (char.visual_bbox.box.x2 - char.visual_bbox.box.x)\n            * (char.visual_bbox.box.y2 - char.visual_bbox.box.y)\n            for char in page.pdf_character\n        ]\n        median_char_area = 0.0\n        if char_areas:\n            char_areas.sort()\n            mid = len(char_areas) // 2\n            median_char_area = (\n                char_areas[mid]\n                if len(char_areas) % 2 == 1\n                else (char_areas[mid - 1] + char_areas[mid]) / 2\n            )\n\n        current_paragraph: PdfParagraph | None = None\n        current_layout: Layout | None = None\n        skip_chars = []\n\n        for char in page.pdf_character:\n            char_layout = get_character_layout(char, layout_index, layout_map)\n            # Check if character is in any formula layout and set formula_layout_id\n            char.formula_layout_id = is_character_in_formula_layout(\n                char, page, layout_index, layout_map\n            )\n\n            if not is_text_layout(char_layout) or self.is_isolated_formula(char):\n                skip_chars.append(char)\n                continue\n\n            char_box = char.visual_bbox.box\n            # char_pdf_box = char.box\n            # if calculate_iou_for_boxes(char_box, char_pdf_box) < 0.2:\n            #     char_box = char_pdf_box\n            char_area = (char_box.x2 - char_box.x) * (char_box.y2 - char_box.y)\n            is_small_char = char_area < median_char_area * 0.05\n\n            is_new_paragraph = False\n            if current_paragraph is None:\n                is_new_paragraph = True\n            elif (\n                not (\n                    is_small_char\n                    and current_paragraph.pdf_paragraph_composition\n                    and char_layout.id == current_layout.id\n                )\n                and char.char_unicode not in HEIGHT_NOT_USFUL_CHAR_IN_CHAR\n            ):\n                if (\n                    (\n                        char_layout.id != current_layout.id\n                        and not SPACE_REGEX.match(char.char_unicode)\n                    )\n                    or (  # not same xobject\n                        current_paragraph.pdf_paragraph_composition\n                        and current_paragraph.pdf_paragraph_composition[\n                            -1\n                        ].pdf_character.xobj_id\n                        != char.xobj_id\n                    )\n                    or (\n                        is_bullet_point(char)\n                        and not current_paragraph.pdf_paragraph_composition\n                    )\n                ):\n                    is_new_paragraph = True\n\n            if is_new_paragraph:\n                current_layout = char_layout\n                current_paragraph = PdfParagraph(\n                    pdf_paragraph_composition=[],\n                    layout_id=current_layout.id,\n                    debug_id=generate_base58_id(),\n                    layout_label=current_layout.name,\n                )\n                paragraphs.append(current_paragraph)\n\n            current_paragraph.pdf_paragraph_composition.append(\n                PdfParagraphComposition(pdf_character=char)\n            )\n\n        page.pdf_character = skip_chars\n        for para in paragraphs:\n            self.update_paragraph_data(para)\n        return paragraphs\n\n    def _merge_overlapping_clusters(\n        self, lines: dict[int, list[PdfCharacter]], char_height_average: float\n    ) -> dict[int, list[PdfCharacter]]:\n        \"\"\"\n        Merge clusters that have significant y-axis overlap.\n        If y_intersection / min_height > 0.5 or the distance between y-midlines is less than char_height_average, merge the two clusters.\n        \"\"\"\n        if len(lines) <= 1:\n            return lines\n\n        # Calculate y-axis ranges for each cluster\n        cluster_ranges = {}\n        cluster_midlines = {}\n        for label, chars in lines.items():\n            y_values = [char.visual_bbox.box.y for char in chars] + [\n                char.visual_bbox.box.y2 for char in chars\n            ]\n            y_min, y_max = min(y_values), max(y_values)\n            cluster_ranges[label] = (y_min, y_max)\n            cluster_midlines[label] = (y_min + y_max) / 2\n\n        # Keep merging until no more merges are possible\n        changed = True\n        while changed:\n            changed = False\n            labels_to_check = list(lines.keys())\n\n            for i in range(len(labels_to_check)):\n                if not changed:  # Only continue if no merge happened in this iteration\n                    for j in range(i + 1, len(labels_to_check)):\n                        label1, label2 = labels_to_check[i], labels_to_check[j]\n\n                        # Skip if either label has been merged away\n                        if label1 not in lines or label2 not in lines:\n                            continue\n\n                        y1_min, y1_max = cluster_ranges[label1]\n                        y2_min, y2_max = cluster_ranges[label2]\n\n                        # Calculate intersection\n                        intersection_start = max(y1_min, y2_min)\n                        intersection_end = min(y1_max, y2_max)\n\n                        # Calculate midline distance\n                        midline_distance = abs(\n                            cluster_midlines[label1] - cluster_midlines[label2]\n                        )\n\n                        should_merge = False\n                        if (\n                            intersection_end > intersection_start\n                        ):  # There is intersection\n                            intersection_height = intersection_end - intersection_start\n                            height1 = y1_max - y1_min\n                            height2 = y2_max - y2_min\n                            min_height = min(height1, height2)\n\n                            # Check if intersection ratio exceeds threshold\n                            if (\n                                min_height > 0\n                                and intersection_height / min_height > 0.3\n                            ):\n                                should_merge = True\n\n                        # Check if midline distance is less than char_height_average\n                        if midline_distance < char_height_average:\n                            should_merge = True\n\n                        if should_merge:\n                            # Merge label2 into label1\n                            lines[label1].extend(lines[label2])\n                            del lines[label2]\n\n                            # Update cluster range and midline for the merged cluster\n                            new_y_min = min(y1_min, y2_min)\n                            new_y_max = max(y1_max, y2_max)\n                            cluster_ranges[label1] = (new_y_min, new_y_max)\n                            cluster_midlines[label1] = (new_y_min + new_y_max) / 2\n                            del cluster_ranges[label2]\n                            del cluster_midlines[label2]\n\n                            changed = True\n                            break\n\n        return lines\n\n    def _get_effective_y_bounds(self, char: PdfCharacter) -> tuple[float, float]:\n        \"\"\"\n        Determines the effective vertical boundaries (y1, y2) for a character.\n\n        It prioritizes the visual bounding box if its Intersection over Union (IoU)\n        with the PDF bounding box is high (>= 0.5), otherwise, it falls back to the\n        PDF bounding box. This helps use more accurate layout information when available.\n        \"\"\"\n        visual_box = char.visual_bbox.box\n        return visual_box.y, visual_box.y2\n        pdf_box = char.box\n        if calculate_iou_for_boxes(visual_box, pdf_box) >= 0.5:\n            return visual_box.y, visual_box.y2\n        return pdf_box.y, pdf_box.y2\n\n    @staticmethod\n    def _compute_collision_counts_histogram(\n        y1_arr: np.ndarray,\n        y2_arr: np.ndarray,\n        para_y_min: float,\n        para_y_max: float,\n        step: float,\n    ) -> np.ndarray:\n        \"\"\"Compute overlap counts at each scan line using a difference-array histogram.\n\n        Args:\n            y1_arr: 1-D array with lower y bounds of characters (inclusive).\n            y2_arr: 1-D array with upper y bounds of characters (exclusive).\n            para_y_min: Minimum y of the paragraph.\n            para_y_max: Maximum y of the paragraph.\n            step: Scan step size.\n\n        Returns:\n            1-D NumPy int32 array where index i corresponds to y = para_y_max - i × step.\n        \"\"\"\n        # Number of scan positions\n        m = int(np.ceil((para_y_max - para_y_min) / step))\n        if m <= 0:\n            return np.array([], dtype=np.int32)\n\n        # Map character bounds to discrete indices (top inclusive, bottom exclusive)\n        starts = np.floor((para_y_max - y2_arr) / step).astype(np.int32)\n        ends = np.floor((para_y_max - y1_arr) / step).astype(np.int32) + 1\n        # Clip ends to the valid range [0, m]\n        np.clip(ends, 0, m, out=ends)\n\n        hist = np.zeros(m + 1, dtype=np.int32)\n        np.add.at(hist, starts, 1)\n        np.add.at(hist, ends, -1)\n\n        return np.cumsum(hist[:-1])\n\n    def _split_paragraph_into_lines(\n        self, paragraph: PdfParagraph, formula_font_ids: set[str]\n    ):\n        \"\"\"\n        Splits a paragraph into lines using a \"line-threading\" method.\n\n        This method works by scanning vertically across the paragraph's bounding\n        box and counting how many characters intersect with a horizontal line\n        at each y-coordinate. The regions with a low number of intersections\n        (less than 2) are identified as gaps between lines. The characters\n        are then partitioned into lines based on these identified gaps.\n        \"\"\"\n        if not paragraph.pdf_paragraph_composition:\n            return\n\n        # 1. Extract all characters and other compositions from the paragraph.\n        all_chars: list[PdfCharacter] = []\n        other_compositions: list[PdfParagraphComposition] = []\n        for comp in paragraph.pdf_paragraph_composition:\n            if comp.pdf_character:\n                all_chars.append(comp.pdf_character)\n            else:\n                other_compositions.append(comp)\n\n        if not all_chars:\n            return\n\n        # 2. Determine effective y-bounds for each character and the paragraph's total vertical range.\n        char_y_bounds = [\n            {\"char\": char, \"y1\": y1, \"y2\": y2}\n            for char in all_chars\n            for y1, y2 in [self._get_effective_y_bounds(char)]\n        ]\n\n        if not char_y_bounds:\n            paragraph.pdf_paragraph_composition = other_compositions\n            self.update_paragraph_data(paragraph)\n            return\n\n        para_y_min = min(b[\"y1\"] for b in char_y_bounds)\n        para_y_max = max(b[\"y2\"] for b in char_y_bounds)\n\n        # If the paragraph is vertically flat, treat it as a single line.\n        if (para_y_max - para_y_min) < 5:  # Using a small threshold\n            # all_chars.sort(key=lambda c: c.visual_bbox.box.x)\n            single_line_composition = self.create_line(all_chars)\n            paragraph.pdf_paragraph_composition = [\n                single_line_composition\n            ] + other_compositions\n            self.update_paragraph_data(paragraph)\n            return\n\n        # 3. Perform \"threading\" scan to create a collision histogram.\n        # Scan from top (max y) to bottom (min y) with a step of 0.5.\n        scan_y_min = para_y_min\n        scan_y_max = para_y_max\n        step = 0.25\n\n        y_coordinates = np.arange(scan_y_max, scan_y_min, -step)\n\n        # Compute collision counts using NumPy histogram (O(m + n))\n        y1_arr = np.array([b[\"y1\"] for b in char_y_bounds], dtype=np.float32)\n        y2_arr = np.array([b[\"y2\"] for b in char_y_bounds], dtype=np.float32)\n        collision_counts = self._compute_collision_counts_histogram(\n            y1_arr,\n            y2_arr,\n            scan_y_min,\n            scan_y_max,\n            step,\n        )\n\n        # 4. Find gaps (regions with low collision count) from the histogram.\n        gaps = []\n        in_gap = False\n        for i, count in enumerate(collision_counts):\n            if count < 1 and not in_gap:\n                in_gap = True\n                gap_start_index = i\n            elif count >= 1 and in_gap:\n                in_gap = False\n                gaps.append((gap_start_index, i - 1))\n        if in_gap:\n            gaps.append((gap_start_index, len(collision_counts) - 1))\n\n        # If no significant gaps are found, treat it as a single line.\n        if not gaps:\n            # all_chars.sort(key=lambda c: c.visual_bbox.box.x)\n            single_line_composition = self.create_line(all_chars)\n            paragraph.pdf_paragraph_composition = [\n                single_line_composition\n            ] + other_compositions\n            self.update_paragraph_data(paragraph)\n            return\n\n        # 5. Assign characters to lines based on the identified gaps.\n        # Calculate separator y-coordinates from the midpoints of the gaps.\n        separator_y_coords = sorted(\n            [y_coordinates[start_idx] for start_idx, end_idx in gaps],\n            reverse=True,\n        )\n\n        lines: list[list[PdfCharacter]] = [\n            [] for _ in range(len(separator_y_coords) + 1)\n        ]\n\n        for b in char_y_bounds:\n            char_y_center = (b[\"y1\"] + b[\"y2\"]) / 2\n            line_idx = 0\n            # Find which line bucket the character belongs to.\n            for sep_y in separator_y_coords:\n                if char_y_center > sep_y:\n                    break\n                line_idx += 1\n            lines[line_idx].append(b[\"char\"])\n\n        # 6. Rebuild the paragraph's composition list from the new lines.\n        new_line_compositions = []\n        for line_chars in lines:\n            if line_chars:\n                # Sort characters within each line by x-coordinate (left-to-right).\n                # line_chars.sort(key=lambda c: c.visual_bbox.box.x)\n                new_line_compositions.append(self.create_line(line_chars))\n\n        # The lines are already sorted vertically due to the scanning process.\n        paragraph.pdf_paragraph_composition = new_line_compositions + other_compositions\n        self.update_paragraph_data(paragraph)\n\n    def process_paragraph_spacing(self, paragraph: PdfParagraph):\n        if not paragraph.pdf_paragraph_composition:\n            return\n\n        # 处理行级别的空格\n        processed_lines = []\n        for composition in paragraph.pdf_paragraph_composition:\n            if not composition.pdf_line:\n                processed_lines.append(composition)\n                continue\n\n            line = composition.pdf_line\n            if not \"\".join(\n                x.char_unicode for x in line.pdf_character\n            ).strip():  # 跳过完全空白的行\n                continue\n\n            # 处理行内字符的尾随空格\n            processed_chars = []\n            for char in line.pdf_character:\n                if not char.char_unicode.isspace():\n                    processed_chars = processed_chars + [char]\n                elif processed_chars:  # 只有在有非空格字符后才考虑保留空格\n                    processed_chars.append(char)\n\n            # 移除尾随空格\n            while processed_chars and processed_chars[-1].char_unicode.isspace():\n                processed_chars.pop()\n\n            if processed_chars:  # 如果行内还有字符\n                line = self.create_line(processed_chars)\n                processed_lines.append(line)\n\n        paragraph.pdf_paragraph_composition = processed_lines\n        self.update_paragraph_data(paragraph)\n\n    def create_line(self, chars: list[PdfCharacter]) -> PdfParagraphComposition:\n        assert chars\n\n        line = PdfLine(pdf_character=chars)\n        self.update_line_data(line)\n        return PdfParagraphComposition(pdf_line=line)\n\n    def calculate_median_line_width(self, paragraphs: list[PdfParagraph]) -> float:\n        # 收集所有行的宽度\n        line_widths = []\n        for paragraph in paragraphs:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_line:\n                    line = composition.pdf_line\n                    line_widths.append(line.box.x2 - line.box.x)\n\n        if not line_widths:\n            return 0.0\n\n        # 计算中位数\n        line_widths.sort()\n        mid = len(line_widths) // 2\n        if len(line_widths) % 2 == 0:\n            return (line_widths[mid - 1] + line_widths[mid]) / 2\n        return line_widths[mid]\n\n    def process_independent_paragraphs(\n        self,\n        paragraphs: list[PdfParagraph],\n        median_width: float,\n    ):\n        i = 0\n        while i < len(paragraphs):\n            paragraph = paragraphs[i]\n            if len(paragraph.pdf_paragraph_composition) <= 1:  # 跳过只有一行的段落\n                i += 1\n                continue\n\n            j = 1\n            while j < len(paragraph.pdf_paragraph_composition):\n                prev_composition = paragraph.pdf_paragraph_composition[j - 1]\n                if not prev_composition.pdf_line:\n                    j += 1\n                    continue\n\n                prev_line = prev_composition.pdf_line\n                prev_width = prev_line.box.x2 - prev_line.box.x\n                prev_text = \"\".join([c.char_unicode for c in prev_line.pdf_character])\n\n                # 检查是否包含连续的点（至少 20 个）\n                # 如果有至少连续 20 个点，则代表这是目录条目\n                if re.search(r\"\\.{20,}\", prev_text):\n                    # 创建新的段落\n                    new_paragraph = PdfParagraph(\n                        box=Box(0, 0, 0, 0),  # 临时边界框\n                        pdf_paragraph_composition=(\n                            paragraph.pdf_paragraph_composition[j:]\n                        ),\n                        unicode=\"\",\n                        debug_id=generate_base58_id(),\n                        layout_label=paragraph.layout_label,\n                        layout_id=paragraph.layout_id,\n                    )\n                    # 更新原段落\n                    paragraph.pdf_paragraph_composition = (\n                        paragraph.pdf_paragraph_composition[:j]\n                    )\n\n                    # 更新两个段落的数据\n                    self.update_paragraph_data(paragraph)\n                    self.update_paragraph_data(new_paragraph)\n\n                    # 在原段落后插入新段落\n                    paragraphs.insert(i + 1, new_paragraph)\n                    break\n\n                # 如果前一行宽度小于中位数的一半，将当前行及后续行分割成新段落\n                if (\n                    self.translation_config.split_short_lines\n                    and prev_width\n                    < median_width * self.translation_config.short_line_split_factor\n                ) or (\n                    paragraph.pdf_paragraph_composition\n                    and (current_line := paragraph.pdf_paragraph_composition[j])\n                    and (line := current_line.pdf_line)\n                    and (chars := line.pdf_character)\n                    and (char := chars[0])\n                    and is_bullet_point(char)\n                ):\n                    # 创建新的段落\n                    new_paragraph = PdfParagraph(\n                        box=Box(0, 0, 0, 0),  # 临时边界框\n                        pdf_paragraph_composition=(\n                            paragraph.pdf_paragraph_composition[j:]\n                        ),\n                        unicode=\"\",\n                        debug_id=generate_base58_id(),\n                        layout_label=paragraph.layout_label,\n                        layout_id=paragraph.layout_id,\n                    )\n                    # 更新原段落\n                    paragraph.pdf_paragraph_composition = (\n                        paragraph.pdf_paragraph_composition[:j]\n                    )\n\n                    # 更新两个段落的数据\n                    self.update_paragraph_data(paragraph)\n                    self.update_paragraph_data(new_paragraph)\n\n                    # 在原段落后插入新段落\n                    paragraphs.insert(i + 1, new_paragraph)\n                    break\n                j += 1\n            i += 1\n\n    @staticmethod\n    def is_bbox_contain_in_vertical(bbox1: Box, bbox2: Box) -> bool:\n        \"\"\"Check if one bounding box is completely contained within the other.\"\"\"\n        # Check if bbox1 is contained in bbox2\n        bbox1_in_bbox2 = bbox1.y >= bbox2.y and bbox1.y2 <= bbox2.y2\n        # Check if bbox2 is contained in bbox1\n        bbox2_in_bbox1 = bbox2.y >= bbox1.y and bbox2.y2 <= bbox1.y2\n        return bbox1_in_bbox2 or bbox2_in_bbox1\n\n    def fix_overlapping_paragraphs(self, page: Page):\n        \"\"\"\n        Adjusts the bounding boxes of paragraphs on a page to resolve vertical overlaps.\n\n        Iteratively checks pairs of paragraphs and adjusts their vertical boundaries\n        (y and y2) if they overlap, aiming to place the boundary at the midpoint\n        of the vertical overlap.\n        \"\"\"\n        paragraphs = page.pdf_paragraph\n        if not paragraphs or len(paragraphs) < 2:\n            return\n\n        max_iterations = len(paragraphs) * len(paragraphs)  # Safety break\n        iterations = 0\n\n        while iterations < max_iterations:\n            iterations += 1\n            overlap_found_in_pass = False\n\n            for i in range(len(paragraphs)):\n                for j in range(i + 1, len(paragraphs)):\n                    para1 = paragraphs[i]\n                    para2 = paragraphs[j]\n\n                    if para1.box is None or para2.box is None:\n                        continue\n\n                    if para1.xobj_id != para2.xobj_id:\n                        continue\n\n                    # Check for overlap using the existing method\n                    if self.bbox_overlap(para1.box, para2.box):\n                        if self.is_bbox_contain_in_vertical(para1.box, para2.box):\n                            continue\n                        # Calculate vertical overlap details\n                        overlap_y_start = max(para1.box.y, para2.box.y)\n                        overlap_y_end = min(para1.box.y2, para2.box.y2)\n                        overlap_height = overlap_y_end - overlap_y_start\n\n                        # Calculate horizontal overlap details\n                        overlap_x_start = max(para1.box.x, para2.box.x)\n                        overlap_x_end = min(para1.box.x2, para2.box.x2)\n                        overlap_width = overlap_x_end - overlap_x_start\n\n                        # Ensure there's a real 2D overlap, focusing on vertical adjustment\n                        if overlap_height > 1e-6 and overlap_width > 1e-6:\n                            overlap_found_in_pass = True\n\n                            # Determine which paragraph is visually higher\n                            if para1.box.y2 > para2.box.y and para1.box.y < para2.box.y:\n                                lower_para = para1\n                                higher_para = para2\n                            # Handle cases where y values are identical (or very close)\n                            # Prefer the one with smaller y2 as the higher one, or break tie arbitrarily\n                            elif para1.box.y2 < para2.box.y2:\n                                lower_para = para1\n                                higher_para = para2\n                            else:\n                                lower_para = para2\n                                higher_para = para1\n\n                            # Calculate the midpoint of the vertical overlap\n                            mid_y = overlap_y_start + overlap_height / 2\n\n                            # Adjust boxes, ensuring they remain valid (y2 > y)\n                            if mid_y > higher_para.box.y and mid_y < lower_para.box.y2:\n                                higher_para.box.y = mid_y + 1\n                                lower_para.box.y2 = mid_y - 1\n                            else:\n                                # This might happen if one box is fully contained vertically\n                                # within another, or due to floating point issues.\n                                # Log a warning and skip adjustment for this pair in this iteration.\n                                # A more complex strategy might be needed for full containment.\n                                logger.warning(\n                                    \"Could not resolve overlap between paragraphs\"\n                                    f\" {higher_para.debug_id} and {lower_para.debug_id}\"\n                                    \" using simple midpoint strategy.\"\n                                    f\" Midpoint: {mid_y},\"\n                                    f\" Higher Box: {higher_para.box},\"\n                                    f\" Lower Box: {lower_para.box}\"\n                                )\n\n            # If no overlaps were found and adjusted in this pass, we're done.\n            if not overlap_found_in_pass:\n                break\n\n        if iterations == max_iterations:\n            logger.warning(\n                f\"Maximum iterations ({max_iterations}) reached in\"\n                f\" fix_overlapping_paragraphs for page {page.page_number}.\"\n                \" Some overlaps might remain.\"\n            )\n\n    def _sort_characters_in_lines(self, page: Page):\n        \"\"\"Sort characters in each line from left to right, top to bottom.\"\"\"\n        for paragraph in page.pdf_paragraph:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_line:\n                    line = composition.pdf_line\n                    line.pdf_character.sort(key=self._get_char_sort_key)\n\n    def _get_char_sort_key(self, char: PdfCharacter):\n        \"\"\"Get sort key for character positioning (top to bottom, left to right).\"\"\"\n        visual_box = char.visual_bbox.box\n        pdf_box = char.box\n\n        # Use visual box if IoU with bbox is >= 0.1, otherwise use bbox\n        if calculate_iou_for_boxes(visual_box, pdf_box) >= 0.1:\n            box = visual_box\n        else:\n            box = pdf_box\n\n        # Sort by y coordinate first (top to bottom), then x coordinate (left to right)\n        # Note: In PDF coordinate system, y increases upward, so we negate y for top-to-bottom sorting\n        return (box.x, -box.y)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/remove_descent.py",
    "content": "import logging\nfrom collections import Counter\nfrom functools import cache\n\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass RemoveDescent:\n    stage_name = \"Remove Char Descent\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n\n    def _remove_char_descent(\n        self,\n        char: il_version_1.PdfCharacter,\n        font: il_version_1.PdfFont,\n    ) -> float | None:\n        \"\"\"Remove descent from a single character and return the descent value.\n\n        Args:\n            char: The character to process\n            font: The font used by this character\n\n        Returns:\n            The descent value if it was removed, None otherwise\n        \"\"\"\n        if (\n            char.box\n            and char.box.y is not None\n            and char.box.y2 is not None\n            and font\n            and hasattr(font, \"descent\")\n        ):\n            descent = font.descent * char.pdf_style.font_size / 1000\n            if char.vertical:\n                # For vertical text, remove descent from x coordinates\n                char.box.x += descent\n                char.box.x2 += descent\n            else:\n                # For horizontal text, remove descent from y coordinates\n                char.box.y -= descent\n                char.box.y2 -= descent\n            return descent\n        return None\n\n    def process(self, document: il_version_1.Document):\n        \"\"\"Process the document to remove descent adjustments from character boxes.\n\n        Args:\n            document: The document to process\n        \"\"\"\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            len(document.page),\n        ) as pbar:\n            for page in document.page:\n                self.translation_config.raise_if_cancelled()\n                self.process_page(page)\n                pbar.advance()\n\n    def process_page(self, page: il_version_1.Page):\n        \"\"\"Process a single page to remove descent adjustments.\n\n        Args:\n            page: The page to process\n        \"\"\"\n        # Build font map including xobjects\n        fonts: dict[\n            str | int,\n            il_version_1.PdfFont | dict[str, il_version_1.PdfFont],\n        ] = {f.font_id: f for f in page.pdf_font}\n        page_fonts = {f.font_id: f for f in page.pdf_font}\n\n        # Add xobject fonts\n        for xobj in page.pdf_xobject:\n            fonts[xobj.xobj_id] = page_fonts.copy()\n            for font in xobj.pdf_font:\n                fonts[xobj.xobj_id][font.font_id] = font\n\n        @cache\n        def get_font(\n            font_id: str,\n            xobj_id: int | None = None,\n        ) -> il_version_1.PdfFont | None:\n            if xobj_id is not None and xobj_id in fonts:\n                font_map = fonts[xobj_id]\n                if isinstance(font_map, dict) and font_id in font_map:\n                    return font_map[font_id]\n            return (\n                fonts.get(font_id)\n                if isinstance(fonts.get(font_id), il_version_1.PdfFont)\n                else None\n            )\n\n        # Process all standalone characters in the page\n        for char in page.pdf_character:\n            if font := get_font(char.pdf_style.font_id, char.xobj_id):\n                self._remove_char_descent(char, font)\n\n        # Process all paragraphs\n        for paragraph in page.pdf_paragraph:\n            descent_values = []\n            vertical_chars = []\n\n            # Process all characters in paragraph compositions\n            for comp in paragraph.pdf_paragraph_composition:\n                # Handle direct characters\n                if comp.pdf_character:\n                    font = get_font(\n                        comp.pdf_character.pdf_style.font_id,\n                        comp.pdf_character.xobj_id,\n                    )\n                    if font:\n                        descent = self._remove_char_descent(comp.pdf_character, font)\n                        if descent is not None:\n                            descent_values.append(descent)\n                            vertical_chars.append(comp.pdf_character.vertical)\n\n                # Handle characters in PdfLine\n                elif comp.pdf_line:\n                    for char in comp.pdf_line.pdf_character:\n                        if font := get_font(char.pdf_style.font_id, char.xobj_id):\n                            descent = self._remove_char_descent(char, font)\n                            if descent is not None:\n                                descent_values.append(descent)\n                                vertical_chars.append(char.vertical)\n\n                # Handle characters in PdfFormula\n                elif comp.pdf_formula:\n                    for char in comp.pdf_formula.pdf_character:\n                        if font := get_font(char.pdf_style.font_id, char.xobj_id):\n                            descent = self._remove_char_descent(char, font)\n                            if descent is not None:\n                                descent_values.append(descent)\n                                vertical_chars.append(char.vertical)\n\n                # Handle characters in PdfSameStyleCharacters\n                elif comp.pdf_same_style_characters:\n                    for char in comp.pdf_same_style_characters.pdf_character:\n                        if font := get_font(char.pdf_style.font_id, char.xobj_id):\n                            descent = self._remove_char_descent(char, font)\n                            if descent is not None:\n                                descent_values.append(descent)\n                                vertical_chars.append(char.vertical)\n\n            # Adjust paragraph box based on most common descent value\n            if descent_values and paragraph.box:\n                # Calculate mode of descent values\n                descent_counter = Counter(descent_values)\n                most_common_descent = descent_counter.most_common(1)[0][0]\n\n                # Check if paragraph is vertical (all characters are vertical)\n                is_vertical = all(vertical_chars) if vertical_chars else False\n\n                # Adjust paragraph box\n                if paragraph.box.y is not None and paragraph.box.y2 is not None:\n                    if is_vertical:\n                        # For vertical paragraphs, adjust x coordinates\n                        paragraph.box.x += most_common_descent\n                        paragraph.box.x2 += most_common_descent\n                    else:\n                        # For horizontal paragraphs, adjust y coordinates\n                        paragraph.box.y -= most_common_descent\n                        paragraph.box.y2 -= most_common_descent\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/styles_and_formulas.py",
    "content": "import math\nimport re\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il.il_version_1 import Document\nfrom babeldoc.format.pdf.document_il.il_version_1 import GraphicState\nfrom babeldoc.format.pdf.document_il.il_version_1 import Page\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfCharacter\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFormula\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfLine\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfParagraphComposition\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfSameStyleCharacters\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfStyle\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import (\n    collect_page_formula_font_ids,\n)\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import (\n    is_formulas_middle_char,\n)\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import is_formulas_start_char\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import update_formula_data\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import LEFT_BRACKET\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import RIGHT_BRACKET\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import build_layout_index\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import calculate_iou_for_boxes\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    calculate_y_true_iou_for_boxes,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import is_bullet_point\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    is_curve_in_figure_table_layout,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    is_curve_overlapping_with_paragraphs,\n)\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import is_same_style\nfrom babeldoc.format.pdf.document_il.utils.spatial_analyzer import (\n    is_element_contained_in_formula,\n)\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\n\nclass StylesAndFormulas:\n    stage_name = \"Parse Formulas and Styles\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        self.font_mapper = FontMapper(translation_config)\n\n    def update_formula_data(self, formula: PdfFormula):\n        update_formula_data(formula)\n\n    def process(self, document: Document):\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            len(document.page),\n        ) as pbar:\n            for page in document.page:\n                self.translation_config.raise_if_cancelled()\n                self.process_page(page)\n                pbar.advance()\n\n    def update_all_formula_data(self, page: Page):\n        for para in page.pdf_paragraph:\n            for comp in para.pdf_paragraph_composition:\n                if comp.pdf_formula:\n                    self.update_formula_data(comp.pdf_formula)\n\n    def _calculate_element_formula_iou(\n        self, element_box: Box, formula_box: Box, tolerance: float = 2.0\n    ) -> float:\n        \"\"\"Calculate precise IoU between an element and a formula with tolerance.\n\n        Args:\n            element_box: Bounding box of the element (curve/form)\n            formula_box: Bounding box of the formula\n            tolerance: Tolerance to expand formula box for containment check\n\n        Returns:\n            IoU value between element and expanded formula box\n        \"\"\"\n        if element_box is None or formula_box is None:\n            return 0.0\n\n        # Expand formula box by tolerance for more lenient containment check\n        expanded_formula_box = Box(\n            x=formula_box.x - tolerance,\n            y=formula_box.y - tolerance,\n            x2=formula_box.x2 + tolerance,\n            y2=formula_box.y2 + tolerance,\n        )\n\n        return calculate_iou_for_boxes(element_box, expanded_formula_box)\n\n    def _is_element_contained_exact(\n        self,\n        element_box: Box,\n        formula_box: Box,\n        containment_threshold: float = 0.95,\n    ) -> bool:\n        \"\"\"Check if an element is contained within a formula with zero tolerance.\n\n        Args:\n            element_box: Bounding box of the element (curve/form)\n            formula_box: Bounding box of the formula\n            containment_threshold: Minimum IoU ratio to consider as contained\n\n        Returns:\n            True if the element is contained within the formula (exact match)\n        \"\"\"\n        if element_box is None or formula_box is None:\n            return False\n\n        # Use formula box without any tolerance expansion\n        iou = calculate_iou_for_boxes(element_box, formula_box)\n        return iou >= containment_threshold\n\n    def _calculate_element_formula_distance(\n        self, element_box: Box, formula_box: Box\n    ) -> float:\n        \"\"\"Calculate the shortest distance between an element and a formula.\n\n        Args:\n            element_box: Bounding box of the element (curve/form)\n            formula_box: Bounding box of the formula\n\n        Returns:\n            Shortest distance between the element and formula boxes\n        \"\"\"\n        if element_box is None or formula_box is None:\n            return float(\"inf\")\n\n        # Calculate horizontal distance\n        if element_box.x2 < formula_box.x:\n            # Element is to the left of formula\n            dx = formula_box.x - element_box.x2\n        elif element_box.x > formula_box.x2:\n            # Element is to the right of formula\n            dx = element_box.x - formula_box.x2\n        else:\n            # Horizontal overlap\n            dx = 0.0\n\n        # Calculate vertical distance\n        if element_box.y2 < formula_box.y:\n            # Element is above formula\n            dy = formula_box.y - element_box.y2\n        elif element_box.y > formula_box.y2:\n            # Element is below formula\n            dy = element_box.y - formula_box.y2\n        else:\n            # Vertical overlap\n            dy = 0.0\n\n        # Return Euclidean distance\n        return (dx * dx + dy * dy) ** 0.5\n\n    def _collect_element_formula_candidates(\n        self, page: Page\n    ) -> tuple[list, dict, dict]:\n        \"\"\"Collect all potential assignments of elements to formulas.\n\n        Uses two-level IoU matching strategy:\n        1. Exact IoU matching (zero tolerance) - highest priority\n        2. Tolerant IoU matching (2.0 tolerance, distance-sorted) - second priority\n\n        Returns:\n            Tuple of (all_formulas, curve_candidates, form_candidates) where:\n            - all_formulas: list of (formula, paragraph_xobj_id) tuples\n            - curve_candidates: dict mapping curve index to (curve, candidates) tuples\n            - form_candidates: dict mapping form index to (form, candidates) tuples\n            where candidates is a list of (formula_index, score, match_type) tuples\n        \"\"\"\n        curve_candidates = {}\n        form_candidates = {}\n\n        # Configuration parameters\n        max_tolerant_distance = 100.0  # Maximum distance for tolerant matching scoring\n\n        if not page.pdf_paragraph:\n            return [], curve_candidates, form_candidates\n\n        # Collect all formulas from all paragraphs with their index\n        all_formulas = []\n        for paragraph in page.pdf_paragraph:\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_formula:\n                    all_formulas.append((composition.pdf_formula, paragraph.xobj_id))\n\n        # Check each curve against all formulas\n        for curve_idx, curve in enumerate(page.pdf_curve):\n            if not curve.box:\n                continue\n\n            candidates = []\n            for formula_idx, (formula, paragraph_xobj_id) in enumerate(all_formulas):\n                if not formula.box:\n                    continue\n\n                # Check xobj_id compatibility\n                if paragraph_xobj_id is not None and curve.xobj_id != paragraph_xobj_id:\n                    continue\n\n                # Level 1: Exact IoU matching (zero tolerance) - highest priority\n                if self._is_element_contained_exact(curve.box, formula.box):\n                    iou = calculate_iou_for_boxes(curve.box, formula.box)\n                    candidates.append((formula_idx, iou, \"iou_exact\"))\n                # Level 2: Tolerant IoU matching (with tolerance) - distance sorted\n                elif is_element_contained_in_formula(curve.box, formula.box):\n                    distance = self._calculate_element_formula_distance(\n                        curve.box, formula.box\n                    )\n                    # Convert distance to score (closer = higher score)\n                    # Score range: 0.5-0.9 to ensure lower than exact IoU\n                    distance_factor = max(0.0, 1.0 - distance / max_tolerant_distance)\n                    score = 0.5 + 0.4 * distance_factor\n                    candidates.append((formula_idx, score, \"iou_tolerant\"))\n\n            if candidates:\n                curve_candidates[curve_idx] = (curve, candidates)\n\n        # Check each form against all formulas\n        for form_idx, form in enumerate(page.pdf_form):\n            if not form.box:\n                continue\n\n            candidates = []\n            for formula_idx, (formula, paragraph_xobj_id) in enumerate(all_formulas):\n                if not formula.box:\n                    continue\n\n                # Check xobj_id compatibility\n                if paragraph_xobj_id is not None and form.xobj_id != paragraph_xobj_id:\n                    continue\n\n                # Level 1: Exact IoU matching (zero tolerance) - highest priority\n                if self._is_element_contained_exact(form.box, formula.box):\n                    iou = calculate_iou_for_boxes(form.box, formula.box)\n                    candidates.append((formula_idx, iou, \"iou_exact\"))\n                # Level 2: Tolerant IoU matching (with tolerance) - distance sorted\n                elif is_element_contained_in_formula(form.box, formula.box):\n                    distance = self._calculate_element_formula_distance(\n                        form.box, formula.box\n                    )\n                    # Convert distance to score (closer = higher score)\n                    # Score range: 0.5-0.9 to ensure lower than exact IoU\n                    distance_factor = max(0.0, 1.0 - distance / max_tolerant_distance)\n                    score = 0.5 + 0.4 * distance_factor\n                    candidates.append((formula_idx, score, \"iou_tolerant\"))\n\n            if candidates:\n                form_candidates[form_idx] = (form, candidates)\n\n        return all_formulas, curve_candidates, form_candidates\n\n    def _resolve_assignment_conflicts(\n        self, curve_candidates: dict, form_candidates: dict\n    ) -> tuple[dict, list, list]:\n        \"\"\"Resolve assignment conflicts using prioritized matching strategy.\n\n        Args:\n            curve_candidates: dict mapping curve index to (curve, candidates) tuples\n            form_candidates: dict mapping form index to (form, candidates) tuples\n            where candidates is a list of (formula_index, score, match_type) tuples\n\n        Returns:\n            Tuple of (formula_assignments, curves_to_remove, forms_to_remove) where:\n            - formula_assignments: dict mapping formula_index to (curves, forms) tuples\n            - curves_to_remove: list of curves to remove from page level\n            - forms_to_remove: list of forms to remove from page level\n        \"\"\"\n        formula_assignments = {}\n        curves_to_remove = []\n        forms_to_remove = []\n\n        def _get_best_candidate(candidates):\n            \"\"\"Get the best candidate using priority: Exact IoU > Tolerant IoU, then by score.\"\"\"\n            if not candidates:\n                return None\n\n            # Sort by match_type priority and then by score (descending)\n            def sort_key(candidate):\n                formula_idx, score, match_type = candidate\n                # Exact IoU matches get priority 1, tolerant IoU matches get priority 2\n                priority = 1 if match_type == \"iou_exact\" else 2\n                # Return tuple for sorting: (priority, -score) for descending score within priority\n                return (priority, -score)\n\n            sorted_candidates = sorted(candidates, key=sort_key)\n            return sorted_candidates[0]\n\n        # Resolve curve assignments\n        for _curve_idx, (curve, candidates) in curve_candidates.items():\n            if not candidates:\n                continue\n\n            best_candidate = _get_best_candidate(candidates)\n            if best_candidate:\n                best_formula_idx, best_score, match_type = best_candidate\n\n                # Add to assignments\n                if best_formula_idx not in formula_assignments:\n                    formula_assignments[best_formula_idx] = ([], [])\n                formula_assignments[best_formula_idx][0].append(curve)\n                curves_to_remove.append(curve)\n\n        # Resolve form assignments\n        for _form_idx, (form, candidates) in form_candidates.items():\n            if not candidates:\n                continue\n\n            best_candidate = _get_best_candidate(candidates)\n            if best_candidate:\n                best_formula_idx, best_score, match_type = best_candidate\n\n                # Add to assignments\n                if best_formula_idx not in formula_assignments:\n                    formula_assignments[best_formula_idx] = ([], [])\n                formula_assignments[best_formula_idx][1].append(form)\n                forms_to_remove.append(form)\n\n        return formula_assignments, curves_to_remove, forms_to_remove\n\n    def collect_contained_elements(self, page: Page):\n        \"\"\"Collect curves and forms that are contained within formulas.\n\n        Uses two-phase assignment strategy to ensure each element is assigned\n        to only one formula based on highest IoU value.\n        \"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        # Phase 1: Collect all potential element-formula assignments\n        all_formulas, curve_candidates, form_candidates = (\n            self._collect_element_formula_candidates(page)\n        )\n\n        # Phase 2: Resolve conflicts using IoU maximization\n        formula_assignments, curves_to_remove, forms_to_remove = (\n            self._resolve_assignment_conflicts(curve_candidates, form_candidates)\n        )\n\n        # Apply the resolved assignments using formula indices\n        for formula_idx, (\n            assigned_curves,\n            assigned_forms,\n        ) in formula_assignments.items():\n            formula = all_formulas[formula_idx][0]  # Extract formula from tuple\n            formula.pdf_curve.extend(assigned_curves)\n            formula.pdf_form.extend(assigned_forms)\n\n        # Remove assigned elements from page level\n        for curve in curves_to_remove:\n            if curve in page.pdf_curve:\n                page.pdf_curve.remove(curve)\n\n        for form in forms_to_remove:\n            if form in page.pdf_form:\n                page.pdf_form.remove(form)\n\n    def process_page(self, page: Page):\n        \"\"\"处理页面，包括公式识别和偏移量计算\"\"\"\n        self.process_page_formulas(page)\n        # self.process_page_offsets(page)\n        self.process_comma_formulas(page)\n        self.merge_overlapping_formulas(page)\n        if not self.translation_config.skip_formula_offset_calculation:\n            self.process_page_offsets(page)\n        self.process_translatable_formulas(page)\n        self.update_all_formula_data(page)\n        if not self.translation_config.ocr_workaround:\n            self.collect_contained_elements(page)\n\n        # Process remaining non-formula lines after formula assignment is complete\n        if self.translation_config.remove_non_formula_lines:\n            self.remove_non_formula_lines_from_paragraphs(page)\n\n        if not self.translation_config.skip_formula_offset_calculation:\n            self.process_page_offsets(page)\n        self.update_all_formula_data(page)\n        self.process_page_styles(page)\n\n    def update_line_data(self, line: PdfLine):\n        min_x = min(char.visual_bbox.box.x for char in line.pdf_character)\n        min_y = min(char.visual_bbox.box.y for char in line.pdf_character)\n        max_x = max(char.visual_bbox.box.x2 for char in line.pdf_character)\n        max_y = max(char.visual_bbox.box.y2 for char in line.pdf_character)\n        line.box = Box(min_x, min_y, max_x, max_y)\n\n    def _classify_characters_in_composition(\n        self,\n        composition: PdfParagraphComposition,\n        formula_font_ids: set[int],\n        first_is_bullet_so_far: bool,\n        line_index: int,\n    ) -> tuple[list[tuple[PdfCharacter, bool]], bool]:\n        \"\"\"\n        Phase 1: Classify every character in a composition as either formula or text.\n        This preserves the original logic, including the sticky `first_is_bullet` flag.\n        \"\"\"\n        tagged_chars = []\n        is_formula_tags = []\n\n        line = composition.pdf_line\n        if not line or not line.pdf_character:\n            return [], first_is_bullet_so_far\n\n        first_is_bullet = first_is_bullet_so_far\n        in_formula_state = False\n        in_corner_mark_state = False\n        corner_mark_info = []\n\n        # Determine the `is_formula` tag for each character\n        for i, char in enumerate(line.pdf_character):\n            # The original logic for `first_is_bullet`: it is set if any segment starts with a bullet.\n            # A \"segment\" started when `current_chars` was empty.\n            # We determine the start of a segment by looking at the previous char's tag.\n            is_start_of_segment = i == 0 or (\n                len(is_formula_tags) > 0 and is_formula_tags[-1] != in_formula_state\n            )\n            if not first_is_bullet and is_start_of_segment and is_bullet_point(char):\n                first_is_bullet = True\n\n            is_formula = (\n                (  # 区分公式开头的字符&公式中间的字符。主要是逗号不能在公式开头，但是可以在中间。\n                    char.formula_layout_id\n                    or (\n                        is_formulas_start_char(\n                            char.char_unicode,\n                            self.font_mapper,\n                            self.translation_config,\n                        )\n                        and not in_formula_state\n                    )\n                    or (\n                        is_formulas_middle_char(\n                            char.char_unicode,\n                            self.font_mapper,\n                            self.translation_config,\n                        )\n                        and in_formula_state\n                    )\n                )  # 公式字符\n                or char.pdf_style.font_id in formula_font_ids  # 公式字体\n                or char.vertical  # 垂直字体\n                or (\n                    #   如果是程序添加的 dummy 空格\n                    char.char_unicode is None and in_formula_state\n                )\n                or (\n                    # 如果字符的视觉框和实际框不一致，则认为是公式字符\n                    char.box.x > char.visual_bbox.box.x2\n                    or char.box.x2 < char.visual_bbox.box.x\n                    or char.box.y > char.visual_bbox.box.y2\n                    or char.box.y2 < char.visual_bbox.box.y\n                )\n            )\n\n            previous_char = line.pdf_character[i - 1] if i > 0 else None\n            next_char = (\n                line.pdf_character[i + 1] if i < len(line.pdf_character) - 1 else None\n            )\n            isspace = char.char_unicode.isspace() if char.char_unicode else False\n            prev_is_space = (\n                previous_char.char_unicode.isspace()\n                if previous_char and previous_char.char_unicode\n                else False\n            )\n\n            is_corner_mark = (\n                (\n                    previous_char is not None\n                    and not isspace\n                    and not prev_is_space\n                    and not first_is_bullet\n                    # 角标字体，有 0.76 的角标和 0.799 的大写，这里用 0.79 取中，同时考虑首字母放大的情况\n                    and char.pdf_style.font_size\n                    < previous_char.pdf_style.font_size * 0.79\n                    and not in_corner_mark_state\n                )\n                or (\n                    previous_char is not None\n                    and not isspace\n                    and not prev_is_space\n                    and not first_is_bullet\n                    # 角标字体，有 0.76 的角标和 0.799 的大写，这里用 0.79 取中，同时考虑首字母放大的情况\n                    and char.pdf_style.font_size\n                    < previous_char.pdf_style.font_size * 1.1\n                    and in_corner_mark_state\n                )\n                or (\n                    # 检查段落开始的角标：当没有前一个字符时，通过下一个字符判断\n                    previous_char is None\n                    and next_char is not None\n                    and not isspace\n                    and not prev_is_space\n                    and not first_is_bullet\n                    # 当前字符字体大小明显小于下一个字符，判定为角标\n                    and char.pdf_style.font_size < next_char.pdf_style.font_size * 0.79\n                    and not in_corner_mark_state\n                )\n            )\n\n            is_formula = is_formula or is_corner_mark\n\n            if char.char_unicode == \" \":\n                is_formula = in_formula_state\n\n            # This simulates the state change for the next iteration\n            if is_formula != in_formula_state:\n                in_formula_state = is_formula\n\n            in_corner_mark_state = is_corner_mark\n            is_formula_tags.append(is_formula)\n            corner_mark_info.append(is_corner_mark)\n\n        for char, is_formula, is_corner_mark in zip(\n            line.pdf_character, is_formula_tags, corner_mark_info, strict=False\n        ):\n            tagged_chars.append((char, is_formula, is_corner_mark))\n\n        return tagged_chars, first_is_bullet\n\n    def _group_classified_characters(\n        self,\n        tagged_chars: list[tuple[PdfCharacter, bool, bool]],\n        line_index: int,\n    ) -> list[PdfParagraphComposition]:\n        \"\"\"\n        Phase 2: Group consecutive characters with the same tag into new compositions.\n        \"\"\"\n        if not tagged_chars:\n            return []\n\n        new_compositions = []\n        current_chars = []\n        current_tag = tagged_chars[0][1]\n        current_corner_mark_flags = []\n\n        for char, is_formula_tag, is_corner_mark in tagged_chars:\n            if is_formula_tag == current_tag:\n                current_chars.append(char)\n                current_corner_mark_flags.append(is_corner_mark)\n            else:\n                # Check if any character in current group is a corner mark\n                has_corner_mark = any(current_corner_mark_flags)\n                new_compositions.append(\n                    self.create_composition(\n                        current_chars, current_tag, line_index, has_corner_mark\n                    ),\n                )\n                current_chars = [char]\n                current_tag = is_formula_tag\n                current_corner_mark_flags = [is_corner_mark]\n\n        if current_chars:\n            # Check if any character in final group is a corner mark\n            has_corner_mark = any(current_corner_mark_flags)\n            new_compositions.append(\n                self.create_composition(\n                    current_chars, current_tag, line_index, has_corner_mark\n                ),\n            )\n\n        return new_compositions\n\n    def process_page_formulas(self, page: Page):\n        if not page.pdf_paragraph:\n            return\n\n        page_level_formula_font_ids, xobj_specific_formula_font_ids = (\n            collect_page_formula_font_ids(\n                page, self.translation_config.formular_font_pattern\n            )\n        )\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            current_formula_font_ids: set[int]\n            if (\n                paragraph.xobj_id\n                and paragraph.xobj_id in xobj_specific_formula_font_ids\n            ):\n                current_formula_font_ids = xobj_specific_formula_font_ids[\n                    paragraph.xobj_id\n                ]\n            else:\n                current_formula_font_ids = page_level_formula_font_ids\n\n            new_paragraph_compositions = []\n            # This flag is carried through all compositions in a paragraph, as in the original implementation.\n            first_is_bullet = False\n\n            for line_index, composition in enumerate(\n                paragraph.pdf_paragraph_composition\n            ):\n                (\n                    tagged_chars,\n                    first_is_bullet,\n                ) = self._classify_characters_in_composition(\n                    composition,\n                    current_formula_font_ids,\n                    first_is_bullet,\n                    line_index,\n                )\n\n                if not tagged_chars:\n                    new_paragraph_compositions.append(composition)\n                    continue\n\n                grouped_compositions = self._group_classified_characters(\n                    tagged_chars, line_index\n                )\n                new_paragraph_compositions.extend(grouped_compositions)\n\n            paragraph.pdf_paragraph_composition = new_paragraph_compositions\n\n    def process_translatable_formulas(self, page: Page):\n        \"\"\"将需要正常翻译的公式（如纯数字、数字加逗号等）转换为普通文本行\"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            new_compositions = []\n            for composition in paragraph.pdf_paragraph_composition:\n                if (\n                    composition.pdf_formula is not None\n                    and not composition.pdf_formula.is_corner_mark\n                    and self.is_translatable_formula(\n                        composition.pdf_formula,\n                    )\n                ):\n                    # 将可翻译公式转换为普通文本行\n                    new_line = PdfLine(\n                        pdf_character=composition.pdf_formula.pdf_character,\n                    )\n                    self.update_line_data(new_line)\n                    new_compositions.append(PdfParagraphComposition(pdf_line=new_line))\n                else:\n                    new_compositions.append(composition)\n\n            paragraph.pdf_paragraph_composition = new_compositions\n\n    def process_page_styles(self, page: Page):\n        \"\"\"处理页面中的文本样式，识别相同样式的文本\"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            # 计算基准样式（除公式外所有文字样式的交集）\n            base_style = self._calculate_base_style(paragraph)\n            paragraph.pdf_style = base_style\n\n            # 重新组织段落中的文本，将相同样式的文本组合在一起\n            new_compositions = []\n            current_chars = []\n            current_style = None\n\n            for comp in paragraph.pdf_paragraph_composition:\n                if comp.pdf_formula is not None:\n                    if current_chars:\n                        new_comp = self._create_same_style_composition(\n                            current_chars,\n                            current_style,\n                        )\n                        new_compositions.append(new_comp)\n                        current_chars = []\n                    new_compositions.append(comp)\n                    continue\n\n                if not comp.pdf_line:\n                    new_compositions.append(comp)\n                    continue\n\n                for char in comp.pdf_line.pdf_character:\n                    char_style = char.pdf_style\n                    if current_style is None:\n                        current_style = char_style\n                        current_chars.append(char)\n                    elif is_same_style(char_style, current_style):\n                        current_chars.append(char)\n                    else:\n                        if current_chars:\n                            new_comp = self._create_same_style_composition(\n                                current_chars,\n                                current_style,\n                            )\n                            new_compositions.append(new_comp)\n                        current_chars = [char]\n                        current_style = char_style\n\n            if current_chars:\n                new_comp = self._create_same_style_composition(\n                    current_chars,\n                    current_style,\n                )\n                new_compositions.append(new_comp)\n\n            paragraph.pdf_paragraph_composition = new_compositions\n\n    def _calculate_base_style(self, paragraph) -> PdfStyle:\n        \"\"\"计算段落的基准样式（除公式外所有文字样式的交集）\"\"\"\n        styles = []\n        for comp in paragraph.pdf_paragraph_composition:\n            if isinstance(comp, PdfFormula):\n                continue\n            if not comp.pdf_line:\n                continue\n            for char in comp.pdf_line.pdf_character:\n                styles.append(char.pdf_style)\n\n        if not styles:\n            return None\n\n        # 返回所有样式的交集\n        base_style = styles[0]\n        for style in styles[1:]:\n            # 更新基准样式为所有样式的交集\n            base_style = self._merge_styles(base_style, style)\n\n        # 如果 font_id 或 font_size 为 None，则使用众数\n        if base_style.font_id is None:\n            base_style.font_id = self._get_mode_value([s.font_id for s in styles])\n        if base_style.font_size is None:\n            base_style.font_size = self._get_mode_value([s.font_size for s in styles])\n\n        return base_style\n\n    def _get_mode_value(self, values):\n        \"\"\"计算列表中的众数\"\"\"\n        if not values:\n            return None\n        from collections import Counter\n\n        counter = Counter(values)\n        return counter.most_common(1)[0][0]\n\n    def _merge_styles(self, style1, style2):\n        \"\"\"合并两个样式，返回它们的交集\"\"\"\n        if style1 is None or style1.font_size is None:\n            return style2\n        if style2 is None or style2.font_size is None:\n            return style1\n\n        return PdfStyle(\n            font_id=style1.font_id if style1.font_id == style2.font_id else None,\n            font_size=(\n                style1.font_size\n                if math.fabs(style1.font_size - style2.font_size) < 0.02\n                else None\n            ),\n            graphic_state=self._merge_graphic_states(\n                style1.graphic_state,\n                style2.graphic_state,\n            ),\n        )\n\n    def _merge_graphic_states(self, state1, state2):\n        \"\"\"合并两个 GraphicState，返回它们的交集\"\"\"\n        if state1 is None:\n            return state2\n        if state2 is None:\n            return state1\n\n        return GraphicState(\n            passthrough_per_char_instruction=(\n                state1.passthrough_per_char_instruction\n                if state1.passthrough_per_char_instruction\n                == state2.passthrough_per_char_instruction\n                else None\n            ),\n        )\n\n    def _create_same_style_composition(\n        self,\n        chars: list[PdfCharacter],\n        style,\n    ) -> PdfParagraphComposition:\n        \"\"\"创建具有相同样式的文本组合\"\"\"\n        if not chars:\n            return None\n\n        # 计算边界框\n        min_x = min(char.visual_bbox.box.x for char in chars)\n        min_y = min(char.visual_bbox.box.y for char in chars)\n        max_x = max(char.visual_bbox.box.x2 for char in chars)\n        max_y = max(char.visual_bbox.box.y2 for char in chars)\n        box = Box(min_x, min_y, max_x, max_y)\n\n        return PdfParagraphComposition(\n            pdf_same_style_characters=PdfSameStyleCharacters(\n                box=box,\n                pdf_style=style,\n                pdf_character=chars,\n            ),\n        )\n\n    def process_page_offsets(self, page: Page):\n        \"\"\"计算公式的 x 和 y 偏移量\"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        for paragraph in page.pdf_paragraph:\n            if paragraph.debug_id is None:\n                continue\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            # 计算该段落的行间距，用其 80% 作为容差\n            # line_spacing = self.calculate_line_spacing(paragraph)\n            # y_tolerance = line_spacing * 0.8\n\n            for i, composition in enumerate(paragraph.pdf_paragraph_composition):\n                if not composition.pdf_formula:\n                    continue\n\n                formula = composition.pdf_formula\n                left_char = None\n                right_char = None\n\n                left_iou = 0\n                right_iou = 0\n\n                # 查找左边最近的同一行的文本\n                for j in range(i - 1, -1, -1):\n                    comp = paragraph.pdf_paragraph_composition[j]\n                    if comp.pdf_line:\n                        for char in reversed(comp.pdf_line.pdf_character):\n                            if not char.pdf_character_id:\n                                continue\n                            # 检查 y 坐标是否接近，判断是否在同一行\n                            left_iou = calculate_y_true_iou_for_boxes(\n                                formula.box, char.box\n                            )\n                            if left_iou > 0.6:\n                                left_char = char\n                                break\n                    break\n\n                # 查找右边最近的同一行的文本\n                for j in range(i + 1, len(paragraph.pdf_paragraph_composition)):\n                    comp = paragraph.pdf_paragraph_composition[j]\n                    if comp.pdf_line:\n                        for char in comp.pdf_line.pdf_character:\n                            if not char.pdf_character_id:\n                                continue\n                            # 检查 y 坐标是否接近，判断是否在同一行\n                            right_iou = calculate_y_true_iou_for_boxes(\n                                formula.box, char.box\n                            )\n                            if right_iou > 0.6:\n                                right_char = char\n                                break\n                    break\n\n                # If both text segments exist, keep the one with higher IOU\n                if left_char and right_char:\n                    if left_iou < right_iou:\n                        left_char = None\n                    elif right_iou < left_iou:\n                        right_char = None\n                    # If IOUs are equal, keep both\n\n                # 计算 x 偏移量（相对于左边文本）\n                if left_char:\n                    formula.x_offset = formula.box.x - left_char.box.x2\n                else:\n                    formula.x_offset = 0  # 如果左边没有文字，x_offset 应该为 0\n                if abs(formula.x_offset) < 0.1:\n                    formula.x_offset = 0\n                if formula.x_offset > 10:\n                    formula.x_offset = 0\n                # if formula.x_offset > 0:\n                #     formula.x_offset = 0\n                if formula.x_offset < -5:\n                    formula.x_offset = 0\n\n                # 计算 y 偏移量\n                if left_char:\n                    # 使用底部坐标计算偏移量\n                    formula.y_offset = formula.box.y - left_char.box.y\n                elif right_char:\n                    formula.y_offset = formula.box.y - right_char.box.y\n                else:\n                    formula.y_offset = 0\n\n                if abs(formula.y_offset) < 0.1:\n                    formula.y_offset = 0\n\n                if max(abs(formula.y_offset), abs(formula.x_offset)) > 10:\n                    pass\n                    # logging.debug(\n                    #     f\"公式 {formula.box} 的偏移量过大：{formula.x_offset}, {formula.y_offset}\"\n                    # )\n\n    def calculate_line_spacing(self, paragraph) -> float:\n        \"\"\"计算段落中的平均行间距\"\"\"\n        if not paragraph.pdf_paragraph_composition:\n            return 0.0\n\n        # 收集所有文本行的 y 坐标\n        line_y_positions = []\n        for comp in paragraph.pdf_paragraph_composition:\n            if comp.pdf_line:\n                line_y_positions.append(comp.pdf_line.box.y)\n\n        if len(line_y_positions) < 2:\n            return 10.0  # 如果只有一行或没有行，返回一个默认值\n\n        # 计算相邻行之间的 y 差值\n        line_spacings = []\n        for i in range(len(line_y_positions) - 1):\n            spacing = abs(line_y_positions[i] - line_y_positions[i + 1])\n            if spacing > 0:  # 忽略重叠的行\n                line_spacings.append(spacing)\n\n        if not line_spacings:\n            return 10.0  # 如果没有有效的行间距，返回默认值\n\n        # 使用中位数来避免异常值的影响\n        median_spacing = sorted(line_spacings)[len(line_spacings) // 2]\n        return median_spacing\n\n    def create_composition(\n        self,\n        chars: list[PdfCharacter],\n        is_formula: bool,\n        line_index: int,\n        is_corner_mark: bool = False,\n    ) -> PdfParagraphComposition:\n        if is_formula:\n            formula = PdfFormula(pdf_character=chars, line_id=line_index)\n            formula.is_corner_mark = is_corner_mark\n            self.update_formula_data(formula)\n            return PdfParagraphComposition(pdf_formula=formula)\n        else:\n            new_line = PdfLine(pdf_character=chars)\n            self.update_line_data(new_line)\n            return PdfParagraphComposition(pdf_line=new_line)\n\n    def is_translatable_formula(self, formula: PdfFormula) -> bool:\n        \"\"\"判断公式是否只包含需要正常翻译的字符（数字、空格和英文逗号）\"\"\"\n        if all(char.formula_layout_id for char in formula.pdf_character):\n            return False\n\n        text = \"\".join(char.char_unicode for char in formula.pdf_character)\n        if formula.y_offset > 0.1:\n            return False\n        return bool(re.match(r\"^[0-9, .]+$\", text))\n\n    def should_split_formula(self, formula: PdfFormula) -> bool:\n        \"\"\"判断公式是否需要按逗号拆分（包含逗号且有其他特殊符号）\"\"\"\n\n        if all(x.formula_layout_id for x in formula.pdf_character):\n            return False\n\n        text = \"\".join(char.char_unicode for char in formula.pdf_character)\n        # 必须包含逗号\n        if \",\" not in text:\n            return False\n        # 检查是否包含除了数字和 [] 之外的其他符号\n        text_without_basic = re.sub(r\"[0-9\\[\\],\\s]\", \"\", text)\n        return bool(text_without_basic)\n\n    def split_formula_by_comma(\n        self,\n        formula: PdfFormula,\n    ) -> list[tuple[list[PdfCharacter], PdfCharacter]]:\n        \"\"\"按逗号拆分公式字符，返回 (字符组，逗号字符) 的列表，最后一组的逗号字符为 None。\n        只有不在括号内的逗号才会被用作分隔符。支持的括号对包括：\n        - (cid:8) 和 (cid:9)\n        - ( 和 )\n        - (cid:16) 和 (cid:17)\n        \"\"\"\n        result = []\n        current_chars = []\n        bracket_level = 0  # 跟踪括号的层数\n\n        for char in formula.pdf_character:\n            # 检查是否是左括号\n            if char.char_unicode in LEFT_BRACKET:\n                bracket_level += 1\n                current_chars.append(char)\n            # 检查是否是右括号\n            elif char.char_unicode in RIGHT_BRACKET:\n                bracket_level = max(0, bracket_level - 1)  # 防止括号不匹配的情况\n                current_chars.append(char)\n            # 检查是否是逗号，且不在括号内\n            elif char.char_unicode == \",\" and bracket_level == 0:\n                if current_chars:\n                    result.append((current_chars, char))\n                    current_chars = []\n            else:\n                current_chars.append(char)\n\n        if current_chars:\n            result.append((current_chars, None))  # 最后一组没有逗号\n\n        return result\n\n    def merge_formulas(self, formula1: PdfFormula, formula2: PdfFormula) -> PdfFormula:\n        \"\"\"合并两个公式，保持字符的相对位置\"\"\"\n        # 合并所有字符\n        all_chars = formula1.pdf_character + formula2.pdf_character\n        # 按 y 坐标和 x 坐标排序，确保字符顺序正确\n        # sorted_chars = sorted(\n        #     all_chars, key=lambda c: (c.visual_bbox.box.y, c.visual_bbox.box.x))\n\n        # 继承第一个公式的行 ID\n        merged_formula = PdfFormula(pdf_character=all_chars, line_id=formula1.line_id)\n        self.update_formula_data(merged_formula)\n        return merged_formula\n\n    def is_x_axis_contained(self, box1: Box, box2: Box) -> bool:\n        \"\"\"判断 box1 的 x 轴是否完全包含在 box2 的 x 轴内，或反之\"\"\"\n        return (box1.x >= box2.x and box1.x2 <= box2.x2) or (\n            box2.x >= box1.x and box2.x2 <= box1.x2\n        )\n\n    def has_y_intersection(self, box1: Box, box2: Box) -> bool:\n        \"\"\"判断两个 box 的 y 轴是否有交集\"\"\"\n        tolerance = 1.0\n        return not (box1.y2 < box2.y - tolerance or box2.y2 < box1.y - tolerance)\n\n    def is_x_axis_adjacent(self, box1: Box, box2: Box, tolerance: float = 2.0) -> bool:\n        \"\"\"判断两个 box 在 x 轴上是否相邻或有交集\"\"\"\n        # 检查是否有交集\n        has_intersection = not (box1.x2 < box2.x or box2.x2 < box1.x)\n\n        # 检查 box1 是否在 box2 左边且相邻\n        left_adjacent = abs(box1.x2 - box2.x) <= tolerance\n        # 检查 box2 是否在 box1 左边且相邻\n        right_adjacent = abs(box2.x2 - box1.x) <= tolerance\n\n        return has_intersection or left_adjacent or right_adjacent\n\n    def calculate_y_iou(self, box1: Box, box2: Box) -> float:\n        \"\"\"计算两个 box 在 y 轴上的 IOU (Intersection over Union)\"\"\"\n        # 计算交集\n        intersection_start = max(box1.y, box2.y)\n        intersection_end = min(box1.y2, box2.y2)\n        intersection_length = max(0, intersection_end - intersection_start)\n\n        # 计算并集\n        box1_height = box1.y2 - box1.y\n        box2_height = box2.y2 - box2.y\n        union_length = box1_height + box2_height - intersection_length\n\n        # 避免除零错误\n        if union_length <= 0:\n            return 0.0\n\n        return intersection_length / union_length\n\n    def merge_overlapping_formulas(self, page: Page):\n        \"\"\"\n        合并符合以下条件的公式：\n        1. x 轴重叠且 y 轴有交集的相邻公式，或者\n        2. x 轴相邻且 y 轴 IOU > 0.5 的相邻公式，或者\n        3. 所有字符的 layout id 都相同的相邻公式，或者\n        4. 任意两个公式的 IOU > 0.8\n        角标可能会被识别成单独的公式，需要合并\n        \"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            # 重复执行合并过程，直到没有更多可以合并的公式\n            merged = True\n            while merged:\n                merged = False\n                for i in range(len(paragraph.pdf_paragraph_composition)):\n                    if merged:\n                        break\n                    comp1 = paragraph.pdf_paragraph_composition[i]\n                    if comp1.pdf_formula is None:\n                        continue\n\n                    for j in range(i + 1, len(paragraph.pdf_paragraph_composition)):\n                        comp2 = paragraph.pdf_paragraph_composition[j]\n                        if comp2.pdf_formula is None:\n                            continue\n\n                        formula1 = comp1.pdf_formula\n                        formula2 = comp2.pdf_formula\n\n                        # 检查合并条件：\n                        # 0. 必须在同一行（line_id 相同），以及\n                        # 1. x 轴重叠且 y 轴有交集，或者\n                        # 2. x 轴相邻且 y 轴 IOU > 0.5，或者\n                        # 3. 所有字符的 layout id 都相同，或者\n                        # 4. 任意两个公式的 IOU > 0.8\n\n                        # 检查是否在同一行\n                        same_line = formula1.line_id == formula2.line_id\n\n                        should_merge = same_line and (\n                            (\n                                j == i + 1\n                                and (\n                                    (\n                                        self.is_x_axis_contained(\n                                            formula1.box, formula2.box\n                                        )\n                                        and self.has_y_intersection(\n                                            formula1.box, formula2.box\n                                        )\n                                    )\n                                    or (\n                                        self.is_x_axis_adjacent(\n                                            formula1.box, formula2.box\n                                        )\n                                        and self.calculate_y_iou(\n                                            formula1.box, formula2.box\n                                        )\n                                        > 0.5\n                                    )\n                                )\n                            )\n                            or (self._have_same_layout_ids(formula1, formula2, page))\n                            or (\n                                calculate_iou_for_boxes(formula1.box, formula2.box)\n                                > 0.8\n                            )\n                            or (\n                                calculate_iou_for_boxes(formula2.box, formula1.box)\n                                > 0.8\n                            )\n                        )\n\n                        if should_merge:\n                            # 合并公式\n                            merged_formula = self.merge_formulas(formula1, formula2)\n                            paragraph.pdf_paragraph_composition[i] = (\n                                PdfParagraphComposition(\n                                    pdf_formula=merged_formula,\n                                )\n                            )\n                            # 删除第二个公式\n                            del paragraph.pdf_paragraph_composition[j]\n                            merged = True\n                            break\n\n    def _have_same_layout_ids(\n        self, formula1: PdfFormula, formula2: PdfFormula, page: Page\n    ) -> bool:\n        \"\"\"检查两个公式的所有字符是否具有相同的 layout id\"\"\"\n        # 获取 formula1 中所有字符的 layout id\n        formula1_layout_ids = set()\n        for char in formula1.pdf_character:\n            if char.char_unicode == \" \":\n                continue\n            layout = char.formula_layout_id\n            if layout:\n                formula1_layout_ids.add(layout)\n\n        # 获取 formula2 中所有字符的 layout id\n        formula2_layout_ids = set()\n        for char in formula2.pdf_character:\n            if char.char_unicode == \" \":\n                continue\n            layout = char.formula_layout_id\n            if layout:\n                formula2_layout_ids.add(layout)\n\n        # 如果任一公式没有有效的 layout id，则不合并\n        if not (len(formula1_layout_ids) == len(formula2_layout_ids) == 1):\n            return False\n\n        # 检查两个公式的 layout id 集合是否相同\n        return formula1_layout_ids == formula2_layout_ids\n\n    def process_comma_formulas(self, page: Page):\n        \"\"\"处理包含逗号的复杂公式，将其按逗号拆分\"\"\"\n        if not page.pdf_paragraph:\n            return\n\n        for paragraph in page.pdf_paragraph:\n            if not paragraph.pdf_paragraph_composition:\n                continue\n\n            new_compositions = []\n            for composition in paragraph.pdf_paragraph_composition:\n                if composition.pdf_formula is not None and self.should_split_formula(\n                    composition.pdf_formula,\n                ):\n                    # 按逗号拆分公式\n                    char_groups = self.split_formula_by_comma(composition.pdf_formula)\n                    for chars, comma in char_groups:\n                        if chars:  # 忽略空组（连续的逗号）\n                            # 继承原公式的行 ID\n                            formula = PdfFormula(\n                                pdf_character=chars,\n                                line_id=composition.pdf_formula.line_id,\n                            )\n                            self.update_formula_data(formula)\n                            new_compositions.append(\n                                PdfParagraphComposition(pdf_formula=formula),\n                            )\n\n                            # 如果有逗号，添加为文本行\n                            if comma:\n                                comma_line = PdfLine(pdf_character=[comma])\n                                self.update_line_data(comma_line)\n                                new_compositions.append(\n                                    PdfParagraphComposition(pdf_line=comma_line),\n                                )\n                else:\n                    new_compositions.append(composition)\n\n            paragraph.pdf_paragraph_composition = new_compositions\n\n    def remove_non_formula_lines_from_paragraphs(self, page: Page):\n        \"\"\"Remove non-formula lines from paragraphs.\n\n        This method processes curves that remain in page.pdf_curve after\n        collect_contained_elements() has assigned formula-related curves to formulas.\n        All remaining curves are non-formula lines, but we need to be careful\n        not to remove lines from figure/table areas.\n\n        Args:\n            page: The page to process\n        \"\"\"\n        if not page.pdf_curve:\n            return\n\n        # Build layout index for efficient spatial queries\n        layout_index, layout_map = build_layout_index(page)\n\n        curves_to_remove = []\n\n        # Get configuration thresholds\n        protection_threshold = getattr(\n            self.translation_config, \"figure_table_protection_threshold\", 0.9\n        )\n        overlap_threshold = getattr(\n            self.translation_config, \"non_formula_line_iou_threshold\", 0.9\n        )\n\n        for curve in page.pdf_curve:\n            # Skip if curve is in figure/table layout areas\n            if is_curve_in_figure_table_layout(\n                curve, layout_index, layout_map, protection_threshold\n            ):\n                continue\n\n            # Only remove if curve overlaps with text paragraph areas\n            if is_curve_overlapping_with_paragraphs(\n                curve, page.pdf_paragraph, overlap_threshold\n            ):\n                curves_to_remove.append(curve)\n\n        # Remove identified curves\n        removed_count = 0\n        for curve in curves_to_remove:\n            if curve in page.pdf_curve:\n                page.pdf_curve.remove(curve)\n                removed_count += 1\n\n        if removed_count > 0:\n            import logging\n\n            logger = logging.getLogger(__name__)\n            logger.debug(f\"Removed {removed_count} non-formula lines from paragraphs\")\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/table_parser.py",
    "content": "import logging\nfrom pathlib import Path\n\nimport cv2\nimport numpy as np\nfrom pymupdf import Document\n\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils.mupdf_helper import get_no_rotation_img\nfrom babeldoc.format.pdf.document_il.utils.style_helper import GREEN\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass TableParser:\n    stage_name = \"Parse Table\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        self.model = translation_config.table_model\n\n    def _save_debug_image(self, image: np.ndarray, layouts, page_number: int):\n        \"\"\"Save debug image with drawn boxes if debug mode is enabled.\"\"\"\n        if not self.translation_config.debug:\n            return\n\n        if not isinstance(layouts, list):\n            layouts = [layouts]\n        debug_dir = Path(\n            self.translation_config.get_working_file_path(\"table-ocr-box-image\")\n        )\n        debug_dir.mkdir(parents=True, exist_ok=True)\n\n        # Draw boxes on the image\n        debug_image = image.copy()\n        for layout in layouts:\n            for box in layout.boxes:\n                x0, y0, x1, y1 = box.xyxy\n                cv2.rectangle(\n                    debug_image,\n                    (int(x0), int(y0)),\n                    (int(x1), int(y1)),\n                    (0, 255, 0),\n                    2,\n                )\n                # Add text label\n                cv2.putText(\n                    debug_image,\n                    layout.names[box.cls],\n                    (int(x0), int(y0) - 5),\n                    cv2.FONT_HERSHEY_SIMPLEX,\n                    0.5,\n                    (0, 255, 0),\n                    1,\n                )\n\n        # Save the image\n        output_path = debug_dir / f\"{page_number}.jpg\"\n        cv2.imwrite(str(output_path), debug_image)\n\n    def _save_debug_box_to_page(self, page: il_version_1.Page):\n        \"\"\"Save debug boxes and text labels to the PDF page.\"\"\"\n        if not self.translation_config.debug:\n            return\n\n        color = GREEN\n\n        for layout in page.page_layout:\n            # Create a rectangle box\n            rect = il_version_1.PdfRectangle(\n                box=il_version_1.Box(\n                    x=layout.box.x,\n                    y=layout.box.y,\n                    x2=layout.box.x2,\n                    y2=layout.box.y2,\n                ),\n                graphic_state=color,\n                debug_info=True,\n            )\n            page.pdf_rectangle.append(rect)\n\n            # Create text label at top-left corner\n            # Note: PDF coordinates are from bottom-left,\n            # so we use y2 for top position\n            style = il_version_1.PdfStyle(\n                font_id=\"base\",\n                font_size=4,\n                graphic_state=color,\n            )\n            page.pdf_paragraph.append(\n                il_version_1.PdfParagraph(\n                    first_line_indent=False,\n                    box=il_version_1.Box(\n                        x=layout.box.x,\n                        y=layout.box.y2,\n                        x2=layout.box.x2,\n                        y2=layout.box.y2 + 5,\n                    ),\n                    vertical=False,\n                    pdf_style=style,\n                    unicode=layout.class_name,\n                    pdf_paragraph_composition=[\n                        il_version_1.PdfParagraphComposition(\n                            pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                                unicode=layout.class_name,\n                                pdf_style=style,\n                                debug_info=True,\n                            ),\n                        ),\n                    ],\n                    xobj_id=-1,\n                ),\n            )\n\n    def process(self, docs: il_version_1.Document, mupdf_doc: Document):\n        \"\"\"Generate layouts for all pages that need to be translated.\"\"\"\n        # Get pages that need to be translated\n        have_table_pages = {}\n        for page in docs.page:\n            for layout in page.page_layout:\n                if layout.class_name == \"table\":\n                    have_table_pages[page.page_number] = page\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            len(have_table_pages),\n        ) as progress:\n            # Process predictions for each page\n            for page, layouts in self.model.handle_document(\n                have_table_pages.values(),\n                mupdf_doc,\n                self.translation_config,\n                self._save_debug_image,\n            ):\n                page_layouts = []\n                for layout in layouts.boxes:\n                    # Convert coordinate system from picture to il\n                    # system to the il coordinate system\n                    x0, y0, x1, y1 = layout.xyxy\n                    # pix = mupdf_doc[page.page_number].get_pixmap()\n                    pix = get_no_rotation_img(mupdf_doc[page.page_number])\n                    h, w = pix.height, pix.width\n                    x0, y0, x1, y1 = (\n                        np.clip(int(x0 - 1), 0, w - 1),\n                        np.clip(int(h - y1 - 1), 0, h - 1),\n                        np.clip(int(x1 + 1), 0, w - 1),\n                        np.clip(int(h - y0 + 1), 0, h - 1),\n                    )\n                    page_layout = il_version_1.PageLayout(\n                        id=len(page_layouts) + 1,\n                        box=il_version_1.Box(\n                            x0.item(),\n                            y0.item(),\n                            x1.item(),\n                            y1.item(),\n                        ),\n                        conf=layout.conf.item(),\n                        class_name=layouts.names[layout.cls],\n                    )\n                    page_layouts.append(page_layout)\n\n                page.page_layout.extend(page_layouts)\n                self._save_debug_box_to_page(page)\n                progress.advance(1)\n\n        return docs\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/midend/typesetting.py",
    "content": "from __future__ import annotations\n\nimport copy\nimport logging\nimport re\nimport statistics\nimport unicodedata\nfrom functools import cache\n\nimport pymupdf\nimport regex\nfrom rtree import index\n\nfrom babeldoc.const import WATERMARK_VERSION\nfrom babeldoc.format.pdf.document_il import Box\nfrom babeldoc.format.pdf.document_il import PdfCharacter\nfrom babeldoc.format.pdf.document_il import PdfCurve\nfrom babeldoc.format.pdf.document_il import PdfForm\nfrom babeldoc.format.pdf.document_il import PdfFormula\nfrom babeldoc.format.pdf.document_il import PdfParagraphComposition\nfrom babeldoc.format.pdf.document_il import PdfStyle\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import update_formula_data\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import box_to_tuple\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.format.pdf.translation_config import WatermarkOutputMode\n\nlogger = logging.getLogger(__name__)\n\nLINE_BREAK_REGEX = regex.compile(\n    r\"^[\"\n    r\"a-z\"\n    r\"A-Z\"\n    r\"0-9\"\n    r\"\\u00C0-\\u00FF\"  # Latin-1 Supplement\n    r\"\\u0100-\\u017F\"  # Latin Extended A\n    r\"\\u0180-\\u024F\"  # Latin Extended B\n    r\"\\u1E00-\\u1EFF\"  # Latin Extended Additional\n    r\"\\u2C60-\\u2C7F\"  # Latin Extended C\n    r\"\\uA720-\\uA7FF\"  # Latin Extended D\n    r\"\\uAB30-\\uAB6F\"  # Latin Extended E\n    r\"\\u0250-\\u02A0\"  # IPA Extensions\n    r\"\\u0400-\\u04FF\"  # Cyrillic\n    r\"\\u0300-\\u036F\"  # Combining Diacritical Marks\n    r\"\\u0500-\\u052F\"  # Cyrillic Supplement\n    r\"\\u0370-\\u03FF\"  # Greek and Coptic\n    r\"\\u2DE0-\\u2DFF\"  # Cyrillic Extended-A\n    r\"\\uA650-\\uA69F\"  # Cyrillic Extended-B\n    r\"\\u1200-\\u137F\"  # Ethiopic\n    r\"\\u1380-\\u139F\"  # Ethiopic Supplement\n    r\"\\u2D80-\\u2DDF\"  # Ethiopic Extended\n    r\"\\uAB00-\\uAB2F\"  # Ethiopic Extended-A\n    r\"\\U0001E7E0-\\U0001E7FF\"  # Ethiopic Extended-B\n    r\"\\u0E80-\\u0EFF\"  # Lao\n    r\"\\u0D00-\\u0D7F\"  # Malayalam\n    r\"\\u0A80-\\u0AFF\"  # Gujarati\n    r\"\\u0E00-\\u0E7F\"  # Thai\n    r\"\\u1000-\\u109F\"  # Myanmar\n    r\"\\uAA60-\\uAA7F\"  # Myanmar Extended-A\n    r\"\\uA9E0-\\uA9FF\"  # Myanmar Extended-B\n    r\"\\U000116D0-\\U000116FF\"  # Myanmar Extended-C\n    r\"\\u0B80-\\u0BFF\"  # Tamil\n    r\"\\u0C00-\\u0C7F\"  # Telugu\n    r\"\\u0B00-\\u0B7F\"  # Oriya\n    r\"\\u0530-\\u058F\"  # Armenian\n    r\"\\u10A0-\\u10FF\"  # Georgian\n    r\"\\u1C90-\\u1CBF\"  # Georgian Extended\n    r\"\\u2D00-\\u2D2F\"  # Georgian Supplement\n    r\"\\u1780-\\u17FF\"  # Khmer\n    r\"\\u19E0-\\u19FF\"  # Khmer Symbols\n    r\"\\U00010B00-\\U00010B3F\"  # Avestan\n    r\"\\u1D00-\\u1D7F\"  # Phonetic Extensions\n    r\"\\u1400-\\u167F\"  # Unified Canadian Aboriginal Syllabics\n    r\"\\u0B00-\\u0B7F\"  # Oriya\n    r\"\\u0780-\\u07BF\"  # Thaana\n    r\"\\U0001E900-\\U0001E95F\"  # Adlam\n    r\"\\u1C80-\\u1C8F\"  # Cyrillic Extended-C\n    r\"\\U0001E030-\\U0001E08F\"  # Cyrillic Extended-D\n    r\"\\uA000-\\uA48F\"  # Yi Syllables\n    r\"\\uA490-\\uA4CF\"  # Yi Radicals\n    r\"'\"\n    r\"-\"  # Hyphen\n    r\"·\"  # Middle Dot (U+00B7) For Català\n    r\"ʻ\"  # Spacing Modifier Letters U+02BB\n    r\"]+$\"\n)\n\n\nclass TypesettingUnit:\n    def __str__(self):\n        return self.try_get_unicode() or \"\"\n\n    def __init__(\n        self,\n        char: PdfCharacter | None = None,\n        formular: PdfFormula | None = None,\n        unicode: str | None = None,\n        font: pymupdf.Font | None = None,\n        original_font: il_version_1.PdfFont | None = None,\n        font_size: float | None = None,\n        style: PdfStyle | None = None,\n        xobj_id: int | None = None,\n        debug_info: bool = False,\n    ):\n        assert (char is not None) + (formular is not None) + (\n            unicode is not None\n        ) == 1, \"Only one of chars and formular can be not None\"\n        self.char = char\n        self.formular = formular\n        self.unicode = unicode\n        self.x = None\n        self.y = None\n        self.scale = None\n        self.debug_info = debug_info\n\n        # Cache variables\n        self.box_cache: Box | None = None\n        self.can_break_line_cache: bool | None = None\n        self.is_cjk_char_cache: bool | None = None\n        self.mixed_character_blacklist_cache: bool | None = None\n        self.is_space_cache: bool | None = None\n        self.is_hung_punctuation_cache: bool | None = None\n        self.is_cannot_appear_in_line_end_punctuation_cache: bool | None = None\n        self.can_passthrough_cache: bool | None = None\n        self.width_cache: float | None = None\n        self.height_cache: float | None = None\n\n        self.font_size: float | None = None\n\n        if unicode:\n            assert font_size, \"Font size must be provided when unicode is provided\"\n            assert style, \"Style must be provided when unicode is provided\"\n            assert len(unicode) == 1, \"Unicode must be a single character\"\n            assert xobj_id is not None, (\n                \"Xobj id must be provided when unicode is provided\"\n            )\n\n            self.font = font\n            if font is not None and hasattr(font, \"font_id\"):\n                self.font_id = font.font_id\n            else:\n                self.font_id = \"base\"\n            if original_font:\n                self.original_font = original_font\n            else:\n                self.original_font = None\n\n            self.font_size = font_size\n            self.style = style\n            self.xobj_id = xobj_id\n\n    def try_resue_cache(self, old_tu: TypesettingUnit):\n        if old_tu.is_cjk_char_cache is not None:\n            self.is_cjk_char_cache = old_tu.is_cjk_char_cache\n\n        if old_tu.can_break_line_cache is not None:\n            self.can_break_line_cache = old_tu.can_break_line_cache\n\n        if old_tu.is_space_cache is not None:\n            self.is_space_cache = old_tu.is_space_cache\n\n        if old_tu.is_hung_punctuation_cache is not None:\n            self.is_hung_punctuation_cache = old_tu.is_hung_punctuation_cache\n\n        if old_tu.is_cannot_appear_in_line_end_punctuation_cache is not None:\n            self.is_cannot_appear_in_line_end_punctuation_cache = (\n                old_tu.is_cannot_appear_in_line_end_punctuation_cache\n            )\n\n        if old_tu.can_passthrough_cache is not None:\n            self.can_passthrough_cache = old_tu.can_passthrough_cache\n\n        if old_tu.mixed_character_blacklist_cache is not None:\n            self.mixed_character_blacklist_cache = (\n                old_tu.mixed_character_blacklist_cache\n            )\n\n    def try_get_unicode(self) -> str | None:\n        if self.char:\n            return self.char.char_unicode\n        elif self.formular:\n            return None\n        elif self.unicode:\n            return self.unicode\n\n    @property\n    def mixed_character_blacklist(self):\n        if self.mixed_character_blacklist_cache is None:\n            self.mixed_character_blacklist_cache = self.calc_mixed_character_blacklist()\n\n        return self.mixed_character_blacklist_cache\n\n    def calc_mixed_character_blacklist(self):\n        unicode = self.try_get_unicode()\n        if unicode:\n            return unicode in [\n                \"。\",\n                \"，\",\n                \"：\",\n                \"？\",\n                \"！\",\n            ]\n        return False\n\n    @property\n    def can_break_line(self):\n        if self.can_break_line_cache is None:\n            self.can_break_line_cache = self.calc_can_break_line()\n\n        return self.can_break_line_cache\n\n    def calc_can_break_line(self):\n        unicode = self.try_get_unicode()\n        if not unicode:\n            return True\n        if LINE_BREAK_REGEX.match(unicode):\n            return False\n        return True\n\n    @property\n    def is_cjk_char(self):\n        if self.is_cjk_char_cache is None:\n            self.is_cjk_char_cache = self.calc_is_cjk_char()\n\n        return self.is_cjk_char_cache\n\n    def calc_is_cjk_char(self):\n        if self.formular:\n            return False\n        unicode = self.try_get_unicode()\n        if not unicode:\n            return False\n        if \"(cid\" in unicode:\n            return False\n        if len(unicode) > 1:\n            return False\n        assert len(unicode) == 1, \"Unicode must be a single character\"\n        if unicode in [\n            \"（\",\n            \"）\",\n            \"【\",\n            \"】\",\n            \"《\",\n            \"》\",\n            \"〔\",\n            \"〕\",\n            \"〈\",\n            \"〉\",\n            \"〖\",\n            \"〗\",\n            \"「\",\n            \"」\",\n            \"『\",\n            \"』\",\n            \"、\",\n            \"。\",\n            \"：\",\n            \"？\",\n            \"！\",\n            \"，\",\n        ]:\n            return True\n        if unicode:\n            if re.match(\n                r\"^[\"\n                r\"\\u3000-\\u303f\"  # CJK Symbols and Punctuation\n                r\"\\u3040-\\u309f\"  # Hiragana\n                r\"\\u30a0-\\u30ff\"  # Katakana\n                r\"\\u3100-\\u312f\"  # Bopomofo\n                r\"\\uac00-\\ud7af\"  # Hangul Syllables\n                r\"\\u1100-\\u11ff\"  # Hangul Jamo\n                r\"\\u3130-\\u318f\"  # Hangul Compatibility Jamo\n                r\"\\ua960-\\ua97f\"  # Hangul Jamo Extended-A\n                r\"\\ud7b0-\\ud7ff\"  # Hangul Jamo Extended-B\n                r\"\\u3190-\\u319f\"  # Kanbun\n                r\"\\u3200-\\u32ff\"  # Enclosed CJK Letters and Months\n                r\"\\u3300-\\u33ff\"  # CJK Compatibility\n                r\"\\ufe30-\\ufe4f\"  # CJK Compatibility Forms\n                r\"\\u4e00-\\u9fff\"  # CJK Unified Ideographs\n                r\"\\u2e80-\\u2eff\"  # CJK Radicals Supplement\n                r\"\\u31c0-\\u31ef\"  # CJK Strokes\n                r\"\\u2f00-\\u2fdf\"  # Kangxi Radicals\n                r\"\\ufe10-\\ufe1f\"  # Vertical Forms\n                r\"]+$\",\n                unicode,\n            ):\n                return True\n            try:\n                unicodedata_name = unicodedata.name(unicode)\n                return (\n                    \"CJK UNIFIED IDEOGRAPH\" in unicodedata_name\n                    or \"FULLWIDTH\" in unicodedata_name\n                )\n            except ValueError:\n                return False\n        return False\n\n    @property\n    def is_space(self):\n        if self.is_space_cache is None:\n            self.is_space_cache = self.calc_is_space()\n\n        return self.is_space_cache\n\n    def calc_is_space(self):\n        if self.formular:\n            return False\n        unicode = self.try_get_unicode()\n        return unicode == \" \"\n\n    @property\n    def is_hung_punctuation(self):\n        if self.is_hung_punctuation_cache is None:\n            self.is_hung_punctuation_cache = self.calc_is_hung_punctuation()\n\n        return self.is_hung_punctuation_cache\n\n    def calc_is_hung_punctuation(self):\n        if self.formular:\n            return False\n        unicode = self.try_get_unicode()\n\n        if unicode:\n            return unicode in [\n                # 英文标点\n                \",\",\n                \".\",\n                \":\",\n                \";\",\n                \"?\",\n                \"!\",\n                # 中文点号\n                \"，\",  # 逗号\n                \"。\",  # 句号\n                \"．\",  # 全角句号\n                \"、\",  # 顿号\n                \"：\",  # 冒号\n                \"；\",  # 分号\n                \"！\",  # 叹号\n                \"‼\",  # 双叹号\n                \"？\",  # 问号\n                \"⁇\",  # 双问号\n                # 结束引号\n                \"”\",  # 右双引号\n                \"’\",  # 右单引号\n                \"」\",  # 右直角单引号\n                \"』\",  # 右直角双引号\n                # 结束括号\n                \")\",  # 右圆括号\n                \"]\",  # 右方括号\n                \"}\",  # 右花括号\n                \"）\",  # 右圆括号\n                \"〕\",  # 右龟甲括号\n                \"〉\",  # 右单书名号\n                \"】\",  # 右黑色方头括号\n                \"〗\",  # 右空白方头括号\n                \"］\",  # 全角右方括号\n                \"｝\",  # 全角右花括号\n                # 结束双书名号\n                \"》\",  # 右双书名号\n                # 连接号\n                \"～\",  # 全角波浪号\n                \"-\",  # 连字符减号\n                \"–\",  # 短破折号 (EN DASH)\n                \"—\",  # 长破折号 (EM DASH)\n                # 间隔号\n                \"·\",  # 中间点\n                \"・\",  # 片假名中间点\n                \"‧\",  # 连字点\n                # 分隔号\n                \"/\",  # 斜杠\n                \"／\",  # 全角斜杠\n                \"⁄\",  # 分数斜杠\n            ]\n        return False\n\n    @property\n    def is_cannot_appear_in_line_end_punctuation(self):\n        if self.is_cannot_appear_in_line_end_punctuation_cache is None:\n            self.is_cannot_appear_in_line_end_punctuation_cache = (\n                self.calc_is_cannot_appear_in_line_end_punctuation()\n            )\n\n        return self.is_cannot_appear_in_line_end_punctuation_cache\n\n    def calc_is_cannot_appear_in_line_end_punctuation(self):\n        if self.formular:\n            return False\n        unicode = self.try_get_unicode()\n        if not unicode:\n            return False\n        return unicode in [\n            # 开始引号\n            \"“\",  # 左双引号\n            \"‘\",  # 左单引号\n            \"「\",  # 左直角单引号\n            \"『\",  # 左直角双引号\n            # 开始括号\n            \"(\",  # 左圆括号\n            \"[\",  # 左方括号\n            \"{\",  # 左花括号\n            \"（\",  # 左圆括号\n            \"〔\",  # 左龟甲括号\n            \"〈\",  # 左单书名号\n            \"《\",  # 左双书名号\n            # 开始单双书名号\n            \"〖\",  # 左空白方头括号\n            \"〘\",  # 左黑色方头括号\n            \"〚\",  # 左单书名号\n        ]\n\n    def passthrough(\n        self,\n    ) -> tuple[list[PdfCharacter], list[PdfCurve], list[PdfForm]]:\n        if self.char:\n            return [self.char], [], []\n        elif self.formular:\n            return (\n                self.formular.pdf_character,\n                self.formular.pdf_curve,\n                self.formular.pdf_form,\n            )\n        elif self.unicode:\n            logger.error(f\"Cannot passthrough unicode. TypesettingUnit: {self}. \")\n            logger.error(f\"Cannot passthrough unicode. TypesettingUnit: {self}. \")\n            return [], [], []\n\n    @property\n    def can_passthrough(self):\n        if self.can_passthrough_cache is None:\n            self.can_passthrough_cache = self.calc_can_passthrough()\n\n        return self.can_passthrough_cache\n\n    def calc_can_passthrough(self):\n        return self.unicode is None\n\n    def calculate_box(self):\n        if self.char:\n            box = copy.deepcopy(self.char.box)\n            if self.char.visual_bbox and self.char.visual_bbox.box:\n                box.y = self.char.visual_bbox.box.y\n                box.y2 = self.char.visual_bbox.box.y2\n                # return self.char.visual_bbox.box\n\n            return box\n        elif self.formular:\n            return self.formular.box\n            # if self.formular.x_offset <= 0.5:\n            #     return self.formular.box\n            # formular_box = copy.copy(self.formular.box)\n            # formular_box.x2 += self.formular.x_advance\n            # return formular_box\n        elif self.unicode:\n            char_width = self.font.char_lengths(self.unicode, self.font_size)[0]\n            if self.x is None or self.y is None or self.scale is None:\n                return Box(0, 0, char_width, self.font_size)\n            return Box(self.x, self.y, self.x + char_width, self.y + self.font_size)\n\n    @property\n    def box(self):\n        if not self.box_cache:\n            self.box_cache = self.calculate_box()\n\n        return self.box_cache\n\n    @property\n    def width(self):\n        if self.width_cache is None:\n            self.width_cache = self.calc_width()\n\n        return self.width_cache\n\n    def calc_width(self):\n        box = self.box\n        return box.x2 - box.x\n\n    @property\n    def height(self):\n        if self.height_cache is None:\n            self.height_cache = self.calc_height()\n\n        return self.height_cache\n\n    def calc_height(self):\n        box = self.box\n        return box.y2 - box.y\n\n    def relocate(\n        self,\n        x: float,\n        y: float,\n        scale: float,\n    ) -> TypesettingUnit:\n        \"\"\"重定位并缩放排版单元\n\n        Args:\n            x: 新的 x 坐标\n            y: 新的 y 坐标\n            scale: 缩放因子\n\n        Returns:\n            新的排版单元\n        \"\"\"\n        if self.char:\n            # 创建新的字符对象\n            new_char = PdfCharacter(\n                pdf_character_id=self.char.pdf_character_id,\n                char_unicode=self.char.char_unicode,\n                box=Box(\n                    x=x,\n                    y=y,\n                    x2=x + self.width * scale,\n                    y2=y + self.height * scale,\n                ),\n                pdf_style=PdfStyle(\n                    font_id=self.char.pdf_style.font_id,\n                    font_size=self.char.pdf_style.font_size * scale,\n                    graphic_state=self.char.pdf_style.graphic_state,\n                ),\n                scale=scale,\n                vertical=self.char.vertical,\n                advance=self.char.advance * scale if self.char.advance else None,\n                debug_info=self.debug_info,\n                xobj_id=self.char.xobj_id,\n            )\n            new_tu = TypesettingUnit(char=new_char)\n            new_tu.try_resue_cache(self)\n            return new_tu\n\n        elif self.formular:\n            # 创建新的公式对象，保持内部字符的相对位置\n            new_chars = []\n            min_x = self.formular.box.x\n            min_y = self.formular.box.y\n\n            for char in self.formular.pdf_character:\n                # 计算相对位置\n                rel_x = char.box.x - min_x\n                rel_y = char.box.y - min_y\n\n                visual_rel_x = char.visual_bbox.box.x - min_x\n                visual_rel_y = char.visual_bbox.box.y - min_y\n\n                # 创建新的字符对象\n                new_char = PdfCharacter(\n                    pdf_character_id=char.pdf_character_id,\n                    char_unicode=char.char_unicode,\n                    box=Box(\n                        x=x + (rel_x + self.formular.x_offset) * scale,\n                        y=y + (rel_y + self.formular.y_offset) * scale,\n                        x2=x\n                        + (rel_x + (char.box.x2 - char.box.x) + self.formular.x_offset)\n                        * scale,\n                        y2=y\n                        + (rel_y + (char.box.y2 - char.box.y) + self.formular.y_offset)\n                        * scale,\n                    ),\n                    visual_bbox=il_version_1.VisualBbox(\n                        box=Box(\n                            x=x + (visual_rel_x + self.formular.x_offset) * scale,\n                            y=y + (visual_rel_y + self.formular.y_offset) * scale,\n                            x2=x\n                            + (\n                                visual_rel_x\n                                + (char.visual_bbox.box.x2 - char.visual_bbox.box.x)\n                                + self.formular.x_offset\n                            )\n                            * scale,\n                            y2=y\n                            + (\n                                visual_rel_y\n                                + (char.visual_bbox.box.y2 - char.visual_bbox.box.y)\n                                + self.formular.y_offset\n                            )\n                            * scale,\n                        ),\n                    ),\n                    pdf_style=PdfStyle(\n                        font_id=char.pdf_style.font_id,\n                        font_size=char.pdf_style.font_size * scale,\n                        graphic_state=char.pdf_style.graphic_state,\n                    ),\n                    scale=scale,\n                    vertical=char.vertical,\n                    advance=char.advance * scale if char.advance else None,\n                    xobj_id=char.xobj_id,\n                )\n                new_chars.append(new_char)\n\n            # Calculate bounding box from new_chars\n            min_x = min(char.visual_bbox.box.x for char in new_chars)\n            min_y = min(char.visual_bbox.box.y for char in new_chars)\n            max_x = max(char.visual_bbox.box.x2 for char in new_chars)\n            max_y = max(char.visual_bbox.box.y2 for char in new_chars)\n\n            new_formula = PdfFormula(\n                box=Box(\n                    x=min_x,\n                    y=min_y,\n                    x2=max_x,\n                    y2=max_y,\n                ),\n                pdf_character=new_chars,\n                x_offset=self.formular.x_offset * scale,\n                y_offset=self.formular.y_offset * scale,\n                x_advance=self.formular.x_advance * scale,\n            )\n\n            # Handle contained curves\n            new_curves = []\n            for curve in self.formular.pdf_curve:\n                new_curve = self._transform_curve_for_relocation(\n                    curve,\n                    self.formular.box.x,\n                    self.formular.box.y,\n                    x,\n                    y,\n                    scale,\n                )\n                new_curves.append(new_curve)\n            new_formula.pdf_curve = new_curves\n\n            # Handle contained forms\n            new_forms = []\n            for form in self.formular.pdf_form:\n                new_form = self._transform_form_for_relocation(\n                    form, self.formular.box.x, self.formular.box.y, x, y, scale\n                )\n                new_forms.append(new_form)\n            new_formula.pdf_form = new_forms\n\n            update_formula_data(new_formula)\n\n            new_tu = TypesettingUnit(formular=new_formula)\n            new_tu.try_resue_cache(self)\n            return new_tu\n\n        elif self.unicode:\n            # 对于 Unicode 字符，我们存储新的位置信息\n            new_unit = TypesettingUnit(\n                unicode=self.unicode,\n                font=self.font,\n                original_font=self.original_font,\n                font_size=self.font_size * scale,\n                style=self.style,\n                xobj_id=self.xobj_id,\n                debug_info=self.debug_info,\n            )\n            new_unit.x = x\n            new_unit.y = y\n            new_unit.scale = scale\n            new_unit.try_resue_cache(self)\n            return new_unit\n\n    def _transform_curve_for_relocation(\n        self,\n        curve,\n        original_formula_x: float,\n        original_formula_y: float,\n        new_x: float,\n        new_y: float,\n        scale: float,\n    ):\n        \"\"\"Transform a curve for formula relocation.\"\"\"\n        import copy\n\n        new_curve = copy.deepcopy(curve)\n\n        if new_curve.box:\n            # Calculate relative position to formula's original position (same as chars)\n            rel_x = new_curve.box.x - original_formula_x\n            rel_y = new_curve.box.y - original_formula_y\n\n            # Apply same transformation as characters\n            new_curve.box = Box(\n                x=new_x + (rel_x + self.formular.x_offset) * scale,\n                y=new_y + (rel_y + self.formular.y_offset) * scale,\n                x2=new_x\n                + (\n                    rel_x\n                    + (new_curve.box.x2 - new_curve.box.x)\n                    + self.formular.x_offset\n                )\n                * scale,\n                y2=new_y\n                + (\n                    rel_y\n                    + (new_curve.box.y2 - new_curve.box.y)\n                    + self.formular.y_offset\n                )\n                * scale,\n            )\n\n        # Set relocation transform instead of modifying original CTM\n        translation_x = (\n            new_x + self.formular.x_offset * scale - original_formula_x * scale\n        )\n        translation_y = (\n            new_y + self.formular.y_offset * scale - original_formula_y * scale\n        )\n\n        # Create relocation transformation matrix\n        from babeldoc.format.pdf.document_il.utils.matrix_helper import (\n            create_translation_and_scale_matrix,\n        )\n\n        relocation_matrix = create_translation_and_scale_matrix(\n            translation_x, translation_y, scale\n        )\n        new_curve.relocation_transform = list(relocation_matrix)\n\n        return new_curve\n\n    def _transform_form_for_relocation(\n        self,\n        form,\n        original_formula_x: float,\n        original_formula_y: float,\n        new_x: float,\n        new_y: float,\n        scale: float,\n    ):\n        \"\"\"Transform a form for formula relocation.\"\"\"\n        import copy\n\n        new_form = copy.deepcopy(form)\n\n        if new_form.box:\n            # Calculate relative position to formula's original position (same as chars)\n            rel_x = new_form.box.x - original_formula_x\n            rel_y = new_form.box.y - original_formula_y\n\n            # Apply same transformation as characters\n            new_form.box = Box(\n                x=new_x + (rel_x + self.formular.x_offset) * scale,\n                y=new_y + (rel_y + self.formular.y_offset) * scale,\n                x2=new_x\n                + (rel_x + (new_form.box.x2 - new_form.box.x) + self.formular.x_offset)\n                * scale,\n                y2=new_y\n                + (rel_y + (new_form.box.y2 - new_form.box.y) + self.formular.y_offset)\n                * scale,\n            )\n\n        # Set relocation transform instead of modifying original matrices\n        translation_x = (\n            new_x + self.formular.x_offset * scale - original_formula_x * scale\n        )\n        translation_y = (\n            new_y + self.formular.y_offset * scale - original_formula_y * scale\n        )\n\n        # Create relocation transformation matrix\n        from babeldoc.format.pdf.document_il.utils.matrix_helper import (\n            create_translation_and_scale_matrix,\n        )\n\n        relocation_matrix = create_translation_and_scale_matrix(\n            translation_x, translation_y, scale\n        )\n        new_form.relocation_transform = list(relocation_matrix)\n\n        return new_form\n\n    def render(\n        self,\n    ) -> tuple[list[PdfCharacter], list[PdfCurve], list[PdfForm]]:\n        \"\"\"渲染排版单元为 PdfCharacter 列表\n\n        Returns:\n            PdfCharacter 列表\n        \"\"\"\n        if self.can_passthrough:\n            return self.passthrough()\n        elif self.unicode:\n            assert self.x is not None, (\n                \"x position must be set, should be set by `relocate`\"\n            )\n            assert self.y is not None, (\n                \"y position must be set, should be set by `relocate`\"\n            )\n            assert self.scale is not None, (\n                \"scale must be set, should be set by `relocate`\"\n            )\n            x = self.x\n            y = self.y\n            # if self.original_font and self.font and hasattr(self.original_font, \"descent\") and hasattr(self.font, \"descent_fontmap\"):\n            #     original_descent = self.original_font.descent\n            #     new_descent = self.font.descent_fontmap\n            #     y -= (original_descent - new_descent) * self.font_size / 1000\n\n            # 计算字符宽度\n            char_width = self.width\n\n            new_char = PdfCharacter(\n                pdf_character_id=self.font.has_glyph(ord(self.unicode)),\n                char_unicode=self.unicode,\n                box=Box(\n                    x=x,  # 使用存储的位置\n                    y=y,\n                    x2=x + char_width,\n                    y2=y + self.font_size,\n                ),\n                pdf_style=PdfStyle(\n                    font_id=self.font_id,\n                    font_size=self.font_size,\n                    graphic_state=self.style.graphic_state,\n                ),\n                scale=self.scale,\n                vertical=False,\n                advance=char_width,\n                xobj_id=self.xobj_id,\n                debug_info=self.debug_info,\n            )\n            return [new_char], [], []\n        else:\n            logger.error(f\"Unknown typesetting unit. TypesettingUnit: {self}. \")\n            logger.error(f\"Unknown typesetting unit. TypesettingUnit: {self}. \")\n            return [], [], []\n\n\nclass Typesetting:\n    stage_name = \"Typesetting\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.font_mapper = FontMapper(translation_config)\n        self.translation_config = translation_config\n        self.lang_code = self.translation_config.lang_out.upper()\n        self.is_cjk = (\n            # Why zh-CN/zh-HK/zh-TW here but not zh-Hans and so on?\n            # See https://funstory-ai.github.io/BabelDOC/supported_languages/\n            (\"ZH\" in self.lang_code)  # C\n            or (\"JA\" in self.lang_code)\n            or (\"JP\" in self.lang_code)  # J\n            or (\"KR\" in self.lang_code)  # K\n            or (\"CN\" in self.lang_code)\n            or (\"HK\" in self.lang_code)\n            or (\"TW\" in self.lang_code)\n        )\n\n    def preprocess_document(self, document: il_version_1.Document, pbar):\n        \"\"\"预处理文档，获取每个段落的最优缩放因子，不执行实际排版\"\"\"\n        all_scales: list[float] = []\n        all_paragraphs: list[il_version_1.PdfParagraph] = []\n\n        for page in document.page:\n            pbar.advance()\n            # 准备字体信息（复制自 render_page 的逻辑）\n            fonts: dict[\n                str | int,\n                il_version_1.PdfFont | dict[str, il_version_1.PdfFont],\n            ] = {f.font_id: f for f in page.pdf_font if f.font_id}\n            page_fonts = {f.font_id: f for f in page.pdf_font if f.font_id}\n            for k, v in self.font_mapper.fontid2font.items():\n                fonts[k] = v\n            for xobj in page.pdf_xobject:\n                if xobj.xobj_id is not None:\n                    fonts[xobj.xobj_id] = page_fonts.copy()\n                    for font in xobj.pdf_font:\n                        if (\n                            xobj.xobj_id in fonts\n                            and isinstance(fonts[xobj.xobj_id], dict)\n                            and font.font_id\n                        ):\n                            fonts[xobj.xobj_id][font.font_id] = font\n\n            # 处理每个段落\n            for paragraph in page.pdf_paragraph:\n                all_paragraphs.append(paragraph)\n                unit_count = 0\n                try:\n                    typesetting_units = self.create_typesetting_units(paragraph, fonts)\n                    unit_count = len(typesetting_units)\n                    for unit in typesetting_units:\n                        if unit.formular:\n                            unit_count += len(unit.formular.pdf_character) - 1\n\n                    # 如果所有单元都可以直接传递，则 scale = 1.0\n                    if all(unit.can_passthrough for unit in typesetting_units):\n                        paragraph.optimal_scale = 1.0\n                    else:\n                        # 获取最优缩放因子\n                        optimal_scale = self._get_optimal_scale(\n                            paragraph, page, typesetting_units\n                        )\n                        paragraph.optimal_scale = optimal_scale\n                except Exception as e:\n                    # 如果预处理出错，默认使用 1.0 缩放因子\n                    logger.warning(f\"预处理段落时出错：{e}\")\n                    paragraph.optimal_scale = 1.0\n\n                if paragraph.optimal_scale is not None:\n                    all_scales.extend([paragraph.optimal_scale] * unit_count)\n\n        # 获取缩放因子的众数\n        if all_scales:\n            try:\n                modes = statistics.multimode(all_scales)\n                mode_scale = min(modes)\n            except statistics.StatisticsError:\n                logger.warning(\n                    \"Could not find a mode for paragraph scales. Falling back to median.\"\n                )\n                mode_scale = statistics.median(all_scales)\n            # 将所有大于众数的值修改为众数\n            for paragraph in all_paragraphs:\n                if (\n                    paragraph.optimal_scale is not None\n                    and paragraph.optimal_scale > mode_scale\n                ):\n                    paragraph.optimal_scale = mode_scale\n        else:\n            logger.error(\n                \"document_scales is empty, there seems no paragraph in this PDF\"\n            )\n\n    def _find_optimal_scale_and_layout(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n        page: il_version_1.Page,\n        typesetting_units: list[TypesettingUnit],\n        initial_scale: float = 1.0,\n        use_english_line_break: bool = True,\n        apply_layout: bool = False,\n    ) -> tuple[float, list[TypesettingUnit] | None]:\n        \"\"\"查找最优缩放因子并可选择性地执行布局\n\n        Args:\n            paragraph: 段落对象\n            page: 页面对象\n            typesetting_units: 排版单元列表\n            initial_scale: 初始缩放因子\n            use_english_line_break: 是否使用英文换行规则\n            apply_layout: 是否应用布局到 paragraph（True 时执行实际排版）\n\n        Returns:\n            tuple[float, list[TypesettingUnit] | None]: (最终缩放因子，排版后的单元列表或 None)\n        \"\"\"\n        if not paragraph.box:\n            return initial_scale, None\n\n        box = paragraph.box\n        scale = initial_scale\n        line_skip = 1.50 if self.is_cjk else 1.3\n        min_scale = 0.1\n        expand_space_flag = 0\n        final_typeset_units = None\n\n        while scale >= min_scale:\n            try:\n                # 尝试布局排版单元\n                typeset_units, all_units_fit = self._layout_typesetting_units(\n                    typesetting_units,\n                    box,\n                    scale,\n                    line_skip,\n                    paragraph,\n                    use_english_line_break,\n                )\n\n                # 如果所有单元都放得下\n                if all_units_fit:\n                    if apply_layout:\n                        # 实际应用排版结果\n                        paragraph.scale = scale\n                        paragraph.pdf_paragraph_composition = []\n                        for unit in typeset_units:\n                            chars, curves, forms = unit.render()\n                            for char in chars:\n                                paragraph.pdf_paragraph_composition.append(\n                                    PdfParagraphComposition(pdf_character=char),\n                                )\n                            for curve in curves:\n                                page.pdf_curve.append(curve)\n                            for form in forms:\n                                page.pdf_form.append(form)\n                        final_typeset_units = typeset_units\n                    return scale, final_typeset_units\n            except Exception:\n                # 如果布局检查出错，继续尝试下一个缩放因子\n                pass\n\n            # 添加与原 retypeset 一致的逻辑检查\n            if not hasattr(paragraph, \"debug_id\") or not paragraph.debug_id:\n                return scale, final_typeset_units\n\n            # 减小缩放因子\n            if scale > 0.6:\n                scale -= 0.05\n            else:\n                scale -= 0.1\n\n            if scale < 0.7:\n                space_expanded = False  # 标记是否成功扩展了空间\n\n                if expand_space_flag == 0:\n                    # 尝试向下扩展\n                    try:\n                        min_y = self.get_max_bottom_space(box, page) + 2\n                        if min_y < box.y:\n                            expanded_box = Box(x=box.x, y=min_y, x2=box.x2, y2=box.y2)\n                            box = expanded_box\n                            if apply_layout:\n                                # 更新段落的边界框\n                                paragraph.box = expanded_box\n                            space_expanded = True\n                    except Exception:\n                        pass\n                    expand_space_flag = 1\n\n                    # 只有成功扩展空间时才 continue，否则继续减小 scale\n                    if space_expanded:\n                        continue\n\n                elif expand_space_flag == 1:\n                    # 尝试向右扩展\n                    try:\n                        max_x = self.get_max_right_space(box, page) - 5\n                        if max_x > box.x2:\n                            expanded_box = Box(x=box.x, y=box.y, x2=max_x, y2=box.y2)\n                            box = expanded_box\n                            if apply_layout:\n                                # 更新段落的边界框\n                                paragraph.box = expanded_box\n                            space_expanded = True\n                    except Exception:\n                        pass\n                    expand_space_flag = 2\n\n                    # 只有成功扩展空间时才 continue，否则继续减小 scale\n                    if space_expanded:\n                        continue\n\n                # 只有在扩展尝试阶段 (expand_space_flag < 2) 且扩展失败时才重置 scale\n                # 当 expand_space_flag >= 2 时，说明已经尝试过所有扩展，应该继续正常的 scale 减小\n                if expand_space_flag < 2:\n                    # 如果无法扩展空间，重置 scale 并继续循环\n                    scale = 1.0\n\n        # 如果仍然放不下，尝试去除英文换行限制\n        if use_english_line_break:\n            return self._find_optimal_scale_and_layout(\n                paragraph,\n                page,\n                typesetting_units,\n                initial_scale,\n                use_english_line_break=False,\n                apply_layout=apply_layout,\n            )\n\n        # 最后返回最小缩放因子\n        return min_scale, final_typeset_units\n\n    def _get_optimal_scale(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n        page: il_version_1.Page,\n        typesetting_units: list[TypesettingUnit],\n        use_english_line_break: bool = True,\n    ) -> float:\n        \"\"\"获取段落的最优缩放因子，不执行实际排版\"\"\"\n        scale, _ = self._find_optimal_scale_and_layout(\n            paragraph,\n            page,\n            typesetting_units,\n            1.0,\n            use_english_line_break,\n            apply_layout=False,\n        )\n        return scale\n\n    def retypeset_with_precomputed_scale(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n        page: il_version_1.Page,\n        typesetting_units: list[TypesettingUnit],\n        precomputed_scale: float,\n        use_english_line_break: bool = True,\n    ):\n        \"\"\"使用预计算的缩放因子进行排版\"\"\"\n        if not paragraph.box:\n            return\n\n        # 使用通用方法进行排版，传入预计算的缩放因子作为初始值\n        self._find_optimal_scale_and_layout(\n            paragraph,\n            page,\n            typesetting_units,\n            precomputed_scale,\n            use_english_line_break,\n            apply_layout=True,\n        )\n\n    def typesetting_document(self, document: il_version_1.Document):\n        # 原有的排版逻辑\n        if self.translation_config.progress_monitor:\n            with self.translation_config.progress_monitor.stage_start(\n                self.stage_name,\n                len(document.page) * 2,\n            ) as pbar:\n                # 预处理：获取所有段落的最优缩放因子\n                self.preprocess_document(document, pbar)\n\n                for page in document.page:\n                    self.translation_config.raise_if_cancelled()\n                    self.render_page(page)\n                    pbar.advance()\n        else:\n            for page in document.page:\n                self.translation_config.raise_if_cancelled()\n                self.render_page(page)\n\n    def render_page(self, page: il_version_1.Page):\n        fonts: dict[\n            str | int,\n            il_version_1.PdfFont | dict[str, il_version_1.PdfFont],\n        ] = {f.font_id: f for f in page.pdf_font if f.font_id}\n        page_fonts = {f.font_id: f for f in page.pdf_font if f.font_id}\n        for k, v in self.font_mapper.fontid2font.items():\n            fonts[k] = v\n        for xobj in page.pdf_xobject:\n            if xobj.xobj_id is not None:\n                fonts[xobj.xobj_id] = page_fonts.copy()\n                for font in xobj.pdf_font:\n                    if font.font_id:\n                        fonts[xobj.xobj_id][font.font_id] = font\n        if (\n            page.page_number == 0\n            and self.translation_config.watermark_output_mode\n            == WatermarkOutputMode.Watermarked\n        ):\n            self.add_watermark(page)\n        try:\n            para_index = index.Index()\n            para_map = {}\n            #\n            valid_paras = [\n                p\n                for p in page.pdf_paragraph\n                if p.box\n                and all(c is not None for c in [p.box.x, p.box.y, p.box.x2, p.box.y2])\n            ]\n\n            for i, para in enumerate(valid_paras):\n                para_map[i] = para\n                para_index.insert(i, box_to_tuple(para.box))\n\n            for i, p_upper in para_map.items():\n                if not (p_upper.box and p_upper.box.y is not None):\n                    continue\n\n                # Calculate paragraph height and set required gap accordingly\n                para_height = p_upper.box.y2 - p_upper.box.y\n                required_gap = 0.5 if para_height < 36 else 3\n\n                check_area = il_version_1.Box(\n                    x=p_upper.box.x,\n                    y=p_upper.box.y - required_gap,\n                    x2=p_upper.box.x2,\n                    y2=p_upper.box.y,\n                )\n\n                candidate_ids = list(para_index.intersection(box_to_tuple(check_area)))\n\n                conflicting_paras = []\n                for para_id in candidate_ids:\n                    if para_id == i:\n                        continue\n                    p_lower = para_map[para_id]\n                    if not (\n                        p_lower.box\n                        and p_upper.box\n                        and p_lower.box.x2 < p_upper.box.x\n                        or p_lower.box.x > p_upper.box.x2\n                    ):\n                        conflicting_paras.append(p_lower)\n\n                if conflicting_paras:\n                    max_y2 = max(\n                        p.box.y2\n                        for p in conflicting_paras\n                        if p.box and p.box.y2 is not None\n                    )\n\n                    new_y = max_y2 + required_gap\n                    if p_upper.box and new_y < p_upper.box.y2:\n                        p_upper.box.y = new_y\n        except Exception as e:\n            logger.warning(\n                f\"Failed to adjust paragraph positions on page {page.page_number}: {e}\"\n            )\n        # 开始实际的渲染过程\n        for paragraph in page.pdf_paragraph:\n            self.render_paragraph(paragraph, page, fonts)\n\n    def add_watermark(self, page: il_version_1.Page):\n        page_width = page.cropbox.box.x2 - page.cropbox.box.x\n        page_height = page.cropbox.box.y2 - page.cropbox.box.y\n        style = il_version_1.PdfStyle(\n            font_id=\"base\",\n            font_size=6,\n            graphic_state=il_version_1.GraphicState(),\n        )\n        text = f\"本文档由 funstory.ai 的开源 PDF 翻译库 BabelDOC {WATERMARK_VERSION} (http://yadt.io) 翻译，本仓库正在积极的建设当中，欢迎 star 和关注。\"\n        if self.translation_config.debug:\n            text += \"\\n 当前为 DEBUG 模式，将显示更多辅助信息。请注意，部分框的位置对应原文，但在译文中可能不正确。\"\n        page.pdf_paragraph.append(\n            il_version_1.PdfParagraph(\n                first_line_indent=False,\n                box=il_version_1.Box(\n                    x=page.cropbox.box.x + page_width * 0.05,\n                    y=page.cropbox.box.y,\n                    x2=page.cropbox.box.x2,\n                    y2=page.cropbox.box.y2 - page_height * 0.05,\n                ),\n                vertical=False,\n                pdf_style=style,\n                pdf_paragraph_composition=[\n                    il_version_1.PdfParagraphComposition(\n                        pdf_same_style_unicode_characters=il_version_1.PdfSameStyleUnicodeCharacters(\n                            unicode=text,\n                            pdf_style=style,\n                        ),\n                    ),\n                ],\n                xobj_id=-1,\n            ),\n        )\n\n    def render_paragraph(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n        page: il_version_1.Page,\n        fonts: dict[\n            str | int,\n            il_version_1.PdfFont | dict[str, il_version_1.PdfFont],\n        ],\n    ):\n        typesetting_units = self.create_typesetting_units(paragraph, fonts)\n        # 如果所有单元都可以直接传递，则直接传递\n        if all(unit.can_passthrough for unit in typesetting_units):\n            paragraph.scale = 1.0\n            paragraph.pdf_paragraph_composition = self.create_passthrough_composition(\n                typesetting_units,\n            )\n        else:\n            # 使用预计算的缩放因子进行重排版\n            precomputed_scale = (\n                paragraph.optimal_scale if paragraph.optimal_scale is not None else 1.0\n            )\n\n            # 如果有单元无法直接传递，则进行重排版\n            paragraph.pdf_paragraph_composition = []\n            self.retypeset_with_precomputed_scale(\n                paragraph, page, typesetting_units, precomputed_scale\n            )\n\n            # 重排版后，重新设置段落各字符的 render order\n            self._update_paragraph_render_order(paragraph)\n\n    def _get_width_before_next_break_point(\n        self, typesetting_units: list[TypesettingUnit], scale: float\n    ) -> float:\n        if not typesetting_units:\n            return 0\n        if typesetting_units[0].can_break_line:\n            return 0\n\n        total_width = 0\n        for unit in typesetting_units:\n            if unit.can_break_line:\n                return total_width * scale\n            total_width += unit.width\n        return total_width * scale\n\n    def _layout_typesetting_units(\n        self,\n        typesetting_units: list[TypesettingUnit],\n        box: Box,\n        scale: float,\n        line_skip: float,\n        paragraph: il_version_1.PdfParagraph,\n        use_english_line_break: bool = True,\n    ) -> tuple[list[TypesettingUnit], bool]:\n        \"\"\"布局排版单元。\n\n        Args:\n            typesetting_units: 要布局的排版单元列表\n            box: 布局边界框\n            scale: 缩放因子\n\n        Returns:\n            tuple[list[TypesettingUnit], bool]: (已布局的排版单元列表，是否所有单元都放得下)\n        \"\"\"\n        # 计算字号众数\n        font_sizes = []\n        for unit in typesetting_units:\n            if unit.font_size:\n                font_sizes.append(unit.font_size)\n            if unit.char and unit.char.pdf_style and unit.char.pdf_style.font_size:\n                font_sizes.append(unit.char.pdf_style.font_size)\n        font_sizes.sort()\n        font_size = statistics.mode(font_sizes)\n\n        space_width = (\n            self.font_mapper.base_font.char_lengths(\"你\", font_size * scale)[0] * 0.5\n        )\n\n        # 计算行高（使用众数）\n        unit_heights = (\n            [unit.height for unit in typesetting_units] if typesetting_units else []\n        )\n        if not unit_heights:\n            avg_height = 0\n        elif len(unit_heights) == 1:\n            avg_height = unit_heights[0] * scale\n        else:\n            try:\n                avg_height = statistics.mode(unit_heights) * scale\n            except statistics.StatisticsError:\n                # 如果没有众数（所有值都出现相同次数），则使用平均值\n                avg_height = sum(unit_heights) / len(unit_heights) * scale\n\n        # 初始化位置为右上角，并减去一个平均行高\n        current_x = box.x\n        current_y = box.y2 - avg_height\n        box = copy.deepcopy(box)\n        # box.y -= avg_height * (line_spacing - 1.01) # line_spacing 已被替换为 line_skip\n        line_height = 0\n        current_line_heights = []  # 存储当前行所有元素的高度\n\n        # 存储已排版的单元\n        typeset_units = []\n        all_units_fit = True\n        last_unit: TypesettingUnit | None = None\n        line_ys = [current_y]\n        if paragraph.first_line_indent:\n            current_x += space_width * 4\n        # 遍历所有排版单元\n        for i, unit in enumerate(typesetting_units):\n            # 计算当前单元在当前缩放下的尺寸\n            unit_width = unit.width * scale\n            unit_height = unit.height * scale\n\n            # 跳过行首的空格\n            if current_x == box.x and unit.is_space:\n                continue\n\n            if (\n                last_unit  # 有上一个单元\n                and last_unit.is_cjk_char ^ unit.is_cjk_char  # 中英文交界处\n                and (\n                    last_unit.box\n                    and last_unit.box.y\n                    and current_y - 0.1\n                    <= last_unit.box.y2\n                    <= current_y + line_height + 0.1\n                )  # 在同一行，且有垂直重叠\n                and not last_unit.mixed_character_blacklist  # 不是混排空格黑名单字符\n                and not unit.mixed_character_blacklist  # 同上\n                and current_x > box.x  # 不是行首\n                and unit.try_get_unicode() != \" \"  # 不是空格\n                and last_unit.try_get_unicode() != \" \"  # 不是空格\n                and last_unit.try_get_unicode()\n                not in [\n                    \"。\",\n                    \"！\",\n                    \"？\",\n                    \"；\",\n                    \"：\",\n                    \"，\",\n                ]\n            ):\n                current_x += space_width * 0.5\n            if use_english_line_break:\n                width_before_next_break_point = self._get_width_before_next_break_point(\n                    typesetting_units[i:], scale\n                )\n            else:\n                width_before_next_break_point = 0\n\n            # 如果当前行放不下这个元素，换行\n            if not unit.is_hung_punctuation and (\n                (current_x + unit_width > box.x2)\n                or (\n                    use_english_line_break\n                    and current_x + unit_width + width_before_next_break_point > box.x2\n                )\n                or (\n                    unit.is_cannot_appear_in_line_end_punctuation\n                    and current_x + unit_width * 2 > box.x2\n                )\n            ):\n                # 换行\n                current_x = box.x\n                if not current_line_heights:\n                    return [], False\n                max_height = max(current_line_heights)\n                mode_height = statistics.mode(current_line_heights)\n\n                current_y -= max(mode_height * line_skip, max_height * 1.05)\n                line_ys.append(current_y)\n                line_height = 0.0\n                current_line_heights = []  # 清空当前行高度列表\n\n                # 检查是否超出底部边界\n                # if current_y - unit_height < box.y:\n                if current_y < box.y:\n                    all_units_fit = False\n                    # 这里不要 break，继续排版剩余内容\n\n                if unit.is_space:\n                    line_height = max(line_height, unit_height)\n                    continue\n\n            # 放置当前单元\n            relocated_unit = unit.relocate(current_x, current_y, scale)\n            typeset_units.append(relocated_unit)\n\n            # 添加当前单元的高度到当前行高度列表\n            if not unit.is_space:\n                current_line_heights.append(unit_height)\n\n            prev_x = current_x\n            # 更新 x 坐标\n            current_x = relocated_unit.box.x2\n            if prev_x > current_x:\n                logger.warning(f\"坐标回绕！！！TypesettingUnit: {unit.box}, \")\n\n            last_unit = relocated_unit\n\n        return typeset_units, all_units_fit\n\n    def create_typesetting_units(\n        self,\n        paragraph: il_version_1.PdfParagraph,\n        fonts: dict[str, il_version_1.PdfFont],\n    ) -> list[TypesettingUnit]:\n        if not paragraph.pdf_paragraph_composition:\n            return []\n        result = []\n\n        @cache\n        def get_font(font_id: str, xobj_id: int | None):\n            if xobj_id in fonts:\n                font = fonts[xobj_id][font_id]\n            else:\n                font = fonts[font_id]\n            return font\n\n        for composition in paragraph.pdf_paragraph_composition:\n            if composition is None:\n                continue\n            if composition.pdf_line:\n                result.extend(\n                    [\n                        TypesettingUnit(char=char)\n                        for char in composition.pdf_line.pdf_character\n                    ],\n                )\n            elif composition.pdf_character:\n                result.append(\n                    TypesettingUnit(\n                        char=composition.pdf_character,\n                        debug_info=paragraph.debug_info,\n                    ),\n                )\n            elif composition.pdf_same_style_characters:\n                result.extend(\n                    [\n                        TypesettingUnit(char=char)\n                        for char in composition.pdf_same_style_characters.pdf_character\n                    ],\n                )\n            elif composition.pdf_same_style_unicode_characters:\n                style = composition.pdf_same_style_unicode_characters.pdf_style\n                if style is None:\n                    logger.warning(\n                        f\"Style is None. \"\n                        f\"Composition: {composition}. \"\n                        f\"Paragraph: {paragraph}. \",\n                    )\n                    continue\n                font_id = style.font_id\n                if font_id is None:\n                    logger.warning(\n                        f\"Font ID is None. \"\n                        f\"Composition: {composition}. \"\n                        f\"Paragraph: {paragraph}. \",\n                    )\n                    continue\n                font = get_font(font_id, paragraph.xobj_id)\n                if composition.pdf_same_style_unicode_characters.unicode:\n                    result.extend(\n                        [\n                            TypesettingUnit(\n                                unicode=char_unicode,\n                                font=self.font_mapper.map(\n                                    font,\n                                    char_unicode,\n                                ),\n                                original_font=font,\n                                font_size=style.font_size,\n                                style=style,\n                                xobj_id=paragraph.xobj_id,\n                                debug_info=composition.pdf_same_style_unicode_characters.debug_info\n                                or False,\n                            )\n                            for char_unicode in composition.pdf_same_style_unicode_characters.unicode\n                            if char_unicode not in (\"\\n\",)\n                        ],\n                    )\n            elif composition.pdf_formula:\n                result.extend([TypesettingUnit(formular=composition.pdf_formula)])\n            else:\n                logger.error(\n                    f\"Unknown composition type. \"\n                    f\"Composition: {composition}. \"\n                    f\"Paragraph: {paragraph}. \",\n                )\n                continue\n        result = list(\n            filter(\n                lambda x: x.unicode is None or x.font is not None,\n                result,\n            ),\n        )\n\n        if any(x.width < 0 for x in result):\n            logger.warning(\"有排版单元宽度小于 0，请检查字体映射是否正确。\")\n        return result\n\n    def create_passthrough_composition(\n        self,\n        typesetting_units: list[TypesettingUnit],\n    ) -> list[PdfParagraphComposition]:\n        \"\"\"从排版单元创建直接传递的段落组合。\n\n        Args:\n            typesetting_units: 排版单元列表\n\n        Returns:\n            段落组合列表\n        \"\"\"\n        composition = []\n        for unit in typesetting_units:\n            if unit.formular:\n                # 对于公式单元，直接创建包含完整公式的组合\n                composition.append(PdfParagraphComposition(pdf_formula=unit.formular))\n            else:\n                # 对于字符单元，使用原有逻辑\n                chars, curves, forms = unit.passthrough()\n                composition.extend(\n                    [PdfParagraphComposition(pdf_character=char) for char in chars],\n                )\n        return composition\n\n    def get_max_right_space(self, current_box: Box, page) -> float:\n        \"\"\"获取段落右侧最大可用空间\n\n        Args:\n            current_box: 当前段落的边界框\n            page: 当前页面\n\n        Returns:\n            可以扩展到的最大 x 坐标\n        \"\"\"\n        # 获取页面的裁剪框作为初始最大限制\n        max_x = page.cropbox.box.x2 * 0.9\n\n        # 检查所有可能的阻挡元素\n        for para in page.pdf_paragraph:\n            if para.box == current_box or para.box is None:  # 跳过当前段落\n                continue\n            # 只考虑在当前段落右侧且有垂直重叠的元素\n            if para.box.x > current_box.x and not (\n                para.box.y >= current_box.y2 or para.box.y2 <= current_box.y\n            ):\n                max_x = min(max_x, para.box.x)\n        for char in page.pdf_character:\n            if char.box.x > current_box.x and not (\n                char.box.y >= current_box.y2 or char.box.y2 <= current_box.y\n            ):\n                max_x = min(max_x, char.box.x)\n        # 检查图形\n        for figure in page.pdf_figure:\n            if figure.box.x > current_box.x and not (\n                figure.box.y >= current_box.y2 or figure.box.y2 <= current_box.y\n            ):\n                max_x = min(max_x, figure.box.x)\n\n        return max_x\n\n    def get_max_bottom_space(self, current_box: Box, page: il_version_1.Page) -> float:\n        \"\"\"获取段落下方最大可用空间\n\n        Args:\n            current_box: 当前段落的边界框\n            page: 当前页面\n\n        Returns:\n            可以扩展到的最小 y 坐标\n        \"\"\"\n        # 获取页面的裁剪框作为初始最小限制\n        min_y = page.cropbox.box.y * 1.1\n\n        # 检查所有可能的阻挡元素\n        for para in page.pdf_paragraph:\n            if para.box == current_box or para.box is None:  # 跳过当前段落\n                continue\n            # 只考虑在当前段落下方且有水平重叠的元素\n            if para.box.y2 < current_box.y and not (\n                para.box.x >= current_box.x2 or para.box.x2 <= current_box.x\n            ):\n                min_y = max(min_y, para.box.y2)\n        for char in page.pdf_character:\n            if char.box.y2 < current_box.y and not (\n                char.box.x >= current_box.x2 or char.box.x2 <= current_box.x\n            ):\n                min_y = max(min_y, char.box.y2)\n        # 检查图形\n        for figure in page.pdf_figure:\n            if figure.box.y2 < current_box.y and not (\n                figure.box.x >= current_box.x2 or figure.box.x2 <= current_box.x\n            ):\n                min_y = max(min_y, figure.box.y2)\n\n        return min_y\n\n    def _update_paragraph_render_order(self, paragraph: il_version_1.PdfParagraph):\n        \"\"\"\n        重新设置段落各字符的 render order\n        主 render order 等于 paragraph 的 renderorder，sub render order 从 1 开始自增\n        \"\"\"\n        if not hasattr(paragraph, \"render_order\") or paragraph.render_order is None:\n            return\n\n        main_render_order = paragraph.render_order\n        sub_render_order = 1\n\n        # 遍历段落的所有组成部分\n        for composition in paragraph.pdf_paragraph_composition:\n            # 检查单个字符\n            if composition.pdf_character:\n                char = composition.pdf_character\n                char.render_order = main_render_order\n                char.sub_render_order = sub_render_order\n                sub_render_order += 1\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/extract_char.py",
    "content": "import logging\nimport shutil\nfrom collections import defaultdict\nfrom pathlib import Path\n\nimport cv2\nimport numpy as np\nimport pymupdf\nfrom rich.logging import RichHandler\nfrom sklearn.cluster import DBSCAN\n\nimport babeldoc.format.pdf.high_level\nimport babeldoc.format.pdf.translation_config\nfrom babeldoc.const import get_process_pool\nfrom babeldoc.format.pdf.document_il import il_version_1\n\nlogger = logging.getLogger(__name__)\n\n# --- Algorithm Tuning Parameters ---\n\n# --- Band Creation ---\n# Minimum vertical overlap ratio for a character to be added to an existing band.\nBAND_CREATION_OVERLAP_THRESHOLD = 0.5\n\n# --- Line Clustering (within a band) ---\n# Epsilon for DBSCAN, as a multiplier of the average character width/height.\nLINE_CLUSTERING_EPS_MULTIPLIER = 3.5\n\n# --- Line Splitting (for tall/wide lines) ---\n# A line is considered for splitting if its height/width is > X times the max char size.\nLINE_SPLIT_SIZE_RATIO_THRESHOLD = 1.5\n# Epsilon for DBSCAN when splitting lines, as a multiplier of the max char size.\nLINE_SPLIT_DBSCAN_EPS_MULTIPLIER = 0.5\n\n# --- Space Insertion (in a finalized line) ---\n# A space is inserted if the gap between chars is > X times the average char width.\nSPACE_INSERTION_GAP_MULTIPLIER = 0.45\n\n# --- Line Merging (across the page) ---\n# --- Optimization ---\n# Maximum vertical gap to search for potential merges, as a multiplier of avg char height.\nMERGE_VERTICAL_GAP_MULTIPLIER = 1.5\n# --- Containment Merge ---\n# Intersection-over-area threshold to consider one line as contained within another.\nMERGE_CONTAINMENT_IOU_THRESHOLD = 0.6\n# --- Adjacency Merge ---\n# Minimum vertical/horizontal overlap for adjacent lines to be considered for merging.\nMERGE_ADJACENCY_OVERLAP_THRESHOLD = 0.7\n# Maximum gap between adjacent lines to merge, as a multiplier of avg char size.\nMERGE_ADJACENCY_GAP_MULTIPLIER = 1.5\n\n\n# --- End of Parameters ---\n\n\ndef parse_pdf(pdf_path, page_ranges=None) -> il_version_1.Document:\n    translation_config = babeldoc.format.pdf.translation_config.TranslationConfig(\n        *[None for _ in range(4)], doc_layout_model=None\n    )\n    if page_ranges:\n        translation_config.page_ranges = [page_ranges]\n    translation_config.progress_monitor = (\n        babeldoc.format.pdf.high_level.ProgressMonitor(\n            babeldoc.format.pdf.high_level.TRANSLATE_STAGES\n        )\n    )\n    try:\n        shutil.copy(pdf_path, translation_config.get_working_file_path(\"input.pdf\"))\n        doc = pymupdf.open(pdf_path)\n        il_creater = babeldoc.format.pdf.high_level.ILCreater(translation_config)\n        il_creater.mupdf = doc\n        with Path(translation_config.get_working_file_path(\"input.pdf\")).open(\n            \"rb\"\n        ) as f:\n            babeldoc.format.pdf.high_level.start_parse_il(\n                f,\n                doc_zh=doc,\n                resfont=\"test_font\",\n                il_creater=il_creater,\n                translation_config=translation_config,\n            )\n        il = il_creater.create_il()\n        doc.close()\n        return il\n    finally:\n        translation_config.cleanup_temp_files()\n    return None\n\n\nclass Line:\n    def __init__(self, chars: list[tuple[il_version_1.Box, str, bool]]):\n        self.chars = chars\n        self.text = \"\".join([c[1] for c in chars])\n\n\ndef _recalculate_line_text_with_spacing(line, orientation):\n    if not line.chars:\n        line.text = \"\"\n        return\n\n    if orientation == \"horizontal\":\n\n        def get_main_start(c):\n            return c[0].x\n\n        def get_main_end(c):\n            return c[0].x2\n\n        def get_main_size(c):\n            return c[0].x2 - c[0].x\n\n    else:  # vertical\n\n        def get_main_start(c):\n            return c[0].y\n\n        def get_main_end(c):\n            return c[0].y2\n\n        def get_main_size(c):\n            return c[0].y2 - c[0].y\n\n    line_text = \"\"\n    avg_width = np.mean(\n        [get_main_size(c) for c in line.chars if get_main_size(c) > 0] or [0]\n    )\n\n    if len(line.chars) > 1 and avg_width > 0:\n        for i in range(len(line.chars) - 1):\n            c1, c2 = line.chars[i], line.chars[i + 1]\n            gap = get_main_start(c2) - get_main_end(c1)\n\n            if gap > avg_width * SPACE_INSERTION_GAP_MULTIPLIER:\n                line_text += c1[1] + \" \"\n            else:\n                line_text += c1[1]\n\n    if line.chars:\n        line_text += line.chars[-1][1]\n\n    line.text = line_text\n\n\n# [box, char_unicode, vertical]\n# vertical: True if the char is vertical, False if the char is horizontal\ndef extract_paragraph_line(\n    pdf_path,\n) -> dict[int, list[tuple[il_version_1.Box, str, bool]]]:\n    il = parse_pdf(pdf_path)\n    if il is None:\n        return None\n    line_boxes = {}\n    for page in il.page:\n        line_boxes[page.page_number] = convert_page_to_char_boxes(page)\n    return line_boxes\n\n\ndef convert_page_to_char_boxes(\n    page: il_version_1.Page,\n) -> list[tuple[il_version_1.Box, str, bool]]:\n    return [\n        (char.visual_bbox.box, char.char_unicode, char.vertical)\n        for char in page.pdf_character\n    ]\n\n\ndef _cluster_by_axis(chars: list[tuple[il_version_1.Box, str, bool]], orientation: str):\n    \"\"\"\n    A generalized function to cluster characters into lines based on main and secondary axes.\n    \"\"\"\n    if not chars:\n        return []\n\n    # Define main and secondary axes based on orientation\n    if orientation == \"horizontal\":\n\n        def get_secondary_start(c):\n            return c[0].y\n\n        def get_secondary_end(c):\n            return c[0].y2\n\n        def get_main_start(c):\n            return c[0].x\n\n        def get_main_end(c):\n            return c[0].x2\n\n        def get_main_size(c):\n            return c[0].x2 - c[0].x\n\n    else:  # vertical\n\n        def get_secondary_start(c):\n            return c[0].x\n\n        def get_secondary_end(c):\n            return c[0].x2\n\n        def get_main_start(c):\n            return c[0].y\n\n        def get_main_end(c):\n            return c[0].y2\n\n        def get_main_size(c):\n            return c[0].y2 - c[0].y\n\n    # Step 1: Group chars into bands along the secondary axis based on overlap.\n    # This is an optimized version of the band clustering algorithm.\n    # It avoids the O(N^2) complexity of the naive approach by making\n    # assumptions based on the sorted order of characters.\n    chars.sort(key=get_secondary_start)\n\n    # Each band is a tuple: (list_of_chars, min_secondary_coord, max_secondary_coord)\n    bands_data: list[tuple[list, float, float]] = []\n\n    for char in chars:\n        char_secondary_start = get_secondary_start(char)\n        char_secondary_end = get_secondary_end(char)\n        char_secondary_size = char_secondary_end - char_secondary_start\n\n        best_band_index = -1\n        max_overlap_ratio = (\n            BAND_CREATION_OVERLAP_THRESHOLD  # Minimum overlap ratio to be considered\n        )\n\n        # Iterate backwards over bands, as recent bands are more likely to overlap.\n        for i in range(len(bands_data) - 1, -1, -1):\n            band_chars, band_secondary_start, band_secondary_end = bands_data[i]\n\n            # Optimization: If the band is already far above the current char,\n            # and since chars are sorted by start, no further bands will match.\n            if band_secondary_end < char_secondary_start:\n                break\n\n            overlap = max(\n                0,\n                min(char_secondary_end, band_secondary_end)\n                - max(char_secondary_start, band_secondary_start),\n            )\n\n            if char_secondary_size > 0:\n                overlap_ratio = overlap / char_secondary_size\n                if overlap_ratio > max_overlap_ratio:\n                    max_overlap_ratio = overlap_ratio\n                    best_band_index = i\n\n        if best_band_index != -1:\n            # Add char to the best matching band and update its boundaries\n            band_chars, band_start, band_end = bands_data[best_band_index]\n            band_chars.append(char)\n            updated_band = (\n                band_chars,\n                min(band_start, char_secondary_start),\n                max(band_end, char_secondary_end),\n            )\n            bands_data[best_band_index] = updated_band\n            # Move the updated band to the end to maintain rough locality\n            bands_data.append(bands_data.pop(best_band_index))\n        else:\n            # No suitable band found, create a new one\n            bands_data.append(([char], char_secondary_start, char_secondary_end))\n\n    # Extract final bands from the data structure\n    bands = [b[0] for b in bands_data]\n\n    # Step 2: For each band, cluster along the main axis using DBSCAN\n    final_lines = []\n    for band in bands:\n        if len(band) < 1:\n            continue\n\n        main_axis_sizes = [get_main_size(c) for c in band if get_main_size(c) > 0]\n        avg_main_size = np.mean(main_axis_sizes) if main_axis_sizes else 10\n\n        # Epsilon for main-axis clustering is twice the average character size in that dimension\n        eps = avg_main_size * LINE_CLUSTERING_EPS_MULTIPLIER\n\n        centroids = np.array(\n            [((c[0].x + c[0].x2) / 2, (c[0].y + c[0].y2) / 2) for c in band]\n        )\n\n        if centroids.size > 0:\n            db = DBSCAN(eps=eps, min_samples=1, metric=\"manhattan\").fit(centroids)\n\n            line_groups = defaultdict(list)\n            for i, label in enumerate(db.labels_):\n                if label != -1:\n                    line_groups[label].append(band[i])\n\n            for _, line in line_groups.items():\n                line.sort(key=get_main_start)\n                final_lines.append(Line(line))\n\n    # Step 3: Split lines that are too tall/wide, which likely contain multiple distinct lines from different columns\n    processed_lines = []\n    for line in final_lines:\n        if not line.chars:\n            continue\n\n        line_secondary_start = min(get_secondary_start(c) for c in line.chars)\n        line_secondary_end = max(get_secondary_end(c) for c in line.chars)\n        line_secondary_size = line_secondary_end - line_secondary_start\n\n        char_secondary_sizes = [\n            get_secondary_end(c) - get_secondary_start(c)\n            for c in line.chars\n            if get_secondary_end(c) - get_secondary_start(c) > 0\n        ]\n        if not char_secondary_sizes:\n            processed_lines.append(line)\n            continue\n\n        max_char_secondary_size = np.max(char_secondary_sizes)\n\n        if (\n            line_secondary_size\n            > max_char_secondary_size * LINE_SPLIT_SIZE_RATIO_THRESHOLD\n            and len(line.chars) > 1\n        ):\n            # logger.debug(\n            #     f\"Splitting line '{line.text}' which seems to contain multiple lines.\"\n            # )\n\n            # Use DBSCAN on the secondary axis centers to split the line\n            centers = np.array(\n                [\n                    [(get_secondary_start(c) + get_secondary_end(c)) / 2]\n                    for c in line.chars\n                ]\n            )\n            db = DBSCAN(\n                eps=max_char_secondary_size * LINE_SPLIT_DBSCAN_EPS_MULTIPLIER,\n                min_samples=1,\n            ).fit(centers)\n\n            sub_lines = defaultdict(list)\n            for i, label in enumerate(db.labels_):\n                sub_lines[label].append(line.chars[i])\n\n            for _, sub_line_chars in sub_lines.items():\n                sub_line_chars.sort(key=get_main_start)\n                processed_lines.append(Line(sub_line_chars))\n        else:\n            processed_lines.append(line)\n    final_lines = processed_lines\n\n    for line in final_lines:\n        _recalculate_line_text_with_spacing(line, orientation)\n\n    return final_lines\n\n\ndef _merge_lines_on_page(page_lines: list[Line]) -> list[Line]:\n    \"\"\"\n    Merge lines on a page that are either contained within or adjacent to each other.\n    This function contains both containment and adjacency merge logic.\n    \"\"\"\n    if not page_lines:\n        return []\n\n    merged_lines = []\n    lines_to_skip = set()\n\n    for i in range(len(page_lines)):\n        if i in lines_to_skip:\n            continue\n\n        line1 = page_lines[i]\n        if not line1.chars:\n            merged_lines.append(line1)\n            continue\n\n        bbox1 = (\n            min(c[0].x for c in line1.chars),\n            min(c[0].y for c in line1.chars),\n            max(c[0].x2 for c in line1.chars),\n            max(c[0].y2 for c in line1.chars),\n        )\n\n        # Optimization: Calculate a vertical gap threshold to prune the search space.\n        # Based on the vertical adjacency merge condition.\n        line1_avg_char_height = np.mean(\n            [c[0].y2 - c[0].y for c in line1.chars if c[0].y2 > c[0].y] or [0]\n        )\n        max_v_gap = line1_avg_char_height * MERGE_VERTICAL_GAP_MULTIPLIER\n\n        merged = False\n        for j in range(i + 1, len(page_lines)):\n            if j in lines_to_skip:\n                continue\n\n            line2 = page_lines[j]\n            if not line2.chars:\n                continue\n\n            bbox2 = (\n                min(c[0].x for c in line2.chars),\n                min(c[0].y for c in line2.chars),\n                max(c[0].x2 for c in line2.chars),\n                max(c[0].y2 for c in line2.chars),\n            )\n\n            # Optimization: if line2 is too far below line1, no more merges with line1 are possible.\n            # The list is sorted top-to-bottom, so we can break early.\n            v_gap = bbox1[1] - bbox2[3]  # y_min_1 - y_max_2\n            if v_gap > max_v_gap:\n                break\n\n            # Check for \"mostly contained\" by checking intersection over area\n            inter_x0 = max(bbox1[0], bbox2[0])\n            inter_y0 = max(bbox1[1], bbox2[1])\n            inter_x1 = min(bbox1[2], bbox2[2])\n            inter_y1 = min(bbox1[3], bbox2[3])\n\n            inter_area = max(0, inter_x1 - inter_x0) * max(0, inter_y1 - inter_y0)\n\n            area1 = (\n                (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])\n                if (bbox1[2] > bbox1[0] and bbox1[3] > bbox1[1])\n                else 0\n            )\n            area2 = (\n                (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])\n                if (bbox2[2] > bbox2[0] and bbox2[3] > bbox2[1])\n                else 0\n            )\n\n            # Heuristic for merging:\n            # 1. By containment: if one line is mostly inside another.\n            # 2. By adjacency: if two lines are close and aligned.\n            if (\n                area2 > 0\n                and area1 >= area2\n                and (inter_area / area2) > MERGE_CONTAINMENT_IOU_THRESHOLD\n            ):\n                # Case 1: Merge line2 (smaller) into line1 (larger) by containment\n                # logger.debug(\n                #     f\"Merging line '{line2.text}' into '{line1.text}' (mostly contained)\"\n                # )\n                line1.chars.extend(line2.chars)\n                lines_to_skip.add(j)\n                merged = True\n                bbox1 = (\n                    min(bbox1[0], bbox2[0]),\n                    min(bbox1[1], bbox2[1]),\n                    max(bbox1[2], bbox2[2]),\n                    max(bbox1[3], bbox2[3]),\n                )\n\n            elif (\n                area1 > 0\n                and area2 > area1\n                and (inter_area / area1) > MERGE_CONTAINMENT_IOU_THRESHOLD\n            ):\n                # Case 2: Merge line1 (smaller) into line2 (larger) by containment\n                # logger.debug(\n                #     f\"Merging line '{line1.text}' into '{line2.text}' (mostly contained)\"\n                # )\n                line2.chars.extend(line1.chars)\n                page_lines[i], page_lines[j] = page_lines[j], page_lines[i]\n                line1 = page_lines[i]\n                lines_to_skip.add(j)\n                merged = True\n                bbox1 = (\n                    min(bbox1[0], bbox2[0]),\n                    min(bbox1[1], bbox2[1]),\n                    max(bbox1[2], bbox2[2]),\n                    max(bbox1[3], bbox2[3]),\n                )\n\n            else:\n                # Case 3: Merge by adjacency for lines that are close to each other\n                orientation = \"horizontal\" if not line1.chars[0][2] else \"vertical\"\n                if orientation == \"horizontal\":\n                    height1 = bbox1[3] - bbox1[1]\n                    height2 = bbox2[3] - bbox2[1]\n                    if height1 > 0 and height2 > 0:\n                        v_overlap = max(\n                            0,\n                            min(bbox1[3], bbox2[3]) - max(bbox1[1], bbox2[1]),\n                        )\n                        if (\n                            v_overlap / height1\n                        ) > MERGE_ADJACENCY_OVERLAP_THRESHOLD and (\n                            v_overlap / height2\n                        ) > MERGE_ADJACENCY_OVERLAP_THRESHOLD:\n                            h_gap = max(bbox1[0], bbox2[0]) - min(bbox1[2], bbox2[2])\n                            if h_gap >= 0:\n                                avg_char_width = np.mean(\n                                    [\n                                        c[0].x2 - c[0].x\n                                        for c in (line1.chars + line2.chars)\n                                        if c[0].x2 > c[0].x\n                                    ]\n                                    or [0]\n                                )\n                                if (\n                                    avg_char_width > 0\n                                    and h_gap\n                                    < avg_char_width * MERGE_ADJACENCY_GAP_MULTIPLIER\n                                ):\n                                    # logger.debug(\n                                    #     f\"Merging adjacent lines '{line1.text}' and '{line2.text}'\"\n                                    # )\n                                    line1.chars.extend(line2.chars)\n                                    lines_to_skip.add(j)\n                                    merged = True\n                                    bbox1 = (\n                                        min(bbox1[0], bbox2[0]),\n                                        min(bbox1[1], bbox2[1]),\n                                        max(bbox1[2], bbox2[2]),\n                                        max(bbox1[3], bbox2[3]),\n                                    )\n                else:  # Vertical\n                    width1 = bbox1[2] - bbox1[0]\n                    width2 = bbox2[2] - bbox2[0]\n                    if width1 > 0 and width2 > 0:\n                        h_overlap = max(\n                            0,\n                            min(bbox1[2], bbox2[2]) - max(bbox1[0], bbox2[0]),\n                        )\n                        if (\n                            h_overlap / width1\n                        ) > MERGE_ADJACENCY_OVERLAP_THRESHOLD and (\n                            h_overlap / width2\n                        ) > MERGE_ADJACENCY_OVERLAP_THRESHOLD:\n                            v_gap = max(bbox1[1], bbox2[1]) - min(bbox1[3], bbox2[3])\n                            if v_gap >= 0:\n                                avg_char_height = np.mean(\n                                    [\n                                        c[0].y2 - c[0].y\n                                        for c in (line1.chars + line2.chars)\n                                        if c[0].y2 > c[0].y\n                                    ]\n                                    or [0]\n                                )\n                                if (\n                                    avg_char_height > 0\n                                    and v_gap\n                                    < avg_char_height * MERGE_ADJACENCY_GAP_MULTIPLIER\n                                ):\n                                    # logger.debug(\n                                    #     f\"Merging adjacent vertical lines '{line1.text}' and '{line2.text}'\"\n                                    # )\n                                    line1.chars.extend(line2.chars)\n                                    lines_to_skip.add(j)\n                                    merged = True\n                                    bbox1 = (\n                                        min(bbox1[0], bbox2[0]),\n                                        min(bbox1[1], bbox2[1]),\n                                        max(bbox1[2], bbox2[2]),\n                                        max(bbox1[3], bbox2[3]),\n                                    )\n\n        if merged:\n            # Re-sort and recalculate text for the merged line\n            orientation = (\n                \"horizontal\" if not line1.chars[0][2] else \"vertical\"\n            )  # Guess orientation from first char\n            if orientation == \"horizontal\":\n                line1.chars.sort(key=lambda c: c[0].x)\n            else:  # vertical\n                line1.chars.sort(key=lambda c: c[0].y)\n            _recalculate_line_text_with_spacing(line1, orientation)\n\n        merged_lines.append(line1)\n\n    return merged_lines\n\n\ndef process_page_chars_to_lines(\n    chars: list[tuple[il_version_1.Box, str, bool]],\n) -> list[Line]:\n    pool = get_process_pool()\n    if pool is None:\n        return process_page_chars_to_lines_internal(chars)\n    return pool.apply(process_page_chars_to_lines_internal, (chars,))\n\n\ndef process_page_chars_to_lines_internal(\n    chars: list[tuple[il_version_1.Box, str, bool]],\n) -> list[Line]:\n    \"\"\"\n    Process characters on a single page to cluster them into lines.\n\n    Args:\n        chars: List of character tuples (box, char_unicode, is_vertical)\n\n    Returns:\n        List of Line objects representing clustered and merged lines\n    \"\"\"\n    if not chars:\n        return []\n\n    horizontal_chars = [c for c in chars if not c[2]]\n    vertical_chars = [c for c in chars if c[2]]\n\n    horizontal_lines = _cluster_by_axis(horizontal_chars, \"horizontal\")\n    vertical_lines = _cluster_by_axis(vertical_chars, \"vertical\")\n\n    page_lines = horizontal_lines + vertical_lines\n\n    # Sort all found lines by their position on the page (top-to-bottom, left-to-right)\n    def get_line_position(line):\n        if not line:\n            return (0, 0)\n        # PDF coordinate system: Y increases upwards. We negate it for top-to-bottom sort.\n        avg_y = np.mean([(c[0].y + c[0].y2) / 2 for c in line])\n        avg_x = np.mean([(c[0].x + c[0].x2) / 2 for c in line])\n        return (-avg_y, avg_x)\n\n    page_lines.sort(key=lambda line: get_line_position(line.chars))\n\n    # Merge lines on the page\n    merged_page_lines = _merge_lines_on_page(page_lines)\n    return merged_page_lines\n\n\ndef cluster_chars_to_lines(\n    char_boxes: dict[int, list[tuple[il_version_1.Box, str, bool]]],\n) -> dict[int, list[Line]]:\n    clustered_lines = {}\n    if not char_boxes:\n        return clustered_lines\n\n    for page_num, chars in char_boxes.items():\n        merged_page_lines = process_page_chars_to_lines(chars)\n        clustered_lines[page_num] = merged_page_lines\n\n    return clustered_lines\n\n\ndef draw_clustered_lines_to_image(pdf_path, clustered_lines: dict[int, list[Line]]):\n    doc = pymupdf.open(pdf_path)\n    debug_dir = Path(\"ocr-box-image-clustered\") / Path(pdf_path).stem\n    debug_dir.mkdir(parents=True, exist_ok=True)\n\n    for page_number, lines in clustered_lines.items():\n        if not lines:\n            continue\n\n        page = doc[page_number]\n        pixmap = page.get_pixmap(dpi=300)\n        image_height = pixmap.height\n        image_width = pixmap.width\n\n        samples = bytearray(pixmap.samples)\n        image_array = np.frombuffer(samples, dtype=np.uint8).reshape(\n            image_height, image_width, pixmap.n\n        )\n\n        if pixmap.n in [3, 4]:\n            image_array = cv2.cvtColor(image_array, cv2.COLOR_RGB2BGR)\n\n        # cv2.imwrite(str(debug_dir / f\"{page_number}.png\"), image_array)\n\n        annotated_image = image_array.copy()\n\n        page_rect = page.rect\n        x_scale = image_width / page_rect.width\n        y_scale = image_height / page_rect.height\n\n        for i, line in enumerate(lines):\n            if not line:\n                continue\n\n            # Draw the encompassing line box first (red)\n            char_boxes_in_line = [item[0] for item in line.chars]\n            min_x = min(b.x for b in char_boxes_in_line)\n            min_y = min(b.y for b in char_boxes_in_line)\n            max_x2 = max(b.x2 for b in char_boxes_in_line)\n            max_y2 = max(b.y2 for b in char_boxes_in_line)\n\n            img_x0_line = int(min_x * x_scale)\n            img_y1_line = int(image_height - (max_y2 * y_scale))\n            img_x1_line = int(max_x2 * x_scale)\n            img_y0_line = int(image_height - (min_y * y_scale))\n\n            cv2.rectangle(\n                annotated_image,\n                (img_x0_line, img_y1_line),\n                (img_x1_line, img_y0_line),\n                (0, 0, 255),  # Red for lines\n                2,\n            )\n\n            cv2.putText(\n                annotated_image,\n                f\"line {i}: {line.text}\",\n                (img_x0_line, img_y1_line - 10),\n                cv2.FONT_HERSHEY_SIMPLEX,\n                0.7,\n                (0, 0, 255),\n                2,\n            )\n\n            # Then, draw the individual character boxes on top (green)\n            for char_box, _, _ in line.chars:\n                pdf_x0, pdf_y0, pdf_x1, pdf_y1 = (\n                    char_box.x,\n                    char_box.y,\n                    char_box.x2,\n                    char_box.y2,\n                )\n\n                img_x0_char = int(pdf_x0 * x_scale)\n                img_y0_char_pdf = int(pdf_y0 * y_scale)\n                img_x1_char = int(pdf_x1 * x_scale)\n                img_y1_char_pdf = int(pdf_y1 * y_scale)\n\n                img_y0_char = image_height - img_y0_char_pdf\n                img_y1_char = image_height - img_y1_char_pdf\n\n                cv2.rectangle(\n                    annotated_image,\n                    (img_x0_char, img_y1_char),\n                    (img_x1_char, img_y0_char),\n                    (0, 255, 0),  # Green for characters\n                    1,  # Thinner line\n                )\n\n        cv2.imwrite(str(debug_dir / f\"{page_number}_annotated.png\"), annotated_image)\n\n    doc.close()\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO, handlers=[RichHandler()])\n    for pdf_path in (\n        \"2404.16109v1.pdf\",\n        \"2022 - Bortoli_Valentin De, Mathieu_Emile - Riemannian Score-Based Generative Modelling.pdf\",\n        \"2024 - Regev_Oded - On Lattices, Learning with Errors, Random Linear Codes, and Cryptography.pdf\",\n        \"2024 - Yang_Tian-Le, Lee_Kuang-Yao - Functional Linear Non-Gaussian Acyclic Model for Causal Discovery.pdf\",\n    ):\n        logger.info(f\"Processing {pdf_path}\")\n        char_boxes = extract_paragraph_line(pdf_path)\n        if not char_boxes:\n            logger.warning(f\"No character boxes extracted from {pdf_path}\")\n            continue\n\n        logger.info(\n            f\"Extracted {sum(len(c) for c in char_boxes.values())} characters. Clustering them into lines...\"\n        )\n        lines = cluster_chars_to_lines(char_boxes)\n\n        total_lines = sum(len(l) for l in lines.values())\n        logger.info(f\"Clustered into {total_lines} lines. Drawing boxes...\")\n\n        # logger.info(\"--- Clustered Lines Text ---\")\n        # for page_num, page_lines in lines.items():\n        #     logger.info(f\"Page {page_num}:\")\n        #     for i, line in enumerate(page_lines):\n        #         logger.info(f\"  Line {i}: {line.text}\")\n        # logger.info(\"----------------------------\")\n\n        draw_clustered_lines_to_image(pdf_path, lines)\n        logger.info(\"Annotated images saved in 'ocr-box-image-clustered' directory.\")\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/fontmap.py",
    "content": "import enum\nimport functools\nimport logging\nimport re\nfrom pathlib import Path\n\nimport pymupdf\n\nfrom babeldoc.assets import assets\nfrom babeldoc.format.pdf.document_il import PdfFont\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass PrimaryFontFamily(enum.IntEnum):\n    SERIF = 1\n    SANS_SERIF = 2\n    SCRIPT = 3\n    NONE = 4\n\n    @classmethod\n    def from_str(cls, value: str):\n        if value == \"serif\":\n            return cls.SERIF\n        elif value == \"sans-serif\":\n            return cls.SANS_SERIF\n        elif value == \"script\":\n            return cls.SCRIPT\n        else:\n            return cls.NONE\n\n\nclass FontMapper:\n    stage_name = \"Add Fonts\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.translation_config = translation_config\n        assert translation_config.primary_font_family in [\n            None,\n            \"serif\",\n            \"sans-serif\",\n            \"script\",\n        ]\n        self.primary_font_family = PrimaryFontFamily.from_str(\n            translation_config.primary_font_family,\n        )\n\n        font_family = assets.get_font_family(translation_config.lang_out)\n        self.font_file_names = []\n        for k in (\n            \"normal\",\n            \"script\",\n            \"fallback\",\n            \"base\",\n        ):\n            self.font_file_names.extend(font_family[k])\n\n        self.fonts: dict[str, pymupdf.Font] = {}\n        self.fontid2fontpath: dict[str, Path] = {}\n        for font_file_name in self.font_file_names:\n            if font_file_name in self.fontid2fontpath:\n                continue\n            font_path, font_metadata = assets.get_font_and_metadata(font_file_name)\n            pymupdf_font = pymupdf.Font(fontfile=str(font_path))\n            pymupdf_font.has_glyph = functools.lru_cache(maxsize=10240, typed=True)(\n                pymupdf_font.has_glyph,\n            )\n            pymupdf_font.char_lengths = functools.lru_cache(maxsize=10240, typed=True)(\n                pymupdf_font.char_lengths,\n            )\n            self.fonts[font_file_name] = pymupdf_font\n            self.fontid2fontpath[font_file_name] = font_path\n            self.fonts[font_file_name].font_id = font_file_name\n            self.fonts[font_file_name].font_path = font_path\n            self.fonts[font_file_name].ascent_fontmap = font_metadata[\"ascent\"]\n            self.fonts[font_file_name].descent_fontmap = font_metadata[\"descent\"]\n            self.fonts[font_file_name].encoding_length = font_metadata[\n                \"encoding_length\"\n            ]\n\n        self.normal_font_ids: list[str] = font_family[\"normal\"]\n        self.script_font_ids: list[str] = font_family[\"script\"]\n        self.fallback_font_ids: list[str] = font_family[\"fallback\"]\n        self.base_font_ids: list[str] = font_family[\"base\"]\n        self.fontid2fontpath[\"base\"] = self.fontid2fontpath[font_family[\"base\"][0]]\n\n        self.fontid2font: dict[str, pymupdf.Font] = {\n            f.font_id: f for f in self.fonts.values()\n        }\n\n        self.fontid2font[\"base\"] = self.fontid2font[self.base_font_ids[0]]\n\n        self.normal_fonts: list[pymupdf.Font] = [\n            self.fontid2font[font_id] for font_id in self.normal_font_ids\n        ]\n        self.script_fonts: list[pymupdf.Font] = [\n            self.fontid2font[font_id] for font_id in self.script_font_ids\n        ]\n        self.fallback_fonts: list[pymupdf.Font] = [\n            self.fontid2font[font_id] for font_id in self.fallback_font_ids\n        ]\n\n        self.base_font = self.fontid2font[\"base\"]\n\n        self.type2font: dict[str, list[pymupdf.Font]] = {\n            \"normal\": self.normal_fonts,\n            \"script\": self.script_fonts,\n            \"fallback\": self.fallback_fonts,\n            \"base\": [self.base_font],\n        }\n\n        self.has_char = functools.lru_cache(maxsize=10240, typed=True)(self.has_char)\n        self.map_in_type = functools.lru_cache(maxsize=10240, typed=True)(\n            self.map_in_type\n        )\n\n    def has_char(self, char_unicode: str):\n        if len(char_unicode) != 1:\n            return False\n        current_char = ord(char_unicode)\n        for font in self.fonts.values():\n            if font.has_glyph(current_char):\n                return True\n        return False\n\n    def map_in_type(\n        self,\n        bold: bool,\n        italic: bool,\n        monospaced: bool,\n        serif: bool,\n        char_unicode: str,\n        font_type: str,\n    ):\n        if font_type == \"script\" and not italic:\n            return None\n        current_char = ord(char_unicode)\n        for font in self.type2font[font_type]:\n            if not font.has_glyph(current_char):\n                continue\n            if bool(bold) != bool(font.is_bold):\n                continue\n            # 不知道什么原因，思源黑体的 serif 属性为 1，先 workaround\n            if bool(serif) and \"serif\" not in font.font_id.lower():\n                continue\n            if not bool(serif) and \"serif\" in font.font_id.lower():\n                continue\n            return font\n\n        return None\n\n    def map(self, original_font: PdfFont, char_unicode: str):\n        current_char = ord(char_unicode)\n        if isinstance(original_font, pymupdf.Font):\n            bold = original_font.is_bold\n            italic = original_font.is_italic\n            monospaced = original_font.is_monospaced\n            serif = original_font.is_serif\n        elif isinstance(original_font, PdfFont):\n            bold = original_font.bold\n            italic = original_font.italic\n            monospaced = original_font.monospace\n            serif = original_font.serif\n        else:\n            logger.error(\n                f\"Unknown font type: {type(original_font)}. \"\n                f\"Original font: {original_font}. \"\n                f\"Char unicode: {char_unicode}. \",\n            )\n            return None\n\n        if self.primary_font_family == PrimaryFontFamily.SERIF:\n            serif = True\n        elif self.primary_font_family == PrimaryFontFamily.SANS_SERIF:\n            serif = False\n        elif self.primary_font_family == PrimaryFontFamily.SCRIPT:\n            serif = False\n            italic = True\n\n        script_font_map_result = self.map_in_type(\n            bold, italic, monospaced, serif, char_unicode, \"script\"\n        )\n        if script_font_map_result:\n            return script_font_map_result\n\n        for script_font in self.script_fonts:\n            if italic and script_font.has_glyph(current_char):\n                return script_font\n\n        normal_font_map_result = self.map_in_type(\n            bold, italic, monospaced, serif, char_unicode, \"normal\"\n        )\n        if normal_font_map_result is not None:\n            return normal_font_map_result\n\n        fallback_font_map_result = self.map_in_type(\n            bold, italic, monospaced, serif, char_unicode, \"fallback\"\n        )\n        if fallback_font_map_result is not None:\n            return fallback_font_map_result\n\n        for font in self.fallback_fonts:\n            if font.has_glyph(current_char):\n                return font\n\n        logger.warning(\n            f\"Can't find font for {char_unicode}({current_char}). \"\n            f\"Original font: {original_font.name}[{original_font.font_id}]. \"\n            f\"Char unicode: {char_unicode}. \",\n        )\n        return None\n\n    def get_used_font_ids(self, il: il_version_1.Document) -> set[str]:\n        result = set()\n        for page in il.page:\n            for char in page.pdf_character:\n                if char.pdf_style and char.pdf_style.font_id:\n                    result.add(char.pdf_style.font_id)\n            for para in page.pdf_paragraph:\n                for comp in para.pdf_paragraph_composition:\n                    if char := comp.pdf_character:\n                        if char.pdf_style and char.pdf_style.font_id:\n                            result.add(char.pdf_style.font_id)\n        return result\n\n    def add_font(self, doc_zh: pymupdf.Document, il: il_version_1.Document):\n        used_font_ids = self.get_used_font_ids(il)\n        font_list = [\n            (k, v) for k, v in self.fontid2fontpath.items() if k in used_font_ids\n        ]\n\n        font_id = {}\n        xreflen = doc_zh.xref_length()\n        total = xreflen - 1 + len(font_list) + len(il.page) + len(font_list)\n        with self.translation_config.progress_monitor.stage_start(\n            self.stage_name,\n            total,\n        ) as pbar:\n            if not il.page:\n                pbar.advance(total)\n                return\n            for font in font_list:\n                if font[0] in font_id:\n                    continue\n                font_id[font[0]] = doc_zh[0].insert_font(font[0], font[1])\n                pbar.advance(1)\n            for xref in range(1, xreflen):\n                pbar.advance(1)\n                # xref_type = doc_zh.xref_get_key(xref, \"Type\")\n                # if xref_type[1] == \"/Page\":\n                #     resources_xref = doc_zh.xref_get_key(xref, \"Resources\")\n                #     if resources_xref[0] == 'null':\n                #         doc_zh.xref_set_key(xref, \"Resources\", f\"<</Font<<>>>>\")\n                for label in [\"Resources/\", \"\"]:  # 可能是基于 xobj 的 res\n                    try:  # xref 读写可能出错\n                        font_res = doc_zh.xref_get_key(xref, f\"{label}Font\")\n                        if font_res is None:\n                            continue\n                        target_key_prefix = f\"{label}Font/\"\n                        if font_res[0] == \"xref\":\n                            resource_xref_id = re.search(\n                                \"(\\\\d+) 0 R\",\n                                font_res[1],\n                            ).group(1)\n                            xref = int(resource_xref_id)\n                            font_res = (\"dict\", doc_zh.xref_object(xref))\n                            target_key_prefix = \"\"\n                        if font_res[0] == \"dict\":\n                            for font in font_list:\n                                target_key = f\"{target_key_prefix}{font[0]}\"\n                                font_exist = doc_zh.xref_get_key(xref, target_key)\n                                if font_exist[0] == \"null\":\n                                    doc_zh.xref_set_key(\n                                        xref,\n                                        target_key,\n                                        f\"{font_id[font[0]]} 0 R\",\n                                    )\n                    except Exception:\n                        pass\n\n            # Create PdfFont for each font\n            # 预先创建所有字体对象\n            pdf_fonts = []\n            for font_name, _ in font_list:\n                # Get descent_fontmap from fontid2font\n                assert font_name in self.fontid2font, f\"Font {font_name} not found\"\n                mupdf_font = self.fontid2font[font_name]\n                descent_fontmap = mupdf_font.descent_fontmap\n                ascent_fontmap = mupdf_font.ascent_fontmap\n                encoding_length = mupdf_font.encoding_length\n\n                pdf_fonts.append(\n                    il_version_1.PdfFont(\n                        name=font_name,\n                        xref_id=font_id[font_name],\n                        font_id=font_name,\n                        encoding_length=encoding_length,\n                        bold=mupdf_font.is_bold,\n                        italic=mupdf_font.is_italic,\n                        monospace=mupdf_font.is_monospaced,\n                        serif=mupdf_font.is_serif,\n                        descent=descent_fontmap,\n                        ascent=ascent_fontmap,\n                    ),\n                )\n                pbar.advance(1)\n\n            # 批量添加字体到页面和 XObject\n            for page in il.page:\n                page.pdf_font.extend(pdf_fonts)\n                for xobj in page.pdf_xobject:\n                    xobj.pdf_font.extend(pdf_fonts)\n                pbar.advance(1)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/formular_helper.py",
    "content": "import base64\nimport functools\nimport re\nimport unicodedata\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il.il_version_1 import Page\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFormula\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import (\n    formular_height_ignore_char,\n)\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\n\ndef is_formulas_start_char(\n    char: str,\n    font_mapper: FontMapper,\n    translation_config: TranslationConfig,\n) -> bool:\n    if not char:\n        return False\n    if \"(cid:\" in char:\n        return True\n    if not font_mapper.has_char(char):\n        if len(char) > 1 and all(font_mapper.has_char(x) for x in char):\n            return False\n        return True\n    if translation_config.formular_char_pattern:\n        pattern = translation_config.formular_char_pattern\n        if re.match(pattern, char):\n            return True\n    if char != \" \" and (\n        unicodedata.category(char[0])\n        in [\n            # \"Lm\",\n            \"Mn\",\n            \"Sk\",\n            \"Sm\",\n            \"Zl\",\n            \"Zp\",\n            \"Zs\",\n            \"Co\",  # private use character\n            # \"So\",  # symbol\n        ]  # 文字修饰符、数学符号、分隔符号\n        or ord(char[0]) in range(0x370, 0x400)  # 希腊字母\n    ):\n        return True\n    if re.match(\"[0-9\\\\[\\\\]•]\", char):\n        return True\n    return False\n\n\ndef is_formulas_middle_char(\n    char: str,\n    font_mapper: FontMapper,\n    translation_config: TranslationConfig,\n) -> bool:\n    if is_formulas_start_char(char, font_mapper, translation_config):\n        return True\n\n    if re.match(\",\", char):\n        return True\n\n    return False\n\n\ndef collect_page_formula_font_ids(\n    page: Page, formular_font_pattern: str | None\n) -> tuple[set[int], dict[str, set[int]]]:\n    \"\"\"\n    Collects formula font IDs from page fonts and XObject fonts.\n\n    Args:\n        page: The Page object to process.\n        formular_font_pattern: The regex pattern to identify formula fonts by name.\n\n    Returns:\n        A tuple containing:\n            - A set of font_ids considered formula fonts at the page level.\n            - A dictionary mapping xobj_id to a set of font_ids considered\n              formula fonts for that specific XObject.\n    \"\"\"\n    # Page-level formula font IDs\n    page_formula_font_ids = set()\n    if page.pdf_font:\n        for font in page.pdf_font:\n            if is_formulas_font(font.name, formular_font_pattern):\n                page_formula_font_ids.add(font.font_id)\n\n    # XObject-level formula font IDs\n    xobj_formula_font_ids_map = {}\n    if page.pdf_xobject:\n        for xobj in page.pdf_xobject:\n            # Start with a copy of page-level formula fonts for this XObject\n            current_xobj_fonts = page_formula_font_ids.copy()\n            if xobj.pdf_font:\n                for font in xobj.pdf_font:\n                    if is_formulas_font(font.name, formular_font_pattern):\n                        current_xobj_fonts.add(font.font_id)\n                    else:\n                        # If a font within an XObject is explicitly not a formula font,\n                        # remove it from this XObject's set.\n                        current_xobj_fonts.discard(font.font_id)\n            xobj_formula_font_ids_map[xobj.xobj_id] = current_xobj_fonts\n\n    return page_formula_font_ids, xobj_formula_font_ids_map\n\n\n@functools.cache\ndef is_formulas_font(font_name: str, formular_font_pattern: str | None) -> bool:\n    pattern_text = (\n        r\"^(\"\n        r\"|BLKFort.*\"\n        r\"|Cambria.*\"\n        r\"|EUAlbertina.*\"\n        r\"|NimbusRomNo9L.*\"\n        r\"|GlosaMath.*\"\n        r\"|URWPalladioL.*\"\n        r\"|CMSS.+\"\n        r\"|Arial.*\"\n        r\"|TimesNewRoman.*\"\n        r\"|SegoeUI.*\"\n        r\"|CMTT9.*\"\n        r\"|CMSL10.*\"\n        r\"|CMTI10.*\"\n        r\"|CMTT10.*\"\n        r\"|CMTI12.*\"\n        r\"|CMR12.*\"\n        r\"|MeridienLTStd.*\"\n        r\"|Calibri.*\"\n        r\"|STIXMathJax_Main.*\"\n        r\"|.*NewBaskerville.*\"\n        r\"|.*FranklinGothic.*\"\n        r\"|.*AGaramondPro.*\"\n        r\"|.*PalatinoItalCOR.*\"\n        r\"|.*ITCSymbolStd.*\"\n        r\"|.*PlantinStd.*\"\n        r\"|.*DJ5EscrowCond.*\"\n        r\"|.*ExchangeBook.*\"\n        r\"|.*DJ5Exchange.*\"\n        r\"|.*Times.*\"\n        r\"|.*PalatinoLTStd.*\"\n        r\"|.*Times New Roman,Italic.*\"\n        r\"|.*EhrhardtMT.*\"\n        r\"|.*GillSansMTStd.*\"\n        r\"|.*MedicineSymbols3.*\"\n        r\"|.*HardingText.*\"\n        r\"|.*GraphikNaturel.*\"\n        r\"|.*HelveticaNeue.*\"\n        r\"|.*GoudyOldStyleT.*\"\n        r\"|.*Symbol.*\"\n        r\"|.*ScalaSansLF.*\"\n        r\"|.*ScalaLF.*\"\n        r\"|.*ScalaSansPro.*\"\n        r\"|.*PetersburgC.*\"\n        r\"|.*ColiseumC.*\"\n        r\"|.*Gantari.*\"\n        r\"|.*OptimaLTStd.*\"\n        r\"|.*CronosPro.*\"\n        r\"|.*ACaslon.*\"\n        r\"|.*Frutiger.*\"\n        r\"|.*BrandonGrotesque.*\"\n        r\"|.*FairfieldLH.*\"\n        r\"|.*CaeciliaLTStd.*\"\n        r\"|.*Whitney.*\"\n        r\"|.*Mercury.*\"\n        r\"|.*SabonLTStd.*\"\n        r\"|.*AnonymousPro.*\"\n        r\"|.*SabonLTPro.*\"\n        r\"|.*ArnoPro.*\"\n        r\"|.*CharisSIL.*\"\n        r\"|.*MSReference.*\"\n        r\"|.*CMUSerif-Roman.*\"\n        r\"|.*CourierNewPS.*\"\n        r\"|.*XCharter.*\"\n        r\"|.*GillSans.*\"\n        r\"|.*Perpetua.*\"\n        r\"|.*GEInspira.*\"\n        r\"|.*AGaramond.*\"\n        r\"|.*BMath.*\"\n        r\"|.*MSTT.*\"\n        r\"|.*Bookinsanity.*\"\n        r\"|.*ScalySans.*\"\n        r\"|.*Code2000.*\"\n        r\"|.*Minion.*\"\n        r\"|.*JansonTextLT.*\"\n        r\"|.*MathPack.*\"\n        r\"|.*Macmillan.*\"\n        r\"|.*NimbusSan.*\"\n        r\"|.*Mincho.*\"\n        r\"|.*Amerigo.*\"\n        r\"|.*MSGloriolaIIStd.*\"\n        r\"|.*CMU.+\"\n        r\"|.*LinLibertine.*\"\n        r\"|.*txsys.*\"\n        r\")$\"\n    )\n    precise_formula_font_pattern = (\n        r\"^(\"\n        # r\"|.*CambriaMath.*\"\n        # r\"|.*Cambria Math.*\"\n        r\"|.*Asana.*\"\n        r\"|.*MiriamMonoCLM-BookOblique.*\"\n        r\"|.*Miriam Mono CLM.*\"\n        r\"|.*Logix.*\"\n        r\"|.*AeBonum.*\"\n        r\"|.*AeMRoman.*\"\n        r\"|.*AePagella.*\"\n        r\"|.*AeSchola.*\"\n        r\"|.*Concrete.*\"\n        r\"|.*LatinModernMathCompanion.*\"\n        r\"|.*Latin Modern Math Companion.*\"\n        r\"|.*RalphSmithsFormalScriptCompanion.*\"\n        r\"|.*Ralph Smiths Formal Script Companion.*\"\n        r\"|.*TeXGyreBonumMathCompanion.*\"\n        r\"|.*TeX Gyre Bonum Companion.*\"\n        r\"|.*TeXGyrePagellaMathCompanion.*\"\n        r\"|.*TeX Gyre Pagella Math Companion.*\"\n        r\"|.*TeXGyreTermesMathCompanion.*\"\n        r\"|.*TeX Gyre Termes Math Companion.*\"\n        r\"|.*XITSMathCompanion.*\"\n        r\"|.*XITS Math Companion.*\"\n        r\"|.*Erewhon.*\"\n        r\"|.*Euler-Math.*\"\n        r\"|.*Euler Math.*\"\n        r\"|.*FiraMath-Regular.*\"\n        r\"|.*Fira Math.*\"\n        r\"|.*Garamond-Math.*\"\n        r\"|.*GFSNeohellenicMath.*\"\n        r\"|.*KpMath.*\"\n        r\"|.*Lete Sans Math.*\"\n        r\"|.*LeteSansMath.*\"\n        # r\"|.*LinLibertineO.*\"\n        r\"|.*Linux Libertine O.*\"\n        r\"|.*LibertinusMath-Regular.*\"\n        r\"|.*Libertinus Math.*\"\n        r\"|.*LatinModernMath-Regular.*\"\n        r\"|.*Latin Modern Math.*\"\n        r\"|.*Luciole.*\"\n        r\"|.*NewCM.*\"\n        r\"|.*NewComputerModern.*\"\n        r\"|.*OldStandard-Math.*\"\n        r\"|.*STIXMath-Regular.*\"\n        r\"|.*STIX Math.*\"\n        r\"|.*STIXTwoMath-Regular.*\"\n        r\"|.*STIX Two Math.*\"\n        r\"|.*TeXGyreBonumMath.*\"\n        r\"|.*TeX Gyre Bonum Math.*\"\n        r\"|.*TeXGyreDejaVuMath.*\"\n        r\"|.*TeX Gyre DejaVu Math.*\"\n        r\"|.*TeXGyrePagellaMath.*\"\n        r\"|.*TeX Gyre Pagella Math.*\"\n        r\"|.*TeXGyreScholaMath.*\"\n        r\"|.*TeX Gyre Schola Math.*\"\n        r\"|.*TeXGyreTermesMath.*\"\n        r\"|.*TeX Gyre Termes Math.*\"\n        r\"|.*XCharter-Math.*\"\n        r\"|.*XCharter Math.*\"\n        r\"|.*XITSMath-Bold.*\"\n        r\"|.*XITS Math.*\"\n        r\"|.*XITSMath.*\"\n        r\"|.*IBMPlexMath.*\"\n        r\"|.*IBM Plex Math.*\"\n        r\")$\"\n    )\n    if formular_font_pattern:\n        broad_formula_font_pattern = formular_font_pattern\n    else:\n        broad_formula_font_pattern = (\n            r\"(CM[^RB]\"\n            r\"|(MS|XY|MT|BL|RM|EU|LA|RS)[A-Z]\"\n            r\"|LINE\"\n            r\"|LCIRCLE\"\n            r\"|TeX-\"\n            r\"|rsfs\"\n            r\"|txsy\"\n            r\"|wasy\"\n            r\"|stmary\"\n            r\"|.*Mono\"\n            r\"|.*Code\"\n            # r\"|.*Ital\"\n            r\"|.*Sym\"\n            r\"|.*Math\"\n            r\"|AdvP4C4E74\"\n            r\"|AdvPSSym\"\n            r\"|AdvP4C4E59\"\n            r\")\"\n        )\n\n    if font_name.startswith(\"BASE64:\"):\n        font_name_bytes = base64.b64decode(font_name[7:])\n        font = font_name_bytes.split(b\"+\")[-1]\n        pattern_text = pattern_text.encode()\n        broad_formula_font_pattern = broad_formula_font_pattern.encode()\n    else:\n        font = font_name.split(\"+\")[-1]\n\n    if not font:\n        return False\n\n    if re.match(precise_formula_font_pattern, font):\n        return True\n    elif re.match(pattern_text, font):\n        return False\n    elif re.match(broad_formula_font_pattern, font):\n        return True\n\n    return False\n\n\ndef update_formula_data(formula: PdfFormula):\n    min_x = min(char.visual_bbox.box.x for char in formula.pdf_character)\n    max_x = max(char.visual_bbox.box.x2 for char in formula.pdf_character)\n    if not all(map(formular_height_ignore_char, formula.pdf_character)):\n        min_y = min(\n            char.visual_bbox.box.y\n            for char in formula.pdf_character\n            if not formular_height_ignore_char(char)\n        )\n        max_y = max(\n            char.visual_bbox.box.y2\n            for char in formula.pdf_character\n            if not formular_height_ignore_char(char)\n        )\n    else:\n        min_y = min(char.visual_bbox.box.y for char in formula.pdf_character)\n        max_y = max(char.visual_bbox.box.y2 for char in formula.pdf_character)\n    formula.box = Box(min_x, min_y, max_x, max_y)\n    if not formula.y_offset:\n        formula.y_offset = 0\n    if not formula.x_offset:\n        formula.x_offset = 0\n    if not formula.x_advance:\n        formula.x_advance = 0\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/layout_helper.py",
    "content": "import logging\nimport math\nimport re\nimport unicodedata\nfrom typing import Literal\n\nimport regex\nfrom pymupdf import Font\n\nfrom babeldoc.format.pdf.document_il import GraphicState\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfCharacter\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfParagraph\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfParagraphComposition\n\nlogger = logging.getLogger(__name__)\n# HEIGHT_NOT_USFUL_CHAR_IN_CHAR = (\n#     \"∑︁\",\n#     # 暂时假设 cid:17 和 cid 16 是特殊情况\n#     # 来源于 arXiv:2310.18608v2 第九页公式大括号\n#     \"(cid:17)\",\n#     \"(cid:16)\",\n#     # arXiv:2411.19509v2 第四页 []\n#     \"(cid:104)\",\n#     \"(cid:105)\",\n#     # arXiv:2411.19509v2 第四页 公式的 | 竖线\n#     \"(cid:13)\",\n#     \"∑︁\",\n#     # arXiv:2412.05265 27 页 累加号\n#     \"(cid:88)\",\n#     # arXiv:2412.05265 16 页 累乘号\n#     \"(cid:89)\",\n#     # arXiv:2412.05265 27 页 积分\n#     \"(cid:90)\",\n#     # arXiv:2412.05265 32 页 公式左右的中括号\n#     \"(cid:2)\",\n#     \"(cid:3)\",\n#     \"·\",\n#     \"√\",\n# )\n\n# 由于我们有一套 bbox 解析机制了，所以现在不需要这个东西了。\nHEIGHT_NOT_USFUL_CHAR_IN_CHAR = (None,)\n\n\nLEFT_BRACKET = (\"(cid:8)\", \"(\", \"(cid:16)\", \"{\", \"[\", \"(cid:104)\", \"(cid:2)\")\nRIGHT_BRACKET = (\"(cid:9)\", \")\", \"(cid:17)\", \"}\", \"]\", \"(cid:105)\", \"(cid:3)\")\n\nBULLET_POINT_PATTERN = re.compile(\n    r\"[■•⚫⬤◆◇○●◦‣⁃▪▫∗†‡¹²³⁴⁵⁶⁷⁸⁹⁰₁₂₃₄₅₆₇₈₉₀ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ¶※⁑⁂⁕⁎⁜❧☙⁋‖‽·]\"\n)\n\n\ndef is_bullet_point(char: PdfCharacter) -> bool:\n    \"\"\"Check if the character is a bullet point.\n\n    Args:\n        char: The character to check\n\n    Returns:\n        bool: True if the character is a bullet point\n    \"\"\"\n    is_bullet = bool(BULLET_POINT_PATTERN.match(char.char_unicode))\n    return is_bullet\n\n\ndef calculate_box_iou(box1: Box, box2: Box) -> float:\n    \"\"\"Calculate the Intersection over Union (IOU) between two boxes.\n\n    Args:\n        box1: First box\n        box2: Second box\n\n    Returns:\n        float: IOU value between 0 and 1\n    \"\"\"\n    if box1 is None or box2 is None:\n        return 0.0\n\n    # Calculate intersection\n    x_left = max(box1.x, box2.x)\n    y_top = max(box1.y, box2.y)\n    x_right = min(box1.x2, box2.x2)\n    y_bottom = min(box1.y2, box2.y2)\n\n    # Check if there's no intersection\n    if x_left >= x_right or y_top >= y_bottom:\n        return 0.0\n\n    # Calculate intersection area\n    intersection_area = (x_right - x_left) * (y_bottom - y_top)\n\n    # Calculate areas of both boxes\n    box1_area = (box1.x2 - box1.x) * (box1.y2 - box1.y)\n    box2_area = (box2.x2 - box2.x) * (box2.y2 - box2.y)\n\n    # Calculate union area\n    union_area = box1_area + box2_area - intersection_area\n\n    # Avoid division by zero\n    if union_area <= 0:\n        return 0.0\n\n    return intersection_area / union_area\n\n\ndef formular_height_ignore_char(char: PdfCharacter):\n    return (\n        char.pdf_character_id is None\n        or char.char_unicode in HEIGHT_NOT_USFUL_CHAR_IN_CHAR\n    )\n\n\ndef box_to_tuple(box: Box) -> tuple[float, float, float, float]:\n    \"\"\"Converts a Box object to a tuple of its coordinates.\"\"\"\n    if box is None:\n        return (0, 0, 0, 0)\n    return (box.x, box.y, box.x2, box.y2)\n\n\nclass Layout:\n    def __init__(self, layout_id, name):\n        self.id = layout_id\n        self.name = name\n\n    @staticmethod\n    def is_newline(prev_char: PdfCharacter, curr_char: PdfCharacter) -> bool:\n        # 如果没有前一个字符，不是换行\n        if prev_char is None:\n            return False\n\n        # 获取两个字符的中心 y 坐标\n        # prev_y = (prev_char.box.y + prev_char.box.y2) / 2\n        # curr_y = (curr_char.box.y + curr_char.box.y2) / 2\n\n        # 如果当前字符的 y 坐标明显低于前一个字符，说明换行了\n        # 这里使用字符高度的一半作为阈值\n        char_height = max(\n            curr_char.box.y2 - curr_char.box.y,\n            prev_char.box.y2 - prev_char.box.y,\n        )\n        char_width = max(\n            curr_char.box.x2 - curr_char.box.x,\n            prev_char.box.x2 - prev_char.box.x,\n        )\n        should_new_line = (\n            curr_char.box.y2 < prev_char.box.y\n            or curr_char.box.x2 < prev_char.box.x - char_width * 10\n        )\n        if should_new_line and (\n            formular_height_ignore_char(curr_char)\n            or formular_height_ignore_char(prev_char)\n        ):\n            return False\n        return should_new_line\n\n\ndef get_paragraph_length_except(\n    paragraph: PdfParagraph,\n    except_chars: str,\n    font: Font,\n) -> int:\n    length = 0\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition.pdf_character:\n            length += (\n                composition.pdf_character[0].box.x2 - composition.pdf_character[0].box.x\n            )\n        elif composition.pdf_same_style_characters:\n            for pdf_char in composition.pdf_same_style_characters.pdf_character:\n                if pdf_char.char_unicode in except_chars:\n                    continue\n                length += pdf_char.box.x2 - pdf_char.box.x\n        elif composition.pdf_same_style_unicode_characters:\n            for char_unicode in composition.pdf_same_style_unicode_characters.unicode:\n                if char_unicode in except_chars:\n                    continue\n                length += font.char_lengths(\n                    char_unicode,\n                    composition.pdf_same_style_unicode_characters.pdf_style.font_size,\n                )[0]\n        elif composition.pdf_line:\n            for pdf_char in composition.pdf_line.pdf_character:\n                if pdf_char.char_unicode in except_chars:\n                    continue\n                length += pdf_char.box.x2 - pdf_char.box.x\n        elif composition.pdf_formula:\n            length += composition.pdf_formula.box.x2 - composition.pdf_formula.box.x\n        else:\n            logger.error(\n                f\"Unknown composition type. \"\n                f\"Composition: {composition}. \"\n                f\"Paragraph: {paragraph}. \",\n            )\n            continue\n    return length\n\n\ndef get_paragraph_unicode(paragraph: PdfParagraph) -> str:\n    chars = []\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition.pdf_line:\n            chars.extend(composition.pdf_line.pdf_character)\n        elif composition.pdf_same_style_characters:\n            chars.extend(composition.pdf_same_style_characters.pdf_character)\n        elif composition.pdf_same_style_unicode_characters:\n            chars.extend(composition.pdf_same_style_unicode_characters.unicode)\n        elif composition.pdf_formula:\n            chars.extend(composition.pdf_formula.pdf_character)\n        elif composition.pdf_character:\n            chars.append(composition.pdf_character)\n        else:\n            logger.error(\n                f\"Unknown composition type. \"\n                f\"Composition: {composition}. \"\n                f\"Paragraph: {paragraph}. \",\n            )\n            continue\n    return get_char_unicode_string(chars)\n\n\nSPACE_REGEX = regex.compile(r\"\\s+\", regex.UNICODE)\n\n\ndef get_char_unicode_string(chars: list[PdfCharacter | str]) -> str:\n    \"\"\"\n    将字符列表转换为 Unicode 字符串，根据字符间距自动插入空格。\n    有些 PDF 不会显式编码空格，这时需要根据间距自动插入空格。\n\n    Args:\n        chars: 字符列表，可以是 PdfCharacter 对象或字符串\n\n    Returns:\n        str: 处理后的 Unicode 字符串\n    \"\"\"\n    # 计算字符间距的中位数\n    distances = []\n    for i in range(len(chars) - 1):\n        if not (\n            isinstance(chars[i], PdfCharacter)\n            and isinstance(chars[i + 1], PdfCharacter)\n        ):\n            continue\n        distance = chars[i + 1].box.x - chars[i].box.x2\n        if distance > 1:  # 只考虑正向距离\n            distances.append(distance)\n\n    # 去重后的距离\n    distinct_distances = sorted(set(distances))\n\n    if not distinct_distances:\n        median_distance = 1\n    elif len(distinct_distances) == 1:\n        median_distance = distinct_distances[0]\n    else:\n        median_distance = distinct_distances[1]\n\n    # 构建 unicode 字符串，根据间距插入空格\n    unicode_chars = []\n    for i in range(len(chars)):\n        # 如果不是字符对象，直接添加，一般来说这个时候 chars[i] 是字符串\n        if not isinstance(chars[i], PdfCharacter):\n            unicode_chars.append(chars[i])\n            continue\n\n        # use unicode regex to replace all space with \" \"\n        unicode_chars.append(\n            regex.sub(\n                r\"\\s+\",\n                \" \",\n                unicodedata.normalize(\"NFKC\", chars[i].char_unicode),\n            )\n        )\n\n        # 如果是空格，跳过\n        if chars[i].char_unicode == \" \":\n            continue\n\n        # 如果两个字符都是 PdfCharacter，检查间距\n        if i < len(chars) - 1 and isinstance(chars[i + 1], PdfCharacter):\n            distance = chars[i + 1].box.x - chars[i].box.x2\n            if distance >= median_distance or Layout.is_newline(  # 间距大于中位数\n                chars[i],\n                chars[i + 1],\n            ):  # 换行\n                unicode_chars.append(\" \")  # 添加空格\n\n    result = \"\".join(unicode_chars)\n    # use unicode regex to replace all space with \" \"\n    normalize = unicodedata.normalize(\"NFKC\", result)\n    result = SPACE_REGEX.sub(\" \", normalize).strip()\n    return result\n\n\ndef get_paragraph_max_height(paragraph: PdfParagraph) -> float:\n    \"\"\"\n    获取段落中最高的排版单元高度。\n\n    Args:\n        paragraph: PDF 段落对象\n\n    Returns:\n        float: 最大高度值\n    \"\"\"\n    max_height = 0.0\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition is None:\n            continue\n        if composition.pdf_character:\n            char_height = (\n                composition.pdf_character[0].box.y2 - composition.pdf_character[0].box.y\n            )\n            max_height = max(max_height, char_height)\n        elif composition.pdf_same_style_characters:\n            for pdf_char in composition.pdf_same_style_characters.pdf_character:\n                char_height = pdf_char.box.y2 - pdf_char.box.y\n                max_height = max(max_height, char_height)\n        elif composition.pdf_same_style_unicode_characters:\n            # 对于纯 Unicode 字符，我们使用其样式中的字体大小作为高度估计\n            font_size = (\n                composition.pdf_same_style_unicode_characters.pdf_style.font_size\n            )\n            max_height = max(max_height, font_size)\n        elif composition.pdf_line:\n            for pdf_char in composition.pdf_line.pdf_character:\n                char_height = pdf_char.box.y2 - pdf_char.box.y\n                max_height = max(max_height, char_height)\n        elif composition.pdf_formula:\n            formula_height = (\n                composition.pdf_formula.box.y2 - composition.pdf_formula.box.y\n            )\n            max_height = max(max_height, formula_height)\n        else:\n            logger.error(\n                f\"Unknown composition type. \"\n                f\"Composition: {composition}. \"\n                f\"Paragraph: {paragraph}. \",\n            )\n            continue\n    return max_height\n\n\ndef is_same_style(style1, style2) -> bool:\n    \"\"\"判断两个样式是否相同\"\"\"\n    if style1 is None or style2 is None:\n        return style1 is style2\n\n    return (\n        style1.font_id == style2.font_id\n        and math.fabs(style1.font_size - style2.font_size) < 0.02\n        and is_same_graphic_state(style1.graphic_state, style2.graphic_state)\n    )\n\n\ndef is_same_style_except_size(style1, style2) -> bool:\n    \"\"\"判断两个样式是否相同\"\"\"\n    if style1 is None or style2 is None:\n        return style1 is style2\n\n    return (\n        style1.font_id == style2.font_id\n        and 0.7 < math.fabs(style1.font_size / style2.font_size) < 1.3\n        and is_same_graphic_state(style1.graphic_state, style2.graphic_state)\n    )\n\n\ndef is_same_style_except_font(style1, style2) -> bool:\n    \"\"\"判断两个样式是否相同\"\"\"\n    if style1 is None or style2 is None:\n        return style1 is style2\n\n    return math.fabs(\n        style1.font_size - style2.font_size,\n    ) < 0.02 and is_same_graphic_state(style1.graphic_state, style2.graphic_state)\n\n\ndef is_same_graphic_state(state1: GraphicState, state2: GraphicState) -> bool:\n    \"\"\"判断两个 GraphicState 是否相同\"\"\"\n    if state1 is None or state2 is None:\n        return state1 is state2\n\n    return (\n        state1.passthrough_per_char_instruction\n        == state2.passthrough_per_char_instruction\n    )\n\n\ndef add_space_dummy_chars(paragraph: PdfParagraph) -> None:\n    \"\"\"\n    在 PDF 段落中添加表示空格的 dummy 字符。\n    这个函数会直接修改传入的 paragraph 对象，在需要空格的地方添加 dummy 字符。\n    同时也会处理不同组成部分之间的空格。\n\n    Args:\n        paragraph: 需要处理的 PDF 段落对象\n    \"\"\"\n    # 首先处理每个组成部分内部的空格\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition.pdf_line:\n            chars = composition.pdf_line.pdf_character\n            _add_space_dummy_chars_to_list(chars)\n        elif composition.pdf_same_style_characters:\n            chars = composition.pdf_same_style_characters.pdf_character\n            _add_space_dummy_chars_to_list(chars)\n        elif composition.pdf_same_style_unicode_characters:\n            # 对于 unicode 字符，不需要处理。\n            # 这种类型只会出现在翻译好的结果中\n            continue\n        elif composition.pdf_formula:\n            chars = composition.pdf_formula.pdf_character\n            _add_space_dummy_chars_to_list(chars)\n\n    # 然后处理组成部分之间的空格\n    for i in range(len(paragraph.pdf_paragraph_composition) - 1):\n        curr_comp = paragraph.pdf_paragraph_composition[i]\n        next_comp = paragraph.pdf_paragraph_composition[i + 1]\n\n        # 获取当前组成部分的最后一个字符\n        curr_last_char = _get_last_char_from_composition(curr_comp)\n        if not curr_last_char:\n            continue\n\n        # 获取下一个组成部分的第一个字符\n        next_first_char = _get_first_char_from_composition(next_comp)\n        if not next_first_char:\n            continue\n\n        # 检查两个组成部分之间是否需要添加空格\n        distance = next_first_char.box.x - curr_last_char.box.x2\n        if distance > 1:  # 只考虑正向距离\n            # 创建一个 dummy 字符作为空格\n            space_box = Box(\n                x=curr_last_char.box.x2,\n                y=curr_last_char.box.y,\n                x2=curr_last_char.box.x2 + distance,\n                y2=curr_last_char.box.y2,\n            )\n\n            space_char = PdfCharacter(\n                pdf_style=curr_last_char.pdf_style,\n                box=space_box,\n                char_unicode=\" \",\n                scale=curr_last_char.scale,\n                advance=space_box.x2 - space_box.x,\n                visual_bbox=il_version_1.VisualBbox(box=space_box),\n            )\n\n            # 将空格添加到当前组成部分的末尾\n            if curr_comp.pdf_line:\n                curr_comp.pdf_line.pdf_character.append(space_char)\n            elif curr_comp.pdf_same_style_characters:\n                curr_comp.pdf_same_style_characters.pdf_character.append(space_char)\n            elif curr_comp.pdf_formula:\n                curr_comp.pdf_formula.pdf_character.append(space_char)\n\n\ndef _get_first_char_from_composition(\n    comp: PdfParagraphComposition,\n) -> PdfCharacter | None:\n    \"\"\"获取组成部分的第一个字符\"\"\"\n    if comp.pdf_line and comp.pdf_line.pdf_character:\n        return comp.pdf_line.pdf_character[0]\n    elif (\n        comp.pdf_same_style_characters and comp.pdf_same_style_characters.pdf_character\n    ):\n        return comp.pdf_same_style_characters.pdf_character[0]\n    elif comp.pdf_formula and comp.pdf_formula.pdf_character:\n        return comp.pdf_formula.pdf_character[0]\n    elif comp.pdf_character:\n        return comp.pdf_character\n    return None\n\n\ndef _get_last_char_from_composition(\n    comp: PdfParagraphComposition,\n) -> PdfCharacter | None:\n    \"\"\"获取组成部分的最后一个字符\"\"\"\n    if comp.pdf_line and comp.pdf_line.pdf_character:\n        return comp.pdf_line.pdf_character[-1]\n    elif (\n        comp.pdf_same_style_characters and comp.pdf_same_style_characters.pdf_character\n    ):\n        return comp.pdf_same_style_characters.pdf_character[-1]\n    elif comp.pdf_formula and comp.pdf_formula.pdf_character:\n        return comp.pdf_formula.pdf_character[-1]\n    elif comp.pdf_character:\n        return comp.pdf_character\n    return None\n\n\ndef _add_space_dummy_chars_to_list(chars: list[PdfCharacter]) -> None:\n    \"\"\"\n    在字符列表中的适当位置添加表示空格的 dummy 字符。\n\n    Args:\n        chars: PdfCharacter 对象列表\n    \"\"\"\n    if not chars:\n        return\n\n    # 计算字符间距的中位数\n    distances = []\n    for i in range(len(chars) - 1):\n        distance = chars[i + 1].box.x - chars[i].box.x2\n        if distance > 1:  # 只考虑正向距离\n            distances.append(distance)\n\n    # 去重后的距离\n    distinct_distances = sorted(set(distances))\n\n    if not distinct_distances:\n        median_distance = 1\n    elif len(distinct_distances) == 1:\n        median_distance = distinct_distances[0]\n    else:\n        median_distance = distinct_distances[1]\n\n    # 在需要的地方插入空格字符\n    i = 0\n    while i < len(chars) - 1:\n        curr_char = chars[i]\n        next_char = chars[i + 1]\n\n        distance = next_char.box.x - curr_char.box.x2\n        if distance >= median_distance or Layout.is_newline(curr_char, next_char):\n            if distance < 0:\n                distance = -distance\n            # 创建一个 dummy 字符作为空格\n            space_box = Box(\n                x=curr_char.box.x2,\n                y=curr_char.box.y,\n                x2=curr_char.box.x2 + min(distance, median_distance),\n                y2=curr_char.box.y2,\n            )\n\n            space_char = PdfCharacter(\n                pdf_style=curr_char.pdf_style,\n                box=space_box,\n                char_unicode=\" \",\n                scale=curr_char.scale,\n                advance=space_box.x2 - space_box.x,\n                visual_bbox=il_version_1.VisualBbox(box=space_box),\n            )\n\n            # 在当前位置后插入空格字符\n            chars.insert(i + 1, space_char)\n            i += 2  # 跳过刚插入的空格\n        else:\n            i += 1\n\n\ndef build_layout_index(page):\n    \"\"\"Builds an R-tree index for all layouts on the page.\"\"\"\n    from rtree import index\n\n    layout_index = index.Index()\n    layout_map = {}\n    for i, layout in enumerate(page.page_layout):\n        layout_map[i] = layout\n        if layout.box:\n            layout_index.insert(i, box_to_tuple(layout.box))\n    return layout_index, layout_map\n\n\ndef calculate_iou_for_boxes(box1: Box, box2: Box) -> float:\n    \"\"\"Calculate the intersection area divided by the first box area.\"\"\"\n    x_left = max(box1.x, box2.x)\n    y_bottom = max(box1.y, box2.y)\n    x_right = min(box1.x2, box2.x2)\n    y_top = min(box1.y2, box2.y2)\n\n    if x_right <= x_left or y_top <= y_bottom:\n        return 0.0\n\n    # Calculate intersection area\n    intersection_area = (x_right - x_left) * (y_top - y_bottom)\n\n    # Calculate area of first box\n    first_box_area = (box1.x2 - box1.x) * (box1.y2 - box1.y)\n\n    # Return intersection divided by first box area, handle division by zero\n    if first_box_area <= 0:\n        return 0.0\n\n    return intersection_area / first_box_area\n\n\ndef calculate_y_iou_for_boxes(box1: Box, box2: Box) -> float:\n    \"\"\"Calculate the intersection ratio in y-axis direction divided by the first box height.\n\n    Args:\n        box1: First box\n        box2: Second box\n\n    Returns:\n        float: Intersection ratio in y-axis direction between 0 and 1\n    \"\"\"\n    y_bottom = max(box1.y, box2.y)\n    y_top = min(box1.y2, box2.y2)\n\n    if y_top <= y_bottom:\n        return 0.0\n\n    # Calculate intersection height\n    intersection_height = y_top - y_bottom\n\n    # Calculate height of first box\n    first_box_height = box1.y2 - box1.y\n\n    # Return intersection divided by first box height, handle division by zero\n    if first_box_height <= 0:\n        return 0.0\n\n    return intersection_height / first_box_height\n\n\ndef calculate_y_true_iou_for_boxes(box1: Box, box2: Box) -> float:\n    \"\"\"Calculate the intersection ratio in y-axis direction divided by the first box height.\n\n    Args:\n        box1: First box\n        box2: Second box\n\n    Returns:\n        float: Intersection ratio in y-axis direction between 0 and 1\n    \"\"\"\n    y_bottom = max(box1.y, box2.y)\n    y_top = min(box1.y2, box2.y2)\n\n    if y_top <= y_bottom:\n        return 0.0\n\n    # Calculate intersection height\n    intersection_height = y_top - y_bottom\n\n    # Calculate height of first box\n    first_box_height = box1.y2 - box1.y\n    second_box_height = box2.y2 - box2.y\n\n    min_height = min(first_box_height, second_box_height)\n\n    # Return intersection divided by first box height, handle division by zero\n    if first_box_height <= 0:\n        return 0.0\n\n    return intersection_height / min_height\n\n\ndef get_character_layout(\n    char,\n    layout_index,\n    layout_map,\n    layout_priority=None,\n    _bbox_mode: Literal[\"auto\", \"visual\", \"box\"] = \"auto\",\n):\n    \"\"\"Get the layout for a character based on priority and IoU.\"\"\"\n    if layout_priority is None:\n        layout_priority = [\n            \"number\",\n            \"reference\",\n            \"reference_content\",\n            \"algorithm\",\n            \"formula_caption\",\n            \"isolate_formula\",\n            \"table_footnote\",\n            \"table_caption\",\n            \"figure_caption\",\n            \"figure_title\",\n            \"chart_title\",\n            \"table_title\",\n            \"table_cell_hybrid\",\n            \"table_text\",\n            \"wireless_table_cell\",\n            \"wired_table_cell\",\n            \"abandon\",\n            \"title\",\n            \"abstract\",\n            \"paragraph_title\",\n            \"content\",\n            \"doc_title\",\n            \"footnote\",\n            \"header\",\n            \"footer\",\n            \"seal\",\n            \"plain text\",\n            \"tiny text\",\n            \"author_info_hybrid\",\n            \"list_item_hybrid\",\n            \"text\",\n            \"paragraph_hybrid\",\n            \"paragraph\",\n            \"table_cell\",\n            \"figure_text\",\n            \"list_item\",\n            \"title\",\n            \"caption\",\n            \"footnote_hybrid\",\n            \"footnote\",\n            \"formula\",\n            \"formula_hybrid\",\n            \"page_header\",\n            \"page_footer\",\n            # --- hybrid labels ---\n            \"reference_hybrid\",\n            \"document_hybrid\",\n            \"academic_paper_hybrid\",\n            \"form_or_table_hybrid\",\n            \"presentation_slide_hybrid\",\n            \"webpage_screenshot_hybrid\",\n            \"manga_or_comic_hybrid\",\n            \"advertisement_hybrid\",\n            \"magazine_or_newspaper_hybrid\",\n            \"other_hybrid\",\n            \"table_cell_hybrid\",\n            \"figure_text_hybrid\",\n            \"title_hybrid\",\n            \"caption_hybrid\",\n            \"code_algo_hybrid\",\n            \"line_number_hybrid\",\n            \"page_header_hybrid\",\n            \"page_footer_hybrid\",\n            \"page_number_hybrid\",\n            \"unknown_hybrid\",\n            \"fallback_line\",\n            \"table\",\n            \"figure\",\n            \"image\",\n        ]\n\n    char_box = char.visual_bbox.box\n    # char_box2 = char.box\n    # if bbox_mode == \"auto\":\n    #     # Calculate IOU to decide which box to use\n    #     intersection_area = max(\n    #         0, min(char_box.x2, char_box2.x2) - max(char_box.x, char_box2.x)\n    #     ) * max(0, min(char_box.y2, char_box2.y2) - max(char_box.y, char_box2.y))\n    #     char_box_area = (char_box.x2 - char_box.x) * (char_box.y2 - char_box.y)\n    #\n    #     if char_box_area > 0:\n    #         iou = intersection_area / char_box_area\n    #         if iou < 0.2:\n    #             char_box = char_box2\n    # elif bbox_mode == \"box\":\n    #     char_box = char_box2\n\n    # Collect all intersecting layouts and their IoU values\n    matching_layouts = []\n    candidate_ids = list(layout_index.intersection(box_to_tuple(char_box)))\n    candidate_layouts = [layout_map[i] for i in candidate_ids]\n\n    for layout in candidate_layouts:\n        # Calculate IoU\n        intersection_area = max(\n            0, min(char_box.x2, layout.box.x2) - max(char_box.x, layout.box.x)\n        ) * max(0, min(char_box.y2, layout.box.y2) - max(char_box.y, layout.box.y))\n        char_area = (char_box.x2 - char_box.x) * (char_box.y2 - char_box.y)\n\n        if char_area > 0:\n            iou = intersection_area / char_area\n            if iou > 0:\n                matching_layouts.append(\n                    {\n                        \"layout\": Layout(layout.id, layout.class_name),\n                        \"priority\": (\n                            layout_priority.index(layout.class_name)\n                            if layout.class_name in layout_priority\n                            else len(layout_priority)\n                        ),\n                        \"iou\": iou,\n                    }\n                )\n\n    if not matching_layouts:\n        return None\n\n    # Sort by priority (ascending) and IoU value (descending)\n    matching_layouts.sort(key=lambda x: (x[\"priority\"], -x[\"iou\"]))\n\n    # non_hybrid_table_label = None\n    # for layout in matching_layouts:\n    #     layout = layout[\"layout\"]\n    #     label = layout.name\n    #     if is_text_layout(layout) and label not in (\n    #         \"table_cell_hybrid\",\n    #         \"table_text\",\n    #         \"wireless_table_cell\",\n    #         \"wired_table_cell\",\n    #         \"fallback_line\",\n    #         \"unknown_hybrid\",\n    #     ):\n    #         non_hybrid_table_label = layout\n    #         break\n    #\n    # if non_hybrid_table_label:\n    #     return non_hybrid_table_label\n\n    return matching_layouts[0][\"layout\"]\n\n\ndef is_text_layout(layout: Layout):\n    \"\"\"Check if a layout is a text layout.\"\"\"\n    return layout is not None and layout.name in [\n        \"plain text\",\n        \"tiny text\",\n        \"title\",\n        \"abandon\",\n        \"figure_caption\",\n        \"table_caption\",\n        \"table_text\",\n        \"table_footnote\",\n        # \"reference\",\n        \"title\",\n        \"paragraph_title\",\n        \"abstract\",\n        \"content\",\n        \"figure_title\",\n        \"table_title\",\n        \"doc_title\",\n        \"footnote\",\n        \"header\",\n        \"footer\",\n        \"seal\",\n        \"text\",\n        \"chart_title\",\n        \"paragraph\",\n        \"table_cell\",\n        \"figure_text\",\n        \"list_item\",\n        \"title\",\n        \"caption\",\n        \"footnote\",\n        \"page_header\",\n        \"page_footer\",\n        \"wired_table_cell\",\n        \"wireless_table_cell\",\n        \"paragraph_hybrid\",\n        \"table_cell_hybrid\",\n        \"caption_hybrid\",\n        \"unknown_hybrid\",\n        \"figure_text_hybrid\",\n        \"list_item_hybrid\",\n        \"title_hybrid\",\n        \"fallback_line\",\n        \"author_info_hybrid\",\n        \"page_header_hybrid\",\n        \"page_footer_hybrid\",\n        \"footnote_hybrid\",\n    ]\n\n\ndef is_character_in_formula_layout(\n    char: il_version_1.PdfCharacter,\n    _page: il_version_1.Page,\n    layout_index,\n    layout_map,\n) -> int | None:\n    \"\"\"Check if character is contained within any formula-related layout.\"\"\"\n    formula_layout_types = {\"formula\"}\n\n    char_box = char.visual_bbox.box\n    char_box2 = char.box\n\n    if calculate_iou_for_boxes(char_box, char_box2) < 0.2:\n        char_box = char_box2\n\n    # Get all candidate layouts that intersect with the character\n    candidate_ids = list(layout_index.intersection(box_to_tuple(char_box)))\n    candidate_layouts: list[il_version_1.PageLayout] = [\n        layout_map[i] for i in candidate_ids\n    ]\n\n    # Check if any intersecting layout is a formula type\n    for layout in candidate_layouts:\n        if layout.class_name in formula_layout_types:\n            iou = calculate_iou_for_boxes(char_box, layout.box)\n            if iou > 0.4:  # Character has overlap with formula layout\n                return layout.id\n\n    return None\n\n\ndef is_curve_in_figure_table_layout(\n    curve, layout_index, layout_map, protection_threshold: float = 0.3\n) -> bool:\n    \"\"\"Check if curve is within figure/table layout areas.\n\n    Args:\n        curve: The curve object to check\n        layout_index: Spatial index for layouts\n        layout_map: Mapping from layout IDs to layout objects\n        protection_threshold: IoU threshold for figure/table protection\n\n    Returns:\n        True if curve is within figure/table layout areas\n    \"\"\"\n    if not curve.box:\n        return False\n\n    # Figure/table related layout types\n    figure_table_layouts = {\n        \"figure\",\n        \"table\",\n        \"figure_text\",\n        \"table_text\",\n        \"figure_caption\",\n        \"table_caption\",\n        \"figure_title\",\n        \"table_title\",\n        \"chart_title\",\n        \"table_cell\",\n        \"table_cell_hybrid\",\n        \"wired_table_cell\",\n        \"wireless_table_cell\",\n        \"table_footnote\",\n    }\n\n    # Get candidate layouts that intersect with curve\n    candidate_ids = list(layout_index.intersection(box_to_tuple(curve.box)))\n    candidate_layouts = [layout_map[i] for i in candidate_ids]\n\n    for layout in candidate_layouts:\n        if layout.class_name in figure_table_layouts:\n            # Check if curve has significant overlap with figure/table layout\n            iou = calculate_iou_for_boxes(curve.box, layout.box)\n            if iou > protection_threshold:\n                return True\n\n    return False\n\n\ndef is_curve_overlapping_with_paragraphs(\n    curve, paragraphs: list, overlap_threshold: float = 0.2\n) -> bool:\n    \"\"\"Check if curve overlaps with text paragraph areas.\n\n    Args:\n        curve: The curve object to check\n        paragraphs: List of paragraph objects\n        overlap_threshold: IoU threshold for paragraph overlap detection\n\n    Returns:\n        True if curve overlaps with any paragraph area\n    \"\"\"\n    if not curve.box:\n        return False\n\n    for paragraph in paragraphs:\n        para_box = get_paragraph_bounding_box(paragraph)\n        if para_box:\n            iou = calculate_iou_for_boxes(curve.box, para_box)\n            if iou > overlap_threshold:\n                return True\n\n    return False\n\n\ndef get_paragraph_bounding_box(paragraph) -> Box | None:\n    \"\"\"Calculate the bounding box of a paragraph from its compositions.\n\n    Args:\n        paragraph: The paragraph object\n\n    Returns:\n        Box object representing the paragraph bounds, or None if no valid bounds\n    \"\"\"\n    if not paragraph.pdf_paragraph_composition:\n        return None\n\n    min_x = float(\"inf\")\n    min_y = float(\"inf\")\n    max_x = float(\"-inf\")\n    max_y = float(\"-inf\")\n\n    has_valid_box = False\n\n    for composition in paragraph.pdf_paragraph_composition:\n        comp_box = None\n\n        if composition.pdf_line and composition.pdf_line.box:\n            comp_box = composition.pdf_line.box\n        elif composition.pdf_formula and composition.pdf_formula.box:\n            comp_box = composition.pdf_formula.box\n        elif (\n            composition.pdf_same_style_characters\n            and composition.pdf_same_style_characters.box\n        ):\n            comp_box = composition.pdf_same_style_characters.box\n        elif composition.pdf_character and len(composition.pdf_character) > 0:\n            # Calculate box from character list\n            char_boxes = [\n                char.visual_bbox.box\n                for char in composition.pdf_character\n                if char.visual_bbox and char.visual_bbox.box\n            ]\n            if char_boxes:\n                comp_min_x = min(box.x for box in char_boxes)\n                comp_min_y = min(box.y for box in char_boxes)\n                comp_max_x = max(box.x2 for box in char_boxes)\n                comp_max_y = max(box.y2 for box in char_boxes)\n                comp_box = Box(comp_min_x, comp_min_y, comp_max_x, comp_max_y)\n\n        if comp_box:\n            min_x = min(min_x, comp_box.x)\n            min_y = min(min_y, comp_box.y)\n            max_x = max(max_x, comp_box.x2)\n            max_y = max(max_y, comp_box.y2)\n            has_valid_box = True\n\n    if not has_valid_box:\n        return None\n\n    return Box(min_x, min_y, max_x, max_y)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/matrix_helper.py",
    "content": "\"\"\"Matrix helper utilities for CTM decomposition and composition.\n\nThis module provides functions to:\n- Decompose a PDF CTM into translation, rotation, scale, and shear\n- Compose a CTM back from translation, rotation, scale, and shear\n\nAll comments and docstrings are in English per project guidelines.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport math\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfAffineTransform\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfMatrix\n\n# Local type aliases to avoid importing from pdfminer\nPoint = tuple[float, float]\nMatrix = tuple[float, float, float, float, float, float]\n\n\ndef decompose_ctm(m: Matrix | PdfMatrix) -> PdfAffineTransform:\n    \"\"\"Decompose a PDF CTM into a PdfAffineTransform.\n\n    The PDF current transformation matrix (CTM) is represented as\n    ``(a, b, c, d, e, f)`` corresponding to the affine matrix:\n    ``[[a, c, e], [b, d, f], [0, 0, 1]]``.\n\n    This function decomposes it into:\n    - translation: (tx, ty)\n    - rotation: angle in radians (counter-clockwise)\n    - scale: (sx, sy)\n    - shear: x-shear factor (dimensionless, equals tan(shear_angle))\n\n    The decomposition is based on a QR-like approach commonly used for 2D\n    affine matrices. If the linear part is degenerate, sensible fallbacks are\n    applied.\n\n    Args:\n        m: CTM as ``(a, b, c, d, e, f)``.\n\n    Returns:\n        A ``PdfAffineTransform`` instance with fields populated.\n    \"\"\"\n    if isinstance(m, PdfMatrix):\n        a = m.a\n        b = m.b\n        c = m.c\n        d = m.d\n        e = m.e\n        f = m.f\n        assert a is not None\n        assert b is not None\n        assert c is not None\n        assert d is not None\n        assert e is not None\n        assert f is not None\n    else:\n        (a, b, c, d, e, f) = m\n\n    tx, ty = e, f\n\n    # Linear part\n    m00, m01 = a, c\n    m10, m11 = b, d\n\n    # Scale X is the length of the first column\n    sx = math.hypot(m00, m10)\n\n    eps = 1e-12\n    if sx < eps:\n        # Degenerate first column. Choose rotation = 0, shear = 0, sx = 0.\n        rotation = 0.0\n        shear = 0.0\n        # Then sy is the length of the second column\n        sy = math.hypot(m01, m11)\n        # Handle reflection\n        det = m00 * m11 - m01 * m10\n        if det < 0:\n            sy = -sy if sy != 0 else -0.0\n        return PdfAffineTransform(\n            translation_x=tx,\n            translation_y=ty,\n            rotation=rotation,\n            scale_x=sx,\n            scale_y=sy,\n            shear=shear,\n        )\n\n    # Normalize first column to get rotation axis\n    r0x = m00 / sx\n    r0y = m10 / sx\n\n    # Shear is the projection of the second column onto the first column\n    shear = r0x * m01 + r0y * m11\n\n    # Remove the shear component from the second column\n    m01_ortho = m01 - shear * r0x\n    m11_ortho = m11 - shear * r0y\n\n    # Scale Y is the length of the orthogonalized second column\n    sy = math.hypot(m01_ortho, m11_ortho)\n\n    # Determine reflection by determinant sign\n    det = m00 * m11 - m01 * m10\n    if det < 0:\n        sy = -sy if sy != 0 else -0.0\n        shear = -shear\n        m01_ortho = -m01_ortho\n        m11_ortho = -m11_ortho\n\n    # Rotation is the angle of the first column\n    rotation = math.atan2(m10, m00)\n\n    return PdfAffineTransform(\n        translation_x=tx,\n        translation_y=ty,\n        rotation=rotation,\n        scale_x=sx,\n        scale_y=sy,\n        shear=shear,\n    )\n\n\ndef compose_ctm(transform: PdfAffineTransform) -> Matrix:\n    \"\"\"Compose a PDF CTM from a PdfAffineTransform.\n\n    This composes the 2x2 linear part using the following model:\n    - First column: ``sx * r0`` where ``r0 = (cos(theta), sin(theta))``\n    - Second column: ``shear * r0 + sy * r1`` where ``r1`` is the unit vector\n      orthogonal to ``r0``: ``r1 = (-sin(theta), cos(theta))``\n    - Translation is appended as (e, f) = (tx, ty)\n\n    Args:\n        transform: A ``PdfAffineTransform`` with translation, rotation,\n            scale, and shear populated.\n\n    Returns:\n        The CTM matrix ``(a, b, c, d, e, f)``.\n    \"\"\"\n    # Extract and validate required values from the dataclass\n    tx = float(transform.translation_x if transform.translation_x is not None else 0.0)\n    ty = float(transform.translation_y if transform.translation_y is not None else 0.0)\n    theta = float(transform.rotation if transform.rotation is not None else 0.0)\n    sx = float(transform.scale_x if transform.scale_x is not None else 1.0)\n    sy = float(transform.scale_y if transform.scale_y is not None else 1.0)\n    shear = float(transform.shear if transform.shear is not None else 0.0)\n\n    cos_t = math.cos(theta)\n    sin_t = math.sin(theta)\n\n    # Unit basis aligned with rotation\n    r0x, r0y = cos_t, sin_t\n    r1x, r1y = -sin_t, cos_t\n\n    # Columns of the linear matrix\n    col0x = sx * r0x\n    col0y = sx * r0y\n    col1x = shear * r0x + sy * r1x\n    col1y = shear * r0y + sy * r1y\n\n    a = col0x\n    b = col0y\n    c = col1x\n    d = col1y\n    e = tx\n    f = ty\n\n    return a, b, c, d, e, f\n\n\ndef scale_and_set_translation(\n    m: Matrix | PdfMatrix, scale_factor: float, tx: float, ty: float\n) -> Matrix | PdfMatrix:\n    \"\"\"Uniformly scale CTM by percentage and set translation to a position.\n\n    This function performs an isotropic scale in X and Y by ``percent`` and\n    then sets the translation components to ``(tx, ty)``. It preserves the\n    input type: if a ``PdfMatrix`` is provided, a ``PdfMatrix`` is returned;\n    if a tuple is provided, a tuple is returned.\n\n    Args:\n        m: Input CTM as ``(a, b, c, d, e, f)`` or ``PdfMatrix``.\n        scale_factor: Scale factor. ``1.0`` keeps size unchanged, ``0.5``\n            halves it, ``2.0`` doubles it.\n        tx: New translation X.\n        ty: New translation Y.\n\n    Returns:\n        A CTM of the same type as the input, scaled and with translation set.\n    \"\"\"\n\n    if isinstance(m, PdfMatrix):\n        a = m.a\n        b = m.b\n        c = m.c\n        d = m.d\n        # e, f will be overridden by tx, ty\n        assert a is not None\n        assert b is not None\n        assert c is not None\n        assert d is not None\n\n        return PdfMatrix(\n            a=a * scale_factor,\n            b=b * scale_factor,\n            c=c * scale_factor,\n            d=d * scale_factor,\n            e=float(tx),\n            f=float(ty),\n        )\n\n    a, b, c, d, _, _ = m\n    return (\n        a * scale_factor,\n        b * scale_factor,\n        c * scale_factor,\n        d * scale_factor,\n        float(tx),\n        float(ty),\n    )\n\n\ndef create_translation_and_scale_matrix(\n    translation_x: float, translation_y: float, scale_factor: float\n) -> Matrix:\n    \"\"\"Create a transformation matrix for translation and uniform scaling.\n\n    This creates a CTM that first scales uniformly by scale_factor, then translates\n    by (translation_x, translation_y).\n\n    Args:\n        translation_x: Translation in X direction\n        translation_y: Translation in Y direction\n        scale_factor: Uniform scale factor for both X and Y\n\n    Returns:\n        The CTM matrix (a, b, c, d, e, f)\n    \"\"\"\n    # Matrix for uniform scaling and translation:\n    # [scale  0      tx]\n    # [0      scale  ty]\n    # [0      0      1 ]\n    # Which maps to CTM (scale, 0, 0, scale, tx, ty)\n    return (scale_factor, 0.0, 0.0, scale_factor, translation_x, translation_y)\n\n\ndef multiply_matrices(m1: Matrix | PdfMatrix, m2: Matrix | PdfMatrix) -> Matrix:\n    \"\"\"Multiply two transformation matrices (m1 * m2).\n\n    Args:\n        m1: Left matrix in multiplication\n        m2: Right matrix in multiplication\n\n    Returns:\n        Result matrix as tuple (a, b, c, d, e, f)\n    \"\"\"\n    # Extract components from first matrix\n    if isinstance(m1, PdfMatrix):\n        a1, b1, c1, d1, e1, f1 = m1.a, m1.b, m1.c, m1.d, m1.e, m1.f\n        assert all(x is not None for x in [a1, b1, c1, d1, e1, f1])\n    else:\n        a1, b1, c1, d1, e1, f1 = m1\n\n    # Extract components from second matrix\n    if isinstance(m2, PdfMatrix):\n        a2, b2, c2, d2, e2, f2 = m2.a, m2.b, m2.c, m2.d, m2.e, m2.f\n        assert all(x is not None for x in [a2, b2, c2, d2, e2, f2])\n    else:\n        a2, b2, c2, d2, e2, f2 = m2\n\n    # Matrix multiplication for 2D affine transformations:\n    # [a1 c1 e1]   [a2 c2 e2]   [a1*a2+c1*b2  a1*c2+c1*d2  a1*e2+c1*f2+e1]\n    # [b1 d1 f1] * [b2 d2 f2] = [b1*a2+d1*b2  b1*c2+d1*d2  b1*e2+d1*f2+f1]\n    # [0  0  1 ]   [0  0  1 ]   [0            0            1              ]\n\n    a = a1 * a2 + c1 * b2\n    b = b1 * a2 + d1 * b2\n    c = a1 * c2 + c1 * d2\n    d = b1 * c2 + d1 * d2\n    e = a1 * e2 + c1 * f2 + e1\n    f = b1 * e2 + d1 * f2 + f1\n\n    return (a, b, c, d, e, f)\n\n\ndef apply_transform_to_ctm(\n    existing_ctm: list[object],\n    translation_x: float,\n    translation_y: float,\n    scale_factor: float,\n) -> list[object]:\n    \"\"\"Apply translation and scale transformation to an existing CTM.\n\n    Args:\n        existing_ctm: Existing CTM as list of 6 floats\n        translation_x: Translation in X direction\n        translation_y: Translation in Y direction\n        scale_factor: Uniform scale factor\n\n    Returns:\n        New CTM as list of objects\n    \"\"\"\n    if len(existing_ctm) != 6:\n        # If CTM is invalid, create a new identity matrix with the transform\n        transform_matrix = create_translation_and_scale_matrix(\n            translation_x, translation_y, scale_factor\n        )\n        return list(transform_matrix)\n\n    # Convert existing CTM to Matrix format\n    try:\n        existing_matrix = tuple(float(x) for x in existing_ctm)\n    except (ValueError, TypeError):\n        # If conversion fails, use identity matrix\n        existing_matrix = (1.0, 0.0, 0.0, 1.0, 0.0, 0.0)\n\n    # Create the transform matrix\n    transform_matrix = create_translation_and_scale_matrix(\n        translation_x, translation_y, scale_factor\n    )\n\n    # Left-multiply: new_ctm = transform_matrix * existing_matrix\n    result_matrix = multiply_matrices(transform_matrix, existing_matrix)\n\n    return list(result_matrix)\n\n\ndef matrix_to_bytes(m: Matrix | PdfMatrix) -> bytes:\n    if isinstance(m, PdfMatrix):\n        return (\n            f\" {m.a:.6f} {m.b:.6f} {m.c:.6f} {m.d:.6f} {m.e:.6f} {m.f:.6f} cm \".encode()\n        )\n    else:\n        return f\" {m[0]:.6f} {m[1]:.6f} {m[2]:.6f} {m[3]:.6f} {m[4]:.6f} {m[5]:.6f} cm \".encode()\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/mupdf_helper.py",
    "content": "import numpy as np\nimport pymupdf\n\nfrom babeldoc.const import get_process_pool\n\n\ndef get_no_rotation_img(page: pymupdf.Page, dpi: int = 72) -> pymupdf.Pixmap:\n    # return page.get_pixmap(dpi=72)\n    original_rotation = page.rotation\n    page.set_rotation(0)\n    pix = page.get_pixmap(dpi=dpi)\n    page.set_rotation(original_rotation)\n    return pix\n\n\ndef get_no_rotation_img_multiprocess_internal(\n    pdf_bytes: str, pagenum: int, dpi: int = 72\n) -> np.ndarray:\n    # return page.get_pixmap(dpi=72)\n    doc = pymupdf.open(pdf_bytes)\n    try:\n        page = doc[pagenum]\n        original_rotation = page.rotation\n        page.set_rotation(0)\n        pix = page.get_pixmap(dpi=dpi)\n        page.set_rotation(original_rotation)\n        return np.frombuffer(pix.samples, np.uint8).reshape(\n            pix.height,\n            pix.width,\n            3,\n        )[:, :, ::-1]\n    finally:\n        doc.close()\n\n\ndef get_no_rotation_img_multiprocess(pdf_bytes: str, pagenum: int, dpi: int = 72):\n    pool = get_process_pool()\n    if pool is None:\n        return get_no_rotation_img_multiprocess_internal(pdf_bytes, pagenum, dpi)\n    return pool.apply(\n        get_no_rotation_img_multiprocess_internal, (pdf_bytes, pagenum, dpi)\n    )\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/paragraph_helper.py",
    "content": "import logging\nimport re\n\nfrom babeldoc.format.pdf.document_il import il_version_1\n\nlogger = logging.getLogger(__name__)\n\n\ndef is_cid_paragraph(paragraph: il_version_1.PdfParagraph):\n    chars: list[il_version_1.PdfCharacter] = []\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition.pdf_line:\n            chars.extend(composition.pdf_line.pdf_character)\n        elif composition.pdf_same_style_characters:\n            chars.extend(composition.pdf_same_style_characters.pdf_character)\n        elif composition.pdf_same_style_unicode_characters:\n            continue\n        #     chars.extend(composition.pdf_same_style_unicode_characters.unicode)\n        elif composition.pdf_formula:\n            chars.extend(composition.pdf_formula.pdf_character)\n        elif composition.pdf_character:\n            chars.append(composition.pdf_character)\n        else:\n            logger.error(\n                f\"Unknown composition type. \"\n                f\"Composition: {composition}. \"\n                f\"Paragraph: {paragraph}. \",\n            )\n            continue\n\n    cid_count = 0\n    for char in chars:\n        if re.match(r\"^\\(cid:\\d+\\)$\", char.char_unicode):\n            cid_count += 1\n\n    return cid_count > len(chars) * 0.8\n\n\nNUMERIC_PATTERN = re.compile(r\"^-?\\d+(\\.\\d+)?$\")\n\n\ndef is_pure_numeric_paragraph(paragraph) -> bool:\n    \"\"\"只检查段落是否为纯数字（支持整数、小数、负数）\"\"\"\n\n    if not paragraph or not getattr(paragraph, \"unicode\", None):\n        return False\n\n    text = paragraph.unicode.strip()\n    if not text:\n        return False\n\n    return bool(NUMERIC_PATTERN.match(text))\n\n\ndef is_placeholder_only_paragraph(paragraph: il_version_1.PdfParagraph) -> bool:\n    \"\"\"Check if a paragraph contains only placeholders and whitespace.\n\n    Args:\n        paragraph: PDF paragraph to check\n\n    Returns:\n        True if the paragraph contains only placeholders (formula or style tags)\n        and whitespace, False otherwise\n    \"\"\"\n    if not paragraph or not paragraph.unicode:\n        return False\n\n    for composition in paragraph.pdf_paragraph_composition:\n        if composition.pdf_formula:\n            # Formula composition is allowed\n            continue\n        elif composition.pdf_character:\n            # Check if single character is whitespace\n            if not composition.pdf_character.char_unicode.isspace():\n                return False\n        elif composition.pdf_line:\n            # Check if all characters in the line are whitespace\n            for char in composition.pdf_line.pdf_character:\n                if not char.char_unicode.isspace():\n                    return False\n        elif composition.pdf_same_style_characters:\n            # Check if all characters in the group are whitespace\n            for char in composition.pdf_same_style_characters.pdf_character:\n                if not char.char_unicode.isspace():\n                    return False\n        elif composition.pdf_same_style_unicode_characters:\n            # Check if the unicode content is only whitespace\n            if not composition.pdf_same_style_unicode_characters.unicode.isspace():\n                return False\n        else:\n            # Unknown composition type, conservatively return False\n            return False\n\n    return True\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/spatial_analyzer.py",
    "content": "\"\"\"Spatial relationship analyzer for PDF elements.\n\nThis module provides functions to analyze spatial relationships between PDF elements,\nparticularly for detecting containment relationships between formulas and other elements\nlike curves and forms.\n\nAll comments and docstrings are in English per project guidelines.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom babeldoc.format.pdf.document_il.il_version_1 import Box\nfrom babeldoc.format.pdf.document_il.il_version_1 import Page\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfCurve\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfForm\nfrom babeldoc.format.pdf.document_il.il_version_1 import PdfFormula\nfrom babeldoc.format.pdf.document_il.utils.layout_helper import calculate_iou_for_boxes\n\n\ndef is_element_contained_in_formula(\n    element_box: Box,\n    formula_box: Box,\n    containment_threshold: float = 0.95,\n    tolerance: float = 2.0,\n) -> bool:\n    \"\"\"Check if an element is completely contained within a formula with tolerance.\n\n    Args:\n        element_box: The bounding box of the element to check\n        formula_box: The bounding box of the formula\n        containment_threshold: Minimum IoU ratio to consider as contained (default: 0.95)\n        tolerance: Tolerance in units to expand formula box for containment check (default: 2.0)\n\n    Returns:\n        True if the element is considered contained within the formula\n    \"\"\"\n    if element_box is None or formula_box is None:\n        return False\n\n    # Expand formula box by tolerance for more lenient containment check\n    expanded_formula_box = Box(\n        x=formula_box.x - tolerance,\n        y=formula_box.y - tolerance,\n        x2=formula_box.x2 + tolerance,\n        y2=formula_box.y2 + tolerance,\n    )\n\n    # Calculate IoU of element box with respect to expanded formula box\n    iou = calculate_iou_for_boxes(element_box, expanded_formula_box)\n    return iou >= containment_threshold\n\n\ndef find_contained_curves(\n    formula: PdfFormula, page: Page, paragraph_xobj_id: int | None = None\n) -> list[PdfCurve]:\n    \"\"\"Find all curves that are contained within the given formula.\n\n    Args:\n        formula: The formula to check for contained curves\n        page: The page containing the curves\n        paragraph_xobj_id: The xobj_id of the paragraph containing the formula.\n                          If provided, only curves with matching xobj_id will be returned.\n\n    Returns:\n        List of curves that are contained within the formula\n    \"\"\"\n    if not formula.box or not page.pdf_curve:\n        return []\n\n    contained_curves = []\n    for curve in page.pdf_curve:\n        if curve.box and is_element_contained_in_formula(curve.box, formula.box):\n            # If paragraph_xobj_id is specified, only include curves with matching xobj_id\n            if paragraph_xobj_id is not None and curve.xobj_id != paragraph_xobj_id:\n                continue\n            contained_curves.append(curve)\n\n    return contained_curves\n\n\ndef find_contained_forms(\n    formula: PdfFormula, page: Page, paragraph_xobj_id: int | None = None\n) -> list[PdfForm]:\n    \"\"\"Find all forms that are contained within the given formula.\n\n    Args:\n        formula: The formula to check for contained forms\n        page: The page containing the forms\n        paragraph_xobj_id: The xobj_id of the paragraph containing the formula.\n                          If provided, only forms with matching xobj_id will be returned.\n\n    Returns:\n        List of forms that are contained within the formula\n    \"\"\"\n    if not formula.box or not page.pdf_form:\n        return []\n\n    contained_forms = []\n    for form in page.pdf_form:\n        if form.box and is_element_contained_in_formula(form.box, formula.box):\n            # If paragraph_xobj_id is specified, only include forms with matching xobj_id\n            if paragraph_xobj_id is not None and form.xobj_id != paragraph_xobj_id:\n                continue\n            contained_forms.append(form)\n\n    return contained_forms\n\n\ndef find_all_contained_elements(\n    formula: PdfFormula, page: Page, paragraph_xobj_id: int | None = None\n) -> tuple[list[PdfCurve], list[PdfForm]]:\n    \"\"\"Find all curves and forms that are contained within the given formula.\n\n    Args:\n        formula: The formula to check for contained elements\n        page: The page containing the elements\n        paragraph_xobj_id: The xobj_id of the paragraph containing the formula.\n                          If provided, only elements with matching xobj_id will be returned.\n\n    Returns:\n        Tuple of (contained_curves, contained_forms)\n    \"\"\"\n    contained_curves = find_contained_curves(formula, page, paragraph_xobj_id)\n    contained_forms = find_contained_forms(formula, page, paragraph_xobj_id)\n    return contained_curves, contained_forms\n\n\ndef calculate_translation_and_scale(\n    old_box: Box, new_box: Box\n) -> tuple[float, float, float]:\n    \"\"\"Calculate translation and scale factors between two boxes.\n\n    Args:\n        old_box: The original bounding box\n        new_box: The new bounding box\n\n    Returns:\n        Tuple of (translation_x, translation_y, scale_factor)\n    \"\"\"\n    if old_box is None or new_box is None:\n        return 0.0, 0.0, 1.0\n\n    # Calculate translation (difference in top-left corners)\n    translation_x = new_box.x - old_box.x\n    translation_y = new_box.y - old_box.y\n\n    # Calculate scale factor (using width ratio, fallback to height if needed)\n    old_width = old_box.x2 - old_box.x\n    new_width = new_box.x2 - new_box.x\n\n    if old_width > 0:\n        scale_factor = new_width / old_width\n    else:\n        old_height = old_box.y2 - old_box.y\n        new_height = new_box.y2 - new_box.y\n        scale_factor = new_height / old_height if old_height > 0 else 1.0\n\n    return translation_x, translation_y, scale_factor\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/style_helper.py",
    "content": "from babeldoc.format.pdf.document_il import il_version_1\n\n\ndef create_pdf_style(r, g, b, font_id=\"base\", font_size=6):\n    \"\"\"\n    Create a PdfStyle object from RGB values.\n\n    Args:\n        r: Red component in range 0-255\n        g: Green component in range 0-255\n        b: Blue component in range 0-255\n        font_id: Font identifier\n        font_size: Font size\n\n    Returns:\n        PdfStyle object with the specified color\n    \"\"\"\n    r, g, b = [x / 255.0 for x in (r, g, b)]\n    return il_version_1.PdfStyle(\n        font_id=font_id,\n        font_size=font_size,\n        graphic_state=il_version_1.GraphicState(\n            passthrough_per_char_instruction=f\"{r:.10f} {g:.10f} {b:.10f} rg\",\n        ),\n    )\n\n\nBLACK = il_version_1.GraphicState(passthrough_per_char_instruction=\"0 g 0 G\")\n\nWHITE = il_version_1.GraphicState(passthrough_per_char_instruction=\"1 g 1 G\")\n\nGRAY80 = il_version_1.GraphicState(passthrough_per_char_instruction=\"0.80 g 0.80 G\")\nGRAY67 = il_version_1.GraphicState(passthrough_per_char_instruction=\"0.67 g 0.67 G\")\nGRAY33 = il_version_1.GraphicState(passthrough_per_char_instruction=\"0.33 g 0.33 G\")\n\n# Generate all color styles\nRED = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"1.0000000000 0.2313725490 0.1882352941 rg \"\n    \"1.0000000000 0.2313725490 0.1882352941 RG\",\n)\n\nORANGE = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"1.0000000000 0.5843137255 0.0000000000 rg \"\n    \"1.0000000000 0.5843137255 0.0000000000 RG\",\n)\nYELLOW = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"1.0000000000 0.8000000000 0.0000000000 rg \"\n    \"1.0000000000 0.8000000000 0.0000000000 RG\",\n)\n\nGREEN = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.2039215686 0.7803921569 0.3490196078 rg \"\n    \"0.2039215686 0.7803921569 0.3490196078 RG\",\n)\n\nMINT = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.0000000000 0.7803921569 0.7450980392 rg \"\n    \"0.0000000000 0.7803921569 0.7450980392 RG\",\n)\n\nTEAL = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.1882352941 0.6901960784 0.7803921569 rg \"\n    \"0.1882352941 0.6901960784 0.7803921569 RG\",\n)\n\nCYAN = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.1960784314 0.6784313725 0.9019607843 rg \"\n    \"0.1960784314 0.6784313725 0.9019607843 RG\",\n)\n\nBLUE = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.0000000000 0.4784313725 1.0000000000 rg \"\n    \"0.0000000000 0.4784313725 1.0000000000 RG\",\n)\n\nINDIGO = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.3450980392 0.3372549020 0.8392156863 rg \"\n    \"0.3450980392 0.3372549020 0.8392156863 RG\",\n)\n\nPURPLE = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.6862745098 0.3215686275 0.8705882353 rg \"\n    \"0.6862745098 0.3215686275 0.8705882353 RG\",\n)\n\nPINK = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"1.0000000000 0.1764705882 0.3333333333 rg \"\n    \"1.0000000000 0.1764705882 0.3333333333 RG\",\n)\n\nBROWN = il_version_1.GraphicState(\n    passthrough_per_char_instruction=\"0.6352941176 0.5176470588 0.3686274510 rg \"\n    \"0.6352941176 0.5176470588 0.3686274510 RG\",\n)\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/utils/zstd_helper.py",
    "content": "import base64\n\nimport pyzstd\n\n\ndef zstd_compress(data) -> str:\n    if isinstance(data, str):\n        data = data.encode()\n    if not isinstance(data, bytes):\n        raise TypeError(f\"data must be str or bytes, not {type(data)}\")\n\n    return base64.b85encode(pyzstd.compress(data)).decode()\n\n\ndef zstd_decompress(data) -> str:\n    if isinstance(data, str):\n        data = data.encode()\n    if not isinstance(data, bytes):\n        raise TypeError(f\"data must be str or bytes, not {type(data)}\")\n\n    return pyzstd.decompress(base64.b85decode(data)).decode()\n"
  },
  {
    "path": "babeldoc/format/pdf/document_il/xml_converter.py",
    "content": "import copy\nfrom pathlib import Path\n\nimport orjson\nfrom xsdata.formats.dataclass.context import XmlContext\nfrom xsdata.formats.dataclass.parsers import XmlParser\nfrom xsdata.formats.dataclass.serializers import XmlSerializer\nfrom xsdata.formats.dataclass.serializers.config import SerializerConfig\n\nfrom babeldoc.format.pdf.document_il import il_version_1\n\n\nclass XMLConverter:\n    def __init__(self):\n        self.parser = XmlParser()\n        config = SerializerConfig(indent=\"  \")\n        context = XmlContext()\n        self.serializer = XmlSerializer(context=context, config=config)\n\n    def write_xml(self, document: il_version_1.Document, path: str):\n        with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n            f.write(self.to_xml(document))\n\n    def read_xml(self, path: str) -> il_version_1.Document:\n        with Path(path).open(encoding=\"utf-8\") as f:\n            return self.from_xml(f.read())\n\n    def to_xml(self, document: il_version_1.Document) -> str:\n        return self.serializer.render(document)\n\n    def from_xml(self, xml: str) -> il_version_1.Document:\n        return self.parser.from_string(\n            xml,\n            il_version_1.Document,\n        )\n\n    def deepcopy(self, document: il_version_1.Document) -> il_version_1.Document:\n        return copy.deepcopy(document)\n        # return self.from_xml(self.to_xml(document))\n\n    def to_json(self, document: il_version_1.Document) -> str:\n        return orjson.dumps(\n            document,\n            option=orjson.OPT_APPEND_NEWLINE\n            | orjson.OPT_INDENT_2\n            | orjson.OPT_SORT_KEYS,\n        ).decode()\n\n    def write_json(self, document: il_version_1.Document, path: str):\n        with Path(path).open(\"w\", encoding=\"utf-8\") as f:\n            f.write(self.to_json(document))\n"
  },
  {
    "path": "babeldoc/format/pdf/high_level.py",
    "content": "import asyncio\nimport copy\nimport hashlib\nimport io\nimport logging\nimport pathlib\nimport re\nimport shutil\nimport threading\nimport time\nfrom asyncio import CancelledError\nfrom pathlib import Path\nfrom typing import Any\nfrom typing import BinaryIO\n\nimport pymupdf\nfrom pymupdf import Document\nfrom pymupdf import Font\n\nfrom babeldoc import asynchronize\nfrom babeldoc.assets.assets import warmup\nfrom babeldoc.babeldoc_exception.BabelDOCException import ExtractTextError\nfrom babeldoc.babeldoc_exception.BabelDOCException import (\n    InputFileGeneratedByBabelDOCError,\n)\nfrom babeldoc.const import CACHE_FOLDER\nfrom babeldoc.const import WATERMARK_VERSION\nfrom babeldoc.const import close_process_pool\nfrom babeldoc.format.pdf.converter import TranslateConverter\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import SAVE_PDF_STAGE_NAME\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import SUBSET_FONT_STAGE_NAME\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import PDFCreater\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import reproduce_cmap\nfrom babeldoc.format.pdf.document_il.frontend.il_creater import ILCreater\nfrom babeldoc.format.pdf.document_il.midend.add_debug_information import (\n    AddDebugInformation,\n)\nfrom babeldoc.format.pdf.document_il.midend.automatic_term_extractor import (\n    AutomaticTermExtractor,\n)\nfrom babeldoc.format.pdf.document_il.midend.detect_scanned_file import DetectScannedFile\nfrom babeldoc.format.pdf.document_il.midend.il_translator import ILTranslator\nfrom babeldoc.format.pdf.document_il.midend.il_translator_llm_only import (\n    ILTranslatorLLMOnly,\n)\nfrom babeldoc.format.pdf.document_il.midend.layout_parser import LayoutParser\nfrom babeldoc.format.pdf.document_il.midend.paragraph_finder import ParagraphFinder\nfrom babeldoc.format.pdf.document_il.midend.styles_and_formulas import StylesAndFormulas\nfrom babeldoc.format.pdf.document_il.midend.table_parser import TableParser\nfrom babeldoc.format.pdf.document_il.midend.typesetting import Typesetting\nfrom babeldoc.format.pdf.document_il.utils.fontmap import FontMapper\nfrom babeldoc.format.pdf.document_il.xml_converter import XMLConverter\nfrom babeldoc.format.pdf.pdfinterp import PDFPageInterpreterEx\nfrom babeldoc.format.pdf.result_merger import ResultMerger\nfrom babeldoc.format.pdf.split_manager import SplitManager\nfrom babeldoc.format.pdf.translation_config import TranslateResult\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.format.pdf.translation_config import WatermarkOutputMode\nfrom babeldoc.pdfminer.pdfdocument import PDFDocument\nfrom babeldoc.pdfminer.pdfinterp import PDFResourceManager\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.pdfparser import PDFParser\nfrom babeldoc.progress_monitor import ProgressMonitor\nfrom babeldoc.utils import memory\n\nlogger = logging.getLogger(__name__)\n\nTRANSLATE_STAGES = [\n    (ILCreater.stage_name, 14.12),  # Parse PDF and Create IR\n    (DetectScannedFile.stage_name, 2.45),  # DetectScannedFile\n    (LayoutParser.stage_name, 14.03),  # Parse Page Layout\n    (TableParser.stage_name, 1.0),  # Parse Table\n    (ParagraphFinder.stage_name, 6.26),  # Parse Paragraphs\n    (StylesAndFormulas.stage_name, 1.66),  # Parse Formulas and Styles\n    # (RemoveDescent.stage_name, 0.15),  # Remove Char Descent\n    (AutomaticTermExtractor.stage_name, 30.0),  # Extract Terms\n    (ILTranslator.stage_name, 46.96),  # Translate Paragraphs\n    (Typesetting.stage_name, 4.71),  # Typesetting\n    (FontMapper.stage_name, 0.61),  # Add Fonts\n    (PDFCreater.stage_name, 1.96),  # Generate drawing instructions\n    (SUBSET_FONT_STAGE_NAME, 0.92),  # Subset font\n    (SAVE_PDF_STAGE_NAME, 6.34),  # Save PDF\n]\n\nresfont_map = {\n    \"zh-cn\": \"china-ss\",\n    \"zh-tw\": \"china-ts\",\n    \"zh-hans\": \"china-ss\",\n    \"zh-hant\": \"china-ts\",\n    \"zh\": \"china-ss\",\n    \"ja\": \"japan-s\",\n    \"ko\": \"korea-s\",\n}\n\n\ndef safe_save(doc, *args, **kwargs):\n    try:\n        # first try, saving without options\n        doc.save(*args, **kwargs)\n    except Exception:\n        # second try, saving with 'garbage=3' for object missing\n        doc.ez_save(*args, **kwargs)\n\n\ndef check_metadata(pdf: Document):\n    meta = pdf.metadata\n    if not meta:\n        return\n    producer = meta.get(\"producer\", None)\n    if (\n        producer\n        and \"BabelDOC\" in producer\n        and \"Translation_generated_by_AI,please_carefully_discern\" in producer\n    ):\n        raise InputFileGeneratedByBabelDOCError(\n            \"Input file is generated by BabelDOC, Cannot translate files that have already been translated.\"\n        )\n\n\ndef add_metadata(\n    translate_result: TranslateResult, translate_config: TranslationConfig\n):\n    processed = []\n    for attr in (\n        \"mono_pdf_path\",\n        \"dual_pdf_path\",\n        \"no_watermark_mono_pdf_path\",\n        \"no_watermark_dual_pdf_path\",\n    ):\n        path = getattr(translate_result, attr)\n        if not path or path in processed:\n            continue\n        processed.append(path)\n\n        temp_path = translate_config.get_working_file_path(f\"{path.stem}.cmap.pdf\")\n        pdf = pymupdf.open(path)\n        meta = pdf.metadata\n        if not meta:\n            meta = {}\n        creator = meta.get(\"creator\", None)\n        producer = meta.get(\"producer\", None)\n        if producer:\n            if not creator:\n                creator = producer\n            else:\n                creator += f\", {producer}\"\n\n        translated_by = f\"BabelDOC{WATERMARK_VERSION}_{time.time()}_Translation_generated_by_AI,please_carefully_discern\"\n        if translate_config.metadata_extra_data:\n            translated_by += f\"_{translate_config.metadata_extra_data}\"\n        meta[\"producer\"] = translated_by\n        meta[\"creator\"] = creator\n\n        for k, v in meta.items():\n            if v:\n                # 使用正则替换掉 surrogate 范围内的字符\n                meta[k] = re.sub(r\"[\\uD800-\\uDFFF]\", \"\", v)\n\n        pdf.set_metadata(meta)\n        safe_save(pdf, temp_path)\n        shutil.move(temp_path, path)\n\n\ndef fix_cmap(translate_result: TranslateResult, translate_config: TranslationConfig):\n    processed = []\n    for attr in (\n        \"mono_pdf_path\",\n        \"dual_pdf_path\",\n        \"no_watermark_mono_pdf_path\",\n        \"no_watermark_dual_pdf_path\",\n    ):\n        path = getattr(translate_result, attr)\n        if not path or path in processed:\n            continue\n        processed.append(path)\n\n        temp_path = translate_config.get_working_file_path(f\"{path.stem}.cmap.pdf\")\n        pdf = pymupdf.open(path)\n        reproduce_cmap(pdf)\n        safe_save(pdf, temp_path)\n        shutil.move(temp_path, path)\n\n\ndef verify_file_hash(file_path: str, expected_hash: str) -> bool:\n    \"\"\"Verify the SHA256 hash of a file.\"\"\"\n    sha256_hash = hashlib.sha256()\n    with Path(file_path).open(\"rb\") as f:\n        # Read the file in chunks to handle large files efficiently\n        for byte_block in iter(lambda: f.read(4096), b\"\"):\n            sha256_hash.update(byte_block)\n    return sha256_hash.hexdigest() == expected_hash\n\n\ndef translator_supports_llm(translator) -> bool:\n    if not translator or not hasattr(translator, \"do_llm_translate\"):\n        return False\n    try:\n        translator.do_llm_translate(None)\n        return True\n    except NotImplementedError:\n        return False\n    except Exception as exc:  # pragma: no cover - defensive logging\n        logger.debug(\"translator %s failed llm detection: %s\", translator, exc)\n        return False\n\n\ndef start_parse_il(\n    inf: BinaryIO,\n    pages: list[int] | None = None,\n    vfont: str = \"\",\n    vchar: str = \"\",\n    thread: int = 0,\n    doc_zh: Document = None,\n    lang_in: str = \"\",\n    lang_out: str = \"\",\n    service: str = \"\",\n    resfont: str = \"\",\n    noto: Font = None,\n    cancellation_event: asyncio.Event = None,\n    il_creater: ILCreater = None,\n    translation_config: TranslationConfig = None,\n    **kwarg: Any,\n) -> None:\n    rsrcmgr = PDFResourceManager()\n    layout = {}\n    device = TranslateConverter(\n        rsrcmgr,\n        vfont,\n        vchar,\n        thread,\n        layout,\n        lang_in,\n        lang_out,\n        service,\n        resfont,\n        noto,\n        kwarg.get(\"envs\", {}),\n        kwarg.get(\"prompt\", []),\n        il_creater=il_creater,\n    )\n    # model = DocLayoutModel.load_available()\n\n    assert device is not None\n    assert il_creater is not None\n    assert translation_config is not None\n    obj_patch = {}\n    interpreter = PDFPageInterpreterEx(rsrcmgr, device, obj_patch, il_creater)\n    if pages:\n        total_pages = len(pages)\n    else:\n        total_pages = doc_zh.page_count\n\n    il_creater.on_total_pages(total_pages)\n\n    parser = PDFParser(inf)\n    doc = PDFDocument(parser)\n\n    for pageno, page in enumerate(PDFPage.create_pages(doc)):\n        if cancellation_event and cancellation_event.is_set():\n            raise CancelledError(\"task cancelled\")\n        if pages and (pageno not in pages):\n            continue\n        page.pageno = pageno\n\n        if not translation_config.should_translate_page(pageno + 1):\n            continue\n\n        height, width = (\n            page.cropbox[3] - page.cropbox[1],\n            page.cropbox[2] - page.cropbox[0],\n        )\n        if height > 1200 or width > 2000:\n            logger.warning(f\"page {pageno + 1} is too large, maybe unable to translate\")\n            # continue\n\n        translation_config.raise_if_cancelled()\n        # The current program no longer relies on\n        # the following layout recognition results,\n        # but in order to facilitate the migration of pdf2zh,\n        # the relevant code is temporarily retained.\n        # pix = doc_zh[page.pageno].get_pixmap()\n        # image = np.frombuffer(pix.samples, np.uint8).reshape(\n        #     pix.height, pix.width, 3\n        # )[:, :, ::-1]\n        # page_layout = model.predict(\n        #     image, imgsz=int(pix.height / 32) * 32)[0]\n        # # kdtree 是不可能 kdtree 的，不如直接渲染成图片，用空间换时间\n        # box = np.ones((pix.height, pix.width))\n        # h, w = box.shape\n        # vcls = [\"abandon\", \"figure\", \"table\",\n        #         \"isolate_formula\", \"formula_caption\"]\n        # for i, d in enumerate(page_layout.boxes):\n        #     if page_layout.names[int(d.cls)] not in vcls:\n        #         x0, y0, x1, y1 = d.xyxy.squeeze()\n        #         x0, y0, x1, y1 = (\n        #             np.clip(int(x0 - 1), 0, w - 1),\n        #             np.clip(int(h - y1 - 1), 0, h - 1),\n        #             np.clip(int(x1 + 1), 0, w - 1),\n        #             np.clip(int(h - y0 + 1), 0, h - 1),\n        #         )\n        #         box[y0:y1, x0:x1] = i + 2\n        # for i, d in enumerate(page_layout.boxes):\n        #     if page_layout.names[int(d.cls)] in vcls:\n        #         x0, y0, x1, y1 = d.xyxy.squeeze()\n        #         x0, y0, x1, y1 = (\n        #             np.clip(int(x0 - 1), 0, w - 1),\n        #             np.clip(int(h - y1 - 1), 0, h - 1),\n        #             np.clip(int(x1 + 1), 0, w - 1),\n        #             np.clip(int(h - y0 + 1), 0, h - 1),\n        #         )\n        #         box[y0:y1, x0:x1] = 0\n        # layout[page.pageno] = box\n        # 新建一个 xref 存放新指令流\n        # page.page_xref = doc_zh.get_new_xref()  # hack 插入页面的新 xref\n        # doc_zh.update_object(page.page_xref, \"<<>>\")\n        # doc_zh.update_stream(page.page_xref, b\"\")\n        # doc_zh[page.pageno].set_contents(page.page_xref)\n        ops_base = interpreter.process_page(page)\n        il_creater.on_page_base_operation(ops_base)\n        il_creater.on_page_end()\n    il_creater.on_finish()\n    device.close()\n\n\ndef translate(translation_config: TranslationConfig) -> TranslateResult:\n    with ProgressMonitor(get_translation_stage(translation_config)) as pm:\n        return do_translate(pm, translation_config)\n\n\ndef get_translation_stage(\n    translation_config: TranslationConfig,\n) -> list[tuple[str, float]]:\n    result = copy.deepcopy(TRANSLATE_STAGES)\n    should_remove = []\n\n    # If only parsing and generating PDF, skip all translation-related stages\n    if translation_config.only_parse_generate_pdf:\n        should_remove.extend(\n            [\n                DetectScannedFile.stage_name,\n                LayoutParser.stage_name,\n                TableParser.stage_name,\n                ParagraphFinder.stage_name,\n                StylesAndFormulas.stage_name,\n                AutomaticTermExtractor.stage_name,\n                ILTranslator.stage_name,\n                Typesetting.stage_name,\n            ]\n        )\n    else:\n        # Original logic for selective removal\n        if not translation_config.table_model:\n            should_remove.append(TableParser.stage_name)\n        if translation_config.skip_scanned_detection:\n            should_remove.append(DetectScannedFile.stage_name)\n        if not translation_config.auto_extract_glossary:\n            should_remove.append(AutomaticTermExtractor.stage_name)\n        if translation_config.skip_translation:\n            should_remove.append(ILTranslator.stage_name)\n\n    result = [x for x in result if x[0] not in should_remove]\n    return result\n\n\nasync def async_translate(translation_config: TranslationConfig):\n    \"\"\"Asynchronously translate a PDF file with real-time progress reporting.\n\n    This function yields progress events that can be used to update progress bars\n    or other UI elements. The events are dictionaries with the following structure:\n\n    - progress_start: {\n        \"type\": \"progress_start\",\n        \"stage\": str,              # Stage name\n        \"stage_progress\": float,   # Always 0.0\n        \"stage_current\": int,      # Current count (0)\n        \"stage_total\": int         # Total items in stage\n    }\n    - progress_update: {\n        \"type\": \"progress_update\",\n        \"stage\": str,              # Stage name\n        \"stage_progress\": float,   # Stage progress (0-100)\n        \"stage_current\": int,      # Current items processed\n        \"stage_total\": int,        # Total items in stage\n        \"overall_progress\": float  # Overall progress (0-100)\n    }\n    - progress_end: {\n        \"type\": \"progress_end\",\n        \"stage\": str,              # Stage name\n        \"stage_progress\": float,   # Always 100.0\n        \"stage_current\": int,      # Equal to stage_total\n        \"stage_total\": int,        # Total items processed\n        \"overall_progress\": float  # Overall progress (0-100)\n    }\n    - finish: {\n        \"type\": \"finish\",\n        \"translate_result\": TranslateResult\n    }\n    - error: {\n        \"type\": \"error\",\n        \"error\": str\n    }\n\n    Args:\n        translation_config: Configuration for the translation process\n\n    Yields:\n        dict: Progress events during translation\n\n    Raises:\n        CancelledError: If the translation is cancelled\n        Exception: Any other errors during translation\n    \"\"\"\n    loop = asyncio.get_running_loop()\n    callback = asynchronize.AsyncCallback()\n\n    finish_event = asyncio.Event()\n    cancel_event = threading.Event()\n    with ProgressMonitor(\n        get_translation_stage(translation_config),\n        progress_change_callback=callback.step_callback,\n        finish_callback=callback.finished_callback,\n        finish_event=finish_event,\n        cancel_event=cancel_event,\n        loop=loop,\n        report_interval=translation_config.report_interval,\n    ) as pm:\n        future = loop.run_in_executor(None, do_translate, pm, translation_config)\n        try:\n            async for event in callback:\n                event = event.kwargs\n                yield event\n                if event[\"type\"] == \"error\":\n                    break\n        except CancelledError:\n            cancel_event.set()\n        except KeyboardInterrupt:\n            logger.info(\"Translation cancelled by user through keyboard interrupt\")\n            cancel_event.set()\n    if cancel_event.is_set():\n        future.cancel()\n    logger.info(\"Waiting for translation to finish...\")\n    await finish_event.wait()\n\n\nclass MemoryMonitor:\n    \"\"\"Monitor memory usage of current process and all child processes.\"\"\"\n\n    def __init__(self, interval=0.1):\n        \"\"\"Initialize memory monitor.\n\n        Args:\n            interval: Monitoring interval in seconds, defaults to 0.1s (100ms)\n        \"\"\"\n        self.interval = interval\n        self.peak_memory_usage = 0\n        self.monitor_thread = None\n        self.stop_event = None\n        self.last_pss_check_time = None\n\n    def __enter__(self):\n        \"\"\"Start memory monitoring.\"\"\"\n        self.stop_event = threading.Event()\n        self.monitor_thread = threading.Thread(\n            target=self._monitor_memory_usage, daemon=True\n        )\n        self.monitor_thread.start()\n        logger.debug(\"Memory monitoring started\")\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        \"\"\"Stop monitoring and log peak memory usage.\"\"\"\n        if not self.monitor_thread:\n            return\n\n        self.stop_event.set()\n        self.monitor_thread.join(timeout=2.0)\n        logger.info(f\"Peak memory usage: {self.peak_memory_usage:.2f} MB\")\n\n    def _monitor_memory_usage(self):\n        \"\"\"Background thread that periodically checks memory usage.\"\"\"\n        while not self.stop_event.is_set():\n            try:\n                # Use throttled memory check with 2-second PSS throttle\n                total_memory, self.last_pss_check_time = (\n                    memory.get_memory_usage_with_throttle(\n                        include_children=True,\n                        prefer_pss=True,\n                        last_pss_check_time=self.last_pss_check_time,\n                        pss_throttle_seconds=2.0,\n                    )\n                )\n\n                # Convert to MB for better readability\n                total_memory_mb = total_memory / (1024 * 1024)\n                if total_memory_mb > self.peak_memory_usage:\n                    self.peak_memory_usage = total_memory_mb\n            except Exception as e:\n                logger.warning(f\"Error monitoring memory: {e}\")\n\n            time.sleep(self.interval)\n\n    def get_peek_memory_psutil(self):\n        \"\"\"Get peak memory usage using psutil (for backwards compatibility).\"\"\"\n        return memory.get_memory_usage_bytes(include_children=True, prefer_pss=True)\n\n\ndef fix_null_page_content(doc: Document) -> list[int]:\n    invalid_page = []\n    for x in range(len(doc)):\n        xref = doc[x].xref\n        if doc.xref_object(xref) == \"null\":\n            invalid_page.append(x)\n    for x in invalid_page:\n        doc.delete_page(x)\n        doc.insert_page(x)\n    return invalid_page\n\n\ndef fix_null_xref(doc: Document) -> None:\n    \"\"\"Fix null xref in PDF file by replacing them with empty arrays.\n\n    Args:\n        doc: PyMuPDF Document object to fix\n    \"\"\"\n    for i in range(1, doc.xref_length()):\n        try:\n            obj = doc.xref_object(i)\n            if obj == \"null\":\n                doc.update_object(i, \"[]\")\n            elif obj and \"/ASCII85Decode\" in obj:  # make pdfminer happy\n                data = doc.xref_stream(i)\n                doc.update_stream(i, data)\n            elif obj and \"/LZWDecode\" in obj:\n                data = doc.xref_stream(i)\n                doc.update_stream(i, data)\n            elif obj and \"/Annots\" in obj:\n                doc.xref_set_key(i, \"Annots\", \"null\")\n        except Exception:\n            doc.update_object(i, \"[]\")\n\n\ndef fix_filter(doc):\n    page_contents = []\n    for page in doc:\n        page_contents.extend(page.get_contents())\n    for page_piece in page_contents:\n        f = doc.xref_get_key(page_piece, \"Filter\")\n        if f[0] == \"xref\":\n            data = doc.xref_stream(page_piece)\n            doc.update_stream(page_piece, data)\n    for page in doc:\n        contents = page.get_contents()\n        if len(contents) > 1:\n            page_streams = [doc.xref_stream(i) for i in contents]\n            r = doc.get_new_xref()\n            doc.update_object(r, \"<<>>\")\n            doc.update_stream(r, b\" \".join(page_streams))\n            doc.xref_set_key(page.xref, \"Contents\", f\"{r} 0 R\")\n    return\n    # skip rotate for now\n    for page in doc:\n        contents = page.get_contents()\n        t, v = doc.xref_get_key(page.xref, \"Rotate\")\n        rotate = -int(v) if t == \"int\" else 0\n        if len(contents) > 1 or rotate:\n            page_streams = [doc.xref_stream(i) for i in contents]\n            r = doc.get_new_xref()\n            page_prefix = b\"\"\n            page_suffix = b\"\"\n            if rotate:\n                m0 = pymupdf.Matrix(rotate)\n                b0 = page.mediabox * m0\n                m1 = m0 * pymupdf.Matrix(1, 0, 0, 1, b0.x0, -b0.y0)\n                page_prefix = (\n                    f\" {m1.a} {m1.b} {m1.c} {m1.d} {m1.e} {m1.f} cm q \".encode()\n                )\n                page_suffix = b\" Q \"\n                update_page_bbox(doc, page, page.cropbox * m1, \"CropBox\")\n                update_page_bbox(doc, page, page.artbox * m1, \"ArtBox\")\n                update_page_bbox(doc, page, page.bleedbox * m1, \"BleedBox\")\n                update_page_bbox(doc, page, page.mediabox * m1, \"MediaBox\")\n                doc.xref_set_key(page.xref, \"Rotate\", \"0\")\n            doc.update_object(r, \"<<>>\")\n            doc.update_stream(r, page_prefix + b\" \".join(page_streams) + page_suffix)\n            doc.xref_set_key(page.xref, \"Contents\", f\"{r} 0 R\")\n\n\ndef update_page_bbox(doc, page, box, key):\n    if doc.xref_get_key(page.xref, key)[0] == \"array\":\n        doc.xref_set_key(page.xref, key, f\"[{box.x0} {box.y0} {box.x1} {box.y1}]\")\n\n\ndef do_translate(\n    pm: ProgressMonitor, translation_config: TranslationConfig\n) -> TranslateResult:\n    try:\n        translation_config.progress_monitor = pm\n        original_pdf_path = translation_config.input_file\n        logger.info(f\"start to translate: {original_pdf_path}\")\n        try:\n            check_metadata(Document(original_pdf_path))\n        except InputFileGeneratedByBabelDOCError as e:\n            logger.error(\n                f\"input file {original_pdf_path} is generated by BabelDOC, Cannot translate files that have already been translated.\"\n            )\n            raise e\n        except Exception as e:\n            logger.warning(f\"Error in check metadata, continue: {e}\")\n        start_time = time.time()\n        peak_memory_usage = 0\n        with MemoryMonitor() as memory_monitor:\n            # Check if split translation is enabled\n            if not translation_config.split_strategy:\n                result = _do_translate_single(pm, translation_config)\n            else:\n                # Initialize split manager and determine split points\n                split_manager = SplitManager(translation_config)\n                split_points = split_manager.determine_split_points(translation_config)\n\n                if not split_points:\n                    logger.warning(\n                        \"No split points determined, falling back to single translation\"\n                    )\n                    result = _do_translate_single(pm, translation_config)\n                else:\n                    logger.info(f\"Split points determined: {len(split_points)} parts\")\n\n                    if len(split_points) == 1:\n                        logger.info(\"Only one part, use single translation\")\n                        result = _do_translate_single(pm, translation_config)\n                    else:\n                        pm.total_parts = len(split_points)\n\n                        # Process parts serially\n                        results: dict[int, TranslateResult | None] = {}\n                        original_watermark_mode = (\n                            translation_config.watermark_output_mode\n                        )\n                        original_doc = Document(original_pdf_path)\n                        for i, split_point in enumerate(split_points):\n                            try:\n                                # Create a copy of config for this part\n                                part_config = copy.copy(translation_config)\n                                part_config.skip_clean = True\n                                should_translate_pages = []\n                                for page in range(\n                                    split_point.start_page, split_point.end_page + 1\n                                ):\n                                    if translation_config.should_translate_page(\n                                        page + 1\n                                    ):\n                                        should_translate_pages.append(\n                                            page - split_point.start_page + 1\n                                        )\n                                part_config.pages = None\n                                part_config.page_ranges = [\n                                    (x, x) for x in should_translate_pages\n                                ]\n                                if (\n                                    translation_config.only_include_translated_page\n                                    and not should_translate_pages\n                                ):\n                                    results[i] = None\n                                    continue\n\n                                # Only first part should do scanned detection if enabled\n                                if i > 0:\n                                    part_config.skip_scanned_detection = True\n\n                                part_config.working_dir = (\n                                    translation_config.get_part_working_dir(i)\n                                )\n                                part_config.output_dir = (\n                                    translation_config.get_part_output_dir(i)\n                                )\n\n                                assert id(\n                                    part_config.shared_context_cross_split_part\n                                ) == id(\n                                    translation_config.shared_context_cross_split_part\n                                ), \"shared_context_cross_split_part must be the same\"\n\n                                part_temp_input_path = (\n                                    part_config.get_working_file_path(\n                                        f\"input.part{i}.pdf\"\n                                    )\n                                )\n                                part_config.input_file = part_temp_input_path\n\n                                temp_doc = Document()\n                                for x in range(\n                                    split_point.start_page, split_point.end_page + 1\n                                ):\n                                    xref = original_doc[x].xref\n                                    if (\n                                        original_doc.xref_get_key(xref, \"Annots\")[0]\n                                        != \"null\"\n                                    ):\n                                        original_doc.xref_set_key(\n                                            xref, \"Annots\", \"null\"\n                                        )\n                                temp_doc.insert_pdf(\n                                    original_doc,\n                                    from_page=split_point.start_page,\n                                    to_page=split_point.end_page,\n                                )\n                                safe_save(temp_doc, part_temp_input_path)\n                                assert (\n                                    temp_doc.page_count\n                                    == split_point.end_page - split_point.start_page + 1\n                                )\n\n                                # Only first part should have watermark\n                                if i > 0:\n                                    part_config.watermark_output_mode = (\n                                        WatermarkOutputMode.NoWatermark\n                                    )\n\n                                # Create progress monitor for this part\n                                part_monitor = pm.create_part_monitor(\n                                    i, len(split_points)\n                                )\n\n                                # Process this part\n                                result = _do_translate_single(\n                                    part_monitor,\n                                    part_config,\n                                )\n                                results[i] = result\n\n                            except Exception as e:\n                                logger.error(f\"Error in part {i}: {e}\")\n                                pm.translate_error(e)\n                                raise\n                            finally:\n                                # Clean up part working directory\n                                translation_config.cleanup_part_working_dir(i)\n\n                        # Restore original watermark mode\n                        translation_config.watermark_output_mode = (\n                            original_watermark_mode\n                        )\n\n                        # Merge results\n                        merger = ResultMerger(translation_config)\n                        logger.info(\"start merge results\")\n                        result = merger.merge_results(results)\n                        logger.info(\"finish merge results\")\n            peak_memory_usage = memory_monitor.peak_memory_usage\n\n        finish_time = time.time()\n        result.total_seconds = finish_time - start_time\n\n        logger.info(\n            f\"finish translate: {original_pdf_path}, cost: {finish_time - start_time} s\",\n        )\n        # Populate aggregate valid text statistics into result\n        try:\n            sc = translation_config.shared_context_cross_split_part\n            result.total_valid_character_count = getattr(\n                sc, \"valid_char_count_total\", 0\n            )\n            token_total = getattr(sc, \"total_valid_text_token_count\", None)\n            result.total_valid_text_token_count = (\n                token_total if isinstance(token_total, int) else 0\n            )\n        except Exception as e:\n            logger.warning(\"Failed to populate valid text statistics: %s\", e)\n            try:\n                result.total_valid_character_count = 0\n                result.total_valid_text_token_count = 0\n            except Exception:\n                pass\n        result.original_pdf_path = translation_config.input_file\n        result.peak_memory_usage = peak_memory_usage\n\n        fix_cmap(result, translation_config)\n        add_metadata(result, translation_config)\n        try:\n            migrate_toc(translation_config, result)\n        except Exception as e:\n            logger.error(\n                f\"Failed to migrate TOC from {translation_config.input_file}: {e}\"\n            )\n        pm.translate_done(result)\n        return result\n\n    except Exception as e:\n        if translation_config.debug:\n            logger.exception(\"translate error:\")\n        else:\n            logger.error(f\"translate error: {e}\")\n        pm.disable = False\n        pm.translate_error(e)\n        raise\n    finally:\n        logger.debug(\"do_translate finally\")\n        pm.on_finish()\n        translation_config.cleanup_temp_files()\n\n\ndef migrate_toc(\n    translation_config: TranslationConfig, translate_result: TranslateResult\n):\n    if translation_config.use_alternating_pages_dual:\n        logger.info('skipping TOC migration for \"use_alternating_pages_dual\" mode')\n        return\n    old_doc = Document(translation_config.input_file)\n    if not old_doc:\n        return\n    try:\n        fix_filter(old_doc)\n        fix_null_xref(old_doc)\n    except Exception:\n        logger.exception(\"auto fix failed, please check the pdf file\")\n\n    toc_data = old_doc.get_toc()\n\n    if not toc_data:\n        logger.info(\"No TOC found in the original PDF, skipping migration.\")\n        return\n\n    if translation_config.only_include_translated_page:\n        total_page = set(range(0, len(old_doc)))\n\n        pages_to_translate = {\n            i for i in len(old_doc) if translation_config.should_translate_page(i + 1)\n        }\n\n        should_removed_page = list(total_page - pages_to_translate)\n\n    files = {\n        translate_result.dual_pdf_path,\n        # translate_result.mono_pdf_path,\n        translate_result.no_watermark_dual_pdf_path,\n        # translate_result.no_watermark_mono_pdf_path\n    }\n\n    for f in files:\n        if not f:\n            continue\n        mig_toc_temp_input = translation_config.get_working_file_path(\n            \"mig_toc_temp.pdf\"\n        )\n        shutil.copy(f, mig_toc_temp_input)\n        new_doc = Document(mig_toc_temp_input.as_posix())\n        if not new_doc:\n            continue\n\n        new_doc.set_toc(toc_data)\n        PDFCreater.save_pdf_with_timeout(\n            new_doc,\n            f.as_posix(),\n            translation_config=translation_config,\n            clean=not translation_config.skip_clean,\n            tag=\"mig_toc\",\n        )\n\n\n# mediabox -> '[0 nul 792]'\ndef fix_media_box(doc: Document) -> None:\n    mediabox_data = {}\n    for x in range(1, doc.xref_length()):\n        t = doc.xref_get_key(x, \"Type\")\n        box_set = {}\n        if t[1] in [\"/Pages\", \"/Page\"]:\n            mediabox = doc.xref_get_key(x, \"MediaBox\")\n            if mediabox[0] == \"array\":\n                try:\n                    _, _, x1, y1 = (\n                        mediabox[1].replace(\"[\", \"\").replace(\"]\", \"\").split(\" \")\n                    )\n                    doc.xref_set_key(x, \"MediaBox\", f\"[0 0 {x1} {y1}]\")\n                    box_set[\"MediaBox\"] = mediabox[1]\n                except Exception:\n                    logger.warning(\n                        \"Attempt to fix media box failed; some pages may not have been processed correctly.\"\n                    )\n            for k in [\"CropBox\", \"BleedBox\", \"TrimBox\", \"ArtBox\"]:\n                box = doc.xref_get_key(x, k)\n                if box[0] != \"null\":\n                    box_set[k] = box[1]\n                    doc.xref_set_key(x, k, \"null\")\n        if box_set:\n            mediabox_data[x] = box_set\n    return mediabox_data\n\n\ndef check_cid_char(il: il_version_1.Document):\n    chars = []\n    for page in il.page:\n        chars.extend(page.pdf_character)\n\n    cid_count = 0\n    for char in chars:\n        if re.match(r\"^\\(cid:\\d+\\)$\", char.char_unicode):\n            cid_count += 1\n\n    return cid_count > len(chars) * 0.8\n\n\ndef _do_translate_single(\n    pm: ProgressMonitor,\n    translation_config: TranslationConfig,\n) -> TranslateResult:\n    \"\"\"Original translation logic for a single document or part\"\"\"\n    translation_config.progress_monitor = pm\n\n    if translation_config.shared_context_cross_split_part.auto_enabled_ocr_workaround:\n        translation_config.ocr_workaround = True\n        translation_config.skip_scanned_detection = True\n\n    original_pdf_path = translation_config.input_file\n    if translation_config.debug:\n        doc_input = Document(original_pdf_path)\n        logger.debug(\"debug mode, save decompressed input pdf\")\n        output_path = translation_config.get_working_file_path(\n            \"input.decompressed.pdf\",\n        )\n        # Fix null xref in PDF file\n        try:\n            _ = fix_null_page_content(doc_input)\n            fix_filter(doc_input)\n            fix_null_xref(doc_input)\n        except Exception:\n            logger.exception(\"auto fix failed, please check the pdf file\")\n        safe_save(doc_input, output_path, expand=True, pretty=True)\n        del doc_input\n\n    # Continue with original processing\n    temp_pdf_path = translation_config.get_working_file_path(\"input.pdf\")\n    doc_pdf2zh = Document(original_pdf_path)\n    safe_save(doc_pdf2zh, temp_pdf_path)\n\n    # Fix null xref in PDF file\n    invalid_pages = []\n    try:\n        invalid_pages = fix_null_page_content(doc_pdf2zh)\n        fix_filter(doc_pdf2zh)\n        fix_null_xref(doc_pdf2zh)\n    except Exception:\n        logger.exception(\"auto fix failed, please check the pdf file\")\n\n    mediabox_data = fix_media_box(doc_pdf2zh)\n\n    # for page in doc_pdf2zh:\n    #     page.insert_font(resfont, None)\n\n    resfont = None\n    safe_save(doc_pdf2zh, temp_pdf_path)\n\n    # if not translation_config.skip_scanned_detection and DetectScannedFile(\n    #     translation_config\n    # ).fast_check(doc_pdf2zh):\n    #     if translation_config.auto_enable_ocr_workaround:\n    #         logger.warning(\n    #             \"Fast scanned check hit, Turning on OCR workaround.\",\n    #         )\n    #         translation_config.shared_context_cross_split_part.auto_enabled_ocr_workaround = True\n    #         translation_config.ocr_workaround = True\n    #         translation_config.skip_scanned_detection = True\n    #     else:\n    #         logger.warning(\n    #             \"Fast scanned check hit, Please check the input PDF file.\",\n    #         )\n    #         raise ScannedPDFError(\"Scanned PDF detected.\")\n\n    il_creater = ILCreater(translation_config)\n    il_creater.mupdf = doc_pdf2zh\n    xml_converter = XMLConverter()\n    logger.debug(f\"start parse il from {temp_pdf_path}\")\n    with Path(temp_pdf_path).open(\"rb\") as f:\n        start_parse_il(\n            f,\n            doc_zh=doc_pdf2zh,\n            resfont=resfont,\n            il_creater=il_creater,\n            translation_config=translation_config,\n        )\n    logger.debug(f\"finish parse il from {temp_pdf_path}\")\n    docs = il_creater.create_il()\n    logger.debug(f\"finish create il from {temp_pdf_path}\")\n    del il_creater\n    if translation_config.only_include_translated_page and not docs.page:\n        return None\n\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"create_il.debug.json\"),\n        )\n\n    if check_cid_char(docs):\n        raise ExtractTextError(\"The document contains too many CID chars.\")\n\n    # Skip all translation processing if only_parse_generate_pdf is enabled\n    if translation_config.only_parse_generate_pdf:\n        logger.debug(\"only_parse_generate_pdf enabled, skipping translation processing\")\n        # Skip directly to PDF generation\n        pdf_creater = PDFCreater(temp_pdf_path, docs, translation_config, mediabox_data)\n        result = pdf_creater.write(translation_config)\n        result.original_pdf_path = translation_config.input_file\n        return result\n\n    # Rest of the original translation logic...\n    # [Previous implementation of do_translate continues here]\n\n    # 检测是否为扫描文件\n    if translation_config.skip_scanned_detection:\n        logger.debug(\"skipping scanned file detection\")\n    else:\n        logger.debug(\"start detect scanned file\")\n        DetectScannedFile(translation_config).process(\n            docs, temp_pdf_path, mediabox_data\n        )\n        logger.debug(\"finish detect scanned file\")\n        if translation_config.debug:\n            xml_converter.write_json(\n                docs,\n                translation_config.get_working_file_path(\"detect_scanned_file.json\"),\n            )\n\n    # Generate layouts for all pages\n    logger.debug(\"start generating layouts\")\n    docs = LayoutParser(translation_config).process(docs, doc_pdf2zh)\n    logger.debug(\"finish generating layouts\")\n    close_process_pool()\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"layout_generator.json\"),\n        )\n\n    if translation_config.table_model:\n        docs = TableParser(translation_config).process(docs, doc_pdf2zh)\n        logger.debug(\"finish table parser\")\n        if translation_config.debug:\n            xml_converter.write_json(\n                docs,\n                translation_config.get_working_file_path(\"table_parser.json\"),\n            )\n    ParagraphFinder(translation_config).process(docs)\n    logger.debug(f\"finish paragraph finder from {temp_pdf_path}\")\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"paragraph_finder.json\"),\n        )\n    StylesAndFormulas(translation_config).process(docs)\n    logger.debug(f\"finish styles and formulas from {temp_pdf_path}\")\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"styles_and_formulas.json\"),\n        )\n\n    translate_engine = translation_config.translator\n    term_extraction_engine = translation_config.get_term_extraction_translator()\n\n    support_llm_translate = translator_supports_llm(translate_engine)\n    support_llm_term_extraction = translator_supports_llm(term_extraction_engine)\n\n    if support_llm_term_extraction and translation_config.auto_extract_glossary:\n        AutomaticTermExtractor(term_extraction_engine, translation_config).procress(\n            docs\n        )\n\n    if not translation_config.skip_translation:\n        if support_llm_translate:\n            il_translator = ILTranslatorLLMOnly(translate_engine, translation_config)\n        else:\n            il_translator = ILTranslator(translate_engine, translation_config)\n\n        il_translator.translate(docs)\n        del il_translator\n        logger.debug(f\"finish ILTranslator from {temp_pdf_path}\")\n    else:\n        logger.info(\"skip ILTranslator\")\n\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"il_translated.json\"),\n        )\n\n    if translation_config.debug:\n        AddDebugInformation(translation_config).process(docs)\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"add_debug_information.json\"),\n        )\n    mono_watermark_first_page_doc_bytes = None\n    dual_watermark_first_page_doc_bytes = None\n    try:\n        if translation_config.watermark_output_mode == WatermarkOutputMode.Both:\n            mono_watermark_first_page_doc_bytes, dual_watermark_first_page_doc_bytes = (\n                generate_first_page_with_watermark(\n                    doc_pdf2zh, translation_config, docs, mediabox_data\n                )\n            )\n    except Exception:\n        logger.warning(\n            \"Failed to generate watermark for first page, using no watermark\"\n        )\n        translation_config.watermark_output_mode = WatermarkOutputMode.NoWatermark\n        mono_watermark_first_page_doc_bytes = None\n        dual_watermark_first_page_doc_bytes = None\n\n    Typesetting(translation_config).typesetting_document(docs)\n    logger.debug(f\"finish typsetting from {temp_pdf_path}\")\n    if translation_config.debug:\n        xml_converter.write_json(\n            docs,\n            translation_config.get_working_file_path(\"typsetting.json\"),\n        )\n\n    pdf_creater = PDFCreater(temp_pdf_path, docs, translation_config, mediabox_data)\n    result = pdf_creater.write(translation_config)\n    try:\n        if mono_watermark_first_page_doc_bytes:\n            mono_watermark_pdf = merge_watermark_doc(\n                result.mono_pdf_path,\n                mono_watermark_first_page_doc_bytes,\n                translation_config,\n            )\n            result.mono_pdf_path = mono_watermark_pdf\n    except Exception:\n        result.mono_pdf_path = result.no_watermark_mono_pdf_path\n    try:\n        if dual_watermark_first_page_doc_bytes:\n            dual_watermark_pdf = merge_watermark_doc(\n                result.dual_pdf_path,\n                dual_watermark_first_page_doc_bytes,\n                translation_config,\n            )\n            result.dual_pdf_path = dual_watermark_pdf\n    except Exception:\n        result.dual_pdf_path = result.no_watermark_dual_pdf_path\n\n    result.original_pdf_path = translation_config.input_file\n\n    return result\n\n\ndef generate_first_page_with_watermark(\n    mupdf: Document,\n    translation_config: TranslationConfig,\n    doc_il: il_version_1.Document,\n    mediabox_data: dict[int, Any] | None = None,\n) -> (io.BytesIO, io.BytesIO):\n    first_page_doc = Document()\n    first_page_doc.insert_pdf(mupdf, from_page=0, to_page=0)\n\n    il_only_first_page_doc = il_version_1.Document()\n    il_only_first_page_doc.total_pages = 1\n    il_only_first_page_doc.page = [copy.deepcopy(doc_il.page[0])]\n\n    watermarked_config = copy.copy(translation_config)\n    watermarked_config.watermark_output_mode = WatermarkOutputMode.Watermarked\n    try:\n        watermarked_config.progress_monitor.disable = True\n        watermarked_temp_pdf_path = watermarked_config.get_working_file_path(\n            \"watermarked_temp_input.pdf\"\n        )\n        safe_save(first_page_doc, watermarked_temp_pdf_path)\n\n        Typesetting(watermarked_config).typsetting_document(il_only_first_page_doc)\n        pdf_creater = PDFCreater(\n            watermarked_temp_pdf_path.as_posix(),\n            il_only_first_page_doc,\n            watermarked_config,\n            mediabox_data,\n        )\n        result = pdf_creater.write(watermarked_config)\n        mono_pdf_bytes = None\n        dual_pdf_bytes = None\n        if result.mono_pdf_path:\n            mono_pdf_bytes = io.BytesIO()\n            with Path(result.mono_pdf_path).open(\"rb\") as f:\n                mono_pdf_bytes.write(f.read())\n            result.mono_pdf_path.unlink()\n            mono_pdf_bytes.seek(0)\n\n        if result.dual_pdf_path:\n            dual_pdf_bytes = io.BytesIO()\n            with Path(result.dual_pdf_path).open(\"rb\") as f:\n                dual_pdf_bytes.write(f.read())\n            result.dual_pdf_path.unlink()\n            dual_pdf_bytes.seek(0)\n\n        return mono_pdf_bytes, dual_pdf_bytes\n    finally:\n        watermarked_config.progress_monitor.disable = False\n\n\ndef merge_watermark_doc(\n    no_watermark_pdf_path: pathlib.PosixPath,\n    watermark_first_page_pdf_bytes: io.BytesIO,\n    translation_config: TranslationConfig,\n) -> pathlib.PosixPath:\n    if not no_watermark_pdf_path.exists():\n        raise FileNotFoundError(\n            f\"no_watermark_pdf_path not found: {no_watermark_pdf_path}\"\n        )\n    if not watermark_first_page_pdf_bytes:\n        raise FileNotFoundError(\n            f\"watermark_first_page_pdf_bytes not found: {watermark_first_page_pdf_bytes}\"\n        )\n\n    no_watermark_pdf = Document(no_watermark_pdf_path.as_posix())\n    no_watermark_pdf.delete_page(0)\n\n    watermark_first_page_pdf = Document(\"pdf\", watermark_first_page_pdf_bytes)\n    no_watermark_pdf.insert_pdf(\n        watermark_first_page_pdf, from_page=0, to_page=0, start_at=0\n    )\n\n    new_save_path = no_watermark_pdf_path.with_name(\n        no_watermark_pdf_path.name.replace(\".no_watermark\", \"\")\n    )\n\n    PDFCreater.save_pdf_with_timeout(\n        no_watermark_pdf,\n        new_save_path.as_posix(),\n        translation_config=translation_config,\n        clean=not translation_config.skip_clean,\n    )\n    return new_save_path\n\n\ndef download_font_assets():\n    warmup()\n\n\ndef create_cache_folder():\n    try:\n        logger.debug(f\"create cache folder at {CACHE_FOLDER}\")\n        Path(CACHE_FOLDER).mkdir(parents=True, exist_ok=True)\n    except OSError:\n        logger.critical(\n            f\"Failed to create cache folder at {CACHE_FOLDER}\",\n            exc_info=True,\n        )\n        exit(1)\n\n\ndef init():\n    create_cache_folder()\n"
  },
  {
    "path": "babeldoc/format/pdf/pdfinterp.py",
    "content": "import logging\nfrom collections.abc import Sequence\nfrom typing import Any\nfrom typing import cast\n\nimport numpy as np\n\nfrom babeldoc.format.pdf.babelpdf.utils import guarded_bbox\nfrom babeldoc.format.pdf.document_il.frontend.il_creater import ILCreater\nfrom babeldoc.pdfminer import settings\nfrom babeldoc.pdfminer.pdfcolor import PREDEFINED_COLORSPACE\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdfdevice import PDFDevice\nfrom babeldoc.pdfminer.pdfdevice import PDFTextSeq\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdfinterp import LITERAL_FORM\nfrom babeldoc.pdfminer.pdfinterp import LITERAL_IMAGE\nfrom babeldoc.pdfminer.pdfinterp import Color\nfrom babeldoc.pdfminer.pdfinterp import PDFContentParser\nfrom babeldoc.pdfminer.pdfinterp import PDFInterpreterError\nfrom babeldoc.pdfminer.pdfinterp import PDFPageInterpreter\nfrom babeldoc.pdfminer.pdfinterp import PDFResourceManager\nfrom babeldoc.pdfminer.pdfinterp import PDFStackT\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.pdftypes import LITERALS_ASCII85_DECODE\nfrom babeldoc.pdfminer.pdftypes import PDFObjRef\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.pdftypes import resolve1\nfrom babeldoc.pdfminer.pdftypes import stream_value\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psexceptions import PSTypeError\nfrom babeldoc.pdfminer.psparser import PSKeyword\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.psparser import keyword_name\nfrom babeldoc.pdfminer.psparser import literal_name\nfrom babeldoc.pdfminer.utils import MATRIX_IDENTITY\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import apply_matrix_pt\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer.utils import mult_matrix\n\nlog = logging.getLogger(__name__)\n\n\ndef safe_float(o: Any) -> float | None:\n    try:\n        return float(o)\n    except (TypeError, ValueError):\n        return None\n\n\nclass PDFContentParserEx(PDFContentParser):\n    def __init__(self, streams: Sequence[object]) -> None:\n        super().__init__(streams)\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        if token is self.KEYWORD_BI:\n            # inline image within a content stream\n            self.start_type(pos, \"inline\")\n        elif token is self.KEYWORD_ID:\n            try:\n                (_, objs) = self.end_type(\"inline\")\n                if len(objs) % 2 != 0:\n                    error_msg = f\"Invalid dictionary construct: {objs!r}\"\n                    raise PSTypeError(error_msg)\n                d = {literal_name(k): resolve1(v) for (k, v) in choplist(2, objs)}\n                eos = b\"EI\"\n                filter_ = d.get(\"F\", None)\n                if filter_:\n                    if isinstance(filter_, PSLiteral):\n                        filter_ = [filter_]\n                    if filter_[0] in LITERALS_ASCII85_DECODE:\n                        eos = b\"~>\"\n                (pos, data) = self.get_inline_data(pos + len(b\"ID \"), target=eos)\n                if eos != b\"EI\":  # it may be necessary for decoding\n                    data += eos\n                obj = PDFStream(d, data)\n                self.push((pos, obj))\n                if eos == b\"EI\":  # otherwise it is still in the stream\n                    self.push((pos, self.KEYWORD_EI))\n            except PSTypeError:\n                if settings.STRICT:\n                    raise\n        else:\n            self.push((pos, token))\n\n\nclass PDFPageInterpreterEx(PDFPageInterpreter):\n    \"\"\"Processor for the content of a PDF page\n\n    Reference: PDF Reference, Appendix A, Operator Summary\n    \"\"\"\n\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        device: PDFDevice,\n        obj_patch,\n        il_creater: ILCreater,\n    ) -> None:\n        self.rsrcmgr = rsrcmgr\n        self.device = device\n        self.obj_patch = obj_patch\n        self.il_creater = il_creater\n\n    def dup(self) -> \"PDFPageInterpreterEx\":\n        return self.__class__(\n            self.rsrcmgr,\n            self.device,\n            self.obj_patch,\n            self.il_creater,\n        )\n\n    def init_resources(self, resources: dict[object, object]) -> None:\n        # 重载设置 fontid 和 descent\n        \"\"\"Prepare the fonts and XObjects listed in the Resource attribute.\"\"\"\n        self.resources = resources\n        self.fontmap: dict[object, PDFFont] = {}\n        self.fontid: dict[PDFFont, object] = {}\n        self.xobjmap = {}\n        self.csmap: dict[str, PDFColorSpace] = PREDEFINED_COLORSPACE.copy()\n        if not resources:\n            return\n\n        def get_colorspace(spec: object) -> PDFColorSpace | None:\n            if isinstance(spec, list):\n                name = literal_name(spec[0])\n            else:\n                name = literal_name(spec)\n            if name == \"ICCBased\" and isinstance(spec, list) and len(spec) >= 2:\n                val = stream_value(spec[1])\n                if \"N\" in val:\n                    return PDFColorSpace(name, val[\"N\"])\n                elif \"Alternate\" in val:\n                    return PREDEFINED_COLORSPACE[val[\"Alternate\"].name]\n            elif name == \"DeviceN\" and isinstance(spec, list) and len(spec) >= 2:\n                return PDFColorSpace(name, len(list_value(spec[1])))\n            else:\n                return PREDEFINED_COLORSPACE.get(name)\n\n        for k, v in dict_value(resources).items():\n            # log.debug(\"Resource: %r: %r\", k, v)\n            if k == \"Font\":\n                for fontid, spec in dict_value(v).items():\n                    objid = None\n                    if isinstance(spec, PDFObjRef):\n                        objid = spec.objid\n                    spec = dict_value(spec)\n                    font = self.rsrcmgr.get_font(objid, spec)\n                    font.xobj_id = objid\n                    self.il_creater.on_page_resource_font(font, objid, fontid)\n                    self.fontmap[fontid] = font\n                    self.fontmap[fontid].descent = 0  # hack fix descent\n                    self.fontid[self.fontmap[fontid]] = fontid\n            elif k == \"ColorSpace\":\n                for csid, spec in dict_value(v).items():\n                    colorspace = get_colorspace(resolve1(spec))\n                    if colorspace is not None:\n                        self.csmap[csid] = colorspace\n            elif k == \"ProcSet\":\n                self.rsrcmgr.get_procset(list_value(v))\n            elif k == \"XObject\":\n                for xobjid, xobjstrm in dict_value(v).items():\n                    self.xobjmap[xobjid] = xobjstrm\n        pass\n\n    def do_CS(self, name: PDFStackT) -> None:\n        \"\"\"Set color space for stroking operations\n\n        Introduced in PDF 1.1\n        \"\"\"\n        try:\n            self.il_creater.on_stroking_color_space(literal_name(name))\n            self.scs = self.csmap[literal_name(name)]\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(f\"Undefined ColorSpace: {name!r}\") from None\n        return\n\n    def do_cs(self, name: PDFStackT) -> None:\n        \"\"\"Set color space for nonstroking operations\"\"\"\n        try:\n            self.il_creater.on_non_stroking_color_space(literal_name(name))\n            self.ncs = self.csmap[literal_name(name)]\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(f\"Undefined ColorSpace: {name!r}\") from None\n        return\n\n    ############################################################\n    # 重载返回调用参数（SCN）\n    def do_SCN(self) -> None:\n        \"\"\"Set color for stroking operations.\"\"\"\n        if self.scs:\n            n = self.scs.ncomponents\n        else:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No colorspace specified!\")\n            n = 1\n        n = len(self.argstack)\n        args = self.pop(n)\n        self.il_creater.on_passthrough_per_char(\"SCN\", args)\n        self.graphicstate.scolor = cast(Color, args)\n        return args\n\n    def do_scn(self) -> None:\n        \"\"\"Set color for nonstroking operations\"\"\"\n        if self.ncs:\n            n = self.ncs.ncomponents\n        else:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No colorspace specified!\")\n            n = 1\n        n = len(self.argstack)\n        args = self.pop(n)\n        self.il_creater.on_passthrough_per_char(\"scn\", args)\n        self.graphicstate.ncolor = cast(Color, args)\n        return args\n\n    def do_SC(self) -> None:\n        \"\"\"Set color for stroking operations\"\"\"\n        args = self.do_SCN()\n        self.il_creater.remove_latest_passthrough_per_char_instruction()\n        self.il_creater.on_passthrough_per_char(\"SC\", args)\n        return args\n\n    def do_sc(self) -> None:\n        \"\"\"Set color for nonstroking operations\"\"\"\n        args = self.do_scn()\n        self.il_creater.remove_latest_passthrough_per_char_instruction()\n        self.il_creater.on_passthrough_per_char(\"sc\", args)\n        return args\n\n    # Ensure bbox has four numbers, otherwise determine it as an illegal image\n    # For example, some Form's bbox is '[ null -.00487 1.00412 .99393 ]'\n    def do_Do(self, xobjid_arg: PDFStackT) -> None:\n        # 重载设置 xobj 的 obj_patch\n        \"\"\"Invoke named XObject\"\"\"\n        xobjid = literal_name(xobjid_arg)\n        try:\n            xobj = stream_value(self.xobjmap[xobjid])\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(f\"Undefined xobject id: {xobjid!r}\") from None\n            return\n        # log.debug(\"Processing xobj: %r\", xobj)\n        subtype = xobj.get(\"Subtype\")\n        if subtype is LITERAL_FORM and \"BBox\" in xobj:\n            interpreter = self.dup()\n\n            # In extremely rare cases, a none might be mixed in the bbox, for example\n            # /BBox [ 0 3.052 null 274.9 157.3 ]\n            bbox = list(\n                filter(lambda x: x is not None, cast(Rect, list_value(xobj[\"BBox\"])))\n            )\n            if len(bbox) < 4:\n                return\n\n            matrix = cast(Matrix, list_value(xobj.get(\"Matrix\", MATRIX_IDENTITY)))\n            # According to PDF reference 1.7 section 4.9.1, XObjects in\n            # earlier PDFs (prior to v1.2) use the page's Resources entry\n            # instead of having their own Resources entry.\n            xobjres = xobj.get(\"Resources\")\n            if xobjres:\n                resources = dict_value(xobjres)\n            else:\n                resources = self.resources.copy()\n\n            self.il_creater.on_xobj_form(\n                self.ctm,\n                self.il_creater.xobj_id,\n                xobj.objid,\n                \"form\",\n                xobjid,\n                bbox,\n                matrix,\n            )\n\n            self.device.begin_figure(xobjid, bbox, matrix)\n            ctm = mult_matrix(matrix, self.ctm)\n            (x, y, x2, y2) = guarded_bbox(bbox)\n            (x, y) = apply_matrix_pt(ctm, (x, y))\n            (x2, y2) = apply_matrix_pt(ctm, (x2, y2))\n            x_id = self.il_creater.on_xobj_begin((x, y, x2, y2), xobj.objid)\n            try:\n                ctm_inv = np.linalg.inv(np.array(ctm[:4]).reshape(2, 2))\n            except Exception:\n                self.il_creater.on_xobj_end(x_id, \" \")\n                return\n            np_version = np.__version__\n            if np_version.split(\".\")[0] >= \"2\":\n                pos_inv = -np.asmatrix(ctm[4:]) * ctm_inv\n            else:\n                pos_inv = -np.mat(ctm[4:]) * ctm_inv\n            a, b, c, d = ctm_inv.reshape(4).tolist()\n            e, f = pos_inv.tolist()[0]\n            ops_base = interpreter.render_contents(\n                resources,\n                [xobj],\n                ctm=ctm,\n            )\n            self.ncs = interpreter.ncs\n            self.scs = interpreter.scs\n            self.il_creater.on_xobj_end(\n                x_id,\n                # f\"q {ops_base} Q {a} {b} {c} {d} {e} {f} cm \",\n                f\"{a:.6f} {b:.6f} {c:.6f} {d:.6f} {e:.6f} {f:.6f} cm \",\n            )\n            try:  # 有的时候 form 字体加不上这里会烂掉\n                self.device.fontid = interpreter.fontid\n                self.device.fontmap = interpreter.fontmap\n                ops_new = self.device.end_figure(xobjid)\n                ctm_inv = np.linalg.inv(np.array(ctm[:4]).reshape(2, 2))\n                np_version = np.__version__\n                if np_version.split(\".\")[0] >= \"2\":\n                    pos_inv = -np.asmatrix(ctm[4:]) * ctm_inv\n                else:\n                    pos_inv = -np.mat(ctm[4:]) * ctm_inv\n                a, b, c, d = ctm_inv.reshape(4).tolist()\n                e, f = pos_inv.tolist()[0]\n                self.obj_patch[self.xobjmap[xobjid].objid] = (\n                    f\"q {ops_base}Q {a:.6f} {b:.6f} {c:.6f} {d:.6f} {e:.6f} {f:.6f} cm {ops_new}\"\n                )\n            except Exception:\n                pass\n        elif subtype is LITERAL_IMAGE and \"Width\" in xobj and \"Height\" in xobj:\n            self.il_creater.on_xobj_form(\n                self.ctm,\n                self.il_creater.xobj_id,\n                xobj.objid,\n                \"image\",\n                xobjid,\n                (0, 0, 1, 1),\n                MATRIX_IDENTITY,\n            )\n            self.device.begin_figure(xobjid, (0, 0, 1, 1), MATRIX_IDENTITY)\n            self.device.render_image(xobjid, xobj)\n            self.device.end_figure(xobjid)\n        else:\n            # unsupported xobject type.\n            pass\n\n    def do_W(self) -> None:\n        \"\"\"Set clipping path using nonzero winding number rule\"\"\"\n        self.handle_w(False)\n\n    def do_W_a(self) -> None:\n        \"\"\"Set clipping path using even-odd rule\"\"\"\n        self.handle_w(True)\n\n    def handle_w(self, evenodd: bool):\n        path = self.curpath\n        self.il_creater.on_pdf_clip_path(path, evenodd, self.ctm)\n\n    def process_page(self, page: PDFPage) -> None:\n        # 重载设置 page 的 obj_patch\n        # log.debug(\"Processing page: %r\", page)\n        # print(page.mediabox,page.cropbox)\n        # (x0, y0, x1, y1) = page.mediabox\n        (x0, y0, x1, y1) = page.cropbox\n        if page.rotate == 90:\n            ctm = (0, -1, 1, 0, -y0, x1)\n        elif page.rotate == 180:\n            ctm = (-1, 0, 0, -1, x1, y1)\n        elif page.rotate == 270:\n            ctm = (0, 1, -1, 0, y1, -x0)\n        else:\n            ctm = (1, 0, 0, 1, -x0, -y0)\n        # ctm_for_ops = copy.copy(ctm)\n        ctm_for_ops = (1, 0, 0, 1, -x0, -y0)\n        ctm = (1, 0, 0, 1, -x0, -y0)\n        if page.rotate == 90 or page.rotate == 270:\n            (x0, y0, x1, y1) = (y0, x1, y1, x0)\n        self.il_creater.on_page_start()\n        self.il_creater.on_page_crop_box(x0, y0, x1, y1)\n        self.device.begin_page(page, ctm)\n        ops_base = self.render_contents(page.resources, page.contents, ctm=ctm)\n        self.device.fontid = self.fontid\n        self.device.fontmap = self.fontmap\n        _ops_new = self.device.end_page(page)\n        # 上面渲染的时候会根据 cropbox 减掉页面偏移得到真实坐标，这里输出的时候需要用 cm 把页面偏移加回来\n        # self.obj_patch[page.page_xref] = (\n        #     # f\"q {ops_base}Q 1 0 0 1 {x0} {y0} cm {ops_new}\"  # ops_base 里可能有图，需要让 ops_new 里的文字覆盖在上面，使用 q/Q 重置位置矩阵\n        #     \"\"\n        # )\n        # for obj in page.contents:\n        #     self.obj_patch[obj.objid] = \"\"\n        return f\"q {ops_base} Q {' '.join(f'{x:f}' for x in ctm_for_ops)} cm\"\n        # return f\"q {ops_base} Q 1 0 0 1 {x0} {y0} cm\"\n\n    def render_contents(\n        self,\n        resources: dict[object, object],\n        streams: Sequence[object],\n        ctm: Matrix = MATRIX_IDENTITY,\n    ) -> None:\n        # 重载返回指令流\n        \"\"\"Render the content streams.\n\n        This method may be called recursively.\n        \"\"\"\n        # log.debug(\n        #     \"render_contents: resources=%r, streams=%r, ctm=%r\",\n        #     resources,\n        #     streams,\n        #     ctm,\n        # )\n        self.init_resources(resources)\n        self.init_state(ctm)\n        return self.execute(list_value(streams))\n\n    def do_q(self) -> None:\n        \"\"\"Save graphics state\"\"\"\n        self.gstack.append(self.get_current_state())\n        self.il_creater.push_passthrough_per_char_instruction()\n        return\n\n    def do_Q(self) -> None:\n        \"\"\"Restore graphics state\"\"\"\n        if self.gstack:\n            self.set_current_state(self.gstack.pop())\n        self.il_creater.pop_passthrough_per_char_instruction()\n        return\n\n    def do_TJ(self, seq: PDFStackT) -> None:\n        \"\"\"Show text, allowing individual glyph positioning\"\"\"\n        if self.textstate.font is None:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No font specified!\")\n            return\n        if isinstance(seq, PSLiteral):\n            return\n        assert self.ncs is not None\n        gs = self.graphicstate.copy()\n        gs.passthrough_instruction = (\n            self.il_creater.passthrough_per_char_instruction.copy()\n        )\n        if isinstance(seq, int) or isinstance(seq, float):\n            seq = [seq]\n        self.device.render_string(self.textstate, cast(PDFTextSeq, seq), self.ncs, gs)\n        return\n\n    def do_d(self, dash: PDFStackT, phase: PDFStackT) -> None:\n        \"\"\"Set line dash pattern\"\"\"\n        self.graphicstate.dash = (dash, phase)\n        self.il_creater.on_line_dash(dash, phase)\n\n    def do_BI(self) -> None:\n        \"\"\"Begin inline image object\"\"\"\n        self.il_creater.on_inline_image_begin()\n\n    def do_ID(self) -> None:\n        \"\"\"Begin inline image data\"\"\"\n        pass  # Handled by PDFContentParserEx\n\n    def do_EI(self, obj: PDFStackT) -> None:\n        \"\"\"End inline image object\"\"\"\n        if isinstance(obj, PDFStream):\n            self.il_creater.on_inline_image_end(obj, self.ctm)\n\n    # Run PostScript commands\n    # The Do_xxx method is the method for executing corresponding postscript instructions\n    def execute(self, streams: Sequence[object]) -> None:\n        ops = \"\"\n        for stream in streams:\n            self.il_creater.on_new_stream()\n            # 重载返回指令流\n            try:\n                parser = PDFContentParserEx([stream])\n            except PSEOF:\n                # empty page\n                return\n            while True:\n                try:\n                    (_, obj) = parser.nextobject()\n                except PSEOF:\n                    break\n                if isinstance(obj, PSKeyword):\n                    name = keyword_name(obj)\n                    act_name = (\n                        name.replace(\"*\", \"_a\").replace('\"', \"_w\").replace(\"'\", \"_q\")\n                    )\n                    method = f\"do_{act_name}\"\n                    if hasattr(self, method):\n                        func = getattr(self, method)\n                        nargs = func.__code__.co_argcount - 1\n                        if nargs:\n                            args = self.pop(nargs)\n                            # log.debug(\"exec: %s %r\", name, args)\n                            if len(args) == nargs:\n                                func(*args)\n                                if self.il_creater.is_passthrough_per_char_operation(\n                                    name,\n                                ):\n                                    self.il_creater.on_passthrough_per_char(name, args)\n                                if self.il_creater.is_graphic_operation(name):\n                                    continue\n                                elif name == \"d\":\n                                    arg0 = f\"[{' '.join(f'{arg}' for arg in args[0])}]\"\n                                    arg1 = args[1]\n                                    ops += f\"{arg0} {arg1} {name} \"\n                                elif not (\n                                    name[0] == \"T\"\n                                    or name\n                                    in ['\"', \"'\", \"EI\", \"MP\", \"DP\", \"BMC\", \"BDC\"]\n                                ):  # 过滤 T 系列文字指令，因为 EI 的参数是 obj 所以也需要过滤（只在少数文档中画横线时使用），过滤 marked 系列指令\n                                    p = \" \".join(\n                                        [\n                                            (\n                                                f\"{x:f}\"\n                                                if isinstance(x, float)\n                                                else str(x).replace(\"'\", \"\")\n                                            )\n                                            for x in args\n                                        ],\n                                    )\n                                    ops += f\"{p} {name} \"\n                        else:\n                            # log.debug(\"exec: %s\", name)\n                            targs = func()\n                            if targs is None:\n                                targs = []\n                            if self.il_creater.is_graphic_operation(name):\n                                continue\n                            elif not (name[0] == \"T\" or name in [\"BI\", \"ID\", \"EMC\"]):\n                                p = \" \".join(\n                                    [\n                                        (\n                                            f\"{x:f}\"\n                                            if isinstance(x, float)\n                                            else str(x).replace(\"'\", \"\")\n                                        )\n                                        for x in targs\n                                    ],\n                                )\n                                ops += f\"{p} {name} \"\n                    elif settings.STRICT:\n                        error_msg = f\"Unknown operator: {name!r}\"\n                        raise PDFInterpreterError(error_msg)\n                else:\n                    self.push(obj)\n            # print('REV DATA',ops)\n        return ops\n"
  },
  {
    "path": "babeldoc/format/pdf/result_merger.py",
    "content": "import logging\nfrom pathlib import Path\n\nfrom pymupdf import Document\n\nfrom babeldoc.format.pdf.document_il.backend.pdf_creater import PDFCreater\nfrom babeldoc.format.pdf.translation_config import TranslateResult\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\n\nlogger = logging.getLogger(__name__)\n\n\nclass ResultMerger:\n    \"\"\"Handles merging of split translation results\"\"\"\n\n    def __init__(self, translation_config: TranslationConfig):\n        self.config = translation_config\n\n    def merge_results(\n        self, results: dict[int, TranslateResult | None]\n    ) -> TranslateResult:\n        \"\"\"Merge multiple translation results into one\"\"\"\n        if not results:\n            raise ValueError(\"No results to merge\")\n\n        basename = Path(self.config.input_file).stem\n        debug_suffix = \".debug\" if self.config.debug else \"\"\n\n        mono_file_name = f\"{basename}{debug_suffix}.{self.config.lang_out}.mono.pdf\"\n        dual_file_name = f\"{basename}{debug_suffix}.{self.config.lang_out}.dual.pdf\"\n\n        debug_suffix += \".no_watermark\"\n\n        mono_file_name_no_watermark = (\n            f\"{basename}{debug_suffix}.{self.config.lang_out}.mono.pdf\"\n        )\n        dual_file_name_no_watermark = (\n            f\"{basename}{debug_suffix}.{self.config.lang_out}.dual.pdf\"\n        )\n        results = {k: v for k, v in results.items() if v is not None}\n        # Sort results by part index\n        sorted_results = dict(sorted(results.items()))\n        first_result = next(iter(sorted_results.values()))\n\n        # Initialize paths for merged files\n        merged_mono_path = None\n        merged_dual_path = None\n        merged_no_watermark_mono_path = None\n        merged_no_watermark_dual_path = None\n        try:\n            # Merge monolingual PDFs if they exist\n            if (\n                any(r.mono_pdf_path for r in results.values())\n                and not self.config.no_mono\n            ):\n                merged_mono_path = self._merge_pdfs(\n                    [\n                        r.mono_pdf_path\n                        for r in sorted_results.values()\n                        if r.mono_pdf_path\n                    ],\n                    mono_file_name,\n                    tag=\"merged_mono\",\n                )\n        except Exception as e:\n            logger.error(f\"Error merging monolingual PDFs: {e}\")\n            merged_mono_path = None\n\n        try:\n            # Merge dual-language PDFs if they exist\n            if (\n                any(r.dual_pdf_path for r in results.values())\n                and not self.config.no_dual\n            ):\n                merged_dual_path = self._merge_pdfs(\n                    [\n                        r.dual_pdf_path\n                        for r in sorted_results.values()\n                        if r.dual_pdf_path\n                    ],\n                    dual_file_name,\n                    tag=\"merged_dual\",\n                )\n        except Exception as e:\n            logger.error(f\"Error merging dual-language PDFs: {e}\")\n            merged_dual_path = None\n\n        if any(\n            r.dual_pdf_path != r.no_watermark_dual_pdf_path\n            or r.mono_pdf_path != r.no_watermark_mono_pdf_path\n            for r in results.values()\n        ):\n            try:\n                # Merge no-watermark PDFs if they exist\n                if (\n                    any(r.no_watermark_mono_pdf_path for r in results.values())\n                    and not self.config.no_mono\n                ):\n                    merged_no_watermark_mono_path = self._merge_pdfs(\n                        [\n                            r.no_watermark_mono_pdf_path\n                            for r in sorted_results.values()\n                            if r.no_watermark_mono_pdf_path\n                        ],\n                        mono_file_name_no_watermark,\n                        tag=\"merged_no_watermark_mono\",\n                    )\n            except Exception as e:\n                logger.error(f\"Error merging no-watermark PDFs: {e}\")\n                merged_no_watermark_mono_path = None\n\n            try:\n                if (\n                    any(r.no_watermark_dual_pdf_path for r in results.values())\n                    and not self.config.no_dual\n                ):\n                    merged_no_watermark_dual_path = self._merge_pdfs(\n                        [\n                            r.no_watermark_dual_pdf_path\n                            for r in sorted_results.values()\n                            if r.no_watermark_dual_pdf_path\n                        ],\n                        \"merged_no_watermark_dual.pdf\",\n                        tag=\"merged_no_watermark_dual\",\n                    )\n            except Exception as e:\n                logger.error(f\"Error merging no-watermark PDFs: {e}\")\n                merged_no_watermark_dual_path = None\n\n        auto_extracted_glossary_path = None\n        if (\n            self.config.save_auto_extracted_glossary\n            and self.config.shared_context_cross_split_part.auto_extracted_glossary\n        ):\n            auto_extracted_glossary_path = self.config.get_output_file_path(\n                f\"{basename}{debug_suffix}.{self.config.lang_out}.glossary.csv\"\n            )\n            with auto_extracted_glossary_path.open(\"w\", encoding=\"utf-8\") as f:\n                logger.info(\n                    f\"save auto extracted glossary to {auto_extracted_glossary_path}\"\n                )\n                f.write(\n                    self.config.shared_context_cross_split_part.auto_extracted_glossary.to_csv()\n                )\n\n        # Create merged result\n        merged_result = TranslateResult(\n            mono_pdf_path=merged_mono_path,\n            dual_pdf_path=merged_dual_path,\n            auto_extracted_glossary_path=auto_extracted_glossary_path,\n        )\n        merged_result.no_watermark_mono_pdf_path = merged_no_watermark_mono_path\n        merged_result.no_watermark_dual_pdf_path = merged_no_watermark_dual_path\n\n        if merged_result.no_watermark_mono_pdf_path is None:\n            merged_result.no_watermark_mono_pdf_path = merged_mono_path\n        elif merged_result.mono_pdf_path is None:\n            merged_result.mono_pdf_path = merged_no_watermark_mono_path\n\n        if merged_result.no_watermark_dual_pdf_path is None:\n            merged_result.no_watermark_dual_pdf_path = merged_dual_path\n        elif merged_result.dual_pdf_path is None:\n            merged_result.dual_pdf_path = merged_no_watermark_dual_path\n\n        # Calculate total time\n        total_time = sum(\n            r.total_seconds for r in results.values() if hasattr(r, \"total_seconds\")\n        )\n        merged_result.total_seconds = total_time\n\n        return merged_result\n\n    def _merge_pdfs(\n        self, pdf_paths: list[str | Path], output_name: str, tag: str\n    ) -> Path:\n        \"\"\"Merge multiple PDFs into one\"\"\"\n        if not pdf_paths:\n            return None\n\n        output_path = self.config.get_output_file_path(output_name)\n        merged_doc = Document()\n\n        for pdf_path in pdf_paths:\n            doc = Document(str(pdf_path))\n            merged_doc.insert_pdf(doc)\n\n        merged_doc = PDFCreater.subset_fonts_in_subprocess(\n            merged_doc, self.config, tag=tag\n        )\n        PDFCreater.save_pdf_with_timeout(\n            merged_doc, str(output_path), translation_config=self.config\n        )\n\n        return output_path\n"
  },
  {
    "path": "babeldoc/format/pdf/split_manager.py",
    "content": "import logging\nfrom dataclasses import dataclass\n\nlogger = logging.getLogger(__name__)\n\n\n@dataclass\nclass SplitPoint:\n    \"\"\"Represents a point where the document should be split\"\"\"\n\n    start_page: int\n    end_page: int\n    estimated_complexity: float = 1.0\n    chapter_title: str | None = None\n\n\nclass BaseSplitStrategy:\n    \"\"\"Base class for split strategies\"\"\"\n\n    def determine_split_points(self, config) -> list[SplitPoint]:\n        raise NotImplementedError\n\n\nclass PageCountStrategy(BaseSplitStrategy):\n    \"\"\"Split document based on page count\"\"\"\n\n    def __init__(self, max_pages_per_part: int = 20):\n        self.max_pages_per_part = max_pages_per_part\n\n    def determine_split_points(self, config) -> list[SplitPoint]:\n        from pymupdf import Document\n\n        doc = Document(str(config.input_file))\n        total_pages = doc.page_count\n\n        split_points = []\n        current_page = 0\n\n        while current_page < total_pages:\n            end_page = min(current_page + self.max_pages_per_part, total_pages)\n            split_points.append(\n                SplitPoint(\n                    start_page=current_page,\n                    end_page=end_page - 1,  # end_page is inclusive\n                )\n            )\n            current_page = end_page\n\n        return split_points\n\n\nclass SplitManager:\n    \"\"\"Manages document splitting process\"\"\"\n\n    def __init__(self, config=None):\n        self.strategy = config.split_strategy\n\n    def determine_split_points(self, config) -> list[SplitPoint]:\n        \"\"\"Determine where to split the document\"\"\"\n        return self.strategy.determine_split_points(config)\n\n    def estimate_part_complexity(self, split_point: SplitPoint) -> float:\n        \"\"\"Estimate the complexity of a document part\"\"\"\n        # Simple estimation based on page count for now\n        return (\n            split_point.end_page - split_point.start_page + 1\n        ) * split_point.estimated_complexity\n"
  },
  {
    "path": "babeldoc/format/pdf/translation_config.py",
    "content": "import enum\nimport logging\nimport shutil\nimport tempfile\nimport threading\nfrom collections import Counter\nfrom pathlib import Path\n\nfrom babeldoc.const import CACHE_FOLDER\nfrom babeldoc.format.pdf.split_manager import BaseSplitStrategy\nfrom babeldoc.format.pdf.split_manager import PageCountStrategy\nfrom babeldoc.glossary import Glossary\nfrom babeldoc.glossary import GlossaryEntry\nfrom babeldoc.progress_monitor import ProgressMonitor\nfrom babeldoc.translator.translator import BaseTranslator\n\nlogger = logging.getLogger(__name__)\n\n\nclass WatermarkOutputMode(enum.Enum):\n    Watermarked = \"watermarked\"\n    NoWatermark = \"no_watermark\"\n    Both = \"both\"\n\n\nclass SharedContextCrossSplitPart:\n    def __init__(self):\n        self.first_paragraph = None\n        self.recent_title_paragraph = None\n        self._lock = threading.Lock()\n        self.user_glossaries: list[Glossary] = []\n        self.auto_extracted_glossary: Glossary | None = None\n        self.raw_extracted_terms: list[tuple[str, str]] = []\n        self.auto_enabled_ocr_workaround = False\n        # Statistics for valid characters/text across the whole file\n        self.valid_char_count_total: int = 0\n        self.total_valid_text_token_count: int = 0\n\n    def initialize_glossaries(self, initial_glossaries: list[Glossary] | None):\n        with self._lock:\n            self.user_glossaries = (\n                list(initial_glossaries) if initial_glossaries else []\n            )\n            self.auto_extracted_glossary = None\n            self.raw_extracted_terms = []\n            self.unique_name = self._generate_unique_auto_glossary_name()\n            self.norm_terms = set()\n            for g in self.user_glossaries:\n                for entity in g.normalized_lookup:\n                    self.norm_terms.add(entity)\n            # reset statistics buffer when initializing\n            self.valid_char_count_total = 0\n            self.total_valid_text_token_count = 0\n\n    def add_raw_extracted_term_pair(self, src: str, tgt: str):\n        with self._lock:\n            self.raw_extracted_terms.append((src, tgt))\n\n    def _generate_unique_auto_glossary_name(self) -> str:\n        base_name = \"auto_extracted_glossary\"\n        current_name = base_name\n        suffix = 0\n        existing_names = {g.name for g in self.user_glossaries}\n        if (\n            self.auto_extracted_glossary\n            and self.auto_extracted_glossary.name == current_name\n        ):\n            pass\n\n        while current_name in existing_names:\n            suffix += 1\n            current_name = f\"{base_name}#{suffix}\"\n        return current_name\n\n    def contains_term(self, term: str) -> bool:\n        with self._lock:\n            try:\n                return term in self.norm_terms\n            except Exception:\n                return False\n\n    def finalize_auto_extracted_glossary(self):\n        with self._lock:\n            self.auto_extracted_glossary = None\n\n            if not self.raw_extracted_terms:\n                self.raw_extracted_terms = []\n                return\n\n            term_translations: dict[str, list[str]] = {}\n            for src, tgt in self.raw_extracted_terms:\n                term_translations.setdefault(src, []).append(tgt)\n\n            final_entries: list[GlossaryEntry] = []\n            for src, tgts in term_translations.items():\n                if not tgts:\n                    continue\n                most_common_tgt = Counter(tgts).most_common(1)[0][0]\n                final_entries.append(GlossaryEntry(src, most_common_tgt))\n\n            if final_entries:\n                self.auto_extracted_glossary = Glossary(\n                    name=self.unique_name, entries=final_entries\n                )\n\n    def get_glossaries(self) -> list[Glossary]:\n        with self._lock:\n            all_glossaries = list(self.user_glossaries)\n            if self.auto_extracted_glossary:\n                all_glossaries.append(self.auto_extracted_glossary)\n            return all_glossaries\n\n    def get_glossaries_for_translation(\n        self, auto_extract_enabled: bool\n    ) -> list[Glossary]:\n        with self._lock:\n            if auto_extract_enabled and self.auto_extracted_glossary:\n                return [self.auto_extracted_glossary]\n            else:\n                all_glossaries = list(self.user_glossaries)\n                if self.auto_extracted_glossary:\n                    all_glossaries.append(self.auto_extracted_glossary)\n                return all_glossaries\n\n    def add_valid_counts(self, char_count: int, token_count: int):\n        \"\"\"Accumulate valid character and token counts in a threadsafe way.\"\"\"\n        if char_count <= 0 and token_count <= 0:\n            return\n        with self._lock:\n            if char_count > 0:\n                self.valid_char_count_total += char_count\n            if token_count > 0:\n                self.total_valid_text_token_count += token_count\n\n\nclass TranslationConfig:\n    @staticmethod\n    def create_max_pages_per_part_split_strategy(max_pages_per_part: int):\n        return PageCountStrategy(max_pages_per_part)\n\n    # for backward compatibility,\n    # new parameters should be added at the end of the function.\n    def __init__(\n        self,\n        translator: BaseTranslator,\n        input_file: str | Path,\n        lang_in: str,\n        lang_out: str,\n        doc_layout_model,  # DocLayoutModel\n        # for backward compatibility\n        font: str | Path | None = None,\n        pages: str | None = None,\n        output_dir: str | Path | None = None,\n        debug: bool = False,\n        working_dir: str | Path | None = None,\n        no_dual: bool = False,\n        no_mono: bool = False,\n        formular_font_pattern: str | None = None,\n        formular_char_pattern: str | None = None,\n        qps: int = 1,\n        split_short_lines: bool = False,\n        short_line_split_factor: float = 0.8,\n        use_rich_pbar: bool = True,\n        progress_monitor: ProgressMonitor | None = None,\n        skip_clean: bool = False,\n        dual_translate_first: bool = False,\n        disable_rich_text_translate: bool = False,\n        enhance_compatibility: bool = False,\n        report_interval: float = 0.1,\n        min_text_length: int = 5,\n        use_side_by_side_dual: bool = True,  # Deprecated: 是否使用拼版式双语 PDF（并排显示原文和译文）向下兼容选项，已停用。\n        use_alternating_pages_dual: bool = False,\n        watermark_output_mode: WatermarkOutputMode = WatermarkOutputMode.Watermarked,\n        # Add split-related parameters\n        split_strategy: BaseSplitStrategy | None = None,\n        table_model=None,\n        show_char_box: bool = False,\n        skip_scanned_detection: bool = False,\n        ocr_workaround: bool = False,\n        custom_system_prompt: str | None = None,\n        add_formula_placehold_hint: bool = False,\n        glossaries: list[Glossary] | None = None,\n        pool_max_workers: int | None = None,\n        auto_extract_glossary: bool = True,\n        auto_enable_ocr_workaround: bool = False,\n        primary_font_family: str | None = None,\n        only_include_translated_page: bool | None = False,\n        save_auto_extracted_glossary: bool = True,\n        enable_graphic_element_process: bool = True,\n        merge_alternating_line_numbers: bool = True,\n        skip_translation: bool = False,\n        skip_form_render: bool = False,\n        skip_curve_render: bool = False,\n        only_parse_generate_pdf: bool = False,\n        remove_non_formula_lines: bool = False,\n        non_formula_line_iou_threshold: float = 0.9,\n        figure_table_protection_threshold: float = 0.9,\n        skip_formula_offset_calculation: bool = False,\n        term_extraction_translator: BaseTranslator | None = None,\n        metadata_extra_data: str | None = None,\n        term_pool_max_workers: int | None = None,\n        disable_same_text_fallback: bool = False,\n    ):\n        self.translator = translator\n        self.term_extraction_translator = term_extraction_translator or translator\n        initial_user_glossaries = list(glossaries) if glossaries else []\n\n        self.input_file = input_file\n        self.lang_in = lang_in\n        self.lang_out = lang_out\n        # just ignore font\n        self.font = None\n\n        self.pages = pages\n        self.page_ranges = self.parse_pages(pages) if pages else None\n        self.debug = debug\n        self.watermark_output_mode = watermark_output_mode\n\n        self.output_dir = output_dir\n        self.working_dir = working_dir\n        self.no_dual = no_dual\n        self.no_mono = no_mono\n\n        self.formular_font_pattern = formular_font_pattern\n        self.formular_char_pattern = formular_char_pattern\n        self.qps = qps\n        # Set pool_max_workers with default value from qps\n        self.pool_max_workers = (\n            pool_max_workers if pool_max_workers is not None else qps\n        )\n        # Set term_pool_max_workers for automatic term extraction.\n        # If not provided, default to pool_max_workers.\n        self.term_pool_max_workers = (\n            term_pool_max_workers\n            if term_pool_max_workers is not None\n            else self.pool_max_workers\n        )\n        self.split_short_lines = split_short_lines\n\n        self.short_line_split_factor = short_line_split_factor\n        self.use_rich_pbar = use_rich_pbar\n        self.progress_monitor = progress_monitor\n        self.doc_layout_model = doc_layout_model\n\n        self.skip_clean = skip_clean or enhance_compatibility\n        self.skip_scanned_detection = skip_scanned_detection\n\n        self.dual_translate_first = dual_translate_first or enhance_compatibility\n        self.disable_rich_text_translate = (\n            disable_rich_text_translate or enhance_compatibility\n        )\n\n        self.report_interval = report_interval\n        self.min_text_length = min_text_length\n        self.use_alternating_pages_dual = use_alternating_pages_dual\n        self.ocr_workaround = ocr_workaround\n        self.merge_alternating_line_numbers = merge_alternating_line_numbers\n\n        if self.ocr_workaround:\n            self.skip_scanned_detection = True\n            self.disable_rich_text_translate = True\n\n        # for backward compatibility\n        if use_side_by_side_dual is False and use_alternating_pages_dual is False:\n            self.use_alternating_pages_dual = True\n\n        if progress_monitor and progress_monitor.cancel_event is None:\n            progress_monitor.cancel_event = threading.Event()\n\n        if working_dir is None:\n            if debug:\n                working_dir = Path(CACHE_FOLDER) / \"working\" / Path(input_file).stem\n                self._is_temp_dir = False\n            else:\n                working_dir = tempfile.mkdtemp()\n                self._is_temp_dir = True\n        else:\n            working_dir = Path(working_dir) / Path(input_file).stem\n            self._is_temp_dir = False\n\n        self.working_dir = working_dir\n\n        Path(working_dir).mkdir(parents=True, exist_ok=True)\n\n        if output_dir is None:\n            output_dir = Path.cwd()\n        self.output_dir = output_dir\n\n        Path(output_dir).mkdir(parents=True, exist_ok=True)\n\n        if not doc_layout_model:\n            from babeldoc.docvision.doclayout import DocLayoutModel\n\n            doc_layout_model = DocLayoutModel.load_available()\n        self.doc_layout_model = doc_layout_model\n\n        self.shared_context_cross_split_part = SharedContextCrossSplitPart()\n        self.shared_context_cross_split_part.initialize_glossaries(\n            initial_user_glossaries\n        )\n\n        # Initialize split-related attributes\n        self.split_strategy = split_strategy\n\n        # Create a unique working directory for each part\n        self._part_working_dirs: dict[int, Path] = {}\n        self._part_output_dirs: dict[int, Path] = {}\n\n        self.table_model = table_model\n        self.show_char_box = show_char_box\n        self.custom_system_prompt = custom_system_prompt\n        self.add_formula_placehold_hint = add_formula_placehold_hint\n        self.auto_extract_glossary = auto_extract_glossary\n        self.auto_enable_ocr_workaround = auto_enable_ocr_workaround\n        self.skip_translation = skip_translation\n        self.only_parse_generate_pdf = only_parse_generate_pdf\n\n        if self.skip_translation or self.only_parse_generate_pdf:\n            self.auto_extract_glossary = False\n\n        if auto_enable_ocr_workaround:\n            self.ocr_workaround = False\n            self.skip_scanned_detection = False\n\n        assert primary_font_family in [\n            None,\n            \"serif\",\n            \"sans-serif\",\n            \"script\",\n        ]\n        self.primary_font_family = primary_font_family\n\n        if only_include_translated_page is None:\n            only_include_translated_page = False\n\n        self.only_include_translated_page = only_include_translated_page\n\n        self.save_auto_extracted_glossary = save_auto_extracted_glossary\n\n        # force disable table translate until the new model is ready\n        self.table_model = None\n        self.enable_graphic_element_process = enable_graphic_element_process\n        self.skip_form_render = skip_form_render\n        self.skip_curve_render = skip_curve_render\n        self.remove_non_formula_lines = remove_non_formula_lines\n        self.non_formula_line_iou_threshold = non_formula_line_iou_threshold\n        self.figure_table_protection_threshold = figure_table_protection_threshold\n        self.skip_formula_offset_calculation = skip_formula_offset_calculation\n\n        self.metadata_extra_data = metadata_extra_data\n\n        self.term_extraction_token_usage: dict[str, int] = {\n            \"total_tokens\": 0,\n            \"prompt_tokens\": 0,\n            \"completion_tokens\": 0,\n            \"cache_hit_prompt_tokens\": 0,\n        }\n        self.disable_same_text_fallback = disable_same_text_fallback\n\n        if self.ocr_workaround:\n            self.remove_non_formula_lines = False\n\n    def parse_pages(self, pages_str: str | None) -> list[tuple[int, int]] | None:\n        \"\"\"解析页码字符串，返回页码范围列表\n\n        Args:\n            pages_str: 形如 \"1-,2,-3,4\" 的页码字符串\n\n        Returns:\n            包含 (start, end) 元组的列表，其中 -1 表示无限制\n        \"\"\"\n        if not pages_str:\n            return None\n\n        ranges: list[tuple[int, int]] = []\n        for part in pages_str.split(\",\"):\n            part = part.strip()\n            if \"-\" in part:\n                start, end = part.split(\"-\")\n                start_as_int = int(start) if start else 1\n                end_as_int = int(end) if end else -1\n                ranges.append((start_as_int, end_as_int))\n            else:\n                page = int(part)\n                ranges.append((page, page))\n        return ranges\n\n    def should_translate_page(self, page_number: int) -> bool:\n        \"\"\"判断指定页码是否需要翻译\n        Args:\n            page_number: 页码\n        Returns:\n            是否需要翻译该页\n        \"\"\"\n        if isinstance(self.page_ranges, list) and len(self.page_ranges) == 0:\n            return False\n        if not self.page_ranges:\n            return True\n\n        for start, end in self.page_ranges:\n            if start <= page_number and (end == -1 or page_number <= end):\n                return True\n        return False\n\n    def get_output_file_path(self, filename: str) -> Path:\n        return Path(self.output_dir) / filename\n\n    def get_working_file_path(self, filename: str) -> Path:\n        return Path(self.working_dir) / filename\n\n    def get_part_working_dir(self, part_index: int) -> Path:\n        \"\"\"Get working directory for a specific part\"\"\"\n        if part_index not in self._part_working_dirs:\n            if self.working_dir:\n                part_dir = Path(self.working_dir) / f\"part_{part_index}\"\n            else:\n                part_dir = Path(tempfile.mkdtemp()) / f\"part_{part_index}\"\n            part_dir.mkdir(parents=True, exist_ok=True)\n            self._part_working_dirs[part_index] = part_dir\n        return self._part_working_dirs[part_index]\n\n    def get_part_output_dir(self, part_index: int) -> Path:\n        \"\"\"Get output directory for a specific part\"\"\"\n        if part_index not in self._part_output_dirs:\n            part_dir = Path(self.working_dir) / f\"part_{part_index}_output\"\n            part_dir.mkdir(parents=True, exist_ok=True)\n            self._part_output_dirs[part_index] = part_dir\n        return self._part_output_dirs[part_index]\n\n    def cleanup_part_output_dir(self, part_index: int):\n        \"\"\"Clean up output directory for a specific part\"\"\"\n        if part_index in self._part_output_dirs:\n            part_dir = self._part_output_dirs[part_index]\n            if part_dir.exists():\n                shutil.rmtree(part_dir)\n            del self._part_output_dirs[part_index]\n\n    def cleanup_part_working_dir(self, part_index: int):\n        \"\"\"Clean up working directory for a specific part\"\"\"\n        if part_index in self._part_working_dirs:\n            part_dir = self._part_working_dirs[part_index]\n            if part_dir.exists():\n                shutil.rmtree(part_dir, ignore_errors=True)\n            del self._part_working_dirs[part_index]\n\n    def cleanup_temp_files(self):\n        \"\"\"Clean up all temporary files including part working directories\"\"\"\n        try:\n            for part_index in list(self._part_working_dirs.keys()):\n                self.cleanup_part_working_dir(part_index)\n            if self._is_temp_dir:\n                logger.info(f\"cleanup temp files: {self.working_dir}\")\n                shutil.rmtree(self.working_dir, ignore_errors=True)\n        except Exception:\n            logger.exception(\"Error cleaning up temporary files\")\n\n    def raise_if_cancelled(self):\n        if self.progress_monitor is not None:\n            self.progress_monitor.raise_if_cancelled()\n\n    def cancel_translation(self):\n        if self.progress_monitor is not None:\n            self.progress_monitor.cancel()\n\n    def get_term_extraction_translator(self) -> BaseTranslator:\n        \"\"\"Return the translator to use for automatic term extraction.\"\"\"\n        return self.term_extraction_translator\n\n    def record_term_extraction_usage(\n        self,\n        total_tokens: int,\n        prompt_tokens: int,\n        completion_tokens: int,\n        cache_hit_prompt_tokens: int,\n    ) -> None:\n        \"\"\"Accumulate token usage for automatic term extraction.\"\"\"\n        if total_tokens > 0:\n            self.term_extraction_token_usage[\"total_tokens\"] += total_tokens\n        if prompt_tokens > 0:\n            self.term_extraction_token_usage[\"prompt_tokens\"] += prompt_tokens\n        if completion_tokens > 0:\n            self.term_extraction_token_usage[\"completion_tokens\"] += completion_tokens\n        if cache_hit_prompt_tokens > 0:\n            self.term_extraction_token_usage[\"cache_hit_prompt_tokens\"] += (\n                cache_hit_prompt_tokens\n            )\n\n\nclass TranslateResult:\n    original_pdf_path: str\n    total_seconds: float\n    mono_pdf_path: Path | None\n    dual_pdf_path: Path | None\n    no_watermark_mono_pdf_path: Path | None\n    no_watermark_dual_pdf_path: Path | None\n    peak_memory_usage: int | None\n    auto_extracted_glossary_path: Path | None\n    total_valid_character_count: int | None\n    total_valid_text_token_count: int | None\n\n    def __init__(\n        self,\n        mono_pdf_path: Path | None,\n        dual_pdf_path: Path | None,\n        auto_extracted_glossary_path: Path | None = None,\n    ):\n        self.mono_pdf_path = mono_pdf_path\n        self.dual_pdf_path = dual_pdf_path\n\n        # For compatibility considerations, if only a non-watermarked PDF is generated,\n        # the values of mono_pdf_path and no_watermark_mono_pdf_path are the same.\n        self.no_watermark_mono_pdf_path = mono_pdf_path\n        self.no_watermark_dual_pdf_path = dual_pdf_path\n\n        self.auto_extracted_glossary_path = auto_extracted_glossary_path\n        self.total_valid_character_count = None\n        self.total_valid_text_token_count = None\n\n    def __str__(self):\n        \"\"\"Return a human-readable string representation of the translation result.\"\"\"\n        result = []\n        if hasattr(self, \"original_pdf_path\") and self.original_pdf_path:\n            result.append(f\"\\tOriginal PDF: {self.original_pdf_path}\")\n\n        if hasattr(self, \"total_seconds\") and self.total_seconds:\n            result.append(f\"\\tTotal time: {self.total_seconds:.2f} seconds\")\n\n        if self.mono_pdf_path:\n            result.append(f\"\\tMonolingual PDF: {self.mono_pdf_path}\")\n\n        if self.dual_pdf_path:\n            result.append(f\"\\tDual-language PDF: {self.dual_pdf_path}\")\n\n        if (\n            hasattr(self, \"no_watermark_mono_pdf_path\")\n            and self.no_watermark_mono_pdf_path\n            and self.no_watermark_mono_pdf_path != self.mono_pdf_path\n        ):\n            result.append(\n                f\"\\tNo-watermark Monolingual PDF: {self.no_watermark_mono_pdf_path}\"\n            )\n\n        if (\n            hasattr(self, \"no_watermark_dual_pdf_path\")\n            and self.no_watermark_dual_pdf_path\n            and self.no_watermark_dual_pdf_path != self.dual_pdf_path\n        ):\n            result.append(\n                f\"\\tNo-watermark Dual-language PDF: {self.no_watermark_dual_pdf_path}\"\n            )\n\n        if (\n            hasattr(self, \"auto_extracted_glossary_path\")\n            and self.auto_extracted_glossary_path\n        ):\n            result.append(\n                f\"\\tAuto-extracted glossary: {self.auto_extracted_glossary_path}\"\n            )\n\n        if hasattr(self, \"peak_memory_usage\") and self.peak_memory_usage:\n            result.append(f\"\\tPeak memory usage: {self.peak_memory_usage} MB\")\n\n        if hasattr(self, \"total_valid_character_count\") and isinstance(\n            self.total_valid_character_count, int\n        ):\n            result.append(\n                f\"\\tTotal valid character count: {self.total_valid_character_count}\"\n            )\n\n        if hasattr(self, \"total_valid_text_token_count\") and isinstance(\n            self.total_valid_text_token_count, int\n        ):\n            result.append(\n                f\"\\tTotal valid text token count (gpt-4o): {self.total_valid_text_token_count}\"\n            )\n\n        if result:\n            result.insert(0, \"Translation results:\")\n\n        return \"\\n\".join(result) if result else \"No translation results available\"\n"
  },
  {
    "path": "babeldoc/glossary.py",
    "content": "import csv\nimport io\nimport itertools\nimport logging\nimport re\nimport time\nfrom pathlib import Path\n\nimport chardet\nimport hyperscan\nimport regex\n\nlogger = logging.getLogger(__name__)\n\n\nclass GlossaryEntry:\n    def __init__(self, source: str, target: str, target_language: str | None = None):\n        self.source = source\n        self.target = target\n        self.target_language = target_language\n\n    def __repr__(self):\n        return f\"GlossaryEntry(source='{self.source}', target='{self.target}', target_language='{self.target_language}')\"\n\n\ndef batched(iterable, n, *, strict=False):\n    # batched('ABCDEFG', 3) → ABC DEF G\n    if n < 1:\n        raise ValueError(\"n must be at least one\")\n    iterator = iter(iterable)\n    while batch := tuple(itertools.islice(iterator, n)):\n        if strict and len(batch) != n:\n            raise ValueError(\"batched(): incomplete batch\")\n        yield batch\n\n\nTERM_NORM_PATTERN = re.compile(r\"\\s+\", regex.UNICODE)\n\n\nclass Glossary:\n    def __init__(self, name: str, entries: list[GlossaryEntry]):\n        self.name = name\n\n        # Deduplicate entries based on normalized source\n        unique_entries = []\n        seen_normalized_sources = set()\n        for entry in entries:\n            normalized_source = self.normalize_source(entry.source)\n            if normalized_source not in seen_normalized_sources:\n                unique_entries.append(entry)\n                seen_normalized_sources.add(normalized_source)\n        self.entries = unique_entries\n\n        self.normalized_lookup: dict[str, tuple[str, str]] = {}\n        self.id_lookup: list[tuple[str, str]] = []\n        self.hs_dbs: list[hyperscan.Database] | None = None\n        self._build_regex_and_lookup()\n\n    @staticmethod\n    def normalize_source(source_term: str) -> str:\n        \"\"\"Normalizes a source term by lowercasing and standardizing whitespace.\"\"\"\n        term = source_term.lower()\n        term = TERM_NORM_PATTERN.sub(\n            \" \", term\n        )  # Replace multiple whitespace with single space\n        return term.strip()\n\n    def _build_regex_and_lookup(self):\n        logger.debug(\n            f\"start build regex for glossary {self.name} with {len(self.entries)} entries\"\n        )\n        \"\"\"\n        Builds a combined regex for all source terms and a lookup dictionary\n        from normalized source terms to (original_source, original_target).\n        Regex patterns are sorted by length in descending order to prioritize longer matches.\n        \"\"\"\n        self.normalized_lookup = {}\n\n        if not self.entries:\n            self.source_terms_regex = None\n            return\n\n        self.hs_dbs = []\n        hs_pattern = []\n        start = time.time()\n        for idx, entry in enumerate(self.entries):\n            normalized_key = self.normalize_source(entry.source)\n            self.normalized_lookup[normalized_key] = (entry.source, entry.target)\n            self.id_lookup.append((entry.source, entry.target))\n\n            hs_pattern.append((re.escape(entry.source).encode(\"utf-8\"), idx))\n\n        chunk_size = 20000\n        for i, pattern_chunk in enumerate(\n            batched(hs_pattern, chunk_size, strict=False)\n        ):\n            logger.debug(\n                f\"building hs_db chunk {i + 1} / {len(self.entries) // chunk_size + 1}\"\n            )\n            expressions, ids = zip(*pattern_chunk, strict=False)\n\n            hs_db = hyperscan.Database()\n            hs_db.compile(\n                expressions=expressions,\n                ids=ids,\n                elements=len(pattern_chunk),\n                flags=hyperscan.HS_FLAG_CASELESS | hyperscan.HS_FLAG_SINGLEMATCH,\n                # | hyperscan.HS_FLAG_UTF8\n                # | hyperscan.HS_FLAG_UCP,\n            )\n            self.hs_dbs.append(hs_db)\n\n        end = time.time()\n        logger.debug(\n            f\"finished building regex for glossary {self.name} in {end - start:.2f} seconds\"\n        )\n        logger.debug(\n            f\"build hs database for glossary {self.name} with {len(self.entries)} entries, hs_info: {self.hs_dbs[0].info()}\"\n        )\n        if not self.hs_dbs:\n            self.hs_dbs = None\n\n    @classmethod\n    def from_csv(cls, file_path: Path, target_lang_out: str) -> \"Glossary\":\n        \"\"\"\n        Loads glossary entries from a CSV file.\n        CSV format: source,target,tgt_lng (tgt_lng is optional)\n        Filters entries based on tgt_lng matching target_lang_out.\n        The glossary name is derived from the CSV filename.\n        \"\"\"\n        glossary_name = file_path.stem\n        loaded_entries: list[GlossaryEntry] = []\n\n        # Normalize target_lang_out once for comparison\n        normalized_target_lang_out = target_lang_out.lower().replace(\"-\", \"_\")\n\n        try:\n            with file_path.open(\"rb\") as f:\n                content = f.read()\n                encoding = chardet.detect(content)[\"encoding\"]\n                buffer = io.StringIO(content.decode(encoding))\n                reader = csv.DictReader(buffer, doublequote=True)\n                if not all(col in reader.fieldnames for col in [\"source\", \"target\"]):\n                    raise ValueError(\n                        f\"CSV file {file_path} must contain 'source' and 'target' columns.\"\n                    )\n\n                for row in reader:\n                    source = row[\"source\"]\n                    target = row[\"target\"]\n                    tgt_lng = row.get(\"tgt_lng\", None)  # Handle optional tgt_lng\n\n                    if tgt_lng and tgt_lng.strip():\n                        normalized_entry_tgt_lng = (\n                            tgt_lng.strip().lower().replace(\"-\", \"_\")\n                        )\n                        if normalized_entry_tgt_lng != normalized_target_lang_out:\n                            continue  # Skip if language doesn't match\n\n                    loaded_entries.append(GlossaryEntry(source, target, tgt_lng))\n        except FileNotFoundError:\n            # Or handle as per your project's error strategy, e.g., log and return empty Glossary\n            raise\n        except Exception as e:\n            # Or handle as per your project's error strategy\n            raise ValueError(\n                f\"Error reading or parsing CSV file {file_path}: {e}\"\n            ) from e\n\n        return cls(name=glossary_name, entries=loaded_entries)\n\n    def to_csv(self) -> str:\n        \"\"\"Exports the glossary entries to a CSV formatted string.\"\"\"\n        dict_data = [\n            {\n                \"source\": x.source,\n                \"target\": x.target,\n                \"tgt_lng\": x.target_language if x.target_language else \"\",\n            }\n            for x in self.entries\n        ]\n        buffer = io.StringIO()\n        dict_writer = csv.DictWriter(\n            buffer, fieldnames=[\"source\", \"target\", \"tgt_lng\"], doublequote=True\n        )\n        dict_writer.writeheader()\n        dict_writer.writerows(dict_data)\n        return buffer.getvalue()\n\n    def __repr__(self):\n        return f\"Glossary(name='{self.name}', num_entries={len(self.entries)})\"\n\n    def get_active_entries_for_text(self, text: str) -> list[tuple[str, str]]:\n        \"\"\"Returns a list of (original_source, target_text) tuples for terms found in the given text.\"\"\"\n        if not self.hs_dbs or not text:\n            return []\n\n        text = TERM_NORM_PATTERN.sub(\" \", text)  # Normalize whitespace in the text\n        if not text:\n            return []\n\n        active_entries = []\n\n        def on_match(\n            idx: int, _from: int, _to: int, _flags: int, _context=None\n        ) -> bool | None:\n            active_entries.append(self.id_lookup[idx])\n            return False\n\n        for hs_db in self.hs_dbs:\n            # Scan the text with the hyperscan database\n            scratch = hyperscan.Scratch(hs_db)\n            hs_db.scan(text.encode(\"utf-8\"), on_match, scratch=scratch)\n        return active_entries\n"
  },
  {
    "path": "babeldoc/main.py",
    "content": "import asyncio\nimport logging\nimport multiprocessing as mp\nimport queue\nimport random\nimport sys\nfrom pathlib import Path\nfrom typing import Any\n\nimport configargparse\nimport tqdm\nfrom rich.progress import BarColumn\nfrom rich.progress import MofNCompleteColumn\nfrom rich.progress import Progress\nfrom rich.progress import TextColumn\nfrom rich.progress import TimeElapsedColumn\nfrom rich.progress import TimeRemainingColumn\n\nimport babeldoc.assets.assets\nimport babeldoc.format.pdf.high_level\nfrom babeldoc.const import enable_process_pool\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom babeldoc.format.pdf.translation_config import WatermarkOutputMode\nfrom babeldoc.glossary import Glossary\nfrom babeldoc.translator.translator import OpenAITranslator\nfrom babeldoc.translator.translator import set_translate_rate_limiter\n\nlogger = logging.getLogger(__name__)\n__version__ = \"0.5.23\"\n\n\ndef create_parser():\n    parser = configargparse.ArgParser(\n        config_file_parser_class=configargparse.TomlConfigParser([\"babeldoc\"]),\n    )\n    parser.add_argument(\n        \"-c\",\n        \"--config\",\n        is_config_file=True,\n        help=\"config file path\",\n    )\n    parser.add_argument(\n        \"--version\",\n        action=\"version\",\n        version=f\"%(prog)s {__version__}\",\n    )\n    parser.add_argument(\n        \"--files\",\n        action=\"append\",\n        help=\"One or more paths to PDF files.\",\n    )\n    parser.add_argument(\n        \"--debug\",\n        action=\"store_true\",\n        help=\"Use debug logging level.\",\n    )\n    parser.add_argument(\n        \"--warmup\",\n        action=\"store_true\",\n        help=\"Only download and verify required assets then exit.\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout2\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout3\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout4\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout5\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout6\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--rpc-doclayout7\",\n        help=\"RPC service host address for document layout analysis\",\n    )\n    parser.add_argument(\n        \"--generate-offline-assets\",\n        default=None,\n        help=\"Generate offline assets package in the specified directory\",\n    )\n    parser.add_argument(\n        \"--restore-offline-assets\",\n        default=None,\n        help=\"Restore offline assets package from the specified file\",\n    )\n    parser.add_argument(\n        \"--working-dir\",\n        default=None,\n        help=\"Working directory for translation. If not set, use temp directory.\",\n    )\n    parser.add_argument(\n        \"--metadata-extra-data\",\n        default=None,\n        help=\"Extra data for metadata\",\n    )\n    parser.add_argument(\n        \"--enable-process-pool\",\n        action=\"store_true\",\n        help=\"DEBUG ONLY\",\n    )\n    # translation option argument group\n    translation_group = parser.add_argument_group(\n        \"Translation\",\n        description=\"Used during translation\",\n    )\n    translation_group.add_argument(\n        \"--pages\",\n        \"-p\",\n        help=\"Pages to translate. If not set, translate all pages. like: 1,2,1-,-3,3-5\",\n    )\n    translation_group.add_argument(\n        \"--min-text-length\",\n        type=int,\n        default=5,\n        help=\"Minimum text length to translate (default: 5)\",\n    )\n    translation_group.add_argument(\n        \"--lang-in\",\n        \"-li\",\n        default=\"en\",\n        help=\"The code of source language.\",\n    )\n    translation_group.add_argument(\n        \"--lang-out\",\n        \"-lo\",\n        default=\"zh\",\n        help=\"The code of target language.\",\n    )\n    translation_group.add_argument(\n        \"--output\",\n        \"-o\",\n        help=\"Output directory for files. if not set, use same as input.\",\n    )\n    translation_group.add_argument(\n        \"--qps\",\n        \"-q\",\n        type=int,\n        default=4,\n        help=\"QPS limit of translation service\",\n    )\n    translation_group.add_argument(\n        \"--ignore-cache\",\n        action=\"store_true\",\n        help=\"Ignore translation cache.\",\n    )\n    translation_group.add_argument(\n        \"--no-dual\",\n        action=\"store_true\",\n        help=\"Do not output bilingual PDF files\",\n    )\n    translation_group.add_argument(\n        \"--no-mono\",\n        action=\"store_true\",\n        help=\"Do not output monolingual PDF files\",\n    )\n    translation_group.add_argument(\n        \"--formular-font-pattern\",\n        help=\"Font pattern to identify formula text\",\n    )\n    translation_group.add_argument(\n        \"--formular-char-pattern\",\n        help=\"Character pattern to identify formula text\",\n    )\n    translation_group.add_argument(\n        \"--split-short-lines\",\n        action=\"store_true\",\n        help=\"Force split short lines into different paragraphs (may cause poor typesetting & bugs)\",\n    )\n    translation_group.add_argument(\n        \"--short-line-split-factor\",\n        type=float,\n        default=0.8,\n        help=\"Split threshold factor. The actual threshold is the median length of all lines on the current page * this factor\",\n    )\n    translation_group.add_argument(\n        \"--skip-clean\",\n        action=\"store_true\",\n        help=\"Skip PDF cleaning step\",\n    )\n    translation_group.add_argument(\n        \"--dual-translate-first\",\n        action=\"store_true\",\n        help=\"Put translated pages first in dual PDF mode\",\n    )\n    translation_group.add_argument(\n        \"--disable-rich-text-translate\",\n        action=\"store_true\",\n        help=\"Disable rich text translation (may help improve compatibility with some PDFs)\",\n    )\n    translation_group.add_argument(\n        \"--enhance-compatibility\",\n        action=\"store_true\",\n        help=\"Enable all compatibility enhancement options (equivalent to --skip-clean --dual-translate-first --disable-rich-text-translate)\",\n    )\n    translation_group.add_argument(\n        \"--use-alternating-pages-dual\",\n        action=\"store_true\",\n        help=\"Use alternating pages mode for dual PDF. When enabled, original and translated pages are arranged in alternate order.\",\n    )\n    translation_group.add_argument(\n        \"--watermark-output-mode\",\n        type=str,\n        choices=[\"watermarked\", \"no_watermark\", \"both\"],\n        default=\"watermarked\",\n        help=\"Control watermark output mode: 'watermarked' (default) adds watermark to translated PDF, 'no_watermark' doesn't add watermark, 'both' outputs both versions.\",\n    )\n    translation_group.add_argument(\n        \"--max-pages-per-part\",\n        type=int,\n        help=\"Maximum number of pages per part for split translation. If not set, no splitting will be performed.\",\n    )\n    translation_group.add_argument(\n        \"--no-watermark\",\n        action=\"store_true\",\n        help=\"[DEPRECATED] Use --watermark-output-mode=no_watermark instead. Do not add watermark to the translated PDF.\",\n    )\n    translation_group.add_argument(\n        \"--report-interval\",\n        type=float,\n        default=0.1,\n        help=\"Progress report interval in seconds (default: 0.1)\",\n    )\n    translation_group.add_argument(\n        \"--translate-table-text\",\n        action=\"store_true\",\n        default=False,\n        help=\"Translate table text (experimental)\",\n    )\n    translation_group.add_argument(\n        \"--show-char-box\",\n        action=\"store_true\",\n        default=False,\n        help=\"Show character box (debug only)\",\n    )\n    translation_group.add_argument(\n        \"--skip-scanned-detection\",\n        action=\"store_true\",\n        default=False,\n        help=\"Skip scanned document detection (speeds up processing for non-scanned documents)\",\n    )\n    translation_group.add_argument(\n        \"--ocr-workaround\",\n        action=\"store_true\",\n        default=False,\n        help=\"Add text fill background (experimental)\",\n    )\n    translation_group.add_argument(\n        \"--custom-system-prompt\",\n        help=\"Custom system prompt for translation.\",\n        default=None,\n    )\n    translation_group.add_argument(\n        \"--add-formula-placehold-hint\",\n        action=\"store_true\",\n        default=False,\n        help=\"Add formula placeholder hint for translation. (Currently not recommended, it may affect translation quality, default: False)\",\n    )\n    translation_group.add_argument(\n        \"--disable-same-text-fallback\",\n        action=\"store_true\",\n        default=False,\n        help=\"Disable fallback translation when LLM output matches input text. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--glossary-files\",\n        type=str,\n        default=None,\n        help=\"Comma-separated paths to glossary CSV files.\",\n    )\n    translation_group.add_argument(\n        \"--pool-max-workers\",\n        type=int,\n        help=\"Maximum number of worker threads for internal task processing pools. If not specified, defaults to QPS value. This parameter directly sets the worker count, replacing previous QPS-based dynamic calculations.\",\n    )\n    translation_group.add_argument(\n        \"--term-pool-max-workers\",\n        type=int,\n        help=\"Maximum number of worker threads dedicated to automatic term extraction. If not specified, defaults to --pool-max-workers (or QPS value when unset).\",\n    )\n    translation_group.add_argument(\n        \"--no-auto-extract-glossary\",\n        action=\"store_false\",\n        dest=\"auto_extract_glossary\",\n        default=True,\n        help=\"Disable automatic term extraction. (Config file: set auto_extract_glossary = false)\",\n    )\n    translation_group.add_argument(\n        \"--auto-enable-ocr-workaround\",\n        action=\"store_true\",\n        default=False,\n        help=\"Enable automatic OCR workaround. If a document is detected as heavily scanned, this will attempt to enable OCR processing and skip further scan detection. Note: This option interacts with `--ocr-workaround` and `--skip-scanned-detection`. See documentation for details. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--primary-font-family\",\n        type=str,\n        choices=[\"serif\", \"sans-serif\", \"script\"],\n        default=None,\n        help=\"Override primary font family for translated text. Choices: 'serif' for serif fonts, 'sans-serif' for sans-serif fonts, 'script' for script/italic fonts. If not specified, uses automatic font selection based on original text properties.\",\n    )\n    translation_group.add_argument(\n        \"--only-include-translated-page\",\n        action=\"store_true\",\n        default=False,\n        help=\"Only include translated pages in the output PDF. Effective only when --pages is used.\",\n    )\n    translation_group.add_argument(\n        \"--save-auto-extracted-glossary\",\n        action=\"store_true\",\n        default=False,\n        help=\"Save automatically extracted glossary terms to a CSV file in the output directory.\",\n    )\n    translation_group.add_argument(\n        \"--disable-graphic-element-process\",\n        action=\"store_true\",\n        default=False,\n        help=\"Disable graphic element process. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--no-merge-alternating-line-numbers\",\n        action=\"store_false\",\n        dest=\"merge_alternating_line_numbers\",\n        default=True,\n        help=\"Disable post-processing that merges alternating line-number layouts (by default this feature is enabled).\",\n    )\n    translation_group.add_argument(\n        \"--skip-translation\",\n        action=\"store_true\",\n        default=False,\n        help=\"Skip translation step. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--skip-form-render\",\n        action=\"store_true\",\n        default=False,\n        help=\"Skip form rendering. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--skip-curve-render\",\n        action=\"store_true\",\n        default=False,\n        help=\"Skip curve rendering. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--only-parse-generate-pdf\",\n        action=\"store_true\",\n        default=False,\n        help=\"Only parse PDF and generate output PDF without translation (default: False). This skips all translation-related processing including layout analysis, paragraph finding, style processing, and translation itself.\",\n    )\n    translation_group.add_argument(\n        \"--remove-non-formula-lines\",\n        action=\"store_true\",\n        default=False,\n        help=\"Remove non-formula lines from paragraph areas. This removes decorative lines that are not part of formulas, while protecting lines in figure/table areas. (default: False)\",\n    )\n    translation_group.add_argument(\n        \"--non-formula-line-iou-threshold\",\n        type=float,\n        default=0.9,\n        help=\"IoU threshold for detecting paragraph overlap when removing non-formula lines. Higher values are more conservative. (default: 0.9)\",\n    )\n    translation_group.add_argument(\n        \"--figure-table-protection-threshold\",\n        type=float,\n        default=0.9,\n        help=\"IoU threshold for protecting lines in figure/table areas when removing non-formula lines. Higher values provide more protection. (default: 0.9)\",\n    )\n    translation_group.add_argument(\n        \"--skip-formula-offset-calculation\",\n        action=\"store_true\",\n        default=False,\n        help=\"Skip formula offset calculation (default: False)\",\n    )\n    # service option argument group\n    service_group = translation_group.add_mutually_exclusive_group()\n    service_group.add_argument(\n        \"--openai\",\n        action=\"store_true\",\n        help=\"Use OpenAI translator.\",\n    )\n    service_group = parser.add_argument_group(\n        \"Translation - OpenAI Options\",\n        description=\"OpenAI specific options\",\n    )\n    service_group.add_argument(\n        \"--openai-model\",\n        default=\"gpt-4o-mini\",\n        help=\"The OpenAI model to use for translation.\",\n    )\n    service_group.add_argument(\n        \"--openai-base-url\",\n        help=\"The base URL for the OpenAI API.\",\n    )\n    service_group.add_argument(\n        \"--openai-api-key\",\n        \"-k\",\n        help=\"The API key for the OpenAI API.\",\n    )\n    service_group.add_argument(\n        \"--openai-term-extraction-model\",\n        default=None,\n        help=\"OpenAI model to use for automatic term extraction. Defaults to --openai-model when unset.\",\n    )\n    service_group.add_argument(\n        \"--openai-term-extraction-base-url\",\n        default=None,\n        help=\"Base URL for the OpenAI API used during automatic term extraction. Falls back to --openai-base-url when unset.\",\n    )\n    service_group.add_argument(\n        \"--openai-term-extraction-api-key\",\n        default=None,\n        help=\"API key for the OpenAI API used during automatic term extraction. Falls back to --openai-api-key when unset.\",\n    )\n    service_group.add_argument(\n        \"--enable-json-mode-if-requested\",\n        action=\"store_true\",\n        default=False,\n        help=\"Enable JSON mode for OpenAI requests.\",\n    )\n    service_group.add_argument(\n        \"--send-dashscope-header\",\n        action=\"store_true\",\n        default=False,\n        help=\"Send DashScope data inspection header to disable input/output inspection.\",\n    )\n    service_group.add_argument(\n        \"--no-send-temperature\",\n        action=\"store_true\",\n        default=False,\n        help=\"Do not send temperature parameter to OpenAI API (default: send temperature).\",\n    )\n    service_group.add_argument(\n        \"--openai-reasoning\",\n        type=str,\n        default=None,\n        help=\"Reasoning string to send in the OpenAI request body 'reasoning' field. If not set, the field is not sent.\",\n    )\n    service_group.add_argument(\n        \"--openai-term-extraction-reasoning\",\n        type=str,\n        default=None,\n        help=\"Reasoning string for the OpenAI term extraction translator. If not set, no reasoning field is sent for term extraction requests.\",\n    )\n\n    return parser\n\n\nasync def main():\n    parser = create_parser()\n    args: Any = parser.parse_args()\n\n    if args.debug:\n        logging.getLogger().setLevel(logging.DEBUG)\n\n    if args.generate_offline_assets:\n        babeldoc.assets.assets.generate_offline_assets_package(\n            Path(args.generate_offline_assets)\n        )\n        logger.info(\"Offline assets package generated, exiting...\")\n        return\n\n    if args.restore_offline_assets:\n        babeldoc.assets.assets.restore_offline_assets_package(\n            Path(args.restore_offline_assets)\n        )\n        logger.info(\"Offline assets package restored, exiting...\")\n        return\n\n    if args.warmup:\n        babeldoc.assets.assets.warmup()\n        logger.info(\"Warmup completed, exiting...\")\n        return\n\n    # 验证翻译服务选择\n    if not args.openai:\n        parser.error(\"必须选择一个翻译服务：--openai\")\n\n    # 验证 OpenAI 参数\n    if args.openai and not args.openai_api_key:\n        parser.error(\"使用 OpenAI 服务时必须提供 API key\")\n\n    if args.enable_process_pool:\n        enable_process_pool()\n\n    # 实例化翻译器\n    if args.openai:\n        translator_kwargs: dict[str, Any] = {}\n        if args.openai_reasoning is not None:\n            translator_kwargs[\"reasoning\"] = args.openai_reasoning\n        translator = OpenAITranslator(\n            lang_in=args.lang_in,\n            lang_out=args.lang_out,\n            model=args.openai_model,\n            base_url=args.openai_base_url,\n            api_key=args.openai_api_key,\n            ignore_cache=args.ignore_cache,\n            enable_json_mode_if_requested=args.enable_json_mode_if_requested,\n            send_dashscope_header=args.send_dashscope_header,\n            send_temperature=not args.no_send_temperature,\n            **translator_kwargs,\n        )\n        term_extraction_translator = translator\n        if (\n            args.openai_term_extraction_model\n            or args.openai_term_extraction_base_url\n            or args.openai_term_extraction_api_key\n        ):\n            term_translator_kwargs: dict[str, Any] = {}\n            if args.openai_term_extraction_reasoning is not None:\n                term_translator_kwargs[\"reasoning\"] = (\n                    args.openai_term_extraction_reasoning\n                )\n            term_extraction_translator = OpenAITranslator(\n                lang_in=args.lang_in,\n                lang_out=args.lang_out,\n                model=args.openai_term_extraction_model or args.openai_model,\n                base_url=(args.openai_term_extraction_base_url or args.openai_base_url),\n                api_key=args.openai_term_extraction_api_key or args.openai_api_key,\n                ignore_cache=args.ignore_cache,\n                enable_json_mode_if_requested=args.enable_json_mode_if_requested,\n                send_dashscope_header=args.send_dashscope_header,\n                send_temperature=not args.no_send_temperature,\n                **term_translator_kwargs,\n            )\n    else:\n        raise ValueError(\"Invalid translator type\")\n\n    # 设置翻译速率限制\n    set_translate_rate_limiter(args.qps)\n    # 初始化文档布局模型\n    if args.rpc_doclayout:\n        from babeldoc.docvision.rpc_doclayout import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout)\n    elif args.rpc_doclayout2:\n        from babeldoc.docvision.rpc_doclayout2 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout2)\n    elif args.rpc_doclayout3:\n        from babeldoc.docvision.rpc_doclayout3 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout3)\n    elif args.rpc_doclayout4:\n        from babeldoc.docvision.rpc_doclayout4 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout4)\n    elif args.rpc_doclayout5:\n        from babeldoc.docvision.rpc_doclayout5 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout5)\n    elif args.rpc_doclayout6:\n        from babeldoc.docvision.rpc_doclayout6 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout6)\n    elif args.rpc_doclayout7:\n        from babeldoc.docvision.rpc_doclayout7 import RpcDocLayoutModel\n\n        doc_layout_model = RpcDocLayoutModel(host=args.rpc_doclayout7)\n    else:\n        from babeldoc.docvision.doclayout import DocLayoutModel\n\n        doc_layout_model = DocLayoutModel.load_onnx()\n\n    if args.translate_table_text:\n        from babeldoc.docvision.table_detection.rapidocr import RapidOCRModel\n\n        table_model = RapidOCRModel()\n    else:\n        table_model = None\n\n    # Load glossaries\n    loaded_glossaries: list[Glossary] = []\n    if args.glossary_files:\n        paths_str = args.glossary_files.split(\",\")\n        for p_str in paths_str:\n            file_path = Path(p_str.strip())\n            if not file_path.exists():\n                logger.error(f\"Glossary file not found: {file_path}\")\n                continue\n            if not file_path.is_file():\n                logger.error(f\"Glossary path is not a file: {file_path}\")\n                continue\n            try:\n                glossary_obj = Glossary.from_csv(file_path, args.lang_out)\n                if glossary_obj.entries:\n                    loaded_glossaries.append(glossary_obj)\n                    logger.info(\n                        f\"Loaded glossary '{glossary_obj.name}' with {len(glossary_obj.entries)} entries.\"\n                    )\n                else:\n                    logger.info(\n                        f\"Glossary '{file_path.stem}' loaded with no applicable entries for lang_out '{args.lang_out}'.\"\n                    )\n            except Exception as e:\n                logger.error(f\"Failed to load glossary from {file_path}: {e}\")\n\n    pending_files = []\n    for file in args.files:\n        # 清理文件路径，去除两端的引号\n        if file.startswith(\"--files=\"):\n            file = file[len(\"--files=\") :]\n        file = file.lstrip(\"-\").strip(\"\\\"'\")\n        if not Path(file).exists():\n            logger.error(f\"文件不存在：{file}\")\n            exit(1)\n        if not file.lower().endswith(\".pdf\"):\n            logger.error(f\"文件不是 PDF 文件：{file}\")\n            exit(1)\n        pending_files.append(file)\n\n    if args.output:\n        if not Path(args.output).exists():\n            logger.info(f\"输出目录不存在，创建：{args.output}\")\n            try:\n                Path(args.output).mkdir(parents=True, exist_ok=True)\n            except OSError:\n                logger.critical(\n                    f\"Failed to create output folder at {args.output}\",\n                    exc_info=True,\n                )\n                exit(1)\n    else:\n        args.output = None\n\n    if args.working_dir:\n        working_dir = Path(args.working_dir)\n        if not working_dir.exists():\n            logger.info(f\"工作目录不存在，创建：{working_dir}\")\n            try:\n                working_dir.mkdir(parents=True, exist_ok=True)\n            except OSError:\n                logger.critical(\n                    f\"Failed to create working directory at {working_dir}\",\n                    exc_info=True,\n                )\n                exit(1)\n    else:\n        working_dir = None\n\n    watermark_output_mode = WatermarkOutputMode.Watermarked\n    if args.no_watermark:\n        watermark_output_mode = WatermarkOutputMode.NoWatermark\n    elif args.watermark_output_mode == \"both\":\n        watermark_output_mode = WatermarkOutputMode.Both\n    elif args.watermark_output_mode == \"watermarked\":\n        watermark_output_mode = WatermarkOutputMode.Watermarked\n    elif args.watermark_output_mode == \"no_watermark\":\n        watermark_output_mode = WatermarkOutputMode.NoWatermark\n\n    split_strategy = None\n    if args.max_pages_per_part:\n        split_strategy = TranslationConfig.create_max_pages_per_part_split_strategy(\n            args.max_pages_per_part\n        )\n\n    total_term_extraction_total_tokens = 0\n    total_term_extraction_prompt_tokens = 0\n    total_term_extraction_completion_tokens = 0\n    total_term_extraction_cache_hit_prompt_tokens = 0\n\n    for file in pending_files:\n        # 清理文件路径，去除两端的引号\n        file = file.strip(\"\\\"'\")\n        # 创建配置对象\n        config = TranslationConfig(\n            input_file=file,\n            font=None,\n            pages=args.pages,\n            output_dir=args.output,\n            translator=translator,\n            term_extraction_translator=term_extraction_translator,\n            debug=args.debug,\n            lang_in=args.lang_in,\n            lang_out=args.lang_out,\n            no_dual=args.no_dual,\n            no_mono=args.no_mono,\n            qps=args.qps,\n            formular_font_pattern=args.formular_font_pattern,\n            formular_char_pattern=args.formular_char_pattern,\n            split_short_lines=args.split_short_lines,\n            short_line_split_factor=args.short_line_split_factor,\n            doc_layout_model=doc_layout_model,\n            skip_clean=args.skip_clean,\n            dual_translate_first=args.dual_translate_first,\n            disable_rich_text_translate=args.disable_rich_text_translate,\n            enhance_compatibility=args.enhance_compatibility,\n            use_alternating_pages_dual=args.use_alternating_pages_dual,\n            report_interval=args.report_interval,\n            min_text_length=args.min_text_length,\n            watermark_output_mode=watermark_output_mode,\n            split_strategy=split_strategy,\n            table_model=table_model,\n            show_char_box=args.show_char_box,\n            skip_scanned_detection=args.skip_scanned_detection,\n            ocr_workaround=args.ocr_workaround,\n            custom_system_prompt=args.custom_system_prompt,\n            working_dir=working_dir,\n            add_formula_placehold_hint=args.add_formula_placehold_hint,\n            disable_same_text_fallback=args.disable_same_text_fallback,\n            glossaries=loaded_glossaries,\n            pool_max_workers=args.pool_max_workers,\n            auto_extract_glossary=args.auto_extract_glossary,\n            auto_enable_ocr_workaround=args.auto_enable_ocr_workaround,\n            primary_font_family=args.primary_font_family,\n            only_include_translated_page=args.only_include_translated_page,\n            save_auto_extracted_glossary=args.save_auto_extracted_glossary,\n            enable_graphic_element_process=not args.disable_graphic_element_process,\n            merge_alternating_line_numbers=args.merge_alternating_line_numbers,\n            skip_translation=args.skip_translation,\n            skip_form_render=args.skip_form_render,\n            skip_curve_render=args.skip_curve_render,\n            only_parse_generate_pdf=args.only_parse_generate_pdf,\n            remove_non_formula_lines=args.remove_non_formula_lines,\n            non_formula_line_iou_threshold=args.non_formula_line_iou_threshold,\n            figure_table_protection_threshold=args.figure_table_protection_threshold,\n            skip_formula_offset_calculation=args.skip_formula_offset_calculation,\n            metadata_extra_data=args.metadata_extra_data,\n            term_pool_max_workers=args.term_pool_max_workers,\n        )\n\n        def nop(_x):\n            pass\n\n        getattr(doc_layout_model, \"init_font_mapper\", nop)(config)\n        # Create progress handler\n        progress_context, progress_handler = create_progress_handler(\n            config, show_log=False\n        )\n\n        # 开始翻译\n        with progress_context:\n            async for event in babeldoc.format.pdf.high_level.async_translate(config):\n                progress_handler(event)\n                if config.debug:\n                    logger.debug(event)\n                if event[\"type\"] == \"error\":\n                    logger.error(f\"Error: {event['error']}\")\n                    break\n                if event[\"type\"] == \"finish\":\n                    result = event[\"translate_result\"]\n                    logger.info(str(result))\n                    break\n        usage = config.term_extraction_token_usage\n        total_term_extraction_total_tokens += usage[\"total_tokens\"]\n        total_term_extraction_prompt_tokens += usage[\"prompt_tokens\"]\n        total_term_extraction_completion_tokens += usage[\"completion_tokens\"]\n        total_term_extraction_cache_hit_prompt_tokens += usage[\n            \"cache_hit_prompt_tokens\"\n        ]\n    logger.info(f\"Total tokens: {translator.token_count.value}\")\n    logger.info(f\"Prompt tokens: {translator.prompt_token_count.value}\")\n    logger.info(f\"Completion tokens: {translator.completion_token_count.value}\")\n    logger.info(\n        f\"Cache hit prompt tokens: {translator.cache_hit_prompt_token_count.value}\"\n    )\n    logger.info(\n        \"Term extraction tokens: total=%s prompt=%s completion=%s cache_hit_prompt=%s\",\n        total_term_extraction_total_tokens,\n        total_term_extraction_prompt_tokens,\n        total_term_extraction_completion_tokens,\n        total_term_extraction_cache_hit_prompt_tokens,\n    )\n    if term_extraction_translator is not translator:\n        logger.info(\n            \"Term extraction translator raw tokens: total=%s prompt=%s completion=%s cache_hit_prompt=%s\",\n            term_extraction_translator.token_count.value,\n            term_extraction_translator.prompt_token_count.value,\n            term_extraction_translator.completion_token_count.value,\n            term_extraction_translator.cache_hit_prompt_token_count.value,\n        )\n\n\ndef create_progress_handler(\n    translation_config: TranslationConfig, show_log: bool = False\n):\n    \"\"\"Create a progress handler function based on the configuration.\n\n    Args:\n        translation_config: The translation configuration.\n\n    Returns:\n        A tuple of (progress_context, progress_handler), where progress_context is a context\n        manager that should be used to wrap the translation process, and progress_handler\n        is a function that will be called with progress events.\n    \"\"\"\n    if translation_config.use_rich_pbar:\n        progress = Progress(\n            TextColumn(\"[progress.description]{task.description}\"),\n            BarColumn(),\n            MofNCompleteColumn(),\n            TimeElapsedColumn(),\n            TimeRemainingColumn(),\n        )\n        translate_task_id = progress.add_task(\"translate\", total=100)\n        stage_tasks = {}\n\n        def progress_handler(event):\n            if show_log and random.random() <= 0.1:  # noqa: S311\n                logger.info(event)\n            if event[\"type\"] == \"progress_start\":\n                if event[\"stage\"] not in stage_tasks:\n                    stage_tasks[event[\"stage\"]] = progress.add_task(\n                        f\"{event['stage']} ({event['part_index']}/{event['total_parts']})\",\n                        total=event.get(\"stage_total\", 100),\n                    )\n            elif event[\"type\"] == \"progress_update\":\n                stage = event[\"stage\"]\n                if stage in stage_tasks:\n                    progress.update(\n                        stage_tasks[stage],\n                        completed=event[\"stage_current\"],\n                        total=event[\"stage_total\"],\n                        description=f\"{event['stage']} ({event['part_index']}/{event['total_parts']})\",\n                        refresh=True,\n                    )\n                progress.update(\n                    translate_task_id,\n                    completed=event[\"overall_progress\"],\n                    refresh=True,\n                )\n            elif event[\"type\"] == \"progress_end\":\n                stage = event[\"stage\"]\n                if stage in stage_tasks:\n                    progress.update(\n                        stage_tasks[stage],\n                        completed=event[\"stage_total\"],\n                        total=event[\"stage_total\"],\n                        description=f\"{event['stage']} ({event['part_index']}/{event['total_parts']})\",\n                        refresh=True,\n                    )\n                    progress.update(\n                        translate_task_id,\n                        completed=event[\"overall_progress\"],\n                        refresh=True,\n                    )\n                progress.refresh()\n\n        return progress, progress_handler\n    else:\n        pbar = tqdm.tqdm(total=100, desc=\"translate\")\n\n        def progress_handler(event):\n            if event[\"type\"] == \"progress_update\":\n                pbar.update(event[\"overall_progress\"] - pbar.n)\n                pbar.set_description(\n                    f\"{event['stage']} ({event['stage_current']}/{event['stage_total']})\",\n                )\n            elif event[\"type\"] == \"progress_end\":\n                pbar.set_description(f\"{event['stage']} (Complete)\")\n                pbar.refresh()\n\n        return pbar, progress_handler\n\n\n# for backward compatibility\ndef create_cache_folder():\n    return babeldoc.format.pdf.high_level.create_cache_folder()\n\n\n# for backward compatibility\ndef download_font_assets():\n    return babeldoc.format.pdf.high_level.download_font_assets()\n\n\nclass EvictQueue(queue.Queue):\n    def __init__(self, maxsize):\n        self.discarded = 0\n        super().__init__(maxsize)\n\n    def put(self, item, block=False, timeout=None):\n        while True:\n            try:\n                super().put(item, block=False)\n                break\n            except queue.Full:\n                try:\n                    self.get_nowait()\n                    self.discarded += 1\n                except queue.Empty:\n                    pass\n\n\ndef speed_up_logs():\n    import logging.handlers\n\n    root_logger = logging.getLogger()\n    log_que = EvictQueue(1000)\n    queue_handler = logging.handlers.QueueHandler(log_que)\n    queue_listener = logging.handlers.QueueListener(log_que, *root_logger.handlers)\n    queue_listener.start()\n    root_logger.handlers = [queue_handler]\n\n\ndef cli():\n    \"\"\"Command line interface entry point.\"\"\"\n    from rich.logging import RichHandler\n\n    logging.basicConfig(level=logging.INFO, handlers=[RichHandler()])\n\n    logging.getLogger(\"httpx\").setLevel(\"CRITICAL\")\n    logging.getLogger(\"httpx\").propagate = False\n    logging.getLogger(\"openai\").setLevel(\"CRITICAL\")\n    logging.getLogger(\"openai\").propagate = False\n    logging.getLogger(\"httpcore\").setLevel(\"CRITICAL\")\n    logging.getLogger(\"httpcore\").propagate = False\n    logging.getLogger(\"http11\").setLevel(\"CRITICAL\")\n    logging.getLogger(\"http11\").propagate = False\n    for v in logging.Logger.manager.loggerDict.values():\n        if getattr(v, \"name\", None) is None:\n            continue\n        if (\n            v.name.startswith(\"pdfminer\")\n            or v.name.startswith(\"peewee\")\n            or v.name.startswith(\"httpx\")\n            or \"http11\" in v.name\n            or \"openai\" in v.name\n            or \"pdfminer\" in v.name\n        ):\n            v.disabled = True\n            v.propagate = False\n\n    speed_up_logs()\n    babeldoc.format.pdf.high_level.init()\n    asyncio.run(main())\n\n\nif __name__ == \"__main__\":\n    if sys.platform == \"darwin\" or sys.platform == \"win32\":\n        mp.set_start_method(\"spawn\")\n    else:\n        mp.set_start_method(\"forkserver\")\n    cli()\n"
  },
  {
    "path": "babeldoc/pdfminer/LICENSE",
    "content": "Copyright (c) 2004-2016  Yusuke Shinyama <yusuke at shinyama dot jp>\n\nPermission is hereby granted, free of charge, to any person\nobtaining a copy of this software and associated documentation\nfiles (the \"Software\"), to deal in the Software without\nrestriction, including without limitation the rights to use,\ncopy, modify, merge, publish, distribute, sublicense, and/or\nsell copies of the Software, and to permit persons to whom the\nSoftware is furnished to do so, subject to the following\nconditions:\n\nThe above copyright notice and this permission notice shall be\nincluded in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY\nKIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE\nWARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR\nPURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR\nCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR\nOTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\nSOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
  },
  {
    "path": "babeldoc/pdfminer/__init__.py",
    "content": "from importlib.metadata import PackageNotFoundError\nfrom importlib.metadata import version\n\ntry:\n    __version__ = version(\"pdfminer.six\")\nexcept PackageNotFoundError:\n    # package is not installed, return default\n    __version__ = \"0.0\"\n\nif __name__ == \"__main__\":\n    print(__version__)\n"
  },
  {
    "path": "babeldoc/pdfminer/_saslprep.py",
    "content": "# Copyright 2016-present MongoDB, Inc.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n# http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n#\n# Some changes copyright 2021-present Matthias Valvekens,\n# licensed under the license of the pyHanko project (see LICENSE file).\n\n\n\"\"\"An implementation of RFC4013 SASLprep.\"\"\"\n\n__all__ = [\"saslprep\"]\n\nimport stringprep\nimport unicodedata\nfrom collections.abc import Callable\n\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\n\n# RFC4013 section 2.3 prohibited output.\n_PROHIBITED: tuple[Callable[[str], bool], ...] = (\n    # A strict reading of RFC 4013 requires table c12 here, but\n    # characters from it are mapped to SPACE in the Map step. Can\n    # normalization reintroduce them somehow?\n    stringprep.in_table_c12,\n    stringprep.in_table_c21_c22,\n    stringprep.in_table_c3,\n    stringprep.in_table_c4,\n    stringprep.in_table_c5,\n    stringprep.in_table_c6,\n    stringprep.in_table_c7,\n    stringprep.in_table_c8,\n    stringprep.in_table_c9,\n)\n\n\ndef saslprep(data: str, prohibit_unassigned_code_points: bool = True) -> str:\n    \"\"\"An implementation of RFC4013 SASLprep.\n    :param data:\n        The string to SASLprep.\n    :param prohibit_unassigned_code_points:\n        RFC 3454 and RFCs for various SASL mechanisms distinguish between\n        `queries` (unassigned code points allowed) and\n        `stored strings` (unassigned code points prohibited). Defaults\n        to ``True`` (unassigned code points are prohibited).\n    :return: The SASLprep'ed version of `data`.\n    \"\"\"\n    if prohibit_unassigned_code_points:\n        prohibited = _PROHIBITED + (stringprep.in_table_a1,)\n    else:\n        prohibited = _PROHIBITED\n\n    # RFC3454 section 2, step 1 - Map\n    # RFC4013 section 2.1 mappings\n    # Map Non-ASCII space characters to SPACE (U+0020). Map\n    # commonly mapped to nothing characters to, well, nothing.\n    in_table_c12 = stringprep.in_table_c12\n    in_table_b1 = stringprep.in_table_b1\n    data = \"\".join(\n        [\n            \"\\u0020\" if in_table_c12(elt) else elt\n            for elt in data\n            if not in_table_b1(elt)\n        ],\n    )\n\n    # RFC3454 section 2, step 2 - Normalize\n    # RFC4013 section 2.2 normalization\n    data = unicodedata.ucd_3_2_0.normalize(\"NFKC\", data)\n\n    in_table_d1 = stringprep.in_table_d1\n    if in_table_d1(data[0]):\n        if not in_table_d1(data[-1]):\n            # RFC3454, Section 6, #3. If a string contains any\n            # RandALCat character, the first and last characters\n            # MUST be RandALCat characters.\n            raise PDFValueError(\"SASLprep: failed bidirectional check\")\n        # RFC3454, Section 6, #2. If a string contains any RandALCat\n        # character, it MUST NOT contain any LCat character.\n        prohibited = prohibited + (stringprep.in_table_d2,)\n    else:\n        # RFC3454, Section 6, #3. Following the logic of #3, if\n        # the first character is not a RandALCat, no other character\n        # can be either.\n        prohibited = prohibited + (in_table_d1,)\n\n    # RFC3454 section 2, step 3 and 4 - Prohibit and check bidi\n    for char in data:\n        if any(in_table(char) for in_table in prohibited):\n            raise PDFValueError(\"SASLprep: failed prohibited character check\")\n\n    return data\n"
  },
  {
    "path": "babeldoc/pdfminer/arcfour.py",
    "content": "\"\"\"Python implementation of Arcfour encryption algorithm.\nSee https://en.wikipedia.org/wiki/RC4\nThis code is in the public domain.\n\n\"\"\"\n\nfrom collections.abc import Sequence\n\n\nclass Arcfour:\n    def __init__(self, key: Sequence[int]) -> None:\n        # because Py3 range is not indexable\n        s = [i for i in range(256)]\n        j = 0\n        klen = len(key)\n        for i in range(256):\n            j = (j + s[i] + key[i % klen]) % 256\n            (s[i], s[j]) = (s[j], s[i])\n        self.s = s\n        (self.i, self.j) = (0, 0)\n\n    def process(self, data: bytes) -> bytes:\n        (i, j) = (self.i, self.j)\n        s = self.s\n        r = b\"\"\n        for c in iter(data):\n            i = (i + 1) % 256\n            j = (j + s[i]) % 256\n            (s[i], s[j]) = (s[j], s[i])\n            k = s[(s[i] + s[j]) % 256]\n            r += bytes((c ^ k,))\n        (self.i, self.j) = (i, j)\n        return r\n\n    encrypt = decrypt = process\n"
  },
  {
    "path": "babeldoc/pdfminer/ascii85.py",
    "content": "\"\"\"Python implementation of ASCII85/ASCIIHex decoder (Adobe version).\"\"\"\n\nimport re\nfrom base64 import a85decode\nfrom binascii import unhexlify\n\nstart_re = re.compile(rb\"^\\s*<?\\s*~\\s*\")\nend_re = re.compile(rb\"\\s*~\\s*>?\\s*$\")\n\n\ndef ascii85decode(data: bytes) -> bytes:\n    \"\"\"In ASCII85 encoding, every four bytes are encoded with five ASCII\n    letters, using 85 different types of characters (as 256**4 < 85**5).\n    When the length of the original bytes is not a multiple of 4, a special\n    rule is used for round up.\n\n    Adobe's ASCII85 implementation expects the input to be terminated\n    by `b\"~>\"`, and (though this is absent from the PDF spec) it can\n    also begin with `b\"<~\"`.  We can't reliably expect this to be the\n    case, and there can be off-by-one errors in stream lengths which\n    mean we only see `~` at the end.  Worse yet, `<` and `>` are\n    ASCII85 digits, so we can't strip them.  We settle on a compromise\n    where we strip leading `<~` or `~` and trailing `~` or `~>`.\n    \"\"\"\n    data = start_re.sub(b\"\", data)\n    data = end_re.sub(b\"\", data)\n    return a85decode(data)\n\n\nbws_re = re.compile(rb\"\\s\")\n\n\ndef asciihexdecode(data: bytes) -> bytes:\n    \"\"\"ASCIIHexDecode filter: PDFReference v1.4 section 3.3.1\n    For each pair of ASCII hexadecimal digits (0-9 and A-F or a-f), the\n    ASCIIHexDecode filter produces one byte of binary data. All white-space\n    characters are ignored. A right angle bracket character (>) indicates\n    EOD. Any other characters will cause an error. If the filter encounters\n    the EOD marker after reading an odd number of hexadecimal digits, it\n    will behave as if a 0 followed the last digit.\n    \"\"\"\n    data = bws_re.sub(b\"\", data)\n    idx = data.find(b\">\")\n    if idx != -1:\n        data = data[:idx]\n        if idx % 2 == 1:\n            data += b\"0\"\n    return unhexlify(data)\n"
  },
  {
    "path": "babeldoc/pdfminer/casting.py",
    "content": "import itertools\nfrom typing import Any\n\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import Rect\n\n_FloatTriple = tuple[float, float, float]\n_FloatQuadruple = tuple[float, float, float, float]\n\n\ndef safe_int(o: Any) -> int | None:\n    try:\n        return int(o)\n    except (TypeError, ValueError):\n        return None\n\n\ndef safe_float(o: Any) -> float | None:\n    try:\n        return float(o)\n    except (TypeError, ValueError):\n        return None\n\n\ndef safe_matrix(a: Any, b: Any, c: Any, d: Any, e: Any, f: Any) -> Matrix | None:\n    a_f = safe_float(a)\n    b_f = safe_float(b)\n    c_f = safe_float(c)\n    d_f = safe_float(d)\n    e_f = safe_float(e)\n    f_f = safe_float(f)\n\n    if (\n        a_f is None\n        or b_f is None\n        or c_f is None\n        or d_f is None\n        or e_f is None\n        or f_f is None\n    ):\n        return None\n\n    return a_f, b_f, c_f, d_f, e_f, f_f\n\n\ndef safe_rgb(r: Any, g: Any, b: Any) -> tuple[float, float, float] | None:\n    return _safe_float_triple(r, g, b)\n\n\ndef safe_cmyk(\n    c: Any, m: Any, y: Any, k: Any\n) -> tuple[float, float, float, float] | None:\n    return _safe_float_quadruple(c, m, y, k)\n\n\ndef safe_rect_list(value: Any) -> Rect | None:\n    try:\n        values = list(itertools.islice(value, 4))\n    except TypeError:\n        return None\n\n    if len(values) != 4:\n        return None\n\n    return safe_rect(*values)\n\n\ndef safe_rect(a: Any, b: Any, c: Any, d: Any) -> Rect | None:\n    return _safe_float_quadruple(a, b, c, d)\n\n\ndef _safe_float_triple(a: Any, b: Any, c: Any) -> _FloatTriple | None:\n    a_f = safe_float(a)\n    b_f = safe_float(b)\n    c_f = safe_float(c)\n\n    if a_f is None or b_f is None or c_f is None:\n        return None\n\n    return a_f, b_f, c_f\n\n\ndef _safe_float_quadruple(a: Any, b: Any, c: Any, d: Any) -> _FloatQuadruple | None:\n    a_f = safe_float(a)\n    b_f = safe_float(b)\n    c_f = safe_float(c)\n    d_f = safe_float(d)\n\n    if a_f is None or b_f is None or c_f is None or d_f is None:\n        return None\n\n    return a_f, b_f, c_f, d_f\n"
  },
  {
    "path": "babeldoc/pdfminer/ccitt.py",
    "content": "# CCITT Fax decoder\n#\n# Bugs: uncompressed mode untested.\n#\n# cf.\n#  ITU-T Recommendation T.4\n#    \"Standardization of Group 3 facsimile terminals\n#    for document transmission\"\n#  ITU-T Recommendation T.6\n#    \"FACSIMILE CODING SCHEMES AND CODING CONTROL FUNCTIONS\n#    FOR GROUP 4 FACSIMILE APPARATUS\"\n\n\nimport array\nfrom collections.abc import Callable\nfrom collections.abc import Iterator\nfrom collections.abc import MutableSequence\nfrom collections.abc import Sequence\nfrom typing import Any\nfrom typing import cast\n\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\n\n\ndef get_bytes(data: bytes) -> Iterator[int]:\n    yield from data\n\n\n# Workaround https://github.com/python/mypy/issues/731\nBitParserState = MutableSequence[Any]\n# A better definition (not supported by mypy) would be:\n# BitParserState = MutableSequence[Union[\"BitParserState\", int, str, None]]\n\n\nclass BitParser:\n    _state: BitParserState\n\n    # _accept is declared Optional solely as a workaround for\n    # https://github.com/python/mypy/issues/708\n    _accept: Callable[[Any], BitParserState] | None\n\n    def __init__(self) -> None:\n        self._pos = 0\n\n    @classmethod\n    def add(cls, root: BitParserState, v: int | str, bits: str) -> None:\n        p: BitParserState = root\n        b = None\n        for i in range(len(bits)):\n            if i > 0:\n                assert b is not None\n                if p[b] is None:\n                    p[b] = [None, None]\n                p = p[b]\n            if bits[i] == \"1\":\n                b = 1\n            else:\n                b = 0\n        assert b is not None\n        p[b] = v\n\n    def feedbytes(self, data: bytes) -> None:\n        for byte in get_bytes(data):\n            for m in (128, 64, 32, 16, 8, 4, 2, 1):\n                self._parse_bit(byte & m)\n\n    def _parse_bit(self, x: object) -> None:\n        if x:\n            v = self._state[1]\n        else:\n            v = self._state[0]\n        self._pos += 1\n        if isinstance(v, list):\n            self._state = v\n        else:\n            assert self._accept is not None\n            self._state = self._accept(v)\n\n\nclass CCITTG4Parser(BitParser):\n    MODE = [None, None]\n    BitParser.add(MODE, 0, \"1\")\n    BitParser.add(MODE, +1, \"011\")\n    BitParser.add(MODE, -1, \"010\")\n    BitParser.add(MODE, \"h\", \"001\")\n    BitParser.add(MODE, \"p\", \"0001\")\n    BitParser.add(MODE, +2, \"000011\")\n    BitParser.add(MODE, -2, \"000010\")\n    BitParser.add(MODE, +3, \"0000011\")\n    BitParser.add(MODE, -3, \"0000010\")\n    BitParser.add(MODE, \"u\", \"0000001111\")\n    BitParser.add(MODE, \"x1\", \"0000001000\")\n    BitParser.add(MODE, \"x2\", \"0000001001\")\n    BitParser.add(MODE, \"x3\", \"0000001010\")\n    BitParser.add(MODE, \"x4\", \"0000001011\")\n    BitParser.add(MODE, \"x5\", \"0000001100\")\n    BitParser.add(MODE, \"x6\", \"0000001101\")\n    BitParser.add(MODE, \"x7\", \"0000001110\")\n    BitParser.add(MODE, \"e\", \"000000000001000000000001\")\n\n    WHITE = [None, None]\n    BitParser.add(WHITE, 0, \"00110101\")\n    BitParser.add(WHITE, 1, \"000111\")\n    BitParser.add(WHITE, 2, \"0111\")\n    BitParser.add(WHITE, 3, \"1000\")\n    BitParser.add(WHITE, 4, \"1011\")\n    BitParser.add(WHITE, 5, \"1100\")\n    BitParser.add(WHITE, 6, \"1110\")\n    BitParser.add(WHITE, 7, \"1111\")\n    BitParser.add(WHITE, 8, \"10011\")\n    BitParser.add(WHITE, 9, \"10100\")\n    BitParser.add(WHITE, 10, \"00111\")\n    BitParser.add(WHITE, 11, \"01000\")\n    BitParser.add(WHITE, 12, \"001000\")\n    BitParser.add(WHITE, 13, \"000011\")\n    BitParser.add(WHITE, 14, \"110100\")\n    BitParser.add(WHITE, 15, \"110101\")\n    BitParser.add(WHITE, 16, \"101010\")\n    BitParser.add(WHITE, 17, \"101011\")\n    BitParser.add(WHITE, 18, \"0100111\")\n    BitParser.add(WHITE, 19, \"0001100\")\n    BitParser.add(WHITE, 20, \"0001000\")\n    BitParser.add(WHITE, 21, \"0010111\")\n    BitParser.add(WHITE, 22, \"0000011\")\n    BitParser.add(WHITE, 23, \"0000100\")\n    BitParser.add(WHITE, 24, \"0101000\")\n    BitParser.add(WHITE, 25, \"0101011\")\n    BitParser.add(WHITE, 26, \"0010011\")\n    BitParser.add(WHITE, 27, \"0100100\")\n    BitParser.add(WHITE, 28, \"0011000\")\n    BitParser.add(WHITE, 29, \"00000010\")\n    BitParser.add(WHITE, 30, \"00000011\")\n    BitParser.add(WHITE, 31, \"00011010\")\n    BitParser.add(WHITE, 32, \"00011011\")\n    BitParser.add(WHITE, 33, \"00010010\")\n    BitParser.add(WHITE, 34, \"00010011\")\n    BitParser.add(WHITE, 35, \"00010100\")\n    BitParser.add(WHITE, 36, \"00010101\")\n    BitParser.add(WHITE, 37, \"00010110\")\n    BitParser.add(WHITE, 38, \"00010111\")\n    BitParser.add(WHITE, 39, \"00101000\")\n    BitParser.add(WHITE, 40, \"00101001\")\n    BitParser.add(WHITE, 41, \"00101010\")\n    BitParser.add(WHITE, 42, \"00101011\")\n    BitParser.add(WHITE, 43, \"00101100\")\n    BitParser.add(WHITE, 44, \"00101101\")\n    BitParser.add(WHITE, 45, \"00000100\")\n    BitParser.add(WHITE, 46, \"00000101\")\n    BitParser.add(WHITE, 47, \"00001010\")\n    BitParser.add(WHITE, 48, \"00001011\")\n    BitParser.add(WHITE, 49, \"01010010\")\n    BitParser.add(WHITE, 50, \"01010011\")\n    BitParser.add(WHITE, 51, \"01010100\")\n    BitParser.add(WHITE, 52, \"01010101\")\n    BitParser.add(WHITE, 53, \"00100100\")\n    BitParser.add(WHITE, 54, \"00100101\")\n    BitParser.add(WHITE, 55, \"01011000\")\n    BitParser.add(WHITE, 56, \"01011001\")\n    BitParser.add(WHITE, 57, \"01011010\")\n    BitParser.add(WHITE, 58, \"01011011\")\n    BitParser.add(WHITE, 59, \"01001010\")\n    BitParser.add(WHITE, 60, \"01001011\")\n    BitParser.add(WHITE, 61, \"00110010\")\n    BitParser.add(WHITE, 62, \"00110011\")\n    BitParser.add(WHITE, 63, \"00110100\")\n    BitParser.add(WHITE, 64, \"11011\")\n    BitParser.add(WHITE, 128, \"10010\")\n    BitParser.add(WHITE, 192, \"010111\")\n    BitParser.add(WHITE, 256, \"0110111\")\n    BitParser.add(WHITE, 320, \"00110110\")\n    BitParser.add(WHITE, 384, \"00110111\")\n    BitParser.add(WHITE, 448, \"01100100\")\n    BitParser.add(WHITE, 512, \"01100101\")\n    BitParser.add(WHITE, 576, \"01101000\")\n    BitParser.add(WHITE, 640, \"01100111\")\n    BitParser.add(WHITE, 704, \"011001100\")\n    BitParser.add(WHITE, 768, \"011001101\")\n    BitParser.add(WHITE, 832, \"011010010\")\n    BitParser.add(WHITE, 896, \"011010011\")\n    BitParser.add(WHITE, 960, \"011010100\")\n    BitParser.add(WHITE, 1024, \"011010101\")\n    BitParser.add(WHITE, 1088, \"011010110\")\n    BitParser.add(WHITE, 1152, \"011010111\")\n    BitParser.add(WHITE, 1216, \"011011000\")\n    BitParser.add(WHITE, 1280, \"011011001\")\n    BitParser.add(WHITE, 1344, \"011011010\")\n    BitParser.add(WHITE, 1408, \"011011011\")\n    BitParser.add(WHITE, 1472, \"010011000\")\n    BitParser.add(WHITE, 1536, \"010011001\")\n    BitParser.add(WHITE, 1600, \"010011010\")\n    BitParser.add(WHITE, 1664, \"011000\")\n    BitParser.add(WHITE, 1728, \"010011011\")\n    BitParser.add(WHITE, 1792, \"00000001000\")\n    BitParser.add(WHITE, 1856, \"00000001100\")\n    BitParser.add(WHITE, 1920, \"00000001101\")\n    BitParser.add(WHITE, 1984, \"000000010010\")\n    BitParser.add(WHITE, 2048, \"000000010011\")\n    BitParser.add(WHITE, 2112, \"000000010100\")\n    BitParser.add(WHITE, 2176, \"000000010101\")\n    BitParser.add(WHITE, 2240, \"000000010110\")\n    BitParser.add(WHITE, 2304, \"000000010111\")\n    BitParser.add(WHITE, 2368, \"000000011100\")\n    BitParser.add(WHITE, 2432, \"000000011101\")\n    BitParser.add(WHITE, 2496, \"000000011110\")\n    BitParser.add(WHITE, 2560, \"000000011111\")\n\n    BLACK = [None, None]\n    BitParser.add(BLACK, 0, \"0000110111\")\n    BitParser.add(BLACK, 1, \"010\")\n    BitParser.add(BLACK, 2, \"11\")\n    BitParser.add(BLACK, 3, \"10\")\n    BitParser.add(BLACK, 4, \"011\")\n    BitParser.add(BLACK, 5, \"0011\")\n    BitParser.add(BLACK, 6, \"0010\")\n    BitParser.add(BLACK, 7, \"00011\")\n    BitParser.add(BLACK, 8, \"000101\")\n    BitParser.add(BLACK, 9, \"000100\")\n    BitParser.add(BLACK, 10, \"0000100\")\n    BitParser.add(BLACK, 11, \"0000101\")\n    BitParser.add(BLACK, 12, \"0000111\")\n    BitParser.add(BLACK, 13, \"00000100\")\n    BitParser.add(BLACK, 14, \"00000111\")\n    BitParser.add(BLACK, 15, \"000011000\")\n    BitParser.add(BLACK, 16, \"0000010111\")\n    BitParser.add(BLACK, 17, \"0000011000\")\n    BitParser.add(BLACK, 18, \"0000001000\")\n    BitParser.add(BLACK, 19, \"00001100111\")\n    BitParser.add(BLACK, 20, \"00001101000\")\n    BitParser.add(BLACK, 21, \"00001101100\")\n    BitParser.add(BLACK, 22, \"00000110111\")\n    BitParser.add(BLACK, 23, \"00000101000\")\n    BitParser.add(BLACK, 24, \"00000010111\")\n    BitParser.add(BLACK, 25, \"00000011000\")\n    BitParser.add(BLACK, 26, \"000011001010\")\n    BitParser.add(BLACK, 27, \"000011001011\")\n    BitParser.add(BLACK, 28, \"000011001100\")\n    BitParser.add(BLACK, 29, \"000011001101\")\n    BitParser.add(BLACK, 30, \"000001101000\")\n    BitParser.add(BLACK, 31, \"000001101001\")\n    BitParser.add(BLACK, 32, \"000001101010\")\n    BitParser.add(BLACK, 33, \"000001101011\")\n    BitParser.add(BLACK, 34, \"000011010010\")\n    BitParser.add(BLACK, 35, \"000011010011\")\n    BitParser.add(BLACK, 36, \"000011010100\")\n    BitParser.add(BLACK, 37, \"000011010101\")\n    BitParser.add(BLACK, 38, \"000011010110\")\n    BitParser.add(BLACK, 39, \"000011010111\")\n    BitParser.add(BLACK, 40, \"000001101100\")\n    BitParser.add(BLACK, 41, \"000001101101\")\n    BitParser.add(BLACK, 42, \"000011011010\")\n    BitParser.add(BLACK, 43, \"000011011011\")\n    BitParser.add(BLACK, 44, \"000001010100\")\n    BitParser.add(BLACK, 45, \"000001010101\")\n    BitParser.add(BLACK, 46, \"000001010110\")\n    BitParser.add(BLACK, 47, \"000001010111\")\n    BitParser.add(BLACK, 48, \"000001100100\")\n    BitParser.add(BLACK, 49, \"000001100101\")\n    BitParser.add(BLACK, 50, \"000001010010\")\n    BitParser.add(BLACK, 51, \"000001010011\")\n    BitParser.add(BLACK, 52, \"000000100100\")\n    BitParser.add(BLACK, 53, \"000000110111\")\n    BitParser.add(BLACK, 54, \"000000111000\")\n    BitParser.add(BLACK, 55, \"000000100111\")\n    BitParser.add(BLACK, 56, \"000000101000\")\n    BitParser.add(BLACK, 57, \"000001011000\")\n    BitParser.add(BLACK, 58, \"000001011001\")\n    BitParser.add(BLACK, 59, \"000000101011\")\n    BitParser.add(BLACK, 60, \"000000101100\")\n    BitParser.add(BLACK, 61, \"000001011010\")\n    BitParser.add(BLACK, 62, \"000001100110\")\n    BitParser.add(BLACK, 63, \"000001100111\")\n    BitParser.add(BLACK, 64, \"0000001111\")\n    BitParser.add(BLACK, 128, \"000011001000\")\n    BitParser.add(BLACK, 192, \"000011001001\")\n    BitParser.add(BLACK, 256, \"000001011011\")\n    BitParser.add(BLACK, 320, \"000000110011\")\n    BitParser.add(BLACK, 384, \"000000110100\")\n    BitParser.add(BLACK, 448, \"000000110101\")\n    BitParser.add(BLACK, 512, \"0000001101100\")\n    BitParser.add(BLACK, 576, \"0000001101101\")\n    BitParser.add(BLACK, 640, \"0000001001010\")\n    BitParser.add(BLACK, 704, \"0000001001011\")\n    BitParser.add(BLACK, 768, \"0000001001100\")\n    BitParser.add(BLACK, 832, \"0000001001101\")\n    BitParser.add(BLACK, 896, \"0000001110010\")\n    BitParser.add(BLACK, 960, \"0000001110011\")\n    BitParser.add(BLACK, 1024, \"0000001110100\")\n    BitParser.add(BLACK, 1088, \"0000001110101\")\n    BitParser.add(BLACK, 1152, \"0000001110110\")\n    BitParser.add(BLACK, 1216, \"0000001110111\")\n    BitParser.add(BLACK, 1280, \"0000001010010\")\n    BitParser.add(BLACK, 1344, \"0000001010011\")\n    BitParser.add(BLACK, 1408, \"0000001010100\")\n    BitParser.add(BLACK, 1472, \"0000001010101\")\n    BitParser.add(BLACK, 1536, \"0000001011010\")\n    BitParser.add(BLACK, 1600, \"0000001011011\")\n    BitParser.add(BLACK, 1664, \"0000001100100\")\n    BitParser.add(BLACK, 1728, \"0000001100101\")\n    BitParser.add(BLACK, 1792, \"00000001000\")\n    BitParser.add(BLACK, 1856, \"00000001100\")\n    BitParser.add(BLACK, 1920, \"00000001101\")\n    BitParser.add(BLACK, 1984, \"000000010010\")\n    BitParser.add(BLACK, 2048, \"000000010011\")\n    BitParser.add(BLACK, 2112, \"000000010100\")\n    BitParser.add(BLACK, 2176, \"000000010101\")\n    BitParser.add(BLACK, 2240, \"000000010110\")\n    BitParser.add(BLACK, 2304, \"000000010111\")\n    BitParser.add(BLACK, 2368, \"000000011100\")\n    BitParser.add(BLACK, 2432, \"000000011101\")\n    BitParser.add(BLACK, 2496, \"000000011110\")\n    BitParser.add(BLACK, 2560, \"000000011111\")\n\n    UNCOMPRESSED = [None, None]\n    BitParser.add(UNCOMPRESSED, \"1\", \"1\")\n    BitParser.add(UNCOMPRESSED, \"01\", \"01\")\n    BitParser.add(UNCOMPRESSED, \"001\", \"001\")\n    BitParser.add(UNCOMPRESSED, \"0001\", \"0001\")\n    BitParser.add(UNCOMPRESSED, \"00001\", \"00001\")\n    BitParser.add(UNCOMPRESSED, \"00000\", \"000001\")\n    BitParser.add(UNCOMPRESSED, \"T00\", \"00000011\")\n    BitParser.add(UNCOMPRESSED, \"T10\", \"00000010\")\n    BitParser.add(UNCOMPRESSED, \"T000\", \"000000011\")\n    BitParser.add(UNCOMPRESSED, \"T100\", \"000000010\")\n    BitParser.add(UNCOMPRESSED, \"T0000\", \"0000000011\")\n    BitParser.add(UNCOMPRESSED, \"T1000\", \"0000000010\")\n    BitParser.add(UNCOMPRESSED, \"T00000\", \"00000000011\")\n    BitParser.add(UNCOMPRESSED, \"T10000\", \"00000000010\")\n\n    class CCITTException(PDFException):\n        pass\n\n    class EOFB(CCITTException):\n        pass\n\n    class InvalidData(CCITTException):\n        pass\n\n    class ByteSkip(CCITTException):\n        pass\n\n    _color: int\n\n    def __init__(self, width: int, bytealign: bool = False) -> None:\n        BitParser.__init__(self)\n        self.width = width\n        self.bytealign = bytealign\n        self.reset()\n\n    def feedbytes(self, data: bytes) -> None:\n        for byte in get_bytes(data):\n            try:\n                for m in (128, 64, 32, 16, 8, 4, 2, 1):\n                    self._parse_bit(byte & m)\n            except self.ByteSkip:\n                self._accept = self._parse_mode\n                self._state = self.MODE\n            except self.EOFB:\n                break\n\n    def _parse_mode(self, mode: object) -> BitParserState:\n        if mode == \"p\":\n            self._do_pass()\n            self._flush_line()\n            return self.MODE\n        elif mode == \"h\":\n            self._n1 = 0\n            self._accept = self._parse_horiz1\n            if self._color:\n                return self.WHITE\n            else:\n                return self.BLACK\n        elif mode == \"u\":\n            self._accept = self._parse_uncompressed\n            return self.UNCOMPRESSED\n        elif mode == \"e\":\n            raise self.EOFB\n        elif isinstance(mode, int):\n            self._do_vertical(mode)\n            self._flush_line()\n            return self.MODE\n        else:\n            raise self.InvalidData(mode)\n\n    def _parse_horiz1(self, n: Any) -> BitParserState:\n        if n is None:\n            raise self.InvalidData\n        self._n1 += n\n        if n < 64:\n            self._n2 = 0\n            self._color = 1 - self._color\n            self._accept = self._parse_horiz2\n        if self._color:\n            return self.WHITE\n        else:\n            return self.BLACK\n\n    def _parse_horiz2(self, n: Any) -> BitParserState:\n        if n is None:\n            raise self.InvalidData\n        self._n2 += n\n        if n < 64:\n            self._color = 1 - self._color\n            self._accept = self._parse_mode\n            self._do_horizontal(self._n1, self._n2)\n            self._flush_line()\n            return self.MODE\n        elif self._color:\n            return self.WHITE\n        else:\n            return self.BLACK\n\n    def _parse_uncompressed(self, bits: str | None) -> BitParserState:\n        if not bits:\n            raise self.InvalidData\n        if bits.startswith(\"T\"):\n            self._accept = self._parse_mode\n            self._color = int(bits[1])\n            self._do_uncompressed(bits[2:])\n            return self.MODE\n        else:\n            self._do_uncompressed(bits)\n            return self.UNCOMPRESSED\n\n    def _get_bits(self) -> str:\n        return \"\".join(str(b) for b in self._curline[: self._curpos])\n\n    def _get_refline(self, i: int) -> str:\n        if i < 0:\n            return \"[]\" + \"\".join(str(b) for b in self._refline)\n        elif len(self._refline) <= i:\n            return \"\".join(str(b) for b in self._refline) + \"[]\"\n        else:\n            return (\n                \"\".join(str(b) for b in self._refline[:i])\n                + \"[\"\n                + str(self._refline[i])\n                + \"]\"\n                + \"\".join(str(b) for b in self._refline[i + 1 :])\n            )\n\n    def reset(self) -> None:\n        self._y = 0\n        self._curline = array.array(\"b\", [1] * self.width)\n        self._reset_line()\n        self._accept = self._parse_mode\n        self._state = self.MODE\n\n    def output_line(self, y: int, bits: Sequence[int]) -> None:\n        print(y, \"\".join(str(b) for b in bits))\n\n    def _reset_line(self) -> None:\n        self._refline = self._curline\n        self._curline = array.array(\"b\", [1] * self.width)\n        self._curpos = -1\n        self._color = 1\n\n    def _flush_line(self) -> None:\n        if self.width <= self._curpos:\n            self.output_line(self._y, self._curline)\n            self._y += 1\n            self._reset_line()\n            if self.bytealign:\n                raise self.ByteSkip\n\n    def _do_vertical(self, dx: int) -> None:\n        x1 = self._curpos + 1\n        while 1:\n            if x1 == 0:\n                if self._color == 1 and self._refline[x1] != self._color:\n                    break\n            elif x1 == len(self._refline) or (\n                self._refline[x1 - 1] == self._color\n                and self._refline[x1] != self._color\n            ):\n                break\n            x1 += 1\n        x1 += dx\n        x0 = max(0, self._curpos)\n        x1 = max(0, min(self.width, x1))\n        if x1 < x0:\n            for x in range(x1, x0):\n                self._curline[x] = self._color\n        elif x0 < x1:\n            for x in range(x0, x1):\n                self._curline[x] = self._color\n        self._curpos = x1\n        self._color = 1 - self._color\n\n    def _do_pass(self) -> None:\n        x1 = self._curpos + 1\n        while 1:\n            if x1 == 0:\n                if self._color == 1 and self._refline[x1] != self._color:\n                    break\n            elif x1 == len(self._refline) or (\n                self._refline[x1 - 1] == self._color\n                and self._refline[x1] != self._color\n            ):\n                break\n            x1 += 1\n        while 1:\n            if x1 == 0:\n                if self._color == 0 and self._refline[x1] == self._color:\n                    break\n            elif x1 == len(self._refline) or (\n                self._refline[x1 - 1] != self._color\n                and self._refline[x1] == self._color\n            ):\n                break\n            x1 += 1\n        for x in range(self._curpos, x1):\n            self._curline[x] = self._color\n        self._curpos = x1\n\n    def _do_horizontal(self, n1: int, n2: int) -> None:\n        if self._curpos < 0:\n            self._curpos = 0\n        x = self._curpos\n        for _ in range(n1):\n            if len(self._curline) <= x:\n                break\n            self._curline[x] = self._color\n            x += 1\n        for _ in range(n2):\n            if len(self._curline) <= x:\n                break\n            self._curline[x] = 1 - self._color\n            x += 1\n        self._curpos = x\n\n    def _do_uncompressed(self, bits: str) -> None:\n        for c in bits:\n            self._curline[self._curpos] = int(c)\n            self._curpos += 1\n            self._flush_line()\n\n\nclass CCITTFaxDecoder(CCITTG4Parser):\n    def __init__(\n        self,\n        width: int,\n        bytealign: bool = False,\n        reversed: bool = False,\n    ) -> None:\n        CCITTG4Parser.__init__(self, width, bytealign=bytealign)\n        self.reversed = reversed\n        self._buf = b\"\"\n\n    def close(self) -> bytes:\n        return self._buf\n\n    def output_line(self, y: int, bits: Sequence[int]) -> None:\n        arr = array.array(\"B\", [0] * ((len(bits) + 7) // 8))\n        if self.reversed:\n            bits = [1 - b for b in bits]\n        for i, b in enumerate(bits):\n            if b:\n                arr[i // 8] += (128, 64, 32, 16, 8, 4, 2, 1)[i % 8]\n        self._buf += arr.tobytes()\n\n\ndef ccittfaxdecode(data: bytes, params: dict[str, object]) -> bytes:\n    K = params.get(\"K\")\n    if K == -1:\n        cols = cast(int, params.get(\"Columns\"))\n        bytealign = cast(bool, params.get(\"EncodedByteAlign\"))\n        reversed = cast(bool, params.get(\"BlackIs1\"))\n        parser = CCITTFaxDecoder(cols, bytealign=bytealign, reversed=reversed)\n    else:\n        raise PDFValueError(K)\n    parser.feedbytes(data)\n    return parser.close()\n\n\n# test\ndef main(argv: list[str]) -> None:\n    if not argv[1:]:\n        import unittest\n\n        unittest.main()\n        return\n\n    class Parser(CCITTG4Parser):\n        def __init__(self, width: int, bytealign: bool = False) -> None:\n            import pygame  # type: ignore[import]\n\n            CCITTG4Parser.__init__(self, width, bytealign=bytealign)\n            self.img = pygame.Surface((self.width, 1000))\n\n        def output_line(self, y: int, bits: Sequence[int]) -> None:\n            for x, b in enumerate(bits):\n                if b:\n                    self.img.set_at((x, y), (255, 255, 255))\n                else:\n                    self.img.set_at((x, y), (0, 0, 0))\n\n        def close(self) -> None:\n            import pygame\n\n            pygame.image.save(self.img, \"out.bmp\")\n\n    for path in argv[1:]:\n        fp = open(path, \"rb\")\n        (_, _, k, w, h, _) = path.split(\".\")\n        parser = Parser(int(w))\n        parser.feedbytes(fp.read())\n        parser.close()\n        fp.close()\n"
  },
  {
    "path": "babeldoc/pdfminer/cmap/README.txt",
    "content": "README.txt for cmap\n\nThis directory contains *.pickle.gz files converted from Adobe CMap resources.\nCMaps are required to decode text data written in CJK (Chinese, Japanese,\nKorean) language.  CMap resources are now available freely from Adobe web site:\nhttp://opensource.adobe.com/wiki/display/cmap/CMap+Resources\n\nThe follwing files were extracted from the downloadable tarballs:\n\ncid2code_Adobe_CNS1.txt:\n\thttp://download.macromedia.com/pub/opensource/cmap/cmapresources_cns1-6.tar.z\n\ncid2code_Adobe_GB1.txt:\n\thttp://download.macromedia.com/pub/opensource/cmap/cmapresources_gb1-5.tar.z\n\ncid2code_Adobe_Japan1.txt:\n\thttp://download.macromedia.com/pub/opensource/cmap/cmapresources_japan1-6.tar.z\n\ncid2code_Adobe_Korea1.txt:\n\thttp://download.macromedia.com/pub/opensource/cmap/cmapresources_korean1-2.tar.z\n\n\nThese *.pickle.gz files can be generated by running following commands in the\ntop directory:\n\n    $ make cmap\n    python tools/conv_cmap.py pdfminer/cmap Adobe-CNS1 cmaprsrc/cid2code_Adobe_CNS1.txt\n    reading 'cmaprsrc/cid2code_Adobe_CNS1.txt'...\n    writing 'CNS1_H.py'...\n    ...\n\nOn Windows machines which don't have `make` command,\npaste the following commands on a command line prompt:\n\n    mkdir pdfminer\\cmap\n    python tools\\conv_cmap.py -c B5=cp950 -c UniCNS-UTF8=utf-8 pdfminer\\cmap Adobe-CNS1 cmaprsrc\\cid2code_Adobe_CNS1.txt\n    python tools\\conv_cmap.py -c GBK-EUC=cp936 -c UniGB-UTF8=utf-8 pdfminer\\cmap Adobe-GB1 cmaprsrc\\cid2code_Adobe_GB1.txt\n    python tools\\conv_cmap.py -c RKSJ=cp932 -c EUC=euc-jp -c UniJIS-UTF8=utf-8 pdfminer\\cmap Adobe-Japan1 cmaprsrc\\cid2code_Adobe_Japan1.txt\n    python tools\\conv_cmap.py -c KSC-EUC=euc-kr -c KSC-Johab=johab -c KSCms-UHC=cp949 -c UniKS-UTF8=utf-8 pdfminer\\cmap Adobe-Korea1 cmaprsrc\\cid2code_Adobe_Korea1.txt\n\n\nHere is the license information in the original files:\n\n%%Copyright: -----------------------------------------------------------\n%%Copyright: Copyright 1990-20xx Adobe Systems Incorporated.\n%%Copyright: All rights reserved.\n%%Copyright:\n%%Copyright: Redistribution and use in source and binary forms, with or\n%%Copyright: without modification, are permitted provided that the\n%%Copyright: following conditions are met:\n%%Copyright:\n%%Copyright: Redistributions of source code must retain the above\n%%Copyright: copyright notice, this list of conditions and the following\n%%Copyright: disclaimer.\n%%Copyright:\n%%Copyright: Redistributions in binary form must reproduce the above\n%%Copyright: copyright notice, this list of conditions and the following\n%%Copyright: disclaimer in the documentation and/or other materials\n%%Copyright: provided with the distribution.\n%%Copyright:\n%%Copyright: Neither the name of Adobe Systems Incorporated nor the names\n%%Copyright: of its contributors may be used to endorse or promote\n%%Copyright: products derived from this software without specific prior\n%%Copyright: written permission.\n%%Copyright:\n%%Copyright: THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND\n%%Copyright: CONTRIBUTORS \"AS IS\" AND ANY EXPRESS OR IMPLIED WARRANTIES,\n%%Copyright: INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF\n%%Copyright: MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n%%Copyright: DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR\n%%Copyright: CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,\n%%Copyright: SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT\n%%Copyright: NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n%%Copyright: LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)\n%%Copyright: HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN\n%%Copyright: CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR\n%%Copyright: OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n%%Copyright: SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n%%Copyright: -----------------------------------------------------------\n"
  },
  {
    "path": "babeldoc/pdfminer/cmapdb.py",
    "content": "\"\"\"Adobe character mapping (CMap) support.\n\nCMaps provide the mapping between character codes and Unicode\ncode-points to character ids (CIDs).\n\nMore information is available on:\n\n  https://github.com/adobe-type-tools/cmap-resources\n\n\"\"\"\n\nimport gzip\nimport logging\nimport os\nimport os.path\nimport pickle as pickle\nimport struct\nimport sys\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections.abc import MutableMapping\nfrom typing import Any\nfrom typing import BinaryIO\nfrom typing import TextIO\nfrom typing import cast\n\nfrom babeldoc.pdfminer.encodingdb import name2unicode\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdfexceptions import PDFTypeError\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psexceptions import PSSyntaxError\nfrom babeldoc.pdfminer.psparser import KWD\nfrom babeldoc.pdfminer.psparser import PSKeyword\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.psparser import PSStackParser\nfrom babeldoc.pdfminer.psparser import literal_name\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer.utils import nunpack\n\nlog = logging.getLogger(__name__)\n\n\nclass CMapError(PDFException):\n    pass\n\n\nclass CMapBase:\n    debug = 0\n\n    def __init__(self, **kwargs: object) -> None:\n        self.attrs: MutableMapping[str, object] = kwargs.copy()\n\n    def is_vertical(self) -> bool:\n        return self.attrs.get(\"WMode\", 0) != 0\n\n    def set_attr(self, k: str, v: object) -> None:\n        self.attrs[k] = v\n\n    def add_code2cid(self, code: str, cid: int) -> None:\n        pass\n\n    def add_cid2unichr(self, cid: int, code: PSLiteral | bytes | int) -> None:\n        pass\n\n    def use_cmap(self, cmap: \"CMapBase\") -> None:\n        pass\n\n    def decode(self, code: bytes) -> Iterable[int]:\n        raise NotImplementedError\n\n\nclass CMap(CMapBase):\n    def __init__(self, **kwargs: str | int) -> None:\n        CMapBase.__init__(self, **kwargs)\n        self.code2cid: dict[int, object] = {}\n\n    def __repr__(self) -> str:\n        return \"<CMap: %s>\" % self.attrs.get(\"CMapName\")\n\n    def use_cmap(self, cmap: CMapBase) -> None:\n        assert isinstance(cmap, CMap), str(type(cmap))\n\n        def copy(dst: dict[int, object], src: dict[int, object]) -> None:\n            for k, v in src.items():\n                if isinstance(v, dict):\n                    d: dict[int, object] = {}\n                    dst[k] = d\n                    copy(d, v)\n                else:\n                    dst[k] = v\n\n        copy(self.code2cid, cmap.code2cid)\n\n    def decode(self, code: bytes) -> Iterator[int]:\n        log.debug(\"decode: %r, %r\", self, code)\n        d = self.code2cid\n        for i in iter(code):\n            if i in d:\n                x = d[i]\n                if isinstance(x, int):\n                    yield x\n                    d = self.code2cid\n                else:\n                    d = cast(dict[int, object], x)\n            else:\n                d = self.code2cid\n\n    def dump(\n        self,\n        out: TextIO = sys.stdout,\n        code2cid: dict[int, object] | None = None,\n        code: tuple[int, ...] = (),\n    ) -> None:\n        if code2cid is None:\n            code2cid = self.code2cid\n            code = ()\n        for k, v in sorted(code2cid.items()):\n            c = code + (k,)\n            if isinstance(v, int):\n                out.write(\"code %r = cid %d\\n\" % (c, v))\n            else:\n                self.dump(out=out, code2cid=cast(dict[int, object], v), code=c)\n\n\nclass IdentityCMap(CMapBase):\n    def decode(self, code: bytes) -> tuple[int, ...]:\n        n = len(code) // 2\n        if n:\n            return struct.unpack_from(f\">{n}H\", code)\n        else:\n            return ()\n\n\nclass IdentityCMapByte(IdentityCMap):\n    def decode(self, code: bytes) -> tuple[int, ...]:\n        n = len(code)\n        if n:\n            return struct.unpack(\">%dB\" % n, code)\n        else:\n            return ()\n\n\nclass UnicodeMap(CMapBase):\n    def __init__(self, **kwargs: str | int) -> None:\n        CMapBase.__init__(self, **kwargs)\n        self.cid2unichr: dict[int, str] = {}\n\n    def __repr__(self) -> str:\n        return \"<UnicodeMap: %s>\" % self.attrs.get(\"CMapName\")\n\n    def get_unichr(self, cid: int) -> str:\n        log.debug(\"get_unichr: %r, %r\", self, cid)\n        return self.cid2unichr[cid]\n\n    def dump(self, out: TextIO = sys.stdout) -> None:\n        for k, v in sorted(self.cid2unichr.items()):\n            out.write(\"cid %d = unicode %r\\n\" % (k, v))\n\n\nclass IdentityUnicodeMap(UnicodeMap):\n    def get_unichr(self, cid: int) -> str:\n        \"\"\"Interpret character id as unicode codepoint\"\"\"\n        log.debug(\"get_unichr: %r, %r\", self, cid)\n        return chr(cid)\n\n\nclass FileCMap(CMap):\n    def add_code2cid(self, code: str, cid: int) -> None:\n        assert isinstance(code, str) and isinstance(cid, int), str(\n            (type(code), type(cid)),\n        )\n        d = self.code2cid\n        for c in code[:-1]:\n            ci = ord(c)\n            if ci in d:\n                d = cast(dict[int, object], d[ci])\n            else:\n                t: dict[int, object] = {}\n                d[ci] = t\n                d = t\n        ci = ord(code[-1])\n        d[ci] = cid\n\n\nclass FileUnicodeMap(UnicodeMap):\n    def add_cid2unichr(self, cid: int, code: PSLiteral | bytes | int) -> None:\n        assert isinstance(cid, int), str(type(cid))\n        if isinstance(code, PSLiteral):\n            # Interpret as an Adobe glyph name.\n            assert isinstance(code.name, str)\n            unichr = name2unicode(code.name)\n        elif isinstance(code, bytes):\n            # Interpret as UTF-16BE.\n            unichr = code.decode(\"UTF-16BE\", \"ignore\")\n        elif isinstance(code, int):\n            unichr = chr(code)\n        else:\n            raise PDFTypeError(code)\n\n        # A0 = non-breaking space, some weird fonts can have a collision on a cid here.\n        if unichr == \"\\u00a0\" and self.cid2unichr.get(cid) == \" \":\n            return\n        self.cid2unichr[cid] = unichr\n\n\nclass PyCMap(CMap):\n    def __init__(self, name: str, module: Any) -> None:\n        super().__init__(CMapName=name)\n        self.code2cid = module.CODE2CID\n        if module.IS_VERTICAL:\n            self.attrs[\"WMode\"] = 1\n\n\nclass PyUnicodeMap(UnicodeMap):\n    def __init__(self, name: str, module: Any, vertical: bool) -> None:\n        super().__init__(CMapName=name)\n        if vertical:\n            self.cid2unichr = module.CID2UNICHR_V\n            self.attrs[\"WMode\"] = 1\n        else:\n            self.cid2unichr = module.CID2UNICHR_H\n\n\nclass CMapDB:\n    _cmap_cache: dict[str, PyCMap] = {}\n    _umap_cache: dict[str, list[PyUnicodeMap]] = {}\n\n    class CMapNotFound(CMapError):\n        pass\n\n    @classmethod\n    def _load_data(cls, name: str) -> Any:\n        name = name.replace(\"\\0\", \"\")\n        filename = \"%s.pickle.gz\" % name\n        log.debug(\"loading: %r\", name)\n        cmap_paths = (\n            os.environ.get(\"CMAP_PATH\", \"/usr/share/pdfminer/\"),\n            os.path.join(os.path.dirname(__file__), \"cmap\"),\n        )\n        for directory in cmap_paths:\n            path = os.path.join(directory, filename)\n            if os.path.exists(path):\n                gzfile = gzip.open(path)\n                try:\n                    return type(str(name), (), pickle.loads(gzfile.read()))\n                finally:\n                    gzfile.close()\n        raise CMapDB.CMapNotFound(name)\n\n    @classmethod\n    def get_cmap(cls, name: str) -> CMapBase:\n        if name == \"Identity-H\":\n            return IdentityCMap(WMode=0)\n        elif name == \"Identity-V\":\n            return IdentityCMap(WMode=1)\n        elif name == \"OneByteIdentityH\":\n            return IdentityCMapByte(WMode=0)\n        elif name == \"OneByteIdentityV\":\n            return IdentityCMapByte(WMode=1)\n        try:\n            return cls._cmap_cache[name]\n        except KeyError:\n            pass\n        data = cls._load_data(name)\n        cls._cmap_cache[name] = cmap = PyCMap(name, data)\n        return cmap\n\n    @classmethod\n    def get_unicode_map(cls, name: str, vertical: bool = False) -> UnicodeMap:\n        try:\n            return cls._umap_cache[name][vertical]\n        except KeyError:\n            pass\n        data = cls._load_data(\"to-unicode-%s\" % name)\n        cls._umap_cache[name] = [PyUnicodeMap(name, data, v) for v in (False, True)]\n        return cls._umap_cache[name][vertical]\n\n\nclass CMapParser(PSStackParser[PSKeyword]):\n    def __init__(self, cmap: CMapBase, fp: BinaryIO) -> None:\n        PSStackParser.__init__(self, fp)\n        self.cmap = cmap\n        # some ToUnicode maps don't have \"begincmap\" keyword.\n        self._in_cmap = True\n        self._warnings: set[str] = set()\n\n    def run(self) -> None:\n        try:\n            self.nextobject()\n        except PSEOF:\n            pass\n\n    KEYWORD_BEGINCMAP = KWD(b\"begincmap\")\n    KEYWORD_ENDCMAP = KWD(b\"endcmap\")\n    KEYWORD_USECMAP = KWD(b\"usecmap\")\n    KEYWORD_DEF = KWD(b\"def\")\n    KEYWORD_BEGINCODESPACERANGE = KWD(b\"begincodespacerange\")\n    KEYWORD_ENDCODESPACERANGE = KWD(b\"endcodespacerange\")\n    KEYWORD_BEGINCIDRANGE = KWD(b\"begincidrange\")\n    KEYWORD_ENDCIDRANGE = KWD(b\"endcidrange\")\n    KEYWORD_BEGINCIDCHAR = KWD(b\"begincidchar\")\n    KEYWORD_ENDCIDCHAR = KWD(b\"endcidchar\")\n    KEYWORD_BEGINBFRANGE = KWD(b\"beginbfrange\")\n    KEYWORD_ENDBFRANGE = KWD(b\"endbfrange\")\n    KEYWORD_BEGINBFCHAR = KWD(b\"beginbfchar\")\n    KEYWORD_ENDBFCHAR = KWD(b\"endbfchar\")\n    KEYWORD_BEGINNOTDEFRANGE = KWD(b\"beginnotdefrange\")\n    KEYWORD_ENDNOTDEFRANGE = KWD(b\"endnotdefrange\")\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        \"\"\"ToUnicode CMaps\n\n        See Section 5.9.2 - ToUnicode CMaps of the PDF Reference.\n        \"\"\"\n        if token is self.KEYWORD_BEGINCMAP:\n            self._in_cmap = True\n            self.popall()\n            return\n\n        elif token is self.KEYWORD_ENDCMAP:\n            self._in_cmap = False\n            return\n\n        if not self._in_cmap:\n            return\n\n        if token is self.KEYWORD_DEF:\n            try:\n                ((_, k), (_, v)) = self.pop(2)\n                self.cmap.set_attr(literal_name(k), v)\n            except PSSyntaxError:\n                pass\n            return\n\n        if token is self.KEYWORD_USECMAP:\n            try:\n                ((_, cmapname),) = self.pop(1)\n                self.cmap.use_cmap(CMapDB.get_cmap(literal_name(cmapname)))\n            except PSSyntaxError:\n                pass\n            except CMapDB.CMapNotFound:\n                pass\n            return\n\n        if token is self.KEYWORD_BEGINCODESPACERANGE:\n            self.popall()\n            return\n        if token is self.KEYWORD_ENDCODESPACERANGE:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_BEGINCIDRANGE:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_ENDCIDRANGE:\n            objs = [obj for (__, obj) in self.popall()]\n            for start_byte, end_byte, cid in choplist(3, objs):\n                if not isinstance(start_byte, bytes):\n                    self._warn_once(\"The start object of begincidrange is not a byte.\")\n                    continue\n                if not isinstance(end_byte, bytes):\n                    self._warn_once(\"The end object of begincidrange is not a byte.\")\n                    continue\n                if not isinstance(cid, int):\n                    self._warn_once(\"The cid object of begincidrange is not a byte.\")\n                    continue\n                if len(start_byte) != len(end_byte):\n                    self._warn_once(\n                        \"The start and end byte of begincidrange have \"\n                        \"different lengths.\",\n                    )\n                    continue\n                start_prefix = start_byte[:-4]\n                end_prefix = end_byte[:-4]\n                if start_prefix != end_prefix:\n                    self._warn_once(\n                        \"The prefix of the start and end byte of \"\n                        \"begincidrange are not the same.\",\n                    )\n                    continue\n                svar = start_byte[-4:]\n                evar = end_byte[-4:]\n                start = nunpack(svar)\n                end = nunpack(evar)\n                vlen = len(svar)\n                for i in range(end - start + 1):\n                    x = start_prefix + struct.pack(\">L\", start + i)[-vlen:]\n                    self.cmap.add_cid2unichr(cid + i, x)\n            return\n\n        if token is self.KEYWORD_BEGINCIDCHAR:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_ENDCIDCHAR:\n            objs = [obj for (__, obj) in self.popall()]\n            for cid, code in choplist(2, objs):\n                if isinstance(code, bytes) and isinstance(cid, int):\n                    self.cmap.add_cid2unichr(cid, code)\n            return\n\n        if token is self.KEYWORD_BEGINBFRANGE:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_ENDBFRANGE:\n            objs = [obj for (__, obj) in self.popall()]\n            for start_byte, end_byte, code in choplist(3, objs):\n                if not isinstance(start_byte, bytes):\n                    self._warn_once(\"The start object is not a byte.\")\n                    continue\n                if not isinstance(end_byte, bytes):\n                    self._warn_once(\"The end object is not a byte.\")\n                    continue\n                if len(start_byte) != len(end_byte):\n                    self._warn_once(\"The start and end byte have different lengths.\")\n                    continue\n                start = nunpack(start_byte)\n                end = nunpack(end_byte)\n                if isinstance(code, list):\n                    if len(code) != end - start + 1:\n                        self._warn_once(\n                            \"The difference between the start and end \"\n                            \"offsets does not match the code length.\",\n                        )\n                    for cid, unicode_value in zip(\n                        range(start, end + 1), code, strict=False\n                    ):\n                        self.cmap.add_cid2unichr(cid, unicode_value)\n                else:\n                    assert isinstance(code, bytes)\n                    var = code[-4:]\n                    base = nunpack(var)\n                    prefix = code[:-4]\n                    vlen = len(var)\n                    for i in range(end - start + 1):\n                        x = prefix + struct.pack(\">L\", base + i)[-vlen:]\n                        self.cmap.add_cid2unichr(start + i, x)\n            return\n\n        if token is self.KEYWORD_BEGINBFCHAR:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_ENDBFCHAR:\n            objs = [obj for (__, obj) in self.popall()]\n            for cid, code in choplist(2, objs):\n                if isinstance(cid, bytes) and isinstance(code, bytes):\n                    self.cmap.add_cid2unichr(nunpack(cid), code)\n            return\n\n        if token is self.KEYWORD_BEGINNOTDEFRANGE:\n            self.popall()\n            return\n\n        if token is self.KEYWORD_ENDNOTDEFRANGE:\n            self.popall()\n            return\n\n        self.push((pos, token))\n\n    def _warn_once(self, msg: str) -> None:\n        \"\"\"Warn once for each unique message\"\"\"\n        if msg not in self._warnings:\n            self._warnings.add(msg)\n            base_msg = (\n                \"Ignoring (part of) ToUnicode map because the PDF data \"\n                \"does not conform to the format. This could result in \"\n                \"(cid) values in the output. \"\n            )\n            log.warning(base_msg + msg)\n"
  },
  {
    "path": "babeldoc/pdfminer/converter.py",
    "content": "import io\nimport logging\nimport re\nfrom collections.abc import Sequence\nfrom typing import BinaryIO\nfrom typing import Generic\nfrom typing import TextIO\nfrom typing import TypeVar\nfrom typing import cast\n\nfrom babeldoc.format.pdf.document_il import il_version_1\nfrom babeldoc.pdfminer.image import ImageWriter\nfrom babeldoc.pdfminer.layout import LAParams\nfrom babeldoc.pdfminer.layout import LTAnno\nfrom babeldoc.pdfminer.layout import LTChar\nfrom babeldoc.pdfminer.layout import LTComponent\nfrom babeldoc.pdfminer.layout import LTContainer\nfrom babeldoc.pdfminer.layout import LTCurve\nfrom babeldoc.pdfminer.layout import LTFigure\nfrom babeldoc.pdfminer.layout import LTImage\nfrom babeldoc.pdfminer.layout import LTItem\nfrom babeldoc.pdfminer.layout import LTLayoutContainer\nfrom babeldoc.pdfminer.layout import LTLine\nfrom babeldoc.pdfminer.layout import LTPage\nfrom babeldoc.pdfminer.layout import LTRect\nfrom babeldoc.pdfminer.layout import LTText\nfrom babeldoc.pdfminer.layout import LTTextBox\nfrom babeldoc.pdfminer.layout import LTTextBoxVertical\nfrom babeldoc.pdfminer.layout import LTTextGroup\nfrom babeldoc.pdfminer.layout import LTTextLine\nfrom babeldoc.pdfminer.layout import TextGroupElement\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdfdevice import PDFTextDevice\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdffont import PDFUnicodeNotDefined\nfrom babeldoc.pdfminer.pdfinterp import PDFGraphicState\nfrom babeldoc.pdfminer.pdfinterp import PDFResourceManager\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.utils import AnyIO\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import PathSegment\nfrom babeldoc.pdfminer.utils import Point\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import apply_matrix_pt\nfrom babeldoc.pdfminer.utils import bbox2str\nfrom babeldoc.pdfminer.utils import enc\nfrom babeldoc.pdfminer.utils import make_compat_str\nfrom babeldoc.pdfminer.utils import mult_matrix\nfrom babeldoc.pdfminer import utils\n\nlog = logging.getLogger(__name__)\n\n\nclass PDFLayoutAnalyzer(PDFTextDevice):\n    cur_item: LTLayoutContainer\n    ctm: Matrix\n\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n    ) -> None:\n        PDFTextDevice.__init__(self, rsrcmgr)\n        self.pageno = pageno\n        self.laparams = laparams\n        self._stack: list[LTLayoutContainer] = []\n\n    def begin_page(self, page: PDFPage, ctm: Matrix) -> None:\n        (x0, y0, x1, y1) = page.mediabox\n        (x0, y0) = apply_matrix_pt(ctm, (x0, y0))\n        (x1, y1) = apply_matrix_pt(ctm, (x1, y1))\n        mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))\n        self.cur_item = LTPage(self.pageno, mediabox)\n\n    def end_page(self, page: PDFPage) -> None:\n        assert not self._stack, str(len(self._stack))\n        assert isinstance(self.cur_item, LTPage), str(type(self.cur_item))\n        if self.laparams is not None:\n            self.cur_item.analyze(self.laparams)\n        self.pageno += 1\n        self.receive_layout(self.cur_item)\n\n    def begin_figure(self, name: str, bbox: Rect, matrix: Matrix) -> None:\n        self._stack.append(self.cur_item)\n        self.cur_item = LTFigure(name, bbox, mult_matrix(matrix, self.ctm))\n\n    def end_figure(self, _: str) -> None:\n        fig = self.cur_item\n        assert isinstance(self.cur_item, LTFigure), str(type(self.cur_item))\n        self.cur_item = self._stack.pop()\n        self.cur_item.add(fig)\n\n    def render_image(self, name: str, stream: PDFStream) -> None:\n        assert isinstance(self.cur_item, LTFigure), str(type(self.cur_item))\n        item = LTImage(\n            name,\n            stream,\n            (self.cur_item.x0, self.cur_item.y0, self.cur_item.x1, self.cur_item.y1),\n        )\n        self.cur_item.add(item)\n\n    def paint_path(\n        self,\n        gstate: PDFGraphicState,\n        stroke: bool,\n        fill: bool,\n        evenodd: bool,\n        path: Sequence[PathSegment],\n    ) -> None:\n        \"\"\"Paint paths described in section 4.4 of the PDF reference manual\"\"\"\n        shape = \"\".join(x[0] for x in path)\n        current_clip_paths = self.il_creater.current_clip_paths.copy()\n        if shape[:1] != \"m\":\n            # Per PDF Reference Section 4.4.1, \"path construction operators may\n            # be invoked in any sequence, but the first one invoked must be m\n            # or re to begin a new subpath.\" Since pdfminer.six already\n            # converts all `re` (rectangle) operators to their equivelent\n            # `mlllh` representation, paths ingested by `.paint_path(...)` that\n            # do not begin with the `m` operator are invalid.\n            pass\n\n        # elif shape.count(\"m\") > 1:\n        #     # recurse if there are multiple m's in this shape\n        #     for m in re.finditer(r\"m[^m]+\", shape):\n        #         subpath = path[m.start(0) : m.end(0)]\n        #         self.paint_path(gstate, stroke, fill, evenodd, subpath)\n\n        else:\n            # Although the 'h' command does not not literally provide a\n            # point-position, its position is (by definition) equal to the\n            # subpath's starting point.\n            #\n            # And, per Section 4.4's Table 4.9, all other path commands place\n            # their point-position in their final two arguments. (Any preceding\n            # arguments represent control points on Bézier curves.)\n            raw_pts = [\n                cast(Point, p[-2:] if p[0] != \"h\" else path[0][-2:]) for p in path\n            ]\n            pts = [apply_matrix_pt(self.ctm, pt) for pt in raw_pts]\n\n            operators = [str(operation[0]) for operation in path]\n            transformed_points = [\n                [\n                    apply_matrix_pt(self.ctm, (float(operand1), float(operand2)))\n                    for operand1, operand2 in zip(\n                        operation[1::2], operation[2::2], strict=False\n                    )\n                ]\n                for operation in path\n            ]\n            transformed_path = [\n                cast(PathSegment, (o, *p))\n                for o, p in zip(operators, transformed_points, strict=False)\n            ]\n\n            # Drop a redundant \"l\" on a path closed with \"h\"\n            if len(shape) > 3 and shape[-2:] == \"lh\" and pts[-2] == pts[0]:\n                shape = shape[:-2] + \"h\"\n                pts.pop()\n\n            passthrough_instruction = (\n                self.il_creater.passthrough_per_char_instruction.copy()\n            )\n            xobj_id = self.il_creater.xobj_id\n            if shape in {\"mlh\", \"ml\"}:\n                # single line segment\n                #\n                # Note: 'ml', in conditional above, is a frequent anomaly\n                # that we want to support.\n                line = LTLine(\n                    gstate.linewidth,\n                    pts[0],\n                    pts[1],\n                    stroke,\n                    fill,\n                    evenodd,\n                    gstate.scolor,\n                    gstate.ncolor,\n                    original_path=transformed_path,\n                    dashing_style=gstate.dash,\n                )\n                line.passthrough_instruction = passthrough_instruction\n                line.xobj_id = xobj_id\n                line.render_order = self.il_creater.get_render_order_and_increase()\n                line.ctm = self.ctm\n                line.raw_path = path.copy()\n                line.clip_paths = current_clip_paths\n                self.cur_item.add(line)\n\n            elif shape in {\"mlllh\", \"mllll\"}:\n                (x0, y0), (x1, y1), (x2, y2), (x3, y3), _ = pts\n\n                is_closed_loop = pts[0] == pts[4]\n                has_square_coordinates = (\n                    x0 == x1 and y1 == y2 and x2 == x3 and y3 == y0\n                ) or (y0 == y1 and x1 == x2 and y2 == y3 and x3 == x0)\n                if is_closed_loop and has_square_coordinates:\n                    rect = LTRect(\n                        gstate.linewidth,\n                        (*pts[0], *pts[2]),\n                        stroke,\n                        fill,\n                        evenodd,\n                        gstate.scolor,\n                        gstate.ncolor,\n                        transformed_path,\n                        gstate.dash,\n                    )\n                    rect.passthrough_instruction = passthrough_instruction\n                    rect.xobj_id = xobj_id\n                    rect.render_order = self.il_creater.get_render_order_and_increase()\n                    rect.ctm = self.ctm\n                    rect.raw_path = path.copy()\n                    rect.clip_paths = current_clip_paths\n                    self.cur_item.add(rect)\n                else:\n                    curve = LTCurve(\n                        gstate.linewidth,\n                        pts,\n                        stroke,\n                        fill,\n                        evenodd,\n                        gstate.scolor,\n                        gstate.ncolor,\n                        transformed_path,\n                        gstate.dash,\n                    )\n                    curve.passthrough_instruction = passthrough_instruction\n                    curve.xobj_id = xobj_id\n                    curve.render_order = self.il_creater.get_render_order_and_increase()\n                    curve.ctm = self.ctm\n                    curve.raw_path = path.copy()\n                    curve.clip_paths = current_clip_paths\n                    self.cur_item.add(curve)\n            else:\n                curve = LTCurve(\n                    gstate.linewidth,\n                    pts,\n                    stroke,\n                    fill,\n                    evenodd,\n                    gstate.scolor,\n                    gstate.ncolor,\n                    transformed_path,\n                    gstate.dash,\n                )\n                curve.passthrough_instruction = passthrough_instruction\n                curve.xobj_id = xobj_id\n                curve.render_order = self.il_creater.get_render_order_and_increase()\n                curve.ctm = self.ctm\n                curve.raw_path = path.copy()\n                curve.clip_paths = current_clip_paths\n                self.cur_item.add(curve)\n\n    def render_char(\n        self,\n        matrix: Matrix,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        rise: float,\n        cid: int,\n        ncs: PDFColorSpace,\n        graphicstate: PDFGraphicState,\n    ) -> float:\n        try:\n            text = font.to_unichr(cid)\n            assert isinstance(text, str), str(type(text))\n        except PDFUnicodeNotDefined:\n            text = self.handle_undefined_char(font, cid)\n        textwidth = font.char_width(cid)\n        textdisp = font.char_disp(cid)\n        item = LTChar(\n            matrix,\n            font,\n            fontsize,\n            scaling,\n            rise,\n            text,\n            textwidth,\n            textdisp,\n            ncs,\n            graphicstate,\n        )\n        self.cur_item.add(item)\n        return item.adv\n\n    def handle_undefined_char(self, font: PDFFont, cid: int) -> str:\n        log.debug(\"undefined: %r, %r\", font, cid)\n        return \"(cid:%d)\" % cid\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        pass\n\n\nclass PDFPageAggregator(PDFLayoutAnalyzer):\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n    ) -> None:\n        PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)\n        self.result: LTPage | None = None\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        self.result = ltpage\n\n    def get_result(self) -> LTPage:\n        assert self.result is not None\n        return self.result\n\n\n# Some PDFConverter children support only binary I/O\nIOType = TypeVar(\"IOType\", TextIO, BinaryIO, AnyIO)\n\n\nclass PDFConverter(PDFLayoutAnalyzer, Generic[IOType]):\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        outfp: IOType,\n        codec: str = \"utf-8\",\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n    ) -> None:\n        PDFLayoutAnalyzer.__init__(self, rsrcmgr, pageno=pageno, laparams=laparams)\n        self.outfp: IOType = outfp\n        self.codec = codec\n        self.outfp_binary = self._is_binary_stream(self.outfp)\n\n    @staticmethod\n    def _is_binary_stream(outfp: AnyIO) -> bool:\n        \"\"\"Test if an stream is binary or not\"\"\"\n        if \"b\" in getattr(outfp, \"mode\", \"\"):\n            return True\n        elif hasattr(outfp, \"mode\"):\n            # output stream has a mode, but it does not contain 'b'\n            return False\n        elif isinstance(outfp, io.BytesIO):\n            return True\n        elif isinstance(outfp, io.StringIO) or isinstance(outfp, io.TextIOBase):\n            return False\n\n        return True\n\n\nclass TextConverter(PDFConverter[AnyIO]):\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        outfp: AnyIO,\n        codec: str = \"utf-8\",\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n        showpageno: bool = False,\n        imagewriter: ImageWriter | None = None,\n    ) -> None:\n        super().__init__(rsrcmgr, outfp, codec=codec, pageno=pageno, laparams=laparams)\n        self.showpageno = showpageno\n        self.imagewriter = imagewriter\n\n    def write_text(self, text: str) -> None:\n        text = utils.compatible_encode_method(text, self.codec, \"ignore\")\n        if self.outfp_binary:\n            cast(BinaryIO, self.outfp).write(text.encode())\n        else:\n            cast(TextIO, self.outfp).write(text)\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        def render(item: LTItem) -> None:\n            if isinstance(item, LTContainer):\n                for child in item:\n                    render(child)\n            elif isinstance(item, LTText):\n                self.write_text(item.get_text())\n            if isinstance(item, LTTextBox):\n                self.write_text(\"\\n\")\n            elif isinstance(item, LTImage):\n                if self.imagewriter is not None:\n                    self.imagewriter.export_image(item)\n\n        if self.showpageno:\n            self.write_text(\"Page %s\\n\" % ltpage.pageid)\n        render(ltpage)\n        self.write_text(\"\\f\")\n\n    # Some dummy functions to save memory/CPU when all that is wanted\n    # is text.  This stops all the image and drawing output from being\n    # recorded and taking up RAM.\n    def render_image(self, name: str, stream: PDFStream) -> None:\n        if self.imagewriter is not None:\n            PDFConverter.render_image(self, name, stream)\n\n    def paint_path(\n        self,\n        gstate: PDFGraphicState,\n        stroke: bool,\n        fill: bool,\n        evenodd: bool,\n        path: Sequence[PathSegment],\n    ) -> None:\n        pass\n\n\nclass HTMLConverter(PDFConverter[AnyIO]):\n    RECT_COLORS = {\n        \"figure\": \"yellow\",\n        \"textline\": \"magenta\",\n        \"textbox\": \"cyan\",\n        \"textgroup\": \"red\",\n        \"curve\": \"black\",\n        \"page\": \"gray\",\n    }\n\n    TEXT_COLORS = {\n        \"textbox\": \"blue\",\n        \"char\": \"black\",\n    }\n\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        outfp: AnyIO,\n        codec: str = \"utf-8\",\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n        scale: float = 1,\n        fontscale: float = 1.0,\n        layoutmode: str = \"normal\",\n        showpageno: bool = True,\n        pagemargin: int = 50,\n        imagewriter: ImageWriter | None = None,\n        debug: int = 0,\n        rect_colors: dict[str, str] | None = None,\n        text_colors: dict[str, str] | None = None,\n    ) -> None:\n        PDFConverter.__init__(\n            self,\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            pageno=pageno,\n            laparams=laparams,\n        )\n\n        # write() assumes a codec for binary I/O, or no codec for text I/O.\n        if self.outfp_binary and not self.codec:\n            raise PDFValueError(\"Codec is required for a binary I/O output\")\n        if not self.outfp_binary and self.codec:\n            raise PDFValueError(\"Codec must not be specified for a text I/O output\")\n\n        if text_colors is None:\n            text_colors = {\"char\": \"black\"}\n        if rect_colors is None:\n            rect_colors = {\"curve\": \"black\", \"page\": \"gray\"}\n\n        self.scale = scale\n        self.fontscale = fontscale\n        self.layoutmode = layoutmode\n        self.showpageno = showpageno\n        self.pagemargin = pagemargin\n        self.imagewriter = imagewriter\n        self.rect_colors = rect_colors\n        self.text_colors = text_colors\n        if debug:\n            self.rect_colors.update(self.RECT_COLORS)\n            self.text_colors.update(self.TEXT_COLORS)\n        self._yoffset: float = self.pagemargin\n        self._font: tuple[str, float] | None = None\n        self._fontstack: list[tuple[str, float] | None] = []\n        self.write_header()\n\n    def write(self, text: str) -> None:\n        if self.codec:\n            cast(BinaryIO, self.outfp).write(text.encode(self.codec))\n        else:\n            cast(TextIO, self.outfp).write(text)\n\n    def write_header(self) -> None:\n        self.write(\"<html><head>\\n\")\n        if self.codec:\n            s = (\n                '<meta http-equiv=\"Content-Type\" content=\"text/html; '\n                'charset=%s\">\\n' % self.codec\n            )\n        else:\n            s = '<meta http-equiv=\"Content-Type\" content=\"text/html\">\\n'\n        self.write(s)\n        self.write(\"</head><body>\\n\")\n\n    def write_footer(self) -> None:\n        page_links = [f'<a href=\"#{i}\">{i}</a>' for i in range(1, self.pageno)]\n        s = '<div style=\"position:absolute; top:0px;\">Page: %s</div>\\n' % \", \".join(\n            page_links,\n        )\n        self.write(s)\n        self.write(\"</body></html>\\n\")\n\n    def write_text(self, text: str) -> None:\n        self.write(enc(text))\n\n    def place_rect(\n        self,\n        color: str,\n        borderwidth: int,\n        x: float,\n        y: float,\n        w: float,\n        h: float,\n    ) -> None:\n        color2 = self.rect_colors.get(color)\n        if color2 is not None:\n            s = (\n                '<span style=\"position:absolute; border: %s %dpx solid; '\n                'left:%dpx; top:%dpx; width:%dpx; height:%dpx;\"></span>\\n'\n                % (\n                    color2,\n                    borderwidth,\n                    x * self.scale,\n                    (self._yoffset - y) * self.scale,\n                    w * self.scale,\n                    h * self.scale,\n                )\n            )\n            self.write(s)\n\n    def place_border(self, color: str, borderwidth: int, item: LTComponent) -> None:\n        self.place_rect(color, borderwidth, item.x0, item.y1, item.width, item.height)\n\n    def place_image(\n        self,\n        item: LTImage,\n        borderwidth: int,\n        x: float,\n        y: float,\n        w: float,\n        h: float,\n    ) -> None:\n        if self.imagewriter is not None:\n            name = self.imagewriter.export_image(item)\n            s = (\n                '<img src=\"%s\" border=\"%d\" style=\"position:absolute; '\n                'left:%dpx; top:%dpx;\" width=\"%d\" height=\"%d\" />\\n'\n                % (\n                    enc(name),\n                    borderwidth,\n                    x * self.scale,\n                    (self._yoffset - y) * self.scale,\n                    w * self.scale,\n                    h * self.scale,\n                )\n            )\n            self.write(s)\n\n    def place_text(\n        self,\n        color: str,\n        text: str,\n        x: float,\n        y: float,\n        size: float,\n    ) -> None:\n        color2 = self.text_colors.get(color)\n        if color2 is not None:\n            s = (\n                '<span style=\"position:absolute; color:%s; left:%dpx; '\n                'top:%dpx; font-size:%dpx;\">'\n                % (\n                    color2,\n                    x * self.scale,\n                    (self._yoffset - y) * self.scale,\n                    size * self.scale * self.fontscale,\n                )\n            )\n            self.write(s)\n            self.write_text(text)\n            self.write(\"</span>\\n\")\n\n    def begin_div(\n        self,\n        color: str,\n        borderwidth: int,\n        x: float,\n        y: float,\n        w: float,\n        h: float,\n        writing_mode: str = \"False\",\n    ) -> None:\n        self._fontstack.append(self._font)\n        self._font = None\n        s = (\n            '<div style=\"position:absolute; border: %s %dpx solid; '\n            \"writing-mode:%s; left:%dpx; top:%dpx; width:%dpx; \"\n            'height:%dpx;\">'\n            % (\n                color,\n                borderwidth,\n                writing_mode,\n                x * self.scale,\n                (self._yoffset - y) * self.scale,\n                w * self.scale,\n                h * self.scale,\n            )\n        )\n        self.write(s)\n\n    def end_div(self, color: str) -> None:\n        if self._font is not None:\n            self.write(\"</span>\")\n        self._font = self._fontstack.pop()\n        self.write(\"</div>\")\n\n    def put_text(self, text: str, fontname: str, fontsize: float) -> None:\n        font = (fontname, fontsize)\n        if font != self._font:\n            if self._font is not None:\n                self.write(\"</span>\")\n            # Remove subset tag from fontname, see PDF Reference 5.5.3\n            fontname_without_subset_tag = fontname.split(\"+\")[-1]\n            self.write(\n                '<span style=\"font-family: %s; font-size:%dpx\">'\n                % (fontname_without_subset_tag, fontsize * self.scale * self.fontscale),\n            )\n            self._font = font\n        self.write_text(text)\n\n    def put_newline(self) -> None:\n        self.write(\"<br>\")\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        def show_group(item: LTTextGroup | TextGroupElement) -> None:\n            if isinstance(item, LTTextGroup):\n                self.place_border(\"textgroup\", 1, item)\n                for child in item:\n                    show_group(child)\n\n        def render(item: LTItem) -> None:\n            child: LTItem\n            if isinstance(item, LTPage):\n                self._yoffset += item.y1\n                self.place_border(\"page\", 1, item)\n                if self.showpageno:\n                    self.write(\n                        '<div style=\"position:absolute; top:%dpx;\">'\n                        % ((self._yoffset - item.y1) * self.scale),\n                    )\n                    self.write(\n                        f'<a name=\"{item.pageid}\">Page {item.pageid}</a></div>\\n',\n                    )\n                for child in item:\n                    render(child)\n                if item.groups is not None:\n                    for group in item.groups:\n                        show_group(group)\n            elif isinstance(item, LTCurve):\n                self.place_border(\"curve\", 1, item)\n            elif isinstance(item, LTFigure):\n                self.begin_div(\"figure\", 1, item.x0, item.y1, item.width, item.height)\n                for child in item:\n                    render(child)\n                self.end_div(\"figure\")\n            elif isinstance(item, LTImage):\n                self.place_image(item, 1, item.x0, item.y1, item.width, item.height)\n            elif self.layoutmode == \"exact\":\n                if isinstance(item, LTTextLine):\n                    self.place_border(\"textline\", 1, item)\n                    for child in item:\n                        render(child)\n                elif isinstance(item, LTTextBox):\n                    self.place_border(\"textbox\", 1, item)\n                    self.place_text(\n                        \"textbox\",\n                        str(item.index + 1),\n                        item.x0,\n                        item.y1,\n                        20,\n                    )\n                    for child in item:\n                        render(child)\n                elif isinstance(item, LTChar):\n                    self.place_border(\"char\", 1, item)\n                    self.place_text(\n                        \"char\",\n                        item.get_text(),\n                        item.x0,\n                        item.y1,\n                        item.size,\n                    )\n            elif isinstance(item, LTTextLine):\n                for child in item:\n                    render(child)\n                if self.layoutmode != \"loose\":\n                    self.put_newline()\n            elif isinstance(item, LTTextBox):\n                self.begin_div(\n                    \"textbox\",\n                    1,\n                    item.x0,\n                    item.y1,\n                    item.width,\n                    item.height,\n                    item.get_writing_mode(),\n                )\n                for child in item:\n                    render(child)\n                self.end_div(\"textbox\")\n            elif isinstance(item, LTChar):\n                fontname = make_compat_str(item.fontname)\n                self.put_text(item.get_text(), fontname, item.size)\n            elif isinstance(item, LTText):\n                self.write_text(item.get_text())\n\n        render(ltpage)\n        self._yoffset += self.pagemargin\n\n    def close(self) -> None:\n        self.write_footer()\n\n\nclass XMLConverter(PDFConverter[AnyIO]):\n    CONTROL = re.compile(\"[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]\")\n\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        outfp: AnyIO,\n        codec: str = \"utf-8\",\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n        imagewriter: ImageWriter | None = None,\n        stripcontrol: bool = False,\n    ) -> None:\n        PDFConverter.__init__(\n            self,\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            pageno=pageno,\n            laparams=laparams,\n        )\n\n        # write() assumes a codec for binary I/O, or no codec for text I/O.\n        if self.outfp_binary == (not self.codec):\n            raise PDFValueError(\"Codec is required for a binary I/O output\")\n\n        self.imagewriter = imagewriter\n        self.stripcontrol = stripcontrol\n        self.write_header()\n\n    def write(self, text: str) -> None:\n        if self.codec:\n            cast(BinaryIO, self.outfp).write(text.encode(self.codec))\n        else:\n            cast(TextIO, self.outfp).write(text)\n\n    def write_header(self) -> None:\n        if self.codec:\n            self.write('<?xml version=\"1.0\" encoding=\"%s\" ?>\\n' % self.codec)\n        else:\n            self.write('<?xml version=\"1.0\" ?>\\n')\n        self.write(\"<pages>\\n\")\n\n    def write_footer(self) -> None:\n        self.write(\"</pages>\\n\")\n\n    def write_text(self, text: str) -> None:\n        if self.stripcontrol:\n            text = self.CONTROL.sub(\"\", text)\n        self.write(enc(text))\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        def show_group(item: LTItem) -> None:\n            if isinstance(item, LTTextBox):\n                self.write(\n                    '<textbox id=\"%d\" bbox=\"%s\" />\\n'\n                    % (item.index, bbox2str(item.bbox)),\n                )\n            elif isinstance(item, LTTextGroup):\n                self.write('<textgroup bbox=\"%s\">\\n' % bbox2str(item.bbox))\n                for child in item:\n                    show_group(child)\n                self.write(\"</textgroup>\\n\")\n\n        def render(item: LTItem) -> None:\n            child: LTItem\n            if isinstance(item, LTPage):\n                s = '<page id=\"%s\" bbox=\"%s\" rotate=\"%d\">\\n' % (\n                    item.pageid,\n                    bbox2str(item.bbox),\n                    item.rotate,\n                )\n                self.write(s)\n                for child in item:\n                    render(child)\n                if item.groups is not None:\n                    self.write(\"<layout>\\n\")\n                    for group in item.groups:\n                        show_group(group)\n                    self.write(\"</layout>\\n\")\n                self.write(\"</page>\\n\")\n            elif isinstance(item, LTLine):\n                s = '<line linewidth=\"%d\" bbox=\"%s\" />\\n' % (\n                    item.linewidth,\n                    bbox2str(item.bbox),\n                )\n                self.write(s)\n            elif isinstance(item, LTRect):\n                s = '<rect linewidth=\"%d\" bbox=\"%s\" />\\n' % (\n                    item.linewidth,\n                    bbox2str(item.bbox),\n                )\n                self.write(s)\n            elif isinstance(item, LTCurve):\n                s = '<curve linewidth=\"%d\" bbox=\"%s\" pts=\"%s\"/>\\n' % (\n                    item.linewidth,\n                    bbox2str(item.bbox),\n                    item.get_pts(),\n                )\n                self.write(s)\n            elif isinstance(item, LTFigure):\n                s = f'<figure name=\"{item.name}\" bbox=\"{bbox2str(item.bbox)}\">\\n'\n                self.write(s)\n                for child in item:\n                    render(child)\n                self.write(\"</figure>\\n\")\n            elif isinstance(item, LTTextLine):\n                self.write('<textline bbox=\"%s\">\\n' % bbox2str(item.bbox))\n                for child in item:\n                    render(child)\n                self.write(\"</textline>\\n\")\n            elif isinstance(item, LTTextBox):\n                wmode = \"\"\n                if isinstance(item, LTTextBoxVertical):\n                    wmode = ' wmode=\"vertical\"'\n                s = '<textbox id=\"%d\" bbox=\"%s\"%s>\\n' % (\n                    item.index,\n                    bbox2str(item.bbox),\n                    wmode,\n                )\n                self.write(s)\n                for child in item:\n                    render(child)\n                self.write(\"</textbox>\\n\")\n            elif isinstance(item, LTChar):\n                s = (\n                    '<text font=\"%s\" bbox=\"%s\" colourspace=\"%s\" '\n                    'ncolour=\"%s\" size=\"%.3f\">'\n                    % (\n                        enc(item.fontname),\n                        bbox2str(item.bbox),\n                        item.ncs.name,\n                        item.graphicstate.ncolor,\n                        item.size,\n                    )\n                )\n                self.write(s)\n                self.write_text(item.get_text())\n                self.write(\"</text>\\n\")\n            elif isinstance(item, LTText):\n                self.write(\"<text>%s</text>\\n\" % item.get_text())\n            elif isinstance(item, LTImage):\n                if self.imagewriter is not None:\n                    name = self.imagewriter.export_image(item)\n                    self.write(\n                        '<image src=\"%s\" width=\"%d\" height=\"%d\" />\\n'\n                        % (enc(name), item.width, item.height),\n                    )\n                else:\n                    self.write(\n                        '<image width=\"%d\" height=\"%d\" />\\n'\n                        % (item.width, item.height),\n                    )\n            else:\n                assert False, str((\"Unhandled\", item))\n\n        render(ltpage)\n\n    def close(self) -> None:\n        self.write_footer()\n\n\nclass HOCRConverter(PDFConverter[AnyIO]):\n    \"\"\"Extract an hOCR representation from explicit text information within a PDF.\"\"\"\n\n    #   Where text is being extracted from a variety of types of PDF within a\n    #   business process, those PDFs where the text is only present in image\n    #   form will need to be analysed using an OCR tool which will typically\n    #   output hOCR. This converter extracts the explicit text information from\n    #   those PDFs that do have it and uses it to genxerate a basic hOCR\n    #   representation that is designed to be used in conjunction with the image\n    #   of the PDF in the same way as genuine OCR output would be, but without the\n    #   inevitable OCR errors.\n\n    #   The converter does not handle images, diagrams or text colors.\n\n    #   In the examples processed by the contributor it was necessary to set\n    #   LAParams.all_texts to True.\n\n    CONTROL = re.compile(r\"[\\x00-\\x08\\x0b-\\x0c\\x0e-\\x1f]\")\n\n    def __init__(\n        self,\n        rsrcmgr: PDFResourceManager,\n        outfp: AnyIO,\n        codec: str = \"utf8\",\n        pageno: int = 1,\n        laparams: LAParams | None = None,\n        stripcontrol: bool = False,\n    ):\n        PDFConverter.__init__(\n            self,\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            pageno=pageno,\n            laparams=laparams,\n        )\n        self.stripcontrol = stripcontrol\n        self.within_chars = False\n        self.write_header()\n\n    def bbox_repr(self, bbox: Rect) -> str:\n        (in_x0, in_y0, in_x1, in_y1) = bbox\n        # PDF y-coordinates are the other way round from hOCR coordinates\n        out_x0 = int(in_x0)\n        out_y0 = int(self.page_bbox[3] - in_y1)\n        out_x1 = int(in_x1)\n        out_y1 = int(self.page_bbox[3] - in_y0)\n        return f\"bbox {out_x0} {out_y0} {out_x1} {out_y1}\"\n\n    def write(self, text: str) -> None:\n        if self.codec:\n            encoded_text = text.encode(self.codec)\n            cast(BinaryIO, self.outfp).write(encoded_text)\n        else:\n            cast(TextIO, self.outfp).write(text)\n\n    def write_header(self) -> None:\n        if self.codec:\n            self.write(\n                \"<html xmlns='http://www.w3.org/1999/xhtml' \"\n                \"xml:lang='en' lang='en' charset='%s'>\\n\" % self.codec,\n            )\n        else:\n            self.write(\n                \"<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>\\n\",\n            )\n        self.write(\"<head>\\n\")\n        self.write(\"<title></title>\\n\")\n        self.write(\n            \"<meta http-equiv='Content-Type' content='text/html;charset=utf-8' />\\n\",\n        )\n        self.write(\n            \"<meta name='ocr-system' content='pdfminer.six HOCR Converter' />\\n\",\n        )\n        self.write(\n            \"  <meta name='ocr-capabilities'\"\n            \" content='ocr_page ocr_block ocr_line ocrx_word'/>\\n\",\n        )\n        self.write(\"</head>\\n\")\n        self.write(\"<body>\\n\")\n\n    def write_footer(self) -> None:\n        self.write(\"<!-- comment in the following line to debug -->\\n\")\n        self.write(\n            \"<!--script src='https://unpkg.com/hocrjs'></script--></body></html>\\n\",\n        )\n\n    def write_text(self, text: str) -> None:\n        if self.stripcontrol:\n            text = self.CONTROL.sub(\"\", text)\n        self.write(text)\n\n    def write_word(self) -> None:\n        if len(self.working_text) > 0:\n            bold_and_italic_styles = \"\"\n            if \"Italic\" in self.working_font:\n                bold_and_italic_styles = \"font-style: italic; \"\n            if \"Bold\" in self.working_font:\n                bold_and_italic_styles += \"font-weight: bold; \"\n            self.write(\n                \"<span style='font:\\\"%s\\\"; font-size:%d; %s' \"\n                \"class='ocrx_word' title='%s; x_font %s; \"\n                \"x_fsize %d'>%s</span>\"\n                % (\n                    (\n                        self.working_font,\n                        self.working_size,\n                        bold_and_italic_styles,\n                        self.bbox_repr(self.working_bbox),\n                        self.working_font,\n                        self.working_size,\n                        self.working_text.strip(),\n                    )\n                ),\n            )\n        self.within_chars = False\n\n    def receive_layout(self, ltpage: LTPage) -> None:\n        def render(item: LTItem) -> None:\n            if self.within_chars and isinstance(item, LTAnno):\n                self.write_word()\n            if isinstance(item, LTPage):\n                self.page_bbox = item.bbox\n                self.write(\n                    \"<div class='ocr_page' id='%s' title='%s'>\\n\"\n                    % (item.pageid, self.bbox_repr(item.bbox)),\n                )\n                for child in item:\n                    render(child)\n                self.write(\"</div>\\n\")\n            elif isinstance(item, LTTextLine):\n                self.write(\n                    \"<span class='ocr_line' title='%s'>\" % (self.bbox_repr(item.bbox)),\n                )\n                for child_line in item:\n                    render(child_line)\n                self.write(\"</span>\\n\")\n            elif isinstance(item, LTTextBox):\n                self.write(\n                    \"<div class='ocr_block' id='%d' title='%s'>\\n\"\n                    % (item.index, self.bbox_repr(item.bbox)),\n                )\n                for child in item:\n                    render(child)\n                self.write(\"</div>\\n\")\n            elif isinstance(item, LTChar):\n                if not self.within_chars:\n                    self.within_chars = True\n                    self.working_text = item.get_text()\n                    self.working_bbox = item.bbox\n                    self.working_font = item.fontname\n                    self.working_size = item.size\n                elif len(item.get_text().strip()) == 0:\n                    self.write_word()\n                    self.write(item.get_text())\n                else:\n                    if (\n                        self.working_bbox[1] != item.bbox[1]\n                        or self.working_font != item.fontname\n                        or self.working_size != item.size\n                    ):\n                        self.write_word()\n                        self.working_bbox = item.bbox\n                        self.working_font = item.fontname\n                        self.working_size = item.size\n                    self.working_text += item.get_text()\n                    self.working_bbox = (\n                        self.working_bbox[0],\n                        self.working_bbox[1],\n                        item.bbox[2],\n                        self.working_bbox[3],\n                    )\n\n        render(ltpage)\n\n    def close(self) -> None:\n        self.write_footer()\n"
  },
  {
    "path": "babeldoc/pdfminer/data_structures.py",
    "content": "from collections.abc import Iterable\nfrom typing import Any\n\nfrom babeldoc.pdfminer.pdfparser import PDFSyntaxError\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import int_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer import settings\n\n\nclass NumberTree:\n    \"\"\"A PDF number tree.\n\n    See Section 3.8.6 of the PDF Reference.\n    \"\"\"\n\n    def __init__(self, obj: Any):\n        self._obj = dict_value(obj)\n        self.nums: Iterable[Any] | None = None\n        self.kids: Iterable[Any] | None = None\n        self.limits: Iterable[Any] | None = None\n\n        if \"Nums\" in self._obj:\n            self.nums = list_value(self._obj[\"Nums\"])\n        if \"Kids\" in self._obj:\n            self.kids = list_value(self._obj[\"Kids\"])\n        if \"Limits\" in self._obj:\n            self.limits = list_value(self._obj[\"Limits\"])\n\n    def _parse(self) -> list[tuple[int, Any]]:\n        items = []\n        if self.nums:  # Leaf node\n            for k, v in choplist(2, self.nums):\n                items.append((int_value(k), v))\n\n        if self.kids:  # Root or intermediate node\n            for child_ref in self.kids:\n                items += NumberTree(child_ref)._parse()\n\n        return items\n\n    values: list[tuple[int, Any]]  # workaround decorators unsupported by mypy\n\n    @property  # type: ignore[no-redef,misc]\n    def values(self) -> list[tuple[int, Any]]:\n        values = self._parse()\n\n        if settings.STRICT:\n            if not all(a[0] <= b[0] for a, b in zip(values, values[1:], strict=False)):\n                raise PDFSyntaxError(\"Number tree elements are out of order\")\n        else:\n            values.sort(key=lambda t: t[0])\n\n        return values\n"
  },
  {
    "path": "babeldoc/pdfminer/encodingdb.py",
    "content": "import logging\nimport re\nfrom collections.abc import Iterable\nfrom typing import cast\n\nfrom babeldoc.pdfminer.glyphlist import glyphname2unicode\nfrom babeldoc.pdfminer.latin_enc import ENCODING\nfrom babeldoc.pdfminer.pdfexceptions import PDFKeyError\nfrom babeldoc.pdfminer.psparser import PSLiteral\n\nHEXADECIMAL = re.compile(r\"[0-9a-fA-F]+\")\n\nlog = logging.getLogger(__name__)\n\n\ndef name2unicode(name: str) -> str:\n    \"\"\"Converts Adobe glyph names to Unicode numbers.\n\n    In contrast to the specification, this raises a KeyError instead of return\n    an empty string when the key is unknown.\n    This way the caller must explicitly define what to do\n    when there is not a match.\n\n    Reference:\n    https://github.com/adobe-type-tools/agl-specification#2-the-mapping\n\n    :returns unicode character if name resembles something,\n    otherwise a KeyError\n    \"\"\"\n    if not isinstance(name, str):\n        raise PDFKeyError(\n            'Could not convert unicode name \"%s\" to character because '\n            \"it should be of type str but is of type %s\" % (name, type(name)),\n        )\n\n    name = name.split(\".\")[0]\n    components = name.split(\"_\")\n\n    if len(components) > 1:\n        return \"\".join(map(name2unicode, components))\n\n    elif name in glyphname2unicode:\n        return glyphname2unicode[name]\n\n    elif name.startswith(\"uni\"):\n        name_without_uni = name.strip(\"uni\")\n\n        if HEXADECIMAL.match(name_without_uni) and len(name_without_uni) % 4 == 0:\n            unicode_digits = [\n                int(name_without_uni[i : i + 4], base=16)\n                for i in range(0, len(name_without_uni), 4)\n            ]\n            for digit in unicode_digits:\n                raise_key_error_for_invalid_unicode(digit)\n            characters = map(chr, unicode_digits)\n            return \"\".join(characters)\n\n    elif name.startswith(\"u\"):\n        name_without_u = name.strip(\"u\")\n\n        if HEXADECIMAL.match(name_without_u) and 4 <= len(name_without_u) <= 6:\n            unicode_digit = int(name_without_u, base=16)\n            raise_key_error_for_invalid_unicode(unicode_digit)\n            return chr(unicode_digit)\n\n    raise PDFKeyError(\n        'Could not convert unicode name \"%s\" to character because '\n        \"it does not match specification\" % name,\n    )\n\n\ndef raise_key_error_for_invalid_unicode(unicode_digit: int) -> None:\n    \"\"\"Unicode values should not be in the range D800 through DFFF because\n    that is used for surrogate pairs in UTF-16\n\n    :raises KeyError if unicode digit is invalid\n    \"\"\"\n    if 55295 < unicode_digit < 57344:\n        raise PDFKeyError(\n            \"Unicode digit %d is invalid because \"\n            \"it is in the range D800 through DFFF\" % unicode_digit,\n        )\n\n\nclass EncodingDB:\n    std2unicode: dict[int, str] = {}\n    mac2unicode: dict[int, str] = {}\n    win2unicode: dict[int, str] = {}\n    pdf2unicode: dict[int, str] = {}\n    for name, std, mac, win, pdf in ENCODING:\n        c = name2unicode(name)\n        if std:\n            std2unicode[std] = c\n        if mac:\n            mac2unicode[mac] = c\n        if win:\n            win2unicode[win] = c\n        if pdf:\n            pdf2unicode[pdf] = c\n\n    encodings = {\n        \"StandardEncoding\": std2unicode,\n        \"MacRomanEncoding\": mac2unicode,\n        \"WinAnsiEncoding\": win2unicode,\n        \"PDFDocEncoding\": pdf2unicode,\n    }\n\n    @classmethod\n    def get_encoding(\n        cls,\n        name: str,\n        diff: Iterable[object] | None = None,\n    ) -> dict[int, str]:\n        cid2unicode = cls.encodings.get(name, cls.std2unicode)\n        if diff:\n            cid2unicode = cid2unicode.copy()\n            cid = 0\n            for x in diff:\n                if isinstance(x, int):\n                    cid = x\n                elif isinstance(x, PSLiteral):\n                    try:\n                        cid2unicode[cid] = name2unicode(cast(str, x.name))\n                    except (KeyError, ValueError) as e:\n                        log.debug(str(e))\n                    cid += 1\n        return cid2unicode\n"
  },
  {
    "path": "babeldoc/pdfminer/fontmetrics.py",
    "content": "\"\"\"Font metrics for the Adobe core 14 fonts.\n\nFont metrics are used to compute the boundary of each character\nwritten with a proportional font.\n\nThe following data were extracted from the AFM files:\n\n  http://www.ctan.org/tex-archive/fonts/adobe/afm/\n\n\"\"\"\n\n###  BEGIN Verbatim copy of the license part\n\n#\n# Adobe Core 35 AFM Files with 314 Glyph Entries - ReadMe\n#\n# This file and the 35 PostScript(R) AFM files it accompanies may be\n# used, copied, and distributed for any purpose and without charge,\n# with or without modification, provided that all copyright notices\n# are retained; that the AFM files are not distributed without this\n# file; that all modifications to this file or any of the AFM files\n# are prominently noted in the modified file(s); and that this\n# paragraph is not modified. Adobe Systems has no responsibility or\n# obligation to support the use of the AFM files.\n#\n\n###  END Verbatim copy of the license part\n\n# flake8: noqa\nfrom typing import Dict\n\n\ndef convert_font_metrics(path: str) -> None:\n    \"\"\"Convert an AFM file to a mapping of font metrics.\n\n    See below for the output.\n    \"\"\"\n    fonts = {}\n    with open(path) as fileinput:\n        for line in fileinput.readlines():\n            f = line.strip().split(\" \")\n            if not f:\n                continue\n            k = f[0]\n            if k == \"FontName\":\n                fontname = f[1]\n                props = {\"FontName\": fontname, \"Flags\": 0}\n                chars: Dict[int, int] = {}\n                fonts[fontname] = (props, chars)\n            elif k == \"C\":\n                cid = int(f[1])\n                if 0 <= cid and cid <= 255:\n                    width = int(f[4])\n                    chars[cid] = width\n            elif k in (\"CapHeight\", \"XHeight\", \"ItalicAngle\", \"Ascender\", \"Descender\"):\n                k = {\"Ascender\": \"Ascent\", \"Descender\": \"Descent\"}.get(k, k)\n                props[k] = float(f[1])\n            elif k in (\"FontName\", \"FamilyName\", \"Weight\"):\n                k = {\"FamilyName\": \"FontFamily\", \"Weight\": \"FontWeight\"}.get(k, k)\n                props[k] = f[1]\n            elif k == \"IsFixedPitch\":\n                if f[1].lower() == \"true\":\n                    props[\"Flags\"] = 64\n            elif k == \"FontBBox\":\n                props[k] = tuple(map(float, f[1:5]))\n        print(\"# -*- python -*-\")\n        print(\"FONT_METRICS = {\")\n        for fontname, (props, chars) in fonts.items():\n            print(f\" {fontname!r}: {(props, chars)!r},\")\n        print(\"}\")\n\n\nFONT_METRICS = {\n    \"Courier\": (\n        {\n            \"FontName\": \"Courier\",\n            \"Descent\": -194.0,\n            \"FontBBox\": (-6.0, -249.0, 639.0, 803.0),\n            \"FontWeight\": \"Medium\",\n            \"CapHeight\": 572.0,\n            \"FontFamily\": \"Courier\",\n            \"Flags\": 64,\n            \"XHeight\": 434.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 627.0,\n        },\n        {\n            \" \": 600,\n            \"!\": 600,\n            '\"': 600,\n            \"#\": 600,\n            \"$\": 600,\n            \"%\": 600,\n            \"&\": 600,\n            \"'\": 600,\n            \"(\": 600,\n            \")\": 600,\n            \"*\": 600,\n            \"+\": 600,\n            \",\": 600,\n            \"-\": 600,\n            \".\": 600,\n            \"/\": 600,\n            \"0\": 600,\n            \"1\": 600,\n            \"2\": 600,\n            \"3\": 600,\n            \"4\": 600,\n            \"5\": 600,\n            \"6\": 600,\n            \"7\": 600,\n            \"8\": 600,\n            \"9\": 600,\n            \":\": 600,\n            \";\": 600,\n            \"<\": 600,\n            \"=\": 600,\n            \">\": 600,\n            \"?\": 600,\n            \"@\": 600,\n            \"A\": 600,\n            \"B\": 600,\n            \"C\": 600,\n            \"D\": 600,\n            \"E\": 600,\n            \"F\": 600,\n            \"G\": 600,\n            \"H\": 600,\n            \"I\": 600,\n            \"J\": 600,\n            \"K\": 600,\n            \"L\": 600,\n            \"M\": 600,\n            \"N\": 600,\n            \"O\": 600,\n            \"P\": 600,\n            \"Q\": 600,\n            \"R\": 600,\n            \"S\": 600,\n            \"T\": 600,\n            \"U\": 600,\n            \"V\": 600,\n            \"W\": 600,\n            \"X\": 600,\n            \"Y\": 600,\n            \"Z\": 600,\n            \"[\": 600,\n            \"\\\\\": 600,\n            \"]\": 600,\n            \"^\": 600,\n            \"_\": 600,\n            \"`\": 600,\n            \"a\": 600,\n            \"b\": 600,\n            \"c\": 600,\n            \"d\": 600,\n            \"e\": 600,\n            \"f\": 600,\n            \"g\": 600,\n            \"h\": 600,\n            \"i\": 600,\n            \"j\": 600,\n            \"k\": 600,\n            \"l\": 600,\n            \"m\": 600,\n            \"n\": 600,\n            \"o\": 600,\n            \"p\": 600,\n            \"q\": 600,\n            \"r\": 600,\n            \"s\": 600,\n            \"t\": 600,\n            \"u\": 600,\n            \"v\": 600,\n            \"w\": 600,\n            \"x\": 600,\n            \"y\": 600,\n            \"z\": 600,\n            \"{\": 600,\n            \"|\": 600,\n            \"}\": 600,\n            \"~\": 600,\n            \"\\xa1\": 600,\n            \"\\xa2\": 600,\n            \"\\xa3\": 600,\n            \"\\xa4\": 600,\n            \"\\xa5\": 600,\n            \"\\xa6\": 600,\n            \"\\xa7\": 600,\n            \"\\xa8\": 600,\n            \"\\xa9\": 600,\n            \"\\xaa\": 600,\n            \"\\xab\": 600,\n            \"\\xac\": 600,\n            \"\\xae\": 600,\n            \"\\xaf\": 600,\n            \"\\xb0\": 600,\n            \"\\xb1\": 600,\n            \"\\xb2\": 600,\n            \"\\xb3\": 600,\n            \"\\xb4\": 600,\n            \"\\xb5\": 600,\n            \"\\xb6\": 600,\n            \"\\xb7\": 600,\n            \"\\xb8\": 600,\n            \"\\xb9\": 600,\n            \"\\xba\": 600,\n            \"\\xbb\": 600,\n            \"\\xbc\": 600,\n            \"\\xbd\": 600,\n            \"\\xbe\": 600,\n            \"\\xbf\": 600,\n            \"\\xc0\": 600,\n            \"\\xc1\": 600,\n            \"\\xc2\": 600,\n            \"\\xc3\": 600,\n            \"\\xc4\": 600,\n            \"\\xc5\": 600,\n            \"\\xc6\": 600,\n            \"\\xc7\": 600,\n            \"\\xc8\": 600,\n            \"\\xc9\": 600,\n            \"\\xca\": 600,\n            \"\\xcb\": 600,\n            \"\\xcc\": 600,\n            \"\\xcd\": 600,\n            \"\\xce\": 600,\n            \"\\xcf\": 600,\n            \"\\xd0\": 600,\n            \"\\xd1\": 600,\n            \"\\xd2\": 600,\n            \"\\xd3\": 600,\n            \"\\xd4\": 600,\n            \"\\xd5\": 600,\n            \"\\xd6\": 600,\n            \"\\xd7\": 600,\n            \"\\xd8\": 600,\n            \"\\xd9\": 600,\n            \"\\xda\": 600,\n            \"\\xdb\": 600,\n            \"\\xdc\": 600,\n            \"\\xdd\": 600,\n            \"\\xde\": 600,\n            \"\\xdf\": 600,\n            \"\\xe0\": 600,\n            \"\\xe1\": 600,\n            \"\\xe2\": 600,\n            \"\\xe3\": 600,\n            \"\\xe4\": 600,\n            \"\\xe5\": 600,\n            \"\\xe6\": 600,\n            \"\\xe7\": 600,\n            \"\\xe8\": 600,\n            \"\\xe9\": 600,\n            \"\\xea\": 600,\n            \"\\xeb\": 600,\n            \"\\xec\": 600,\n            \"\\xed\": 600,\n            \"\\xee\": 600,\n            \"\\xef\": 600,\n            \"\\xf0\": 600,\n            \"\\xf1\": 600,\n            \"\\xf2\": 600,\n            \"\\xf3\": 600,\n            \"\\xf4\": 600,\n            \"\\xf5\": 600,\n            \"\\xf6\": 600,\n            \"\\xf7\": 600,\n            \"\\xf8\": 600,\n            \"\\xf9\": 600,\n            \"\\xfa\": 600,\n            \"\\xfb\": 600,\n            \"\\xfc\": 600,\n            \"\\xfd\": 600,\n            \"\\xfe\": 600,\n            \"\\xff\": 600,\n            \"\\u0100\": 600,\n            \"\\u0101\": 600,\n            \"\\u0102\": 600,\n            \"\\u0103\": 600,\n            \"\\u0104\": 600,\n            \"\\u0105\": 600,\n            \"\\u0106\": 600,\n            \"\\u0107\": 600,\n            \"\\u010c\": 600,\n            \"\\u010d\": 600,\n            \"\\u010e\": 600,\n            \"\\u010f\": 600,\n            \"\\u0110\": 600,\n            \"\\u0111\": 600,\n            \"\\u0112\": 600,\n            \"\\u0113\": 600,\n            \"\\u0116\": 600,\n            \"\\u0117\": 600,\n            \"\\u0118\": 600,\n            \"\\u0119\": 600,\n            \"\\u011a\": 600,\n            \"\\u011b\": 600,\n            \"\\u011e\": 600,\n            \"\\u011f\": 600,\n            \"\\u0122\": 600,\n            \"\\u0123\": 600,\n            \"\\u012a\": 600,\n            \"\\u012b\": 600,\n            \"\\u012e\": 600,\n            \"\\u012f\": 600,\n            \"\\u0130\": 600,\n            \"\\u0131\": 600,\n            \"\\u0136\": 600,\n            \"\\u0137\": 600,\n            \"\\u0139\": 600,\n            \"\\u013a\": 600,\n            \"\\u013b\": 600,\n            \"\\u013c\": 600,\n            \"\\u013d\": 600,\n            \"\\u013e\": 600,\n            \"\\u0141\": 600,\n            \"\\u0142\": 600,\n            \"\\u0143\": 600,\n            \"\\u0144\": 600,\n            \"\\u0145\": 600,\n            \"\\u0146\": 600,\n            \"\\u0147\": 600,\n            \"\\u0148\": 600,\n            \"\\u014c\": 600,\n            \"\\u014d\": 600,\n            \"\\u0150\": 600,\n            \"\\u0151\": 600,\n            \"\\u0152\": 600,\n            \"\\u0153\": 600,\n            \"\\u0154\": 600,\n            \"\\u0155\": 600,\n            \"\\u0156\": 600,\n            \"\\u0157\": 600,\n            \"\\u0158\": 600,\n            \"\\u0159\": 600,\n            \"\\u015a\": 600,\n            \"\\u015b\": 600,\n            \"\\u015e\": 600,\n            \"\\u015f\": 600,\n            \"\\u0160\": 600,\n            \"\\u0161\": 600,\n            \"\\u0162\": 600,\n            \"\\u0163\": 600,\n            \"\\u0164\": 600,\n            \"\\u0165\": 600,\n            \"\\u016a\": 600,\n            \"\\u016b\": 600,\n            \"\\u016e\": 600,\n            \"\\u016f\": 600,\n            \"\\u0170\": 600,\n            \"\\u0171\": 600,\n            \"\\u0172\": 600,\n            \"\\u0173\": 600,\n            \"\\u0178\": 600,\n            \"\\u0179\": 600,\n            \"\\u017a\": 600,\n            \"\\u017b\": 600,\n            \"\\u017c\": 600,\n            \"\\u017d\": 600,\n            \"\\u017e\": 600,\n            \"\\u0192\": 600,\n            \"\\u0218\": 600,\n            \"\\u0219\": 600,\n            \"\\u02c6\": 600,\n            \"\\u02c7\": 600,\n            \"\\u02d8\": 600,\n            \"\\u02d9\": 600,\n            \"\\u02da\": 600,\n            \"\\u02db\": 600,\n            \"\\u02dc\": 600,\n            \"\\u02dd\": 600,\n            \"\\u2013\": 600,\n            \"\\u2014\": 600,\n            \"\\u2018\": 600,\n            \"\\u2019\": 600,\n            \"\\u201a\": 600,\n            \"\\u201c\": 600,\n            \"\\u201d\": 600,\n            \"\\u201e\": 600,\n            \"\\u2020\": 600,\n            \"\\u2021\": 600,\n            \"\\u2022\": 600,\n            \"\\u2026\": 600,\n            \"\\u2030\": 600,\n            \"\\u2039\": 600,\n            \"\\u203a\": 600,\n            \"\\u2044\": 600,\n            \"\\u2122\": 600,\n            \"\\u2202\": 600,\n            \"\\u2206\": 600,\n            \"\\u2211\": 600,\n            \"\\u2212\": 600,\n            \"\\u221a\": 600,\n            \"\\u2260\": 600,\n            \"\\u2264\": 600,\n            \"\\u2265\": 600,\n            \"\\u25ca\": 600,\n            \"\\uf6c3\": 600,\n            \"\\ufb01\": 600,\n            \"\\ufb02\": 600,\n        },\n    ),\n    \"Courier-Bold\": (\n        {\n            \"FontName\": \"Courier-Bold\",\n            \"Descent\": -194.0,\n            \"FontBBox\": (-88.0, -249.0, 697.0, 811.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 572.0,\n            \"FontFamily\": \"Courier\",\n            \"Flags\": 64,\n            \"XHeight\": 434.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 627.0,\n        },\n        {\n            \" \": 600,\n            \"!\": 600,\n            '\"': 600,\n            \"#\": 600,\n            \"$\": 600,\n            \"%\": 600,\n            \"&\": 600,\n            \"'\": 600,\n            \"(\": 600,\n            \")\": 600,\n            \"*\": 600,\n            \"+\": 600,\n            \",\": 600,\n            \"-\": 600,\n            \".\": 600,\n            \"/\": 600,\n            \"0\": 600,\n            \"1\": 600,\n            \"2\": 600,\n            \"3\": 600,\n            \"4\": 600,\n            \"5\": 600,\n            \"6\": 600,\n            \"7\": 600,\n            \"8\": 600,\n            \"9\": 600,\n            \":\": 600,\n            \";\": 600,\n            \"<\": 600,\n            \"=\": 600,\n            \">\": 600,\n            \"?\": 600,\n            \"@\": 600,\n            \"A\": 600,\n            \"B\": 600,\n            \"C\": 600,\n            \"D\": 600,\n            \"E\": 600,\n            \"F\": 600,\n            \"G\": 600,\n            \"H\": 600,\n            \"I\": 600,\n            \"J\": 600,\n            \"K\": 600,\n            \"L\": 600,\n            \"M\": 600,\n            \"N\": 600,\n            \"O\": 600,\n            \"P\": 600,\n            \"Q\": 600,\n            \"R\": 600,\n            \"S\": 600,\n            \"T\": 600,\n            \"U\": 600,\n            \"V\": 600,\n            \"W\": 600,\n            \"X\": 600,\n            \"Y\": 600,\n            \"Z\": 600,\n            \"[\": 600,\n            \"\\\\\": 600,\n            \"]\": 600,\n            \"^\": 600,\n            \"_\": 600,\n            \"`\": 600,\n            \"a\": 600,\n            \"b\": 600,\n            \"c\": 600,\n            \"d\": 600,\n            \"e\": 600,\n            \"f\": 600,\n            \"g\": 600,\n            \"h\": 600,\n            \"i\": 600,\n            \"j\": 600,\n            \"k\": 600,\n            \"l\": 600,\n            \"m\": 600,\n            \"n\": 600,\n            \"o\": 600,\n            \"p\": 600,\n            \"q\": 600,\n            \"r\": 600,\n            \"s\": 600,\n            \"t\": 600,\n            \"u\": 600,\n            \"v\": 600,\n            \"w\": 600,\n            \"x\": 600,\n            \"y\": 600,\n            \"z\": 600,\n            \"{\": 600,\n            \"|\": 600,\n            \"}\": 600,\n            \"~\": 600,\n            \"\\xa1\": 600,\n            \"\\xa2\": 600,\n            \"\\xa3\": 600,\n            \"\\xa4\": 600,\n            \"\\xa5\": 600,\n            \"\\xa6\": 600,\n            \"\\xa7\": 600,\n            \"\\xa8\": 600,\n            \"\\xa9\": 600,\n            \"\\xaa\": 600,\n            \"\\xab\": 600,\n            \"\\xac\": 600,\n            \"\\xae\": 600,\n            \"\\xaf\": 600,\n            \"\\xb0\": 600,\n            \"\\xb1\": 600,\n            \"\\xb2\": 600,\n            \"\\xb3\": 600,\n            \"\\xb4\": 600,\n            \"\\xb5\": 600,\n            \"\\xb6\": 600,\n            \"\\xb7\": 600,\n            \"\\xb8\": 600,\n            \"\\xb9\": 600,\n            \"\\xba\": 600,\n            \"\\xbb\": 600,\n            \"\\xbc\": 600,\n            \"\\xbd\": 600,\n            \"\\xbe\": 600,\n            \"\\xbf\": 600,\n            \"\\xc0\": 600,\n            \"\\xc1\": 600,\n            \"\\xc2\": 600,\n            \"\\xc3\": 600,\n            \"\\xc4\": 600,\n            \"\\xc5\": 600,\n            \"\\xc6\": 600,\n            \"\\xc7\": 600,\n            \"\\xc8\": 600,\n            \"\\xc9\": 600,\n            \"\\xca\": 600,\n            \"\\xcb\": 600,\n            \"\\xcc\": 600,\n            \"\\xcd\": 600,\n            \"\\xce\": 600,\n            \"\\xcf\": 600,\n            \"\\xd0\": 600,\n            \"\\xd1\": 600,\n            \"\\xd2\": 600,\n            \"\\xd3\": 600,\n            \"\\xd4\": 600,\n            \"\\xd5\": 600,\n            \"\\xd6\": 600,\n            \"\\xd7\": 600,\n            \"\\xd8\": 600,\n            \"\\xd9\": 600,\n            \"\\xda\": 600,\n            \"\\xdb\": 600,\n            \"\\xdc\": 600,\n            \"\\xdd\": 600,\n            \"\\xde\": 600,\n            \"\\xdf\": 600,\n            \"\\xe0\": 600,\n            \"\\xe1\": 600,\n            \"\\xe2\": 600,\n            \"\\xe3\": 600,\n            \"\\xe4\": 600,\n            \"\\xe5\": 600,\n            \"\\xe6\": 600,\n            \"\\xe7\": 600,\n            \"\\xe8\": 600,\n            \"\\xe9\": 600,\n            \"\\xea\": 600,\n            \"\\xeb\": 600,\n            \"\\xec\": 600,\n            \"\\xed\": 600,\n            \"\\xee\": 600,\n            \"\\xef\": 600,\n            \"\\xf0\": 600,\n            \"\\xf1\": 600,\n            \"\\xf2\": 600,\n            \"\\xf3\": 600,\n            \"\\xf4\": 600,\n            \"\\xf5\": 600,\n            \"\\xf6\": 600,\n            \"\\xf7\": 600,\n            \"\\xf8\": 600,\n            \"\\xf9\": 600,\n            \"\\xfa\": 600,\n            \"\\xfb\": 600,\n            \"\\xfc\": 600,\n            \"\\xfd\": 600,\n            \"\\xfe\": 600,\n            \"\\xff\": 600,\n            \"\\u0100\": 600,\n            \"\\u0101\": 600,\n            \"\\u0102\": 600,\n            \"\\u0103\": 600,\n            \"\\u0104\": 600,\n            \"\\u0105\": 600,\n            \"\\u0106\": 600,\n            \"\\u0107\": 600,\n            \"\\u010c\": 600,\n            \"\\u010d\": 600,\n            \"\\u010e\": 600,\n            \"\\u010f\": 600,\n            \"\\u0110\": 600,\n            \"\\u0111\": 600,\n            \"\\u0112\": 600,\n            \"\\u0113\": 600,\n            \"\\u0116\": 600,\n            \"\\u0117\": 600,\n            \"\\u0118\": 600,\n            \"\\u0119\": 600,\n            \"\\u011a\": 600,\n            \"\\u011b\": 600,\n            \"\\u011e\": 600,\n            \"\\u011f\": 600,\n            \"\\u0122\": 600,\n            \"\\u0123\": 600,\n            \"\\u012a\": 600,\n            \"\\u012b\": 600,\n            \"\\u012e\": 600,\n            \"\\u012f\": 600,\n            \"\\u0130\": 600,\n            \"\\u0131\": 600,\n            \"\\u0136\": 600,\n            \"\\u0137\": 600,\n            \"\\u0139\": 600,\n            \"\\u013a\": 600,\n            \"\\u013b\": 600,\n            \"\\u013c\": 600,\n            \"\\u013d\": 600,\n            \"\\u013e\": 600,\n            \"\\u0141\": 600,\n            \"\\u0142\": 600,\n            \"\\u0143\": 600,\n            \"\\u0144\": 600,\n            \"\\u0145\": 600,\n            \"\\u0146\": 600,\n            \"\\u0147\": 600,\n            \"\\u0148\": 600,\n            \"\\u014c\": 600,\n            \"\\u014d\": 600,\n            \"\\u0150\": 600,\n            \"\\u0151\": 600,\n            \"\\u0152\": 600,\n            \"\\u0153\": 600,\n            \"\\u0154\": 600,\n            \"\\u0155\": 600,\n            \"\\u0156\": 600,\n            \"\\u0157\": 600,\n            \"\\u0158\": 600,\n            \"\\u0159\": 600,\n            \"\\u015a\": 600,\n            \"\\u015b\": 600,\n            \"\\u015e\": 600,\n            \"\\u015f\": 600,\n            \"\\u0160\": 600,\n            \"\\u0161\": 600,\n            \"\\u0162\": 600,\n            \"\\u0163\": 600,\n            \"\\u0164\": 600,\n            \"\\u0165\": 600,\n            \"\\u016a\": 600,\n            \"\\u016b\": 600,\n            \"\\u016e\": 600,\n            \"\\u016f\": 600,\n            \"\\u0170\": 600,\n            \"\\u0171\": 600,\n            \"\\u0172\": 600,\n            \"\\u0173\": 600,\n            \"\\u0178\": 600,\n            \"\\u0179\": 600,\n            \"\\u017a\": 600,\n            \"\\u017b\": 600,\n            \"\\u017c\": 600,\n            \"\\u017d\": 600,\n            \"\\u017e\": 600,\n            \"\\u0192\": 600,\n            \"\\u0218\": 600,\n            \"\\u0219\": 600,\n            \"\\u02c6\": 600,\n            \"\\u02c7\": 600,\n            \"\\u02d8\": 600,\n            \"\\u02d9\": 600,\n            \"\\u02da\": 600,\n            \"\\u02db\": 600,\n            \"\\u02dc\": 600,\n            \"\\u02dd\": 600,\n            \"\\u2013\": 600,\n            \"\\u2014\": 600,\n            \"\\u2018\": 600,\n            \"\\u2019\": 600,\n            \"\\u201a\": 600,\n            \"\\u201c\": 600,\n            \"\\u201d\": 600,\n            \"\\u201e\": 600,\n            \"\\u2020\": 600,\n            \"\\u2021\": 600,\n            \"\\u2022\": 600,\n            \"\\u2026\": 600,\n            \"\\u2030\": 600,\n            \"\\u2039\": 600,\n            \"\\u203a\": 600,\n            \"\\u2044\": 600,\n            \"\\u2122\": 600,\n            \"\\u2202\": 600,\n            \"\\u2206\": 600,\n            \"\\u2211\": 600,\n            \"\\u2212\": 600,\n            \"\\u221a\": 600,\n            \"\\u2260\": 600,\n            \"\\u2264\": 600,\n            \"\\u2265\": 600,\n            \"\\u25ca\": 600,\n            \"\\uf6c3\": 600,\n            \"\\ufb01\": 600,\n            \"\\ufb02\": 600,\n        },\n    ),\n    \"Courier-BoldOblique\": (\n        {\n            \"FontName\": \"Courier-BoldOblique\",\n            \"Descent\": -194.0,\n            \"FontBBox\": (-49.0, -249.0, 758.0, 811.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 572.0,\n            \"FontFamily\": \"Courier\",\n            \"Flags\": 64,\n            \"XHeight\": 434.0,\n            \"ItalicAngle\": -11.0,\n            \"Ascent\": 627.0,\n        },\n        {\n            \" \": 600,\n            \"!\": 600,\n            '\"': 600,\n            \"#\": 600,\n            \"$\": 600,\n            \"%\": 600,\n            \"&\": 600,\n            \"'\": 600,\n            \"(\": 600,\n            \")\": 600,\n            \"*\": 600,\n            \"+\": 600,\n            \",\": 600,\n            \"-\": 600,\n            \".\": 600,\n            \"/\": 600,\n            \"0\": 600,\n            \"1\": 600,\n            \"2\": 600,\n            \"3\": 600,\n            \"4\": 600,\n            \"5\": 600,\n            \"6\": 600,\n            \"7\": 600,\n            \"8\": 600,\n            \"9\": 600,\n            \":\": 600,\n            \";\": 600,\n            \"<\": 600,\n            \"=\": 600,\n            \">\": 600,\n            \"?\": 600,\n            \"@\": 600,\n            \"A\": 600,\n            \"B\": 600,\n            \"C\": 600,\n            \"D\": 600,\n            \"E\": 600,\n            \"F\": 600,\n            \"G\": 600,\n            \"H\": 600,\n            \"I\": 600,\n            \"J\": 600,\n            \"K\": 600,\n            \"L\": 600,\n            \"M\": 600,\n            \"N\": 600,\n            \"O\": 600,\n            \"P\": 600,\n            \"Q\": 600,\n            \"R\": 600,\n            \"S\": 600,\n            \"T\": 600,\n            \"U\": 600,\n            \"V\": 600,\n            \"W\": 600,\n            \"X\": 600,\n            \"Y\": 600,\n            \"Z\": 600,\n            \"[\": 600,\n            \"\\\\\": 600,\n            \"]\": 600,\n            \"^\": 600,\n            \"_\": 600,\n            \"`\": 600,\n            \"a\": 600,\n            \"b\": 600,\n            \"c\": 600,\n            \"d\": 600,\n            \"e\": 600,\n            \"f\": 600,\n            \"g\": 600,\n            \"h\": 600,\n            \"i\": 600,\n            \"j\": 600,\n            \"k\": 600,\n            \"l\": 600,\n            \"m\": 600,\n            \"n\": 600,\n            \"o\": 600,\n            \"p\": 600,\n            \"q\": 600,\n            \"r\": 600,\n            \"s\": 600,\n            \"t\": 600,\n            \"u\": 600,\n            \"v\": 600,\n            \"w\": 600,\n            \"x\": 600,\n            \"y\": 600,\n            \"z\": 600,\n            \"{\": 600,\n            \"|\": 600,\n            \"}\": 600,\n            \"~\": 600,\n            \"\\xa1\": 600,\n            \"\\xa2\": 600,\n            \"\\xa3\": 600,\n            \"\\xa4\": 600,\n            \"\\xa5\": 600,\n            \"\\xa6\": 600,\n            \"\\xa7\": 600,\n            \"\\xa8\": 600,\n            \"\\xa9\": 600,\n            \"\\xaa\": 600,\n            \"\\xab\": 600,\n            \"\\xac\": 600,\n            \"\\xae\": 600,\n            \"\\xaf\": 600,\n            \"\\xb0\": 600,\n            \"\\xb1\": 600,\n            \"\\xb2\": 600,\n            \"\\xb3\": 600,\n            \"\\xb4\": 600,\n            \"\\xb5\": 600,\n            \"\\xb6\": 600,\n            \"\\xb7\": 600,\n            \"\\xb8\": 600,\n            \"\\xb9\": 600,\n            \"\\xba\": 600,\n            \"\\xbb\": 600,\n            \"\\xbc\": 600,\n            \"\\xbd\": 600,\n            \"\\xbe\": 600,\n            \"\\xbf\": 600,\n            \"\\xc0\": 600,\n            \"\\xc1\": 600,\n            \"\\xc2\": 600,\n            \"\\xc3\": 600,\n            \"\\xc4\": 600,\n            \"\\xc5\": 600,\n            \"\\xc6\": 600,\n            \"\\xc7\": 600,\n            \"\\xc8\": 600,\n            \"\\xc9\": 600,\n            \"\\xca\": 600,\n            \"\\xcb\": 600,\n            \"\\xcc\": 600,\n            \"\\xcd\": 600,\n            \"\\xce\": 600,\n            \"\\xcf\": 600,\n            \"\\xd0\": 600,\n            \"\\xd1\": 600,\n            \"\\xd2\": 600,\n            \"\\xd3\": 600,\n            \"\\xd4\": 600,\n            \"\\xd5\": 600,\n            \"\\xd6\": 600,\n            \"\\xd7\": 600,\n            \"\\xd8\": 600,\n            \"\\xd9\": 600,\n            \"\\xda\": 600,\n            \"\\xdb\": 600,\n            \"\\xdc\": 600,\n            \"\\xdd\": 600,\n            \"\\xde\": 600,\n            \"\\xdf\": 600,\n            \"\\xe0\": 600,\n            \"\\xe1\": 600,\n            \"\\xe2\": 600,\n            \"\\xe3\": 600,\n            \"\\xe4\": 600,\n            \"\\xe5\": 600,\n            \"\\xe6\": 600,\n            \"\\xe7\": 600,\n            \"\\xe8\": 600,\n            \"\\xe9\": 600,\n            \"\\xea\": 600,\n            \"\\xeb\": 600,\n            \"\\xec\": 600,\n            \"\\xed\": 600,\n            \"\\xee\": 600,\n            \"\\xef\": 600,\n            \"\\xf0\": 600,\n            \"\\xf1\": 600,\n            \"\\xf2\": 600,\n            \"\\xf3\": 600,\n            \"\\xf4\": 600,\n            \"\\xf5\": 600,\n            \"\\xf6\": 600,\n            \"\\xf7\": 600,\n            \"\\xf8\": 600,\n            \"\\xf9\": 600,\n            \"\\xfa\": 600,\n            \"\\xfb\": 600,\n            \"\\xfc\": 600,\n            \"\\xfd\": 600,\n            \"\\xfe\": 600,\n            \"\\xff\": 600,\n            \"\\u0100\": 600,\n            \"\\u0101\": 600,\n            \"\\u0102\": 600,\n            \"\\u0103\": 600,\n            \"\\u0104\": 600,\n            \"\\u0105\": 600,\n            \"\\u0106\": 600,\n            \"\\u0107\": 600,\n            \"\\u010c\": 600,\n            \"\\u010d\": 600,\n            \"\\u010e\": 600,\n            \"\\u010f\": 600,\n            \"\\u0110\": 600,\n            \"\\u0111\": 600,\n            \"\\u0112\": 600,\n            \"\\u0113\": 600,\n            \"\\u0116\": 600,\n            \"\\u0117\": 600,\n            \"\\u0118\": 600,\n            \"\\u0119\": 600,\n            \"\\u011a\": 600,\n            \"\\u011b\": 600,\n            \"\\u011e\": 600,\n            \"\\u011f\": 600,\n            \"\\u0122\": 600,\n            \"\\u0123\": 600,\n            \"\\u012a\": 600,\n            \"\\u012b\": 600,\n            \"\\u012e\": 600,\n            \"\\u012f\": 600,\n            \"\\u0130\": 600,\n            \"\\u0131\": 600,\n            \"\\u0136\": 600,\n            \"\\u0137\": 600,\n            \"\\u0139\": 600,\n            \"\\u013a\": 600,\n            \"\\u013b\": 600,\n            \"\\u013c\": 600,\n            \"\\u013d\": 600,\n            \"\\u013e\": 600,\n            \"\\u0141\": 600,\n            \"\\u0142\": 600,\n            \"\\u0143\": 600,\n            \"\\u0144\": 600,\n            \"\\u0145\": 600,\n            \"\\u0146\": 600,\n            \"\\u0147\": 600,\n            \"\\u0148\": 600,\n            \"\\u014c\": 600,\n            \"\\u014d\": 600,\n            \"\\u0150\": 600,\n            \"\\u0151\": 600,\n            \"\\u0152\": 600,\n            \"\\u0153\": 600,\n            \"\\u0154\": 600,\n            \"\\u0155\": 600,\n            \"\\u0156\": 600,\n            \"\\u0157\": 600,\n            \"\\u0158\": 600,\n            \"\\u0159\": 600,\n            \"\\u015a\": 600,\n            \"\\u015b\": 600,\n            \"\\u015e\": 600,\n            \"\\u015f\": 600,\n            \"\\u0160\": 600,\n            \"\\u0161\": 600,\n            \"\\u0162\": 600,\n            \"\\u0163\": 600,\n            \"\\u0164\": 600,\n            \"\\u0165\": 600,\n            \"\\u016a\": 600,\n            \"\\u016b\": 600,\n            \"\\u016e\": 600,\n            \"\\u016f\": 600,\n            \"\\u0170\": 600,\n            \"\\u0171\": 600,\n            \"\\u0172\": 600,\n            \"\\u0173\": 600,\n            \"\\u0178\": 600,\n            \"\\u0179\": 600,\n            \"\\u017a\": 600,\n            \"\\u017b\": 600,\n            \"\\u017c\": 600,\n            \"\\u017d\": 600,\n            \"\\u017e\": 600,\n            \"\\u0192\": 600,\n            \"\\u0218\": 600,\n            \"\\u0219\": 600,\n            \"\\u02c6\": 600,\n            \"\\u02c7\": 600,\n            \"\\u02d8\": 600,\n            \"\\u02d9\": 600,\n            \"\\u02da\": 600,\n            \"\\u02db\": 600,\n            \"\\u02dc\": 600,\n            \"\\u02dd\": 600,\n            \"\\u2013\": 600,\n            \"\\u2014\": 600,\n            \"\\u2018\": 600,\n            \"\\u2019\": 600,\n            \"\\u201a\": 600,\n            \"\\u201c\": 600,\n            \"\\u201d\": 600,\n            \"\\u201e\": 600,\n            \"\\u2020\": 600,\n            \"\\u2021\": 600,\n            \"\\u2022\": 600,\n            \"\\u2026\": 600,\n            \"\\u2030\": 600,\n            \"\\u2039\": 600,\n            \"\\u203a\": 600,\n            \"\\u2044\": 600,\n            \"\\u2122\": 600,\n            \"\\u2202\": 600,\n            \"\\u2206\": 600,\n            \"\\u2211\": 600,\n            \"\\u2212\": 600,\n            \"\\u221a\": 600,\n            \"\\u2260\": 600,\n            \"\\u2264\": 600,\n            \"\\u2265\": 600,\n            \"\\u25ca\": 600,\n            \"\\uf6c3\": 600,\n            \"\\ufb01\": 600,\n            \"\\ufb02\": 600,\n        },\n    ),\n    \"Courier-Oblique\": (\n        {\n            \"FontName\": \"Courier-Oblique\",\n            \"Descent\": -194.0,\n            \"FontBBox\": (-49.0, -249.0, 749.0, 803.0),\n            \"FontWeight\": \"Medium\",\n            \"CapHeight\": 572.0,\n            \"FontFamily\": \"Courier\",\n            \"Flags\": 64,\n            \"XHeight\": 434.0,\n            \"ItalicAngle\": -11.0,\n            \"Ascent\": 627.0,\n        },\n        {\n            \" \": 600,\n            \"!\": 600,\n            '\"': 600,\n            \"#\": 600,\n            \"$\": 600,\n            \"%\": 600,\n            \"&\": 600,\n            \"'\": 600,\n            \"(\": 600,\n            \")\": 600,\n            \"*\": 600,\n            \"+\": 600,\n            \",\": 600,\n            \"-\": 600,\n            \".\": 600,\n            \"/\": 600,\n            \"0\": 600,\n            \"1\": 600,\n            \"2\": 600,\n            \"3\": 600,\n            \"4\": 600,\n            \"5\": 600,\n            \"6\": 600,\n            \"7\": 600,\n            \"8\": 600,\n            \"9\": 600,\n            \":\": 600,\n            \";\": 600,\n            \"<\": 600,\n            \"=\": 600,\n            \">\": 600,\n            \"?\": 600,\n            \"@\": 600,\n            \"A\": 600,\n            \"B\": 600,\n            \"C\": 600,\n            \"D\": 600,\n            \"E\": 600,\n            \"F\": 600,\n            \"G\": 600,\n            \"H\": 600,\n            \"I\": 600,\n            \"J\": 600,\n            \"K\": 600,\n            \"L\": 600,\n            \"M\": 600,\n            \"N\": 600,\n            \"O\": 600,\n            \"P\": 600,\n            \"Q\": 600,\n            \"R\": 600,\n            \"S\": 600,\n            \"T\": 600,\n            \"U\": 600,\n            \"V\": 600,\n            \"W\": 600,\n            \"X\": 600,\n            \"Y\": 600,\n            \"Z\": 600,\n            \"[\": 600,\n            \"\\\\\": 600,\n            \"]\": 600,\n            \"^\": 600,\n            \"_\": 600,\n            \"`\": 600,\n            \"a\": 600,\n            \"b\": 600,\n            \"c\": 600,\n            \"d\": 600,\n            \"e\": 600,\n            \"f\": 600,\n            \"g\": 600,\n            \"h\": 600,\n            \"i\": 600,\n            \"j\": 600,\n            \"k\": 600,\n            \"l\": 600,\n            \"m\": 600,\n            \"n\": 600,\n            \"o\": 600,\n            \"p\": 600,\n            \"q\": 600,\n            \"r\": 600,\n            \"s\": 600,\n            \"t\": 600,\n            \"u\": 600,\n            \"v\": 600,\n            \"w\": 600,\n            \"x\": 600,\n            \"y\": 600,\n            \"z\": 600,\n            \"{\": 600,\n            \"|\": 600,\n            \"}\": 600,\n            \"~\": 600,\n            \"\\xa1\": 600,\n            \"\\xa2\": 600,\n            \"\\xa3\": 600,\n            \"\\xa4\": 600,\n            \"\\xa5\": 600,\n            \"\\xa6\": 600,\n            \"\\xa7\": 600,\n            \"\\xa8\": 600,\n            \"\\xa9\": 600,\n            \"\\xaa\": 600,\n            \"\\xab\": 600,\n            \"\\xac\": 600,\n            \"\\xae\": 600,\n            \"\\xaf\": 600,\n            \"\\xb0\": 600,\n            \"\\xb1\": 600,\n            \"\\xb2\": 600,\n            \"\\xb3\": 600,\n            \"\\xb4\": 600,\n            \"\\xb5\": 600,\n            \"\\xb6\": 600,\n            \"\\xb7\": 600,\n            \"\\xb8\": 600,\n            \"\\xb9\": 600,\n            \"\\xba\": 600,\n            \"\\xbb\": 600,\n            \"\\xbc\": 600,\n            \"\\xbd\": 600,\n            \"\\xbe\": 600,\n            \"\\xbf\": 600,\n            \"\\xc0\": 600,\n            \"\\xc1\": 600,\n            \"\\xc2\": 600,\n            \"\\xc3\": 600,\n            \"\\xc4\": 600,\n            \"\\xc5\": 600,\n            \"\\xc6\": 600,\n            \"\\xc7\": 600,\n            \"\\xc8\": 600,\n            \"\\xc9\": 600,\n            \"\\xca\": 600,\n            \"\\xcb\": 600,\n            \"\\xcc\": 600,\n            \"\\xcd\": 600,\n            \"\\xce\": 600,\n            \"\\xcf\": 600,\n            \"\\xd0\": 600,\n            \"\\xd1\": 600,\n            \"\\xd2\": 600,\n            \"\\xd3\": 600,\n            \"\\xd4\": 600,\n            \"\\xd5\": 600,\n            \"\\xd6\": 600,\n            \"\\xd7\": 600,\n            \"\\xd8\": 600,\n            \"\\xd9\": 600,\n            \"\\xda\": 600,\n            \"\\xdb\": 600,\n            \"\\xdc\": 600,\n            \"\\xdd\": 600,\n            \"\\xde\": 600,\n            \"\\xdf\": 600,\n            \"\\xe0\": 600,\n            \"\\xe1\": 600,\n            \"\\xe2\": 600,\n            \"\\xe3\": 600,\n            \"\\xe4\": 600,\n            \"\\xe5\": 600,\n            \"\\xe6\": 600,\n            \"\\xe7\": 600,\n            \"\\xe8\": 600,\n            \"\\xe9\": 600,\n            \"\\xea\": 600,\n            \"\\xeb\": 600,\n            \"\\xec\": 600,\n            \"\\xed\": 600,\n            \"\\xee\": 600,\n            \"\\xef\": 600,\n            \"\\xf0\": 600,\n            \"\\xf1\": 600,\n            \"\\xf2\": 600,\n            \"\\xf3\": 600,\n            \"\\xf4\": 600,\n            \"\\xf5\": 600,\n            \"\\xf6\": 600,\n            \"\\xf7\": 600,\n            \"\\xf8\": 600,\n            \"\\xf9\": 600,\n            \"\\xfa\": 600,\n            \"\\xfb\": 600,\n            \"\\xfc\": 600,\n            \"\\xfd\": 600,\n            \"\\xfe\": 600,\n            \"\\xff\": 600,\n            \"\\u0100\": 600,\n            \"\\u0101\": 600,\n            \"\\u0102\": 600,\n            \"\\u0103\": 600,\n            \"\\u0104\": 600,\n            \"\\u0105\": 600,\n            \"\\u0106\": 600,\n            \"\\u0107\": 600,\n            \"\\u010c\": 600,\n            \"\\u010d\": 600,\n            \"\\u010e\": 600,\n            \"\\u010f\": 600,\n            \"\\u0110\": 600,\n            \"\\u0111\": 600,\n            \"\\u0112\": 600,\n            \"\\u0113\": 600,\n            \"\\u0116\": 600,\n            \"\\u0117\": 600,\n            \"\\u0118\": 600,\n            \"\\u0119\": 600,\n            \"\\u011a\": 600,\n            \"\\u011b\": 600,\n            \"\\u011e\": 600,\n            \"\\u011f\": 600,\n            \"\\u0122\": 600,\n            \"\\u0123\": 600,\n            \"\\u012a\": 600,\n            \"\\u012b\": 600,\n            \"\\u012e\": 600,\n            \"\\u012f\": 600,\n            \"\\u0130\": 600,\n            \"\\u0131\": 600,\n            \"\\u0136\": 600,\n            \"\\u0137\": 600,\n            \"\\u0139\": 600,\n            \"\\u013a\": 600,\n            \"\\u013b\": 600,\n            \"\\u013c\": 600,\n            \"\\u013d\": 600,\n            \"\\u013e\": 600,\n            \"\\u0141\": 600,\n            \"\\u0142\": 600,\n            \"\\u0143\": 600,\n            \"\\u0144\": 600,\n            \"\\u0145\": 600,\n            \"\\u0146\": 600,\n            \"\\u0147\": 600,\n            \"\\u0148\": 600,\n            \"\\u014c\": 600,\n            \"\\u014d\": 600,\n            \"\\u0150\": 600,\n            \"\\u0151\": 600,\n            \"\\u0152\": 600,\n            \"\\u0153\": 600,\n            \"\\u0154\": 600,\n            \"\\u0155\": 600,\n            \"\\u0156\": 600,\n            \"\\u0157\": 600,\n            \"\\u0158\": 600,\n            \"\\u0159\": 600,\n            \"\\u015a\": 600,\n            \"\\u015b\": 600,\n            \"\\u015e\": 600,\n            \"\\u015f\": 600,\n            \"\\u0160\": 600,\n            \"\\u0161\": 600,\n            \"\\u0162\": 600,\n            \"\\u0163\": 600,\n            \"\\u0164\": 600,\n            \"\\u0165\": 600,\n            \"\\u016a\": 600,\n            \"\\u016b\": 600,\n            \"\\u016e\": 600,\n            \"\\u016f\": 600,\n            \"\\u0170\": 600,\n            \"\\u0171\": 600,\n            \"\\u0172\": 600,\n            \"\\u0173\": 600,\n            \"\\u0178\": 600,\n            \"\\u0179\": 600,\n            \"\\u017a\": 600,\n            \"\\u017b\": 600,\n            \"\\u017c\": 600,\n            \"\\u017d\": 600,\n            \"\\u017e\": 600,\n            \"\\u0192\": 600,\n            \"\\u0218\": 600,\n            \"\\u0219\": 600,\n            \"\\u02c6\": 600,\n            \"\\u02c7\": 600,\n            \"\\u02d8\": 600,\n            \"\\u02d9\": 600,\n            \"\\u02da\": 600,\n            \"\\u02db\": 600,\n            \"\\u02dc\": 600,\n            \"\\u02dd\": 600,\n            \"\\u2013\": 600,\n            \"\\u2014\": 600,\n            \"\\u2018\": 600,\n            \"\\u2019\": 600,\n            \"\\u201a\": 600,\n            \"\\u201c\": 600,\n            \"\\u201d\": 600,\n            \"\\u201e\": 600,\n            \"\\u2020\": 600,\n            \"\\u2021\": 600,\n            \"\\u2022\": 600,\n            \"\\u2026\": 600,\n            \"\\u2030\": 600,\n            \"\\u2039\": 600,\n            \"\\u203a\": 600,\n            \"\\u2044\": 600,\n            \"\\u2122\": 600,\n            \"\\u2202\": 600,\n            \"\\u2206\": 600,\n            \"\\u2211\": 600,\n            \"\\u2212\": 600,\n            \"\\u221a\": 600,\n            \"\\u2260\": 600,\n            \"\\u2264\": 600,\n            \"\\u2265\": 600,\n            \"\\u25ca\": 600,\n            \"\\uf6c3\": 600,\n            \"\\ufb01\": 600,\n            \"\\ufb02\": 600,\n        },\n    ),\n    \"Helvetica\": (\n        {\n            \"FontName\": \"Helvetica\",\n            \"Descent\": -207.0,\n            \"FontBBox\": (-166.0, -225.0, 1000.0, 931.0),\n            \"FontWeight\": \"Medium\",\n            \"CapHeight\": 718.0,\n            \"FontFamily\": \"Helvetica\",\n            \"Flags\": 0,\n            \"XHeight\": 523.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 718.0,\n        },\n        {\n            \" \": 278,\n            \"!\": 278,\n            '\"': 355,\n            \"#\": 556,\n            \"$\": 556,\n            \"%\": 889,\n            \"&\": 667,\n            \"'\": 191,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 389,\n            \"+\": 584,\n            \",\": 278,\n            \"-\": 333,\n            \".\": 278,\n            \"/\": 278,\n            \"0\": 556,\n            \"1\": 556,\n            \"2\": 556,\n            \"3\": 556,\n            \"4\": 556,\n            \"5\": 556,\n            \"6\": 556,\n            \"7\": 556,\n            \"8\": 556,\n            \"9\": 556,\n            \":\": 278,\n            \";\": 278,\n            \"<\": 584,\n            \"=\": 584,\n            \">\": 584,\n            \"?\": 556,\n            \"@\": 1015,\n            \"A\": 667,\n            \"B\": 667,\n            \"C\": 722,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 611,\n            \"G\": 778,\n            \"H\": 722,\n            \"I\": 278,\n            \"J\": 500,\n            \"K\": 667,\n            \"L\": 556,\n            \"M\": 833,\n            \"N\": 722,\n            \"O\": 778,\n            \"P\": 667,\n            \"Q\": 778,\n            \"R\": 722,\n            \"S\": 667,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 667,\n            \"W\": 944,\n            \"X\": 667,\n            \"Y\": 667,\n            \"Z\": 611,\n            \"[\": 278,\n            \"\\\\\": 278,\n            \"]\": 278,\n            \"^\": 469,\n            \"_\": 556,\n            \"`\": 333,\n            \"a\": 556,\n            \"b\": 556,\n            \"c\": 500,\n            \"d\": 556,\n            \"e\": 556,\n            \"f\": 278,\n            \"g\": 556,\n            \"h\": 556,\n            \"i\": 222,\n            \"j\": 222,\n            \"k\": 500,\n            \"l\": 222,\n            \"m\": 833,\n            \"n\": 556,\n            \"o\": 556,\n            \"p\": 556,\n            \"q\": 556,\n            \"r\": 333,\n            \"s\": 500,\n            \"t\": 278,\n            \"u\": 556,\n            \"v\": 500,\n            \"w\": 722,\n            \"x\": 500,\n            \"y\": 500,\n            \"z\": 500,\n            \"{\": 334,\n            \"|\": 260,\n            \"}\": 334,\n            \"~\": 584,\n            \"\\xa1\": 333,\n            \"\\xa2\": 556,\n            \"\\xa3\": 556,\n            \"\\xa4\": 556,\n            \"\\xa5\": 556,\n            \"\\xa6\": 260,\n            \"\\xa7\": 556,\n            \"\\xa8\": 333,\n            \"\\xa9\": 737,\n            \"\\xaa\": 370,\n            \"\\xab\": 556,\n            \"\\xac\": 584,\n            \"\\xae\": 737,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 584,\n            \"\\xb2\": 333,\n            \"\\xb3\": 333,\n            \"\\xb4\": 333,\n            \"\\xb5\": 556,\n            \"\\xb6\": 537,\n            \"\\xb7\": 278,\n            \"\\xb8\": 333,\n            \"\\xb9\": 333,\n            \"\\xba\": 365,\n            \"\\xbb\": 556,\n            \"\\xbc\": 834,\n            \"\\xbd\": 834,\n            \"\\xbe\": 834,\n            \"\\xbf\": 611,\n            \"\\xc0\": 667,\n            \"\\xc1\": 667,\n            \"\\xc2\": 667,\n            \"\\xc3\": 667,\n            \"\\xc4\": 667,\n            \"\\xc5\": 667,\n            \"\\xc6\": 1000,\n            \"\\xc7\": 722,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 278,\n            \"\\xcd\": 278,\n            \"\\xce\": 278,\n            \"\\xcf\": 278,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 778,\n            \"\\xd3\": 778,\n            \"\\xd4\": 778,\n            \"\\xd5\": 778,\n            \"\\xd6\": 778,\n            \"\\xd7\": 584,\n            \"\\xd8\": 778,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 667,\n            \"\\xde\": 667,\n            \"\\xdf\": 611,\n            \"\\xe0\": 556,\n            \"\\xe1\": 556,\n            \"\\xe2\": 556,\n            \"\\xe3\": 556,\n            \"\\xe4\": 556,\n            \"\\xe5\": 556,\n            \"\\xe6\": 889,\n            \"\\xe7\": 500,\n            \"\\xe8\": 556,\n            \"\\xe9\": 556,\n            \"\\xea\": 556,\n            \"\\xeb\": 556,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 556,\n            \"\\xf1\": 556,\n            \"\\xf2\": 556,\n            \"\\xf3\": 556,\n            \"\\xf4\": 556,\n            \"\\xf5\": 556,\n            \"\\xf6\": 556,\n            \"\\xf7\": 584,\n            \"\\xf8\": 611,\n            \"\\xf9\": 556,\n            \"\\xfa\": 556,\n            \"\\xfb\": 556,\n            \"\\xfc\": 556,\n            \"\\xfd\": 500,\n            \"\\xfe\": 556,\n            \"\\xff\": 500,\n            \"\\u0100\": 667,\n            \"\\u0101\": 556,\n            \"\\u0102\": 667,\n            \"\\u0103\": 556,\n            \"\\u0104\": 667,\n            \"\\u0105\": 556,\n            \"\\u0106\": 722,\n            \"\\u0107\": 500,\n            \"\\u010c\": 722,\n            \"\\u010d\": 500,\n            \"\\u010e\": 722,\n            \"\\u010f\": 643,\n            \"\\u0110\": 722,\n            \"\\u0111\": 556,\n            \"\\u0112\": 667,\n            \"\\u0113\": 556,\n            \"\\u0116\": 667,\n            \"\\u0117\": 556,\n            \"\\u0118\": 667,\n            \"\\u0119\": 556,\n            \"\\u011a\": 667,\n            \"\\u011b\": 556,\n            \"\\u011e\": 778,\n            \"\\u011f\": 556,\n            \"\\u0122\": 778,\n            \"\\u0123\": 556,\n            \"\\u012a\": 278,\n            \"\\u012b\": 278,\n            \"\\u012e\": 278,\n            \"\\u012f\": 222,\n            \"\\u0130\": 278,\n            \"\\u0131\": 278,\n            \"\\u0136\": 667,\n            \"\\u0137\": 500,\n            \"\\u0139\": 556,\n            \"\\u013a\": 222,\n            \"\\u013b\": 556,\n            \"\\u013c\": 222,\n            \"\\u013d\": 556,\n            \"\\u013e\": 299,\n            \"\\u0141\": 556,\n            \"\\u0142\": 222,\n            \"\\u0143\": 722,\n            \"\\u0144\": 556,\n            \"\\u0145\": 722,\n            \"\\u0146\": 556,\n            \"\\u0147\": 722,\n            \"\\u0148\": 556,\n            \"\\u014c\": 778,\n            \"\\u014d\": 556,\n            \"\\u0150\": 778,\n            \"\\u0151\": 556,\n            \"\\u0152\": 1000,\n            \"\\u0153\": 944,\n            \"\\u0154\": 722,\n            \"\\u0155\": 333,\n            \"\\u0156\": 722,\n            \"\\u0157\": 333,\n            \"\\u0158\": 722,\n            \"\\u0159\": 333,\n            \"\\u015a\": 667,\n            \"\\u015b\": 500,\n            \"\\u015e\": 667,\n            \"\\u015f\": 500,\n            \"\\u0160\": 667,\n            \"\\u0161\": 500,\n            \"\\u0162\": 611,\n            \"\\u0163\": 278,\n            \"\\u0164\": 611,\n            \"\\u0165\": 317,\n            \"\\u016a\": 722,\n            \"\\u016b\": 556,\n            \"\\u016e\": 722,\n            \"\\u016f\": 556,\n            \"\\u0170\": 722,\n            \"\\u0171\": 556,\n            \"\\u0172\": 722,\n            \"\\u0173\": 556,\n            \"\\u0178\": 667,\n            \"\\u0179\": 611,\n            \"\\u017a\": 500,\n            \"\\u017b\": 611,\n            \"\\u017c\": 500,\n            \"\\u017d\": 611,\n            \"\\u017e\": 500,\n            \"\\u0192\": 556,\n            \"\\u0218\": 667,\n            \"\\u0219\": 500,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 556,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 222,\n            \"\\u2019\": 222,\n            \"\\u201a\": 222,\n            \"\\u201c\": 333,\n            \"\\u201d\": 333,\n            \"\\u201e\": 333,\n            \"\\u2020\": 556,\n            \"\\u2021\": 556,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 476,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 584,\n            \"\\u221a\": 453,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 471,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 500,\n            \"\\ufb02\": 500,\n        },\n    ),\n    \"Helvetica-Bold\": (\n        {\n            \"FontName\": \"Helvetica-Bold\",\n            \"Descent\": -207.0,\n            \"FontBBox\": (-170.0, -228.0, 1003.0, 962.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 718.0,\n            \"FontFamily\": \"Helvetica\",\n            \"Flags\": 0,\n            \"XHeight\": 532.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 718.0,\n        },\n        {\n            \" \": 278,\n            \"!\": 333,\n            '\"': 474,\n            \"#\": 556,\n            \"$\": 556,\n            \"%\": 889,\n            \"&\": 722,\n            \"'\": 238,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 389,\n            \"+\": 584,\n            \",\": 278,\n            \"-\": 333,\n            \".\": 278,\n            \"/\": 278,\n            \"0\": 556,\n            \"1\": 556,\n            \"2\": 556,\n            \"3\": 556,\n            \"4\": 556,\n            \"5\": 556,\n            \"6\": 556,\n            \"7\": 556,\n            \"8\": 556,\n            \"9\": 556,\n            \":\": 333,\n            \";\": 333,\n            \"<\": 584,\n            \"=\": 584,\n            \">\": 584,\n            \"?\": 611,\n            \"@\": 975,\n            \"A\": 722,\n            \"B\": 722,\n            \"C\": 722,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 611,\n            \"G\": 778,\n            \"H\": 722,\n            \"I\": 278,\n            \"J\": 556,\n            \"K\": 722,\n            \"L\": 611,\n            \"M\": 833,\n            \"N\": 722,\n            \"O\": 778,\n            \"P\": 667,\n            \"Q\": 778,\n            \"R\": 722,\n            \"S\": 667,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 667,\n            \"W\": 944,\n            \"X\": 667,\n            \"Y\": 667,\n            \"Z\": 611,\n            \"[\": 333,\n            \"\\\\\": 278,\n            \"]\": 333,\n            \"^\": 584,\n            \"_\": 556,\n            \"`\": 333,\n            \"a\": 556,\n            \"b\": 611,\n            \"c\": 556,\n            \"d\": 611,\n            \"e\": 556,\n            \"f\": 333,\n            \"g\": 611,\n            \"h\": 611,\n            \"i\": 278,\n            \"j\": 278,\n            \"k\": 556,\n            \"l\": 278,\n            \"m\": 889,\n            \"n\": 611,\n            \"o\": 611,\n            \"p\": 611,\n            \"q\": 611,\n            \"r\": 389,\n            \"s\": 556,\n            \"t\": 333,\n            \"u\": 611,\n            \"v\": 556,\n            \"w\": 778,\n            \"x\": 556,\n            \"y\": 556,\n            \"z\": 500,\n            \"{\": 389,\n            \"|\": 280,\n            \"}\": 389,\n            \"~\": 584,\n            \"\\xa1\": 333,\n            \"\\xa2\": 556,\n            \"\\xa3\": 556,\n            \"\\xa4\": 556,\n            \"\\xa5\": 556,\n            \"\\xa6\": 280,\n            \"\\xa7\": 556,\n            \"\\xa8\": 333,\n            \"\\xa9\": 737,\n            \"\\xaa\": 370,\n            \"\\xab\": 556,\n            \"\\xac\": 584,\n            \"\\xae\": 737,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 584,\n            \"\\xb2\": 333,\n            \"\\xb3\": 333,\n            \"\\xb4\": 333,\n            \"\\xb5\": 611,\n            \"\\xb6\": 556,\n            \"\\xb7\": 278,\n            \"\\xb8\": 333,\n            \"\\xb9\": 333,\n            \"\\xba\": 365,\n            \"\\xbb\": 556,\n            \"\\xbc\": 834,\n            \"\\xbd\": 834,\n            \"\\xbe\": 834,\n            \"\\xbf\": 611,\n            \"\\xc0\": 722,\n            \"\\xc1\": 722,\n            \"\\xc2\": 722,\n            \"\\xc3\": 722,\n            \"\\xc4\": 722,\n            \"\\xc5\": 722,\n            \"\\xc6\": 1000,\n            \"\\xc7\": 722,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 278,\n            \"\\xcd\": 278,\n            \"\\xce\": 278,\n            \"\\xcf\": 278,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 778,\n            \"\\xd3\": 778,\n            \"\\xd4\": 778,\n            \"\\xd5\": 778,\n            \"\\xd6\": 778,\n            \"\\xd7\": 584,\n            \"\\xd8\": 778,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 667,\n            \"\\xde\": 667,\n            \"\\xdf\": 611,\n            \"\\xe0\": 556,\n            \"\\xe1\": 556,\n            \"\\xe2\": 556,\n            \"\\xe3\": 556,\n            \"\\xe4\": 556,\n            \"\\xe5\": 556,\n            \"\\xe6\": 889,\n            \"\\xe7\": 556,\n            \"\\xe8\": 556,\n            \"\\xe9\": 556,\n            \"\\xea\": 556,\n            \"\\xeb\": 556,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 611,\n            \"\\xf1\": 611,\n            \"\\xf2\": 611,\n            \"\\xf3\": 611,\n            \"\\xf4\": 611,\n            \"\\xf5\": 611,\n            \"\\xf6\": 611,\n            \"\\xf7\": 584,\n            \"\\xf8\": 611,\n            \"\\xf9\": 611,\n            \"\\xfa\": 611,\n            \"\\xfb\": 611,\n            \"\\xfc\": 611,\n            \"\\xfd\": 556,\n            \"\\xfe\": 611,\n            \"\\xff\": 556,\n            \"\\u0100\": 722,\n            \"\\u0101\": 556,\n            \"\\u0102\": 722,\n            \"\\u0103\": 556,\n            \"\\u0104\": 722,\n            \"\\u0105\": 556,\n            \"\\u0106\": 722,\n            \"\\u0107\": 556,\n            \"\\u010c\": 722,\n            \"\\u010d\": 556,\n            \"\\u010e\": 722,\n            \"\\u010f\": 743,\n            \"\\u0110\": 722,\n            \"\\u0111\": 611,\n            \"\\u0112\": 667,\n            \"\\u0113\": 556,\n            \"\\u0116\": 667,\n            \"\\u0117\": 556,\n            \"\\u0118\": 667,\n            \"\\u0119\": 556,\n            \"\\u011a\": 667,\n            \"\\u011b\": 556,\n            \"\\u011e\": 778,\n            \"\\u011f\": 611,\n            \"\\u0122\": 778,\n            \"\\u0123\": 611,\n            \"\\u012a\": 278,\n            \"\\u012b\": 278,\n            \"\\u012e\": 278,\n            \"\\u012f\": 278,\n            \"\\u0130\": 278,\n            \"\\u0131\": 278,\n            \"\\u0136\": 722,\n            \"\\u0137\": 556,\n            \"\\u0139\": 611,\n            \"\\u013a\": 278,\n            \"\\u013b\": 611,\n            \"\\u013c\": 278,\n            \"\\u013d\": 611,\n            \"\\u013e\": 400,\n            \"\\u0141\": 611,\n            \"\\u0142\": 278,\n            \"\\u0143\": 722,\n            \"\\u0144\": 611,\n            \"\\u0145\": 722,\n            \"\\u0146\": 611,\n            \"\\u0147\": 722,\n            \"\\u0148\": 611,\n            \"\\u014c\": 778,\n            \"\\u014d\": 611,\n            \"\\u0150\": 778,\n            \"\\u0151\": 611,\n            \"\\u0152\": 1000,\n            \"\\u0153\": 944,\n            \"\\u0154\": 722,\n            \"\\u0155\": 389,\n            \"\\u0156\": 722,\n            \"\\u0157\": 389,\n            \"\\u0158\": 722,\n            \"\\u0159\": 389,\n            \"\\u015a\": 667,\n            \"\\u015b\": 556,\n            \"\\u015e\": 667,\n            \"\\u015f\": 556,\n            \"\\u0160\": 667,\n            \"\\u0161\": 556,\n            \"\\u0162\": 611,\n            \"\\u0163\": 333,\n            \"\\u0164\": 611,\n            \"\\u0165\": 389,\n            \"\\u016a\": 722,\n            \"\\u016b\": 611,\n            \"\\u016e\": 722,\n            \"\\u016f\": 611,\n            \"\\u0170\": 722,\n            \"\\u0171\": 611,\n            \"\\u0172\": 722,\n            \"\\u0173\": 611,\n            \"\\u0178\": 667,\n            \"\\u0179\": 611,\n            \"\\u017a\": 500,\n            \"\\u017b\": 611,\n            \"\\u017c\": 500,\n            \"\\u017d\": 611,\n            \"\\u017e\": 500,\n            \"\\u0192\": 556,\n            \"\\u0218\": 667,\n            \"\\u0219\": 556,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 556,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 278,\n            \"\\u2019\": 278,\n            \"\\u201a\": 278,\n            \"\\u201c\": 500,\n            \"\\u201d\": 500,\n            \"\\u201e\": 500,\n            \"\\u2020\": 556,\n            \"\\u2021\": 556,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 494,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 584,\n            \"\\u221a\": 549,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 494,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 611,\n            \"\\ufb02\": 611,\n        },\n    ),\n    \"Helvetica-BoldOblique\": (\n        {\n            \"FontName\": \"Helvetica-BoldOblique\",\n            \"Descent\": -207.0,\n            \"FontBBox\": (-175.0, -228.0, 1114.0, 962.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 718.0,\n            \"FontFamily\": \"Helvetica\",\n            \"Flags\": 0,\n            \"XHeight\": 532.0,\n            \"ItalicAngle\": -12.0,\n            \"Ascent\": 718.0,\n        },\n        {\n            \" \": 278,\n            \"!\": 333,\n            '\"': 474,\n            \"#\": 556,\n            \"$\": 556,\n            \"%\": 889,\n            \"&\": 722,\n            \"'\": 238,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 389,\n            \"+\": 584,\n            \",\": 278,\n            \"-\": 333,\n            \".\": 278,\n            \"/\": 278,\n            \"0\": 556,\n            \"1\": 556,\n            \"2\": 556,\n            \"3\": 556,\n            \"4\": 556,\n            \"5\": 556,\n            \"6\": 556,\n            \"7\": 556,\n            \"8\": 556,\n            \"9\": 556,\n            \":\": 333,\n            \";\": 333,\n            \"<\": 584,\n            \"=\": 584,\n            \">\": 584,\n            \"?\": 611,\n            \"@\": 975,\n            \"A\": 722,\n            \"B\": 722,\n            \"C\": 722,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 611,\n            \"G\": 778,\n            \"H\": 722,\n            \"I\": 278,\n            \"J\": 556,\n            \"K\": 722,\n            \"L\": 611,\n            \"M\": 833,\n            \"N\": 722,\n            \"O\": 778,\n            \"P\": 667,\n            \"Q\": 778,\n            \"R\": 722,\n            \"S\": 667,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 667,\n            \"W\": 944,\n            \"X\": 667,\n            \"Y\": 667,\n            \"Z\": 611,\n            \"[\": 333,\n            \"\\\\\": 278,\n            \"]\": 333,\n            \"^\": 584,\n            \"_\": 556,\n            \"`\": 333,\n            \"a\": 556,\n            \"b\": 611,\n            \"c\": 556,\n            \"d\": 611,\n            \"e\": 556,\n            \"f\": 333,\n            \"g\": 611,\n            \"h\": 611,\n            \"i\": 278,\n            \"j\": 278,\n            \"k\": 556,\n            \"l\": 278,\n            \"m\": 889,\n            \"n\": 611,\n            \"o\": 611,\n            \"p\": 611,\n            \"q\": 611,\n            \"r\": 389,\n            \"s\": 556,\n            \"t\": 333,\n            \"u\": 611,\n            \"v\": 556,\n            \"w\": 778,\n            \"x\": 556,\n            \"y\": 556,\n            \"z\": 500,\n            \"{\": 389,\n            \"|\": 280,\n            \"}\": 389,\n            \"~\": 584,\n            \"\\xa1\": 333,\n            \"\\xa2\": 556,\n            \"\\xa3\": 556,\n            \"\\xa4\": 556,\n            \"\\xa5\": 556,\n            \"\\xa6\": 280,\n            \"\\xa7\": 556,\n            \"\\xa8\": 333,\n            \"\\xa9\": 737,\n            \"\\xaa\": 370,\n            \"\\xab\": 556,\n            \"\\xac\": 584,\n            \"\\xae\": 737,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 584,\n            \"\\xb2\": 333,\n            \"\\xb3\": 333,\n            \"\\xb4\": 333,\n            \"\\xb5\": 611,\n            \"\\xb6\": 556,\n            \"\\xb7\": 278,\n            \"\\xb8\": 333,\n            \"\\xb9\": 333,\n            \"\\xba\": 365,\n            \"\\xbb\": 556,\n            \"\\xbc\": 834,\n            \"\\xbd\": 834,\n            \"\\xbe\": 834,\n            \"\\xbf\": 611,\n            \"\\xc0\": 722,\n            \"\\xc1\": 722,\n            \"\\xc2\": 722,\n            \"\\xc3\": 722,\n            \"\\xc4\": 722,\n            \"\\xc5\": 722,\n            \"\\xc6\": 1000,\n            \"\\xc7\": 722,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 278,\n            \"\\xcd\": 278,\n            \"\\xce\": 278,\n            \"\\xcf\": 278,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 778,\n            \"\\xd3\": 778,\n            \"\\xd4\": 778,\n            \"\\xd5\": 778,\n            \"\\xd6\": 778,\n            \"\\xd7\": 584,\n            \"\\xd8\": 778,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 667,\n            \"\\xde\": 667,\n            \"\\xdf\": 611,\n            \"\\xe0\": 556,\n            \"\\xe1\": 556,\n            \"\\xe2\": 556,\n            \"\\xe3\": 556,\n            \"\\xe4\": 556,\n            \"\\xe5\": 556,\n            \"\\xe6\": 889,\n            \"\\xe7\": 556,\n            \"\\xe8\": 556,\n            \"\\xe9\": 556,\n            \"\\xea\": 556,\n            \"\\xeb\": 556,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 611,\n            \"\\xf1\": 611,\n            \"\\xf2\": 611,\n            \"\\xf3\": 611,\n            \"\\xf4\": 611,\n            \"\\xf5\": 611,\n            \"\\xf6\": 611,\n            \"\\xf7\": 584,\n            \"\\xf8\": 611,\n            \"\\xf9\": 611,\n            \"\\xfa\": 611,\n            \"\\xfb\": 611,\n            \"\\xfc\": 611,\n            \"\\xfd\": 556,\n            \"\\xfe\": 611,\n            \"\\xff\": 556,\n            \"\\u0100\": 722,\n            \"\\u0101\": 556,\n            \"\\u0102\": 722,\n            \"\\u0103\": 556,\n            \"\\u0104\": 722,\n            \"\\u0105\": 556,\n            \"\\u0106\": 722,\n            \"\\u0107\": 556,\n            \"\\u010c\": 722,\n            \"\\u010d\": 556,\n            \"\\u010e\": 722,\n            \"\\u010f\": 743,\n            \"\\u0110\": 722,\n            \"\\u0111\": 611,\n            \"\\u0112\": 667,\n            \"\\u0113\": 556,\n            \"\\u0116\": 667,\n            \"\\u0117\": 556,\n            \"\\u0118\": 667,\n            \"\\u0119\": 556,\n            \"\\u011a\": 667,\n            \"\\u011b\": 556,\n            \"\\u011e\": 778,\n            \"\\u011f\": 611,\n            \"\\u0122\": 778,\n            \"\\u0123\": 611,\n            \"\\u012a\": 278,\n            \"\\u012b\": 278,\n            \"\\u012e\": 278,\n            \"\\u012f\": 278,\n            \"\\u0130\": 278,\n            \"\\u0131\": 278,\n            \"\\u0136\": 722,\n            \"\\u0137\": 556,\n            \"\\u0139\": 611,\n            \"\\u013a\": 278,\n            \"\\u013b\": 611,\n            \"\\u013c\": 278,\n            \"\\u013d\": 611,\n            \"\\u013e\": 400,\n            \"\\u0141\": 611,\n            \"\\u0142\": 278,\n            \"\\u0143\": 722,\n            \"\\u0144\": 611,\n            \"\\u0145\": 722,\n            \"\\u0146\": 611,\n            \"\\u0147\": 722,\n            \"\\u0148\": 611,\n            \"\\u014c\": 778,\n            \"\\u014d\": 611,\n            \"\\u0150\": 778,\n            \"\\u0151\": 611,\n            \"\\u0152\": 1000,\n            \"\\u0153\": 944,\n            \"\\u0154\": 722,\n            \"\\u0155\": 389,\n            \"\\u0156\": 722,\n            \"\\u0157\": 389,\n            \"\\u0158\": 722,\n            \"\\u0159\": 389,\n            \"\\u015a\": 667,\n            \"\\u015b\": 556,\n            \"\\u015e\": 667,\n            \"\\u015f\": 556,\n            \"\\u0160\": 667,\n            \"\\u0161\": 556,\n            \"\\u0162\": 611,\n            \"\\u0163\": 333,\n            \"\\u0164\": 611,\n            \"\\u0165\": 389,\n            \"\\u016a\": 722,\n            \"\\u016b\": 611,\n            \"\\u016e\": 722,\n            \"\\u016f\": 611,\n            \"\\u0170\": 722,\n            \"\\u0171\": 611,\n            \"\\u0172\": 722,\n            \"\\u0173\": 611,\n            \"\\u0178\": 667,\n            \"\\u0179\": 611,\n            \"\\u017a\": 500,\n            \"\\u017b\": 611,\n            \"\\u017c\": 500,\n            \"\\u017d\": 611,\n            \"\\u017e\": 500,\n            \"\\u0192\": 556,\n            \"\\u0218\": 667,\n            \"\\u0219\": 556,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 556,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 278,\n            \"\\u2019\": 278,\n            \"\\u201a\": 278,\n            \"\\u201c\": 500,\n            \"\\u201d\": 500,\n            \"\\u201e\": 500,\n            \"\\u2020\": 556,\n            \"\\u2021\": 556,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 494,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 584,\n            \"\\u221a\": 549,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 494,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 611,\n            \"\\ufb02\": 611,\n        },\n    ),\n    \"Helvetica-Oblique\": (\n        {\n            \"FontName\": \"Helvetica-Oblique\",\n            \"Descent\": -207.0,\n            \"FontBBox\": (-171.0, -225.0, 1116.0, 931.0),\n            \"FontWeight\": \"Medium\",\n            \"CapHeight\": 718.0,\n            \"FontFamily\": \"Helvetica\",\n            \"Flags\": 0,\n            \"XHeight\": 523.0,\n            \"ItalicAngle\": -12.0,\n            \"Ascent\": 718.0,\n        },\n        {\n            \" \": 278,\n            \"!\": 278,\n            '\"': 355,\n            \"#\": 556,\n            \"$\": 556,\n            \"%\": 889,\n            \"&\": 667,\n            \"'\": 191,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 389,\n            \"+\": 584,\n            \",\": 278,\n            \"-\": 333,\n            \".\": 278,\n            \"/\": 278,\n            \"0\": 556,\n            \"1\": 556,\n            \"2\": 556,\n            \"3\": 556,\n            \"4\": 556,\n            \"5\": 556,\n            \"6\": 556,\n            \"7\": 556,\n            \"8\": 556,\n            \"9\": 556,\n            \":\": 278,\n            \";\": 278,\n            \"<\": 584,\n            \"=\": 584,\n            \">\": 584,\n            \"?\": 556,\n            \"@\": 1015,\n            \"A\": 667,\n            \"B\": 667,\n            \"C\": 722,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 611,\n            \"G\": 778,\n            \"H\": 722,\n            \"I\": 278,\n            \"J\": 500,\n            \"K\": 667,\n            \"L\": 556,\n            \"M\": 833,\n            \"N\": 722,\n            \"O\": 778,\n            \"P\": 667,\n            \"Q\": 778,\n            \"R\": 722,\n            \"S\": 667,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 667,\n            \"W\": 944,\n            \"X\": 667,\n            \"Y\": 667,\n            \"Z\": 611,\n            \"[\": 278,\n            \"\\\\\": 278,\n            \"]\": 278,\n            \"^\": 469,\n            \"_\": 556,\n            \"`\": 333,\n            \"a\": 556,\n            \"b\": 556,\n            \"c\": 500,\n            \"d\": 556,\n            \"e\": 556,\n            \"f\": 278,\n            \"g\": 556,\n            \"h\": 556,\n            \"i\": 222,\n            \"j\": 222,\n            \"k\": 500,\n            \"l\": 222,\n            \"m\": 833,\n            \"n\": 556,\n            \"o\": 556,\n            \"p\": 556,\n            \"q\": 556,\n            \"r\": 333,\n            \"s\": 500,\n            \"t\": 278,\n            \"u\": 556,\n            \"v\": 500,\n            \"w\": 722,\n            \"x\": 500,\n            \"y\": 500,\n            \"z\": 500,\n            \"{\": 334,\n            \"|\": 260,\n            \"}\": 334,\n            \"~\": 584,\n            \"\\xa1\": 333,\n            \"\\xa2\": 556,\n            \"\\xa3\": 556,\n            \"\\xa4\": 556,\n            \"\\xa5\": 556,\n            \"\\xa6\": 260,\n            \"\\xa7\": 556,\n            \"\\xa8\": 333,\n            \"\\xa9\": 737,\n            \"\\xaa\": 370,\n            \"\\xab\": 556,\n            \"\\xac\": 584,\n            \"\\xae\": 737,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 584,\n            \"\\xb2\": 333,\n            \"\\xb3\": 333,\n            \"\\xb4\": 333,\n            \"\\xb5\": 556,\n            \"\\xb6\": 537,\n            \"\\xb7\": 278,\n            \"\\xb8\": 333,\n            \"\\xb9\": 333,\n            \"\\xba\": 365,\n            \"\\xbb\": 556,\n            \"\\xbc\": 834,\n            \"\\xbd\": 834,\n            \"\\xbe\": 834,\n            \"\\xbf\": 611,\n            \"\\xc0\": 667,\n            \"\\xc1\": 667,\n            \"\\xc2\": 667,\n            \"\\xc3\": 667,\n            \"\\xc4\": 667,\n            \"\\xc5\": 667,\n            \"\\xc6\": 1000,\n            \"\\xc7\": 722,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 278,\n            \"\\xcd\": 278,\n            \"\\xce\": 278,\n            \"\\xcf\": 278,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 778,\n            \"\\xd3\": 778,\n            \"\\xd4\": 778,\n            \"\\xd5\": 778,\n            \"\\xd6\": 778,\n            \"\\xd7\": 584,\n            \"\\xd8\": 778,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 667,\n            \"\\xde\": 667,\n            \"\\xdf\": 611,\n            \"\\xe0\": 556,\n            \"\\xe1\": 556,\n            \"\\xe2\": 556,\n            \"\\xe3\": 556,\n            \"\\xe4\": 556,\n            \"\\xe5\": 556,\n            \"\\xe6\": 889,\n            \"\\xe7\": 500,\n            \"\\xe8\": 556,\n            \"\\xe9\": 556,\n            \"\\xea\": 556,\n            \"\\xeb\": 556,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 556,\n            \"\\xf1\": 556,\n            \"\\xf2\": 556,\n            \"\\xf3\": 556,\n            \"\\xf4\": 556,\n            \"\\xf5\": 556,\n            \"\\xf6\": 556,\n            \"\\xf7\": 584,\n            \"\\xf8\": 611,\n            \"\\xf9\": 556,\n            \"\\xfa\": 556,\n            \"\\xfb\": 556,\n            \"\\xfc\": 556,\n            \"\\xfd\": 500,\n            \"\\xfe\": 556,\n            \"\\xff\": 500,\n            \"\\u0100\": 667,\n            \"\\u0101\": 556,\n            \"\\u0102\": 667,\n            \"\\u0103\": 556,\n            \"\\u0104\": 667,\n            \"\\u0105\": 556,\n            \"\\u0106\": 722,\n            \"\\u0107\": 500,\n            \"\\u010c\": 722,\n            \"\\u010d\": 500,\n            \"\\u010e\": 722,\n            \"\\u010f\": 643,\n            \"\\u0110\": 722,\n            \"\\u0111\": 556,\n            \"\\u0112\": 667,\n            \"\\u0113\": 556,\n            \"\\u0116\": 667,\n            \"\\u0117\": 556,\n            \"\\u0118\": 667,\n            \"\\u0119\": 556,\n            \"\\u011a\": 667,\n            \"\\u011b\": 556,\n            \"\\u011e\": 778,\n            \"\\u011f\": 556,\n            \"\\u0122\": 778,\n            \"\\u0123\": 556,\n            \"\\u012a\": 278,\n            \"\\u012b\": 278,\n            \"\\u012e\": 278,\n            \"\\u012f\": 222,\n            \"\\u0130\": 278,\n            \"\\u0131\": 278,\n            \"\\u0136\": 667,\n            \"\\u0137\": 500,\n            \"\\u0139\": 556,\n            \"\\u013a\": 222,\n            \"\\u013b\": 556,\n            \"\\u013c\": 222,\n            \"\\u013d\": 556,\n            \"\\u013e\": 299,\n            \"\\u0141\": 556,\n            \"\\u0142\": 222,\n            \"\\u0143\": 722,\n            \"\\u0144\": 556,\n            \"\\u0145\": 722,\n            \"\\u0146\": 556,\n            \"\\u0147\": 722,\n            \"\\u0148\": 556,\n            \"\\u014c\": 778,\n            \"\\u014d\": 556,\n            \"\\u0150\": 778,\n            \"\\u0151\": 556,\n            \"\\u0152\": 1000,\n            \"\\u0153\": 944,\n            \"\\u0154\": 722,\n            \"\\u0155\": 333,\n            \"\\u0156\": 722,\n            \"\\u0157\": 333,\n            \"\\u0158\": 722,\n            \"\\u0159\": 333,\n            \"\\u015a\": 667,\n            \"\\u015b\": 500,\n            \"\\u015e\": 667,\n            \"\\u015f\": 500,\n            \"\\u0160\": 667,\n            \"\\u0161\": 500,\n            \"\\u0162\": 611,\n            \"\\u0163\": 278,\n            \"\\u0164\": 611,\n            \"\\u0165\": 317,\n            \"\\u016a\": 722,\n            \"\\u016b\": 556,\n            \"\\u016e\": 722,\n            \"\\u016f\": 556,\n            \"\\u0170\": 722,\n            \"\\u0171\": 556,\n            \"\\u0172\": 722,\n            \"\\u0173\": 556,\n            \"\\u0178\": 667,\n            \"\\u0179\": 611,\n            \"\\u017a\": 500,\n            \"\\u017b\": 611,\n            \"\\u017c\": 500,\n            \"\\u017d\": 611,\n            \"\\u017e\": 500,\n            \"\\u0192\": 556,\n            \"\\u0218\": 667,\n            \"\\u0219\": 500,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 556,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 222,\n            \"\\u2019\": 222,\n            \"\\u201a\": 222,\n            \"\\u201c\": 333,\n            \"\\u201d\": 333,\n            \"\\u201e\": 333,\n            \"\\u2020\": 556,\n            \"\\u2021\": 556,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 476,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 584,\n            \"\\u221a\": 453,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 471,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 500,\n            \"\\ufb02\": 500,\n        },\n    ),\n    \"Symbol\": (\n        {\n            \"FontName\": \"Symbol\",\n            \"FontBBox\": (-180.0, -293.0, 1090.0, 1010.0),\n            \"FontWeight\": \"Medium\",\n            \"FontFamily\": \"Symbol\",\n            \"Flags\": 0,\n            \"ItalicAngle\": 0.0,\n        },\n        {\n            \" \": 250,\n            \"!\": 333,\n            \"#\": 500,\n            \"%\": 833,\n            \"&\": 778,\n            \"(\": 333,\n            \")\": 333,\n            \"+\": 549,\n            \",\": 250,\n            \".\": 250,\n            \"/\": 278,\n            \"0\": 500,\n            \"1\": 500,\n            \"2\": 500,\n            \"3\": 500,\n            \"4\": 500,\n            \"5\": 500,\n            \"6\": 500,\n            \"7\": 500,\n            \"8\": 500,\n            \"9\": 500,\n            \":\": 278,\n            \";\": 278,\n            \"<\": 549,\n            \"=\": 549,\n            \">\": 549,\n            \"?\": 444,\n            \"[\": 333,\n            \"]\": 333,\n            \"_\": 500,\n            \"{\": 480,\n            \"|\": 200,\n            \"}\": 480,\n            \"\\xac\": 713,\n            \"\\xb0\": 400,\n            \"\\xb1\": 549,\n            \"\\xb5\": 576,\n            \"\\xd7\": 549,\n            \"\\xf7\": 549,\n            \"\\u0192\": 500,\n            \"\\u0391\": 722,\n            \"\\u0392\": 667,\n            \"\\u0393\": 603,\n            \"\\u0395\": 611,\n            \"\\u0396\": 611,\n            \"\\u0397\": 722,\n            \"\\u0398\": 741,\n            \"\\u0399\": 333,\n            \"\\u039a\": 722,\n            \"\\u039b\": 686,\n            \"\\u039c\": 889,\n            \"\\u039d\": 722,\n            \"\\u039e\": 645,\n            \"\\u039f\": 722,\n            \"\\u03a0\": 768,\n            \"\\u03a1\": 556,\n            \"\\u03a3\": 592,\n            \"\\u03a4\": 611,\n            \"\\u03a5\": 690,\n            \"\\u03a6\": 763,\n            \"\\u03a7\": 722,\n            \"\\u03a8\": 795,\n            \"\\u03b1\": 631,\n            \"\\u03b2\": 549,\n            \"\\u03b3\": 411,\n            \"\\u03b4\": 494,\n            \"\\u03b5\": 439,\n            \"\\u03b6\": 494,\n            \"\\u03b7\": 603,\n            \"\\u03b8\": 521,\n            \"\\u03b9\": 329,\n            \"\\u03ba\": 549,\n            \"\\u03bb\": 549,\n            \"\\u03bd\": 521,\n            \"\\u03be\": 493,\n            \"\\u03bf\": 549,\n            \"\\u03c0\": 549,\n            \"\\u03c1\": 549,\n            \"\\u03c2\": 439,\n            \"\\u03c3\": 603,\n            \"\\u03c4\": 439,\n            \"\\u03c5\": 576,\n            \"\\u03c6\": 521,\n            \"\\u03c7\": 549,\n            \"\\u03c8\": 686,\n            \"\\u03c9\": 686,\n            \"\\u03d1\": 631,\n            \"\\u03d2\": 620,\n            \"\\u03d5\": 603,\n            \"\\u03d6\": 713,\n            \"\\u2022\": 460,\n            \"\\u2026\": 1000,\n            \"\\u2032\": 247,\n            \"\\u2033\": 411,\n            \"\\u2044\": 167,\n            \"\\u20ac\": 750,\n            \"\\u2111\": 686,\n            \"\\u2118\": 987,\n            \"\\u211c\": 795,\n            \"\\u2126\": 768,\n            \"\\u2135\": 823,\n            \"\\u2190\": 987,\n            \"\\u2191\": 603,\n            \"\\u2192\": 987,\n            \"\\u2193\": 603,\n            \"\\u2194\": 1042,\n            \"\\u21b5\": 658,\n            \"\\u21d0\": 987,\n            \"\\u21d1\": 603,\n            \"\\u21d2\": 987,\n            \"\\u21d3\": 603,\n            \"\\u21d4\": 1042,\n            \"\\u2200\": 713,\n            \"\\u2202\": 494,\n            \"\\u2203\": 549,\n            \"\\u2205\": 823,\n            \"\\u2206\": 612,\n            \"\\u2207\": 713,\n            \"\\u2208\": 713,\n            \"\\u2209\": 713,\n            \"\\u220b\": 439,\n            \"\\u220f\": 823,\n            \"\\u2211\": 713,\n            \"\\u2212\": 549,\n            \"\\u2217\": 500,\n            \"\\u221a\": 549,\n            \"\\u221d\": 713,\n            \"\\u221e\": 713,\n            \"\\u2220\": 768,\n            \"\\u2227\": 603,\n            \"\\u2228\": 603,\n            \"\\u2229\": 768,\n            \"\\u222a\": 768,\n            \"\\u222b\": 274,\n            \"\\u2234\": 863,\n            \"\\u223c\": 549,\n            \"\\u2245\": 549,\n            \"\\u2248\": 549,\n            \"\\u2260\": 549,\n            \"\\u2261\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u2282\": 713,\n            \"\\u2283\": 713,\n            \"\\u2284\": 713,\n            \"\\u2286\": 713,\n            \"\\u2287\": 713,\n            \"\\u2295\": 768,\n            \"\\u2297\": 768,\n            \"\\u22a5\": 658,\n            \"\\u22c5\": 250,\n            \"\\u2320\": 686,\n            \"\\u2321\": 686,\n            \"\\u2329\": 329,\n            \"\\u232a\": 329,\n            \"\\u25ca\": 494,\n            \"\\u2660\": 753,\n            \"\\u2663\": 753,\n            \"\\u2665\": 753,\n            \"\\u2666\": 753,\n            \"\\uf6d9\": 790,\n            \"\\uf6da\": 790,\n            \"\\uf6db\": 890,\n            \"\\uf8e5\": 500,\n            \"\\uf8e6\": 603,\n            \"\\uf8e7\": 1000,\n            \"\\uf8e8\": 790,\n            \"\\uf8e9\": 790,\n            \"\\uf8ea\": 786,\n            \"\\uf8eb\": 384,\n            \"\\uf8ec\": 384,\n            \"\\uf8ed\": 384,\n            \"\\uf8ee\": 384,\n            \"\\uf8ef\": 384,\n            \"\\uf8f0\": 384,\n            \"\\uf8f1\": 494,\n            \"\\uf8f2\": 494,\n            \"\\uf8f3\": 494,\n            \"\\uf8f4\": 494,\n            \"\\uf8f5\": 686,\n            \"\\uf8f6\": 384,\n            \"\\uf8f7\": 384,\n            \"\\uf8f8\": 384,\n            \"\\uf8f9\": 384,\n            \"\\uf8fa\": 384,\n            \"\\uf8fb\": 384,\n            \"\\uf8fc\": 494,\n            \"\\uf8fd\": 494,\n            \"\\uf8fe\": 494,\n            \"\\uf8ff\": 790,\n        },\n    ),\n    \"Times-Bold\": (\n        {\n            \"FontName\": \"Times-Bold\",\n            \"Descent\": -217.0,\n            \"FontBBox\": (-168.0, -218.0, 1000.0, 935.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 676.0,\n            \"FontFamily\": \"Times\",\n            \"Flags\": 0,\n            \"XHeight\": 461.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 683.0,\n        },\n        {\n            \" \": 250,\n            \"!\": 333,\n            '\"': 555,\n            \"#\": 500,\n            \"$\": 500,\n            \"%\": 1000,\n            \"&\": 833,\n            \"'\": 278,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 500,\n            \"+\": 570,\n            \",\": 250,\n            \"-\": 333,\n            \".\": 250,\n            \"/\": 278,\n            \"0\": 500,\n            \"1\": 500,\n            \"2\": 500,\n            \"3\": 500,\n            \"4\": 500,\n            \"5\": 500,\n            \"6\": 500,\n            \"7\": 500,\n            \"8\": 500,\n            \"9\": 500,\n            \":\": 333,\n            \";\": 333,\n            \"<\": 570,\n            \"=\": 570,\n            \">\": 570,\n            \"?\": 500,\n            \"@\": 930,\n            \"A\": 722,\n            \"B\": 667,\n            \"C\": 722,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 611,\n            \"G\": 778,\n            \"H\": 778,\n            \"I\": 389,\n            \"J\": 500,\n            \"K\": 778,\n            \"L\": 667,\n            \"M\": 944,\n            \"N\": 722,\n            \"O\": 778,\n            \"P\": 611,\n            \"Q\": 778,\n            \"R\": 722,\n            \"S\": 556,\n            \"T\": 667,\n            \"U\": 722,\n            \"V\": 722,\n            \"W\": 1000,\n            \"X\": 722,\n            \"Y\": 722,\n            \"Z\": 667,\n            \"[\": 333,\n            \"\\\\\": 278,\n            \"]\": 333,\n            \"^\": 581,\n            \"_\": 500,\n            \"`\": 333,\n            \"a\": 500,\n            \"b\": 556,\n            \"c\": 444,\n            \"d\": 556,\n            \"e\": 444,\n            \"f\": 333,\n            \"g\": 500,\n            \"h\": 556,\n            \"i\": 278,\n            \"j\": 333,\n            \"k\": 556,\n            \"l\": 278,\n            \"m\": 833,\n            \"n\": 556,\n            \"o\": 500,\n            \"p\": 556,\n            \"q\": 556,\n            \"r\": 444,\n            \"s\": 389,\n            \"t\": 333,\n            \"u\": 556,\n            \"v\": 500,\n            \"w\": 722,\n            \"x\": 500,\n            \"y\": 500,\n            \"z\": 444,\n            \"{\": 394,\n            \"|\": 220,\n            \"}\": 394,\n            \"~\": 520,\n            \"\\xa1\": 333,\n            \"\\xa2\": 500,\n            \"\\xa3\": 500,\n            \"\\xa4\": 500,\n            \"\\xa5\": 500,\n            \"\\xa6\": 220,\n            \"\\xa7\": 500,\n            \"\\xa8\": 333,\n            \"\\xa9\": 747,\n            \"\\xaa\": 300,\n            \"\\xab\": 500,\n            \"\\xac\": 570,\n            \"\\xae\": 747,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 570,\n            \"\\xb2\": 300,\n            \"\\xb3\": 300,\n            \"\\xb4\": 333,\n            \"\\xb5\": 556,\n            \"\\xb6\": 540,\n            \"\\xb7\": 250,\n            \"\\xb8\": 333,\n            \"\\xb9\": 300,\n            \"\\xba\": 330,\n            \"\\xbb\": 500,\n            \"\\xbc\": 750,\n            \"\\xbd\": 750,\n            \"\\xbe\": 750,\n            \"\\xbf\": 500,\n            \"\\xc0\": 722,\n            \"\\xc1\": 722,\n            \"\\xc2\": 722,\n            \"\\xc3\": 722,\n            \"\\xc4\": 722,\n            \"\\xc5\": 722,\n            \"\\xc6\": 1000,\n            \"\\xc7\": 722,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 389,\n            \"\\xcd\": 389,\n            \"\\xce\": 389,\n            \"\\xcf\": 389,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 778,\n            \"\\xd3\": 778,\n            \"\\xd4\": 778,\n            \"\\xd5\": 778,\n            \"\\xd6\": 778,\n            \"\\xd7\": 570,\n            \"\\xd8\": 778,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 722,\n            \"\\xde\": 611,\n            \"\\xdf\": 556,\n            \"\\xe0\": 500,\n            \"\\xe1\": 500,\n            \"\\xe2\": 500,\n            \"\\xe3\": 500,\n            \"\\xe4\": 500,\n            \"\\xe5\": 500,\n            \"\\xe6\": 722,\n            \"\\xe7\": 444,\n            \"\\xe8\": 444,\n            \"\\xe9\": 444,\n            \"\\xea\": 444,\n            \"\\xeb\": 444,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 500,\n            \"\\xf1\": 556,\n            \"\\xf2\": 500,\n            \"\\xf3\": 500,\n            \"\\xf4\": 500,\n            \"\\xf5\": 500,\n            \"\\xf6\": 500,\n            \"\\xf7\": 570,\n            \"\\xf8\": 500,\n            \"\\xf9\": 556,\n            \"\\xfa\": 556,\n            \"\\xfb\": 556,\n            \"\\xfc\": 556,\n            \"\\xfd\": 500,\n            \"\\xfe\": 556,\n            \"\\xff\": 500,\n            \"\\u0100\": 722,\n            \"\\u0101\": 500,\n            \"\\u0102\": 722,\n            \"\\u0103\": 500,\n            \"\\u0104\": 722,\n            \"\\u0105\": 500,\n            \"\\u0106\": 722,\n            \"\\u0107\": 444,\n            \"\\u010c\": 722,\n            \"\\u010d\": 444,\n            \"\\u010e\": 722,\n            \"\\u010f\": 672,\n            \"\\u0110\": 722,\n            \"\\u0111\": 556,\n            \"\\u0112\": 667,\n            \"\\u0113\": 444,\n            \"\\u0116\": 667,\n            \"\\u0117\": 444,\n            \"\\u0118\": 667,\n            \"\\u0119\": 444,\n            \"\\u011a\": 667,\n            \"\\u011b\": 444,\n            \"\\u011e\": 778,\n            \"\\u011f\": 500,\n            \"\\u0122\": 778,\n            \"\\u0123\": 500,\n            \"\\u012a\": 389,\n            \"\\u012b\": 278,\n            \"\\u012e\": 389,\n            \"\\u012f\": 278,\n            \"\\u0130\": 389,\n            \"\\u0131\": 278,\n            \"\\u0136\": 778,\n            \"\\u0137\": 556,\n            \"\\u0139\": 667,\n            \"\\u013a\": 278,\n            \"\\u013b\": 667,\n            \"\\u013c\": 278,\n            \"\\u013d\": 667,\n            \"\\u013e\": 394,\n            \"\\u0141\": 667,\n            \"\\u0142\": 278,\n            \"\\u0143\": 722,\n            \"\\u0144\": 556,\n            \"\\u0145\": 722,\n            \"\\u0146\": 556,\n            \"\\u0147\": 722,\n            \"\\u0148\": 556,\n            \"\\u014c\": 778,\n            \"\\u014d\": 500,\n            \"\\u0150\": 778,\n            \"\\u0151\": 500,\n            \"\\u0152\": 1000,\n            \"\\u0153\": 722,\n            \"\\u0154\": 722,\n            \"\\u0155\": 444,\n            \"\\u0156\": 722,\n            \"\\u0157\": 444,\n            \"\\u0158\": 722,\n            \"\\u0159\": 444,\n            \"\\u015a\": 556,\n            \"\\u015b\": 389,\n            \"\\u015e\": 556,\n            \"\\u015f\": 389,\n            \"\\u0160\": 556,\n            \"\\u0161\": 389,\n            \"\\u0162\": 667,\n            \"\\u0163\": 333,\n            \"\\u0164\": 667,\n            \"\\u0165\": 416,\n            \"\\u016a\": 722,\n            \"\\u016b\": 556,\n            \"\\u016e\": 722,\n            \"\\u016f\": 556,\n            \"\\u0170\": 722,\n            \"\\u0171\": 556,\n            \"\\u0172\": 722,\n            \"\\u0173\": 556,\n            \"\\u0178\": 722,\n            \"\\u0179\": 667,\n            \"\\u017a\": 444,\n            \"\\u017b\": 667,\n            \"\\u017c\": 444,\n            \"\\u017d\": 667,\n            \"\\u017e\": 444,\n            \"\\u0192\": 500,\n            \"\\u0218\": 556,\n            \"\\u0219\": 389,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 500,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 333,\n            \"\\u2019\": 333,\n            \"\\u201a\": 333,\n            \"\\u201c\": 500,\n            \"\\u201d\": 500,\n            \"\\u201e\": 500,\n            \"\\u2020\": 500,\n            \"\\u2021\": 500,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 494,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 570,\n            \"\\u221a\": 549,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 494,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 556,\n            \"\\ufb02\": 556,\n        },\n    ),\n    \"Times-BoldItalic\": (\n        {\n            \"FontName\": \"Times-BoldItalic\",\n            \"Descent\": -217.0,\n            \"FontBBox\": (-200.0, -218.0, 996.0, 921.0),\n            \"FontWeight\": \"Bold\",\n            \"CapHeight\": 669.0,\n            \"FontFamily\": \"Times\",\n            \"Flags\": 0,\n            \"XHeight\": 462.0,\n            \"ItalicAngle\": -15.0,\n            \"Ascent\": 683.0,\n        },\n        {\n            \" \": 250,\n            \"!\": 389,\n            '\"': 555,\n            \"#\": 500,\n            \"$\": 500,\n            \"%\": 833,\n            \"&\": 778,\n            \"'\": 278,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 500,\n            \"+\": 570,\n            \",\": 250,\n            \"-\": 333,\n            \".\": 250,\n            \"/\": 278,\n            \"0\": 500,\n            \"1\": 500,\n            \"2\": 500,\n            \"3\": 500,\n            \"4\": 500,\n            \"5\": 500,\n            \"6\": 500,\n            \"7\": 500,\n            \"8\": 500,\n            \"9\": 500,\n            \":\": 333,\n            \";\": 333,\n            \"<\": 570,\n            \"=\": 570,\n            \">\": 570,\n            \"?\": 500,\n            \"@\": 832,\n            \"A\": 667,\n            \"B\": 667,\n            \"C\": 667,\n            \"D\": 722,\n            \"E\": 667,\n            \"F\": 667,\n            \"G\": 722,\n            \"H\": 778,\n            \"I\": 389,\n            \"J\": 500,\n            \"K\": 667,\n            \"L\": 611,\n            \"M\": 889,\n            \"N\": 722,\n            \"O\": 722,\n            \"P\": 611,\n            \"Q\": 722,\n            \"R\": 667,\n            \"S\": 556,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 667,\n            \"W\": 889,\n            \"X\": 667,\n            \"Y\": 611,\n            \"Z\": 611,\n            \"[\": 333,\n            \"\\\\\": 278,\n            \"]\": 333,\n            \"^\": 570,\n            \"_\": 500,\n            \"`\": 333,\n            \"a\": 500,\n            \"b\": 500,\n            \"c\": 444,\n            \"d\": 500,\n            \"e\": 444,\n            \"f\": 333,\n            \"g\": 500,\n            \"h\": 556,\n            \"i\": 278,\n            \"j\": 278,\n            \"k\": 500,\n            \"l\": 278,\n            \"m\": 778,\n            \"n\": 556,\n            \"o\": 500,\n            \"p\": 500,\n            \"q\": 500,\n            \"r\": 389,\n            \"s\": 389,\n            \"t\": 278,\n            \"u\": 556,\n            \"v\": 444,\n            \"w\": 667,\n            \"x\": 500,\n            \"y\": 444,\n            \"z\": 389,\n            \"{\": 348,\n            \"|\": 220,\n            \"}\": 348,\n            \"~\": 570,\n            \"\\xa1\": 389,\n            \"\\xa2\": 500,\n            \"\\xa3\": 500,\n            \"\\xa4\": 500,\n            \"\\xa5\": 500,\n            \"\\xa6\": 220,\n            \"\\xa7\": 500,\n            \"\\xa8\": 333,\n            \"\\xa9\": 747,\n            \"\\xaa\": 266,\n            \"\\xab\": 500,\n            \"\\xac\": 606,\n            \"\\xae\": 747,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 570,\n            \"\\xb2\": 300,\n            \"\\xb3\": 300,\n            \"\\xb4\": 333,\n            \"\\xb5\": 576,\n            \"\\xb6\": 500,\n            \"\\xb7\": 250,\n            \"\\xb8\": 333,\n            \"\\xb9\": 300,\n            \"\\xba\": 300,\n            \"\\xbb\": 500,\n            \"\\xbc\": 750,\n            \"\\xbd\": 750,\n            \"\\xbe\": 750,\n            \"\\xbf\": 500,\n            \"\\xc0\": 667,\n            \"\\xc1\": 667,\n            \"\\xc2\": 667,\n            \"\\xc3\": 667,\n            \"\\xc4\": 667,\n            \"\\xc5\": 667,\n            \"\\xc6\": 944,\n            \"\\xc7\": 667,\n            \"\\xc8\": 667,\n            \"\\xc9\": 667,\n            \"\\xca\": 667,\n            \"\\xcb\": 667,\n            \"\\xcc\": 389,\n            \"\\xcd\": 389,\n            \"\\xce\": 389,\n            \"\\xcf\": 389,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 722,\n            \"\\xd3\": 722,\n            \"\\xd4\": 722,\n            \"\\xd5\": 722,\n            \"\\xd6\": 722,\n            \"\\xd7\": 570,\n            \"\\xd8\": 722,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 611,\n            \"\\xde\": 611,\n            \"\\xdf\": 500,\n            \"\\xe0\": 500,\n            \"\\xe1\": 500,\n            \"\\xe2\": 500,\n            \"\\xe3\": 500,\n            \"\\xe4\": 500,\n            \"\\xe5\": 500,\n            \"\\xe6\": 722,\n            \"\\xe7\": 444,\n            \"\\xe8\": 444,\n            \"\\xe9\": 444,\n            \"\\xea\": 444,\n            \"\\xeb\": 444,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 500,\n            \"\\xf1\": 556,\n            \"\\xf2\": 500,\n            \"\\xf3\": 500,\n            \"\\xf4\": 500,\n            \"\\xf5\": 500,\n            \"\\xf6\": 500,\n            \"\\xf7\": 570,\n            \"\\xf8\": 500,\n            \"\\xf9\": 556,\n            \"\\xfa\": 556,\n            \"\\xfb\": 556,\n            \"\\xfc\": 556,\n            \"\\xfd\": 444,\n            \"\\xfe\": 500,\n            \"\\xff\": 444,\n            \"\\u0100\": 667,\n            \"\\u0101\": 500,\n            \"\\u0102\": 667,\n            \"\\u0103\": 500,\n            \"\\u0104\": 667,\n            \"\\u0105\": 500,\n            \"\\u0106\": 667,\n            \"\\u0107\": 444,\n            \"\\u010c\": 667,\n            \"\\u010d\": 444,\n            \"\\u010e\": 722,\n            \"\\u010f\": 608,\n            \"\\u0110\": 722,\n            \"\\u0111\": 500,\n            \"\\u0112\": 667,\n            \"\\u0113\": 444,\n            \"\\u0116\": 667,\n            \"\\u0117\": 444,\n            \"\\u0118\": 667,\n            \"\\u0119\": 444,\n            \"\\u011a\": 667,\n            \"\\u011b\": 444,\n            \"\\u011e\": 722,\n            \"\\u011f\": 500,\n            \"\\u0122\": 722,\n            \"\\u0123\": 500,\n            \"\\u012a\": 389,\n            \"\\u012b\": 278,\n            \"\\u012e\": 389,\n            \"\\u012f\": 278,\n            \"\\u0130\": 389,\n            \"\\u0131\": 278,\n            \"\\u0136\": 667,\n            \"\\u0137\": 500,\n            \"\\u0139\": 611,\n            \"\\u013a\": 278,\n            \"\\u013b\": 611,\n            \"\\u013c\": 278,\n            \"\\u013d\": 611,\n            \"\\u013e\": 382,\n            \"\\u0141\": 611,\n            \"\\u0142\": 278,\n            \"\\u0143\": 722,\n            \"\\u0144\": 556,\n            \"\\u0145\": 722,\n            \"\\u0146\": 556,\n            \"\\u0147\": 722,\n            \"\\u0148\": 556,\n            \"\\u014c\": 722,\n            \"\\u014d\": 500,\n            \"\\u0150\": 722,\n            \"\\u0151\": 500,\n            \"\\u0152\": 944,\n            \"\\u0153\": 722,\n            \"\\u0154\": 667,\n            \"\\u0155\": 389,\n            \"\\u0156\": 667,\n            \"\\u0157\": 389,\n            \"\\u0158\": 667,\n            \"\\u0159\": 389,\n            \"\\u015a\": 556,\n            \"\\u015b\": 389,\n            \"\\u015e\": 556,\n            \"\\u015f\": 389,\n            \"\\u0160\": 556,\n            \"\\u0161\": 389,\n            \"\\u0162\": 611,\n            \"\\u0163\": 278,\n            \"\\u0164\": 611,\n            \"\\u0165\": 366,\n            \"\\u016a\": 722,\n            \"\\u016b\": 556,\n            \"\\u016e\": 722,\n            \"\\u016f\": 556,\n            \"\\u0170\": 722,\n            \"\\u0171\": 556,\n            \"\\u0172\": 722,\n            \"\\u0173\": 556,\n            \"\\u0178\": 611,\n            \"\\u0179\": 611,\n            \"\\u017a\": 389,\n            \"\\u017b\": 611,\n            \"\\u017c\": 389,\n            \"\\u017d\": 611,\n            \"\\u017e\": 389,\n            \"\\u0192\": 500,\n            \"\\u0218\": 556,\n            \"\\u0219\": 389,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 500,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 333,\n            \"\\u2019\": 333,\n            \"\\u201a\": 333,\n            \"\\u201c\": 500,\n            \"\\u201d\": 500,\n            \"\\u201e\": 500,\n            \"\\u2020\": 500,\n            \"\\u2021\": 500,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 1000,\n            \"\\u2202\": 494,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 606,\n            \"\\u221a\": 549,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 494,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 556,\n            \"\\ufb02\": 556,\n        },\n    ),\n    \"Times-Italic\": (\n        {\n            \"FontName\": \"Times-Italic\",\n            \"Descent\": -217.0,\n            \"FontBBox\": (-169.0, -217.0, 1010.0, 883.0),\n            \"FontWeight\": \"Medium\",\n            \"CapHeight\": 653.0,\n            \"FontFamily\": \"Times\",\n            \"Flags\": 0,\n            \"XHeight\": 441.0,\n            \"ItalicAngle\": -15.5,\n            \"Ascent\": 683.0,\n        },\n        {\n            \" \": 250,\n            \"!\": 333,\n            '\"': 420,\n            \"#\": 500,\n            \"$\": 500,\n            \"%\": 833,\n            \"&\": 778,\n            \"'\": 214,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 500,\n            \"+\": 675,\n            \",\": 250,\n            \"-\": 333,\n            \".\": 250,\n            \"/\": 278,\n            \"0\": 500,\n            \"1\": 500,\n            \"2\": 500,\n            \"3\": 500,\n            \"4\": 500,\n            \"5\": 500,\n            \"6\": 500,\n            \"7\": 500,\n            \"8\": 500,\n            \"9\": 500,\n            \":\": 333,\n            \";\": 333,\n            \"<\": 675,\n            \"=\": 675,\n            \">\": 675,\n            \"?\": 500,\n            \"@\": 920,\n            \"A\": 611,\n            \"B\": 611,\n            \"C\": 667,\n            \"D\": 722,\n            \"E\": 611,\n            \"F\": 611,\n            \"G\": 722,\n            \"H\": 722,\n            \"I\": 333,\n            \"J\": 444,\n            \"K\": 667,\n            \"L\": 556,\n            \"M\": 833,\n            \"N\": 667,\n            \"O\": 722,\n            \"P\": 611,\n            \"Q\": 722,\n            \"R\": 611,\n            \"S\": 500,\n            \"T\": 556,\n            \"U\": 722,\n            \"V\": 611,\n            \"W\": 833,\n            \"X\": 611,\n            \"Y\": 556,\n            \"Z\": 556,\n            \"[\": 389,\n            \"\\\\\": 278,\n            \"]\": 389,\n            \"^\": 422,\n            \"_\": 500,\n            \"`\": 333,\n            \"a\": 500,\n            \"b\": 500,\n            \"c\": 444,\n            \"d\": 500,\n            \"e\": 444,\n            \"f\": 278,\n            \"g\": 500,\n            \"h\": 500,\n            \"i\": 278,\n            \"j\": 278,\n            \"k\": 444,\n            \"l\": 278,\n            \"m\": 722,\n            \"n\": 500,\n            \"o\": 500,\n            \"p\": 500,\n            \"q\": 500,\n            \"r\": 389,\n            \"s\": 389,\n            \"t\": 278,\n            \"u\": 500,\n            \"v\": 444,\n            \"w\": 667,\n            \"x\": 444,\n            \"y\": 444,\n            \"z\": 389,\n            \"{\": 400,\n            \"|\": 275,\n            \"}\": 400,\n            \"~\": 541,\n            \"\\xa1\": 389,\n            \"\\xa2\": 500,\n            \"\\xa3\": 500,\n            \"\\xa4\": 500,\n            \"\\xa5\": 500,\n            \"\\xa6\": 275,\n            \"\\xa7\": 500,\n            \"\\xa8\": 333,\n            \"\\xa9\": 760,\n            \"\\xaa\": 276,\n            \"\\xab\": 500,\n            \"\\xac\": 675,\n            \"\\xae\": 760,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 675,\n            \"\\xb2\": 300,\n            \"\\xb3\": 300,\n            \"\\xb4\": 333,\n            \"\\xb5\": 500,\n            \"\\xb6\": 523,\n            \"\\xb7\": 250,\n            \"\\xb8\": 333,\n            \"\\xb9\": 300,\n            \"\\xba\": 310,\n            \"\\xbb\": 500,\n            \"\\xbc\": 750,\n            \"\\xbd\": 750,\n            \"\\xbe\": 750,\n            \"\\xbf\": 500,\n            \"\\xc0\": 611,\n            \"\\xc1\": 611,\n            \"\\xc2\": 611,\n            \"\\xc3\": 611,\n            \"\\xc4\": 611,\n            \"\\xc5\": 611,\n            \"\\xc6\": 889,\n            \"\\xc7\": 667,\n            \"\\xc8\": 611,\n            \"\\xc9\": 611,\n            \"\\xca\": 611,\n            \"\\xcb\": 611,\n            \"\\xcc\": 333,\n            \"\\xcd\": 333,\n            \"\\xce\": 333,\n            \"\\xcf\": 333,\n            \"\\xd0\": 722,\n            \"\\xd1\": 667,\n            \"\\xd2\": 722,\n            \"\\xd3\": 722,\n            \"\\xd4\": 722,\n            \"\\xd5\": 722,\n            \"\\xd6\": 722,\n            \"\\xd7\": 675,\n            \"\\xd8\": 722,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 556,\n            \"\\xde\": 611,\n            \"\\xdf\": 500,\n            \"\\xe0\": 500,\n            \"\\xe1\": 500,\n            \"\\xe2\": 500,\n            \"\\xe3\": 500,\n            \"\\xe4\": 500,\n            \"\\xe5\": 500,\n            \"\\xe6\": 667,\n            \"\\xe7\": 444,\n            \"\\xe8\": 444,\n            \"\\xe9\": 444,\n            \"\\xea\": 444,\n            \"\\xeb\": 444,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 500,\n            \"\\xf1\": 500,\n            \"\\xf2\": 500,\n            \"\\xf3\": 500,\n            \"\\xf4\": 500,\n            \"\\xf5\": 500,\n            \"\\xf6\": 500,\n            \"\\xf7\": 675,\n            \"\\xf8\": 500,\n            \"\\xf9\": 500,\n            \"\\xfa\": 500,\n            \"\\xfb\": 500,\n            \"\\xfc\": 500,\n            \"\\xfd\": 444,\n            \"\\xfe\": 500,\n            \"\\xff\": 444,\n            \"\\u0100\": 611,\n            \"\\u0101\": 500,\n            \"\\u0102\": 611,\n            \"\\u0103\": 500,\n            \"\\u0104\": 611,\n            \"\\u0105\": 500,\n            \"\\u0106\": 667,\n            \"\\u0107\": 444,\n            \"\\u010c\": 667,\n            \"\\u010d\": 444,\n            \"\\u010e\": 722,\n            \"\\u010f\": 544,\n            \"\\u0110\": 722,\n            \"\\u0111\": 500,\n            \"\\u0112\": 611,\n            \"\\u0113\": 444,\n            \"\\u0116\": 611,\n            \"\\u0117\": 444,\n            \"\\u0118\": 611,\n            \"\\u0119\": 444,\n            \"\\u011a\": 611,\n            \"\\u011b\": 444,\n            \"\\u011e\": 722,\n            \"\\u011f\": 500,\n            \"\\u0122\": 722,\n            \"\\u0123\": 500,\n            \"\\u012a\": 333,\n            \"\\u012b\": 278,\n            \"\\u012e\": 333,\n            \"\\u012f\": 278,\n            \"\\u0130\": 333,\n            \"\\u0131\": 278,\n            \"\\u0136\": 667,\n            \"\\u0137\": 444,\n            \"\\u0139\": 556,\n            \"\\u013a\": 278,\n            \"\\u013b\": 556,\n            \"\\u013c\": 278,\n            \"\\u013d\": 611,\n            \"\\u013e\": 300,\n            \"\\u0141\": 556,\n            \"\\u0142\": 278,\n            \"\\u0143\": 667,\n            \"\\u0144\": 500,\n            \"\\u0145\": 667,\n            \"\\u0146\": 500,\n            \"\\u0147\": 667,\n            \"\\u0148\": 500,\n            \"\\u014c\": 722,\n            \"\\u014d\": 500,\n            \"\\u0150\": 722,\n            \"\\u0151\": 500,\n            \"\\u0152\": 944,\n            \"\\u0153\": 667,\n            \"\\u0154\": 611,\n            \"\\u0155\": 389,\n            \"\\u0156\": 611,\n            \"\\u0157\": 389,\n            \"\\u0158\": 611,\n            \"\\u0159\": 389,\n            \"\\u015a\": 500,\n            \"\\u015b\": 389,\n            \"\\u015e\": 500,\n            \"\\u015f\": 389,\n            \"\\u0160\": 500,\n            \"\\u0161\": 389,\n            \"\\u0162\": 556,\n            \"\\u0163\": 278,\n            \"\\u0164\": 556,\n            \"\\u0165\": 300,\n            \"\\u016a\": 722,\n            \"\\u016b\": 500,\n            \"\\u016e\": 722,\n            \"\\u016f\": 500,\n            \"\\u0170\": 722,\n            \"\\u0171\": 500,\n            \"\\u0172\": 722,\n            \"\\u0173\": 500,\n            \"\\u0178\": 556,\n            \"\\u0179\": 556,\n            \"\\u017a\": 389,\n            \"\\u017b\": 556,\n            \"\\u017c\": 389,\n            \"\\u017d\": 556,\n            \"\\u017e\": 389,\n            \"\\u0192\": 500,\n            \"\\u0218\": 500,\n            \"\\u0219\": 389,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 500,\n            \"\\u2014\": 889,\n            \"\\u2018\": 333,\n            \"\\u2019\": 333,\n            \"\\u201a\": 333,\n            \"\\u201c\": 556,\n            \"\\u201d\": 556,\n            \"\\u201e\": 556,\n            \"\\u2020\": 500,\n            \"\\u2021\": 500,\n            \"\\u2022\": 350,\n            \"\\u2026\": 889,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 980,\n            \"\\u2202\": 476,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 675,\n            \"\\u221a\": 453,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 471,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 500,\n            \"\\ufb02\": 500,\n        },\n    ),\n    \"Times-Roman\": (\n        {\n            \"FontName\": \"Times-Roman\",\n            \"Descent\": -217.0,\n            \"FontBBox\": (-168.0, -218.0, 1000.0, 898.0),\n            \"FontWeight\": \"Roman\",\n            \"CapHeight\": 662.0,\n            \"FontFamily\": \"Times\",\n            \"Flags\": 0,\n            \"XHeight\": 450.0,\n            \"ItalicAngle\": 0.0,\n            \"Ascent\": 683.0,\n        },\n        {\n            \" \": 250,\n            \"!\": 333,\n            '\"': 408,\n            \"#\": 500,\n            \"$\": 500,\n            \"%\": 833,\n            \"&\": 778,\n            \"'\": 180,\n            \"(\": 333,\n            \")\": 333,\n            \"*\": 500,\n            \"+\": 564,\n            \",\": 250,\n            \"-\": 333,\n            \".\": 250,\n            \"/\": 278,\n            \"0\": 500,\n            \"1\": 500,\n            \"2\": 500,\n            \"3\": 500,\n            \"4\": 500,\n            \"5\": 500,\n            \"6\": 500,\n            \"7\": 500,\n            \"8\": 500,\n            \"9\": 500,\n            \":\": 278,\n            \";\": 278,\n            \"<\": 564,\n            \"=\": 564,\n            \">\": 564,\n            \"?\": 444,\n            \"@\": 921,\n            \"A\": 722,\n            \"B\": 667,\n            \"C\": 667,\n            \"D\": 722,\n            \"E\": 611,\n            \"F\": 556,\n            \"G\": 722,\n            \"H\": 722,\n            \"I\": 333,\n            \"J\": 389,\n            \"K\": 722,\n            \"L\": 611,\n            \"M\": 889,\n            \"N\": 722,\n            \"O\": 722,\n            \"P\": 556,\n            \"Q\": 722,\n            \"R\": 667,\n            \"S\": 556,\n            \"T\": 611,\n            \"U\": 722,\n            \"V\": 722,\n            \"W\": 944,\n            \"X\": 722,\n            \"Y\": 722,\n            \"Z\": 611,\n            \"[\": 333,\n            \"\\\\\": 278,\n            \"]\": 333,\n            \"^\": 469,\n            \"_\": 500,\n            \"`\": 333,\n            \"a\": 444,\n            \"b\": 500,\n            \"c\": 444,\n            \"d\": 500,\n            \"e\": 444,\n            \"f\": 333,\n            \"g\": 500,\n            \"h\": 500,\n            \"i\": 278,\n            \"j\": 278,\n            \"k\": 500,\n            \"l\": 278,\n            \"m\": 778,\n            \"n\": 500,\n            \"o\": 500,\n            \"p\": 500,\n            \"q\": 500,\n            \"r\": 333,\n            \"s\": 389,\n            \"t\": 278,\n            \"u\": 500,\n            \"v\": 500,\n            \"w\": 722,\n            \"x\": 500,\n            \"y\": 500,\n            \"z\": 444,\n            \"{\": 480,\n            \"|\": 200,\n            \"}\": 480,\n            \"~\": 541,\n            \"\\xa1\": 333,\n            \"\\xa2\": 500,\n            \"\\xa3\": 500,\n            \"\\xa4\": 500,\n            \"\\xa5\": 500,\n            \"\\xa6\": 200,\n            \"\\xa7\": 500,\n            \"\\xa8\": 333,\n            \"\\xa9\": 760,\n            \"\\xaa\": 276,\n            \"\\xab\": 500,\n            \"\\xac\": 564,\n            \"\\xae\": 760,\n            \"\\xaf\": 333,\n            \"\\xb0\": 400,\n            \"\\xb1\": 564,\n            \"\\xb2\": 300,\n            \"\\xb3\": 300,\n            \"\\xb4\": 333,\n            \"\\xb5\": 500,\n            \"\\xb6\": 453,\n            \"\\xb7\": 250,\n            \"\\xb8\": 333,\n            \"\\xb9\": 300,\n            \"\\xba\": 310,\n            \"\\xbb\": 500,\n            \"\\xbc\": 750,\n            \"\\xbd\": 750,\n            \"\\xbe\": 750,\n            \"\\xbf\": 444,\n            \"\\xc0\": 722,\n            \"\\xc1\": 722,\n            \"\\xc2\": 722,\n            \"\\xc3\": 722,\n            \"\\xc4\": 722,\n            \"\\xc5\": 722,\n            \"\\xc6\": 889,\n            \"\\xc7\": 667,\n            \"\\xc8\": 611,\n            \"\\xc9\": 611,\n            \"\\xca\": 611,\n            \"\\xcb\": 611,\n            \"\\xcc\": 333,\n            \"\\xcd\": 333,\n            \"\\xce\": 333,\n            \"\\xcf\": 333,\n            \"\\xd0\": 722,\n            \"\\xd1\": 722,\n            \"\\xd2\": 722,\n            \"\\xd3\": 722,\n            \"\\xd4\": 722,\n            \"\\xd5\": 722,\n            \"\\xd6\": 722,\n            \"\\xd7\": 564,\n            \"\\xd8\": 722,\n            \"\\xd9\": 722,\n            \"\\xda\": 722,\n            \"\\xdb\": 722,\n            \"\\xdc\": 722,\n            \"\\xdd\": 722,\n            \"\\xde\": 556,\n            \"\\xdf\": 500,\n            \"\\xe0\": 444,\n            \"\\xe1\": 444,\n            \"\\xe2\": 444,\n            \"\\xe3\": 444,\n            \"\\xe4\": 444,\n            \"\\xe5\": 444,\n            \"\\xe6\": 667,\n            \"\\xe7\": 444,\n            \"\\xe8\": 444,\n            \"\\xe9\": 444,\n            \"\\xea\": 444,\n            \"\\xeb\": 444,\n            \"\\xec\": 278,\n            \"\\xed\": 278,\n            \"\\xee\": 278,\n            \"\\xef\": 278,\n            \"\\xf0\": 500,\n            \"\\xf1\": 500,\n            \"\\xf2\": 500,\n            \"\\xf3\": 500,\n            \"\\xf4\": 500,\n            \"\\xf5\": 500,\n            \"\\xf6\": 500,\n            \"\\xf7\": 564,\n            \"\\xf8\": 500,\n            \"\\xf9\": 500,\n            \"\\xfa\": 500,\n            \"\\xfb\": 500,\n            \"\\xfc\": 500,\n            \"\\xfd\": 500,\n            \"\\xfe\": 500,\n            \"\\xff\": 500,\n            \"\\u0100\": 722,\n            \"\\u0101\": 444,\n            \"\\u0102\": 722,\n            \"\\u0103\": 444,\n            \"\\u0104\": 722,\n            \"\\u0105\": 444,\n            \"\\u0106\": 667,\n            \"\\u0107\": 444,\n            \"\\u010c\": 667,\n            \"\\u010d\": 444,\n            \"\\u010e\": 722,\n            \"\\u010f\": 588,\n            \"\\u0110\": 722,\n            \"\\u0111\": 500,\n            \"\\u0112\": 611,\n            \"\\u0113\": 444,\n            \"\\u0116\": 611,\n            \"\\u0117\": 444,\n            \"\\u0118\": 611,\n            \"\\u0119\": 444,\n            \"\\u011a\": 611,\n            \"\\u011b\": 444,\n            \"\\u011e\": 722,\n            \"\\u011f\": 500,\n            \"\\u0122\": 722,\n            \"\\u0123\": 500,\n            \"\\u012a\": 333,\n            \"\\u012b\": 278,\n            \"\\u012e\": 333,\n            \"\\u012f\": 278,\n            \"\\u0130\": 333,\n            \"\\u0131\": 278,\n            \"\\u0136\": 722,\n            \"\\u0137\": 500,\n            \"\\u0139\": 611,\n            \"\\u013a\": 278,\n            \"\\u013b\": 611,\n            \"\\u013c\": 278,\n            \"\\u013d\": 611,\n            \"\\u013e\": 344,\n            \"\\u0141\": 611,\n            \"\\u0142\": 278,\n            \"\\u0143\": 722,\n            \"\\u0144\": 500,\n            \"\\u0145\": 722,\n            \"\\u0146\": 500,\n            \"\\u0147\": 722,\n            \"\\u0148\": 500,\n            \"\\u014c\": 722,\n            \"\\u014d\": 500,\n            \"\\u0150\": 722,\n            \"\\u0151\": 500,\n            \"\\u0152\": 889,\n            \"\\u0153\": 722,\n            \"\\u0154\": 667,\n            \"\\u0155\": 333,\n            \"\\u0156\": 667,\n            \"\\u0157\": 333,\n            \"\\u0158\": 667,\n            \"\\u0159\": 333,\n            \"\\u015a\": 556,\n            \"\\u015b\": 389,\n            \"\\u015e\": 556,\n            \"\\u015f\": 389,\n            \"\\u0160\": 556,\n            \"\\u0161\": 389,\n            \"\\u0162\": 611,\n            \"\\u0163\": 278,\n            \"\\u0164\": 611,\n            \"\\u0165\": 326,\n            \"\\u016a\": 722,\n            \"\\u016b\": 500,\n            \"\\u016e\": 722,\n            \"\\u016f\": 500,\n            \"\\u0170\": 722,\n            \"\\u0171\": 500,\n            \"\\u0172\": 722,\n            \"\\u0173\": 500,\n            \"\\u0178\": 722,\n            \"\\u0179\": 611,\n            \"\\u017a\": 444,\n            \"\\u017b\": 611,\n            \"\\u017c\": 444,\n            \"\\u017d\": 611,\n            \"\\u017e\": 444,\n            \"\\u0192\": 500,\n            \"\\u0218\": 556,\n            \"\\u0219\": 389,\n            \"\\u02c6\": 333,\n            \"\\u02c7\": 333,\n            \"\\u02d8\": 333,\n            \"\\u02d9\": 333,\n            \"\\u02da\": 333,\n            \"\\u02db\": 333,\n            \"\\u02dc\": 333,\n            \"\\u02dd\": 333,\n            \"\\u2013\": 500,\n            \"\\u2014\": 1000,\n            \"\\u2018\": 333,\n            \"\\u2019\": 333,\n            \"\\u201a\": 333,\n            \"\\u201c\": 444,\n            \"\\u201d\": 444,\n            \"\\u201e\": 444,\n            \"\\u2020\": 500,\n            \"\\u2021\": 500,\n            \"\\u2022\": 350,\n            \"\\u2026\": 1000,\n            \"\\u2030\": 1000,\n            \"\\u2039\": 333,\n            \"\\u203a\": 333,\n            \"\\u2044\": 167,\n            \"\\u2122\": 980,\n            \"\\u2202\": 476,\n            \"\\u2206\": 612,\n            \"\\u2211\": 600,\n            \"\\u2212\": 564,\n            \"\\u221a\": 453,\n            \"\\u2260\": 549,\n            \"\\u2264\": 549,\n            \"\\u2265\": 549,\n            \"\\u25ca\": 471,\n            \"\\uf6c3\": 250,\n            \"\\ufb01\": 556,\n            \"\\ufb02\": 556,\n        },\n    ),\n    \"ZapfDingbats\": (\n        {\n            \"FontName\": \"ZapfDingbats\",\n            \"FontBBox\": (-1.0, -143.0, 981.0, 820.0),\n            \"FontWeight\": \"Medium\",\n            \"FontFamily\": \"ITC\",\n            \"Flags\": 0,\n            \"ItalicAngle\": 0.0,\n        },\n        {\n            \"\\x01\": 974,\n            \"\\x02\": 961,\n            \"\\x03\": 980,\n            \"\\x04\": 719,\n            \"\\x05\": 789,\n            \"\\x06\": 494,\n            \"\\x07\": 552,\n            \"\\x08\": 537,\n            \"\\t\": 577,\n            \"\\n\": 692,\n            \"\\x0b\": 960,\n            \"\\x0c\": 939,\n            \"\\r\": 549,\n            \"\\x0e\": 855,\n            \"\\x0f\": 911,\n            \"\\x10\": 933,\n            \"\\x11\": 945,\n            \"\\x12\": 974,\n            \"\\x13\": 755,\n            \"\\x14\": 846,\n            \"\\x15\": 762,\n            \"\\x16\": 761,\n            \"\\x17\": 571,\n            \"\\x18\": 677,\n            \"\\x19\": 763,\n            \"\\x1a\": 760,\n            \"\\x1b\": 759,\n            \"\\x1c\": 754,\n            \"\\x1d\": 786,\n            \"\\x1e\": 788,\n            \"\\x1f\": 788,\n            \" \": 790,\n            \"!\": 793,\n            '\"': 794,\n            \"#\": 816,\n            \"$\": 823,\n            \"%\": 789,\n            \"&\": 841,\n            \"'\": 823,\n            \"(\": 833,\n            \")\": 816,\n            \"*\": 831,\n            \"+\": 923,\n            \",\": 744,\n            \"-\": 723,\n            \".\": 749,\n            \"/\": 790,\n            \"0\": 792,\n            \"1\": 695,\n            \"2\": 776,\n            \"3\": 768,\n            \"4\": 792,\n            \"5\": 759,\n            \"6\": 707,\n            \"7\": 708,\n            \"8\": 682,\n            \"9\": 701,\n            \":\": 826,\n            \";\": 815,\n            \"<\": 789,\n            \"=\": 789,\n            \">\": 707,\n            \"?\": 687,\n            \"@\": 696,\n            \"A\": 689,\n            \"B\": 786,\n            \"C\": 787,\n            \"D\": 713,\n            \"E\": 791,\n            \"F\": 785,\n            \"G\": 791,\n            \"H\": 873,\n            \"I\": 761,\n            \"J\": 762,\n            \"K\": 759,\n            \"L\": 892,\n            \"M\": 892,\n            \"N\": 788,\n            \"O\": 784,\n            \"Q\": 438,\n            \"R\": 138,\n            \"S\": 277,\n            \"T\": 415,\n            \"U\": 509,\n            \"V\": 410,\n            \"W\": 234,\n            \"X\": 234,\n            \"Y\": 390,\n            \"Z\": 390,\n            \"[\": 276,\n            \"\\\\\": 276,\n            \"]\": 317,\n            \"^\": 317,\n            \"_\": 334,\n            \"`\": 334,\n            \"a\": 392,\n            \"b\": 392,\n            \"c\": 668,\n            \"d\": 668,\n            \"e\": 732,\n            \"f\": 544,\n            \"g\": 544,\n            \"h\": 910,\n            \"i\": 911,\n            \"j\": 667,\n            \"k\": 760,\n            \"l\": 760,\n            \"m\": 626,\n            \"n\": 694,\n            \"o\": 595,\n            \"p\": 776,\n            \"u\": 690,\n            \"v\": 791,\n            \"w\": 790,\n            \"x\": 788,\n            \"y\": 788,\n            \"z\": 788,\n            \"{\": 788,\n            \"|\": 788,\n            \"}\": 788,\n            \"~\": 788,\n            \"\\x7f\": 788,\n            \"\\x80\": 788,\n            \"\\x81\": 788,\n            \"\\x82\": 788,\n            \"\\x83\": 788,\n            \"\\x84\": 788,\n            \"\\x85\": 788,\n            \"\\x86\": 788,\n            \"\\x87\": 788,\n            \"\\x88\": 788,\n            \"\\x89\": 788,\n            \"\\x8a\": 788,\n            \"\\x8b\": 788,\n            \"\\x8c\": 788,\n            \"\\x8d\": 788,\n            \"\\x8e\": 788,\n            \"\\x8f\": 788,\n            \"\\x90\": 788,\n            \"\\x91\": 788,\n            \"\\x92\": 788,\n            \"\\x93\": 788,\n            \"\\x94\": 788,\n            \"\\x95\": 788,\n            \"\\x96\": 788,\n            \"\\x97\": 788,\n            \"\\x98\": 788,\n            \"\\x99\": 788,\n            \"\\x9a\": 788,\n            \"\\x9b\": 788,\n            \"\\x9c\": 788,\n            \"\\x9d\": 788,\n            \"\\x9e\": 788,\n            \"\\x9f\": 788,\n            \"\\xa0\": 894,\n            \"\\xa1\": 838,\n            \"\\xa2\": 924,\n            \"\\xa3\": 1016,\n            \"\\xa4\": 458,\n            \"\\xa5\": 924,\n            \"\\xa6\": 918,\n            \"\\xa7\": 927,\n            \"\\xa8\": 928,\n            \"\\xa9\": 928,\n            \"\\xaa\": 834,\n            \"\\xab\": 873,\n            \"\\xac\": 828,\n            \"\\xad\": 924,\n            \"\\xae\": 917,\n            \"\\xaf\": 930,\n            \"\\xb0\": 931,\n            \"\\xb1\": 463,\n            \"\\xb2\": 883,\n            \"\\xb3\": 836,\n            \"\\xb4\": 867,\n            \"\\xb5\": 696,\n            \"\\xb6\": 874,\n            \"\\xb7\": 760,\n            \"\\xb8\": 946,\n            \"\\xb9\": 865,\n            \"\\xba\": 967,\n            \"\\xbb\": 831,\n            \"\\xbc\": 873,\n            \"\\xbd\": 927,\n            \"\\xbe\": 970,\n            \"\\xbf\": 918,\n            \"\\xc0\": 748,\n            \"\\xc1\": 836,\n            \"\\xc2\": 771,\n            \"\\xc3\": 888,\n            \"\\xc4\": 748,\n            \"\\xc5\": 771,\n            \"\\xc6\": 888,\n            \"\\xc7\": 867,\n            \"\\xc8\": 696,\n            \"\\xc9\": 874,\n            \"\\xca\": 974,\n            \"\\xcb\": 762,\n            \"\\xcc\": 759,\n            \"\\xcd\": 509,\n            \"\\xce\": 410,\n        },\n    ),\n}\n\n# Aliases defined in implementation note 62 in Appecix H. related to section 5.5.1\n# (Type 1 Fonts) in the PDF Reference.\nFONT_METRICS[\"Arial\"] = FONT_METRICS[\"Helvetica\"]\nFONT_METRICS[\"Arial,Italic\"] = FONT_METRICS[\"Helvetica-Oblique\"]\nFONT_METRICS[\"Arial,Bold\"] = FONT_METRICS[\"Helvetica-Bold\"]\nFONT_METRICS[\"Arial,BoldItalic\"] = FONT_METRICS[\"Helvetica-BoldOblique\"]\nFONT_METRICS[\"CourierNew\"] = FONT_METRICS[\"Courier\"]\nFONT_METRICS[\"CourierNew,Italic\"] = FONT_METRICS[\"Courier-Oblique\"]\nFONT_METRICS[\"CourierNew,Bold\"] = FONT_METRICS[\"Courier-Bold\"]\nFONT_METRICS[\"CourierNew,BoldItalic\"] = FONT_METRICS[\"Courier-BoldOblique\"]\nFONT_METRICS[\"TimesNewRoman\"] = FONT_METRICS[\"Times-Roman\"]\nFONT_METRICS[\"TimesNewRoman,Italic\"] = FONT_METRICS[\"Times-Italic\"]\nFONT_METRICS[\"TimesNewRoman,Bold\"] = FONT_METRICS[\"Times-Bold\"]\nFONT_METRICS[\"TimesNewRoman,BoldItalic\"] = FONT_METRICS[\"Times-BoldItalic\"]\n"
  },
  {
    "path": "babeldoc/pdfminer/glyphlist.py",
    "content": "\"\"\"Mappings from Adobe glyph names to Unicode characters.\n\nIn some CMap tables, Adobe glyph names are used for specifying\nUnicode characters instead of using decimal/hex character code.\n\nThe following data was taken by\n\n  $ wget https://partners.adobe.com/public/developer/en/opentype/glyphlist.txt\n\n```python\nfrom babeldoc.pdfminer.glyphlist import convert_glyphlist\n\nconvert_glyphlist(\"glyphlist.txt\")\"\"\"\n\n# ###################################################################################\n# Copyright (c) 1997,1998,2002,2007 Adobe Systems Incorporated\n#\n# Permission is hereby granted, free of charge, to any person obtaining a\n# copy of this documentation file to use, copy, publish, distribute,\n# sublicense, and/or sell copies of the documentation, and to permit\n# others to do the same, provided that:\n# - No modification, editing or other alteration of this document is\n# allowed; and\n# - The above copyright notice and this permission notice shall be\n# included in all copies of the documentation.\n#\n# Permission is hereby granted, free of charge, to any person obtaining a\n# copy of this documentation file, to create their own derivative works\n# from the content of this document to use, copy, publish, distribute,\n# sublicense, and/or sell the derivative works, and to permit others to do\n# the same, provided that the derived work is not represented as being a\n# copy or version of this document.\n#\n# Adobe shall not be liable to any party for any loss of revenue or profit\n# or for indirect, incidental, special, consequential, or other similar\n# damages, whether based on tort (including without limitation negligence\n# or strict liability), contract or other legal or equitable grounds even\n# if Adobe has been advised or had reason to know of the possibility of\n# such damages. The Adobe materials are provided on an \"AS IS\" basis.\n# Adobe specifically disclaims all express, statutory, or implied\n# warranties relating to the Adobe materials, including but not limited to\n# those concerning merchantability or fitness for a particular purpose or\n# non-infringement of any third party rights regarding the Adobe\n# materials.\n# ###################################################################################\n# Name:          Adobe Glyph List\n# Table version: 2.0\n# Date:          September 20, 2002\n#\n# See http://partners.adobe.com/asn/developer/typeforum/unicodegn.html\n#\n# Format: Semicolon-delimited fields:\n#            (1) glyph name\n#            (2) Unicode scalar value\n\n\ndef convert_glyphlist(path: str) -> None:\n    \"\"\"Convert a glyph list into a python representation.\n\n    See output below.\n    \"\"\"\n    state = 0\n    with open(path) as fileinput:\n        for line in fileinput.readlines():\n            line = line.strip()\n            if not line or line.startswith(\"#\"):\n                if state == 1:\n                    state = 2\n                    print(\"}\\n\")\n                print(line)\n                continue\n            if state == 0:\n                print(\"\\nglyphname2unicode = {\")\n                state = 1\n            (name, x) = line.split(\";\")\n            codes = x.split(\" \")\n            print(\n                \" {!r}: u'{}',\".format(name, \"\".join(\"\\\\u%s\" % code for code in codes)),\n            )\n\n\nglyphname2unicode = {\n    \"A\": \"\\u0041\",\n    \"AE\": \"\\u00c6\",\n    \"AEacute\": \"\\u01fc\",\n    \"AEmacron\": \"\\u01e2\",\n    \"AEsmall\": \"\\uf7e6\",\n    \"Aacute\": \"\\u00c1\",\n    \"Aacutesmall\": \"\\uf7e1\",\n    \"Abreve\": \"\\u0102\",\n    \"Abreveacute\": \"\\u1eae\",\n    \"Abrevecyrillic\": \"\\u04d0\",\n    \"Abrevedotbelow\": \"\\u1eb6\",\n    \"Abrevegrave\": \"\\u1eb0\",\n    \"Abrevehookabove\": \"\\u1eb2\",\n    \"Abrevetilde\": \"\\u1eb4\",\n    \"Acaron\": \"\\u01cd\",\n    \"Acircle\": \"\\u24b6\",\n    \"Acircumflex\": \"\\u00c2\",\n    \"Acircumflexacute\": \"\\u1ea4\",\n    \"Acircumflexdotbelow\": \"\\u1eac\",\n    \"Acircumflexgrave\": \"\\u1ea6\",\n    \"Acircumflexhookabove\": \"\\u1ea8\",\n    \"Acircumflexsmall\": \"\\uf7e2\",\n    \"Acircumflextilde\": \"\\u1eaa\",\n    \"Acute\": \"\\uf6c9\",\n    \"Acutesmall\": \"\\uf7b4\",\n    \"Acyrillic\": \"\\u0410\",\n    \"Adblgrave\": \"\\u0200\",\n    \"Adieresis\": \"\\u00c4\",\n    \"Adieresiscyrillic\": \"\\u04d2\",\n    \"Adieresismacron\": \"\\u01de\",\n    \"Adieresissmall\": \"\\uf7e4\",\n    \"Adotbelow\": \"\\u1ea0\",\n    \"Adotmacron\": \"\\u01e0\",\n    \"Agrave\": \"\\u00c0\",\n    \"Agravesmall\": \"\\uf7e0\",\n    \"Ahookabove\": \"\\u1ea2\",\n    \"Aiecyrillic\": \"\\u04d4\",\n    \"Ainvertedbreve\": \"\\u0202\",\n    \"Alpha\": \"\\u0391\",\n    \"Alphatonos\": \"\\u0386\",\n    \"Amacron\": \"\\u0100\",\n    \"Amonospace\": \"\\uff21\",\n    \"Aogonek\": \"\\u0104\",\n    \"Aring\": \"\\u00c5\",\n    \"Aringacute\": \"\\u01fa\",\n    \"Aringbelow\": \"\\u1e00\",\n    \"Aringsmall\": \"\\uf7e5\",\n    \"Asmall\": \"\\uf761\",\n    \"Atilde\": \"\\u00c3\",\n    \"Atildesmall\": \"\\uf7e3\",\n    \"Aybarmenian\": \"\\u0531\",\n    \"B\": \"\\u0042\",\n    \"Bcircle\": \"\\u24b7\",\n    \"Bdotaccent\": \"\\u1e02\",\n    \"Bdotbelow\": \"\\u1e04\",\n    \"Becyrillic\": \"\\u0411\",\n    \"Benarmenian\": \"\\u0532\",\n    \"Beta\": \"\\u0392\",\n    \"Bhook\": \"\\u0181\",\n    \"Blinebelow\": \"\\u1e06\",\n    \"Bmonospace\": \"\\uff22\",\n    \"Brevesmall\": \"\\uf6f4\",\n    \"Bsmall\": \"\\uf762\",\n    \"Btopbar\": \"\\u0182\",\n    \"C\": \"\\u0043\",\n    \"Caarmenian\": \"\\u053e\",\n    \"Cacute\": \"\\u0106\",\n    \"Caron\": \"\\uf6ca\",\n    \"Caronsmall\": \"\\uf6f5\",\n    \"Ccaron\": \"\\u010c\",\n    \"Ccedilla\": \"\\u00c7\",\n    \"Ccedillaacute\": \"\\u1e08\",\n    \"Ccedillasmall\": \"\\uf7e7\",\n    \"Ccircle\": \"\\u24b8\",\n    \"Ccircumflex\": \"\\u0108\",\n    \"Cdot\": \"\\u010a\",\n    \"Cdotaccent\": \"\\u010a\",\n    \"Cedillasmall\": \"\\uf7b8\",\n    \"Chaarmenian\": \"\\u0549\",\n    \"Cheabkhasiancyrillic\": \"\\u04bc\",\n    \"Checyrillic\": \"\\u0427\",\n    \"Chedescenderabkhasiancyrillic\": \"\\u04be\",\n    \"Chedescendercyrillic\": \"\\u04b6\",\n    \"Chedieresiscyrillic\": \"\\u04f4\",\n    \"Cheharmenian\": \"\\u0543\",\n    \"Chekhakassiancyrillic\": \"\\u04cb\",\n    \"Cheverticalstrokecyrillic\": \"\\u04b8\",\n    \"Chi\": \"\\u03a7\",\n    \"Chook\": \"\\u0187\",\n    \"Circumflexsmall\": \"\\uf6f6\",\n    \"Cmonospace\": \"\\uff23\",\n    \"Coarmenian\": \"\\u0551\",\n    \"Csmall\": \"\\uf763\",\n    \"D\": \"\\u0044\",\n    \"DZ\": \"\\u01f1\",\n    \"DZcaron\": \"\\u01c4\",\n    \"Daarmenian\": \"\\u0534\",\n    \"Dafrican\": \"\\u0189\",\n    \"Dcaron\": \"\\u010e\",\n    \"Dcedilla\": \"\\u1e10\",\n    \"Dcircle\": \"\\u24b9\",\n    \"Dcircumflexbelow\": \"\\u1e12\",\n    \"Dcroat\": \"\\u0110\",\n    \"Ddotaccent\": \"\\u1e0a\",\n    \"Ddotbelow\": \"\\u1e0c\",\n    \"Decyrillic\": \"\\u0414\",\n    \"Deicoptic\": \"\\u03ee\",\n    \"Delta\": \"\\u2206\",\n    \"Deltagreek\": \"\\u0394\",\n    \"Dhook\": \"\\u018a\",\n    \"Dieresis\": \"\\uf6cb\",\n    \"DieresisAcute\": \"\\uf6cc\",\n    \"DieresisGrave\": \"\\uf6cd\",\n    \"Dieresissmall\": \"\\uf7a8\",\n    \"Digammagreek\": \"\\u03dc\",\n    \"Djecyrillic\": \"\\u0402\",\n    \"Dlinebelow\": \"\\u1e0e\",\n    \"Dmonospace\": \"\\uff24\",\n    \"Dotaccentsmall\": \"\\uf6f7\",\n    \"Dslash\": \"\\u0110\",\n    \"Dsmall\": \"\\uf764\",\n    \"Dtopbar\": \"\\u018b\",\n    \"Dz\": \"\\u01f2\",\n    \"Dzcaron\": \"\\u01c5\",\n    \"Dzeabkhasiancyrillic\": \"\\u04e0\",\n    \"Dzecyrillic\": \"\\u0405\",\n    \"Dzhecyrillic\": \"\\u040f\",\n    \"E\": \"\\u0045\",\n    \"Eacute\": \"\\u00c9\",\n    \"Eacutesmall\": \"\\uf7e9\",\n    \"Ebreve\": \"\\u0114\",\n    \"Ecaron\": \"\\u011a\",\n    \"Ecedillabreve\": \"\\u1e1c\",\n    \"Echarmenian\": \"\\u0535\",\n    \"Ecircle\": \"\\u24ba\",\n    \"Ecircumflex\": \"\\u00ca\",\n    \"Ecircumflexacute\": \"\\u1ebe\",\n    \"Ecircumflexbelow\": \"\\u1e18\",\n    \"Ecircumflexdotbelow\": \"\\u1ec6\",\n    \"Ecircumflexgrave\": \"\\u1ec0\",\n    \"Ecircumflexhookabove\": \"\\u1ec2\",\n    \"Ecircumflexsmall\": \"\\uf7ea\",\n    \"Ecircumflextilde\": \"\\u1ec4\",\n    \"Ecyrillic\": \"\\u0404\",\n    \"Edblgrave\": \"\\u0204\",\n    \"Edieresis\": \"\\u00cb\",\n    \"Edieresissmall\": \"\\uf7eb\",\n    \"Edot\": \"\\u0116\",\n    \"Edotaccent\": \"\\u0116\",\n    \"Edotbelow\": \"\\u1eb8\",\n    \"Efcyrillic\": \"\\u0424\",\n    \"Egrave\": \"\\u00c8\",\n    \"Egravesmall\": \"\\uf7e8\",\n    \"Eharmenian\": \"\\u0537\",\n    \"Ehookabove\": \"\\u1eba\",\n    \"Eightroman\": \"\\u2167\",\n    \"Einvertedbreve\": \"\\u0206\",\n    \"Eiotifiedcyrillic\": \"\\u0464\",\n    \"Elcyrillic\": \"\\u041b\",\n    \"Elevenroman\": \"\\u216a\",\n    \"Emacron\": \"\\u0112\",\n    \"Emacronacute\": \"\\u1e16\",\n    \"Emacrongrave\": \"\\u1e14\",\n    \"Emcyrillic\": \"\\u041c\",\n    \"Emonospace\": \"\\uff25\",\n    \"Encyrillic\": \"\\u041d\",\n    \"Endescendercyrillic\": \"\\u04a2\",\n    \"Eng\": \"\\u014a\",\n    \"Enghecyrillic\": \"\\u04a4\",\n    \"Enhookcyrillic\": \"\\u04c7\",\n    \"Eogonek\": \"\\u0118\",\n    \"Eopen\": \"\\u0190\",\n    \"Epsilon\": \"\\u0395\",\n    \"Epsilontonos\": \"\\u0388\",\n    \"Ercyrillic\": \"\\u0420\",\n    \"Ereversed\": \"\\u018e\",\n    \"Ereversedcyrillic\": \"\\u042d\",\n    \"Escyrillic\": \"\\u0421\",\n    \"Esdescendercyrillic\": \"\\u04aa\",\n    \"Esh\": \"\\u01a9\",\n    \"Esmall\": \"\\uf765\",\n    \"Eta\": \"\\u0397\",\n    \"Etarmenian\": \"\\u0538\",\n    \"Etatonos\": \"\\u0389\",\n    \"Eth\": \"\\u00d0\",\n    \"Ethsmall\": \"\\uf7f0\",\n    \"Etilde\": \"\\u1ebc\",\n    \"Etildebelow\": \"\\u1e1a\",\n    \"Euro\": \"\\u20ac\",\n    \"Ezh\": \"\\u01b7\",\n    \"Ezhcaron\": \"\\u01ee\",\n    \"Ezhreversed\": \"\\u01b8\",\n    \"F\": \"\\u0046\",\n    \"Fcircle\": \"\\u24bb\",\n    \"Fdotaccent\": \"\\u1e1e\",\n    \"Feharmenian\": \"\\u0556\",\n    \"Feicoptic\": \"\\u03e4\",\n    \"Fhook\": \"\\u0191\",\n    \"Fitacyrillic\": \"\\u0472\",\n    \"Fiveroman\": \"\\u2164\",\n    \"Fmonospace\": \"\\uff26\",\n    \"Fourroman\": \"\\u2163\",\n    \"Fsmall\": \"\\uf766\",\n    \"G\": \"\\u0047\",\n    \"GBsquare\": \"\\u3387\",\n    \"Gacute\": \"\\u01f4\",\n    \"Gamma\": \"\\u0393\",\n    \"Gammaafrican\": \"\\u0194\",\n    \"Gangiacoptic\": \"\\u03ea\",\n    \"Gbreve\": \"\\u011e\",\n    \"Gcaron\": \"\\u01e6\",\n    \"Gcedilla\": \"\\u0122\",\n    \"Gcircle\": \"\\u24bc\",\n    \"Gcircumflex\": \"\\u011c\",\n    \"Gcommaaccent\": \"\\u0122\",\n    \"Gdot\": \"\\u0120\",\n    \"Gdotaccent\": \"\\u0120\",\n    \"Gecyrillic\": \"\\u0413\",\n    \"Ghadarmenian\": \"\\u0542\",\n    \"Ghemiddlehookcyrillic\": \"\\u0494\",\n    \"Ghestrokecyrillic\": \"\\u0492\",\n    \"Gheupturncyrillic\": \"\\u0490\",\n    \"Ghook\": \"\\u0193\",\n    \"Gimarmenian\": \"\\u0533\",\n    \"Gjecyrillic\": \"\\u0403\",\n    \"Gmacron\": \"\\u1e20\",\n    \"Gmonospace\": \"\\uff27\",\n    \"Grave\": \"\\uf6ce\",\n    \"Gravesmall\": \"\\uf760\",\n    \"Gsmall\": \"\\uf767\",\n    \"Gsmallhook\": \"\\u029b\",\n    \"Gstroke\": \"\\u01e4\",\n    \"H\": \"\\u0048\",\n    \"H18533\": \"\\u25cf\",\n    \"H18543\": \"\\u25aa\",\n    \"H18551\": \"\\u25ab\",\n    \"H22073\": \"\\u25a1\",\n    \"HPsquare\": \"\\u33cb\",\n    \"Haabkhasiancyrillic\": \"\\u04a8\",\n    \"Hadescendercyrillic\": \"\\u04b2\",\n    \"Hardsigncyrillic\": \"\\u042a\",\n    \"Hbar\": \"\\u0126\",\n    \"Hbrevebelow\": \"\\u1e2a\",\n    \"Hcedilla\": \"\\u1e28\",\n    \"Hcircle\": \"\\u24bd\",\n    \"Hcircumflex\": \"\\u0124\",\n    \"Hdieresis\": \"\\u1e26\",\n    \"Hdotaccent\": \"\\u1e22\",\n    \"Hdotbelow\": \"\\u1e24\",\n    \"Hmonospace\": \"\\uff28\",\n    \"Hoarmenian\": \"\\u0540\",\n    \"Horicoptic\": \"\\u03e8\",\n    \"Hsmall\": \"\\uf768\",\n    \"Hungarumlaut\": \"\\uf6cf\",\n    \"Hungarumlautsmall\": \"\\uf6f8\",\n    \"Hzsquare\": \"\\u3390\",\n    \"I\": \"\\u0049\",\n    \"IAcyrillic\": \"\\u042f\",\n    \"IJ\": \"\\u0132\",\n    \"IUcyrillic\": \"\\u042e\",\n    \"Iacute\": \"\\u00cd\",\n    \"Iacutesmall\": \"\\uf7ed\",\n    \"Ibreve\": \"\\u012c\",\n    \"Icaron\": \"\\u01cf\",\n    \"Icircle\": \"\\u24be\",\n    \"Icircumflex\": \"\\u00ce\",\n    \"Icircumflexsmall\": \"\\uf7ee\",\n    \"Icyrillic\": \"\\u0406\",\n    \"Idblgrave\": \"\\u0208\",\n    \"Idieresis\": \"\\u00cf\",\n    \"Idieresisacute\": \"\\u1e2e\",\n    \"Idieresiscyrillic\": \"\\u04e4\",\n    \"Idieresissmall\": \"\\uf7ef\",\n    \"Idot\": \"\\u0130\",\n    \"Idotaccent\": \"\\u0130\",\n    \"Idotbelow\": \"\\u1eca\",\n    \"Iebrevecyrillic\": \"\\u04d6\",\n    \"Iecyrillic\": \"\\u0415\",\n    \"Ifraktur\": \"\\u2111\",\n    \"Igrave\": \"\\u00cc\",\n    \"Igravesmall\": \"\\uf7ec\",\n    \"Ihookabove\": \"\\u1ec8\",\n    \"Iicyrillic\": \"\\u0418\",\n    \"Iinvertedbreve\": \"\\u020a\",\n    \"Iishortcyrillic\": \"\\u0419\",\n    \"Imacron\": \"\\u012a\",\n    \"Imacroncyrillic\": \"\\u04e2\",\n    \"Imonospace\": \"\\uff29\",\n    \"Iniarmenian\": \"\\u053b\",\n    \"Iocyrillic\": \"\\u0401\",\n    \"Iogonek\": \"\\u012e\",\n    \"Iota\": \"\\u0399\",\n    \"Iotaafrican\": \"\\u0196\",\n    \"Iotadieresis\": \"\\u03aa\",\n    \"Iotatonos\": \"\\u038a\",\n    \"Ismall\": \"\\uf769\",\n    \"Istroke\": \"\\u0197\",\n    \"Itilde\": \"\\u0128\",\n    \"Itildebelow\": \"\\u1e2c\",\n    \"Izhitsacyrillic\": \"\\u0474\",\n    \"Izhitsadblgravecyrillic\": \"\\u0476\",\n    \"J\": \"\\u004a\",\n    \"Jaarmenian\": \"\\u0541\",\n    \"Jcircle\": \"\\u24bf\",\n    \"Jcircumflex\": \"\\u0134\",\n    \"Jecyrillic\": \"\\u0408\",\n    \"Jheharmenian\": \"\\u054b\",\n    \"Jmonospace\": \"\\uff2a\",\n    \"Jsmall\": \"\\uf76a\",\n    \"K\": \"\\u004b\",\n    \"KBsquare\": \"\\u3385\",\n    \"KKsquare\": \"\\u33cd\",\n    \"Kabashkircyrillic\": \"\\u04a0\",\n    \"Kacute\": \"\\u1e30\",\n    \"Kacyrillic\": \"\\u041a\",\n    \"Kadescendercyrillic\": \"\\u049a\",\n    \"Kahookcyrillic\": \"\\u04c3\",\n    \"Kappa\": \"\\u039a\",\n    \"Kastrokecyrillic\": \"\\u049e\",\n    \"Kaverticalstrokecyrillic\": \"\\u049c\",\n    \"Kcaron\": \"\\u01e8\",\n    \"Kcedilla\": \"\\u0136\",\n    \"Kcircle\": \"\\u24c0\",\n    \"Kcommaaccent\": \"\\u0136\",\n    \"Kdotbelow\": \"\\u1e32\",\n    \"Keharmenian\": \"\\u0554\",\n    \"Kenarmenian\": \"\\u053f\",\n    \"Khacyrillic\": \"\\u0425\",\n    \"Kheicoptic\": \"\\u03e6\",\n    \"Khook\": \"\\u0198\",\n    \"Kjecyrillic\": \"\\u040c\",\n    \"Klinebelow\": \"\\u1e34\",\n    \"Kmonospace\": \"\\uff2b\",\n    \"Koppacyrillic\": \"\\u0480\",\n    \"Koppagreek\": \"\\u03de\",\n    \"Ksicyrillic\": \"\\u046e\",\n    \"Ksmall\": \"\\uf76b\",\n    \"L\": \"\\u004c\",\n    \"LJ\": \"\\u01c7\",\n    \"LL\": \"\\uf6bf\",\n    \"Lacute\": \"\\u0139\",\n    \"Lambda\": \"\\u039b\",\n    \"Lcaron\": \"\\u013d\",\n    \"Lcedilla\": \"\\u013b\",\n    \"Lcircle\": \"\\u24c1\",\n    \"Lcircumflexbelow\": \"\\u1e3c\",\n    \"Lcommaaccent\": \"\\u013b\",\n    \"Ldot\": \"\\u013f\",\n    \"Ldotaccent\": \"\\u013f\",\n    \"Ldotbelow\": \"\\u1e36\",\n    \"Ldotbelowmacron\": \"\\u1e38\",\n    \"Liwnarmenian\": \"\\u053c\",\n    \"Lj\": \"\\u01c8\",\n    \"Ljecyrillic\": \"\\u0409\",\n    \"Llinebelow\": \"\\u1e3a\",\n    \"Lmonospace\": \"\\uff2c\",\n    \"Lslash\": \"\\u0141\",\n    \"Lslashsmall\": \"\\uf6f9\",\n    \"Lsmall\": \"\\uf76c\",\n    \"M\": \"\\u004d\",\n    \"MBsquare\": \"\\u3386\",\n    \"Macron\": \"\\uf6d0\",\n    \"Macronsmall\": \"\\uf7af\",\n    \"Macute\": \"\\u1e3e\",\n    \"Mcircle\": \"\\u24c2\",\n    \"Mdotaccent\": \"\\u1e40\",\n    \"Mdotbelow\": \"\\u1e42\",\n    \"Menarmenian\": \"\\u0544\",\n    \"Mmonospace\": \"\\uff2d\",\n    \"Msmall\": \"\\uf76d\",\n    \"Mturned\": \"\\u019c\",\n    \"Mu\": \"\\u039c\",\n    \"N\": \"\\u004e\",\n    \"NJ\": \"\\u01ca\",\n    \"Nacute\": \"\\u0143\",\n    \"Ncaron\": \"\\u0147\",\n    \"Ncedilla\": \"\\u0145\",\n    \"Ncircle\": \"\\u24c3\",\n    \"Ncircumflexbelow\": \"\\u1e4a\",\n    \"Ncommaaccent\": \"\\u0145\",\n    \"Ndotaccent\": \"\\u1e44\",\n    \"Ndotbelow\": \"\\u1e46\",\n    \"Nhookleft\": \"\\u019d\",\n    \"Nineroman\": \"\\u2168\",\n    \"Nj\": \"\\u01cb\",\n    \"Njecyrillic\": \"\\u040a\",\n    \"Nlinebelow\": \"\\u1e48\",\n    \"Nmonospace\": \"\\uff2e\",\n    \"Nowarmenian\": \"\\u0546\",\n    \"Nsmall\": \"\\uf76e\",\n    \"Ntilde\": \"\\u00d1\",\n    \"Ntildesmall\": \"\\uf7f1\",\n    \"Nu\": \"\\u039d\",\n    \"O\": \"\\u004f\",\n    \"OE\": \"\\u0152\",\n    \"OEsmall\": \"\\uf6fa\",\n    \"Oacute\": \"\\u00d3\",\n    \"Oacutesmall\": \"\\uf7f3\",\n    \"Obarredcyrillic\": \"\\u04e8\",\n    \"Obarreddieresiscyrillic\": \"\\u04ea\",\n    \"Obreve\": \"\\u014e\",\n    \"Ocaron\": \"\\u01d1\",\n    \"Ocenteredtilde\": \"\\u019f\",\n    \"Ocircle\": \"\\u24c4\",\n    \"Ocircumflex\": \"\\u00d4\",\n    \"Ocircumflexacute\": \"\\u1ed0\",\n    \"Ocircumflexdotbelow\": \"\\u1ed8\",\n    \"Ocircumflexgrave\": \"\\u1ed2\",\n    \"Ocircumflexhookabove\": \"\\u1ed4\",\n    \"Ocircumflexsmall\": \"\\uf7f4\",\n    \"Ocircumflextilde\": \"\\u1ed6\",\n    \"Ocyrillic\": \"\\u041e\",\n    \"Odblacute\": \"\\u0150\",\n    \"Odblgrave\": \"\\u020c\",\n    \"Odieresis\": \"\\u00d6\",\n    \"Odieresiscyrillic\": \"\\u04e6\",\n    \"Odieresissmall\": \"\\uf7f6\",\n    \"Odotbelow\": \"\\u1ecc\",\n    \"Ogoneksmall\": \"\\uf6fb\",\n    \"Ograve\": \"\\u00d2\",\n    \"Ogravesmall\": \"\\uf7f2\",\n    \"Oharmenian\": \"\\u0555\",\n    \"Ohm\": \"\\u2126\",\n    \"Ohookabove\": \"\\u1ece\",\n    \"Ohorn\": \"\\u01a0\",\n    \"Ohornacute\": \"\\u1eda\",\n    \"Ohorndotbelow\": \"\\u1ee2\",\n    \"Ohorngrave\": \"\\u1edc\",\n    \"Ohornhookabove\": \"\\u1ede\",\n    \"Ohorntilde\": \"\\u1ee0\",\n    \"Ohungarumlaut\": \"\\u0150\",\n    \"Oi\": \"\\u01a2\",\n    \"Oinvertedbreve\": \"\\u020e\",\n    \"Omacron\": \"\\u014c\",\n    \"Omacronacute\": \"\\u1e52\",\n    \"Omacrongrave\": \"\\u1e50\",\n    \"Omega\": \"\\u2126\",\n    \"Omegacyrillic\": \"\\u0460\",\n    \"Omegagreek\": \"\\u03a9\",\n    \"Omegaroundcyrillic\": \"\\u047a\",\n    \"Omegatitlocyrillic\": \"\\u047c\",\n    \"Omegatonos\": \"\\u038f\",\n    \"Omicron\": \"\\u039f\",\n    \"Omicrontonos\": \"\\u038c\",\n    \"Omonospace\": \"\\uff2f\",\n    \"Oneroman\": \"\\u2160\",\n    \"Oogonek\": \"\\u01ea\",\n    \"Oogonekmacron\": \"\\u01ec\",\n    \"Oopen\": \"\\u0186\",\n    \"Oslash\": \"\\u00d8\",\n    \"Oslashacute\": \"\\u01fe\",\n    \"Oslashsmall\": \"\\uf7f8\",\n    \"Osmall\": \"\\uf76f\",\n    \"Ostrokeacute\": \"\\u01fe\",\n    \"Otcyrillic\": \"\\u047e\",\n    \"Otilde\": \"\\u00d5\",\n    \"Otildeacute\": \"\\u1e4c\",\n    \"Otildedieresis\": \"\\u1e4e\",\n    \"Otildesmall\": \"\\uf7f5\",\n    \"P\": \"\\u0050\",\n    \"Pacute\": \"\\u1e54\",\n    \"Pcircle\": \"\\u24c5\",\n    \"Pdotaccent\": \"\\u1e56\",\n    \"Pecyrillic\": \"\\u041f\",\n    \"Peharmenian\": \"\\u054a\",\n    \"Pemiddlehookcyrillic\": \"\\u04a6\",\n    \"Phi\": \"\\u03a6\",\n    \"Phook\": \"\\u01a4\",\n    \"Pi\": \"\\u03a0\",\n    \"Piwrarmenian\": \"\\u0553\",\n    \"Pmonospace\": \"\\uff30\",\n    \"Psi\": \"\\u03a8\",\n    \"Psicyrillic\": \"\\u0470\",\n    \"Psmall\": \"\\uf770\",\n    \"Q\": \"\\u0051\",\n    \"Qcircle\": \"\\u24c6\",\n    \"Qmonospace\": \"\\uff31\",\n    \"Qsmall\": \"\\uf771\",\n    \"R\": \"\\u0052\",\n    \"Raarmenian\": \"\\u054c\",\n    \"Racute\": \"\\u0154\",\n    \"Rcaron\": \"\\u0158\",\n    \"Rcedilla\": \"\\u0156\",\n    \"Rcircle\": \"\\u24c7\",\n    \"Rcommaaccent\": \"\\u0156\",\n    \"Rdblgrave\": \"\\u0210\",\n    \"Rdotaccent\": \"\\u1e58\",\n    \"Rdotbelow\": \"\\u1e5a\",\n    \"Rdotbelowmacron\": \"\\u1e5c\",\n    \"Reharmenian\": \"\\u0550\",\n    \"Rfraktur\": \"\\u211c\",\n    \"Rho\": \"\\u03a1\",\n    \"Ringsmall\": \"\\uf6fc\",\n    \"Rinvertedbreve\": \"\\u0212\",\n    \"Rlinebelow\": \"\\u1e5e\",\n    \"Rmonospace\": \"\\uff32\",\n    \"Rsmall\": \"\\uf772\",\n    \"Rsmallinverted\": \"\\u0281\",\n    \"Rsmallinvertedsuperior\": \"\\u02b6\",\n    \"S\": \"\\u0053\",\n    \"SF010000\": \"\\u250c\",\n    \"SF020000\": \"\\u2514\",\n    \"SF030000\": \"\\u2510\",\n    \"SF040000\": \"\\u2518\",\n    \"SF050000\": \"\\u253c\",\n    \"SF060000\": \"\\u252c\",\n    \"SF070000\": \"\\u2534\",\n    \"SF080000\": \"\\u251c\",\n    \"SF090000\": \"\\u2524\",\n    \"SF100000\": \"\\u2500\",\n    \"SF110000\": \"\\u2502\",\n    \"SF190000\": \"\\u2561\",\n    \"SF200000\": \"\\u2562\",\n    \"SF210000\": \"\\u2556\",\n    \"SF220000\": \"\\u2555\",\n    \"SF230000\": \"\\u2563\",\n    \"SF240000\": \"\\u2551\",\n    \"SF250000\": \"\\u2557\",\n    \"SF260000\": \"\\u255d\",\n    \"SF270000\": \"\\u255c\",\n    \"SF280000\": \"\\u255b\",\n    \"SF360000\": \"\\u255e\",\n    \"SF370000\": \"\\u255f\",\n    \"SF380000\": \"\\u255a\",\n    \"SF390000\": \"\\u2554\",\n    \"SF400000\": \"\\u2569\",\n    \"SF410000\": \"\\u2566\",\n    \"SF420000\": \"\\u2560\",\n    \"SF430000\": \"\\u2550\",\n    \"SF440000\": \"\\u256c\",\n    \"SF450000\": \"\\u2567\",\n    \"SF460000\": \"\\u2568\",\n    \"SF470000\": \"\\u2564\",\n    \"SF480000\": \"\\u2565\",\n    \"SF490000\": \"\\u2559\",\n    \"SF500000\": \"\\u2558\",\n    \"SF510000\": \"\\u2552\",\n    \"SF520000\": \"\\u2553\",\n    \"SF530000\": \"\\u256b\",\n    \"SF540000\": \"\\u256a\",\n    \"Sacute\": \"\\u015a\",\n    \"Sacutedotaccent\": \"\\u1e64\",\n    \"Sampigreek\": \"\\u03e0\",\n    \"Scaron\": \"\\u0160\",\n    \"Scarondotaccent\": \"\\u1e66\",\n    \"Scaronsmall\": \"\\uf6fd\",\n    \"Scedilla\": \"\\u015e\",\n    \"Schwa\": \"\\u018f\",\n    \"Schwacyrillic\": \"\\u04d8\",\n    \"Schwadieresiscyrillic\": \"\\u04da\",\n    \"Scircle\": \"\\u24c8\",\n    \"Scircumflex\": \"\\u015c\",\n    \"Scommaaccent\": \"\\u0218\",\n    \"Sdotaccent\": \"\\u1e60\",\n    \"Sdotbelow\": \"\\u1e62\",\n    \"Sdotbelowdotaccent\": \"\\u1e68\",\n    \"Seharmenian\": \"\\u054d\",\n    \"Sevenroman\": \"\\u2166\",\n    \"Shaarmenian\": \"\\u0547\",\n    \"Shacyrillic\": \"\\u0428\",\n    \"Shchacyrillic\": \"\\u0429\",\n    \"Sheicoptic\": \"\\u03e2\",\n    \"Shhacyrillic\": \"\\u04ba\",\n    \"Shimacoptic\": \"\\u03ec\",\n    \"Sigma\": \"\\u03a3\",\n    \"Sixroman\": \"\\u2165\",\n    \"Smonospace\": \"\\uff33\",\n    \"Softsigncyrillic\": \"\\u042c\",\n    \"Ssmall\": \"\\uf773\",\n    \"Stigmagreek\": \"\\u03da\",\n    \"T\": \"\\u0054\",\n    \"Tau\": \"\\u03a4\",\n    \"Tbar\": \"\\u0166\",\n    \"Tcaron\": \"\\u0164\",\n    \"Tcedilla\": \"\\u0162\",\n    \"Tcircle\": \"\\u24c9\",\n    \"Tcircumflexbelow\": \"\\u1e70\",\n    \"Tcommaaccent\": \"\\u0162\",\n    \"Tdotaccent\": \"\\u1e6a\",\n    \"Tdotbelow\": \"\\u1e6c\",\n    \"Tecyrillic\": \"\\u0422\",\n    \"Tedescendercyrillic\": \"\\u04ac\",\n    \"Tenroman\": \"\\u2169\",\n    \"Tetsecyrillic\": \"\\u04b4\",\n    \"Theta\": \"\\u0398\",\n    \"Thook\": \"\\u01ac\",\n    \"Thorn\": \"\\u00de\",\n    \"Thornsmall\": \"\\uf7fe\",\n    \"Threeroman\": \"\\u2162\",\n    \"Tildesmall\": \"\\uf6fe\",\n    \"Tiwnarmenian\": \"\\u054f\",\n    \"Tlinebelow\": \"\\u1e6e\",\n    \"Tmonospace\": \"\\uff34\",\n    \"Toarmenian\": \"\\u0539\",\n    \"Tonefive\": \"\\u01bc\",\n    \"Tonesix\": \"\\u0184\",\n    \"Tonetwo\": \"\\u01a7\",\n    \"Tretroflexhook\": \"\\u01ae\",\n    \"Tsecyrillic\": \"\\u0426\",\n    \"Tshecyrillic\": \"\\u040b\",\n    \"Tsmall\": \"\\uf774\",\n    \"Twelveroman\": \"\\u216b\",\n    \"Tworoman\": \"\\u2161\",\n    \"U\": \"\\u0055\",\n    \"Uacute\": \"\\u00da\",\n    \"Uacutesmall\": \"\\uf7fa\",\n    \"Ubreve\": \"\\u016c\",\n    \"Ucaron\": \"\\u01d3\",\n    \"Ucircle\": \"\\u24ca\",\n    \"Ucircumflex\": \"\\u00db\",\n    \"Ucircumflexbelow\": \"\\u1e76\",\n    \"Ucircumflexsmall\": \"\\uf7fb\",\n    \"Ucyrillic\": \"\\u0423\",\n    \"Udblacute\": \"\\u0170\",\n    \"Udblgrave\": \"\\u0214\",\n    \"Udieresis\": \"\\u00dc\",\n    \"Udieresisacute\": \"\\u01d7\",\n    \"Udieresisbelow\": \"\\u1e72\",\n    \"Udieresiscaron\": \"\\u01d9\",\n    \"Udieresiscyrillic\": \"\\u04f0\",\n    \"Udieresisgrave\": \"\\u01db\",\n    \"Udieresismacron\": \"\\u01d5\",\n    \"Udieresissmall\": \"\\uf7fc\",\n    \"Udotbelow\": \"\\u1ee4\",\n    \"Ugrave\": \"\\u00d9\",\n    \"Ugravesmall\": \"\\uf7f9\",\n    \"Uhookabove\": \"\\u1ee6\",\n    \"Uhorn\": \"\\u01af\",\n    \"Uhornacute\": \"\\u1ee8\",\n    \"Uhorndotbelow\": \"\\u1ef0\",\n    \"Uhorngrave\": \"\\u1eea\",\n    \"Uhornhookabove\": \"\\u1eec\",\n    \"Uhorntilde\": \"\\u1eee\",\n    \"Uhungarumlaut\": \"\\u0170\",\n    \"Uhungarumlautcyrillic\": \"\\u04f2\",\n    \"Uinvertedbreve\": \"\\u0216\",\n    \"Ukcyrillic\": \"\\u0478\",\n    \"Umacron\": \"\\u016a\",\n    \"Umacroncyrillic\": \"\\u04ee\",\n    \"Umacrondieresis\": \"\\u1e7a\",\n    \"Umonospace\": \"\\uff35\",\n    \"Uogonek\": \"\\u0172\",\n    \"Upsilon\": \"\\u03a5\",\n    \"Upsilon1\": \"\\u03d2\",\n    \"Upsilonacutehooksymbolgreek\": \"\\u03d3\",\n    \"Upsilonafrican\": \"\\u01b1\",\n    \"Upsilondieresis\": \"\\u03ab\",\n    \"Upsilondieresishooksymbolgreek\": \"\\u03d4\",\n    \"Upsilonhooksymbol\": \"\\u03d2\",\n    \"Upsilontonos\": \"\\u038e\",\n    \"Uring\": \"\\u016e\",\n    \"Ushortcyrillic\": \"\\u040e\",\n    \"Usmall\": \"\\uf775\",\n    \"Ustraightcyrillic\": \"\\u04ae\",\n    \"Ustraightstrokecyrillic\": \"\\u04b0\",\n    \"Utilde\": \"\\u0168\",\n    \"Utildeacute\": \"\\u1e78\",\n    \"Utildebelow\": \"\\u1e74\",\n    \"V\": \"\\u0056\",\n    \"Vcircle\": \"\\u24cb\",\n    \"Vdotbelow\": \"\\u1e7e\",\n    \"Vecyrillic\": \"\\u0412\",\n    \"Vewarmenian\": \"\\u054e\",\n    \"Vhook\": \"\\u01b2\",\n    \"Vmonospace\": \"\\uff36\",\n    \"Voarmenian\": \"\\u0548\",\n    \"Vsmall\": \"\\uf776\",\n    \"Vtilde\": \"\\u1e7c\",\n    \"W\": \"\\u0057\",\n    \"Wacute\": \"\\u1e82\",\n    \"Wcircle\": \"\\u24cc\",\n    \"Wcircumflex\": \"\\u0174\",\n    \"Wdieresis\": \"\\u1e84\",\n    \"Wdotaccent\": \"\\u1e86\",\n    \"Wdotbelow\": \"\\u1e88\",\n    \"Wgrave\": \"\\u1e80\",\n    \"Wmonospace\": \"\\uff37\",\n    \"Wsmall\": \"\\uf777\",\n    \"X\": \"\\u0058\",\n    \"Xcircle\": \"\\u24cd\",\n    \"Xdieresis\": \"\\u1e8c\",\n    \"Xdotaccent\": \"\\u1e8a\",\n    \"Xeharmenian\": \"\\u053d\",\n    \"Xi\": \"\\u039e\",\n    \"Xmonospace\": \"\\uff38\",\n    \"Xsmall\": \"\\uf778\",\n    \"Y\": \"\\u0059\",\n    \"Yacute\": \"\\u00dd\",\n    \"Yacutesmall\": \"\\uf7fd\",\n    \"Yatcyrillic\": \"\\u0462\",\n    \"Ycircle\": \"\\u24ce\",\n    \"Ycircumflex\": \"\\u0176\",\n    \"Ydieresis\": \"\\u0178\",\n    \"Ydieresissmall\": \"\\uf7ff\",\n    \"Ydotaccent\": \"\\u1e8e\",\n    \"Ydotbelow\": \"\\u1ef4\",\n    \"Yericyrillic\": \"\\u042b\",\n    \"Yerudieresiscyrillic\": \"\\u04f8\",\n    \"Ygrave\": \"\\u1ef2\",\n    \"Yhook\": \"\\u01b3\",\n    \"Yhookabove\": \"\\u1ef6\",\n    \"Yiarmenian\": \"\\u0545\",\n    \"Yicyrillic\": \"\\u0407\",\n    \"Yiwnarmenian\": \"\\u0552\",\n    \"Ymonospace\": \"\\uff39\",\n    \"Ysmall\": \"\\uf779\",\n    \"Ytilde\": \"\\u1ef8\",\n    \"Yusbigcyrillic\": \"\\u046a\",\n    \"Yusbigiotifiedcyrillic\": \"\\u046c\",\n    \"Yuslittlecyrillic\": \"\\u0466\",\n    \"Yuslittleiotifiedcyrillic\": \"\\u0468\",\n    \"Z\": \"\\u005a\",\n    \"Zaarmenian\": \"\\u0536\",\n    \"Zacute\": \"\\u0179\",\n    \"Zcaron\": \"\\u017d\",\n    \"Zcaronsmall\": \"\\uf6ff\",\n    \"Zcircle\": \"\\u24cf\",\n    \"Zcircumflex\": \"\\u1e90\",\n    \"Zdot\": \"\\u017b\",\n    \"Zdotaccent\": \"\\u017b\",\n    \"Zdotbelow\": \"\\u1e92\",\n    \"Zecyrillic\": \"\\u0417\",\n    \"Zedescendercyrillic\": \"\\u0498\",\n    \"Zedieresiscyrillic\": \"\\u04de\",\n    \"Zeta\": \"\\u0396\",\n    \"Zhearmenian\": \"\\u053a\",\n    \"Zhebrevecyrillic\": \"\\u04c1\",\n    \"Zhecyrillic\": \"\\u0416\",\n    \"Zhedescendercyrillic\": \"\\u0496\",\n    \"Zhedieresiscyrillic\": \"\\u04dc\",\n    \"Zlinebelow\": \"\\u1e94\",\n    \"Zmonospace\": \"\\uff3a\",\n    \"Zsmall\": \"\\uf77a\",\n    \"Zstroke\": \"\\u01b5\",\n    \"a\": \"\\u0061\",\n    \"aabengali\": \"\\u0986\",\n    \"aacute\": \"\\u00e1\",\n    \"aadeva\": \"\\u0906\",\n    \"aagujarati\": \"\\u0a86\",\n    \"aagurmukhi\": \"\\u0a06\",\n    \"aamatragurmukhi\": \"\\u0a3e\",\n    \"aarusquare\": \"\\u3303\",\n    \"aavowelsignbengali\": \"\\u09be\",\n    \"aavowelsigndeva\": \"\\u093e\",\n    \"aavowelsigngujarati\": \"\\u0abe\",\n    \"abbreviationmarkarmenian\": \"\\u055f\",\n    \"abbreviationsigndeva\": \"\\u0970\",\n    \"abengali\": \"\\u0985\",\n    \"abopomofo\": \"\\u311a\",\n    \"abreve\": \"\\u0103\",\n    \"abreveacute\": \"\\u1eaf\",\n    \"abrevecyrillic\": \"\\u04d1\",\n    \"abrevedotbelow\": \"\\u1eb7\",\n    \"abrevegrave\": \"\\u1eb1\",\n    \"abrevehookabove\": \"\\u1eb3\",\n    \"abrevetilde\": \"\\u1eb5\",\n    \"acaron\": \"\\u01ce\",\n    \"acircle\": \"\\u24d0\",\n    \"acircumflex\": \"\\u00e2\",\n    \"acircumflexacute\": \"\\u1ea5\",\n    \"acircumflexdotbelow\": \"\\u1ead\",\n    \"acircumflexgrave\": \"\\u1ea7\",\n    \"acircumflexhookabove\": \"\\u1ea9\",\n    \"acircumflextilde\": \"\\u1eab\",\n    \"acute\": \"\\u00b4\",\n    \"acutebelowcmb\": \"\\u0317\",\n    \"acutecmb\": \"\\u0301\",\n    \"acutecomb\": \"\\u0301\",\n    \"acutedeva\": \"\\u0954\",\n    \"acutelowmod\": \"\\u02cf\",\n    \"acutetonecmb\": \"\\u0341\",\n    \"acyrillic\": \"\\u0430\",\n    \"adblgrave\": \"\\u0201\",\n    \"addakgurmukhi\": \"\\u0a71\",\n    \"adeva\": \"\\u0905\",\n    \"adieresis\": \"\\u00e4\",\n    \"adieresiscyrillic\": \"\\u04d3\",\n    \"adieresismacron\": \"\\u01df\",\n    \"adotbelow\": \"\\u1ea1\",\n    \"adotmacron\": \"\\u01e1\",\n    \"ae\": \"\\u00e6\",\n    \"aeacute\": \"\\u01fd\",\n    \"aekorean\": \"\\u3150\",\n    \"aemacron\": \"\\u01e3\",\n    \"afii00208\": \"\\u2015\",\n    \"afii08941\": \"\\u20a4\",\n    \"afii10017\": \"\\u0410\",\n    \"afii10018\": \"\\u0411\",\n    \"afii10019\": \"\\u0412\",\n    \"afii10020\": \"\\u0413\",\n    \"afii10021\": \"\\u0414\",\n    \"afii10022\": \"\\u0415\",\n    \"afii10023\": \"\\u0401\",\n    \"afii10024\": \"\\u0416\",\n    \"afii10025\": \"\\u0417\",\n    \"afii10026\": \"\\u0418\",\n    \"afii10027\": \"\\u0419\",\n    \"afii10028\": \"\\u041a\",\n    \"afii10029\": \"\\u041b\",\n    \"afii10030\": \"\\u041c\",\n    \"afii10031\": \"\\u041d\",\n    \"afii10032\": \"\\u041e\",\n    \"afii10033\": \"\\u041f\",\n    \"afii10034\": \"\\u0420\",\n    \"afii10035\": \"\\u0421\",\n    \"afii10036\": \"\\u0422\",\n    \"afii10037\": \"\\u0423\",\n    \"afii10038\": \"\\u0424\",\n    \"afii10039\": \"\\u0425\",\n    \"afii10040\": \"\\u0426\",\n    \"afii10041\": \"\\u0427\",\n    \"afii10042\": \"\\u0428\",\n    \"afii10043\": \"\\u0429\",\n    \"afii10044\": \"\\u042a\",\n    \"afii10045\": \"\\u042b\",\n    \"afii10046\": \"\\u042c\",\n    \"afii10047\": \"\\u042d\",\n    \"afii10048\": \"\\u042e\",\n    \"afii10049\": \"\\u042f\",\n    \"afii10050\": \"\\u0490\",\n    \"afii10051\": \"\\u0402\",\n    \"afii10052\": \"\\u0403\",\n    \"afii10053\": \"\\u0404\",\n    \"afii10054\": \"\\u0405\",\n    \"afii10055\": \"\\u0406\",\n    \"afii10056\": \"\\u0407\",\n    \"afii10057\": \"\\u0408\",\n    \"afii10058\": \"\\u0409\",\n    \"afii10059\": \"\\u040a\",\n    \"afii10060\": \"\\u040b\",\n    \"afii10061\": \"\\u040c\",\n    \"afii10062\": \"\\u040e\",\n    \"afii10063\": \"\\uf6c4\",\n    \"afii10064\": \"\\uf6c5\",\n    \"afii10065\": \"\\u0430\",\n    \"afii10066\": \"\\u0431\",\n    \"afii10067\": \"\\u0432\",\n    \"afii10068\": \"\\u0433\",\n    \"afii10069\": \"\\u0434\",\n    \"afii10070\": \"\\u0435\",\n    \"afii10071\": \"\\u0451\",\n    \"afii10072\": \"\\u0436\",\n    \"afii10073\": \"\\u0437\",\n    \"afii10074\": \"\\u0438\",\n    \"afii10075\": \"\\u0439\",\n    \"afii10076\": \"\\u043a\",\n    \"afii10077\": \"\\u043b\",\n    \"afii10078\": \"\\u043c\",\n    \"afii10079\": \"\\u043d\",\n    \"afii10080\": \"\\u043e\",\n    \"afii10081\": \"\\u043f\",\n    \"afii10082\": \"\\u0440\",\n    \"afii10083\": \"\\u0441\",\n    \"afii10084\": \"\\u0442\",\n    \"afii10085\": \"\\u0443\",\n    \"afii10086\": \"\\u0444\",\n    \"afii10087\": \"\\u0445\",\n    \"afii10088\": \"\\u0446\",\n    \"afii10089\": \"\\u0447\",\n    \"afii10090\": \"\\u0448\",\n    \"afii10091\": \"\\u0449\",\n    \"afii10092\": \"\\u044a\",\n    \"afii10093\": \"\\u044b\",\n    \"afii10094\": \"\\u044c\",\n    \"afii10095\": \"\\u044d\",\n    \"afii10096\": \"\\u044e\",\n    \"afii10097\": \"\\u044f\",\n    \"afii10098\": \"\\u0491\",\n    \"afii10099\": \"\\u0452\",\n    \"afii10100\": \"\\u0453\",\n    \"afii10101\": \"\\u0454\",\n    \"afii10102\": \"\\u0455\",\n    \"afii10103\": \"\\u0456\",\n    \"afii10104\": \"\\u0457\",\n    \"afii10105\": \"\\u0458\",\n    \"afii10106\": \"\\u0459\",\n    \"afii10107\": \"\\u045a\",\n    \"afii10108\": \"\\u045b\",\n    \"afii10109\": \"\\u045c\",\n    \"afii10110\": \"\\u045e\",\n    \"afii10145\": \"\\u040f\",\n    \"afii10146\": \"\\u0462\",\n    \"afii10147\": \"\\u0472\",\n    \"afii10148\": \"\\u0474\",\n    \"afii10192\": \"\\uf6c6\",\n    \"afii10193\": \"\\u045f\",\n    \"afii10194\": \"\\u0463\",\n    \"afii10195\": \"\\u0473\",\n    \"afii10196\": \"\\u0475\",\n    \"afii10831\": \"\\uf6c7\",\n    \"afii10832\": \"\\uf6c8\",\n    \"afii10846\": \"\\u04d9\",\n    \"afii299\": \"\\u200e\",\n    \"afii300\": \"\\u200f\",\n    \"afii301\": \"\\u200d\",\n    \"afii57381\": \"\\u066a\",\n    \"afii57388\": \"\\u060c\",\n    \"afii57392\": \"\\u0660\",\n    \"afii57393\": \"\\u0661\",\n    \"afii57394\": \"\\u0662\",\n    \"afii57395\": \"\\u0663\",\n    \"afii57396\": \"\\u0664\",\n    \"afii57397\": \"\\u0665\",\n    \"afii57398\": \"\\u0666\",\n    \"afii57399\": \"\\u0667\",\n    \"afii57400\": \"\\u0668\",\n    \"afii57401\": \"\\u0669\",\n    \"afii57403\": \"\\u061b\",\n    \"afii57407\": \"\\u061f\",\n    \"afii57409\": \"\\u0621\",\n    \"afii57410\": \"\\u0622\",\n    \"afii57411\": \"\\u0623\",\n    \"afii57412\": \"\\u0624\",\n    \"afii57413\": \"\\u0625\",\n    \"afii57414\": \"\\u0626\",\n    \"afii57415\": \"\\u0627\",\n    \"afii57416\": \"\\u0628\",\n    \"afii57417\": \"\\u0629\",\n    \"afii57418\": \"\\u062a\",\n    \"afii57419\": \"\\u062b\",\n    \"afii57420\": \"\\u062c\",\n    \"afii57421\": \"\\u062d\",\n    \"afii57422\": \"\\u062e\",\n    \"afii57423\": \"\\u062f\",\n    \"afii57424\": \"\\u0630\",\n    \"afii57425\": \"\\u0631\",\n    \"afii57426\": \"\\u0632\",\n    \"afii57427\": \"\\u0633\",\n    \"afii57428\": \"\\u0634\",\n    \"afii57429\": \"\\u0635\",\n    \"afii57430\": \"\\u0636\",\n    \"afii57431\": \"\\u0637\",\n    \"afii57432\": \"\\u0638\",\n    \"afii57433\": \"\\u0639\",\n    \"afii57434\": \"\\u063a\",\n    \"afii57440\": \"\\u0640\",\n    \"afii57441\": \"\\u0641\",\n    \"afii57442\": \"\\u0642\",\n    \"afii57443\": \"\\u0643\",\n    \"afii57444\": \"\\u0644\",\n    \"afii57445\": \"\\u0645\",\n    \"afii57446\": \"\\u0646\",\n    \"afii57448\": \"\\u0648\",\n    \"afii57449\": \"\\u0649\",\n    \"afii57450\": \"\\u064a\",\n    \"afii57451\": \"\\u064b\",\n    \"afii57452\": \"\\u064c\",\n    \"afii57453\": \"\\u064d\",\n    \"afii57454\": \"\\u064e\",\n    \"afii57455\": \"\\u064f\",\n    \"afii57456\": \"\\u0650\",\n    \"afii57457\": \"\\u0651\",\n    \"afii57458\": \"\\u0652\",\n    \"afii57470\": \"\\u0647\",\n    \"afii57505\": \"\\u06a4\",\n    \"afii57506\": \"\\u067e\",\n    \"afii57507\": \"\\u0686\",\n    \"afii57508\": \"\\u0698\",\n    \"afii57509\": \"\\u06af\",\n    \"afii57511\": \"\\u0679\",\n    \"afii57512\": \"\\u0688\",\n    \"afii57513\": \"\\u0691\",\n    \"afii57514\": \"\\u06ba\",\n    \"afii57519\": \"\\u06d2\",\n    \"afii57534\": \"\\u06d5\",\n    \"afii57636\": \"\\u20aa\",\n    \"afii57645\": \"\\u05be\",\n    \"afii57658\": \"\\u05c3\",\n    \"afii57664\": \"\\u05d0\",\n    \"afii57665\": \"\\u05d1\",\n    \"afii57666\": \"\\u05d2\",\n    \"afii57667\": \"\\u05d3\",\n    \"afii57668\": \"\\u05d4\",\n    \"afii57669\": \"\\u05d5\",\n    \"afii57670\": \"\\u05d6\",\n    \"afii57671\": \"\\u05d7\",\n    \"afii57672\": \"\\u05d8\",\n    \"afii57673\": \"\\u05d9\",\n    \"afii57674\": \"\\u05da\",\n    \"afii57675\": \"\\u05db\",\n    \"afii57676\": \"\\u05dc\",\n    \"afii57677\": \"\\u05dd\",\n    \"afii57678\": \"\\u05de\",\n    \"afii57679\": \"\\u05df\",\n    \"afii57680\": \"\\u05e0\",\n    \"afii57681\": \"\\u05e1\",\n    \"afii57682\": \"\\u05e2\",\n    \"afii57683\": \"\\u05e3\",\n    \"afii57684\": \"\\u05e4\",\n    \"afii57685\": \"\\u05e5\",\n    \"afii57686\": \"\\u05e6\",\n    \"afii57687\": \"\\u05e7\",\n    \"afii57688\": \"\\u05e8\",\n    \"afii57689\": \"\\u05e9\",\n    \"afii57690\": \"\\u05ea\",\n    \"afii57694\": \"\\ufb2a\",\n    \"afii57695\": \"\\ufb2b\",\n    \"afii57700\": \"\\ufb4b\",\n    \"afii57705\": \"\\ufb1f\",\n    \"afii57716\": \"\\u05f0\",\n    \"afii57717\": \"\\u05f1\",\n    \"afii57718\": \"\\u05f2\",\n    \"afii57723\": \"\\ufb35\",\n    \"afii57793\": \"\\u05b4\",\n    \"afii57794\": \"\\u05b5\",\n    \"afii57795\": \"\\u05b6\",\n    \"afii57796\": \"\\u05bb\",\n    \"afii57797\": \"\\u05b8\",\n    \"afii57798\": \"\\u05b7\",\n    \"afii57799\": \"\\u05b0\",\n    \"afii57800\": \"\\u05b2\",\n    \"afii57801\": \"\\u05b1\",\n    \"afii57802\": \"\\u05b3\",\n    \"afii57803\": \"\\u05c2\",\n    \"afii57804\": \"\\u05c1\",\n    \"afii57806\": \"\\u05b9\",\n    \"afii57807\": \"\\u05bc\",\n    \"afii57839\": \"\\u05bd\",\n    \"afii57841\": \"\\u05bf\",\n    \"afii57842\": \"\\u05c0\",\n    \"afii57929\": \"\\u02bc\",\n    \"afii61248\": \"\\u2105\",\n    \"afii61289\": \"\\u2113\",\n    \"afii61352\": \"\\u2116\",\n    \"afii61573\": \"\\u202c\",\n    \"afii61574\": \"\\u202d\",\n    \"afii61575\": \"\\u202e\",\n    \"afii61664\": \"\\u200c\",\n    \"afii63167\": \"\\u066d\",\n    \"afii64937\": \"\\u02bd\",\n    \"agrave\": \"\\u00e0\",\n    \"agujarati\": \"\\u0a85\",\n    \"agurmukhi\": \"\\u0a05\",\n    \"ahiragana\": \"\\u3042\",\n    \"ahookabove\": \"\\u1ea3\",\n    \"aibengali\": \"\\u0990\",\n    \"aibopomofo\": \"\\u311e\",\n    \"aideva\": \"\\u0910\",\n    \"aiecyrillic\": \"\\u04d5\",\n    \"aigujarati\": \"\\u0a90\",\n    \"aigurmukhi\": \"\\u0a10\",\n    \"aimatragurmukhi\": \"\\u0a48\",\n    \"ainarabic\": \"\\u0639\",\n    \"ainfinalarabic\": \"\\ufeca\",\n    \"aininitialarabic\": \"\\ufecb\",\n    \"ainmedialarabic\": \"\\ufecc\",\n    \"ainvertedbreve\": \"\\u0203\",\n    \"aivowelsignbengali\": \"\\u09c8\",\n    \"aivowelsigndeva\": \"\\u0948\",\n    \"aivowelsigngujarati\": \"\\u0ac8\",\n    \"akatakana\": \"\\u30a2\",\n    \"akatakanahalfwidth\": \"\\uff71\",\n    \"akorean\": \"\\u314f\",\n    \"alef\": \"\\u05d0\",\n    \"alefarabic\": \"\\u0627\",\n    \"alefdageshhebrew\": \"\\ufb30\",\n    \"aleffinalarabic\": \"\\ufe8e\",\n    \"alefhamzaabovearabic\": \"\\u0623\",\n    \"alefhamzaabovefinalarabic\": \"\\ufe84\",\n    \"alefhamzabelowarabic\": \"\\u0625\",\n    \"alefhamzabelowfinalarabic\": \"\\ufe88\",\n    \"alefhebrew\": \"\\u05d0\",\n    \"aleflamedhebrew\": \"\\ufb4f\",\n    \"alefmaddaabovearabic\": \"\\u0622\",\n    \"alefmaddaabovefinalarabic\": \"\\ufe82\",\n    \"alefmaksuraarabic\": \"\\u0649\",\n    \"alefmaksurafinalarabic\": \"\\ufef0\",\n    \"alefmaksurainitialarabic\": \"\\ufef3\",\n    \"alefmaksuramedialarabic\": \"\\ufef4\",\n    \"alefpatahhebrew\": \"\\ufb2e\",\n    \"alefqamatshebrew\": \"\\ufb2f\",\n    \"aleph\": \"\\u2135\",\n    \"allequal\": \"\\u224c\",\n    \"alpha\": \"\\u03b1\",\n    \"alphatonos\": \"\\u03ac\",\n    \"amacron\": \"\\u0101\",\n    \"amonospace\": \"\\uff41\",\n    \"ampersand\": \"\\u0026\",\n    \"ampersandmonospace\": \"\\uff06\",\n    \"ampersandsmall\": \"\\uf726\",\n    \"amsquare\": \"\\u33c2\",\n    \"anbopomofo\": \"\\u3122\",\n    \"angbopomofo\": \"\\u3124\",\n    \"angkhankhuthai\": \"\\u0e5a\",\n    \"angle\": \"\\u2220\",\n    \"anglebracketleft\": \"\\u3008\",\n    \"anglebracketleftvertical\": \"\\ufe3f\",\n    \"anglebracketright\": \"\\u3009\",\n    \"anglebracketrightvertical\": \"\\ufe40\",\n    \"angleleft\": \"\\u2329\",\n    \"angleright\": \"\\u232a\",\n    \"angstrom\": \"\\u212b\",\n    \"anoteleia\": \"\\u0387\",\n    \"anudattadeva\": \"\\u0952\",\n    \"anusvarabengali\": \"\\u0982\",\n    \"anusvaradeva\": \"\\u0902\",\n    \"anusvaragujarati\": \"\\u0a82\",\n    \"aogonek\": \"\\u0105\",\n    \"apaatosquare\": \"\\u3300\",\n    \"aparen\": \"\\u249c\",\n    \"apostrophearmenian\": \"\\u055a\",\n    \"apostrophemod\": \"\\u02bc\",\n    \"apple\": \"\\uf8ff\",\n    \"approaches\": \"\\u2250\",\n    \"approxequal\": \"\\u2248\",\n    \"approxequalorimage\": \"\\u2252\",\n    \"approximatelyequal\": \"\\u2245\",\n    \"araeaekorean\": \"\\u318e\",\n    \"araeakorean\": \"\\u318d\",\n    \"arc\": \"\\u2312\",\n    \"arighthalfring\": \"\\u1e9a\",\n    \"aring\": \"\\u00e5\",\n    \"aringacute\": \"\\u01fb\",\n    \"aringbelow\": \"\\u1e01\",\n    \"arrowboth\": \"\\u2194\",\n    \"arrowdashdown\": \"\\u21e3\",\n    \"arrowdashleft\": \"\\u21e0\",\n    \"arrowdashright\": \"\\u21e2\",\n    \"arrowdashup\": \"\\u21e1\",\n    \"arrowdblboth\": \"\\u21d4\",\n    \"arrowdbldown\": \"\\u21d3\",\n    \"arrowdblleft\": \"\\u21d0\",\n    \"arrowdblright\": \"\\u21d2\",\n    \"arrowdblup\": \"\\u21d1\",\n    \"arrowdown\": \"\\u2193\",\n    \"arrowdownleft\": \"\\u2199\",\n    \"arrowdownright\": \"\\u2198\",\n    \"arrowdownwhite\": \"\\u21e9\",\n    \"arrowheaddownmod\": \"\\u02c5\",\n    \"arrowheadleftmod\": \"\\u02c2\",\n    \"arrowheadrightmod\": \"\\u02c3\",\n    \"arrowheadupmod\": \"\\u02c4\",\n    \"arrowhorizex\": \"\\uf8e7\",\n    \"arrowleft\": \"\\u2190\",\n    \"arrowleftdbl\": \"\\u21d0\",\n    \"arrowleftdblstroke\": \"\\u21cd\",\n    \"arrowleftoverright\": \"\\u21c6\",\n    \"arrowleftwhite\": \"\\u21e6\",\n    \"arrowright\": \"\\u2192\",\n    \"arrowrightdblstroke\": \"\\u21cf\",\n    \"arrowrightheavy\": \"\\u279e\",\n    \"arrowrightoverleft\": \"\\u21c4\",\n    \"arrowrightwhite\": \"\\u21e8\",\n    \"arrowtableft\": \"\\u21e4\",\n    \"arrowtabright\": \"\\u21e5\",\n    \"arrowup\": \"\\u2191\",\n    \"arrowupdn\": \"\\u2195\",\n    \"arrowupdnbse\": \"\\u21a8\",\n    \"arrowupdownbase\": \"\\u21a8\",\n    \"arrowupleft\": \"\\u2196\",\n    \"arrowupleftofdown\": \"\\u21c5\",\n    \"arrowupright\": \"\\u2197\",\n    \"arrowupwhite\": \"\\u21e7\",\n    \"arrowvertex\": \"\\uf8e6\",\n    \"asciicircum\": \"\\u005e\",\n    \"asciicircummonospace\": \"\\uff3e\",\n    \"asciitilde\": \"\\u007e\",\n    \"asciitildemonospace\": \"\\uff5e\",\n    \"ascript\": \"\\u0251\",\n    \"ascriptturned\": \"\\u0252\",\n    \"asmallhiragana\": \"\\u3041\",\n    \"asmallkatakana\": \"\\u30a1\",\n    \"asmallkatakanahalfwidth\": \"\\uff67\",\n    \"asterisk\": \"\\u002a\",\n    \"asteriskaltonearabic\": \"\\u066d\",\n    \"asteriskarabic\": \"\\u066d\",\n    \"asteriskmath\": \"\\u2217\",\n    \"asteriskmonospace\": \"\\uff0a\",\n    \"asterisksmall\": \"\\ufe61\",\n    \"asterism\": \"\\u2042\",\n    \"asuperior\": \"\\uf6e9\",\n    \"asymptoticallyequal\": \"\\u2243\",\n    \"at\": \"\\u0040\",\n    \"atilde\": \"\\u00e3\",\n    \"atmonospace\": \"\\uff20\",\n    \"atsmall\": \"\\ufe6b\",\n    \"aturned\": \"\\u0250\",\n    \"aubengali\": \"\\u0994\",\n    \"aubopomofo\": \"\\u3120\",\n    \"audeva\": \"\\u0914\",\n    \"augujarati\": \"\\u0a94\",\n    \"augurmukhi\": \"\\u0a14\",\n    \"aulengthmarkbengali\": \"\\u09d7\",\n    \"aumatragurmukhi\": \"\\u0a4c\",\n    \"auvowelsignbengali\": \"\\u09cc\",\n    \"auvowelsigndeva\": \"\\u094c\",\n    \"auvowelsigngujarati\": \"\\u0acc\",\n    \"avagrahadeva\": \"\\u093d\",\n    \"aybarmenian\": \"\\u0561\",\n    \"ayin\": \"\\u05e2\",\n    \"ayinaltonehebrew\": \"\\ufb20\",\n    \"ayinhebrew\": \"\\u05e2\",\n    \"b\": \"\\u0062\",\n    \"babengali\": \"\\u09ac\",\n    \"backslash\": \"\\u005c\",\n    \"backslashmonospace\": \"\\uff3c\",\n    \"badeva\": \"\\u092c\",\n    \"bagujarati\": \"\\u0aac\",\n    \"bagurmukhi\": \"\\u0a2c\",\n    \"bahiragana\": \"\\u3070\",\n    \"bahtthai\": \"\\u0e3f\",\n    \"bakatakana\": \"\\u30d0\",\n    \"bar\": \"\\u007c\",\n    \"barmonospace\": \"\\uff5c\",\n    \"bbopomofo\": \"\\u3105\",\n    \"bcircle\": \"\\u24d1\",\n    \"bdotaccent\": \"\\u1e03\",\n    \"bdotbelow\": \"\\u1e05\",\n    \"beamedsixteenthnotes\": \"\\u266c\",\n    \"because\": \"\\u2235\",\n    \"becyrillic\": \"\\u0431\",\n    \"beharabic\": \"\\u0628\",\n    \"behfinalarabic\": \"\\ufe90\",\n    \"behinitialarabic\": \"\\ufe91\",\n    \"behiragana\": \"\\u3079\",\n    \"behmedialarabic\": \"\\ufe92\",\n    \"behmeeminitialarabic\": \"\\ufc9f\",\n    \"behmeemisolatedarabic\": \"\\ufc08\",\n    \"behnoonfinalarabic\": \"\\ufc6d\",\n    \"bekatakana\": \"\\u30d9\",\n    \"benarmenian\": \"\\u0562\",\n    \"bet\": \"\\u05d1\",\n    \"beta\": \"\\u03b2\",\n    \"betasymbolgreek\": \"\\u03d0\",\n    \"betdagesh\": \"\\ufb31\",\n    \"betdageshhebrew\": \"\\ufb31\",\n    \"bethebrew\": \"\\u05d1\",\n    \"betrafehebrew\": \"\\ufb4c\",\n    \"bhabengali\": \"\\u09ad\",\n    \"bhadeva\": \"\\u092d\",\n    \"bhagujarati\": \"\\u0aad\",\n    \"bhagurmukhi\": \"\\u0a2d\",\n    \"bhook\": \"\\u0253\",\n    \"bihiragana\": \"\\u3073\",\n    \"bikatakana\": \"\\u30d3\",\n    \"bilabialclick\": \"\\u0298\",\n    \"bindigurmukhi\": \"\\u0a02\",\n    \"birusquare\": \"\\u3331\",\n    \"blackcircle\": \"\\u25cf\",\n    \"blackdiamond\": \"\\u25c6\",\n    \"blackdownpointingtriangle\": \"\\u25bc\",\n    \"blackleftpointingpointer\": \"\\u25c4\",\n    \"blackleftpointingtriangle\": \"\\u25c0\",\n    \"blacklenticularbracketleft\": \"\\u3010\",\n    \"blacklenticularbracketleftvertical\": \"\\ufe3b\",\n    \"blacklenticularbracketright\": \"\\u3011\",\n    \"blacklenticularbracketrightvertical\": \"\\ufe3c\",\n    \"blacklowerlefttriangle\": \"\\u25e3\",\n    \"blacklowerrighttriangle\": \"\\u25e2\",\n    \"blackrectangle\": \"\\u25ac\",\n    \"blackrightpointingpointer\": \"\\u25ba\",\n    \"blackrightpointingtriangle\": \"\\u25b6\",\n    \"blacksmallsquare\": \"\\u25aa\",\n    \"blacksmilingface\": \"\\u263b\",\n    \"blacksquare\": \"\\u25a0\",\n    \"blackstar\": \"\\u2605\",\n    \"blackupperlefttriangle\": \"\\u25e4\",\n    \"blackupperrighttriangle\": \"\\u25e5\",\n    \"blackuppointingsmalltriangle\": \"\\u25b4\",\n    \"blackuppointingtriangle\": \"\\u25b2\",\n    \"blank\": \"\\u2423\",\n    \"blinebelow\": \"\\u1e07\",\n    \"block\": \"\\u2588\",\n    \"bmonospace\": \"\\uff42\",\n    \"bobaimaithai\": \"\\u0e1a\",\n    \"bohiragana\": \"\\u307c\",\n    \"bokatakana\": \"\\u30dc\",\n    \"bparen\": \"\\u249d\",\n    \"bqsquare\": \"\\u33c3\",\n    \"braceex\": \"\\uf8f4\",\n    \"braceleft\": \"\\u007b\",\n    \"braceleftbt\": \"\\uf8f3\",\n    \"braceleftmid\": \"\\uf8f2\",\n    \"braceleftmonospace\": \"\\uff5b\",\n    \"braceleftsmall\": \"\\ufe5b\",\n    \"bracelefttp\": \"\\uf8f1\",\n    \"braceleftvertical\": \"\\ufe37\",\n    \"braceright\": \"\\u007d\",\n    \"bracerightbt\": \"\\uf8fe\",\n    \"bracerightmid\": \"\\uf8fd\",\n    \"bracerightmonospace\": \"\\uff5d\",\n    \"bracerightsmall\": \"\\ufe5c\",\n    \"bracerighttp\": \"\\uf8fc\",\n    \"bracerightvertical\": \"\\ufe38\",\n    \"bracketleft\": \"\\u005b\",\n    \"bracketleftbt\": \"\\uf8f0\",\n    \"bracketleftex\": \"\\uf8ef\",\n    \"bracketleftmonospace\": \"\\uff3b\",\n    \"bracketlefttp\": \"\\uf8ee\",\n    \"bracketright\": \"\\u005d\",\n    \"bracketrightbt\": \"\\uf8fb\",\n    \"bracketrightex\": \"\\uf8fa\",\n    \"bracketrightmonospace\": \"\\uff3d\",\n    \"bracketrighttp\": \"\\uf8f9\",\n    \"breve\": \"\\u02d8\",\n    \"brevebelowcmb\": \"\\u032e\",\n    \"brevecmb\": \"\\u0306\",\n    \"breveinvertedbelowcmb\": \"\\u032f\",\n    \"breveinvertedcmb\": \"\\u0311\",\n    \"breveinverteddoublecmb\": \"\\u0361\",\n    \"bridgebelowcmb\": \"\\u032a\",\n    \"bridgeinvertedbelowcmb\": \"\\u033a\",\n    \"brokenbar\": \"\\u00a6\",\n    \"bstroke\": \"\\u0180\",\n    \"bsuperior\": \"\\uf6ea\",\n    \"btopbar\": \"\\u0183\",\n    \"buhiragana\": \"\\u3076\",\n    \"bukatakana\": \"\\u30d6\",\n    \"bullet\": \"\\u2022\",\n    \"bulletinverse\": \"\\u25d8\",\n    \"bulletoperator\": \"\\u2219\",\n    \"bullseye\": \"\\u25ce\",\n    \"c\": \"\\u0063\",\n    \"caarmenian\": \"\\u056e\",\n    \"cabengali\": \"\\u099a\",\n    \"cacute\": \"\\u0107\",\n    \"cadeva\": \"\\u091a\",\n    \"cagujarati\": \"\\u0a9a\",\n    \"cagurmukhi\": \"\\u0a1a\",\n    \"calsquare\": \"\\u3388\",\n    \"candrabindubengali\": \"\\u0981\",\n    \"candrabinducmb\": \"\\u0310\",\n    \"candrabindudeva\": \"\\u0901\",\n    \"candrabindugujarati\": \"\\u0a81\",\n    \"capslock\": \"\\u21ea\",\n    \"careof\": \"\\u2105\",\n    \"caron\": \"\\u02c7\",\n    \"caronbelowcmb\": \"\\u032c\",\n    \"caroncmb\": \"\\u030c\",\n    \"carriagereturn\": \"\\u21b5\",\n    \"cbopomofo\": \"\\u3118\",\n    \"ccaron\": \"\\u010d\",\n    \"ccedilla\": \"\\u00e7\",\n    \"ccedillaacute\": \"\\u1e09\",\n    \"ccircle\": \"\\u24d2\",\n    \"ccircumflex\": \"\\u0109\",\n    \"ccurl\": \"\\u0255\",\n    \"cdot\": \"\\u010b\",\n    \"cdotaccent\": \"\\u010b\",\n    \"cdsquare\": \"\\u33c5\",\n    \"cedilla\": \"\\u00b8\",\n    \"cedillacmb\": \"\\u0327\",\n    \"cent\": \"\\u00a2\",\n    \"centigrade\": \"\\u2103\",\n    \"centinferior\": \"\\uf6df\",\n    \"centmonospace\": \"\\uffe0\",\n    \"centoldstyle\": \"\\uf7a2\",\n    \"centsuperior\": \"\\uf6e0\",\n    \"chaarmenian\": \"\\u0579\",\n    \"chabengali\": \"\\u099b\",\n    \"chadeva\": \"\\u091b\",\n    \"chagujarati\": \"\\u0a9b\",\n    \"chagurmukhi\": \"\\u0a1b\",\n    \"chbopomofo\": \"\\u3114\",\n    \"cheabkhasiancyrillic\": \"\\u04bd\",\n    \"checkmark\": \"\\u2713\",\n    \"checyrillic\": \"\\u0447\",\n    \"chedescenderabkhasiancyrillic\": \"\\u04bf\",\n    \"chedescendercyrillic\": \"\\u04b7\",\n    \"chedieresiscyrillic\": \"\\u04f5\",\n    \"cheharmenian\": \"\\u0573\",\n    \"chekhakassiancyrillic\": \"\\u04cc\",\n    \"cheverticalstrokecyrillic\": \"\\u04b9\",\n    \"chi\": \"\\u03c7\",\n    \"chieuchacirclekorean\": \"\\u3277\",\n    \"chieuchaparenkorean\": \"\\u3217\",\n    \"chieuchcirclekorean\": \"\\u3269\",\n    \"chieuchkorean\": \"\\u314a\",\n    \"chieuchparenkorean\": \"\\u3209\",\n    \"chochangthai\": \"\\u0e0a\",\n    \"chochanthai\": \"\\u0e08\",\n    \"chochingthai\": \"\\u0e09\",\n    \"chochoethai\": \"\\u0e0c\",\n    \"chook\": \"\\u0188\",\n    \"cieucacirclekorean\": \"\\u3276\",\n    \"cieucaparenkorean\": \"\\u3216\",\n    \"cieuccirclekorean\": \"\\u3268\",\n    \"cieuckorean\": \"\\u3148\",\n    \"cieucparenkorean\": \"\\u3208\",\n    \"cieucuparenkorean\": \"\\u321c\",\n    \"circle\": \"\\u25cb\",\n    \"circlemultiply\": \"\\u2297\",\n    \"circleot\": \"\\u2299\",\n    \"circleplus\": \"\\u2295\",\n    \"circlepostalmark\": \"\\u3036\",\n    \"circlewithlefthalfblack\": \"\\u25d0\",\n    \"circlewithrighthalfblack\": \"\\u25d1\",\n    \"circumflex\": \"\\u02c6\",\n    \"circumflexbelowcmb\": \"\\u032d\",\n    \"circumflexcmb\": \"\\u0302\",\n    \"clear\": \"\\u2327\",\n    \"clickalveolar\": \"\\u01c2\",\n    \"clickdental\": \"\\u01c0\",\n    \"clicklateral\": \"\\u01c1\",\n    \"clickretroflex\": \"\\u01c3\",\n    \"club\": \"\\u2663\",\n    \"clubsuitblack\": \"\\u2663\",\n    \"clubsuitwhite\": \"\\u2667\",\n    \"cmcubedsquare\": \"\\u33a4\",\n    \"cmonospace\": \"\\uff43\",\n    \"cmsquaredsquare\": \"\\u33a0\",\n    \"coarmenian\": \"\\u0581\",\n    \"colon\": \"\\u003a\",\n    \"colonmonetary\": \"\\u20a1\",\n    \"colonmonospace\": \"\\uff1a\",\n    \"colonsign\": \"\\u20a1\",\n    \"colonsmall\": \"\\ufe55\",\n    \"colontriangularhalfmod\": \"\\u02d1\",\n    \"colontriangularmod\": \"\\u02d0\",\n    \"comma\": \"\\u002c\",\n    \"commaabovecmb\": \"\\u0313\",\n    \"commaaboverightcmb\": \"\\u0315\",\n    \"commaaccent\": \"\\uf6c3\",\n    \"commaarabic\": \"\\u060c\",\n    \"commaarmenian\": \"\\u055d\",\n    \"commainferior\": \"\\uf6e1\",\n    \"commamonospace\": \"\\uff0c\",\n    \"commareversedabovecmb\": \"\\u0314\",\n    \"commareversedmod\": \"\\u02bd\",\n    \"commasmall\": \"\\ufe50\",\n    \"commasuperior\": \"\\uf6e2\",\n    \"commaturnedabovecmb\": \"\\u0312\",\n    \"commaturnedmod\": \"\\u02bb\",\n    \"compass\": \"\\u263c\",\n    \"congruent\": \"\\u2245\",\n    \"contourintegral\": \"\\u222e\",\n    \"control\": \"\\u2303\",\n    \"controlACK\": \"\\u0006\",\n    \"controlBEL\": \"\\u0007\",\n    \"controlBS\": \"\\u0008\",\n    \"controlCAN\": \"\\u0018\",\n    \"controlCR\": \"\\u000d\",\n    \"controlDC1\": \"\\u0011\",\n    \"controlDC2\": \"\\u0012\",\n    \"controlDC3\": \"\\u0013\",\n    \"controlDC4\": \"\\u0014\",\n    \"controlDEL\": \"\\u007f\",\n    \"controlDLE\": \"\\u0010\",\n    \"controlEM\": \"\\u0019\",\n    \"controlENQ\": \"\\u0005\",\n    \"controlEOT\": \"\\u0004\",\n    \"controlESC\": \"\\u001b\",\n    \"controlETB\": \"\\u0017\",\n    \"controlETX\": \"\\u0003\",\n    \"controlFF\": \"\\u000c\",\n    \"controlFS\": \"\\u001c\",\n    \"controlGS\": \"\\u001d\",\n    \"controlHT\": \"\\u0009\",\n    \"controlLF\": \"\\u000a\",\n    \"controlNAK\": \"\\u0015\",\n    \"controlRS\": \"\\u001e\",\n    \"controlSI\": \"\\u000f\",\n    \"controlSO\": \"\\u000e\",\n    \"controlSOT\": \"\\u0002\",\n    \"controlSTX\": \"\\u0001\",\n    \"controlSUB\": \"\\u001a\",\n    \"controlSYN\": \"\\u0016\",\n    \"controlUS\": \"\\u001f\",\n    \"controlVT\": \"\\u000b\",\n    \"copyright\": \"\\u00a9\",\n    \"copyrightsans\": \"\\uf8e9\",\n    \"copyrightserif\": \"\\uf6d9\",\n    \"cornerbracketleft\": \"\\u300c\",\n    \"cornerbracketlefthalfwidth\": \"\\uff62\",\n    \"cornerbracketleftvertical\": \"\\ufe41\",\n    \"cornerbracketright\": \"\\u300d\",\n    \"cornerbracketrighthalfwidth\": \"\\uff63\",\n    \"cornerbracketrightvertical\": \"\\ufe42\",\n    \"corporationsquare\": \"\\u337f\",\n    \"cosquare\": \"\\u33c7\",\n    \"coverkgsquare\": \"\\u33c6\",\n    \"cparen\": \"\\u249e\",\n    \"cruzeiro\": \"\\u20a2\",\n    \"cstretched\": \"\\u0297\",\n    \"curlyand\": \"\\u22cf\",\n    \"curlyor\": \"\\u22ce\",\n    \"currency\": \"\\u00a4\",\n    \"cyrBreve\": \"\\uf6d1\",\n    \"cyrFlex\": \"\\uf6d2\",\n    \"cyrbreve\": \"\\uf6d4\",\n    \"cyrflex\": \"\\uf6d5\",\n    \"d\": \"\\u0064\",\n    \"daarmenian\": \"\\u0564\",\n    \"dabengali\": \"\\u09a6\",\n    \"dadarabic\": \"\\u0636\",\n    \"dadeva\": \"\\u0926\",\n    \"dadfinalarabic\": \"\\ufebe\",\n    \"dadinitialarabic\": \"\\ufebf\",\n    \"dadmedialarabic\": \"\\ufec0\",\n    \"dagesh\": \"\\u05bc\",\n    \"dageshhebrew\": \"\\u05bc\",\n    \"dagger\": \"\\u2020\",\n    \"daggerdbl\": \"\\u2021\",\n    \"dagujarati\": \"\\u0aa6\",\n    \"dagurmukhi\": \"\\u0a26\",\n    \"dahiragana\": \"\\u3060\",\n    \"dakatakana\": \"\\u30c0\",\n    \"dalarabic\": \"\\u062f\",\n    \"dalet\": \"\\u05d3\",\n    \"daletdagesh\": \"\\ufb33\",\n    \"daletdageshhebrew\": \"\\ufb33\",\n    \"dalethatafpatah\": \"\\u05d3\\u05b2\",\n    \"dalethatafpatahhebrew\": \"\\u05d3\\u05b2\",\n    \"dalethatafsegol\": \"\\u05d3\\u05b1\",\n    \"dalethatafsegolhebrew\": \"\\u05d3\\u05b1\",\n    \"dalethebrew\": \"\\u05d3\",\n    \"dalethiriq\": \"\\u05d3\\u05b4\",\n    \"dalethiriqhebrew\": \"\\u05d3\\u05b4\",\n    \"daletholam\": \"\\u05d3\\u05b9\",\n    \"daletholamhebrew\": \"\\u05d3\\u05b9\",\n    \"daletpatah\": \"\\u05d3\\u05b7\",\n    \"daletpatahhebrew\": \"\\u05d3\\u05b7\",\n    \"daletqamats\": \"\\u05d3\\u05b8\",\n    \"daletqamatshebrew\": \"\\u05d3\\u05b8\",\n    \"daletqubuts\": \"\\u05d3\\u05bb\",\n    \"daletqubutshebrew\": \"\\u05d3\\u05bb\",\n    \"daletsegol\": \"\\u05d3\\u05b6\",\n    \"daletsegolhebrew\": \"\\u05d3\\u05b6\",\n    \"daletsheva\": \"\\u05d3\\u05b0\",\n    \"daletshevahebrew\": \"\\u05d3\\u05b0\",\n    \"dalettsere\": \"\\u05d3\\u05b5\",\n    \"dalettserehebrew\": \"\\u05d3\\u05b5\",\n    \"dalfinalarabic\": \"\\ufeaa\",\n    \"dammaarabic\": \"\\u064f\",\n    \"dammalowarabic\": \"\\u064f\",\n    \"dammatanaltonearabic\": \"\\u064c\",\n    \"dammatanarabic\": \"\\u064c\",\n    \"danda\": \"\\u0964\",\n    \"dargahebrew\": \"\\u05a7\",\n    \"dargalefthebrew\": \"\\u05a7\",\n    \"dasiapneumatacyrilliccmb\": \"\\u0485\",\n    \"dblGrave\": \"\\uf6d3\",\n    \"dblanglebracketleft\": \"\\u300a\",\n    \"dblanglebracketleftvertical\": \"\\ufe3d\",\n    \"dblanglebracketright\": \"\\u300b\",\n    \"dblanglebracketrightvertical\": \"\\ufe3e\",\n    \"dblarchinvertedbelowcmb\": \"\\u032b\",\n    \"dblarrowleft\": \"\\u21d4\",\n    \"dblarrowright\": \"\\u21d2\",\n    \"dbldanda\": \"\\u0965\",\n    \"dblgrave\": \"\\uf6d6\",\n    \"dblgravecmb\": \"\\u030f\",\n    \"dblintegral\": \"\\u222c\",\n    \"dbllowline\": \"\\u2017\",\n    \"dbllowlinecmb\": \"\\u0333\",\n    \"dbloverlinecmb\": \"\\u033f\",\n    \"dblprimemod\": \"\\u02ba\",\n    \"dblverticalbar\": \"\\u2016\",\n    \"dblverticallineabovecmb\": \"\\u030e\",\n    \"dbopomofo\": \"\\u3109\",\n    \"dbsquare\": \"\\u33c8\",\n    \"dcaron\": \"\\u010f\",\n    \"dcedilla\": \"\\u1e11\",\n    \"dcircle\": \"\\u24d3\",\n    \"dcircumflexbelow\": \"\\u1e13\",\n    \"dcroat\": \"\\u0111\",\n    \"ddabengali\": \"\\u09a1\",\n    \"ddadeva\": \"\\u0921\",\n    \"ddagujarati\": \"\\u0aa1\",\n    \"ddagurmukhi\": \"\\u0a21\",\n    \"ddalarabic\": \"\\u0688\",\n    \"ddalfinalarabic\": \"\\ufb89\",\n    \"dddhadeva\": \"\\u095c\",\n    \"ddhabengali\": \"\\u09a2\",\n    \"ddhadeva\": \"\\u0922\",\n    \"ddhagujarati\": \"\\u0aa2\",\n    \"ddhagurmukhi\": \"\\u0a22\",\n    \"ddotaccent\": \"\\u1e0b\",\n    \"ddotbelow\": \"\\u1e0d\",\n    \"decimalseparatorarabic\": \"\\u066b\",\n    \"decimalseparatorpersian\": \"\\u066b\",\n    \"decyrillic\": \"\\u0434\",\n    \"degree\": \"\\u00b0\",\n    \"dehihebrew\": \"\\u05ad\",\n    \"dehiragana\": \"\\u3067\",\n    \"deicoptic\": \"\\u03ef\",\n    \"dekatakana\": \"\\u30c7\",\n    \"deleteleft\": \"\\u232b\",\n    \"deleteright\": \"\\u2326\",\n    \"delta\": \"\\u03b4\",\n    \"deltaturned\": \"\\u018d\",\n    \"denominatorminusonenumeratorbengali\": \"\\u09f8\",\n    \"dezh\": \"\\u02a4\",\n    \"dhabengali\": \"\\u09a7\",\n    \"dhadeva\": \"\\u0927\",\n    \"dhagujarati\": \"\\u0aa7\",\n    \"dhagurmukhi\": \"\\u0a27\",\n    \"dhook\": \"\\u0257\",\n    \"dialytikatonos\": \"\\u0385\",\n    \"dialytikatonoscmb\": \"\\u0344\",\n    \"diamond\": \"\\u2666\",\n    \"diamondsuitwhite\": \"\\u2662\",\n    \"dieresis\": \"\\u00a8\",\n    \"dieresisacute\": \"\\uf6d7\",\n    \"dieresisbelowcmb\": \"\\u0324\",\n    \"dieresiscmb\": \"\\u0308\",\n    \"dieresisgrave\": \"\\uf6d8\",\n    \"dieresistonos\": \"\\u0385\",\n    \"dihiragana\": \"\\u3062\",\n    \"dikatakana\": \"\\u30c2\",\n    \"dittomark\": \"\\u3003\",\n    \"divide\": \"\\u00f7\",\n    \"divides\": \"\\u2223\",\n    \"divisionslash\": \"\\u2215\",\n    \"djecyrillic\": \"\\u0452\",\n    \"dkshade\": \"\\u2593\",\n    \"dlinebelow\": \"\\u1e0f\",\n    \"dlsquare\": \"\\u3397\",\n    \"dmacron\": \"\\u0111\",\n    \"dmonospace\": \"\\uff44\",\n    \"dnblock\": \"\\u2584\",\n    \"dochadathai\": \"\\u0e0e\",\n    \"dodekthai\": \"\\u0e14\",\n    \"dohiragana\": \"\\u3069\",\n    \"dokatakana\": \"\\u30c9\",\n    \"dollar\": \"\\u0024\",\n    \"dollarinferior\": \"\\uf6e3\",\n    \"dollarmonospace\": \"\\uff04\",\n    \"dollaroldstyle\": \"\\uf724\",\n    \"dollarsmall\": \"\\ufe69\",\n    \"dollarsuperior\": \"\\uf6e4\",\n    \"dong\": \"\\u20ab\",\n    \"dorusquare\": \"\\u3326\",\n    \"dotaccent\": \"\\u02d9\",\n    \"dotaccentcmb\": \"\\u0307\",\n    \"dotbelowcmb\": \"\\u0323\",\n    \"dotbelowcomb\": \"\\u0323\",\n    \"dotkatakana\": \"\\u30fb\",\n    \"dotlessi\": \"\\u0131\",\n    \"dotlessj\": \"\\uf6be\",\n    \"dotlessjstrokehook\": \"\\u0284\",\n    \"dotmath\": \"\\u22c5\",\n    \"dottedcircle\": \"\\u25cc\",\n    \"doubleyodpatah\": \"\\ufb1f\",\n    \"doubleyodpatahhebrew\": \"\\ufb1f\",\n    \"downtackbelowcmb\": \"\\u031e\",\n    \"downtackmod\": \"\\u02d5\",\n    \"dparen\": \"\\u249f\",\n    \"dsuperior\": \"\\uf6eb\",\n    \"dtail\": \"\\u0256\",\n    \"dtopbar\": \"\\u018c\",\n    \"duhiragana\": \"\\u3065\",\n    \"dukatakana\": \"\\u30c5\",\n    \"dz\": \"\\u01f3\",\n    \"dzaltone\": \"\\u02a3\",\n    \"dzcaron\": \"\\u01c6\",\n    \"dzcurl\": \"\\u02a5\",\n    \"dzeabkhasiancyrillic\": \"\\u04e1\",\n    \"dzecyrillic\": \"\\u0455\",\n    \"dzhecyrillic\": \"\\u045f\",\n    \"e\": \"\\u0065\",\n    \"eacute\": \"\\u00e9\",\n    \"earth\": \"\\u2641\",\n    \"ebengali\": \"\\u098f\",\n    \"ebopomofo\": \"\\u311c\",\n    \"ebreve\": \"\\u0115\",\n    \"ecandradeva\": \"\\u090d\",\n    \"ecandragujarati\": \"\\u0a8d\",\n    \"ecandravowelsigndeva\": \"\\u0945\",\n    \"ecandravowelsigngujarati\": \"\\u0ac5\",\n    \"ecaron\": \"\\u011b\",\n    \"ecedillabreve\": \"\\u1e1d\",\n    \"echarmenian\": \"\\u0565\",\n    \"echyiwnarmenian\": \"\\u0587\",\n    \"ecircle\": \"\\u24d4\",\n    \"ecircumflex\": \"\\u00ea\",\n    \"ecircumflexacute\": \"\\u1ebf\",\n    \"ecircumflexbelow\": \"\\u1e19\",\n    \"ecircumflexdotbelow\": \"\\u1ec7\",\n    \"ecircumflexgrave\": \"\\u1ec1\",\n    \"ecircumflexhookabove\": \"\\u1ec3\",\n    \"ecircumflextilde\": \"\\u1ec5\",\n    \"ecyrillic\": \"\\u0454\",\n    \"edblgrave\": \"\\u0205\",\n    \"edeva\": \"\\u090f\",\n    \"edieresis\": \"\\u00eb\",\n    \"edot\": \"\\u0117\",\n    \"edotaccent\": \"\\u0117\",\n    \"edotbelow\": \"\\u1eb9\",\n    \"eegurmukhi\": \"\\u0a0f\",\n    \"eematragurmukhi\": \"\\u0a47\",\n    \"efcyrillic\": \"\\u0444\",\n    \"egrave\": \"\\u00e8\",\n    \"egujarati\": \"\\u0a8f\",\n    \"eharmenian\": \"\\u0567\",\n    \"ehbopomofo\": \"\\u311d\",\n    \"ehiragana\": \"\\u3048\",\n    \"ehookabove\": \"\\u1ebb\",\n    \"eibopomofo\": \"\\u311f\",\n    \"eight\": \"\\u0038\",\n    \"eightarabic\": \"\\u0668\",\n    \"eightbengali\": \"\\u09ee\",\n    \"eightcircle\": \"\\u2467\",\n    \"eightcircleinversesansserif\": \"\\u2791\",\n    \"eightdeva\": \"\\u096e\",\n    \"eighteencircle\": \"\\u2471\",\n    \"eighteenparen\": \"\\u2485\",\n    \"eighteenperiod\": \"\\u2499\",\n    \"eightgujarati\": \"\\u0aee\",\n    \"eightgurmukhi\": \"\\u0a6e\",\n    \"eighthackarabic\": \"\\u0668\",\n    \"eighthangzhou\": \"\\u3028\",\n    \"eighthnotebeamed\": \"\\u266b\",\n    \"eightideographicparen\": \"\\u3227\",\n    \"eightinferior\": \"\\u2088\",\n    \"eightmonospace\": \"\\uff18\",\n    \"eightoldstyle\": \"\\uf738\",\n    \"eightparen\": \"\\u247b\",\n    \"eightperiod\": \"\\u248f\",\n    \"eightpersian\": \"\\u06f8\",\n    \"eightroman\": \"\\u2177\",\n    \"eightsuperior\": \"\\u2078\",\n    \"eightthai\": \"\\u0e58\",\n    \"einvertedbreve\": \"\\u0207\",\n    \"eiotifiedcyrillic\": \"\\u0465\",\n    \"ekatakana\": \"\\u30a8\",\n    \"ekatakanahalfwidth\": \"\\uff74\",\n    \"ekonkargurmukhi\": \"\\u0a74\",\n    \"ekorean\": \"\\u3154\",\n    \"elcyrillic\": \"\\u043b\",\n    \"element\": \"\\u2208\",\n    \"elevencircle\": \"\\u246a\",\n    \"elevenparen\": \"\\u247e\",\n    \"elevenperiod\": \"\\u2492\",\n    \"elevenroman\": \"\\u217a\",\n    \"ellipsis\": \"\\u2026\",\n    \"ellipsisvertical\": \"\\u22ee\",\n    \"emacron\": \"\\u0113\",\n    \"emacronacute\": \"\\u1e17\",\n    \"emacrongrave\": \"\\u1e15\",\n    \"emcyrillic\": \"\\u043c\",\n    \"emdash\": \"\\u2014\",\n    \"emdashvertical\": \"\\ufe31\",\n    \"emonospace\": \"\\uff45\",\n    \"emphasismarkarmenian\": \"\\u055b\",\n    \"emptyset\": \"\\u2205\",\n    \"enbopomofo\": \"\\u3123\",\n    \"encyrillic\": \"\\u043d\",\n    \"endash\": \"\\u2013\",\n    \"endashvertical\": \"\\ufe32\",\n    \"endescendercyrillic\": \"\\u04a3\",\n    \"eng\": \"\\u014b\",\n    \"engbopomofo\": \"\\u3125\",\n    \"enghecyrillic\": \"\\u04a5\",\n    \"enhookcyrillic\": \"\\u04c8\",\n    \"enspace\": \"\\u2002\",\n    \"eogonek\": \"\\u0119\",\n    \"eokorean\": \"\\u3153\",\n    \"eopen\": \"\\u025b\",\n    \"eopenclosed\": \"\\u029a\",\n    \"eopenreversed\": \"\\u025c\",\n    \"eopenreversedclosed\": \"\\u025e\",\n    \"eopenreversedhook\": \"\\u025d\",\n    \"eparen\": \"\\u24a0\",\n    \"epsilon\": \"\\u03b5\",\n    \"epsilontonos\": \"\\u03ad\",\n    \"equal\": \"\\u003d\",\n    \"equalmonospace\": \"\\uff1d\",\n    \"equalsmall\": \"\\ufe66\",\n    \"equalsuperior\": \"\\u207c\",\n    \"equivalence\": \"\\u2261\",\n    \"erbopomofo\": \"\\u3126\",\n    \"ercyrillic\": \"\\u0440\",\n    \"ereversed\": \"\\u0258\",\n    \"ereversedcyrillic\": \"\\u044d\",\n    \"escyrillic\": \"\\u0441\",\n    \"esdescendercyrillic\": \"\\u04ab\",\n    \"esh\": \"\\u0283\",\n    \"eshcurl\": \"\\u0286\",\n    \"eshortdeva\": \"\\u090e\",\n    \"eshortvowelsigndeva\": \"\\u0946\",\n    \"eshreversedloop\": \"\\u01aa\",\n    \"eshsquatreversed\": \"\\u0285\",\n    \"esmallhiragana\": \"\\u3047\",\n    \"esmallkatakana\": \"\\u30a7\",\n    \"esmallkatakanahalfwidth\": \"\\uff6a\",\n    \"estimated\": \"\\u212e\",\n    \"esuperior\": \"\\uf6ec\",\n    \"eta\": \"\\u03b7\",\n    \"etarmenian\": \"\\u0568\",\n    \"etatonos\": \"\\u03ae\",\n    \"eth\": \"\\u00f0\",\n    \"etilde\": \"\\u1ebd\",\n    \"etildebelow\": \"\\u1e1b\",\n    \"etnahtafoukhhebrew\": \"\\u0591\",\n    \"etnahtafoukhlefthebrew\": \"\\u0591\",\n    \"etnahtahebrew\": \"\\u0591\",\n    \"etnahtalefthebrew\": \"\\u0591\",\n    \"eturned\": \"\\u01dd\",\n    \"eukorean\": \"\\u3161\",\n    \"euro\": \"\\u20ac\",\n    \"evowelsignbengali\": \"\\u09c7\",\n    \"evowelsigndeva\": \"\\u0947\",\n    \"evowelsigngujarati\": \"\\u0ac7\",\n    \"exclam\": \"\\u0021\",\n    \"exclamarmenian\": \"\\u055c\",\n    \"exclamdbl\": \"\\u203c\",\n    \"exclamdown\": \"\\u00a1\",\n    \"exclamdownsmall\": \"\\uf7a1\",\n    \"exclammonospace\": \"\\uff01\",\n    \"exclamsmall\": \"\\uf721\",\n    \"existential\": \"\\u2203\",\n    \"ezh\": \"\\u0292\",\n    \"ezhcaron\": \"\\u01ef\",\n    \"ezhcurl\": \"\\u0293\",\n    \"ezhreversed\": \"\\u01b9\",\n    \"ezhtail\": \"\\u01ba\",\n    \"f\": \"\\u0066\",\n    \"fadeva\": \"\\u095e\",\n    \"fagurmukhi\": \"\\u0a5e\",\n    \"fahrenheit\": \"\\u2109\",\n    \"fathaarabic\": \"\\u064e\",\n    \"fathalowarabic\": \"\\u064e\",\n    \"fathatanarabic\": \"\\u064b\",\n    \"fbopomofo\": \"\\u3108\",\n    \"fcircle\": \"\\u24d5\",\n    \"fdotaccent\": \"\\u1e1f\",\n    \"feharabic\": \"\\u0641\",\n    \"feharmenian\": \"\\u0586\",\n    \"fehfinalarabic\": \"\\ufed2\",\n    \"fehinitialarabic\": \"\\ufed3\",\n    \"fehmedialarabic\": \"\\ufed4\",\n    \"feicoptic\": \"\\u03e5\",\n    \"female\": \"\\u2640\",\n    \"ff\": \"\\ufb00\",\n    \"ffi\": \"\\ufb03\",\n    \"ffl\": \"\\ufb04\",\n    \"fi\": \"\\ufb01\",\n    \"fifteencircle\": \"\\u246e\",\n    \"fifteenparen\": \"\\u2482\",\n    \"fifteenperiod\": \"\\u2496\",\n    \"figuredash\": \"\\u2012\",\n    \"filledbox\": \"\\u25a0\",\n    \"filledrect\": \"\\u25ac\",\n    \"finalkaf\": \"\\u05da\",\n    \"finalkafdagesh\": \"\\ufb3a\",\n    \"finalkafdageshhebrew\": \"\\ufb3a\",\n    \"finalkafhebrew\": \"\\u05da\",\n    \"finalkafqamats\": \"\\u05da\\u05b8\",\n    \"finalkafqamatshebrew\": \"\\u05da\\u05b8\",\n    \"finalkafsheva\": \"\\u05da\\u05b0\",\n    \"finalkafshevahebrew\": \"\\u05da\\u05b0\",\n    \"finalmem\": \"\\u05dd\",\n    \"finalmemhebrew\": \"\\u05dd\",\n    \"finalnun\": \"\\u05df\",\n    \"finalnunhebrew\": \"\\u05df\",\n    \"finalpe\": \"\\u05e3\",\n    \"finalpehebrew\": \"\\u05e3\",\n    \"finaltsadi\": \"\\u05e5\",\n    \"finaltsadihebrew\": \"\\u05e5\",\n    \"firsttonechinese\": \"\\u02c9\",\n    \"fisheye\": \"\\u25c9\",\n    \"fitacyrillic\": \"\\u0473\",\n    \"five\": \"\\u0035\",\n    \"fivearabic\": \"\\u0665\",\n    \"fivebengali\": \"\\u09eb\",\n    \"fivecircle\": \"\\u2464\",\n    \"fivecircleinversesansserif\": \"\\u278e\",\n    \"fivedeva\": \"\\u096b\",\n    \"fiveeighths\": \"\\u215d\",\n    \"fivegujarati\": \"\\u0aeb\",\n    \"fivegurmukhi\": \"\\u0a6b\",\n    \"fivehackarabic\": \"\\u0665\",\n    \"fivehangzhou\": \"\\u3025\",\n    \"fiveideographicparen\": \"\\u3224\",\n    \"fiveinferior\": \"\\u2085\",\n    \"fivemonospace\": \"\\uff15\",\n    \"fiveoldstyle\": \"\\uf735\",\n    \"fiveparen\": \"\\u2478\",\n    \"fiveperiod\": \"\\u248c\",\n    \"fivepersian\": \"\\u06f5\",\n    \"fiveroman\": \"\\u2174\",\n    \"fivesuperior\": \"\\u2075\",\n    \"fivethai\": \"\\u0e55\",\n    \"fl\": \"\\ufb02\",\n    \"florin\": \"\\u0192\",\n    \"fmonospace\": \"\\uff46\",\n    \"fmsquare\": \"\\u3399\",\n    \"fofanthai\": \"\\u0e1f\",\n    \"fofathai\": \"\\u0e1d\",\n    \"fongmanthai\": \"\\u0e4f\",\n    \"forall\": \"\\u2200\",\n    \"four\": \"\\u0034\",\n    \"fourarabic\": \"\\u0664\",\n    \"fourbengali\": \"\\u09ea\",\n    \"fourcircle\": \"\\u2463\",\n    \"fourcircleinversesansserif\": \"\\u278d\",\n    \"fourdeva\": \"\\u096a\",\n    \"fourgujarati\": \"\\u0aea\",\n    \"fourgurmukhi\": \"\\u0a6a\",\n    \"fourhackarabic\": \"\\u0664\",\n    \"fourhangzhou\": \"\\u3024\",\n    \"fourideographicparen\": \"\\u3223\",\n    \"fourinferior\": \"\\u2084\",\n    \"fourmonospace\": \"\\uff14\",\n    \"fournumeratorbengali\": \"\\u09f7\",\n    \"fouroldstyle\": \"\\uf734\",\n    \"fourparen\": \"\\u2477\",\n    \"fourperiod\": \"\\u248b\",\n    \"fourpersian\": \"\\u06f4\",\n    \"fourroman\": \"\\u2173\",\n    \"foursuperior\": \"\\u2074\",\n    \"fourteencircle\": \"\\u246d\",\n    \"fourteenparen\": \"\\u2481\",\n    \"fourteenperiod\": \"\\u2495\",\n    \"fourthai\": \"\\u0e54\",\n    \"fourthtonechinese\": \"\\u02cb\",\n    \"fparen\": \"\\u24a1\",\n    \"fraction\": \"\\u2044\",\n    \"franc\": \"\\u20a3\",\n    \"g\": \"\\u0067\",\n    \"gabengali\": \"\\u0997\",\n    \"gacute\": \"\\u01f5\",\n    \"gadeva\": \"\\u0917\",\n    \"gafarabic\": \"\\u06af\",\n    \"gaffinalarabic\": \"\\ufb93\",\n    \"gafinitialarabic\": \"\\ufb94\",\n    \"gafmedialarabic\": \"\\ufb95\",\n    \"gagujarati\": \"\\u0a97\",\n    \"gagurmukhi\": \"\\u0a17\",\n    \"gahiragana\": \"\\u304c\",\n    \"gakatakana\": \"\\u30ac\",\n    \"gamma\": \"\\u03b3\",\n    \"gammalatinsmall\": \"\\u0263\",\n    \"gammasuperior\": \"\\u02e0\",\n    \"gangiacoptic\": \"\\u03eb\",\n    \"gbopomofo\": \"\\u310d\",\n    \"gbreve\": \"\\u011f\",\n    \"gcaron\": \"\\u01e7\",\n    \"gcedilla\": \"\\u0123\",\n    \"gcircle\": \"\\u24d6\",\n    \"gcircumflex\": \"\\u011d\",\n    \"gcommaaccent\": \"\\u0123\",\n    \"gdot\": \"\\u0121\",\n    \"gdotaccent\": \"\\u0121\",\n    \"gecyrillic\": \"\\u0433\",\n    \"gehiragana\": \"\\u3052\",\n    \"gekatakana\": \"\\u30b2\",\n    \"geometricallyequal\": \"\\u2251\",\n    \"gereshaccenthebrew\": \"\\u059c\",\n    \"gereshhebrew\": \"\\u05f3\",\n    \"gereshmuqdamhebrew\": \"\\u059d\",\n    \"germandbls\": \"\\u00df\",\n    \"gershayimaccenthebrew\": \"\\u059e\",\n    \"gershayimhebrew\": \"\\u05f4\",\n    \"getamark\": \"\\u3013\",\n    \"ghabengali\": \"\\u0998\",\n    \"ghadarmenian\": \"\\u0572\",\n    \"ghadeva\": \"\\u0918\",\n    \"ghagujarati\": \"\\u0a98\",\n    \"ghagurmukhi\": \"\\u0a18\",\n    \"ghainarabic\": \"\\u063a\",\n    \"ghainfinalarabic\": \"\\ufece\",\n    \"ghaininitialarabic\": \"\\ufecf\",\n    \"ghainmedialarabic\": \"\\ufed0\",\n    \"ghemiddlehookcyrillic\": \"\\u0495\",\n    \"ghestrokecyrillic\": \"\\u0493\",\n    \"gheupturncyrillic\": \"\\u0491\",\n    \"ghhadeva\": \"\\u095a\",\n    \"ghhagurmukhi\": \"\\u0a5a\",\n    \"ghook\": \"\\u0260\",\n    \"ghzsquare\": \"\\u3393\",\n    \"gihiragana\": \"\\u304e\",\n    \"gikatakana\": \"\\u30ae\",\n    \"gimarmenian\": \"\\u0563\",\n    \"gimel\": \"\\u05d2\",\n    \"gimeldagesh\": \"\\ufb32\",\n    \"gimeldageshhebrew\": \"\\ufb32\",\n    \"gimelhebrew\": \"\\u05d2\",\n    \"gjecyrillic\": \"\\u0453\",\n    \"glottalinvertedstroke\": \"\\u01be\",\n    \"glottalstop\": \"\\u0294\",\n    \"glottalstopinverted\": \"\\u0296\",\n    \"glottalstopmod\": \"\\u02c0\",\n    \"glottalstopreversed\": \"\\u0295\",\n    \"glottalstopreversedmod\": \"\\u02c1\",\n    \"glottalstopreversedsuperior\": \"\\u02e4\",\n    \"glottalstopstroke\": \"\\u02a1\",\n    \"glottalstopstrokereversed\": \"\\u02a2\",\n    \"gmacron\": \"\\u1e21\",\n    \"gmonospace\": \"\\uff47\",\n    \"gohiragana\": \"\\u3054\",\n    \"gokatakana\": \"\\u30b4\",\n    \"gparen\": \"\\u24a2\",\n    \"gpasquare\": \"\\u33ac\",\n    \"gradient\": \"\\u2207\",\n    \"grave\": \"\\u0060\",\n    \"gravebelowcmb\": \"\\u0316\",\n    \"gravecmb\": \"\\u0300\",\n    \"gravecomb\": \"\\u0300\",\n    \"gravedeva\": \"\\u0953\",\n    \"gravelowmod\": \"\\u02ce\",\n    \"gravemonospace\": \"\\uff40\",\n    \"gravetonecmb\": \"\\u0340\",\n    \"greater\": \"\\u003e\",\n    \"greaterequal\": \"\\u2265\",\n    \"greaterequalorless\": \"\\u22db\",\n    \"greatermonospace\": \"\\uff1e\",\n    \"greaterorequivalent\": \"\\u2273\",\n    \"greaterorless\": \"\\u2277\",\n    \"greateroverequal\": \"\\u2267\",\n    \"greatersmall\": \"\\ufe65\",\n    \"gscript\": \"\\u0261\",\n    \"gstroke\": \"\\u01e5\",\n    \"guhiragana\": \"\\u3050\",\n    \"guillemotleft\": \"\\u00ab\",\n    \"guillemotright\": \"\\u00bb\",\n    \"guilsinglleft\": \"\\u2039\",\n    \"guilsinglright\": \"\\u203a\",\n    \"gukatakana\": \"\\u30b0\",\n    \"guramusquare\": \"\\u3318\",\n    \"gysquare\": \"\\u33c9\",\n    \"h\": \"\\u0068\",\n    \"haabkhasiancyrillic\": \"\\u04a9\",\n    \"haaltonearabic\": \"\\u06c1\",\n    \"habengali\": \"\\u09b9\",\n    \"hadescendercyrillic\": \"\\u04b3\",\n    \"hadeva\": \"\\u0939\",\n    \"hagujarati\": \"\\u0ab9\",\n    \"hagurmukhi\": \"\\u0a39\",\n    \"haharabic\": \"\\u062d\",\n    \"hahfinalarabic\": \"\\ufea2\",\n    \"hahinitialarabic\": \"\\ufea3\",\n    \"hahiragana\": \"\\u306f\",\n    \"hahmedialarabic\": \"\\ufea4\",\n    \"haitusquare\": \"\\u332a\",\n    \"hakatakana\": \"\\u30cf\",\n    \"hakatakanahalfwidth\": \"\\uff8a\",\n    \"halantgurmukhi\": \"\\u0a4d\",\n    \"hamzaarabic\": \"\\u0621\",\n    \"hamzadammaarabic\": \"\\u0621\\u064f\",\n    \"hamzadammatanarabic\": \"\\u0621\\u064c\",\n    \"hamzafathaarabic\": \"\\u0621\\u064e\",\n    \"hamzafathatanarabic\": \"\\u0621\\u064b\",\n    \"hamzalowarabic\": \"\\u0621\",\n    \"hamzalowkasraarabic\": \"\\u0621\\u0650\",\n    \"hamzalowkasratanarabic\": \"\\u0621\\u064d\",\n    \"hamzasukunarabic\": \"\\u0621\\u0652\",\n    \"hangulfiller\": \"\\u3164\",\n    \"hardsigncyrillic\": \"\\u044a\",\n    \"harpoonleftbarbup\": \"\\u21bc\",\n    \"harpoonrightbarbup\": \"\\u21c0\",\n    \"hasquare\": \"\\u33ca\",\n    \"hatafpatah\": \"\\u05b2\",\n    \"hatafpatah16\": \"\\u05b2\",\n    \"hatafpatah23\": \"\\u05b2\",\n    \"hatafpatah2f\": \"\\u05b2\",\n    \"hatafpatahhebrew\": \"\\u05b2\",\n    \"hatafpatahnarrowhebrew\": \"\\u05b2\",\n    \"hatafpatahquarterhebrew\": \"\\u05b2\",\n    \"hatafpatahwidehebrew\": \"\\u05b2\",\n    \"hatafqamats\": \"\\u05b3\",\n    \"hatafqamats1b\": \"\\u05b3\",\n    \"hatafqamats28\": \"\\u05b3\",\n    \"hatafqamats34\": \"\\u05b3\",\n    \"hatafqamatshebrew\": \"\\u05b3\",\n    \"hatafqamatsnarrowhebrew\": \"\\u05b3\",\n    \"hatafqamatsquarterhebrew\": \"\\u05b3\",\n    \"hatafqamatswidehebrew\": \"\\u05b3\",\n    \"hatafsegol\": \"\\u05b1\",\n    \"hatafsegol17\": \"\\u05b1\",\n    \"hatafsegol24\": \"\\u05b1\",\n    \"hatafsegol30\": \"\\u05b1\",\n    \"hatafsegolhebrew\": \"\\u05b1\",\n    \"hatafsegolnarrowhebrew\": \"\\u05b1\",\n    \"hatafsegolquarterhebrew\": \"\\u05b1\",\n    \"hatafsegolwidehebrew\": \"\\u05b1\",\n    \"hbar\": \"\\u0127\",\n    \"hbopomofo\": \"\\u310f\",\n    \"hbrevebelow\": \"\\u1e2b\",\n    \"hcedilla\": \"\\u1e29\",\n    \"hcircle\": \"\\u24d7\",\n    \"hcircumflex\": \"\\u0125\",\n    \"hdieresis\": \"\\u1e27\",\n    \"hdotaccent\": \"\\u1e23\",\n    \"hdotbelow\": \"\\u1e25\",\n    \"he\": \"\\u05d4\",\n    \"heart\": \"\\u2665\",\n    \"heartsuitblack\": \"\\u2665\",\n    \"heartsuitwhite\": \"\\u2661\",\n    \"hedagesh\": \"\\ufb34\",\n    \"hedageshhebrew\": \"\\ufb34\",\n    \"hehaltonearabic\": \"\\u06c1\",\n    \"heharabic\": \"\\u0647\",\n    \"hehebrew\": \"\\u05d4\",\n    \"hehfinalaltonearabic\": \"\\ufba7\",\n    \"hehfinalalttwoarabic\": \"\\ufeea\",\n    \"hehfinalarabic\": \"\\ufeea\",\n    \"hehhamzaabovefinalarabic\": \"\\ufba5\",\n    \"hehhamzaaboveisolatedarabic\": \"\\ufba4\",\n    \"hehinitialaltonearabic\": \"\\ufba8\",\n    \"hehinitialarabic\": \"\\ufeeb\",\n    \"hehiragana\": \"\\u3078\",\n    \"hehmedialaltonearabic\": \"\\ufba9\",\n    \"hehmedialarabic\": \"\\ufeec\",\n    \"heiseierasquare\": \"\\u337b\",\n    \"hekatakana\": \"\\u30d8\",\n    \"hekatakanahalfwidth\": \"\\uff8d\",\n    \"hekutaarusquare\": \"\\u3336\",\n    \"henghook\": \"\\u0267\",\n    \"herutusquare\": \"\\u3339\",\n    \"het\": \"\\u05d7\",\n    \"hethebrew\": \"\\u05d7\",\n    \"hhook\": \"\\u0266\",\n    \"hhooksuperior\": \"\\u02b1\",\n    \"hieuhacirclekorean\": \"\\u327b\",\n    \"hieuhaparenkorean\": \"\\u321b\",\n    \"hieuhcirclekorean\": \"\\u326d\",\n    \"hieuhkorean\": \"\\u314e\",\n    \"hieuhparenkorean\": \"\\u320d\",\n    \"hihiragana\": \"\\u3072\",\n    \"hikatakana\": \"\\u30d2\",\n    \"hikatakanahalfwidth\": \"\\uff8b\",\n    \"hiriq\": \"\\u05b4\",\n    \"hiriq14\": \"\\u05b4\",\n    \"hiriq21\": \"\\u05b4\",\n    \"hiriq2d\": \"\\u05b4\",\n    \"hiriqhebrew\": \"\\u05b4\",\n    \"hiriqnarrowhebrew\": \"\\u05b4\",\n    \"hiriqquarterhebrew\": \"\\u05b4\",\n    \"hiriqwidehebrew\": \"\\u05b4\",\n    \"hlinebelow\": \"\\u1e96\",\n    \"hmonospace\": \"\\uff48\",\n    \"hoarmenian\": \"\\u0570\",\n    \"hohipthai\": \"\\u0e2b\",\n    \"hohiragana\": \"\\u307b\",\n    \"hokatakana\": \"\\u30db\",\n    \"hokatakanahalfwidth\": \"\\uff8e\",\n    \"holam\": \"\\u05b9\",\n    \"holam19\": \"\\u05b9\",\n    \"holam26\": \"\\u05b9\",\n    \"holam32\": \"\\u05b9\",\n    \"holamhebrew\": \"\\u05b9\",\n    \"holamnarrowhebrew\": \"\\u05b9\",\n    \"holamquarterhebrew\": \"\\u05b9\",\n    \"holamwidehebrew\": \"\\u05b9\",\n    \"honokhukthai\": \"\\u0e2e\",\n    \"hookabovecomb\": \"\\u0309\",\n    \"hookcmb\": \"\\u0309\",\n    \"hookpalatalizedbelowcmb\": \"\\u0321\",\n    \"hookretroflexbelowcmb\": \"\\u0322\",\n    \"hoonsquare\": \"\\u3342\",\n    \"horicoptic\": \"\\u03e9\",\n    \"horizontalbar\": \"\\u2015\",\n    \"horncmb\": \"\\u031b\",\n    \"hotsprings\": \"\\u2668\",\n    \"house\": \"\\u2302\",\n    \"hparen\": \"\\u24a3\",\n    \"hsuperior\": \"\\u02b0\",\n    \"hturned\": \"\\u0265\",\n    \"huhiragana\": \"\\u3075\",\n    \"huiitosquare\": \"\\u3333\",\n    \"hukatakana\": \"\\u30d5\",\n    \"hukatakanahalfwidth\": \"\\uff8c\",\n    \"hungarumlaut\": \"\\u02dd\",\n    \"hungarumlautcmb\": \"\\u030b\",\n    \"hv\": \"\\u0195\",\n    \"hyphen\": \"\\u002d\",\n    \"hypheninferior\": \"\\uf6e5\",\n    \"hyphenmonospace\": \"\\uff0d\",\n    \"hyphensmall\": \"\\ufe63\",\n    \"hyphensuperior\": \"\\uf6e6\",\n    \"hyphentwo\": \"\\u2010\",\n    \"i\": \"\\u0069\",\n    \"iacute\": \"\\u00ed\",\n    \"iacyrillic\": \"\\u044f\",\n    \"ibengali\": \"\\u0987\",\n    \"ibopomofo\": \"\\u3127\",\n    \"ibreve\": \"\\u012d\",\n    \"icaron\": \"\\u01d0\",\n    \"icircle\": \"\\u24d8\",\n    \"icircumflex\": \"\\u00ee\",\n    \"icyrillic\": \"\\u0456\",\n    \"idblgrave\": \"\\u0209\",\n    \"ideographearthcircle\": \"\\u328f\",\n    \"ideographfirecircle\": \"\\u328b\",\n    \"ideographicallianceparen\": \"\\u323f\",\n    \"ideographiccallparen\": \"\\u323a\",\n    \"ideographiccentrecircle\": \"\\u32a5\",\n    \"ideographicclose\": \"\\u3006\",\n    \"ideographiccomma\": \"\\u3001\",\n    \"ideographiccommaleft\": \"\\uff64\",\n    \"ideographiccongratulationparen\": \"\\u3237\",\n    \"ideographiccorrectcircle\": \"\\u32a3\",\n    \"ideographicearthparen\": \"\\u322f\",\n    \"ideographicenterpriseparen\": \"\\u323d\",\n    \"ideographicexcellentcircle\": \"\\u329d\",\n    \"ideographicfestivalparen\": \"\\u3240\",\n    \"ideographicfinancialcircle\": \"\\u3296\",\n    \"ideographicfinancialparen\": \"\\u3236\",\n    \"ideographicfireparen\": \"\\u322b\",\n    \"ideographichaveparen\": \"\\u3232\",\n    \"ideographichighcircle\": \"\\u32a4\",\n    \"ideographiciterationmark\": \"\\u3005\",\n    \"ideographiclaborcircle\": \"\\u3298\",\n    \"ideographiclaborparen\": \"\\u3238\",\n    \"ideographicleftcircle\": \"\\u32a7\",\n    \"ideographiclowcircle\": \"\\u32a6\",\n    \"ideographicmedicinecircle\": \"\\u32a9\",\n    \"ideographicmetalparen\": \"\\u322e\",\n    \"ideographicmoonparen\": \"\\u322a\",\n    \"ideographicnameparen\": \"\\u3234\",\n    \"ideographicperiod\": \"\\u3002\",\n    \"ideographicprintcircle\": \"\\u329e\",\n    \"ideographicreachparen\": \"\\u3243\",\n    \"ideographicrepresentparen\": \"\\u3239\",\n    \"ideographicresourceparen\": \"\\u323e\",\n    \"ideographicrightcircle\": \"\\u32a8\",\n    \"ideographicsecretcircle\": \"\\u3299\",\n    \"ideographicselfparen\": \"\\u3242\",\n    \"ideographicsocietyparen\": \"\\u3233\",\n    \"ideographicspace\": \"\\u3000\",\n    \"ideographicspecialparen\": \"\\u3235\",\n    \"ideographicstockparen\": \"\\u3231\",\n    \"ideographicstudyparen\": \"\\u323b\",\n    \"ideographicsunparen\": \"\\u3230\",\n    \"ideographicsuperviseparen\": \"\\u323c\",\n    \"ideographicwaterparen\": \"\\u322c\",\n    \"ideographicwoodparen\": \"\\u322d\",\n    \"ideographiczero\": \"\\u3007\",\n    \"ideographmetalcircle\": \"\\u328e\",\n    \"ideographmooncircle\": \"\\u328a\",\n    \"ideographnamecircle\": \"\\u3294\",\n    \"ideographsuncircle\": \"\\u3290\",\n    \"ideographwatercircle\": \"\\u328c\",\n    \"ideographwoodcircle\": \"\\u328d\",\n    \"ideva\": \"\\u0907\",\n    \"idieresis\": \"\\u00ef\",\n    \"idieresisacute\": \"\\u1e2f\",\n    \"idieresiscyrillic\": \"\\u04e5\",\n    \"idotbelow\": \"\\u1ecb\",\n    \"iebrevecyrillic\": \"\\u04d7\",\n    \"iecyrillic\": \"\\u0435\",\n    \"ieungacirclekorean\": \"\\u3275\",\n    \"ieungaparenkorean\": \"\\u3215\",\n    \"ieungcirclekorean\": \"\\u3267\",\n    \"ieungkorean\": \"\\u3147\",\n    \"ieungparenkorean\": \"\\u3207\",\n    \"igrave\": \"\\u00ec\",\n    \"igujarati\": \"\\u0a87\",\n    \"igurmukhi\": \"\\u0a07\",\n    \"ihiragana\": \"\\u3044\",\n    \"ihookabove\": \"\\u1ec9\",\n    \"iibengali\": \"\\u0988\",\n    \"iicyrillic\": \"\\u0438\",\n    \"iideva\": \"\\u0908\",\n    \"iigujarati\": \"\\u0a88\",\n    \"iigurmukhi\": \"\\u0a08\",\n    \"iimatragurmukhi\": \"\\u0a40\",\n    \"iinvertedbreve\": \"\\u020b\",\n    \"iishortcyrillic\": \"\\u0439\",\n    \"iivowelsignbengali\": \"\\u09c0\",\n    \"iivowelsigndeva\": \"\\u0940\",\n    \"iivowelsigngujarati\": \"\\u0ac0\",\n    \"ij\": \"\\u0133\",\n    \"ikatakana\": \"\\u30a4\",\n    \"ikatakanahalfwidth\": \"\\uff72\",\n    \"ikorean\": \"\\u3163\",\n    \"ilde\": \"\\u02dc\",\n    \"iluyhebrew\": \"\\u05ac\",\n    \"imacron\": \"\\u012b\",\n    \"imacroncyrillic\": \"\\u04e3\",\n    \"imageorapproximatelyequal\": \"\\u2253\",\n    \"imatragurmukhi\": \"\\u0a3f\",\n    \"imonospace\": \"\\uff49\",\n    \"increment\": \"\\u2206\",\n    \"infinity\": \"\\u221e\",\n    \"iniarmenian\": \"\\u056b\",\n    \"integral\": \"\\u222b\",\n    \"integralbottom\": \"\\u2321\",\n    \"integralbt\": \"\\u2321\",\n    \"integralex\": \"\\uf8f5\",\n    \"integraltop\": \"\\u2320\",\n    \"integraltp\": \"\\u2320\",\n    \"intersection\": \"\\u2229\",\n    \"intisquare\": \"\\u3305\",\n    \"invbullet\": \"\\u25d8\",\n    \"invcircle\": \"\\u25d9\",\n    \"invsmileface\": \"\\u263b\",\n    \"iocyrillic\": \"\\u0451\",\n    \"iogonek\": \"\\u012f\",\n    \"iota\": \"\\u03b9\",\n    \"iotadieresis\": \"\\u03ca\",\n    \"iotadieresistonos\": \"\\u0390\",\n    \"iotalatin\": \"\\u0269\",\n    \"iotatonos\": \"\\u03af\",\n    \"iparen\": \"\\u24a4\",\n    \"irigurmukhi\": \"\\u0a72\",\n    \"ismallhiragana\": \"\\u3043\",\n    \"ismallkatakana\": \"\\u30a3\",\n    \"ismallkatakanahalfwidth\": \"\\uff68\",\n    \"issharbengali\": \"\\u09fa\",\n    \"istroke\": \"\\u0268\",\n    \"isuperior\": \"\\uf6ed\",\n    \"iterationhiragana\": \"\\u309d\",\n    \"iterationkatakana\": \"\\u30fd\",\n    \"itilde\": \"\\u0129\",\n    \"itildebelow\": \"\\u1e2d\",\n    \"iubopomofo\": \"\\u3129\",\n    \"iucyrillic\": \"\\u044e\",\n    \"ivowelsignbengali\": \"\\u09bf\",\n    \"ivowelsigndeva\": \"\\u093f\",\n    \"ivowelsigngujarati\": \"\\u0abf\",\n    \"izhitsacyrillic\": \"\\u0475\",\n    \"izhitsadblgravecyrillic\": \"\\u0477\",\n    \"j\": \"\\u006a\",\n    \"jaarmenian\": \"\\u0571\",\n    \"jabengali\": \"\\u099c\",\n    \"jadeva\": \"\\u091c\",\n    \"jagujarati\": \"\\u0a9c\",\n    \"jagurmukhi\": \"\\u0a1c\",\n    \"jbopomofo\": \"\\u3110\",\n    \"jcaron\": \"\\u01f0\",\n    \"jcircle\": \"\\u24d9\",\n    \"jcircumflex\": \"\\u0135\",\n    \"jcrossedtail\": \"\\u029d\",\n    \"jdotlessstroke\": \"\\u025f\",\n    \"jecyrillic\": \"\\u0458\",\n    \"jeemarabic\": \"\\u062c\",\n    \"jeemfinalarabic\": \"\\ufe9e\",\n    \"jeeminitialarabic\": \"\\ufe9f\",\n    \"jeemmedialarabic\": \"\\ufea0\",\n    \"jeharabic\": \"\\u0698\",\n    \"jehfinalarabic\": \"\\ufb8b\",\n    \"jhabengali\": \"\\u099d\",\n    \"jhadeva\": \"\\u091d\",\n    \"jhagujarati\": \"\\u0a9d\",\n    \"jhagurmukhi\": \"\\u0a1d\",\n    \"jheharmenian\": \"\\u057b\",\n    \"jis\": \"\\u3004\",\n    \"jmonospace\": \"\\uff4a\",\n    \"jparen\": \"\\u24a5\",\n    \"jsuperior\": \"\\u02b2\",\n    \"k\": \"\\u006b\",\n    \"kabashkircyrillic\": \"\\u04a1\",\n    \"kabengali\": \"\\u0995\",\n    \"kacute\": \"\\u1e31\",\n    \"kacyrillic\": \"\\u043a\",\n    \"kadescendercyrillic\": \"\\u049b\",\n    \"kadeva\": \"\\u0915\",\n    \"kaf\": \"\\u05db\",\n    \"kafarabic\": \"\\u0643\",\n    \"kafdagesh\": \"\\ufb3b\",\n    \"kafdageshhebrew\": \"\\ufb3b\",\n    \"kaffinalarabic\": \"\\ufeda\",\n    \"kafhebrew\": \"\\u05db\",\n    \"kafinitialarabic\": \"\\ufedb\",\n    \"kafmedialarabic\": \"\\ufedc\",\n    \"kafrafehebrew\": \"\\ufb4d\",\n    \"kagujarati\": \"\\u0a95\",\n    \"kagurmukhi\": \"\\u0a15\",\n    \"kahiragana\": \"\\u304b\",\n    \"kahookcyrillic\": \"\\u04c4\",\n    \"kakatakana\": \"\\u30ab\",\n    \"kakatakanahalfwidth\": \"\\uff76\",\n    \"kappa\": \"\\u03ba\",\n    \"kappasymbolgreek\": \"\\u03f0\",\n    \"kapyeounmieumkorean\": \"\\u3171\",\n    \"kapyeounphieuphkorean\": \"\\u3184\",\n    \"kapyeounpieupkorean\": \"\\u3178\",\n    \"kapyeounssangpieupkorean\": \"\\u3179\",\n    \"karoriisquare\": \"\\u330d\",\n    \"kashidaautoarabic\": \"\\u0640\",\n    \"kashidaautonosidebearingarabic\": \"\\u0640\",\n    \"kasmallkatakana\": \"\\u30f5\",\n    \"kasquare\": \"\\u3384\",\n    \"kasraarabic\": \"\\u0650\",\n    \"kasratanarabic\": \"\\u064d\",\n    \"kastrokecyrillic\": \"\\u049f\",\n    \"katahiraprolongmarkhalfwidth\": \"\\uff70\",\n    \"kaverticalstrokecyrillic\": \"\\u049d\",\n    \"kbopomofo\": \"\\u310e\",\n    \"kcalsquare\": \"\\u3389\",\n    \"kcaron\": \"\\u01e9\",\n    \"kcedilla\": \"\\u0137\",\n    \"kcircle\": \"\\u24da\",\n    \"kcommaaccent\": \"\\u0137\",\n    \"kdotbelow\": \"\\u1e33\",\n    \"keharmenian\": \"\\u0584\",\n    \"kehiragana\": \"\\u3051\",\n    \"kekatakana\": \"\\u30b1\",\n    \"kekatakanahalfwidth\": \"\\uff79\",\n    \"kenarmenian\": \"\\u056f\",\n    \"kesmallkatakana\": \"\\u30f6\",\n    \"kgreenlandic\": \"\\u0138\",\n    \"khabengali\": \"\\u0996\",\n    \"khacyrillic\": \"\\u0445\",\n    \"khadeva\": \"\\u0916\",\n    \"khagujarati\": \"\\u0a96\",\n    \"khagurmukhi\": \"\\u0a16\",\n    \"khaharabic\": \"\\u062e\",\n    \"khahfinalarabic\": \"\\ufea6\",\n    \"khahinitialarabic\": \"\\ufea7\",\n    \"khahmedialarabic\": \"\\ufea8\",\n    \"kheicoptic\": \"\\u03e7\",\n    \"khhadeva\": \"\\u0959\",\n    \"khhagurmukhi\": \"\\u0a59\",\n    \"khieukhacirclekorean\": \"\\u3278\",\n    \"khieukhaparenkorean\": \"\\u3218\",\n    \"khieukhcirclekorean\": \"\\u326a\",\n    \"khieukhkorean\": \"\\u314b\",\n    \"khieukhparenkorean\": \"\\u320a\",\n    \"khokhaithai\": \"\\u0e02\",\n    \"khokhonthai\": \"\\u0e05\",\n    \"khokhuatthai\": \"\\u0e03\",\n    \"khokhwaithai\": \"\\u0e04\",\n    \"khomutthai\": \"\\u0e5b\",\n    \"khook\": \"\\u0199\",\n    \"khorakhangthai\": \"\\u0e06\",\n    \"khzsquare\": \"\\u3391\",\n    \"kihiragana\": \"\\u304d\",\n    \"kikatakana\": \"\\u30ad\",\n    \"kikatakanahalfwidth\": \"\\uff77\",\n    \"kiroguramusquare\": \"\\u3315\",\n    \"kiromeetorusquare\": \"\\u3316\",\n    \"kirosquare\": \"\\u3314\",\n    \"kiyeokacirclekorean\": \"\\u326e\",\n    \"kiyeokaparenkorean\": \"\\u320e\",\n    \"kiyeokcirclekorean\": \"\\u3260\",\n    \"kiyeokkorean\": \"\\u3131\",\n    \"kiyeokparenkorean\": \"\\u3200\",\n    \"kiyeoksioskorean\": \"\\u3133\",\n    \"kjecyrillic\": \"\\u045c\",\n    \"klinebelow\": \"\\u1e35\",\n    \"klsquare\": \"\\u3398\",\n    \"kmcubedsquare\": \"\\u33a6\",\n    \"kmonospace\": \"\\uff4b\",\n    \"kmsquaredsquare\": \"\\u33a2\",\n    \"kohiragana\": \"\\u3053\",\n    \"kohmsquare\": \"\\u33c0\",\n    \"kokaithai\": \"\\u0e01\",\n    \"kokatakana\": \"\\u30b3\",\n    \"kokatakanahalfwidth\": \"\\uff7a\",\n    \"kooposquare\": \"\\u331e\",\n    \"koppacyrillic\": \"\\u0481\",\n    \"koreanstandardsymbol\": \"\\u327f\",\n    \"koroniscmb\": \"\\u0343\",\n    \"kparen\": \"\\u24a6\",\n    \"kpasquare\": \"\\u33aa\",\n    \"ksicyrillic\": \"\\u046f\",\n    \"ktsquare\": \"\\u33cf\",\n    \"kturned\": \"\\u029e\",\n    \"kuhiragana\": \"\\u304f\",\n    \"kukatakana\": \"\\u30af\",\n    \"kukatakanahalfwidth\": \"\\uff78\",\n    \"kvsquare\": \"\\u33b8\",\n    \"kwsquare\": \"\\u33be\",\n    \"l\": \"\\u006c\",\n    \"labengali\": \"\\u09b2\",\n    \"lacute\": \"\\u013a\",\n    \"ladeva\": \"\\u0932\",\n    \"lagujarati\": \"\\u0ab2\",\n    \"lagurmukhi\": \"\\u0a32\",\n    \"lakkhangyaothai\": \"\\u0e45\",\n    \"lamaleffinalarabic\": \"\\ufefc\",\n    \"lamalefhamzaabovefinalarabic\": \"\\ufef8\",\n    \"lamalefhamzaaboveisolatedarabic\": \"\\ufef7\",\n    \"lamalefhamzabelowfinalarabic\": \"\\ufefa\",\n    \"lamalefhamzabelowisolatedarabic\": \"\\ufef9\",\n    \"lamalefisolatedarabic\": \"\\ufefb\",\n    \"lamalefmaddaabovefinalarabic\": \"\\ufef6\",\n    \"lamalefmaddaaboveisolatedarabic\": \"\\ufef5\",\n    \"lamarabic\": \"\\u0644\",\n    \"lambda\": \"\\u03bb\",\n    \"lambdastroke\": \"\\u019b\",\n    \"lamed\": \"\\u05dc\",\n    \"lameddagesh\": \"\\ufb3c\",\n    \"lameddageshhebrew\": \"\\ufb3c\",\n    \"lamedhebrew\": \"\\u05dc\",\n    \"lamedholam\": \"\\u05dc\\u05b9\",\n    \"lamedholamdagesh\": \"\\u05dc\\u05b9\\u05bc\",\n    \"lamedholamdageshhebrew\": \"\\u05dc\\u05b9\\u05bc\",\n    \"lamedholamhebrew\": \"\\u05dc\\u05b9\",\n    \"lamfinalarabic\": \"\\ufede\",\n    \"lamhahinitialarabic\": \"\\ufcca\",\n    \"laminitialarabic\": \"\\ufedf\",\n    \"lamjeeminitialarabic\": \"\\ufcc9\",\n    \"lamkhahinitialarabic\": \"\\ufccb\",\n    \"lamlamhehisolatedarabic\": \"\\ufdf2\",\n    \"lammedialarabic\": \"\\ufee0\",\n    \"lammeemhahinitialarabic\": \"\\ufd88\",\n    \"lammeeminitialarabic\": \"\\ufccc\",\n    \"lammeemjeeminitialarabic\": \"\\ufedf\\ufee4\\ufea0\",\n    \"lammeemkhahinitialarabic\": \"\\ufedf\\ufee4\\ufea8\",\n    \"largecircle\": \"\\u25ef\",\n    \"lbar\": \"\\u019a\",\n    \"lbelt\": \"\\u026c\",\n    \"lbopomofo\": \"\\u310c\",\n    \"lcaron\": \"\\u013e\",\n    \"lcedilla\": \"\\u013c\",\n    \"lcircle\": \"\\u24db\",\n    \"lcircumflexbelow\": \"\\u1e3d\",\n    \"lcommaaccent\": \"\\u013c\",\n    \"ldot\": \"\\u0140\",\n    \"ldotaccent\": \"\\u0140\",\n    \"ldotbelow\": \"\\u1e37\",\n    \"ldotbelowmacron\": \"\\u1e39\",\n    \"leftangleabovecmb\": \"\\u031a\",\n    \"lefttackbelowcmb\": \"\\u0318\",\n    \"less\": \"\\u003c\",\n    \"lessequal\": \"\\u2264\",\n    \"lessequalorgreater\": \"\\u22da\",\n    \"lessmonospace\": \"\\uff1c\",\n    \"lessorequivalent\": \"\\u2272\",\n    \"lessorgreater\": \"\\u2276\",\n    \"lessoverequal\": \"\\u2266\",\n    \"lesssmall\": \"\\ufe64\",\n    \"lezh\": \"\\u026e\",\n    \"lfblock\": \"\\u258c\",\n    \"lhookretroflex\": \"\\u026d\",\n    \"lira\": \"\\u20a4\",\n    \"liwnarmenian\": \"\\u056c\",\n    \"lj\": \"\\u01c9\",\n    \"ljecyrillic\": \"\\u0459\",\n    \"ll\": \"\\uf6c0\",\n    \"lladeva\": \"\\u0933\",\n    \"llagujarati\": \"\\u0ab3\",\n    \"llinebelow\": \"\\u1e3b\",\n    \"llladeva\": \"\\u0934\",\n    \"llvocalicbengali\": \"\\u09e1\",\n    \"llvocalicdeva\": \"\\u0961\",\n    \"llvocalicvowelsignbengali\": \"\\u09e3\",\n    \"llvocalicvowelsigndeva\": \"\\u0963\",\n    \"lmiddletilde\": \"\\u026b\",\n    \"lmonospace\": \"\\uff4c\",\n    \"lmsquare\": \"\\u33d0\",\n    \"lochulathai\": \"\\u0e2c\",\n    \"logicaland\": \"\\u2227\",\n    \"logicalnot\": \"\\u00ac\",\n    \"logicalnotreversed\": \"\\u2310\",\n    \"logicalor\": \"\\u2228\",\n    \"lolingthai\": \"\\u0e25\",\n    \"longs\": \"\\u017f\",\n    \"lowlinecenterline\": \"\\ufe4e\",\n    \"lowlinecmb\": \"\\u0332\",\n    \"lowlinedashed\": \"\\ufe4d\",\n    \"lozenge\": \"\\u25ca\",\n    \"lparen\": \"\\u24a7\",\n    \"lslash\": \"\\u0142\",\n    \"lsquare\": \"\\u2113\",\n    \"lsuperior\": \"\\uf6ee\",\n    \"ltshade\": \"\\u2591\",\n    \"luthai\": \"\\u0e26\",\n    \"lvocalicbengali\": \"\\u098c\",\n    \"lvocalicdeva\": \"\\u090c\",\n    \"lvocalicvowelsignbengali\": \"\\u09e2\",\n    \"lvocalicvowelsigndeva\": \"\\u0962\",\n    \"lxsquare\": \"\\u33d3\",\n    \"m\": \"\\u006d\",\n    \"mabengali\": \"\\u09ae\",\n    \"macron\": \"\\u00af\",\n    \"macronbelowcmb\": \"\\u0331\",\n    \"macroncmb\": \"\\u0304\",\n    \"macronlowmod\": \"\\u02cd\",\n    \"macronmonospace\": \"\\uffe3\",\n    \"macute\": \"\\u1e3f\",\n    \"madeva\": \"\\u092e\",\n    \"magujarati\": \"\\u0aae\",\n    \"magurmukhi\": \"\\u0a2e\",\n    \"mahapakhhebrew\": \"\\u05a4\",\n    \"mahapakhlefthebrew\": \"\\u05a4\",\n    \"mahiragana\": \"\\u307e\",\n    \"maichattawalowleftthai\": \"\\uf895\",\n    \"maichattawalowrightthai\": \"\\uf894\",\n    \"maichattawathai\": \"\\u0e4b\",\n    \"maichattawaupperleftthai\": \"\\uf893\",\n    \"maieklowleftthai\": \"\\uf88c\",\n    \"maieklowrightthai\": \"\\uf88b\",\n    \"maiekthai\": \"\\u0e48\",\n    \"maiekupperleftthai\": \"\\uf88a\",\n    \"maihanakatleftthai\": \"\\uf884\",\n    \"maihanakatthai\": \"\\u0e31\",\n    \"maitaikhuleftthai\": \"\\uf889\",\n    \"maitaikhuthai\": \"\\u0e47\",\n    \"maitholowleftthai\": \"\\uf88f\",\n    \"maitholowrightthai\": \"\\uf88e\",\n    \"maithothai\": \"\\u0e49\",\n    \"maithoupperleftthai\": \"\\uf88d\",\n    \"maitrilowleftthai\": \"\\uf892\",\n    \"maitrilowrightthai\": \"\\uf891\",\n    \"maitrithai\": \"\\u0e4a\",\n    \"maitriupperleftthai\": \"\\uf890\",\n    \"maiyamokthai\": \"\\u0e46\",\n    \"makatakana\": \"\\u30de\",\n    \"makatakanahalfwidth\": \"\\uff8f\",\n    \"male\": \"\\u2642\",\n    \"mansyonsquare\": \"\\u3347\",\n    \"maqafhebrew\": \"\\u05be\",\n    \"mars\": \"\\u2642\",\n    \"masoracirclehebrew\": \"\\u05af\",\n    \"masquare\": \"\\u3383\",\n    \"mbopomofo\": \"\\u3107\",\n    \"mbsquare\": \"\\u33d4\",\n    \"mcircle\": \"\\u24dc\",\n    \"mcubedsquare\": \"\\u33a5\",\n    \"mdotaccent\": \"\\u1e41\",\n    \"mdotbelow\": \"\\u1e43\",\n    \"meemarabic\": \"\\u0645\",\n    \"meemfinalarabic\": \"\\ufee2\",\n    \"meeminitialarabic\": \"\\ufee3\",\n    \"meemmedialarabic\": \"\\ufee4\",\n    \"meemmeeminitialarabic\": \"\\ufcd1\",\n    \"meemmeemisolatedarabic\": \"\\ufc48\",\n    \"meetorusquare\": \"\\u334d\",\n    \"mehiragana\": \"\\u3081\",\n    \"meizierasquare\": \"\\u337e\",\n    \"mekatakana\": \"\\u30e1\",\n    \"mekatakanahalfwidth\": \"\\uff92\",\n    \"mem\": \"\\u05de\",\n    \"memdagesh\": \"\\ufb3e\",\n    \"memdageshhebrew\": \"\\ufb3e\",\n    \"memhebrew\": \"\\u05de\",\n    \"menarmenian\": \"\\u0574\",\n    \"merkhahebrew\": \"\\u05a5\",\n    \"merkhakefulahebrew\": \"\\u05a6\",\n    \"merkhakefulalefthebrew\": \"\\u05a6\",\n    \"merkhalefthebrew\": \"\\u05a5\",\n    \"mhook\": \"\\u0271\",\n    \"mhzsquare\": \"\\u3392\",\n    \"middledotkatakanahalfwidth\": \"\\uff65\",\n    \"middot\": \"\\u00b7\",\n    \"mieumacirclekorean\": \"\\u3272\",\n    \"mieumaparenkorean\": \"\\u3212\",\n    \"mieumcirclekorean\": \"\\u3264\",\n    \"mieumkorean\": \"\\u3141\",\n    \"mieumpansioskorean\": \"\\u3170\",\n    \"mieumparenkorean\": \"\\u3204\",\n    \"mieumpieupkorean\": \"\\u316e\",\n    \"mieumsioskorean\": \"\\u316f\",\n    \"mihiragana\": \"\\u307f\",\n    \"mikatakana\": \"\\u30df\",\n    \"mikatakanahalfwidth\": \"\\uff90\",\n    \"minus\": \"\\u2212\",\n    \"minusbelowcmb\": \"\\u0320\",\n    \"minuscircle\": \"\\u2296\",\n    \"minusmod\": \"\\u02d7\",\n    \"minusplus\": \"\\u2213\",\n    \"minute\": \"\\u2032\",\n    \"miribaarusquare\": \"\\u334a\",\n    \"mirisquare\": \"\\u3349\",\n    \"mlonglegturned\": \"\\u0270\",\n    \"mlsquare\": \"\\u3396\",\n    \"mmcubedsquare\": \"\\u33a3\",\n    \"mmonospace\": \"\\uff4d\",\n    \"mmsquaredsquare\": \"\\u339f\",\n    \"mohiragana\": \"\\u3082\",\n    \"mohmsquare\": \"\\u33c1\",\n    \"mokatakana\": \"\\u30e2\",\n    \"mokatakanahalfwidth\": \"\\uff93\",\n    \"molsquare\": \"\\u33d6\",\n    \"momathai\": \"\\u0e21\",\n    \"moverssquare\": \"\\u33a7\",\n    \"moverssquaredsquare\": \"\\u33a8\",\n    \"mparen\": \"\\u24a8\",\n    \"mpasquare\": \"\\u33ab\",\n    \"mssquare\": \"\\u33b3\",\n    \"msuperior\": \"\\uf6ef\",\n    \"mturned\": \"\\u026f\",\n    \"mu\": \"\\u00b5\",\n    \"mu1\": \"\\u00b5\",\n    \"muasquare\": \"\\u3382\",\n    \"muchgreater\": \"\\u226b\",\n    \"muchless\": \"\\u226a\",\n    \"mufsquare\": \"\\u338c\",\n    \"mugreek\": \"\\u03bc\",\n    \"mugsquare\": \"\\u338d\",\n    \"muhiragana\": \"\\u3080\",\n    \"mukatakana\": \"\\u30e0\",\n    \"mukatakanahalfwidth\": \"\\uff91\",\n    \"mulsquare\": \"\\u3395\",\n    \"multiply\": \"\\u00d7\",\n    \"mumsquare\": \"\\u339b\",\n    \"munahhebrew\": \"\\u05a3\",\n    \"munahlefthebrew\": \"\\u05a3\",\n    \"musicalnote\": \"\\u266a\",\n    \"musicalnotedbl\": \"\\u266b\",\n    \"musicflatsign\": \"\\u266d\",\n    \"musicsharpsign\": \"\\u266f\",\n    \"mussquare\": \"\\u33b2\",\n    \"muvsquare\": \"\\u33b6\",\n    \"muwsquare\": \"\\u33bc\",\n    \"mvmegasquare\": \"\\u33b9\",\n    \"mvsquare\": \"\\u33b7\",\n    \"mwmegasquare\": \"\\u33bf\",\n    \"mwsquare\": \"\\u33bd\",\n    \"n\": \"\\u006e\",\n    \"nabengali\": \"\\u09a8\",\n    \"nabla\": \"\\u2207\",\n    \"nacute\": \"\\u0144\",\n    \"nadeva\": \"\\u0928\",\n    \"nagujarati\": \"\\u0aa8\",\n    \"nagurmukhi\": \"\\u0a28\",\n    \"nahiragana\": \"\\u306a\",\n    \"nakatakana\": \"\\u30ca\",\n    \"nakatakanahalfwidth\": \"\\uff85\",\n    \"napostrophe\": \"\\u0149\",\n    \"nasquare\": \"\\u3381\",\n    \"nbopomofo\": \"\\u310b\",\n    \"nbspace\": \"\\u00a0\",\n    \"ncaron\": \"\\u0148\",\n    \"ncedilla\": \"\\u0146\",\n    \"ncircle\": \"\\u24dd\",\n    \"ncircumflexbelow\": \"\\u1e4b\",\n    \"ncommaaccent\": \"\\u0146\",\n    \"ndotaccent\": \"\\u1e45\",\n    \"ndotbelow\": \"\\u1e47\",\n    \"nehiragana\": \"\\u306d\",\n    \"nekatakana\": \"\\u30cd\",\n    \"nekatakanahalfwidth\": \"\\uff88\",\n    \"newsheqelsign\": \"\\u20aa\",\n    \"nfsquare\": \"\\u338b\",\n    \"ngabengali\": \"\\u0999\",\n    \"ngadeva\": \"\\u0919\",\n    \"ngagujarati\": \"\\u0a99\",\n    \"ngagurmukhi\": \"\\u0a19\",\n    \"ngonguthai\": \"\\u0e07\",\n    \"nhiragana\": \"\\u3093\",\n    \"nhookleft\": \"\\u0272\",\n    \"nhookretroflex\": \"\\u0273\",\n    \"nieunacirclekorean\": \"\\u326f\",\n    \"nieunaparenkorean\": \"\\u320f\",\n    \"nieuncieuckorean\": \"\\u3135\",\n    \"nieuncirclekorean\": \"\\u3261\",\n    \"nieunhieuhkorean\": \"\\u3136\",\n    \"nieunkorean\": \"\\u3134\",\n    \"nieunpansioskorean\": \"\\u3168\",\n    \"nieunparenkorean\": \"\\u3201\",\n    \"nieunsioskorean\": \"\\u3167\",\n    \"nieuntikeutkorean\": \"\\u3166\",\n    \"nihiragana\": \"\\u306b\",\n    \"nikatakana\": \"\\u30cb\",\n    \"nikatakanahalfwidth\": \"\\uff86\",\n    \"nikhahitleftthai\": \"\\uf899\",\n    \"nikhahitthai\": \"\\u0e4d\",\n    \"nine\": \"\\u0039\",\n    \"ninearabic\": \"\\u0669\",\n    \"ninebengali\": \"\\u09ef\",\n    \"ninecircle\": \"\\u2468\",\n    \"ninecircleinversesansserif\": \"\\u2792\",\n    \"ninedeva\": \"\\u096f\",\n    \"ninegujarati\": \"\\u0aef\",\n    \"ninegurmukhi\": \"\\u0a6f\",\n    \"ninehackarabic\": \"\\u0669\",\n    \"ninehangzhou\": \"\\u3029\",\n    \"nineideographicparen\": \"\\u3228\",\n    \"nineinferior\": \"\\u2089\",\n    \"ninemonospace\": \"\\uff19\",\n    \"nineoldstyle\": \"\\uf739\",\n    \"nineparen\": \"\\u247c\",\n    \"nineperiod\": \"\\u2490\",\n    \"ninepersian\": \"\\u06f9\",\n    \"nineroman\": \"\\u2178\",\n    \"ninesuperior\": \"\\u2079\",\n    \"nineteencircle\": \"\\u2472\",\n    \"nineteenparen\": \"\\u2486\",\n    \"nineteenperiod\": \"\\u249a\",\n    \"ninethai\": \"\\u0e59\",\n    \"nj\": \"\\u01cc\",\n    \"njecyrillic\": \"\\u045a\",\n    \"nkatakana\": \"\\u30f3\",\n    \"nkatakanahalfwidth\": \"\\uff9d\",\n    \"nlegrightlong\": \"\\u019e\",\n    \"nlinebelow\": \"\\u1e49\",\n    \"nmonospace\": \"\\uff4e\",\n    \"nmsquare\": \"\\u339a\",\n    \"nnabengali\": \"\\u09a3\",\n    \"nnadeva\": \"\\u0923\",\n    \"nnagujarati\": \"\\u0aa3\",\n    \"nnagurmukhi\": \"\\u0a23\",\n    \"nnnadeva\": \"\\u0929\",\n    \"nohiragana\": \"\\u306e\",\n    \"nokatakana\": \"\\u30ce\",\n    \"nokatakanahalfwidth\": \"\\uff89\",\n    \"nonbreakingspace\": \"\\u00a0\",\n    \"nonenthai\": \"\\u0e13\",\n    \"nonuthai\": \"\\u0e19\",\n    \"noonarabic\": \"\\u0646\",\n    \"noonfinalarabic\": \"\\ufee6\",\n    \"noonghunnaarabic\": \"\\u06ba\",\n    \"noonghunnafinalarabic\": \"\\ufb9f\",\n    \"noonhehinitialarabic\": \"\\ufee7\\ufeec\",\n    \"nooninitialarabic\": \"\\ufee7\",\n    \"noonjeeminitialarabic\": \"\\ufcd2\",\n    \"noonjeemisolatedarabic\": \"\\ufc4b\",\n    \"noonmedialarabic\": \"\\ufee8\",\n    \"noonmeeminitialarabic\": \"\\ufcd5\",\n    \"noonmeemisolatedarabic\": \"\\ufc4e\",\n    \"noonnoonfinalarabic\": \"\\ufc8d\",\n    \"notcontains\": \"\\u220c\",\n    \"notelement\": \"\\u2209\",\n    \"notelementof\": \"\\u2209\",\n    \"notequal\": \"\\u2260\",\n    \"notgreater\": \"\\u226f\",\n    \"notgreaternorequal\": \"\\u2271\",\n    \"notgreaternorless\": \"\\u2279\",\n    \"notidentical\": \"\\u2262\",\n    \"notless\": \"\\u226e\",\n    \"notlessnorequal\": \"\\u2270\",\n    \"notparallel\": \"\\u2226\",\n    \"notprecedes\": \"\\u2280\",\n    \"notsubset\": \"\\u2284\",\n    \"notsucceeds\": \"\\u2281\",\n    \"notsuperset\": \"\\u2285\",\n    \"nowarmenian\": \"\\u0576\",\n    \"nparen\": \"\\u24a9\",\n    \"nssquare\": \"\\u33b1\",\n    \"nsuperior\": \"\\u207f\",\n    \"ntilde\": \"\\u00f1\",\n    \"nu\": \"\\u03bd\",\n    \"nuhiragana\": \"\\u306c\",\n    \"nukatakana\": \"\\u30cc\",\n    \"nukatakanahalfwidth\": \"\\uff87\",\n    \"nuktabengali\": \"\\u09bc\",\n    \"nuktadeva\": \"\\u093c\",\n    \"nuktagujarati\": \"\\u0abc\",\n    \"nuktagurmukhi\": \"\\u0a3c\",\n    \"numbersign\": \"\\u0023\",\n    \"numbersignmonospace\": \"\\uff03\",\n    \"numbersignsmall\": \"\\ufe5f\",\n    \"numeralsigngreek\": \"\\u0374\",\n    \"numeralsignlowergreek\": \"\\u0375\",\n    \"numero\": \"\\u2116\",\n    \"nun\": \"\\u05e0\",\n    \"nundagesh\": \"\\ufb40\",\n    \"nundageshhebrew\": \"\\ufb40\",\n    \"nunhebrew\": \"\\u05e0\",\n    \"nvsquare\": \"\\u33b5\",\n    \"nwsquare\": \"\\u33bb\",\n    \"nyabengali\": \"\\u099e\",\n    \"nyadeva\": \"\\u091e\",\n    \"nyagujarati\": \"\\u0a9e\",\n    \"nyagurmukhi\": \"\\u0a1e\",\n    \"o\": \"\\u006f\",\n    \"oacute\": \"\\u00f3\",\n    \"oangthai\": \"\\u0e2d\",\n    \"obarred\": \"\\u0275\",\n    \"obarredcyrillic\": \"\\u04e9\",\n    \"obarreddieresiscyrillic\": \"\\u04eb\",\n    \"obengali\": \"\\u0993\",\n    \"obopomofo\": \"\\u311b\",\n    \"obreve\": \"\\u014f\",\n    \"ocandradeva\": \"\\u0911\",\n    \"ocandragujarati\": \"\\u0a91\",\n    \"ocandravowelsigndeva\": \"\\u0949\",\n    \"ocandravowelsigngujarati\": \"\\u0ac9\",\n    \"ocaron\": \"\\u01d2\",\n    \"ocircle\": \"\\u24de\",\n    \"ocircumflex\": \"\\u00f4\",\n    \"ocircumflexacute\": \"\\u1ed1\",\n    \"ocircumflexdotbelow\": \"\\u1ed9\",\n    \"ocircumflexgrave\": \"\\u1ed3\",\n    \"ocircumflexhookabove\": \"\\u1ed5\",\n    \"ocircumflextilde\": \"\\u1ed7\",\n    \"ocyrillic\": \"\\u043e\",\n    \"odblacute\": \"\\u0151\",\n    \"odblgrave\": \"\\u020d\",\n    \"odeva\": \"\\u0913\",\n    \"odieresis\": \"\\u00f6\",\n    \"odieresiscyrillic\": \"\\u04e7\",\n    \"odotbelow\": \"\\u1ecd\",\n    \"oe\": \"\\u0153\",\n    \"oekorean\": \"\\u315a\",\n    \"ogonek\": \"\\u02db\",\n    \"ogonekcmb\": \"\\u0328\",\n    \"ograve\": \"\\u00f2\",\n    \"ogujarati\": \"\\u0a93\",\n    \"oharmenian\": \"\\u0585\",\n    \"ohiragana\": \"\\u304a\",\n    \"ohookabove\": \"\\u1ecf\",\n    \"ohorn\": \"\\u01a1\",\n    \"ohornacute\": \"\\u1edb\",\n    \"ohorndotbelow\": \"\\u1ee3\",\n    \"ohorngrave\": \"\\u1edd\",\n    \"ohornhookabove\": \"\\u1edf\",\n    \"ohorntilde\": \"\\u1ee1\",\n    \"ohungarumlaut\": \"\\u0151\",\n    \"oi\": \"\\u01a3\",\n    \"oinvertedbreve\": \"\\u020f\",\n    \"okatakana\": \"\\u30aa\",\n    \"okatakanahalfwidth\": \"\\uff75\",\n    \"okorean\": \"\\u3157\",\n    \"olehebrew\": \"\\u05ab\",\n    \"omacron\": \"\\u014d\",\n    \"omacronacute\": \"\\u1e53\",\n    \"omacrongrave\": \"\\u1e51\",\n    \"omdeva\": \"\\u0950\",\n    \"omega\": \"\\u03c9\",\n    \"omega1\": \"\\u03d6\",\n    \"omegacyrillic\": \"\\u0461\",\n    \"omegalatinclosed\": \"\\u0277\",\n    \"omegaroundcyrillic\": \"\\u047b\",\n    \"omegatitlocyrillic\": \"\\u047d\",\n    \"omegatonos\": \"\\u03ce\",\n    \"omgujarati\": \"\\u0ad0\",\n    \"omicron\": \"\\u03bf\",\n    \"omicrontonos\": \"\\u03cc\",\n    \"omonospace\": \"\\uff4f\",\n    \"one\": \"\\u0031\",\n    \"onearabic\": \"\\u0661\",\n    \"onebengali\": \"\\u09e7\",\n    \"onecircle\": \"\\u2460\",\n    \"onecircleinversesansserif\": \"\\u278a\",\n    \"onedeva\": \"\\u0967\",\n    \"onedotenleader\": \"\\u2024\",\n    \"oneeighth\": \"\\u215b\",\n    \"onefitted\": \"\\uf6dc\",\n    \"onegujarati\": \"\\u0ae7\",\n    \"onegurmukhi\": \"\\u0a67\",\n    \"onehackarabic\": \"\\u0661\",\n    \"onehalf\": \"\\u00bd\",\n    \"onehangzhou\": \"\\u3021\",\n    \"oneideographicparen\": \"\\u3220\",\n    \"oneinferior\": \"\\u2081\",\n    \"onemonospace\": \"\\uff11\",\n    \"onenumeratorbengali\": \"\\u09f4\",\n    \"oneoldstyle\": \"\\uf731\",\n    \"oneparen\": \"\\u2474\",\n    \"oneperiod\": \"\\u2488\",\n    \"onepersian\": \"\\u06f1\",\n    \"onequarter\": \"\\u00bc\",\n    \"oneroman\": \"\\u2170\",\n    \"onesuperior\": \"\\u00b9\",\n    \"onethai\": \"\\u0e51\",\n    \"onethird\": \"\\u2153\",\n    \"oogonek\": \"\\u01eb\",\n    \"oogonekmacron\": \"\\u01ed\",\n    \"oogurmukhi\": \"\\u0a13\",\n    \"oomatragurmukhi\": \"\\u0a4b\",\n    \"oopen\": \"\\u0254\",\n    \"oparen\": \"\\u24aa\",\n    \"openbullet\": \"\\u25e6\",\n    \"option\": \"\\u2325\",\n    \"ordfeminine\": \"\\u00aa\",\n    \"ordmasculine\": \"\\u00ba\",\n    \"orthogonal\": \"\\u221f\",\n    \"oshortdeva\": \"\\u0912\",\n    \"oshortvowelsigndeva\": \"\\u094a\",\n    \"oslash\": \"\\u00f8\",\n    \"oslashacute\": \"\\u01ff\",\n    \"osmallhiragana\": \"\\u3049\",\n    \"osmallkatakana\": \"\\u30a9\",\n    \"osmallkatakanahalfwidth\": \"\\uff6b\",\n    \"ostrokeacute\": \"\\u01ff\",\n    \"osuperior\": \"\\uf6f0\",\n    \"otcyrillic\": \"\\u047f\",\n    \"otilde\": \"\\u00f5\",\n    \"otildeacute\": \"\\u1e4d\",\n    \"otildedieresis\": \"\\u1e4f\",\n    \"oubopomofo\": \"\\u3121\",\n    \"overline\": \"\\u203e\",\n    \"overlinecenterline\": \"\\ufe4a\",\n    \"overlinecmb\": \"\\u0305\",\n    \"overlinedashed\": \"\\ufe49\",\n    \"overlinedblwavy\": \"\\ufe4c\",\n    \"overlinewavy\": \"\\ufe4b\",\n    \"overscore\": \"\\u00af\",\n    \"ovowelsignbengali\": \"\\u09cb\",\n    \"ovowelsigndeva\": \"\\u094b\",\n    \"ovowelsigngujarati\": \"\\u0acb\",\n    \"p\": \"\\u0070\",\n    \"paampssquare\": \"\\u3380\",\n    \"paasentosquare\": \"\\u332b\",\n    \"pabengali\": \"\\u09aa\",\n    \"pacute\": \"\\u1e55\",\n    \"padeva\": \"\\u092a\",\n    \"pagedown\": \"\\u21df\",\n    \"pageup\": \"\\u21de\",\n    \"pagujarati\": \"\\u0aaa\",\n    \"pagurmukhi\": \"\\u0a2a\",\n    \"pahiragana\": \"\\u3071\",\n    \"paiyannoithai\": \"\\u0e2f\",\n    \"pakatakana\": \"\\u30d1\",\n    \"palatalizationcyrilliccmb\": \"\\u0484\",\n    \"palochkacyrillic\": \"\\u04c0\",\n    \"pansioskorean\": \"\\u317f\",\n    \"paragraph\": \"\\u00b6\",\n    \"parallel\": \"\\u2225\",\n    \"parenleft\": \"\\u0028\",\n    \"parenleftaltonearabic\": \"\\ufd3e\",\n    \"parenleftbt\": \"\\uf8ed\",\n    \"parenleftex\": \"\\uf8ec\",\n    \"parenleftinferior\": \"\\u208d\",\n    \"parenleftmonospace\": \"\\uff08\",\n    \"parenleftsmall\": \"\\ufe59\",\n    \"parenleftsuperior\": \"\\u207d\",\n    \"parenlefttp\": \"\\uf8eb\",\n    \"parenleftvertical\": \"\\ufe35\",\n    \"parenright\": \"\\u0029\",\n    \"parenrightaltonearabic\": \"\\ufd3f\",\n    \"parenrightbt\": \"\\uf8f8\",\n    \"parenrightex\": \"\\uf8f7\",\n    \"parenrightinferior\": \"\\u208e\",\n    \"parenrightmonospace\": \"\\uff09\",\n    \"parenrightsmall\": \"\\ufe5a\",\n    \"parenrightsuperior\": \"\\u207e\",\n    \"parenrighttp\": \"\\uf8f6\",\n    \"parenrightvertical\": \"\\ufe36\",\n    \"partialdiff\": \"\\u2202\",\n    \"paseqhebrew\": \"\\u05c0\",\n    \"pashtahebrew\": \"\\u0599\",\n    \"pasquare\": \"\\u33a9\",\n    \"patah\": \"\\u05b7\",\n    \"patah11\": \"\\u05b7\",\n    \"patah1d\": \"\\u05b7\",\n    \"patah2a\": \"\\u05b7\",\n    \"patahhebrew\": \"\\u05b7\",\n    \"patahnarrowhebrew\": \"\\u05b7\",\n    \"patahquarterhebrew\": \"\\u05b7\",\n    \"patahwidehebrew\": \"\\u05b7\",\n    \"pazerhebrew\": \"\\u05a1\",\n    \"pbopomofo\": \"\\u3106\",\n    \"pcircle\": \"\\u24df\",\n    \"pdotaccent\": \"\\u1e57\",\n    \"pe\": \"\\u05e4\",\n    \"pecyrillic\": \"\\u043f\",\n    \"pedagesh\": \"\\ufb44\",\n    \"pedageshhebrew\": \"\\ufb44\",\n    \"peezisquare\": \"\\u333b\",\n    \"pefinaldageshhebrew\": \"\\ufb43\",\n    \"peharabic\": \"\\u067e\",\n    \"peharmenian\": \"\\u057a\",\n    \"pehebrew\": \"\\u05e4\",\n    \"pehfinalarabic\": \"\\ufb57\",\n    \"pehinitialarabic\": \"\\ufb58\",\n    \"pehiragana\": \"\\u307a\",\n    \"pehmedialarabic\": \"\\ufb59\",\n    \"pekatakana\": \"\\u30da\",\n    \"pemiddlehookcyrillic\": \"\\u04a7\",\n    \"perafehebrew\": \"\\ufb4e\",\n    \"percent\": \"\\u0025\",\n    \"percentarabic\": \"\\u066a\",\n    \"percentmonospace\": \"\\uff05\",\n    \"percentsmall\": \"\\ufe6a\",\n    \"period\": \"\\u002e\",\n    \"periodarmenian\": \"\\u0589\",\n    \"periodcentered\": \"\\u00b7\",\n    \"periodhalfwidth\": \"\\uff61\",\n    \"periodinferior\": \"\\uf6e7\",\n    \"periodmonospace\": \"\\uff0e\",\n    \"periodsmall\": \"\\ufe52\",\n    \"periodsuperior\": \"\\uf6e8\",\n    \"perispomenigreekcmb\": \"\\u0342\",\n    \"perpendicular\": \"\\u22a5\",\n    \"perthousand\": \"\\u2030\",\n    \"peseta\": \"\\u20a7\",\n    \"pfsquare\": \"\\u338a\",\n    \"phabengali\": \"\\u09ab\",\n    \"phadeva\": \"\\u092b\",\n    \"phagujarati\": \"\\u0aab\",\n    \"phagurmukhi\": \"\\u0a2b\",\n    \"phi\": \"\\u03c6\",\n    \"phi1\": \"\\u03d5\",\n    \"phieuphacirclekorean\": \"\\u327a\",\n    \"phieuphaparenkorean\": \"\\u321a\",\n    \"phieuphcirclekorean\": \"\\u326c\",\n    \"phieuphkorean\": \"\\u314d\",\n    \"phieuphparenkorean\": \"\\u320c\",\n    \"philatin\": \"\\u0278\",\n    \"phinthuthai\": \"\\u0e3a\",\n    \"phisymbolgreek\": \"\\u03d5\",\n    \"phook\": \"\\u01a5\",\n    \"phophanthai\": \"\\u0e1e\",\n    \"phophungthai\": \"\\u0e1c\",\n    \"phosamphaothai\": \"\\u0e20\",\n    \"pi\": \"\\u03c0\",\n    \"pieupacirclekorean\": \"\\u3273\",\n    \"pieupaparenkorean\": \"\\u3213\",\n    \"pieupcieuckorean\": \"\\u3176\",\n    \"pieupcirclekorean\": \"\\u3265\",\n    \"pieupkiyeokkorean\": \"\\u3172\",\n    \"pieupkorean\": \"\\u3142\",\n    \"pieupparenkorean\": \"\\u3205\",\n    \"pieupsioskiyeokkorean\": \"\\u3174\",\n    \"pieupsioskorean\": \"\\u3144\",\n    \"pieupsiostikeutkorean\": \"\\u3175\",\n    \"pieupthieuthkorean\": \"\\u3177\",\n    \"pieuptikeutkorean\": \"\\u3173\",\n    \"pihiragana\": \"\\u3074\",\n    \"pikatakana\": \"\\u30d4\",\n    \"pisymbolgreek\": \"\\u03d6\",\n    \"piwrarmenian\": \"\\u0583\",\n    \"plus\": \"\\u002b\",\n    \"plusbelowcmb\": \"\\u031f\",\n    \"pluscircle\": \"\\u2295\",\n    \"plusminus\": \"\\u00b1\",\n    \"plusmod\": \"\\u02d6\",\n    \"plusmonospace\": \"\\uff0b\",\n    \"plussmall\": \"\\ufe62\",\n    \"plussuperior\": \"\\u207a\",\n    \"pmonospace\": \"\\uff50\",\n    \"pmsquare\": \"\\u33d8\",\n    \"pohiragana\": \"\\u307d\",\n    \"pointingindexdownwhite\": \"\\u261f\",\n    \"pointingindexleftwhite\": \"\\u261c\",\n    \"pointingindexrightwhite\": \"\\u261e\",\n    \"pointingindexupwhite\": \"\\u261d\",\n    \"pokatakana\": \"\\u30dd\",\n    \"poplathai\": \"\\u0e1b\",\n    \"postalmark\": \"\\u3012\",\n    \"postalmarkface\": \"\\u3020\",\n    \"pparen\": \"\\u24ab\",\n    \"precedes\": \"\\u227a\",\n    \"prescription\": \"\\u211e\",\n    \"primemod\": \"\\u02b9\",\n    \"primereversed\": \"\\u2035\",\n    \"product\": \"\\u220f\",\n    \"projective\": \"\\u2305\",\n    \"prolongedkana\": \"\\u30fc\",\n    \"propellor\": \"\\u2318\",\n    \"propersubset\": \"\\u2282\",\n    \"propersuperset\": \"\\u2283\",\n    \"proportion\": \"\\u2237\",\n    \"proportional\": \"\\u221d\",\n    \"psi\": \"\\u03c8\",\n    \"psicyrillic\": \"\\u0471\",\n    \"psilipneumatacyrilliccmb\": \"\\u0486\",\n    \"pssquare\": \"\\u33b0\",\n    \"puhiragana\": \"\\u3077\",\n    \"pukatakana\": \"\\u30d7\",\n    \"pvsquare\": \"\\u33b4\",\n    \"pwsquare\": \"\\u33ba\",\n    \"q\": \"\\u0071\",\n    \"qadeva\": \"\\u0958\",\n    \"qadmahebrew\": \"\\u05a8\",\n    \"qafarabic\": \"\\u0642\",\n    \"qaffinalarabic\": \"\\ufed6\",\n    \"qafinitialarabic\": \"\\ufed7\",\n    \"qafmedialarabic\": \"\\ufed8\",\n    \"qamats\": \"\\u05b8\",\n    \"qamats10\": \"\\u05b8\",\n    \"qamats1a\": \"\\u05b8\",\n    \"qamats1c\": \"\\u05b8\",\n    \"qamats27\": \"\\u05b8\",\n    \"qamats29\": \"\\u05b8\",\n    \"qamats33\": \"\\u05b8\",\n    \"qamatsde\": \"\\u05b8\",\n    \"qamatshebrew\": \"\\u05b8\",\n    \"qamatsnarrowhebrew\": \"\\u05b8\",\n    \"qamatsqatanhebrew\": \"\\u05b8\",\n    \"qamatsqatannarrowhebrew\": \"\\u05b8\",\n    \"qamatsqatanquarterhebrew\": \"\\u05b8\",\n    \"qamatsqatanwidehebrew\": \"\\u05b8\",\n    \"qamatsquarterhebrew\": \"\\u05b8\",\n    \"qamatswidehebrew\": \"\\u05b8\",\n    \"qarneyparahebrew\": \"\\u059f\",\n    \"qbopomofo\": \"\\u3111\",\n    \"qcircle\": \"\\u24e0\",\n    \"qhook\": \"\\u02a0\",\n    \"qmonospace\": \"\\uff51\",\n    \"qof\": \"\\u05e7\",\n    \"qofdagesh\": \"\\ufb47\",\n    \"qofdageshhebrew\": \"\\ufb47\",\n    \"qofhatafpatah\": \"\\u05e7\\u05b2\",\n    \"qofhatafpatahhebrew\": \"\\u05e7\\u05b2\",\n    \"qofhatafsegol\": \"\\u05e7\\u05b1\",\n    \"qofhatafsegolhebrew\": \"\\u05e7\\u05b1\",\n    \"qofhebrew\": \"\\u05e7\",\n    \"qofhiriq\": \"\\u05e7\\u05b4\",\n    \"qofhiriqhebrew\": \"\\u05e7\\u05b4\",\n    \"qofholam\": \"\\u05e7\\u05b9\",\n    \"qofholamhebrew\": \"\\u05e7\\u05b9\",\n    \"qofpatah\": \"\\u05e7\\u05b7\",\n    \"qofpatahhebrew\": \"\\u05e7\\u05b7\",\n    \"qofqamats\": \"\\u05e7\\u05b8\",\n    \"qofqamatshebrew\": \"\\u05e7\\u05b8\",\n    \"qofqubuts\": \"\\u05e7\\u05bb\",\n    \"qofqubutshebrew\": \"\\u05e7\\u05bb\",\n    \"qofsegol\": \"\\u05e7\\u05b6\",\n    \"qofsegolhebrew\": \"\\u05e7\\u05b6\",\n    \"qofsheva\": \"\\u05e7\\u05b0\",\n    \"qofshevahebrew\": \"\\u05e7\\u05b0\",\n    \"qoftsere\": \"\\u05e7\\u05b5\",\n    \"qoftserehebrew\": \"\\u05e7\\u05b5\",\n    \"qparen\": \"\\u24ac\",\n    \"quarternote\": \"\\u2669\",\n    \"qubuts\": \"\\u05bb\",\n    \"qubuts18\": \"\\u05bb\",\n    \"qubuts25\": \"\\u05bb\",\n    \"qubuts31\": \"\\u05bb\",\n    \"qubutshebrew\": \"\\u05bb\",\n    \"qubutsnarrowhebrew\": \"\\u05bb\",\n    \"qubutsquarterhebrew\": \"\\u05bb\",\n    \"qubutswidehebrew\": \"\\u05bb\",\n    \"question\": \"\\u003f\",\n    \"questionarabic\": \"\\u061f\",\n    \"questionarmenian\": \"\\u055e\",\n    \"questiondown\": \"\\u00bf\",\n    \"questiondownsmall\": \"\\uf7bf\",\n    \"questiongreek\": \"\\u037e\",\n    \"questionmonospace\": \"\\uff1f\",\n    \"questionsmall\": \"\\uf73f\",\n    \"quotedbl\": \"\\u0022\",\n    \"quotedblbase\": \"\\u201e\",\n    \"quotedblleft\": \"\\u201c\",\n    \"quotedblmonospace\": \"\\uff02\",\n    \"quotedblprime\": \"\\u301e\",\n    \"quotedblprimereversed\": \"\\u301d\",\n    \"quotedblright\": \"\\u201d\",\n    \"quoteleft\": \"\\u2018\",\n    \"quoteleftreversed\": \"\\u201b\",\n    \"quotereversed\": \"\\u201b\",\n    \"quoteright\": \"\\u2019\",\n    \"quoterightn\": \"\\u0149\",\n    \"quotesinglbase\": \"\\u201a\",\n    \"quotesingle\": \"\\u0027\",\n    \"quotesinglemonospace\": \"\\uff07\",\n    \"r\": \"\\u0072\",\n    \"raarmenian\": \"\\u057c\",\n    \"rabengali\": \"\\u09b0\",\n    \"racute\": \"\\u0155\",\n    \"radeva\": \"\\u0930\",\n    \"radical\": \"\\u221a\",\n    \"radicalex\": \"\\uf8e5\",\n    \"radoverssquare\": \"\\u33ae\",\n    \"radoverssquaredsquare\": \"\\u33af\",\n    \"radsquare\": \"\\u33ad\",\n    \"rafe\": \"\\u05bf\",\n    \"rafehebrew\": \"\\u05bf\",\n    \"ragujarati\": \"\\u0ab0\",\n    \"ragurmukhi\": \"\\u0a30\",\n    \"rahiragana\": \"\\u3089\",\n    \"rakatakana\": \"\\u30e9\",\n    \"rakatakanahalfwidth\": \"\\uff97\",\n    \"ralowerdiagonalbengali\": \"\\u09f1\",\n    \"ramiddlediagonalbengali\": \"\\u09f0\",\n    \"ramshorn\": \"\\u0264\",\n    \"ratio\": \"\\u2236\",\n    \"rbopomofo\": \"\\u3116\",\n    \"rcaron\": \"\\u0159\",\n    \"rcedilla\": \"\\u0157\",\n    \"rcircle\": \"\\u24e1\",\n    \"rcommaaccent\": \"\\u0157\",\n    \"rdblgrave\": \"\\u0211\",\n    \"rdotaccent\": \"\\u1e59\",\n    \"rdotbelow\": \"\\u1e5b\",\n    \"rdotbelowmacron\": \"\\u1e5d\",\n    \"referencemark\": \"\\u203b\",\n    \"reflexsubset\": \"\\u2286\",\n    \"reflexsuperset\": \"\\u2287\",\n    \"registered\": \"\\u00ae\",\n    \"registersans\": \"\\uf8e8\",\n    \"registerserif\": \"\\uf6da\",\n    \"reharabic\": \"\\u0631\",\n    \"reharmenian\": \"\\u0580\",\n    \"rehfinalarabic\": \"\\ufeae\",\n    \"rehiragana\": \"\\u308c\",\n    \"rehyehaleflamarabic\": \"\\u0631\\ufef3\\ufe8e\\u0644\",\n    \"rekatakana\": \"\\u30ec\",\n    \"rekatakanahalfwidth\": \"\\uff9a\",\n    \"resh\": \"\\u05e8\",\n    \"reshdageshhebrew\": \"\\ufb48\",\n    \"reshhatafpatah\": \"\\u05e8\\u05b2\",\n    \"reshhatafpatahhebrew\": \"\\u05e8\\u05b2\",\n    \"reshhatafsegol\": \"\\u05e8\\u05b1\",\n    \"reshhatafsegolhebrew\": \"\\u05e8\\u05b1\",\n    \"reshhebrew\": \"\\u05e8\",\n    \"reshhiriq\": \"\\u05e8\\u05b4\",\n    \"reshhiriqhebrew\": \"\\u05e8\\u05b4\",\n    \"reshholam\": \"\\u05e8\\u05b9\",\n    \"reshholamhebrew\": \"\\u05e8\\u05b9\",\n    \"reshpatah\": \"\\u05e8\\u05b7\",\n    \"reshpatahhebrew\": \"\\u05e8\\u05b7\",\n    \"reshqamats\": \"\\u05e8\\u05b8\",\n    \"reshqamatshebrew\": \"\\u05e8\\u05b8\",\n    \"reshqubuts\": \"\\u05e8\\u05bb\",\n    \"reshqubutshebrew\": \"\\u05e8\\u05bb\",\n    \"reshsegol\": \"\\u05e8\\u05b6\",\n    \"reshsegolhebrew\": \"\\u05e8\\u05b6\",\n    \"reshsheva\": \"\\u05e8\\u05b0\",\n    \"reshshevahebrew\": \"\\u05e8\\u05b0\",\n    \"reshtsere\": \"\\u05e8\\u05b5\",\n    \"reshtserehebrew\": \"\\u05e8\\u05b5\",\n    \"reversedtilde\": \"\\u223d\",\n    \"reviahebrew\": \"\\u0597\",\n    \"reviamugrashhebrew\": \"\\u0597\",\n    \"revlogicalnot\": \"\\u2310\",\n    \"rfishhook\": \"\\u027e\",\n    \"rfishhookreversed\": \"\\u027f\",\n    \"rhabengali\": \"\\u09dd\",\n    \"rhadeva\": \"\\u095d\",\n    \"rho\": \"\\u03c1\",\n    \"rhook\": \"\\u027d\",\n    \"rhookturned\": \"\\u027b\",\n    \"rhookturnedsuperior\": \"\\u02b5\",\n    \"rhosymbolgreek\": \"\\u03f1\",\n    \"rhotichookmod\": \"\\u02de\",\n    \"rieulacirclekorean\": \"\\u3271\",\n    \"rieulaparenkorean\": \"\\u3211\",\n    \"rieulcirclekorean\": \"\\u3263\",\n    \"rieulhieuhkorean\": \"\\u3140\",\n    \"rieulkiyeokkorean\": \"\\u313a\",\n    \"rieulkiyeoksioskorean\": \"\\u3169\",\n    \"rieulkorean\": \"\\u3139\",\n    \"rieulmieumkorean\": \"\\u313b\",\n    \"rieulpansioskorean\": \"\\u316c\",\n    \"rieulparenkorean\": \"\\u3203\",\n    \"rieulphieuphkorean\": \"\\u313f\",\n    \"rieulpieupkorean\": \"\\u313c\",\n    \"rieulpieupsioskorean\": \"\\u316b\",\n    \"rieulsioskorean\": \"\\u313d\",\n    \"rieulthieuthkorean\": \"\\u313e\",\n    \"rieultikeutkorean\": \"\\u316a\",\n    \"rieulyeorinhieuhkorean\": \"\\u316d\",\n    \"rightangle\": \"\\u221f\",\n    \"righttackbelowcmb\": \"\\u0319\",\n    \"righttriangle\": \"\\u22bf\",\n    \"rihiragana\": \"\\u308a\",\n    \"rikatakana\": \"\\u30ea\",\n    \"rikatakanahalfwidth\": \"\\uff98\",\n    \"ring\": \"\\u02da\",\n    \"ringbelowcmb\": \"\\u0325\",\n    \"ringcmb\": \"\\u030a\",\n    \"ringhalfleft\": \"\\u02bf\",\n    \"ringhalfleftarmenian\": \"\\u0559\",\n    \"ringhalfleftbelowcmb\": \"\\u031c\",\n    \"ringhalfleftcentered\": \"\\u02d3\",\n    \"ringhalfright\": \"\\u02be\",\n    \"ringhalfrightbelowcmb\": \"\\u0339\",\n    \"ringhalfrightcentered\": \"\\u02d2\",\n    \"rinvertedbreve\": \"\\u0213\",\n    \"rittorusquare\": \"\\u3351\",\n    \"rlinebelow\": \"\\u1e5f\",\n    \"rlongleg\": \"\\u027c\",\n    \"rlonglegturned\": \"\\u027a\",\n    \"rmonospace\": \"\\uff52\",\n    \"rohiragana\": \"\\u308d\",\n    \"rokatakana\": \"\\u30ed\",\n    \"rokatakanahalfwidth\": \"\\uff9b\",\n    \"roruathai\": \"\\u0e23\",\n    \"rparen\": \"\\u24ad\",\n    \"rrabengali\": \"\\u09dc\",\n    \"rradeva\": \"\\u0931\",\n    \"rragurmukhi\": \"\\u0a5c\",\n    \"rreharabic\": \"\\u0691\",\n    \"rrehfinalarabic\": \"\\ufb8d\",\n    \"rrvocalicbengali\": \"\\u09e0\",\n    \"rrvocalicdeva\": \"\\u0960\",\n    \"rrvocalicgujarati\": \"\\u0ae0\",\n    \"rrvocalicvowelsignbengali\": \"\\u09c4\",\n    \"rrvocalicvowelsigndeva\": \"\\u0944\",\n    \"rrvocalicvowelsigngujarati\": \"\\u0ac4\",\n    \"rsuperior\": \"\\uf6f1\",\n    \"rtblock\": \"\\u2590\",\n    \"rturned\": \"\\u0279\",\n    \"rturnedsuperior\": \"\\u02b4\",\n    \"ruhiragana\": \"\\u308b\",\n    \"rukatakana\": \"\\u30eb\",\n    \"rukatakanahalfwidth\": \"\\uff99\",\n    \"rupeemarkbengali\": \"\\u09f2\",\n    \"rupeesignbengali\": \"\\u09f3\",\n    \"rupiah\": \"\\uf6dd\",\n    \"ruthai\": \"\\u0e24\",\n    \"rvocalicbengali\": \"\\u098b\",\n    \"rvocalicdeva\": \"\\u090b\",\n    \"rvocalicgujarati\": \"\\u0a8b\",\n    \"rvocalicvowelsignbengali\": \"\\u09c3\",\n    \"rvocalicvowelsigndeva\": \"\\u0943\",\n    \"rvocalicvowelsigngujarati\": \"\\u0ac3\",\n    \"s\": \"\\u0073\",\n    \"sabengali\": \"\\u09b8\",\n    \"sacute\": \"\\u015b\",\n    \"sacutedotaccent\": \"\\u1e65\",\n    \"sadarabic\": \"\\u0635\",\n    \"sadeva\": \"\\u0938\",\n    \"sadfinalarabic\": \"\\ufeba\",\n    \"sadinitialarabic\": \"\\ufebb\",\n    \"sadmedialarabic\": \"\\ufebc\",\n    \"sagujarati\": \"\\u0ab8\",\n    \"sagurmukhi\": \"\\u0a38\",\n    \"sahiragana\": \"\\u3055\",\n    \"sakatakana\": \"\\u30b5\",\n    \"sakatakanahalfwidth\": \"\\uff7b\",\n    \"sallallahoualayhewasallamarabic\": \"\\ufdfa\",\n    \"samekh\": \"\\u05e1\",\n    \"samekhdagesh\": \"\\ufb41\",\n    \"samekhdageshhebrew\": \"\\ufb41\",\n    \"samekhhebrew\": \"\\u05e1\",\n    \"saraaathai\": \"\\u0e32\",\n    \"saraaethai\": \"\\u0e41\",\n    \"saraaimaimalaithai\": \"\\u0e44\",\n    \"saraaimaimuanthai\": \"\\u0e43\",\n    \"saraamthai\": \"\\u0e33\",\n    \"saraathai\": \"\\u0e30\",\n    \"saraethai\": \"\\u0e40\",\n    \"saraiileftthai\": \"\\uf886\",\n    \"saraiithai\": \"\\u0e35\",\n    \"saraileftthai\": \"\\uf885\",\n    \"saraithai\": \"\\u0e34\",\n    \"saraothai\": \"\\u0e42\",\n    \"saraueeleftthai\": \"\\uf888\",\n    \"saraueethai\": \"\\u0e37\",\n    \"saraueleftthai\": \"\\uf887\",\n    \"sarauethai\": \"\\u0e36\",\n    \"sarauthai\": \"\\u0e38\",\n    \"sarauuthai\": \"\\u0e39\",\n    \"sbopomofo\": \"\\u3119\",\n    \"scaron\": \"\\u0161\",\n    \"scarondotaccent\": \"\\u1e67\",\n    \"scedilla\": \"\\u015f\",\n    \"schwa\": \"\\u0259\",\n    \"schwacyrillic\": \"\\u04d9\",\n    \"schwadieresiscyrillic\": \"\\u04db\",\n    \"schwahook\": \"\\u025a\",\n    \"scircle\": \"\\u24e2\",\n    \"scircumflex\": \"\\u015d\",\n    \"scommaaccent\": \"\\u0219\",\n    \"sdotaccent\": \"\\u1e61\",\n    \"sdotbelow\": \"\\u1e63\",\n    \"sdotbelowdotaccent\": \"\\u1e69\",\n    \"seagullbelowcmb\": \"\\u033c\",\n    \"second\": \"\\u2033\",\n    \"secondtonechinese\": \"\\u02ca\",\n    \"section\": \"\\u00a7\",\n    \"seenarabic\": \"\\u0633\",\n    \"seenfinalarabic\": \"\\ufeb2\",\n    \"seeninitialarabic\": \"\\ufeb3\",\n    \"seenmedialarabic\": \"\\ufeb4\",\n    \"segol\": \"\\u05b6\",\n    \"segol13\": \"\\u05b6\",\n    \"segol1f\": \"\\u05b6\",\n    \"segol2c\": \"\\u05b6\",\n    \"segolhebrew\": \"\\u05b6\",\n    \"segolnarrowhebrew\": \"\\u05b6\",\n    \"segolquarterhebrew\": \"\\u05b6\",\n    \"segoltahebrew\": \"\\u0592\",\n    \"segolwidehebrew\": \"\\u05b6\",\n    \"seharmenian\": \"\\u057d\",\n    \"sehiragana\": \"\\u305b\",\n    \"sekatakana\": \"\\u30bb\",\n    \"sekatakanahalfwidth\": \"\\uff7e\",\n    \"semicolon\": \"\\u003b\",\n    \"semicolonarabic\": \"\\u061b\",\n    \"semicolonmonospace\": \"\\uff1b\",\n    \"semicolonsmall\": \"\\ufe54\",\n    \"semivoicedmarkkana\": \"\\u309c\",\n    \"semivoicedmarkkanahalfwidth\": \"\\uff9f\",\n    \"sentisquare\": \"\\u3322\",\n    \"sentosquare\": \"\\u3323\",\n    \"seven\": \"\\u0037\",\n    \"sevenarabic\": \"\\u0667\",\n    \"sevenbengali\": \"\\u09ed\",\n    \"sevencircle\": \"\\u2466\",\n    \"sevencircleinversesansserif\": \"\\u2790\",\n    \"sevendeva\": \"\\u096d\",\n    \"seveneighths\": \"\\u215e\",\n    \"sevengujarati\": \"\\u0aed\",\n    \"sevengurmukhi\": \"\\u0a6d\",\n    \"sevenhackarabic\": \"\\u0667\",\n    \"sevenhangzhou\": \"\\u3027\",\n    \"sevenideographicparen\": \"\\u3226\",\n    \"seveninferior\": \"\\u2087\",\n    \"sevenmonospace\": \"\\uff17\",\n    \"sevenoldstyle\": \"\\uf737\",\n    \"sevenparen\": \"\\u247a\",\n    \"sevenperiod\": \"\\u248e\",\n    \"sevenpersian\": \"\\u06f7\",\n    \"sevenroman\": \"\\u2176\",\n    \"sevensuperior\": \"\\u2077\",\n    \"seventeencircle\": \"\\u2470\",\n    \"seventeenparen\": \"\\u2484\",\n    \"seventeenperiod\": \"\\u2498\",\n    \"seventhai\": \"\\u0e57\",\n    \"sfthyphen\": \"\\u00ad\",\n    \"shaarmenian\": \"\\u0577\",\n    \"shabengali\": \"\\u09b6\",\n    \"shacyrillic\": \"\\u0448\",\n    \"shaddaarabic\": \"\\u0651\",\n    \"shaddadammaarabic\": \"\\ufc61\",\n    \"shaddadammatanarabic\": \"\\ufc5e\",\n    \"shaddafathaarabic\": \"\\ufc60\",\n    \"shaddafathatanarabic\": \"\\u0651\\u064b\",\n    \"shaddakasraarabic\": \"\\ufc62\",\n    \"shaddakasratanarabic\": \"\\ufc5f\",\n    \"shade\": \"\\u2592\",\n    \"shadedark\": \"\\u2593\",\n    \"shadelight\": \"\\u2591\",\n    \"shademedium\": \"\\u2592\",\n    \"shadeva\": \"\\u0936\",\n    \"shagujarati\": \"\\u0ab6\",\n    \"shagurmukhi\": \"\\u0a36\",\n    \"shalshelethebrew\": \"\\u0593\",\n    \"shbopomofo\": \"\\u3115\",\n    \"shchacyrillic\": \"\\u0449\",\n    \"sheenarabic\": \"\\u0634\",\n    \"sheenfinalarabic\": \"\\ufeb6\",\n    \"sheeninitialarabic\": \"\\ufeb7\",\n    \"sheenmedialarabic\": \"\\ufeb8\",\n    \"sheicoptic\": \"\\u03e3\",\n    \"sheqel\": \"\\u20aa\",\n    \"sheqelhebrew\": \"\\u20aa\",\n    \"sheva\": \"\\u05b0\",\n    \"sheva115\": \"\\u05b0\",\n    \"sheva15\": \"\\u05b0\",\n    \"sheva22\": \"\\u05b0\",\n    \"sheva2e\": \"\\u05b0\",\n    \"shevahebrew\": \"\\u05b0\",\n    \"shevanarrowhebrew\": \"\\u05b0\",\n    \"shevaquarterhebrew\": \"\\u05b0\",\n    \"shevawidehebrew\": \"\\u05b0\",\n    \"shhacyrillic\": \"\\u04bb\",\n    \"shimacoptic\": \"\\u03ed\",\n    \"shin\": \"\\u05e9\",\n    \"shindagesh\": \"\\ufb49\",\n    \"shindageshhebrew\": \"\\ufb49\",\n    \"shindageshshindot\": \"\\ufb2c\",\n    \"shindageshshindothebrew\": \"\\ufb2c\",\n    \"shindageshsindot\": \"\\ufb2d\",\n    \"shindageshsindothebrew\": \"\\ufb2d\",\n    \"shindothebrew\": \"\\u05c1\",\n    \"shinhebrew\": \"\\u05e9\",\n    \"shinshindot\": \"\\ufb2a\",\n    \"shinshindothebrew\": \"\\ufb2a\",\n    \"shinsindot\": \"\\ufb2b\",\n    \"shinsindothebrew\": \"\\ufb2b\",\n    \"shook\": \"\\u0282\",\n    \"sigma\": \"\\u03c3\",\n    \"sigma1\": \"\\u03c2\",\n    \"sigmafinal\": \"\\u03c2\",\n    \"sigmalunatesymbolgreek\": \"\\u03f2\",\n    \"sihiragana\": \"\\u3057\",\n    \"sikatakana\": \"\\u30b7\",\n    \"sikatakanahalfwidth\": \"\\uff7c\",\n    \"siluqhebrew\": \"\\u05bd\",\n    \"siluqlefthebrew\": \"\\u05bd\",\n    \"similar\": \"\\u223c\",\n    \"sindothebrew\": \"\\u05c2\",\n    \"siosacirclekorean\": \"\\u3274\",\n    \"siosaparenkorean\": \"\\u3214\",\n    \"sioscieuckorean\": \"\\u317e\",\n    \"sioscirclekorean\": \"\\u3266\",\n    \"sioskiyeokkorean\": \"\\u317a\",\n    \"sioskorean\": \"\\u3145\",\n    \"siosnieunkorean\": \"\\u317b\",\n    \"siosparenkorean\": \"\\u3206\",\n    \"siospieupkorean\": \"\\u317d\",\n    \"siostikeutkorean\": \"\\u317c\",\n    \"six\": \"\\u0036\",\n    \"sixarabic\": \"\\u0666\",\n    \"sixbengali\": \"\\u09ec\",\n    \"sixcircle\": \"\\u2465\",\n    \"sixcircleinversesansserif\": \"\\u278f\",\n    \"sixdeva\": \"\\u096c\",\n    \"sixgujarati\": \"\\u0aec\",\n    \"sixgurmukhi\": \"\\u0a6c\",\n    \"sixhackarabic\": \"\\u0666\",\n    \"sixhangzhou\": \"\\u3026\",\n    \"sixideographicparen\": \"\\u3225\",\n    \"sixinferior\": \"\\u2086\",\n    \"sixmonospace\": \"\\uff16\",\n    \"sixoldstyle\": \"\\uf736\",\n    \"sixparen\": \"\\u2479\",\n    \"sixperiod\": \"\\u248d\",\n    \"sixpersian\": \"\\u06f6\",\n    \"sixroman\": \"\\u2175\",\n    \"sixsuperior\": \"\\u2076\",\n    \"sixteencircle\": \"\\u246f\",\n    \"sixteencurrencydenominatorbengali\": \"\\u09f9\",\n    \"sixteenparen\": \"\\u2483\",\n    \"sixteenperiod\": \"\\u2497\",\n    \"sixthai\": \"\\u0e56\",\n    \"slash\": \"\\u002f\",\n    \"slashmonospace\": \"\\uff0f\",\n    \"slong\": \"\\u017f\",\n    \"slongdotaccent\": \"\\u1e9b\",\n    \"smileface\": \"\\u263a\",\n    \"smonospace\": \"\\uff53\",\n    \"sofpasuqhebrew\": \"\\u05c3\",\n    \"softhyphen\": \"\\u00ad\",\n    \"softsigncyrillic\": \"\\u044c\",\n    \"sohiragana\": \"\\u305d\",\n    \"sokatakana\": \"\\u30bd\",\n    \"sokatakanahalfwidth\": \"\\uff7f\",\n    \"soliduslongoverlaycmb\": \"\\u0338\",\n    \"solidusshortoverlaycmb\": \"\\u0337\",\n    \"sorusithai\": \"\\u0e29\",\n    \"sosalathai\": \"\\u0e28\",\n    \"sosothai\": \"\\u0e0b\",\n    \"sosuathai\": \"\\u0e2a\",\n    \"space\": \"\\u0020\",\n    \"spacehackarabic\": \"\\u0020\",\n    \"spade\": \"\\u2660\",\n    \"spadesuitblack\": \"\\u2660\",\n    \"spadesuitwhite\": \"\\u2664\",\n    \"sparen\": \"\\u24ae\",\n    \"squarebelowcmb\": \"\\u033b\",\n    \"squarecc\": \"\\u33c4\",\n    \"squarecm\": \"\\u339d\",\n    \"squarediagonalcrosshatchfill\": \"\\u25a9\",\n    \"squarehorizontalfill\": \"\\u25a4\",\n    \"squarekg\": \"\\u338f\",\n    \"squarekm\": \"\\u339e\",\n    \"squarekmcapital\": \"\\u33ce\",\n    \"squareln\": \"\\u33d1\",\n    \"squarelog\": \"\\u33d2\",\n    \"squaremg\": \"\\u338e\",\n    \"squaremil\": \"\\u33d5\",\n    \"squaremm\": \"\\u339c\",\n    \"squaremsquared\": \"\\u33a1\",\n    \"squareorthogonalcrosshatchfill\": \"\\u25a6\",\n    \"squareupperlefttolowerrightfill\": \"\\u25a7\",\n    \"squareupperrighttolowerleftfill\": \"\\u25a8\",\n    \"squareverticalfill\": \"\\u25a5\",\n    \"squarewhitewithsmallblack\": \"\\u25a3\",\n    \"srsquare\": \"\\u33db\",\n    \"ssabengali\": \"\\u09b7\",\n    \"ssadeva\": \"\\u0937\",\n    \"ssagujarati\": \"\\u0ab7\",\n    \"ssangcieuckorean\": \"\\u3149\",\n    \"ssanghieuhkorean\": \"\\u3185\",\n    \"ssangieungkorean\": \"\\u3180\",\n    \"ssangkiyeokkorean\": \"\\u3132\",\n    \"ssangnieunkorean\": \"\\u3165\",\n    \"ssangpieupkorean\": \"\\u3143\",\n    \"ssangsioskorean\": \"\\u3146\",\n    \"ssangtikeutkorean\": \"\\u3138\",\n    \"ssuperior\": \"\\uf6f2\",\n    \"sterling\": \"\\u00a3\",\n    \"sterlingmonospace\": \"\\uffe1\",\n    \"strokelongoverlaycmb\": \"\\u0336\",\n    \"strokeshortoverlaycmb\": \"\\u0335\",\n    \"subset\": \"\\u2282\",\n    \"subsetnotequal\": \"\\u228a\",\n    \"subsetorequal\": \"\\u2286\",\n    \"succeeds\": \"\\u227b\",\n    \"suchthat\": \"\\u220b\",\n    \"suhiragana\": \"\\u3059\",\n    \"sukatakana\": \"\\u30b9\",\n    \"sukatakanahalfwidth\": \"\\uff7d\",\n    \"sukunarabic\": \"\\u0652\",\n    \"summation\": \"\\u2211\",\n    \"sun\": \"\\u263c\",\n    \"superset\": \"\\u2283\",\n    \"supersetnotequal\": \"\\u228b\",\n    \"supersetorequal\": \"\\u2287\",\n    \"svsquare\": \"\\u33dc\",\n    \"syouwaerasquare\": \"\\u337c\",\n    \"t\": \"\\u0074\",\n    \"tabengali\": \"\\u09a4\",\n    \"tackdown\": \"\\u22a4\",\n    \"tackleft\": \"\\u22a3\",\n    \"tadeva\": \"\\u0924\",\n    \"tagujarati\": \"\\u0aa4\",\n    \"tagurmukhi\": \"\\u0a24\",\n    \"taharabic\": \"\\u0637\",\n    \"tahfinalarabic\": \"\\ufec2\",\n    \"tahinitialarabic\": \"\\ufec3\",\n    \"tahiragana\": \"\\u305f\",\n    \"tahmedialarabic\": \"\\ufec4\",\n    \"taisyouerasquare\": \"\\u337d\",\n    \"takatakana\": \"\\u30bf\",\n    \"takatakanahalfwidth\": \"\\uff80\",\n    \"tatweelarabic\": \"\\u0640\",\n    \"tau\": \"\\u03c4\",\n    \"tav\": \"\\u05ea\",\n    \"tavdages\": \"\\ufb4a\",\n    \"tavdagesh\": \"\\ufb4a\",\n    \"tavdageshhebrew\": \"\\ufb4a\",\n    \"tavhebrew\": \"\\u05ea\",\n    \"tbar\": \"\\u0167\",\n    \"tbopomofo\": \"\\u310a\",\n    \"tcaron\": \"\\u0165\",\n    \"tccurl\": \"\\u02a8\",\n    \"tcedilla\": \"\\u0163\",\n    \"tcheharabic\": \"\\u0686\",\n    \"tchehfinalarabic\": \"\\ufb7b\",\n    \"tchehinitialarabic\": \"\\ufb7c\",\n    \"tchehmedialarabic\": \"\\ufb7d\",\n    \"tchehmeeminitialarabic\": \"\\ufb7c\\ufee4\",\n    \"tcircle\": \"\\u24e3\",\n    \"tcircumflexbelow\": \"\\u1e71\",\n    \"tcommaaccent\": \"\\u0163\",\n    \"tdieresis\": \"\\u1e97\",\n    \"tdotaccent\": \"\\u1e6b\",\n    \"tdotbelow\": \"\\u1e6d\",\n    \"tecyrillic\": \"\\u0442\",\n    \"tedescendercyrillic\": \"\\u04ad\",\n    \"teharabic\": \"\\u062a\",\n    \"tehfinalarabic\": \"\\ufe96\",\n    \"tehhahinitialarabic\": \"\\ufca2\",\n    \"tehhahisolatedarabic\": \"\\ufc0c\",\n    \"tehinitialarabic\": \"\\ufe97\",\n    \"tehiragana\": \"\\u3066\",\n    \"tehjeeminitialarabic\": \"\\ufca1\",\n    \"tehjeemisolatedarabic\": \"\\ufc0b\",\n    \"tehmarbutaarabic\": \"\\u0629\",\n    \"tehmarbutafinalarabic\": \"\\ufe94\",\n    \"tehmedialarabic\": \"\\ufe98\",\n    \"tehmeeminitialarabic\": \"\\ufca4\",\n    \"tehmeemisolatedarabic\": \"\\ufc0e\",\n    \"tehnoonfinalarabic\": \"\\ufc73\",\n    \"tekatakana\": \"\\u30c6\",\n    \"tekatakanahalfwidth\": \"\\uff83\",\n    \"telephone\": \"\\u2121\",\n    \"telephoneblack\": \"\\u260e\",\n    \"telishagedolahebrew\": \"\\u05a0\",\n    \"telishaqetanahebrew\": \"\\u05a9\",\n    \"tencircle\": \"\\u2469\",\n    \"tenideographicparen\": \"\\u3229\",\n    \"tenparen\": \"\\u247d\",\n    \"tenperiod\": \"\\u2491\",\n    \"tenroman\": \"\\u2179\",\n    \"tesh\": \"\\u02a7\",\n    \"tet\": \"\\u05d8\",\n    \"tetdagesh\": \"\\ufb38\",\n    \"tetdageshhebrew\": \"\\ufb38\",\n    \"tethebrew\": \"\\u05d8\",\n    \"tetsecyrillic\": \"\\u04b5\",\n    \"tevirhebrew\": \"\\u059b\",\n    \"tevirlefthebrew\": \"\\u059b\",\n    \"thabengali\": \"\\u09a5\",\n    \"thadeva\": \"\\u0925\",\n    \"thagujarati\": \"\\u0aa5\",\n    \"thagurmukhi\": \"\\u0a25\",\n    \"thalarabic\": \"\\u0630\",\n    \"thalfinalarabic\": \"\\ufeac\",\n    \"thanthakhatlowleftthai\": \"\\uf898\",\n    \"thanthakhatlowrightthai\": \"\\uf897\",\n    \"thanthakhatthai\": \"\\u0e4c\",\n    \"thanthakhatupperleftthai\": \"\\uf896\",\n    \"theharabic\": \"\\u062b\",\n    \"thehfinalarabic\": \"\\ufe9a\",\n    \"thehinitialarabic\": \"\\ufe9b\",\n    \"thehmedialarabic\": \"\\ufe9c\",\n    \"thereexists\": \"\\u2203\",\n    \"therefore\": \"\\u2234\",\n    \"theta\": \"\\u03b8\",\n    \"theta1\": \"\\u03d1\",\n    \"thetasymbolgreek\": \"\\u03d1\",\n    \"thieuthacirclekorean\": \"\\u3279\",\n    \"thieuthaparenkorean\": \"\\u3219\",\n    \"thieuthcirclekorean\": \"\\u326b\",\n    \"thieuthkorean\": \"\\u314c\",\n    \"thieuthparenkorean\": \"\\u320b\",\n    \"thirteencircle\": \"\\u246c\",\n    \"thirteenparen\": \"\\u2480\",\n    \"thirteenperiod\": \"\\u2494\",\n    \"thonangmonthothai\": \"\\u0e11\",\n    \"thook\": \"\\u01ad\",\n    \"thophuthaothai\": \"\\u0e12\",\n    \"thorn\": \"\\u00fe\",\n    \"thothahanthai\": \"\\u0e17\",\n    \"thothanthai\": \"\\u0e10\",\n    \"thothongthai\": \"\\u0e18\",\n    \"thothungthai\": \"\\u0e16\",\n    \"thousandcyrillic\": \"\\u0482\",\n    \"thousandsseparatorarabic\": \"\\u066c\",\n    \"thousandsseparatorpersian\": \"\\u066c\",\n    \"three\": \"\\u0033\",\n    \"threearabic\": \"\\u0663\",\n    \"threebengali\": \"\\u09e9\",\n    \"threecircle\": \"\\u2462\",\n    \"threecircleinversesansserif\": \"\\u278c\",\n    \"threedeva\": \"\\u0969\",\n    \"threeeighths\": \"\\u215c\",\n    \"threegujarati\": \"\\u0ae9\",\n    \"threegurmukhi\": \"\\u0a69\",\n    \"threehackarabic\": \"\\u0663\",\n    \"threehangzhou\": \"\\u3023\",\n    \"threeideographicparen\": \"\\u3222\",\n    \"threeinferior\": \"\\u2083\",\n    \"threemonospace\": \"\\uff13\",\n    \"threenumeratorbengali\": \"\\u09f6\",\n    \"threeoldstyle\": \"\\uf733\",\n    \"threeparen\": \"\\u2476\",\n    \"threeperiod\": \"\\u248a\",\n    \"threepersian\": \"\\u06f3\",\n    \"threequarters\": \"\\u00be\",\n    \"threequartersemdash\": \"\\uf6de\",\n    \"threeroman\": \"\\u2172\",\n    \"threesuperior\": \"\\u00b3\",\n    \"threethai\": \"\\u0e53\",\n    \"thzsquare\": \"\\u3394\",\n    \"tihiragana\": \"\\u3061\",\n    \"tikatakana\": \"\\u30c1\",\n    \"tikatakanahalfwidth\": \"\\uff81\",\n    \"tikeutacirclekorean\": \"\\u3270\",\n    \"tikeutaparenkorean\": \"\\u3210\",\n    \"tikeutcirclekorean\": \"\\u3262\",\n    \"tikeutkorean\": \"\\u3137\",\n    \"tikeutparenkorean\": \"\\u3202\",\n    \"tilde\": \"\\u02dc\",\n    \"tildebelowcmb\": \"\\u0330\",\n    \"tildecmb\": \"\\u0303\",\n    \"tildecomb\": \"\\u0303\",\n    \"tildedoublecmb\": \"\\u0360\",\n    \"tildeoperator\": \"\\u223c\",\n    \"tildeoverlaycmb\": \"\\u0334\",\n    \"tildeverticalcmb\": \"\\u033e\",\n    \"timescircle\": \"\\u2297\",\n    \"tipehahebrew\": \"\\u0596\",\n    \"tipehalefthebrew\": \"\\u0596\",\n    \"tippigurmukhi\": \"\\u0a70\",\n    \"titlocyrilliccmb\": \"\\u0483\",\n    \"tiwnarmenian\": \"\\u057f\",\n    \"tlinebelow\": \"\\u1e6f\",\n    \"tmonospace\": \"\\uff54\",\n    \"toarmenian\": \"\\u0569\",\n    \"tohiragana\": \"\\u3068\",\n    \"tokatakana\": \"\\u30c8\",\n    \"tokatakanahalfwidth\": \"\\uff84\",\n    \"tonebarextrahighmod\": \"\\u02e5\",\n    \"tonebarextralowmod\": \"\\u02e9\",\n    \"tonebarhighmod\": \"\\u02e6\",\n    \"tonebarlowmod\": \"\\u02e8\",\n    \"tonebarmidmod\": \"\\u02e7\",\n    \"tonefive\": \"\\u01bd\",\n    \"tonesix\": \"\\u0185\",\n    \"tonetwo\": \"\\u01a8\",\n    \"tonos\": \"\\u0384\",\n    \"tonsquare\": \"\\u3327\",\n    \"topatakthai\": \"\\u0e0f\",\n    \"tortoiseshellbracketleft\": \"\\u3014\",\n    \"tortoiseshellbracketleftsmall\": \"\\ufe5d\",\n    \"tortoiseshellbracketleftvertical\": \"\\ufe39\",\n    \"tortoiseshellbracketright\": \"\\u3015\",\n    \"tortoiseshellbracketrightsmall\": \"\\ufe5e\",\n    \"tortoiseshellbracketrightvertical\": \"\\ufe3a\",\n    \"totaothai\": \"\\u0e15\",\n    \"tpalatalhook\": \"\\u01ab\",\n    \"tparen\": \"\\u24af\",\n    \"trademark\": \"\\u2122\",\n    \"trademarksans\": \"\\uf8ea\",\n    \"trademarkserif\": \"\\uf6db\",\n    \"tretroflexhook\": \"\\u0288\",\n    \"triagdn\": \"\\u25bc\",\n    \"triaglf\": \"\\u25c4\",\n    \"triagrt\": \"\\u25ba\",\n    \"triagup\": \"\\u25b2\",\n    \"ts\": \"\\u02a6\",\n    \"tsadi\": \"\\u05e6\",\n    \"tsadidagesh\": \"\\ufb46\",\n    \"tsadidageshhebrew\": \"\\ufb46\",\n    \"tsadihebrew\": \"\\u05e6\",\n    \"tsecyrillic\": \"\\u0446\",\n    \"tsere\": \"\\u05b5\",\n    \"tsere12\": \"\\u05b5\",\n    \"tsere1e\": \"\\u05b5\",\n    \"tsere2b\": \"\\u05b5\",\n    \"tserehebrew\": \"\\u05b5\",\n    \"tserenarrowhebrew\": \"\\u05b5\",\n    \"tserequarterhebrew\": \"\\u05b5\",\n    \"tserewidehebrew\": \"\\u05b5\",\n    \"tshecyrillic\": \"\\u045b\",\n    \"tsuperior\": \"\\uf6f3\",\n    \"ttabengali\": \"\\u099f\",\n    \"ttadeva\": \"\\u091f\",\n    \"ttagujarati\": \"\\u0a9f\",\n    \"ttagurmukhi\": \"\\u0a1f\",\n    \"tteharabic\": \"\\u0679\",\n    \"ttehfinalarabic\": \"\\ufb67\",\n    \"ttehinitialarabic\": \"\\ufb68\",\n    \"ttehmedialarabic\": \"\\ufb69\",\n    \"tthabengali\": \"\\u09a0\",\n    \"tthadeva\": \"\\u0920\",\n    \"tthagujarati\": \"\\u0aa0\",\n    \"tthagurmukhi\": \"\\u0a20\",\n    \"tturned\": \"\\u0287\",\n    \"tuhiragana\": \"\\u3064\",\n    \"tukatakana\": \"\\u30c4\",\n    \"tukatakanahalfwidth\": \"\\uff82\",\n    \"tusmallhiragana\": \"\\u3063\",\n    \"tusmallkatakana\": \"\\u30c3\",\n    \"tusmallkatakanahalfwidth\": \"\\uff6f\",\n    \"twelvecircle\": \"\\u246b\",\n    \"twelveparen\": \"\\u247f\",\n    \"twelveperiod\": \"\\u2493\",\n    \"twelveroman\": \"\\u217b\",\n    \"twentycircle\": \"\\u2473\",\n    \"twentyhangzhou\": \"\\u5344\",\n    \"twentyparen\": \"\\u2487\",\n    \"twentyperiod\": \"\\u249b\",\n    \"two\": \"\\u0032\",\n    \"twoarabic\": \"\\u0662\",\n    \"twobengali\": \"\\u09e8\",\n    \"twocircle\": \"\\u2461\",\n    \"twocircleinversesansserif\": \"\\u278b\",\n    \"twodeva\": \"\\u0968\",\n    \"twodotenleader\": \"\\u2025\",\n    \"twodotleader\": \"\\u2025\",\n    \"twodotleadervertical\": \"\\ufe30\",\n    \"twogujarati\": \"\\u0ae8\",\n    \"twogurmukhi\": \"\\u0a68\",\n    \"twohackarabic\": \"\\u0662\",\n    \"twohangzhou\": \"\\u3022\",\n    \"twoideographicparen\": \"\\u3221\",\n    \"twoinferior\": \"\\u2082\",\n    \"twomonospace\": \"\\uff12\",\n    \"twonumeratorbengali\": \"\\u09f5\",\n    \"twooldstyle\": \"\\uf732\",\n    \"twoparen\": \"\\u2475\",\n    \"twoperiod\": \"\\u2489\",\n    \"twopersian\": \"\\u06f2\",\n    \"tworoman\": \"\\u2171\",\n    \"twostroke\": \"\\u01bb\",\n    \"twosuperior\": \"\\u00b2\",\n    \"twothai\": \"\\u0e52\",\n    \"twothirds\": \"\\u2154\",\n    \"u\": \"\\u0075\",\n    \"uacute\": \"\\u00fa\",\n    \"ubar\": \"\\u0289\",\n    \"ubengali\": \"\\u0989\",\n    \"ubopomofo\": \"\\u3128\",\n    \"ubreve\": \"\\u016d\",\n    \"ucaron\": \"\\u01d4\",\n    \"ucircle\": \"\\u24e4\",\n    \"ucircumflex\": \"\\u00fb\",\n    \"ucircumflexbelow\": \"\\u1e77\",\n    \"ucyrillic\": \"\\u0443\",\n    \"udattadeva\": \"\\u0951\",\n    \"udblacute\": \"\\u0171\",\n    \"udblgrave\": \"\\u0215\",\n    \"udeva\": \"\\u0909\",\n    \"udieresis\": \"\\u00fc\",\n    \"udieresisacute\": \"\\u01d8\",\n    \"udieresisbelow\": \"\\u1e73\",\n    \"udieresiscaron\": \"\\u01da\",\n    \"udieresiscyrillic\": \"\\u04f1\",\n    \"udieresisgrave\": \"\\u01dc\",\n    \"udieresismacron\": \"\\u01d6\",\n    \"udotbelow\": \"\\u1ee5\",\n    \"ugrave\": \"\\u00f9\",\n    \"ugujarati\": \"\\u0a89\",\n    \"ugurmukhi\": \"\\u0a09\",\n    \"uhiragana\": \"\\u3046\",\n    \"uhookabove\": \"\\u1ee7\",\n    \"uhorn\": \"\\u01b0\",\n    \"uhornacute\": \"\\u1ee9\",\n    \"uhorndotbelow\": \"\\u1ef1\",\n    \"uhorngrave\": \"\\u1eeb\",\n    \"uhornhookabove\": \"\\u1eed\",\n    \"uhorntilde\": \"\\u1eef\",\n    \"uhungarumlaut\": \"\\u0171\",\n    \"uhungarumlautcyrillic\": \"\\u04f3\",\n    \"uinvertedbreve\": \"\\u0217\",\n    \"ukatakana\": \"\\u30a6\",\n    \"ukatakanahalfwidth\": \"\\uff73\",\n    \"ukcyrillic\": \"\\u0479\",\n    \"ukorean\": \"\\u315c\",\n    \"umacron\": \"\\u016b\",\n    \"umacroncyrillic\": \"\\u04ef\",\n    \"umacrondieresis\": \"\\u1e7b\",\n    \"umatragurmukhi\": \"\\u0a41\",\n    \"umonospace\": \"\\uff55\",\n    \"underscore\": \"\\u005f\",\n    \"underscoredbl\": \"\\u2017\",\n    \"underscoremonospace\": \"\\uff3f\",\n    \"underscorevertical\": \"\\ufe33\",\n    \"underscorewavy\": \"\\ufe4f\",\n    \"union\": \"\\u222a\",\n    \"universal\": \"\\u2200\",\n    \"uogonek\": \"\\u0173\",\n    \"uparen\": \"\\u24b0\",\n    \"upblock\": \"\\u2580\",\n    \"upperdothebrew\": \"\\u05c4\",\n    \"upsilon\": \"\\u03c5\",\n    \"upsilondieresis\": \"\\u03cb\",\n    \"upsilondieresistonos\": \"\\u03b0\",\n    \"upsilonlatin\": \"\\u028a\",\n    \"upsilontonos\": \"\\u03cd\",\n    \"uptackbelowcmb\": \"\\u031d\",\n    \"uptackmod\": \"\\u02d4\",\n    \"uragurmukhi\": \"\\u0a73\",\n    \"uring\": \"\\u016f\",\n    \"ushortcyrillic\": \"\\u045e\",\n    \"usmallhiragana\": \"\\u3045\",\n    \"usmallkatakana\": \"\\u30a5\",\n    \"usmallkatakanahalfwidth\": \"\\uff69\",\n    \"ustraightcyrillic\": \"\\u04af\",\n    \"ustraightstrokecyrillic\": \"\\u04b1\",\n    \"utilde\": \"\\u0169\",\n    \"utildeacute\": \"\\u1e79\",\n    \"utildebelow\": \"\\u1e75\",\n    \"uubengali\": \"\\u098a\",\n    \"uudeva\": \"\\u090a\",\n    \"uugujarati\": \"\\u0a8a\",\n    \"uugurmukhi\": \"\\u0a0a\",\n    \"uumatragurmukhi\": \"\\u0a42\",\n    \"uuvowelsignbengali\": \"\\u09c2\",\n    \"uuvowelsigndeva\": \"\\u0942\",\n    \"uuvowelsigngujarati\": \"\\u0ac2\",\n    \"uvowelsignbengali\": \"\\u09c1\",\n    \"uvowelsigndeva\": \"\\u0941\",\n    \"uvowelsigngujarati\": \"\\u0ac1\",\n    \"v\": \"\\u0076\",\n    \"vadeva\": \"\\u0935\",\n    \"vagujarati\": \"\\u0ab5\",\n    \"vagurmukhi\": \"\\u0a35\",\n    \"vakatakana\": \"\\u30f7\",\n    \"vav\": \"\\u05d5\",\n    \"vavdagesh\": \"\\ufb35\",\n    \"vavdagesh65\": \"\\ufb35\",\n    \"vavdageshhebrew\": \"\\ufb35\",\n    \"vavhebrew\": \"\\u05d5\",\n    \"vavholam\": \"\\ufb4b\",\n    \"vavholamhebrew\": \"\\ufb4b\",\n    \"vavvavhebrew\": \"\\u05f0\",\n    \"vavyodhebrew\": \"\\u05f1\",\n    \"vcircle\": \"\\u24e5\",\n    \"vdotbelow\": \"\\u1e7f\",\n    \"vecyrillic\": \"\\u0432\",\n    \"veharabic\": \"\\u06a4\",\n    \"vehfinalarabic\": \"\\ufb6b\",\n    \"vehinitialarabic\": \"\\ufb6c\",\n    \"vehmedialarabic\": \"\\ufb6d\",\n    \"vekatakana\": \"\\u30f9\",\n    \"venus\": \"\\u2640\",\n    \"verticalbar\": \"\\u007c\",\n    \"verticallineabovecmb\": \"\\u030d\",\n    \"verticallinebelowcmb\": \"\\u0329\",\n    \"verticallinelowmod\": \"\\u02cc\",\n    \"verticallinemod\": \"\\u02c8\",\n    \"vewarmenian\": \"\\u057e\",\n    \"vhook\": \"\\u028b\",\n    \"vikatakana\": \"\\u30f8\",\n    \"viramabengali\": \"\\u09cd\",\n    \"viramadeva\": \"\\u094d\",\n    \"viramagujarati\": \"\\u0acd\",\n    \"visargabengali\": \"\\u0983\",\n    \"visargadeva\": \"\\u0903\",\n    \"visargagujarati\": \"\\u0a83\",\n    \"vmonospace\": \"\\uff56\",\n    \"voarmenian\": \"\\u0578\",\n    \"voicediterationhiragana\": \"\\u309e\",\n    \"voicediterationkatakana\": \"\\u30fe\",\n    \"voicedmarkkana\": \"\\u309b\",\n    \"voicedmarkkanahalfwidth\": \"\\uff9e\",\n    \"vokatakana\": \"\\u30fa\",\n    \"vparen\": \"\\u24b1\",\n    \"vtilde\": \"\\u1e7d\",\n    \"vturned\": \"\\u028c\",\n    \"vuhiragana\": \"\\u3094\",\n    \"vukatakana\": \"\\u30f4\",\n    \"w\": \"\\u0077\",\n    \"wacute\": \"\\u1e83\",\n    \"waekorean\": \"\\u3159\",\n    \"wahiragana\": \"\\u308f\",\n    \"wakatakana\": \"\\u30ef\",\n    \"wakatakanahalfwidth\": \"\\uff9c\",\n    \"wakorean\": \"\\u3158\",\n    \"wasmallhiragana\": \"\\u308e\",\n    \"wasmallkatakana\": \"\\u30ee\",\n    \"wattosquare\": \"\\u3357\",\n    \"wavedash\": \"\\u301c\",\n    \"wavyunderscorevertical\": \"\\ufe34\",\n    \"wawarabic\": \"\\u0648\",\n    \"wawfinalarabic\": \"\\ufeee\",\n    \"wawhamzaabovearabic\": \"\\u0624\",\n    \"wawhamzaabovefinalarabic\": \"\\ufe86\",\n    \"wbsquare\": \"\\u33dd\",\n    \"wcircle\": \"\\u24e6\",\n    \"wcircumflex\": \"\\u0175\",\n    \"wdieresis\": \"\\u1e85\",\n    \"wdotaccent\": \"\\u1e87\",\n    \"wdotbelow\": \"\\u1e89\",\n    \"wehiragana\": \"\\u3091\",\n    \"weierstrass\": \"\\u2118\",\n    \"wekatakana\": \"\\u30f1\",\n    \"wekorean\": \"\\u315e\",\n    \"weokorean\": \"\\u315d\",\n    \"wgrave\": \"\\u1e81\",\n    \"whitebullet\": \"\\u25e6\",\n    \"whitecircle\": \"\\u25cb\",\n    \"whitecircleinverse\": \"\\u25d9\",\n    \"whitecornerbracketleft\": \"\\u300e\",\n    \"whitecornerbracketleftvertical\": \"\\ufe43\",\n    \"whitecornerbracketright\": \"\\u300f\",\n    \"whitecornerbracketrightvertical\": \"\\ufe44\",\n    \"whitediamond\": \"\\u25c7\",\n    \"whitediamondcontainingblacksmalldiamond\": \"\\u25c8\",\n    \"whitedownpointingsmalltriangle\": \"\\u25bf\",\n    \"whitedownpointingtriangle\": \"\\u25bd\",\n    \"whiteleftpointingsmalltriangle\": \"\\u25c3\",\n    \"whiteleftpointingtriangle\": \"\\u25c1\",\n    \"whitelenticularbracketleft\": \"\\u3016\",\n    \"whitelenticularbracketright\": \"\\u3017\",\n    \"whiterightpointingsmalltriangle\": \"\\u25b9\",\n    \"whiterightpointingtriangle\": \"\\u25b7\",\n    \"whitesmallsquare\": \"\\u25ab\",\n    \"whitesmilingface\": \"\\u263a\",\n    \"whitesquare\": \"\\u25a1\",\n    \"whitestar\": \"\\u2606\",\n    \"whitetelephone\": \"\\u260f\",\n    \"whitetortoiseshellbracketleft\": \"\\u3018\",\n    \"whitetortoiseshellbracketright\": \"\\u3019\",\n    \"whiteuppointingsmalltriangle\": \"\\u25b5\",\n    \"whiteuppointingtriangle\": \"\\u25b3\",\n    \"wihiragana\": \"\\u3090\",\n    \"wikatakana\": \"\\u30f0\",\n    \"wikorean\": \"\\u315f\",\n    \"wmonospace\": \"\\uff57\",\n    \"wohiragana\": \"\\u3092\",\n    \"wokatakana\": \"\\u30f2\",\n    \"wokatakanahalfwidth\": \"\\uff66\",\n    \"won\": \"\\u20a9\",\n    \"wonmonospace\": \"\\uffe6\",\n    \"wowaenthai\": \"\\u0e27\",\n    \"wparen\": \"\\u24b2\",\n    \"wring\": \"\\u1e98\",\n    \"wsuperior\": \"\\u02b7\",\n    \"wturned\": \"\\u028d\",\n    \"wynn\": \"\\u01bf\",\n    \"x\": \"\\u0078\",\n    \"xabovecmb\": \"\\u033d\",\n    \"xbopomofo\": \"\\u3112\",\n    \"xcircle\": \"\\u24e7\",\n    \"xdieresis\": \"\\u1e8d\",\n    \"xdotaccent\": \"\\u1e8b\",\n    \"xeharmenian\": \"\\u056d\",\n    \"xi\": \"\\u03be\",\n    \"xmonospace\": \"\\uff58\",\n    \"xparen\": \"\\u24b3\",\n    \"xsuperior\": \"\\u02e3\",\n    \"y\": \"\\u0079\",\n    \"yaadosquare\": \"\\u334e\",\n    \"yabengali\": \"\\u09af\",\n    \"yacute\": \"\\u00fd\",\n    \"yadeva\": \"\\u092f\",\n    \"yaekorean\": \"\\u3152\",\n    \"yagujarati\": \"\\u0aaf\",\n    \"yagurmukhi\": \"\\u0a2f\",\n    \"yahiragana\": \"\\u3084\",\n    \"yakatakana\": \"\\u30e4\",\n    \"yakatakanahalfwidth\": \"\\uff94\",\n    \"yakorean\": \"\\u3151\",\n    \"yamakkanthai\": \"\\u0e4e\",\n    \"yasmallhiragana\": \"\\u3083\",\n    \"yasmallkatakana\": \"\\u30e3\",\n    \"yasmallkatakanahalfwidth\": \"\\uff6c\",\n    \"yatcyrillic\": \"\\u0463\",\n    \"ycircle\": \"\\u24e8\",\n    \"ycircumflex\": \"\\u0177\",\n    \"ydieresis\": \"\\u00ff\",\n    \"ydotaccent\": \"\\u1e8f\",\n    \"ydotbelow\": \"\\u1ef5\",\n    \"yeharabic\": \"\\u064a\",\n    \"yehbarreearabic\": \"\\u06d2\",\n    \"yehbarreefinalarabic\": \"\\ufbaf\",\n    \"yehfinalarabic\": \"\\ufef2\",\n    \"yehhamzaabovearabic\": \"\\u0626\",\n    \"yehhamzaabovefinalarabic\": \"\\ufe8a\",\n    \"yehhamzaaboveinitialarabic\": \"\\ufe8b\",\n    \"yehhamzaabovemedialarabic\": \"\\ufe8c\",\n    \"yehinitialarabic\": \"\\ufef3\",\n    \"yehmedialarabic\": \"\\ufef4\",\n    \"yehmeeminitialarabic\": \"\\ufcdd\",\n    \"yehmeemisolatedarabic\": \"\\ufc58\",\n    \"yehnoonfinalarabic\": \"\\ufc94\",\n    \"yehthreedotsbelowarabic\": \"\\u06d1\",\n    \"yekorean\": \"\\u3156\",\n    \"yen\": \"\\u00a5\",\n    \"yenmonospace\": \"\\uffe5\",\n    \"yeokorean\": \"\\u3155\",\n    \"yeorinhieuhkorean\": \"\\u3186\",\n    \"yerahbenyomohebrew\": \"\\u05aa\",\n    \"yerahbenyomolefthebrew\": \"\\u05aa\",\n    \"yericyrillic\": \"\\u044b\",\n    \"yerudieresiscyrillic\": \"\\u04f9\",\n    \"yesieungkorean\": \"\\u3181\",\n    \"yesieungpansioskorean\": \"\\u3183\",\n    \"yesieungsioskorean\": \"\\u3182\",\n    \"yetivhebrew\": \"\\u059a\",\n    \"ygrave\": \"\\u1ef3\",\n    \"yhook\": \"\\u01b4\",\n    \"yhookabove\": \"\\u1ef7\",\n    \"yiarmenian\": \"\\u0575\",\n    \"yicyrillic\": \"\\u0457\",\n    \"yikorean\": \"\\u3162\",\n    \"yinyang\": \"\\u262f\",\n    \"yiwnarmenian\": \"\\u0582\",\n    \"ymonospace\": \"\\uff59\",\n    \"yod\": \"\\u05d9\",\n    \"yoddagesh\": \"\\ufb39\",\n    \"yoddageshhebrew\": \"\\ufb39\",\n    \"yodhebrew\": \"\\u05d9\",\n    \"yodyodhebrew\": \"\\u05f2\",\n    \"yodyodpatahhebrew\": \"\\ufb1f\",\n    \"yohiragana\": \"\\u3088\",\n    \"yoikorean\": \"\\u3189\",\n    \"yokatakana\": \"\\u30e8\",\n    \"yokatakanahalfwidth\": \"\\uff96\",\n    \"yokorean\": \"\\u315b\",\n    \"yosmallhiragana\": \"\\u3087\",\n    \"yosmallkatakana\": \"\\u30e7\",\n    \"yosmallkatakanahalfwidth\": \"\\uff6e\",\n    \"yotgreek\": \"\\u03f3\",\n    \"yoyaekorean\": \"\\u3188\",\n    \"yoyakorean\": \"\\u3187\",\n    \"yoyakthai\": \"\\u0e22\",\n    \"yoyingthai\": \"\\u0e0d\",\n    \"yparen\": \"\\u24b4\",\n    \"ypogegrammeni\": \"\\u037a\",\n    \"ypogegrammenigreekcmb\": \"\\u0345\",\n    \"yr\": \"\\u01a6\",\n    \"yring\": \"\\u1e99\",\n    \"ysuperior\": \"\\u02b8\",\n    \"ytilde\": \"\\u1ef9\",\n    \"yturned\": \"\\u028e\",\n    \"yuhiragana\": \"\\u3086\",\n    \"yuikorean\": \"\\u318c\",\n    \"yukatakana\": \"\\u30e6\",\n    \"yukatakanahalfwidth\": \"\\uff95\",\n    \"yukorean\": \"\\u3160\",\n    \"yusbigcyrillic\": \"\\u046b\",\n    \"yusbigiotifiedcyrillic\": \"\\u046d\",\n    \"yuslittlecyrillic\": \"\\u0467\",\n    \"yuslittleiotifiedcyrillic\": \"\\u0469\",\n    \"yusmallhiragana\": \"\\u3085\",\n    \"yusmallkatakana\": \"\\u30e5\",\n    \"yusmallkatakanahalfwidth\": \"\\uff6d\",\n    \"yuyekorean\": \"\\u318b\",\n    \"yuyeokorean\": \"\\u318a\",\n    \"yyabengali\": \"\\u09df\",\n    \"yyadeva\": \"\\u095f\",\n    \"z\": \"\\u007a\",\n    \"zaarmenian\": \"\\u0566\",\n    \"zacute\": \"\\u017a\",\n    \"zadeva\": \"\\u095b\",\n    \"zagurmukhi\": \"\\u0a5b\",\n    \"zaharabic\": \"\\u0638\",\n    \"zahfinalarabic\": \"\\ufec6\",\n    \"zahinitialarabic\": \"\\ufec7\",\n    \"zahiragana\": \"\\u3056\",\n    \"zahmedialarabic\": \"\\ufec8\",\n    \"zainarabic\": \"\\u0632\",\n    \"zainfinalarabic\": \"\\ufeb0\",\n    \"zakatakana\": \"\\u30b6\",\n    \"zaqefgadolhebrew\": \"\\u0595\",\n    \"zaqefqatanhebrew\": \"\\u0594\",\n    \"zarqahebrew\": \"\\u0598\",\n    \"zayin\": \"\\u05d6\",\n    \"zayindagesh\": \"\\ufb36\",\n    \"zayindageshhebrew\": \"\\ufb36\",\n    \"zayinhebrew\": \"\\u05d6\",\n    \"zbopomofo\": \"\\u3117\",\n    \"zcaron\": \"\\u017e\",\n    \"zcircle\": \"\\u24e9\",\n    \"zcircumflex\": \"\\u1e91\",\n    \"zcurl\": \"\\u0291\",\n    \"zdot\": \"\\u017c\",\n    \"zdotaccent\": \"\\u017c\",\n    \"zdotbelow\": \"\\u1e93\",\n    \"zecyrillic\": \"\\u0437\",\n    \"zedescendercyrillic\": \"\\u0499\",\n    \"zedieresiscyrillic\": \"\\u04df\",\n    \"zehiragana\": \"\\u305c\",\n    \"zekatakana\": \"\\u30bc\",\n    \"zero\": \"\\u0030\",\n    \"zeroarabic\": \"\\u0660\",\n    \"zerobengali\": \"\\u09e6\",\n    \"zerodeva\": \"\\u0966\",\n    \"zerogujarati\": \"\\u0ae6\",\n    \"zerogurmukhi\": \"\\u0a66\",\n    \"zerohackarabic\": \"\\u0660\",\n    \"zeroinferior\": \"\\u2080\",\n    \"zeromonospace\": \"\\uff10\",\n    \"zerooldstyle\": \"\\uf730\",\n    \"zeropersian\": \"\\u06f0\",\n    \"zerosuperior\": \"\\u2070\",\n    \"zerothai\": \"\\u0e50\",\n    \"zerowidthjoiner\": \"\\ufeff\",\n    \"zerowidthnonjoiner\": \"\\u200c\",\n    \"zerowidthspace\": \"\\u200b\",\n    \"zeta\": \"\\u03b6\",\n    \"zhbopomofo\": \"\\u3113\",\n    \"zhearmenian\": \"\\u056a\",\n    \"zhebrevecyrillic\": \"\\u04c2\",\n    \"zhecyrillic\": \"\\u0436\",\n    \"zhedescendercyrillic\": \"\\u0497\",\n    \"zhedieresiscyrillic\": \"\\u04dd\",\n    \"zihiragana\": \"\\u3058\",\n    \"zikatakana\": \"\\u30b8\",\n    \"zinorhebrew\": \"\\u05ae\",\n    \"zlinebelow\": \"\\u1e95\",\n    \"zmonospace\": \"\\uff5a\",\n    \"zohiragana\": \"\\u305e\",\n    \"zokatakana\": \"\\u30be\",\n    \"zparen\": \"\\u24b5\",\n    \"zretroflexhook\": \"\\u0290\",\n    \"zstroke\": \"\\u01b6\",\n    \"zuhiragana\": \"\\u305a\",\n    \"zukatakana\": \"\\u30ba\",\n}\n# --end\n"
  },
  {
    "path": "babeldoc/pdfminer/high_level.py",
    "content": "\"\"\"Functions that can be used for the most common use-cases for pdfminer.six\"\"\"\n\nimport logging\nimport sys\nfrom collections.abc import Container\nfrom collections.abc import Iterator\nfrom io import StringIO\nfrom typing import Any\nfrom typing import BinaryIO\nfrom typing import cast\n\nfrom babeldoc.pdfminer.converter import HOCRConverter\nfrom babeldoc.pdfminer.converter import HTMLConverter\nfrom babeldoc.pdfminer.converter import PDFPageAggregator\nfrom babeldoc.pdfminer.converter import TextConverter\nfrom babeldoc.pdfminer.converter import XMLConverter\nfrom babeldoc.pdfminer.image import ImageWriter\nfrom babeldoc.pdfminer.layout import LAParams\nfrom babeldoc.pdfminer.layout import LTPage\nfrom babeldoc.pdfminer.pdfdevice import PDFDevice\nfrom babeldoc.pdfminer.pdfdevice import TagExtractor\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdfinterp import PDFPageInterpreter\nfrom babeldoc.pdfminer.pdfinterp import PDFResourceManager\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.utils import AnyIO\nfrom babeldoc.pdfminer.utils import FileOrName\nfrom babeldoc.pdfminer.utils import open_filename\n\n\ndef extract_text_to_fp(\n    inf: BinaryIO,\n    outfp: AnyIO,\n    output_type: str = \"text\",\n    codec: str = \"utf-8\",\n    laparams: LAParams | None = None,\n    maxpages: int = 0,\n    page_numbers: Container[int] | None = None,\n    password: str = \"\",\n    scale: float = 1.0,\n    rotation: int = 0,\n    layoutmode: str = \"normal\",\n    output_dir: str | None = None,\n    strip_control: bool = False,\n    debug: bool = False,\n    disable_caching: bool = False,\n    **kwargs: Any,\n) -> None:\n    \"\"\"Parses text from inf-file and writes to outfp file-like object.\n\n    Takes loads of optional arguments but the defaults are somewhat sane.\n    Beware laparams: Including an empty LAParams is not the same as passing\n    None!\n\n    :param inf: a file-like object to read PDF structure from, such as a\n        file handler (using the builtin `open()` function) or a `BytesIO`.\n    :param outfp: a file-like object to write the text to.\n    :param output_type: May be 'text', 'xml', 'html', 'hocr', 'tag'.\n        Only 'text' works properly.\n    :param codec: Text decoding codec\n    :param laparams: An LAParams object from babeldoc.pdfminer.layout. Default is None\n        but may not layout correctly.\n    :param maxpages: How many pages to stop parsing after\n    :param page_numbers: zero-indexed page numbers to operate on.\n    :param password: For encrypted PDFs, the password to decrypt.\n    :param scale: Scale factor\n    :param rotation: Rotation factor\n    :param layoutmode: Default is 'normal', see\n        pdfminer.converter.HTMLConverter\n    :param output_dir: If given, creates an ImageWriter for extracted images.\n    :param strip_control: Does what it says on the tin\n    :param debug: Output more logging data\n    :param disable_caching: Does what it says on the tin\n    :param other:\n    :return: nothing, acting as it does on two streams. Use StringIO to get\n        strings.\n    \"\"\"\n    if debug:\n        logging.getLogger().setLevel(logging.DEBUG)\n\n    imagewriter = None\n    if output_dir:\n        imagewriter = ImageWriter(output_dir)\n\n    rsrcmgr = PDFResourceManager(caching=not disable_caching)\n    device: PDFDevice | None = None\n\n    if output_type != \"text\" and outfp == sys.stdout:\n        outfp = sys.stdout.buffer\n\n    if output_type == \"text\":\n        device = TextConverter(\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            laparams=laparams,\n            imagewriter=imagewriter,\n        )\n\n    elif output_type == \"xml\":\n        device = XMLConverter(\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            laparams=laparams,\n            imagewriter=imagewriter,\n            stripcontrol=strip_control,\n        )\n\n    elif output_type == \"html\":\n        device = HTMLConverter(\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            scale=scale,\n            layoutmode=layoutmode,\n            laparams=laparams,\n            imagewriter=imagewriter,\n        )\n\n    elif output_type == \"hocr\":\n        device = HOCRConverter(\n            rsrcmgr,\n            outfp,\n            codec=codec,\n            laparams=laparams,\n            stripcontrol=strip_control,\n        )\n\n    elif output_type == \"tag\":\n        # Binary I/O is required, but we have no good way to test it here.\n        device = TagExtractor(rsrcmgr, cast(BinaryIO, outfp), codec=codec)\n\n    else:\n        msg = f\"Output type can be text, html, xml or tag but is {output_type}\"\n        raise PDFValueError(msg)\n\n    assert device is not None\n    interpreter = PDFPageInterpreter(rsrcmgr, device)\n    for page in PDFPage.get_pages(\n        inf,\n        page_numbers,\n        maxpages=maxpages,\n        password=password,\n        caching=not disable_caching,\n    ):\n        page.rotate = (page.rotate + rotation) % 360\n        interpreter.process_page(page)\n\n    device.close()\n\n\ndef extract_text(\n    pdf_file: FileOrName,\n    password: str = \"\",\n    page_numbers: Container[int] | None = None,\n    maxpages: int = 0,\n    caching: bool = True,\n    codec: str = \"utf-8\",\n    laparams: LAParams | None = None,\n) -> str:\n    \"\"\"Parse and return the text contained in a PDF file.\n\n    :param pdf_file: Either a file path or a file-like object for the PDF file\n        to be worked on.\n    :param password: For encrypted PDFs, the password to decrypt.\n    :param page_numbers: List of zero-indexed page numbers to extract.\n    :param maxpages: The maximum number of pages to parse\n    :param caching: If resources should be cached\n    :param codec: Text decoding codec\n    :param laparams: An LAParams object from babeldoc.pdfminer.layout. If None, uses\n        some default settings that often work well.\n    :return: a string containing all of the text extracted.\n    \"\"\"\n    if laparams is None:\n        laparams = LAParams()\n\n    with open_filename(pdf_file, \"rb\") as fp, StringIO() as output_string:\n        fp = cast(BinaryIO, fp)  # we opened in binary mode\n        rsrcmgr = PDFResourceManager(caching=caching)\n        device = TextConverter(rsrcmgr, output_string, codec=codec, laparams=laparams)\n        interpreter = PDFPageInterpreter(rsrcmgr, device)\n\n        for page in PDFPage.get_pages(\n            fp,\n            page_numbers,\n            maxpages=maxpages,\n            password=password,\n            caching=caching,\n        ):\n            interpreter.process_page(page)\n\n        return output_string.getvalue()\n\n\ndef extract_pages(\n    pdf_file: FileOrName,\n    password: str = \"\",\n    page_numbers: Container[int] | None = None,\n    maxpages: int = 0,\n    caching: bool = True,\n    laparams: LAParams | None = None,\n) -> Iterator[LTPage]:\n    \"\"\"Extract and yield LTPage objects\n\n    :param pdf_file: Either a file path or a file-like object for the PDF file\n        to be worked on.\n    :param password: For encrypted PDFs, the password to decrypt.\n    :param page_numbers: List of zero-indexed page numbers to extract.\n    :param maxpages: The maximum number of pages to parse\n    :param caching: If resources should be cached\n    :param laparams: An LAParams object from babeldoc.pdfminer.layout. If None, uses\n        some default settings that often work well.\n    :return: LTPage objects\n    \"\"\"\n    if laparams is None:\n        laparams = LAParams()\n\n    with open_filename(pdf_file, \"rb\") as fp:\n        fp = cast(BinaryIO, fp)  # we opened in binary mode\n        resource_manager = PDFResourceManager(caching=caching)\n        device = PDFPageAggregator(resource_manager, laparams=laparams)\n        interpreter = PDFPageInterpreter(resource_manager, device)\n        for page in PDFPage.get_pages(\n            fp,\n            page_numbers,\n            maxpages=maxpages,\n            password=password,\n            caching=caching,\n        ):\n            interpreter.process_page(page)\n            layout = device.get_result()\n            yield layout\n"
  },
  {
    "path": "babeldoc/pdfminer/image.py",
    "content": "import os\nimport os.path\nimport struct\nfrom io import BytesIO\nfrom typing import BinaryIO\nfrom typing import Literal\n\nfrom babeldoc.pdfminer.jbig2 import JBIG2StreamReader\nfrom babeldoc.pdfminer.jbig2 import JBIG2StreamWriter\nfrom babeldoc.pdfminer.layout import LTImage\nfrom babeldoc.pdfminer.pdfcolor import LITERAL_DEVICE_CMYK\nfrom babeldoc.pdfminer.pdfcolor import LITERAL_DEVICE_GRAY\nfrom babeldoc.pdfminer.pdfcolor import LITERAL_DEVICE_RGB\nfrom babeldoc.pdfminer.pdfcolor import LITERAL_INLINE_DEVICE_GRAY\nfrom babeldoc.pdfminer.pdfcolor import LITERAL_INLINE_DEVICE_RGB\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdftypes import LITERALS_DCT_DECODE\nfrom babeldoc.pdfminer.pdftypes import LITERALS_FLATE_DECODE\nfrom babeldoc.pdfminer.pdftypes import LITERALS_JBIG2_DECODE\nfrom babeldoc.pdfminer.pdftypes import LITERALS_JPX_DECODE\n\nPIL_ERROR_MESSAGE = (\n    \"Could not import Pillow. This dependency of pdfminer.six is not \"\n    \"installed by default. You need it to to save jpg images to a file. Install it \"\n    \"with `pip install 'pdfminer.six[image]'`\"\n)\n\n\ndef align32(x: int) -> int:\n    return ((x + 3) // 4) * 4\n\n\nclass BMPWriter:\n    def __init__(self, fp: BinaryIO, bits: int, width: int, height: int) -> None:\n        self.fp = fp\n        self.bits = bits\n        self.width = width\n        self.height = height\n        if bits == 1:\n            ncols = 2\n        elif bits == 8:\n            ncols = 256\n        elif bits == 24:\n            ncols = 0\n        else:\n            raise PDFValueError(bits)\n        self.linesize = align32((self.width * self.bits + 7) // 8)\n        self.datasize = self.linesize * self.height\n        headersize = 14 + 40 + ncols * 4\n        info = struct.pack(\n            \"<IiiHHIIIIII\",\n            40,\n            self.width,\n            self.height,\n            1,\n            self.bits,\n            0,\n            self.datasize,\n            0,\n            0,\n            ncols,\n            0,\n        )\n        assert len(info) == 40, str(len(info))\n        header = struct.pack(\n            \"<ccIHHI\",\n            b\"B\",\n            b\"M\",\n            headersize + self.datasize,\n            0,\n            0,\n            headersize,\n        )\n        assert len(header) == 14, str(len(header))\n        self.fp.write(header)\n        self.fp.write(info)\n        if ncols == 2:\n            # B&W color table\n            for i in (0, 255):\n                self.fp.write(struct.pack(\"BBBx\", i, i, i))\n        elif ncols == 256:\n            # grayscale color table\n            for i in range(256):\n                self.fp.write(struct.pack(\"BBBx\", i, i, i))\n        self.pos0 = self.fp.tell()\n        self.pos1 = self.pos0 + self.datasize\n\n    def write_line(self, y: int, data: bytes) -> None:\n        self.fp.seek(self.pos1 - (y + 1) * self.linesize)\n        self.fp.write(data)\n\n\nclass ImageWriter:\n    \"\"\"Write image to a file\n\n    Supports various image types: JPEG, JBIG2 and bitmaps\n    \"\"\"\n\n    def __init__(self, outdir: str) -> None:\n        self.outdir = outdir\n        if not os.path.exists(self.outdir):\n            os.makedirs(self.outdir)\n\n    def export_image(self, image: LTImage) -> str:\n        \"\"\"Save an LTImage to disk\"\"\"\n        (width, height) = image.srcsize\n\n        filters = image.stream.get_filters()\n\n        if filters[-1][0] in LITERALS_DCT_DECODE:\n            name = self._save_jpeg(image)\n\n        elif filters[-1][0] in LITERALS_JPX_DECODE:\n            name = self._save_jpeg2000(image)\n\n        elif self._is_jbig2_iamge(image):\n            name = self._save_jbig2(image)\n\n        elif image.bits == 1:\n            name = self._save_bmp(image, width, height, (width + 7) // 8, image.bits)\n\n        elif image.bits == 8 and (\n            LITERAL_DEVICE_RGB in image.colorspace\n            or LITERAL_INLINE_DEVICE_RGB in image.colorspace\n        ):\n            name = self._save_bmp(image, width, height, width * 3, image.bits * 3)\n\n        elif image.bits == 8 and (\n            LITERAL_DEVICE_GRAY in image.colorspace\n            or LITERAL_INLINE_DEVICE_GRAY in image.colorspace\n        ):\n            name = self._save_bmp(image, width, height, width, image.bits)\n\n        elif len(filters) == 1 and filters[0][0] in LITERALS_FLATE_DECODE:\n            name = self._save_bytes(image)\n\n        else:\n            name = self._save_raw(image)\n\n        return name\n\n    def _save_jpeg(self, image: LTImage) -> str:\n        \"\"\"Save a JPEG encoded image\"\"\"\n        data = image.stream.get_data()\n\n        name, path = self._create_unique_image_name(image, \".jpg\")\n        with open(path, \"wb\") as fp:\n            if LITERAL_DEVICE_CMYK in image.colorspace:\n                try:\n                    from PIL import Image  # type: ignore[import]\n                    from PIL import ImageChops  # type: ignore[import]\n                except ImportError:\n                    raise ImportError(PIL_ERROR_MESSAGE)\n\n                ifp = BytesIO(data)\n                i = Image.open(ifp)\n                i = ImageChops.invert(i)\n                i = i.convert(\"RGB\")\n                i.save(fp, \"JPEG\")\n            else:\n                fp.write(data)\n\n        return name\n\n    def _save_jpeg2000(self, image: LTImage) -> str:\n        \"\"\"Save a JPEG 2000 encoded image\"\"\"\n        data = image.stream.get_data()\n\n        name, path = self._create_unique_image_name(image, \".jp2\")\n        with open(path, \"wb\") as fp:\n            try:\n                from PIL import Image  # type: ignore[import]\n            except ImportError:\n                raise ImportError(PIL_ERROR_MESSAGE)\n\n            # if we just write the raw data, most image programs\n            # that I have tried cannot open the file. However,\n            # open and saving with PIL produces a file that\n            # seems to be easily opened by other programs\n            ifp = BytesIO(data)\n            i = Image.open(ifp)\n            i.save(fp, \"JPEG2000\")\n        return name\n\n    def _save_jbig2(self, image: LTImage) -> str:\n        \"\"\"Save a JBIG2 encoded image\"\"\"\n        name, path = self._create_unique_image_name(image, \".jb2\")\n        with open(path, \"wb\") as fp:\n            input_stream = BytesIO()\n\n            global_streams = []\n            filters = image.stream.get_filters()\n            for filter_name, params in filters:\n                if filter_name in LITERALS_JBIG2_DECODE:\n                    global_streams.append(params[\"JBIG2Globals\"].resolve())\n\n            if len(global_streams) > 1:\n                msg = (\n                    \"There should never be more than one JBIG2Globals \"\n                    \"associated with a JBIG2 embedded image\"\n                )\n                raise PDFValueError(msg)\n            if len(global_streams) == 1:\n                input_stream.write(global_streams[0].get_data().rstrip(b\"\\n\"))\n            input_stream.write(image.stream.get_data())\n            input_stream.seek(0)\n            reader = JBIG2StreamReader(input_stream)\n            segments = reader.get_segments()\n\n            writer = JBIG2StreamWriter(fp)\n            writer.write_file(segments)\n        return name\n\n    def _save_bmp(\n        self,\n        image: LTImage,\n        width: int,\n        height: int,\n        bytes_per_line: int,\n        bits: int,\n    ) -> str:\n        \"\"\"Save a BMP encoded image\"\"\"\n        name, path = self._create_unique_image_name(image, \".bmp\")\n        with open(path, \"wb\") as fp:\n            bmp = BMPWriter(fp, bits, width, height)\n            data = image.stream.get_data()\n            i = 0\n            for y in range(height):\n                bmp.write_line(y, data[i : i + bytes_per_line])\n                i += bytes_per_line\n        return name\n\n    def _save_bytes(self, image: LTImage) -> str:\n        \"\"\"Save an image without encoding, just bytes\"\"\"\n        name, path = self._create_unique_image_name(image, \".jpg\")\n        width, height = image.srcsize\n        channels = len(image.stream.get_data()) / width / height / (image.bits / 8)\n        with open(path, \"wb\") as fp:\n            try:\n                from PIL import Image  # type: ignore[import]\n                from PIL import ImageOps\n            except ImportError:\n                raise ImportError(PIL_ERROR_MESSAGE)\n\n            mode: Literal[\"1\", \"L\", \"RGB\", \"CMYK\"]\n            if image.bits == 1:\n                mode = \"1\"\n            elif image.bits == 8 and channels == 1:\n                mode = \"L\"\n            elif image.bits == 8 and channels == 3:\n                mode = \"RGB\"\n            elif image.bits == 8 and channels == 4:\n                mode = \"CMYK\"\n\n            img = Image.frombytes(mode, image.srcsize, image.stream.get_data(), \"raw\")\n            if mode == \"L\":\n                img = ImageOps.invert(img)\n\n            img.save(fp)\n\n        return name\n\n    def _save_raw(self, image: LTImage) -> str:\n        \"\"\"Save an image with unknown encoding\"\"\"\n        ext = \".%d.%dx%d.img\" % (image.bits, image.srcsize[0], image.srcsize[1])\n        name, path = self._create_unique_image_name(image, ext)\n\n        with open(path, \"wb\") as fp:\n            fp.write(image.stream.get_data())\n        return name\n\n    @staticmethod\n    def _is_jbig2_iamge(image: LTImage) -> bool:\n        filters = image.stream.get_filters()\n        for filter_name, params in filters:\n            if filter_name in LITERALS_JBIG2_DECODE:\n                return True\n        return False\n\n    def _create_unique_image_name(self, image: LTImage, ext: str) -> tuple[str, str]:\n        name = image.name + ext\n        path = os.path.join(self.outdir, name)\n        img_index = 0\n        while os.path.exists(path):\n            name = \"%s.%d%s\" % (image.name, img_index, ext)\n            path = os.path.join(self.outdir, name)\n            img_index += 1\n        return name, path\n"
  },
  {
    "path": "babeldoc/pdfminer/jbig2.py",
    "content": "import math\nimport os\nfrom collections.abc import Iterable\nfrom struct import calcsize\nfrom struct import pack\nfrom struct import unpack\nfrom typing import BinaryIO\nfrom typing import cast\n\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\n\n# segment structure base\nSEG_STRUCT = [\n    (\">L\", \"number\"),\n    (\">B\", \"flags\"),\n    (\">B\", \"retention_flags\"),\n    (\">B\", \"page_assoc\"),\n    (\">L\", \"data_length\"),\n]\n\n# segment header literals\nHEADER_FLAG_DEFERRED = 0b10000000\nHEADER_FLAG_PAGE_ASSOC_LONG = 0b01000000\n\nSEG_TYPE_MASK = 0b00111111\n\nREF_COUNT_SHORT_MASK = 0b11100000\nREF_COUNT_LONG_MASK = 0x1FFFFFFF\nREF_COUNT_LONG = 7\n\nDATA_LEN_UNKNOWN = 0xFFFFFFFF\n\n# segment types\nSEG_TYPE_IMMEDIATE_GEN_REGION = 38\nSEG_TYPE_END_OF_PAGE = 49\nSEG_TYPE_END_OF_FILE = 51\n\n# file literals\nFILE_HEADER_ID = b\"\\x97\\x4a\\x42\\x32\\x0d\\x0a\\x1a\\x0a\"\nFILE_HEAD_FLAG_SEQUENTIAL = 0b00000001\n\n\ndef bit_set(bit_pos: int, value: int) -> bool:\n    return bool((value >> bit_pos) & 1)\n\n\ndef check_flag(flag: int, value: int) -> bool:\n    return bool(flag & value)\n\n\ndef masked_value(mask: int, value: int) -> int:\n    for bit_pos in range(31):\n        if bit_set(bit_pos, mask):\n            return (value & mask) >> bit_pos\n\n    raise PDFValueError(\"Invalid mask or value\")\n\n\ndef mask_value(mask: int, value: int) -> int:\n    for bit_pos in range(31):\n        if bit_set(bit_pos, mask):\n            return (value & (mask >> bit_pos)) << bit_pos\n\n    raise PDFValueError(\"Invalid mask or value\")\n\n\ndef unpack_int(format: str, buffer: bytes) -> int:\n    assert format in {\">B\", \">I\", \">L\"}\n    [result] = cast(tuple[int], unpack(format, buffer))\n    return result\n\n\nJBIG2SegmentFlags = dict[str, int | bool]\nJBIG2RetentionFlags = dict[str, int | list[int] | list[bool]]\nJBIG2Segment = dict[\n    str,\n    bool | int | bytes | JBIG2SegmentFlags | JBIG2RetentionFlags,\n]\n\n\nclass JBIG2StreamReader:\n    \"\"\"Read segments from a JBIG2 byte stream\"\"\"\n\n    def __init__(self, stream: BinaryIO) -> None:\n        self.stream = stream\n\n    def get_segments(self) -> list[JBIG2Segment]:\n        segments: list[JBIG2Segment] = []\n        while not self.is_eof():\n            segment: JBIG2Segment = {}\n            for field_format, name in SEG_STRUCT:\n                field_len = calcsize(field_format)\n                field = self.stream.read(field_len)\n                if len(field) < field_len:\n                    segment[\"_error\"] = True\n                    break\n                value = unpack_int(field_format, field)\n                parser = getattr(self, \"parse_%s\" % name, None)\n                if callable(parser):\n                    value = parser(segment, value, field)\n                segment[name] = value\n\n            if not segment.get(\"_error\"):\n                segments.append(segment)\n        return segments\n\n    def is_eof(self) -> bool:\n        if self.stream.read(1) == b\"\":\n            return True\n        else:\n            self.stream.seek(-1, os.SEEK_CUR)\n            return False\n\n    def parse_flags(\n        self,\n        segment: JBIG2Segment,\n        flags: int,\n        field: bytes,\n    ) -> JBIG2SegmentFlags:\n        return {\n            \"deferred\": check_flag(HEADER_FLAG_DEFERRED, flags),\n            \"page_assoc_long\": check_flag(HEADER_FLAG_PAGE_ASSOC_LONG, flags),\n            \"type\": masked_value(SEG_TYPE_MASK, flags),\n        }\n\n    def parse_retention_flags(\n        self,\n        segment: JBIG2Segment,\n        flags: int,\n        field: bytes,\n    ) -> JBIG2RetentionFlags:\n        ref_count = masked_value(REF_COUNT_SHORT_MASK, flags)\n        retain_segments = []\n        ref_segments = []\n\n        if ref_count < REF_COUNT_LONG:\n            for bit_pos in range(5):\n                retain_segments.append(bit_set(bit_pos, flags))\n        else:\n            field += self.stream.read(3)\n            ref_count = unpack_int(\">L\", field)\n            ref_count = masked_value(REF_COUNT_LONG_MASK, ref_count)\n            ret_bytes_count = int(math.ceil((ref_count + 1) / 8))\n            for ret_byte_index in range(ret_bytes_count):\n                ret_byte = unpack_int(\">B\", self.stream.read(1))\n                for bit_pos in range(7):\n                    retain_segments.append(bit_set(bit_pos, ret_byte))\n\n        seg_num = segment[\"number\"]\n        assert isinstance(seg_num, int)\n        if seg_num <= 256:\n            ref_format = \">B\"\n        elif seg_num <= 65536:\n            ref_format = \">I\"\n        else:\n            ref_format = \">L\"\n\n        ref_size = calcsize(ref_format)\n\n        for ref_index in range(ref_count):\n            ref_data = self.stream.read(ref_size)\n            ref = unpack_int(ref_format, ref_data)\n            ref_segments.append(ref)\n\n        return {\n            \"ref_count\": ref_count,\n            \"retain_segments\": retain_segments,\n            \"ref_segments\": ref_segments,\n        }\n\n    def parse_page_assoc(self, segment: JBIG2Segment, page: int, field: bytes) -> int:\n        if cast(JBIG2SegmentFlags, segment[\"flags\"])[\"page_assoc_long\"]:\n            field += self.stream.read(3)\n            page = unpack_int(\">L\", field)\n        return page\n\n    def parse_data_length(\n        self,\n        segment: JBIG2Segment,\n        length: int,\n        field: bytes,\n    ) -> int:\n        if length:\n            if (\n                cast(JBIG2SegmentFlags, segment[\"flags\"])[\"type\"]\n                == SEG_TYPE_IMMEDIATE_GEN_REGION\n            ) and (length == DATA_LEN_UNKNOWN):\n                raise NotImplementedError(\n                    \"Working with unknown segment length is not implemented yet\",\n                )\n            else:\n                segment[\"raw_data\"] = self.stream.read(length)\n\n        return length\n\n\nclass JBIG2StreamWriter:\n    \"\"\"Write JBIG2 segments to a file in JBIG2 format\"\"\"\n\n    EMPTY_RETENTION_FLAGS: JBIG2RetentionFlags = {\n        \"ref_count\": 0,\n        \"ref_segments\": cast(list[int], []),\n        \"retain_segments\": cast(list[bool], []),\n    }\n\n    def __init__(self, stream: BinaryIO) -> None:\n        self.stream = stream\n\n    def write_segments(\n        self,\n        segments: Iterable[JBIG2Segment],\n        fix_last_page: bool = True,\n    ) -> int:\n        data_len = 0\n        current_page: int | None = None\n        seg_num: int | None = None\n\n        for segment in segments:\n            data = self.encode_segment(segment)\n            self.stream.write(data)\n            data_len += len(data)\n\n            seg_num = cast(int | None, segment[\"number\"])\n\n            if fix_last_page:\n                seg_page = cast(int, segment.get(\"page_assoc\"))\n\n                if (\n                    cast(JBIG2SegmentFlags, segment[\"flags\"])[\"type\"]\n                    == SEG_TYPE_END_OF_PAGE\n                ):\n                    current_page = None\n                elif seg_page:\n                    current_page = seg_page\n\n        if fix_last_page and current_page and (seg_num is not None):\n            segment = self.get_eop_segment(seg_num + 1, current_page)\n            data = self.encode_segment(segment)\n            self.stream.write(data)\n            data_len += len(data)\n\n        return data_len\n\n    def write_file(\n        self,\n        segments: Iterable[JBIG2Segment],\n        fix_last_page: bool = True,\n    ) -> int:\n        header = FILE_HEADER_ID\n        header_flags = FILE_HEAD_FLAG_SEQUENTIAL\n        header += pack(\">B\", header_flags)\n        # The embedded JBIG2 files in a PDF always\n        # only have one page\n        number_of_pages = pack(\">L\", 1)\n        header += number_of_pages\n        self.stream.write(header)\n        data_len = len(header)\n\n        data_len += self.write_segments(segments, fix_last_page)\n\n        seg_num = 0\n        for segment in segments:\n            seg_num = cast(int, segment[\"number\"])\n\n        if fix_last_page:\n            seg_num_offset = 2\n        else:\n            seg_num_offset = 1\n        eof_segment = self.get_eof_segment(seg_num + seg_num_offset)\n        data = self.encode_segment(eof_segment)\n\n        self.stream.write(data)\n        data_len += len(data)\n\n        return data_len\n\n    def encode_segment(self, segment: JBIG2Segment) -> bytes:\n        data = b\"\"\n        for field_format, name in SEG_STRUCT:\n            value = segment.get(name)\n            encoder = getattr(self, \"encode_%s\" % name, None)\n            if callable(encoder):\n                field = encoder(value, segment)\n            else:\n                field = pack(field_format, value)\n            data += field\n        return data\n\n    def encode_flags(self, value: JBIG2SegmentFlags, segment: JBIG2Segment) -> bytes:\n        flags = 0\n        if value.get(\"deferred\"):\n            flags |= HEADER_FLAG_DEFERRED\n\n        if \"page_assoc_long\" in value:\n            flags |= HEADER_FLAG_PAGE_ASSOC_LONG if value[\"page_assoc_long\"] else flags\n        else:\n            flags |= (\n                HEADER_FLAG_PAGE_ASSOC_LONG\n                if cast(int, segment.get(\"page\", 0)) > 255\n                else flags\n            )\n\n        flags |= mask_value(SEG_TYPE_MASK, value[\"type\"])\n\n        return pack(\">B\", flags)\n\n    def encode_retention_flags(\n        self,\n        value: JBIG2RetentionFlags,\n        segment: JBIG2Segment,\n    ) -> bytes:\n        flags = []\n        flags_format = \">B\"\n        ref_count = value[\"ref_count\"]\n        assert isinstance(ref_count, int)\n        retain_segments = cast(list[bool], value.get(\"retain_segments\", []))\n\n        if ref_count <= 4:\n            flags_byte = mask_value(REF_COUNT_SHORT_MASK, ref_count)\n            for ref_index, ref_retain in enumerate(retain_segments):\n                if ref_retain:\n                    flags_byte |= 1 << ref_index\n            flags.append(flags_byte)\n        else:\n            bytes_count = math.ceil((ref_count + 1) / 8)\n            flags_format = \">L\" + (\"B\" * bytes_count)\n            flags_dword = mask_value(REF_COUNT_SHORT_MASK, REF_COUNT_LONG) << 24\n            flags.append(flags_dword)\n\n            for byte_index in range(bytes_count):\n                ret_byte = 0\n                ret_part = retain_segments[byte_index * 8 : byte_index * 8 + 8]\n                for bit_pos, ret_seg in enumerate(ret_part):\n                    ret_byte |= 1 << bit_pos if ret_seg else ret_byte\n\n                flags.append(ret_byte)\n\n        ref_segments = cast(list[int], value.get(\"ref_segments\", []))\n\n        seg_num = cast(int, segment[\"number\"])\n        if seg_num <= 256:\n            ref_format = \"B\"\n        elif seg_num <= 65536:\n            ref_format = \"I\"\n        else:\n            ref_format = \"L\"\n\n        for ref in ref_segments:\n            flags_format += ref_format\n            flags.append(ref)\n\n        return pack(flags_format, *flags)\n\n    def encode_data_length(self, value: int, segment: JBIG2Segment) -> bytes:\n        data = pack(\">L\", value)\n        data += cast(bytes, segment[\"raw_data\"])\n        return data\n\n    def get_eop_segment(self, seg_number: int, page_number: int) -> JBIG2Segment:\n        return {\n            \"data_length\": 0,\n            \"flags\": {\"deferred\": False, \"type\": SEG_TYPE_END_OF_PAGE},\n            \"number\": seg_number,\n            \"page_assoc\": page_number,\n            \"raw_data\": b\"\",\n            \"retention_flags\": JBIG2StreamWriter.EMPTY_RETENTION_FLAGS,\n        }\n\n    def get_eof_segment(self, seg_number: int) -> JBIG2Segment:\n        return {\n            \"data_length\": 0,\n            \"flags\": {\"deferred\": False, \"type\": SEG_TYPE_END_OF_FILE},\n            \"number\": seg_number,\n            \"page_assoc\": 0,\n            \"raw_data\": b\"\",\n            \"retention_flags\": JBIG2StreamWriter.EMPTY_RETENTION_FLAGS,\n        }\n"
  },
  {
    "path": "babeldoc/pdfminer/latin_enc.py",
    "content": "\"\"\"Standard encoding tables used in PDF.\n\nThis table is extracted from PDF Reference Manual 1.6, pp.925\n  \"D.1 Latin Character Set and Encodings\"\n\n\"\"\"\n\nEncodingRow = tuple[str, int | None, int | None, int | None, int | None]\n\nENCODING: list[EncodingRow] = [\n    # (name, std, mac, win, pdf)\n    (\"A\", 65, 65, 65, 65),\n    (\"AE\", 225, 174, 198, 198),\n    (\"Aacute\", None, 231, 193, 193),\n    (\"Acircumflex\", None, 229, 194, 194),\n    (\"Adieresis\", None, 128, 196, 196),\n    (\"Agrave\", None, 203, 192, 192),\n    (\"Aring\", None, 129, 197, 197),\n    (\"Atilde\", None, 204, 195, 195),\n    (\"B\", 66, 66, 66, 66),\n    (\"C\", 67, 67, 67, 67),\n    (\"Ccedilla\", None, 130, 199, 199),\n    (\"D\", 68, 68, 68, 68),\n    (\"E\", 69, 69, 69, 69),\n    (\"Eacute\", None, 131, 201, 201),\n    (\"Ecircumflex\", None, 230, 202, 202),\n    (\"Edieresis\", None, 232, 203, 203),\n    (\"Egrave\", None, 233, 200, 200),\n    (\"Eth\", None, None, 208, 208),\n    (\"Euro\", None, None, 128, 160),\n    (\"F\", 70, 70, 70, 70),\n    (\"G\", 71, 71, 71, 71),\n    (\"H\", 72, 72, 72, 72),\n    (\"I\", 73, 73, 73, 73),\n    (\"Iacute\", None, 234, 205, 205),\n    (\"Icircumflex\", None, 235, 206, 206),\n    (\"Idieresis\", None, 236, 207, 207),\n    (\"Igrave\", None, 237, 204, 204),\n    (\"J\", 74, 74, 74, 74),\n    (\"K\", 75, 75, 75, 75),\n    (\"L\", 76, 76, 76, 76),\n    (\"Lslash\", 232, None, None, 149),\n    (\"M\", 77, 77, 77, 77),\n    (\"N\", 78, 78, 78, 78),\n    (\"Ntilde\", None, 132, 209, 209),\n    (\"O\", 79, 79, 79, 79),\n    (\"OE\", 234, 206, 140, 150),\n    (\"Oacute\", None, 238, 211, 211),\n    (\"Ocircumflex\", None, 239, 212, 212),\n    (\"Odieresis\", None, 133, 214, 214),\n    (\"Ograve\", None, 241, 210, 210),\n    (\"Oslash\", 233, 175, 216, 216),\n    (\"Otilde\", None, 205, 213, 213),\n    (\"P\", 80, 80, 80, 80),\n    (\"Q\", 81, 81, 81, 81),\n    (\"R\", 82, 82, 82, 82),\n    (\"S\", 83, 83, 83, 83),\n    (\"Scaron\", None, None, 138, 151),\n    (\"T\", 84, 84, 84, 84),\n    (\"Thorn\", None, None, 222, 222),\n    (\"U\", 85, 85, 85, 85),\n    (\"Uacute\", None, 242, 218, 218),\n    (\"Ucircumflex\", None, 243, 219, 219),\n    (\"Udieresis\", None, 134, 220, 220),\n    (\"Ugrave\", None, 244, 217, 217),\n    (\"V\", 86, 86, 86, 86),\n    (\"W\", 87, 87, 87, 87),\n    (\"X\", 88, 88, 88, 88),\n    (\"Y\", 89, 89, 89, 89),\n    (\"Yacute\", None, None, 221, 221),\n    (\"Ydieresis\", None, 217, 159, 152),\n    (\"Z\", 90, 90, 90, 90),\n    (\"Zcaron\", None, None, 142, 153),\n    (\"a\", 97, 97, 97, 97),\n    (\"aacute\", None, 135, 225, 225),\n    (\"acircumflex\", None, 137, 226, 226),\n    (\"acute\", 194, 171, 180, 180),\n    (\"adieresis\", None, 138, 228, 228),\n    (\"ae\", 241, 190, 230, 230),\n    (\"agrave\", None, 136, 224, 224),\n    (\"ampersand\", 38, 38, 38, 38),\n    (\"aring\", None, 140, 229, 229),\n    (\"asciicircum\", 94, 94, 94, 94),\n    (\"asciitilde\", 126, 126, 126, 126),\n    (\"asterisk\", 42, 42, 42, 42),\n    (\"at\", 64, 64, 64, 64),\n    (\"atilde\", None, 139, 227, 227),\n    (\"b\", 98, 98, 98, 98),\n    (\"backslash\", 92, 92, 92, 92),\n    (\"bar\", 124, 124, 124, 124),\n    (\"braceleft\", 123, 123, 123, 123),\n    (\"braceright\", 125, 125, 125, 125),\n    (\"bracketleft\", 91, 91, 91, 91),\n    (\"bracketright\", 93, 93, 93, 93),\n    (\"breve\", 198, 249, None, 24),\n    (\"brokenbar\", None, None, 166, 166),\n    (\"bullet\", 183, 165, 149, 128),\n    (\"c\", 99, 99, 99, 99),\n    (\"caron\", 207, 255, None, 25),\n    (\"ccedilla\", None, 141, 231, 231),\n    (\"cedilla\", 203, 252, 184, 184),\n    (\"cent\", 162, 162, 162, 162),\n    (\"circumflex\", 195, 246, 136, 26),\n    (\"colon\", 58, 58, 58, 58),\n    (\"comma\", 44, 44, 44, 44),\n    (\"copyright\", None, 169, 169, 169),\n    (\"currency\", 168, 219, 164, 164),\n    (\"d\", 100, 100, 100, 100),\n    (\"dagger\", 178, 160, 134, 129),\n    (\"daggerdbl\", 179, 224, 135, 130),\n    (\"degree\", None, 161, 176, 176),\n    (\"dieresis\", 200, 172, 168, 168),\n    (\"divide\", None, 214, 247, 247),\n    (\"dollar\", 36, 36, 36, 36),\n    (\"dotaccent\", 199, 250, None, 27),\n    (\"dotlessi\", 245, 245, None, 154),\n    (\"e\", 101, 101, 101, 101),\n    (\"eacute\", None, 142, 233, 233),\n    (\"ecircumflex\", None, 144, 234, 234),\n    (\"edieresis\", None, 145, 235, 235),\n    (\"egrave\", None, 143, 232, 232),\n    (\"eight\", 56, 56, 56, 56),\n    (\"ellipsis\", 188, 201, 133, 131),\n    (\"emdash\", 208, 209, 151, 132),\n    (\"endash\", 177, 208, 150, 133),\n    (\"equal\", 61, 61, 61, 61),\n    (\"eth\", None, None, 240, 240),\n    (\"exclam\", 33, 33, 33, 33),\n    (\"exclamdown\", 161, 193, 161, 161),\n    (\"f\", 102, 102, 102, 102),\n    (\"fi\", 174, 222, None, 147),\n    (\"five\", 53, 53, 53, 53),\n    (\"fl\", 175, 223, None, 148),\n    (\"florin\", 166, 196, 131, 134),\n    (\"four\", 52, 52, 52, 52),\n    (\"fraction\", 164, 218, None, 135),\n    (\"g\", 103, 103, 103, 103),\n    (\"germandbls\", 251, 167, 223, 223),\n    (\"grave\", 193, 96, 96, 96),\n    (\"greater\", 62, 62, 62, 62),\n    (\"guillemotleft\", 171, 199, 171, 171),\n    (\"guillemotright\", 187, 200, 187, 187),\n    (\"guilsinglleft\", 172, 220, 139, 136),\n    (\"guilsinglright\", 173, 221, 155, 137),\n    (\"h\", 104, 104, 104, 104),\n    (\"hungarumlaut\", 205, 253, None, 28),\n    (\"hyphen\", 45, 45, 45, 45),\n    (\"i\", 105, 105, 105, 105),\n    (\"iacute\", None, 146, 237, 237),\n    (\"icircumflex\", None, 148, 238, 238),\n    (\"idieresis\", None, 149, 239, 239),\n    (\"igrave\", None, 147, 236, 236),\n    (\"j\", 106, 106, 106, 106),\n    (\"k\", 107, 107, 107, 107),\n    (\"l\", 108, 108, 108, 108),\n    (\"less\", 60, 60, 60, 60),\n    (\"logicalnot\", None, 194, 172, 172),\n    (\"lslash\", 248, None, None, 155),\n    (\"m\", 109, 109, 109, 109),\n    (\"macron\", 197, 248, 175, 175),\n    (\"minus\", None, None, None, 138),\n    (\"mu\", None, 181, 181, 181),\n    (\"multiply\", None, None, 215, 215),\n    (\"n\", 110, 110, 110, 110),\n    (\"nbspace\", None, 202, 160, None),\n    (\"nine\", 57, 57, 57, 57),\n    (\"ntilde\", None, 150, 241, 241),\n    (\"numbersign\", 35, 35, 35, 35),\n    (\"o\", 111, 111, 111, 111),\n    (\"oacute\", None, 151, 243, 243),\n    (\"ocircumflex\", None, 153, 244, 244),\n    (\"odieresis\", None, 154, 246, 246),\n    (\"oe\", 250, 207, 156, 156),\n    (\"ogonek\", 206, 254, None, 29),\n    (\"ograve\", None, 152, 242, 242),\n    (\"one\", 49, 49, 49, 49),\n    (\"onehalf\", None, None, 189, 189),\n    (\"onequarter\", None, None, 188, 188),\n    (\"onesuperior\", None, None, 185, 185),\n    (\"ordfeminine\", 227, 187, 170, 170),\n    (\"ordmasculine\", 235, 188, 186, 186),\n    (\"oslash\", 249, 191, 248, 248),\n    (\"otilde\", None, 155, 245, 245),\n    (\"p\", 112, 112, 112, 112),\n    (\"paragraph\", 182, 166, 182, 182),\n    (\"parenleft\", 40, 40, 40, 40),\n    (\"parenright\", 41, 41, 41, 41),\n    (\"percent\", 37, 37, 37, 37),\n    (\"period\", 46, 46, 46, 46),\n    (\"periodcentered\", 180, 225, 183, 183),\n    (\"perthousand\", 189, 228, 137, 139),\n    (\"plus\", 43, 43, 43, 43),\n    (\"plusminus\", None, 177, 177, 177),\n    (\"q\", 113, 113, 113, 113),\n    (\"question\", 63, 63, 63, 63),\n    (\"questiondown\", 191, 192, 191, 191),\n    (\"quotedbl\", 34, 34, 34, 34),\n    (\"quotedblbase\", 185, 227, 132, 140),\n    (\"quotedblleft\", 170, 210, 147, 141),\n    (\"quotedblright\", 186, 211, 148, 142),\n    (\"quoteleft\", 96, 212, 145, 143),\n    (\"quoteright\", 39, 213, 146, 144),\n    (\"quotesinglbase\", 184, 226, 130, 145),\n    (\"quotesingle\", 169, 39, 39, 39),\n    (\"r\", 114, 114, 114, 114),\n    (\"registered\", None, 168, 174, 174),\n    (\"ring\", 202, 251, None, 30),\n    (\"s\", 115, 115, 115, 115),\n    (\"scaron\", None, None, 154, 157),\n    (\"section\", 167, 164, 167, 167),\n    (\"semicolon\", 59, 59, 59, 59),\n    (\"seven\", 55, 55, 55, 55),\n    (\"six\", 54, 54, 54, 54),\n    (\"slash\", 47, 47, 47, 47),\n    (\"space\", 32, 32, 32, 32),\n    (\"space\", None, 202, 160, None),\n    (\"space\", None, 202, 173, None),\n    (\"sterling\", 163, 163, 163, 163),\n    (\"t\", 116, 116, 116, 116),\n    (\"thorn\", None, None, 254, 254),\n    (\"three\", 51, 51, 51, 51),\n    (\"threequarters\", None, None, 190, 190),\n    (\"threesuperior\", None, None, 179, 179),\n    (\"tilde\", 196, 247, 152, 31),\n    (\"trademark\", None, 170, 153, 146),\n    (\"two\", 50, 50, 50, 50),\n    (\"twosuperior\", None, None, 178, 178),\n    (\"u\", 117, 117, 117, 117),\n    (\"uacute\", None, 156, 250, 250),\n    (\"ucircumflex\", None, 158, 251, 251),\n    (\"udieresis\", None, 159, 252, 252),\n    (\"ugrave\", None, 157, 249, 249),\n    (\"underscore\", 95, 95, 95, 95),\n    (\"v\", 118, 118, 118, 118),\n    (\"w\", 119, 119, 119, 119),\n    (\"x\", 120, 120, 120, 120),\n    (\"y\", 121, 121, 121, 121),\n    (\"yacute\", None, None, 253, 253),\n    (\"ydieresis\", None, 216, 255, 255),\n    (\"yen\", 165, 180, 165, 165),\n    (\"z\", 122, 122, 122, 122),\n    (\"zcaron\", None, None, 158, 158),\n    (\"zero\", 48, 48, 48, 48),\n]\n"
  },
  {
    "path": "babeldoc/pdfminer/layout.py",
    "content": "import heapq\nimport logging\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections.abc import Sequence\nfrom typing import Generic\nfrom typing import TypeVar\nfrom typing import Union\nfrom typing import cast\n\nfrom babeldoc.format.pdf.babelpdf.utils import guarded_bbox\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdfexceptions import PDFTypeError\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdfinterp import Color\nfrom babeldoc.pdfminer.pdfinterp import PDFGraphicState\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.utils import INF\nfrom babeldoc.pdfminer.utils import LTComponentT\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import PathSegment\nfrom babeldoc.pdfminer.utils import Plane\nfrom babeldoc.pdfminer.utils import Point\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import apply_matrix_pt\nfrom babeldoc.pdfminer.utils import bbox2str\nfrom babeldoc.pdfminer.utils import fsplit\nfrom babeldoc.pdfminer.utils import get_bound\nfrom babeldoc.pdfminer.utils import matrix2str\nfrom babeldoc.pdfminer.utils import uniq\n\nlogger = logging.getLogger(__name__)\n\n\nclass IndexAssigner:\n    def __init__(self, index: int = 0) -> None:\n        self.index = index\n\n    def run(self, obj: \"LTItem\") -> None:\n        if isinstance(obj, LTTextBox):\n            obj.index = self.index\n            self.index += 1\n        elif isinstance(obj, LTTextGroup):\n            for x in obj:\n                self.run(x)\n\n\nclass LAParams:\n    \"\"\"Parameters for layout analysis\n\n    :param line_overlap: If two characters have more overlap than this they\n        are considered to be on the same line. The overlap is specified\n        relative to the minimum height of both characters.\n    :param char_margin: If two characters are closer together than this\n        margin they are considered part of the same line. The margin is\n        specified relative to the width of the character.\n    :param word_margin: If two characters on the same line are further apart\n        than this margin then they are considered to be two separate words, and\n        an intermediate space will be added for readability. The margin is\n        specified relative to the width of the character.\n    :param line_margin: If two lines are are close together they are\n        considered to be part of the same paragraph. The margin is\n        specified relative to the height of a line.\n    :param boxes_flow: Specifies how much a horizontal and vertical position\n        of a text matters when determining the order of text boxes. The value\n        should be within the range of -1.0 (only horizontal position\n        matters) to +1.0 (only vertical position matters). You can also pass\n        `None` to disable advanced layout analysis, and instead return text\n        based on the position of the bottom left corner of the text box.\n    :param detect_vertical: If vertical text should be considered during\n        layout analysis\n    :param all_texts: If layout analysis should be performed on text in\n        figures.\n    \"\"\"\n\n    def __init__(\n        self,\n        line_overlap: float = 0.5,\n        char_margin: float = 2.0,\n        line_margin: float = 0.5,\n        word_margin: float = 0.1,\n        boxes_flow: float | None = 0.5,\n        detect_vertical: bool = False,\n        all_texts: bool = False,\n    ) -> None:\n        self.line_overlap = line_overlap\n        self.char_margin = char_margin\n        self.line_margin = line_margin\n        self.word_margin = word_margin\n        self.boxes_flow = boxes_flow\n        self.detect_vertical = detect_vertical\n        self.all_texts = all_texts\n\n        self._validate()\n\n    def _validate(self) -> None:\n        if self.boxes_flow is not None:\n            boxes_flow_err_msg = (\n                \"LAParam boxes_flow should be None, or a number between -1 and +1\"\n            )\n            if not (\n                isinstance(self.boxes_flow, int) or isinstance(self.boxes_flow, float)\n            ):\n                raise PDFTypeError(boxes_flow_err_msg)\n            if not -1 <= self.boxes_flow <= 1:\n                raise PDFValueError(boxes_flow_err_msg)\n\n    def __repr__(self) -> str:\n        return (\n            \"<LAParams: char_margin=%.1f, line_margin=%.1f, \"\n            \"word_margin=%.1f all_texts=%r>\"\n            % (self.char_margin, self.line_margin, self.word_margin, self.all_texts)\n        )\n\n\nclass LTItem:\n    \"\"\"Interface for things that can be analyzed\"\"\"\n\n    def analyze(self, laparams: LAParams) -> None:\n        \"\"\"Perform the layout analysis.\"\"\"\n\n\nclass LTText:\n    \"\"\"Interface for things that have text\"\"\"\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__} {self.get_text()!r}>\"\n\n    def get_text(self) -> str:\n        \"\"\"Text contained in this object\"\"\"\n        raise NotImplementedError\n\n\nclass LTComponent(LTItem):\n    \"\"\"Object with a bounding box\"\"\"\n\n    def __init__(self, bbox: Rect) -> None:\n        LTItem.__init__(self)\n        self.set_bbox(bbox)\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__} {bbox2str(self.bbox)}>\"\n\n    # Disable comparison.\n    def __lt__(self, _: object) -> bool:\n        raise PDFValueError\n\n    def __le__(self, _: object) -> bool:\n        raise PDFValueError\n\n    def __gt__(self, _: object) -> bool:\n        raise PDFValueError\n\n    def __ge__(self, _: object) -> bool:\n        raise PDFValueError\n\n    def set_bbox(self, bbox: Rect) -> None:\n        (x0, y0, x1, y1) = bbox\n        self.x0 = x0\n        self.y0 = y0\n        self.x1 = x1\n        self.y1 = y1\n        self.width = x1 - x0\n        self.height = y1 - y0\n        self.bbox = bbox\n\n    def is_empty(self) -> bool:\n        return self.width <= 0 or self.height <= 0\n\n    def is_hoverlap(self, obj: \"LTComponent\") -> bool:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        return obj.x0 <= self.x1 and self.x0 <= obj.x1\n\n    def hdistance(self, obj: \"LTComponent\") -> float:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        if self.is_hoverlap(obj):\n            return 0\n        else:\n            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))\n\n    def hoverlap(self, obj: \"LTComponent\") -> float:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        if self.is_hoverlap(obj):\n            return min(abs(self.x0 - obj.x1), abs(self.x1 - obj.x0))\n        else:\n            return 0\n\n    def is_voverlap(self, obj: \"LTComponent\") -> bool:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        return obj.y0 <= self.y1 and self.y0 <= obj.y1\n\n    def vdistance(self, obj: \"LTComponent\") -> float:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        if self.is_voverlap(obj):\n            return 0\n        else:\n            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))\n\n    def voverlap(self, obj: \"LTComponent\") -> float:\n        assert isinstance(obj, LTComponent), str(type(obj))\n        if self.is_voverlap(obj):\n            return min(abs(self.y0 - obj.y1), abs(self.y1 - obj.y0))\n        else:\n            return 0\n\n\nclass LTCurve(LTComponent):\n    \"\"\"A generic Bezier curve\n\n    The parameter `original_path` contains the original\n    pathing information from the pdf (e.g. for reconstructing Bezier Curves).\n\n    `dashing_style` contains the Dashing information if any.\n    \"\"\"\n\n    def __init__(\n        self,\n        linewidth: float,\n        pts: list[Point],\n        stroke: bool = False,\n        fill: bool = False,\n        evenodd: bool = False,\n        stroking_color: Color | None = None,\n        non_stroking_color: Color | None = None,\n        original_path: list[PathSegment] | None = None,\n        dashing_style: tuple[object, object] | None = None,\n    ) -> None:\n        LTComponent.__init__(self, get_bound(pts))\n        self.pts = pts\n        self.linewidth = linewidth\n        self.stroke = stroke\n        self.fill = fill\n        self.evenodd = evenodd\n        self.stroking_color = stroking_color\n        self.non_stroking_color = non_stroking_color\n        self.original_path = original_path\n        self.dashing_style = dashing_style\n\n    def get_pts(self) -> str:\n        return \",\".join(\"%.3f,%.3f\" % p for p in self.pts)\n\n\nclass LTLine(LTCurve):\n    \"\"\"A single straight line.\n\n    Could be used for separating text or figures.\n    \"\"\"\n\n    def __init__(\n        self,\n        linewidth: float,\n        p0: Point,\n        p1: Point,\n        stroke: bool = False,\n        fill: bool = False,\n        evenodd: bool = False,\n        stroking_color: Color | None = None,\n        non_stroking_color: Color | None = None,\n        original_path: list[PathSegment] | None = None,\n        dashing_style: tuple[object, object] | None = None,\n    ) -> None:\n        LTCurve.__init__(\n            self,\n            linewidth,\n            [p0, p1],\n            stroke,\n            fill,\n            evenodd,\n            stroking_color,\n            non_stroking_color,\n            original_path,\n            dashing_style,\n        )\n\n\nclass LTRect(LTCurve):\n    \"\"\"A rectangle.\n\n    Could be used for framing another pictures or figures.\n    \"\"\"\n\n    def __init__(\n        self,\n        linewidth: float,\n        bbox: Rect,\n        stroke: bool = False,\n        fill: bool = False,\n        evenodd: bool = False,\n        stroking_color: Color | None = None,\n        non_stroking_color: Color | None = None,\n        original_path: list[PathSegment] | None = None,\n        dashing_style: tuple[object, object] | None = None,\n    ) -> None:\n        (x0, y0, x1, y1) = bbox\n        LTCurve.__init__(\n            self,\n            linewidth,\n            [(x0, y0), (x1, y0), (x1, y1), (x0, y1)],\n            stroke,\n            fill,\n            evenodd,\n            stroking_color,\n            non_stroking_color,\n            original_path,\n            dashing_style,\n        )\n\n\nclass LTImage(LTComponent):\n    \"\"\"An image object.\n\n    Embedded images can be in JPEG, Bitmap or JBIG2.\n    \"\"\"\n\n    def __init__(self, name: str, stream: PDFStream, bbox: Rect) -> None:\n        LTComponent.__init__(self, bbox)\n        self.name = name\n        self.stream = stream\n        self.srcsize = (stream.get_any((\"W\", \"Width\")), stream.get_any((\"H\", \"Height\")))\n        self.imagemask = stream.get_any((\"IM\", \"ImageMask\"))\n        self.bits = stream.get_any((\"BPC\", \"BitsPerComponent\"), 1)\n        self.colorspace = stream.get_any((\"CS\", \"ColorSpace\"))\n        if not isinstance(self.colorspace, list):\n            self.colorspace = [self.colorspace]\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__}({self.name}) {bbox2str(self.bbox)} {self.srcsize!r}>\"\n\n\nclass LTAnno(LTItem, LTText):\n    \"\"\"Actual letter in the text as a Unicode string.\n\n    Note that, while a LTChar object has actual boundaries, LTAnno objects does\n    not, as these are \"virtual\" characters, inserted by a layout analyzer\n    according to the relationship between two characters (e.g. a space).\n    \"\"\"\n\n    def __init__(self, text: str) -> None:\n        self._text = text\n\n    def get_text(self) -> str:\n        return self._text\n\n\nclass LTChar(LTComponent, LTText):\n    \"\"\"Actual letter in the text as a Unicode string.\"\"\"\n\n    def __init__(\n        self,\n        matrix: Matrix,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        rise: float,\n        text: str,\n        textwidth: float,\n        textdisp: float | tuple[float | None, float],\n        ncs: PDFColorSpace,\n        graphicstate: PDFGraphicState,\n    ) -> None:\n        LTText.__init__(self)\n        self._text = text\n        self.matrix = matrix\n        self.fontname = font.fontname\n        self.ncs = ncs\n        self.graphicstate = graphicstate\n        self.adv = textwidth * fontsize * scaling\n        # compute the boundary rectangle.\n        if font.is_vertical():\n            # vertical\n            assert isinstance(textdisp, tuple)\n            (vx, vy) = textdisp\n            if vx is None:\n                vx = fontsize * 0.5\n            else:\n                vx = vx * fontsize * 0.001\n            vy = (1000 - vy) * fontsize * 0.001\n            bbox_lower_left = (-vx, vy + rise + self.adv)\n            bbox_upper_right = (-vx + fontsize, vy + rise)\n        else:\n            # horizontal\n            descent = font.get_descent() * fontsize\n            bbox_lower_left = (0, descent + rise)\n            bbox_upper_right = (self.adv, descent + rise + fontsize)\n        (a, b, c, d, e, f) = self.matrix\n        self.upright = a * d * scaling > 0 and b * c <= 0\n        (x0, y0) = apply_matrix_pt(self.matrix, bbox_lower_left)\n        (x1, y1) = apply_matrix_pt(self.matrix, bbox_upper_right)\n        if x1 < x0:\n            (x0, x1) = (x1, x0)\n        if y1 < y0:\n            (y0, y1) = (y1, y0)\n        LTComponent.__init__(self, (x0, y0, x1, y1))\n        if font.is_vertical():\n            self.size = self.width\n        else:\n            self.size = self.height\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__} {bbox2str(self.bbox)} matrix={matrix2str(self.matrix)} font={self.fontname!r} adv={self.adv} text={self.get_text()!r}>\"\n\n    def get_text(self) -> str:\n        return self._text\n\n\nLTItemT = TypeVar(\"LTItemT\", bound=LTItem)\n\n\nclass LTContainer(LTComponent, Generic[LTItemT]):\n    \"\"\"Object that can be extended and analyzed\"\"\"\n\n    def __init__(self, bbox: Rect) -> None:\n        LTComponent.__init__(self, bbox)\n        self._objs: list[LTItemT] = []\n\n    def __iter__(self) -> Iterator[LTItemT]:\n        return iter(self._objs)\n\n    def __len__(self) -> int:\n        return len(self._objs)\n\n    def add(self, obj: LTItemT) -> None:\n        self._objs.append(obj)\n\n    def extend(self, objs: Iterable[LTItemT]) -> None:\n        for obj in objs:\n            self.add(obj)\n\n    def analyze(self, laparams: LAParams) -> None:\n        for obj in self._objs:\n            obj.analyze(laparams)\n\n\nclass LTExpandableContainer(LTContainer[LTItemT]):\n    def __init__(self) -> None:\n        LTContainer.__init__(self, (+INF, +INF, -INF, -INF))\n\n    # Incompatible override: we take an LTComponent (with bounding box), but\n    # super() LTContainer only considers LTItem (no bounding box).\n    def add(self, obj: LTComponent) -> None:  # type: ignore[override]\n        LTContainer.add(self, cast(LTItemT, obj))\n        self.set_bbox(\n            (\n                min(self.x0, obj.x0),\n                min(self.y0, obj.y0),\n                max(self.x1, obj.x1),\n                max(self.y1, obj.y1),\n            ),\n        )\n\n\nclass LTTextContainer(LTExpandableContainer[LTItemT], LTText):\n    def __init__(self) -> None:\n        LTText.__init__(self)\n        LTExpandableContainer.__init__(self)\n\n    def get_text(self) -> str:\n        return \"\".join(\n            cast(LTText, obj).get_text() for obj in self if isinstance(obj, LTText)\n        )\n\n\nTextLineElement = Union[LTChar, LTAnno]\n\n\nclass LTTextLine(LTTextContainer[TextLineElement]):\n    \"\"\"Contains a list of LTChar objects that represent a single text line.\n\n    The characters are aligned either horizontally or vertically, depending on\n    the text's writing mode.\n    \"\"\"\n\n    def __init__(self, word_margin: float) -> None:\n        super().__init__()\n        self.word_margin = word_margin\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__} {bbox2str(self.bbox)} {self.get_text()!r}>\"\n\n    def analyze(self, laparams: LAParams) -> None:\n        for obj in self._objs:\n            obj.analyze(laparams)\n        LTContainer.add(self, LTAnno(\"\\n\"))\n\n    def find_neighbors(\n        self,\n        plane: Plane[LTComponentT],\n        ratio: float,\n    ) -> list[\"LTTextLine\"]:\n        raise NotImplementedError\n\n    def is_empty(self) -> bool:\n        return super().is_empty() or self.get_text().isspace()\n\n\nclass LTTextLineHorizontal(LTTextLine):\n    def __init__(self, word_margin: float) -> None:\n        LTTextLine.__init__(self, word_margin)\n        self._x1: float = +INF\n\n    # Incompatible override: we take an LTComponent (with bounding box), but\n    # LTContainer only considers LTItem (no bounding box).\n    def add(self, obj: LTComponent) -> None:  # type: ignore[override]\n        if isinstance(obj, LTChar) and self.word_margin:\n            margin = self.word_margin * max(obj.width, obj.height)\n            if self._x1 < obj.x0 - margin:\n                LTContainer.add(self, LTAnno(\" \"))\n        self._x1 = obj.x1\n        super().add(obj)\n\n    def find_neighbors(\n        self,\n        plane: Plane[LTComponentT],\n        ratio: float,\n    ) -> list[LTTextLine]:\n        \"\"\"Finds neighboring LTTextLineHorizontals in the plane.\n\n        Returns a list of other LTTestLineHorizontals in the plane which are\n        close to self. \"Close\" can be controlled by ratio. The returned objects\n        will be the same height as self, and also either left-, right-, or\n        centrally-aligned.\n        \"\"\"\n        d = ratio * self.height\n        objs = plane.find((self.x0, self.y0 - d, self.x1, self.y1 + d))\n        return [\n            obj\n            for obj in objs\n            if (\n                isinstance(obj, LTTextLineHorizontal)\n                and self._is_same_height_as(obj, tolerance=d)\n                and (\n                    self._is_left_aligned_with(obj, tolerance=d)\n                    or self._is_right_aligned_with(obj, tolerance=d)\n                    or self._is_centrally_aligned_with(obj, tolerance=d)\n                )\n            )\n        ]\n\n    def _is_left_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:\n        \"\"\"Whether the left-hand edge of `other` is within `tolerance`.\"\"\"\n        return abs(other.x0 - self.x0) <= tolerance\n\n    def _is_right_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:\n        \"\"\"Whether the right-hand edge of `other` is within `tolerance`.\"\"\"\n        return abs(other.x1 - self.x1) <= tolerance\n\n    def _is_centrally_aligned_with(\n        self,\n        other: LTComponent,\n        tolerance: float = 0,\n    ) -> bool:\n        \"\"\"Whether the horizontal center of `other` is within `tolerance`.\"\"\"\n        return abs((other.x0 + other.x1) / 2 - (self.x0 + self.x1) / 2) <= tolerance\n\n    def _is_same_height_as(self, other: LTComponent, tolerance: float = 0) -> bool:\n        return abs(other.height - self.height) <= tolerance\n\n\nclass LTTextLineVertical(LTTextLine):\n    def __init__(self, word_margin: float) -> None:\n        LTTextLine.__init__(self, word_margin)\n        self._y0: float = -INF\n\n    # Incompatible override: we take an LTComponent (with bounding box), but\n    # LTContainer only considers LTItem (no bounding box).\n    def add(self, obj: LTComponent) -> None:  # type: ignore[override]\n        if isinstance(obj, LTChar) and self.word_margin:\n            margin = self.word_margin * max(obj.width, obj.height)\n            if obj.y1 + margin < self._y0:\n                LTContainer.add(self, LTAnno(\" \"))\n        self._y0 = obj.y0\n        super().add(obj)\n\n    def find_neighbors(\n        self,\n        plane: Plane[LTComponentT],\n        ratio: float,\n    ) -> list[LTTextLine]:\n        \"\"\"Finds neighboring LTTextLineVerticals in the plane.\n\n        Returns a list of other LTTextLineVerticals in the plane which are\n        close to self. \"Close\" can be controlled by ratio. The returned objects\n        will be the same width as self, and also either upper-, lower-, or\n        centrally-aligned.\n        \"\"\"\n        d = ratio * self.width\n        objs = plane.find((self.x0 - d, self.y0, self.x1 + d, self.y1))\n        return [\n            obj\n            for obj in objs\n            if (\n                isinstance(obj, LTTextLineVertical)\n                and self._is_same_width_as(obj, tolerance=d)\n                and (\n                    self._is_lower_aligned_with(obj, tolerance=d)\n                    or self._is_upper_aligned_with(obj, tolerance=d)\n                    or self._is_centrally_aligned_with(obj, tolerance=d)\n                )\n            )\n        ]\n\n    def _is_lower_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:\n        \"\"\"Whether the lower edge of `other` is within `tolerance`.\"\"\"\n        return abs(other.y0 - self.y0) <= tolerance\n\n    def _is_upper_aligned_with(self, other: LTComponent, tolerance: float = 0) -> bool:\n        \"\"\"Whether the upper edge of `other` is within `tolerance`.\"\"\"\n        return abs(other.y1 - self.y1) <= tolerance\n\n    def _is_centrally_aligned_with(\n        self,\n        other: LTComponent,\n        tolerance: float = 0,\n    ) -> bool:\n        \"\"\"Whether the vertical center of `other` is within `tolerance`.\"\"\"\n        return abs((other.y0 + other.y1) / 2 - (self.y0 + self.y1) / 2) <= tolerance\n\n    def _is_same_width_as(self, other: LTComponent, tolerance: float) -> bool:\n        return abs(other.width - self.width) <= tolerance\n\n\nclass LTTextBox(LTTextContainer[LTTextLine]):\n    \"\"\"Represents a group of text chunks in a rectangular area.\n\n    Note that this box is created by geometric analysis and does not\n    necessarily represents a logical boundary of the text. It contains a list\n    of LTTextLine objects.\n    \"\"\"\n\n    def __init__(self) -> None:\n        LTTextContainer.__init__(self)\n        self.index: int = -1\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__}({self.index}) {bbox2str(self.bbox)} {self.get_text()!r}>\"\n\n    def get_writing_mode(self) -> str:\n        raise NotImplementedError\n\n\nclass LTTextBoxHorizontal(LTTextBox):\n    def analyze(self, laparams: LAParams) -> None:\n        super().analyze(laparams)\n        self._objs.sort(key=lambda obj: -obj.y1)\n\n    def get_writing_mode(self) -> str:\n        return \"lr-tb\"\n\n\nclass LTTextBoxVertical(LTTextBox):\n    def analyze(self, laparams: LAParams) -> None:\n        super().analyze(laparams)\n        self._objs.sort(key=lambda obj: -obj.x1)\n\n    def get_writing_mode(self) -> str:\n        return \"tb-rl\"\n\n\nTextGroupElement = Union[LTTextBox, \"LTTextGroup\"]\n\n\nclass LTTextGroup(LTTextContainer[TextGroupElement]):\n    def __init__(self, objs: Iterable[TextGroupElement]) -> None:\n        super().__init__()\n        self.extend(objs)\n\n\nclass LTTextGroupLRTB(LTTextGroup):\n    def analyze(self, laparams: LAParams) -> None:\n        super().analyze(laparams)\n        assert laparams.boxes_flow is not None\n        boxes_flow = laparams.boxes_flow\n        # reorder the objects from top-left to bottom-right.\n        self._objs.sort(\n            key=lambda obj: (1 - boxes_flow) * obj.x0\n            - (1 + boxes_flow) * (obj.y0 + obj.y1),\n        )\n\n\nclass LTTextGroupTBRL(LTTextGroup):\n    def analyze(self, laparams: LAParams) -> None:\n        super().analyze(laparams)\n        assert laparams.boxes_flow is not None\n        boxes_flow = laparams.boxes_flow\n        # reorder the objects from top-right to bottom-left.\n        self._objs.sort(\n            key=lambda obj: -(1 + boxes_flow) * (obj.x0 + obj.x1)\n            - (1 - boxes_flow) * obj.y1,\n        )\n\n\nclass LTLayoutContainer(LTContainer[LTComponent]):\n    def __init__(self, bbox: Rect) -> None:\n        LTContainer.__init__(self, bbox)\n        self.groups: list[LTTextGroup] | None = None\n\n    # group_objects: group text object to textlines.\n    def group_objects(\n        self,\n        laparams: LAParams,\n        objs: Iterable[LTComponent],\n    ) -> Iterator[LTTextLine]:\n        obj0 = None\n        line = None\n        for obj1 in objs:\n            if obj0 is not None:\n                # halign: obj0 and obj1 is horizontally aligned.\n                #\n                #   +------+ - - -\n                #   | obj0 | - - +------+   -\n                #   |      |     | obj1 |   | (line_overlap)\n                #   +------+ - - |      |   -\n                #          - - - +------+\n                #\n                #          |<--->|\n                #        (char_margin)\n                halign = (\n                    obj0.is_voverlap(obj1)\n                    and min(obj0.height, obj1.height) * laparams.line_overlap\n                    < obj0.voverlap(obj1)\n                    and obj0.hdistance(obj1)\n                    < max(obj0.width, obj1.width) * laparams.char_margin\n                )\n\n                # valign: obj0 and obj1 is vertically aligned.\n                #\n                #   +------+\n                #   | obj0 |\n                #   |      |\n                #   +------+ - - -\n                #     |    |     | (char_margin)\n                #     +------+ - -\n                #     | obj1 |\n                #     |      |\n                #     +------+\n                #\n                #     |<-->|\n                #   (line_overlap)\n                valign = (\n                    laparams.detect_vertical\n                    and obj0.is_hoverlap(obj1)\n                    and min(obj0.width, obj1.width) * laparams.line_overlap\n                    < obj0.hoverlap(obj1)\n                    and obj0.vdistance(obj1)\n                    < max(obj0.height, obj1.height) * laparams.char_margin\n                )\n\n                if (halign and isinstance(line, LTTextLineHorizontal)) or (\n                    valign and isinstance(line, LTTextLineVertical)\n                ):\n                    line.add(obj1)\n                elif line is not None:\n                    yield line\n                    line = None\n                elif valign and not halign:\n                    line = LTTextLineVertical(laparams.word_margin)\n                    line.add(obj0)\n                    line.add(obj1)\n                elif halign and not valign:\n                    line = LTTextLineHorizontal(laparams.word_margin)\n                    line.add(obj0)\n                    line.add(obj1)\n                else:\n                    line = LTTextLineHorizontal(laparams.word_margin)\n                    line.add(obj0)\n                    yield line\n                    line = None\n            obj0 = obj1\n        if line is None:\n            line = LTTextLineHorizontal(laparams.word_margin)\n            assert obj0 is not None\n            line.add(obj0)\n        yield line\n\n    def group_textlines(\n        self,\n        laparams: LAParams,\n        lines: Iterable[LTTextLine],\n    ) -> Iterator[LTTextBox]:\n        \"\"\"Group neighboring lines to textboxes\"\"\"\n        plane: Plane[LTTextLine] = Plane(self.bbox)\n        plane.extend(lines)\n        boxes: dict[LTTextLine, LTTextBox] = {}\n        for line in lines:\n            neighbors = line.find_neighbors(plane, laparams.line_margin)\n            members = [line]\n            for obj1 in neighbors:\n                members.append(obj1)\n                if obj1 in boxes:\n                    members.extend(boxes.pop(obj1))\n            if isinstance(line, LTTextLineHorizontal):\n                box: LTTextBox = LTTextBoxHorizontal()\n            else:\n                box = LTTextBoxVertical()\n            for obj in uniq(members):\n                box.add(obj)\n                boxes[obj] = box\n        done = set()\n        for line in lines:\n            if line not in boxes:\n                continue\n            box = boxes[line]\n            if box in done:\n                continue\n            done.add(box)\n            if not box.is_empty():\n                yield box\n\n    def group_textboxes(\n        self,\n        laparams: LAParams,\n        boxes: Sequence[LTTextBox],\n    ) -> list[LTTextGroup]:\n        \"\"\"Group textboxes hierarchically.\n\n        Get pair-wise distances, via dist func defined below, and then merge\n        from the closest textbox pair. Once obj1 and obj2 are merged /\n        grouped, the resulting group is considered as a new object, and its\n        distances to other objects & groups are added to the process queue.\n\n        For performance reason, pair-wise distances and object pair info are\n        maintained in a heap of (idx, dist, id(obj1), id(obj2), obj1, obj2)\n        tuples. It ensures quick access to the smallest element. Note that\n        since comparison operators, e.g., __lt__, are disabled for\n        LTComponent, id(obj) has to appear before obj in element tuples.\n\n        :param laparams: LAParams object.\n        :param boxes: All textbox objects to be grouped.\n        :return: a list that has only one element, the final top level group.\n        \"\"\"\n        ElementT = Union[LTTextBox, LTTextGroup]\n        plane: Plane[ElementT] = Plane(self.bbox)\n\n        def dist(obj1: LTComponent, obj2: LTComponent) -> float:\n            \"\"\"A distance function between two TextBoxes.\n\n            Consider the bounding rectangle for obj1 and obj2.\n            Return its area less the areas of obj1 and obj2,\n            shown as 'www' below. This value may be negative.\n                    +------+..........+ (x1, y1)\n                    | obj1 |wwwwwwwwww:\n                    +------+www+------+\n                    :wwwwwwwwww| obj2 |\n            (x0, y0) +..........+------+\n            \"\"\"\n            x0 = min(obj1.x0, obj2.x0)\n            y0 = min(obj1.y0, obj2.y0)\n            x1 = max(obj1.x1, obj2.x1)\n            y1 = max(obj1.y1, obj2.y1)\n            return (\n                (x1 - x0) * (y1 - y0)\n                - obj1.width * obj1.height\n                - obj2.width * obj2.height\n            )\n\n        def isany(obj1: ElementT, obj2: ElementT) -> set[ElementT]:\n            \"\"\"Check if there's any other object between obj1 and obj2.\"\"\"\n            x0 = min(obj1.x0, obj2.x0)\n            y0 = min(obj1.y0, obj2.y0)\n            x1 = max(obj1.x1, obj2.x1)\n            y1 = max(obj1.y1, obj2.y1)\n            objs = set(plane.find((x0, y0, x1, y1)))\n            return objs.difference((obj1, obj2))\n\n        dists: list[tuple[bool, float, int, int, ElementT, ElementT]] = []\n        for i in range(len(boxes)):\n            box1 = boxes[i]\n            for j in range(i + 1, len(boxes)):\n                box2 = boxes[j]\n                dists.append((False, dist(box1, box2), id(box1), id(box2), box1, box2))\n        heapq.heapify(dists)\n\n        plane.extend(boxes)\n        done = set()\n        while len(dists) > 0:\n            (skip_isany, d, id1, id2, obj1, obj2) = heapq.heappop(dists)\n            # Skip objects that are already merged\n            if (id1 not in done) and (id2 not in done):\n                if not skip_isany and isany(obj1, obj2):\n                    heapq.heappush(dists, (True, d, id1, id2, obj1, obj2))\n                    continue\n                if isinstance(obj1, (LTTextBoxVertical, LTTextGroupTBRL)) or isinstance(\n                    obj2,\n                    (LTTextBoxVertical, LTTextGroupTBRL),\n                ):\n                    group: LTTextGroup = LTTextGroupTBRL([obj1, obj2])\n                else:\n                    group = LTTextGroupLRTB([obj1, obj2])\n                plane.remove(obj1)\n                plane.remove(obj2)\n                done.update([id1, id2])\n\n                for other in plane:\n                    heapq.heappush(\n                        dists,\n                        (False, dist(group, other), id(group), id(other), group, other),\n                    )\n                plane.add(group)\n        # By now only groups are in the plane\n        return list(cast(LTTextGroup, g) for g in plane)\n\n    def analyze(self, laparams: LAParams) -> None:\n        # textobjs is a list of LTChar objects, i.e.\n        # it has all the individual characters in the page.\n        (textobjs, otherobjs) = fsplit(lambda obj: isinstance(obj, LTChar), self)\n        for obj in otherobjs:\n            obj.analyze(laparams)\n        if not textobjs:\n            return\n        textlines = list(self.group_objects(laparams, textobjs))\n        (empties, textlines) = fsplit(lambda obj: obj.is_empty(), textlines)\n        for obj in empties:\n            obj.analyze(laparams)\n        textboxes = list(self.group_textlines(laparams, textlines))\n        if laparams.boxes_flow is None:\n            for textbox in textboxes:\n                textbox.analyze(laparams)\n\n            def getkey(box: LTTextBox) -> tuple[int, float, float]:\n                if isinstance(box, LTTextBoxVertical):\n                    return (0, -box.x1, -box.y0)\n                else:\n                    return (1, -box.y0, box.x0)\n\n            textboxes.sort(key=getkey)\n        else:\n            self.groups = self.group_textboxes(laparams, textboxes)\n            assigner = IndexAssigner()\n            for group in self.groups:\n                group.analyze(laparams)\n                assigner.run(group)\n            textboxes.sort(key=lambda box: box.index)\n        self._objs = (\n            cast(list[LTComponent], textboxes)\n            + otherobjs\n            + cast(list[LTComponent], empties)\n        )\n\n\nclass LTFigure(LTLayoutContainer):\n    \"\"\"Represents an area used by PDF Form objects.\n\n    PDF Forms can be used to present figures or pictures by embedding yet\n    another PDF document within a page. Note that LTFigure objects can appear\n    recursively.\n    \"\"\"\n\n    def __init__(self, name: str, bbox: Rect, matrix: Matrix) -> None:\n        self.name = name\n        self.matrix = matrix\n        (x, y, w, h) = guarded_bbox(bbox)\n        bounds = ((x, y), (x + w, y), (x, y + h), (x + w, y + h))\n        bbox = get_bound(apply_matrix_pt(matrix, (p, q)) for (p, q) in bounds)\n        LTLayoutContainer.__init__(self, bbox)\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__}({self.name}) {bbox2str(self.bbox)} matrix={matrix2str(self.matrix)}>\"\n\n    def analyze(self, laparams: LAParams) -> None:\n        if not laparams.all_texts:\n            return\n        LTLayoutContainer.analyze(self, laparams)\n\n\nclass LTPage(LTLayoutContainer):\n    \"\"\"Represents an entire page.\n\n    Like any other LTLayoutContainer, an LTPage can be iterated to obtain child\n    objects like LTTextBox, LTFigure, LTImage, LTRect, LTCurve and LTLine.\n    \"\"\"\n\n    def __init__(self, pageid: int, bbox: Rect, rotate: float = 0) -> None:\n        LTLayoutContainer.__init__(self, bbox)\n        self.pageid = pageid\n        self.rotate = rotate\n\n    def __repr__(self) -> str:\n        return f\"<{self.__class__.__name__}({self.pageid!r}) {bbox2str(self.bbox)} rotate={self.rotate!r}>\"\n"
  },
  {
    "path": "babeldoc/pdfminer/lzw.py",
    "content": "import logging\nfrom collections.abc import Iterator\nfrom io import BytesIO\nfrom typing import BinaryIO\nfrom typing import cast\n\nfrom babeldoc.pdfminer.pdfexceptions import PDFEOFError\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\n\nlogger = logging.getLogger(__name__)\n\n\nclass CorruptDataError(PDFException):\n    pass\n\n\nclass LZWDecoder:\n    def __init__(self, fp: BinaryIO) -> None:\n        self.fp = fp\n        self.buff = 0\n        self.bpos = 8\n        self.nbits = 9\n        # NB: self.table stores None only in indices 256 and 257\n        self.table: list[bytes | None] = []\n        self.prevbuf: bytes | None = None\n\n    def readbits(self, bits: int) -> int:\n        v = 0\n        while 1:\n            # the number of remaining bits we can get from the current buffer.\n            r = 8 - self.bpos\n            if bits <= r:\n                # |-----8-bits-----|\n                # |-bpos-|-bits-|  |\n                # |      |----r----|\n                v = (v << bits) | ((self.buff >> (r - bits)) & ((1 << bits) - 1))\n                self.bpos += bits\n                break\n            else:\n                # |-----8-bits-----|\n                # |-bpos-|---bits----...\n                # |      |----r----|\n                v = (v << r) | (self.buff & ((1 << r) - 1))\n                bits -= r\n                x = self.fp.read(1)\n                if not x:\n                    raise PDFEOFError\n                self.buff = ord(x)\n                self.bpos = 0\n        return v\n\n    def feed(self, code: int) -> bytes:\n        x = b\"\"\n        if code == 256:\n            self.table = [bytes((c,)) for c in range(256)]  # 0-255\n            self.table.append(None)  # 256\n            self.table.append(None)  # 257\n            self.prevbuf = b\"\"\n            self.nbits = 9\n        elif code == 257:\n            pass\n        elif not self.prevbuf:\n            x = self.prevbuf = cast(bytes, self.table[code])  # assume not None\n        else:\n            if code < len(self.table):\n                x = cast(bytes, self.table[code])  # assume not None\n                self.table.append(self.prevbuf + x[:1])\n            elif code == len(self.table):\n                self.table.append(self.prevbuf + self.prevbuf[:1])\n                x = cast(bytes, self.table[code])\n            else:\n                raise CorruptDataError\n            table_length = len(self.table)\n            if table_length == 511:\n                self.nbits = 10\n            elif table_length == 1023:\n                self.nbits = 11\n            elif table_length == 2047:\n                self.nbits = 12\n            self.prevbuf = x\n        return x\n\n    def run(self) -> Iterator[bytes]:\n        while 1:\n            try:\n                code = self.readbits(self.nbits)\n            except EOFError:\n                break\n            try:\n                x = self.feed(code)\n            except CorruptDataError:\n                # just ignore corrupt data and stop yielding there\n                break\n            yield x\n\n            logger.debug(\n                \"nbits=%d, code=%d, output=%r, table=%r\",\n                self.nbits,\n                code,\n                x,\n                self.table[258:],\n            )\n\n\ndef lzwdecode(data: bytes) -> bytes:\n    fp = BytesIO(data)\n    s = LZWDecoder(fp).run()\n    return b\"\".join(s)\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfcolor.py",
    "content": "import collections\n\nfrom babeldoc.pdfminer.psparser import LIT\n\nLITERAL_DEVICE_GRAY = LIT(\"DeviceGray\")\nLITERAL_DEVICE_RGB = LIT(\"DeviceRGB\")\nLITERAL_DEVICE_CMYK = LIT(\"DeviceCMYK\")\n# Abbreviations for inline images\nLITERAL_INLINE_DEVICE_GRAY = LIT(\"G\")\nLITERAL_INLINE_DEVICE_RGB = LIT(\"RGB\")\nLITERAL_INLINE_DEVICE_CMYK = LIT(\"CMYK\")\n\n\nclass PDFColorSpace:\n    def __init__(self, name: str, ncomponents: int) -> None:\n        self.name = name\n        self.ncomponents = ncomponents\n\n    def __repr__(self) -> str:\n        return \"<PDFColorSpace: %s, ncomponents=%d>\" % (self.name, self.ncomponents)\n\n\nPREDEFINED_COLORSPACE: dict[str, PDFColorSpace] = collections.OrderedDict()\n\nfor name, n in [\n    (\"DeviceGray\", 1),  # default value first\n    (\"CalRGB\", 3),\n    (\"CalGray\", 1),\n    (\"Lab\", 3),\n    (\"DeviceRGB\", 3),\n    (\"DeviceCMYK\", 4),\n    (\"Separation\", 1),\n    (\"Indexed\", 1),\n    (\"Pattern\", 1),\n]:\n    PREDEFINED_COLORSPACE[name] = PDFColorSpace(name, n)\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfdevice.py",
    "content": "import logging\nfrom collections.abc import Iterable\nfrom collections.abc import Sequence\nfrom typing import TYPE_CHECKING\nfrom typing import BinaryIO\nfrom typing import Optional\nfrom typing import cast\n\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdffont import PDFUnicodeNotDefined\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import PathSegment\nfrom babeldoc.pdfminer.utils import Point\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer import utils\n\nif TYPE_CHECKING:\n    from babeldoc.pdfminer.pdfinterp import PDFGraphicState\n    from babeldoc.pdfminer.pdfinterp import PDFResourceManager\n    from babeldoc.pdfminer.pdfinterp import PDFStackT\n    from babeldoc.pdfminer.pdfinterp import PDFTextState\n\n\nPDFTextSeq = Iterable[int | float | bytes]\n\nlogger = logging.getLogger(__name__)\n\n\nclass PDFDevice:\n    \"\"\"Translate the output of PDFPageInterpreter to the output that is needed\"\"\"\n\n    def __init__(self, rsrcmgr: \"PDFResourceManager\") -> None:\n        self.rsrcmgr = rsrcmgr\n        self.ctm: Matrix | None = None\n\n    def __repr__(self) -> str:\n        return \"<PDFDevice>\"\n\n    def __enter__(self) -> \"PDFDevice\":\n        return self\n\n    def __exit__(self, exc_type: object, exc_val: object, exc_tb: object) -> None:\n        self.close()\n\n    def close(self) -> None:\n        pass\n\n    def set_ctm(self, ctm: Matrix) -> None:\n        self.ctm = ctm\n\n    def begin_tag(self, tag: PSLiteral, props: Optional[\"PDFStackT\"] = None) -> None:\n        pass\n\n    def end_tag(self) -> None:\n        pass\n\n    def do_tag(self, tag: PSLiteral, props: Optional[\"PDFStackT\"] = None) -> None:\n        pass\n\n    def begin_page(self, page: PDFPage, ctm: Matrix) -> None:\n        pass\n\n    def end_page(self, page: PDFPage) -> None:\n        pass\n\n    def begin_figure(self, name: str, bbox: Rect, matrix: Matrix) -> None:\n        pass\n\n    def end_figure(self, name: str) -> None:\n        pass\n\n    def paint_path(\n        self,\n        graphicstate: \"PDFGraphicState\",\n        stroke: bool,\n        fill: bool,\n        evenodd: bool,\n        path: Sequence[PathSegment],\n    ) -> None:\n        pass\n\n    def render_image(self, name: str, stream: PDFStream) -> None:\n        pass\n\n    def render_string(\n        self,\n        textstate: \"PDFTextState\",\n        seq: PDFTextSeq,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> None:\n        pass\n\n\nclass PDFTextDevice(PDFDevice):\n    def render_string(\n        self,\n        textstate: \"PDFTextState\",\n        seq: PDFTextSeq,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> None:\n        assert self.ctm is not None\n        matrix = utils.mult_matrix(textstate.matrix, self.ctm)\n        font = textstate.font\n        font.font_id_temp = getattr(textstate, \"font_id\", None)\n        fontsize = textstate.fontsize\n        scaling = textstate.scaling * 0.01\n        charspace = textstate.charspace * scaling\n        wordspace = textstate.wordspace * scaling\n        rise = textstate.rise\n        assert font is not None\n        if font.is_multibyte():\n            wordspace = 0\n        dxscale = 0.001 * fontsize * scaling\n        if font.is_vertical():\n            textstate.linematrix = self.render_string_vertical(\n                seq,\n                matrix,\n                textstate.linematrix,\n                font,\n                fontsize,\n                scaling,\n                charspace,\n                wordspace,\n                rise,\n                dxscale,\n                ncs,\n                graphicstate,\n            )\n        else:\n            textstate.linematrix = self.render_string_horizontal(\n                seq,\n                matrix,\n                textstate.linematrix,\n                font,\n                fontsize,\n                scaling,\n                charspace,\n                wordspace,\n                rise,\n                dxscale,\n                ncs,\n                graphicstate,\n            )\n\n    def render_string_horizontal(\n        self,\n        seq: PDFTextSeq,\n        matrix: Matrix,\n        pos: Point,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        charspace: float,\n        wordspace: float,\n        rise: float,\n        dxscale: float,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> Point:\n        (x, y) = pos\n        needcharspace = False\n        for obj in seq:\n            if isinstance(obj, (int, float)):\n                x -= obj * dxscale\n                needcharspace = True\n            elif isinstance(obj, bytes):\n                for cid in font.decode(obj):\n                    if needcharspace:\n                        x += charspace\n                    x += self.render_char(\n                        utils.translate_matrix(matrix, (x, y)),\n                        font,\n                        fontsize,\n                        scaling,\n                        rise,\n                        cid,\n                        ncs,\n                        graphicstate,\n                    )\n                    if cid == 32 and wordspace:\n                        x += wordspace\n                    needcharspace = True\n            else:\n                logger.warning(\n                    f\"Cannot render horizontal string because {obj!r} is not a valid int, float or bytes.\"\n                )\n        return (x, y)\n\n    def render_string_vertical(\n        self,\n        seq: PDFTextSeq,\n        matrix: Matrix,\n        pos: Point,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        charspace: float,\n        wordspace: float,\n        rise: float,\n        dxscale: float,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> Point:\n        (x, y) = pos\n        needcharspace = False\n        for obj in seq:\n            if isinstance(obj, (int, float)):\n                y -= obj * dxscale\n                needcharspace = True\n            elif isinstance(obj, bytes):\n                for cid in font.decode(obj):\n                    if needcharspace:\n                        y += charspace\n                    y += self.render_char(\n                        utils.translate_matrix(matrix, (x, y)),\n                        font,\n                        fontsize,\n                        scaling,\n                        rise,\n                        cid,\n                        ncs,\n                        graphicstate,\n                    )\n                    if cid == 32 and wordspace:\n                        y += wordspace\n                    needcharspace = True\n            else:\n                logger.warning(\n                    f\"Cannot render vertical string because {obj!r} is not a valid int, float or bytes.\"\n                )\n        return (x, y)\n\n    def render_char(\n        self,\n        matrix: Matrix,\n        font: PDFFont,\n        fontsize: float,\n        scaling: float,\n        rise: float,\n        cid: int,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> float:\n        return 0\n\n\nclass TagExtractor(PDFDevice):\n    def __init__(\n        self,\n        rsrcmgr: \"PDFResourceManager\",\n        outfp: BinaryIO,\n        codec: str = \"utf-8\",\n    ) -> None:\n        PDFDevice.__init__(self, rsrcmgr)\n        self.outfp = outfp\n        self.codec = codec\n        self.pageno = 0\n        self._stack: list[PSLiteral] = []\n\n    def render_string(\n        self,\n        textstate: \"PDFTextState\",\n        seq: PDFTextSeq,\n        ncs: PDFColorSpace,\n        graphicstate: \"PDFGraphicState\",\n    ) -> None:\n        font = textstate.font\n        assert font is not None\n        text = \"\"\n        for obj in seq:\n            if isinstance(obj, str):\n                obj = utils.make_compat_bytes(obj)\n            if not isinstance(obj, bytes):\n                continue\n            chars = font.decode(obj)\n            for cid in chars:\n                try:\n                    char = font.to_unichr(cid)\n                    text += char\n                except PDFUnicodeNotDefined:\n                    pass\n        self._write(utils.enc(text))\n\n    def begin_page(self, page: PDFPage, ctm: Matrix) -> None:\n        output = '<page id=\"%s\" bbox=\"%s\" rotate=\"%d\">' % (\n            self.pageno,\n            utils.bbox2str(page.mediabox),\n            page.rotate,\n        )\n        self._write(output)\n\n    def end_page(self, page: PDFPage) -> None:\n        self._write(\"</page>\\n\")\n        self.pageno += 1\n\n    def begin_tag(self, tag: PSLiteral, props: Optional[\"PDFStackT\"] = None) -> None:\n        s = \"\"\n        if isinstance(props, dict):\n            s = \"\".join(\n                [\n                    f' {utils.enc(k)}=\"{utils.make_compat_str(v)}\"'\n                    for (k, v) in sorted(props.items())\n                ],\n            )\n        out_s = f\"<{utils.enc(cast(str, tag.name))}{s}>\"\n        self._write(out_s)\n        self._stack.append(tag)\n\n    def end_tag(self) -> None:\n        assert self._stack, str(self.pageno)\n        tag = self._stack.pop(-1)\n        out_s = \"</%s>\" % utils.enc(cast(str, tag.name))\n        self._write(out_s)\n\n    def do_tag(self, tag: PSLiteral, props: Optional[\"PDFStackT\"] = None) -> None:\n        self.begin_tag(tag, props)\n        self._stack.pop(-1)\n\n    def _write(self, s: str) -> None:\n        self.outfp.write(s.encode(self.codec))\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfdocument.py",
    "content": "import itertools\nimport logging\nimport re\nimport struct\nfrom collections.abc import Callable\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections.abc import KeysView\nfrom collections.abc import Sequence\nfrom hashlib import md5\nfrom hashlib import sha256\nfrom hashlib import sha384\nfrom hashlib import sha512\nfrom typing import Any\nfrom typing import cast\n\nfrom cryptography.hazmat.backends import default_backend\nfrom cryptography.hazmat.primitives.ciphers import Cipher\nfrom cryptography.hazmat.primitives.ciphers import algorithms\nfrom cryptography.hazmat.primitives.ciphers import modes\n\nfrom babeldoc.pdfminer.arcfour import Arcfour\nfrom babeldoc.pdfminer.casting import safe_int\nfrom babeldoc.pdfminer.data_structures import NumberTree\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdfexceptions import PDFKeyError\nfrom babeldoc.pdfminer.pdfexceptions import PDFObjectNotFound\nfrom babeldoc.pdfminer.pdfexceptions import PDFTypeError\nfrom babeldoc.pdfminer.pdfparser import PDFParser\nfrom babeldoc.pdfminer.pdfparser import PDFStreamParser\nfrom babeldoc.pdfminer.pdfparser import PDFSyntaxError\nfrom babeldoc.pdfminer.pdftypes import DecipherCallable\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.pdftypes import decipher_all\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import int_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.pdftypes import str_value\nfrom babeldoc.pdfminer.pdftypes import stream_value\nfrom babeldoc.pdfminer.pdftypes import uint_value\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psparser import KWD\nfrom babeldoc.pdfminer.psparser import LIT\nfrom babeldoc.pdfminer.psparser import literal_name\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer.utils import decode_text\nfrom babeldoc.pdfminer.utils import format_int_alpha\nfrom babeldoc.pdfminer.utils import format_int_roman\nfrom babeldoc.pdfminer.utils import nunpack\nfrom babeldoc.pdfminer import settings\n\nlog = logging.getLogger(__name__)\n\n\nclass PDFNoValidXRef(PDFSyntaxError):\n    pass\n\n\nclass PDFNoValidXRefWarning(SyntaxWarning):\n    \"\"\"Legacy warning for missing xref.\n\n    Not used anymore because warnings.warn is replaced by logger.Logger.warn.\n    \"\"\"\n\n\nclass PDFNoOutlines(PDFException):\n    pass\n\n\nclass PDFNoPageLabels(PDFException):\n    pass\n\n\nclass PDFDestinationNotFound(PDFException):\n    pass\n\n\nclass PDFEncryptionError(PDFException):\n    pass\n\n\nclass PDFPasswordIncorrect(PDFEncryptionError):\n    pass\n\n\nclass PDFEncryptionWarning(UserWarning):\n    \"\"\"Legacy warning for failed decryption.\n\n    Not used anymore because warnings.warn is replaced by logger.Logger.warn.\n    \"\"\"\n\n\nclass PDFTextExtractionNotAllowedWarning(UserWarning):\n    \"\"\"Legacy warning for PDF that does not allow extraction.\n\n    Not used anymore because warnings.warn is replaced by logger.Logger.warn.\n    \"\"\"\n\n\nclass PDFTextExtractionNotAllowed(PDFEncryptionError):\n    pass\n\n\n# some predefined literals and keywords.\nLITERAL_OBJSTM = LIT(\"ObjStm\")\nLITERAL_XREF = LIT(\"XRef\")\nLITERAL_CATALOG = LIT(\"Catalog\")\n\n\nclass PDFBaseXRef:\n    def get_trailer(self) -> dict[str, Any]:\n        raise NotImplementedError\n\n    def get_objids(self) -> Iterable[int]:\n        return []\n\n    # Must return\n    #     (strmid, index, genno)\n    #  or (None, pos, genno)\n    def get_pos(self, objid: int) -> tuple[int | None, int, int]:\n        raise PDFKeyError(objid)\n\n    def load(self, parser: PDFParser) -> None:\n        raise NotImplementedError\n\n\nclass PDFXRef(PDFBaseXRef):\n    def __init__(self) -> None:\n        self.offsets: dict[int, tuple[int | None, int, int]] = {}\n        self.trailer: dict[str, Any] = {}\n\n    def __repr__(self) -> str:\n        return \"<PDFXRef: offsets=%r>\" % (self.offsets.keys())\n\n    def load(self, parser: PDFParser) -> None:\n        while True:\n            try:\n                (pos, line) = parser.nextline()\n                line = line.strip()\n                if not line:\n                    continue\n            except PSEOF:\n                raise PDFNoValidXRef(\"Unexpected EOF - file corrupted?\")\n            if line.startswith(b\"trailer\"):\n                parser.seek(pos)\n                break\n            f = line.split(b\" \")\n            if len(f) != 2:\n                error_msg = f\"Trailer not found: {parser!r}: line={line!r}\"\n                raise PDFNoValidXRef(error_msg)\n            try:\n                (start, nobjs) = map(int, f)\n            except ValueError:\n                error_msg = f\"Invalid line: {parser!r}: line={line!r}\"\n                raise PDFNoValidXRef(error_msg)\n            for objid in range(start, start + nobjs):\n                try:\n                    (_, line) = parser.nextline()\n                    line = line.strip()\n                except PSEOF:\n                    raise PDFNoValidXRef(\"Unexpected EOF - file corrupted?\")\n                f = line.split(b\" \")\n                if len(f) != 3:\n                    error_msg = f\"Invalid XRef format: {parser!r}, line={line!r}\"\n                    raise PDFNoValidXRef(error_msg)\n                (pos_b, genno_b, use_b) = f\n                if use_b != b\"n\":\n                    continue\n\n                pos_i = safe_int(pos_b)\n                genno_i = safe_int(genno_b)\n                if pos_i is not None and genno_i is not None:\n                    self.offsets[objid] = (None, pos_i, genno_i)\n                else:\n                    log.warning(\n                        f\"Not adding object {objid} to xref because position {pos_b!r} \"\n                        f\"or generation number {genno_b!r} cannot be parsed as an int\"\n                    )\n\n        log.debug(\"xref objects: %r\", self.offsets)\n        self.load_trailer(parser)\n\n    def load_trailer(self, parser: PDFParser) -> None:\n        try:\n            (_, kwd) = parser.nexttoken()\n            assert kwd is KWD(b\"trailer\"), str(kwd)\n            (_, dic) = parser.nextobject()\n        except PSEOF:\n            x = parser.pop(1)\n            if not x:\n                raise PDFNoValidXRef(\"Unexpected EOF - file corrupted\")\n            (_, dic) = x[0]\n        self.trailer.update(dict_value(dic))\n        log.debug(\"trailer=%r\", self.trailer)\n\n    def get_trailer(self) -> dict[str, Any]:\n        return self.trailer\n\n    def get_objids(self) -> KeysView[int]:\n        return self.offsets.keys()\n\n    def get_pos(self, objid: int) -> tuple[int | None, int, int]:\n        return self.offsets[objid]\n\n\nclass PDFXRefFallback(PDFXRef):\n    def __repr__(self) -> str:\n        return \"<PDFXRefFallback: offsets=%r>\" % (self.offsets.keys())\n\n    PDFOBJ_CUE = re.compile(r\"^(\\d+)\\s+(\\d+)\\s+obj\\b\")\n\n    def load(self, parser: PDFParser) -> None:\n        parser.seek(0)\n        while 1:\n            try:\n                (pos, line_bytes) = parser.nextline()\n            except PSEOF:\n                break\n            if line_bytes.startswith(b\"trailer\"):\n                parser.seek(pos)\n                self.load_trailer(parser)\n                log.debug(\"trailer: %r\", self.trailer)\n                break\n            line = line_bytes.decode(\"latin-1\")  # default pdf encoding\n            m = self.PDFOBJ_CUE.match(line)\n            if not m:\n                continue\n            (objid_s, genno_s) = m.groups()\n            objid = int(objid_s)\n            genno = int(genno_s)\n            self.offsets[objid] = (None, pos, genno)\n            # expand ObjStm.\n            parser.seek(pos)\n            (_, obj) = parser.nextobject()\n            if isinstance(obj, PDFStream) and obj.get(\"Type\") is LITERAL_OBJSTM:\n                stream = stream_value(obj)\n                try:\n                    n = stream[\"N\"]\n                except KeyError:\n                    if settings.STRICT:\n                        raise PDFSyntaxError(\"N is not defined: %r\" % stream)\n                    n = 0\n                parser1 = PDFStreamParser(stream.get_data())\n                objs: list[int] = []\n                try:\n                    while 1:\n                        (_, obj) = parser1.nextobject()\n                        objs.append(cast(int, obj))\n                except PSEOF:\n                    pass\n                n = min(n, len(objs) // 2)\n                for index in range(n):\n                    objid1 = objs[index * 2]\n                    self.offsets[objid1] = (objid, index, 0)\n\n\nclass PDFXRefStream(PDFBaseXRef):\n    def __init__(self) -> None:\n        self.data: bytes | None = None\n        self.entlen: int | None = None\n        self.fl1: int | None = None\n        self.fl2: int | None = None\n        self.fl3: int | None = None\n        self.ranges: list[tuple[int, int]] = []\n\n    def __repr__(self) -> str:\n        return \"<PDFXRefStream: ranges=%r>\" % (self.ranges)\n\n    def load(self, parser: PDFParser) -> None:\n        (_, objid) = parser.nexttoken()  # ignored\n        (_, genno) = parser.nexttoken()  # ignored\n        (_, kwd) = parser.nexttoken()\n        (_, stream) = parser.nextobject()\n        if not isinstance(stream, PDFStream) or stream.get(\"Type\") is not LITERAL_XREF:\n            raise PDFNoValidXRef(\"Invalid PDF stream spec.\")\n        size = stream[\"Size\"]\n        index_array = stream.get(\"Index\", (0, size))\n        if len(index_array) % 2 != 0:\n            raise PDFSyntaxError(\"Invalid index number\")\n        self.ranges.extend(cast(Iterator[tuple[int, int]], choplist(2, index_array)))\n        (self.fl1, self.fl2, self.fl3) = stream[\"W\"]\n        assert self.fl1 is not None and self.fl2 is not None and self.fl3 is not None\n        self.data = stream.get_data()\n        self.entlen = self.fl1 + self.fl2 + self.fl3\n        self.trailer = stream.attrs\n        log.debug(\n            \"xref stream: objid=%s, fields=%d,%d,%d\",\n            \", \".join(map(repr, self.ranges)),\n            self.fl1,\n            self.fl2,\n            self.fl3,\n        )\n\n    def get_trailer(self) -> dict[str, Any]:\n        return self.trailer\n\n    def get_objids(self) -> Iterator[int]:\n        for start, nobjs in self.ranges:\n            for i in range(nobjs):\n                assert self.entlen is not None\n                assert self.data is not None\n                offset = self.entlen * i\n                ent = self.data[offset : offset + self.entlen]\n                f1 = nunpack(ent[: self.fl1], 1)\n                if f1 == 1 or f1 == 2:\n                    yield start + i\n\n    def get_pos(self, objid: int) -> tuple[int | None, int, int]:\n        index = 0\n        for start, nobjs in self.ranges:\n            if start <= objid and objid < start + nobjs:\n                index += objid - start\n                break\n            else:\n                index += nobjs\n        else:\n            raise PDFKeyError(objid)\n        assert self.entlen is not None\n        assert self.data is not None\n        assert self.fl1 is not None and self.fl2 is not None and self.fl3 is not None\n        offset = self.entlen * index\n        ent = self.data[offset : offset + self.entlen]\n        f1 = nunpack(ent[: self.fl1], 1)\n        f2 = nunpack(ent[self.fl1 : self.fl1 + self.fl2])\n        f3 = nunpack(ent[self.fl1 + self.fl2 :])\n        if f1 == 1:\n            return (None, f2, f3)\n        elif f1 == 2:\n            return (f2, f3, 0)\n        else:\n            # this is a free object\n            raise PDFKeyError(objid)\n\n\nclass PDFStandardSecurityHandler:\n    PASSWORD_PADDING = (\n        b\"(\\xbfN^Nu\\x8aAd\\x00NV\\xff\\xfa\\x01\\x08..\\x00\\xb6\\xd0h>\\x80/\\x0c\\xa9\\xfedSiz\"\n    )\n    supported_revisions: tuple[int, ...] = (2, 3)\n\n    def __init__(\n        self,\n        docid: Sequence[bytes],\n        param: dict[str, Any],\n        password: str = \"\",\n    ) -> None:\n        self.docid = docid\n        self.param = param\n        self.password = password\n        self.init()\n\n    def init(self) -> None:\n        self.init_params()\n        if self.r not in self.supported_revisions:\n            error_msg = \"Unsupported revision: param=%r\" % self.param\n            raise PDFEncryptionError(error_msg)\n        self.init_key()\n\n    def init_params(self) -> None:\n        self.v = int_value(self.param.get(\"V\", 0))\n        self.r = int_value(self.param[\"R\"])\n        self.p = uint_value(self.param[\"P\"], 32)\n        self.o = str_value(self.param[\"O\"])\n        self.u = str_value(self.param[\"U\"])\n        self.length = int_value(self.param.get(\"Length\", 40))\n\n    def init_key(self) -> None:\n        self.key = self.authenticate(self.password)\n        if self.key is None:\n            raise PDFPasswordIncorrect\n\n    def is_printable(self) -> bool:\n        return bool(self.p & 4)\n\n    def is_modifiable(self) -> bool:\n        return bool(self.p & 8)\n\n    def is_extractable(self) -> bool:\n        return bool(self.p & 16)\n\n    def compute_u(self, key: bytes) -> bytes:\n        if self.r == 2:\n            # Algorithm 3.4\n            return Arcfour(key).encrypt(self.PASSWORD_PADDING)  # 2\n        else:\n            # Algorithm 3.5\n            hash = md5(self.PASSWORD_PADDING)  # 2\n            hash.update(self.docid[0])  # 3\n            result = Arcfour(key).encrypt(hash.digest())  # 4\n            for i in range(1, 20):  # 5\n                k = b\"\".join(bytes((c ^ i,)) for c in iter(key))\n                result = Arcfour(k).encrypt(result)\n            result += result  # 6\n            return result\n\n    def compute_encryption_key(self, password: bytes) -> bytes:\n        # Algorithm 3.2\n        password = (password + self.PASSWORD_PADDING)[:32]  # 1\n        hash = md5(password)  # 2\n        hash.update(self.o)  # 3\n        # See https://github.com/pdfminer/pdfminer.six/issues/186\n        hash.update(struct.pack(\"<L\", self.p))  # 4\n        hash.update(self.docid[0])  # 5\n        if self.r >= 4:\n            if not cast(PDFStandardSecurityHandlerV4, self).encrypt_metadata:\n                hash.update(b\"\\xff\\xff\\xff\\xff\")\n        result = hash.digest()\n        n = 5\n        if self.r >= 3:\n            n = self.length // 8\n            for _ in range(50):\n                result = md5(result[:n]).digest()\n        return result[:n]\n\n    def authenticate(self, password: str) -> bytes | None:\n        password_bytes = password.encode(\"latin1\")\n        key = self.authenticate_user_password(password_bytes)\n        if key is None:\n            key = self.authenticate_owner_password(password_bytes)\n        return key\n\n    def authenticate_user_password(self, password: bytes) -> bytes | None:\n        key = self.compute_encryption_key(password)\n        if self.verify_encryption_key(key):\n            return key\n        else:\n            return None\n\n    def verify_encryption_key(self, key: bytes) -> bool:\n        # Algorithm 3.6\n        u = self.compute_u(key)\n        if self.r == 2:\n            return u == self.u\n        return u[:16] == self.u[:16]\n\n    def authenticate_owner_password(self, password: bytes) -> bytes | None:\n        # Algorithm 3.7\n        password = (password + self.PASSWORD_PADDING)[:32]\n        hash = md5(password)\n        if self.r >= 3:\n            for _ in range(50):\n                hash = md5(hash.digest())\n        n = 5\n        if self.r >= 3:\n            n = self.length // 8\n        key = hash.digest()[:n]\n        if self.r == 2:\n            user_password = Arcfour(key).decrypt(self.o)\n        else:\n            user_password = self.o\n            for i in range(19, -1, -1):\n                k = b\"\".join(bytes((c ^ i,)) for c in iter(key))\n                user_password = Arcfour(k).decrypt(user_password)\n        return self.authenticate_user_password(user_password)\n\n    def decrypt(\n        self,\n        objid: int,\n        genno: int,\n        data: bytes,\n        attrs: dict[str, Any] | None = None,\n    ) -> bytes:\n        return self.decrypt_rc4(objid, genno, data)\n\n    def decrypt_rc4(self, objid: int, genno: int, data: bytes) -> bytes:\n        assert self.key is not None\n        key = self.key + struct.pack(\"<L\", objid)[:3] + struct.pack(\"<L\", genno)[:2]\n        hash = md5(key)\n        key = hash.digest()[: min(len(key), 16)]\n        return Arcfour(key).decrypt(data)\n\n\nclass PDFStandardSecurityHandlerV4(PDFStandardSecurityHandler):\n    supported_revisions: tuple[int, ...] = (4,)\n\n    def init_params(self) -> None:\n        super().init_params()\n        self.length = 128\n        self.cf = dict_value(self.param.get(\"CF\"))\n        self.stmf = literal_name(self.param[\"StmF\"])\n        self.strf = literal_name(self.param[\"StrF\"])\n        self.encrypt_metadata = bool(self.param.get(\"EncryptMetadata\", True))\n        if self.stmf != self.strf:\n            error_msg = \"Unsupported crypt filter: param=%r\" % self.param\n            raise PDFEncryptionError(error_msg)\n        self.cfm = {}\n        for k, v in self.cf.items():\n            f = self.get_cfm(literal_name(v[\"CFM\"]))\n            if f is None:\n                error_msg = \"Unknown crypt filter method: param=%r\" % self.param\n                raise PDFEncryptionError(error_msg)\n            self.cfm[k] = f\n        self.cfm[\"Identity\"] = self.decrypt_identity\n        if self.strf not in self.cfm:\n            error_msg = \"Undefined crypt filter: param=%r\" % self.param\n            raise PDFEncryptionError(error_msg)\n\n    def get_cfm(self, name: str) -> Callable[[int, int, bytes], bytes] | None:\n        if name == \"V2\":\n            return self.decrypt_rc4\n        elif name == \"AESV2\":\n            return self.decrypt_aes128\n        else:\n            return None\n\n    def decrypt(\n        self,\n        objid: int,\n        genno: int,\n        data: bytes,\n        attrs: dict[str, Any] | None = None,\n        name: str | None = None,\n    ) -> bytes:\n        if not self.encrypt_metadata and attrs is not None:\n            t = attrs.get(\"Type\")\n            if t is not None and literal_name(t) == \"Metadata\":\n                return data\n        if name is None:\n            name = self.strf\n        return self.cfm[name](objid, genno, data)\n\n    def decrypt_identity(self, objid: int, genno: int, data: bytes) -> bytes:\n        return data\n\n    def decrypt_aes128(self, objid: int, genno: int, data: bytes) -> bytes:\n        assert self.key is not None\n        key = (\n            self.key\n            + struct.pack(\"<L\", objid)[:3]\n            + struct.pack(\"<L\", genno)[:2]\n            + b\"sAlT\"\n        )\n        hash = md5(key)\n        key = hash.digest()[: min(len(key), 16)]\n        initialization_vector = data[:16]\n        ciphertext = data[16:]\n        cipher = Cipher(\n            algorithms.AES(key),\n            modes.CBC(initialization_vector),\n            backend=default_backend(),\n        )  # type: ignore\n        return cipher.decryptor().update(ciphertext)  # type: ignore\n\n\nclass PDFStandardSecurityHandlerV5(PDFStandardSecurityHandlerV4):\n    supported_revisions = (5, 6)\n\n    def init_params(self) -> None:\n        super().init_params()\n        self.length = 256\n        self.oe = str_value(self.param[\"OE\"])\n        self.ue = str_value(self.param[\"UE\"])\n        self.o_hash = self.o[:32]\n        self.o_validation_salt = self.o[32:40]\n        self.o_key_salt = self.o[40:]\n        self.u_hash = self.u[:32]\n        self.u_validation_salt = self.u[32:40]\n        self.u_key_salt = self.u[40:]\n\n    def get_cfm(self, name: str) -> Callable[[int, int, bytes], bytes] | None:\n        if name == \"AESV3\":\n            return self.decrypt_aes256\n        else:\n            return None\n\n    def authenticate(self, password: str) -> bytes | None:\n        password_b = self._normalize_password(password)\n        hash = self._password_hash(password_b, self.o_validation_salt, self.u)\n        if hash == self.o_hash:\n            hash = self._password_hash(password_b, self.o_key_salt, self.u)\n            cipher = Cipher(\n                algorithms.AES(hash),\n                modes.CBC(b\"\\0\" * 16),\n                backend=default_backend(),\n            )  # type: ignore\n            return cipher.decryptor().update(self.oe)  # type: ignore\n        hash = self._password_hash(password_b, self.u_validation_salt)\n        if hash == self.u_hash:\n            hash = self._password_hash(password_b, self.u_key_salt)\n            cipher = Cipher(\n                algorithms.AES(hash),\n                modes.CBC(b\"\\0\" * 16),\n                backend=default_backend(),\n            )  # type: ignore\n            return cipher.decryptor().update(self.ue)  # type: ignore\n        return None\n\n    def _normalize_password(self, password: str) -> bytes:\n        if self.r == 6:\n            # saslprep expects non-empty strings, apparently\n            if not password:\n                return b\"\"\n            from babeldoc.pdfminer._saslprep import saslprep\n\n            password = saslprep(password)\n        return password.encode(\"utf-8\")[:127]\n\n    def _password_hash(\n        self,\n        password: bytes,\n        salt: bytes,\n        vector: bytes | None = None,\n    ) -> bytes:\n        \"\"\"Compute password hash depending on revision number\"\"\"\n        if self.r == 5:\n            return self._r5_password(password, salt, vector)\n        return self._r6_password(password, salt[0:8], vector)\n\n    def _r5_password(\n        self,\n        password: bytes,\n        salt: bytes,\n        vector: bytes | None = None,\n    ) -> bytes:\n        \"\"\"Compute the password for revision 5\"\"\"\n        hash = sha256(password)\n        hash.update(salt)\n        if vector is not None:\n            hash.update(vector)\n        return hash.digest()\n\n    def _r6_password(\n        self,\n        password: bytes,\n        salt: bytes,\n        vector: bytes | None = None,\n    ) -> bytes:\n        \"\"\"Compute the password for revision 6\"\"\"\n        initial_hash = sha256(password)\n        initial_hash.update(salt)\n        if vector is not None:\n            initial_hash.update(vector)\n        k = initial_hash.digest()\n        hashes = (sha256, sha384, sha512)\n        round_no = last_byte_val = 0\n        while round_no < 64 or last_byte_val > round_no - 32:\n            k1 = (password + k + (vector or b\"\")) * 64\n            e = self._aes_cbc_encrypt(key=k[:16], iv=k[16:32], data=k1)\n            # compute the first 16 bytes of e,\n            # interpreted as an unsigned integer mod 3\n            next_hash = hashes[self._bytes_mod_3(e[:16])]\n            k = next_hash(e).digest()\n            last_byte_val = e[len(e) - 1]\n            round_no += 1\n        return k[:32]\n\n    @staticmethod\n    def _bytes_mod_3(input_bytes: bytes) -> int:\n        # 256 is 1 mod 3, so we can just sum 'em\n        return sum(b % 3 for b in input_bytes) % 3\n\n    def _aes_cbc_encrypt(self, key: bytes, iv: bytes, data: bytes) -> bytes:\n        cipher = Cipher(algorithms.AES(key), modes.CBC(iv))\n        encryptor = cipher.encryptor()  # type: ignore\n        return encryptor.update(data) + encryptor.finalize()  # type: ignore\n\n    def decrypt_aes256(self, objid: int, genno: int, data: bytes) -> bytes:\n        initialization_vector = data[:16]\n        ciphertext = data[16:]\n        assert self.key is not None\n        cipher = Cipher(\n            algorithms.AES(self.key),\n            modes.CBC(initialization_vector),\n            backend=default_backend(),\n        )  # type: ignore\n        return cipher.decryptor().update(ciphertext)  # type: ignore\n\n\nclass PDFDocument:\n    \"\"\"PDFDocument object represents a PDF document.\n\n    Since a PDF file can be very big, normally it is not loaded at\n    once. So PDF document has to cooperate with a PDF parser in order to\n    dynamically import the data as processing goes.\n\n    Typical usage:\n      doc = PDFDocument(parser, password)\n      obj = doc.getobj(objid)\n\n    \"\"\"\n\n    security_handler_registry: dict[int, type[PDFStandardSecurityHandler]] = {\n        1: PDFStandardSecurityHandler,\n        2: PDFStandardSecurityHandler,\n        4: PDFStandardSecurityHandlerV4,\n        5: PDFStandardSecurityHandlerV5,\n    }\n\n    def __init__(\n        self,\n        parser: PDFParser,\n        password: str = \"\",\n        caching: bool = True,\n        fallback: bool = True,\n    ) -> None:\n        \"\"\"Set the document to use a given PDFParser object.\"\"\"\n        self.caching = caching\n        self.xrefs: list[PDFBaseXRef] = []\n        self.info = []\n        self.catalog: dict[str, Any] = {}\n        self.encryption: tuple[Any, Any] | None = None\n        self.decipher: DecipherCallable | None = None\n        self._parser = None\n        self._cached_objs: dict[int, tuple[object, int]] = {}\n        self._parsed_objs: dict[int, tuple[list[object], int]] = {}\n        self._parser = parser\n        self._parser.set_document(self)\n        self.is_printable = self.is_modifiable = self.is_extractable = True\n        # Retrieve the information of each header that was appended\n        # (maybe multiple times) at the end of the document.\n        try:\n            pos = self.find_xref(parser)\n            self.read_xref_from(parser, pos, self.xrefs)\n        except PDFNoValidXRef:\n            if fallback:\n                parser.fallback = True\n                newxref = PDFXRefFallback()\n                newxref.load(parser)\n                self.xrefs.append(newxref)\n\n        for xref in self.xrefs:\n            trailer = xref.get_trailer()\n            if not trailer:\n                continue\n            # If there's an encryption info, remember it.\n            if \"Encrypt\" in trailer:\n                if \"ID\" in trailer:\n                    id_value = list_value(trailer[\"ID\"])\n                else:\n                    # Some documents may not have a /ID, use two empty\n                    # byte strings instead. Solves\n                    # https://github.com/pdfminer/pdfminer.six/issues/594\n                    id_value = (b\"\", b\"\")\n                self.encryption = (id_value, dict_value(trailer[\"Encrypt\"]))\n                self._initialize_password(password)\n            if \"Info\" in trailer:\n                self.info.append(dict_value(trailer[\"Info\"]))\n            if \"Root\" in trailer:\n                # Every PDF file must have exactly one /Root dictionary.\n                self.catalog = dict_value(trailer[\"Root\"])\n                break\n        else:\n            raise PDFSyntaxError(\"No /Root object! - Is this really a PDF?\")\n        if self.catalog.get(\"Type\") is not LITERAL_CATALOG:\n            if settings.STRICT:\n                raise PDFSyntaxError(\"Catalog not found!\")\n\n    KEYWORD_OBJ = KWD(b\"obj\")\n\n    # _initialize_password(password=b'')\n    #   Perform the initialization with a given password.\n    def _initialize_password(self, password: str = \"\") -> None:\n        assert self.encryption is not None\n        (docid, param) = self.encryption\n        if literal_name(param.get(\"Filter\")) != \"Standard\":\n            raise PDFEncryptionError(\"Unknown filter: param=%r\" % param)\n        v = int_value(param.get(\"V\", 0))\n        factory = self.security_handler_registry.get(v)\n        if factory is None:\n            raise PDFEncryptionError(\"Unknown algorithm: param=%r\" % param)\n        handler = factory(docid, param, password)\n        self.decipher = handler.decrypt\n        self.is_printable = handler.is_printable()\n        self.is_modifiable = handler.is_modifiable()\n        self.is_extractable = handler.is_extractable()\n        assert self._parser is not None\n        self._parser.fallback = False  # need to read streams with exact length\n\n    def _getobj_objstm(self, stream: PDFStream, index: int, objid: int) -> object:\n        if stream.objid in self._parsed_objs:\n            (objs, n) = self._parsed_objs[stream.objid]\n        else:\n            (objs, n) = self._get_objects(stream)\n            if self.caching:\n                assert stream.objid is not None\n                self._parsed_objs[stream.objid] = (objs, n)\n        i = n * 2 + index\n        try:\n            obj = objs[i]\n        except IndexError:\n            raise PDFSyntaxError(\"index too big: %r\" % index)\n        return obj\n\n    def _get_objects(self, stream: PDFStream) -> tuple[list[object], int]:\n        if stream.get(\"Type\") is not LITERAL_OBJSTM:\n            if settings.STRICT:\n                raise PDFSyntaxError(\"Not a stream object: %r\" % stream)\n        try:\n            n = cast(int, stream[\"N\"])\n        except KeyError:\n            if settings.STRICT:\n                raise PDFSyntaxError(\"N is not defined: %r\" % stream)\n            n = 0\n        parser = PDFStreamParser(stream.get_data())\n        parser.set_document(self)\n        objs: list[object] = []\n        try:\n            while 1:\n                (_, obj) = parser.nextobject()\n                objs.append(obj)\n        except PSEOF:\n            pass\n        return (objs, n)\n\n    def _getobj_parse(self, pos: int, objid: int) -> object:\n        assert self._parser is not None\n        self._parser.seek(pos)\n        (_, objid1) = self._parser.nexttoken()  # objid\n        (_, genno) = self._parser.nexttoken()  # genno\n        (_, kwd) = self._parser.nexttoken()\n        # hack around malformed pdf files\n        # copied from https://github.com/jaepil/pdfminer3k/blob/master/\n        # pdfminer/pdfparser.py#L399\n        # to solve https://github.com/pdfminer/pdfminer.six/issues/56\n        # assert objid1 == objid, str((objid1, objid))\n        if objid1 != objid:\n            x = []\n            while kwd is not self.KEYWORD_OBJ:\n                (_, kwd) = self._parser.nexttoken()\n                x.append(kwd)\n            if len(x) >= 2:\n                objid1 = x[-2]\n        # #### end hack around malformed pdf files\n        if objid1 != objid:\n            raise PDFSyntaxError(f\"objid mismatch: {objid1!r}={objid!r}\")\n\n        if kwd != KWD(b\"obj\"):\n            raise PDFSyntaxError(\"Invalid object spec: offset=%r\" % pos)\n        (_, obj) = self._parser.nextobject()\n        return obj\n\n    # can raise PDFObjectNotFound\n    def getobj(self, objid: int) -> object:\n        \"\"\"Get object from PDF\n\n        :raises PDFException if PDFDocument is not initialized\n        :raises PDFObjectNotFound if objid does not exist in PDF\n        \"\"\"\n        if not self.xrefs:\n            raise PDFException(\"PDFDocument is not initialized\")\n        log.debug(\"getobj: objid=%r\", objid)\n        if objid in self._cached_objs:\n            (obj, genno) = self._cached_objs[objid]\n        else:\n            for xref in self.xrefs:\n                try:\n                    (strmid, index, genno) = xref.get_pos(objid)\n                except KeyError:\n                    continue\n                try:\n                    if strmid is not None:\n                        stream = stream_value(self.getobj(strmid))\n                        obj = self._getobj_objstm(stream, index, objid)\n                    else:\n                        obj = self._getobj_parse(index, objid)\n                        if self.decipher:\n                            obj = decipher_all(self.decipher, objid, genno, obj)\n\n                    if isinstance(obj, PDFStream):\n                        obj.set_objid(objid, genno)\n                    break\n                except (PSEOF, PDFSyntaxError):\n                    continue\n            else:\n                raise PDFObjectNotFound(objid)\n            log.debug(\"register: objid=%r: %r\", objid, obj)\n            if self.caching:\n                self._cached_objs[objid] = (obj, genno)\n        return obj\n\n    OutlineType = tuple[Any, Any, Any, Any, Any]\n\n    def get_outlines(self) -> Iterator[OutlineType]:\n        if \"Outlines\" not in self.catalog:\n            raise PDFNoOutlines\n\n        def search(entry: object, level: int) -> Iterator[PDFDocument.OutlineType]:\n            entry = dict_value(entry)\n            if \"Title\" in entry:\n                if \"A\" in entry or \"Dest\" in entry:\n                    title = decode_text(str_value(entry[\"Title\"]))\n                    dest = entry.get(\"Dest\")\n                    action = entry.get(\"A\")\n                    se = entry.get(\"SE\")\n                    yield (level, title, dest, action, se)\n            if \"First\" in entry and \"Last\" in entry:\n                yield from search(entry[\"First\"], level + 1)\n            if \"Next\" in entry:\n                yield from search(entry[\"Next\"], level)\n\n        return search(self.catalog[\"Outlines\"], 0)\n\n    def get_page_labels(self) -> Iterator[str]:\n        \"\"\"Generate page label strings for the PDF document.\n\n        If the document includes page labels, generates strings, one per page.\n        If not, raises PDFNoPageLabels.\n\n        The resulting iteration is unbounded.\n        \"\"\"\n        assert self.catalog is not None\n\n        try:\n            page_labels = PageLabels(self.catalog[\"PageLabels\"])\n        except (PDFTypeError, KeyError):\n            raise PDFNoPageLabels\n\n        return page_labels.labels\n\n    def lookup_name(self, cat: str, key: str | bytes) -> Any:\n        try:\n            names = dict_value(self.catalog[\"Names\"])\n        except (PDFTypeError, KeyError):\n            raise PDFKeyError((cat, key))\n        # may raise KeyError\n        d0 = dict_value(names[cat])\n\n        def lookup(d: dict[str, Any]) -> Any:\n            if \"Limits\" in d:\n                (k1, k2) = list_value(d[\"Limits\"])\n                if key < k1 or k2 < key:\n                    return None\n            if \"Names\" in d:\n                objs = list_value(d[\"Names\"])\n                names = dict(\n                    cast(Iterator[tuple[str | bytes, Any]], choplist(2, objs)),\n                )\n                return names[key]\n            if \"Kids\" in d:\n                for c in list_value(d[\"Kids\"]):\n                    v = lookup(dict_value(c))\n                    if v:\n                        return v\n            raise PDFKeyError((cat, key))\n\n        return lookup(d0)\n\n    def get_dest(self, name: str | bytes) -> Any:\n        try:\n            # PDF-1.2 or later\n            obj = self.lookup_name(\"Dests\", name)\n        except KeyError:\n            # PDF-1.1 or prior\n            if \"Dests\" not in self.catalog:\n                raise PDFDestinationNotFound(name)\n            d0 = dict_value(self.catalog[\"Dests\"])\n            if name not in d0:\n                raise PDFDestinationNotFound(name)\n            obj = d0[name]\n        return obj\n\n    # find_xref\n    def find_xref(self, parser: PDFParser) -> int:\n        \"\"\"Internal function used to locate the first XRef.\"\"\"\n        # search the last xref table by scanning the file backwards.\n        prev = b\"\"\n        for line in parser.revreadlines():\n            line = line.strip()\n            log.debug(\"find_xref: %r\", line)\n\n            if line == b\"startxref\":\n                log.debug(\"xref found: pos=%r\", prev)\n\n                if not prev.isdigit():\n                    raise PDFNoValidXRef(f\"Invalid xref position: {prev!r}\")\n\n                start = int(prev)\n\n                if not start >= 0:\n                    raise PDFNoValidXRef(f\"Invalid negative xref position: {start}\")\n\n                return start\n\n            if line:\n                prev = line\n\n        raise PDFNoValidXRef(\"Unexpected EOF\")\n\n    # read xref table\n    def read_xref_from(\n        self,\n        parser: PDFParser,\n        start: int,\n        xrefs: list[PDFBaseXRef],\n    ) -> None:\n        \"\"\"Reads XRefs from the given location.\"\"\"\n        parser.seek(start)\n        parser.reset()\n        try:\n            (pos, token) = parser.nexttoken()\n        except PSEOF:\n            raise PDFNoValidXRef(\"Unexpected EOF\")\n        log.debug(\"read_xref_from: start=%d, token=%r\", start, token)\n        if isinstance(token, int):\n            # XRefStream: PDF-1.5\n            parser.seek(pos)\n            parser.reset()\n            xref: PDFBaseXRef = PDFXRefStream()\n            xref.load(parser)\n        else:\n            if token is parser.KEYWORD_XREF:\n                parser.nextline()\n            xref = PDFXRef()\n            xref.load(parser)\n        xrefs.append(xref)\n        trailer = xref.get_trailer()\n        log.debug(\"trailer: %r\", trailer)\n        if \"XRefStm\" in trailer:\n            pos = int_value(trailer[\"XRefStm\"])\n            self.read_xref_from(parser, pos, xrefs)\n        if \"Prev\" in trailer:\n            # find previous xref\n            pos = int_value(trailer[\"Prev\"])\n            self.read_xref_from(parser, pos, xrefs)\n\n\nclass PageLabels(NumberTree):\n    \"\"\"PageLabels from the document catalog.\n\n    See Section 8.3.1 in the PDF Reference.\n    \"\"\"\n\n    @property\n    def labels(self) -> Iterator[str]:\n        ranges = self.values\n\n        # The tree must begin with page index 0\n        if len(ranges) == 0 or ranges[0][0] != 0:\n            if settings.STRICT:\n                raise PDFSyntaxError(\"PageLabels is missing page index 0\")\n            else:\n                # Try to cope, by assuming empty labels for the initial pages\n                ranges.insert(0, (0, {}))\n\n        for next, (start, label_dict_unchecked) in enumerate(ranges, 1):\n            label_dict = dict_value(label_dict_unchecked)\n            style = label_dict.get(\"S\")\n            prefix = decode_text(str_value(label_dict.get(\"P\", b\"\")))\n            first_value = int_value(label_dict.get(\"St\", 1))\n\n            if next == len(ranges):\n                # This is the last specified range. It continues until the end\n                # of the document.\n                values: Iterable[int] = itertools.count(first_value)\n            else:\n                end, _ = ranges[next]\n                range_length = end - start\n                values = range(first_value, first_value + range_length)\n\n            for value in values:\n                label = self._format_page_label(value, style)\n                yield prefix + label\n\n    @staticmethod\n    def _format_page_label(value: int, style: Any) -> str:\n        \"\"\"Format page label value in a specific style\"\"\"\n        if style is None:\n            label = \"\"\n        elif style is LIT(\"D\"):  # Decimal arabic numerals\n            label = str(value)\n        elif style is LIT(\"R\"):  # Uppercase roman numerals\n            label = format_int_roman(value).upper()\n        elif style is LIT(\"r\"):  # Lowercase roman numerals\n            label = format_int_roman(value)\n        elif style is LIT(\"A\"):  # Uppercase letters A-Z, AA-ZZ...\n            label = format_int_alpha(value).upper()\n        elif style is LIT(\"a\"):  # Lowercase letters a-z, aa-zz...\n            label = format_int_alpha(value)\n        else:\n            log.warning(\"Unknown page label style: %r\", style)\n            label = \"\"\n        return label\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfexceptions.py",
    "content": "from babeldoc.pdfminer.psexceptions import PSException\n\n\nclass PDFException(PSException):\n    pass\n\n\nclass PDFTypeError(PDFException, TypeError):\n    pass\n\n\nclass PDFValueError(PDFException, ValueError):\n    pass\n\n\nclass PDFObjectNotFound(PDFException):\n    pass\n\n\nclass PDFNotImplementedError(PDFException, NotImplementedError):\n    pass\n\n\nclass PDFKeyError(PDFException, KeyError):\n    pass\n\n\nclass PDFEOFError(PDFException, EOFError):\n    pass\n\n\nclass PDFIOError(PDFException, IOError):\n    pass\n"
  },
  {
    "path": "babeldoc/pdfminer/pdffont.py",
    "content": "import logging\nimport struct\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom collections.abc import Mapping\nfrom io import BytesIO\nfrom typing import TYPE_CHECKING\nfrom typing import Any\nfrom typing import BinaryIO\nfrom typing import cast\nimport freetype\n\nfrom babeldoc.pdfminer.casting import safe_float\nfrom babeldoc.pdfminer.casting import safe_rect_list\nfrom babeldoc.pdfminer.cmapdb import CMap\nfrom babeldoc.pdfminer.cmapdb import CMapBase\nfrom babeldoc.pdfminer.cmapdb import CMapDB\nfrom babeldoc.pdfminer.cmapdb import CMapParser\nfrom babeldoc.pdfminer.cmapdb import FileUnicodeMap\nfrom babeldoc.pdfminer.cmapdb import IdentityUnicodeMap\nfrom babeldoc.pdfminer.cmapdb import UnicodeMap\nfrom babeldoc.pdfminer.encodingdb import EncodingDB\nfrom babeldoc.pdfminer.encodingdb import name2unicode\nfrom babeldoc.pdfminer.fontmetrics import FONT_METRICS\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdfexceptions import PDFKeyError\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import int_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.pdftypes import num_value\nfrom babeldoc.pdfminer.pdftypes import resolve1\nfrom babeldoc.pdfminer.pdftypes import resolve_all\nfrom babeldoc.pdfminer.pdftypes import stream_value\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psparser import KWD\nfrom babeldoc.pdfminer.psparser import LIT\nfrom babeldoc.pdfminer.psparser import PSKeyword\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.psparser import PSStackParser\nfrom babeldoc.pdfminer.psparser import literal_name\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import Point\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import apply_matrix_norm\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer.utils import nunpack\nfrom babeldoc.pdfminer import settings\nfrom babeldoc.format.pdf.babelpdf.cmap import CharacterMap\n\nif TYPE_CHECKING:\n    from babeldoc.pdfminer.pdfinterp import PDFResourceManager\n\nlog = logging.getLogger(__name__)\n\n\ndef get_widths(seq: Iterable[object]) -> dict[str | int, float]:\n    \"\"\"Build a mapping of character widths for horizontal writing.\"\"\"\n    widths: dict[int, float] = {}\n    r: list[float] = []\n    for v in seq:\n        v = resolve1(v)\n        if isinstance(v, list):\n            if r:\n                char1 = r[-1]\n                for i, w in enumerate(v):\n                    widths[cast(int, char1) + i] = w\n                r = []\n        elif isinstance(v, (int, float)):  # == utils.isnumber(v)\n            r.append(v)\n            if len(r) == 3:\n                (char1, char2, w) = r\n                if isinstance(char1, int) and isinstance(char2, int):\n                    for i in range(cast(int, char1), cast(int, char2) + 1):\n                        widths[i] = w\n                else:\n                    log.warning(\n                        f\"Skipping invalid font width specification for {char1} to {char2} because either of them is not an int\"\n                    )\n                r = []\n        else:\n            log.warning(\n                f\"Skipping invalid font width specification for {v} because it is not a number or a list\"\n            )\n    return cast(dict[str | int, float], widths)\n\n\ndef get_widths2(seq: Iterable[object]) -> dict[int, tuple[float, Point]]:\n    \"\"\"Build a mapping of character widths for vertical writing.\"\"\"\n    widths: dict[int, tuple[float, Point]] = {}\n    r: list[float] = []\n    for v in seq:\n        if isinstance(v, list):\n            if r:\n                char1 = r[-1]\n                for i, (w, vx, vy) in enumerate(choplist(3, v)):\n                    widths[cast(int, char1) + i] = (w, (vx, vy))\n                r = []\n        elif isinstance(v, (int, float)):  # == utils.isnumber(v)\n            r.append(v)\n            if len(r) == 5:\n                (char1, char2, w, vx, vy) = r\n                for i in range(cast(int, char1), cast(int, char2) + 1):\n                    widths[i] = (w, (vx, vy))\n                r = []\n    return widths\n\n\nclass FontMetricsDB:\n    @classmethod\n    def get_metrics(cls, fontname: str) -> tuple[dict[str, object], dict[str, int]]:\n        return FONT_METRICS[fontname]\n\n\n# int here means that we're not extending PSStackParser with additional types.\nclass Type1FontHeaderParser(PSStackParser[int]):\n    KEYWORD_BEGIN = KWD(b\"begin\")\n    KEYWORD_END = KWD(b\"end\")\n    KEYWORD_DEF = KWD(b\"def\")\n    KEYWORD_PUT = KWD(b\"put\")\n    KEYWORD_DICT = KWD(b\"dict\")\n    KEYWORD_ARRAY = KWD(b\"array\")\n    KEYWORD_READONLY = KWD(b\"readonly\")\n    KEYWORD_FOR = KWD(b\"for\")\n\n    def __init__(self, data: BinaryIO) -> None:\n        PSStackParser.__init__(self, data)\n        self._cid2unicode: dict[int, str] = {}\n\n    def get_encoding(self) -> dict[int, str]:\n        \"\"\"Parse the font encoding.\n\n        The Type1 font encoding maps character codes to character names. These\n        character names could either be standard Adobe glyph names, or\n        character names associated with custom CharStrings for this font. A\n        CharString is a sequence of operations that describe how the character\n        should be drawn. Currently, this function returns '' (empty string)\n        for character names that are associated with a CharStrings.\n\n        Reference: Adobe Systems Incorporated, Adobe Type 1 Font Format\n\n        :returns mapping of character identifiers (cid's) to unicode characters\n        \"\"\"\n        while 1:\n            try:\n                (cid, name) = self.nextobject()\n            except PSEOF:\n                break\n            try:\n                self._cid2unicode[cid] = name2unicode(cast(str, name))\n            except KeyError as e:\n                log.debug(str(e))\n        return self._cid2unicode\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        if token is self.KEYWORD_PUT:\n            ((_, key), (_, value)) = self.pop(2)\n            if isinstance(key, int) and isinstance(value, PSLiteral):\n                self.add_results((key, literal_name(value)))\n\n\nNIBBLES = (\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\", \".\", \"e\", \"e-\", None, \"-\")\n\n# Mapping of cmap names. Original cmap name is kept if not in the mapping.\n# (missing reference for why DLIdent is mapped to Identity)\nIDENTITY_ENCODER = {\n    \"DLIdent-H\": \"Identity-H\",\n    \"DLIdent-V\": \"Identity-V\",\n}\n\n\ndef getdict(data: bytes) -> dict[int, list[float | int]]:\n    d: dict[int, list[float | int]] = {}\n    fp = BytesIO(data)\n    stack: list[float | int] = []\n    while 1:\n        c = fp.read(1)\n        if not c:\n            break\n        b0 = ord(c)\n        if b0 <= 21:\n            d[b0] = stack\n            stack = []\n            continue\n        if b0 == 30:\n            s = \"\"\n            loop = True\n            while loop:\n                b = ord(fp.read(1))\n                for n in (b >> 4, b & 15):\n                    if n == 15:\n                        loop = False\n                    else:\n                        nibble = NIBBLES[n]\n                        assert nibble is not None\n                        s += nibble\n            value = float(s)\n        elif b0 >= 32 and b0 <= 246:\n            value = b0 - 139\n        else:\n            b1 = ord(fp.read(1))\n            if b0 >= 247 and b0 <= 250:\n                value = ((b0 - 247) << 8) + b1 + 108\n            elif b0 >= 251 and b0 <= 254:\n                value = -((b0 - 251) << 8) - b1 - 108\n            else:\n                b2 = ord(fp.read(1))\n                if b1 >= 128:\n                    b1 -= 256\n                if b0 == 28:\n                    value = b1 << 8 | b2\n                else:\n                    value = b1 << 24 | b2 << 16 | struct.unpack(\">H\", fp.read(2))[0]\n        stack.append(value)\n    return d\n\n\nclass CFFFont:\n    STANDARD_STRINGS = (\n        \".notdef\",\n        \"space\",\n        \"exclam\",\n        \"quotedbl\",\n        \"numbersign\",\n        \"dollar\",\n        \"percent\",\n        \"ampersand\",\n        \"quoteright\",\n        \"parenleft\",\n        \"parenright\",\n        \"asterisk\",\n        \"plus\",\n        \"comma\",\n        \"hyphen\",\n        \"period\",\n        \"slash\",\n        \"zero\",\n        \"one\",\n        \"two\",\n        \"three\",\n        \"four\",\n        \"five\",\n        \"six\",\n        \"seven\",\n        \"eight\",\n        \"nine\",\n        \"colon\",\n        \"semicolon\",\n        \"less\",\n        \"equal\",\n        \"greater\",\n        \"question\",\n        \"at\",\n        \"A\",\n        \"B\",\n        \"C\",\n        \"D\",\n        \"E\",\n        \"F\",\n        \"G\",\n        \"H\",\n        \"I\",\n        \"J\",\n        \"K\",\n        \"L\",\n        \"M\",\n        \"N\",\n        \"O\",\n        \"P\",\n        \"Q\",\n        \"R\",\n        \"S\",\n        \"T\",\n        \"U\",\n        \"V\",\n        \"W\",\n        \"X\",\n        \"Y\",\n        \"Z\",\n        \"bracketleft\",\n        \"backslash\",\n        \"bracketright\",\n        \"asciicircum\",\n        \"underscore\",\n        \"quoteleft\",\n        \"a\",\n        \"b\",\n        \"c\",\n        \"d\",\n        \"e\",\n        \"f\",\n        \"g\",\n        \"h\",\n        \"i\",\n        \"j\",\n        \"k\",\n        \"l\",\n        \"m\",\n        \"n\",\n        \"o\",\n        \"p\",\n        \"q\",\n        \"r\",\n        \"s\",\n        \"t\",\n        \"u\",\n        \"v\",\n        \"w\",\n        \"x\",\n        \"y\",\n        \"z\",\n        \"braceleft\",\n        \"bar\",\n        \"braceright\",\n        \"asciitilde\",\n        \"exclamdown\",\n        \"cent\",\n        \"sterling\",\n        \"fraction\",\n        \"yen\",\n        \"florin\",\n        \"section\",\n        \"currency\",\n        \"quotesingle\",\n        \"quotedblleft\",\n        \"guillemotleft\",\n        \"guilsinglleft\",\n        \"guilsinglright\",\n        \"fi\",\n        \"fl\",\n        \"endash\",\n        \"dagger\",\n        \"daggerdbl\",\n        \"periodcentered\",\n        \"paragraph\",\n        \"bullet\",\n        \"quotesinglbase\",\n        \"quotedblbase\",\n        \"quotedblright\",\n        \"guillemotright\",\n        \"ellipsis\",\n        \"perthousand\",\n        \"questiondown\",\n        \"grave\",\n        \"acute\",\n        \"circumflex\",\n        \"tilde\",\n        \"macron\",\n        \"breve\",\n        \"dotaccent\",\n        \"dieresis\",\n        \"ring\",\n        \"cedilla\",\n        \"hungarumlaut\",\n        \"ogonek\",\n        \"caron\",\n        \"emdash\",\n        \"AE\",\n        \"ordfeminine\",\n        \"Lslash\",\n        \"Oslash\",\n        \"OE\",\n        \"ordmasculine\",\n        \"ae\",\n        \"dotlessi\",\n        \"lslash\",\n        \"oslash\",\n        \"oe\",\n        \"germandbls\",\n        \"onesuperior\",\n        \"logicalnot\",\n        \"mu\",\n        \"trademark\",\n        \"Eth\",\n        \"onehalf\",\n        \"plusminus\",\n        \"Thorn\",\n        \"onequarter\",\n        \"divide\",\n        \"brokenbar\",\n        \"degree\",\n        \"thorn\",\n        \"threequarters\",\n        \"twosuperior\",\n        \"registered\",\n        \"minus\",\n        \"eth\",\n        \"multiply\",\n        \"threesuperior\",\n        \"copyright\",\n        \"Aacute\",\n        \"Acircumflex\",\n        \"Adieresis\",\n        \"Agrave\",\n        \"Aring\",\n        \"Atilde\",\n        \"Ccedilla\",\n        \"Eacute\",\n        \"Ecircumflex\",\n        \"Edieresis\",\n        \"Egrave\",\n        \"Iacute\",\n        \"Icircumflex\",\n        \"Idieresis\",\n        \"Igrave\",\n        \"Ntilde\",\n        \"Oacute\",\n        \"Ocircumflex\",\n        \"Odieresis\",\n        \"Ograve\",\n        \"Otilde\",\n        \"Scaron\",\n        \"Uacute\",\n        \"Ucircumflex\",\n        \"Udieresis\",\n        \"Ugrave\",\n        \"Yacute\",\n        \"Ydieresis\",\n        \"Zcaron\",\n        \"aacute\",\n        \"acircumflex\",\n        \"adieresis\",\n        \"agrave\",\n        \"aring\",\n        \"atilde\",\n        \"ccedilla\",\n        \"eacute\",\n        \"ecircumflex\",\n        \"edieresis\",\n        \"egrave\",\n        \"iacute\",\n        \"icircumflex\",\n        \"idieresis\",\n        \"igrave\",\n        \"ntilde\",\n        \"oacute\",\n        \"ocircumflex\",\n        \"odieresis\",\n        \"ograve\",\n        \"otilde\",\n        \"scaron\",\n        \"uacute\",\n        \"ucircumflex\",\n        \"udieresis\",\n        \"ugrave\",\n        \"yacute\",\n        \"ydieresis\",\n        \"zcaron\",\n        \"exclamsmall\",\n        \"Hungarumlautsmall\",\n        \"dollaroldstyle\",\n        \"dollarsuperior\",\n        \"ampersandsmall\",\n        \"Acutesmall\",\n        \"parenleftsuperior\",\n        \"parenrightsuperior\",\n        \"twodotenleader\",\n        \"onedotenleader\",\n        \"zerooldstyle\",\n        \"oneoldstyle\",\n        \"twooldstyle\",\n        \"threeoldstyle\",\n        \"fouroldstyle\",\n        \"fiveoldstyle\",\n        \"sixoldstyle\",\n        \"sevenoldstyle\",\n        \"eightoldstyle\",\n        \"nineoldstyle\",\n        \"commasuperior\",\n        \"threequartersemdash\",\n        \"periodsuperior\",\n        \"questionsmall\",\n        \"asuperior\",\n        \"bsuperior\",\n        \"centsuperior\",\n        \"dsuperior\",\n        \"esuperior\",\n        \"isuperior\",\n        \"lsuperior\",\n        \"msuperior\",\n        \"nsuperior\",\n        \"osuperior\",\n        \"rsuperior\",\n        \"ssuperior\",\n        \"tsuperior\",\n        \"ff\",\n        \"ffi\",\n        \"ffl\",\n        \"parenleftinferior\",\n        \"parenrightinferior\",\n        \"Circumflexsmall\",\n        \"hyphensuperior\",\n        \"Gravesmall\",\n        \"Asmall\",\n        \"Bsmall\",\n        \"Csmall\",\n        \"Dsmall\",\n        \"Esmall\",\n        \"Fsmall\",\n        \"Gsmall\",\n        \"Hsmall\",\n        \"Ismall\",\n        \"Jsmall\",\n        \"Ksmall\",\n        \"Lsmall\",\n        \"Msmall\",\n        \"Nsmall\",\n        \"Osmall\",\n        \"Psmall\",\n        \"Qsmall\",\n        \"Rsmall\",\n        \"Ssmall\",\n        \"Tsmall\",\n        \"Usmall\",\n        \"Vsmall\",\n        \"Wsmall\",\n        \"Xsmall\",\n        \"Ysmall\",\n        \"Zsmall\",\n        \"colonmonetary\",\n        \"onefitted\",\n        \"rupiah\",\n        \"Tildesmall\",\n        \"exclamdownsmall\",\n        \"centoldstyle\",\n        \"Lslashsmall\",\n        \"Scaronsmall\",\n        \"Zcaronsmall\",\n        \"Dieresissmall\",\n        \"Brevesmall\",\n        \"Caronsmall\",\n        \"Dotaccentsmall\",\n        \"Macronsmall\",\n        \"figuredash\",\n        \"hypheninferior\",\n        \"Ogoneksmall\",\n        \"Ringsmall\",\n        \"Cedillasmall\",\n        \"questiondownsmall\",\n        \"oneeighth\",\n        \"threeeighths\",\n        \"fiveeighths\",\n        \"seveneighths\",\n        \"onethird\",\n        \"twothirds\",\n        \"zerosuperior\",\n        \"foursuperior\",\n        \"fivesuperior\",\n        \"sixsuperior\",\n        \"sevensuperior\",\n        \"eightsuperior\",\n        \"ninesuperior\",\n        \"zeroinferior\",\n        \"oneinferior\",\n        \"twoinferior\",\n        \"threeinferior\",\n        \"fourinferior\",\n        \"fiveinferior\",\n        \"sixinferior\",\n        \"seveninferior\",\n        \"eightinferior\",\n        \"nineinferior\",\n        \"centinferior\",\n        \"dollarinferior\",\n        \"periodinferior\",\n        \"commainferior\",\n        \"Agravesmall\",\n        \"Aacutesmall\",\n        \"Acircumflexsmall\",\n        \"Atildesmall\",\n        \"Adieresissmall\",\n        \"Aringsmall\",\n        \"AEsmall\",\n        \"Ccedillasmall\",\n        \"Egravesmall\",\n        \"Eacutesmall\",\n        \"Ecircumflexsmall\",\n        \"Edieresissmall\",\n        \"Igravesmall\",\n        \"Iacutesmall\",\n        \"Icircumflexsmall\",\n        \"Idieresissmall\",\n        \"Ethsmall\",\n        \"Ntildesmall\",\n        \"Ogravesmall\",\n        \"Oacutesmall\",\n        \"Ocircumflexsmall\",\n        \"Otildesmall\",\n        \"Odieresissmall\",\n        \"OEsmall\",\n        \"Oslashsmall\",\n        \"Ugravesmall\",\n        \"Uacutesmall\",\n        \"Ucircumflexsmall\",\n        \"Udieresissmall\",\n        \"Yacutesmall\",\n        \"Thornsmall\",\n        \"Ydieresissmall\",\n        \"001.000\",\n        \"001.001\",\n        \"001.002\",\n        \"001.003\",\n        \"Black\",\n        \"Bold\",\n        \"Book\",\n        \"Light\",\n        \"Medium\",\n        \"Regular\",\n        \"Roman\",\n        \"Semibold\",\n    )\n\n    class INDEX:\n        def __init__(self, fp: BinaryIO) -> None:\n            self.fp = fp\n            self.offsets: list[int] = []\n            (count, offsize) = struct.unpack(\">HB\", self.fp.read(3))\n            for i in range(count + 1):\n                self.offsets.append(nunpack(self.fp.read(offsize)))\n            self.base = self.fp.tell() - 1\n            self.fp.seek(self.base + self.offsets[-1])\n\n        def __repr__(self) -> str:\n            return \"<INDEX: size=%d>\" % len(self)\n\n        def __len__(self) -> int:\n            return len(self.offsets) - 1\n\n        def __getitem__(self, i: int) -> bytes:\n            self.fp.seek(self.base + self.offsets[i])\n            return self.fp.read(self.offsets[i + 1] - self.offsets[i])\n\n        def __iter__(self) -> Iterator[bytes]:\n            return iter(self[i] for i in range(len(self)))\n\n    def __init__(self, name: str, fp: BinaryIO) -> None:\n        self.name = name\n        self.fp = fp\n        # Header\n        (_major, _minor, hdrsize, offsize) = struct.unpack(\"BBBB\", self.fp.read(4))\n        self.fp.read(hdrsize - 4)\n        # Name INDEX\n        self.name_index = self.INDEX(self.fp)\n        # Top DICT INDEX\n        self.dict_index = self.INDEX(self.fp)\n        # String INDEX\n        self.string_index = self.INDEX(self.fp)\n        # Global Subr INDEX\n        self.subr_index = self.INDEX(self.fp)\n        # Top DICT DATA\n        self.top_dict = getdict(self.dict_index[0])\n        (charset_pos,) = self.top_dict.get(15, [0])\n        (encoding_pos,) = self.top_dict.get(16, [0])\n        (charstring_pos,) = self.top_dict.get(17, [0])\n        # CharStrings\n        self.fp.seek(cast(int, charstring_pos))\n        self.charstring = self.INDEX(self.fp)\n        self.nglyphs = len(self.charstring)\n        # Encodings\n        self.code2gid = {}\n        self.gid2code = {}\n        self.fp.seek(cast(int, encoding_pos))\n        format = self.fp.read(1)\n        if format == b\"\\x00\":\n            # Format 0\n            (n,) = struct.unpack(\"B\", self.fp.read(1))\n            for code, gid in enumerate(struct.unpack(\"B\" * n, self.fp.read(n))):\n                self.code2gid[code] = gid\n                self.gid2code[gid] = code\n        elif format == b\"\\x01\":\n            # Format 1\n            (n,) = struct.unpack(\"B\", self.fp.read(1))\n            code = 0\n            for i in range(n):\n                (first, nleft) = struct.unpack(\"BB\", self.fp.read(2))\n                for gid in range(first, first + nleft + 1):\n                    self.code2gid[code] = gid\n                    self.gid2code[gid] = code\n                    code += 1\n        else:\n            raise PDFValueError(\"unsupported encoding format: %r\" % format)\n        # Charsets\n        self.name2gid = {}\n        self.gid2name = {}\n        self.fp.seek(cast(int, charset_pos))\n        format = self.fp.read(1)\n        if format == b\"\\x00\":\n            # Format 0\n            n = self.nglyphs - 1\n            for gid, sid in enumerate(\n                cast(\n                    tuple[int, ...], struct.unpack(\">\" + \"H\" * n, self.fp.read(2 * n))\n                ),\n            ):\n                gid += 1\n                sidname = self.getstr(sid)\n                self.name2gid[sidname] = gid\n                self.gid2name[gid] = sidname\n        elif format == b\"\\x01\":\n            # Format 1\n            (n,) = struct.unpack(\"B\", self.fp.read(1))\n            sid = 0\n            for i in range(n):\n                (first, nleft) = struct.unpack(\"BB\", self.fp.read(2))\n                for gid in range(first, first + nleft + 1):\n                    sidname = self.getstr(sid)\n                    self.name2gid[sidname] = gid\n                    self.gid2name[gid] = sidname\n                    sid += 1\n        elif format == b\"\\x02\":\n            # Format 2\n            assert False, str((\"Unhandled\", format))\n        else:\n            raise PDFValueError(\"unsupported charset format: %r\" % format)\n\n    def getstr(self, sid: int) -> str | bytes:\n        # This returns str for one of the STANDARD_STRINGS but bytes otherwise,\n        # and appears to be a needless source of type complexity.\n        if sid < len(self.STANDARD_STRINGS):\n            return self.STANDARD_STRINGS[sid]\n        return self.string_index[sid - len(self.STANDARD_STRINGS)]\n\n\nclass TrueTypeFont:\n    class CMapNotFound(PDFException):\n        pass\n\n    def __init__(self, name: str, fp: BinaryIO) -> None:\n        self.name = name\n        self.fp = fp\n        self.tables: dict[bytes, tuple[int, int]] = {}\n        self.fonttype = fp.read(4)\n        try:\n            (ntables, _1, _2, _3) = cast(\n                tuple[int, int, int, int],\n                struct.unpack(\">HHHH\", fp.read(8)),\n            )\n            for _ in range(ntables):\n                (name_bytes, tsum, offset, length) = cast(\n                    tuple[bytes, int, int, int],\n                    struct.unpack(\">4sLLL\", fp.read(16)),\n                )\n                self.tables[name_bytes] = (offset, length)\n        except struct.error:\n            # Do not fail if there are not enough bytes to read. Even for\n            # corrupted PDFs we would like to get as much information as\n            # possible, so continue.\n            pass\n\n    def create_unicode_map(self) -> FileUnicodeMap:\n        if b\"cmap\" not in self.tables:\n            raise TrueTypeFont.CMapNotFound\n        fp = self.fp\n        char2gid = []\n        try:\n            face = freetype.Face(fp)\n            char2gid = list(face.get_chars())\n        except Exception:\n            raise TrueTypeFont.CMapNotFound\n        # create unicode map\n        unicode_map = FileUnicodeMap()\n        for char, gid in char2gid:\n            unicode_map.add_cid2unichr(gid, char)\n        return unicode_map\n\n\nclass PDFFontError(PDFException):\n    pass\n\n\nclass PDFUnicodeNotDefined(PDFFontError):\n    pass\n\n\nLITERAL_STANDARD_ENCODING = LIT(\"StandardEncoding\")\nLITERAL_TYPE1C = LIT(\"Type1C\")\n\n# Font widths are maintained in a dict type that maps from *either* unicode\n# chars or integer character IDs.\nFontWidthDict = dict[int | str, float]\n\n\nclass PDFFont:\n    def __init__(\n        self,\n        descriptor: Mapping[str, Any],\n        widths: FontWidthDict,\n        default_width: float | None = None,\n    ) -> None:\n        self.descriptor = descriptor\n        self.widths: FontWidthDict = resolve_all(widths)\n        self.fontname = resolve1(descriptor.get(\"FontName\", \"unknown\"))\n        if isinstance(self.fontname, PSLiteral):\n            self.fontname = literal_name(self.fontname)\n        self.flags = int_value(descriptor.get(\"Flags\", 0))\n        self.ascent = num_value(descriptor.get(\"Ascent\", 0))\n        self.descent = num_value(descriptor.get(\"Descent\", 0))\n        self.italic_angle = num_value(descriptor.get(\"ItalicAngle\", 0))\n        if default_width is None:\n            self.default_width = num_value(descriptor.get(\"MissingWidth\", 0))\n        else:\n            self.default_width = default_width\n        self.default_width = resolve1(self.default_width)\n        self.leading = num_value(descriptor.get(\"Leading\", 0))\n        self.bbox = self._parse_bbox(descriptor)\n        self.hscale = self.vscale = 0.001\n\n        # PDF RM 9.8.1 specifies /Descent should always be a negative number.\n        # PScript5.dll seems to produce Descent with a positive number, but\n        # text analysis will be wrong if this is taken as correct. So force\n        # descent to negative.\n        if self.descent > 0:\n            self.descent = -self.descent\n\n    def __repr__(self) -> str:\n        return \"<PDFFont>\"\n\n    def is_vertical(self) -> bool:\n        return False\n\n    def is_multibyte(self) -> bool:\n        return False\n\n    def decode(self, bytes: bytes) -> Iterable[int]:\n        return bytearray(bytes)  # map(ord, bytes)\n\n    def get_ascent(self) -> float:\n        \"\"\"Ascent above the baseline, in text space units\"\"\"\n        return self.ascent * self.vscale\n\n    def get_descent(self) -> float:\n        \"\"\"Descent below the baseline, in text space units; always negative\"\"\"\n        return self.descent * self.vscale\n\n    def get_width(self) -> float:\n        w = self.bbox[2] - self.bbox[0]\n        if w == 0:\n            w = -self.default_width\n        return w * self.hscale\n\n    def get_height(self) -> float:\n        h = self.bbox[3] - self.bbox[1]\n        if h == 0:\n            h = self.ascent - self.descent\n        return h * self.vscale\n\n    def char_width(self, cid: int) -> float:\n        # Because character widths may be mapping either IDs or strings,\n        # we try to lookup the character ID first, then its str equivalent.\n        cid_width = safe_float(self.widths.get(cid))\n        if cid_width is not None:\n            return cid_width * self.hscale\n\n        try:\n            str_cid = self.to_unichr(cid)\n            cid_width = safe_float(self.widths.get(str_cid))\n            if cid_width is not None:\n                return cid_width * self.hscale\n\n        except PDFUnicodeNotDefined:\n            pass\n\n        return self.default_width * self.hscale\n\n    def char_disp(self, cid: int) -> float | tuple[float | None, float]:\n        \"\"\"Returns an integer for horizontal fonts, a tuple for vertical fonts.\"\"\"\n        return 0\n\n    def string_width(self, s: bytes) -> float:\n        return sum(self.char_width(cid) for cid in self.decode(s))\n\n    def to_unichr(self, cid: int) -> str:\n        raise NotImplementedError\n\n    @staticmethod\n    def _parse_bbox(descriptor: Mapping[str, Any]) -> Rect:\n        \"\"\"Parse FontBBox from the fonts descriptor\"\"\"\n        font_bbox = resolve_all(descriptor.get(\"FontBBox\"))\n        bbox = safe_rect_list(font_bbox)\n        if bbox is None:\n            log.warning(\n                f\"Could get FontBBox from font descriptor because {font_bbox!r} cannot be parsed as 4 floats\"\n            )\n            return 0.0, 0.0, 0.0, 0.0\n        return bbox\n\n\nclass PDFSimpleFont(PDFFont):\n    def __init__(\n        self,\n        descriptor: Mapping[str, Any],\n        widths: FontWidthDict,\n        spec: Mapping[str, Any],\n    ) -> None:\n        # Font encoding is specified either by a name of\n        # built-in encoding or a dictionary that describes\n        # the differences.\n        if \"Encoding\" in spec:\n            encoding = resolve1(spec[\"Encoding\"])\n        else:\n            encoding = LITERAL_STANDARD_ENCODING\n        if isinstance(encoding, dict):\n            name = literal_name(encoding.get(\"BaseEncoding\", LITERAL_STANDARD_ENCODING))\n            diff = list_value(encoding.get(\"Differences\", []))\n            self.cid2unicode = EncodingDB.get_encoding(name, diff)\n        else:\n            self.cid2unicode = EncodingDB.get_encoding(literal_name(encoding))\n        self.unicode_map: UnicodeMap | None = None\n        if \"ToUnicode\" in spec:\n            strm = stream_value(spec[\"ToUnicode\"])\n            self.unicode_map = FileUnicodeMap()\n            CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()\n        PDFFont.__init__(self, descriptor, widths)\n\n    def to_unichr(self, cid: int) -> str:\n        if self.unicode_map:\n            try:\n                return self.unicode_map.get_unichr(cid)\n            except KeyError:\n                pass\n        try:\n            return self.cid2unicode[cid]\n        except KeyError:\n            raise PDFUnicodeNotDefined(None, cid)\n\n\nclass PDFType1Font(PDFSimpleFont):\n    def __init__(self, rsrcmgr: \"PDFResourceManager\", spec: Mapping[str, Any]) -> None:\n        try:\n            self.basefont = literal_name(spec[\"BaseFont\"])\n        except KeyError:\n            if settings.STRICT:\n                raise PDFFontError(\"BaseFont is missing\")\n            self.basefont = \"unknown\"\n\n        widths: FontWidthDict\n        try:\n            (descriptor, int_widths) = FontMetricsDB.get_metrics(self.basefont)\n            widths = cast(dict[str | int, float], int_widths)  # implicit int->float\n        except KeyError:\n            descriptor = dict_value(spec.get(\"FontDescriptor\", {}))\n            firstchar = int_value(spec.get(\"FirstChar\", 0))\n            # lastchar = int_value(spec.get('LastChar', 255))\n            width_list = list_value(spec.get(\"Widths\", [0] * 256))\n            widths = {i + firstchar: resolve1(w) for (i, w) in enumerate(width_list)}\n        PDFSimpleFont.__init__(self, descriptor, widths, spec)\n        if \"Encoding\" not in spec and \"FontFile\" in descriptor:\n            # try to recover the missing encoding info from the font file.\n            self.fontfile = stream_value(descriptor.get(\"FontFile\"))\n            length1 = int_value(self.fontfile[\"Length1\"])\n            data = self.fontfile.get_data()[:length1]\n            # awcm: quickfix for type 1 font which contains bad string literals\n            offset = 0\n            if enc_offset := data.index(b\"/Encoding\"):\n                offset = enc_offset\n            parser = Type1FontHeaderParser(BytesIO(data[offset:]))\n            self.cid2unicode = parser.get_encoding()\n\n    def __repr__(self) -> str:\n        return \"<PDFType1Font: basefont=%r>\" % self.basefont\n\n\nclass PDFTrueTypeFont(PDFType1Font):\n    def __repr__(self) -> str:\n        return \"<PDFTrueTypeFont: basefont=%r>\" % self.basefont\n\n\nclass PDFType3Font(PDFSimpleFont):\n    def __init__(self, rsrcmgr: \"PDFResourceManager\", spec: Mapping[str, Any]) -> None:\n        firstchar = int_value(spec.get(\"FirstChar\", 0))\n        # lastchar = int_value(spec.get('LastChar', 0))\n        width_list = list_value(spec.get(\"Widths\", [0] * 256))\n        widths: dict[str | int, float] = {\n            i + firstchar: w for (i, w) in enumerate(width_list)\n        }\n        if \"FontDescriptor\" in spec:\n            descriptor = dict_value(spec[\"FontDescriptor\"])\n        else:\n            descriptor = {\"Ascent\": 0, \"Descent\": 0, \"FontBBox\": spec[\"FontBBox\"]}\n        PDFSimpleFont.__init__(self, descriptor, widths, spec)\n        self.matrix = cast(Matrix, tuple(list_value(spec.get(\"FontMatrix\"))))\n        (_, self.descent, _, self.ascent) = self.bbox\n        (self.hscale, self.vscale) = apply_matrix_norm(self.matrix, (1, 1))\n\n    def __repr__(self) -> str:\n        return \"<PDFType3Font>\"\n\n\nclass PDFCIDFont(PDFFont):\n    default_disp: float | tuple[float | None, float]\n\n    def __init__(\n        self,\n        rsrcmgr: \"PDFResourceManager\",\n        spec: Mapping[str, Any],\n        strict: bool = settings.STRICT,\n    ) -> None:\n        try:\n            self.basefont = literal_name(spec[\"BaseFont\"])\n        except KeyError:\n            if strict:\n                raise PDFFontError(\"BaseFont is missing\")\n            self.basefont = \"unknown\"\n        self.cidsysteminfo = dict_value(spec.get(\"CIDSystemInfo\", {}))\n        cid_registry = resolve1(self.cidsysteminfo.get(\"Registry\", b\"unknown\")).decode(\n            \"latin1\",\n        )\n        cid_ordering = resolve1(self.cidsysteminfo.get(\"Ordering\", b\"unknown\")).decode(\n            \"latin1\",\n        )\n        self.cidcoding = f\"{cid_registry.strip()}-{cid_ordering.strip()}\"\n        self.cmap: CMapBase = self.get_cmap_from_spec(spec, strict)\n\n        try:\n            descriptor = dict_value(spec[\"FontDescriptor\"])\n        except KeyError:\n            if strict:\n                raise PDFFontError(\"FontDescriptor is missing\")\n            descriptor = {}\n        ttf = None\n        self.has_encoding = False\n        self.cid_encoding = None\n        try:\n            if \"Encoding\" in spec:\n                encoding_part = resolve1(spec[\"Encoding\"])\n                if isinstance(encoding_part, PDFStream):\n                    self.has_encoding = True\n                    self.cid_encoding = CharacterMap(\n                        encoding_part.get_data().decode(\"U8\")\n                    )\n        except Exception as e:\n            log.error(f\"Error get cid_encoding from spec: {e}\")\n            self.has_encoding = False\n            self.cid_encoding = None\n        if \"FontFile2\" in descriptor:\n            self.fontfile = stream_value(descriptor.get(\"FontFile2\"))\n            ttf = TrueTypeFont(self.basefont, BytesIO(self.fontfile.get_data()))\n        self.unicode_map: UnicodeMap | None = None\n        if \"ToUnicode\" in spec:\n            if isinstance(spec[\"ToUnicode\"], PDFStream):\n                strm = stream_value(spec[\"ToUnicode\"])\n                self.unicode_map = FileUnicodeMap()\n                CMapParser(self.unicode_map, BytesIO(strm.get_data())).run()\n            else:\n                cmap_name = literal_name(spec[\"ToUnicode\"])\n                encoding = literal_name(spec[\"Encoding\"])\n                if (\n                    \"Identity\" in cid_ordering\n                    or \"Identity\" in cmap_name\n                    or \"Identity\" in encoding\n                ):\n                    self.unicode_map = IdentityUnicodeMap()\n        elif self.cidcoding in (\"Adobe-Identity\", \"Adobe-UCS\"):\n            if ttf:\n                try:\n                    self.unicode_map = ttf.create_unicode_map()\n                except TrueTypeFont.CMapNotFound:\n                    pass\n        else:\n            try:\n                self.unicode_map = CMapDB.get_unicode_map(\n                    self.cidcoding,\n                    self.cmap.is_vertical(),\n                )\n            except CMapDB.CMapNotFound:\n                pass\n\n        self.vertical = self.cmap.is_vertical()\n        if self.vertical:\n            # writing mode: vertical\n            widths2 = get_widths2(list_value(spec.get(\"W2\", [])))\n            self.disps = {cid: (vx, vy) for (cid, (_, (vx, vy))) in widths2.items()}\n            (vy, w) = resolve1(spec.get(\"DW2\", [880, -1000]))\n            self.default_disp = (None, vy)\n            widths: dict[str | int, float] = {\n                cid: w for (cid, (w, _)) in widths2.items()\n            }\n            default_width = w\n        else:\n            # writing mode: horizontal\n            self.disps = {}\n            self.default_disp = 0\n            widths = get_widths(list_value(spec.get(\"W\", [])))\n            default_width = spec.get(\"DW\", 1000)\n        PDFFont.__init__(self, descriptor, widths, default_width=default_width)\n\n    def get_cmap_from_spec(self, spec: Mapping[str, Any], strict: bool) -> CMapBase:\n        \"\"\"Get cmap from font specification\n\n        For certain PDFs, Encoding Type isn't mentioned as an attribute of\n        Encoding but as an attribute of CMapName, where CMapName is an\n        attribute of spec['Encoding'].\n        The horizontal/vertical modes are mentioned with different name\n        such as 'DLIdent-H/V','OneByteIdentityH/V','Identity-H/V'.\n        \"\"\"\n        cmap_name = self._get_cmap_name(spec, strict)\n\n        try:\n            return CMapDB.get_cmap(cmap_name)\n        except CMapDB.CMapNotFound as e:\n            if strict:\n                raise PDFFontError(e)\n            return CMap()\n\n    @staticmethod\n    def _get_cmap_name(spec: Mapping[str, Any], strict: bool) -> str:\n        \"\"\"Get cmap name from font specification\"\"\"\n        cmap_name = \"unknown\"  # default value\n\n        try:\n            spec_encoding = spec[\"Encoding\"]\n            if hasattr(spec_encoding, \"name\"):\n                cmap_name = literal_name(spec[\"Encoding\"])\n            else:\n                cmap_name = literal_name(spec_encoding[\"CMapName\"])\n        except KeyError:\n            if strict:\n                raise PDFFontError(\"Encoding is unspecified\")\n\n        if type(cmap_name) is PDFStream:  # type: ignore[comparison-overlap]\n            cmap_name_stream: PDFStream = cast(PDFStream, cmap_name)\n            if \"CMapName\" in cmap_name_stream:\n                cmap_name = cmap_name_stream.get(\"CMapName\").name\n            elif strict:\n                raise PDFFontError(\"CMapName unspecified for encoding\")\n\n        return IDENTITY_ENCODER.get(cmap_name, cmap_name)\n\n    def __repr__(self) -> str:\n        return f\"<PDFCIDFont: basefont={self.basefont!r}, cidcoding={self.cidcoding!r}>\"\n\n    def is_vertical(self) -> bool:\n        return self.vertical\n\n    def is_multibyte(self) -> bool:\n        return True\n\n    def decode(self, bytes: bytes) -> Iterable[int]:\n        try:\n            if self.has_encoding:\n                res = self.cid_encoding.decode(bytes)\n\n                if res is not None and all(x > 0 for x in res):\n                    return res\n        except Exception as e:\n            log.error(f\"Error use cid_encoding to decode bytes: {e}\")\n        return self.cmap.decode(bytes)\n\n    def char_disp(self, cid: int) -> float | tuple[float | None, float]:\n        \"\"\"Returns an integer for horizontal fonts, a tuple for vertical fonts.\"\"\"\n        return self.disps.get(cid, self.default_disp)\n\n    def to_unichr(self, cid: int) -> str:\n        try:\n            if not self.unicode_map:\n                raise PDFKeyError(cid)\n            return self.unicode_map.get_unichr(cid)\n        except KeyError:\n            raise PDFUnicodeNotDefined(self.cidcoding, cid)\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfinterp.py",
    "content": "import logging\nimport re\nfrom collections.abc import Mapping\nfrom collections.abc import Sequence\nfrom io import BytesIO\nfrom typing import Union\nfrom typing import cast\n\nfrom babeldoc.pdfminer.casting import safe_cmyk\nfrom babeldoc.pdfminer.casting import safe_float\nfrom babeldoc.pdfminer.casting import safe_int\nfrom babeldoc.pdfminer.casting import safe_matrix\nfrom babeldoc.pdfminer.casting import safe_rgb\nfrom babeldoc.pdfminer.cmapdb import CMap\nfrom babeldoc.pdfminer.cmapdb import CMapBase\nfrom babeldoc.pdfminer.cmapdb import CMapDB\nfrom babeldoc.pdfminer.pdfcolor import PREDEFINED_COLORSPACE\nfrom babeldoc.pdfminer.pdfcolor import PDFColorSpace\nfrom babeldoc.pdfminer.pdfdevice import PDFDevice\nfrom babeldoc.pdfminer.pdfdevice import PDFTextSeq\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdffont import PDFCIDFont\nfrom babeldoc.pdfminer.pdffont import PDFFont\nfrom babeldoc.pdfminer.pdffont import PDFFontError\nfrom babeldoc.pdfminer.pdffont import PDFTrueTypeFont\nfrom babeldoc.pdfminer.pdffont import PDFType1Font\nfrom babeldoc.pdfminer.pdffont import PDFType3Font\nfrom babeldoc.pdfminer.pdfpage import PDFPage\nfrom babeldoc.pdfminer.pdftypes import LITERALS_ASCII85_DECODE\nfrom babeldoc.pdfminer.pdftypes import PDFObjRef\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.pdftypes import resolve1\nfrom babeldoc.pdfminer.pdftypes import stream_value\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psexceptions import PSTypeError\nfrom babeldoc.pdfminer.psparser import KWD\nfrom babeldoc.pdfminer.psparser import LIT\nfrom babeldoc.pdfminer.psparser import PSKeyword\nfrom babeldoc.pdfminer.psparser import PSLiteral\nfrom babeldoc.pdfminer.psparser import PSStackParser\nfrom babeldoc.pdfminer.psparser import PSStackType\nfrom babeldoc.pdfminer.psparser import keyword_name\nfrom babeldoc.pdfminer.psparser import literal_name\nfrom babeldoc.pdfminer.utils import MATRIX_IDENTITY, apply_matrix_pt\nfrom babeldoc.pdfminer.utils import Matrix\nfrom babeldoc.pdfminer.utils import PathSegment\nfrom babeldoc.pdfminer.utils import Point\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer.utils import mult_matrix\nfrom babeldoc.pdfminer import settings\n\nlog = logging.getLogger(__name__)\n\n\nclass PDFResourceError(PDFException):\n    pass\n\n\nclass PDFInterpreterError(PDFException):\n    pass\n\n\nLITERAL_PDF = LIT(\"PDF\")\nLITERAL_TEXT = LIT(\"Text\")\nLITERAL_FONT = LIT(\"Font\")\nLITERAL_FORM = LIT(\"Form\")\nLITERAL_IMAGE = LIT(\"Image\")\n\n\nclass PDFTextState:\n    matrix: Matrix\n    linematrix: Point\n\n    def __init__(self) -> None:\n        self.font: PDFFont | None = None\n        self.fontsize: float = 0\n        self.charspace: float = 0\n        self.wordspace: float = 0\n        self.scaling: float = 100\n        self.leading: float = 0\n        self.render: int = 0\n        self.rise: float = 0\n        self.reset()\n        # self.matrix is set\n        # self.linematrix is set\n\n    def __repr__(self) -> str:\n        return (\n            \"<PDFTextState: font=%r, fontsize=%r, charspace=%r, \"\n            \"wordspace=%r, scaling=%r, leading=%r, render=%r, rise=%r, \"\n            \"matrix=%r, linematrix=%r>\"\n            % (\n                self.font,\n                self.fontsize,\n                self.charspace,\n                self.wordspace,\n                self.scaling,\n                self.leading,\n                self.render,\n                self.rise,\n                self.matrix,\n                self.linematrix,\n            )\n        )\n\n    def copy(self) -> \"PDFTextState\":\n        obj = PDFTextState()\n        obj.font = self.font\n        obj.fontsize = self.fontsize\n        obj.charspace = self.charspace\n        obj.wordspace = self.wordspace\n        obj.scaling = self.scaling\n        obj.leading = self.leading\n        obj.render = self.render\n        obj.rise = self.rise\n        obj.matrix = self.matrix\n        obj.linematrix = self.linematrix\n        obj.font_id = getattr(self, \"font_id\", None)\n        return obj\n\n    def reset(self) -> None:\n        self.matrix = MATRIX_IDENTITY\n        self.linematrix = (0, 0)\n\n\nColor = Union[\n    float,  # Greyscale\n    tuple[float, float, float],  # R, G, B\n    tuple[float, float, float, float],  # C, M, Y, K\n]\n\n\nclass PDFGraphicState:\n    def __init__(self) -> None:\n        self.linewidth: float = 0\n        self.linecap: object | None = None\n        self.linejoin: object | None = None\n        self.miterlimit: object | None = None\n        self.dash: tuple[object, object] | None = None\n        self.intent: object | None = None\n        self.flatness: object | None = None\n\n        # stroking color\n        self.scolor: Color | None = None\n\n        # non stroking color\n        self.ncolor: Color | None = None\n\n    def copy(self) -> \"PDFGraphicState\":\n        obj = PDFGraphicState()\n        obj.linewidth = self.linewidth\n        obj.linecap = self.linecap\n        obj.linejoin = self.linejoin\n        obj.miterlimit = self.miterlimit\n        obj.dash = self.dash\n        obj.intent = self.intent\n        obj.flatness = self.flatness\n        obj.scolor = self.scolor\n        obj.ncolor = self.ncolor\n        return obj\n\n    def __repr__(self) -> str:\n        return (\n            \"<PDFGraphicState: linewidth=%r, linecap=%r, linejoin=%r, \"\n            \" miterlimit=%r, dash=%r, intent=%r, flatness=%r, \"\n            \" stroking color=%r, non stroking color=%r>\"\n            % (\n                self.linewidth,\n                self.linecap,\n                self.linejoin,\n                self.miterlimit,\n                self.dash,\n                self.intent,\n                self.flatness,\n                self.scolor,\n                self.ncolor,\n            )\n        )\n\n\nclass PDFResourceManager:\n    \"\"\"Repository of shared resources.\n\n    ResourceManager facilitates reuse of shared resources\n    such as fonts and images so that large objects are not\n    allocated multiple times.\n    \"\"\"\n\n    def __init__(self, caching: bool = True) -> None:\n        self.caching = caching\n        self._cached_fonts: dict[object, PDFFont] = {}\n\n    def get_procset(self, procs: Sequence[object]) -> None:\n        for proc in procs:\n            if proc is LITERAL_PDF or proc is LITERAL_TEXT:\n                pass\n            else:\n                pass\n\n    def get_cmap(self, cmapname: str, strict: bool = False) -> CMapBase:\n        try:\n            return CMapDB.get_cmap(cmapname)\n        except CMapDB.CMapNotFound:\n            if strict:\n                raise\n            return CMap()\n\n    def get_font(self, objid: object, spec: Mapping[str, object]) -> PDFFont:\n        if objid and objid in self._cached_fonts:\n            font = self._cached_fonts[objid]\n        else:\n            log.debug(\"get_font: create: objid=%r, spec=%r\", objid, spec)\n            if settings.STRICT:\n                if spec[\"Type\"] is not LITERAL_FONT:\n                    raise PDFFontError(\"Type is not /Font\")\n            # Create a Font object.\n            if \"Subtype\" in spec:\n                subtype = literal_name(spec[\"Subtype\"])\n            else:\n                if settings.STRICT:\n                    raise PDFFontError(\"Font Subtype is not specified.\")\n                subtype = \"Type1\"\n            if subtype in (\"Type1\", \"MMType1\"):\n                # Type1 Font\n                font = PDFType1Font(self, spec)\n            elif subtype == \"TrueType\":\n                # TrueType Font\n                font = PDFTrueTypeFont(self, spec)\n            elif subtype == \"Type3\":\n                # Type3 Font\n                font = PDFType3Font(self, spec)\n            elif subtype in (\"CIDFontType0\", \"CIDFontType2\"):\n                # CID Font\n                font = PDFCIDFont(self, spec)\n            elif subtype == \"Type0\":\n                # Type0 Font\n                dfonts = list_value(spec[\"DescendantFonts\"])\n                assert dfonts\n                subspec = dict_value(dfonts[0]).copy()\n                for k in (\"Encoding\", \"ToUnicode\"):\n                    if k in spec:\n                        subspec[k] = resolve1(spec[k])\n                font = self.get_font(None, subspec)\n            else:\n                if settings.STRICT:\n                    raise PDFFontError(\"Invalid Font spec: %r\" % spec)\n                font = PDFType1Font(self, spec)  # this is so wrong!\n            if objid and self.caching:\n                self._cached_fonts[objid] = font\n        return font\n\n\nclass PDFContentParser(PSStackParser[Union[PSKeyword, PDFStream]]):\n    def __init__(self, streams: Sequence[object]) -> None:\n        self.streams = streams\n        self.istream = 0\n        # PSStackParser.__init__(fp=None) is safe only because we've overloaded\n        # all the methods that would attempt to access self.fp without first\n        # calling self.fillfp().\n        PSStackParser.__init__(self, None)  # type: ignore[arg-type]\n\n    def fillfp(self) -> None:\n        if not self.fp:\n            if self.istream < len(self.streams):\n                strm = stream_value(self.streams[self.istream])\n                self.istream += 1\n            else:\n                raise PSEOF(\"Unexpected EOF, file truncated?\")\n            self.fp = BytesIO(strm.get_data())\n\n    def seek(self, pos: int) -> None:\n        self.fillfp()\n        PSStackParser.seek(self, pos)\n\n    def fillbuf(self) -> None:\n        if self.charpos < len(self.buf):\n            return\n        while 1:\n            self.fillfp()\n            self.bufpos = self.fp.tell()\n            self.buf = self.fp.read(self.BUFSIZ)\n            if self.buf:\n                break\n            self.fp = None  # type: ignore[assignment]\n        self.charpos = 0\n\n    def get_inline_data(self, pos: int, target: bytes = b\"EI\") -> tuple[int, bytes]:\n        self.seek(pos)\n        i = 0\n        data = b\"\"\n        while i <= len(target):\n            self.fillbuf()\n            if i:\n                ci = self.buf[self.charpos]\n                c = bytes((ci,))\n                data += c\n                self.charpos += 1\n                if (\n                    len(target) <= i\n                    and c.isspace()\n                    or i < len(target)\n                    and c == (bytes((target[i],)))\n                ):\n                    i += 1\n                else:\n                    i = 0\n            else:\n                try:\n                    j = self.buf.index(target[0], self.charpos)\n                    data += self.buf[self.charpos : j + 1]\n                    self.charpos = j + 1\n                    i = 1\n                except ValueError:\n                    data += self.buf[self.charpos :]\n                    self.charpos = len(self.buf)\n        data = data[: -(len(target) + 1)]  # strip the last part\n        data = re.sub(rb\"(\\x0d\\x0a|[\\x0d\\x0a])$\", b\"\", data)\n        return (pos, data)\n\n    def flush(self) -> None:\n        self.add_results(*self.popall())\n\n    KEYWORD_BI = KWD(b\"BI\")\n    KEYWORD_ID = KWD(b\"ID\")\n    KEYWORD_EI = KWD(b\"EI\")\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        if token is self.KEYWORD_BI:\n            # inline image within a content stream\n            self.start_type(pos, \"inline\")\n        elif token is self.KEYWORD_ID:\n            try:\n                (_, objs) = self.end_type(\"inline\")\n                if len(objs) % 2 != 0:\n                    error_msg = f\"Invalid dictionary construct: {objs!r}\"\n                    raise PSTypeError(error_msg)\n                d = {literal_name(k): resolve1(v) for (k, v) in choplist(2, objs)}\n                eos = b\"EI\"\n                filter = d.get(\"F\", None)\n                if filter is not None:\n                    if isinstance(filter, PSLiteral):\n                        filter = [filter]\n                    if filter[0] in LITERALS_ASCII85_DECODE:\n                        eos = b\"~>\"\n                (pos, data) = self.get_inline_data(pos + len(b\"ID \"), target=eos)\n                if eos != b\"EI\":  # it may be necessary for decoding\n                    data += eos\n                obj = PDFStream(d, data)\n                self.push((pos, obj))\n                if eos == b\"EI\":  # otherwise it is still in the stream\n                    self.push((pos, self.KEYWORD_EI))\n            except PSTypeError:\n                if settings.STRICT:\n                    raise\n        else:\n            self.push((pos, token))\n\n\nPDFStackT = PSStackType[PDFStream]\n\"\"\"Types that may appear on the PDF argument stack.\"\"\"\n\n\nclass PDFPageInterpreter:\n    \"\"\"Processor for the content of a PDF page\n\n    Reference: PDF Reference, Appendix A, Operator Summary\n    \"\"\"\n\n    def __init__(self, rsrcmgr: PDFResourceManager, device: PDFDevice) -> None:\n        self.rsrcmgr = rsrcmgr\n        self.device = device\n\n    def dup(self) -> \"PDFPageInterpreter\":\n        return self.__class__(self.rsrcmgr, self.device)\n\n    def init_resources(self, resources: dict[object, object]) -> None:\n        \"\"\"Prepare the fonts and XObjects listed in the Resource attribute.\"\"\"\n        self.resources = resources\n        self.fontmap: dict[object, PDFFont] = {}\n        self.xobjmap = {}\n        self.csmap: dict[str, PDFColorSpace] = PREDEFINED_COLORSPACE.copy()\n        if not resources:\n            return\n\n        def get_colorspace(spec: object) -> PDFColorSpace | None:\n            if isinstance(spec, list):\n                name = literal_name(spec[0])\n            else:\n                name = literal_name(spec)\n            if name == \"ICCBased\" and isinstance(spec, list) and len(spec) >= 2:\n                return PDFColorSpace(name, stream_value(spec[1])[\"N\"])\n            elif name == \"DeviceN\" and isinstance(spec, list) and len(spec) >= 2:\n                return PDFColorSpace(name, len(list_value(spec[1])))\n            else:\n                return PREDEFINED_COLORSPACE.get(name)\n\n        for k, v in dict_value(resources).items():\n            log.debug(\"Resource: %r: %r\", k, v)\n            if k == \"Font\":\n                for fontid, spec in dict_value(v).items():\n                    objid = None\n                    if isinstance(spec, PDFObjRef):\n                        objid = spec.objid\n                    spec = dict_value(spec)\n                    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)\n            elif k == \"ColorSpace\":\n                for csid, spec in dict_value(v).items():\n                    colorspace = get_colorspace(resolve1(spec))\n                    if colorspace is not None:\n                        self.csmap[csid] = colorspace\n            elif k == \"ProcSet\":\n                self.rsrcmgr.get_procset(list_value(v))\n            elif k == \"XObject\":\n                for xobjid, xobjstrm in dict_value(v).items():\n                    self.xobjmap[xobjid] = xobjstrm\n\n    def init_state(self, ctm: Matrix) -> None:\n        \"\"\"Initialize the text and graphic states for rendering a page.\"\"\"\n        # gstack: stack for graphical states.\n        self.gstack: list[tuple[Matrix, PDFTextState, PDFGraphicState]] = []\n        self.ctm = ctm\n        self.device.set_ctm(self.ctm)\n        self.textstate = PDFTextState()\n        self.graphicstate = PDFGraphicState()\n        self.curpath: list[PathSegment] = []\n        # argstack: stack for command arguments.\n        self.argstack: list[PDFStackT] = []\n        # set some global states.\n        self.scs: PDFColorSpace | None = None\n        self.ncs: PDFColorSpace | None = None\n        if self.csmap:\n            self.scs = self.ncs = next(iter(self.csmap.values()))\n\n    def push(self, obj: PDFStackT) -> None:\n        self.argstack.append(obj)\n\n    def pop(self, n: int) -> list[PDFStackT]:\n        if n == 0:\n            return []\n        x = self.argstack[-n:]\n        self.argstack = self.argstack[:-n]\n        return x\n\n    def get_current_state(self) -> tuple[Matrix, PDFTextState, PDFGraphicState]:\n        return (self.ctm, self.textstate.copy(), self.graphicstate.copy())\n\n    def set_current_state(\n        self,\n        state: tuple[Matrix, PDFTextState, PDFGraphicState],\n    ) -> None:\n        (self.ctm, self.textstate, self.graphicstate) = state\n        self.device.set_ctm(self.ctm)\n\n    def do_q(self) -> None:\n        \"\"\"Save graphics state\"\"\"\n        self.gstack.append(self.get_current_state())\n\n    def do_Q(self) -> None:\n        \"\"\"Restore graphics state\"\"\"\n        if self.gstack:\n            self.set_current_state(self.gstack.pop())\n\n    def do_cm(\n        self,\n        a1: PDFStackT,\n        b1: PDFStackT,\n        c1: PDFStackT,\n        d1: PDFStackT,\n        e1: PDFStackT,\n        f1: PDFStackT,\n    ) -> None:\n        \"\"\"Concatenate matrix to current transformation matrix\"\"\"\n        matrix = safe_matrix(a1, b1, c1, d1, e1, f1)\n\n        if matrix is None:\n            log.warning(\n                f\"Cannot concatenate matrix to current transformation matrix because not all values in {(a1, b1, c1, d1, e1, f1)!r} can be parsed as floats\"\n            )\n        else:\n            self.ctm = mult_matrix(matrix, self.ctm)\n            self.device.set_ctm(self.ctm)\n\n    def do_w(self, linewidth: PDFStackT) -> None:\n        \"\"\"Set line width\"\"\"\n        linewidth_f = safe_float(linewidth)\n        if linewidth_f is None:\n            log.warning(\n                f\"Cannot set line width because {linewidth!r} is an invalid float value\"\n            )\n        else:\n            self.graphicstate.linewidth = linewidth_f\n\n    def do_J(self, linecap: PDFStackT) -> None:\n        \"\"\"Set line cap style\"\"\"\n        self.graphicstate.linecap = linecap\n\n    def do_j(self, linejoin: PDFStackT) -> None:\n        \"\"\"Set line join style\"\"\"\n        self.graphicstate.linejoin = linejoin\n\n    def do_M(self, miterlimit: PDFStackT) -> None:\n        \"\"\"Set miter limit\"\"\"\n        self.graphicstate.miterlimit = miterlimit\n\n    def do_d(self, dash: PDFStackT, phase: PDFStackT) -> None:\n        \"\"\"Set line dash pattern\"\"\"\n        self.graphicstate.dash = (dash, phase)\n\n    def do_ri(self, intent: PDFStackT) -> None:\n        \"\"\"Set color rendering intent\"\"\"\n        self.graphicstate.intent = intent\n\n    def do_i(self, flatness: PDFStackT) -> None:\n        \"\"\"Set flatness tolerance\"\"\"\n        self.graphicstate.flatness = flatness\n\n    def do_gs(self, name: PDFStackT) -> None:\n        \"\"\"Set parameters from graphics state parameter dictionary\"\"\"\n        # to do\n\n    def do_m(self, x: PDFStackT, y: PDFStackT) -> None:\n        \"\"\"Begin new subpath\"\"\"\n        x_f = safe_float(x)\n        y_f = safe_float(y)\n\n        if x_f is None or y_f is None:\n            point = (\"m\", x, y)\n            log.warning(\n                f\"Cannot start new subpath because not all values in {point!r} can be parsed as floats\"\n            )\n        else:\n            point = (\"m\", x_f, y_f)\n            self.curpath.append(point)\n\n    def do_l(self, x: PDFStackT, y: PDFStackT) -> None:\n        \"\"\"Append straight line segment to path\"\"\"\n        x_f = safe_float(x)\n        y_f = safe_float(y)\n        if x_f is None or y_f is None:\n            point = (\"l\", x, y)\n            log.warning(\n                f\"Cannot append straight line segment to path because not all values in {point!r} can be parsed as floats\"\n            )\n        else:\n            point = (\"l\", x_f, y_f)\n            self.curpath.append(point)\n\n    def do_c(\n        self,\n        x1: PDFStackT,\n        y1: PDFStackT,\n        x2: PDFStackT,\n        y2: PDFStackT,\n        x3: PDFStackT,\n        y3: PDFStackT,\n    ) -> None:\n        \"\"\"Append curved segment to path (three control points)\"\"\"\n        x1_f = safe_float(x1)\n        y1_f = safe_float(y1)\n        x2_f = safe_float(x2)\n        y2_f = safe_float(y2)\n        x3_f = safe_float(x3)\n        y3_f = safe_float(y3)\n        if (\n            x1_f is None\n            or y1_f is None\n            or x2_f is None\n            or y2_f is None\n            or x3_f is None\n            or y3_f is None\n        ):\n            point = (\"c\", x1, y1, x2, y2, x3, y3)\n            log.warning(\n                f\"Cannot append curved segment to path because not all values in {point!r} can be parsed as floats\"\n            )\n        else:\n            point = (\"c\", x1_f, y1_f, x2_f, y2_f, x3_f, y3_f)\n            self.curpath.append(point)\n\n    def do_v(self, x2: PDFStackT, y2: PDFStackT, x3: PDFStackT, y3: PDFStackT) -> None:\n        \"\"\"Append curved segment to path (initial point replicated)\"\"\"\n        x2_f = safe_float(x2)\n        y2_f = safe_float(y2)\n        x3_f = safe_float(x3)\n        y3_f = safe_float(y3)\n        if x2_f is None or y2_f is None or x3_f is None or y3_f is None:\n            point = (\"v\", x2, y2, x3, y3)\n            log.warning(\n                f\"Cannot append curved segment to path because not all values in {point!r} can be parsed as floats\"\n            )\n        else:\n            point = (\"v\", x2_f, y2_f, x3_f, y3_f)\n            self.curpath.append(point)\n\n    def do_y(self, x1: PDFStackT, y1: PDFStackT, x3: PDFStackT, y3: PDFStackT) -> None:\n        \"\"\"Append curved segment to path (final point replicated)\"\"\"\n        x1_f = safe_float(x1)\n        y1_f = safe_float(y1)\n        x3_f = safe_float(x3)\n        y3_f = safe_float(y3)\n        if x1_f is None or y1_f is None or x3_f is None or y3_f is None:\n            point = (\"y\", x1, y1, x3, y3)\n            log.warning(\n                f\"Cannot append curved segment to path because not all values in {point!r} can be parsed as floats\"\n            )\n        else:\n            point = (\"y\", x1_f, y1_f, x3_f, y3_f)\n            self.curpath.append(point)\n\n    def do_h(self) -> None:\n        \"\"\"Close subpath\"\"\"\n        self.curpath.append((\"h\",))\n\n    def do_re(self, x: PDFStackT, y: PDFStackT, w: PDFStackT, h: PDFStackT) -> None:\n        \"\"\"Append rectangle to path\"\"\"\n        x_f = safe_float(x)\n        y_f = safe_float(y)\n        w_f = safe_float(w)\n        h_f = safe_float(h)\n\n        if x_f is None or y_f is None or w_f is None or h_f is None:\n            values = (x, y, w, h)\n            log.warning(\n                f\"Cannot append rectangle to path because not all values in {values!r} can be parsed as floats\"\n            )\n        else:\n            self.curpath.append((\"m\", x_f, y_f))\n            self.curpath.append((\"l\", x_f + w_f, y_f))\n            self.curpath.append((\"l\", x_f + w_f, y_f + h_f))\n            self.curpath.append((\"l\", x_f, y_f + h_f))\n            self.curpath.append((\"h\",))\n\n    def do_S(self) -> None:\n        \"\"\"Stroke path\"\"\"\n        self.device.paint_path(self.graphicstate, True, False, False, self.curpath)\n        self.curpath = []\n\n    def do_s(self) -> None:\n        \"\"\"Close and stroke path\"\"\"\n        self.do_h()\n        self.do_S()\n\n    def do_f(self) -> None:\n        \"\"\"Fill path using nonzero winding number rule\"\"\"\n        self.device.paint_path(self.graphicstate, False, True, False, self.curpath)\n        self.curpath = []\n\n    def do_F(self) -> None:\n        \"\"\"Fill path using nonzero winding number rule (obsolete)\"\"\"\n\n    def do_f_a(self) -> None:\n        \"\"\"Fill path using even-odd rule\"\"\"\n        self.device.paint_path(self.graphicstate, False, True, True, self.curpath)\n        self.curpath = []\n\n    def do_B(self) -> None:\n        \"\"\"Fill and stroke path using nonzero winding number rule\"\"\"\n        self.device.paint_path(self.graphicstate, True, True, False, self.curpath)\n        self.curpath = []\n\n    def do_B_a(self) -> None:\n        \"\"\"Fill and stroke path using even-odd rule\"\"\"\n        self.device.paint_path(self.graphicstate, True, True, True, self.curpath)\n        self.curpath = []\n\n    def do_b(self) -> None:\n        \"\"\"Close, fill, and stroke path using nonzero winding number rule\"\"\"\n        self.do_h()\n        self.do_B()\n\n    def do_b_a(self) -> None:\n        \"\"\"Close, fill, and stroke path using even-odd rule\"\"\"\n        self.do_h()\n        self.do_B_a()\n\n    def do_n(self) -> None:\n        \"\"\"End path without filling or stroking\"\"\"\n        self.curpath = []\n\n    def do_W(self) -> None:\n        \"\"\"Set clipping path using nonzero winding number rule\"\"\"\n        pass\n\n    def do_W_a(self) -> None:\n        \"\"\"Set clipping path using even-odd rule\"\"\"\n        pass\n\n    def do_CS(self, name: PDFStackT) -> None:\n        \"\"\"Set color space for stroking operations\n\n        Introduced in PDF 1.1\n        \"\"\"\n        try:\n            self.scs = self.csmap[literal_name(name)]\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"Undefined ColorSpace: %r\" % name)\n\n    def do_cs(self, name: PDFStackT) -> None:\n        \"\"\"Set color space for nonstroking operations\"\"\"\n        try:\n            self.ncs = self.csmap[literal_name(name)]\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"Undefined ColorSpace: %r\" % name)\n\n    def do_G(self, gray: PDFStackT) -> None:\n        \"\"\"Set gray level for stroking operations\"\"\"\n        gray_f = safe_float(gray)\n\n        if gray_f is None:\n            log.warning(\n                f\"Cannot set gray level because {gray!r} is an invalid float value\"\n            )\n        else:\n            self.graphicstate.scolor = gray_f\n            self.scs = self.csmap[\"DeviceGray\"]\n\n    def do_g(self, gray: PDFStackT) -> None:\n        \"\"\"Set gray level for nonstroking operations\"\"\"\n        gray_f = safe_float(gray)\n\n        if gray_f is None:\n            log.warning(\n                f\"Cannot set gray level because {gray!r} is an invalid float value\"\n            )\n        else:\n            self.graphicstate.ncolor = gray_f\n            self.ncs = self.csmap[\"DeviceGray\"]\n\n    def do_RG(self, r: PDFStackT, g: PDFStackT, b: PDFStackT) -> None:\n        \"\"\"Set RGB color for stroking operations\"\"\"\n        rgb = safe_rgb(r, g, b)\n\n        if rgb is None:\n            log.warning(\n                f\"Cannot set RGB stroke color because not all values in {(r, g, b)!r} can be parsed as floats\"\n            )\n        else:\n            self.graphicstate.scolor = rgb\n            self.scs = self.csmap[\"DeviceRGB\"]\n\n    def do_rg(self, r: PDFStackT, g: PDFStackT, b: PDFStackT) -> None:\n        \"\"\"Set RGB color for nonstroking operations\"\"\"\n        rgb = safe_rgb(r, g, b)\n\n        if rgb is None:\n            log.warning(\n                f\"Cannot set RGB non-stroke color because not all values in {(r, g, b)!r} can be parsed as floats\"\n            )\n        else:\n            self.graphicstate.ncolor = rgb\n            self.ncs = self.csmap[\"DeviceRGB\"]\n\n    def do_K(self, c: PDFStackT, m: PDFStackT, y: PDFStackT, k: PDFStackT) -> None:\n        \"\"\"Set CMYK color for stroking operations\"\"\"\n        cmyk = safe_cmyk(c, m, y, k)\n\n        if cmyk is None:\n            log.warning(\n                f\"Cannot set CMYK stroke color because not all values in {(c, m, y, k)!r} can be parsed as floats\"\n            )\n        else:\n            self.graphicstate.scolor = cmyk\n            self.scs = self.csmap[\"DeviceCMYK\"]\n\n    def do_k(self, c: PDFStackT, m: PDFStackT, y: PDFStackT, k: PDFStackT) -> None:\n        \"\"\"Set CMYK color for nonstroking operations\"\"\"\n        cmyk = safe_cmyk(c, m, y, k)\n\n        if cmyk is None:\n            log.warning(\n                f\"Cannot set CMYK non-stroke color because not all values in {(c, m, y, k)!r} can be parsed as floats\"\n            )\n        else:\n            self.graphicstate.ncolor = cmyk\n            self.ncs = self.csmap[\"DeviceCMYK\"]\n\n    def do_SCN(self) -> None:\n        \"\"\"Set color for stroking operations.\"\"\"\n        if self.scs:\n            n = self.scs.ncomponents\n        else:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No colorspace specified!\")\n            n = 1\n\n        if n == 1:\n            gray = self.pop(1)[0]\n            gray_f = safe_float(gray)\n            if gray_f is None:\n                log.warning(\n                    f\"Cannot set gray stroke color because {gray!r} is an invalid float value\"\n                )\n            else:\n                self.graphicstate.scolor = gray_f\n\n        elif n == 3:\n            values = self.pop(3)\n            rgb = safe_rgb(*values)\n            if rgb is None:\n                log.warning(\n                    f\"Cannot set RGB stroke color because not all values in {values!r} can be parsed as floats\"\n                )\n            else:\n                self.graphicstate.scolor = rgb\n\n        elif n == 4:\n            values = self.pop(4)\n            cmyk = safe_cmyk(*values)\n\n            if cmyk is None:\n                log.warning(\n                    f\"Cannot set CMYK stroke color because not all values in {values!r} can be parsed as floats\"\n                )\n            else:\n                self.graphicstate.scolor = cmyk\n\n        else:\n            log.warning(\n                f\"Cannot set stroke color because {n} components are specified but only 1 (grayscale), 3 (rgb) and 4 (cmyk) are supported\"\n            )\n\n    def do_scn(self) -> None:\n        \"\"\"Set color for nonstroking operations\"\"\"\n        if self.ncs:\n            n = self.ncs.ncomponents\n        else:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No colorspace specified!\")\n            n = 1\n\n        if n == 1:\n            gray = self.pop(1)[0]\n            gray_f = safe_float(gray)\n            if gray_f is None:\n                log.warning(\n                    f\"Cannot set gray non-stroke color because {gray!r} is an invalid float value\"\n                )\n            else:\n                self.graphicstate.ncolor = gray_f\n\n        elif n == 3:\n            values = self.pop(3)\n            rgb = safe_rgb(*values)\n\n            if rgb is None:\n                log.warning(\n                    f\"Cannot set RGB non-stroke color because not all values in {values!r} can be parsed as floats\"\n                )\n            else:\n                self.graphicstate.ncolor = rgb\n\n        elif n == 4:\n            values = self.pop(4)\n            cmyk = safe_cmyk(*values)\n\n            if cmyk is None:\n                log.warning(\n                    f\"Cannot set CMYK non-stroke color because not all values in {values!r} can be parsed as floats\"\n                )\n            else:\n                self.graphicstate.ncolor = cmyk\n\n        else:\n            log.warning(\n                f\"Cannot set non-stroke color because {n} components are specified but only 1 (grayscale), 3 (rgb) and 4 (cmyk) are supported\"\n            )\n\n    def do_SC(self) -> None:\n        \"\"\"Set color for stroking operations\"\"\"\n        self.do_SCN()\n\n    def do_sc(self) -> None:\n        \"\"\"Set color for nonstroking operations\"\"\"\n        self.do_scn()\n\n    def do_sh(self, name: object) -> None:\n        \"\"\"Paint area defined by shading pattern\"\"\"\n\n    def do_BT(self) -> None:\n        \"\"\"Begin text object\n\n        Initializing the text matrix, Tm, and the text line matrix, Tlm, to\n        the identity matrix. Text objects cannot be nested; a second BT cannot\n        appear before an ET.\n        \"\"\"\n        self.textstate.reset()\n\n    def do_ET(self) -> None:\n        \"\"\"End a text object\"\"\"\n\n    def do_BX(self) -> None:\n        \"\"\"Begin compatibility section\"\"\"\n\n    def do_EX(self) -> None:\n        \"\"\"End compatibility section\"\"\"\n\n    def do_MP(self, tag: PDFStackT) -> None:\n        \"\"\"Define marked-content point\"\"\"\n        if isinstance(tag, PSLiteral):\n            self.device.do_tag(tag)\n        else:\n            log.warning(\n                f\"Cannot define marked-content point because {tag!r} is not a PSLiteral\"\n            )\n\n    def do_DP(self, tag: PDFStackT, props: PDFStackT) -> None:\n        \"\"\"Define marked-content point with property list\"\"\"\n        if isinstance(tag, PSLiteral):\n            self.device.do_tag(tag, props)\n        else:\n            log.warning(\n                f\"Cannot define marked-content point with property list because {tag!r} is not a PSLiteral\"\n            )\n\n    def do_BMC(self, tag: PDFStackT) -> None:\n        \"\"\"Begin marked-content sequence\"\"\"\n        if isinstance(tag, PSLiteral):\n            self.device.begin_tag(tag)\n        else:\n            log.warning(\n                f\"Cannot begin marked-content sequence because {tag!r} is not a PSLiteral\"\n            )\n\n    def do_BDC(self, tag: PDFStackT, props: PDFStackT) -> None:\n        \"\"\"Begin marked-content sequence with property list\"\"\"\n        if isinstance(tag, PSLiteral):\n            self.device.begin_tag(tag, props)\n        else:\n            log.warning(\n                f\"Cannot begin marked-content sequence with property list because {tag!r} is not a PSLiteral\"\n            )\n\n    def do_EMC(self) -> None:\n        \"\"\"End marked-content sequence\"\"\"\n        self.device.end_tag()\n\n    def do_Tc(self, space: PDFStackT) -> None:\n        \"\"\"Set character spacing.\n\n        Character spacing is used by the Tj, TJ, and ' operators.\n\n        :param space: a number expressed in unscaled text space units.\n        \"\"\"\n        charspace = safe_float(space)\n        if charspace is None:\n            log.warning(\n                f\"Could not set character spacing because {space!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.charspace = charspace\n\n    def do_Tw(self, space: PDFStackT) -> None:\n        \"\"\"Set the word spacing.\n\n        Word spacing is used by the Tj, TJ, and ' operators.\n\n        :param space: a number expressed in unscaled text space units\n        \"\"\"\n        wordspace = safe_float(space)\n        if wordspace is None:\n            log.warning(\n                f\"Could not set word spacing becuase {space!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.wordspace = wordspace\n\n    def do_Tz(self, scale: PDFStackT) -> None:\n        \"\"\"Set the horizontal scaling.\n\n        :param scale: is a number specifying the percentage of the normal width\n        \"\"\"\n        scale_f = safe_float(scale)\n\n        if scale_f is None:\n            log.warning(\n                f\"Could not set horizontal scaling because {scale!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.scaling = scale_f\n\n    def do_TL(self, leading: PDFStackT) -> None:\n        \"\"\"Set the text leading.\n\n        Text leading is used only by the T*, ', and \" operators.\n\n        :param leading: a number expressed in unscaled text space units\n        \"\"\"\n        leading_f = safe_float(leading)\n        if leading_f is None:\n            log.warning(\n                f\"Could not set text leading because {leading!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.leading = -leading_f\n\n    def do_Tf(self, fontid: PDFStackT, fontsize: PDFStackT) -> None:\n        \"\"\"Set the text font\n\n        :param fontid: the name of a font resource in the Font subdictionary\n            of the current resource dictionary\n        :param fontsize: size is a number representing a scale factor.\n        \"\"\"\n        try:\n            self.textstate.font = self.fontmap[literal_name(fontid)]\n            self.textstate.font_id = literal_name(fontid)\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"Undefined Font id: %r\" % fontid)\n            self.textstate.font = self.rsrcmgr.get_font(None, {})\n\n        fontsize_f = safe_float(fontsize)\n        if fontsize_f is None:\n            log.warning(\n                f\"Could not set text font because {fontsize!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.fontsize = fontsize_f\n\n    def do_Tr(self, render: PDFStackT) -> None:\n        \"\"\"Set the text rendering mode\"\"\"\n        render_i = safe_int(render)\n\n        if render_i is None:\n            log.warning(\n                f\"Could not set text rendering mode because {render!r} is an invalid int value\"\n            )\n        else:\n            self.textstate.render = render_i\n\n    def do_Ts(self, rise: PDFStackT) -> None:\n        \"\"\"Set the text rise\n\n        :param rise: a number expressed in unscaled text space units\n        \"\"\"\n        rise_f = safe_float(rise)\n\n        if rise_f is None:\n            log.warning(\n                f\"Could not set text rise because {rise!r} is an invalid float value\"\n            )\n        else:\n            self.textstate.rise = rise_f\n\n    def do_Td(self, tx: PDFStackT, ty: PDFStackT) -> None:\n        \"\"\"Move to the start of the next line\n\n        Offset from the start of the current line by (tx , ty).\n        \"\"\"\n        tx_ = safe_float(tx)\n        ty_ = safe_float(ty)\n        if tx_ is not None and ty_ is not None:\n            (a, b, c, d, e, f) = self.textstate.matrix\n            e_new = tx_ * a + ty_ * c + e\n            f_new = tx_ * b + ty_ * d + f\n            self.textstate.matrix = (a, b, c, d, e_new, f_new)\n\n        elif settings.STRICT:\n            raise PDFValueError(f\"Invalid offset ({tx!r}, {ty!r}) for Td\")\n\n        self.textstate.linematrix = (0, 0)\n\n    def do_TD(self, tx: PDFStackT, ty: PDFStackT) -> None:\n        \"\"\"Move to the start of the next line.\n\n        offset from the start of the current line by (tx , ty). As a side effect, this\n        operator sets the leading parameter in the text state.\n        \"\"\"\n        tx_ = safe_float(tx)\n        ty_ = safe_float(ty)\n\n        if tx_ is not None and ty_ is not None:\n            (a, b, c, d, e, f) = self.textstate.matrix\n            e_new = tx_ * a + ty_ * c + e\n            f_new = tx_ * b + ty_ * d + f\n            self.textstate.matrix = (a, b, c, d, e_new, f_new)\n\n        elif settings.STRICT:\n            raise PDFValueError(\"Invalid offset ({tx}, {ty}) for TD\")\n\n        if ty_ is not None:\n            self.textstate.leading = ty_\n\n        self.textstate.linematrix = (0, 0)\n\n    def do_Tm(\n        self,\n        a: PDFStackT,\n        b: PDFStackT,\n        c: PDFStackT,\n        d: PDFStackT,\n        e: PDFStackT,\n        f: PDFStackT,\n    ) -> None:\n        \"\"\"Set text matrix and text line matrix\"\"\"\n        values = (a, b, c, d, e, f)\n        matrix = safe_matrix(*values)\n\n        if matrix is None:\n            log.warning(\n                f\"Could not set text matrix because not all values in {values!r} can be parsed as floats\"\n            )\n        else:\n            self.textstate.matrix = matrix\n            self.textstate.linematrix = (0, 0)\n\n    def do_T_a(self) -> None:\n        \"\"\"Move to start of next text line\"\"\"\n        (a, b, c, d, e, f) = self.textstate.matrix\n        self.textstate.matrix = (\n            a,\n            b,\n            c,\n            d,\n            self.textstate.leading * c + e,\n            self.textstate.leading * d + f,\n        )\n        self.textstate.linematrix = (0, 0)\n\n    def do_TJ(self, seq: PDFStackT) -> None:\n        \"\"\"Show text, allowing individual glyph positioning\"\"\"\n        if self.textstate.font is None:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"No font specified!\")\n            return\n        assert self.ncs is not None\n        self.device.render_string(\n            self.textstate,\n            cast(PDFTextSeq, seq),\n            self.ncs,\n            self.graphicstate.copy(),\n        )\n\n    def do_Tj(self, s: PDFStackT) -> None:\n        \"\"\"Show text\"\"\"\n        self.do_TJ([s])\n\n    def do__q(self, s: PDFStackT) -> None:\n        \"\"\"Move to next line and show text\n\n        The ' (single quote) operator.\n        \"\"\"\n        self.do_T_a()\n        self.do_TJ([s])\n\n    def do__w(self, aw: PDFStackT, ac: PDFStackT, s: PDFStackT) -> None:\n        \"\"\"Set word and character spacing, move to next line, and show text\n\n        The \" (double quote) operator.\n        \"\"\"\n        self.do_Tw(aw)\n        self.do_Tc(ac)\n        self.do_TJ([s])\n\n    def do_BI(self) -> None:\n        \"\"\"Begin inline image object\"\"\"\n\n    def do_ID(self) -> None:\n        \"\"\"Begin inline image data\"\"\"\n\n    def do_EI(self, obj: PDFStackT) -> None:\n        \"\"\"End inline image object\"\"\"\n        if isinstance(obj, PDFStream) and \"W\" in obj and \"H\" in obj:\n            iobjid = str(id(obj))\n            self.device.begin_figure(iobjid, (0, 0, 1, 1), MATRIX_IDENTITY)\n            self.device.render_image(iobjid, obj)\n            self.device.end_figure(iobjid)\n\n    def do_Do(self, xobjid_arg: PDFStackT) -> None:\n        \"\"\"Invoke named XObject\"\"\"\n        xobjid = literal_name(xobjid_arg)\n        try:\n            xobj = stream_value(self.xobjmap[xobjid])\n        except KeyError:\n            if settings.STRICT:\n                raise PDFInterpreterError(\"Undefined xobject id: %r\" % xobjid)\n            return\n        log.debug(\"Processing xobj: %r\", xobj)\n        subtype = xobj.get(\"Subtype\")\n        if subtype is LITERAL_FORM and \"BBox\" in xobj:\n            interpreter = self.dup()\n            bbox = cast(Rect, list_value(xobj[\"BBox\"]))\n            matrix = cast(Matrix, list_value(xobj.get(\"Matrix\", MATRIX_IDENTITY)))\n            # According to PDF reference 1.7 section 4.9.1, XObjects in\n            # earlier PDFs (prior to v1.2) use the page's Resources entry\n            # instead of having their own Resources entry.\n            xobjres = xobj.get(\"Resources\")\n            if xobjres:\n                resources = dict_value(xobjres)\n            else:\n                resources = self.resources.copy()\n            self.device.begin_figure(xobjid, bbox, matrix)\n            interpreter.render_contents(\n                resources,\n                [xobj],\n                ctm=mult_matrix(matrix, self.ctm),\n            )\n            self.device.end_figure(xobjid)\n        elif subtype is LITERAL_IMAGE and \"Width\" in xobj and \"Height\" in xobj:\n            self.device.begin_figure(xobjid, (0, 0, 1, 1), MATRIX_IDENTITY)\n            self.device.render_image(xobjid, xobj)\n            self.device.end_figure(xobjid)\n        else:\n            # unsupported xobject type.\n            pass\n\n    def process_page(self, page: PDFPage) -> None:\n        log.debug(\"Processing page: %r\", page)\n        (x0, y0, x1, y1) = page.mediabox\n        if page.rotate == 90:\n            ctm = (0, -1, 1, 0, -y0, x1)\n        elif page.rotate == 180:\n            ctm = (-1, 0, 0, -1, x1, y1)\n        elif page.rotate == 270:\n            ctm = (0, 1, -1, 0, y1, -x0)\n        else:\n            ctm = (1, 0, 0, 1, -x0, -y0)\n        self.device.begin_page(page, ctm)\n        self.render_contents(page.resources, page.contents, ctm=ctm)\n        self.device.end_page(page)\n\n    def render_contents(\n        self,\n        resources: dict[object, object],\n        streams: Sequence[object],\n        ctm: Matrix = MATRIX_IDENTITY,\n    ) -> None:\n        \"\"\"Render the content streams.\n\n        This method may be called recursively.\n        \"\"\"\n        log.debug(\n            \"render_contents: resources=%r, streams=%r, ctm=%r\",\n            resources,\n            streams,\n            ctm,\n        )\n        self.init_resources(resources)\n        self.init_state(ctm)\n        self.execute(list_value(streams))\n\n    def execute(self, streams: Sequence[object]) -> None:\n        try:\n            parser = PDFContentParser(streams)\n        except PSEOF:\n            # empty page\n            return\n        while True:\n            try:\n                (_, obj) = parser.nextobject()\n            except PSEOF:\n                break\n            if isinstance(obj, PSKeyword):\n                name = keyword_name(obj)\n                method = \"do_%s\" % name.replace(\"*\", \"_a\").replace('\"', \"_w\").replace(\n                    \"'\",\n                    \"_q\",\n                )\n                if hasattr(self, method):\n                    func = getattr(self, method)\n                    nargs = func.__code__.co_argcount - 1\n                    if nargs:\n                        args = self.pop(nargs)\n                        log.debug(\"exec: %s %r\", name, args)\n                        if len(args) == nargs:\n                            func(*args)\n                    else:\n                        log.debug(\"exec: %s\", name)\n                        func()\n                elif settings.STRICT:\n                    error_msg = \"Unknown operator: %r\" % name\n                    raise PDFInterpreterError(error_msg)\n            else:\n                self.push(obj)\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfpage.py",
    "content": "import itertools\nimport logging\nfrom collections.abc import Container\nfrom collections.abc import Iterator\nfrom typing import Any\nfrom typing import BinaryIO\n\nfrom babeldoc.pdfminer.pdfdocument import PDFDocument\nfrom babeldoc.pdfminer.pdfdocument import PDFNoPageLabels\nfrom babeldoc.pdfminer.pdfdocument import PDFTextExtractionNotAllowed\nfrom babeldoc.pdfminer.pdfexceptions import PDFObjectNotFound\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\nfrom babeldoc.pdfminer.pdfparser import PDFParser\nfrom babeldoc.pdfminer.pdftypes import dict_value, PDFObjRef\nfrom babeldoc.pdfminer.pdftypes import int_value\nfrom babeldoc.pdfminer.pdftypes import list_value\nfrom babeldoc.pdfminer.pdftypes import resolve1\nfrom babeldoc.pdfminer.psparser import LIT\nfrom babeldoc.pdfminer.utils import Rect\nfrom babeldoc.pdfminer.utils import parse_rect\nfrom babeldoc.pdfminer import settings\n\nlog = logging.getLogger(__name__)\n\n# some predefined literals and keywords.\nLITERAL_PAGE = LIT(\"Page\")\nLITERAL_PAGES = LIT(\"Pages\")\n\n\nclass PDFPage:\n    \"\"\"An object that holds the information about a page.\n\n    A PDFPage object is merely a convenience class that has a set\n    of keys and values, which describe the properties of a page\n    and point to its contents.\n\n    Attributes\n    ----------\n      doc: a PDFDocument object.\n      pageid: any Python object that can uniquely identify the page.\n      attrs: a dictionary of page attributes.\n      contents: a list of PDFStream objects that represents the page content.\n      lastmod: the last modified time of the page.\n      resources: a dictionary of resources used by the page.\n      mediabox: the physical size of the page.\n      cropbox: the crop rectangle of the page.\n      rotate: the page rotation (in degree).\n      annots: the page annotations.\n      beads: a chain that represents natural reading order.\n      label: the page's label (typically, the logical page number).\n\n    \"\"\"\n\n    def __init__(\n        self,\n        doc: PDFDocument,\n        pageid: object,\n        attrs: object,\n        label: str | None,\n    ) -> None:\n        \"\"\"Initialize a page object.\n\n        doc: a PDFDocument object.\n        pageid: any Python object that can uniquely identify the page.\n        attrs: a dictionary of page attributes.\n        label: page label string.\n        \"\"\"\n        self.doc = doc\n        self.pageid = pageid\n        self.attrs = dict_value(attrs)\n        self.label = label\n        self.lastmod = resolve1(self.attrs.get(\"LastModified\"))\n        self.resources: dict[object, object] = resolve1(\n            self.attrs.get(\"Resources\", dict()),\n        )\n        try:\n            while isinstance(attrs[\"MediaBox\"], PDFObjRef):\n                attrs[\"MediaBox\"] = resolve1(attrs[\"MediaBox\"])\n        except Exception:\n            log.exception(f\"try to fix mediabox failed: {attrs}\")\n\n        self.mediabox = self._parse_mediabox(self.attrs.get(\"MediaBox\"))\n        try:\n            self.cropbox = self._parse_cropbox(self.attrs.get(\"CropBox\"), self.mediabox)\n        except Exception:\n            self.cropbox = self.mediabox\n        self.contents = self._parse_contents(self.attrs.get(\"Contents\"))\n\n        self.rotate = (int_value(self.attrs.get(\"Rotate\", 0)) + 360) % 360\n        self.annots = self.attrs.get(\"Annots\")\n        self.beads = self.attrs.get(\"B\")\n\n    def __repr__(self) -> str:\n        return f\"<PDFPage: Resources={self.resources!r}, MediaBox={self.mediabox!r}>\"\n\n    INHERITABLE_ATTRS = {\"Resources\", \"MediaBox\", \"CropBox\", \"Rotate\"}\n\n    @classmethod\n    def create_pages(cls, document: PDFDocument) -> Iterator[\"PDFPage\"]:\n        def depth_first_search(\n            obj: Any,\n            parent: dict[str, Any],\n            visited: set[Any] | None = None,\n        ) -> Iterator[tuple[int, dict[Any, dict[Any, Any]]]]:\n            if isinstance(obj, int):\n                object_id = obj\n                object_properties = dict_value(document.getobj(object_id)).copy()\n            else:\n                # This looks broken. obj.objid means obj could be either\n                # PDFObjRef or PDFStream, but neither is valid for dict_value.\n                object_id = obj.objid  # type: ignore[attr-defined]\n                object_properties = dict_value(obj).copy()\n\n            # Avoid recursion errors by keeping track of visited nodes\n            if visited is None:\n                visited = set()\n            if object_id in visited:\n                return\n            visited.add(object_id)\n\n            for k, v in parent.items():\n                if k in cls.INHERITABLE_ATTRS and k not in object_properties:\n                    object_properties[k] = v\n\n            object_type = object_properties.get(\"Type\")\n            if object_type is None and not settings.STRICT:  # See #64\n                object_type = object_properties.get(\"type\")\n\n            if object_type is LITERAL_PAGES and \"Kids\" in object_properties:\n                log.debug(\"Pages: Kids=%r\", object_properties[\"Kids\"])\n                for child in list_value(object_properties[\"Kids\"]):\n                    yield from depth_first_search(child, object_properties, visited)\n\n            elif object_type is LITERAL_PAGE:\n                log.debug(\"Page: %r\", object_properties)\n                yield (object_id, object_properties)\n\n        try:\n            page_labels: Iterator[str | None] = document.get_page_labels()\n        except PDFNoPageLabels:\n            page_labels = itertools.repeat(None)\n\n        pages = False\n        if \"Pages\" in document.catalog:\n            objects = depth_first_search(document.catalog[\"Pages\"], document.catalog)\n            for objid, tree in objects:\n                yield cls(document, objid, tree, next(page_labels))\n                pages = True\n        if not pages:\n            # fallback when /Pages is missing.\n            for xref in document.xrefs:\n                for objid in xref.get_objids():\n                    try:\n                        obj = document.getobj(objid)\n                        if isinstance(obj, dict) and obj.get(\"Type\") is LITERAL_PAGE:\n                            yield cls(document, objid, obj, next(page_labels))\n                    except PDFObjectNotFound:\n                        pass\n\n    @classmethod\n    def get_pages(\n        cls,\n        fp: BinaryIO,\n        pagenos: Container[int] | None = None,\n        maxpages: int = 0,\n        password: str = \"\",\n        caching: bool = True,\n        check_extractable: bool = False,\n    ) -> Iterator[\"PDFPage\"]:\n        # Create a PDF parser object associated with the file object.\n        parser = PDFParser(fp)\n        # Create a PDF document object that stores the document structure.\n        doc = PDFDocument(parser, password=password, caching=caching)\n        # Check if the document allows text extraction.\n        # If not, warn the user and proceed.\n        if not doc.is_extractable:\n            if check_extractable:\n                error_msg = \"Text extraction is not allowed: %r\" % fp\n                raise PDFTextExtractionNotAllowed(error_msg)\n            else:\n                warning_msg = (\n                    \"The PDF %r contains a metadata field \"\n                    \"indicating that it should not allow \"\n                    \"text extraction. Ignoring this field \"\n                    \"and proceeding. Use the check_extractable \"\n                    \"if you want to raise an error in this case\" % fp\n                )\n                log.warning(warning_msg)\n        # Process each page contained in the document.\n        for pageno, page in enumerate(cls.create_pages(doc)):\n            if pagenos and (pageno not in pagenos):\n                continue\n            yield page\n            if maxpages and maxpages <= pageno + 1:\n                break\n\n    def _parse_mediabox(self, value: Any) -> Rect:\n        us_letter = (0.0, 0.0, 612.0, 792.0)\n\n        if value is None:\n            log.warning(\n                \"MediaBox missing from /Page (and not inherited), \"\n                \"defaulting to US Letter\"\n            )\n            return us_letter\n\n        try:\n            return parse_rect(resolve1(val) for val in resolve1(value))\n\n        except PDFValueError:\n            log.warning(\"Invalid MediaBox in /Page, defaulting to US Letter\")\n            return us_letter\n\n    def _parse_cropbox(self, value: Any, mediabox: Rect) -> Rect:\n        if value is None:\n            # CropBox is optional, and MediaBox is used if not specified.\n            return mediabox\n\n        try:\n            return parse_rect(resolve1(val) for val in resolve1(value))\n\n        except PDFValueError:\n            log.warning(\"Invalid CropBox in /Page, defaulting to MediaBox\")\n            return mediabox\n\n    def _parse_contents(self, value: Any) -> list[Any]:\n        contents: list[Any] = []\n        if value is not None:\n            contents = resolve1(value)\n            if not isinstance(contents, list):\n                contents = [contents]\n        return contents\n"
  },
  {
    "path": "babeldoc/pdfminer/pdfparser.py",
    "content": "import logging\nfrom io import BytesIO\nfrom typing import TYPE_CHECKING\nfrom typing import BinaryIO\nfrom typing import Union\n\nfrom babeldoc.pdfminer.casting import safe_int\nfrom babeldoc.pdfminer.pdfexceptions import PDFException\nfrom babeldoc.pdfminer.pdftypes import PDFObjRef\nfrom babeldoc.pdfminer.pdftypes import PDFStream\nfrom babeldoc.pdfminer.pdftypes import dict_value\nfrom babeldoc.pdfminer.pdftypes import int_value\nfrom babeldoc.pdfminer.psexceptions import PSEOF\nfrom babeldoc.pdfminer.psparser import KWD\nfrom babeldoc.pdfminer.psparser import PSKeyword\nfrom babeldoc.pdfminer.psparser import PSStackParser\nfrom babeldoc.pdfminer import settings\n\nif TYPE_CHECKING:\n    from babeldoc.pdfminer.pdfdocument import PDFDocument\n\nlog = logging.getLogger(__name__)\n\n\nclass PDFSyntaxError(PDFException):\n    pass\n\n\n# PDFParser stack holds all the base types plus PDFStream, PDFObjRef, and None\nclass PDFParser(PSStackParser[Union[PSKeyword, PDFStream, PDFObjRef, None]]):\n    \"\"\"PDFParser fetch PDF objects from a file stream.\n    It can handle indirect references by referring to\n    a PDF document set by set_document method.\n    It also reads XRefs at the end of every PDF file.\n\n    Typical usage:\n      parser = PDFParser(fp)\n      parser.read_xref()\n      parser.read_xref(fallback=True) # optional\n      parser.set_document(doc)\n      parser.seek(offset)\n      parser.nextobject()\n\n    \"\"\"\n\n    def __init__(self, fp: BinaryIO) -> None:\n        PSStackParser.__init__(self, fp)\n        self.doc: PDFDocument | None = None\n        self.fallback = False\n\n    def set_document(self, doc: \"PDFDocument\") -> None:\n        \"\"\"Associates the parser with a PDFDocument object.\"\"\"\n        self.doc = doc\n\n    KEYWORD_R = KWD(b\"R\")\n    KEYWORD_NULL = KWD(b\"null\")\n    KEYWORD_ENDOBJ = KWD(b\"endobj\")\n    KEYWORD_STREAM = KWD(b\"stream\")\n    KEYWORD_XREF = KWD(b\"xref\")\n    KEYWORD_STARTXREF = KWD(b\"startxref\")\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        \"\"\"Handles PDF-related keywords.\"\"\"\n        if token in (self.KEYWORD_XREF, self.KEYWORD_STARTXREF):\n            self.add_results(*self.pop(1))\n\n        elif token is self.KEYWORD_ENDOBJ:\n            self.add_results(*self.pop(4))\n\n        elif token is self.KEYWORD_NULL:\n            # null object\n            self.push((pos, None))\n\n        elif token is self.KEYWORD_R:\n            # reference to indirect object\n            if len(self.curstack) >= 2:\n                (_, _object_id), _ = self.pop(2)\n                object_id = safe_int(_object_id)\n                if object_id is not None:\n                    obj = PDFObjRef(self.doc, object_id)\n                    self.push((pos, obj))\n\n        elif token is self.KEYWORD_STREAM:\n            # stream object\n            ((_, dic),) = self.pop(1)\n            dic = dict_value(dic)\n            objlen = 0\n            if not self.fallback:\n                try:\n                    objlen = int_value(dic[\"Length\"])\n                except KeyError:\n                    if settings.STRICT:\n                        raise PDFSyntaxError(\"/Length is undefined: %r\" % dic)\n            self.seek(pos)\n            try:\n                (_, line) = self.nextline()  # 'stream'\n            except PSEOF:\n                if settings.STRICT:\n                    raise PDFSyntaxError(\"Unexpected EOF\")\n                return\n            pos += len(line)\n            self.fp.seek(pos)\n            data = bytearray(self.fp.read(objlen))\n            self.seek(pos + objlen)\n            while 1:\n                try:\n                    (linepos, line) = self.nextline()\n                except PSEOF:\n                    if settings.STRICT:\n                        raise PDFSyntaxError(\"Unexpected EOF\")\n                    break\n                if b\"endstream\" in line:\n                    i = line.index(b\"endstream\")\n                    objlen += i\n                    if self.fallback:\n                        data += line[:i]\n                    break\n                objlen += len(line)\n                if self.fallback:\n                    data += line\n            self.seek(pos + objlen)\n            # XXX limit objlen not to exceed object boundary\n            log.debug(\n                \"Stream: pos=%d, objlen=%d, dic=%r, data=%r...\",\n                pos,\n                objlen,\n                dic,\n                data[:10],\n            )\n            assert self.doc is not None\n            stream = PDFStream(dic, bytes(data), self.doc.decipher)\n            self.push((pos, stream))\n\n        else:\n            # others\n            self.push((pos, token))\n\n\nclass PDFStreamParser(PDFParser):\n    \"\"\"PDFStreamParser is used to parse PDF content streams\n    that is contained in each page and has instructions\n    for rendering the page. A reference to a PDF document is\n    needed because a PDF content stream can also have\n    indirect references to other objects in the same document.\n    \"\"\"\n\n    def __init__(self, data: bytes) -> None:\n        PDFParser.__init__(self, BytesIO(data))\n\n    def flush(self) -> None:\n        self.add_results(*self.popall())\n\n    KEYWORD_OBJ = KWD(b\"obj\")\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        if token is self.KEYWORD_R:\n            # reference to indirect object\n            (_, _object_id), _ = self.pop(2)\n            object_id = safe_int(_object_id)\n            if object_id is not None:\n                obj = PDFObjRef(self.doc, object_id)\n                self.push((pos, obj))\n            return\n\n        elif token in (self.KEYWORD_OBJ, self.KEYWORD_ENDOBJ):\n            if settings.STRICT:\n                # See PDF Spec 3.4.6: Only the object values are stored in the\n                # stream; the obj and endobj keywords are not used.\n                raise PDFSyntaxError(\"Keyword endobj found in stream\")\n            return\n\n        # others\n        self.push((pos, token))\n"
  },
  {
    "path": "babeldoc/pdfminer/pdftypes.py",
    "content": "import io\nimport logging\nimport zlib\nfrom collections.abc import Iterable\nfrom typing import TYPE_CHECKING\nfrom typing import Any\nfrom typing import Optional\nfrom typing import Protocol\nfrom typing import cast\nfrom warnings import warn\n\nfrom babeldoc.pdfminer.ascii85 import ascii85decode\nfrom babeldoc.pdfminer.ascii85 import asciihexdecode\nfrom babeldoc.pdfminer.ccitt import ccittfaxdecode\nfrom babeldoc.pdfminer.lzw import lzwdecode\nfrom babeldoc.pdfminer.psparser import LIT\nfrom babeldoc.pdfminer.psparser import PSObject\nfrom babeldoc.pdfminer.runlength import rldecode\nfrom babeldoc.pdfminer.utils import apply_png_predictor\nfrom babeldoc.pdfminer import pdfexceptions\nfrom babeldoc.pdfminer import settings\n\nif TYPE_CHECKING:\n    from babeldoc.pdfminer.pdfdocument import PDFDocument\n\nlogger = logging.getLogger(__name__)\n\nLITERAL_CRYPT = LIT(\"Crypt\")\n\n# Abbreviation of Filter names in PDF 4.8.6. \"Inline Images\"\nLITERALS_FLATE_DECODE = (LIT(\"FlateDecode\"), LIT(\"Fl\"))\nLITERALS_LZW_DECODE = (LIT(\"LZWDecode\"), LIT(\"LZW\"))\nLITERALS_ASCII85_DECODE = (LIT(\"ASCII85Decode\"), LIT(\"A85\"))\nLITERALS_ASCIIHEX_DECODE = (LIT(\"ASCIIHexDecode\"), LIT(\"AHx\"))\nLITERALS_RUNLENGTH_DECODE = (LIT(\"RunLengthDecode\"), LIT(\"RL\"))\nLITERALS_CCITTFAX_DECODE = (LIT(\"CCITTFaxDecode\"), LIT(\"CCF\"))\nLITERALS_DCT_DECODE = (LIT(\"DCTDecode\"), LIT(\"DCT\"))\nLITERALS_JBIG2_DECODE = (LIT(\"JBIG2Decode\"),)\nLITERALS_JPX_DECODE = (LIT(\"JPXDecode\"),)\n\n\nclass DecipherCallable(Protocol):\n    \"\"\"Fully typed a decipher callback, with optional parameter.\"\"\"\n\n    def __call__(\n        self,\n        objid: int,\n        genno: int,\n        data: bytes,\n        attrs: dict[str, Any] | None = None,\n    ) -> bytes:\n        raise NotImplementedError\n\n\nclass PDFObject(PSObject):\n    pass\n\n\n# Adding aliases for these exceptions for backwards compatibility\nPDFException = pdfexceptions.PDFException\nPDFTypeError = pdfexceptions.PDFTypeError\nPDFValueError = pdfexceptions.PDFValueError\nPDFObjectNotFound = pdfexceptions.PDFObjectNotFound\nPDFNotImplementedError = pdfexceptions.PDFNotImplementedError\n\n_DEFAULT = object()\n\n\nclass PDFObjRef(PDFObject):\n    def __init__(\n        self,\n        doc: Optional[\"PDFDocument\"],\n        objid: int,\n        _: Any = _DEFAULT,\n    ) -> None:\n        \"\"\"Reference to a PDF object.\n\n        :param doc: The PDF document.\n        :param objid: The object number.\n        :param _: Unused argument for backwards compatibility.\n        \"\"\"\n        if _ is not _DEFAULT:\n            warn(\n                \"The third argument of PDFObjRef is unused and will be removed after \"\n                \"2024\",\n                DeprecationWarning,\n            )\n\n        if objid == 0:\n            if settings.STRICT:\n                raise PDFValueError(\"PDF object id cannot be 0.\")\n\n        self.doc = doc\n        self.objid = objid\n\n    def __repr__(self) -> str:\n        return \"<PDFObjRef:%d>\" % (self.objid)\n\n    def resolve(self, default: object = None) -> Any:\n        assert self.doc is not None\n        try:\n            return self.doc.getobj(self.objid)\n        except PDFObjectNotFound:\n            return default\n\n\ndef resolve1(x: object, default: object = None) -> Any:\n    \"\"\"Resolves an object.\n\n    If this is an array or dictionary, it may still contains\n    some indirect objects inside.\n    \"\"\"\n    while isinstance(x, PDFObjRef):\n        x = x.resolve(default=default)\n    return x\n\n\ndef resolve_all(x: object, default: object = None) -> Any:\n    \"\"\"Recursively resolves the given object and all the internals.\n\n    Make sure there is no indirect reference within the nested object.\n    This procedure might be slow.\n    \"\"\"\n    while isinstance(x, PDFObjRef):\n        x = x.resolve(default=default)\n    if isinstance(x, list):\n        x = [resolve_all(v, default=default) for v in x]\n    elif isinstance(x, dict):\n        for k, v in x.items():\n            x[k] = resolve_all(v, default=default)\n    return x\n\n\ndef decipher_all(decipher: DecipherCallable, objid: int, genno: int, x: object) -> Any:\n    \"\"\"Recursively deciphers the given object.\"\"\"\n    if isinstance(x, bytes):\n        if len(x) == 0:\n            return x\n        return decipher(objid, genno, x)\n    if isinstance(x, list):\n        x = [decipher_all(decipher, objid, genno, v) for v in x]\n    elif isinstance(x, dict):\n        for k, v in x.items():\n            x[k] = decipher_all(decipher, objid, genno, v)\n    return x\n\n\ndef int_value(x: object) -> int:\n    x = resolve1(x)\n    if not isinstance(x, int):\n        if settings.STRICT:\n            raise PDFTypeError(\"Integer required: %r\" % x)\n        return 0\n    return x\n\n\ndef float_value(x: object) -> float:\n    x = resolve1(x)\n    if not isinstance(x, float):\n        if settings.STRICT:\n            raise PDFTypeError(\"Float required: %r\" % x)\n        return 0.0\n    return x\n\n\ndef num_value(x: object) -> float:\n    x = resolve1(x)\n    if not isinstance(x, (int, float)):  # == utils.isnumber(x)\n        if settings.STRICT:\n            raise PDFTypeError(\"Int or Float required: %r\" % x)\n        return 0\n    return x\n\n\ndef uint_value(x: object, n_bits: int) -> int:\n    \"\"\"Resolve number and interpret it as a two's-complement unsigned number\"\"\"\n    xi = int_value(x)\n    if xi > 0:\n        return xi\n    else:\n        return xi + cast(int, 2**n_bits)\n\n\ndef str_value(x: object) -> bytes:\n    x = resolve1(x)\n    if not isinstance(x, bytes):\n        if settings.STRICT:\n            raise PDFTypeError(\"String required: %r\" % x)\n        return b\"\"\n    return x\n\n\ndef list_value(x: object) -> list[Any] | tuple[Any, ...]:\n    x = resolve1(x)\n    if not isinstance(x, (list, tuple)):\n        if settings.STRICT:\n            raise PDFTypeError(\"List required: %r\" % x)\n        return []\n    return x\n\n\ndef dict_value(x: object) -> dict[Any, Any]:\n    x = resolve1(x)\n    if not isinstance(x, dict):\n        if settings.STRICT:\n            logger.error(\"PDFTypeError : Dict required: %r\", x)\n            raise PDFTypeError(\"Dict required: %r\" % x)\n        return {}\n    return x\n\n\ndef stream_value(x: object) -> \"PDFStream\":\n    x = resolve1(x)\n    if not isinstance(x, PDFStream):\n        if settings.STRICT:\n            raise PDFTypeError(\"PDFStream required: %r\" % x)\n        return PDFStream({}, b\"\")\n    return x\n\n\ndef decompress_corrupted(data: bytes) -> bytes:\n    \"\"\"Called on some data that can't be properly decoded because of CRC checksum\n    error. Attempt to decode it skipping the CRC.\n    \"\"\"\n    d = zlib.decompressobj()\n    f = io.BytesIO(data)\n    result_str = b\"\"\n    buffer = f.read(1)\n    i = 0\n    try:\n        while buffer:\n            result_str += d.decompress(buffer)\n            buffer = f.read(1)\n            i += 1\n    except zlib.error:\n        # Let the error propagates if we're not yet in the CRC checksum\n        if i < len(data) - 3:\n            logger.warning(\"Data-loss while decompressing corrupted data\")\n    return result_str\n\n\nclass PDFStream(PDFObject):\n    def __init__(\n        self,\n        attrs: dict[str, Any],\n        rawdata: bytes,\n        decipher: DecipherCallable | None = None,\n    ) -> None:\n        assert isinstance(attrs, dict), str(type(attrs))\n        self.attrs = attrs\n        self.rawdata: bytes | None = rawdata\n        self.decipher = decipher\n        self.data: bytes | None = None\n        self.objid: int | None = None\n        self.genno: int | None = None\n\n    def set_objid(self, objid: int, genno: int) -> None:\n        self.objid = objid\n        self.genno = genno\n\n    def __repr__(self) -> str:\n        if self.data is None:\n            assert self.rawdata is not None\n            return \"<PDFStream(%r): raw=%d, %r>\" % (\n                self.objid,\n                len(self.rawdata),\n                self.attrs,\n            )\n        else:\n            assert self.data is not None\n            return \"<PDFStream(%r): len=%d, %r>\" % (\n                self.objid,\n                len(self.data),\n                self.attrs,\n            )\n\n    def __contains__(self, name: object) -> bool:\n        return name in self.attrs\n\n    def __getitem__(self, name: str) -> Any:\n        return self.attrs[name]\n\n    def get(self, name: str, default: object = None) -> Any:\n        return self.attrs.get(name, default)\n\n    def get_any(self, names: Iterable[str], default: object = None) -> Any:\n        for name in names:\n            if name in self.attrs:\n                return self.attrs[name]\n        return default\n\n    def get_filters(self) -> list[tuple[Any, Any]]:\n        filters = resolve1(self.get_any((\"F\", \"Filter\"), []))\n        params = resolve1(self.get_any((\"DP\", \"DecodeParms\", \"FDecodeParms\"), {}))\n        if not filters:\n            return []\n        if not isinstance(filters, list):\n            filters = [filters]\n        if not isinstance(params, list):\n            # Make sure the parameters list is the same as filters.\n            params = [params] * len(filters)\n        if settings.STRICT and len(params) != len(filters):\n            raise PDFException(\"Parameters len filter mismatch\")\n\n        resolved_filters = [resolve1(f) for f in filters]\n        resolved_params = [resolve1(param) for param in params]\n        return list(zip(resolved_filters, resolved_params, strict=False))\n\n    def decode(self) -> None:\n        assert self.data is None and self.rawdata is not None, str(\n            (self.data, self.rawdata),\n        )\n        data = self.rawdata\n        if self.decipher:\n            # Handle encryption\n            assert self.objid is not None\n            assert self.genno is not None\n            data = self.decipher(self.objid, self.genno, data, self.attrs)\n        filters = self.get_filters()\n        if not filters:\n            self.data = data\n            self.rawdata = None\n            return\n        for f, params in filters:\n            if f in LITERALS_FLATE_DECODE:\n                # will get errors if the document is encrypted.\n                try:\n                    data = zlib.decompress(data)\n\n                except zlib.error as e:\n                    if settings.STRICT:\n                        error_msg = f\"Invalid zlib bytes: {e!r}, {data!r}\"\n                        raise PDFException(error_msg)\n\n                    try:\n                        data = decompress_corrupted(data)\n                    except zlib.error:\n                        data = b\"\"\n\n            elif f in LITERALS_LZW_DECODE:\n                data = lzwdecode(data)\n            elif f in LITERALS_ASCII85_DECODE:\n                data = ascii85decode(data)\n            elif f in LITERALS_ASCIIHEX_DECODE:\n                data = asciihexdecode(data)\n            elif f in LITERALS_RUNLENGTH_DECODE:\n                data = rldecode(data)\n            elif f in LITERALS_CCITTFAX_DECODE:\n                data = ccittfaxdecode(data, params)\n            elif f in LITERALS_DCT_DECODE:\n                # This is probably a JPG stream\n                # it does not need to be decoded twice.\n                # Just return the stream to the user.\n                pass\n            elif f in LITERALS_JBIG2_DECODE or f in LITERALS_JPX_DECODE:\n                pass\n            elif f == LITERAL_CRYPT:\n                # not yet..\n                raise PDFNotImplementedError(\"/Crypt filter is unsupported\")\n            else:\n                raise PDFNotImplementedError(\"Unsupported filter: %r\" % f)\n            # apply predictors\n            if params and \"Predictor\" in params:\n                pred = int_value(params[\"Predictor\"])\n                if pred == 1:\n                    # no predictor\n                    pass\n                elif pred >= 10:\n                    # PNG predictor\n                    colors = int_value(params.get(\"Colors\", 1))\n                    columns = int_value(params.get(\"Columns\", 1))\n                    raw_bits_per_component = params.get(\"BitsPerComponent\", 8)\n                    bitspercomponent = int_value(raw_bits_per_component)\n                    data = apply_png_predictor(\n                        pred,\n                        colors,\n                        columns,\n                        bitspercomponent,\n                        data,\n                    )\n                else:\n                    error_msg = \"Unsupported predictor: %r\" % pred\n                    raise PDFNotImplementedError(error_msg)\n        self.data = data\n        self.rawdata = None\n\n    def get_data(self) -> bytes:\n        if self.data is None:\n            self.decode()\n            assert self.data is not None\n        return self.data\n\n    def get_rawdata(self) -> bytes | None:\n        return self.rawdata\n"
  },
  {
    "path": "babeldoc/pdfminer/psexceptions.py",
    "content": "class PSException(Exception):\n    pass\n\n\nclass PSEOF(PSException):\n    pass\n\n\nclass PSSyntaxError(PSException):\n    pass\n\n\nclass PSTypeError(PSException):\n    pass\n\n\nclass PSValueError(PSException):\n    pass\n"
  },
  {
    "path": "babeldoc/pdfminer/psparser.py",
    "content": "#!/usr/bin/env python3\nimport io\nimport logging\nimport re\nfrom collections.abc import Iterator\nfrom typing import Any\nfrom typing import BinaryIO\nfrom typing import Generic\nfrom typing import TypeVar\nfrom typing import Union\n\nfrom babeldoc.pdfminer.utils import choplist\nfrom babeldoc.pdfminer import psexceptions\nfrom babeldoc.pdfminer import settings\n\nlog = logging.getLogger(__name__)\n\n\n# Adding aliases for these exceptions for backwards compatibility\nPSException = psexceptions.PSException\nPSEOF = psexceptions.PSEOF\nPSSyntaxError = psexceptions.PSSyntaxError\nPSTypeError = psexceptions.PSTypeError\nPSValueError = psexceptions.PSValueError\n\n\nclass PSObject:\n    \"\"\"Base class for all PS or PDF-related data types.\"\"\"\n\n\nclass PSLiteral(PSObject):\n    \"\"\"A class that represents a PostScript literal.\n\n    Postscript literals are used as identifiers, such as\n    variable names, property names and dictionary keys.\n    Literals are case sensitive and denoted by a preceding\n    slash sign (e.g. \"/Name\")\n\n    Note: Do not create an instance of PSLiteral directly.\n    Always use PSLiteralTable.intern().\n    \"\"\"\n\n    NameType = Union[str, bytes]\n\n    def __init__(self, name: NameType) -> None:\n        self.name = name\n\n    def __repr__(self) -> str:\n        name = self.name\n        return \"/%r\" % name\n\n\nclass PSKeyword(PSObject):\n    \"\"\"A class that represents a PostScript keyword.\n\n    PostScript keywords are a dozen of predefined words.\n    Commands and directives in PostScript are expressed by keywords.\n    They are also used to denote the content boundaries.\n\n    Note: Do not create an instance of PSKeyword directly.\n    Always use PSKeywordTable.intern().\n    \"\"\"\n\n    def __init__(self, name: bytes) -> None:\n        self.name = name\n\n    def __repr__(self) -> str:\n        name = self.name\n        return \"/%r\" % name\n\n\n_SymbolT = TypeVar(\"_SymbolT\", PSLiteral, PSKeyword)\n\n\nclass PSSymbolTable(Generic[_SymbolT]):\n    \"\"\"A utility class for storing PSLiteral/PSKeyword objects.\n\n    Interned objects can be checked its identity with \"is\" operator.\n    \"\"\"\n\n    def __init__(self, klass: type[_SymbolT]) -> None:\n        self.dict: dict[PSLiteral.NameType, _SymbolT] = {}\n        self.klass: type[_SymbolT] = klass\n\n    def intern(self, name: PSLiteral.NameType) -> _SymbolT:\n        if name in self.dict:\n            lit = self.dict[name]\n        else:\n            # Type confusion issue: PSKeyword always takes bytes as name\n            #                       PSLiteral uses either str or bytes\n            lit = self.klass(name)  # type: ignore[arg-type]\n            self.dict[name] = lit\n        return lit\n\n\nPSLiteralTable = PSSymbolTable(PSLiteral)\nPSKeywordTable = PSSymbolTable(PSKeyword)\nLIT = PSLiteralTable.intern\nKWD = PSKeywordTable.intern\nKEYWORD_PROC_BEGIN = KWD(b\"{\")\nKEYWORD_PROC_END = KWD(b\"}\")\nKEYWORD_ARRAY_BEGIN = KWD(b\"[\")\nKEYWORD_ARRAY_END = KWD(b\"]\")\nKEYWORD_DICT_BEGIN = KWD(b\"<<\")\nKEYWORD_DICT_END = KWD(b\">>\")\n\n\ndef literal_name(x: Any) -> str:\n    if isinstance(x, PSLiteral):\n        if isinstance(x.name, str):\n            return x.name\n        try:\n            return str(x.name, \"utf-8\")\n        except UnicodeDecodeError:\n            return str(x.name)\n    else:\n        if settings.STRICT:\n            raise PSTypeError(f\"Literal required: {x!r}\")\n        return str(x)\n\n\ndef keyword_name(x: Any) -> Any:\n    if not isinstance(x, PSKeyword):\n        if settings.STRICT:\n            raise PSTypeError(\"Keyword required: %r\" % x)\n        else:\n            name = x\n    else:\n        name = str(x.name, \"utf-8\", \"ignore\")\n    return name\n\n\nEOL = re.compile(rb\"[\\r\\n]\")\nSPC = re.compile(rb\"\\s\")\nNONSPC = re.compile(rb\"\\S\")\nHEX = re.compile(rb\"[0-9a-fA-F]\")\nEND_LITERAL = re.compile(rb\"[#/%\\[\\]()<>{}\\s]\")\nEND_HEX_STRING = re.compile(rb\"[^\\s0-9a-fA-F]\")\nHEX_PAIR = re.compile(rb\"[0-9a-fA-F]{2}|.\")\nEND_NUMBER = re.compile(rb\"[^0-9]\")\nEND_KEYWORD = re.compile(rb\"[#/%\\[\\]()<>{}\\s]\")\nEND_STRING = re.compile(rb\"[()\\134]\")\nOCT_STRING = re.compile(rb\"[0-7]\")\nESC_STRING = {\n    b\"b\": 8,\n    b\"t\": 9,\n    b\"n\": 10,\n    b\"f\": 12,\n    b\"r\": 13,\n    b\"(\": 40,\n    b\")\": 41,\n    b\"\\\\\": 92,\n}\n\n\nPSBaseParserToken = Union[float, bool, PSLiteral, PSKeyword, bytes]\n\n\nclass PSBaseParser:\n    \"\"\"Most basic PostScript parser that performs only tokenization.\"\"\"\n\n    BUFSIZ = 4096\n\n    def __init__(self, fp: BinaryIO) -> None:\n        self.fp = fp\n        self.eof = False\n        self.seek(0)\n\n    def __repr__(self) -> str:\n        return \"<%s: %r, bufpos=%d>\" % (self.__class__.__name__, self.fp, self.bufpos)\n\n    def flush(self) -> None:\n        pass\n\n    def close(self) -> None:\n        self.flush()\n\n    def tell(self) -> int:\n        return self.bufpos + self.charpos\n\n    def poll(self, pos: int | None = None, n: int = 80) -> None:\n        pos0 = self.fp.tell()\n        if not pos:\n            pos = self.bufpos + self.charpos\n        self.fp.seek(pos)\n        log.debug(\"poll(%d): %r\", pos, self.fp.read(n))\n        self.fp.seek(pos0)\n\n    def seek(self, pos: int) -> None:\n        \"\"\"Seeks the parser to the given position.\"\"\"\n        log.debug(\"seek: %r\", pos)\n        self.fp.seek(pos)\n        # reset the status for nextline()\n        self.bufpos = pos\n        self.buf = b\"\"\n        self.charpos = 0\n        # reset the status for nexttoken()\n        self._parse1 = self._parse_main\n        self._curtoken = b\"\"\n        self._curtokenpos = 0\n        self._tokens: list[tuple[int, PSBaseParserToken]] = []\n        self.eof = False\n\n    def fillbuf(self) -> None:\n        if self.charpos < len(self.buf):\n            return\n        # fetch next chunk.\n        self.bufpos = self.fp.tell()\n        self.buf = self.fp.read(self.BUFSIZ)\n        if not self.buf:\n            raise PSEOF(\"Unexpected EOF\")\n        self.charpos = 0\n\n    def nextline(self) -> tuple[int, bytes]:\n        \"\"\"Fetches a next line that ends either with \\\\r or \\\\n.\"\"\"\n        linebuf = b\"\"\n        linepos = self.bufpos + self.charpos\n        eol = False\n        while 1:\n            self.fillbuf()\n            if eol:\n                c = self.buf[self.charpos : self.charpos + 1]\n                # handle b'\\r\\n'\n                if c == b\"\\n\":\n                    linebuf += c\n                    self.charpos += 1\n                break\n            m = EOL.search(self.buf, self.charpos)\n            if m:\n                linebuf += self.buf[self.charpos : m.end(0)]\n                self.charpos = m.end(0)\n                if linebuf[-1:] == b\"\\r\":\n                    eol = True\n                else:\n                    break\n            else:\n                linebuf += self.buf[self.charpos :]\n                self.charpos = len(self.buf)\n        log.debug(\"nextline: %r, %r\", linepos, linebuf)\n\n        return (linepos, linebuf)\n\n    def revreadlines(self) -> Iterator[bytes]:\n        \"\"\"Fetches a next line backword.\n\n        This is used to locate the trailers at the end of a file.\n        \"\"\"\n        self.fp.seek(0, io.SEEK_END)\n        pos = self.fp.tell()\n        buf = b\"\"\n        while pos > 0:\n            prevpos = pos\n            pos = max(0, pos - self.BUFSIZ)\n            self.fp.seek(pos)\n            s = self.fp.read(prevpos - pos)\n            if not s:\n                break\n            while 1:\n                n = max(s.rfind(b\"\\r\"), s.rfind(b\"\\n\"))\n                if n == -1:\n                    buf = s + buf\n                    break\n                yield s[n:] + buf\n                s = s[:n]\n                buf = b\"\"\n\n    def _parse_main(self, s: bytes, i: int) -> int:\n        m = NONSPC.search(s, i)\n        if not m:\n            return len(s)\n        j = m.start(0)\n        c = s[j : j + 1]\n        self._curtokenpos = self.bufpos + j\n        if c == b\"%\":\n            self._curtoken = b\"%\"\n            self._parse1 = self._parse_comment\n            return j + 1\n        elif c == b\"/\":\n            self._curtoken = b\"\"\n            self._parse1 = self._parse_literal\n            return j + 1\n        elif c in b\"-+\" or c.isdigit():\n            self._curtoken = c\n            self._parse1 = self._parse_number\n            return j + 1\n        elif c == b\".\":\n            self._curtoken = c\n            self._parse1 = self._parse_float\n            return j + 1\n        elif c.isalpha():\n            self._curtoken = c\n            self._parse1 = self._parse_keyword\n            return j + 1\n        elif c == b\"(\":\n            self._curtoken = b\"\"\n            self.paren = 1\n            self._parse1 = self._parse_string\n            return j + 1\n        elif c == b\"<\":\n            self._curtoken = b\"\"\n            self._parse1 = self._parse_wopen\n            return j + 1\n        elif c == b\">\":\n            self._curtoken = b\"\"\n            self._parse1 = self._parse_wclose\n            return j + 1\n        elif c == b\"\\x00\":\n            return j + 1\n        else:\n            self._add_token(KWD(c))\n            return j + 1\n\n    def _add_token(self, obj: PSBaseParserToken) -> None:\n        self._tokens.append((self._curtokenpos, obj))\n\n    def _parse_comment(self, s: bytes, i: int) -> int:\n        m = EOL.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        self._parse1 = self._parse_main\n        # We ignore comments.\n        # self._tokens.append(self._curtoken)\n        return j\n\n    def _parse_literal(self, s: bytes, i: int) -> int:\n        m = END_LITERAL.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        c = s[j : j + 1]\n        if c == b\"#\":\n            self.hex = b\"\"\n            self._parse1 = self._parse_literal_hex\n            return j + 1\n        try:\n            name: str | bytes = str(self._curtoken, \"utf-8\")\n        except Exception:\n            name = self._curtoken\n        self._add_token(LIT(name))\n        self._parse1 = self._parse_main\n        return j\n\n    def _parse_literal_hex(self, s: bytes, i: int) -> int:\n        c = s[i : i + 1]\n        if HEX.match(c) and len(self.hex) < 2:\n            self.hex += c\n            return i + 1\n        if self.hex:\n            self._curtoken += bytes((int(self.hex, 16),))\n        self._parse1 = self._parse_literal\n        return i\n\n    def _parse_number(self, s: bytes, i: int) -> int:\n        m = END_NUMBER.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        c = s[j : j + 1]\n        if c == b\".\":\n            self._curtoken += c\n            self._parse1 = self._parse_float\n            return j + 1\n        try:\n            self._add_token(int(self._curtoken))\n        except ValueError:\n            pass\n        self._parse1 = self._parse_main\n        return j\n\n    def _parse_float(self, s: bytes, i: int) -> int:\n        m = END_NUMBER.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        try:\n            self._add_token(float(self._curtoken))\n        except ValueError:\n            pass\n        self._parse1 = self._parse_main\n        return j\n\n    def _parse_keyword(self, s: bytes, i: int) -> int:\n        m = END_KEYWORD.search(s, i)\n        if m:\n            j = m.start(0)\n            self._curtoken += s[i:j]\n        else:\n            self._curtoken += s[i:]\n            return len(s)\n        if self._curtoken == b\"true\":\n            token: bool | PSKeyword = True\n        elif self._curtoken == b\"false\":\n            token = False\n        else:\n            token = KWD(self._curtoken)\n        self._add_token(token)\n        self._parse1 = self._parse_main\n        return j\n\n    def _parse_string(self, s: bytes, i: int) -> int:\n        m = END_STRING.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        c = s[j : j + 1]\n        if c == b\"\\\\\":\n            self.oct = b\"\"\n            self._parse1 = self._parse_string_1\n            return j + 1\n        if c == b\"(\":\n            self.paren += 1\n            self._curtoken += c\n            return j + 1\n        if c == b\")\":\n            self.paren -= 1\n            if self.paren:\n                # WTF, they said balanced parens need no special treatment.\n                self._curtoken += c\n                return j + 1\n        self._add_token(self._curtoken)\n        self._parse1 = self._parse_main\n        return j + 1\n\n    def _parse_string_1(self, s: bytes, i: int) -> int:\n        \"\"\"Parse literal strings\n\n        PDF Reference 3.2.3\n        \"\"\"\n        c = s[i : i + 1]\n        if OCT_STRING.match(c) and len(self.oct) < 3:\n            self.oct += c\n            return i + 1\n\n        elif self.oct:\n            chrcode = int(self.oct, 8)\n            assert chrcode < 256, \"Invalid octal %s (%d)\" % (repr(self.oct), chrcode)\n            self._curtoken += bytes((chrcode,))\n            self._parse1 = self._parse_string\n            return i\n\n        elif c in ESC_STRING:\n            self._curtoken += bytes((ESC_STRING[c],))\n\n        elif c == b\"\\r\" and len(s) > i + 1 and s[i + 1 : i + 2] == b\"\\n\":\n            # If current and next character is \\r\\n skip both because enters\n            # after a \\ are ignored\n            i += 1\n\n        # default action\n        self._parse1 = self._parse_string\n        return i + 1\n\n    def _parse_wopen(self, s: bytes, i: int) -> int:\n        c = s[i : i + 1]\n        if c == b\"<\":\n            self._add_token(KEYWORD_DICT_BEGIN)\n            self._parse1 = self._parse_main\n            i += 1\n        else:\n            self._parse1 = self._parse_hexstring\n        return i\n\n    def _parse_wclose(self, s: bytes, i: int) -> int:\n        c = s[i : i + 1]\n        if c == b\">\":\n            self._add_token(KEYWORD_DICT_END)\n            i += 1\n        self._parse1 = self._parse_main\n        return i\n\n    def _parse_hexstring(self, s: bytes, i: int) -> int:\n        m = END_HEX_STRING.search(s, i)\n        if not m:\n            self._curtoken += s[i:]\n            return len(s)\n        j = m.start(0)\n        self._curtoken += s[i:j]\n        token = HEX_PAIR.sub(\n            lambda m: bytes((int(m.group(0), 16),)),\n            SPC.sub(b\"\", self._curtoken),\n        )\n        self._add_token(token)\n        self._parse1 = self._parse_main\n        return j\n\n    def nexttoken(self) -> tuple[int, PSBaseParserToken]:\n        if self.eof:\n            # It's not really unexpected, come on now...\n            raise PSEOF(\"Unexpected EOF\")\n        while not self._tokens:\n            try:\n                self.fillbuf()\n                self.charpos = self._parse1(self.buf, self.charpos)\n            except PSEOF:\n                # If we hit EOF in the middle of a token, try to parse\n                # it by tacking on whitespace, and delay raising PSEOF\n                # until next time around\n                self.charpos = self._parse1(b\"\\n\", 0)\n                self.eof = True\n                # Oh, so there wasn't actually a token there? OK.\n                if not self._tokens:\n                    raise\n        token = self._tokens.pop(0)\n        log.debug(\"nexttoken: %r\", token)\n        return token\n\n\n# Stack slots may by occupied by any of:\n#  * the name of a literal\n#  * the PSBaseParserToken types\n#  * list (via KEYWORD_ARRAY)\n#  * dict (via KEYWORD_DICT)\n#  * subclass-specific extensions (e.g. PDFStream, PDFObjRef) via ExtraT\nExtraT = TypeVar(\"ExtraT\")\nPSStackType = Union[str, float, bool, PSLiteral, bytes, list, dict, ExtraT]\nPSStackEntry = tuple[int, PSStackType[ExtraT]]\n\n\nclass PSStackParser(PSBaseParser, Generic[ExtraT]):\n    def __init__(self, fp: BinaryIO) -> None:\n        PSBaseParser.__init__(self, fp)\n        self.reset()\n\n    def reset(self) -> None:\n        self.context: list[tuple[int, str | None, list[PSStackEntry[ExtraT]]]] = []\n        self.curtype: str | None = None\n        self.curstack: list[PSStackEntry[ExtraT]] = []\n        self.results: list[PSStackEntry[ExtraT]] = []\n\n    def seek(self, pos: int) -> None:\n        PSBaseParser.seek(self, pos)\n        self.reset()\n\n    def push(self, *objs: PSStackEntry[ExtraT]) -> None:\n        self.curstack.extend(objs)\n\n    def pop(self, n: int) -> list[PSStackEntry[ExtraT]]:\n        objs = self.curstack[-n:]\n        self.curstack[-n:] = []\n        return objs\n\n    def popall(self) -> list[PSStackEntry[ExtraT]]:\n        objs = self.curstack\n        self.curstack = []\n        return objs\n\n    def add_results(self, *objs: PSStackEntry[ExtraT]) -> None:\n        try:\n            log.debug(\"add_results: %r\", objs)\n        except Exception:\n            log.debug(\"add_results: (unprintable object)\")\n        self.results.extend(objs)\n\n    def start_type(self, pos: int, type: str) -> None:\n        self.context.append((pos, self.curtype, self.curstack))\n        (self.curtype, self.curstack) = (type, [])\n        log.debug(\"start_type: pos=%r, type=%r\", pos, type)\n\n    def end_type(self, type: str) -> tuple[int, list[PSStackType[ExtraT]]]:\n        if self.curtype != type:\n            raise PSTypeError(f\"Type mismatch: {self.curtype!r} != {type!r}\")\n        objs = [obj for (_, obj) in self.curstack]\n        (pos, self.curtype, self.curstack) = self.context.pop()\n        log.debug(\"end_type: pos=%r, type=%r, objs=%r\", pos, type, objs)\n        return (pos, objs)\n\n    def do_keyword(self, pos: int, token: PSKeyword) -> None:\n        pass\n\n    def nextobject(self) -> PSStackEntry[ExtraT]:\n        \"\"\"Yields a list of objects.\n\n        Arrays and dictionaries are represented as Python lists and\n        dictionaries.\n\n        :return: keywords, literals, strings, numbers, arrays and dictionaries.\n        \"\"\"\n        while not self.results:\n            (pos, token) = self.nexttoken()\n            if isinstance(token, (int, float, bool, str, bytes, PSLiteral)):\n                # normal token\n                self.push((pos, token))\n            elif token == KEYWORD_ARRAY_BEGIN:\n                # begin array\n                self.start_type(pos, \"a\")\n            elif token == KEYWORD_ARRAY_END:\n                # end array\n                try:\n                    self.push(self.end_type(\"a\"))\n                except PSTypeError:\n                    if settings.STRICT:\n                        raise\n            elif token == KEYWORD_DICT_BEGIN:\n                # begin dictionary\n                self.start_type(pos, \"d\")\n            elif token == KEYWORD_DICT_END:\n                # end dictionary\n                try:\n                    (pos, objs) = self.end_type(\"d\")\n                    if len(objs) % 2 != 0:\n                        error_msg = \"Invalid dictionary construct: %r\" % objs\n                        raise PSSyntaxError(error_msg)\n                    d = {\n                        literal_name(k): v\n                        for (k, v) in choplist(2, objs)\n                        if v is not None\n                    }\n                    self.push((pos, d))\n                except PSTypeError:\n                    if settings.STRICT:\n                        raise\n            elif token == KEYWORD_PROC_BEGIN:\n                # begin proc\n                self.start_type(pos, \"p\")\n            elif token == KEYWORD_PROC_END:\n                # end proc\n                try:\n                    self.push(self.end_type(\"p\"))\n                except PSTypeError:\n                    if settings.STRICT:\n                        raise\n            elif isinstance(token, PSKeyword):\n                log.debug(\n                    \"do_keyword: pos=%r, token=%r, stack=%r\",\n                    pos,\n                    token,\n                    self.curstack,\n                )\n                self.do_keyword(pos, token)\n            else:\n                log.error(\n                    \"unknown token: pos=%r, token=%r, stack=%r\",\n                    pos,\n                    token,\n                    self.curstack,\n                )\n                self.do_keyword(pos, token)\n                raise PSException\n            if self.context:\n                continue\n            else:\n                self.flush()\n        obj = self.results.pop(0)\n        try:\n            log.debug(\"nextobject: %r\", obj)\n        except Exception:\n            log.debug(\"nextobject: (unprintable object)\")\n        return obj\n"
  },
  {
    "path": "babeldoc/pdfminer/py.typed",
    "content": ""
  },
  {
    "path": "babeldoc/pdfminer/runlength.py",
    "content": "#\n# RunLength decoder (Adobe version) implementation based on PDF Reference\n# version 1.4 section 3.3.4.\n#\n#  * public domain *\n#\n\n\ndef rldecode(data: bytes) -> bytes:\n    \"\"\"RunLength decoder (Adobe version) implementation based on PDF Reference\n    version 1.4 section 3.3.4:\n        The RunLengthDecode filter decodes data that has been encoded in a\n        simple byte-oriented format based on run length. The encoded data\n        is a sequence of runs, where each run consists of a length byte\n        followed by 1 to 128 bytes of data. If the length byte is in the\n        range 0 to 127, the following length + 1 (1 to 128) bytes are\n        copied literally during decompression. If length is in the range\n        129 to 255, the following single byte is to be copied 257 - length\n        (2 to 128) times during decompression. A length value of 128\n        denotes EOD.\n    \"\"\"\n    decoded_array: list[int] = []\n    data_iter = iter(data)\n\n    while True:\n        length = next(data_iter, 128)\n        if length == 128:\n            break\n\n        if 0 <= length < 128:\n            decoded_array.extend(next(data_iter) for _ in range(length + 1))\n\n        if length > 128:\n            run = [next(data_iter)] * (257 - length)\n            decoded_array.extend(run)\n    return bytes(decoded_array)\n"
  },
  {
    "path": "babeldoc/pdfminer/settings.py",
    "content": "STRICT = False\n"
  },
  {
    "path": "babeldoc/pdfminer/utils.py",
    "content": "\"\"\"Miscellaneous Routines.\"\"\"\n\nimport io\nimport pathlib\nimport string\nfrom collections.abc import Callable\nfrom collections.abc import Iterable\nfrom collections.abc import Iterator\nfrom html import escape\nfrom typing import TYPE_CHECKING\nfrom typing import Any\nfrom typing import BinaryIO\nfrom typing import Generic\nfrom typing import TextIO\nfrom typing import TypeVar\nfrom typing import Union\nfrom typing import cast\n\nfrom babeldoc.pdfminer.pdfexceptions import PDFTypeError\nfrom babeldoc.pdfminer.pdfexceptions import PDFValueError\n\nif TYPE_CHECKING:\n    from babeldoc.pdfminer.layout import LTComponent\n\nimport charset_normalizer  # For str encoding detection\n\n# from sys import maxint as INF doesn't work anymore under Python3, but PDF\n# still uses 32 bits ints\nINF = (1 << 31) - 1\n\n\nFileOrName = Union[pathlib.PurePath, str, io.IOBase]\nAnyIO = Union[TextIO, BinaryIO]\n\n\nclass open_filename:\n    \"\"\"Context manager that allows opening a filename\n    (str or pathlib.PurePath type is supported) and closes it on exit,\n    (just like `open`), but does nothing for file-like objects.\n    \"\"\"\n\n    def __init__(self, filename: FileOrName, *args: Any, **kwargs: Any) -> None:\n        if isinstance(filename, pathlib.PurePath):\n            filename = str(filename)\n        if isinstance(filename, str):\n            self.file_handler: AnyIO = open(filename, *args, **kwargs)\n            self.closing = True\n        elif isinstance(filename, io.IOBase):\n            self.file_handler = cast(AnyIO, filename)\n            self.closing = False\n        else:\n            raise PDFTypeError(\"Unsupported input type: %s\" % type(filename))\n\n    def __enter__(self) -> AnyIO:\n        return self.file_handler\n\n    def __exit__(self, exc_type: object, exc_val: object, exc_tb: object) -> None:\n        if self.closing:\n            self.file_handler.close()\n\n\ndef make_compat_bytes(in_str: str) -> bytes:\n    \"\"\"Converts to bytes, encoding to unicode.\"\"\"\n    assert isinstance(in_str, str), str(type(in_str))\n    return in_str.encode()\n\n\ndef make_compat_str(o: object) -> str:\n    \"\"\"Converts everything to string, if bytes guessing the encoding.\"\"\"\n    if isinstance(o, bytes):\n        enc = charset_normalizer.detect(o)\n        try:\n            return o.decode(enc[\"encoding\"])\n        except UnicodeDecodeError:\n            return str(o)\n    else:\n        return str(o)\n\n\ndef shorten_str(s: str, size: int) -> str:\n    if size < 7:\n        return s[:size]\n    if len(s) > size:\n        length = (size - 5) // 2\n        return f\"{s[:length]} ... {s[-length:]}\"\n    else:\n        return s\n\n\ndef compatible_encode_method(\n    bytesorstring: bytes | str,\n    encoding: str = \"utf-8\",\n    erraction: str = \"ignore\",\n) -> str:\n    \"\"\"When Py2 str.encode is called, it often means bytes.encode in Py3.\n\n    This does either.\n    \"\"\"\n    if isinstance(bytesorstring, str):\n        return bytesorstring\n    assert isinstance(bytesorstring, bytes), str(type(bytesorstring))\n    return bytesorstring.decode(encoding, erraction)\n\n\ndef paeth_predictor(left: int, above: int, upper_left: int) -> int:\n    # From http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html\n    # Initial estimate\n    p = left + above - upper_left\n    # Distances to a,b,c\n    pa = abs(p - left)\n    pb = abs(p - above)\n    pc = abs(p - upper_left)\n\n    # Return nearest of a,b,c breaking ties in order a,b,c\n    if pa <= pb and pa <= pc:\n        return left\n    elif pb <= pc:\n        return above\n    else:\n        return upper_left\n\n\ndef apply_png_predictor(\n    pred: int,\n    colors: int,\n    columns: int,\n    bitspercomponent: int,\n    data: bytes,\n) -> bytes:\n    \"\"\"Reverse the effect of the PNG predictor\n\n    Documentation: http://www.libpng.org/pub/png/spec/1.2/PNG-Filters.html\n    \"\"\"\n    if bitspercomponent not in [8, 1]:\n        msg = \"Unsupported `bitspercomponent': %d\" % bitspercomponent\n        raise PDFValueError(msg)\n\n    nbytes = colors * columns * bitspercomponent // 8\n    bpp = colors * bitspercomponent // 8  # number of bytes per complete pixel\n    buf = []\n    line_above = list(b\"\\x00\" * columns)\n    for scanline_i in range(0, len(data), nbytes + 1):\n        filter_type = data[scanline_i]\n        line_encoded = data[scanline_i + 1 : scanline_i + 1 + nbytes]\n        raw = []\n\n        if filter_type == 0:\n            # Filter type 0: None\n            raw = list(line_encoded)\n\n        elif filter_type == 1:\n            # Filter type 1: Sub\n            # To reverse the effect of the Sub() filter after decompression,\n            # output the following value:\n            #   Raw(x) = Sub(x) + Raw(x - bpp)\n            # (computed mod 256), where Raw() refers to the bytes already\n            #  decoded.\n            for j, sub_x in enumerate(line_encoded):\n                if j - bpp < 0:\n                    raw_x_bpp = 0\n                else:\n                    raw_x_bpp = int(raw[j - bpp])\n                raw_x = (sub_x + raw_x_bpp) & 255\n                raw.append(raw_x)\n\n        elif filter_type == 2:\n            # Filter type 2: Up\n            # To reverse the effect of the Up() filter after decompression,\n            # output the following value:\n            #   Raw(x) = Up(x) + Prior(x)\n            # (computed mod 256), where Prior() refers to the decoded bytes of\n            # the prior scanline.\n            for up_x, prior_x in zip(line_encoded, line_above, strict=False):\n                raw_x = (up_x + prior_x) & 255\n                raw.append(raw_x)\n\n        elif filter_type == 3:\n            # Filter type 3: Average\n            # To reverse the effect of the Average() filter after\n            # decompression, output the following value:\n            #    Raw(x) = Average(x) + floor((Raw(x-bpp)+Prior(x))/2)\n            # where the result is computed mod 256, but the prediction is\n            # calculated in the same way as for encoding. Raw() refers to the\n            # bytes already decoded, and Prior() refers to the decoded bytes of\n            # the prior scanline.\n            for j, average_x in enumerate(line_encoded):\n                if j - bpp < 0:\n                    raw_x_bpp = 0\n                else:\n                    raw_x_bpp = int(raw[j - bpp])\n                prior_x = int(line_above[j])\n                raw_x = (average_x + (raw_x_bpp + prior_x) // 2) & 255\n                raw.append(raw_x)\n\n        elif filter_type == 4:\n            # Filter type 4: Paeth\n            # To reverse the effect of the Paeth() filter after decompression,\n            # output the following value:\n            #    Raw(x) = Paeth(x)\n            #             + PaethPredictor(Raw(x-bpp), Prior(x), Prior(x-bpp))\n            # (computed mod 256), where Raw() and Prior() refer to bytes\n            # already decoded. Exactly the same PaethPredictor() function is\n            # used by both encoder and decoder.\n            for j, paeth_x in enumerate(line_encoded):\n                if j - bpp < 0:\n                    raw_x_bpp = 0\n                    prior_x_bpp = 0\n                else:\n                    raw_x_bpp = int(raw[j - bpp])\n                    prior_x_bpp = int(line_above[j - bpp])\n                prior_x = int(line_above[j])\n                paeth = paeth_predictor(raw_x_bpp, prior_x, prior_x_bpp)\n                raw_x = (paeth_x + paeth) & 255\n                raw.append(raw_x)\n\n        else:\n            raise PDFValueError(\"Unsupported predictor value: %d\" % filter_type)\n\n        buf.extend(raw)\n        line_above = raw\n    return bytes(buf)\n\n\nPoint = tuple[float, float]\nRect = tuple[float, float, float, float]\nMatrix = tuple[float, float, float, float, float, float]\nPathSegment = Union[\n    tuple[str],  # Literal['h']\n    tuple[str, float, float],  # Literal['m', 'l']\n    tuple[str, float, float, float, float],  # Literal['v', 'y']\n    tuple[str, float, float, float, float, float, float],\n]  # Literal['c']\n\n#  Matrix operations\nMATRIX_IDENTITY: Matrix = (1, 0, 0, 1, 0, 0)\n\n\ndef parse_rect(o: Any) -> Rect:\n    try:\n        (x0, y0, x1, y1) = o\n        return float(x0), float(y0), float(x1), float(y1)\n    except ValueError:\n        raise PDFValueError(\"Could not parse rectangle\")\n\n\ndef mult_matrix(m1: Matrix, m0: Matrix) -> Matrix:\n    (a1, b1, c1, d1, e1, f1) = m1\n    (a0, b0, c0, d0, e0, f0) = m0\n    \"\"\"Returns the multiplication of two matrices.\"\"\"\n    return (\n        a0 * a1 + c0 * b1,\n        b0 * a1 + d0 * b1,\n        a0 * c1 + c0 * d1,\n        b0 * c1 + d0 * d1,\n        a0 * e1 + c0 * f1 + e0,\n        b0 * e1 + d0 * f1 + f0,\n    )\n\n\ndef translate_matrix(m: Matrix, v: Point) -> Matrix:\n    \"\"\"Translates a matrix by (x, y).\"\"\"\n    (a, b, c, d, e, f) = m\n    (x, y) = v\n    return a, b, c, d, x * a + y * c + e, x * b + y * d + f\n\n\ndef apply_matrix_pt(m: Matrix, v: Point) -> Point:\n    (a, b, c, d, e, f) = m\n    (x, y) = v\n    \"\"\"Applies a matrix to a point.\"\"\"\n    return a * x + c * y + e, b * x + d * y + f\n\n\ndef apply_matrix_norm(m: Matrix, v: Point) -> Point:\n    \"\"\"Equivalent to apply_matrix_pt(M, (p,q)) - apply_matrix_pt(M, (0,0))\"\"\"\n    (a, b, c, d, e, f) = m\n    (p, q) = v\n    return a * p + c * q, b * p + d * q\n\n\n#  Utility functions\n\n\ndef isnumber(x: object) -> bool:\n    return isinstance(x, (int, float))\n\n\n_T = TypeVar(\"_T\")\n\n\ndef uniq(objs: Iterable[_T]) -> Iterator[_T]:\n    \"\"\"Eliminates duplicated elements.\"\"\"\n    done = set()\n    for obj in objs:\n        if obj in done:\n            continue\n        done.add(obj)\n        yield obj\n\n\ndef fsplit(pred: Callable[[_T], bool], objs: Iterable[_T]) -> tuple[list[_T], list[_T]]:\n    \"\"\"Split a list into two classes according to the predicate.\"\"\"\n    t = []\n    f = []\n    for obj in objs:\n        if pred(obj):\n            t.append(obj)\n        else:\n            f.append(obj)\n    return t, f\n\n\ndef drange(v0: float, v1: float, d: int) -> range:\n    \"\"\"Returns a discrete range.\"\"\"\n    return range(int(v0) // d, int(v1 + d) // d)\n\n\ndef get_bound(pts: Iterable[Point]) -> Rect:\n    \"\"\"Compute a minimal rectangle that covers all the points.\"\"\"\n    limit: Rect = (INF, INF, -INF, -INF)\n    (x0, y0, x1, y1) = limit\n    for x, y in pts:\n        x0 = min(x0, x)\n        y0 = min(y0, y)\n        x1 = max(x1, x)\n        y1 = max(y1, y)\n    return x0, y0, x1, y1\n\n\ndef pick(\n    seq: Iterable[_T],\n    func: Callable[[_T], float],\n    maxobj: _T | None = None,\n) -> _T | None:\n    \"\"\"Picks the object obj where func(obj) has the highest value.\"\"\"\n    maxscore = None\n    for obj in seq:\n        score = func(obj)\n        if maxscore is None or maxscore < score:\n            (maxscore, maxobj) = (score, obj)\n    return maxobj\n\n\ndef choplist(n: int, seq: Iterable[_T]) -> Iterator[tuple[_T, ...]]:\n    \"\"\"Groups every n elements of the list.\"\"\"\n    r = []\n    for x in seq:\n        r.append(x)\n        if len(r) == n:\n            yield tuple(r)\n            r = []\n\n\ndef nunpack(s: bytes, default: int = 0) -> int:\n    \"\"\"Unpacks variable-length unsigned integers (big endian).\"\"\"\n    length = len(s)\n    if not length:\n        return default\n    else:\n        return int.from_bytes(s, byteorder=\"big\", signed=False)\n\n\nPDFDocEncoding = \"\".join(\n    chr(x)\n    for x in (\n        0x0000,\n        0x0001,\n        0x0002,\n        0x0003,\n        0x0004,\n        0x0005,\n        0x0006,\n        0x0007,\n        0x0008,\n        0x0009,\n        0x000A,\n        0x000B,\n        0x000C,\n        0x000D,\n        0x000E,\n        0x000F,\n        0x0010,\n        0x0011,\n        0x0012,\n        0x0013,\n        0x0014,\n        0x0015,\n        0x0017,\n        0x0017,\n        0x02D8,\n        0x02C7,\n        0x02C6,\n        0x02D9,\n        0x02DD,\n        0x02DB,\n        0x02DA,\n        0x02DC,\n        0x0020,\n        0x0021,\n        0x0022,\n        0x0023,\n        0x0024,\n        0x0025,\n        0x0026,\n        0x0027,\n        0x0028,\n        0x0029,\n        0x002A,\n        0x002B,\n        0x002C,\n        0x002D,\n        0x002E,\n        0x002F,\n        0x0030,\n        0x0031,\n        0x0032,\n        0x0033,\n        0x0034,\n        0x0035,\n        0x0036,\n        0x0037,\n        0x0038,\n        0x0039,\n        0x003A,\n        0x003B,\n        0x003C,\n        0x003D,\n        0x003E,\n        0x003F,\n        0x0040,\n        0x0041,\n        0x0042,\n        0x0043,\n        0x0044,\n        0x0045,\n        0x0046,\n        0x0047,\n        0x0048,\n        0x0049,\n        0x004A,\n        0x004B,\n        0x004C,\n        0x004D,\n        0x004E,\n        0x004F,\n        0x0050,\n        0x0051,\n        0x0052,\n        0x0053,\n        0x0054,\n        0x0055,\n        0x0056,\n        0x0057,\n        0x0058,\n        0x0059,\n        0x005A,\n        0x005B,\n        0x005C,\n        0x005D,\n        0x005E,\n        0x005F,\n        0x0060,\n        0x0061,\n        0x0062,\n        0x0063,\n        0x0064,\n        0x0065,\n        0x0066,\n        0x0067,\n        0x0068,\n        0x0069,\n        0x006A,\n        0x006B,\n        0x006C,\n        0x006D,\n        0x006E,\n        0x006F,\n        0x0070,\n        0x0071,\n        0x0072,\n        0x0073,\n        0x0074,\n        0x0075,\n        0x0076,\n        0x0077,\n        0x0078,\n        0x0079,\n        0x007A,\n        0x007B,\n        0x007C,\n        0x007D,\n        0x007E,\n        0x0000,\n        0x2022,\n        0x2020,\n        0x2021,\n        0x2026,\n        0x2014,\n        0x2013,\n        0x0192,\n        0x2044,\n        0x2039,\n        0x203A,\n        0x2212,\n        0x2030,\n        0x201E,\n        0x201C,\n        0x201D,\n        0x2018,\n        0x2019,\n        0x201A,\n        0x2122,\n        0xFB01,\n        0xFB02,\n        0x0141,\n        0x0152,\n        0x0160,\n        0x0178,\n        0x017D,\n        0x0131,\n        0x0142,\n        0x0153,\n        0x0161,\n        0x017E,\n        0x0000,\n        0x20AC,\n        0x00A1,\n        0x00A2,\n        0x00A3,\n        0x00A4,\n        0x00A5,\n        0x00A6,\n        0x00A7,\n        0x00A8,\n        0x00A9,\n        0x00AA,\n        0x00AB,\n        0x00AC,\n        0x0000,\n        0x00AE,\n        0x00AF,\n        0x00B0,\n        0x00B1,\n        0x00B2,\n        0x00B3,\n        0x00B4,\n        0x00B5,\n        0x00B6,\n        0x00B7,\n        0x00B8,\n        0x00B9,\n        0x00BA,\n        0x00BB,\n        0x00BC,\n        0x00BD,\n        0x00BE,\n        0x00BF,\n        0x00C0,\n        0x00C1,\n        0x00C2,\n        0x00C3,\n        0x00C4,\n        0x00C5,\n        0x00C6,\n        0x00C7,\n        0x00C8,\n        0x00C9,\n        0x00CA,\n        0x00CB,\n        0x00CC,\n        0x00CD,\n        0x00CE,\n        0x00CF,\n        0x00D0,\n        0x00D1,\n        0x00D2,\n        0x00D3,\n        0x00D4,\n        0x00D5,\n        0x00D6,\n        0x00D7,\n        0x00D8,\n        0x00D9,\n        0x00DA,\n        0x00DB,\n        0x00DC,\n        0x00DD,\n        0x00DE,\n        0x00DF,\n        0x00E0,\n        0x00E1,\n        0x00E2,\n        0x00E3,\n        0x00E4,\n        0x00E5,\n        0x00E6,\n        0x00E7,\n        0x00E8,\n        0x00E9,\n        0x00EA,\n        0x00EB,\n        0x00EC,\n        0x00ED,\n        0x00EE,\n        0x00EF,\n        0x00F0,\n        0x00F1,\n        0x00F2,\n        0x00F3,\n        0x00F4,\n        0x00F5,\n        0x00F6,\n        0x00F7,\n        0x00F8,\n        0x00F9,\n        0x00FA,\n        0x00FB,\n        0x00FC,\n        0x00FD,\n        0x00FE,\n        0x00FF,\n    )\n)\n\n\ndef decode_text(s: bytes) -> str:\n    \"\"\"Decodes a PDFDocEncoding string to Unicode.\"\"\"\n    if s.startswith(b\"\\xfe\\xff\"):\n        return str(s[2:], \"utf-16be\", \"ignore\")\n    else:\n        return \"\".join(PDFDocEncoding[c] for c in s)\n\n\ndef enc(x: str) -> str:\n    \"\"\"Encodes a string for SGML/XML/HTML\"\"\"\n    if isinstance(x, bytes):\n        return \"\"\n    return escape(x)\n\n\ndef bbox2str(bbox: Rect) -> str:\n    (x0, y0, x1, y1) = bbox\n    return f\"{x0:.3f},{y0:.3f},{x1:.3f},{y1:.3f}\"\n\n\ndef matrix2str(m: Matrix) -> str:\n    (a, b, c, d, e, f) = m\n    return f\"[{a:.2f},{b:.2f},{c:.2f},{d:.2f}, ({e:.2f},{f:.2f})]\"\n\n\ndef vecBetweenBoxes(obj1: \"LTComponent\", obj2: \"LTComponent\") -> Point:\n    \"\"\"A distance function between two TextBoxes.\n\n    Consider the bounding rectangle for obj1 and obj2.\n    Return vector between 2 boxes boundaries if they don't overlap, otherwise\n    returns vector betweeen boxes centers\n\n             +------+..........+ (x1, y1)\n             | obj1 |          :\n             +------+www+------+\n             :          | obj2 |\n    (x0, y0) +..........+------+\n    \"\"\"\n    (x0, y0) = (min(obj1.x0, obj2.x0), min(obj1.y0, obj2.y0))\n    (x1, y1) = (max(obj1.x1, obj2.x1), max(obj1.y1, obj2.y1))\n    (ow, oh) = (x1 - x0, y1 - y0)\n    (iw, ih) = (ow - obj1.width - obj2.width, oh - obj1.height - obj2.height)\n    if iw < 0 and ih < 0:\n        # if one is inside another we compute euclidean distance\n        (xc1, yc1) = ((obj1.x0 + obj1.x1) / 2, (obj1.y0 + obj1.y1) / 2)\n        (xc2, yc2) = ((obj2.x0 + obj2.x1) / 2, (obj2.y0 + obj2.y1) / 2)\n        return xc1 - xc2, yc1 - yc2\n    else:\n        return max(0, iw), max(0, ih)\n\n\nLTComponentT = TypeVar(\"LTComponentT\", bound=\"LTComponent\")\n\n\nclass Plane(Generic[LTComponentT]):\n    \"\"\"A set-like data structure for objects placed on a plane.\n\n    Can efficiently find objects in a certain rectangular area.\n    It maintains two parallel lists of objects, each of\n    which is sorted by its x or y coordinate.\n    \"\"\"\n\n    def __init__(self, bbox: Rect, gridsize: int = 50) -> None:\n        self._seq: list[LTComponentT] = []  # preserve the object order.\n        self._objs: set[LTComponentT] = set()\n        self._grid: dict[Point, list[LTComponentT]] = {}\n        self.gridsize = gridsize\n        (self.x0, self.y0, self.x1, self.y1) = bbox\n\n    def __repr__(self) -> str:\n        return \"<Plane objs=%r>\" % list(self)\n\n    def __iter__(self) -> Iterator[LTComponentT]:\n        return (obj for obj in self._seq if obj in self._objs)\n\n    def __len__(self) -> int:\n        return len(self._objs)\n\n    def __contains__(self, obj: object) -> bool:\n        return obj in self._objs\n\n    def _getrange(self, bbox: Rect) -> Iterator[Point]:\n        (x0, y0, x1, y1) = bbox\n        if x1 <= self.x0 or self.x1 <= x0 or y1 <= self.y0 or self.y1 <= y0:\n            return\n        x0 = max(self.x0, x0)\n        y0 = max(self.y0, y0)\n        x1 = min(self.x1, x1)\n        y1 = min(self.y1, y1)\n        for grid_y in drange(y0, y1, self.gridsize):\n            for grid_x in drange(x0, x1, self.gridsize):\n                yield (grid_x, grid_y)\n\n    def extend(self, objs: Iterable[LTComponentT]) -> None:\n        for obj in objs:\n            self.add(obj)\n\n    def add(self, obj: LTComponentT) -> None:\n        \"\"\"Place an object.\"\"\"\n        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):\n            if k not in self._grid:\n                r: list[LTComponentT] = []\n                self._grid[k] = r\n            else:\n                r = self._grid[k]\n            r.append(obj)\n        self._seq.append(obj)\n        self._objs.add(obj)\n\n    def remove(self, obj: LTComponentT) -> None:\n        \"\"\"Displace an object.\"\"\"\n        for k in self._getrange((obj.x0, obj.y0, obj.x1, obj.y1)):\n            try:\n                self._grid[k].remove(obj)\n            except (KeyError, ValueError):\n                pass\n        self._objs.remove(obj)\n\n    def find(self, bbox: Rect) -> Iterator[LTComponentT]:\n        \"\"\"Finds objects that are in a certain area.\"\"\"\n        (x0, y0, x1, y1) = bbox\n        done = set()\n        for k in self._getrange(bbox):\n            if k not in self._grid:\n                continue\n            for obj in self._grid[k]:\n                if obj in done:\n                    continue\n                done.add(obj)\n                if obj.x1 <= x0 or x1 <= obj.x0 or obj.y1 <= y0 or y1 <= obj.y0:\n                    continue\n                yield obj\n\n\nROMAN_ONES = [\"i\", \"x\", \"c\", \"m\"]\nROMAN_FIVES = [\"v\", \"l\", \"d\"]\n\n\ndef format_int_roman(value: int) -> str:\n    \"\"\"Format a number as lowercase Roman numerals.\"\"\"\n    assert 0 < value < 4000\n    result: list[str] = []\n    index = 0\n\n    while value != 0:\n        value, remainder = divmod(value, 10)\n        if remainder == 9:\n            result.insert(0, ROMAN_ONES[index])\n            result.insert(1, ROMAN_ONES[index + 1])\n        elif remainder == 4:\n            result.insert(0, ROMAN_ONES[index])\n            result.insert(1, ROMAN_FIVES[index])\n        else:\n            over_five = remainder >= 5\n            if over_five:\n                result.insert(0, ROMAN_FIVES[index])\n                remainder -= 5\n            result.insert(1 if over_five else 0, ROMAN_ONES[index] * remainder)\n        index += 1\n\n    return \"\".join(result)\n\n\ndef format_int_alpha(value: int) -> str:\n    \"\"\"Format a number as lowercase letters a-z, aa-zz, etc.\"\"\"\n    assert value > 0\n    result: list[str] = []\n\n    while value != 0:\n        value, remainder = divmod(value - 1, len(string.ascii_lowercase))\n        result.append(string.ascii_lowercase[remainder])\n\n    result.reverse()\n    return \"\".join(result)\n"
  },
  {
    "path": "babeldoc/progress_monitor.py",
    "content": "import asyncio\nimport logging\nimport threading\nimport time\nfrom asyncio import CancelledError\nfrom collections.abc import Callable\nfrom typing import Optional\n\nlogger = logging.getLogger(__name__)\n\n\nclass ProgressMonitor:\n    def __init__(\n        self,\n        stages: list[tuple[str, float]],\n        progress_change_callback: Callable | None = None,\n        finish_callback: Callable | None = None,\n        report_interval: float = 0.1,\n        finish_event: asyncio.Event | None = None,\n        cancel_event: threading.Event | None = None,\n        loop: asyncio.AbstractEventLoop | None = None,\n        parent_monitor: Optional[\"ProgressMonitor\"] = None,\n        part_index: int | None = 0,\n        total_parts: int | None = 1,\n    ):\n        self.lock = threading.Lock()\n        self.parent_monitor = parent_monitor\n        self.part_index = part_index\n        self.total_parts = total_parts\n        self.raw_stages = stages\n        self.part_results = {}\n\n        # Convert stages list to dict with name and weight\n        self.stage = {}\n        total_weight = sum(weight for _, weight in stages)\n        for name, weight in stages:\n            normalized_weight = weight / total_weight\n            self.stage[name] = TranslationStage(\n                name,\n                0,\n                self,\n                normalized_weight,\n                self.lock,\n            )\n\n        self.progress_change_callback = progress_change_callback\n        self.finish_callback = finish_callback\n        self.report_interval = report_interval\n        logger.debug(f\"report_interval: {self.report_interval}\")\n        self.last_report_time = 0\n        self.finish_stage_count = 0\n        self.finish_event = finish_event\n        self.cancel_event = cancel_event\n        self.loop = loop\n        self.disable = False\n        if finish_event and not loop:\n            raise ValueError(\"finish_event requires a loop\")\n        if self.progress_change_callback:\n            self.progress_change_callback(\n                type=\"stage_summary\",\n                stages=[\n                    {\n                        \"name\": name,\n                        \"percent\": self.stage[name].weight,\n                    }\n                    for name, _ in stages\n                ],\n                part_index=self.part_index,\n                total_parts=self.total_parts,\n            )\n\n    def create_part_monitor(\n        self, part_index: int, total_parts: int\n    ) -> \"ProgressMonitor\":\n        \"\"\"Create a new progress monitor for a document part\"\"\"\n        return ProgressMonitor(\n            stages=self.raw_stages,\n            progress_change_callback=self._handle_part_progress,\n            finish_callback=self._handle_part_finish,\n            report_interval=self.report_interval,\n            cancel_event=self.cancel_event,\n            loop=self.loop,\n            parent_monitor=self,\n            part_index=part_index,\n            total_parts=total_parts,\n        )\n\n    def _handle_part_progress(self, **kwargs):\n        \"\"\"Handle progress updates from part monitors\"\"\"\n        if self.progress_change_callback and not self.disable:\n            # Add part information to progress update\n            kwargs[\"part_index\"] = kwargs.get(\"part_index\")\n            kwargs[\"total_parts\"] = kwargs.get(\"total_parts\")\n            self.progress_change_callback(**kwargs)\n\n    def _handle_part_finish(self, **kwargs):\n        \"\"\"Handle completion of a part translation\"\"\"\n        if kwargs[\"type\"] == \"error\":\n            logger.info(f\"progress_monitor handle_part_finish: {kwargs['error']}\")\n            self.finish_callback(type=\"error\", error=kwargs[\"error\"])\n            return\n        if \"translate_result\" in kwargs:\n            part_index = kwargs.get(\"part_index\")\n            if part_index is not None:\n                self.part_results[part_index] = kwargs[\"translate_result\"]\n\n        # if self.finish_callback and not self.disable:\n        #     self.finish_callback(**kwargs)\n\n    def stage_start(self, stage_name: str, total: int):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return DummyTranslationStage(stage_name, total, self, 0)\n        stage = self.stage[stage_name]\n        stage.run_time += 1\n        stage.name = stage_name\n        stage.display_name = f\"{stage_name}\" if stage.run_time > 1 else stage_name\n        stage.current = 0\n        stage.total = total\n        if self.progress_change_callback:\n            self.progress_change_callback(\n                type=\"progress_start\",\n                stage=stage.display_name,\n                stage_progress=0.0,\n                stage_current=0,\n                stage_total=total,\n                overall_progress=self.calculate_current_progress(),\n                part_index=self.part_index + 1,\n                total_parts=self.total_parts,\n            )\n        self.last_report_time = 0.0\n        return stage\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        logger.debug(\"ProgressMonitor __exit__\")\n\n    def on_finish(self):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        if self.cancel_event:\n            self.cancel_event.set()\n        if self.finish_event and self.loop:\n            self.loop.call_soon_threadsafe(self.finish_event.set)\n        if self.cancel_event and self.cancel_event.is_set():\n            self.finish_callback(type=\"error\", error=CancelledError)\n\n    def stage_done(self, stage):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        self.last_report_time = 0.0\n        self.finish_stage_count += 1\n        if (\n            stage.current != stage.total\n            and self.cancel_event is not None\n            and not self.cancel_event.is_set()\n        ):\n            logger.warning(\n                f\"Stage {stage.name} completed with {stage.current}/{stage.total} items\",\n            )\n            return\n        if self.progress_change_callback:\n            self.progress_change_callback(\n                type=\"progress_end\",\n                stage=stage.display_name,\n                stage_progress=100.0,\n                stage_current=stage.total,\n                stage_total=stage.total,\n                overall_progress=self.calculate_current_progress(),\n                part_index=self.part_index + 1,\n                total_parts=self.total_parts,\n            )\n\n    def calculate_current_progress(self, stage=None):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return 100\n        part_weight = 1 / self.total_parts\n        if self.parent_monitor:\n            part_offset = self.part_index * part_weight\n        else:\n            part_offset = len(self.part_results) * part_weight\n        part_offset *= 100\n        progress = self._calculate_current_progress(stage) * part_weight + part_offset\n        return progress\n\n    def _calculate_current_progress(self, stage=None):\n        \"\"\"Calculate overall progress including part progress\"\"\"\n        # Count completed stages\n        completed_stages = sum(\n            1 for s in self.stage.values() if s.run_time > 0 and s.current == s.total\n        )\n\n        # If all stages are complete, return exactly 100\n        if completed_stages == len(self.stage):\n            return 100\n\n        # Calculate progress based on weights\n        progress = sum(\n            s.weight * 100\n            for s in self.stage.values()\n            if s.run_time > 0 and s.current == s.total\n        )\n        if stage is not None and 0 < stage.total != stage.current:\n            progress += stage.weight * stage.current * 100 / stage.total\n\n        # If this is a part monitor (has parent_monitor), return the progress as is\n        if hasattr(self, \"parent_monitor\") and self.parent_monitor:\n            return progress\n\n        # Otherwise return the standard progress\n        return progress\n\n    def stage_update(self, stage, n: int):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        report_time_delta = time.time() - self.last_report_time\n        if report_time_delta < self.report_interval and stage.total > 3:\n            return\n        if self.progress_change_callback:\n            if stage.total != 0:\n                stage_progress = stage.current * 100 / stage.total\n            else:\n                stage_progress = 100\n            self.progress_change_callback(\n                type=\"progress_update\",\n                stage=stage.display_name,\n                stage_progress=stage_progress,\n                stage_current=stage.current,\n                stage_total=stage.total,\n                overall_progress=self.calculate_current_progress(stage),\n                part_index=self.part_index + 1,\n                total_parts=self.total_parts,\n            )\n            self.last_report_time = time.time()\n\n    def translate_done(self, translate_result):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        if self.finish_callback:\n            self.finish_callback(type=\"finish\", translate_result=translate_result)\n\n    def translate_error(self, error):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        if self.finish_callback:\n            logger.info(f\"progress_monitor handle translate_error: {error}\")\n            self.finish_callback(type=\"error\", error=error)\n\n    def raise_if_cancelled(self):\n        if self.cancel_event and self.cancel_event.is_set():\n            raise asyncio.CancelledError\n\n    def cancel(self):\n        if self.disable or self.parent_monitor and self.parent_monitor.disable:\n            return\n        if self.cancel_event:\n            logger.info(\"Translation canceled\")\n            self.cancel_event.set()\n\n\nclass TranslationStage:\n    def __init__(\n        self,\n        name: str,\n        total: int,\n        pm: ProgressMonitor,\n        weight: float,\n        lock: threading.Lock,\n    ):\n        self.name = name\n        self.display_name = name\n        self.current = 0\n        self.total = total\n        self.pm = pm\n        self.run_time = 0\n        self.weight = weight\n        self.lock = lock\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        with self.lock:\n            diff = self.total - self.current\n            if diff > 0:\n                logger.info(\n                    f\"Stage {self.name} completed with {self.current}/{self.total} items\"\n                )\n            self.pm.stage_update(self, diff)\n            self.current = self.total\n            self.pm.stage_done(self)\n\n    def advance(self, n: int = 1):\n        with self.lock:\n            self.current += n\n            self.pm.stage_update(self, n)\n\n\nclass DummyTranslationStage:\n    def __init__(self, name: str, total: int, pm: ProgressMonitor, weight: float):\n        self.name = name\n        self.display_name = name\n        self.current = 0\n        self.total = total\n        self.pm = pm\n\n    def __enter__(self):\n        return self\n\n    def __exit__(self, exc_type, exc_val, exc_tb):\n        pass\n\n    def advance(self, n: int = 1):\n        pass\n"
  },
  {
    "path": "babeldoc/tools/generate_cmap_metadata.py",
    "content": "\"\"\"\nThis script is used to automatically generate the following file:\nhttps://github.com/funstory-ai/BabelDOC-Assets/blob/main/cmap_metadata.json\n\"\"\"\n\nimport argparse\nimport hashlib\nimport logging\nfrom pathlib import Path\n\nimport orjson\nfrom rich.logging import RichHandler\n\nlogger = logging.getLogger(__name__)\n\n\ndef _calc_sha3_256(path: Path) -> str:\n    \"\"\"Calculate sha3-256 for a given file path.\"\"\"\n    hash_ = hashlib.sha3_256()\n    with path.open(\"rb\") as f:\n        # Read the file in chunks to handle large files efficiently\n        while True:\n            chunk = f.read(1024 * 1024)\n            if not chunk:\n                break\n            hash_.update(chunk)\n    return hash_.hexdigest()\n\n\ndef main() -> None:\n    logging.basicConfig(level=logging.INFO, handlers=[RichHandler()])\n    parser = argparse.ArgumentParser(description=\"Generate cmap metadata.\")\n    parser.add_argument(\n        \"assets_repo_path\",\n        type=str,\n        help=\"Path to the BabelDOC-Assets repository.\",\n    )\n    args = parser.parse_args()\n    repo_path = Path(args.assets_repo_path)\n    assert repo_path.exists(), f\"Assets repo path {repo_path} does not exist.\"\n    assert (repo_path / \"README.md\").exists(), (\n        f\"Assets repo path {repo_path} does not contain a README.md file.\"\n    )\n    assert (repo_path / \"cmap\").exists(), (\n        f\"Assets repo path {repo_path} does not contain a cmap folder.\"\n    )\n    logger.info(f\"Getting cmap metadata for {repo_path}\")\n\n    metadatas: dict[str, dict[str, object]] = {}\n    cmap_dir = repo_path / \"cmap\"\n    for cmap_path in sorted(cmap_dir.glob(\"**/*.json\")):\n        if not cmap_path.is_file():\n            continue\n        logger.info(f\"Getting cmap metadata for {cmap_path}\")\n        sha3_256 = _calc_sha3_256(cmap_path)\n        metadata = {\n            \"file_name\": cmap_path.name,\n            \"sha3_256\": sha3_256,\n            \"size\": cmap_path.stat().st_size,\n        }\n        metadatas[cmap_path.name] = metadata\n\n    metadatas_json = orjson.dumps(\n        metadatas,\n        option=orjson.OPT_APPEND_NEWLINE | orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS,\n    ).decode()\n    print(f\"CMAP METADATA: {metadatas_json}\")\n    with (repo_path / \"cmap_metadata.json\").open(\"w\") as f:\n        f.write(metadatas_json)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "babeldoc/tools/generate_font_metadata.py",
    "content": "# This script is used to automatically generate the following files:\n# https://github.com/funstory-ai/BabelDOC-Assets/blob/main/font_metadata.json\n\n\nimport argparse\nimport hashlib\nimport io\nimport logging\nimport re\nfrom pathlib import Path\n\nimport babeldoc.format.pdf.high_level\nimport babeldoc.format.pdf.translation_config\nimport orjson\nimport pymupdf\nfrom babeldoc.format.pdf.document_il import PdfFont\nfrom rich.logging import RichHandler\n\nlogger = logging.getLogger(__name__)\n\nserif_keywords = [\n    \"serif\",\n]\nsans_serif_keywords = [\"sans\", \"GoNotoKurrent\"]\nserif_regex = \"|\".join(serif_keywords)\nsans_serif_regex = \"|\".join(sans_serif_keywords)\n\n\ndef get_font_metadata(font_path) -> PdfFont:\n    doc = pymupdf.open()\n    page = doc.new_page(width=1000, height=1000)\n    page.insert_font(\"test_font\", font_path)\n    translation_config = babeldoc.format.pdf.translation_config.TranslationConfig(\n        *[None for _ in range(4)], doc_layout_model=1\n    )\n    translation_config.progress_monitor = (\n        babeldoc.format.pdf.high_level.ProgressMonitor(\n            babeldoc.format.pdf.high_level.get_translation_stage(translation_config)\n        )\n    )\n    translation_config.font = font_path\n    il_creater = babeldoc.format.pdf.high_level.ILCreater(translation_config)\n    il_creater.mupdf = doc\n    buffer = io.BytesIO()\n    doc.save(buffer)\n    babeldoc.format.pdf.high_level.start_parse_il(\n        buffer,\n        doc_zh=doc,\n        resfont=\"test_font\",\n        il_creater=il_creater,\n        translation_config=translation_config,\n    )\n\n    il = il_creater.create_il()\n    il_page = il.page[0]\n    font_metadata = il_page.pdf_font[0]\n    return font_metadata\n\n\ndef main():\n    logging.basicConfig(level=logging.INFO, handlers=[RichHandler()])\n    parser = argparse.ArgumentParser(description=\"Get font metadata.\")\n    parser.add_argument(\"assets_repo_path\", type=str, help=\"Path to the font file.\")\n    args = parser.parse_args()\n    repo_path = Path(args.assets_repo_path)\n    assert repo_path.exists(), f\"Assets repo path {repo_path} does not exist.\"\n    assert (repo_path / \"README.md\").exists(), (\n        f\"Assets repo path {repo_path} does not contain a README.md file.\"\n    )\n    assert (repo_path / \"fonts\").exists(), (\n        f\"Assets repo path {repo_path} does not contain a fonts folder.\"\n    )\n    logger.info(f\"Getting font metadata for {repo_path}\")\n\n    metadatas = {}\n    for font_path in list((repo_path / \"fonts\").glob(\"**/*.ttf\")):\n        logger.info(f\"Getting font metadata for {font_path}\")\n        with Path(font_path).open(\"rb\") as f:\n            # Read the file in chunks to handle large files efficiently\n            hash_ = hashlib.sha3_256()\n            while True:\n                chunk = f.read(1024 * 1024)\n                if not chunk:\n                    break\n                hash_.update(chunk)\n        extracted_metadata = get_font_metadata(font_path)\n\n        if re.search(serif_regex, extracted_metadata.name, re.IGNORECASE):\n            serif = 1\n        else:\n            serif = 0\n\n        metadata = {\n            \"file_name\": font_path.name,\n            \"font_name\": extracted_metadata.name,\n            \"encoding_length\": extracted_metadata.encoding_length,\n            \"bold\": extracted_metadata.bold,\n            \"italic\": extracted_metadata.italic,\n            \"monospace\": extracted_metadata.monospace,\n            \"serif\": serif,\n            \"ascent\": extracted_metadata.ascent,\n            \"descent\": extracted_metadata.descent,\n            \"sha3_256\": hash_.hexdigest(),\n            \"size\": font_path.stat().st_size,\n        }\n        metadatas[font_path.name] = metadata\n    metadatas = orjson.dumps(\n        metadatas,\n        option=orjson.OPT_APPEND_NEWLINE | orjson.OPT_INDENT_2 | orjson.OPT_SORT_KEYS,\n    ).decode()\n    print(f\"FONT METADATA: {metadatas}\")\n    with (repo_path / \"font_metadata.json\").open(\"w\") as f:\n        f.write(metadatas)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "babeldoc/tools/italic_assistance.py",
    "content": "import argparse\nimport json\nimport re\nfrom pathlib import Path\n\nimport orjson\nfrom babeldoc.const import CACHE_FOLDER\nfrom babeldoc.format.pdf.document_il.utils.formular_helper import is_formulas_font\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom rich.console import Console\nfrom rich.table import Table\n\nWORKING_FOLDER = Path(CACHE_FOLDER) / \"working\"\n\n\ndef find_latest_il_json() -> Path | None:\n    \"\"\"\n    Find the latest il_translated.json file in ~/.cache/babeldoc/ subdirectories.\n\n    Returns:\n        Path to the most recently modified il_translated.json file, or None if not found.\n    \"\"\"\n    base_dir = Path(WORKING_FOLDER)\n    json_files = list(base_dir.glob(\"*/il_translated.json\"))\n\n    if not json_files:\n        return None\n\n    # Sort by modification time (newest first)\n    json_files.sort(key=lambda p: p.stat().st_mtime, reverse=True)\n    return json_files[0]\n\n\ndef extract_fonts_from_paragraph(\n    paragraph: dict, page_font_map: dict[str, tuple[str, str]]\n) -> set[tuple[str, str]]:\n    \"\"\"\n    Extract all font_ids and names used in a paragraph.\n\n    Args:\n        paragraph: The paragraph dictionary\n        page_font_map: Dictionary mapping font_id to (font_id, name) tuples\n\n    Returns:\n        Set of (font_id, name) tuples\n    \"\"\"\n    fonts = set()\n\n    # Check if paragraph has a pdfStyle with font_id\n    if (\n        \"pdf_style\" in paragraph\n        and paragraph[\"pdf_style\"]\n        and \"font_id\" in paragraph[\"pdf_style\"]\n    ):\n        font_id = paragraph[\"pdf_style\"][\"font_id\"]\n        if font_id in page_font_map:\n            fonts.add(page_font_map[font_id])\n\n    # Process paragraph compositions if present\n    if \"pdf_paragraph_composition\" in paragraph:\n        for comp in paragraph[\"pdf_paragraph_composition\"]:\n            # Check different composition types that might contain font information\n\n            # Direct pdfCharacter in composition\n            if \"pdf_character\" in comp and comp[\"pdf_character\"]:\n                char = comp[\"pdf_character\"]\n                if \"pdf_style\" in char and \"font_id\" in char[\"pdf_style\"]:\n                    font_id = char[\"pdf_style\"][\"font_id\"]\n                    if font_id in page_font_map:\n                        fonts.add(page_font_map[font_id])\n\n            # PdfLine in composition\n            elif \"pdf_line\" in comp and comp[\"pdf_line\"]:\n                line = comp[\"pdf_line\"]\n                if \"pdf_character\" in line:\n                    for char in line[\"pdf_character\"]:\n                        if \"pdf_style\" in char and \"font_id\" in char[\"pdf_style\"]:\n                            font_id = char[\"pdf_style\"][\"font_id\"]\n                            if font_id in page_font_map:\n                                fonts.add(page_font_map[font_id])\n\n            # PdfFormula in composition\n            elif \"pdf_formula\" in comp and comp[\"pdf_formula\"]:\n                formula = comp[\"pdf_formula\"]\n                if \"pdf_character\" in formula:\n                    for char in formula[\"pdf_character\"]:\n                        if \"pdf_style\" in char and \"font_id\" in char[\"pdf_style\"]:\n                            font_id = char[\"pdf_style\"][\"font_id\"]\n                            if font_id in page_font_map:\n                                fonts.add(page_font_map[font_id])\n\n            # PdfSameStyleCharacters in composition\n            elif (\n                \"pdf_same_style_characters\" in comp\n                and comp[\"pdf_same_style_characters\"]\n            ):\n                same_style = comp[\"pdf_same_style_characters\"]\n                if \"pdf_style\" in same_style and \"font_id\" in same_style[\"pdf_style\"]:\n                    font_id = same_style[\"pdf_style\"][\"font_id\"]\n                    if font_id in page_font_map:\n                        fonts.add(page_font_map[font_id])\n\n            # PdfSameStyleUnicodeCharacters in composition\n            elif (\n                \"pdf_same_style_unicode_characters\" in comp\n                and comp[\"pdf_same_style_unicode_characters\"]\n            ):\n                same_style_unicode = comp[\"pdf_same_style_unicode_characters\"]\n                if (\n                    \"pdf_style\" in same_style_unicode\n                    and same_style_unicode[\"pdf_style\"] is not None\n                    and \"font_id\" in same_style_unicode[\"pdf_style\"]\n                ):\n                    font_id = same_style_unicode[\"pdf_style\"][\"font_id\"]\n                    if font_id in page_font_map:\n                        fonts.add(page_font_map[font_id])\n\n    return fonts\n\n\ndef find_fonts_by_debug_id(json_path: Path, debug_id_regex: str) -> dict[str, str]:\n    \"\"\"\n    Find all fonts used in paragraphs with matching debug_id.\n\n    Args:\n        json_path: Path to the il_translated.json file\n        debug_id_regex: Regular expression to match debug_id values\n\n    Returns:\n        Dictionary mapping font_ids to font names\n    \"\"\"\n    # Load and parse JSON\n    with json_path.open(\"rb\") as f:\n        doc_data = orjson.loads(f.read())\n\n    # Compile regex pattern (case insensitive)\n    pattern = re.compile(debug_id_regex.strip(\" \\\"'\"), re.IGNORECASE)\n\n    # Set to collect all found font information\n    found_fonts = set()\n\n    # Process each page\n    for page in doc_data.get(\"page\", []):\n        # Create a mapping of font_id to (font_id, name) tuples for this page\n        page_font_map = {}\n        for font in page.get(\"pdf_font\", []):\n            if \"font_id\" in font and \"name\" in font:\n                page_font_map[font[\"font_id\"]] = (font[\"font_id\"], font[\"name\"])\n\n        # Check each paragraph\n        for paragraph in page.get(\"pdf_paragraph\", []):\n            # Check if paragraph has debug_id and if it matches the pattern\n            debug_id = paragraph.get(\"debug_id\")\n            if debug_id and pattern.search(debug_id):\n                # Get all fonts used in this paragraph\n                paragraph_fonts = extract_fonts_from_paragraph(paragraph, page_font_map)\n                found_fonts.update(paragraph_fonts)\n\n    # Convert set of tuples to dictionary\n    return dict(found_fonts)\n\n\ndef main():\n    parser = argparse.ArgumentParser(\n        description=\"Extract fonts from paragraphs with matching debug_id\"\n    )\n    parser.add_argument(\n        \"debug_id_regex\", nargs=\"+\", help=\"Regular expression to match debug_id values\"\n    )\n    parser.add_argument(\n        \"--json-path\",\n        help=\"Path to il_translated.json (if not provided, will use the latest file)\",\n    )\n    parser.add_argument(\n        \"--working-folder\",\n        help=\"Path to the working folder containing il_translated.json files\",\n    )\n\n    args = parser.parse_args()\n\n    if args.working_folder:\n        global WORKING_FOLDER\n        WORKING_FOLDER = Path(args.working_folder)\n        if not WORKING_FOLDER.exists():\n            print(f\"Error: Working folder does not exist: {WORKING_FOLDER}\")\n            return 1\n\n    # Determine JSON file path\n    json_path = None\n    if args.json_path:\n        json_path = Path(args.json_path)\n        if not json_path.exists():\n            print(f\"Error: File not found: {json_path}\")\n            return 1\n    else:\n        json_path = find_latest_il_json()\n        if not json_path:\n            print(\"Error: Could not find any il_translated.json file\")\n            return 1\n\n    print(f\"Using JSON file: {json_path}\")\n\n    # Find fonts matching the debug_id pattern\n    fonts = find_fonts_by_debug_id(json_path, \"|\".join(args.debug_id_regex))\n\n    # Output the results\n    if fonts:\n        print(\n            f\"Found {len(fonts)} fonts in paragraphs matching debug_id pattern: {args.debug_id_regex}\"\n        )\n        print(json.dumps(fonts, indent=2, ensure_ascii=False))\n    else:\n        print(\n            f\"No fonts found for paragraphs matching debug_id pattern: {args.debug_id_regex}\"\n        )\n\n    fonts = []\n\n    # Read intermediate representation\n    with json_path.open(encoding=\"utf-8\") as f:\n        pdf_data = json.load(f)\n\n    for page_index, page in enumerate(pdf_data[\"page\"]):\n        for paragraph_index, paragraph_content in enumerate(page[\"pdf_paragraph\"]):\n            font_debug_id = paragraph_content[\"debug_id\"]\n            if font_debug_id:\n                # Create page font mapping\n                page_font_map = {}\n                for font in page[\"pdf_font\"]:\n                    if \"font_id\" in font and \"name\" in font:\n                        page_font_map[font[\"font_id\"]] = (font[\"font_id\"], font[\"name\"])\n\n                # Extract fonts from paragraph\n                name_list = []\n                paragraph_fonts = extract_fonts_from_paragraph(\n                    paragraph_content, page_font_map\n                )\n                for _font_id, font_name in paragraph_fonts:\n                    name_list.append(font_name)\n\n                font_list = []\n                for each in fonts:\n                    font_list.append(each[1])\n\n                for each_name in name_list:\n                    if each_name not in font_list:\n                        fonts.append(\n                            (page_index, each_name, paragraph_index, font_debug_id)\n                        )\n\n    # Initialize checker\n    translation_config = TranslationConfig(\n        *[None for _ in range(3)], lang_out=\"zh_cn\", doc_layout_model=1\n    )\n\n    # Create table\n    table = Table(title=\"Font Recognition Results\")\n    table.add_column(\"Page #\", justify=\"center\", style=\"cyan\")\n    table.add_column(\"Paragraph #\", justify=\"center\", style=\"cyan\")\n    table.add_column(\"DEBUG_ID\", justify=\"center\", style=\"cyan\")\n    table.add_column(\"Font Name\", style=\"magenta\")\n    table.add_column(\"Recognition Result\", justify=\"center\")\n\n    # Output results\n    for each_font in fonts:\n        page_index, font_name, paragraph_index, font_debug_id = each_font\n\n        if is_formulas_font(font_name, None):\n            table.add_row(\n                str(page_index),\n                str(paragraph_index),\n                str(font_debug_id),\n                font_name,\n                \"[bold red]Formula Font[/bold red]\",\n            )\n        else:\n            table.add_row(\n                str(page_index),\n                str(paragraph_index),\n                str(font_debug_id),\n                font_name,\n                \"[bold blue]Non-Formula Font[/bold blue]\",\n            )\n\n    # Print table\n    console = Console()\n\n    console.print(table)\n\n    return 0\n\n\nif __name__ == \"__main__\":\n    exit(main())\n"
  },
  {
    "path": "babeldoc/tools/italic_recognize_tool.py",
    "content": "# Identify non-formula italic fonts that were incorrectly classified as formulas in BableDOC translation results (intermediate)\n\nimport json\n\nimport babeldoc.tools.italic_assistance as italic_assistance\nfrom babeldoc.format.pdf.document_il.midend.styles_and_formulas import StylesAndFormulas\nfrom babeldoc.format.pdf.translation_config import TranslationConfig\nfrom rich.console import Console\nfrom rich.table import Table\n\nconsole = Console()\n\njson_path = italic_assistance.find_latest_il_json()\n\nfonts = []\n\n# Read intermediate representation\nwith json_path.open(encoding=\"utf-8\") as f:\n    pdf_data = json.load(f)\n\nfor page_index, page in enumerate(pdf_data[\"page\"]):\n    for paragraph_index, paragraph_content in enumerate(page[\"pdf_paragraph\"]):\n        font_debug_id = paragraph_content[\"debug_id\"]\n        if font_debug_id:\n            # Create page font mapping\n            page_font_map = {}\n            for font in page[\"pdf_font\"]:\n                if \"font_id\" in font and \"name\" in font:\n                    page_font_map[font[\"font_id\"]] = (font[\"font_id\"], font[\"name\"])\n\n            # Extract fonts from paragraph\n            name_list = []\n            paragraph_fonts = italic_assistance.extract_fonts_from_paragraph(\n                paragraph_content, page_font_map\n            )\n            for _font_id, font_name in paragraph_fonts:\n                name_list.append(font_name)\n\n            font_list = []\n            for each in fonts:\n                font_list.append(each[1])\n\n            for each_name in name_list:\n                if each_name not in font_list:\n                    fonts.append(\n                        (page_index, each_name, paragraph_index, font_debug_id)\n                    )\n\n# Initialize checker\ntranslation_config = TranslationConfig(\n    *[None for _ in range(3)], lang_out=\"zh_cn\", doc_layout_model=1\n)\nchecker = StylesAndFormulas(translation_config)\n\n# Create table\ntable = Table(title=\"Font Recognition Results\")\ntable.add_column(\"Page #\", justify=\"center\", style=\"cyan\")\ntable.add_column(\"Paragraph #\", justify=\"center\", style=\"cyan\")\ntable.add_column(\"DEBUG_ID\", justify=\"center\", style=\"cyan\")\ntable.add_column(\"Font Name\", style=\"magenta\")\ntable.add_column(\"Recognition Result\", justify=\"center\")\n\n# Output results\nfor each_font in fonts:\n    page_index, font_name, paragraph_index, font_debug_id = each_font\n\n    if checker.is_formulas_font(font_name):\n        table.add_row(\n            str(page_index),\n            str(paragraph_index),\n            str(font_debug_id),\n            font_name,\n            \"[bold red]Formula Font[/bold red]\",\n        )\n    else:\n        table.add_row(\n            str(page_index),\n            str(paragraph_index),\n            str(font_debug_id),\n            font_name,\n            \"[bold blue]Non-Formula Font[/bold blue]\",\n        )\n\n# Print table\nconsole.print(table)\n"
  },
  {
    "path": "babeldoc/translator/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/translator/cache.py",
    "content": "import json\nimport logging\nimport random\nimport threading\nfrom pathlib import Path\n\nimport peewee\nfrom peewee import SQL\nfrom peewee import AutoField\nfrom peewee import CharField\nfrom peewee import Model\nfrom peewee import SqliteDatabase\nfrom peewee import TextField\nfrom peewee import fn  # For aggregation functions\n\nfrom babeldoc.const import CACHE_FOLDER\n\nlogger = logging.getLogger(__name__)\n\n# we don't init the database here\ndb = SqliteDatabase(None)\n\n# Cleanup configuration\nCLEAN_PROBABILITY = 0.001  # 0.1% chance to trigger cleanup\nMAX_CACHE_ROWS = 50_000  # Keep only the latest 50,000 rows\n\n# Thread-level mutex to ensure only one cleanup runs at a time within the process\n_cleanup_lock = threading.Lock()\n\n\nclass _TranslationCache(Model):\n    id = AutoField()\n    translate_engine = CharField(max_length=20)\n    translate_engine_params = TextField()\n    original_text = TextField()\n    translation = TextField()\n\n    class Meta:\n        database = db\n        constraints = [\n            SQL(\n                \"\"\"\n            UNIQUE (\n                translate_engine,\n                translate_engine_params,\n                original_text\n                )\n            ON CONFLICT REPLACE\n            \"\"\",\n            ),\n        ]\n\n\nclass TranslationCache:\n    @staticmethod\n    def _sort_dict_recursively(obj):\n        if isinstance(obj, dict):\n            return {\n                k: TranslationCache._sort_dict_recursively(v)\n                for k in sorted(obj.keys())\n                for v in [obj[k]]\n            }\n        elif isinstance(obj, list):\n            return [TranslationCache._sort_dict_recursively(item) for item in obj]\n        return obj\n\n    def __init__(self, translate_engine: str, translate_engine_params: dict = None):\n        self.translate_engine = translate_engine\n        self.replace_params(translate_engine_params)\n\n    # The program typically starts multi-threaded translation\n    # only after cache parameters are fully configured,\n    # so thread safety doesn't need to be considered here.\n    def replace_params(self, params: dict = None):\n        if params is None:\n            params = {}\n        self.params = params\n        params = self._sort_dict_recursively(params)\n        self.translate_engine_params = json.dumps(params)\n\n    def update_params(self, params: dict = None):\n        if params is None:\n            params = {}\n        self.params.update(params)\n        self.replace_params(self.params)\n\n    def add_params(self, k: str, v):\n        self.params[k] = v\n        self.replace_params(self.params)\n\n    # Since peewee and the underlying sqlite are thread-safe,\n    # get and set operations don't need locks.\n    def get(self, original_text: str) -> str | None:\n        try:\n            result = _TranslationCache.get_or_none(\n                translate_engine=self.translate_engine,\n                translate_engine_params=self.translate_engine_params,\n                original_text=original_text,\n            )\n            # Trigger cache cleanup with a small probability.\n            if result and random.random() < CLEAN_PROBABILITY:  # noqa: S311\n                self._cleanup()\n            return result.translation if result else None\n        except peewee.OperationalError as e:\n            if \"database is locked\" in str(e):\n                logger.debug(\"Cache is locked\")\n                return None\n            else:\n                raise\n\n    def set(self, original_text: str, translation: str):\n        try:\n            _TranslationCache.create(\n                translate_engine=self.translate_engine,\n                translate_engine_params=self.translate_engine_params,\n                original_text=original_text,\n                translation=translation,\n            )\n            # Trigger cache cleanup with a small probability.\n            if random.random() < CLEAN_PROBABILITY:  # noqa: S311\n                self._cleanup()\n        except peewee.OperationalError as e:\n            if \"database is locked\" in str(e):\n                logger.debug(\"Cache is locked\")\n            else:\n                raise\n\n    def _cleanup(self) -> None:\n        \"\"\"Remove old cache entries, keeping only the latest MAX_CACHE_ROWS records.\"\"\"\n        # Quick exit if another thread is already performing cleanup.\n        if not _cleanup_lock.acquire(blocking=False):\n            return\n        try:\n            logger.info(\"Cleaning up translation cache...\")\n            max_id = _TranslationCache.select(fn.MAX(_TranslationCache.id)).scalar()\n            # Nothing to do if table is empty or below threshold\n            if not max_id or max_id <= MAX_CACHE_ROWS:\n                return\n            threshold = max_id - MAX_CACHE_ROWS\n            # Delete rows with id *less than or equal* to threshold so that at most MAX_CACHE_ROWS remain.\n            _TranslationCache.delete().where(\n                _TranslationCache.id <= threshold\n            ).execute()\n        finally:\n            _cleanup_lock.release()\n\n\ndef init_db(remove_exists=False):\n    CACHE_FOLDER.mkdir(parents=True, exist_ok=True)\n    # The current version does not support database migration, so add the version number to the file name.\n    cache_db_path = CACHE_FOLDER / \"cache.v1.db\"\n    logger.info(f\"Initializing cache database at {cache_db_path}\")\n    if remove_exists and cache_db_path.exists():\n        cache_db_path.unlink()\n    db.init(\n        cache_db_path,\n        pragmas={\n            \"journal_mode\": \"wal\",\n            \"busy_timeout\": 1000,\n        },\n    )\n    db.create_tables([_TranslationCache], safe=True)\n\n\ndef init_test_db():\n    import tempfile\n\n    temp_file = tempfile.NamedTemporaryFile(suffix=\".db\", delete=False)\n    cache_db_path = temp_file.name\n    temp_file.close()\n\n    test_db = SqliteDatabase(\n        cache_db_path,\n        pragmas={\n            \"journal_mode\": \"wal\",\n            \"busy_timeout\": 1000,\n        },\n    )\n    test_db.bind([_TranslationCache], bind_refs=False, bind_backrefs=False)\n    test_db.connect()\n    test_db.create_tables([_TranslationCache], safe=True)\n    return test_db\n\n\ndef clean_test_db(test_db):\n    test_db.drop_tables([_TranslationCache])\n    test_db.close()\n    db_path = Path(test_db.database)\n    if db_path.exists():\n        db_path.unlink()\n    wal_path = Path(str(db_path) + \"-wal\")\n    if wal_path.exists():\n        wal_path.unlink()\n    shm_path = Path(str(db_path) + \"-shm\")\n    if shm_path.exists():\n        shm_path.unlink()\n\n\ninit_db()\n"
  },
  {
    "path": "babeldoc/translator/translator.py",
    "content": "import contextlib\nimport logging\nimport threading\nimport time\nimport unicodedata\nfrom abc import ABC\nfrom abc import abstractmethod\n\nimport httpx\nimport openai\nfrom tenacity import before_sleep_log\nfrom tenacity import retry\nfrom tenacity import retry_if_exception_type\nfrom tenacity import stop_after_attempt\nfrom tenacity import wait_exponential\n\nfrom babeldoc.babeldoc_exception.BabelDOCException import ContentFilterError\nfrom babeldoc.translator.cache import TranslationCache\nfrom babeldoc.utils.atomic_integer import AtomicInteger\n\nlogger = logging.getLogger(__name__)\n\n\ndef remove_control_characters(s):\n    return \"\".join(ch for ch in s if unicodedata.category(ch)[0] != \"C\")\n\n\nclass RateLimiter:\n    \"\"\"\n    A rate limiter using the leaky bucket algorithm to ensure a smooth, constant rate of requests.\n    This implementation is thread-safe and robust against system clock changes.\n    \"\"\"\n\n    def __init__(self, max_qps: int):\n        if max_qps <= 0:\n            raise ValueError(\"max_qps must be a positive number\")\n        self.max_qps = max_qps\n        self.min_interval = 1.0 / max_qps\n        self.lock = threading.Lock()\n        # Use monotonic time to prevent issues with system time changes\n        self.next_request_time = time.monotonic()\n\n    def wait(self, _rate_limit_params: dict = None):\n        \"\"\"\n        Blocks until the next request can be processed, ensuring the rate limit is not exceeded.\n        \"\"\"\n        with self.lock:\n            now = time.monotonic()\n\n            wait_duration = self.next_request_time - now\n            if wait_duration > 0:\n                time.sleep(wait_duration)\n\n            # Update the next allowed request time.\n            # If the limiter has been idle, the next request should start from 'now'.\n            now = time.monotonic()\n            self.next_request_time = (\n                max(self.next_request_time, now) + self.min_interval\n            )\n\n    def set_max_qps(self, max_qps: int):\n        \"\"\"\n        Updates the maximum queries per second. This operation is thread-safe.\n        \"\"\"\n        if max_qps <= 0:\n            raise ValueError(\"max_qps must be a positive number\")\n        with self.lock:\n            self.max_qps = max_qps\n            self.min_interval = 1.0 / max_qps\n\n\n_translate_rate_limiter = RateLimiter(5)\n\n\ndef set_translate_rate_limiter(max_qps):\n    _translate_rate_limiter.set_max_qps(max_qps)\n\n\nclass BaseTranslator(ABC):\n    # Due to cache limitations, name should be within 20 characters.\n    # cache.py: translate_engine = CharField(max_length=20)\n    name = \"base\"\n    lang_map = {}\n\n    def __init__(self, lang_in, lang_out, ignore_cache):\n        self.ignore_cache = ignore_cache\n        lang_in = self.lang_map.get(lang_in.lower(), lang_in)\n        lang_out = self.lang_map.get(lang_out.lower(), lang_out)\n        self.lang_in = lang_in\n        self.lang_out = lang_out\n\n        self.cache = TranslationCache(\n            self.name,\n            {\n                \"lang_in\": lang_in,\n                \"lang_out\": lang_out,\n            },\n        )\n\n        self.translate_call_count = 0\n        self.translate_cache_call_count = 0\n\n    def __del__(self):\n        with contextlib.suppress(Exception):\n            logger.info(\n                f\"{self.name} translate call count: {self.translate_call_count}\"\n            )\n            logger.info(\n                f\"{self.name} translate cache call count: {self.translate_cache_call_count}\",\n            )\n\n    def add_cache_impact_parameters(self, k: str, v):\n        \"\"\"\n        Add parameters that affect the translation quality to distinguish the translation effects under different parameters.\n        :param k: key\n        :param v: value\n        \"\"\"\n        self.cache.add_params(k, v)\n\n    def translate(self, text, ignore_cache=False, rate_limit_params: dict = None):\n        \"\"\"\n        Translate the text, and the other part should call this method.\n        :param text: text to translate\n        :return: translated text\n        \"\"\"\n        self.translate_call_count += 1\n        if not (self.ignore_cache or ignore_cache):\n            try:\n                cache = self.cache.get(text)\n                if cache is not None:\n                    self.translate_cache_call_count += 1\n                    return cache\n            except Exception as e:\n                logger.debug(f\"try get cache failed, ignore it: {e}\")\n        _translate_rate_limiter.wait()\n        translation = self.do_translate(text, rate_limit_params)\n        if not (self.ignore_cache or ignore_cache):\n            self.cache.set(text, translation)\n        return translation\n\n    def llm_translate(self, text, ignore_cache=False, rate_limit_params: dict = None):\n        \"\"\"\n        Translate the text, and the other part should call this method.\n        :param text: text to translate\n        :return: translated text\n        \"\"\"\n        self.translate_call_count += 1\n        if not (self.ignore_cache or ignore_cache):\n            try:\n                cache = self.cache.get(text)\n                if cache is not None:\n                    self.translate_cache_call_count += 1\n                    return cache\n            except Exception as e:\n                logger.debug(f\"try get cache failed, ignore it: {e}\")\n        _translate_rate_limiter.wait()\n        translation = self.do_llm_translate(text, rate_limit_params)\n        if not (self.ignore_cache or ignore_cache):\n            try:\n                self.cache.set(text, translation)\n            except Exception as e:\n                logger.debug(\n                    f\"try set cache failed, ignore it: {e}, text: {text}, translation: {translation}\"\n                )\n        return translation\n\n    @abstractmethod\n    def do_llm_translate(self, text, rate_limit_params: dict = None):\n        \"\"\"\n        Actual translate text, override this method\n        :param text: text to translate\n        :return: translated text\n        \"\"\"\n        raise NotImplementedError\n\n    @abstractmethod\n    def do_translate(self, text, rate_limit_params: dict = None):\n        \"\"\"\n        Actual translate text, override this method\n        :param text: text to translate\n        :return: translated text\n        \"\"\"\n        logger.critical(\n            f\"Do not call BaseTranslator.do_translate. \"\n            f\"Translator: {self}. \"\n            f\"Text: {text}. \",\n        )\n        raise NotImplementedError\n\n    def __str__(self):\n        return f\"{self.name} {self.lang_in} {self.lang_out} {self.model}\"\n\n    def get_rich_text_left_placeholder(self, placeholder_id: int | str):\n        return f\"<b{placeholder_id}>\"\n\n    def get_rich_text_right_placeholder(self, placeholder_id: int | str):\n        return f\"</b{placeholder_id}>\"\n\n    def get_formular_placeholder(self, placeholder_id: int | str):\n        return self.get_rich_text_left_placeholder(placeholder_id)\n\n\nclass OpenAITranslator(BaseTranslator):\n    # https://github.com/openai/openai-python\n    name = \"openai\"\n\n    def __init__(\n        self,\n        lang_in,\n        lang_out,\n        model,\n        base_url=None,\n        api_key=None,\n        ignore_cache=False,\n        enable_json_mode_if_requested=False,\n        send_dashscope_header=False,\n        send_temperature=True,\n        reasoning=None,\n    ):\n        super().__init__(lang_in, lang_out, ignore_cache)\n        self.options = {\"temperature\": 0}  # 随机采样可能会打断公式标记\n        self.extra_body = {}\n        # if 'gpt-5' in model and 'gpt-5-chat' not in model:\n        #     self.extra_body['reasoning'] = {\n        #         \"effort\": \"minimal\"\n        #     }\n        #     self.add_cache_impact_parameters(\"reasoning-effort\", 'minimal')\n        self.reasoning = reasoning\n        self.client = openai.OpenAI(\n            base_url=base_url,\n            api_key=api_key,\n            http_client=httpx.Client(\n                limits=httpx.Limits(\n                    max_connections=None, max_keepalive_connections=None\n                ),\n                timeout=60,  # Set a reasonable timeout\n            ),\n        )\n        if send_temperature:\n            self.add_cache_impact_parameters(\"temperature\", self.options[\"temperature\"])\n        self.model = model\n        self.enable_json_mode_if_requested = enable_json_mode_if_requested\n        self.send_dashscope_header = send_dashscope_header\n        self.send_temperature = send_temperature\n        self.add_cache_impact_parameters(\"model\", self.model)\n        self.add_cache_impact_parameters(\"prompt\", self.prompt(\"\"))\n        if self.reasoning:\n            self.extra_body[\"reasoning\"] = {\"effort\": self.reasoning}\n            self.add_cache_impact_parameters(\"reasoning\", self.reasoning)\n        if self.enable_json_mode_if_requested:\n            self.add_cache_impact_parameters(\n                \"enable_json_mode_if_requested\", self.enable_json_mode_if_requested\n            )\n        self.token_count = AtomicInteger()\n        self.prompt_token_count = AtomicInteger()\n        self.completion_token_count = AtomicInteger()\n        self.cache_hit_prompt_token_count = AtomicInteger()\n\n    @retry(\n        retry=retry_if_exception_type(openai.RateLimitError),\n        stop=stop_after_attempt(100),\n        wait=wait_exponential(multiplier=1, min=1, max=15),\n        before_sleep=before_sleep_log(logger, logging.WARNING),\n    )\n    def do_translate(self, text, rate_limit_params: dict = None) -> str:\n        options = {}\n        if self.send_temperature:\n            options.update(self.options)\n\n        response = self.client.chat.completions.create(\n            model=self.model,\n            **options,\n            messages=self.prompt(text),\n            extra_body=self.extra_body,\n        )\n        self.update_token_count(response)\n        return response.choices[0].message.content.strip()\n\n    def prompt(self, text):\n        return [\n            {\n                \"role\": \"system\",\n                \"content\": \"You are a professional,authentic machine translation engine.\",\n            },\n            {\n                \"role\": \"user\",\n                \"content\": f\";; Treat next line as plain text input and translate it into {self.lang_out}, output translation ONLY. If translation is unnecessary (e.g. proper nouns, codes, {'{{1}}, etc. '}), return the original text. NO explanations. NO notes. Input:\\n\\n{text}\",\n            },\n        ]\n\n    @retry(\n        retry=retry_if_exception_type(openai.RateLimitError),\n        stop=stop_after_attempt(100),\n        wait=wait_exponential(multiplier=1, min=1, max=15),\n        before_sleep=before_sleep_log(logger, logging.WARNING),\n    )\n    def do_llm_translate(self, text, rate_limit_params: dict = None):\n        if text is None:\n            return None\n\n        options = {}\n        if self.send_temperature:\n            options.update(self.options)\n        if self.enable_json_mode_if_requested and rate_limit_params.get(\n            \"request_json_mode\", False\n        ):\n            options[\"response_format\"] = {\"type\": \"json_object\"}\n\n        extra_headers = {}\n        if self.send_dashscope_header:\n            extra_headers[\"X-DashScope-DataInspection\"] = (\n                '{\"input\": \"disable\", \"output\": \"disable\"}'\n            )\n        try:\n            response = self.client.chat.completions.create(\n                model=self.model,\n                **options,\n                max_tokens=2048,\n                messages=[\n                    {\n                        \"role\": \"user\",\n                        \"content\": text,\n                    },\n                ],\n                extra_headers=extra_headers,\n                extra_body=self.extra_body,\n            )\n            self.update_token_count(response)\n            return response.choices[0].message.content.strip()\n        except openai.BadRequestError as e:\n            if (\n                \"系统检测到输入或生成内容可能包含不安全或敏感内容，请您避免输入易产生敏感内容的提示语，感谢您的配合。\"\n                in e.message\n            ):\n                raise ContentFilterError(e.message) from e\n            else:\n                raise\n\n    def update_token_count(self, response):\n        try:\n            if response.usage and response.usage.total_tokens:\n                self.token_count.inc(response.usage.total_tokens)\n            if response.usage and response.usage.prompt_tokens:\n                self.prompt_token_count.inc(response.usage.prompt_tokens)\n            if response.usage and response.usage.completion_tokens:\n                self.completion_token_count.inc(response.usage.completion_tokens)\n            # Support both response.usage.prompt_cache_hit_tokens and response.prompt_tokens_details.cached_tokens\n            hit_count = 0\n            if response.usage and hasattr(response.usage, \"prompt_cache_hit_tokens\"):\n                hit_count = getattr(response.usage, \"prompt_cache_hit_tokens\", 0)\n            if hasattr(response, \"prompt_tokens_details\") and getattr(\n                response.prompt_tokens_details, \"cached_tokens\", 0\n            ):\n                hit_count += getattr(response.prompt_tokens_details, \"cached_tokens\", 0)\n            if hit_count:\n                self.cache_hit_prompt_token_count.inc(hit_count)\n        except Exception as e:\n            logger.exception(\"Error updating token count\")\n\n    def get_formular_placeholder(self, placeholder_id: int | str):\n        return \"{v\" + str(placeholder_id) + \"}\", f\"{{\\\\s*v\\\\s*{placeholder_id}\\\\s*}}\"\n        return \"{{\" + str(placeholder_id) + \"}}\"\n\n    def get_rich_text_left_placeholder(self, placeholder_id: int | str):\n        return (\n            f\"<style id='{placeholder_id}'>\",\n            f\"<\\\\s*style\\\\s*id\\\\s*=\\\\s*'\\\\s*{placeholder_id}\\\\s*'\\\\s*>\",\n        )\n\n    def get_rich_text_right_placeholder(self, placeholder_id: int | str):\n        return \"</style>\", r\"<\\s*\\/\\s*style\\s*>\"\n"
  },
  {
    "path": "babeldoc/utils/__init__.py",
    "content": ""
  },
  {
    "path": "babeldoc/utils/atomic_integer.py",
    "content": "import threading\n\n\nclass AtomicInteger:\n    def __init__(self, value=0):\n        self._value = int(value)\n        self._lock = threading.Lock()\n\n    def inc(self, d=1):\n        with self._lock:\n            self._value += int(d)\n            return self._value\n\n    def dec(self, d=1):\n        return self.inc(-d)\n\n    @property\n    def value(self):\n        with self._lock:\n            return self._value\n\n    @value.setter\n    def value(self, v):\n        with self._lock:\n            self._value = int(v)\n            return self._value\n"
  },
  {
    "path": "babeldoc/utils/memory.py",
    "content": "import os\nimport sys\nimport time\nfrom pathlib import Path\n\ntry:\n    import psutil\nexcept ImportError:\n    psutil = None\n\n\ndef _parse_pss_from_smaps_rollup(pid: int) -> int | None:\n    \"\"\"\n    Try to read PSS from /proc/<pid>/smaps_rollup.\n    Returns PSS in bytes, or None if not available/readable.\n    \"\"\"\n    try:\n        smaps_rollup_path = Path(f\"/proc/{pid}/smaps_rollup\")\n        with smaps_rollup_path.open() as f:\n            for line in f:\n                if line.startswith(\"Pss:\"):\n                    # Format: \"Pss:            1234 kB\"\n                    parts = line.split()\n                    if len(parts) >= 2:\n                        pss_kb = int(parts[1])\n                        return pss_kb * 1024  # Convert to bytes\n        return None\n    except (FileNotFoundError, PermissionError, ValueError, OSError):\n        return None\n\n\ndef _parse_pss_from_smaps(pid: int) -> int | None:\n    \"\"\"\n    Try to read PSS from /proc/<pid>/smaps and sum all Pss entries.\n    Returns PSS in bytes, or None if not available/readable.\n    \"\"\"\n    try:\n        smaps_path = Path(f\"/proc/{pid}/smaps\")\n        total_pss_kb = 0\n        with smaps_path.open() as f:\n            for line in f:\n                if line.startswith(\"Pss:\"):\n                    # Format: \"Pss:            1234 kB\"\n                    parts = line.split()\n                    if len(parts) >= 2:\n                        total_pss_kb += int(parts[1])\n        if total_pss_kb > 0:\n            return total_pss_kb * 1024  # Convert to bytes\n        return None\n    except (FileNotFoundError, PermissionError, ValueError, OSError):\n        return None\n\n\ndef _get_pss_linux(pid: int) -> int | None:\n    \"\"\"\n    Try to get PSS on Linux.\n    Priority: smaps_rollup -> smaps -> None\n    Returns PSS in bytes, or None if not available.\n    \"\"\"\n    # Try smaps_rollup first (lightweight)\n    pss = _parse_pss_from_smaps_rollup(pid)\n    if pss is not None:\n        return pss\n\n    # Fallback to smaps (heavier)\n    pss = _parse_pss_from_smaps(pid)\n    if pss is not None:\n        return pss\n\n    return None\n\n\ndef _get_rss_psutil(pid: int) -> int | None:\n    \"\"\"\n    Get RSS using psutil for a single process.\n    Returns RSS in bytes, or None if psutil unavailable or process not found.\n    \"\"\"\n    if psutil is None:\n        return None\n\n    try:\n        process = psutil.Process(pid)\n        return process.memory_info().rss\n    except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.TimeoutExpired):\n        return None\n\n\ndef _get_single_process_memory(\n    pid: int, prefer_pss: bool = True, use_smaps_rollup_only: bool = False\n) -> int | None:\n    \"\"\"\n    Get memory usage for a single process (no children).\n\n    Args:\n        pid: Process ID\n        prefer_pss: If True and on Linux, try PSS first; otherwise use RSS\n        use_smaps_rollup_only: If True, only try smaps_rollup (faster), fallback to RSS if not available\n\n    Returns:\n        Memory usage in bytes, or None if all methods fail\n    \"\"\"\n    if sys.platform == \"linux\":\n        if prefer_pss:\n            if use_smaps_rollup_only:\n                # Only try smaps_rollup, then fallback to RSS\n                pss = _parse_pss_from_smaps_rollup(pid)\n                if pss is not None:\n                    return pss\n            else:\n                # Try full PSS (smaps_rollup -> smaps)\n                pss = _get_pss_linux(pid)\n                if pss is not None:\n                    return pss\n\n    # Fallback to RSS\n    return _get_rss_psutil(pid)\n\n\ndef get_memory_usage_bytes(\n    pid: int | None = None,\n    include_children: bool = True,\n    prefer_pss: bool = True,\n) -> int:\n    \"\"\"\n    Get memory usage of a process (and optionally its children).\n\n    On Linux with prefer_pss=True:\n      - Tries /proc/<pid>/smaps_rollup first (lightweight)\n      - Falls back to /proc/<pid>/smaps if smaps_rollup unavailable (heavier)\n      - Falls back to psutil RSS if smaps unavailable\n\n    On non-Linux systems or prefer_pss=False:\n      - Uses psutil RSS\n\n    Args:\n        pid: Process ID to monitor. If None, uses current process.\n        include_children: If True, also includes memory of child processes.\n        prefer_pss: If True on Linux, attempts to use PSS; otherwise uses RSS.\n\n    Returns:\n        Total memory usage in bytes (guaranteed non-negative).\n    \"\"\"\n    if pid is None:\n        pid = os.getpid()\n\n    total_memory = 0\n\n    # Determine if we're using smaps (heavier) vs smaps_rollup (lighter)\n    use_smaps_rollup_only = False\n    if sys.platform == \"linux\" and prefer_pss:\n        # If we can read smaps_rollup, use rollup-only mode\n        test_rollup = _parse_pss_from_smaps_rollup(pid)\n        use_smaps_rollup_only = test_rollup is not None\n\n    # Get current process memory\n    memory = _get_single_process_memory(\n        pid, prefer_pss=prefer_pss, use_smaps_rollup_only=use_smaps_rollup_only\n    )\n    if memory is not None:\n        total_memory += memory\n\n    # Get children memory if requested\n    if include_children:\n        if psutil is None:\n            # Cannot get children without psutil\n            return total_memory\n\n        try:\n            parent_process = psutil.Process(pid)\n            children = parent_process.children(recursive=True)\n        except (psutil.NoSuchProcess, psutil.AccessDenied):\n            # Parent process not found or no permission\n            return total_memory\n\n        for child in children:\n            try:\n                child_pid = child.pid\n                child_memory = _get_single_process_memory(\n                    child_pid,\n                    prefer_pss=prefer_pss,\n                    use_smaps_rollup_only=use_smaps_rollup_only,\n                )\n                if child_memory is not None:\n                    total_memory += child_memory\n            except (psutil.NoSuchProcess, psutil.AccessDenied):\n                # Child process died or no permission; skip it\n                pass\n\n    return max(0, total_memory)\n\n\ndef get_memory_usage_with_throttle(\n    pid: int | None = None,\n    include_children: bool = True,\n    prefer_pss: bool = True,\n    last_pss_check_time: float | None = None,\n    pss_throttle_seconds: float = 2.0,\n) -> tuple[int, float | None]:\n    \"\"\"\n    Get memory usage with throttling for PSS checks on Linux.\n\n    When PSS is not available via smaps_rollup and must read smaps (expensive),\n    this throttles checks to at most once per pss_throttle_seconds.\n\n    Args:\n        pid: Process ID. If None, uses current process.\n        include_children: If True, includes child process memory.\n        prefer_pss: If True on Linux, attempts to use PSS.\n        last_pss_check_time: Timestamp of last PSS check. For throttling logic.\n        pss_throttle_seconds: Minimum interval (seconds) between smaps reads.\n\n    Returns:\n        Tuple of (memory_bytes, new_check_time).\n        If throttled, returns cached estimate (0) and original check time.\n    \"\"\"\n    current_time = time.time()\n\n    # Check if we should throttle\n    if (\n        prefer_pss\n        and sys.platform == \"linux\"\n        and last_pss_check_time is not None\n        and (current_time - last_pss_check_time) < pss_throttle_seconds\n    ):\n        # Throttled: use RSS only as a fast estimate\n        memory = 0\n        pid_to_check = pid if pid is not None else os.getpid()\n        rss = _get_rss_psutil(pid_to_check)\n        if rss is not None:\n            memory += rss\n\n        if include_children and psutil is not None:\n            try:\n                parent_process = psutil.Process(pid_to_check)\n                for child in parent_process.children(recursive=True):\n                    try:\n                        child_rss = _get_rss_psutil(child.pid)\n                        if child_rss is not None:\n                            memory += child_rss\n                    except (psutil.NoSuchProcess, psutil.AccessDenied):\n                        pass\n            except (psutil.NoSuchProcess, psutil.AccessDenied):\n                pass\n\n        return memory, last_pss_check_time\n\n    # Not throttled: do full check\n    memory = get_memory_usage_bytes(\n        pid=pid, include_children=include_children, prefer_pss=prefer_pss\n    )\n    return memory, current_time\n"
  },
  {
    "path": "babeldoc/utils/priority_thread_pool_executor.py",
    "content": "# thanks to:\n# https://github.com/oleglpts/PriorityThreadPoolExecutor/blob/master/PriorityThreadPoolExecutor/__init__.py\n# https://github.com/oleglpts/PriorityThreadPoolExecutor/issues/4\n\nimport atexit\nimport itertools\nimport logging\nimport queue\nimport random\nimport sys\nimport threading\nimport weakref\nfrom concurrent.futures import _base\nfrom concurrent.futures.thread import BrokenThreadPool\nfrom concurrent.futures.thread import ThreadPoolExecutor\nfrom concurrent.futures.thread import _python_exit\nfrom concurrent.futures.thread import _threads_queues\nfrom concurrent.futures.thread import _WorkItem\nfrom heapq import heappop\nfrom heapq import heappush\n\nlogger = logging.getLogger(__name__)\n\n########################################################################################################################\n#                                                Global variables                                                      #\n########################################################################################################################\n\nNULL_ENTRY = (sys.maxsize, _WorkItem(None, None, (), {}))\n_shutdown = False\n\n########################################################################################################################\n#                                           Before system exit procedure                                               #\n########################################################################################################################\n\n\ndef python_exit():\n    \"\"\"\n\n    Cleanup before system exit\n\n    \"\"\"\n    global _shutdown\n    _shutdown = True\n    items = list(_threads_queues.items())\n    for _t, q in items:\n        q.put(NULL_ENTRY)\n    for t, _q in items:\n        t.join()\n\n\n# change default cleanup\n\n\natexit.unregister(_python_exit)\natexit.register(python_exit)\n\n\nclass PriorityQueue(queue.Queue):\n    \"\"\"Variant of Queue that retrieves open entries in priority order (lowest first).\n\n    Entries are typically tuples of the form:  (priority number, data).\n    \"\"\"\n\n    REMOVED = \"<removed-task>\"\n    DEFAULT_PRIORITY = 100\n\n    def _init(self, maxsize):\n        self.queue = []\n        self.entry_finder = {}\n        self.counter = itertools.count()\n\n    def _qsize(self):\n        return len(self.queue)\n\n    def _put(self, item):\n        # heappush(self.queue, item)\n        try:\n            if item[1] in self.entry_finder:\n                self.remove(item[1])\n            count = next(self.counter)\n            entry = [item[0], count, item[1]]\n            self.entry_finder[item[1]] = entry\n            heappush(self.queue, entry)\n        except TypeError:  # handle item==None\n            self._put((self.DEFAULT_PRIORITY, None))\n\n    def remove(self, task):\n        \"\"\"\n        This simply replaces the data with the REMOVED value,\n        which will get cleared out once _get reaches it.\n        \"\"\"\n        entry = self.entry_finder.pop(task)\n        entry[-1] = self.REMOVED\n\n    def _get(self):\n        while self.queue:\n            entry = heappop(self.queue)\n            if entry[2] is not self.REMOVED:\n                del self.entry_finder[entry[2]]\n                return entry\n        return None\n\n\ndef _worker(executor_reference, work_queue, initializer, initargs):\n    if initializer is not None:\n        try:\n            initializer(*initargs)\n        except BaseException:\n            _base.LOGGER.critical(\"Exception in initializer:\", exc_info=True)\n            executor = executor_reference()\n            if executor is not None:\n                executor._initializer_failed()\n            return\n    try:\n        while True:\n            work_item = work_queue.get(block=True)\n            try:\n                if work_item[2] is not None:\n                    work_item[2].run()\n                    # Delete references to object. See issue16284\n                    del work_item\n\n                    # attempt to increment idle count\n                    executor = executor_reference()\n                    if executor is not None:\n                        executor._idle_semaphore.release()\n                    del executor\n                    continue\n\n                executor = executor_reference()\n                # Exit if:\n                #   - The interpreter is shutting down OR\n                #   - The executor that owns the worker has been collected OR\n                #   - The executor that owns the worker has been shutdown.\n                if _shutdown or executor is None or executor._shutdown:\n                    # Flag the executor as shutting down as early as possible if it\n                    # is not gc-ed yet.\n                    if executor is not None:\n                        executor._shutdown = True\n                    # Notice other workers\n                    work_queue.put(None)\n                    return\n                del executor\n            finally:\n                work_queue.task_done()\n    except BaseException:\n        _base.LOGGER.critical(\"Exception in worker\", exc_info=True)\n\n\nclass PriorityThreadPoolExecutor(ThreadPoolExecutor):\n    \"\"\"\n    Thread pool executor with priority queue (priorities must be different, lowest first)\n    \"\"\"\n\n    def __init__(self, *args, **kwargs):\n        super().__init__(*args, **kwargs)\n\n        # change work queue type to queue.PriorityQueue\n        self._work_queue: PriorityQueue = PriorityQueue()\n        self._all_future = []\n\n    def submit(self, fn, *args, **kwargs):\n        \"\"\"\n\n        Sending the function to the execution queue\n\n        :param fn: function being executed\n        :type fn: callable\n        :param args: function's positional arguments\n        :param kwargs: function's keywords arguments\n        :return: future instance\n        :rtype: _base.Future\n\n        Added keyword:\n\n        - priority (integer later sys.maxsize)\n\n        \"\"\"\n        with self._shutdown_lock:\n            if self._broken:\n                raise BrokenThreadPool(self._broken)\n\n            if self._shutdown:\n                raise RuntimeError(\"cannot schedule new futures after shutdown\")\n            if _shutdown:\n                raise RuntimeError(\n                    \"cannot schedule new futures after interpreter shutdown\"\n                )\n\n            priority = kwargs.get(\"priority\", random.randint(0, sys.maxsize - 1))  # noqa: S311\n            if \"priority\" in kwargs:\n                del kwargs[\"priority\"]\n\n            f = _base.Future()\n            w = _WorkItem(f, fn, args, kwargs)\n\n            self._work_queue.put((priority, w))\n            self._adjust_thread_count()\n            self._all_future.append(f)\n            return f\n\n    def _adjust_thread_count(self):\n        # if idle threads are available, don't spin new threads\n        if self._idle_semaphore.acquire(timeout=0):\n            return\n\n        # When the executor gets lost, the weakref callback will wake up\n        # the worker threads.\n        def weakref_cb(_, q=self._work_queue):\n            q.put(None)\n\n        num_threads = len(self._threads)\n        if num_threads < self._max_workers:\n            thread_name = f\"{self._thread_name_prefix or self}_{num_threads:d}\"\n            t = threading.Thread(\n                name=thread_name,\n                target=_worker,\n                args=(\n                    weakref.ref(self, weakref_cb),\n                    self._work_queue,\n                    self._initializer,\n                    self._initargs,\n                ),\n            )\n            t.start()\n            self._threads.add(t)\n            _threads_queues[t] = self._work_queue\n\n    def shutdown(self, wait=True, *, cancel_futures=False):\n        logger.debug(\"Shutting down executor %s\", self._thread_name_prefix or self)\n        if wait:\n            logger.debug(\n                \"Waiting for all tasks done %s\", self._thread_name_prefix or self\n            )\n            self._work_queue.join()\n            logger.debug(\"All tasks done %s\", self._thread_name_prefix or self)\n\n        with self._shutdown_lock:\n            self._shutdown = True\n            if cancel_futures:\n                # Drain all work items from the queue, and then cancel their\n                # associated futures.\n                while True:\n                    try:\n                        work_item = self._work_queue.get_nowait()\n                    except queue.Empty:\n                        break\n                    if work_item is not None:\n                        work_item.future.cancel()\n\n            # Send a wake-up to prevent threads calling\n            # _work_queue.get(block=True) from permanently blocking.\n            self._work_queue.put(None)\n        if wait:\n            logger.debug(\n                \"Waiting for all thread done %s\", self._thread_name_prefix or self\n            )\n            for t in self._threads:\n                self._work_queue.put(None)\n                t.join()\n        logger.debug(\"shutdown finish %s\", self._thread_name_prefix or self)\n\n    def __del__(self):\n        for f in self._all_future:\n            if f.done() and not f.cancelled():\n                try:\n                    f.result()\n                except Exception as e:\n                    logger.warning(\"Exception in future %s: %s\", f, e, exc_info=True)\n"
  },
  {
    "path": "docs/CODE_OF_CONDUCT.md",
    "content": "# Contributor Covenant Code of Conduct\n\n## Our Pledge\n\nWe as members, contributors, and leaders pledge to make participation in our\ncommunity a harassment-free experience for everyone, regardless of age, body\nsize, visible or invisible disability, ethnicity, sex characteristics, gender\nidentity and expression, level of experience, education, socio-economic status,\nnationality, personal appearance, race, religion, or sexual identity\nand orientation.\n\nWe pledge to act and interact in ways that contribute to an open, welcoming,\ndiverse, inclusive, and healthy community.\n\n## Our Standards\n\nExamples of behavior that contributes to a positive environment for our\ncommunity include:\n\n* Demonstrating empathy and kindness toward other people\n* Being respectful of differing opinions, viewpoints, and experiences\n* Giving and gracefully accepting constructive feedback\n* Accepting responsibility and apologizing to those affected by our mistakes,\n  and learning from the experience\n* Focusing on what is best not just for us as individuals, but for the\n  overall community\n\nExamples of unacceptable behavior include:\n\n* The use of sexualized language or imagery, and sexual attention or\n  advances of any kind\n* Trolling, insulting or derogatory comments, and personal or political attacks\n* Public or private harassment\n* Publishing others' private information, such as a physical or email\n  address, without their explicit permission\n* Other conduct which could reasonably be considered inappropriate in a\n  professional setting\n\n## Enforcement Responsibilities\n\nCommunity leaders are responsible for clarifying and enforcing our standards of\nacceptable behavior and will take appropriate and fair corrective action in\nresponse to any behavior that they deem inappropriate, threatening, offensive,\nor harmful.\n\nCommunity leaders have the right and responsibility to remove, edit, or reject\ncomments, commits, code, wiki edits, issues, and other contributions that are\nnot aligned to this Code of Conduct, and will communicate reasons for moderation\ndecisions when appropriate.\n\n## Scope\n\nThis Code of Conduct applies within all community spaces, and also applies when\nan individual is officially representing the community in public spaces.\nExamples of representing our community include using an official e-mail address,\nposting via an official social media account, or acting as an appointed\nrepresentative at an online or offline event.\n\n## Enforcement\n\nInstances of abusive, harassing, or otherwise unacceptable behavior may be\nreported to the community leaders responsible for enforcement at\naw@funstory.ai .\nAll complaints will be reviewed and investigated promptly and fairly.\n\nAll community leaders are obligated to respect the privacy and security of the\nreporter of any incident.\n\n## Enforcement Guidelines\n\nCommunity leaders will follow these Community Impact Guidelines in determining\nthe consequences for any action they deem in violation of this Code of Conduct:\n\n### 1. Correction\n\n**Community Impact**: Use of inappropriate language or other behavior deemed\nunprofessional or unwelcome in the community.\n\n**Consequence**: A private, written warning from community leaders, providing\nclarity around the nature of the violation and an explanation of why the\nbehavior was inappropriate. A public apology may be requested.\n\n### 2. Warning\n\n**Community Impact**: A violation through a single incident or series\nof actions.\n\n**Consequence**: A warning with consequences for continued behavior. No\ninteraction with the people involved, including unsolicited interaction with\nthose enforcing the Code of Conduct, for a specified period of time. This\nincludes avoiding interactions in community spaces as well as external channels\nlike social media. Violating these terms may lead to a temporary or\npermanent ban.\n\n### 3. Temporary Ban\n\n**Community Impact**: A serious violation of community standards, including\nsustained inappropriate behavior.\n\n**Consequence**: A temporary ban from any sort of interaction or public\ncommunication with the community for a specified period of time. No public or\nprivate interaction with the people involved, including unsolicited interaction\nwith those enforcing the Code of Conduct, is allowed during this period.\nViolating these terms may lead to a permanent ban.\n\n### 4. Permanent Ban\n\n**Community Impact**: Demonstrating a pattern of violation of community\nstandards, including sustained inappropriate behavior,  harassment of an\nindividual, or aggression toward or disparagement of classes of individuals.\n\n**Consequence**: A permanent ban from any sort of public interaction within\nthe community.\n\n## Attribution\n\nThis Code of Conduct is adapted from the [Contributor Covenant][homepage],\nversion 2.0, available at\nhttps://www.contributor-covenant.org/version/2/0/code_of_conduct.html.\n\nCommunity Impact Guidelines were inspired by [Mozilla's code of conduct\nenforcement ladder](https://github.com/mozilla/diversity).\n\n[homepage]: https://www.contributor-covenant.org\n\nFor answers to common questions about this code of conduct, see the FAQ at\nhttps://www.contributor-covenant.org/faq. Translations are available at\nhttps://www.contributor-covenant.org/translations.\n"
  },
  {
    "path": "docs/CONTRIBUTING.md",
    "content": "# Contributing to BabelDOC\n\n## How to contribute to BabelDOC\n\n### **About Language**\n\n- Issues can be in Chinese or English\n- PRs are limited to English\n- All documents are provided in English only\n\n### **Did you find a bug?**\n\n- **Ensure the bug was not already reported** by searching on GitHub under [Issues](https://github.com/funstory-ai/BabelDOC/issues).\n\nPlease pay special attention to:\n\n1. Known compatibility issues with pdf2zh - see [#20](https://github.com/funstory-ai/BabelDOC/issues/20) for details\n2. Reported edge cases and limitations from downstream applications - see [#23](https://github.com/funstory-ai/BabelDOC/issues/23) for discussion\n\n- If you're unable to find an open issue addressing the problem, [open a new one](https://github.com/funstory-ai/BabelDOC/issues/new?template=bug_report.md). Be sure to include a **title and clear description**, as much relevant information as possible.\n\n### **If you wish to request changes or new features**\n\n- Suggest your change in the [Issues](https://github.com/funstory-ai/BabelDOC/issues/new?template=feature_request.md) section.\n\n### **If you wish to add more translators**\n\n- This project is not intended for direct end-user use, and the supported translators are mainly for debugging purposes. Unless it clearly helps with development and debugging, PRs for directly adding translators will not be accepted.\n- You can directly use [PDFMathTranslate](https://github.com/Byaidu/PDFMathTranslate) to get support for more translators.\n\n### **If you want to add new accelerator support for the layout model**\n\n- This project only plans to support various accelerators through onnxruntime. Please submit your accelerator support directly to onnxruntime.\n\n- Additionally, [translation_config.py](https://github.com/funstory-ai/BabelDOC/blob/9e5be3a05c15ecae98024ba695e4a2db1412c062/babeldoc/translation_config.py#L41) shows that the layout model implementation actually used in this project is passed in from outside. You can implement a layout model class according to the relevant interface, and then pass it through this parameter at runtime.\n\n### **If you wish to contribute to BabelDOC**\n\n> [!TIP]\n>\n> If you have any questions about the source code or related matters, please contact the maintainer at aw@funstory.ai .\n> \n> You can also raise questions in [Issues](https://github.com/funstory-ai/BabelDOC/issues).\n> \n> You can contact the maintainers in the pdf2zh discussion group.\n> \n> Due to the current high rate of code changes, this project only accepts small PRs. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.\n>\n> In addition, we do not accept PRs involving the following changes:\n> 1. PRs that modify prompts.\n> 2. Adding GUI or other features directly targeting end users to this project. (Exceptions granted by maintainers in issues are excluded.)\n> 3. PRs that do not comply with this specification.\n> 4. Other PRs that maintainers deem inappropriate.\n>\n> **This project cannot accept all PRs. We recommend that you discuss with the maintainers via [Issue](https://github.com/funstory-ai/BabelDOC/issues) before submitting a PR.**\n\n[//]: # (> We welcome pull requests and will review your contributions.)\n\n\n1. Fork this repository and clone it locally.\n2. Use `doc/deploy.sh` to set up the development environment.\n3. Create a new branch and make code changes on that branch. `git checkout -b feature/<feature-name>`\n4. Perform development and ensure the code meets the requirements.\n\n5. Commit your changes to your new branch.\n\n```\ngit add .\n\ngit commit -m \"<semantic commit message>\"\n```\n\n5. Push to your repository: `git push origin feature/<feature-name>`.\n\n6. Create a PR on GitHub and provide a detailed description.\n\n7. Ensure all automated checks pass.\n\n#### Basic Requirements\n\n##### Workflow\n\n1. Please create a fork on the main branch and develop on the forked branch.\n\n- When submitting a Pull Request (PR), please provide detailed descriptions of the changes.\n\n- If the PR fails automated checks (showing checks failed and red cross marks), please review the corresponding details and modify the submission to ensure the new PR passes automated checks.\n\n2. Development and Testing\n\n- Use the `uv run BabelDOC` command for development and testing.\n\n- When you need print log, please use `log.debug()` to print info. **DO NOT USE `print()`**\n\n- Code formatting\n\n3. Dependency Updates\n\n- If new dependencies are introduced, please update the dependency list in pyproject.toml accordingly.\n\n- It is recommended to use the `uv add` command for adding dependencies.\n\n4. Documentation Updates\n\n- If new command-line options are added, please update the command-line options list in README.md accordingly.\n\n5. Commit Messages\n\n- Use [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/), for example: feat(translator): add openai.\n\n6. Coding Style\n\n- Please ensure submitted code follows basic coding style guidelines.\n- Use pep8-naming.\n- Comments should be in English.\n- Follow these specific Python coding style guidelines:\n\n  a. Naming Conventions:\n\n  - Class names should use CapWords (PascalCase): `class TranslatorConfig`\n  - Function and variable names should use snake_case: `def process_text()`, `word_count = 0`\n  - Constants should be UPPER_CASE: `MAX_RETRY_COUNT = 3`\n  - Private attributes should start with underscore: `_internal_state`\n\n  b. Code Layout:\n\n  - Use 4 spaces for indentation (no tabs)\n  - Maximum line length is 88 characters (compatible with black formatter)\n  - Add 2 blank lines before top-level classes and functions\n  - Add 1 blank line before class methods\n  - No trailing whitespace\n\n  c. Imports:\n\n  - Imports should be on separate lines: `import os\\nimport sys`\n  - Imports should be grouped in the following order:\n    1.  Standard library imports\n    2.  Related third party imports\n    3.  Local application/library specific imports\n  - Use absolute imports over relative imports\n\n  d. String Formatting:\n\n  - Prefer f-strings for string formatting: `f\"Count: {count}\"`\n  - Use double quotes for docstrings\n\n  e. Type Hints:\n\n  - Use type hints for function arguments and return values\n  - Example: `def translate_text(text: str) -> str:`\n\n  f. Documentation:\n\n  - All public functions and classes must have docstrings\n  - Use Google style for docstrings\n  - Example:\n\n    ```python\n    def function_name(arg1: str, arg2: int) -> bool:\n        \"\"\"Short description of function.\n\n        Args:\n            arg1: Description of arg1\n            arg2: Description of arg2\n\n        Returns:\n            Description of return value\n\n        Raises:\n            ValueError: Description of when this error occurs\n        \"\"\"\n    ```\n\nThe existing codebase does not comply with the above specifications in some aspects. Contributions for modifications are welcome.\n\n#### How to modify the intermediate representation\n\nThe intermediate representation is described by [il_version_1.rnc](https://github.com/funstory-ai/BabelDOC/blob/main/BabelDOC/format/pdf/document_il/il_version_1.rnc). Corresponding Python data classes are generated using [xsdata](https://xsdata.readthedocs.io/en/latest/). The files `il_version_1.rng`, `il_version_1.xsd`, and `il_version_1.py` are auto-generated and must not be manually modified.\n\n##### Format RNC file\n\n```bash\ntrang babeldoc/format/pdf/document_il/il_version_1.rnc babeldoc/format/pdf/document_il/il_version_1.rnc\n```\n\n##### Generate RNG, XSD and Python classes\n\n```bash\n# Generate RNG from RNC\ntrang babeldoc/format/pdf/document_il/il_version_1.rnc babeldoc/format/pdf/document_il/il_version_1.rng\n\n# Generate XSD from RNC\ntrang babeldoc/format/pdf/document_il/il_version_1.rnc babeldoc/format/pdf/document_il/il_version_1.xsd\n\n# Generate Python classes from XSD\nxsdata generate babeldoc/format/pdf/document_il/il_version_1.xsd --package babeldoc.format.pdf.document_il\n```\n\n##### Profile memory usage\n\n```bash\nuv run memray run --native --aggregate babeldoc/main.py -c yadt.toml\n```"
  },
  {
    "path": "docs/CONTRIBUTOR_REWARD.md",
    "content": "# BabelDOC/PDFMathTranslate/OneAIFW 贡献者奖励规则\n\n## 月度活跃贡献者奖励规则\n\n### 一、资格标准\n#### **贡献类型要求**\n   - 需提交 **至少 1 个有效 PR**（Pull Request），或进行 **PR 审核、文档编写** 等贡献。\n   - 有效贡献定义：\n     - 非简单的文档错别字修复\n     - 非简单的代码格式化调整（如仅调整缩进、空格等）\n     - 需做出实质性贡献（如功能开发、Bug 修复、性能优化、架构调整、技术文档编写、PR 审核等）\n   - 示例合格贡献：新增功能模块、修复逻辑错误、优化算法效率、编写技术文档等\n\n#### **时间范围**\n   - 每月 1 日至月末最后一天合并的 PR 计入当月统计\n\n### 二、申请流程\n#### **申请条件**\n   - PR 需被成功合并至以下几个仓库：\n     1. [funstory-ai/BabelDOC](https://github.com/funstory-ai/BabelDOC/pulls) 仓库\n     2. [PDFMathTranslate-next/PDFMathTranslate-next](https://github.com/PDFMathTranslate-next/PDFMathTranslate-next) 的主分支。\n     3. [guaguastandup/zotero-pdf2zh](https://github.com/guaguastandup/zotero-pdf2zh) 的主分支\n     4. [funstory-ai/aifw](https://github.com/funstory-ai/aifw) 的主分支\n   - 若目标为 [funstory-ai/BabelDOC](https://github.com/funstory-ai/BabelDOC/pulls) 的 PR 未被合并，但被维护者认定为有价值的概念验证，同样符合条件。\n   - 审核 PR、撰写 wiki 等贡献也必须是以上两个仓库。\n   - 同一贡献者每月仅可申请一次（无论提交 PR 数量）\n   - 同一贡献者每月最多可以获得 1 个兑换码\n   - 对于 PR，只有发起者可以申请兑换码\n   - 仅可使用当月的贡献申请兑换码（特殊情况请联系 aw@funstory.ai 说明）\n\n#### **申请方式**\n   - 发送邮件至 **aw@funstory.ai**\n   - 邮件标题格式：`[贡献者会员兑换码申请] GitHub用户名-月份`（例：`[贡献者会员兑换码申请] awwaawwa-2024-07`）\n   - 邮件正文需包含：\n     - GitHub 用户名\n     - 合并 PR 的完整链接\n   - 附件要求：\n     - PR 页面完整截图（需包含合并状态、仓库名称及点击头像后弹出来的侧边栏，如下图所示）\n\n> [!IMPORTANT]\n>\n> 不满足上述格式要求的邮件会被直接忽略！\n\n![附件示例](https://s.immersivetranslate.com/assets/r2-uploads/images/babeldoc-contributor_reward_example.png)\n\n#### **奖励说明**\n   - 奖励内容：[沉浸式翻译（Immersive Translate）](https://immersivetranslate.com/zh-Hans/pricing/)月度会员兑换码\n   - 兑换码使用：在[沉浸式翻译官网兑换页](https://immersivetranslate.com/zh-Hans/exchange)输入即可激活\n   - 会员权益：沉浸式翻译 Pro 会员一个月（详见[官网价格页](https://immersivetranslate.com/zh-Hans/pricing/)说明）\n   - 兑换码为专属福利，不可转让\n\n### 三、审核与发放\n#### **审核周期**\n   - 我们会尽力在收到申请邮件后 1 个工作日内完成审核\n   - 审核时间可能因申请数量、审核复杂度等因素有所延长\n   - 审核通过后，兑换码将通过邮件方式发送\n   - 若审核未通过，我们会通过邮件说明原因\n\n#### **兑换码规则**\n   - 使用方式：[官网兑换页](https://immersivetranslate.com/zh-Hans/exchange)输入兑换码激活\n   - 权益内容：月度会员（具体权益见[官网价格页](https://immersivetranslate.com/zh-Hans/pricing/)说明）\n   - 不可转让\n\n### 四、注意事项\n#### **禁止行为**\n   - 将完整功能拆分为多个无关 PR 提交\n   - 提交质量不合格或具有潜在危害的代码\n   - 提供虚假或误导性的申请材料\n\n#### **特别说明**\n   - funstory.ai 保留对贡献价值的评估权、规则的最终解释权等所有必要权利\n   - 规则如有实质性更新（格式调整等除外），将提前 1 天在 [BabelDOC GitHub PR](https://github.com/funstory-ai/BabelDOC/pulls) 公告\n   - 过期未使用的兑换码不予补发\n   - 自 2025 年 2 月 1 日起的贡献可以申请兑换码\n   - 为了确认您是 Pull Request (PR) 的发起者，防止他人冒领，我们可能会要求您使用发起者账号在 PR 下方留言指定的随机数字。\n\n## 常见问题解答（FAQ）\n\n**Q：如何判断文档翻译贡献是否有效？**\n\nA：系统性的人工翻译（如完整章节的翻译并经过人工校对）视为有效贡献。零散段落翻译或仅依赖机器翻译的内容不计入有效贡献。\n\n**Q：兑换码过期了可以补发吗？**\n\n   A：为确保公平性，过期的兑换码将不予补发，请在有效期内及时使用。\n\n**Q：为什么这个文档是中文的？**\n\nA：因为目前应该是中文贡献者多吧，所以就先写中文的。后面再撰写英文版的。\n\n---\n**规则公示**：本规则文档存放于 BabelDOC 仓库 [CONTRIBUTOR_REWARD.md](https://github.com/funstory-ai/BabelDOC/blob/main/docs/CONTRIBUTOR_REWARD.md)，并在 [Contributor Reward - BabelDOC](https://funstory-ai.github.io/BabelDOC/CONTRIBUTOR_REWARD/) 展示。\n"
  },
  {
    "path": "docs/ImplementationDetails/AsyncTranslate/AsyncTranslate.md",
    "content": "# Async Translation API\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Overview\n\nThe `yadt.high_level.async_translate` function provides an asynchronous interface for translating PDF files with real-time progress reporting. This function yields progress events that can be used to update progress bars or other UI elements.\n\n## Usage\n\n```python linenums=\"1\"\nasync def translate_with_progress():\n    config = TranslationConfig(\n        input_file=\"example.pdf\",\n        translator=your_translator,\n        # ... other configuration options\n    )\n    \n    try:\n        async for event in async_translate(config):\n            if event[\"type\"] == \"progress_update\":\n                print(f\"Progress: {event['overall_progress']}%\")\n            elif event[\"type\"] == \"finish\":\n                result = event[\"translate_result\"]\n                print(f\"Translation completed: {result.original_pdf_path}\")\n            elif event[\"type\"] == \"error\":\n                print(f\"Error occurred: {event['error']}\")\n                break\n    except asyncio.CancelledError:\n        print(\"Translation was cancelled\")\n    except KeyboardInterrupt:\n        print(\"Translation was interrupted\")\n```\n\n## Event Types\n\nThe function yields different types of events during the translation process:\n\n### 1. Progress Start Event\n\nEmitted when a translation stage begins:\n\n```python\n{\n    \"type\": \"progress_start\",\n    \"stage\": str,              # Name of the current stage\n    \"stage_progress\": float,   # Always 0.0\n    \"stage_current\": int,      # Current progress count (0)\n    \"stage_total\": int         # Total items to process in this stage\n}\n```\n\n### 2. Progress Update Event\n\nEmitted periodically during translation (controlled by report_interval, default 0.1s):\n\n```python\n{\n    \"type\": \"progress_update\",\n    \"stage\": str,              # Name of the current stage\n    \"stage_progress\": float,   # Progress percentage of current stage (0-100)\n    \"stage_current\": int,      # Current items processed in this stage\n    \"stage_total\": int,        # Total items to process in this stage\n    \"overall_progress\": float  # Overall translation progress (0-100)\n}\n```\n\n### 3. Progress End Event\n\nEmitted when a stage completes:\n\n```python\n{\n    \"type\": \"progress_end\",\n    \"stage\": str,              # Name of the completed stage\n    \"stage_progress\": float,   # Always 100.0\n    \"stage_current\": int,      # Equal to stage_total\n    \"stage_total\": int,        # Total items processed in this stage\n    \"overall_progress\": float  # Overall translation progress (0-100)\n}\n```\n\n### 4. Finish Event\n\nEmitted when translation completes successfully:\n\n```python\n{\n    \"type\": \"finish\",\n    \"translate_result\": TranslateResult  # Contains paths to translated files and timing info\n}\n```\n\n### 5. Error Event\n\nEmitted if an error occurs during translation:\n\n```python\n{\n    \"type\": \"error\",\n    \"error\": str  # Error message\n}\n```\n\n## Translation Stages\n\nThe translation process goes through the following stages in order:\n\n1. ILCreater\n2. LayoutParser\n3. ParagraphFinder\n4. StylesAndFormulas\n5. ILTranslator\n6. Typesetting\n7. FontMapper\n8. PDFCreater\n\nEach stage will emit its own set of progress events.\n\n## Cancellation\n\nThe translation process can be cancelled in several ways:\n\n1. By raising a `CancelledError` (e.g., when using `asyncio.Task.cancel()`)\n2. Through `KeyboardInterrupt` (e.g., when user presses Ctrl+C)\n3. By calling `translation_config.cancel_translation()` method\n\nExample of programmatic cancellation:\n\n```python linenums=\"1\"\nasync def translate_with_cancellation():\n    config = TranslationConfig(\n        input_file=\"example.pdf\",\n        translator=your_translator,\n        # ... other configuration options\n    )\n    \n    try:\n        # Start translation in another task\n        translation_task = asyncio.create_task(process_translation(config))\n        \n        # Simulate some condition that requires cancellation\n        await asyncio.sleep(5)\n        config.cancel_translation()  # This will trigger cancellation\n        \n        await translation_task  # Wait for the task to finish\n    except asyncio.CancelledError:\n        print(\"Translation was cancelled\")\n\nasync def process_translation(config):\n    async for event in async_translate(config):\n        if event[\"type\"] == \"error\":\n            if isinstance(event[\"error\"], asyncio.CancelledError):\n                print(\"Translation was cancelled\")\n                break\n            print(f\"Error occurred: {event['error']}\")\n            break\n        # ... handle other events ...\n```\n\nWhen cancelled:\n- The function will log the cancellation reason\n- All resources will be cleaned up properly\n- Any ongoing translation tasks will be stopped\n- A final error event with `CancelledError` will be emitted\n- The function will exit gracefully\n\n## Error Handling\n\nAny errors during translation will be:\n1. Logged with full traceback (if debug mode is enabled)\n2. Reported through an error event\n3. Cause the event stream to stop after the error event\n4. Clean up resources properly before exiting\n\nIt's recommended to handle these events appropriately in your application to provide feedback to users. The example in the Usage section shows a basic error handling pattern. "
  },
  {
    "path": "docs/ImplementationDetails/ILTranslator/ILTranslator.md",
    "content": "# Intermediate Layer Translator\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nAfter formula and style processing, we need to translate the document while preserving all formatting, formulas, and styles. The intermediate layer translator handles this complex task by using placeholders and style preservation techniques.\n\n## Goal\n\n1. Translate text while preserving document structure\n2. Maintain formulas and special formatting\n3. Handle rich text with different styles\n4. Support concurrent translation for better performance\n\n## Specific Implementation\n\nThe translation process consists of several key steps:\n\n### Step 1: Translation Preparation\n\n1. Process paragraphs:\n   - Skip vertical text\n   - Handle single-component paragraphs directly\n   - Process multi-component paragraphs with placeholders\n\n2. Create placeholders:\n   - Formula placeholders for mathematical expressions\n   - Rich text placeholders for styled text\n   - Ensure placeholder uniqueness within each paragraph\n\n### Step 2: Translation Input Creation\n\n1. Analyze paragraph components:\n   - Regular text components\n   - Formula components\n   - Styled text components\n\n2. Handle special cases:\n   - Skip pure formula paragraphs\n   - Preserve original text when style matches base style\n   - Handle font mapping cases\n\n### Step 3: Translation Execution\n\n1. Concurrent translation:\n   - Use thread pool for parallel processing\n   - Control QPS (Queries Per Second)\n   - Track translation progress\n\n2. Translation tracking:\n   - Record original text\n   - Record translated text\n   - Save tracking information for debugging\n\n### Step 4: Translation Output Processing\n\n1. Parse translated text:\n   - Extract text between placeholders\n   - Restore formulas at placeholder positions\n   - Restore rich text with original styles\n\n2. Create new paragraph components:\n   - Maintain style information\n   - Preserve formula positioning\n   - Handle empty text segments\n\n## Additional Features\n\n1. Style preservation:\n   - Maintains original text styles\n   - Handles font size variations\n   - Preserves formatting attributes\n\n2. Formula handling:\n   - Preserves formula integrity\n   - Maintains formula positioning\n   - Supports complex mathematical expressions\n\n3. Debug support:\n   - Translation tracking\n   - JSON output for debugging\n   - Detailed logging\n\n## Limitations\n\n1. Vertical text is not supported\n\n2. Complex nested styles might not be perfectly preserved\n\n3. Placeholder conflicts could occur in rare cases\n\n4. Translation quality depends on external translation engine\n\n## Configuration Options\n\nThe translation process can be customized through `TranslationConfig`:\n\n1. `qps`: Maximum queries per second for translation\n2. `debug`: Enable/disable debug mode and tracking\n3. Translation engine specific settings "
  },
  {
    "path": "docs/ImplementationDetails/PDFCreation/PDFCreation.md",
    "content": "# PDF Creation\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nAfter translation and typesetting, we need to create the final PDF document that preserves all the formatting, styles, and layout of the original document while containing the translated text. The PDF creation process handles this final step.\n\n## Goal\n\n1. Create a new PDF document with translated content\n2. Preserve all original formatting and styles\n3. Support both monolingual and dual-language output\n4. Maintain font consistency and character encoding\n5. Optimize the output file size and performance\n\n## Specific Implementation\n\nThe PDF creation process consists of several key steps:\n\n### Step 1: Font Management\n\n1. Font initialization:\n   - Add required fonts to the document\n   - Map font identifiers\n   - Handle font encoding lengths\n\n2. Font availability checking:\n   - Check available fonts for each page\n   - Handle XObject font requirements\n   - Manage font resources\n\n3. Font subsetting:\n   - Optimize font usage\n   - Reduce file size\n   - Maintain character support\n\n### Step 2: Content Rendering\n\n1. Character processing:\n   - Handle individual characters\n   - Process character encodings\n   - Manage character positioning\n\n2. Graphics state handling:\n   - Process color spaces\n   - Handle transparency\n   - Manage graphic state instructions\n\n3. XObject management:\n   - Process form XObjects\n   - Handle drawing operations\n   - Maintain XObject hierarchy\n\n### Step 3: Document Assembly\n\n1. Page construction:\n   - Build page content\n   - Process page resources\n   - Handle page boundaries\n\n2. Content stream creation:\n   - Generate drawing operations\n   - Handle text positioning\n   - Manage content streams\n\n3. Resource management:\n   - Handle font resources\n   - Manage XObject resources\n   - Process graphic states\n\n### Step 4: Output Generation\n\n1. Monolingual output:\n   - Create translated-only PDF\n   - Optimize file size\n   - Apply compression\n\n2. Dual-language output:\n   - Combine original and translated pages\n   - Handle page ordering\n   - Maintain document structure\n\n3. File optimization:\n   - Apply garbage collection\n   - Enable compression\n   - Optimize for linear reading\n\n## Additional Features\n\n1. Font handling:\n   - Support for CID fonts\n   - Font subsetting\n   - Font resource management\n\n2. Document optimization:\n   - File size reduction\n   - Performance optimization\n   - Resource cleanup\n\n3. Debug support:\n   - Decompressed output\n   - Debug information\n   - Progress tracking\n\n## Limitations\n\n1. Font support:\n   - Limited to available font formats\n   - Font subsetting restrictions\n   - Character encoding constraints\n\n2. File size:\n   - Dual-language output increases size\n   - Font embedding impact\n   - Resource duplication\n\n3. Performance considerations:\n   - Processing time for large documents\n   - Memory usage during creation\n   - Optimization overhead\n\n## Configuration Options\n\nThe PDF creation process can be customized through `TranslationConfig`:\n\n1. Output options:\n   - `no_mono`: Disable monolingual output\n   - `no_dual`: Disable dual-language output\n   - Output file naming patterns\n\n2. Optimization settings:\n   - Compression options\n   - Garbage collection\n   - Font subsetting\n\n3. Debug options:\n   - Debug mode\n   - Decompressed output\n   - Progress tracking "
  },
  {
    "path": "docs/ImplementationDetails/PDFParsing/PDFParsing.md",
    "content": "# PDF Parsing and Intermediate Layer Creation\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nThe first step in the translation process is to parse the PDF document and create an intermediate layer (IL) representation. This step involves extracting text, styles, formulas, and layout information from the PDF while maintaining their relationships and properties.\n\n## Goal\n\n1. Extract text content while preserving character-level information\n2. Maintain font and style information\n3. Preserve document structure and layout\n4. Handle special elements like XObjects and graphics\n5. Create a structured intermediate representation for later processing\n\n## Specific Implementation\n\nThe parsing process consists of several key components working together:\n\n### Step 1: PDF Interpreter (PDFPageInterpreterEx)\n\n1. Page content processing:\n   - Parse PDF operators and their parameters\n   - Handle graphics state operations\n   - Process text and font operations\n   - Manage XObject rendering\n\n2. Graphics filtering:\n   - Filter non-formula lines\n   - Handle color space operations\n   - Process stroke and fill operations\n\n3. XObject handling:\n   - Process form XObjects\n   - Handle image XObjects\n   - Maintain XObject hierarchy\n\n### Step 2: PDF Converter (PDFConverterEx)\n\n1. Character processing:\n   - Extract character information\n   - Maintain character positions\n   - Preserve style attributes\n\n2. Layout management:\n   - Handle page boundaries\n   - Process figure elements\n   - Manage coordinate systems\n\n3. Font handling:\n   - Map font identifiers\n   - Process font metadata\n   - Handle CID fonts\n\n### Step 3: Intermediate Layer Creator (ILCreater)\n\n1. Document structure creation:\n   - Build page hierarchy\n   - Create character objects\n   - Maintain font registry\n\n2. Resource management:\n   - Process font resources\n   - Handle color spaces\n   - Manage graphic states\n\n3. XObject tracking:\n   - Track XObject hierarchy\n   - Maintain XObject states\n   - Process form content\n\n### Step 4: High-level Coordination\n\n1. Process management:\n   - Initialize resources\n   - Coordinate component interactions\n   - Handle progress tracking\n\n2. Resource initialization:\n   - Set up font management\n   - Initialize graphics resources\n   - Prepare document structure\n\n3. Error handling:\n   - Handle malformed content\n   - Manage resource errors\n   - Provide debug information\n\n## Additional Features\n\n1. Font management:\n   - Support for CID fonts\n   - Font metadata extraction\n   - Font mapping capabilities\n\n2. Graphics state tracking:\n   - Color space management\n   - Line style preservation\n   - Transparency handling\n\n3. Coordinate system handling:\n   - Support for transformations\n   - Boundary box calculations\n   - Position normalization\n\n4. Debug support:\n   - Detailed logging\n   - Intermediate file generation\n   - Progress tracking\n\n## Limitations\n\n1. Complex PDF features:\n   - Limited support for some PDF extensions\n   - Simplified graphics model\n   - Basic transparency support\n\n2. Font handling:\n   - Limited support for some font formats\n   - Simplified font metrics\n   - Basic font feature support\n\n3. Performance considerations:\n   - Memory usage for large documents\n   - Processing time for complex layouts\n   - Resource management overhead\n\n## Configuration Options\n\nThe parsing process can be customized through `TranslationConfig`:\n\n1. `debug`: Enable/disable debug mode and intermediate file generation\n2. Font-related settings:\n   - Font mapping configurations\n   - CID font handling options\n3. Layout processing options:\n   - Page selection\n   - Content filtering rules "
  },
  {
    "path": "docs/ImplementationDetails/ParagraphFinding/ParagraphFinding.md",
    "content": "# Paragraph Finding\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nAfter PDF analysis, we need to identify paragraphs from individual characters. This is a crucial step before translation and typesetting, as it helps maintain the logical structure of the document.\n\n## Goal\n\n1. Group characters into meaningful paragraphs while preserving the document's logical structure\n2. Handle special cases like table of contents, short lines, and multi-line paragraphs\n3. Maintain layout information for later typesetting\n\n## Specific Implementation\n\nThe paragraph finding process consists of four main steps:\n\n### Step 1: Create Initial Paragraphs\n\n1. Group characters into lines based on their spatial relationships\n2. Create paragraphs based on layout information and XObject IDs\n3. Characters that don't belong to text layouts are skipped\n\n### Step 2: Process Paragraph Spacing\n\n1. Remove completely empty lines\n2. Handle trailing spaces within lines\n3. Update paragraph boundary boxes and metadata\n\n### Step 3: Calculate Line Width Statistics\n\n1. Calculate the median width of all lines\n2. This information is used for identifying potential paragraph breaks\n\n### Step 4: Process Independent Paragraphs\n\n1. Analyze paragraphs with multiple lines\n2. Split paragraphs in two cases:\n   - When encountering table of contents entries (identified by consecutive dots)\n   - When finding lines significantly shorter than the median width (configurable via `short_line_split_factor`)\n\n## Additional Features\n\n1. Layout-aware processing:\n   - Respects different layout types (plain text, title, figure caption, etc.)\n   - Maintains layout priority order for overlapping regions\n\n2. First line indent detection:\n   - Automatically detects and marks paragraphs with first line indentation\n\n3. Flexible character position detection:\n   - Uses multiple position detection modes (middle, topleft, bottomright)\n   - Special handling for characters with unreliable height information\n\n## Limitations\n\n1. The current implementation assumes left-to-right text direction\n\n2. May not perfectly handle complex layouts with overlapping regions\n\n3. Table of contents detection relies on consecutive dots pattern\n\n4. Short line splitting might occasionally create incorrect paragraph breaks\n\n## Configuration Options\n\nThe paragraph finding behavior can be customized through `TranslationConfig`:\n\n1. `split_short_lines`: Enable/disable splitting paragraphs at short lines\n2. `short_line_split_factor`: Threshold factor for short line detection (relative to median width) "
  },
  {
    "path": "docs/ImplementationDetails/README.md",
    "content": "# Implementation Details\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Core Processing Flow\n\nMain processing stages in order of actual execution and corresponding documentation:\n\n1. [PDFParser.md](PDFParsing/PDFParsing.md): **PDF Parsing and Intermediate Layer Creation**\n\n2. [LayoutParser](https://github.com/funstory-ai/yadt/blob/main/yadt/document_il/midend/layout_parser.py): **Layout OCR**\n\n3. [ParagraphFinding.md](ParagraphFinding/ParagraphFinding.md): **Paragraph Recognition**\n\n4. [StylesAndFormulas.md](StylesAndFormulas/StylesAndFormulas.md): **Style and Formula Processing**\n\n5. [ILTranslator.md](ILTranslator/ILTranslator.md): **Intermediate Layer Translation**\n\n6. [Typesetting.md](Typesetting/Typesetting.md): **Typesetting Processing**\n\n7. [FontMapper](https://github.com/funstory-ai/yadt/blob/main/yadt/document_il/utils/fontmap.py): **Font Mapping**\n\n8. [PDFCreation.md](PDFCreation/PDFCreation.md): **PDF Generation**\n\n## API\n\n1. [Async Translation API](AsyncTranslate/AsyncTranslate.md): **Async Translation API**\n\n> [!TIP]\n>\n> Click on document links to view detailed implementation principles and configuration options\n"
  },
  {
    "path": "docs/ImplementationDetails/StylesAndFormulas/StylesAndFormulas.md",
    "content": "# Styles and Formulas Processing\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nAfter paragraph finding, we need to identify formulas and text styles within each paragraph. This step is crucial for maintaining mathematical expressions and text formatting during translation.\n\n## Goal\n\n1. Identify and preserve mathematical formulas\n2. Detect and maintain consistent text styles\n3. Handle special cases like subscripts and superscripts\n4. Calculate proper offsets for formula positioning\n\n## Specific Implementation\n\nThe processing consists of several main steps:\n\n### Step 1: Formula Detection\n\n1. Identify formula characters based on:\n   - Formula-specific fonts\n   - Special Unicode characters\n   - Vertical text\n   - Corner marks (subscripts/superscripts)\n\n2. Group consecutive formula characters into formula units\n\n### Step 2: Formula Processing\n\n1. Process comma-containing formulas:\n   - Split complex formulas at commas when appropriate\n   - Preserve brackets and their contents\n   - Convert simple number-only formulas to regular text\n\n2. Merge overlapping formulas:\n   - Handle cases where subscripts/superscripts are detected as separate formulas\n   - Maintain proper character ordering\n\n### Step 3: Style Analysis\n\n1. Calculate base style for each paragraph:\n   - Find common style attributes across all text\n   - Handle font variations\n   - Process graphic states\n\n2. Group characters with identical styles:\n   - Font properties\n   - Size properties\n   - Graphic state properties\n\n### Step 4: Position Calculation\n\n1. Calculate formula offsets:\n   - Compute x-offset relative to surrounding text\n   - Compute y-offset for proper vertical alignment\n   - Handle line spacing variations\n\n## Additional Features\n\n1. Font mapping:\n   - Maps different fonts to standard ones\n   - Special handling for formula fonts\n\n2. Style inheritance:\n   - Maintains style hierarchy\n   - Handles partial style overrides\n\n3. Formula classification:\n   - Distinguishes between translatable and non-translatable formulas\n   - Special handling for numeric formulas with commas\n\n## Limitations\n\n1. Formula detection relies on font and character patterns\n\n2. May not handle all types of mathematical notations\n\n3. Complex subscript/superscript combinations might be misidentified\n\n4. Limited support for vertical formulas\n\n## Configuration Options\n\nThe formula and style processing can be customized through `TranslationConfig`:\n\n1. `formular_font_pattern`: Regex pattern for identifying formula fonts\n2. `formular_char_pattern`: Regex pattern for identifying formula characters "
  },
  {
    "path": "docs/ImplementationDetails/Typesetting/Typesetting.md",
    "content": "# Typography\n\n> [!NOTE]\n> This documentation may contain AI-generated content. While we strive for accuracy, there might be inaccuracies. Please report any issues via:\n>\n> - [GitHub Issues](https://github.com/funstory-ai/yadt/issues)\n> - Community contribution (PRs welcome!)\n\n## Background\n\nAfter translation, text needs to be typeset before placing into PDF.\n\nTranslated paragraphs can contain any combination of the following types:\n\n1. PDF formulas\n\n2. Single PDF original character\n\n3. PDF original string with same style\n\n4. Translated Unicode string with same style\n\nLet's discuss different cases:\n\nFor the following 3 types, they can be directly transmitted transparently to new positions:\n\n1. PDF formulas\n\n2. Single PDF original character\n\n3. PDF original string with same style\n\nOnly \"translated Unicode string with same style\" needs typesetting operation, as this step loses original layout information. However, since paragraphs may contain other components that need transparent transmission, their positions may also change and need to participate in typesetting.\n\n## Goal\n\nTry to fit all components within the original paragraph bounding box. If impossible, try to expand the bounding box in writing direction.\n\n## Specific Implementation\n\nFirst perform reflow judgment to determine if the paragraph needs reflow. If all elements can be transmitted transparently, no reflow is needed. Then, if reflow is needed, execute Algorithm 1:\n\n1. Convert all elements to typesetting unit type, which records length and width information.\n\n2. Start from top-left of original paragraph bounding box, place elements sequentially.\n\n3. If current line cannot fit next element, wrap to next line.\n\n4. Repeat 2-3 until all elements are placed or exceed original bounding box.\n\nAlgorithm 1 works normally when translated text is shorter than original. When translated text is longer, Algorithm 2 needs to be added:\n\n1. Initialize element scaling factor as 1.0.\n\n2. Initialize line spacing as 1.5.\n\n3. Try typesetting using Algorithm 1.\n\n4. If it cannot fit all elements:\n\n   - First try to reduce line spacing by 0.1 step until reaching minimum line spacing (1.4)\n   - If still cannot fit:\n     - When scale > 0.6, reduce element scaling by 0.05\n     - When scale <= 0.6, reduce element scaling by 0.1\n     - Reset line spacing to 1.5\n   - When scale becomes less than 0.7, adjust minimum line spacing to 1.1\n\n5. Report error if element scaling is less than 0.1.\n\nAlgorithm 2 can fit translations of almost all languages in original position.\n\nHowever, for special cases like \"图 1\" translated to \"Figure 1\", even with the above algorithms some text may still overflow. So Algorithm 3:\n\n1. Before reducing scale, first try to expand the bounding box in writing direction.\n\n2. Calculate paragraph's right whitespace by:\n\n   - Using 90% of page crop box width as maximum limit\n   - Checking for overlapping paragraphs on the right\n   - Checking for overlapping figures on the right\n\n3. Expand paragraph bounding box based on available whitespace.\n\n4. If still cannot fit all elements, continue with scale reduction as in Algorithm 2.\n\n## Additional Features\n\n1. Mixed Chinese-English text handling:\n   - Adds 0.5 character width spacing between Chinese and English text transitions\n   - Excludes certain punctuation marks from this spacing rule\n2. First line indent:\n\n   - Adds 2 Chinese characters width indent for the first line when specified\n\n3. Hanging punctuation:\n   - Allows certain punctuation marks to extend beyond the right margin\n   - Helps maintain better visual alignment\n\n## Limitations\n\n1. Currently, we use PDFPlumber for PDF analysis, this is only implemented for paragraphs, only handles left-to-right writing.\n\n2. Cannot handle table of contents alignment by dots.\n\n3. Poor performance, needs optimization.\n\n4. No global page information consideration, inconsistent text sizes.\n\n5. No advanced typography features, poor reading experience.\n\n## Related Resources\n\n[UTR #59: East Asian Spacing](https://www.unicode.org/reports/tr59/) specifies which characters need spacing between them.\n"
  },
  {
    "path": "docs/README.md",
    "content": "YADT Spec\n===\n\n## YADT Document Intermediate Language\n\n[il_version_1.rnc](https://github.com/funstory-ai/yadt/blob/main/yadt/document_il/il_version_1.rnc): The definition of the intermediate language used between PDF parsing and rendering stages.\n\nFor other implementation details, please refer to [Implementation Details](ImplementationDetails/README.md)."
  },
  {
    "path": "docs/deploy.sh",
    "content": "#!/bin/bash\nset -e\n\ncommand_exists() {\n  command -v \"$1\" >/dev/null 2>&1\n}\n\necho \"check uv installed ……\"\nif command_exists uv; then\n  echo \"uv installed !\"\n  exit 0\nfi\n\necho \"uv not install, start installing ……\"\n\nOS=$(uname -s)\ncase \"$OS\" in\n  Linux)\n    if command_exists curl; then\n        curl -LsSf https://astral.sh/uv/install.sh | sh\n    elif command_exists wget; then\n        wget -qO- https://astral.sh/uv/install.sh | sh\n    else\n      echo \"curl or wget not found. uv installed failed.\"\n      exit 1\n    fi\n    ;;\n  Darwin)\n    if command_exists brew; then\n      brew install uv\n    else\n      echo \"Homebrew not installed, please installed uv munally. \"\n      exit 1\n    fi\n    ;;\n  *)\n    echo \"not support OS: $OS\"\n    exit 1\n    ;;\nesac\n\nif command_exists uv; then\n     uv run babeldoc --version\n     pre-commit install\nelse\n  exit 1\nfi\n"
  },
  {
    "path": "docs/example/demo_glossary.csv",
    "content": "source,target,tgt_lng\nAutoML,自动ML,zh-CN\n\"a,a\",a,zh-CN\n\"\"\"\",\"\"\"\",zh-CN"
  },
  {
    "path": "docs/index.md",
    "content": "\n{!README.md!}\n"
  },
  {
    "path": "docs/intro-to-pdf-object.md",
    "content": "An Introduction to PDF Object Definitions in dpml\n===\n\n## 1. Understanding PDF Structure\nA PDF file is fundamentally an indexed collection of objects, where each object represents a structured data unit. The file structure consists of four main components:\n\n1. A header\n2. Object definitions\n3. A cross-reference table\n4. A trailer\n\nThe cross-reference table serves as a lookup directory, mapping each numbered object to its byte offset location within the file. The trailer contains critical metadata, including the location of the root object (document catalog), which serves as the entry point for PDF interpretation. The file concludes with a byte offset pointing to the cross-reference table.\n\nHere's an illustrative example of a PDF file structure:\n\n```pdf\n%PDF-2.0\n1 0 obj\n<<\n  /Pages 2 0 R\n  /Type /Catalog\n>>\nendobj\n2 0 obj\n<<\n  /Count 1\n  /Kids [\n    3 0 R\n  ]\n  /Type /Pages\n>>\nendobj\n3 0 obj\n<<\n  /Contents 4 0 R\n  /MediaBox [ 0 0 612 792 ]\n  /Parent 2 0 R\n  /Resources <<\n    /Font << /F1 5 0 R >>\n  >>\n  /Type /Page\n>>\nendobj\n4 0 obj\n<<\n  /Length 44\n>>\nstream\nBT\n  /F1 24 Tf\n  72 720 Td\n  (Potato) Tj\nET\nendstream\nendobj\n5 0 obj\n<<\n  /BaseFont /Helvetica\n  /Encoding /WinAnsiEncoding\n  /Subtype /Type1\n  /Type /Font\n>>\nendobj\n\nxref\n0 6\n0000000000 65535 f \n0000000009 00000 n \n0000000062 00000 n \n0000000133 00000 n \n0000000277 00000 n \n0000000372 00000 n \ntrailer <<\n  /Root 1 0 R\n  /Size 6\n  /ID [<42841c13bbf709d79a200fa1691836f8><b1d8b5838eeafe16125317aa78e666aa>]\n>>\nstartxref\n478\n%%EOF\n```\n\n### PDF File Interpretation\nWhen a PDF viewer processes a file, it follows these steps:\n\n1. Starts at the file's end to locate the cross-reference table offset\n2. Accesses the cross-reference table to find object locations\n3. Reads the trailer dictionary to identify the document catalog\n4. Uses the document catalog to access various document components:\n   - Pages\n   - Outlines\n   - Thumbnails\n   - Annotations\n   - Other PDF elements\n\nThe pages tree root is particularly crucial as it enables navigation to specific pages within the document.\n\n### Example Interpretation Flow\nLet's trace through our example:\n\n1. The cross-reference table begins at byte offset 478 (indicated after `startxref`)\n2. The trailer identifies object 1 as the document catalog (`/Root 1 0 R`)\n3. Object 1 is located at byte offset 9\n4. The document catalog points to object 2 as the pages tree root\n5. Object 2 is found at byte offset 62\n6. The pages tree identifies page 3 as the first page\n7. Object 3 is positioned at byte offset 133\n8. Object 3 defines the page properties and links to object 4 for content\n9. Object 4, at byte offset 277, contains the drawing instructions for rendering \"Potato\"\n\nThis structure enables efficient random access to any part of the PDF document.\n\n## 2. PDF Objects\n\nEarlier, we discussed PDF objects and introduced the concept of dictionaries. At the top level of a PDF file, objects are identified by two numbers followed by the keyword \"obj\". The first number serves as the object number, while the second—known as the generation number—is typically 0. Everything between these identifiers and the \"endobj\" keyword constitutes the object's body.\n\nThe PDF specification provides a mechanism for modifying files by appending object updates and cross-reference table entries. When an object's contents are completely replaced (rather than modified), its generation number can be incremented. This allows object numbers to be reused while preventing old indirect references from resolving to new objects. However, such files are rare in practice, and generation numbers can generally be disregarded. Modern PDF specifications using object streams have even eliminated generation numbers entirely.\n\nPDF objects share similarities with data structures found in JSON, YAML, and modern programming languages, though PDF includes some unique object types. Here are the available PDF object types:\n\n- String: A text sequence enclosed in parentheses, e.g., (potato). Note that PDF strings typically don't support full Unicode encoding, though there are specific cases where this is possible. (A detailed discussion of character encoding is beyond our current scope.)\n\n- Number: Both integers and floating-point numbers (e.g., 12, 3.14159). While the PDF specification distinguishes between integers and real numbers, they're often interchangeable in practice—integers can be used where real numbers are expected, and viewers typically handle real numbers appropriately when integers are required.\n\n- Boolean: Simple true/false values\n\n- Null: Represented by the keyword \"null\"\n\n- Name: A keyword or dictionary key identifier starting with a forward slash (/), e.g., /Type\n\n- Array: An ordered collection of objects enclosed in square brackets, with no separators between items. Arrays support nested structures, including other arrays and dictionaries. Example: `[1 (two) 3.14 false]`\n\n- Dictionary: A collection of key-value pairs where keys are Names and values can be any object type. Dictionaries are enclosed in << and >> with no separators between entries. Example: `<< /A 1 /B [2, 3 <</Four 4>> ] >>`\n\n- Indirect object reference: A reference to a numbered object in the file, consisting of two numbers (object and generation) followed by 'R', e.g., 1 0 R. While some objects must be direct per the PDF specification, most can be defined at the top level and referenced indirectly.\n\n- Stream: A container for binary data, structured as a dictionary (containing at least a /Length key and other format-specific entries) followed by the specified number of bytes between \"stream\" and \"endstream\" keywords. 🔍 The stream length can be specified as an indirect object, enabling single-pass PDF generation where the stream length isn't known in advance—a common practice in PDF creation.\n\n## 3. PDF Object Definitions In dpml\n\n### Coordinate system definition\n\nThe positive x-axis extends horizontally to the right, while the positive y-axis extends vertically upward, following\nstandard mathematical conventions. The unit length along both the x and y axes is defined as 1/72 inch (or 1 point).\n\n## 4. Useful Information\n\n- [PDF32000_2008](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf) page 111: Table 51 - Operator Categories"
  },
  {
    "path": "docs/requirements.txt",
    "content": "sphinx>=8.2.0\nsphinx-click>=5.1.0\nfuro>=2024.1.29\nmyst-parser[linkify,html_meta,html_admonition]>=2.0.0 "
  },
  {
    "path": "docs/supported_languages.md",
    "content": "# Supported Languages\n\nFor languages in the table below that do not rely on ligature support, BabelDOC provides good support. For languages\nthat partially rely on ligatures, BabelDOC's translation results can generally meet self-reading needs. For languages\nthat completely rely on ligatures (such as some Indian languages), BabelDOC does not currently support them.\n\nWe are working hard to develop support for ligatures as soon as possible.\n<!-- | Kazakh (Cyrillic)    | kk            | None                | -->\n\n| Language                        | Language Code | Ligature Dependency |\n|:--------------------------------|:--------------|:--------------------|\n| English                         | EN            | None                |\n| Simplified Chinese              | zh-CN         | None                |\n| Traditional Chinese - Hong Kong | zh-HK         | None                |\n| Traditional Chinese - Taiwan    | zh-TW         | None                |\n| Japanese                        | JA            | None                |\n| Korean                          | KO            | None                |\n| Polish                          | PL            | Partial             |\n| Russian                         | RU            | None                |\n| Spanish                         | es            | None                |\n| Portuguese                      | pt            | None                |\n| French                          | fr            | Partial             |\n| Malay                           | ms            | None                |\n| Indonesian                      | id            | None                |\n| Turkmen                         | tk            | None                |\n| Filipino (Tagalog)              | tl            | None                |\n| Vietnamese                      | vi            | None                |\n| Kazakh (Latin)                  | kk            | None                |\n| German                          | de            | None                |\n| Dutch                           | nl            | None                |\n| Irish                           | ga            | None                |\n| Italian                         | it            | None                |\n| Greek                           | el            | None                |\n| Swedish                         | sv            | None                |\n| Danish                          | da            | None                |\n| Norwegian                       | no            | None                |\n| Icelandic                       | is            | None                |\n| Finnish                         | fi            | None                |\n| Ukrainian                       | uk            | None                |\n| Czech                           | cs            | None                |\n| Romanian                        | ro            | None                |\n| Hungarian                       | hu            | None                |\n| Slovak                          | sk            | None                |\n| Croatian                        | hr            | None                |\n| Estonian                        | et            | None                |\n| Latvian                         | lv            | None                |\n| Lithuanian                      | lt            | None                |\n| Belarusian                      | be            | None                |\n| Macedonian                      | mk            | None                |\n| Albanian                        | sq            | None                |\n| Serbian (Cyrillic)              | sr            | Partial             |\n| Serbian (Latin)                 | sr            | Partial             |\n| Slovenian                       | sl            | None                |\n| Catalan                         | ca            | None                |\n| Bulgarian                       | bg            | None                |\n| Maltese                         | mt            | None                |\n| Swahili                         | sw            | None                |\n| Amharic                         | am            | None                |\n| Oromo                           | om            | None                |\n| Tigrinya                        | ti            | None                |\n| Haitian Creole                  | ht            | None                |\n| Latin                           | la            | None                |\n| Lao                             | lo            | None                |\n| Malayalam                       | ml            | None                |\n| Gujarati                        | gu            | None                |\n| Thai                            | th            | None                |\n| Burmese                         | my            | Partial             |\n| Tamil                           | ta            | None                |\n| Telugu                          | te            | None                |\n| Oriya                           | or            | Partial             |\n| Armenian                        | hy            | None                |\n| Mongolian (Cyrillic)            | mn            | None                |\n| Georgian                        | ka            | None                |\n| Khmer                           | km            | None                |\n| Bosnian                         | bs            | None                |\n| Luxembourgish                   | lb            | None                |\n| Moldovan                        | ro            | None                |\n| Moldovan (Cyrillic)             | ro            | None                |\n| Romansh                         | rm            | None                |\n| Turkish                         | tr            | None                |\n| Sinhala                         | si            | None                |\n| Uzbek                           | uz            | None                |\n| Kyrgyz                          | ky            | None                |\n| Tajik                           | tg            | None                |\n| Abkhazian                       | ab            | None                |\n| Afar                            | aa            | None                |\n| Afrikaans                       | af            | None                |\n| Akan                            | ak            | None                |\n| Aragonese                       | an            | None                |\n| Avaric                          | av            | None                |\n| Ewe                             | ee            | None                |\n| Aymara                          | ay            | None                |\n| Ojibwa                          | oj            | None                |\n| Occitan                         | oc            | None                |\n| Oriya                           | or            | None                |\n| Ossetian                        | os            | None                |\n| Pali                            | pi            | None                |\n| Bashkir                         | ba            | None                |\n| Basque                          | eu            | None                |\n| Breton                          | br            | None                |\n| Chamorro                        | ch            | None                |\n| Chechen                         | ce            | None                |\n| Chuvash                         | cv            | None                |\n| Tswana                          | tn            | None                |\n| Ndebele, South                  | nr            | None                |\n| Ndonga                          | ng            | None                |\n| Faroese                         | fo            | None                |\n| Fijian                          | fj            | None                |\n| Frisian, Western                | fy            | None                |\n| Ganda                           | lg            | None                |\n| Kongo                           | kg            | None                |\n| Kalaallisut                     | kl            | None                |\n| Church Slavic                   | cu            | None                |\n| Guarani                         | gn            | None                |\n| Interlingua                     | ia            | None                |\n| Herero                          | hz            | None                |\n| Kikuyu                          | ki            | None                |\n| Rundi                           | rn            | None                |\n| Kinyarwanda                     | rw            | None                |\n| Kirghiz                         | ky            | None                |\n| Galician                        | gl            | None                |\n| Kanuri                          | kr            | None                |\n| Cornish                         | kw            | None                |\n| Komi                            | kv            | None                |\n| Xhosa                           | xh            | None                |\n| Corsican                        | co            | None                |\n| Cree                            | cr            | None                |\n| Croatian                        | hr            | None                |\n| Quechua                         | qu            | None                |\n| Kurdish (Latin)                 | ku            | None                |\n| Kuanyama                        | kj            | None                |\n| Limburgan                       | li            | None                |\n| Lingala                         | ln            | None                |\n| Manx                            | gv            | None                |\n| Malagasy                        | mg            | None                |\n| Marshallese                     | mh            | None                |\n| Maori                           | mi            | None                |\n| Navajo                          | nv            | None                |\n| Nauru                           | na            | None                |\n| Nyanja                          | ny            | None                |\n| Norwegian Nynorsk               | nn            | None                |\n| Sardinian                       | sc            | None                |\n| Northern Sami                   | se            | None                |\n| Samoan                          | sm            | None                |\n| Sango                           | sg            | None                |\n| Shona                           | sn            | None                |\n| Esperanto                       | eo            | None                |\n| Scottish Gaelic                 | gd            | None                |\n| Somali                          | so            | None                |\n| Southern Sotho                  | st            | None                |\n| Tagalog                         | tl            | None                |\n| Tatar                           | tt            | None                |\n| Tahitian                        | ty            | None                |\n| Tongan                          | to            | None                |\n| Twi                             | tw            | None                |\n| Walloon                         | wa            | None                |\n| Welsh                           | cy            | None                |\n| Venda                           | ve            | None                |\n| Volapük                         | vo            | None                |\n| Interlingue                     | ie            | None                |\n| Hiri Motu                       | ho            | None                |\n| Igbo                            | ig            | None                |\n| Ido                             | io            | None                |\n| Inuktitut                       | iu            | None                |\n| Inupiaq                         | ik            | None                |\n| Sichuan Yi                      | ii            | None                |\n| Yoruba                          | yo            | None                |\n| Zhuang                          | za            | None                |\n| Tsonga                          | ts            | None                |\n| Zulu                            | zu            | None                |\n| Brazilian Portuguese            | pt-BR         | None                |\n"
  },
  {
    "path": "mkdocs.yml",
    "content": "# Copyright (c) 2016-2025 Martin Donath <martin.donath@squidfunk.com>\n\n# Permission is hereby granted, free of charge, to any person obtaining a copy\n# of this software and associated documentation files (the \"Software\"), to\n# deal in the Software without restriction, including without limitation the\n# rights to use, copy, modify, merge, publish, distribute, sublicense, and/or\n# sell copies of the Software, and to permit persons to whom the Software is\n# furnished to do so, subject to the following conditions:\n\n# The above copyright notice and this permission notice shall be included in\n# all copies or substantial portions of the Software.\n\n# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n# FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE\n# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS\n# IN THE SOFTWARE.\n\n# Project information\nsite_name: BabelDOC\nsite_url: https://squidfunk.github.io/mkdocs-material/\nsite_author: funstory.ai\nsite_description: >-\n  Write your documentation in Markdown and create a professional static site in\n  minutes – searchable, customizable, in 60+ languages, for all devices\n\n# Repository\nrepo_name: funstory-ai/BabelDOC\nrepo_url: https://github.com/funstory-ai/BabelDOC\nedit_uri: edit/main/docs/\n\n# Copyright\ncopyright: Copyright &copy; 2025 funstory.ai\n\n# Configuration\ntheme:\n  name: material\n  # custom_dir: material/overrides\n  features:\n    - announce.dismiss\n    - content.action.edit\n    - content.action.view\n    - content.code.annotate\n    - content.code.copy\n    - content.code.select\n    # - content.footnote.tooltips\n    # - content.tabs.link\n    - content.tooltips\n    # - header.autohide\n    # - navigation.expand\n    - navigation.footer\n    - navigation.indexes\n    # - navigation.instant\n    # - navigation.instant.prefetch\n    # - navigation.instant.progress\n    # - navigation.prune\n    - navigation.sections\n    - navigation.tabs\n    # - navigation.tabs.sticky\n    - navigation.top\n    - navigation.tracking\n    - search.highlight\n    - search.share\n    - search.suggest\n    - toc.follow\n    # - toc.integrate\n  palette:\n    - media: \"(prefers-color-scheme)\"\n      toggle:\n        icon: material/brightness-auto\n        name: Switch to light mode\n    - media: \"(prefers-color-scheme: light)\"\n      scheme: default\n      primary: white\n      accent: indigo\n      toggle:\n        icon: material/brightness-7\n        name: Switch to dark mode\n    - media: \"(prefers-color-scheme: dark)\"\n      scheme: slate\n      primary: black\n      accent: indigo\n      toggle:\n        icon: material/brightness-4\n        name: Switch to system preference\n  font:\n    text: Roboto\n    code: Roboto Mono\n  # favicon: assets/favicon.png\n  favicon: images/babeldoc-small-logo-with-transparent-background.svg\n  logo: images/babeldoc-small-logo-with-transparent-background.svg\n\n# Plugins\nplugins:\n  - search:\n      separator: '[\\s\\u200b\\-_,:!=\\[\\]()\"`/]+|\\.(?!\\d)|&[lg]t;|(?!\\b)(?=[A-Z][a-z])'\n  - minify:\n      minify_html: true\n  - git-authors\n  - git-revision-date-localized:\n      enable_creation_date: true\n# Additional configuration\nextra:\n  status:\n    new: Recently added\n    deprecated: Deprecated\n  social:\n    - icon: fontawesome/brands/github\n      link: https://github.com/funstory-ai/BabelDOC\n    - icon: fontawesome/brands/python\n      link: https://pypi.org/project/BabelDOC/\n\n# Extensions\nmarkdown_extensions:\n  - github-callouts\n  - markdown_include.include\n  - pymdownx.highlight:\n      anchor_linenums: true\n      line_spans: __span\n      pygments_lang_class: true\n  - pymdownx.inlinehilite\n  - pymdownx.snippets\n  - pymdownx.superfences\n  - def_list\n  - pymdownx.tasklist:\n      custom_checkbox: true\nnot_in_nav: |\n  /tutorials/**/*.md\n\n# Page tree\nnav:\n  - Home: index.md\n  - Supported Languages: supported_languages.md\n  - API:\n    - Async Translation API: ImplementationDetails/AsyncTranslate/AsyncTranslate.md\n  - Implementation Details:\n    - ImplementationDetails/README.md\n    - PDF Parsing: ImplementationDetails/PDFParsing/PDFParsing.md\n    - Layout Parser(.py): https://github.com/funstory-ai/BabelDOC/blob/main/babeldoc/document_il/midend/layout_parser.py\n    - Paragraph Finding: ImplementationDetails/ParagraphFinding/ParagraphFinding.md\n    - Styles and Formulas: ImplementationDetails/StylesAndFormulas/StylesAndFormulas.md\n    - IL Translator: ImplementationDetails/ILTranslator/ILTranslator.md\n    - Typesetting: ImplementationDetails/Typesetting/Typesetting.md\n    - Font Mapper(.py): https://github.com/funstory-ai/BabelDOC/blob/main/babeldoc/document_il/utils/fontmap.py\n    - PDF Creation: ImplementationDetails/PDFCreation/PDFCreation.md\n    - Intro To PDF Object: intro-to-pdf-object.md\n  - Community:\n    - Code of Conduct: CODE_OF_CONDUCT.md\n    - Contributing:\n      - Contributing: CONTRIBUTING.md\n      - Contributor Reward: CONTRIBUTOR_REWARD.md"
  },
  {
    "path": "pyproject.toml",
    "content": "[project]\nname = \"BabelDOC\"\nversion = \"0.5.23\"\ndescription = \"Yet Another Document Translator\"\nlicense = \"AGPL-3.0\"\nreadme = \"README.md\"\nrequires-python = \">=3.10,<3.14\"\nauthors = [\n    { name = \"awwaawwa\", email = \"aw@funstory.ai\" }\n]\nmaintainers = [\n    { name = \"awwaawwa\", email = \"aw@funstory.ai\" }\n]\nclassifiers = [\n    \"Programming Language :: Python :: 3\",\n    \"Operating System :: OS Independent\",\n]\nkeywords = [\"PDF\"]\ndependencies = [\n    \"bitstring>=4.3.0\",\n    \"configargparse>=1.7\",\n    \"httpx[socks]>=0.27.0\",\n    \"huggingface-hub>=0.27.0\",\n    \"numpy>=2.0.2\",\n    \"onnx>=1.18.0\",\n    \"onnxruntime>=1.16.1\",\n    \"openai>=1.59.3\",\n    \"orjson>=3.10.14\",\n    \"charset-normalizer >= 2.0.0\",\n    \"cryptography >= 36.0.0\",\n    #    \"pdfminer-six==20250416\",\n    \"peewee>=3.17.8\",\n    \"psutil>=7.0.0\",\n    \"pymupdf>=1.25.1\",\n    \"rich>=13.9.4\",\n    \"toml>=0.10.2\",\n    \"tqdm>=4.67.1\",\n    \"xsdata[cli,lxml,soap]>=24.12\",\n    \"msgpack>=1.1.0\",\n    \"pydantic>=2.10.6\",\n    \"tenacity>=9.0.0\",\n    \"scikit-image>=0.25.2\",\n    \"freetype-py>=2.5.1\",\n    \"tiktoken>=0.9.0\",\n    \"Levenshtein>=0.27.1\",\n    \"opencv-python-headless>=4.10.0.84\",\n    \"rapidocr-onnxruntime>=1.4.4\",\n    \"pyzstd>=0.17.0\",\n    \"hyperscan>=0.7.13\",\n    \"rtree>=1.4.0\",\n    \"chardet>=5.2.0\",\n    \"scipy>=1.15.3\",\n    \"uharfbuzz>=0.50.2\",\n    \"scikit-learn>=1.7.1\",\n]\n\n[project.optional-dependencies]\ndirectml = [\"onnxruntime-directml>=1.16.1\"]\ncuda = [\"onnxruntime-gpu>=1.16.1\"]\nmemray = [\"memray>=1.17.1\"]\n\n[project.urls]\nHomepage = \"https://github.com/funstory-ai/BabelDOC\"\nIssues = \"https://github.com/funstory-ai/BabelDOC/issues\"\n\n[project.scripts]\nbabeldoc = \"babeldoc.main:cli\"\n\n[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[tool.flake8]\nignore = [\"E203\", \"E261\", \"E501\", \"W503\", \"E741\", \"E501\"]\nmax-line-length = 88\n\n[tool.ruff]\nsrc = [\"babeldoc\"]\ntarget-version = \"py310\"\nshow-fixes = true\n\n[tool.ruff.format]\n# Enable reformatting of code snippets in docstrings.\ndocstring-code-format = true\n\n[tool.ruff.lint]\nignore = [\n    \"E203\",   # 冒号前的空格\n    \"E261\",   # 注释前至少两个空格\n    \"E501\",   # 行太长\n    \"E741\",   # 变量名歧义\n    \"F841\",   # 未使用的变量\n    \"C901\",   # 太复杂的函数\n    \"S101\",   # use assert\n    \"SIM\",    # flake8-simplify\n    \"ARG002\", # unused argument\n    \"S110\",   # `try`-`except`-`pass` detected, consider logging the exception\n    \"B024\",   # abstract class without abstract methods\n    \"S112\",   # `try`-`except`-`continue` detected, consider logging the exception\n    \"COM812\", # missing-trailing-comma\n\n]\nselect = [\n    \"E\",   # pycodestyle 错误\n    \"F\",   # Pyflakes\n    \"N\",   # PEP8 命名\n    \"B\",   # flake8-bugbear\n    \"I\",   # isort\n    \"C\",   # mccabe\n    \"UP\",  # pyupgrade\n    \"S\",   # flake8-bandit\n    \"A\",   # flake8-builtins\n    \"COM\", # flake8-commas\n    \"ARG\", # flake8-unused-arguments\n    \"PTH\", # 使用 pathlib\n]\n\n[tool.ruff.lint.flake8-quotes]\ndocstring-quotes = \"double\"\n\n[tool.ruff.lint.flake8-annotations]\nsuppress-none-returning = true\n\n[tool.ruff.lint.isort]\nforce-single-line = true\n\n[tool.ruff.lint.pydocstyle]\nconvention = \"google\"\n\n# 设置一些规则的特定配置\n[tool.ruff.lint.mccabe]\nmax-complexity = 10 # 函数圈复杂度阈值\n\n[tool.ruff.lint.per-file-ignores]\n\"babeldoc/babeldoc_exception/BabelDOCException.py\" = [\"N999\"]\n\"babeldoc/format/pdf/pdfinterp.py\" = [\"N\"] # 忽略命名规范\n\"tests/*\" = [\"S101\"]            # 在测试文件中允许 assert\n\"**/__init__.py\" = [\"F401\"]     # 允许未使用的导入\n# 忽略 S311 警告，因为这是有意的\n\"babeldoc/format/pdf/document_il/midend/paragraph_finder.py\" = [\"S311\"]\n\"docs/*\" = [\"A001\"]\n\"babeldoc/pdfminer/*\" =[\"A\",\"F\", \"I\", \"N\", \"S\", \"B\", \"C\", \"COM\", \"ARG\", \"PTH\", \"UP\"]\n[dependency-groups]\ndev = [\n    \"bumpver>=2024.1130\",\n    \"markdown-callouts>=0.4.0\",\n    \"markdown-include>=0.8.1\",\n    \"mkdocs-git-authors-plugin>=0.9.2\",\n    \"mkdocs-git-committers-plugin-2>=2.5.0\",\n    \"mkdocs-git-revision-date-localized-plugin>=1.3.0\",\n    \"mkdocs-material[recommended]>=9.6.4\",\n    \"pre-commit>=4.1.0\",\n    \"pygments>=2.19.1\",\n    \"ruff>=0.9.2\",\n    \"pytest>=8.3.4\",\n    \"pylance>=0.29.0\",\n    \"py-spy>=0.4.0\",\n]\n\n[tool.pytest.ini_options]\npythonpath = [\".\", \"src\"]\ntestpaths = [\"tests\"]\n\n[bumpver]\ncurrent_version = \"0.5.23\"\nversion_pattern = \"MAJOR.MINOR.PATCH[.PYTAGNUM]\"\n\n[bumpver.file_patterns]\n\"pyproject.toml\" = [\n    'current_version = \"{version}\"',\n    'version = \"{version}\"'\n]\n\"babeldoc/__init__.py\" = [\n    '__version__ = \"{version}\"'\n]\n\"babeldoc/main.py\" = [\n    '__version__ = \"{version}\"'\n]\n\"babeldoc/const.py\" = [\n    '__version__ = \"{version}\"'\n]\n\n[tool.uv.sources]\nyadt = { path = \".\", editable = true }\n\n[tool.pyright]\npythonVersion = \"3.10\"\n# typeCheckingMode = \"off\"\nreportGeneralTypeIssues = false\nreportUnknownVariableType = false\nreportMissingParameterType = false\nreportUnknownParameterType = false\n"
  },
  {
    "path": "tests/test_translation_cache_cleanup.py",
    "content": "from concurrent.futures import ThreadPoolExecutor\n\nfrom babeldoc.translator.cache import TranslationCache\nfrom babeldoc.translator.cache import _TranslationCache\nfrom babeldoc.translator.cache import clean_test_db\nfrom babeldoc.translator.cache import init_test_db\n\n\ndef _prepare_records(cache: TranslationCache, num_records: int) -> None:\n    \"\"\"Insert *num_records* unique records into the cache.\"\"\"\n    for i in range(num_records):\n        cache.set(f\"text_{i}\", f\"translation_{i}\")\n\n\ndef test_cleanup_under_limit(monkeypatch):\n    \"\"\"When total rows < MAX_CACHE_ROWS, cleanup should do nothing.\"\"\"\n    # Create an isolated test database\n    test_db = init_test_db()\n    try:\n        cache = TranslationCache(\"dummy\")\n        # Make cleanup run every time for deterministic behaviour\n        monkeypatch.setattr(\"babeldoc.translator.cache.CLEAN_PROBABILITY\", 1.0)\n        # Lower the MAX_CACHE_ROWS threshold for quick test execution\n        monkeypatch.setattr(\"babeldoc.translator.cache.MAX_CACHE_ROWS\", 1000)\n\n        _prepare_records(cache, 900)\n        cache.set(\"extra\", \"extra\")  # This triggers cleanup\n        assert _TranslationCache.select().count() == 901\n    finally:\n        clean_test_db(test_db)\n\n\ndef test_cleanup_over_limit(monkeypatch):\n    \"\"\"When rows > MAX_CACHE_ROWS, cleanup should trim to the limit.\"\"\"\n    test_db = init_test_db()\n    try:\n        cache = TranslationCache(\"dummy\")\n        monkeypatch.setattr(\"babeldoc.translator.cache.CLEAN_PROBABILITY\", 1.0)\n        monkeypatch.setattr(\"babeldoc.translator.cache.MAX_CACHE_ROWS\", 500)\n\n        total_records = 750\n        _prepare_records(cache, total_records)\n        cache.set(\"extra\", \"extra\")\n\n        assert _TranslationCache.select().count() <= 500  # capped at limit\n    finally:\n        clean_test_db(test_db)\n\n\ndef test_cleanup_thread_safety(monkeypatch):\n    \"\"\"Multiple threads attempting cleanup concurrently should not raise errors.\"\"\"\n    test_db = init_test_db()\n    try:\n        cache = TranslationCache(\"dummy\")\n        monkeypatch.setattr(\"babeldoc.translator.cache.CLEAN_PROBABILITY\", 1.0)\n        monkeypatch.setattr(\"babeldoc.translator.cache.MAX_CACHE_ROWS\", 500)\n\n        def task(n):\n            cache.set(f\"text_{n}\", f\"translation_{n}\")\n\n        # Use a pool of threads to stress cleanup\n        with ThreadPoolExecutor(max_workers=10) as executor:\n            executor.map(task, range(600))\n\n        # After all threads complete, ensure table size is capped\n        assert _TranslationCache.select().count() <= 500\n    finally:\n        clean_test_db(test_db)\n"
  }
]