main 56fadb3e9593 cached
62 files
14.6 MB
97.7k tokens
433 symbols
1 requests
Download .txt
Showing preview only (406K chars total). Download the full file or copy to clipboard to get everything.
Repository: Unstructured-IO/unstructured-inference
Branch: main
Commit: 56fadb3e9593
Files: 62
Total size: 14.6 MB

Directory structure:
gitextract_bzec1cqm/

├── .github/
│   ├── dependabot.yml
│   └── workflows/
│       ├── ci.yml
│       ├── claude.yml
│       ├── create_issue.yml
│       ├── release.yml
│       └── version-bump.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── benchmarks/
│   ├── __init__.py
│   └── test_benchmark_yolox.py
├── examples/
│   └── ocr/
│       ├── engine.py
│       ├── requirements.txt
│       └── validate_ocr_performance.py
├── logger_config.yaml
├── pyproject.toml
├── renovate.json
├── sample-docs/
│   └── loremipsum.tiff
├── scripts/
│   ├── docker-build.sh
│   ├── shellcheck.sh
│   ├── test-unstructured-ingest-helper.sh
│   └── version-sync.sh
├── test_unstructured_inference/
│   ├── conftest.py
│   ├── inference/
│   │   ├── test_layout.py
│   │   ├── test_layout_element.py
│   │   └── test_layout_rotation.py
│   ├── models/
│   │   ├── test_detectron2onnx.py
│   │   ├── test_eval.py
│   │   ├── test_model.py
│   │   ├── test_tables.py
│   │   └── test_yolox.py
│   ├── test_config.py
│   ├── test_elements.py
│   ├── test_logger.py
│   ├── test_math.py
│   ├── test_utils.py
│   └── test_visualization.py
└── unstructured_inference/
    ├── __init__.py
    ├── __version__.py
    ├── config.py
    ├── constants.py
    ├── inference/
    │   ├── __init__.py
    │   ├── elements.py
    │   ├── layout.py
    │   ├── layoutelement.py
    │   └── pdf_image.py
    ├── logger.py
    ├── math.py
    ├── models/
    │   ├── __init__.py
    │   ├── base.py
    │   ├── detectron2onnx.py
    │   ├── eval.py
    │   ├── table_postprocess.py
    │   ├── tables.py
    │   ├── unstructuredmodel.py
    │   └── yolox.py
    ├── utils.py
    └── visualize.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
  - package-ecosystem: "uv"
    directory: "/"
    schedule:
      interval: "monthly"

  - package-ecosystem: "github-actions"
    # NOTE(robinson) - Workflow files stored in the
    # default location of `.github/workflows`
    directory: "/"
    schedule:
      interval: "monthly"


================================================
FILE: .github/workflows/ci.yml
================================================
name: CI

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

permissions:
  contents: read

jobs:
  lint:
    runs-on: opensource-linux-8core
    strategy:
      fail-fast: false
      matrix:
        python-version: ["3.11", "3.12", "3.13"]
    steps:
    - uses: actions/checkout@v4
    - name: Install uv
      uses: astral-sh/setup-uv@v5
      with:
        enable-cache: true
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install lint dependencies
      run: make install-lint
    - name: Lint
      run: make check

  shellcheck:
    runs-on: opensource-linux-8core
    steps:
      - uses: actions/checkout@v4
      - name: ShellCheck
        uses: ludeeus/action-shellcheck@master

  test:
    runs-on: opensource-linux-8core
    needs: lint
    strategy:
      fail-fast: false
      matrix:
        python-version: ["3.11", "3.12", "3.13"]
    steps:
    - uses: actions/checkout@v4
    - name: Install uv
      uses: astral-sh/setup-uv@v5
      with:
        enable-cache: true
    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install system dependencies
      run: |
        sudo apt-get update
        sudo apt-get -y install poppler-utils tesseract-ocr
    - name: Install dependencies
      run: make install
    - name: Configure AWS credentials
      uses: aws-actions/configure-aws-credentials@v4
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region: us-east-2
    - name: Test
      env:
        UNSTRUCTURED_HF_TOKEN: ${{ secrets.HF_TOKEN }}
      run: |
        aws s3 cp s3://utic-dev-models/ci_test_model/test_ci_model.onnx test_unstructured_inference/models/
        CI=true make test
        make check-coverage

  changelog:
    runs-on: opensource-linux-8core
    steps:
    - uses: actions/checkout@v4
    - if: github.ref != 'refs/heads/main'
      uses: dorny/paths-filter@v2
      id: changes
      with:
        filters: |
          src:
            - 'unstructured_inference/**'

    - if: steps.changes.outputs.src == 'true' && github.ref != 'refs/heads/main'
      uses: dangoslen/changelog-enforcer@v3


================================================
FILE: .github/workflows/claude.yml
================================================
name: Claude Code

on:
  issue_comment:
    types: [created]
  pull_request_review_comment:
    types: [created]
  issues:
    types: [opened, assigned]
  pull_request_review:
    types: [submitted]

jobs:
  claude:
    if: |
      (github.event_name == 'issue_comment' && contains(github.event.comment.body, '@claude')) ||
      (github.event_name == 'pull_request_review_comment' && contains(github.event.comment.body, '@claude')) ||
      (github.event_name == 'pull_request_review' && contains(github.event.review.body, '@claude')) ||
      (github.event_name == 'issues' && (contains(github.event.issue.body, '@claude') || contains(github.event.issue.title, '@claude')))
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: read
      issues: read
      id-token: write
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Run Claude Code
        id: claude
        uses: anthropics/claude-code-action@beta
        with:
          anthropic_api_key: ${{ secrets.GH_ANTHROPIC_API_KEY }}
          allowed_tools: "Bash(git:*),View,GlobTool,GrepTool,BatchTool"


================================================
FILE: .github/workflows/create_issue.yml
================================================
name: create_jira_issue

on:
  issues:
    types:
      - opened

jobs:
  create:
    runs-on: ubuntu-latest
    name: Create JIRA Issue
    steps:

    - name: Login to Jira
      uses: atlassian/gajira-login@v3
      env:
        JIRA_BASE_URL: ${{ secrets.JIRA_BASE_URL }}
        JIRA_USER_EMAIL: ${{ secrets.JIRA_USER_EMAIL }}
        JIRA_API_TOKEN: ${{ secrets.JIRA_API_TOKEN }}

    - name: Create Jira issue
      uses: atlassian/gajira-create@v3
      with:
        project: CORE
        issuetype: Task
        summary: ${{ github.event.issue.title }}
        description: |
          Created from github issue: ${{ github.event.issue.html_url }}
          ----
          ${{ github.event.issue.body }}
        fields: '{ "labels": ["github-issue"] }'

    - name: Log created issue
      run: echo "Issue ${{ steps.create.outputs.issue }} was created"


================================================
FILE: .github/workflows/release.yml
================================================
name: Release

on:
  release:
    types: [published]

permissions:
  contents: read
  id-token: write       # Required for PyPI trusted publishing / attestations

concurrency:
  group: release
  cancel-in-progress: false

jobs:
  release:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@34e114876b0b11c390a56381ad16ebd13914f8d5 # v4

    - name: Install uv
      uses: astral-sh/setup-uv@d4b2f3b6ecc6e67c4457f6d3e41ec42d3d0fcb86 # v5
      with:
        enable-cache: true

    - name: Set up Python
      uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5
      with:
        python-version: "3.12"

    - name: Verify tag matches package version
      run: |
        PKG_VERSION=$(python -c "exec(open('unstructured_inference/__version__.py').read()); print(__version__)")
        TAG_VERSION="${GITHUB_REF_NAME#v}"
        if [ "$PKG_VERSION" != "$TAG_VERSION" ]; then
          echo "::error::Tag ($TAG_VERSION) does not match package version ($PKG_VERSION)"
          exit 1
        fi

    - name: Install release dependencies
      run: uv sync --locked --only-group release --no-install-project

    - name: Build package
      id: build
      run: uv build

    - name: Publish to PyPI
      uses: pypa/gh-action-pypi-publish@ed0c53931b1dc9bd32cbe73a98c7f6766f8a527e # release/v1

    # Best-effort: attempt Azure upload even if PyPI fails, but only if build succeeded.
    # continue-on-error allows the workflow to pass when Azure secrets are not configured.
    - name: Publish to Azure Artifacts
      if: always() && steps.build.outcome == 'success'
      continue-on-error: true
      run: |
        uv run --no-sync twine upload \
          --repository-url "${{ secrets.AZURE_ARTIFACTS_FEED }}" \
          --username "${{ secrets.AZURE_ARTIFACTS_USERNAME }}" \
          --password "${{ secrets.AZURE_ARTIFACTS_PAT }}" \
          dist/*


================================================
FILE: .github/workflows/version-bump.yml
================================================
name: Version Bump

on:
  pull_request:
    branches: [main]
    types: [opened, synchronize, reopened]

permissions:
  contents: write
  pull-requests: read

jobs:
  version-bump:
    if: github.event.pull_request.user.login == 'utic-renovate[bot]'
    uses: Unstructured-IO/infra/.github/workflows/version-bump.yml@main
    with:
      component-paths: '["."]'
      default-bump: patch
      update-changelog: true
      update-lockfile: true
      renovate-app-id: ${{ vars.RENOVATE_APP_ID }}
    secrets:
      token: ${{ secrets.GITHUB_TOKEN }}
      private-pypi-url: ${{ secrets.PRIVATE_PYPI_INDEX_URL }}
      renovate-app-private-key: ${{ secrets.RENOVATE_APP_PRIVATE_KEY }}


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints
nbs/

# IPython
profile_default/
ipython_config.py

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# Pycharm
.idea/

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# Model artifacts
.models/*
!.models/.gitkeep

# Mac stuff
.DS_Store

# VSCode
.vscode/

sample-docs/*_images
examples/**/output
figures


================================================
FILE: .pre-commit-config.yaml
================================================
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: "v5.0.0"
    hooks:
      - id: check-added-large-files
      - id: check-toml
      - id: check-yaml
      - id: check-json
      - id: check-xml
      - id: end-of-file-fixer
        exclude: \.json$
        include: \.py$
      - id: trailing-whitespace
      - id: mixed-line-ending

  - repo: https://github.com/astral-sh/ruff-pre-commit
    rev: "v0.15.0"
    hooks:
      - id: ruff
        args: ["--fix"]
      - id: ruff-format


================================================
FILE: CHANGELOG.md
================================================
## 1.6.11

### Enhancement
- Add `table_extraction_method` field to `LayoutElements` and `LayoutElement` to track which algorithm produced a table (grid, tatr, vlm).

## 1.6.10

### Enhancement
- Add Python 3.13 support.

## 1.6.9

### Enhancement
- Restore support for Python 3.11 alongside Python 3.12.

## 1.6.8

### Fix
- Reject PDF pages that would render beyond the configured pixel limit before
  allocating the page bitmap.

## 1.6.7

### Fix
- `get_model` now materializes `LazyDict` model configs into a plain dict before
  unpacking into `initialize(**...)`. Uses `__iter__` + `__getitem__` to avoid
  depending on `Mapping.keys()`, which has been observed to fail at `**`
  unpacking with "argument after ** must be a mapping, not LazyDict" in some
  deployment environments.

## 1.6.6

### Enhancement
- Relax the lower bound of the pandas and numpy dependency

## 1.6.5

### Enhancement
- Store `pdf_rotation` in `page.image_metadata` so downstream consumers can check page rotation after the page image is freed
- Add targeted unittest coverage for PDF page rotation handling in `convert_pdf_to_image`
- Speed up the targeted rotation unittest by isolating the PDF image conversion surface into a lightweight module and mocking the PDFium rendering path for the timing-critical test

## 1.6.4

### Fix
- Apply PDF `/Rotate` metadata during page rendering - pypdfium2's `page.render()` ignores the flag, producing sideways images for rotated pages

## 1.6.3

### Security

- **security:** fix(deps): upgrade vulnerable transitive dependencies [security]

## 1.6.2

### Enhancement
- Make `dpi` an explicit parameter on `convert_pdf_to_image` (default 200) instead of reading from config internally, enabling unstructured to use this as the single source of truth for PDF rendering

## 1.6.1

### Enhancement
- Free intermediate arrays (`origin_img`, `img`, `ort_inputs`, `output`) and PIL pixel buffer at dead points during YoloX `image_processing()` to reduce peak memory during inference

## 1.6.0

### Fix
- Relax `huggingface-hub` lower bound from `>=1.4.1` to `>=0.22.0` (the `>=1.4.1` was an artifact of the uv migration and broke compatibility with `transformers<5.0`)

## 1.5.5

### Enhancement
- Lazy page rendering in `convert_pdf_to_image` to reduce peak memory from O(N pages) to O(1 page)

## 1.5.4

### Enhancement
- Use `np.full()` instead of `np.ones() * scalar` in YoloX preprocessing to avoid a redundant temporary array

## 1.5.3

- Store routing in LayoutElement

## 1.5.2

### Fix
- Switch to PyPI trusted publishing (OIDC) and remove API token auth

## 1.5.1

### Fix
- Add `id-token: write` permission to release workflow for PyPI attestations

## 1.5.0

### Enhancement
- Automate PyPI and Azure Artifacts publishing via GitHub release workflow
- Replace `--frozen` with `--locked` across Makefile and Dockerfile for stricter lockfile validation
- Add `release` dependency group with `twine` for Azure Artifacts upload
- Constrain pillow to >=12.1.1 to address CVE for out-of-bounds write when loading PSD images

## 1.4.0

### Enhancement
- Switch CI runners to `opensource-linux-8core` for faster builds
- Add pytest-xdist parallelization (`-n auto`) to `docker-test` target
- Remove mypy from lint pipeline; ruff covers linting needs sufficiently
- Add `install-lint` target; CI lint job no longer downloads full project dependencies

## 1.3.0

### Enhancement
- Migrate project to native uv with hatchling build backend
- Consolidate all configuration into pyproject.toml
- Replace pip/requirements workflow with uv sync/lock
- Parallelize test runs with pytest-xdist (`-n auto`)

### Breaking
- Drop support for Python 3.10 and 3.11; require Python >=3.12, <3.13

## 1.2.0

### Enhancement
- **Per-model locks for parallel model loading**: Replace single global lock with per-model locks
  - Allows concurrent loading of different models (detectron2, yolox, etc.)
  - 10x+ concurrency improvement in multi-model environments
  - Maintains thread-safe initialization with double-check pattern
  - Backward compatible - no API changes

## 1.1.9

### Fix
- **TableTransformer device_map fix**: Remove device_map parameter to prevent meta tensor errors
  - Device normalization (cuda -> cuda:0) for consistent caching
  - Load models without device_map, use explicit .to(device, dtype=torch.float32)
  - Fixes concurrent PDF processing AssertionError
  - Prevents "Trying to set a tensor of type Float but got Meta" errors
- Use context manager for `pdfium.PdfDocument`

## 1.1.8

- put `pdfium` call behind a thread lock

## 1.1.7

- Update OpenCV-Python to 4.13.0.90 to squash ffmpeg vulnerability CVE-2023-6605

## 1.1.6

- Use inference_config to set default rendering DPI

## 1.1.5

- Render PDF to image using PyPDFium instead of pdf2image, due to much improved performance for certain docs

## 1.1.4

- Constrain urllib3 to urllib3>=2.6.0 to address CVE-2025-66471 and CVE-2025-66418

## 1.1.3

- Constrain fonttools to >=4.60.2 to address CVE-2025-66034

## 1.1.2

* chore(deps): Bump several depedencies to resolve open high CVEs
* fix: Exclude pip and setuptools pinning based on cursor comment
* fix: With the newer version of transformers 4.57.1, the type checking became stricter, and mypy correctly flagged that DetrImageProcessor.from_pretrained() expects str | PathLike[Any], not a model object.
* fix: Update test to explicitly cast numpy array to uint8 for Pillow 12.0.0 compatibility

## 1.1.1

* Add NotImplementedError when trying to single index a TextRegions, reflecting the fact that it won't behave correctly at the moment.

## 1.1.0

* Enhancement: Add `TextSource` to track where the text of an element came from
* Enhancement: Refactor `__post_init__` of `TextRegions` and `LayoutElement` slightly to automate initialization

## 1.0.10

* Remove merging logic that's no longer used

## 1.0.9

* Make OD model loading thread safe

## 1.0.8

* Enhancement: Optimized `zoom_image` (codeflash)
* Enhancement: Optimized `cells_to_html` for an 8% speedup in some cases (codeflash)
* Enhancement: Optimized `outputs_to_objects` for an 88% speedup in some cases (codeflash)

## 1.0.7

* Fix a hardcoded file extension causing confusion in the logs

## 1.0.6

* Add slicing through indexing for vectorized elements

## 1.0.5

* feat: add thread lock to prevent racing condition when instantiating singletons
* feat: parametrize edge config for `DetrImageProcessor` with env variables

## 1.0.4

* feat: use singleton instead of `global` to store shared variables

## 1.0.3

* setting longest_edge=1333 to the table image processor

## 1.0.2

* adding parameter to table image preprocessor related to the image size

## 1.0.1

* fix: moving the table transformer model to device when loading the model instead of once the model is loaded.

## 1.0.0

* feat: support for Python 3.10+; drop support for Python 3.9

## 0.8.11

* feat: remove `donut` model

## 0.8.10

* feat: unpin `numpy` and bump minimum for `onnxruntime` to be compatible with `numpy>=2`

## 0.8.9

* chore: unpin `pdfminer-six` version

## 0.8.8
* fix: pdfminer-six dependencies
* feat: `PageLayout.elements` is now a `cached_property` to reduce unecessary memory and cpu costs

## 0.8.7

* fix: add `password` for PDF

## 0.8.6

* feat: add back `source` to `TextRegions` and `LayoutElements` for backward compatibility

## 0.8.5

* fix: remove `pdfplumber` but include `pdfminer-six==20240706` to update `pdfminer`

## 0.8.4

* feat: add `text_as_html` and `table_as_cells` to `LayoutElements` class as new attributes
* feat: replace the single valueed `source` attribute from `TextRegions` and `LayoutElements` with an array attribute `sources`

## 0.8.3

* fix: removed `layoutelement.from_lp_textblock()` and related tests as it's not used
* fix: update requirements to drop `layoutparser` lib
* fix: update `README.md` to remove layoutparser model zoo support note

## 0.8.2

* fix: fix bug when an empty list is passed into `TextRegions.from_list` triggers `IndexError`
* fix: fix bug when concatenate a list of `LayoutElements` the class id mapping is no properly
  updated

## 0.8.1

* fix: fix list index out of range error caused by calling LayoutElements.from_list() with empty list

## 0.8.0

* fix: fix missing source after cleaning layout elements
* **BREAKING** Remove chipper model

## 0.7.41

* fix: fix incorrect type casting with higher versions of `numpy` when substracting a `float` from an `int` array
* fix: fix a bug where class id 0 becomes class type `None` when calling `LayoutElements.as_list()`

## 0.7.40

* fix: store probabilities with `float` data type instead of `int`

## 0.7.39

* fix: Correctly assign mutable default value to variable in `LayoutElements` class

## 0.7.38

* fix: Correctly assign mutable default value to variable in `TextRegions` class

## 0.7.37

* refactor: remove layout analysis related code
* enhancement: Hide warning about table transformer weights not being loaded
* fix(layout): Use TemporaryDirectory instead of NamedTemporaryFile for Windows support
* refactor: use `numpy` array to store layout elements' information in one single `LayoutElements`
  object instead of using a list of `LayoutElement`

## 0.7.36

fix: add input parameter validation to `fill_cells()` when converting cells to html

## 0.7.35

Fix syntax for generated HTML tables

## 0.7.34

* Reduce excessive logging

## 0.7.33

* BREAKING CHANGE: removes legacy detectron2 model
* deps: remove layoutparser optional dependencies

## 0.7.32

* refactor: remove all code related to filling inferred elements text from embedded text (pdfminer).
* bug: set the Chipper max_length variable

## 0.7.31

* refactor: remove all `cid` related code that was originally added to filter out invalid `pdfminer` text
* enhancement: Wrapped hf_hub_download with a function that checks for local file before checking HF

## 0.7.30

* fix: table transformer doesn't return multiple cells with same coordinates
*
## 0.7.29

* fix: table transformer predictions are now removed if confidence is below threshold


## 0.7.28

* feat: allow table transformer agent to return table prediction in not parsed format

## 0.7.27

* fix: remove pin from `onnxruntime` dependency.

## 0.7.26

* feat: add a set of new `ElementType`s to extend future element types recognition
* feat: allow registering of new models for inference using `unstructured_inference.models.base.register_new_model` function

## 0.7.25

* fix: replace `Rectangle.is_in()` with `Rectangle.is_almost_subregion_of()` when filling in an inferred element with embedded text
* bug: check for None in Chipper bounding box reduction
* chore: removes `install-detectron2` from the `Makefile`
* fix: convert label_map keys read from os.environment `UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH` to int type
* feat: removes supergradients references

## 0.7.24

* fix: assign value to `text_as_html` element attribute only if `text` attribute contains HTML tags.

## 0.7.23

* fix: added handling in `UnstructuredTableTransformerModel` for if `recognize` returns an empty
  list in `run_prediction`.

## 0.7.22

* fix: add logic to handle computation of intersections betwen 2 `Rectangle`s when a `Rectangle` has `None` value in its coordinates

## 0.7.21

* fix: fix a bug where chipper, or any element extraction model based `PageLayout` object, lack `image_metadata` and other attributes that are required for downstream processing; this fix also reduces the memory overhead of using chipper model

## 0.7.20

* chipper-v3: improved table prediction

## 0.7.19

* refactor: remove all OCR related code

## 0.7.18

* refactor: remove all image extraction related code

## 0.7.17

* refactor: remove all `pdfminer` related code
* enhancement: improved Chipper bounding boxes

## 0.7.16

* bug: Allow supplied ONNX models to use label_map dictionary from json file

## 0.7.15

* enhancement: Enable env variables for model definition

## 0.7.14

* enhancement: Remove Super-Gradients Dependency and Allow General Onnx Models Instead

## 0.7.13

* refactor: add a class `ElementType` for the element type constants and use the constants to replace element type strings
* enhancement: support extracting elements with types `Picture` and `Figure`
* fix: update logger in table initalization where the logger info was not showing
* chore: supress UserWarning about specified model providers

## 0.7.12

* change the default model to yolox, as table output appears to be better and speed is similar to `yolox_quantized`

## 0.7.11

* chore: remove logger info for chipper since its private
* fix: update broken slack invite link in chipper logger info
* enhancement: Improve error message when # images extracted doesn't match # page layouts.
* fix: use automatic mixed precision on GPU for Chipper
* fix: chipper Table elements now match other layout models' Table element format: html representation is stored in `text_as_html` attribute and `text` attribute stores text without html tags

## 0.7.10

* Handle kwargs explicitly when needed, suppress otherwise
* fix: Reduce Chipper memory consumption on x86_64 cpus
* fix: Skips ordering elements coming from Chipper
* fix: After refactoring to introduce Chipper, annotate() wasn't able to show text with extra info from elements, this is fixed now.
* feat: add table cell and dataframe output formats to table transformer's `run_prediction` call
* breaking change: function `unstructured_inference.models.tables.recognize` no longer takes `out_html` parameter and it now only returns table cell data format (lists of dictionaries)

## 0.7.9

* Allow table model to accept optional OCR tokens

## 0.7.8

* Fix: include onnx as base dependency.

## 0.7.7

• Fix a memory leak in DonutProcessor when using large images in numpy format
• Set the right settings for beam search size > 1
• Fix a bug that in very rare cases made the last element predicted by Chipper to have a bbox = None

## 0.7.6

* fix a bug where invalid zoom factor lead to exceptions; now invalid zoom factors results in no scaling of the image

## 0.7.5

* Improved packaging

## 0.7.4

* Dynamic beam search size has been implemented for Chipper, the decoding process starts with a size = 1 and changes to size = 3 if repetitions appear.
* Fixed bug when PDFMiner predicts that an image text occupies the full page and removes annotations by Chipper.
* Added random seed to Chipper text generation to avoid differences between calls to Chipper.
* Allows user to use super-gradients model if they have a callback predict function, a yaml file with names field corresponding to classes and a path to the model weights

## 0.7.3

* Integration of Chipperv2 and additional Chipper functionality, which includes automatic detection of GPU,
bounding box prediction and hierarchical representation.
* Remove control characters from the text of all layout elements

## 0.7.2

* Sort elements extracted by `pdfminer` to get consistent result from `aggregate_by_block()`

## 0.7.1

* Download yolox_quantized from HF

## 0.7.0

* Remove all OCR related code expect the table OCR code

## 0.6.6

* Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until
we have the mapping from standard language code to paddle language code.
## 0.6.5

* Add functionality to keep extracted image elements while merging inferred layout with extracted layout
* Fix `source` property for elements generated by pdfminer.
* Add 'OCR-tesseract' and 'OCR-paddle' as sources for elements generated by OCR.

## 0.6.4

* add a function to automatically scale table crop images based on text height so the text height is optimum for `tesseract` OCR task
* add the new image auto scaling parameters to `config.py`

## 0.6.3

* fix a bug where padded table structure bounding boxes are not shifted back into the original image coordinates correctly

## 0.6.2

* move the confidence threshold for table transformer to config

## 0.6.1

* YoloX_quantized is now the default model. This models detects most diverse types and detect tables better than previous model.
* Since detection models tend to nest elements inside others(specifically in Tables), an algorithm has been added for reducing this
  behavior. Now all the elements produced by detection models are disjoint and they don't produce overlapping regions, which helps
  reduce duplicated content.
* Add `source` property to our elements, so you can know where the information was generated (OCR or detection model)

## 0.6.0

* add a config class to handle parameter configurations for inference tasks; parameters in the config class can be set via environement variables
* update behavior of `pad_image_with_background_color` so that input `pad` is applied to all sides

## 0.5.31

* Add functionality to extract and save images from the page
* Add functionality to get only "true" embedded images when extracting elements from PDF pages
* Update the layout visualization script to be able to show only image elements if need
* add an evaluation metric for table comparison based on token similarity
* fix paddle unit tests where `make test` fails since paddle doesn't work on M1/M2 chip locally

## 0.5.28

* add env variable `ENTIRE_PAGE_OCR` to specify using paddle or tesseract on entire page OCR

## 0.5.27

* table structure detection now pads the input image by 25 pixels in all 4 directions to improve its recall

## 0.5.26

* support paddle with both cpu and gpu and assumed it is pre-installed

## 0.5.25

* fix a bug where `cells_to_html` doesn't handle cells spanning multiple rows properly

## 0.5.24

* remove `cv2` preprocessing step before OCR step in table transformer

## 0.5.23

* Add functionality to bring back embedded images in PDF

## 0.5.22

* Add object-detection classification probabilities to LayoutElement for all currently implemented object detection models

## 0.5.21

* adds `safe_division` to replae 0 with machine epsilon for `float` to avoid division by 0
* apply `safe_division` to area overlap calculations in `unstructured_inference/inference/elements.py`

## 0.5.20

* Adds YoloX quantized model

## 0.5.19

* Add functionality to supplement detected layout with elements from the full page OCR
* Add functionality to annotate any layout(extracted, inferred, OCR) on a page

## 0.5.18

* Fix for incorrect type assignation at ingest test

## 0.5.17

* Use `OMP_THREAD_LIMIT` to improve tesseract performance

## 0.5.16

* Fix to no longer create a directory for storing processed images
* Hot-load images for annotation

## 0.5.15

* Handle an uncaught TesseractError

## 0.5.14

* Add TIFF test file and TIFF filetype to `test_from_image_file` in `test_layout`

## 0.5.13

* Fix extracted image elements being included in layout merge

## 0.5.12

* Add multipage TIFF extraction support
* Fix a pdfminer error when using `process_data_with_model`

## 0.5.11

* Add warning when chipper is used with < 300 DPI
* Use None default for dpi so defaults can be properly handled upstream

## 0.5.10

* Implement full-page OCR

## 0.5.9

* Handle exceptions from Tesseract

## 0.5.8

* Add alternative architecture for detectron2 (but default is unchanged)
* Updates:

| Library       | From      | To       |
|---------------|-----------|----------|
| transformers  | 4.29.2    | 4.30.2   |
| opencv-python | 4.7.0.72  | 4.8.0.74 |
| ipython       | 8.12.2    | 8.14.0   |

* Cache named models that have been loaded

## 0.5.7

* hotfix to handle issue storing images in a new dir when the pdf has no file extension

## 0.5.6

* Update the `annotate` and `_get_image_array` methods of `PageLayout` to get the image from the `image_path` property if the `image` property is `None`.
* Add functionality to store pdf images for later use.
* Add `image_metadata` property to `PageLayout` & set `page.image` to None to reduce memory usage.
* Update `DocumentLayout.from_file` to open only one image.
* Update `load_pdf` to return either Image objects or Image paths.
* Warns users that Chipper is a beta model.
* Exposed control over dpi when converting PDF to an image.
* Updated detectron2 version to avoid errors related to deprecated PIL reference

## 0.5.5

* Rename large model to chipper
* Added functionality to write images to computer storage temporarily instead of keeping them in memory for `pdf2image.convert_from_path`
* Added functionality to convert a PDF in small chunks of pages at a time for `pdf2image.convert_from_path`
* Table processing check for the area of the package to fix division by zero bug
* Added CUDA and TensorRT execution providers for yolox and detectron2onnx model.
* Warning for onnx version of detectron2 for empty pages suppresed.

## 0.5.4

* Tweak to element ordering to make it more deterministic

## 0.5.3

* Refactor for large model

## 0.5.2

* Combine inferred elements with extracted elements
* Add ruff to keep code consistent with unstructured
* Configure fallback for OCR token if paddleocr doesn't work to use tesseract

## 0.5.1

* Add annotation for pages
* Store page numbers when processing PDFs
* Hotfix to handle inference of blank pages using ONNX detectron2
* Revert ordering change to investigate examples of misordering

## 0.5.0

* Preserve image format in PIL.Image.Image when loading
* Added ONNX version of Detectron2 and make default model
* Remove API code, we don't serve this as a standalone API any more
* Update ordering logic to account for multicolumn documents.

## 0.4.4

* Fixed patches not being a package.

## 0.4.3

* Patch pdfminer.six to fix parsing bug

## 0.4.2

* Output of table extraction is now stored in `text_as_html` property rather than `text` property

## 0.4.1

* Added the ability to pass `ocr_languages` to the OCR agent for users who need
  non-English language packs.

## 0.4.0

* Added logic to partition granular elements (words, characters) by proximity
* Text extraction is now delegated to text regions rather than being handled centrally
* Fixed embedded image coordinates being interpreted differently than embedded text coordinates
* Update to how dependencies are being handled
* Update detectron2 version

## 0.3.2

* Allow extracting tables from higher level functions

## 0.3.1

* Pin protobuf version to avoid errors
* Make paddleocr an extra again

## 0.3.0

* Fix for text block detection
* Add paddleocr dependency to setup for x86_64 machines

## 0.2.14

* Suppressed processing progress bars

## 0.2.13

* Add table processing
* Change OCR logic to be aware of PDF image elements

## 0.2.12

* Fix for processing RGBA images

## 0.2.11

* Fixed some cases where image elements were not being OCR'd

## 0.2.10

* Removed control characters from tesseract output

## 0.2.9

* Removed multithreading from OCR (DocumentLayout.get_elements_from_layout)

## 0.2.8

* Refactored YoloX inference code to integrate better with framework
* Improved testing time

## 0.2.7

* Fixed duplicated load_pdf call

## 0.2.6

* Add donut model script for image prediction
* Add sample receipt and test for donut prediction

## 0.2.5

* Add YoloX model for images and PDFs
* Add generic model interface

## 0.2.4

* Download default model from huggingface
* Clarify error when trying to open file that doesn't exist as an image

## 0.2.3

* Pins the version of `opencv-python` for linux compatibility

## 0.2.2

* Add capability to process image files
* Add logic to use OCR when layout text is full of unknown characters

## 0.2.1

* Refactor to facilitate local inference
* Removes BasicConfig from logger configuration
* Implement auto model downloading

## 0.2.0

* Initial release of unstructured-inference


================================================
FILE: Dockerfile
================================================
# syntax=docker/dockerfile:experimental
ARG PYTHON_VERSION=3.12
FROM python:${PYTHON_VERSION}-slim AS base

# Set up environment
ENV HOME=/home/
WORKDIR ${HOME}
RUN mkdir ${HOME}/.ssh && chmod go-rwx ${HOME}/.ssh \
  && ssh-keyscan -t rsa github.com >> /home/.ssh/known_hosts

# Install uv
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /usr/local/bin/

FROM base AS deps
# Copy project files needed for dependency resolution
COPY pyproject.toml uv.lock ./
COPY unstructured_inference/__version__.py unstructured_inference/__version__.py

RUN uv sync --locked --all-groups --no-install-project

# Ensure venv binaries are on PATH so pytest/etc. are directly accessible
ENV PATH="/home/.venv/bin:${PATH}"

FROM deps AS code
COPY unstructured_inference unstructured_inference
RUN uv sync --locked --all-groups

CMD ["/bin/bash"]


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: Makefile
================================================
PACKAGE_NAME := unstructured_inference
CURRENT_DIR := $(shell pwd)


.PHONY: help
help: Makefile
	@sed -n 's/^\(## \)\([a-zA-Z]\)/\2/p' $<


###########
# Install #
###########

## install:                 install all dependencies via uv
.PHONY: install
install:
	@uv sync --locked --all-groups

## install-lint:            install only lint dependencies (no project deps)
.PHONY: install-lint
install-lint:
	@uv sync --locked --only-group lint

## lock:                    update and lock all dependencies
.PHONY: lock
lock:
	@uv lock --upgrade

#################
# Test and Lint #
#################

export CI ?= false

## test:                    runs all unittests (excluding slow)
.PHONY: test
test:
	CI=$(CI) uv run --locked --no-sync pytest -n auto -m "not slow" test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing

## test-slow:               runs all unittests (including slow)
.PHONY: test-slow
test-slow:
	CI=$(CI) uv run --locked --no-sync pytest -n auto test_${PACKAGE_NAME} --cov=${PACKAGE_NAME} --cov-report term-missing

## check:                   runs all linters and checks
.PHONY: check
check: check-ruff check-version

## check-ruff:              runs ruff linter
.PHONY: check-ruff
check-ruff:
	uv run --locked --no-sync ruff check .
	uv run --locked --no-sync ruff format --check .

## check-scripts:           run shellcheck
.PHONY: check-scripts
check-scripts:
	scripts/shellcheck.sh

## check-version:           run check to ensure version in CHANGELOG.md matches version in package
.PHONY: check-version
check-version:
    # Fail if syncing version would produce changes
	scripts/version-sync.sh -c \
		-s CHANGELOG.md \
		-f ${PACKAGE_NAME}/__version__.py semver

## tidy:                    auto-format and fix lint issues
.PHONY: tidy
tidy:
	uv run --locked --no-sync ruff format .
	uv run --locked --no-sync ruff check --fix-only --show-fixes .

## version-sync:            update __version__.py with most recent version from CHANGELOG.md
.PHONY: version-sync
version-sync:
	scripts/version-sync.sh \
		-s CHANGELOG.md \
		-f ${PACKAGE_NAME}/__version__.py semver

## check-coverage:          check test coverage meets threshold
.PHONY: check-coverage
check-coverage:
	uv run --locked --no-sync coverage report --fail-under=90

##########
# Docker #
##########

DOCKER_IMAGE ?= unstructured-inference:dev

.PHONY: docker-build
docker-build:
	DOCKER_IMAGE=${DOCKER_IMAGE} ./scripts/docker-build.sh

.PHONY: docker-test
docker-test: docker-build
	docker run --rm \
	-v ${CURRENT_DIR}/test_unstructured_inference:/home/test_unstructured_inference \
	-v ${CURRENT_DIR}/sample-docs:/home/sample-docs \
	$(DOCKER_IMAGE) \
	bash -c "pytest -n auto $(if $(TEST_NAME),-k $(TEST_NAME),) test_unstructured_inference"


================================================
FILE: README.md
================================================
<h3 align="center">
  <img
    src="https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured_logo.png"
    height="200"
  >

</h3>

<h3 align="center">
  <p>Open-Source Pre-Processing Tools for Unstructured Data</p>
</h3>

The `unstructured-inference` repo contains hosted model inference code for layout parsing models. 
These models are invoked via API as part of the partitioning bricks in the `unstructured` package.

**Requires Python >=3.11, <3.14.**

## Installation

### Package

```shell
pip install unstructured-inference
```

### Detectron2

[Detectron2](https://github.com/facebookresearch/detectron2) is required for using models from the [layoutparser model zoo](#using-models-from-the-layoutparser-model-zoo) 
but is not automatically installed with this package. 
For MacOS and Linux, build from source with:
```shell
pip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'
```
Other install options can be found in the 
[Detectron2 installation guide](https://detectron2.readthedocs.io/en/latest/tutorials/install.html).

Windows is not officially supported by Detectron2, but some users are able to install it anyway. 
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for 
tips on installing Detectron2 on Windows.

### Development Setup

This project uses [uv](https://docs.astral.sh/uv/) for dependency management.

```shell
# Clone and install all dependencies (including dev/test/lint groups)
git clone https://github.com/Unstructured-IO/unstructured-inference.git
cd unstructured-inference
make install
```

Run `make help` for a full list of available targets.

## Getting Started

To get started with the layout parsing model, use the following commands:

```python
from unstructured_inference.inference.layout import DocumentLayout

layout = DocumentLayout.from_file("sample-docs/loremipsum.pdf")

print(layout.pages[0].elements)
```

Once the model has detected the layout and OCR'd the document, the text extracted from the first 
page of the sample document will be displayed.
You can convert a given element to a `dict` by running the `.to_dict()` method.

## Models

The inference pipeline operates by finding text elements in a document page using a detection model, then extracting the contents of the elements using direct extraction (if available), OCR, and optionally table inference models.

We offer several detection models including [Detectron2](https://github.com/facebookresearch/detectron2) and [YOLOX](https://github.com/Megvii-BaseDetection/YOLOX).

### Using a non-default model

When doing inference, an alternate model can be used by passing the model object to the ingestion method via the `model` parameter. The `get_model` function can be used to construct one of our out-of-the-box models from a keyword, e.g.:
```python
from unstructured_inference.models.base import get_model
from unstructured_inference.inference.layout import DocumentLayout

model = get_model("yolox")
layout = DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf", detection_model=model)
```

### Using your own model

Any detection model can be used for in the `unstructured_inference` pipeline by wrapping the model in the `UnstructuredObjectDetectionModel` class. To integrate with the `DocumentLayout` class, a subclass of `UnstructuredObjectDetectionModel` must have a `predict` method that accepts a `PIL.Image.Image` and returns a list of `LayoutElement`s, and an `initialize` method, which loads the model and prepares it for inference.

## Security Policy

See our [security policy](https://github.com/Unstructured-IO/unstructured-inference/security/policy) for
information on how to report security vulnerabilities.

## Learn more

| Section | Description |
|-|-|
| [Unstructured Community Github](https://github.com/Unstructured-IO/community) | Information about Unstructured.io community projects  |
| [Unstructured Github](https://github.com/Unstructured-IO) | Unstructured.io open source repositories |
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |


================================================
FILE: benchmarks/__init__.py
================================================


================================================
FILE: benchmarks/test_benchmark_yolox.py
================================================
"""Benchmark for YoloX image_processing() memory optimization.

Uses a fake ONNX session to isolate the memory behavior of image_processing()
without requiring the real model weights. The fake session allocates a realistic
35 MiB workspace to simulate ONNX inference memory pressure.
"""

import numpy as np
from PIL import Image as PILImage

from unstructured_inference.models.yolox import UnstructuredYoloXModel


class _FakeInput:
    def __init__(self) -> None:
        self.name = "input"


class _FakeSession:
    """Simulates an ONNX inference session with realistic memory allocation."""

    def get_inputs(self):
        return [_FakeInput()]

    def run(self, _names, _inputs):
        workspace = np.empty((35 * 1024 * 1024,), dtype=np.uint8)  # 35 MiB  # noqa: F841
        # input_shape (1024,768), strides [8,16,32] → 128*96 + 64*48 + 32*24 = 16128
        return [np.random.randn(1, 16128, 16).astype(np.float32)]


def make_model() -> UnstructuredYoloXModel:
    model = object.__new__(UnstructuredYoloXModel)
    model.model = _FakeSession()
    model.model_path = "yolox_fake"
    model.layout_classes = {
        0: "Caption",
        1: "Footnote",
        2: "Formula",
        3: "List-item",
        4: "Page-footer",
        5: "Page-header",
        6: "Picture",
        7: "Section-header",
        8: "Table",
        9: "Text",
        10: "Title",
    }
    return model


# Letter-size page at 200 DPI — the default render resolution
def make_letter_200dpi() -> PILImage.Image:
    return PILImage.fromarray(np.random.randint(0, 255, (2200, 1700, 3), dtype=np.uint8))


def run_image_processing():
    model = make_model()
    img = make_letter_200dpi()
    return model.image_processing(img)


def test_benchmark_yolox_image_processing(benchmark):
    benchmark(run_image_processing)


================================================
FILE: examples/ocr/engine.py
================================================
import os
import re
import time
from typing import List, cast

import cv2
import numpy as np
import pytesseract
from pytesseract import Output

from unstructured_inference.inference import layout
from unstructured_inference.inference.elements import Rectangle, TextRegion


def remove_non_printable(s):
    dst_str = re.sub(r"[^\x20-\x7E]", " ", s)
    return " ".join(dst_str.split())


def run_ocr_with_layout_detection(
    images,
    detection_model=None,
    element_extraction_model=None,
    mode="individual_blocks",
    output_dir="",
    drawable=True,
    printable=True,
):
    total_text_extraction_infer_time = 0
    total_extracted_text = {}
    for i, image in enumerate(images):
        page_num = i + 1
        page_num_str = f"page{page_num}"

        page = layout.PageLayout(
            number=i + 1,
            image=image,
            layout=None,
            detection_model=detection_model,
            element_extraction_model=element_extraction_model,
        )

        inferred_layout: List[TextRegion] = cast(List[TextRegion], page.detection_model(page.image))

        cv_img = np.array(image)

        if mode == "individual_blocks":
            # OCR'ing individual blocks (current approach)
            text_extraction_start_time = time.time()

            elements = page.get_elements_from_layout(inferred_layout)

            text_extraction_infer_time = time.time() - text_extraction_start_time

            total_text_extraction_infer_time += text_extraction_infer_time

            page_text = ""
            for el in elements:
                page_text += el.text
            filtered_page_text = remove_non_printable(page_text)
            total_extracted_text[page_num_str] = filtered_page_text
        elif mode == "entire_page":
            # OCR'ing entire page (new approach to implement)
            text_extraction_start_time = time.time()

            ocr_data = pytesseract.image_to_data(image, lang="eng", output_type=Output.DICT)
            boxes = ocr_data["level"]
            extracted_text_list = []
            for k in range(len(boxes)):
                (x, y, w, h) = (
                    ocr_data["left"][k],
                    ocr_data["top"][k],
                    ocr_data["width"][k],
                    ocr_data["height"][k],
                )
                extracted_text = ocr_data["text"][k]
                if not extracted_text:
                    continue

                extracted_region = Rectangle(x1=x, y1=y, x2=x + w, y2=y + h)

                extracted_is_subregion_of_inferred = False
                for inferred_region in inferred_layout:
                    extracted_is_subregion_of_inferred = extracted_region.is_almost_subregion_of(
                        inferred_region.pad(12),
                        subregion_threshold=0.75,
                    )
                    if extracted_is_subregion_of_inferred:
                        break

                if extracted_is_subregion_of_inferred:
                    extracted_text_list.append(extracted_text)

                if drawable:
                    if extracted_is_subregion_of_inferred:
                        cv2.rectangle(cv_img, (x, y), (x + w, y + h), (0, 255, 0), 2, None)
                    else:
                        cv2.rectangle(cv_img, (x, y), (x + w, y + h), (255, 0, 0), 2, None)

            text_extraction_infer_time = time.time() - text_extraction_start_time
            total_text_extraction_infer_time += text_extraction_infer_time

            page_text = " ".join(extracted_text_list)
            filtered_page_text = remove_non_printable(page_text)
            total_extracted_text[page_num_str] = filtered_page_text
        else:
            raise ValueError("Invalid mode")

        if drawable:
            for el in inferred_layout:
                pt1 = [int(el.x1), int(el.y1)]
                pt2 = [int(el.x2), int(el.y2)]
                cv2.rectangle(
                    img=cv_img,
                    pt1=pt1,
                    pt2=pt2,
                    color=(0, 0, 255),
                    thickness=4,
                    lineType=None,
                )

            f_path = os.path.join(output_dir, f"ocr_{mode}_{page_num_str}.jpg")
            cv2.imwrite(f_path, cv_img)

        if printable:
            print(
                f"page: {i + 1} - n_layout_elements: {len(inferred_layout)} - "
                f"text_extraction_infer_time: {text_extraction_infer_time}"
            )

    return total_text_extraction_infer_time, total_extracted_text


def run_ocr(
    images,
    printable=True,
):
    total_text_extraction_infer_time = 0
    total_text = ""
    for i, image in enumerate(images):
        text_extraction_start_time = time.time()

        page_text = pytesseract.image_to_string(image)

        text_extraction_infer_time = time.time() - text_extraction_start_time

        if printable:
            print(f"page: {i + 1} - text_extraction_infer_time: {text_extraction_infer_time}")

        total_text_extraction_infer_time += text_extraction_infer_time
        total_text += page_text

    return total_text_extraction_infer_time, total_text


================================================
FILE: examples/ocr/requirements.txt
================================================
unstructured[local-inference]
nltk

================================================
FILE: examples/ocr/validate_ocr_performance.py
================================================
import json
import os
import time
from datetime import datetime
from difflib import SequenceMatcher

import nltk
import pdf2image

from unstructured_inference.inference.layout import (
    DocumentLayout,
    create_image_output_dir,
    process_file_with_model,
)

# Download the required resources (run this once)
nltk.download("punkt")


def validate_performance(
    f_name,
    validation_mode,
    is_image_file=False,
):
    print(
        f">>> Start performance comparison - filename: {f_name}"
        f" - validation_mode: {validation_mode}"
        f" - is_image_file: {is_image_file}"
    )

    now_dt = datetime.utcnow()
    now_str = now_dt.strftime("%Y_%m_%d-%H_%M_%S")

    f_path = os.path.join(example_docs_dir, f_name)

    image_f_paths = []
    if validation_mode == "pdf":
        pdf_info = pdf2image.pdfinfo_from_path(f_path)
        n_pages = pdf_info["Pages"]
    elif validation_mode == "image":
        if is_image_file:
            image_f_paths.append(f_path)
        else:
            image_output_dir = create_image_output_dir(f_path)
            images = pdf2image.convert_from_path(f_path, output_folder=image_output_dir)
            image_f_paths = [image.filename for image in images]
        n_pages = len(image_f_paths)
    else:
        n_pages = 0

    processing_result = {}
    for ocr_mode in ["individual_blocks", "entire_page"]:
        start_time = time.time()

        if validation_mode == "pdf":
            layout = process_file_with_model(
                f_path,
                model_name=None,
                ocr_mode=ocr_mode,
            )
        elif validation_mode == "image":
            pages = []
            for image_f_path in image_f_paths:
                _layout = process_file_with_model(
                    image_f_path,
                    model_name=None,
                    ocr_mode=ocr_mode,
                    is_image=True,
                )
                pages += _layout.pages
            for i, page in enumerate(pages):
                page.number = i + 1
            layout = DocumentLayout.from_pages(pages)
        else:
            layout = None

        infer_time = time.time() - start_time

        if layout is None:
            print("Layout is None")
            return

        full_text = str(layout)
        page_text = {}
        for page in layout.pages:
            page_text[page.number] = str(page)

        processing_result[ocr_mode] = {
            "infer_time": infer_time,
            "full_text": full_text,
            "page_text": page_text,
        }

    individual_mode_page_text = processing_result["individual_blocks"]["page_text"]
    entire_mode_page_text = processing_result["individual_blocks"]["page_text"]
    individual_mode_full_text = processing_result["individual_blocks"]["full_text"]
    entire_mode_full_text = processing_result["entire_page"]["full_text"]

    compare_result = compare_processed_text(individual_mode_full_text, entire_mode_full_text)

    report = {
        "validation_mode": validation_mode,
        "file_info": {
            "filename": f_name,
            "n_pages": n_pages,
        },
        "processing_time": {
            "individual_blocks": processing_result["individual_blocks"]["infer_time"],
            "entire_page": processing_result["entire_page"]["infer_time"],
        },
        "text_similarity": compare_result,
        "extracted_text": {
            "individual_blocks": {
                "page_text": individual_mode_page_text,
                "full_text": individual_mode_full_text,
            },
            "entire_page": {
                "page_text": entire_mode_page_text,
                "full_text": entire_mode_full_text,
            },
        },
    }

    write_report(report, now_str, validation_mode)

    print("<<< End performance comparison", f_name)


def compare_processed_text(individual_mode_full_text, entire_mode_full_text, delimiter=" "):
    # Calculate similarity ratio
    similarity_ratio = SequenceMatcher(
        None, individual_mode_full_text, entire_mode_full_text
    ).ratio()

    print(f"similarity_ratio: {similarity_ratio}")

    # Tokenize the text into words
    word_list_individual = nltk.word_tokenize(individual_mode_full_text)
    n_word_list_individual = len(word_list_individual)
    print("n_word_list_in_text_individual:", n_word_list_individual)
    word_sets_individual = set(word_list_individual)
    n_word_sets_individual = len(word_sets_individual)
    print(f"n_word_sets_in_text_individual: {n_word_sets_individual}")
    # print("word_sets_merged:", word_sets_merged)

    word_list_entire = nltk.word_tokenize(entire_mode_full_text)
    n_word_list_entire = len(word_list_entire)
    print("n_word_list_individual:", n_word_list_entire)
    word_sets_entire = set(word_list_entire)
    n_word_sets_entire = len(word_sets_entire)
    print(f"n_word_sets_individual: {n_word_sets_entire}")
    # print("word_sets_individual:", word_sets_individual)

    # Find unique elements using difference
    print("diff_elements:")
    unique_words_individual = word_sets_individual - word_sets_entire
    unique_words_entire = word_sets_entire - word_sets_individual
    print(f"unique_words_in_text_individual: {unique_words_individual}\n")
    print(f"unique_words_in_text_entire: {unique_words_entire}")

    return {
        "similarity_ratio": similarity_ratio,
        "individual_blocks": {
            "n_word_list": n_word_list_individual,
            "n_word_sets": n_word_sets_individual,
            "unique_words": delimiter.join(list(unique_words_individual)),
        },
        "entire_page": {
            "n_word_list": n_word_list_entire,
            "n_word_sets": n_word_sets_entire,
            "unique_words": delimiter.join(list(unique_words_entire)),
        },
    }


def write_report(report, now_str, validation_mode):
    report_f_name = f"validate-ocr-{validation_mode}-{now_str}.json"
    report_f_path = os.path.join(output_dir, report_f_name)
    with open(report_f_path, "w", encoding="utf-8-sig") as f:
        json.dump(report, f, indent=4)


def run():
    test_files = [
        {"name": "layout-parser-paper-fast.pdf", "mode": "image", "is_image_file": False},
        {"name": "loremipsum_multipage.pdf", "mode": "image", "is_image_file": False},
        {"name": "2023-Jan-economic-outlook.pdf", "mode": "image", "is_image_file": False},
        {"name": "recalibrating-risk-report.pdf", "mode": "image", "is_image_file": False},
        {"name": "Silent-Giant.pdf", "mode": "image", "is_image_file": False},
    ]

    for test_file in test_files:
        f_name = test_file["name"]
        validation_mode = test_file["mode"]
        is_image_file = test_file["is_image_file"]

        validate_performance(f_name, validation_mode, is_image_file)


if __name__ == "__main__":
    cur_dir = os.getcwd()
    base_dir = os.path.join(cur_dir, os.pardir, os.pardir)
    example_docs_dir = os.path.join(base_dir, "sample-docs")

    # folder path to save temporary outputs
    output_dir = os.path.join(cur_dir, "output")
    os.makedirs(output_dir, exist_ok=True)

    run()


================================================
FILE: logger_config.yaml
================================================
version: 1
disable_existing_loggers: False
formatters:
  default_format:
    "()": uvicorn.logging.DefaultFormatter
    format: '%(asctime)s %(name)s %(levelname)s %(message)s'
  access:
    "()": uvicorn.logging.AccessFormatter
    format: '%(asctime)s %(client_addr)s %(request_line)s - %(status_code)s'
handlers:
  access_handler:
    formatter: access
    class: logging.StreamHandler
    stream: ext://sys.stderr
  standard_handler:
    formatter: default_format
    class: logging.StreamHandler
    stream: ext://sys.stderr
loggers:
  uvicorn.error:
    level: INFO
    handlers:
      - standard_handler
    propagate: no
    # disable logging for uvicorn.error by not having a handler
  uvicorn.access:
    level: INFO
    handlers:
      - access_handler
    propagate: no
    # disable logging for uvicorn.access by not having a handler
  unstructured:
    level: INFO
    handlers:
      - standard_handler
    propagate: no
  unstructured_inference:
    level: DEBUG
    handlers:
      - standard_handler
    propagate: no



================================================
FILE: pyproject.toml
================================================
[project]
name = "unstructured_inference"
description = "A library for performing inference using trained models."
requires-python = ">=3.11, <3.14"
authors = [{name = "Unstructured Technologies", email = "devops@unstructuredai.io"}]
classifiers = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "Intended Audience :: Education",
    "Intended Audience :: Science/Research",
    "License :: OSI Approved :: Apache Software License",
    "Operating System :: OS Independent",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Programming Language :: Python :: 3.13",
    "Topic :: Scientific/Engineering :: Artificial Intelligence",
]
readme = "README.md"
license = "Apache-2.0"
keywords = ["NLP", "PDF", "HTML", "CV", "XML", "parsing", "preprocessing"]
dynamic = ["version"]
dependencies = [
    "huggingface-hub>=0.22.0",
    "numpy>=1.26.0",
    "opencv-python>=4.13.0.90",
    "onnx>=1.20.1",
    "onnxruntime>=1.25.0",
    "matplotlib>=3.10.8",
    "torch>=2.10.0",
    "timm>=1.0.24",
    # NOTE(alan): Pinned because this is when the most recent module we import appeared
    "transformers>=4.25.1",
    # Required by transformers[torch] for model loading with torch
    "accelerate>=1.12.0",
    "rapidfuzz>=3.14.3",
    "pandas>=1.5.0",
    "scipy>=1.17.0",
    "pypdfium2>=5.0.0",
]

[project.urls]
Homepage = "https://github.com/Unstructured-IO/unstructured-inference"

[tool.hatch.version]
path = "unstructured_inference/__version__.py"

[dependency-groups]
lint = [
    "ruff>=0.15.0",
]
test = [
    "pytest>=9.0.2",
    "pytest-cov>=7.0.0",
    "pytest-mock>=3.15.1",
    "pytest-xdist>=3.5.0",
    "coverage>=7.13.3",
    "httpx>=0.28.1",
    "pdf2image>=1.16.2",
]
dev = [
    "jupyter>=1.1.1",
    "ipython>=9.10.0",
]
release = [
    "twine>=6.2.0",
]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.uv]
constraint-dependencies = [
    # Security: CVE fix for fonttools
    "fonttools>=4.60.2",
    # Security: CVE fix for urllib3
    "urllib3>=2.6.0",
    # Security: CVE fix for Pillow (out-of-bounds write loading PSD images)
    "pillow>=12.1.1",
]

[tool.hatch.build.targets.wheel]
packages = ["/unstructured_inference"]

[tool.hatch.build.targets.sdist]
packages = ["/unstructured_inference"]

[tool.ruff]
line-length = 100

[tool.ruff.lint]
select = [
    # pycodestyle
    "E",
    # Pyflakes
    "F",
    # flake8-comprehensions
    "C4",
    # flake8-commas
    "COM",
    # isort
    "I",
    # flake8-simplify
    "SIM",
    # pyupgrade
    "UP015",
    "UP018",
    "UP032",
    "UP034",
    # pylint refactor
    "PLR0402",
    # flake8-pytest-style
    "PT",
]
ignore = [
    "COM812",
    "PT011",
    "PT012",
]

[tool.ruff.lint.per-file-ignores]
"test_*/**" = ["D"]

[tool.pytest.ini_options]
markers = [
    "slow: marks tests as slow (deselect with '-m \"not slow\"')",
]
filterwarnings = [
    "ignore::DeprecationWarning",
]

[tool.codeflash]
benchmarks-root = "benchmarks"

[tool.coverage.report]
fail_under = 90


================================================
FILE: renovate.json
================================================
{
  "$schema": "https://docs.renovatebot.com/renovate-schema.json",
  "extends": ["github>Unstructured-IO/renovate-config:python-uv"]
}


================================================
FILE: sample-docs/loremipsum.tiff
================================================
[File too large to display: 14.3 MB]

================================================
FILE: scripts/docker-build.sh
================================================
#!/usr/bin/env bash

set -euo pipefail
DOCKER_IMAGE="${DOCKER_IMAGE:-unstructured-inference:dev}"

DOCKER_BUILD_CMD=(docker buildx build --load -f Dockerfile \
  --build-arg BUILDKIT_INLINE_CACHE=1 \
  --progress plain \
  -t "$DOCKER_IMAGE" .)

DOCKER_BUILDKIT=1 "${DOCKER_BUILD_CMD[@]}"


================================================
FILE: scripts/shellcheck.sh
================================================
#!/usr/bin/env bash

find scripts -name "*.sh" -exec shellcheck {} +



================================================
FILE: scripts/test-unstructured-ingest-helper.sh
================================================
#!/usr/bin/env bash

# This is intended to be run from an unstructured checkout, not in this repo
# The goal here is to see what changes the current branch would introduce to unstructured
# fixtures

INGEST_COMMANDS=(
    test_unstructured_ingest/src/azure.sh
    test_unstructured_ingest/src/biomed-api.sh
    test_unstructured_ingest/src/biomed-path.sh
    test_unstructured_ingest/src/box.sh
    test_unstructured_ingest/src/dropbox.sh
    test_unstructured_ingest/src/gcs.sh
    test_unstructured_ingest/src/onedrive.sh
    test_unstructured_ingest/src/s3.sh
)

EXIT_STATUSES=()

# Run each command and capture its exit status
for INGEST_COMMAND in "${INGEST_COMMANDS[@]}"; do
  $INGEST_COMMAND
  EXIT_STATUSES+=($?)
done

# Check for failures
for STATUS in "${EXIT_STATUSES[@]}"; do
  if [[ $STATUS -ne 0 ]]; then
    echo "At least one ingest command failed! Scroll up to see which"
    exit 1
  fi
done

echo "No diff's resulted from any ingest commands"


================================================
FILE: scripts/version-sync.sh
================================================
#!/usr/bin/env bash
function usage {
    echo "Usage: $(basename "$0") [-c] -f FILE_TO_CHANGE REPLACEMENT_FORMAT [-f FILE_TO_CHANGE REPLACEMENT_FORMAT ...]" 2>&1
    echo 'Synchronize files to latest version in source file'
    echo '   -s              Specifies source file for version (default is CHANGELOG.md)'
    echo '   -f              Specifies a file to change and the format for searching and replacing versions'
    echo '                       FILE_TO_CHANGE is the file to be updated/checked for updates'
    echo '                       REPLACEMENT_FORMAT is one of (semver, release, api-release)'
    echo '                           semver indicates to look for a full semver version and replace with the latest full version'
    echo '                           release indicates to look for a release semver version (x.x.x) and replace with the latest release version'
    echo '                           api-release indicates to look for a release semver version in the context of an api route and replace with the latest release version'
    echo '   -c              Compare versions and output proposed changes without changing anything.'
}

function getopts-extra () {
    declare -i i=1
    # if the next argument is not an option, then append it to array OPTARG
    while [[ ${OPTIND} -le $# && ${!OPTIND:0:1} != '-' ]]; do
        OPTARG[i]=${!OPTIND}
        ((i += 1))
        ((OPTIND += 1))
    done
}

# Parse input options
declare CHECK=0
declare SOURCE_FILE="CHANGELOG.md"
declare -a FILES_TO_CHECK=()
declare -a REPLACEMENT_FORMATS=()
declare args
declare OPTIND OPTARG opt
while getopts ":hcs:f:" opt; do
    case $opt in
        h)
            usage
            exit 0
            ;;
        c)
            CHECK=1
            ;;
        s)
            SOURCE_FILE="$OPTARG"
            ;;
        f)
            getopts-extra "$@"
            args=( "${OPTARG[@]}" )
            # validate length of args, should be 2
            if [ ${#args[@]} -eq 2 ]; then
                FILES_TO_CHECK+=( "${args[0]}" )
                REPLACEMENT_FORMATS+=( "${args[1]}" )
            else
                echo "Exactly 2 arguments must follow -f option." >&2
                exit 1
            fi
            ;;
        \?)
            echo "Invalid option: -$OPTARG." >&2
            usage
            exit 1
            ;;
    esac
done

# Parse REPLACEMENT_FORMATS
RE_SEMVER_FULL="(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)(-((0|[1-9][0-9]*|[0-9]*[a-zA-Z-][0-9a-zA-Z-]*)(\.(0|[1-9][0-9]*|[0-9]*[a-zA-Z-][0-9a-zA-Z-]*))*))?(\+([0-9a-zA-Z-]+(\.[0-9a-zA-Z-]+)*))?"
RE_RELEASE="(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)"
RE_API_RELEASE="v(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)\.(0|[1-9][0-9]*)"
# Pull out semver appearing earliest in SOURCE_FILE.
LAST_VERSION=$(grep -o -m 1 -E "${RE_SEMVER_FULL}" "$SOURCE_FILE")
LAST_RELEASE=$(grep -o -m 1 -E "${RE_RELEASE}($|[^-+])" "$SOURCE_FILE" | grep -o -m 1 -E "${RE_RELEASE}")
LAST_API_RELEASE="v$(grep -o -m 1 -E "${RE_RELEASE}($|[^-+])$" "$SOURCE_FILE" | grep -o -m 1 -E "${RE_RELEASE}")"
declare -a RE_SEMVERS=()
declare -a UPDATED_VERSIONS=()
for i in "${!REPLACEMENT_FORMATS[@]}"; do
    REPLACEMENT_FORMAT=${REPLACEMENT_FORMATS[$i]}
    case $REPLACEMENT_FORMAT in
        semver)
            RE_SEMVERS+=( "$RE_SEMVER_FULL" )
            UPDATED_VERSIONS+=( "$LAST_VERSION" )
            ;;
        release)
            RE_SEMVERS+=( "$RE_RELEASE" )
            UPDATED_VERSIONS+=( "$LAST_RELEASE" )
            ;;
        api-release)
            RE_SEMVERS+=( "$RE_API_RELEASE" )
            UPDATED_VERSIONS+=( "$LAST_API_RELEASE" )
            ;;
        *)
            echo "Invalid replacement format: \"${REPLACEMENT_FORMAT}\". Use semver, release, or api-release" >&2
            exit 1
            ;;
    esac
done

if [ -z "$LAST_VERSION" ];
then
    # No match to semver regex in SOURCE_FILE, so no version to go from.
    printf "Error: Unable to find latest version from %s.\n" "$SOURCE_FILE"
    exit 1
fi

# Search files in FILES_TO_CHECK and change (or get diffs)
declare FAILED_CHECK=0

for i in "${!FILES_TO_CHECK[@]}"; do
    FILE_TO_CHANGE=${FILES_TO_CHECK[$i]}
    RE_SEMVER=${RE_SEMVERS[$i]}
    UPDATED_VERSION=${UPDATED_VERSIONS[$i]}
    FILE_VERSION=$(grep -o -m 1 -E "${RE_SEMVER}" "$FILE_TO_CHANGE")
    if [ -z "$FILE_VERSION" ];
    then
        # No match to semver regex in VERSIONFILE, so nothing to replace
        printf "Error: No semver version found in file %s.\n" "$FILE_TO_CHANGE"
        exit 1
    else
        # Replace semver in VERSIONFILE with semver obtained from SOURCE_FILE
        TMPFILE=$(mktemp /tmp/new_version.XXXXXX)
        # Check sed version, exit if version < 4.3
        if ! sed --version > /dev/null 2>&1; then
            CURRENT_VERSION=1.archaic
        else
            CURRENT_VERSION=$(sed --version | head -n1 | cut -d" " -f4)
        fi
        REQUIRED_VERSION="4.3"
        if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$CURRENT_VERSION" | sort -V | head -n1)" != "$REQUIRED_VERSION" ]; then
            echo "sed version must be >= ${REQUIRED_VERSION}" && exit 1
        fi
        sed -E -r "s/$RE_SEMVER/$UPDATED_VERSION/" "$FILE_TO_CHANGE" > "$TMPFILE"
        if [ $CHECK == 1 ];
        then
            DIFF=$(diff "$FILE_TO_CHANGE"  "$TMPFILE" )
            if [ -z "$DIFF" ];
            then
                printf "version sync would make no changes to %s.\n" "$FILE_TO_CHANGE"
                rm "$TMPFILE"
            else
                FAILED_CHECK=1
                printf "version sync would make the following changes to %s:\n%s\n" "$FILE_TO_CHANGE" "$DIFF"
                rm "$TMPFILE"
            fi
        else
            cp "$TMPFILE" "$FILE_TO_CHANGE" 
            rm "$TMPFILE"
        fi
    fi
done

# Exit with code determined by whether changes were needed in a check.
if [ ${FAILED_CHECK} -ne 0 ]; then
    exit 1
else
    exit 0
fi


================================================
FILE: test_unstructured_inference/conftest.py
================================================
import numpy as np
import pytest
from PIL import Image

from unstructured_inference.inference.elements import (
    EmbeddedTextRegion,
    Rectangle,
    TextRegion,
)
from unstructured_inference.inference.layoutelement import LayoutElement


@pytest.fixture
def mock_pil_image():
    return Image.new("RGB", (50, 50))


@pytest.fixture
def mock_numpy_image():
    return np.zeros((50, 50, 3), np.uint8)


@pytest.fixture
def mock_rectangle():
    return Rectangle(100, 100, 300, 300)


@pytest.fixture
def mock_text_region():
    return TextRegion.from_coords(100, 100, 300, 300, text="Sample text")


@pytest.fixture
def mock_layout_element():
    return LayoutElement.from_coords(
        100,
        100,
        300,
        300,
        text="Sample text",
        source=None,
        type="Text",
    )


@pytest.fixture
def mock_embedded_text_regions():
    return [
        EmbeddedTextRegion.from_coords(
            x1=453.00277777777774,
            y1=317.319341111111,
            x2=711.5338541666665,
            y2=358.28571222222206,
            text="LayoutParser:",
        ),
        EmbeddedTextRegion.from_coords(
            x1=726.4778125,
            y1=317.319341111111,
            x2=760.3308594444444,
            y2=357.1698966666667,
            text="A",
        ),
        EmbeddedTextRegion.from_coords(
            x1=775.2748177777777,
            y1=317.319341111111,
            x2=917.3579885555555,
            y2=357.1698966666667,
            text="Unified",
        ),
        EmbeddedTextRegion.from_coords(
            x1=932.3019468888888,
            y1=317.319341111111,
            x2=1071.8426522222221,
            y2=357.1698966666667,
            text="Toolkit",
        ),
        EmbeddedTextRegion.from_coords(
            x1=1086.7866105555556,
            y1=317.319341111111,
            x2=1141.2105142777777,
            y2=357.1698966666667,
            text="for",
        ),
        EmbeddedTextRegion.from_coords(
            x1=1156.154472611111,
            y1=317.319341111111,
            x2=1256.334784222222,
            y2=357.1698966666667,
            text="Deep",
        ),
        EmbeddedTextRegion.from_coords(
            x1=437.83888888888885,
            y1=367.13322999999986,
            x2=610.0171992222222,
            y2=406.9837855555556,
            text="Learning",
        ),
        EmbeddedTextRegion.from_coords(
            x1=624.9611575555555,
            y1=367.13322999999986,
            x2=741.6754646666665,
            y2=406.9837855555556,
            text="Based",
        ),
        EmbeddedTextRegion.from_coords(
            x1=756.619423,
            y1=367.13322999999986,
            x2=958.3867708333332,
            y2=406.9837855555556,
            text="Document",
        ),
        EmbeddedTextRegion.from_coords(
            x1=973.3307291666665,
            y1=367.13322999999986,
            x2=1092.0535042777776,
            y2=406.9837855555556,
            text="Image",
        ),
    ]


# TODO(alan): Make a better test layout
@pytest.fixture
def mock_layout(mock_embedded_text_regions):
    return [
        LayoutElement(text=r.text, type="UncategorizedText", bbox=r.bbox)
        for r in mock_embedded_text_regions
    ]


@pytest.fixture
def example_table_cells():
    cells = [
        {"cell text": "Disability Category", "row_nums": [0, 1], "column_nums": [0]},
        {"cell text": "Participants", "row_nums": [0, 1], "column_nums": [1]},
        {"cell text": "Ballots Completed", "row_nums": [0, 1], "column_nums": [2]},
        {"cell text": "Ballots Incomplete/Terminated", "row_nums": [0, 1], "column_nums": [3]},
        {"cell text": "Results", "row_nums": [0], "column_nums": [4, 5]},
        {"cell text": "Accuracy", "row_nums": [1], "column_nums": [4]},
        {"cell text": "Time to complete", "row_nums": [1], "column_nums": [5]},
        {"cell text": "Blind", "row_nums": [2], "column_nums": [0]},
        {"cell text": "Low Vision", "row_nums": [3], "column_nums": [0]},
        {"cell text": "Dexterity", "row_nums": [4], "column_nums": [0]},
        {"cell text": "Mobility", "row_nums": [5], "column_nums": [0]},
        {"cell text": "5", "row_nums": [2], "column_nums": [1]},
        {"cell text": "5", "row_nums": [3], "column_nums": [1]},
        {"cell text": "5", "row_nums": [4], "column_nums": [1]},
        {"cell text": "3", "row_nums": [5], "column_nums": [1]},
        {"cell text": "1", "row_nums": [2], "column_nums": [2]},
        {"cell text": "2", "row_nums": [3], "column_nums": [2]},
        {"cell text": "4", "row_nums": [4], "column_nums": [2]},
        {"cell text": "3", "row_nums": [5], "column_nums": [2]},
        {"cell text": "4", "row_nums": [2], "column_nums": [3]},
        {"cell text": "3", "row_nums": [3], "column_nums": [3]},
        {"cell text": "1", "row_nums": [4], "column_nums": [3]},
        {"cell text": "0", "row_nums": [5], "column_nums": [3]},
        {"cell text": "34.5%, n=1", "row_nums": [2], "column_nums": [4]},
        {"cell text": "98.3% n=2 (97.7%, n=3)", "row_nums": [3], "column_nums": [4]},
        {"cell text": "98.3%, n=4", "row_nums": [4], "column_nums": [4]},
        {"cell text": "95.4%, n=3", "row_nums": [5], "column_nums": [4]},
        {"cell text": "1199 sec, n=1", "row_nums": [2], "column_nums": [5]},
        {"cell text": "1716 sec, n=3 (1934 sec, n=2)", "row_nums": [3], "column_nums": [5]},
        {"cell text": "1672.1 sec, n=4", "row_nums": [4], "column_nums": [5]},
        {"cell text": "1416 sec, n=3", "row_nums": [5], "column_nums": [5]},
    ]
    for i in range(len(cells)):
        cells[i]["column header"] = False
    return [cells]


================================================
FILE: test_unstructured_inference/inference/test_layout.py
================================================
import os
import os.path
import tempfile
from unittest.mock import MagicMock, mock_open, patch

import numpy as np
import pytest
from PIL import Image

import unstructured_inference.models.base as models
from unstructured_inference.constants import IsExtracted
from unstructured_inference.inference import elements, layout, layoutelement, pdf_image
from unstructured_inference.inference.elements import (
    EmbeddedTextRegion,
    ImageTextRegion,
)
from unstructured_inference.models.unstructuredmodel import (
    UnstructuredElementExtractionModel,
    UnstructuredObjectDetectionModel,
)

skip_outside_ci = os.getenv("CI", "").lower() in {"", "false", "f", "0"}


@pytest.fixture
def mock_image():
    return Image.new("1", (1, 1))


@pytest.fixture
def mock_initial_layout():
    text_block = EmbeddedTextRegion.from_coords(
        2,
        4,
        6,
        8,
        text="A very repetitive narrative. " * 10,
        is_extracted=IsExtracted.TRUE,
    )

    title_block = EmbeddedTextRegion.from_coords(
        1,
        2,
        3,
        4,
        text="A Catchy Title",
        is_extracted=IsExtracted.TRUE,
    )

    return [text_block, title_block]


@pytest.fixture
def mock_final_layout():
    text_block = layoutelement.LayoutElement.from_coords(
        2,
        4,
        6,
        8,
        source="Mock",
        text="A very repetitive narrative. " * 10,
        type="NarrativeText",
    )

    title_block = layoutelement.LayoutElement.from_coords(
        1,
        2,
        3,
        4,
        source="Mock",
        text="A Catchy Title",
        type="Title",
    )

    return layoutelement.LayoutElements.from_list([text_block, title_block])


def test_pdf_page_converts_images_to_array(mock_image):
    def verify_image_array():
        assert page.image_array is None
        image_array = page._get_image_array()
        assert isinstance(image_array, np.ndarray)
        assert page.image_array.all() == image_array.all()

    # Scenario 1: where self.image exists
    page = layout.PageLayout(number=0, image=mock_image)
    verify_image_array()

    # Scenario 2: where self.image is None, but self.image_path exists
    page.image_array = None
    page.image = None
    page.image_path = "mock_path_to_image"
    with patch.object(Image, "open", return_value=mock_image):
        verify_image_array()


class MockLayoutModel:
    def __init__(self, layout):
        self.layout_return = layout

    def __call__(self, *args):
        return self.layout_return

    def initialize(self, *args, **kwargs):
        pass

    def deduplicate_detected_elements(self, elements, *args, **kwargs):
        return elements


def test_get_page_elements(monkeypatch, mock_final_layout):
    image = Image.fromarray(
        np.random.randint(12, 14, size=(40, 10, 3)).astype(np.uint8), mode="RGB"
    )
    page = layout.PageLayout(
        number=0,
        image=image,
        detection_model=MockLayoutModel(mock_final_layout),
    )
    elements = page.get_elements_with_detection_model(inplace=False)
    page.get_elements_with_detection_model(inplace=True)
    assert elements == page.elements_array


class MockPool:
    def map(self, f, xs):
        return [f(x) for x in xs]

    def close(self):
        pass

    def join(self):
        pass


@pytest.mark.parametrize("model_name", [None, "checkbox", "fake"])
def test_process_data_with_model(monkeypatch, mock_final_layout, model_name):
    monkeypatch.setattr(layout, "get_model", lambda x: MockLayoutModel(mock_final_layout))
    monkeypatch.setattr(
        layout.DocumentLayout,
        "from_file",
        lambda *args, **kwargs: layout.DocumentLayout.from_pages([]),
    )

    def new_isinstance(obj, cls):
        if type(obj) is MockLayoutModel:
            return True
        else:
            return isinstance(obj, cls)

    with (
        patch("builtins.open", mock_open(read_data=b"000000")),
        patch(
            "unstructured_inference.inference.layout.UnstructuredObjectDetectionModel",
            MockLayoutModel,
        ),
        open("") as fp,
    ):
        assert layout.process_data_with_model(fp, model_name=model_name)


def test_process_data_with_model_raises_on_invalid_model_name():
    with (
        patch("builtins.open", mock_open(read_data=b"000000")),
        pytest.raises(
            models.UnknownModelException,
        ),
        open("") as fp,
    ):
        layout.process_data_with_model(fp, model_name="fake")


@pytest.mark.parametrize("model_name", [None, "yolox"])
def test_process_file_with_model(monkeypatch, mock_final_layout, model_name):
    def mock_initialize(self, *args, **kwargs):
        self.model = MockLayoutModel(mock_final_layout)

    monkeypatch.setattr(
        layout.DocumentLayout,
        "from_file",
        lambda *args, **kwargs: layout.DocumentLayout.from_pages([]),
    )
    monkeypatch.setattr(models.UnstructuredDetectronONNXModel, "initialize", mock_initialize)
    filename = ""
    assert layout.process_file_with_model(filename, model_name=model_name)


def test_process_file_no_warnings(monkeypatch, mock_final_layout, recwarn):
    def mock_initialize(self, *args, **kwargs):
        self.model = MockLayoutModel(mock_final_layout)

    monkeypatch.setattr(
        layout.DocumentLayout,
        "from_file",
        lambda *args, **kwargs: layout.DocumentLayout.from_pages([]),
    )
    monkeypatch.setattr(models.UnstructuredDetectronONNXModel, "initialize", mock_initialize)
    filename = ""
    layout.process_file_with_model(filename, model_name=None)
    # There should be no UserWarning, but if there is one it should not have the following message
    with pytest.raises(AssertionError, match="not found in warning list"):
        user_warning = recwarn.pop(UserWarning)
        assert "not in available provider names" not in str(user_warning.message)


def test_process_file_with_model_raises_on_invalid_model_name():
    with pytest.raises(models.UnknownModelException):
        layout.process_file_with_model("", model_name="fake")


class MockPoints:
    def tolist(self):
        return [1, 2, 3, 4]


class MockEmbeddedTextRegion(EmbeddedTextRegion):
    def __init__(self, type=None, text=None):
        self.type = type
        self.text = text

    @property
    def points(self):
        return MockPoints()


class MockPageLayout(layout.PageLayout):
    def __init__(
        self,
        number=1,
        image=None,
        model=None,
        detection_model=None,
    ):
        self.image = image
        self.layout = layout
        self.model = model
        self.number = number
        self.detection_model = detection_model


class MockLayout:
    def __init__(self, *elements):
        self.elements = elements

    def __len__(self):
        return len(self.elements)

    def sort(self, key, inplace):
        return self.elements

    def __iter__(self):
        return iter(self.elements)

    def get_texts(self):
        return [el.text for el in self.elements]

    def filter_by(self, *args, **kwargs):
        return MockLayout()


@pytest.mark.parametrize("element_extraction_model", [None, "foo"])
@pytest.mark.parametrize("filetype", ["png", "jpg", "tiff"])
def test_from_image_file(monkeypatch, mock_final_layout, filetype, element_extraction_model):
    def mock_get_elements(self, *args, **kwargs):
        self.elements = [mock_final_layout]

    monkeypatch.setattr(layout.PageLayout, "get_elements_with_detection_model", mock_get_elements)
    monkeypatch.setattr(layout.PageLayout, "get_elements_using_image_extraction", mock_get_elements)
    filename = f"sample-docs/loremipsum.{filetype}"
    image = Image.open(filename)
    image_metadata = {
        "format": image.format,
        "width": image.width,
        "height": image.height,
        "pdf_rotation": 0,
    }

    doc = layout.DocumentLayout.from_image_file(
        filename,
        element_extraction_model=element_extraction_model,
    )
    page = doc.pages[0]
    assert page.elements[0] == mock_final_layout
    assert page.image is None
    assert page.image_path == os.path.abspath(filename)
    assert page.image_metadata == image_metadata


def test_from_file(monkeypatch, mock_final_layout):
    def mock_get_elements(self, *args, **kwargs):
        self.elements = [mock_final_layout]

    monkeypatch.setattr(layout.PageLayout, "get_elements_with_detection_model", mock_get_elements)

    with tempfile.TemporaryDirectory() as tmpdir:
        image_path = os.path.join(tmpdir, "loremipsum.ppm")
        image = Image.open("sample-docs/loremipsum.jpg")
        image.save(image_path)
        image_metadata = {
            "format": "PPM",
            "width": image.width,
            "height": image.height,
            "pdf_rotation": 0,
        }

        with patch.object(
            layout,
            "convert_pdf_to_image",
            lambda *args, **kwargs: ([image_path]),
        ):
            doc = layout.DocumentLayout.from_file("fake-file.pdf")
            page = doc.pages[0]
            assert page.elements[0] == mock_final_layout
            assert page.image_metadata == image_metadata
            assert page.image is None


def test_from_file_rotated_pdf_stores_rotation_in_metadata(monkeypatch, mock_final_layout):
    """image_metadata includes pdf_rotation for rotated PDF pages."""

    def mock_get_elements(self, *args, **kwargs):
        self.elements = [mock_final_layout]

    monkeypatch.setattr(layout.PageLayout, "get_elements_with_detection_model", mock_get_elements)

    doc = layout.DocumentLayout.from_file("sample-docs/rotated-page-90.pdf")
    page = doc.pages[0]
    assert page.image_metadata["pdf_rotation"] == 90
    assert page.image is None


@pytest.mark.slow
def test_from_file_with_password(monkeypatch, mock_final_layout):

    doc = layout.DocumentLayout.from_file("sample-docs/password.pdf", password="password")
    assert doc

    monkeypatch.setattr(layout, "get_model", lambda x: MockLayoutModel(mock_final_layout))
    with (
        patch(
            "unstructured_inference.inference.layout.UnstructuredObjectDetectionModel",
            MockLayoutModel,
        ),
        open("sample-docs/password.pdf", mode="rb") as fp,
    ):
        doc = layout.process_data_with_model(fp, model_name="fake", password="password")
        assert doc


def test_from_image_file_raises_with_empty_fn():
    with pytest.raises(FileNotFoundError):
        layout.DocumentLayout.from_image_file("")


def test_from_image_file_raises_isadirectoryerror_with_dir():
    with tempfile.TemporaryDirectory() as tempdir, pytest.raises(IsADirectoryError):
        layout.DocumentLayout.from_image_file(tempdir)


def test_page_numbers_in_page_objects():
    with patch(
        "unstructured_inference.inference.layout.PageLayout.get_elements_with_detection_model",
    ) as mock_get_elements:
        doc = layout.DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf")
        mock_get_elements.assert_called()
        assert [page.number for page in doc.pages] == list(range(1, len(doc.pages) + 1))


no_text_region = EmbeddedTextRegion.from_coords(0, 0, 100, 100)
text_region = EmbeddedTextRegion.from_coords(0, 0, 100, 100, text="test")
overlapping_rect = ImageTextRegion.from_coords(50, 50, 150, 150)
nonoverlapping_rect = ImageTextRegion.from_coords(150, 150, 200, 200)
populated_text_region = EmbeddedTextRegion.from_coords(50, 50, 60, 60, text="test")
unpopulated_text_region = EmbeddedTextRegion.from_coords(50, 50, 60, 60, text=None)


@pytest.mark.parametrize(
    ("colors", "add_details", "threshold"),
    [("red", False, 0.992), (None, False, 0.992), ("red", True, 0.8)],
)
def test_annotate(colors, add_details, threshold):
    def check_annotated_image():
        annotated_array = np.array(annotated_image)
        for coords in [coords1, coords2]:
            x1, y1, x2, y2 = coords
            # Make sure the pixels on the edge of the box are red
            for i, expected in zip(range(3), [255, 0, 0]):
                assert all(annotated_array[y1, x1:x2, i] == expected)
                assert all(annotated_array[y2, x1:x2, i] == expected)
                assert all(annotated_array[y1:y2, x1, i] == expected)
                assert all(annotated_array[y1:y2, x2, i] == expected)
            # Make sure almost all the pixels are not changed
            assert ((annotated_array[:, :, 0] == 1).mean()) > threshold
            assert ((annotated_array[:, :, 1] == 1).mean()) > threshold
            assert ((annotated_array[:, :, 2] == 1).mean()) > threshold

    test_image_arr = np.ones((100, 100, 3), dtype="uint8")
    image = Image.fromarray(test_image_arr)
    page = layout.PageLayout(number=1, image=image)
    coords1 = (21, 30, 37, 41)
    rect1 = elements.TextRegion.from_coords(*coords1)
    coords2 = (1, 10, 7, 11)
    rect2 = elements.TextRegion.from_coords(*coords2)
    page.elements = [rect1, rect2]

    annotated_image = page.annotate(colors=colors, add_details=add_details, sources=None)
    check_annotated_image()

    # Scenario 1: where self.image exists
    annotated_image = page.annotate(colors=colors, add_details=add_details)
    check_annotated_image()

    # Scenario 2: where self.image is None, but self.image_path exists
    with patch.object(Image, "open", return_value=image):
        page.image = None
        page.image_path = "mock_path_to_image"
        annotated_image = page.annotate(colors=colors, add_details=add_details)
        check_annotated_image()


class MockDetectionModel(layout.UnstructuredObjectDetectionModel):
    def initialize(self, *args, **kwargs):
        pass

    def predict(self, x):
        return layoutelement.LayoutElements.from_list(
            [
                layout.LayoutElement.from_coords(x1=447.0, y1=315.0, x2=1275.7, y2=413.0, text="0"),
                layout.LayoutElement.from_coords(x1=380.6, y1=473.4, x2=1334.8, y2=533.9, text="1"),
                layout.LayoutElement.from_coords(x1=578.6, y1=556.8, x2=1109.0, y2=874.4, text="2"),
                layout.LayoutElement.from_coords(
                    x1=444.5,
                    y1=942.3,
                    x2=1261.1,
                    y2=1584.1,
                    text="3",
                ),
                layout.LayoutElement.from_coords(
                    x1=444.8,
                    y1=1609.4,
                    x2=1257.2,
                    y2=1665.2,
                    text="4",
                ),
                layout.LayoutElement.from_coords(
                    x1=414.0,
                    y1=1718.8,
                    x2=635.0,
                    y2=1755.2,
                    text="5",
                ),
                layout.LayoutElement.from_coords(
                    x1=372.6,
                    y1=1786.9,
                    x2=1333.6,
                    y2=1848.7,
                    text="6",
                ),
            ],
        )


def test_layout_order(mock_image):
    with tempfile.TemporaryDirectory() as tmpdir:
        mock_image_path = os.path.join(tmpdir, "mock.jpg")
        mock_image.save(mock_image_path)
        with (
            patch.object(layout, "get_model", lambda: MockDetectionModel()),
            patch.object(
                layout,
                "convert_pdf_to_image",
                lambda *args, **kwargs: ([mock_image_path]),
            ),
        ):
            doc = layout.DocumentLayout.from_file("sample-docs/layout-parser-paper.pdf")
            page = doc.pages[0]
    for n, element in enumerate(page.elements):
        assert element.text == str(n)


def test_page_layout_raises_when_multiple_models_passed(mock_image, mock_initial_layout):
    with pytest.raises(ValueError):
        layout.PageLayout(
            0,
            mock_image,
            mock_initial_layout,
            detection_model="something",
            element_extraction_model="something else",
        )


class MockElementExtractionModel:
    def __call__(self, x):
        return [1, 2, 3]


@pytest.mark.parametrize(("inplace", "expected"), [(True, None), (False, [1, 2, 3])])
def test_get_elements_using_image_extraction(mock_image, inplace, expected):
    page = layout.PageLayout(
        1,
        mock_image,
        None,
        element_extraction_model=MockElementExtractionModel(),
    )
    assert page.get_elements_using_image_extraction(inplace=inplace) == expected


def test_get_elements_using_image_extraction_raises_with_no_extraction_model(
    mock_image,
):
    page = layout.PageLayout(1, mock_image, None, element_extraction_model=None)
    with pytest.raises(ValueError):
        page.get_elements_using_image_extraction()


def test_get_elements_with_detection_model_raises_with_wrong_default_model(monkeypatch):
    monkeypatch.setattr(layout, "get_model", lambda *x: MockLayoutModel(mock_final_layout))
    page = layout.PageLayout(1, mock_image, None)
    with pytest.raises(NotImplementedError):
        page.get_elements_with_detection_model()


@pytest.mark.parametrize(
    (
        "detection_model",
        "element_extraction_model",
        "detection_model_called",
        "element_extraction_model_called",
    ),
    [(None, "asdf", False, True), ("asdf", None, True, False)],
)
def test_from_image(
    mock_image,
    detection_model,
    element_extraction_model,
    detection_model_called,
    element_extraction_model_called,
):
    with (
        patch.object(
            layout.PageLayout,
            "get_elements_using_image_extraction",
        ) as mock_image_extraction,
        patch.object(
            layout.PageLayout,
            "get_elements_with_detection_model",
        ) as mock_detection,
    ):
        layout.PageLayout.from_image(
            mock_image,
            image_path=None,
            detection_model=detection_model,
            element_extraction_model=element_extraction_model,
        )
        assert mock_image_extraction.called == element_extraction_model_called
        assert mock_detection.called == detection_model_called


class MockUnstructuredElementExtractionModel(UnstructuredElementExtractionModel):
    def initialize(self, *args, **kwargs):
        return super().initialize(*args, **kwargs)

    def predict(self, x: Image):
        return super().predict(x)


class MockUnstructuredDetectionModel(UnstructuredObjectDetectionModel):
    def initialize(self, *args, **kwargs):
        return super().initialize(*args, **kwargs)

    def predict(self, x: Image):
        return super().predict(x)


@pytest.mark.parametrize(
    ("model_type", "is_detection_model"),
    [
        (MockUnstructuredElementExtractionModel, False),
        (MockUnstructuredDetectionModel, True),
    ],
)
def test_process_file_with_model_routing(monkeypatch, model_type, is_detection_model):
    model = model_type()
    monkeypatch.setattr(layout, "get_model", lambda *x: model)
    with patch.object(layout.DocumentLayout, "from_file") as mock_from_file:
        layout.process_file_with_model("asdf", model_name="fake", is_image=False)
        if is_detection_model:
            detection_model = model
            element_extraction_model = None
        else:
            detection_model = None
            element_extraction_model = model
        mock_from_file.assert_called_once_with(
            "asdf",
            detection_model=detection_model,
            element_extraction_model=element_extraction_model,
            fixed_layouts=None,
            password=None,
            pdf_image_dpi=200,
            pdf_render_max_pixels_per_page=None,
        )


@pytest.mark.parametrize(("pdf_image_dpi", "expected"), [(200, 2200), (100, 1100)])
def test_exposed_pdf_image_dpi(pdf_image_dpi, expected, monkeypatch):
    with patch.object(layout.PageLayout, "from_image") as mock_from_image:
        layout.DocumentLayout.from_file("sample-docs/loremipsum.pdf", pdf_image_dpi=pdf_image_dpi)
        assert mock_from_image.call_args[0][0].height == expected


def test_convert_pdf_to_image_no_output_folder():
    result = layout.convert_pdf_to_image(filename="sample-docs/loremipsum.pdf", dpi=72)
    assert len(result) == 1
    assert isinstance(result[0], Image.Image)


def _install_mock_pdfium(monkeypatch, *, width=720, height=720):
    page = MagicMock()
    page.get_width.return_value = width
    page.get_height.return_value = height
    page.get_rotation.return_value = 0
    page.render.return_value.to_pil.return_value = Image.new("RGB", (1, 1))
    pdf = MagicMock()
    pdf.__len__.return_value = 1
    pdf.__getitem__.return_value = page
    pdfium = MagicMock()
    pdfium.PdfDocument.return_value = pdf
    monkeypatch.setattr(pdf_image, "_get_pdfium_module", lambda: pdfium)
    return page


def test_convert_pdf_to_image_rejects_oversized_page_before_render(monkeypatch):
    page = _install_mock_pdfium(monkeypatch)

    with pytest.raises(pdf_image.PdfRenderTooLargeError, match="too many pixels"):
        pdf_image.convert_pdf_to_image(
            filename="mock.pdf",
            dpi=100,
            pdf_render_max_pixels_per_page=999_999,
        )

    page.render.assert_not_called()


def test_convert_pdf_to_image_allows_render_guard_to_be_disabled(monkeypatch):
    page = _install_mock_pdfium(monkeypatch)

    result = pdf_image.convert_pdf_to_image(
        filename="mock.pdf",
        dpi=100,
        pdf_render_max_pixels_per_page=0,
    )

    page.render.assert_called_once()
    assert len(result) == 1
    assert isinstance(result[0], Image.Image)


def test_page_hotload_preserves_render_max_pixels_per_page(monkeypatch, tmp_path):
    image_path = tmp_path / "page_1.png"
    Image.new("RGB", (1, 1)).save(image_path)
    calls = []

    def fake_convert_pdf_to_image(**kwargs):
        calls.append(kwargs)
        return [str(image_path)]

    monkeypatch.setattr(layout, "convert_pdf_to_image", fake_convert_pdf_to_image)
    page = layout.PageLayout(
        number=1,
        image=Image.new("RGB", (1, 1)),
        document_filename="mock.pdf",
        pdf_render_max_pixels_per_page=None,
    )

    image = page._get_image("mock.pdf", 1, pdf_image_dpi=123)

    assert image.size == (1, 1)
    assert calls[0]["dpi"] == 123
    assert calls[0]["pdf_render_max_pixels_per_page"] is None


def test_convert_pdf_to_image_output_folder_returns_images(tmp_path):
    result = layout.convert_pdf_to_image(
        filename="sample-docs/loremipsum.pdf",
        dpi=72,
        output_folder=tmp_path,
        path_only=False,
    )
    assert len(result) == 1
    assert isinstance(result[0], Image.Image)
    saved = list(tmp_path.glob("*.png"))
    assert len(saved) == 1


def test_convert_pdf_to_image_path_only(tmp_path):
    result = layout.convert_pdf_to_image(
        filename="sample-docs/loremipsum.pdf",
        dpi=72,
        output_folder=tmp_path,
        path_only=True,
    )
    assert len(result) == 1
    assert all(isinstance(p, str) for p in result)
    for p in result:
        assert os.path.exists(p)
        assert p.endswith(".png")
    saved = sorted(tmp_path.glob("*.png"))
    assert [str(s) for s in saved] == sorted(result)


def test_convert_pdf_to_image_applies_rotation_path_only(tmp_path):
    """Rotation is also applied when saving to disk (path_only mode)."""
    result = layout.convert_pdf_to_image(
        filename="sample-docs/rotated-page-90.pdf",
        dpi=72,
        output_folder=tmp_path,
        path_only=True,
    )
    assert len(result) == 1
    saved = Image.open(result[0])
    assert saved.height > saved.width, f"Expected portrait after rotation, got {saved.size}"


def test_convert_pdf_to_image_no_rotation_on_normal_pdf():
    """Non-rotated PDFs are unchanged."""
    result = layout.convert_pdf_to_image(filename="sample-docs/loremipsum.pdf", dpi=72)
    assert len(result) == 1
    img = result[0]
    # loremipsum.pdf is a standard portrait page - should stay portrait
    assert img.height > img.width, f"Expected portrait, got {img.size}"


def test_convert_pdf_to_image_save_not_under_pdfium_lock(tmp_path):
    """Verify that PIL save (disk I/O) is NOT performed while holding _pdfium_lock."""
    original_save = Image.Image.save
    lock_held_during_save = []

    def spy_save(self, *args, **kwargs):
        lock_held_during_save.append(layout._pdfium_lock.locked())
        return original_save(self, *args, **kwargs)

    with patch.object(Image.Image, "save", spy_save):
        layout.convert_pdf_to_image(
            filename="sample-docs/loremipsum.pdf",
            dpi=72,
            output_folder=tmp_path,
            path_only=True,
        )
    assert lock_held_during_save, "save was never called"
    assert not any(lock_held_during_save), "pil_image.save() was called while _pdfium_lock was held"


def test_convert_pdf_to_image_concurrent_saves_not_serialized(tmp_path):
    """Two concurrent callers must be able to overlap their disk writes.

    Uses a threading.Barrier to verify both threads are inside save()
    simultaneously. If saves are serialized under _pdfium_lock, the second
    thread can never reach save() while the first is there, so the barrier
    times out and the test fails.
    """
    import threading

    original_save = Image.Image.save
    barrier = threading.Barrier(2, timeout=5)
    overlap_detected = threading.Event()

    def barrier_save(self, *args, **kwargs):
        try:
            barrier.wait()
            overlap_detected.set()
        except threading.BrokenBarrierError:
            pass
        return original_save(self, *args, **kwargs)

    errors: list[str] = []

    def run(folder):
        try:
            layout.convert_pdf_to_image(
                filename="sample-docs/loremipsum.pdf",
                dpi=72,
                output_folder=folder,
                path_only=True,
            )
        except Exception as exc:
            errors.append(str(exc))

    dir_a = tmp_path / "a"
    dir_b = tmp_path / "b"
    dir_a.mkdir()
    dir_b.mkdir()

    with patch.object(Image.Image, "save", barrier_save):
        t1 = threading.Thread(target=run, args=(dir_a,))
        t2 = threading.Thread(target=run, args=(dir_b,))
        t1.start()
        t2.start()
        t1.join(timeout=10)
        t2.join(timeout=10)

    assert not errors, f"threads raised: {errors}"
    assert overlap_detected.is_set(), (
        "saves were serialized under _pdfium_lock — threads could not overlap"
    )
    assert list(dir_a.glob("*.png")), "thread A produced no output"
    assert list(dir_b.glob("*.png")), "thread B produced no output"


def test_render_can_proceed_while_other_thread_saves(tmp_path):
    """Thread B can acquire _pdfium_lock and render while thread A is in save().

    Blocks thread A inside save() (outside the lock), then starts thread B.
    If B completes entirely while A is still blocked, the lock was not held
    during save — rendering and saving can overlap across callers.
    """
    import threading

    original_save = Image.Image.save
    a_in_save = threading.Event()
    b_done = threading.Event()

    dir_a = tmp_path / "a"
    dir_b = tmp_path / "b"
    dir_a.mkdir()
    dir_b.mkdir()

    def gated_save(self, *args, **kwargs):
        fp = str(args[0]) if args else ""
        if str(dir_a) in fp:
            a_in_save.set()
            b_done.wait(timeout=5)
        return original_save(self, *args, **kwargs)

    errors: list[str] = []

    def run(folder, done_event=None):
        try:
            layout.convert_pdf_to_image(
                filename="sample-docs/loremipsum.pdf",
                dpi=72,
                output_folder=folder,
                path_only=True,
            )
        except Exception as exc:
            errors.append(str(exc))
        finally:
            if done_event:
                done_event.set()

    with patch.object(Image.Image, "save", gated_save):
        t_a = threading.Thread(target=run, args=(dir_a,))
        t_b = threading.Thread(target=run, args=(dir_b, b_done))
        t_a.start()
        a_in_save.wait(timeout=5)
        # A is now blocked in save (outside lock). B should render + save freely.
        t_b.start()
        t_b.join(timeout=10)
        t_a.join(timeout=10)

    assert not errors, f"threads raised: {errors}"
    assert b_done.is_set(), "Thread B could not complete while A was saving"
    assert list(dir_a.glob("*.png")), "thread A produced no output"
    assert list(dir_b.glob("*.png")), "thread B produced no output"


def test_multi_page_concurrent_output_complete(tmp_path):
    """Two threads processing a multi-page PDF both produce correct, complete output."""
    import threading

    errors: list[str] = []

    def run(folder):
        try:
            layout.convert_pdf_to_image(
                filename="sample-docs/loremipsum_multipage.pdf",
                dpi=72,
                output_folder=folder,
                path_only=True,
            )
        except Exception as exc:
            errors.append(str(exc))

    dir_a = tmp_path / "a"
    dir_b = tmp_path / "b"
    dir_a.mkdir()
    dir_b.mkdir()

    t1 = threading.Thread(target=run, args=(dir_a,))
    t2 = threading.Thread(target=run, args=(dir_b,))
    t1.start()
    t2.start()
    t1.join(timeout=60)
    t2.join(timeout=60)

    assert not errors, f"threads raised: {errors}"
    a_files = sorted(dir_a.glob("*.png"))
    b_files = sorted(dir_b.glob("*.png"))
    assert len(a_files) == 10, f"thread A produced {len(a_files)} files, expected 10"
    assert len(b_files) == 10, f"thread B produced {len(b_files)} files, expected 10"
    for i in range(1, 11):
        assert (dir_a / f"page_{i}.png").exists(), f"thread A missing page_{i}.png"
        assert (dir_b / f"page_{i}.png").exists(), f"thread B missing page_{i}.png"


def test_error_in_one_thread_does_not_block_other(tmp_path):
    """If one thread fails mid-processing, the other still completes."""
    import threading

    original_save = Image.Image.save

    dir_a = tmp_path / "a"
    dir_b = tmp_path / "b"
    dir_a.mkdir()
    dir_b.mkdir()

    def failing_save(self, *args, **kwargs):
        fp = str(args[0]) if args else ""
        if str(dir_a) in fp:
            raise OSError("simulated disk failure")
        return original_save(self, *args, **kwargs)

    a_error: list[Exception] = []
    b_result: list[str] = []
    b_error: list[Exception] = []

    def run_a():
        try:
            layout.convert_pdf_to_image(
                filename="sample-docs/loremipsum.pdf",
                dpi=72,
                output_folder=dir_a,
                path_only=True,
            )
        except Exception as exc:
            a_error.append(exc)

    def run_b():
        try:
            result = layout.convert_pdf_to_image(
                filename="sample-docs/loremipsum.pdf",
                dpi=72,
                output_folder=dir_b,
                path_only=True,
            )
            b_result.extend(result)
        except Exception as exc:
            b_error.append(exc)

    with patch.object(Image.Image, "save", failing_save):
        t_a = threading.Thread(target=run_a)
        t_b = threading.Thread(target=run_b)
        t_a.start()
        t_b.start()
        t_a.join(timeout=10)
        t_b.join(timeout=10)

    assert a_error, "Thread A should have failed"
    assert not b_error, f"Thread B should have succeeded: {b_error}"
    assert b_result, "Thread B produced no result"
    assert list(dir_b.glob("*.png")), "Thread B produced no output files"


@pytest.mark.parametrize(
    ("filename", "img_num", "should_complete"),
    [
        ("sample-docs/empty-document.pdf", 0, True),
        ("sample-docs/empty-document.pdf", 10, False),
    ],
)
def test_get_image(filename, img_num, should_complete):
    doc = layout.DocumentLayout.from_file(filename)
    page = doc.pages[0]
    try:
        img = page._get_image(filename, img_num)
        # transform img to numpy array
        img = np.array(img)
        # is a blank image with all pixels white
        assert img.mean() == 255.0
    except ValueError:
        assert not should_complete


================================================
FILE: test_unstructured_inference/inference/test_layout_element.py
================================================
from unstructured_inference.constants import IsExtracted, Source
from unstructured_inference.inference.layoutelement import LayoutElement, TextRegion


def test_layout_element_to_dict(mock_layout_element):
    expected = {
        "coordinates": ((100, 100), (100, 300), (300, 300), (300, 100)),
        "text": "Sample text",
        "is_extracted": None,
        "type": "Text",
        "prob": None,
        "source": None,
    }

    assert mock_layout_element.to_dict() == expected


def test_layout_element_from_region(mock_rectangle):
    expected = LayoutElement.from_coords(100, 100, 300, 300)
    region = TextRegion(bbox=mock_rectangle)

    assert LayoutElement.from_region(region) == expected


def test_layoutelement_inheritance_works_correctly():
    """Test that LayoutElement properly inherits from TextRegion without conflicts"""
    from unstructured_inference.inference.elements import TextRegion

    # Create a TextRegion with both source and text_source
    region = TextRegion.from_coords(
        0, 0, 10, 10, text="test", source=Source.YOLOX, is_extracted=IsExtracted.TRUE
    )

    # Convert to LayoutElement
    element = LayoutElement.from_region(region)

    # Check that both properties are preserved
    assert element.source == Source.YOLOX, "LayoutElement should inherit source from TextRegion"
    assert element.is_extracted == IsExtracted.TRUE, (
        "LayoutElement should inherit is_extracted from TextRegion"
    )

    # Check that to_dict() works correctly
    d = element.to_dict()
    assert d["source"] == Source.YOLOX
    assert d["is_extracted"] == IsExtracted.TRUE

    # Check that we can set source directly on LayoutElement
    element.source = Source.DETECTRON2_ONNX
    assert element.source == Source.DETECTRON2_ONNX


================================================
FILE: test_unstructured_inference/inference/test_layout_rotation.py
================================================
from __future__ import annotations

import numpy as np

from unstructured_inference.inference import pdf_image


def test_convert_pdf_to_image_applies_rotation():
    """Pages with /Rotate metadata are rendered upright."""
    result = pdf_image.convert_pdf_to_image(filename="sample-docs/rotated-page-90.pdf", dpi=72)
    assert len(result) == 1
    img = result[0]
    # The PDF has /Rotate=90 on a landscape page (width > height in PDF units).
    # Without rotation fix the rendered image would be landscape; with the fix it's portrait.
    assert img.height > img.width, f"Expected portrait after rotation, got {img.size}"

    # Fixture contract: rotated-page-90.pdf has visible dark text in the upper half when upright.
    # Use relative dark-pixel counts to reduce sensitivity to minor renderer differences.
    gray = np.array(img.convert("L"))
    split = gray.shape[0] // 2
    top_dark_pixels = int(np.count_nonzero(gray[:split] < 245))
    bottom_dark_pixels = int(np.count_nonzero(gray[split:] < 245))

    assert top_dark_pixels > 0, "Expected text pixels in upper half of upright page"
    assert top_dark_pixels > max(bottom_dark_pixels * 10, 50), (
        "Expected substantially more dark pixels in upper half for upright orientation; "
        f"got top={top_dark_pixels}, bottom={bottom_dark_pixels}"
    )


================================================
FILE: test_unstructured_inference/models/test_detectron2onnx.py
================================================
import os
from unittest.mock import patch

import pytest
from PIL import Image

import unstructured_inference.models.base as models
import unstructured_inference.models.detectron2onnx as detectron2


class MockDetectron2ONNXLayoutModel:
    def __init__(self, *args, **kwargs):
        self.args = args
        self.kwargs = kwargs

    def run(self, *args):
        return ([(1, 2, 3, 4)], [0], [(4, 5)], [0.818])

    def get_inputs(self):
        class input_thing:
            name = "Bernard"

        return [input_thing()]


def test_load_default_model(monkeypatch):
    monkeypatch.setattr(models, "models", {})
    with patch.object(
        detectron2.onnxruntime,
        "InferenceSession",
        new=MockDetectron2ONNXLayoutModel,
    ):
        model = models.get_model("detectron2_mask_rcnn")

    assert isinstance(model.model, MockDetectron2ONNXLayoutModel)


@pytest.mark.parametrize(("model_path", "label_map"), [("asdf", "diufs"), ("dfaw", "hfhfhfh")])
def test_load_model(model_path, label_map):
    with patch.object(detectron2.onnxruntime, "InferenceSession", return_value=True):
        model = detectron2.UnstructuredDetectronONNXModel()
        model.initialize(model_path=model_path, label_map=label_map)
        args, _ = detectron2.onnxruntime.InferenceSession.call_args
        assert args == (model_path,)
    assert label_map == model.label_map


def test_unstructured_detectron_model():
    model = detectron2.UnstructuredDetectronONNXModel()
    model.model = 1
    with patch.object(detectron2.UnstructuredDetectronONNXModel, "predict", return_value=[]):
        result = model(None)
    assert isinstance(result, list)
    assert len(result) == 0


def test_inference():
    with patch.object(
        detectron2.onnxruntime,
        "InferenceSession",
        return_value=MockDetectron2ONNXLayoutModel(),
    ):
        model = detectron2.UnstructuredDetectronONNXModel()
        model.initialize(model_path="test_path", label_map={0: "test_class"})
        assert isinstance(model.model, MockDetectron2ONNXLayoutModel)
        with open(os.path.join("sample-docs", "receipt-sample.jpg"), mode="rb") as fp:
            image = Image.open(fp)
            image.load()
        elements = model(image)
        assert len(elements) == 1
        element = elements[0]
        (x1, y1), _, (x2, y2), _ = element.bbox.coordinates
        assert hasattr(
            element,
            "prob",
        )  # NOTE(pravin) New Assertion to Make Sure element has probabilities
        assert isinstance(
            element.prob,
            float,
        )  # NOTE(pravin) New Assertion to Make Sure Populated Probability is Float
        # NOTE(alan): The bbox coordinates get resized, so check their relative proportions
        assert x2 / x1 == pytest.approx(3.0)  # x1 == 1, x2 == 3 before scaling
        assert y2 / y1 == pytest.approx(2.0)  # y1 == 2, y2 == 4 before scaling
        assert element.type == "test_class"


================================================
FILE: test_unstructured_inference/models/test_eval.py
================================================
import pytest

from unstructured_inference.inference.layoutelement import table_cells_to_dataframe
from unstructured_inference.models.eval import compare_contents_as_df, default_tokenizer


@pytest.fixture
def actual_cells():
    return [
        {
            "column_nums": [0],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Disability Category",
        },
        {
            "column_nums": [1],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Participants",
        },
        {
            "column_nums": [2],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Ballots Completed",
        },
        {
            "column_nums": [3],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Ballots Incomplete/Terminated",
        },
        {"column_nums": [4, 5], "row_nums": [0], "column header": True, "cell text": "Results"},
        {"column_nums": [4], "row_nums": [1], "column header": False, "cell text": "Accuracy"},
        {
            "column_nums": [5],
            "row_nums": [1],
            "column header": False,
            "cell text": "Time to complete",
        },
        {"column_nums": [0], "row_nums": [2], "column header": False, "cell text": "Blind"},
        {"column_nums": [0], "row_nums": [3], "column header": False, "cell text": "Low Vision"},
        {"column_nums": [0], "row_nums": [4], "column header": False, "cell text": "Dexterity"},
        {"column_nums": [0], "row_nums": [5], "column header": False, "cell text": "Mobility"},
        {"column_nums": [1], "row_nums": [2], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [3], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [4], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [5], "column header": False, "cell text": "3"},
        {"column_nums": [2], "row_nums": [2], "column header": False, "cell text": "1"},
        {"column_nums": [2], "row_nums": [3], "column header": False, "cell text": "2"},
        {"column_nums": [2], "row_nums": [4], "column header": False, "cell text": "4"},
        {"column_nums": [2], "row_nums": [5], "column header": False, "cell text": "3"},
        {"column_nums": [3], "row_nums": [2], "column header": False, "cell text": "4"},
        {"column_nums": [3], "row_nums": [3], "column header": False, "cell text": "3"},
        {"column_nums": [3], "row_nums": [4], "column header": False, "cell text": "1"},
        {"column_nums": [3], "row_nums": [5], "column header": False, "cell text": "0"},
        {"column_nums": [4], "row_nums": [2], "column header": False, "cell text": "34.5%, n=1"},
        {
            "column_nums": [4],
            "row_nums": [3],
            "column header": False,
            "cell text": "98.3% n=2 (97.7%, n=3)",
        },
        {"column_nums": [4], "row_nums": [4], "column header": False, "cell text": "98.3%, n=4"},
        {"column_nums": [4], "row_nums": [5], "column header": False, "cell text": "95.4%, n=3"},
        {"column_nums": [5], "row_nums": [2], "column header": False, "cell text": "1199 sec, n=1"},
        {
            "column_nums": [5],
            "row_nums": [3],
            "column header": False,
            "cell text": "1716 sec, n=3 (1934 sec, n=2)",
        },
        {
            "column_nums": [5],
            "row_nums": [4],
            "column header": False,
            "cell text": "1672.1 sec, n=4",
        },
        {"column_nums": [5], "row_nums": [5], "column header": False, "cell text": "1416 sec, n=3"},
    ]


@pytest.fixture
def pred_cells():
    return [
        {"column_nums": [0], "row_nums": [2], "column header": False, "cell text": "Blind"},
        {"column_nums": [0], "row_nums": [3], "column header": False, "cell text": "Low Vision"},
        {"column_nums": [0], "row_nums": [4], "column header": False, "cell text": "Dexterity"},
        {"column_nums": [0], "row_nums": [5], "column header": False, "cell text": "Mobility"},
        {"column_nums": [1], "row_nums": [2], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [3], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [4], "column header": False, "cell text": "5"},
        {"column_nums": [1], "row_nums": [5], "column header": False, "cell text": "3"},
        {"column_nums": [2], "row_nums": [2], "column header": False, "cell text": "1"},
        {"column_nums": [2], "row_nums": [3], "column header": False, "cell text": "2"},
        {"column_nums": [2], "row_nums": [4], "column header": False, "cell text": "4"},
        {"column_nums": [2], "row_nums": [5], "column header": False, "cell text": "3"},
        {"column_nums": [3], "row_nums": [2], "column header": False, "cell text": "4"},
        {"column_nums": [3], "row_nums": [3], "column header": False, "cell text": "3"},
        {"column_nums": [3], "row_nums": [4], "column header": False, "cell text": "1"},
        {"column_nums": [3], "row_nums": [5], "column header": False, "cell text": "0"},
        {"column_nums": [4], "row_nums": [1], "column header": False, "cell text": "Accuracy"},
        {"column_nums": [4], "row_nums": [2], "column header": False, "cell text": "34.5%, n=1"},
        {
            "column_nums": [4],
            "row_nums": [3],
            "column header": False,
            "cell text": "98.3% n=2 (97.7%, n=3)",
        },
        {"column_nums": [4], "row_nums": [4], "column header": False, "cell text": "98.3%, n=4"},
        {"column_nums": [4], "row_nums": [5], "column header": False, "cell text": "95.4%, n=3"},
        {
            "column_nums": [5],
            "row_nums": [1],
            "column header": False,
            "cell text": "Time to complete",
        },
        {"column_nums": [5], "row_nums": [2], "column header": False, "cell text": "1199 sec, n=1"},
        {
            "column_nums": [5],
            "row_nums": [3],
            "column header": False,
            "cell text": "1716 sec, n=3 | (1934 sec, n=2)",
        },
        {
            "column_nums": [5],
            "row_nums": [4],
            "column header": False,
            "cell text": "1672.1 sec, n=4",
        },
        {"column_nums": [5], "row_nums": [5], "column header": False, "cell text": "1416 sec, n=3"},
        {
            "column_nums": [0],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "soa etealeiliay Category",
        },
        {"column_nums": [4, 5], "row_nums": [0], "column header": True, "cell text": "Results"},
        {
            "column_nums": [1],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Participants P",
        },
        {
            "column_nums": [2],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "pallets Completed",
        },
        {
            "column_nums": [3],
            "row_nums": [0, 1],
            "column header": True,
            "cell text": "Ballot: incom lete/ Ne Terminated",
        },
    ]


@pytest.fixture
def actual_df(actual_cells):
    return table_cells_to_dataframe(actual_cells).fillna("")


@pytest.fixture
def pred_df(pred_cells):
    return table_cells_to_dataframe(pred_cells).fillna("")


@pytest.mark.parametrize(
    ("eval_func", "processor"),
    [
        ("token_ratio", default_tokenizer),
        ("token_ratio", None),
        ("partial_token_ratio", default_tokenizer),
        ("ratio", None),
        ("ratio", default_tokenizer),
        ("partial_ratio", default_tokenizer),
    ],
)
def test_compare_content_as_df(actual_df, pred_df, eval_func, processor):
    results = compare_contents_as_df(actual_df, pred_df, eval_func=eval_func, processor=processor)
    assert 0 < results.get(f"by_col_{eval_func}") < 100


def test_compare_content_as_df_with_invalid_input(actual_df, pred_df):
    with pytest.raises(ValueError, match="eval_func must be one of"):
        compare_contents_as_df(actual_df, pred_df, eval_func="foo")


================================================
FILE: test_unstructured_inference/models/test_model.py
================================================
import json
import threading
import time
from typing import Any
from unittest import mock

import numpy as np
import pytest

import unstructured_inference.models.base as models
from unstructured_inference.inference.layoutelement import LayoutElement, LayoutElements
from unstructured_inference.models.unstructuredmodel import (
    ModelNotInitializedError,
    UnstructuredObjectDetectionModel,
)


class MockModel(UnstructuredObjectDetectionModel):
    call_count = 0

    def __init__(self):
        self.initializer = mock.MagicMock()
        super().__init__()

    def initialize(self, *args, **kwargs):
        return self.initializer(self, *args, **kwargs)

    def predict(self, x: Any) -> Any:
        return LayoutElements(element_coords=np.array([]))


MOCK_MODEL_TYPES = {
    "foo": {
        "input_shape": (640, 640),
    },
}


def test_get_model(monkeypatch):
    monkeypatch.setattr(models, "models", {})
    with mock.patch.dict(models.model_class_map, {"yolox": MockModel}):
        assert isinstance(models.get_model("yolox"), MockModel)


def test_get_model_threaded(monkeypatch):
    """Test that get_model works correctly when called from multiple threads simultaneously."""
    monkeypatch.setattr(models, "models", {})

    # Results and exceptions from threads will be stored here
    results = []
    exceptions = []

    def get_model_worker(thread_id):
        """Worker function for each thread."""
        try:
            model = models.get_model("yolox")
            results.append((thread_id, model))
        except Exception as e:
            exceptions.append((thread_id, e))

    # Create and start multiple threads
    num_threads = 10
    threads = []

    with mock.patch.dict(models.model_class_map, {"yolox": MockModel}):
        for i in range(num_threads):
            thread = threading.Thread(target=get_model_worker, args=(i,))
            threads.append(thread)
            thread.start()

        # Wait for all threads to complete
        for thread in threads:
            thread.join()

    # Verify no exceptions occurred
    assert len(exceptions) == 0, f"Exceptions occurred in threads: {exceptions}"

    # Verify all threads got results
    assert len(results) == num_threads, f"Expected {num_threads} results, got {len(results)}"

    # Verify all results are MockModel instances
    for thread_id, model in results:
        assert isinstance(model, MockModel), (
            f"Thread {thread_id} got unexpected model type: {type(model)}"
        )


def test_get_model_concurrent_different_models(monkeypatch):
    """Test that different models can load in parallel without serialization."""
    monkeypatch.setattr(models, "models", {})

    # Track initialization timing
    init_events = []
    init_lock = threading.Lock()

    class SlowMockModel(MockModel):
        def __init__(self):
            super().__init__()
            self.model_name = None

        def initialize(self, *args, **kwargs):
            with init_lock:
                init_events.append((self.model_name, "start"))
            time.sleep(0.1)  # Simulate slow loading
            with init_lock:
                init_events.append((self.model_name, "end"))
            return super().initialize(*args, **kwargs)

    # Store model names in instances
    def create_model_with_name(name):
        def factory():
            model = SlowMockModel()
            model.model_name = name
            return model

        return factory

    results = []

    def worker(model_name):
        models.get_model(model_name)  # Load the model
        results.append(model_name)

    # Load 2 different models concurrently
    threads = []
    mock_config = {"input_shape": (640, 640)}
    with (
        mock.patch.dict(
            models.model_class_map,
            {
                "yolox": create_model_with_name("yolox"),
                "detectron2": create_model_with_name("detectron2"),
            },
        ),
        mock.patch.dict(models.model_config_map, {"yolox": mock_config, "detectron2": mock_config}),
    ):
        for model_name in ["yolox", "detectron2"]:
            thread = threading.Thread(target=worker, args=(model_name,))
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

    # Both models should load successfully
    assert len(results) == 2

    # Verify parallel execution (both start before either ends)
    assert len(init_events) == 4, f"Expected 4 events (2 starts + 2 ends), got {len(init_events)}"

    # True parallelism means both models start before either finishes
    # Find when the first model finishes
    first_end_idx = next(
        (i for i, (_, event_type) in enumerate(init_events) if event_type == "end"), None
    )
    assert first_end_idx is not None, "No 'end' event found"

    # Count how many models started before the first one finished
    starts_before_first_end = sum(
        1 for _, event_type in init_events[:first_end_idx] if event_type == "start"
    )
    assert starts_before_first_end == 2, (
        f"Expected both models to start before either finishes (parallel execution), "
        f"but only {starts_before_first_end} started before first completion. "
        f"Events: {init_events}"
    )


def test_register_new_model():
    assert "foo" not in models.model_class_map
    assert "foo" not in models.model_config_map
    models.register_new_model(MOCK_MODEL_TYPES, MockModel)
    assert "foo" in models.model_class_map
    assert "foo" in models.model_config_map
    model = models.get_model("foo")
    assert len(model.initializer.mock_calls) == 1
    assert model.initializer.mock_calls[0][-1] == MOCK_MODEL_TYPES["foo"]
    assert isinstance(model, MockModel)
    # unregister the new model by reset to default
    models.model_class_map, models.model_config_map = models.get_default_model_mappings()
    assert "foo" not in models.model_class_map
    assert "foo" not in models.model_config_map


def test_get_model_with_lazydict_config(monkeypatch):
    """get_model must unpack a LazyDict config into initialize() without
    depending on Mapping.keys() — prevents regression of
    'argument after ** must be a mapping, not LazyDict' in prod.
    """
    from unstructured_inference.utils import LazyDict, LazyEvaluateInfo

    monkeypatch.setattr(models, "models", {})

    evaluated = []

    def _fake_download(path):
        evaluated.append(path)
        return path

    lazy_config = LazyDict(
        model_path=LazyEvaluateInfo(_fake_download, "/tmp/weights.onnx"),
        input_shape=(640, 640),
    )

    with (
        mock.patch.dict(models.model_class_map, {"lazy_mock": MockModel}),
        mock.patch.dict(models.model_config_map, {"lazy_mock": lazy_config}),
    ):
        model = models.get_model("lazy_mock")

    assert isinstance(model, MockModel)
    assert evaluated == ["/tmp/weights.onnx"]
    model.initializer.assert_called_once_with(
        model,
        model_path="/tmp/weights.onnx",
        input_shape=(640, 640),
    )


def test_raises_invalid_model():
    with pytest.raises(models.UnknownModelException):
        models.get_model("fake_model")


def test_raises_uninitialized():
    with pytest.raises(ModelNotInitializedError):
        models.UnstructuredDetectronONNXModel().predict(None)


def test_model_initializes_once():
    from unstructured_inference.inference import layout

    with (
        mock.patch.dict(models.model_class_map, {"yolox": MockModel}),
        mock.patch.object(
            models,
            "models",
            {},
        ),
    ):
        doc = layout.DocumentLayout.from_file("sample-docs/loremipsum.pdf")
        doc.pages[0].detection_model.initializer.assert_called_once()


def test_deduplicate_detected_elements():
    import numpy as np

    from unstructured_inference.inference.elements import intersections
    from unstructured_inference.inference.layout import DocumentLayout
    from unstructured_inference.models.base import get_model

    model = get_model("yolox_quantized")
    # model.confidence_threshold=0.5
    file = "sample-docs/example_table.jpg"
    doc = DocumentLayout.from_image_file(
        file,
        model,
    )
    known_elements = [e.bbox for e in doc.pages[0].elements if e.type != "UncategorizedText"]
    # Compute intersection matrix
    intersections_mtx = intersections(*known_elements)
    # Get rid off diagonal (cause an element will always intersect itself)
    np.fill_diagonal(intersections_mtx, False)
    # Now all the elements should be False, because any intersection remains
    assert not intersections_mtx.any()


def test_enhance_regions():
    from unstructured_inference.inference.elements import Rectangle
    from unstructured_inference.models.base import get_model

    elements = [
        LayoutElement(bbox=Rectangle(0, 0, 1, 1)),
        LayoutElement(bbox=Rectangle(0.01, 0.01, 1.01, 1.01)),
        LayoutElement(bbox=Rectangle(0.02, 0.02, 1.02, 1.02)),
        LayoutElement(bbox=Rectangle(0.03, 0.03, 1.03, 1.03)),
        LayoutElement(bbox=Rectangle(0.04, 0.04, 1.04, 1.04)),
        LayoutElement(bbox=Rectangle(0.05, 0.05, 1.05, 1.05)),
        LayoutElement(bbox=Rectangle(0.06, 0.06, 1.06, 1.06)),
        LayoutElement(bbox=Rectangle(0.07, 0.07, 1.07, 1.07)),
        LayoutElement(bbox=Rectangle(0.08, 0.08, 1.08, 1.08)),
        LayoutElement(bbox=Rectangle(0.09, 0.09, 1.09, 1.09)),
        LayoutElement(bbox=Rectangle(0.10, 0.10, 1.10, 1.10)),
    ]
    model = get_model("yolox_tiny")
    elements = model.enhance_regions(elements, 0.5)
    assert len(elements) == 1
    assert (
        elements[0].bbox.x1,
        elements[0].bbox.y1,
        elements[0].bbox.x2,
        elements[0].bbox.x2,
    ) == (
        0,
        0,
        1.10,
        1.10,
    )


def test_clean_type():
    from unstructured_inference.inference.layout import LayoutElement
    from unstructured_inference.models.base import get_model

    elements = [
        LayoutElement.from_coords(
            0.6,
            0.6,
            0.65,
            0.65,
            type="Table",
        ),  # One little table nested inside all the others
        LayoutElement.from_coords(0.5, 0.5, 0.7, 0.7, type="Table"),  # One nested table
        LayoutElement.from_coords(0, 0, 1, 1, type="Table"),  # Big table
        LayoutElement.from_coords(0.01, 0.01, 1.01, 1.01),
        LayoutElement.from_coords(0.02, 0.02, 1.02, 1.02),
        LayoutElement.from_coords(0.03, 0.03, 1.03, 1.03),
        LayoutElement.from_coords(0.04, 0.04, 1.04, 1.04),
        LayoutElement.from_coords(0.05, 0.05, 1.05, 1.05),
    ]
    model = get_model("yolox_tiny")
    elements = model.clean_type(elements, type_to_clean="Table")
    assert len(elements) == 1
    assert (
        elements[0].bbox.x1,
        elements[0].bbox.y1,
        elements[0].bbox.x2,
        elements[0].bbox.x2,
    ) == (0, 0, 1, 1)


def test_env_variables_override_default_model(monkeypatch):
    # When an environment variable specifies a different default model and we call get_model with no
    # args, we should get back the model the env var calls for
    monkeypatch.setattr(models, "models", {})
    with (
        mock.patch.dict(
            models.os.environ,
            {"UNSTRUCTURED_DEFAULT_MODEL_NAME": "yolox"},
        ),
        mock.patch.dict(models.model_class_map, {"yolox": MockModel}),
    ):
        model = models.get_model()
    assert isinstance(model, MockModel)


def test_env_variables_override_initialization_params(monkeypatch):
    # When initialization params are specified in an environment variable, and we call get_model, we
    # should see that the model was initialized with those params
    monkeypatch.setattr(models, "models", {})
    fake_label_map = {"1": "label1", "2": "label2"}
    with (
        mock.patch.dict(
            models.os.environ,
            {"UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH": "fake_json.json"},
        ),
        mock.patch.object(models, "DEFAULT_MODEL", "fake"),
        mock.patch.dict(
            models.model_class_map,
            {"fake": mock.MagicMock()},
        ),
        mock.patch(
            "builtins.open",
            mock.mock_open(
                read_data='{"model_path": "fakepath", "label_map": '
                + json.dumps(fake_label_map)
                + "}",
            ),
        ),
    ):
        model = models.get_model()
    model.initialize.assert_called_once_with(
        model_path="fakepath",
        label_map={1: "label1", 2: "label2"},
    )


================================================
FILE: test_unstructured_inference/models/test_tables.py
================================================
import os
import threading
from copy import deepcopy

import numpy as np
import pytest
import torch
from PIL import Image
from transformers.models.table_transformer.modeling_table_transformer import (
    TableTransformerDecoder,
)

import unstructured_inference.models.table_postprocess as postprocess
from unstructured_inference.models import tables
from unstructured_inference.models.tables import (
    apply_thresholds_on_objects,
    structure_to_cells,
)

skip_outside_ci = os.getenv("CI", "").lower() in {"", "false", "f", "0"}


@pytest.fixture
def table_transformer():
    tables.load_agent()
    return tables.tables_agent


def test_load_agent(table_transformer):
    assert hasattr(table_transformer, "model")


@pytest.fixture
def example_image():
    return Image.open("./sample-docs/table-multi-row-column-cells.png").convert("RGB")


@pytest.fixture
def mocked_ocr_tokens():
    return [
        {
            "bbox": [51.0, 37.0, 1333.0, 38.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 0,
            "text": " ",
        },
        {
            "bbox": [1064.0, 47.0, 1161.0, 71.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 1,
            "text": "Results",
        },
        {
            "bbox": [891.0, 113.0, 1333.0, 114.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 2,
            "text": " ",
        },
        {
            "bbox": [51.0, 236.0, 1333.0, 237.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 3,
            "text": " ",
        },
        {
            "bbox": [51.0, 308.0, 1333.0, 309.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 4,
            "text": " ",
        },
        {
            "bbox": [51.0, 450.0, 1333.0, 452.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 5,
            "text": " ",
        },
        {
            "bbox": [51.0, 522.0, 1333.0, 524.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 6,
            "text": " ",
        },
        {
            "bbox": [51.0, 37.0, 53.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 7,
            "text": " ",
        },
        {
            "bbox": [90.0, 89.0, 167.0, 93.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 8,
            "text": "soa",
        },
        {
            "bbox": [684.0, 68.0, 762.0, 91.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 9,
            "text": "Ballot:",
        },
        {
            "bbox": [69.0, 84.0, 196.0, 140.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 10,
            "text": "etealeiliay",
        },
        {
            "bbox": [283.0, 109.0, 446.0, 132.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 11,
            "text": "Participants",
        },
        {
            "bbox": [484.0, 84.0, 576.0, 140.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 12,
            "text": "pallets",
        },
        {
            "bbox": [684.0, 75.0, 776.0, 132.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 13,
            "text": "incom",
        },
        {
            "bbox": [788.0, 107.0, 853.0, 136.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 14,
            "text": "lete/",
        },
        {
            "bbox": [68.0, 121.0, 191.0, 162.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 15,
            "text": "Category",
        },
        {
            "bbox": [371.0, 115.0, 386.0, 137.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 16,
            "text": "P",
        },
        {
            "bbox": [483.0, 121.0, 632.0, 162.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 17,
            "text": "Completed",
        },
        {
            "bbox": [756.0, 115.0, 785.0, 154.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 18,
            "text": "Ne",
        },
        {
            "bbox": [930.0, 125.0, 1054.0, 152.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 19,
            "text": "Accuracy",
        },
        {
            "bbox": [1159.0, 124.0, 1227.0, 147.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 20,
            "text": "Time",
        },
        {
            "bbox": [1235.0, 126.0, 1264.0, 147.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 21,
            "text": "to",
        },
        {
            "bbox": [682.0, 149.0, 841.0, 173.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 22,
            "text": "Terminated",
        },
        {
            "bbox": [1147.0, 169.0, 1276.0, 198.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 23,
            "text": "complete",
        },
        {
            "bbox": [70.0, 245.0, 127.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 24,
            "text": "Blind",
        },
        {
            "bbox": [361.0, 247.0, 373.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 25,
            "text": "5",
        },
        {
            "bbox": [562.0, 247.0, 573.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 26,
            "text": "1",
        },
        {
            "bbox": [772.0, 247.0, 786.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 27,
            "text": "4",
        },
        {
            "bbox": [925.0, 246.0, 1005.0, 270.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 28,
            "text": "34.5%,",
        },
        {
            "bbox": [1017.0, 247.0, 1059.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 29,
            "text": "n=1",
        },
        {
            "bbox": [1129.0, 246.0, 1187.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 30,
            "text": "1199",
        },
        {
            "bbox": [1197.0, 251.0, 1241.0, 270.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 31,
            "text": "sec,",
        },
        {
            "bbox": [1253.0, 247.0, 1295.0, 266.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 32,
            "text": "n=1",
        },
        {
            "bbox": [70.0, 319.0, 117.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 33,
            "text": "Low",
        },
        {
            "bbox": [125.0, 318.0, 198.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 34,
            "text": "Vision",
        },
        {
            "bbox": [361.0, 319.0, 373.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 35,
            "text": "5",
        },
        {
            "bbox": [561.0, 318.0, 573.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 36,
            "text": "2",
        },
        {
            "bbox": [773.0, 318.0, 785.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 37,
            "text": "3",
        },
        {
            "bbox": [928.0, 318.0, 1002.0, 339.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 38,
            "text": "98.3%",
        },
        {
            "bbox": [1013.0, 318.0, 1055.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 39,
            "text": "n=2",
        },
        {
            "bbox": [1129.0, 318.0, 1188.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 40,
            "text": "1716",
        },
        {
            "bbox": [1197.0, 323.0, 1242.0, 342.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 41,
            "text": "sec,",
        },
        {
            "bbox": [1253.0, 318.0, 1295.0, 338.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 42,
            "text": "n=3",
        },
        {
            "bbox": [916.0, 387.0, 1005.0, 413.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 43,
            "text": "(97.7%,",
        },
        {
            "bbox": [1016.0, 387.0, 1068.0, 413.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 44,
            "text": "n=3)",
        },
        {
            "bbox": [1086.0, 383.0, 1099.0, 418.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 45,
            "text": "|",
        },
        {
            "bbox": [1120.0, 387.0, 1188.0, 413.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 46,
            "text": "(1934",
        },
        {
            "bbox": [1197.0, 393.0, 1241.0, 412.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 47,
            "text": "sec,",
        },
        {
            "bbox": [1253.0, 387.0, 1305.0, 413.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 48,
            "text": "n=2)",
        },
        {
            "bbox": [70.0, 456.0, 181.0, 489.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 49,
            "text": "Dexterity",
        },
        {
            "bbox": [360.0, 461.0, 372.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 50,
            "text": "5",
        },
        {
            "bbox": [560.0, 461.0, 574.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 51,
            "text": "4",
        },
        {
            "bbox": [774.0, 461.0, 785.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 52,
            "text": "1",
        },
        {
            "bbox": [924.0, 460.0, 1005.0, 484.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 53,
            "text": "98.3%,",
        },
        {
            "bbox": [1017.0, 461.0, 1060.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 54,
            "text": "n=4",
        },
        {
            "bbox": [1118.0, 460.0, 1199.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 55,
            "text": "1672.1",
        },
        {
            "bbox": [1209.0, 465.0, 1253.0, 484.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 56,
            "text": "sec,",
        },
        {
            "bbox": [1265.0, 461.0, 1308.0, 480.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 57,
            "text": "n=4",
        },
        {
            "bbox": [70.0, 527.0, 170.0, 561.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 58,
            "text": "Mobility",
        },
        {
            "bbox": [361.0, 532.0, 373.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 59,
            "text": "3",
        },
        {
            "bbox": [561.0, 532.0, 573.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 60,
            "text": "3",
        },
        {
            "bbox": [773.0, 532.0, 786.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 61,
            "text": "0",
        },
        {
            "bbox": [924.0, 532.0, 1005.0, 556.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 62,
            "text": "95.4%,",
        },
        {
            "bbox": [1017.0, 532.0, 1059.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 63,
            "text": "n=3",
        },
        {
            "bbox": [1129.0, 532.0, 1188.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 64,
            "text": "1416",
        },
        {
            "bbox": [1197.0, 537.0, 1242.0, 556.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 65,
            "text": "sec,",
        },
        {
            "bbox": [1253.0, 532.0, 1295.0, 552.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 66,
            "text": "n=3",
        },
        {
            "bbox": [266.0, 37.0, 267.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 67,
            "text": " ",
        },
        {
            "bbox": [466.0, 37.0, 468.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 68,
            "text": " ",
        },
        {
            "bbox": [666.0, 37.0, 668.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 69,
            "text": " ",
        },
        {
            "bbox": [891.0, 37.0, 893.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 70,
            "text": " ",
        },
        {
            "bbox": [1091.0, 113.0, 1093.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 71,
            "text": " ",
        },
        {
            "bbox": [51.0, 595.0, 1333.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 72,
            "text": " ",
        },
        {
            "bbox": [1331.0, 37.0, 1333.0, 596.0],
            "block_num": 0,
            "line_num": 0,
            "span_num": 73,
            "text": " ",
        },
    ]


@pytest.mark.parametrize(
    "model_path",
    [
        ("invalid_table_path"),
        ("incorrect_table_path"),
    ],
)
def test_load_table_model_raises_when_not_available(model_path):
    with pytest.raises(OSError):
        table_model = tables.UnstructuredTableTransformerModel()
        table_model.initialize(model=model_path)


@pytest.mark.parametrize(
    ("bbox1", "bbox2", "expected_result"),
    [
        ((0, 0, 5, 5), (2, 2, 7, 7), 0.36),
        ((0, 0, 0, 0), (6, 6, 10, 10), 0),
    ],
)
def test_iob(bbox1, bbox2, expected_result):
    result = tables.iob(bbox1, bbox2)
    assert result == expected_result


@pytest.mark.parametrize(
    "model_path",
    [
        "microsoft/table-transformer-structure-recognition",
    ],
)
def test_load_donut_model(model_path):
    table_model = tables.UnstructuredTableTransformerModel()
    table_model.initialize(model=model_path)
    assert type(table_model.model.model.decoder) is TableTransformerDecoder


@pytest.mark.parametrize(
    ("input_test", "output_test"),
    [
        (
            [
                {
                    "label": "table column header",
                    "score": 0.9349299073219299,
                    "bbox": [
                        47.83147430419922,
                        116.8877944946289,
                        2557.79296875,
                        216.98883056640625,
                    ],
                },
                {
                    "label": "table column header",
                    "score": 0.934,
                    "bbox": [
                        47.83147430419922,
                        116.8877944946289,
                        2557.79296875,
                        216.98883056640625,
                    ],
                },
            ],
            [
                {
                    "label": "table column header",
                    "score": 0.9349299073219299,
                    "bbox": [
                        47.83147430419922,
                        116.8877944946289,
                        2557.79296875,
                        216.98883056640625,
                    ],
                },
            ],
        ),
        ([], []),
    ],
)
def test_nms(input_test, output_test):
    output = postprocess.nms(input_test)

    assert output == output_test


@pytest.mark.parametrize(
    ("supercell1", "supercell2"),
    [
        (
            {
                "label": "table spanning cell",
                "score": 0.526617169380188,
                "bbox": [
                    1446.2801513671875,
                    1023.817138671875,
                    2114.3525390625,
                    1099.20166015625,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [3, 4],
                "column_numbers": [0, 4],
            },
            {
                "label": "table spanning cell",
                "score": 0.5199193954467773,
                "bbox": [
                    98.92312622070312,
                    676.1566772460938,
                    751.0982666015625,
                    938.5986938476562,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [3, 4, 6],
                "column_numbers": [0, 4],
            },
        ),
        (
            {
                "label": "table spanning cell",
                "score": 0.526617169380188,
                "bbox": [
                    1446.2801513671875,
                    1023.817138671875,
                    2114.3525390625,
                    1099.20166015625,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [3, 4],
                "column_numbers": [0, 4],
            },
            {
                "label": "table spanning cell",
                "score": 0.5199193954467773,
                "bbox": [
                    98.92312622070312,
                    676.1566772460938,
                    751.0982666015625,
                    938.5986938476562,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [4],
                "column_numbers": [0, 4, 6],
            },
        ),
        (
            {
                "label": "table spanning cell",
                "score": 0.526617169380188,
                "bbox": [
                    1446.2801513671875,
                    1023.817138671875,
                    2114.3525390625,
                    1099.20166015625,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [3, 4],
                "column_numbers": [1, 4],
            },
            {
                "label": "table spanning cell",
                "score": 0.5199193954467773,
                "bbox": [
                    98.92312622070312,
                    676.1566772460938,
                    751.0982666015625,
                    938.5986938476562,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [4],
                "column_numbers": [0, 4, 6],
            },
        ),
        (
            {
                "label": "table spanning cell",
                "score": 0.526617169380188,
                "bbox": [
                    1446.2801513671875,
                    1023.817138671875,
                    2114.3525390625,
                    1099.20166015625,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [3, 4],
                "column_numbers": [1, 4],
            },
            {
                "label": "table spanning cell",
                "score": 0.5199193954467773,
                "bbox": [
                    98.92312622070312,
                    676.1566772460938,
                    751.0982666015625,
                    938.5986938476562,
                ],
                "projected row header": False,
                "header": False,
                "row_numbers": [2, 4, 5, 6, 7, 8],
                "column_numbers": [0, 4, 6],
            },
        ),
    ],
)
def test_remove_supercell_overlap(supercell1, supercell2):
    assert postprocess.remove_supercell_overlap(supercell1, supercell2) is None


@pytest.mark.parametrize(
    ("supercells", "rows", "columns", "output_test"),
    [
        (
            [
                {
                    "label": "table spanning cell",
                    "score": 0.9,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        2115.197265625,
                        1238.27587890625,
                    ],
                    "projected row header": True,
                    "header": True,
                    "span": True,
                },
            ],
            [
                {
                    "label": "table row",
                    "score": 0.9299452900886536,
                    "bbox": [0, 0, 10, 10],
                    "column header": True,
                    "header": True,
                },
                {
                    "label": "table row",
                    "score": 0.9299452900886536,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        2114.3525390625,
                        193.67681884765625,
                    ],
                    "column header": True,
                    "header": True,
                },
                {
                    "label": "table row",
                    "score": 0.9299452900886536,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        2114.3525390625,
                        193.67681884765625,
                    ],
                    "column header": True,
                    "header": True,
                },
            ],
            [
                {
                    "label": "table column",
                    "score": 0.9996132254600525,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        517.6508178710938,
                        1616.48779296875,
                    ],
                },
                {
                    "label": "table column",
                    "score": 0.9935646653175354,
                    "bbox": [
                        520.0474853515625,
                        143.11549377441406,
                        751.0982666015625,
                        1616.48779296875,
                    ],
                },
            ],
            [
                {
                    "label": "table spanning cell",
                    "score": 0.9,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        751.0982666015625,
                        193.67681884765625,
                    ],
                    "projected row header": True,
                    "header": True,
                    "span": True,
                    "row_numbers": [1, 2],
                    "column_numbers": [0, 1],
                },
                {
                    "row_numbers": [0],
                    "column_numbers": [0, 1],
                    "score": 0.9,
                    "propagated": True,
                    "bbox": [
                        98.92312622070312,
                        143.11549377441406,
                        751.0982666015625,
                        193.67681884765625,
                    ],
                },
            ],
        ),
    ],
)
def test_align_supercells(supercells, rows, columns, output_test):
    assert postprocess.align_supercells(supercells, rows, columns) == output_test


@pytest.mark.parametrize(("rows", "bbox", "output"), [([1.0], [0.0], [1.0])])
def test_align_rows(rows, bbox, output):
    assert postprocess.align_rows(rows, bbox) == output


@pytest.mark.parametrize(
    ("output_format", "expectation"),
    [
        ("html", "<tr><td>Blind</td><td>5</td><td>1</td><td>4</td><td>34.5%, n=1</td>"),
        (
            "cells",
            {
                "column_nums": [0],
                "row_nums": [2],
                "column header": False,
                "cell text": "Blind",
            },
        ),
        ("dataframe", ["Blind", "5", "1", "4", "34.5%, n=1", "1199 sec, n=1"]),
        (None, "<tr><td>Blind</td><td>5</td><td>1</td><td>4</td><td>34.5%, n=1</td>"),
    ],
)
def test_table_prediction_output_format(
    output_format,
    expectation,
    table_transformer,
    example_image,
    mocker,
    example_table_cells,
    mocked_ocr_tokens,
):
    mocker.patch.object(tables, "recognize", return_value=example_table_cells)
    mocker.patch.object(
        tables.UnstructuredTableTransformerModel,
        "get_structure",
        return_value=None,
    )
    if output_format:
        result = table_transformer.run_prediction(
            example_image,
            result_format=output_format,
            ocr_tokens=mocked_ocr_tokens,
        )
    else:
        result = table_transformer.run_prediction(example_image, ocr_tokens=mocked_ocr_tokens)

    if output_format == "dataframe":
        assert expectation in result.values
    elif output_format == "cells":
        # other output like bbox are flakey to test since they depend on OCR and it may change
        # slightly when OCR pacakge changes or even on different machines
        validation_fields = ("column_nums", "row_nums", "column header", "cell text")
        assert expectation in [{key: cell[key] for key in validation_fields} for cell in result]
    else:
        assert expectation in result


def test_table_prediction_output_format_when_wrong_type_then_value_error(
    table_transformer,
    example_image,
    mocker,
    example_table_cells,
    mocked_ocr_tokens,
):
    mocker.patch.object(tables, "recognize", return_value=example_table_cells)
    mocker.patch.object(
        tables.UnstructuredTableTransformerModel,
        "get_structure",
        return_value=None,
    )
    with pytest.raises(ValueError):
        table_transformer.run_prediction(
            example_image,
            result_format="Wrong format",
            ocr_tokens=mocked_ocr_tokens,
        )


def test_table_prediction_runs_with_empty_recognize(
    table_transformer,
    example_image,
    mocker,
    mocked_ocr_tokens,
):
    mocker.patch.object(tables, "recognize", return_value=[])
    mocker.patch.object(
        tables.UnstructuredTableTransformerModel,
        "get_structure",
        return_value=None,
    )
    assert table_transformer.run_prediction(example_image, ocr_tokens=mocked_ocr_tokens) == ""


def test_table_prediction_with_ocr_tokens(table_transformer, example_image, mocked_ocr_tokens):
    prediction = table_transformer.predict(example_image, ocr_tokens=mocked_ocr_tokens)
    assert '<table><thead><tr><th rowspan="2">' in prediction
    assert "<tr><td>Blind</td><td>5</td><td>1</td><td>4</td><td>34.5%, n=1</td>" in prediction


def test_table_prediction_with_no_ocr_tokens(table_transformer, example_image):
    with pytest.raises(ValueError):
        table_transformer.predict(example_image)


@pytest.mark.parametrize(
    ("thresholds", "expected_object_number"),
    [
        ({"0": 0.5}, 1),
        ({"0": 0.1}, 3),
        ({"0": 0.9}, 0),
    ],
)
def test_objects_are_filtered_based_on_class_thresholds_when_correct_prediction_and_threshold(
    thresholds,
    expected_object_number,
):
    objects = [
        {"label": "0", "score": 0.2},
        {"label": "0", "score": 0.4},
        {"label": "0", "score": 0.55},
    ]
    assert len(apply_thresholds_on_objects(objects, thresholds)) == expected_object_number


@pytest.mark.parametrize(
    ("thresholds", "expected_object_number"),
    [
        ({"0": 0.5, "1": 0.1}, 4),
        ({"0": 0.1, "1": 0.9}, 3),
        ({"0": 0.9, "1": 0.5}, 1),
    ],
)
def test_objects_are_filtered_based_on_class_thresholds_when_two_classes(
    thresholds,
    expected_object_number,
):
    objects = [
        {"label": "0", "score": 0.2},
        {"label": "0", "score": 0.4},
        {"label": "0", "score": 0.55},
        {"label": "1", "score": 0.2},
        {"label": "1", "score": 0.4},
        {"label": "1", "score": 0.55},
    ]
    assert len(apply_thresholds_on_objects(objects, thresholds)) == expected_object_number


def test_objects_filtering_when_missing_threshold():
    class_name = "class_name"
    objects = [{"label": class_name, "score": 0.2}]
    thresholds = {"1": 0.5}
    with pytest.raises(KeyError, match=class_name):
        apply_thresholds_on_objects(objects, thresholds)


def test_intersect():
    a = postprocess.Rect()
    b = postprocess.Rect([1, 2, 3, 4])
    assert a.intersect(b).get_area() == 4.0


def test_include_rect():
    a = postprocess.Rect()
    assert a.include_rect([1, 2, 3, 4]).get_area() == 4.0


@pytest.mark.parametrize(
    ("spans", "join_with_space", "expected"),
    [
        (
            [
                {
                    "flags": 2**0,
                    "text": "5",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
            ],
            True,
            "",
        ),
        (
            [
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
            ],
            True,
            "p",
        ),
        (
            [
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
            ],
            True,
            "p p",
        ),
        (
            [
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 1,
                },
            ],
            True,
            "p p",
        ),
        (
            [
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 0,
                },
                {
                    "flags": 2**0,
                    "text": "p",
                    "superscript": False,
                    "span_num": 0,
                    "line_num": 0,
                    "block_num": 1,
                },
            ],
            False,
            "p p",
        ),
    ],
)
def test_extract_text_from_spans(spans, join_with_space, expected):
    res = postprocess.extract_text_from_spans(
        spans,
        join_with_space=join_with_space,
        remove_integer_superscripts=True,
    )
    assert res == expected


@pytest.mark.parametrize(
    ("supercells", "expected_len"),
    [
        ([{"header": "hi", "row_numbers": [0, 1, 2], "score": 0.9}], 1),
        (
            [
                {
                    "header": "hi",
                    "row_numbers": [0],
                    "column_numbers": [1, 2, 3],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [1],
                    "column_numbers": [1],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [1],
                    "column_numbers": [2],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [1],
                    "column_numbers": [3],
                    "score": 0.9,
                },
            ],
            4,
        ),
        (
            [
                {
                    "header": "hi",
                    "row_numbers": [0],
                    "column_numbers": [0],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [1],
                    "column_numbers": [0],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [1, 2],
                    "column_numbers": [0],
                    "score": 0.9,
                },
                {
                    "header": "hi",
                    "row_numbers": [3],
                    "column_numbers": [0],
                    "score": 0.9,
                },
            ],
            3,
        ),
    ],
)
def test_header_supercell_tree(supercells, expected_len):
    postprocess.header_supercell_tree(supercells)
    assert len(supercells) == expected_len


@pytest.mark.parametrize("zoom", [1, 0.1, 5, -1, 0])
def test_zoom_image(example_image, zoom):
    width, height = example_image.size
    new_image = tables.zoom_image(example_image, zoom)
    new_w, new_h = new_image.size
    if zoom <= 0:
        zoom = 1
    assert new_w == np.round(width * zoom, 0)
    assert new_h == np.round(height * zoom, 0)


@pytest.mark.parametrize(
    ("input_cells", "expected_html"),
    [
        # +----------+---------------------+
        # | row1col1 | row1col2 | row1col3 |
        # |----------|----------+----------|
        # | row2col1 | row2col2 | row2col3 |
        # +----------+----------+----------+
        pytest.param(
            [
                {
                    "row_nums": [0],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [0],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [0],
                    "column_nums": [2],
                    "cell text": "row1col3",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            (
                "<table><tbody><tr><td>row1col1</td><td>row1col2</td><td>row1col3</td></tr>"
                "<tr><td>row2col1</td><td>row2col2</td><td>row2col3</td></tr></tbody></table>"
            ),
            id="simple table without header",
        ),
        # +----------+---------------------+
        # |  h1col1  |  h1col2  |  h1col3  |
        # |----------|----------+----------|
        # | row1col1 | row1col2 | row1col3 |
        # |----------|----------+----------|
        # | row2col1 | row2col2 | row2col3 |
        # +----------+----------+----------+
        pytest.param(
            [
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {"row_nums": [0], "column_nums": [1], "cell text": "h1col2", "column header": True},
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "row1col3",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            (
                "<table><thead><tr><th>h1col1</th><th>h1col2</th><th>h1col2</th></tr></thead>"
                "<tbody><tr><td>row1col1</td><td>row1col2</td><td>row1col3</td></tr>"
                "<tr><td>row2col1</td><td>row2col2</td><td>row2col3</td></tr></tbody></table>"
            ),
            id="simple table with header",
        ),
        # +----------+---------------------+
        # |  h1col1  |  h1col2  |  h1col3  |
        # |----------|----------+----------|
        # | row1col1 | row1col2 | row1col3 |
        # |----------|----------+----------|
        # | row2col1 | row2col2 | row2col3 |
        # +----------+----------+----------+
        pytest.param(
            [
                {"row_nums": [0], "column_nums": [1], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "row1col3",
                    "column header": False,
                },
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
            ],
            (
                "<table><thead><tr><th>h1col1</th><th>h1col2</th><th>h1col2</th></tr></thead>"
                "<tbody><tr><td>row1col1</td><td>row1col2</td><td>row1col3</td></tr>"
                "<tr><td>row2col1</td><td>row2col2</td><td>row2col3</td></tr></tbody></table>"
            ),
            id="simple table with header, mixed elements",
        ),
        # +----------+---------------------+
        # |    two   |   two columns       |
        # |          |----------+----------|
        # |    rows  |sub cell 1|sub cell 2|
        # +----------+----------+----------+
        pytest.param(
            [
                {
                    "row_nums": [0, 1],
                    "column_nums": [0],
                    "cell text": "two row",
                    "column header": False,
                },
                {
                    "row_nums": [0],
                    "column_nums": [1, 2],
                    "cell text": "two cols",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "sub cell 1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "sub cell 2",
                    "column header": False,
                },
            ],
            (
                '<table><tbody><tr><td rowspan="2">two row</td><td colspan="2">two '
                "cols</td></tr><tr><td>sub cell 1</td><td>sub cell 2</td></tr>"
                "</tbody></table>"
            ),
            id="various spans, no headers",
        ),
        # +----------+---------------------+----------+
        # |          |       h1col23       |  h1col4  |
        # | h12col1  |----------+----------+----------|
        # |          |  h2col2  |       h2col34       |
        # |----------|----------+----------+----------+
        # |  r3col1  |  r3col2  |                     |
        # |----------+----------|      r34col34       |
        # |       r4col12       |                     |
        # +----------+----------+----------+----------+
        pytest.param(
            [
                {
                    "row_nums": [0, 1],
                    "column_nums": [0],
                    "cell text": "h12col1",
                    "column header": True,
                },
                {
                    "row_nums": [0],
                    "column_nums": [1, 2],
                    "cell text": "h1col23",
                    "column header": True,
                },
                {"row_nums": [0], "column_nums": [3], "cell text": "h1col4", "column header": True},
                {"row_nums": [1], "column_nums": [1], "cell text": "h2col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [2, 3],
                    "cell text": "h2col34",
                    "column header": True,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "r3col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "r3col2",
                    "column header": False,
                },
                {
                    "row_nums": [2, 3],
                    "column_nums": [2, 3],
                    "cell text": "r34col34",
                    "column header": False,
                },
                {
                    "row_nums": [3],
                    "column_nums": [0, 1],
                    "cell text": "r4col12",
                    "column header": False,
                },
            ],
            (
                '<table><thead><tr><th rowspan="2">h12col1</th>'
                '<th colspan="2">h1col23</th><th>h1col4</th></tr>'
                '<tr><th>h2col2</th><th colspan="2">h2col34</th></tr></thead><tbody>'
                '<tr><td>r3col1</td><td>r3col2</td><td colspan="2" rowspan="2">r34col34</td></tr>'
                '<tr><td colspan="2">r4col12</td></tr></tbody></table>'
            ),
            id="various spans, with 2 row header",
        ),
    ],
)
def test_cells_to_html(input_cells, expected_html):
    assert tables.cells_to_html(input_cells) == expected_html


@pytest.mark.parametrize(
    ("input_cells", "expected_cells"),
    [
        pytest.param(
            [
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {"row_nums": [0], "column_nums": [1], "cell text": "h1col2", "column header": True},
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "row1col3",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            [
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {"row_nums": [0], "column_nums": [1], "cell text": "h1col2", "column header": True},
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [2],
                    "cell text": "row1col3",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            id="identical tables, no changes expected",
        ),
        pytest.param(
            [
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            [
                {"row_nums": [0], "column_nums": [0], "cell text": "h1col1", "column header": True},
                {"row_nums": [0], "column_nums": [1], "cell text": "", "column header": True},
                {"row_nums": [0], "column_nums": [2], "cell text": "h1col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [0],
                    "cell text": "row1col1",
                    "column header": False,
                },
                {
                    "row_nums": [1],
                    "column_nums": [1],
                    "cell text": "row1col2",
                    "column header": False,
                },
                {"row_nums": [1], "column_nums": [2], "cell text": "", "column header": False},
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "row2col1",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [1],
                    "cell text": "row2col2",
                    "column header": False,
                },
                {
                    "row_nums": [2],
                    "column_nums": [2],
                    "cell text": "row2col3",
                    "column header": False,
                },
            ],
            id="missing column in header and in the middle",
        ),
        pytest.param(
            [
                {
                    "row_nums": [0, 1],
                    "column_nums": [0],
                    "cell text": "h12col1",
                    "column header": True,
                },
                {
                    "row_nums": [0],
                    "column_nums": [1, 2],
                    "cell text": "h1col23",
                    "column header": True,
                },
                {"row_nums": [1], "column_nums": [1], "cell text": "h2col2", "column header": True},
                {
                    "row_nums": [1],
                    "column_nums": [2, 3],
                    "cell text": "h2col34",
                    "column header": True,
                },
                {
                    "row_nums": [2],
                    "column_nums": [0],
                    "cell text": "r3col1",
                    "column header": False,
                },
                {
                    "row_nums": [2, 3],
                    "column_nums": [2, 3],
 
Download .txt
gitextract_bzec1cqm/

├── .github/
│   ├── dependabot.yml
│   └── workflows/
│       ├── ci.yml
│       ├── claude.yml
│       ├── create_issue.yml
│       ├── release.yml
│       └── version-bump.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CHANGELOG.md
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── benchmarks/
│   ├── __init__.py
│   └── test_benchmark_yolox.py
├── examples/
│   └── ocr/
│       ├── engine.py
│       ├── requirements.txt
│       └── validate_ocr_performance.py
├── logger_config.yaml
├── pyproject.toml
├── renovate.json
├── sample-docs/
│   └── loremipsum.tiff
├── scripts/
│   ├── docker-build.sh
│   ├── shellcheck.sh
│   ├── test-unstructured-ingest-helper.sh
│   └── version-sync.sh
├── test_unstructured_inference/
│   ├── conftest.py
│   ├── inference/
│   │   ├── test_layout.py
│   │   ├── test_layout_element.py
│   │   └── test_layout_rotation.py
│   ├── models/
│   │   ├── test_detectron2onnx.py
│   │   ├── test_eval.py
│   │   ├── test_model.py
│   │   ├── test_tables.py
│   │   └── test_yolox.py
│   ├── test_config.py
│   ├── test_elements.py
│   ├── test_logger.py
│   ├── test_math.py
│   ├── test_utils.py
│   └── test_visualization.py
└── unstructured_inference/
    ├── __init__.py
    ├── __version__.py
    ├── config.py
    ├── constants.py
    ├── inference/
    │   ├── __init__.py
    │   ├── elements.py
    │   ├── layout.py
    │   ├── layoutelement.py
    │   └── pdf_image.py
    ├── logger.py
    ├── math.py
    ├── models/
    │   ├── __init__.py
    │   ├── base.py
    │   ├── detectron2onnx.py
    │   ├── eval.py
    │   ├── table_postprocess.py
    │   ├── tables.py
    │   ├── unstructuredmodel.py
    │   └── yolox.py
    ├── utils.py
    └── visualize.py
Download .txt
SYMBOL INDEX (433 symbols across 35 files)

FILE: benchmarks/test_benchmark_yolox.py
  class _FakeInput (line 14) | class _FakeInput:
    method __init__ (line 15) | def __init__(self) -> None:
  class _FakeSession (line 19) | class _FakeSession:
    method get_inputs (line 22) | def get_inputs(self):
    method run (line 25) | def run(self, _names, _inputs):
  function make_model (line 31) | def make_model() -> UnstructuredYoloXModel:
  function make_letter_200dpi (line 52) | def make_letter_200dpi() -> PILImage.Image:
  function run_image_processing (line 56) | def run_image_processing():
  function test_benchmark_yolox_image_processing (line 62) | def test_benchmark_yolox_image_processing(benchmark):

FILE: examples/ocr/engine.py
  function remove_non_printable (line 15) | def remove_non_printable(s):
  function run_ocr_with_layout_detection (line 20) | def run_ocr_with_layout_detection(
  function run_ocr (line 134) | def run_ocr(

FILE: examples/ocr/validate_ocr_performance.py
  function validate_performance (line 20) | def validate_performance(
  function compare_processed_text (line 129) | def compare_processed_text(individual_mode_full_text, entire_mode_full_t...
  function write_report (line 176) | def write_report(report, now_str, validation_mode):
  function run (line 183) | def run():

FILE: test_unstructured_inference/conftest.py
  function mock_pil_image (line 14) | def mock_pil_image():
  function mock_numpy_image (line 19) | def mock_numpy_image():
  function mock_rectangle (line 24) | def mock_rectangle():
  function mock_text_region (line 29) | def mock_text_region():
  function mock_layout_element (line 34) | def mock_layout_element():
  function mock_embedded_text_regions (line 47) | def mock_embedded_text_regions():
  function mock_layout (line 124) | def mock_layout(mock_embedded_text_regions):
  function example_table_cells (line 132) | def example_table_cells():

FILE: test_unstructured_inference/inference/test_layout.py
  function mock_image (line 26) | def mock_image():
  function mock_initial_layout (line 31) | def mock_initial_layout():
  function mock_final_layout (line 54) | def mock_final_layout():
  function test_pdf_page_converts_images_to_array (line 78) | def test_pdf_page_converts_images_to_array(mock_image):
  class MockLayoutModel (line 97) | class MockLayoutModel:
    method __init__ (line 98) | def __init__(self, layout):
    method __call__ (line 101) | def __call__(self, *args):
    method initialize (line 104) | def initialize(self, *args, **kwargs):
    method deduplicate_detected_elements (line 107) | def deduplicate_detected_elements(self, elements, *args, **kwargs):
  function test_get_page_elements (line 111) | def test_get_page_elements(monkeypatch, mock_final_layout):
  class MockPool (line 125) | class MockPool:
    method map (line 126) | def map(self, f, xs):
    method close (line 129) | def close(self):
    method join (line 132) | def join(self):
  function test_process_data_with_model (line 137) | def test_process_data_with_model(monkeypatch, mock_final_layout, model_n...
  function test_process_data_with_model_raises_on_invalid_model_name (line 162) | def test_process_data_with_model_raises_on_invalid_model_name():
  function test_process_file_with_model (line 174) | def test_process_file_with_model(monkeypatch, mock_final_layout, model_n...
  function test_process_file_no_warnings (line 188) | def test_process_file_no_warnings(monkeypatch, mock_final_layout, recwarn):
  function test_process_file_with_model_raises_on_invalid_model_name (line 206) | def test_process_file_with_model_raises_on_invalid_model_name():
  class MockPoints (line 211) | class MockPoints:
    method tolist (line 212) | def tolist(self):
  class MockEmbeddedTextRegion (line 216) | class MockEmbeddedTextRegion(EmbeddedTextRegion):
    method __init__ (line 217) | def __init__(self, type=None, text=None):
    method points (line 222) | def points(self):
  class MockPageLayout (line 226) | class MockPageLayout(layout.PageLayout):
    method __init__ (line 227) | def __init__(
  class MockLayout (line 241) | class MockLayout:
    method __init__ (line 242) | def __init__(self, *elements):
    method __len__ (line 245) | def __len__(self):
    method sort (line 248) | def sort(self, key, inplace):
    method __iter__ (line 251) | def __iter__(self):
    method get_texts (line 254) | def get_texts(self):
    method filter_by (line 257) | def filter_by(self, *args, **kwargs):
  function test_from_image_file (line 263) | def test_from_image_file(monkeypatch, mock_final_layout, filetype, eleme...
  function test_from_file (line 289) | def test_from_file(monkeypatch, mock_final_layout):
  function test_from_file_rotated_pdf_stores_rotation_in_metadata (line 318) | def test_from_file_rotated_pdf_stores_rotation_in_metadata(monkeypatch, ...
  function test_from_file_with_password (line 333) | def test_from_file_with_password(monkeypatch, mock_final_layout):
  function test_from_image_file_raises_with_empty_fn (line 350) | def test_from_image_file_raises_with_empty_fn():
  function test_from_image_file_raises_isadirectoryerror_with_dir (line 355) | def test_from_image_file_raises_isadirectoryerror_with_dir():
  function test_page_numbers_in_page_objects (line 360) | def test_page_numbers_in_page_objects():
  function test_annotate (line 381) | def test_annotate(colors, add_details, threshold):
  class MockDetectionModel (line 421) | class MockDetectionModel(layout.UnstructuredObjectDetectionModel):
    method initialize (line 422) | def initialize(self, *args, **kwargs):
    method predict (line 425) | def predict(self, x):
  function test_layout_order (line 463) | def test_layout_order(mock_image):
  function test_page_layout_raises_when_multiple_models_passed (line 481) | def test_page_layout_raises_when_multiple_models_passed(mock_image, mock...
  class MockElementExtractionModel (line 492) | class MockElementExtractionModel:
    method __call__ (line 493) | def __call__(self, x):
  function test_get_elements_using_image_extraction (line 498) | def test_get_elements_using_image_extraction(mock_image, inplace, expect...
  function test_get_elements_using_image_extraction_raises_with_no_extraction_model (line 508) | def test_get_elements_using_image_extraction_raises_with_no_extraction_m...
  function test_get_elements_with_detection_model_raises_with_wrong_default_model (line 516) | def test_get_elements_with_detection_model_raises_with_wrong_default_mod...
  function test_from_image (line 532) | def test_from_image(
  class MockUnstructuredElementExtractionModel (line 559) | class MockUnstructuredElementExtractionModel(UnstructuredElementExtracti...
    method initialize (line 560) | def initialize(self, *args, **kwargs):
    method predict (line 563) | def predict(self, x: Image):
  class MockUnstructuredDetectionModel (line 567) | class MockUnstructuredDetectionModel(UnstructuredObjectDetectionModel):
    method initialize (line 568) | def initialize(self, *args, **kwargs):
    method predict (line 571) | def predict(self, x: Image):
  function test_process_file_with_model_routing (line 582) | def test_process_file_with_model_routing(monkeypatch, model_type, is_det...
  function test_exposed_pdf_image_dpi (line 605) | def test_exposed_pdf_image_dpi(pdf_image_dpi, expected, monkeypatch):
  function test_convert_pdf_to_image_no_output_folder (line 611) | def test_convert_pdf_to_image_no_output_folder():
  function _install_mock_pdfium (line 617) | def _install_mock_pdfium(monkeypatch, *, width=720, height=720):
  function test_convert_pdf_to_image_rejects_oversized_page_before_render (line 632) | def test_convert_pdf_to_image_rejects_oversized_page_before_render(monke...
  function test_convert_pdf_to_image_allows_render_guard_to_be_disabled (line 645) | def test_convert_pdf_to_image_allows_render_guard_to_be_disabled(monkeyp...
  function test_page_hotload_preserves_render_max_pixels_per_page (line 659) | def test_page_hotload_preserves_render_max_pixels_per_page(monkeypatch, ...
  function test_convert_pdf_to_image_output_folder_returns_images (line 683) | def test_convert_pdf_to_image_output_folder_returns_images(tmp_path):
  function test_convert_pdf_to_image_path_only (line 696) | def test_convert_pdf_to_image_path_only(tmp_path):
  function test_convert_pdf_to_image_applies_rotation_path_only (line 712) | def test_convert_pdf_to_image_applies_rotation_path_only(tmp_path):
  function test_convert_pdf_to_image_no_rotation_on_normal_pdf (line 725) | def test_convert_pdf_to_image_no_rotation_on_normal_pdf():
  function test_convert_pdf_to_image_save_not_under_pdfium_lock (line 734) | def test_convert_pdf_to_image_save_not_under_pdfium_lock(tmp_path):
  function test_convert_pdf_to_image_concurrent_saves_not_serialized (line 754) | def test_convert_pdf_to_image_concurrent_saves_not_serialized(tmp_path):
  function test_render_can_proceed_while_other_thread_saves (line 810) | def test_render_can_proceed_while_other_thread_saves(tmp_path):
  function test_multi_page_concurrent_output_complete (line 867) | def test_multi_page_concurrent_output_complete(tmp_path):
  function test_error_in_one_thread_does_not_block_other (line 906) | def test_error_in_one_thread_does_not_block_other(tmp_path):
  function test_get_image (line 971) | def test_get_image(filename, img_num, should_complete):

FILE: test_unstructured_inference/inference/test_layout_element.py
  function test_layout_element_to_dict (line 5) | def test_layout_element_to_dict(mock_layout_element):
  function test_layout_element_from_region (line 18) | def test_layout_element_from_region(mock_rectangle):
  function test_layoutelement_inheritance_works_correctly (line 25) | def test_layoutelement_inheritance_works_correctly():

FILE: test_unstructured_inference/inference/test_layout_rotation.py
  function test_convert_pdf_to_image_applies_rotation (line 8) | def test_convert_pdf_to_image_applies_rotation():

FILE: test_unstructured_inference/models/test_detectron2onnx.py
  class MockDetectron2ONNXLayoutModel (line 11) | class MockDetectron2ONNXLayoutModel:
    method __init__ (line 12) | def __init__(self, *args, **kwargs):
    method run (line 16) | def run(self, *args):
    method get_inputs (line 19) | def get_inputs(self):
  function test_load_default_model (line 26) | def test_load_default_model(monkeypatch):
  function test_load_model (line 39) | def test_load_model(model_path, label_map):
  function test_unstructured_detectron_model (line 48) | def test_unstructured_detectron_model():
  function test_inference (line 57) | def test_inference():

FILE: test_unstructured_inference/models/test_eval.py
  function actual_cells (line 8) | def actual_cells():
  function pred_cells (line 85) | def pred_cells():
  function actual_df (line 162) | def actual_df(actual_cells):
  function pred_df (line 167) | def pred_df(pred_cells):
  function test_compare_content_as_df (line 182) | def test_compare_content_as_df(actual_df, pred_df, eval_func, processor):
  function test_compare_content_as_df_with_invalid_input (line 187) | def test_compare_content_as_df_with_invalid_input(actual_df, pred_df):

FILE: test_unstructured_inference/models/test_model.py
  class MockModel (line 18) | class MockModel(UnstructuredObjectDetectionModel):
    method __init__ (line 21) | def __init__(self):
    method initialize (line 25) | def initialize(self, *args, **kwargs):
    method predict (line 28) | def predict(self, x: Any) -> Any:
  function test_get_model (line 39) | def test_get_model(monkeypatch):
  function test_get_model_threaded (line 45) | def test_get_model_threaded(monkeypatch):
  function test_get_model_concurrent_different_models (line 88) | def test_get_model_concurrent_different_models(monkeypatch):
  function test_register_new_model (line 169) | def test_register_new_model():
  function test_get_model_with_lazydict_config (line 185) | def test_get_model_with_lazydict_config(monkeypatch):
  function test_raises_invalid_model (line 220) | def test_raises_invalid_model():
  function test_raises_uninitialized (line 225) | def test_raises_uninitialized():
  function test_model_initializes_once (line 230) | def test_model_initializes_once():
  function test_deduplicate_detected_elements (line 245) | def test_deduplicate_detected_elements():
  function test_enhance_regions (line 268) | def test_enhance_regions():
  function test_clean_type (line 301) | def test_clean_type():
  function test_env_variables_override_default_model (line 332) | def test_env_variables_override_default_model(monkeypatch):
  function test_env_variables_override_initialization_params (line 347) | def test_env_variables_override_initialization_params(monkeypatch):

FILE: test_unstructured_inference/models/test_tables.py
  function table_transformer (line 24) | def table_transformer():
  function test_load_agent (line 29) | def test_load_agent(table_transformer):
  function example_image (line 34) | def example_image():
  function mocked_ocr_tokens (line 39) | def mocked_ocr_tokens():
  function test_load_table_model_raises_when_not_available (line 569) | def test_load_table_model_raises_when_not_available(model_path):
  function test_iob (line 582) | def test_iob(bbox1, bbox2, expected_result):
  function test_load_donut_model (line 593) | def test_load_donut_model(model_path):
  function test_nms (line 641) | def test_nms(input_test, output_test):
  function test_remove_supercell_overlap (line 772) | def test_remove_supercell_overlap(supercell1, supercell2):
  function test_align_supercells (line 882) | def test_align_supercells(supercells, rows, columns, output_test):
  function test_align_rows (line 887) | def test_align_rows(rows, bbox, output):
  function test_table_prediction_output_format (line 908) | def test_table_prediction_output_format(
  function test_table_prediction_output_format_when_wrong_type_then_value_error (line 943) | def test_table_prediction_output_format_when_wrong_type_then_value_error(
  function test_table_prediction_runs_with_empty_recognize (line 964) | def test_table_prediction_runs_with_empty_recognize(
  function test_table_prediction_with_ocr_tokens (line 979) | def test_table_prediction_with_ocr_tokens(table_transformer, example_ima...
  function test_table_prediction_with_no_ocr_tokens (line 985) | def test_table_prediction_with_no_ocr_tokens(table_transformer, example_...
  function test_objects_are_filtered_based_on_class_thresholds_when_correct_prediction_and_threshold (line 998) | def test_objects_are_filtered_based_on_class_thresholds_when_correct_pre...
  function test_objects_are_filtered_based_on_class_thresholds_when_two_classes (line 1018) | def test_objects_are_filtered_based_on_class_thresholds_when_two_classes(
  function test_objects_filtering_when_missing_threshold (line 1033) | def test_objects_filtering_when_missing_threshold():
  function test_intersect (line 1041) | def test_intersect():
  function test_include_rect (line 1047) | def test_include_rect():
  function test_extract_text_from_spans (line 1151) | def test_extract_text_from_spans(spans, join_with_space, expected):
  function test_header_supercell_tree (line 1224) | def test_header_supercell_tree(supercells, expected_len):
  function test_zoom_image (line 1230) | def test_zoom_image(example_image, zoom):
  function test_cells_to_html (line 1511) | def test_cells_to_html(input_cells, expected_html):
  function test_fill_cells (line 1761) | def test_fill_cells(input_cells, expected_cells):
  function test_padded_results_has_right_dimensions (line 1768) | def test_padded_results_has_right_dimensions(table_transformer, example_...
  function test_compute_confidence_score_zero_division_error_handling (line 1805) | def test_compute_confidence_score_zero_division_error_handling():
  function test_subcells_filtering_when_overlapping_spanning_cells (line 1836) | def test_subcells_filtering_when_overlapping_spanning_cells(
  function test_model_init_is_thread_safe (line 1908) | def test_model_init_is_thread_safe():

FILE: test_unstructured_inference/models/test_yolox.py
  function test_layout_yolox_local_parsing_image (line 9) | def test_layout_yolox_local_parsing_image():
  function test_layout_yolox_local_parsing_pdf (line 32) | def test_layout_yolox_local_parsing_pdf():
  function test_layout_yolox_local_parsing_empty_pdf (line 50) | def test_layout_yolox_local_parsing_empty_pdf():
  function test_layout_yolox_local_parsing_image_soft (line 63) | def test_layout_yolox_local_parsing_image_soft():
  function test_layout_yolox_local_parsing_pdf_soft (line 82) | def test_layout_yolox_local_parsing_pdf_soft():
  function test_layout_yolox_local_parsing_empty_pdf_soft (line 94) | def test_layout_yolox_local_parsing_empty_pdf_soft():

FILE: test_unstructured_inference/test_config.py
  function test_default_config (line 1) | def test_default_config():
  function test_env_override (line 7) | def test_env_override(monkeypatch):

FILE: test_unstructured_inference/test_elements.py
  function intersect_brute (line 26) | def intersect_brute(rect1, rect2):
  function rand_rect (line 34) | def rand_rect(size=10):
  function test_layoutelements (line 41) | def test_layoutelements():
  function test_unhappy_intersection (line 72) | def test_unhappy_intersection(rect1, rect2, expected):
  function test_intersects (line 78) | def test_intersects(second_size):
  function test_intersection_of_lots_of_rects (line 100) | def test_intersection_of_lots_of_rects():
  function test_rectangle_width_height (line 114) | def test_rectangle_width_height():
  function test_minimal_containing_rect (line 125) | def test_minimal_containing_rect():
  function test_partition_groups_from_regions (line 145) | def test_partition_groups_from_regions(mock_embedded_text_regions, coord...
  function test_rectangle_padding (line 157) | def test_rectangle_padding():
  function test_rectangle_area (line 164) | def test_rectangle_area(monkeypatch):
  function test_rectangle_iou (line 184) | def test_rectangle_iou():
  function test_midpoints (line 204) | def test_midpoints():
  function test_is_disjoint (line 218) | def test_is_disjoint():
  function test_intersection_over_min (line 247) | def test_intersection_over_min(
  function test_grow_region_to_match_region (line 257) | def test_grow_region_to_match_region():
  function test_is_almost_subregion_of (line 277) | def test_is_almost_subregion_of(rect1, rect2, expected):
  function test_separate (line 294) | def test_separate(rect1, rect2):
  function test_clean_layoutelements (line 300) | def test_clean_layoutelements(test_layoutelements):
  function test_clean_layoutelements_cases (line 336) | def test_clean_layoutelements_cases(
  function test_clean_layoutelements_for_class (line 380) | def test_clean_layoutelements_for_class(
  function test_layoutelements_to_list_and_back (line 396) | def test_layoutelements_to_list_and_back(test_layoutelements):
  function test_layoutelements_from_list_no_elements (line 407) | def test_layoutelements_from_list_no_elements():
  function test_textregions_from_list_no_elements (line 414) | def test_textregions_from_list_no_elements():
  function test_layoutelements_concatenate (line 421) | def test_layoutelements_concatenate():
  function test_textregions_support_numpy_slicing (line 484) | def test_textregions_support_numpy_slicing(test_elements):
  function test_textregions_from_list_collects_sources (line 496) | def test_textregions_from_list_collects_sources():
  function test_textregions_has_sources_field (line 523) | def test_textregions_has_sources_field():
  function test_textregions_iter_elements_preserves_source (line 532) | def test_textregions_iter_elements_preserves_source():
  function test_textregions_slice_preserves_sources (line 549) | def test_textregions_slice_preserves_sources():
  function test_textregions_post_init_handles_sources (line 577) | def test_textregions_post_init_handles_sources():
  function test_textregions_from_coords_accepts_source (line 590) | def test_textregions_from_coords_accepts_source():
  function test_textregions_allows_for_single_element_access_and_returns_textregion_with_correct_values (line 602) | def test_textregions_allows_for_single_element_access_and_returns_textre...

FILE: test_unstructured_inference/test_logger.py
  function test_translate_log_level (line 9) | def test_translate_log_level(level):

FILE: test_unstructured_inference/test_math.py
  function test_safe_division (line 11) | def test_safe_division(a, b, expected):

FILE: test_unstructured_inference/test_utils.py
  class MockPageLayout (line 14) | class MockPageLayout:
    method annotate (line 15) | def annotate(self, annotation_data):
  class MockDocumentLayout (line 19) | class MockDocumentLayout(DocumentLayout):
    method pages (line 21) | def pages(self):
  function test_dict_same (line 25) | def test_dict_same():
  function test_lazy_evaluate (line 33) | def test_lazy_evaluate():
  function test_caches (line 50) | def test_caches(cache, expected):
  function test_pad_image_with_background_color (line 67) | def test_pad_image_with_background_color(mock_pil_image):
  function test_pad_image_with_invalid_input (line 79) | def test_pad_image_with_invalid_input(mock_pil_image):
  function test_strip_tags (line 94) | def test_strip_tags(html, text):

FILE: test_unstructured_inference/test_visualization.py
  function test_draw_bbox (line 11) | def test_draw_bbox():
  function test_show_plot_with_pil_image (line 30) | def test_show_plot_with_pil_image(mock_pil_image):
  function test_show_plot_with_numpy_image (line 52) | def test_show_plot_with_numpy_image(mock_numpy_image):
  function test_show_plot_with_unsupported_image_type (line 74) | def test_show_plot_with_unsupported_image_type():

FILE: unstructured_inference/config.py
  class InferenceConfig (line 14) | class InferenceConfig:
    method _get_string (line 17) | def _get_string(self, var: str, default_value: str = "") -> str:
    method _get_int (line 22) | def _get_int(self, var: str, default_value: int) -> int:
    method _get_float (line 27) | def _get_float(self, var: str, default_value: float) -> float:
    method TABLE_IMAGE_BACKGROUND_PAD (line 33) | def TABLE_IMAGE_BACKGROUND_PAD(self) -> int:
    method TT_TABLE_CONF (line 42) | def TT_TABLE_CONF(self) -> float:
    method TABLE_COLUMN_CONF (line 47) | def TABLE_COLUMN_CONF(self) -> float:
    method TABLE_ROW_CONF (line 52) | def TABLE_ROW_CONF(self) -> float:
    method TABLE_COLUMN_HEADER_CONF (line 57) | def TABLE_COLUMN_HEADER_CONF(self) -> float:
    method TABLE_PROJECTED_ROW_HEADER_CONF (line 62) | def TABLE_PROJECTED_ROW_HEADER_CONF(self) -> float:
    method TABLE_SPANNING_CELL_CONF (line 67) | def TABLE_SPANNING_CELL_CONF(self) -> float:
    method TABLE_IOB_THRESHOLD (line 72) | def TABLE_IOB_THRESHOLD(self) -> float:
    method LAYOUT_SAME_REGION_THRESHOLD (line 78) | def LAYOUT_SAME_REGION_THRESHOLD(self) -> float:
    method LAYOUT_SUBREGION_THRESHOLD (line 87) | def LAYOUT_SUBREGION_THRESHOLD(self) -> float:
    method ELEMENTS_H_PADDING_COEF (line 96) | def ELEMENTS_H_PADDING_COEF(self) -> float:
    method ELEMENTS_V_PADDING_COEF (line 105) | def ELEMENTS_V_PADDING_COEF(self) -> float:
    method IMG_PROCESSOR_LONGEST_EDGE (line 110) | def IMG_PROCESSOR_LONGEST_EDGE(self) -> int:
    method IMG_PROCESSOR_SHORTEST_EDGE (line 115) | def IMG_PROCESSOR_SHORTEST_EDGE(self) -> int:
    method PDF_RENDER_MAX_PIXELS_PER_PAGE (line 120) | def PDF_RENDER_MAX_PIXELS_PER_PAGE(self) -> int:

FILE: unstructured_inference/constants.py
  class Source (line 4) | class Source(Enum):
  class IsExtracted (line 10) | class IsExtracted(Enum):
  class ElementType (line 16) | class ElementType:

FILE: unstructured_inference/inference/elements.py
  class Rectangle (line 15) | class Rectangle:
    method pad (line 21) | def pad(self, padding: Union[int, float]):
    method hpad (line 27) | def hpad(self, padding: Union[int, float]):
    method vpad (line 35) | def vpad(self, padding: Union[int, float]):
    method width (line 44) | def width(self) -> Union[int, float]:
    method height (line 49) | def height(self) -> Union[int, float]:
    method x_midpoint (line 54) | def x_midpoint(self) -> Union[int, float]:
    method y_midpoint (line 59) | def y_midpoint(self) -> Union[int, float]:
    method is_disjoint (line 63) | def is_disjoint(self, other: Rectangle) -> bool:
    method intersects (line 67) | def intersects(self, other: Rectangle) -> bool:
    method is_in (line 73) | def is_in(self, other: Rectangle, error_margin: Optional[Union[int, fl...
    method _has_none (line 85) | def _has_none(self) -> bool:
    method coordinates (line 90) | def coordinates(self):
    method intersection (line 94) | def intersection(self, other: Rectangle) -> Optional[Rectangle]:
    method area (line 108) | def area(self) -> float:
    method intersection_over_union (line 112) | def intersection_over_union(self, other: Rectangle) -> float:
    method intersection_over_minimum (line 121) | def intersection_over_minimum(self, other: Rectangle) -> float:
    method is_almost_subregion_of (line 130) | def is_almost_subregion_of(self, other: Rectangle, subregion_threshold...
  function minimal_containing_region (line 141) | def minimal_containing_region(*regions: Rectangle) -> Rectangle:
  function intersections (line 151) | def intersections(*rects: Rectangle):
  function coords_intersections (line 160) | def coords_intersections(coords: np.ndarray) -> np.ndarray:
  class TextRegion (line 184) | class TextRegion:
    method __str__ (line 190) | def __str__(self) -> str:
    method from_coords (line 194) | def from_coords(
  class TextRegions (line 212) | class TextRegions:
    method __post_init__ (line 230) | def __post_init__(self):
    method __getitem__ (line 248) | def __getitem__(self, indices) -> TextRegions:
    method slice (line 251) | def slice(self, indices) -> TextRegions:
    method iter_elements (line 266) | def iter_elements(self):
    method as_list (line 277) | def as_list(self):
    method from_list (line 282) | def from_list(cls, regions: list):
    method __len__ (line 298) | def __len__(self):
    method x1 (line 302) | def x1(self):
    method y1 (line 307) | def y1(self):
    method x2 (line 312) | def x2(self):
    method y2 (line 317) | def y2(self):
    method areas (line 322) | def areas(self) -> np.ndarray:
  class EmbeddedTextRegion (line 327) | class EmbeddedTextRegion(TextRegion):
  class ImageTextRegion (line 331) | class ImageTextRegion(TextRegion):
  function region_bounding_boxes_are_almost_the_same (line 335) | def region_bounding_boxes_are_almost_the_same(
  function grow_region_to_match_region (line 345) | def grow_region_to_match_region(region_to_grow: Rectangle, region_to_mat...

FILE: unstructured_inference/inference/layout.py
  class DocumentLayout (line 29) | class DocumentLayout:
    method __init__ (line 34) | def __init__(self, pages=None):
    method __str__ (line 37) | def __str__(self) -> str:
    method pages (line 41) | def pages(self) -> List[PageLayout]:
    method from_pages (line 46) | def from_pages(cls, pages: List[PageLayout]) -> DocumentLayout:
    method from_file (line 53) | def from_file(
    method from_image_file (line 95) | def from_image_file(
  class PageLayout (line 133) | class PageLayout:
    method __init__ (line 136) | def __init__(
    method __str__ (line 167) | def __str__(self) -> str:
    method elements (line 171) | def elements(self) -> Collection[LayoutElement]:
    method get_elements_using_image_extraction (line 178) | def get_elements_using_image_extraction(
    method get_elements_with_detection_model (line 194) | def get_elements_with_detection_model(
    method _get_image_array (line 224) | def _get_image_array(self) -> Union[np.ndarray[Any, Any], None]:
    method annotate (line 234) | def annotate(
    method _get_image (line 286) | def _get_image(self, filename, page_number, pdf_image_dpi: int = 200) ...
    method from_image (line 307) | def from_image(
  function process_data_with_model (line 350) | def process_data_with_model(
  function process_file_with_model (line 378) | def process_file_with_model(

FILE: unstructured_inference/inference/layoutelement.py
  class LayoutElements (line 23) | class LayoutElements(TextRegions):
    method __post_init__ (line 53) | def __post_init__(self):
    method __eq__ (line 57) | def __eq__(self, other: object) -> bool:
    method __getitem__ (line 81) | def __getitem__(self, indices):
    method slice (line 84) | def slice(self, indices) -> LayoutElements:
    method concatenate (line 100) | def concatenate(cls, groups: Iterable[LayoutElements]) -> LayoutElements:
    method iter_elements (line 139) | def iter_elements(self):
    method from_list (line 183) | def from_list(cls, elements: list):
  class LayoutElement (line 238) | class LayoutElement(TextRegion):
    method to_dict (line 247) | def to_dict(self) -> dict:
    method from_region (line 260) | def from_region(cls, region: TextRegion):
    method from_coords (line 277) | def from_coords(
  function separate (line 309) | def separate(region_a: Rectangle, region_b: Rectangle):
  function table_cells_to_dataframe (line 349) | def table_cells_to_dataframe(
  function partition_groups_from_regions (line 370) | def partition_groups_from_regions(regions: TextRegions) -> List[TextRegi...
  function intersection_areas_between_coords (line 393) | def intersection_areas_between_coords(
  function clean_layoutelements (line 410) | def clean_layoutelements(elements: LayoutElements, subregion_threshold: ...
  function clean_layoutelements_for_class (line 474) | def clean_layoutelements_for_class(

FILE: unstructured_inference/inference/pdf_image.py
  class PdfRenderTooLargeError (line 19) | class PdfRenderTooLargeError(ValueError):
  function _check_pdf_render_max_pixels (line 23) | def _check_pdf_render_max_pixels(page, page_number: int, scale: float, m...
  function _get_pdfium_module (line 40) | def _get_pdfium_module():
  function convert_pdf_to_image (line 46) | def convert_pdf_to_image(

FILE: unstructured_inference/logger.py
  function translate_log_level (line 4) | def translate_log_level(level: int) -> int:

FILE: unstructured_inference/math.py
  function safe_division (line 8) | def safe_division(a, b) -> float:

FILE: unstructured_inference/models/base.py
  class Models (line 20) | class Models(object):
    method __new__ (line 35) | def __new__(cls):
    method __contains__ (line 44) | def __contains__(self, key):
    method __getitem__ (line 48) | def __getitem__(self, key: str):
    method __setitem__ (line 52) | def __setitem__(self, key: str, value: UnstructuredModel):
  function get_default_model_mappings (line 68) | def get_default_model_mappings() -> Tuple[
  function register_new_model (line 82) | def register_new_model(model_config: dict, model_class: UnstructuredModel):
  function get_model (line 91) | def get_model(model_name: Optional[str] = None) -> UnstructuredModel:
  class UnknownModelException (line 151) | class UnknownModelException(Exception):

FILE: unstructured_inference/models/detectron2onnx.py
  class UnstructuredDetectronONNXModel (line 68) | class UnstructuredDetectronONNXModel(UnstructuredObjectDetectionModel):
    method predict (line 75) | def predict(self, image: Image.Image) -> List[LayoutElement]:
    method initialize (line 97) | def initialize(
    method preprocess (line 128) | def preprocess(self, image: Image.Image) -> Dict[str, np.ndarray]:
    method postprocess (line 147) | def postprocess(

FILE: unstructured_inference/models/eval.py
  function _join_df_content (line 15) | def _join_df_content(df, tab_token="\t", row_break_token="\n") -> str:
  function default_tokenizer (line 20) | def default_tokenizer(text: str) -> List[str]:
  function compare_contents_as_df (line 25) | def compare_contents_as_df(

FILE: unstructured_inference/models/table_postprocess.py
  class Rect (line 9) | class Rect:
    method __init__ (line 10) | def __init__(self, bbox=None):
    method get_area (line 22) | def get_area(self):
    method intersect (line 27) | def intersect(self, other):
    method include_rect (line 48) | def include_rect(self, bbox):
    method get_bbox (line 72) | def get_bbox(self):
  function apply_threshold (line 77) | def apply_threshold(objects, threshold):
  function refine_rows (line 84) | def refine_rows(rows, tokens, score_threshold):
  function refine_columns (line 101) | def refine_columns(columns, tokens, score_threshold):
  function nms_by_containment (line 123) | def nms_by_containment(container_objects, package_objects, overlap_thres...
  function slot_into_containers (line 152) | def slot_into_containers(
  function sort_objects_by_score (line 202) | def sort_objects_by_score(objects, reverse=True):
  function remove_objects_without_content (line 209) | def remove_objects_without_content(page_spans, objects):
  function extract_text_inside_bbox (line 220) | def extract_text_inside_bbox(spans, bbox):
  function get_bbox_span_subset (line 230) | def get_bbox_span_subset(spans, bbox, threshold=0.5):
  function overlaps (line 243) | def overlaps(bbox1, bbox2, threshold=0.5):
  function extract_text_from_spans (line 254) | def extract_text_from_spans(spans, join_with_space=True, remove_integer_...
  function sort_objects_left_to_right (line 304) | def sort_objects_left_to_right(objs):
  function sort_objects_top_to_bottom (line 311) | def sort_objects_top_to_bottom(objs):
  function align_columns (line 318) | def align_columns(columns, bbox):
  function align_rows (line 334) | def align_rows(rows, bbox):
  function nms (line 350) | def nms(objects, match_criteria="object2_overlap", match_threshold=0.05,...
  function align_supercells (line 395) | def align_supercells(supercells, rows, columns):
  function nms_supercells (line 509) | def nms_supercells(supercells):
  function header_supercell_tree (line 536) | def header_supercell_tree(supercells):
  function remove_supercell_overlap (line 564) | def remove_supercell_overlap(supercell1, supercell2):

FILE: unstructured_inference/models/tables.py
  class UnstructuredTableTransformerModel (line 30) | class UnstructuredTableTransformerModel(UnstructuredModel):
    method __new__ (line 36) | def __new__(cls):
    method predict (line 44) | def predict(
    method initialize (line 68) | def initialize(
    method get_structure (line 123) | def get_structure(
    method run_prediction (line 139) | def run_prediction(
  function load_agent (line 177) | def load_agent():
  function get_class_map (line 189) | def get_class_map(data_type: str):
  function recognize (line 218) | def recognize(outputs: TableTransformerObjectDetectionOutput, img: PILIm...
  function outputs_to_objects (line 233) | def outputs_to_objects(
  function apply_thresholds_on_objects (line 268) | def apply_thresholds_on_objects(
  function box_cxcywh_to_xyxy (line 296) | def box_cxcywh_to_xyxy(x):
  function rescale_bboxes (line 304) | def rescale_bboxes(out_bbox, size):
  function iob (line 312) | def iob(bbox1, bbox2):
  function objects_to_structures (line 325) | def objects_to_structures(objects, tokens, class_thresholds):
  function refine_table_structure (line 409) | def refine_table_structure(table_structure, class_thresholds):
  function align_headers (line 458) | def align_headers(headers, rows):
  function compute_confidence_score (line 508) | def compute_confidence_score(cell_match_scores):
  function structure_to_cells (line 522) | def structure_to_cells(table_structure, tokens):
  function fill_cells (line 697) | def fill_cells(cells: List[dict]) -> List[dict]:
  function cells_to_html (line 746) | def cells_to_html(cells: List[dict]) -> str:
  function zoom_image (line 799) | def zoom_image(image: PILImage.Image, zoom: float) -> PILImage.Image:

FILE: unstructured_inference/models/unstructuredmodel.py
  class UnstructuredModel (line 23) | class UnstructuredModel(ABC):
    method __init__ (line 26) | def __init__(self):
    method predict (line 34) | def predict(self, x: Any) -> Any:
    method __call__ (line 43) | def __call__(self, x: Any) -> Any:
    method initialize (line 48) | def initialize(self, *args, **kwargs):
  class UnstructuredObjectDetectionModel (line 53) | class UnstructuredObjectDetectionModel(UnstructuredModel):
    method predict (line 57) | def predict(self, x: Image) -> LayoutElements | list[LayoutElement]:
    method __call__ (line 62) | def __call__(self, x: Image) -> LayoutElements:
    method enhance_regions (line 67) | def enhance_regions(
    method clean_type (line 130) | def clean_type(
    method deduplicate_detected_elements (line 168) | def deduplicate_detected_elements(
  class UnstructuredElementExtractionModel (line 188) | class UnstructuredElementExtractionModel(UnstructuredModel):
    method predict (line 192) | def predict(self, x: Image) -> List[LayoutElement]:
    method __call__ (line 197) | def __call__(self, x: Image) -> List[LayoutElement]:
  class ModelNotInitializedError (line 202) | class ModelNotInitializedError(Exception):

FILE: unstructured_inference/models/yolox.py
  class UnstructuredYoloXModel (line 65) | class UnstructuredYoloXModel(UnstructuredObjectDetectionModel):
    method predict (line 66) | def predict(self, x: PILImage.Image):
    method initialize (line 71) | def initialize(self, model_path: str, label_map: dict):
    method image_processing (line 90) | def image_processing(
  function preprocess (line 154) | def preprocess(img, input_size, swap=(2, 0, 1)):
  function demo_postprocess (line 174) | def demo_postprocess(outputs, img_size, p6=False):
  function multiclass_nms (line 199) | def multiclass_nms(boxes, scores, nms_thr, score_thr, class_agnostic=True):
  function multiclass_nms_class_agnostic (line 209) | def multiclass_nms_class_agnostic(boxes, scores, nms_thr, score_thr):
  function nms (line 226) | def nms(boxes, scores, nms_thr):

FILE: unstructured_inference/utils.py
  class LazyEvaluateInfo (line 13) | class LazyEvaluateInfo:
    method __init__ (line 18) | def __init__(self, evaluate: Callable, *args, **kwargs):
  class LazyDict (line 23) | class LazyDict(Mapping):
    method __init__ (line 31) | def __init__(self, *args, cache=True, **kwargs):
    method __getitem__ (line 35) | def __getitem__(self, key: Hashable) -> Union[LazyEvaluateInfo, Any]:
    method __iter__ (line 45) | def __iter__(self) -> Iterator:
    method __len__ (line 48) | def __len__(self) -> int:
  function tag (line 52) | def tag(elements: Iterable[LayoutElement]):
  function pad_image_with_background_color (line 63) | def pad_image_with_background_color(
  class MLStripper (line 82) | class MLStripper(HTMLParser):
    method __init__ (line 85) | def __init__(self):
    method handle_data (line 92) | def handle_data(self, d):
    method get_data (line 96) | def get_data(self):
  function strip_tags (line 101) | def strip_tags(html: str) -> str:
  function download_if_needed_and_get_local_path (line 108) | def download_if_needed_and_get_local_path(path_or_repo: str, filename: s...

FILE: unstructured_inference/visualize.py
  function draw_bbox (line 17) | def draw_bbox(
  function show_plot (line 44) | def show_plot(
Condensed preview — 62 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (418K chars).
[
  {
    "path": ".github/dependabot.yml",
    "chars": 305,
    "preview": "version: 2\nupdates:\n  - package-ecosystem: \"uv\"\n    directory: \"/\"\n    schedule:\n      interval: \"monthly\"\n\n  - package-"
  },
  {
    "path": ".github/workflows/ci.yml",
    "chars": 2351,
    "preview": "name: CI\n\non:\n  push:\n    branches: [ main ]\n  pull_request:\n    branches: [ main ]\n\npermissions:\n  contents: read\n\njobs"
  },
  {
    "path": ".github/workflows/claude.yml",
    "chars": 1176,
    "preview": "name: Claude Code\n\non:\n  issue_comment:\n    types: [created]\n  pull_request_review_comment:\n    types: [created]\n  issue"
  },
  {
    "path": ".github/workflows/create_issue.yml",
    "chars": 864,
    "preview": "name: create_jira_issue\n\non:\n  issues:\n    types:\n      - opened\n\njobs:\n  create:\n    runs-on: ubuntu-latest\n    name: C"
  },
  {
    "path": ".github/workflows/release.yml",
    "chars": 1897,
    "preview": "name: Release\n\non:\n  release:\n    types: [published]\n\npermissions:\n  contents: read\n  id-token: write       # Required f"
  },
  {
    "path": ".github/workflows/version-bump.yml",
    "chars": 685,
    "preview": "name: Version Bump\n\non:\n  pull_request:\n    branches: [main]\n    types: [opened, synchronize, reopened]\n\npermissions:\n  "
  },
  {
    "path": ".gitignore",
    "chars": 1935,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 511,
    "preview": "repos:\n  - repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: \"v5.0.0\"\n    hooks:\n      - id: check-added-lar"
  },
  {
    "path": "CHANGELOG.md",
    "chars": 23649,
    "preview": "## 1.6.11\n\n### Enhancement\n- Add `table_extraction_method` field to `LayoutElements` and `LayoutElement` to track which "
  },
  {
    "path": "Dockerfile",
    "chars": 829,
    "preview": "# syntax=docker/dockerfile:experimental\nARG PYTHON_VERSION=3.12\nFROM python:${PYTHON_VERSION}-slim AS base\n\n# Set up env"
  },
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "Makefile",
    "chars": 2754,
    "preview": "PACKAGE_NAME := unstructured_inference\nCURRENT_DIR := $(shell pwd)\n\n\n.PHONY: help\nhelp: Makefile\n\t@sed -n 's/^\\(## \\)\\(["
  },
  {
    "path": "README.md",
    "chars": 4169,
    "preview": "<h3 align=\"center\">\n  <img\n    src=\"https://raw.githubusercontent.com/Unstructured-IO/unstructured/main/img/unstructured"
  },
  {
    "path": "benchmarks/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "benchmarks/test_benchmark_yolox.py",
    "chars": 1818,
    "preview": "\"\"\"Benchmark for YoloX image_processing() memory optimization.\n\nUses a fake ONNX session to isolate the memory behavior "
  },
  {
    "path": "examples/ocr/engine.py",
    "chars": 5178,
    "preview": "import os\nimport re\nimport time\nfrom typing import List, cast\n\nimport cv2\nimport numpy as np\nimport pytesseract\nfrom pyt"
  },
  {
    "path": "examples/ocr/requirements.txt",
    "chars": 34,
    "preview": "unstructured[local-inference]\nnltk"
  },
  {
    "path": "examples/ocr/validate_ocr_performance.py",
    "chars": 7142,
    "preview": "import json\nimport os\nimport time\nfrom datetime import datetime\nfrom difflib import SequenceMatcher\n\nimport nltk\nimport "
  },
  {
    "path": "logger_config.yaml",
    "chars": 1037,
    "preview": "version: 1\ndisable_existing_loggers: False\nformatters:\n  default_format:\n    \"()\": uvicorn.logging.DefaultFormatter\n    "
  },
  {
    "path": "pyproject.toml",
    "chars": 3113,
    "preview": "[project]\nname = \"unstructured_inference\"\ndescription = \"A library for performing inference using trained models.\"\nrequi"
  },
  {
    "path": "renovate.json",
    "chars": 136,
    "preview": "{\n  \"$schema\": \"https://docs.renovatebot.com/renovate-schema.json\",\n  \"extends\": [\"github>Unstructured-IO/renovate-confi"
  },
  {
    "path": "scripts/docker-build.sh",
    "chars": 289,
    "preview": "#!/usr/bin/env bash\n\nset -euo pipefail\nDOCKER_IMAGE=\"${DOCKER_IMAGE:-unstructured-inference:dev}\"\n\nDOCKER_BUILD_CMD=(doc"
  },
  {
    "path": "scripts/shellcheck.sh",
    "chars": 70,
    "preview": "#!/usr/bin/env bash\n\nfind scripts -name \"*.sh\" -exec shellcheck {} +\n\n"
  },
  {
    "path": "scripts/test-unstructured-ingest-helper.sh",
    "chars": 962,
    "preview": "#!/usr/bin/env bash\n\n# This is intended to be run from an unstructured checkout, not in this repo\n# The goal here is to "
  },
  {
    "path": "scripts/version-sync.sh",
    "chars": 5923,
    "preview": "#!/usr/bin/env bash\nfunction usage {\n    echo \"Usage: $(basename \"$0\") [-c] -f FILE_TO_CHANGE REPLACEMENT_FORMAT [-f FIL"
  },
  {
    "path": "test_unstructured_inference/conftest.py",
    "chars": 5688,
    "preview": "import numpy as np\nimport pytest\nfrom PIL import Image\n\nfrom unstructured_inference.inference.elements import (\n    Embe"
  },
  {
    "path": "test_unstructured_inference/inference/test_layout.py",
    "chars": 32132,
    "preview": "import os\nimport os.path\nimport tempfile\nfrom unittest.mock import MagicMock, mock_open, patch\n\nimport numpy as np\nimpor"
  },
  {
    "path": "test_unstructured_inference/inference/test_layout_element.py",
    "chars": 1776,
    "preview": "from unstructured_inference.constants import IsExtracted, Source\nfrom unstructured_inference.inference.layoutelement imp"
  },
  {
    "path": "test_unstructured_inference/inference/test_layout_rotation.py",
    "chars": 1330,
    "preview": "from __future__ import annotations\n\nimport numpy as np\n\nfrom unstructured_inference.inference import pdf_image\n\n\ndef tes"
  },
  {
    "path": "test_unstructured_inference/models/test_detectron2onnx.py",
    "chars": 2960,
    "preview": "import os\nfrom unittest.mock import patch\n\nimport pytest\nfrom PIL import Image\n\nimport unstructured_inference.models.bas"
  },
  {
    "path": "test_unstructured_inference/models/test_eval.py",
    "chars": 8242,
    "preview": "import pytest\n\nfrom unstructured_inference.inference.layoutelement import table_cells_to_dataframe\nfrom unstructured_inf"
  },
  {
    "path": "test_unstructured_inference/models/test_model.py",
    "chars": 12586,
    "preview": "import json\nimport threading\nimport time\nfrom typing import Any\nfrom unittest import mock\n\nimport numpy as np\nimport pyt"
  },
  {
    "path": "test_unstructured_inference/models/test_tables.py",
    "chars": 59969,
    "preview": "import os\nimport threading\nfrom copy import deepcopy\n\nimport numpy as np\nimport pytest\nimport torch\nfrom PIL import Imag"
  },
  {
    "path": "test_unstructured_inference/models/test_yolox.py",
    "chars": 4360,
    "preview": "import os\n\nimport pytest\n\nfrom unstructured_inference.inference.layout import process_file_with_model\n\n\n@pytest.mark.slo"
  },
  {
    "path": "test_unstructured_inference/test_config.py",
    "chars": 332,
    "preview": "def test_default_config():\n    from unstructured_inference.config import inference_config\n\n    assert inference_config.T"
  },
  {
    "path": "test_unstructured_inference/test_elements.py",
    "chars": 22064,
    "preview": "import os\nfrom random import randint\nfrom unittest.mock import PropertyMock, patch\n\nimport numpy as np\nimport pytest\n\nfr"
  },
  {
    "path": "test_unstructured_inference/test_logger.py",
    "chars": 447,
    "preview": "import logging\n\nimport pytest\n\nfrom unstructured_inference import logger\n\n\n@pytest.mark.parametrize(\"level\", range(50))\n"
  },
  {
    "path": "test_unstructured_inference/test_math.py",
    "chars": 338,
    "preview": "import numpy as np\nimport pytest\n\nfrom unstructured_inference.math import FLOAT_EPSILON, safe_division\n\n\n@pytest.mark.pa"
  },
  {
    "path": "test_unstructured_inference/test_utils.py",
    "chars": 2384,
    "preview": "import numpy as np\nimport pytest\n\nfrom unstructured_inference.inference.layout import DocumentLayout\nfrom unstructured_i"
  },
  {
    "path": "test_unstructured_inference/test_visualization.py",
    "chars": 2492,
    "preview": "from unittest.mock import MagicMock, patch\n\nimport numpy as np\nimport pytest\nfrom PIL import Image\n\nfrom unstructured_in"
  },
  {
    "path": "unstructured_inference/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "unstructured_inference/__version__.py",
    "chars": 43,
    "preview": "__version__ = \"1.6.11\"  # pragma: no cover\n"
  },
  {
    "path": "unstructured_inference/config.py",
    "chars": 5288,
    "preview": "\"\"\"\nThis module contains variables that can permitted to be tweaked by the system environment. For\nexample, model parame"
  },
  {
    "path": "unstructured_inference/constants.py",
    "chars": 1278,
    "preview": "from enum import Enum\n\n\nclass Source(Enum):\n    YOLOX = \"yolox\"\n    DETECTRON2_ONNX = \"detectron2_onnx\"\n    DETECTRON2_L"
  },
  {
    "path": "unstructured_inference/inference/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "unstructured_inference/inference/elements.py",
    "chars": 13423,
    "preview": "from __future__ import annotations\n\nfrom copy import deepcopy\nfrom dataclasses import dataclass, field\nfrom functools im"
  },
  {
    "path": "unstructured_inference/inference/layout.py",
    "chars": 16514,
    "preview": "from __future__ import annotations\n\nimport os\nimport tempfile\nfrom functools import cached_property\nfrom pathlib import "
  },
  {
    "path": "unstructured_inference/inference/layoutelement.py",
    "chars": 20497,
    "preview": "from __future__ import annotations\n\nfrom dataclasses import dataclass, field\nfrom typing import Any, Iterable, List, Opt"
  },
  {
    "path": "unstructured_inference/inference/pdf_image.py",
    "chars": 4686,
    "preview": "from __future__ import annotations\n\nimport math\nimport os\nfrom functools import lru_cache\nfrom pathlib import Path, Pure"
  },
  {
    "path": "unstructured_inference/logger.py",
    "chars": 674,
    "preview": "import logging\n\n\ndef translate_log_level(level: int) -> int:\n    \"\"\"Translate Python debugg level to ONNX runtime error "
  },
  {
    "path": "unstructured_inference/math.py",
    "chars": 514,
    "preview": "\"\"\"a lightweight module that provides helpers to common math operations\"\"\"\n\nimport numpy as np\n\nFLOAT_EPSILON = np.finfo"
  },
  {
    "path": "unstructured_inference/models/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "unstructured_inference/models/base.py",
    "chars": 5981,
    "preview": "from __future__ import annotations\n\nimport json\nimport os\nimport threading\nfrom typing import Dict, Optional, Tuple, Typ"
  },
  {
    "path": "unstructured_inference/models/detectron2onnx.py",
    "chars": 6445,
    "preview": "import os\nfrom typing import Dict, Final, List, Optional, Union, cast\n\nimport cv2\nimport numpy as np\nimport onnxruntime\n"
  },
  {
    "path": "unstructured_inference/models/eval.py",
    "chars": 2743,
    "preview": "from functools import partial\nfrom typing import Callable, Dict, List, Optional\n\nimport pandas as pd\nfrom rapidfuzz impo"
  },
  {
    "path": "unstructured_inference/models/table_postprocess.py",
    "chars": 23583,
    "preview": "# https://github.com/microsoft/table-transformer/blob/main/src/postprocess.py\n\"\"\"\nCopyright (C) 2021 Microsoft Corporati"
  },
  {
    "path": "unstructured_inference/models/tables.py",
    "chars": 31800,
    "preview": "# https://github.com/microsoft/table-transformer/blob/main/src/inference.py\n# https://github.com/NielsRogge/Transformers"
  },
  {
    "path": "unstructured_inference/models/unstructuredmodel.py",
    "chars": 8101,
    "preview": "from __future__ import annotations\n\nfrom abc import ABC, abstractmethod\nfrom typing import Any, List, cast\n\nimport numpy"
  },
  {
    "path": "unstructured_inference/models/yolox.py",
    "chars": 8637,
    "preview": "# Copyright (c) Megvii, Inc. and its affiliates.\n# Unstructured modified the original source code found at:\n# https://gi"
  },
  {
    "path": "unstructured_inference/utils.py",
    "chars": 3880,
    "preview": "import os\nfrom collections.abc import Mapping\nfrom html.parser import HTMLParser\nfrom io import StringIO\nfrom typing imp"
  },
  {
    "path": "unstructured_inference/visualize.py",
    "chars": 5437,
    "preview": "# Copyright (c) Megvii Inc. All rights reserved.\n# Unstructured modified the original source code found at\n# https://git"
  }
]

// ... and 1 more files (download for full content)

About this extraction

This page contains the full source code of the Unstructured-IO/unstructured-inference GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 62 files (14.6 MB), approximately 97.7k tokens, and a symbol index with 433 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!