[
  {
    "path": ".github/ISSUE_TEMPLATE/bug-report.md",
    "content": "---\nname: Bug report\nabout: Use this if you observe a specific problem with pdfplumber's code or results\ntitle: ''\nlabels: bug\nassignees: ''\n---\n\n## Describe the bug\n\n*A clear and concise description of what the bug is.*\n\n\n## Have you tried [repairing](https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md) the PDF?\n\n*Please try running your code with `pdfplumber.open(..., repair=True)` before submitting a bug report.*\n\n\n## Code to reproduce the problem\n\n*Paste it here, or attach a Python file.*\n\n\n## PDF file\n\n*Please attach any PDFs necessary to reproduce the problem.*\n\n*If you need to redact text in a sensitive PDF, you can run it through [JoshData/pdf-redactor](https://github.com/JoshData/pdf-redactor).*\n\n\n## Expected behavior\n\n*What did you expect the result __should__ have been?*\n\n\n## Actual behavior\n\n*What actually happened, instead?*\n\n\n## Screenshots\n\n*If applicable, add screenshots to help explain your problem.*\n\n\n## Environment\n\n- pdfplumber version: [e.g., 0.5.22]\n- Python version: [e.g., 3.8.1]\n- OS: [e.g., Mac, Linux, etc.]\n\n\n## Additional context\n\n*Add any other context/notes about the problem here.*\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: false\ncontact_links:\n- name: Troubleshooting, etc.\n  url: https://github.com/jsvine/pdfplumber/discussions\n  about: Use 'Discussions' to request assistance, ask questions, etc.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature-request.md",
    "content": "---\nname: Feature request\nabout: Suggest a feature or improvement\ntitle: ''\nlabels: feature-request\nassignees: ''\n---\n\nPlease describe, in as much detail as possible, your proposal and how it would improve your experience with pdfplumber.\n"
  },
  {
    "path": ".github/workflows/tests.yml",
    "content": "name: Tests\n\non: [push, pull_request]\n\njobs:\n  lint:\n    runs-on: ubuntu-latest\n\n    steps:\n    - uses: actions/checkout@v3\n\n    - name: Set up Python 3.12\n      uses: actions/setup-python@v4\n      with:\n        python-version: 3.12\n\n    - name: Configure pip caching\n      uses: actions/cache@v3\n      with:\n        path: ~/.cache/pip\n        key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}\n\n    - name: Install Python dependencies\n      run: |\n        python -m pip install --upgrade pip\n        pip install -r requirements.txt\n        pip install -r requirements-dev.txt\n\n    - name: Validate against psf/black\n      run: python -m black --check pdfplumber tests\n\n    - name: Validate against isort\n      run: python -m isort --profile black --check-only pdfplumber tests\n\n    - name: Validate against flake8\n      run: python -m flake8 pdfplumber tests\n\n    - name: Check type annotations via mypy\n      run: python -m mypy --strict --implicit-reexport pdfplumber\n\n  test:\n    needs: lint\n    runs-on: ubuntu-latest\n    strategy:\n      fail-fast: false\n      matrix:\n        python-version: [\"3.10\", \"3.11\", \"3.12\", \"3.13\", \"3.14\"]\n\n    steps:\n    - uses: actions/checkout@v3\n\n    - name: Set up Python ${{ matrix.python-version }}\n      uses: actions/setup-python@v4\n      with:\n        python-version: ${{ matrix.python-version }}\n\n    - name: Install ghostscript\n      run: sudo apt update && sudo apt install ghostscript\n\n    - name: Configure pip caching\n      uses: actions/cache@v3\n      with:\n        path: ~/.cache/pip\n        key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}\n\n    - name: Install Python dependencies\n      run: |\n        python -m pip install --upgrade pip\n        pip install -r requirements.txt\n        pip install -r requirements-dev.txt\n\n    - name: Run tests\n      run: |\n        python -m pytest -n auto\n        python -m coverage html\n\n    - name: Upload code coverage\n      uses: codecov/codecov-action@v3\n      if: matrix.python-version == 3.9\n\n    - name: Build package\n      run: python setup.py build sdist\n\n"
  },
  {
    "path": ".gitignore",
    "content": "venv/\nnotebooks/\nnonpublic/\n.ipynb_checkpoints\n.DS_Store\n.idea/\n.pytest_cache/\n.mypy_cache/\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\n*.egg-info/\n.installed.cfg\n*.egg\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*,cover\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# Changelog\n\nAll notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).\n\n## Unreleased\n\n- Upgrade `pdfminer.six` from `20251230` to `20260107`. ([07a5ff6](https://github.com/jsvine/pdfplumber/commit/07a5ff6))\n\n## 0.11.9 — 2026-01-05\n\n### Changed\n- Upgrade `pdfminer.six` from `20251107` to `20251230`. ([75bbed3](https://github.com/jsvine/pdfplumber/commit/75bbed3) + [1524ce4](https://github.com/jsvine/pdfplumber/commit/1524ce4) + [26687c3](https://github.com/jsvine/pdfplumber/commit/26687c3) + [9555532](https://github.com/jsvine/pdfplumber/commit/9555532))\n\n## [0.11.8] - 2025-11-08\n\n### Added\n- Add `edge_min_length_prefilter` table setting for initial edge filtering. Lowering this setting enables capturing small edge segments (e.g., dashed lines) that would be filtered out with the default minimum length of 1. Raising this setting would be less common but plausible. (h/t @bronislav). ([#1274](https://github.com/jsvine/pdfplumber/issues/1274)).\n\n### Changed\n- Upgrade `pdfminer.six` from `20250506` to `20251107` (h/t @henry-renner-v). ([0079187](https://github.com/jsvine/pdfplumber/pull/1348/commits/0079187ab5493f4440147cde83ee627cab081079))\n\n## [0.11.7] - 2025-06-12\n\n### Added\n- Add access to `Page.trimbox`, `Page.bleedbox`, and `Page.artbox` (h/t @samuelbradshaw). ([#1313](https://github.com/jsvine/pdfplumber/issues/1313) + [7e364e6](https://github.com/jsvine/pdfplumber/commit/7e364e6193c6e8bafa9b46587c0fdd4a46405399))\n\n### Changed\n- Upgrade `pdfminer.six` from `20250327` to `20250506`. ([4c7e092](https://github.com/jsvine/pdfplumber/commit/4c7e092))\n\n### Removed\n- Remove `stroking_pattern` and `non_stroking_pattern` object attributes, due to changes in `pdfminer.six`. ([4c7e092](https://github.com/jsvine/pdfplumber/commit/4c7e092))\n\n## [0.11.6] - 2025-03-27\n### Changed\n- Upgrade `pdfminer.six` from `20231228` to `20250327` ([3fcb493](https://github.com/jsvine/pdfplumber/commit/3fcb493) + [12a73a2](https://github.com/jsvine/pdfplumber/commit/12a73a2))\n- Use csv.QUOTE_MINIMAL for .to_csv(...) ([980494a](https://github.com/jsvine/pdfplumber/commit/980494a))\n\n\n### Fixed\n- Fix bug with `use_text_flow=True` text extraction (h/t @samuelbradshaw)([#1279](https://github.com/jsvine/pdfplumber/issues/1279) + [e15ed98](https://github.com/jsvine/pdfplumber/commit/e15ed98))\n- Catch exceptions from pdfminer and malformed PDFs ([43ccc5b](https://github.com/jsvine/pdfplumber/commit/43ccc5b))\n- More broadly handle RecursionError ([748ff31](https://github.com/jsvine/pdfplumber/commit/748ff31))\n\n### Removed\n- Remove test_issue_1089 ([#1263](https://github.com/jsvine/pdfplumber/issues/1263) + [7e28e76](https://github.com/jsvine/pdfplumber/commit/7e28e76))\n\n## [0.11.5] - 2025-01-01\n\n### Added\n\n- Add `--format text` options to CLI (in addition to previously-available `csv` and `json`) (h/t @brandonrobertz). ([#1235](https://github.com/jsvine/pdfplumber/pull/1235))\n- Add `raise_unicode_errors: bool` parameter to `pdfplumber.open()` to allow bypassing `UnicodeDecodeError`s in annotation-parsing and generate warnings instead (h/t @stolarczyk). ([#1195](https://github.com/jsvine/pdfplumber/issues/1195))\n- Add `name` property to `image` objects (h/t @djr2015). ([#1201](https://github.com/jsvine/pdfplumber/discussions/1201))\n\n### Fixed\n\n- Fix `PageImage.debug_tablefinder(...)` so that its main keyword argument is named the same (`table_settings=`) as other related `Page` methods (h/t @n-traore). ([#1237](https://github.com/jsvine/pdfplumber/issues/1237))\n\n\n## [0.11.4] - 2024-08-18\n\n### Fixed\n\n- Fix one type hint so that it doesn't throw error on Python 3.8 (h/t @andrekeller). ([#1184](https://github.com/jsvine/pdfplumber/issues/1184))\n\n## [0.11.3] - 2024-08-07\n\n### Added\n\n- Add `Table.columns`, analogous to `Table.rows` (h/t @Pk13055). ([#1050](https://github.com/jsvine/pdfplumber/issues/1050) + [d39302f](https://github.com/jsvine/pdfplumber/commit/d39302f))\n- Add `Page.extract_words(return_chars=True)`, mirroring `Page.search(..., return_chars=True)`; if this argument is passed, each word dictionary will include an additional key-value pair: `\"chars\": [char_object, ...]` (h/t @cmdlineluser). ([#1173](https://github.com/jsvine/pdfplumber/issues/1173) + [1496cbd](https://github.com/jsvine/pdfplumber/commit/1496cbd))\n- Add `pdfplumber.open(unicode_norm=\"NFC\"/\"NFD\"/\"NFKC\"/NFKD\")`, where the values are the [four options for Unicode normalization](https://unicode.org/reports/tr15/#Normalization_Forms_Table) (h/t @petermr + @agusluques). ([#905](https://github.com/jsvine/pdfplumber/issues/905) + [03a477f](https://github.com/jsvine/pdfplumber/commit/03a477f))\n\n### Changed\n\n- Change default setting `pdfplumber.repair(...)` passes to Ghostscript's `-dPDFSETTINGS` parameter, from `prepress` to `default`, and make that setting modifiable via `.repair(setting=...)`, where the value is one of `\"default\"`, `\"prepress\"`, `\"printer\"`, or `\"ebook\"` (h/t @Laubeee). ([#874](https://github.com/jsvine/pdfplumber/issues/874) + [48cab3f](https://github.com/jsvine/pdfplumber/commit/48cab3f))\n\n### Fixed\n\n- Fix handling of object coordinates when `mediabox` does not begin at `(0,0)` (h/t @wodny). ([#1181](https://github.com/jsvine/pdfplumber/issues/1181) + [9025c3f](https://github.com/jsvine/pdfplumber/commit/9025c3f) + [046bd87](https://github.com/jsvine/pdfplumber/commit/046bd87))\n- Fix error on getting `.annots`/`.hyperlinks` from `CroppedPage` (due to missing `.rotation` and `.initial_doctop` attributes) (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [e5737d2](https://github.com/jsvine/pdfplumber/commit/e5737d2))\n- Fix problem where `Page.crop(...)` was not cropping `.annots/.hyperlinks` (h/t @Safrone). ([#1171](https://github.com/jsvine/pdfplumber/issues/1171) + [22494e8](https://github.com/jsvine/pdfplumber/commit/22494e8))\n- Fix calculation of coordinates for `.annots` on `CroppedPage`s. ([0bbb340](https://github.com/jsvine/pdfplumber/commit/0bbb340) + [b16acc3](https://github.com/jsvine/pdfplumber/commit/b16acc3))\n- Dereference structure element attributes (h/t @dhdaines). ([#1169](https://github.com/jsvine/pdfplumber/pull/1169) + [3f16180](https://github.com/jsvine/pdfplumber/commit/3f16180))\n- Fix `Page.get_attr(...)` so that it fully resolves references before determining whether the attribute's value is `None` (h/t @zzhangyun + @mkl-public). ([#1176](https://github.com/jsvine/pdfplumber/issues/1176) + [c20cd3b](https://github.com/jsvine/pdfplumber/commit/c20cd3b))\n\n## [0.11.2] - 2024-07-06\n\n### Added\n\n- Add `extra_attrs` parameter to `.dedupe_chars(...)` to adjust the properties used when deduplicating (h/t @QuentinAndre11). ([#1114](https://github.com/jsvine/pdfplumber/issues/1114))\n\n### Development Changes\n\n- Remove testing for Python 3.8, add testing for Python 3.12. ([944eaed](https://github.com/jsvine/pdfplumber/commit/944eaed))\n- Upgrade `flake8`, `pytest`, and `pytest-cov` — and add `setuptools` and `py` as explicit dev requirements (for Python 3.12).\n\n## [0.11.1] - 2024-06-11\n\n### Fixed\n- Fix `.open(..., repair=True)` subprocess args (to avoid stderr being captured) ([70534a7](https://github.com/jsvine/pdfplumber/commit/70534a7))\n- Fix coordinates of annots on rotated pages ([aaa35c9](https://github.com/jsvine/pdfplumber/commit/aaa35c9))\n- Fix handling `PDFDocEncoding` failures in `decode_text(...)`([#1147](https://github.com/jsvine/pdfplumber/issues/1147) + [4daf0aa](https://github.com/jsvine/pdfplumber/commit/4daf0aa))\n- Add `.get_textmap.cache_clear()` to `page.close()` ([0a26f05](https://github.com/jsvine/pdfplumber/commit/0a26f05))\n\n## [0.11.0] - 2024-03-07\n\n### Added\n\n- Add `{line,char}_dir{,rotated,render}` params, to provide better support for non–top-to-bottom, left-to-right text (h/t @afriedman412). ([850fd45](https://github.com/jsvine/pdfplumber/commit/850fd45))\n- Add `curve[\"path\"]` and `curve[\"dash\"]`, thanks to `pdfminer.six` upgrade (see below). ([1820247](https://github.com/jsvine/pdfplumber/commit/1820247))\n\n### Changed\n- Upgrade `pdfminer.six` from `20221105` to `20231228`. ([cd2f768](https://github.com/jsvine/pdfplumber/commit/cd2f768))\n- Change value of in `word[\"direction\"]` from `{1,-1}` to `{\"ltr\",\"rtl\",\"ttb\",\"btt\"}`. ([850fd45](https://github.com/jsvine/pdfplumber/commit/850fd45))\n- Deprecate `vertical_ttb`, `horizontal_ltr` in favor of `char_dir` and `char_dir_rotated`.([850fd45](https://github.com/jsvine/pdfplumber/commit/850fd45))\n\n\n### Fixed\n- Fix layout-caching issue  caused by `0bfffc2`. ([#1097](https://github.com/jsvine/pdfplumber/pull/1097) + [efca277](https://github.com/jsvine/pdfplumber/commit/efca277))\n- Fix missing ParentTree edge-case. ([#1094](https://github.com/jsvine/pdfplumber/pull/1094)))\n\n## [0.10.4] - 2024-02-10\n\n### Added\n\n- Add `x_tolerance_ratio` parameter to `extract_text` and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). ([#1041](https://github.com/jsvine/pdfplumber/pulls/1041))\n- Add support for PDF 1.3 logical structure via `Page.structure_tree` (h/t @dhdaines). ([#963](https://github.com/jsvine/pdfplumber/pulls/963))\n- Add \"gswin64c\" as another possible Ghostscript executable in `repair.py` (h/t @echedey-ls). ([#1032](https://github.com/jsvine/pdfplumber/issues/1030))\n- Re-add `Page.close()` method, have `PDF.close()` close all pages as well, and improve relevant documentation (h/t @luketudge). ([#1042](https://github.com/jsvine/pdfplumber/issues/1042))\n- Add `force_mediabox` parameter to `Page.to_image(...)`. ([#1054](https://github.com/jsvine/pdfplumber/issues/1054))\n\n### Fixed\n\n- Standardize handling of cropbox, fixing various issues with PageImage. ([#1054](https://github.com/jsvine/pdfplumber/issues/1054))\n- Fix `Page.get_textmap` caching to allow for `extra_attrs=[...]`, by preconverting list kwargs to tuples. ([#1030](https://github.com/jsvine/pdfplumber/issues/1030))\n- Explicitly close `pypdfium2.PdfDocument` in `get_page_image` (h/t @dhdaines). ([#1090](https://github.com/jsvine/pdfplumber/pull/1090))\n- In `PDFPageAggregatorWithMarkedContent.tag_cur_item`, check `self.cur_item._objs` length before trying to access `[-1]`. ([4f39d03](https://github.com/jsvine/pdfplumber/commit/4f39d03))\n\n\n## [0.10.3] - 2023-10-26\n\n### Added\n\n- Add support for marked-content sequences, represented by `mcid` and `tag` attributes on `char`/`rect`/`line`/`curve`/`image` objects (h/t @dhdaines). ([#961](https://github.com/jsvine/pdfplumber/pulls/961))\n- Add `gs_path` argument to `pdfplumber.open(...)` and `pdfplumber.repair(...)`, to allow passing a custom Ghostscript path to be used for repairing. ([#953](https://github.com/jsvine/pdfplumber/issues/953))\n\n### Fixed\n\n- Respect `use_text_flow` in `extract_text` (h/t @dhdaines). ([#983](https://github.com/jsvine/pdfplumber/pulls/983))\n\n## [0.10.2] - 2023-07-29\n\n### Added\n\n- Add `PDF.path`: A `Path` object for PDFs loaded by passing a path (unless `repair=True`), and `None` otherwise. ([30a52cb](https://github.com/jsvine/pdfplumber/commit/30a52cb) + [#948](https://github.com/jsvine/pdfplumber/issues/948))\n\n- Accept `Iterable` objects for geometry utils (h/t @dhdaines). ([53bee23](https://github.com/jsvine/pdfplumber/commit/53bee23) + [#945](https://github.com/jsvine/pdfplumber/pulls/945))\n\n### Changed\n\n- Use pypdfium2's *public* (not private) `.render(...)` method (h/t @mara004). ([28f4ebe](https://github.com/jsvine/pdfplumber/commit/28f4ebe) + [#899](https://github.com/jsvine/pdfplumber/discussions/899#discussioncomment-6520928))\n\n### Fixed\n\n- Fix `.to_image()` for `ZipExtFile`s (h/t @Urbener). ([30a52cb](https://github.com/jsvine/pdfplumber/commit/30a52cb) + [#948](https://github.com/jsvine/pdfplumber/issues/948))\n\n## [0.10.1] - 2023-07-19\n\n### Added\n\n- Add `antialias` boolean parameter to `Page.to_image(...)` and associated methods (h/t @cmdlineluser). ([7e28931](https://github.com/jsvine/pdfplumber/commit/7e28931))\n\n## [0.10.0] - 2023-07-16\n\n### Changed\n\n- Normalize color representation to `tuple[float|int, ...]` ([#917](https://github.com/jsvine/pdfplumber/issues/917)). ([57d51bb](https://github.com/jsvine/pdfplumber/commit/57d51bb))\n- Replace Wand with pypdfium2 for page.to_image(...). ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))\n\n### Added\n\n- Add `pdfplumber.repair(...)` and `.open(repair=True)` ([#824](https://github.com/jsvine/pdfplumber/issues/824)). ([db6ae97](https://github.com/jsvine/pdfplumber/commit/db6ae97))\n- Add Page.find_table(...) ([#873](https://github.com/jsvine/pdfplumber/issues/873)). ([3772af6](https://github.com/jsvine/pdfplumber/commit/3772af6))\n- Add `quantize=True`, `colors=256`, `bits=8` arguments/defaults to `PageImage.save(...)`. ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))\n- Extract and handle patterns + (some) color spaces. ([97ca4b0](https://github.com/jsvine/pdfplumber/commit/97ca4b0))\n\n### Removed\n\n- Remove support for Python 3.7 ([EOL'ed June 2023](https://endoflife.date/python)). ([c9d24d5](https://github.com/jsvine/pdfplumber/commit/c9d24d5))\n- Remove vestigial 'font' and 'name' properties from PDF objects. ([6d62054](https://github.com/jsvine/pdfplumber/commit/6d62054))\n\n### Fixed\n\n- Fix bug for re-crops that use relative=True ([#914](https://github.com/jsvine/pdfplumber/issues/914)). ([0de6da9](https://github.com/jsvine/pdfplumber/commit/0de6da9))\n- Handle `use_text_flow` more consistently ([#912](https://github.com/jsvine/pdfplumber/issues/912)). ([b1db5b8](https://github.com/jsvine/pdfplumber/commit/b1db5b8))\n\n\n## [0.9.0] - 2023-04-13\n\n### Changed\n\n- Make word segmentation (via `WordExtractor.char_begins_new_word(...)`) more explict and rigorous; should help in catching edge-cases in the future. ([6acd580](https://github.com/jsvine/pdfplumber/commit/6acd580) + [ebb93ea](https://github.com/jsvine/pdfplumber/commit/ebb93ea) + [#840](https://github.com/jsvine/pdfplumber/discussions/840#discussioncomment-5312166))\n- Use `curve_edge` objects (instead of just `line` and `rect_edge` objects) in default table-detection strategy. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465) + [#858](https://github.com/jsvine/pdfplumber/discussions/858)) \n- By default, expand ligatures into their consituent letters (e.g., `ﬃ` to `ffi`), and add the `expand_ligatures` boolean parameter to text-extraction methods. ([86e935d](https://github.com/jsvine/pdfplumber/commit/86e935d) + [#598](https://github.com/jsvine/pdfplumber/issues/598))\n\n### Added\n\n- Add `Page.extract_text_lines(...)` method. ([4b37397](https://github.com/jsvine/pdfplumber/commit/4b37397) + [#852](https://github.com/jsvine/pdfplumber/discussions/852))\n- Add `main_group`, `return_groups`, `return_chars` parameters to `Page.search(...)`. ([4b37397](https://github.com/jsvine/pdfplumber/commit/4b37397))\n- Add `.curve_edges` property to `PDF` and `Page`. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465))\n\n### Fixed\n\n- Fix handling of bytes-typed fontnames. ([9441ff7](https://github.com/jsvine/pdfplumber/commit/9441ff7) + [#461](https://github.com/jsvine/pdfplumber/discussions/461) + [#842](https://github.com/jsvine/pdfplumber/discussions/842))\n- Fix handling of whitespace-only and empty results of `Page.search(...)`. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465) + [#853](https://github.com/jsvine/pdfplumber/discussions/853))\n\n## [0.8.1] - 2023-04-08\n### Fixed\n\n- Fix `x0>x1`/etc. error for when drawing rect fills, per new Pillow version ([db136b7](https://github.com/jsvine/pdfplumber/commit/db136b7))\n\n## [0.8.0] - 2023-02-13\n\n### Changed\n\n- Change the (still experimental) `Page/utils.extract_text(layout=True)` approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. ([d3662de](https://github.com/jsvine/pdfplumber/commit/d3662de))\n- Refactor handling of `pts` attribute and, in doing so, deprecate the `curve_obj[\"points\"]` attribute, and fix `PageImage.draw_line(...)`'s handling of diagonal lines. ([216bedd](https://github.com/jsvine/pdfplumber/commit/216bedd))\n- Breaking change: In `Page.extract_table[s](...)`, `keep_blank_chars` must now be passed as `text_keep_blank_chars`, for consistency's sake. ([c4e1b29](https://github.com/jsvine/pdfplumber/commit/c4e1b29))\n\n### Added\n\n- Add `Page.extract_table[s](...)` support for all `Page.extract_text(...)` keyword arguments. ([c4e1b29](https://github.com/jsvine/pdfplumber/commit/c4e1b29))\n- Add `height` and `width` keyword arguemnts to `Page.to_image(...)`. ([#798](https://github.com/jsvine/pdfplumber/issues/798) + [93f7dbd](https://github.com/jsvine/pdfplumber/commit/93f7dbd))\n- Add `layout_width`, `layout_width_chars`, `layout_height`, and `layout_width_chars` parameters to `Page/utils.extract_text(layout=True)`. ([d3662de](https://github.com/jsvine/pdfplumber/commit/d3662de))\n- Add CITATION.cff. ([#755](https://github.com/jsvine/pdfplumber/issues/755)) [h/t @joaoccruz]\n\n### Fixed\n\n- Fix simple edge-case for when page rotation is (incorrectly) set to `None`. ([#811](https://github.com/jsvine/pdfplumber/pull/811)) [h/t @toshi1127]\n\n### Development Changes\n\n- Convert `utils.py` into `utils/` submodules. Retains same interface, just an improvement in organization. ([6351d97](https://github.com/jsvine/pdfplumber/commit/6351d97))\n- Fix typing hints to include io.BytesIO. ([d4107f6](https://github.com/jsvine/pdfplumber/commit/d4107f6)) [h/t @conitrade-as]\n- Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via `utils.extract_text(...)`, via `Page.extract_text(...)`, via `Page.extract_table(...)`). ([3424b57](https://github.com/jsvine/pdfplumber/commit/3424b57))\n\n## [0.7.6] - 2022-11-22\n\n### Changed\n\n- Bump pinned `pdfminer.six` version to `20221105`. ([e63a038](https://github.com/jsvine/pdfplumber/commit/e63a038))\n\n### Fixed\n\n- Restore `text` attribute to `.textboxhorizontal`/etc., regression introduced in `9587cc7` / `v0.6.2`. ([8a0c126](https://github.com/jsvine/pdfplumber/commit/8a0c126))\n- Fix `lru_cache` usage, which are [discouraged for class methods](https://rednafi.github.io/reflections/dont-wrap-instance-methods-with-functoolslru_cache-decorator-in-python.html) due to garbage-collection issues. ([e3142a0](https://github.com/jsvine/pdfplumber/commit/e3142a0))\n\n### Development Changes\n\n- Upgrade `nbexec` development requirement from `0.1.0` to `0.2.0`. ([30dac25](https://github.com/jsvine/pdfplumber/commit/30dac25))\n\n## [0.7.5] - 2022-10-01\n\n### Added\n\n- Add `PageImage.show()` as alias for `PageImage.annotated.show()`. ([#715](https://github.com/jsvine/pdfplumber/discussions/715) + [5c7787b](https://github.com/jsvine/pdfplumber/commit/5c7787b))\n\n### Fixed\n\n- Fixed issue where `py.typed` file was not included in PyPi distribution. ([#698](https://github.com/jsvine/pdfplumber/issues/698) + [#703](https://github.com/jsvine/pdfplumber/pull/703) + [6908487](https://github.com/jsvine/pdfplumber/commit/6908487)) [h/t @jhonatan-lopes]\n- Reinstated the ability to call `utils.cluster_objects(...)` with any hashable value (`str`, `int`, `tuple`, etc.) as the `key_fn` parameter, reverting breaking change in [58b1ab1](https://github.com/jsvine/pdfplumber/commit/58b1ab1). ([#691](https://github.com/jsvine/pdfplumber/issues/691) + [1e97656](https://github.com/jsvine/pdfplumber/commit/1e97656)) [h/t @jfuruness]\n\n### Development Changes\n\n- Update Wand version in `requirements.txt` from `>=0.6.7` to `>=0.6.10`. ([#713](https://github.com/jsvine/pdfplumber/issues/713) + [3457d79](https://github.com/jsvine/pdfplumber/commit/3457d79))\n\n## [0.7.4] - 2022-07-19\n\n### Added\n\n- Add `utils.outside_bbox(...)` and `Page.outside_bbox(...)` method, which are the inverse of `utils.within_bbox(...)` and `Page.within_bbox(...)`. ([#369](https://github.com/jsvine/pdfplumber/issues/369) + [3ab1cc4](https://github.com/jsvine/pdfplumber/commit/3ab1cc4))\n- Add `strict=True/False` parameter to `Page.crop(...)`, `Page.within_bbox(...)`, and `Page.outside_bbox(...)`; default is `True`, while `False` bypasses the `test_proposed_bbox(...)` check. ([#421](https://github.com/jsvine/pdfplumber/issues/421) + [71ad60f](https://github.com/jsvine/pdfplumber/commit/71ad60f))\n- Add more guidance to exception when `.to_image(...)` raises `PIL.Image.DecompressionBombError`. ([#413](https://github.com/jsvine/pdfplumber/issues/413) + [b6ff9e8](https://github.com/jsvine/pdfplumber/commit/b6ff9e8))\n\n### Fixed\n\n- Fix `PageImage` conversions for PDFs with `cmyk` colorspaces; convert them to `rgb` earlier in the process. ([28330da](https://github.com/jsvine/pdfplumber/commit/28330da))\n\n## [0.7.3] - 2022-07-18\n\n### Fixed\n\n- Quick fix for transparency issue in visual debugging mode. ([b98dd7c](https://github.com/jsvine/pdfplumber/commit/b98dd7c))\n\n## [0.7.2] - 2022-07-17\n\n### Added\n\n- Add `split_at_punctuation` parameter to `.extract_words(...)` and `.extract_text(...)`. ([#682](https://github.com/jsvine/pdfplumber/issues/674)) [h/t @lolipopshock]\n- Add README.md link to @hbh112233abc's [Chinese translation of README.md](https://github.com/hbh112233abc/pdfplumber/blob/stable/README-CN.md). ([#674](https://github.com/jsvine/pdfplumber/issues/674))\n\n### Changed\n\n- Change `.to_image(...)`'s approach, preferring to composite with a white background instead of removing the alpha channel. ([1cd1f9a](https://github.com/jsvine/pdfplumber/commit/1cd1f9a))\n\n### Fixed\n\n- Fix bug in `LayoutEngine.calculate(...)` when processing char objects with len>1 representations, such as ligatures. ([#683](https://github.com/jsvine/pdfplumber/issues/683))\n\n## [0.7.1] - 2022-05-31\n\n### Fixed\n\n- Fix bug when calling `PageImage.debug_tablefinder()` (i.e., with no parameters). ([#659](https://github.com/jsvine/pdfplumber/issues/659) + [063e2ed](https://github.com/jsvine/pdfplumber/commit/063e2ed)) [h/t @rneumann7]\n\n### Development Changes\n\n- Add `Makefile` target for `examples`, as well as dev requirements to support re-running the example notebooks automatically. ([ef065a7](https://github.com/jsvine/pdfplumber/commit/ef065a7))\n\n## [0.7.0] - 2022-05-27\n\n### Added\n\n- Add `\"matrix\"` property to `char` objects, representing the current transformation matrix. ([ae6f99e](https://github.com/jsvine/pdfplumber/commit/ae6f99e))\n- Add `pdfplumber.ctm` submodule with class `CTM`, to calculate scale, skew, and translation of a current transformation matrix obtained from a `char`'s `\"matrix\"` property. ([ae6f99e](https://github.com/jsvine/pdfplumber/commit/ae6f99e))\n- Add `page.search(...)`, an *experimental feature* that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. ([#201](https://github.com/jsvine/pdfplumber/issues/201) + [58b1ab1](https://github.com/jsvine/pdfplumber/commit/58b1ab1))\n- Add `--include-attrs`/`--exclude-attrs` to CLI (and corresponding params to `.to_json(...)`, `.to_csv(...)`, and `Serializer`. ([4deac25](https://github.com/jsvine/pdfplumber/commit/4deac25))\n- Add `py.typed` for PEP561 compatibility and detection of typing hints by mypy. ([ca795d1](https://github.com/jsvine/pdfplumber/commit/ca795d1)) [h/t @jhonatan-lopes]\n\n### Changed\n\n- Bump pinned `pdfminer.six` version to `20220524`. ([486cea8](https://github.com/jsvine/pdfplumber/commit/486cea8))\n\n### Removed\n\n- Remove `utils.collate_chars(...)`, the old name (and then alias) for `utils.extract_text(...)`. ([24f3532](https://github.com/jsvine/pdfplumber/commit/24f3532))\n- Remove `utils._itemgetter(...)`, an internal-use method previously used by `utils.cluster_objects(...)`. ([58b1ab1](https://github.com/jsvine/pdfplumber/commit/58b1ab1))\n\n### Fixed\n\n- Fix `IndexError` bug for `.extract_text(layout=True)` on pages without text. ([#658](https://github.com/jsvine/pdfplumber/issues/658) + [ad3df11](https://github.com/jsvine/pdfplumber/commit/ad3df11)) [h/t @ethanscorey]\n\n## [0.6.2] - 2022-05-06\n\n### Added\n\n- Add type annotations, and refactor parts of the library accordingly. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n- Add enforcement of type annotations via `mypy --strict`. ([cdfdb87](https://github.com/jsvine/pdfplumber/commit/cdfdb87a215fed6cdc0db3a218c35bf18d399cbe))\n- Add final bits of test coverage. ([feb9d08](https://github.com/jsvine/pdfplumber/commit/feb9d082d7afb31edd0838cb93666d1e71c119da))\n- Add `TableSettings` class, a behind-the-scenes handler for managing and validating table-extraction settings. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n\n### Changed\n\n- Rename the positional argument to `.to_csv(...)` and `.to_json(...)` from `types` to `object_types`. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n- Tweak the output of `.to_json(...)` so that, if an object type is not present for a given page, it has no key in the page's object representation. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n\n### Removed\n\n- Remove `utils.filter_objects(...)` and move the functionality to within the `FilteredPage.objects` property calculation, the only part of the library that used it. ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n- Remove code that sets `pdfminer.pdftypes.STRICT = True` and `pdfminer.pdfinterp.STRICT = True`, since that [has now been the default for a while](https://github.com/pdfminer/pdfminer.six/commit/9439a3a31a347836aad1c1226168156125d9505f). ([9587cc7](https://github.com/jsvine/pdfplumber/commit/9587cc7d2292a1eae7a0150ab406f9365944266f))\n\n## [0.6.1] - 2022-04-23\n\n### Changed\n- Bump pinned `pdfminer.six` version to `20220319`. ([e434ed0](https://github.com/jsvine/pdfplumber/commit/e434ed0b196f1f2c0b7f76e8ea2663e40c99e93c))\n- Bump minimum `Pillow` version to `>=9.1`. ([d88eff1](https://github.com/jsvine/pdfplumber/commit/d88eff1e5354baa219ebff244fd4ab0e74db49c5))\n- Drop support for Python 3.6 (EOL Dec. 2021) ([a32473e](https://github.com/jsvine/pdfplumber/commit/a32473ee5f9113d5c5a96b30270cafc58d170f46))\n\n### Fixed\n- If `pdfplumber.open(...)` opens a file but a `pdfminer.pdfparser.PSException` is raised during the process, `pdfplumber` now makes sure to close that file. ([#581](https://github.com/jsvine/pdfplumber/pull/581) + ([#578](https://github.com/jsvine/pdfplumber/issues/578)) [h/t @johnhuge]\n- Fix incompatibility with `Pillow>=9.1`. ([#637](https://github.com/jsvine/pdfplumber/issues/637))\n\n## [0.6.0] - 2021-12-21\n### Added\n- Add `.extract_text(layout=True)`, an *experimental feature* which attempts to mimic the structural layout of the text on the page. ([#10](https://github.com/jsvine/pdfplumber/issues/10))\n- Add `utils.merge_bboxes(bboxes)`, which returns the smallest bounding box that contains all bounding boxes in the `bboxes` argument. ([f8d5e70](https://github.com/jsvine/pdfplumber/commit/f8d5e70a509aa9ed3ee565d7d3f97bb5ec67f5a5))\n- Add `--precision` argument to CLI ([#520](https://github.com/jsvine/pdfplumber/pull/520))\n- Add `snap_x_tolerance` and `snap_y_tolerance` to table extraction settings. ([#51](https://github.com/jsvine/pdfplumber/pull/51) + [#475](https://github.com/jsvine/pdfplumber/issues/475)) [h/t @dustindall]\n- Add `join_x_tolerance` and `join_y_tolerance` to table extraction settings. ([cbb34ce](https://github.com/jsvine/pdfplumber/commit/cbb34ce28b9b66d8d709304bbd0de267d82d75f3))\n\n### Changed\n- Upgrade `pdfminer.six` from `20200517` to `20211012`; see [that library's changelog](https://github.com/pdfminer/pdfminer.six/blob/develop/CHANGELOG.md) for details, but a key difference is an improvement in how it assigns `line`, `rect`, and `curve` objects. (Diagonal two-point lines, for instance, are now `line` objects instead of `curve` objects.) ([#515](https://github.com/jsvine/pdfplumber/pull/515))\n- Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by `pdfminer.six` ([#346](https://github.com/jsvine/pdfplumber/discussions/346) + [#520](https://github.com/jsvine/pdfplumber/pull/520))\n- `.extract_text(...)` returns `\"\"` instead of `None` when character list is empty. ([#482](https://github.com/jsvine/pdfplumber/issues/482) + [cb9900b](https://github.com/jsvine/pdfplumber/commit/cb9900b49706e96df520dbd1067c2a57a4cdb20d)) [h/t @tungph]\n- `.extract_words(...)` now includes `doctop` among the attributes it returns for each word. ([66fef89](https://github.com/jsvine/pdfplumber/commit/66fef89b670cf95d13a5e23040c7bf9339944c01))\n- Change behavior of horizontal `text_strategy`, so that it uses the top and bottom of *every* word, not just the top of every word and the bottom of the last. ([#467](https://github.com/jsvine/pdfplumber/pull/467) + [#466](https://github.com/jsvine/pdfplumber/issues/466) + [#265](https://github.com/jsvine/pdfplumber/issues/265)) [h/t @bobluda + @samkit-jain]\n- Change `table.merge_edges(...)` behavior when `join_tolerance` (and `x`/`y` variants) `<= 0`, so that joining is attempted regardless, to handle cases of overlapping lines. ([cbb34ce](https://github.com/jsvine/pdfplumber/commit/cbb34ce28b9b66d8d709304bbd0de267d82d75f3))\n- Raise error if certain table-extraction settings are negative. ([aa2d594](https://github.com/jsvine/pdfplumber/commit/aa2d594d3b3352dbcef503e4df2e045d69fc2511))\n\n### Fixed\n- Fix slowdown in `.extract_words(...)`/`WordExtractor.iter_chars_to_words(...)` on very long words, caused by repeatedly re-calculating bounding box. ([#483](https://github.com/jsvine/pdfplumber/discussions/483))\n- Handle `UnicodeDecodeError` when trying to decode utf-16-encoded annotations ([#463](https://github.com/jsvine/pdfplumber/issues/463)) [h/t @tungph]\n- Fix crash when extracting tables with null values in `(text|intersection)_(x|y)_tolerance` settings. ([#539](https://github.com/jsvine/pdfplumber/discussions/539)) [h/t @yoavxyoav]\n\n### Removed\n- Remove `pdfplumber.load(...)` method, which has been deprecated since `0.5.23` ([54cbbc5](https://github.com/jsvine/pdfplumber/commit/54cbbc5321b42f3976b2ac750c25b7b2ec6045d7))\n\n### Development Changes\n- Add `CONTRIBUTING.md` ([#428](https://github.com/jsvine/pdfplumber/pull/428))\n- Enforce import order via [`isort`](https://pycqa.github.io/isort/index.html) ([d72b879](https://github.com/jsvine/pdfplumber/commit/d72b879665b410bd0f9c436d54ae60b3989489d5))\n- Update Pillow and Wand versions in `requirements.txt` ([cae6924](https://github.com/jsvine/pdfplumber/commit/cae69246c53e49f95c1adbb5dffb3d49e726c5df))\n- Update all dependency versions in `requirements-dev.txt` ([2f7e7ee](https://github.com/jsvine/pdfplumber/commit/2f7e7ee49172d681f34269a0db0276dffefa6386))\n\n## [0.5.28] — 2021-05-08\n### Added\n- Add `--laparams` flag to CLI. ([#407](https://github.com/jsvine/pdfplumber/pull/407))\n\n### Changed\n- Change `.convert_csv(...)` to order objects first by page number, rather than object type. ([#407](https://github.com/jsvine/pdfplumber/pull/407))\n- Change `.convert_csv(...)`, `.convert_json(...)`, and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. ([#407](https://github.com/jsvine/pdfplumber/pull/407))\n\n### Fixed\n- Fix `.extract_text(...)` so that it can accept generator objects as its main parameter. ([#385](https://github.com/jsvine/pdfplumber/pull/385)) [h/t @alexreg]\n- Fix page-parsing so that `LTAnno` objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting `laparams`.) ([#388](https://github.com/jsvine/pdfplumber/issues/383))\n- Fix `Page.extract_table(...)` so that it honors text tolerance settings ([#415](https://github.com/jsvine/pdfplumber/issues/415)) [h/t @trifling]\n\n## [0.5.27] — 2021-02-28\n### Fixed\n- Fix regression (introduced in `0.5.26`/[b1849f4](https://github.com/jsvine/pdfplumber/commit/b1849f4)) in closing files opened by `PDF.open`\n- Reinstate access to higher-level layout objects (such as `textboxhorizontal`) when `laparams` is passed to `pdfplumber.open(...)`. Had been removed in `0.5.24` via [1f87898](https://github.com/jsvine/pdfplumber/commit/1f878988576017b64f5cd77e1eb21b401124c699). ([#359](https://github.com/jsvine/pdfplumber/issues/359) + [#364](https://github.com/jsvine/pdfplumber/pull/364))\n\n### Development Changes\n- Add a `python setup.py build sdist` test to main GitHub action. ([#365](https://github.com/jsvine/pdfplumber/pull/365))\n\n## [0.5.26] — 2021-02-10\n### Added\n- Add `Page.close/__enter__/__exit__` methods, by generalizing that behavior through the `Container` class ([b1849f4](https://github.com/jsvine/pdfplumber/commit/b1849f4))\n\n### Changed\n- Change handling of floating point numbers; no longer convert them to `Decimal` objects and do not round them\n- Change `TableFinder` to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost ([#336](https://github.com/jsvine/pdfplumber/issues/336))\n- Change `Page.to_image()`'s handling of alpha layer, to remove aliasing artifacts ([#340](https://github.com/jsvine/pdfplumber/pull/340)) [h/t @arlyon]\n\n### Development Changes\n\n- Enforce `psf/black` and `flake8` on `tests/` ([#327](https://github.com/jsvine/pdfplumber/pull/327)\n\n## [0.5.25] — 2020-12-09\n### Added\n- Add new boolean argument `strict_metadata` (default `False`) to `pdfplumber.open(...)` method for handling metadata resolution failures ([f2c510d](https://github.com/jsvine/pdfplumber/commit/f2c510d))\n\n### Fixed\n- Fix metadata extraction to handle integer/floating-point values ([cb32478](https://github.com/jsvine/pdfplumber/commit/cb32478)) ([#297](https://github.com/jsvine/pdfplumber/issues/297))\n- Fix metadata extraction to handle nested metadata values ([2d9415](https://github.com/jsvine/pdfplumber/commit/2d9415)) ([#316](https://github.com/jsvine/pdfplumber/issues/316))\n- Explicitly load text as utf-8 in `setup.py` ([7854328](https://github.com/jsvine/pdfplumber/commit/7854328)) ([#304](https://github.com/jsvine/pdfplumber/issues/304))\n- Fix `pdfplumber.open(...)` so that it does not close file objects passed to it ([408605f](https://github.com/jsvine/pdfplumber/commit/408605f)) ([#312](https://github.com/jsvine/pdfplumber/issues/312))\n\n\n## [0.5.24] — 2020-10-20\n### Added\n- Added `extra_attrs=[...]` parameter to `.extract_text(...)` ([c8b200e](https://github.com/jsvine/pdfplumber/commit/c8b200e)) ([#28](https://github.com/jsvine/pdfplumber/issues/28))\n- Added `utils/page.dedupe_chars(...)` ([04fd56a](https://github.com/jsvine/pdfplumber/commit/04fd56a) + [b132d45](https://github.com/jsvine/pdfplumber/commit/b132d45)) ([#71](https://github.com/jsvine/pdfplumber/issues/71))\n\n### Changed\n- Change character attribute `upright` from `int` to `bool` (per original `pdfminer.six` representation) ([1f87898](https://github.com/jsvine/pdfplumber/commit/1f87898))\n- Remove access and reference to `Container.figures`, given that they are not fundamental objects ([8e74cb9](https://github.com/jsvine/pdfplumber/commit/8e74cb9))\n\n### Fixed\n- Decimalize \"simple\" `explicit_horizontal_lines`/`explicit_vertical_lines` descs passed to `TableFinder` methods ([bc40779](https://github.com/jsvine/pdfplumber/commit/bc40779)) ([#290](https://github.com/jsvine/pdfplumber/issues/290))\n\n### Development Changes\n\n- Refactor/simplify `Page.process_objects` ([1f87898](https://github.com/jsvine/pdfplumber/commit/1f87898)), `utils.extract_words` ([c8b200e](https://github.com/jsvine/pdfplumber/commit/c8b200e)), and `convert.serialize` ([a74d3bc](https://github.com/jsvine/pdfplumber/commit/a74d3bc))\n- Remove `test_issues.py:test_pr_77` ([917467a](https://github.com/jsvine/pdfplumber/commit/917467a)) and narrow `test_ca_warn_report:test_objects` ([6233bbd](https://github.com/jsvine/pdfplumber/commit/6233bbd)) to speed up tests\n\n## [0.5.23] — 2020-08-15\n### Added\n- Add `utils.resolve` (non-recursive .resolve_all) ([7a90630](https://github.com/jsvine/pdfplumber/commit/7a90630))\n- Add `page.annots` and `page.hyperlinks`, replacing non-functional `page.annos`, and mirroring pdfminer's language (\"annot\" vs. \"anno\"). ([aa03961](https://github.com/jsvine/pdfplumber/commit/aa03961))\n- Add `page/pdf.to_json` and `page/pdf.to_csv` ([cbc91c6](https://github.com/jsvine/pdfplumber/commit/cbc91c6))\n- Add `relative=True/False` parameter to `.crop` and `.within_bbox`; those methods also now raise exceptions for invalid and out-of-page bounding boxes. ([047ad34](https://github.com/jsvine/pdfplumber/commit/047ad34)) [h/t @samkit-jain]\n\n### Changed\n- Remove `pdfminer.from_path` and `pdfminer.load` as deprecated; now `pdfminer.open` is the canonical way to load a PDF. ([00e789b](https://github.com/jsvine/pdfplumber/commit/00e789b))\n- Simplify the logic in \"text\" table-finding strategies; in edge cases, may result in changes to results. ([d224202](https://github.com/jsvine/pdfplumber/commit/d224202))\n- Drop support for Python 3.5 ([baf1033](https://github.com/jsvine/pdfplumber/commit/baf1033))\n\n### Fixed\n- Fix `.extract_words`, which had been returning incorrect results when `horizontal_ltr = False` ([d16aa13](https://github.com/jsvine/pdfplumber/commit/d16aa13))\n- Fix `utils.resize_object`, which had been failing in various permutations ([d16aa13](https://github.com/jsvine/pdfplumber/commit/d16aa13))\n- Fix `lines_strict` table-finding strategy, which a typo had prevented from being usable ([f0c9b85](https://github.com/jsvine/pdfplumber/commit/f0c9b85))\n- Fix `utils.resolve_all` to guard against two known sources of infinite recursion ([cbc91c6](https://github.com/jsvine/pdfplumber/commit/cbc91c6))\n\n### Development Changes\n\n- Rename default branch to \"stable,\" to clarify its purpose\n- Reformat code with psf/black ([1258e09](https://github.com/jsvine/pdfplumber/commit/1258e09))\n- Add code linting via psf/black and flake8 ([1258e09](https://github.com/jsvine/pdfplumber/commit/1258e09))\n- Switch from nosetests to pytest ([1ac16dd](https://github.com/jsvine/pdfplumber/commit/1ac16dd))\n- Switch from pipenv to standard requirements.txt + python -m venv ([48eaa51](https://github.com/jsvine/pdfplumber/commit/48eaa51))\n- Add GitHub action for tests + codecov ([b148fd1](https://github.com/jsvine/pdfplumber/commit/b148fd1))\n- Add Makefile for building development virtual environment and running tests ([4c69c58](https://github.com/jsvine/pdfplumber/commit/4c69c58))\n- Add badges to README.md ([9e42dc3](https://github.com/jsvine/pdfplumber/commit/9e42dc3))\n- Add Trove classifiers for Python versions to setup.py ([6946e8d](https://github.com/jsvine/pdfplumber/commit/6946e8d))\n- Add MANIFEST.in ([eafc15c](https://github.com/jsvine/pdfplumber/commit/eafc15c))\n- Add GitHub issue templates ([c4156d6](https://github.com/jsvine/pdfplumber/commit/c4156d6))\n- Remove `pandas` from dev requirements and tests ([a5e7d7f](https://github.com/jsvine/pdfplumber/commit/a5e7d7f))\n\n## [0.5.22] — 2020-07-18\n### Changed\n- Upgraded `pdfminer.six` requirement to `==20200517` ([cddbff7](https://github.com/jsvine/pdfplumber/commit/cddbff7)) [h/t @youngquan]\n\n### Added\n- Add support for `non_stroking_color` attribute on `char` objects ([0254da3](https://github.com/jsvine/pdfplumber/commit/0254da3)) [h/t @idan-david]\n\n## [0.5.21] — 2020-05-27\n### Fixed\n- Fix `Page.extract_table(...)` to return `None` instead of crashing when no table is found ([d64afa8](https://github.com/jsvine/pdfplumber/commit/d64afa8)) [h/t @stucka]\n\n## [0.5.20] — 2020-04-29\n### Fixed\n- Fix `.get_page_image` to prefer paths over streams, when possible ([ab957de](https://github.com/jsvine/pdfplumber/commit/ab957de)) [h/t @ubmarco]\n- Local-fix pdfminer.six's `.resolve_all` to handle tuples and simplify ([85f422d](https://github.com/jsvine/pdfplumber/commit/85f422d))\n\n### Changed\n- Remove support for Python 2 and Python <3.3\n\n## [0.5.19] — 2020-04-16\n### Changed\n- Add `utils.decimalize` performance improvement ([830d117](https://github.com/jsvine/pdfplumber/commit/830d117)) [h/t @ubmarco]\n\n### Fixed\n- Fix un-referenced method when using \"text\" table-finding strategy ([2a0c4a2](https://github.com/jsvine/pdfplumber/commit/2a0c4a2))\n- Add missing object type `rect_edge` to `obj_to_edges()` ([0edc6bf](https://github.com/jsvine/pdfplumber/commit/0edc6bf))\n\n## [0.5.18] — 2020-04-01\n### Changed\n- Allow `rect` and `curve` objects also to be passed to \"explicit_..._lines\" setting when table-finding. (And disallow other types of dicts to be passed.)\n\n### Fixed\n- Fix `utils.extract_text` bug introduced in prior version\n\n## [0.5.17] — 2020-04-01\n### Fixed\n- Fix and simplify obj-in-bbox logic (see commit [25672961](https://github.com/jsvine/pdfplumber/commit/25672961))\n- Improve/fix the way `utils.extract_text` handles vertical text (see commit [8a5d858b](https://github.com/jsvine/pdfplumber/commit/8a5d858b)) [h/t @dwalton76]\n- Have `Page.to_image` use bytes stream instead of file path (Issue [#124](https://github.com/jsvine/pdfplumber/issues/124) / PR [#179](https://github.com/jsvine/pdfplumber/pull/179)) [h/t @cheungpat]\n- Fix issue [#176](https://github.com/jsvine/pdfplumber/issues/176), in which `Page.extract_tables` did not pass kwargs to `Table.extract` [h/t @jsfenfen]\n\n## [0.5.16] — 2020-01-12\n### Fixed\n- Prevent custom LAParams from raising exception (Issue [#168](https://github.com/jsvine/pdfplumber/issues/168) / PR [#169](https://github.com/jsvine/pdfplumber/pull/169)) [h/t @frascuchon]\n- Add `six` as explicit dependency (for now)\n\n## [0.5.15] — 2020-01-05\n### Changed\n- Upgrade `pdfminer.six` requirement to `==20200104`\n- Upgrade `pillow` requirement `>=7.0.0`\n- Remove Python 2.7 and 3.4 from `tox` tests\n\n## [0.5.14] — 2019-10-06\n### Fixed\n- Fix sorting bug in `page.extract_table()`\n- Fix support for password-protected PDFs (PR [#138](https://github.com/jsvine/pdfplumber/pull/138))\n\n## [0.5.13] — 2019-08-29\n### Fixed\n- Fixed PDF object resolution for rotation (PR [#136](https://github.com/jsvine/pdfplumber/pull/136))\n\n## [0.5.12] — 2019-04-14\n### Added\n- `cdecimal` support for Python 2\n- Support for password-protected PDFs\n\n## [0.5.11] — 2018-11-13\n### Added\n- Caching for `.decimalize()` method\n\n### Changed\n- Upgrade to `pdfminer.six==20181108`\n- Make whitespace checking more robust (PR [#88](https://github.com/jsvine/pdfplumber/pull/88))\n\n### Fixed\n- Fix issue [#75](https://github.com/jsvine/pdfplumber/issues/75) (`.to_image()` custom arguments)\n- Fix issue raised in PR [#77](https://github.com/jsvine/pdfplumber/pull/77) (PDFObjRef resolution), and general class of problems\n- Fix issue [#90](https://github.com/jsvine/pdfplumber/issues/90), and general class of problems, by explicitly typecasting each kind of PDF Object\n\n## [0.5.10] — 2018-08-03\n### Fixed\n- Fix bug in which, when calling get_page_image(...), the alpha channel could make the whole page black out.\n\n## [0.5.9] — 2018-07-10\n### Fixed\n- Fix issue [#67](https://github.com/jsvine/pdfplumber/issues/67), in which bool-type metadata were handled incorrectly\n\n## [0.5.8] — 2018-03-06\n### Fixed\n- Fix issue [#53](https://github.com/jsvine/pdfplumber/issues/53), in which non-decimalize-able (non_)stroking_color properties were raising errors.\n\n## [0.5.7] — 2018-01-20\n### Added\n- `.travis.yml`, but failing on `.to_image()`\n\n### Changed\n- Move from defunct `pycrypto` to `pycryptodome`\n- Update `pdfminer.six` to `20170720`\n\n## [0.5.6] — 2017-11-21\n### Fixed\n- Fix issue [#41](https://github.com/jsvine/pdfplumber/issues/41), in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.\n\n## [0.5.5] — 2017-05-10\n### Added\n- Access to `__version__` from main namespace\n\n### Fixed\n- Fix issue #33, by checking `decode_text`'s argument type\n\n## [0.5.4] — 2017-04-27\n### Fixed\n- Pin `pdfminer.six` to version `20151013` (for now), fixing incompatibility\n\n## [0.5.3] — 2017-02-27\n### Fixed\n- Allow `import pdfplumber` even if ImageMagick not installed.\n\n## [0.5.2] — 2017-02-27\n### Added\n- Access to `curve` points. (E.g., `page.curves[0][\"points\"]`.)\n- Ability for `.draw_line` to draw `curve` points.\n\n### Changed\n- Disaggregated \"min_words_vertical\" (default: 3) and \"min_words_horizontal\" (default: 1), removing \"text_word_threshold\".\n- Internally, made `utils.decimalize` a bit more robust; now throws errors on non-decimalizable items.\n- Now explicitly ignoring some (obscure) `pdfminer` object attributes.\n- Raw input for `.draw_line` from a bounding box to `((x, y), (x, y))`, for consistency with `curve[\"points\"]` and with `Pillow`'s underlying method.\n\n### Fixed\n- Fixed typo bug when `.rect_edges` is called before `.edges`\n\n## [0.5.1] — 2017-02-26\n### Added\n- Quick-draw `PageImage` methods: `.draw_vline`, `.draw_vlines`, `.draw_hline`, and `.draw_hlines`.\n- Boolean parameter `keep_blank_chars` for `.extract_words(...)` and `TableFinder` settings.\n\n### Changed\n- Increased default `text_tolerance` and `intersection_tolerance` TableFinder values from 1 to 3.\n\n### Fixed\n- Properly handle conversion of PDFs with transparency to `pillow` images.\n- Properly handle `pandas` DataFrames as inputs to multi-draw commands (e.g., `PageImage.draw_rects(...)`).\n\n## [0.5.0] - 2017-02-25\n### Added\n- Visual debugging features, via `Page.to_image(...)` and `PageImage`. (Introduces `wand` and `pillow` as package requirements.)\n- More powerful options for extracting data from tables. See changes below.\n\n### Changed\n- Entirely overhaul the table-extraction methods. Now based on [Anssi Nurminen's master's thesis](http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3).\n- Disentangle `.crop` from `.intersects_bbox` and `.within_bbox`.\n- Change default `x_tolerance` and `y_tolerance` for word extraction from `5` to `3`\n\n### Fixed\n- Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]\n\n## [0.4.6] - 2017-01-26\n### Added\n- Provide access to `Page.page_number`\n\n### Changed\n- Use `.page_number` instead of `.page_id` as primary identifier. [h/t @jsfenfen]\n- Change default `x_tolerance` and `y_tolerance` for word extraction from `0` to `5`\n\n### Fixed\n- Provide proper support for rotated pages\n\n## [0.4.5] - 2016-12-09\n### Fixed\n- Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]\n\n\n## [0.4.4] - Mistakenly skipped\n\nWhoops.\n\n## [0.4.3] - 2016-04-12\n### Changed\n- When extracting table cells, use chars' midpoints instead of top-points.\n\n### Fixed\n- Fix find_gutters — should ignore `\" \"` chars\n\n"
  },
  {
    "path": "CITATION.cff",
    "content": "cff-version: 1.2.0\ntitle: pdfplumber\ntype: software\nversion: 0.11.9\ndate-released: \"2026-01-05\"\nauthors:\n  - family-names: \"Singer-Vine\"\n    given-names: \"Jeremy\"\n    email: \"jsvine@gmail.com\"\n  - name: \"The pdfplumber contributors\"\nrepository-code: \"https://github.com/jsvine/pdfplumber\"\nurl: \"https://github.com/jsvine/pdfplumber\"\nlicense: MIT\nabstract: >- \n  Plumb a PDF for detailed information about each char, rectangle,\n  line, et cetera — and easily extract text and tables.\nkeywords:\n  - \"pdf\"\n  - \"pdf parsing\"\n  - \"table extraction\"\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contribution Guidelines\n\nThank you for your interest in `pdfplumber`! Before submitting an issue or filing a pull request, please consult the brief notes and instructions below.\n\n## Creating issues\n\n- If you are __troubleshooting__ a specific PDF and have not identified a clear bug, please [open a discussion](https://github.com/jsvine/pdfplumber/discussions) instead of an issue. \n- Malformed PDFs can often cause problems that cannot be directly fixed in `pdfplumber`. For that reason, please __try repairing__ your PDF using [Ghostscript](https://www.ghostscript.com/) before filing a bug report. To do so, run `gs -o repaired.pdf -sDEVICE=pdfwrite original.pdf`, replacing `original.pdf` with your PDF's actual filename.\n- If your issue relates to __text not being displayed__ correctly, please compare the output to [`pdfminer.six`'s `pdf2txt` command](https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html). If you're seeing the same problems there, please consult that repository instead of this one, because `pdfplumber` depends on `pdfminer.six` for text extraction.\n- Please do fill out all requested sections of the __issue template__; doing so will help the maintainers and community more efficiently respond.\n\n## Submitting pull requests\n\n- If you would like to propose a change that is more __complex__ than a simple bug-fix, please [first open a discussion](https://github.com/jsvine/pdfplumber/discussions). If you are submitting a __simple__ bugfix, typo correction, et cetera, feel free to open a pull request directly.\n- PRs should be submitted against the __`develop` branch__ only.\n- PRs should contain one or more __tests__ that support the changes. The tests should pass with the new code but fail on the commits prior. For guidance, see the existing tests in the `tests/` directory. To execute the tests, run `make tests` or `python -m pytest`.\n- Python code in PRs should conform to [`psf/black`](https://black.readthedocs.io/en/stable/), [`isort`](https://pycqa.github.io/isort/index.html), and [`flake8`](https://pypi.org/project/flake8/) __formatting__ guidelines. To automatically reformat your code accordingly, run `make format`. To test the formatting and `flake8` compliance, run `make lint`.\n- Please add yourself to the [list of contributors](https://github.com/jsvine/pdfplumber#acknowledgments--contributors).\n- Please also update the [CHANGELOG.md](https://github.com/jsvine/pdfplumber/blob/develop/CHANGELOG.md).\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "The MIT License (MIT)\n\nCopyright (c) 2015, Jeremy Singer-Vine\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include LICENSE.txt\ninclude README.md\ninclude requirements.txt\ninclude requirements-dev.txt\ninclude pdfplumber/py.typed\n"
  },
  {
    "path": "Makefile",
    "content": ".PHONY: venv tests check-black check-flake lint format examples build\nVENV ?= .venv\nPYTHON = ${VENV}/bin/python\n\nvenv:\n\tpython3 -m venv venv\n\t${VENV}/bin/pip install --upgrade pip\n\t${VENV}/bin/pip install -r requirements.txt\n\t${VENV}/bin/pip install -r requirements-dev.txt\n\t${VENV}/bin/pip install -e .\n\ntests:\n\t${PYTHON} -m pytest -n auto\n\t${PYTHON} -m coverage html\n\ncheck-black:\n\t${VENV}/bin/black --check pdfplumber tests\n\ncheck-isort:\n\t${VENV}/bin/isort --profile black --check-only pdfplumber tests\n\ncheck-flake:\n\t${VENV}/bin/flake8 pdfplumber tests\n\ncheck-mypy:\n\t${VENV}/bin/mypy --strict --implicit-reexport pdfplumber\n\nlint: check-flake check-mypy check-black check-isort\n\nformat:\n\t${VENV}/bin/black pdfplumber tests\n\t${VENV}/bin/isort --profile black pdfplumber tests\n\nexamples:\n\t${VENV}/bin/nbexec examples/notebooks\n\nbuild:\n\t${PYTHON} -m build\n"
  },
  {
    "path": "README.md",
    "content": "# pdfplumber\n\n[![Version](https://img.shields.io/pypi/v/pdfplumber.svg)](https://pypi.python.org/pypi/pdfplumber) ![Tests](https://github.com/jsvine/pdfplumber/workflows/Tests/badge.svg) [![Code coverage](https://codecov.io/gh/jsvine/pdfplumber/branch/stable/graph/badge.svg)](https://codecov.io/gh/jsvine/pdfplumber/branch/stable) [![Support Python versions](https://img.shields.io/pypi/pyversions/pdfplumber.svg)](https://pypi.python.org/pypi/pdfplumber)\n\nPlumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging.\n\nWorks best on machine-generated, rather than scanned, PDFs. Built on [`pdfminer.six`](https://github.com/goulu/pdfminer). \n\nCurrently [tested](tests/) on [Python 3.10, 3.11, 3.12, 3.13, 3.14](.github/workflows/tests.yml).\n\nTranslations of this document are available in: [Chinese (by @hbh112233abc)](https://github.com/hbh112233abc/pdfplumber/blob/stable/README-CN.md).\n\n__To report a bug__ or request a feature, please [file an issue](https://github.com/jsvine/pdfplumber/issues/new/choose). __To ask a question__ or request assistance with a specific PDF, please [use the discussions forum](https://github.com/jsvine/pdfplumber/discussions).\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Command line interface](#command-line-interface)\n- [Python library](#python-library)\n- [Visual debugging](#visual-debugging)\n- [Extracting text](#extracting-text)\n- [Extracting tables](#extracting-tables)\n- [Extracting form values](#extracting-form-values)\n- [Demonstrations](#demonstrations)\n- [Comparison to other libraries](#comparison-to-other-libraries)\n- [Acknowledgments / Contributors](#acknowledgments--contributors)\n- [Contributing](#contributing)\n\n## Installation\n\n```sh\npip install pdfplumber\n```\n\n## Command line interface\n\n### Basic example\n\n```sh\ncurl \"https://raw.githubusercontent.com/jsvine/pdfplumber/stable/examples/pdfs/background-checks.pdf\" > background-checks.pdf\npdfplumber background-checks.pdf > background-checks.csv\n```\n\nThe output will be a CSV containing info about every character, line, and rectangle in the PDF.\n\n### Options\n\n| Argument | Description |\n|----------|-------------|\n|`--format [format]`| `csv`, `json`, or `text`. The `csv` and `json` formats return information about each object. Of those two, the `json` format returns more information; it includes PDF-level and page-level metadata, plus dictionary-nested attributes. The `text` option returns a plain-text representation of the PDF, using `Page.extract_text(layout=True)`.|\n|`--pages [list of pages]`| A space-delimited, `1`-indexed list of pages or hyphenated page ranges. E.g., `1, 11-15`, which would return data for pages 1, 11, 12, 13, 14, and 15.|\n|`--types [list of object types to extract]`| Choices are `char`, `rect`, `line`, `curve`, `image`, `annot`, et cetera. Defaults to all available.|\n|`--laparams`| A JSON-formatted string (e.g., `'{\"detect_vertical\": true}'`) to pass to `pdfplumber.open(..., laparams=...)`.|\n|`--precision [integer]`| The number of decimal places to round floating-point numbers. Defaults to no rounding.|\n\n## Python library\n\n### Basic example\n\n```python\nimport pdfplumber\n\nwith pdfplumber.open(\"path/to/file.pdf\") as pdf:\n    first_page = pdf.pages[0]\n    print(first_page.chars[0])\n```\n\n### Loading a PDF\n\nTo start working with a PDF, call `pdfplumber.open(x)`, where `x` can be a:\n\n- path to your PDF file\n- file object, loaded as bytes\n- file-like object, loaded as bytes\n\nThe `open` method returns an instance of the `pdfplumber.PDF` class.\n\nTo load a password-protected PDF, pass the `password` keyword argument, e.g., `pdfplumber.open(\"file.pdf\", password = \"test\")`.\n\nTo set layout analysis parameters to `pdfminer.six`'s layout engine, pass the `laparams` keyword argument, e.g., `pdfplumber.open(\"file.pdf\", laparams = { \"line_overlap\": 0.7 })`.\n\nTo [pre-normalize Unicode text](https://unicode.org/reports/tr15/), pass `unicode_norm=...`, where `...` is one of the [four Unicode normalization forms](https://unicode.org/reports/tr15/#Normalization_Forms_Table): `\"NFC\"`, `\"NFD\"`, `\"NFKC\"`, or `\"NFKD\"`.\n\nInvalid metadata values are treated as a warning by default. If that is not intended, pass `strict_metadata=True` to the `open` method and `pdfplumber.open` will raise an exception if it is unable to parse the metadata.\n\n### The `pdfplumber.PDF` class\n\nThe top-level `pdfplumber.PDF` class represents a single PDF and has two main properties:\n\n| Property | Description |\n|----------|-------------|\n|`.metadata`| A dictionary of metadata key/value pairs, drawn from the PDF's `Info` trailers. Typically includes \"CreationDate,\" \"ModDate,\" \"Producer,\" et cetera.|\n|`.pages`| A list containing one `pdfplumber.Page` instance per page loaded.|\n\n... and also has the following method:\n\n| Method | Description |\n|--------|-------------|\n|`.close()`| Calling this method calls `Page.close()` on each page, and also closes the file stream (except in cases when the stream is external, i.e., already opened and passed directly to `pdfplumber`). |\n\n### The `pdfplumber.Page` class\n\nThe `pdfplumber.Page` class is at the core of `pdfplumber`. Most things you'll do with `pdfplumber` will revolve around this class. It has these main properties:\n\n| Property | Description |\n|----------|-------------|\n|`.page_number`| The sequential page number, starting with `1` for the first page, `2` for the second, and so on.|\n|`.width`| The page's width.|\n|`.height`| The page's height.|\n|`.objects` / `.chars` / `.lines` / `.rects` / `.curves` / `.images`| Each of these properties is a list, and each list contains one dictionary for each such object embedded on the page. For more detail, see \"[Objects](#objects)\" below.|\n\n... and these main methods:\n\n| Method | Description |\n|--------|-------------|\n|`.crop(bounding_box, relative=False, strict=True)`| Returns a version of the page cropped to the bounding box, which should be expressed as 4-tuple with the values `(x0, top, x1, bottom)`. Cropped pages retain objects that fall at least partly within the bounding box. If an object falls only partly within the box, its dimensions are sliced to fit the bounding box. If `relative=True`, the bounding box is calculated as an offset from the top-left of the page's bounding box, rather than an absolute positioning. (See [Issue #245](https://github.com/jsvine/pdfplumber/issues/245) for a visual example and explanation.) When `strict=True` (the default), the crop's bounding box must fall entirely within the page's bounding box.|\n|`.within_bbox(bounding_box, relative=False, strict=True)`| Similar to `.crop`, but only retains objects that fall *entirely within* the bounding box.|\n|`.outside_bbox(bounding_box, relative=False, strict=True)`| Similar to `.crop` and `.within_bbox`, but only retains objects that fall *entirely outside* the bounding box.|\n|`.filter(test_function)`| Returns a version of the page with only the `.objects` for which `test_function(obj)` returns `True`.|\n\n... and also has the following method:\n\n| Method | Description |\n|--------|-------------|\n|`.close()`| By default, `Page` objects cache their layout and object information to avoid having to reprocess it. When parsing large PDFs, however, these cached properties can require a lot of memory. You can use this method to flush the cache and release the memory.|\n\nAdditional methods are described in the sections below:\n\n- [Visual debugging](#visual-debugging)\n- [Extracting text](#extracting-text)\n- [Extracting tables](#extracting-tables)\n\n### Objects\n\nEach instance of `pdfplumber.PDF` and `pdfplumber.Page` provides access to several types of PDF objects, all derived from [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six/) PDF parsing. The following properties each return a Python list of the matching objects:\n\n- `.chars`, each representing a single text character.\n- `.lines`, each representing a single 1-dimensional line.\n- `.rects`, each representing a single 2-dimensional rectangle.\n- `.curves`, each representing any series of connected points that `pdfminer.six` does not recognize as a line or rectangle.\n- `.images`, each representing an image.\n- `.annots`, each representing a single PDF annotation (cf. Section 8.4 of the [official PDF specification](https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf) for details)\n- `.hyperlinks`, each representing a single PDF annotation of the subtype `Link` and having an `URI` action attribute\n\nEach object is represented as a simple Python `dict`, with the following properties:\n\n#### `char` properties\n\n| Property | Description |\n|----------|-------------|\n|`page_number`| Page number on which this character was found.|\n|`text`| E.g., \"z\", or \"Z\" or \" \".|\n|`fontname`| Name of the character's font face.|\n|`size`| Font size.|\n|`adv`| Equal to text width * the font size * scaling factor.|\n|`upright`| Whether the character is upright.|\n|`height`| Height of the character.|\n|`width`| Width of the character.|\n|`x0`| Distance of left side of character from left side of page.|\n|`x1`| Distance of right side of character from left side of page.|\n|`y0`| Distance of bottom of character from bottom of page.|\n|`y1`| Distance of top of character from bottom of page.|\n|`top`| Distance of top of character from top of page.|\n|`bottom`| Distance of bottom of the character from top of page.|\n|`doctop`| Distance of top of character from top of document.|\n|`matrix`| The \"current transformation matrix\" for this character. (See below for details.)|\n|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this character if any (otherwise `None`). *Experimental attribute.*|\n|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this character if any (otherwise `None`). *Experimental attribute.*|\n|`ncs`|TKTK|\n|`stroking_pattern`|TKTK|\n|`non_stroking_pattern`|TKTK|\n|`stroking_color`|The color of the character's outline (i.e., stroke). See [docs/colors.md](docs/colors.md) for details.|\n|`non_stroking_color`|The character's interior color. See [docs/colors.md](docs/colors.md) for details.|\n|`object_type`| \"char\"|\n\n__Note__: A character’s `matrix` property represents the “current transformation matrix,” as described in Section 4.2.2 of the [PDF Reference](https://ghostscript.com/~robin/pdf_reference17.pdf) (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The `pdfplumber.ctm` submodule defines a class, `CTM`, that assists with these calculations. For instance:\n\n```python\nfrom pdfplumber.ctm import CTM\nmy_char = pdf.pages[0].chars[3]\nmy_char_ctm = CTM(*my_char[\"matrix\"])\nmy_char_rotation = my_char_ctm.skew_x\n```\n\n#### `line` properties\n\n| Property | Description |\n|----------|-------------|\n|`page_number`| Page number on which this line was found.|\n|`height`| Height of line.|\n|`width`| Width of line.|\n|`x0`| Distance of left-side extremity from left side of page.|\n|`x1`| Distance of right-side extremity from left side of page.|\n|`y0`| Distance of bottom extremity from bottom of page.|\n|`y1`| Distance of top extremity bottom of page.|\n|`top`| Distance of top of line from top of page.|\n|`bottom`| Distance of bottom of the line from top of page.|\n|`doctop`| Distance of top of line from top of document.|\n|`linewidth`| Thickness of line.|\n|`stroking_color`|The color of the line. See [docs/colors.md](docs/colors.md) for details.|\n|`non_stroking_color`|The non-stroking color specified for the line’s path. See [docs/colors.md](docs/colors.md) for details.|\n|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this line if any (otherwise `None`). *Experimental attribute.*|\n|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this line if any (otherwise `None`). *Experimental attribute.*|\n|`object_type`| \"line\"|\n\n#### `rect` properties\n\n| Property | Description |\n|----------|-------------|\n|`page_number`| Page number on which this rectangle was found.|\n|`height`| Height of rectangle.|\n|`width`| Width of rectangle.|\n|`x0`| Distance of left side of rectangle from left side of page.|\n|`x1`| Distance of right side of rectangle from left side of page.|\n|`y0`| Distance of bottom of rectangle from bottom of page.|\n|`y1`| Distance of top of rectangle from bottom of page.|\n|`top`| Distance of top of rectangle from top of page.|\n|`bottom`| Distance of bottom of the rectangle from top of page.|\n|`doctop`| Distance of top of rectangle from top of document.|\n|`linewidth`| Thickness of line.|\n|`stroking_color`|The color of the rectangle's outline. See [docs/colors.md](docs/colors.md) for details.|\n|`non_stroking_color`|The rectangle’s fill color. See [docs/colors.md](docs/colors.md) for details.|\n|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this rect if any (otherwise `None`). *Experimental attribute.*|\n|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this rect if any (otherwise `None`). *Experimental attribute.*|\n|`object_type`| \"rect\"|\n\n#### `curve` properties\n\n| Property | Description |\n|----------|-------------|\n|`page_number`| Page number on which this curve was found.|\n|`pts`| A list of `(x, top)` tuples indicating the *points on the curve*.|\n|`path`| A list of `(cmd, *(x, top))` tuples *describing the full path description*, including (for example) control points used in Bezier curves.|\n|`height`| Height of curve's bounding box.|\n|`width`| Width of curve's bounding box.|\n|`x0`| Distance of curve's left-most point from left side of page.|\n|`x1`| Distance of curve's right-most point from left side of the page.|\n|`y0`| Distance of curve's lowest point from bottom of page.|\n|`y1`| Distance of curve's highest point from bottom of page.|\n|`top`| Distance of curve's highest point from top of page.|\n|`bottom`| Distance of curve's lowest point from top of page.|\n|`doctop`| Distance of curve's highest point from top of document.|\n|`linewidth`| Thickness of line.|\n|`fill`| Whether the shape defined by the curve's path is filled.|\n|`stroking_color`|The color of the curve's outline. See [docs/colors.md](docs/colors.md) for details.|\n|`non_stroking_color`|The curve’s fill color. See [docs/colors.md](docs/colors.md) for details.|\n|`dash`|A `([dash_array], dash_phase)` tuple describing the curve's dash style. See [Table 4.6 of the PDF specification](https://ghostscript.com/~robin/pdf_reference17.pdf#page=218) for details.|\n|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this curve if any (otherwise `None`). *Experimental attribute.*|\n|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this curve if any (otherwise `None`). *Experimental attribute.*|\n|`object_type`| \"curve\"|\n\n#### Derived properties\n\nAdditionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to several derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines), `.curve_edges` (which does the same for `curve` objects), and `.edges` (which combines `.rect_edges`, `.curve_edges`, and `.lines`). \n\n#### `image` properties\n\n*Note: Although the positioning and characteristics of `image` objects are available via `pdfplumber`, this library does not provide direct support for reconstructing image content. For that, please see [this suggestion](https://github.com/jsvine/pdfplumber/discussions/496#discussioncomment-1259772).*\n\n| Property | Description |\n|----------|-------------|\n|`page_number`| Page number on which the image was found.|\n|`height`| Height of the image.|\n|`width`| Width of the image.|\n|`x0`| Distance of left side of the image from left side of page.|\n|`x1`| Distance of right side of the image from left side of page.|\n|`y0`| Distance of bottom of the image from bottom of page.|\n|`y1`| Distance of top of the image from bottom of page.|\n|`top`| Distance of top of the image from top of page.|\n|`bottom`| Distance of bottom of the image from top of page.|\n|`doctop`| Distance of top of rectangle from top of document.|\n|`srcsize`| The image original dimensions, as a `(width, height)` tuple.|\n|`colorspace`| Color domain of the image (e.g., RGB).|\n|`bits`| The number of bits per color component; e.g., 8 corresponds to 255 possible values for each color component (R, G, and B in an RGB color space).|\n|`stream`| Pixel values of the image, as a `pdfminer.pdftypes.PDFStream` object.|\n|`imagemask`| A nullable boolean; if `True`, \"specifies that the image data is to be used as a stencil mask for painting in the current color.\"|\n|`name`| \"The name by which this image XObject is referenced in the XObject subdictionary of the current resource dictionary.\" [🔗](https://ghostscript.com/~robin/pdf_reference17.pdf#page=340) |\n|`mcid`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section ID for this image if any (otherwise `None`). *Experimental attribute.*|\n|`tag`| The [marked content](https://ghostscript.com/~robin/pdf_reference17.pdf#page=850) section tag for this image if any (otherwise `None`). *Experimental attribute.*|\n|`object_type`| \"image\"|\n\n### Obtaining higher-level layout objects via `pdfminer.six`\n\nIf you pass the `pdfminer.six`-handling `laparams` parameter to `pdfplumber.open(...)`, then each page's `.objects` dictionary will also contain `pdfminer.six`'s higher-level layout objects, such as `\"textboxhorizontal\"`.\n\n\n## Visual debugging\n\n`pdfplumber`'s visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.\n\n\n### Creating a `PageImage` with `.to_image()`\n\nTo turn any page (including cropped pages) into an `PageImage` object, call `my_page.to_image()`. You can optionally pass *one* of the  following keyword arguments:\n\n- `resolution`: The desired number pixels per inch. Default: `72`. Type: `int`.\n- `width`: The desired image width in pixels. Default: unset, determined by `resolution`. Type: `int`.\n- `height`: The desired image width in pixels. Default: unset, determined by `resolution`. Type: `int`.\n- `antialias`: Whether to use antialiasing when creating the image. Setting to `True` creates images with less-jagged text and graphics, but with larger file sizes. Default: `False`. Type: `bool`.\n- `force_mediabox`: Use the page's `.mediabox` dimensions, rather than the `.cropbox` dimensions. Default: `False`. Type: `bool`.\n\nFor instance:\n\n```python\nim = my_pdf.pages[0].to_image(resolution=150)\n```\n\nFrom a script or REPL, `im.show()` will open the image in your local image viewer. But `PageImage` objects also play nicely with Jupyter notebooks; they automatically render as cell outputs. For example:\n\n![Visual debugging in Jupyter](examples/screenshots/visual-debugging-in-jupyter.png \"Visual debugging in Jupyter\")\n\n*Note*: `.to_image(...)` works as expected with `Page.crop(...)`/`CroppedPage` instances, but is unable to incorporate changes made via `Page.filter(...)`/`FilteredPage` instances.\n\n\n### Basic `PageImage` methods\n\n| Method | Description |\n|--------|-------------|\n|`im.reset()`| Clears anything you've drawn so far.|\n|`im.copy()`| Copies the image to a new `PageImage` object.|\n|`im.show()`| Opens the image in your local image viewer.|\n|`im.save(path_or_fileobject, format=\"PNG\", quantize=True, colors=256, bits=8)`| Saves the annotated image as a PNG file. The default arguments quantize the image to a palette of 256 colors, saving the PNG with 8-bit color depth. You can disable quantization by passing `quantize=False` or adjust the size of the color palette by passing `colors=N`.|\n\n### Drawing methods\n\nYou can pass explicit coordinates or any `pdfplumber` PDF object (e.g., char, line, rect) to these methods.\n\n| Single-object method | Bulk method | Description |\n|----------------------|-------------|-------------|\n|`im.draw_line(line, stroke={color}, stroke_width=1)`| `im.draw_lines(list_of_lines, **kwargs)`| Draws a line from a `line`, `curve`, or a 2-tuple of 2-tuples (e.g., `((x, y), (x, y))`).|\n|`im.draw_vline(location, stroke={color}, stroke_width=1)`| `im.draw_vlines(list_of_locations, **kwargs)`| Draws a vertical line at the x-coordinate indicated by `location`.|\n|`im.draw_hline(location, stroke={color}, stroke_width=1)`| `im.draw_hlines(list_of_locations, **kwargs)`| Draws a horizontal line at the y-coordinate indicated by `location`.|\n|`im.draw_rect(bbox_or_obj, fill={color}, stroke={color}, stroke_width=1)`| `im.draw_rects(list_of_rects, **kwargs)`| Draws a rectangle from a `rect`, `char`, etc., or 4-tuple bounding box.|\n|`im.draw_circle(center_or_obj, radius=5, fill={color}, stroke={color})`| `im.draw_circles(list_of_circles, **kwargs)`| Draws a circle at `(x, y)` coordinate or at the center of a `char`, `rect`, etc.|\n\nNote: The methods above are built on Pillow's [`ImageDraw` methods](http://pillow.readthedocs.io/en/latest/reference/ImageDraw.html), but the parameters have been tweaked for consistency with SVG's `fill`/`stroke`/`stroke_width` nomenclature.\n\n### Visually debugging the table-finder\n\n`im.debug_tablefinder(table_settings={})` will return a version of the PageImage with the detected lines (in red), intersections (circles), and tables (light blue) overlaid.\n\n## Extracting text\n\n`pdfplumber` can extract text from any given page (including cropped and derived pages). It can also attempt to preserve the layout of that text, as well as to identify the coordinates of words and search queries. `Page` objects can call the following text-extraction methods:\n\n\n| Method | Description |\n|--------|-------------|\n|`.extract_text(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, layout=False, x_density=7.25, y_density=13, line_dir_render=None, char_dir_render=None, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character[\"size\"]`.) Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per \"point,\" the PDF unit of measurement. Passing `line_dir_render=\"ttb\"/\"btt\"/\"ltr\"/\"rtl\"` and/or `char_dir_render=\"ttb\"/\"btt\"/\"ltr\"/\"rtl\"` will output the the lines/characters in a different direction than the default. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|\n|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|\n|`.extract_words(x_tolerance=3, x_tolerance_ratio=None, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, line_dir=\"ttb\", char_dir=\"ltr\", line_dir_rotated=\"ttb\", char_dir_rotated=\"ltr\", extra_attrs=[], split_at_punctuation=False, expand_ligatures=True, return_chars=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for \"upright\" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. (If `x_tolerance_ratio` is not `None`, the extractor uses a dynamic `x_tolerance` equal to `x_tolerance_ratio * previous_character[\"size\"]`.) A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) The arguments `line_dir` and `char_dir` tell this method the direction in which lines/characters are expected to be read; valid options are \"ttb\" (top-to-bottom), \"btt\" (bottom-to-top), \"ltr\" (left-to-right), and \"rtl\" (right-to-left). The `line_dir_rotated` and `char_dir_rotated` arguments are similar, but for text that has been rotated. Passing a list of `extra_attrs`  (e.g., `[\"fontname\", \"size\"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!\"&\\'()*+,.:;<=>?@[\\]^\\`\\{\\|\\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`). Passing `return_chars=True` will add, to each word dictionary, a list of its constituent characters, as a list in the `\"chars\"` field.|\n|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|\n|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `\"groups\"` and `\"chars\"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |\n|`.dedupe_chars(tolerance=1, extra_attrs=(\"fontname\", \"size\"))`| Returns a version of the page with duplicate chars — those sharing the same text, positioning (within `tolerance` x/y), and `extra_attrs` as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|\n\n## Extracting tables\n\n`pdfplumber`'s approach to table detection borrows heavily from [Anssi Nurminen's master's thesis](https://trepo.tuni.fi/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3), and is inspired by [Tabula](https://github.com/tabulapdf/tabula-extractor/issues/16). It works like this:\n\n1. For any given PDF page, find the lines that are (a) explicitly defined and/or (b) implied by the alignment of words on the page.\n2. Merge overlapping, or nearly-overlapping, lines.\n3. Find the intersections of all those lines.\n4. Find the most granular set of rectangles (i.e., cells) that use these intersections as their vertices.\n5. Group contiguous cells into tables. \n\n### Table-extraction methods\n\n`pdfplumber.Page` objects can call the following table methods:\n\n| Method | Description |\n|--------|-------------|\n|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, `.columns`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|\n|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|\n|`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|\n|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page (see `.find_table(...)` above), represented as a list of lists, with the structure `row -> cell`.|\n|`.debug_tablefinder(table_settings={})`|Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.|\n\nFor example:\n\n```python\npdf = pdfplumber.open(\"path/to/my.pdf\")\npage = pdf.pages[0]\npage.extract_table()\n```\n\n[Click here for a more detailed example.](examples/notebooks/extract-table-ca-warn-report.ipynb)\n\n### Table-extraction settings\n\nBy default, `extract_tables` uses the page's vertical and horizontal lines (or rectangle edges) as cell-separators. But the method is highly customizable via the `table_settings` argument. The possible settings, and their defaults:\n\n```python\n{\n    \"vertical_strategy\": \"lines\", \n    \"horizontal_strategy\": \"lines\",\n    \"explicit_vertical_lines\": [],\n    \"explicit_horizontal_lines\": [],\n    \"snap_tolerance\": 3,\n    \"snap_x_tolerance\": 3,\n    \"snap_y_tolerance\": 3,\n    \"join_tolerance\": 3,\n    \"join_x_tolerance\": 3,\n    \"join_y_tolerance\": 3,\n    \"edge_min_length\": 3,\n    \"edge_min_length_prefilter\": 1,\n    \"min_words_vertical\": 3,\n    \"min_words_horizontal\": 1,\n    \"intersection_tolerance\": 3,\n    \"intersection_x_tolerance\": 3,\n    \"intersection_y_tolerance\": 3,\n    \"text_tolerance\": 3,\n    \"text_x_tolerance\": 3,\n    \"text_y_tolerance\": 3,\n    \"text_*\": …, # See below\n}\n```\n\n| Setting | Description |\n|---------|-------------|\n|`\"vertical_strategy\"`| Either `\"lines\"`, `\"lines_strict\"`, `\"text\"`, or `\"explicit\"`. See explanation below.|\n|`\"horizontal_strategy\"`| Either `\"lines\"`, `\"lines_strict\"`, `\"text\"`, or `\"explicit\"`. See explanation below.|\n|`\"explicit_vertical_lines\"`| A list of vertical lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `x` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|\n|`\"explicit_horizontal_lines\"`| A list of horizontal lines that explicitly demarcate cells in the table. Can be used in combination with any of the strategies above. Items in the list should be either numbers — indicating the `y` coordinate of a line the full height of the page — or `line`/`rect`/`curve` objects.|\n|`\"snap_tolerance\"`, `\"snap_x_tolerance\"`, `\"snap_y_tolerance\"`| Parallel lines within `snap_tolerance` points will be \"snapped\" to the same horizontal or vertical position.|\n|`\"join_tolerance\"`, `\"join_x_tolerance\"`, `\"join_y_tolerance\"`| Line segments on the same infinite line, and whose ends are within `join_tolerance` of one another, will be \"joined\" into a single line segment.|\n|`\"edge_min_length\"`| Edges shorter than `edge_min_length` will be discarded before attempting to reconstruct the table.|\n|`\"edge_min_length_prefilter\"`| Edges shorter than `edge_min_length_prefilter` will be discarded during initial edge filtering from the page. Lowering this value (e.g., to `0.5`) can help capture small dashed lines that might otherwise be filtered out.|\n|`\"min_words_vertical\"`| When using `\"vertical_strategy\": \"text\"`, at least `min_words_vertical` words must share the same alignment.|\n|`\"min_words_horizontal\"`| When using `\"horizontal_strategy\": \"text\"`, at least `min_words_horizontal` words must share the same alignment.|\n|`\"intersection_tolerance\"`, `\"intersection_x_tolerance\"`, `\"intersection_y_tolerance\"`| When combining edges into cells, orthogonal edges must be within `intersection_tolerance` points to be considered intersecting.|\n|`\"text_*\"`| All settings prefixed with `text_` are then used when extracting text from each discovered table. All possible arguments to `Page.extract_text(...)` are also valid here.|\n|`\"text_x_tolerance\"`, `\"text_y_tolerance\"`| These `text_`-prefixed settings *also* apply to the table-identification algorithm when the `text` strategy is used. I.e., when that algorithm searches for words, it will expect the individual letters in each word to be no more than `text_x_tolerance`/`text_y_tolerance` points apart.|\n\n### Table-extraction strategies\n\nBoth `vertical_strategy` and `horizontal_strategy` accept the following options:\n\n| Strategy | Description | \n|----------|-------------|\n| `\"lines\"` | Use the page's graphical lines — including the sides of rectangle objects — as the borders of potential table-cells. |\n| `\"lines_strict\"` | Use the page's graphical lines — but *not* the sides of rectangle objects — as the borders of potential table-cells. |\n| `\"text\"` | For `vertical_strategy`: Deduce the (imaginary) lines that connect the left, right, or center of words on the page, and use those lines as the borders of potential table-cells. For `horizontal_strategy`, the same but using the tops of words. |\n| `\"explicit\"` | Only use the lines explicitly defined in `explicit_vertical_lines` / `explicit_horizontal_lines`. |\n\n### Notes\n\n- Often it's helpful to crop a page — `Page.crop(bounding_box)` — before trying to extract the table.\n\n- Table extraction for `pdfplumber` was radically redesigned for `v0.5.0`, and introduced breaking changes.\n\n\n## Extracting form values\n\nSometimes PDF files can contain forms that include inputs that people can fill out and save. While values in form fields appear like other text in a PDF file, form data is handled differently. If you want the gory details, see page 671 of this [specification](https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf).\n\n`pdfplumber` doesn't have an interface for working with form data, but you can access it using `pdfplumber`'s wrappers around `pdfminer`.\n\nFor example, this snippet will retrieve form field names and values and store them in a dictionary.\n\n```python\nimport pdfplumber\nfrom pdfplumber.utils.pdfinternals import resolve_and_decode, resolve\n\npdf = pdfplumber.open(\"document_with_form.pdf\")\n\ndef parse_field_helper(form_data, field, prefix=None):\n    \"\"\" appends any PDF AcroForm field/value pairs in `field` to provided `form_data` list\n\n        if `field` has child fields, those will be parsed recursively.\n    \"\"\"\n    resolved_field = field.resolve()\n    field_name = '.'.join(filter(lambda x: x, [prefix, resolve_and_decode(resolved_field.get(\"T\"))]))\n    if \"Kids\" in resolved_field:\n        for kid_field in resolved_field[\"Kids\"]:\n            parse_field_helper(form_data, kid_field, prefix=field_name)\n    if \"T\" in resolved_field or \"TU\" in resolved_field:\n        # \"T\" is a field-name, but it's sometimes absent.\n        # \"TU\" is the \"alternate field name\" and is often more human-readable\n        # your PDF may have one, the other, or both.\n        alternate_field_name  = resolve_and_decode(resolved_field.get(\"TU\")) if resolved_field.get(\"TU\") else None\n        field_value = resolve_and_decode(resolved_field[\"V\"]) if 'V' in resolved_field else None\n        form_data.append([field_name, alternate_field_name, field_value])\n\n\nform_data = []\nfields = resolve(resolve(pdf.doc.catalog[\"AcroForm\"])[\"Fields\"])\nfor field in fields:\n    parse_field_helper(form_data, field)\n```\n\nOnce you run this script, `form_data` is a list containing a three-element tuple for each form element. For instance, a PDF form with a city and state field might look like this.\n```\n[\n ['STATE.0', 'enter STATE', 'CA'],\n ['section 2  accident infoRmation.1.0',\n  'enter city of accident',\n  'SAN FRANCISCO']\n]\n```\n\n*Thanks to [@jeremybmerrill](https://github.com/jeremybmerrill) for helping to maintain the form-parsing code above.*\n\n## Demonstrations\n\n- [Using `extract_table` on a California Worker Adjustment and Retraining Notification (WARN) report](examples/notebooks/extract-table-ca-warn-report.ipynb). Demonstrates basic visual debugging and table extraction.\n- [Using `extract_table` on the FBI's National Instant Criminal Background Check System PDFs](examples/notebooks/extract-table-nics.ipynb). Demonstrates how to use visual debugging to find optimal table extraction settings. Also demonstrates `Page.crop(...)` and `Page.extract_text(...).`\n- [Inspecting and visualizing `curve` objects](examples/notebooks/ag-energy-roundup-curves.ipynb).\n- [Extracting fixed-width data from a San Jose PD firearm search report](examples/notebooks/san-jose-pd-firearm-report.ipynb), an example of using `Page.extract_text(...)`.\n\n## Comparison to other libraries\n\nSeveral other Python libraries help users to extract information from PDFs. As a broad overview, `pdfplumber` distinguishes itself from other PDF processing libraries by combining these features:\n\n- Easy access to detailed information about each PDF object\n- Higher-level, customizable methods for extracting text and tables\n- Tightly integrated visual debugging\n- Other useful utility functions, such as filtering objects via a crop-box\n\nIt's also helpful to know what features `pdfplumber` does __not__ provide:\n\n- PDF *generation*\n- PDF *modification*\n- Optical character recognition (OCR)\n- Strong support for extracting tables from OCR'ed documents\n\n### Specific comparisons\n\n- [`pdfminer.six`](https://github.com/pdfminer/pdfminer.six) provides the foundation for `pdfplumber`. It primarily focuses on parsing PDFs, analyzing PDF layouts and object positioning, and extracting text. It does not provide tools for table extraction or visual debugging. License: [MIT](https://github.com/pdfminer/pdfminer.six?tab=MIT-1-ov-file).\n\n- [`PyPDF2`](https://github.com/mstamy2/PyPDF2) is a pure-Python library \"capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files.\" It can extract page text, but does not provide easy access to shape objects (rectangles, lines, etc.), table-extraction, or visually debugging tools. License: [BSD](https://github.com/py-pdf/pypdf?tab=License-1-ov-file#readme).\n\n- [`pymupdf`](https://pymupdf.readthedocs.io/) is substantially faster than `pdfminer.six` (and thus also `pdfplumber`) and can generate and modify PDFs, but the library requires installation of non-Python software (MuPDF). It also does not enable easy access to shape objects (rectangles, lines, etc.), and does not provide table-extraction or visual debugging tools. License: [AGPL](https://pymupdf.readthedocs.io/en/latest/about.html#license-and-copyright).\n\n- [`camelot`](https://github.com/camelot-dev/camelot), [`tabula-py`](https://github.com/chezou/tabula-py), and [`pdftables`](https://github.com/drj11/pdftables) all focus primarily on extracting tables. In some cases, they may be better suited to the particular tables you are trying to extract. License: [MIT](https://github.com/camelot-dev/camelot?tab=MIT-1-ov-file#readme) (`camelot`), [MIT](https://github.com/chezou/tabula-py?tab=MIT-1-ov-file#readme) (`tabula-py`), [BSD](https://github.com/drj11/pdftables?tab=BSD-2-Clause-1-ov-file#readme) (`pdftables`).\n\n\n## Acknowledgments / Contributors\n\nMany thanks to the following users who've contributed ideas, features, and fixes:\n\n- [Jacob Fenton](https://github.com/jsfenfen)\n- [Dan Nguyen](https://github.com/dannguyen)\n- [Jeff Barrera](https://github.com/jeffbarrera)\n- [Bob Lannon](https://github.com/boblannon)\n- [Dustin Tindall](https://github.com/dustindall)\n- [@yevgnen](https://github.com/Yevgnen)\n- [@meldonization](https://github.com/meldonization)\n- [Oisín Moran](https://github.com/OisinMoran)\n- [Samkit Jain](https://github.com/samkit-jain)\n- [Francisco Aranda](https://github.com/frascuchon)\n- [Kwok-kuen Cheung](https://github.com/cheungpat)\n- [Marco](https://github.com/ubmarco)\n- [Idan David](https://github.com/idan-david)\n- [@xv44586](https://github.com/xv44586)\n- [Alexander Regueiro](https://github.com/alexreg)\n- [Daniel Peña](https://github.com/trifling)\n- [@bobluda](https://github.com/bobluda)\n- [@ramcdona](https://github.com/ramcdona)\n- [@johnhuge](https://github.com/johnhuge)\n- [Jhonatan Lopes](https://github.com/jhonatan-lopes)\n- [Ethan Corey](https://github.com/ethanscorey)\n- [Shannon Shen](https://github.com/lolipopshock)\n- [Matsumoto Toshi](https://github.com/toshi1127)\n- [John West](https://github.com/jwestwsj)\n- [David Huggins-Daines](https://github.com/dhdaines)\n- [Jeremy B. Merrill](https://github.com/jeremybmerrill)\n- [Echedey Luis](https://github.com/echedey-ls)\n- [Andy Friedman](https://github.com/afriedman412)\n- [Aron Weiler](https://github.com/aronweiler)\n- [Quentin André](https://github.com/QuentinAndre11)\n- [Léo Roux](https://github.com/leorouxx)\n- [@wodny](https://github.com/wodny)\n- [Michal Stolarczyk](https://github.com/stolarczyk)\n- [Brandon Roberts](https://github.com/brandonrobertz)\n- [@ennamarie19](https://github.com/ennamarie19)\n- [Anton Ilin](https://github.com/bronislav)\n\n## Contributing\n\nPull requests are welcome, but please submit a proposal issue first, as the library is in active development.\n\nCurrent maintainers:\n\n- [Jeremy Singer-Vine](https://github.com/jsvine)\n- [Samkit Jain](https://github.com/samkit-jain)\n"
  },
  {
    "path": "codecov.yml",
    "content": "codecov:\n  branch: stable\n"
  },
  {
    "path": "docs/colors.md",
    "content": "# Colors\n\nIn the PDF specification, as well as in `pdfplumber`, most graphical objects can have two color attributes:\n\n- `stroking_color`: The color of the object's outline\n- `non_stroking_color`: The color of the object's interior, or \"fill\"\n\nIn the PDF specification, colors have both a \"color space\" and a \"color value\".\n\n## Color Spaces\n\nValid color spaces are grouped into three categories:\n\n- Device color spaces\n    - `DeviceGray`\n    - `DeviceRGB`\n    - `DeviceCMYK`\n- CIE-based color spaces\n    - `CalGray`\n    - `CalRGB`\n    - `Lab`\n    - `ICCBased`\n- Special color spaces\n    - `Indexed`\n    - `Pattern`\n    - `Separation`\n    - `DeviceN`\n\nTo read more about the differences between those color spaces, see section 4.5 [here](https://ghostscript.com/~robin/pdf_reference17.pdf).\n\n`pdfplumber` aims to expose those color spaces as `scs` (stroking color space) and `ncs` (non-stroking color space), represented as a __string__.\n\n__Caveat__: The only information `pdfplumber` can __currently__ expose is the non-stroking color space for `char` objects. The rest (stroking color space for `char` objects and either color space for the other types of objects) will require a pull request to `pdfminer.six`.\n\n## Color Values\n\nThe color value determines *what specific color* in the color space should be used. With the exception of the \"special color spaces,\" these color values are specified as a series of numbers. For `DeviceRGB`, for example, the color values are three numbers, representing the intensities of red, green, and blue.\n\nIn `pdfplumber`, those color values are exposed as `stroking_color` and `non_stroking_color`, represented as a __tuple of numbers__.\n\nThe pattern specified by the `Pattern` color space is exposed via the `non_stroking_pattern` and `stroking_pattern` attributes.\n"
  },
  {
    "path": "docs/repairing.md",
    "content": "# Repairing Malformed PDFs\n\nMany parsing issues can be traced back to malformed PDFs.\n\nMalformed PDFs can often be [fixed via Ghostscript](https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file).\n\n`pdfplumber` lets you automatically run those repairs, in several ways:\n\n- `pdfplumber.open(..., repair=True)` will repair your PDF on the fly (but not save the repaired version to disk).\n- `pdfplumber.repair(path_to_pdf)` will return a `BytesIO` object holding the bytes of a repaired version of the original file.\n- `pdfplumber.repair(path_to_pdf, outfile=\"path/to/repaired.pdf\")` will write a repaired version of the original file to the indicated `outfile` path.\n\n## Custom parameters\n\n- `gs_path=...`: You can pass a custom path for the Ghostscript executable, helpful in case `pdfplumber` is unable to auto-detect your copy of Ghostscript.\n"
  },
  {
    "path": "docs/structure.md",
    "content": "# Structure Tree\n\nSince PDF 1.3 it is possible for a PDF to contain logical structure,\ncontained in a *structure tree*.  In conjunction with PDF 1.2 [marked\ncontent sections](#marked-content-sections) this forms the basis of\nTagged PDF and other accessibility features.\n\nUnfortunately, since all of these standards are optional and variably\nimplemented in PDF authoring tools, and are frequently not enabled by\ndefault, it is not possible to rely on them to extract the structure\nof a PDF and associated content.  Nonetheless they can be useful as\nfeatures for a heuristic or machine-learning based system, or for\nextracting particular structures such as tables.\n\nSince `pdfplumber`'s API is page-based, the structure is available for\na particular page, using the `structure_tree` attribute:\n\n    with pdfplumber.open(pdffile) as pdf:\n        for element in pdf.pages[0].structure_tree:\n             print(element[\"type\"], element[\"mcids\"])\n             for child in element.children:\n                 print(child[\"type\"], child[\"mcids\"])\n\nThe `type` field contains the type of the structure element - the\nstandard structure types can be seen in section 10.7.3 of [the PDF 1.7\nreference\ndocument](https://ghostscript.com/~robin/pdf_reference17.pdf#page=898),\nbut usually they are rather HTML-like, if created by a recent PDF\nauthoring tool (notably, older tools may simply produce `P` for\neverything).\n\nThe `mcids` field contains the list of marked content section IDs\ncorresponding to this element.\n\nThe `lang` field is often present as well, and contains a language\ncode for the text content, e.g. `\"EN-US\"` or `\"FR-CA\"`.\n\nThe `alt_text` field will be present if the author has helpfully added\nalternate text to an image.  In some cases, `actual_text` may also be\npresent.\n\nThere are also various attributes that may be in the `attributes`\nfield.  Some of these are quite useful indeed, such as ``BBox` which\ngives you the bounding box of a `Table`, `Figure`, or `Image`.  You\ncan see a full list of these [in the PDF\nspec](https://ghostscript.com/~robin/pdf_reference17.pdf#page=916).\nNote that the `BBox` is in PDF coordinate space with the origin at the\nbottom left of the page.  To convert it to `pdfplumber`'s space you\ncan do, for example:\n\n    x0, y0, x1, y1 = element['attributes']['BBox']\n    top = page.height - y1\n    bottom = page.height - y0\n    doctop = page.initial_doctop + top\n    bbox = (x0, top, x1, bottom)\n\nIt is also possible to get the structure tree for the entire document.\nIn this case, because marked content IDs are specific to a given page,\neach element will also have a `page_number` attribute, which is the\nnumber of the page containing (partially or completely) this element,\nindexed from 1 (for consistency with `pdfplumber.Page`).\n\nYou can also access the underlying `PDFStructTree` object for more\nflexibility, including visual debugging.  For instance to plot the\nbounding boxes of the contents of all of the `TD` elements on the\nfirst page of a document:\n\n    page = pdf.pages[0]\n    stree = PDFStructTree(pdf, page)\n    img = page.to_image()\n    img.draw_rects(stree.element_bbox(td) for td in table.find_all(\"TD\"))\n\nThe `find_all` method works rather like the same method in\n[BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree) -\nit takes an element name, a regular expression, or a matching\nfunction.\n"
  },
  {
    "path": "pdfplumber/__init__.py",
    "content": "__all__ = [\n    \"__version__\",\n    \"utils\",\n    \"pdfminer\",\n    \"open\",\n    \"repair\",\n    \"set_debug\",\n]\n\nimport pdfminer\nimport pdfminer.pdftypes\n\nfrom . import utils\nfrom ._version import __version__\nfrom .pdf import PDF\nfrom .repair import repair\n\nopen = PDF.open\n"
  },
  {
    "path": "pdfplumber/_typing.py",
    "content": "from typing import Any, Dict, Iterable, List, Literal, Sequence, Tuple, Union\n\nT_seq = Sequence\nT_num = Union[int, float]\nT_point = Tuple[T_num, T_num]\nT_bbox = Tuple[T_num, T_num, T_num, T_num]\nT_obj = Dict[str, Any]\nT_obj_list = List[T_obj]\nT_obj_iter = Iterable[T_obj]\nT_dir = Union[Literal[\"ltr\"], Literal[\"rtl\"], Literal[\"ttb\"], Literal[\"btt\"]]\n"
  },
  {
    "path": "pdfplumber/_version.py",
    "content": "version_info = (0, 11, 9)\n__version__ = \".\".join(map(str, version_info))\n"
  },
  {
    "path": "pdfplumber/cli.py",
    "content": "#!/usr/bin/env python\nimport argparse\nimport json\nimport sys\nfrom collections import defaultdict, deque\nfrom itertools import chain\nfrom typing import Any, DefaultDict, Dict, List\n\nfrom .pdf import PDF\n\nif len(sys.argv) == 1:\n    sys.argv.append(\"--help\")\n\n\ndef parse_page_spec(p_str: str) -> List[int]:\n    if \"-\" in p_str:\n        start, end = map(int, p_str.split(\"-\"))\n        return list(range(start, end + 1))\n    else:\n        return [int(p_str)]\n\n\ndef parse_args(args_raw: List[str]) -> argparse.Namespace:\n    parser = argparse.ArgumentParser(\"pdfplumber\")\n\n    parser.add_argument(\"infile\", nargs=\"?\", type=argparse.FileType(\"rb\"))\n    group = parser.add_mutually_exclusive_group()\n    group.add_argument(\n        \"--structure\",\n        help=\"Write the structure tree as JSON.  \"\n        \"All other arguments except --pages, --laparams, and --indent will be ignored\",\n        action=\"store_true\",\n    )\n    group.add_argument(\n        \"--structure-text\",\n        help=\"Write the structure tree as JSON including text contents.  \"\n        \"All other arguments except --pages, --laparams, and --indent will be ignored\",\n        action=\"store_true\",\n    )\n\n    parser.add_argument(\"--format\", choices=[\"csv\", \"json\", \"text\"], default=\"csv\")\n\n    parser.add_argument(\"--types\", nargs=\"+\")\n\n    parser.add_argument(\n        \"--include-attrs\",\n        nargs=\"+\",\n        help=\"Include *only* these object attributes in output.\",\n    )\n\n    parser.add_argument(\n        \"--exclude-attrs\",\n        nargs=\"+\",\n        help=\"Exclude these object attributes from output.\",\n    )\n\n    parser.add_argument(\"--laparams\", type=json.loads)\n\n    parser.add_argument(\"--precision\", type=int)\n\n    parser.add_argument(\"--pages\", nargs=\"+\", type=parse_page_spec)\n\n    parser.add_argument(\n        \"--indent\", type=int, help=\"Indent level for JSON pretty-printing.\"\n    )\n\n    args = parser.parse_args(args_raw)\n    if args.pages is not None:\n        args.pages = list(chain(*args.pages))\n    return args\n\n\ndef add_text_to_mcids(pdf: PDF, data: List[Dict[str, Any]]) -> None:\n    page_contents: DefaultDict[int, Any] = defaultdict(lambda: defaultdict(str))\n    for page in pdf.pages:\n        text_contents = page_contents[page.page_number]\n        for c in page.chars:\n            mcid = c.get(\"mcid\")\n            if mcid is None:\n                continue\n            text_contents[mcid] += c[\"text\"]\n    d = deque(data)\n    while d:\n        el = d.popleft()\n        if \"children\" in el:\n            d.extend(el[\"children\"])\n        pageno = el.get(\"page_number\")\n        if pageno is None:\n            continue\n        text_contents = page_contents[pageno]\n        if \"mcids\" in el:\n            el[\"text\"] = [text_contents[mcid] for mcid in el[\"mcids\"]]\n\n\ndef main(args_raw: List[str] = sys.argv[1:]) -> None:\n    args = parse_args(args_raw)\n\n    with PDF.open(args.infile, pages=args.pages, laparams=args.laparams) as pdf:\n        if args.structure:\n            print(json.dumps(pdf.structure_tree, indent=args.indent))\n        elif args.structure_text:\n            tree = pdf.structure_tree\n            add_text_to_mcids(pdf, tree)\n            print(json.dumps(tree, indent=args.indent, ensure_ascii=False))\n        elif args.format == \"csv\":\n            pdf.to_csv(\n                sys.stdout,\n                args.types,\n                precision=args.precision,\n                include_attrs=args.include_attrs,\n                exclude_attrs=args.exclude_attrs,\n            )\n        elif args.format == \"text\":\n            for page in pdf.pages:\n                print(page.extract_text(layout=True))\n        else:\n            pdf.to_json(\n                sys.stdout,\n                args.types,\n                precision=args.precision,\n                include_attrs=args.include_attrs,\n                exclude_attrs=args.exclude_attrs,\n                indent=args.indent,\n            )\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "pdfplumber/container.py",
    "content": "import csv\nimport json\nfrom io import StringIO\nfrom itertools import chain\nfrom typing import Any, Dict, List, Optional, Set, TextIO\n\nfrom . import utils\nfrom ._typing import T_obj, T_obj_list\nfrom .convert import CSV_COLS_REQUIRED, CSV_COLS_TO_PREPEND, Serializer\n\n\nclass Container(object):\n    cached_properties = [\"_rect_edges\", \"_curve_edges\", \"_edges\", \"_objects\"]\n\n    @property\n    def pages(self) -> Optional[List[Any]]:  # pragma: nocover\n        raise NotImplementedError\n\n    @property\n    def objects(self) -> Dict[str, T_obj_list]:  # pragma: nocover\n        raise NotImplementedError\n\n    def to_dict(\n        self, object_types: Optional[List[str]] = None\n    ) -> Dict[str, Any]:  # pragma: nocover\n        raise NotImplementedError\n\n    def flush_cache(self, properties: Optional[List[str]] = None) -> None:\n        props = self.cached_properties if properties is None else properties\n        for p in props:\n            if hasattr(self, p):\n                delattr(self, p)\n\n    @property\n    def rects(self) -> T_obj_list:\n        return self.objects.get(\"rect\", [])\n\n    @property\n    def lines(self) -> T_obj_list:\n        return self.objects.get(\"line\", [])\n\n    @property\n    def curves(self) -> T_obj_list:\n        return self.objects.get(\"curve\", [])\n\n    @property\n    def images(self) -> T_obj_list:\n        return self.objects.get(\"image\", [])\n\n    @property\n    def chars(self) -> T_obj_list:\n        return self.objects.get(\"char\", [])\n\n    @property\n    def textboxverticals(self) -> T_obj_list:\n        return self.objects.get(\"textboxvertical\", [])\n\n    @property\n    def textboxhorizontals(self) -> T_obj_list:\n        return self.objects.get(\"textboxhorizontal\", [])\n\n    @property\n    def textlineverticals(self) -> T_obj_list:\n        return self.objects.get(\"textlinevertical\", [])\n\n    @property\n    def textlinehorizontals(self) -> T_obj_list:\n        return self.objects.get(\"textlinehorizontal\", [])\n\n    @property\n    def rect_edges(self) -> T_obj_list:\n        if hasattr(self, \"_rect_edges\"):\n            return self._rect_edges\n        rect_edges_gen = (utils.rect_to_edges(r) for r in self.rects)\n        self._rect_edges: T_obj_list = list(chain(*rect_edges_gen))\n        return self._rect_edges\n\n    @property\n    def curve_edges(self) -> T_obj_list:\n        if hasattr(self, \"_curve_edges\"):\n            return self._curve_edges\n        curve_edges_gen = (utils.curve_to_edges(r) for r in self.curves)\n        self._curve_edges: T_obj_list = list(chain(*curve_edges_gen))\n        return self._curve_edges\n\n    @property\n    def edges(self) -> T_obj_list:\n        if hasattr(self, \"_edges\"):\n            return self._edges\n        line_edges = list(map(utils.line_to_edge, self.lines))\n        self._edges: T_obj_list = line_edges + self.rect_edges + self.curve_edges\n        return self._edges\n\n    @property\n    def horizontal_edges(self) -> T_obj_list:\n        def test(x: T_obj) -> bool:\n            return bool(x[\"orientation\"] == \"h\")\n\n        return list(filter(test, self.edges))\n\n    @property\n    def vertical_edges(self) -> T_obj_list:\n        def test(x: T_obj) -> bool:\n            return bool(x[\"orientation\"] == \"v\")\n\n        return list(filter(test, self.edges))\n\n    def to_json(\n        self,\n        stream: Optional[TextIO] = None,\n        object_types: Optional[List[str]] = None,\n        include_attrs: Optional[List[str]] = None,\n        exclude_attrs: Optional[List[str]] = None,\n        precision: Optional[int] = None,\n        indent: Optional[int] = None,\n    ) -> Optional[str]:\n\n        data = self.to_dict(object_types)\n\n        serialized = Serializer(\n            precision=precision,\n            include_attrs=include_attrs,\n            exclude_attrs=exclude_attrs,\n        ).serialize(data)\n\n        if stream is None:\n            return json.dumps(serialized, indent=indent)\n        else:\n            json.dump(serialized, stream, indent=indent)\n            return None\n\n    def to_csv(\n        self,\n        stream: Optional[TextIO] = None,\n        object_types: Optional[List[str]] = None,\n        precision: Optional[int] = None,\n        include_attrs: Optional[List[str]] = None,\n        exclude_attrs: Optional[List[str]] = None,\n    ) -> Optional[str]:\n        if stream is None:\n            stream = StringIO()\n            to_string = True\n        else:\n            to_string = False\n\n        if object_types is None:\n            object_types = list(self.objects.keys()) + [\"annot\"]\n\n        serialized = []\n        fields: Set[str] = set()\n\n        pages = [self] if self.pages is None else self.pages\n\n        serializer = Serializer(\n            precision=precision,\n            include_attrs=include_attrs,\n            exclude_attrs=exclude_attrs,\n        )\n        for page in pages:\n            for t in object_types:\n                objs = getattr(page, t + \"s\")\n                if len(objs):\n                    serialized += serializer.serialize(objs)\n                    new_keys = [k for k, v in objs[0].items() if type(v) is not dict]\n                    fields = fields.union(set(new_keys))\n\n        non_req_cols = CSV_COLS_TO_PREPEND + list(\n            sorted(set(fields) - set(CSV_COLS_REQUIRED + CSV_COLS_TO_PREPEND))\n        )\n\n        cols = CSV_COLS_REQUIRED + list(filter(serializer.attr_filter, non_req_cols))\n\n        w = csv.DictWriter(\n            stream,\n            fieldnames=cols,\n            extrasaction=\"ignore\",\n            quoting=csv.QUOTE_MINIMAL,\n            escapechar=\"\\\\\",\n        )\n        w.writeheader()\n        w.writerows(serialized)\n\n        if to_string:\n            stream.seek(0)\n            return stream.read()\n        else:\n            return None\n"
  },
  {
    "path": "pdfplumber/convert.py",
    "content": "import base64\nfrom typing import Any, Callable, Dict, List, Optional, Tuple\n\nfrom pdfminer.psparser import PSLiteral\n\nfrom .utils import decode_text\n\nENCODINGS_TO_TRY = [\n    \"utf-8\",\n    \"latin-1\",\n    \"utf-16\",\n    \"utf-16le\",\n]\n\nCSV_COLS_REQUIRED = [\n    \"object_type\",\n]\n\nCSV_COLS_TO_PREPEND = [\n    \"page_number\",\n    \"x0\",\n    \"x1\",\n    \"y0\",\n    \"y1\",\n    \"doctop\",\n    \"top\",\n    \"bottom\",\n    \"width\",\n    \"height\",\n]\n\n\ndef get_attr_filter(\n    include_attrs: Optional[List[str]] = None, exclude_attrs: Optional[List[str]] = None\n) -> Callable[[str], bool]:\n    if include_attrs is not None and exclude_attrs is not None:\n        raise ValueError(\n            \"Cannot specify `include_attrs` and `exclude_attrs` at the same time.\"\n        )\n\n    elif include_attrs is not None:\n        incl = set(CSV_COLS_REQUIRED + include_attrs)\n        return lambda attr: attr in incl\n\n    elif exclude_attrs is not None:\n        nonexcludable = set(exclude_attrs).intersection(set(CSV_COLS_REQUIRED))\n        if len(nonexcludable):\n            raise ValueError(\n                f\"Cannot exclude these required properties: {list(nonexcludable)}\"\n            )\n        excl = set(exclude_attrs)\n        return lambda attr: attr not in excl\n\n    else:\n        return lambda attr: True\n\n\ndef to_b64(data: bytes) -> str:\n    return base64.b64encode(data).decode(\"ascii\")\n\n\nclass Serializer:\n    def __init__(\n        self,\n        precision: Optional[int] = None,\n        include_attrs: Optional[List[str]] = None,\n        exclude_attrs: Optional[List[str]] = None,\n    ):\n\n        self.precision = precision\n        self.attr_filter = get_attr_filter(\n            include_attrs=include_attrs, exclude_attrs=exclude_attrs\n        )\n\n    def serialize(self, obj: Any) -> Any:\n        if obj is None:\n            return None\n\n        t = type(obj)\n\n        # Basic types don't need to be converted\n        if t in (int, str):\n            return obj\n\n        # Use one of the custom converters, if possible\n        fn = getattr(self, f\"do_{t.__name__}\", None)\n        if fn is not None:\n            return fn(obj)\n\n        # Otherwise, just use the string-representation\n        else:\n            return str(obj)\n\n    def do_float(self, x: float) -> float:\n        return x if self.precision is None else round(x, self.precision)\n\n    def do_bool(self, x: bool) -> int:\n        return int(x)\n\n    def do_list(self, obj: List[Any]) -> List[Any]:\n        return list(self.serialize(x) for x in obj)\n\n    def do_tuple(self, obj: Tuple[Any, ...]) -> Tuple[Any, ...]:\n        return tuple(self.serialize(x) for x in obj)\n\n    def do_dict(self, obj: Dict[str, Any]) -> Dict[str, Any]:\n        if \"object_type\" in obj.keys():\n            return {k: self.serialize(v) for k, v in obj.items() if self.attr_filter(k)}\n        else:\n            return {k: self.serialize(v) for k, v in obj.items()}\n\n    def do_PDFStream(self, obj: Any) -> Dict[str, Optional[str]]:\n        return {\"rawdata\": to_b64(obj.rawdata) if obj.rawdata else None}\n\n    def do_PSLiteral(self, obj: PSLiteral) -> str:\n        return decode_text(obj.name)\n\n    def do_bytes(self, obj: bytes) -> Optional[str]:\n        for e in ENCODINGS_TO_TRY:\n            try:\n                return obj.decode(e)\n            except UnicodeDecodeError:  # pragma: no cover\n                return None\n        # If none of the decodings work, raise whatever error\n        # decoding with utf-8 causes\n        obj.decode(ENCODINGS_TO_TRY[0])  # pragma: no cover\n        return None  # pragma: no cover\n"
  },
  {
    "path": "pdfplumber/ctm.py",
    "content": "import math\nfrom typing import NamedTuple\n\n# For more details, see the PDF Reference, 6th Ed., Section 4.2.2 (\"Common\n# Transformations\")\n\n\nclass CTM(NamedTuple):\n    a: float\n    b: float\n    c: float\n    d: float\n    e: float\n    f: float\n\n    @property\n    def scale_x(self) -> float:\n        return math.sqrt(pow(self.a, 2) + pow(self.b, 2))\n\n    @property\n    def scale_y(self) -> float:\n        return math.sqrt(pow(self.c, 2) + pow(self.d, 2))\n\n    @property\n    def skew_x(self) -> float:\n        return (math.atan2(self.d, self.c) * 180 / math.pi) - 90\n\n    @property\n    def skew_y(self) -> float:\n        return math.atan2(self.b, self.a) * 180 / math.pi\n\n    @property\n    def translation_x(self) -> float:\n        return self.e\n\n    @property\n    def translation_y(self) -> float:\n        return self.f\n"
  },
  {
    "path": "pdfplumber/display.py",
    "content": "import pathlib\nfrom io import BufferedReader, BytesIO\nfrom typing import TYPE_CHECKING, Any, List, Optional, Tuple, Union\n\nimport PIL.Image\nimport PIL.ImageDraw\nimport pypdfium2  # type: ignore\n\nfrom . import utils\nfrom ._typing import T_bbox, T_num, T_obj, T_obj_list, T_point, T_seq\nfrom .table import T_table_settings, Table, TableFinder, TableSettings\nfrom .utils.exceptions import MalformedPDFException\n\nif TYPE_CHECKING:  # pragma: nocover\n    import pandas as pd\n\n    from .page import Page\n\n\nclass COLORS:\n    RED = (255, 0, 0)\n    GREEN = (0, 255, 0)\n    BLUE = (0, 0, 255)\n    TRANSPARENT = (0, 0, 0, 0)\n\n\nDEFAULT_FILL = COLORS.BLUE + (50,)\nDEFAULT_STROKE = COLORS.RED + (200,)\nDEFAULT_STROKE_WIDTH = 1\nDEFAULT_RESOLUTION = 72\n\nT_color = Union[Tuple[int, int, int], Tuple[int, int, int, int], str]\nT_contains_points = Union[Tuple[T_point, ...], List[T_point], T_obj]\n\n\ndef get_page_image(\n    stream: Union[BufferedReader, BytesIO],\n    path: Optional[pathlib.Path],\n    page_ix: int,\n    resolution: Union[int, float],\n    password: Optional[str],\n    antialias: bool = False,\n) -> PIL.Image.Image:\n\n    src: Union[pathlib.Path, BufferedReader, BytesIO]\n\n    # If we are working with a file object saved to disk\n    if path:\n        src = path\n\n    # If we instead are working with a BytesIO stream\n    else:\n        stream.seek(0)\n        src = stream\n\n    try:\n        pdfium_doc = pypdfium2.PdfDocument(src, password=password)\n    except pypdfium2.PdfiumError as e:\n        raise MalformedPDFException(e)\n\n    pdfium_page = pdfium_doc.get_page(page_ix)\n\n    img: PIL.Image.Image = pdfium_page.render(\n        # Modifiable arguments\n        scale=resolution / 72,\n        no_smoothtext=not antialias,\n        no_smoothpath=not antialias,\n        no_smoothimage=not antialias,\n        # Non-modifiable arguments\n        prefer_bgrx=True,\n    ).to_pil()\n    pdfium_doc.close()\n\n    return img.convert(\"RGB\")\n\n\nclass PageImage:\n    def __init__(\n        self,\n        page: \"Page\",\n        original: Optional[PIL.Image.Image] = None,\n        resolution: Union[int, float] = DEFAULT_RESOLUTION,\n        antialias: bool = False,\n        force_mediabox: bool = False,\n    ):\n        self.page = page\n        self.root = page if page.is_original else page.root_page\n        self.resolution = resolution\n\n        if original is None:\n            self.original = get_page_image(\n                stream=page.pdf.stream,\n                path=page.pdf.path,\n                page_ix=page.page_number - 1,\n                resolution=resolution,\n                antialias=antialias,\n                password=page.pdf.password,\n            )\n        else:\n            self.original = original\n\n        self.scale = self.original.size[0] / (page.cropbox[2] - page.cropbox[0])\n\n        # This value represents the coordinates of the page,\n        # in page-unit values, that will be displayed.\n        self.bbox = (\n            page.bbox\n            if page.bbox != page.mediabox\n            else (page.mediabox if force_mediabox else page.cropbox)\n        )\n\n        # If this value is different than the *Page*'s .cropbox\n        # (e.g., because the mediabox differs from the cropbox or\n        # or because we've used Page.crop(...)), then we'll need to\n        # crop the initially-converted image.\n        if page.bbox != page.cropbox:\n            crop_dims = self._reproject_bbox(page.cropbox)\n            bbox_dims = self._reproject_bbox(self.bbox)\n            self.original = self.original.crop(\n                (\n                    bbox_dims[0] - crop_dims[0],\n                    bbox_dims[1] - crop_dims[1],\n                    bbox_dims[2] - crop_dims[0],\n                    bbox_dims[3] - crop_dims[1],\n                )\n            )\n\n        self.reset()\n\n    def _reproject_bbox(self, bbox: T_bbox) -> Tuple[int, int, int, int]:\n        x0, top, x1, bottom = bbox\n        _x0, _top = self._reproject((x0, top))\n        _x1, _bottom = self._reproject((x1, bottom))\n        return (_x0, _top, _x1, _bottom)\n\n    def _reproject(self, coord: T_point) -> Tuple[int, int]:\n        \"\"\"\n        Given an (x0, top) tuple from the *root* coordinate system,\n        return an (x0, top) tuple in the *image* coordinate system.\n        \"\"\"\n        x0, top = coord\n        _x0 = (x0 - self.bbox[0]) * self.scale\n        _top = (top - self.bbox[1]) * self.scale\n        return (int(_x0), int(_top))\n\n    def reset(self) -> \"PageImage\":\n        self.annotated = PIL.Image.new(\"RGB\", self.original.size)\n        self.annotated.paste(self.original)\n        self.draw = PIL.ImageDraw.Draw(self.annotated, \"RGBA\")\n        return self\n\n    def save(\n        self,\n        dest: Union[str, pathlib.Path, BytesIO],\n        format: str = \"PNG\",\n        quantize: bool = True,\n        colors: int = 256,\n        bits: int = 8,\n        **kwargs: Any,\n    ) -> None:\n        if quantize:\n            out = self.annotated.quantize(colors, method=PIL.Image.FASTOCTREE).convert(\n                \"P\"\n            )\n        else:\n            out = self.annotated\n\n        out.save(\n            dest,\n            format=format,\n            bits=bits,\n            dpi=(self.resolution, self.resolution),\n            **kwargs,\n        )\n\n    def copy(self) -> \"PageImage\":\n        return self.__class__(self.page, self.original)\n\n    def draw_line(\n        self,\n        points_or_obj: T_contains_points,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        # If passing a raw list of points, use those\n        if isinstance(points_or_obj, (tuple, list)):\n            points = points_or_obj\n        # Else, use the \"pts\" attribute if available\n        elif isinstance(points_or_obj, dict) and \"pts\" in points_or_obj:\n            points = [(x, y) for x, y in points_or_obj[\"pts\"]]\n        # Otherwise, just use ((x0, top), (x1, bottom))\n        else:\n            obj = points_or_obj\n            points = ((obj[\"x0\"], obj[\"top\"]), (obj[\"x1\"], obj[\"bottom\"]))\n\n        self.draw.line(\n            list(map(self._reproject, points)), fill=stroke, width=stroke_width\n        )\n\n        return self\n\n    def draw_lines(\n        self,\n        list_of_lines: Union[T_seq[T_contains_points], \"pd.DataFrame\"],\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        for x in utils.to_list(list_of_lines):\n            self.draw_line(x, stroke=stroke, stroke_width=stroke_width)\n        return self\n\n    def draw_vline(\n        self,\n        location: T_num,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        points = (location, self.bbox[1], location, self.bbox[3])\n        self.draw.line(self._reproject_bbox(points), fill=stroke, width=stroke_width)\n        return self\n\n    def draw_vlines(\n        self,\n        locations: Union[List[T_num], \"pd.Series[float]\"],\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        for x in list(locations):\n            self.draw_vline(x, stroke=stroke, stroke_width=stroke_width)\n        return self\n\n    def draw_hline(\n        self,\n        location: T_num,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        points = (self.bbox[0], location, self.bbox[2], location)\n        self.draw.line(self._reproject_bbox(points), fill=stroke, width=stroke_width)\n        return self\n\n    def draw_hlines(\n        self,\n        locations: Union[List[T_num], \"pd.Series[float]\"],\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        for x in list(locations):\n            self.draw_hline(x, stroke=stroke, stroke_width=stroke_width)\n        return self\n\n    def draw_rect(\n        self,\n        bbox_or_obj: Union[T_bbox, T_obj],\n        fill: T_color = DEFAULT_FILL,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        if isinstance(bbox_or_obj, (tuple, list)):\n            bbox = bbox_or_obj\n        else:\n            obj = bbox_or_obj\n            bbox = (obj[\"x0\"], obj[\"top\"], obj[\"x1\"], obj[\"bottom\"])\n\n        x0, top, x1, bottom = bbox\n        half = stroke_width / 2\n        x0 = min(x0 + half, (x0 + x1) / 2)\n        top = min(top + half, (top + bottom) / 2)\n        x1 = max(x1 - half, (x0 + x1) / 2)\n        bottom = max(bottom - half, (top + bottom) / 2)\n\n        fill_bbox = self._reproject_bbox((x0, top, x1, bottom))\n        self.draw.rectangle(fill_bbox, fill, COLORS.TRANSPARENT)\n\n        if stroke_width > 0:\n            segments = [\n                ((x0, top), (x1, top)),  # top\n                ((x0, bottom), (x1, bottom)),  # bottom\n                ((x0, top), (x0, bottom)),  # left\n                ((x1, top), (x1, bottom)),  # right\n            ]\n            self.draw_lines(segments, stroke=stroke, stroke_width=stroke_width)\n        return self\n\n    def draw_rects(\n        self,\n        list_of_rects: Union[List[T_bbox], T_obj_list, \"pd.DataFrame\"],\n        fill: T_color = DEFAULT_FILL,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n        for x in utils.to_list(list_of_rects):\n            self.draw_rect(x, fill=fill, stroke=stroke, stroke_width=stroke_width)\n        return self\n\n    def draw_circle(\n        self,\n        center_or_obj: Union[T_point, T_obj],\n        radius: int = 5,\n        fill: T_color = DEFAULT_FILL,\n        stroke: T_color = DEFAULT_STROKE,\n    ) -> \"PageImage\":\n        if isinstance(center_or_obj, tuple):\n            center = center_or_obj\n        else:\n            obj = center_or_obj\n            center = ((obj[\"x0\"] + obj[\"x1\"]) / 2, (obj[\"top\"] + obj[\"bottom\"]) / 2)\n        cx, cy = center\n        bbox = (cx - radius, cy - radius, cx + radius, cy + radius)\n        self.draw.ellipse(self._reproject_bbox(bbox), fill, stroke)\n        return self\n\n    def draw_circles(\n        self,\n        list_of_circles: Union[List[T_point], T_obj_list, \"pd.DataFrame\"],\n        radius: int = 5,\n        fill: T_color = DEFAULT_FILL,\n        stroke: T_color = DEFAULT_STROKE,\n    ) -> \"PageImage\":\n        for x in utils.to_list(list_of_circles):\n            self.draw_circle(x, radius=radius, fill=fill, stroke=stroke)\n        return self\n\n    def debug_table(\n        self,\n        table: Table,\n        fill: T_color = DEFAULT_FILL,\n        stroke: T_color = DEFAULT_STROKE,\n        stroke_width: int = 1,\n    ) -> \"PageImage\":\n        \"\"\"\n        Outline all found tables.\n        \"\"\"\n        self.draw_rects(\n            table.cells, fill=fill, stroke=stroke, stroke_width=stroke_width\n        )\n        return self\n\n    def debug_tablefinder(\n        self,\n        table_settings: Optional[\n            Union[TableFinder, TableSettings, T_table_settings]\n        ] = None,\n    ) -> \"PageImage\":\n        if isinstance(table_settings, TableFinder):\n            finder = table_settings\n        elif table_settings is None or isinstance(\n            table_settings, (TableSettings, dict)\n        ):\n            finder = self.page.debug_tablefinder(table_settings)\n        else:\n            raise ValueError(\n                \"Argument must be instance of TableFinder\"\n                \"or a TableFinder settings dict.\"\n            )\n\n        for table in finder.tables:\n            self.debug_table(table)\n\n        self.draw_lines(finder.edges, stroke_width=1)\n\n        self.draw_circles(\n            list(finder.intersections.keys()),\n            fill=COLORS.TRANSPARENT,\n            stroke=COLORS.BLUE + (200,),\n            radius=3,\n        )\n        return self\n\n    def outline_words(\n        self,\n        stroke: T_color = DEFAULT_STROKE,\n        fill: T_color = DEFAULT_FILL,\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n        x_tolerance: T_num = utils.DEFAULT_X_TOLERANCE,\n        y_tolerance: T_num = utils.DEFAULT_Y_TOLERANCE,\n    ) -> \"PageImage\":\n\n        words = self.page.extract_words(\n            x_tolerance=x_tolerance, y_tolerance=y_tolerance\n        )\n        self.draw_rects(words, stroke=stroke, fill=fill, stroke_width=stroke_width)\n        return self\n\n    def outline_chars(\n        self,\n        stroke: T_color = (255, 0, 0, 255),\n        fill: T_color = (255, 0, 0, int(255 / 4)),\n        stroke_width: int = DEFAULT_STROKE_WIDTH,\n    ) -> \"PageImage\":\n\n        self.draw_rects(\n            self.page.chars, stroke=stroke, fill=fill, stroke_width=stroke_width\n        )\n        return self\n\n    def _repr_png_(self) -> bytes:\n        b = BytesIO()\n        self.save(b, \"PNG\")\n        return b.getvalue()\n\n    def show(self) -> None:  # pragma: no cover\n        self.annotated.show()\n"
  },
  {
    "path": "pdfplumber/page.py",
    "content": "import numbers\nimport re\nfrom functools import lru_cache\nfrom typing import (\n    TYPE_CHECKING,\n    Any,\n    Callable,\n    Dict,\n    Generator,\n    List,\n    Optional,\n    Pattern,\n    Tuple,\n    Union,\n)\nfrom unicodedata import normalize as normalize_unicode\nfrom warnings import warn\n\nfrom pdfminer.converter import PDFPageAggregator\nfrom pdfminer.layout import (\n    LTChar,\n    LTComponent,\n    LTContainer,\n    LTCurve,\n    LTItem,\n    LTPage,\n    LTTextContainer,\n)\nfrom pdfminer.pdfinterp import PDFPageInterpreter, PDFStackT\nfrom pdfminer.pdfpage import PDFPage\nfrom pdfminer.psparser import PSLiteral\n\nfrom . import utils\nfrom ._typing import T_bbox, T_num, T_obj, T_obj_list\nfrom .container import Container\nfrom .structure import PDFStructTree, StructTreeMissing\nfrom .table import T_table_settings, Table, TableFinder, TableSettings\nfrom .utils import decode_text, resolve_all, resolve_and_decode\nfrom .utils.exceptions import MalformedPDFException, PdfminerException\nfrom .utils.text import TextMap\n\nlt_pat = re.compile(r\"^LT\")\n\nALL_ATTRS = set(\n    [\n        \"adv\",\n        \"height\",\n        \"linewidth\",\n        \"pts\",\n        \"size\",\n        \"srcsize\",\n        \"width\",\n        \"x0\",\n        \"x1\",\n        \"y0\",\n        \"y1\",\n        \"bits\",\n        \"matrix\",\n        \"upright\",\n        \"fontname\",\n        \"text\",\n        \"imagemask\",\n        \"colorspace\",\n        \"evenodd\",\n        \"fill\",\n        \"non_stroking_color\",\n        \"stroke\",\n        \"stroking_color\",\n        \"stream\",\n        \"name\",\n        \"mcid\",\n        \"tag\",\n    ]\n)\n\n\nif TYPE_CHECKING:  # pragma: nocover\n    from .display import PageImage\n    from .pdf import PDF\n\n# via https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774   # noqa\n\nCP936_FONTNAMES = {\n    b\"\\xcb\\xce\\xcc\\xe5\": \"SimSun,Regular\",\n    b\"\\xba\\xda\\xcc\\xe5\": \"SimHei,Regular\",\n    b\"\\xbf\\xac\\xcc\\xe5_GB2312\": \"SimKai,Regular\",\n    b\"\\xb7\\xc2\\xcb\\xce_GB2312\": \"SimFang,Regular\",\n    b\"\\xc1\\xa5\\xca\\xe9\": \"SimLi,Regular\",\n}\n\n\ndef fix_fontname_bytes(fontname: bytes) -> str:\n    if b\"+\" in fontname:\n        split_at = fontname.index(b\"+\") + 1\n        prefix, suffix = fontname[:split_at], fontname[split_at:]\n    else:\n        prefix, suffix = b\"\", fontname\n\n    suffix_new = CP936_FONTNAMES.get(suffix, str(suffix)[2:-1])\n    return str(prefix)[2:-1] + suffix_new\n\n\ndef tuplify_list_kwargs(kwargs: Dict[str, Any]) -> Dict[str, Any]:\n    return {\n        key: (tuple(value) if isinstance(value, list) else value)\n        for key, value in kwargs.items()\n    }\n\n\nclass PDFPageAggregatorWithMarkedContent(PDFPageAggregator):\n    \"\"\"Extract layout from a specific page, adding marked-content IDs to\n    objects where found.\"\"\"\n\n    cur_mcid: Optional[int] = None\n    cur_tag: Optional[str] = None\n\n    def begin_tag(self, tag: PSLiteral, props: Optional[PDFStackT] = None) -> None:\n        \"\"\"Handle beginning of tag, setting current MCID if any.\"\"\"\n        self.cur_tag = decode_text(tag.name)\n        if isinstance(props, dict) and \"MCID\" in props:\n            self.cur_mcid = props[\"MCID\"]\n        else:\n            self.cur_mcid = None\n\n    def end_tag(self) -> None:\n        \"\"\"Handle beginning of tag, clearing current MCID.\"\"\"\n        self.cur_tag = None\n        self.cur_mcid = None\n\n    def tag_cur_item(self) -> None:\n        \"\"\"Add current MCID to what we hope to be the most recent object created\n        by pdfminer.six.\"\"\"\n        # This is somewhat hacky and would not be necessary if\n        # pdfminer.six supported MCIDs.  In reading the code it's\n        # clear that the `render_*` methods methods will only ever\n        # create one object, but that is far from being guaranteed.\n        # Even if pdfminer.six's API would just return the objects it\n        # creates, we wouldn't have to do this.\n        if self.cur_item._objs:\n            cur_obj = self.cur_item._objs[-1]\n            cur_obj.mcid = self.cur_mcid  # type: ignore\n            cur_obj.tag = self.cur_tag  # type: ignore\n\n    def render_char(self, *args, **kwargs) -> float:  # type: ignore\n        \"\"\"Hook for rendering characters, adding the `mcid` attribute.\"\"\"\n        adv = super().render_char(*args, **kwargs)\n        self.tag_cur_item()\n        return adv\n\n    def render_image(self, *args, **kwargs) -> None:  # type: ignore\n        \"\"\"Hook for rendering images, adding the `mcid` attribute.\"\"\"\n        super().render_image(*args, **kwargs)\n        self.tag_cur_item()\n\n    def paint_path(self, *args, **kwargs) -> None:  # type: ignore\n        \"\"\"Hook for rendering lines and curves, adding the `mcid` attribute.\"\"\"\n        super().paint_path(*args, **kwargs)\n        self.tag_cur_item()\n\n\ndef _normalize_box(box_raw: T_bbox, rotation: T_num = 0) -> T_bbox:\n    # Per PDF Reference 3.8.4: \"Note: Although rectangles are\n    # conventionally specified by their lower-left and upperright\n    # corners, it is acceptable to specify any two diagonally opposite\n    # corners.\"\n    if not all(isinstance(x, numbers.Number) for x in box_raw):  # pragma: nocover\n        raise MalformedPDFException(\n            f\"Bounding box contains non-number coordinate(s): {box_raw}\"\n        )\n    x0, x1 = sorted((box_raw[0], box_raw[2]))\n    y0, y1 = sorted((box_raw[1], box_raw[3]))\n    if rotation in [90, 270]:\n        return (y0, x0, y1, x1)\n    else:\n        return (x0, y0, x1, y1)\n\n\n# PDFs coordinate spaces refer to an origin in the bottom-left of the\n# page; pdfplumber flips this vertically, so that the origin is in the\n# top-left.\ndef _invert_box(box_raw: T_bbox, mb_height: T_num) -> T_bbox:\n    x0, y0, x1, y1 = box_raw\n    return (x0, mb_height - y1, x1, mb_height - y0)\n\n\nclass Page(Container):\n    cached_properties: List[str] = Container.cached_properties + [\"_layout\"]\n    is_original: bool = True\n    pages = None\n\n    def __init__(\n        self,\n        pdf: \"PDF\",\n        page_obj: PDFPage,\n        page_number: int,\n        initial_doctop: T_num = 0,\n    ):\n        self.pdf = pdf\n        self.root_page = self\n        self.page_obj = page_obj\n        self.page_number = page_number\n        self.initial_doctop = initial_doctop\n\n        def get_attr(key: str, default: Any = None) -> Any:\n            value = resolve_all(page_obj.attrs.get(key))\n            return default if value is None else value\n\n        # Per PDF Reference Table 3.27: \"The number of degrees by which the\n        # page should be rotated clockwise when displayed or printed. The value\n        # must be a multiple of 90. Default value: 0\"\n        _rotation = get_attr(\"Rotate\", 0)\n        self.rotation = _rotation % 360\n\n        mb_raw = _normalize_box(get_attr(\"MediaBox\"), self.rotation)\n        mb_height = mb_raw[3] - mb_raw[1]\n\n        self.mediabox = _invert_box(mb_raw, mb_height)\n\n        for box_name in [\"CropBox\", \"TrimBox\", \"BleedBox\", \"ArtBox\"]:\n            if box_name in page_obj.attrs:\n                box_normalized = _invert_box(\n                    _normalize_box(get_attr(box_name), self.rotation), mb_height\n                )\n                setattr(self, box_name.lower(), box_normalized)\n\n        if \"CropBox\" not in page_obj.attrs:\n            self.cropbox = self.mediabox\n\n        # Page.bbox defaults to self.mediabox, but can be altered by Page.crop(...)\n        self.bbox = self.mediabox\n\n        # See https://rednafi.com/python/lru_cache_on_methods/\n        self.get_textmap = lru_cache()(self._get_textmap)\n\n    def close(self) -> None:\n        self.flush_cache()\n        self.get_textmap.cache_clear()\n\n    @property\n    def width(self) -> T_num:\n        return self.bbox[2] - self.bbox[0]\n\n    @property\n    def height(self) -> T_num:\n        return self.bbox[3] - self.bbox[1]\n\n    @property\n    def structure_tree(self) -> List[Dict[str, Any]]:\n        \"\"\"Return the structure tree for a page, if any.\"\"\"\n        try:\n            return [elem.to_dict() for elem in PDFStructTree(self.pdf, self)]\n        except StructTreeMissing:\n            return []\n\n    @property\n    def layout(self) -> LTPage:\n        if hasattr(self, \"_layout\"):\n            return self._layout\n        device = PDFPageAggregatorWithMarkedContent(\n            self.pdf.rsrcmgr,\n            pageno=self.page_number,\n            laparams=self.pdf.laparams,\n        )\n        interpreter = PDFPageInterpreter(self.pdf.rsrcmgr, device)\n        try:\n            interpreter.process_page(self.page_obj)\n        except Exception as e:\n            raise PdfminerException(e)\n        self._layout: LTPage = device.get_result()\n        return self._layout\n\n    @property\n    def annots(self) -> T_obj_list:\n        def rotate_point(pt: Tuple[float, float], r: int) -> Tuple[float, float]:\n            turns = r // 90\n            for i in range(turns):\n                x, y = pt\n                comp = self.width if i == turns % 2 else self.height\n                pt = (y, (comp - x))\n            return pt\n\n        def parse(annot: T_obj) -> T_obj:\n            _a, _b, _c, _d = annot[\"Rect\"]\n            pt0 = rotate_point((_a, _b), self.rotation)\n            pt1 = rotate_point((_c, _d), self.rotation)\n            rh = self.root_page.height\n            x0, top, x1, bottom = _invert_box(_normalize_box((*pt0, *pt1)), rh)\n\n            a = annot.get(\"A\", {})\n            extras = {\n                \"uri\": a.get(\"URI\"),\n                \"title\": annot.get(\"T\"),\n                \"contents\": annot.get(\"Contents\"),\n            }\n            for k, v in extras.items():\n                if v is not None:\n                    try:\n                        extras[k] = v.decode(\"utf-8\")\n                    except UnicodeDecodeError:\n                        try:\n                            extras[k] = v.decode(\"utf-16\")\n                        except UnicodeDecodeError:\n                            if self.pdf.raise_unicode_errors:\n                                raise\n                            warn(\n                                f\"Could not decode {k} of annotation.\"\n                                f\" {k} will be missing.\"\n                            )\n\n            parsed = {\n                \"page_number\": self.page_number,\n                \"object_type\": \"annot\",\n                \"x0\": x0,\n                \"y0\": rh - bottom,\n                \"x1\": x1,\n                \"y1\": rh - top,\n                \"doctop\": self.initial_doctop + top,\n                \"top\": top,\n                \"bottom\": bottom,\n                \"width\": x1 - x0,\n                \"height\": bottom - top,\n            }\n            parsed.update(extras)\n            # Replace the indirect reference to the page dictionary\n            # with a pointer to our actual page\n            if \"P\" in annot:\n                annot[\"P\"] = self\n            parsed[\"data\"] = annot\n            return parsed\n\n        raw = resolve_all(self.page_obj.annots) or []\n        parsed = list(map(parse, raw))\n        if isinstance(self, CroppedPage):\n            return self._crop_fn(parsed)\n        else:\n            return parsed\n\n    @property\n    def hyperlinks(self) -> T_obj_list:\n        return [a for a in self.annots if a[\"uri\"] is not None]\n\n    @property\n    def objects(self) -> Dict[str, T_obj_list]:\n        if hasattr(self, \"_objects\"):\n            return self._objects\n        self._objects: Dict[str, T_obj_list] = self.parse_objects()\n        return self._objects\n\n    def point2coord(self, pt: Tuple[T_num, T_num]) -> Tuple[T_num, T_num]:\n        # See note below re. #1181 and mediabox-adjustment reversions\n        return (self.mediabox[0] + pt[0], self.mediabox[1] + self.height - pt[1])\n\n    def process_object(self, obj: LTItem) -> T_obj:\n        kind = re.sub(lt_pat, \"\", obj.__class__.__name__).lower()\n\n        def process_attr(item: Tuple[str, Any]) -> Optional[Tuple[str, Any]]:\n            k, v = item\n            if k in ALL_ATTRS:\n                res = resolve_all(v)\n                return (k, res)\n            else:\n                return None\n\n        attr = dict(filter(None, map(process_attr, obj.__dict__.items())))\n\n        attr[\"object_type\"] = kind\n        attr[\"page_number\"] = self.page_number\n\n        for cs in [\"ncs\", \"scs\"]:\n            # Note: As of pdfminer.six v20221105, that library only\n            # exposes ncs for LTChars, and neither attribute for\n            # other objects. Keeping this code here, though,\n            # for ease of addition if color spaces become\n            # more available via pdfminer.six\n            if hasattr(obj, cs):\n                attr[cs] = resolve_and_decode(getattr(obj, cs).name)\n\n        if isinstance(obj, (LTChar, LTTextContainer)):\n            text = obj.get_text()\n            attr[\"text\"] = (\n                normalize_unicode(self.pdf.unicode_norm, text)\n                if self.pdf.unicode_norm is not None\n                else text\n            )\n\n        if isinstance(obj, LTChar):\n            # pdfminer.six (at least as of v20221105) does not\n            # directly expose .stroking_color and .non_stroking_color\n            # for LTChar objects (unlike, e.g., LTRect objects).\n            gs = obj.graphicstate\n            attr[\"stroking_color\"] = (\n                gs.scolor if isinstance(gs.scolor, tuple) else (gs.scolor,)\n            )\n            attr[\"non_stroking_color\"] = (\n                gs.ncolor if isinstance(gs.ncolor, tuple) else (gs.ncolor,)\n            )\n\n            # Handle (rare) byte-encoded fontnames\n            if isinstance(attr[\"fontname\"], bytes):  # pragma: nocover\n                attr[\"fontname\"] = fix_fontname_bytes(attr[\"fontname\"])\n\n        elif isinstance(obj, (LTCurve,)):\n            attr[\"pts\"] = list(map(self.point2coord, attr[\"pts\"]))\n\n            # Ignoring typing because type signature for obj.original_path\n            # appears to be incorrect\n            attr[\"path\"] = [(cmd, *map(self.point2coord, pts)) for cmd, *pts in obj.original_path]  # type: ignore  # noqa: E501\n\n            attr[\"dash\"] = obj.dashing_style\n\n        # As noted in #1181, `pdfminer.six` adjusts objects'\n        # coordinates relative to the MediaBox:\n        # https://github.com/pdfminer/pdfminer.six/blob/1a8bd2f730295b31d6165e4d95fcb5a03793c978/pdfminer/converter.py#L79-L84\n        mb_x0, mb_top = self.mediabox[:2]\n\n        if \"y0\" in attr:\n            attr[\"top\"] = (self.height - attr[\"y1\"]) + mb_top\n            attr[\"bottom\"] = (self.height - attr[\"y0\"]) + mb_top\n            attr[\"doctop\"] = self.initial_doctop + attr[\"top\"]\n\n        if \"x0\" in attr and mb_x0 != 0:\n            attr[\"x0\"] = attr[\"x0\"] + mb_x0\n            attr[\"x1\"] = attr[\"x1\"] + mb_x0\n\n        return attr\n\n    def iter_layout_objects(\n        self, layout_objects: List[LTComponent]\n    ) -> Generator[T_obj, None, None]:\n        for obj in layout_objects:\n            # If object is, like LTFigure, a higher-level object ...\n            if isinstance(obj, LTContainer):\n                # and LAParams is passed, process the object itself.\n                if self.pdf.laparams is not None:\n                    yield self.process_object(obj)\n                # Regardless, iterate through its children\n                yield from self.iter_layout_objects(obj._objs)\n            else:\n                yield self.process_object(obj)\n\n    def parse_objects(self) -> Dict[str, T_obj_list]:\n        objects: Dict[str, T_obj_list] = {}\n        for obj in self.iter_layout_objects(self.layout._objs):\n            kind = obj[\"object_type\"]\n            if kind in [\"anno\"]:\n                continue\n            if objects.get(kind) is None:\n                objects[kind] = []\n            objects[kind].append(obj)\n        return objects\n\n    def debug_tablefinder(\n        self, table_settings: Optional[T_table_settings] = None\n    ) -> TableFinder:\n        tset = TableSettings.resolve(table_settings)\n        return TableFinder(self, tset)\n\n    def find_tables(\n        self, table_settings: Optional[T_table_settings] = None\n    ) -> List[Table]:\n        tset = TableSettings.resolve(table_settings)\n        return TableFinder(self, tset).tables\n\n    def find_table(\n        self, table_settings: Optional[T_table_settings] = None\n    ) -> Optional[Table]:\n        tset = TableSettings.resolve(table_settings)\n        tables = self.find_tables(tset)\n\n        if len(tables) == 0:\n            return None\n\n        # Return the largest table, as measured by number of cells.\n        def sorter(x: Table) -> Tuple[int, T_num, T_num]:\n            return (-len(x.cells), x.bbox[1], x.bbox[0])\n\n        largest = list(sorted(tables, key=sorter))[0]\n\n        return largest\n\n    def extract_tables(\n        self, table_settings: Optional[T_table_settings] = None\n    ) -> List[List[List[Optional[str]]]]:\n        tset = TableSettings.resolve(table_settings)\n        tables = self.find_tables(tset)\n        return [table.extract(**(tset.text_settings or {})) for table in tables]\n\n    def extract_table(\n        self, table_settings: Optional[T_table_settings] = None\n    ) -> Optional[List[List[Optional[str]]]]:\n        tset = TableSettings.resolve(table_settings)\n        table = self.find_table(tset)\n        if table is None:\n            return None\n        else:\n            return table.extract(**(tset.text_settings or {}))\n\n    def _get_textmap(self, **kwargs: Any) -> TextMap:\n        defaults: Dict[str, Any] = dict(\n            layout_bbox=self.bbox,\n        )\n        if \"layout_width_chars\" not in kwargs:\n            defaults.update({\"layout_width\": self.width})\n        if \"layout_height_chars\" not in kwargs:\n            defaults.update({\"layout_height\": self.height})\n        full_kwargs: Dict[str, Any] = {**defaults, **kwargs}\n        return utils.chars_to_textmap(self.chars, **full_kwargs)\n\n    def search(\n        self,\n        pattern: Union[str, Pattern[str]],\n        regex: bool = True,\n        case: bool = True,\n        main_group: int = 0,\n        return_chars: bool = True,\n        return_groups: bool = True,\n        **kwargs: Any,\n    ) -> List[Dict[str, Any]]:\n        textmap = self.get_textmap(**tuplify_list_kwargs(kwargs))\n        return textmap.search(\n            pattern,\n            regex=regex,\n            case=case,\n            main_group=main_group,\n            return_chars=return_chars,\n            return_groups=return_groups,\n        )\n\n    def extract_text(self, **kwargs: Any) -> str:\n        return self.get_textmap(**tuplify_list_kwargs(kwargs)).as_string\n\n    def extract_text_simple(self, **kwargs: Any) -> str:\n        return utils.extract_text_simple(self.chars, **kwargs)\n\n    def extract_words(self, **kwargs: Any) -> T_obj_list:\n        return utils.extract_words(self.chars, **kwargs)\n\n    def extract_text_lines(\n        self, strip: bool = True, return_chars: bool = True, **kwargs: Any\n    ) -> T_obj_list:\n        return self.get_textmap(**tuplify_list_kwargs(kwargs)).extract_text_lines(\n            strip=strip, return_chars=return_chars\n        )\n\n    def crop(\n        self, bbox: T_bbox, relative: bool = False, strict: bool = True\n    ) -> \"CroppedPage\":\n        return CroppedPage(self, bbox, relative=relative, strict=strict)\n\n    def within_bbox(\n        self, bbox: T_bbox, relative: bool = False, strict: bool = True\n    ) -> \"CroppedPage\":\n        \"\"\"\n        Same as .crop, except only includes objects fully within the bbox\n        \"\"\"\n        return CroppedPage(\n            self, bbox, relative=relative, strict=strict, crop_fn=utils.within_bbox\n        )\n\n    def outside_bbox(\n        self, bbox: T_bbox, relative: bool = False, strict: bool = True\n    ) -> \"CroppedPage\":\n        \"\"\"\n        Same as .crop, except only includes objects fully within the bbox\n        \"\"\"\n        return CroppedPage(\n            self, bbox, relative=relative, strict=strict, crop_fn=utils.outside_bbox\n        )\n\n    def filter(self, test_function: Callable[[T_obj], bool]) -> \"FilteredPage\":\n        return FilteredPage(self, test_function)\n\n    def dedupe_chars(self, **kwargs: Any) -> \"FilteredPage\":\n        \"\"\"\n        Removes duplicate chars — those sharing the same text and positioning\n        (within `tolerance`) as other characters in the set. Adjust extra_args\n        to be more/less restrictive with the properties checked.\n        \"\"\"\n        p = FilteredPage(self, lambda x: True)\n        p._objects = {kind: objs for kind, objs in self.objects.items()}\n        p._objects[\"char\"] = utils.dedupe_chars(self.chars, **kwargs)\n        return p\n\n    def to_image(\n        self,\n        resolution: Optional[Union[int, float]] = None,\n        width: Optional[Union[int, float]] = None,\n        height: Optional[Union[int, float]] = None,\n        antialias: bool = False,\n        force_mediabox: bool = False,\n    ) -> \"PageImage\":\n        \"\"\"\n        You can pass a maximum of 1 of the following:\n        - resolution: The desired number pixels per inch. Defaults to 72.\n        - width: The desired image width in pixels.\n        - height: The desired image width in pixels.\n        \"\"\"\n        from .display import DEFAULT_RESOLUTION, PageImage\n\n        num_specs = sum(x is not None for x in [resolution, width, height])\n        if num_specs > 1:\n            raise ValueError(\n                f\"Only one of these arguments can be provided: resolution, width, height. You provided {num_specs}\"  # noqa: E501\n            )\n        elif width is not None:\n            resolution = 72 * width / self.width\n        elif height is not None:\n            resolution = 72 * height / self.height\n\n        return PageImage(\n            self,\n            resolution=resolution or DEFAULT_RESOLUTION,\n            antialias=antialias,\n            force_mediabox=force_mediabox,\n        )\n\n    def to_dict(self, object_types: Optional[List[str]] = None) -> Dict[str, Any]:\n        if object_types is None:\n            _object_types = list(self.objects.keys()) + [\"annot\"]\n        else:\n            _object_types = object_types\n        d = {\n            \"page_number\": self.page_number,\n            \"initial_doctop\": self.initial_doctop,\n            \"rotation\": self.rotation,\n            \"cropbox\": self.cropbox,\n            \"mediabox\": self.mediabox,\n            \"bbox\": self.bbox,\n            \"width\": self.width,\n            \"height\": self.height,\n        }\n        for t in _object_types:\n            d[t + \"s\"] = getattr(self, t + \"s\")\n        return d\n\n    def __repr__(self) -> str:\n        return f\"<Page:{self.page_number}>\"\n\n\nclass DerivedPage(Page):\n    is_original: bool = False\n\n    def __init__(self, parent_page: Page):\n        self.parent_page = parent_page\n        self.root_page = parent_page.root_page\n        self.pdf = parent_page.pdf\n        self.page_obj = parent_page.page_obj\n        self.page_number = parent_page.page_number\n        self.initial_doctop = parent_page.initial_doctop\n        self.rotation = parent_page.rotation\n        self.mediabox = parent_page.mediabox\n        self.cropbox = parent_page.cropbox\n        self.flush_cache(Container.cached_properties)\n        self.get_textmap = lru_cache()(self._get_textmap)\n\n\ndef test_proposed_bbox(bbox: T_bbox, parent_bbox: T_bbox) -> None:\n    bbox_area = utils.calculate_area(bbox)\n    if bbox_area == 0:\n        raise ValueError(f\"Bounding box {bbox} has an area of zero.\")\n\n    overlap = utils.get_bbox_overlap(bbox, parent_bbox)\n    if overlap is None:\n        raise ValueError(\n            f\"Bounding box {bbox} is entirely outside \"\n            f\"parent page bounding box {parent_bbox}\"\n        )\n\n    overlap_area = utils.calculate_area(overlap)\n    if overlap_area < bbox_area:\n        raise ValueError(\n            f\"Bounding box {bbox} is not fully within \"\n            f\"parent page bounding box {parent_bbox}\"\n        )\n\n\nclass CroppedPage(DerivedPage):\n    def __init__(\n        self,\n        parent_page: Page,\n        crop_bbox: T_bbox,\n        crop_fn: Callable[[T_obj_list, T_bbox], T_obj_list] = utils.crop_to_bbox,\n        relative: bool = False,\n        strict: bool = True,\n    ):\n        if relative:\n            o_x0, o_top, _, _ = parent_page.bbox\n            x0, top, x1, bottom = crop_bbox\n            crop_bbox = (x0 + o_x0, top + o_top, x1 + o_x0, bottom + o_top)\n\n        if strict:\n            test_proposed_bbox(crop_bbox, parent_page.bbox)\n\n        def _crop_fn(objs: T_obj_list) -> T_obj_list:\n            return crop_fn(objs, crop_bbox)\n\n        super().__init__(parent_page)\n\n        self._crop_fn = _crop_fn\n\n        # Note: testing for original function passed, not _crop_fn\n        if crop_fn is utils.outside_bbox:\n            self.bbox = parent_page.bbox\n        else:\n            self.bbox = crop_bbox\n\n    @property\n    def objects(self) -> Dict[str, T_obj_list]:\n        if hasattr(self, \"_objects\"):\n            return self._objects\n        self._objects: Dict[str, T_obj_list] = {\n            k: self._crop_fn(v) for k, v in self.parent_page.objects.items()\n        }\n        return self._objects\n\n\nclass FilteredPage(DerivedPage):\n    def __init__(self, parent_page: Page, filter_fn: Callable[[T_obj], bool]):\n        self.bbox = parent_page.bbox\n        self.filter_fn = filter_fn\n        super().__init__(parent_page)\n\n    @property\n    def objects(self) -> Dict[str, T_obj_list]:\n        if hasattr(self, \"_objects\"):\n            return self._objects\n        self._objects: Dict[str, T_obj_list] = {\n            k: list(filter(self.filter_fn, v))\n            for k, v in self.parent_page.objects.items()\n        }\n        return self._objects\n"
  },
  {
    "path": "pdfplumber/pdf.py",
    "content": "import itertools\nimport logging\nimport pathlib\nfrom io import BufferedReader, BytesIO\nfrom types import TracebackType\nfrom typing import Any, Dict, Generator, List, Literal, Optional, Tuple, Type, Union\n\nfrom pdfminer.layout import LAParams\nfrom pdfminer.pdfdocument import PDFDocument\nfrom pdfminer.pdfinterp import PDFResourceManager\nfrom pdfminer.pdfpage import PDFPage\nfrom pdfminer.pdfparser import PDFParser\n\nfrom ._typing import T_num, T_obj_list\nfrom .container import Container\nfrom .page import Page\nfrom .repair import T_repair_setting, _repair\nfrom .structure import PDFStructTree, StructTreeMissing\nfrom .utils import resolve_and_decode\nfrom .utils.exceptions import PdfminerException\n\nlogger = logging.getLogger(__name__)\n\n\nclass PDF(Container):\n    cached_properties: List[str] = Container.cached_properties + [\"_pages\"]\n\n    def __init__(\n        self,\n        stream: Union[BufferedReader, BytesIO],\n        stream_is_external: bool = False,\n        path: Optional[pathlib.Path] = None,\n        pages: Optional[Union[List[int], Tuple[int]]] = None,\n        laparams: Optional[Dict[str, Any]] = None,\n        password: Optional[str] = None,\n        strict_metadata: bool = False,\n        unicode_norm: Optional[Literal[\"NFC\", \"NFKC\", \"NFD\", \"NFKD\"]] = None,\n        raise_unicode_errors: bool = True,\n    ):\n        self.stream = stream\n        self.stream_is_external = stream_is_external\n        self.path = path\n        self.pages_to_parse = pages\n        self.laparams = None if laparams is None else LAParams(**laparams)\n        self.password = password\n        self.unicode_norm = unicode_norm\n        self.raise_unicode_errors = raise_unicode_errors\n\n        try:\n            self.doc = PDFDocument(PDFParser(stream), password=password or \"\")\n        except Exception as e:\n            raise PdfminerException(e)\n        self.rsrcmgr = PDFResourceManager()\n        self.metadata = {}\n\n        for info in self.doc.info:\n            self.metadata.update(info)\n        for k, v in self.metadata.items():\n            try:\n                self.metadata[k] = resolve_and_decode(v)\n            except Exception as e:  # pragma: nocover\n                if strict_metadata:\n                    # Raise an exception since unable to resolve the metadata value.\n                    raise\n                # This metadata value could not be parsed. Instead of failing the PDF\n                # read, treat it as a warning only if `strict_metadata=False`.\n                logger.warning(\n                    f'[WARNING] Metadata key \"{k}\" could not be parsed due to '\n                    f\"exception: {str(e)}\"\n                )\n\n    @classmethod\n    def open(\n        cls,\n        path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO],\n        pages: Optional[Union[List[int], Tuple[int]]] = None,\n        laparams: Optional[Dict[str, Any]] = None,\n        password: Optional[str] = None,\n        strict_metadata: bool = False,\n        unicode_norm: Optional[Literal[\"NFC\", \"NFKC\", \"NFD\", \"NFKD\"]] = None,\n        repair: bool = False,\n        gs_path: Optional[Union[str, pathlib.Path]] = None,\n        repair_setting: T_repair_setting = \"default\",\n        raise_unicode_errors: bool = True,\n    ) -> \"PDF\":\n\n        stream: Union[BufferedReader, BytesIO]\n\n        if repair:\n            stream = _repair(\n                path_or_fp, password=password, gs_path=gs_path, setting=repair_setting\n            )\n            stream_is_external = False\n            # Although the original file has a path,\n            # the repaired version does not\n            path = None\n        elif isinstance(path_or_fp, (str, pathlib.Path)):\n            stream = open(path_or_fp, \"rb\")\n            stream_is_external = False\n            path = pathlib.Path(path_or_fp)\n        else:\n            stream = path_or_fp\n            stream_is_external = True\n            path = None\n\n        try:\n            return cls(\n                stream,\n                path=path,\n                pages=pages,\n                laparams=laparams,\n                password=password,\n                strict_metadata=strict_metadata,\n                unicode_norm=unicode_norm,\n                stream_is_external=stream_is_external,\n                raise_unicode_errors=raise_unicode_errors,\n            )\n\n        except PdfminerException:\n            if not stream_is_external:\n                stream.close()\n            raise\n\n    def close(self) -> None:\n        self.flush_cache()\n\n        for page in self.pages:\n            page.close()\n\n        if not self.stream_is_external:\n            self.stream.close()\n\n    def __enter__(self) -> \"PDF\":\n        return self\n\n    def __exit__(\n        self,\n        t: Optional[Type[BaseException]],\n        value: Optional[BaseException],\n        traceback: Optional[TracebackType],\n    ) -> None:\n        self.close()\n\n    @property\n    def pages(self) -> List[Page]:\n        if hasattr(self, \"_pages\"):\n            return self._pages\n\n        doctop: T_num = 0\n        pp = self.pages_to_parse\n        self._pages: List[Page] = []\n\n        def iter_pages() -> Generator[PDFPage, None, None]:\n            gen = PDFPage.create_pages(self.doc)\n            while True:\n                try:\n                    yield next(gen)\n                except StopIteration:\n                    break\n                except Exception as e:\n                    raise PdfminerException(e)\n\n        for i, page in enumerate(iter_pages()):\n            page_number = i + 1\n            if pp is not None and page_number not in pp:\n                continue\n            p = Page(self, page, page_number=page_number, initial_doctop=doctop)\n            self._pages.append(p)\n            doctop += p.height\n        return self._pages\n\n    @property\n    def objects(self) -> Dict[str, T_obj_list]:\n        if hasattr(self, \"_objects\"):\n            return self._objects\n        all_objects: Dict[str, T_obj_list] = {}\n        for p in self.pages:\n            for kind in p.objects.keys():\n                all_objects[kind] = all_objects.get(kind, []) + p.objects[kind]\n        self._objects: Dict[str, T_obj_list] = all_objects\n        return self._objects\n\n    @property\n    def annots(self) -> List[Dict[str, Any]]:\n        gen = (p.annots for p in self.pages)\n        return list(itertools.chain(*gen))\n\n    @property\n    def hyperlinks(self) -> List[Dict[str, Any]]:\n        gen = (p.hyperlinks for p in self.pages)\n        return list(itertools.chain(*gen))\n\n    @property\n    def structure_tree(self) -> List[Dict[str, Any]]:\n        \"\"\"Return the structure tree for the document.\"\"\"\n        try:\n            return [elem.to_dict() for elem in PDFStructTree(self)]\n        except StructTreeMissing:\n            return []\n\n    def to_dict(self, object_types: Optional[List[str]] = None) -> Dict[str, Any]:\n        return {\n            \"metadata\": self.metadata,\n            \"pages\": [page.to_dict(object_types) for page in self.pages],\n        }\n"
  },
  {
    "path": "pdfplumber/py.typed",
    "content": ""
  },
  {
    "path": "pdfplumber/repair.py",
    "content": "import pathlib\nimport shutil\nimport subprocess\nfrom io import BufferedReader, BytesIO\nfrom typing import Literal, Optional, Union\n\nT_repair_setting = Literal[\"default\", \"prepress\", \"printer\", \"ebook\", \"screen\"]\n\n\ndef _repair(\n    path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO],\n    password: Optional[str] = None,\n    gs_path: Optional[Union[str, pathlib.Path]] = None,\n    setting: T_repair_setting = \"default\",\n) -> BytesIO:\n\n    executable = (\n        gs_path\n        or shutil.which(\"gs\")\n        or shutil.which(\"gswin32c\")\n        or shutil.which(\"gswin64c\")\n    )\n    if executable is None:  # pragma: nocover\n        raise Exception(\n            \"Cannot find Ghostscript, which is required for repairs.\\n\"\n            \"Visit https://www.ghostscript.com/ for installation instructions.\"\n        )\n\n    repair_args = [\n        executable,\n        \"-sstdout=%stderr\",\n        \"-o\",\n        \"-\",\n        \"-sDEVICE=pdfwrite\",\n        f\"-dPDFSETTINGS=/{setting}\",\n    ]\n\n    if password:\n        repair_args += [f\"-sPDFPassword={password}\"]\n\n    if isinstance(path_or_fp, (str, pathlib.Path)):\n        stdin = None\n        repair_args += [str(pathlib.Path(path_or_fp).absolute())]\n    else:\n        stdin = path_or_fp\n        repair_args += [\"-\"]\n\n    proc = subprocess.Popen(\n        repair_args,\n        stdin=subprocess.PIPE if stdin else None,\n        stdout=subprocess.PIPE,\n        stderr=subprocess.PIPE,\n    )\n\n    stdout, stderr = proc.communicate(stdin.read() if stdin else None)\n\n    if proc.returncode:\n        raise Exception(f\"{stderr.decode('utf-8')}\")\n\n    return BytesIO(stdout)\n\n\ndef repair(\n    path_or_fp: Union[str, pathlib.Path, BufferedReader, BytesIO],\n    outfile: Optional[Union[str, pathlib.Path]] = None,\n    password: Optional[str] = None,\n    gs_path: Optional[Union[str, pathlib.Path]] = None,\n    setting: T_repair_setting = \"default\",\n) -> Optional[BytesIO]:\n    repaired = _repair(path_or_fp, password, gs_path=gs_path, setting=setting)\n    if outfile:\n        with open(outfile, \"wb\") as f:\n            f.write(repaired.read())\n        return None\n    else:\n        return repaired\n"
  },
  {
    "path": "pdfplumber/structure.py",
    "content": "import itertools\nimport logging\nimport re\nfrom collections import deque\nfrom dataclasses import asdict, dataclass, field\nfrom typing import (\n    TYPE_CHECKING,\n    Any,\n    Callable,\n    Dict,\n    Iterable,\n    Iterator,\n    List,\n    Optional,\n    Pattern,\n    Tuple,\n    Union,\n)\n\nfrom pdfminer.data_structures import NumberTree\nfrom pdfminer.pdfparser import PDFParser\nfrom pdfminer.pdftypes import PDFObjRef, resolve1\nfrom pdfminer.psparser import PSLiteral\n\nfrom ._typing import T_bbox, T_obj\nfrom .utils import decode_text, geometry\n\nlogger = logging.getLogger(__name__)\n\n\nif TYPE_CHECKING:  # pragma: nocover\n    from .page import Page\n    from .pdf import PDF\n\n\nMatchFunc = Callable[[\"PDFStructElement\"], bool]\n\n\ndef _find_all(\n    elements: Iterable[\"PDFStructElement\"],\n    matcher: Union[str, Pattern[str], MatchFunc],\n) -> Iterator[\"PDFStructElement\"]:\n    \"\"\"\n    Common code for `find_all()` in trees and elements.\n    \"\"\"\n\n    def match_tag(x: \"PDFStructElement\") -> bool:\n        \"\"\"Match an element name.\"\"\"\n        return x.type == matcher\n\n    def match_regex(x: \"PDFStructElement\") -> bool:\n        \"\"\"Match an element name by regular expression.\"\"\"\n        return matcher.match(x.type)  # type: ignore\n\n    if isinstance(matcher, str):\n        match_func = match_tag\n    elif isinstance(matcher, re.Pattern):\n        match_func = match_regex\n    else:\n        match_func = matcher  # type: ignore\n    d = deque(elements)\n    while d:\n        el = d.popleft()\n        if match_func(el):\n            yield el\n        d.extendleft(reversed(el.children))\n\n\nclass Findable:\n    \"\"\"find() and find_all() methods that can be inherited to avoid\n    repeating oneself\"\"\"\n\n    children: List[\"PDFStructElement\"]\n\n    def find_all(\n        self, matcher: Union[str, Pattern[str], MatchFunc]\n    ) -> Iterator[\"PDFStructElement\"]:\n        \"\"\"Iterate depth-first over matching elements in subtree.\n\n        The `matcher` argument is either an element name, a regular\n        expression, or a function taking a `PDFStructElement` and\n        returning `True` if the element matches.\n        \"\"\"\n        return _find_all(self.children, matcher)\n\n    def find(\n        self, matcher: Union[str, Pattern[str], MatchFunc]\n    ) -> Optional[\"PDFStructElement\"]:\n        \"\"\"Find the first matching element in subtree.\n\n        The `matcher` argument is either an element name, a regular\n        expression, or a function taking a `PDFStructElement` and\n        returning `True` if the element matches.\n        \"\"\"\n        try:\n            return next(_find_all(self.children, matcher))\n        except StopIteration:\n            return None\n\n\n@dataclass\nclass PDFStructElement(Findable):\n    type: str\n    revision: Optional[int]\n    id: Optional[str]\n    lang: Optional[str]\n    alt_text: Optional[str]\n    actual_text: Optional[str]\n    title: Optional[str]\n    page_number: Optional[int]\n    attributes: Dict[str, Any] = field(default_factory=dict)\n    mcids: List[int] = field(default_factory=list)\n    children: List[\"PDFStructElement\"] = field(default_factory=list)\n\n    def __iter__(self) -> Iterator[\"PDFStructElement\"]:\n        return iter(self.children)\n\n    def all_mcids(self) -> Iterator[Tuple[Optional[int], int]]:\n        \"\"\"Collect all MCIDs (with their page numbers, if there are\n        multiple pages in the tree) inside a structure element.\n        \"\"\"\n        # Collect them depth-first to preserve ordering\n        for mcid in self.mcids:\n            yield self.page_number, mcid\n        d = deque(self.children)\n        while d:\n            el = d.popleft()\n            for mcid in el.mcids:\n                yield el.page_number, mcid\n            d.extendleft(reversed(el.children))\n\n    def to_dict(self) -> Dict[str, Any]:\n        \"\"\"Return a compacted dict representation.\"\"\"\n        r = asdict(self)\n        # Prune empty values (does not matter in which order)\n        d = deque([r])\n        while d:\n            el = d.popleft()\n            for k in list(el.keys()):\n                if el[k] is None or el[k] == [] or el[k] == {}:\n                    del el[k]\n            if \"children\" in el:\n                d.extend(el[\"children\"])\n        return r\n\n\nclass StructTreeMissing(ValueError):\n    pass\n\n\nclass PDFStructTree(Findable):\n    \"\"\"Parse the structure tree of a PDF.\n\n    The constructor takes a `pdfplumber.PDF` and optionally a\n    `pdfplumber.Page`.  To avoid creating the entire tree for a large\n    document it is recommended to provide a page.\n\n    This class creates a representation of the portion of the\n    structure tree that reaches marked content sections, either for a\n    single page, or for the whole document.  Note that this is slightly\n    different from the behaviour of other PDF libraries which will\n    also include structure elements with no content.\n\n    If the PDF has no structure, the constructor will raise\n    `StructTreeMissing`.\n\n    \"\"\"\n\n    page: Optional[\"Page\"]\n\n    def __init__(self, doc: \"PDF\", page: Optional[\"Page\"] = None):\n        self.doc = doc.doc\n        if \"StructTreeRoot\" not in self.doc.catalog:\n            raise StructTreeMissing(\"PDF has no structure\")\n        self.root = resolve1(self.doc.catalog[\"StructTreeRoot\"])\n        self.role_map = resolve1(self.root.get(\"RoleMap\", {}))\n        self.class_map = resolve1(self.root.get(\"ClassMap\", {}))\n        self.children: List[PDFStructElement] = []\n\n        # If we have a specific page then we will work backwards from\n        # its ParentTree - this is because structure elements could\n        # span multiple pages, and the \"Pg\" attribute is *optional*,\n        # so this is the approved way to get a page's structure...\n        if page is not None:\n            self.page = page\n            self.pages = {page.page_number: page}\n            self.page_dict = None\n            # ...EXCEPT that the ParentTree is sometimes missing, in which\n            # case we fall back to the non-approved way.\n            parent_tree_obj = self.root.get(\"ParentTree\")\n            if parent_tree_obj is None:\n                self._parse_struct_tree()\n            else:\n                parent_tree = NumberTree(parent_tree_obj)\n                # If there is no marked content in the structure tree for\n                # this page (which can happen even when there is a\n                # structure tree) then there is no `StructParents`.\n                # Note however that if there are XObjects in a page,\n                # *they* may have `StructParent` (not `StructParents`)\n                if \"StructParents\" not in self.page.page_obj.attrs:\n                    return\n                parent_id = self.page.page_obj.attrs[\"StructParents\"]\n                # NumberTree should have a `get` method like it does in pdf.js...\n                parent_array = resolve1(\n                    next(array for num, array in parent_tree.values if num == parent_id)\n                )\n                self._parse_parent_tree(parent_array)\n        else:\n            self.page = None\n            # Overhead of creating pages shouldn't be too bad we hope!\n            self.pages = {page.page_number: page for page in doc.pages}\n            self.page_dict = {\n                page.page_obj.pageid: page.page_number for page in self.pages.values()\n            }\n            self._parse_struct_tree()\n\n    def _make_attributes(\n        self, obj: Dict[str, Any], revision: Optional[int]\n    ) -> Dict[str, Any]:\n        attr_obj_list = []\n        for key in \"C\", \"A\":\n            if key not in obj:\n                continue\n            attr_obj = resolve1(obj[key])\n            # It could be a list of attribute objects (why?)\n            if isinstance(attr_obj, list):\n                attr_obj_list.extend(attr_obj)\n            else:\n                attr_obj_list.append(attr_obj)\n        attr_objs = []\n        prev_obj = None\n        for aref in attr_obj_list:\n            # If we find a revision number, which might \"follow the\n            # revision object\" (the spec is not clear about what this\n            # should look like but it implies they are simply adjacent\n            # in a flat array), then use it to decide whether to take\n            # the previous object...\n            if isinstance(aref, int):\n                if aref == revision and prev_obj is not None:\n                    attr_objs.append(prev_obj)\n                prev_obj = None\n            else:\n                if prev_obj is not None:\n                    attr_objs.append(prev_obj)\n                prev_obj = resolve1(aref)\n        if prev_obj is not None:\n            attr_objs.append(prev_obj)\n        # Now merge all the attribute objects in the collected to a\n        # single set (again, the spec doesn't really explain this but\n        # does say that attributes in /A supersede those in /C)\n        attr = {}\n        for obj in attr_objs:\n            if isinstance(obj, PSLiteral):\n                key = decode_text(obj.name)\n                if key not in self.class_map:\n                    logger.warning(\"Unknown attribute class %s\", key)\n                    continue\n                obj = self.class_map[key]\n            for k, v in obj.items():\n                if isinstance(v, PSLiteral):\n                    attr[k] = decode_text(v.name)\n                else:\n                    attr[k] = obj[k]\n        return attr\n\n    def _make_element(self, obj: Any) -> Tuple[Optional[PDFStructElement], List[Any]]:\n        # We hopefully caught these earlier\n        assert \"MCID\" not in obj, \"Uncaught MCR: %s\" % obj\n        assert \"Obj\" not in obj, \"Uncaught OBJR: %s\" % obj\n        # Get page number if necessary\n        page_number = None\n        if self.page_dict is not None and \"Pg\" in obj:\n            page_objid = obj[\"Pg\"].objid\n            assert page_objid in self.page_dict, \"Object on unparsed page: %s\" % obj\n            page_number = self.page_dict[page_objid]\n        obj_tag = \"\"\n        if \"S\" in obj:\n            obj_tag = decode_text(obj[\"S\"].name)\n            if obj_tag in self.role_map:\n                obj_tag = decode_text(self.role_map[obj_tag].name)\n        children = resolve1(obj[\"K\"]) if \"K\" in obj else []\n        if isinstance(children, int):  # ugh... isinstance...\n            children = [children]\n        elif isinstance(children, dict):  # a single object.. ugh...\n            children = [obj[\"K\"]]\n        revision = obj.get(\"R\")\n        attributes = self._make_attributes(obj, revision)\n        element_id = decode_text(resolve1(obj[\"ID\"])) if \"ID\" in obj else None\n        title = decode_text(resolve1(obj[\"T\"])) if \"T\" in obj else None\n        lang = decode_text(resolve1(obj[\"Lang\"])) if \"Lang\" in obj else None\n        alt_text = decode_text(resolve1(obj[\"Alt\"])) if \"Alt\" in obj else None\n        actual_text = (\n            decode_text(resolve1(obj[\"ActualText\"])) if \"ActualText\" in obj else None\n        )\n        element = PDFStructElement(\n            type=obj_tag,\n            id=element_id,\n            page_number=page_number,\n            revision=revision,\n            lang=lang,\n            title=title,\n            alt_text=alt_text,\n            actual_text=actual_text,\n            attributes=attributes,\n        )\n        return element, children\n\n    def _parse_parent_tree(self, parent_array: List[Any]) -> None:\n        \"\"\"Populate the structure tree using the leaves of the parent tree for\n        a given page.\"\"\"\n        # First walk backwards from the leaves to the root, tracking references\n        d = deque(parent_array)\n        s = {}\n        found_root = False\n        while d:\n            ref = d.popleft()\n            # In the case where an MCID is not associated with any\n            # structure, there will be a \"null\" in the parent tree.\n            if ref == PDFParser.KEYWORD_NULL:\n                continue\n            if repr(ref) in s:\n                continue\n            obj = resolve1(ref)\n            # This is required! It's in the spec!\n            if \"Type\" in obj and decode_text(obj[\"Type\"].name) == \"StructTreeRoot\":\n                found_root = True\n            else:\n                # We hope that these are actual elements and not\n                # references or marked-content sections...\n                element, children = self._make_element(obj)\n                # We have no page tree so we assume this page was parsed\n                assert element is not None\n                s[repr(ref)] = element, children\n                d.append(obj[\"P\"])\n        # If we didn't reach the root something is quite wrong!\n        assert found_root\n        self._resolve_children(s)\n\n    def on_parsed_page(self, obj: Dict[str, Any]) -> bool:\n        if \"Pg\" not in obj:\n            return True\n        page_objid = obj[\"Pg\"].objid\n        if self.page_dict is not None:\n            return page_objid in self.page_dict\n        if self.page is not None:\n            # We have to do this to satisfy mypy\n            if page_objid != self.page.page_obj.pageid:\n                return False\n        return True\n\n    def _parse_struct_tree(self) -> None:\n        \"\"\"Populate the structure tree starting from the root, skipping\n        unparsed pages and empty elements.\"\"\"\n        root = resolve1(self.root[\"K\"])\n\n        # It could just be a single object ... it's in the spec (argh)\n        if isinstance(root, dict):\n            root = [self.root[\"K\"]]\n        d = deque(root)\n        s = {}\n        while d:\n            ref = d.popleft()\n            # In case the tree is actually a DAG and not a tree...\n            if repr(ref) in s:  # pragma: nocover (shouldn't happen)\n                continue\n            obj = resolve1(ref)\n            # Deref top-level OBJR skipping refs to unparsed pages\n            if isinstance(obj, dict) and \"Obj\" in obj:\n                if not self.on_parsed_page(obj):\n                    continue\n                ref = obj[\"Obj\"]\n                obj = resolve1(ref)\n            element, children = self._make_element(obj)\n            # Similar to above, delay resolving the children to avoid\n            # tree-recursion.\n            s[repr(ref)] = element, children\n            for child in children:\n                obj = resolve1(child)\n                if isinstance(obj, dict):\n                    if not self.on_parsed_page(obj):\n                        continue\n                    if \"Obj\" in obj:\n                        child = obj[\"Obj\"]\n                    elif \"MCID\" in obj:\n                        continue\n                if isinstance(child, PDFObjRef):\n                    d.append(child)\n\n        # Traverse depth-first, removing empty elements (unsure how to\n        # do this non-recursively)\n        def prune(elements: List[Any]) -> List[Any]:\n            next_elements = []\n            for ref in elements:\n                obj = resolve1(ref)\n                if isinstance(ref, int):\n                    next_elements.append(ref)\n                    continue\n                elif isinstance(obj, dict):\n                    if not self.on_parsed_page(obj):\n                        continue\n                    if \"MCID\" in obj:\n                        next_elements.append(obj[\"MCID\"])\n                        continue\n                    elif \"Obj\" in obj:\n                        ref = obj[\"Obj\"]\n                element, children = s[repr(ref)]\n                children = prune(children)\n                # See assertions below\n                if element is None or not children:\n                    del s[repr(ref)]\n                else:\n                    s[repr(ref)] = element, children\n                    next_elements.append(ref)\n            return next_elements\n\n        prune(root)\n        self._resolve_children(s)\n\n    def _resolve_children(self, seen: Dict[str, Any]) -> None:\n        \"\"\"Resolve children starting from the tree root based on references we\n        saw when traversing the structure tree.\n        \"\"\"\n        root = resolve1(self.root[\"K\"])\n        # It could just be a single object ... it's in the spec (argh)\n        if isinstance(root, dict):\n            root = [self.root[\"K\"]]\n        self.children = []\n        # Create top-level self.children\n        parsed_root = []\n        for ref in root:\n            obj = resolve1(ref)\n            if isinstance(obj, dict) and \"Obj\" in obj:\n                if not self.on_parsed_page(obj):\n                    continue\n                ref = obj[\"Obj\"]\n            if repr(ref) in seen:\n                parsed_root.append(ref)\n        d = deque(parsed_root)\n        while d:\n            ref = d.popleft()\n            element, children = seen[repr(ref)]\n            assert element is not None, \"Unparsed element\"\n            for child in children:\n                obj = resolve1(child)\n                if isinstance(obj, int):\n                    element.mcids.append(obj)\n                elif isinstance(obj, dict):\n                    # Skip out-of-page MCIDS and OBJRs\n                    if not self.on_parsed_page(obj):\n                        continue\n                    if \"MCID\" in obj:\n                        element.mcids.append(obj[\"MCID\"])\n                    elif \"Obj\" in obj:\n                        child = obj[\"Obj\"]\n                # NOTE: if, not elif, in case of OBJR above\n                if isinstance(child, PDFObjRef):\n                    child_element, _ = seen.get(repr(child), (None, None))\n                    if child_element is not None:\n                        element.children.append(child_element)\n                        d.append(child)\n        self.children = [seen[repr(ref)][0] for ref in parsed_root]\n\n    def __iter__(self) -> Iterator[PDFStructElement]:\n        return iter(self.children)\n\n    def element_bbox(self, el: PDFStructElement) -> T_bbox:\n        \"\"\"Get the bounding box for an element for visual debugging.\"\"\"\n        page = None\n        if self.page is not None:\n            page = self.page\n        elif el.page_number is not None:\n            page = self.pages[el.page_number]\n        bbox = el.attributes.get(\"BBox\", None)\n        if page is not None and bbox is not None:\n            from .page import CroppedPage, _invert_box, _normalize_box\n\n            # Use secret knowledge of CroppedPage (cannot use\n            # page.height because it is the *cropped* dimension, but\n            # cropping does not actually translate coordinates)\n            bbox = _invert_box(\n                _normalize_box(bbox), page.mediabox[3] - page.mediabox[1]\n            )\n            # Use more secret knowledge of CroppedPage\n            if isinstance(page, CroppedPage):\n                rect = geometry.bbox_to_rect(bbox)\n                rects = page._crop_fn([rect])\n                if not rects:\n                    raise IndexError(\"Element no longer on page\")\n                return geometry.obj_to_bbox(rects[0])\n            else:\n                # Not sure why mypy complains here\n                return bbox  # type: ignore\n        else:\n            mcid_objs = []\n            for page_number, mcid in el.all_mcids():\n                objects: Iterable[T_obj]\n                if page_number is None:\n                    if page is not None:\n                        objects = itertools.chain.from_iterable(page.objects.values())\n                    else:\n                        objects = []  # pragma: nocover\n                else:\n                    objects = itertools.chain.from_iterable(\n                        self.pages[page_number].objects.values()\n                    )\n                for c in objects:\n                    if c[\"mcid\"] == mcid:\n                        mcid_objs.append(c)\n            if not mcid_objs:\n                raise IndexError(\"No objects found\")  # pragma: nocover\n            return geometry.objects_to_bbox(mcid_objs)\n"
  },
  {
    "path": "pdfplumber/table.py",
    "content": "import itertools\nfrom dataclasses import dataclass\nfrom operator import itemgetter\nfrom typing import TYPE_CHECKING, Any, Dict, List, Optional, Set, Tuple, Type, Union\n\nfrom . import utils\nfrom ._typing import T_bbox, T_num, T_obj, T_obj_iter, T_obj_list, T_point\n\nDEFAULT_SNAP_TOLERANCE = 3\nDEFAULT_JOIN_TOLERANCE = 3\nDEFAULT_MIN_WORDS_VERTICAL = 3\nDEFAULT_MIN_WORDS_HORIZONTAL = 1\n\nT_intersections = Dict[T_point, Dict[str, T_obj_list]]\nT_table_settings = Union[\"TableSettings\", Dict[str, Any]]\n\nif TYPE_CHECKING:  # pragma: nocover\n    from .page import Page\n\n\ndef snap_edges(\n    edges: T_obj_list,\n    x_tolerance: T_num = DEFAULT_SNAP_TOLERANCE,\n    y_tolerance: T_num = DEFAULT_SNAP_TOLERANCE,\n) -> T_obj_list:\n    \"\"\"\n    Given a list of edges, snap any within `tolerance` pixels of one another\n    to their positional average.\n    \"\"\"\n    by_orientation: Dict[str, T_obj_list] = {\"v\": [], \"h\": []}\n    for e in edges:\n        by_orientation[e[\"orientation\"]].append(e)\n\n    snapped_v = utils.snap_objects(by_orientation[\"v\"], \"x0\", x_tolerance)\n    snapped_h = utils.snap_objects(by_orientation[\"h\"], \"top\", y_tolerance)\n    return snapped_v + snapped_h\n\n\ndef join_edge_group(\n    edges: T_obj_iter, orientation: str, tolerance: T_num = DEFAULT_JOIN_TOLERANCE\n) -> T_obj_list:\n    \"\"\"\n    Given a list of edges along the same infinite line, join those that\n    are within `tolerance` pixels of one another.\n    \"\"\"\n    if orientation == \"h\":\n        min_prop, max_prop = \"x0\", \"x1\"\n    elif orientation == \"v\":\n        min_prop, max_prop = \"top\", \"bottom\"\n    else:\n        raise ValueError(\"Orientation must be 'v' or 'h'\")\n\n    sorted_edges = list(sorted(edges, key=itemgetter(min_prop)))\n    joined = [sorted_edges[0]]\n    for e in sorted_edges[1:]:\n        last = joined[-1]\n        if e[min_prop] <= (last[max_prop] + tolerance):\n            if e[max_prop] > last[max_prop]:\n                # Extend current edge to new extremity\n                joined[-1] = utils.resize_object(last, max_prop, e[max_prop])\n        else:\n            # Edge is separate from previous edges\n            joined.append(e)\n\n    return joined\n\n\ndef merge_edges(\n    edges: T_obj_list,\n    snap_x_tolerance: T_num,\n    snap_y_tolerance: T_num,\n    join_x_tolerance: T_num,\n    join_y_tolerance: T_num,\n) -> T_obj_list:\n    \"\"\"\n    Using the `snap_edges` and `join_edge_group` methods above,\n    merge a list of edges into a more \"seamless\" list.\n    \"\"\"\n\n    def get_group(edge: T_obj) -> Tuple[str, T_num]:\n        if edge[\"orientation\"] == \"h\":\n            return (\"h\", edge[\"top\"])\n        else:\n            return (\"v\", edge[\"x0\"])\n\n    if snap_x_tolerance > 0 or snap_y_tolerance > 0:\n        edges = snap_edges(edges, snap_x_tolerance, snap_y_tolerance)\n\n    _sorted = sorted(edges, key=get_group)\n    edge_groups = itertools.groupby(_sorted, key=get_group)\n    edge_gen = (\n        join_edge_group(\n            items, k[0], (join_x_tolerance if k[0] == \"h\" else join_y_tolerance)\n        )\n        for k, items in edge_groups\n    )\n    edges = list(itertools.chain(*edge_gen))\n    return edges\n\n\ndef words_to_edges_h(\n    words: T_obj_list, word_threshold: int = DEFAULT_MIN_WORDS_HORIZONTAL\n) -> T_obj_list:\n    \"\"\"\n    Find (imaginary) horizontal lines that connect the tops\n    of at least `word_threshold` words.\n    \"\"\"\n    by_top = utils.cluster_objects(words, itemgetter(\"top\"), 1)\n    large_clusters = filter(lambda x: len(x) >= word_threshold, by_top)\n    rects = list(map(utils.objects_to_rect, large_clusters))\n    if len(rects) == 0:\n        return []\n    min_x0 = min(map(itemgetter(\"x0\"), rects))\n    max_x1 = max(map(itemgetter(\"x1\"), rects))\n\n    edges = []\n    for r in rects:\n        edges += [\n            # Top of text\n            {\n                \"x0\": min_x0,\n                \"x1\": max_x1,\n                \"top\": r[\"top\"],\n                \"bottom\": r[\"top\"],\n                \"width\": max_x1 - min_x0,\n                \"orientation\": \"h\",\n            },\n            # For each detected row, we also add the 'bottom' line.  This will\n            # generate extra edges, (some will be redundant with the next row\n            # 'top' line), but this catches the last row of every table.\n            {\n                \"x0\": min_x0,\n                \"x1\": max_x1,\n                \"top\": r[\"bottom\"],\n                \"bottom\": r[\"bottom\"],\n                \"width\": max_x1 - min_x0,\n                \"orientation\": \"h\",\n            },\n        ]\n\n    return edges\n\n\ndef words_to_edges_v(\n    words: T_obj_list, word_threshold: int = DEFAULT_MIN_WORDS_VERTICAL\n) -> T_obj_list:\n    \"\"\"\n    Find (imaginary) vertical lines that connect the left, right, or\n    center of at least `word_threshold` words.\n    \"\"\"\n    # Find words that share the same left, right, or centerpoints\n    by_x0 = utils.cluster_objects(words, itemgetter(\"x0\"), 1)\n    by_x1 = utils.cluster_objects(words, itemgetter(\"x1\"), 1)\n\n    def get_center(word: T_obj) -> T_num:\n        return float(word[\"x0\"] + word[\"x1\"]) / 2\n\n    by_center = utils.cluster_objects(words, get_center, 1)\n    clusters = by_x0 + by_x1 + by_center\n\n    # Find the points that align with the most words\n    sorted_clusters = sorted(clusters, key=lambda x: -len(x))\n    large_clusters = filter(lambda x: len(x) >= word_threshold, sorted_clusters)\n\n    # For each of those points, find the bboxes fitting all matching words\n    bboxes = list(map(utils.objects_to_bbox, large_clusters))\n\n    # Iterate through those bboxes, condensing overlapping bboxes\n    condensed_bboxes: List[T_bbox] = []\n    for bbox in bboxes:\n        overlap = any(utils.get_bbox_overlap(bbox, c) for c in condensed_bboxes)\n        if not overlap:\n            condensed_bboxes.append(bbox)\n\n    if len(condensed_bboxes) == 0:\n        return []\n\n    condensed_rects = map(utils.bbox_to_rect, condensed_bboxes)\n    sorted_rects = list(sorted(condensed_rects, key=itemgetter(\"x0\")))\n\n    max_x1 = max(map(itemgetter(\"x1\"), sorted_rects))\n    min_top = min(map(itemgetter(\"top\"), sorted_rects))\n    max_bottom = max(map(itemgetter(\"bottom\"), sorted_rects))\n\n    return [\n        {\n            \"x0\": b[\"x0\"],\n            \"x1\": b[\"x0\"],\n            \"top\": min_top,\n            \"bottom\": max_bottom,\n            \"height\": max_bottom - min_top,\n            \"orientation\": \"v\",\n        }\n        for b in sorted_rects\n    ] + [\n        {\n            \"x0\": max_x1,\n            \"x1\": max_x1,\n            \"top\": min_top,\n            \"bottom\": max_bottom,\n            \"height\": max_bottom - min_top,\n            \"orientation\": \"v\",\n        }\n    ]\n\n\ndef edges_to_intersections(\n    edges: T_obj_list, x_tolerance: T_num = 1, y_tolerance: T_num = 1\n) -> T_intersections:\n    \"\"\"\n    Given a list of edges, return the points at which they intersect\n    within `tolerance` pixels.\n    \"\"\"\n    intersections: T_intersections = {}\n    v_edges, h_edges = [\n        list(filter(lambda x: x[\"orientation\"] == o, edges)) for o in (\"v\", \"h\")\n    ]\n    for v in sorted(v_edges, key=itemgetter(\"x0\", \"top\")):\n        for h in sorted(h_edges, key=itemgetter(\"top\", \"x0\")):\n            if (\n                (v[\"top\"] <= (h[\"top\"] + y_tolerance))\n                and (v[\"bottom\"] >= (h[\"top\"] - y_tolerance))\n                and (v[\"x0\"] >= (h[\"x0\"] - x_tolerance))\n                and (v[\"x0\"] <= (h[\"x1\"] + x_tolerance))\n            ):\n                vertex = (v[\"x0\"], h[\"top\"])\n                if vertex not in intersections:\n                    intersections[vertex] = {\"v\": [], \"h\": []}\n                intersections[vertex][\"v\"].append(v)\n                intersections[vertex][\"h\"].append(h)\n    return intersections\n\n\ndef intersections_to_cells(intersections: T_intersections) -> List[T_bbox]:\n    \"\"\"\n    Given a list of points (`intersections`), return all rectangular \"cells\"\n    that those points describe.\n\n    `intersections` should be a dictionary with (x0, top) tuples as keys,\n    and a list of edge objects as values. The edge objects should correspond\n    to the edges that touch the intersection.\n    \"\"\"\n\n    def edge_connects(p1: T_point, p2: T_point) -> bool:\n        def edges_to_set(edges: T_obj_list) -> Set[T_bbox]:\n            return set(map(utils.obj_to_bbox, edges))\n\n        if p1[0] == p2[0]:\n            common = edges_to_set(intersections[p1][\"v\"]).intersection(\n                edges_to_set(intersections[p2][\"v\"])\n            )\n            if len(common):\n                return True\n\n        if p1[1] == p2[1]:\n            common = edges_to_set(intersections[p1][\"h\"]).intersection(\n                edges_to_set(intersections[p2][\"h\"])\n            )\n            if len(common):\n                return True\n        return False\n\n    points = list(sorted(intersections.keys()))\n    n_points = len(points)\n\n    def find_smallest_cell(points: List[T_point], i: int) -> Optional[T_bbox]:\n        if i == n_points - 1:\n            return None\n        pt = points[i]\n        rest = points[i + 1 :]\n        # Get all the points directly below and directly right\n        below = [x for x in rest if x[0] == pt[0]]\n        right = [x for x in rest if x[1] == pt[1]]\n        for below_pt in below:\n            if not edge_connects(pt, below_pt):\n                continue\n\n            for right_pt in right:\n                if not edge_connects(pt, right_pt):\n                    continue\n\n                bottom_right = (right_pt[0], below_pt[1])\n\n                if (\n                    (bottom_right in intersections)\n                    and edge_connects(bottom_right, right_pt)\n                    and edge_connects(bottom_right, below_pt)\n                ):\n\n                    return (pt[0], pt[1], bottom_right[0], bottom_right[1])\n        return None\n\n    cell_gen = (find_smallest_cell(points, i) for i in range(len(points)))\n    return list(filter(None, cell_gen))\n\n\ndef cells_to_tables(cells: List[T_bbox]) -> List[List[T_bbox]]:\n    \"\"\"\n    Given a list of bounding boxes (`cells`), return a list of tables that\n    hold those cells most simply (and contiguously).\n    \"\"\"\n\n    def bbox_to_corners(bbox: T_bbox) -> Tuple[T_point, T_point, T_point, T_point]:\n        x0, top, x1, bottom = bbox\n        return ((x0, top), (x0, bottom), (x1, top), (x1, bottom))\n\n    remaining_cells = list(cells)\n\n    # Iterate through the cells found above, and assign them\n    # to contiguous tables\n\n    current_corners: Set[T_point] = set()\n    current_cells: List[T_bbox] = []\n\n    tables = []\n    while len(remaining_cells):\n        initial_cell_count = len(current_cells)\n        for cell in list(remaining_cells):\n            cell_corners = bbox_to_corners(cell)\n            # If we're just starting a table ...\n            if len(current_cells) == 0:\n                # ... immediately assign it to the empty group\n                current_corners |= set(cell_corners)\n                current_cells.append(cell)\n                remaining_cells.remove(cell)\n            else:\n                # How many corners does this table share with the current group?\n                corner_count = sum(c in current_corners for c in cell_corners)\n\n                # If touching on at least one corner...\n                if corner_count > 0:\n                    # ... assign it to the current group\n                    current_corners |= set(cell_corners)\n                    current_cells.append(cell)\n                    remaining_cells.remove(cell)\n\n        # If this iteration did not find any more cells to append...\n        if len(current_cells) == initial_cell_count:\n            # ... start a new cell group\n            tables.append(list(current_cells))\n            current_corners.clear()\n            current_cells.clear()\n\n    # Once we have exhausting the list of cells ...\n\n    # ... and we have a cell group that has not been stored\n    if len(current_cells):\n        # ... store it.\n        tables.append(list(current_cells))\n\n    # Sort the tables top-to-bottom-left-to-right based on the value of the\n    # topmost-and-then-leftmost coordinate of a table.\n    _sorted = sorted(tables, key=lambda t: min((c[1], c[0]) for c in t))\n    filtered = [t for t in _sorted if len(t) > 1]\n    return filtered\n\n\nclass CellGroup(object):\n    def __init__(self, cells: List[Optional[T_bbox]]):\n        self.cells = cells\n        self.bbox = (\n            min(map(itemgetter(0), filter(None, cells))),\n            min(map(itemgetter(1), filter(None, cells))),\n            max(map(itemgetter(2), filter(None, cells))),\n            max(map(itemgetter(3), filter(None, cells))),\n        )\n\n\nclass Row(CellGroup):\n    pass\n\n\nclass Column(CellGroup):\n    pass\n\n\nclass Table(object):\n    def __init__(self, page: \"Page\", cells: List[T_bbox]):\n        self.page = page\n        self.cells = cells\n\n    @property\n    def bbox(self) -> T_bbox:\n        c = self.cells\n        return (\n            min(map(itemgetter(0), c)),\n            min(map(itemgetter(1), c)),\n            max(map(itemgetter(2), c)),\n            max(map(itemgetter(3), c)),\n        )\n\n    def _get_rows_or_cols(self, kind: Type[CellGroup]) -> List[CellGroup]:\n        axis = 0 if kind is Row else 1\n        antiaxis = int(not axis)\n\n        # Sort first by top/x0, then by x0/top\n        _sorted = sorted(self.cells, key=itemgetter(antiaxis, axis))\n\n        # Sort get all x0s/tops\n        xs = list(sorted(set(map(itemgetter(axis), self.cells))))\n\n        # Group by top/x0\n        grouped = itertools.groupby(_sorted, itemgetter(antiaxis))\n\n        rows = []\n        # for y/x, row/column-cells ...\n        for y, row_cells in grouped:\n            xdict = {cell[axis]: cell for cell in row_cells}\n            row = kind([xdict.get(x) for x in xs])\n            rows.append(row)\n        return rows\n\n    @property\n    def rows(self) -> List[CellGroup]:\n        return self._get_rows_or_cols(Row)\n\n    @property\n    def columns(self) -> List[CellGroup]:\n        return self._get_rows_or_cols(Column)\n\n    def extract(self, **kwargs: Any) -> List[List[Optional[str]]]:\n\n        chars = self.page.chars\n        table_arr = []\n\n        def char_in_bbox(char: T_obj, bbox: T_bbox) -> bool:\n            v_mid = (char[\"top\"] + char[\"bottom\"]) / 2\n            h_mid = (char[\"x0\"] + char[\"x1\"]) / 2\n            x0, top, x1, bottom = bbox\n            return bool(\n                (h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)\n            )\n\n        for row in self.rows:\n            arr = []\n            row_chars = [char for char in chars if char_in_bbox(char, row.bbox)]\n\n            for cell in row.cells:\n                if cell is None:\n                    cell_text = None\n                else:\n                    cell_chars = [\n                        char for char in row_chars if char_in_bbox(char, cell)\n                    ]\n\n                    if len(cell_chars):\n                        if \"layout\" in kwargs:\n                            kwargs[\"layout_width\"] = cell[2] - cell[0]\n                            kwargs[\"layout_height\"] = cell[3] - cell[1]\n                            kwargs[\"layout_bbox\"] = cell\n                        cell_text = utils.extract_text(cell_chars, **kwargs)\n                    else:\n                        cell_text = \"\"\n                arr.append(cell_text)\n            table_arr.append(arr)\n\n        return table_arr\n\n\nTABLE_STRATEGIES = [\"lines\", \"lines_strict\", \"text\", \"explicit\"]\nNON_NEGATIVE_SETTINGS = [\n    \"snap_tolerance\",\n    \"snap_x_tolerance\",\n    \"snap_y_tolerance\",\n    \"join_tolerance\",\n    \"join_x_tolerance\",\n    \"join_y_tolerance\",\n    \"edge_min_length\",\n    \"edge_min_length_prefilter\",\n    \"min_words_vertical\",\n    \"min_words_horizontal\",\n    \"intersection_tolerance\",\n    \"intersection_x_tolerance\",\n    \"intersection_y_tolerance\",\n]\n\n\nclass UnsetFloat(float):\n    pass\n\n\nUNSET = UnsetFloat(0)\n\n\n@dataclass\nclass TableSettings:\n    vertical_strategy: str = \"lines\"\n    horizontal_strategy: str = \"lines\"\n    explicit_vertical_lines: Optional[List[Union[T_obj, T_num]]] = None\n    explicit_horizontal_lines: Optional[List[Union[T_obj, T_num]]] = None\n    snap_tolerance: T_num = DEFAULT_SNAP_TOLERANCE\n    snap_x_tolerance: T_num = UNSET\n    snap_y_tolerance: T_num = UNSET\n    join_tolerance: T_num = DEFAULT_JOIN_TOLERANCE\n    join_x_tolerance: T_num = UNSET\n    join_y_tolerance: T_num = UNSET\n    edge_min_length: T_num = 3\n    edge_min_length_prefilter: T_num = 1\n    min_words_vertical: int = DEFAULT_MIN_WORDS_VERTICAL\n    min_words_horizontal: int = DEFAULT_MIN_WORDS_HORIZONTAL\n    intersection_tolerance: T_num = 3\n    intersection_x_tolerance: T_num = UNSET\n    intersection_y_tolerance: T_num = UNSET\n    text_settings: Optional[Dict[str, Any]] = None\n\n    def __post_init__(self) -> None:\n        \"\"\"Clean up user-provided table settings.\n\n        Validates that the table settings provided consists of acceptable values and\n        returns a cleaned up version. The cleaned up version fills out the missing\n        values with the default values in the provided settings.\n\n        TODO: Can be further used to validate that the values are of the correct\n            type. For example, raising a value error when a non-boolean input is\n            provided for the key ``keep_blank_chars``.\n\n        :param table_settings: User-provided table settings.\n        :returns: A cleaned up version of the user-provided table settings.\n        :raises ValueError: When an unrecognised key is provided.\n        \"\"\"\n\n        for setting in NON_NEGATIVE_SETTINGS:\n            if (getattr(self, setting) or 0) < 0:\n                raise ValueError(f\"Table setting '{setting}' cannot be negative\")\n\n        for orientation in [\"horizontal\", \"vertical\"]:\n            strategy = getattr(self, orientation + \"_strategy\")\n            if strategy not in TABLE_STRATEGIES:\n                raise ValueError(\n                    f\"{orientation}_strategy must be one of\"\n                    f'{{{\",\".join(TABLE_STRATEGIES)}}}'\n                )\n\n        if self.text_settings is None:\n            self.text_settings = {}\n\n        # This next section is for backwards compatibility\n        for attr in [\"x_tolerance\", \"y_tolerance\"]:\n            if attr not in self.text_settings:\n                self.text_settings[attr] = self.text_settings.get(\"tolerance\", 3)\n\n        if \"tolerance\" in self.text_settings:\n            del self.text_settings[\"tolerance\"]\n        # End of that section\n\n        for attr, fallback in [\n            (\"snap_x_tolerance\", \"snap_tolerance\"),\n            (\"snap_y_tolerance\", \"snap_tolerance\"),\n            (\"join_x_tolerance\", \"join_tolerance\"),\n            (\"join_y_tolerance\", \"join_tolerance\"),\n            (\"intersection_x_tolerance\", \"intersection_tolerance\"),\n            (\"intersection_y_tolerance\", \"intersection_tolerance\"),\n        ]:\n            if getattr(self, attr) is UNSET:\n                setattr(self, attr, getattr(self, fallback))\n\n    @classmethod\n    def resolve(cls, settings: Optional[T_table_settings]) -> \"TableSettings\":\n        if settings is None:\n            return cls()\n        elif isinstance(settings, cls):\n            return settings\n        elif isinstance(settings, dict):\n            core_settings = {}\n            text_settings = {}\n            for k, v in settings.items():\n                if k[:5] == \"text_\":\n                    text_settings[k[5:]] = v\n                else:\n                    core_settings[k] = v\n            core_settings[\"text_settings\"] = text_settings\n            return cls(**core_settings)\n        else:\n            raise ValueError(f\"Cannot resolve settings: {settings}\")\n\n\nclass TableFinder(object):\n    \"\"\"\n    Given a PDF page, find plausible table structures.\n\n    Largely borrowed from Anssi Nurminen's master's thesis:\n    http://dspace.cc.tut.fi/dpub/bitstream/handle/123456789/21520/Nurminen.pdf?sequence=3\n\n    ... and inspired by Tabula:\n    https://github.com/tabulapdf/tabula-extractor/issues/16\n    \"\"\"\n\n    def __init__(self, page: \"Page\", settings: Optional[T_table_settings] = None):\n        self.page = page\n        self.settings = TableSettings.resolve(settings)\n        self.edges = self.get_edges()\n        self.intersections = edges_to_intersections(\n            self.edges,\n            self.settings.intersection_x_tolerance,\n            self.settings.intersection_y_tolerance,\n        )\n        self.cells = intersections_to_cells(self.intersections)\n        self.tables = [\n            Table(self.page, cell_group) for cell_group in cells_to_tables(self.cells)\n        ]\n\n    def get_edges(self) -> T_obj_list:\n        settings = self.settings\n\n        for orientation in [\"vertical\", \"horizontal\"]:\n            strategy = getattr(settings, orientation + \"_strategy\")\n            if strategy == \"explicit\":\n                lines = getattr(settings, \"explicit_\" + orientation + \"_lines\")\n                if len(lines) < 2:\n                    raise ValueError(\n                        f\"If {orientation}_strategy == 'explicit', \"\n                        f\"explicit_{orientation}_lines \"\n                        f\"must be specified as a list/tuple of two or more \"\n                        f\"floats/ints.\"\n                    )\n\n        v_strat = settings.vertical_strategy\n        h_strat = settings.horizontal_strategy\n\n        if v_strat == \"text\" or h_strat == \"text\":\n            words = self.page.extract_words(**(settings.text_settings or {}))\n\n        v_explicit = []\n        for desc in settings.explicit_vertical_lines or []:\n            if isinstance(desc, dict):\n                for e in utils.obj_to_edges(desc):\n                    if e[\"orientation\"] == \"v\":\n                        v_explicit.append(e)\n            else:\n                v_explicit.append(\n                    {\n                        \"x0\": desc,\n                        \"x1\": desc,\n                        \"top\": self.page.bbox[1],\n                        \"bottom\": self.page.bbox[3],\n                        \"height\": self.page.bbox[3] - self.page.bbox[1],\n                        \"orientation\": \"v\",\n                    }\n                )\n\n        if v_strat == \"lines\":\n            v_base = utils.filter_edges(\n                self.page.edges, \"v\", min_length=settings.edge_min_length_prefilter\n            )\n        elif v_strat == \"lines_strict\":\n            v_base = utils.filter_edges(\n                self.page.edges,\n                \"v\",\n                edge_type=\"line\",\n                min_length=settings.edge_min_length_prefilter,\n            )\n        elif v_strat == \"text\":\n            v_base = words_to_edges_v(words, word_threshold=settings.min_words_vertical)\n        elif v_strat == \"explicit\":\n            v_base = []\n\n        v = v_base + v_explicit\n\n        h_explicit = []\n        for desc in settings.explicit_horizontal_lines or []:\n            if isinstance(desc, dict):\n                for e in utils.obj_to_edges(desc):\n                    if e[\"orientation\"] == \"h\":\n                        h_explicit.append(e)\n            else:\n                h_explicit.append(\n                    {\n                        \"x0\": self.page.bbox[0],\n                        \"x1\": self.page.bbox[2],\n                        \"width\": self.page.bbox[2] - self.page.bbox[0],\n                        \"top\": desc,\n                        \"bottom\": desc,\n                        \"orientation\": \"h\",\n                    }\n                )\n\n        if h_strat == \"lines\":\n            h_base = utils.filter_edges(\n                self.page.edges, \"h\", min_length=settings.edge_min_length_prefilter\n            )\n        elif h_strat == \"lines_strict\":\n            h_base = utils.filter_edges(\n                self.page.edges,\n                \"h\",\n                edge_type=\"line\",\n                min_length=settings.edge_min_length_prefilter,\n            )\n        elif h_strat == \"text\":\n            h_base = words_to_edges_h(\n                words, word_threshold=settings.min_words_horizontal\n            )\n        elif h_strat == \"explicit\":\n            h_base = []\n\n        h = h_base + h_explicit\n\n        edges = list(v) + list(h)\n\n        edges = merge_edges(\n            edges,\n            snap_x_tolerance=settings.snap_x_tolerance,\n            snap_y_tolerance=settings.snap_y_tolerance,\n            join_x_tolerance=settings.join_x_tolerance,\n            join_y_tolerance=settings.join_y_tolerance,\n        )\n\n        return utils.filter_edges(edges, min_length=settings.edge_min_length)\n"
  },
  {
    "path": "pdfplumber/utils/__init__.py",
    "content": "from .clustering import cluster_list, cluster_objects, make_cluster_dict  # noqa: F401\nfrom .generic import to_list  # noqa: F401\nfrom .geometry import (  # noqa: F401\n    bbox_to_rect,\n    calculate_area,\n    clip_obj,\n    crop_to_bbox,\n    curve_to_edges,\n    filter_edges,\n    get_bbox_overlap,\n    intersects_bbox,\n    line_to_edge,\n    merge_bboxes,\n    move_object,\n    obj_to_bbox,\n    obj_to_edges,\n    objects_to_bbox,\n    objects_to_rect,\n    outside_bbox,\n    rect_to_edges,\n    resize_object,\n    snap_objects,\n    within_bbox,\n)\nfrom .pdfinternals import (  # noqa: F401\n    decode_psl_list,\n    decode_text,\n    resolve,\n    resolve_all,\n    resolve_and_decode,\n)\nfrom .text import (  # noqa: F401\n    DEFAULT_X_DENSITY,\n    DEFAULT_X_TOLERANCE,\n    DEFAULT_Y_DENSITY,\n    DEFAULT_Y_TOLERANCE,\n    chars_to_textmap,\n    collate_line,\n    dedupe_chars,\n    extract_text,\n    extract_text_simple,\n    extract_words,\n)\n"
  },
  {
    "path": "pdfplumber/utils/clustering.py",
    "content": "import itertools\nfrom collections.abc import Hashable\nfrom operator import itemgetter\nfrom typing import Any, Callable, Dict, Iterable, List, Tuple, TypeVar, Union\n\nfrom .._typing import T_num, T_obj\n\n\ndef cluster_list(xs: List[T_num], tolerance: T_num = 0) -> List[List[T_num]]:\n    if tolerance == 0:\n        return [[x] for x in sorted(xs)]\n    if len(xs) < 2:\n        return [[x] for x in sorted(xs)]\n    groups = []\n    xs = list(sorted(xs))\n    current_group = [xs[0]]\n    last = xs[0]\n    for x in xs[1:]:\n        if x <= (last + tolerance):\n            current_group.append(x)\n        else:\n            groups.append(current_group)\n            current_group = [x]\n        last = x\n    groups.append(current_group)\n    return groups\n\n\ndef make_cluster_dict(values: Iterable[T_num], tolerance: T_num) -> Dict[T_num, int]:\n    clusters = cluster_list(list(set(values)), tolerance)\n\n    nested_tuples = [\n        [(val, i) for val in value_cluster] for i, value_cluster in enumerate(clusters)\n    ]\n\n    return dict(itertools.chain(*nested_tuples))\n\n\nClusterable = TypeVar(\"Clusterable\", T_obj, Tuple[Any, ...])\n\n\ndef cluster_objects(\n    xs: List[Clusterable],\n    key_fn: Union[Hashable, Callable[[Clusterable], T_num]],\n    tolerance: T_num,\n    preserve_order: bool = False,\n) -> List[List[Clusterable]]:\n\n    if not callable(key_fn):\n        key_fn = itemgetter(key_fn)\n\n    values = map(key_fn, xs)\n    cluster_dict = make_cluster_dict(values, tolerance)\n\n    get_0, get_1 = itemgetter(0), itemgetter(1)\n\n    if preserve_order:\n        cluster_tuples = [(x, cluster_dict.get(key_fn(x))) for x in xs]\n    else:\n        cluster_tuples = sorted(\n            ((x, cluster_dict.get(key_fn(x))) for x in xs), key=get_1\n        )\n\n    grouped = itertools.groupby(cluster_tuples, key=get_1)\n\n    return [list(map(get_0, v)) for k, v in grouped]\n"
  },
  {
    "path": "pdfplumber/utils/exceptions.py",
    "content": "class MalformedPDFException(Exception):\n    pass\n\n\nclass PdfminerException(Exception):\n    pass\n"
  },
  {
    "path": "pdfplumber/utils/generic.py",
    "content": "from collections.abc import Sequence\nfrom typing import TYPE_CHECKING, Any, Dict, Hashable, List, Union\n\nfrom .._typing import T_seq\n\nif TYPE_CHECKING:  # pragma: nocover\n    from pandas.core.frame import DataFrame\n\n\ndef to_list(collection: Union[T_seq[Any], \"DataFrame\"]) -> List[Any]:\n    if isinstance(collection, list):\n        return collection\n    elif isinstance(collection, Sequence):\n        return list(collection)\n    elif hasattr(collection, \"to_dict\"):\n        res: List[Dict[Hashable, Any]] = collection.to_dict(\n            \"records\"\n        )  # pragma: nocover\n        return res\n    else:\n        return list(collection)\n"
  },
  {
    "path": "pdfplumber/utils/geometry.py",
    "content": "import itertools\nfrom operator import itemgetter\nfrom typing import Dict, Iterable, Optional\n\nfrom .._typing import T_bbox, T_num, T_obj, T_obj_list\nfrom .clustering import cluster_objects\n\n\ndef objects_to_rect(objects: Iterable[T_obj]) -> Dict[str, T_num]:\n    \"\"\"\n    Given an iterable of objects, return the smallest rectangle (i.e. a\n    dict with \"x0\", \"top\", \"x1\", and \"bottom\" keys) that contains them\n    all.\n    \"\"\"\n    return bbox_to_rect(objects_to_bbox(objects))\n\n\ndef objects_to_bbox(objects: Iterable[T_obj]) -> T_bbox:\n    \"\"\"\n    Given an iterable of objects, return the smallest bounding box that\n    contains them all.\n    \"\"\"\n    return merge_bboxes(map(bbox_getter, objects))\n\n\nbbox_getter = itemgetter(\"x0\", \"top\", \"x1\", \"bottom\")\n\n\ndef obj_to_bbox(obj: T_obj) -> T_bbox:\n    \"\"\"\n    Return the bounding box for an object.\n    \"\"\"\n    bbox: T_bbox = bbox_getter(obj)\n    return bbox\n\n\ndef bbox_to_rect(bbox: T_bbox) -> Dict[str, T_num]:\n    \"\"\"\n    Return the rectangle (i.e a dict with keys \"x0\", \"top\", \"x1\",\n    \"bottom\") for an object.\n    \"\"\"\n    return {\"x0\": bbox[0], \"top\": bbox[1], \"x1\": bbox[2], \"bottom\": bbox[3]}\n\n\ndef merge_bboxes(bboxes: Iterable[T_bbox]) -> T_bbox:\n    \"\"\"\n    Given an iterable of bounding boxes, return the smallest bounding box\n    that contains them all.\n    \"\"\"\n    x0, top, x1, bottom = zip(*bboxes)\n    return (min(x0), min(top), max(x1), max(bottom))\n\n\ndef get_bbox_overlap(a: T_bbox, b: T_bbox) -> Optional[T_bbox]:\n    a_left, a_top, a_right, a_bottom = a\n    b_left, b_top, b_right, b_bottom = b\n    o_left = max(a_left, b_left)\n    o_right = min(a_right, b_right)\n    o_bottom = min(a_bottom, b_bottom)\n    o_top = max(a_top, b_top)\n    o_width = o_right - o_left\n    o_height = o_bottom - o_top\n    if o_height >= 0 and o_width >= 0 and o_height + o_width > 0:\n        return (o_left, o_top, o_right, o_bottom)\n    else:\n        return None\n\n\ndef calculate_area(bbox: T_bbox) -> T_num:\n    left, top, right, bottom = bbox\n    if left > right or top > bottom:\n        raise ValueError(f\"{bbox} has a negative width or height.\")\n    return (right - left) * (bottom - top)\n\n\ndef clip_obj(obj: T_obj, bbox: T_bbox) -> Optional[T_obj]:\n    overlap = get_bbox_overlap(obj_to_bbox(obj), bbox)\n    if overlap is None:\n        return None\n\n    dims = bbox_to_rect(overlap)\n    copy = dict(obj)\n\n    for attr in [\"x0\", \"top\", \"x1\", \"bottom\"]:\n        copy[attr] = dims[attr]\n\n    diff = dims[\"top\"] - obj[\"top\"]\n    if \"doctop\" in copy:\n        copy[\"doctop\"] = obj[\"doctop\"] + diff\n    copy[\"width\"] = copy[\"x1\"] - copy[\"x0\"]\n    copy[\"height\"] = copy[\"bottom\"] - copy[\"top\"]\n\n    return copy\n\n\ndef intersects_bbox(objs: Iterable[T_obj], bbox: T_bbox) -> T_obj_list:\n    \"\"\"\n    Filters objs to only those intersecting the bbox\n    \"\"\"\n    return [obj for obj in objs if get_bbox_overlap(obj_to_bbox(obj), bbox) is not None]\n\n\ndef within_bbox(objs: Iterable[T_obj], bbox: T_bbox) -> T_obj_list:\n    \"\"\"\n    Filters objs to only those fully within the bbox\n    \"\"\"\n    return [\n        obj\n        for obj in objs\n        if get_bbox_overlap(obj_to_bbox(obj), bbox) == obj_to_bbox(obj)\n    ]\n\n\ndef outside_bbox(objs: Iterable[T_obj], bbox: T_bbox) -> T_obj_list:\n    \"\"\"\n    Filters objs to only those fully outside the bbox\n    \"\"\"\n    return [obj for obj in objs if get_bbox_overlap(obj_to_bbox(obj), bbox) is None]\n\n\ndef crop_to_bbox(objs: Iterable[T_obj], bbox: T_bbox) -> T_obj_list:\n    \"\"\"\n    Filters objs to only those intersecting the bbox,\n    and crops the extent of the objects to the bbox.\n    \"\"\"\n    return list(filter(None, (clip_obj(obj, bbox) for obj in objs)))\n\n\ndef move_object(obj: T_obj, axis: str, value: T_num) -> T_obj:\n    assert axis in (\"h\", \"v\")\n    if axis == \"h\":\n        new_items = [\n            (\"x0\", obj[\"x0\"] + value),\n            (\"x1\", obj[\"x1\"] + value),\n        ]\n    if axis == \"v\":\n        new_items = [\n            (\"top\", obj[\"top\"] + value),\n            (\"bottom\", obj[\"bottom\"] + value),\n        ]\n        if \"doctop\" in obj:\n            new_items += [(\"doctop\", obj[\"doctop\"] + value)]\n        if \"y0\" in obj:\n            new_items += [\n                (\"y0\", obj[\"y0\"] - value),\n                (\"y1\", obj[\"y1\"] - value),\n            ]\n    return obj.__class__(tuple(obj.items()) + tuple(new_items))\n\n\ndef snap_objects(objs: Iterable[T_obj], attr: str, tolerance: T_num) -> T_obj_list:\n    axis = {\"x0\": \"h\", \"x1\": \"h\", \"top\": \"v\", \"bottom\": \"v\"}[attr]\n    list_objs = list(objs)\n    clusters = cluster_objects(list_objs, itemgetter(attr), tolerance)\n    avgs = [sum(map(itemgetter(attr), cluster)) / len(cluster) for cluster in clusters]\n    snapped_clusters = [\n        [move_object(obj, axis, avg - obj[attr]) for obj in cluster]\n        for cluster, avg in zip(clusters, avgs)\n    ]\n    return list(itertools.chain(*snapped_clusters))\n\n\ndef resize_object(obj: T_obj, key: str, value: T_num) -> T_obj:\n    assert key in (\"x0\", \"x1\", \"top\", \"bottom\")\n    old_value = obj[key]\n    diff = value - old_value\n    new_items = [\n        (key, value),\n    ]\n    if key == \"x0\":\n        assert value <= obj[\"x1\"]\n        new_items.append((\"width\", obj[\"x1\"] - value))\n    elif key == \"x1\":\n        assert value >= obj[\"x0\"]\n        new_items.append((\"width\", value - obj[\"x0\"]))\n    elif key == \"top\":\n        assert value <= obj[\"bottom\"]\n        new_items.append((\"doctop\", obj[\"doctop\"] + diff))\n        new_items.append((\"height\", obj[\"height\"] - diff))\n        if \"y1\" in obj:\n            new_items.append((\"y1\", obj[\"y1\"] - diff))\n    elif key == \"bottom\":\n        assert value >= obj[\"top\"]\n        new_items.append((\"height\", obj[\"height\"] + diff))\n        if \"y0\" in obj:\n            new_items.append((\"y0\", obj[\"y0\"] - diff))\n    return obj.__class__(tuple(obj.items()) + tuple(new_items))\n\n\ndef curve_to_edges(curve: T_obj) -> T_obj_list:\n    point_pairs = zip(curve[\"pts\"], curve[\"pts\"][1:])\n    return [\n        {\n            \"object_type\": \"curve_edge\",\n            \"x0\": min(p0[0], p1[0]),\n            \"x1\": max(p0[0], p1[0]),\n            \"top\": min(p0[1], p1[1]),\n            \"doctop\": min(p0[1], p1[1]) + (curve[\"doctop\"] - curve[\"top\"]),\n            \"bottom\": max(p0[1], p1[1]),\n            \"width\": abs(p0[0] - p1[0]),\n            \"height\": abs(p0[1] - p1[1]),\n            \"orientation\": \"v\" if p0[0] == p1[0] else (\"h\" if p0[1] == p1[1] else None),\n        }\n        for p0, p1 in point_pairs\n    ]\n\n\ndef rect_to_edges(rect: T_obj) -> T_obj_list:\n    top, bottom, left, right = [dict(rect) for x in range(4)]\n    top.update(\n        {\n            \"object_type\": \"rect_edge\",\n            \"height\": 0,\n            \"y0\": rect[\"y1\"],\n            \"bottom\": rect[\"top\"],\n            \"orientation\": \"h\",\n        }\n    )\n    bottom.update(\n        {\n            \"object_type\": \"rect_edge\",\n            \"height\": 0,\n            \"y1\": rect[\"y0\"],\n            \"top\": rect[\"top\"] + rect[\"height\"],\n            \"doctop\": rect[\"doctop\"] + rect[\"height\"],\n            \"orientation\": \"h\",\n        }\n    )\n    left.update(\n        {\n            \"object_type\": \"rect_edge\",\n            \"width\": 0,\n            \"x1\": rect[\"x0\"],\n            \"orientation\": \"v\",\n        }\n    )\n    right.update(\n        {\n            \"object_type\": \"rect_edge\",\n            \"width\": 0,\n            \"x0\": rect[\"x1\"],\n            \"orientation\": \"v\",\n        }\n    )\n    return [top, bottom, left, right]\n\n\ndef line_to_edge(line: T_obj) -> T_obj:\n    edge = dict(line)\n    edge[\"orientation\"] = \"h\" if (line[\"top\"] == line[\"bottom\"]) else \"v\"\n    return edge\n\n\ndef obj_to_edges(obj: T_obj) -> T_obj_list:\n    t = obj[\"object_type\"]\n    if \"_edge\" in t:\n        return [obj]\n    elif t == \"line\":\n        return [line_to_edge(obj)]\n    else:\n        return {\"rect\": rect_to_edges, \"curve\": curve_to_edges}[t](obj)\n\n\ndef filter_edges(\n    edges: Iterable[T_obj],\n    orientation: Optional[str] = None,\n    edge_type: Optional[str] = None,\n    min_length: T_num = 1,\n) -> T_obj_list:\n    if orientation not in (\"v\", \"h\", None):\n        raise ValueError(\"Orientation must be 'v' or 'h'\")\n\n    def test(e: T_obj) -> bool:\n        dim = \"height\" if e[\"orientation\"] == \"v\" else \"width\"\n        et_correct = e[\"object_type\"] == edge_type if edge_type is not None else True\n        orient_correct = orientation is None or e[\"orientation\"] == orientation\n        return bool(et_correct and orient_correct and (e[dim] >= min_length))\n\n    return list(filter(test, edges))\n"
  },
  {
    "path": "pdfplumber/utils/pdfinternals.py",
    "content": "from typing import Any, List, Optional, Union\n\nfrom pdfminer.pdftypes import PDFObjRef\nfrom pdfminer.psparser import PSLiteral\nfrom pdfminer.utils import PDFDocEncoding\n\nfrom .exceptions import MalformedPDFException\n\n\ndef decode_text(s: Union[bytes, str]) -> str:\n    \"\"\"\n    Decodes a PDFDocEncoding string to Unicode.\n    Adds py3 compatibility to pdfminer's version.\n    \"\"\"\n    if isinstance(s, bytes) and s.startswith(b\"\\xfe\\xff\"):\n        return str(s[2:], \"utf-16be\", \"ignore\")\n    try:\n        ords = (ord(c) if isinstance(c, str) else c for c in s)\n        return \"\".join(PDFDocEncoding[o] for o in ords)\n    except IndexError:\n        return str(s)\n\n\ndef resolve_and_decode(obj: Any) -> Any:\n    \"\"\"Recursively resolve the metadata values.\"\"\"\n    if hasattr(obj, \"resolve\"):\n        obj = obj.resolve()\n    if isinstance(obj, list):\n        return list(map(resolve_and_decode, obj))\n    elif isinstance(obj, PSLiteral):\n        return decode_text(obj.name)\n    elif isinstance(obj, (str, bytes)):\n        return decode_text(obj)\n    elif isinstance(obj, dict):\n        for k, v in obj.items():\n            obj[k] = resolve_and_decode(v)\n        return obj\n\n    return obj\n\n\ndef decode_psl_list(_list: List[Union[PSLiteral, str]]) -> List[str]:\n    return [\n        decode_text(value.name) if isinstance(value, PSLiteral) else value\n        for value in _list\n    ]\n\n\ndef resolve(x: Any) -> Any:\n    if isinstance(x, PDFObjRef):\n        return x.resolve()\n    else:\n        return x\n\n\ndef get_dict_type(d: Any) -> Optional[str]:\n    if not isinstance(d, dict):\n        return None\n    t = d.get(\"Type\")\n    if isinstance(t, PSLiteral):\n        return decode_text(t.name)\n    else:\n        return t\n\n\ndef resolve_all(x: Any) -> Any:\n    \"\"\"\n    Recursively resolves the given object and all the internals.\n    \"\"\"\n    if isinstance(x, PDFObjRef):\n        resolved = x.resolve()\n\n        # Avoid infinite recursion\n        if get_dict_type(resolved) == \"Page\":\n            return x\n\n        try:\n            return resolve_all(resolved)\n        except RecursionError as e:\n            raise MalformedPDFException(e)\n    elif isinstance(x, (list, tuple)):\n        return type(x)(resolve_all(v) for v in x)\n    elif isinstance(x, dict):\n        exceptions = [\"Parent\"] if get_dict_type(x) == \"Annot\" else []\n        return {k: v if k in exceptions else resolve_all(v) for k, v in x.items()}\n    else:\n        return x\n"
  },
  {
    "path": "pdfplumber/utils/text.py",
    "content": "import inspect\nimport itertools\nimport logging\nimport re\nimport string\nfrom operator import itemgetter\nfrom typing import (\n    Any,\n    Callable,\n    Dict,\n    Generator,\n    List,\n    Match,\n    Optional,\n    Pattern,\n    Tuple,\n    Union,\n)\n\nfrom .._typing import T_bbox, T_dir, T_num, T_obj, T_obj_iter, T_obj_list\nfrom .clustering import cluster_objects\nfrom .generic import to_list\nfrom .geometry import objects_to_bbox\n\nlogger = logging.getLogger(__name__)\n\nDEFAULT_X_TOLERANCE = 3\nDEFAULT_Y_TOLERANCE = 3\nDEFAULT_X_DENSITY = 7.25\nDEFAULT_Y_DENSITY = 13\nDEFAULT_LINE_DIR: T_dir = \"ttb\"\nDEFAULT_CHAR_DIR: T_dir = \"ltr\"\n\nLIGATURES = {\n    \"ﬀ\": \"ff\",\n    \"ﬃ\": \"ffi\",\n    \"ﬄ\": \"ffl\",\n    \"ﬁ\": \"fi\",\n    \"ﬂ\": \"fl\",\n    \"ﬆ\": \"st\",\n    \"ﬅ\": \"st\",\n}\n\n\ndef get_line_cluster_key(line_dir: T_dir) -> Callable[[T_obj], T_num]:\n    return {\n        \"ttb\": lambda x: x[\"top\"],\n        \"btt\": lambda x: -x[\"bottom\"],\n        \"ltr\": lambda x: x[\"x0\"],\n        \"rtl\": lambda x: -x[\"x1\"],\n    }[line_dir]\n\n\ndef get_char_sort_key(char_dir: T_dir) -> Callable[[T_obj], Tuple[T_num, T_num]]:\n    return {\n        \"ttb\": lambda x: (x[\"top\"], x[\"bottom\"]),\n        \"btt\": lambda x: (-(x[\"top\"] + x[\"height\"]), -x[\"top\"]),\n        \"ltr\": lambda x: (x[\"x0\"], x[\"x0\"]),\n        \"rtl\": lambda x: (-x[\"x1\"], -x[\"x0\"]),\n    }[char_dir]\n\n\nBBOX_ORIGIN_KEYS = {\n    \"ttb\": itemgetter(1),\n    \"btt\": itemgetter(3),\n    \"ltr\": itemgetter(0),\n    \"rtl\": itemgetter(2),\n}\n\nPOSITION_KEYS = {\n    \"ttb\": itemgetter(\"top\"),\n    \"btt\": itemgetter(\"bottom\"),\n    \"ltr\": itemgetter(\"x0\"),\n    \"rtl\": itemgetter(\"x1\"),\n}\n\n\ndef validate_directions(line_dir: T_dir, char_dir: T_dir, suffix: str = \"\") -> None:\n    valid_dirs = set(POSITION_KEYS.keys())\n    if line_dir not in valid_dirs:\n        raise ValueError(\n            f\"line_dir{suffix} must be one of {valid_dirs}, not {line_dir}\"\n        )\n    if char_dir not in valid_dirs:\n        raise ValueError(\n            f\"char_dir{suffix} must be one of {valid_dirs}, not {char_dir}\"\n        )\n    if set(line_dir) == set(char_dir):\n        raise ValueError(\n            f\"line_dir{suffix}={line_dir} is incompatible \"\n            f\"with char_dir{suffix}={char_dir}\"\n        )\n\n\nclass TextMap:\n    \"\"\"\n    A TextMap maps each unicode character in the text to an individual `char`\n    object (or, in the case of layout-implied whitespace, `None`).\n    \"\"\"\n\n    def __init__(\n        self,\n        tuples: List[Tuple[str, Optional[T_obj]]],\n        line_dir_render: T_dir,\n        char_dir_render: T_dir,\n    ) -> None:\n        validate_directions(line_dir_render, char_dir_render, \"_render\")\n        self.tuples = tuples\n        self.line_dir_render = line_dir_render\n        self.char_dir_render = char_dir_render\n        self.as_string = self.to_string()\n\n    def to_string(self) -> str:\n        cd = self.char_dir_render\n        ld = self.line_dir_render\n\n        base = \"\".join(map(itemgetter(0), self.tuples))\n\n        if cd == \"ltr\" and ld == \"ttb\":\n            return base\n        else:\n            lines = base.split(\"\\n\")\n            if ld in (\"btt\", \"rtl\"):\n                lines = list(reversed(lines))\n\n            if cd == \"rtl\":\n                lines = [\"\".join(reversed(line)) for line in lines]\n\n            if ld in (\"rtl\", \"ltr\"):\n                max_line_length = max(map(len, lines))\n                if cd == \"btt\":\n                    lines = [\n                        (\" \" * (max_line_length - len(line))) + line for line in lines\n                    ]\n                else:\n                    lines = [\n                        line + (\" \" * (max_line_length - len(line))) for line in lines\n                    ]\n                return \"\\n\".join(\n                    \"\".join(line[i] for line in lines) for i in range(max_line_length)\n                )\n            else:\n                return \"\\n\".join(lines)\n\n    def match_to_dict(\n        self,\n        m: Match[str],\n        main_group: int = 0,\n        return_groups: bool = True,\n        return_chars: bool = True,\n    ) -> Dict[str, Any]:\n        subset = self.tuples[m.start(main_group) : m.end(main_group)]\n        chars = [c for (text, c) in subset if c is not None]\n        x0, top, x1, bottom = objects_to_bbox(chars)\n\n        result = {\n            \"text\": m.group(main_group),\n            \"x0\": x0,\n            \"top\": top,\n            \"x1\": x1,\n            \"bottom\": bottom,\n        }\n\n        if return_groups:\n            result[\"groups\"] = m.groups()\n\n        if return_chars:\n            result[\"chars\"] = chars\n\n        return result\n\n    def search(\n        self,\n        pattern: Union[str, Pattern[str]],\n        regex: bool = True,\n        case: bool = True,\n        return_groups: bool = True,\n        return_chars: bool = True,\n        main_group: int = 0,\n    ) -> List[Dict[str, Any]]:\n        if isinstance(pattern, Pattern):\n            if regex is False:\n                raise ValueError(\n                    \"Cannot pass a compiled search pattern *and* regex=False together.\"\n                )\n            if case is False:\n                raise ValueError(\n                    \"Cannot pass a compiled search pattern *and* case=False together.\"\n                )\n            compiled = pattern\n        else:\n            if regex is False:\n                pattern = re.escape(pattern)\n\n            flags = re.I if case is False else 0\n            compiled = re.compile(pattern, flags)\n\n        gen = re.finditer(compiled, self.as_string)\n        # Remove zero-length matches (can happen, e.g., with optional\n        # patterns in regexes) and whitespace-only matches\n        filtered = filter(lambda m: bool(m.group(main_group).strip()), gen)\n        return [\n            self.match_to_dict(\n                m,\n                return_groups=return_groups,\n                return_chars=return_chars,\n                main_group=main_group,\n            )\n            for m in filtered\n        ]\n\n    def extract_text_lines(\n        self, strip: bool = True, return_chars: bool = True\n    ) -> List[Dict[str, Any]]:\n        \"\"\"\n        `strip` is analogous to Python's `str.strip()` method, and returns\n        `text` attributes without their surrounding whitespace. Only\n        relevant when the relevant TextMap is created with `layout` = True\n\n        Setting `return_chars` to False will exclude the individual\n        character objects from the returned text-line dicts.\n        \"\"\"\n        if strip:\n            pat = r\" *([^\\n]+?) *(\\n|$)\"\n        else:\n            pat = r\"([^\\n]+)\"\n\n        return self.search(\n            pat, main_group=1, return_chars=return_chars, return_groups=False\n        )\n\n\nclass WordMap:\n    \"\"\"\n    A WordMap maps words->chars.\n    \"\"\"\n\n    def __init__(self, tuples: List[Tuple[T_obj, T_obj_list]]) -> None:\n        self.tuples = tuples\n\n    def to_textmap(\n        self,\n        layout: bool = False,\n        layout_width: T_num = 0,\n        layout_height: T_num = 0,\n        layout_width_chars: int = 0,\n        layout_height_chars: int = 0,\n        layout_bbox: T_bbox = (0, 0, 0, 0),\n        x_density: T_num = DEFAULT_X_DENSITY,\n        y_density: T_num = DEFAULT_Y_DENSITY,\n        x_shift: T_num = 0,\n        y_shift: T_num = 0,\n        y_tolerance: T_num = DEFAULT_Y_TOLERANCE,\n        line_dir: T_dir = DEFAULT_LINE_DIR,\n        char_dir: T_dir = DEFAULT_CHAR_DIR,\n        line_dir_rotated: Optional[T_dir] = None,\n        char_dir_rotated: Optional[T_dir] = None,\n        char_dir_render: Optional[T_dir] = None,\n        line_dir_render: Optional[T_dir] = None,\n        use_text_flow: bool = False,\n        presorted: bool = False,\n        expand_ligatures: bool = True,\n    ) -> TextMap:\n        \"\"\"\n        Given a list of (word, chars) tuples (i.e., a WordMap), return a list of\n        (char-text, char) tuples (i.e., a TextMap) that can be used to mimic\n        the structural layout of the text on the page(s), using the following\n        approach for top-to-bottom, left-to-right text:\n\n        - Sort the words by (top, x0) if not already sorted.\n\n        - Cluster the words by top (taking `y_tolerance` into account), and\n          iterate through them.\n\n        - For each cluster, divide (top - y_shift) by `y_density` to calculate\n          the minimum number of newlines that should come before this cluster.\n          Append that number of newlines *minus* the number of newlines already\n          appended, with a minimum of one.\n\n        - Then for each cluster, iterate through each word in it. Divide each\n          word's x0, minus `x_shift`, by `x_density` to calculate the minimum\n          number of characters that should come before this cluster.  Append that\n          number of spaces *minus* the number of characters and spaces already\n          appended, with a minimum of one. Then append the word's text.\n\n        - At the termination of each line, add more spaces if necessary to\n          mimic `layout_width`.\n\n        - Finally, add newlines to the end if necessary to mimic to\n          `layout_height`.\n\n        For other line/character directions (e.g., bottom-to-top,\n        right-to-left), these steps are adjusted.\n        \"\"\"\n        _textmap: List[Tuple[str, Optional[T_obj]]] = []\n\n        if not len(self.tuples):\n            return TextMap(\n                _textmap,\n                line_dir_render=line_dir_render or line_dir,\n                char_dir_render=char_dir_render or char_dir,\n            )\n\n        expansions = LIGATURES if expand_ligatures else {}\n\n        if layout:\n            if layout_width_chars:\n                if layout_width:\n                    raise ValueError(\n                        \"`layout_width` and `layout_width_chars` cannot both be set.\"\n                    )\n            else:\n                layout_width_chars = int(round(layout_width / x_density))\n\n            if layout_height_chars:\n                if layout_height:\n                    raise ValueError(\n                        \"`layout_height` and `layout_height_chars` cannot both be set.\"\n                    )\n            else:\n                layout_height_chars = int(round(layout_height / y_density))\n\n            blank_line = [(\" \", None)] * layout_width_chars\n        else:\n            blank_line = []\n\n        num_newlines = 0\n\n        line_cluster_key = get_line_cluster_key(line_dir)\n        char_sort_key = get_char_sort_key(char_dir)\n\n        line_position_key = POSITION_KEYS[line_dir]\n        char_position_key = POSITION_KEYS[char_dir]\n\n        y_origin = BBOX_ORIGIN_KEYS[line_dir](layout_bbox)\n        x_origin = BBOX_ORIGIN_KEYS[char_dir](layout_bbox)\n\n        words_sorted_line_dir = (\n            self.tuples\n            if presorted or use_text_flow\n            else sorted(self.tuples, key=lambda x: line_cluster_key(x[0]))\n        )\n\n        tuples_by_line = cluster_objects(\n            words_sorted_line_dir,\n            lambda x: line_cluster_key(x[0]),\n            y_tolerance,\n            preserve_order=presorted or use_text_flow,\n        )\n\n        for i, line_tuples in enumerate(tuples_by_line):\n            if layout:\n                line_position = line_position_key(line_tuples[0][0])\n                y_dist_raw = line_position - (y_origin + y_shift)\n                adj = -1 if line_dir in [\"btt\", \"rtl\"] else 1\n                y_dist = y_dist_raw * adj / y_density\n            else:\n                y_dist = 0\n            num_newlines_prepend = max(\n                # At least one newline, unless this iis the first line\n                int(i > 0),\n                # ... or as many as needed to get the imputed \"distance\" from the top\n                round(y_dist) - num_newlines,\n            )\n\n            for i in range(num_newlines_prepend):\n                if not len(_textmap) or _textmap[-1][0] == \"\\n\":\n                    _textmap += blank_line\n                _textmap.append((\"\\n\", None))\n\n            num_newlines += num_newlines_prepend\n\n            line_len = 0\n\n            line_tuples_sorted = (\n                line_tuples\n                if presorted or use_text_flow\n                else sorted(line_tuples, key=lambda x: char_sort_key(x[0]))\n            )\n\n            for word, chars in line_tuples_sorted:\n                if layout:\n                    char_position = char_position_key(word)\n                    x_dist_raw = char_position - (x_origin + x_shift)\n                    adj = -1 if char_dir in [\"btt\", \"rtl\"] else 1\n                    x_dist = x_dist_raw * adj / x_density\n                else:\n                    x_dist = 0\n\n                num_spaces_prepend = max(min(1, line_len), round(x_dist) - line_len)\n                _textmap += [(\" \", None)] * num_spaces_prepend\n                line_len += num_spaces_prepend\n\n                for c in chars:\n                    letters = expansions.get(c[\"text\"], c[\"text\"])\n                    for letter in letters:\n                        _textmap.append((letter, c))\n                        line_len += 1\n\n            # Append spaces at end of line\n            if layout:\n                _textmap += [(\" \", None)] * (layout_width_chars - line_len)\n\n        # Append blank lines at end of text\n        if layout:\n            num_newlines_append = layout_height_chars - (num_newlines + 1)\n            for i in range(num_newlines_append):\n                if i > 0:\n                    _textmap += blank_line\n                _textmap.append((\"\\n\", None))\n\n            # Remove terminal newline\n            if _textmap[-1] == (\"\\n\", None):\n                _textmap = _textmap[:-1]\n\n        return TextMap(\n            _textmap,\n            line_dir_render=line_dir_render or line_dir,\n            char_dir_render=char_dir_render or char_dir,\n        )\n\n\nclass WordExtractor:\n    def __init__(\n        self,\n        x_tolerance: T_num = DEFAULT_X_TOLERANCE,\n        y_tolerance: T_num = DEFAULT_Y_TOLERANCE,\n        x_tolerance_ratio: Union[int, float, None] = None,\n        y_tolerance_ratio: Union[int, float, None] = None,\n        keep_blank_chars: bool = False,\n        use_text_flow: bool = False,\n        vertical_ttb: bool = True,  # Should vertical words be read top-to-bottom?\n        horizontal_ltr: bool = True,  # Should words be read left-to-right?\n        line_dir: T_dir = DEFAULT_LINE_DIR,\n        char_dir: T_dir = DEFAULT_CHAR_DIR,\n        line_dir_rotated: Optional[T_dir] = None,\n        char_dir_rotated: Optional[T_dir] = None,\n        extra_attrs: Optional[List[str]] = None,\n        split_at_punctuation: Union[bool, str] = False,\n        expand_ligatures: bool = True,\n    ):\n        self.x_tolerance = x_tolerance\n        self.y_tolerance = y_tolerance\n        self.x_tolerance_ratio = x_tolerance_ratio\n        self.y_tolerance_ratio = y_tolerance_ratio\n        self.keep_blank_chars = keep_blank_chars\n        self.use_text_flow = use_text_flow\n        self.horizontal_ltr = horizontal_ltr\n        self.vertical_ttb = vertical_ttb\n        if vertical_ttb is False:\n            logger.warning(\n                \"vertical_ttb is deprecated and will be removed;\"\n                \" use line_dir/char_dir instead.\"\n            )\n        if horizontal_ltr is False:\n            logger.warning(\n                \"horizontal_ltr is deprecated and will be removed;\"\n                \" use line_dir/char_dir instead.\"\n            )\n        self.line_dir = line_dir\n        self.char_dir = char_dir\n        # Default is to \"flip\" the directions for rotated text\n        self.line_dir_rotated = line_dir_rotated or char_dir\n        self.char_dir_rotated = char_dir_rotated or line_dir\n        validate_directions(self.line_dir, self.char_dir)\n        validate_directions(self.line_dir_rotated, self.char_dir_rotated, \"_rotated\")\n        self.extra_attrs = [] if extra_attrs is None else extra_attrs\n\n        # Note: string.punctuation = '!\"#$%&\\'()*+,-./:;<=>?@[\\\\]^_`{|}~'\n        self.split_at_punctuation = (\n            string.punctuation\n            if split_at_punctuation is True\n            else (split_at_punctuation or \"\")\n        )\n\n        self.expansions = LIGATURES if expand_ligatures else {}\n\n    def get_char_dir(self, upright: int) -> T_dir:\n        # Note: This can be simplified and reincorporated into .merge_chars and\n        # .iter_chars_to_lines once .vertical_ttb and .horizontal_ltr\n        # deprecation is complete.\n        if not upright and not self.vertical_ttb:\n            return \"btt\"\n\n        elif upright and not self.horizontal_ltr:\n            return \"rtl\"\n\n        return self.char_dir if upright else self.char_dir_rotated\n\n    def merge_chars(self, ordered_chars: T_obj_list) -> T_obj:\n        x0, top, x1, bottom = objects_to_bbox(ordered_chars)\n        doctop_adj = ordered_chars[0][\"doctop\"] - ordered_chars[0][\"top\"]\n        upright = ordered_chars[0][\"upright\"]\n        char_dir = self.get_char_dir(upright)\n\n        word = {\n            \"text\": \"\".join(\n                self.expansions.get(c[\"text\"], c[\"text\"]) for c in ordered_chars\n            ),\n            \"x0\": x0,\n            \"x1\": x1,\n            \"top\": top,\n            \"doctop\": top + doctop_adj,\n            \"bottom\": bottom,\n            \"upright\": upright,\n            \"height\": bottom - top,\n            \"width\": x1 - x0,\n            \"direction\": char_dir,\n        }\n\n        for key in self.extra_attrs:\n            word[key] = ordered_chars[0][key]\n\n        return word\n\n    def char_begins_new_word(\n        self,\n        prev_char: T_obj,\n        curr_char: T_obj,\n        direction: T_dir,\n        x_tolerance: T_num,\n        y_tolerance: T_num,\n    ) -> bool:\n        \"\"\"This method takes several factors into account to determine if\n        `curr_char` represents the beginning of a new word:\n\n        - Whether the text is \"upright\" (i.e., non-rotated)\n        - Whether the user has specified that horizontal text runs\n          left-to-right (default) or right-to-left, as represented by\n          self.horizontal_ltr\n        - Whether the user has specified that vertical text the text runs\n          top-to-bottom (default) or bottom-to-top, as represented by\n          self.vertical_ttb\n        - The x0, top, x1, and bottom attributes of prev_char and\n          curr_char\n        - The self.x_tolerance and self.y_tolerance settings. Note: In\n          this case, x/y refer to those directions for non-rotated text.\n          For vertical text, they are flipped. A more accurate terminology\n          might be \"*intra*line character distance tolerance\" and\n          \"*inter*line character distance tolerance\"\n\n        An important note: The *intra*line distance is measured from the\n        *end* of the previous character to the *beginning* of the current\n        character, while the *inter*line distance is measured from the\n        *top* of the previous character to the *top* of the next\n        character. The reasons for this are partly repository-historical,\n        and partly logical, as successive text lines' bounding boxes often\n        overlap slightly (and we don't want that overlap to be interpreted\n        as the two lines being the same line).\n\n        The upright-ness of the character determines the attributes to\n        compare, while horizontal_ltr/vertical_ttb determine the direction\n        of the comparison.\n        \"\"\"\n        # Note: Due to the grouping step earlier in the process,\n        # curr_char[\"upright\"] will always equal prev_char[\"upright\"].\n        if direction in (\"ltr\", \"rtl\"):\n            x = x_tolerance\n            y = y_tolerance\n            ay = prev_char[\"top\"]\n            cy = curr_char[\"top\"]\n            if direction == \"ltr\":\n                ax = prev_char[\"x0\"]\n                bx = prev_char[\"x1\"]\n                cx = curr_char[\"x0\"]\n            else:\n                ax = -prev_char[\"x1\"]\n                bx = -prev_char[\"x0\"]\n                cx = -curr_char[\"x1\"]\n\n        else:\n            x = y_tolerance\n            y = x_tolerance\n            ay = prev_char[\"x0\"]\n            cy = curr_char[\"x0\"]\n            if direction == \"ttb\":\n                ax = prev_char[\"top\"]\n                bx = prev_char[\"bottom\"]\n                cx = curr_char[\"top\"]\n            else:\n                ax = -prev_char[\"bottom\"]\n                bx = -prev_char[\"top\"]\n                cx = -curr_char[\"bottom\"]\n\n        return bool(\n            # Intraline test\n            (cx < ax)\n            or (cx > bx + x)\n            # Interline test\n            or abs(cy - ay) > y\n        )\n\n    def iter_chars_to_words(\n        self,\n        ordered_chars: T_obj_iter,\n        direction: T_dir,\n    ) -> Generator[T_obj_list, None, None]:\n        current_word: T_obj_list = []\n\n        def start_next_word(\n            new_char: Optional[T_obj],\n        ) -> Generator[T_obj_list, None, None]:\n            nonlocal current_word\n\n            if current_word:\n                yield current_word\n\n            current_word = [] if new_char is None else [new_char]\n\n        xt = self.x_tolerance\n        xtr = self.x_tolerance_ratio\n        yt = self.y_tolerance\n        ytr = self.y_tolerance_ratio\n\n        for char in ordered_chars:\n            text = char[\"text\"]\n\n            if not self.keep_blank_chars and text.isspace():\n                yield from start_next_word(None)\n\n            elif text in self.split_at_punctuation:\n                yield from start_next_word(char)\n                yield from start_next_word(None)\n\n            elif current_word and self.char_begins_new_word(\n                current_word[-1],\n                char,\n                direction,\n                x_tolerance=(xt if xtr is None else xtr * current_word[-1][\"size\"]),\n                y_tolerance=(yt if ytr is None else ytr * current_word[-1][\"size\"]),\n            ):\n                yield from start_next_word(char)\n\n            else:\n                current_word.append(char)\n\n        # Finally, after all chars processed\n        if current_word:\n            yield current_word\n\n    def iter_chars_to_lines(\n        self, chars: T_obj_iter\n    ) -> Generator[Tuple[T_obj_list, T_dir], None, None]:\n        chars = list(chars)\n        upright = chars[0][\"upright\"]\n        line_dir = self.line_dir if upright else self.line_dir_rotated\n        char_dir = self.get_char_dir(upright)\n\n        line_cluster_key = get_line_cluster_key(line_dir)\n        char_sort_key = get_char_sort_key(char_dir)\n\n        # Cluster by line\n        subclusters = cluster_objects(\n            chars,\n            line_cluster_key,\n            (self.y_tolerance if line_dir in (\"ttb\", \"btt\") else self.x_tolerance),\n        )\n\n        for sc in subclusters:\n            # Sort within line\n            chars_sorted = sorted(sc, key=char_sort_key)\n            yield (chars_sorted, char_dir)\n\n    def iter_extract_tuples(\n        self, chars: T_obj_iter\n    ) -> Generator[Tuple[T_obj, T_obj_list], None, None]:\n        grouping_key = itemgetter(\"upright\", *self.extra_attrs)\n        grouped_chars = itertools.groupby(chars, grouping_key)\n\n        for keyvals, char_group in grouped_chars:\n            line_groups = (\n                [(char_group, self.char_dir)]\n                if self.use_text_flow\n                else self.iter_chars_to_lines(char_group)\n            )\n            for line_chars, direction in line_groups:\n                for word_chars in self.iter_chars_to_words(line_chars, direction):\n                    yield (self.merge_chars(word_chars), word_chars)\n\n    def extract_wordmap(self, chars: T_obj_iter) -> WordMap:\n        return WordMap(list(self.iter_extract_tuples(chars)))\n\n    def extract_words(\n        self, chars: T_obj_list, return_chars: bool = False\n    ) -> T_obj_list:\n        if return_chars:\n            return list(\n                {**word, \"chars\": word_chars}\n                for word, word_chars in self.iter_extract_tuples(chars)\n            )\n        else:\n            return list(word for word, word_chars in self.iter_extract_tuples(chars))\n\n\ndef extract_words(\n    chars: T_obj_list, return_chars: bool = False, **kwargs: Any\n) -> T_obj_list:\n    return WordExtractor(**kwargs).extract_words(chars, return_chars)\n\n\nTEXTMAP_KWARGS = inspect.signature(WordMap.to_textmap).parameters.keys()\nWORD_EXTRACTOR_KWARGS = inspect.signature(WordExtractor).parameters.keys()\n\n\ndef chars_to_textmap(chars: T_obj_list, **kwargs: Any) -> TextMap:\n    kwargs.update(\n        {\n            \"presorted\": True,\n            \"layout_bbox\": kwargs.get(\"layout_bbox\") or objects_to_bbox(chars),\n        }\n    )\n\n    extractor = WordExtractor(\n        **{k: kwargs[k] for k in WORD_EXTRACTOR_KWARGS if k in kwargs}\n    )\n    wordmap = extractor.extract_wordmap(chars)\n    textmap = wordmap.to_textmap(\n        **{k: kwargs[k] for k in TEXTMAP_KWARGS if k in kwargs}\n    )\n    return textmap\n\n\ndef extract_text(\n    chars: T_obj_list,\n    line_dir_render: Optional[T_dir] = None,\n    char_dir_render: Optional[T_dir] = None,\n    **kwargs: Any,\n) -> str:\n    chars = to_list(chars)\n    if len(chars) == 0:\n        return \"\"\n\n    if kwargs.get(\"layout\"):\n        textmap_kwargs = {\n            **kwargs,\n            **{\"line_dir_render\": line_dir_render, \"char_dir_render\": char_dir_render},\n        }\n        return chars_to_textmap(chars, **textmap_kwargs).as_string\n    else:\n        extractor = WordExtractor(\n            **{k: kwargs[k] for k in WORD_EXTRACTOR_KWARGS if k in kwargs}\n        )\n        words = extractor.extract_words(chars)\n\n        line_dir_render = line_dir_render or extractor.line_dir\n        char_dir_render = char_dir_render or extractor.char_dir\n\n        line_cluster_key = get_line_cluster_key(extractor.line_dir)\n\n        x_tolerance = kwargs.get(\"x_tolerance\", DEFAULT_X_TOLERANCE)\n        y_tolerance = kwargs.get(\"y_tolerance\", DEFAULT_Y_TOLERANCE)\n\n        lines = cluster_objects(\n            words,\n            line_cluster_key,\n            y_tolerance if line_dir_render in (\"ttb\", \"btt\") else x_tolerance,\n        )\n\n        return TextMap(\n            [\n                (char, None)\n                for char in (\n                    \"\\n\".join(\" \".join(word[\"text\"] for word in line) for line in lines)\n                )\n            ],\n            line_dir_render=line_dir_render,\n            char_dir_render=char_dir_render,\n        ).as_string\n\n\ndef collate_line(\n    line_chars: T_obj_list,\n    tolerance: T_num = DEFAULT_X_TOLERANCE,\n) -> str:\n    coll = \"\"\n    last_x1 = None\n    for char in sorted(line_chars, key=itemgetter(\"x0\")):\n        if (last_x1 is not None) and (char[\"x0\"] > (last_x1 + tolerance)):\n            coll += \" \"\n        last_x1 = char[\"x1\"]\n        coll += char[\"text\"]\n    return coll\n\n\ndef extract_text_simple(\n    chars: T_obj_list,\n    x_tolerance: T_num = DEFAULT_X_TOLERANCE,\n    y_tolerance: T_num = DEFAULT_Y_TOLERANCE,\n) -> str:\n    clustered = cluster_objects(chars, itemgetter(\"doctop\"), y_tolerance)\n    return \"\\n\".join(collate_line(c, x_tolerance) for c in clustered)\n\n\ndef dedupe_chars(\n    chars: T_obj_list,\n    tolerance: T_num = 1,\n    extra_attrs: Optional[Tuple[str, ...]] = (\"fontname\", \"size\"),\n) -> T_obj_list:\n    \"\"\"\n    Removes duplicate chars — those sharing the same text and positioning\n    (within `tolerance`) as other characters in the set. Use extra_args to\n    be more restrictive with the properties shared by the matching chars.\n    \"\"\"\n    key = itemgetter(*(\"upright\", \"text\"), *(extra_attrs or tuple()))\n    pos_key = itemgetter(\"doctop\", \"x0\")\n\n    def yield_unique_chars(chars: T_obj_list) -> Generator[T_obj, None, None]:\n        sorted_chars = sorted(chars, key=key)\n        for grp, grp_chars in itertools.groupby(sorted_chars, key=key):\n            for y_cluster in cluster_objects(\n                list(grp_chars), itemgetter(\"doctop\"), tolerance\n            ):\n                for x_cluster in cluster_objects(\n                    y_cluster, itemgetter(\"x0\"), tolerance\n                ):\n                    yield sorted(x_cluster, key=pos_key)[0]\n\n    deduped = yield_unique_chars(chars)\n    return sorted(deduped, key=chars.index)\n"
  },
  {
    "path": "requirements-dev.txt",
    "content": "black==24.8.0\nflake8==7.1.1\nisort==5.13.2\njupyterlab>=4.4.8\nmypy==1.11.1\nnbexec==0.2.0\npandas-stubs>=2.2.2.240805\npandas>=2.2.2\npy==1.11.0\npytest-cov==5.0.0\npytest-xdist==3.8.0\npytest==8.3.2\nsetuptools>=78.1.1\ntypes-Pillow==10.2.0.20240520\n"
  },
  {
    "path": "requirements.txt",
    "content": "pdfminer.six==20260107\nPillow>=9.1\npypdfium2>=4.18.0\n"
  },
  {
    "path": "setup.cfg",
    "content": "[flake8]\n# max-complexity = 10\nmax-line-length = 88\nignore = \n    # https://black.readthedocs.io/en/stable/the_black_code_style.html#slices\n    E203\n    # Impossible to obey both W503 and W504\n    W503\n    # https://github.com/psf/black/issues/3887\n    E704\n\n[tool:pytest]\naddopts=--cov=pdfplumber --cov-report xml:coverage.xml --cov-report term\n\n[tool.isort]\nprofile = \"black\"\n\n[testenv]\ndeps=\n    -r requirements.txt\n    -r requirements-dev.txt\ncommands=python -m pytest\n"
  },
  {
    "path": "setup.py",
    "content": "import os\n\nfrom setuptools import setup, find_packages\n\nNAME = \"pdfplumber\"\nHERE = os.path.abspath(os.path.dirname(__file__))\n\nversion_ns = {}\n\n\ndef _open(subpath):\n    path = os.path.join(HERE, subpath)\n    return open(path, encoding=\"utf-8\")\n\n\nwith _open(NAME + \"/_version.py\") as f:\n    exec(f.read(), {}, version_ns)\n\nwith _open(\"requirements.txt\") as f:\n    base_reqs = f.read().strip().split(\"\\n\")\n\nwith _open(\"requirements-dev.txt\") as f:\n    dev_reqs = f.read().strip().split(\"\\n\")\n\nwith _open(\"README.md\") as f:\n    long_description = f.read()\n\nsetup(\n    name=NAME,\n    url=\"https://github.com/jsvine/pdfplumber\",\n    author=\"Jeremy Singer-Vine\",\n    author_email=\"jsvine@gmail.com\",\n    description=\"Plumb a PDF for detailed information about each char, rectangle, and line.\",\n    long_description=long_description,\n    long_description_content_type=\"text/markdown\",\n    version=version_ns[\"__version__\"],\n    packages=find_packages(\n        exclude=[\n            \"test\",\n        ]\n    ),\n    include_package_data=True,\n    package_data={\"pdfplumber\": [\"py.typed\"]},\n    zip_safe=False,\n    tests_require=base_reqs + dev_reqs,\n    python_requires=\">=3.8\",\n    install_requires=base_reqs,\n    entry_points={\"console_scripts\": [\"pdfplumber = pdfplumber.cli:main\"]},\n    classifiers=[\n        \"Intended Audience :: Developers\",\n        \"License :: OSI Approved :: MIT License\",\n        \"Operating System :: OS Independent\",\n        \"Programming Language :: Python :: 3.10\",\n        \"Programming Language :: Python :: 3.11\",\n        \"Programming Language :: Python :: 3.12\",\n        \"Programming Language :: Python :: 3.13\",\n        \"Programming Language :: Python :: 3.14\",\n    ],\n)\n"
  },
  {
    "path": "tests/comparisons/scotus-transcript-p1-cropped.txt",
    "content": " 1      IN THE SUPREME COURT OF THE UNITED STATES                       \n                                                                        \n 2   - - - - - - - - - - - - - - - - - x                                \n                                                                        \n 3   MICHAEL A. KNOWLES,               :                                \n                                                                        \n 4   WARDEN,                           :                                \n                                                                        \n 5               Petitioner            :                                \n                                                                        \n 6          v.                         :  No. 07-1315                   \n                                                                        \n 7   ALEXANDRE MIRZAYANCE.             :                                \n                                                                        \n 8   - - - - - - - - - - - - - - - - - x                                \n                                                                        \n 9                          Washington, D.C.                            \n                                                                        \n10                          Tuesday, January 13, 2009                   \n"
  },
  {
    "path": "tests/comparisons/scotus-transcript-p1.txt",
    "content": "                                                                                    \n                                                                                    \n                                                                                    \n                                   Official - Subject to Final Review               \n                                                                                    \n             1      IN THE SUPREME COURT OF THE UNITED STATES                       \n                                                                                    \n             2   - - - - - - - - - - - - - - - - - x                                \n                                                                                    \n             3   MICHAEL A. KNOWLES,               :                                \n                                                                                    \n             4   WARDEN,                           :                                \n                                                                                    \n             5               Petitioner            :                                \n                                                                                    \n             6          v.                         :  No. 07-1315                   \n                                                                                    \n             7   ALEXANDRE MIRZAYANCE.             :                                \n                                                                                    \n             8   - - - - - - - - - - - - - - - - - x                                \n                                                                                    \n             9                          Washington, D.C.                            \n                                                                                    \n            10                          Tuesday, January 13, 2009                   \n                                                                                    \n            11                                                                      \n                                                                                    \n            12                 The above-entitled matter came on for oral           \n                                                                                    \n            13   argument before the Supreme Court of the United States             \n                                                                                    \n            14   at 1:01 p.m.                                                       \n                                                                                    \n            15   APPEARANCES:                                                       \n                                                                                    \n            16   STEVEN E. MERCER, ESQ., Deputy Attorney General, Los               \n                                                                                    \n            17      Angeles, Cal.; on behalf of the Petitioner.                     \n                                                                                    \n            18   CHARLES M. SEVILLA, ESQ., San Diego, Cal.; on behalf               \n                                                                                    \n            19      of the Respondent.                                              \n                                                                                    \n            20                                                                      \n                                                                                    \n            21                                                                      \n                                                                                    \n            22                                                                      \n                                                                                    \n            23                                                                      \n                                                                                    \n            24                                                                      \n                                                                                    \n            25                                                                      \n                                                                                    \n                                                                                    \n                                            1                                       \n                                    Alderson Reporting Company                      \n                                                                                    \n                                                                                    \n"
  },
  {
    "path": "tests/pdfs/make_xref.py",
    "content": "#!/usr/bin/env python\n\n\"\"\"Create an xref section for a simple handmade PDF.\n\nNot a general purpose tool!!!\"\"\"\n\nimport re\nimport sys\n\nwith open(sys.argv[1], \"r+b\") as infh:\n    pos = 0\n    xref = [(0, 65535, \"f\")]\n    for spam in infh:\n        text = spam.decode(\"ascii\")\n        if re.match(r\"\\s*(\\d+)\\s+(\\d+)\\s+obj\", text):\n            xref.append((pos, 0, \"n\"))\n        elif text.strip() == \"xref\":\n            startxref = pos\n        pos = infh.tell()\n    infh.seek(startxref)\n    infh.write(b\"xref\\n\")\n    infh.write((\"0 %d\\n\" % len(xref)).encode(\"ascii\"))\n    for x in xref:\n        infh.write((\"%010d %05d %s \\n\" % x).encode(\"ascii\"))\n    infh.write((\"trailer  << /Size %d /Root 1 0 R >>\\n\" % len(xref)).encode(\"ascii\"))\n    infh.write(b\"startxref\\n\")\n    infh.write((\"%d\\n\" % startxref).encode(\"ascii\"))\n    infh.write(b\"%%EOF\\n\")\n"
  },
  {
    "path": "tests/test_basics.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pytest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        self.pdf = pdfplumber.open(path)\n        # via http://www.pdfill.com/example/pdf_drawing_new.pdf\n        path_2 = os.path.join(HERE, \"pdfs/pdffill-demo.pdf\")\n        self.pdf_2 = pdfplumber.open(path_2)\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n        self.pdf_2.close()\n\n    def test_metadata(self):\n        metadata = self.pdf.metadata\n        assert isinstance(metadata[\"Producer\"], str)\n\n    def test_pagecount(self):\n        assert len(self.pdf.pages) == 1\n\n    def test_page_number(self):\n        assert self.pdf.pages[0].page_number == 1\n        assert str(self.pdf.pages[0]) == \"<Page:1>\"\n\n    def test_objects(self):\n        assert len(self.pdf.chars)\n        assert len(self.pdf.rects)\n        assert len(self.pdf.lines)\n        assert len(self.pdf.rect_edges)\n        assert len(self.pdf_2.curve_edges)\n        # Ensure that caching is working:\n        assert id(self.pdf._rect_edges) == id(self.pdf.rect_edges)\n        assert id(self.pdf_2._curve_edges) == id(self.pdf_2.curve_edges)\n        assert id(self.pdf.pages[0]._layout) == id(self.pdf.pages[0].layout)\n\n    def test_annots(self):\n        pdf = self.pdf_2\n        assert len(pdf.annots)\n        assert len(pdf.hyperlinks) == 17\n        uri = \"http://www.pdfill.com/pdf_drawing.html\"\n        assert pdf.hyperlinks[0][\"uri\"] == uri\n\n        path = os.path.join(HERE, \"pdfs/annotations.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert len(pdf.annots)\n\n    def test_annots_cropped(self):\n        pdf = self.pdf_2\n        page = pdf.pages[0]\n        assert len(page.annots) == 13\n        assert len(page.hyperlinks) == 1\n\n        cropped = page.crop(page.bbox)\n        assert len(cropped.annots) == 13\n        assert len(cropped.hyperlinks) == 1\n\n        h0_bbox = pdfplumber.utils.obj_to_bbox(page.hyperlinks[0])\n        cropped = page.crop(h0_bbox)\n        assert len(cropped.annots) == len(cropped.hyperlinks) == 1\n\n    def test_annots_rotated(self):\n        def get_annot(filename, n=0):\n            path = os.path.join(HERE, \"pdfs\", filename)\n            with pdfplumber.open(path) as pdf:\n                return pdf.pages[0].annots[n]\n\n        a = get_annot(\"annotations.pdf\", 3)\n        b = get_annot(\"annotations-rotated-180.pdf\", 3)\n        c = get_annot(\"annotations-rotated-90.pdf\", 3)\n        d = get_annot(\"annotations-rotated-270.pdf\", 3)\n\n        assert (\n            int(a[\"width\"]) == int(b[\"width\"]) == int(c[\"height\"]) == int(d[\"height\"])\n        )\n        assert (\n            int(a[\"height\"]) == int(b[\"height\"]) == int(c[\"width\"]) == int(d[\"width\"])\n        )\n        assert int(a[\"x0\"]) == int(c[\"top\"]) == int(d[\"y0\"])\n        assert int(a[\"x1\"]) == int(c[\"bottom\"]) == int(d[\"y1\"])\n        assert int(a[\"top\"]) == int(b[\"y0\"]) == int(d[\"x0\"])\n        assert int(a[\"bottom\"]) == int(b[\"y1\"]) == int(d[\"x1\"])\n\n    def test_crop_and_filter(self):\n        def test(obj):\n            return obj[\"object_type\"] == \"char\"\n\n        bbox = (0, 0, 200, 200)\n        original = self.pdf.pages[0]\n        cropped = original.crop(bbox)\n        assert id(cropped.chars) == id(cropped._objects[\"char\"])\n        assert cropped.width == 200\n        assert len(cropped.rects) > 0\n        assert len(cropped.chars) < len(original.chars)\n\n        within_bbox = original.within_bbox(bbox)\n        assert len(within_bbox.chars) < len(cropped.chars)\n        assert len(within_bbox.chars) > 0\n\n        filtered = cropped.filter(test)\n        assert id(filtered.chars) == id(filtered._objects[\"char\"])\n        assert len(filtered.rects) == 0\n\n    def test_outside_bbox(self):\n        original = self.pdf.pages[0]\n        outside_bbox = original.outside_bbox(original.find_tables()[0].bbox)\n        assert outside_bbox.extract_text() == \"Page 1 of 205\"\n        assert outside_bbox.bbox == original.bbox\n\n    def test_relative_crop(self):\n        page = self.pdf.pages[0]\n        cropped = page.crop((10, 10, 40, 40))\n        recropped = cropped.crop((10, 15, 20, 25), relative=True)\n        target_bbox = (20, 25, 30, 35)\n        assert recropped.bbox == target_bbox\n\n        recropped_wi = cropped.within_bbox((10, 15, 20, 25), relative=True)\n        assert recropped_wi.bbox == target_bbox\n\n        # via issue #245, should not throw error when using `relative=True`\n        bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))\n        bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height), relative=True)\n        bottom.crop(\n            (0.5 * float(bottom.width), 0, bottom.width, bottom.height), relative=True\n        )\n\n        # An extra test for issue #914, in which relative crops were\n        # using the the wrong bboxes for cropping, leading to empty object-lists\n        crop_right = page.crop((page.width / 2, 0, page.width, page.height))\n        crop_right_again_rel = crop_right.crop(\n            (0, 0, crop_right.width / 2, page.height), relative=True\n        )\n        assert len(crop_right_again_rel.chars)\n\n    def test_invalid_crops(self):\n        page = self.pdf.pages[0]\n        with pytest.raises(ValueError):\n            page.crop((0, 0, 0, 0))\n\n        with pytest.raises(ValueError):\n            page.crop((0, 0, 10000, 10))\n\n        with pytest.raises(ValueError):\n            page.crop((-10, 0, 10, 10))\n\n        with pytest.raises(ValueError):\n            page.crop((100, 0, 0, 100))\n\n        with pytest.raises(ValueError):\n            page.crop((0, 100, 100, 0))\n\n        # via issue #245\n        bottom = page.crop((0, 0.8 * float(page.height), page.width, page.height))\n        with pytest.raises(ValueError):\n            bottom.crop((0, 0, 0.5 * float(bottom.width), bottom.height))\n        with pytest.raises(ValueError):\n            bottom.crop((0.5 * float(bottom.width), 0, bottom.width, bottom.height))\n\n        # via issue #421, testing strict=True/False\n        with pytest.raises(ValueError):\n            page.crop((0, 0, page.width + 10, page.height + 10))\n\n        page.crop((0, 0, page.width + 10, page.height + 10), strict=False)\n\n    def test_rotation(self):\n        assert self.pdf.pages[0].width == 1008\n        assert self.pdf.pages[0].height == 612\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11-rotated.pdf\")\n        with pdfplumber.open(path) as rotated:\n            assert rotated.pages[0].width == 612\n            assert rotated.pages[0].height == 1008\n\n            assert rotated.pages[0].cropbox != self.pdf.pages[0].cropbox\n            assert rotated.pages[0].bbox != self.pdf.pages[0].bbox\n\n    def test_password(self):\n        path = os.path.join(HERE, \"pdfs/password-example.pdf\")\n        with pdfplumber.open(path, password=\"test\") as pdf:\n            assert len(pdf.chars) > 0\n\n    def test_unicode_normalization(self):\n        path = os.path.join(HERE, \"pdfs/issue-905.pdf\")\n\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            print(page.extract_text())\n            assert ord(page.chars[0][\"text\"]) == 894\n\n        with pdfplumber.open(path, unicode_norm=\"NFC\") as pdf:\n            page = pdf.pages[0]\n            assert ord(page.chars[0][\"text\"]) == 59\n            assert page.extract_text() == \";;\"\n\n    def test_colors(self):\n        rect = self.pdf.pages[0].rects[0]\n        assert rect[\"non_stroking_color\"] == (0.8, 1, 1)\n\n    def test_text_colors(self):\n        char = self.pdf.pages[0].chars[3358]\n        assert char[\"non_stroking_color\"] == (1, 0, 0)\n\n    def test_load_with_custom_laparams(self):\n        # See https://github.com/jsvine/pdfplumber/issues/168\n        path = os.path.join(HERE, \"pdfs/cupertino_usd_4-6-16.pdf\")\n        laparams = dict(line_margin=0.2)\n        with pdfplumber.open(path, laparams=laparams) as pdf:\n            assert round(pdf.pages[0].chars[0][\"top\"], 3) == 66.384\n\n    def test_loading_pathobj(self):\n        from pathlib import Path\n\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        path_obj = Path(path)\n        with pdfplumber.open(path_obj) as pdf:\n            assert len(pdf.metadata)\n\n    def test_loading_fileobj(self):\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        with open(path, \"rb\") as f:\n            with pdfplumber.open(f) as pdf:\n                assert len(pdf.metadata)\n            assert not f.closed\n\n    def test_bad_fileobj(self):\n        path = os.path.join(HERE, \"pdfs/empty.pdf\")\n        with pytest.raises(pdfplumber.utils.exceptions.PdfminerException):\n            pdfplumber.open(path)\n\n        f = open(path)\n        with pytest.raises(pdfplumber.utils.exceptions.PdfminerException):\n            pdfplumber.open(f)\n        # File objects passed to pdfplumber should not be auto-closed\n        assert not f.closed\n        f.close()\n\n    def test_uncommon_boxes(self):\n        path = os.path.join(HERE, \"pdfs/page-boxes-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert page.artbox == (42.51969, 70.86613999999997, 552.75591, 827.71653)\n            assert page.bleedbox == (0, 0.0, 623.62205, 870.23622)\n            assert page.trimbox == (28.34646, 56.69290999999998, 566.92913, 841.88976)\n"
  },
  {
    "path": "tests/test_ca_warn_report.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pdfplumber\nfrom pdfplumber import table, utils\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\ndef fix_row_spaces(row):\n    return [(x or \"\").replace(\" \", \"\") for x in row[:3]] + row[3:]\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        self.path = os.path.join(\n            HERE, \"pdfs/WARN-Report-for-7-1-2015-to-03-25-2016.pdf\"\n        )\n        self.pdf = pdfplumber.open(self.path)\n        self.PDF_WIDTH = self.pdf.pages[0].width\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_page_limiting(self):\n        with pdfplumber.open(self.path, pages=[1, 3]) as pdf:\n            assert len(pdf.pages) == 2\n            assert pdf.pages[1].page_number == 3\n\n    def test_objects(self):\n        p = self.pdf.pages[0]\n        assert len(p.chars)\n        assert len(p.rects)\n        assert len(p.images)\n\n    def test_parse(self):\n\n        rect_x0_clusters = utils.cluster_list(\n            [r[\"x0\"] for r in self.pdf.pages[1].rects], tolerance=3\n        )\n\n        v_lines = [x[0] for x in rect_x0_clusters]\n\n        def parse_page(page):\n            data = page.extract_table(\n                {\"vertical_strategy\": \"explicit\", \"explicit_vertical_lines\": v_lines}\n            )\n            without_spaces = [fix_row_spaces(row) for row in data]\n            return without_spaces\n\n        parsed = parse_page(self.pdf.pages[0])\n\n        assert parsed[0] == [\n            \"NoticeDate\",\n            \"Effective\",\n            \"Received\",\n            \"Company\",\n            \"City\",\n            \"No. Of\",\n            \"Layoff/Closure\",\n        ]\n\n        assert parsed[1] == [\n            \"06/22/2015\",\n            \"03/25/2016\",\n            \"07/01/2015\",\n            \"Maxim Integrated Product\",\n            \"San Jose\",\n            \"150\",\n            \"Closure Permanent\",\n        ]\n\n    def test_edge_merging(self):\n        p0 = self.pdf.pages[0]\n        assert len(p0.edges) == 364\n        assert (\n            len(\n                table.merge_edges(\n                    p0.edges,\n                    snap_x_tolerance=3,\n                    snap_y_tolerance=3,\n                    join_x_tolerance=3,\n                    join_y_tolerance=3,\n                )\n            )\n            == 46\n        )\n        assert (\n            len(\n                table.merge_edges(\n                    p0.edges,\n                    snap_x_tolerance=3,\n                    snap_y_tolerance=3,\n                    join_x_tolerance=3,\n                    join_y_tolerance=0,\n                )\n            )\n            == 52\n        )\n        assert (\n            len(\n                table.merge_edges(\n                    p0.edges,\n                    snap_x_tolerance=0,\n                    snap_y_tolerance=3,\n                    join_x_tolerance=3,\n                    join_y_tolerance=3,\n                )\n            )\n            == 94\n        )\n        assert (\n            len(\n                table.merge_edges(\n                    p0.edges,\n                    snap_x_tolerance=3,\n                    snap_y_tolerance=0,\n                    join_x_tolerance=3,\n                    join_y_tolerance=3,\n                )\n            )\n            == 174\n        )\n\n    def test_vertices(self):\n        p0 = self.pdf.pages[0]\n        edges = table.merge_edges(\n            p0.edges,\n            snap_x_tolerance=3,\n            snap_y_tolerance=3,\n            join_x_tolerance=3,\n            join_y_tolerance=3,\n        )\n        ixs = table.edges_to_intersections(edges)\n        assert len(ixs.keys()) == 304  # 38x8\n"
  },
  {
    "path": "tests/test_convert.py",
    "content": "#!/usr/bin/env python\nimport json\nimport logging\nimport os\nimport sys\nimport unittest\nfrom io import StringIO\nfrom subprocess import PIPE, Popen\n\nimport pytest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nSCOTUS_TEXT = [\n    {\n        \"type\": \"Div\",\n        \"children\": [\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 21.625,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [1],\n                \"text\": [\n                    \"IN THE SUPREME COURT OF THE UNITED STATES - - - - - - - - - - - - \"\n                    \"- - - - - x MICHAEL A. KNOWLES, : WARDEN, :\"\n                ],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"StartIndent\": 86.375,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [2],\n                \"text\": [\" Petitioner :\"],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 50.375,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [3, 4],\n                \"text\": [\n                    \" v. \",\n                    \": No. 07-1315 ALEXANDRE MIRZAYANCE. : - - - - - - - - - - - - - -\"\n                    \" - - - x\",\n                ],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"SpaceAfter\": 24.5,\n                    \"LineHeight\": 25.75,\n                    \"StartIndent\": 165.625,\n                    \"EndIndent\": 57.625,\n                },\n                \"mcids\": [5],\n                \"text\": [\" Washington, D.C. Tuesday, January 13, 2009\"],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 100.75,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [6],\n                \"text\": [\n                    \" The above-entitled matter came on for oral argument before the \"\n                    \"Supreme Court of the United States at 1:01 p.m. APPEARANCES: \"\n                    \"STEVEN E. MERCER, ESQ., Deputy Attorney General, Los\"\n                ],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"SpaceAfter\": 179.125,\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 21.625,\n                    \"EndIndent\": 50.375,\n                    \"TextAlign\": \"None\",\n                },\n                \"mcids\": [7],\n                \"text\": [\n                    \" Angeles, Cal.; on behalf of the Petitioner. CHARLES M. SEVILLA, \"\n                    \"ESQ., San Diego, Cal.; on behalf of the Respondent. \"\n                ],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\"O\": \"Layout\", \"TextAlign\": \"Center\", \"SpaceAfter\": 8.5},\n                \"mcids\": [8],\n                \"text\": [\"1\\n\"],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\"O\": \"Layout\", \"TextAlign\": \"Center\"},\n                \"mcids\": [9],\n                \"text\": [\"Alderson Reporting Company \"],\n            },\n        ],\n    }\n]\n\n\ndef run(cmd):\n    return Popen(cmd, stdout=PIPE).communicate()[0]\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        self.path = os.path.join(HERE, \"pdfs/pdffill-demo.pdf\")\n        self.pdf = pdfplumber.open(self.path, pages=[1, 2, 5])\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_json(self):\n        c = json.loads(self.pdf.to_json())\n        assert (\n            c[\"pages\"][0][\"rects\"][0][\"bottom\"] == self.pdf.pages[0].rects[0][\"bottom\"]\n        )\n\n    def test_json_attr_filter(self):\n        c = json.loads(self.pdf.to_json(include_attrs=[\"page_number\"]))\n        assert list(c[\"pages\"][0][\"rects\"][0].keys()) == [\"object_type\", \"page_number\"]\n\n        with pytest.raises(ValueError):\n            self.pdf.to_json(include_attrs=[\"page_number\"], exclude_attrs=[\"bottom\"])\n\n        with pytest.raises(ValueError):\n            self.pdf.to_json(exclude_attrs=[\"object_type\"])\n\n    def test_json_all_types(self):\n        c = json.loads(self.pdf.to_json(object_types=None))\n        found_types = c[\"pages\"][0].keys()\n        assert \"chars\" in found_types\n        assert \"lines\" in found_types\n        assert \"rects\" in found_types\n        assert \"images\" in found_types\n        assert \"curves\" in c[\"pages\"][2].keys()\n\n    def test_single_pages(self):\n        c = json.loads(self.pdf.pages[0].to_json())\n        assert c[\"rects\"][0][\"bottom\"] == self.pdf.pages[0].rects[0][\"bottom\"]\n\n    def test_additional_attr_types(self):\n        path = os.path.join(HERE, \"pdfs/issue-67-example.pdf\")\n        with pdfplumber.open(path, pages=[1]) as pdf:\n            c = json.loads(pdf.to_json())\n            assert len(c[\"pages\"][0][\"images\"])\n\n    def test_csv(self):\n        c = self.pdf.to_csv(precision=3)\n        assert c.split(\"\\r\\n\")[9] == (\n            \"char,1,45.83,58.826,656.82,674.82,117.18,117.18,135.18,12.996,\"\n            \"18.0,12.996,,,,,,,TimesNewRomanPSMT,,,\"\n            '\"(1.0, 0.0, 0.0, 1.0, 45.83, 660.69)\"'\n            ',,,DeviceRGB,\"(0.0, 0.0, 0.0)\",,,18.0,,,,\"(0,)\",,Y,,1,'\n        )\n\n        io = StringIO()\n        self.pdf.to_csv(io, precision=3)\n        io.seek(0)\n        c_from_io = io.read()\n        assert c == c_from_io\n\n    def test_csv_all_types(self):\n        c = self.pdf.to_csv(object_types=None)\n        assert c.split(\"\\r\\n\")[1].split(\",\")[0] == \"line\"\n\n    def test_cli_help(self):\n        res = run([sys.executable, \"-m\", \"pdfplumber.cli\"])\n        assert b\"usage:\" in res\n\n    def test_cli_structure(self):\n        res = run([sys.executable, \"-m\", \"pdfplumber.cli\", self.path, \"--structure\"])\n        c = json.loads(res)\n        # lol no structure\n        assert c == []\n\n    def test_cli_structure_text(self):\n        path = os.path.join(HERE, \"pdfs/scotus-transcript-p1.pdf\")\n        res = run([sys.executable, \"-m\", \"pdfplumber.cli\", path, \"--structure-text\"])\n        c = json.loads(res)\n        assert c == SCOTUS_TEXT\n\n    def test_cli_json(self):\n        res = run(\n            [\n                sys.executable,\n                \"-m\",\n                \"pdfplumber.cli\",\n                self.path,\n                \"--format\",\n                \"json\",\n                \"--pages\",\n                \"1-2\",\n                \"5\",\n                \"--indent\",\n                \"2\",\n            ]\n        )\n\n        c = json.loads(res)\n        assert c[\"pages\"][0][\"page_number\"] == 1\n        assert c[\"pages\"][1][\"page_number\"] == 2\n        assert c[\"pages\"][2][\"page_number\"] == 5\n        assert c[\"pages\"][0][\"rects\"][0][\"bottom\"] == float(\n            self.pdf.pages[0].rects[0][\"bottom\"]\n        )\n\n    def test_cli_csv(self):\n        res = run(\n            [\n                sys.executable,\n                \"-m\",\n                \"pdfplumber.cli\",\n                self.path,\n                \"--format\",\n                \"csv\",\n                \"--precision\",\n                \"3\",\n            ]\n        )\n\n        assert res.decode(\"utf-8\").split(\"\\r\\n\")[9] == (\n            \"char,1,45.83,58.826,656.82,674.82,117.18,117.18,135.18,12.996,\"\n            \"18.0,12.996,,,,,,,TimesNewRomanPSMT,,,\"\n            '\"(1.0, 0.0, 0.0, 1.0, 45.83, 660.69)\"'\n            ',,,DeviceRGB,\"(0.0, 0.0, 0.0)\",,,18.0,,,,\"(0,)\",,Y,,1,'\n        )\n\n    def test_cli_csv_exclude(self):\n        res = run(\n            [\n                sys.executable,\n                \"-m\",\n                \"pdfplumber.cli\",\n                self.path,\n                \"--format\",\n                \"csv\",\n                \"--precision\",\n                \"3\",\n                \"--exclude-attrs\",\n                \"matrix\",\n                \"mcid\",\n                \"ncs\",\n            ]\n        )\n\n        assert res.decode(\"utf-8\").split(\"\\r\\n\")[9] == (\n            \"char,1,45.83,58.826,656.82,674.82,117.18,117.18,135.18,12.996,\"\n            \"18.0,12.996,,,,,,,TimesNewRomanPSMT,\"\n            ',,,\"(0.0, 0.0, 0.0)\",,,18.0,,,,\"(0,)\",,Y,,1,'\n        )\n\n    def test_cli_csv_include(self):\n        res = run(\n            [\n                sys.executable,\n                \"-m\",\n                \"pdfplumber.cli\",\n                self.path,\n                \"--format\",\n                \"csv\",\n                \"--precision\",\n                \"3\",\n                \"--include-attrs\",\n                \"page_number\",\n            ]\n        )\n\n        assert res.decode(\"utf-8\").split(\"\\r\\n\")[9] == (\"char,1\")\n\n    def test_cli_text(self):\n        path = os.path.join(HERE, \"pdfs/scotus-transcript-p1.pdf\")\n        res = run(\n            [\n                sys.executable,\n                \"-m\",\n                \"pdfplumber.cli\",\n                path,\n                \"--format\",\n                \"text\",\n            ]\n        )\n\n        target_path = os.path.join(HERE, \"comparisons/scotus-transcript-p1.txt\")\n        target = open(target_path).read()\n        assert res.decode(\"utf-8\") == target\n\n    def test_page_to_dict(self):\n        x = self.pdf.pages[0].to_dict(object_types=[\"char\"])\n        assert len(x[\"chars\"]) == len(self.pdf.pages[0].chars)\n"
  },
  {
    "path": "tests/test_ctm.py",
    "content": "#!/usr/bin/env python\nimport os\nimport unittest\n\nimport pdfplumber\nfrom pdfplumber.ctm import CTM\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    def test_pdffill_demo(self):\n        path = os.path.join(HERE, \"pdfs/pdffill-demo.pdf\")\n        pdf = pdfplumber.open(path)\n        left_r = pdf.pages[3].chars[97]\n        right_r = pdf.pages[3].chars[105]\n\n        left_ctm = CTM(*left_r[\"matrix\"])\n        right_ctm = CTM(*right_r[\"matrix\"])\n\n        assert round(left_ctm.translation_x) == 126\n        assert round(right_ctm.translation_x) == 372\n\n        assert round(left_ctm.translation_y) == 519\n        assert round(right_ctm.translation_y) == 562\n\n        assert left_ctm.skew_x == 45\n        assert right_ctm.skew_x == -45\n\n        assert left_ctm.skew_y == 45\n        assert right_ctm.skew_y == -45\n\n        assert round(left_ctm.scale_x, 3) == 1\n        assert round(right_ctm.scale_x, 3) == 1\n\n        assert round(left_ctm.scale_y, 3) == 1\n        assert round(right_ctm.scale_y, 3) == 1\n"
  },
  {
    "path": "tests/test_dedupe_chars.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/issue-71-duplicate-chars.pdf\")\n        self.pdf = pdfplumber.open(path)\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_extract_table(self):\n        page = self.pdf.pages[0]\n        table_without_drop_duplicates = page.extract_table()\n        table_with_drop_duplicates = page.dedupe_chars().extract_table()\n        last_line_without_drop = table_without_drop_duplicates[1][1].split(\"\\n\")[-1]\n        last_line_with_drop = table_with_drop_duplicates[1][1].split(\"\\n\")[-1]\n\n        assert (\n            last_line_without_drop\n            == \"微微软软 培培训训课课程程：： 名名模模意意义义一一些些有有意意义义一一些些\"\n        )\n        assert last_line_with_drop == \"微软 培训课程： 名模意义一些有意义一些\"\n\n    def test_extract_words(self):\n        page = self.pdf.pages[0]\n        x0 = 440.143\n        x1_without_drop = 534.992\n        x1_with_drop = 534.719\n        top_windows = 791.849\n        top_linux = 794.357\n        bottom = 802.961\n        last_words_without_drop = page.extract_words()[-1]\n        last_words_with_drop = page.dedupe_chars().extract_words()[-1]\n\n        assert round(last_words_without_drop[\"x0\"], 3) == x0\n        assert round(last_words_without_drop[\"x1\"], 3) == x1_without_drop\n        assert round(last_words_without_drop[\"top\"], 3) in (top_windows, top_linux)\n        assert round(last_words_without_drop[\"bottom\"], 3) == bottom\n        assert last_words_without_drop[\"upright\"] == 1\n        assert (\n            last_words_without_drop[\"text\"]\n            == \"名名模模意意义义一一些些有有意意义义一一些些\"\n        )\n\n        assert round(last_words_with_drop[\"x0\"], 3) == x0\n        assert round(last_words_with_drop[\"x1\"], 3) == x1_with_drop\n        assert round(last_words_with_drop[\"top\"], 3) in (top_windows, top_linux)\n        assert round(last_words_with_drop[\"bottom\"], 3) == bottom\n        assert last_words_with_drop[\"upright\"] == 1\n        assert last_words_with_drop[\"text\"] == \"名模意义一些有意义一些\"\n\n    def test_extract_text(self):\n        page = self.pdf.pages[0]\n        last_line_without_drop = page.extract_text().split(\"\\n\")[-1]\n        last_line_with_drop = page.dedupe_chars().extract_text().split(\"\\n\")[-1]\n\n        assert (\n            last_line_without_drop\n            == \"微微软软 培培训训课课程程：： 名名模模意意义义一一些些有有意意义义一一些些\"\n        )\n        assert last_line_with_drop == \"微软 培训课程： 名模意义一些有意义一些\"\n\n    def test_extract_text2(self):\n        path = os.path.join(HERE, \"pdfs/issue-71-duplicate-chars-2.pdf\")\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n\n        assert (\n            page.dedupe_chars().extract_text(y_tolerance=6).splitlines()[4]\n            == \"UE 8. Circulation - Métabolismes\"\n        )\n\n    def test_extra_attrs(self):\n        path = os.path.join(HERE, \"pdfs/issue-1114-dedupe-chars.pdf\")\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n\n        def dup_chars(s: str) -> str:\n            return \"\".join((char if char == \" \" else char + char) for char in s)\n\n        ground_truth = (\n            (\"Simple\", False, False),\n            (\"Duplicated\", True, True),\n            (\"Font\", \"fontname\", True),\n            (\"Size\", \"size\", True),\n            (\"Italic\", \"fontname\", True),\n            (\"Weight\", \"fontname\", True),\n            (\"Horizontal shift\", False, \"HHoorrizizoonntatal ls shhifitft\"),\n            (\"Vertical shift\", False, True),\n        )\n        gt = []\n        for text, should_dedup, dup_text in ground_truth:\n            if isinstance(dup_text, bool):\n                if dup_text:\n                    dup_text = dup_chars(text)\n                else:\n                    dup_text = text\n            gt.append((text, should_dedup, dup_text))\n\n        keys_list = [\"no_dedupe\", (), (\"size\",), (\"fontname\",), (\"size\", \"fontname\")]\n        for keys in keys_list:\n            if keys != \"no_dedupe\":\n                filtered_page = page.dedupe_chars(tolerance=2, extra_attrs=keys)\n            else:\n                filtered_page = page\n            for i, line in enumerate(\n                filtered_page.extract_text(y_tolerance=5).splitlines()\n            ):\n                text, should_dedup, dup_text = gt[i]\n                if keys == \"no_dedupe\":\n                    should_dedup = False\n                if isinstance(should_dedup, str):\n                    if should_dedup in keys:\n                        fail_msg = (\n                            f\"{should_dedup} is not required to match \"\n                            \"so it should be duplicated\"\n                        )\n                        assert line == dup_text, fail_msg\n                    else:\n                        fail_msg = (\n                            \"Should not be duplicated \"\n                            f\"when requiring matching {should_dedup}\"\n                        )\n                        assert line == text, fail_msg\n                elif should_dedup:\n                    assert line == text\n                else:\n                    assert line == dup_text\n"
  },
  {
    "path": "tests/test_display.py",
    "content": "#!/usr/bin/env python\nimport io\nimport logging\nimport os\nimport unittest\nfrom zipfile import ZipFile\n\nimport PIL.Image\nimport pytest\n\nimport pdfplumber\nfrom pdfplumber.table import TableFinder\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        self.pdf = pdfplumber.open(path)\n        self.im = self.pdf.pages[0].to_image()\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_basic_conversion(self):\n        self.im.reset()\n        self.im.draw_rects(self.im.page.rects)\n        self.im.draw_circle(self.im.page.chars[0])\n        self.im.draw_line(self.im.page.edges[0])\n        self.im.draw_vlines([10])\n        self.im.draw_hlines([10])\n\n    def test_width_height(self):\n        p = self.pdf.pages[0]\n        with pytest.raises(ValueError):\n            p.to_image(resolution=72, height=100)\n\n        im = p.to_image(width=503)\n        assert im.original.width == 503\n\n        im = p.to_image(height=805)\n        assert im.original.height == 805\n\n    def test_debug_tablefinder(self):\n        self.im.reset()\n        settings = {\"horizontal_strategy\": \"text\", \"intersection_tolerance\": 5}\n        self.im.debug_tablefinder(settings)\n        finder = TableFinder(self.im.page, settings)\n        self.im.debug_tablefinder(finder)\n\n        self.im.debug_tablefinder(None)\n\n        # https://github.com/jsvine/pdfplumber/issues/1237\n        self.im.debug_tablefinder(table_settings={})\n\n        with pytest.raises(ValueError):\n            self.im.debug_tablefinder(0)\n\n    def test_bytes_stream_to_image(self):\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        page = pdfplumber.PDF(io.BytesIO(open(path, \"rb\").read())).pages[0]\n        page.to_image()\n\n    def test_curves(self):\n        path = os.path.join(HERE, \"../examples/pdfs/ag-energy-round-up-2017-02-24.pdf\")\n        page = pdfplumber.open(path).pages[0]\n        im = page.to_image()\n        im.draw_lines(page.curves)\n\n    def test_cropped(self):\n        im = self.pdf.pages[0].crop((10, 20, 30, 50)).to_image()\n        assert im.original.size == (20, 30)\n\n    def test_cropbox(self):\n        path = os.path.join(HERE, \"pdfs/issue-1054-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            im = pdf.pages[0].to_image()\n            assert im.original.size == (596, 842)\n            im = pdf.pages[0].to_image(force_mediabox=True)\n            assert im.original.size == (2227, 2923)\n\n    def test_copy(self):\n        assert self.im.copy().original == self.im.original\n\n    def test_outline_words(self):\n        self.im.outline_words(\n            stroke=\"blue\",\n            fill=(0, 200, 10),\n            stroke_width=2,\n            x_tolerance=5,\n            y_tolerance=5,\n        )\n\n    def test_outline_chars(self):\n        self.im.outline_chars(stroke=\"blue\", fill=(0, 200, 10), stroke_width=2)\n\n    def test__repr_png_(self):\n        png = self.im._repr_png_()\n        assert isinstance(png, bytes)\n        assert 20000 < len(png) < 80000\n\n    def test_no_quantize(self):\n        b = io.BytesIO()\n        self.im.save(b, \"PNG\", quantize=False)\n        assert len(b.getvalue()) > len(self.im._repr_png_())\n\n    def test_antialias(self):\n        aa = self.pdf.pages[0].to_image(antialias=True)\n        assert len(aa._repr_png_()) > len(self.im._repr_png_())\n\n    def test_decompression_bomb(self):\n        original_max = PIL.Image.MAX_IMAGE_PIXELS\n        PIL.Image.MAX_IMAGE_PIXELS = 10\n        # Previously, this raised PIL.Image.DecompressionBombError\n        self.pdf.pages[0].to_image()\n        PIL.Image.MAX_IMAGE_PIXELS = original_max\n\n    def test_password(self):\n        path = os.path.join(HERE, \"pdfs/password-example.pdf\")\n        with pdfplumber.open(path, password=\"test\") as pdf:\n            pdf.pages[0].to_image()\n\n    def test_zip(self):\n        # See https://github.com/jsvine/pdfplumber/issues/948\n        # reproducer.py\n        path = os.path.join(HERE, \"pdfs/issue-948.zip\")\n        with ZipFile(path) as zip_file:\n            with zip_file.open(\"dummy.pdf\") as pdf_file:\n                with pdfplumber.open(pdf_file) as pdf:\n                    page = pdf.pages[0]\n                    page.to_image()\n"
  },
  {
    "path": "tests/test_issues.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport re\n\ntry:\n    import resource\nexcept ModuleNotFoundError:\n    resource = None\nimport unittest\n\nimport pytest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    def test_issue_13(self):\n        \"\"\"\n        Test slightly simplified from gist here:\n        https://github.com/jsvine/pdfplumber/issues/13\n        \"\"\"\n        pdf = pdfplumber.open(\n            os.path.join(HERE, \"pdfs/issue-13-151201DSP-Fond-581-90D.pdf\")\n        )\n\n        # Only find checkboxes this size\n        RECT_WIDTH = 9.3\n        RECT_HEIGHT = 9.3\n        RECT_TOLERANCE = 2\n\n        def filter_rects(rects):\n            # Just get the rects that are the right size to be checkboxes\n            rects_found = []\n            for rect in rects:\n                if (\n                    rect[\"height\"] > (RECT_HEIGHT - RECT_TOLERANCE)\n                    and (rect[\"height\"] < RECT_HEIGHT + RECT_TOLERANCE)\n                    and (rect[\"width\"] < RECT_WIDTH + RECT_TOLERANCE)\n                    and (rect[\"width\"] < RECT_WIDTH + RECT_TOLERANCE)\n                ):\n                    rects_found.append(rect)\n            return rects_found\n\n        def determine_if_checked(checkbox, checklines):\n            \"\"\"\n            This figures out if the bounding box of (either) line used to make\n            one half of the 'x' is the right size and overlaps with a rectangle.\n            This isn't foolproof, but works for this case.\n            It's not totally clear (to me) how common this style of checkboxes\n            are used, and whether this is useful approach to them.\n            Also note there should be *two* matching LTCurves for each checkbox.\n            But here we only test there's at least one.\n            \"\"\"\n\n            for cl in checklines:\n\n                if (\n                    checkbox[\"height\"] > (RECT_HEIGHT - RECT_TOLERANCE)\n                    and (checkbox[\"height\"] < RECT_HEIGHT + RECT_TOLERANCE)\n                    and (checkbox[\"width\"] < RECT_WIDTH + RECT_TOLERANCE)\n                    and (checkbox[\"width\"] < RECT_WIDTH + RECT_TOLERANCE)\n                ):\n\n                    xmatch = False\n                    ymatch = False\n\n                    if max(checkbox[\"x0\"], cl[\"x0\"]) <= min(checkbox[\"x1\"], cl[\"x1\"]):\n                        xmatch = True\n                    if max(checkbox[\"y0\"], cl[\"y0\"]) <= min(checkbox[\"y1\"], cl[\"y1\"]):\n                        ymatch = True\n                    if xmatch and ymatch:\n                        return True\n\n            return False\n\n        p0 = pdf.pages[0]\n        checklines = [\n            line\n            for line in p0.lines\n            if round(line[\"height\"], 2) == round(line[\"width\"], 2)\n        ]  # These are diagonals\n        rects = filter_rects(p0.objects[\"rect\"])\n\n        n_checked = sum([determine_if_checked(rect, checklines) for rect in rects])\n\n        assert n_checked == 5\n        pdf.close()\n\n    def test_issue_14(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/cupertino_usd_4-6-16.pdf\"))\n        assert len(pdf.objects)\n        pdf.close()\n\n    def test_issue_21(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/150109DSP-Milw-505-90D.pdf\"))\n        assert len(pdf.objects)\n        pdf.close()\n\n    def test_issue_33(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/issue-33-lorem-ipsum.pdf\"))\n        assert len(pdf.metadata.keys())\n        pdf.close()\n\n    def test_issue_53(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/issue-53-example.pdf\"))\n        assert len(pdf.objects)\n        pdf.close()\n\n    def test_issue_67(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/issue-67-example.pdf\"))\n        assert len(pdf.metadata.keys())\n        pdf.close()\n\n    def test_pr_88(self):\n        # via https://github.com/jsvine/pdfplumber/pull/88\n        path = os.path.join(HERE, \"pdfs/pr-88-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            words = page.extract_words()\n            assert len(words) == 25\n\n    def test_issue_90(self):\n        path = os.path.join(HERE, \"pdfs/issue-90-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            page.extract_words()\n\n    def test_pr_136(self):\n        path = os.path.join(HERE, \"pdfs/pr-136-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            page.extract_words()\n\n    def test_pr_138(self):\n        path = os.path.join(HERE, \"pdfs/pr-138-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert len(page.chars) == 5140\n            page.extract_tables(\n                {\n                    \"vertical_strategy\": \"explicit\",\n                    \"horizontal_strategy\": \"lines\",\n                    \"explicit_vertical_lines\": page.curves + page.edges,\n                }\n            )\n\n    def test_issue_140(self):\n        path = os.path.join(HERE, \"pdfs/issue-140-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            cropped_page = page.crop((0, 0, page.width, 122))\n            assert len(cropped_page.extract_table()) == 5\n\n    def test_issue_203(self):\n        path = os.path.join(HERE, \"pdfs/issue-203-decimalize.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert len(pdf.objects)\n\n    def test_issue_216(self):\n        \"\"\"\n        .extract_table() should return None if there's no table,\n        instead of crashing\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-140-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            cropped = pdf.pages[0].crop((0, 0, 1, 1))\n            assert cropped.extract_table() is None\n\n    def test_issue_297(self):\n        \"\"\"\n        Handle integer type metadata\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-297-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert isinstance(pdf.metadata[\"Copies\"], int)\n\n    def test_issue_316(self):\n        \"\"\"\n        Handle invalid metadata\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-316-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert (\n                pdf.metadata[\"Changes\"][0][\"CreationDate\"] == \"D:20061207105020Z00'00'\"\n            )\n\n    def test_issue_386(self):\n        \"\"\"\n        util.extract_text() should not raise exception if given pure iterator\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        with pdfplumber.open(path) as pdf:\n            chars = (char for char in pdf.chars)\n            pdfplumber.utils.extract_text(chars)\n\n    def test_issue_461_and_842(self):\n        \"\"\"\n        pdfplumber should gracefully handle characters with byte-encoded\n        font names.\n        \"\"\"\n        before = b\"RGJSAP+\\xcb\\xce\\xcc\\xe5\"\n        after = pdfplumber.page.fix_fontname_bytes(before)\n        assert after == \"RGJSAP+SimSun,Regular\"\n\n        before = b\"\\xcb\\xce\\xcc\\xe5\"\n        after = pdfplumber.page.fix_fontname_bytes(before)\n        assert after == \"SimSun,Regular\"\n\n        path = os.path.join(HERE, \"pdfs/issue-461-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert all(isinstance(c[\"fontname\"], str) for c in page.chars)\n            page.dedupe_chars()\n\n        path = os.path.join(HERE, \"pdfs/issue-842-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert all(isinstance(c[\"fontname\"], str) for c in page.chars)\n            page.dedupe_chars()\n\n    def test_issue_463(self):\n        \"\"\"\n        Extracting annotations should not raise UnicodeDecodeError on utf-16 text\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-463-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            annots = pdf.annots\n            annots[0][\"contents\"] == \"日本語\"\n\n    def test_issue_598(self):\n        \"\"\"\n        Ligatures should be translated by default.\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-598-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            a = page.extract_text()\n            assert \"fiction\" in a\n            assert \"ﬁction\" not in a\n\n            b = page.extract_text(expand_ligatures=False)\n            assert \"ﬁction\" in b\n            assert \"fiction\" not in b\n\n            assert page.extract_words()[53][\"text\"] == \"fiction\"\n            assert page.extract_words(expand_ligatures=False)[53][\"text\"] == \"ﬁction\"\n\n    def test_issue_683(self):\n        \"\"\"\n        Page.search ValueError: min() arg is an empty sequence\n\n        This ultimately stemmed from a mistaken assumption in\n        LayoutEngine.calculate(...) that len(char[\"text\"]) would always equal\n        1, which is not true for ligatures. Issue 683 does not provide a PDF,\n        but the test PDF triggers the same error, which should now be fixed.\n\n        Thank you to @samkit-jain for identifying and writing this test.\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-71-duplicate-chars-2.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            page.search(r\"\\d+\", regex=True)\n\n    def test_issue_982(self):\n        \"\"\"\n        extract_text(use_text_flow=True) apparently does nothing\n\n        This is because, while we took care not to sort the words by\n        `doctop` in `WordExtractor` and `WordMap`, no such precaution\n        was taken in `cluster_objects`.  We thus add an option to\n        `cluster_objects` to preserve the ordering (which could come\n        from `use_text_flow` or from `presorted`) of the input objects.\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-982-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            text = re.sub(r\"\\s+\", \" \", page.extract_text(use_text_flow=True))\n            words = \" \".join(w[\"text\"] for w in page.extract_words(use_text_flow=True))\n            assert text[0:100] == words[0:100]\n\n    def test_issue_1147(self):\n        \"\"\"\n        Edge-case for when decode_text is passed a string\n        that is out of bounds of PDFDocEncoding\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-1147-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            # Should not error:\n            assert page.extract_text()\n\n    def test_issue_1181(self):\n        \"\"\"\n        Correctly re-calculate coordinates when MediaBox does not start at (0,0)\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-1181.pdf\")\n        with pdfplumber.open(path) as pdf:\n            p0, p1 = pdf.pages\n            assert p0.crop(p0.bbox).extract_table() == [\n                [\"FooCol1\", \"FooCol2\", \"FooCol3\"],\n                [\"Foo4\", \"Foo5\", \"Foo6\"],\n                [\"Foo7\", \"Foo8\", \"Foo9\"],\n                [\"Foo10\", \"Foo11\", \"Foo12\"],\n                [\"\", \"\", \"\"],\n            ]\n            assert p1.crop(p1.bbox).extract_table() == [\n                [\"BarCol1\", \"BarCol2\", \"BarCol3\"],\n                [\"Bar4\", \"Bar5\", \"Bar6\"],\n                [\"Bar7\", \"Bar8\", \"Bar9\"],\n                [\"Bar10\", \"Bar11\", \"Bar12\"],\n                [\"\", \"\", \"\"],\n            ]\n\n    def test_pr_1195(self):\n        \"\"\"\n        In certain scenarios, annotations may include invalid or extraneous\n        data that can obstruct the annotation processing workflow.  To mitigate\n        this, the raise_unicode_errors parameter in the PDF initializer and the\n        .open() method provides a configurable option to bypass these errors\n        and generate warnings instead, ensuring smoother handling of such\n        anomalies.\n\n        The following tests verifies the functionality of the\n        raise_unicode_errors parameter.\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/annotations-unicode-issues.pdf\")\n        with pdfplumber.open(path) as pdf, pytest.raises(UnicodeDecodeError):\n            for _ in pdf.annots:\n                pass\n\n        with pdfplumber.open(path, raise_unicode_errors=False) as pdf, pytest.warns(\n            UserWarning\n        ):\n            for _ in pdf.annots:\n                pass\n"
  },
  {
    "path": "tests/test_laparams.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        self.path = os.path.join(HERE, \"pdfs/issue-13-151201DSP-Fond-581-90D.pdf\")\n\n    def test_without_laparams(self):\n        with pdfplumber.open(self.path, laparams=None) as pdf:\n            objs = pdf.pages[0].objects\n            assert \"textboxhorizontal\" not in objs.keys()\n            assert len(objs[\"char\"]) == 4408\n\n    def test_with_laparams(self):\n        with pdfplumber.open(self.path, laparams={}) as pdf:\n            page = pdf.pages[0]\n            assert len(page.textboxhorizontals) == 27\n            assert len(page.textlinehorizontals) == 79\n            assert \"text\" in page.textboxhorizontals[0]\n            assert \"text\" in page.textlinehorizontals[0]\n            assert len(page.chars) == 4408\n            assert \"anno\" not in page.objects.keys()\n\n    def test_vertical_texts(self):\n        path = os.path.join(HERE, \"pdfs/issue-192-example.pdf\")\n        laparams = {\"detect_vertical\": True}\n        with pdfplumber.open(path, laparams=laparams) as pdf:\n            page = pdf.pages[0]\n            assert len(page.textlinehorizontals) == 142\n            assert len(page.textboxhorizontals) == 74\n            assert len(page.textlineverticals) == 11\n            assert len(page.textboxverticals) == 6\n            assert \"text\" in page.textboxverticals[0]\n            assert \"text\" in page.textlineverticals[0]\n\n    def test_issue_383(self):\n        with pdfplumber.open(self.path, laparams={}) as pdf:\n            p0 = pdf.pages[0]\n            assert \"anno\" not in p0.objects.keys()\n            cropped = p0.crop((0, 0, 100, 100))\n            assert len(cropped.objects)\n"
  },
  {
    "path": "tests/test_list_metadata.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pdfplumber\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    def test_load(self):\n        path = os.path.join(HERE, \"pdfs/cupertino_usd_4-6-16.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert len(pdf.metadata)\n"
  },
  {
    "path": "tests/test_mcids.py",
    "content": "#!/usr/bin/env python3\n\nimport os\nimport unittest\n\nimport pdfplumber\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass TestMCIDs(unittest.TestCase):\n    \"\"\"Test MCID extraction.\"\"\"\n\n    def test_mcids(self):\n        path = os.path.join(HERE, \"pdfs/mcid_example.pdf\")\n\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n        # Check text of MCIDS\n        mcids = []\n        for c in page.chars:\n            if \"mcid\" in c:\n                while len(mcids) <= c[\"mcid\"]:\n                    mcids.append(\"\")\n                if not mcids[c[\"mcid\"]]:\n                    mcids[c[\"mcid\"]] = c[\"tag\"] + \": \"\n                mcids[c[\"mcid\"]] += c[\"text\"]\n        assert mcids == [\n            \"Standard: Test of figures\",\n            \"\",\n            \"P: 1 ligne\",\n            \"P: 2 ligne\",\n            \"P: 3 ligne\",\n            \"P: 4 ligne\",\n            \"P: 0\",\n            \"P: 2\",\n            \"P: 4\",\n            \"P: 6\",\n            \"P: 8\",\n            \"P: 10\",\n            \"P: 12\",\n            \"P: Figure 1: Chart\",\n            \"\",\n            \"P: 1 colonne\",\n            \"P: 2 colonne\",\n            \"P: 3 colonne\",\n        ]\n        # Check line and curve MCIDs\n        line_mcids = set(x[\"mcid\"] for x in page.lines)\n        curve_mcids = set(x[\"mcid\"] for x in page.curves)\n        assert all(x[\"tag\"] == \"Figure\" for x in page.lines)\n        assert all(x[\"tag\"] == \"Figure\" for x in page.curves)\n        assert line_mcids & {1, 14}\n        assert curve_mcids & {1, 14}\n        # No rects to test unfortunately!\n"
  },
  {
    "path": "tests/test_nics_report.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\nfrom operator import itemgetter\n\nimport pdfplumber\nfrom pdfplumber.utils import extract_text, within_bbox\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\nCOLUMNS = [\n    \"state\",\n    \"permit\",\n    \"handgun\",\n    \"long_gun\",\n    \"other\",\n    \"multiple\",\n    \"admin\",\n    \"prepawn_handgun\",\n    \"prepawn_long_gun\",\n    \"prepawn_other\",\n    \"redemption_handgun\",\n    \"redemption_long_gun\",\n    \"redemption_other\",\n    \"returned_handgun\",\n    \"returned_long_gun\",\n    \"returned_other\",\n    \"rentals_handgun\",\n    \"rentals_long_gun\",\n    \"private_sale_handgun\",\n    \"private_sale_long_gun\",\n    \"private_sale_other\",\n    \"return_to_seller_handgun\",\n    \"return_to_seller_long_gun\",\n    \"return_to_seller_other\",\n    \"totals\",\n]\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        self.pdf = pdfplumber.open(path)\n        self.PDF_WIDTH = self.pdf.pages[0].width\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_edges(self):\n        assert len(self.pdf.vertical_edges) == 700\n        assert len(self.pdf.horizontal_edges) == 508\n\n    def test_plain(self):\n        page = self.pdf.pages[0]\n        cropped = page.crop((0, 80, self.PDF_WIDTH, 485))\n        table = cropped.extract_table(\n            {\n                \"horizontal_strategy\": \"text\",\n                \"explicit_vertical_lines\": [min(map(itemgetter(\"x0\"), cropped.chars))],\n                \"intersection_tolerance\": 5,\n            }\n        )\n\n        def parse_value(k, x):\n            if k == 0:\n                return x\n            if x in (None, \"\"):\n                return None\n            return int(x.replace(\",\", \"\"))\n\n        def parse_row(row):\n            return dict((COLUMNS[i], parse_value(i, v)) for i, v in enumerate(row))\n\n        parsed_table = [parse_row(row) for row in table]\n\n        # [1:] because first column is state name\n        for c in COLUMNS[1:]:\n            total = parsed_table[-1][c]\n            colsum = sum(row[c] or 0 for row in parsed_table)\n            assert colsum == (total * 2)\n\n        month_chars = within_bbox(page.chars, (0, 35, self.PDF_WIDTH, 65))\n        month_text = extract_text(month_chars)\n        assert month_text == \"November - 2015\"\n\n    def test_filter(self):\n        page = self.pdf.pages[0]\n\n        def test(obj):\n            if obj[\"object_type\"] == \"char\":\n                if obj[\"size\"] < 15:\n                    return False\n            return True\n\n        filtered = page.filter(test)\n        text = filtered.extract_text()\n        assert text == \"NICS Firearm Background Checks\\nNovember - 2015\"\n\n    def test_text_only_strategy(self):\n        cropped = self.pdf.pages[0].crop((0, 80, self.PDF_WIDTH, 475))\n        table = cropped.extract_table(\n            dict(\n                horizontal_strategy=\"text\",\n                vertical_strategy=\"text\",\n            )\n        )\n        assert table[0][0] == \"Alabama\"\n        assert table[0][22] == \"71,137\"\n        assert table[-1][0] == \"Wyoming\"\n        assert table[-1][22] == \"5,017\"\n\n    def test_explicit_horizontal(self):\n        cropped = self.pdf.pages[0].crop((0, 80, self.PDF_WIDTH, 475))\n        table = cropped.find_tables(\n            dict(\n                horizontal_strategy=\"text\",\n                vertical_strategy=\"text\",\n            )\n        )[0]\n\n        h_positions = [row.cells[0][1] for row in table.rows] + [\n            table.rows[-1].cells[0][3]\n        ]\n\n        t_explicit = cropped.find_tables(\n            dict(\n                horizontal_strategy=\"explicit\",\n                vertical_strategy=\"text\",\n                explicit_horizontal_lines=h_positions,\n            )\n        )[0]\n\n        assert table.extract() == t_explicit.extract()\n\n        h_objs = [\n            {\n                \"x0\": 0,\n                \"x1\": self.PDF_WIDTH,\n                \"width\": self.PDF_WIDTH,\n                \"top\": h,\n                \"bottom\": h,\n                \"object_type\": \"line\",\n            }\n            for h in h_positions\n        ]\n\n        t_explicit_objs = cropped.find_tables(\n            dict(\n                horizontal_strategy=\"explicit\",\n                vertical_strategy=\"text\",\n                explicit_horizontal_lines=h_objs,\n            )\n        )[0]\n\n        assert table.extract() == t_explicit_objs.extract()\n"
  },
  {
    "path": "tests/test_oss_fuzz.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\nfrom pathlib import Path\n\nimport pdfplumber\nfrom pdfplumber.utils.exceptions import MalformedPDFException, PdfminerException\n\nlogging.disable(logging.ERROR)\n\nHERE = Path(os.path.abspath(os.path.dirname(__file__)))\n\nACCEPTABLE_EXCEPTIONS = (MalformedPDFException, PdfminerException)\n\n\nclass Test(unittest.TestCase):\n    def test_load(self):\n        def test_conversions(pdf):\n            methods = [pdf.to_dict, pdf.to_json, pdf.to_csv, pdf.pages[0].to_image]\n            for method in methods:\n                try:\n                    method()\n                except ACCEPTABLE_EXCEPTIONS:\n                    continue\n                except Exception as e:\n                    print(f\"Failed on: {path.name}\")\n                    raise e\n\n        paths = sorted((HERE / \"pdfs/from-oss-fuzz/load/\").glob(\"*.pdf\"))\n        for path in paths:\n            try:\n                with pdfplumber.open(path) as pdf:\n                    assert pdf.pages\n                    test_conversions(pdf)\n            except ACCEPTABLE_EXCEPTIONS:\n                continue\n            except Exception as e:\n                print(f\"Failed on: {path.name}\")\n                raise e\n"
  },
  {
    "path": "tests/test_repair.py",
    "content": "#!/usr/bin/env python\nimport os\nimport shutil\nimport tempfile\nimport unittest\n\nimport pytest\n\nimport pdfplumber\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    def test_from_issue_932(self):\n        path = os.path.join(HERE, \"pdfs/malformed-from-issue-932.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            char = page.chars[0]\n            assert char[\"bottom\"] > page.height\n\n        with pdfplumber.open(path, repair=True) as pdf:\n            page = pdf.pages[0]\n            char = page.chars[0]\n            assert char[\"bottom\"] < page.height\n\n        with pdfplumber.repair(path) as repaired:\n            with pdfplumber.open(repaired) as pdf:\n                page = pdf.pages[0]\n                char = page.chars[0]\n                assert char[\"bottom\"] < page.height\n\n    def test_other_repair_inputs(self):\n        path = os.path.join(HERE, \"pdfs/malformed-from-issue-932.pdf\")\n        with pdfplumber.open(open(path, \"rb\"), repair=True) as pdf:\n            page = pdf.pages[0]\n            char = page.chars[0]\n            assert char[\"bottom\"] < page.height\n\n    def test_bad_repair_path(self):\n        path = os.path.join(HERE, \"pdfs/abc.xyz\")\n\n        with pytest.raises(Exception):\n            with pdfplumber.open(path, repair=True):\n                pass\n\n    def test_repair_to_file(self):\n        path = os.path.join(HERE, \"pdfs/malformed-from-issue-932.pdf\")\n        with tempfile.NamedTemporaryFile(\"wb\") as out:\n            pdfplumber.repair(path, outfile=out.name)\n            with pdfplumber.open(out.name) as pdf:\n                page = pdf.pages[0]\n                char = page.chars[0]\n                assert char[\"bottom\"] < page.height\n\n    def test_repair_setting(self):\n        path = os.path.join(HERE, \"pdfs/malformed-from-issue-932.pdf\")\n        with tempfile.NamedTemporaryFile(\"wb\") as out:\n            pdfplumber.repair(path, outfile=out.name)\n\n        with tempfile.NamedTemporaryFile(\"wb\") as out:\n            pdfplumber.repair(path, outfile=out.name, setting=\"prepress\")\n\n    def test_repair_password(self):\n        path = os.path.join(HERE, \"pdfs/password-example.pdf\")\n        with pdfplumber.open(path, repair=True, password=\"test\") as pdf:\n            assert len(pdf.pages[0].chars)\n\n    def test_repair_custom_path(self):\n        path = os.path.join(HERE, \"pdfs/malformed-from-issue-932.pdf\")\n        with pdfplumber.open(path, repair=True, gs_path=shutil.which(\"gs\")) as pdf:\n            assert len(pdf.pages[0].chars)\n"
  },
  {
    "path": "tests/test_structure.py",
    "content": "#!/usr/bin/env python3\n\nimport os\nimport re\nimport unittest\nfrom collections import deque\n\nfrom pdfminer.pdftypes import resolve1\n\nimport pdfplumber\nfrom pdfplumber.structure import PDFStructTree\n\nHERE = os.path.abspath(os.path.dirname(__file__))\nTREE = [\n    {\n        \"type\": \"Document\",\n        \"children\": [\n            {\n                \"type\": \"P\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.24,\n                    \"TextAlign\": \"Center\",\n                },\n                \"mcids\": [0],\n            },\n            {\n                \"type\": \"H1\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.36,\n                },\n                \"mcids\": [1],\n            },\n            {\n                \"type\": \"P\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.12,\n                },\n                \"mcids\": [2],\n            },\n            {\n                \"type\": \"P\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.181,\n                },\n                \"mcids\": [3, 4, 5, 6, 7],\n            },\n            {\n                \"type\": \"H2\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.381,\n                },\n                \"mcids\": [8],\n            },\n            {\n                \"type\": \"P\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.12,\n                },\n                \"mcids\": [9],\n            },\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                            \"SpaceBefore\": 0.181,\n                                            \"StartIndent\": 0.36,\n                                        },\n                                        \"mcids\": [10, 11],\n                                    }\n                                ],\n                            }\n                        ],\n                    },\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                            \"SpaceBefore\": 0.181,\n                                            \"StartIndent\": 0.36,\n                                        },\n                                        \"mcids\": [12, 13],\n                                    },\n                                    {\n                                        \"type\": \"L\",\n                                        \"children\": [\n                                            {\n                                                \"type\": \"LI\",\n                                                \"children\": [\n                                                    {\n                                                        \"type\": \"LBody\",\n                                                        \"children\": [\n                                                            {\n                                                                \"type\": \"P\",\n                                                                \"attributes\": {\n                                                                    \"O\": \"Layout\",\n                                                                    \"Placement\": \"Block\",  # noqa: E501\n                                                                    \"SpaceBefore\": 0.181,  # noqa: E501\n                                                                    \"StartIndent\": 0.72,  # noqa: E501\n                                                                },\n                                                                \"mcids\": [14, 15],\n                                                            }\n                                                        ],\n                                                    }\n                                                ],\n                                            }\n                                        ],\n                                    },\n                                ],\n                            }\n                        ],\n                    },\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                            \"SpaceBefore\": 0.181,\n                                            \"StartIndent\": 0.36,\n                                        },\n                                        \"mcids\": [16, 17, 18, 19, 20, 21, 22, 23],\n                                    }\n                                ],\n                            }\n                        ],\n                    },\n                ],\n            },\n            {\n                \"type\": \"H3\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.321,\n                },\n                \"mcids\": [24],\n            },\n            {\n                \"type\": \"Table\",\n                \"attributes\": {\n                    \"O\": \"Layout\",\n                    \"Placement\": \"Block\",\n                    \"SpaceBefore\": 0.12,\n                    \"SpaceAfter\": 0.015,\n                    \"Width\": 9.972,\n                    \"Height\": 1.047,\n                    \"BBox\": [56.7, 249.75, 555.3, 302.1],\n                },\n                \"children\": [\n                    {\n                        \"type\": \"TR\",\n                        \"attributes\": {\"O\": \"Layout\", \"Placement\": \"Block\"},\n                        \"children\": [\n                            {\n                                \"type\": \"TH\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.985,\n                                    \"Height\": 0.291,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [25],\n                                    }\n                                ],\n                            },\n                            {\n                                \"type\": \"TH\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.987,\n                                    \"Height\": 0.291,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [26],\n                                    }\n                                ],\n                            },\n                        ],\n                    },\n                    {\n                        \"type\": \"TR\",\n                        \"attributes\": {\"O\": \"Layout\", \"Placement\": \"Block\"},\n                        \"children\": [\n                            {\n                                \"type\": \"TD\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.985,\n                                    \"Height\": 0.291,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [27],\n                                    }\n                                ],\n                            },\n                            {\n                                \"type\": \"TD\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.987,\n                                    \"Height\": 0.291,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [28],\n                                    }\n                                ],\n                            },\n                        ],\n                    },\n                    {\n                        \"type\": \"TR\",\n                        \"attributes\": {\"O\": \"Layout\", \"Placement\": \"Block\"},\n                        \"children\": [\n                            {\n                                \"type\": \"TD\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.985,\n                                    \"Height\": 0.33,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [29],\n                                    }\n                                ],\n                            },\n                            {\n                                \"type\": \"TD\",\n                                \"attributes\": {\n                                    \"O\": \"Layout\",\n                                    \"Placement\": \"Inline\",\n                                    \"Width\": 4.987,\n                                    \"Height\": 0.33,\n                                },\n                                \"children\": [\n                                    {\n                                        \"type\": \"P\",\n                                        \"attributes\": {\n                                            \"O\": \"Layout\",\n                                            \"Placement\": \"Block\",\n                                        },\n                                        \"mcids\": [30],\n                                    }\n                                ],\n                            },\n                        ],\n                    },\n                ],\n            },\n        ],\n    }\n]\n\n\nclass Test(unittest.TestCase):\n    \"\"\"Test a PDF specifically created to show structure.\"\"\"\n\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/pdf_structure.pdf\")\n        self.pdf = pdfplumber.open(path)\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_structure_tree(self):\n        assert self.pdf.pages[0].structure_tree == TREE\n        # Add page numbers\n        d = deque(TREE)\n        while d:\n            el = d.popleft()\n            el[\"page_number\"] = 1\n            if \"children\" in el:\n                d.extend(el[\"children\"])\n        assert self.pdf.structure_tree == TREE\n\n\nPVSTRUCT = [\n    {\n        \"type\": \"Sect\",\n        \"children\": [\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [0]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [1]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [2]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [3]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [4]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [5]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [6]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [7]},\n            {\n                \"type\": \"P\",\n                \"lang\": \"FR-FR\",\n                \"page_number\": 1,\n                \"mcids\": [8],\n                \"children\": [\n                    {\"type\": \"Span\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [9]}\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [11]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [12]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [13]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [14]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [15]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [16]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 1,\n                                \"mcids\": [19],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [22]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 1, \"mcids\": [23]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [24]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 1,\n                                \"mcids\": [27],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [30]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [31]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [32]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 1,\n                                \"mcids\": [35],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [38]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [39]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [40]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 1,\n                                \"mcids\": [43, 45],\n                                \"children\": [\n                                    {\n                                        \"type\": \"Span\",\n                                        \"lang\": \"FR-FR\",\n                                        \"page_number\": 1,\n                                        \"mcids\": [44],\n                                    }\n                                ],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [48]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [49]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [50]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [51]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [52]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [53]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [54]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [55]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [56]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [57]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [58]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [59]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [60]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [61]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [62]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [63]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [64]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 1, \"mcids\": [65]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [0]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [1]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [2]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [3]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [4]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [5]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [6]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 2,\n                                \"mcids\": [9, 11],\n                                \"children\": [\n                                    {\n                                        \"type\": \"Span\",\n                                        \"lang\": \"FR-FR\",\n                                        \"page_number\": 2,\n                                        \"mcids\": [10],\n                                    }\n                                ],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [14]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [15]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [16]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [17]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [18]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [19]},\n        ],\n    }\n]\n\n\nPVSTRUCT1 = [\n    {\n        \"type\": \"Sect\",\n        \"children\": [\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [0]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [1]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [2]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [3]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [4]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [5]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [6]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"lang\": \"FR-CA\",\n                                \"type\": \"LBody\",\n                                \"mcids\": [9, 11],\n                                \"children\": [\n                                    {\"lang\": \"FR-FR\", \"type\": \"Span\", \"mcids\": [10]}\n                                ],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [14]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [15]},\n            {\"lang\": \"FR-CA\", \"type\": \"P\", \"mcids\": [16]},\n            {\"lang\": \"FR-FR\", \"type\": \"P\", \"mcids\": [17]},\n            {\"lang\": \"FR-FR\", \"type\": \"P\", \"mcids\": [18]},\n            {\"lang\": \"FR-FR\", \"type\": \"P\", \"mcids\": [19]},\n        ],\n    }\n]\n\nPVSTRUCT2 = [\n    {\n        \"type\": \"Sect\",\n        \"children\": [\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [0]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [1]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [2]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [3]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [4]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [5]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [6]},\n            {\n                \"type\": \"L\",\n                \"children\": [\n                    {\n                        \"type\": \"LI\",\n                        \"children\": [\n                            {\n                                \"type\": \"LBody\",\n                                \"lang\": \"FR-CA\",\n                                \"page_number\": 2,\n                                \"mcids\": [9, 11],\n                                \"children\": [\n                                    {\n                                        \"type\": \"Span\",\n                                        \"lang\": \"FR-FR\",\n                                        \"page_number\": 2,\n                                        \"mcids\": [10],\n                                    }\n                                ],\n                            }\n                        ],\n                    }\n                ],\n            },\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [14]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [15]},\n            {\"type\": \"P\", \"lang\": \"FR-CA\", \"page_number\": 2, \"mcids\": [16]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [17]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [18]},\n            {\"type\": \"P\", \"lang\": \"FR-FR\", \"page_number\": 2, \"mcids\": [19]},\n        ],\n    }\n]\n\n\nIMAGESTRUCT = [\n    {\n        \"type\": \"Document\",\n        \"children\": [\n            {\"type\": \"P\", \"mcids\": [0]},\n            {\"type\": \"P\", \"mcids\": [1]},\n            {\n                \"type\": \"Figure\",\n                \"alt_text\": \"pdfplumber on github\\n\\n\"\n                \"a screen capture of the github page for pdfplumber\",\n                \"mcids\": [2],\n            },\n        ],\n    }\n]\n\n\nWORD365 = [\n    {\n        \"type\": \"Document\",\n        \"children\": [\n            {\n                \"type\": \"H1\",\n                \"children\": [\n                    {\"type\": \"Span\", \"mcids\": [0]},\n                    {\"type\": \"Span\", \"actual_text\": \" \", \"mcids\": [1]},\n                ],\n            },\n            {\"type\": \"P\", \"mcids\": [2]},\n            {\n                \"type\": \"L\",\n                \"attributes\": {\"O\": \"List\", \"ListNumbering\": \"Disc\"},\n                \"children\": [\n                    {\"type\": \"LI\", \"children\": [{\"type\": \"LBody\", \"mcids\": [3]}]},\n                    {\"type\": \"LI\", \"children\": [{\"type\": \"LBody\", \"mcids\": [4]}]},\n                    {\"type\": \"LI\", \"children\": [{\"type\": \"LBody\", \"mcids\": [5]}]},\n                ],\n            },\n            {\"type\": \"P\", \"mcids\": [6]},\n            {\n                \"type\": \"L\",\n                \"attributes\": {\"O\": \"List\", \"ListNumbering\": \"Decimal\"},\n                \"children\": [\n                    {\"type\": \"LI\", \"children\": [{\"type\": \"LBody\", \"mcids\": [7]}]},\n                    {\"type\": \"LI\", \"children\": [{\"type\": \"LBody\", \"mcids\": [8]}]},\n                ],\n            },\n            {\n                \"type\": \"Table\",\n                \"children\": [\n                    {\n                        \"type\": \"THead\",\n                        \"children\": [\n                            {\n                                \"type\": \"TR\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"TH\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [9, 10]}],\n                                    },\n                                    {\n                                        \"type\": \"TH\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [11, 12]}],\n                                    },\n                                    {\n                                        \"type\": \"TH\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [13, 14]}],\n                                    },\n                                ],\n                            }\n                        ],\n                    },\n                    {\n                        \"type\": \"TBody\",\n                        \"children\": [\n                            {\n                                \"type\": \"TR\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [15, 16]}],\n                                    },\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [17, 18]}],\n                                    },\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [19, 20]}],\n                                    },\n                                ],\n                            },\n                            {\n                                \"type\": \"TR\",\n                                \"children\": [\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [21, 22]}],\n                                    },\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [23, 24]}],\n                                    },\n                                    {\n                                        \"type\": \"TD\",\n                                        \"children\": [{\"type\": \"P\", \"mcids\": [25, 26]}],\n                                    },\n                                ],\n                            },\n                        ],\n                    },\n                ],\n            },\n            {\"type\": \"P\", \"mcids\": [27]},\n        ],\n    }\n]\n\n\nSCOTUS = [\n    {\n        \"type\": \"Div\",\n        \"children\": [\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 21.625,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [1],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"StartIndent\": 86.375,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [2],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 50.375,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [3, 4],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                # This is important, it has attributes and a class\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"StartIndent\": 165.625,\n                    \"EndIndent\": 57.625,\n                    \"SpaceAfter\": 24.5,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [5],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 100.75,\n                    \"O\": \"Layout\",\n                },\n                \"mcids\": [6],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                # This is important, it has attributes and a class\n                \"attributes\": {\n                    \"LineHeight\": 25.75,\n                    \"TextIndent\": 21.625,\n                    \"EndIndent\": 50.375,\n                    \"O\": \"Layout\",\n                    \"TextAlign\": \"None\",\n                    \"SpaceAfter\": 179.125,\n                },\n                \"mcids\": [7],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                # This is important, it has two attribute classes\n                \"attributes\": {\"O\": \"Layout\", \"TextAlign\": \"Center\", \"SpaceAfter\": 8.5},\n                \"mcids\": [8],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\"O\": \"Layout\", \"TextAlign\": \"Center\"},\n                \"mcids\": [9],\n            },\n        ],\n    }\n]\n\n\nHELLO = [\n    {\n        \"type\": \"Section\",\n        \"page_number\": 1,\n        \"children\": [\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\"O\": \"Foo\", \"A1\": 1},\n                \"mcids\": [1],\n            },\n            {\n                \"type\": \"P\",\n                \"page_number\": 2,\n                \"attributes\": {\"O\": \"Foo\", \"A1\": 2, \"A2\": 2},\n                \"mcids\": [1],\n            },\n        ],\n    },\n    {\n        \"type\": \"P\",\n        \"revision\": 1,\n        \"page_number\": 2,\n        \"attributes\": {\"O\": \"Foo\", \"A1\": 3, \"A2\": 3},\n        \"mcids\": [2],\n    },\n]\nHELLO1 = [\n    {\n        \"type\": \"Section\",\n        \"page_number\": 1,\n        \"children\": [\n            {\n                \"type\": \"P\",\n                \"page_number\": 1,\n                \"attributes\": {\"O\": \"Foo\", \"A1\": 1},\n                \"mcids\": [1],\n            },\n        ],\n    }\n]\nHELLO1P = [\n    {\n        \"type\": \"Section\",\n        \"children\": [{\"type\": \"P\", \"attributes\": {\"O\": \"Foo\", \"A1\": 1}, \"mcids\": [1]}],\n    }\n]\n\n\nclass TestClass(unittest.TestCase):\n    \"\"\"Test the underlying Structure tree class\"\"\"\n\n    def test_structure_tree_class(self):\n        path = os.path.join(HERE, \"pdfs/image_structure.pdf\")\n        pdf = pdfplumber.open(path)\n        stree = PDFStructTree(pdf, pdf.pages[0])\n        doc_elem = next(iter(stree))\n        assert [k.type for k in doc_elem] == [\"P\", \"P\", \"Figure\"]\n\n    def test_find_all_tree(self):\n        \"\"\"\n        Test find_all() and find() on trees\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/image_structure.pdf\")\n        pdf = pdfplumber.open(path)\n        stree = PDFStructTree(pdf, pdf.pages[0])\n        figs = list(stree.find_all(\"Figure\"))\n        assert len(figs) == 1\n        fig = stree.find(\"Figure\")\n        assert fig == figs[0]\n        assert stree.find(\"Fogure\") is None\n        figs = list(stree.find_all(re.compile(r\"Fig.*\")))\n        assert len(figs) == 1\n        figs = list(stree.find_all(lambda x: x.type == \"Figure\"))\n        assert len(figs) == 1\n        figs = list(stree.find_all(\"Foogure\"))\n        assert len(figs) == 0\n        figs = list(stree.find_all(re.compile(r\"Fog.*\")))\n        assert len(figs) == 0\n        figs = list(stree.find_all(lambda x: x.type == \"Flogger\"))\n        assert len(figs) == 0\n\n    def test_find_all_element(self):\n        \"\"\"\n        Test find_all() and find() on elements\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/pdf_structure.pdf\")\n        pdf = pdfplumber.open(path)\n        stree = PDFStructTree(pdf)\n        for list_elem in stree.find_all(\"L\"):\n            items = list(list_elem.find_all(\"LI\"))\n            assert items\n            for item in items:\n                body = list(item.find_all(\"LBody\"))\n                assert body\n                body1 = item.find(\"LBody\")\n                assert body1 == body[0]\n                assert item.find(\"Loonie\") is None\n\n    def test_all_mcids(self):\n        \"\"\"\n        Test all_mcids()\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/2023-06-20-PV.pdf\")\n        pdf = pdfplumber.open(path)\n        # Make sure we can get them with page numbers\n        stree = PDFStructTree(pdf)\n        sect = next(stree.find_all(\"Sect\"))\n        mcids = list(sect.all_mcids())\n        pages = set(page for page, mcid in mcids)\n        assert 1 in pages\n        assert 2 in pages\n        # If we take only a single page there are no page numbers\n        # (FIXME: may wish to reconsider this API decision...)\n        page = pdf.pages[1]\n        stree = PDFStructTree(pdf, page)\n        sect = next(stree.find_all(\"Sect\"))\n        mcids = list(sect.all_mcids())\n        pages = set(page for page, mcid in mcids)\n        assert None in pages\n        assert 1 not in pages\n        assert 2 not in pages\n        # Assure that we get the MCIDs for a content element\n        for p in sect.find_all(\"P\"):\n            assert set(mcid for page, mcid in p.all_mcids()) == set(p.mcids)\n\n    def test_element_bbox(self):\n        \"\"\"\n        Test various ways of getting element bboxes\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/pdf_structure.pdf\")\n        pdf = pdfplumber.open(path)\n        stree = PDFStructTree(pdf)\n        # As BBox attribute\n        table = next(stree.find_all(\"Table\"))\n        assert tuple(stree.element_bbox(table)) == (56.7, 489.9, 555.3, 542.25)\n        # With child elements\n        tr = next(table.find_all(\"TR\"))\n        assert tuple(stree.element_bbox(tr)) == (56.8, 495.9, 328.312, 507.9)\n        # From a specific page it should also work\n        stree = PDFStructTree(pdf, pdf.pages[0])\n        table = next(stree.find_all(\"Table\"))\n        assert tuple(stree.element_bbox(table)) == (56.7, 489.9, 555.3, 542.25)\n        tr = next(table.find_all(\"TR\"))\n        assert tuple(stree.element_bbox(tr)) == (56.8, 495.9, 328.312, 507.9)\n        # Yeah but what happens if you crop the page?\n        page = pdf.pages[0].crop((10, 400, 500, 500))\n        stree = PDFStructTree(pdf, page)\n        table = next(stree.find_all(\"Table\"))\n        # The element gets cropped too\n        assert tuple(stree.element_bbox(table)) == (56.7, 489.9, 500, 500)\n        # And if you crop it out of the page?\n        page = pdf.pages[0].crop((0, 0, 560, 400))\n        stree = PDFStructTree(pdf, page)\n        table = next(stree.find_all(\"Table\"))\n        with self.assertRaises(IndexError):\n            _ = stree.element_bbox(table)\n\n\nclass TestUnparsed(unittest.TestCase):\n    \"\"\"Test handling of PDFs with unparsed pages.\"\"\"\n\n    def test_unparsed_pages(self):\n        path = os.path.join(HERE, \"pdfs/2023-06-20-PV.pdf\")\n\n        pdf = pdfplumber.open(path, pages=[2])\n        assert pdf.structure_tree == PVSTRUCT2\n\n\nclass TestMany(unittest.TestCase):\n    \"\"\"Test various PDFs.\"\"\"\n\n    def test_no_stucture(self):\n        path = os.path.join(HERE, \"pdfs/pdffill-demo.pdf\")\n        pdf = pdfplumber.open(path)\n        assert pdf.structure_tree == []\n        assert pdf.pages[0].structure_tree == []\n\n    def test_word365(self):\n        path = os.path.join(HERE, \"pdfs/word365_structure.pdf\")\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n        assert page.structure_tree == WORD365\n\n    def test_proces_verbal(self):\n        path = os.path.join(HERE, \"pdfs/2023-06-20-PV.pdf\")\n\n        pdf = pdfplumber.open(path)\n        assert pdf.structure_tree == PVSTRUCT\n        page = pdf.pages[1]\n        assert page.structure_tree == PVSTRUCT1\n\n    def test_missing_parenttree(self):\n        \"\"\"Verify we can get structure without a ParentTree.\"\"\"\n        path = os.path.join(HERE, \"pdfs/2023-06-20-PV.pdf\")\n        pdf = pdfplumber.open(path)\n        root = resolve1(pdf.doc.catalog[\"StructTreeRoot\"])\n        del root[\"ParentTree\"]\n        assert pdf.pages[1].structure_tree == PVSTRUCT1\n\n    def test_image_structure(self):\n        path = os.path.join(HERE, \"pdfs/image_structure.pdf\")\n\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n        assert page.structure_tree == IMAGESTRUCT\n\n    def test_figure_mcids(self):\n        path = os.path.join(HERE, \"pdfs/figure_structure.pdf\")\n\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n        d = deque(page.structure_tree)\n        while d:\n            el = d.popleft()\n            if el[\"type\"] == \"Figure\":\n                break\n            if \"children\" in el:\n                d.extend(el[\"children\"])\n        # We found a Figure\n        assert el[\"type\"] == \"Figure\"\n        # It has these MCIDS\n        assert el[\"mcids\"] == [1, 14]\n\n    def test_scotus(self):\n        # This one actually has attribute classes!\n        path = os.path.join(HERE, \"pdfs/scotus-transcript-p1.pdf\")\n        pdf = pdfplumber.open(path)\n        assert pdf.structure_tree == SCOTUS\n\n    def test_chelsea_pdta(self):\n        # This one has structure elements for marked content sections\n        path = os.path.join(HERE, \"pdfs/chelsea_pdta.pdf\")\n        pdf = pdfplumber.open(path)\n        # This page has no structure tree (really!)\n        tree8 = pdf.pages[7].structure_tree\n        assert tree8 == []\n        # We should also have no structure tree here\n        with pdfplumber.open(path, pages=[8]) as pdf8:\n            assert pdf8.structure_tree == []\n        # This page is empty\n        tree3 = pdf.pages[3].structure_tree\n        assert tree3 == []\n        # This page in particular has OBJR and MCR elements\n        tree1 = pdf.pages[2].structure_tree\n        assert tree1  # Should contain a tree!\n        pdf = pdfplumber.open(path, pages=[3])\n        tree2 = pdf.structure_tree\n        assert tree2\n        # Compare modulo page_number\n        d = deque(zip(tree1, tree2))\n        while d:\n            el1, el2 = d.popleft()\n            if \"page_number\" in el1:\n                assert el1[\"page_number\"] == 3\n                assert el1 == el2\n            if \"children\" in el1:\n                assert len(el1[\"children\"]) == len(el2[\"children\"])\n                d.extend(zip(el1[\"children\"], el2[\"children\"]))\n\n    def test_hello_structure(self):\n        # Synthetic PDF to test some corner cases\n        path = os.path.join(HERE, \"pdfs/hello_structure.pdf\")\n        with pdfplumber.open(path) as pdf:\n            assert pdf.structure_tree == HELLO\n            assert pdf.pages[0].structure_tree == HELLO1P\n        with pdfplumber.open(path, pages=[1]) as pdf:\n            assert pdf.structure_tree == HELLO1\n"
  },
  {
    "path": "tests/test_table.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport unittest\n\nimport pytest\n\nimport pdfplumber\nfrom pdfplumber import table\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        path = os.path.join(HERE, \"pdfs/pdffill-demo.pdf\")\n        self.pdf = pdfplumber.open(path)\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_orientation_errors(self):\n        with pytest.raises(ValueError):\n            table.join_edge_group([], \"x\")\n\n    def test_table_settings_errors(self):\n        with pytest.raises(ValueError):\n            tf = table.TableFinder(self.pdf.pages[0], tuple())\n\n        with pytest.raises(TypeError):\n            tf = table.TableFinder(self.pdf.pages[0], {\"strategy\": \"x\"})\n            tf.get_edges()\n\n        with pytest.raises(ValueError):\n            tf = table.TableFinder(self.pdf.pages[0], {\"vertical_strategy\": \"x\"})\n\n        with pytest.raises(ValueError):\n            tf = table.TableFinder(\n                self.pdf.pages[0],\n                {\n                    \"vertical_strategy\": \"explicit\",\n                    \"explicit_vertical_lines\": [],\n                },\n            )\n\n        with pytest.raises(ValueError):\n            tf = table.TableFinder(self.pdf.pages[0], {\"join_tolerance\": -1})\n            tf.get_edges()\n\n    def test_edges_strict(self):\n        path = os.path.join(HERE, \"pdfs/issue-140-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            t = pdf.pages[0].extract_table(\n                {\n                    \"vertical_strategy\": \"lines_strict\",\n                    \"horizontal_strategy\": \"lines_strict\",\n                }\n            )\n\n        assert t[-1] == [\n            \"\",\n            \"0085648100300\",\n            \"CENTRAL KMA\",\n            \"LILYS 55% DARK CHOC BAR\",\n            \"415\",\n            \"$ 0.61\",\n            \"$ 253.15\",\n            \"0.0000\",\n            \"\",\n        ]\n\n    def test_rows_and_columns(self):\n        path = os.path.join(HERE, \"pdfs/issue-140-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            table = page.find_table()\n            row = [page.crop(bbox).extract_text() for bbox in table.rows[0].cells]\n            assert row == [\n                \"Line no\",\n                \"UPC code\",\n                \"Location\",\n                \"Item Description\",\n                \"Item Quantity\",\n                \"Bill Amount\",\n                \"Accrued Amount\",\n                \"Handling Rate\",\n                \"PO number\",\n            ]\n            col = [page.crop(bbox).extract_text() for bbox in table.columns[1].cells]\n            assert col == [\n                \"UPC code\",\n                \"0085648100305\",\n                \"0085648100380\",\n                \"0085648100303\",\n                \"0085648100300\",\n            ]\n\n    def test_explicit_desc_decimalization(self):\n        \"\"\"\n        See issue #290\n        \"\"\"\n        tf = table.TableFinder(\n            self.pdf.pages[0],\n            {\n                \"vertical_strategy\": \"explicit\",\n                \"explicit_vertical_lines\": [100, 200, 300],\n                \"horizontal_strategy\": \"explicit\",\n                \"explicit_horizontal_lines\": [100, 200, 300],\n            },\n        )\n        assert tf.tables[0].extract()\n\n    def test_text_tolerance(self):\n        path = os.path.join(HERE, \"pdfs/senate-expenditures.pdf\")\n        with pdfplumber.open(path) as pdf:\n            bbox = (70.332, 130.986, 420, 509.106)\n            cropped = pdf.pages[0].crop(bbox)\n            t = cropped.extract_table(\n                {\n                    \"horizontal_strategy\": \"text\",\n                    \"vertical_strategy\": \"text\",\n                    \"min_words_vertical\": 20,\n                }\n            )\n            t_tol = cropped.extract_table(\n                {\n                    \"horizontal_strategy\": \"text\",\n                    \"vertical_strategy\": \"text\",\n                    \"min_words_vertical\": 20,\n                    \"text_x_tolerance\": 1,\n                }\n            )\n            t_tol_from_tables = cropped.extract_tables(\n                {\n                    \"horizontal_strategy\": \"text\",\n                    \"vertical_strategy\": \"text\",\n                    \"min_words_vertical\": 20,\n                    \"text_x_tolerance\": 1,\n                }\n            )[0]\n\n        assert t[-1] == [\n            \"DHAW20190070\",\n            \"09/09/2019\",\n            \"CITIBANK-TRAVELCBACARD\",\n            \"08/12/2019\",\n            \"08/14/2019\",\n        ]\n        assert t_tol[-1] == [\n            \"DHAW20190070\",\n            \"09/09/2019\",\n            \"CITIBANK - TRAVEL CBA CARD\",\n            \"08/12/2019\",\n            \"08/14/2019\",\n        ]\n        assert t_tol[-1] == t_tol_from_tables[-1]\n\n    def test_text_layout(self):\n        path = os.path.join(HERE, \"pdfs/issue-53-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            table = pdf.pages[0].extract_table(\n                {\n                    \"text_layout\": True,\n                }\n            )\n            assert table[3][0] == \"   FY2013   \\n   FY2014   \"\n\n    def test_text_without_words(self):\n        assert table.words_to_edges_h([]) == []\n        assert table.words_to_edges_v([]) == []\n\n    def test_order(self):\n        \"\"\"\n        See issue #336\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-336-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            tables = pdf.pages[0].extract_tables()\n            assert len(tables) == 3\n            assert len(tables[0]) == 8\n            assert len(tables[1]) == 11\n            assert len(tables[2]) == 2\n\n    def test_issue_466_mixed_strategy(self):\n        \"\"\"\n        See issue #466\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/issue-466-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            tables = pdf.pages[0].extract_tables(\n                {\n                    \"vertical_strategy\": \"lines\",\n                    \"horizontal_strategy\": \"text\",\n                    \"snap_tolerance\": 8,\n                    \"intersection_tolerance\": 4,\n                }\n            )\n\n            # The engine only extracts the tables which have drawn horizontal\n            # lines.\n            # For the 3 extracted tables, some common properties are expected:\n            # - 4 rows\n            # - 3 columns\n            # - Data in last row contains the string 'last'\n            for t in tables:\n                assert len(t) == 4\n                assert len(t[0]) == 3\n\n                # Verify that all cell contain real data\n                for cell in t[3]:\n                    assert \"last\" in cell\n\n    def test_discussion_539_null_value(self):\n        \"\"\"\n        See discussion #539\n        \"\"\"\n        path = os.path.join(HERE, \"pdfs/nics-background-checks-2015-11.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            table_settings = {\n                \"vertical_strategy\": \"lines\",\n                \"horizontal_strategy\": \"lines\",\n                \"explicit_vertical_lines\": [],\n                \"explicit_horizontal_lines\": [],\n                \"snap_tolerance\": 3,\n                \"join_tolerance\": 3,\n                \"edge_min_length\": 3,\n                \"min_words_vertical\": 3,\n                \"min_words_horizontal\": 1,\n                \"text_keep_blank_chars\": False,\n                \"text_tolerance\": 3,\n                \"intersection_tolerance\": 3,\n            }\n            assert page.extract_table(table_settings)\n            assert page.extract_tables(table_settings)\n\n    def test_table_curves(self):\n        # See https://github.com/jsvine/pdfplumber/discussions/808\n        path = os.path.join(HERE, \"pdfs/table-curves-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert len(page.curves)\n            tables = page.extract_tables()\n            assert len(tables) == 1\n            t = tables[0]\n            assert t[-2][-2] == \"Uncommon\"\n\n            assert len(page.extract_tables({\"vertical_strategy\": \"lines_strict\"})) == 0\n"
  },
  {
    "path": "tests/test_utils.py",
    "content": "#!/usr/bin/env python\nimport logging\nimport os\nimport re\nimport unittest\nfrom itertools import groupby\nfrom operator import itemgetter\n\nimport pandas as pd\nimport pytest\nfrom pdfminer.pdfparser import PDFObjRef\nfrom pdfminer.psparser import PSLiteral\n\nimport pdfplumber\nfrom pdfplumber import utils\n\nlogging.disable(logging.ERROR)\n\nHERE = os.path.abspath(os.path.dirname(__file__))\n\n\nclass Test(unittest.TestCase):\n    @classmethod\n    def setup_class(self):\n        self.pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/pdffill-demo.pdf\"))\n        self.pdf_scotus = pdfplumber.open(\n            os.path.join(HERE, \"pdfs/scotus-transcript-p1.pdf\")\n        )\n\n    @classmethod\n    def teardown_class(self):\n        self.pdf.close()\n\n    def test_cluster_list(self):\n        a = [1, 2, 3, 4]\n        assert utils.cluster_list(a) == [[x] for x in a]\n        assert utils.cluster_list(a, tolerance=1) == [a]\n\n        a = [1, 2, 5, 6]\n        assert utils.cluster_list(a, tolerance=1) == [[1, 2], [5, 6]]\n\n    def test_cluster_objects(self):\n        a = [\"a\", \"ab\", \"abc\", \"b\"]\n        assert utils.cluster_objects(a, len, 0) == [[\"a\", \"b\"], [\"ab\"], [\"abc\"]]\n\n        b = [{\"x\": 1, 7: \"a\"}, {\"x\": 1, 7: \"b\"}, {\"x\": 2, 7: \"b\"}, {\"x\": 2, 7: \"b\"}]\n        assert utils.cluster_objects(b, \"x\", 0) == [[b[0], b[1]], [b[2], b[3]]]\n        assert utils.cluster_objects(b, 7, 0) == [[b[0]], [b[1], b[2], b[3]]]\n\n    def test_resolve(self):\n        annot = self.pdf.annots[0]\n        annot_ad0 = utils.resolve(annot[\"data\"][\"A\"][\"D\"][0])\n        assert annot_ad0[\"MediaBox\"] == [0, 0, 612, 792]\n        assert utils.resolve(1) == 1\n\n    def test_resolve_all(self):\n        info = self.pdf.doc.xrefs[0].trailer[\"Info\"]\n        assert type(info) is PDFObjRef\n        a = [{\"info\": info}]\n        a_res = utils.resolve_all(a)\n        assert a_res[0][\"info\"][\"Producer\"] == self.pdf.doc.info[0][\"Producer\"]\n\n    def test_decode_psl_list(self):\n        a = [PSLiteral(\"test\"), \"test_2\"]\n        assert utils.decode_psl_list(a) == [\"test\", \"test_2\"]\n\n    def test_x_tolerance_ratio(self):\n        pdf = pdfplumber.open(os.path.join(HERE, \"pdfs/issue-987-test.pdf\"))\n        page = pdf.pages[0]\n\n        assert page.extract_text() == \"Big Te xt\\nSmall Text\"\n        assert page.extract_text(x_tolerance=4) == \"Big Te xt\\nSmallText\"\n        assert page.extract_text(x_tolerance_ratio=0.15) == \"Big Text\\nSmall Text\"\n\n        words = page.extract_words(x_tolerance_ratio=0.15)\n        assert \"|\".join(w[\"text\"] for w in words) == \"Big|Text|Small|Text\"\n\n    def test_extract_words(self):\n        path = os.path.join(HERE, \"pdfs/issue-192-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            p = pdf.pages[0]\n            words = p.extract_words(vertical_ttb=False)\n            words_attr = p.extract_words(vertical_ttb=False, extra_attrs=[\"size\"])\n            words_w_spaces = p.extract_words(vertical_ttb=False, keep_blank_chars=True)\n            words_rtl = p.extract_words(horizontal_ltr=False)\n\n        assert words[0][\"text\"] == \"Agaaaaa:\"\n        assert words[0][\"direction\"] == \"ltr\"\n\n        assert \"size\" not in words[0]\n        assert round(words_attr[0][\"size\"], 2) == 9.96\n\n        assert words_w_spaces[0][\"text\"] == \"Agaaaaa: AAAA\"\n\n        vertical = [w for w in words if w[\"upright\"] == 0]\n        assert vertical[0][\"text\"] == \"Aaaaaabag8\"\n        assert vertical[0][\"direction\"] == \"btt\"\n\n        assert words_rtl[1][\"text\"] == \"baaabaaA/AAA\"\n        assert words_rtl[1][\"direction\"] == \"rtl\"\n\n    def test_extract_words_return_chars(self):\n        path = os.path.join(HERE, \"pdfs/extra-attrs-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n\n            words = page.extract_words()\n            assert \"chars\" not in words[0]\n\n            words = page.extract_words(return_chars=True)\n            assert \"chars\" in words[0]\n            assert \"\".join(c[\"text\"] for c in words[0][\"chars\"]) == words[0][\"text\"]\n\n    def test_text_rotation(self):\n        rotations = {\n            \"0\": (\"ltr\", \"ttb\"),\n            \"-0\": (\"rtl\", \"ttb\"),\n            \"180\": (\"rtl\", \"btt\"),\n            \"-180\": (\"ltr\", \"btt\"),\n            \"90\": (\"ttb\", \"rtl\"),\n            \"-90\": (\"btt\", \"rtl\"),\n            \"270\": (\"btt\", \"ltr\"),\n            \"-270\": (\"ttb\", \"ltr\"),\n        }\n\n        path = os.path.join(HERE, \"pdfs/issue-848.pdf\")\n        with pdfplumber.open(path) as pdf:\n            expected = utils.text.extract_text(pdf.pages[0].chars)\n            for i, (rotation, (char_dir, line_dir)) in enumerate(rotations.items()):\n                if i == 0:\n                    continue\n                print(f\"--- {rotation} ---\")\n                p = pdf.pages[i].filter(lambda obj: obj.get(\"text\") != \" \")\n                output = utils.text.extract_text(\n                    x_tolerance=2,\n                    y_tolerance=2,\n                    chars=p.chars,\n                    char_dir=char_dir,\n                    line_dir=line_dir,\n                    char_dir_rotated=char_dir,\n                    line_dir_rotated=line_dir,\n                    char_dir_render=\"ltr\",\n                    line_dir_render=\"ttb\",\n                )\n                assert output == expected\n\n    def test_text_rotation_layout(self):\n        rotations = {\n            \"0\": (\"ltr\", \"ttb\"),\n            \"-0\": (\"rtl\", \"ttb\"),\n            \"180\": (\"rtl\", \"btt\"),\n            \"-180\": (\"ltr\", \"btt\"),\n            \"90\": (\"ttb\", \"rtl\"),\n            \"-90\": (\"btt\", \"rtl\"),\n            \"270\": (\"btt\", \"ltr\"),\n            \"-270\": (\"ttb\", \"ltr\"),\n        }\n\n        def meets_expectations(text):\n            # Both texts should be found, and the first should appear before the second\n            a = re.search(\"opens with a news report\", text)\n            b = re.search(\"having been transferred\", text)\n            return a and b and (a.start() < b.start())\n\n        path = os.path.join(HERE, \"pdfs/issue-848.pdf\")\n        with pdfplumber.open(path) as pdf:\n            for i, (rotation, (char_dir, line_dir)) in enumerate(rotations.items()):\n                print(f\"--- {rotation} ---\")\n                p = pdf.pages[i].filter(lambda obj: obj.get(\"text\") != \" \")\n                output = p.extract_text(\n                    layout=True,\n                    x_tolerance=2,\n                    y_tolerance=2,\n                    char_dir=char_dir,\n                    line_dir=line_dir,\n                    char_dir_rotated=char_dir,\n                    line_dir_rotated=line_dir,\n                    char_dir_render=\"ltr\",\n                    line_dir_render=\"ttb\",\n                    y_density=14,\n                )\n                assert meets_expectations(output)\n\n    def test_text_render_directions(self):\n        path = os.path.join(HERE, \"pdfs/line-char-render-example.pdf\")\n        targets = {\n            (\"ttb\", \"ltr\"): \"first line\\nsecond line\\nthird line\",\n            (\"ttb\", \"rtl\"): \"enil tsrif\\nenil dnoces\\nenil driht\",\n            (\"btt\", \"ltr\"): \"third line\\nsecond line\\nfirst line\",\n            (\"btt\", \"rtl\"): \"enil driht\\nenil dnoces\\nenil tsrif\",\n            (\"ltr\", \"ttb\"): \"fst\\nieh\\nrci\\nsor\\ntnd\\n d \\nl l\\nili\\nnin\\nene\\n e \",\n            (\"ltr\", \"btt\"): \" s \\nfet\\nich\\nroi\\nsnr\\ntdd\\n   \\nlll\\niii\\nnnn\\neee\",\n            (\"rtl\", \"ttb\"): \"tsf\\nhei\\nicr\\nros\\ndnt\\n d \\nl l\\nili\\nnin\\nene\\n e \",\n            (\"rtl\", \"btt\"): \" s \\ntef\\nhci\\nior\\nrns\\nddt\\n   \\nlll\\niii\\nnnn\\neee\",\n        }\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            for (line_dir, char_dir), target in targets.items():\n                text = page.extract_text(\n                    line_dir_render=line_dir, char_dir_render=char_dir\n                )\n                assert text == target\n\n    def test_invalid_directions(self):\n        path = os.path.join(HERE, \"pdfs/line-char-render-example.pdf\")\n        pdf = pdfplumber.open(path)\n        page = pdf.pages[0]\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir=\"xxx\", char_dir=\"ltr\")\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir=\"ttb\", char_dir=\"a\")\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir=\"rtl\", char_dir=\"ltr\")\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir=\"ttb\", char_dir=\"btt\")\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir_rotated=\"ttb\", char_dir=\"btt\")\n        with pytest.raises(ValueError):\n            page.extract_text(line_dir_render=\"ttb\", char_dir_render=\"btt\")\n        pdf.close()\n\n    def test_extra_attrs(self):\n        path = os.path.join(HERE, \"pdfs/extra-attrs-example.pdf\")\n        with pdfplumber.open(path) as pdf:\n            page = pdf.pages[0]\n            assert page.extract_text() == \"BlackRedArial\"\n            assert (\n                page.extract_text(extra_attrs=[\"non_stroking_color\"])\n                == \"Black RedArial\"\n            )\n            assert page.extract_text(extra_attrs=[\"fontname\"]) == \"BlackRed Arial\"\n            assert (\n                page.extract_text(extra_attrs=[\"non_stroking_color\", \"fontname\"])\n                == \"Black Red Arial\"\n            )\n            # Should not error\n            assert page.extract_text(\n                layout=True,\n                use_text_flow=True,\n                extra_attrs=[\"non_stroking_color\", \"fontname\"],\n            )\n\n    def test_extract_words_punctuation(self):\n        path = os.path.join(HERE, \"pdfs/test-punkt.pdf\")\n        with pdfplumber.open(path) as pdf:\n\n            wordsA = pdf.pages[0].extract_words(split_at_punctuation=True)\n            wordsB = pdf.pages[0].extract_words(split_at_punctuation=False)\n            wordsC = pdf.pages[0].extract_words(\n                split_at_punctuation=r\"!\\\"&'()*+,.:;<=>?@[]^`{|}~\"\n            )\n\n            assert wordsA[0][\"text\"] == \"https\"\n            assert (\n                wordsB[0][\"text\"]\n                == \"https://dell-research-harvard.github.io/HJDataset/\"\n            )\n            assert wordsC[2][\"text\"] == \"//dell-research-harvard\"\n\n            wordsA = pdf.pages[1].extract_words(split_at_punctuation=True)\n            wordsB = pdf.pages[1].extract_words(split_at_punctuation=False)\n            wordsC = pdf.pages[1].extract_words(\n                split_at_punctuation=r\"!\\\"&'()*+,.:;<=>?@[]^`{|}~\"\n            )\n\n            assert len(wordsA) == 4\n            assert len(wordsB) == 2\n            assert len(wordsC) == 2\n\n            wordsA = pdf.pages[2].extract_words(split_at_punctuation=True)\n            wordsB = pdf.pages[2].extract_words(split_at_punctuation=False)\n            wordsC = pdf.pages[2].extract_words(\n                split_at_punctuation=r\"!\\\"&'()*+,.:;<=>?@[]^`{|}~\"\n            )\n\n            assert wordsA[1][\"text\"] == \"[\"\n            assert wordsB[1][\"text\"] == \"[2,\"\n            assert wordsC[1][\"text\"] == \"[\"\n\n            wordsA = pdf.pages[3].extract_words(split_at_punctuation=True)\n            wordsB = pdf.pages[3].extract_words(split_at_punctuation=False)\n            wordsC = pdf.pages[3].extract_words(\n                split_at_punctuation=r\"!\\\"&'()*+,.:;<=>?@[]^`{|}~\"\n            )\n\n            assert wordsA[2][\"text\"] == \"al\"\n            assert wordsB[2][\"text\"] == \"al.\"\n            assert wordsC[2][\"text\"] == \"al\"\n\n    def test_extract_text_punctuation(self):\n        path = os.path.join(HERE, \"pdfs/test-punkt.pdf\")\n        with pdfplumber.open(path) as pdf:\n            text = pdf.pages[0].extract_text(\n                layout=True,\n                split_at_punctuation=True,\n            )\n            assert \"https \" in text\n\n    def test_text_flow(self):\n        path = os.path.join(HERE, \"pdfs/federal-register-2020-17221.pdf\")\n\n        def words_to_text(words):\n            grouped = groupby(words, key=itemgetter(\"top\"))\n            lines = [\" \".join(word[\"text\"] for word in grp) for top, grp in grouped]\n            return \"\\n\".join(lines)\n\n        with pdfplumber.open(path) as pdf:\n            p0 = pdf.pages[0]\n            using_flow = p0.extract_words(use_text_flow=True)\n            not_using_flow = p0.extract_words()\n\n        target_text = (\n            \"The FAA proposes to\\n\"\n            \"supersede Airworthiness Directive (AD)\\n\"\n            \"2018–23–51, which applies to all The\\n\"\n            \"Boeing Company Model 737–8 and 737–\\n\"\n            \"9 (737 MAX) airplanes. Since AD 2018–\\n\"\n        )\n\n        assert target_text in words_to_text(using_flow)\n        assert target_text not in words_to_text(not_using_flow)\n\n    def test_text_flow_overlapping(self):\n        path = os.path.join(HERE, \"pdfs/issue-912.pdf\")\n\n        with pdfplumber.open(path) as pdf:\n            p0 = pdf.pages[0]\n            using_flow = p0.extract_text(use_text_flow=True, layout=True, x_tolerance=1)\n            not_using_flow = p0.extract_text(layout=True, x_tolerance=1)\n\n        assert re.search(\"2015 RICE PAYMENT 26406576 0 1207631 Cr\", using_flow)\n        assert re.search(\"124644,06155766\", using_flow) is None\n\n        assert re.search(\"124644,06155766\", not_using_flow)\n        assert (\n            re.search(\"2015 RICE PAYMENT 26406576 0 1207631 Cr\", not_using_flow) is None\n        )\n\n    def test_text_flow_words_mixed_lines(self):\n        path = os.path.join(HERE, \"pdfs/issue-1279-example.pdf\")\n\n        with pdfplumber.open(path) as pdf:\n            p0 = pdf.pages[0]\n            words = p0.extract_words(use_text_flow=True)\n\n        texts = set(w[\"text\"] for w in words)\n\n        assert \"claim\" in texts\n        assert \"lence\" in texts\n        assert \"claimlence\" not in texts\n\n    def test_extract_text(self):\n        text = self.pdf.pages[0].extract_text()\n        goal_lines = [\n            \"First Page Previous Page Next Page Last Page\",\n            \"Print\",\n            \"PDFill: PDF Drawing\",\n            \"You can open a PDF or create a blank PDF by PDFill.\",\n            \"Online Help\",\n            \"Here are the PDF drawings created by PDFill\",\n            \"Please save into a new PDF to see the effect!\",\n            \"Goto Page 2: Line Tool\",\n            \"Goto Page 3: Arrow Tool\",\n            \"Goto Page 4: Tool for Rectangle, Square and Rounded Corner\",\n            \"Goto Page 5: Tool for Circle, Ellipse, Arc, Pie\",\n            \"Goto Page 6: Tool for Basic Shapes\",\n            \"Goto Page 7: Tool for Curves\",\n            \"Here are the tools to change line width, style, arrow style and colors\",\n        ]\n        goal = \"\\n\".join(goal_lines)\n\n        assert text == goal\n\n        text_simple = self.pdf.pages[0].extract_text_simple()\n        assert text_simple == goal\n\n        assert self.pdf.pages[0].crop((0, 0, 1, 1)).extract_text() == \"\"\n\n    def test_extract_text_blank(self):\n        assert utils.extract_text([]) == \"\"\n\n    def test_extract_text_layout(self):\n        target = (\n            open(os.path.join(HERE, \"comparisons/scotus-transcript-p1.txt\"))\n            .read()\n            .strip(\"\\n\")\n        )\n        page = self.pdf_scotus.pages[0]\n        text = page.extract_text(layout=True)\n        utils_text = utils.extract_text(\n            page.chars,\n            layout=True,\n            layout_width=page.width,\n            layout_height=page.height,\n            layout_bbox=page.bbox,\n        )\n        assert text == utils_text\n        assert text == target\n\n    def test_extract_text_layout_cropped(self):\n        target = (\n            open(os.path.join(HERE, \"comparisons/scotus-transcript-p1-cropped.txt\"))\n            .read()\n            .strip(\"\\n\")\n        )\n        p = self.pdf_scotus.pages[0]\n        cropped = p.crop((90, 70, p.width, 300))\n        text = cropped.extract_text(layout=True)\n        assert text == target\n\n    def test_extract_text_layout_widths(self):\n        p = self.pdf_scotus.pages[0]\n        text = p.extract_text(layout=True, layout_width_chars=75)\n        assert all(len(line) == 75 for line in text.splitlines())\n        with pytest.raises(ValueError):\n            p.extract_text(layout=True, layout_width=300, layout_width_chars=50)\n        with pytest.raises(ValueError):\n            p.extract_text(layout=True, layout_height=300, layout_height_chars=50)\n\n    def test_extract_text_nochars(self):\n        charless = self.pdf.pages[0].filter(lambda df: df[\"object_type\"] != \"char\")\n        assert charless.extract_text() == \"\"\n        assert charless.extract_text(layout=True) == \"\"\n\n    def test_search_regex_compiled(self):\n        page = self.pdf_scotus.pages[0]\n        pat = re.compile(r\"supreme\\s+(\\w+)\", re.I)\n        results = page.search(pat)\n        assert results[0][\"text\"] == \"SUPREME COURT\"\n        assert results[0][\"groups\"] == (\"COURT\",)\n        assert results[1][\"text\"] == \"Supreme Court\"\n        assert results[1][\"groups\"] == (\"Court\",)\n\n        with pytest.raises(ValueError):\n            page.search(re.compile(r\"x\"), regex=False)\n\n        with pytest.raises(ValueError):\n            page.search(re.compile(r\"x\"), case=False)\n\n    def test_search_regex_uncompiled(self):\n        page = self.pdf_scotus.pages[0]\n        pat = r\"supreme\\s+(\\w+)\"\n        results = page.search(pat, case=False)\n        assert results[0][\"text\"] == \"SUPREME COURT\"\n        assert results[0][\"groups\"] == (\"COURT\",)\n        assert results[1][\"text\"] == \"Supreme Court\"\n        assert results[1][\"groups\"] == (\"Court\",)\n\n    def test_search_string(self):\n        page = self.pdf_scotus.pages[0]\n        results = page.search(\"SUPREME COURT\", regex=False)\n        assert results[0][\"text\"] == \"SUPREME COURT\"\n        assert results[0][\"groups\"] == tuple()\n\n        results = page.search(\"supreme court\", regex=False)\n        assert len(results) == 0\n\n        results = page.search(\"supreme court\", regex=False, case=False)\n        assert len(results) == 2\n\n        results = page.search(\"supreme court\", regex=True, case=False)\n        assert len(results) == 2\n\n        results = page.search(r\"supreme\\s+(\\w+)\", regex=False)\n        assert len(results) == 0\n\n        results = page.search(r\"10 Tuesday\", layout=False)\n        assert len(results) == 1\n\n        results = page.search(r\"10 Tuesday\", layout=True)\n        assert len(results) == 0\n\n    def test_extract_text_lines(self):\n        page = self.pdf_scotus.pages[0]\n        results = page.extract_text_lines()\n        assert len(results) == 28\n        assert \"chars\" in results[0]\n        assert results[0][\"text\"] == \"Official - Subject to Final Review\"\n\n        alt = page.extract_text_lines(layout=True, strip=False, return_chars=False)\n        assert \"chars\" not in alt[0]\n        assert (\n            alt[0][\"text\"]\n            == \"                                   Official - Subject to Final Review               \"  # noqa: E501\n        )\n\n        assert results[10][\"text\"] == \"10 Tuesday, January 13, 2009\"\n        assert (\n            alt[10][\"text\"]\n            == \"            10                          Tuesday, January 13, 2009                   \"  # noqa: E501\n        )\n        assert (\n            page.extract_text_lines(layout=True)[10][\"text\"]\n            == \"10                          Tuesday, January 13, 2009\"\n        )  # noqa: E501\n\n    def test_handle_empty_and_whitespace_search_results(self):\n        # via https://github.com/jsvine/pdfplumber/discussions/853\n        # The searches below should not raise errors but instead\n        # should return empty result-sets.\n        page = self.pdf_scotus.pages[0]\n        for regex in [True, False]:\n            results = page.search(\"\\n\", regex=regex)\n            assert len(results) == 0\n\n        assert len(page.search(\"(sdfsd)?\")) == 0\n        assert len(page.search(\"\")) == 0\n\n    def test_intersects_bbox(self):\n        objs = [\n            # Is same as bbox\n            {\n                \"x0\": 0,\n                \"top\": 0,\n                \"x1\": 20,\n                \"bottom\": 20,\n            },\n            # Inside bbox\n            {\n                \"x0\": 10,\n                \"top\": 10,\n                \"x1\": 15,\n                \"bottom\": 15,\n            },\n            # Overlaps bbox\n            {\n                \"x0\": 10,\n                \"top\": 10,\n                \"x1\": 30,\n                \"bottom\": 30,\n            },\n            # Touching on one side\n            {\n                \"x0\": 20,\n                \"top\": 0,\n                \"x1\": 40,\n                \"bottom\": 20,\n            },\n            # Touching on one corner\n            {\n                \"x0\": 20,\n                \"top\": 20,\n                \"x1\": 40,\n                \"bottom\": 40,\n            },\n            # Fully outside\n            {\n                \"x0\": 21,\n                \"top\": 21,\n                \"x1\": 40,\n                \"bottom\": 40,\n            },\n        ]\n        bbox = utils.obj_to_bbox(objs[0])\n\n        assert utils.intersects_bbox(objs, bbox) == objs[:4]\n        assert utils.intersects_bbox(iter(objs), bbox) == objs[:4]\n\n    def test_merge_bboxes(self):\n        bboxes = [\n            (0, 10, 20, 20),\n            (10, 5, 10, 30),\n        ]\n        merged = utils.merge_bboxes(bboxes)\n        assert merged == (0, 5, 20, 30)\n        merged = utils.merge_bboxes(iter(bboxes))\n        assert merged == (0, 5, 20, 30)\n\n    def test_resize_object(self):\n        obj = {\n            \"x0\": 5,\n            \"x1\": 10,\n            \"top\": 20,\n            \"bottom\": 30,\n            \"width\": 5,\n            \"height\": 10,\n            \"doctop\": 120,\n            \"y0\": 40,\n            \"y1\": 50,\n        }\n        assert utils.resize_object(obj, \"x0\", 0) == {\n            \"x0\": 0,\n            \"x1\": 10,\n            \"top\": 20,\n            \"doctop\": 120,\n            \"bottom\": 30,\n            \"width\": 10,\n            \"height\": 10,\n            \"y0\": 40,\n            \"y1\": 50,\n        }\n        assert utils.resize_object(obj, \"x1\", 50) == {\n            \"x0\": 5,\n            \"x1\": 50,\n            \"top\": 20,\n            \"doctop\": 120,\n            \"bottom\": 30,\n            \"width\": 45,\n            \"height\": 10,\n            \"y0\": 40,\n            \"y1\": 50,\n        }\n        assert utils.resize_object(obj, \"top\", 0) == {\n            \"x0\": 5,\n            \"x1\": 10,\n            \"top\": 0,\n            \"doctop\": 100,\n            \"bottom\": 30,\n            \"height\": 30,\n            \"width\": 5,\n            \"y0\": 40,\n            \"y1\": 70,\n        }\n        assert utils.resize_object(obj, \"bottom\", 40) == {\n            \"x0\": 5,\n            \"x1\": 10,\n            \"top\": 20,\n            \"doctop\": 120,\n            \"bottom\": 40,\n            \"height\": 20,\n            \"width\": 5,\n            \"y0\": 30,\n            \"y1\": 50,\n        }\n\n    def test_move_object(self):\n        a = {\n            \"x0\": 5,\n            \"x1\": 10,\n            \"top\": 20,\n            \"bottom\": 30,\n            \"width\": 5,\n            \"height\": 10,\n            \"doctop\": 120,\n            \"y0\": 40,\n            \"y1\": 50,\n        }\n\n        b = dict(a)\n        b[\"x0\"] = 15\n        b[\"x1\"] = 20\n\n        a_new = utils.move_object(a, \"h\", 10)\n        assert a_new == b\n\n    def test_snap_objects(self):\n        a = {\n            \"x0\": 5,\n            \"x1\": 10,\n            \"top\": 20,\n            \"bottom\": 30,\n            \"width\": 5,\n            \"height\": 10,\n            \"doctop\": 120,\n            \"y0\": 40,\n            \"y1\": 50,\n        }\n\n        b = dict(a)\n        b[\"x0\"] = 6\n        b[\"x1\"] = 11\n\n        c = dict(a)\n        c[\"x0\"] = 7\n        c[\"x1\"] = 12\n\n        a_new, b_new, c_new = utils.snap_objects([a, b, c], \"x0\", 1)\n        assert a_new == b_new == c_new\n        a_new, b_new, c_new = utils.snap_objects(iter([a, b, c]), \"x0\", 1)\n        assert a_new == b_new == c_new\n\n    def test_filter_edges(self):\n        with pytest.raises(ValueError):\n            utils.filter_edges([], \"x\")\n\n    def test_to_list(self):\n        objs = [\n            {\n                \"x0\": 0,\n                \"top\": 0,\n                \"x1\": 20,\n                \"bottom\": 20,\n            },\n            {\n                \"x0\": 10,\n                \"top\": 10,\n                \"x1\": 15,\n                \"bottom\": 15,\n            },\n        ]\n        assert utils.to_list(objs) == objs\n        assert utils.to_list(iter(objs)) == objs\n        assert utils.to_list(tuple(objs)) == objs\n        assert utils.to_list((o for o in objs)) == objs\n        assert utils.to_list(pd.DataFrame(objs)) == objs\n"
  }
]