[
  {
    "path": ".github/workflows/test.yml",
    "content": "name: Run tests\n\non: [push]\n\njobs:\n  test:\n    strategy:\n      fail-fast: false\n      matrix:\n        os: [ubuntu-24.04, macos-14]\n\n    runs-on: ${{ matrix.os }}\n\n    steps:\n    - uses: actions/checkout@v3\n\n    - name: Install prerequisites on Ubuntu\n      if: matrix.os == 'ubuntu-24.04'\n      run: |\n        sudo apt update\n        sudo apt install --yes libdjvulibre21 libdjvulibre-dev\n\n    - name: Install prerequisites on macOS\n      if: matrix.os == 'macos-14'\n      run: brew install djvulibre libtiff\n\n    # - name: Install prerequisites on Windows\n    #   if: matrix.os == 'windows-2022'\n    #   run: |\n    #     choco install djvu-libre\n    #     vcpkg install tiff\n\n    - uses: astral-sh/setup-uv@v7\n    - name: Install dependencies\n      run: uv sync --all-extras\n\n    - name: Lint\n      run: make lint\n\n    - name: Test\n      run: make test\n"
  },
  {
    "path": ".gitignore",
    "content": ".pytest_cache\n.ruff_cache\n.tests\n"
  },
  {
    "path": ".python-version",
    "content": "3.11\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "## v2.5.4 (2026-04-24)\n\n* Run `uv` security audit and update some dependencies\n\n## v2.5.3 (2026-03-25)\n\n* Fix broken workflow without text layer translation\n* Shorter names for temporary directories\n* Code maintenance\n\n## v2.5.2 (2026-03-25)\n\n* Relax dependency versions\n\n## v2.5.1 (2026-03-14)\n\n* Allow manually configuring PDF page resolution (DPI)\n\n## v2.5.0 (2026-03-13)\n\n* Account for DjVu file resolution\n* Simplify image diffing and regenerate better-quality fixtures\n\n## v2.4.2 (2026-02-24)\n\n* Fix issue where only the main process has its logger configured\n\n## v2.4.1 (2026-02-24)\n\n* Fix compatibility issues with the new OCRmyPDF API\n* Remove support for Python 3.10\n\n## v2.4.0 (2026-02-24)\n\n* Migrate to `uv` from `pyenv` + `poetry`\n* Update dependencies\n\n## v2.3.1 (2025-10-28)\n\n* Fix mixed-up email format\n\n## v2.3.0 (2025-10-28)\n\n* Remove support for Python 3.9\n* Migrate to standardized `pyproject.toml`\n* Update dependencies\n\n## v2.2.15 (2025-07-02)\n\n* Add support for installation via `pipx`\n\n## v2.2.14 (2025-05-27)\n\n* Improve installation notes\n* Bump djvulibre-python version\n\n## v2.2.13 (2025-02-12)\n\n* Fail-safe quality settings for non-JPEG images\n\n## v2.2.12 (2025-01-27)\n\n* Update pytest_image_diff and fix newly broken tests\n\n## v2.2.11 (2025-01-26)\n\n* Update dependencies\n\n## v2.2.10 (2024-10-25)\n\n* Improve interface with OCRmyPDF\n* Fix CI build\n\n## v2.2.9 (2024-10-25)\n\n* Improve type hints\n* Update dependencies\n\n## v2.2.8 (2024-10-18)\n\n* Support single characters in the text layer\n\n## v2.2.7 (2024-08-27)\n\n* Improve tab and newline handling\n\n## v2.2.6 (2024-08-05)\n\n* Fix accidental whitespace removal from text blocks\n\n## v2.2.5 (2024-07-20)\n\n* Re-add ability to force the image mode (RGB/Grayscale/Monochrome)\n\n## v2.2.4 (2024-02-24)\n\n* Update dependencies\n\n## v2.2.3 (2023-12-09)\n\n* Fix CI build\n* Ignore invalid UTF-8 sequences\n* Ignore unrecognized page titles in the outline (#23)\n\n## v2.2.2 (2023-10-29)\n\n* Update dependencies\n\n## v2.2.1 (2023-11-06)\n\n* Handle invalid PDF pages\n* Fix exception in text layer processing (#20)\n\n## v2.2.0 (2023-10-28)\n\n* Add options for disabling the text layer and for directly running OCR\n\n## v2.1.5 (2023-10-27)\n\n* Fix inverted colors in images (#16)\n\n## v2.1.4 (2023-10-06)\n\n* Fix typo in logging code\n\n## v2.1.3 (2023-10-06)\n\n* Improve logging\n\n## v2.1.2 (2023-10-02)\n\n* Accidental version bump\n\n## v2.1.1 (2023-10-02)\n\n* Remove debug code\n\n## v2.1.0 (2023-10-02)\n\n* Add support for OCRmyPDF\n\n## v2.0.2 (2023-08-03)\n\n* Update some other dependencies\n* Replace `python-djvulibre` with `djvulibre-python`\n\n## v2.0.1 (2023-06-22)\n\n* Minor improvements in packaging\n\n## v2.0.0 (2023-05-04)\n\n* Fully rewrite\n"
  },
  {
    "path": "LICENSE",
    "content": "Copyright (C) 2015 Kevin Arthur Schiff Croker\n\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program.  If not, see <https://www.gnu.org/licenses/>.\n"
  },
  {
    "path": "Makefile",
    "content": ".PHONY: lint test\n\nlint:\n\tuv run ruff check\n\tuv run mypy\n\ntest:\n\tuv run pytest\n\ndpsprep.1: dpsprep.1.ronn\n\tronn --roff dpsprep.1.ronn\n"
  },
  {
    "path": "README.md",
    "content": "# dpsprep\n\n[![Tests](https://github.com/kcroker/dpsprep/actions/workflows/test.yml/badge.svg)](https://github.com/kcroker/dpsprep/actions/workflows/test.yml) [![AUR Package](https://img.shields.io/aur/version/dpsprep)](https://aur.archlinux.org/packages/dpsprep)\n\nThis tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).\n\n## Usage\n\nFull example (the name of the PDF is optional and inferred from the input name):\n\n    dpsprep --pool=8 --quality=50 input.djvu output.pdf\n\nIf you have [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) installed, you can use its PDF optimizer:\n\n    dpsprep -O3 input.djvu\n\nYou can also skip translating the text layer (it is sometimes not translated well) and redo the OCR (rather than launching the `ocrmypdf` CLI, we use the API directly and accept options in JSON format):\n\n    dpsprep --ocr '{\"language\": [\"rus\", \"eng\"]}' input.djvu\n\nConsult the man file ([online](./dpsprep.1.ronn)) for details; there are a lot of options to consider.\n\nSee the next section for different ways to run the program.\n\n## Installation\n\n### Automated\n\nAn easy way to install a `dpsprep` executable for the current user is via [`uv`](https://docs.astral.sh/uv/):\n\n    uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep\n\nFor better compression (see below), the `compress` extra must be specified:\n\n    uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep[compress]\n\nSometimes a particular feature branch need to be tested. For installing a fixed revision (i.e. common/branch/tag), the following should work (if `extra-name` is needed, use `dpsprep@rev[extra-name]`):\n\n    uv tool install dpsprep --from git+https://github.com/kcroker/dpsprep@rev\n\nThe only hard prerequisite is `djvulibre` (e.g. `djvulibre` on Arch, `libdjvulibre-dev` on Ubuntu, etc.). We use the Python bindings from the package [`djvulibre-python`](https://github.com/FriedrichFroebel/python-djvulibre) (not to be confused with the unmaintained [`python-djvulibre`](https://github.com/jwilk-archive/python-djvulibre); see [this pull request](https://github.com/kcroker/dpsprep/pull/10)).\n\n> [!TIP]\n> A few people have reported installation problems; see [this possible solution](https://github.com/kcroker/dpsprep/issues/38) and [this sample Dockerfile](https://github.com/kcroker/dpsprep/pull/37).\n\n> [!NOTE]\n> Note that Windows support in `djvulibre-python` requires 64-bit `djvulibre`, and they only officially distribute 32-bit Windows packages. If you manage to make it work, consider opening a pull request.\n\nOptional prerequisites are:\n* `libtiff` for bitonal image compression.\n* `libjpeg` (or `libjpeg-turbo`) for multitotal (RGB or grayscale) compression.\n* `OCRmyPDF` and `jbig2enc` for PDF optimization (see the next section).\n\n`libtiff` depends on `libjpeg`, so installing `libtiff` will likely install both.\n\nFor details on how these dependencies can be installed, see the GitHub Actions [workflow](./.github/workflows/test.yml) and the [dpsprep](https://aur.archlinux.org/packages/dpsprep) package for Arch Linux.\n\n### Manual\n\nSetting up the project in is again done via `uv`. Once inside the cloned repository, the environment for the program can be set up by simply running `uv sync --all-extras`. After than, the following should work:\n\n    uv run dpsprep [OPTIONS] SRC [DEST]\n\n> [!NOTE]\n> Previous versions used [`pyenv`](https://github.com/pyenv/pyenv) for managing Python versions and [`poetry`](https://python-poetry.org/) for managing dependencies and building. Since then the project migrated to `uv`, which subsumes both and provides other niceties.\n\nYou can also build and install the project, for example via [`pipx`](https://pipx.pypa.io/en/stable/):\n\n    uv build --wheel\n    pipx install --include-deps dist/*.whl\n\n> [!TIP]\n> The build can fail if the [`uv_build`](https://docs.astral.sh/uv/concepts/build-backend/) Python package is not installed. Make sure not only the `uv` binary, but also the corresponding Python package is available. For example, in the Arch repositories, these are distinct packages, `uv` and `python-uv`. Alternatively, try to install the [`uv-build`](https://pypi.org/project/uv-build/) PyPI package (`python-uv-build` in Arch) explicitly in this case.\n\nIf you want `dpsprep` to be able to use `ocrmypdf` from `pipx`'s isolated environment, you must [inject](https://fig.io/manual/pipx/inject) it explicitly via\n\n    pipx inject dpsprep ocrmypdf\n\n> [!TIP]\n> If you are packaging this for some other package manager, consider using PEP-517 tools as shown in [this PKGBUILD file](https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=dpsprep).\n\n> [!NOTE]\n> Previous versions of the tool itself used to depend on third-party binaries, but this is no longer the case. The test fixtures are checked in, however regenerating them (see [`./fixtures/Makefile`](./fixtures/Makefile)) requires `pdflatex` (texlive, among others), `gs` (Ghostscript), `oxipng` (oxipng), `pdftotext` (Poppler), `djvudigital` (GSDjVU) and `djvused` (DjVuLibre). Similarly, the man file is checked in, but building it from markdown depends on `ronn`.\n\n## Details\n\n### Compression\n\nWe perform compression in two stages:\n\n* The first one is the default compression provided by [Pillow](https://github.com/python-pillow/Pillow). For bitonal images, [the PDF generation code says](https://github.com/python-pillow/Pillow/blob/a088d54509e42e4eeed37d618b42d775c0d16ef5/src/PIL/PdfImagePlugin.py#L138C16-L138C16) that, if `libtiff` is available, `group4` compression is used.\n\n* If [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) is installed, its PDF optimization can be used via the flags `-O1` to `-O3` (this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression via `jbig2enc`.\n\nIf manually running OCRmyPDF, note that the optimization command suggested [in the documentation](https://ocrmypdf.readthedocs.io/en/latest/cookbook.html#optimize-images-without-performing-ocr) (setting `--tesseract-timeout` to `0`) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:\n\n    python -m ocrmypdf.optimize <input_file> <level> <output_file>\n\n### Text layer\n\nThe visible contents of a DjVu file are well-compressed images (see [here](http://yann.lecun.com/ex/djvu/index.html)). But a DjVu file also contains a \"text layer\" stored as metadata attached to invisible rectangular blocks. PDF does not support such constructs, so we do a little hack.\n\nWe render each page as an image and put it as a background in the PDF. We then use a font, [`invisible1.ttf`](./dpsprep/invisible.ttf), taken from [here](https://www.angelfire.com/pr/pgpf/if.html), to \"draw\" text. Every time we draw a block of text, we rescale the font so that the width of the text matches that of the corresponding DjVu block.\n\n> [!NOTE]\n> The font is small (12kb) and contains (invisible) Latin, Cyrillic and Greek characters. Even Chinese characters seem to be working correctly, at least with [Evince](https://gitlab.gnome.org/GNOME/evince).\n\nThe following screenshot displays the result of converting a DjVu document:\n\n![Image](./screenshots/lipsum_with_image.png)\n\nThe following screenshot displays the same document without the background image and with the invisible font replaced by Times New Roman:\n\n![Image](./screenshots/lipsum_with_text.png)\n\nSince the image is actually drawn on top of the text, there is no harm in using an actual visible font, possibly rendered using a transparent \"color\". Still, when searching and selecting text, the scrambled letters from the second image would be highlighted. With the invisible font, there are no visible glyphs to highlight, so an illusory \"block\" containing the text is highlighted instead.\n\nSee [`./dpsprep/text.py`](./dpsprep/text.py) for the implementation.\n\n## Kevin's notes regarding the first version\n\nI wrote this with the specific intent of converting ebooks in the DJVU format into PDFs for use with the fantastic (but pricey) \nSony Digital Paper System.\n\nDjVu technology is strikingly superior for many ebook applications, yet the Sony Digital Paper System (rev 1.3 US)\nonly supports PDF technology: this is because its primary design purpose is not as an ereader.  The device, however, \nis quite nearly the **perfect** ereader.\n\nUnfortunately, all presently available DjVu to PDF tools seem to just dump flattened enormous TIFF images.  This is ridiculous.\nSince PDF really can't do that much better on the way it stores image data, a 5-6x bloat cannot be avoided.  However, none of the \nexisting tools preserve:\n\n* The OCR'd text content\n* Table of Contents or Internal links\n\nThis is kind of silly, but until Sony's Digital Paper, there was no need to move functional DjVu files to PDFs.\nIn order to make workable PDFs from DjVu files for use on the Digital Paper System, I have implemented in one location the following\nprocedures detailed here:\n\nBy automating the procedure of user zetah for extracting the text and getting it in the correct locations:\nhttp://askubuntu.com/questions/46233/converting-djvu-to-pdf (OCR text transfer)\n\nBy implementing the procedure of user pyrocrasty for extracting the outline, and putting it into the PDF generated above:\nhttp://superuser.com/questions/801893/converting-djvu-to-pdf-and-preserving-table-of-contents-how-is-it-possible (bookmark transfer)\n"
  },
  {
    "path": "dpsprep/__init__.py",
    "content": "from .dpsprep import dpsprep\n\n\n__all__ = ['dpsprep']\n"
  },
  {
    "path": "dpsprep/conftest.py",
    "content": "import loguru\nimport pytest\n\n\n@pytest.fixture(autouse=True)\ndef disable_loguru() -> None:\n    loguru.logger.remove()\n"
  },
  {
    "path": "dpsprep/dpsprep.py",
    "content": "import json\nimport multiprocessing.pool\nimport shutil\nfrom time import time\n\nimport click\nimport djvu.decode\nimport loguru\nimport pdfrw\n\nfrom .images import ImageMode, failsafe_save_djvu_page, process_djvu_page\nfrom .logging import configure_loguru, human_readable_size\nfrom .ocrmypdf import optimize_pdf, perform_ocr\nfrom .outline import OutlineTransformVisitor\nfrom .pdf import combine_pdfs_on_fs_with_text, combine_pdfs_on_fs_without_text, is_valid_pdf\nfrom .text import djvu_pages_to_text_fpdf\nfrom .workdir import WorkingDirectory\n\n\ndef process_page_bg(workdir: WorkingDirectory, mode: ImageMode, quality: int | None, dpi: int | None, i: int, *, verbose: bool) -> None:  # noqa: PLR0913\n    configure_loguru(verbose=verbose)\n    page_number = i + 1\n\n    if workdir.get_page_pdf_path(i).exists():\n        if is_valid_pdf(workdir.get_page_pdf_path(i)):\n            loguru.logger.debug(f'Image data from page {page_number} already processed.')\n            return\n        loguru.logger.debug(f'Invalid page generated for {page_number}, regenerating.')\n    else:\n        loguru.logger.debug(f'Processing image data from page {page_number}.')\n\n    start_time = time()\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI(workdir.src),\n    )\n    document.decoding_job.wait()\n\n    page_bg = process_djvu_page(document.pages[i], mode, i)\n\n    failsafe_save_djvu_page(\n        page_bg,\n        workdir.get_page_pdf_path(i),\n        quality,\n        dpi,\n        page_number,\n    )\n\n    pdf_size = workdir.get_page_pdf_path(i).stat().st_size\n    loguru.logger.debug(f'Image data with size {human_readable_size(pdf_size)} from page {page_number} processed in {time() - start_time:.2f}s and written to working directory.')\n\n\ndef process_text(workdir: WorkingDirectory, dpi: int | None, *, verbose: bool) -> None:\n    configure_loguru(verbose=verbose)\n\n    if workdir.text_layer_pdf_path.exists():\n        loguru.logger.info('Text data already processed.')\n        return\n\n    loguru.logger.debug('Processing text data.')\n\n    start_time = time()\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI(workdir.src),\n    )\n    document.decoding_job.wait()\n\n    fpdf = djvu_pages_to_text_fpdf(document.pages, dpi)\n    fpdf.output(str(workdir.text_layer_pdf_path))\n\n    pdf_size = workdir.text_layer_pdf_path.stat().st_size\n    loguru.logger.info(f'Text data with size {human_readable_size(pdf_size) } processed in {time() - start_time:.2f}s and written to working directory')\n\n\n@click.option('-d', '--delete-working', is_flag=True, help='Delete any existing files in the working directory prior to writing to it.')\n@click.option('-w', '--preserve-working', is_flag=True, help='Preserve the working directory after script termination.')\n@click.option('-o', '--overwrite', is_flag=True, help='Overwrite destination file.')\n@click.option('-v', '--verbose', is_flag=True, help='Display debug messages.')\n@click.option('-t', '--no-text', is_flag=True, help='Disable the generation of text layers. Implied by --ocr.')\n@click.option('-O1', 'optlevel', flag_value=1, help='Use the lossless PDF image optimization from OCRmyPDF (without performing OCR).')\n@click.option('-O2', 'optlevel', flag_value=2, help='Use the PDF image optimization from OCRmyPDF.')\n@click.option('-O3', 'optlevel', flag_value=3, help='Use the aggressive lossy PDF image optimization from OCRmyPDF.')\n@click.option('-p', '--pool-size', type=click.IntRange(min=0), default=4, help='Size of MultiProcessing pool for handling page-by-page operations.')\n@click.option('-q', '--quality', type=click.IntRange(min=0, max=100), help=\"Quality of images in output. Used only for JPEG compression, i.e. RGB and Grayscale images. Passed directly to Pillow and to OCRmyPDF's optimizer.\")\n@click.option('-m', '--mode', type=click.Choice(['infer', 'bitonal', 'grayscale', 'rgb']), default='infer', help='Override the image modes encoded in the DjVu file for individual pages. It sometimes makes sense to force bitonal images since they compress well.')\n@click.option('--dpi', type=click.IntRange(min=1), help='Override DPI values encoded in the DjVu file for individual pages.')\n@click.option('--ocr', type=str, is_flag=False, flag_value='{}', help='Perform OCR via OCRmyPDF rather than trying to convert the text layer. If this parameter has a value, it should be a JSON dictionary of options to be passed to OCRmyPDF.')\n@click.version_option()\n@click.argument('dest', type=click.Path(exists=False, resolve_path=True), required=False)\n@click.argument('src', type=click.Path(exists=True, resolve_path=True), required=True)\n@click.command()\ndef dpsprep(  # noqa: C901, PLR0912, PLR0913, PLR0915\n    src: str,\n    dest: str | None,\n    quality: int | None,\n    dpi: int | None,\n    pool_size: int,\n    mode: ImageMode,\n    optlevel: int | None,\n    ocr: str | None,\n    *,\n    verbose: bool,\n    overwrite: bool,\n    delete_working: bool,\n    preserve_working: bool,\n    no_text: bool,\n) -> None:\n    configure_loguru(verbose=verbose)\n    workdir = WorkingDirectory(src, dest)\n\n    if ocr is None:\n        ocr_options = None\n    else:\n        try:\n            ocr_options = json.loads(ocr)\n        except ValueError as err:\n            msg = f'The OCR options {ocr!r} are not valid JSON.'\n            raise SystemExit(msg) from err\n        else:\n            if not isinstance(ocr_options, dict):\n                msg = f'The OCR options {ocr!r} are not a JSON dictionary.'\n                raise SystemExit(msg)\n\n        no_text = True\n\n    if not overwrite and workdir.dest.exists():\n        msg = f'File {workdir.dest} already exists.'\n        raise SystemExit(msg)\n\n    start_time = time()\n\n    if workdir.workdir.exists():\n        if delete_working:\n            loguru.logger.debug(f'Removing existing working directory {workdir.workdir}.')\n            workdir.destroy()\n            loguru.logger.info(f'Removed existing working directory {workdir.workdir}.')\n        else:\n            loguru.logger.info(f'Reusing working directory {workdir.workdir}.')\n    else:\n        loguru.logger.info(f'Working directory {workdir.workdir} has been created.')\n\n    workdir.create_if_necessary()\n\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI(workdir.src),\n    )\n    document.decoding_job.wait()\n\n    djvu_size = workdir.src.stat().st_size\n    loguru.logger.info(f'Processing {workdir.src} with {len(document.pages)} pages and size {human_readable_size(djvu_size)} using {pool_size} workers.')\n\n    pool = multiprocessing.Pool(processes=pool_size)\n    tasks = list[multiprocessing.pool.AsyncResult]()\n\n    if not no_text:\n        tasks.append(pool.apply_async(func=process_text, args=[workdir, dpi], kwds={'verbose': verbose}))\n\n    for i in range(len(document.pages)):\n        # Cannot pass the page object itself because it does not support serialization for IPC\n        tasks.append(pool.apply_async(func=process_page_bg, args=[workdir, mode, quality, dpi, i], kwds={'verbose': verbose}))\n\n    pool.close()\n    pool_is_working = True\n\n    while pool_is_working:\n        pool_is_working = False\n\n        for task in tasks:\n            try:\n                task.get(timeout=25)\n            except multiprocessing.TimeoutError:\n                pool_is_working = True\n\n    pool.join()\n    loguru.logger.info('Processed all pages.')\n\n    outline = pdfrw.IndirectPdfDict()\n\n    if len(document.outline.sexpr) > 0:\n        loguru.logger.info('Processing metadata.')\n        outline = OutlineTransformVisitor().visit(document.outline.sexpr)\n        loguru.logger.info('Metadata processed.')\n    else:\n        loguru.logger.info('No metadata to process.')\n\n    loguru.logger.info('Combining everything.')\n\n    if no_text:\n        combine_pdfs_on_fs_without_text(workdir, outline, len(document.pages))\n\n        ocr_success = False\n\n        if ocr_options:\n            loguru.logger.info('Performing OCR.')\n            ocr_success = perform_ocr(workdir, ocr_options)\n        else:\n            loguru.logger.info('Skipping the text layer.')\n\n        if not ocr_success:\n            shutil.copy(workdir.combined_pdf_without_text_path, workdir.combined_pdf_path)\n    else:\n        combine_pdfs_on_fs_with_text(workdir, outline)\n\n    combined_size = workdir.combined_pdf_path.stat().st_size\n    loguru.logger.info(f'Produced a combined output file with size {human_readable_size(combined_size)} in {time() - start_time:.2f}s. This is {round(100 * combined_size / djvu_size, 2)}% of the DjVu source file.')\n\n    opt_success = False\n\n    if optlevel is not None:\n        loguru.logger.info(f'Performing level {optlevel} optimization.')\n        opt_success = optimize_pdf(workdir, optlevel, quality, pool_size)\n\n    if opt_success:\n        opt_size = workdir.optimized_pdf_path.stat().st_size\n\n        loguru.logger.info(f'The optimized file has size {human_readable_size(opt_size)}, which is {round(100 * opt_size / combined_size, 2)}% of the raw combined file and {round(100 * opt_size / djvu_size, 2)}% of the DjVu source file.')\n\n        if opt_size < combined_size:\n            loguru.logger.info('Using the optimized file.')\n            shutil.copy(workdir.optimized_pdf_path, workdir.dest)\n        else:\n            loguru.logger.info('Using the raw combined file.')\n            shutil.copy(workdir.combined_pdf_path, workdir.dest)\n    else:\n        shutil.copy(workdir.combined_pdf_path, workdir.dest)\n\n    if preserve_working:\n        loguru.logger.info(f'Working directory {workdir.workdir} will be preserved.')\n    else:\n        loguru.logger.info(f'Deleting the working directory {workdir.workdir}.')\n        workdir.destroy()\n"
  },
  {
    "path": "dpsprep/images.py",
    "content": "import pathlib\nfrom typing import Literal, NamedTuple\n\nimport djvu.decode\nimport loguru\nimport PIL.features\nfrom PIL import Image, ImageOps\n\n\nImageMode = Literal['rgb', 'grayscale', 'bitonal', 'infer']\n\n\ndjvu_pixel_formats = {\n    'rgb': djvu.decode.PixelFormatRgb(byte_order='RGB'),\n    'grayscale': djvu.decode.PixelFormatGrey(),\n    'bitonal': djvu.decode.PixelFormatPackedBits('>'),\n}\n\n\nfor pixel_format in djvu_pixel_formats.values():\n    pixel_format.rows_top_to_bottom = 1\n    pixel_format.y_top_to_bottom = 0\n\n\npil_modes = {\n    'rgb': 'RGB',\n    'grayscale': 'L',\n    'bitonal': '1',\n}\n\n\nclass ProcessedPageBackground(NamedTuple):\n    pil_image: Image.Image\n    resolution: int\n\n\ndef process_djvu_page(page: djvu.decode.Page, mode: ImageMode, i: int) -> ProcessedPageBackground:\n    page_job = page.decode(wait=True)\n    width, height = page_job.size\n    buffer = bytearray(3 * width * height)  # RGB at most\n\n    rect = (0, 0, width, height)\n\n    if mode == 'infer':\n        mode = 'bitonal' if page_job.type == djvu.decode.PAGE_TYPE_BITONAL else 'rgb'\n\n    if mode == 'bitonal':\n        if not PIL.features.check_codec('libtiff'):\n            loguru.logger.warning('Bitonal image compression may suffer because Pillow has been built without libtiff support.')\n    elif not PIL.features.check_codec('jpg'):\n        loguru.logger.warning('Multitonal image compression may suffer because Pillow has been built without libjpeg support.')\n\n    try:\n        page_job.render(\n            # RENDER_COLOR is simply a default value and doesn't actually imply colors\n            mode=djvu.decode.RENDER_COLOR,\n            page_rect=rect,\n            render_rect=rect,\n            pixel_format=djvu_pixel_formats[mode],\n            buffer=buffer,\n        )\n    except djvu.decode.NotAvailable:\n        loguru.logger.warning(f'libdjvu claims that data for page {i + 1} is not available. Producing a blank page.')\n        image = Image.new(\n            pil_modes['bitonal'],\n            page_job.size,\n            1,\n        )\n\n        return ProcessedPageBackground(image, page_job.dpi)\n\n    image = Image.frombuffer(\n        pil_modes[mode],\n        page_job.size,\n        buffer,\n        'raw',\n    )\n\n    return ProcessedPageBackground(\n        # I have experimentally determined that we need to invert the black-and-white images. -- Ianis, 2023-05-13\n        # See also https://github.com/kcroker/dpsprep/issues/16\n        ImageOps.invert(image) if mode == 'bitonal' else image,\n        page_job.dpi,\n    )\n\n\ndef failsafe_save_djvu_page(page_bg: ProcessedPageBackground, target: pathlib.Path, quality: int | None, dpi: int | None, page_number: int) -> None:\n    if quality is not None:\n        if page_bg.pil_image.mode in pil_modes['bitonal'] and PIL.features.check_codec('libtiff'):\n            loguru.logger.warning('Pillow uses TIFF for encoding bitonal PDF images. The encoder does not support a \"quality\" setting. If the conversion fails, please try again without specifying quality.')\n\n        try:\n            page_bg.pil_image.save(\n                target,\n                format='PDF',\n                quality=quality,\n                resolution=dpi or page_bg.resolution,\n            )\n        except ValueError:\n            loguru.logger.warning(f'Failed to encode page {page_number}. Trying again without setting quality.')\n        else:\n            return\n\n    page_bg.pil_image.save(\n        target,\n        format='PDF',\n        resolution=dpi or page_bg.resolution,\n    )\n"
  },
  {
    "path": "dpsprep/logging.py",
    "content": "import os\nimport sys\nfrom types import TracebackType\n\nimport loguru\n\n\ncached_stdout = sys.stdout\n\n\ndef configure_loguru(*, verbose: bool) -> None:\n    loguru.logger.remove()\n    loguru.logger.add(\n        cached_stdout,\n        format='<level>{level}</level> <green>{time:HH:mm:ss}</green> <level>{message}</level>',\n        level='DEBUG' if verbose else 'INFO',\n    )\n\n\ndef human_readable_size(size: int) -> str:\n    # ruff: disable[PLR2004]\n    if size < 1024:\n        return f'{size} bytes'\n\n    if size < 1024 ** 2:\n        return f'{size / 1024:.02f} KiB'\n\n    return f'{size / 1024 ** 2:.02f} MiB'\n    # ruff: enable[PLR2004]\n\n\n# img2pdf abuses debug logging by using print\n# This is a way to temporarily silence it\nclass SilencePrint:\n    def __enter__(self) -> None:\n        sys.stdout = open(os.devnull, 'w', encoding='utf-8')\n\n    def __exit__(\n        self,\n        exc_type: type[BaseException] | None,\n        exc_value: BaseException | None,\n        traceback: TracebackType | None,\n     ) -> None:\n        sys.stdout.close()\n        sys.stdout = cached_stdout\n"
  },
  {
    "path": "dpsprep/ocrmypdf.py",
    "content": "# We use OCRmyPDF in a non-canonical way: only optimize the file without performing any OCR.\n# The optimization procedure provides good results and preserves the text layer and outline.\n# The code here is based on\n#   https://github.com/ocrmypdf/OCRmyPDF/blob/fb006ef39f7f8842dec1976bebe4bcd5ca2e8df8/src/ocrmypdf/optimize.py#L724\n# with some simplifications for OCRmyPDF 17\n\n# ruff: noqa: PLC0415\n\nimport shutil\nfrom typing import Any\n\nimport loguru\n\nfrom .workdir import WorkingDirectory\n\n\ndef optimize_pdf(workdir: WorkingDirectory, optlevel: int, quality: int | None, pool_size: int) -> bool:\n    try:\n        # ObjectStreamMode is actually from pikepdf, but I did not want to include that as a dependency\n        from ocrmypdf._options import OcrOptions\n        from ocrmypdf.optimize import ObjectStreamMode, PdfContext, optimize\n        from ocrmypdf.pdfinfo import PdfInfo\n    except ImportError:\n        loguru.logger.warning('Cannot detect OCRmyPDF. No optimizations will be performed on the output file.')\n        return False\n\n    options = OcrOptions(\n        input_file=workdir.combined_pdf_without_text_path,\n        output_file=workdir.combined_pdf_path,\n        # Jobs correspond to CPU cores rather than threads, but it seems better to use the available pool size parameter\n        jobs=pool_size,\n        optimize=optlevel,\n        # When 0, these should be adjusted inside OCRmyPDF's \"optimize\" function\n        jpg_quality=quality or 0,\n        png_quality=quality or 0,\n    )\n\n    info = PdfInfo(workdir.combined_pdf_path)\n    context = PdfContext(options, workdir.ocrmypdf_tmp_path, workdir.combined_pdf_path, info, None)\n\n    optimize(\n        workdir.combined_pdf_path,\n        workdir.optimized_pdf_path,\n        context,\n        {\n            'compress_streams': True,\n            'preserve_pdfa': True,\n            'object_stream_mode': ObjectStreamMode.generate,\n        },\n    )\n\n    return True\n\n\ndef perform_ocr(workdir: WorkingDirectory, options: dict[str, Any]) -> bool:\n    try:\n        from ocrmypdf import api\n    except ImportError:\n        loguru.logger.warning('Cannot detect OCRmyPDF. No OCR will be performed on the output file.')\n        return False\n\n    try:\n        api.ocr(\n            input_file_or_options=workdir.combined_pdf_without_text_path,\n            output_file=workdir.combined_pdf_path,\n            **options,\n        )\n    except Exception as err:\n        loguru.logger.warning(f'OCRmyPDF failed: {err}')\n        shutil.copy(workdir.combined_pdf_without_text_path, workdir.combined_pdf_path)\n        return False\n    else:\n        return True\n"
  },
  {
    "path": "dpsprep/outline.py",
    "content": "import djvu.sexpr\nimport loguru\nfrom pdfrw import IndirectPdfDict, PdfDict, PdfName\n\nfrom .sexpr import SExpressionVisitor\n\n\n# Based on\n# https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546\nclass OutlineTransformVisitor(SExpressionVisitor[PdfDict]):\n    def visit_plain_list(self, node: djvu.sexpr.StringExpression, parent: IndirectPdfDict) -> PdfDict:\n        title, page, *rest = node\n\n        # I have experimentally determined that we need to translate page indices. -- Ianis, 2023-05-03\n        try:\n            page_number = int(page.value[1:]) - 1\n        except ValueError:\n            # As far as I understand, python-djvulibre doesn't support Djvu's page titles. -- Ianis, 2023-12-09\n            loguru.logger.warning(f'Could not determine page number from the page title {page.value}.')\n            return None\n\n        try:\n            title_text = title.value\n        except UnicodeDecodeError:\n            loguru.logger.warning(f'Could not decode page title {title!r}; leaving it in escaped form.')\n            title_text = str(title)\n\n        bookmark = IndirectPdfDict(\n            Parent = parent,\n            Title = title_text,\n            A = PdfDict(\n                D = [page_number, PdfName.Fit],\n                S = PdfName.GoTo,\n            ),\n        )\n\n        if parent.Count is None:\n            parent.Count = 0\n            parent.First = bookmark\n        else:\n            bookmark.Prev = parent.Last\n            bookmark.Prev.Next = bookmark\n\n        parent.Count += 1\n        parent.Last = bookmark\n\n        for child in rest:\n            self.visit(child, parent=bookmark)\n\n        return bookmark\n\n    def visit_list_bookmarks(self, node: djvu.sexpr.ListExpression) -> PdfDict:\n        _, *rest = node\n\n        outline = IndirectPdfDict()\n\n        for child in rest:\n            self.visit(child, parent=outline)\n\n        return outline\n"
  },
  {
    "path": "dpsprep/pdf.py",
    "content": "import pathlib\n\nimport pdfrw\n\nfrom .workdir import WorkingDirectory\n\n\ndef is_valid_pdf(path: pathlib.Path) -> bool:\n    try:\n        pdfrw.PdfReader(path)\n    except pdfrw.errors.PdfParseError:\n        return False\n    else:\n        return True\n\n\ndef combine_pdfs_on_fs_with_text(workdir: WorkingDirectory, outline: pdfrw.IndirectPdfDict) -> None:\n    text_pdf = pdfrw.PdfReader(workdir.text_layer_pdf_path)\n    writer = pdfrw.PdfWriter()\n\n    for i, text_page in enumerate(text_pdf.pages):\n        # We take the one-page text PDF and add the image layer on top\n        # Even if the font was not invisible, it would be hidden visually (but not during search or text highlight)\n        image_pdf = pdfrw.PdfReader(workdir.get_page_pdf_path(i))\n        image_page = image_pdf.pages[0]\n        merger = pdfrw.PageMerge(text_page)\n        merger.add(image_page).render()\n        writer.addpage(text_page)\n\n    writer.trailer.Root.Outlines = outline\n    writer.write(workdir.combined_pdf_path)\n\n\ndef combine_pdfs_on_fs_without_text(workdir: WorkingDirectory, outline: pdfrw.IndirectPdfDict, max_page: int) -> None:\n    writer = pdfrw.PdfWriter()\n\n    for i in range(max_page):\n        image_pdf = pdfrw.PdfReader(workdir.get_page_pdf_path(i))\n        image_page = image_pdf.pages[0]\n        writer.addpage(image_page)\n\n    writer.trailer.Root.Outlines = outline\n    writer.write(workdir.combined_pdf_without_text_path)\n"
  },
  {
    "path": "dpsprep/py.typed",
    "content": ""
  },
  {
    "path": "dpsprep/sexpr.py",
    "content": "from typing import Generic, TypeVar\n\nimport djvu.sexpr\nimport loguru\n\n\nT = TypeVar('T')\nR = TypeVar('R')\n\n\nclass SExpressionVisitor(Generic[R]):\n    def visit_list(self, node: djvu.sexpr.ListExpression, **kwargs: T) -> R | None:\n        if len(node) > 0 and isinstance(node[0], djvu.sexpr.SymbolExpression):\n            method = getattr(self, f'visit_list_{node[0]}', None)\n            if method is None:\n                loguru.logger.warning(f\"Don't know how to visit ListExpression of type {str(node[0])!r}.\")\n                return None\n            return method(node, **kwargs)\n        if hasattr(self, 'visit_plain_list'):\n            return self.visit_plain_list(node, **kwargs)\n        loguru.logger.warning(\"Don't know how to visit a plain ListExpression.\")\n        return None\n\n    def visit_other(self, node: djvu.sexpr.Expression, **kwargs: T) -> R | None:  # noqa: ARG002\n        loguru.logger.warning(f\"Don't know how to visit S-expression type {type(node)!r}.\")\n        return None\n\n    def visit(self, node: djvu.sexpr.Expression, **kwargs: T) -> R | None:\n        if isinstance(node, djvu.sexpr.IntExpression):\n            if hasattr(self, 'visit_int'):\n                return self.visit_int(node, **kwargs)\n            loguru.logger.warning(\"Don't know how to visit IntExpression.\")\n            return None\n        if isinstance(node, djvu.sexpr.StringExpression):\n            if hasattr(self, 'visit_string'):\n                return self.visit_string(node, **kwargs)\n            loguru.logger.warning(\"Don't know how to visit StringExpression.\")\n            return None\n        if isinstance(node, djvu.sexpr.ListExpression):\n            return self.visit_list(node, **kwargs)\n        return self.visit_other(node, **kwargs)\n"
  },
  {
    "path": "dpsprep/test_images.py",
    "content": "import djvu.decode\nfrom PIL import Image, ImageChops, ImageStat\n\nfrom .images import process_djvu_page\n\n\n# A simple score function for Pillow images.\n# We previously used the pytest-image-diff module, which used the diffimg module.\n# It turned out that diffimg uses a similar approach, so we dropped the dependency in favor of a few-liner.\ndef calculate_image_diff_score(a: Image.Image, b: Image.Image) -> float:\n    assert a.size == b.size, 'We only support diffing images with identical sizes'\n    assert a.mode == b.mode, 'We only support diffing images with the same mode'\n\n    diff = ImageChops.difference(a, b)\n    stat = ImageStat.Stat(diff)\n    return max(stat.rms) / 256  # The ImageStat module uses 256 bins\n\n\ndef test_process_djvu_page_bitonal() -> None:\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI('fixtures/lipsum_words.djvu'),\n    )\n    document.decoding_job.wait()\n\n    fixture = Image.open('fixtures/lipsum_01.png')\n    result = process_djvu_page(document.pages[0], mode='infer', i=0)\n\n    page_decode_job = document.pages[0].decode()\n    page_decode_job.wait()\n    assert result.resolution == page_decode_job.dpi\n\n    assert calculate_image_diff_score(fixture, result.pil_image) < 0.05\n"
  },
  {
    "path": "dpsprep/test_outline.py",
    "content": "from djvu import sexpr\nfrom pdfrw import IndirectPdfDict\n\nfrom .outline import OutlineTransformVisitor\n\n\ndef test_basic_outline() -> None:\n    src = sexpr.ListExpression([\n        sexpr.SymbolExpression(sexpr.Symbol('bookmarks')),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'Chapter 2'),\n            sexpr.StringExpression(b'#100'),\n        ]),\n    ])\n\n    visitor = OutlineTransformVisitor()\n    bookmarks = visitor.visit(src)\n    assert bookmarks is not None\n    assert bookmarks.Count == 1\n    assert bookmarks.First.Title == 'Chapter 2'\n    assert bookmarks.First.A.D[0] == 99  # The page number\n\n\ndef test_nested_outline() -> None:\n    src = sexpr.ListExpression([\n        sexpr.SymbolExpression(sexpr.Symbol('bookmarks')),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'Chapter 2'),\n            sexpr.StringExpression(b'#100'),\n            sexpr.ListExpression([\n                sexpr.StringExpression(b'Chapter 2.1'),\n                sexpr.StringExpression(b'#200'),\n            ]),\n        ]),\n    ])\n\n    visitor = OutlineTransformVisitor()\n    bookmarks = visitor.visit(src)\n    assert bookmarks is not None\n    assert bookmarks.Count == 1\n    assert bookmarks.First.Count == 1\n    assert bookmarks.First.A.D[0] == 99  # The page number of chapter 2\n    assert bookmarks.First.First.A.D[0] == 199  # The page number of chapter 2.1\n\n\n# Sometimes the page numbers are instead page titles, which our libdjvu bindings do not support\n# We ignore them since there is not much we can do in this case\n# See https://github.com/kcroker/dpsprep/issues/23\ndef test_outline_with_page_titles() -> None:\n    src = sexpr.ListExpression([\n        sexpr.SymbolExpression(sexpr.Symbol('bookmarks')),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'Preface'),\n            sexpr.StringExpression(b'#f007.djvu'),\n        ]),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'Contents'),\n            sexpr.StringExpression(b'#f011.djvu'),\n        ]),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'0 Prologue'),\n            sexpr.StringExpression(b'#p001.djvu'),\n        ]),\n    ])\n\n    visitor = OutlineTransformVisitor()\n    bookmarks = visitor.visit(src)\n    empty_pdf_dict = IndirectPdfDict()\n    assert bookmarks == empty_pdf_dict\n\n\ndef test_outline_with_invalid_unicode() -> None:\n    src = sexpr.ListExpression([\n        sexpr.SymbolExpression(sexpr.Symbol('bookmarks')),\n        sexpr.ListExpression([\n            sexpr.StringExpression(b'\\2470'),\n            sexpr.StringExpression(b'#1'),\n        ]),\n    ])\n\n    visitor = OutlineTransformVisitor()\n    bookmarks = visitor.visit(src)\n    assert bookmarks is not None\n    assert bookmarks.Count == 1\n    assert bookmarks.First.Title == '\"\\\\2470\"'\n"
  },
  {
    "path": "dpsprep/test_text.py",
    "content": "import pathlib\nimport string\n\nimport djvu.decode\nimport pytest\n\nfrom .text import TextExtractVisitor\n\n\ndef remove_whitespace(src: str) -> str:\n    return src.translate({ord(c): None for c in string.whitespace})\n\n\ndef test_extract_djvu_page_text_words() -> None:\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI('fixtures/lipsum_words.djvu'),\n    )\n    document.decoding_job.wait()\n\n    djvu_page = document.pages[0]\n    djvu_page.get_info()\n    djvu_text = TextExtractVisitor().visit(djvu_page.text.sexpr)\n\n    assert djvu_text is not None\n\n    source_pdf_text = pathlib.Path('fixtures/lipsum_01.txt').read_text(encoding='utf-8')\n\n    assert remove_whitespace(djvu_text) == remove_whitespace(source_pdf_text)\n\n\ndef test_extract_djvu_page_text_lines() -> None:\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI('fixtures/lipsum_lines.djvu'),\n    )\n    document.decoding_job.wait()\n\n    djvu_page = document.pages[0]\n    djvu_page.get_info()\n    djvu_text = TextExtractVisitor().visit(djvu_page.text.sexpr)\n\n    assert djvu_text is not None\n\n    source_pdf_text = pathlib.Path('fixtures/lipsum_01.txt').read_text(encoding='utf-8')\n\n    assert remove_whitespace(djvu_text) == remove_whitespace(source_pdf_text)\n\n\ndef test_invalid_utf8() -> None:\n    document = djvu.decode.Context().new_document(\n        djvu.decode.FileURI('fixtures/lipsum_words_invalid.djvu'),\n    )\n    document.decoding_job.wait()\n\n    djvu_page = document.pages[0]\n    djvu_page.get_info()\n    first_word_sexpr = djvu_page.text.sexpr[5][5]\n\n    # djvulibre cannot decode the first word\n    with pytest.raises(UnicodeDecodeError):\n        first_word_sexpr.value  # noqa: B018\n\n    first_word = TextExtractVisitor().visit(first_word_sexpr)\n    assert first_word == ''\n"
  },
  {
    "path": "dpsprep/text.py",
    "content": "# ruff: noqa: RUF059\n\nimport unicodedata\nfrom collections.abc import Iterable, Sequence\nfrom pathlib import Path\n\nimport djvu.sexpr\nimport loguru\nfrom fpdf import FPDF\n\nfrom .sexpr import SExpressionVisitor\n\n\nBASE_FONT_SIZE = 10\nTAB_SIZE = 4\n\n\nclass TextExtractVisitor(SExpressionVisitor[str]):\n    def iter_chars(self, string: str) -> Iterable[str]:\n        for char in string:\n            code = unicodedata.category(char)\n\n            # Line Separator (Zl) | Space Separator (Zs)\n            if code in {'Zl', 'Zs'}:\n                yield ' '\n\n            # Paragraph Separator (Zp)\n            elif code == 'Zp':\n                yield '\\n'\n\n            # Control (Cc)\n            elif code == 'Cc':\n                if char == '\\t':\n                    yield ' ' * TAB_SIZE\n                elif char == '\\n':\n                    yield ' '\n\n            # These break FPDF.\n            # A full list of categories can be found in https://www.compart.com/en/unicode/category\n            # Format (Cf) | Private Use (Co) | Surrogate 'Cs':\n            elif code in {'Cf', 'Co', 'Cs'}:\n                pass\n\n            else:\n                yield char\n\n    def visit_string(self, node: djvu.sexpr.StringExpression) -> str:\n        try:\n            string = node.value  # This getter is not static - it does UTF-8 conversion and fails for some DjVu files\n        except ValueError as err:\n            loguru.logger.warning(f'Could not decode {node!r}: {err}')\n            return ''\n        else:\n            return ''.join(self.iter_chars(string))\n\n    def visit_plain_list(self, node: djvu.sexpr.ListExpression) -> str:  # noqa: ARG002\n        return ''\n\n    def visit_list_word(self, node: djvu.sexpr.ListExpression) -> str | None:\n        _, x1, y1, x2, y2, content, *rest = node\n        return self.visit(content)\n\n    visit_list_char = visit_list_word\n\n    def visit_list_line(self, node: djvu.sexpr.ListExpression) -> str:\n        _, x1, y1, x2, y2, *rest = node\n        return ' '.join(self.visit(child) or '' for child in rest)\n\n    def visit_list_para(self, node: djvu.sexpr.ListExpression) -> str:\n        _, x1, y1, x2, y2, *rest = node\n        return '\\n'.join(self.visit(child) or '' for child in rest)\n\n    visit_list_column = visit_list_para\n    visit_list_region = visit_list_para\n    visit_list_page = visit_list_para\n\n\nclass TextDrawVisitor(SExpressionVisitor):\n    pdf: FPDF\n    dpi: int\n    extractor: TextExtractVisitor\n\n    def __init__(self, pdf: FPDF, dpi: int) -> None:\n        self.pdf = pdf\n        self.dpi = dpi\n        self.extractor = TextExtractVisitor()\n\n    def draw_text(self, x1: int, x2: int, y1: int, y2: int, text: str) -> None:  # noqa: ARG002\n        page_width, page_height = self.pdf.pages[self.pdf.page].dimensions()\n\n        if page_height is None:\n            loguru.logger.warning(f'Cannot draw {text!r} because page height is not set.')\n            return\n\n        self.pdf.set_font('Invisible', size=BASE_FONT_SIZE)\n\n        # Adjust font size\n        desired_width = (x2 - x1) / self.dpi\n        actual_width = self.pdf.get_string_width(text)\n\n        if actual_width == 0:\n            return\n\n        self.pdf.set_font('Invisible', size=int(BASE_FONT_SIZE * desired_width / actual_width))\n\n        try:\n            self.pdf.text(x=x1 / self.dpi, y=page_height / 72 - y1 / self.dpi, text=text)\n        except TypeError as err:\n            loguru.logger.warning(f'FPDF refuses to draw {text!r}: {err}')\n\n    def iter_loose_string_content(self, expressions: list[djvu.sexpr.Expression]) -> Iterable[str]:\n        for child in expressions:\n            if not isinstance(child, djvu.sexpr.StringExpression):\n                continue\n\n            if (text := self.extractor.visit(child)) is not None:\n                yield text\n\n    def get_loose_string_content(self, expressions: list[djvu.sexpr.Expression], delimiter: str) -> str:\n        return delimiter.join(self.iter_loose_string_content(expressions))\n\n    def visit_list_word(self, node: djvu.sexpr.ListExpression) -> None:\n        _, x1, y1, x2, y2, *rest = node\n        text = self.extractor.visit(node)\n\n        if text is not None:\n            self.draw_text(x1.value, x2.value, y1.value, y2.value, text)\n\n    visit_list_char = visit_list_word\n\n    def visit_list_line(self, node: djvu.sexpr.ListExpression) -> None:\n        _, x1, y1, x2, y2, *rest = node\n\n        text = self.get_loose_string_content(rest, ' ')\n\n        if len(text) > 0:\n            self.draw_text(x1.value, x2.value, y1.value, y2.value, text)\n\n        for child in rest:\n            if not isinstance(child, djvu.sexpr.StringExpression):\n                self.visit(child)\n\n    def visit_list_para(self, node: djvu.sexpr.ListExpression) -> None:\n        _, x1, y1, x2, y2, *rest = node\n\n        text = self.get_loose_string_content(rest, '\\n')\n\n        if len(text) > 0:\n            self.draw_text(x1.value, x2.value, y1.value, y2.value, text)\n\n        for child in rest:\n            if not isinstance(child, djvu.sexpr.StringExpression):\n                self.visit(child)\n\n    def visit_list_column(self, node: djvu.sexpr.ListExpression) -> None:\n        _, x1, y1, x2, y2, *rest = node\n\n        for child in rest:\n            self.visit(child)\n\n    visit_list_page = visit_list_column\n    visit_list_region = visit_list_column\n\n\ndef djvu_pages_to_text_fpdf(pages: Sequence[djvu.decode.Page], dpi: int | None) -> FPDF:\n    pdf = FPDF(unit='in')\n    pdf.add_font(\n        family='Invisible',\n        fname=Path(__file__).parent / 'invisible1.ttf',\n        style='',\n    )\n\n    for i, page in enumerate(pages):\n        page_job = page.decode(wait=True)\n        page_dpi = dpi or page_job.dpi\n        pdf.add_page(format=(page_job.width / page_dpi, page_job.height / page_dpi))\n        loguru.logger.debug(f'Processing text for page {i + 1}.')\n        visitor = TextDrawVisitor(pdf, page_dpi)\n        visitor.visit(page.text.sexpr)\n\n    return pdf\n"
  },
  {
    "path": "dpsprep/workdir.py",
    "content": "import hashlib\nimport os\nimport pathlib\nimport shutil\nimport tempfile\n\nimport loguru\n\n\nHASHING_BUFFER_SIZE = 64 * 1024\n\n\ndef get_file_hash(path: os.PathLike | str) -> str:\n    h = hashlib.blake2b(digest_size=4)\n\n    with open(path, 'rb') as file:\n        data = file.read(HASHING_BUFFER_SIZE)\n\n        while len(data) > 0:\n            h.update(data)\n            data = file.read(HASHING_BUFFER_SIZE)\n\n    return h.hexdigest()\n\n\nclass WorkingDirectory:\n    src: pathlib.Path\n    dest: pathlib.Path\n    workdir: pathlib.Path\n\n    def __init__(self, src: os.PathLike | str, dest: os.PathLike | str | None) -> None:\n        self.src = pathlib.Path(src)\n\n        if dest is None:\n            self.dest = pathlib.Path(pathlib.Path(src).with_suffix('.pdf').name)\n        else:\n            self.dest = pathlib.Path(dest)\n\n        # Working path\n        # If possible, we avoid the ephemeral storage /tmp\n        persistent_tmp = pathlib.Path('/var/tmp')  # noqa: S108\n\n        if persistent_tmp.exists() and (persistent_tmp.stat().st_mode & (os.W_OK | os.X_OK)):\n            loguru.logger.debug('Using non-ephemeral storage \"/var/tmp\".')\n            root = persistent_tmp\n        else:\n            loguru.logger.debug(f'Using default system storage {tempfile.gettempdir()!r}.')\n            root = pathlib.Path(tempfile.gettempdir())\n\n        self.workdir = root / 'dpsprep' / get_file_hash(self.src)\n\n    def create_if_necessary(self) -> None:\n        if not self.workdir.exists():\n            loguru.logger.debug(f'Creating {str(self.workdir)!r}.')\n            self.workdir.mkdir(parents=True)\n\n        if not self.ocrmypdf_tmp_path.exists():\n            loguru.logger.debug(f'Creating {str(self.ocrmypdf_tmp_path)!r}.')\n\n    def get_page_pdf_path(self, i: int) -> pathlib.Path:\n        return self.workdir / f'page_bg_{i + 1}.pdf'\n\n    @property\n    def text_layer_pdf_path(self) -> pathlib.Path:\n        return self.workdir / 'text_layer.pdf'\n\n    @property\n    def ocrmypdf_tmp_path(self) -> pathlib.Path:\n        return self.workdir / 'ocrmypdf'\n\n    @property\n    def combined_pdf_without_text_path(self) -> pathlib.Path:\n        return self.workdir / 'combined_without_text.pdf'\n\n    @property\n    def combined_pdf_path(self) -> pathlib.Path:\n        return self.workdir / 'combined.pdf'\n\n    @property\n    def optimized_pdf_path(self) -> pathlib.Path:\n        return self.workdir / 'optimized.pdf'\n\n    def destroy(self) -> None:\n        shutil.rmtree(self.workdir)\n"
  },
  {
    "path": "dpsprep.1",
    "content": ".\\\" generated with Ronn-NG/v0.10.1\n.\\\" http://github.com/apjanke/ronn-ng/tree/0.10.1\n.TH \"DPS\" \"1\" \"March 2026\" \"\"\n.SH \"NAME\"\n\\fBdps\\fR \\- a DjVu to PDF converter\n.SH \"SYNOPSIS\"\n\\fBdpsprep\\fR \\fIoptions\\fR src [dest]\n.SH \"DESCRIPTION\"\nThis tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general\\-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e\\.g\\. TOC) and text layers (e\\.g\\. OCR)\\.\n.SH \"OPTIONS\"\n.IP \"\\(bu\" 4\n\\fB\\-q\\fR, \\fB\\-\\-quality\\fR: Quality of images in output\\. Used only for JPEG compression, i\\.e\\. RGB and Grayscale images\\. Passed directly to Pillow and to OCRmyPDF's optimizer\\.\n.IP \"\\(bu\" 4\n\\fB\\-v\\fR, \\fB\\-\\-verbose\\fR: Display debug messages\\.\n.IP \"\\(bu\" 4\n\\fB\\-o\\fR, \\fB\\-\\-overwrite\\fR: Overwrite destination file\\.\n.IP \"\\(bu\" 4\n\\fB\\-w\\fR, \\fB\\-\\-preserve\\-working\\fR: Preserve the working directory after script termination\\.\n.IP \"\\(bu\" 4\n\\fB\\-d\\fR, \\fB\\-\\-delete\\-working\\fR: Delete any existing files in the working directory prior to writing to it\\.\n.IP \"\\(bu\" 4\n\\fB\\-t\\fR, \\fB\\-\\-no\\-text\\fR: Disable the generation of text layers\\. Implied by \\-\\-ocr\\.\n.IP \"\\(bu\" 4\n\\fB\\-p\\fR, \\fB\\-\\-pool\\-size\\fR \\fIint\\fR: Size of MultiProcessing pool for handling page\\-by\\-page operations\\.\n.IP \"\\(bu\" 4\n\\fB\\-m\\fR, \\fB\\-\\-mode\\fR \\fIinfer|bitonal|grayscale|rgb\\fR: Override the image modes encoded in the DjVu file for individual pages\\. It sometimes makes sense to force bitonal images since they compress well\\.\n.IP \"\\(bu\" 4\n\\fB\\-\\-dpi\\fR \\fIint\\fR: Override DPI values encoded in the DjVu file for individual pages\\.\n.IP \"\\(bu\" 4\n\\fB\\-\\-ocr\\fR \\fIJSON\\fR: Perform OCR via OCRmyPDF rather than trying to convert the text layer\\. If this parameter has a value, it should be a JSON dictionary of options to be passed to OCRmyPDF\\.\n.IP \"\\(bu\" 4\n\\fB\\-O1\\fR: Use the lossless PDF image optimization from OCRmyPDF (without performing OCR)\\.\n.IP \"\\(bu\" 4\n\\fB\\-O2\\fR: Use the PDF image optimization from OCRmyPDF\\.\n.IP \"\\(bu\" 4\n\\fB\\-O3\\fR: Use the aggressive lossy PDF image optimization from OCRmyPDF\\.\n.IP \"\\(bu\" 4\n\\fB\\-\\-help\\fR: Show help message and exit\\.\n.IP \"\\(bu\" 4\n\\fB\\-\\-version\\fR: Show the version and exit\\.\n.IP \"\" 0\n.SH \"EXAMPLES\"\nProduce \\fBfile\\.pdf\\fR in the current directory:\n.IP \"\" 4\n.nf\ndpsprep /wherever/file\\.djvu\n.fi\n.IP \"\" 0\n.P\nProduce \\fBoutput\\.pdf\\fR with reduced image quality and aggressive PDF image optimizations:\n.IP \"\" 4\n.nf\ndpsprep \\-\\-\\-quality=30 \\-O3 input\\.djvu output\\.pdf\n.fi\n.IP \"\" 0\n.P\nProduce an output file using a large pool of workers:\n.IP \"\" 4\n.nf\ndpsprep \\-\\-pool=16 input\\.djvu\n.fi\n.IP \"\" 0\n.P\nForce bitonal images:\n.IP \"\" 4\n.nf\ndpsprep \\-\\-mode bitonal input\\.djvu\n.fi\n.IP \"\" 0\n.P\nProduce an output file by disregarding the text layer and running OCRmyPDF instead:\n.IP \"\" 4\n.nf\ndpsprep \\-\\-ocr '{\"language\": [\"rus\", \"eng\"]}' input\\.djvu\n.fi\n.IP \"\" 0\n.P\nOr simply disregard the text layer without OCR:\n.IP \"\" 4\n.nf\ndpsprep \\-\\-no\\-text input\\.djvu\n.fi\n.IP \"\" 0\n.SH \"NOTE REGARDING COMPRESSION\"\nWe perform compression in two stages:\n.IP \"\\(bu\" 4\nThe first one is the default compression provided by Pillow\\. For bitonal images, the PDF generation code says that, if \\fBlibtiff\\fR is available, \\fBgroup4\\fR compression is used\\.\n.IP \"\\(bu\" 4\nIf OCRmyPDF is installed, its PDF optimization can be used via the flags \\fB\\-O1\\fR to \\fB\\-O3\\fR (this involves no OCR)\\. This allows us to use advanced techniques, including JBIG2 compression via \\fBjbig2enc\\fR\\.\n.IP \"\" 0\n.P\nIf manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting \\fB\\-\\-tesseract\\-timeout\\fR to \\fB0\\fR) may ruin existing text layers\\. To perform only PDF optimization you can use the following undocumented tool instead:\n.IP \"\" 4\n.nf\npython \\-m ocrmypdf\\.optimize <input_file> <level> <output_file>\n.fi\n.IP \"\" 0\n\n"
  },
  {
    "path": "dpsprep.1.ronn",
    "content": "# dps(1) -- a DjVu to PDF converter\n\n## SYNOPSIS\n\n`dpsprep` [options] src [dest]\n\n## DESCRIPTION\n\nThis tool, initially made specifically for use with Sony's Digital Paper System (DPS), is now a general-purpose DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines (e.g. TOC) and text layers (e.g. OCR).\n\n## OPTIONS\n\n* `-q`, `--quality`:                            Quality of images in output. Used only for JPEG compression, i.e. RGB and Grayscale images. Passed directly to Pillow and to OCRmyPDF's optimizer.\n* `-v`, `--verbose`:                            Display debug messages.\n* `-o`, `--overwrite`:                          Overwrite destination file.\n* `-w`, `--preserve-working`:                   Preserve the working directory after script termination.\n* `-d`, `--delete-working`:                     Delete any existing files in the working directory prior to writing to it.\n* `-t`, `--no-text`:                            Disable the generation of text layers. Implied by --ocr.\n* `-p`, `--pool-size` <int>:                    Size of MultiProcessing pool for handling page-by-page operations.\n* `-m`, `--mode` <infer|bitonal|grayscale|rgb>: Override the image modes encoded in the DjVu file for individual pages. It sometimes makes sense to force bitonal images since they compress well.\n* `--dpi` <int>:                                Override DPI values encoded in the DjVu file for individual pages.\n* `--ocr` <JSON>:                               Perform OCR via OCRmyPDF rather than trying to convert the text layer. If this parameter has a value, it should be a JSON dictionary of options to be passed to OCRmyPDF.\n* `-O1`:                                        Use the lossless PDF image optimization from OCRmyPDF (without performing OCR).\n* `-O2`:                                        Use the PDF image optimization from OCRmyPDF.\n* `-O3`:                                        Use the aggressive lossy PDF image optimization from OCRmyPDF.\n* `--help`:                                     Show help message and exit.\n* `--version`:                                  Show the version and exit.\n\n## EXAMPLES\n\nProduce `file.pdf` in the current directory:\n\n    dpsprep /wherever/file.djvu\n\nProduce `output.pdf` with reduced image quality and aggressive PDF image optimizations:\n\n    dpsprep ---quality=30 -O3 input.djvu output.pdf\n\nProduce an output file using a large pool of workers:\n\n    dpsprep --pool=16 input.djvu\n\nForce bitonal images:\n\n    dpsprep --mode bitonal input.djvu\n\nProduce an output file by disregarding the text layer and running OCRmyPDF instead:\n\n    dpsprep --ocr '{\"language\": [\"rus\", \"eng\"]}' input.djvu\n\nOr simply disregard the text layer without OCR:\n\n    dpsprep --no-text input.djvu\n\n## NOTE REGARDING COMPRESSION\n\nWe perform compression in two stages:\n\n* The first one is the default compression provided by Pillow. For bitonal images, the PDF generation code says that, if `libtiff` is available, `group4` compression is used.\n\n* If OCRmyPDF is installed, its PDF optimization can be used via the flags `-O1` to `-O3` (this involves no OCR). This allows us to use advanced techniques, including JBIG2 compression via `jbig2enc`.\n\nIf manually running OCRmyPDF, note that the optimization command suggested in the documentation (setting `--tesseract-timeout` to `0`) may ruin existing text layers. To perform only PDF optimization you can use the following undocumented tool instead:\n\n    python -m ocrmypdf.optimize <input_file> <level> <output_file>\n"
  },
  {
    "path": "fixtures/.gitattributes",
    "content": "lipsum* linguist-generated\n"
  },
  {
    "path": "fixtures/Makefile",
    "content": ".PHONY: all clean\n\nall: lipsum.pdf lipsum_01.txt lipsum_01.png lipsum_lines.djvu lipsum_words.djvu lipsum_words_invalid.djvu\n\nclean:\n\trm --force *.djvu *.pdf *.png *.txt\n\n%.pdf: %.tex\n\tpdflatex $*.tex\n\trm $*.aux $*.log\n\n%_01.txt: %.pdf\n\tpdftotext -l 1 -layout $*.pdf $*_01.txt\n\n%_01.png: %.pdf\n\tgs -sDEVICE=pngmono -r600 -dLastPage=1 -o $*_01.png $*.pdf\n\toxipng $*_01.png\n\n%_words.djvu: %.pdf\n\tdjvudigital --dpi=600 --words $*.pdf\n\tmv $*.djvu $*_words.djvu\n\n%_lines.djvu: %.pdf\n\tdjvudigital --dpi=600 --lines $*.pdf\n\tmv $*.djvu $*_lines.djvu\n\n%_invalid.djvu: %.djvu\n\tcp $*.djvu $*_invalid.djvu\n\tdjvused $*_invalid.djvu -e 'output-all' | \\\n\t\tsed 's/Lorem/\\\\270/g' | \\\n\t\tdjvused $*_invalid.djvu -f /dev/stdin -s\n"
  },
  {
    "path": "fixtures/lipsum.tex",
    "content": "\\documentclass{article}\n\n\\usepackage{lipsum}\n\n\\title{Lorem Ipsum}\n\\author{Cicero}\n\n\\begin{document}\n  \\lipsum\n\\end{document}\n"
  },
  {
    "path": "fixtures/lipsum_01.txt",
    "content": "    Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Ut purus elit,\nvestibulum ut, placerat ac, adipiscing vitae, felis. Curabitur dictum gravida\nmauris. Nam arcu libero, nonummy eget, consectetuer id, vulputate a, magna.\nDonec vehicula augue eu neque. Pellentesque habitant morbi tristique senectus\net netus et malesuada fames ac turpis egestas. Mauris ut leo. Cras viverra\nmetus rhoncus sem. Nulla et lectus vestibulum urna fringilla ultrices. Phasellus\neu tellus sit amet tortor gravida placerat. Integer sapien est, iaculis in, pretium\nquis, viverra ac, nunc. Praesent eget sem vel leo ultrices bibendum. Aenean\nfaucibus. Morbi dolor nulla, malesuada eu, pulvinar at, mollis ac, nulla. Cur-\nabitur auctor semper nulla. Donec varius orci eget risus. Duis nibh mi, congue\neu, accumsan eleifend, sagittis quis, diam. Duis eget orci sit amet orci dignissim\nrutrum.\n    Nam dui ligula, fringilla a, euismod sodales, sollicitudin vel, wisi. Morbi\nauctor lorem non justo. Nam lacus libero, pretium at, lobortis vitae, ultricies et,\ntellus. Donec aliquet, tortor sed accumsan bibendum, erat ligula aliquet magna,\nvitae ornare odio metus a mi. Morbi ac orci et nisl hendrerit mollis. Suspendisse\nut massa. Cras nec ante. Pellentesque a nulla. Cum sociis natoque penatibus et\nmagnis dis parturient montes, nascetur ridiculus mus. Aliquam tincidunt urna.\nNulla ullamcorper vestibulum turpis. Pellentesque cursus luctus mauris.\n    Nulla malesuada porttitor diam. Donec felis erat, congue non, volutpat at,\ntincidunt tristique, libero. Vivamus viverra fermentum felis. Donec nonummy\npellentesque ante. Phasellus adipiscing semper elit. Proin fermentum massa\nac quam. Sed diam turpis, molestie vitae, placerat a, molestie nec, leo. Mae-\ncenas lacinia. Nam ipsum ligula, eleifend at, accumsan nec, suscipit a, ipsum.\nMorbi blandit ligula feugiat magna. Nunc eleifend consequat lorem. Sed lacinia\nnulla vitae enim. Pellentesque tincidunt purus vel magna. Integer non enim.\nPraesent euismod nunc eu purus. Donec bibendum quam in tellus. Nullam cur-\nsus pulvinar lectus. Donec et mi. Nam vulputate metus eu enim. Vestibulum\npellentesque felis eu massa.\n    Quisque ullamcorper placerat ipsum. Cras nibh. Morbi vel justo vitae lacus\ntincidunt ultrices. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. In\nhac habitasse platea dictumst. Integer tempus convallis augue. Etiam facilisis.\nNunc elementum fermentum wisi. Aenean placerat. Ut imperdiet, enim sed\ngravida sollicitudin, felis odio placerat quam, ac pulvinar elit purus eget enim.\nNunc vitae tortor. Proin tempus nibh sit amet nisl. Vivamus quis tortor vitae\nrisus porta vehicula.\n    Fusce mauris. Vestibulum luctus nibh at lectus. Sed bibendum, nulla a fau-\ncibus semper, leo velit ultricies tellus, ac venenatis arcu wisi vel nisl. Vestibulum\ndiam. Aliquam pellentesque, augue quis sagittis posuere, turpis lacus congue\nquam, in hendrerit risus eros eget felis. Maecenas eget erat in sapien mattis\nporttitor. Vestibulum porttitor. Nulla facilisi. Sed a turpis eu lacus commodo\nfacilisis. Morbi fringilla, wisi in dignissim interdum, justo lectus sagittis dui, et\nvehicula libero dui cursus dui. Mauris tempor ligula sed lacus. Duis cursus enim\nut augue. Cras ac magna. Cras nulla. Nulla egestas. Curabitur a leo. Quisque\negestas wisi eget nunc. Nam feugiat lacus vel est. Curabitur consectetuer.\n    Suspendisse vel felis. Ut lorem lorem, interdum eu, tincidunt sit amet,\n\n\n                                         1\n\f"
  },
  {
    "path": "pyproject.toml",
    "content": "[project]\nname = \"dpsprep\"\nversion = \"2.5.4\"\ndescription = \"A DjVu to PDF converter with a focus on small output size and the ability to preserve document outlines and text layers\"\nrequires-python = \">=3.11, <4.0\"\nauthors = [\n  { name = \"Kevin Arthur Schiff Croker\" },\n  { name = \"Ianis Vasilev\", email = \"ianis@ivasilev.net\" }\n]\nlicense = \"GPL-3.0-or-later\"\ndependencies = [\n  \"click (>=8)\",\n  \"djvulibre-python (>=0.9.3)\",\n  \"fpdf2 (>=2.8)\",\n  \"loguru (>=0.7)\",\n  \"pdfrw (>=0.4)\",\n  \"pillow (>=12.2.0)\"\n]\n\n[project.urls]\nRepository = \"https://github.com/kcroker/dpsprep.git\"\nChangelog = \"https://github.com/kcroker/dpsprep/blob/master/CHANGELOG.md\"\n\n[project.optional-dependencies]\ncompress = [\n  \"ocrmypdf (>=17)\"\n]\n\n[project.scripts]\ndpsprep = \"dpsprep:dpsprep\"\n\n[dependency-groups]\ndev = [\n  \"mypy (>=1.19)\",\n  \"pytest (>=9.0.3)\",\n  \"ruff (>=0.15)\",\n  \"types-fpdf2 (>=2.8.4.20260322)\"\n]\n\n[build-system]\n# uv build complains if no upper bound is set, but it updates its minor versions often, so we put a major version just to shut it up\nrequires = [\"uv_build (>=0.10, <1)\"]\nbuild-backend = \"uv_build\"\n\n# uv\n[tool.uv]\nresolution = \"lowest-direct\"\n\n[tool.uv.build-backend]\nmodule-root = \"\"  # uv-build expects the code to be in src/dpsprep, but I did not want to move it when migrating to uv\n\n# pytest\n[tool.pytest.ini_options]\naddopts = \"--capture tee-sys\"\n\n# ruff\n[tool.ruff]\nline-length = 120\n\n[tool.ruff.lint]\nselect = [\n  \"A\",     # flake8-builtins\n  \"ANN\",   # flake8-annotations\n  \"ARG\",   # flake8-unused-arguments\n  \"ASYNC\", # flake8-async\n  \"B\",     # flake8-bugbear\n  \"C4\",    # flake8-comrehensions\n  \"C90\",   # mccabe\n  \"COM\",   # flake8-commas\n  \"E\",     # pycodestyle error\n  \"F\",     # pyflakes\n  \"FURB\",  # refurb\n  \"I\",     # isort\n  \"INP\",   # flake8-no-pep420\n  \"N\",     # pep8-naming\n  \"PERF\",  # perflint\n  \"PL\",    # pylint\n  \"PT\",    # flake8-pytest-style\n  \"PTH\",   # flake8-use-pathlib\n  \"Q\",     # flake8-quotes\n  \"RUF\",   # ruff\n  \"S\",     # flake8-bandit\n  \"SIM\",   # flake8-simplify\n  \"TC\",    # flake8-type-checking\n  \"TRY\",   # tryceratops\n  \"UP\",    # pyupgrade\n  \"W\",     # pycodestyle warning\n]\nignore = [\n  \"E501\",    # line-too-long\n  \"PLC1901\", # compare-to-empty-string\n  \"PLR6301\", # no-self-use\n  \"PTH123\",  # builtin-open\n  \"RUF001\", \"RUF002\", \"RUF003\", # ambiguous-unicode-character-{string,docstring,comment}\n]\n\n[tool.ruff.lint.isort]\nlines-after-imports = 2\n\n[tool.ruff.lint.flake8-quotes]\ninline-quotes = \"single\"\nmultiline-quotes = \"single\"\n\n[tool.ruff.lint.per-file-ignores]\n\"test_*.py\" = [\"S101\", \"PLR2004\"]\n\n# mypy\n[tool.mypy]\npackages = [\"dpsprep\"]\n\n[[tool.mypy.overrides]]\nmodule = [\n  \"djvu.*\",\n  \"ocrmypdf.*\",\n  \"pdfrw.*\"\n]\nignore_missing_imports = true\n"
  }
]