Showing preview only (335K chars total). Download the full file or copy to clipboard to get everything.
Repository: ernestofgonzalez/epub-utils
Branch: main
Commit: 8c5417c331f2
Files: 60
Total size: 315.4 KB
Directory structure:
gitextract_obeqz0f5/
├── .github/
│ └── workflows/
│ ├── docs.yml
│ └── test.yml
├── .gitignore
├── .vscode/
│ └── settings.json
├── LICENSE
├── Makefile
├── README.md
├── docs/
│ ├── Makefile
│ ├── api-reference.rst
│ ├── api-tutorial.rst
│ ├── changelog.rst
│ ├── cli-reference.rst
│ ├── cli-tutorial.rst
│ ├── conf.py
│ ├── contributing.rst
│ ├── epub-standards.rst
│ ├── examples.rst
│ ├── formats.rst
│ ├── index.rst
│ └── installation.rst
├── epub_utils/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py
│ ├── container.py
│ ├── content/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── xhtml.py
│ ├── doc.py
│ ├── exceptions.py
│ ├── navigation/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── nav/
│ │ │ ├── __init__.py
│ │ │ └── dom.py
│ │ └── ncx/
│ │ ├── __init__.py
│ │ └── dom.py
│ ├── package/
│ │ ├── __init__.py
│ │ ├── manifest.py
│ │ ├── metadata.py
│ │ └── spine.py
│ └── printers.py
├── pytest.ini
├── requirements/
│ ├── requirements-docs.txt
│ ├── requirements-linting.txt
│ ├── requirements-testing.txt
│ └── requirements.txt
├── requirements.txt
├── ruff.toml
├── setup.py
└── tests/
├── assets/
│ └── roads.epub
├── conftest.py
├── test_cli.py
├── test_container.py
├── test_doc.py
├── test_manifest.py
├── test_metadata.py
├── test_nav_navigation.py
├── test_ncx_navigation.py
├── test_package.py
├── test_spine.py
└── test_xhtml_content.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/docs.yml
================================================
name: Publish documentation
on:
push:
branches:
- main
jobs:
docs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- uses: actions/setup-python@v2
- name: Install dependencies
run: |
pip install -r requirements/requirements-docs.txt
- name: Sphinx build
run: |
sphinx-build docs _build
- name: Deploy
uses: peaceiris/actions-gh-pages@v3
if: ${{ github.ref == 'refs/heads/main' }}
with:
publish_branch: gh-pages
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: _build/
force_orphan: true
================================================
FILE: .github/workflows/test.yml
================================================
name: Test
on:
push:
branches:
- "main"
pull_request:
concurrency:
group: ${{ github.head_ref || github.run_id }}
cancel-in-progress: true
jobs:
test:
name: Python ${{ matrix.python-version }} on ${{ matrix.os }}
runs-on: ${{ matrix.os }}
strategy:
max-parallel: 4
matrix:
os:
- ubuntu-24.04
- windows-2022
- macos-14
python-version:
- "3.8"
- "3.9"
- "3.10"
- "3.11"
- "3.12"
- "3.13"
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
allow-prereleases: true
- name: Cache pip packages
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests
run: |
pytest
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# MacOS
.DS_Store
================================================
FILE: .vscode/settings.json
================================================
{
"python.testing.pytestEnabled": true
}
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright 2025 Ernesto González
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: Makefile
================================================
#!/usr/bin/env bash
LIGHT_CYAN=\033[1;36m
NO_COLOR=\033[0m
.PHONY: docs
help:
@echo "test - run tests with pytest"
@echo "coverage - get code coverage report"
@echo "lint - lint the python code"
@echo "format - format the python code"
# Run tests
test:
@echo "${LIGHT_CYAN}Running tests...${NO_COLOR}"
pytest
# Get code coverage report
coverage:
@echo "${LIGHT_CYAN}Running tests and collecting coverage data...${NO_COLOR}"
pytest
coverage combine
@echo "${LIGHT_CYAN}Reporting code coverage data...${NO_COLOR}"
coverage report
@echo "${LIGHT_CYAN}Creating HTML report...${NO_COLOR}"
coverage html
@echo "${LIGHT_CYAN}Creating coverage badge...${NO_COLOR}"
@rm ./coverage.svg
coverage-badge -o coverage.svg
# Lint code
lint:
@echo "${LIGHT_CYAN}Linting code...${NO_COLOR}"
ruff check
# Format code
format:
@echo "${LIGHT_CYAN}Formatting code...${NO_COLOR}"
ruff check --select I --fix
ruff format
================================================
FILE: README.md
================================================
# epub-utils
[](https://pypi.org/project/epub-utils/)
[](https://ernestofgonzalez.github.io/epub-utils/changelog)
[](https://pypi.org/project/epub-utils/)
[](https://github.com/ernestofgonzalez/epub-utils/blob/main/LICENSE)
A Python library and CLI tool for inspecting ePub from the terminal.
## Features
- **Complete EPUB Support** - Parse both EPUB 2.0.1 and EPUB 3.0+ specifications with container, package, manifest, spine, and table of contents inspection
- **Rich Metadata Extraction** - Extract Dublin Core metadata (title, author, language, publisher) with key-value, XML, and raw output formats for easy scripting
- **Content Analysis** - Access document content by manifest ID or file path, with plain text extraction for content analysis and word counting
- **File System Navigation** - Browse and extract any file within EPUB archives (XHTML, CSS, images, fonts) with detailed file information including sizes and compression ratios
- **Multiple Output Formats** - XML with syntax highlighting, raw content, key-value pairs, plain text, and formatted tables to suit different workflows
- **CLI and Python API** - Comprehensive command-line tool for terminal workflows plus a clean Python library for programmatic access
- **Standards Compliance** - Built-in validation capabilities and adherence to W3C/IDPF specifications for reliable EPUB processing
- **Performance Optimized** - Lazy loading, efficient ZIP parsing, and optional lxml support for handling large EPUB collections
## Installation
`epub-utils` is available as a [PyPI](https://pypi.org/) package
```bash
pip install epub-utils
```
## Use as a CLI tool
The basic format is:
```bash
epub-utils EPUB_PATH COMMAND [OPTIONS]
```
### Commands
- `container` - Display the container.xml contents
```bash
# Show container.xml with syntax highlighting
epub-utils book.epub container
# Show container.xml as raw content
epub-utils book.epub container --format raw
# Show container.xml with pretty formatting
epub-utils book.epub container --pretty-print
```
- `package` - Display the package OPF file contents
```bash
# Show package.opf with syntax highlighting
epub-utils book.epub package
# Show package.opf as raw content
epub-utils book.epub package --format raw
```
- `toc` - Display the table of contents file contents
```bash
# Show toc.ncx/nav.xhtml with syntax highlighting (auto-detect)
epub-utils book.epub toc
# Show toc.ncx/nav.xhtml as raw content
epub-utils book.epub toc --format raw
# Force NCX format (EPUB 2 navigation control file)
epub-utils book.epub toc --ncx
# Force Navigation Document (EPUB 3 navigation file)
epub-utils book.epub toc --nav
```
- `metadata` - Display the metadata information from the package file
```bash
# Show metadata with syntax highlighting
epub-utils book.epub metadata
# Show metadata as key-value pairs
epub-utils book.epub metadata --format kv
# Show metadata with pretty formatting
epub-utils book.epub metadata --pretty-print
```
- `manifest` - Display the manifest information from the package file
```bash
# Show manifest with syntax highlighting
epub-utils book.epub manifest
# Show manifest as raw content
epub-utils book.epub manifest --format raw
```
- `spine` - Display the spine information from the package file
```bash
# Show spine with syntax highlighting
epub-utils book.epub spine
# Show spine as raw content
epub-utils book.epub spine --format raw
```
- `content` - Display the content of a document by its manifest item ID
```bash
# Show content with syntax highlighting
epub-utils book.epub content chapter1
# Show raw HTML/XML content
epub-utils book.epub content chapter1 --format raw
# Show plain text content (HTML tags stripped)
epub-utils book.epub content chapter1 --format plain
```
- `files` - List all files in the EPUB archive or display content of a specific file
```bash
# List all files in table format (default)
epub-utils book.epub files
# List all files as simple paths
epub-utils book.epub files --format raw
# Display content of a specific file by path
epub-utils book.epub files OEBPS/chapter1.xhtml
# Display XHTML file content in different formats
epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
# Display non-XHTML files (CSS, images, etc.)
epub-utils book.epub files OEBPS/styles/main.css
epub-utils book.epub files META-INF/container.xml
```
### Options
- `-h, --help` - Show help message and exit
- `-v, --version` - Show program version and exit
- `-fmt, --format` - Output format (default: xml)
- `xml` - Display with XML syntax highlighting (default)
- `raw` - Display raw content without formatting
- `plain` - Display plain text content (HTML tags stripped, for content command only)
- `kv` - Display key-value pairs (where supported)
- `-pp, --pretty-print` - Pretty-print XML output (applies to xml and raw formats only)
```bash
# Display as raw content
epub-utils book.epub package --format raw
# Display with XML syntax highlighting (default)
epub-utils book.epub package --format xml
# Display as key-value pairs (for supported commands)
epub-utils book.epub metadata --format kv
# Display plain text content (content command only)
epub-utils book.epub content chapter1 --format plain
# Pretty-print XML with proper indentation
epub-utils book.epub package --pretty-print
# Combine format and pretty-print options
epub-utils book.epub metadata --format raw --pretty-print
```
## Use as a Python library
```python
from epub_utils import Document
# Load an EPUB document
doc = Document("path/to/book.epub")
```
### Basic Document Access
Access the main components of an EPUB document:
```python
# Get container information
container = doc.container
print(container.to_xml()) # Formatted XML with syntax highlighting
print(container.to_str()) # Raw XML content
# Get package information
package = doc.package
print(package.to_xml()) # Formatted XML with syntax highlighting
print(package.to_str()) # Raw XML content
# Get table of contents
toc = doc.toc
if toc: # TOC might be None if not present
print(toc.to_xml()) # Formatted XML with syntax highlighting
print(toc.to_str()) # Raw XML content
# Access specific navigation formats
ncx = doc.ncx # NCX format (EPUB 2 or EPUB 3 with NCX)
if ncx:
print("NCX navigation available")
print(ncx.to_xml())
nav = doc.nav # Navigation Document (EPUB 3 only)
if nav:
print("Navigation Document available")
print(nav.to_xml())
print(toc.to_str()) # Raw XML content
```
### Working with Metadata
Access and format metadata information:
```python
# Access package metadata
metadata = doc.package.metadata
# Basic Dublin Core elements
print(f"Title: {metadata.title}")
print(f"Creator: {metadata.creator}")
print(f"Identifier: {metadata.identifier}")
print(f"Language: {metadata.language}")
print(f"Publisher: {metadata.publisher}")
print(f"Date: {metadata.date}")
# Dynamic attribute access for any metadata field
isbn = getattr(metadata, 'isbn', 'Not available')
series = getattr(metadata, 'series', 'Not available')
# Get formatted metadata output
print(metadata.to_xml()) # Formatted XML with syntax highlighting
print(metadata.to_str()) # Raw XML content
print(metadata.to_kv()) # Key-value format for easy parsing
```
### Working with Manifest
Access the manifest to see all files in the EPUB:
```python
# Get manifest information
manifest = doc.package.manifest
# Access all manifest items
for item in manifest.items:
print(f"ID: {item['id']}")
print(f"File: {item['href']}")
print(f"Type: {item['media_type']}")
print(f"Properties: {item['properties']}")
# Find specific items
nav_item = manifest.find_by_property('nav')
chapter = manifest.find_by_id('chapter1')
xhtml_items = manifest.find_by_media_type('application/xhtml+xml')
# Get formatted manifest output
print(manifest.to_xml()) # Formatted XML with syntax highlighting
print(manifest.to_str()) # Raw XML content
```
### Working with Spine
Access the spine to see the reading order:
```python
# Get spine information
spine = doc.package.spine
# Access spine properties
print(f"TOC reference: {spine.toc}")
print(f"Page progression: {spine.page_progression_direction}")
# Access spine items in reading order
for itemref in spine.itemrefs:
print(f"ID: {itemref['idref']}")
print(f"Linear: {itemref['linear']}")
print(f"Properties: {itemref['properties']}")
# Find specific spine item
spine_item = spine.find_by_idref('chapter1')
# Get formatted spine output
print(spine.to_xml()) # Formatted XML with syntax highlighting
print(spine.to_str()) # Raw XML content
```
### Content Extraction
Extract content from specific documents within the EPUB:
```python
# Access content by manifest item ID
try:
content = doc.find_content_by_id('chapter1')
# Get content in different formats
print(content.to_xml()) # Formatted XHTML with syntax highlighting
print(content.to_str()) # Raw XHTML content
print(content.to_plain()) # Plain text with HTML tags stripped
# Access the parsed content tree for advanced processing
tree = content.tree
inner_text = content.inner_text
except ValueError as e:
print(f"Content not found: {e}")
# Find publication resources by ID (for non-spine items)
try:
resource = doc.find_pub_resource_by_id('cover-image')
except ValueError as e:
print(f"Resource not found: {e}")
```
### File Operations
List and access files directly by their paths in the EPUB archive:
```python
# Get information about all files
files_info = doc.get_files_info()
for file_info in files_info:
print(f"Path: {file_info['path']}")
print(f"Size: {file_info['size']} bytes")
print(f"Compressed: {file_info['compressed_size']} bytes")
print(f"Modified: {file_info['modified']}")
# Access specific file by path
try:
# For XHTML files, returns XHTMLContent object
xhtml_content = doc.get_file_by_path('OEBPS/chapter1.xhtml')
print(xhtml_content.to_xml())
print(xhtml_content.to_plain())
# For other files, returns raw string content
css_content = doc.get_file_by_path('OEBPS/styles/main.css')
print(css_content)
except ValueError as e:
print(f"File not found: {e}")
```
### Output Formatting Options
All document components support flexible output formatting:
```python
# Pretty-printed XML output
print(metadata.to_str(pretty_print=True))
print(manifest.to_xml(pretty_print=True))
# Syntax highlighting can be controlled
print(package.to_xml(highlight_syntax=True)) # With highlighting (default)
print(package.to_xml(highlight_syntax=False)) # Without highlighting
```
## Industry Standards & Compliance
`epub-utils` provides comprehensive support for industry-standard ePub specifications and related technologies, ensuring broad compatibility across the digital publishing ecosystem.
### Supported EPUB Standards
- **EPUB 2.0.1** (IDPF, 2010)
- Complete OPF 2.0 package document support
- NCX navigation control file support
- Dublin Core metadata extraction
- Legacy EPUB compatibility
- **EPUB 3.0+** (IDPF/W3C, 2011-present)
- EPUB 3.3 specification compliance
- HTML5-based content documents
- Navigation document (nav.xhtml) support
- Enhanced accessibility features
- Media overlays and scripting support
### Metadata Standards
- **Dublin Core Metadata Initiative (DCMI)**
- Dublin Core Metadata Element Set v1.1
- Dublin Core Metadata Terms (DCTERMS)
- **Open Packaging Format (OPF)**
- OPF 2.0 specification (EPUB 2.0.1)
- OPF 3.0 specification (EPUB 3.0+)
The library maintains strict adherence to published specifications while providing robust handling of real-world EPUB variations commonly found in commercial and open-source reading applications.
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/api-reference.rst
================================================
API Reference
=============
This section provides complete API documentation for all classes and methods in epub-utils.
Document Class
--------------
.. py:class:: Document(path)
Main class for working with EPUB files.
:param str path: Path to the EPUB file
**Example**:
.. code-block:: python
from epub_utils import Document
doc = Document("book.epub")
print(doc.package.metadata.title)
.. py:attribute:: container
Access to the container information.
:type: Container
:returns: Container object with container.xml information
**Example**:
.. code-block:: python
container = doc.container
print(f"Package path: {container.rootfile_path}")
.. py:attribute:: package
Access to the package (OPF) information.
:type: Package
:returns: Package object with OPF file information
**Example**:
.. code-block:: python
package = doc.package
print(f"Title: {package.metadata.title}")
.. py:attribute:: toc
Access to the table of contents.
:type: TableOfContents
:returns: Table of contents object
**Example**:
.. code-block:: python
toc = doc.toc
toc_xml = toc.to_xml()
.. py:attribute:: ncx
Access to the NCX (Navigation Control for XML) table of contents.
:type: TableOfContents or None
:returns: NCX table of contents object for EPUB 2, or for EPUB 3 if NCX is present, None otherwise
**Example**:
.. code-block:: python
ncx = doc.ncx
if ncx:
ncx_xml = ncx.to_xml()
**Note**: For EPUB 2, this returns the same as ``toc``. For EPUB 3, this specifically
accesses the NCX file if present, which provides backward compatibility.
.. py:attribute:: nav
Access to the Navigation Document (EPUB 3 only).
:type: TableOfContents or None
:returns: Navigation Document table of contents object for EPUB 3, None for EPUB 2 or if not present
**Example**:
.. code-block:: python
nav = doc.nav
if nav:
nav_xml = nav.to_xml()
**Note**: This property specifically accesses EPUB 3 Navigation Documents.
Returns None for EPUB 2 documents.
.. py:method:: get_files_info()
Get detailed information about all files in the EPUB.
:returns: List of dictionaries containing file information
:rtype: List[Dict[str, Union[str, int]]]
Each dictionary contains:
- ``path`` (str): File path within the EPUB
- ``size`` (int): Uncompressed size in bytes
- ``compressed_size`` (int): Compressed size in bytes
- ``modified`` (str): Last modified date in ISO format
**Example**:
.. code-block:: python
files = doc.get_files_info()
for file_info in files:
print(f"{file_info['path']}: {file_info['size']} bytes")
.. py:method:: list_files()
Get basic information about all files in the EPUB.
:returns: List of dictionaries with basic file information
:rtype: List[Dict[str, str]]
**Example**:
.. code-block:: python
files = doc.list_files()
print(f"EPUB contains {len(files)} files")
Container Class
---------------
.. py:class:: Container
Represents the META-INF/container.xml file information.
.. py:attribute:: rootfile_path
Path to the main package file within the EPUB.
:type: str
.. py:attribute:: rootfile_media_type
Media type of the main package file.
:type: str
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_str()
Get raw XML content.
:returns: Raw XML string
:rtype: str
Package Class
-------------
.. py:class:: Package
Represents the main OPF package file.
.. py:attribute:: metadata
Package metadata information.
:type: Metadata
.. py:attribute:: manifest
Package manifest information.
:type: Manifest
.. py:attribute:: spine
Package spine information.
:type: Spine
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation of the complete package.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_str()
Get raw XML content of the complete package.
:returns: Raw XML string
:rtype: str
Metadata Class
--------------
.. py:class:: Metadata
Represents Dublin Core and EPUB-specific metadata.
.. py:attribute:: title
Book title from dc:title element.
:type: str
.. py:attribute:: creator
Book author/creator from dc:creator element.
:type: str
.. py:attribute:: language
Language code from dc:language element.
:type: str
.. py:attribute:: identifier
Unique identifier from dc:identifier element.
:type: str
.. py:attribute:: publisher
Publisher from dc:publisher element.
:type: str
.. py:attribute:: date
Publication date from dc:date element.
:type: str
.. py:attribute:: subject
Subject/keywords from dc:subject element.
:type: str
.. py:attribute:: description
Description from dc:description element.
:type: str
.. py:attribute:: contributor
Contributor from dc:contributor element.
:type: str
.. py:attribute:: type
Resource type from dc:type element.
:type: str
.. py:attribute:: format
Format from dc:format element.
:type: str
.. py:attribute:: source
Source from dc:source element.
:type: str
.. py:attribute:: relation
Relation from dc:relation element.
:type: str
.. py:attribute:: coverage
Coverage from dc:coverage element.
:type: str
.. py:attribute:: rights
Rights information from dc:rights element.
:type: str
.. py:method:: __getattr__(name)
Dynamic attribute access for any metadata field.
:param str name: Metadata field name
:returns: Metadata value or empty string
:rtype: str
**Example**:
.. code-block:: python
# Access any metadata field
isbn = metadata.isbn if hasattr(metadata, 'isbn') else 'Not available'
series = getattr(metadata, 'series', 'Not available')
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation of metadata.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_kv()
Get metadata as key-value pairs.
:returns: Key-value formatted string
:rtype: str
**Example**:
.. code-block:: python
kv_data = metadata.to_kv()
print(kv_data)
# Output:
# title: The Great Gatsby
# creator: F. Scott Fitzgerald
# language: en
.. py:method:: to_str()
Get raw XML content of metadata.
:returns: Raw XML string
:rtype: str
Manifest Class
--------------
.. py:class:: Manifest
Represents the package manifest section.
.. py:attribute:: items
Dictionary of manifest items.
:type: Dict[str, Dict[str, str]]
Each item contains:
- ``href``: File path
- ``media-type``: MIME type
- Other attributes as needed
**Example**:
.. code-block:: python
for item_id, item in manifest.items.items():
print(f"ID: {item_id}")
print(f" File: {item['href']}")
print(f" Type: {item['media-type']}")
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_str()
Get raw XML content.
:returns: Raw XML string
:rtype: str
Spine Class
-----------
.. py:class:: Spine
Represents the package spine section.
.. py:attribute:: items
List of spine items in reading order.
:type: List[Dict[str, str]]
**Example**:
.. code-block:: python
for item in spine.items:
print(f"Reading order item: {item}")
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_str()
Get raw XML content.
:returns: Raw XML string
:rtype: str
TableOfContents Class
---------------------
.. py:class:: TableOfContents
Represents the table of contents (NCX or Navigation Document).
.. py:method:: to_xml(highlight_syntax=True)
Get formatted XML representation.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted XML string
:rtype: str
.. py:method:: to_str()
Get raw XML content.
:returns: Raw XML string
:rtype: str
Content Classes
---------------
.. py:class:: Content
Base class for EPUB content documents.
.. py:method:: to_xml(highlight_syntax=True)
Get formatted content.
:param bool highlight_syntax: Whether to apply syntax highlighting
:returns: Formatted content string
:rtype: str
.. py:method:: to_str()
Get raw content.
:returns: Raw content string
:rtype: str
.. py:class:: XHTMLContent
Specialized class for XHTML content documents.
Inherits from Content with additional XHTML-specific methods.
.. py:method:: to_plain()
Get plain text content with HTML tags stripped.
:returns: Plain text string
:rtype: str
**Example**:
.. code-block:: python
from epub_utils.content import XHTMLContent
# This would typically be accessed through Document
# content = XHTMLContent(raw_html)
# plain_text = content.to_plain()
Exception Classes
-----------------
.. py:exception:: ParseError
Raised when there's an error parsing EPUB content.
Base class: ``Exception``
**Example**:
.. code-block:: python
from epub_utils import Document
from epub_utils.exceptions import ParseError
try:
doc = Document("corrupted.epub")
title = doc.package.metadata.title
except ParseError as e:
print(f"Failed to parse EPUB: {e}")
except FileNotFoundError:
print("EPUB file not found")
Usage Examples
--------------
Basic Usage
~~~~~~~~~~~
.. code-block:: python
from epub_utils import Document
# Load document
doc = Document("book.epub")
# Access metadata
metadata = doc.package.metadata
print(f"Title: {metadata.title}")
print(f"Author: {metadata.creator}")
# Check file structure
files = doc.get_files_info()
print(f"Contains {len(files)} files")
# Get formatted output
toc_xml = doc.toc.to_xml()
metadata_kv = metadata.to_kv()
Error Handling
~~~~~~~~~~~~~~
.. code-block:: python
from epub_utils import Document
from epub_utils.exceptions import ParseError
def safe_load_epub(path):
try:
doc = Document(path)
return {
'status': 'success',
'document': doc,
'title': getattr(doc.package.metadata, 'title', 'Unknown')
}
except ParseError as e:
return {
'status': 'parse_error',
'error': str(e)
}
except FileNotFoundError:
return {
'status': 'file_not_found',
'error': 'EPUB file not found'
}
except Exception as e:
return {
'status': 'unknown_error',
'error': str(e)
}
Batch Processing
~~~~~~~~~~~~~~~~
.. code-block:: python
import os
from pathlib import Path
from epub_utils import Document
def process_epub_directory(directory):
epub_files = Path(directory).glob("*.epub")
results = []
for epub_path in epub_files:
try:
doc = Document(str(epub_path))
metadata = doc.package.metadata
result = {
'file': epub_path.name,
'title': getattr(metadata, 'title', ''),
'author': getattr(metadata, 'creator', ''),
'language': getattr(metadata, 'language', ''),
'file_size': epub_path.stat().st_size,
'epub_files': len(doc.get_files_info())
}
results.append(result)
except Exception as e:
results.append({
'file': epub_path.name,
'error': str(e)
})
return results
Type Hints
----------
For better IDE support and type checking, here are the main type hints:
.. code-block:: python
from typing import Dict, List, Union, Optional
from epub_utils import Document
# Function signatures for reference
def get_files_info(self) -> List[Dict[str, Union[str, int]]]: ...
def list_files(self) -> List[Dict[str, str]]: ...
def to_xml(self, highlight_syntax: bool = True) -> str: ...
def to_str(self) -> str: ...
def to_kv(self) -> str: ...
# Type-safe usage example
doc: Document = Document("book.epub")
files_info: List[Dict[str, Union[str, int]]] = doc.get_files_info()
title: str = doc.package.metadata.title
kv_data: str = doc.package.metadata.to_kv()
Module Structure
----------------
The ``epub-utils`` package is organized as follows:
.. code-block:: text
epub_utils/
├── __init__.py # Main exports (Document, Container)
├── doc.py # Document class
├── container.py # Container class
├── package/
│ ├── __init__.py # Package class
│ ├── metadata.py # Metadata class
│ ├── manifest.py # Manifest class
│ └── spine.py # Spine class
├── content/
│ ├── __init__.py # Content classes
│ ├── base.py # Base Content class
│ └── xhtml.py # XHTMLContent class
├── toc.py # TableOfContents class
├── exceptions.py # Exception classes
├── highlighters.py # Syntax highlighting utilities
└── cli.py # Command-line interface
For detailed implementation examples, see :doc:`api-tutorial` and :doc:`examples`.
================================================
FILE: docs/api-tutorial.rst
================================================
Use as a Python library
=======================
This guide covers using ``epub-utils`` as a Python library. The API is designed to be intuitive
and follows Python best practices for ease of use and integration into your projects.
Quick Start
-----------
The main entry point is the ``Document`` class:
.. code-block:: python
from epub_utils import Document
# Load an EPUB file
doc = Document("path/to/book.epub")
# Access various components
print(f"Title: {doc.package.metadata.title}")
print(f"Author: {doc.package.metadata.creator}")
Core Classes
------------
Document Class
~~~~~~~~~~~~~~
The ``Document`` class is your main interface to an EPUB file:
.. code-block:: python
from epub_utils import Document
doc = Document("example.epub")
# Access major components
container = doc.container # Container information
package = doc.package # Package/OPF file
toc = doc.toc # Table of contents
# Get file information
files_info = doc.get_files_info()
**Key Methods**:
- ``get_files_info()``: Returns detailed information about all files in the EPUB
- ``list_files()``: Returns a simple list of files with basic metadata
Container Access
~~~~~~~~~~~~~~~~
The container provides information from the META-INF/container.xml file:
.. code-block:: python
# Access container properties
print(f"Package path: {doc.container.rootfile_path}")
print(f"Media type: {doc.container.rootfile_media_type}")
# Get raw XML
container_xml = doc.container.to_xml()
raw_container = doc.container.to_str()
Package and Metadata
~~~~~~~~~~~~~~~~~~~~~
The package object gives you access to the main OPF file and its metadata:
.. code-block:: python
package = doc.package
# Access metadata
metadata = package.metadata
print(f"Title: {metadata.title}")
print(f"Author: {metadata.creator}")
print(f"Language: {metadata.language}")
print(f"Identifier: {metadata.identifier}")
print(f"Publisher: {metadata.publisher}")
# Get all metadata as key-value pairs
kv_metadata = metadata.to_kv()
print(kv_metadata)
# Access manifest and spine
manifest = package.manifest
spine = package.spine
Working with Metadata
----------------------
Extracting Common Fields
~~~~~~~~~~~~~~~~~~~~~~~~~
The metadata object provides easy access to Dublin Core and EPUB-specific metadata:
.. code-block:: python
metadata = doc.package.metadata
# Basic Dublin Core elements
title = metadata.title
creator = metadata.creator # Usually the author
subject = metadata.subject # Keywords/topics
description = metadata.description
publisher = metadata.publisher
contributor = metadata.contributor
date = metadata.date
type = metadata.type
format = metadata.format
identifier = metadata.identifier
source = metadata.source
language = metadata.language
relation = metadata.relation
coverage = metadata.coverage
rights = metadata.rights
Dynamic Attribute Access
~~~~~~~~~~~~~~~~~~~~~~~~
The metadata object supports dynamic attribute access for any metadata field:
.. code-block:: python
# Access any metadata field by name
isbn = getattr(metadata, 'isbn', 'Not available')
series = getattr(metadata, 'series', 'Not available')
# Or use the more direct approach
try:
custom_field = metadata.custom_metadata_field
except AttributeError:
custom_field = "Field not found"
Formatted Output
~~~~~~~~~~~~~~~~
Get metadata in different formats:
.. code-block:: python
# XML format with syntax highlighting
xml_metadata = metadata.to_xml(highlight_syntax=True)
# Raw XML without highlighting
raw_xml = metadata.to_xml(highlight_syntax=False)
# Key-value format for easy parsing
kv_format = metadata.to_kv()
Manifest and Spine
-------------------
Working with the Manifest
~~~~~~~~~~~~~~~~~~~~~~~~~~
The manifest lists all files in the EPUB package:
.. code-block:: python
manifest = doc.package.manifest
# Get all items
items = manifest.items # Dictionary of manifest items
# Find specific items
for item_id, item in items.items():
print(f"ID: {item_id}")
print(f" File: {item['href']}")
print(f" Type: {item['media-type']}")
# Get formatted output
manifest_xml = manifest.to_xml()
Understanding the Spine
~~~~~~~~~~~~~~~~~~~~~~~~
The spine defines the reading order:
.. code-block:: python
spine = doc.package.spine
# Get spine items in reading order
spine_items = spine.items
# Get formatted output
spine_xml = spine.to_xml()
Table of Contents
-----------------
Working with TOC
~~~~~~~~~~~~~~~~
Access the table of contents (either NCX or Navigation Document):
.. code-block:: python
toc = doc.toc
# Get formatted TOC
toc_xml = toc.to_xml()
raw_toc = toc.to_str()
Specific TOC Access
~~~~~~~~~~~~~~~~~~~
For fine-grained control over which table of contents format to access:
.. code-block:: python
# Access NCX specifically (EPUB 2 or EPUB 3 with NCX)
ncx = doc.ncx
if ncx:
ncx_xml = ncx.to_xml()
print("NCX navigation available")
else:
print("No NCX navigation found")
# Access Navigation Document specifically (EPUB 3 only)
nav = doc.nav
if nav:
nav_xml = nav.to_xml()
print("Navigation Document available")
else:
print("No Navigation Document found (likely EPUB 2)")
# Handle different EPUB versions
package = doc.package
if package.version.major >= 3:
# EPUB 3 - prefer Navigation Document, fallback to NCX
nav_doc = doc.nav or doc.ncx
else:
# EPUB 2 - use NCX
nav_doc = doc.ncx
if nav_doc:
print("Table of contents found:", nav_doc.to_str()[:100])
Content Extraction
------------------
Accessing Document Content
~~~~~~~~~~~~~~~~~~~~~~~~~~
Extract content from specific documents within the EPUB:
.. code-block:: python
# First, find content IDs from the manifest
manifest = doc.package.manifest
content_items = {
item_id: item for item_id, item in manifest.items.items()
if item['media-type'] == 'application/xhtml+xml'
}
# Access content by ID
for content_id in content_items:
try:
content = doc.get_content(content_id)
# Process content as needed
print(f"Content ID {content_id}: {len(content)} characters")
except Exception as e:
print(f"Could not access content {content_id}: {e}")
File Information
----------------
Detailed File Analysis
~~~~~~~~~~~~~~~~~~~~~~
Get comprehensive information about all files in the EPUB:
.. code-block:: python
files_info = doc.get_files_info()
for file_info in files_info:
print(f"Path: {file_info['path']}")
print(f"Size: {file_info['size']} bytes")
print(f"Compressed: {file_info['compressed_size']} bytes")
print(f"Modified: {file_info['modified']}")
print("---")
# Calculate total size
total_size = sum(f['size'] for f in files_info)
total_compressed = sum(f['compressed_size'] for f in files_info)
compression_ratio = (1 - total_compressed / total_size) * 100
print(f"Total size: {total_size} bytes")
print(f"Compressed size: {total_compressed} bytes")
print(f"Compression ratio: {compression_ratio:.1f}%")
Error Handling
--------------
Robust Error Handling
~~~~~~~~~~~~~~~~~~~~~~
epub-utils provides specific exception types for better error handling:
.. code-block:: python
from epub_utils import Document
from epub_utils.exceptions import ParseError
try:
doc = Document("potentially_corrupt.epub")
# Try to access metadata
title = doc.package.metadata.title
print(f"Successfully loaded: {title}")
except ParseError as e:
print(f"EPUB parsing error: {e}")
except FileNotFoundError:
print("EPUB file not found")
except Exception as e:
print(f"Unexpected error: {e}")
Graceful Degradation
~~~~~~~~~~~~~~~~~~~~
Handle missing or malformed metadata gracefully:
.. code-block:: python
def safe_get_metadata(doc, field_name, default="Unknown"):
"""Safely extract metadata field with fallback."""
try:
return getattr(doc.package.metadata, field_name, default)
except (AttributeError, ParseError):
return default
# Usage
title = safe_get_metadata(doc, 'title', 'Untitled')
author = safe_get_metadata(doc, 'creator', 'Unknown Author')
Next Steps
----------
- Explore the complete :doc:`api-reference` for detailed class documentation
- See more :doc:`examples` for advanced use cases
- Learn about :doc:`epub-standards` to understand the underlying specifications
- Check out the :doc:`cli-reference` for command-line equivalents
================================================
FILE: docs/changelog.rst
================================================
.. _changelog:
=========
Changelog
=========
.. _v_0_1_0a1:
0.1.0a1 (2025-06-14)
--------------------
* Added `toc` retrieval as dictionary (:issue:`4`)
* Added Comprehensive navigation reading support (`#38 <https://github.com/ernestofgonzalez/epub-utils/pull/38>`__, `#39 <https://github.com/ernestofgonzalez/epub-utils/pull/39>`__, `#42 <https://github.com/ernestofgonzalez/epub-utils/pull/42>`__)
* Added MacOS test runner (`#41 <https://github.com/ernestofgonzalez/epub-utils/pull/41>`__)
* Added support for Python 3.8 and Python 3.9 (`#40 <https://github.com/ernestofgonzalez/epub-utils/pull/40>`__)
.. _v_0_0_0a5:
0.0.0a5 (2025-06-01)
--------------------
* Added file retrieval by file path. (:issue:`22`)
* Added pretty printing to XML inspection (:issue:`23`)
.. _v_0_0_0a4:
0.0.0a4 (2025-05-26)
--------------------
* Added file inspection and ``files`` CLI command. (`#20 <https://github.com/ernestofgonzalez/epub-utils/pull/20>`__)
* Added content inspection and ``content`` CLI command (:issue:`5`)
* Added manifest parsing and ``manifest`` CLI command (`#13 <https://github.com/ernestofgonzalez/epub-utils/pull/13>`__)
* Added spine parsing and ``spine`` CLI command (`#9 <https://github.com/ernestofgonzalez/epub-utils/pull/9>`__)
* Added Key-value support for ``metadata`` CLI command
* Fixed table of contents parsing for OEBPS 1 (`#11 <https://github.com/ernestofgonzalez/epub-utils/pull/11>`__). Thanks, `Christian Klein <https://github.com/cklein>`__.
.. _v_0_0_0a3:
0.0.0a3 (2025-05-04)
--------------------
* Fixed `toc` command. (:issue:`1`)
.. _v_0_0_0a2:
0.0.0a2 (2025-05-03)
--------------------
* Added classifiers
.. _v_0_0_0a1:
0.0.0a1 (2025-05-03)
--------------------
* Initial relese to PyPI
================================================
FILE: docs/cli-reference.rst
================================================
CLI Reference
=============
This reference documents all available command-line options and commands for ``epub-utils``.
Synopsis
--------
.. code-block:: text
epub-utils [GLOBAL_OPTIONS] EPUB_FILE COMMAND [COMMAND_OPTIONS]
Global Options
--------------
``-h, --help``
Show help message and exit
``-v, --version``
Show program version and exit
``-pp, --pretty-print``
Pretty-print XML output with proper indentation (applies to xml and raw formats only)
Commands
--------
All commands operate on an EPUB file and support the ``--format`` and ``--pretty-print`` options unless otherwise noted.
container
~~~~~~~~~
Display the container.xml file contents.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE container [--format FORMAT] [--pretty-print]
**Description**:
The container command shows the contents of META-INF/container.xml, which defines the
location of the main package file within the EPUB.
**Supported formats**: ``xml`` (default), ``raw``
**Examples**:
.. code-block:: bash
# Show container with syntax highlighting
epub-utils book.epub container
# Show raw container XML
epub-utils book.epub container --format raw
# Show container with pretty formatting
epub-utils book.epub container --pretty-print
# Combine both options
epub-utils book.epub container --format raw --pretty-print
epub-utils book.epub container --format raw
**Sample output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
package
~~~~~~~
Display the main package (OPF) file contents.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE package [--format FORMAT] [--pretty-print]
**Description**:
The package command shows the complete OPF (Open Packaging Format) file, which contains
metadata, manifest, and spine information.
**Supported formats**: ``xml`` (default), ``raw``
**Examples**:
.. code-block:: bash
# Show package with syntax highlighting
epub-utils book.epub package
# Show raw package XML for processing
epub-utils book.epub package --format raw | xmllint --format -
# Show package with pretty formatting
epub-utils book.epub package --pretty-print
toc
~~~
Display the table of contents file.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE toc [--format FORMAT] [--pretty-print] [--ncx | --nav]
**Description**:
Shows the table of contents, which can be either an NCX file (EPUB 2.x) or a
Navigation Document (EPUB 3.x). By default, automatically detects and uses the
appropriate format for the EPUB version.
**Options**:
``--ncx``
Force retrieval of NCX file (EPUB 2 navigation control file). For EPUB 2,
this is the same as the default behavior. For EPUB 3, this specifically
accesses the NCX file if present for backward compatibility.
``--nav``
Force retrieval of Navigation Document (EPUB 3 navigation file). Only works
with EPUB 3 documents that have a Navigation Document.
**Note**: The ``--ncx`` and ``--nav`` flags are mutually exclusive.
**Supported formats**: ``xml`` (default), ``raw``
**Examples**:
.. code-block:: bash
# Show TOC with highlighting (auto-detect format)
epub-utils book.epub toc
# Extract navigation structure
epub-utils book.epub toc --format raw
# Show TOC with pretty formatting
epub-utils book.epub toc --pretty-print
# Force NCX format (EPUB 2 style)
epub-utils book.epub toc --ncx
# Force Navigation Document (EPUB 3 style)
epub-utils book.epub toc --nav
metadata
~~~~~~~~
Display metadata information from the package file.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE metadata [--format FORMAT] [--pretty-print]
**Description**:
Extracts and displays Dublin Core and EPUB-specific metadata from the package file.
**Supported formats**: ``xml`` (default), ``raw``, ``kv``
**Examples**:
.. code-block:: bash
# Show formatted metadata
epub-utils book.epub metadata
# Get key-value pairs for scripting
epub-utils book.epub metadata --format kv
# Raw metadata XML
epub-utils book.epub metadata --format raw
# Show metadata with pretty formatting
epub-utils book.epub metadata --pretty-print
**Key-value output format**:
.. code-block:: text
title: The Great Gatsby
creator: F. Scott Fitzgerald
language: en
identifier: urn:uuid:12345678-1234-1234-1234-123456789abc
publisher: Scribner
date: 2021-01-01
subject: Fiction, Classic Literature
manifest
~~~~~~~~
Display the manifest section from the package file.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE manifest [--format FORMAT] [--pretty-print]
**Description**:
Shows the manifest, which lists all files included in the EPUB package with their
IDs, file paths, and media types.
**Supported formats**: ``xml`` (default), ``raw``
**Examples**:
.. code-block:: bash
# Show manifest with highlighting
epub-utils book.epub manifest
# Find all CSS files
epub-utils book.epub manifest --format raw | grep 'media-type="text/css"'
# Show manifest with pretty formatting
epub-utils book.epub manifest --pretty-print
epub-utils book.epub manifest --format raw | grep 'media-type="text/css"'
# Count content files
epub-utils book.epub manifest --format raw | grep -c 'application/xhtml+xml'
spine
~~~~~
Display the spine section from the package file.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE spine [--format FORMAT] [--pretty-print]
**Description**:
Shows the spine, which defines the default reading order of the book's content.
**Supported formats**: ``xml`` (default), ``raw``
**Examples**:
.. code-block:: bash
# Show spine with highlighting
epub-utils book.epub spine
# Extract reading order
epub-utils book.epub spine --format raw
# Show spine with pretty formatting
epub-utils book.epub spine --pretty-print
content
~~~~~~~
Display the content of a document by its manifest item ID.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE content ITEM_ID [--format FORMAT] [--pretty-print]
**Description**:
Extracts and displays the content of a specific document within the EPUB, identified
by its manifest item ID.
**Supported formats**: ``xml`` (default), ``raw``, ``plain``
**Arguments**:
- ``ITEM_ID``: The ID of the item as defined in the manifest
**Examples**:
.. code-block:: bash
# Show content with syntax highlighting
epub-utils book.epub content chapter1
# Get raw HTML/XHTML
epub-utils book.epub content intro --format raw
# Extract plain text (no HTML tags)
epub-utils book.epub content chapter2 --format plain
# Show content with pretty formatting
epub-utils book.epub content chapter1 --pretty-print
**Finding item IDs**:
.. code-block:: bash
# First check the manifest for available IDs
epub-utils book.epub manifest | grep 'id='
# Then extract specific content
epub-utils book.epub content found_id --format plain
files
~~~~~
List all files in the EPUB archive with metadata, or display content of a specific file.
**Syntax**:
.. code-block:: bash
epub-utils EPUB_FILE files [FILE_PATH] [--format FORMAT] [--pretty-print]
**Description**:
When used without a file path, provides detailed information about all files contained
within the EPUB archive, including sizes, compression ratios, and modification dates.
When used with a file path, displays the content of the specified file within the EPUB archive.
**Supported formats**:
- For file listing: ``table`` (default), ``raw``
- For file content: ``raw``, ``xml`` (default), ``plain``, ``kv``
**Arguments**:
- ``FILE_PATH`` (optional): Path to a specific file within the EPUB archive
**Supported formats**: ``table`` (default), ``raw``
**Examples**:
.. code-block:: bash
# List all files in table format (default)
epub-utils book.epub files
# Get simple file list
epub-utils book.epub files --format raw
# Count total files
epub-utils book.epub files --format raw | wc -l
# Display content of a specific XHTML file
epub-utils book.epub files OEBPS/chapter1.xhtml
# Display XHTML file in different formats
epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
# Display non-XHTML files (CSS, etc.)
epub-utils book.epub files OEBPS/styles/main.css
**Key differences from content command**:
- ``files`` uses file paths within the EPUB archive
- ``content`` uses manifest item IDs
- ``files`` can access any file, including CSS, XML, and image files
- ``content`` only accesses files listed in the manifest
**Sample table output**:
.. code-block:: text
File Information for book.epub
┌────────────────────────────────────────┬──────────┬──────────────┬─────────────────────┐
│ Path │ Size │ Compressed │ Modified │
├────────────────────────────────────────┼──────────┼──────────────┼─────────────────────┤
│ META-INF/container.xml │ 230 B │ 140 B │ 2021-01-01 10:00:00│
│ OEBPS/content.opf │ 2.1 KB │ 856 B │ 2021-01-01 10:00:00│
│ OEBPS/Text/chapter01.xhtml │ 12.4 KB │ 3.2 KB │ 2021-01-01 10:00:00│
└────────────────────────────────────────┴──────────┴──────────────┴─────────────────────┘
Format Options
--------------
Most commands support the ``--format`` and ``--pretty-print`` options to control output formatting:
``xml`` (default for most commands)
Syntax-highlighted, formatted XML output
``raw``
Unformatted content exactly as stored in the EPUB
``kv`` (metadata command only)
Key-value pairs suitable for shell scripting
``plain`` (content command only)
Plain text with HTML tags stripped
``table`` (files command only)
Formatted table with aligned columns
Pretty Print Option
~~~~~~~~~~~~~~~~~~~
The ``--pretty-print`` (or ``-pp``) option formats XML output with proper indentation and structure:
.. code-block:: bash
# Default output (with syntax highlighting but compact)
epub-utils book.epub metadata
# Pretty-printed output (with proper indentation)
epub-utils book.epub metadata --pretty-print
# Combine with raw format for clean, formatted XML
epub-utils book.epub package --format raw --pretty-print
**Note**: The pretty-print option applies to both ``xml`` and ``raw`` formats, but has no effect on ``kv``, ``plain``, or ``table`` formats.
Exit Codes
----------
epub-utils uses standard exit codes:
- ``0``: Success
- ``1``: General error (file not found, invalid EPUB, etc.)
- ``2``: Command line usage error
Examples can check exit codes for error handling:
.. code-block:: bash
if epub-utils book.epub metadata >/dev/null 2>&1; then
echo "EPUB is valid"
else
echo "EPUB has issues"
fi
Environment Variables
---------------------
epub-utils respects these environment variables:
``NO_COLOR``
Disable color output when set to any value
``FORCE_COLOR``
Force color output even when not outputting to a terminal
**Examples**:
.. code-block:: bash
# Disable colors
NO_COLOR=1 epub-utils book.epub metadata
# Force colors in pipes
FORCE_COLOR=1 epub-utils book.epub metadata | less -R
Common Usage Patterns
---------------------
Validation Workflow
~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
#!/bin/zsh
# validate-epub.sh - Basic EPUB validation
epub_file="$1"
echo "Validating: $epub_file"
# Check container
if ! epub-utils "$epub_file" container >/dev/null 2>&1; then
echo "❌ Invalid container"
exit 1
fi
# Check package
if ! epub-utils "$epub_file" package >/dev/null 2>&1; then
echo "❌ Invalid package"
exit 1
fi
# Check required metadata
metadata=$(epub-utils "$epub_file" metadata --format kv 2>/dev/null)
if ! echo "$metadata" | grep -q "^title:"; then
echo "⚠️ Missing title"
fi
if ! echo "$metadata" | grep -q "^creator:"; then
echo "⚠️ Missing author"
fi
echo "✅ EPUB structure is valid"
Metadata Extraction
~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
#!/bin/zsh
# extract-metadata.sh - Extract metadata to CSV
echo "filename,title,author,language,publisher" > metadata.csv
for epub in *.epub; do
if [[ -f "$epub" ]]; then
metadata=$(epub-utils "$epub" metadata --format kv 2>/dev/null)
title=$(echo "$metadata" | grep "^title:" | cut -d' ' -f2- | tr ',' ';')
author=$(echo "$metadata" | grep "^creator:" | cut -d' ' -f2- | tr ',' ';')
language=$(echo "$metadata" | grep "^language:" | cut -d' ' -f2-)
publisher=$(echo "$metadata" | grep "^publisher:" | cut -d' ' -f2- | tr ',' ';')
echo "$epub,$title,$author,$language,$publisher" >> metadata.csv
fi
done
Content Analysis
~~~~~~~~~~~~~~~~
.. code-block:: bash
#!/bin/zsh
# analyze-content.sh - Analyze EPUB content structure
epub_file="$1"
echo "Content Analysis for: $epub_file"
echo "=================================="
# Get content files from manifest
content_ids=$(epub-utils "$epub_file" manifest --format raw | \
grep 'media-type="application/xhtml+xml"' | \
sed 's/.*id="\([^"]*\)".*/\1/')
total_words=0
for content_id in $content_ids; do
if word_count=$(epub-utils "$epub_file" content "$content_id" --format plain 2>/dev/null | wc -w); then
echo "Content ID '$content_id': $word_count words"
total_words=$((total_words + word_count))
fi
done
echo "=================================="
echo "Total words: $total_words"
Error Handling
--------------
Always handle errors when using epub-utils in scripts:
.. code-block:: bash
# Check if file exists first
if [[ ! -f "$epub_file" ]]; then
echo "Error: File '$epub_file' not found" >&2
exit 1
fi
# Capture and handle command errors
if ! output=$(epub-utils "$epub_file" metadata --format kv 2>&1); then
echo "Error processing EPUB: $output" >&2
exit 1
fi
# Check for specific issues
if [[ -z "$output" ]]; then
echo "Warning: No metadata found" >&2
fi
Performance Tips
----------------
1. **Use raw format for large-scale processing** to avoid syntax highlighting overhead
2. **Pipe efficiently** to avoid unnecessary intermediate files
3. **Process files in parallel** when handling many EPUBs
4. **Cache results** when running the same command multiple times
.. code-block:: bash
# Efficient parallel processing
find . -name "*.epub" | xargs -n 1 -P 4 -I {} \
zsh -c 'echo "{}: $(epub-utils "{}" metadata --format kv | grep "^title:" | cut -d" " -f2-)"'
Troubleshooting
---------------
Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~~~
**"Invalid value for 'PATH': File does not exist"**
Check the file path and ensure the EPUB file exists.
**"ParseError: Unable to parse container.xml"**
The EPUB file may be corrupted. Verify it's a valid ZIP file.
**"Content with id 'X' not found"**
Check available content IDs using the manifest command first.
**No color output**
Ensure your terminal supports colors and check the ``NO_COLOR`` environment variable.
**Large file performance**
Use ``--format raw`` for better performance with large files.
================================================
FILE: docs/cli-tutorial.rst
================================================
Use as a command-line tool
==========================
This tutorial will guide you through using ``epub-utils`` from the command line. We'll cover all
available commands with practical examples and tips for everyday usage.
Getting Started
---------------
The basic syntax for epub-utils is:
.. code-block:: bash
epub-utils [OPTIONS] EPUB_FILE COMMAND [COMMAND_OPTIONS]
Let's start with a simple example:
.. code-block:: bash
# Display help
epub-utils --help
# Check version
epub-utils --version
Basic File Inspection
---------------------
Container Information
~~~~~~~~~~~~~~~~~~~~~
The container command shows the EPUB's container.xml file, which points to the main package file:
.. code-block:: bash
# Show container with syntax highlighting (default)
epub-utils book.epub container
# Show raw XML without highlighting
epub-utils book.epub container --format raw
# Show container with pretty formatting
epub-utils book.epub container --pretty-print
**Example output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?>
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
<rootfiles>
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
</rootfiles>
</container>
Package Information
~~~~~~~~~~~~~~~~~~~
The package command displays the main OPF (Open Packaging Format) file:
.. code-block:: bash
# Show package file with highlighting
epub-utils book.epub package
# Show raw package content
epub-utils book.epub package --format raw
# Show package with pretty formatting
epub-utils book.epub package --pretty-print
This reveals the complete EPUB structure including metadata, manifest, and spine.
Working with Metadata
----------------------
Extracting Metadata
~~~~~~~~~~~~~~~~~~~~
The metadata command is perfect for getting book information:
.. code-block:: bash
# Pretty-printed metadata with highlighting
epub-utils book.epub metadata
# Key-value format for scripting
epub-utils book.epub metadata --format kv
# Metadata with pretty formatting
epub-utils book.epub metadata --pretty-print
**Example key-value output**:
.. code-block:: text
title: The Great Gatsby
creator: F. Scott Fitzgerald
language: en
identifier: urn:uuid:12345678-1234-1234-1234-123456789abc
publisher: Scribner
date: 2021-01-01
subject: Fiction, Classic Literature
Scripting with Metadata
~~~~~~~~~~~~~~~~~~~~~~~~
The key-value format is perfect for shell scripting:
.. code-block:: bash
# Extract just the title
epub-utils book.epub metadata --format kv | grep "^title:" | cut -d' ' -f2-
# Get author name
author=$(epub-utils book.epub metadata --format kv | grep "^creator:" | cut -d' ' -f2-)
echo "Author: $author"
# Batch process multiple files
for epub in *.epub; do
title=$(epub-utils "$epub" metadata --format kv | grep "^title:" | cut -d' ' -f2-)
echo "$epub: $title"
done
Understanding EPUB Structure
-----------------------------
Table of Contents
~~~~~~~~~~~~~~~~~
View the navigation structure of your EPUB:
.. code-block:: bash
# Show table of contents with highlighting (auto-detect format)
epub-utils book.epub toc
# Raw TOC for processing
epub-utils book.epub toc --format raw
# TOC with pretty formatting
epub-utils book.epub toc --pretty-print
**EPUB Version-Specific Access**:
For precise control over which navigation format to access:
.. code-block:: bash
# Force NCX format (EPUB 2 navigation control file)
epub-utils book.epub toc --ncx
# Force Navigation Document (EPUB 3 navigation file)
epub-utils book.epub toc --nav
**Use Cases**:
- Use ``--ncx`` when you specifically need the EPUB 2 style navigation or want to access backward-compatible NCX in EPUB 3
- Use ``--nav`` when you specifically need the EPUB 3 Navigation Document features
- Use the default (no flags) for general TOC access that works with any EPUB version
Manifest Inspection
~~~~~~~~~~~~~~~~~~~
The manifest lists all files contained in the EPUB:
.. code-block:: bash
# View manifest with syntax highlighting
epub-utils book.epub manifest
# Raw manifest content
epub-utils book.epub manifest --format raw
# Manifest with pretty formatting
epub-utils book.epub manifest --pretty-print
**What you'll see**: Each item in the manifest includes:
- ``id``: Unique identifier for the item
- ``href``: File path within the EPUB
- ``media-type``: MIME type of the file
Spine Information
~~~~~~~~~~~~~~~~~
The spine defines the reading order of the book:
.. code-block:: bash
# View spine with highlighting
epub-utils book.epub spine
# Raw spine for processing
epub-utils book.epub spine --format raw
Content Extraction
------------------
Viewing Document Content
~~~~~~~~~~~~~~~~~~~~~~~~
Extract content from specific documents using their manifest ID:
.. code-block:: bash
# Show content with syntax highlighting
epub-utils book.epub content chapter1
# Raw HTML/XHTML content
epub-utils book.epub content chapter1 --format raw
# Plain text (HTML tags stripped)
epub-utils book.epub content chapter1 --format plain
**Finding Content IDs**: Use the manifest command to see available content IDs:
.. code-block:: bash
# First, check the manifest for available IDs
epub-utils book.epub manifest
# Then extract specific content
epub-utils book.epub content intro --format plain
File Listing and Content Access
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Get detailed information about all files in the EPUB, or access specific file content:
.. code-block:: bash
# Formatted table of files
epub-utils book.epub files
# Raw file list
epub-utils book.epub files --format raw
# Display content of a specific file by path
epub-utils book.epub files OEBPS/chapter1.xhtml
# Access different file types
epub-utils book.epub files META-INF/container.xml
epub-utils book.epub files OEBPS/styles/main.css
epub-utils book.epub files OEBPS/images/cover.jpg
# Different output formats for XHTML content
epub-utils book.epub files OEBPS/chapter1.xhtml --format raw
epub-utils book.epub files OEBPS/chapter1.xhtml --format xml --pretty-print
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
**Key advantages of the files command**:
- Access any file in the EPUB archive by its path
- No need to know manifest item IDs
- Works with all file types (XHTML, CSS, XML, images, etc.)
- Complements the ``content`` command which uses manifest IDs
Content Analysis
~~~~~~~~~~~~~~~~
Analyze EPUB content structure:
.. code-block:: bash
#!/bin/bash
# analyze-content.sh - Analyze EPUB content structure
epub_file="$1"
echo "=== Content Analysis for $epub_file ==="
# Get all content files from manifest
epub-utils "$epub_file" manifest --format raw | \
grep 'media-type="application/xhtml+xml"' | \
sed 's/.*id="\([^"]*\)".*/\1/' | \
while read -r content_id; do
echo "--- Content ID: $content_id ---"
word_count=$(epub-utils "$epub_file" content "$content_id" --format plain | wc -w)
echo "Word count: $word_count"
echo ""
done
Output Format Options
---------------------
epub-utils supports multiple output formats for different use cases:
XML Format (Default)
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
epub-utils book.epub metadata
# Produces syntax-highlighted, formatted XML
Raw Format
~~~~~~~~~~
.. code-block:: bash
epub-utils book.epub metadata --format raw
# Produces unformatted XML, perfect for piping to other tools
Key-Value Format
~~~~~~~~~~~~~~~~
.. code-block:: bash
epub-utils book.epub metadata --format kv
# Produces key: value pairs, ideal for scripting
Plain Text Format
~~~~~~~~~~~~~~~~~
.. code-block:: bash
epub-utils book.epub content chapter1 --format plain
# Strips HTML tags, produces readable text
Pretty-Print Option
~~~~~~~~~~~~~~~~~~~
Use the ``--pretty-print`` (or ``-pp``) option to format XML output with proper indentation:
.. code-block:: bash
# Default output (compact XML)
epub-utils book.epub metadata --format raw
# Pretty-formatted output (with indentation)
epub-utils book.epub metadata --format raw --pretty-print
# Works with syntax highlighting too
epub-utils book.epub package --pretty-print
Next Steps
----------
Now that you're familiar with the CLI basics, you might want to:
- Explore the :doc:`api-tutorial` for programmatic access
- Check out more :doc:`examples` for real-world use cases
- Learn about :doc:`epub-standards` for deeper understanding
- Contribute to the project via :doc:`contributing`
================================================
FILE: docs/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = 'epub-utils'
copyright = '2025, Ernesto González'
author = 'Ernesto González'
release = '0.1.0a1'
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [
'sphinx.ext.autodoc',
'sphinx.ext.autosummary',
'sphinx.ext.napoleon',
'sphinx.ext.viewcode',
'sphinx_copybutton',
'sphinx_issues',
]
templates_path = ['_templates']
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Napoleon settings -------------------------------------------------------
napoleon_google_docstring = True
napoleon_numpy_docstring = True
napoleon_include_init_with_doc = False
napoleon_include_private_with_doc = False
# -- Autodoc settings --------------------------------------------------------
autodoc_member_order = 'bysource'
autodoc_default_flags = ['members']
autosummary_generate = True
# -- Intersphinx mapping -----------------------------------------------------
intersphinx_mapping = {
'python': ('https://docs.python.org/3', None),
'lxml': ('https://lxml.de/', None),
}
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = 'furo'
html_static_path = ['_static']
# Add source link in footer
html_show_sourcelink = True
html_copy_source = True
html_show_sphinx = True
# -- Linking Github issues --------------------------------------------------
# https://github.com/sloria/sphinx-issues
issues_github_path = 'ernestofgonzalez/epub-utils'
================================================
FILE: docs/contributing.rst
================================================
============
Contributing
============
We welcome contributions to ``epub-utils``! This guide will help you get started with contributing to the project.
Getting Started
===============
Setting Up Development Environment
----------------------------------
1. **Fork the Repository**
Fork the ``epub-utils`` repository on GitHub to your own account.
2. **Clone Your Fork**
.. code-block:: bash
git clone https://github.com/yourusername/epub-utils.git
cd epub-utils
3. **Set Up Development Environment**
.. code-block:: bash
# Create virtual environment
python -m venv dev-env
source dev-env/bin/activate # On Windows: dev-env\Scripts\activate
# Install in development mode
pip install -e ".[dev]"
# Or install dependencies manually
pip install -e .
pip install pytest black flake8 mypy sphinx
Project Structure
-----------------
.. code-block:: text
epub-utils/
├── src/
│ └── epub_utils/
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ ├── document.py # Main Document class
│ ├── extractors.py # Content extraction logic
│ └── formatters.py # Output formatting
├── tests/
│ ├── __init__.py
│ ├── test_document.py
│ ├── test_cli.py
│ └── fixtures/ # Test EPUB files
├── docs/
│ ├── conf.py
│ ├── index.rst
│ └── ... # Documentation files
├── pyproject.toml
├── README.md
└── CHANGELOG.md
Development Workflow
====================
Branch Strategy
---------------
- ``main`` branch: Stable, release-ready code
- ``develop`` branch: Integration branch for features
- Feature branches: ``feature/your-feature-name``
- Bug fix branches: ``fix/issue-description``
Making Changes
--------------
1. **Create a Feature Branch**
.. code-block:: bash
git checkout -b feature/your-feature-name
2. **Make Your Changes**
Follow the coding standards outlined below.
3. **Write Tests**
All new features should include comprehensive tests.
4. **Run Tests Locally**
.. code-block:: bash
# Run all tests
pytest
# Run with coverage
pytest --cov=epub_utils
# Run specific test file
pytest tests/test_document.py
5. **Check Code Quality**
.. code-block:: bash
# Format code
black src/ tests/
# Check linting
flake8 src/ tests/
# Type checking
mypy src/
6. **Update Documentation**
If your changes affect the API or add new features, update the documentation.
7. **Commit Your Changes**
.. code-block:: bash
git add .
git commit -m "Add: Brief description of your changes"
8. **Push and Create Pull Request**
.. code-block:: bash
git push origin feature/your-feature-name
Then create a pull request on GitHub.
Coding Standards
================
Python Style Guide
------------------
We follow PEP 8 with some modifications:
- **Line length**: 88 characters (Black's default)
- **String quotes**: Use double quotes for strings
- **Import sorting**: Use isort or similar tool
- **Docstrings**: Use Google-style docstrings
Code Formatting
---------------
We use **Black** for code formatting:
.. code-block:: bash
# Format all Python files
black src/ tests/
# Check formatting without making changes
black --check src/ tests/
Example of properly formatted code:
.. code-block:: python
def extract_metadata(epub_path: str, format_type: str = "dict") -> dict:
"""Extract metadata from an EPUB file.
Args:
epub_path: Path to the EPUB file.
format_type: Output format ('dict', 'xml', 'json').
Returns:
Dictionary containing extracted metadata.
Raises:
FileNotFoundError: If the EPUB file doesn't exist.
ValueError: If format_type is not supported.
"""
if not os.path.exists(epub_path):
raise FileNotFoundError(f"EPUB file not found: {epub_path}")
if format_type not in ["dict", "xml", "json"]:
raise ValueError(f"Unsupported format: {format_type}")
# Implementation here...
return {}
Linting
-------
We use **ruff** for linting:
.. code-block:: bash
# Check for linting errors
make lint
Type Hints
----------
Use type hints for all function signatures:
.. code-block:: python
from typing import List, Dict, Optional, Union
from pathlib import Path
def process_files(
file_paths: List[Union[str, Path]],
output_format: str = "table"
) -> Optional[Dict[str, any]]:
"""Process multiple EPUB files."""
pass
Documentation Standards
=======================
Docstring Format
----------------
Use Google-style docstrings:
.. code-block:: python
def complex_function(param1: str, param2: int, param3: bool = False) -> dict:
"""Brief description of the function.
Longer description if needed. Explain the purpose, behavior,
and any important details about the function.
Args:
param1: Description of the first parameter.
param2: Description of the second parameter.
param3: Description of optional parameter. Defaults to False.
Returns:
Description of return value and its structure.
Raises:
ValueError: When param2 is negative.
FileNotFoundError: When the specified file doesn't exist.
Example:
Basic usage example:
>>> result = complex_function("test", 42)
>>> print(result["status"])
"success"
"""
pass
API Documentation
-----------------
When adding new classes or functions to the public API:
1. **Add to __init__.py** exports if appropriate
2. **Update API reference** documentation
3. **Include usage examples** in docstrings
4. **Add to tutorials** if it's a major feature
RST Documentation
-----------------
When writing RST documentation:
.. code-block:: rst
Section Title
=============
Subsection
----------
Code examples:
.. code-block:: python
# Python code here
import epub_utils
Shell commands:
.. code-block:: bash
epub-utils info book.epub
Testing Guidelines
==================
Test Structure
--------------
- **Unit tests**: Test individual functions and methods
- **Integration tests**: Test component interactions
- **End-to-end tests**: Test complete workflows
- **Performance tests**: Test with large files (optional)
Writing Tests
-------------
Use pytest for all tests:
.. code-block:: python
import pytest
from epub_utils import Document
from pathlib import Path
def test_document_with_invalid_file():
"""Test error handling with invalid file."""
with pytest.raises(FileNotFoundError):
Document("nonexistent.epub")
@pytest.mark.parametrize("format_type", ["dict", "xml", "json"])
def test_metadata_formats(doc_path, format_type):
"""Test different metadata formats."""
doc = Document(str(doc_path)
metadata = doc.get_metadata(format_type=format_type)
assert metadata is not None
Test Fixtures
-------------
Create test EPUB files in ``tests/fixtures/``:
.. code-block:: python
# tests/conftest.py
import pytest
from pathlib import Path
@pytest.fixture
def sample_epub():
"""Provide path to sample EPUB for testing."""
return Path(__file__).parent / "fixtures" / "sample.epub"
@pytest.fixture
def invalid_epub():
"""Provide path to invalid EPUB for error testing."""
return Path(__file__).parent / "fixtures" / "invalid.epub"
Running Tests
-------------
.. code-block:: bash
# Run all tests
make test
# Run specific test file
pytest tests/test_document.py
Types of Contributions
======================
Bug Reports
-----------
When reporting bugs:
1. Check existing issues first
2. Use the issue template if available
3. Provide minimal reproduction case
4. Include system information
.. code-block:: text
**Bug Description**
Clear description of the bug.
**Steps to Reproduce**
1. Step one
2. Step two
3. Step three
**Expected Behavior**
What should happen.
**Actual Behavior**
What actually happens.
**Environment**
- epub-utils version:
- Python version:
- Operating system:
**Sample File**
Attach or link to EPUB file if relevant.
Feature Requests
----------------
For new features:
1. Describe the use case clearly
2. Explain why it's valuable to users
3. Suggest implementation approach if you have ideas
4. Consider backward compatibility
Documentation Improvements
--------------------------
Documentation contributions are highly valued:
- Fix typos and grammar errors
- Improve clarity of explanations
- Add more examples to existing docs
- Create new tutorials for common use cases
- Update outdated information
Code Contributions
------------------
Areas where contributions are welcome:
1. Performance improvements
2. New output formats
3. Additional EPUB validation
4. Better error handling
5. CLI usability enhancements
6. Support for EPUB 3 features
Release Process
===============
Versioning
----------
We follow `Semantic Versioning <https://semver.org/>`_:
- MAJOR: Incompatible API changes
- MINOR: New functionality (backward compatible)
- PATCH: Bug fixes (backward compatible)
Version format: ``MAJOR.MINOR.PATCH`` (e.g., ``1.2.3``)
Development versions may include additional identifiers:
- ``1.2.3-dev`` (development)
- ``1.2.3rc1`` (release candidate)
================================================
FILE: docs/epub-standards.rst
================================================
==============
EPUB Standards
==============
Understanding EPUB Specifications
=================================
EPUB (Electronic Publication) is an open standard for digital books and publications.
This guide covers the EPUB specifications and how epub-utils ensures compliance.
EPUB 3.3 Specification
======================
Current Standard
----------------
EPUB 3.3 is the current specification, published by the W3C. It defines:
- **Package Document**: Contains metadata, manifest, and spine
- **Container Format**: ZIP-based archive structure
- **Content Documents**: XHTML5, SVG, and other media types
- **Navigation Document**: Replaces NCX for table of contents
Key Components
--------------
Container Structure
~~~~~~~~~~~~~~~~~~~
.. code-block:: text
book.epub
├── META-INF/
│ ├── container.xml # Points to package document
│ └── signatures.xml # Digital signatures (optional)
├── OEBPS/ # Content folder (common name)
│ ├── package.opf # Package document
│ ├── nav.xhtml # Navigation document
│ ├── content/ # Text content
│ ├── images/ # Images
│ ├── styles/ # CSS files
│ └── fonts/ # Font files (optional)
└── mimetype # Must be first file, uncompressed
Package Document (OPF)
~~~~~~~~~~~~~~~~~~~~~~
The package document defines three main sections:
**Metadata Section**:
.. code-block:: xml
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:title>Book Title</dc:title>
<dc:creator>Author Name</dc:creator>
<dc:identifier id="bookid">urn:uuid:12345</dc:identifier>
<dc:language>en</dc:language>
<meta property="dcterms:modified">2024-01-01T00:00:00Z</meta>
</metadata>
**Manifest Section**:
.. code-block:: xml
<manifest>
<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml"
properties="nav"/>
<item id="chapter1" href="content/chapter1.xhtml"
media-type="application/xhtml+xml"/>
<item id="cover-image" href="images/cover.jpg"
media-type="image/jpeg" properties="cover-image"/>
</manifest>
**Spine Section**:
.. code-block:: xml
<spine>
<itemref idref="chapter1"/>
<itemref idref="chapter2"/>
</spine>
Navigation Document
~~~~~~~~~~~~~~~~~~~
EPUB 3 uses XHTML navigation documents instead of NCX:
.. code-block:: html
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:epub="http://www.idpf.org/2007/ops">
<head>
<title>Navigation</title>
</head>
<body>
<nav epub:type="toc">
<h1>Table of Contents</h1>
<ol>
<li><a href="content/chapter1.xhtml">Chapter 1</a></li>
<li><a href="content/chapter2.xhtml">Chapter 2</a></li>
</ol>
</nav>
</body>
</html>
EPUB Compliance with epub-utils
===============================
Validation Capabilities
-----------------------
epub-utils helps ensure EPUB compliance by:
1. **Structure Validation**: Checks container format
2. **Metadata Validation**: Verifies required elements
3. **Manifest Validation**: Ensures all files are declared
4. **Spine Validation**: Checks reading order
5. **Content Validation**: Basic XHTML structure checks
Checking Compliance
-------------------
Use epub-utils to validate EPUB structure:
.. code-block:: bash
# Check basic structure
epub-utils info book.epub
# Detailed manifest information
epub-utils manifest book.epub --format table
# Extract and examine package document
epub-utils extract book.epub --output-dir temp/
cat temp/OEBPS/package.opf
Python API for Validation
~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from epub_utils import Document
def validate_epub_structure(epub_path):
"""Validate basic EPUB structure."""
try:
doc = Document(epub_path)
# Check required components
checks = {
'has_container': hasattr(doc, 'container'),
'has_package': hasattr(doc, 'package'),
'has_metadata': len(doc.metadata) > 0,
'has_manifest': len(doc.manifest) > 0,
'has_spine': len(doc.spine) > 0,
}
# Check required metadata
required_metadata = ['title', 'language', 'identifier']
metadata_present = {}
for item in doc.metadata:
for req in required_metadata:
if req in item.get('name', '').lower():
metadata_present[req] = True
print("Structure Validation:")
for check, passed in checks.items():
status = "✓" if passed else "✗"
print(f" {status} {check}")
print("\nRequired Metadata:")
for req in required_metadata:
status = "✓" if metadata_present.get(req) else "✗"
print(f" {status} {req}")
return all(checks.values()) and len(metadata_present) >= 2
except Exception as e:
print(f"Validation failed: {e}")
return False
Common Compliance Issues
========================
Missing Required Elements
-------------------------
**Problem**: EPUB missing required metadata
.. code-block:: bash
# Check metadata completeness
epub-utils metadata book.epub --format table
**Solution**: Ensure these elements are present:
- ``dc:title``
- ``dc:language``
- ``dc:identifier`` (with unique ID)
- ``meta property="dcterms:modified"`` (EPUB 3)
Invalid File References
-----------------------
**Problem**: Manifest references files that don't exist
.. code-block:: python
def check_file_references(epub_path):
"""Check if all manifest files exist in the archive."""
doc = Document(epub_path)
missing_files = []
for item in doc.manifest:
file_path = item.get('href')
if file_path:
# Check if file exists in the EPUB
try:
# This would need zip file checking
pass
except:
missing_files.append(file_path)
if missing_files:
print("Missing files referenced in manifest:")
for file in missing_files:
print(f" - {file}")
Incorrect MIME Types
--------------------
**Problem**: Wrong media-type attributes in manifest
Common correct MIME types:
- XHTML: ``application/xhtml+xml``
- CSS: ``text/css``
- JPEG: ``image/jpeg``
- PNG: ``image/png``
- NCX: ``application/x-dtbncx+xml``
EPUB 2 vs EPUB 3 Differences
============================
Format Evolution
-----------------
+------------------+-------------------------+-------------------------+
| Feature | EPUB 2 | EPUB 3 |
+==================+=========================+=========================+
| Navigation | NCX file required | XHTML nav document |
+------------------+-------------------------+-------------------------+
| Content Types | XHTML 1.1, limited | XHTML5, SVG, MathML |
+------------------+-------------------------+-------------------------+
| Metadata | Dublin Core only | Enhanced metadata |
+------------------+-------------------------+-------------------------+
| Accessibility | Limited | Rich accessibility |
+------------------+-------------------------+-------------------------+
| Scripting | Not allowed | Limited JavaScript |
+------------------+-------------------------+-------------------------+
Migration Considerations
------------------------
When working with older EPUB 2 files:
.. code-block:: python
def detect_epub_version(epub_path):
"""Detect EPUB version from package document."""
doc = Document(epub_path)
# Check package document for version attribute
# This is a simplified example
for item in doc.manifest:
if 'nav' in item.get('properties', ''):
return "EPUB 3"
# Check for NCX file (EPUB 2 indicator)
for item in doc.manifest:
if item.get('media-type') == 'application/x-dtbncx+xml':
return "EPUB 2"
return "Unknown"
Best Practices for Compliance
=============================
Metadata Best Practices
-----------------------
1. **Always include required elements**:
.. code-block:: xml
<dc:title>Complete Book Title</dc:title>
<dc:creator>Author Full Name</dc:creator>
<dc:identifier id="bookid">urn:uuid:unique-identifier</dc:identifier>
<dc:language>en-US</dc:language>
2. **Use proper Dublin Core refinements**:
.. code-block:: xml
<dc:creator id="author">Jane Doe</dc:creator>
<meta refines="#author" property="role" scheme="marc:relators">aut</meta>
3. **Include modification date for EPUB 3**:
.. code-block:: xml
<meta property="dcterms:modified">2024-05-25T10:30:00Z</meta>
File Organization
-----------------
1. **Use consistent folder structure**
2. **Declare all files in manifest**
3. **Use proper MIME types**
4. **Include fallbacks for specialized content**
Content Guidelines
------------------
1. **Valid XHTML**: Ensure all content files are well-formed
2. **Proper encoding**: Use UTF-8 encoding
3. **Relative links**: Use relative paths for internal references
4. **Alt text**: Include alt attributes for images
Testing and Validation Tools
============================
External Validators
-------------------
- **EPUBCheck**: Official EPUB validator
- **Ace by DAISY**: Accessibility checker
- **pagina EPUB-Checker**: Online validator
Integration with epub-utils
---------------------------
.. code-block:: bash
# Basic structure check
epub-utils info book.epub
# Export for external validation
epub-utils extract book.epub --output-dir validation/
# Run EPUBCheck on extracted content
# Check specific components
epub-utils manifest book.epub --format xml > manifest.xml
epub-utils metadata book.epub --format xml > metadata.xml
Future Standards
================
EPUB 3.3 and Beyond
-------------------
Current developments in EPUB standards:
- **Enhanced accessibility features**
- **Better multimedia support**
- **Improved metadata vocabularies**
- **Web standards alignment**
Staying Current
---------------
- Monitor W3C EPUB Working Group
- Test with latest validators
- Follow accessibility guidelines (WCAG)
- Use semantic markup
Resources
=========
Official Specifications
-----------------------
- `EPUB 3.3 Specification <https://www.w3.org/TR/epub-33/>`_
- `EPUB Accessibility 1.1 <https://www.w3.org/TR/epub-a11y-11/>`_
- `EPUB Open Container Format 3.0.1 <https://www.w3.org/TR/epub-ocf-301/>`_
Validation Tools
----------------
- `EPUBCheck <https://github.com/w3c/epubcheck>`_
- `Ace Accessibility Checker <https://github.com/daisy/ace>`_
- `EPUB Validator <https://validator.idpf.org/>`_
Developer Resources
-------------------
- `EPUB 3 Best Practices <https://www.w3.org/TR/epub-bp/>`_
- `IDPF EPUB Resources <http://idpf.org/epub/31/spec/>`_
- `Accessibility Guidelines <https://www.w3.org/WAI/WCAG21/quickref/>`_
================================================
FILE: docs/examples.rst
================================================
Examples and Use Cases
======================
This page showcases real-world examples of using epub-utils for various tasks. Each example
includes both CLI and Python API approaches where applicable.
Digital Library Management
--------------------------
Cataloging Your EPUB Collection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: You have a large collection of EPUB files and want to create a comprehensive catalog.
**CLI Approach**:
.. code-block:: bash
#!/bin/bash
# catalog-epubs.sh - Create a catalog of all EPUB files
echo "Creating EPUB catalog..."
echo "File,Title,Author,Publisher,Language,Year,Files,Size" > epub_catalog.csv
find . -name "*.epub" -type f | while read -r epub; do
echo "Processing: $epub"
# Extract metadata using epub-utils
metadata=$(epub-utils "$epub" metadata --format kv 2>/dev/null)
if [ $? -eq 0 ]; then
title=$(echo "$metadata" | grep "^title:" | cut -d' ' -f2- | sed 's/,/;/g')
author=$(echo "$metadata" | grep "^creator:" | cut -d' ' -f2- | sed 's/,/;/g')
publisher=$(echo "$metadata" | grep "^publisher:" | cut -d' ' -f2- | sed 's/,/;/g')
language=$(echo "$metadata" | grep "^language:" | cut -d' ' -f2-)
year=$(echo "$metadata" | grep "^date:" | cut -d' ' -f2- | cut -d'-' -f1)
# Count files and get size
file_count=$(epub-utils "$epub" files --format raw 2>/dev/null | wc -l)
size=$(stat -f%z "$epub" 2>/dev/null || stat -c%s "$epub" 2>/dev/null)
echo "$epub,$title,$author,$publisher,$language,$year,$file_count,$size" >> epub_catalog.csv
else
echo "$epub,ERROR,ERROR,ERROR,ERROR,ERROR,ERROR,ERROR" >> epub_catalog.csv
fi
done
echo "Catalog complete! See epub_catalog.csv"
**Python Approach**:
.. code-block:: python
import csv
import os
from pathlib import Path
from epub_utils import Document
def create_epub_catalog(directory, output_file="epub_catalog.csv"):
"""Create a comprehensive catalog of EPUB files."""
fieldnames = [
'filepath', 'filename', 'title', 'author', 'publisher',
'language', 'year', 'isbn', 'file_count', 'size_bytes', 'size_mb'
]
epub_files = list(Path(directory).rglob("*.epub"))
print(f"Found {len(epub_files)} EPUB files")
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i, epub_path in enumerate(epub_files, 1):
print(f"Processing {i}/{len(epub_files)}: {epub_path.name}")
try:
doc = Document(str(epub_path))
metadata = doc.package.metadata
# Extract date year
date_str = getattr(metadata, 'date', '')
year = date_str.split('-')[0] if date_str else ''
# Get file size
size_bytes = epub_path.stat().st_size
size_mb = round(size_bytes / (1024 * 1024), 2)
row = {
'filepath': str(epub_path),
'filename': epub_path.name,
'title': getattr(metadata, 'title', ''),
'author': getattr(metadata, 'creator', ''),
'publisher': getattr(metadata, 'publisher', ''),
'language': getattr(metadata, 'language', ''),
'year': year,
'isbn': getattr(metadata, 'identifier', ''),
'file_count': len(doc.get_files_info()),
'size_bytes': size_bytes,
'size_mb': size_mb
}
writer.writerow(row)
except Exception as e:
print(f" Error: {e}")
# Write error row
writer.writerow({
'filepath': str(epub_path),
'filename': epub_path.name,
'title': f'ERROR: {str(e)}',
'author': '',
'publisher': '',
'language': '',
'year': '',
'isbn': '',
'file_count': 0,
'size_bytes': epub_path.stat().st_size,
'size_mb': 0
})
# Usage
create_epub_catalog("/path/to/your/epub/collection")
Quality Assurance and Validation
---------------------------------
EPUB Health Check
~~~~~~~~~~~~~~~~~
**Scenario**: Validate EPUB files and identify potential issues.
.. code-block:: python
from epub_utils import Document, ParseError
import zipfile
from pathlib import Path
class EPUBHealthChecker:
def __init__(self):
self.issues = []
def check_epub(self, epub_path):
"""Comprehensive EPUB health check."""
self.issues = []
epub_path = Path(epub_path)
print(f"Checking EPUB: {epub_path.name}")
# Basic file checks
if not epub_path.exists():
self.issues.append("File does not exist")
return self.get_report()
if epub_path.stat().st_size == 0:
self.issues.append("File is empty")
return self.get_report()
# ZIP integrity check
try:
with zipfile.ZipFile(epub_path, 'r') as zf:
corrupt_files = zf.testzip()
if corrupt_files:
self.issues.append(f"Corrupt ZIP file: {corrupt_files}")
except zipfile.BadZipFile:
self.issues.append("Invalid ZIP file")
return self.get_report()
# EPUB structure checks
try:
doc = Document(str(epub_path))
self._check_container(doc)
self._check_package(doc)
self._check_metadata(doc)
self._check_manifest(doc)
self._check_files(doc)
except ParseError as e:
self.issues.append(f"Parse error: {e}")
except Exception as e:
self.issues.append(f"Unexpected error: {e}")
return self.get_report()
def _check_container(self, doc):
"""Check container structure."""
try:
container = doc.container
if not container.rootfile_path:
self.issues.append("No rootfile specified in container")
except Exception as e:
self.issues.append(f"Container error: {e}")
def _check_package(self, doc):
"""Check package/OPF file."""
try:
package = doc.package
if not hasattr(package, 'metadata'):
self.issues.append("Package missing metadata")
if not hasattr(package, 'manifest'):
self.issues.append("Package missing manifest")
if not hasattr(package, 'spine'):
self.issues.append("Package missing spine")
except Exception as e:
self.issues.append(f"Package error: {e}")
def _check_metadata(self, doc):
"""Check metadata quality."""
try:
metadata = doc.package.metadata
# Check required fields
if not getattr(metadata, 'title', '').strip():
self.issues.append("Missing or empty title")
if not getattr(metadata, 'language', '').strip():
self.issues.append("Missing or empty language")
if not getattr(metadata, 'identifier', '').strip():
self.issues.append("Missing or empty identifier")
except Exception as e:
self.issues.append(f"Metadata error: {e}")
def _check_manifest(self, doc):
"""Check manifest integrity."""
try:
manifest = doc.package.manifest
if not manifest.items:
self.issues.append("Empty manifest")
# Check for common content types
has_html = any(
item.get('media-type') == 'application/xhtml+xml'
for item in manifest.items.values()
)
if not has_html:
self.issues.append("No XHTML content files found")
except Exception as e:
self.issues.append(f"Manifest error: {e}")
def _check_files(self, doc):
"""Check file structure."""
try:
files_info = doc.get_files_info()
if len(files_info) < 3: # At least container, package, and one content file
self.issues.append("Very few files in EPUB (possibly incomplete)")
# Check for suspiciously large files
for file_info in files_info:
if file_info['size'] > 10 * 1024 * 1024: # 10MB
self.issues.append(f"Large file found: {file_info['path']} ({file_info['size']} bytes)")
except Exception as e:
self.issues.append(f"File check error: {e}")
def get_report(self):
"""Generate health check report."""
if not self.issues:
return {"status": "healthy", "issues": []}
else:
return {"status": "issues_found", "issues": self.issues}
# Usage
checker = EPUBHealthChecker()
report = checker.check_epub("book.epub")
if report["status"] == "healthy":
print("✓ EPUB is healthy!")
else:
print("⚠ Issues found:")
for issue in report["issues"]:
print(f" - {issue}")
Metadata Management
-------------------
Standardizing Metadata
~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Clean and standardize metadata across your EPUB collection.
.. code-block:: python
import re
from epub_utils import Document
class MetadataStandardizer:
def __init__(self):
self.language_codes = {
'english': 'en',
'spanish': 'es',
'french': 'fr',
'german': 'de',
'italian': 'it'
# Add more as needed
}
def analyze_metadata(self, epub_path):
"""Analyze and suggest metadata improvements."""
doc = Document(epub_path)
metadata = doc.package.metadata
suggestions = []
# Check title
title = getattr(metadata, 'title', '')
if not title:
suggestions.append("Missing title")
elif len(title) > 200:
suggestions.append("Title is very long (>200 chars)")
elif title.isupper():
suggestions.append("Title is all uppercase - consider title case")
# Check author
creator = getattr(metadata, 'creator', '')
if not creator:
suggestions.append("Missing author/creator")
elif ',' not in creator and len(creator.split()) > 2:
suggestions.append("Author name might need reformatting (Last, First)")
# Check language
language = getattr(metadata, 'language', '')
if not language:
suggestions.append("Missing language code")
elif len(language) > 3:
# Might be full language name instead of code
lang_lower = language.lower()
if lang_lower in self.language_codes:
suggestions.append(f"Use language code '{self.language_codes[lang_lower]}' instead of '{language}'")
# Check identifier
identifier = getattr(metadata, 'identifier', '')
if not identifier:
suggestions.append("Missing identifier")
elif not self._is_valid_identifier(identifier):
suggestions.append("Identifier format might be invalid")
# Check date format
date = getattr(metadata, 'date', '')
if date and not re.match(r'\d{4}(-\d{2}-\d{2})?', date):
suggestions.append("Date should be in YYYY or YYYY-MM-DD format")
return {
'file': epub_path,
'current_metadata': {
'title': title,
'creator': creator,
'language': language,
'identifier': identifier,
'date': date
},
'suggestions': suggestions
}
def _is_valid_identifier(self, identifier):
"""Check if identifier looks valid."""
# Check for ISBN, DOI, UUID patterns
patterns = [
r'urn:isbn:\d{10,13}', # ISBN URN
r'isbn:\d{10,13}', # Simple ISBN
r'urn:uuid:[a-f0-9-]{36}', # UUID URN
r'doi:10\.\d+/.+', # DOI
r'urn:doi:10\.\d+/.+' # DOI URN
]
return any(re.match(pattern, identifier, re.I) for pattern in patterns)
# Usage
standardizer = MetadataStandardizer()
analysis = standardizer.analyze_metadata("book.epub")
print(f"Analyzing: {analysis['file']}")
if analysis['suggestions']:
print("Suggestions for improvement:")
for suggestion in analysis['suggestions']:
print(f" - {suggestion}")
else:
print("Metadata looks good!")
Content Analysis and Statistics
-------------------------------
Reading Level Analysis
~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Analyze EPUB content to determine reading complexity.
.. code-block:: python
import re
import math
from epub_utils import Document
class ReadingLevelAnalyzer:
def analyze_epub(self, epub_path):
"""Analyze reading level of an EPUB."""
doc = Document(epub_path)
# Get all text content
all_text = self._extract_all_text(doc)
if not all_text.strip():
return {"error": "No readable text found"}
# Calculate statistics
stats = self._calculate_text_stats(all_text)
# Calculate reading level scores
flesch_score = self._flesch_reading_ease(stats)
flesch_grade = self._flesch_kincaid_grade(stats)
return {
'title': getattr(doc.package.metadata, 'title', 'Unknown'),
'word_count': stats['words'],
'sentence_count': stats['sentences'],
'syllable_count': stats['syllables'],
'avg_words_per_sentence': round(stats['words'] / stats['sentences'], 2),
'avg_syllables_per_word': round(stats['syllables'] / stats['words'], 2),
'flesch_reading_ease': round(flesch_score, 2),
'flesch_kincaid_grade': round(flesch_grade, 2),
'reading_level': self._interpret_flesch_score(flesch_score)
}
def _extract_all_text(self, doc):
"""Extract all readable text from EPUB."""
# This is a simplified version - real implementation would
# need to parse XHTML content files
try:
manifest = doc.package.manifest
# In a real implementation, you'd extract and parse each content file
# For now, return placeholder
return "Sample text for analysis. This would contain the actual book content."
except Exception:
return ""
def _calculate_text_stats(self, text):
"""Calculate basic text statistics."""
# Clean text
text = re.sub(r'[^\w\s\.\!\?]', '', text)
# Count words
words = len(text.split())
# Count sentences
sentences = len(re.findall(r'[.!?]+', text))
if sentences == 0:
sentences = 1 # Avoid division by zero
# Count syllables (simplified)
syllables = self._count_syllables(text)
return {
'words': words,
'sentences': sentences,
'syllables': syllables
}
def _count_syllables(self, text):
"""Simplified syllable counting."""
words = text.lower().split()
syllable_count = 0
for word in words:
word = re.sub(r'[^a-z]', '', word)
if word:
# Simple syllable counting heuristic
vowels = 'aeiouy'
syllables = sum(1 for i, char in enumerate(word)
if char in vowels and (i == 0 or word[i-1] not in vowels))
if word.endswith('e') and syllables > 1:
syllables -= 1
syllable_count += max(1, syllables)
return syllable_count
def _flesch_reading_ease(self, stats):
"""Calculate Flesch Reading Ease score."""
return (206.835 -
(1.015 * (stats['words'] / stats['sentences'])) -
(84.6 * (stats['syllables'] / stats['words'])))
def _flesch_kincaid_grade(self, stats):
"""Calculate Flesch-Kincaid Grade Level."""
return ((0.39 * (stats['words'] / stats['sentences'])) +
(11.8 * (stats['syllables'] / stats['words'])) - 15.59)
def _interpret_flesch_score(self, score):
"""Interpret Flesch Reading Ease score."""
if score >= 90:
return "Very Easy (5th grade)"
elif score >= 80:
return "Easy (6th grade)"
elif score >= 70:
return "Fairly Easy (7th grade)"
elif score >= 60:
return "Standard (8th-9th grade)"
elif score >= 50:
return "Fairly Difficult (10th-12th grade)"
elif score >= 30:
return "Difficult (College level)"
else:
return "Very Difficult (Graduate level)"
# Usage
analyzer = ReadingLevelAnalyzer()
analysis = analyzer.analyze_epub("book.epub")
print(f"Reading Level Analysis for: {analysis['title']}")
print(f"Word Count: {analysis['word_count']:,}")
print(f"Reading Level: {analysis['reading_level']}")
print(f"Flesch-Kincaid Grade: {analysis['flesch_kincaid_grade']}")
Direct File Access and Extraction
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Extract specific files from EPUB archives for processing or analysis.
**CLI Approach**:
.. code-block:: bash
#!/bin/bash
# extract-epub-assets.sh - Extract and process EPUB content files
epub_file="$1"
output_dir="extracted_content"
mkdir -p "$output_dir"
echo "Extracting content from: $epub_file"
# Get list of all XHTML content files
epub-utils "$epub_file" files --format raw | grep '\.xhtml$' | while read -r file_path; do
echo "Processing: $file_path"
# Extract plain text content
safe_name=$(echo "$file_path" | tr '/' '_')
epub-utils "$epub_file" files "$file_path" --format plain > "$output_dir/${safe_name}.txt"
# Extract styled HTML content
epub-utils "$epub_file" files "$file_path" --format raw > "$output_dir/${safe_name}.html"
done
# Extract CSS files for styling reference
epub-utils "$epub_file" files --format raw | grep '\.css$' | while read -r css_path; do
echo "Extracting CSS: $css_path"
safe_name=$(echo "$css_path" | tr '/' '_')
epub-utils "$epub_file" files "$css_path" > "$output_dir/${safe_name}"
done
echo "Extraction complete! Files saved to $output_dir/"
**Comparing files vs content commands**:
.. code-block:: bash
# Using files command (direct path access)
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain
epub-utils book.epub files OEBPS/styles/main.css
epub-utils book.epub files META-INF/container.xml
# Using content command (requires manifest item ID)
epub-utils book.epub manifest | grep chapter1 # Find the ID first
epub-utils book.epub content chapter1-id --format plain
**Key advantages of the files command**:
- **Direct access**: Use actual file paths without needing manifest IDs
- **Universal file access**: Access any file type (XHTML, CSS, XML, images, etc.)
- **Simpler automation**: No need to parse manifest to find item IDs
- **Better for file-system-based workflows**: Mirrors actual EPUB structure
**Python equivalent using API**:
.. code-block:: python
from epub_utils import Document
def extract_file_content(epub_path, file_path):
"""Extract content from a specific file in EPUB."""
doc = Document(epub_path)
try:
content = doc.get_file_by_path(file_path)
# Handle different content types
if hasattr(content, 'to_plain'):
# XHTML content - can extract plain text
return {
'raw_html': content.to_str(),
'plain_text': content.to_plain(),
'formatted_xml': content.to_xml(pretty_print=True)
}
else:
# Other file types (CSS, XML, etc.)
return {'raw_content': content}
except ValueError as e:
return {'error': str(e)}
# Usage
doc = Document("book.epub")
# Extract chapter content
chapter_content = extract_file_content("book.epub", "OEBPS/chapter1.xhtml")
if 'plain_text' in chapter_content:
print(f"Chapter text: {chapter_content['plain_text'][:200]}...")
# Extract CSS for styling analysis
css_content = extract_file_content("book.epub", "OEBPS/styles/main.css")
if 'raw_content' in css_content:
print(f"CSS rules: {len(css_content['raw_content'].split('{'))} rules found")
Automation and Workflows
-------------------------
Automated EPUB Processing Pipeline
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Set up an automated pipeline for processing new EPUB files.
.. code-block:: python
import os
import shutil
import json
from pathlib import Path
from datetime import datetime
from epub_utils import Document
class EPUBProcessor:
def __init__(self, input_dir, output_dir, processed_dir):
self.input_dir = Path(input_dir)
self.output_dir = Path(output_dir)
self.processed_dir = Path(processed_dir)
# Create directories if they don't exist
self.output_dir.mkdir(exist_ok=True)
self.processed_dir.mkdir(exist_ok=True)
def process_new_files(self):
"""Process all new EPUB files in input directory."""
epub_files = list(self.input_dir.glob("*.epub"))
if not epub_files:
print("No EPUB files found to process")
return
print(f"Found {len(epub_files)} EPUB files to process")
results = []
for epub_path in epub_files:
result = self.process_single_file(epub_path)
results.append(result)
# Generate processing report
self.generate_report(results)
return results
def process_single_file(self, epub_path):
"""Process a single EPUB file."""
print(f"Processing: {epub_path.name}")
try:
doc = Document(str(epub_path))
# Extract metadata
metadata = self.extract_metadata(doc)
# Validate file
validation_result = self.validate_epub(doc)
# Generate file info
file_info = self.generate_file_info(epub_path, doc)
# Create organized filename
new_filename = self.create_organized_filename(metadata)
# Move file to organized location
organized_path = self.organize_file(epub_path, new_filename, metadata)
result = {
'original_path': str(epub_path),
'new_path': str(organized_path),
'status': 'success',
'metadata': metadata,
'validation': validation_result,
'file_info': file_info,
'processed_at': datetime.now().isoformat()
}
# Move original to processed directory
processed_path = self.processed_dir / epub_path.name
shutil.move(str(epub_path), str(processed_path))
return result
except Exception as e:
result = {
'original_path': str(epub_path),
'status': 'error',
'error': str(e),
'processed_at': datetime.now().isoformat()
}
# Move problematic file to processed directory
processed_path = self.processed_dir / f"ERROR_{epub_path.name}"
shutil.move(str(epub_path), str(processed_path))
return result
def extract_metadata(self, doc):
"""Extract standardized metadata."""
metadata = doc.package.metadata
return {
'title': getattr(metadata, 'title', '').strip(),
'author': getattr(metadata, 'creator', '').strip(),
'publisher': getattr(metadata, 'publisher', '').strip(),
'language': getattr(metadata, 'language', '').strip(),
'year': self.extract_year(getattr(metadata, 'date', '')),
'identifier': getattr(metadata, 'identifier', '').strip(),
'subject': getattr(metadata, 'subject', '').strip()
}
def extract_year(self, date_str):
"""Extract year from date string."""
if not date_str:
return ''
return date_str.split('-')[0] if '-' in date_str else date_str[:4]
def validate_epub(self, doc):
"""Basic EPUB validation."""
issues = []
try:
metadata = doc.package.metadata
if not getattr(metadata, 'title', '').strip():
issues.append('Missing title')
if not getattr(metadata, 'creator', '').strip():
issues.append('Missing author')
if not getattr(metadata, 'language', '').strip():
issues.append('Missing language')
# Check for content
manifest = doc.package.manifest
has_content = any(
item.get('media-type') == 'application/xhtml+xml'
for item in manifest.items.values()
)
if not has_content:
issues.append('No content files found')
except Exception as e:
issues.append(f'Validation error: {e}')
return {
'is_valid': len(issues) == 0,
'issues': issues
}
def generate_file_info(self, epub_path, doc):
"""Generate file information."""
stat = epub_path.stat()
return {
'filename': epub_path.name,
'size_bytes': stat.st_size,
'size_mb': round(stat.st_size / (1024 * 1024), 2),
'file_count': len(doc.get_files_info()),
'modified': datetime.fromtimestamp(stat.st_mtime).isoformat()
}
def create_organized_filename(self, metadata):
"""Create an organized filename from metadata."""
# Clean strings for filename
def clean_for_filename(s):
return re.sub(r'[^\w\s-]', '', s).strip()[:50]
author = clean_for_filename(metadata['author'] or 'Unknown_Author')
title = clean_for_filename(metadata['title'] or 'Unknown_Title')
year = metadata['year'] or 'Unknown_Year'
return f"{author} - {title} ({year}).epub"
def organize_file(self, epub_path, new_filename, metadata):
"""Organize file into structured directory."""
# Create author directory
author = metadata['author'] or 'Unknown_Author'
author_dir = self.output_dir / author[:50] # Limit length
author_dir.mkdir(exist_ok=True)
# Create final path
final_path = author_dir / new_filename
# Copy file to organized location
shutil.copy2(str(epub_path), str(final_path))
return final_path
def generate_report(self, results):
"""Generate processing report."""
report_path = self.output_dir / f"processing_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
summary = {
'total_files': len(results),
'successful': len([r for r in results if r['status'] == 'success']),
'errors': len([r for r in results if r['status'] == 'error']),
'generated_at': datetime.now().isoformat(),
'results': results
}
with open(report_path, 'w', encoding='utf-8') as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f"Processing complete!")
print(f"Successfully processed: {summary['successful']}")
print(f"Errors: {summary['errors']}")
print(f"Report saved to: {report_path}")
# Usage
processor = EPUBProcessor(
input_dir="/path/to/new/epubs",
output_dir="/path/to/organized/library",
processed_dir="/path/to/processed/files"
)
results = processor.process_new_files()
Command-Line Power User Examples
--------------------------------
Advanced Shell Scripts
~~~~~~~~~~~~~~~~~~~~~~
**Complex metadata extraction with error handling**:
.. code-block:: bash
#!/bin/bash
# advanced-epub-analysis.sh
set -euo pipefail
EPUB_DIR="${1:-./}"
OUTPUT_FILE="detailed_analysis.json"
echo "Starting advanced EPUB analysis..."
echo "Directory: $EPUB_DIR"
echo "Output: $OUTPUT_FILE"
# Initialize JSON output
echo '{"analysis_date": "'$(date -Iseconds)'", "epubs": [' > "$OUTPUT_FILE"
first=true
find "$EPUB_DIR" -name "*.epub" -type f | while read -r epub; do
echo "Analyzing: $(basename "$epub")"
if [ "$first" = true ]; then
first=false
else
echo "," >> "$OUTPUT_FILE"
fi
# Start JSON object for this EPUB
echo ' {' >> "$OUTPUT_FILE"
echo " \"file\": \"$epub\"," >> "$OUTPUT_FILE"
# Extract metadata with error handling
if metadata=$(epub-utils "$epub" metadata --format kv 2>/dev/null); then
echo " \"metadata\": {" >> "$OUTPUT_FILE"
# Parse metadata into JSON
echo "$metadata" | while IFS=': ' read -r key value; do
if [ -n "$key" ] && [ -n "$value" ]; then
echo " \"$key\": \"$value\"," >> "$OUTPUT_FILE"
fi
done | sed '$s/,$//' # Remove last comma
echo " }," >> "$OUTPUT_FILE"
else
echo " \"metadata\": null," >> "$OUTPUT_FILE"
echo " \"metadata_error\": true," >> "$OUTPUT_FILE"
fi
# File analysis
if file_info=$(epub-utils "$epub" files --format raw 2>/dev/null); then
file_count=$(echo "$file_info" | wc -l)
echo " \"file_count\": $file_count," >> "$OUTPUT_FILE"
else
echo " \"file_count\": null," >> "$OUTPUT_FILE"
fi
# File size
size=$(stat -f%z "$epub" 2>/dev/null || stat -c%s "$epub" 2>/dev/null || echo "0")
echo " \"size_bytes\": $size," >> "$OUTPUT_FILE"
# Validation check
if epub-utils "$epub" container >/dev/null 2>&1 && \
epub-utils "$epub" package >/dev/null 2>&1; then
echo " \"is_valid\": true" >> "$OUTPUT_FILE"
else
echo " \"is_valid\": false" >> "$OUTPUT_FILE"
fi
echo " }" >> "$OUTPUT_FILE"
done
# Close JSON
echo "]}" >> "$OUTPUT_FILE"
echo "Analysis complete! Results in $OUTPUT_FILE"
**Batch processing with parallel execution**:
.. code-block:: bash
#!/bin/bash
# parallel-epub-check.sh
EPUB_DIR="${1:-./}"
MAX_JOBS=4
export -f check_single_epub
check_single_epub() {
epub="$1"
base=$(basename "$epub")
echo "[$base] Starting check..."
# Quick validation
if ! epub-utils "$epub" container >/dev/null 2>&1; then
echo "[$base] ❌ Invalid container"
return 1
fi
if ! epub-utils "$epub" package >/dev/null 2>&1; then
echo "[$base] ❌ Invalid package"
return 1
fi
# Check for required metadata
metadata=$(epub-utils "$epub" metadata --format kv 2>/dev/null)
if ! echo "$metadata" | grep -q "^title:"; then
echo "[$base] ⚠️ Missing title"
fi
if ! echo "$metadata" | grep -q "^creator:"; then
echo "[$base] ⚠️ Missing author"
fi
echo "[$base] ✅ Check complete"
}
# Run parallel checks
find "$EPUB_DIR" -name "*.epub" -type f | \
xargs -n 1 -P $MAX_JOBS -I {} bash -c 'check_single_epub "$@"' _ {}
Navigation and Table of Contents
--------------------------------
Working with EPUB Navigation Documents
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Extract and analyze navigation structures from both EPUB 2 and EPUB 3 files.
**CLI Approach - Version-Specific TOC Access**:
.. code-block:: bash
#!/bin/bash
# extract-navigation.sh - Extract navigation from EPUB files
EPUB_FILE="$1"
if [ -z "$EPUB_FILE" ]; then
echo "Usage: $0 <epub-file>"
exit 1
fi
echo "Analyzing navigation in: $(basename "$EPUB_FILE")"
echo "========================================"
# Try EPUB 3 nav document first
echo "Attempting EPUB 3 nav document extraction..."
if epub-utils "$EPUB_FILE" toc --nav > /tmp/nav.xml 2>/dev/null; then
echo "✅ EPUB 3 nav document found"
echo "Navigation structure:"
# Extract navigation items with their hierarchy
grep -o '<a[^>]*href="[^"]*"[^>]*>[^<]*</a>' /tmp/nav.xml | \
sed 's/<a[^>]*href="\([^"]*\)"[^>]*>\([^<]*\)<\/a>/ → \2 (\1)/' | \
head -10
# Count navigation items
nav_count=$(grep -c '<a[^>]*href=' /tmp/nav.xml)
echo "Total navigation items: $nav_count"
else
echo "❌ No EPUB 3 nav document found"
fi
echo ""
echo "Attempting EPUB 2 NCX extraction..."
if epub-utils "$EPUB_FILE" toc --ncx > /tmp/ncx.xml 2>/dev/null; then
echo "✅ EPUB 2 NCX document found"
echo "Table of contents structure:"
# Extract NCX navigation points
grep -o '<navLabel><text>[^<]*</text></navLabel>' /tmp/ncx.xml | \
sed 's/<navLabel><text>\([^<]*\)<\/text><\/navLabel>/ → \1/' | \
head -10
# Count NCX nav points
ncx_count=$(grep -c '<navPoint' /tmp/ncx.xml)
echo "Total NCX navigation points: $ncx_count"
else
echo "❌ No EPUB 2 NCX document found"
fi
# Compare standard TOC with version-specific extracts
echo ""
echo "Standard TOC extraction:"
standard_toc=$(epub-utils "$EPUB_FILE" toc --format raw 2>/dev/null | wc -l)
echo "Standard TOC items: $standard_toc"
**Python Approach - Advanced Navigation Analysis**:
.. code-block:: python
from epub_utils import Document
import xml.etree.ElementTree as ET
from pathlib import Path
class NavigationAnalyzer:
def __init__(self, epub_path):
self.doc = Document(epub_path)
self.epub_path = Path(epub_path)
def analyze_navigation(self):
"""Comprehensive navigation analysis."""
print(f"Analyzing: {self.epub_path.name}")
print("=" * 50)
# Check EPUB version
version = getattr(self.doc.package.metadata, 'version', 'unknown')
print(f"EPUB Version: {version}")
print()
# Analyze EPUB 3 nav document
self._analyze_nav_document()
# Analyze EPUB 2 NCX document
self._analyze_ncx_document()
# Compare with standard TOC
self._analyze_standard_toc()
def _analyze_nav_document(self):
"""Analyze EPUB 3 navigation document."""
print("EPUB 3 Navigation Document Analysis:")
print("-" * 40)
try:
nav_content = self.doc.nav
if nav_content:
print("✅ Nav document found")
# Parse navigation structure
nav_items = self._parse_nav_structure(nav_content)
print(f"Navigation items found: {len(nav_items)}")
# Show hierarchy
print("\nNavigation hierarchy:")
for item in nav_items[:10]: # Show first 10
indent = " " * item['level']
print(f"{indent}→ {item['title']} ({item['href']})")
if len(nav_items) > 10:
print(f" ... and {len(nav_items) - 10} more items")
else:
print("❌ No nav document found")
except Exception as e:
print(f"❌ Error accessing nav document: {e}")
print()
def _analyze_ncx_document(self):
"""Analyze EPUB 2 NCX document."""
print("EPUB 2 NCX Document Analysis:")
print("-" * 30)
try:
ncx_content = self.doc.ncx
if ncx_content:
print("✅ NCX document found")
# Parse NCX structure
ncx_items = self._parse_ncx_structure(ncx_content)
print(f"NCX navigation points: {len(ncx_items)}")
# Show structure
print("\nNCX structure:")
for item in ncx_items[:10]: # Show first 10
indent = " " * item['level']
print(f"{indent}→ {item['title']} ({item['src']})")
if len(ncx_items) > 10:
print(f" ... and {len(ncx_items) - 10} more items")
else:
print("❌ No NCX document found")
except Exception as e:
print(f"❌ Error accessing NCX document: {e}")
print()
def _analyze_standard_toc(self):
"""Analyze standard TOC extraction."""
print("Standard TOC Analysis:")
print("-" * 22)
try:
toc = self.doc.get_toc()
toc_items = len(toc.get_nav_items())
print(f"✅ Standard TOC items: {toc_items}")
# Show some items
print("\nStandard TOC items:")
for i, item in enumerate(toc.get_nav_items()[:5]):
print(f" → {item.title} ({item.href})")
except Exception as e:
print(f"❌ Error with standard TOC: {e}")
print()
def _parse_nav_structure(self, nav_content):
"""Parse EPUB 3 nav document structure."""
items = []
try:
root = ET.fromstring(nav_content)
# Handle namespaces
namespaces = {'xhtml': 'http://www.w3.org/1999/xhtml'}
def parse_nav_list(ol_element, level=0):
for li in ol_element.findall('.//xhtml:li', namespaces):
a_elem = li.find('.//xhtml:a', namespaces)
if a_elem is not None:
title = a_elem.text or ""
href = a_elem.get('href', '')
items.append({
'title': title.strip(),
'href': href,
'level': level
})
# Check for nested lists
nested_ol = li.find('.//xhtml:ol', namespaces)
if nested_ol is not None:
parse_nav_list(nested_ol, level + 1)
# Find main navigation
nav_elem = root.find('.//xhtml:nav[@*="toc"]', namespaces)
if nav_elem is None:
nav_elem = root.find('.//xhtml:nav', namespaces)
if nav_elem is not None:
ol_elem = nav_elem.find('.//xhtml:ol', namespaces)
if ol_elem is not None:
parse_nav_list(ol_elem)
except ET.ParseError as e:
print(f"Warning: Could not parse nav XML: {e}")
return items
def _parse_ncx_structure(self, ncx_content):
"""Parse EPUB 2 NCX document structure."""
items = []
try:
root = ET.fromstring(ncx_content)
# NCX namespace
namespaces = {'ncx': 'http://www.daisy.org/z3986/2005/ncx/'}
def parse_nav_point(nav_point, level=0):
# Get label
nav_label = nav_point.find('ncx:navLabel/ncx:text', namespaces)
title = nav_label.text if nav_label is not None else ""
# Get content source
content = nav_point.find('ncx:content', namespaces)
src = content.get('src', '') if content is not None else ""
items.append({
'title': title.strip(),
'src': src,
'level': level
})
# Process child nav points
for child_nav_point in nav_point.findall('ncx:navPoint', namespaces):
parse_nav_point(child_nav_point, level + 1)
# Find all top-level navigation points
nav_map = root.find('ncx:navMap', namespaces)
if nav_map is not None:
for nav_point in nav_map.findall('ncx:navPoint', namespaces):
parse_nav_point(nav_point)
except ET.ParseError as e:
print(f"Warning: Could not parse NCX XML: {e}")
return items
# Usage examples
def analyze_single_epub(epub_path):
"""Analyze a single EPUB file."""
analyzer = NavigationAnalyzer(epub_path)
analyzer.analyze_navigation()
def compare_navigation_across_epubs(epub_directory):
"""Compare navigation structures across multiple EPUB files."""
epub_files = list(Path(epub_directory).glob("*.epub"))
print(f"Comparing navigation across {len(epub_files)} EPUB files")
print("=" * 60)
results = []
for epub_path in epub_files:
try:
doc = Document(str(epub_path))
# Check what navigation documents are available
has_nav = bool(doc.nav)
has_ncx = bool(doc.ncx)
standard_toc_count = len(doc.get_toc().get_nav_items())
results.append({
'file': epub_path.name,
'has_nav': has_nav,
'has_ncx': has_ncx,
'toc_items': standard_toc_count,
'version': getattr(doc.package.metadata, 'version', 'unknown')
})
except Exception as e:
print(f"Error processing {epub_path.name}: {e}")
# Print comparison table
print(f"{'File':<30} {'Version':<8} {'Nav':<5} {'NCX':<5} {'TOC Items':<10}")
print("-" * 65)
for result in results:
nav_mark = "✅" if result['has_nav'] else "❌"
ncx_mark = "✅" if result['has_ncx'] else "❌"
print(f"{result['file']:<30} {result['version']:<8} "
f"{nav_mark:<5} {ncx_mark:<5} {result['toc_items']:<10}")
# Example usage
if __name__ == "__main__":
# Analyze single file
analyze_single_epub("/path/to/your/book.epub")
# Compare multiple files
compare_navigation_across_epubs("/path/to/epub/collection")
Building Smart Reading Lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Scenario**: Create curated reading lists based on navigation complexity and structure.
.. code-block:: python
from epub_utils import Document
import json
from pathlib import Path
from collections import defaultdict
class ReadingListBuilder:
def __init__(self):
self.books = []
def analyze_book_complexity(self, epub_path):
"""Analyze book's structural complexity."""
try:
doc = Document(str(epub_path))
# Get navigation info
toc_items = len(doc.get_toc().get_nav_items())
has_advanced_nav = bool(doc.nav) or bool(doc.ncx)
# Get file structure info
files_info = doc.get_files_info()
html_files = [f for f in files_info if f['media_type'] == 'application/xhtml+xml']
complexity_score = self._calculate_complexity_score(
toc_items, len(html_files), has_advanced_nav
)
return {
'path': epub_path,
'title': getattr(doc.package.metadata, 'title', ''),
'author': getattr(doc.package.metadata, 'creator', ''),
'toc_items': toc_items,
'html_files': len(html_files),
'has_advanced_nav': has_advanced_nav,
'complexity_score': complexity_score,
'complexity_level': self._get_complexity_level(complexity_score)
}
except Exception as e:
print(f"Error analyzing {epub_path}: {e}")
return None
def _calculate_complexity_score(self, toc_items, html_files, has_advanced_nav):
"""Calculate structural complexity score."""
score = 0
# TOC complexity
if toc_items > 50:
score += 30
elif toc_items > 20:
score += 20
elif toc_items > 10:
score += 10
# File structure complexity
if html_files > 100:
score += 25
elif html_files > 50:
score += 15
elif html_files > 20:
score += 10
# Advanced navigation features
if has_advanced_nav:
score += 15
return min(score, 100) # Cap at 100
def _get_complexity_level(self, score):
"""Convert score to complexity level."""
if score >= 70:
return "Advanced"
elif score >= 40:
return "Intermediate"
else:
return "Beginner"
def build_reading_lists(self, epub_directory, output_file="reading_lists.json"):
"""Build categorized reading lists."""
epub_files = list(Path(epub_directory).glob("*.epub"))
print(f"Analyzing {len(epub_files)} EPUB files for reading lists...")
# Analyze all books
for epub_path in epub_files:
book_info = self.analyze_book_complexity(epub_path)
if book_info:
self.books.append(book_info)
# Categorize books
categories = defaultdict(list)
for book in self.books:
# By complexity
categories[f"complexity_{book['complexity_level'].lower()}"].append(book)
# By navigation richness
if book['toc_items'] >= 20:
categories['detailed_structure'].append(book)
if book['has_advanced_nav']:
categories['advanced_navigation'].append(book)
# Create final reading lists
reading_lists = {
'beginner_friendly': {
'description': 'Books with simple structure, perfect for casual reading',
'books': sorted(categories['complexity_beginner'],
key=lambda x: x['toc_items'])[:10]
},
'intermediate_reads': {
'description': 'Well-structured books with moderate complexity',
'books': sorted(categories['complexity_intermediate'],
key=lambda x: x['complexity_score'])[:15]
},
'advanced_studies': {
'description': 'Complex books with rich navigation, ideal for research',
'books': sorted(categories['complexity_advanced'],
key=lambda x: x['complexity_score'], reverse=True)[:10]
},
'detailed_references': {
'description': 'Books with detailed table of contents',
'books': sorted(categories['detailed_structure'],
key=lambda x: x['toc_items'], reverse=True)[:12]
},
'enhanced_navigation': {
'description': 'Books with advanced navigation features',
'books': categories['advanced_navigation'][:10]
}
}
# Save to file
with open(output_file, 'w', encoding='utf-8') as f:
json.dump(reading_lists, f, indent=2, ensure_ascii=False, default=str)
# Print summary
print(f"\nReading Lists Generated:")
print("=" * 25)
for list_name, list_data in reading_lists.items():
print(f"{list_name}: {len(list_data['books'])} books")
print(f" → {list_data['description']}")
print(f"\nSaved to: {output_file}")
# Usage
builder = ReadingListBuilder()
builder.build_reading_lists("/path/to/epub/collection")
These examples demonstrate the power and flexibility of ``epub-utils`` for various real-world scenarios. Whether you're managing a digital library, performing quality assurance, building automated workflows, or analyzing navigation structures, epub-utils provides the tools you need to work effectively with EPUB files.
================================================
FILE: docs/formats.rst
================================================
Output Formats Reference
========================
``epub-utils`` supports multiple output formats to suit different use cases. This guide explains each
format with examples and best practices for when to use each one.
Overview
--------
All commands in ``epub-utils`` support the ``--format`` option with these values:
- ``xml`` - Syntax-highlighted XML (default for most commands)
- ``raw`` - Unformatted, raw content
- ``kv`` - Key-value pairs (where supported)
- ``plain`` - Plain text with HTML tags stripped (content command only)
- ``table`` - Formatted table (files command only)
Additionally, most commands support the ``--pretty-print`` option to format XML output with proper indentation and structure.
XML Format (Default)
--------------------
The XML format provides syntax-highlighted, pretty-printed XML output that's easy to read.
**When to use**: Interactive inspection, debugging, learning EPUB structure
**Example**:
.. code-block:: bash
$ epub-utils book.epub metadata --format xml
**Output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title>The Great Gatsby</dc:title>
<dc:creator>F. Scott Fitzgerald</dc:creator>
<dc:language>en</dc:language>
<dc:identifier id="bookid">urn:uuid:12345678-1234-1234-1234-123456789abc</dc:identifier>
<dc:publisher>Scribner</dc:publisher>
<dc:date>2021-01-01</dc:date>
<dc:subject>Fiction</dc:subject>
<dc:subject>Classic Literature</dc:subject>
</metadata>
**Features**:
- Color syntax highlighting
- Proper indentation
- Easy to read structure
- Preserves all XML attributes and namespaces
Raw Format
----------
The raw format outputs unprocessed content exactly as stored in the EPUB file.
**When to use**: Piping to other tools, automated processing, debugging XML issues
**Example**:
.. code-block:: bash
$ epub-utils book.epub metadata --format raw
**Output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?><metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"><dc:title>The Great Gatsby</dc:title><dc:creator>F. Scott Fitzgerald</dc:creator><dc:language>en</dc:language><dc:identifier id="bookid">urn:uuid:12345678-1234-1234-1234-123456789abc</dc:identifier><dc:publisher>Scribner</dc:publisher><dc:date>2021-01-01</dc:date><dc:subject>Fiction</dc:subject><dc:subject>Classic Literature</dc:subject></metadata>
**Use cases**:
.. code-block:: bash
# Pipe to xmllint for custom formatting
$ epub-utils book.epub package --format raw | xmllint --format -
# Extract specific elements with grep
$ epub-utils book.epub manifest --format raw | grep 'media-type="text/css"'
# Validate XML structure
$ epub-utils book.epub toc --format raw | xmllint --valid -
Key-Value Format
----------------
The key-value format presents metadata as simple ``key: value`` pairs, perfect for scripting.
**When to use**: Shell scripting, automated data extraction, configuration files
**Supported commands**: ``metadata``
**Example**:
.. code-block:: bash
$ epub-utils book.epub metadata --format kv
**Output**:
.. code-block:: text
title: The Great Gatsby
creator: F. Scott Fitzgerald
language: en
identifier: urn:uuid:12345678-1234-1234-1234-123456789abc
publisher: Scribner
date: 2021-01-01
subject: Fiction, Classic Literature
**Scripting examples**:
.. code-block:: bash
# Extract just the title
title=$(epub-utils book.epub metadata --format kv | grep "^title:" | cut -d' ' -f2-)
# Get all metadata into shell variables
eval "$(epub-utils book.epub metadata --format kv | sed 's/^/meta_/')"
echo "Book title: $meta_title"
echo "Author: $meta_creator"
# Create a simple database
echo "filename,title,author" > books.csv
for epub in *.epub; do
metadata=$(epub-utils "$epub" metadata --format kv)
title=$(echo "$metadata" | grep "^title:" | cut -d' ' -f2- | tr ',' ';')
author=$(echo "$metadata" | grep "^creator:" | cut -d' ' -f2- | tr ',' ';')
echo "$epub,$title,$author" >> books.csv
done
Plain Text Format
-----------------
The plain text format strips HTML tags and returns readable text content.
**When to use**: Content analysis, word counting, text extraction
**Supported commands**: ``content``, ``files`` (with file path)
**Example**:
.. code-block:: bash
$ epub-utils book.epub content chapter1 --format plain
**Output**:
.. code-block:: text
Chapter 1: The Beginning
In my younger and more vulnerable years my father gave me some advice
that I've carried with me ever since. "Whenever you feel like criticizing
anyone," he told me, "just remember that all the people in this world
haven't had the advantages that you've had."
**Use cases**:
.. code-block:: bash
# Count words in a chapter (using content command)
word_count=$(epub-utils book.epub content chapter1 --format plain | wc -w)
echo "Chapter 1 has $word_count words"
# Extract all text for analysis (using files command)
epub-utils book.epub files OEBPS/chapter1.xhtml --format plain > chapter1.txt
# Search for specific content in any file
if epub-utils book.epub files OEBPS/chapter2.xhtml --format plain | grep -q "important phrase"; then
echo "Found the phrase in chapter 2"
fi
# Access files by path without knowing manifest IDs
epub-utils book.epub files OEBPS/styles/main.css
epub-utils book.epub files META-INF/container.xml
Table Format
------------
The table format presents file information in a readable tabular layout.
**When to use**: File analysis, human-readable file listings
**Supported commands**: ``files``
**Example**:
.. code-block:: bash
$ epub-utils book.epub files --format table
**Output**:
.. code-block:: text
File Information for book.epub
┌────────────────────────────────────────┬──────────┬──────────────┬─────────────────────┐
│ Path │ Size │ Compressed │ Modified │
├────────────────────────────────────────┼──────────┼──────────────┼─────────────────────┤
│ META-INF/container.xml │ 230 B │ 140 B │ 2021-01-01 10:00:00│
│ OEBPS/content.opf │ 2.1 KB │ 856 B │ 2021-01-01 10:00:00│
│ OEBPS/toc.ncx │ 1.8 KB │ 542 B │ 2021-01-01 10:00:00│
│ OEBPS/Text/chapter01.xhtml │ 12.4 KB │ 3.2 KB │ 2021-01-01 10:00:00│
│ OEBPS/Text/chapter02.xhtml │ 15.6 KB │ 4.1 KB │ 2021-01-01 10:00:00│
│ OEBPS/Styles/stylesheet.css │ 3.2 KB │ 1.1 KB │ 2021-01-01 10:00:00│
│ OEBPS/Images/cover.jpg │ 145.2 KB │ 144.8 KB │ 2021-01-01 10:00:00│
└────────────────────────────────────────┴──────────┴──────────────┴─────────────────────┘
Command-Specific Format Support
-------------------------------
Here's a quick reference for which formats each command supports:
.. list-table:: Format Support by Command
:header-rows: 1
:widths: 20 15 15 15 15 15
* - Command
- XML
- Raw
- KV
- Plain
- Table
* - ``container``
- ✓
- ✓
- ✗
- ✗
- ✗
* - ``package``
- ✓
- ✓
- ✗
- ✗
- ✗
* - ``toc``
- ✓
- ✓
- ✗
- ✗
- ✗
* - ``metadata``
- ✓
- ✓
- ✓
- ✗
- ✗
* - ``manifest``
- ✓
- ✓
- ✗
- ✗
- ✗
* - ``spine``
- ✓
- ✓
- ✗
- ✗
- ✗
* - ``content``
- ✓
- ✓
- ✗
- ✓
- ✗
* - ``files``
- ✓*
- ✓
- ✗
- ✓*
- ✓*
.. note::
\* For the ``files`` command: ``xml``, ``plain``, and ``table`` formats are only available when specifying a file path. When listing files (no path specified), only ``table`` and ``raw`` formats are supported.
Advanced Format Usage
---------------------
Combining Formats with Shell Tools
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Pretty-print with custom tools**:
.. code-block:: bash
# Use xmllint for custom XML formatting
epub-utils book.epub package --format raw | xmllint --format --noblanks -
# Convert to JSON using xq (if available)
epub-utils book.epub metadata --format raw | xq '.'
**Processing key-value output**:
.. code-block:: bash
# Convert to environment variables
export $(epub-utils book.epub metadata --format kv | tr ' ' '_' | tr ':' '=')
echo "Title: $title"
# Create YAML-like output
epub-utils book.epub metadata --format kv | sed 's/^/ /' | sed '1i metadata:'
**Text analysis workflows**:
.. code-block:: bash
# Analyze reading time (assuming 200 words per minute)
words=$(epub-utils book.epub content chapter1 --format plain | wc -w)
minutes=$((words / 200))
echo "Chapter 1 reading time: $minutes minutes"
# Extract quotes (lines starting with quotation marks)
epub-utils book.epub content chapter1 --format plain | grep '^".*"$'
Format Selection Guidelines
---------------------------
Choose the right format based on your use case:
**For Human Reading**:
- Use ``xml`` for inspecting EPUB structure
- Use ``table`` for file listings
- Use ``plain`` for content reading
**For Automation**:
- Use ``raw`` for piping to other XML tools
- Use ``kv`` for simple scripting and data extraction
- Use ``raw`` with ``files`` for getting simple file lists
**For Integration**:
- Use ``raw`` when feeding into other programs
- Use ``kv`` for configuration file generation
- Use ``plain`` for text processing workflows
**Performance Considerations**:
- ``raw`` format is fastest (no syntax highlighting)
- ``xml`` format has slight overhead for highlighting
- ``table`` format requires additional formatting computation
Error Handling with Formats
----------------------------
Different formats handle errors differently:
.. code-block:: bash
# XML format shows formatted error messages
$ epub-utils corrupted.epub metadata --format xml
Error: Unable to parse metadata
# Raw format may show parsing errors directly
$ epub-utils corrupted.epub metadata --format raw
ParseError: Invalid XML structure
# KV format gracefully handles missing fields
$ epub-utils incomplete.epub metadata --format kv
title:
creator: Unknown Author
language: en
Custom Format Processing
------------------------
You can create custom output formats by post-processing the raw output:
.. code-block:: bash
#!/bin/zsh
# custom-json-format.sh - Convert metadata to JSON
epub_file="$1"
echo "{"
epub-utils "$epub_file" metadata --format kv | while IFS=': ' read -r key value; do
if [[ -n "$key" && -n "$value" ]]; then
echo " \"$key\": \"$value\","
fi
done | sed '$s/,$//'
echo "}"
.. code-block:: bash
#!/bin/zsh
# custom-markdown-format.sh - Convert metadata to Markdown
epub_file="$1"
echo "# Book Information"
echo ""
epub-utils "$epub_file" metadata --format kv | while IFS=': ' read -r key value; do
if [[ -n "$key" && -n "$value" ]]; then
formatted_key=$(echo "$key" | sed 's/\b\w/\U&/g') # Title case
echo "**$formatted_key**: $value"
fi
done
Pretty-Print Option
-------------------
The ``--pretty-print`` (or ``-pp``) option enhances XML output by adding proper indentation and structure, making it more readable for human inspection.
**When to use**: Human review, debugging XML structure, cleaner output for documentation
**Supported formats**: ``xml`` and ``raw``
**Example without pretty-print**:
.. code-block:: bash
$ epub-utils book.epub metadata --format raw
**Output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?><metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf"><dc:title>The Great Gatsby</dc:title><dc:creator>F. Scott Fitzgerald</dc:creator><dc:language>en</dc:language></metadata>
**Example with pretty-print**:
.. code-block:: bash
$ epub-utils book.epub metadata --format raw --pretty-print
**Output**:
.. code-block:: xml
<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title>The Great Gatsby</dc:title>
<dc:creator>F. Scott Fitzgerald</dc:creator>
<dc:language>en</dc:language>
</metadata>
**Use cases**:
.. code-block:: bash
# Better readability for manual inspection
epub-utils book.epub package --pretty-print
# Clean output for documentation or examples
epub-utils book.epub container --format raw --pretty-print
# Pipe to file with proper formatting
epub-utils book.epub toc --pretty-print > toc-formatted.xml
**Note**: Pretty-print has no effect on ``kv``, ``plain``, or ``table`` formats as these are already optimized for readability.
Best Practices
--------------
1. **Default to XML for interactive use** - it's the most readable
2. **Use raw for scripting** - it's the most reliable for automation
3. **Use kv for metadata extraction** - it's purpose-built for simple parsing
4. **Use plain for content analysis** - it removes HTML complexity
5. **Use pretty-print for human review** - it makes XML structure clearer
6. **Always handle errors** - EPUB files can be malformed
7. **Test with various EPUB files** - format output can vary with different EPUB structures
These format options make epub-utils flexible enough to handle everything from quick
interactive inspection to complex automated workflows.
================================================
FILE: docs/index.rst
================================================
epub-utils: EPUB Inspection and Manipulation
=============================================
.. image:: https://img.shields.io/pypi/v/epub-utils.svg
:target: https://pypi.org/project/epub-utils/
:alt: PyPI version
.. image:: https://img.shields.io/pypi/pyversions/epub-utils.svg?logo=python&logoColor=white
:target: https://pypi.org/project/epub-utils/
:alt: Python versions
.. image:: https://img.shields.io/badge/license-Apache%202.0-blue.svg
:target: https://github.com/ernestofgonzalez/epub-utils/blob/main/LICENSE
:alt: License
**epub-utils** is a comprehensive Python library and command-line tool for working with EPUB files.
It provides both a programmatic API and an intuitive CLI interface for inspecting and parsing EPUB archives.
.. note::
epub-utils supports **EPUB 2.0.1** and **EPUB 3.0+** specifications, ensuring compatibility
with the vast majority of EPUB files in circulation.
Key Features
------------
**Rich CLI Interface**
- Syntax-highlighted XML output
- Multiple output formats (XML, raw, key-value, plain text)
- Comprehensive file inspection capabilities
**Complete EPUB Support**
- Parse container.xml and package files
- Extract and display table of contents
- Access manifest and spine information
- Retrieve document content by ID
**Metadata Extraction**
- Dublin Core metadata support
- EPUB-specific metadata fields
- Key-value output for easy parsing
**Python API**
- Clean, object-oriented interface
- Lazy loading for performance
- Comprehensive error handling
Quick Start
-----------
Installation
~~~~~~~~~~~~
.. code-block:: bash
$ pip install epub-utils
Basic CLI Usage
~~~~~~~~~~~~~~~
Inspect an EPUB file with a simple command:
.. code-block:: bash
# Display metadata with beautiful syntax highlighting
$ epub-utils my-book.epub metadata
# Show table of contents structure
$ epub-utils my-book.epub toc
# Get key-value metadata for scripting
$ epub-utils my-book.epub metadata --format kv
Basic Python Usage
~~~~~~~~~~~~~~~~~~
.. code-block:: python
from epub_utils import Document
# Load an EPUB document
doc = Document("path/to/book.epub")
# Access metadata easily
print(f"Title: {doc.package.metadata.title}")
print(f"Author: {doc.package.metadata.creator}")
print(f"Language: {doc.package.metadata.language}")
# Get table of contents
toc_xml = doc.toc.to_xml()
print(toc_xml)
Why epub-utils?
---------------
epub-utils fills a crucial gap in the Python ecosystem for EPUB file manipulation. While there are
libraries for creating EPUBs, few focus on inspection and analysis. This tool is perfect for:
**Publishers and Authors**
Validate EPUB structure and metadata before distribution
**Digital Librarians**
Batch process and analyze EPUB collections
**Automation Scripts**
Extract metadata for catalogs and databases
**Debugging**
Inspect malformed or problematic EPUB files
**Learning**
Understand EPUB structure and standards compliance
Documentation Contents
----------------------
.. toctree::
:maxdepth: 2
:caption: User Guide
installation
cli-tutorial
api-tutorial
examples
formats
.. toctree::
:maxdepth: 2
:caption: Reference
cli-reference
api-reference
epub-standards
.. toctree::
:maxdepth: 1
:caption: Development
contributing
changelog
Community & Support
-------------------
- **Source Code**: `GitHub Repository <https://github.com/ernestofgonzalez/epub-utils>`_
- **Issues**: `Bug Reports & Feature Requests <https://github.com/ernestofgonzalez/epub-utils/issues>`_
- **PyPI**: `Package Index <https://pypi.org/project/epub-utils/>`_
License
-------
``epub-utils`` is distributed under the `Apache License 2.0 <https://github.com/ernestofgonzalez/epub-utils/blob/main/LICENSE>`_.
================================================
FILE: docs/installation.rst
================================================
Installation Guide
==================
System Requirements
-------------------
``epub-utils`` requires Python 3.10 or higher and works on:
- **Linux** (Ubuntu 18.04+, Debian 10+, CentOS 7+, Fedora 30+)
- **macOS** (10.14+)
- **Windows** (Windows 10+)
Installing from PyPI
---------------------
The easiest way to install ``epub-utils`` is using pip:
.. code-block:: bash
$ pip install epub-utils
This will install the latest stable version with all required dependencies.
Development Installation
------------------------
If you want to contribute to ``epub-utils`` or use the latest development version:
.. code-block:: bash
# Clone the repository
$ git clone https://github.com/ernestofgonzalez/epub-utils.git
$ cd epub-utils
# Create a virtual environment
$ python -m venv env
$ source env/bin/activate # On Windows: env\Scripts\activate
# Install in development mode
$ pip install -e .
# Install development dependencies
$ pip install -r requirements/requirements-testing.txt
$ pip install -r requirements/requirements-linting.txt
Virtual Environment Installation
--------------------------------
For isolated installations, we recommend using virtual environments:
Using venv (Python 3.3+)
~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
# Create virtual environment
$ python -m venv epub-utils-env
# Activate virtual environment
$ source epub-utils-env/bin/activate # Linux/macOS
$ epub-utils-env\Scripts\activate # Windows
# Install epub-utils
$ pip install epub-utils
Using conda
~~~~~~~~~~~
.. code-block:: bash
# Create conda environment
$ conda create -n epub-utils python=3.10
# Activate environment
$ conda activate epub-utils
# Install epub-utils
$ pip install epub-utils
Verifying Installation
----------------------
After installation, verify that ``epub-utils`` is working correctly:
.. code-block:: bash
# Check version
$ epub-utils --version
# Test with a sample EPUB (if you have one)
$ epub-utils sample.epub metadata
If you see the version number and can run commands without errors, the installation was successful!
Installing from Source
----------------------
To install from source code:
.. code-block:: bash
# Download and extract the source
$ wget https://github.com/ernestofgonzalez/epub-utils/archive/main.zip
$ unzip main.zip
$ cd epub-utils-main
# Install
$ pip install .
Upgrading
---------
To upgrade to the latest version:
.. code-block:: bash
$ pip install --upgrade epub-utils
Uninstalling
------------
To remove epub-utils:
.. code-block:: bash
$ pip uninstall epub-utils
Performance Considerations
--------------------------
Installing lxml
~~~~~~~~~~~~~~~
While not required, installing ``lxml`` can significantly improve XML parsing performance:
.. code-block:: bash
$ pip install lxml
``epub-utils`` will automatically use lxml if available, falling back to the standard library's
``xml.etree.ElementTree`` if not.
================================================
FILE: epub_utils/__init__.py
================================================
from epub_utils.container import Container
from epub_utils.doc import Document
__all__ = ['Document', 'Container']
================================================
FILE: epub_utils/__main__.py
================================================
from epub_utils.cli import main
if __name__ == '__main__':
main(prog_name='epub-utils')
================================================
FILE: epub_utils/cli.py
================================================
import click
from epub_utils.doc import Document
from epub_utils.exceptions import (
EPUBError,
FileNotFoundError,
)
VERSION = '0.1.0a1'
def format_error_message(e: Exception) -> str:
"""Format exception messages for CLI output."""
if isinstance(e, EPUBError):
# Use the custom formatting from our EPUBError class
return str(e)
else:
# For other exceptions, just return the message
return str(e)
def print_version(ctx, param, value):
if not value or ctx.resilient_parsing:
return
click.echo(VERSION)
ctx.exit()
@click.group(
context_settings=dict(help_option_names=['-h', '--help']),
)
@click.option(
'-v',
'--version',
is_flag=True,
callback=print_version,
expose_value=False,
is_eager=True,
help='Print epub-utils version.',
)
@click.argument(
'path',
type=click.Path(exists=True, file_okay=True),
required=True,
)
@click.pass_context
def main(ctx, path):
ctx.ensure_object(dict)
ctx.obj['path'] = path
def format_option(default='xml'):
"""Reusable decorator for the format option."""
return click.option(
'-fmt',
'--format',
type=click.Choice(['raw', 'xml', 'plain', 'kv'], case_sensitive=False),
default=default,
help=f'Output format, defaults to {default}.',
)
def pretty_print_option():
"""Reusable decorator for the pretty-print option."""
return click.option(
'-pp',
'--pretty-print',
is_flag=True,
default=False,
help='Pretty-print XML output (only applies to str and xml format).',
)
def output_document_part(doc, part_name, format, pretty_print=False):
"""Helper function to output document parts in the specified format."""
part = getattr(doc, part_name)
if format == 'raw':
click.echo(part.to_str(pretty_print=pretty_print))
elif format == 'xml':
click.echo(part.to_xml(pretty_print=pretty_print))
elif format == 'kv':
if hasattr(part, 'to_kv') and callable(getattr(part, 'to_kv')):
click.echo(part.to_kv())
else:
click.secho(
'Key-value format not supported for this document part. Falling back to raw:\n',
fg='yellow',
)
click.echo(part.to_str())
def format_file_size(size_bytes: int) -> str:
"""Format file size in human-readable format."""
if size_bytes == 0:
return '0 B'
size_names = ['B', 'KB', 'MB', 'GB']
i = 0
size = float(size_bytes)
while size >= 1024.0 and i < len(size_names) - 1:
size /= 1024.0
i += 1
if i == 0:
return f'{int(size)} {size_names[i]}'
else:
return f'{size:.1f} {size_names[i]}'
def format_files_table(files_info: list) -> str:
"""Format file information as a table."""
if not files_info:
return 'No files found in EPUB archive.'
# Calculate column widths
max_path_width = max(len(file_info['path']) for file_info in files_info)
max_size_width = max(len(format_file_size(file_info['size'])) for file_info in files_info)
max_compressed_width = max(
len(format_file_size(file_info['compressed_size'])) for file_info in files_info
)
# Ensure minimum widths for headers
path_width = max(max_path_width, len('Path'))
size_width = max(max_size_width, len('Size'))
compressed_width = max(max_compressed_width, len('Compressed'))
modified_width = len('Modified') # Fixed width for date/time
# Create header
header = f'{"Path":<{path_width}} | {"Size":>{size_width}} | {"Compressed":>{compressed_width}} | {"Modified":<{modified_width}}'
separator = '-' * len(header)
# Create rows
rows = []
for file_info in files_info:
path = file_info['path'][:path_width] # Truncate if too long
size = format_file_size(file_info['size'])
compressed = format_file_size(file_info['compressed_size'])
modified = file_info['modified']
row = f'{path:<{path_width}} | {size:>{size_width}} | {compressed:>{compressed_width}} | {modified:<{modified_width}}'
rows.append(row)
# Combine all parts
result = [header, separator] + rows
return '\n'.join(result)
@main.command()
@format_option()
@pretty_print_option()
@click.pass_context
def container(ctx, format, pretty_print):
"""Outputs the container information of the EPUB file."""
try:
doc = Document(ctx.obj['path'])
output_document_part(doc, 'container', format, pretty_print)
except EPUBError as e:
click.secho('EPUB Error:', fg='red', bold=True, err=True)
click.secho(format_error_message(e), fg='red', err=True)
ctx.exit(1)
except Exception as e:
click.secho('Unexpected Error:', fg='red', bold=True, err=True)
click.secho(str(e), fg='red', err=True)
ctx.exit(1)
@main.command()
@format_option()
@pretty_print_option()
@click.pass_context
def package(ctx, format, pretty_print):
"""Outputs the package information of the EPUB file."""
doc = Document(ctx.obj['path'])
output_document_part(doc, 'package', format, pretty_print)
@main.command()
@format_option()
@pretty_print_option()
@click.option(
'--ncx',
is_flag=True,
default=False,
help='Force retrieval of NCX file (EPUB 2 navigation control file).',
)
@click.option(
'--nav',
is_flag=True,
default=False,
help='Force retrieval of Navigation Document (EPUB 3 navigation file).',
)
@click.pass_context
def toc(ctx, format, pretty_print, ncx, nav):
"""Outputs the Table of Contents (TOC) of the EPUB file."""
doc = Document(ctx.obj['path'])
if ncx and nav:
click.secho('Error: --ncx and --nav flags cannot be used together.', fg='red', err=True)
ctx.exit(1)
if ncx:
part = 'ncx'
if doc.ncx is None:
click.secho(
'Error: This document does not include a Navigation Control eXtended (NCX).',
fg='red',
err=True,
)
ctx.exit(1)
elif nav:
part = 'nav'
if doc.nav is None:
click.secho(
'Error: This document does not include an EPUB Navigation Document.',
fg='red',
err=True,
)
ctx.exit(1)
else:
part = 'toc'
output_document_part(doc, part, format, pretty_print)
@main.command()
@format_option()
@pretty_print_option()
@click.pass_context
def metadata(ctx, format, pretty_print):
"""Outputs the metadata information from the package file."""
doc = Document(ctx.obj['path'])
package = doc.package
output_document_part(package, 'metadata', format, pretty_print)
@main.command()
@format_option()
@pretty_print_option()
@click.pass_context
def manifest(ctx, format, pretty_print):
"""Outputs the manifest information from the package file."""
doc = Document(ctx.obj['path'])
package = doc.package
output_document_part(package, 'manifest', format, pretty_print)
@main.command()
@format_option()
@pretty_print_option()
@click.pass_context
def spine(ctx, format, pretty_print):
"""Outputs the spine information from the package file."""
doc = Document(ctx.obj['path'])
package = doc.package
output_document_part(package, 'spine', format, pretty_print)
@main.command()
@click.argument('item_id', required=True)
@format_option()
@pretty_print_option()
@click.pass_context
def content(ctx, item_id, format, pretty_print):
"""Outputs the content of a document by its manifest item ID."""
doc = Document(ctx.obj['path'])
content = doc.find_content_by_id(item_id)
if format == 'raw':
click.echo(content.to_str())
elif format == 'xml':
if hasattr(content, 'to_xml'):
click.echo(content.to_xml(pretty_print=pretty_print))
else:
click.echo(content.to_str())
elif format == 'plain':
click.echo(content.to_plain())
elif format == 'kv':
click.secho(
'Key-value format not supported for content documents. Falling back to raw:\n',
fg='yellow',
)
click.echo(content.to_str())
@main.command()
@click.argument('file_path', required=False)
@click.option(
'-fmt',
'--format',
type=click.Choice(['table', 'raw', 'xml', 'plain', 'kv'], case_sensitive=False),
default=None,
help='Output format. For file listing: table, raw. For file content: raw, xml, plain, kv. Defaults to table for listing, xml for file content.',
)
@pretty_print_option()
@click.pass_context
def files(ctx, file_path, format, pretty_print):
"""List all files in the EPUB archive with their metadata, or output content of a specific file."""
doc = Document(ctx.obj['path'])
# Set dynamic default based on whether file_path is provided
if format is None:
format = 'xml' if file_path else 'table'
if file_path:
# Display content of specific file
try:
content = doc.get_file_by_path(file_path)
except FileNotFoundError as e:
click.secho('FileNotFoundError:', fg='red', bold=True, err=True)
click.secho(format_error_message(e), fg='red', err=True)
ctx.exit(1)
return
# Handle XHTMLContent objects
if hasattr(content, 'to_str'):
if format == 'raw':
click.echo(content.to_str())
elif format == 'xml':
if hasattr(content, 'to_xml'):
click.echo(content.to_xml(pretty_print=pretty_print))
else:
click.echo(content.to_str())
elif format == 'plain':
if hasattr(content, 'to_plain'):
click.echo(content.to_plain())
else:
click.echo(content.to_str())
elif format == 'kv':
click.secho(
'Key-value format not supported for file content. Falling back to raw:\n',
fg='yellow',
)
click.echo(content.to_str())
elif format == 'table':
# For file content, table format doesn't make sense, fall back to raw
click.secho(
'Table format not supported for file content. Falling back to raw:\n',
fg='yellow',
)
click.echo(content.to_str())
else:
# Handle raw string content (non-XHTML files)
click.echo(content)
else:
# List all files (existing behavior)
files_info = doc.get_files_info()
if format == 'table':
click.echo(format_files_table(files_info))
elif format == 'raw':
for file_info in files_info:
click.echo(f'{file_info["path"]}')
else:
# For file listing, only table and raw make sense
if format in ['xml', 'plain', 'kv']:
click.secho(
f'{format.title()} format not supported for file listing. Using table format:\n',
fg='yellow',
)
click.echo(format_files_table(files_info))
================================================
FILE: epub_utils/container.py
================================================
"""
Open Container Format: https://www.w3.org/TR/epub/#sec-ocf
This file includes the `Container` class, which is responsible for parsing the `container.xml` file
of an EPUB archive. The `container.xml` file is a required component of the EPUB Open Container
Format (OCF) and is located in the `META-INF` directory of the EPUB archive.
The `container.xml` file serves as the entry point for identifying the package document(s)
within the EPUB container. It must conform to the following structure as defined in the EPUB
specification:
- The root element is `<container>` and must include the `version` attribute with the value "1.0".
- The `<container>` element must contain exactly one `<rootfiles>` child element.
- The `<rootfiles>` element must contain one or more `<rootfile>` child elements.
- Each `<rootfile>` element must include a `full-path` attribute that specifies the location of
the package document relative to the root of the EPUB container.
Namespace:
- All elements in the `container.xml` file are in the namespace
`urn:oasis:names:tc:opendocument:xmlns:container`.
For more details on the structure and requirements of the `container.xml` file, refer to the
EPUB specification: https://www.w3.org/TR/epub/#sec-ocf
"""
try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree
from epub_utils.exceptions import InvalidEPUBError, ParseError
from epub_utils.printers import XMLPrinter
class Container:
"""
Represents the parsed container.xml file of an EPUB.
Attributes:
xml_content (str): The raw XML content of the container.xml file.
rootfile_path (str): The path to the rootfile specified in the container.
"""
NAMESPACE = 'urn:oasis:names:tc:opendocument:xmlns:container'
ROOTFILE_XPATH = f'.//{{{NAMESPACE}}}rootfile'
def __init__(self, xml_content: str) -> None:
"""
Initialize the Container by parsing the container.xml data.
Args:
xml_content (str): The raw XML content of the container.xml file.
"""
self.xml_content = xml_content
self.rootfile_path: str = None
self._parse(xml_content)
self._printer = XMLPrinter(self)
def __str__(self) -> str:
return self.xml_content
def to_str(self, *args, **kwargs) -> str:
return self._printer.to_str(*args, **kwargs)
def to_xml(self, *args, **kwargs) -> str:
return self._printer.to_xml(*args, **kwargs)
def _find_rootfile_element(self, root: etree.Element) -> etree.Element:
"""
Finds the rootfile element in the container.xml data.
Args:
root (etree.Element): The root element of the parsed XML.
Returns:
etree.Element: The rootfile element.
Raises:
InvalidEPUBError: If the rootfile element or its 'full-path' attribute is missing.
"""
rootfile_element = root.find(self.ROOTFILE_XPATH)
if rootfile_element is None:
raise InvalidEPUBError(
'Invalid container.xml: Missing rootfile element',
suggestions=[
'Ensure the container.xml contains a rootfile element',
'Check that the container structure follows EPUB specifications',
'Verify the EPUB was created with compliant tools',
],
)
if 'full-path' not in rootfile_element.attrib:
raise InvalidEPUBError(
"Invalid container.xml: Missing 'full-path' attribute in rootfile element",
suggestions=[
"Ensure the rootfile element has a 'full-path' attribute",
'Check that the container.xml follows EPUB specifications',
'Verify the EPUB package structure is complete',
],
)
return rootfile_element
def _parse(self, xml_content: str) -> None:
"""
Parses the container.xml data to extract the rootfile path.
Args:
xml_content (str): The raw XML content of the container.xml file.
Raises:
ParseError: If the XML is invalid or cannot be parsed.
InvalidEPUBError: If the container.xml structure is invalid.
"""
try:
if isinstance(xml_content, str):
xml_content = xml_content.encode('utf-8')
root = etree.fromstring(xml_content)
rootfile_element = self._find_rootfile_element(root)
self.rootfile_path = rootfile_element.attrib['full-path']
if not self.rootfile_path.strip():
raise InvalidEPUBError(
"Invalid container.xml: 'full-path' attribute is empty",
suggestions=[
"Ensure the rootfile element has a non-empty 'full-path' attribute",
'Check that the path points to a valid OPF file',
'Verify the EPUB package structure is complete',
],
)
except etree.ParseError as e:
raise ParseError(
f'Invalid XML in container.xml: {str(e)}',
suggestions=[
'Check that the container.xml file contains valid XML',
'Verify the file is not corrupted',
'Ensure all XML tags are properly closed',
'Check for invalid characters in the XML',
],
) from e
================================================
FILE: epub_utils/content/__init__.py
================================================
from epub_utils.content.base import Content
from epub_utils.content.xhtml import XHTMLContent
__all__ = ['Content', 'XHTMLContent']
================================================
FILE: epub_utils/content/base.py
================================================
class Content:
"""
Base class for EPUB content documents.
Attributes:
media_type (str): The MIME type of the content.
href (str): The path to the content file within the EPUB.
"""
def __init__(self, media_type: str, href: str) -> None:
self.media_type = media_type
self.href = href
================================================
FILE: epub_utils/content/xhtml.py
================================================
import re
from lxml import etree
from epub_utils.content.base import Content
from epub_utils.exceptions import ParseError, UnsupportedFormatError
from epub_utils.printers import XMLPrinter
class XHTMLContent(Content):
"""
Represents an XHTML content document within an EPUB file.
"""
MEDIA_TYPES = ['application/xhtml+xml', 'text/html']
def __init__(self, xml_content: str, media_type: str, href: str) -> None:
self.xml_content = xml_content
self._tree = None
if media_type not in self.MEDIA_TYPES:
raise UnsupportedFormatError(
f"Media type '{media_type}' is not supported for XHTML content",
suggestions=[
f'Use one of the supported media types: {", ".join(self.MEDIA_TYPES)}',
'Check that this is an XHTML content file',
'Verify the manifest declares the correct media type',
],
)
super().__init__(media_type, href)
self._parse(xml_content)
self._printer = XMLPrinter(self)
def __str__(self) -> str:
return self.xml_content
def to_str(self, *args, **kwargs) -> str:
return self._printer.to_str(*args, **kwargs)
def to_xml(self, *args, **kwargs) -> str:
return self._printer.to_xml(*args, **kwargs)
def to_plain(self) -> str:
return self.inner_text
def _parse(self, xml_content: str) -> None:
try:
self._tree = etree.fromstring(xml_content.encode('utf-8'))
except etree.ParseError as e:
raise ParseError(
f'Invalid XML in XHTML content file: {str(e)}',
suggestions=[
'Check that the content file contains valid XHTML',
'Verify the file is not corrupted',
'Ensure all XML tags are properly closed',
'Check for invalid characters in the XML',
],
) from e
@property
def tree(self):
"""Lazily parse and cache the XHTML tree."""
if self._tree is None:
self._parse(self.xml_content)
return self._tree
@property
def inner_text(self) -> str:
tree = self.tree
body_elements = tree.xpath('//*[local-name()="body"]')
if body_elements:
inner_text = ''.join(body_elements[0].itertext())
else:
inner_text = ''.join(tree.itertext())
# Normalize whitespace
inner_text = re.sub(r'\s+', ' ', inner_text).strip()
return inner_text
================================================
FILE: epub_utils/doc.py
================================================
import os
import zipfile
from datetime import datetime
from functools import cached_property
from pathlib import Path
from typing import Dict, List, Optional, Union
from epub_utils.container import Container
from epub_utils.content import XHTMLContent
from epub_utils.exceptions import FileNotFoundError as EPUBFileNotFoundError
from epub_utils.exceptions import InvalidEPUBError
from epub_utils.navigation import EPUBNavDocNavigation, Navigation, NCXNavigation
from epub_utils.package import Package
class Document:
"""
Represents an EPUB document.
Attributes:
path (Path): The path to the EPUB file.
_container (Container): The parsed container document.
_package (Package): The parsed package document.
_toc (TableOfContents): The parsed table of contents document.
"""
CONTAINER_FILE_PATH = 'META-INF/container.xml'
def __init__(self, path: Union[str, Path]) -> None:
"""
Initialize the Document from a given path.
Args:
path (str | Path): The path to the EPUB file.
Raises:
InvalidEPUBError: If the file is not a valid EPUB archive.
"""
self.path: Path = Path(path)
if not self.path.exists():
raise InvalidEPUBError(
f'EPUB file does not exist: {self.path}',
suggestions=[
'Check that the file path is correct',
'Verify the file has not been moved or deleted',
],
file_path=str(self.path),
)
if not zipfile.is_zipfile(self.path):
raise InvalidEPUBError(
f'File is not a valid ZIP archive: {self.path}',
suggestions=[
'Ensure the file is a valid EPUB (which is a ZIP archive)',
'Check that the file is not corrupted',
'Verify the file extension is .epub',
],
file_path=str(self.path),
)
self._container: Container = None
self._package: Package = None
self._toc: Navigation = None
self._ncx: NCXNavigation = None
self._nav: EPUBNavDocNavigation = None
def _read_file_from_epub(self, file_path: str) -> str:
"""
Read and decode a file from the EPUB archive.
Args:
file_path (str): Path to the file within the EPUB archive.
Returns:
str: Decoded contents of the file.
Raises:
EPUBFileNotFoundError: If the file is missing from the EPUB archive.
"""
with zipfile.ZipFile(self.path, 'r') as epub_zip:
norm_namelist = {os.path.normpath(name): name for name in epub_zip.namelist()}
norm_path = os.path.normpath(file_path)
if norm_path not in norm_namelist:
available_files = sorted(norm_namelist.keys())[:10] # Show first 10 files
suggestions = [
'Check that the file path is correct',
'Verify the EPUB file structure is complete',
]
if available_files:
file_list = ', '.join(available_files)
if len(norm_namelist) > 10:
file_list += f' (and {len(norm_namelist) - 10} more)'
suggestions.append(f'Available files include: {file_list}')
raise EPUBFile
gitextract_obeqz0f5/
├── .github/
│ └── workflows/
│ ├── docs.yml
│ └── test.yml
├── .gitignore
├── .vscode/
│ └── settings.json
├── LICENSE
├── Makefile
├── README.md
├── docs/
│ ├── Makefile
│ ├── api-reference.rst
│ ├── api-tutorial.rst
│ ├── changelog.rst
│ ├── cli-reference.rst
│ ├── cli-tutorial.rst
│ ├── conf.py
│ ├── contributing.rst
│ ├── epub-standards.rst
│ ├── examples.rst
│ ├── formats.rst
│ ├── index.rst
│ └── installation.rst
├── epub_utils/
│ ├── __init__.py
│ ├── __main__.py
│ ├── cli.py
│ ├── container.py
│ ├── content/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── xhtml.py
│ ├── doc.py
│ ├── exceptions.py
│ ├── navigation/
│ │ ├── __init__.py
│ │ ├── base.py
│ │ ├── nav/
│ │ │ ├── __init__.py
│ │ │ └── dom.py
│ │ └── ncx/
│ │ ├── __init__.py
│ │ └── dom.py
│ ├── package/
│ │ ├── __init__.py
│ │ ├── manifest.py
│ │ ├── metadata.py
│ │ └── spine.py
│ └── printers.py
├── pytest.ini
├── requirements/
│ ├── requirements-docs.txt
│ ├── requirements-linting.txt
│ ├── requirements-testing.txt
│ └── requirements.txt
├── requirements.txt
├── ruff.toml
├── setup.py
└── tests/
├── assets/
│ └── roads.epub
├── conftest.py
├── test_cli.py
├── test_container.py
├── test_doc.py
├── test_manifest.py
├── test_metadata.py
├── test_nav_navigation.py
├── test_ncx_navigation.py
├── test_package.py
├── test_spine.py
└── test_xhtml_content.py
SYMBOL INDEX (333 symbols across 28 files)
FILE: epub_utils/cli.py
function format_error_message (line 12) | def format_error_message(e: Exception) -> str:
function print_version (line 22) | def print_version(ctx, param, value):
function main (line 47) | def main(ctx, path):
function format_option (line 52) | def format_option(default='xml'):
function pretty_print_option (line 63) | def pretty_print_option():
function output_document_part (line 74) | def output_document_part(doc, part_name, format, pretty_print=False):
function format_file_size (line 92) | def format_file_size(size_bytes: int) -> str:
function format_files_table (line 111) | def format_files_table(files_info: list) -> str:
function container (line 153) | def container(ctx, format, pretty_print):
function package (line 172) | def package(ctx, format, pretty_print):
function toc (line 194) | def toc(ctx, format, pretty_print, ncx, nav):
function metadata (line 230) | def metadata(ctx, format, pretty_print):
function manifest (line 241) | def manifest(ctx, format, pretty_print):
function spine (line 252) | def spine(ctx, format, pretty_print):
function content (line 264) | def content(ctx, item_id, format, pretty_print):
function files (line 297) | def files(ctx, file_path, format, pretty_print):
FILE: epub_utils/container.py
class Container (line 35) | class Container:
method __init__ (line 47) | def __init__(self, xml_content: str) -> None:
method __str__ (line 61) | def __str__(self) -> str:
method to_str (line 64) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 67) | def to_xml(self, *args, **kwargs) -> str:
method _find_rootfile_element (line 70) | def _find_rootfile_element(self, root: etree.Element) -> etree.Element:
method _parse (line 106) | def _parse(self, xml_content: str) -> None:
FILE: epub_utils/content/base.py
class Content (line 1) | class Content:
method __init__ (line 10) | def __init__(self, media_type: str, href: str) -> None:
FILE: epub_utils/content/xhtml.py
class XHTMLContent (line 10) | class XHTMLContent(Content):
method __init__ (line 17) | def __init__(self, xml_content: str, media_type: str, href: str) -> None:
method __str__ (line 37) | def __str__(self) -> str:
method to_str (line 40) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 43) | def to_xml(self, *args, **kwargs) -> str:
method to_plain (line 46) | def to_plain(self) -> str:
method _parse (line 49) | def _parse(self, xml_content: str) -> None:
method tree (line 64) | def tree(self):
method inner_text (line 71) | def inner_text(self) -> str:
FILE: epub_utils/doc.py
class Document (line 16) | class Document:
method __init__ (line 29) | def __init__(self, path: Union[str, Path]) -> None:
method _read_file_from_epub (line 69) | def _read_file_from_epub(self, file_path: str) -> str:
method container (line 116) | def container(self) -> Container:
method package (line 123) | def package(self) -> Package:
method package_href (line 130) | def package_href(self):
method toc (line 134) | def toc(self) -> Optional[Navigation]:
method ncx (line 145) | def ncx(self) -> Optional[NCXNavigation]:
method nav (line 162) | def nav(self) -> Optional[EPUBNavDocNavigation]:
method find_content_by_id (line 178) | def find_content_by_id(self, item_id: str) -> str:
method find_pub_resource_by_id (line 236) | def find_pub_resource_by_id(self, item_id: str) -> str:
method list_files (line 275) | def list_files(self) -> List[Dict[str, str]]:
method get_files_info (line 295) | def get_files_info(self) -> List[Dict[str, Union[str, int]]]:
method get_file_by_path (line 323) | def get_file_by_path(self, file_path: str):
FILE: epub_utils/exceptions.py
class EPUBError (line 10) | class EPUBError(Exception):
method __init__ (line 13) | def __init__(self, message: str, suggestions: list = None, file_path: ...
method __str__ (line 26) | def __str__(self):
class ParseError (line 40) | class ParseError(EPUBError, ValueError):
method __init__ (line 43) | def __init__(
class InvalidEPUBError (line 76) | class InvalidEPUBError(EPUBError, ValueError):
method __init__ (line 79) | def __init__(
class UnsupportedFormatError (line 109) | class UnsupportedFormatError(EPUBError, ValueError):
method __init__ (line 112) | def __init__(
class NotImplementedError (line 146) | class NotImplementedError(EPUBError):
method __init__ (line 149) | def __init__(
class FileNotFoundError (line 178) | class FileNotFoundError(EPUBError, ValueError):
method __init__ (line 181) | def __init__(self, file_path: str, epub_path: str = None, suggestions:...
class ValidationError (line 202) | class ValidationError(EPUBError, ValueError):
method __init__ (line 205) | def __init__(
FILE: epub_utils/navigation/base.py
class NavigationItem (line 7) | class NavigationItem:
method to_dict (line 18) | def to_dict(self) -> Dict[str, Any]:
class Navigation (line 37) | class Navigation(ABC):
method __init__ (line 46) | def __init__(self, media_type: str, href: str) -> None:
method get_toc_items (line 52) | def get_toc_items(self) -> List[NavigationItem]:
method get_page_list (line 57) | def get_page_list(self) -> List[NavigationItem]:
method get_landmarks (line 62) | def get_landmarks(self) -> List[NavigationItem]:
method add_toc_item (line 68) | def add_toc_item(self, item: NavigationItem, after_id: Optional[str] =...
method remove_toc_item (line 73) | def remove_toc_item(self, item_id: str) -> bool:
method update_toc_item (line 78) | def update_toc_item(self, item_id: str, **kwargs) -> bool:
method reorder_toc_items (line 83) | def reorder_toc_items(self, new_order: List[str]) -> None:
method find_item_by_id (line 88) | def find_item_by_id(self, item_id: str) -> Optional[NavigationItem]:
method find_items_by_target (line 95) | def find_items_by_target(self, target: str) -> List[NavigationItem]:
method get_all_items (line 99) | def get_all_items(self) -> List[NavigationItem]:
method get_toc_items_as_dicts (line 107) | def get_toc_items_as_dicts(self) -> List[Dict[str, Any]]:
method get_page_list_as_dicts (line 116) | def get_page_list_as_dicts(self) -> List[Dict[str, Any]]:
method get_landmarks_as_dicts (line 124) | def get_landmarks_as_dicts(self) -> List[Dict[str, Any]]:
method tree (line 135) | def tree(self):
method to_str (line 141) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 145) | def to_xml(self, *args, **kwargs) -> str:
method to_plain (line 149) | def to_plain(self) -> str:
FILE: epub_utils/navigation/nav/__init__.py
class EPUBNavDocNavigation (line 13) | class EPUBNavDocNavigation(Navigation):
method __init__ (line 18) | def __init__(
method __str__ (line 43) | def __str__(self) -> str:
method to_str (line 46) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 49) | def to_xml(self, *args, **kwargs) -> str:
method to_plain (line 52) | def to_plain(self) -> str:
method _parse (line 55) | def _parse(self, xml_content: str) -> None:
method tree (line 76) | def tree(self):
method inner_text (line 83) | def inner_text(self) -> str:
method get_toc_items (line 102) | def get_toc_items(self) -> List[NavigationItem]:
method get_page_list (line 115) | def get_page_list(self) -> List[NavigationItem]:
method get_landmarks (line 128) | def get_landmarks(self) -> List[NavigationItem]:
method add_toc_item (line 143) | def add_toc_item(self, item: NavigationItem, after_id: Optional[str] =...
method remove_toc_item (line 175) | def remove_toc_item(self, item_id: str) -> bool:
method update_toc_item (line 207) | def update_toc_item(self, item_id: str, **kwargs) -> bool:
method reorder_toc_items (line 255) | def reorder_toc_items(self, new_order: List[str]) -> None:
method _convert_list_items_recursive (line 287) | def _convert_list_items_recursive(
method _convert_list_items_to_pages (line 330) | def _convert_list_items_to_pages(self, list_items: List[NavListItem]) ...
method _convert_list_items_to_landmarks (line 351) | def _convert_list_items_to_landmarks(
FILE: epub_utils/navigation/nav/dom.py
class NavElement (line 8) | class NavElement:
method __init__ (line 11) | def __init__(self, element: etree.Element) -> None:
method id (line 15) | def id(self) -> Optional[str]:
method id (line 20) | def id(self, value: str) -> None:
class NavAnchor (line 25) | class NavAnchor(NavElement):
method href (line 29) | def href(self) -> Optional[str]:
method href (line 34) | def href(self, value: str) -> None:
method text (line 39) | def text(self) -> str:
method text (line 44) | def text(self, value: str) -> None:
method epub_type (line 49) | def epub_type(self) -> Optional[str]:
method epub_type (line 54) | def epub_type(self, value: str) -> None:
class NavListItem (line 59) | class NavListItem(NavElement):
method anchor (line 63) | def anchor(self) -> Optional[NavAnchor]:
method nested_list (line 73) | def nested_list(self) -> Optional['NavList']:
method span (line 83) | def span(self) -> Optional[NavElement]:
method add_anchor (line 92) | def add_anchor(self, href: str, text: str, epub_type: Optional[str] = ...
method add_span (line 102) | def add_span(self, text: str) -> NavElement:
method add_nested_list (line 109) | def add_nested_list(self) -> 'NavList':
class NavList (line 115) | class NavList(NavElement):
method list_items (line 119) | def list_items(self) -> List[NavListItem]:
method add_list_item (line 126) | def add_list_item(self) -> NavListItem:
method get_all_items_recursive (line 131) | def get_all_items_recursive(self) -> List[NavListItem]:
class NavSection (line 146) | class NavSection(NavElement):
method epub_type (line 150) | def epub_type(self) -> Optional[str]:
method epub_type (line 155) | def epub_type(self, value: str) -> None:
method heading (line 160) | def heading(self) -> Optional[str]:
method ordered_list (line 171) | def ordered_list(self) -> Optional[NavList]:
method add_heading (line 180) | def add_heading(self, level: int, text: str) -> NavElement:
method add_ordered_list (line 192) | def add_ordered_list(self) -> NavList:
class NavDocument (line 198) | class NavDocument(NavElement):
method toc_nav (line 202) | def toc_nav(self) -> Optional[NavSection]:
method page_list_nav (line 216) | def page_list_nav(self) -> Optional[NavSection]:
method landmarks_nav (line 230) | def landmarks_nav(self) -> Optional[NavSection]:
method all_nav_sections (line 244) | def all_nav_sections(self) -> List[NavSection]:
method title (line 252) | def title(self) -> str:
method body (line 260) | def body(self) -> Optional[NavElement]:
method add_nav_section (line 269) | def add_nav_section(self, epub_type: str) -> NavSection:
FILE: epub_utils/navigation/ncx/__init__.py
class NCXNavigation (line 14) | class NCXNavigation(Navigation):
method __init__ (line 17) | def __init__(
method __str__ (line 43) | def __str__(self) -> str:
method to_str (line 46) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 49) | def to_xml(self, *args, **kwargs) -> str:
method to_plain (line 52) | def to_plain(self) -> str:
method _parse (line 55) | def _parse(self, xml_content: str) -> None:
method tree (line 77) | def tree(self):
method inner_text (line 84) | def inner_text(self) -> str:
method get_toc_items (line 101) | def get_toc_items(self) -> List[NavigationItem]:
method get_page_list (line 110) | def get_page_list(self) -> List[NavigationItem]:
method get_landmarks (line 119) | def get_landmarks(self) -> List[NavigationItem]:
method add_toc_item (line 131) | def add_toc_item(self, item: NavigationItem, after_id: Optional[str] =...
method remove_toc_item (line 182) | def remove_toc_item(self, item_id: str) -> bool:
method update_toc_item (line 201) | def update_toc_item(self, item_id: str, **kwargs) -> bool:
method reorder_toc_items (line 238) | def reorder_toc_items(self, new_order: List[str]) -> None:
method _convert_nav_points_recursive (line 258) | def _convert_nav_points_recursive(
method _convert_page_targets (line 283) | def _convert_page_targets(self, page_targets: List[NCXPageTarget]) -> ...
method _convert_nav_target (line 300) | def _convert_nav_target(self, nav_target: NCXNavTarget) -> NavigationI...
FILE: epub_utils/navigation/ncx/dom.py
class NCXElement (line 8) | class NCXElement:
method __init__ (line 11) | def __init__(self, element: etree.Element):
method id (line 15) | def id(self) -> Optional[str]:
method id (line 20) | def id(self, value: str) -> None:
class NCXText (line 25) | class NCXText(NCXElement):
method text (line 29) | def text(self) -> str:
method text (line 34) | def text(self, value: str) -> None:
class NCXContent (line 39) | class NCXContent(NCXElement):
method src (line 43) | def src(self) -> Optional[str]:
method src (line 48) | def src(self, value: str) -> None:
class NCXNavLabel (line 53) | class NCXNavLabel(NCXElement):
method text_element (line 57) | def text_element(self) -> Optional[NCXText]:
method text (line 67) | def text(self) -> str:
method text (line 73) | def text(self, value: str) -> None:
class NCXNavPoint (line 86) | class NCXNavPoint(NCXElement):
method class_attr (line 90) | def class_attr(self) -> Optional[str]:
method class_attr (line 95) | def class_attr(self, value: str) -> None:
method play_order (line 100) | def play_order(self) -> Optional[int]:
method play_order (line 106) | def play_order(self, value: int) -> None:
method nav_label (line 111) | def nav_label(self) -> Optional[NCXNavLabel]:
method content (line 121) | def content(self) -> Optional[NCXContent]:
method nav_points (line 131) | def nav_points(self) -> List['NCXNavPoint']:
method add_nav_point (line 138) | def add_nav_point(
method label_text (line 175) | def label_text(self) -> str:
method content_src (line 181) | def content_src(self) -> str:
class NCXNavMap (line 187) | class NCXNavMap(NCXElement):
method nav_points (line 191) | def nav_points(self) -> List[NCXNavPoint]:
method add_nav_point (line 198) | def add_nav_point(
method get_all_nav_points (line 234) | def get_all_nav_points(self) -> List[NCXNavPoint]:
class NCXPageTarget (line 242) | class NCXPageTarget(NCXElement):
method type_attr (line 246) | def type_attr(self) -> Optional[str]:
method type_attr (line 251) | def type_attr(self, value: str) -> None:
method value (line 256) | def value(self) -> Optional[str]:
method value (line 261) | def value(self, value: str) -> None:
method play_order (line 266) | def play_order(self) -> Optional[int]:
method play_order (line 272) | def play_order(self, value: int) -> None:
method nav_label (line 277) | def nav_label(self) -> Optional[NCXNavLabel]:
method content (line 287) | def content(self) -> Optional[NCXContent]:
method label_text (line 297) | def label_text(self) -> str:
method content_src (line 303) | def content_src(self) -> str:
class NCXPageList (line 309) | class NCXPageList(NCXElement):
method page_targets (line 313) | def page_targets(self) -> List[NCXPageTarget]:
method add_page_target (line 320) | def add_page_target(
class NCXNavTarget (line 358) | class NCXNavTarget(NCXElement):
method value (line 362) | def value(self) -> Optional[str]:
method value (line 367) | def value(self, value: str) -> None:
method class_attr (line 372) | def class_attr(self) -> Optional[str]:
method class_attr (line 377) | def class_attr(self, value: str) -> None:
method play_order (line 382) | def play_order(self) -> Optional[int]:
method play_order (line 388) | def play_order(self, value: int) -> None:
method nav_label (line 393) | def nav_label(self) -> Optional[NCXNavLabel]:
method content (line 403) | def content(self) -> Optional[NCXContent]:
class NCXNavList (line 413) | class NCXNavList(NCXElement):
method nav_label (line 417) | def nav_label(self) -> Optional[NCXNavLabel]:
method nav_targets (line 427) | def nav_targets(self) -> List[NCXNavTarget]:
method add_nav_target (line 434) | def add_nav_target(
method label_text (line 464) | def label_text(self) -> str:
class NCXDocument (line 470) | class NCXDocument(NCXElement):
method nav_map (line 474) | def nav_map(self) -> Optional[NCXNavMap]:
method page_list (line 484) | def page_list(self) -> Optional[NCXPageList]:
method nav_lists (line 494) | def nav_lists(self) -> List[NCXNavList]:
method title (line 502) | def title(self) -> str:
method author (line 510) | def author(self) -> str:
method get_uid (line 517) | def get_uid(self) -> Optional[str]:
method get_depth (line 525) | def get_depth(self) -> Optional[int]:
method get_total_page_count (line 533) | def get_total_page_count(self) -> Optional[int]:
method get_max_page_number (line 541) | def get_max_page_number(self) -> Optional[int]:
FILE: epub_utils/package/__init__.py
class Package (line 29) | class Package:
method __init__ (line 55) | def __init__(self, xml_content: str) -> None:
method __str__ (line 77) | def __str__(self) -> str:
method to_str (line 80) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 83) | def to_xml(self, *args, **kwargs) -> str:
method _parse (line 86) | def _parse(self, xml_content: str) -> None:
method _get_text (line 176) | def _get_text(self, root: etree.Element, xpath: str) -> str:
method _find_toc_href (line 190) | def _find_toc_href(self, root: etree.Element) -> str:
method _find_nav_href (line 219) | def _find_nav_href(self, root: etree.Element) -> str:
method _parse_version (line 247) | def _parse_version(self, version):
FILE: epub_utils/package/manifest.py
class Manifest (line 10) | class Manifest:
method __init__ (line 19) | def __init__(self, xml_content: str):
method __str__ (line 27) | def __str__(self) -> str:
method to_str (line 30) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 33) | def to_xml(self, *args, **kwargs) -> str:
method _parse (line 36) | def _parse(self, xml_content: str) -> None:
method find_by_property (line 70) | def find_by_property(self, property_name: str) -> dict:
method find_by_id (line 77) | def find_by_id(self, item_id: str) -> dict:
method find_by_media_type (line 84) | def find_by_media_type(self, media_type: str) -> list:
FILE: epub_utils/package/metadata.py
class Metadata (line 10) | class Metadata:
method __init__ (line 22) | def __init__(self, xml_content: str):
method _parse (line 30) | def _parse(self, xml_content: str) -> None:
method _add_field (line 65) | def _add_field(self, name: str, value: str) -> None:
method _validate (line 74) | def _validate(self, raise_exception=False) -> None:
method _validate_field (line 101) | def _validate_field(self, field_name: str) -> None:
method __str__ (line 115) | def __str__(self) -> str:
method to_str (line 118) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 121) | def to_xml(self, *args, **kwargs) -> str:
method _get_text (line 124) | def _get_text(self, root: etree.Element, xpath: str) -> str:
method __getattr__ (line 128) | def __getattr__(self, name: str) -> str:
method to_kv (line 131) | def to_kv(self) -> str:
FILE: epub_utils/package/spine.py
class Spine (line 10) | class Spine:
method __init__ (line 19) | def __init__(self, xml_content: str):
method __str__ (line 30) | def __str__(self) -> str:
method to_str (line 33) | def to_str(self, *args, **kwargs) -> str:
method to_xml (line 36) | def to_xml(self, *args, **kwargs) -> str:
method _parse (line 39) | def _parse(self, xml_content: str) -> None:
method find_by_idref (line 73) | def find_by_idref(self, itemref_idref: str) -> dict:
FILE: epub_utils/printers.py
function highlight_xml (line 11) | def highlight_xml(xml_content: str) -> str:
function pretty_print_xml (line 15) | def pretty_print_xml(xml_content: str) -> str:
function print_to_str (line 54) | def print_to_str(xml_content: bool, pretty_print: bool) -> str:
function print_to_xml (line 61) | def print_to_xml(xml_content: str, pretty_print: bool, highlight_syntax:...
class XMLPrinter (line 71) | class XMLPrinter:
method __init__ (line 74) | def __init__(self, xml_content_provider):
method to_str (line 83) | def to_str(self, pretty_print: bool = False) -> str:
method to_xml (line 95) | def to_xml(self, pretty_print: bool = False, highlight_syntax: bool = ...
FILE: setup.py
function get_long_description (line 8) | def get_long_description():
FILE: tests/conftest.py
function doc_path (line 5) | def doc_path():
FILE: tests/test_cli.py
function test_help (line 14) | def test_help(options):
function test_version (line 28) | def test_version(options):
function test_files_command_with_file_path_xhtml_xml (line 34) | def test_files_command_with_file_path_xhtml_xml(doc_path):
function test_files_command_with_file_path_missing_file (line 43) | def test_files_command_with_file_path_missing_file(doc_path):
function test_files_command_without_file_path_table (line 50) | def test_files_command_without_file_path_table(doc_path):
function test_files_command_without_file_path_raw (line 59) | def test_files_command_without_file_path_raw(doc_path):
function test_toc_command_default (line 67) | def test_toc_command_default(doc_path):
function test_toc_command_nav_flag (line 74) | def test_toc_command_nav_flag(doc_path):
function test_toc_command_mutually_exclusive_flags (line 81) | def test_toc_command_mutually_exclusive_flags(doc_path):
FILE: tests/test_container.py
function test_container_initialization (line 15) | def test_container_initialization():
function test_invalid_container_xml (line 24) | def test_invalid_container_xml():
function test_container_to_str_pretty_print_parameter (line 48) | def test_container_to_str_pretty_print_parameter(xml_content, pretty_pri...
FILE: tests/test_doc.py
function test_document_container (line 9) | def test_document_container(doc_path):
function test_document_package (line 17) | def test_document_package(doc_path):
function test_document_toc (line 45) | def test_document_toc(doc_path):
function test_document_find_content_by_id (line 53) | def test_document_find_content_by_id(doc_path):
function test_document_get_file_by_path_xhtml (line 59) | def test_document_get_file_by_path_xhtml(doc_path):
function test_document_get_file_by_path_missing_file (line 77) | def test_document_get_file_by_path_missing_file(doc_path):
function test_document_nav_property (line 90) | def test_document_nav_property(doc_path):
FILE: tests/test_manifest.py
function test_manifest_initialization (line 21) | def test_manifest_initialization():
function test_minimal_manifest (line 37) | def test_minimal_manifest():
function test_find_by_property (line 47) | def test_find_by_property():
function test_find_by_id (line 54) | def test_find_by_id():
function test_find_by_media_type (line 61) | def test_find_by_media_type():
function test_manifest_to_str_pretty_print_parameter (line 83) | def test_manifest_to_str_pretty_print_parameter(xml_content, pretty_prin...
FILE: tests/test_metadata.py
function test_metadata_parse_valid_element (line 29) | def test_metadata_parse_valid_element():
function test_metadata_validate_missing_identifier_with_raise_exception (line 46) | def test_metadata_validate_missing_identifier_with_raise_exception():
function test_metadata_to_str_pretty_print_parameter (line 67) | def test_metadata_to_str_pretty_print_parameter(xml_content, pretty_prin...
FILE: tests/test_nav_navigation.py
function test_nav_doc_navigation_initialization (line 23) | def test_nav_doc_navigation_initialization():
function test_nav_doc_navigation_interface (line 35) | def test_nav_doc_navigation_interface():
function test_nav_doc_navigation_toc_items_as_dicts (line 69) | def test_nav_doc_navigation_toc_items_as_dicts():
function test_nav_doc_navigation_page_list (line 132) | def test_nav_doc_navigation_page_list():
function test_nav_doc_navigation_landmarks (line 182) | def test_nav_doc_navigation_landmarks():
function test_nav_doc_navigation_editing (line 232) | def test_nav_doc_navigation_editing():
function test_nav_doc_navigation_span_elements (line 271) | def test_nav_doc_navigation_span_elements():
function test_nav_doc_navigation_item_types (line 316) | def test_nav_doc_navigation_item_types():
function test_nav_doc_navigation_invalid_media_type (line 350) | def test_nav_doc_navigation_invalid_media_type():
function test_nav_doc_navigation_malformed_xml (line 360) | def test_nav_doc_navigation_malformed_xml():
function test_nav_doc_navigation_output_methods (line 384) | def test_nav_doc_navigation_output_methods():
function test_nav_doc_navigation_reorder_items (line 412) | def test_nav_doc_navigation_reorder_items():
function test_nav_doc_navigation_empty_document (line 445) | def test_nav_doc_navigation_empty_document():
FILE: tests/test_ncx_navigation.py
function test_ncx_navigation_initialization (line 25) | def test_ncx_navigation_initialization():
function test_ncx_navigation_interface (line 38) | def test_ncx_navigation_interface():
function test_ncx_navigation_hierarchy (line 72) | def test_ncx_navigation_hierarchy():
function test_ncx_navigation_editing (line 143) | def test_ncx_navigation_editing():
FILE: tests/test_package.py
function test_package_initialization (line 92) | def test_package_initialization():
function test_package_invalid_xml (line 102) | def test_package_invalid_xml():
function test_epub3 (line 108) | def test_epub3():
function test_epub3_without_toc (line 115) | def test_epub3_without_toc():
function test_epub2 (line 122) | def test_epub2():
function test_epub2_without_toc (line 129) | def test_epub2_without_toc():
function test_epub1 (line 136) | def test_epub1():
function test_invalid_version (line 143) | def test_invalid_version():
function test_package_to_str_pretty_print_parameter (line 164) | def test_package_to_str_pretty_print_parameter(xml_content, pretty_print...
FILE: tests/test_spine.py
function test_spine_initialization (line 21) | def test_spine_initialization():
function test_minimal_spine (line 39) | def test_minimal_spine():
function test_spine_to_str_pretty_print_parameter (line 65) | def test_spine_to_str_pretty_print_parameter(xml_content, pretty_print, ...
FILE: tests/test_xhtml_content.py
function test_simple_paragraph (line 6) | def test_simple_paragraph():
function test_to_str_pretty_print_parameter (line 56) | def test_to_str_pretty_print_parameter(xml_content, pretty_print, expect...
Condensed preview — 60 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (351K chars).
[
{
"path": ".github/workflows/docs.yml",
"chars": 655,
"preview": "name: Publish documentation\n\non:\n push:\n branches:\n - main\n\njobs:\n docs:\n runs-on: ubuntu-latest\n steps:\n "
},
{
"path": ".github/workflows/test.yml",
"chars": 1127,
"preview": "name: Test\n\non:\n push:\n branches: \n - \"main\"\n pull_request:\n\nconcurrency:\n group: ${{ github.head_ref || github"
},
{
"path": ".gitignore",
"chars": 1817,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": ".vscode/settings.json",
"chars": 44,
"preview": "{\n \"python.testing.pytestEnabled\": true\n}"
},
{
"path": "LICENSE",
"chars": 11346,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "Makefile",
"chars": 924,
"preview": "#!/usr/bin/env bash\n\nLIGHT_CYAN=\\033[1;36m\nNO_COLOR=\\033[0m\n\n.PHONY: docs\n\nhelp:\n\t@echo \"test - run tests with pytest\"\n\t"
},
{
"path": "README.md",
"chars": 12655,
"preview": "# epub-utils\n\n[](https://pypi.org/project/epub-utils/)\n[![Changelog"
},
{
"path": "docs/Makefile",
"chars": 634,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "docs/api-reference.rst",
"chars": 14860,
"preview": "API Reference\n=============\n\nThis section provides complete API documentation for all classes and methods in epub-utils."
},
{
"path": "docs/api-tutorial.rst",
"chars": 8823,
"preview": "Use as a Python library\n=======================\n\nThis guide covers using ``epub-utils`` as a Python library. The API is "
},
{
"path": "docs/changelog.rst",
"chars": 1745,
"preview": ".. _changelog:\n\n=========\nChangelog\n=========\n\n.. _v_0_1_0a1:\n\n0.1.0a1 (2025-06-14)\n--------------------\n\n* Added `toc` "
},
{
"path": "docs/cli-reference.rst",
"chars": 15737,
"preview": "CLI Reference\n=============\n\nThis reference documents all available command-line options and commands for ``epub-utils``"
},
{
"path": "docs/cli-tutorial.rst",
"chars": 8789,
"preview": "Use as a command-line tool\n==========================\n\nThis tutorial will guide you through using ``epub-utils`` from th"
},
{
"path": "docs/conf.py",
"chars": 1980,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# For the full list of built-in configuration values, see t"
},
{
"path": "docs/contributing.rst",
"chars": 10062,
"preview": "============\nContributing\n============\n\nWe welcome contributions to ``epub-utils``! This guide will help you get started"
},
{
"path": "docs/epub-standards.rst",
"chars": 11607,
"preview": "==============\nEPUB Standards\n==============\n\nUnderstanding EPUB Specifications\n=================================\n\nEPUB "
},
{
"path": "docs/examples.rst",
"chars": 52712,
"preview": "Examples and Use Cases\n======================\n\nThis page showcases real-world examples of using epub-utils for various t"
},
{
"path": "docs/formats.rst",
"chars": 13714,
"preview": "Output Formats Reference\n========================\n\n``epub-utils`` supports multiple output formats to suit different use"
},
{
"path": "docs/index.rst",
"chars": 3864,
"preview": "epub-utils: EPUB Inspection and Manipulation\n=============================================\n\n.. image:: https://img.shiel"
},
{
"path": "docs/installation.rst",
"chars": 3029,
"preview": "Installation Guide\n==================\n\nSystem Requirements\n-------------------\n\n``epub-utils`` requires Python 3.10 or h"
},
{
"path": "epub_utils/__init__.py",
"chars": 116,
"preview": "from epub_utils.container import Container\nfrom epub_utils.doc import Document\n\n__all__ = ['Document', 'Container']\n"
},
{
"path": "epub_utils/__main__.py",
"chars": 90,
"preview": "from epub_utils.cli import main\n\nif __name__ == '__main__':\n\tmain(prog_name='epub-utils')\n"
},
{
"path": "epub_utils/cli.py",
"chars": 9850,
"preview": "import click\n\nfrom epub_utils.doc import Document\nfrom epub_utils.exceptions import (\n\tEPUBError,\n\tFileNotFoundError,\n)\n"
},
{
"path": "epub_utils/container.py",
"chars": 4791,
"preview": "\"\"\"\nOpen Container Format: https://www.w3.org/TR/epub/#sec-ocf\n\nThis file includes the `Container` class, which is respo"
},
{
"path": "epub_utils/content/__init__.py",
"chars": 133,
"preview": "from epub_utils.content.base import Content\nfrom epub_utils.content.xhtml import XHTMLContent\n\n__all__ = ['Content', 'XH"
},
{
"path": "epub_utils/content/base.py",
"chars": 303,
"preview": "class Content:\n\t\"\"\"\n\tBase class for EPUB content documents.\n\n\tAttributes:\n\t media_type (str): The MIME type of the co"
},
{
"path": "epub_utils/content/xhtml.py",
"chars": 2176,
"preview": "import re\n\nfrom lxml import etree\n\nfrom epub_utils.content.base import Content\nfrom epub_utils.exceptions import ParseEr"
},
{
"path": "epub_utils/doc.py",
"chars": 10527,
"preview": "import os\nimport zipfile\nfrom datetime import datetime\nfrom functools import cached_property\nfrom pathlib import Path\nfr"
},
{
"path": "epub_utils/exceptions.py",
"chars": 6428,
"preview": "\"\"\"\nGlobal epub-utils exception classes.\n\nThis module defines custom exceptions for the epub-utils library that provide\n"
},
{
"path": "epub_utils/navigation/__init__.py",
"chars": 237,
"preview": "\"\"\"EPUB Navigation module.\"\"\"\n\nfrom .base import Navigation, NavigationItem\nfrom .nav import EPUBNavDocNavigation\nfrom ."
},
{
"path": "epub_utils/navigation/base.py",
"chars": 4059,
"preview": "from abc import ABC, abstractmethod\nfrom dataclasses import dataclass, field\nfrom typing import Any, Dict, List, Optiona"
},
{
"path": "epub_utils/navigation/nav/__init__.py",
"chars": 10082,
"preview": "import re\nfrom typing import List, Optional\n\nfrom lxml import etree\n\nfrom epub_utils.exceptions import ParseError, Unsup"
},
{
"path": "epub_utils/navigation/nav/dom.py",
"chars": 7819,
"preview": "\"\"\"DOM classes for structured access to EPUB 3 Navigation Documents.\"\"\"\n\nfrom typing import List, Optional\n\nfrom lxml im"
},
{
"path": "epub_utils/navigation/ncx/__init__.py",
"chars": 8710,
"preview": "import re\nfrom typing import List, Optional\n\nfrom lxml import etree\n\nfrom epub_utils.exceptions import FileNotFoundError"
},
{
"path": "epub_utils/navigation/ncx/dom.py",
"chars": 15066,
"preview": "\"\"\"NCX DOM classes for structured access to NCX navigation documents.\"\"\"\n\nfrom typing import List, Optional\n\nfrom lxml i"
},
{
"path": "epub_utils/package/__init__.py",
"chars": 8630,
"preview": "\"\"\"\nOpen Packaging Format (OPF): https://www.w3.org/TR/epub/#sec-package-doc\n\nThis file includes the `Package` class, wh"
},
{
"path": "epub_utils/package/manifest.py",
"chars": 2371,
"preview": "try:\n\tfrom lxml import etree\nexcept ImportError:\n\timport xml.etree.ElementTree as etree\n\nfrom epub_utils.exceptions impo"
},
{
"path": "epub_utils/package/metadata.py",
"chars": 4043,
"preview": "try:\n\tfrom lxml import etree\nexcept ImportError:\n\timport xml.etree.ElementTree as etree\n\nfrom epub_utils.exceptions impo"
},
{
"path": "epub_utils/package/spine.py",
"chars": 2092,
"preview": "try:\n\tfrom lxml import etree\nexcept ImportError:\n\timport xml.etree.ElementTree as etree\n\nfrom epub_utils.exceptions impo"
},
{
"path": "epub_utils/printers.py",
"chars": 3051,
"preview": "try:\n\tfrom lxml import etree\nexcept ImportError:\n\timport xml.etree.ElementTree as etree\n\nfrom pygments import highlight\n"
},
{
"path": "pytest.ini",
"chars": 93,
"preview": "[pytest]\npythonpath = .\npython_files = tests.py test_*.py *_tests.py\naddopts = -p no:warnings"
},
{
"path": "requirements/requirements-docs.txt",
"chars": 75,
"preview": "sphinx==6.2.0\nsphinx-copybutton==0.5.1\nsphinx-issues==3.0.1\nfuro==2022.12.7"
},
{
"path": "requirements/requirements-linting.txt",
"chars": 12,
"preview": "ruff==0.11.9"
},
{
"path": "requirements/requirements-testing.txt",
"chars": 69,
"preview": "coverage==6.4.1\ncoverage-badge==1.1.0\npytest==7.2.0\npytest-cov==3.0.0"
},
{
"path": "requirements/requirements.txt",
"chars": 55,
"preview": "click==8.1.8\nlxml==5.4.0\npygments==2.19.1\nPyYAML==6.0.2"
},
{
"path": "requirements.txt",
"chars": 152,
"preview": "-r requirements/requirements-docs.txt\n-r requirements/requirements-linting.txt\n-r requirements/requirements-testing.txt\n"
},
{
"path": "ruff.toml",
"chars": 100,
"preview": "line-length = 100\n\n[format]\nquote-style = \"single\"\nindent-style = \"tab\"\ndocstring-code-format = true"
},
{
"path": "setup.py",
"chars": 1756,
"preview": "import os\n\nfrom setuptools import find_packages, setup\n\nVERSION = '0.1.0a1'\n\n\ndef get_long_description():\n\twith open(\n\t\t"
},
{
"path": "tests/conftest.py",
"chars": 100,
"preview": "import pytest\n\n\n@pytest.fixture\ndef doc_path():\n\tpath = str('tests/assets/roads.epub')\n\treturn path\n"
},
{
"path": "tests/test_cli.py",
"chars": 2572,
"preview": "import pytest\nfrom click.testing import CliRunner\n\nfrom epub_utils import cli\n\n\n@pytest.mark.parametrize(\n\t'options',\n\t("
},
{
"path": "tests/test_container.py",
"chars": 2288,
"preview": "import pytest\n\nfrom epub_utils.container import Container\nfrom epub_utils.exceptions import InvalidEPUBError\n\nCONTAINER_"
},
{
"path": "tests/test_doc.py",
"chars": 2396,
"preview": "import unittest\n\nfrom epub_utils.container import Container\nfrom epub_utils.doc import Document\nfrom epub_utils.navigati"
},
{
"path": "tests/test_manifest.py",
"chars": 3454,
"preview": "import pytest\n\nfrom epub_utils.package.manifest import Manifest\n\nVALID_MANIFEST_XML = \"\"\"\n<manifest xmlns=\"http://www.id"
},
{
"path": "tests/test_metadata.py",
"chars": 2858,
"preview": "import pytest\n\nfrom epub_utils.exceptions import ValidationError\nfrom epub_utils.package.metadata import Metadata\n\nVALID"
},
{
"path": "tests/test_nav_navigation.py",
"chars": 13680,
"preview": "import pytest\n\nfrom epub_utils.navigation.nav import EPUBNavDocNavigation\n\nNAV_XML = \"\"\"<?xml version=\"1.0\" encoding=\"UT"
},
{
"path": "tests/test_ncx_navigation.py",
"chars": 5007,
"preview": "from epub_utils.navigation.ncx import NCXNavigation\n\nNCX_XML = \"\"\"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ncx xmlns=\"htt"
},
{
"path": "tests/test_package.py",
"chars": 6180,
"preview": "import pytest\n\nfrom epub_utils.exceptions import InvalidEPUBError, UnsupportedFormatError\nfrom epub_utils.package import"
},
{
"path": "tests/test_spine.py",
"chars": 2267,
"preview": "import pytest\n\nfrom epub_utils.package.spine import Spine\n\nVALID_SPINE_XML = \"\"\"\n<spine xmlns=\"http://www.idpf.org/2007/"
},
{
"path": "tests/test_xhtml_content.py",
"chars": 3172,
"preview": "import pytest\n\nfrom epub_utils.content.xhtml import XHTMLContent\n\n\ndef test_simple_paragraph():\n\t\"\"\"Test extraction from"
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the ernestofgonzalez/epub-utils GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 60 files (315.4 KB), approximately 81.4k tokens, and a symbol index with 333 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.