Showing preview only (1,669K chars total). Download the full file or copy to clipboard to get everything.
Repository: microsoft/markitdown
Branch: main
Commit: a6c8ac46a684
Files: 137
Total size: 1.6 MB
Directory structure:
gitextract_s2k12qtb/
├── .devcontainer/
│ └── devcontainer.json
├── .dockerignore
├── .gitattributes
├── .github/
│ ├── dependabot.yml
│ └── workflows/
│ ├── pre-commit.yml
│ └── tests.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── LICENSE
├── README.md
├── SECURITY.md
├── SUPPORT.md
└── packages/
├── markitdown/
│ ├── README.md
│ ├── ThirdPartyNotices.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── _base_converter.py
│ │ ├── _exceptions.py
│ │ ├── _markitdown.py
│ │ ├── _stream_info.py
│ │ ├── _uri_utils.py
│ │ ├── converter_utils/
│ │ │ ├── __init__.py
│ │ │ └── docx/
│ │ │ ├── __init__.py
│ │ │ ├── math/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── latex_dict.py
│ │ │ │ └── omml.py
│ │ │ └── pre_process.py
│ │ ├── converters/
│ │ │ ├── __init__.py
│ │ │ ├── _audio_converter.py
│ │ │ ├── _bing_serp_converter.py
│ │ │ ├── _csv_converter.py
│ │ │ ├── _doc_intel_converter.py
│ │ │ ├── _docx_converter.py
│ │ │ ├── _epub_converter.py
│ │ │ ├── _exiftool.py
│ │ │ ├── _html_converter.py
│ │ │ ├── _image_converter.py
│ │ │ ├── _ipynb_converter.py
│ │ │ ├── _llm_caption.py
│ │ │ ├── _markdownify.py
│ │ │ ├── _outlook_msg_converter.py
│ │ │ ├── _pdf_converter.py
│ │ │ ├── _plain_text_converter.py
│ │ │ ├── _pptx_converter.py
│ │ │ ├── _rss_converter.py
│ │ │ ├── _transcribe_audio.py
│ │ │ ├── _wikipedia_converter.py
│ │ │ ├── _xlsx_converter.py
│ │ │ ├── _youtube_converter.py
│ │ │ └── _zip_converter.py
│ │ └── py.typed
│ └── tests/
│ ├── __init__.py
│ ├── _test_vectors.py
│ ├── test_cli_misc.py
│ ├── test_cli_vectors.py
│ ├── test_docintel_html.py
│ ├── test_files/
│ │ ├── equations.docx
│ │ ├── expected_outputs/
│ │ │ ├── MEDRPT-2024-PAT-3847_medical_report_scan.md
│ │ │ ├── RECEIPT-2024-TXN-98765_retail_purchase.md
│ │ │ ├── REPAIR-2022-INV-001_multipage.md
│ │ │ ├── SPARSE-2024-INV-1234_borderless_table.md
│ │ │ ├── movie-theater-booking-2024.md
│ │ │ └── test.md
│ │ ├── rlink.docx
│ │ ├── test.docx
│ │ ├── test.epub
│ │ ├── test.json
│ │ ├── test.m4a
│ │ ├── test.pptx
│ │ ├── test.xls
│ │ ├── test.xlsx
│ │ ├── test_blog.html
│ │ ├── test_mskanji.csv
│ │ ├── test_notebook.ipynb
│ │ ├── test_outlook_msg.msg
│ │ ├── test_rss.xml
│ │ ├── test_serp.html
│ │ ├── test_wikipedia.html
│ │ └── test_with_comment.docx
│ ├── test_module_misc.py
│ ├── test_module_vectors.py
│ ├── test_pdf_masterformat.py
│ ├── test_pdf_memory.py
│ └── test_pdf_tables.py
├── markitdown-mcp/
│ ├── Dockerfile
│ ├── README.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown_mcp/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ └── py.typed
│ └── tests/
│ └── __init__.py
├── markitdown-ocr/
│ ├── LICENSE
│ ├── README.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown_ocr/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── _docx_converter_with_ocr.py
│ │ ├── _ocr_service.py
│ │ ├── _pdf_converter_with_ocr.py
│ │ ├── _plugin.py
│ │ ├── _pptx_converter_with_ocr.py
│ │ └── _xlsx_converter_with_ocr.py
│ └── tests/
│ ├── __init__.py
│ ├── ocr_test_data/
│ │ ├── docx_complex_layout.docx
│ │ ├── docx_image_end.docx
│ │ ├── docx_image_middle.docx
│ │ ├── docx_image_start.docx
│ │ ├── docx_multipage.docx
│ │ ├── docx_multiple_images.docx
│ │ ├── pptx_complex_layout.pptx
│ │ ├── pptx_image_end.pptx
│ │ ├── pptx_image_middle.pptx
│ │ ├── pptx_image_start.pptx
│ │ ├── pptx_multiple_images.pptx
│ │ ├── xlsx_complex_layout.xlsx
│ │ ├── xlsx_image_end.xlsx
│ │ ├── xlsx_image_middle.xlsx
│ │ ├── xlsx_image_start.xlsx
│ │ └── xlsx_multiple_images.xlsx
│ ├── test_docx_converter.py
│ ├── test_pdf_converter.py
│ ├── test_pptx_converter.py
│ └── test_xlsx_converter.py
└── markitdown-sample-plugin/
├── README.md
├── pyproject.toml
├── src/
│ └── markitdown_sample_plugin/
│ ├── __about__.py
│ ├── __init__.py
│ ├── _plugin.py
│ └── py.typed
└── tests/
├── __init__.py
├── test_files/
│ └── test.rtf
└── test_sample_plugin.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .devcontainer/devcontainer.json
================================================
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/docker-existing-dockerfile
{
"name": "Existing Dockerfile",
"build": {
// Sets the run context to one level up instead of the .devcontainer folder.
"context": "..",
// Update the 'dockerFile' property if you aren't using the standard 'Dockerfile' filename.
"dockerfile": "../Dockerfile",
"args": {
"INSTALL_GIT": "true"
}
},
// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},
"features": {
"ghcr.io/devcontainers-extra/features/hatch:2": {}
},
// Use 'forwardPorts' to make a list of ports inside the container available locally.
// "forwardPorts": [],
// Uncomment the next line to run commands after the container is created.
// "postCreateCommand": "cat /etc/os-release",
// Configure tool-specific properties.
// "customizations": {},
// Uncomment to connect as an existing user other than the container default. More info: https://aka.ms/dev-containers-non-root.
"remoteUser": "root"
}
================================================
FILE: .dockerignore
================================================
*
!packages/
================================================
FILE: .gitattributes
================================================
packages/markitdown/tests/test_files/** linguist-vendored
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored
# Treat PDF files as binary to prevent line ending conversion
*.pdf binary
================================================
FILE: .github/dependabot.yml
================================================
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
================================================
FILE: .github/workflows/pre-commit.yml
================================================
name: pre-commit
on: [pull_request]
jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.x"
- name: Install pre-commit
run: |
pip install pre-commit
pre-commit install --install-hooks
- name: Run pre-commit
run: pre-commit run --all-files
================================================
FILE: .github/workflows/tests.yml
================================================
name: tests
on: [pull_request]
jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: |
3.10
3.11
3.12
- name: Install Hatch
run: pipx install hatch
- name: Run tests
run: cd packages/markitdown; hatch test
================================================
FILE: .gitignore
================================================
.vscode
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
.test-logs/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
.pdm.toml
.pdm-python
.pdm-build/
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
src/.DS_Store
.DS_Store
.cursorrules
================================================
FILE: .pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/psf/black
rev: 23.7.0 # Use the latest version of Black
hooks:
- id: black
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Microsoft Open Source Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
================================================
FILE: Dockerfile
================================================
FROM python:3.13-slim-bullseye
ENV DEBIAN_FRONTEND=noninteractive
ENV EXIFTOOL_PATH=/usr/bin/exiftool
ENV FFMPEG_PATH=/usr/bin/ffmpeg
# Runtime dependency
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
exiftool
ARG INSTALL_GIT=false
RUN if [ "$INSTALL_GIT" = "true" ]; then \
apt-get install -y --no-install-recommends \
git; \
fi
# Cleanup
RUN rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . /app
RUN pip --no-cache-dir install \
/app/packages/markitdown[all] \
/app/packages/markitdown-sample-plugin
# Default USERID and GROUPID
ARG USERID=nobody
ARG GROUPID=nogroup
USER $USERID:$GROUPID
ENTRYPOINT [ "markitdown" ]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) Microsoft Corporation.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
================================================
FILE: README.md
================================================
# MarkItDown
[](https://pypi.org/project/markitdown/)

[](https://github.com/microsoft/autogen)
> [!TIP]
> MarkItDown now offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See [markitdown-mcp](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp) for more information.
> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.1.0:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.
MarkItDown currently supports the conversion from:
- PDF
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
- Audio (EXIF metadata and speech transcription)
- HTML
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- Youtube URLs
- EPubs
- ... and more!
## Why Markdown?
Markdown is extremely close to plain text, with minimal markup or formatting, but still
provides a way to represent important document structure. Mainstream LLMs, such as
OpenAI's GPT-4o, natively "_speak_" Markdown, and often incorporate Markdown into their
responses unprompted. This suggests that they have been trained on vast amounts of
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
are also highly token-efficient.
## Prerequisites
MarkItDown requires Python 3.10 or higher. It is recommended to use a virtual environment to avoid dependency conflicts.
With the standard Python installation, you can create and activate a virtual environment using the following commands:
```bash
python -m venv .venv
source .venv/bin/activate
```
If using `uv`, you can create a virtual environment with:
```bash
uv venv --python=3.12 .venv
source .venv/bin/activate
# NOTE: Be sure to use 'uv pip install' rather than just 'pip install' to install packages in this virtual environment
```
If you are using Anaconda, you can create a virtual environment with:
```bash
conda create -n markitdown python=3.12
conda activate markitdown
```
## Installation
To install MarkItDown, use pip: `pip install 'markitdown[all]'`. Alternatively, you can install it from the source:
```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e 'packages/markitdown[all]'
```
## Usage
### Command-Line
```bash
markitdown path-to-file.pdf > document.md
```
Or use `-o` to specify the output file:
```bash
markitdown path-to-file.pdf -o document.md
```
You can also pipe content:
```bash
cat path-to-file.pdf | markitdown
```
### Optional Dependencies
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:
```bash
pip install 'markitdown[pdf, docx, pptx]'
```
will install only the dependencies for PDF, DOCX, and PPTX files.
At the moment, the following optional dependencies are available:
* `[all]` Installs all optional dependencies
* `[pptx]` Installs dependencies for PowerPoint files
* `[docx]` Installs dependencies for Word files
* `[xlsx]` Installs dependencies for Excel files
* `[xls]` Installs dependencies for older Excel files
* `[pdf]` Installs dependencies for PDF files
* `[outlook]` Installs dependencies for Outlook messages
* `[az-doc-intel]` Installs dependencies for Azure Document Intelligence
* `[audio-transcription]` Installs dependencies for audio transcription of wav and mp3 files
* `[youtube-transcription]` Installs dependencies for fetching YouTube video transcription
### Plugins
MarkItDown also supports 3rd-party plugins. Plugins are disabled by default. To list installed plugins:
```bash
markitdown --list-plugins
```
To enable plugins use:
```bash
markitdown --use-plugins path-to-file.pdf
```
To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.
#### markitdown-ocr Plugin
The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.
**Installation:**
```bash
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
```
**Usage:**
Pass the same `llm_client` and `llm_model` you would use for image descriptions:
```python
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```
If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.
See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.
### Azure Document Intelligence
To use Microsoft Document Intelligence for conversion:
```bash
markitdown path-to-file.pdf -o document.md -d -e "<document_intelligence_endpoint>"
```
More information about how to set up an Azure Document Intelligence Resource can be found [here](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/create-document-intelligence-resource?view=doc-intel-4.0.0)
### Python API
Basic usage in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False) # Set to True to enable plugins
result = md.convert("test.xlsx")
print(result.text_content)
```
Document Intelligence conversion in Python:
```python
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="<document_intelligence_endpoint>")
result = md.convert("test.pdf")
print(result.text_content)
```
To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:
```python
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
result = md.convert("example.jpg")
print(result.text_content)
```
### Docker
```sh
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
### How to Contribute
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.
<div align="center">
| | All | Especially Needs Help from Community |
| ---------- | ------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Issues** | [All Issues](https://github.com/microsoft/markitdown/issues) | [Issues open for contribution](https://github.com/microsoft/markitdown/issues?q=is%3Aissue+is%3Aopen+label%3A%22open+for+contribution%22) |
| **PRs** | [All PRs](https://github.com/microsoft/markitdown/pulls) | [PRs open for reviewing](https://github.com/microsoft/markitdown/pulls?q=is%3Apr+is%3Aopen+label%3A%22open+for+reviewing%22) |
</div>
### Running Tests and Checks
- Navigate to the MarkItDown package:
```sh
cd packages/markitdown
```
- Install `hatch` in your environment and run tests:
```sh
pip install hatch # Other ways of installing hatch: https://hatch.pypa.io/dev/install/
hatch shell
hatch test
```
(Alternative) Use the Devcontainer which has all the dependencies installed:
```sh
# Reopen the project in Devcontainer and run:
hatch test
```
- Run pre-commit checks before submitting a PR: `pre-commit run --all-files`
### Contributing 3rd-party Plugins
You can also contribute by creating and sharing 3rd party plugins. See `packages/markitdown-sample-plugin` for more details.
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
================================================
FILE: SECURITY.md
================================================
<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet) and [Xamarin](https://github.com/xamarin).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets [Microsoft's definition of a security vulnerability](https://aka.ms/security.md/definition), please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://aka.ms/security.md/msrc/create-report).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the [Microsoft Security Response Center PGP Key page](https://aka.ms/security.md/msrc/pgp).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
* Type of issue (e.g. buffer overflow, SQL injection, cross-site scripting, etc.)
* Full paths of source file(s) related to the manifestation of the issue
* The location of the affected source code (tag/branch/commit or direct URL)
* Any special configuration required to reproduce the issue
* Step-by-step instructions to reproduce the issue
* Proof-of-concept or exploit code (if possible)
* Impact of the issue, including how an attacker might exploit the issue
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://aka.ms/security.md/msrc/bounty) page for more details about our active programs.
## Preferred Languages
We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://aka.ms/security.md/cvd).
<!-- END MICROSOFT SECURITY.MD BLOCK -->
================================================
FILE: SUPPORT.md
================================================
# TODO: The maintainer of this repo has not yet edited this file
**REPO OWNER**: Do you want Customer Service & Support (CSS) support for this product/project?
- **No CSS support:** Fill out this template with information about how to file issues and get help.
- **Yes CSS support:** Fill out an intake form at [aka.ms/onboardsupport](https://aka.ms/onboardsupport). CSS will work with/help you to determine next steps.
- **Not sure?** Fill out an intake as though the answer were "Yes". CSS will help you decide.
*Then remove this first heading from this SUPPORT.MD file before publishing your repo.*
# Support
## How to file issues and get help
This project uses GitHub Issues to track bugs and feature requests. Please search the existing
issues before filing new issues to avoid duplicates. For new issues, file your bug or
feature request as a new Issue.
For help and questions about using this project, please **REPO MAINTAINER: INSERT INSTRUCTIONS HERE
FOR HOW TO ENGAGE REPO OWNERS OR COMMUNITY FOR HELP. COULD BE A STACK OVERFLOW TAG OR OTHER
CHANNEL. WHERE WILL YOU HELP PEOPLE?**.
## Microsoft Support Policy
Support for this **PROJECT or PRODUCT** is limited to the resources listed above.
================================================
FILE: packages/markitdown/README.md
================================================
# MarkItDown
> [!IMPORTANT]
> MarkItDown is a Python package and command-line utility for converting various files to Markdown (e.g., for indexing, text analysis, etc).
>
> For more information, and full documentation, see the project [README.md](https://github.com/microsoft/markitdown) on GitHub.
## Installation
From PyPI:
```bash
pip install markitdown[all]
```
From source:
```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]
```
## Usage
### Command-Line
```bash
markitdown path-to-file.pdf > document.md
```
### Python API
```python
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("test.xlsx")
print(result.text_content)
```
### More Information
For more information, and full documentation, see the project [README.md](https://github.com/microsoft/markitdown) on GitHub.
## Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
================================================
FILE: packages/markitdown/ThirdPartyNotices.md
================================================
# THIRD-PARTY SOFTWARE NOTICES AND INFORMATION
**Do Not Translate or Localize**
This project incorporates components from the projects listed below. The original copyright notices and the licenses
under which MarkItDown received such components are set forth below. MarkItDown reserves all rights not expressly
granted herein, whether by implication, estoppel or otherwise.
1.dwml (https://github.com/xiilei/dwml)
dwml NOTICES AND INFORMATION BEGIN HERE
-----------------------------------------
NOTE 1: What follows is a verbatim copy of dwml's LICENSE file, as it appeared on March 28th, 2025 - including
placeholders for the copyright owner and year.
NOTE 2: The Apache License, Version 2.0, requires that modifications to the dwml source code be documented.
The following section summarizes these changes. The full details are available in the MarkItDown source code
repository under PR #1160 (https://github.com/microsoft/markitdown/pull/1160)
This project incorporates `dwml/latex_dict.py` and `dwml/omml.py` files without any additional logic modifications (which
lives in `packages/markitdown/src/markitdown/converter_utils/docx/math` location). However, we have reformatted the code
according to `black` code formatter. From `tests/docx.py` file, we have used `DOCXML_ROOT` XML namespaces and the rest of
the file is not used.
-----------------------------------------
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-----------------------------------------
END OF dwml NOTICES AND INFORMATION
================================================
FILE: packages/markitdown/pyproject.toml
================================================
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "markitdown"
dynamic = ["version"]
description = 'Utility tool for converting various files to Markdown'
readme = "README.md"
requires-python = ">=3.10"
license = "MIT"
keywords = []
authors = [
{ name = "Adam Fourney", email = "adamfo@microsoft.com" },
]
classifiers = [
"Development Status :: 4 - Beta",
"Programming Language :: Python",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Programming Language :: Python :: 3.13",
"Programming Language :: Python :: Implementation :: CPython",
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"beautifulsoup4",
"requests",
"markdownify",
"magika~=0.6.1",
"charset-normalizer",
"defusedxml",
]
[project.optional-dependencies]
all = [
"python-pptx",
"mammoth~=1.11.0",
"pandas",
"openpyxl",
"xlrd",
"lxml",
"pdfminer.six>=20251230",
"pdfplumber>=0.11.9",
"olefile",
"pydub",
"SpeechRecognition",
"youtube-transcript-api~=1.0.0",
"azure-ai-documentintelligence",
"azure-identity",
]
pptx = ["python-pptx"]
docx = ["mammoth~=1.11.0", "lxml"]
xlsx = ["pandas", "openpyxl"]
xls = ["pandas", "xlrd"]
pdf = ["pdfminer.six>=20251230", "pdfplumber>=0.11.9"]
outlook = ["olefile"]
audio-transcription = ["pydub", "SpeechRecognition"]
youtube-transcription = ["youtube-transcript-api"]
az-doc-intel = ["azure-ai-documentintelligence", "azure-identity"]
[project.urls]
Documentation = "https://github.com/microsoft/markitdown#readme"
Issues = "https://github.com/microsoft/markitdown/issues"
Source = "https://github.com/microsoft/markitdown"
[tool.hatch.version]
path = "src/markitdown/__about__.py"
[project.scripts]
markitdown = "markitdown.__main__:main"
[tool.hatch.envs.default]
features = ["all"]
[tool.hatch.envs.hatch-test]
features = ["all"]
extra-dependencies = [
"openai",
]
[tool.hatch.envs.types]
features = ["all"]
extra-dependencies = [
"openai",
"mypy>=1.0.0",
]
[tool.hatch.envs.types.scripts]
check = "mypy --install-types --non-interactive --ignore-missing-imports {args:src/markitdown tests}"
[tool.coverage.run]
source_pkgs = ["markitdown", "tests"]
branch = true
parallel = true
omit = [
"src/markitdown/__about__.py",
]
[tool.coverage.paths]
markitdown = ["src/markitdown", "*/markitdown/src/markitdown"]
tests = ["tests", "*/markitdown/tests"]
[tool.coverage.report]
exclude_lines = [
"no cov",
"if __name__ == .__main__.:",
"if TYPE_CHECKING:",
]
[tool.hatch.build.targets.sdist]
only-include = ["src/markitdown"]
================================================
FILE: packages/markitdown/src/markitdown/__about__.py
================================================
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
__version__ = "0.1.6b2"
================================================
FILE: packages/markitdown/src/markitdown/__init__.py
================================================
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
from .__about__ import __version__
from ._markitdown import (
MarkItDown,
PRIORITY_SPECIFIC_FILE_FORMAT,
PRIORITY_GENERIC_FILE_FORMAT,
)
from ._base_converter import DocumentConverterResult, DocumentConverter
from ._stream_info import StreamInfo
from ._exceptions import (
MarkItDownException,
MissingDependencyException,
FailedConversionAttempt,
FileConversionException,
UnsupportedFormatException,
)
__all__ = [
"__version__",
"MarkItDown",
"DocumentConverter",
"DocumentConverterResult",
"MarkItDownException",
"MissingDependencyException",
"FailedConversionAttempt",
"FileConversionException",
"UnsupportedFormatException",
"StreamInfo",
"PRIORITY_SPECIFIC_FILE_FORMAT",
"PRIORITY_GENERIC_FILE_FORMAT",
]
================================================
FILE: packages/markitdown/src/markitdown/__main__.py
================================================
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
import argparse
import sys
import codecs
from textwrap import dedent
from importlib.metadata import entry_points
from .__about__ import __version__
from ._markitdown import MarkItDown, StreamInfo, DocumentConverterResult
def main():
parser = argparse.ArgumentParser(
description="Convert various file formats to markdown.",
prog="markitdown",
formatter_class=argparse.RawDescriptionHelpFormatter,
usage=dedent(
"""
SYNTAX:
markitdown <OPTIONAL: FILENAME>
If FILENAME is empty, markitdown reads from stdin.
EXAMPLE:
markitdown example.pdf
OR
cat example.pdf | markitdown
OR
markitdown < example.pdf
OR to save to a file use
markitdown example.pdf -o example.md
OR
markitdown example.pdf > example.md
"""
).strip(),
)
parser.add_argument(
"-v",
"--version",
action="version",
version=f"%(prog)s {__version__}",
help="show the version number and exit",
)
parser.add_argument(
"-o",
"--output",
help="Output file name. If not provided, output is written to stdout.",
)
parser.add_argument(
"-x",
"--extension",
help="Provide a hint about the file extension (e.g., when reading from stdin).",
)
parser.add_argument(
"-m",
"--mime-type",
help="Provide a hint about the file's MIME type.",
)
parser.add_argument(
"-c",
"--charset",
help="Provide a hint about the file's charset (e.g, UTF-8).",
)
parser.add_argument(
"-d",
"--use-docintel",
action="store_true",
help="Use Document Intelligence to extract text instead of offline conversion. Requires a valid Document Intelligence Endpoint.",
)
parser.add_argument(
"-e",
"--endpoint",
type=str,
help="Document Intelligence Endpoint. Required if using Document Intelligence.",
)
parser.add_argument(
"-p",
"--use-plugins",
action="store_true",
help="Use 3rd-party plugins to convert files. Use --list-plugins to see installed plugins.",
)
parser.add_argument(
"--list-plugins",
action="store_true",
help="List installed 3rd-party plugins. Plugins are loaded when using the -p or --use-plugin option.",
)
parser.add_argument(
"--keep-data-uris",
action="store_true",
help="Keep data URIs (like base64-encoded images) in the output. By default, data URIs are truncated.",
)
parser.add_argument("filename", nargs="?")
args = parser.parse_args()
# Parse the extension hint
extension_hint = args.extension
if extension_hint is not None:
extension_hint = extension_hint.strip().lower()
if len(extension_hint) > 0:
if not extension_hint.startswith("."):
extension_hint = "." + extension_hint
else:
extension_hint = None
# Parse the mime type
mime_type_hint = args.mime_type
if mime_type_hint is not None:
mime_type_hint = mime_type_hint.strip()
if len(mime_type_hint) > 0:
if mime_type_hint.count("/") != 1:
_exit_with_error(f"Invalid MIME type: {mime_type_hint}")
else:
mime_type_hint = None
# Parse the charset
charset_hint = args.charset
if charset_hint is not None:
charset_hint = charset_hint.strip()
if len(charset_hint) > 0:
try:
charset_hint = codecs.lookup(charset_hint).name
except LookupError:
_exit_with_error(f"Invalid charset: {charset_hint}")
else:
charset_hint = None
stream_info = None
if (
extension_hint is not None
or mime_type_hint is not None
or charset_hint is not None
):
stream_info = StreamInfo(
extension=extension_hint, mimetype=mime_type_hint, charset=charset_hint
)
if args.list_plugins:
# List installed plugins, then exit
print("Installed MarkItDown 3rd-party Plugins:\n")
plugin_entry_points = list(entry_points(group="markitdown.plugin"))
if len(plugin_entry_points) == 0:
print(" * No 3rd-party plugins installed.")
print(
"\nFind plugins by searching for the hashtag #markitdown-plugin on GitHub.\n"
)
else:
for entry_point in plugin_entry_points:
print(f" * {entry_point.name:<16}\t(package: {entry_point.value})")
print(
"\nUse the -p (or --use-plugins) option to enable 3rd-party plugins.\n"
)
sys.exit(0)
if args.use_docintel:
if args.endpoint is None:
_exit_with_error(
"Document Intelligence Endpoint is required when using Document Intelligence."
)
elif args.filename is None:
_exit_with_error("Filename is required when using Document Intelligence.")
markitdown = MarkItDown(
enable_plugins=args.use_plugins, docintel_endpoint=args.endpoint
)
else:
markitdown = MarkItDown(enable_plugins=args.use_plugins)
if args.filename is None:
result = markitdown.convert_stream(
sys.stdin.buffer,
stream_info=stream_info,
keep_data_uris=args.keep_data_uris,
)
else:
result = markitdown.convert(
args.filename, stream_info=stream_info, keep_data_uris=args.keep_data_uris
)
_handle_output(args, result)
def _handle_output(args, result: DocumentConverterResult):
"""Handle output to stdout or file"""
if args.output:
with open(args.output, "w", encoding="utf-8") as f:
f.write(result.markdown)
else:
# Handle stdout encoding errors more gracefully
print(
result.markdown.encode(sys.stdout.encoding, errors="replace").decode(
sys.stdout.encoding
)
)
def _exit_with_error(message: str):
print(message)
sys.exit(1)
if __name__ == "__main__":
main()
================================================
FILE: packages/markitdown/src/markitdown/_base_converter.py
================================================
from typing import Any, BinaryIO, Optional
from ._stream_info import StreamInfo
class DocumentConverterResult:
"""The result of converting a document to Markdown."""
def __init__(
self,
markdown: str,
*,
title: Optional[str] = None,
):
"""
Initialize the DocumentConverterResult.
The only required parameter is the converted Markdown text.
The title, and any other metadata that may be added in the future, are optional.
Parameters:
- markdown: The converted Markdown text.
- title: Optional title of the document.
"""
self.markdown = markdown
self.title = title
@property
def text_content(self) -> str:
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
return self.markdown
@text_content.setter
def text_content(self, markdown: str):
"""Soft-deprecated alias for `markdown`. New code should migrate to using `markdown` or __str__."""
self.markdown = markdown
def __str__(self) -> str:
"""Return the converted Markdown text."""
return self.markdown
class DocumentConverter:
"""Abstract superclass of all DocumentConverters."""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Return a quick determination on if the converter should attempt converting the document.
This is primarily based `stream_info` (typically, `stream_info.mimetype`, `stream_info.extension`).
In cases where the data is retrieved via HTTP, the `steam_info.url` might also be referenced to
make a determination (e.g., special converters for Wikipedia, YouTube etc).
Finally, it is conceivable that the `stream_info.filename` might be used to in cases
where the filename is well-known (e.g., `Dockerfile`, `Makefile`, etc)
NOTE: The method signature is designed to match that of the convert() method. This provides some
assurance that, if accepts() returns True, the convert() method will also be able to handle the document.
IMPORTANT: In rare cases, (e.g., OutlookMsgConverter) we need to read more from the stream to make a final
determination. Read operations inevitably advances the position in file_stream. In these case, the position
MUST be reset it MUST be reset before returning. This is because the convert() method may be called immediately
after accepts(), and will expect the file_stream to be at the original position.
E.g.,
cur_pos = file_stream.tell() # Save the current position
data = file_stream.read(100) # ... peek at the first 100 bytes, etc.
file_stream.seek(cur_pos) # Reset the position to the original position
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
Returns:
- bool: True if the converter can handle the document, False otherwise.
"""
raise NotImplementedError(
f"The subclass, {type(self).__name__}, must implement the accepts() method to determine if they can handle the document."
)
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
"""
Convert a document to Markdown text.
Parameters:
- file_stream: The file-like object to convert. Must support seek(), tell(), and read() methods.
- stream_info: The StreamInfo object containing metadata about the file (mimetype, extension, charset, set)
- kwargs: Additional keyword arguments for the converter.
Returns:
- DocumentConverterResult: The result of the conversion, which includes the title and markdown content.
Raises:
- FileConversionException: If the mimetype is recognized, but the conversion fails for some other reason.
- MissingDependencyException: If the converter requires a dependency that is not installed.
"""
raise NotImplementedError("Subclasses must implement this method")
================================================
FILE: packages/markitdown/src/markitdown/_exceptions.py
================================================
from typing import Optional, List, Any
MISSING_DEPENDENCY_MESSAGE = """{converter} recognized the input as a potential {extension} file, but the dependencies needed to read {extension} files have not been installed. To resolve this error, include the optional dependency [{feature}] or [all] when installing MarkItDown. For example:
* pip install markitdown[{feature}]
* pip install markitdown[all]
* pip install markitdown[{feature}, ...]
* etc."""
class MarkItDownException(Exception):
"""
Base exception class for MarkItDown.
"""
pass
class MissingDependencyException(MarkItDownException):
"""
Converters shipped with MarkItDown may depend on optional
dependencies. This exception is thrown when a converter's
convert() method is called, but the required dependency is not
installed. This is not necessarily a fatal error, as the converter
will simply be skipped (an error will bubble up only if no other
suitable converter is found).
Error messages should clearly indicate which dependency is missing.
"""
pass
class UnsupportedFormatException(MarkItDownException):
"""
Thrown when no suitable converter was found for the given file.
"""
pass
class FailedConversionAttempt(object):
"""
Represents an a single attempt to convert a file.
"""
def __init__(self, converter: Any, exc_info: Optional[tuple] = None):
self.converter = converter
self.exc_info = exc_info
class FileConversionException(MarkItDownException):
"""
Thrown when a suitable converter was found, but the conversion
process fails for any reason.
"""
def __init__(
self,
message: Optional[str] = None,
attempts: Optional[List[FailedConversionAttempt]] = None,
):
self.attempts = attempts
if message is None:
if attempts is None:
message = "File conversion failed."
else:
message = f"File conversion failed after {len(attempts)} attempts:\n"
for attempt in attempts:
if attempt.exc_info is None:
message += f" - {type(attempt.converter).__name__} provided no execution info."
else:
message += f" - {type(attempt.converter).__name__} threw {attempt.exc_info[0].__name__} with message: {attempt.exc_info[1]}\n"
super().__init__(message)
================================================
FILE: packages/markitdown/src/markitdown/_markitdown.py
================================================
import mimetypes
import os
import re
import sys
import shutil
import traceback
import io
from dataclasses import dataclass
from importlib.metadata import entry_points
from typing import Any, List, Dict, Optional, Union, BinaryIO
from pathlib import Path
from urllib.parse import urlparse
from warnings import warn
import requests
import magika
import charset_normalizer
import codecs
from ._stream_info import StreamInfo
from ._uri_utils import parse_data_uri, file_uri_to_path
from .converters import (
PlainTextConverter,
HtmlConverter,
RssConverter,
WikipediaConverter,
YouTubeConverter,
IpynbConverter,
BingSerpConverter,
PdfConverter,
DocxConverter,
XlsxConverter,
XlsConverter,
PptxConverter,
ImageConverter,
AudioConverter,
OutlookMsgConverter,
ZipConverter,
EpubConverter,
DocumentIntelligenceConverter,
CsvConverter,
)
from ._base_converter import DocumentConverter, DocumentConverterResult
from ._exceptions import (
FileConversionException,
UnsupportedFormatException,
FailedConversionAttempt,
)
# Lower priority values are tried first.
PRIORITY_SPECIFIC_FILE_FORMAT = (
0.0 # e.g., .docx, .pdf, .xlsx, Or specific pages, e.g., wikipedia
)
PRIORITY_GENERIC_FILE_FORMAT = (
10.0 # Near catch-all converters for mimetypes like text/*, etc.
)
_plugins: Union[None, List[Any]] = None # If None, plugins have not been loaded yet.
def _load_plugins() -> Union[None, List[Any]]:
"""Lazy load plugins, exiting early if already loaded."""
global _plugins
# Skip if we've already loaded plugins
if _plugins is not None:
return _plugins
# Load plugins
_plugins = []
for entry_point in entry_points(group="markitdown.plugin"):
try:
_plugins.append(entry_point.load())
except Exception:
tb = traceback.format_exc()
warn(f"Plugin '{entry_point.name}' failed to load ... skipping:\n{tb}")
return _plugins
@dataclass(kw_only=True, frozen=True)
class ConverterRegistration:
"""A registration of a converter with its priority and other metadata."""
converter: DocumentConverter
priority: float
class MarkItDown:
"""(In preview) An extremely simple text-based document reader, suitable for LLM use.
This reader will convert common file-types or webpages to Markdown."""
def __init__(
self,
*,
enable_builtins: Union[None, bool] = None,
enable_plugins: Union[None, bool] = None,
**kwargs,
):
self._builtins_enabled = False
self._plugins_enabled = False
requests_session = kwargs.get("requests_session")
if requests_session is None:
self._requests_session = requests.Session()
# Signal that we prefer markdown over HTML, etc. if the server supports it.
# e.g., https://blog.cloudflare.com/markdown-for-agents/
self._requests_session.headers.update(
{
"Accept": "text/markdown, text/html;q=0.9, text/plain;q=0.8, */*;q=0.1"
}
)
else:
self._requests_session = requests_session
self._magika = magika.Magika()
# TODO - remove these (see enable_builtins)
self._llm_client: Any = None
self._llm_model: Union[str | None] = None
self._llm_prompt: Union[str | None] = None
self._exiftool_path: Union[str | None] = None
self._style_map: Union[str | None] = None
# Register the converters
self._converters: List[ConverterRegistration] = []
if (
enable_builtins is None or enable_builtins
): # Default to True when not specified
self.enable_builtins(**kwargs)
if enable_plugins:
self.enable_plugins(**kwargs)
def enable_builtins(self, **kwargs) -> None:
"""
Enable and register built-in converters.
Built-in converters are enabled by default.
This method should only be called once, if built-ins were initially disabled.
"""
if not self._builtins_enabled:
# TODO: Move these into converter constructors
self._llm_client = kwargs.get("llm_client")
self._llm_model = kwargs.get("llm_model")
self._llm_prompt = kwargs.get("llm_prompt")
self._exiftool_path = kwargs.get("exiftool_path")
self._style_map = kwargs.get("style_map")
if self._exiftool_path is None:
self._exiftool_path = os.getenv("EXIFTOOL_PATH")
# Still none? Check well-known paths
if self._exiftool_path is None:
candidate = shutil.which("exiftool")
if candidate:
candidate = os.path.abspath(candidate)
if any(
d == os.path.dirname(candidate)
for d in [
"/usr/bin",
"/usr/local/bin",
"/opt",
"/opt/bin",
"/opt/local/bin",
"/opt/homebrew/bin",
"C:\\Windows\\System32",
"C:\\Program Files",
"C:\\Program Files (x86)",
]
):
self._exiftool_path = candidate
# Register converters for successful browsing operations
# Later registrations are tried first / take higher priority than earlier registrations
# To this end, the most specific converters should appear below the most generic converters
self.register_converter(
PlainTextConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
)
self.register_converter(
ZipConverter(markitdown=self), priority=PRIORITY_GENERIC_FILE_FORMAT
)
self.register_converter(
HtmlConverter(), priority=PRIORITY_GENERIC_FILE_FORMAT
)
self.register_converter(RssConverter())
self.register_converter(WikipediaConverter())
self.register_converter(YouTubeConverter())
self.register_converter(BingSerpConverter())
self.register_converter(DocxConverter())
self.register_converter(XlsxConverter())
self.register_converter(XlsConverter())
self.register_converter(PptxConverter())
self.register_converter(AudioConverter())
self.register_converter(ImageConverter())
self.register_converter(IpynbConverter())
self.register_converter(PdfConverter())
self.register_converter(OutlookMsgConverter())
self.register_converter(EpubConverter())
self.register_converter(CsvConverter())
# Register Document Intelligence converter at the top of the stack if endpoint is provided
docintel_endpoint = kwargs.get("docintel_endpoint")
if docintel_endpoint is not None:
docintel_args: Dict[str, Any] = {}
docintel_args["endpoint"] = docintel_endpoint
docintel_credential = kwargs.get("docintel_credential")
if docintel_credential is not None:
docintel_args["credential"] = docintel_credential
docintel_types = kwargs.get("docintel_file_types")
if docintel_types is not None:
docintel_args["file_types"] = docintel_types
docintel_version = kwargs.get("docintel_api_version")
if docintel_version is not None:
docintel_args["api_version"] = docintel_version
self.register_converter(
DocumentIntelligenceConverter(**docintel_args),
)
self._builtins_enabled = True
else:
warn("Built-in converters are already enabled.", RuntimeWarning)
def enable_plugins(self, **kwargs) -> None:
"""
Enable and register converters provided by plugins.
Plugins are disabled by default.
This method should only be called once, if plugins were initially disabled.
"""
if not self._plugins_enabled:
# Load plugins
plugins = _load_plugins()
assert plugins is not None
for plugin in plugins:
try:
plugin.register_converters(self, **kwargs)
except Exception:
tb = traceback.format_exc()
warn(f"Plugin '{plugin}' failed to register converters:\n{tb}")
self._plugins_enabled = True
else:
warn("Plugins converters are already enabled.", RuntimeWarning)
def convert(
self,
source: Union[str, requests.Response, Path, BinaryIO],
*,
stream_info: Optional[StreamInfo] = None,
**kwargs: Any,
) -> DocumentConverterResult: # TODO: deal with kwargs
"""
Args:
- source: can be a path (str or Path), url, or a requests.response object
- stream_info: optional stream info to use for the conversion. If None, infer from source
- kwargs: additional arguments to pass to the converter
"""
# Local path or url
if isinstance(source, str):
if (
source.startswith("http:")
or source.startswith("https:")
or source.startswith("file:")
or source.startswith("data:")
):
# Rename the url argument to mock_url
# (Deprecated -- use stream_info)
_kwargs = {k: v for k, v in kwargs.items()}
if "url" in _kwargs:
_kwargs["mock_url"] = _kwargs["url"]
del _kwargs["url"]
return self.convert_uri(source, stream_info=stream_info, **_kwargs)
else:
return self.convert_local(source, stream_info=stream_info, **kwargs)
# Path object
elif isinstance(source, Path):
return self.convert_local(source, stream_info=stream_info, **kwargs)
# Request response
elif isinstance(source, requests.Response):
return self.convert_response(source, stream_info=stream_info, **kwargs)
# Binary stream
elif (
hasattr(source, "read")
and callable(source.read)
and not isinstance(source, io.TextIOBase)
):
return self.convert_stream(source, stream_info=stream_info, **kwargs)
else:
raise TypeError(
f"Invalid source type: {type(source)}. Expected str, requests.Response, BinaryIO."
)
def convert_local(
self,
path: Union[str, Path],
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated -- use stream_info
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
if isinstance(path, Path):
path = str(path)
# Build a base StreamInfo object from which to start guesses
base_guess = StreamInfo(
local_path=path,
extension=os.path.splitext(path)[1],
filename=os.path.basename(path),
)
# Extend the base_guess with any additional info from the arguments
if stream_info is not None:
base_guess = base_guess.copy_and_update(stream_info)
if file_extension is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(extension=file_extension)
if url is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(url=url)
with open(path, "rb") as fh:
guesses = self._get_stream_info_guesses(
file_stream=fh, base_guess=base_guess
)
return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)
def convert_stream(
self,
stream: BinaryIO,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated -- use stream_info
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
guesses: List[StreamInfo] = []
# Do we have anything on which to base a guess?
base_guess = None
if stream_info is not None or file_extension is not None or url is not None:
# Start with a non-Null base guess
if stream_info is None:
base_guess = StreamInfo()
else:
base_guess = stream_info
if file_extension is not None:
# Deprecated -- use stream_info
assert base_guess is not None # for mypy
base_guess = base_guess.copy_and_update(extension=file_extension)
if url is not None:
# Deprecated -- use stream_info
assert base_guess is not None # for mypy
base_guess = base_guess.copy_and_update(url=url)
# Check if we have a seekable stream. If not, load the entire stream into memory.
if not stream.seekable():
buffer = io.BytesIO()
while True:
chunk = stream.read(4096)
if not chunk:
break
buffer.write(chunk)
buffer.seek(0)
stream = buffer
# Add guesses based on stream content
guesses = self._get_stream_info_guesses(
file_stream=stream, base_guess=base_guess or StreamInfo()
)
return self._convert(file_stream=stream, stream_info_guesses=guesses, **kwargs)
def convert_url(
self,
url: str,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None,
mock_url: Optional[str] = None,
**kwargs: Any,
) -> DocumentConverterResult:
"""Alias for convert_uri()"""
# convert_url will likely be deprecated in the future in favor of convert_uri
return self.convert_uri(
url,
stream_info=stream_info,
file_extension=file_extension,
mock_url=mock_url,
**kwargs,
)
def convert_uri(
self,
uri: str,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated -- use stream_info
mock_url: Optional[
str
] = None, # Mock the request as if it came from a different URL
**kwargs: Any,
) -> DocumentConverterResult:
uri = uri.strip()
# File URIs
if uri.startswith("file:"):
netloc, path = file_uri_to_path(uri)
if netloc and netloc != "localhost":
raise ValueError(
f"Unsupported file URI: {uri}. Netloc must be empty or localhost."
)
return self.convert_local(
path,
stream_info=stream_info,
file_extension=file_extension,
url=mock_url,
**kwargs,
)
# Data URIs
elif uri.startswith("data:"):
mimetype, attributes, data = parse_data_uri(uri)
base_guess = StreamInfo(
mimetype=mimetype,
charset=attributes.get("charset"),
)
if stream_info is not None:
base_guess = base_guess.copy_and_update(stream_info)
return self.convert_stream(
io.BytesIO(data),
stream_info=base_guess,
file_extension=file_extension,
url=mock_url,
**kwargs,
)
# HTTP/HTTPS URIs
elif uri.startswith("http:") or uri.startswith("https:"):
response = self._requests_session.get(uri, stream=True)
response.raise_for_status()
return self.convert_response(
response,
stream_info=stream_info,
file_extension=file_extension,
url=mock_url,
**kwargs,
)
else:
raise ValueError(
f"Unsupported URI scheme: {uri.split(':')[0]}. Supported schemes are: file:, data:, http:, https:"
)
def convert_response(
self,
response: requests.Response,
*,
stream_info: Optional[StreamInfo] = None,
file_extension: Optional[str] = None, # Deprecated -- use stream_info
url: Optional[str] = None, # Deprecated -- use stream_info
**kwargs: Any,
) -> DocumentConverterResult:
# If there is a content-type header, get the mimetype and charset (if present)
mimetype: Optional[str] = None
charset: Optional[str] = None
if "content-type" in response.headers:
parts = response.headers["content-type"].split(";")
mimetype = parts.pop(0).strip()
for part in parts:
if part.strip().startswith("charset="):
_charset = part.split("=")[1].strip()
if len(_charset) > 0:
charset = _charset
# If there is a content-disposition header, get the filename and possibly the extension
filename: Optional[str] = None
extension: Optional[str] = None
if "content-disposition" in response.headers:
m = re.search(r"filename=([^;]+)", response.headers["content-disposition"])
if m:
filename = m.group(1).strip("\"'")
_, _extension = os.path.splitext(filename)
if len(_extension) > 0:
extension = _extension
# If there is still no filename, try to read it from the url
if filename is None:
parsed_url = urlparse(response.url)
_, _extension = os.path.splitext(parsed_url.path)
if len(_extension) > 0: # Looks like this might be a file!
filename = os.path.basename(parsed_url.path)
extension = _extension
# Create an initial guess from all this information
base_guess = StreamInfo(
mimetype=mimetype,
charset=charset,
filename=filename,
extension=extension,
url=response.url,
)
# Update with any additional info from the arguments
if stream_info is not None:
base_guess = base_guess.copy_and_update(stream_info)
if file_extension is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(extension=file_extension)
if url is not None:
# Deprecated -- use stream_info
base_guess = base_guess.copy_and_update(url=url)
# Read into BytesIO
buffer = io.BytesIO()
for chunk in response.iter_content(chunk_size=512):
buffer.write(chunk)
buffer.seek(0)
# Convert
guesses = self._get_stream_info_guesses(
file_stream=buffer, base_guess=base_guess
)
return self._convert(file_stream=buffer, stream_info_guesses=guesses, **kwargs)
def _convert(
self, *, file_stream: BinaryIO, stream_info_guesses: List[StreamInfo], **kwargs
) -> DocumentConverterResult:
res: Union[None, DocumentConverterResult] = None
# Keep track of which converters throw exceptions
failed_attempts: List[FailedConversionAttempt] = []
# Create a copy of the page_converters list, sorted by priority.
# We do this with each call to _convert because the priority of converters may change between calls.
# The sort is guaranteed to be stable, so converters with the same priority will remain in the same order.
sorted_registrations = sorted(self._converters, key=lambda x: x.priority)
# Remember the initial stream position so that we can return to it
cur_pos = file_stream.tell()
for stream_info in stream_info_guesses + [StreamInfo()]:
for converter_registration in sorted_registrations:
converter = converter_registration.converter
# Sanity check -- make sure the cur_pos is still the same
assert (
cur_pos == file_stream.tell()
), "File stream position should NOT change between guess iterations"
_kwargs = {k: v for k, v in kwargs.items()}
# Copy any additional global options
if "llm_client" not in _kwargs and self._llm_client is not None:
_kwargs["llm_client"] = self._llm_client
if "llm_model" not in _kwargs and self._llm_model is not None:
_kwargs["llm_model"] = self._llm_model
if "llm_prompt" not in _kwargs and self._llm_prompt is not None:
_kwargs["llm_prompt"] = self._llm_prompt
if "style_map" not in _kwargs and self._style_map is not None:
_kwargs["style_map"] = self._style_map
if "exiftool_path" not in _kwargs and self._exiftool_path is not None:
_kwargs["exiftool_path"] = self._exiftool_path
# Add the list of converters for nested processing
_kwargs["_parent_converters"] = self._converters
# Add legaxy kwargs
if stream_info is not None:
if stream_info.extension is not None:
_kwargs["file_extension"] = stream_info.extension
if stream_info.url is not None:
_kwargs["url"] = stream_info.url
# Check if the converter will accept the file, and if so, try to convert it
_accepts = False
try:
_accepts = converter.accepts(file_stream, stream_info, **_kwargs)
except NotImplementedError:
pass
# accept() should not have changed the file stream position
assert (
cur_pos == file_stream.tell()
), f"{type(converter).__name__}.accept() should NOT change the file_stream position"
# Attempt the conversion
if _accepts:
try:
res = converter.convert(file_stream, stream_info, **_kwargs)
except Exception:
failed_attempts.append(
FailedConversionAttempt(
converter=converter, exc_info=sys.exc_info()
)
)
finally:
file_stream.seek(cur_pos)
if res is not None:
# Normalize the content
res.text_content = "\n".join(
[line.rstrip() for line in re.split(r"\r?\n", res.text_content)]
)
res.text_content = re.sub(r"\n{3,}", "\n\n", res.text_content)
return res
# If we got this far without success, report any exceptions
if len(failed_attempts) > 0:
raise FileConversionException(attempts=failed_attempts)
# Nothing can handle it!
raise UnsupportedFormatException(
"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
)
def register_page_converter(self, converter: DocumentConverter) -> None:
"""DEPRECATED: User register_converter instead."""
warn(
"register_page_converter is deprecated. Use register_converter instead.",
DeprecationWarning,
)
self.register_converter(converter)
def register_converter(
self,
converter: DocumentConverter,
*,
priority: float = PRIORITY_SPECIFIC_FILE_FORMAT,
) -> None:
"""
Register a DocumentConverter with a given priority.
Priorities work as follows: By default, most converters get priority
DocumentConverter.PRIORITY_SPECIFIC_FILE_FORMAT (== 0). The exception
is the PlainTextConverter, HtmlConverter, and ZipConverter, which get
priority PRIORITY_SPECIFIC_FILE_FORMAT (== 10), with lower values
being tried first (i.e., higher priority).
Just prior to conversion, the converters are sorted by priority, using
a stable sort. This means that converters with the same priority will
remain in the same order, with the most recently registered converters
appearing first.
We have tight control over the order of built-in converters, but
plugins can register converters in any order. The registration's priority
field reasserts some control over the order of converters.
Plugins can register converters with any priority, to appear before or
after the built-ins. For example, a plugin with priority 9 will run
before the PlainTextConverter, but after the built-in converters.
"""
self._converters.insert(
0, ConverterRegistration(converter=converter, priority=priority)
)
def _get_stream_info_guesses(
self, file_stream: BinaryIO, base_guess: StreamInfo
) -> List[StreamInfo]:
"""
Given a base guess, attempt to guess or expand on the stream info using the stream content (via magika).
"""
guesses: List[StreamInfo] = []
# Enhance the base guess with information based on the extension or mimetype
enhanced_guess = base_guess.copy_and_update()
# If there's an extension and no mimetype, try to guess the mimetype
if base_guess.mimetype is None and base_guess.extension is not None:
_m, _ = mimetypes.guess_type(
"placeholder" + base_guess.extension, strict=False
)
if _m is not None:
enhanced_guess = enhanced_guess.copy_and_update(mimetype=_m)
# If there's a mimetype and no extension, try to guess the extension
if base_guess.mimetype is not None and base_guess.extension is None:
_e = mimetypes.guess_all_extensions(base_guess.mimetype, strict=False)
if len(_e) > 0:
enhanced_guess = enhanced_guess.copy_and_update(extension=_e[0])
# Call magika to guess from the stream
cur_pos = file_stream.tell()
try:
result = self._magika.identify_stream(file_stream)
if result.status == "ok" and result.prediction.output.label != "unknown":
# If it's text, also guess the charset
charset = None
if result.prediction.output.is_text:
# Read the first 4k to guess the charset
file_stream.seek(cur_pos)
stream_page = file_stream.read(4096)
charset_result = charset_normalizer.from_bytes(stream_page).best()
if charset_result is not None:
charset = self._normalize_charset(charset_result.encoding)
# Normalize the first extension listed
guessed_extension = None
if len(result.prediction.output.extensions) > 0:
guessed_extension = "." + result.prediction.output.extensions[0]
# Determine if the guess is compatible with the base guess
compatible = True
if (
base_guess.mimetype is not None
and base_guess.mimetype != result.prediction.output.mime_type
):
compatible = False
if (
base_guess.extension is not None
and base_guess.extension.lstrip(".")
not in result.prediction.output.extensions
):
compatible = False
if (
base_guess.charset is not None
and self._normalize_charset(base_guess.charset) != charset
):
compatible = False
if compatible:
# Add the compatible base guess
guesses.append(
StreamInfo(
mimetype=base_guess.mimetype
or result.prediction.output.mime_type,
extension=base_guess.extension or guessed_extension,
charset=base_guess.charset or charset,
filename=base_guess.filename,
local_path=base_guess.local_path,
url=base_guess.url,
)
)
else:
# The magika guess was incompatible with the base guess, so add both guesses
guesses.append(enhanced_guess)
guesses.append(
StreamInfo(
mimetype=result.prediction.output.mime_type,
extension=guessed_extension,
charset=charset,
filename=base_guess.filename,
local_path=base_guess.local_path,
url=base_guess.url,
)
)
else:
# There were no other guesses, so just add the base guess
guesses.append(enhanced_guess)
finally:
file_stream.seek(cur_pos)
return guesses
def _normalize_charset(self, charset: str | None) -> str | None:
"""
Normalize a charset string to a canonical form.
"""
if charset is None:
return None
try:
return codecs.lookup(charset).name
except LookupError:
return charset
================================================
FILE: packages/markitdown/src/markitdown/_stream_info.py
================================================
from dataclasses import dataclass, asdict
from typing import Optional
@dataclass(kw_only=True, frozen=True)
class StreamInfo:
"""The StreamInfo class is used to store information about a file stream.
All fields can be None, and will depend on how the stream was opened.
"""
mimetype: Optional[str] = None
extension: Optional[str] = None
charset: Optional[str] = None
filename: Optional[
str
] = None # From local path, url, or Content-Disposition header
local_path: Optional[str] = None # If read from disk
url: Optional[str] = None # If read from url
def copy_and_update(self, *args, **kwargs):
"""Copy the StreamInfo object and update it with the given StreamInfo
instance and/or other keyword arguments."""
new_info = asdict(self)
for si in args:
assert isinstance(si, StreamInfo)
new_info.update({k: v for k, v in asdict(si).items() if v is not None})
if len(kwargs) > 0:
new_info.update(kwargs)
return StreamInfo(**new_info)
================================================
FILE: packages/markitdown/src/markitdown/_uri_utils.py
================================================
import base64
import os
from typing import Tuple, Dict
from urllib.request import url2pathname
from urllib.parse import urlparse, unquote_to_bytes
def file_uri_to_path(file_uri: str) -> Tuple[str | None, str]:
"""Convert a file URI to a local file path"""
parsed = urlparse(file_uri)
if parsed.scheme != "file":
raise ValueError(f"Not a file URL: {file_uri}")
netloc = parsed.netloc if parsed.netloc else None
path = os.path.abspath(url2pathname(parsed.path))
return netloc, path
def parse_data_uri(uri: str) -> Tuple[str | None, Dict[str, str], bytes]:
if not uri.startswith("data:"):
raise ValueError("Not a data URI")
header, _, data = uri.partition(",")
if not _:
raise ValueError("Malformed data URI, missing ',' separator")
meta = header[5:] # Strip 'data:'
parts = meta.split(";")
is_base64 = False
# Ends with base64?
if parts[-1] == "base64":
parts.pop()
is_base64 = True
mime_type = None # Normally this would default to text/plain but we won't assume
if len(parts) and len(parts[0]) > 0:
# First part is the mime type
mime_type = parts.pop(0)
attributes: Dict[str, str] = {}
for part in parts:
# Handle key=value pairs in the middle
if "=" in part:
key, value = part.split("=", 1)
attributes[key] = value
elif len(part) > 0:
attributes[part] = ""
content = base64.b64decode(data) if is_base64 else unquote_to_bytes(data)
return mime_type, attributes, content
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/__init__.py
================================================
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/docx/__init__.py
================================================
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/docx/math/__init__.py
================================================
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/docx/math/latex_dict.py
================================================
# -*- coding: utf-8 -*-
"""
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
On 25/03/2025
"""
from __future__ import unicode_literals
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
BLANK = ""
BACKSLASH = "\\"
ALN = "&"
CHR = {
# Unicode : Latex Math Symbols
# Top accents
"\u0300": "\\grave{{{0}}}",
"\u0301": "\\acute{{{0}}}",
"\u0302": "\\hat{{{0}}}",
"\u0303": "\\tilde{{{0}}}",
"\u0304": "\\bar{{{0}}}",
"\u0305": "\\overbar{{{0}}}",
"\u0306": "\\breve{{{0}}}",
"\u0307": "\\dot{{{0}}}",
"\u0308": "\\ddot{{{0}}}",
"\u0309": "\\ovhook{{{0}}}",
"\u030a": "\\ocirc{{{0}}}}",
"\u030c": "\\check{{{0}}}}",
"\u0310": "\\candra{{{0}}}",
"\u0312": "\\oturnedcomma{{{0}}}",
"\u0315": "\\ocommatopright{{{0}}}",
"\u031a": "\\droang{{{0}}}",
"\u0338": "\\not{{{0}}}",
"\u20d0": "\\leftharpoonaccent{{{0}}}",
"\u20d1": "\\rightharpoonaccent{{{0}}}",
"\u20d2": "\\vertoverlay{{{0}}}",
"\u20d6": "\\overleftarrow{{{0}}}",
"\u20d7": "\\vec{{{0}}}",
"\u20db": "\\dddot{{{0}}}",
"\u20dc": "\\ddddot{{{0}}}",
"\u20e1": "\\overleftrightarrow{{{0}}}",
"\u20e7": "\\annuity{{{0}}}",
"\u20e9": "\\widebridgeabove{{{0}}}",
"\u20f0": "\\asteraccent{{{0}}}",
# Bottom accents
"\u0330": "\\wideutilde{{{0}}}",
"\u0331": "\\underbar{{{0}}}",
"\u20e8": "\\threeunderdot{{{0}}}",
"\u20ec": "\\underrightharpoondown{{{0}}}",
"\u20ed": "\\underleftharpoondown{{{0}}}",
"\u20ee": "\\underledtarrow{{{0}}}",
"\u20ef": "\\underrightarrow{{{0}}}",
# Over | group
"\u23b4": "\\overbracket{{{0}}}",
"\u23dc": "\\overparen{{{0}}}",
"\u23de": "\\overbrace{{{0}}}",
# Under| group
"\u23b5": "\\underbracket{{{0}}}",
"\u23dd": "\\underparen{{{0}}}",
"\u23df": "\\underbrace{{{0}}}",
}
CHR_BO = {
# Big operators,
"\u2140": "\\Bbbsum",
"\u220f": "\\prod",
"\u2210": "\\coprod",
"\u2211": "\\sum",
"\u222b": "\\int",
"\u22c0": "\\bigwedge",
"\u22c1": "\\bigvee",
"\u22c2": "\\bigcap",
"\u22c3": "\\bigcup",
"\u2a00": "\\bigodot",
"\u2a01": "\\bigoplus",
"\u2a02": "\\bigotimes",
}
T = {
"\u2192": "\\rightarrow ",
# Greek letters
"\U0001d6fc": "\\alpha ",
"\U0001d6fd": "\\beta ",
"\U0001d6fe": "\\gamma ",
"\U0001d6ff": "\\theta ",
"\U0001d700": "\\epsilon ",
"\U0001d701": "\\zeta ",
"\U0001d702": "\\eta ",
"\U0001d703": "\\theta ",
"\U0001d704": "\\iota ",
"\U0001d705": "\\kappa ",
"\U0001d706": "\\lambda ",
"\U0001d707": "\\m ",
"\U0001d708": "\\n ",
"\U0001d709": "\\xi ",
"\U0001d70a": "\\omicron ",
"\U0001d70b": "\\pi ",
"\U0001d70c": "\\rho ",
"\U0001d70d": "\\varsigma ",
"\U0001d70e": "\\sigma ",
"\U0001d70f": "\\ta ",
"\U0001d710": "\\upsilon ",
"\U0001d711": "\\phi ",
"\U0001d712": "\\chi ",
"\U0001d713": "\\psi ",
"\U0001d714": "\\omega ",
"\U0001d715": "\\partial ",
"\U0001d716": "\\varepsilon ",
"\U0001d717": "\\vartheta ",
"\U0001d718": "\\varkappa ",
"\U0001d719": "\\varphi ",
"\U0001d71a": "\\varrho ",
"\U0001d71b": "\\varpi ",
# Relation symbols
"\u2190": "\\leftarrow ",
"\u2191": "\\uparrow ",
"\u2192": "\\rightarrow ",
"\u2193": "\\downright ",
"\u2194": "\\leftrightarrow ",
"\u2195": "\\updownarrow ",
"\u2196": "\\nwarrow ",
"\u2197": "\\nearrow ",
"\u2198": "\\searrow ",
"\u2199": "\\swarrow ",
"\u22ee": "\\vdots ",
"\u22ef": "\\cdots ",
"\u22f0": "\\adots ",
"\u22f1": "\\ddots ",
"\u2260": "\\ne ",
"\u2264": "\\leq ",
"\u2265": "\\geq ",
"\u2266": "\\leqq ",
"\u2267": "\\geqq ",
"\u2268": "\\lneqq ",
"\u2269": "\\gneqq ",
"\u226a": "\\ll ",
"\u226b": "\\gg ",
"\u2208": "\\in ",
"\u2209": "\\notin ",
"\u220b": "\\ni ",
"\u220c": "\\nni ",
# Ordinary symbols
"\u221e": "\\infty ",
# Binary relations
"\u00b1": "\\pm ",
"\u2213": "\\mp ",
# Italic, Latin, uppercase
"\U0001d434": "A",
"\U0001d435": "B",
"\U0001d436": "C",
"\U0001d437": "D",
"\U0001d438": "E",
"\U0001d439": "F",
"\U0001d43a": "G",
"\U0001d43b": "H",
"\U0001d43c": "I",
"\U0001d43d": "J",
"\U0001d43e": "K",
"\U0001d43f": "L",
"\U0001d440": "M",
"\U0001d441": "N",
"\U0001d442": "O",
"\U0001d443": "P",
"\U0001d444": "Q",
"\U0001d445": "R",
"\U0001d446": "S",
"\U0001d447": "T",
"\U0001d448": "U",
"\U0001d449": "V",
"\U0001d44a": "W",
"\U0001d44b": "X",
"\U0001d44c": "Y",
"\U0001d44d": "Z",
# Italic, Latin, lowercase
"\U0001d44e": "a",
"\U0001d44f": "b",
"\U0001d450": "c",
"\U0001d451": "d",
"\U0001d452": "e",
"\U0001d453": "f",
"\U0001d454": "g",
"\U0001d456": "i",
"\U0001d457": "j",
"\U0001d458": "k",
"\U0001d459": "l",
"\U0001d45a": "m",
"\U0001d45b": "n",
"\U0001d45c": "o",
"\U0001d45d": "p",
"\U0001d45e": "q",
"\U0001d45f": "r",
"\U0001d460": "s",
"\U0001d461": "t",
"\U0001d462": "u",
"\U0001d463": "v",
"\U0001d464": "w",
"\U0001d465": "x",
"\U0001d466": "y",
"\U0001d467": "z",
}
FUNC = {
"sin": "\\sin({fe})",
"cos": "\\cos({fe})",
"tan": "\\tan({fe})",
"arcsin": "\\arcsin({fe})",
"arccos": "\\arccos({fe})",
"arctan": "\\arctan({fe})",
"arccot": "\\arccot({fe})",
"sinh": "\\sinh({fe})",
"cosh": "\\cosh({fe})",
"tanh": "\\tanh({fe})",
"coth": "\\coth({fe})",
"sec": "\\sec({fe})",
"csc": "\\csc({fe})",
}
FUNC_PLACE = "{fe}"
BRK = "\\\\"
CHR_DEFAULT = {
"ACC_VAL": "\\hat{{{0}}}",
}
POS = {
"top": "\\overline{{{0}}}", # not sure
"bot": "\\underline{{{0}}}",
}
POS_DEFAULT = {
"BAR_VAL": "\\overline{{{0}}}",
}
SUB = "_{{{0}}}"
SUP = "^{{{0}}}"
F = {
"bar": "\\frac{{{num}}}{{{den}}}",
"skw": r"^{{{num}}}/_{{{den}}}",
"noBar": "\\genfrac{{}}{{}}{{0pt}}{{}}{{{num}}}{{{den}}}",
"lin": "{{{num}}}/{{{den}}}",
}
F_DEFAULT = "\\frac{{{num}}}{{{den}}}"
D = "\\left{left}{text}\\right{right}"
D_DEFAULT = {
"left": "(",
"right": ")",
"null": ".",
}
RAD = "\\sqrt[{deg}]{{{text}}}"
RAD_DEFAULT = "\\sqrt{{{text}}}"
ARR = "\\begin{{array}}{{c}}{text}\\end{{array}}"
LIM_FUNC = {
"lim": "\\lim_{{{lim}}}",
"max": "\\max_{{{lim}}}",
"min": "\\min_{{{lim}}}",
}
LIM_TO = ("\\rightarrow", "\\to")
LIM_UPP = "\\overset{{{lim}}}{{{text}}}"
M = "\\begin{{matrix}}{text}\\end{{matrix}}"
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/docx/math/omml.py
================================================
# -*- coding: utf-8 -*-
"""
Office Math Markup Language (OMML)
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/omml.py
On 25/03/2025
"""
from defusedxml import ElementTree as ET
from .latex_dict import (
CHARS,
CHR,
CHR_BO,
CHR_DEFAULT,
POS,
POS_DEFAULT,
SUB,
SUP,
F,
F_DEFAULT,
T,
FUNC,
D,
D_DEFAULT,
RAD,
RAD_DEFAULT,
ARR,
LIM_FUNC,
LIM_TO,
LIM_UPP,
M,
BRK,
BLANK,
BACKSLASH,
ALN,
FUNC_PLACE,
)
OMML_NS = "{http://schemas.openxmlformats.org/officeDocument/2006/math}"
def load(stream):
tree = ET.parse(stream)
for omath in tree.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def load_string(string):
root = ET.fromstring(string)
for omath in root.findall(OMML_NS + "oMath"):
yield oMath2Latex(omath)
def escape_latex(strs):
last = None
new_chr = []
strs = strs.replace(r"\\", "\\")
for c in strs:
if (c in CHARS) and (last != BACKSLASH):
new_chr.append(BACKSLASH + c)
else:
new_chr.append(c)
last = c
return BLANK.join(new_chr)
def get_val(key, default=None, store=CHR):
if key is not None:
return key if not store else store.get(key, key)
else:
return default
class Tag2Method(object):
def call_method(self, elm, stag=None):
getmethod = self.tag2meth.get
if stag is None:
stag = elm.tag.replace(OMML_NS, "")
method = getmethod(stag)
if method:
return method(self, elm)
else:
return None
def process_children_list(self, elm, include=None):
"""
process children of the elm,return iterable
"""
for _e in list(elm):
if OMML_NS not in _e.tag:
continue
stag = _e.tag.replace(OMML_NS, "")
if include and (stag not in include):
continue
t = self.call_method(_e, stag=stag)
if t is None:
t = self.process_unknow(_e, stag)
if t is None:
continue
yield (stag, t, _e)
def process_children_dict(self, elm, include=None):
"""
process children of the elm,return dict
"""
latex_chars = dict()
for stag, t, e in self.process_children_list(elm, include):
latex_chars[stag] = t
return latex_chars
def process_children(self, elm, include=None):
"""
process children of the elm,return string
"""
return BLANK.join(
(
t if not isinstance(t, Tag2Method) else str(t)
for stag, t, e in self.process_children_list(elm, include)
)
)
def process_unknow(self, elm, stag):
return None
class Pr(Tag2Method):
text = ""
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
__innerdict = None # can't use the __dict__
""" common properties of element"""
def __init__(self, elm):
self.__innerdict = {}
self.text = self.process_children(elm)
def __str__(self):
return self.text
def __unicode__(self):
return self.__str__(self)
def __getattr__(self, name):
return self.__innerdict.get(name, None)
def do_brk(self, elm):
self.__innerdict["brk"] = BRK
return BRK
def do_common(self, elm):
stag = elm.tag.replace(OMML_NS, "")
if stag in self.__val_tags:
t = elm.get("{0}val".format(OMML_NS))
self.__innerdict[stag] = t
return None
tag2meth = {
"brk": do_brk,
"chr": do_common,
"pos": do_common,
"begChr": do_common,
"endChr": do_common,
"type": do_common,
}
class oMath2Latex(Tag2Method):
"""
Convert oMath element of omml to latex
"""
_t_dict = T
__direct_tags = ("box", "sSub", "sSup", "sSubSup", "num", "den", "deg", "e")
def __init__(self, element):
self._latex = self.process_children(element)
def __str__(self):
return self.latex
def __unicode__(self):
return self.__str__(self)
def process_unknow(self, elm, stag):
if stag in self.__direct_tags:
return self.process_children(elm)
elif stag[-2:] == "Pr":
return Pr(elm)
else:
return None
@property
def latex(self):
return self._latex
def do_acc(self, elm):
"""
the accent function
"""
c_dict = self.process_children_dict(elm)
latex_s = get_val(
c_dict["accPr"].chr, default=CHR_DEFAULT.get("ACC_VAL"), store=CHR
)
return latex_s.format(c_dict["e"])
def do_bar(self, elm):
"""
the bar function
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["barPr"]
latex_s = get_val(pr.pos, default=POS_DEFAULT.get("BAR_VAL"), store=POS)
return pr.text + latex_s.format(c_dict["e"])
def do_d(self, elm):
"""
the delimiter object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["dPr"]
null = D_DEFAULT.get("null")
s_val = get_val(pr.begChr, default=D_DEFAULT.get("left"), store=T)
e_val = get_val(pr.endChr, default=D_DEFAULT.get("right"), store=T)
return pr.text + D.format(
left=null if not s_val else escape_latex(s_val),
text=c_dict["e"],
right=null if not e_val else escape_latex(e_val),
)
def do_spre(self, elm):
"""
the Pre-Sub-Superscript object -- Not support yet
"""
pass
def do_sub(self, elm):
text = self.process_children(elm)
return SUB.format(text)
def do_sup(self, elm):
text = self.process_children(elm)
return SUP.format(text)
def do_f(self, elm):
"""
the fraction object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["fPr"]
latex_s = get_val(pr.type, default=F_DEFAULT, store=F)
return pr.text + latex_s.format(num=c_dict.get("num"), den=c_dict.get("den"))
def do_func(self, elm):
"""
the Function-Apply object (Examples:sin cos)
"""
c_dict = self.process_children_dict(elm)
func_name = c_dict.get("fName")
return func_name.replace(FUNC_PLACE, c_dict.get("e"))
def do_fname(self, elm):
"""
the func name
"""
latex_chars = []
for stag, t, e in self.process_children_list(elm):
if stag == "r":
if FUNC.get(t):
latex_chars.append(FUNC[t])
else:
raise NotImplementedError("Not support func %s" % t)
else:
latex_chars.append(t)
t = BLANK.join(latex_chars)
return t if FUNC_PLACE in t else t + FUNC_PLACE # do_func will replace this
def do_groupchr(self, elm):
"""
the Group-Character object
"""
c_dict = self.process_children_dict(elm)
pr = c_dict["groupChrPr"]
latex_s = get_val(pr.chr)
return pr.text + latex_s.format(c_dict["e"])
def do_rad(self, elm):
"""
the radical object
"""
c_dict = self.process_children_dict(elm)
text = c_dict.get("e")
deg_text = c_dict.get("deg")
if deg_text:
return RAD.format(deg=deg_text, text=text)
else:
return RAD_DEFAULT.format(text=text)
def do_eqarr(self, elm):
"""
the Array object
"""
return ARR.format(
text=BRK.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
)
def do_limlow(self, elm):
"""
the Lower-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
latex_s = LIM_FUNC.get(t_dict["e"])
if not latex_s:
raise NotImplementedError("Not support lim %s" % t_dict["e"])
else:
return latex_s.format(lim=t_dict.get("lim"))
def do_limupp(self, elm):
"""
the Upper-Limit object
"""
t_dict = self.process_children_dict(elm, include=("e", "lim"))
return LIM_UPP.format(lim=t_dict.get("lim"), text=t_dict.get("e"))
def do_lim(self, elm):
"""
the lower limit of the limLow object and the upper limit of the limUpp function
"""
return self.process_children(elm).replace(LIM_TO[0], LIM_TO[1])
def do_m(self, elm):
"""
the Matrix object
"""
rows = []
for stag, t, e in self.process_children_list(elm):
if stag == "mPr":
pass
elif stag == "mr":
rows.append(t)
return M.format(text=BRK.join(rows))
def do_mr(self, elm):
"""
a single row of the matrix m
"""
return ALN.join(
[t for stag, t, e in self.process_children_list(elm, include=("e",))]
)
def do_nary(self, elm):
"""
the n-ary object
"""
res = []
bo = ""
for stag, t, e in self.process_children_list(elm):
if stag == "naryPr":
bo = get_val(t.chr, store=CHR_BO)
else:
res.append(t)
return bo + BLANK.join(res)
def do_r(self, elm):
"""
Get text from 'r' element,And try convert them to latex symbols
@todo text style support , (sty)
@todo \text (latex pure text support)
"""
_str = []
for s in elm.findtext("./{0}t".format(OMML_NS)):
# s = s if isinstance(s,unicode) else unicode(s,'utf-8')
_str.append(self._t_dict.get(s, s))
return escape_latex(BLANK.join(_str))
tag2meth = {
"acc": do_acc,
"r": do_r,
"bar": do_bar,
"sub": do_sub,
"sup": do_sup,
"f": do_f,
"func": do_func,
"fName": do_fname,
"groupChr": do_groupchr,
"d": do_d,
"rad": do_rad,
"eqArr": do_eqarr,
"limLow": do_limlow,
"limUpp": do_limupp,
"lim": do_lim,
"m": do_m,
"mr": do_mr,
"nary": do_nary,
}
================================================
FILE: packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py
================================================
import zipfile
from io import BytesIO
from typing import BinaryIO
from xml.etree import ElementTree as ET
from bs4 import BeautifulSoup, Tag
from .math.omml import OMML_NS, oMath2Latex
MATH_ROOT_TEMPLATE = "".join(
(
"<w:document ",
'xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" ',
'xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" ',
'xmlns:o="urn:schemas-microsoft-com:office:office" ',
'xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" ',
'xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" ',
'xmlns:v="urn:schemas-microsoft-com:vml" ',
'xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" ',
'xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" ',
'xmlns:w10="urn:schemas-microsoft-com:office:word" ',
'xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" ',
'xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" ',
'xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" ',
'xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" ',
'xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" ',
'xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 wp14">',
"{0}</w:document>",
)
)
def _convert_omath_to_latex(tag: Tag) -> str:
"""
Converts an OMML (Office Math Markup Language) tag to LaTeX format.
Args:
tag (Tag): A BeautifulSoup Tag object representing the OMML element.
Returns:
str: The LaTeX representation of the OMML element.
"""
# Format the tag into a complete XML document string
math_root = ET.fromstring(MATH_ROOT_TEMPLATE.format(str(tag)))
# Find the 'oMath' element within the XML document
math_element = math_root.find(OMML_NS + "oMath")
# Convert the 'oMath' element to LaTeX using the oMath2Latex function
latex = oMath2Latex(math_element).latex
return latex
def _get_omath_tag_replacement(tag: Tag, block: bool = False) -> Tag:
"""
Creates a replacement tag for an OMML (Office Math Markup Language) element.
Args:
tag (Tag): A BeautifulSoup Tag object representing the "oMath" element.
block (bool, optional): If True, the LaTeX will be wrapped in double dollar signs for block mode. Defaults to False.
Returns:
Tag: A BeautifulSoup Tag object representing the replacement element.
"""
t_tag = Tag(name="w:t")
t_tag.string = (
f"$${_convert_omath_to_latex(tag)}$$"
if block
else f"${_convert_omath_to_latex(tag)}$"
)
r_tag = Tag(name="w:r")
r_tag.append(t_tag)
return r_tag
def _replace_equations(tag: Tag):
"""
Replaces OMML (Office Math Markup Language) elements with their LaTeX equivalents.
Args:
tag (Tag): A BeautifulSoup Tag object representing the OMML element. Could be either "oMathPara" or "oMath".
Raises:
ValueError: If the tag is not supported.
"""
if tag.name == "oMathPara":
# Create a new paragraph tag
p_tag = Tag(name="w:p")
# Replace each 'oMath' child tag with its LaTeX equivalent as block equations
for child_tag in tag.find_all("oMath"):
p_tag.append(_get_omath_tag_replacement(child_tag, block=True))
# Replace the original 'oMathPara' tag with the new paragraph tag
tag.replace_with(p_tag)
elif tag.name == "oMath":
# Replace the 'oMath' tag with its LaTeX equivalent as inline equation
tag.replace_with(_get_omath_tag_replacement(tag, block=False))
else:
raise ValueError(f"Not supported tag: {tag.name}")
def _pre_process_math(content: bytes) -> bytes:
"""
Pre-processes the math content in a DOCX -> XML file by converting OMML (Office Math Markup Language) elements to LaTeX.
This preprocessed content can be directly replaced in the DOCX file -> XMLs.
Args:
content (bytes): The XML content of the DOCX file as bytes.
Returns:
bytes: The processed content with OMML elements replaced by their LaTeX equivalents, encoded as bytes.
"""
soup = BeautifulSoup(content.decode(), features="xml")
for tag in soup.find_all("oMathPara"):
_replace_equations(tag)
for tag in soup.find_all("oMath"):
_replace_equations(tag)
return str(soup).encode()
def pre_process_docx(input_docx: BinaryIO) -> BinaryIO:
"""
Pre-processes a DOCX file with provided steps.
The process works by unzipping the DOCX file in memory, transforming specific XML files
(such as converting OMML elements to LaTeX), and then zipping everything back into a
DOCX file without writing to disk.
Args:
input_docx (BinaryIO): A binary input stream representing the DOCX file.
Returns:
BinaryIO: A binary output stream representing the processed DOCX file.
"""
output_docx = BytesIO()
# The files that need to be pre-processed from .docx
pre_process_enable_files = [
"word/document.xml",
"word/footnotes.xml",
"word/endnotes.xml",
]
with zipfile.ZipFile(input_docx, mode="r") as zip_input:
files = {name: zip_input.read(name) for name in zip_input.namelist()}
with zipfile.ZipFile(output_docx, mode="w") as zip_output:
zip_output.comment = zip_input.comment
for name, content in files.items():
if name in pre_process_enable_files:
try:
# Pre-process the content
updated_content = _pre_process_math(content)
# In the future, if there are more pre-processing steps, they can be added here
zip_output.writestr(name, updated_content)
except Exception:
# If there is an error in processing the content, write the original content
zip_output.writestr(name, content)
else:
zip_output.writestr(name, content)
output_docx.seek(0)
return output_docx
================================================
FILE: packages/markitdown/src/markitdown/converters/__init__.py
================================================
# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>
#
# SPDX-License-Identifier: MIT
from ._plain_text_converter import PlainTextConverter
from ._html_converter import HtmlConverter
from ._rss_converter import RssConverter
from ._wikipedia_converter import WikipediaConverter
from ._youtube_converter import YouTubeConverter
from ._ipynb_converter import IpynbConverter
from ._bing_serp_converter import BingSerpConverter
from ._pdf_converter import PdfConverter
from ._docx_converter import DocxConverter
from ._xlsx_converter import XlsxConverter, XlsConverter
from ._pptx_converter import PptxConverter
from ._image_converter import ImageConverter
from ._audio_converter import AudioConverter
from ._outlook_msg_converter import OutlookMsgConverter
from ._zip_converter import ZipConverter
from ._doc_intel_converter import (
DocumentIntelligenceConverter,
DocumentIntelligenceFileType,
)
from ._epub_converter import EpubConverter
from ._csv_converter import CsvConverter
__all__ = [
"PlainTextConverter",
"HtmlConverter",
"RssConverter",
"WikipediaConverter",
"YouTubeConverter",
"IpynbConverter",
"BingSerpConverter",
"PdfConverter",
"DocxConverter",
"XlsxConverter",
"XlsConverter",
"PptxConverter",
"ImageConverter",
"AudioConverter",
"OutlookMsgConverter",
"ZipConverter",
"DocumentIntelligenceConverter",
"DocumentIntelligenceFileType",
"EpubConverter",
"CsvConverter",
]
================================================
FILE: packages/markitdown/src/markitdown/converters/_audio_converter.py
================================================
from typing import Any, BinaryIO
from ._exiftool import exiftool_metadata
from ._transcribe_audio import transcribe_audio
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException
ACCEPTED_MIME_TYPE_PREFIXES = [
"audio/x-wav",
"audio/mpeg",
"video/mp4",
]
ACCEPTED_FILE_EXTENSIONS = [
".wav",
".mp3",
".m4a",
".mp4",
]
class AudioConverter(DocumentConverter):
"""
Converts audio files to markdown via extraction of metadata (if `exiftool` is installed), and speech transcription (if `speech_recognition` is installed).
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
md_content = ""
# Add metadata
metadata = exiftool_metadata(
file_stream, exiftool_path=kwargs.get("exiftool_path")
)
if metadata:
for f in [
"Title",
"Artist",
"Author",
"Band",
"Album",
"Genre",
"Track",
"DateTimeOriginal",
"CreateDate",
# "Duration", -- Wrong values when read from memory
"NumChannels",
"SampleRate",
"AvgBytesPerSec",
"BitsPerSample",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Figure out the audio format for transcription
if stream_info.extension == ".wav" or stream_info.mimetype == "audio/x-wav":
audio_format = "wav"
elif stream_info.extension == ".mp3" or stream_info.mimetype == "audio/mpeg":
audio_format = "mp3"
elif (
stream_info.extension in [".mp4", ".m4a"]
or stream_info.mimetype == "video/mp4"
):
audio_format = "mp4"
else:
audio_format = None
# Transcribe
if audio_format:
try:
transcript = transcribe_audio(file_stream, audio_format=audio_format)
if transcript:
md_content += "\n\n### Audio Transcript:\n" + transcript
except MissingDependencyException:
pass
# Return the result
return DocumentConverterResult(markdown=md_content.strip())
================================================
FILE: packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
================================================
import re
import base64
import binascii
from urllib.parse import parse_qs, urlparse
from typing import Any, BinaryIO
from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from ._markdownify import _CustomMarkdownify
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class BingSerpConverter(DocumentConverter):
"""
Handle Bing results pages (only the organic search results).
NOTE: It is better to use the Bing API
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
"""
Make sure we're dealing with HTML content *from* Bing.
"""
url = stream_info.url or ""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if not re.search(r"^https://www\.bing\.com/search\?q=", url):
# Not a Bing SERP URL
return False
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Not HTML content
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
assert stream_info.url is not None
# Parse the query parameters
parsed_params = parse_qs(urlparse(stream_info.url).query)
query = parsed_params.get("q", [""])[0]
# Parse the stream
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
# Clean up some formatting
for tptt in soup.find_all(class_="tptt"):
if hasattr(tptt, "string") and tptt.string:
tptt.string += " "
for slug in soup.find_all(class_="algoSlug_icon"):
slug.extract()
# Parse the algorithmic results
_markdownify = _CustomMarkdownify(**kwargs)
results = list()
for result in soup.find_all(class_="b_algo"):
if not hasattr(result, "find_all"):
continue
# Rewrite redirect urls
for a in result.find_all("a", href=True):
parsed_href = urlparse(a["href"])
qs = parse_qs(parsed_href.query)
# The destination is contained in the u parameter,
# but appears to be base64 encoded, with some prefix
if "u" in qs:
u = (
qs["u"][0][2:].strip() + "=="
) # Python 3 doesn't care about extra padding
try:
# RFC 4648 / Base64URL" variant, which uses "-" and "_"
a["href"] = base64.b64decode(u, altchars="-_").decode("utf-8")
except UnicodeDecodeError:
pass
except binascii.Error:
pass
# Convert to markdown
md_result = _markdownify.convert_soup(result).strip()
lines = [line.strip() for line in re.split(r"\n+", md_result)]
results.append("\n".join([line for line in lines if len(line) > 0]))
webpage_text = (
f"## A Bing search for '{query}' found the following results:\n\n"
+ "\n\n".join(results)
)
return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string,
)
================================================
FILE: packages/markitdown/src/markitdown/converters/_csv_converter.py
================================================
import csv
import io
from typing import BinaryIO, Any
from charset_normalizer import from_bytes
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/csv",
"application/csv",
]
ACCEPTED_FILE_EXTENSIONS = [".csv"]
class CsvConverter(DocumentConverter):
"""
Converts CSV files to Markdown tables.
"""
def __init__(self):
super().__init__()
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Read the file content
if stream_info.charset:
content = file_stream.read().decode(stream_info.charset)
else:
content = str(from_bytes(file_stream.read()).best())
# Parse CSV content
reader = csv.reader(io.StringIO(content))
rows = list(reader)
if not rows:
return DocumentConverterResult(markdown="")
# Create markdown table
markdown_table = []
# Add header row
markdown_table.append("| " + " | ".join(rows[0]) + " |")
# Add separator row
markdown_table.append("| " + " | ".join(["---"] * len(rows[0])) + " |")
# Add data rows
for row in rows[1:]:
# Make sure row has the same number of columns as header
while len(row) < len(rows[0]):
row.append("")
# Truncate if row has more columns than header
row = row[: len(rows[0])]
markdown_table.append("| " + " | ".join(row) + " |")
result = "\n".join(markdown_table)
return DocumentConverterResult(markdown=result)
================================================
FILE: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
================================================
import sys
import re
import os
from typing import BinaryIO, Any, List
from enum import Enum
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import (
AnalyzeDocumentRequest,
AnalyzeResult,
DocumentAnalysisFeature,
)
from azure.core.credentials import AzureKeyCredential, TokenCredential
from azure.identity import DefaultAzureCredential
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
# Define these types for type hinting when the package is not available
class AzureKeyCredential:
pass
class TokenCredential:
pass
class DocumentIntelligenceClient:
pass
class AnalyzeDocumentRequest:
pass
class AnalyzeResult:
pass
class DocumentAnalysisFeature:
pass
class DefaultAzureCredential:
pass
# TODO: currently, there is a bug in the document intelligence SDK with importing the "ContentFormat" enum.
# This constant is a temporary fix until the bug is resolved.
CONTENT_FORMAT = "markdown"
class DocumentIntelligenceFileType(str, Enum):
"""Enum of file types supported by the Document Intelligence Converter."""
# No OCR
DOCX = "docx"
PPTX = "pptx"
XLSX = "xlsx"
HTML = "html"
# OCR
PDF = "pdf"
JPEG = "jpeg"
PNG = "png"
BMP = "bmp"
TIFF = "tiff"
def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -> List[str]:
"""Get the MIME type prefixes for the given file types."""
prefixes: List[str] = []
for type_ in types:
if type_ == DocumentIntelligenceFileType.DOCX:
prefixes.append(
"application/vnd.openxmlformats-officedocument.wordprocessingml.document"
)
elif type_ == DocumentIntelligenceFileType.PPTX:
prefixes.append(
"application/vnd.openxmlformats-officedocument.presentationml"
)
elif type_ == DocumentIntelligenceFileType.XLSX:
prefixes.append(
"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
)
elif type_ == DocumentIntelligenceFileType.HTML:
prefixes.append("text/html")
prefixes.append("application/xhtml+xml")
elif type_ == DocumentIntelligenceFileType.PDF:
prefixes.append("application/pdf")
prefixes.append("application/x-pdf")
elif type_ == DocumentIntelligenceFileType.JPEG:
prefixes.append("image/jpeg")
elif type_ == DocumentIntelligenceFileType.PNG:
prefixes.append("image/png")
elif type_ == DocumentIntelligenceFileType.BMP:
prefixes.append("image/bmp")
elif type_ == DocumentIntelligenceFileType.TIFF:
prefixes.append("image/tiff")
return prefixes
def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> List[str]:
"""Get the file extensions for the given file types."""
extensions: List[str] = []
for type_ in types:
if type_ == DocumentIntelligenceFileType.DOCX:
extensions.append(".docx")
elif type_ == DocumentIntelligenceFileType.PPTX:
extensions.append(".pptx")
elif type_ == DocumentIntelligenceFileType.XLSX:
extensions.append(".xlsx")
elif type_ == DocumentIntelligenceFileType.PDF:
extensions.append(".pdf")
elif type_ == DocumentIntelligenceFileType.JPEG:
extensions.append(".jpg")
extensions.append(".jpeg")
elif type_ == DocumentIntelligenceFileType.PNG:
extensions.append(".png")
elif type_ == DocumentIntelligenceFileType.BMP:
extensions.append(".bmp")
elif type_ == DocumentIntelligenceFileType.TIFF:
extensions.append(".tiff")
elif type_ == DocumentIntelligenceFileType.HTML:
extensions.append(".html")
return extensions
class DocumentIntelligenceConverter(DocumentConverter):
"""Specialized DocumentConverter that uses Document Intelligence to extract text from documents."""
def __init__(
self,
*,
endpoint: str,
api_version: str = "2024-07-31-preview",
credential: AzureKeyCredential | TokenCredential | None = None,
file_types: List[DocumentIntelligenceFileType] = [
DocumentIntelligenceFileType.DOCX,
DocumentIntelligenceFileType.PPTX,
DocumentIntelligenceFileType.XLSX,
DocumentIntelligenceFileType.PDF,
DocumentIntelligenceFileType.JPEG,
DocumentIntelligenceFileType.PNG,
DocumentIntelligenceFileType.BMP,
DocumentIntelligenceFileType.TIFF,
],
):
"""
Initialize the DocumentIntelligenceConverter.
Args:
endpoint (str): The endpoint for the Document Intelligence service.
api_version (str): The API version to use. Defaults to "2024-07-31-preview".
credential (AzureKeyCredential | TokenCredential | None): The credential to use for authentication.
file_types (List[DocumentIntelligenceFileType]): The file types to accept. Defaults to all supported file types.
"""
super().__init__()
self._file_types = file_types
# Raise an error if the dependencies are not available.
# This is different than other converters since this one isn't even instantiated
# unless explicitly requested.
if _dependency_exc_info is not None:
raise MissingDependencyException(
"DocumentIntelligenceConverter requires the optional dependency [az-doc-intel] (or [all]) to be installed. E.g., `pip install markitdown[az-doc-intel]`"
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
if credential is None:
if os.environ.get("AZURE_API_KEY") is None:
credential = DefaultAzureCredential()
else:
credential = AzureKeyCredential(os.environ["AZURE_API_KEY"])
self.endpoint = endpoint
self.api_version = api_version
self.doc_intel_client = DocumentIntelligenceClient(
endpoint=self.endpoint,
api_version=self.api_version,
credential=credential,
)
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in _get_file_extensions(self._file_types):
return True
for prefix in _get_mime_type_prefixes(self._file_types):
if mimetype.startswith(prefix):
return True
return False
def _analysis_features(self, stream_info: StreamInfo) -> List[str]:
"""
Helper needed to determine which analysis features to use.
Certain document analysis features are not availiable for
office filetypes (.xlsx, .pptx, .html, .docx)
"""
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
# Types that don't support ocr
no_ocr_types = [
DocumentIntelligenceFileType.DOCX,
DocumentIntelligenceFileType.PPTX,
DocumentIntelligenceFileType.XLSX,
DocumentIntelligenceFileType.HTML,
]
if extension in _get_file_extensions(no_ocr_types):
return []
for prefix in _get_mime_type_prefixes(no_ocr_types):
if mimetype.startswith(prefix):
return []
return [
DocumentAnalysisFeature.FORMULAS, # enable formula extraction
DocumentAnalysisFeature.OCR_HIGH_RESOLUTION, # enable high resolution OCR
DocumentAnalysisFeature.STYLE_FONT, # enable font style extraction
]
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Extract the text using Azure Document Intelligence
poller = self.doc_intel_client.begin_analyze_document(
model_id="prebuilt-layout",
body=AnalyzeDocumentRequest(bytes_source=file_stream.read()),
features=self._analysis_features(stream_info),
output_content_format=CONTENT_FORMAT, # TODO: replace with "ContentFormat.MARKDOWN" when the bug is fixed
)
result: AnalyzeResult = poller.result()
# remove comments from the markdown content generated by Doc Intelligence and append to markdown string
markdown_text = re.sub(r"<!--.*?-->", "", result.content, flags=re.DOTALL)
return DocumentConverterResult(markdown=markdown_text)
================================================
FILE: packages/markitdown/src/markitdown/converters/_docx_converter.py
================================================
import sys
import io
from warnings import warn
from typing import BinaryIO, Any
from ._html_converter import HtmlConverter
from ..converter_utils.docx.pre_process import pre_process_docx
from .._base_converter import DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import mammoth
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
]
ACCEPTED_FILE_EXTENSIONS = [".docx"]
class DocxConverter(HtmlConverter):
"""
Converts DOCX files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
"""
def __init__(self):
super().__init__()
self._html_converter = HtmlConverter()
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Check: the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".docx",
feature="docx",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
style_map = kwargs.get("style_map", None)
pre_process_stream = pre_process_docx(file_stream)
return self._html_converter.convert_string(
mammoth.convert_to_html(pre_process_stream, style_map=style_map).value,
**kwargs,
)
================================================
FILE: packages/markitdown/src/markitdown/converters/_epub_converter.py
================================================
import os
import zipfile
from defusedxml import minidom
from xml.dom.minidom import Document
from typing import BinaryIO, Any, Dict, List
from ._html_converter import HtmlConverter
from .._base_converter import DocumentConverterResult
from .._stream_info import StreamInfo
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/epub",
"application/epub+zip",
"application/x-epub+zip",
]
ACCEPTED_FILE_EXTENSIONS = [".epub"]
MIME_TYPE_MAPPING = {
".html": "text/html",
".xhtml": "application/xhtml+xml",
}
class EpubConverter(HtmlConverter):
"""
Converts EPUB files to Markdown. Style information (e.g.m headings) and tables are preserved where possible.
"""
def __init__(self):
super().__init__()
self._html_converter = HtmlConverter()
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
with zipfile.ZipFile(file_stream, "r") as z:
# Extracts metadata (title, authors, language, publisher, date, description, cover) from an EPUB file."""
# Locate content.opf
container_dom = minidom.parse(z.open("META-INF/container.xml"))
opf_path = container_dom.getElementsByTagName("rootfile")[0].getAttribute(
"full-path"
)
# Parse content.opf
opf_dom = minidom.parse(z.open(opf_path))
metadata: Dict[str, Any] = {
"title": self._get_text_from_node(opf_dom, "dc:title"),
"authors": self._get_all_texts_from_nodes(opf_dom, "dc:creator"),
"language": self._get_text_from_node(opf_dom, "dc:language"),
"publisher": self._get_text_from_node(opf_dom, "dc:publisher"),
"date": self._get_text_from_node(opf_dom, "dc:date"),
"description": self._get_text_from_node(opf_dom, "dc:description"),
"identifier": self._get_text_from_node(opf_dom, "dc:identifier"),
}
# Extract manifest items (ID → href mapping)
manifest = {
item.getAttribute("id"): item.getAttribute("href")
for item in opf_dom.getElementsByTagName("item")
}
# Extract spine order (ID refs)
spine_items = opf_dom.getElementsByTagName("itemref")
spine_order = [item.getAttribute("idref") for item in spine_items]
# Convert spine order to actual file paths
base_path = "/".join(
opf_path.split("/")[:-1]
) # Get base directory of content.opf
spine = [
f"{base_path}/{manifest[item_id]}" if base_path else manifest[item_id]
for item_id in spine_order
if item_id in manifest
]
# Extract and convert the content
markdown_content: List[str] = []
for file in spine:
if file in z.namelist():
with z.open(file) as f:
filename = os.path.basename(file)
extension = os.path.splitext(filename)[1].lower()
mimetype = MIME_TYPE_MAPPING.get(extension)
converted_content = self._html_converter.convert(
f,
StreamInfo(
mimetype=mimetype,
extension=extension,
filename=filename,
),
)
markdown_content.append(converted_content.markdown.strip())
# Format and add the metadata
metadata_markdown = []
for key, value in metadata.items():
if isinstance(value, list):
value = ", ".join(value)
if value:
metadata_markdown.append(f"**{key.capitalize()}:** {value}")
markdown_content.insert(0, "\n".join(metadata_markdown))
return DocumentConverterResult(
markdown="\n\n".join(markdown_content), title=metadata["title"]
)
def _get_text_from_node(self, dom: Document, tag_name: str) -> str | None:
"""Convenience function to extract a single occurrence of a tag (e.g., title)."""
texts = self._get_all_texts_from_nodes(dom, tag_name)
if len(texts) > 0:
return texts[0]
else:
return None
def _get_all_texts_from_nodes(self, dom: Document, tag_name: str) -> List[str]:
"""Helper function to extract all occurrences of a tag (e.g., multiple authors)."""
texts: List[str] = []
for node in dom.getElementsByTagName(tag_name):
if node.firstChild and hasattr(node.firstChild, "nodeValue"):
texts.append(node.firstChild.nodeValue.strip())
return texts
================================================
FILE: packages/markitdown/src/markitdown/converters/_exiftool.py
================================================
import json
import locale
import subprocess
from typing import Any, BinaryIO, Union
def _parse_version(version: str) -> tuple:
return tuple(map(int, (version.split("."))))
def exiftool_metadata(
file_stream: BinaryIO,
*,
exiftool_path: Union[str, None],
) -> Any: # Need a better type for json data
# Nothing to do
if not exiftool_path:
return {}
# Verify exiftool version
try:
version_output = subprocess.run(
[exiftool_path, "-ver"],
capture_output=True,
text=True,
check=True,
).stdout.strip()
version = _parse_version(version_output)
min_version = (12, 24)
if version < min_version:
raise RuntimeError(
f"ExifTool version {version_output} is vulnerable to CVE-2021-22204. "
"Please upgrade to version 12.24 or later."
)
except (subprocess.CalledProcessError, ValueError) as e:
raise RuntimeError("Failed to verify ExifTool version.") from e
# Run exiftool
cur_pos = file_stream.tell()
try:
output = subprocess.run(
[exiftool_path, "-json", "-"],
input=file_stream.read(),
capture_output=True,
text=False,
).stdout
return json.loads(
output.decode(locale.getpreferredencoding(False)),
)[0]
finally:
file_stream.seek(cur_pos)
================================================
FILE: packages/markitdown/src/markitdown/converters/_html_converter.py
================================================
import io
from typing import Any, BinaryIO, Optional
from bs4 import BeautifulSoup
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from ._markdownify import _CustomMarkdownify
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/html",
"application/xhtml",
]
ACCEPTED_FILE_EXTENSIONS = [
".html",
".htm",
]
class HtmlConverter(DocumentConverter):
"""Anything with content type text/html"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Parse the stream
encoding = "utf-8" if stream_info.charset is None else stream_info.charset
soup = BeautifulSoup(file_stream, "html.parser", from_encoding=encoding)
# Remove javascript and style blocks
for script in soup(["script", "style"]):
script.extract()
# Print only the main content
body_elm = soup.find("body")
webpage_text = ""
if body_elm:
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(body_elm)
else:
webpage_text = _CustomMarkdownify(**kwargs).convert_soup(soup)
assert isinstance(webpage_text, str)
# remove leading and trailing \n
webpage_text = webpage_text.strip()
return DocumentConverterResult(
markdown=webpage_text,
title=None if soup.title is None else soup.title.string,
)
def convert_string(
self, html_content: str, *, url: Optional[str] = None, **kwargs
) -> DocumentConverterResult:
"""
Non-standard convenience method to convert a string to markdown.
Given that many converters produce HTML as intermediate output, this
allows for easy conversion of HTML to markdown.
"""
return self.convert(
file_stream=io.BytesIO(html_content.encode("utf-8")),
stream_info=StreamInfo(
mimetype="text/html",
extension=".html",
charset="utf-8",
url=url,
),
**kwargs,
)
================================================
FILE: packages/markitdown/src/markitdown/converters/_image_converter.py
================================================
from typing import BinaryIO, Any, Union
import base64
import mimetypes
from ._exiftool import exiftool_metadata
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
ACCEPTED_MIME_TYPE_PREFIXES = [
"image/jpeg",
"image/png",
]
ACCEPTED_FILE_EXTENSIONS = [".jpg", ".jpeg", ".png"]
class ImageConverter(DocumentConverter):
"""
Converts images to markdown via extraction of metadata (if `exiftool` is installed), and description via a multimodal LLM (if an llm_client is configured).
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
md_content = ""
# Add metadata
metadata = exiftool_metadata(
file_stream, exiftool_path=kwargs.get("exiftool_path")
)
if metadata:
for f in [
"ImageSize",
"Title",
"Caption",
"Description",
"Keywords",
"Artist",
"Author",
"DateTimeOriginal",
"CreateDate",
"GPSPosition",
]:
if f in metadata:
md_content += f"{f}: {metadata[f]}\n"
# Try describing the image with GPT
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None:
llm_description = self._get_llm_description(
file_stream,
stream_info,
client=llm_client,
model=llm_model,
prompt=kwargs.get("llm_prompt"),
)
if llm_description is not None:
md_content += "\n# Description:\n" + llm_description.strip() + "\n"
return DocumentConverterResult(
markdown=md_content,
)
def _get_llm_description(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
*,
client,
model,
prompt=None,
) -> Union[None, str]:
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
# Get the content type
content_type = stream_info.mimetype
if not content_type:
content_type, _ = mimetypes.guess_type(
"_dummy" + (stream_info.extension or "")
)
if not content_type:
content_type = "application/octet-stream"
# Convert to base64
cur_pos = file_stream.tell()
try:
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
except Exception as e:
return None
finally:
file_stream.seek(cur_pos)
# Prepare the data-uri
data_uri = f"data:{content_type};base64,{base64_image}"
# Prepare the OpenAI API request
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
],
}
]
# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
================================================
FILE: packages/markitdown/src/markitdown/converters/_ipynb_converter.py
================================================
from typing import BinaryIO, Any
import json
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._exceptions import FileConversionException
from .._stream_info import StreamInfo
CANDIDATE_MIME_TYPE_PREFIXES = [
"application/json",
]
ACCEPTED_FILE_EXTENSIONS = [".ipynb"]
class IpynbConverter(DocumentConverter):
"""Converts Jupyter Notebook (.ipynb) files to Markdown."""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in CANDIDATE_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
# Read further to see if it's a notebook
cur_pos = file_stream.tell()
try:
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding)
return (
"nbformat" in notebook_content
and "nbformat_minor" in notebook_content
)
finally:
file_stream.seek(cur_pos)
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Parse and convert the notebook
encoding = stream_info.charset or "utf-8"
notebook_content = file_stream.read().decode(encoding=encoding)
return self._convert(json.loads(notebook_content))
def _convert(self, notebook_content: dict) -> DocumentConverterResult:
"""Helper function that converts notebook JSON content to Markdown."""
try:
md_output = []
title = None
for cell in notebook_content.get("cells", []):
cell_type = cell.get("cell_type", "")
source_lines = cell.get("source", [])
if cell_type == "markdown":
md_output.append("".join(source_lines))
# Extract the first # heading as title if not already found
if title is None:
for line in source_lines:
if line.startswith("# "):
title = line.lstrip("# ").strip()
break
elif cell_type == "code":
# Code cells are wrapped in Markdown code blocks
md_output.append(f"```python\n{''.join(source_lines)}\n```")
elif cell_type == "raw":
md_output.append(f"```\n{''.join(source_lines)}\n```")
md_text = "\n\n".join(md_output)
# Check for title in notebook metadata
title = notebook_content.get("metadata", {}).get("title", title)
return DocumentConverterResult(
markdown=md_text,
title=title,
)
except Exception as e:
raise FileConversionException(
f"Error converting .ipynb file: {str(e)}"
) from e
================================================
FILE: packages/markitdown/src/markitdown/converters/_llm_caption.py
================================================
from typing import BinaryIO, Union
import base64
import mimetypes
from .._stream_info import StreamInfo
def llm_caption(
file_stream: BinaryIO, stream_info: StreamInfo, *, client, model, prompt=None
) -> Union[None, str]:
if prompt is None or prompt.strip() == "":
prompt = "Write a detailed caption for this image."
# Get the content type
content_type = stream_info.mimetype
if not content_type:
content_type, _ = mimetypes.guess_type("_dummy" + (stream_info.extension or ""))
if not content_type:
content_type = "application/octet-stream"
# Convert to base64
cur_pos = file_stream.tell()
try:
base64_image = base64.b64encode(file_stream.read()).decode("utf-8")
except Exception as e:
return None
finally:
file_stream.seek(cur_pos)
# Prepare the data-uri
data_uri = f"data:{content_type};base64,{base64_image}"
# Prepare the OpenAI API request
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": data_uri,
},
},
],
}
]
# Call the OpenAI API
response = client.chat.completions.create(model=model, messages=messages)
return response.choices[0].message.content
================================================
FILE: packages/markitdown/src/markitdown/converters/_markdownify.py
================================================
import re
import markdownify
from typing import Any, Optional
from urllib.parse import quote, unquote, urlparse, urlunparse
class _CustomMarkdownify(markdownify.MarkdownConverter):
"""
A custom version of markdownify's MarkdownConverter. Changes include:
- Altering the default heading style to use '#', '##', etc.
- Removing javascript hyperlinks.
- Truncating images with large data:uri sources.
- Ensuring URIs are properly escaped, and do not conflict with Markdown syntax
"""
def __init__(self, **options: Any):
options["heading_style"] = options.get("heading_style", markdownify.ATX)
options["keep_data_uris"] = options.get("keep_data_uris", False)
# Explicitly cast options to the expected type if necessary
super().__init__(**options)
def convert_hn(
self,
n: int,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Same as usual, but be sure to start with a new line"""
if not convert_as_inline:
if not re.search(r"^\n", text):
return "\n" + super().convert_hn(n, el, text, convert_as_inline) # type: ignore
return super().convert_hn(n, el, text, convert_as_inline) # type: ignore
def convert_a(
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
):
"""Same as usual converter, but removes Javascript links and escapes URIs."""
prefix, suffix, text = markdownify.chomp(text) # type: ignore
if not text:
return ""
if el.find_parent("pre") is not None:
return text
href = el.get("href")
title = el.get("title")
# Escape URIs and skip non-http or file schemes
if href:
try:
parsed_url = urlparse(href) # type: ignore
if parsed_url.scheme and parsed_url.scheme.lower() not in ["http", "https", "file"]: # type: ignore
return "%s%s%s" % (prefix, text, suffix)
href = urlunparse(parsed_url._replace(path=quote(unquote(parsed_url.path)))) # type: ignore
except ValueError: # It's not clear if this ever gets thrown
return "%s%s%s" % (prefix, text, suffix)
# For the replacement see #29: text nodes underscores are escaped
if (
self.options["autolinks"]
and text.replace(r"\_", "_") == href
and not title
and not self.options["default_title"]
):
# Shortcut syntax
return "<%s>" % href
if self.options["default_title"] and not title:
title = href
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
return (
"%s[%s](%s%s)%s" % (prefix, text, href, title_part, suffix)
if href
else text
)
def convert_img(
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Same as usual converter, but removes data URIs"""
alt = el.attrs.get("alt", None) or ""
src = el.attrs.get("src", None) or el.attrs.get("data-src", None) or ""
title = el.attrs.get("title", None) or ""
title_part = ' "%s"' % title.replace('"', r"\"") if title else ""
# Remove all line breaks from alt
alt = alt.replace("\n", " ")
if (
convert_as_inline
and el.parent.name not in self.options["keep_inline_images_in"]
):
return alt
# Remove dataURIs
if src.startswith("data:") and not self.options["keep_data_uris"]:
src = src.split(",")[0] + "..."
return "" % (alt, src, title_part)
def convert_input(
self,
el: Any,
text: str,
convert_as_inline: Optional[bool] = False,
**kwargs,
) -> str:
"""Convert checkboxes to Markdown [x]/[ ] syntax."""
if el.get("type") == "checkbox":
return "[x] " if el.has_attr("checked") else "[ ] "
return ""
def convert_soup(self, soup: Any) -> str:
return super().convert_soup(soup) # type: ignore
================================================
FILE: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
================================================
import sys
from typing import Any, Union, BinaryIO
from .._stream_info import StreamInfo
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
olefile = None
try:
import olefile # type: ignore[no-redef]
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/vnd.ms-outlook",
]
ACCEPTED_FILE_EXTENSIONS = [".msg"]
class OutlookMsgConverter(DocumentConverter):
"""Converts Outlook .msg files to markdown by extracting email metadata and content.
Uses the olefile package to parse the .msg file structure and extract:
- Email headers (From, To, Subject)
- Email body content
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
# Check the extension and mimetype
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
# Brute force, check if we have an OLE file
cur_pos = file_stream.tell()
try:
if olefile and not olefile.isOleFile(file_stream):
return False
finally:
file_stream.seek(cur_pos)
# Brue force, check if it's an Outlook file
try:
if olefile is not None:
msg = olefile.OleFileIO(file_stream)
toc = "\n".join([str(stream) for stream in msg.listdir()])
return (
"__properties_version1.0" in toc
and "__recip_version1.0_#00000000" in toc
)
except Exception as e:
pass
finally:
file_stream.seek(cur_pos)
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Check: the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".msg",
feature="outlook",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
assert (
olefile is not None
) # If we made it this far, olefile should be available
msg = olefile.OleFileIO(file_stream)
# Extract email metadata
md_content = "# Email Message\n\n"
# Get headers
headers = {
"From": self._get_stream_data(msg, "__substg1.0_0C1F001F"),
"To": self._get_stream_data(msg, "__substg1.0_0E04001F"),
"Subject": self._get_stream_data(msg, "__substg1.0_0037001F"),
}
# Add headers to markdown
for key, value in headers.items():
if value:
md_content += f"**{key}:** {value}\n"
md_content += "\n## Content\n\n"
# Get email body
body = self._get_stream_data(msg, "__substg1.0_1000001F")
if body:
md_content += body
msg.close()
return DocumentConverterResult(
markdown=md_content.strip(),
title=headers.get("Subject"),
)
def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, None]:
"""Helper to safely extract and decode stream data from the MSG file."""
assert olefile is not None
assert isinstance(
msg, olefile.OleFileIO
) # Ensure msg is of the correct type (type hinting is not possible with the optional olefile package)
try:
if msg.exists(stream_path):
data = msg.openstream(stream_path).read()
# Try UTF-16 first (common for .msg files)
try:
return data.decode("utf-16-le").strip()
except UnicodeDecodeError:
# Fall back to UTF-8
try:
return data.decode("utf-8").strip()
except UnicodeDecodeError:
# Last resort - ignore errors
return data.decode("utf-8", errors="ignore").strip()
except Exception:
pass
return None
================================================
FILE: packages/markitdown/src/markitdown/converters/_pdf_converter.py
================================================
import sys
import io
import re
from typing import BinaryIO, Any
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Pattern for MasterFormat-style partial numbering (e.g., ".1", ".2", ".10")
PARTIAL_NUMBERING_PATTERN = re.compile(r"^\.\d+$")
def _merge_partial_numbering_lines(text: str) -> str:
"""
Post-process extracted text to merge MasterFormat-style partial numbering
with the following text line.
MasterFormat documents use partial numbering like:
.1 The intent of this Request for Proposal...
.2 Available information relative to...
Some PDF extractors split these into separate lines:
.1
The intent of this Request for Proposal...
This function merges them back together.
"""
lines = text.split("\n")
result_lines: list[str] = []
i = 0
while i < len(lines):
line = lines[i]
stripped = line.strip()
# Check if this line is ONLY a partial numbering
if PARTIAL_NUMBERING_PATTERN.match(stripped):
# Look for the next non-empty line to merge with
j = i + 1
while j < len(lines) and not lines[j].strip():
j += 1
if j < len(lines):
# Merge the partial numbering with the next line
next_line = lines[j].strip()
result_lines.append(f"{stripped} {next_line}")
i = j + 1 # Skip past the merged line
else:
# No next line to merge with, keep as is
result_lines.append(line)
i += 1
else:
result_lines.append(line)
i += 1
return "\n".join(result_lines)
# Load dependencies
_dependency_exc_info = None
try:
import pdfminer
import pdfminer.high_level
import pdfplumber
except ImportError:
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/pdf",
"application/x-pdf",
]
ACCEPTED_FILE_EXTENSIONS = [".pdf"]
def _to_markdown_table(table: list[list[str]], include_separator: bool = True) -> str:
"""Convert a 2D list (rows/columns) into a nicely aligned Markdown table.
Args:
table: 2D list of cell values
include_separator: If True, include header separator row (standard markdown).
If False, output simple pipe-separated rows.
"""
if not table:
return ""
# Normalize None → ""
table = [[cell if cell is not None else "" for cell in row] for row in table]
# Filter out empty rows
table = [row for row in table if any(cell.strip() for cell in row)]
if not table:
return ""
# Column widths
col_widths = [max(len(str(cell)) for cell in col) for col in zip(*table)]
def fmt_row(row: list[str]) -> str:
return (
"|"
+ "|".join(str(cell).ljust(width) for cell, width in zip(row, col_widths))
+ "|"
)
if include_separator:
header, *rows = table
md = [fmt_row(header)]
md.append("|" + "|".join("-" * w for w in col_widths) + "|")
for row in rows:
md.append(fmt_row(row))
else:
md = [fmt_row(row) for row in table]
return "\n".join(md)
def _extract_form_content_from_words(page: Any) -> str | None:
"""
Extract form-style content from a PDF page by analyzing word positions.
This handles borderless forms/tables where words are aligned in columns.
Returns markdown with proper table formatting:
- Tables have pipe-separated columns with header separator rows
- Non-table content is rendered as plain text
Returns None if the page doesn't appear to be a form-style document,
indicating that pdfminer should be used instead for better text spacing.
"""
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
if not words:
return None
# Group words by their Y position (rows)
y_tolerance = 5
rows_by_y: dict[float, list[dict]] = {}
for word in words:
y_key = round(word["top"] / y_tolerance) * y_tolerance
if y_key not in rows_by_y:
rows_by_y[y_key] = []
rows_by_y[y_key].append(word)
# Sort rows by Y position
sorted_y_keys = sorted(rows_by_y.keys())
page_width = page.width if hasattr(page, "width") else 612
# First pass: analyze each row
row_info: list[dict] = []
for y_key in sorted_y_keys:
row_words = sorted(rows_by_y[y_key], key=lambda w: w["x0"])
if not row_words:
continue
first_x0 = row_words[0]["x0"]
last_x1 = row_words[-1]["x1"]
line_width = last_x1 - first_x0
combined_text = " ".join(w["text"] for w in row_words)
# Count distinct x-position groups (columns)
x_positions = [w["x0"] for w in row_words]
x_groups: list[float] = []
for x in sorted(x_positions):
if not x_groups or x - x_groups[-1] > 50:
x_groups.append(x)
# Determine row type
is_paragraph = line_width > page_width * 0.55 and len(combined_text) > 60
# Check for MasterFormat-style partial numbering (e.g., ".1", ".2")
# These should be treated as list items, not table rows
has_partial_numbering = False
if row_words:
first_word = row_words[0]["text"].strip()
if PARTIAL_NUMBERING_PATTERN.match(first_word):
has_partial_numbering = True
row_info.append(
{
"y_key": y_key,
"words": row_words,
"text": combined_text,
"x_groups": x_groups,
"is_paragraph": is_paragraph,
"num_columns": len(x_groups),
"has_partial_numbering": has_partial_numbering,
}
)
# Collect ALL x-positions from rows with 3+ columns (table-like rows)
# This gives us the global column structure
all_table_x_positions: list[float] = []
for info in row_info:
if info["num_columns"] >= 3 and not info["is_paragraph"]:
all_table_x_positions.extend(info["x_groups"])
if not all_table_x_positions:
return None
# Compute adaptive column clustering tolerance based on gap analysis
all_table_x_positions.sort()
# Calculate gaps between consecutive x-positions
gaps = []
for i in range(len(all_table_x_positions) - 1):
gap = all_table_x_positions[i + 1] - all_table_x_positions[i]
if gap > 5: # Only significant gaps
gaps.append(gap)
# Determine optimal tolerance using statistical analysis
if gaps and len(gaps) >= 3:
# Use 70th percentile of gaps as threshold (balances precision/recall)
sorted_gaps = sorted(gaps)
percentile_70_idx = int(len(sorted_gaps) * 0.70)
adaptive_tolerance = sorted_gaps[percentile_70_idx]
# Clamp tolerance to reasonable range [25, 50]
adaptive_tolerance = max(25, min(50, adaptive_tolerance))
else:
# Fallback to conservative value
adaptive_tolerance = 35
# Compute global column boundaries using adaptive tolerance
global_columns: list[float] = []
for x in all_table_x_positions:
if not global_columns or x - global_columns[-1] > adaptive_tolerance:
global_columns.append(x)
# Adaptive max column check based on page characteristics
# Calculate average column width
if len(global_columns) > 1:
content_width = global_columns[-1] - global_columns[0]
avg_col_width = content_width / len(global_columns)
# Forms with very narrow columns (< 30px) are likely dense text
if avg_col_width < 30:
return None
# Compute adaptive max based on columns per inch
# Typical forms have 3-8 columns per inch
columns_per_inch = len(global_columns) / (content_width / 72)
# If density is too high (> 10 cols/inch), likely not a form
if columns_per_inch > 10:
return None
# Adaptive max: allow more columns for wider pages
# Standard letter is 612pt wide, so scale accordingly
adaptive_max_columns = int(20 * (page_width / 612))
adaptive_max_columns = max(15, adaptive_max_columns) # At least 15
if len(global_columns) > adaptive_max_columns:
return None
else:
# Single column, not a form
return None
# Now classify each row as table row or not
# A row is a table row if it has words that align with 2+ of the global columns
for info in row_info:
if info["is_paragraph"]:
info["is_table_row"] = False
continue
# Rows with partial numbering (e.g., ".1", ".2") are list items, not table rows
if info["has_partial_numbering"]:
info["is_table_row"] = False
continue
# Count how many global columns this row's words align with
aligned_columns: set[int] = set()
for word in info["words"]:
word_x = word["x0"]
for col_idx, col_x in enumerate(global_columns):
if abs(word_x - col_x) < 40:
aligned_columns.add(col_idx)
break
# If row uses 2+ of the established columns, it's a table row
info["is_table_row"] = len(aligned_columns) >= 2
# Find table regions (consecutive table rows)
table_regions: list[tuple[int, int]] = [] # (start_idx, end_idx)
i = 0
while i < len(row_info):
if row_info[i]["is_table_row"]:
start_idx = i
while i < len(row_info) and row_info[i]["is_table_row"]:
i += 1
end_idx = i
table_regions.append((start_idx, end_idx))
else:
i += 1
# Check if enough rows are table rows (at least 20%)
total_table_rows = sum(end - start for start, end in table_regions)
if len(row_info) > 0 and total_table_rows / len(row_info) < 0.2:
return None
# Build output - collect table data first, then format with proper column widths
result_lines: list[str] = []
num_cols = len(global_columns)
# Helper function to extract cells from a row
def extract_cells(info: dict) -> list[str]:
cells: list[str] = ["" for _ in range(num_cols)]
for word in info["words"]:
word_x = word["x0"]
# Find the correct column using boundary ranges
assigned_col = num_cols - 1 # Default to last column
for col_idx in range(num_cols - 1):
col_end = global_columns[col_idx + 1]
if word_x < col_end - 20:
assigned_col = col_idx
break
if cells[assigned_col]:
cells[assigned_col] += " " + word["text"]
else:
cells[assigned_col] = word["text"]
return cells
# Process rows, collecting table data for proper formatting
idx = 0
while idx < len(row_info):
info = row_info[idx]
# Check if this row starts a table region
table_region = None
for start, end in table_regions:
if idx == start:
table_region = (start, end)
break
if table_region:
start, end = table_region
# Collect all rows in this table
table_data: list[list[str]] = []
for table_idx in range(start, end):
cells = extract_cells(row_info[table_idx])
table_data.append(cells)
# Calculate column widths for this table
if table_data:
col_widths = [
max(len(row[col]) for row in table_data) for col in range(num_cols)
]
# Ensure minimum width of 3 for separator dashes
col_widths = [max(w, 3) for w in col_widths]
# Format header row
header = table_data[0]
header_str = (
"| "
+ " | ".join(
cell.ljust(col_widths[i]) for i, cell in enumerate(header)
)
+ " |"
)
result_lines.append(header_str)
# Format separator row
separator = (
"| "
+ " | ".join("-" * col_widths[i] for i in range(num_cols))
+ " |"
)
result_lines.append(separator)
# Format data rows
for row in table_data[1:]:
row_str = (
"| "
+ " | ".join(
cell.ljust(col_widths[i]) for i, cell in enumerate(row)
)
+ " |"
)
result_lines.append(row_str)
idx = end # Skip to end of table region
else:
# Check if we're inside a table region (not at start)
in_table = False
for start, end in table_regions:
if start < idx < end:
in_table = True
break
if not in_table:
# Non-table content
result_lines.append(info["text"])
idx += 1
return "\n".join(result_lines)
def _extract_tables_from_words(page: Any) -> list[list[list[str]]]:
"""
Extract tables from a PDF page by analyzing word positions.
This handles borderless tables where words are aligned in columns.
This function is designed for structured tabular data (like invoices),
not for multi-column text layouts in scientific documents.
"""
words = page.extract_words(keep_blank_chars=True, x_tolerance=3, y_tolerance=3)
if not words:
return []
# Group words by their Y position (rows)
y_tolerance = 5
rows_by_y: dict[float, list[dict]] = {}
for word in words:
y_key = round(word["top"] / y_tolerance) * y_tolerance
if y_key not in rows_by_y:
rows_by_y[y_key] = []
rows_by_y[y_key].append(word)
# Sort rows by Y position
sorted_y_keys = sorted(rows_by_y.keys())
# Find potential column boundaries by analyzing x positions across all rows
all_x_positions = []
for words_in_row in rows_by_y.values():
for word in words_in_row:
all_x_positions.append(word["x0"])
if not all_x_positions:
return []
# Cluster x positions to find column starts
all_x_positions.sort()
x_tolerance_col = 20
column_starts: list[float] = []
for x in all_x_positions:
if not column_starts or x - column_starts[-1] > x_tolerance_col:
column_starts.append(x)
# Need at least 3 columns but not too many (likely text layout, not table)
if len(column_starts) < 3 or len(column_starts) > 10:
return []
# Find rows that span multiple columns (potential table rows)
table_rows = []
for y_key in sorted_y_keys:
words_in_row = sorted(rows_by_y[y_key], key=lambda w: w["x0"])
# Assign words to columns
row_data = [""] * len(column_starts)
for word in words_in_row:
# Find the closest column
best_col = 0
min_dist = float("inf")
for i, col_x in enumerate(column_starts):
dist = abs(word["x0"] - col_x)
if dist < min_dist:
min_dist = dist
best_col = i
if row_data[best_col]:
row_data[best_col] += " " + word["text"]
else:
row_data[best_col] = word["text"]
# Only include rows that have content in multiple columns
non_empty = sum(1 for cell in row_data if cell.strip())
if non_empty >= 2:
table_rows.append(row_data)
# Validate table quality - tables should have:
# 1. Enough rows (at least 3 including header)
# 2. Short cell content (tables have concise data, not paragraphs)
# 3. Consistent structure across rows
if len(table_rows) < 3:
return []
# Check if cells contain short, structured data (not long text)
long_cell_count = 0
total_cell_count = 0
for row in table_rows:
for cell in row:
if cell.strip():
total_cell_count += 1
# If cell has more than 30 chars, it's likely prose text
if len(cell.strip()) > 30:
long_cell_count += 1
# If more than 30% of cells are long, this is probably not a table
if total_cell_count > 0 and long_cell_count / total_cell_count > 0.3:
return []
return [table_rows]
class PdfConverter(DocumentConverter):
"""
Converts PDFs to Markdown.
Supports extracting tables into aligned Markdown format (via pdfplumber).
Falls back to pdfminer if pdfplumber is missing or fails.
"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any,
) -> DocumentConverterResult:
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pdf",
feature="pdf",
)
) from _dependency_exc_info[1].with_traceback(
_dependency_exc_info[2]
) # type: ignore[union-attr]
assert isinstance(file_stream, io.IOBase)
# Read file stream into BytesIO for compatibility with pdfplumber
pdf_bytes = io.BytesIO(file_stream.read())
try:
# Single pass: check every page for form-style content.
# Pages with tables/forms get rich extraction; plain-text
# pages are collected separately. page.close() is called
# after each page to free pdfplumber's cached objects and
# keep memory usage constant regardless of page count.
markdown_chunks: list[str] = []
form_page_count = 0
plain_page_indices: list[int] = []
with pdfplumber.open(pdf_bytes) as pdf:
for page_idx, page in enumerate(pdf.pages):
page_content = _extract_form_content_from_words(page)
if page_content is not None:
form_page_count += 1
if page_content.strip():
markdown_chunks.append(page_content)
else:
plain_page_indices.append(page_idx)
text = page.extract_text()
if text and text.strip():
markdown_chunks.append(text.strip())
page.close() # Free cached page data immediately
# If no pages had form-style content, use pdfminer for
# the whole document (better text spacing for prose).
if form_page_count == 0:
pdf_bytes.seek(0)
markdown = pdfminer.high_level.extract_text(pdf_bytes)
else:
markdown = "\n\n".join(markdown_chunks).strip()
except Exception:
# Fallback if pdfplumber fails
pdf_bytes.seek(0)
markdown = pdfminer.high_level.extract_text(pdf_bytes)
# Fallback if still empty
if not markdown:
pdf_bytes.seek(0)
markdown = pdfminer.high_level.extract_text(pdf_bytes)
# Post-process to merge MasterFormat-style partial numbering with following text
markdown = _merge_partial_numbering_lines(markdown)
return DocumentConverterResult(markdown=markdown)
================================================
FILE: packages/markitdown/src/markitdown/converters/_plain_text_converter.py
================================================
import sys
from typing import BinaryIO, Any
from charset_normalizer import from_bytes
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import mammoth # noqa: F401
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"text/",
"application/json",
"application/markdown",
]
ACCEPTED_FILE_EXTENSIONS = [
".txt",
".text",
".md",
".markdown",
".json",
".jsonl",
]
class PlainTextConverter(DocumentConverter):
"""Anything with content type text/plain"""
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
# If we have a charset, we can safely assume it's text
# With Magika in the earlier stages, this handles most cases
if stream_info.charset is not None:
return True
# Otherwise, check the mimetype and extension
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
if stream_info.charset:
text_content = file_stream.read().decode(stream_info.charset)
else:
text_content = str(from_bytes(file_stream.read()).best())
return DocumentConverterResult(markdown=text_content)
================================================
FILE: packages/markitdown/src/markitdown/converters/_pptx_converter.py
================================================
import sys
import base64
import os
import io
import re
import html
from typing import BinaryIO, Any
from operator import attrgetter
from ._html_converter import HtmlConverter
from ._llm_caption import llm_caption
from .._base_converter import DocumentConverter, DocumentConverterResult
from .._stream_info import StreamInfo
from .._exceptions import MissingDependencyException, MISSING_DEPENDENCY_MESSAGE
# Try loading optional (but in this case, required) dependencies
# Save reporting of any exceptions for later
_dependency_exc_info = None
try:
import pptx
except ImportError:
# Preserve the error and stack trace for later
_dependency_exc_info = sys.exc_info()
ACCEPTED_MIME_TYPE_PREFIXES = [
"application/vnd.openxmlformats-officedocument.presentationml",
]
ACCEPTED_FILE_EXTENSIONS = [".pptx"]
class PptxConverter(DocumentConverter):
"""
Converts PPTX files to Markdown. Supports heading, tables and images with alt text.
"""
def __init__(self):
super().__init__()
self._html_converter = HtmlConverter()
def accepts(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> bool:
mimetype = (stream_info.mimetype or "").lower()
extension = (stream_info.extension or "").lower()
if extension in ACCEPTED_FILE_EXTENSIONS:
return True
for prefix in ACCEPTED_MIME_TYPE_PREFIXES:
if mimetype.startswith(prefix):
return True
return False
def convert(
self,
file_stream: BinaryIO,
stream_info: StreamInfo,
**kwargs: Any, # Options to pass to the converter
) -> DocumentConverterResult:
# Check the dependencies
if _dependency_exc_info is not None:
raise MissingDependencyException(
MISSING_DEPENDENCY_MESSAGE.format(
converter=type(self).__name__,
extension=".pptx",
feature="pptx",
)
) from _dependency_exc_info[
1
].with_traceback( # type: ignore[union-attr]
_dependency_exc_info[2]
)
# Perform the conversion
presentation = pptx.Presentation(file_stream)
md_content = ""
slide_num = 0
for slide in presentation.slides:
slide_num += 1
md_content += f"\n\n<!-- Slide number: {slide_num} -->\n"
title = slide.shapes.title
def get_shape_content(shape, **kwargs):
nonlocal md_content
# Pictures
if self._is_picture(shape):
# https://github.com/scanny/python-pptx/pull/512#issuecomment-1713100069
llm_description = ""
alt_text = ""
# Potentially generate a description using an LLM
llm_client = kwargs.get("llm_client")
llm_model = kwargs.get("llm_model")
if llm_client is not None and llm_model is not None:
# Prepare a file_stream and stream_info for the image data
image_filename = shape.image.filename
image_extension = None
if image_filename:
image_extension = os.path.splitext(image_filename)[1]
image_stream_info = StreamInfo(
mimetype=shape.image.content_type,
extension=image_extension,
filename=image_filename,
)
image_stream = io.BytesIO(shape.image.blob)
# Caption the image
try:
llm_description = llm_caption(
image_stream,
image_stream_info,
client=llm_client,
model=llm_model,
prompt=kwargs.get("llm_prompt"),
)
except Exception:
# Unable to generate a description
pass
# Also grab any description embedded in the deck
try:
alt_text = shape._element._nvXxPr.cNvPr.attrib.get("descr", "")
except Exception:
# Unable to get alt text
pass
# Prepare the alt, escaping any special characters
alt_text = "\n".join([llm_description, alt_text]) or shape.name
alt_text = re.sub(r"[\r\n\[\]]", " ", alt_text)
alt_text = re.sub(r"\s+", " ", alt_text).strip()
# If keep_data_uris is True, use base64 encoding for images
if kwargs.get("keep_data_uris", False):
blob = shape.image.blob
content_type = shape.image.content_type or "image/png"
b64_string = base64.b64encode(blob).decode("utf-8")
md_content += f"\n\n"
else:
# A placeholder name
filename = re.sub(r"\W", "", shape.name) + ".jpg"
md_content += "\n\n"
# Tables
if self._is_table(shape):
md_content += self._convert_table_to_markdown(shape.table, **kwargs)
# Charts
if shape.has_chart:
md_content += self._convert_chart_to_markdown(shape.chart)
# Text areas
elif shape.has_text_frame:
if shape == title:
md_content += "# " + shape.text.lstrip() + "\n"
else:
md_content += shape.text + "\n"
# Group Shapes
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.GROUP:
sorted_shapes = sorted(
shape.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for subshape in sorted_shapes:
get_shape_content(subshape, **kwargs)
sorted_shapes = sorted(
slide.shapes,
key=lambda x: (
float("-inf") if not x.top else x.top,
float("-inf") if not x.left else x.left,
),
)
for shape in sorted_shapes:
get_shape_content(shape, **kwargs)
md_content = md_content.strip()
if slide.has_notes_slide:
md_content += "\n\n### Notes:\n"
notes_frame = slide.notes_slide.notes_text_frame
if notes_frame is not None:
md_content += notes_frame.text
md_content = md_content.strip()
return DocumentConverterResult(markdown=md_content.strip())
def _is_picture(self, shape):
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PICTURE:
return True
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.PLACEHOLDER:
if hasattr(shape, "image"):
return True
return False
def _is_table(self, shape):
if shape.shape_type == pptx.enum.shapes.MSO_SHAPE_TYPE.TABLE:
return True
return False
def _convert_table_to_markdown(self, table, **kwargs):
# Write the table as HTML, then convert it to Markdown
html_table = "<html><body><table>"
first_row = True
for row in table.rows:
html_table += "<tr>"
for cell in row.cells:
if first_row:
html_table += "<th>" + html.escape(cell.text) + "</th>"
else:
html_table += "<td>" + html.escape(cell.text) + "</td>"
html_table += "</tr>"
first_row = Fa
gitextract_s2k12qtb/
├── .devcontainer/
│ └── devcontainer.json
├── .dockerignore
├── .gitattributes
├── .github/
│ ├── dependabot.yml
│ └── workflows/
│ ├── pre-commit.yml
│ └── tests.yml
├── .gitignore
├── .pre-commit-config.yaml
├── CODE_OF_CONDUCT.md
├── Dockerfile
├── LICENSE
├── README.md
├── SECURITY.md
├── SUPPORT.md
└── packages/
├── markitdown/
│ ├── README.md
│ ├── ThirdPartyNotices.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── _base_converter.py
│ │ ├── _exceptions.py
│ │ ├── _markitdown.py
│ │ ├── _stream_info.py
│ │ ├── _uri_utils.py
│ │ ├── converter_utils/
│ │ │ ├── __init__.py
│ │ │ └── docx/
│ │ │ ├── __init__.py
│ │ │ ├── math/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── latex_dict.py
│ │ │ │ └── omml.py
│ │ │ └── pre_process.py
│ │ ├── converters/
│ │ │ ├── __init__.py
│ │ │ ├── _audio_converter.py
│ │ │ ├── _bing_serp_converter.py
│ │ │ ├── _csv_converter.py
│ │ │ ├── _doc_intel_converter.py
│ │ │ ├── _docx_converter.py
│ │ │ ├── _epub_converter.py
│ │ │ ├── _exiftool.py
│ │ │ ├── _html_converter.py
│ │ │ ├── _image_converter.py
│ │ │ ├── _ipynb_converter.py
│ │ │ ├── _llm_caption.py
│ │ │ ├── _markdownify.py
│ │ │ ├── _outlook_msg_converter.py
│ │ │ ├── _pdf_converter.py
│ │ │ ├── _plain_text_converter.py
│ │ │ ├── _pptx_converter.py
│ │ │ ├── _rss_converter.py
│ │ │ ├── _transcribe_audio.py
│ │ │ ├── _wikipedia_converter.py
│ │ │ ├── _xlsx_converter.py
│ │ │ ├── _youtube_converter.py
│ │ │ └── _zip_converter.py
│ │ └── py.typed
│ └── tests/
│ ├── __init__.py
│ ├── _test_vectors.py
│ ├── test_cli_misc.py
│ ├── test_cli_vectors.py
│ ├── test_docintel_html.py
│ ├── test_files/
│ │ ├── equations.docx
│ │ ├── expected_outputs/
│ │ │ ├── MEDRPT-2024-PAT-3847_medical_report_scan.md
│ │ │ ├── RECEIPT-2024-TXN-98765_retail_purchase.md
│ │ │ ├── REPAIR-2022-INV-001_multipage.md
│ │ │ ├── SPARSE-2024-INV-1234_borderless_table.md
│ │ │ ├── movie-theater-booking-2024.md
│ │ │ └── test.md
│ │ ├── rlink.docx
│ │ ├── test.docx
│ │ ├── test.epub
│ │ ├── test.json
│ │ ├── test.m4a
│ │ ├── test.pptx
│ │ ├── test.xls
│ │ ├── test.xlsx
│ │ ├── test_blog.html
│ │ ├── test_mskanji.csv
│ │ ├── test_notebook.ipynb
│ │ ├── test_outlook_msg.msg
│ │ ├── test_rss.xml
│ │ ├── test_serp.html
│ │ ├── test_wikipedia.html
│ │ └── test_with_comment.docx
│ ├── test_module_misc.py
│ ├── test_module_vectors.py
│ ├── test_pdf_masterformat.py
│ ├── test_pdf_memory.py
│ └── test_pdf_tables.py
├── markitdown-mcp/
│ ├── Dockerfile
│ ├── README.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown_mcp/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ └── py.typed
│ └── tests/
│ └── __init__.py
├── markitdown-ocr/
│ ├── LICENSE
│ ├── README.md
│ ├── pyproject.toml
│ ├── src/
│ │ └── markitdown_ocr/
│ │ ├── __about__.py
│ │ ├── __init__.py
│ │ ├── _docx_converter_with_ocr.py
│ │ ├── _ocr_service.py
│ │ ├── _pdf_converter_with_ocr.py
│ │ ├── _plugin.py
│ │ ├── _pptx_converter_with_ocr.py
│ │ └── _xlsx_converter_with_ocr.py
│ └── tests/
│ ├── __init__.py
│ ├── ocr_test_data/
│ │ ├── docx_complex_layout.docx
│ │ ├── docx_image_end.docx
│ │ ├── docx_image_middle.docx
│ │ ├── docx_image_start.docx
│ │ ├── docx_multipage.docx
│ │ ├── docx_multiple_images.docx
│ │ ├── pptx_complex_layout.pptx
│ │ ├── pptx_image_end.pptx
│ │ ├── pptx_image_middle.pptx
│ │ ├── pptx_image_start.pptx
│ │ ├── pptx_multiple_images.pptx
│ │ ├── xlsx_complex_layout.xlsx
│ │ ├── xlsx_image_end.xlsx
│ │ ├── xlsx_image_middle.xlsx
│ │ ├── xlsx_image_start.xlsx
│ │ └── xlsx_multiple_images.xlsx
│ ├── test_docx_converter.py
│ ├── test_pdf_converter.py
│ ├── test_pptx_converter.py
│ └── test_xlsx_converter.py
└── markitdown-sample-plugin/
├── README.md
├── pyproject.toml
├── src/
│ └── markitdown_sample_plugin/
│ ├── __about__.py
│ ├── __init__.py
│ ├── _plugin.py
│ └── py.typed
└── tests/
├── __init__.py
├── test_files/
│ └── test.rtf
└── test_sample_plugin.py
SYMBOL INDEX (383 symbols across 52 files)
FILE: packages/markitdown-mcp/src/markitdown_mcp/__main__.py
function convert_to_markdown (line 21) | async def convert_to_markdown(uri: str) -> str:
function check_plugins_enabled (line 26) | def check_plugins_enabled() -> bool:
function create_starlette_app (line 34) | def create_starlette_app(mcp_server: Server, *, debug: bool = False) -> ...
function main (line 82) | def main():
FILE: packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py
class DocxConverterWithOCR (line 33) | class DocxConverterWithOCR(HtmlConverter):
method __init__ (line 39) | def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
method accepts (line 44) | def accepts(
method convert (line 63) | def convert(
method _extract_and_ocr_images (line 126) | def _extract_and_ocr_images(
method _inject_placeholders (line 160) | def _inject_placeholders(
FILE: packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py
class OCRResult (line 14) | class OCRResult:
class LLMVisionOCRService (line 23) | class LLMVisionOCRService:
method __init__ (line 26) | def __init__(
method extract_text (line 48) | def extract_text(
FILE: packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py
function _extract_images_from_page (line 28) | def _extract_images_from_page(page: Any) -> list[dict]:
class PdfConverterWithOCR (line 129) | class PdfConverterWithOCR(DocumentConverter):
method __init__ (line 135) | def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
method accepts (line 139) | def accepts(
method convert (line 158) | def convert(
method _extract_page_images (line 313) | def _extract_page_images(self, pdf_bytes: io.BytesIO, page_num: int) -...
method _ocr_full_pages (line 340) | def _ocr_full_pages(
FILE: packages/markitdown-ocr/src/markitdown_ocr/_plugin.py
function register_converters (line 19) | def register_converters(markitdown: MarkItDown, **kwargs: Any) -> None:
FILE: packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py
class PptxConverterWithOCR (line 27) | class PptxConverterWithOCR(DocumentConverter):
method __init__ (line 30) | def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
method accepts (line 35) | def accepts(
method convert (line 54) | def convert(
method _is_picture (line 188) | def _is_picture(self, shape):
method _is_table (line 196) | def _is_table(self, shape):
method _convert_table_to_markdown (line 201) | def _convert_table_to_markdown(self, table, **kwargs):
method _convert_chart_to_markdown (line 222) | def _convert_chart_to_markdown(self, chart):
FILE: packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py
class XlsxConverterWithOCR (line 27) | class XlsxConverterWithOCR(DocumentConverter):
method __init__ (line 33) | def __init__(self, ocr_service: Optional[LLMVisionOCRService] = None):
method accepts (line 38) | def accepts(
method convert (line 57) | def convert(
method _convert_standard (line 88) | def _convert_standard(
method _convert_with_ocr (line 108) | def _convert_with_ocr(
method _extract_and_ocr_sheet_images (line 149) | def _extract_and_ocr_sheet_images(
method _column_number_to_letter (line 217) | def _column_number_to_letter(n: int) -> str:
FILE: packages/markitdown-ocr/tests/test_docx_converter.py
class MockOCRService (line 32) | class MockOCRService:
method extract_text (line 33) | def extract_text( # noqa: ANN101
function svc (line 40) | def svc() -> MockOCRService:
function _convert (line 44) | def _convert(filename: str, ocr_service: MockOCRService) -> str:
function test_docx_image_start (line 60) | def test_docx_image_start(svc: MockOCRService) -> None:
function test_docx_image_middle (line 75) | def test_docx_image_middle(svc: MockOCRService) -> None:
function test_docx_image_end (line 92) | def test_docx_image_end(svc: MockOCRService) -> None:
function test_docx_multiple_images (line 108) | def test_docx_multiple_images(svc: MockOCRService) -> None:
function test_docx_multipage (line 125) | def test_docx_multipage(svc: MockOCRService) -> None:
function test_docx_complex_layout (line 152) | def test_docx_complex_layout(svc: MockOCRService) -> None:
function test_inject_placeholders_single_image (line 171) | def test_inject_placeholders_single_image() -> None:
function test_inject_placeholders_two_images_sequential_tokens (line 180) | def test_inject_placeholders_two_images_sequential_tokens() -> None:
function test_inject_placeholders_no_img_tag_appends_at_end (line 194) | def test_inject_placeholders_no_img_tag_appends_at_end() -> None:
function test_inject_placeholders_empty_map_leaves_html_unchanged (line 202) | def test_inject_placeholders_empty_map_leaves_html_unchanged() -> None:
function test_docx_no_ocr_service_no_tags (line 215) | def test_docx_no_ocr_service_no_tags() -> None:
FILE: packages/markitdown-ocr/tests/test_pdf_converter.py
class MockOCRService (line 36) | class MockOCRService:
method extract_text (line 37) | def extract_text(
function svc (line 46) | def svc() -> MockOCRService:
function _convert (line 50) | def _convert(filename: str, ocr_service: MockOCRService) -> str:
function test_pdf_image_start (line 66) | def test_pdf_image_start(svc: MockOCRService) -> None:
function test_pdf_image_middle (line 82) | def test_pdf_image_middle(svc: MockOCRService) -> None:
function test_pdf_image_end (line 100) | def test_pdf_image_end(svc: MockOCRService) -> None:
function test_pdf_multiple_images (line 117) | def test_pdf_multiple_images(svc: MockOCRService) -> None:
function test_pdf_complex_layout (line 134) | def test_pdf_complex_layout(svc: MockOCRService) -> None:
function test_pdf_multipage (line 151) | def test_pdf_multipage(svc: MockOCRService) -> None:
function test_pdf_scanned_invoice (line 167) | def test_pdf_scanned_invoice(svc: MockOCRService) -> None:
function test_pdf_scanned_meeting_minutes (line 171) | def test_pdf_scanned_meeting_minutes(svc: MockOCRService) -> None:
function test_pdf_scanned_minimal (line 175) | def test_pdf_scanned_minimal(svc: MockOCRService) -> None:
function test_pdf_scanned_sales_report (line 179) | def test_pdf_scanned_sales_report(svc: MockOCRService) -> None:
function test_pdf_scanned_report (line 183) | def test_pdf_scanned_report(svc: MockOCRService) -> None:
function test_pdf_scanned_fallback_format (line 197) | def test_pdf_scanned_fallback_format(svc: MockOCRService) -> None:
function test_pdf_no_ocr_service_no_tags (line 226) | def test_pdf_no_ocr_service_no_tags() -> None:
FILE: packages/markitdown-ocr/tests/test_pptx_converter.py
class MockOCRService (line 36) | class MockOCRService:
method extract_text (line 37) | def extract_text(
function svc (line 46) | def svc() -> MockOCRService:
function _convert (line 50) | def _convert(filename: str, ocr_service: MockOCRService) -> str:
function test_pptx_image_start (line 66) | def test_pptx_image_start(svc: MockOCRService) -> None:
function test_pptx_image_middle (line 80) | def test_pptx_image_middle(svc: MockOCRService) -> None:
function test_pptx_image_end (line 96) | def test_pptx_image_end(svc: MockOCRService) -> None:
function test_pptx_multiple_images (line 111) | def test_pptx_multiple_images(svc: MockOCRService) -> None:
function test_pptx_complex_layout (line 126) | def test_pptx_complex_layout(svc: MockOCRService) -> None:
function test_pptx_no_ocr_service_no_tags (line 140) | def test_pptx_no_ocr_service_no_tags() -> None:
FILE: packages/markitdown-ocr/tests/test_xlsx_converter.py
class MockOCRService (line 37) | class MockOCRService:
method extract_text (line 38) | def extract_text(
function svc (line 47) | def svc() -> MockOCRService:
function _convert (line 51) | def _convert(filename: str, ocr_service: MockOCRService) -> str:
function test_xlsx_image_start (line 67) | def test_xlsx_image_start(svc: MockOCRService) -> None:
function test_xlsx_image_middle (line 92) | def test_xlsx_image_middle(svc: MockOCRService) -> None:
function test_xlsx_image_end (line 127) | def test_xlsx_image_end(svc: MockOCRService) -> None:
function test_xlsx_multiple_images (line 166) | def test_xlsx_multiple_images(svc: MockOCRService) -> None:
function test_xlsx_complex_layout (line 201) | def test_xlsx_complex_layout(svc: MockOCRService) -> None:
function test_xlsx_no_ocr_service_no_tags (line 241) | def test_xlsx_no_ocr_service_no_tags() -> None:
FILE: packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py
function register_converters (line 25) | def register_converters(markitdown: MarkItDown, **kwargs):
class RtfConverter (line 34) | class RtfConverter(DocumentConverter):
method accepts (line 39) | def accepts(
method convert (line 57) | def convert(
FILE: packages/markitdown-sample-plugin/tests/test_sample_plugin.py
function test_converter (line 15) | def test_converter() -> None:
function test_markitdown (line 30) | def test_markitdown() -> None:
FILE: packages/markitdown/src/markitdown/__main__.py
function main (line 13) | def main():
function _handle_output (line 203) | def _handle_output(args, result: DocumentConverterResult):
function _exit_with_error (line 217) | def _exit_with_error(message: str):
FILE: packages/markitdown/src/markitdown/_base_converter.py
class DocumentConverterResult (line 5) | class DocumentConverterResult:
method __init__ (line 8) | def __init__(
method text_content (line 28) | def text_content(self) -> str:
method text_content (line 33) | def text_content(self, markdown: str):
method __str__ (line 37) | def __str__(self) -> str:
class DocumentConverter (line 42) | class DocumentConverter:
method accepts (line 45) | def accepts(
method convert (line 84) | def convert(
FILE: packages/markitdown/src/markitdown/_exceptions.py
class MarkItDownException (line 11) | class MarkItDownException(Exception):
class MissingDependencyException (line 19) | class MissingDependencyException(MarkItDownException):
class UnsupportedFormatException (line 34) | class UnsupportedFormatException(MarkItDownException):
class FailedConversionAttempt (line 42) | class FailedConversionAttempt(object):
method __init__ (line 47) | def __init__(self, converter: Any, exc_info: Optional[tuple] = None):
class FileConversionException (line 52) | class FileConversionException(MarkItDownException):
method __init__ (line 58) | def __init__(
FILE: packages/markitdown/src/markitdown/_markitdown.py
function _load_plugins (line 65) | def _load_plugins() -> Union[None, List[Any]]:
class ConverterRegistration (line 86) | class ConverterRegistration:
class MarkItDown (line 93) | class MarkItDown:
method __init__ (line 97) | def __init__(
method enable_builtins (line 140) | def enable_builtins(self, **kwargs) -> None:
method enable_plugins (line 232) | def enable_plugins(self, **kwargs) -> None:
method convert (line 252) | def convert(
method convert_local (line 302) | def convert_local(
method convert_stream (line 339) | def convert_stream(
method convert_url (line 386) | def convert_url(
method convert_uri (line 405) | def convert_uri(
method convert_response (line 466) | def convert_response(
method _convert (line 538) | def _convert(
method register_page_converter (line 633) | def register_page_converter(self, converter: DocumentConverter) -> None:
method register_converter (line 641) | def register_converter(
method _get_stream_info_guesses (line 673) | def _get_stream_info_guesses(
method _normalize_charset (line 774) | def _normalize_charset(self, charset: str | None) -> str | None:
FILE: packages/markitdown/src/markitdown/_stream_info.py
class StreamInfo (line 6) | class StreamInfo:
method copy_and_update (line 20) | def copy_and_update(self, *args, **kwargs):
FILE: packages/markitdown/src/markitdown/_uri_utils.py
function file_uri_to_path (line 8) | def file_uri_to_path(file_uri: str) -> Tuple[str | None, str]:
function parse_data_uri (line 19) | def parse_data_uri(uri: str) -> Tuple[str | None, Dict[str, str], bytes]:
FILE: packages/markitdown/src/markitdown/converter_utils/docx/math/omml.py
function load (line 43) | def load(stream):
function load_string (line 49) | def load_string(string):
function escape_latex (line 55) | def escape_latex(strs):
function get_val (line 68) | def get_val(key, default=None, store=CHR):
class Tag2Method (line 75) | class Tag2Method(object):
method call_method (line 76) | def call_method(self, elm, stag=None):
method process_children_list (line 86) | def process_children_list(self, elm, include=None):
method process_children_dict (line 103) | def process_children_dict(self, elm, include=None):
method process_children (line 112) | def process_children(self, elm, include=None):
method process_unknow (line 123) | def process_unknow(self, elm, stag):
class Pr (line 127) | class Pr(Tag2Method):
method __init__ (line 136) | def __init__(self, elm):
method __str__ (line 140) | def __str__(self):
method __unicode__ (line 143) | def __unicode__(self):
method __getattr__ (line 146) | def __getattr__(self, name):
method do_brk (line 149) | def do_brk(self, elm):
method do_common (line 153) | def do_common(self, elm):
class oMath2Latex (line 170) | class oMath2Latex(Tag2Method):
method __init__ (line 179) | def __init__(self, element):
method __str__ (line 182) | def __str__(self):
method __unicode__ (line 185) | def __unicode__(self):
method process_unknow (line 188) | def process_unknow(self, elm, stag):
method latex (line 197) | def latex(self):
method do_acc (line 200) | def do_acc(self, elm):
method do_bar (line 210) | def do_bar(self, elm):
method do_d (line 219) | def do_d(self, elm):
method do_spre (line 234) | def do_spre(self, elm):
method do_sub (line 240) | def do_sub(self, elm):
method do_sup (line 244) | def do_sup(self, elm):
method do_f (line 248) | def do_f(self, elm):
method do_func (line 257) | def do_func(self, elm):
method do_fname (line 265) | def do_fname(self, elm):
method do_groupchr (line 281) | def do_groupchr(self, elm):
method do_rad (line 290) | def do_rad(self, elm):
method do_eqarr (line 302) | def do_eqarr(self, elm):
method do_limlow (line 312) | def do_limlow(self, elm):
method do_limupp (line 323) | def do_limupp(self, elm):
method do_lim (line 330) | def do_lim(self, elm):
method do_m (line 336) | def do_m(self, elm):
method do_mr (line 348) | def do_mr(self, elm):
method do_nary (line 356) | def do_nary(self, elm):
method do_r (line 369) | def do_r(self, elm):
FILE: packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py
function _convert_omath_to_latex (line 33) | def _convert_omath_to_latex(tag: Tag) -> str:
function _get_omath_tag_replacement (line 52) | def _get_omath_tag_replacement(tag: Tag, block: bool = False) -> Tag:
function _replace_equations (line 74) | def _replace_equations(tag: Tag):
function _pre_process_math (line 99) | def _pre_process_math(content: bytes) -> bytes:
function pre_process_docx (line 118) | def pre_process_docx(input_docx: BinaryIO) -> BinaryIO:
FILE: packages/markitdown/src/markitdown/converters/_audio_converter.py
class AudioConverter (line 23) | class AudioConverter(DocumentConverter):
method accepts (line 28) | def accepts(
method convert (line 46) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_bing_serp_converter.py
class BingSerpConverter (line 23) | class BingSerpConverter(DocumentConverter):
method accepts (line 29) | def accepts(
method convert (line 57) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_csv_converter.py
class CsvConverter (line 15) | class CsvConverter(DocumentConverter):
method __init__ (line 20) | def __init__(self):
method accepts (line 23) | def accepts(
method convert (line 38) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_doc_intel_converter.py
class AzureKeyCredential (line 28) | class AzureKeyCredential:
class TokenCredential (line 31) | class TokenCredential:
class DocumentIntelligenceClient (line 34) | class DocumentIntelligenceClient:
class AnalyzeDocumentRequest (line 37) | class AnalyzeDocumentRequest:
class AnalyzeResult (line 40) | class AnalyzeResult:
class DocumentAnalysisFeature (line 43) | class DocumentAnalysisFeature:
class DefaultAzureCredential (line 46) | class DefaultAzureCredential:
class DocumentIntelligenceFileType (line 55) | class DocumentIntelligenceFileType(str, Enum):
function _get_mime_type_prefixes (line 71) | def _get_mime_type_prefixes(types: List[DocumentIntelligenceFileType]) -...
function _get_file_extensions (line 104) | def _get_file_extensions(types: List[DocumentIntelligenceFileType]) -> L...
class DocumentIntelligenceConverter (line 130) | class DocumentIntelligenceConverter(DocumentConverter):
method __init__ (line 133) | def __init__(
method accepts (line 189) | def accepts(
method _analysis_features (line 207) | def _analysis_features(self, stream_info: StreamInfo) -> List[str]:
method convert (line 237) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_docx_converter.py
class DocxConverter (line 31) | class DocxConverter(HtmlConverter):
method __init__ (line 36) | def __init__(self):
method accepts (line 40) | def accepts(
method convert (line 58) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_epub_converter.py
class EpubConverter (line 26) | class EpubConverter(HtmlConverter):
method __init__ (line 31) | def __init__(self):
method accepts (line 35) | def accepts(
method convert (line 53) | def convert(
method _get_text_from_node (line 132) | def _get_text_from_node(self, dom: Document, tag_name: str) -> str | N...
method _get_all_texts_from_nodes (line 140) | def _get_all_texts_from_nodes(self, dom: Document, tag_name: str) -> L...
FILE: packages/markitdown/src/markitdown/converters/_exiftool.py
function _parse_version (line 7) | def _parse_version(version: str) -> tuple:
function exiftool_metadata (line 11) | def exiftool_metadata(
FILE: packages/markitdown/src/markitdown/converters/_html_converter.py
class HtmlConverter (line 20) | class HtmlConverter(DocumentConverter):
method accepts (line 23) | def accepts(
method convert (line 41) | def convert(
method convert_string (line 73) | def convert_string(
FILE: packages/markitdown/src/markitdown/converters/_image_converter.py
class ImageConverter (line 16) | class ImageConverter(DocumentConverter):
method accepts (line 21) | def accepts(
method convert (line 39) | def convert(
method _get_llm_description (line 87) | def _get_llm_description(
FILE: packages/markitdown/src/markitdown/converters/_ipynb_converter.py
class IpynbConverter (line 15) | class IpynbConverter(DocumentConverter):
method accepts (line 18) | def accepts(
method convert (line 46) | def convert(
method _convert (line 57) | def _convert(self, notebook_content: dict) -> DocumentConverterResult:
FILE: packages/markitdown/src/markitdown/converters/_llm_caption.py
function llm_caption (line 7) | def llm_caption(
FILE: packages/markitdown/src/markitdown/converters/_markdownify.py
class _CustomMarkdownify (line 8) | class _CustomMarkdownify(markdownify.MarkdownConverter):
method __init__ (line 18) | def __init__(self, **options: Any):
method convert_hn (line 24) | def convert_hn(
method convert_a (line 39) | def convert_a(
method convert_img (line 85) | def convert_img(
method convert_input (line 112) | def convert_input(
method convert_soup (line 125) | def convert_soup(self, soup: Any) -> str:
FILE: packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py
class OutlookMsgConverter (line 24) | class OutlookMsgConverter(DocumentConverter):
method accepts (line 32) | def accepts(
method convert (line 73) | def convert(
method _get_stream_data (line 127) | def _get_stream_data(self, msg: Any, stream_path: str) -> Union[str, N...
FILE: packages/markitdown/src/markitdown/converters/_pdf_converter.py
function _merge_partial_numbering_lines (line 14) | def _merge_partial_numbering_lines(text: str) -> str:
function _to_markdown_table (line 78) | def _to_markdown_table(table: list[list[str]], include_separator: bool =...
function _extract_form_content_from_words (line 120) | def _extract_form_content_from_words(page: Any) -> str | None:
function _extract_tables_from_words (line 398) | def _extract_tables_from_words(page: Any) -> list[list[list[str]]]:
class PdfConverter (line 495) | class PdfConverter(DocumentConverter):
method accepts (line 502) | def accepts(
method convert (line 520) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_plain_text_converter.py
class PlainTextConverter (line 33) | class PlainTextConverter(DocumentConverter):
method accepts (line 36) | def accepts(
method convert (line 60) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_pptx_converter.py
class PptxConverter (line 34) | class PptxConverter(DocumentConverter):
method __init__ (line 39) | def __init__(self):
method accepts (line 43) | def accepts(
method convert (line 61) | def convert(
method _is_picture (line 202) | def _is_picture(self, shape):
method _is_table (line 210) | def _is_table(self, shape):
method _convert_table_to_markdown (line 215) | def _convert_table_to_markdown(self, table, **kwargs):
method _convert_chart_to_markdown (line 235) | def _convert_chart_to_markdown(self, chart):
FILE: packages/markitdown/src/markitdown/converters/_rss_converter.py
class RssConverter (line 29) | class RssConverter(DocumentConverter):
method __init__ (line 32) | def __init__(self):
method accepts (line 36) | def accepts(
method _check_xml (line 63) | def _check_xml(self, file_stream: BinaryIO) -> bool:
method _feed_type (line 74) | def _feed_type(self, doc: Any) -> str | None:
method convert (line 84) | def convert(
method _parse_atom_type (line 101) | def _parse_atom_type(self, doc: Document) -> DocumentConverterResult:
method _parse_rss_type (line 133) | def _parse_rss_type(self, doc: Document) -> DocumentConverterResult:
method _parse_content (line 170) | def _parse_content(self, content: str) -> str:
method _get_data_by_tag_name (line 179) | def _get_data_by_tag_name(
FILE: packages/markitdown/src/markitdown/converters/_transcribe_audio.py
function transcribe_audio (line 23) | def transcribe_audio(file_stream: BinaryIO, *, audio_format: str = "wav"...
FILE: packages/markitdown/src/markitdown/converters/_wikipedia_converter.py
class WikipediaConverter (line 20) | class WikipediaConverter(DocumentConverter):
method accepts (line 23) | def accepts(
method convert (line 51) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_xlsx_converter.py
class XlsxConverter (line 36) | class XlsxConverter(DocumentConverter):
method __init__ (line 41) | def __init__(self):
method accepts (line 45) | def accepts(
method convert (line 63) | def convert(
class XlsConverter (line 98) | class XlsConverter(DocumentConverter):
method __init__ (line 103) | def __init__(self):
method accepts (line 107) | def accepts(
method convert (line 125) | def convert(
FILE: packages/markitdown/src/markitdown/converters/_youtube_converter.py
class YouTubeConverter (line 37) | class YouTubeConverter(DocumentConverter):
method accepts (line 40) | def accepts(
method convert (line 70) | def convert(
method _get (line 199) | def _get(
method _findKey (line 211) | def _findKey(self, json: Any, key: str) -> Union[str, None]: # TODO: ...
method _retry_operation (line 226) | def _retry_operation(self, operation, retries=3, delay=2):
FILE: packages/markitdown/src/markitdown/converters/_zip_converter.py
class ZipConverter (line 22) | class ZipConverter(DocumentConverter):
method __init__ (line 61) | def __init__(
method accepts (line 69) | def accepts(
method convert (line 87) | def convert(
FILE: packages/markitdown/tests/_test_vectors.py
class FileTestVector (line 6) | class FileTestVector(object):
FILE: packages/markitdown/tests/test_cli_misc.py
function test_version (line 9) | def test_version() -> None:
function test_invalid_flag (line 18) | def test_invalid_flag() -> None:
FILE: packages/markitdown/tests/test_cli_vectors.py
function shared_tmp_dir (line 39) | def shared_tmp_dir(tmp_path_factory):
function test_output_to_stdout (line 44) | def test_output_to_stdout(shared_tmp_dir, test_vector) -> None:
function test_output_to_file (line 66) | def test_output_to_file(shared_tmp_dir, test_vector) -> None:
function test_input_from_stdin_without_hints (line 98) | def test_input_from_stdin_without_hints(shared_tmp_dir, test_vector) -> ...
function test_convert_url (line 132) | def test_convert_url(shared_tmp_dir, test_vector):
function test_output_to_file_with_data_uris (line 152) | def test_output_to_file_with_data_uris(shared_tmp_dir, test_vector) -> N...
FILE: packages/markitdown/tests/test_docintel_html.py
function _make_converter (line 9) | def _make_converter(file_types):
function test_docintel_accepts_html_extension (line 15) | def test_docintel_accepts_html_extension():
function test_docintel_accepts_html_mimetype (line 21) | def test_docintel_accepts_html_mimetype():
FILE: packages/markitdown/tests/test_module_misc.py
function validate_strings (line 100) | def validate_strings(result, expected_strings, exclude_strings=None):
function test_stream_info_operations (line 110) | def test_stream_info_operations() -> None:
function test_data_uris (line 182) | def test_data_uris() -> None:
function test_file_uris (line 223) | def test_file_uris() -> None:
function test_docx_comments (line 255) | def test_docx_comments() -> None:
function test_docx_equations (line 264) | def test_docx_equations() -> None:
function test_input_as_strings (line 277) | def test_input_as_strings() -> None:
function test_doc_rlink (line 291) | def test_doc_rlink() -> None:
function test_markitdown_remote (line 336) | def test_markitdown_remote() -> None:
function test_speech_transcription (line 354) | def test_speech_transcription() -> None:
function test_exceptions (line 370) | def test_exceptions() -> None:
function test_markitdown_exiftool (line 389) | def test_markitdown_exiftool() -> None:
function test_markitdown_llm_parameters (line 415) | def test_markitdown_llm_parameters() -> None:
function test_markitdown_llm (line 463) | def test_markitdown_llm() -> None:
FILE: packages/markitdown/tests/test_module_vectors.py
function test_guess_stream_info (line 28) | def test_guess_stream_info(test_vector):
function test_convert_local (line 58) | def test_convert_local(test_vector):
function test_convert_stream_with_hints (line 72) | def test_convert_stream_with_hints(test_vector):
function test_convert_stream_without_hints (line 93) | def test_convert_stream_without_hints(test_vector):
function test_convert_http_uri (line 110) | def test_convert_http_uri(test_vector):
function test_convert_file_uri (line 127) | def test_convert_file_uri(test_vector):
function test_convert_data_uri (line 142) | def test_convert_data_uri(test_vector):
function test_convert_keep_data_uris (line 163) | def test_convert_keep_data_uris(test_vector):
function test_convert_stream_keep_data_uris (line 181) | def test_convert_stream_keep_data_uris(test_vector):
FILE: packages/markitdown/tests/test_pdf_masterformat.py
class TestMasterFormatPartialNumbering (line 14) | class TestMasterFormatPartialNumbering:
method test_partial_numbering_pattern_regex (line 17) | def test_partial_numbering_pattern_regex(self):
method test_masterformat_partial_numbering_not_split (line 34) | def test_masterformat_partial_numbering_not_split(self):
method test_masterformat_content_preserved (line 73) | def test_masterformat_content_preserved(self):
method test_merge_partial_numbering_with_empty_lines_between (line 115) | def test_merge_partial_numbering_with_empty_lines_between(self):
method test_multiple_partial_numberings_all_merged (line 148) | def test_multiple_partial_numberings_all_merged(self):
FILE: packages/markitdown/tests/test_pdf_memory.py
function _has_fpdf2 (line 24) | def _has_fpdf2() -> bool:
function _make_form_page (line 33) | def _make_form_page():
function _make_plain_page (line 52) | def _make_plain_page():
function _mock_pdfplumber_open (line 70) | def _mock_pdfplumber_open(pages):
class TestPdfMemoryOptimization (line 83) | class TestPdfMemoryOptimization:
method test_page_close_called_on_every_page (line 86) | def test_page_close_called_on_every_page(self):
method test_plain_text_pdf_falls_back_to_pdfminer (line 116) | def test_plain_text_pdf_falls_back_to_pdfminer(self):
method test_plain_text_pdf_still_closes_all_pages (line 150) | def test_plain_text_pdf_still_closes_all_pages(self):
method test_mixed_pdf_uses_form_extraction_per_page (line 177) | def test_mixed_pdf_uses_form_extraction_per_page(self):
method test_only_one_pdfplumber_open_call (line 222) | def test_only_one_pdfplumber_open_call(self):
method test_real_pdf_page_cleanup (line 249) | def test_real_pdf_page_cleanup(self):
function _generate_table_pdf (line 271) | def _generate_table_pdf(num_pages: int) -> bytes:
class TestPdfMemoryBenchmark (line 300) | class TestPdfMemoryBenchmark:
method test_memory_does_not_grow_linearly (line 303) | def test_memory_does_not_grow_linearly(self):
method test_memory_constant_across_page_counts (line 333) | def test_memory_constant_across_page_counts(self):
FILE: packages/markitdown/tests/test_pdf_tables.py
function validate_strings (line 14) | def validate_strings(result, expected_strings, exclude_strings=None):
function validate_markdown_table (line 24) | def validate_markdown_table(result, expected_headers, expected_data_samp...
function extract_markdown_tables (line 40) | def extract_markdown_tables(text_content):
function validate_table_structure (line 74) | def validate_table_structure(table):
class TestPdfTableExtraction (line 97) | class TestPdfTableExtraction:
method markitdown (line 101) | def markitdown(self):
method test_borderless_table_extraction (line 105) | def test_borderless_table_extraction(self, markitdown):
method test_borderless_table_no_duplication (line 273) | def test_borderless_table_no_duplication(self, markitdown):
method test_borderless_table_correct_position (line 293) | def test_borderless_table_correct_position(self, markitdown):
method test_receipt_pdf_extraction (line 337) | def test_receipt_pdf_extraction(self, markitdown):
method test_multipage_invoice_extraction (line 495) | def test_multipage_invoice_extraction(self, markitdown):
method test_academic_pdf_extraction (line 577) | def test_academic_pdf_extraction(self, markitdown):
method test_scanned_pdf_handling (line 629) | def test_scanned_pdf_handling(self, markitdown):
method test_movie_theater_booking_pdf_extraction (line 654) | def test_movie_theater_booking_pdf_extraction(self, markitdown):
class TestPdfFullOutputComparison (line 722) | class TestPdfFullOutputComparison:
method markitdown (line 726) | def markitdown(self):
method test_movie_theater_full_output (line 730) | def test_movie_theater_full_output(self, markitdown):
method test_sparse_borderless_table_full_output (line 779) | def test_sparse_borderless_table_full_output(self, markitdown):
method test_repair_multipage_full_output (line 825) | def test_repair_multipage_full_output(self, markitdown):
method test_receipt_full_output (line 867) | def test_receipt_full_output(self, markitdown):
method test_academic_paper_full_output (line 910) | def test_academic_paper_full_output(self, markitdown):
method test_medical_scan_full_output (line 951) | def test_medical_scan_full_output(self, markitdown):
class TestPdfTableMarkdownFormat (line 981) | class TestPdfTableMarkdownFormat:
method markitdown (line 985) | def markitdown(self):
method test_markdown_table_has_pipe_format (line 989) | def test_markdown_table_has_pipe_format(self, markitdown):
method test_markdown_table_columns_have_pipes (line 1013) | def test_markdown_table_columns_have_pipes(self, markitdown):
class TestPdfTableStructureConsistency (line 1040) | class TestPdfTableStructureConsistency:
method markitdown (line 1044) | def markitdown(self):
method test_borderless_table_structure (line 1048) | def test_borderless_table_structure(self, markitdown):
method test_multipage_invoice_table_structure (line 1068) | def test_multipage_invoice_table_structure(self, markitdown):
method test_receipt_has_no_tables (line 1095) | def test_receipt_has_no_tables(self, markitdown):
method test_scanned_pdf_no_tables (line 1115) | def test_scanned_pdf_no_tables(self, markitdown):
method test_all_pdfs_table_rows_consistent (line 1136) | def test_all_pdfs_table_rows_consistent(self, markitdown):
method test_borderless_table_data_integrity (line 1174) | def test_borderless_table_data_integrity(self, markitdown):
Condensed preview — 137 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,712K chars).
[
{
"path": ".devcontainer/devcontainer.json",
"chars": 1141,
"preview": "// For format details, see https://aka.ms/devcontainer.json. For config options, see the\n// README at: https://github.co"
},
{
"path": ".dockerignore",
"chars": 13,
"preview": "*\n!packages/\n"
},
{
"path": ".gitattributes",
"chars": 206,
"preview": "packages/markitdown/tests/test_files/** linguist-vendored\npackages/markitdown-sample-plugin/tests/test_files/** linguist"
},
{
"path": ".github/dependabot.yml",
"chars": 118,
"preview": "version: 2\nupdates:\n - package-ecosystem: \"github-actions\"\n directory: \"/\"\n schedule:\n interval: \"weekly\"\n"
},
{
"path": ".github/workflows/pre-commit.yml",
"chars": 438,
"preview": "name: pre-commit\non: [pull_request]\n\njobs:\n pre-commit:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/che"
},
{
"path": ".github/workflows/tests.yml",
"chars": 382,
"preview": "name: tests\non: [pull_request]\n\njobs:\n tests:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v5\n "
},
{
"path": ".gitignore",
"chars": 3197,
"preview": ".vscode\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution "
},
{
"path": ".pre-commit-config.yaml",
"chars": 125,
"preview": "repos:\n - repo: https://github.com/psf/black\n rev: 23.7.0 # Use the latest version of Black\n hooks:\n - id: b"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 444,
"preview": "# Microsoft Open Source Code of Conduct\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://op"
},
{
"path": "Dockerfile",
"chars": 684,
"preview": "FROM python:3.13-slim-bullseye\n\nENV DEBIAN_FRONTEND=noninteractive\nENV EXIFTOOL_PATH=/usr/bin/exiftool\nENV FFMPEG_PATH=/"
},
{
"path": "LICENSE",
"chars": 1141,
"preview": " MIT License\n\n Copyright (c) Microsoft Corporation.\n\n Permission is hereby granted, free of charge, to any pers"
},
{
"path": "README.md",
"chars": 11195,
"preview": "# MarkItDown\n\n[](https://pypi.org/project/markitdown/)\n![PyPI - Dow"
},
{
"path": "SECURITY.md",
"chars": 2656,
"preview": "<!-- BEGIN MICROSOFT SECURITY.MD V0.0.9 BLOCK -->\n\n## Security\n\nMicrosoft takes the security of our software products an"
},
{
"path": "SUPPORT.md",
"chars": 1244,
"preview": "# TODO: The maintainer of this repo has not yet edited this file\r\n\r\n**REPO OWNER**: Do you want Customer Service & Suppo"
},
{
"path": "packages/markitdown/README.md",
"chars": 1402,
"preview": "# MarkItDown\n\n> [!IMPORTANT]\n> MarkItDown is a Python package and command-line utility for converting various files to M"
},
{
"path": "packages/markitdown/ThirdPartyNotices.md",
"chars": 12799,
"preview": "# THIRD-PARTY SOFTWARE NOTICES AND INFORMATION\n\n**Do Not Translate or Localize**\n\nThis project incorporates components f"
},
{
"path": "packages/markitdown/pyproject.toml",
"chars": 2660,
"preview": "[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"markitdown\"\ndynamic = [\"ver"
},
{
"path": "packages/markitdown/src/markitdown/__about__.py",
"chars": 132,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n__version__ "
},
{
"path": "packages/markitdown/src/markitdown/__init__.py",
"chars": 899,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n\nfrom .__abo"
},
{
"path": "packages/markitdown/src/markitdown/__main__.py",
"chars": 6504,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\nimport argpa"
},
{
"path": "packages/markitdown/src/markitdown/_base_converter.py",
"chars": 4530,
"preview": "from typing import Any, BinaryIO, Optional\nfrom ._stream_info import StreamInfo\n\n\nclass DocumentConverterResult:\n \"\"\""
},
{
"path": "packages/markitdown/src/markitdown/_exceptions.py",
"chars": 2459,
"preview": "from typing import Optional, List, Any\n\nMISSING_DEPENDENCY_MESSAGE = \"\"\"{converter} recognized the input as a potential "
},
{
"path": "packages/markitdown/src/markitdown/_markitdown.py",
"chars": 30612,
"preview": "import mimetypes\nimport os\nimport re\nimport sys\nimport shutil\nimport traceback\nimport io\nfrom dataclasses import datacla"
},
{
"path": "packages/markitdown/src/markitdown/_stream_info.py",
"chars": 1076,
"preview": "from dataclasses import dataclass, asdict\nfrom typing import Optional\n\n\n@dataclass(kw_only=True, frozen=True)\nclass Stre"
},
{
"path": "packages/markitdown/src/markitdown/_uri_utils.py",
"chars": 1580,
"preview": "import base64\nimport os\nfrom typing import Tuple, Dict\nfrom urllib.request import url2pathname\nfrom urllib.parse import "
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/docx/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/docx/math/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/docx/math/latex_dict.py",
"chars": 6650,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nAdapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py\nOn 25/03/2025\n\"\""
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/docx/math/omml.py",
"chars": 10481,
"preview": "# -*- coding: utf-8 -*-\n\n\"\"\"\nOffice Math Markup Language (OMML)\nAdapted from https://github.com/xiilei/dwml/blob/master/"
},
{
"path": "packages/markitdown/src/markitdown/converter_utils/docx/pre_process.py",
"chars": 6342,
"preview": "import zipfile\nfrom io import BytesIO\nfrom typing import BinaryIO\nfrom xml.etree import ElementTree as ET\n\nfrom bs4 impo"
},
{
"path": "packages/markitdown/src/markitdown/converters/__init__.py",
"chars": 1495,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n\nfrom ._plai"
},
{
"path": "packages/markitdown/src/markitdown/converters/_audio_converter.py",
"chars": 3047,
"preview": "from typing import Any, BinaryIO\n\nfrom ._exiftool import exiftool_metadata\nfrom ._transcribe_audio import transcribe_aud"
},
{
"path": "packages/markitdown/src/markitdown/converters/_bing_serp_converter.py",
"chars": 3885,
"preview": "import re\nimport base64\nimport binascii\nfrom urllib.parse import parse_qs, urlparse\nfrom typing import Any, BinaryIO\nfro"
},
{
"path": "packages/markitdown/src/markitdown/converters/_csv_converter.py",
"chars": 2278,
"preview": "import csv\nimport io\nfrom typing import BinaryIO, Any\nfrom charset_normalizer import from_bytes\nfrom .._base_converter i"
},
{
"path": "packages/markitdown/src/markitdown/converters/_doc_intel_converter.py",
"chars": 9437,
"preview": "import sys\nimport re\nimport os\nfrom typing import BinaryIO, Any, List\nfrom enum import Enum\n\nfrom .._base_converter impo"
},
{
"path": "packages/markitdown/src/markitdown/converters/_docx_converter.py",
"chars": 2521,
"preview": "import sys\nimport io\nfrom warnings import warn\n\nfrom typing import BinaryIO, Any\n\nfrom ._html_converter import HtmlConve"
},
{
"path": "packages/markitdown/src/markitdown/converters/_epub_converter.py",
"chars": 5513,
"preview": "import os\nimport zipfile\nfrom defusedxml import minidom\nfrom xml.dom.minidom import Document\n\nfrom typing import BinaryI"
},
{
"path": "packages/markitdown/src/markitdown/converters/_exiftool.py",
"chars": 1448,
"preview": "import json\nimport locale\nimport subprocess\nfrom typing import Any, BinaryIO, Union\n\n\ndef _parse_version(version: str) -"
},
{
"path": "packages/markitdown/src/markitdown/converters/_html_converter.py",
"chars": 2720,
"preview": "import io\nfrom typing import Any, BinaryIO, Optional\nfrom bs4 import BeautifulSoup\n\nfrom .._base_converter import Docume"
},
{
"path": "packages/markitdown/src/markitdown/converters/_image_converter.py",
"chars": 4073,
"preview": "from typing import BinaryIO, Any, Union\nimport base64\nimport mimetypes\nfrom ._exiftool import exiftool_metadata\nfrom .._"
},
{
"path": "packages/markitdown/src/markitdown/converters/_ipynb_converter.py",
"chars": 3384,
"preview": "from typing import BinaryIO, Any\nimport json\n\nfrom .._base_converter import DocumentConverter, DocumentConverterResult\nf"
},
{
"path": "packages/markitdown/src/markitdown/converters/_llm_caption.py",
"chars": 1445,
"preview": "from typing import BinaryIO, Union\nimport base64\nimport mimetypes\nfrom .._stream_info import StreamInfo\n\n\ndef llm_captio"
},
{
"path": "packages/markitdown/src/markitdown/converters/_markdownify.py",
"chars": 4323,
"preview": "import re\nimport markdownify\n\nfrom typing import Any, Optional\nfrom urllib.parse import quote, unquote, urlparse, urlunp"
},
{
"path": "packages/markitdown/src/markitdown/converters/_outlook_msg_converter.py",
"chars": 4954,
"preview": "import sys\nfrom typing import Any, Union, BinaryIO\nfrom .._stream_info import StreamInfo\nfrom .._base_converter import D"
},
{
"path": "packages/markitdown/src/markitdown/converters/_pdf_converter.py",
"chars": 20540,
"preview": "import sys\nimport io\nimport re\nfrom typing import BinaryIO, Any\n\nfrom .._base_converter import DocumentConverter, Docume"
},
{
"path": "packages/markitdown/src/markitdown/converters/_plain_text_converter.py",
"chars": 1997,
"preview": "import sys\n\nfrom typing import BinaryIO, Any\nfrom charset_normalizer import from_bytes\nfrom .._base_converter import Doc"
},
{
"path": "packages/markitdown/src/markitdown/converters/_pptx_converter.py",
"chars": 9953,
"preview": "import sys\nimport base64\nimport os\nimport io\nimport re\nimport html\n\nfrom typing import BinaryIO, Any\nfrom operator impor"
},
{
"path": "packages/markitdown/src/markitdown/converters/_rss_converter.py",
"chars": 6605,
"preview": "from defusedxml import minidom\nfrom xml.dom.minidom import Document, Element\nfrom typing import BinaryIO, Any, Union\nfro"
},
{
"path": "packages/markitdown/src/markitdown/converters/_transcribe_audio.py",
"chars": 1899,
"preview": "import io\nimport sys\nfrom typing import BinaryIO\nfrom .._exceptions import MissingDependencyException\n\n# Try loading opt"
},
{
"path": "packages/markitdown/src/markitdown/converters/_wikipedia_converter.py",
"chars": 2589,
"preview": "import re\nimport bs4\nfrom typing import Any, BinaryIO\n\nfrom .._base_converter import DocumentConverter, DocumentConverte"
},
{
"path": "packages/markitdown/src/markitdown/converters/_xlsx_converter.py",
"chars": 4905,
"preview": "import sys\nfrom typing import BinaryIO, Any\nfrom ._html_converter import HtmlConverter\nfrom .._base_converter import Doc"
},
{
"path": "packages/markitdown/src/markitdown/converters/_youtube_converter.py",
"chars": 8565,
"preview": "import json\nimport time\nimport re\nimport bs4\nfrom typing import Any, BinaryIO, Dict, List, Union\nfrom urllib.parse impor"
},
{
"path": "packages/markitdown/src/markitdown/converters/_zip_converter.py",
"chars": 3639,
"preview": "import zipfile\nimport io\nimport os\n\nfrom typing import BinaryIO, Any, TYPE_CHECKING\n\nfrom .._base_converter import Docum"
},
{
"path": "packages/markitdown/src/markitdown/py.typed",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown/tests/__init__.py",
"chars": 108,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n"
},
{
"path": "packages/markitdown/tests/_test_vectors.py",
"chars": 9839,
"preview": "import dataclasses\nfrom typing import List\n\n\n@dataclasses.dataclass(frozen=True, kw_only=True)\nclass FileTestVector(obje"
},
{
"path": "packages/markitdown/tests/test_cli_misc.py",
"chars": 1160,
"preview": "#!/usr/bin/env python3 -m pytest\nimport subprocess\nfrom markitdown import __version__\n\n# This file contains CLI tests th"
},
{
"path": "packages/markitdown/tests/test_cli_vectors.py",
"chars": 6944,
"preview": "#!/usr/bin/env python3 -m pytest\nimport os\nimport time\nimport pytest\nimport subprocess\nimport locale\nfrom typing import "
},
{
"path": "packages/markitdown/tests/test_docintel_html.py",
"chars": 940,
"preview": "import io\nfrom markitdown.converters._doc_intel_converter import (\n DocumentIntelligenceConverter,\n DocumentIntell"
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/MEDRPT-2024-PAT-3847_medical_report_scan.md",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/RECEIPT-2024-TXN-98765_retail_purchase.md",
"chars": 1483,
"preview": "TECHMART ELECTRONICS\n4567 Innovation Blvd\nSan Francisco, CA 94103\n(415) 555-0199\n\n===================================\n\nS"
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/REPAIR-2022-INV-001_multipage.md",
"chars": 5089,
"preview": "ZAVA AUTO REPAIR\nCertified Collision Repair\n123 Main Street, Redmond, WA 98052\nPhone: (425) 000-0000\nPreliminary Estimat"
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/SPARSE-2024-INV-1234_borderless_table.md",
"chars": 2711,
"preview": "INVENTORY RECONCILIATION REPORT\nReport ID: SPARSE-2024-INV-1234\nWarehouse: Distribution Center East\nReport Date: 2024-11"
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/movie-theater-booking-2024.md",
"chars": 4030,
"preview": "BOOKING ORDER\nPrint Date 12/15/2024 14:30:22\nPage 1 of 1\nSTARLIGHT CINEMAS\nOrders\n| Order / Rev: | 2024-12-5678 | "
},
{
"path": "packages/markitdown/tests/test_files/expected_outputs/test.md",
"chars": 5194,
"preview": "1\n\nIntroduction\n\nLarge language models (LLMs) are becoming a crucial building block in developing powerful agents\nthat u"
},
{
"path": "packages/markitdown/tests/test_files/test.json",
"chars": 229,
"preview": "{\n \"key1\": \"string_value\",\n \"key2\": 1234,\n \"key3\": [\n \"list_value1\",\n \"list_value2\"\n ],\n \"5"
},
{
"path": "packages/markitdown/tests/test_files/test_blog.html",
"chars": 25955,
"preview": "<!doctype html>\n<html lang=\"en\" dir=\"ltr\" class=\"blog-wrapper blog-post-page plugin-blog plugin-id-default\" data-has-hyd"
},
{
"path": "packages/markitdown/tests/test_files/test_mskanji.csv",
"chars": 32,
"preview": "O,N,Z\r\nY,30,\r\nO؉pq,25,\r\n~,35,É\r\n"
},
{
"path": "packages/markitdown/tests/test_files/test_notebook.ipynb",
"chars": 1408,
"preview": "{\n \"cells\": [\n {\n \"cell_type\": \"markdown\",\n \"id\": \"0f61db80\",\n \"metadata\": {},\n \"source\": [\n \"# Test Noteboo"
},
{
"path": "packages/markitdown/tests/test_files/test_rss.xml",
"chars": 280005,
"preview": "<rss xmlns:dc=\"http://purl.org/dc/elements/1.1/\" xmlns:ns0=\"http://www.w3.org/2005/Atom\" xmlns:ns1=\"http://purl.org/rss/"
},
{
"path": "packages/markitdown/tests/test_files/test_serp.html",
"chars": 439622,
"preview": "<!DOCTYPE html><html dir=\"ltr\" lang=\"en\" xml:lang=\"en\" xmlns=\"http://www.w3.org/1999/xhtml\" xmlns:Web=\"http://schemas.li"
},
{
"path": "packages/markitdown/tests/test_files/test_wikipedia.html",
"chars": 392420,
"preview": "<!DOCTYPE html>\n<html class=\"client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-"
},
{
"path": "packages/markitdown/tests/test_module_misc.py",
"chars": 18003,
"preview": "#!/usr/bin/env python3 -m pytest\nimport io\nimport os\nimport re\nimport shutil\nimport pytest\nfrom unittest.mock import Mag"
},
{
"path": "packages/markitdown/tests/test_module_vectors.py",
"chars": 7782,
"preview": "#!/usr/bin/env python3 -m pytest\nimport os\nimport time\nimport pytest\nimport base64\n\nfrom pathlib import Path\n\nif __name_"
},
{
"path": "packages/markitdown/tests/test_pdf_masterformat.py",
"chars": 7255,
"preview": "#!/usr/bin/env python3 -m pytest\n\"\"\"Tests for MasterFormat-style partial numbering in PDF conversion.\"\"\"\n\nimport os\nimpo"
},
{
"path": "packages/markitdown/tests/test_pdf_memory.py",
"chars": 12636,
"preview": "#!/usr/bin/env python3 -m pytest\n\"\"\"Tests for PDF converter memory optimization.\n\nVerifies that:\n- page.close() is calle"
},
{
"path": "packages/markitdown/tests/test_pdf_tables.py",
"chars": 45659,
"preview": "#!/usr/bin/env python3 -m pytest\n\"\"\"Tests for PDF table extraction functionality.\"\"\"\n\nimport os\nimport re\nimport pytest\n"
},
{
"path": "packages/markitdown-mcp/Dockerfile",
"chars": 570,
"preview": "FROM python:3.13-slim-bullseye\n\nENV DEBIAN_FRONTEND=noninteractive\nENV EXIFTOOL_PATH=/usr/bin/exiftool\nENV FFMPEG_PATH=/"
},
{
"path": "packages/markitdown-mcp/README.md",
"chars": 3949,
"preview": "# MarkItDown-MCP\n\n[](https://pypi.org/project/markitdown-mcp/)\n"
},
{
"path": "packages/markitdown-mcp/pyproject.toml",
"chars": 1777,
"preview": "[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"markitdown-mcp\"\ndynamic = ["
},
{
"path": "packages/markitdown-mcp/src/markitdown_mcp/__about__.py",
"chars": 132,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n__version__ "
},
{
"path": "packages/markitdown-mcp/src/markitdown_mcp/__init__.py",
"chars": 178,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n\nfrom .__abo"
},
{
"path": "packages/markitdown-mcp/src/markitdown_mcp/__main__.py",
"chars": 3803,
"preview": "import contextlib\nimport sys\nimport os\nfrom collections.abc import AsyncIterator\nfrom mcp.server.fastmcp import FastMCP\n"
},
{
"path": "packages/markitdown-mcp/src/markitdown_mcp/py.typed",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown-mcp/tests/__init__.py",
"chars": 108,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n"
},
{
"path": "packages/markitdown-ocr/LICENSE",
"chars": 1141,
"preview": " MIT License\n\n Copyright (c) Microsoft Corporation.\n\n Permission is hereby granted, free of charge, to any pers"
},
{
"path": "packages/markitdown-ocr/README.md",
"chars": 5940,
"preview": "# MarkItDown OCR Plugin\n\nLLM Vision plugin for MarkItDown that extracts text from images embedded in PDF, DOCX, PPTX, an"
},
{
"path": "packages/markitdown-ocr/pyproject.toml",
"chars": 1708,
"preview": "[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"markitdown-ocr\"\ndynamic = ["
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/__about__.py",
"chars": 106,
"preview": "# SPDX-FileCopyrightText: 2025-present Contributors\n# SPDX-License-Identifier: MIT\n\n__version__ = \"0.1.0\"\n"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/__init__.py",
"chars": 893,
"preview": "# SPDX-FileCopyrightText: 2025-present Contributors\n# SPDX-License-Identifier: MIT\n\n\"\"\"\nmarkitdown-ocr: OCR plugin for M"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_docx_converter_with_ocr.py",
"chars": 6400,
"preview": "\"\"\"\nEnhanced DOCX Converter with OCR support for embedded images.\nExtracts images from Word documents and performs OCR w"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_ocr_service.py",
"chars": 3328,
"preview": "\"\"\"\nOCR Service Layer for MarkItDown\nProvides LLM Vision-based image text extraction.\n\"\"\"\n\nimport base64\nfrom typing imp"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_pdf_converter_with_ocr.py",
"chars": 16548,
"preview": "\"\"\"\nEnhanced PDF Converter with OCR support for embedded images.\nExtracts images from PDFs and performs OCR while mainta"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_plugin.py",
"chars": 2504,
"preview": "\"\"\"\nPlugin registration for markitdown-ocr.\nRegisters OCR-enhanced converters with priority-based replacement strategy.\n"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_pptx_converter_with_ocr.py",
"chars": 8990,
"preview": "\"\"\"\nEnhanced PPTX Converter with improved OCR support.\nAlready has LLM-based image description, this enhances it with tr"
},
{
"path": "packages/markitdown-ocr/src/markitdown_ocr/_xlsx_converter_with_ocr.py",
"chars": 7739,
"preview": "\"\"\"\nEnhanced XLSX Converter with OCR support for embedded images.\nExtracts images from Excel spreadsheets and performs O"
},
{
"path": "packages/markitdown-ocr/tests/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown-ocr/tests/test_docx_converter.py",
"chars": 7520,
"preview": "\"\"\"\nUnit tests for DocxConverterWithOCR.\n\nFor each DOCX test file: convert with a mock OCR service then compare the\nfull"
},
{
"path": "packages/markitdown-ocr/tests/test_pdf_converter.py",
"chars": 7847,
"preview": "\"\"\"\nUnit tests for PdfConverterWithOCR.\n\nFor each PDF test file: convert with a mock OCR service then compare the\nfull o"
},
{
"path": "packages/markitdown-ocr/tests/test_pptx_converter.py",
"chars": 4950,
"preview": "\"\"\"\nUnit tests for PptxConverterWithOCR.\n\nFor each PPTX test file: convert with a mock OCR service then compare the\nfull"
},
{
"path": "packages/markitdown-ocr/tests/test_xlsx_converter.py",
"chars": 7895,
"preview": "\"\"\"\nUnit tests for XlsxConverterWithOCR.\n\nFor each XLSX test file: convert with a mock OCR service then compare the\nfull"
},
{
"path": "packages/markitdown-sample-plugin/README.md",
"chars": 3434,
"preview": "# MarkItDown Sample Plugin\n\n[](https://pypi.org/proje"
},
{
"path": "packages/markitdown-sample-plugin/pyproject.toml",
"chars": 1972,
"preview": "[build-system]\nrequires = [\"hatchling\"]\nbuild-backend = \"hatchling.build\"\n\n[project]\nname = \"markitdown-sample-plugin\"\nd"
},
{
"path": "packages/markitdown-sample-plugin/src/markitdown_sample_plugin/__about__.py",
"chars": 132,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n__version__ "
},
{
"path": "packages/markitdown-sample-plugin/src/markitdown_sample_plugin/__init__.py",
"chars": 346,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n\nfrom ._plug"
},
{
"path": "packages/markitdown-sample-plugin/src/markitdown_sample_plugin/_plugin.py",
"chars": 1824,
"preview": "import locale\nfrom typing import BinaryIO, Any\nfrom striprtf.striprtf import rtf_to_text\n\nfrom markitdown import (\n M"
},
{
"path": "packages/markitdown-sample-plugin/src/markitdown_sample_plugin/py.typed",
"chars": 0,
"preview": ""
},
{
"path": "packages/markitdown-sample-plugin/tests/__init__.py",
"chars": 108,
"preview": "# SPDX-FileCopyrightText: 2024-present Adam Fourney <adamfo@microsoft.com>\n#\n# SPDX-License-Identifier: MIT\n"
},
{
"path": "packages/markitdown-sample-plugin/tests/test_files/test.rtf",
"chars": 51180,
"preview": "{\\rtf1\\adeflang1025\\ansi\\ansicpg1252\\uc1\\adeff31507\\deff0\\stshfdbch31506\\stshfloch31506\\stshfhich31506\\stshfbi31507\\defl"
},
{
"path": "packages/markitdown-sample-plugin/tests/test_sample_plugin.py",
"chars": 1312,
"preview": "#!/usr/bin/env python3 -m pytest\nimport os\n\nfrom markitdown import MarkItDown, StreamInfo\nfrom markitdown_sample_plugin "
}
]
// ... and 26 more files (download for full content)
About this extraction
This page contains the full source code of the microsoft/markitdown GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 137 files (1.6 MB), approximately 501.9k tokens, and a symbol index with 383 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.