Full Code of lukas-blecher/LaTeX-OCR for AI

main 5c1ac929bd19 cached

95 files

1.4 MB

581.7k tokens

202 symbols

1 requests

Download .txt

Showing preview only (1,482K chars total). Download the full file or copy to clipboard to get everything.

Repository: lukas-blecher/LaTeX-OCR
Branch: main
Commit: 5c1ac929bd19
Files: 95
Total size: 1.4 MB

Directory structure:
gitextract_jxo7rqko/

├── .gitignore
├── .readthedocs.yaml
├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│   ├── api.dockerfile
│   └── build-api.sh
├── docs/
│   ├── Makefile
│   ├── conf.py
│   ├── index.rst
│   ├── installation.md
│   ├── make.bat
│   ├── pix2tex.rst
│   └── requirements.txt
├── notebooks/
│   ├── LaTeX_OCR_test.ipynb
│   └── LaTeX_OCR_training.ipynb
├── pix2tex/
│   ├── __init__.py
│   ├── __main__.py
│   ├── api/
│   │   ├── __init__.py
│   │   ├── app.py
│   │   ├── run.py
│   │   └── streamlit.py
│   ├── cli.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── arxiv.py
│   │   ├── data/
│   │   │   └── .gitkeep
│   │   ├── dataset.py
│   │   ├── demacro-test.py
│   │   ├── demacro.py
│   │   ├── extract_latex.py
│   │   ├── latex2png.py
│   │   ├── postprocess.py
│   │   ├── preprocessing/
│   │   │   ├── __init__.py
│   │   │   ├── generate_latex_vocab.py
│   │   │   ├── preprocess_formulas.py
│   │   │   ├── preprocess_latex.js
│   │   │   └── third_party/
│   │   │       ├── README.md
│   │   │       ├── katex/
│   │   │       │   ├── .#katex.js
│   │   │       │   ├── LICENSE.txt
│   │   │       │   ├── README.md
│   │   │       │   ├── cli.js
│   │   │       │   ├── katex.js
│   │   │       │   ├── package.json
│   │   │       │   └── src/
│   │   │       │       ├── Lexer.js
│   │   │       │       ├── Options.js
│   │   │       │       ├── ParseError.js
│   │   │       │       ├── Parser.js
│   │   │       │       ├── Settings.js
│   │   │       │       ├── Style.js
│   │   │       │       ├── buildCommon.js
│   │   │       │       ├── buildHTML.js
│   │   │       │       ├── buildMathML.js
│   │   │       │       ├── buildTree.js
│   │   │       │       ├── delimiter.js
│   │   │       │       ├── domTree.js
│   │   │       │       ├── environments.js
│   │   │       │       ├── fontMetrics.js
│   │   │       │       ├── fontMetricsData.js
│   │   │       │       ├── functions.js
│   │   │       │       ├── mathMLTree.js
│   │   │       │       ├── parseData.js
│   │   │       │       ├── parseTree.js
│   │   │       │       ├── symbols.js
│   │   │       │       └── utils.js
│   │   │       └── match-at/
│   │   │           ├── README.md
│   │   │           ├── lib/
│   │   │           │   └── matchAt.js
│   │   │           └── package.json
│   │   ├── render.py
│   │   ├── scraping.py
│   │   └── transforms.py
│   ├── eval.py
│   ├── gui.py
│   ├── model/
│   │   ├── __init__.py
│   │   ├── checkpoints/
│   │   │   ├── __init__.py
│   │   │   └── get_latest_checkpoint.py
│   │   ├── dataset/
│   │   │   └── tokenizer.json
│   │   └── settings/
│   │       ├── config-vit.yaml
│   │       ├── config.yaml
│   │       └── debug.yaml
│   ├── models/
│   │   ├── __init__.py
│   │   ├── hybrid.py
│   │   ├── transformer.py
│   │   ├── utils.py
│   │   └── vit.py
│   ├── resources/
│   │   ├── MathJax.js
│   │   ├── __init__.py
│   │   ├── resources.py
│   │   └── resources.qrc
│   ├── setup_desktop.py
│   ├── train.py
│   ├── train_resizer.py
│   └── utils/
│       ├── __init__.py
│       └── utils.py
├── setup.cfg
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
# lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

notebooks/**
!notebooks/LaTeX_OCR*.ipynb
.ipynb_checkpoints/
**/dataset/data/**
wandb/
pix2tex/model/checkpoints/**
!pix2tex/model/checkpoints/*.py
!**/.gitkeep
.vscode
.DS_Store
test/*



================================================
FILE: .readthedocs.yaml
================================================
# .readthedocs.yaml
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details

# Required
version: 2

# Set the version of Python and other tools you might need
build:
  os: ubuntu-20.04
  tools:
    python: "3.9"

# Build documentation in the docs/ directory with Sphinx
sphinx:
   configuration: docs/conf.py

# If using Sphinx, optionally build your docs in additional formats such as PDF
# formats:
#    - pdf

# Optionally declare the Python requirements required to build your docs
python:
   install:
    - requirements: docs/requirements.txt
    - method: pip
      path: .
      extra_requirements:
        - all


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021 Lukas Blecher

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: MANIFEST.in
================================================
exclude **\*.pth 


================================================
FILE: README.md
================================================
# pix2tex - LaTeX OCR

[![GitHub](https://img.shields.io/github/license/lukas-blecher/LaTeX-OCR)](https://github.com/lukas-blecher/LaTeX-OCR) [![Documentation Status](https://readthedocs.org/projects/pix2tex/badge/?version=latest)](https://pix2tex.readthedocs.io/en/latest/?badge=latest) [![PyPI](https://img.shields.io/pypi/v/pix2tex?logo=pypi)](https://pypi.org/project/pix2tex) [![PyPI - Downloads](https://img.shields.io/pypi/dm/pix2tex?logo=pypi)](https://pypi.org/project/pix2tex) [![GitHub all releases](https://img.shields.io/github/downloads/lukas-blecher/LaTeX-OCR/total?color=blue&logo=github)](https://github.com/lukas-blecher/LaTeX-OCR/releases) [![Docker Pulls](https://img.shields.io/docker/pulls/lukasblecher/pix2tex?logo=docker)](https://hub.docker.com/r/lukasblecher/pix2tex) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukas-blecher/LaTeX-OCR/blob/main/notebooks/LaTeX_OCR_test.ipynb) [![Hugging Face Spaces](https://img.shields.io/badge/🤗%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/lukbl/LaTeX-OCR)

The goal of this project is to create a learning based system that takes an image of a math formula and returns corresponding LaTeX code. 

![header](https://user-images.githubusercontent.com/55287601/109183599-69431f00-778e-11eb-9809-d42b9451e018.png)

## Using the model
To run the model you need Python 3.7+

If you don't have PyTorch installed. Follow their instructions [here](https://pytorch.org/get-started/locally/).

Install the package `pix2tex`: 

```
pip install "pix2tex[gui]"
```

Model checkpoints will be downloaded automatically.

There are three ways to get a prediction from an image. 
1. You can use the command line tool by calling `pix2tex`. Here you can parse already existing images from the disk and images in your clipboard.

2. Thanks to [@katie-lim](https://github.com/katie-lim), you can use a nice user interface as a quick way to get the model prediction. Just call the GUI with `latexocr`. From here you can take a screenshot and the predicted latex code is rendered using [MathJax](https://www.mathjax.org/) and copied to your clipboard.

    Under linux, it is possible to use the GUI with `gnome-screenshot` (which comes with multiple monitor support). For other Wayland compositers, `grim` and `slurp` will be used for wlroots-based Wayland compositers and `spectacle` for KDE Plasma. Note that `gnome-screenshot` is not compatible with wlroots or Qt based compositers. Since `gnome-screenshot` will be preferred when available, you may have to set the environment variable `SCREENSHOT_TOOL` to `grim` or `spectacle` in these cases (other available values are `gnome-screenshot` and `pil`).

    ![demo](https://user-images.githubusercontent.com/55287601/117812740-77b7b780-b262-11eb-81f6-fc19766ae2ae.gif)

    If the model is unsure about the what's in the image it might output a different prediction every time you click "Retry". With the `temperature` parameter you can control this behavior (low temperature will produce the same result).

3. You can use an API. This has additional dependencies. Install via `pip install -U "pix2tex[api]"` and run
    ```bash
    python -m pix2tex.api.run
    ```
    to start a [Streamlit](https://streamlit.io/) demo that connects to the API at port 8502. There is also a docker image  available for the API: https://hub.docker.com/r/lukasblecher/pix2tex [![Docker Image Size (latest by date)](https://img.shields.io/docker/image-size/lukasblecher/pix2tex?logo=docker)](https://hub.docker.com/r/lukasblecher/pix2tex)

    ```
    docker pull lukasblecher/pix2tex:api
    docker run --rm -p 8502:8502 lukasblecher/pix2tex:api
    ```
    To also run the streamlit demo run
    ```
    docker run --rm -it -p 8501:8501 --entrypoint python lukasblecher/pix2tex:api pix2tex/api/run.py
    ```
    and navigate to http://localhost:8501/

4. Use from within Python
    ```python
    from PIL import Image
    from pix2tex.cli import LatexOCR
    
    img = Image.open('path/to/image.png')
    model = LatexOCR()
    print(model(img))
    ```

The model works best with images of smaller resolution. That's why I added a preprocessing step where another neural network predicts the optimal resolution of the input image. This model will automatically resize the custom image to best resemble the training data and thus increase performance of images found in the wild. Still it's not perfect and might not be able to handle huge images optimally, so don't zoom in all the way before taking a picture. 

Always double check the result carefully. You can try to redo the prediction with an other resolution if the answer was wrong.

**Want to use the package?**

I'm trying to compile a documentation right now. 

Visit here: https://pix2tex.readthedocs.io/ 


## Training the model [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lukas-blecher/LaTeX-OCR/blob/main/notebooks/LaTeX_OCR_training.ipynb)

Install a couple of dependencies `pip install "pix2tex[train]"`.
1. First we need to combine the images with their ground truth labels. I wrote a dataset class (which needs further improving) that saves the relative paths to the images with the LaTeX code they were rendered with. To generate the dataset pickle file run 

```
python -m pix2tex.dataset.dataset --equations path_to_textfile --images path_to_images --out dataset.pkl
```
To use your own tokenizer pass it via `--tokenizer` (See below).

You can find my generated training data on the [Google Drive](https://drive.google.com/drive/folders/13CA4vAmOmD_I_dSbvLp-Lf0s6KiaNfuO) as well (formulae.zip - images, math.txt - labels). Repeat the step for the validation and test data. All use the same label text file.

2. Edit the `data` (and `valdata`) entry in the config file to the newly generated `.pkl` file. Change other hyperparameters if you want to. See `pix2tex/model/settings/config.yaml` for a template.
3. Now for the actual training run 
```
python -m pix2tex.train --config path_to_config_file
```

If you want to use your own data you might be interested in creating your own tokenizer with
```
python -m pix2tex.dataset.dataset --equations path_to_textfile --vocab-size 8000 --out tokenizer.json
```
Don't forget to update the path to the tokenizer in the config file and set `num_tokens` to your vocabulary size.

## Model
The model consist of a ViT [[1](#References)] encoder with a ResNet backbone and a Transformer [[2](#References)] decoder.

### Performance
| BLEU score | normed edit distance | token accuracy |
| ---------- | -------------------- | -------------- |
| 0.88       | 0.10                 | 0.60           |

## Data
We need paired data for the network to learn. Luckily there is a lot of LaTeX code on the internet, e.g. [wikipedia](https://www.wikipedia.org), [arXiv](https://www.arxiv.org). We also use the formulae from the [im2latex-100k](https://zenodo.org/record/56198#.V2px0jXT6eA) [[3](#References)] dataset.
All of it can be found [here](https://drive.google.com/drive/folders/13CA4vAmOmD_I_dSbvLp-Lf0s6KiaNfuO)

### Dataset Requirements
In order to render the math in many different fonts we use  XeLaTeX, generate a PDF and finally convert it to a PNG. For the last step we need to use some third party tools: 
* [XeLaTeX](https://www.ctan.org/pkg/xetex)
* [ImageMagick](https://imagemagick.org/) with [Ghostscript](https://www.ghostscript.com/index.html). (for converting pdf to png)
* [Node.js](https://nodejs.org/) to run [KaTeX](https://github.com/KaTeX/KaTeX) (for normalizing Latex code)
* Python 3.7+ & dependencies (specified in `setup.py`)

### Fonts
Latin Modern Math, GFSNeohellenicMath.otf, Asana Math, XITS Math, Cambria Math


## TODO
- [x] add more evaluation metrics
- [x] create a GUI
- [ ] add beam search
- [ ] support handwritten formulae (kinda done, see training colab notebook)
- [ ] reduce model size (distillation)
- [ ] find optimal hyperparameters
- [ ] tweak model structure
- [ ] fix data scraping and scrape more data
- [ ] trace the model ([#2](https://github.com/lukas-blecher/LaTeX-OCR/issues/2))


## Contribution
Contributions of any kind are welcome.

## Acknowledgment
Code taken and modified from [lucidrains](https://github.com/lucidrains), [rwightman](https://github.com/rwightman/pytorch-image-models), [im2markup](https://github.com/harvardnlp/im2markup), [arxiv_leaks](https://github.com/soskek/arxiv_leaks), [pkra: Mathjax](https://github.com/pkra/MathJax-single-file), [harupy: snipping tool](https://github.com/harupy/snipping-tool)

## References
[1] [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)

[2] [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

[3] [Image-to-Markup Generation with Coarse-to-Fine Attention](https://arxiv.org/abs/1609.04938v2)


================================================
FILE: docker/api.dockerfile
================================================
FROM python:3.8-slim
RUN pip install torch>=1.7.1
WORKDIR /latexocr
ADD pix2tex /latexocr/pix2tex/
ADD setup.py /latexocr/
ADD README.md /latexocr/
RUN pip install -e .[api]
RUN python -m pix2tex.model.checkpoints.get_latest_checkpoint

ENTRYPOINT ["uvicorn", "pix2tex.api.app:app", "--host", "0.0.0.0", "--port", "8502"]


================================================
FILE: docker/build-api.sh
================================================
# cd into proj. root
cd $(dirname $0)
cd ..
docker build -t lukasblecher/pix2tex:api -f docker/api.dockerfile .


================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#

# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS    ?=
SPHINXBUILD   ?= sphinx-build
SOURCEDIR     = .
BUILDDIR      = _build

# Put it first so that "make" without argument is like "make help".
help:
	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

.PHONY: help Makefile

# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)


================================================
FILE: docs/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html

# -- Path setup --------------------------------------------------------------

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))


# -- Project information -----------------------------------------------------

project = 'LaTeX-OCR'
copyright = '2022, Lukas Blecher'
author = 'Lukas Blecher'


# -- General configuration ---------------------------------------------------

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['myst_parser', 'sphinx.ext.autodoc']

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']

autoclass_content = 'both'
# -- Options for HTML output -------------------------------------------------

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']


================================================
FILE: docs/index.rst
================================================
.. LaTeX-OCR documentation master file, created by
   sphinx-quickstart on Sun May  1 16:39:27 2022.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to LaTeX-OCR's documentation!
=====================================

.. |ico| image:: https://img.shields.io/badge/LaTeX--OCR-visit-a?style=social&logo=github
   :target: https://github.com/lukas-blecher/LaTeX-OCR
This is the documentation for LaTeX-OCR |ico|. The goal of this project is to find a corresponding LaTeX code for a given image of an equation.


.. toctree::
   :maxdepth: 2
   :caption: Contents:

   installation
   pix2tex


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`


================================================
FILE: docs/installation.md
================================================
Installation
============

Python package
--------------

To run the model you need Python 3.7+

If you don't have PyTorch installed. Follow their instructions [here](https://pytorch.org/get-started/locally/).

Install the package `pix2tex`: 

```
pip install pix2tex[gui]
```

Model checkpoints will be downloaded automatically when first running the script.

To install
- with GUI dependencies use tag `[gui]`.
- with training dependencies use tag `[train]`.
- with api dependencies use tag `[api]`.
- all dependencies use tag `[all]`.

Docker
------

The API can be used from a docker container, available on [DockerHub](https://hub.docker.com/r/lukasblecher/pix2tex)
```
docker pull lukasblecher/pix2tex:api
docker run -p 8502:8502 lukasblecher/pix2tex:api
```
This starts the API which is available at port 8502.

To use the [Streamlit](https://streamlit.io/) demo run instead
```
docker run -it -p 8501:8501 --entrypoint python lukasblecher/pix2tex:api pix2tex/api/run.py
```
and navigate to [http://localhost:8501/](http://localhost:8501/)


================================================
FILE: docs/make.bat
================================================
@ECHO OFF

pushd %~dp0

REM Command file for Sphinx documentation

if "%SPHINXBUILD%" == "" (
	set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build

%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
	echo.
	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
	echo.installed, then set the SPHINXBUILD environment variable to point
	echo.to the full path of the 'sphinx-build' executable. Alternatively you
	echo.may add the Sphinx directory to PATH.
	echo.
	echo.If you don't have Sphinx installed, grab it from
	echo.https://www.sphinx-doc.org/
	exit /b 1
)

if "%1" == "" goto help

%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end

:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%

:end
popd


================================================
FILE: docs/pix2tex.rst
================================================
pix2tex
=======

pix2tex.cli package
-------------------

.. automodule:: pix2tex.cli
   :members:
   :no-undoc-members:
   :show-inheritance:


pix2tex.gui package
-------------------

.. automodule:: pix2tex.gui
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.api package
-------------------

Submodules
~~~~~~~~~~

pix2tex.api.app module
~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.api.app
   :members:
   :no-undoc-members:
   :show-inheritance:


pix2tex.api.streamlit module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.api.streamlit
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset package
-----------------------

Submodules
~~~~~~~~~~

pix2tex.dataset.arxiv module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.arxiv
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.dataset module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.dataset
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.demacro module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.demacro
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.extract\_latex module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.extract_latex
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.latex2png module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.latex2png
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.render module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.render
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.dataset.scraping module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.dataset.scraping
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.models package
----------------------

pix2tex.models.hybrid module
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.models.hybrid
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.models.vit module
~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.models.vit
   :members:
   :no-undoc-members:
   :show-inheritance:

pix2tex.utils package
---------------------

pix2tex.utils.utils module
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. automodule:: pix2tex.utils.utils
   :members:
   :no-undoc-members:
   :show-inheritance:


================================================
FILE: docs/requirements.txt
================================================
myst_parser
torch>=1.7.1


================================================
FILE: notebooks/LaTeX_OCR_test.ipynb
================================================
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "LaTeX OCR test.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# LaTeX OCR\n",
        "In this colab you can convert an image of an equation into LaTeX code.\n",
        "## How?\n",
        "Execute the cell titled \"Setup\". The first time an error will show up. Simply execute the cell again. Everything should be fine now.\n",
        "\n",
        "Next, execute the cell below and upload the image(s).\n",
        "\n",
        "> Note: You can probably also run this project locally and with a GUI. Follow the steps on [GitHub](https://github.com/lukas-blecher/LaTeX-OCR)"
      ],
      "metadata": {
        "id": "aaAqi3wku23I"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "DQM_PKeCuzWR"
      },
      "outputs": [],
      "source": [
        "#@title Setup\n",
        "%reload_ext autoreload\n",
        "%autoreload\n",
        "import PIL\n",
        "!pip install Pillow -U -qq\n",
        "if int(PIL.__version__[0]) < 9:\n",
        "    print('Mandatory restart: Execute this cell again!')\n",
        "    import os\n",
        "    os.kill(os.getpid(), 9)\n",
        "!pip install pix2tex -qq\n",
        "!pip install opencv-python-headless==4.1.2.30 -U -qq\n",
        "\n",
        "def upload_files():\n",
        "  from google.colab import files\n",
        "  from io import BytesIO\n",
        "  uploaded = files.upload()\n",
        "  return [(name, BytesIO(b)) for name, b in uploaded.items()]\n",
        "\n",
        "from pix2tex import cli as pix2tex\n",
        "from PIL import Image\n",
        "model = pix2tex.LatexOCR()\n",
        "\n",
        "from IPython.display import HTML, Math\n",
        "display(HTML(\"<script src='https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.3/\"\n",
        "             \"latest.js?config=default'></script>\"))\n",
        "table = r'\\begin{array} {l|l} %s  \\end{array}'"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "imgs = upload_files()\n",
        "predictions = []\n",
        "for name, f in imgs:\n",
        "    img = Image.open(f)\n",
        "    math = model(img)\n",
        "    print(math)\n",
        "    predictions.append('\\\\mathrm{%s} & \\\\displaystyle{%s}'%(name, math))\n",
        "Math(table%'\\\\\\\\'.join(predictions))"
      ],
      "metadata": {
        "id": "CjrR3O07u3uH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        ""
      ],
      "metadata": {
        "id": "ZqCH-4XoCkMO"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}

================================================
FILE: notebooks/LaTeX_OCR_training.ipynb
================================================
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "LaTeX-OCR training.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Train a LaTeX OCR model\n",
        "In this brief notebook I show how you can finetune/train an OCR model.\n",
        "\n",
        "I've opted to mix in handwritten data into the regular pdf LaTeX images. For that I started out with the released pretrained model and continued training on the slightly larger corpus."
      ],
      "metadata": {
        "id": "YtR1GhYwnLnu"
      }
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r396ah-Q3EQc"
      },
      "source": [
        "!pip install pix2tex[train] -qq"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dZ4PLwkb3RIs"
      },
      "source": [
        "import os\n",
        "!mkdir -p LaTeX-OCR\n",
        "os.chdir('LaTeX-OCR')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cUsTlxXV3Mot"
      },
      "source": [
        "!pip install gpustat -q\n",
        "!pip install opencv-python-headless==4.1.2.30 -U -q\n",
        "!pip install --upgrade --no-cache-dir gdown -q"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# check what GPU we have\n",
        "!gpustat"
      ],
      "metadata": {
        "id": "uhLzh5vyaCaL"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aAz37dDU21zu"
      },
      "source": [
        "!mkdir -p dataset/data\n",
        "!mkdir images\n",
        "# Google Drive ids\n",
        "# handwritten: 13vjxGYrFCuYnwgDIUqkxsNGKk__D_sOM\n",
        "# pdf - images: 176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ\n",
        "# pdf - math: 1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D\n",
        "!gdown -O dataset/data/crohme.zip --id 13vjxGYrFCuYnwgDIUqkxsNGKk__D_sOM\n",
        "!gdown -O dataset/data/pdf.zip --id 176PKaCUDWmTJdQwc-OfkO0y8t4gLsIvQ\n",
        "!gdown -O dataset/data/pdfmath.txt --id 1QUjX6PFWPa-HBWdcY-7bA5TRVUnbyS1D\n",
        "os.chdir('dataset/data')\n",
        "!unzip -q crohme.zip \n",
        "!unzip -q pdf.zip \n",
        "# split handwritten data into val set and train set\n",
        "os.chdir('images')\n",
        "!mkdir ../valimages\n",
        "!ls | shuf -n 1000 | xargs -i mv {} ../valimages\n",
        "os.chdir('../../..')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we generate the datasets. We can string multiple datasets together to get one large lookup table. The only thing saved in these pkl files are image sizes, image location and the ground truth latex code. That way we can serve batches of images with the same dimensionality."
      ],
      "metadata": {
        "id": "2BMuIqRIqG-8"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!python -m pix2tex.dataset.dataset -i dataset/data/images dataset/data/train -e dataset/data/CROHME_math.txt dataset/data/pdfmath.txt -o dataset/data/train.pkl"
      ],
      "metadata": {
        "id": "1JebcEarl-g6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "!python -m pix2tex.dataset.dataset -i dataset/data/valimages dataset/data/val -e dataset/data/CROHME_math.txt dataset/data/pdfmath.txt -o dataset/data/val.pkl"
      ],
      "metadata": {
        "id": "x_Orutb37xHD"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# download the weights we want to fine tune\n",
        "!curl -L -o weights.pth https://github.com/lukas-blecher/LaTeX-OCR/releases/download/v0.0.1/weights.pth"
      ],
      "metadata": {
        "id": "I3iOyEEBbw58"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# If using wandb\n",
        "!pip install -q wandb \n",
        "# you can cancel this if you don't wan't to use it or don't have a W&B acc.\n",
        "#!wandb login"
      ],
      "metadata": {
        "id": "vow2NnpHmWt0"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# generate colab specific config (set 'debug' to true if wandb is not used)\n",
        "!echo {backbone_layers: [2, 3, 7], betas: [0.9, 0.999], batchsize: 10, bos_token: 1, channels: 1, data: dataset/data/train.pkl, debug: true, decoder_args: {'attn_on_attn': true, 'cross_attend': true, 'ff_glu': true, 'rel_pos_bias': false, 'use_scalenorm': false}, dim: 256, encoder_depth: 4, eos_token: 2, epochs: 50, gamma: 0.9995, heads: 8, id: null, load_chkpt: 'weights.pth', lr: 0.001, lr_step: 30, max_height: 192, max_seq_len: 512, max_width: 672, min_height: 32, min_width: 32, model_path: checkpoints, name: mixed, num_layers: 4, num_tokens: 8000, optimizer: Adam, output_path: outputs, pad: false, pad_token: 0, patch_size: 16, sample_freq: 2000, save_freq: 1, scheduler: StepLR, seed: 42, temperature: 0.2, test_samples: 5, testbatchsize: 20, tokenizer: dataset/tokenizer.json, valbatches: 100, valdata: dataset/data/val.pkl} > colab.yaml"
      ],
      "metadata": {
        "id": "OnsNCLp84QSY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c8NU5j2k3z36"
      },
      "source": [
        "!python -m pix2tex.train --config colab.yaml"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        ""
      ],
      "metadata": {
        "id": "g3DU9KxubWgq"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}

================================================
FILE: pix2tex/__init__.py
================================================
import os
os.environ['FOR_DISABLE_CONSOLE_CTRL_HANDLER'] = '1'


================================================
FILE: pix2tex/__main__.py
================================================
#!/usr/bin/env python
def main():
    from argparse import ArgumentParser

    parser = ArgumentParser()
    parser.add_argument('-t', '--temperature', type=float, default=.333, help='Softmax sampling frequency')
    parser.add_argument('-c', '--config', type=str, default='settings/config.yaml', help='path to config file')
    parser.add_argument('-m', '--checkpoint', type=str, default='checkpoints/weights.pth', help='path to weights file')
    parser.add_argument('--no-cuda', action='store_true', help='Compute on CPU')
    parser.add_argument('--no-resize', action='store_true', help='Resize the image beforehand')

    parser.add_argument('-s', '--show', action='store_true', help='Show the rendered predicted latex code (cli only)')
    parser.add_argument('-k', '--katex', action='store_true', help='Render the latex code in the browser (cli only)')

    parser.add_argument('--gui', action='store_true', help='Use GUI (gui only)')

    parser.add_argument('file', nargs='*', type=str, default=None, help='Predict LaTeX code from image file instead of clipboard (cli only)')
    arguments = parser.parse_args()

    import os
    import sys

    name = os.path.split(sys.argv[0])[-1]
    if arguments.gui or name in ['pix2tex_gui', 'latexocr']:
        from .gui import main
    else:
        from .cli import main
    main(arguments)


if __name__ == '__main__':
    main()


================================================
FILE: pix2tex/api/__init__.py
================================================


================================================
FILE: pix2tex/api/app.py
================================================
# Adapted from https://github.com/kingyiusuen/image-to-latex/blob/main/api/app.py

from http import HTTPStatus
from fastapi import FastAPI, File, UploadFile, Form
from PIL import Image
from io import BytesIO
from pix2tex.cli import LatexOCR

model = None
app = FastAPI(title='pix2tex API')


def read_imagefile(file) -> Image.Image:
    image = Image.open(BytesIO(file))
    return image


@app.on_event('startup')
async def load_model():
    global model
    if model is None:
        model = LatexOCR()


@app.get('/')
def root():
    '''Health check.'''
    response = {
        'message': HTTPStatus.OK.phrase,
        'status-code': HTTPStatus.OK,
        'data': {},
    }
    return response


@app.post('/predict/')
async def predict(file: UploadFile = File(...)) -> str:
    """Predict the Latex code from an image file.

    Args:
        file (UploadFile, optional): Image to predict. Defaults to File(...).

    Returns:
        str: Latex prediction
    """
    global model
    image = Image.open(file.file)
    return model(image)


@app.post('/bytes/')
async def predict_from_bytes(file: bytes = File(...)) -> str:  # , size: str = Form(...)
    """Predict the Latex code from a byte array

    Args:
        file (bytes, optional): Image as byte array. Defaults to File(...).

    Returns:
        str: Latex prediction
    """
    global model
    #size = tuple(int(a) for a in size.split(','))
    image = Image.open(BytesIO(file))
    return model(image, resize=False)


================================================
FILE: pix2tex/api/run.py
================================================
from multiprocessing import Process
import subprocess
import os


def start_api(path='.'):
    subprocess.call(['uvicorn', 'app:app', '--port', '8502'], cwd=path)


def start_frontend(path='.'):
    subprocess.call(['streamlit', 'run', 'streamlit.py'], cwd=path)


if __name__ == '__main__':
    path = os.path.realpath(os.path.dirname(__file__))
    api = Process(target=start_api, kwargs={'path': path})
    api.start()
    frontend = Process(target=start_frontend, kwargs={'path': path})
    frontend.start()
    api.join()
    frontend.join()


================================================
FILE: pix2tex/api/streamlit.py
================================================
import requests
from PIL import Image
import streamlit as st
from st_img_pastebutton import paste
from io import BytesIO
import base64


def encode_image(file):
    _, encoded = file.split(",", 1)
    binary_data = base64.b64decode(encoded)
    bytes_data = BytesIO(binary_data)
    return bytes_data


if __name__ == "__main__":
    st.set_page_config(page_title="LaTeX-OCR")
    st.title("LaTeX OCR")
    st.markdown(
        "Convert images of equations to corresponding LaTeX code.\n\nThis is based on the `pix2tex` module. Check it out [![github](https://img.shields.io/badge/LaTeX--OCR-visit-a?style=social&logo=github)](https://github.com/lukas-blecher/LaTeX-OCR)"
    )

    source = st.radio(
        "Choose the source of the image",
        options=["Upload", "Paste"],
    )

    image = None

    if source == "Upload":
        uploaded_file = st.file_uploader(
            "Upload an image of an equation",
            type=["png", "jpg"],
        )

        if uploaded_file is not None:
            st.image(Image.open(uploaded_file))
            image = uploaded_file.getvalue()

    if source == "Paste":
        pasted_file = paste("Paste an image of an equation")

        if pasted_file is not None:
            image = encode_image(pasted_file)
            st.image(image)

    if st.button("Convert"):
        if image is not None:
            with st.spinner("Computing"):
                response = requests.post(
                    "http://127.0.0.1:8502/predict/", files={"file": image}
                )
            if response.ok:
                latex_code = response.json()
                st.code(latex_code, language="latex")
                st.markdown(f"$\\displaystyle {latex_code}$")
            else:
                st.error(response.text)
        else:
            st.error("No image selected")


================================================
FILE: pix2tex/cli.py
================================================
from pix2tex.dataset.transforms import test_transform
import pandas.io.clipboard as clipboard
from PIL import ImageGrab
from PIL import Image
import os
from pathlib import Path
import sys
from typing import List, Optional, Tuple
import atexit
from contextlib import suppress
import logging
import yaml
import re

with suppress(ImportError, AttributeError):
    import readline

import numpy as np
import torch
from torch._appdirs import user_data_dir
from munch import Munch
from transformers import PreTrainedTokenizerFast
from timm.models.resnetv2 import ResNetV2
from timm.models.layers import StdConv2dSame

from pix2tex.dataset.latex2png import tex2pil
from pix2tex.models import get_model
from pix2tex.utils import *
from pix2tex.model.checkpoints.get_latest_checkpoint import download_checkpoints


def minmax_size(img: Image, max_dimensions: Tuple[int, int] = None, min_dimensions: Tuple[int, int] = None) -> Image:
    """Resize or pad an image to fit into given dimensions

    Args:
        img (Image): Image to scale up/down.
        max_dimensions (Tuple[int, int], optional): Maximum dimensions. Defaults to None.
        min_dimensions (Tuple[int, int], optional): Minimum dimensions. Defaults to None.

    Returns:
        Image: Image with correct dimensionality
    """
    if max_dimensions is not None:
        ratios = [a/b for a, b in zip(img.size, max_dimensions)]
        if any([r > 1 for r in ratios]):
            size = np.array(img.size)//max(ratios)
            img = img.resize(tuple(size.astype(int)), Image.BILINEAR)
    if min_dimensions is not None:
        # hypothesis: there is a dim in img smaller than min_dimensions, and return a proper dim >= min_dimensions
        padded_size = [max(img_dim, min_dim) for img_dim, min_dim in zip(img.size, min_dimensions)]
        if padded_size != list(img.size):  # assert hypothesis
            padded_im = Image.new('L', padded_size, 255)
            padded_im.paste(img, img.getbbox())
            img = padded_im
    return img


class LatexOCR:
    '''Get a prediction of an image in the easiest way'''

    image_resizer = None
    last_pic = None

    @in_model_path()
    def __init__(self, arguments=None):
        """Initialize a LatexOCR model

        Args:
            arguments (Union[Namespace, Munch], optional): Special model parameters. Defaults to None.
        """
        if arguments is None:
            arguments = Munch({'config': 'settings/config.yaml', 'checkpoint': 'checkpoints/weights.pth', 'no_cuda': True, 'no_resize': False})
        logging.getLogger().setLevel(logging.FATAL)
        os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
        with open(arguments.config, 'r') as f:
            params = yaml.load(f, Loader=yaml.FullLoader)
        self.args = parse_args(Munch(params))
        self.args.update(**vars(arguments))
        self.args.wandb = False
        self.args.device = 'cuda' if torch.cuda.is_available() and not self.args.no_cuda else 'cpu'
        if not os.path.exists(self.args.checkpoint):
            download_checkpoints()
        self.model = get_model(self.args)
        self.model.load_state_dict(torch.load(self.args.checkpoint, map_location=self.args.device))
        self.model.eval()

        if 'image_resizer.pth' in os.listdir(os.path.dirname(self.args.checkpoint)) and not arguments.no_resize:
            self.image_resizer = ResNetV2(layers=[2, 3, 3], num_classes=max(self.args.max_dimensions)//32, global_pool='avg', in_chans=1, drop_rate=.05,
                                          preact=True, stem_type='same', conv_layer=StdConv2dSame).to(self.args.device)
            self.image_resizer.load_state_dict(torch.load(os.path.join(os.path.dirname(self.args.checkpoint), 'image_resizer.pth'), map_location=self.args.device))
            self.image_resizer.eval()
        self.tokenizer = PreTrainedTokenizerFast(tokenizer_file=self.args.tokenizer)

    @in_model_path()
    def __call__(self, img=None, resize=True) -> str:
        """Get a prediction from an image

        Args:
            img (Image, optional): Image to predict. Defaults to None.
            resize (bool, optional): Whether to call the resize model. Defaults to True.

        Returns:
            str: predicted Latex code
        """
        if type(img) is bool:
            img = None
        if img is None:
            if self.last_pic is None:
                return ''
            else:
                print('\nLast image is: ', end='')
                img = self.last_pic.copy()
        else:
            self.last_pic = img.copy()
        img = minmax_size(pad(img), self.args.max_dimensions, self.args.min_dimensions)
        if (self.image_resizer is not None and not self.args.no_resize) and resize:
            with torch.no_grad():
                input_image = img.convert('RGB').copy()
                r, w, h = 1, input_image.size[0], input_image.size[1]
                for _ in range(10):
                    h = int(h * r)  # height to resize
                    img = pad(minmax_size(input_image.resize((w, h), Image.Resampling.BILINEAR if r > 1 else Image.Resampling.LANCZOS), self.args.max_dimensions, self.args.min_dimensions))
                    t = test_transform(image=np.array(img.convert('RGB')))['image'][:1].unsqueeze(0)
                    w = (self.image_resizer(t.to(self.args.device)).argmax(-1).item()+1)*32
                    logging.info(r, img.size, (w, int(input_image.size[1]*r)))
                    if (w == img.size[0]):
                        break
                    r = w/img.size[0]
        else:
            img = np.array(pad(img).convert('RGB'))
            t = test_transform(image=img)['image'][:1].unsqueeze(0)
        im = t.to(self.args.device)

        dec = self.model.generate(im.to(self.args.device), temperature=self.args.get('temperature', .25))
        pred = post_process(token2str(dec, self.tokenizer)[0])
        try:
            clipboard.copy(pred)
        except:
            pass
        return pred


def output_prediction(pred, args):
    TERM = os.getenv('TERM', 'xterm')
    if not sys.stdout.isatty():
        TERM = 'dumb'
    try:
        from pygments import highlight
        from pygments.lexers import get_lexer_by_name
        from pygments.formatters import get_formatter_by_name

        if TERM.split('-')[-1] == '256color':
            formatter_name = 'terminal256'
        elif TERM != 'dumb':
            formatter_name = 'terminal'
        else:
            formatter_name = None
        if formatter_name:
            formatter = get_formatter_by_name(formatter_name)
            lexer = get_lexer_by_name('tex')
            print(highlight(pred, lexer, formatter), end='')
    except ImportError:
        TERM = 'dumb'
    if TERM == 'dumb':
        print(pred)
    if args.show or args.katex:
        try:
            if args.katex:
                raise ValueError
            tex2pil([f'$${pred}$$'])[0].show()
        except Exception as e:
            # render using katex
            import webbrowser
            from urllib.parse import quote
            url = 'https://katex.org/?data=' + \
                quote('{"displayMode":true,"leqno":false,"fleqn":false,"throwOnError":true,"errorColor":"#cc0000",\
"strict":"warn","output":"htmlAndMathml","trust":false,"code":"%s"}' % pred.replace('\\', '\\\\'))
            webbrowser.open(url)


def predict(model, file, arguments):
    img = None
    if file:
        try:
            img = Image.open(os.path.expanduser(file))
        except Exception as e:
            print(e, end='')
    else:
        try:
            img = ImageGrab.grabclipboard()
        except NotImplementedError as e:
            print(e, end='')
    pred = model(img)
    output_prediction(pred, arguments)

def check_file_path(paths:List[Path], wdir:Optional[Path]=None)->List[str]:
    files = []
    for path in paths:
        if type(path)==str:
            if path=='':
                continue
            path=Path(path)
        pathsi = ([path] if wdir is None else [path, wdir/path])
        for p in pathsi:
            if p.exists():
                files.append(str(p.resolve()))
            elif '*' in path.name:
                files.extend([str(pi.resolve()) for pi in p.parent.glob(p.name)])
    return list(set(files))

def main(arguments):
    path = user_data_dir('pix2tex')
    os.makedirs(path, exist_ok=True)
    history_file = os.path.join(path, 'history.txt')
    with suppress(NameError):
        # user can `ln -s /dev/null ~/.local/share/pix2tex/history.txt` to
        # disable history record
        with suppress(OSError):
            readline.read_history_file(history_file)
        atexit.register(readline.write_history_file, history_file)
    files = check_file_path(arguments.file)
    wdir = Path(os.getcwd())
    with in_model_path():
        model = LatexOCR(arguments)
        if files:
            for file in check_file_path(arguments.file, wdir):
                print(file + ': ', end='')
                predict(model, file, arguments)
                model.last_pic = None
                with suppress(NameError):
                    readline.add_history(file)
            exit()
        pat = re.compile(r't=([\.\d]+)')
        while True:
            try:
                instructions = input('Predict LaTeX code for image ("h" for help). ')
            except KeyboardInterrupt:
                # TODO: make the last line gray
                print("")
                continue
            except EOFError:
                break
            file = instructions.strip()
            ins = file.lower()
            t = pat.match(ins)
            if ins == 'x':
                break
            elif ins in ['?', 'h', 'help']:
                print('''pix2tex help:

    Usage:
        On Windows and macOS you can copy the image into memory and just press ENTER to get a prediction.
        Alternatively you can paste the image file path here and submit.

        You might get a different prediction every time you submit the same image. If the result you got was close you
        can just predict the same image by pressing ENTER again. If that still does not work you can change the temperature
        or you have to take another picture with another resolution (e.g. zoom out and take a screenshot with lower resolution). 

        Press "x" to close the program.
        You can interrupt the model if it takes too long by pressing Ctrl+C.

    Visualization:
        You can either render the code into a png using XeLaTeX (see README) to get an image file back.
        This is slow and requires a working installation of XeLaTeX. To activate type 'show' or set the flag --show
        Alternatively you can render the expression in the browser using katex.org. Type 'katex' or set --katex

    Settings:
        to toggle one of these settings: 'show', 'katex', 'no_resize' just type it into the console
        Change the temperature (default=0.333) type: "t=0.XX" to set a new temperature.
                    ''')
                continue
            elif ins in ['show', 'katex', 'no_resize']:
                setattr(arguments, ins, not getattr(arguments, ins, False))
                print('set %s to %s' % (ins, getattr(arguments, ins)))
                continue
            elif t is not None:
                t = t.groups()[0]
                model.args.temperature = float(t)+1e-8
                print('new temperature: T=%.3f' % model.args.temperature)
                continue
            files = check_file_path(file.split(' '), wdir)
            with suppress(KeyboardInterrupt):
                if files:
                    for file in files:
                        if len(files)>1:
                            print(file + ': ', end='')
                        predict(model, file, arguments)
                else:
                    predict(model, file, arguments)
            file = None


================================================
FILE: pix2tex/dataset/__init__.py
================================================


================================================
FILE: pix2tex/dataset/arxiv.py
================================================
# modified from https://github.com/soskek/arxiv_leaks

import argparse
import subprocess
import os
import glob
import re
import sys
import argparse
import logging
import tarfile
import tempfile
import logging
import requests
import urllib.request
from tqdm import tqdm
from urllib.error import HTTPError
from pix2tex.dataset.extract_latex import find_math
from pix2tex.dataset.scraping import recursive_search
from pix2tex.dataset.demacro import *

# logging.getLogger().setLevel(logging.INFO)
arxiv_id = re.compile(r'(?<!\d)(\d{4}\.\d{5})(?!\d)')
arxiv_base = 'https://export.arxiv.org/e-print/'


def get_all_arxiv_ids(text):
    '''returns all arxiv ids present in a string `text`'''
    ids = []
    for id in arxiv_id.findall(text):
        ids.append(id)
    return list(set(ids))


def download(url, dir_path='./'):
    idx = os.path.split(url)[-1]
    file_name = idx + '.tar.gz'
    file_path = os.path.join(dir_path, file_name)
    if os.path.exists(file_path):
        return file_path
    logging.info('\tdownload {}'.format(url) + '\n')
    try:
        r = urllib.request.urlretrieve(url, file_path)
        return r[0]
    except HTTPError:
        logging.info('Could not download %s' % url)
        return 0


def read_tex_files(file_path:str, demacro:bool=False)->str:
    """Read all tex files in the latex source at `file_path`. If it is not a `tar.gz` file try to read it as text file.

    Args:
        file_path (str): Path to latex source
        demacro (bool, optional): Deprecated. Call external `de-macro` program. Defaults to False.

    Returns:
        str: All Latex files concatenated into one string.
    """    
    tex = ''
    try:
        with tempfile.TemporaryDirectory() as tempdir:
            try:
                tf = tarfile.open(file_path, 'r')
                tf.extractall(tempdir)
                tf.close()
                texfiles = [os.path.abspath(x) for x in glob.glob(os.path.join(tempdir, '**', '*.tex'), recursive=True)]
            except tarfile.ReadError as e:
                texfiles = [file_path]  # [os.path.join(tempdir, file_path+'.tex')]
            if demacro:
                ret = subprocess.run(['de-macro', *texfiles], cwd=tempdir, capture_output=True)
                if ret.returncode == 0:
                    texfiles = glob.glob(os.path.join(tempdir, '**', '*-clean.tex'), recursive=True)
            for texfile in texfiles:
                try:
                    ct = open(texfile, 'r', encoding='utf-8').read()
                    tex += ct
                except UnicodeDecodeError as e:
                    logging.debug(e)
                    pass
    except Exception as e:
        logging.debug('Could not read %s: %s' % (file_path, str(e)))
        raise e
    tex = pydemacro(tex)
    return tex


def download_paper(arxiv_id, dir_path='./'):
    url = arxiv_base + arxiv_id
    return download(url, dir_path)


def read_paper(targz_path, delete=False, demacro=False):
    paper = ''
    if targz_path != 0:
        paper = read_tex_files(targz_path, demacro=demacro)
        if delete:
            os.remove(targz_path)
    return paper


def parse_arxiv(id, save=None, demacro=True):
    if save is None:
        dir = tempfile.gettempdir()
    else:
        dir = save
    text = read_paper(download_paper(id, dir), delete=save is None, demacro=demacro)

    return find_math(text, wiki=False), []


if __name__ == '__main__':
    # logging.getLogger().setLevel(logging.DEBUG)
    parser = argparse.ArgumentParser(description='Extract math from arxiv')
    parser.add_argument('-m', '--mode', default='top100', choices=['top', 'ids', 'dirs'],
                        help='Where to extract code from. top: current 100 arxiv papers (-m top int for any other number of papers), id: specific arxiv ids. \
                              Usage: `python arxiv.py -m ids id001 [id002 ...]`, dirs: a folder full of .tar.gz files. Usage: `python arxiv.py -m dirs directory [dir2 ...]`')
    parser.add_argument(nargs='*', dest='args', default=[])
    parser.add_argument('-o', '--out', default=os.path.join(os.path.dirname(os.path.realpath(__file__)), 'data'), help='output directory')
    parser.add_argument('-d', '--demacro', dest='demacro', action='store_true',
                        help='Deprecated - Use de-macro (Slows down extraction, may but improves quality). Install https://www.ctan.org/pkg/de-macro')
    parser.add_argument('-s', '--save', default=None, type=str, help='When downloading files from arxiv. Where to save the .tar.gz files. Default: Only temporary')
    args = parser.parse_args()
    if '.' in args.out:
        args.out = os.path.dirname(args.out)
    skips = os.path.join(args.out, 'visited_arxiv.txt')
    if os.path.exists(skips):
        skip = open(skips, 'r', encoding='utf-8').read().split('\n')
    else:
        skip = []
    if args.save is not None:
        os.makedirs(args.save, exist_ok=True)
    try:
        if args.mode == 'ids':
            visited, math = recursive_search(parse_arxiv, args.args, skip=skip, unit='paper', save=args.save, demacro=args.demacro)
        elif args.mode == 'top':
            num = 100 if len(args.args) == 0 else int(args.args[0])
            url = 'https://arxiv.org/list/physics/pastweek?skip=0&show=%i' % num  # 'https://arxiv.org/list/hep-th/2203?skip=0&show=100'
            ids = get_all_arxiv_ids(requests.get(url).text)
            math, visited = [], ids
            for id in tqdm(ids):
                try:
                    m, _ = parse_arxiv(id, save=args.save, demacro=args.demacro)
                    math.extend(m)
                except ValueError:
                    pass
        elif args.mode == 'dirs':
            files = []
            for folder in args.args:
                files.extend([os.path.join(folder, p) for p in os.listdir(folder)])
            math, visited = [], []
            for f in tqdm(files):
                try:
                    text = read_paper(f, delete=False, demacro=args.demacro)
                    math.extend(find_math(text, wiki=False))
                    visited.append(os.path.basename(f))
                except DemacroError as e:
                    logging.debug(f + str(e))
                    pass
                except KeyboardInterrupt:
                    break
                except Exception as e:
                    logging.debug(e)
                    raise e
        else:
            raise NotImplementedError
    except KeyboardInterrupt:
        pass
    print('Found %i instances of math latex code' % len(math))
    # print('\n'.join(math))
    # sys.exit(0)
    for l, name in zip([visited, math], ['visited_arxiv.txt', 'math_arxiv.txt']):
        f = os.path.join(args.out, name)
        if not os.path.exists(f):
            open(f, 'w').write('')
        f = open(f, 'a', encoding='utf-8')
        for element in l:
            f.write(element)
            f.write('\n')
        f.close()


================================================
FILE: pix2tex/dataset/data/.gitkeep
================================================


================================================
FILE: pix2tex/dataset/dataset.py
================================================
import torch
import torch.nn.functional as F
from torch.nn.utils.rnn import pad_sequence
import numpy as np
import imagesize
import logging
import glob
import os
from os.path import join
from collections import defaultdict
import pickle
import cv2
from transformers import PreTrainedTokenizerFast
from tqdm.auto import tqdm

from pix2tex.utils.utils import in_model_path
from pix2tex.dataset.transforms import train_transform, test_transform



class Im2LatexDataset:
    keep_smaller_batches = False
    shuffle = True
    batchsize = 16
    max_dimensions = (1024, 512)
    min_dimensions = (32, 32)
    max_seq_len = 1024
    pad_token = "[PAD]"
    bos_token = "[BOS]"
    eos_token = "[EOS]"
    pad_token_id = 0
    bos_token_id = 1
    eos_token_id = 2
    transform = train_transform
    data = defaultdict(lambda: [])

    def __init__(self, equations=None, images=None, tokenizer=None, shuffle=True, batchsize=16, max_seq_len=1024,
                 max_dimensions=(1024, 512), min_dimensions=(32, 32), pad=False, keep_smaller_batches=False, test=False):
        """Generates a torch dataset from pairs of `equations` and `images`.

        Args:
            equations (str, optional): Path to equations. Defaults to None.
            images (str, optional): Directory where images are saved. Defaults to None.
            tokenizer (str, optional): Path to saved tokenizer. Defaults to None.
            shuffle (bool, opitonal): Defaults to True. 
            batchsize (int, optional): Defaults to 16.
            max_seq_len (int, optional): Defaults to 1024.
            max_dimensions (tuple(int, int), optional): Maximal dimensions the model can handle
            min_dimensions (tuple(int, int), optional): Minimal dimensions the model can handle
            pad (bool): Pad the images to `max_dimensions`. Defaults to False.
            keep_smaller_batches (bool): Whether to also return batches with smaller size than `batchsize`. Defaults to False.
            test (bool): Whether to use the test transformation or not. Defaults to False.
        """

        if images is not None and equations is not None:
            assert tokenizer is not None
            self.images = [path.replace('\\', '/') for path in glob.glob(join(images, '*.png'))]
            self.sample_size = len(self.images)
            eqs = open(equations, 'r').read().split('\n')
            self.indices = [int(os.path.basename(img).split('.')[0]) for img in self.images]
            self.tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer)
            self.shuffle = shuffle
            self.batchsize = batchsize
            self.max_seq_len = max_seq_len
            self.max_dimensions = max_dimensions
            self.min_dimensions = min_dimensions
            self.pad = pad
            self.keep_smaller_batches = keep_smaller_batches
            self.test = test
            # check the image dimension for every image and group them together
            try:
                for i, im in tqdm(enumerate(self.images), total=len(self.images)):
                    width, height = imagesize.get(im)
                    if min_dimensions[0] <= width <= max_dimensions[0] and min_dimensions[1] <= height <= max_dimensions[1]:
                        self.data[(width, height)].append((eqs[self.indices[i]], im))
            except KeyboardInterrupt:
                pass
            self.data = dict(self.data)
            self._get_size()

            iter(self)

    def __len__(self):
        return self.size

    def __iter__(self):
        self.i = 0
        self.transform = test_transform if self.test else train_transform
        self.pairs = []
        for k in self.data:
            info = np.array(self.data[k], dtype=object)
            p = torch.randperm(len(info)) if self.shuffle else torch.arange(len(info))
            for i in range(0, len(info), self.batchsize):
                batch = info[p[i:i+self.batchsize]]
                if len(batch.shape) == 1:
                    batch = batch[None, :]
                if len(batch) < self.batchsize and not self.keep_smaller_batches:
                    continue
                self.pairs.append(batch)
        if self.shuffle:
            self.pairs = np.random.permutation(np.array(self.pairs, dtype=object))
        else:
            self.pairs = np.array(self.pairs, dtype=object)
        self.size = len(self.pairs)
        return self

    def __next__(self):
        if self.i >= self.size:
            raise StopIteration
        self.i += 1
        return self.prepare_data(self.pairs[self.i-1])

    def prepare_data(self, batch):
        """loads images into memory

        Args:
            batch (numpy.array[[str, str]]): array of equations and image path pairs

        Returns:
            tuple(torch.tensor, torch.tensor): data in memory
        """

        eqs, ims = batch.T
        tok = self.tokenizer(list(eqs), return_token_type_ids=False)
        # pad with bos and eos token
        for k, p in zip(tok, [[self.bos_token_id, self.eos_token_id], [1, 1]]):
            tok[k] = pad_sequence([torch.LongTensor([p[0]]+x+[p[1]]) for x in tok[k]], batch_first=True, padding_value=self.pad_token_id)
        # check if sequence length is too long
        if self.max_seq_len < tok['attention_mask'].shape[1]:
            return next(self)
        images = []
        for path in list(ims):
            im = cv2.imread(path)
            if im is None:
                print(path, 'not found!')
                continue
            im = cv2.cvtColor(im, cv2.COLOR_BGR2RGB)
            if not self.test:
                # sometimes convert to bitmask
                if np.random.random() < .04:
                    im[im != 255] = 0
            images.append(self.transform(image=im)['image'][:1])
        try:
            images = torch.cat(images).float().unsqueeze(1)
        except RuntimeError:
            logging.critical('Images not working: %s' % (' '.join(list(ims))))
            return None, None
        if self.pad:
            h, w = images.shape[2:]
            images = F.pad(images, (0, self.max_dimensions[0]-w, 0, self.max_dimensions[1]-h), value=1)
        return tok, images

    def _get_size(self):
        self.size = 0
        for k in self.data:
            div, mod = divmod(len(self.data[k]), self.batchsize)
            self.size += div  # + (1 if mod > 0 else 0)

    def load(self, filename, args=[]):
        """returns a pickled version of a dataset

        Args:
            filename (str): Path to dataset
        """
        if not os.path.exists(filename):
            with in_model_path():
                tempf = os.path.join('..', filename)
                if os.path.exists(tempf):
                    filename = os.path.realpath(tempf)
        with open(filename, 'rb') as file:
            x = pickle.load(file)
        return x

    def combine(self, x):
        """Combine Im2LatexDataset with another Im2LatexDataset

        Args:
            x (Im2LatexDataset): Dataset to absorb
        """
        for key in x.data.keys():
            if key in self.data.keys():
                self.data[key].extend(x.data[key])
                self.data[key] = list(set(self.data[key]))
            else:
                self.data[key] = x.data[key]
        self._get_size()
        iter(self)

    def save(self, filename):
        """save a pickled version of a dataset

        Args:
            filename (str): Path to dataset
        """
        with open(filename, 'wb') as file:
            pickle.dump(self, file)

    def update(self, **kwargs):
        for k in ['batchsize', 'shuffle', 'pad', 'keep_smaller_batches', 'test', 'max_seq_len']:
            if k in kwargs:
                setattr(self, k, kwargs[k])
        if 'max_dimensions' in kwargs or 'min_dimensions' in kwargs:
            if 'max_dimensions' in kwargs:
                self.max_dimensions = kwargs['max_dimensions']
            if 'min_dimensions' in kwargs:
                self.min_dimensions = kwargs['min_dimensions']
            temp = {}
            for k in self.data:
                if self.min_dimensions[0] <= k[0] <= self.max_dimensions[0] and self.min_dimensions[1] <= k[1] <= self.max_dimensions[1]:
                    temp[k] = self.data[k]
            self.data = temp
        if 'tokenizer' in kwargs:
            tokenizer_file = kwargs['tokenizer']
            if not os.path.exists(tokenizer_file):
                with in_model_path():
                    tokenizer_file = os.path.realpath(tokenizer_file)
            self.tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_file)
        self._get_size()
        iter(self)


def generate_tokenizer(equations, output, vocab_size):
    from tokenizers import Tokenizer, pre_tokenizers
    from tokenizers.models import BPE
    from tokenizers.trainers import BpeTrainer
    tokenizer = Tokenizer(BPE())
    tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
    trainer = BpeTrainer(special_tokens=["[PAD]", "[BOS]", "[EOS]"], vocab_size=vocab_size, show_progress=True)
    tokenizer.train(equations, trainer)
    tokenizer.save(path=output, pretty=False)


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser(description='Train model', add_help=False)
    parser.add_argument('-i', '--images', type=str, nargs='+', default=None, help='Image folders')
    parser.add_argument('-e', '--equations', type=str, nargs='+', default=None, help='equations text files')
    parser.add_argument('-t', '--tokenizer', default=None, help='Pretrained tokenizer file')
    parser.add_argument('-o', '--out', type=str, required=True, help='output file')
    parser.add_argument('-s', '--vocab-size', default=8000, type=int, help='vocabulary size when training a tokenizer')
    args = parser.parse_args()
    if args.tokenizer is None:
        with in_model_path():
            args.tokenizer = os.path.realpath(os.path.join('dataset', 'tokenizer.json'))
    if args.images is None and args.equations is not None:
        print('Generate tokenizer')
        generate_tokenizer(args.equations, args.out, args.vocab_size)
    elif args.images is not None and args.equations is not None:
        print('Generate dataset')
        dataset = None
        for images, equations in zip(args.images, args.equations):
            if dataset is None:
                dataset = Im2LatexDataset(equations, images, args.tokenizer)
            else:
                dataset.combine(Im2LatexDataset(equations, images, args.tokenizer))
        dataset.update(batchsize=1, keep_smaller_batches=True)
        dataset.save(args.out)
    else:
        print('Not defined')


================================================
FILE: pix2tex/dataset/demacro-test.py
================================================
import unittest
import re
from pix2tex.dataset.demacro import pydemacro


def norm(s):
    s = re.sub(r'\n+', '\n', s)
    s = re.sub(r'\s+', ' ', s)
    return s.strip()


def f(s):
    return norm(pydemacro(s))


class TestDemacroCases(unittest.TestCase):
    def test_noargs(self):
        inp = r'''
        \newcommand*{\noargs}{sample text}
        \noargs[a]\noargs{b}\noargs
        '''
        expected = r'''sample text[a]sample text{b}sample text'''
        self.assertEqual(f(inp), norm(expected))

    def test_optional_arg(self):
        inp = r'''
        \newcommand{\example}[2][YYY]{Mandatory arg: #2; Optional arg: #1.}     
        \example{BBB}
        \example[XXX]{AAA}
        '''
        expected = r'''
        Mandatory arg: BBB; Optional arg: YYY.
        Mandatory arg: AAA; Optional arg: XXX.
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_optional_arg_and_positional_args(self):
        inp = r'''
        \newcommand{\plusbinomial}[3][2]{(#2 + #3)^{#1}}
        \plusbinomial[4]{y}{x}
        '''
        expected = r'''(y + x)^{4}'''
        self.assertEqual(f(inp), norm(expected))

    def test_alt_definition1(self):
        inp = r'''
        \newcommand\d{replacement}
        \d
        '''
        expected = r'''replacement'''
        self.assertEqual(f(inp), norm(expected))

    def test_arg_with_bs_and_cb(self):
        # def 1 argument and with backslash (bs) and cruly brackets (cb) in definition
        inp = r'''
        \newcommand{\eq}[1]{\begin{equation}#1\end{equation}}
        \eq{\sqrt{2}\approx1.4}
        \eq[unexpected argument]{\sqrt{2}\approx1.4}
        '''
        expected = r'''
        \begin{equation}\sqrt{2}\approx1.4\end{equation}
        \begin{equation}\sqrt{2}\approx1.4\end{equation}
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multiline_definition(self):
        inp = r'''
        \newcommand{\multiline}[2]{%
        Arg 1: \bf{#1}
        Arg 2: #2
        }
        \multiline{1}{two}
        '''
        expected = r'''
        Arg 1: \bf{1}
        Arg 2: two
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multiline_definition_alt1(self):
        inp = r'''
        \newcommand{\identity}[1]
        {#1}
        \identity{x}
        '''
        expected = 'x'
        self.assertEqual(f(inp), norm(expected))

    def test_multiline_definition_alt2(self):
        inp = r'''
        \newcommand
        {\identity}[1]{#1}
        \identity{x}
        '''
        expected = 'x'
        self.assertEqual(f(inp), norm(expected))

    def test_multiline_definition_alt3(self):
        inp = r'''
        \newcommand
        {\identity}[1]
        {#1}
        \identity{x}
        '''
        expected = 'x'
        self.assertEqual(f(inp), norm(expected))

    def test_multiline_definition_alt4(self):
        inp = r'''
        \newcommand
        {\identity}
        [1]
        {#1}
        \identity{x}
        '''
        expected = 'x'
        self.assertEqual(f(inp), norm(expected))

    def test_nested_definition(self):
        inp = r'''
        \newcommand{\cmd}[1]{command #1}
        \newcommand{\nested}[2]{\cmd{#1} \cmd{#2}}
        \nested{\alpha}{\beta}
        '''
        expected = r'''
        command \alpha command \beta
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_def(self):
        # check if \def is handled correctly.
        inp = r'''
        \def\defcheck#1#2{Defcheck arg1: #1 arg2: #2}
        \defcheck{1}{two}
        '''
        expected = r'''
        Defcheck arg1: 1 arg2: two
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multi_def_lines_alt0(self):
        inp = r'''\def\be{\begin{equation}} \def\ee{\end{equation}} %some comment
        \be
        1+1=2
        \ee'''
        expected = r'''
        \begin{equation}
        1+1=2
        \end{equation}
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multi_def_lines_alt1(self):
        inp = r'''\def\be{\begin{equation}}\def\ee{\end{equation}}
        \be
        1+1=2
        \ee'''
        expected = r'''
        \begin{equation}
        1+1=2
        \end{equation}
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multi_def_lines_alt2(self):
        inp = r'''\def
        \be{\begin{equation}}
        \def\ee
        {\end{equation}}
        \be
        1+1=2
        \ee'''
        expected = r'''
        \begin{equation}
        1+1=2
        \end{equation}
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_multi_def_lines_alt3(self):
        inp = r'''
        \def\be
        {
            \begin{equation}
        }
        \def
        \ee
        {\end{equation}}
        \be
        1+1=2
        \ee'''
        expected = r'''
        \begin{equation}
        1+1=2
        \end{equation}
        '''
        self.assertEqual(f(inp), norm(expected))

    def test_let_alt0(self):
        inp = r'''\let\a\alpha\let\b=\beta
        \a \b'''
        expected = r'''\alpha \beta'''
        self.assertEqual(f(inp), norm(expected))

    def test_let_alt1(self):
        inp = r'''\let\a\alpha \let\b=\beta
        \a \b'''
        expected = r'''\alpha \beta'''
        self.assertEqual(f(inp), norm(expected))

    def test_let_alt2(self):
        inp = r'''\let\a\alpha \let\b=\beta
        \a \b'''
        expected = r'''\alpha \beta'''
        self.assertEqual(f(inp), norm(expected))

    def test_let_alt3(self):
        inp = r'''
        \let
        \a
        \alpha
        \let\b=
        \beta
        \a \b'''
        expected = r'''\alpha \beta'''
        self.assertEqual(f(inp), norm(expected))


if __name__ == '__main__':
    unittest.main()


================================================
FILE: pix2tex/dataset/demacro.py
================================================
# modified from https://tex.stackexchange.com/a/521639

import argparse
import re
import logging
from collections import Counter
import time
from pix2tex.dataset.extract_latex import remove_labels


class DemacroError(Exception):
    pass


def main():
    args = parse_command_line()
    data = read(args.input)
    data = pydemacro(data)
    if args.output is not None:
        write(args.output, data)
    else:
        print(data)


def parse_command_line():
    parser = argparse.ArgumentParser(description='Replace \\def with \\newcommand where possible.')
    parser.add_argument('input', help='TeX input file with \\def')
    parser.add_argument('--output', '-o', default=None, help='TeX output file with \\newcommand')
    return parser.parse_args()


def read(path):
    with open(path, mode='r') as handle:
        return handle.read()


def bracket_replace(string: str) -> str:
    '''
    replaces all layered brackets with special symbols
    '''
    layer = 0
    out = list(string)
    for i, c in enumerate(out):
        if c == '{':
            if layer > 0:
                out[i] = 'Ḋ'
            layer += 1
        elif c == '}':
            layer -= 1
            if layer > 0:
                out[i] = 'Ḍ'
    return ''.join(out)


def undo_bracket_replace(string):
    return string.replace('Ḋ', '{').replace('Ḍ', '}')


def sweep(t, cmds):
    num_matches = 0
    for c in cmds:
        nargs = int(c[1][1]) if c[1] != r'' else 0
        optional = c[2] != r''
        if nargs == 0:
            num_matches += len(re.findall(r'\\%s([\W_^\dĊ])' % c[0], t))
            if num_matches > 0:
                t = re.sub(r'\\%s([\W_^\dĊ])' % c[0], r'%s\1' % c[-1].replace('\\', r'\\'), t)
        else:
            matches = re.findall(r'(\\%s(?:\[(.+?)\])?' % c[0]+r'{(.+?)}'*(nargs-(1 if optional else 0))+r')', t)
            num_matches += len(matches)
            for i, m in enumerate(matches):
                r = c[-1]
                if m[1] == r'':
                    matches[i] = (m[0], c[2][1:-1], *m[2:])
                for j in range(1, nargs+1):
                    r = r.replace(r'#%i' % j, matches[i][j+int(not optional)])
                t = t.replace(matches[i][0], r)
    return t, num_matches


def unfold(t):
    #t = queue.get()
    t = t.replace('\n', 'Ċ')
    t = bracket_replace(t)
    commands_pattern = r'\\(?:re)?newcommand\*?{\\(.+?)}[\sĊ]*(\[\d\])?[\sĊ]*(\[.+?\])?[\sĊ]*{(.*?)}'
    cmds = re.findall(commands_pattern, t)
    t = re.sub(r'(?<!\\)'+commands_pattern, 'Ċ', t)
    cmds = sorted(cmds, key=lambda x: len(x[0]))
    cmd_names = Counter([c[0] for c in cmds])
    for i in reversed(range(len(cmds))):
        if cmd_names[cmds[i][0]] > 1:
            # something went wrong here. No multiple definitions allowed
            del cmds[i]
        elif '\\newcommand' in cmds[i][-1]:
            logging.debug("Command recognition pattern didn't work properly. %s" % (undo_bracket_replace(cmds[i][-1])))
            del cmds[i]
    start = time.time()
    try:
        for i in range(10):
            # check for up to 10 nested commands
            if i > 0:
                t = bracket_replace(t)
            t, N = sweep(t, cmds)
            if time.time()-start > 5: # not optimal. more sophisticated methods didnt work or are slow
                raise TimeoutError
            t = undo_bracket_replace(t)
            if N == 0 or i == 9:
                #print("Needed %i iterations to demacro" % (i+1))
                break
            elif N > 4000:
                raise ValueError("Too many matches. Processing would take too long.")
    except ValueError:
        pass
    except TimeoutError:
        pass
    except re.error as e:
        raise DemacroError(e)
    t = remove_labels(t.replace('Ċ', '\n'))
    # queue.put(t)
    return t


def pydemacro(t: str) -> str:
    r"""Replaces all occurences of newly defined Latex commands in a document.
    Can replace `\newcommand`, `\def` and `\let` definitions in the code.

    Args:
        t (str): Latex document

    Returns:
        str: Document without custom commands
    """
    return unfold(convert(re.sub('\n+', '\n', re.sub(r'(?<!\\)%.*\n', '\n', t))))


def replace(match):
    prefix = match.group(1)
    if (
            prefix is not None and
            (
                'expandafter' in prefix or
                'global' in prefix or
                'outer' in prefix or
                'protected' in prefix
            )
    ):
        return match.group(0)

    result = r'\newcommand'
    if prefix is None or 'long' not in prefix:
        result += '*'

    result += '{' + match.group(2) + '}'
    if match.lastindex == 3:
        result += '[' + match.group(3) + ']'

    result += '{'
    return result


def convert(data):
    data = re.sub(
        r'((?:\\(?:expandafter|global|long|outer|protected)(?:\s+|\r?\n\s*)?)*)?\\def\s*(\\[a-zA-Z]+)\s*(?:#+([0-9]))*\{',
        replace,
        data,
    )
    return re.sub(r'\\let[\sĊ]*(\\[a-zA-Z]+)\s*=?[\sĊ]*(\\?\w+)*', r'\\newcommand*{\1}{\2}\n', data)


def write(path, data):
    with open(path, mode='w') as handle:
        handle.write(data)

    print('=> File written: {0}'.format(path))


if __name__ == '__main__':
    main()


================================================
FILE: pix2tex/dataset/extract_latex.py
================================================
import argparse
import html
import os
import re
import numpy as np
from typing import List

MIN_CHARS = 1
MAX_CHARS = 3000
dollar = re.compile(r'((?<!\$)\${1,2}(?!\$))(.{%i,%i}?)(?<!\\)(?<!\$)\1(?!\$)' % (1, MAX_CHARS))
inline = re.compile(r'(\\\((.*?)(?<!\\)\\\))|(\\\[(.{%i,%i}?)(?<!\\)\\\])' % (1, MAX_CHARS))
equation = re.compile(r'\\begin\{(equation|math|displaymath)\*?\}(.{%i,%i}?)\\end\{\1\*?\}' % (1, MAX_CHARS), re.S)
align = re.compile(r'(\\begin\{(align|alignedat|alignat|flalign|eqnarray|aligned|split|gather)\*?\}(.{%i,%i}?)\\end\{\2\*?\})' % (1, MAX_CHARS), re.S)
displaymath = re.compile(r'(?:\\displaystyle)(.{%i,%i}?)((?<!\\)\}?(?:\"|<))' % (1, MAX_CHARS), re.S)
outer_whitespace = re.compile(
    r'^\\,|\\,$|^~|~$|^\\ |\\ $|^\\thinspace|\\thinspace$|^\\!|\\!$|^\\:|\\:$|^\\;|\\;$|^\\enspace|\\enspace$|^\\quad|\\quad$|^\\qquad|\\qquad$|^\\hspace{[a-zA-Z0-9]+}|\\hspace{[a-zA-Z0-9]+}$|^\\hfill|\\hfill$')
label_names = [re.compile(r'\\%s\s?\{(.*?)\}' % s) for s in ['ref', 'cite', 'label', 'eqref']]


def check_brackets(s):
    a = []
    surrounding = False
    for i, c in enumerate(s):
        if c == '{':
            if i > 0 and s[i-1] == '\\':  # not perfect
                continue
            else:
                a.append(1)
            if i == 0:
                surrounding = True
        elif c == '}':
            if i > 0 and s[i-1] == '\\':
                continue
            else:
                a.append(-1)
    b = np.cumsum(a)
    if len(b) > 1 and b[-1] != 0:
        raise ValueError(s)
    surrounding = s[-1] == '}' and surrounding
    if not surrounding:
        return s
    elif (b == 0).sum() == 1:
        return s[1:-1]
    else:
        return s


def remove_labels(string):
    for s in label_names:
        string = re.sub(s, '', string)
    return string


def clean_matches(matches, min_chars=MIN_CHARS):
    faulty = []
    for i in range(len(matches)):
        if 'tikz' in matches[i]:  # do not support tikz at the moment
            faulty.append(i)
            continue
        matches[i] = remove_labels(matches[i])
        matches[i] = matches[i].replace('\n', '').replace(r'\notag', '').replace(r'\nonumber', '')
        matches[i] = re.sub(outer_whitespace, '', matches[i])
        if len(matches[i]) < min_chars:
            faulty.append(i)
            continue
        # try:
        #     matches[i] = check_brackets(matches[i])
        # except ValueError:
        #     faulty.append(i)
        if matches[i][-1] == '\\' or 'newcommand' in matches[i][-1]:
            faulty.append(i)

    matches = [m.strip() for i, m in enumerate(matches) if i not in faulty]
    return list(set(matches))


def find_math(s: str, wiki=False) -> List[str]:
    r"""Find all occurences of math in a Latex-like document. 

    Args:
        s (str): String to search
        wiki (bool, optional): Search for `\displaystyle` as it can be found in the wikipedia page source code. Defaults to False.

    Returns:
        List[str]: List of all found mathematical expressions
    """
    matches = []
    x = re.findall(inline, s)
    matches.extend([(g[1] if g[1] != '' else g[-1]) for g in x])
    if not wiki:
        patterns = [dollar, equation, align]
        groups = [1, 1, 0]
    else:
        patterns = [displaymath]
        groups = [0]
    for i, pattern in zip(groups, patterns):
        x = re.findall(pattern, s)
        matches.extend([g[i] for g in x])

    return clean_matches(matches)

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument(dest='file', type=str, help='file to find equations in')
    parser.add_argument('--out','-o', type=str, default=None, help='file to save equations to. If none provided, print all equations.')
    parser.add_argument('--wiki', action='store_true', help='only look for math starting with \\displaystyle')
    parser.add_argument('--unescape', action='store_true', help='call `html.unescape` on input')
    args = parser.parse_args()

    if not os.path.exists(args.file):
        raise ValueError('File can not be found. %s' % args.file)

    from pix2tex.dataset.demacro import pydemacro
    s = pydemacro(open(args.file, 'r', encoding='utf-8').read())
    if args.unescape:
        s = html.unescape(s)
    math = '\n'.join(sorted(find_math(s, args.wiki)))
    if args.out is None:
        print(math)
    else:
        with open(args.out, 'w') as f:
            f.write(math)
    

================================================
FILE: pix2tex/dataset/latex2png.py
================================================
# mostly taken from http://code.google.com/p/latexmath2png/
# install preview.sty
import os
import re
import sys
import io
import glob
import tempfile
import shlex
import subprocess
import traceback
from PIL import Image


class Latex:
    BASE = r'''
\documentclass[varwidth]{standalone}
\usepackage{fontspec,unicode-math}
\usepackage[active,tightpage,displaymath,textmath]{preview}
\setmathfont{%s}
\begin{document}
\thispagestyle{empty}
%s
\end{document}
'''

    def __init__(self, math, dpi=250, font='Latin Modern Math'):
        '''takes list of math code. `returns each element as PNG with DPI=`dpi`'''
        self.math = math
        self.dpi = dpi
        self.font = font
        self.prefix_line = self.BASE.split("\n").index(
            "%s")  # used for calculate error formula index

    def write(self, return_bytes=False):
        # inline = bool(re.match('^\$[^$]*\$$', self.math)) and False
        try:
            workdir = tempfile.gettempdir()
            fd, texfile = tempfile.mkstemp('.tex', 'eq', workdir, True)
            # print(self.BASE % (self.font, self.math))
            with os.fdopen(fd, 'w+') as f:
                document = self.BASE % (self.font, '\n'.join(self.math))
                # print(document)
                f.write(document)

            png, error_index = self.convert_file(
                texfile, workdir, return_bytes=return_bytes)
            return png, error_index

        finally:
            if os.path.exists(texfile):
                try:
                    os.remove(texfile)
                except PermissionError:
                    pass

    def convert_file(self, infile, workdir, return_bytes=False):
        infile = infile.replace('\\', '/')
        try:
            # Generate the PDF file
            #  not stop on error line, but return error line index,index start from 1
            cmd = 'xelatex -interaction nonstopmode -file-line-error -output-directory %s %s' % (
                workdir.replace('\\', '/'), infile)

            p = subprocess.Popen(
                shlex.split(cmd),
                stdin=subprocess.PIPE,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
                universal_newlines=True
            )
            sout, serr = p.communicate()
            # extract error line from sout
            error_index, _ = extract(text=sout, expression=r"%s:(\d+)" % os.path.basename(infile))
            # extract success rendered equation
            if error_index != []:
                # offset index start from 0, same as self.math
                error_index = [int(_)-self.prefix_line-1 for _ in error_index]
            # Convert the PDF file to PNG's
            pdffile = infile.replace('.tex', '.pdf')
            result, _ = extract(
                text=sout, expression="Output written on %s \((\d+)? page" % pdffile)
            if int(result[0]) != len(self.math):
                raise Exception('xelatex rendering error, generated %d formula\'s page, but the total number of formulas is %d.' % (
                    int(result[0]), len(self.math)))
            pngfile = os.path.join(workdir, infile.replace('.tex', '.png'))

            cmd = 'convert -density %i -colorspace gray %s -quality 90 %s' % (
                self.dpi,
                pdffile,
                pngfile,
            )  # -bg Transparent -z 9
            if sys.platform == 'win32':
                cmd = 'magick ' + cmd
            p = subprocess.Popen(
                shlex.split(cmd),
                stdin=subprocess.PIPE,
                stdout=subprocess.PIPE,
                stderr=subprocess.PIPE,
            )

            sout, serr = p.communicate()
            if p.returncode != 0:
                raise Exception('PDFpng error', serr, cmd, os.path.exists(
                    pdffile), os.path.exists(infile))
            if return_bytes:
                if len(self.math) > 1:
                    png = [open(pngfile.replace('.png', '')+'-%i.png' %
                                i, 'rb').read() for i in range(len(self.math))]
                else:
                    png = [open(pngfile.replace(
                        '.png', '')+'.png', 'rb').read()]
            else:
                # return path
                if len(self.math) > 1:
                    png = [(pngfile.replace('.png', '')+'-%i.png' % i)
                           for i in range(len(self.math))]
                else:
                    png = [(pngfile.replace('.png', '')+'.png')]
            return png, error_index
        except Exception as e:
            print(e)
        finally:
            # Cleanup temporaries
            basefile = infile.replace('.tex', '')
            tempext = ['.aux', '.pdf', '.log']
            if return_bytes:
                ims = glob.glob(basefile+'*.png')
                for im in ims:
                    os.remove(im)
            for te in tempext:
                tempfile = basefile + te
                if os.path.exists(tempfile):
                    os.remove(tempfile)


__cache = {}


def tex2png(eq, **kwargs):
    if not eq in __cache:
        __cache[eq] = Latex(eq, **kwargs).write(return_bytes=True)
    return __cache[eq]


def tex2pil(tex, return_error_index=False, **kwargs):
    pngs, error_index = Latex(tex, **kwargs).write(return_bytes=True)
    images = [Image.open(io.BytesIO(d)) for d in pngs]
    return (images, error_index) if return_error_index else images


def extract(text, expression=None):
    """extract text from text by regular expression

    Args:
        text (str): input text
        expression (str, optional): regular expression. Defaults to None.

    Returns:
        str: extracted text
    """
    try:
        pattern = re.compile(expression)
        results = re.findall(pattern, text)
        return results, True if len(results) != 0 else False
    except Exception:
        traceback.print_exc()


if __name__ == '__main__':
    if len(sys.argv) > 1:
        src = sys.argv[1]
    else:
        src = r'\begin{equation}\mathcal{ L}\nonumber\end{equation}'

    print('Equation is: %s' % src)
    print(Latex([src]).write())


================================================
FILE: pix2tex/dataset/postprocess.py
================================================
import argparse
from tqdm.auto import tqdm

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-i', '--input', required=True, help='input file')
    parser.add_argument('-o', '--output', default=None, help='output file')
    args = parser.parse_args()

    d = open(args.input, 'r').read().split('\n')
    reqs = ['\\', '_', '^', '(', ')', '{', '}']
    deleted = 0
    for i in tqdm(reversed(range(len(d))), total=len(d)):
        if not any([r in d[i] for r in reqs]):
            del d[i]
            deleted += 1
    print('removed %i lines' % deleted)
    f = args.output
    if f is None:
        f = args.input
    open(f, 'w').write('\n'.join(d))


================================================
FILE: pix2tex/dataset/preprocessing/__init__.py
================================================


================================================
FILE: pix2tex/dataset/preprocessing/generate_latex_vocab.py
================================================
import sys, logging, argparse, os

def process_args(args):
    parser = argparse.ArgumentParser(description='Generate vocabulary file.')

    parser.add_argument('--data-path', dest='data_path',
                        type=str, required=True,
                        help=('Input file containing <img_name> <line_idx> per line. This should be the file used for training.'
                        ))
    parser.add_argument('--label-path', dest='label_path',
                        type=str, required=True,
                        help=('Input file containing a tokenized formula per line.'
                        ))
    parser.add_argument('--output-file', dest='output_file',
                        type=str, required=True,
                        help=('Output file for putting vocabulary.'
                        ))
    parser.add_argument('--unk-threshold', dest='unk_threshold',
                        type=int, default=1,
                        help=('If the number of occurences of a token is less than (including) the threshold, then it will be excluded from the generated vocabulary.'
                        ))
    parser.add_argument('--log-path', dest="log_path",
                        type=str, default='log.txt',
                        help=('Log file path, default=log.txt' 
                        ))
    parameters = parser.parse_args(args)
    return parameters

def main(args):
    parameters = process_args(args)
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)-15s %(name)-5s %(levelname)-8s %(message)s',
        filename=parameters.log_path)

    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)-15s %(name)-5s %(levelname)-8s %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)

    logging.info('Script being executed: %s'%__file__)

    label_path = parameters.label_path
    assert os.path.exists(label_path), label_path
    data_path = parameters.data_path
    assert os.path.exists(data_path), data_path

    formulas = open(label_path).readlines()
    vocab = {}
    max_len = 0
    with open(data_path) as fin:
        for line in fin:
            _, line_idx = line.strip().split()
            line_strip = formulas[int(line_idx)].strip()
            tokens = line_strip.split()
            tokens_out = []
            for token in tokens:
                tokens_out.append(token)
                if token not in vocab:
                    vocab[token] = 0
                vocab[token] += 1

    vocab_sort = sorted(list(vocab.keys()))
    vocab_out = []
    num_unknown = 0
    for word in vocab_sort:
        if vocab[word] > parameters.unk_threshold:
            vocab_out.append(word)
        else:
            num_unknown += 1
    #vocab = ["'"+word.replace('\\','\\\\').replace('\'', '\\\'')+"'" for word in vocab_out]
    vocab = [word for word in vocab_out]

    with open(parameters.output_file, 'w') as fout:
        fout.write('\n'.join(vocab))
    logging.info('#UNK\'s: %d'%num_unknown)

if __name__ == '__main__':
    main(sys.argv[1:])
    logging.info('Jobs finished')


================================================
FILE: pix2tex/dataset/preprocessing/preprocess_formulas.py
================================================
# taken and modified from https://github.com/harvardnlp/im2markup
# tokenize latex formulas
import sys
import os
import re
import argparse
import logging
import subprocess
import shutil


def process_args(args):
    parser = argparse.ArgumentParser(description='Preprocess (tokenize or normalize) latex formulas')

    parser.add_argument('--mode', '-m', dest='mode',
                        choices=['tokenize', 'normalize'], default='normalize',
                        help=('Tokenize (split to tokens seperated by space) or normalize (further translate to an equivalent standard form).'
                              ))
    parser.add_argument('--input-file', '-i', dest='input_file',
                        type=str, required=True,
                        help=('Input file containing latex formulas. One formula per line.'
                              ))
    parser.add_argument('--output-file', '-o', dest='output_file',
                        type=str, required=True,
                        help=('Output file.'
                              ))
    parser.add_argument('-n', '--num-threads', dest='num_threads',
                        type=int, default=4,
                        help=('Number of threads, default=4.'))
    parser.add_argument('--log-path', dest="log_path",
                        type=str, default=None,
                        help=('Log file path, default=log.txt'))
    parameters = parser.parse_args(args)
    return parameters


def main(args):
    parameters = process_args(args)
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)-15s %(name)-5s %(levelname)-8s %(message)s',
        filename=parameters.log_path)

    console = logging.StreamHandler()
    console.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)-15s %(name)-5s %(levelname)-8s %(message)s')
    console.setFormatter(formatter)
    logging.getLogger('').addHandler(console)

    logging.info('Script being executed: %s' % __file__)

    input_file = parameters.input_file
    output_file = parameters.output_file

    assert os.path.exists(input_file), input_file
    shutil.copy(input_file, output_file)
    operators = '\s?'.join('|'.join(['arccos', 'arcsin', 'arctan', 'arg', 'cos', 'cosh', 'cot', 'coth', 'csc', 'deg', 'det', 'dim', 'exp', 'gcd', 'hom', 'inf',
                                     'injlim', 'ker', 'lg', 'lim', 'liminf', 'limsup', 'ln', 'log', 'max', 'min', 'Pr', 'projlim', 'sec', 'sin', 'sinh', 'sup', 'tan', 'tanh']))
    ops = re.compile(r'\\operatorname {(%s)}' % operators)
    temp_file = output_file + '.tmp'
    with open(temp_file, 'w') as fout:
        prepre = open(output_file, 'r').read().replace('\r', ' ')  # delete \r
        # replace split, align with aligned
        prepre = re.sub(r'\\begin{(split|align|alignedat|alignat|eqnarray)\*?}(.+?)\\end{\1\*?}', r'\\begin{aligned}\2\\end{aligned}', prepre, flags=re.S)
        prepre = re.sub(r'\\begin{(smallmatrix)\*?}(.+?)\\end{\1\*?}', r'\\begin{matrix}\2\\end{matrix}', prepre, flags=re.S)
        fout.write(prepre)

    # print(os.path.abspath(__file__))
    cmd = r"cat %s | node %s %s > %s " % (temp_file, os.path.join(os.path.dirname(__file__), 'preprocess_latex.js'), parameters.mode, output_file)
    ret = subprocess.call(cmd, shell=True)
    os.remove(temp_file)
    if ret != 0:
        logging.error('FAILED: %s' % cmd)
    temp_file = output_file + '.tmp'
    shutil.move(output_file, temp_file)
    with open(temp_file, 'r') as fin:
        with open(output_file, 'w') as fout:
            for line in fin:
                tokens = line.strip().split()
                tokens_out = []
                for token in tokens:
                    tokens_out.append(token)
                if len(tokens_out) > 5:
                    post = ' '.join(tokens_out)
                    # use \sin instead of \operatorname{sin}
                    names = ['\\'+x.replace(' ', '') for x in re.findall(ops, post)]
                    post = re.sub(ops, lambda match: str(names.pop(0)), post).replace(r'\\ \end{array}', r'\end{array}')
                    fout.write(post+'\n')
    os.remove(temp_file)


if __name__ == '__main__':
    main(sys.argv[1:])
    logging.info('Jobs finished')


================================================
FILE: pix2tex/dataset/preprocessing/preprocess_latex.js
================================================
const path = require('path');
var katex = require(path.join(__dirname,"third_party/katex/katex.js"))
options = require(path.join(__dirname,"third_party/katex/src/Options.js"))
var readline = require('readline');
var rl = readline.createInterface({
    input: process.stdin,
    output: process.stdout,
    terminal: false
});


rl.on('line', function(line){
    a = line
    if (line[0] == "%") {
        line = line.substr(1, line.length - 1);
    }
    line = line.split('%')[0];

    line = line.split('\\~').join(' ');
    
    for (var i = 0; i < 300; i++) {
        line = line.replace(/\\>/, " ");
        line = line.replace('$', ' ');
        line = line.replace(/\\label{.*?}/, "");
    }

    if (line.indexOf("matrix") == -1 && line.indexOf("cases")==-1 &&
        line.indexOf("array")==-1 && line.indexOf("begin")==-1)  {
        for (var i = 0; i < 300; i++) {
            line = line.replace(/\\\\/, "\\,");
        }
    }
    

    line = line + " "
    // global_str is tokenized version (build in parser.js)
    // norm_str is normalized version build by renderer below.
    try {
    

        if (process.argv[2] == "tokenize") {
            var tree = katex.__parse(line, {});
            console.log(global_str.replace(/\\label { .*? }/, ""));
        } else {
            for (var i = 0; i < 300; ++i) {
                line = line.replace(/{\\rm/, "\\mathrm{");
                line = line.replace(/{ \\rm/, "\\mathrm{");
                line = line.replace(/\\rm{/, "\\mathrm{");
            }

            var tree = katex.__parse(line, {});
            buildExpression(tree, new options({}));            
            for (var i = 0; i < 300; ++i) {
                norm_str = norm_str.replace('SSSSSS', '$');
                norm_str = norm_str.replace(' S S S S S S', '$');
            }
            console.log(norm_str.replace(/\\label { .*? }/, ""));
        }
    } catch (e) {
        console.error(line);
        console.error(norm_str);
        console.error(e);
        console.log();
    }
    global_str = ""
    norm_str = ""
})



// This is a LaTeX AST to LaTeX Renderer (modified version of KaTeX AST-> MathML).
norm_str = ""

var groupTypes = {};

groupTypes.mathord = function(group, options) {
    if (options.font == "mathrm"){
        for (i = 0; i < group.value.length; ++i ) {
            if (group.value[i] == " ") {
                norm_str = norm_str + group.value[i] + "\; ";
            } else {
                norm_str = norm_str + group.value[i] + " ";
            }
        }
    } else {
        norm_str = norm_str + group.value + " ";
    }
};

groupTypes.textord = function(group, options) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.bin = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.rel = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.open = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.close = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.inner = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.punct = function(group) {
    norm_str = norm_str + group.value + " ";
};

groupTypes.ordgroup = function(group, options) {
    norm_str = norm_str + "{ ";

    buildExpression(group.value, options);

    norm_str = norm_str +  "} ";
};

groupTypes.text = function(group, options) {
    
    norm_str = norm_str + "\\mathrm { ";

    buildExpression(group.value.body, options);
    norm_str = norm_str + "} ";
};

groupTypes.color = function(group, options) {
    var inner = buildExpression(group.value.value, options);

    var node = new mathMLTree.MathNode("mstyle", inner);

    node.setAttribute("mathcolor", group.value.color);

    return node;
};

groupTypes.supsub = function(group, options) {
    buildGroup(group.value.base, options);

    if (group.value.sub) {
        norm_str = norm_str + "_ ";
        if (group.value.sub.type != 'ordgroup') {
            norm_str = norm_str + " { ";
            buildGroup(group.value.sub, options);
            norm_str = norm_str + "} ";
        } else {
            buildGroup(group.value.sub, options);
        }
        
    }

    if (group.value.sup) {
        norm_str = norm_str + "^ ";
        if (group.value.sup.type != 'ordgroup') {
            norm_str = norm_str + " { ";
            buildGroup(group.value.sup, options);
            norm_str = norm_str + "} ";
        } else {
            buildGroup(group.value.sup, options);
        }
    }

};

groupTypes.genfrac = function(group, options) {
    if (!group.value.hasBarLine) {
        norm_str = norm_str + "\\binom ";
    } else {
        norm_str = norm_str + "\\frac ";
    }
    buildGroup(group.value.numer, options);
    buildGroup(group.value.denom, options);

};

groupTypes.array = function(group, options) {
    norm_str = norm_str + "\\begin{array} { ";
    if (group.value.cols) {
        group.value.cols.map(function(start) {
            if (start && start.align) {
                norm_str = norm_str + start.align + " ";}});
    } else {
        group.value.body[0].map(function(start) {
            norm_str = norm_str + "l ";
        } );
    }
    norm_str = norm_str + "} ";
    group.value.body.map(function(row) {
        if (row[0].value.length > 0) {
            out = row.map(function(cell) {
                buildGroup(cell, options);
                norm_str = norm_str + "& ";
            });
            norm_str = norm_str.substring(0, norm_str.length-2) + "\\\\ ";
        }
    }); 
    norm_str = norm_str + "\\end{array} ";
};

groupTypes.sqrt = function(group, options) {
    var node;
    if (group.value.index) {
        norm_str = norm_str + "\\sqrt [ ";
        buildExpression(group.value.index.value, options);
        norm_str = norm_str + "] ";
        buildGroup(group.value.body, options);
    } else {
        norm_str = norm_str + "\\sqrt ";
        buildGroup(group.value.body, options);
    }
};

groupTypes.leftright = function(group, options) {



    norm_str = norm_str + "\\left" + group.value.left + " ";
    buildExpression(group.value.body, options);
    norm_str = norm_str + "\\right" + group.value.right + " ";
};

groupTypes.accent = function(group, options) {
    if (group.value.base.type != 'ordgroup') {
        norm_str = norm_str + group.value.accent + " { ";
        buildGroup(group.value.base, options);
        norm_str = norm_str + "} ";
    } else {
        norm_str = norm_str + group.value.accent + " ";
        buildGroup(group.value.base, options);
    }
};

groupTypes.spacing = function(group) {
    var node;
    if (group.value == " ") {
        norm_str = norm_str + "~ ";
    } else {
        norm_str = norm_str + group.value + " ";
    }
    return node;
};

groupTypes.op = function(group) {
    var node;

    // TODO(emily): handle big operators using the `largeop` attribute
    
    
    if (group.value.symbol) {
        // This is a symbol. Just add the symbol.
        norm_str = norm_str + group.value.body + " ";

    } else {
        if (group.value.limits == false) {
            norm_str = norm_str + "\\\operatorname { ";
        } else {
            norm_str = norm_str + "\\\operatorname* { ";
        }
        for (i = 1; i < group.value.body.length; ++i ) {
            norm_str = norm_str + group.value.body[i] + " ";
        }
        norm_str = norm_str + "} ";
    }
};

groupTypes.katex = function(group) {
    var node = new mathMLTree.MathNode(
        "mtext", [new mathMLTree.TextNode("KaTeX")]);

    return node;
};



groupTypes.font = function(group, options) {
    var font = group.value.font;
    if (font == "mbox" || font == "hbox") {
        font = "mathrm";
    }
    norm_str = norm_str + "\\" + font + " ";
    buildGroup(group.value.body, options.withFont(font));    
};

groupTypes.delimsizing = function(group) {
    var children = [];
    norm_str = norm_str + group.value.funcName + " " + group.value.value + " ";
};

groupTypes.styling = function(group, options) {
    norm_str = norm_str + " " + group.value.original + " ";
    buildExpression(group.value.value, options);

};

groupTypes.sizing = function(group, options) {

    if (group.value.original == "\\rm") {
        norm_str = norm_str + "\\mathrm { "; 
        buildExpression(group.value.value, options.withFont("mathrm"));
        norm_str = norm_str + "} ";
    } else {
        norm_str = norm_str + " " + group.value.original + " ";
        buildExpression(group.value.value, options);
    }
};

groupTypes.overline = function(group, options) {
    norm_str = norm_str + "\\overline { ";
    
    buildGroup(group.value.body, options);
    norm_str = norm_str + "} ";
    norm_str = norm_str;

};

groupTypes.underline = function(group, options) {
    norm_str = norm_str + "\\underline { ";
    buildGroup(group.value.body, options);
    norm_str = norm_str + "} ";

    norm_str = norm_str;

};

groupTypes.rule = function(group) {
    norm_str = norm_str + "\\rule { "+group.value.width.number+" "+group.value.width.unit+"  } { "+group.value.height.number+" "+group.value.height.unit+ " } ";

};

groupTypes.llap = function(group, options) {
    norm_str = norm_str + "\\llap ";
    buildGroup(group.value.body, options);
};

groupTypes.rlap = function(group, options) {
    norm_str = norm_str + "\\rlap ";
    buildGroup(group.value.body, options);

};

groupTypes.phantom = function(group, options, prev) {
    norm_str = norm_str + "\\phantom { ";
    buildExpression(group.value.value, options);
    norm_str = norm_str + "} ";

};

/**
 * Takes a list of nodes, builds them, and returns a list of the generated
 * MathML nodes. A little simpler than the HTML version because we don't do any
 * previous-node handling.
 */
var buildExpression = function(expression, options) {
    var groups = [];
    for (var i = 0; i < expression.length; i++) {
        var group = expression[i];
        buildGroup(group, options);
    }
    // console.log(norm_str);
    // return groups;
};

/**
 * Takes a group from the parser and calls the appropriate groupTypes function
 * on it to produce a MathML node.
 */
var buildGroup = function(group, options) {
    if (groupTypes[group.type]) {
        groupTypes[group.type](group, options);
    } else {
        throw new ParseError(
            "Got group of unknown type: '" + group.type + "'");
    }
};





================================================
FILE: pix2tex/dataset/preprocessing/third_party/README.md
================================================
Directly taken from https://github.com/harvardnlp/im2markup


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/.#katex.js
================================================
srush@beaker.12118:1471814512

================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/LICENSE.txt
================================================
The MIT License (MIT)

Copyright (c) 2015 Khan Academy

This software also uses portions of the underscore.js project, which is
MIT licensed with the following copyright:

Copyright (c) 2009-2015 Jeremy Ashkenas, DocumentCloud and Investigative
Reporters & Editors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/README.md
================================================
# [<img src="https://khan.github.io/KaTeX/katex-logo.svg" width="130" alt="KaTeX">](https://khan.github.io/KaTeX/) [![Build Status](https://travis-ci.org/Khan/KaTeX.svg?branch=master)](https://travis-ci.org/Khan/KaTeX)

[![Join the chat at https://gitter.im/Khan/KaTeX](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/Khan/KaTeX?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)

KaTeX is a fast, easy-to-use JavaScript library for TeX math rendering on the web.

 * **Fast:** KaTeX renders its math synchronously and doesn't need to reflow the page. See how it compares to a competitor in [this speed test](http://jsperf.com/katex-vs-mathjax/).
 * **Print quality:** KaTeX’s layout is based on Donald Knuth’s TeX, the gold standard for math typesetting.
 * **Self contained:** KaTeX has no dependencies and can easily be bundled with your website resources.
 * **Server side rendering:** KaTeX produces the same output regardless of browser or environment, so you can pre-render expressions using Node.js and send them as plain HTML.

KaTeX supports all major browsers, including Chrome, Safari, Firefox, Opera, and IE 8 - IE 11.  A list of supported  commands can be on the [wiki](https://github.com/Khan/KaTeX/wiki/Function-Support-in-KaTeX).

## Usage

You can [download KaTeX](https://github.com/khan/katex/releases) and host it on your server or include the `katex.min.js` and `katex.min.css` files on your page directly from a CDN:

```html
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.css">
<script src="https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.5.1/katex.min.js"></script>
```

#### In-browser rendering

Call `katex.render` with a TeX expression and a DOM element to render into:

```js
katex.render("c = \\pm\\sqrt{a^2 + b^2}", element);
```

If KaTeX can't parse the expression, it throws a `katex.ParseError` error.

#### Server side rendering or rendering to a string

To generate HTML on the server or to generate an HTML string of the rendered math, you can use `katex.renderToString`:

```js
var html = katex.renderToString("c = \\pm\\sqrt{a^2 + b^2}");
// '<span class="katex">...</span>'
```

Make sure to include the CSS and font files, but there is no need to include the JavaScript. Like `render`, `renderToString` throws if it can't parse the expression.

#### Rendering options

You can provide an object of options as the last argument to `katex.render` and `katex.renderToString`. Available options are:

- `displayMode`: `boolean`. If `true` the math will be rendered in display mode, which will put the math in display style (so `\int` and `\sum` are large, for example), and will center the math on the page on its own line. If `false` the math will be rendered in inline mode. (default: `false`)
- `throwOnError`: `boolean`. If `true`, KaTeX will throw a `ParseError` when it encounters an unsupported command. If `false`, KaTeX will render the unsupported command as text in the color given by `errorColor`. (default: `true`)
- `errorColor`: `string`. A color string given in the format `"#XXX"` or `"#XXXXXX"`. This option determines the color which unsupported commands are rendered in. (default: `#cc0000`)

For example:

```js
katex.render("c = \\pm\\sqrt{a^2 + b^2}", element, { displayMode: true });
```

#### Automatic rendering of math on a page

Math on the page can be automatically rendered using the auto-render extension. See [the Auto-render README](contrib/auto-render/README.md) for more information.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md)

## License

KaTeX is licensed under the [MIT License](http://opensource.org/licenses/MIT).


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/cli.js
================================================
#!/usr/bin/env node
// Simple CLI for KaTeX.
// Reads TeX from stdin, outputs HTML to stdout.
/* eslint no-console:0 */

var katex = require("./");
var input = "";

// Skip the first two args, which are just "node" and "cli.js"
var args = process.argv.slice(2);

if (args.indexOf("--help") !== -1) {
    console.log(process.argv[0] + " " + process.argv[1] +
                " [ --help ]" +
                " [ --display-mode ]");

    console.log("\n" +
                "Options:");
    console.log("  --help            Display this help message");
    console.log("  --display-mode    Render in display mode (not inline mode)");
    process.exit();
}

process.stdin.on("data", function(chunk) {
    input += chunk.toString();
});

process.stdin.on("end", function() {
    var options = { displayMode: args.indexOf("--display-mode") !== -1 };
    var output = katex.renderToString(input, options);
    console.log(output);
});


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/katex.js
================================================
/* eslint no-console:0 */
/**
 * This is the main entry point for KaTeX. Here, we expose functions for
 * rendering expressions either to DOM nodes or to markup strings.
 *
 * We also expose the ParseError class to check if errors thrown from KaTeX are
 * errors in the expression, or errors in javascript handling.
 */

var ParseError = require("./src/ParseError");
var Settings = require("./src/Settings");

var buildTree = require("./src/buildTree");
var parseTree = require("./src/parseTree");
var utils = require("./src/utils");

/**
 * Parse and build an expression, and place that expression in the DOM node
 * given.
 */
var render = function(expression, baseNode, options) {
    utils.clearNode(baseNode);

    var settings = new Settings(options);

    var tree = parseTree(expression, settings);
    var node = buildTree(tree, expression, settings).toNode();

    baseNode.appendChild(node);
};

// KaTeX's styles don't work properly in quirks mode. Print out an error, and
// disable rendering.
if (typeof document !== "undefined") {
    if (document.compatMode !== "CSS1Compat") {
        typeof console !== "undefined" && console.warn(
            "Warning: KaTeX doesn't work in quirks mode. Make sure your " +
                "website has a suitable doctype.");

        render = function() {
            throw new ParseError("KaTeX doesn't work in quirks mode.");
        };
    }
}

/**
 * Parse and build an expression, and return the markup for that.
 */
var renderToString = function(expression, options) {
    var settings = new Settings(options);

    var tree = parseTree(expression, settings);
    return buildTree(tree, expression, settings).toMarkup();
};

/**
 * Parse an expression and return the parse tree.
 */
var generateParseTree = function(expression, options) {
    var settings = new Settings(options);
    return parseTree(expression, settings);
};

module.exports = {
    render: render,
    renderToString: renderToString,
    /**
     * NOTE: This method is not currently recommended for public use.
     * The internal tree representation is unstable and is very likely
     * to change. Use at your own risk.
     */
    __parse: generateParseTree,
    ParseError: ParseError,
};


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/package.json
================================================
{
  "_args": [
    [
      "katex",
      "/home/srush/Projects/im2latex"
    ]
  ],
  "_from": "katex@latest",
  "_id": "katex@0.6.0",
  "_inCache": true,
  "_installable": true,
  "_location": "/katex",
  "_nodeVersion": "4.2.1",
  "_npmOperationalInternal": {
    "host": "packages-12-west.internal.npmjs.com",
    "tmp": "tmp/katex-0.6.0.tgz_1460769444991_0.38667152682319283"
  },
  "_npmUser": {
    "email": "kevinb7@gmail.com",
    "name": "kevinbarabash"
  },
  "_npmVersion": "2.15.2",
  "_phantomChildren": {},
  "_requested": {
    "name": "katex",
    "raw": "katex",
    "rawSpec": "",
    "scope": null,
    "spec": "latest",
    "type": "tag"
  },
  "_requiredBy": [
    "#USER"
  ],
  "_resolved": "https://registry.npmjs.org/katex/-/katex-0.6.0.tgz",
  "_shasum": "12418e09121c05c92041b6b3b9fb6bab213cb6f3",
  "_shrinkwrap": null,
  "_spec": "katex",
  "_where": "/home/srush/Projects/im2latex",
  "bin": {
    "katex": "cli.js"
  },
  "bugs": {
    "url": "https://github.com/Khan/KaTeX/issues"
  },
  "dependencies": {
    "match-at": "^0.1.0"
  },
  "description": "Fast math typesetting for the web.",
  "devDependencies": {
    "browserify": "^10.2.4",
    "clean-css": "~2.2.15",
    "eslint": "^1.10.2",
    "express": "~3.3.3",
    "glob": "^5.0.15",
    "jasmine": "^2.3.2",
    "jasmine-core": "^2.3.4",
    "js-yaml": "^3.3.1",
    "jspngopt": "^0.1.0",
    "less": "~1.7.5",
    "nomnom": "^1.8.1",
    "pako": "0.2.7",
    "selenium-webdriver": "^2.46.1",
    "uglify-js": "~2.4.15"
  },
  "directories": {},
  "dist": {
    "shasum": "12418e09121c05c92041b6b3b9fb6bab213cb6f3",
    "tarball": "https://registry.npmjs.org/katex/-/katex-0.6.0.tgz"
  },
  "files": [
    "cli.js",
    "dist/",
    "katex.js",
    "src/"
  ],
  "gitHead": "b94fc6534d5c23f944906a52a592bee4e0090665",
  "homepage": "https://github.com/Khan/KaTeX#readme",
  "license": "MIT",
  "main": "katex.js",
  "maintainers": [
    {
      "name": "kevinbarabash",
      "email": "kevinb7@gmail.com"
    },
    {
      "name": "spicyj",
      "email": "ben@benalpert.com"
    },
    {
      "name": "xymostech",
      "email": "xymostech@gmail.com"
    }
  ],
  "name": "katex",
  "optionalDependencies": {},
  "readme": "ERROR: No README data found!",
  "repository": {
    "type": "git",
    "url": "git://github.com/Khan/KaTeX.git"
  },
  "scripts": {
    "prepublish": "make dist",
    "start": "node server.js",
    "test": "make lint test"
  },
  "version": "0.6.0"
}


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Lexer.js
================================================
/**
 * The Lexer class handles tokenizing the input in various ways. Since our
 * parser expects us to be able to backtrack, the lexer allows lexing from any
 * given starting point.
 *
 * Its main exposed function is the `lex` function, which takes a position to
 * lex from and a type of token to lex. It defers to the appropriate `_innerLex`
 * function.
 *
 * The various `_innerLex` functions perform the actual lexing of different
 * kinds.
 */

var matchAt = require("../../match-at");

var ParseError = require("./ParseError");

// The main lexer class
function Lexer(input) {
    this._input = input;
}

// The resulting token returned from `lex`.
function Token(text, data, position) {
    this.text = text;
    this.data = data;
    this.position = position;
}

/* The following tokenRegex
 * - matches typical whitespace (but not NBSP etc.) using its first group
 * - matches symbol combinations which result in a single output character
 * - does not match any control character \x00-\x1f except whitespace
 * - does not match a bare backslash
 * - matches any ASCII character except those just mentioned
 * - does not match the BMP private use area \uE000-\uF8FF
 * - does not match bare surrogate code units
 * - matches any BMP character except for those just described
 * - matches any valid Unicode surrogate pair
 * - matches a backslash followed by one or more letters
 * - matches a backslash followed by any BMP character, including newline
 * Just because the Lexer matches something doesn't mean it's valid input:
 * If there is no matching function or symbol definition, the Parser will
 * still reject the input.
 */
var tokenRegex = new RegExp(
    "([ \r\n\t]+)|(" +                                // whitespace
    "---?" +                                          // special combinations
    "|[!-\\[\\]-\u2027\u202A-\uD7FF\uF900-\uFFFF]" +  // single codepoint
    "|[\uD800-\uDBFF][\uDC00-\uDFFF]" +               // surrogate pair
    "|\\\\(?:[a-zA-Z]+|[^\uD800-\uDFFF])" +           // function name
    ")"
);

var whitespaceRegex = /\s*/;

/**
 * This function lexes a single normal token. It takes a position and
 * whether it should completely ignore whitespace or not.
 */
Lexer.prototype._innerLex = function(pos, ignoreWhitespace) {
    var input = this._input;
    if (pos === input.length) {
        return new Token("EOF", null, pos);
    }
    var match = matchAt(tokenRegex, input, pos);
    if (match === null) {
        throw new ParseError(
            "Unexpected character: '" + input[pos] + "'",
            this, pos);
    } else if (match[2]) { // matched non-whitespace
        return new Token(match[2], null, pos + match[2].length);
    } else if (ignoreWhitespace) {
        return this._innerLex(pos + match[1].length, true);
    } else { // concatenate whitespace to a single space
        return new Token(" ", null, pos + match[1].length);
    }
};

// A regex to match a CSS color (like #ffffff or BlueViolet)
var cssColor = /#[a-z0-9]+|[a-z]+/i;

/**
 * This function lexes a CSS color.
 */
Lexer.prototype._innerLexColor = function(pos) {
    var input = this._input;

    // Ignore whitespace
    var whitespace = matchAt(whitespaceRegex, input, pos)[0];
    pos += whitespace.length;

    var match;
    if ((match = matchAt(cssColor, input, pos))) {
        // If we look like a color, return a color
        return new Token(match[0], null, pos + match[0].length);
    } else {
        throw new ParseError("Invalid color", this, pos);
    }
};

// A regex to match a dimension. Dimensions look like
// "1.2em" or ".4pt" or "1 ex"
var sizeRegex = /(-?)\s*(\d+(?:\.\d*)?|\.\d+)\s*([a-z]{2})/;

/**
 * This function lexes a dimension.
 */
Lexer.prototype._innerLexSize = function(pos) {
    var input = this._input;

    // Ignore whitespace
    var whitespace = matchAt(whitespaceRegex, input, pos)[0];
    pos += whitespace.length;

    var match;
    if ((match = matchAt(sizeRegex, input, pos))) {
        var unit = match[3];
        // We only currently handle "em" and "ex" units
        // if (unit !== "em" && unit !== "ex") {
        //     throw new ParseError("Invalid unit: '" + unit + "'", this, pos);
        // }
        return new Token(match[0], {
            number: +(match[1] + match[2]),
            unit: unit,
        }, pos + match[0].length);
    }

    throw new ParseError("Invalid size", this, pos);
};

/**
 * This function lexes a string of whitespace.
 */
Lexer.prototype._innerLexWhitespace = function(pos) {
    var input = this._input;

    var whitespace = matchAt(whitespaceRegex, input, pos)[0];
    pos += whitespace.length;

    return new Token(whitespace[0], null, pos);
};

/**
 * This function lexes a single token starting at `pos` and of the given mode.
 * Based on the mode, we defer to one of the `_innerLex` functions.
 */
Lexer.prototype.lex = function(pos, mode) {
    if (mode === "math") {
        return this._innerLex(pos, true);
    } else if (mode === "text") {
        return this._innerLex(pos, false);
    } else if (mode === "color") {
        return this._innerLexColor(pos);
    } else if (mode === "size") {
        return this._innerLexSize(pos);
    } else if (mode === "whitespace") {
        return this._innerLexWhitespace(pos);
    }
};

module.exports = Lexer;


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Options.js
================================================
/**
 * This file contains information about the options that the Parser carries
 * around with it while parsing. Data is held in an `Options` object, and when
 * recursing, a new `Options` object can be created with the `.with*` and
 * `.reset` functions.
 */

/**
 * This is the main options class. It contains the style, size, color, and font
 * of the current parse level. It also contains the style and size of the parent
 * parse level, so size changes can be handled efficiently.
 *
 * Each of the `.with*` and `.reset` functions passes its current style and size
 * as the parentStyle and parentSize of the new options class, so parent
 * handling is taken care of automatically.
 */
function Options(data) {
    this.style = data.style;
    this.color = data.color;
    this.size = data.size;
    this.phantom = data.phantom;
    this.font = data.font;

    if (data.parentStyle === undefined) {
        this.parentStyle = data.style;
    } else {
        this.parentStyle = data.parentStyle;
    }

    if (data.parentSize === undefined) {
        this.parentSize = data.size;
    } else {
        this.parentSize = data.parentSize;
    }
}

/**
 * Returns a new options object with the same properties as "this".  Properties
 * from "extension" will be copied to the new options object.
 */
Options.prototype.extend = function(extension) {
    var data = {
        style: this.style,
        size: this.size,
        color: this.color,
        parentStyle: this.style,
        parentSize: this.size,
        phantom: this.phantom,
        font: this.font,
    };

    for (var key in extension) {
        if (extension.hasOwnProperty(key)) {
            data[key] = extension[key];
        }
    }

    return new Options(data);
};

/**
 * Create a new options object with the given style.
 */
Options.prototype.withStyle = function(style) {
    return this.extend({
        style: style,
    });
};

/**
 * Create a new options object with the given size.
 */
Options.prototype.withSize = function(size) {
    return this.extend({
        size: size,
    });
};

/**
 * Create a new options object with the given color.
 */
Options.prototype.withColor = function(color) {
    return this.extend({
        color: color,
    });
};

/**
 * Create a new options object with "phantom" set to true.
 */
Options.prototype.withPhantom = function() {
    return this.extend({
        phantom: true,
    });
};

/**
 * Create a new options objects with the give font.
 */
Options.prototype.withFont = function(font) {
    return this.extend({
        font: font,
    });
};

/**
 * Create a new options object with the same style, size, and color. This is
 * used so that parent style and size changes are handled correctly.
 */
Options.prototype.reset = function() {
    return this.extend({});
};

/**
 * A map of color names to CSS colors.
 * TODO(emily): Remove this when we have real macros
 */
var colorMap = {
    "katex-blue": "#6495ed",
    "katex-orange": "#ffa500",
    "katex-pink": "#ff00af",
    "katex-red": "#df0030",
    "katex-green": "#28ae7b",
    "katex-gray": "gray",
    "katex-purple": "#9d38bd",
    "katex-blueA": "#c7e9f1",
    "katex-blueB": "#9cdceb",
    "katex-blueC": "#58c4dd",
    "katex-blueD": "#29abca",
    "katex-blueE": "#1c758a",
    "katex-tealA": "#acead7",
    "katex-tealB": "#76ddc0",
    "katex-tealC": "#5cd0b3",
    "katex-tealD": "#55c1a7",
    "katex-tealE": "#49a88f",
    "katex-greenA": "#c9e2ae",
    "katex-greenB": "#a6cf8c",
    "katex-greenC": "#83c167",
    "katex-greenD": "#77b05d",
    "katex-greenE": "#699c52",
    "katex-goldA": "#f7c797",
    "katex-goldB": "#f9b775",
    "katex-goldC": "#f0ac5f",
    "katex-goldD": "#e1a158",
    "katex-goldE": "#c78d46",
    "katex-redA": "#f7a1a3",
    "katex-redB": "#ff8080",
    "katex-redC": "#fc6255",
    "katex-redD": "#e65a4c",
    "katex-redE": "#cf5044",
    "katex-maroonA": "#ecabc1",
    "katex-maroonB": "#ec92ab",
    "katex-maroonC": "#c55f73",
    "katex-maroonD": "#a24d61",
    "katex-maroonE": "#94424f",
    "katex-purpleA": "#caa3e8",
    "katex-purpleB": "#b189c6",
    "katex-purpleC": "#9a72ac",
    "katex-purpleD": "#715582",
    "katex-purpleE": "#644172",
    "katex-mintA": "#f5f9e8",
    "katex-mintB": "#edf2df",
    "katex-mintC": "#e0e5cc",
    "katex-grayA": "#fdfdfd",
    "katex-grayB": "#f7f7f7",
    "katex-grayC": "#eeeeee",
    "katex-grayD": "#dddddd",
    "katex-grayE": "#cccccc",
    "katex-grayF": "#aaaaaa",
    "katex-grayG": "#999999",
    "katex-grayH": "#555555",
    "katex-grayI": "#333333",
    "katex-kaBlue": "#314453",
    "katex-kaGreen": "#639b24",
};

/**
 * Gets the CSS color of the current options object, accounting for the
 * `colorMap`.
 */
Options.prototype.getColor = function() {
    if (this.phantom) {
        return "transparent";
    } else {
        return colorMap[this.color] || this.color;
    }
};

module.exports = Options;


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/ParseError.js
================================================
/**
 * This is the ParseError class, which is the main error thrown by KaTeX
 * functions when something has gone wrong. This is used to distinguish internal
 * errors from errors in the expression that the user provided.
 */
function ParseError(message, lexer, position) {
    var error = "KaTeX parse error: " + message;

    if (lexer !== undefined && position !== undefined) {
        // If we have the input and a position, make the error a bit fancier

        // Prepend some information
        error += " at position " + position + ": ";

        // Get the input
        var input = lexer._input;
        // Insert a combining underscore at the correct position
        input = input.slice(0, position) + "\u0332" +
            input.slice(position);

        // Extract some context from the input and add it to the error
        var begin = Math.max(0, position - 15);
        var end = position + 15;
        error += input.slice(begin, end);
    }

    // Some hackery to make ParseError a prototype of Error
    // See http://stackoverflow.com/a/8460753
    var self = new Error(error);
    self.name = "ParseError";
    self.__proto__ = ParseError.prototype;

    self.position = position;
    return self;
}

// More hackery
ParseError.prototype.__proto__ = Error.prototype;

module.exports = ParseError;


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Parser.js
================================================
/* eslint no-constant-condition:0 */
var functions = require("./functions");
var environments = require("./environments");
var Lexer = require("./Lexer");
var symbols = require("./symbols");
var utils = require("./utils");

var parseData = require("./parseData");
var ParseError = require("./ParseError");

global_str = ""

/**
 * This file contains the parser used to parse out a TeX expression from the
 * input. Since TeX isn't context-free, standard parsers don't work particularly
 * well.
 *
 * The strategy of this parser is as such:
 *
 * The main functions (the `.parse...` ones) take a position in the current
 * parse string to parse tokens from. The lexer (found in Lexer.js, stored at
 * this.lexer) also supports pulling out tokens at arbitrary places. When
 * individual tokens are needed at a position, the lexer is called to pull out a
 * token, which is then used.
 *
 * The parser has a property called "mode" indicating the mode that
 * the parser is currently in. Currently it has to be one of "math" or
 * "text", which denotes whether the current environment is a math-y
 * one or a text-y one (e.g. inside \text). Currently, this serves to
 * limit the functions which can be used in text mode.
 *
 * The main functions then return an object which contains the useful data that
 * was parsed at its given point, and a new position at the end of the parsed
 * data. The main functions can call each other and continue the parsing by
 * using the returned position as a new starting point.
 *
 * There are also extra `.handle...` functions, which pull out some reused
 * functionality into self-contained functions.
 *
 * The earlier functions return ParseNodes.
 * The later functions (which are called deeper in the parse) sometimes return
 * ParseFuncOrArgument, which contain a ParseNode as well as some data about
 * whether the parsed object is a function which is missing some arguments, or a
 * standalone object which can be used as an argument to another function.
 */

/**
 * Main Parser class
 */
function Parser(input, settings) {
    // Make a new lexer
    this.lexer = new Lexer(input);
    // Store the settings for use in parsing
    this.settings = settings;
}

var ParseNode = parseData.ParseNode;

/**
 * An initial function (without its arguments), or an argument to a function.
 * The `result` argument should be a ParseNode.
 */
function ParseFuncOrArgument(result, isFunction) {
    this.result = result;
    // Is this a function (i.e. is it something defined in functions.js)?
    this.isFunction = isFunction;
}

/**
 * Checks a result to make sure it has the right type, and throws an
 * appropriate error otherwise.
 *
 * @param {boolean=} consume whether to consume the expected token,
 *                           defaults to true
 */
Parser.prototype.expect = function(text, consume) {
    if (this.nextToken.text !== text) {
        throw new ParseError(
            "Expected '" + text + "', got '" + this.nextToken.text + "'",
            this.lexer, this.nextToken.position
        );
    }
    if (consume !== false) {
        this.consume();
    }
};

/**
 * Considers the current look ahead token as consumed,
 * and fetches the one after that as the new look ahead.
 */
Parser.prototype.consume = function() {
    this.pos = this.nextToken.position;

    global_str =  global_str + " " + this.nextToken.text
    this.nextToken = this.lexer.lex(this.pos, this.mode);
};

/**
 * Main parsing function, which parses an entire input.
 *
 * @return {?Array.<ParseNode>}
 */
Parser.prototype.parse = function() {
    // Try to parse the input
    this.mode = "math";
    this.pos = 0;
    this.nextToken = this.lexer.lex(this.pos, this.mode);
    var parse = this.parseInput();
    return parse;
};

/**
 * Parses an entire input tree.
 */
Parser.prototype.parseInput = function() {
    // Parse an expression
    var expression = this.parseExpression(false);
    // If we succeeded, make sure there's an EOF at the end
    this.expect("EOF", false);
    return expression;
};

var endOfExpression = ["}", "\\end", "\\right", "&", "\\\\", "\\cr"];

/**
 * Parses an "expression", which is a list of atoms.
 *
 * @param {boolean} breakOnInfix Should the parsing stop when we hit infix
 *                  nodes? This happens when functions have higher precendence
 *                  than infix nodes in implicit parses.
 *
 * @param {?string} breakOnToken The token that the expression should end with,
 *                  or `null` if something else should end the expression.
 *
 * @return {ParseNode}
 */
Parser.prototype.parseExpression = function(breakOnInfix, breakOnToken) {
    var body = [];
    // Keep adding atoms to the body until we can't parse any more atoms (either
    // we reached the end, a }, or a \right)
    while (true) {
        var lex = this.nextToken;
        var pos = this.pos;
        if (endOfExpression.indexOf(lex.text) !== -1) {
            break;
        }
        if (breakOnToken && lex.text === breakOnToken) {
            break;
        }
        var atom = this.parseAtom();
        if (!atom) {
            if (!this.settings.throwOnError && lex.text[0] === "\\") {
                var errorNode = this.handleUnsupportedCmd();
                body.push(errorNode);

                pos = lex.position;
                continue;
            }

            break;
        }
        if (breakOnInfix && atom.type === "infix") {
            // rewind so we can parse the infix atom again
            this.pos = pos;
            this.nextToken = lex;
            break;
        }
        body.push(atom);
    }
    return this.handleInfixNodes(body);
};

/**
 * Rewrites infix operators such as \over with corresponding commands such
 * as \frac.
 *
 * There can only be one infix operator per group.  If there's more than one
 * then the expression is ambiguous.  This can be resolved by adding {}.
 *
 * @returns {Array}
 */
Parser.prototype.handleInfixNodes = function(body) {
    var overIndex = -1;
    var funcName;

    for (var i = 0; i < body.length; i++) {
        var node = body[i];
        if (node.type === "infix") {
            if (overIndex !== -1) {
                throw new ParseError("only one infix operator per group",
                    this.lexer, -1);
            }
            overIndex = i;
            funcName = node.value.replaceWith;
        }
    }

    if (overIndex !== -1) {
        var numerNode;
        var denomNode;

        var numerBody = body.slice(0, overIndex);
        var denomBody = body.slice(overIndex + 1);

        if (numerBody.length === 1 && numerBody[0].type === "ordgroup") {
            numerNode = numerBody[0];
        } else {
            numerNode = new ParseNode("ordgroup", numerBody, this.mode);
        }

        if (denomBody.length === 1 && denomBody[0].type === "ordgroup") {
            denomNode = denomBody[0];
        } else {
            denomNode = new ParseNode("ordgroup", denomBody, this.mode);
        }

        var value = this.callFunction(
            funcName, [numerNode, denomNode], null);
        return [new ParseNode(value.type, value, this.mode)];
    } else {
        return body;
    }
};

// The greediness of a superscript or subscript
var SUPSUB_GREEDINESS = 1;

/**
 * Handle a subscript or superscript with nice errors.
 */
Parser.prototype.handleSupSubscript = function(name) {
    var symbol = this.nextToken.text;
    var symPos = this.pos;
    this.consume();
    var group = this.parseGroup();

    if (!group) {
        if (!this.settings.throwOnError && this.nextToken.text[0] === "\\") {
            return this.handleUnsupportedCmd();
        } else {
            // throw new ParseError(
            //     "Expected group after '" + symbol + "'",
            //     this.lexer,
            //     symPos + 1
            // );
        }
    } else if (group.isFunction) {
        // ^ and _ have a greediness, so handle interactions with functions'
        // greediness
        var funcGreediness = functions[group.result].greediness;
        if (funcGreediness > SUPSUB_GREEDINESS) {
            return this.parseFunction(group);
        } else {
            throw new ParseError(
                "Got function '" + group.result + "' with no arguments " +
                    "as " + name,
                this.lexer, symPos + 1);
        }
    } else {
        return group.result;
    }
};

/**
 * Converts the textual input of an unsupported command into a text node
 * contained within a color node whose color is determined by errorColor
 */
Parser.prototype.handleUnsupportedCmd = function() {
    var text = this.nextToken.text;
    var textordArray = [];

    for (var i = 0; i < text.length; i++) {
        textordArray.push(new ParseNode("textord", text[i], "text"));
    }

    var textNode = new ParseNode(
        "text",
        {
            body: textordArray,
            type: "text",
        },
        this.mode);

    var colorNode = new ParseNode(
        "color",
        {
            color: this.settings.errorColor,
            value: [textNode],
            type: "color",
        },
        this.mode);

    this.consume();
    return colorNode;
};

/**
 * Parses a group with optional super/subscripts.
 *
 * @return {?ParseNode}
 */
Parser.prototype.parseAtom = function() {
    // The body of an atom is an implicit group, so that things like
    // \left(x\right)^2 work correctly.
    var base = this.parseImplicitGroup();

    // In text mode, we don't have superscripts or subscripts
    if (this.mode === "text") {
        return base;
    }

    // Note that base may be empty (i.e. null) at this point.

    var superscript;
    var subscript;
    while (true) {
        // Lex the first token
        var lex = this.nextToken;

        if (lex.text === "\\limits" || lex.text === "\\nolimits") {
            // We got a limit control
            if (!base || base.type !== "op") {
                throw new ParseError(
                    "Limit controls must follow a math operator",
                    this.lexer, this.pos);
            } else {
                var limits = lex.text === "\\limits";
                base.value.limits = limits;
                base.value.alwaysHandleSupSub = true;
            }
            this.consume();
        } else if (lex.text === "^") {
            // We got a superscript start
            // if (superscript) {
            //     throw new ParseError(
            //         "Double superscript", this.lexer, this.pos);
            // }
            superscript = this.handleSupSubscript("superscript");
        } else if (lex.text === "_") {
            // We got a subscript start
            // if (subscript) {
            //     throw new ParseError(
            //         "Double subscript", this.lexer, this.pos);
            // }
            subscript = this.handleSupSubscript("subscript");
        } else if (lex.text === "'") {
            // We got a prime
            var prime = new ParseNode("textord", "\\prime", this.mode);

            // Many primes can be grouped together, so we handle this here
            var primes = [prime];
            this.consume();
            // Keep lexing tokens until we get something that's not a prime
            while (this.nextToken.text === "'") {
                // For each one, add another prime to the list
                primes.push(prime);
                this.consume();
            }
            // Put them into an ordgroup as the superscript
            superscript = new ParseNode("ordgroup", primes, this.mode);
        } else {
            // If it wasn't ^, _, or ', stop parsing super/subscripts
            break;
        }
    }

    if (superscript || subscript) {
        // If we got either a superscript or subscript, create a supsub
        return new ParseNode("supsub", {
            base: base,
            sup: superscript,
            sub: subscript,
        }, this.mode);
    } else {
        // Otherwise return the original body
        return base;
    }
};

// A list of the size-changing functions, for use in parseImplicitGroup
var sizeFuncs = [
    "\\tiny", "\\scriptsize", "\\footnotesize", "\\small", "\\normalsize",
    "\\large", "\\Large", "\\LARGE", "\\huge", "\\Huge", "\\textrm", "\\rm", "\\cal",
    "\\bf", "\\siptstyle", "\\boldmath", "\\it"
];

// A list of the style-changing functions, for use in parseImplicitGroup
var styleFuncs = [
    "\\displaystyle", "\\textstyle", "\\scriptstyle", "\\scriptscriptstyle",
];

/**
 * Parses an implicit group, which is a group that starts at the end of a
 * specified, and ends right before a higher explicit group ends, or at EOL. It
 * is used for functions that appear to affect the current style, like \Large or
 * \textrm, where instead of keeping a style we just pretend that there is an
 * implicit grouping after it until the end of the group. E.g.
 *   small text {\Large large text} small text again
 * It is also used for \left and \right to get the correct grouping.
 *
 * @return {?ParseNode}
 */
Parser.prototype.parseImplicitGroup = function() {
    var start = this.parseSymbol();

    if (start == null) {
        // If we didn't get anything we handle, fall back to parseFunction
        return this.parseFunction();
    }

    var func = start.result;
    var body;
    if (func === "\\left") {
        // If we see a left:
        // Parse the entire left function (including the delimiter)
        var left = this.parseFunction(start);
        // Parse out the implicit body
        body = this.parseExpression(false);
        // Check the next token
        this.expect("\\right", false);
        var right = this.parseFunction();
        return new ParseNode("leftright", {
            body: body,
            left: left.value.value,
            right: right.value.value,
        }, this.mode);
    } else if (func === "\\begin") {
        // begin...end is similar to left...right
        var begin = this.parseFunction(start);
        var envName = begin.value.name;
        var name = (begin.value.name + "")

        global_str = global_str.substring(0, global_str.length - (name.length * 2 + 2)) + name + "}"

        if (!environments.hasOwnProperty(envName)) {
            throw new ParseError(
                "No such environment: " + envName,
                this.lexer, begin.value.namepos);
        }
        // Build the environment object. Arguments and other information will
        // be made available to the begin and end methods using properties.
        var env = environments[envName];
        var args = this.parseArguments("\\begin{" + envName + "}", env);
        var context = {
            mode: this.mode,
            envName: envName,
            parser: this,
            lexer: this.lexer,
            positions: args.pop(),
        };
        var result = env.handler(context, args);
        this.expect("\\end", false);
        var end = this.parseFunction();
        
        var name = (begin.value.name + "")

        global_str = global_str.substring(0, global_str.length - (name.length * 2 + 2)) + name + "}"
        if (end.value.name !== envName) {
            throw new ParseError(
                "Mismatch: \\begin{" + envName + "} matched " +
                "by \\end{" + end.value.name + "}",
                this.lexer /* , end.value.namepos */);
            // TODO: Add position to the above line and adjust test case,
            // requires #385 to get merged first
        }
        result.position = end.position;

        return result;

    } else if (func.value == "\\matrix" || func.value == "\\pmatrix" || func.value == "\\cases") {
        // if (!environments.hasOwnProperty(envName)) {
        //     throw new ParseError(
        //         "No such environment: " + envName,
        //         this.lexer, begin.value.namepos);
        // }
        // Build the environment object. Arguments and other information will
        // be made available to the begin and end methods using properties.

        envName = func.value.slice(1);
        var env = environments[envName];
        // var args = this.parseArguments("\\matrix{", env);
        this.expect("{", true);
        var context = {
            mode: this.mode,
            envName: envName,
            parser: this,
            lexer: this.lexer
        };

        var result = env.handler(context, {}  );
        // exit();
        this.expect("}", true);
        // var end = this.parseFunction();
        var next = this.nextToken.text;
        // exit();
        // console.log(next);
        // var name = ( + "")

        // global_str = global_str.substring(0, global_str.length - (name.length * 2 + 2)) + name + "}"
        // result.position = end.position;

        return result;
        
    } else if (utils.contains(sizeFuncs, func)) {
        // If we see a sizing function, parse out the implict body
        body = this.parseExpression(false);

        return new ParseNode("sizing", {
            // Figure out what size to use based on the list of functions above
            original: func,
            size: "size" + (utils.indexOf(sizeFuncs, func) + 1),
            value: body,
        }, this.mode);
    } else if (utils.contains(styleFuncs, func)) {
        // If we see a styling function, parse out the implict body
        body = this.parseExpression(true);
        return new ParseNode("styling", {
            // Figure out what style to use by pulling out the style from
            // the function name
            original: func,
            style: func.slice(1, func.length - 5),
            value: body,
        }, this.mode);
    } else {
        // Defer to parseFunction if it's not a function we handle
        return this.parseFunction(start);
    }
};

/**
 * Parses an entire function, including its base and all of its arguments.
 * The base might either have been parsed already, in which case
 * it is provided as an argument, or it's the next group in the input.
 *
 * @param {ParseFuncOrArgument=} baseGroup optional as described above
 * @return {?ParseNode}
 */
Parser.prototype.parseFunction = function(baseGroup) {
    if (!baseGroup) {
        baseGroup = this.parseGroup();
    }

    if (baseGroup) {
        if (baseGroup.isFunction) {
            var func = baseGroup.result;
            var funcData = functions[func];
            if (this.mode === "text" && !funcData.allowedInText) {
                // throw new ParseError(
                //     "Can't use function '" + func + "' in text mode",
                //     this.lexer, baseGroup.position);
            }

            var args = this.parseArguments(func, funcData);
            var result = this.callFunction(func, args, args.pop());
            return new ParseNode(result.type, result, this.mode);
        } else {
            return baseGroup.result;
        }
    } else {
        return null;
    }
};

/**
 * Call a function handler with a suitable context and arguments.
 */
Parser.prototype.callFunction = function(name, args, positions) {
    var context = {
        funcName: name,
        parser: this,
        lexer: this.lexer,
        positions: positions,
    };
    return functions[name].handler(context, args);
};

/**
 * Parses the arguments of a function or environment
 *
 * @param {string} func  "\name" or "\begin{name}"
 * @param {{numArgs:number,numOptionalArgs:number|undefined}} funcData
 * @return the array of arguments, with the list of positions as last element
 */
Parser.prototype.parseArguments = function(func, funcData) {
    var totalArgs = funcData.numArgs + funcData.numOptionalArgs;
    if (totalArgs === 0) {
        return [[this.pos]];
    }

    var baseGreediness = funcData.greediness;
    var positions = [this.pos];
    var args = [];

    for (var i = 0; i < totalArgs; i++) {
        var argType = funcData.argTypes && funcData.argTypes[i];
        var arg;
        if (i < funcData.numOptionalArgs) {
            if (argType) {
                arg = this.parseSpecialGroup(argType, true);
            } else {
                arg = this.parseOptionalGroup();
            }
            if (!arg) {
                args.push(null);
                positions.push(this.pos);
                continue;
            }
        } else {
            if (argType) {
                arg = this.parseSpecialGroup(argType);
            } else {
                arg = this.parseGroup();
            }
            if (!arg) {
                if (!this.settings.throwOnError &&
                    this.nextToken.text[0] === "\\") {
                    arg = new ParseFuncOrArgument(
                        this.handleUnsupportedCmd(this.nextToken.text),
                        false);
                } else {
                    throw new ParseError(
                        "Expected group after '" + func + "'",
                        this.lexer, this.pos);
                }
            }
        }
        var argNode;
        if (arg.isFunction) {
            var argGreediness =
                functions[arg.result].greediness;
            if (argGreediness > baseGreediness) {
                argNode = this.parseFunction(arg);
            } else {
                // throw new ParseError(
                //     "Got function '" + arg.result + "' as " +
                //     "argument to '" + func + "'",
                //     this.lexer, this.pos - 1);
            }
        } else {
            argNode = arg.result;
        }
        args.push(argNode);
        positions.push(this.pos);
    }

    args.push(positions);

    return args;
};


/**
 * Parses a group when the mode is changing. Takes a position, a new mode, and
 * an outer mode that is used to parse the outside.
 *
 * @return {?ParseFuncOrArgument}
 */
Parser.prototype.parseSpecialGroup = function(innerMode, optional) {
    var outerMode = this.mode;
    // Handle `original` argTypes
    if (innerMode === "original") {
        innerMode = outerMode;
    }

    if (innerMode === "color" || innerMode === "size") {
        // color and size modes are special because they should have braces and
        // should only lex a single symbol inside
        var openBrace = this.nextToken;
        if (optional && openBrace.text !== "[") {
            // optional arguments should return null if they don't exist
            return null;
        }
        // The call to expect will lex the token after the '{' in inner mode
        this.mode = innerMode;
        this.expect(optional ? "[" : "{");
        var inner = this.nextToken;
        this.mode = outerMode;
        var data;
        if (innerMode === "color") {
            data = inner.text;
        } else {
            data = inner.data;
        }
        this.consume(); // consume the token stored in inner
        this.expect(optional ? "]" : "}");
        return new ParseFuncOrArgument(
            new ParseNode(innerMode, data, outerMode),
            false);
    } else if (innerMode === "text") {
        // text mode is special because it should ignore the whitespace before
        // it
        var whitespace = this.lexer.lex(this.pos, "whitespace");
        this.pos = whitespace.position;
    }

    // By the time we get here, innerMode is one of "text" or "math".
    // We switch the mode of the parser, recurse, then restore the old mode.
    this.mode = innerMode;
    this.nextToken = this.lexer.lex(this.pos, innerMode);
    var res;
    if (optional) {
        res = this.parseOptionalGroup();
    } else {
        res = this.parseGroup();
    }
    this.mode = outerMode;
    this.nextToken = this.lexer.lex(this.pos, outerMode);
    return res;
};

/**
 * Parses a group, which is either a single nucleus (like "x") or an expression
 * in braces (like "{x+y}")
 *
 * @return {?ParseFuncOrArgument}
 */
Parser.prototype.parseGroup = function() {
    // Try to parse an open brace
    if (this.nextToken.text === "{") {
        // If we get a brace, parse an expression
        this.consume();
        var expression = this.parseExpression(false);
        // Make sure we get a close brace
        this.expect("}");
        return new ParseFuncOrArgument(
            new ParseNode("ordgroup", expression, this.mode),
            false);
    } else {
        // Otherwise, just return a nucleus
        return this.parseSymbol();
    }
};

/**
 * Parses a group, which is an expression in brackets (like "[x+y]")
 *
 * @return {?ParseFuncOrArgument}
 */
Parser.prototype.parseOptionalGroup = function() {
    // Try to parse an open bracket
    if (this.nextToken.text === "[") {
        // If we get a brace, parse an expression
        this.consume();
        var expression = this.parseExpression(false, "]");
        // Make sure we get a close bracket
        this.expect("]");
        return new ParseFuncOrArgument(
            new ParseNode("ordgroup", expression, this.mode),
            false);
    } else {
        // Otherwise, return null,
        return null;
    }
};

/**
 * Parse a single symbol out of the string. Here, we handle both the functions
 * we have defined, as well as the single character symbols
 *
 * @return {?ParseFuncOrArgument}
 */
Parser.prototype.parseSymbol = function() {
    var nucleus = this.nextToken;

    if (functions[nucleus.text]) {
        this.consume();
        // If there exists a function with this name, we return the function and
        // say that it is a function.
        return new ParseFuncOrArgument(
            nucleus.text,
            true);
    } else if (symbols[this.mode][nucleus.text]) {
        this.consume();
        // Otherwise if this is a no-argument function, find the type it
        // corresponds to in the symbols map
        return new ParseFuncOrArgument(
            new ParseNode(symbols[this.mode][nucleus.text].group,
                          nucleus.text, this.mode),
            false);
    } else if (nucleus.text == "EOF" || nucleus.text == "{") {
        return null;
        
    } else {
        this.consume();
        // console.error(nucleus);
        return new ParseFuncOrArgument(
            new ParseNode(symbols["math"]["\\sigma"].group,
                          nucleus.text, this.mode),
            false);
        // console.log(nucleus.text);
        // return null;
    }
};

Parser.prototype.ParseNode = ParseNode;

module.exports = Parser;


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Settings.js
================================================
/**
 * This is a module for storing settings passed into KaTeX. It correctly handles
 * default settings.
 */

/**
 * Helper function for getting a default value if the value is undefined
 */
function get(option, defaultValue) {
    return option === undefined ? defaultValue : option;
}

/**
 * The main Settings object
 *
 * The current options stored are:
 *  - displayMode: Whether the expression should be typeset by default in
 *                 textstyle or displaystyle (default false)
 */
function Settings(options) {
    // allow null options
    options = options || {};
    this.displayMode = get(options.displayMode, false);
    this.throwOnError = get(options.throwOnError, true);
    this.errorColor = get(options.errorColor, "#cc0000");
}

module.exports = Settings;


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Style.js
================================================
/**
 * This file contains information and classes for the various kinds of styles
 * used in TeX. It provides a generic `Style` class, which holds information
 * about a specific style. It then provides instances of all the different kinds
 * of styles possible, and provides functions to move between them and get
 * information about them.
 */

/**
 * The main style class. Contains a unique id for the style, a size (which is
 * the same for cramped and uncramped version of a style), a cramped flag, and a
 * size multiplier, which gives the size difference between a style and
 * textstyle.
 */
function Style(id, size, multiplier, cramped) {
    this.id = id;
    this.size = size;
    this.cramped = cramped;
    this.sizeMultiplier = multiplier;
}

/**
 * Get the style of a superscript given a base in the current style.
 */
Style.prototype.sup = function() {
    return styles[sup[this.id]];
};

/**
 * Get the style of a subscript given a base in the current style.
 */
Style.prototype.sub = function() {
    return styles[sub[this.id]];
};

/**
 * Get the style of a fraction numerator given the fraction in the current
 * style.
 */
Style.prototype.fracNum = function() {
    return styles[fracNum[this.id]];
};

/**
 * Get the style of a fraction denominator given the fraction in the current
 * style.
 */
Style.prototype.fracDen = function() {
    return styles[fracDen[this.id]];
};

/**
 * Get the cramped version of a style (in particular, cramping a cramped style
 * doesn't change the style).
 */
Style.prototype.cramp = function() {
    return styles[cramp[this.id]];
};

/**
 * HTML class name, like "displaystyle cramped"
 */
Style.prototype.cls = function() {
    return sizeNames[this.size] + (this.cramped ? " cramped" : " uncramped");
};

/**
 * HTML Reset class name, like "reset-textstyle"
 */
Style.prototype.reset = function() {
    return resetNames[this.size];
};

// IDs of the different styles
var D = 0;
var Dc = 1;
var T = 2;
var Tc = 3;
var S = 4;
var Sc = 5;
var SS = 6;
var SSc = 7;

// String names for the different sizes
var sizeNames = [
    "displaystyle textstyle",
    "textstyle",
    "scriptstyle",
    "scriptscriptstyle",
];

// Reset names for the different sizes
var resetNames = [
    "reset-textstyle",
    "reset-textstyle",
    "reset-scriptstyle",
    "reset-scriptscriptstyle",
];

// Instances of the different styles
var styles = [
    new Style(D, 0, 1.0, false),
    new Style(Dc, 0, 1.0, true),
    new Style(T, 1, 1.0, false),
    new Style(Tc, 1, 1.0, true),
    new Style(S, 2, 0.7, false),
    new Style(Sc, 2, 0.7, true),
    new Style(SS, 3, 0.5, false),
    new Style(SSc, 3, 0.5, true),
];

// Lookup tables for switching from one style to another
var sup = [S, Sc, S, Sc, SS, SSc, SS, SSc];
var sub = [Sc, Sc, Sc, Sc, SSc, SSc, SSc, SSc];
var fracNum = [T, Tc, S, Sc, SS, SSc, SS, SSc];
var fracDen = [Tc, Tc, Sc, Sc, SSc, SSc, SSc, SSc];
var cramp = [Dc, Dc, Tc, Tc, Sc, Sc, SSc, SSc];

// We only export some of the styles. Also, we don't export the `Style` class so
// no more styles can be generated.
module.exports = {
    DISPLAY: styles[D],
    TEXT: styles[T],
    SCRIPT: styles[S],
    SCRIPTSCRIPT: styles[SS],
};


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/buildCommon.js
================================================
/* eslint no-console:0 */
/**
 * This module contains general functions that can be used for building
 * different kinds of domTree nodes in a consistent manner.
 */

var domTree = require("./domTree");
var fontMetrics = require("./fontMetrics");
var symbols = require("./symbols");
var utils = require("./utils");

var greekCapitals = [
    "\\Gamma",
    "\\Delta",
    "\\Theta",
    "\\Lambda",
    "\\Xi",
    "\\Pi",
    "\\Sigma",
    "\\Upsilon",
    "\\Phi",
    "\\Psi",
    "\\Omega",
];

var dotlessLetters = [
    "\u0131",   // dotless i, \imath
    "\u0237",   // dotless j, \jmath
];

/**
 * Makes a symbolNode after translation via the list of symbols in symbols.js.
 * Correctly pulls out metrics for the character, and optionally takes a list of
 * classes to be attached to the node.
 */
var makeSymbol = function(value, style, mode, color, classes) {
    // Replace the value with its replaced value from symbol.js
    if (symbols[mode][value] && symbols[mode][value].replace) {
        value = symbols[mode][value].replace;
    }

    var metrics = fontMetrics.getCharacterMetrics(value, style);

    var symbolNode;
    if (metrics) {
        symbolNode = new domTree.symbolNode(
            value, metrics.height, metrics.depth, metrics.italic, metrics.skew,
            classes);
    } else {
        // TODO(emily): Figure out a good way to only print this in development
        typeof console !== "undefined" && console.warn(
            "No character metrics for '" + value + "' in style '" +
                style + "'");
        symbolNode = new domTree.symbolNode(value, 0, 0, 0, 0, classes);
    }

    if (color) {
        symbolNode.style.color = color;
    }

    return symbolNode;
};

/**
 * Makes a symbol in Main-Regular or AMS-Regular.
 * Used for rel, bin, open, close, inner, and punct.
 */
var mathsym = function(value, mode, color, classes) {
    // Decide what font to render the symbol in by its entry in the symbols
    // table.
    // Have a special case for when the value = \ because the \ is used as a
    // textord in unsupported command errors but cannot be parsed as a regular
    // text ordinal and is therefore not present as a symbol in the symbols
    // table for text
    if (value === "\\" || symbols[mode][value].font === "main") {
        return makeSymbol(value, "Main-Regular", mode, color, classes);
    } else {
        return makeSymbol(
            value, "AMS-Regular", mode, color, classes.concat(["amsrm"]));
    }
};

/**
 * Makes a symbol in the default font for mathords and textords.
 */
var mathDefault = function(value, mode, color, classes, type) {
    if (type === "mathord") {
        return mathit(value, mode, color, classes);
    } else if (type === "textord") {
        return makeSymbol(
            value, "Main-Regular", mode, color, classes.concat(["mathrm"]));
    } else {
        throw new Error("unexpected type: " + type + " in mathDefault");
    }
};

/**
 * Makes a symbol in the italic math font.
 */
var mathit = function(value, mode, color, classes) {
    if (/[0-9]/.test(value.charAt(0)) ||
            // glyphs for \imath and \jmath do not exist in Math-Italic so we
            // need to use Main-Italic instead
            utils.contains(dotlessLetters, value) ||
            utils.contains(greekCapitals, value)) {
        return makeSymbol(
            value, "Main-Italic", mode, color, classes.concat(["mainit"]));
    } else {
        return makeSymbol(
            value, "Math-Italic", mode, color, classes.concat(["mathit"]));
    }
};

/**
 * Makes either a mathord or textord in the correct font and color.
 */
var makeOrd = function(group, options, type) {
    var mode = group.mode;
    var value = group.value;
    if (symbols[mode][value] && symbols[mode][value].replace) {
        value = symbols[mode][value].replace;
    }

    var classes = ["mord"];
    var color = options.getColor();

    var font = options.font;
    if (font) {
        if (font === "mathit" || utils.contains(dotlessLetters, value)) {
            return mathit(value, mode, color, classes);
        } else {
            var fontName = fontMap[font].fontName;
            if (fontMetrics.getCharacterMetrics(value, fontName)) {
                return makeSymbol(
                    value, fontName, mode, color, classes.concat([font]));
            } else {
                return mathDefault(value, mode, color, classes, type);
            }
        }
    } else {
        return mathDefault(value, mode, color, classes, type);
    }
};

/**
 * Calculate the height, depth, and maxFontSize of an element based on its
 * children.
 */
var sizeElementFromChildren = function(elem) {
    var height = 0;
    var depth = 0;
    var maxFontSize = 0;

    if (elem.children) {
        for (var i = 0; i < elem.children.length; i++) {
            if (elem.children[i].height > height) {
                height = elem.children[i].height;
            }
            if (elem.children[i].depth > depth) {
                depth = elem.children[i].depth;
            }
            if (elem.children[i].maxFontSize > maxFontSize) {
                maxFontSize = elem.children[i].maxFontSize;
            }
        }
    }

    elem.height = height;
    elem.depth = depth;
    elem.maxFontSize = maxFontSize;
};

/**
 * Makes a span with the given list of classes, list of children, and color.
 */
var makeSpan = function(classes, children, color) {
    var span = new domTree.span(classes, children);

    sizeElementFromChildren(span);

    if (color) {
        span.style.color = color;
    }

    return span;
};

/**
 * Makes a document fragment with the given list of children.
 */
var makeFragment = function(children) {
    var fragment = new domTree.documentFragment(children);

    sizeElementFromChildren(fragment);

    return fragment;
};

/**
 * Makes an element placed in each of the vlist elements to ensure that each
 * element has the same max font size. To do this, we create a zero-width space
 * with the correct font size.
 */
var makeFontSizer = function(options, fontSize) {
    var fontSizeInner = makeSpan([], [new domTree.symbolNode("\u200b")]);
    fontSizeInner.style.fontSize =
        (fontSize / options.style.sizeMultiplier) + "em";

    var fontSizer = makeSpan(
        ["fontsize-ensurer", "reset-" + options.size, "size5"],
        [fontSizeInner]);

    return fontSizer;
};

/**
 * Makes a vertical list by stacking elements and kerns on top of each other.
 * Allows for many different ways of specifying the positioning method.
 *
 * Arguments:
 *  - children: A list of child or kern nodes to be stacked on top of each other
 *              (i.e. the first element will be at the bottom, and the last at
 *              the top). Element nodes are specified as
 *                {type: "elem", elem: node}
 *              while kern nodes are specified as
 *                {type: "kern", size: size}
 *  - positionType: The method by which the vlist should be positioned. Valid
 *                  values are:
 *                   - "individualShift": The children list only contains elem
 *                                        nodes, and each node contains an extra
 *                                        "shift" value of how much it should be
 *                                        shifted (note that shifting is always
 *                                        moving downwards). positionData is
 *                                        ignored.
 *                   - "top": The positionData specifies the topmost point of
 *                            the vlist (note this is expected to be a height,
 *                            so positive values move up)
 *                   - "bottom": The positionData specifies the bottommost point
 *                               of the vlist (note this is expected to be a
 *                               depth, so positive values move down
 *                   - "shift": The vlist will be positioned such that its
 *                              baseline is positionData away from the baseline
 *                              of the first child. Positive values move
 *                              downwards.
 *                   - "firstBaseline": The vlist will be positioned such that
 *                                      its baseline is aligned with the
 *                                      baseline of the first child.
 *                                      positionData is ignored. (this is
 *                                      equivalent to "shift" with
 *                                      positionData=0)
 *  - positionData: Data used in different ways depending on positionType
 *  - options: An Options object
 *
 */
var makeVList = function(children, positionType, positionData, options) {
    var depth;
    var currPos;
    var i;
    if (positionType === "individualShift") {
        var oldChildren = children;
        children = [oldChildren[0]];

        // Add in kerns to the list of children to get each element to be
        // shifted to the correct specified shift
        depth = -oldChildren[0].shift - oldChildren[0].elem.depth;
        currPos = depth;
        for (i = 1; i < oldChildren.length; i++) {
            var diff = -oldChildren[i].shift - currPos -
                oldChildren[i].elem.depth;
            var size = diff -
                (oldChildren[i - 1].elem.height +
                 oldChildren[i - 1].elem.depth);

            currPos = currPos + diff;

            children.push({type: "kern", size: size});
            children.push(oldChildren[i]);
        }
    } else if (positionType === "top") {
        // We always start at the bottom, so calculate the bottom by adding up
        // all the sizes
        var bottom = positionData;
        for (i = 0; i < children.length; i++) {
            if (children[i].type === "kern") {
                bottom -= children[i].size;
            } else {
                bottom -= children[i].elem.height + children[i].elem.depth;
            }
        }
        depth = bottom;
    } else if (positionType === "bottom") {
        depth = -positionData;
    } else if (positionType === "shift") {
        depth = -children[0].elem.depth - positionData;
    } else if (positionType === "firstBaseline") {
        depth = -children[0].elem.depth;
    } else {
        depth = 0;
    }

    // Make the fontSizer
    var maxFontSize = 0;
    for (i = 0; i < children.length; i++) {
        if (children[i].type === "elem") {
            maxFontSize = Math.max(maxFontSize, children[i].elem.maxFontSize);
        }
    }
    var fontSizer = makeFontSizer(options, maxFontSize);

    // Create a new list of actual children at the correct offsets
    var realChildren = [];
    currPos = depth;
    for (i = 0; i < children.length; i++) {
        if (children[i].type === "kern") {
            currPos += children[i].size;
        } else {
            var child = children[i].elem;

            var shift = -child.depth - currPos;
            currPos += child.height + child.depth;

            var childWrap = makeSpan([], [fontSizer, child]);
            childWrap.height -= shift;
            childWrap.depth += shift;
            childWrap.style.top = shift + "em";

            realChildren.push(childWrap);
        }
    }

    // Add in an element at the end with no offset to fix the calculation of
    // baselines in some browsers (namely IE, sometimes safari)
    var baselineFix = makeSpan(
        ["baseline-fix"], [fontSizer, new domTree.symbolNode("\u200b")]);
    realChildren.push(baselineFix);

    var vlist = makeSpan(["vlist"], realChildren);
    // Fix the final height and depth, in case there were kerns at the ends
    // since the makeSpan calculation won't take that in to account.
    vlist.height = Math.max(currPos, vlist.height);
    vlist.depth = Math.max(-depth, vlist.depth);
    return vlist;
};

// A table of size -> font size for the different sizing functions
var sizingMultiplier = {
    size1: 0.5,
    size2: 0.7,
    size3: 0.8,
    size4: 0.9,
    size5: 1.0,
    size6: 1.2,
    size7: 1.44,
    size8: 1.73,
    size9: 2.07,
    size10: 2.49,
};

// A map of spacing functions to their attributes, like size and corresponding
// CSS class
var spacingFunctions = {
    "\\qquad": {
        size: "2em",
        className: "qquad",
    },
    "\\quad": {
        size: "1em",
        className: "quad",
    },
    "\\enspace": {
        size: "0.5em",
        className: "enspace",
    },
    "\\;": {
        size: "0.277778em",
        className: "thickspace",
    },
    "\\:": {
        size: "0.22222em",
        className: "mediumspace",
    },
    "\\,": {
        size: "0.16667em",
        className: "thinspace",
    },
    "\\!": {
        size: "-0.16667em",
        className: "negativethinspace",
    },
};

/**
 * Maps TeX font commands to objects containing:
 * - variant: string used for "mathvariant" attribute in buildMathML.js
 * - fontName: the "style" parameter to fontMetrics.getCharacterMetrics
 */
// A map between tex font commands an MathML mathvariant attribute values
var fontMap = {
    // styles
    "mathbf": {
        variant: "bold",
        fontName: "Main-Bold",
    },
    "mathrm": {
        variant: "normal",
        fontName: "Main-Regular",
    },

    // "mathit" is missing because it requires the use of two fonts: Main-Italic
    // and Math-Italic.  This is handled by a special case in makeOrd which ends
    // up calling mathit.

    // families
    "mathbb": {
        variant: "double-struck",
        fontName: "AMS-Regular",
    },
    "mathcal": {
        variant: "script",
        fontName: "Caligraphic-Regular",
    },
    "mathfrak": {
        variant: "fraktur",
        fontName: "Fraktur-Regular",
    },
    "mathscr": {
        variant: "script",
        fontName: "Script-Regular",
    },
    "mathsf": {
        variant: "sans-serif",
        fontName: "SansSerif-Regular",
    },
    "mathtt": {
        variant: "monospace",
        fontName: "Typewriter-Regular",
    },
};

module.exports = {
    fontMap: fontMap,
    makeSymbol: makeSymbol,
    mathsym: mathsym,
    makeSpan: makeSpan,
    makeFragment: makeFragment,
    makeVList: makeVList,
    makeOrd: makeOrd,
    sizingMultiplier: sizingMultiplier,
    spacingFunctions: spacingFunctions,
};


================================================
FILE: pix2tex/dataset/preprocessing/third_party/katex/src/buildHTML.js
================================================
/* eslint no-console:0 */
/**
 * This file does the main work of building a domTree structure from a parse
 * tree. The entry point is the `buildHTML` function, which takes a parse tree.
 * Then, the buildExpression, buildGroup, and various groupTypes functions are
 * called, to produce a final HTML tree.
 */

var ParseError = require("./ParseError");
var Style = require("./Style");

var buildCommon = require("./buildCommon");
var delimiter = require("./delimiter");
var domTree = require("./domTree");
var fontMetrics = require("./fontMetrics");
var utils = require("./utils");

var makeSpan = buildCommon.makeSpan;

/**
 * Take a list of nodes, build them in order, and return a list of the built
 * nodes. This function handles the `prev` node correctly, and passes the
 * previous element from the list as the prev of the next element.
 */
var buildExpression = function(expression, options, prev) {
    var groups = [];
    for (var i = 0; i < expression.length; i++) {
        var group = expression[i];
        groups.push(buildGroup(group, options, prev));
        prev = group;
    }
    return groups;
};

// List of types used by getTypeOfGroup,
// see https://github.com/Khan/KaTeX/wiki/Examining-TeX#group-types
var groupToType = {
    mathord: "mord",
    textord: "mord",
    bin: "mbin",
    rel: "mrel",
    text: "mord",
    open: "mopen",
    close: "mclose",
    inner: "minner",
    genfrac: "mord",
    array: "mord",
    spacing: "mord",
    punct: "mpunct",
    ordgroup: "mord",
    op: "mop",
    katex: "mord",
    overline: "mord",
    underline: "mord",
    rule: "mord",
    leftright: "minner",
    sqrt: "mord",
    accent: "mord",
};

/**
 * Gets the final math type of an expression, given its group type. This type is
 * used to determine spacing between elements, and affects bin elements by
 * causing them to change depending on what types are around them. This type
 * must be attached to the outermost node of an element as a CSS class so that
 * spacing with its surrounding elements works correctly.
 *
 * Some elements can be mapped one-to-one from group type to math type, and
 * those are listed in the `groupToType` table.
 *
 * Others (usually elements that wrap around other elements) often have
 * recursive definitions, and thus call `getTypeOfGroup` on their inner
 * elements.
 */
var getTypeOfGroup = function(group) {
    if (group == null) {
        // Like when typesetting $^3$
        return groupToType.mathord;
    } else if (group.type === "supsub") {
        return getTypeOfGroup(group.value.base);
    } else if (group.type === "llap" || group.type === "rlap") {
        return getTypeOfGroup(group.value);
    } else if (group.type === "color") {
        return getTypeOfGroup(group.value.value);
    } else if (group.type === "sizing") {
        return getTypeOfGroup(group.value.value);
    } else if (group.type === "styling") {
        return getTypeOfGroup(group.value.value);
    } else if (group.type === "delimsizing") {
        return groupToType[group.value.delimType];
    } else {
        return groupToType[group.type];
    }
};

/**
 * Sometimes, groups perform special rules when they have superscripts or
 * subscripts attached to them. This function lets the `supsub` group know that
 * its inner element should handle the superscripts and subscripts instead of
 * handling them itself.
 */
var shouldHandleSupSub = function(group, options) {
    if (!group) {
        return false;
    } else if (group.type === "op") {
        // Operators handle supsubs differently when they have limits
        // (e.g. `\displaystyle\sum_2^3`)
        return group.value.limits &&
            (options.style.size === Style.DISPLAY.size ||
            group.value.alwaysHandleSupSub);
    } else if (group.type === "accent") {
        return isCharacterBox(group.value.base);
    } else {
        return null;
    }
};

/**
 * Sometimes we want to pull out the innermost element of a group. In most
 * cases, this will just be the group itself, but when ordgroups and colors have
 * a single element, we want to pull that out.
 */
var getBaseElem = function(group) {
    if (!group) {
        return false;
    } else if (group.type === "ordgroup") {
        if (group.value.length === 1) {
            return getBaseElem(group.value[0]);
        } else {
            return group;
        }
    } else if (group.type === "color") {
        if (group.value.value.length === 1) {
            return getBaseElem(group.value.value[0]);
        } else {
            return group;
        }
    } else {
        return group;
    }
};

/**
 * TeXbook algorithms often reference "character boxes", which are simply groups
 * with a single character in them. To decide if something is a character box,
 * we find its innermost group, and see if it is a single character.
 */
var isCharacterBox = function(group) {
    var baseElem = getBaseElem(group);

    // These are all they types of groups which hold single characters
    return baseElem.type === "mathord" ||
        baseElem.type === "textord" ||
        baseElem.type === "bin" ||
        baseElem.type === "rel" ||
        baseElem.type === "inner" ||
        baseElem.type === "open" ||
        baseElem.type === "close" ||
        baseElem.type === "punct";
};

var makeNullDelimiter = function(options) {
    return makeSpan([
        "sizing", "reset-" + options.size, "size5",
        options.style.reset(), Style.TEXT.cls(),
        "nulldelimiter",
    ]);
};

/**
 * This is a map of group types to the function used to handle that type.
 * Simpler types come at the beginning, while complicated types come afterwards.
 */
var groupTypes = {};

groupTypes.mathord = function(group, options, prev) {
    return buildCommon.makeOrd(group, options, "mathord");
};

groupTypes.textord = function(group, options, prev) {
    return buildCommon.makeOrd(group, options, "textord");
};

groupTypes.bin = function(group, options, prev) {
    var className = "mbin";
    // Pull out the most recent element. Do some special handling to find
    // things at the end of a \color group. Note that we don't use the same
    // logic for ordgroups (which count as ords).
    var prevAtom = prev;
    while (prevAtom && prevAtom.type === "color") {
        var atoms = prevAtom.value.value;
        prevAtom = atoms[atoms.length - 1];
    }
    // See TeXbook pg. 442-446, Rules 5 and 6, and the text before Rule 19.
    // Here, we determine whether the bin should turn into an ord. We
    // currently only apply Rule 5.
    if (!prev || utils.contains(["mbin", "mopen", "mrel", "mop", "mpunct"],
            getTypeOfGroup(prevAtom))) {
        group.type = "textord";
        className = "mord";
    }

    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), [className]);
};

groupTypes.rel = function(group, options, prev) {
    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), ["mrel"]);
};

groupTypes.open = function(group, options, prev) {
    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), ["mopen"]);
};

groupTypes.close = function(group, options, prev) {
    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), ["mclose"]);
};

groupTypes.inner = function(group, options, prev) {
    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), ["minner"]);
};

groupTypes.punct = function(group, options, prev) {
    return buildCommon.mathsym(
        group.value, group.mode, options.getColor(), ["mpunct"]);
};

groupTypes.ordgroup = function(group, options, prev) {
    return makeSpan(
        ["mord", options.style.cls()],
        buildExpression(group.value, options.reset())
    );
};

groupTypes.text = function(group, options, prev) {
    return makeSpan(["text", "mord", options.style.cls()],
        buildExpression(group.value.body, options.reset()));
};

groupTypes.color = function(group, options, prev) {
    var elements = buildExpression(
        group.value.value,
        options.withColor(group.value.color),
        prev
    );

    // \color isn't supposed to affect the type of the elements it contains.
    // To accomplish this, we wrap the results in a fragment, so the inner
    // elements will be able to directly interact with their neighbors. For
    // example, `\color{red}{2 +} 3` has the same spacing as `2 + 3`
    return new buildCommon.makeFragment(elements);
};

groupTypes.supsub = function(group, options, prev) {
    // Superscript and subscripts are handled in the TeXbook on page
    // 445-446, rules 18(a-f).

    // Here is where we defer to the inner group if it should handle
    // superscripts and subscripts itself.
    if (shouldHandleSupSub(group.value.base, options)) {
        return groupTypes[group.value.base.type](group, options, prev);
    }

    var base = buildGroup(group.value.base, options.reset());
    var supmid;
    var submid;
    var sup;
    var sub;

    if (group.value.sup) {
        sup = buildGroup(group.value.sup,
                options.withStyle(options.style.sup()));
        supmid = makeSpan(
                [options.style.reset(), options.style.sup().cls()], [sup]);
    }

    if (group.value.sub) {
        sub = buildGroup(group.value.sub,
                options.withStyle(options.style.sub()));
        submid = makeSpan(
                [options.style.reset(), options.style.sub().cls()], [sub]);
    }

    // Rule 18a
    var supShift;
    var subShift;
    if (isCharacterBox(group.value.base)) {
        supShift = 0;
        subShift = 0;
    } else {
        supShift = base.height - fontMetrics.metrics.supDrop;
        subShift = base.depth + fontMetrics.metrics.subDrop;
    }

    // Rule 18c
    var minSupShift;
    if (options.style === Style.DISPLAY) {
        minSupShift = fontMetrics.metrics.sup1;
    } else if (options.style.cramped) {
        minSupShift = fontMetrics.metrics.sup3;
    } else {
        minSupShift = fontMetrics.metrics.sup2;
    }

    // scriptspace is a font-size-independent size, so scale it
    // appropriately
    var multiplier = Style.TEXT.sizeMultiplier *
            options.style.sizeMultiplier;
    var scriptspace =
        (0.5 / fontMetrics.metrics.ptPerEm) / multiplier + "em";

    var supsub;
    if (!group.value.sup) {
        // Rule 18b
        subShift = Math.max(
            subShift, fontMetrics.metrics.sub1,
            sub.height - 0.8 * fontMetrics.metrics.xHeight);

        supsub = buildCommon.makeVList([
            {type: "elem", elem: submid},
        ], "shift", subShift, options);

        supsub.children[0].style.marginRight = scriptspace;

        // Subscripts shouldn't be shifted by the base's italic correction.
        // Account for that by shifting the subscript back the appropriate
        // amount. Note we only do this when the base is a single symbol.
        if (base instanceof domTree.symbolNode) {
            supsub.children[0].style.marginLeft = -base.italic + "em";
        }
    } else if (!group.value.sub) {
        // Rule 18c, d
        supShift = Math.max(supShift, minSupShift,
            sup.depth + 0.25 * fontMetrics.metrics.xHeight);

        supsub = buildCommon.makeVList([
            {type: "elem", elem: supmid},
        ], "shift", -supShift, options);

        supsub.children[0].style.marginRight = scriptspace;
    } else {
        supShift = Math.max(
            supShift, minSupShift,
            sup.depth + 0.25 * fontMetrics.metrics.xHeight);
        subShift = Math.max(subShift, fontMetrics.metrics.sub2);

        var ruleWidth = fontMetrics.metrics.defaultRuleThickness;

        // Rule 18e
        if ((supShift - sup.depth) - (sub.height - subShift) <
                4 * ruleWidth) {
            subShift = 4 * ruleWidth - (supShift - sup.depth) + sub.height;
            var psi = 0.8 * fontMetrics.metrics.xHeight -
                (supShift - sup.depth);
            if (psi > 0) {
                supShift += psi;
                subShift -= psi;
            }
        }

        supsub = buildCommon.makeVList([
            {type: "elem", elem: submid, shift: subShift},
            {type: "elem", elem: supmid, shift: -supShift},
        ], "individualShift", null, options);

        // See comment above about subscripts not being shifted
        if (base instanceof domTree.symbolNode) {
            supsub.children[0].style.marginLeft = -base.italic + "em";
        }

        supsub.children[0].style.marginRight = scriptspace;
        supsub.children[1].style.marginRight = scriptspace;
    }

    return makeSpan([getTypeOfGroup(group.value.base)],
        [base, supsub]);
};

groupTypes.genfrac = function(group, options, prev) {
    // Fractions are handled in the TeXbook on pages 444-445, rules 15(a-e).
    // Figure out what style this fraction should be in based on the
    // function used
    var fstyle = options.style;
    if (group.value.size === "display") {
        fstyle = Style.DISPLAY;
    } else if (group.value.size === "text") {
        fstyle = Style.TEXT;
    }

    var nstyle = fstyle.fracNum();
    var dstyle = fstyle.fracDen();

    var numer = buildGroup(group.value.numer, options.withStyle(nstyle));
    var numerreset = makeSpan([fstyle.reset(), nstyle.cls()], [numer]);

    var denom = buildGroup(group.value.denom, options.withStyle(dstyle));
    var denomreset = makeSpan([fstyle.reset(), dstyle.cls()], [denom]);

    var ruleWidth;
    if (group.value.hasBarLine) {
        ruleWidth = fontMetrics.metrics.defaultRuleThickness /
            options.style.sizeMultiplier;
    } else {
        ruleWidth = 0;
    }

    // Rule 15b
    var numShift;
    var clearance;
    var denomShift;
    if (fstyle.size === Style.DISPLAY.size) {
        numShift = fontMetrics.metrics.num1;
        if (ruleWidth > 0) {
            clearance = 3 * ruleWidth;
        } else {
            clearance = 7 * fontMetrics.metrics.defaultRuleThickness;
        }
        denomShift = fontMetrics.metrics.denom1;
    } else {
        if (ruleWidth > 0) {
            numShift = fontMetrics.metrics.num2;
            clearance = ruleWidth;
        } else {
            numShift = fontMetrics.metrics.num3;
            clearance = 3 * fontMetrics.metrics.defaultRuleThickness;
        }
        denomShift = fontMetrics.metrics.denom2;
    }

    var frac;
    if (ruleWidth === 0) {
        // Rule 15c
        var candiateClearance =
            (numShift - numer.depth) - (denom.height - denomShift);
        if (candiateClearance < clearance) {
            numShift += 0.5 * (clearance - candiateClearance);
            denomShift += 0.5 * (clearance - candiateClearance);
        }

        frac = buildCommon.makeVList([
            {type: "elem", elem: denomreset, shift: denomShift},
            {type: "elem", elem: numerreset, shift: -numShift},
        ], "individualShift", null, options);
    } else {
        // Rule 15d
        var axisHeight = fontMetrics.metrics.axisHeight;

        if ((numShift - numer.depth) - (axisHeight + 0.5 * ruleWidth) <
                clearance) {
            numShift +=
                clearance - ((numShift - numer.depth) -
                             (axisHeight + 0.5 * ruleWidth));
        }

        if ((axisHeight - 0.5 * ruleWidth) - (denom.height - denomShift) <
                clearance) {
            denomShift +=
                clearance - ((axisHeight - 0.5 * ruleWidth) -
                             (denom.height - denomShift));
        }

        var mid = makeSpan(
            [options.style.reset(), Style.TEXT.cls(), "frac-line"]);
        // Manually set the height of the line because its height is
        // created in CSS
        mid.height = ruleWidth;

        var midShift = -(axisHeight - 0.5 * ruleWidth);

        frac = buildCommon.makeVList([
            {type: "elem", elem: denomreset, shift: denomShift},
            {type: "elem", elem: mid,        shift: midShift},
            {type: "elem", elem: numerreset, shift: -numShift},
        ], "individualShift", null, options);
    }

    // Since we manually change the style sometimes (with \dfrac or \tfrac),
    // account for the possible size change here.
    frac.height *= fstyle.sizeMultiplier / options.style.sizeMultiplier;
    frac.depth *= fstyle.sizeMultiplier / options.style.sizeMultiplier;

    // Rule 15e
    var delimSize;
    if (fstyle.size === Style.DISPLAY.size) {
        delimSize = fontMetrics.metrics.delim1;
    } else {
        delimSize = fontMetrics.metrics.getDelim2(fstyle);
    }

    var leftDelim;
    var rightDelim;
    if (group.value.leftDelim == null) {
        leftDelim = makeNullDelimiter(options);
    } else {
        leftDelim = delimiter.customSizedDelim(
            group.value.leftDelim, delimSize, true,
            options.withStyle(fstyle), group.mode);
    }
    if (group.value.rightDelim == null) {
        rightDelim = makeNullDelimiter(options);
    } else {
        rightDelim = delimiter.customSizedDelim(
            group.value.rightDelim, delimSize, true,
            options.withStyle(fstyle), group.mode);
    }

    return makeSpan(
        ["mord", options.style.reset(), fstyle.cls()],
        [leftDelim, makeSpan(["mfrac"], [frac]), rightDelim],
        options.getColor());
};

groupTypes.array = function(group, options, prev) {
    var r;
    var c;
    var nr = group.value.body.length;
    var nc = 0;
    var body = new Array(nr);

    // Horizontal spacing
    var pt = 1 / fontMetrics.metrics.ptPerEm;
    var arraycolsep = 5 * pt; // \arraycolsep in article.cls

    // Vertical spacing
    var baselineskip = 12 * pt; // see size10.clo
    // Default \arraystretch from lttab.dtx
    // TODO(gagern): may get redefined once we have user-defined macros
    var arraystretch = utils.deflt(group.value.arraystretch, 1);
    var arrayskip = arraystretch * baselineskip;
    var arstrutHeight = 0.7 * arrayskip; // \strutbox in ltfsstrc.dtx and
    var arstrutDepth = 0.3 * arrayskip;  // \@arstrutbox in lttab.dtx

    var totalHeight = 0;
    for (r = 0; r < group.value.body.length; ++r) {
        var inrow = group.value.body[r];
        var height = arstrutHeight; // \@array adds an \@arstrut
        var depth = arstrutDepth;   // to each tow (via the template)

        if (nc < inrow.length) {
            nc = inrow.length;
        }

        var outrow = new Array(inrow.length);
        for (c = 0; c < inrow.length; ++c) {
            var elt = buildGroup(inrow[c], options);
            if (depth < elt.depth) {
                depth = elt.depth;
            }
            if (height < elt.height) {
                height = elt.height;
            }
            outrow[c] = elt;
        }

        var gap = 0;
        if (group.value.rowGaps[r]) {
            gap = group.value.rowGaps[r].value;
            switch (gap.unit) {
                case "em":
                    gap = gap.number;
                    break;
                case "ex":
                    gap = gap.number * fontMetrics.metrics.emPerEx;
                    break;
                default:
                    console.error("Can't handle unit " + gap.unit);
                    gap = 0;
            }
            if (gap > 0) { // \@argarraycr
                gap += arstrutDepth;
                if (depth < gap) {
                    depth = gap; // \@xargarraycr
                }
                gap = 0;
            }
        }

        outrow.height = height;
        outrow.depth = depth;
        totalHeight += height;
        outrow.pos = totalHeight;
        totalHeight += depth + gap; // \@yargarraycr
        body[r] = outrow;
    }

    var offset = totalHeight / 2 + fontMetrics.metrics.axisHeight;
    var colDescriptions = group.value.cols || [];
    var cols = [];
    var colSep;
    var colDescrNum;
    for (c = 0, colDescrNum = 0;
         // Continue while either there are more columns or more column
         // descriptions, so trailing separators don't get lost.
         c < nc || colDescrNum < colDescriptions.length;
         ++c, ++colDescrNum) {

        var colDescr = colDescriptions[colDescrNum] || {};

        var firstSeparator = true;
        while (colDescr.type === "separator") {
            // If there is more than one separator in a row, add a space
            // between them.
            if (!firstSeparator) {
                colSep = makeSpan(["arraycolsep"], []);
                colSep.style.width =
                    fontMetrics.metrics.doubleRuleSep + "em";
                cols.push(colSep);
            }

            if (colDescr.separator === "|") {
                var separator = makeSpan(
                    ["vertical-separator"],
                    []);
                separator.style.height = totalHeight + "em";
                separator.style.verticalAlign =
                    -(totalHeight - offset) + "em";

                cols.push(separator);
            } else {
                throw new ParseError(
                    "Invalid separator type: " + colDescr.separa

Download .txt

gitextract_jxo7rqko/

├── .gitignore
├── .readthedocs.yaml
├── LICENSE
├── MANIFEST.in
├── README.md
├── docker/
│   ├── api.dockerfile
│   └── build-api.sh
├── docs/
│   ├── Makefile
│   ├── conf.py
│   ├── index.rst
│   ├── installation.md
│   ├── make.bat
│   ├── pix2tex.rst
│   └── requirements.txt
├── notebooks/
│   ├── LaTeX_OCR_test.ipynb
│   └── LaTeX_OCR_training.ipynb
├── pix2tex/
│   ├── __init__.py
│   ├── __main__.py
│   ├── api/
│   │   ├── __init__.py
│   │   ├── app.py
│   │   ├── run.py
│   │   └── streamlit.py
│   ├── cli.py
│   ├── dataset/
│   │   ├── __init__.py
│   │   ├── arxiv.py
│   │   ├── data/
│   │   │   └── .gitkeep
│   │   ├── dataset.py
│   │   ├── demacro-test.py
│   │   ├── demacro.py
│   │   ├── extract_latex.py
│   │   ├── latex2png.py
│   │   ├── postprocess.py
│   │   ├── preprocessing/
│   │   │   ├── __init__.py
│   │   │   ├── generate_latex_vocab.py
│   │   │   ├── preprocess_formulas.py
│   │   │   ├── preprocess_latex.js
│   │   │   └── third_party/
│   │   │       ├── README.md
│   │   │       ├── katex/
│   │   │       │   ├── .#katex.js
│   │   │       │   ├── LICENSE.txt
│   │   │       │   ├── README.md
│   │   │       │   ├── cli.js
│   │   │       │   ├── katex.js
│   │   │       │   ├── package.json
│   │   │       │   └── src/
│   │   │       │       ├── Lexer.js
│   │   │       │       ├── Options.js
│   │   │       │       ├── ParseError.js
│   │   │       │       ├── Parser.js
│   │   │       │       ├── Settings.js
│   │   │       │       ├── Style.js
│   │   │       │       ├── buildCommon.js
│   │   │       │       ├── buildHTML.js
│   │   │       │       ├── buildMathML.js
│   │   │       │       ├── buildTree.js
│   │   │       │       ├── delimiter.js
│   │   │       │       ├── domTree.js
│   │   │       │       ├── environments.js
│   │   │       │       ├── fontMetrics.js
│   │   │       │       ├── fontMetricsData.js
│   │   │       │       ├── functions.js
│   │   │       │       ├── mathMLTree.js
│   │   │       │       ├── parseData.js
│   │   │       │       ├── parseTree.js
│   │   │       │       ├── symbols.js
│   │   │       │       └── utils.js
│   │   │       └── match-at/
│   │   │           ├── README.md
│   │   │           ├── lib/
│   │   │           │   └── matchAt.js
│   │   │           └── package.json
│   │   ├── render.py
│   │   ├── scraping.py
│   │   └── transforms.py
│   ├── eval.py
│   ├── gui.py
│   ├── model/
│   │   ├── __init__.py
│   │   ├── checkpoints/
│   │   │   ├── __init__.py
│   │   │   └── get_latest_checkpoint.py
│   │   ├── dataset/
│   │   │   └── tokenizer.json
│   │   └── settings/
│   │       ├── config-vit.yaml
│   │       ├── config.yaml
│   │       └── debug.yaml
│   ├── models/
│   │   ├── __init__.py
│   │   ├── hybrid.py
│   │   ├── transformer.py
│   │   ├── utils.py
│   │   └── vit.py
│   ├── resources/
│   │   ├── MathJax.js
│   │   ├── __init__.py
│   │   ├── resources.py
│   │   └── resources.qrc
│   ├── setup_desktop.py
│   ├── train.py
│   ├── train_resizer.py
│   └── utils/
│       ├── __init__.py
│       └── utils.py
├── setup.cfg
└── setup.py

Download .txt

SYMBOL INDEX (202 symbols across 41 files)

FILE: pix2tex/__main__.py
  function main (line 2) | def main():

FILE: pix2tex/api/app.py
  function read_imagefile (line 13) | def read_imagefile(file) -> Image.Image:
  function load_model (line 19) | async def load_model():
  function root (line 26) | def root():
  function predict (line 37) | async def predict(file: UploadFile = File(...)) -> str:
  function predict_from_bytes (line 52) | async def predict_from_bytes(file: bytes = File(...)) -> str:  # , size:...

FILE: pix2tex/api/run.py
  function start_api (line 6) | def start_api(path='.'):
  function start_frontend (line 10) | def start_frontend(path='.'):

FILE: pix2tex/api/streamlit.py
  function encode_image (line 9) | def encode_image(file):

FILE: pix2tex/cli.py
  function minmax_size (line 32) | def minmax_size(img: Image, max_dimensions: Tuple[int, int] = None, min_...
  class LatexOCR (line 58) | class LatexOCR:
    method __init__ (line 65) | def __init__(self, arguments=None):
    method __call__ (line 95) | def __call__(self, img=None, resize=True) -> str:
  function output_prediction (line 143) | def output_prediction(pred, args):
  function predict (line 181) | def predict(model, file, arguments):
  function check_file_path (line 196) | def check_file_path(paths:List[Path], wdir:Optional[Path]=None)->List[str]:
  function main (line 211) | def main(arguments):

FILE: pix2tex/dataset/arxiv.py
  function get_all_arxiv_ids (line 27) | def get_all_arxiv_ids(text):
  function download (line 35) | def download(url, dir_path='./'):
  function read_tex_files (line 50) | def read_tex_files(file_path:str, demacro:bool=False)->str:
  function download_paper (line 88) | def download_paper(arxiv_id, dir_path='./'):
  function read_paper (line 93) | def read_paper(targz_path, delete=False, demacro=False):
  function parse_arxiv (line 102) | def parse_arxiv(id, save=None, demacro=True):

FILE: pix2tex/dataset/dataset.py
  class Im2LatexDataset (line 21) | class Im2LatexDataset:
    method __init__ (line 37) | def __init__(self, equations=None, images=None, tokenizer=None, shuffl...
    method __len__ (line 83) | def __len__(self):
    method __iter__ (line 86) | def __iter__(self):
    method __next__ (line 107) | def __next__(self):
    method prepare_data (line 113) | def prepare_data(self, batch):
    method _get_size (line 153) | def _get_size(self):
    method load (line 159) | def load(self, filename, args=[]):
    method combine (line 174) | def combine(self, x):
    method save (line 189) | def save(self, filename):
    method update (line 198) | def update(self, **kwargs):
  function generate_tokenizer (line 222) | def generate_tokenizer(equations, output, vocab_size):

FILE: pix2tex/dataset/demacro-test.py
  function norm (line 6) | def norm(s):
  function f (line 12) | def f(s):
  class TestDemacroCases (line 16) | class TestDemacroCases(unittest.TestCase):
    method test_noargs (line 17) | def test_noargs(self):
    method test_optional_arg (line 25) | def test_optional_arg(self):
    method test_optional_arg_and_positional_args (line 37) | def test_optional_arg_and_positional_args(self):
    method test_alt_definition1 (line 45) | def test_alt_definition1(self):
    method test_arg_with_bs_and_cb (line 53) | def test_arg_with_bs_and_cb(self):
    method test_multiline_definition (line 66) | def test_multiline_definition(self):
    method test_multiline_definition_alt1 (line 80) | def test_multiline_definition_alt1(self):
    method test_multiline_definition_alt2 (line 89) | def test_multiline_definition_alt2(self):
    method test_multiline_definition_alt3 (line 98) | def test_multiline_definition_alt3(self):
    method test_multiline_definition_alt4 (line 108) | def test_multiline_definition_alt4(self):
    method test_nested_definition (line 119) | def test_nested_definition(self):
    method test_def (line 130) | def test_def(self):
    method test_multi_def_lines_alt0 (line 141) | def test_multi_def_lines_alt0(self):
    method test_multi_def_lines_alt1 (line 153) | def test_multi_def_lines_alt1(self):
    method test_multi_def_lines_alt2 (line 165) | def test_multi_def_lines_alt2(self):
    method test_multi_def_lines_alt3 (line 180) | def test_multi_def_lines_alt3(self):
    method test_let_alt0 (line 199) | def test_let_alt0(self):
    method test_let_alt1 (line 205) | def test_let_alt1(self):
    method test_let_alt2 (line 211) | def test_let_alt2(self):
    method test_let_alt3 (line 217) | def test_let_alt3(self):

FILE: pix2tex/dataset/demacro.py
  class DemacroError (line 11) | class DemacroError(Exception):
  function main (line 15) | def main():
  function parse_command_line (line 25) | def parse_command_line():
  function read (line 32) | def read(path):
  function bracket_replace (line 37) | def bracket_replace(string: str) -> str:
  function undo_bracket_replace (line 55) | def undo_bracket_replace(string):
  function sweep (line 59) | def sweep(t, cmds):
  function unfold (line 81) | def unfold(t):
  function pydemacro (line 123) | def pydemacro(t: str) -> str:
  function replace (line 136) | def replace(match):
  function convert (line 161) | def convert(data):
  function write (line 170) | def write(path, data):

FILE: pix2tex/dataset/extract_latex.py
  function check_brackets (line 20) | def check_brackets(s):
  function remove_labels (line 48) | def remove_labels(string):
  function clean_matches (line 54) | def clean_matches(matches, min_chars=MIN_CHARS):
  function find_math (line 77) | def find_math(s: str, wiki=False) -> List[str]:

FILE: pix2tex/dataset/latex2png.py
  class Latex (line 15) | class Latex:
    method __init__ (line 27) | def __init__(self, math, dpi=250, font='Latin Modern Math'):
    method write (line 35) | def write(self, return_bytes=False):
    method convert_file (line 57) | def convert_file(self, infile, workdir, return_bytes=False):
  function tex2png (line 140) | def tex2png(eq, **kwargs):
  function tex2pil (line 146) | def tex2pil(tex, return_error_index=False, **kwargs):
  function extract (line 152) | def extract(text, expression=None):

FILE: pix2tex/dataset/preprocessing/generate_latex_vocab.py
  function process_args (line 3) | def process_args(args):
  function main (line 29) | def main(args):

FILE: pix2tex/dataset/preprocessing/preprocess_formulas.py
  function process_args (line 12) | def process_args(args):
  function main (line 37) | def main(args):

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Lexer.js
  function Lexer (line 19) | function Lexer(input) {
  function Token (line 24) | function Token(text, data, position) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Options.js
  function Options (line 17) | function Options(data) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/ParseError.js
  function ParseError (line 6) | function ParseError(message, lexer, position) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Parser.js
  function Parser (line 50) | function Parser(input, settings) {
  function ParseFuncOrArgument (line 63) | function ParseFuncOrArgument(result, isFunction) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Settings.js
  function get (line 9) | function get(option, defaultValue) {
  function Settings (line 20) | function Settings(options) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/Style.js
  function Style (line 15) | function Style(id, size, multiplier, cramped) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/domTree.js
  function span (line 33) | function span(classes, children, height, depth, maxFontSize, style) {
  function documentFragment (line 136) | function documentFragment(children, height, depth, maxFontSize) {
  function symbolNode (line 177) | function symbolNode(value, height, depth, italic, skew, classes, style) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/environments.js
  function parseArray (line 14) | function parseArray(parser, result) {
  function defineEnvironment (line 80) | function defineEnvironment(names, props, handler) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/functions.js
  function defineFunction (line 80) | function defineFunction(names, props, handler) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/mathMLTree.js
  function MathNode (line 18) | function MathNode(type, children) {
  function TextNode (line 81) | function TextNode(text) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/parseData.js
  function ParseNode (line 4) | function ParseNode(type, value, mode) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/symbols.js
  function defineSymbol (line 24) | function defineSymbol(mode, font, group, replace, name) {

FILE: pix2tex/dataset/preprocessing/third_party/katex/src/utils.js
  function escaper (line 59) | function escaper(match) {
  function escape (line 69) | function escape(text) {
  function clearNode (line 94) | function clearNode(node) {

FILE: pix2tex/dataset/preprocessing/third_party/match-at/lib/matchAt.js
  function getRelocatable (line 5) | function getRelocatable(re) {
  function matchAt (line 24) | function matchAt(re, str, pos) {

FILE: pix2tex/dataset/render.py
  function get_installed_fonts (line 15) | def get_installed_fonts(tex_path: str):
  function render_dataset (line 31) | def render_dataset(dataset: np.ndarray, unrendered: np.ndarray, args) ->...

FILE: pix2tex/dataset/scraping.py
  function recursive_search (line 23) | def recursive_search(parser: Callable,  seeds: List[str], depth: int = 2...
  function parse_url (line 67) | def parse_url(url, encoding=None):
  function parse_wiki (line 76) | def parse_wiki(url):
  function parse_stack_exchange (line 82) | def parse_stack_exchange(url):
  function recursive_wiki (line 90) | def recursive_wiki(seeds, depth=4, skip=[], base_url=wiki_base):
  function recursive_stack_exchange (line 97) | def recursive_stack_exchange(seeds, depth=4, skip=[], base_url=math_stac...

FILE: pix2tex/eval.py
  function detokenize (line 18) | def detokenize(tokens, tokenizer):
  function evaluate (line 31) | def evaluate(model: Model, dataset: Im2LatexDataset, args: Munch, num_ba...

FILE: pix2tex/gui.py
  function to_sympy (line 27) | def to_sympy(latex):
  class WebView (line 33) | class WebView(QWebEngineView):
    method __init__ (line 34) | def __init__(self, app) -> None:
    method dragEnterEvent (line 39) | def dragEnterEvent(self, event):
    method dropEvent (line 45) | def dropEvent(self, event):
  class App (line 49) | class App(QMainWindow):
    method __init__ (line 52) | def __init__(self, args=None):
    method initUI (line 60) | def initUI(self):
    method toggleProcessing (line 153) | def toggleProcessing(self, value=None):
    method eventFilter (line 174) | def eventFilter(self, obj, event):
    method onClick (line 187) | def onClick(self):
    method interrupt (line 205) | def interrupt(self):
    method snip_using_gnome_screenshot (line 211) | def snip_using_gnome_screenshot(self):
    method snip_using_spectacle (line 222) | def snip_using_spectacle(self):
    method snip_using_grim (line 232) | def snip_using_grim(self):
    method returnFromMimeData (line 249) | def returnFromMimeData(self, urls):
    method returnSnip (line 258) | def returnSnip(self, img=None):
    method returnPrediction (line 293) | def returnPrediction(self, result):
    method onFormatChange (line 310) | def onFormatChange(self):
    method formatPrediction (line 317) | def formatPrediction(self, prediction, format_type=None):
    method onTextboxChange (line 346) | def onTextboxChange(self):
    method onFormatTextboxChange (line 354) | def onFormatTextboxChange(self):
    method displayPrediction (line 359) | def displayPrediction(self, prediction=None):
  class ModelThread (line 387) | class ModelThread(QThread):
    method __init__ (line 390) | def __init__(self, img, model):
    method run (line 395) | def run(self):
  class SnipWidget (line 407) | class SnipWidget(QMainWindow):
    method __init__ (line 410) | def __init__(self, parent):
    method update_geometry_based_on_cursor_position (line 431) | def update_geometry_based_on_cursor_position(self):
    method snip (line 444) | def snip(self):
    method paintEvent (line 451) | def paintEvent(self, event):
    method keyPressEvent (line 465) | def keyPressEvent(self, event):
    method mousePressEvent (line 472) | def mousePressEvent(self, event):
    method mouseMoveEvent (line 479) | def mouseMoveEvent(self, event):
    method mouseReleaseEvent (line 483) | def mouseReleaseEvent(self, event):
  function main (line 512) | def main(arguments):

FILE: pix2tex/model/checkpoints/get_latest_checkpoint.py
  function get_latest_tag (line 9) | def get_latest_tag():
  function download_as_bytes_with_progress (line 17) | def download_as_bytes_with_progress(url: str, name: str = None) -> bytes:
  function download_checkpoints (line 37) | def download_checkpoints():

FILE: pix2tex/models/hybrid.py
  class CustomVisionTransformer (line 10) | class CustomVisionTransformer(VisionTransformer):
    method __init__ (line 11) | def __init__(self, img_size=224, patch_size=16, *args, **kwargs):
    method forward_features (line 16) | def forward_features(self, x):
  function get_encoder (line 36) | def get_encoder(args):

FILE: pix2tex/models/transformer.py
  class CustomARWrapper (line 7) | class CustomARWrapper(AutoregressiveWrapper):
    method __init__ (line 8) | def __init__(self, *args, **kwargs):
    method generate (line 12) | def generate(self, start_tokens, seq_len=256, eos_token=None, temperat...
  function get_decoder (line 55) | def get_decoder(args):

FILE: pix2tex/models/utils.py
  class Model (line 9) | class Model(nn.Module):
    method __init__ (line 10) | def __init__(self, encoder, decoder, args):
    method data_parallel (line 16) | def data_parallel(self, x: torch.Tensor, device_ids, output_device=Non...
    method forward (line 29) | def forward(self, x: torch.Tensor, tgt_seq: torch.Tensor,  **kwargs):
    method generate (line 35) | def generate(self, x: torch.Tensor, temperature: float = 0.25):
  function get_model (line 40) | def get_model(args):

FILE: pix2tex/models/vit.py
  class ViTransformerWrapper (line 8) | class ViTransformerWrapper(nn.Module):
    method __init__ (line 9) | def __init__(
    method forward (line 41) | def forward(self, img, **kwargs):
  function get_encoder (line 62) | def get_encoder(args):

FILE: pix2tex/resources/resources.py
  function qInitResources (line 9226) | def qInitResources():
  function qCleanupResources (line 9229) | def qCleanupResources():

FILE: pix2tex/setup_desktop.py
  function _check_file (line 10) | def _check_file(
  function _make_desktop_file (line 20) | def _make_desktop_file(
  function setup_desktop (line 28) | def setup_desktop(

FILE: pix2tex/train.py
  function train (line 18) | def train(args):

FILE: pix2tex/train_resizer.py
  function prepare_data (line 21) | def prepare_data(dataloader: Im2LatexDataset) -> Tuple[torch.tensor, tor...
  function val (line 82) | def val(val: Im2LatexDataset, model: ResNetV2, num_samples=400, device='...
  function main (line 109) | def main(args):

FILE: pix2tex/utils/utils.py
  class EmptyStepper (line 17) | class EmptyStepper:
    method __init__ (line 18) | def __init__(self, *args, **kwargs):
    method step (line 21) | def step(self, *args, **kwargs):
  function exists (line 27) | def exists(val):
  function default (line 31) | def default(val, d):
  function seed_everything (line 37) | def seed_everything(seed: int):
  function parse_args (line 52) | def parse_args(args, **kwargs) -> Munch:
  function get_device (line 66) | def get_device(args, no_cuda=False):
  function gpu_memory_check (line 77) | def gpu_memory_check(model, args):
  function token2str (line 94) | def token2str(tokens, tokenizer) -> list:
  function pad (line 101) | def pad(img: Image, divable: int = 32) -> Image:
  function post_process (line 138) | def post_process(s: str):
  function alternatives (line 163) | def alternatives(s):
  function get_optimizer (line 174) | def get_optimizer(optimizer):
  function get_scheduler (line 178) | def get_scheduler(scheduler):
  function num_model_params (line 184) | def num_model_params(model):
  function in_model_path (line 189) | def in_model_path():

Download .json

Condensed preview — 95 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,631K chars).

[
  {
    "path": ".gitignore",
    "chars": 1989,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": ".readthedocs.yaml",
    "chars": 675,
    "preview": "# .readthedocs.yaml\n# Read the Docs configuration file\n# See https://docs.readthedocs.io/en/stable/config-file/v2.html f"
  },
  {
    "path": "LICENSE",
    "chars": 1070,
    "preview": "MIT License\n\nCopyright (c) 2021 Lukas Blecher\n\nPermission is hereby granted, free of charge, to any person obtaining a c"
  },
  {
    "path": "MANIFEST.in",
    "chars": 18,
    "preview": "exclude **\\*.pth \n"
  },
  {
    "path": "README.md",
    "chars": 8892,
    "preview": "# pix2tex - LaTeX OCR\n\n[![GitHub](https://img.shields.io/github/license/lukas-blecher/LaTeX-OCR)](https://github.com/luk"
  },
  {
    "path": "docker/api.dockerfile",
    "chars": 322,
    "preview": "FROM python:3.8-slim\nRUN pip install torch>=1.7.1\nWORKDIR /latexocr\nADD pix2tex /latexocr/pix2tex/\nADD setup.py /latexoc"
  },
  {
    "path": "docker/build-api.sh",
    "chars": 112,
    "preview": "# cd into proj. root\ncd $(dirname $0)\ncd ..\ndocker build -t lukasblecher/pix2tex:api -f docker/api.dockerfile .\n"
  },
  {
    "path": "docs/Makefile",
    "chars": 634,
    "preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
  },
  {
    "path": "docs/conf.py",
    "chars": 1789,
    "preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
  },
  {
    "path": "docs/index.rst",
    "chars": 761,
    "preview": ".. LaTeX-OCR documentation master file, created by\n   sphinx-quickstart on Sun May  1 16:39:27 2022.\n   You can adapt th"
  },
  {
    "path": "docs/installation.md",
    "chars": 1047,
    "preview": "Installation\n============\n\nPython package\n--------------\n\nTo run the model you need Python 3.7+\n\nIf you don't have PyTor"
  },
  {
    "path": "docs/make.bat",
    "chars": 765,
    "preview": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=sphinx-bu"
  },
  {
    "path": "docs/pix2tex.rst",
    "chars": 2389,
    "preview": "pix2tex\n=======\n\npix2tex.cli package\n-------------------\n\n.. automodule:: pix2tex.cli\n   :members:\n   :no-undoc-members:"
  },
  {
    "path": "docs/requirements.txt",
    "chars": 25,
    "preview": "myst_parser\ntorch>=1.7.1\n"
  },
  {
    "path": "notebooks/LaTeX_OCR_test.ipynb",
    "chars": 3008,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"LaTeX OCR test.ipynb\",\n      \"pr"
  },
  {
    "path": "notebooks/LaTeX_OCR_training.ipynb",
    "chars": 6264,
    "preview": "{\n  \"nbformat\": 4,\n  \"nbformat_minor\": 0,\n  \"metadata\": {\n    \"colab\": {\n      \"name\": \"LaTeX-OCR training.ipynb\",\n     "
  },
  {
    "path": "pix2tex/__init__.py",
    "chars": 63,
    "preview": "import os\nos.environ['FOR_DISABLE_CONSOLE_CTRL_HANDLER'] = '1'\n"
  },
  {
    "path": "pix2tex/__main__.py",
    "chars": 1385,
    "preview": "#!/usr/bin/env python\ndef main():\n    from argparse import ArgumentParser\n\n    parser = ArgumentParser()\n    parser.add_"
  },
  {
    "path": "pix2tex/api/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pix2tex/api/app.py",
    "chars": 1489,
    "preview": "# Adapted from https://github.com/kingyiusuen/image-to-latex/blob/main/api/app.py\n\nfrom http import HTTPStatus\nfrom fast"
  },
  {
    "path": "pix2tex/api/run.py",
    "chars": 547,
    "preview": "from multiprocessing import Process\nimport subprocess\nimport os\n\n\ndef start_api(path='.'):\n    subprocess.call(['uvicorn"
  },
  {
    "path": "pix2tex/api/streamlit.py",
    "chars": 1836,
    "preview": "import requests\nfrom PIL import Image\nimport streamlit as st\nfrom st_img_pastebutton import paste\nfrom io import BytesIO"
  },
  {
    "path": "pix2tex/cli.py",
    "chars": 11925,
    "preview": "from pix2tex.dataset.transforms import test_transform\nimport pandas.io.clipboard as clipboard\nfrom PIL import ImageGrab\n"
  },
  {
    "path": "pix2tex/dataset/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pix2tex/dataset/arxiv.py",
    "chars": 6945,
    "preview": "# modified from https://github.com/soskek/arxiv_leaks\n\nimport argparse\nimport subprocess\nimport os\nimport glob\nimport re"
  },
  {
    "path": "pix2tex/dataset/data/.gitkeep",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pix2tex/dataset/dataset.py",
    "chars": 10699,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom torch.nn.utils.rnn import pad_sequence\nimport numpy as np\nimport image"
  },
  {
    "path": "pix2tex/dataset/demacro-test.py",
    "chars": 5784,
    "preview": "import unittest\nimport re\nfrom pix2tex.dataset.demacro import pydemacro\n\n\ndef norm(s):\n    s = re.sub(r'\\n+', '\\n', s)\n "
  },
  {
    "path": "pix2tex/dataset/demacro.py",
    "chars": 5236,
    "preview": "# modified from https://tex.stackexchange.com/a/521639\n\nimport argparse\nimport re\nimport logging\nfrom collections import"
  },
  {
    "path": "pix2tex/dataset/extract_latex.py",
    "chars": 4436,
    "preview": "import argparse\nimport html\nimport os\nimport re\nimport numpy as np\nfrom typing import List\n\nMIN_CHARS = 1\nMAX_CHARS = 30"
  },
  {
    "path": "pix2tex/dataset/latex2png.py",
    "chars": 6149,
    "preview": "# mostly taken from http://code.google.com/p/latexmath2png/\n# install preview.sty\nimport os\nimport re\nimport sys\nimport "
  },
  {
    "path": "pix2tex/dataset/postprocess.py",
    "chars": 695,
    "preview": "import argparse\nfrom tqdm.auto import tqdm\n\nif __name__ == '__main__':\n    parser = argparse.ArgumentParser()\n    parser"
  },
  {
    "path": "pix2tex/dataset/preprocessing/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pix2tex/dataset/preprocessing/generate_latex_vocab.py",
    "chars": 3168,
    "preview": "import sys, logging, argparse, os\n\ndef process_args(args):\n    parser = argparse.ArgumentParser(description='Generate vo"
  },
  {
    "path": "pix2tex/dataset/preprocessing/preprocess_formulas.py",
    "chars": 4233,
    "preview": "# taken and modified from https://github.com/harvardnlp/im2markup\n# tokenize latex formulas\nimport sys\nimport os\nimport "
  },
  {
    "path": "pix2tex/dataset/preprocessing/preprocess_latex.js",
    "chars": 10418,
    "preview": "const path = require('path');\nvar katex = require(path.join(__dirname,\"third_party/katex/katex.js\"))\noptions = require(p"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/README.md",
    "chars": 60,
    "preview": "Directly taken from https://github.com/harvardnlp/im2markup\n"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/.#katex.js",
    "chars": 29,
    "preview": "srush@beaker.12118:1471814512"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/LICENSE.txt",
    "chars": 1289,
    "preview": "The MIT License (MIT)\n\nCopyright (c) 2015 Khan Academy\n\nThis software also uses portions of the underscore.js project, w"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/README.md",
    "chars": 3685,
    "preview": "# [<img src=\"https://khan.github.io/KaTeX/katex-logo.svg\" width=\"130\" alt=\"KaTeX\">](https://khan.github.io/KaTeX/) [![Bu"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/cli.js",
    "chars": 927,
    "preview": "#!/usr/bin/env node\n// Simple CLI for KaTeX.\n// Reads TeX from stdin, outputs HTML to stdout.\n/* eslint no-console:0 */\n"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/katex.js",
    "chars": 2223,
    "preview": "/* eslint no-console:0 */\n/**\n * This is the main entry point for KaTeX. Here, we expose functions for\n * rendering expr"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/package.json",
    "chars": 2473,
    "preview": "{\n  \"_args\": [\n    [\n      \"katex\",\n      \"/home/srush/Projects/im2latex\"\n    ]\n  ],\n  \"_from\": \"katex@latest\",\n  \"_id\":"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/Lexer.js",
    "chars": 5300,
    "preview": "/**\n * The Lexer class handles tokenizing the input in various ways. Since our\n * parser expects us to be able to backtr"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/Options.js",
    "chars": 4911,
    "preview": "/**\n * This file contains information about the options that the Parser carries\n * around with it while parsing. Data is"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/ParseError.js",
    "chars": 1322,
    "preview": "/**\n * This is the ParseError class, which is the main error thrown by KaTeX\n * functions when something has gone wrong."
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/Parser.js",
    "chars": 26372,
    "preview": "/* eslint no-constant-condition:0 */\nvar functions = require(\"./functions\");\nvar environments = require(\"./environments\""
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/Settings.js",
    "chars": 783,
    "preview": "/**\n * This is a module for storing settings passed into KaTeX. It correctly handles\n * default settings.\n */\n\n/**\n * He"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/Style.js",
    "chars": 3199,
    "preview": "/**\n * This file contains information and classes for the various kinds of styles\n * used in TeX. It provides a generic "
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/buildCommon.js",
    "chars": 14362,
    "preview": "/* eslint no-console:0 */\n/**\n * This module contains general functions that can be used for building\n * different kinds"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/buildHTML.js",
    "chars": 49168,
    "preview": "/* eslint no-console:0 */\n/**\n * This file does the main work of building a domTree structure from a parse\n * tree. The "
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/buildMathML.js",
    "chars": 14472,
    "preview": "/**\n * This file converts a parse tree into a cooresponding MathML tree. The main\n * entry point is the `buildMathML` fu"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/buildTree.js",
    "chars": 1090,
    "preview": "var buildHTML = require(\"./buildHTML\");\nvar buildMathML = require(\"./buildMathML\");\nvar buildCommon = require(\"./buildCo"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/delimiter.js",
    "chars": 18988,
    "preview": "/**\n * This file deals with creating delimiters of various sizes. The TeXbook\n * discusses these routines on page 441-44"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/domTree.js",
    "chars": 7224,
    "preview": "/**\n * These objects store the data about the DOM nodes we create, as well as some\n * extra data. They can then be trans"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/environments.js",
    "chars": 8510,
    "preview": "/* eslint no-constant-condition:0 */\nvar fontMetrics = require(\"./fontMetrics\");\nvar parseData = require(\"./parseData\");"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/fontMetrics.js",
    "chars": 4274,
    "preview": "/* eslint no-unused-vars:0 */\n\nvar Style = require(\"./Style\");\n\n/**\n * This file contains metrics regarding fonts and in"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/fontMetricsData.js",
    "chars": 67282,
    "preview": "module.exports = {\n    \"AMS-Regular\": {\n        \"65\": [0, 0.68889, 0, 0],\n        \"66\": [0, 0.68889, 0, 0],\n        \"67\""
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/functions.js",
    "chars": 16828,
    "preview": "var utils = require(\"./utils\");\nvar ParseError = require(\"./ParseError\");\n\n/* This file contains a list of functions tha"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/mathMLTree.js",
    "chars": 2694,
    "preview": "/**\n * These objects store data about MathML nodes. This is the MathML equivalent\n * of the types in domTree.js. Since M"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/parseData.js",
    "chars": 221,
    "preview": "/**\n * The resulting parse tree nodes of the parse tree.\n */\nfunction ParseNode(type, value, mode) {\n    this.type = typ"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/parseTree.js",
    "chars": 377,
    "preview": "/**\n * Provides a single function for parsing an expression using a Parser\n * TODO(emily): Remove this\n */\n\nvar Parser ="
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/symbols.js",
    "chars": 33147,
    "preview": "/**\n * This file holds a list of all no-argument functions and single-character\n * symbols (like 'a' or ';').\n *\n * For "
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/katex/src/utils.js",
    "chars": 2330,
    "preview": "/**\n * This file contains a list of utility functions which are useful in other\n * files.\n */\n\n/**\n * Provide an `indexO"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/match-at/README.md",
    "chars": 125,
    "preview": "# match-at [![Build Status](https://travis-ci.org/spicyj/match-at.svg?branch=master)](https://travis-ci.org/spicyj/match"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/match-at/lib/matchAt.js",
    "chars": 1309,
    "preview": "/** @flow */\n\n\"use strict\";\n\nfunction getRelocatable(re) {\n  // In the future, this could use a WeakMap instead of an ex"
  },
  {
    "path": "pix2tex/dataset/preprocessing/third_party/match-at/package.json",
    "chars": 1328,
    "preview": "{\n  \"name\": \"match-at\",\n  \"version\": \"0.1.0\",\n  \"description\": \"Relocatable regular expressions.\",\n  \"repository\": {\n   "
  },
  {
    "path": "pix2tex/dataset/render.py",
    "chars": 8443,
    "preview": "\nfrom pix2tex.dataset.latex2png import Latex, tex2pil\nimport argparse\nimport sys\nimport os\nimport glob\nimport shutil\nfro"
  },
  {
    "path": "pix2tex/dataset/scraping.py",
    "chars": 6537,
    "preview": "import os\nimport sys\nimport random\nfrom tqdm import tqdm\nimport html\nimport requests\nimport re\nimport argparse\nimport lo"
  },
  {
    "path": "pix2tex/dataset/transforms.py",
    "chars": 1086,
    "preview": "import albumentations as alb\nfrom albumentations.pytorch import ToTensorV2\n\ntrain_transform = alb.Compose(\n    [\n       "
  },
  {
    "path": "pix2tex/eval.py",
    "chars": 5922,
    "preview": "from pix2tex.dataset.dataset import Im2LatexDataset\nimport argparse\nimport logging\nimport yaml\n\nimport numpy as np\nimpor"
  },
  {
    "path": "pix2tex/gui.py",
    "chars": 18434,
    "preview": "from shutil import which\nimport io\nimport subprocess\nimport sys\nimport os\nimport re\nimport tempfile\nfrom PyQt6 import Qt"
  },
  {
    "path": "pix2tex/model/__init__.py",
    "chars": 34,
    "preview": "from pix2tex.utils.utils import *\n"
  },
  {
    "path": "pix2tex/model/checkpoints/__init__.py",
    "chars": 34,
    "preview": "from pix2tex.utils.utils import *\n"
  },
  {
    "path": "pix2tex/model/checkpoints/get_latest_checkpoint.py",
    "chars": 1514,
    "preview": "import requests\nimport os\nimport tqdm\nimport io\n\nurl = 'https://github.com/lukas-blecher/LaTeX-OCR/releases/latest'\n\n\nde"
  },
  {
    "path": "pix2tex/model/dataset/tokenizer.json",
    "chars": 23902,
    "preview": "{\"version\":\"1.0\",\"truncation\":null,\"padding\":null,\"added_tokens\":[{\"id\":0,\"special\":true,\"content\":\"[PAD]\",\"single_word\""
  },
  {
    "path": "pix2tex/model/settings/config-vit.yaml",
    "chars": 853,
    "preview": "gpu_devices: null #[0,1,2,3,4,5,6,7]\nbetas:\n- 0.9\n- 0.999\nbatchsize: 64\nbos_token: 1\nchannels: 1\ndata: dataset/data/trai"
  },
  {
    "path": "pix2tex/model/settings/config.yaml",
    "chars": 862,
    "preview": "gpu_devices: null #[0,1,2,3,4,5,6,7]\nbackbone_layers:\n- 2\n- 3\n- 7\nbetas:\n- 0.9\n- 0.999\nbatchsize: 64\nbos_token: 1\nchanne"
  },
  {
    "path": "pix2tex/model/settings/debug.yaml",
    "chars": 1024,
    "preview": "# Input/Output/Name\ndata: \"dataset/data/dataset.pkl\"\nvaldata: \"dataset/data/val.pkl\"\ntokenizer: \"dataset/tokenizer.json\""
  },
  {
    "path": "pix2tex/models/__init__.py",
    "chars": 20,
    "preview": "from .utils import *"
  },
  {
    "path": "pix2tex/models/hybrid.py",
    "chars": 2429,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom timm.models.vision_transformer import VisionTransformer\nfrom timm.models.vision"
  },
  {
    "path": "pix2tex/models/transformer.py",
    "chars": 2157,
    "preview": "import torch\nimport torch.nn.functional as F\nfrom x_transformers.autoregressive_wrapper import AutoregressiveWrapper, to"
  },
  {
    "path": "pix2tex/models/utils.py",
    "chars": 2149,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom . import hybrid\nfrom . import vit\nfrom . import transformer\n\n\nclass Model(nn.Mo"
  },
  {
    "path": "pix2tex/models/vit.py",
    "chars": 2425,
    "preview": "import torch\nimport torch.nn as nn\n\nfrom x_transformers import Encoder\nfrom einops import rearrange, repeat\n\n\nclass ViTr"
  },
  {
    "path": "pix2tex/resources/MathJax.js",
    "chars": 520824,
    "preview": "document.getElementById&&document.childNodes&&document.createElement&&(window.MathJax&&MathJax.Hub||(window.MathJax?wind"
  },
  {
    "path": "pix2tex/resources/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pix2tex/resources/resources.py",
    "chars": 440083,
    "preview": "# Resource object code (Python 3)\n# Created by: object code\n# Created by: The Resource Compiler for Qt version 6.4.2\n# W"
  },
  {
    "path": "pix2tex/resources/resources.qrc",
    "chars": 329,
    "preview": "<!DOCTYPE RCC>\n<RCC version=\"1.0\">\n    <qresource prefix=\"icons\">\n        <file alias=\"icon.svg\">resources/icon.svg</fil"
  },
  {
    "path": "pix2tex/setup_desktop.py",
    "chars": 4134,
    "preview": "#!/usr/bin/env python3\n\n'''Simple installer for the graphical user interface of pix2tex'''\n\nimport argparse\nimport os\nim"
  },
  {
    "path": "pix2tex/train.py",
    "chars": 4640,
    "preview": "from pix2tex.dataset.dataset import Im2LatexDataset\nimport os\nimport argparse\nimport logging\nimport yaml\n\nimport torch\nf"
  },
  {
    "path": "pix2tex/train_resizer.py",
    "chars": 6996,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch.optim import Adam\nfrom torch.optim.lr_sche"
  },
  {
    "path": "pix2tex/utils/__init__.py",
    "chars": 34,
    "preview": "from pix2tex.utils.utils import *\n"
  },
  {
    "path": "pix2tex/utils/utils.py",
    "chars": 6546,
    "preview": "import random\nimport os\nimport cv2\nimport re\nfrom PIL import Image\nimport numpy as np\nimport torch\nfrom munch import Mun"
  },
  {
    "path": "setup.cfg",
    "chars": 40,
    "preview": "[metadata]\ndescription_file = README.md\n"
  },
  {
    "path": "setup.py",
    "chars": 2507,
    "preview": "#!/usr/bin/env python\n\nimport setuptools\n\n# read the contents of your README file\nfrom pathlib import Path\nthis_director"
  }
]

About this extraction

This page contains the full source code of the lukas-blecher/LaTeX-OCR GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 95 files (1.4 MB), approximately 581.7k tokens, and a symbol index with 202 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo