Showing preview only (2,345K chars total). Download the full file or copy to clipboard to get everything.
Repository: facebookresearch/nougat
Branch: main
Commit: 5a92920d342f
Files: 47
Total size: 2.2 MB
Directory structure:
gitextract_rzpn8tp_/
├── .gitignore
├── CODE_OF_CONDUCT.md
├── CONTRIBUTING.md
├── LICENSE
├── LICENSE-MODEL.md
├── MANIFEST.in
├── NOTICE
├── README.md
├── app.py
├── config/
│ └── train_nougat.yaml
├── docker/
│ ├── Dockerfile
│ └── README.md
├── lightning_module.py
├── nougat/
│ ├── __init__.py
│ ├── _version.py
│ ├── dataset/
│ │ ├── __init__.py
│ │ ├── create_index.py
│ │ ├── gen_seek.py
│ │ ├── parser/
│ │ │ ├── __init__.py
│ │ │ ├── document.py
│ │ │ ├── html2md.py
│ │ │ ├── latexml_parser.py
│ │ │ └── markdown.py
│ │ ├── pdffigures.py
│ │ ├── rasterize.py
│ │ ├── split_htmls_to_pages.py
│ │ ├── split_md_to_pages.py
│ │ ├── splitter.py
│ │ ├── staircase.py
│ │ ├── tokenizer.json
│ │ └── utils/
│ │ ├── __init__.py
│ │ ├── latex_conversion.py
│ │ ├── pdf_text_extract.py
│ │ └── utils.py
│ ├── metrics.py
│ ├── model.py
│ ├── postprocessing.py
│ ├── transforms.py
│ └── utils/
│ ├── __init__.py
│ ├── checkpoint.py
│ ├── dataset.py
│ └── device.py
├── predict.py
├── setup.cfg
├── setup.py
├── test.py
└── train.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
core.*
*.bin
.nfs*
.vscode/*
result/*
!result/extract.py
misc/*
wandb/
!misc/*.png
!dataset/gen_seek.py
!result/.gitkeep
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
ckpt*/
# Misc
pdfs
================================================
FILE: CODE_OF_CONDUCT.md
================================================
# Code of Conduct
## Our Pledge
In the interest of fostering an open and welcoming environment, we as
contributors and maintainers pledge to make participation in our project and
our community a harassment-free experience for everyone, regardless of age, body
size, disability, ethnicity, sex characteristics, gender identity and expression,
level of experience, education, socio-economic status, nationality, personal
appearance, race, religion, or sexual identity and orientation.
## Our Standards
Examples of behavior that contributes to creating a positive environment
include:
* Using welcoming and inclusive language
* Being respectful of differing viewpoints and experiences
* Gracefully accepting constructive criticism
* Focusing on what is best for the community
* Showing empathy towards other community members
Examples of unacceptable behavior by participants include:
* The use of sexualized language or imagery and unwelcome sexual attention or
advances
* Trolling, insulting/derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or electronic
address, without explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Our Responsibilities
Project maintainers are responsible for clarifying the standards of acceptable
behavior and are expected to take appropriate and fair corrective action in
response to any instances of unacceptable behavior.
Project maintainers have the right and responsibility to remove, edit, or
reject comments, commits, code, wiki edits, issues, and other contributions
that are not aligned to this Code of Conduct, or to ban temporarily or
permanently any contributor for other behaviors that they deem inappropriate,
threatening, offensive, or harmful.
## Scope
This Code of Conduct applies within all project spaces, and it also applies when
an individual is representing the project or its community in public spaces.
Examples of representing a project or community include using an official
project e-mail address, posting via an official social media account, or acting
as an appointed representative at an online or offline event. Representation of
a project may be further defined and clarified by project maintainers.
This Code of Conduct also applies outside the project spaces when there is a
reasonable belief that an individual's behavior may have a negative impact on
the project or its community.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported by contacting the project team at <opensource-conduct@meta.com>. All
complaints will be reviewed and investigated and will result in a response that
is deemed necessary and appropriate to the circumstances. The project team is
obligated to maintain confidentiality with regard to the reporter of an incident.
Further details of specific enforcement policies may be posted separately.
Project maintainers who do not follow or enforce the Code of Conduct in good
faith may face temporary or permanent repercussions as determined by other
members of the project's leadership.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see
https://www.contributor-covenant.org/faq
================================================
FILE: CONTRIBUTING.md
================================================
# Contributing to Nougat
## Pull Requests
In order to accept your pull request, we need you to submit a CLA. You only need
to do this once to work on any of Facebook's open source projects.
Complete your CLA here: <https://code.facebook.com/cla>
## Issues
We use GitHub issues to track public bugs. Please ensure your description is
clear and has sufficient instructions to be able to reproduce the issue.
## License
By contributing to this repo, you agree that your contributions will be licensed
under the LICENSE file in the root directory of this source tree.
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) Meta Platforms, Inc. and affiliates.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: LICENSE-MODEL.md
================================================
# Creative Commons Attribution-NonCommercial 4.0 International Public License
By exercising the Licensed Rights (defined below), You accept and agree to be bound by the terms and conditions of this Creative Commons Attribution-NonCommercial 4.0 International Public License ("Public License"). To the extent this Public License may be interpreted as a contract, You are granted the Licensed Rights in consideration of Your acceptance of these terms and conditions, and the Licensor grants You such rights in consideration of benefits the Licensor receives from making the Licensed Material available under these terms and conditions.
## Section 1 – Definitions.
a. Adapted Material means material subject to Copyright and Similar Rights that is derived from or based upon the Licensed Material and in which the Licensed Material is translated, altered, arranged, transformed, or otherwise modified in a manner requiring permission under the Copyright and Similar Rights held by the Licensor. For purposes of this Public License, where the Licensed Material is a musical work, performance, or sound recording, Adapted Material is always produced where the Licensed Material is synched in timed relation with a moving image.
b. Adapter's License means the license You apply to Your Copyright and Similar Rights in Your contributions to Adapted Material in accordance with the terms and conditions of this Public License.
c. Copyright and Similar Rights means copyright and/or similar rights closely related to copyright including, without limitation, performance, broadcast, sound recording, and Sui Generis Database Rights, without regard to how the rights are labeled or categorized. For purposes of this Public License, the rights specified in Section 2(b)(1)-(2) are not d. Copyright and Similar Rights.
d. Effective Technological Measures means those measures that, in the absence of proper authority, may not be circumvented under laws fulfilling obligations under Article 11 of the WIPO Copyright Treaty adopted on December 20, 1996, and/or similar international agreements.
e. Exceptions and Limitations means fair use, fair dealing, and/or any other exception or limitation to Copyright and Similar Rights that applies to Your use of the Licensed Material.
Licensed Material means the artistic or literary work, database, or other material to which the Licensor applied this Public License.
f. Licensed Rights means the rights granted to You subject to the terms and conditions of this Public License, which are limited to all Copyright and Similar Rights that apply to Your use of the Licensed Material and that the Licensor has authority to license.
g. Licensor means the individual(s) or entity(ies) granting rights under this Public License.
i. NonCommercial means not primarily intended for or directed towards commercial advantage or monetary compensation. For purposes of this Public License, the exchange of the Licensed Material for other material subject to Copyright and Similar Rights by digital file-sharing or similar means is NonCommercial provided there is no payment of monetary compensation in connection with the exchange.
j. Share means to provide material to the public by any means or process that requires permission under the Licensed Rights, such as reproduction, public display, public performance, distribution, dissemination, communication, or importation, and to make material available to the public including in ways that members of the public may access the material from a place and at a time individually chosen by them.
k. Sui Generis Database Rights means rights other than copyright resulting from Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases, as amended and/or succeeded, as well as other essentially equivalent rights anywhere in the world.
l. You means the individual or entity exercising the Licensed Rights under this Public License. Your has a corresponding meaning.
## Section 2 – Scope.
a. License grant.
1. Subject to the terms and conditions of this Public License, the Licensor hereby grants You a worldwide, royalty-free, non-sublicensable, non-exclusive, irrevocable license to exercise the Licensed Rights in the Licensed Material to:
A. reproduce and Share the Licensed Material, in whole or in part, for NonCommercial purposes only; and
B. produce, reproduce, and Share Adapted Material for NonCommercial purposes only.
2. Exceptions and Limitations. For the avoidance of doubt, where Exceptions and Limitations apply to Your use, this Public License does not apply, and You do not need to comply with its terms and conditions.
3. Term. The term of this Public License is specified in Section 6(a).
4. Media and formats; technical modifications allowed. The Licensor authorizes You to exercise the Licensed Rights in all media and formats whether now known or hereafter created, and to make technical modifications necessary to do so. The Licensor waives and/or agrees not to assert any right or authority to forbid You from making technical modifications necessary to exercise the Licensed Rights, including technical modifications necessary to circumvent Effective Technological Measures. For purposes of this Public License, simply making modifications authorized by this Section 2(a)(4) never produces Adapted Material.
5. Downstream recipients.
a. Offer from the Licensor – Licensed Material. Every recipient of the Licensed Material automatically receives an offer from the Licensor to exercise the Licensed Rights under the terms and conditions of this Public License.
b. No downstream restrictions. You may not offer or impose any additional or different terms or conditions on, or apply any Effective Technological Measures to, the Licensed Material if doing so restricts exercise of the Licensed Rights by any recipient of the Licensed Material.
6. No endorsement. Nothing in this Public License constitutes or may be construed as permission to assert or imply that You are, or that Your use of the Licensed Material is, connected with, or sponsored, endorsed, or granted official status by, the Licensor or others designated to receive attribution as provided in Section 3(a)(1)(A)(i).
b. Other rights.
1. Moral rights, such as the right of integrity, are not licensed under this Public License, nor are publicity, privacy, and/or other similar personality rights; however, to the extent possible, the Licensor waives and/or agrees not to assert any such rights held by the Licensor to the limited extent necessary to allow You to exercise the Licensed Rights, but not otherwise.
2. Patent and trademark rights are not licensed under this Public License.
3. To the extent possible, the Licensor waives any right to collect royalties from You for the exercise of the Licensed Rights, whether directly or through a collecting society under any voluntary or waivable statutory or compulsory licensing scheme. In all other cases the Licensor expressly reserves any right to collect such royalties, including when the Licensed Material is used other than for NonCommercial purposes.
## Section 3 – License Conditions.
Your exercise of the Licensed Rights is expressly made subject to the following conditions.
a. Attribution.
1. If You Share the Licensed Material (including in modified form), You must:
A. retain the following if it is supplied by the Licensor with the Licensed Material:
identification of the creator(s) of the Licensed Material and any others designated to receive attribution, in any reasonable manner requested by the Licensor (including by pseudonym if designated);
i) a copyright notice;
ii) a notice that refers to this Public License;
iii) a notice that refers to the disclaimer of warranties;
iv) a URI or hyperlink to the Licensed Material to the extent reasonably practicable;
B. indicate if You modified the Licensed Material and retain an indication of any previous modifications; and
C. indicate the Licensed Material is licensed under this Public License, and include the text of, or the URI or hyperlink to, this Public License.
2. You may satisfy the conditions in Section 3(a)(1) in any reasonable manner based on the medium, means, and context in which You Share the Licensed Material. For example, it may be reasonable to satisfy the conditions by providing a URI or hyperlink to a resource that includes the required information.
3. If requested by the Licensor, You must remove any of the information required by Section 3(a)(1)(A) to the extent reasonably practicable.
4. If You Share Adapted Material You produce, the Adapter's License You apply must not prevent recipients of the Adapted Material from complying with this Public License.
## Section 4 – Sui Generis Database Rights.
Where the Licensed Rights include Sui Generis Database Rights that apply to Your use of the Licensed Material:
a. for the avoidance of doubt, Section 2(a)(1) grants You the right to extract, reuse, reproduce, and Share all or a substantial portion of the contents of the database for NonCommercial purposes only;
b. if You include all or a substantial portion of the database contents in a database in which You have Sui Generis Database Rights, then the database in which You have Sui Generis Database Rights (but not its individual contents) is Adapted Material; and
c. You must comply with the conditions in Section 3(a) if You Share all or a substantial portion of the contents of the database.
For the avoidance of doubt, this Section 4 supplements and does not replace Your obligations under this Public License where the Licensed Rights include other Copyright and Similar Rights.
## Section 5 – Disclaimer of Warranties and Limitation of Liability.
a. Unless otherwise separately undertaken by the Licensor, to the extent possible, the Licensor offers the Licensed Material as-is and as-available, and makes no representations or warranties of any kind concerning the Licensed Material, whether express, implied, statutory, or other. This includes, without limitation, warranties of title, merchantability, fitness for a particular purpose, non-infringement, absence of latent or other defects, accuracy, or the presence or absence of errors, whether or not known or discoverable. Where disclaimers of warranties are not allowed in full or in part, this disclaimer may not apply to You.
b. To the extent possible, in no event will the Licensor be liable to You on any legal theory (including, without limitation, negligence) or otherwise for any direct, special, indirect, incidental, consequential, punitive, exemplary, or other losses, costs, expenses, or damages arising out of this Public License or use of the Licensed Material, even if the Licensor has been advised of the possibility of such losses, costs, expenses, or damages. Where a limitation of liability is not allowed in full or in part, this limitation may not apply to You.
c. The disclaimer of warranties and limitation of liability provided above shall be interpreted in a manner that, to the extent possible, most closely approximates an absolute disclaimer and waiver of all liability.
## Section 6 – Term and Termination.
a. This Public License applies for the term of the Copyright and Similar Rights licensed here. However, if You fail to comply with this Public License, then Your rights under this Public License terminate automatically.
b. Where Your right to use the Licensed Material has terminated under Section 6(a), it reinstates:
1. automatically as of the date the violation is cured, provided it is cured within 30 days of Your discovery of the violation; or
2. upon express reinstatement by the Licensor.
For the avoidance of doubt, this Section 6(b) does not affect any right the Licensor may have to seek remedies for Your violations of this Public License.
c. For the avoidance of doubt, the Licensor may also offer the Licensed Material under separate terms or conditions or stop distributing the Licensed Material at any time; however, doing so will not terminate this Public License.
d. Sections 1, 5, 6, 7, and 8 survive termination of this Public License.
## Section 7 – Other Terms and Conditions.
a. The Licensor shall not be bound by any additional or different terms or conditions communicated by You unless expressly agreed.
b. Any arrangements, understandings, or agreements regarding the Licensed Material not stated herein are separate from and independent of the terms and conditions of this Public License.
## Section 8 – Interpretation.
a. For the avoidance of doubt, this Public License does not, and shall not be interpreted to, reduce, limit, restrict, or impose conditions on any use of the Licensed Material that could lawfully be made without permission under this Public License.
b. To the extent possible, if any provision of this Public License is deemed unenforceable, it shall be automatically reformed to the minimum extent necessary to make it enforceable. If the provision cannot be reformed, it shall be severed from this Public License without affecting the enforceability of the remaining terms and conditions.
c. No term or condition of this Public License will be waived and no failure to comply consented to unless expressly agreed to by the Licensor.
d. Nothing in this Public License constitutes or may be interpreted as a limitation upon, or waiver of, any privileges and immunities that apply to the Licensor or You, including from the legal processes of any jurisdiction or authority.
================================================
FILE: MANIFEST.in
================================================
include ./*.*
================================================
FILE: NOTICE
================================================
Donut
Copyright (c) 2022-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
--------------------------------------------------------------------------------------
This project contains subcomponents with separate copyright notices and license terms.
Your use of the source code for these subcomponents is subject to the terms and conditions of the following licenses.
=====
googlefonts/noto-fonts
https://fonts.google.com/specimen/Noto+Sans
Copyright 2018 The Noto Project Authors (github.com/googlei18n/noto-fonts)
This Font Software is licensed under the SIL Open Font License,
Version 1.1.
This license is copied below, and is also available with a FAQ at:
http://scripts.sil.org/OFL
-----------------------------------------------------------
SIL OPEN FONT LICENSE Version 1.1 - 26 February 2007
-----------------------------------------------------------
PREAMBLE
The goals of the Open Font License (OFL) are to stimulate worldwide
development of collaborative font projects, to support the font
creation efforts of academic and linguistic communities, and to
provide a free and open framework in which fonts may be shared and
improved in partnership with others.
The OFL allows the licensed fonts to be used, studied, modified and
redistributed freely as long as they are not sold by themselves. The
fonts, including any derivative works, can be bundled, embedded,
redistributed and/or sold with any software provided that any reserved
names are not used by derivative works. The fonts and derivatives,
however, cannot be released under any other type of license. The
requirement for fonts to remain under this license does not apply to
any document created using the fonts or their derivatives.
DEFINITIONS
"Font Software" refers to the set of files released by the Copyright
Holder(s) under this license and clearly marked as such. This may
include source files, build scripts and documentation.
"Reserved Font Name" refers to any names specified as such after the
copyright statement(s).
"Original Version" refers to the collection of Font Software
components as distributed by the Copyright Holder(s).
"Modified Version" refers to any derivative made by adding to,
deleting, or substituting -- in part or in whole -- any of the
components of the Original Version, by changing formats or by porting
the Font Software to a new environment.
"Author" refers to any designer, engineer, programmer, technical
writer or other person who contributed to the Font Software.
PERMISSION & CONDITIONS
Permission is hereby granted, free of charge, to any person obtaining
a copy of the Font Software, to use, study, copy, merge, embed,
modify, redistribute, and sell modified and unmodified copies of the
Font Software, subject to the following conditions:
1) Neither the Font Software nor any of its individual components, in
Original or Modified Versions, may be sold by itself.
2) Original or Modified Versions of the Font Software may be bundled,
redistributed and/or sold with any software, provided that each copy
contains the above copyright notice and this license. These can be
included either as stand-alone text files, human-readable headers or
in the appropriate machine-readable metadata fields within text or
binary files as long as those fields can be easily viewed by the user.
3) No Modified Version of the Font Software may use the Reserved Font
Name(s) unless explicit written permission is granted by the
corresponding Copyright Holder. This restriction only applies to the
primary font name as presented to the users.
4) The name(s) of the Copyright Holder(s) or the Author(s) of the Font
Software shall not be used to promote, endorse or advertise any
Modified Version, except to acknowledge the contribution(s) of the
Copyright Holder(s) and the Author(s) or with their explicit written
permission.
5) The Font Software, modified or unmodified, in part or in whole,
must be distributed entirely under this license, and must not be
distributed under any other license. The requirement for fonts to
remain under this license does not apply to any document created using
the Font Software.
TERMINATION
This license becomes null and void if any of the above conditions are
not met.
DISCLAIMER
THE FONT SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT
OF COPYRIGHT, PATENT, TRADEMARK, OR OTHER RIGHT. IN NO EVENT SHALL THE
COPYRIGHT HOLDER BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
INCLUDING ANY GENERAL, SPECIAL, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL
DAMAGES, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF THE USE OR INABILITY TO USE THE FONT SOFTWARE OR FROM
OTHER DEALINGS IN THE FONT SOFTWARE.
=====
huggingface/transformers
https://github.com/huggingface/transformers
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.
=====
clovaai/synthtiger
https://github.com/clovaai/synthtiger
Copyright (c) 2021-present NAVER Corp.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
=====
rwightman/pytorch-image-models
https://github.com/rwightman/pytorch-image-models
Copyright 2019 Ross Wightman
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=====
ankush-me/SynthText
https://github.com/ankush-me/SynthText
Copyright 2017, Ankush Gupta.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
=====
================================================
FILE: README.md
================================================
<div align="center">
<h1>Nougat: Neural Optical Understanding for Academic Documents</h1>
[](https://arxiv.org/abs/2308.13418)
[](https://github.com/facebookresearch/nougat)
[](https://pypi.org/project/nougat-ocr)
[](https://www.python.org/downloads/release/python-390/)
[](https://github.com/psf/black)
[](https://huggingface.co/spaces/ysharma/nougat)
</div>
This is the official repository for Nougat, the academic document PDF parser that understands LaTeX math and tables.
Project page: https://facebookresearch.github.io/nougat/
## Install
From pip:
```
pip install nougat-ocr
```
From repository:
```
pip install git+https://github.com/facebookresearch/nougat
```
> Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions [here](https://pytorch.org/get-started/locally/)
There are extra dependencies if you want to call the model from an API or generate a dataset.
Install via
`pip install "nougat-ocr[api]"` or `pip install "nougat-ocr[dataset]"`
### Get prediction for a PDF
#### CLI
To get predictions for a PDF run
```
$ nougat path/to/file.pdf -o output_directory
```
A path to a directory or to a file where each line is a path to a PDF can also be passed as a positional argument
```
$ nougat path/to/directory -o output_directory
```
```
usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT]
[--recompute] [--markdown] [--no-skipping] pdf [pdf ...]
positional arguments:
pdf PDF(s) to process.
options:
-h, --help show this help message and exit
--batchsize BATCHSIZE, -b BATCHSIZE
Batch size to use.
--checkpoint CHECKPOINT, -c CHECKPOINT
Path to checkpoint directory.
--model MODEL_TAG, -m MODEL_TAG
Model tag to use.
--out OUT, -o OUT Output directory.
--recompute Recompute already computed PDF, discarding previous predictions.
--full-precision Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.
--no-markdown Do not add postprocessing step for markdown compatibility.
--markdown Add postprocessing step for markdown compatibility (default).
--no-skipping Don't apply failure detection heuristic.
--pages PAGES, -p PAGES
Provide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.
```
The default model tag is `0.1.0-small`. If you want to use the base model, use `0.1.0-base`.
```
$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base
```
In the output directory every PDF will be saved as a `.mmd` file, the lightweight markup language, mostly compatible with [Mathpix Markdown](https://github.com/Mathpix/mathpix-markdown-it) (we make use of the LaTeX tables).
> Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of `[MISSING_PAGE]` responses, try to run with the `--no-skipping` flag. Related: [#11](https://github.com/facebookresearch/nougat/issues/11), [#67](https://github.com/facebookresearch/nougat/issues/67)
#### API
With the extra dependencies you use `app.py` to start an API. Call
```sh
$ nougat_api
```
To get a prediction of a PDF file by making a POST request to http://127.0.0.1:8503/predict/. It also accepts parameters `start` and `stop` to limit the computation to select page numbers (boundaries are included).
The response is a string with the markdown text of the document.
```sh
curl -X 'POST' \
'http://127.0.0.1:8503/predict/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@<PDFFILE.pdf>;type=application/pdf'
```
To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5
## Dataset
### Generate dataset
To generate a dataset you need
1. A directory containing the PDFs
2. A directory containing the `.html` files (processed `.tex` files by [LaTeXML](https://math.nist.gov/~BMiller/LaTeXML/)) with the same folder structure
3. A binary file of [pdffigures2](https://github.com/allenai/pdffigures2) and a corresponding environment variable `export PDFFIGURES_PATH="/path/to/binary.jar"`
Next run
```
python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs
```
Additional arguments include
| Argument | Description |
| --------------------- | ------------------------------------------ |
| `--recompute` | recompute all splits |
| `--markdown MARKDOWN` | Markdown output dir |
| `--workers WORKERS` | How many processes to use |
| `--dpi DPI` | What resolution the pages will be saved at |
| `--timeout TIMEOUT` | max time per paper in seconds |
| `--tesseract` | Tesseract OCR prediction for each page |
Finally create a `jsonl` file that contains all the image paths, markdown text and meta information.
```
python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl
```
For each `jsonl` file you also need to generate a seek map for faster data loading:
```
python -m nougat.dataset.gen_seek file.jsonl
```
The resulting directory structure can look as follows:
```
root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map
```
Note that the `.mmd` and `.json` files in the `path/paired/output` (here `images`) are no longer required.
This can be useful for pushing to a S3 bucket by halving the amount of files.
## Training
To train or fine tune a Nougat model, run
```
python train.py --config config/train_nougat.yaml
```
## Evaluation
Run
```
python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json
```
To get the results for the different text modalities, run
```
python -m nougat.metrics path/to/results.json
```
## FAQ
- Why am I only getting `[MISSING_PAGE]`?
Nougat was trained on scientific papers found on arXiv and PMC. Is the document you're processing similar to that?
What language is the document in? Nougat works best with English papers, other Latin-based languages might work. **Chinese, Russian, Japanese etc. will not work**.
If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs ([#11](https://github.com/facebookresearch/nougat/issues/11)). Try passing the `--no-skipping` flag for now.
- Where can I download the model checkpoint from.
They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing `--model 0.1.0-{base,small}`
## Citation
```
@misc{blecher2023nougat,
title={Nougat: Neural Optical Understanding for Academic Documents},
author={Lukas Blecher and Guillem Cucurull and Thomas Scialom and Robert Stojnic},
year={2023},
eprint={2308.13418},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## Acknowledgments
This repository builds on top of the [Donut](https://github.com/clovaai/donut/) repository.
## License
Nougat codebase is licensed under MIT.
Nougat model weights are licensed under CC-BY-NC.
================================================
FILE: app.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import os
import sys
from functools import partial
from http import HTTPStatus
from fastapi import FastAPI, File, UploadFile
from PIL import Image
from pathlib import Path
import hashlib
from fastapi.middleware.cors import CORSMiddleware
import pypdfium2
import torch
from nougat import NougatModel
from nougat.postprocessing import markdown_compatible, close_envs
from nougat.utils.dataset import ImageDataset
from nougat.utils.checkpoint import get_checkpoint
from nougat.dataset.rasterize import rasterize_paper
from nougat.utils.device import move_to_device, default_batch_size
from tqdm import tqdm
SAVE_DIR = Path("./pdfs")
BATCHSIZE = int(os.environ.get("NOUGAT_BATCHSIZE", default_batch_size()))
NOUGAT_CHECKPOINT = get_checkpoint()
if NOUGAT_CHECKPOINT is None:
print(
"Set environment variable 'NOUGAT_CHECKPOINT' with a path to the model checkpoint!"
)
sys.exit(1)
app = FastAPI(title="Nougat API")
origins = ["http://localhost", "http://127.0.0.1"]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
model = None
@app.on_event("startup")
async def load_model(
checkpoint: str = NOUGAT_CHECKPOINT,
):
global model, BATCHSIZE
if model is None:
model = NougatModel.from_pretrained(checkpoint)
model = move_to_device(model, cuda=BATCHSIZE > 0)
if BATCHSIZE <= 0:
BATCHSIZE = 1
model.eval()
@app.get("/")
def root():
"""Health check."""
response = {
"status-code": HTTPStatus.OK,
"data": {},
}
return response
@app.post("/predict/")
async def predict(
file: UploadFile = File(...), start: int = None, stop: int = None
) -> str:
"""
Perform predictions on a PDF document and return the extracted text in Markdown format.
Args:
file (UploadFile): The uploaded PDF file to process.
start (int, optional): The starting page number for prediction.
stop (int, optional): The ending page number for prediction.
Returns:
str: The extracted text in Markdown format.
"""
pdfbin = file.file.read()
pdf = pypdfium2.PdfDocument(pdfbin)
md5 = hashlib.md5(pdfbin).hexdigest()
save_path = SAVE_DIR / md5
if start is not None and stop is not None:
pages = list(range(start - 1, stop))
else:
pages = list(range(len(pdf)))
predictions = [""] * len(pages)
dellist = []
if save_path.exists():
for computed in (save_path / "pages").glob("*.mmd"):
try:
idx = int(computed.stem) - 1
if idx in pages:
i = pages.index(idx)
print("skip page", idx + 1)
predictions[i] = computed.read_text(encoding="utf-8")
dellist.append(idx)
except Exception as e:
print(e)
compute_pages = pages.copy()
for el in dellist:
compute_pages.remove(el)
images = rasterize_paper(pdf, pages=compute_pages)
global model
dataset = ImageDataset(
images,
partial(model.encoder.prepare_input, random_padding=False),
)
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=BATCHSIZE,
pin_memory=True,
shuffle=False,
)
for idx, sample in tqdm(enumerate(dataloader), total=len(dataloader)):
if sample is None:
continue
model_output = model.inference(image_tensors=sample)
for j, output in enumerate(model_output["predictions"]):
if model_output["repeats"][j] is not None:
if model_output["repeats"][j] > 0:
disclaimer = "\n\n+++ ==WARNING: Truncated because of repetitions==\n%s\n+++\n\n"
else:
disclaimer = (
"\n\n+++ ==ERROR: No output for this page==\n%s\n+++\n\n"
)
rest = close_envs(model_output["repetitions"][j]).strip()
if len(rest) > 0:
disclaimer = disclaimer % rest
else:
disclaimer = ""
else:
disclaimer = ""
predictions[pages.index(compute_pages[idx * BATCHSIZE + j])] = (
markdown_compatible(output) + disclaimer
)
(save_path / "pages").mkdir(parents=True, exist_ok=True)
pdf.save(save_path / "doc.pdf")
if len(images) > 0:
thumb = Image.open(images[0])
thumb.thumbnail((400, 400))
thumb.save(save_path / "thumb.jpg")
for idx, page_num in enumerate(pages):
(save_path / "pages" / ("%02d.mmd" % (page_num + 1))).write_text(
predictions[idx], encoding="utf-8"
)
final = "".join(predictions).strip()
(save_path / "doc.mmd").write_text(final, encoding="utf-8")
return final
def main():
import uvicorn
uvicorn.run("app:app", port=8503)
if __name__ == "__main__":
main()
================================================
FILE: config/train_nougat.yaml
================================================
resume_from_checkpoint_path: null
result_path: "result"
model_path: null
dataset_paths: ["path/to/train.jsonl"]
tokenizer: "dataset/tokenizer.json"
exp_name: "nougat"
train_batch_sizes: [1]
num_workers: 8
val_batch_sizes: [1]
val_batches: 1
input_size: [896, 672]
max_length: 4096
max_position_embeddings: 4096
accumulate_grad_batches: 3
window_size: 7
patch_size: 4
embed_dim: 128
hidden_dimension: 1024
num_heads: [4, 8, 16, 32]
encoder_layer: [2, 2, 14, 2]
decoder_layer: 10
align_long_axis: False
num_nodes: 1
seed: 25
lr: 5e-5
min_lr: 7.5e-6
lr_step: 16
gamma: 0.9996
warmup_steps: 250
num_training_samples_per_epoch: 10000
max_epochs: 30
max_steps: -1
val_check_interval: null
check_val_every_n_epoch: 1
gradient_clip_val: 0.5
verbose: False
================================================
FILE: docker/Dockerfile
================================================
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
# replace CUDA version to your CUDA version.
# You can check your CUDA version with below.
# nvcc -V
RUN apt-get update
RUN apt-get install -y python3
RUN apt-get -y install python3-pip git
RUN pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# replace CUDA version to your CUDA version.
RUN mkdir workspace
WORKDIR /workspace
RUN pip3 install fastapi uvicorn[standard] fsspec[http]==2023.1.0
RUN git clone https://github.com/facebookresearch/nougat.git
WORKDIR /workspace/nougat
RUN python3 setup.py install
EXPOSE 8503
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8503"]
# Run this using 'docker run -it -d -p <YOUR PORT>:8503 --gpus all <IMAGE NAME>
================================================
FILE: docker/README.md
================================================
## Prerequisites
Ensure you have Docker installed on your machine.
And you must also have NVIDIA CUDA and CuDNN installed in your machine.
Then, you must check your machine's CUDA version.
```sh
nvcc -V
```
You must change base image name and pytorch version compatible with your **CUDA version**.
## Building the Docker Image
Clone this repository and navigate into the current directory(nougat/docker). You can build the Docker image by running:
```sh
docker build -t <image-name> .
```
Replace <image-name> with a name of your choice. This will be used to refer to the image later.
Please be patient as this operation can take a while. It needs to pull the CUDA-capable image from NVIDIA’s Docker repository and install several libraries.
Image size will be about 17GB.
## Running the Docker Container
You can run your Docker container with the following command:
```sh
docker run -it -d -p <your-port>:8503 --gpus all <image-name>
```
Replace <your-port> with the port number you wish to expose on your host machine to access the nougat API server.
This can be any valid port number. Replace <image-name> with the name you chose earlier during the build step.
## Testing the API Server
Once the Docker container is running, you can access the nougat API server.
You can easily check connection by running:
```sh
curl -X 'GET' \
'http://127.0.0.1:<your-port>/'
```
It can be take a while for loading API server, because the server have to download nougat model at startup.
If connection is successful, you can get response looks like this.
```
{"status-code":200,"data":{}}
```
## Using the API Server
To get a prediction of a PDF file by making a POST request to `http://127.0.0.1:<your-port>/predict/`. It also accepts parameters `start` and `stop` to limit the computation to select page numbers (boundaries are included).
The response is a string with the markdown text of the document.
```sh
curl -X 'POST' \
'http://127.0.0.1:<your-port>/predict/' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F 'file=@<PDFFILE.pdf>;type=application/pdf'
```
To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL:
`http://127.0.0.1:<your-port>/predict/?start=1&stop=5`
================================================
FILE: lightning_module.py
================================================
"""
Donut
Copyright (c) 2022-present NAVER Corp.
MIT License
Copyright (c) Meta Platforms, Inc. and affiliates.
"""
import math
import random
from pathlib import Path
import numpy as np
import lightning.pytorch as pl
import torch
from lightning.pytorch.utilities import rank_zero_only
from torch.nn.utils.rnn import pad_sequence
from torch.optim.lr_scheduler import LambdaLR
from torch.utils.data import DataLoader
from nougat import NougatConfig, NougatModel
from nougat.metrics import get_metrics
class NougatModelPLModule(pl.LightningModule):
def __init__(self, config):
super().__init__()
self.validation_step_outputs = []
self.config = config
if self.config.get("model_path", False):
self.model = NougatModel.from_pretrained(
self.config.model_path,
input_size=self.config.input_size,
max_length=self.config.max_length,
align_long_axis=self.config.align_long_axis,
window_size=self.config.window_size,
encoder_layer=self.config.encoder_layer,
decoder_layer=self.config.decoder_layer,
patch_size=self.config.patch_size,
embed_dim=self.config.embed_dim,
num_heads=self.config.num_heads,
hidden_dimension=self.config.hidden_dimension,
ignore_mismatched_sizes=True,
)
else:
self.model = NougatModel(
config=NougatConfig(
input_size=self.config.input_size,
max_length=self.config.max_length,
align_long_axis=self.config.align_long_axis,
window_size=self.config.window_size,
encoder_layer=self.config.encoder_layer,
decoder_layer=self.config.decoder_layer,
tokenizer_file=self.config.tokenizer,
patch_size=self.config.patch_size,
embed_dim=self.config.embed_dim,
num_heads=self.config.num_heads,
hidden_dimension=self.config.hidden_dimension,
)
)
def training_step(self, batch, batch_idx):
image_tensors, decoder_input_ids, attention_masks = list(), list(), list()
if batch is None:
return
for batch_data in batch:
if batch_data is None or batch_data[0] is None:
continue
image_tensors.append(batch_data[0])
decoder_input_ids.append(batch_data[1])
attention_masks.append(batch_data[2])
image_tensors = torch.cat(image_tensors)
decoder_input_ids = torch.cat(decoder_input_ids)
attention_masks = torch.cat(attention_masks)
loss = self.model(image_tensors, decoder_input_ids, attention_masks)[0]
if loss is not None:
self.log_dict({"train/loss": loss}, sync_dist=True)
return loss
def validation_step(self, batch, batch_idx, dataset_idx=0):
if batch is None:
return
image_tensors, decoder_input_ids, _ = batch
if image_tensors is None:
return
markdown = pad_sequence(
decoder_input_ids,
batch_first=True,
)
preds = self.model.inference(
image_tensors=image_tensors,
return_attentions=False,
)["predictions"]
gts = self.model.decoder.tokenizer.batch_decode(
markdown, skip_special_tokens=True
)
metrics = get_metrics(gts, preds, pool=False)
scores = {
"val/" + key: sum(values) / len(values) for key, values in metrics.items()
}
self.validation_step_outputs.append(scores)
return scores
def on_validation_epoch_end(self):
if (
self.validation_step_outputs is not None
and len(self.validation_step_outputs) >= 1
):
self.log_dict(self.validation_step_outputs[0], sync_dist=True)
self.validation_step_outputs.clear()
def configure_optimizers(self):
def _get_device_count():
if torch.cuda.is_available():
return torch.cuda.device_count()
elif torch.backends.mps.is_available():
# Can MPS have more than one device?
return 1
return 1
max_iter = None
if int(self.config.get("max_epochs", -1)) > 0:
assert (
len(self.config.train_batch_sizes) == 1
), "Set max_epochs only if the number of datasets is 1"
steps = self.config.num_training_samples_per_epoch
max_iter = (self.config.max_epochs * steps) / max(
1,
(
self.config.train_batch_sizes[0]
* _get_device_count()
* self.config.get("num_nodes", 1)
),
)
if int(self.config.get("max_steps", -1)) > 0:
max_iter = (
min(self.config.max_steps, max_iter)
if max_iter is not None
else self.config.max_steps
)
assert max_iter is not None
optimizer = torch.optim.AdamW(self.parameters(), lr=self.config.lr)
scheduler = {
"scheduler": self.exponential_scheduler(
optimizer,
self.config.warmup_steps,
self.config.lr,
self.config.get("min_lr", 5e-5),
self.config.get("gamma", 0.9996),
),
"name": "learning_rate",
"interval": "step",
"frequency": self.config.get("lr_step", 1),
}
return [optimizer], [scheduler]
@staticmethod
def cosine_scheduler(optimizer, training_steps, warmup_steps):
def lr_lambda(current_step):
if current_step < warmup_steps:
return current_step / max(1, warmup_steps)
progress = current_step - warmup_steps
progress /= max(1, training_steps - warmup_steps)
return max(0.0, 0.5 * (1.0 + math.cos(math.pi * progress)))
return LambdaLR(optimizer, lr_lambda)
@staticmethod
def exponential_scheduler(optimizer, warmup_steps, lr, min_lr=5e-5, gamma=0.9999):
def lr_lambda(x):
if x > warmup_steps or warmup_steps <= 0:
if lr * gamma ** (x - warmup_steps) > min_lr:
return gamma ** (x - warmup_steps)
else:
return min_lr / lr
else:
return x / warmup_steps
return LambdaLR(optimizer, lr_lambda=lr_lambda)
def get_progress_bar_dict(self):
items = super().get_progress_bar_dict()
items.pop("v_num", None)
items["exp_name"] = f"{self.config.get('exp_name', '')}"
items["exp_version"] = f"{self.config.get('exp_version', '')}"
return items
@rank_zero_only
def on_save_checkpoint(self, checkpoint):
save_path = (
Path(self.config.result_path)
/ self.config.exp_name
/ self.config.exp_version
)
self.model.save_pretrained(save_path)
self.model.decoder.tokenizer.save_pretrained(save_path)
class NougatDataPLModule(pl.LightningDataModule):
def __init__(self, config):
super().__init__()
self.config = config
self.train_batch_sizes = self.config.train_batch_sizes
self.val_batch_sizes = self.config.val_batch_sizes
self.train_datasets = []
self.val_datasets = []
self.g = torch.Generator()
self.g.manual_seed(self.config.seed)
def train_dataloader(self):
loaders = [
DataLoader(
torch.utils.data.ConcatDataset(self.train_datasets),
batch_size=self.train_batch_sizes[0],
num_workers=self.config.num_workers,
pin_memory=True,
worker_init_fn=self.seed_worker,
generator=self.g,
shuffle=True,
collate_fn=self.ignore_none_collate,
)
]
return loaders
def val_dataloader(self):
loaders = [
DataLoader(
torch.utils.data.ConcatDataset(self.val_datasets),
batch_size=self.val_batch_sizes[0],
pin_memory=True,
shuffle=True,
collate_fn=self.ignore_none_collate,
)
]
return loaders
@staticmethod
def seed_worker(wordker_id):
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
@staticmethod
def ignore_none_collate(batch):
if batch is None:
return
try:
batch = [x for x in batch if x is not None and x[0] is not None]
if len(batch) == 0:
return
return torch.utils.data.dataloader.default_collate(batch)
except AttributeError:
pass
================================================
FILE: nougat/__init__.py
================================================
"""
Donut
Copyright (c) 2022-present NAVER Corp.
MIT License
Copyright (c) Meta Platforms, Inc. and affiliates.
"""
from .model import NougatConfig, NougatModel
from .utils.dataset import NougatDataset
from ._version import __version__
__all__ = [
"NougatConfig",
"NougatModel",
"NougatDataset",
]
================================================
FILE: nougat/_version.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
__version__ = "0.1.18"
================================================
FILE: nougat/dataset/__init__.py
================================================
================================================
FILE: nougat/dataset/create_index.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
"""
This script creates an index of all available pages and parses the meta data for all pages into a separate file.
Optionally TesseractOCR is called for each image.
"""
import argparse
import json
from typing import Dict, List
import numpy as np
from pathlib import Path
import multiprocessing
from pebble import ProcessPool
from PIL import Image
import pytesseract
import re
import logging
from tqdm import tqdm
logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def convert_pt2px(pt, dpi=96):
if isinstance(pt, list):
return [round(dpi / 72 * p) for p in pt]
elif isinstance(pt, dict):
for k in pt:
pt[k] = round(dpi / 72 * pt[k])
return pt
def read_metadata(data: Dict) -> List[List[Dict]]:
N = data["num_pages"]
out = [[] for _ in range(N)]
# pdffigures2 meta data
if "pdffigures" in data and data["pdffigures"]:
for item in data["pdffigures"]:
p = item.pop("page", None)
if p is None or p >= N:
continue
item["source"] = "fig"
if "regionBoundary" in item:
item["regionBoundary"] = convert_pt2px(item["regionBoundary"])
if "captionBoundary" in item:
item["captionBoundary"] = convert_pt2px(item["captionBoundary"])
out[p].append(item)
return out
def index_paper(directory: Path, args: argparse.Namespace):
"""
Pack all image-text pairs into a single h5 file and save it at `args.out`
"""
paper = directory.name
markdowns = directory.glob("*.mmd")
meta_file = directory / "meta.json"
data_samples = []
if not meta_file.exists():
return
# load meta info
try:
meta = read_metadata(json.load(meta_file.open("r", encoding="utf-8")))
except json.JSONDecodeError:
return
for md_path in markdowns:
image = md_path.parent / (md_path.stem + ".png")
i = int(image.stem) - 1
if not image.exists():
continue
if i >= len(meta):
continue
data_sample = {}
ocr_path = image.parent / (image.stem + "_OCR.txt")
if args.tesseract and not ocr_path.exists():
try:
pil = Image.open(image)
ocr = pytesseract.image_to_string(pil, lang="eng", timeout=2)
ocr = re.sub(r"\n+\s+?([^\s])", r"\n\n\1", ocr).strip()
with ocr_path.open("w", encoding="utf-8") as f_ocr:
f_ocr.write(ocr)
except RuntimeError:
logger.info("Page %s of paper %s timed out", image.stem, paper)
pass
if ocr_path.exists():
data_sample["ocr"] = str(ocr_path.relative_to(args.root))
data_sample["image"] = str(image.relative_to(args.root))
data_sample["markdown"] = md_path.read_text(encoding="utf8").strip()
data_sample["meta"] = meta[i]
data_samples.append(data_sample)
return data_samples
def create_index(args):
if not args.dir.exists() and not args.dir.is_dir():
logger.error("%s does not exist or is no dir.", args.dir)
return
papers = []
depth = 0
p = args.dir
while True:
p = next(p.iterdir())
if p.is_file():
break
else:
depth += 1
papers = args.dir.glob("*/" * depth)
index = []
with ProcessPool(max_workers=args.workers) as pool:
tasks = {}
for j, paper in enumerate(papers):
fname = paper.name
tasks[fname] = pool.schedule(
index_paper,
args=[paper, args],
timeout=args.timeout,
)
for fname in tqdm(tasks):
try:
res = tasks[fname].result()
if res is None:
logger.info("%s is faulty", fname)
continue
index.append(res)
except TimeoutError:
logger.info("%s timed out", fname)
with args.out.open("w", encoding="utf-8") as f:
for item in index:
for page in item:
if len(page) == 0:
continue
f.write(json.dumps(page) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--out", type=Path, required=True, help="Index file")
parser.add_argument(
"--dir", type=Path, required=True, help="Parent directory for input dirs"
)
parser.add_argument("--root", type=Path, default=None)
parser.add_argument(
"--tesseract",
action="store_true",
help="Tesseract OCR prediction for each page",
)
parser.add_argument(
"--workers",
type=int,
default=multiprocessing.cpu_count(),
help="How many processes to use",
)
parser.add_argument(
"--dpi", type=int, default=96, help="DPI the images were saved with"
)
parser.add_argument("--timeout", type=int, default=240, help="Max time per paper")
args = parser.parse_args()
if args.root is None:
args.root = args.dir
else:
# check if dir is subdir of root
args.dir.relative_to(args.root)
create_index(args)
================================================
FILE: nougat/dataset/gen_seek.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
from tqdm import tqdm
import json
from pathlib import Path
import argparse
def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("src_file", nargs="+", type=Path, help="JSONL file in question")
args = parser.parse_args()
return args
if __name__ == "__main__":
args = get_args()
for file in args.src_file:
seek_map = []
seek_pos = 0
with open(file) as f:
with tqdm(smoothing=0.0) as pbar:
line = f.readline()
while line:
seek_map.append(seek_pos)
seek_pos = f.tell()
line = f.readline()
pbar.update(1)
out_file = file.parent / (file.stem + ".seek.map")
with open(out_file, "w") as f:
f.write(json.dumps(seek_map))
================================================
FILE: nougat/dataset/parser/__init__.py
================================================
================================================
FILE: nougat/dataset/parser/document.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
from collections import defaultdict
from copy import copy
import itertools
import re
from dataclasses import dataclass, field, asdict
from typing import (
Any,
List,
Dict,
Optional,
TypeVar,
Type,
Generic,
)
import numpy as np
import logging
logger = logging.getLogger()
from dataclasses import dataclass, field, asdict
from typing import List, Dict, TypeVar, Type, Generic
T = TypeVar("T")
EL = TypeVar("EL")
@dataclass
class Element(Generic[EL]):
"""
Generic class representing an element with children in a tree-like structure.
Attributes:
parent (Element): The parent element.
children (List[Element]): List of child elements.
"""
parent: "Element" = None
children: List["Element"] = field(default_factory=list)
@property
def plaintext(self):
return "".join([child.plaintext for child in self.children])
def append(self, child: EL) -> EL:
self.children.append(child)
child.parent = self
return child
def find_parent(self, class_or_tuple: Type[T]) -> T:
elem = self
while elem:
if isinstance(elem, class_or_tuple):
return elem
elem = elem.parent
return None
@dataclass
class UnknownElement(Element):
pass
@dataclass
class TextElement(Element):
content: str = ""
@property
def plaintext(self):
return self.content
def append(self, child: "Element"):
raise Exception(f"Cannot append elements to {self.__class__.__name__}")
@dataclass
class Math(Element):
pass
@dataclass
class PlaintextMath(Math):
pass
@dataclass
class LatexMath(Math):
inline: bool = True
code: str = ""
@property
def plaintext(self):
return self.code
@dataclass
class Author:
fullname: str = None
lastname: str = None
affiliation: str = None
@dataclass
class Link(Element):
target: str = None
@dataclass
class InlineRef(Element):
target: str = None
def as_dict(self):
return {
"target": self.target,
}
@dataclass
class Reference:
"""
Data class representing a reference with various attributes.
Attributes:
title (Element): The title of the reference.
authors (List[Author]): List of authors of the reference.
ids (Dict[str, str]): Dictionary of identification information.
date (str): The publication date of the reference.
url (str): The URL link to the reference.
journal (str): The journal where the reference is published.
full_text (str): The full text content of the reference.
Methods:
as_dict(): Convert the reference object to a dictionary.
"""
title: Element = None
authors: List[Author] = field(default_factory=list)
ids: Dict[str, str] = field(default_factory=dict)
date: str = None
url: str = None
journal: str = None
full_text: str = None
def as_dict(self):
return {
"title": self.title.plaintext,
"authors": [asdict(auth) for auth in self.authors],
"ids": self.ids,
"date": self.date,
"url": self.url,
"journal": self.journal,
"full_text": self.full_text,
}
@dataclass
class SpanElement(Element):
pass
@dataclass
class Italic(SpanElement):
pass
@dataclass
class Bold(SpanElement):
pass
@dataclass
class Superscript(SpanElement):
pass
@dataclass
class Subscript(SpanElement):
pass
@dataclass
class Paragraph(Element):
pass
@dataclass
class TableRow(Element):
cells: List[Element] = field(default_factory=list)
def add_cell(self, cell: Element):
self.cells.append(cell)
cell.parent = self
return cell
@property
def plaintext(self):
return "\t".join([cell.plaintext for cell in self.cells])
@dataclass
class TableHead(TableRow):
pass
@dataclass
class Table(Element):
id: str = None
header: Element = None
caption: Element = None
rows: List[TableRow] = field(default_factory=list)
keep_table: bool = False
def add_row(self, row: TableRow) -> TableRow:
self.rows.append(row)
row.parent = self
return row
@property
def plaintext(self):
return "\n".join([row.plaintext for row in self.rows])
@dataclass
class Equation(Element):
pass
@dataclass
class EquationList(Element):
equations: List[Equation] = field(default_factory=list)
def add_equation(self, eqn: Equation) -> Equation:
self.equations.append(eqn)
eqn.parent = self
return eqn
@property
def plaintext(self):
return "\n".join([eqn.plaintext for eqn in self.equations])
@dataclass
class Algorithm(Element):
caption: Element = None
lines: List[Element] = field(default_factory=list)
inline: bool = False
def add_line(self, line: Element) -> Element:
self.lines.append(line)
line.parent = self
return line
@property
def plaintext(self):
return "\n".join([line.plaintext for line in self.lines])
@dataclass
class Definition(Element):
term: Element = None
definition: Element = None
@property
def plaintext(self):
parts = []
if self.term:
parts.append(f"{self.term.plaintext}:")
if self.definition:
parts.append(self.definition.plaintext)
return " ".join(parts)
@dataclass
class DefinitionList(Element):
"""
Data class representing a list of definitions with an optional header.
Attributes:
header (Element): The header element for the definition list.
items (List[Definition]): List of Definition elements.
Methods:
add_item(item: Definition) -> Definition: Add a definition item to the list.
"""
header: Element = None
items: List[Element] = field(default_factory=list)
def add_item(self, item: Definition) -> Definition:
self.items.append(item)
item.parent = self
return item
@property
def plaintext(self):
parts = []
if self.header:
parts.append(self.header.plaintext)
parts.extend([df.plaintext for df in self.items])
return "\n".join(parts)
@dataclass
class Figure(Element):
id: str = None
header: Element = None
caption: Element = None
@dataclass
class Section(Element):
id: str = None
header: Element = None
level: int = 0
hnum: int = 1
@dataclass
class SectionHeader(Element):
id: str = None
header: Element = None
level: int = 0
@dataclass
class ListItem(Element):
label: str = ""
@dataclass
class ListContainer(Element):
level: int = 0
ordered: bool = False
items: List[Element] = field(default_factory=list)
def add_item(self, item: ListItem) -> ListItem:
self.items.append(item)
item.parent = self
return item
@property
def plaintext(self):
return "\n".join([item.plaintext for item in self.items])
@dataclass
class Footnote(Element):
id: str = None
@dataclass
class Document(Element, Reference):
abstract: Element = None
language: str = None
keywords: List[Element] = field(default_factory=list)
references: List[Reference] = field(default_factory=list)
inline_refs: List[InlineRef] = field(default_factory=list)
bib: Reference = None
def add_reference(self, reference):
self.references.append(reference)
def add_inline_ref(self, in_ref):
self.inline_refs.append(in_ref)
def set_bib(self, reference):
self.bib = reference
@dataclass
class Spec:
"""
Data class representing specifications for table cells.
Attributes:
t (int): The top border size.
b (int): The bottom border size.
l (int): The left border size.
r (int): The right border size.
align (str): The alignment of the cell content ('c' for center, 'l' for left, 'r' for right,
or 'p{width}' for justified with a specified width).
Methods:
__hash__() -> int: Compute the hash of the specification.
__eq__(__o: object) -> bool: Check if two specifications are equal.
set_align(classes: List[str], style: Optional[str] = None) -> None:
Extract alignment information from HTML classes.
set_border(classes: List[str]) -> None: Automatically set border specifications.
set_attrs(attrs: Dict[str, Any]) -> None: Automatically set all attributes from HTML class attributes.
__str__() -> str: Get the string representation of the specification.
"""
t: int = field(default=0, repr=False)
b: int = field(default=0, repr=False)
l: int = field(default=0)
r: int = field(default=0)
align: str = field(default="")
def __hash__(self) -> int:
return hash(repr(self))
def __eq__(self, __o: object) -> bool:
return repr(self) == repr(__o)
def set_align(self, classes: List[str], style: Optional[str] = None) -> None:
"""extract alignment information from available classes (html)"""
aligns = [s for s in classes if "align" in s]
if len(aligns) == 0:
return
elif len(aligns) > 1:
logger.warn("Found multiple aligns in classes: %s", ", ".join(classes))
align = aligns[0]
if "center" in align or align == "c":
self.align = "c"
elif "left" in align or align == "l":
self.align = "l"
elif "right" in align or align == "r":
self.align = "r"
elif "justify" in align or align == "p":
# assert style is not None, "justify without style information"
if style is None:
self.align = "c"
else:
width = style.partition("width:")[2].partition(";")[0]
self.align = "p{%s}" % width
else:
logger.warn(
"only center, left, right, justify supported at the moment. Found %s",
align,
)
self.align = "c"
def set_border(self, classes: List[str]) -> None:
"""automatically set spec with border classes e.g 'ltx_border_t'"""
for border in classes:
orientation = border.partition("border_")[2]
if len(orientation) > 0 and orientation[0] in "tbrl":
setattr(self, orientation[0], len(orientation))
def set_attrs(self, attrs: Dict[str, Any]) -> None:
"""automatically set all attr from html class attributes"""
classes = attrs["class"]
style = attrs["style"] if "style" in attrs else None
self.set_align(classes, style=style)
self.set_border(classes)
def __str__(self) -> str:
if self.align:
return "|" * self.l + self.align + "|" * self.r
else:
# default center
return "|" * self.l + "c" + "|" * self.r
@dataclass
class TableCell(Element):
"""
Represents a cell in an HTML table.
Attributes:
multicolumn (Optional[int]): The number of columns spanned by the cell.
multirow (Optional[int]): The number of rows spanned by the cell.
spec (Spec): The specification for the cell's formatting.
content (Element): The content of the cell.
Methods:
__post_init__(*args, **kwargs) -> None: Initialize the cell, ensuring that the spec property is not None.
__hash__() -> int: Compute the hash of the cell.
__eq__(__o: object) -> bool: Check if two cells are equal.
set_attrs(attrs: Dict[str, Any]) -> None: Set attributes for the cell from HTML attributes.
plaintext() -> str: Get the plaintext content of the cell.
"""
multicolumn: Optional[int] = None
multirow: Optional[int] = None
spec: Spec = None
content: Element = None
def __post_init__(self, *args, **kwargs) -> None:
# spec property cannot be None
if self.spec is None:
self.spec = Spec()
def __hash__(self) -> int:
return hash(repr(self))
def __eq__(self, __o: object) -> bool:
return repr(self) == repr(__o)
def set_attrs(self, attrs: Dict[str, Any]) -> None:
if "colspan" in attrs:
self.multicolumn = int(attrs["colspan"])
if "rowspan" in attrs:
self.multirow = int(attrs["rowspan"])
self.spec.set_attrs(attrs)
@property
def plaintext(self):
if self.content is None:
return ""
return self.content.plaintext
@dataclass
class TableRow(Element):
"""
Represents a row in an HTML table.
Attributes:
cells (List[TableCell]): The list of cells in the row.
Methods:
add_cell(cell: TableCell) -> TableCell: Add a cell to the row.
__iter__() -> Iterator: Iterate through the cells in the row.
__len__() -> int: Get the number of cells in the row.
__bool__() -> bool: Check if the row is not empty.
cum_cell_widths() -> List[int]: Get the cumulative cell widths.
cell_widths() -> List[int]: Get the widths of individual cells.
width() -> int: Get the total width of the row.
_hline(orientation: str) -> str: Determine horizontal lines to be inserted.
hline_above() -> str: Get the horizontal line description for the top of the row.
hline_below() -> str: Get the horizontal line description for the bottom of the row.
plaintext() -> str: Get the plaintext content of the row.
"""
cells: List[TableCell] = field(default_factory=list)
def add_cell(self, cell: TableCell):
self.cells.append(cell)
cell.parent = self
return cell
def __iter__(self):
return iter(self.cells)
def __len__(self) -> int:
return len(self.cells)
def __bool__(self) -> bool:
return True
@property
def cum_cell_widths(self) -> List[int]:
return np.cumsum(self.cell_widths)
@property
def cell_widths(self) -> List[int]:
return [(cell.multicolumn or 1) for cell in self.cells]
@property
def width(self) -> int:
return sum(self.cell_widths)
def _hline(self, orientation: str) -> str:
"""Figure out if and where horizontal lines need to be inserted.
Args:
orientation (str): Either 't' (top) or 'b' (bottom)
Returns:
str: Correct vertical line description for latex tables.
"""
assert orientation == "t" or orientation == "b"
lines = []
for cell in self.cells:
lines.extend([getattr(cell.spec, orientation)] * (cell.multicolumn or 1))
lines.append(0)
indices = []
start = None
for i, v in enumerate(lines):
if v and start is None:
start = i
elif start is not None and not v:
indices.append((start, i - 1))
start = None
s = ""
for a, b in indices:
if b - a + 1 == self.width:
s += "\\hline " * lines[0]
else:
s += "\\cline{%i-%i} " % (a + 1, b + 1)
return s.strip()
@property
def hline_above(self) -> str:
return self._hline("t")
@property
def hline_below(self) -> str:
return self._hline("b")
@property
def plaintext(self) -> str:
return "\t".join([cell.plaintext for cell in self.cells])
@dataclass
class Tabular(Element):
rows: List[TableRow] = field(default_factory=list)
"""
Represents a tabular structure, such as an HTML table.
Attributes:
rows (List[TableRow]): The list of rows in the tabular structure.
Methods:
add_row(row: TableRow) -> TableRow: Add a row to the tabular structure.
width() -> int: Get the maximum width of the tabular structure.
cols() -> List[List[TableCell]]: Get a list of columns in the tabular structure.
_square_table() -> None: Ensure the table has an equal number of columns in each row.
get_table_spec() -> str: Generate a LaTeX table specification based on cell alignments.
plaintext() -> str: Get the plaintext content of the tabular structure.
"""
def add_row(self, row: TableRow) -> TableRow:
self.rows.append(row)
row.parent = self
return row
@property
def width(self) -> int:
if len(self.rows) > 0:
return max([r.width for r in self.rows])
else:
return 0
@property
def cols(self) -> List[List[TableCell]]:
return list(
map(
list,
itertools.zip_longest(*[r.cells for r in self.rows], fillvalue=None),
)
)
def _square_table(self) -> None:
"""check if number of columns is equal for every row. Add placeholders for `\multirow` instances"""
for i, row in enumerate(self.rows):
for j, cell in enumerate(row.cells):
if cell.multirow is not None and cell.multirow > 1:
spec = copy(cell.spec)
# assume no hlines in multi cells: disable bottom lines for top and top lines for lower cells.
spec.t = 0
cell.spec.b = 0
for k in range(i + 1, i + cell.multirow):
if k < len(self.rows):
for _ in range(row.cell_widths[j]):
# add empty cell
self.rows[k].cells.insert(
j, TableCell(parent=self.rows[k], spec=spec)
)
def get_table_spec(self) -> str:
"""Generates a LaTeX table spec."""
# First make table square
self._square_table()
# Find the most used spec in regular cells (no multi-col/row)
specs = [Spec() for _ in range(self.width)]
for i, col in enumerate(self.cols):
counts = defaultdict(int)
for cell in col:
if cell is None or cell.spec.align == "":
continue
if cell.multicolumn is None and cell.multirow is None:
counts[cell.spec] += 1
if len(counts) > 0:
specs[i] = max(counts, key=counts.get)
# convert all cells that don't match the column style into a multicol{1}{custom_spec}
for i, col in enumerate(self.cols):
for cell in col:
if cell is not None and cell.spec != specs[i]:
# check if there is text in the cell. If not alignment doesn't matter
if (
len(cell.children) == 0
and cell.spec.l == specs[i].l
and cell.spec.r == specs[i].r
):
continue
# convert any standard cell into a multicol cell of width 1
if cell.multicolumn is None:
cell.multicolumn = 1
# generate final latex table spec
out = " ".join([str(spec) for spec in specs])
out = re.sub(r"(\|) +(\w)", r"\1\2", out)
out = re.sub(r"(\w) +(\|)", r"\1\2", out)
return out
@property
def plaintext(self):
return "\n".join([row.plaintext for row in self.rows])
@dataclass
class Table(Element):
id: str = None
caption: Element = None
================================================
FILE: nougat/dataset/parser/html2md.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import argparse
from pathlib import Path
from typing import List, Optional
from bs4 import BeautifulSoup
from tqdm import tqdm
import htmlmin
from nougat.dataset.parser.latexml_parser import parse_latexml, _clean_html_whitespace
from nougat.dataset.parser.markdown import format_document
def check_file_path(paths: List[Path], wdir: Optional[Path] = None) -> List[str]:
"""
Checks if the given file paths exist.
Args:
paths: A list of file paths.
wdir: The working directory. If None, the current working directory is used.
Returns:
A list of file paths that exist.
"""
files = []
for path in paths:
if type(path) == str:
if path == "":
continue
path = Path(path)
pathsi = [path] if wdir is None else [path, wdir / path]
for p in pathsi:
if p.exists():
files.append((p.resolve()))
elif "*" in path.name:
files.extend([(pi.resolve()) for pi in p.parent.glob(p.name)])
return list(set(files))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--html", type=Path, nargs="+", help="HTML file", required=True)
parser.add_argument("--out", type=Path, help="Output file", required=True)
args = parser.parse_args()
args.html = check_file_path(args.html)
for f in tqdm(args.html):
html = BeautifulSoup(
htmlmin.minify(
open(f, "r", encoding="utf-8").read().replace("\xa0", " "),
remove_all_empty_space=1,
),
features="html.parser",
)
try:
doc = parse_latexml(html)
except ValueError as e:
print(e)
continue
if doc is None:
continue
out, fig = format_document(doc, keep_refs=True)
outp = (args.out if args.out.is_dir() else args.out.parent) / (f.stem + ".mmd")
with open(outp, "w", encoding="utf-8") as f:
f.write(out)
================================================
FILE: nougat/dataset/parser/latexml_parser.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import re
import sys
import requests
from typing import Optional, Set
from bs4 import BeautifulSoup, NavigableString
import soupsieve as sv
from nougat.dataset.parser.document import *
def printerr(*args, **kwargs):
# uncomment for debugging
# print(*args, **kwargs)
pass
latexml_wrapper_selector = sv.compile(
", ".join(
[
".ltx_engrafo_equation_container",
"tbody",
".ltx_note_content",
".ltx_role_footnote",
".ltx_note_type",
".ltx_theorem",
".ltx_proof",
".ltx_quote",
"blockquote",
".ltx_inline-para",
".ltx_inline-block",
]
)
)
latexml_ignore_selector = sv.compile(".ltx_rule, .ltx_pagination.ltx_role_newpage")
def is_wrapper_element(element: BeautifulSoup) -> bool:
return latexml_wrapper_selector.match(element)
def ignore_element(element: BeautifulSoup) -> bool:
return latexml_ignore_selector.match(element)
def _get_classes(el: BeautifulSoup) -> Set[str]:
if not hasattr(el, "attrs"):
return set()
classes = el.attrs.get("class")
if classes is None:
return set()
return set(classes)
def _detach_selected(element: BeautifulSoup, selector: str) -> None:
for elem in element.select(selector):
elem.extract()
def parse_latexml_authors(ltx_authors: BeautifulSoup) -> List[Author]:
authors = Paragraph()
parse_latexml_children(ltx_authors, authors)
return authors
def parse_latexml_citations(cite: BeautifulSoup, parent: Element) -> None:
"""
Parses LaTeXML citations and appends them as children to the given parent element.
Args:
cite (BeautifulSoup): The BeautifulSoup object containing the citation data.
parent (Element): The parent element to which the citations will be added as children.
"""
parse_latexml_children(cite, parent)
if ("[" in parent.plaintext and "]" in parent.plaintext) or re.search(
r"[A-Za-z]", parent.plaintext
):
return
parent.children.insert(0, TextElement(content="["))
parent.children.append(TextElement(content="]"))
def _clean_html_whitespace(text: str) -> str:
if text.strip():
text = re.sub(r"(^\n+|\n+$)", "\n", text)
else:
text = text.strip("\n")
text = re.sub(r"[ \t]+", " ", text)
return text
def parse_latexml_children(html: BeautifulSoup, parent: Element) -> None:
"""
Parses LaTeXML children and appends them as appropriate elements to the given parent element.
Args:
html (BeautifulSoup): The BeautifulSoup object containing the HTML data.
parent (Element): The parent element to which the parsed children will be added.
"""
if html is None:
return
for child in html.children:
classes = _get_classes(child)
if isinstance(child, NavigableString):
parent.append(TextElement(content=_clean_html_whitespace(str(child))))
elif sv.match(
"p, .ltx_p, div.ltx_para, span.ltx_para, section.ltx_paragraph", child
):
paragraph = parent.append(Paragraph())
parse_latexml_children(child, paragraph)
elif sv.match(".ltx_tag", child):
if "ltx_tag_note" not in classes:
if sv.match(".ltx_tag_section", child):
child.string = child.string.upper()
elif sv.match(".ltx_tag_subsection", child):
child.string = ""
parse_latexml_children(child, parent)
elif "ltx_tag_bibitem" in classes:
parse_latexml_children(child, parent.append(SpanElement()))
elif sv.match(".ltx_note_outer", child):
# try to place the footnote outside the current paragraph
paragraph = parent.find_parent(Paragraph)
if paragraph is not None and paragraph.parent is not None:
footnote = paragraph.parent.append(Footnote())
else:
footnote = parent.append(Footnote())
parse_latexml_children(child, footnote)
elif sv.match(".ltx_note_content > .ltx_note_mark", child):
footnote = parent.find_parent(Footnote)
if footnote is not None:
footnote.id = child.get_text(strip=True)
else:
printerr("Unable to find footnote to set its id", file=sys.stderr)
parse_latexml_children(child, parent)
elif sv.match("sup", child):
sup = parent.append(Superscript())
parse_latexml_children(child, sup)
elif sv.match("sub", child):
sub = parent.append(Subscript())
parse_latexml_children(child, sub)
elif sv.match("span.ltx_Math, span.ltx_DisplayMath", child):
inline = "ltx_DisplayMath" not in classes
math_elem = child.select_one(".mjx-math")
if math_elem:
tex = math_elem.attrs["aria-label"]
if inline:
tex = rf"\({tex}\)"
else:
tex = rf"\[{tex}\]"
parent.append(LatexMath(code=tex, inline=inline))
elif sv.match("math.ltx_Math", child):
# not sure if the math tag LaTeXML version specific, but that seems to work
inline = True
if "display" in child.attrs:
inline = child.attrs["display"] == "inline"
tex = child.attrs["alttext"]
if inline:
tex = rf"\({tex}\)"
else:
tex = rf"\[{tex}\]"
parent.append(LatexMath(code=tex, inline=inline))
elif sv.match("a.ref", child):
link = parent.append(Link())
link.target = child.attrs.get("href")
parse_latexml_children(child, link)
elif sv.match(
".ltx_ref.ltx_missing_citation, .ltx_ref.ltx_missing_label", child
):
placeholder = child.get_text().strip()
resolved = False
if placeholder.isnumeric():
parent.append(TextElement(content=placeholder))
resolved = True
else:
target = child.attrs.get("href")
if target is not None:
potential_num = target.partition(".bib")[2]
if potential_num.isnumeric():
parent.append(TextElement(content=potential_num))
resolved = True
if not resolved:
raise ValueError("missing reference detected")
elif sv.match(
".ltx_bibblock, .ltx_role_author, .ltx_contact, .ltx_role_email, .ltx_role_affiliation",
child,
):
parse_latexml_children(child, parent.append(SpanElement()))
parent.append(TextElement(content="\n"))
elif sv.match(
".ltx_authors, .ltx_personname, .ltx_role_creation.ltx_date, .ltx_engrafo_author_notes, .ltx_author_notes, .ltx_date.ltx_role_creation",
child,
):
parse_latexml_children(child, parent.append(Paragraph()))
parent.append(TextElement(content="\n"))
elif sv.match(
".ltx_author_before, .ltx_role_pubyear, .ltx_role_pagerange", child
):
pass
elif sv.match("h1.ltx_title_document", child):
doc = parent.find_parent(Document)
if doc is not None:
if doc.title is None:
doc.title = SectionHeader(parent=doc)
doc.title.hnum = int(child.name[1])
parse_latexml_children(child, doc.title)
else:
printerr("Document title is already set", file=sys.stderr)
else:
printerr("Unable to find document to set title", file=sys.stderr)
elif sv.match("section", child):
if ".ltx_bibliography" not in classes:
section = parent.append(Section())
parse_latexml_children(child, section)
elif sv.match("h1, h2, h3, h4, h5, h6", child) and "ltx_title" in classes:
if {"ltx_title_theorem", "ltx_title_proof"} & classes:
parse_latexml_children(child, parent)
parent.append(TextElement(content=": "))
elif isinstance(parent, Section):
parent.hnum = int(child.name[1])
if parent.header is None:
parent.header = SpanElement()
parse_latexml_children(child, parent.header)
else:
printerr("Dangling title element", file=sys.stderr)
parse_latexml_children(child, parent)
elif sv.match(".ltx_TOC.ltx_toc_toc", child):
s = parent.append(Section(hnum=6, header=TextElement(content="Contents")))
parse_latexml_children(child, s.append(Paragraph()))
elif sv.match(
"ul.ltx_itemize, ul.ltx_toclist, ul.ltx_biblist, ol.ltx_enumerate", child
):
lst = parent.append(ListContainer())
lst.ordered = child.name == "ol"
parent_list = parent.find_parent(ListContainer)
lst.level = parent_list.level + 1 if parent_list is not None else 1
parse_latexml_children(child, lst)
elif sv.match("li.ltx_item, li.ltx_tocentry, li.ltx_bibitem", child):
lst = parent.find_parent(ListContainer)
if lst is not None:
item = lst.add_item(ListItem())
parse_latexml_children(child, item)
else:
printerr("List item outside list", file=sys.stderr)
elif sv.match("cite", child):
span = parent.append(SpanElement())
parse_latexml_citations(child, span)
elif sv.match("a.ltx_ref", child):
target = child.attrs.get("href")
if target.startswith("#bib"): # citation link
in_ref = parent.append(InlineRef())
in_ref.target = target
text = child.get_text()
in_ref.target = target
if text.strip().isnumeric():
in_ref.append(TextElement(content=text))
elif re.search(r"[A-Za-z][:;.,_]?\d", text):
# probably a broken citation, go with link number instead
in_ref.append(
TextElement(
content=re.sub(r"\D", "", target.partition(".bib")[2])
)
)
else:
raise ValueError('unusable reference "%s"' % text)
doc = parent.find_parent(Document)
if doc:
doc.add_inline_ref(in_ref)
else:
link = parent.append(Link())
link.target = target
parse_latexml_children(child, link)
elif sv.match("a", child) and len(classes) == 0:
target = child.attrs.get("href")
parse_latexml_children(child, parent.append(Link(target=target)))
elif sv.match(".ltx_eqn_table", child):
eqn_list = parent.append(EquationList())
parse_latexml_children(child, eqn_list)
elif sv.match(".ltx_eqn_row", child):
eqn_list = parent.find_parent(EquationList)
if eqn_list is not None:
eqn = eqn_list.add_equation(Equation())
parse_latexml_children(child, eqn)
else:
printerr("Dangling equation row", file=sys.stderr)
parse_latexml_children(child, parent)
elif sv.match(".ltx_eqn_cell", child):
parse_latexml_children(child, parent)
elif sv.match("table, span.ltx_tabular, div.ltx_tabular", child):
tabular = parent.append(Tabular())
parse_latexml_children(child, tabular)
elif sv.match("thead.ltx_thead", child):
table = parent.find_parent(Tabular)
if table is not None:
parse_latexml_children(child, table)
else:
printerr("Table header element outside table", file=sys.stderr)
elif sv.match("tbody.ltx_tbody", child):
parse_latexml_children(child, parent)
elif sv.match("tr.ltx_tr", child):
table = parent.find_parent(Tabular)
if table is not None:
row = table.add_row(TableRow())
parse_latexml_children(child, row)
else:
printerr("TableRow element outside table", file=sys.stderr)
elif sv.match("td.ltx_td, th.ltx_th", child):
row = parent.find_parent(TableRow)
if row is not None:
cell = TableCell()
cell.set_attrs(child.attrs)
row.add_cell(cell)
parse_latexml_children(child, cell)
else:
printerr("TableData element outside table row", file=sys.stderr)
elif sv.match("span.ltx_text, em.ltx_emph", child):
if (
child.find_parent(ListItem) is None
or child.get_text() != "[label=0)]"
or child.get_text() != "[leftmargin=*] "
):
if "ltx_font_italic" in classes:
elem = Italic()
elif "ltx_font_bold" in classes:
elem = Bold()
else:
elem = SpanElement()
parent.append(elem)
parse_latexml_children(child, elem)
else:
parent.find_parent(ListContainer).items.pop()
elif sv.match("figure.ltx_table", child):
figure = parent.append(Table())
if "id" in child.attrs:
figure.id = child.attrs["id"]
parse_latexml_children(child, figure)
elif sv.match("figure.ltx_figure", child):
figure = parent.append(Figure())
if "id" in child.attrs:
figure.id = child.attrs["id"]
parse_latexml_children(child, figure)
elif sv.match("figure.ltx_float", child):
parse_latexml_children(child, parent)
elif sv.match(".ltx_listing", child):
alg = parent.append(Algorithm())
parse_latexml_children(child, alg)
elif sv.match(".ltx_listingline", child):
alg = parent.find_parent(Algorithm)
if alg is not None:
line = alg.add_line(Element())
parse_latexml_children(child, line)
else:
printerr("Listing line outside algorithm environment", file=sys.stderr)
elif sv.match("dl.ltx_description", child):
def_list = parent.append(DefinitionList())
parse_latexml_children(child, def_list)
elif sv.match("dt.ltx_item", child):
def_list = parent.find_parent(DefinitionList)
if def_list is not None:
item = def_list.add_item(Definition())
item.term = SpanElement(parent=item)
parse_latexml_children(child, item.term)
else:
printerr("Found dangling definition term", file=sys.stderr)
elif sv.match("dd.ltx_item", child):
def_list = parent.find_parent(DefinitionList)
if def_list is not None:
if def_list.items and def_list.items[-1].definition is None:
item = def_list.items[-1]
else:
printerr("Found definition without term", file=sys.stderr)
item = def_list.add_item(Definition())
item.definition = SpanElement(parent=item)
parse_latexml_children(child, item.definition)
else:
printerr("Found dangling definition", file=sys.stderr)
parse_latexml_children(child, parent)
elif sv.match("figcaption", child):
fig = parent.find_parent((Figure, Table))
if fig is not None:
if fig.caption is None:
fig.caption = Paragraph(parent=fig)
parse_latexml_children(child, fig.caption)
fig.caption.append(TextElement(content="\n"))
else:
printerr("Figure caption outside figure element", file=sys.stderr)
para = parent.append(Paragraph())
parse_latexml_children(child, para)
elif sv.match(".ltx_break", child):
parent.append(TextElement(content="\n\n"))
elif sv.match(".ltx_abstract, .ltx_acknowledgements", child):
abstract = parent.append(Section())
parse_latexml_children(child, abstract)
elif sv.match(".ltx_ERROR", child):
printerr(
f"LaTeX error element: {child.get_text(strip=True)}", file=sys.stderr
)
elif is_wrapper_element(child):
parse_latexml_children(child, parent)
elif ignore_element(child):
continue
else:
printerr(
f"Unknown LaTeXML element <{child.name}> with classes {', '.join(classes)}",
file=sys.stderr,
)
elem = parent.append(UnknownElement())
parse_latexml_children(child, elem)
# TODO: move this somewhere else, so I can use it with plaintext too
sess = requests.Session()
def parse_latexml_references(html: BeautifulSoup, doc: Document) -> None:
for child in html.select("li.ltx_bibitem"):
child.attrs.get("id")
ref_text = child.get_text(strip=False).replace("\n", " ")
reference = Reference()
reference.title = TextElement(content=child.get_text(strip=True))
doc.add_reference(reference)
def parse_latexml(
html: BeautifulSoup,
) -> Optional[Document]:
if html.article is None:
printerr("Missing article element", file=sys.stderr)
return None
doc = Document()
parse_latexml_children(html.article, doc)
parse_latexml_references(
html.article,
doc,
)
return doc
================================================
FILE: nougat/dataset/parser/markdown.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
from typing import Iterable, List, Optional, Tuple
import re
from uuid import uuid4
from nougat.dataset.utils import normalize_tex
from nougat.dataset.parser.document import *
from nougat.dataset.parser.latexml_parser import _clean_html_whitespace
from unidecode import unidecode
SUPERSCRIPT_MAP = str.maketrans("0123456789", "⁰¹²³⁴⁵⁶⁷⁸⁹")
SUBSCRIPT_MAP = str.maketrans("0123456789", "₀₁₂₃₄₅₆₇₈₉")
figure_regex = re.compile(r"\[(FOOTNOTE|FIGURE|TABLE)(.*?)\](.*?)\[END\1\]", re.S)
conv = {
"&": r"\&",
"%": r"\%",
"$": r"\$",
"#": r"\#",
"_": r"\_",
"{": r"\{",
"}": r"\}",
"~": r"\textasciitilde{}",
"^": r"\^{}",
"\\": r"\textbackslash{}",
"<": r"\textless{}",
">": r"\textgreater{}",
}
regex = re.compile(
"|".join(
re.escape(str(key)) for key in sorted(conv.keys(), key=lambda item: -len(item))
)
)
def remove_trailing_whitespace(parts: List[str]) -> None:
"""Removes whitespace elements in list inplace"""
for s in reversed(parts):
if s.rstrip() == "":
del parts[-1]
else:
break
def remove_line_breaks(parts: List[str]):
out = []
for s in parts:
out.append(s.replace("\n", " "))
return out
def leading_trailing_whitespace(
parts: List[str],
) -> Tuple[List[str], List[str], List[str]]:
"""splits the list into three parts. The first and last return elements are made up only of whitespace
Args:
parts (List[str]): List to split.
Returns:
Tuple[List[str],List[str],List[str]]: Splitted list
"""
lead = []
trail = []
out_slice = [None, None]
for i, s in enumerate(parts):
if s.strip() == "":
lead.append(s)
out_slice[0] = i + 1
else:
break
for i, s in enumerate(reversed(parts)):
if s.strip() == "":
trail.append(s)
out_slice[1] = -1 - i
else:
break
return lead, parts[slice(*out_slice)], trail[::-1]
def latex_escape(string: str) -> str:
return regex.sub(lambda match: conv[match.group()], string)
def is_empty(content: List) -> bool:
"""Used to determine if a Section is empty"""
empty = True
for part in content:
if len(part.strip()):
empty = False
break
return empty
def format_element(
element: Element, keep_refs: bool = False, latex_env: bool = False
) -> List[str]:
"""
Formats a given Element into a list of formatted strings.
Args:
element (Element): The element to be formatted.
keep_refs (bool, optional): Whether to keep references in the formatting. Default is False.
latex_env (bool, optional): Whether to use LaTeX environment formatting. Default is False.
Returns:
List[str]: A list of formatted strings representing the formatted element.
"""
if isinstance(element, TextElement):
if latex_env:
return [latex_escape(element.content)]
else:
return [element.content]
if isinstance(element, Bold):
parts = format_children(element, keep_refs, latex_env)
if element.find_parent(Algorithm) is not None:
return parts
lead, text, tail = leading_trailing_whitespace("".join(parts))
return [*lead, "**", *remove_line_breaks(text), "**", *tail]
if isinstance(element, Italic):
parts = format_children(element, keep_refs, latex_env)
if element.find_parent(Algorithm) is not None:
return parts
lead, text, tail = leading_trailing_whitespace("".join(parts))
return [*lead, "_", *remove_line_breaks(text), "_", *tail]
if isinstance(element, PlaintextMath):
return format_children(element, keep_refs) + ["\n"]
if isinstance(element, Paragraph):
return format_children(element, keep_refs, latex_env) + ["\n\n"]
if isinstance(element, TableCell):
parts = format_children(element, keep_refs, latex_env)
remove_trailing_whitespace(parts)
if element.multirow is not None:
parts.insert(0, "\\multirow{%i}{*}{" % (element.multirow))
parts.append("}")
if element.multicolumn is not None:
parts.insert(
0, "\\multicolumn{%i}{%s}{" % (element.multicolumn, element.spec)
)
parts.append("}")
return parts
if isinstance(element, TableRow):
parts = []
if element.hline_above:
parts.append(element.hline_above + "\n")
parts.extend(
remove_line_breaks(
format_iterator(element.cells, keep_refs, latex_env, join=" & ")
)
)
parts.append(r" \\")
parts.append((" " + element.hline_below).rstrip())
return parts
if isinstance(element, Tabular):
parts = [
"\\begin{tabular}",
"{%s}\n" % element.get_table_spec(),
]
parts.extend(format_iterator(element.rows, keep_refs, True, join="\n"))
parts.append("\n\\end{tabular}\n")
return parts
if isinstance(element, Table):
parts = [
"[TABLE%s]\n\\begin{table}\n"
% (str(uuid4())[:5] if element.id is None else ":" + str(element.id))
]
parts.extend(format_children(element, keep_refs, latex_env))
caption_parts = format_element(element.caption, keep_refs, latex_env)
remove_trailing_whitespace(caption_parts)
parts.append("\\end{table}\n")
if len(caption_parts) > 0:
parts.extend(caption_parts + ["\n"])
parts.append("[ENDTABLE]\n\n")
return parts
if isinstance(element, Figure):
parts = format_element(element.caption, keep_refs)
remove_trailing_whitespace(parts)
return (
[
"[FIGURE%s]\n"
% (str(uuid4())[:5] if element.id is None else ":" + str(element.id))
]
+ parts
+ ["\n[ENDFIGURE]\n\n"]
)
if isinstance(element, SectionHeader):
parts = ["# "]
if element.id:
parts.append(f"{element.id.upper()} ")
if element.header:
header = format_element(element.header, keep_refs)
else:
header = format_iterator(element.children, keep_refs)
_, title, _ = leading_trailing_whitespace("".join(header))
parts.append(title)
parts.append("\n\n")
return parts
if isinstance(element, Section):
children_parts = format_children(element, keep_refs)
if is_empty(children_parts):
return []
if element.header:
parts = [f"\n\n{'#'*element.hnum} "]
_, title, _ = leading_trailing_whitespace(
"".join(format_element(element.header, keep_refs))
)
parts.append(title)
parts.append("\n\n")
else:
parts = []
return parts + children_parts
if isinstance(element, Footnote):
if element.id is not None:
foot = f"\n[FOOTNOTE:{element.id}]Footnote {element.id}: "
else:
foot = "\n[FOOTNOTE:%s]Footnote: " % (str(uuid4())[:5])
return [foot] + format_children(element, keep_refs) + ["[ENDFOOTNOTE]\n\n"]
if isinstance(element, ListContainer):
items = [
(
item.label,
"".join(format_element(item, keep_refs)).strip().replace("\n", " "),
)
for item in element.items
]
parts = ["\n"]
indent = " " * max(element.level - 1, 0)
for i, (label, item) in enumerate(items, 1):
if label:
bullet = label
else:
bullet = f"{i}." if element.ordered else "*"
parts.append(f"{indent}{bullet} {item}\n")
parts.append("\n")
return parts
if isinstance(element, Equation):
# equation comprises of multiple displaystyle TeX formulas and optional equation label
parts = []
for child in element.children:
if isinstance(child, LatexMath):
tex = normalize_tex(
"".join(format_element(child, keep_refs)).strip(" \n"), inline=False
)
parts.append(tex)
else:
text = "".join(format_element(child, keep_refs))
if text:
parts.append(text)
lead, eqs, tail = leading_trailing_whitespace(parts)
s = " ".join(eqs).replace(r"\] \[", " ")
return [*lead, s, *tail]
if isinstance(element, EquationList):
parts = ["\n"]
items = element.equations
items = ["".join(format_element(item, keep_refs)).rstrip() for item in items]
items = [item + "\n" for item in items if item]
if items:
parts.extend(items)
parts.append("\n")
return parts
if isinstance(element, Algorithm):
parts = []
items = element.lines
items = ["".join(format_element(item, keep_refs)).rstrip() for item in items]
if element.inline:
items = [item for item in items if item]
else:
items = [item + "\n" for item in items if item]
if items:
prepend = "`" if element.inline else "\n```\n"
parts.append(prepend)
parts.extend(items)
append = "`" if element.inline else "```\n\n"
parts.append(append)
return parts
if isinstance(element, DefinitionList):
parts = ["\n"]
if element.header is not None:
parts.extend(format_element(element.header, keep_refs))
parts.append("\n")
items = [
"".join(format_element(item, keep_refs)).rstrip() for item in element.items
]
items = [item + "\n" for item in items if item]
if items:
parts.extend(items)
parts.append("\n")
return parts
if isinstance(element, Definition):
parts = []
if element.term is not None:
term = (
"".join(format_element(element.term, keep_refs)).rstrip(" \n\t:") + ": "
)
# maths in wiki might be inside a definition without a term
if term.strip() != ":":
parts.append(term)
if element.definition is not None:
definition = "".join(format_element(element.definition, keep_refs)).rstrip()
parts.append(definition)
if parts:
parts.append("\n")
return parts
if isinstance(element, LatexMath):
parts = []
if not element.inline:
parts.append("\n\n")
parts.append(normalize_tex(element.code, element.inline).strip())
if not element.inline:
parts.append("\n\n")
return parts
if isinstance(element, (Superscript, Subscript)):
content = element.plaintext
if content.strip().isdigit():
script_map = (
SUBSCRIPT_MAP if isinstance(element, Subscript) else SUPERSCRIPT_MAP
)
return [content.translate(script_map)]
else:
return format_children(element, keep_refs)
if isinstance(element, InlineRef):
parts = format_children(element, keep_refs)
return parts
return format_children(element, keep_refs, latex_env)
def format_iterator(
iterator: Iterable,
keep_refs: bool = False,
latex_env: bool = False,
join: Optional[str] = None,
) -> List[str]:
"""
The `format_iterator` function takes an iterator and formats its elements, optionally joining them with a specified string.
:param iterator: The `iterator` parameter is an iterable object that contains the elements to be formatted. It could be a list, tuple, set, or any other iterable object
:type iterator: Iterable
:param keep_refs: The `keep_refs` parameter is a boolean flag that determines whether references to other elements should be preserved in the formatted output. If `keep_refs` is set to `True`, the references will be included in the output. If `keep_refs` is set to `False` (default), the, defaults to False
:type keep_refs: bool (optional)
:param latex_env: The `latex_env` parameter is a boolean flag that determines whether the output should be formatted as LaTeX code. If `latex_env` is set to `True`, the output will be formatted using LaTeX syntax. If `latex_env` is set to `False` (default), the output will be, defaults to False
:type latex_env: bool (optional)
:param join: The `join` parameter is an optional string that specifies the delimiter to be used when joining the formatted elements of the iterator into a single string. If `join` is provided, it will be inserted between each formatted element. If `join` is not provided, the formatted elements will be returned as
:type join: Optional[str]
:return: The function `format_iterator` returns a list of strings.
"""
parts = []
for child in iterator:
parts.extend(format_element(child, keep_refs, latex_env))
if join is not None:
parts.append(join)
if join is not None:
parts = parts[:-1]
return parts
def format_children(
element: Element, keep_refs: bool = False, latex_env: bool = False
) -> List[str]:
if element is None:
return []
return format_iterator(element.children, keep_refs, latex_env)
def format_document(
doc: Document, keep_refs: bool = False
) -> Tuple[str, Dict[str, str]]:
"""
The `format_document` function takes a `doc` object of type `Document` and a boolean `keep_refs` as input and returns a tuple containing the formatted text of the document and a dictionary of figures found in the document.
:param doc: The `doc` parameter is of type `Document`, which is presumably a custom class representing a document
:type doc: Document
:param keep_refs: The `keep_refs` parameter is a boolean flag that determines whether to keep references in the formatted document or not. If `keep_refs` is set to `True`, the references will be included in the formatted document. If `keep_refs` is set to `False`, the references will be excluded, defaults to False
:type keep_refs: bool (optional)
:return: The function `format_document` returns a tuple containing two elements: a formatted text document and a dictionary of figures.
"""
parts = []
if doc.title:
parts.extend([*format_element(doc.title), "\n"])
parts.append("\n")
parts.extend(format_children(doc, keep_refs))
text = "".join(parts)
text = text.replace("\xa0", " ") # replace non-breakable spaces
text = re.sub(r" $", "", text, flags=re.MULTILINE)
text = re.sub(r"\n[\t ]*$", "\n", text, flags=re.MULTILINE)
text = re.sub(r"(?<!\n) {2,}", " ", text)
text = re.sub(r"\n{3,}", "\n\n", text).lstrip()
figures = {unidecode(m[0] + m[1]): m[2].strip() for m in figure_regex.findall(text)}
text = figure_regex.sub(
r"[\1\2][END\1]",
text,
)
return text, figures
================================================
FILE: nougat/dataset/pdffigures.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import os
import subprocess
import logging
PDFFIGURES2_JAR_PATH = os.environ.get("PDFFIGURES_PATH", None)
logger = logging.getLogger()
if PDFFIGURES2_JAR_PATH is None:
logger.warning(
"You need to configure the path to the pdffigures2 executable in this file (nougat/dataset/pdffigures.py) or set the environment variable 'PDFFIGURES_PATH'."
)
def call_pdffigures(
pdf_path: str, figures_dir: str, timeout: int = 30, verbose: bool = False
):
"""
Extract figures from a PDF file using pdffigures2.
Args:
pdf_path (str): The path to the PDF file.
figures_dir (str): The directory where the figures will be extracted.
timeout (int, optional): The timeout in seconds for the pdffigures2 command. Defaults to 30.
verbose (bool, optional): Whether to print the output of the pdffigures2 command. Defaults to False.
Returns:
str: The path to the JSON file containing the extracted figures.
"""
os.makedirs(figures_dir, exist_ok=True)
kwargs = (
{} if verbose else {"stderr": subprocess.DEVNULL, "stdout": subprocess.DEVNULL}
)
if PDFFIGURES2_JAR_PATH is None:
return
process = subprocess.Popen(
"java"
" -jar {pdffigures_jar_path}"
" -d {figures_dir}/"
" -c"
" -q"
" {pdf_path}".format(
pdffigures_jar_path=PDFFIGURES2_JAR_PATH,
pdf_path=pdf_path,
figures_dir=figures_dir,
),
shell=True,
**kwargs
)
try:
exit_code = process.wait(timeout=timeout)
if exit_code != 0:
logger.error("Extracting figures from file %s failed.", pdf_path)
return False
except subprocess.TimeoutExpired as e:
logger.error(
"pdffigures2 command did not terminate in 30 seconds, "
"terminating. Error: %s",
e,
)
process.terminate() # give up
return False
pdf_name = os.path.basename(pdf_path).partition(".pdf")[0]
dest_file = os.path.join(figures_dir, (pdf_name + ".json"))
return dest_file
================================================
FILE: nougat/dataset/rasterize.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import argparse
import logging
import pypdfium2
from pathlib import Path
from tqdm import tqdm
import io
from typing import Optional, List, Union
logging.getLogger("pypdfium2").setLevel(logging.WARNING)
def rasterize_paper(
pdf: Union[Path, bytes],
outpath: Optional[Path] = None,
dpi: int = 96,
return_pil=False,
pages=None,
) -> Optional[List[io.BytesIO]]:
"""
Rasterize a PDF file to PNG images.
Args:
pdf (Path): The path to the PDF file.
outpath (Optional[Path], optional): The output directory. If None, the PIL images will be returned instead. Defaults to None.
dpi (int, optional): The output DPI. Defaults to 96.
return_pil (bool, optional): Whether to return the PIL images instead of writing them to disk. Defaults to False.
pages (Optional[List[int]], optional): The pages to rasterize. If None, all pages will be rasterized. Defaults to None.
Returns:
Optional[List[io.BytesIO]]: The PIL images if `return_pil` is True, otherwise None.
"""
pils = []
if outpath is None:
return_pil = True
try:
if isinstance(pdf, (str, Path)):
pdf = pypdfium2.PdfDocument(pdf)
if pages is None:
pages = range(len(pdf))
renderer = pdf.render(
pypdfium2.PdfBitmap.to_pil,
page_indices=pages,
scale=dpi / 72,
)
for i, image in zip(pages, renderer):
if return_pil:
page_bytes = io.BytesIO()
image.save(page_bytes, "bmp")
pils.append(page_bytes)
else:
image.save((outpath / ("%02d.png" % (i + 1))), "png")
except Exception as e:
logging.error(e)
if return_pil:
return pils
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--pdfs", nargs="+", type=Path, help="PDF files", required=True)
parser.add_argument("--out", type=Path, help="Output dir", default=None)
parser.add_argument(
"--dpi", type=int, default=96, help="What resolution the pages will be saved"
)
parser.add_argument(
"--pages", type=int, nargs="+", default=None, help="list of page numbers"
)
args = parser.parse_args()
if args.pages:
args.pages = [p - 1 for p in args.pages]
for pdf_file in tqdm(args.pdfs):
assert pdf_file.exists() and pdf_file.is_file()
outpath: Path = args.out or (pdf_file.parent / pdf_file.stem)
outpath.mkdir(exist_ok=True)
rasterize_paper(pdf_file, outpath, pages=args.pages, dpi=args.dpi)
================================================
FILE: nougat/dataset/split_htmls_to_pages.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import argparse
from io import BytesIO
import multiprocessing
from pebble import ProcessPool
from concurrent.futures import TimeoutError
from tqdm import tqdm
from typing import Tuple
import os
from pathlib import Path
import logging
import pypdf
from PIL import Image
import pytesseract
from nougat.dataset.split_md_to_pages import *
from nougat.dataset.parser.html2md import *
from nougat.dataset.pdffigures import call_pdffigures
logging.basicConfig()
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def process_paper(
fname: str,
pdf_file: Path,
html_file: Path,
json_file: Path,
args: argparse.Namespace,
) -> Tuple[int, int]:
"""
Process a single paper.
Args:
fname (str): The paper's filename.
pdf_file (Path): The path to the PDF file.
html_file (Path): The path to the HTML file.
json_file (Path): The path to the JSON file containing the extracted figures.
args (argparse.Namespace): The command-line arguments.
Returns:
Tuple[int, int]: The number of total pages and the number of recognized pages.
"""
total_pages = 0
num_recognized_pages = 0
try:
pdf = pypdf.PdfReader(pdf_file)
total_pages = len(pdf.pages)
outpath: Path = args.out / fname
# skip this paper if already processed
dirs_with_same_stem = list(args.out.glob(fname.partition("v")[0] + "*"))
if (
len(dirs_with_same_stem) > 0
and len(list(dirs_with_same_stem[0].iterdir())) > 0
and not args.recompute
):
logger.info(
"%s (or another version thereof) already processed. Skipping paper",
fname,
)
return total_pages, len(list(outpath.glob("*.mmd")))
html = BeautifulSoup(
htmlmin.minify(
open(html_file, "r", encoding="utf-8").read().replace("\xa0", " "),
remove_all_empty_space=True,
),
features="html.parser",
)
doc = parse_latexml(html)
if doc is None:
return
out, fig = format_document(doc, keep_refs=True)
if args.markdown:
md_out = args.markdown / (fname + ".mmd")
with open(md_out, "w", encoding="utf-8") as f:
f.write(out)
if json_file is None:
json_file = call_pdffigures(pdf_file, args.figure)
if json_file:
figure_info = json.load(open(json_file, "r", encoding="utf-8"))
else:
figure_info = None
split = split_markdown(
out, pdf_file, figure_info=figure_info, doc_fig=fig, min_score=0.9
)
if split is None:
return
pages, meta = split
num_recognized_pages = sum([len(p) > 0 for p in pages])
if all([len(p) == 0 for p in pages]):
return
os.makedirs(outpath, exist_ok=True)
recognized_indices = []
for i, content in enumerate(pages):
with (outpath / "meta.json").open("w", encoding="utf-8") as f:
f.write(json.dumps(meta))
if content:
if re.search(r"\[(?:\?\?(?:. )?)+\]", content):
# there are wrongly parsed references in the page eg [??].
continue
with (outpath / ("%02d.mmd" % (i + 1))).open(
"w", encoding="utf-8"
) as f:
f.write(content)
recognized_indices.append(i)
rasterize_paper(pdf_file, outpath, dpi=args.dpi, pages=recognized_indices)
if args.tesseract:
for i in recognized_indices:
ocr = pytesseract.image_to_string(
Image.open((outpath / ("%02d.png" % (i + 1)))), lang="eng"
)
ocr = re.sub(r"\n+\s+?([^\s])", r"\n\n\1", ocr).strip()
with (outpath / ("%02d_OCR.txt" % (i + 1))).open(
"w", encoding="utf-8"
) as f_ocr:
f_ocr.write(ocr)
except Exception as e:
logger.error(e)
return total_pages, num_recognized_pages
def process_htmls(args):
for input_dir in (args.pdfs, args.html):
if not input_dir.exists() and not input_dir.is_dir():
logger.error("%s does not exist or is no dir.", input_dir)
return
htmls: List[Path] = args.html.glob("*.html")
args.out.mkdir(exist_ok=True)
if args.markdown:
args.markdown.mkdir(exist_ok=True)
with ProcessPool(max_workers=args.workers) as pool:
total_pages, total_pages_extracted = 0, 0
tasks = {}
for j, html_file in enumerate(htmls):
fname = html_file.stem
pdf_file = args.pdfs / (fname + ".pdf")
if not pdf_file.exists():
logger.info("%s pdf could not be found.", fname)
continue
json_file = args.figure / (fname + ".json")
if not json_file.exists():
logger.info("%s figure json could not be found.", fname)
json_file = None
tasks[fname] = pool.schedule(
process_paper,
args=[fname, pdf_file, html_file, json_file, args],
timeout=args.timeout,
)
for fname in tqdm(tasks):
try:
res = tasks[fname].result()
if res is None:
logger.info("%s is faulty", fname)
continue
num_pages, num_recognized_pages = res
total_pages += num_pages
total_pages_extracted += num_recognized_pages
logger.info(
"%s: %i/%i pages recognized. Percentage: %.2f%%",
fname,
num_recognized_pages,
num_pages,
(100 * num_recognized_pages / max(1, num_pages)),
)
except TimeoutError:
logger.info("%s timed out", fname)
if total_pages > 0:
logger.info(
"In total: %i/%i pages recognized. Percentage: %.2f%%",
total_pages_extracted,
total_pages,
(100 * total_pages_extracted / max(1, total_pages)),
)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--html", type=Path, help="HTML files", required=True)
parser.add_argument("--pdfs", type=Path, help="PDF files", required=True)
parser.add_argument("--out", type=Path, help="Output dir", required=True)
parser.add_argument("--recompute", action="store_true", help="recompute all splits")
parser.add_argument(
"--markdown", type=Path, help="Markdown output dir", default=None
)
parser.add_argument(
"--figure",
type=Path,
help="Figure info JSON dir",
)
parser.add_argument(
"--workers",
type=int,
default=multiprocessing.cpu_count(),
help="How many processes to use",
)
parser.add_argument(
"--dpi", type=int, default=96, help="What resolution the pages will be saved at"
)
parser.add_argument(
"--timeout", type=float, default=120, help="max time per paper in seconds"
)
parser.add_argument(
"--tesseract",
action="store_true",
help="Tesseract OCR prediction for each page",
)
args = parser.parse_args()
print(args)
process_htmls(args)
================================================
FILE: nougat/dataset/split_md_to_pages.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
import argparse
from collections import Counter
from copy import deepcopy
import json
import math
from operator import itemgetter
import re
from typing import Dict, List, Tuple, Union, Optional
import os
import pypdf
from unidecode import unidecode
from rapidfuzz.fuzz import ratio as ratio_perc
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from nougat.dataset.staircase import Staircase
from nougat.dataset.splitter import (
Splitter,
get_first_last,
get_glob_index,
)
from nougat.dataset.utils import unicode_to_latex, remove_pretty_linebreaks
from nougat.dataset.utils.pdf_text_extract import get_pages, get_paragraphs
from nougat.dataset.rasterize import rasterize_paper
def ratio(*args, **kwargs):
return ratio_perc(*args, **kwargs) / 100
class BagOfWords:
"""
A bag-of-words model for text classification.
Args:
sentences (List[str]): The training sentences.
target (Optional[List[int]]): The target labels for the training sentences. Defaults to None.
"""
def __init__(
self,
sentences: List[str],
target: Optional[List[int]] = None,
) -> None:
self.sentences = sentences
self.target = target
self.train()
def train(self):
if self.target is None:
self.target = np.arange(len(self.sentences))
self.count_vect = CountVectorizer()
X_train_counts = self.count_vect.fit_transform(self.sentences)
self.tfidf_transformer = TfidfTransformer(use_idf=True)
X_train_tfidf = self.tfidf_transformer.fit_transform(X_train_counts)
self.clf = SGDClassifier(
loss="hinge",
penalty="l2",
alpha=1e-3,
random_state=42,
max_iter=5,
tol=None,
)
self.clf.fit(X_train_tfidf, self.target)
def __call__(
self, text: Union[str, List[str]], lob_probs: bool = False
) -> np.ndarray:
if type(text) == str:
text = [text]
X_new_counts = self.count_vect.transform(text)
X_new_tfidf = self.tfidf_transformer.transform(X_new_counts)
if lob_probs:
return self.clf.predict_log_proba(X_new_tfidf)
else:
return self.clf.predict(X_new_tfidf)
def remove_short_seqs(seqs: List[str], minimum: int = 10) -> List[str]:
"""Remove sequences shorter than the specified minimum length."""
out = []
for seq in seqs:
if len(seq) > minimum:
out.append(seq)
return out
def find_figures(
pdf_pages: List[List[str]], figure_info: Union[Dict, List]
) -> List[Tuple[int, int]]:
""" "
Find the locations of figures in a PDF file.
Args:
pdf_pages (List[List[str]]): The text of the PDF pages.
figure_info (Union[Dict, List]): A dictionary or list of dictionaries, where each dictionary
specifies the information about a figure, such as its caption, page number, and bounding box.
Returns:
List[Tuple[int, int]]: A list of tuples, where each tuple contains the figure index, page number,
start position, and end position of the figure in the PDF file.
"""
figure_locations = []
iterator = figure_info.values() if type(figure_info) == dict else [figure_info]
for figure_list in iterator:
for i, f in enumerate(figure_list):
if "caption" in f:
fig_string = f["caption"]
elif "text" in f:
fig_string = f["text"]
else:
continue
fig_string = unicode_to_latex(fig_string)
if f["page"] >= len(pdf_pages):
continue
block, score = Splitter.fuzzysearch(
"\n".join(pdf_pages[f["page"]]),
fig_string,
)
if score > 0.8 and block[2] > 0:
figure_locations.append((i, f["page"], block[0], block[2]))
return figure_locations
def flatten(l: List) -> List:
return [item for sublist in l for item in sublist]
def get_doc_text(
pdf: str,
splitn: bool = True,
split_block: bool = True,
minlen: Optional[int] = 10,
) -> List[List[str]]:
"""
Get the text from a PDF document.
Args:
doc (str): Path to the PDF document.
splitn (bool): Whether to split the text into lines. Defaults to True.
split_block (bool): Whether to split the text into blocks. Defaults to True.
minlen (Optional[int]): The minimum length of a line or block. Defaults to 10.
Returns:
List[List[str]]: The text of the PDF document, either as a list of lines or a list of blocks..
"""
document_lines = []
if split_block:
pages = get_paragraphs(pdf)
else:
pages = [get_pages(pdf)]
for blocks in pages:
page_lines = []
for block in blocks:
if splitn:
page_lines.extend(block.split("\n"))
else:
page_lines.append(block)
if splitn:
page_lines = remove_short_seqs(page_lines, minlen)
document_lines.append(page_lines)
return document_lines
def clean_pdf_text(pages: List[List[str]], num_words: int = 10) -> List[List[str]]:
"""
Clean the text of a PDF document by removing frequent words from the beginning and end of each page.
Args:
pages (List[List[str]]): The text of the PDF document, as a list of lists of strings.
num_words (int, optional): The number of words to consider at the beginning and end of each page. Defaults to 10.
Returns:
List[List[str]]: The cleaned text of the PDF document.
"""
words = []
for page in pages:
first = get_first_last(
" ".join(page).lower(), num_words=num_words, first_only=True
)
words.extend(first.split(" "))
word_counts = Counter(words)
common_words = [
"the",
"of",
"a",
"and",
"to",
"in",
"is",
"that",
"for",
"are",
"this",
"we",
"figure",
"fig.",
"",
]
frequent_words = []
for w, f in word_counts.items():
if w in common_words or w.startswith("\\"):
continue
if f / len(pages) >= 0.4:
frequent_words.append(w)
if len(frequent_words) == 0:
return pages
# remove frequent words from page beginning/end
for i in range(len(pages)):
page = pages[i]
stop = 0
page_num_words = 0
for p in page:
page_num_words += len(p.split(" "))
stop += 1
if page_num_words >= num_words:
break
for w in frequent_words:
for j in range(stop):
if w == "-": # probably page number - \d -
pages[i][j] = re.sub(
r"-\s*\d{1,3}\s*-", "", pages[i][j], flags=re.IGNORECASE
)
pages[i][j] = re.sub(re.escape(w), "", pages[i][j], flags=re.IGNORECASE)
return pages
def split_markdown(
doc: str,
pdf_file: str,
figure_info: Optional[List[Dict]] = None,
doc_fig: Dict[str, str] = {},
minlen: int = 3,
min_num_words: int = 22,
doc_paragraph_chars: int = 1000,
min_score: float = 0.75,
staircase: bool = True,
) -> Tuple[List[str], Dict]:
"""
Split a PDF document into Markdown paragraphs.
Args:
doc (str): The text of the Markdown document.
pdf (str): The PDF document.
figure_info (Optional[List[Dict]]): A list of dictionaries, where each dictionary
specifies the information about a figure, such as its caption, page number, and bounding box.
doc_fig (Dict[str, str]): A dictionary mapping figure ids to LaTeX code.
minlen (int): The minimum length of a Markdown paragraph.
min_num_words: The minimum number of words in a Markdown paragraph.
doc_paragraph_chars: The maximum number of characters in a Markdown paragraph.
min_score: The minimum score for a Markdown paragraph to be split.
staircase: Whether to split the document into paragraphs with a staircase pattern.
Returns:
Tuple[List[str], Dict]: The list of Markdown paragraphs and the metadata.
"""
pdf = pypdf.PdfReader(pdf_file)
doc_paragraphs_full: List[str] = doc.split("\n")
doc_paragraph_lengths = [len(p) for p in doc_paragraphs_full if len(p) > 1]
num_lines = 1 + int(doc_paragraph_chars / np.mean(doc_paragraph_lengths))
doc_paragraphs_full = [
unidecode("\n".join(doc_paragraphs_full[i : i + num_lines]))
for i in range(0, len(doc_paragraphs_full), num_lines)
]
doc_paragraphs: List[str] = []
doc_paragraph_indices: List[int] = []
for i, p in enumerate(doc_paragraphs_full):
if len(p) > 1:
doc_paragraphs.append(
re.sub(r"(\[(FOOTNOTE|FIGURE|TABLE).*?END\2\])", "", p)
)
doc_paragraph_indices.append(i)
meta = {"pdffigures": figure_info}
if len(pdf.pages) > 1:
pdf_text = get_doc_text(pdf_file, True, True, minlen)
pdf_content = [
[unicode_to_latex(q).replace("\n", " ") for q in p if len(q) >= minlen]
for p in pdf_text
]
pdf_content = clean_pdf_text(pdf_content)
if figure_info is not None:
figure_locations = sorted(
find_figures(pdf_content, figure_info), key=itemgetter(2), reverse=True
)
clean_pdf_content = deepcopy(pdf_content)
for i, page_content in enumerate(pdf_content):
len_sentences = np.cumsum([0] + [len(p) for p in page_content])
for match in figure_locations:
_, page, start, len_ = match
if i != page:
continue
a, b = (
get_glob_index(len_sentences, start),
get_glob_index(len_sentences, start + len_) + 1,
)
for j, k in enumerate(range(a, b + 1)):
if len(clean_pdf_content[i]) == k:
break
if j == 0:
clean_pdf_content[i][k] = clean_pdf_content[i][k][
: start - len_sentences[k]
]
elif k == b:
clean_pdf_content[i][k] = clean_pdf_content[i][k][
start + len_ - len_sentences[k] :
]
else:
clean_pdf_content[i][k] = ""
clean_pdf_content[i] = remove_short_seqs(clean_pdf_content[i], 0)
pdf_content = clean_pdf_content
paragraphs = flatten(pdf_content)
num_paragraphs = np.cumsum([0] + [len(page) for page in pdf_content])
if staircase:
# train bag of words
page_target = np.zeros(len(paragraphs))
page_target[num_paragraphs[1:-1] - 1] = 1
page_target = np.cumsum(page_target).astype(int)
model = BagOfWords(paragraphs, target=page_target)
labels = model(doc_paragraphs)
# fit stair case function
x = np.arange(len(labels))
stairs = Staircase(len(labels), labels.max() + 1)
stairs.fit(x, labels)
boundaries = (stairs.get_boundaries().astype(int)).tolist()
boundaries.insert(0, 0)
else:
boundaries = [0] * (len(pdf.pages))
splitter = Splitter(doc_paragraphs)
pages = [(0, 0, 1.0)]
meta["first_words"] = []
meta["last_words"] = []
for i in range(1, len(boundaries)):
delta = (
math.ceil(stairs.uncertainty[i - 1]) + 5
if staircase
else len(doc_paragraphs)
)
words_f = []
words_l = []
for p in pdf_content[i]:
words_f.extend(p.split(" "))
if len(words_f) >= min_num_words:
break
for p in pdf_content[i - 1][::-1]:
words_l.extend(p.split(" ")[::-1])
if len(words_l) >= min_num_words:
words_l = words_l[::-1]
break
if len(words_f) < 2:
pages.append(pages[-1])
first_words = " ".join(words_f[:min_num_words]).strip()
last_words = " ".join(words_l[-min_num_words:]).strip()
meta["first_words"].append(first_words)
meta["last_words"].append(last_words)
if len(first_words) < minlen and len(last_words) < minlen:
pages.append(pages[-1])
continue
pages.append(
splitter.split_first_last(
boundaries[i],
first_words,
last_words,
delta=delta,
)
)
elif len(pdf.pages) == 1: # single page
pages = [(0, 0, 1)]
else:
return
pages.append((len(doc_paragraphs), -1, 1.0))
out = []
page_scores = {}
for i in range(len(pages) - 1):
score = (pages[i][2] + pages[i + 1][2]) * 0.5
if score >= min_score:
end = pages[i + 1][0]
if end >= len(doc_paragraph_indices):
end = None
else:
end = doc_paragraph_indices[pages[i + 1][0]] + 1
lines = doc_paragraphs_full[doc_paragraph_indices[pages[i][0]] : end]
if len(lines) > 0:
lines[0] = lines[0][pages[i][1] :]
lines[-1] = lines[-1][: pages[i + 1][1]]
else:
lines = []
page_content = "\n".join(lines)
page_content = remove_pretty_linebreaks(page_content)
page_scores[i] = score
out.append(page_content)
meta["page_splits"] = pages
meta["page_scores"] = page_scores
meta["num_pages"] = len(pdf.pages)
# Reintroduce figures, tables and footnotes
figure_tex = list(doc_fig.keys()), list(doc_fig.values())
if len(doc_fig) > 0:
iterator = figure_info.values() if type(figure_info) == dict else [figure_info]
for figure_list in iterator:
if not figure_list:
continue
for i, f in enumerate(figure_list):
if "caption" in f:
fig_string = f["caption"]
elif "text" in f:
fig_string = f["text"]
else:
continue
ratios = []
for tex in figure_tex[1]:
if f["figType"] == "Table":
tex = tex.partition(r"\end{table}")[2]
ratios.append(ratio(tex, fig_string))
k = np.argmax(ratios)
if ratios[k] < 0.8:
continue
if f["page"] < len(out) and out[f["page"]] != "":
out[f["page"]] += "\n\n" + remove_pretty_linebreaks(
figure_tex[1][k].strip()
)
for i in range(len(out)):
foot_match = re.findall(r"\[FOOTNOTE(.*?)\]\[ENDFOOTNOTE\]", out[i])
for match in foot_match:
out[i] = out[i].replace(
"[FOOTNOTE%s][ENDFOOTNOTE]" % match,
doc_fig.get("FOOTNOTE%s" % match, ""),
)
out[i] = re.sub(r"\[(FIGURE|TABLE)(.*?)\](.*?)\[END\1\]", "", out[i])
return out, meta
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--md", type=str, help="Markdown file", required=True)
parser.add_argument("--pdf", type=str, help="PDF File", required=True)
parser.add_argument("--out", type=str, help="Out dir", required=True)
parser.add_argument(
"--figure",
type=str,
help="Figure info JSON",
)
parser.add_argument("--dpi", type=int, default=96)
args = parser.parse_args()
md = open(args.md, "r", encoding="utf-8").read().replace("\xa0", " ")
pdf = pypdf.PdfReader(args.pdf)
try:
fig_info = json.load(open(args.figure, "r", encoding="utf-8"))
except FileNotFoundError:
fig_info = None
pages, meta = split_markdown(md, pdf, fig_info)
if args.out:
outpath = os.path.join(args.out, os.path.basename(args.pdf).partition(".")[0])
os.makedirs(outpath, exist_ok=True)
found_pages = []
for i, content in enumerate(pages):
if content:
with open(
os.path.join(
outpath, "%02d_s=%.2f.mmd" % (i + 1, meta["page_scores"][i])
),
"w",
encoding="utf-8",
) as f:
f.write(content)
found_pages.append(i)
rasterize_paper(pdf, outpath, dpi=args.dpi, pages=found_pages)
================================================
FILE: nougat/dataset/splitter.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
from typing import List, Tuple, Union
import re
import numpy as np
from rapidfuzz.fuzz import ratio as ratio_perc
from fuzzysearch import find_near_matches
math_start_regex = re.compile(r"(?<!\\)\\[\[\(]", re.M)
math_end_regex = re.compile(r"(?<!\\)\\[\]\)]", re.M)
def ratio(*args, **kwargs):
return ratio_perc(*args, **kwargs) / 100
def reverse(lst: List[str]) -> List[str]:
"""Reverses a list and the strings inside
Args:
lst (List[str]): List to process
Returns:
List[str]: Reversed list
"""
out = lst[::-1]
for i in range(len(out)):
out[i] = out[i][::-1]
return out
def get_first_last(
s: str,
num_words: int = 8,
delim: str = " ",
first_only: bool = False,
last_only: bool = False,
) -> Union[Tuple[str, str], str]:
"""
Get the first and last `num_words` from a string `s`.
Args:
s (str): The string.
num_words (int): The number of words.
delim (str): The delimiter between words.
first_only (bool): Whether to only get the first `num_words`.
last_only (bool): Whether to only get the last `num_words`.
Returns:
Union[Tuple[str, str], str]: The first and last `num_words` from `s`, or `s` if `num_words` is 0.
"""
s = s.split(delim)
if not first_only and not last_only:
return delim.join(s[:num_words]), delim.join(s[-num_words:])
elif first_only:
return delim.join(s[:num_words])
elif last_only:
return delim.join(s[-num_words:])
def get_glob_index(
lengths: List[int], ind: int, return_breakpoints: bool = False
) -> int:
"""returns the index where ind is closest and greater than the lengths"""
breakpoints = np.cumsum(lengths)
overlap = breakpoints - ind
overlap[overlap > 0] = -int(1e5)
indices = overlap.argmax(0)
if return_breakpoints:
return indices, breakpoints
else:
return indices
# table-header-figure regex
# thf_regex = re.compile(r"(\[(FOOTNOTE|FIGURE|TABLE).*?END\2\])")
class Splitter:
_split_locs: List[Tuple[int, int]] = None
def __init__(self, paragraphs: List[str]) -> None:
self.paragraphs = paragraphs
self.paragraphs_no_space = [self.remove_special_chars(h) for h in paragraphs]
self._split_locs = [(0, 0)]
self.paragraphs_rev = reverse(self.paragraphs)
self.paragraphs_rev_no_space = reverse(self.paragraphs_no_space)
@staticmethod
def remove_special_chars(string: str) -> str:
# string = thf_regex.sub(r"", string)
return (
string.replace("\\ ", "")
.replace(" ", "")
.replace("\n", "")
.replace("*", "")
.replace("_", "")
.replace("^", "")
.replace("\\[", "")
.replace("\\]", "")
.replace("\\(", "")
.replace("\\)", "")
.replace("\\right", "")
.replace("\\left", "")
.replace("\\sum", "X") # old latex unicode encoding issue
.replace("{", "")
.replace("}", "")
.replace("#", "")
.replace("[REF]", "")
.replace("[ENDREF]", "")
.replace("\\varphi", "\\phi") # https://meta.stackexchange.com/a/349360
.replace("\\quad", "")
.replace("\\qquad", "")
.replace("\\hskip", "")
.replace("\\vskip", "")
.replace("\\frac", "")
.replace("\\rm", "")
.replace("\\,", "")
.replace("-", "")
.lower()
)
@staticmethod
def count_special_chars(string: str, char_ind: int) -> int:
if len(string) == 0:
return 0
add_space_ind = 0
while True:
string_ = string[: char_ind + add_space_ind]
# last_first = string[: char_ind + add_space_ind+]
add = (
string_.count(" ")
+ string_.count("\\ ") * 2
+ string_.count("\n")
+ string_.count("*")
+ string_.count("_")
+ string_.count("^")
+ string_.count("\\[") * 2
+ string_.count("\\]") * 2
+ string_.count("\\(") * 2
+ string_.count("\\)") * 2
+ string_.count("\\right") * 6
+ string_.count("\\left") * 5
+ string_.count("\\sum") * 3 # replaced to X that's why not 4
+ string_.count("{")
+ string_.count("}")
+ string_.count("#")
+ string_.count("[REF]") * 5
+ string_.count("[ENDREF]") * 8
+ string_.count("\\varphi") * 3
+ string_.count("\\quad") * 5
+ string_.count("\\qquad") * 6
+ string_.count("\\hskip") * 6
+ string_.count("\\vskip") * 6
+ string_.count("\\frac") * 5
+ string_.count("\\rm") * 3
+ string_.count("\\,") * 2
+ string_.count("-")
)
if add == add_space_ind:
break
add_space_ind = add
if len(string) <= char_ind + add_space_ind:
add_space_ind = max(0, len(string) - 1 - char_ind)
# check first chars of rest if they match closing expressions
while True:
rest = string[char_ind + add_space_ind :]
string_ = string[: char_ind + add_space_ind]
section_title = re.match(r"#+\s?\d*\s*$", string_)
if rest.startswith("\\]") or rest.startswith("\\)"):
add_space_ind += 2
elif (rest.startswith(")") or rest.startswith("]")) and string_.endswith(
"\\"
):
add_space_ind += 1
elif (rest.startswith("(") or rest.startswith("[")) and string_.endswith(
"\\"
):
add_space_ind -= 1
elif rest.startswith(" "):
add_space_ind += 1
elif section_title:
add_space_ind -= section_title.end() - section_title.start()
elif (
re.match(r"^[^\w\s]*_\s", rest)
or re.match(r"^[^\w\s]*\*\*?\s", rest)
or re.match(r"^.\n", rest)
):
add_space_ind += 1
else:
break
# check if it starts in a math env and include everything before
end = math_end_regex.search(rest)
if end is not None:
start = math_start_regex.search(rest)
if start is None or start.start() > end.start():
inds = [
m.start()
for m in math_start_regex.finditer(string_)
if m.start() < end.start() + len(string_)
]
if len(inds) > 0:
add_space_ind = inds[-1] - char_ind
# assert string_[char_ind+add_space_ind]=='\\'
return add_space_ind
def split_first_last(
self, index: int, first: str, last: str, delta: int = 5
) -> Tuple[int, int, float]:
"""Refines a split by looking at both the first words from a new page and the last words from the previous page.
Args:
index (int): paragraph index
first (str): first words
last (str): last words
delta (int, optional): paragraph search radius. Defaults to 5.
Returns:
Tuple[int, int, float]: split prediction
"""
if first:
first_split = glob_f, char_f, score_f = self.split(
index, first, delta=delta
)
if last:
last_split = glob_l, char_l, score_l = self.split(
index, last, delta=delta, reverse=True
)
if first and not last:
return first_split
elif not first and last:
return last_split
elif not first and not last:
return index, 0, 0.0
if char_f == char_l and glob_f == glob_l and (score_f > 0.5 or score_l > 0.5):
return glob_l, char_l, 1.0
# score calculation
first, last = self.remove_special_chars(first), self.remove_special_chars(last)
matching = []
for split in (first_split, last_split):
first_source = []
num_chars_first = len(first)
num_chars_last = len(last)
last_source = []
for i, p in enumerate(self.paragraphs[split[0] :]):
if i == 0:
p = p[split[1] :]
first_source.append(self.remove_special_chars(p))
if sum([len(s) for s in first_source]) >= num_chars_first:
break
first_source = "".join(first_source)[:num_chars_first]
for i, p in enumerate(self.paragraphs[split[0] :: -1]):
if i == 0:
p = p[: split[1]]
last_source.insert(0, self.remove_special_chars(p))
if sum([len(s) for s in last_source]) >= num_chars_last:
last_source = last_source
break
last_source = "".join(last_source)[-num_chars_last:]
matching.append(
[
ratio(first, first_source) * ratio(first[:10], first_source[:10]),
ratio(last, last_source) * ratio(last[-10:], last_source[-10:]),
]
)
scores = np.asarray(matching).max(0)
return (
(glob_l, char_l, scores[1])
if scores.argmax()
else (glob_f, char_f, scores[0])
)
def split(
self, index: int, string: str, delta: int = 5, reverse: bool = False
) -> Tuple[int, int, float]:
"""
refine split prediction. `string` are the first words from new page.
delta can be used as uncertainty measure.
returns new index and split index
"""
if reverse:
index = len(self.paragraphs) - 1 - index
string = string[::-1]
paragraphs = self.paragraphs_rev
paragraphs_no_space = self.paragraphs_rev_no_space
else:
paragraphs = self.paragraphs
paragraphs_no_space = self.paragraphs_no_space
string_ = self.remove_special_chars(string)
start_ind = max(0, index - delta)
search_corpus = paragraphs_no_space[start_ind : index + delta + 1]
lengths = np.asarray([0] + [len(p) for p in search_corpus])
corp = "".join(search_corpus)
if len(corp) == 0:
self._split_locs.append((index, 0))
return index, 0, 1
ind, score = self._find_match(corp, string_)
indices, breakpoints = get_glob_index(lengths, ind, True)
global_ind, char_ind = int(start_ind + indices), int(ind - breakpoints[indices])
self._split_locs.append((global_ind, char_ind))
if reverse:
char_ind = len(paragraphs_no_space[global_ind]) - char_ind
global_ind = len(paragraphs) - global_ind - 1
add_space_ind = self.count_special_chars(self.paragraphs[global_ind], char_ind)
return global_ind, char_ind + add_space_ind, score
def _find_match(
self, corp: str, key: str, get_start: bool = True
) -> Tuple[int, float]:
block, score = self._fuzzy(corp, key)
index = max(0, block[0])
if not get_start:
index += block[2]
return index, score
@staticmethod
def _fuzzy(
corpus: str, string: str, max_error_rate: float = 0.025
) -> Tuple[Tuple[int, int, int], float]:
max_dist = min(len(string) - 1, int(len(string) * min(0.9, max_error_rate)) + 5)
matches = find_near_matches(string, corpus, max_l_dist=max_dist)
if len(matches) > 0 and max_dist > 0:
match = min(matches, key=lambda x: x.dist)
block = (match.start, 0, match.end - match.start)
score = 1 - match.dist / max_dist
return block, score
return (0, 0, 0), 0
@staticmethod
def fuzzysearch(
corpus: str, string: str, max_error_rate: float = 0.025
) -> Tuple[Tuple[int, int, int], float]:
corpus_ = Splitter.remove_special_chars(corpus)
string_ = Splitter.remove_special_chars(string)
(start, _, dist), score = Splitter._fuzzy(
corpus_, string_, max_error_rate=max_error_rate
)
end = Splitter.count_special_chars(corpus, start + dist) + start + dist
start = start + Splitter.count_special_chars(corpus, start)
return (start, _, end - start), score
def evaluate_split(self, page_num: int, page_content: str) -> float:
if page_num > len(self._split_locs) or page_num < 1:
return 0
page_content = self.remove_special_chars(page_content)
if page_num == len(self._split_locs):
start, end = self._split_locs[-1], (-1, -1)
else:
start, end = self._split_locs[page_num - 1], self._split_locs[page_num]
if (end[0] + 1) - start[0] < 0:
return 0
doc_content = self.paragraphs_no_space[start[0] : (end[0] + 1) or None]
if (
len(doc_content) < 1
or len(doc_content[0]) < start[1]
or len(doc_content[-1]) < end[1]
):
return 0
doc_content[0] = doc_content[0][start[1] :]
doc_content[-1] = doc_content[-1][: end[1]]
doc_content = "".join(doc_content)
match = ratio(page_content, doc_content)
return match
================================================
FILE: nougat/dataset/staircase.py
================================================
"""
Copyright (c) Meta Platforms, Inc. and affiliates.
This source code is licensed under the MIT license found in the
LICENSE file in the root directory of this source tree.
"""
from collections import deque
import operator
import itertools
from typing import Optional, List, Tuple
import numpy as np
import warnings
warnings.filterwarnings("ignore", message="All-NaN slice encountered")
def stair_func(x: np.ndarray, thresholds: np.ndarray) -> np.ndarray:
return np.heaviside(x[:, None] - np.floor(thresholds)[None, :], 0).sum(1)
def compute_gini(labels: np.ndarray) -> float:
N = len(labels)
if N == 0:
return 0
G = N - np.square(np.bincount(labels)).sum() / N
return G
def compute_binary_gini(labels: np.ndarray) -> float:
N = len(labels)
if N == 0:
return 0
G = N - labels.sum() ** 2 / N
return G
def gini_impurity(
thresholds: np.ndarray,
data: np.ndarray,
labels: np.ndarray,
classes: Optional[List[int]] = None,
reduction: Optional[str] = "sum",
padded: bool = True,
) -> float:
"""
Calculate the Gini impurity of a dataset split on a set of thresholds.
Args:
thresholds (np.ndarray): The thresholds to split the data on.
data (np.ndarray): The data to split.
labels (np.ndarray): The labels for the data.
classes (Optional[List[int]]): The classes to consider. If None, all classes are used.
reduction (Optional[str]): The reduction to apply to the impurity. One of "none", "sum", or "mean".
padded (bool): Whether to pad the thresholds with `[-0.5, data.max() + 0.5]`.
Returns:
float: The Gini impurity.
"""
G = []
if not padded:
thresholds = np.insert(
thresholds, [0, len(thresholds)], [-0.5, data.max() + 0.5]
)
if classes is None:
classes = np.arange(len(thresholds) - 1)
else:
classes = np.asarray(classes)
if data.ndim == 1:
data = np.expand_dims(data, 0)
masks = np.logical_and(
data > thresholds[classes, None],
data <= thresholds[classes + 1, None],
)
for i, c in enumerate(classes):
G.append(compute_binary_gini(np.where(labels[masks[i]] == c, 1, 0)))
if reduction is None or reduction == "none":
return G
elif reduction == "sum":
return sum(G)
elif reduction == "mean":
return sum(G) / len(G)
else:
raise NotImplementedError
def step_impurity(
thresholds,
data: np.ndarray,
labels: np.ndarray,
classes: Optional[List[int]] = None,
) -> float:
"""
Calculate the step-wise Gini impurity of a dataset split on a set of thresholds.
Args:
thresholds (np.ndarray): The thresholds to split the data on.
data (np.ndarray): The data to split.
labels (np.ndarray): The labels for the data.
classes (Optional[List[int]]): The classes to consider. If None, all classes are used.
Returns:
float: The step-wise Gini impurity.
"""
G = gini_impurity(thresholds, data, labels, reduction=None, classes=classes)
out = []
for i in range(len(G) - 1):
out.append(G[i] + G[i + 1])
return out
class PaddedArray:
"""
A wrapper class for an array that allows for relative indexing.
Args:
array (np.ndarray): The array to wrap.
range (Optional[Tuple[int, int]]): The range of the array to expose. Defaults to (1, -1).
"""
def __init__(
self, array: np.ndarray, range: Optional[Tuple[int, int]] = (1, -1)
) -> None:
self.array = array
mi, ma = range
assert ma <= 0, "relative assignment only"
self.range = mi, ma
def __len__(self):
return len(self.array) + self.range[1] - self.range[0]
def _process_index(self, index):
if isinstance(index, slice):
index = slice(
(index.start or 0) + self.range[0],
self.range[0] + (len(self) if index.stop is None else index.stop),
index.step,
)
if index.stop > len(self.array):
raise IndexError
else:
index = index + self.range[0]
if index > len(self):
raise IndexError
return index
def __getitem__(self, index):
index = self._process_index(index)
return self.array[index]
def __setitem__(self, index, value):
self.array[self._process_index(index)] = value
def copy(self):
return PaddedArray(self.array.copy(), self.range)
def toarray(self):
return self.array[self.range[0] : self.range[1]]
class Staircase:
"""
A class for learning a staircase decision tree.
Args:
domain: The number of points in the domain.
n_classes: The number of classes.
"""
def __init__(self, domain: int, n_classes: int) -> None:
self.domain = domain
self.classes = n_classes
assert domain > 0
assert n_classes > 0
self.thresholds = self._back_thres = self._forward_thres = np.linspace(
domain / n_classes, domain, n_classes - 1, endpoint=False
)
self.uncertainty = np.zeros_like(self.thresholds)
def statistic_fit(
self,
data: np.ndarray,
labels: np.ndarray,
):
"""
Fit statistical thresholds for anomaly detection.
This method fits statistical thresholds for anomaly detection based on input data and labels.
Args:
data (np.ndarray): The input data.
labels (np.ndarray): The labels corresponding to the data.
Note:
This method modifies the internal state of the object to set statistical thresholds.
"""
onehot = np.eye(self.classes)[labels.reshape(-1)]
onehot.reshape(list(labels.shape) + [self.classes])
k = onehot * data.T.repeat(self.classes, 1)
k[k == 0] = np.nan
med = np.nanmedian(k, 0)
for i in range(len(med)):
if med[i] != med[i]:
med[i] = 0 if i == 0 else med[i - 1]
mad = 5 * np.nan_to_num(
np.nanmedian(np.absolute(k - np.nanmedian(k, 0)), 0),
nan=self.domain / self.classes / 2,
)
arr = np.vstack(((med - mad)[:-1], (med + mad)[1:]))
self._forward_thres[:] = arr.max(0)
self._back_thres[:] = arr.min(0)
self._stat_forward = self._forward_thres.copy()
self._stat_back = self._back_thres.copy()
def fit(
self,
data: np.ndarray,
labels: np.ndarray,
early_stop_after: int = 10,
fixed: bool = True,
) -> None:
"""
Fit statistical thresholds for anomaly detection.
This method fits statistical thresholds for anomaly detection based on input data and labels.
Args:
data (np.ndarray): The input data.
labels (np.ndarray): The labels corresponding to the data.
early_stop_after (int, optional): The number of consecutive early stops to consider. Default is 10.
fixed (bool, optional): Whether to use fixed thresholds. Default is True.
Note:
This method modifies the internal state of the object to set statistical thresholds.
"""
assert data.ndim == 1
assert labels.ndim <= 2
if self.classes == 1:
self.thresholds = np.array([0.5 + data.max()])
self.uncertainty = np.zeros_like(self.thresholds)
if data.ndim == 1:
data = np.expand_dims(data, 0)
thresholds = PaddedArray(
np.insert(
np.arange(self.domain - self.classes + 1, self.domain) - 1,
[0, self.classes - 1],
[-0.5, self.domain + 0.5],
).astype(int)
)
self._back_thres = thresholds.copy()
self._forward_thres = thresholds.copy()
self.statistic_fit(data, labels)
last = -0.5
for n in range(self.classes):
G = np.inf
Gis = deque([], early_stop_after)
# forward pass
if n < self.classes - 1:
new_forward_n: float = self._forward_thres[n]
for i in range(
max(0, self._back_thres[n - 1]) if n - 1 >= 0 else int(last),
min(self.domain, self._forward_thres[n + 1])
if n + 2 < self.classes
else self.domain - 1,
):
thresholds.array[n + 1] = i + 0.5
Gi = step_impurity(
thresholds.array, data, labels, classes=[n, n + 1]
)[0]
Gis.append(Gi)
if Gi <= G:
last = i + 0.5
new_forward_n = last
G = Gi
elif (
(not fixed or i - last > self.domain / self.classes)
and len(Gis) == early_stop_after
and all(
itertools.starmap(
operator.ge,
zip(Gis, itertools.islice(Gis, 1, early_stop_after)),
)
)
):
break
thresholds.array[n + 1] = new_forward_n
self._forward_thres.array[n + 1] = new_forward_n
self._back_thres.array[n + 1] = new_forward_n
G = np.inf
self._forward_thres = self._forward_thres.toarray().clip(
min=0, max=self.domain - 1
)
self._back_thres = self._back_thres.toarray().clip(min=0, max=self.domain - 1)
self.thresholds = (self._forward_thres + self._back_thres) / 2
self.uncertainty = np.abs(self._forward_thres - self._back_thres) / 2
@property
def score(self):
try:
return gini_impurity(self.thresholds, self._data, self._labels) / len(
self._data
)
except AttributeError:
return np.inf
def predict(self, x: np.ndarray) -> np.ndarray:
return stair_func(x, self.get_boundaries())
def __call__(self, *args):
return self.predict(*args)
def get_boundaries(self) -> np.ndarray:
return self.thresholds.astype(int).clip(min=0, max=self.domain - 1) + 0.5
================================================
FILE: nougat/dataset/tokenizer.json
================================================
{
"version": "1.0",
"truncation": {
"direction": "Right",
"max_length": 4096,
"strategy": "LongestFirst",
"stride": 0
},
"padding": {
"strategy": {
"Fixed": 4096
},
"direction": "Right",
"pad_to_multiple_of": null,
"pad_id": 1,
"pad_type_id": 0,
"pad_token": "<pad>"
},
"added_tokens": [
{
"id": 0,
"content": "<s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<pad>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 2,
"content": "</s>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 3,
"content": "<unk>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 4,
"content": "[START_REF]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 5,
"content": "[END_REF]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 6,
"content": "[IMAGE]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 7,
"content": "<fragments>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 8,
"content": "</fragments>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 9,
"content": "<work>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 10,
"content": "</work>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 11,
"content": "[START_SUP]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 12,
"content": "[END_SUP]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 13,
"content": "[START_SUB]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 14,
"content": "[END_SUB]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 15,
"content": "[START_DNA]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 16,
"content": "[END_DNA]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 17,
"content": "[START_AMINO]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 18,
"content": "[END_AMINO]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 19,
"content": "[START_SMILES]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 20,
"content": "[END_SMILES]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 21,
"content": "[START_I_SMILES]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 22,
"content": "[END_I_SMILES]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
}
],
"normalizer": {
"type": "NFKC"
},
"pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"String": "SPL1T-TH1S-Pl3A5E"
},
"behavior": "Removed",
"invert": false
},
{
"type": "Digits",
"individual_digits": true
},
{
"type": "Split",
"pattern": {
"Regex": "[\\(\\)\\[\\]\\{\\}]|([!\"\\#\\$%\\&'\\*\\+,\\-\\./:;<=>\\?\\\\\\^_`\\|\\~])\\1*"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Split",
"pattern": {
"String": "\n"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
}
]
},
"post_processor": {
"type": "TemplateProcessing",
"single": [
{
"SpecialToken": {
"id": "<s>",
"type_id": 0
}
},
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
{
"SpecialToken": {
"id": "</s>",
"type_id": 0
}
}
],
"pair": [
{
"Sequence": {
"id": "A",
"type_id": 0
}
},
{
"Sequence": {
"id": "B",
"type_id": 1
}
}
],
"special_tokens": {
"</s>": {
"id": "</s>",
"ids": [
2
],
"tokens": [
"</s>"
]
},
"<s>": {
"id": "<s>",
"ids": [
0
],
"tokens": [
"<s>"
]
}
}
},
"decoder": {
"type": "ByteLevel",
"add_prefix_space": true,
"trim_offsets": true,
"use_regex": true
},
"model": {
"type": "BPE",
"dropout": null,
"unk_token": null,
"continuing_subword_prefix": null,
"end_of_word_suffix": null,
"fuse_unk": false,
"vocab": {
"<s>": 0,
"<pad>": 1,
"</s>": 2,
"<unk>": 3,
"[START_REF]": 4,
"[END_REF]": 5,
"[IMAGE]": 6,
"<fragments>": 7,
"</fragments>": 8,
"<work>": 9,
"</work>": 10,
"[START_SUP]": 11,
"[END_SUP]": 12,
"[START_SUB]": 13,
"[END_SUB]": 14,
"[START_DNA]": 15,
"[END_DNA]": 16,
"[START_AMINO]": 17,
"[END_AMINO]": 18,
"[START_SMILES]": 19,
"[END_SMILES]": 20,
"[START_I_SMILES]": 21,
"[END_I_SMILES]": 22,
"!": 23,
"\"": 24,
"#": 25,
"$": 26,
"%": 27,
"&": 28,
"'": 29,
"(": 30,
")": 31,
"*": 32,
"+": 33,
",": 34,
"-": 35,
".": 36,
"/": 37,
"0": 38,
"1": 39,
"2": 40,
"3": 41,
"4": 42,
"5": 43,
"6": 44,
"7": 45,
"8": 46,
"9": 47,
":": 48,
";": 49,
"<": 50,
"=": 51,
">": 52,
"?": 53,
"@": 54,
"A": 55,
"B": 56,
"C": 57,
"D": 58,
"E": 59,
"F": 60,
"G": 61,
"H": 62,
"I": 63,
"J": 64,
"K": 65,
"L": 66,
"M": 67,
"N": 68,
"O": 69,
"P": 70,
"Q": 71,
"R": 72,
"S": 73,
"T": 74,
"U": 75,
"V": 76,
"W": 77,
"X": 78,
"Y": 79,
"Z": 80,
"[": 81,
"\\": 82,
"]": 83,
"^": 84,
"_": 85,
"`": 86,
"a": 87,
"b": 88,
"c": 89,
"d": 90,
"e": 91,
"f": 92,
"g": 93,
"h": 94,
"i": 95,
"j": 96,
"k": 97,
"l": 98,
"m": 99,
"n": 100,
"o": 101,
"p": 102,
"q": 103,
"r": 104,
"s": 105,
"t": 106,
"u": 107,
"v": 108,
"w": 109,
"x": 110,
"y": 111,
"z": 112,
"{": 113,
"|": 114,
"}": 115,
"~": 116,
"¡": 117,
"¢": 118,
"£": 119,
"¤": 120,
"¥": 121,
"¦": 122,
"§": 123,
"¨": 124,
"©": 125,
"ª": 126,
"«": 127,
"¬": 128,
"®": 129,
"¯": 130,
"°": 131,
"±": 132,
"²": 133,
"³": 134,
"´": 135,
"µ": 136,
"¶": 137,
"·": 138,
"¸": 139,
"¹": 140,
"º": 141,
"»": 142,
"¼": 143,
"½": 144,
"¾": 145,
"¿": 146,
"À": 147,
"Á": 148,
"Â": 149,
"Ã": 150,
"Ä": 151,
"Å": 152,
"Æ": 153,
"Ç": 154,
"È": 155,
"É": 156,
"Ê": 157,
"Ë": 158,
"Ì": 159,
"Í": 160,
"Î": 161,
"Ï": 162,
"Ð": 163,
"Ñ": 164,
"Ò": 165,
"Ó": 166,
"Ô": 167,
"Õ": 168,
"Ö": 169,
"×": 170,
"Ø": 171,
"Ù": 172,
"Ú": 173,
"Û": 174,
"Ü": 175,
"Ý": 176,
"Þ": 177,
"ß": 178,
"à": 179,
"á": 180,
"â": 181,
"ã": 182,
"ä": 183,
"å": 184,
"æ": 185,
"ç": 186,
"è": 187,
"é": 188,
"ê": 189,
"ë": 190,
"ì": 191,
"í": 192,
"î": 193,
"ï": 194,
"ð": 195,
"ñ": 196,
"ò": 197,
"ó": 198,
"ô": 199,
"õ": 200,
"ö": 201,
"÷": 202,
"ø": 203,
"ù": 204,
"ú": 205,
"û": 206,
"ü": 207,
"ý": 208,
"þ": 209,
"ÿ": 210,
"Ā": 211,
"ā": 212,
"Ă": 213,
"ă": 214,
"Ą": 215,
"ą": 216,
"Ć": 217,
"ć": 218,
"Ĉ": 219,
"ĉ": 220,
"Ċ": 221,
"ċ": 222,
"Č": 223,
"č": 224,
"Ď": 225,
"ď": 226,
"Đ": 227,
"đ": 228,
"Ē": 229,
"ē": 230,
"Ĕ": 231,
"ĕ": 232,
"Ė": 233,
"ė": 234,
"Ę": 235,
"ę": 236,
"Ě": 237,
"ě": 238,
"Ĝ": 239,
"ĝ": 240,
"Ğ": 241,
"ğ": 242,
"Ġ": 243,
"ġ": 244,
"Ģ": 245,
"ģ": 246,
"Ĥ": 247,
"ĥ": 248,
"Ħ": 249,
"ħ": 250,
"Ĩ": 251,
"ĩ": 252,
"Ī": 253,
"ī": 254,
"Ĭ": 255,
"ĭ": 256,
"Į": 257,
"į": 258,
"İ": 259,
"ı": 260,
"IJ": 261,
"ij": 262,
"Ĵ": 263,
"ĵ": 264,
"Ķ": 265,
"ķ": 266,
"ĸ": 267,
"Ĺ": 268,
"ĺ": 269,
"Ļ": 270,
"ļ": 271,
"Ľ": 272,
"ľ": 273,
"Ŀ": 274,
"ŀ": 275,
"Ł": 276,
"ł": 277,
"Ń": 278,
"Ġt": 279,
"in": 280,
"Ġa": 281,
"he": 282,
"on": 283,
"re": 284,
"at": 285,
"Ġthe": 286,
"er": 287,
"Ġs": 288,
"Ġo": 289,
"en": 290,
"al": 291,
"Ġc": 292,
"ti": 293,
"or": 294,
"ed": 295,
"es": 296,
"is": 297,
"Ġp": 298,
"Ġof": 299,
"nd": 300,
"Ġin": 301,
"Ġf": 302,
"Ġw": 303,
"ĠĠ": 304,
"it": 305,
"an": 306,
"ro": 307,
"ar": 308,
"Ġd": 309,
"Ġm": 310,
"Ġb": 311,
"Ġand": 312,
"ic": 313,
"le": 314,
"ing": 315,
"ion": 316,
"as": 317,
"Ġe": 318,
"Ġre": 319,
"ation": 320,
"Ġto": 321,
"el": 322,
"ent": 323,
"ac": 324,
"et": 325,
"ec": 326,
"tion": 327,
"om": 328,
"st": 329,
"ĠT": 330,
"Ġn": 331,
"Ġth": 332,
"ol": 333,
"ul": 334,
"im": 335,
"RE": 336,
"ig": 337,
"us": 338,
"REF": 339,
"Ġl": 340,
"Ġh": 341,
"ur": 342,
"Ġis": 343,
"ĠĠĠĠ": 344,
"Ġfor": 345,
"id": 346,
"am": 347,
"ĠS": 348,
"ve": 349,
"il": 350,
"ĠA": 351,
"ĠC": 352,
"Ġg": 353,
"ot": 354,
"ith": 355,
"ly": 356,
"ce": 357,
"Ġcon": 358,
"ow": 359,
"Ġst": 360,
"ut": 361,
"os": 362,
"Ġwith": 363,
"od": 364,
"ra": 365,
"Ġv": 366,
"Ġpro": 367,
"um": 368,
"ĠI": 369,
"if": 370,
"uc": 371,
"ter": 372,
"un": 373,
"AR": 374,
"ST": 375,
"res": 376,
"Ġon": 377,
"EN": 378,
"ere": 379,
"ĠP": 380,
"ĠThe": 381,
"ĠM": 382,
"Ġas": 383,
"ART": 384,
"Ġan": 385,
"END": 386,
"START": 387,
"Ġthat": 388,
"qu": 389,
"em": 390,
"Ġbe": 391,
"Ġex": 392,
"ri": 393,
"ab": 394,
"ity": 395,
"tic": 396,
"ver": 397,
"Ġal": 398,
"pl": 399,
"ts": 400,
"ĠF": 401,
"Ġâ": 402,
"ure": 403,
"Ġby": 404,
"ate": 405,
"ag": 406,
"ir": 407,
"oc": 408,
"per": 409,
"ĠB": 410,
"ay": 411,
"ĠD": 412,
"Ġcom": 413,
"ĠH": 414,
"ated": 415,
"ĠR": 416,
"Ġare": 417,
"rom": 418,
"ĠE": 419,
"op": 420,
"ad": 421,
"se": 422,
"ĠL": 423,
"igh": 424,
"ĠN": 425,
"ment": 426,
"her": 427,
"og": 428,
"ain": 429,
"ect": 430,
"ud": 431,
"Ġde": 432,
"Ġr": 433,
"Ġat": 434,
"Ġwas": 435,
"Ġus": 436,
"Ġres": 437,
"ell": 438,
"iz": 439,
"ine": 440,
"ph": 441,
"Ġac": 442,
"ess": 443,
"ore": 444,
"ical": 445,
"th": 446,
"und": 447,
"rac": 448,
"Ġwe": 449,
"ath": 450,
"ĠG": 451,
"Ġfrom": 452,
"ati": 453,
"up": 454,
"ist": 455,
"ant": 456,
"Ġor": 457,
"ff": 458,
"Ġcomp": 459,
"Ġwh": 460,
"ĠW": 461,
"ch": 462,
"ers": 463,
"Ġsp": 464,
"orm": 465,
"Ġch": 466,
"ations": 467,
"ran": 468,
"ub": 469,
"te": 470,
"di": 471,
"Ġsh": 472,
"ge": 473,
"ase": 474,
"Ġwere": 475,
"ĠĠĠĠĠĠĠĠ": 476,
"ĠÎ": 477,
"ap": 478,
"ĠIn": 479,
"and": 480,
"Ġse": 481,
"vel": 482,
"Ġim": 483,
"ĠâĪ": 484,
"ens": 485,
"ies": 486,
"ich": 487,
"ight": 488,
"duc": 489,
"ĠO": 490,
"Ġit": 491,
"tions": 492,
"end": 493,
"Ġco": 494,
"Ġthis": 495,
"Ġcan": 496,
"Ġk": 497,
"âĢ": 498,
"lec": 499,
"ted": 500,
"Ġmod": 501,
"math": 502,
"Ġcont": 503,
"Ġne": 504,
"Ġpar": 505,
"ib": 506,
"ĠĠĠ": 507,
"Ġle": 508,
"iv": 509,
"ug": 510,
"ence": 511,
"ign": 512,
"ous": 513,
"ents": 514,
"ys": 515,
"ave": 516,
"red": 517,
"ress": 518,
"able": 519,
"por": 520,
"all": 521,
"iff": 522,
"est": 523,
"Ġap": 524,
"Ġinc": 525,
"nt": 526,
"ary": 527,
"iti": 528,
"Ġwhich": 529,
"Ġnot": 530,
"form": 531,
"Ġsy": 532,
"Ġad": 533,
"low": 534,
"ak": 535,
"Ġper": 536,
"Ġhe": 537,
"pro": 538,
"ance": 539,
"ial": 540,
"ue": 541,
"Ġen": 542,
"Ġcl": 543,
"ass": 544,
"ip": 545,
"rans": 546,
"Ġob": 547,
"Ġgen": 548,
"tim": 549,
"Ġdis": 550,
"unc": 551,
"Ġint": 552,
"ep": 553,
"etw": 554,
"Ġdiff": 555,
"ach": 556,
"ther": 557,
"ime": 558,
"age": 559,
"ple": 560,
"ill": 561,
"yp": 562,
"ĠK": 563,
"act": 564,
"ari": 565,
"Ġmet": 566,
"ors": 567,
"Ġhave": 568,
"Ġstud": 569,
"ong": 570,
"ĠU": 571,
"Ġpl": 572,
"ide": 573,
"ma": 574,
"hen": 575,
"ific": 576,
"ome": 577,
"Ġi": 578,
"ular": 579,
"ĠV": 580,
"ally": 581,
"Ġshow": 582,
"rib": 583,
"ia": 584,
"enti": 585,
"Ġass": 586,
"ond": 587,
"ft": 588,
"Ġab": 589,
"Ġinter": 590,
"ĠTh": 591,
"The": 592,
"str": 593,
"Ġcell": 594,
"cal": 595,
"Ġmodel": 596,
"ata": 597,
"ast": 598,
"Ġeff": 599,
"Ġtrans": 600,
"ates": 601,
"ased": 602,
"ost": 603,
"vi": 604,
"ang": 605,
"our": 606,
"Ġme": 607,
"ard": 608,
"Ġdiffere": 609,
"Ġpre": 610,
"Ġdi": 611,
"ĠâĪĴ": 612,
"olog": 613,
"ution": 614,
"ound": 615,
"ace": 616,
"Ġresul": 617,
"erm": 618,
"pos": 619,
"here": 620,
"tive": 621,
"ord": 622,
"so": 623,
"stem": 624,
"yl": 625,
"Ġph": 626,
"Ġy": 627,
"ame": 628,
"ork": 629,
"ative": 630,
"Ġqu": 631,
"ric": 632,
"SU": 633,
"wo": 634,
"Ġun": 635,
"Ġev": 636,
"are": 637,
"##": 638,
"de": 639,
"een": 640,
"tiv": 641,
"Ġgro": 642,
"ory": 643,
"Ġcons": 644,
"Ġsub": 645,
"ta": 646,
"--": 647,
"Ġstr": 648,
"ber": 649,
"erv": 650,
"etween": 651,
"enc": 652,
"Ġanal": 653,
"int": 654,
"Ġhas": 655,
"uch": 656,
"Ġreg": 657,
"Ġbetween": 658,
"Ġdet": 659,
"Ġall": 660,
"cess": 661,
"Ġexp": 662,
"ection": 663,
"ĠâĢ": 664,
"ind": 665,
"ater": 666,
"Ġsign": 667,
"pt": 668,
"ugh": 669,
"ite": 670,
"ility": 671,
"Ġusing": 672,
"Ġval": 673,
"Ġro": 674,
"ree": 675,
"Ġrel": 676,
"out": 677,
"Ġfunc": 678,
"ition": 679,
"Ġcor": 680,
"Ġalso": 681,
"Ġtwo": 682,
"ne": 683,
"ĠJ": 684,
"Ġsystem": 685,
"cl": 686,
"uct": 687,
"Ġsim": 688,
"tain": 689,
"ust": 690,
"ied": 691,
"port": 692,
"Ġrec": 693,
"Ġresp": 694,
"Ġdata": 695,
"rm": 696,
"resent": 697,
"uld": 698,
"xt": 699,
"Ġj": 700,
"ry": 701,
"ack": 702,
"Ġra": 703,
"par": 704,
"Ġform": 705,
"Ġsc": 706,
"frac": 707,
"ĠWe": 708,
"ating": 709,
"ech": 710,
"hod": 711,
"Ġfol": 712,
"ined": 713,
"ĠSt": 714,
"ual": 715,
"Ġused": 716,
"Ġone": 717,
"Ġdes": 718,
"ĠÏ": 719,
"Ġvari": 720,
"Ġdist": 721,
"Ġnum": 722,
"ym": 723,
"ew": 724,
"rec": 725,
"ob": 726,
"Ġinf": 727,
"Ġar": 728,
"lect": 729,
"ll": 730,
"ons": 731,
"ĠThis": 732,
"ose": 733,
"ile": 734,
"play": 735,
"ear": 736,
"ox": 737,
"ures": 738,
"one": 739,
"Ġstudy": 740,
"ysis": 741,
"Ġfollow": 742,
"yle": 743,
"ract": 744,
"dis": 745,
"Ġpos": 746,
"right": 747,
"Ġthan": 748,
"ros": 749,
"av": 750,
"Fig": 751,
"Ġtime": 752,
"ization": 753,
"ulation": 754,
"ized": 755,
"Ġsur": 756,
"oth": 757,
"Ġout": 758,
"Ġcol": 759,
"ature": 760,
"ive": 761,
"Ġsol": 762,
"Ġx": 763,
"eld": 764,
"Ġother": 765,
"plic": 766,
"Ġdef": 767,
"erg": 768,
"Ġgener": 769,
"ely": 770,
"Ġbeen": 771,
"Ġincre": 772,
"Ġthese": 773,
"Ġno": 774,
"ax": 775,
"style": 776,
"arg": 777,
"ian": 778,
"Ġind": 779,
"Ġsuch": 780,
"Ġfunction": 781,
"ting": 782,
"Ġequ": 783,
"aus": 784,
"Ġund": 785,
"mathb": 786,
"tical": 787,
"Ġhigh": 788,
"rain": 789,
"Ġam": 790,
"ield": 791,
"oun": 792,
"ression": 793,
"Ġspec": 794,
"Ġop": 795,
"Ġdec": 796,
"Ġover": 797,
"Ġmethod": 798,
"Ġset": 799,
"âĪ": 800,
"Ġif": 801,
"dition": 802,
"ues": 803,
"ects": 804,
"display": 805,
"hem": 806,
"Ġpati": 807,
"Ġresults": 808,
"old": 809,
"anc": 810,
"displaystyle": 811,
"Ġeach": 812,
"Ġmore": 813,
"les": 814,
"pr": 815,
"acter": 816,
"Ġtheir": 817,
"Ġacc": 818,
"Ġappro": 819,
"iss": 820,
"ize": 821,
"Ġinv": 822,
"ases": 823,
"Ġcells": 824,
"irst": 825,
"lu": 826,
"ail": 827,
"Ġmeas": 828,
"Ġlow": 829,
"ov": 830,
"the": 831,
"ik": 832,
"**": 833,
"ef": 834,
"Ġbut": 835,
"hes": 836,
"fter": 837,
"Ġdifferent": 838,
"vely": 839,
"Ġext": 840,
"Ġthere": 841,
"oci": 842,
"Ġprob": 843,
"Ġits": 844,
"ron": 845,
"ments": 846,
"Ġag": 847,
"NA": 848,
"Ġpo": 849,
"ice": 850,
"ype": 851,
"Ġgroup": 852,
"âĢĵ": 853,
"ever": 854,
"ult": 855,
"ism": 856,
"tern": 857,
"ability": 858,
"ions": 859,
"ark": 860,
"Ġnon": 861,
"to": 862,
"ĠĠĠĠĠĠĠ": 863,
"Ġobs": 864,
"Ġtre": 865,
"als": 866,
"left": 867,
"ĠPro": 868,
"Ġonly": 869,
"Ġman": 870,
"der": 871,
"Ġpol": 872,
"uring": 873,
"amet": 874,
"rol": 875,
"In": 876,
"yn": 877,
"Ġunder": 878,
"ĠCh": 879,
"Ġwhere": 880,
"ood": 881,
"ĠX": 882,
"nce": 883,
"Ġpartic": 884,
"ected": 885,
"ĠFig": 886,
"Ġem": 887,
"Ġfact": 888,
"ĠAn": 889,
"Ġperform": 890,
"Ġso": 891,
"Ġanalysis": 892,
"stract": 893,
"hed": 894,
"Ġmay": 895,
"atic": 896,
"Ġrep": 897,
"tein": 898,
"duced": 899,
"Ġup": 900,
"Ġinto": 901,
"Ġnumber": 902,
"Ġour": 903,
"Ġet": 904,
"eg": 905,
"itle": 906,
"over": 907,
"ix": 908,
"ator": 909,
"ulti": 910,
"Ġincl": 911,
"ould": 912,
"ici": 913,
"bstract": 914,
"Ġcomple": 915,
"Ġpatients": 916,
"Ġdo": 917,
"Ġexper": 918,
"vid": 919,
"ange": 920,
"Ġlevel": 921,
"Ġprocess": 922,
gitextract_rzpn8tp_/ ├── .gitignore ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md ├── LICENSE ├── LICENSE-MODEL.md ├── MANIFEST.in ├── NOTICE ├── README.md ├── app.py ├── config/ │ └── train_nougat.yaml ├── docker/ │ ├── Dockerfile │ └── README.md ├── lightning_module.py ├── nougat/ │ ├── __init__.py │ ├── _version.py │ ├── dataset/ │ │ ├── __init__.py │ │ ├── create_index.py │ │ ├── gen_seek.py │ │ ├── parser/ │ │ │ ├── __init__.py │ │ │ ├── document.py │ │ │ ├── html2md.py │ │ │ ├── latexml_parser.py │ │ │ └── markdown.py │ │ ├── pdffigures.py │ │ ├── rasterize.py │ │ ├── split_htmls_to_pages.py │ │ ├── split_md_to_pages.py │ │ ├── splitter.py │ │ ├── staircase.py │ │ ├── tokenizer.json │ │ └── utils/ │ │ ├── __init__.py │ │ ├── latex_conversion.py │ │ ├── pdf_text_extract.py │ │ └── utils.py │ ├── metrics.py │ ├── model.py │ ├── postprocessing.py │ ├── transforms.py │ └── utils/ │ ├── __init__.py │ ├── checkpoint.py │ ├── dataset.py │ └── device.py ├── predict.py ├── setup.cfg ├── setup.py ├── test.py └── train.py
SYMBOL INDEX (289 symbols across 28 files)
FILE: app.py
function load_model (line 50) | async def load_model(
function root (line 63) | def root():
function predict (line 73) | async def predict(
function main (line 166) | def main():
FILE: lightning_module.py
class NougatModelPLModule (line 23) | class NougatModelPLModule(pl.LightningModule):
method __init__ (line 24) | def __init__(self, config):
method training_step (line 60) | def training_step(self, batch, batch_idx):
method validation_step (line 78) | def validation_step(self, batch, batch_idx, dataset_idx=0):
method on_validation_epoch_end (line 102) | def on_validation_epoch_end(self):
method configure_optimizers (line 110) | def configure_optimizers(self):
method cosine_scheduler (line 159) | def cosine_scheduler(optimizer, training_steps, warmup_steps):
method exponential_scheduler (line 170) | def exponential_scheduler(optimizer, warmup_steps, lr, min_lr=5e-5, ga...
method get_progress_bar_dict (line 182) | def get_progress_bar_dict(self):
method on_save_checkpoint (line 190) | def on_save_checkpoint(self, checkpoint):
class NougatDataPLModule (line 200) | class NougatDataPLModule(pl.LightningDataModule):
method __init__ (line 201) | def __init__(self, config):
method train_dataloader (line 211) | def train_dataloader(self):
method val_dataloader (line 226) | def val_dataloader(self):
method seed_worker (line 239) | def seed_worker(wordker_id):
method ignore_none_collate (line 245) | def ignore_none_collate(batch):
FILE: nougat/dataset/create_index.py
function convert_pt2px (line 30) | def convert_pt2px(pt, dpi=96):
function read_metadata (line 39) | def read_metadata(data: Dict) -> List[List[Dict]]:
function index_paper (line 58) | def index_paper(directory: Path, args: argparse.Namespace):
function create_index (line 102) | def create_index(args):
FILE: nougat/dataset/gen_seek.py
function get_args (line 13) | def get_args():
FILE: nougat/dataset/parser/document.py
class Element (line 35) | class Element(Generic[EL]):
method plaintext (line 48) | def plaintext(self):
method append (line 51) | def append(self, child: EL) -> EL:
method find_parent (line 56) | def find_parent(self, class_or_tuple: Type[T]) -> T:
class UnknownElement (line 66) | class UnknownElement(Element):
class TextElement (line 71) | class TextElement(Element):
method plaintext (line 75) | def plaintext(self):
method append (line 78) | def append(self, child: "Element"):
class Math (line 83) | class Math(Element):
class PlaintextMath (line 88) | class PlaintextMath(Math):
class LatexMath (line 93) | class LatexMath(Math):
method plaintext (line 98) | def plaintext(self):
class Author (line 103) | class Author:
class Link (line 110) | class Link(Element):
class InlineRef (line 115) | class InlineRef(Element):
method as_dict (line 118) | def as_dict(self):
class Reference (line 125) | class Reference:
method as_dict (line 150) | def as_dict(self):
class SpanElement (line 163) | class SpanElement(Element):
class Italic (line 168) | class Italic(SpanElement):
class Bold (line 173) | class Bold(SpanElement):
class Superscript (line 178) | class Superscript(SpanElement):
class Subscript (line 183) | class Subscript(SpanElement):
class Paragraph (line 188) | class Paragraph(Element):
class TableRow (line 193) | class TableRow(Element):
method add_cell (line 196) | def add_cell(self, cell: Element):
method plaintext (line 202) | def plaintext(self):
method add_cell (line 535) | def add_cell(self, cell: TableCell):
method __iter__ (line 540) | def __iter__(self):
method __len__ (line 543) | def __len__(self) -> int:
method __bool__ (line 546) | def __bool__(self) -> bool:
method cum_cell_widths (line 550) | def cum_cell_widths(self) -> List[int]:
method cell_widths (line 554) | def cell_widths(self) -> List[int]:
method width (line 558) | def width(self) -> int:
method _hline (line 561) | def _hline(self, orientation: str) -> str:
method hline_above (line 592) | def hline_above(self) -> str:
method hline_below (line 596) | def hline_below(self) -> str:
method plaintext (line 600) | def plaintext(self) -> str:
class TableHead (line 207) | class TableHead(TableRow):
class Table (line 212) | class Table(Element):
method add_row (line 219) | def add_row(self, row: TableRow) -> TableRow:
method plaintext (line 225) | def plaintext(self):
class Equation (line 230) | class Equation(Element):
class EquationList (line 235) | class EquationList(Element):
method add_equation (line 238) | def add_equation(self, eqn: Equation) -> Equation:
method plaintext (line 244) | def plaintext(self):
class Algorithm (line 249) | class Algorithm(Element):
method add_line (line 254) | def add_line(self, line: Element) -> Element:
method plaintext (line 260) | def plaintext(self):
class Definition (line 265) | class Definition(Element):
method plaintext (line 270) | def plaintext(self):
class DefinitionList (line 280) | class DefinitionList(Element):
method add_item (line 295) | def add_item(self, item: Definition) -> Definition:
method plaintext (line 301) | def plaintext(self):
class Figure (line 310) | class Figure(Element):
class Section (line 317) | class Section(Element):
class SectionHeader (line 325) | class SectionHeader(Element):
class ListItem (line 332) | class ListItem(Element):
class ListContainer (line 337) | class ListContainer(Element):
method add_item (line 342) | def add_item(self, item: ListItem) -> ListItem:
method plaintext (line 348) | def plaintext(self):
class Footnote (line 353) | class Footnote(Element):
class Document (line 358) | class Document(Element, Reference):
method add_reference (line 366) | def add_reference(self, reference):
method add_inline_ref (line 369) | def add_inline_ref(self, in_ref):
method set_bib (line 372) | def set_bib(self, reference):
class Spec (line 377) | class Spec:
method __hash__ (line 405) | def __hash__(self) -> int:
method __eq__ (line 408) | def __eq__(self, __o: object) -> bool:
method set_align (line 411) | def set_align(self, classes: List[str], style: Optional[str] = None) -...
method set_border (line 439) | def set_border(self, classes: List[str]) -> None:
method set_attrs (line 446) | def set_attrs(self, attrs: Dict[str, Any]) -> None:
method __str__ (line 454) | def __str__(self) -> str:
class TableCell (line 463) | class TableCell(Element):
method __post_init__ (line 486) | def __post_init__(self, *args, **kwargs) -> None:
method __hash__ (line 491) | def __hash__(self) -> int:
method __eq__ (line 494) | def __eq__(self, __o: object) -> bool:
method set_attrs (line 497) | def set_attrs(self, attrs: Dict[str, Any]) -> None:
method plaintext (line 505) | def plaintext(self):
class TableRow (line 512) | class TableRow(Element):
method add_cell (line 196) | def add_cell(self, cell: Element):
method plaintext (line 202) | def plaintext(self):
method add_cell (line 535) | def add_cell(self, cell: TableCell):
method __iter__ (line 540) | def __iter__(self):
method __len__ (line 543) | def __len__(self) -> int:
method __bool__ (line 546) | def __bool__(self) -> bool:
method cum_cell_widths (line 550) | def cum_cell_widths(self) -> List[int]:
method cell_widths (line 554) | def cell_widths(self) -> List[int]:
method width (line 558) | def width(self) -> int:
method _hline (line 561) | def _hline(self, orientation: str) -> str:
method hline_above (line 592) | def hline_above(self) -> str:
method hline_below (line 596) | def hline_below(self) -> str:
method plaintext (line 600) | def plaintext(self) -> str:
class Tabular (line 605) | class Tabular(Element):
method add_row (line 622) | def add_row(self, row: TableRow) -> TableRow:
method width (line 628) | def width(self) -> int:
method cols (line 635) | def cols(self) -> List[List[TableCell]]:
method _square_table (line 643) | def _square_table(self) -> None:
method get_table_spec (line 660) | def get_table_spec(self) -> str:
method plaintext (line 696) | def plaintext(self):
class Table (line 701) | class Table(Element):
method add_row (line 219) | def add_row(self, row: TableRow) -> TableRow:
method plaintext (line 225) | def plaintext(self):
FILE: nougat/dataset/parser/html2md.py
function check_file_path (line 17) | def check_file_path(paths: List[Path], wdir: Optional[Path] = None) -> L...
FILE: nougat/dataset/parser/latexml_parser.py
function printerr (line 17) | def printerr(*args, **kwargs):
function is_wrapper_element (line 43) | def is_wrapper_element(element: BeautifulSoup) -> bool:
function ignore_element (line 47) | def ignore_element(element: BeautifulSoup) -> bool:
function _get_classes (line 51) | def _get_classes(el: BeautifulSoup) -> Set[str]:
function _detach_selected (line 60) | def _detach_selected(element: BeautifulSoup, selector: str) -> None:
function parse_latexml_authors (line 65) | def parse_latexml_authors(ltx_authors: BeautifulSoup) -> List[Author]:
function parse_latexml_citations (line 71) | def parse_latexml_citations(cite: BeautifulSoup, parent: Element) -> None:
function _clean_html_whitespace (line 89) | def _clean_html_whitespace(text: str) -> str:
function parse_latexml_children (line 98) | def parse_latexml_children(html: BeautifulSoup, parent: Element) -> None:
function parse_latexml_references (line 420) | def parse_latexml_references(html: BeautifulSoup, doc: Document) -> None:
function parse_latexml (line 429) | def parse_latexml(
FILE: nougat/dataset/parser/markdown.py
function remove_trailing_whitespace (line 39) | def remove_trailing_whitespace(parts: List[str]) -> None:
function remove_line_breaks (line 48) | def remove_line_breaks(parts: List[str]):
function leading_trailing_whitespace (line 55) | def leading_trailing_whitespace(
function latex_escape (line 84) | def latex_escape(string: str) -> str:
function is_empty (line 88) | def is_empty(content: List) -> bool:
function format_element (line 98) | def format_element(
function format_iterator (line 330) | def format_iterator(
function format_children (line 359) | def format_children(
function format_document (line 367) | def format_document(
FILE: nougat/dataset/pdffigures.py
function call_pdffigures (line 19) | def call_pdffigures(
FILE: nougat/dataset/rasterize.py
function rasterize_paper (line 18) | def rasterize_paper(
FILE: nougat/dataset/split_htmls_to_pages.py
function process_paper (line 29) | def process_paper(
function process_htmls (line 130) | def process_htmls(args):
FILE: nougat/dataset/split_md_to_pages.py
function ratio (line 37) | def ratio(*args, **kwargs):
class BagOfWords (line 41) | class BagOfWords:
method __init__ (line 51) | def __init__(
method train (line 60) | def train(self):
method __call__ (line 77) | def __call__(
function remove_short_seqs (line 90) | def remove_short_seqs(seqs: List[str], minimum: int = 10) -> List[str]:
function find_figures (line 99) | def find_figures(
function flatten (line 136) | def flatten(l: List) -> List:
function get_doc_text (line 140) | def get_doc_text(
function clean_pdf_text (line 176) | def clean_pdf_text(pages: List[List[str]], num_words: int = 10) -> List[...
function split_markdown (line 239) | def split_markdown(
FILE: nougat/dataset/splitter.py
function ratio (line 18) | def ratio(*args, **kwargs):
function reverse (line 22) | def reverse(lst: List[str]) -> List[str]:
function get_first_last (line 37) | def get_first_last(
function get_glob_index (line 66) | def get_glob_index(
class Splitter (line 84) | class Splitter:
method __init__ (line 87) | def __init__(self, paragraphs: List[str]) -> None:
method remove_special_chars (line 95) | def remove_special_chars(string: str) -> str:
method count_special_chars (line 129) | def count_special_chars(string: str, char_ind: int) -> int:
method split_first_last (line 213) | def split_first_last(
method split (line 280) | def split(
method _find_match (line 315) | def _find_match(
method _fuzzy (line 325) | def _fuzzy(
method fuzzysearch (line 338) | def fuzzysearch(
method evaluate_split (line 350) | def evaluate_split(self, page_num: int, page_content: str) -> float:
FILE: nougat/dataset/staircase.py
function stair_func (line 17) | def stair_func(x: np.ndarray, thresholds: np.ndarray) -> np.ndarray:
function compute_gini (line 21) | def compute_gini(labels: np.ndarray) -> float:
function compute_binary_gini (line 29) | def compute_binary_gini(labels: np.ndarray) -> float:
function gini_impurity (line 37) | def gini_impurity(
function step_impurity (line 87) | def step_impurity(
class PaddedArray (line 112) | class PaddedArray:
method __init__ (line 121) | def __init__(
method __len__ (line 129) | def __len__(self):
method _process_index (line 132) | def _process_index(self, index):
method __getitem__ (line 147) | def __getitem__(self, index):
method __setitem__ (line 151) | def __setitem__(self, index, value):
method copy (line 154) | def copy(self):
method toarray (line 157) | def toarray(self):
class Staircase (line 161) | class Staircase:
method __init__ (line 170) | def __init__(self, domain: int, n_classes: int) -> None:
method statistic_fit (line 180) | def statistic_fit(
method fit (line 216) | def fit(
method score (line 299) | def score(self):
method predict (line 307) | def predict(self, x: np.ndarray) -> np.ndarray:
method __call__ (line 310) | def __call__(self, *args):
method get_boundaries (line 313) | def get_boundaries(self) -> np.ndarray:
FILE: nougat/dataset/utils/latex_conversion.py
function remove_style (line 60) | def remove_style(string: str) -> str:
function replace_duplicate_definitions (line 69) | def replace_duplicate_definitions(string: str) -> str:
function unicode_to_latex (line 76) | def unicode_to_latex(s: str) -> str:
function remove_line_breaks (line 108) | def remove_line_breaks(string: str) -> str:
function normalize_tex (line 113) | def normalize_tex(math: str, inline: bool) -> str:
FILE: nougat/dataset/utils/pdf_text_extract.py
function replace_ligatures (line 18) | def replace_ligatures(text: str) -> str:
function remove_hyphens (line 36) | def remove_hyphens(text: str) -> str:
function dehyphenate (line 59) | def dehyphenate(lines: List[str], line_no: int) -> List[str]:
function get_pages (line 68) | def get_pages(pdf: str) -> List[str]:
function get_paragraphs (line 84) | def get_paragraphs(pdf: str) -> List[List[str]]:
FILE: nougat/dataset/utils/utils.py
function remove_pretty_linebreaks (line 10) | def remove_pretty_linebreaks(string: str) -> str:
FILE: nougat/metrics.py
function compute_metrics (line 27) | def compute_metrics(pred, gt, minlen=4):
function get_parser (line 47) | def get_parser():
function split_text (line 63) | def split_text(pages: List[str]):
function get_metrics (line 86) | def get_metrics(gt: List[str], pred: List[str], pool: bool = True):
FILE: nougat/model.py
class SwinEncoder (line 37) | class SwinEncoder(nn.Module):
method __init__ (line 52) | def __init__(
method forward (line 116) | def forward(self, x: torch.Tensor) -> torch.Tensor:
method crop_margin (line 127) | def crop_margin(img: Image.Image) -> Image.Image:
method to_tensor (line 142) | def to_tensor(self):
method prepare_input (line 148) | def prepare_input(
class BARTDecoder (line 191) | class BARTDecoder(nn.Module):
method __init__ (line 207) | def __init__(
method add_special_tokens (line 271) | def add_special_tokens(self, list_of_tokens: List[str]):
method prepare_inputs_for_inference (line 281) | def prepare_inputs_for_inference(
method forward (line 312) | def forward(
method resize_bart_abs_pos_emb (line 337) | def resize_bart_abs_pos_emb(weight: torch.Tensor, max_length: int) -> ...
class NougatConfig (line 359) | class NougatConfig(PretrainedConfig):
method __init__ (line 385) | def __init__(
class RunningVarTorch (line 418) | class RunningVarTorch:
method __init__ (line 419) | def __init__(self, L=15, norm=False):
method push (line 424) | def push(self, x: torch.Tensor):
method variance (line 433) | def variance(self):
class StoppingCriteriaScores (line 442) | class StoppingCriteriaScores(StoppingCriteria):
method __init__ (line 443) | def __init__(self, threshold: float = 0.015, window_size: int = 200):
method __call__ (line 454) | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTen...
function batch (line 477) | def batch(l, b=15):
function subdiv (line 484) | def subdiv(l, b=10):
class NougatModel (line 491) | class NougatModel(PreTrainedModel):
method __init__ (line 501) | def __init__(self, config: NougatConfig):
method forward (line 521) | def forward(
method _init_weights (line 544) | def _init_weights(self, *args, **kwargs):
method inference (line 547) | def inference(
method from_pretrained (line 671) | def from_pretrained(
FILE: nougat/postprocessing.py
function ratio (line 18) | def ratio(*args, **kwargs):
function markdown_compatible (line 25) | def markdown_compatible(s: str) -> str:
function find_next_punctuation (line 70) | def find_next_punctuation(s: str, start_inx=0):
function find_last_punctuation (line 86) | def find_last_punctuation(s: str, start_inx=0):
function truncate_repetitions (line 102) | def truncate_repetitions(s: str, min_len=30):
function close_envs (line 168) | def close_envs(s: str) -> str:
function remove_numbers (line 178) | def remove_numbers(lines):
function get_slices (line 190) | def get_slices(lines, clean_lines):
function remove_slice_from_lines (line 233) | def remove_slice_from_lines(lines, clean_text, sli) -> str:
function remove_hallucinated_references (line 301) | def remove_hallucinated_references(text: str) -> str:
function postprocess_single (line 332) | def postprocess_single(generation: str, markdown_fix: bool = True) -> str:
function postprocess (line 487) | def postprocess(
FILE: nougat/transforms.py
function alb_wrapper (line 16) | def alb_wrapper(transform):
class Erosion (line 23) | class Erosion(alb.ImageOnlyTransform):
method __init__ (line 41) | def __init__(self, scale, always_apply=False, p=0.5):
method apply (line 49) | def apply(self, img, **params):
class Dilation (line 57) | class Dilation(alb.ImageOnlyTransform):
method __init__ (line 75) | def __init__(self, scale, always_apply=False, p=0.5):
method apply (line 83) | def apply(self, img, **params):
class Bitmap (line 91) | class Bitmap(alb.ImageOnlyTransform):
method __init__ (line 107) | def __init__(self, value=0, lower=200, always_apply=False, p=0.5):
method apply (line 112) | def apply(self, img, **params):
FILE: nougat/utils/checkpoint.py
function download_as_bytes_with_progress (line 20) | def download_as_bytes_with_progress(url: str, name: str = None) -> bytes:
function download_checkpoint (line 49) | def download_checkpoint(checkpoint: Path, model_tag: str = MODEL_TAG):
function torch_hub (line 74) | def torch_hub(model_tag: Optional[str] = MODEL_TAG) -> Path:
function get_checkpoint (line 85) | def get_checkpoint(
FILE: nougat/utils/dataset.py
class ImageDataset (line 25) | class ImageDataset(torch.utils.data.Dataset):
method __init__ (line 40) | def __init__(self, img_list, prepare: Callable):
method __len__ (line 45) | def __len__(self):
method ignore_none_collate (line 49) | def ignore_none_collate(batch):
method __getitem__ (line 60) | def __getitem__(self, idx):
class LazyDataset (line 68) | class LazyDataset(Dataset):
method __init__ (line 83) | def __init__(self, pdf, prepare: Callable, pages: Optional[List[int]] ...
method __len__ (line 91) | def __len__(self):
method __getitem__ (line 94) | def __getitem__(self, i):
method ignore_none_collate (line 103) | def ignore_none_collate(batch):
class SciPDFDataset (line 125) | class SciPDFDataset(Dataset):
method __init__ (line 144) | def __init__(
method __len__ (line 172) | def __len__(self) -> int:
method __getitem__ (line 175) | def __getitem__(self, index: int) -> Dict:
method __iter__ (line 203) | def __iter__(self):
class NougatDataset (line 208) | class NougatDataset(Dataset):
method __init__ (line 214) | def __init__(
method __len__ (line 234) | def __len__(self) -> int:
method __getitem__ (line 237) | def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
FILE: nougat/utils/device.py
function default_batch_size (line 11) | def default_batch_size():
function move_to_device (line 28) | def move_to_device(model, bf16: bool = True, cuda: bool = True):
FILE: predict.py
function get_args (line 28) | def get_args():
function main (line 125) | def main():
FILE: setup.py
function read_version (line 14) | def read_version():
function read_long_description (line 22) | def read_long_description():
FILE: test.py
function test (line 27) | def test(args):
FILE: train.py
class CustomCheckpointIO (line 42) | class CustomCheckpointIO(CheckpointIO):
method save_checkpoint (line 62) | def save_checkpoint(self, checkpoint, path, storage_options=None):
method load_checkpoint (line 73) | def load_checkpoint(self, path, storage_options=None):
method remove_checkpoint (line 101) | def remove_checkpoint(self, path) -> None:
class GradNormCallback (line 105) | class GradNormCallback(Callback):
method gradient_norm (line 111) | def gradient_norm(model):
method on_after_backward (line 120) | def on_after_backward(self, trainer, model):
function save_config_file (line 125) | def save_config_file(config, path):
function train (line 135) | def train(config):
Condensed preview — 47 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,652K chars).
[
{
"path": ".gitignore",
"chars": 1941,
"preview": "core.*\n*.bin\n.nfs*\n.vscode/*\nresult/*\n!result/extract.py\nmisc/*\nwandb/\n!misc/*.png\n!dataset/gen_seek.py\n!result/.gitkeep"
},
{
"path": "CODE_OF_CONDUCT.md",
"chars": 3537,
"preview": "# Code of Conduct\n\n## Our Pledge\n\nIn the interest of fostering an open and welcoming environment, we as\ncontributors and"
},
{
"path": "CONTRIBUTING.md",
"chars": 569,
"preview": "# Contributing to Nougat\n\n## Pull Requests\n\nIn order to accept your pull request, we need you to submit a CLA. You only "
},
{
"path": "LICENSE",
"chars": 1088,
"preview": "MIT License\n\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nPermission is hereby granted, free of charge, to any pe"
},
{
"path": "LICENSE-MODEL.md",
"chars": 13567,
"preview": "# Creative Commons Attribution-NonCommercial 4.0 International Public License\n\nBy exercising the Licensed Rights (define"
},
{
"path": "MANIFEST.in",
"chars": 14,
"preview": "include ./*.*\n"
},
{
"path": "NOTICE",
"chars": 8932,
"preview": "Donut\nCopyright (c) 2022-present NAVER Corp.\n\nPermission is hereby granted, free of charge, to any person obtaining a co"
},
{
"path": "README.md",
"chars": 7995,
"preview": "<div align=\"center\">\n<h1>Nougat: Neural Optical Understanding for Academic Documents</h1>\n\n[ Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "config/train_nougat.yaml",
"chars": 748,
"preview": "resume_from_checkpoint_path: null\nresult_path: \"result\"\nmodel_path: null\ndataset_paths: [\"path/to/train.jsonl\"]\ntokenize"
},
{
"path": "docker/Dockerfile",
"chars": 766,
"preview": "FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04\n# replace CUDA version to your CUDA version.\n# You can check your CUDA "
},
{
"path": "docker/README.md",
"chars": 2263,
"preview": "## Prerequisites\nEnsure you have Docker installed on your machine. \nAnd you must also have NVIDIA CUDA and CuDNN install"
},
{
"path": "lightning_module.py",
"chars": 9121,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nimpo"
},
{
"path": "nougat/__init__.py",
"chars": 311,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nfrom"
},
{
"path": "nougat/_version.py",
"chars": 204,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "nougat/dataset/create_index.py",
"chars": 5482,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/gen_seek.py",
"chars": 1015,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/parser/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "nougat/dataset/parser/document.py",
"chars": 19783,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/parser/html2md.py",
"chars": 2220,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/parser/latexml_parser.py",
"chars": 18357,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/parser/markdown.py",
"chars": 15343,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/pdffigures.py",
"chars": 2300,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/rasterize.py",
"chars": 2806,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/split_htmls_to_pages.py",
"chars": 7717,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/split_md_to_pages.py",
"chars": 17432,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/splitter.py",
"chars": 13880,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/staircase.py",
"chars": 10486,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/tokenizer.json",
"chars": 2068443,
"preview": "{\n \"version\": \"1.0\",\n \"truncation\": {\n \"direction\": \"Right\",\n \"max_length\": 4096,\n \"strategy\": \"LongestFirst\""
},
{
"path": "nougat/dataset/utils/__init__.py",
"chars": 273,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/utils/latex_conversion.py",
"chars": 4234,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/utils/pdf_text_extract.py",
"chars": 2531,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/dataset/utils/utils.py",
"chars": 528,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/metrics.py",
"chars": 3844,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/model.py",
"chars": 26035,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nimpo"
},
{
"path": "nougat/postprocessing.py",
"chars": 16892,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/transforms.py",
"chars": 5986,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "nougat/utils/checkpoint.py",
"chars": 3890,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "nougat/utils/dataset.py",
"chars": 9275,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nimpo"
},
{
"path": "nougat/utils/device.py",
"chars": 1198,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "predict.py",
"chars": 7439,
"preview": "\"\"\"\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\nThis source code is licensed under the MIT license found in the\n"
},
{
"path": "setup.cfg",
"chars": 39,
"preview": "[metadata]\ndescription_file = README.md"
},
{
"path": "setup.py",
"chars": 2775,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\n\nimp"
},
{
"path": "test.py",
"chars": 3810,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nimpo"
},
{
"path": "train.py",
"chars": 7430,
"preview": "\"\"\"\nDonut\nCopyright (c) 2022-present NAVER Corp.\nMIT License\nCopyright (c) Meta Platforms, Inc. and affiliates.\n\"\"\"\nimpo"
}
]
About this extraction
This page contains the full source code of the facebookresearch/nougat GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 47 files (2.2 MB), approximately 586.5k tokens, and a symbol index with 289 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.