Repository: google-research/arxiv-latex-cleaner
Branch: main
Commit: fb7ee5c72100
Files: 36
Total size: 109.9 KB
Directory structure:
gitextract_ut9ensog/
├── .github/
│ └── workflows/
│ └── release-workflow.yml
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── __init__.py
├── arxiv_latex_cleaner/
│ ├── __init__.py
│ ├── __main__.py
│ ├── _version.py
│ ├── arxiv_latex_cleaner.py
│ └── tests/
│ └── arxiv_latex_cleaner_test.py
├── cleaner_config.yaml
├── requirements.txt
├── setup.py
└── test_data/
├── tex/
│ ├── figures/
│ │ ├── data_included.txt
│ │ ├── data_not_included.txt
│ │ ├── figure_included.tex
│ │ ├── figure_included.tikz
│ │ ├── figure_not_included.tex
│ │ └── figure_not_included_2.tex
│ ├── main.aux
│ ├── main.bbl
│ ├── main.bib
│ ├── main.tex
│ └── not_included/
│ └── figures/
│ └── data_included.txt
├── tex_arXiv_png2jpg_true/
│ ├── figures/
│ │ ├── data_included.txt
│ │ ├── figure_included.tex
│ │ └── figure_included.tikz
│ ├── main.bbl
│ └── main.tex
└── tex_arXiv_true/
├── figures/
│ ├── data_included.txt
│ ├── figure_included.tex
│ └── figure_included.tikz
├── main.bbl
└── main.tex
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/release-workflow.yml
================================================
name: Create a GitHub and PyPI release
on:
push:
tags:
- 'v*'
jobs:
build:
name: Create a GitHub Release
runs-on: ubuntu-latest
permissions:
contents: write
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Create Release
id: create_release
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ github.ref }}
release_name: Release ${{ github.ref }}
body: ${{ github.ref }} release of `arxiv_latex_cleaner`.
draft: false
prerelease: false
deploy:
name: Create a PyPI Release
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build
run: |
python setup.py sdist bdist_wheel
- name: Publish
env:
TWINE_USERNAME: '__token__'
TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
run: |
python -m twine upload dist/*
================================================
FILE: .gitignore
================================================
*.pyc
.idea
arxiv-latex-cleaner.iml
arxiv-latex-cleaner.ipr
arxiv-latex-cleaner.iws
arxiv_latex_cleaner.egg-info/
build/
dist/
*.DS_Store
================================================
FILE: CONTRIBUTING.md
================================================
# How to Contribute
We'd love to accept your patches and contributions to this project. There are
just a few small guidelines you need to follow.
## Contributor License Agreement
Contributions to this project must be accompanied by a Contributor License
Agreement. You (or your employer) retain the copyright to your contribution;
this simply gives us permission to use and redistribute your contributions as
part of the project. Head over to <https://cla.developers.google.com/> to see
your current agreements on file or to sign a new one.
You generally only need to submit a CLA once, so if you've already submitted one
(even if it was for a different project), you probably don't need to do it
again.
## Code reviews
All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.
## Community Guidelines
This project follows
[Google's Open Source Community Guidelines](https://opensource.google.com/conduct/).
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: MANIFEST.in
================================================
include LICENSE
include README.md
include requirements.txt
================================================
FILE: README.md
================================================
# `arxiv_latex_cleaner`
This tool allows you to easily clean the LaTeX code of your paper to submit to
arXiv. From a folder containing all your code, e.g. `/path/to/latex/`, it
creates a new folder `/path/to/latex_arXiv/`, that is ready to ZIP and upload to
arXiv.
## Example call:
```bash
arxiv_latex_cleaner /path/to/latex --resize_images --im_size 500 --images_allowlist='{"images/im.png":2000}'
```
Or simply from a config file
```bash
arxiv_latex_cleaner /path/to/latex --config cleaner_config.yaml
```
## Installation:
```bash
pip install arxiv-latex-cleaner
```
| :exclamation: arxiv_latex_cleaner is only compatible with Python >=3.9 :exclamation: |
| ---------------------------------------------------------------------------------- |
If using MacOS, you can install using [Homebrew](https://brew.sh/):
```bash
brew install arxiv_latex_cleaner
```
Alternatively, you can download the source code:
```bash
git clone https://github.com/google-research/arxiv-latex-cleaner
cd arxiv-latex-cleaner/
python -m arxiv_latex_cleaner --help
```
And install as a command-line program directly from the source code:
```bash
python setup.py install
```
## Main features:
#### Privacy-oriented
* Removes all auxiliary files (`.aux`, `.log`, `.out`, etc.).
* Removes all comments from your code (yes, those are visible on arXiv and you
do not want them to be). These also include `\begin{comment}\end{comment}`,
`\iffalse\fi`, and `\if0\fi` environments.
* Optionally removes user-defined commands entered with `commands_to_delete`
(such as `\todo{}` that you redefine as the empty string at the end).
* Optionally allows you to define custom regex replacement rules through a
`cleaner_config.yaml` file.
#### Size-oriented
There is a 50MB limit on arXiv submissions, so to make it fit:
* Removes all unused `.tex` files (those that are not in the root and not
included in any other `.tex` file).
* Removes all unused images that take up space (those that are not actually
included in any used `.tex` file).
* Optionally resizes all images to `im_size` pixels, to reduce the size of the
submission. You can allowlist some images to skip the global size using
`images_allowlist`.
* Optionally compresses `.pdf` files using ghostscript (Linux and Mac only).
You can allowlist some PDFs to skip the global size using
`images_allowlist`.
* Optionally converts PNG images to JPG format to reduce file size.
#### TikZ picture source code concealment
To prevent the upload of tikzpicture source code or raw simulation data, this
feature:
* Replaces the tikzpicture environment `\begin{tikzpicture} ...
\end{tikzpicture}` with the respective
`\includegraphics{EXTERNAL_TIKZ_FOLDER/picture_name.pdf}`.
* Requires externally compiled TikZ pictures as `.pdf` files in folder
`EXTERNAL_TIKZ_FOLDER`. See section 52 (Externalization Library) in the
[PGF/TikZ manual](https://ctan.org/pkg/pgf?lang=en) on TikZ picture
externalization.
* Only replaces environments with preceding
`\tikzsetnextfilename{picture_name}` command (as in
`\tikzsetnextfilename{picture_name}\begin{tikzpicture} ...
\end{tikzpicture}`) where the externalized `picture_name.pdf` filename
matches `picture_name`.
#### More sophisticated pattern replacement based on regex group captures
Sometimes it is useful to work with a set of custom LaTeX commands when writing
a paper. To get rid of them upon arXiv submission, one can simply revert them to
plain LaTeX with a regular expression insertion.
```yaml
{
"pattern" : '(?:\\figcomp{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}',
"insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}',
"description" : "Replace figcomp"
}
```
The pattern above will find all `\figcomp{path}{w1}{w2}` commands and replace
them with
`\parbox[c]{w1\linewidth}{\includegraphics[width=w2\linewidth]{figures/path}}`.
Note that the insertion template is filled with the
[named groups captures](https://docs.python.org/3/library/re.html#regular-expression-examples)
from the pattern. Note that the replacement is processed **before** all
`\includegraphics` commands are processed and corresponding file paths are
copied, making sure all figure files are copied to the cleaned version. See also
[cleaner_config.yaml](cleaner_config.yaml) for details on how to specify the
patterns.
## Usage:
```
usage: arxiv_latex_cleaner@v1.0.10 [-h] [--resize_images] [--im_size IM_SIZE]
[--compress_pdf]
[--pdf_im_resolution PDF_IM_RESOLUTION]
[--images_allowlist IMAGES_ALLOWLIST]
[--keep_bib]
[--commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]]
[--commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]]
[--environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]]
[--if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]]
[--use_external_tikz USE_EXTERNAL_TIKZ]
[--svg_inkscape [SVG_INKSCAPE]]
[--convert_png_to_jpg]
[--png_quality PNG_QUALITY]
[--png_size_threshold PNG_SIZE_THRESHOLD]
[--config CONFIG] [--verbose]
input_folder
Clean the LaTeX code of your paper to submit to arXiv. Check the README for
more information on the use.
positional arguments:
input_folder Input folder containing the LaTeX code.
optional arguments:
-h, --help show this help message and exit
--resize_images Resize images.
--im_size IM_SIZE Size of the output images (in pixels, longest side).
Fine tune this to get as close to 10MB as possible.
--compress_pdf Compress PDF images using ghostscript (Linux and Mac
only).
--pdf_im_resolution PDF_IM_RESOLUTION
Resolution (in dpi) to which the tool resamples the
PDF images.
--images_allowlist IMAGES_ALLOWLIST
Images (and PDFs) that won't be resized to the default
resolution, but the one provided here. Value is pixel
for images, and dpi forPDFs, as in --im_size and
--pdf_im_resolution, respectively. Format is a
dictionary as: '{"path/to/im.jpg": 1000}'
--keep_bib Avoid deleting the *.bib files.
--commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]
LaTeX commands that will be deleted. Useful for e.g.
user-defined \todo commands. For example, to delete
all occurrences of \todo1{} and \todo2{}, run the tool
with `--commands_to_delete todo1 todo2`.Please note
that the positional argument `input_folder` cannot
come immediately after `commands_to_delete`, as the
parser does not have any way to know if it's another
command to delete.
--commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]
LaTeX commands that will be deleted but the text
wrapped in the commands will be retained. Useful for
commands that change text formats and colors, which
you may want to remove but keep the text within. Usages
are exactly the same as commands_to_delete. Note that if
the commands listed here duplicate that after
commands_to_delete, the default action will be retaining
the wrapped text.
--environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]
LaTeX environments that will be deleted. Useful for e.g.
user-defined comment environments. For example, to
delete all occurrences of \begin{note} ... \end{note},
run the tool with `--environments_to_delete note`.
Please note that the positional argument `input_folder`
cannot come immediately after
`environments_to_delete`, as the parser does not have
any way to know if it's another environment to delete.
--if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]
Constant TeX primitive conditionals (\iffalse, \iftrue,
etc.) are simplified, i.e., true branches are kept, false
branches deleted. To parse the conditional constructs
correctly, all commands starting with `\if` are assumed to
be TeX primitive conditionals (e.g., declared by
\newif\ifvar). Some known exceptions to this rule are
already included (e.g., \iff, \ifthenelse, etc.), but you
can add custom exceptions using `--if_exceptions iffalt`.
--use_external_tikz USE_EXTERNAL_TIKZ
Folder (relative to input folder) containing
externalized tikz figures in PDF format.
--svg_inkscape [SVG_INKSCAPE]
Include PDF files generated by Inkscape via the
`\includesvg` command from the `svg` package. This is
done by replacing the `\includesvg` calls with
`\includeinkscape` calls pointing to the generated
`.pdf_tex` files. By default, these files and the
generated PDFs are located under `./svg-inkscape`
(relative to the input folder), but a different path
(relative to the input folder) can be provided in case a
different `inkscapepath` was set when loading the `svg`
package.
--convert_png_to_jpg Convert PNG images to JPG format to reduce file size
--png_quality PNG_QUALITY
JPG quality for PNG conversion (0-100, default: 50)
--png_size_threshold PNG_SIZE_THRESHOLD
Minimum PNG file size in MB to apply quality reduction (default: 0.5)
--config CONFIG Read settings from `.yaml` config file. If command
line arguments are provided additionally, the config
file parameters are updated with the command line
parameters.
--verbose Enable detailed output.
```
## Testing:
```bash
python -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test
```
## Note
This is not an officially supported Google product.
================================================
FILE: __init__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
================================================
FILE: arxiv_latex_cleaner/__init__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
================================================
FILE: arxiv_latex_cleaner/__main__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Main module for ``arxiv_latex_cleaner``.
.. code-block:: bash
$ python -m arxiv_latex_cleaner --help
"""
import argparse
import json
import logging
import yaml
from ._version import __version__
from .arxiv_latex_cleaner import merge_args_into_config
from .arxiv_latex_cleaner import run_arxiv_cleaner
PARSER = argparse.ArgumentParser(
prog="arxiv_latex_cleaner@{0}".format(__version__),
description=(
"Clean the LaTeX code of your paper to submit to arXiv. "
"Check the README for more information on the use."
),
)
PARSER.add_argument(
"input_folder",
type=str,
help="Input folder or zip archive containing the LaTeX code.",
)
PARSER.add_argument(
"--resize_images",
action="store_true",
help="Resize images.",
)
PARSER.add_argument(
"--im_size",
default=500,
type=int,
help=(
"Size of the output images (in pixels, longest side). Fine tune this "
"to get as close to 10MB as possible."
),
)
PARSER.add_argument(
"--compress_pdf",
action="store_true",
help="Compress PDF images using ghostscript (Linux and Mac only).",
)
PARSER.add_argument(
"--pdf_im_resolution",
default=500,
type=int,
help="Resolution (in dpi) to which the tool resamples the PDF images.",
)
PARSER.add_argument(
"--images_allowlist",
default={},
type=json.loads,
help=(
"Images (and PDFs) that won't be resized to the default resolution,"
"but the one provided here. Value is pixel for images, and dpi for"
"PDFs, as in --im_size and --pdf_im_resolution, respectively. Format "
"is a dictionary as: '{\"path/to/im.jpg\": 1000}'"
),
)
PARSER.add_argument(
"--keep_bib",
action="store_true",
help="Avoid deleting the *.bib files.",
)
PARSER.add_argument(
"--commands_to_delete",
nargs="+",
default=[],
required=False,
help=(
"LaTeX commands that will be deleted. Useful for e.g. user-defined "
"\\todo commands. For example, to delete all occurrences of \\todo1{} "
"and \\todo2{}, run the tool with `--commands_to_delete todo1 todo2`."
"Please note that the positional argument `input_folder` cannot come "
"immediately after `commands_to_delete`, as the parser does not have "
"any way to know if it's another command to delete."
),
)
PARSER.add_argument(
"--commands_only_to_delete",
nargs="+",
default=[],
required=False,
help=(
"LaTeX commands that will be deleted but the text wrapped in the"
" commands will be retained. Useful for commands that change text"
" formats and colors, which you may want to remove but keep the text"
" within. Usages are exactly the same as commands_to_delete. Note that"
" if the commands listed here duplicate that after commands_to_delete,"
" the default action will be retaining the wrapped text."
),
)
PARSER.add_argument(
"--environments_to_delete",
nargs="+",
default=[],
required=False,
help=(
"LaTeX environments that will be deleted. Useful for e.g. user-"
"defined comment environments. For example, to delete all occurrences "
"of \\begin{note} ... \\end{note}, run the tool with "
"`--environments_to_delete note`. Please note that the positional "
"argument `input_folder` cannot come immediately after "
"`environments_to_delete`, as the parser does not have any way to "
"know if it's another environment to delete."
),
)
def if_prefixed(orig_string):
if orig_string.startswith("\\"):
string = orig_string[1:]
else:
string = orig_string
if not string.startswith("if"):
raise argparse.ArgumentTypeError(
f"Expected a string starting with 'if', got '{orig_string}'!"
)
return string
PARSER.add_argument(
"--if_exceptions",
nargs="+",
default=[],
required=False,
type=if_prefixed,
help=(
"Constant TeX primitive conditionals (\\iffalse, \\iftrue, etc.) are "
"simplified, i.e., true branches are kept, false branches deleted. "
"To parse the conditional constructs correctly, all commands starting "
"with `\\if` are assumed to be TeX primitive conditionals (e.g., "
"declared by \\newif\\ifvar). Some known exceptions to this rule are "
"already included (e.g., \\iff, \\ifthenelse, etc.), but you can add "
"custom exceptions using `--if_exceptions iffalt`."
),
)
PARSER.add_argument(
"--use_external_tikz",
type=str,
help=(
"Folder (relative to input folder) containing externalized tikz "
"figures in PDF format."
),
)
PARSER.add_argument(
"--svg_inkscape",
nargs="?",
type=str,
const="svg-inkscape",
help=(
"Include PDF files generated by Inkscape via the `\\includesvg` "
"command from the `svg` package. This is done by replacing the "
"`\\includesvg` calls with `\\includeinkscape` calls pointing to the "
"generated `.pdf_tex` files. By default, these files and the "
"generated PDFs are located under `./svg-inkscape` (relative to the "
"input folder), but a different path (relative to the input folder) "
"can be provided in case a different `inkscapepath` was set when "
"loading the `svg` package."
),
)
PARSER.add_argument(
"--convert_png_to_jpg",
action="store_true",
help="Convert PNG images to JPG format to reduce file size. Note that this will override --resize_images for PNG files.",
)
PARSER.add_argument(
"--png_quality",
type=int,
default=50,
help="JPG quality for PNG conversion (0-100, default: 50)",
)
PARSER.add_argument(
"--png_size_threshold",
type=float,
default=0.5,
help="Minimum PNG file size in MB to apply quality reduction (default: 0.5)",
)
PARSER.add_argument(
"--config",
type=str,
help=(
"Read settings from `.yaml` config file. If command line arguments "
"are provided additionally, the config file parameters are updated "
"with the command line parameters."
),
required=False,
)
PARSER.add_argument(
"--verbose",
action="store_true",
help="Enable detailed output.",
)
ARGS = vars(PARSER.parse_args())
if ARGS["config"] is not None:
try:
with open(ARGS["config"], "r") as config_file:
config_params = yaml.safe_load(config_file)
final_args = merge_args_into_config(ARGS, config_params)
except FileNotFoundError:
print(f"config file {ARGS.config} not found.")
final_args = ARGS
final_args.pop("config", None)
else:
final_args = ARGS
if final_args.get("verbose", False):
logging.basicConfig(level=logging.INFO)
else:
logging.basicConfig(level=logging.ERROR)
run_arxiv_cleaner(final_args)
exit(0)
================================================
FILE: arxiv_latex_cleaner/_version.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
__version__ = "v1.0.10"
================================================
FILE: arxiv_latex_cleaner/arxiv_latex_cleaner.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Cleans the LaTeX code of your paper to submit to arXiv."""
import collections
import contextlib
import copy
import logging
import os
import pathlib
import shutil
import subprocess
import tempfile
from PIL import Image
import regex
PDF_RESIZE_COMMAND = (
'gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH '
'-dDownsampleColorImages=true -dColorImageResolution={resolution} '
'-dColorImageDownsampleThreshold=1.0 -dAutoRotatePages=/None '
'-sOutputFile={output} {input}'
)
MAX_FILENAME_LENGTH = 120
# Fix for Windows: Even if '\' (os.sep) is the standard way of making paths on
# Windows, it interferes with regular expressions. We just change os.sep to '/'
# and os.path.join to a version using '/' as Windows will handle it the right
# way.
if os.name == 'nt':
global old_os_path_join
def new_os_join(path, *args):
res = old_os_path_join(path, *args)
res = res.replace('\\', '/')
return res
old_os_path_join = os.path.join
os.sep = '/'
os.path.join = new_os_join
def _create_dir_erase_if_exists(path):
if os.path.exists(path):
shutil.rmtree(path)
os.makedirs(path)
def _create_dir_if_not_exists(path):
if not os.path.exists(path):
os.makedirs(path)
def _keep_pattern(haystack, patterns_to_keep):
"""Keeps the strings that match 'patterns_to_keep'."""
out = []
for item in haystack:
if any((regex.findall(rem, item) for rem in patterns_to_keep)):
out.append(item)
return out
def _remove_pattern(haystack, patterns_to_remove):
"""Removes the strings that match 'patterns_to_remove'."""
return [
item
for item in haystack
if item not in _keep_pattern([item], patterns_to_remove)
]
def _list_all_files(in_folder, ignore_dirs=None):
if ignore_dirs is None:
ignore_dirs = []
to_consider = [
os.path.join(os.path.relpath(path, in_folder), name)
if path != in_folder
else name
for path, _, files in os.walk(in_folder)
for name in files
]
return _remove_pattern(to_consider, ignore_dirs)
def _copy_file(filename, params):
_create_dir_if_not_exists(
os.path.join(params['output_folder'], os.path.dirname(filename))
)
shutil.copy(
os.path.join(params['input_folder'], filename),
os.path.join(params['output_folder'], filename),
)
def _remove_command(text, command, keep_text=False):
"""Removes '\\command{*}' from the string 'text'.
Regex `base_pattern` used to match balanced parentheses taken from:
https://stackoverflow.com/questions/546433/regular-expression-to-match-balanced-parentheses/35271017#35271017
"""
base_pattern = (
r'\\'
+ command
+ r'(?:\[(?:.*?)\])*\{((?:[^{}]+|\{(?1)\})*)\}(?:\[(?:.*?)\])*'
)
def extract_text_inside_curly_braces(text):
"""Extract text inside of {} from command string"""
pattern = r'\{((?:[^{}]|(?R))*)\}'
match = regex.search(pattern, text)
if match:
return match.group(1)
else:
return ''
# Loops in case of nested commands that need to retain text, e.g.,
# \red{hello \red{world}}.
while True:
all_substitutions = []
has_match = False
for match in regex.finditer(base_pattern, text):
# In case there are only spaces or nothing up to the following newline,
# adds a percent, not to alter the newlines.
has_match = True
if not keep_text:
new_substring = ''
else:
temp_substring = text[match.span()[0] : match.span()[1]]
new_substring = extract_text_inside_curly_braces(temp_substring)
if match.span()[1] < len(text):
next_newline = text[match.span()[1] :].find('\n')
if next_newline != -1:
text_until_newline = text[
match.span()[1] : match.span()[1] + next_newline
]
if (
not text_until_newline or text_until_newline.isspace()
) and not keep_text:
new_substring = '%'
all_substitutions.append(
(match.span()[0], match.span()[1], new_substring)
)
for start, end, new_substring in reversed(all_substitutions):
text = text[:start] + new_substring + text[end:]
if not keep_text or not has_match:
break
return text
def _remove_environment(text, environment):
"""Removes '\\begin{environment}*\\end{environment}' from 'text'."""
# Need to escape '{', to not trigger fuzzy matching if `environment` starts
# with one of 'i', 'd', 's', or 'e'
return regex.sub(
r'\\begin\{' + environment + r'}[\s\S]*?\\end\{' + environment + r'}',
'',
text,
)
def _simplify_conditional_blocks(text, if_exceptions=[]):
r"""Simplify possibly nested conditional blocks from 'text'.
For example, `\iffalse TEST1\else TEST2\fi` is simplified to `TEST2`,
and `\iftrue TEST1\else TEST2\fi` is simplified to `TEST1`.
Unknown conditionals are left untouched.
If the conditional tree is malformed, the function will print a warning
to stderr and return the original text.
"""
p = regex.compile(r'(?!(?<=\\newif\s*))\\if\s*(\w+)|\\else(?!\w)|\\fi(?!\w)')
toplevel_tree = {'left': [], 'right': [], 'kind': 'toplevel', 'parent': None}
tree = toplevel_tree
exceptions = [
# TeX primitives
'iff',
# package etoolbox
'ifpatchable',
'ifpatchable*',
'ifbool',
'iftoggle',
'ifdef',
'ifcsdef',
'ifundef',
'ifcsundef',
'ifdefmacro',
'ifcsmacro',
'ifdefparam',
'ifcsparam',
'ifcsprefix',
'ifdefprotected',
'ifcsprotected',
'ifdefltxprotect',
'ifcsltxprotect',
'ifdefempty',
'ifcsempty',
'ifdefvoid',
'ifcsvoid',
'ifdefequal',
'ifcsequal',
'ifdefstring',
'ifcsstring',
'ifdefstrequal',
'ifcsstrequal',
'ifdefcounter',
'ifcscounter',
'ifltxcounter',
'ifdeflength',
'ifcslength',
'ifdefdimen',
'ifcsdimen',
'ifstrequal',
'ifstrempty',
'ifblank',
'ifnumcomp',
'ifnumequal',
'ifnumodd',
'ifdimcomp',
'ifdimequal',
'ifdimgreater',
'ifdimless',
'ifboolexpr',
'ifboolexpe',
'ifinlist',
'ifinlistcs',
'ifrmnum',
# package hyperref
'ifpdfstringunicode',
# package ifthen
'ifthenelse',
] + if_exceptions
def new_subtree(kind):
return {'kind': kind, 'left': [], 'right': []}
def add_subtree(tree, subtree):
if 'else' not in tree:
tree['left'].append(subtree)
else:
tree['right'].append(subtree)
subtree['parent'] = tree
def print_tree(tree, indent, write):
if 'start' in tree:
write(' ' * indent + tree['start'].group() + '\n')
for subtree in tree['left']:
print_tree(subtree, indent + 2, write)
if 'else' in tree:
write(' ' * indent + tree['else'].group() + '\n')
for subtree in tree['right']:
print_tree(subtree, indent + 2)
if 'end' in tree:
write(' ' * indent + tree['end'].group() + '\n')
def print_abort(error_finding):
os.sys.stderr.write(
f'Warning: Found {error_finding}! Not removing any conditional'
' blocks...\n'
)
os.sys.stderr.write(
f' This is the matched tree (as built up to the error):\n'
)
print_tree(toplevel_tree, indent=9, write=os.sys.stderr.write)
os.sys.stderr.write(
f' Potentially, you need to supply an exception using'
f" --if_exceptions'.\n"
)
for m in p.finditer(text):
m_no_space = m.group().replace(' ', '')
if m_no_space == r'\iffalse' or m_no_space == r'\if0':
subtree = new_subtree('iffalse')
subtree['start'] = m
add_subtree(tree, subtree)
tree = subtree
elif m_no_space == r'\iftrue' or m_no_space == r'\if1':
subtree = new_subtree('iftrue')
subtree['start'] = m
add_subtree(tree, subtree)
tree = subtree
elif m_no_space.startswith(r'\if'):
if m_no_space[1:] in exceptions:
continue
subtree = new_subtree('unknown')
subtree['start'] = m
add_subtree(tree, subtree)
tree = subtree
elif m_no_space == r'\else':
if tree['parent'] is None:
print_abort(r'unmatched \else')
return text
elif 'else' in tree:
print_abort(r'duplicate \else')
return text
tree['else'] = m
elif m.group() == r'\fi':
if tree['parent'] is None:
print_abort(r'unmatched \fi')
return text
tree['end'] = m
tree = tree['parent']
else:
raise RuntimeError('Unreachable!')
if tree['parent'] is not None:
print_abort('unmatched ' + tree['start'].group())
return text
positions_to_delete = []
def traverse_tree(tree):
if tree['kind'] == 'iffalse':
if 'else' in tree:
positions_to_delete.append((tree['start'].start(), tree['else'].end()))
for subtree in tree['right']:
traverse_tree(subtree)
positions_to_delete.append((tree['end'].start(), tree['end'].end()))
else:
positions_to_delete.append((tree['start'].start(), tree['end'].end()))
elif tree['kind'] == 'iftrue':
if 'else' in tree:
positions_to_delete.append((tree['start'].start(), tree['start'].end()))
for subtree in tree['left']:
traverse_tree(subtree)
positions_to_delete.append((tree['else'].start(), tree['end'].end()))
else:
positions_to_delete.append((tree['start'].start(), tree['start'].end()))
positions_to_delete.append((tree['end'].start(), tree['end'].end()))
elif tree['kind'] == 'unknown':
for subtree in tree['left']:
traverse_tree(subtree)
for subtree in tree['right']:
traverse_tree(subtree)
else:
raise ValueError('Unreachable!')
for tree in toplevel_tree['left']:
traverse_tree(tree)
for start, end in reversed(positions_to_delete):
if end < len(text) and text[end].isspace():
end_to_del = end + 1
else:
end_to_del = end
text = text[:start] + text[end_to_del:]
return text
def _remove_comments_inline(text):
"""Removes the comments from the string 'text' and ignores % inside \\url{}."""
auto_ignore_pattern = r'(%\s*auto-ignore).*'
if regex.search(auto_ignore_pattern, text):
return regex.sub(auto_ignore_pattern, r'\1', text)
if text.lstrip(' ').lstrip('\t').startswith('%'):
return ''
url_pattern = r'\\url\{(?>[^{}]|(?R))*\}'
def remove_comments(segment):
"""Check if a segment of text contains a comment and remove it."""
if segment.lstrip().startswith('%'):
return '', True
match = regex.search(r'(?<!\\)%', segment)
if match:
return segment[: match.end()] + '\n', True
else:
return segment, False
# split the text into segments based on \url{} tags
segments = regex.split(f'({url_pattern})', text)
for i in range(len(segments)):
# only process segments that are not part of a \url{} tag
if not regex.match(url_pattern, segments[i]):
segments[i], match = remove_comments(segments[i])
if match:
# remove all segments after the first inline comment
segments = segments[: i + 1]
break
final_text = ''.join(segments)
return (
final_text
if final_text.endswith('\n') or final_text.endswith('\\n')
else final_text + '\n'
)
def _strip_tex_contents(lines, end_str):
"""Removes everything after end_str."""
for i in range(len(lines)):
if end_str in lines[i]:
if '%' not in lines[i]:
return lines[: i + 1]
elif lines[i].index('%') > lines[i].index(end_str):
return lines[: i + 1]
return lines
def _read_file_content(filename):
with open(filename, 'r', encoding='utf-8') as fp:
lines = fp.readlines()
lines = _strip_tex_contents(lines, '\\end{document}')
return lines
def _read_all_tex_contents(tex_files, parameters):
contents = {}
for fn in tex_files:
contents[fn] = _read_file_content(
os.path.join(parameters['input_folder'], fn)
)
return contents
def _write_file_content(content, filename):
_create_dir_if_not_exists(os.path.dirname(filename))
with open(filename, 'w', encoding='utf-8') as fp:
return fp.write(content)
def _remove_comments_and_commands_to_delete(content, parameters):
"""Erases all LaTeX comments in the content, and writes it."""
content = [_remove_comments_inline(line) for line in content]
content = _remove_environment(''.join(content), 'comment')
content = _simplify_conditional_blocks(
content, parameters.get('if_exceptions', [])
)
for environment in parameters.get('environments_to_delete', []):
content = _remove_environment(content, environment)
for command in parameters.get('commands_only_to_delete', []):
content = _remove_command(content, command, True)
for command in parameters['commands_to_delete']:
content = _remove_command(content, command, False)
return content
def _replace_tikzpictures(content, figures):
"""Replaces all tikzpicture environments (with includegraphic commands of
external PDF figures) in the content, and writes it.
"""
def get_figure(matchobj):
found_tikz_filename = regex.search(
r'\\tikzsetnextfilename{(.*?)}', matchobj.group(0)
).group(1)
# search in tex split if figure is available
matching_tikz_filenames = _keep_pattern(
figures, ['/' + found_tikz_filename + '.pdf']
)
if len(matching_tikz_filenames) == 1:
return '\\includegraphics{' + matching_tikz_filenames[0] + '}'
else:
return matchobj.group(0)
content = regex.sub(
r'\\tikzsetnextfilename{[\s\S]*?\\end{tikzpicture}', get_figure, content
)
return content
def _replace_includesvg(content, svg_inkscape_files):
def repl_svg(matchobj):
svg_path = matchobj.group(2)
if svg_path.endswith('.svg'):
svg_path = '_'.join(svg_path.rsplit('.', 1))
svg_filename = os.path.basename(svg_path)
# search in svg_inkscape split if pdf_tex file is available
matching_pdf_tex_files = _keep_pattern(
svg_inkscape_files, ['/' + svg_filename + '-tex.pdf_tex']
)
if len(matching_pdf_tex_files) == 1:
options = '' if matchobj.group(1) is None else matchobj.group(1)
res = f'\\includeinkscape{options}{{{matching_pdf_tex_files[0]}}}'
return res
else:
return matchobj.group(0)
content = regex.sub(r'\\includesvg(\[.*?\])?{(.*?)}', repl_svg, content)
return content
def _resize_and_copy_figure(
filename,
origin_folder,
destination_folder,
resize_image,
image_size,
compress_pdf,
pdf_resolution,
convert_png_to_jpg=False,
png_quality=50,
png_size_threshold=0.5,
verbose=False
):
"""Resizes and copies the input figure (either JPG, PNG, or PDF).
Parameters:
filename: The input filename
origin_folder: The folder containing the input filename
destination_folder: The folder to copy the output filename to
resize_image: Whether to resize the image
image_size: The maximum size of the image in pixels
compress_pdf: Whether to compress the PDF file
convert_png_to_jpg: Whether to convert PNG files to JPG format. Note that this will override resize_image for PNG files.
png_quality: JPG quality for converted PNG files (0-100)
png_size_threshold: Minimum file size in MB to apply quality reduction
verbose: Enable verbose logging
Returns:
str: The actual output filename (may differ from input if PNG was converted)
"""
_create_dir_if_not_exists(
os.path.join(destination_folder, os.path.dirname(filename))
)
if convert_png_to_jpg and os.path.splitext(filename)[1].lower() in ['.png']:
original_size_mb = os.path.getsize(os.path.join(origin_folder, filename)) / (1024 * 1024)
im = Image.open(os.path.join(origin_folder, filename))
# Determine quality based on file size
if original_size_mb < png_size_threshold:
quality = 100 # Keep high quality for small files
if verbose:
print(f"Keeping original quality for small PNG: {filename}")
else:
quality = png_quality
if verbose:
print(f"Converting PNG to JPG with quality {quality}: {filename}")
# Convert PNG to JPG
output_filename = os.path.splitext(filename)[0] + '.jpg'
rgb_img = im.convert('RGB')
rgb_img.save(os.path.join(destination_folder, output_filename), 'JPEG', quality=quality)
if verbose:
print(f"Converted: {filename} -> {output_filename}")
return output_filename
if resize_image and os.path.splitext(filename)[1].lower() in [
'.jpg',
'.jpeg',
'.png',
]:
try:
im = Image.open(os.path.join(origin_folder, filename))
if max(im.size) > image_size:
im = im.resize(
tuple([int(x * float(image_size) / max(im.size)) for x in im.size]),
Image.Resampling.LANCZOS,
)
if os.path.splitext(filename)[1].lower() in ['.jpg', '.jpeg']:
im.save(os.path.join(destination_folder, filename), 'JPEG', quality=90)
return filename
elif os.path.splitext(filename)[1].lower() in ['.png']:
im.save(os.path.join(destination_folder, filename), 'PNG')
return filename
except Exception as e:
if verbose:
print(f"Failed to process image {filename}: {e}")
# Fall back to simple copy
shutil.copy(
os.path.join(origin_folder, filename),
os.path.join(destination_folder, filename),
)
return filename
elif compress_pdf and os.path.splitext(filename)[1].lower() == '.pdf':
_resize_pdf_figure(
filename, origin_folder, destination_folder, pdf_resolution
)
return filename
else:
shutil.copy(
os.path.join(origin_folder, filename),
os.path.join(destination_folder, filename),
)
return filename
def _update_image_references(tex_contents_dict, old_filename, new_filename, verbose=False):
"""Update references from old_filename to new_filename in all tex content."""
if old_filename == new_filename:
return # No change needed
old_base = os.path.splitext(old_filename)[0]
new_base = os.path.splitext(new_filename)[0]
if verbose:
print(f"Updating LaTeX references: {old_filename} -> {new_filename}")
for tex_file in tex_contents_dict:
# Handle both string and list content
if isinstance(tex_contents_dict[tex_file], list):
content = ''.join(tex_contents_dict[tex_file])
else:
content = tex_contents_dict[tex_file]
content_changed = False
# Pattern 1: Direct filename with full extension, handling comments and newlines
pattern1 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_filename) + r'((?:%\s*\n\s*)?[^}]*?\})'
replacement1 = r'\1' + new_filename + r'\2'
new_content = regex.sub(pattern1, replacement1, content, flags=regex.IGNORECASE | regex.DOTALL)
if new_content != content:
content = new_content
content_changed = True
if verbose:
print(f"Applied pattern 1 (full filename) in {tex_file}")
else:
# Pattern 2: Base filename without extension, handling comments and newlines
# Only apply this if Pattern 1 didn't match to avoid double replacements
pattern2 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_base) + r'((?:%\s*\n\s*)?[^}]*?\})'
replacement2 = r'\1' + new_base + r'.jpg\2'
new_content = regex.sub(pattern2, replacement2, content, flags=regex.IGNORECASE | regex.DOTALL)
if new_content != content:
content = new_content
content_changed = True
if verbose:
print(f"Applied pattern 2 (base filename) in {tex_file}")
else:
# Pattern 3: Handle cases where extension is split across lines with comments
# This specifically targets patterns like: images/filename%\n.png
pattern3 = r'(\{[^}]*?)' + regex.escape(old_base) + r'(%\s*\n\s*)(\.png)([^}]*?\})'
replacement3 = r'\1' + new_base + r'\2.jpg\4'
new_content = regex.sub(pattern3, replacement3, content, flags=regex.IGNORECASE | regex.DOTALL)
if new_content != content:
content = new_content
content_changed = True
if verbose:
print(f"Applied pattern 3 (split extension) in {tex_file}")
# Update the content back in the appropriate format
if content_changed:
if isinstance(tex_contents_dict[tex_file], list):
# Convert back to list format, preserving line endings
tex_contents_dict[tex_file] = content.split('\n')
else:
tex_contents_dict[tex_file] = content
if verbose:
print(f"Updated references in {tex_file}")
# Re-write the updated tex files to the output directory
if verbose and any(tex_contents_dict.values()):
print("Re-writing updated tex files...")
return tex_contents_dict
def _resize_pdf_figure(
filename, origin_folder, destination_folder, resolution, timeout=10
):
input_file = os.path.join(origin_folder, filename)
output_file = os.path.join(destination_folder, filename)
bash_command = PDF_RESIZE_COMMAND.format(
input=input_file, output=output_file, resolution=resolution
)
process = subprocess.Popen(bash_command.split(), stdout=subprocess.PIPE)
try:
process.communicate(timeout=timeout)
except subprocess.TimeoutExpired:
process.kill()
outs, errs = process.communicate()
print('Output: ', outs)
print('Errors: ', errs)
def _copy_only_referenced_non_tex_not_in_root(parameters, contents, splits):
for fn in _keep_only_referenced(
splits['non_tex_not_in_root'], contents, strict=True
):
_copy_file(fn, parameters)
def _resize_and_copy_figures_if_referenced(parameters, contents, splits):
"""Modified to handle PNG to JPG conversion and reference updates."""
image_size = collections.defaultdict(lambda: parameters['im_size'])
image_size.update(parameters['images_allowlist'])
pdf_resolution = collections.defaultdict(
lambda: parameters['pdf_im_resolution']
)
pdf_resolution.update(parameters['images_allowlist'])
# contents is the full content string for reference checking
filename_changes = {} # Track PNG -> JPG filename changes
for image_file in _keep_only_referenced(
splits['figures'], contents, strict=False
):
actual_output_filename = _resize_and_copy_figure(
filename=image_file,
origin_folder=parameters['input_folder'],
destination_folder=parameters['output_folder'],
resize_image=parameters['resize_images'],
image_size=image_size[image_file],
compress_pdf=parameters['compress_pdf'],
pdf_resolution=pdf_resolution[image_file],
convert_png_to_jpg=parameters.get('convert_png_to_jpg', False),
png_quality=parameters.get('png_quality', 50),
png_size_threshold=parameters.get('png_size_threshold', 0.5),
verbose=parameters.get('verbose', False)
)
# Track filename changes for reference updates
if actual_output_filename != image_file:
filename_changes[image_file] = actual_output_filename
return filename_changes
def _search_reference(filename, contents, strict=False):
"""Returns a match object if filename is referenced in contents, and None otherwise.
If not strict mode, path prefix and extension are optional.
"""
if strict:
# regex pattern for strict=True for path/to/img.ext:
# \{[\s%]*path/to/img\.ext[\s%]*\}
filename_regex = filename.replace('.', r'\.')
else:
filename_path = pathlib.Path(filename)
# make extension optional
root, extension = filename_path.stem, filename_path.suffix
basename_regex = '{}({})?'.format(
regex.escape(root), regex.escape(extension)
)
# iterate through parent fragments to make path prefix optional
path_prefix_regex = ''
for fragment in reversed(filename_path.parents):
if fragment.name == '.':
continue
fragment = regex.escape(fragment.name)
path_prefix_regex = '({}{}{})?'.format(
path_prefix_regex, fragment, os.sep
)
# Regex pattern for strict=True for path/to/img.ext:
# \{[\s%]*(<path_prefix>)?<basename>(<ext>)?[\s%]*\}
filename_regex = path_prefix_regex + basename_regex
# Some files 'path/to/file' are referenced in tex as './path/to/file' thus
# adds prefix for relative paths starting with './' or '.\' to regex search.
filename_regex = r'(.' + os.sep + r')?' + filename_regex
# Pads with braces and optional whitespace/comment characters.
patn = r'\{{[\s%]*{}[\s%]*\}}'.format(filename_regex)
# Picture references in LaTeX are allowed to be in different cases.
return regex.search(patn, contents, regex.IGNORECASE)
def _keep_only_referenced(filenames, contents, strict=False):
"""Returns the filenames referenced from contents.
If not strict mode, path prefix and extension are optional.
"""
return [
fn
for fn in filenames
if _search_reference(fn, contents, strict) is not None
]
def _keep_only_referenced_tex(contents, splits):
"""Returns the filenames referenced from the tex files themselves.
It needs various iterations in case one file is referenced from an
unreferenced file.
"""
old_referenced = set(splits['tex_in_root'] + splits['tex_not_in_root'])
while True:
referenced = set(splits['tex_in_root'])
for fn in old_referenced:
for fn2 in old_referenced:
if regex.search(
r'(' + os.path.splitext(fn)[0] + r'[.}])', '\n'.join(contents[fn2])
):
referenced.add(fn)
if referenced == old_referenced:
splits['tex_to_copy'] = list(referenced)
return
old_referenced = referenced.copy()
def _add_root_tex_files(splits):
# TODO: Check auto-ignore marker in root to detect the main file. Then check
# there is only one non-referenced TeX in root.
# Forces the TeX in root to be copied, even if they are not referenced.
for fn in splits['tex_in_root']:
if fn not in splits['tex_to_copy']:
splits['tex_to_copy'].append(fn)
def _split_all_files(parameters):
"""Splits the files into types or location to know what to do with them."""
file_splits = {
'all': _list_all_files(
parameters['input_folder'], ignore_dirs=['.git' + os.sep]
),
'in_root': [
f
for f in os.listdir(parameters['input_folder'])
if os.path.isfile(os.path.join(parameters['input_folder'], f))
],
}
file_splits['not_in_root'] = [
f for f in file_splits['all'] if f not in file_splits['in_root']
]
file_splits['to_copy_in_root'] = _remove_pattern(
file_splits['in_root'],
parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],
)
file_splits['to_copy_not_in_root'] = _remove_pattern(
file_splits['not_in_root'],
parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],
)
file_splits['figures'] = _keep_pattern(
file_splits['all'], parameters['figures_to_copy_if_referenced']
)
file_splits['tex_in_root'] = _keep_pattern(
file_splits['to_copy_in_root'], ['.tex$', '.tikz$']
)
file_splits['tex_not_in_root'] = _keep_pattern(
file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']
)
file_splits['non_tex_in_root'] = _remove_pattern(
file_splits['to_copy_in_root'], ['.tex$', '.tikz$']
)
file_splits['non_tex_not_in_root'] = _remove_pattern(
file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']
)
if parameters.get('use_external_tikz', None) is not None:
file_splits['external_tikz_figures'] = _keep_pattern(
file_splits['all'], [parameters['use_external_tikz']]
)
else:
file_splits['external_tikz_figures'] = []
if parameters.get('svg_inkscape', None) is not None:
file_splits['svg_inkscape'] = _keep_pattern(
file_splits['all'], [parameters['svg_inkscape']]
)
else:
file_splits['svg_inkscape'] = []
return file_splits
def _create_out_folder(input_folder):
"""Creates the output folder, erasing it if existed."""
out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv'
_create_dir_erase_if_exists(out_folder)
return out_folder
def run_arxiv_cleaner(parameters):
"""Core of the code, runs the actual arXiv cleaner."""
files_to_delete = [
r'\.aux$',
r'\.sh$',
r'\.blg$',
r'\.brf$',
r'\.log$',
r'\.out$',
r'\.ps$',
r'\.dvi$',
r'\.synctex.gz$',
'~$',
r'\.backup$',
r'\.gitignore$',
r'\.DS_Store$',
r'\.svg$',
r'^\.idea',
r'\.dpth$',
r'\.md5$',
r'\.dep$',
r'\.auxlock$',
r'\.fls$',
r'\.fdb_latexmk$',
]
if not parameters['keep_bib']:
files_to_delete.append(r'\.bib$')
parameters.update({
'to_delete': files_to_delete,
'figures_to_copy_if_referenced': [
r'\.png$',
r'\.jpg$',
r'\.jpeg$',
r'\.pdf$',
],
})
logging.info('Collecting file structure.')
parameters['output_folder'] = _create_out_folder(parameters['input_folder'])
from_zip = parameters['input_folder'].endswith('.zip')
tempdir_context = (
tempfile.TemporaryDirectory() if from_zip else contextlib.suppress()
)
with tempdir_context as tempdir:
if from_zip:
logging.info('Unzipping input folder.')
shutil.unpack_archive(parameters['input_folder'], tempdir)
parameters['input_folder'] = tempdir
splits = _split_all_files(parameters)
logging.info('Reading all tex files')
tex_contents = _read_all_tex_contents(
splits['tex_in_root'] + splits['tex_not_in_root'], parameters
)
for tex_file in tex_contents:
logging.info('Removing comments in file %s.', tex_file)
tex_contents[tex_file] = _remove_comments_and_commands_to_delete(
tex_contents[tex_file], parameters
)
for tex_file in tex_contents:
logging.info('Replacing \\includesvg calls in file %s.', tex_file)
tex_contents[tex_file] = _replace_includesvg(
tex_contents[tex_file], splits['svg_inkscape']
)
for tex_file in tex_contents:
logging.info('Replacing Tikz Pictures in file %s.', tex_file)
content = _replace_tikzpictures(
tex_contents[tex_file], splits['external_tikz_figures']
)
# If file ends with '\n' already, the split in last line would add an extra
# '\n', so we remove it.
tex_contents[tex_file] = content.split('\n')
_keep_only_referenced_tex(tex_contents, splits)
_add_root_tex_files(splits)
for tex_file in splits['tex_to_copy']:
logging.info('Replacing patterns in file %s.', tex_file)
content = '\n'.join(tex_contents[tex_file])
content = _find_and_replace_patterns(
content, parameters.get('patterns_and_insertions', list())
)
tex_contents[tex_file] = content
new_path = os.path.join(parameters['output_folder'], tex_file)
logging.info('Writing modified contents to %s.', new_path)
_write_file_content(
content,
new_path,
)
full_content = '\n'.join(
''.join(tex_contents[fn]) for fn in splits['tex_to_copy']
)
_copy_only_referenced_non_tex_not_in_root(parameters, full_content, splits)
for non_tex_file in splits['non_tex_in_root']:
logging.info('Copying non-tex file %s.', non_tex_file)
_copy_file(non_tex_file, parameters)
filename_changes = _resize_and_copy_figures_if_referenced(parameters, full_content, splits)
logging.info('Outputs written to %s', parameters['output_folder'])
# Update LaTeX references for changed filenames if tex_contents_dict is provided
if tex_contents and filename_changes:
for old_filename, new_filename in filename_changes.items():
tex_contents = _update_image_references(
tex_contents, old_filename, new_filename,
verbose=parameters.get('verbose', False)
)
# Re-write modified tex files with new references after resizing and copying figures
for tex_file in splits['tex_to_copy']:
if tex_file in tex_contents:
# Get the updated content
if isinstance(tex_contents[tex_file], list):
updated_content = ''.join(tex_contents[tex_file])
else:
updated_content = tex_contents[tex_file]
# Write the updated content back to the output file
output_path = os.path.join(parameters['output_folder'], tex_file)
logging.info('Re-writing modified tex file with updated references: %s', output_path)
_write_file_content(updated_content, output_path)
if parameters.get('verbose', False):
print(f"Re-wrote {tex_file} with updated image references")
if parameters.get('verbose', False):
print(f"Updated {len(filename_changes)} image references and re-wrote tex files")
def strip_whitespace(text):
"""Strips all whitespace characters.
https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string
"""
pattern = regex.compile(r'\s+')
text = regex.sub(pattern, '', text)
return text
def merge_args_into_config(args, config_params):
final_args = copy.deepcopy(config_params)
config_keys = config_params.keys()
for key, value in args.items():
if key in config_keys:
if any([isinstance(value, t) for t in [str, bool, float, int]]):
# Overwrites config value with args value.
final_args[key] = value
elif isinstance(value, list):
# Appends args values to config values.
final_args[key] = value + config_params[key]
elif isinstance(value, dict):
# Updates config params with args params.
final_args[key].update(**value)
else:
final_args[key] = value
return final_args
def _find_and_replace_patterns(content, patterns_and_insertions):
r"""content: str
patterns_and_insertions: List[Dict]
Example for patterns_and_insertions:
[
{
"pattern" :
r"(?:\\figcompfigures{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}",
"insertion" :
r"\parbox[c]{{{second}\linewidth}}{{\includegraphics[width={third}\linewidth]{{figures/{first}}}}}}",
"description": "Replace figcompfigures"
},
]
"""
for pattern_and_insertion in patterns_and_insertions:
pattern = pattern_and_insertion['pattern']
insertion = pattern_and_insertion['insertion']
description = pattern_and_insertion['description']
logging.info('Processing pattern: %s.', description)
p = regex.compile(pattern)
m = p.search(content)
while m is not None:
local_insertion = insertion.format(**m.groupdict())
if pattern_and_insertion.get('strip_whitespace', True):
local_insertion = strip_whitespace(local_insertion)
logging.info(f'Found {content[m.start():m.end()]:<70}')
logging.info(f'Replacing with {local_insertion:<30}')
content = content[: m.start()] + local_insertion + content[m.end() :]
m = p.search(content)
logging.info('Finished pattern: %s.', description)
return content
================================================
FILE: arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from os import path
import shutil
import unittest
from absl.testing import parameterized
from arxiv_latex_cleaner import arxiv_latex_cleaner
from PIL import Image
def make_args(
input_folder='foo/bar',
resize_images=False,
im_size=500,
compress_pdf=False,
pdf_im_resolution=500,
images_allowlist=None,
commands_to_delete=None,
use_external_tikz='foo/bar/tikz',
):
if images_allowlist is None:
images_allowlist = {}
if commands_to_delete is None:
commands_to_delete = []
args = {
'input_folder': input_folder,
'resize_images': resize_images,
'im_size': im_size,
'compress_pdf': compress_pdf,
'pdf_im_resolution': pdf_im_resolution,
'images_allowlist': images_allowlist,
'commands_to_delete': commands_to_delete,
'use_external_tikz': use_external_tikz,
}
return args
def make_contents():
return (
r'& \figcompfigures{'
'\n\timage1.jpg'
'\n}{'
'\n\t'
r'\ww'
'\n}{'
'\n\t1.0'
'\n\t}'
'\n& '
r'\figcompfigures{image2.jpg}{\ww}{1.0}'
)
def make_patterns():
pattern = r'(?:\\figcompfigures{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}'
insertion = r"""\parbox[c]{{
{second}\linewidth
}}{{
\includegraphics[
width={third}\linewidth
]{{
figures/{first}
}}
}} """
description = 'Replace figcompfigures'
output = {
'pattern': pattern,
'insertion': insertion,
'description': description,
}
return [output]
def make_search_reference_tests():
return (
{
'testcase_name': 'prefix1',
'filenames': ['include_image_yes.png', 'include_image.png'],
'contents': '\\include{include_image_yes.png}',
'strict': False,
'true_outputs': ['include_image_yes.png'],
},
{
'testcase_name': 'prefix2',
'filenames': ['include_image_yes.png', 'include_image.png'],
'contents': '\\include{include_image.png}',
'strict': False,
'true_outputs': ['include_image.png'],
},
{
'testcase_name': 'nested_more_specific',
'filenames': [
'images/im_included.png',
'images/include/images/im_included.png',
],
'contents': '\\include{images/include/images/im_included.png}',
'strict': False,
'true_outputs': ['images/include/images/im_included.png'],
},
{
'testcase_name': 'nested_less_specific',
'filenames': [
'images/im_included.png',
'images/include/images/im_included.png',
],
'contents': '\\include{images/im_included.png}',
'strict': False,
'true_outputs': [
'images/im_included.png',
'images/include/images/im_included.png',
],
},
{
'testcase_name': 'nested_substring',
'filenames': ['images/im_included.png', 'im_included.png'],
'contents': '\\include{images/im_included.png}',
'strict': False,
'true_outputs': ['images/im_included.png'],
},
{
'testcase_name': 'nested_diffpath',
'filenames': ['images/im_included.png', 'figures/im_included.png'],
'contents': '\\include{images/im_included.png}',
'strict': False,
'true_outputs': ['images/im_included.png'],
},
{
'testcase_name': 'diffext',
'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],
'contents': '\\include{tables/demo.tex}',
'strict': False,
'true_outputs': ['tables/demo.tex'],
},
{
'testcase_name': 'diffext2',
'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],
'contents': '\\include{tables/demo}',
'strict': False,
'true_outputs': ['tables/demo.tex', 'tables/demo.tikz'],
},
{
'testcase_name': 'strict_prefix1',
'filenames': ['demo_yes.tex', 'demo.tex'],
'contents': '\\include{demo_yes.tex}',
'strict': True,
'true_outputs': ['demo_yes.tex'],
},
{
'testcase_name': 'strict_prefix2',
'filenames': ['demo_yes.tex', 'demo.tex'],
'contents': '\\include{demo.tex}',
'strict': True,
'true_outputs': ['demo.tex'],
},
{
'testcase_name': 'strict_nested_more_specific',
'filenames': [
'tables/table_included.csv',
'tables/include/tables/table_included.csv',
],
'contents': '\\include{tables/include/tables/table_included.csv}',
'strict': True,
'true_outputs': ['tables/include/tables/table_included.csv'],
},
{
'testcase_name': 'strict_nested_less_specific',
'filenames': [
'tables/table_included.csv',
'tables/include/tables/table_included.csv',
],
'contents': '\\include{tables/table_included.csv}',
'strict': True,
'true_outputs': ['tables/table_included.csv'],
},
{
'testcase_name': 'strict_nested_substring1',
'filenames': ['tables/table_included.csv', 'table_included.csv'],
'contents': '\\include{tables/table_included.csv}',
'strict': True,
'true_outputs': ['tables/table_included.csv'],
},
{
'testcase_name': 'strict_nested_substring2',
'filenames': ['tables/table_included.csv', 'table_included.csv'],
'contents': '\\include{table_included.csv}',
'strict': True,
'true_outputs': ['table_included.csv'],
},
{
'testcase_name': 'strict_nested_diffpath',
'filenames': ['tables/table_included.csv', 'data/table_included.csv'],
'contents': '\\include{tables/table_included.csv}',
'strict': True,
'true_outputs': ['tables/table_included.csv'],
},
{
'testcase_name': 'strict_diffext',
'filenames': ['tables/demo.csv', 'tables/demo.txt', 'demo.csv'],
'contents': '\\include{tables/demo.csv}',
'strict': True,
'true_outputs': ['tables/demo.csv'],
},
{
'testcase_name': 'path_starting_with_dot',
'filenames': [
'./images/im_included.png',
'./figures/im_included.png',
],
'contents': '\\include{./images/im_included.png}',
'strict': False,
'true_outputs': ['./images/im_included.png'],
},
)
class UnitTests(parameterized.TestCase):
@parameterized.named_parameters(
{
'testcase_name': 'empty config',
'args': make_args(),
'config_params': {},
'final_args': make_args(),
},
{
'testcase_name': 'empty args',
'args': {},
'config_params': make_args(),
'final_args': make_args(),
},
{
'testcase_name': 'args and config provided',
'args': make_args(
images_allowlist={'path1/': 1000}, commands_to_delete=[r'\todo1']
),
'config_params': make_args(
'foo_/bar_',
True,
1000,
True,
1000,
images_allowlist={'path2/': 1000},
commands_to_delete=[r'\todo2'],
use_external_tikz='foo_/bar_/tikz_',
),
'final_args': make_args(
images_allowlist={'path1/': 1000, 'path2/': 1000},
commands_to_delete=[r'\todo1', r'\todo2'],
),
},
)
def test_merge_args_into_config(self, args, config_params, final_args):
self.assertEqual(
arxiv_latex_cleaner.merge_args_into_config(args, config_params),
final_args,
)
@parameterized.named_parameters(
{
'testcase_name': 'no_comment',
'line_in': 'Foo\n',
'true_output': 'Foo\n',
},
{
'testcase_name': 'auto_ignore',
'line_in': '%auto-ignore\n',
'true_output': '%auto-ignore\n',
},
{
'testcase_name': 'auto_ignore_middle',
'line_in': 'Foo % auto-ignore Comment\n',
'true_output': 'Foo % auto-ignore\n',
},
{
'testcase_name': 'auto_ignore_text_with_comment',
'line_in': 'Foo auto-ignore % Comment\n',
'true_output': 'Foo auto-ignore %\n',
},
{
'testcase_name': 'percent',
'line_in': r'100\% accurate\n',
'true_output': r'100\% accurate\n',
},
{
'testcase_name': 'comment',
'line_in': ' % Comment\n',
'true_output': '',
},
{
'testcase_name': 'comment_inline',
'line_in': 'Foo %Comment\n',
'true_output': 'Foo %\n',
},
{
'testcase_name': 'url_with_percent',
'line_in': '\\url{https://www.example.com/hello%20world}\n',
'true_output': '\\url{https://www.example.com/hello%20world}\n',
},
{
'testcase_name': 'comment_with_url',
'line_in': 'Foo %\\url{https://www.example.com/hello%20world}\n',
'true_output': 'Foo %\n',
},
)
def test_remove_comments_inline(self, line_in, true_output):
self.assertEqual(
arxiv_latex_cleaner._remove_comments_inline(line_in), true_output
)
@parameterized.named_parameters(
{
'testcase_name': 'no_command',
'text_in': 'Foo\nFoo2\n',
'keep_text': False,
'true_output': 'Foo\nFoo2\n',
},
{
'testcase_name': 'command_not_removed',
'text_in': '\\textit{Foo\nFoo2}\n',
'keep_text': False,
'true_output': '\\textit{Foo\nFoo2}\n',
},
{
'testcase_name': 'command_no_end_line_removed',
'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}',
'keep_text': False,
'true_output': 'AD\nE\n\\end{document}',
},
{
'testcase_name': 'command_with_end_line_removed',
'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}',
'keep_text': False,
'true_output': 'A\n%\nD\n\\end{document}',
},
{
'testcase_name': 'command_with_optional_arguments_start',
'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}',
'keep_text': False,
'true_output': 'A\n%\nE\n\\end{document}',
},
{
'testcase_name': 'command_with_optional_arguments_end',
'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}',
'keep_text': False,
'true_output': 'A\n%\nE\n\\end{document}',
},
{
'testcase_name': 'no_command_keep_text',
'text_in': 'Foo\nFoo2\n',
'keep_text': True,
'true_output': 'Foo\nFoo2\n',
},
{
'testcase_name': 'command_not_removed_keep_text',
'text_in': '\\textit{Foo\nFoo2}\n',
'keep_text': True,
'true_output': '\\textit{Foo\nFoo2}\n',
},
{
'testcase_name': 'command_no_end_line_removed_keep_text',
'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}',
'keep_text': True,
'true_output': 'AB\nCD\nE\n\\end{document}',
},
{
'testcase_name': 'command_with_end_line_removed_keep_text',
'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}',
'keep_text': True,
'true_output': 'A\nB\nC\nD\n\\end{document}',
},
{
'testcase_name': 'nested_command_keep_text',
'text_in': 'A\n\\todo{B\n\\todo{C}}\nD\n\\end{document}',
'keep_text': True,
'true_output': 'A\nB\nC\nD\n\\end{document}',
},
{
'testcase_name': 'command_with_optional_arguments_start_keep_text',
'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}',
'keep_text': True,
'true_output': 'A\nC\nD\nE\n\\end{document}',
},
{
'testcase_name': 'command_with_optional_arguments_end_keep_text',
'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}',
'keep_text': True,
'true_output': 'A\nB\nC\nE\n\\end{document}',
},
{
'testcase_name': 'deeply_nested_command_keep_text',
'text_in': 'A\n\\todo{B\n\\emph{C\\footnote{\\textbf{D}}}}\nE\n\\end{document}',
'keep_text': True,
'true_output': (
'A\nB\n\\emph{C\\footnote{\\textbf{D}}}\nE\n\\end{document}'
),
},
)
def test_remove_command(self, text_in, keep_text, true_output):
self.assertEqual(
arxiv_latex_cleaner._remove_command(text_in, 'todo', keep_text),
true_output,
)
@parameterized.named_parameters(
{
'testcase_name': 'no_environment',
'text_in': 'Foo\n',
'true_output': 'Foo\n',
},
{
'testcase_name': 'environment_not_removed',
'text_in': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo',
'true_output': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo',
},
{
'testcase_name': 'environment_removed',
'text_in': 'Foo\\begin{comment}\n3x+2\n\\end{comment}\nFoo',
'true_output': 'Foo\nFoo',
},
)
def test_remove_environment(self, text_in, true_output):
self.assertEqual(
arxiv_latex_cleaner._remove_environment(text_in, 'comment'), true_output
)
@parameterized.named_parameters(
{
'testcase_name': 'no_iffalse',
'text_in': 'Foo\n',
'true_output': 'Foo\n',
},
{
'testcase_name': 'if_not_removed',
'text_in': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n',
'true_output': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n',
},
{
'testcase_name': 'if_removed_with_nested_ifvar',
'text_in': '\\ifvar\n\\iffalse\n\\ifvar\nFoo\n\\fi\n\\fi\n\\fi\n',
'true_output': '\\ifvar\n\\fi\n',
},
{
'testcase_name': 'if_removed_with_nested_iffalse',
'text_in': '\\ifvar\n\\iffalse\n\\iffalse\nFoo\n\\fi\n\\fi\n\\fi\n',
'true_output': '\\ifvar\n\\fi\n',
},
{
'testcase_name': 'if_removed_eof',
'text_in': '\\iffalse\nFoo\n\\fi',
'true_output': '',
},
{
'testcase_name': 'if_removed_space',
'text_in': '\\iffalse\nFoo\n\\fi ',
'true_output': '',
},
{
'testcase_name': 'if_removed_backslash',
'text_in': '\\iffalse\nFoo\n\\fi\\end{document}',
'true_output': '\\end{document}',
},
{
'testcase_name': 'commands_not_removed',
'text_in': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}',
'true_output': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}',
},
{
'testcase_name': 'iffalse_else_sustained',
'text_in': '\\iffalse not there\\else here\\fi',
'true_output': 'here',
},
{
'testcase_name': 'iftrue_else_removed',
'text_in': '\\iftrue expected\\else not expected\\fi',
'true_output': 'expected',
},
{
'testcase_name': 'if0_removed',
'text_in': '\\if0 to be removed\\fi',
'true_output': '',
},
{
'testcase_name': 'if1 works',
'text_in': '\\if 1 expected\\fi',
'true_output': 'expected',
},
{
'testcase_name': 'new_if_ignored',
'text_in': '\\newif \\ifvar \\ifvar\\iffalse test\\fi\\fi',
'true_output': '\\newif \\ifvar \\ifvar\\fi',
},
{
'testcase_name': 'known exceptions (iff) ignored in \\iffalse',
'text_in': '\\iffalse \\iff\\fi',
'true_output': '',
},
{
'testcase_name': 'known exceptions (iff) ignored in \\iftrue',
'text_in': '\\iftrue\\iff\\else\\fi',
'true_output': '\\iff',
},
)
def test_simplify_conditional_blocks(self, text_in, true_output):
self.assertEqual(
arxiv_latex_cleaner._simplify_conditional_blocks(text_in), true_output
)
@parameterized.named_parameters(
{
'testcase_name': 'all_pass',
'inputs': ['abc', 'bca'],
'patterns': ['a'],
'true_outputs': ['abc', 'bca'],
},
{
'testcase_name': 'not_all_pass',
'inputs': ['abc', 'bca'],
'patterns': ['a$'],
'true_outputs': ['bca'],
},
)
def test_keep_pattern(self, inputs, patterns, true_outputs):
self.assertEqual(
list(arxiv_latex_cleaner._keep_pattern(inputs, patterns)), true_outputs
)
@parameterized.named_parameters(
{
'testcase_name': 'all_pass',
'inputs': ['abc', 'bca'],
'patterns': ['a'],
'true_outputs': [],
},
{
'testcase_name': 'not_all_pass',
'inputs': ['abc', 'bca'],
'patterns': ['a$'],
'true_outputs': ['abc'],
},
)
def test_remove_pattern(self, inputs, patterns, true_outputs):
self.assertEqual(
list(arxiv_latex_cleaner._remove_pattern(inputs, patterns)),
true_outputs,
)
@parameterized.named_parameters(
{
'testcase_name': 'replace_contents',
'content': make_contents(),
'patterns_and_insertions': make_patterns(),
'true_outputs': (
r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image1.jpg}}'
'\n'
r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image2.jpg}}'
),
},
)
def test_find_and_replace_patterns(
self, content, patterns_and_insertions, true_outputs
):
output = arxiv_latex_cleaner._find_and_replace_patterns(
content, patterns_and_insertions
)
output = arxiv_latex_cleaner.strip_whitespace(output)
true_outputs = arxiv_latex_cleaner.strip_whitespace(true_outputs)
self.assertEqual(output, true_outputs)
@parameterized.named_parameters(
{
'testcase_name': 'no_tikz',
'text_in': 'Foo\n',
'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
'true_output': 'Foo\n',
},
{
'testcase_name': 'tikz_no_match',
'text_in': (
'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node'
' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
),
'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
'true_output': (
'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node'
' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
),
},
{
'testcase_name': 'tikz_match',
'text_in': (
'Foo\\tikzsetnextfilename{test2}\n\\begin{tikzpicture}\n\\node'
' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
),
'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
'true_output': 'Foo\\includegraphics{ext_tikz/test2.pdf}\nFoo',
},
)
def test_replace_tikzpictures(self, text_in, figures_in, true_output):
self.assertEqual(
arxiv_latex_cleaner._replace_tikzpictures(text_in, figures_in),
true_output,
)
@parameterized.named_parameters(
{
'testcase_name': 'no_includesvg',
'text_in': 'Foo\n',
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2-tex.pdf_tex',
],
'true_output': 'Foo\n',
},
{
'testcase_name': 'includesvg_no_match',
'text_in': 'Foo\\includesvg{test_no_match}\nFoo',
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2-tex.pdf_tex',
],
'true_output': 'Foo\\includesvg{test_no_match}\nFoo',
},
{
'testcase_name': 'includesvg_match',
'text_in': 'Foo\\includesvg{test2}\nFoo',
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2-tex.pdf_tex',
],
'true_output': 'Foo\\includeinkscape{ext_svg/test2-tex.pdf_tex}\nFoo',
},
{
'testcase_name': 'includesvg_match_with_options',
'text_in': 'Foo\\includesvg[width=\\linewidth,scale=0.40]{figs/persdf/test2}\nFoo',
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2-tex.pdf_tex',
],
'true_output': 'Foo\\includeinkscape[width=\\linewidth,scale=0.40]{ext_svg/test2-tex.pdf_tex}\nFoo',
},
{
'testcase_name': 'includesvg_match_with_options_with_suffix',
'text_in': 'Foo\\includesvg[width=\\linewidth]{figs/test2.svg}\nFoo',
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2_svg-tex.pdf_tex',
],
'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2_svg-tex.pdf_tex}\nFoo',
},
{
'testcase_name': 'includesvg_match_with_options_with_dot_with_suffix',
'text_in': (
'Foo\\includesvg[width=\\linewidth]{figs/test2-0.9.svg}\nFoo'
),
'figures_in': [
'ext_svg/test1-tex.pdf_tex',
'ext_svg/test2-0.9_svg-tex.pdf_tex',
],
'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2-0.9_svg-tex.pdf_tex}\nFoo',
},
)
def test_replace_includesvg(self, text_in, figures_in, true_output):
self.assertEqual(
arxiv_latex_cleaner._replace_includesvg(text_in, figures_in),
true_output,
)
@parameterized.named_parameters(*make_search_reference_tests())
def test_search_reference_weak(
self, filenames, contents, strict, true_outputs
):
cleaner_outputs = []
for filename in filenames:
reference = arxiv_latex_cleaner._search_reference(
filename, contents, strict
)
if reference is not None:
cleaner_outputs.append(filename)
# weak check (passes as long as cleaner includes a superset of the true_output)
for true_output in true_outputs:
self.assertIn(true_output, cleaner_outputs)
@parameterized.named_parameters(*make_search_reference_tests())
def test_search_reference_strong(
self, filenames, contents, strict, true_outputs
):
cleaner_outputs = []
for filename in filenames:
reference = arxiv_latex_cleaner._search_reference(
filename, contents, strict
)
if reference is not None:
cleaner_outputs.append(filename)
# strong check (set of files must match exactly)
weak_check_result = set(true_outputs).issubset(cleaner_outputs)
if weak_check_result:
msg = 'not fatal, cleaner included more files than necessary'
else:
msg = 'fatal, see test_search_reference_weak'
self.assertEqual(cleaner_outputs, true_outputs, msg)
@parameterized.named_parameters(
{
'testcase_name': 'three_parent',
'filename': 'long/path/to/img.ext',
'content_strs': [
# match
'{img.ext}',
'{to/img.ext}',
'{path/to/img.ext}',
'{long/path/to/img.ext}',
'{%\nimg.ext }',
'{to/img.ext % \n}',
'{ \npath/to/img.ext\n}',
'{ \n \nlong/path/to/img.ext\n}',
'{img}',
'{to/img}',
'{path/to/img}',
'{long/path/to/img}',
# dont match
'{from/img.ext}',
'{from/img}',
'{imgoext}',
'{from/imgo}',
'{ \n long/\npath/to/img.ext\n}',
'{path/img.ext}',
'{long/img.ext}',
'{long/path/img.ext}',
'{long/to/img.ext}',
'{path/img}',
'{long/img}',
'{long/path/img}',
'{long/to/img}',
],
'strict': False,
'true_outputs': [True] * 12 + [False] * 13,
},
{
'testcase_name': 'two_parent',
'filename': 'path/to/img.ext',
'content_strs': [
# match
'{img.ext}',
'{to/img.ext}',
'{path/to/img.ext}',
'{%\nimg.ext }',
'{to/img.ext % \n}',
'{ \npath/to/img.ext\n}',
'{img}',
'{to/img}',
'{path/to/img}',
# dont match
'{long/path/to/img.ext}',
'{ \n \nlong/path/to/img.ext\n}',
'{long/path/to/img}',
'{from/img.ext}',
'{from/img}',
'{imgoext}',
'{from/imgo}',
'{ \n long/\npath/to/img.ext\n}',
'{path/img.ext}',
'{long/img.ext}',
'{long/path/img.ext}',
'{long/to/img.ext}',
'{path/img}',
'{long/img}',
'{long/path/img}',
'{long/to/img}',
],
'strict': False,
'true_outputs': [True] * 9 + [False] * 16,
},
{
'testcase_name': 'one_parent',
'filename': 'to/img.ext',
'content_strs': [
# match
'{img.ext}',
'{to/img.ext}',
'{%\nimg.ext }',
'{to/img.ext % \n}',
'{img}',
'{to/img}',
# dont match
'{long/path/to/img}',
'{path/to/img}',
'{ \n \nlong/path/to/img.ext\n}',
'{ \npath/to/img.ext\n}',
'{long/path/to/img.ext}',
'{path/to/img.ext}',
'{from/img.ext}',
'{from/img}',
'{imgoext}',
'{from/imgo}',
'{ \n long/\npath/to/img.ext\n}',
'{path/img.ext}',
'{long/img.ext}',
'{long/path/img.ext}',
'{long/to/img.ext}',
'{path/img}',
'{long/img}',
'{long/path/img}',
'{long/to/img}',
],
'strict': False,
'true_outputs': [True] * 6 + [False] * 19,
},
{
'testcase_name': 'two_parent_strict',
'filename': 'path/to/img.ext',
'content_strs': [
# match
'{path/to/img.ext}',
'{ \npath/to/img.ext\n}',
# dont match
'{img.ext}',
'{to/img.ext}',
'{%\nimg.ext }',
'{to/img.ext % \n}',
'{img}',
'{to/img}',
'{path/to/img}',
'{long/path/to/img.ext}',
'{ \n \nlong/path/to/img.ext\n}',
'{long/path/to/img}',
'{from/img.ext}',
'{from/img}',
'{imgoext}',
'{from/imgo}',
'{ \n long/\npath/to/img.ext\n}',
'{path/img.ext}',
'{long/img.ext}',
'{long/path/img.ext}',
'{long/to/img.ext}',
'{path/img}',
'{long/img}',
'{long/path/img}',
'{long/to/img}',
],
'strict': True,
'true_outputs': [True] * 2 + [False] * 23,
},
)
def test_search_reference_filewise(
self, filename, content_strs, strict, true_outputs
):
if len(content_strs) != len(true_outputs):
raise ValueError(
"number of true_outputs doesn't match number of content strs"
)
for content, true_output in zip(content_strs, true_outputs):
reference = arxiv_latex_cleaner._search_reference(
filename, content, strict
)
matched = reference is not None
msg_not = ' ' if true_output else ' not '
msg_fmt = 'file {} should' + msg_not + 'have matched latex reference {}'
msg = msg_fmt.format(filename, content)
self.assertEqual(matched, true_output, msg)
class IntegrationTests(parameterized.TestCase):
def setUp(self):
super(IntegrationTests, self).setUp()
self.out_path = 'test_data/tex_arXiv'
def _compare_files(self, filename, filename_true):
if path.splitext(filename)[1].lower() in ['.jpg', '.jpeg', '.png']:
with Image.open(filename) as im, Image.open(filename_true) as im_true:
# We check only the sizes of the images, checking pixels would be too
# complicated in case the resize implementations change.
self.assertEqual(
im.size,
im_true.size,
'Images {:s} was not resized properly.'.format(filename),
)
else:
# Checks if text files are equal without taking in account end of line
# characters.
with open(filename, 'rb') as f:
processed_content = f.read().splitlines()
with open(filename_true, 'rb') as f:
groundtruth_content = f.read().splitlines()
self.assertEqual(
processed_content,
groundtruth_content,
'{:s} and {:s} are not equal.'.format(filename, filename_true),
)
@parameterized.named_parameters(
{'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},
{'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},
)
def test_complete(self, input_dir):
out_path_true = 'test_data/tex_arXiv_true'
# Make sure the folder does not exist, since we erase it in the test.
if path.isdir(self.out_path):
raise RuntimeError(
'The folder {:s} should not exist.'.format(self.out_path)
)
arxiv_latex_cleaner.run_arxiv_cleaner({
'input_folder': input_dir,
'images_allowlist': {
'images/im2_included.jpg': 200,
'images/im3_included.png': 400,
},
'resize_images': True,
'im_size': 100,
'compress_pdf': False,
'pdf_im_resolution': 500,
'commands_to_delete': ['mytodo'],
'commands_only_to_delete': ['red'],
'if_exceptions': ['iffalt'],
'environments_to_delete': ['mynote'],
'use_external_tikz': 'ext_tikz',
'keep_bib': False,
})
# Checks the set of files is the same as in the true folder.
out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))
out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))
self.assertSetEqual(out_files, out_files_true)
# Compares the contents of each file against the true value.
for f1 in out_files:
self._compare_files(
path.join(self.out_path, f1), path.join(out_path_true, f1)
)
@parameterized.named_parameters(
{'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},
{'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},
)
def test_png2jpg(self, input_dir):
out_path_true = 'test_data/tex_arXiv_png2jpg_true'
# Make sure the folder does not exist, since we erase it in the test.
if path.isdir(self.out_path):
raise RuntimeError(
'The folder {:s} should not exist.'.format(self.out_path)
)
arxiv_latex_cleaner.run_arxiv_cleaner({
'input_folder': input_dir,
'images_allowlist': {
# 'images/im2_included.jpg': 200,
# 'images/im3_included.png': 400,
},
'resize_images': False,
'im_size': 100,
'compress_pdf': False,
'pdf_im_resolution': 500,
'commands_to_delete': ['mytodo'],
'commands_only_to_delete': ['red'],
'if_exceptions': ['iffalt'],
'environments_to_delete': ['mynote'],
'use_external_tikz': 'ext_tikz',
'keep_bib': False,
'convert_png_to_jpg': True,
'png_quality': 50,
'png_size_threshold': 0.5,
})
# Checks the set of files is the same as in the true folder.
out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))
out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))
self.assertSetEqual(out_files, out_files_true)
# Compares the contents of each file against the true value.
for f1 in out_files:
if path.splitext(path.join(self.out_path, f1))[1].lower() in ['.jpg', '.jpeg', '.png']:
# check if all png files have been renamed to jpg
self.assertTrue(path.splitext(f1)[1].lower() != '.png', f'{f1} is not renamed to jpg')
else:
self._compare_files(
path.join(self.out_path, f1), path.join(out_path_true, f1)
)
def tearDown(self):
shutil.rmtree(self.out_path)
super(IntegrationTests, self).tearDown()
if __name__ == '__main__':
unittest.main()
================================================
FILE: cleaner_config.yaml
================================================
patterns_and_insertions:
[
# Use single ticks for regex patterns
# http://blogs.perl.org/users/tinita/2018/03/strings-in-yaml---to-quote-or-not-to-quote.html
# You need to escape \ with \\ in the pattern, for instance for \\todo
# Use Python named groups https://docs.python.org/3/library/re.html#regular-expression-examples
# Escape {{ and }} in the insertion expression
#
# Optional:
# Set strip_whitespace to n to disable white space stripping while replacing the pattern. (Default: y)
{
"pattern" : '(?:\\figcomp{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}',
"insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}',
"description" : "Replace figcomp",
# "strip_whitespace": n
},
]
verbose: False
commands_to_delete: [
'\\todo',
]
================================================
FILE: requirements.txt
================================================
absl_py>=0.12
pillow
pyyaml
regex
================================================
FILE: setup.py
================================================
#! /usr/bin/env python
#
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from setuptools import setup
from setuptools import find_packages
from arxiv_latex_cleaner._version import __version__
with open("README.md", "r") as fh:
long_description = fh.read()
install_requires = []
with open("requirements.txt") as f:
for l in f.readlines():
l_c = l.strip()
if l_c and not l_c.startswith('#'):
install_requires.append(l_c)
setup(
name="arxiv_latex_cleaner",
version=__version__,
packages=find_packages(exclude=["*.tests"]),
python_requires='>=3',
url="https://github.com/google-research/arxiv-latex-cleaner",
license="Apache License, Version 2.0",
author="Google Research Authors",
author_email="jponttuset@gmail.com",
description="Cleans the LaTeX code of your paper to submit to arXiv.",
long_description=long_description,
long_description_content_type="text/markdown",
entry_points={
"console_scripts": ["arxiv_latex_cleaner=arxiv_latex_cleaner.__main__:__main__"]
},
install_requires=install_requires,
classifiers=[
"License :: OSI Approved :: Apache Software License",
"Intended Audience :: Science/Research",
],
)
================================================
FILE: test_data/tex/figures/data_included.txt
================================================
================================================
FILE: test_data/tex/figures/data_not_included.txt
================================================
================================================
FILE: test_data/tex/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}
================================================
FILE: test_data/tex/figures/figure_included.tikz
================================================
\tikzsetnextfilename{test2}
\begin{tikzpicture}
\node {root}
child {node {left}}
child {node {right}
child {node {child}}
child {node {child}}
};
\end{tikzpicture}
================================================
FILE: test_data/tex/figures/figure_not_included.tex
================================================
\addplot{figures/data_not_included.txt}
\input{figures/figure_not_included_2.tex}
================================================
FILE: test_data/tex/figures/figure_not_included_2.tex
================================================
================================================
FILE: test_data/tex/main.aux
================================================
================================================
FILE: test_data/tex/main.bbl
================================================
BBL content, should be intact.
================================================
FILE: test_data/tex/main.bib
================================================
================================================
FILE: test_data/tex/main.tex
================================================
\begin{document}
Text
% Whole line comment
Text% Inline comment
\begin{comment}
This is an environment comment.
\end{comment}
This is a percent \%.
% Whole line comment without newline
\includegraphics{images/im1_included.png}
%\includegraphics{images/im_not_included}
\includegraphics{images/im3_included.png}
\includegraphics{%
images/im4_included.png%
}
\includegraphics[width=.5\linewidth]{%
images/im5_included.jpg}
%\includegraphics{%
% images/im4_not_included.png
% }
%\includegraphics[width=.5\linewidth]{%
% images/im5_not_included.jpg}
% test whatever the path satrting with dot works when include graphics
\includegraphics{./images/im3_included.png}
This line should\mytodo{Do this later} not be separated
\mytodo{This is a todo command with a nested \textit{command}.
Please remember that up to \texttt{2 levels} of \textit{nesting} are supported.}
from this one.
\begin{mynote}
This is a custom environment that could be excluded.
\end{mynote}
\newif\ifvar
\newif \ifvarII
\ifvarII asdf \fi
\ifvar
\if false
\if false
\if 0
\iffalse
\ifvar
Text
\fi
\fi
\fi
\fi
\fi
\fi
\iffalse I shall be gone (iffalse block)!\else Expect me (else block of iffalse)!\fi
\iftrue Expect me (iftrue block)!\else I shall be gone (else block of iftrue)!\fi
\iffalse
\iffalt
\fi
\newcommand{\red}[1]{{\color{red} #1}}
hello test \red{hello
test \red{hello}}
test
% content after this line should not be cleaned if \end{document} is in a comment
\input{figures/figure_included.tex}
% \input{figures/figure_not_included.tex}
% Test for tikzpicture feature
% should be replaced
\tikzsetnextfilename{test1}
\begin{tikzpicture}
\node (test) at (0,0) {Test1};
\end{tikzpicture}
% should be replaced in included file
\input{figures/figure_included.tikz}
% should not be be replaced - no preceding tikzsetnextfilename command
\begin{tikzpicture}
\node (test) at (0,0) {Test3};
\end{tikzpicture}
\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
\node (test) at (0,0) {Test4};
\end{tikzpicture}
\end{document}
This should be ignored.
================================================
FILE: test_data/tex/not_included/figures/data_included.txt
================================================
================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/data_included.txt
================================================
================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}
================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tikz
================================================
\includegraphics{ext_tikz/test2.pdf}
================================================
FILE: test_data/tex_arXiv_png2jpg_true/main.bbl
================================================
BBL content, should be intact.
================================================
FILE: test_data/tex_arXiv_png2jpg_true/main.tex
================================================
\begin{document}
Text
Text%
This is a percent \%.
\includegraphics{images/im1_included.jpg}
\includegraphics{images/im3_included.jpg}
\includegraphics{%
images/im4_included.jpg%
}
\includegraphics[width=.5\linewidth]{%
images/im5_included.jpg}
\includegraphics{./images/im3_included.jpg}
This line should not be separated
%
from this one.
\newif\ifvar
\newif \ifvarII
\ifvarII asdf \fi
\ifvar
\fi
Expect me (else block of iffalse)!
Expect me (iftrue block)!
\newcommand{\red}[1]{{\color{red} #1}}
hello test hello
test hello
test
\input{figures/figure_included.tex}
\includegraphics{ext_tikz/test1.pdf}
\input{figures/figure_included.tikz}
\begin{tikzpicture}
\node (test) at (0,0) {Test3};
\end{tikzpicture}
\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
\node (test) at (0,0) {Test4};
\end{tikzpicture}
\end{document}
================================================
FILE: test_data/tex_arXiv_true/figures/data_included.txt
================================================
================================================
FILE: test_data/tex_arXiv_true/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}
================================================
FILE: test_data/tex_arXiv_true/figures/figure_included.tikz
================================================
\includegraphics{ext_tikz/test2.pdf}
================================================
FILE: test_data/tex_arXiv_true/main.bbl
================================================
BBL content, should be intact.
================================================
FILE: test_data/tex_arXiv_true/main.tex
================================================
\begin{document}
Text
Text%
This is a percent \%.
\includegraphics{images/im1_included.png}
\includegraphics{images/im3_included.png}
\includegraphics{%
images/im4_included.png%
}
\includegraphics[width=.5\linewidth]{%
images/im5_included.jpg}
\includegraphics{./images/im3_included.png}
This line should not be separated
%
from this one.
\newif\ifvar
\newif \ifvarII
\ifvarII asdf \fi
\ifvar
\fi
Expect me (else block of iffalse)!
Expect me (iftrue block)!
\newcommand{\red}[1]{{\color{red} #1}}
hello test hello
test hello
test
\input{figures/figure_included.tex}
\includegraphics{ext_tikz/test1.pdf}
\input{figures/figure_included.tikz}
\begin{tikzpicture}
\node (test) at (0,0) {Test3};
\end{tikzpicture}
\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
\node (test) at (0,0) {Test4};
\end{tikzpicture}
\end{document}
gitextract_ut9ensog/
├── .github/
│ └── workflows/
│ └── release-workflow.yml
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── __init__.py
├── arxiv_latex_cleaner/
│ ├── __init__.py
│ ├── __main__.py
│ ├── _version.py
│ ├── arxiv_latex_cleaner.py
│ └── tests/
│ └── arxiv_latex_cleaner_test.py
├── cleaner_config.yaml
├── requirements.txt
├── setup.py
└── test_data/
├── tex/
│ ├── figures/
│ │ ├── data_included.txt
│ │ ├── data_not_included.txt
│ │ ├── figure_included.tex
│ │ ├── figure_included.tikz
│ │ ├── figure_not_included.tex
│ │ └── figure_not_included_2.tex
│ ├── main.aux
│ ├── main.bbl
│ ├── main.bib
│ ├── main.tex
│ └── not_included/
│ └── figures/
│ └── data_included.txt
├── tex_arXiv_png2jpg_true/
│ ├── figures/
│ │ ├── data_included.txt
│ │ ├── figure_included.tex
│ │ └── figure_included.tikz
│ ├── main.bbl
│ └── main.tex
└── tex_arXiv_true/
├── figures/
│ ├── data_included.txt
│ ├── figure_included.tex
│ └── figure_included.tikz
├── main.bbl
└── main.tex
SYMBOL INDEX (58 symbols across 3 files)
FILE: arxiv_latex_cleaner/__main__.py
function if_prefixed (line 138) | def if_prefixed(orig_string):
FILE: arxiv_latex_cleaner/arxiv_latex_cleaner.py
function new_os_join (line 44) | def new_os_join(path, *args):
function _create_dir_erase_if_exists (line 55) | def _create_dir_erase_if_exists(path):
function _create_dir_if_not_exists (line 61) | def _create_dir_if_not_exists(path):
function _keep_pattern (line 66) | def _keep_pattern(haystack, patterns_to_keep):
function _remove_pattern (line 75) | def _remove_pattern(haystack, patterns_to_remove):
function _list_all_files (line 84) | def _list_all_files(in_folder, ignore_dirs=None):
function _copy_file (line 97) | def _copy_file(filename, params):
function _remove_command (line 107) | def _remove_command(text, command, keep_text=False):
function _remove_environment (line 169) | def _remove_environment(text, environment):
function _simplify_conditional_blocks (line 180) | def _simplify_conditional_blocks(text, if_exceptions=[]):
function _remove_comments_inline (line 373) | def _remove_comments_inline(text):
function _strip_tex_contents (line 414) | def _strip_tex_contents(lines, end_str):
function _read_file_content (line 425) | def _read_file_content(filename):
function _read_all_tex_contents (line 432) | def _read_all_tex_contents(tex_files, parameters):
function _write_file_content (line 441) | def _write_file_content(content, filename):
function _remove_comments_and_commands_to_delete (line 447) | def _remove_comments_and_commands_to_delete(content, parameters):
function _replace_tikzpictures (line 463) | def _replace_tikzpictures(content, figures):
function _replace_includesvg (line 489) | def _replace_includesvg(content, svg_inkscape_files):
function _resize_and_copy_figure (line 511) | def _resize_and_copy_figure(
function _update_image_references (line 612) | def _update_image_references(tex_contents_dict, old_filename, new_filena...
function _resize_pdf_figure (line 685) | def _resize_pdf_figure(
function _copy_only_referenced_non_tex_not_in_root (line 704) | def _copy_only_referenced_non_tex_not_in_root(parameters, contents, spli...
function _resize_and_copy_figures_if_referenced (line 710) | def _resize_and_copy_figures_if_referenced(parameters, contents, splits):
function _search_reference (line 747) | def _search_reference(filename, contents, strict=False):
function _keep_only_referenced (line 789) | def _keep_only_referenced(filenames, contents, strict=False):
function _keep_only_referenced_tex (line 801) | def _keep_only_referenced_tex(contents, splits):
function _add_root_tex_files (line 824) | def _add_root_tex_files(splits):
function _split_all_files (line 834) | def _split_all_files(parameters):
function _create_out_folder (line 893) | def _create_out_folder(input_folder):
function run_arxiv_cleaner (line 901) | def run_arxiv_cleaner(parameters):
function strip_whitespace (line 1041) | def strip_whitespace(text):
function merge_args_into_config (line 1051) | def merge_args_into_config(args, config_params):
function _find_and_replace_patterns (line 1070) | def _find_and_replace_patterns(content, patterns_and_insertions):
FILE: arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py
function make_args (line 24) | def make_args(
function make_contents (line 51) | def make_contents():
function make_patterns (line 66) | def make_patterns():
function make_search_reference_tests (line 86) | def make_search_reference_tests():
class UnitTests (line 228) | class UnitTests(parameterized.TestCase):
method test_merge_args_into_config (line 264) | def test_merge_args_into_config(self, args, config_params, final_args):
method test_remove_comments_inline (line 317) | def test_remove_comments_inline(self, line_in, true_output):
method test_remove_command (line 410) | def test_remove_command(self, text_in, keep_text, true_output):
method test_remove_environment (line 433) | def test_remove_environment(self, text_in, true_output):
method test_simplify_conditional_blocks (line 515) | def test_simplify_conditional_blocks(self, text_in, true_output):
method test_keep_pattern (line 534) | def test_keep_pattern(self, inputs, patterns, true_outputs):
method test_remove_pattern (line 553) | def test_remove_pattern(self, inputs, patterns, true_outputs):
method test_find_and_replace_patterns (line 571) | def test_find_and_replace_patterns(
method test_replace_tikzpictures (line 610) | def test_replace_tikzpictures(self, text_in, figures_in, true_output):
method test_replace_includesvg (line 674) | def test_replace_includesvg(self, text_in, figures_in, true_output):
method test_search_reference_weak (line 681) | def test_search_reference_weak(
method test_search_reference_strong (line 697) | def test_search_reference_strong(
method test_search_reference_filewise (line 858) | def test_search_reference_filewise(
class IntegrationTests (line 876) | class IntegrationTests(parameterized.TestCase):
method setUp (line 878) | def setUp(self):
method _compare_files (line 882) | def _compare_files(self, filename, filename_true):
method test_complete (line 910) | def test_complete(self, input_dir):
method test_png2jpg (line 952) | def test_png2jpg(self, input_dir):
method tearDown (line 998) | def tearDown(self):
Condensed preview — 36 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (120K chars).
[
{
"path": ".github/workflows/release-workflow.yml",
"chars": 1299,
"preview": "name: Create a GitHub and PyPI release\non:\n push:\n tags:\n - 'v*'\n\njobs:\n build:\n name: Create a GitHub Rele"
},
{
"path": ".gitignore",
"chars": 139,
"preview": "*.pyc\n.idea\narxiv-latex-cleaner.iml\narxiv-latex-cleaner.ipr\narxiv-latex-cleaner.iws\narxiv_latex_cleaner.egg-info/\nbuild/"
},
{
"path": "CONTRIBUTING.md",
"chars": 1101,
"preview": "# How to Contribute\n\nWe'd love to accept your patches and contributions to this project. There are\njust a few small guid"
},
{
"path": "LICENSE",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "MANIFEST.in",
"chars": 59,
"preview": "include LICENSE\ninclude README.md\ninclude requirements.txt\n"
},
{
"path": "README.md",
"chars": 11283,
"preview": "# `arxiv_latex_cleaner`\n\nThis tool allows you to easily clean the LaTeX code of your paper to submit to\narXiv. From a fo"
},
{
"path": "__init__.py",
"chars": 608,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "arxiv_latex_cleaner/__init__.py",
"chars": 607,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "arxiv_latex_cleaner/__main__.py",
"chars": 7479,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "arxiv_latex_cleaner/_version.py",
"chars": 632,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "arxiv_latex_cleaner/arxiv_latex_cleaner.py",
"chars": 37035,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py",
"chars": 33652,
"preview": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Li"
},
{
"path": "cleaner_config.yaml",
"chars": 971,
"preview": "patterns_and_insertions:\n [\n # Use single ticks for regex patterns\n # http://blogs.perl.org/users/tinit"
},
{
"path": "requirements.txt",
"chars": 34,
"preview": "absl_py>=0.12\npillow\npyyaml\nregex\n"
},
{
"path": "setup.py",
"chars": 1804,
"preview": "#! /usr/bin/env python\n#\n# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache Lice"
},
{
"path": "test_data/tex/figures/data_included.txt",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex/figures/data_not_included.txt",
"chars": 1,
"preview": "\n"
},
{
"path": "test_data/tex/figures/figure_included.tex",
"chars": 78,
"preview": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
},
{
"path": "test_data/tex/figures/figure_included.tikz",
"chars": 164,
"preview": "\\tikzsetnextfilename{test2}\n\\begin{tikzpicture}\n\\node {root}\nchild {node {left}}\nchild {node {right}\nchild {node {child"
},
{
"path": "test_data/tex/figures/figure_not_included.tex",
"chars": 82,
"preview": "\\addplot{figures/data_not_included.txt}\n\\input{figures/figure_not_included_2.tex}\n"
},
{
"path": "test_data/tex/figures/figure_not_included_2.tex",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex/main.aux",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex/main.bbl",
"chars": 31,
"preview": "BBL content, should be intact.\n"
},
{
"path": "test_data/tex/main.bib",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex/main.tex",
"chars": 2069,
"preview": "\\begin{document}\nText\n% Whole line comment\n\nText% Inline comment\n\\begin{comment}\nThis is an environment comment.\n\\end{co"
},
{
"path": "test_data/tex/not_included/figures/data_included.txt",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex_arXiv_png2jpg_true/figures/data_included.txt",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex_arXiv_png2jpg_true/figures/figure_included.tex",
"chars": 78,
"preview": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
},
{
"path": "test_data/tex_arXiv_png2jpg_true/figures/figure_included.tikz",
"chars": 37,
"preview": "\\includegraphics{ext_tikz/test2.pdf}"
},
{
"path": "test_data/tex_arXiv_png2jpg_true/main.bbl",
"chars": 31,
"preview": "BBL content, should be intact.\n"
},
{
"path": "test_data/tex_arXiv_png2jpg_true/main.tex",
"chars": 863,
"preview": "\\begin{document}\nText\n\nText%\n\n\nThis is a percent \\%.\n\\includegraphics{images/im1_included.jpg}\n\\includegraphics{images/i"
},
{
"path": "test_data/tex_arXiv_true/figures/data_included.txt",
"chars": 0,
"preview": ""
},
{
"path": "test_data/tex_arXiv_true/figures/figure_included.tex",
"chars": 78,
"preview": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
},
{
"path": "test_data/tex_arXiv_true/figures/figure_included.tikz",
"chars": 37,
"preview": "\\includegraphics{ext_tikz/test2.pdf}"
},
{
"path": "test_data/tex_arXiv_true/main.bbl",
"chars": 31,
"preview": "BBL content, should be intact.\n"
},
{
"path": "test_data/tex_arXiv_true/main.tex",
"chars": 863,
"preview": "\\begin{document}\nText\n\nText%\n\n\nThis is a percent \\%.\n\\includegraphics{images/im1_included.png}\n\\includegraphics{images/i"
}
]
About this extraction
This page contains the full source code of the google-research/arxiv-latex-cleaner GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 36 files (109.9 KB), approximately 28.2k tokens, and a symbol index with 58 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.