Repository: google-research/arxiv-latex-cleaner Branch: main Commit: fb7ee5c72100 Files: 36 Total size: 109.9 KB Directory structure: gitextract_ut9ensog/ ├── .github/ │ └── workflows/ │ └── release-workflow.yml ├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── MANIFEST.in ├── README.md ├── __init__.py ├── arxiv_latex_cleaner/ │ ├── __init__.py │ ├── __main__.py │ ├── _version.py │ ├── arxiv_latex_cleaner.py │ └── tests/ │ └── arxiv_latex_cleaner_test.py ├── cleaner_config.yaml ├── requirements.txt ├── setup.py └── test_data/ ├── tex/ │ ├── figures/ │ │ ├── data_included.txt │ │ ├── data_not_included.txt │ │ ├── figure_included.tex │ │ ├── figure_included.tikz │ │ ├── figure_not_included.tex │ │ └── figure_not_included_2.tex │ ├── main.aux │ ├── main.bbl │ ├── main.bib │ ├── main.tex │ └── not_included/ │ └── figures/ │ └── data_included.txt ├── tex_arXiv_png2jpg_true/ │ ├── figures/ │ │ ├── data_included.txt │ │ ├── figure_included.tex │ │ └── figure_included.tikz │ ├── main.bbl │ └── main.tex └── tex_arXiv_true/ ├── figures/ │ ├── data_included.txt │ ├── figure_included.tex │ └── figure_included.tikz ├── main.bbl └── main.tex ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/release-workflow.yml ================================================ name: Create a GitHub and PyPI release on: push: tags: - 'v*' jobs: build: name: Create a GitHub Release runs-on: ubuntu-latest permissions: contents: write steps: - name: Checkout code uses: actions/checkout@v2 - name: Create Release id: create_release uses: actions/create-release@v1 env: GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} with: tag_name: ${{ github.ref }} release_name: Release ${{ github.ref }} body: ${{ github.ref }} release of `arxiv_latex_cleaner`. draft: false prerelease: false deploy: name: Create a PyPI Release runs-on: ubuntu-latest steps: - name: Checkout code uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: '3.x' - name: Install dependencies run: | python -m pip install --upgrade pip pip install setuptools wheel twine - name: Build run: | python setup.py sdist bdist_wheel - name: Publish env: TWINE_USERNAME: '__token__' TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }} run: | python -m twine upload dist/* ================================================ FILE: .gitignore ================================================ *.pyc .idea arxiv-latex-cleaner.iml arxiv-latex-cleaner.ipr arxiv-latex-cleaner.iws arxiv_latex_cleaner.egg-info/ build/ dist/ *.DS_Store ================================================ FILE: CONTRIBUTING.md ================================================ # How to Contribute We'd love to accept your patches and contributions to this project. There are just a few small guidelines you need to follow. ## Contributor License Agreement Contributions to this project must be accompanied by a Contributor License Agreement. You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project. Head over to to see your current agreements on file or to sign a new one. You generally only need to submit a CLA once, so if you've already submitted one (even if it was for a different project), you probably don't need to do it again. ## Code reviews All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult [GitHub Help](https://help.github.com/articles/about-pull-requests/) for more information on using pull requests. ## Community Guidelines This project follows [Google's Open Source Community Guidelines](https://opensource.google.com/conduct/). ================================================ FILE: LICENSE ================================================ Apache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. ================================================ FILE: MANIFEST.in ================================================ include LICENSE include README.md include requirements.txt ================================================ FILE: README.md ================================================ # `arxiv_latex_cleaner` This tool allows you to easily clean the LaTeX code of your paper to submit to arXiv. From a folder containing all your code, e.g. `/path/to/latex/`, it creates a new folder `/path/to/latex_arXiv/`, that is ready to ZIP and upload to arXiv. ## Example call: ```bash arxiv_latex_cleaner /path/to/latex --resize_images --im_size 500 --images_allowlist='{"images/im.png":2000}' ``` Or simply from a config file ```bash arxiv_latex_cleaner /path/to/latex --config cleaner_config.yaml ``` ## Installation: ```bash pip install arxiv-latex-cleaner ``` | :exclamation: arxiv_latex_cleaner is only compatible with Python >=3.9 :exclamation: | | ---------------------------------------------------------------------------------- | If using MacOS, you can install using [Homebrew](https://brew.sh/): ```bash brew install arxiv_latex_cleaner ``` Alternatively, you can download the source code: ```bash git clone https://github.com/google-research/arxiv-latex-cleaner cd arxiv-latex-cleaner/ python -m arxiv_latex_cleaner --help ``` And install as a command-line program directly from the source code: ```bash python setup.py install ``` ## Main features: #### Privacy-oriented * Removes all auxiliary files (`.aux`, `.log`, `.out`, etc.). * Removes all comments from your code (yes, those are visible on arXiv and you do not want them to be). These also include `\begin{comment}\end{comment}`, `\iffalse\fi`, and `\if0\fi` environments. * Optionally removes user-defined commands entered with `commands_to_delete` (such as `\todo{}` that you redefine as the empty string at the end). * Optionally allows you to define custom regex replacement rules through a `cleaner_config.yaml` file. #### Size-oriented There is a 50MB limit on arXiv submissions, so to make it fit: * Removes all unused `.tex` files (those that are not in the root and not included in any other `.tex` file). * Removes all unused images that take up space (those that are not actually included in any used `.tex` file). * Optionally resizes all images to `im_size` pixels, to reduce the size of the submission. You can allowlist some images to skip the global size using `images_allowlist`. * Optionally compresses `.pdf` files using ghostscript (Linux and Mac only). You can allowlist some PDFs to skip the global size using `images_allowlist`. * Optionally converts PNG images to JPG format to reduce file size. #### TikZ picture source code concealment To prevent the upload of tikzpicture source code or raw simulation data, this feature: * Replaces the tikzpicture environment `\begin{tikzpicture} ... \end{tikzpicture}` with the respective `\includegraphics{EXTERNAL_TIKZ_FOLDER/picture_name.pdf}`. * Requires externally compiled TikZ pictures as `.pdf` files in folder `EXTERNAL_TIKZ_FOLDER`. See section 52 (Externalization Library) in the [PGF/TikZ manual](https://ctan.org/pkg/pgf?lang=en) on TikZ picture externalization. * Only replaces environments with preceding `\tikzsetnextfilename{picture_name}` command (as in `\tikzsetnextfilename{picture_name}\begin{tikzpicture} ... \end{tikzpicture}`) where the externalized `picture_name.pdf` filename matches `picture_name`. #### More sophisticated pattern replacement based on regex group captures Sometimes it is useful to work with a set of custom LaTeX commands when writing a paper. To get rid of them upon arXiv submission, one can simply revert them to plain LaTeX with a regular expression insertion. ```yaml { "pattern" : '(?:\\figcomp{\s*)(?P.*?)\s*}\s*{\s*(?P.*?)\s*}\s*{\s*(?P.*?)\s*}', "insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}', "description" : "Replace figcomp" } ``` The pattern above will find all `\figcomp{path}{w1}{w2}` commands and replace them with `\parbox[c]{w1\linewidth}{\includegraphics[width=w2\linewidth]{figures/path}}`. Note that the insertion template is filled with the [named groups captures](https://docs.python.org/3/library/re.html#regular-expression-examples) from the pattern. Note that the replacement is processed **before** all `\includegraphics` commands are processed and corresponding file paths are copied, making sure all figure files are copied to the cleaned version. See also [cleaner_config.yaml](cleaner_config.yaml) for details on how to specify the patterns. ## Usage: ``` usage: arxiv_latex_cleaner@v1.0.10 [-h] [--resize_images] [--im_size IM_SIZE] [--compress_pdf] [--pdf_im_resolution PDF_IM_RESOLUTION] [--images_allowlist IMAGES_ALLOWLIST] [--keep_bib] [--commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]] [--commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]] [--environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]] [--if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]] [--use_external_tikz USE_EXTERNAL_TIKZ] [--svg_inkscape [SVG_INKSCAPE]] [--convert_png_to_jpg] [--png_quality PNG_QUALITY] [--png_size_threshold PNG_SIZE_THRESHOLD] [--config CONFIG] [--verbose] input_folder Clean the LaTeX code of your paper to submit to arXiv. Check the README for more information on the use. positional arguments: input_folder Input folder containing the LaTeX code. optional arguments: -h, --help show this help message and exit --resize_images Resize images. --im_size IM_SIZE Size of the output images (in pixels, longest side). Fine tune this to get as close to 10MB as possible. --compress_pdf Compress PDF images using ghostscript (Linux and Mac only). --pdf_im_resolution PDF_IM_RESOLUTION Resolution (in dpi) to which the tool resamples the PDF images. --images_allowlist IMAGES_ALLOWLIST Images (and PDFs) that won't be resized to the default resolution, but the one provided here. Value is pixel for images, and dpi forPDFs, as in --im_size and --pdf_im_resolution, respectively. Format is a dictionary as: '{"path/to/im.jpg": 1000}' --keep_bib Avoid deleting the *.bib files. --commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...] LaTeX commands that will be deleted. Useful for e.g. user-defined \todo commands. For example, to delete all occurrences of \todo1{} and \todo2{}, run the tool with `--commands_to_delete todo1 todo2`.Please note that the positional argument `input_folder` cannot come immediately after `commands_to_delete`, as the parser does not have any way to know if it's another command to delete. --commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...] LaTeX commands that will be deleted but the text wrapped in the commands will be retained. Useful for commands that change text formats and colors, which you may want to remove but keep the text within. Usages are exactly the same as commands_to_delete. Note that if the commands listed here duplicate that after commands_to_delete, the default action will be retaining the wrapped text. --environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...] LaTeX environments that will be deleted. Useful for e.g. user-defined comment environments. For example, to delete all occurrences of \begin{note} ... \end{note}, run the tool with `--environments_to_delete note`. Please note that the positional argument `input_folder` cannot come immediately after `environments_to_delete`, as the parser does not have any way to know if it's another environment to delete. --if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...] Constant TeX primitive conditionals (\iffalse, \iftrue, etc.) are simplified, i.e., true branches are kept, false branches deleted. To parse the conditional constructs correctly, all commands starting with `\if` are assumed to be TeX primitive conditionals (e.g., declared by \newif\ifvar). Some known exceptions to this rule are already included (e.g., \iff, \ifthenelse, etc.), but you can add custom exceptions using `--if_exceptions iffalt`. --use_external_tikz USE_EXTERNAL_TIKZ Folder (relative to input folder) containing externalized tikz figures in PDF format. --svg_inkscape [SVG_INKSCAPE] Include PDF files generated by Inkscape via the `\includesvg` command from the `svg` package. This is done by replacing the `\includesvg` calls with `\includeinkscape` calls pointing to the generated `.pdf_tex` files. By default, these files and the generated PDFs are located under `./svg-inkscape` (relative to the input folder), but a different path (relative to the input folder) can be provided in case a different `inkscapepath` was set when loading the `svg` package. --convert_png_to_jpg Convert PNG images to JPG format to reduce file size --png_quality PNG_QUALITY JPG quality for PNG conversion (0-100, default: 50) --png_size_threshold PNG_SIZE_THRESHOLD Minimum PNG file size in MB to apply quality reduction (default: 0.5) --config CONFIG Read settings from `.yaml` config file. If command line arguments are provided additionally, the config file parameters are updated with the command line parameters. --verbose Enable detailed output. ``` ## Testing: ```bash python -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test ``` ## Note This is not an officially supported Google product. ================================================ FILE: __init__.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ================================================ FILE: arxiv_latex_cleaner/__init__.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ================================================ FILE: arxiv_latex_cleaner/__main__.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Main module for ``arxiv_latex_cleaner``. .. code-block:: bash $ python -m arxiv_latex_cleaner --help """ import argparse import json import logging import yaml from ._version import __version__ from .arxiv_latex_cleaner import merge_args_into_config from .arxiv_latex_cleaner import run_arxiv_cleaner PARSER = argparse.ArgumentParser( prog="arxiv_latex_cleaner@{0}".format(__version__), description=( "Clean the LaTeX code of your paper to submit to arXiv. " "Check the README for more information on the use." ), ) PARSER.add_argument( "input_folder", type=str, help="Input folder or zip archive containing the LaTeX code.", ) PARSER.add_argument( "--resize_images", action="store_true", help="Resize images.", ) PARSER.add_argument( "--im_size", default=500, type=int, help=( "Size of the output images (in pixels, longest side). Fine tune this " "to get as close to 10MB as possible." ), ) PARSER.add_argument( "--compress_pdf", action="store_true", help="Compress PDF images using ghostscript (Linux and Mac only).", ) PARSER.add_argument( "--pdf_im_resolution", default=500, type=int, help="Resolution (in dpi) to which the tool resamples the PDF images.", ) PARSER.add_argument( "--images_allowlist", default={}, type=json.loads, help=( "Images (and PDFs) that won't be resized to the default resolution," "but the one provided here. Value is pixel for images, and dpi for" "PDFs, as in --im_size and --pdf_im_resolution, respectively. Format " "is a dictionary as: '{\"path/to/im.jpg\": 1000}'" ), ) PARSER.add_argument( "--keep_bib", action="store_true", help="Avoid deleting the *.bib files.", ) PARSER.add_argument( "--commands_to_delete", nargs="+", default=[], required=False, help=( "LaTeX commands that will be deleted. Useful for e.g. user-defined " "\\todo commands. For example, to delete all occurrences of \\todo1{} " "and \\todo2{}, run the tool with `--commands_to_delete todo1 todo2`." "Please note that the positional argument `input_folder` cannot come " "immediately after `commands_to_delete`, as the parser does not have " "any way to know if it's another command to delete." ), ) PARSER.add_argument( "--commands_only_to_delete", nargs="+", default=[], required=False, help=( "LaTeX commands that will be deleted but the text wrapped in the" " commands will be retained. Useful for commands that change text" " formats and colors, which you may want to remove but keep the text" " within. Usages are exactly the same as commands_to_delete. Note that" " if the commands listed here duplicate that after commands_to_delete," " the default action will be retaining the wrapped text." ), ) PARSER.add_argument( "--environments_to_delete", nargs="+", default=[], required=False, help=( "LaTeX environments that will be deleted. Useful for e.g. user-" "defined comment environments. For example, to delete all occurrences " "of \\begin{note} ... \\end{note}, run the tool with " "`--environments_to_delete note`. Please note that the positional " "argument `input_folder` cannot come immediately after " "`environments_to_delete`, as the parser does not have any way to " "know if it's another environment to delete." ), ) def if_prefixed(orig_string): if orig_string.startswith("\\"): string = orig_string[1:] else: string = orig_string if not string.startswith("if"): raise argparse.ArgumentTypeError( f"Expected a string starting with 'if', got '{orig_string}'!" ) return string PARSER.add_argument( "--if_exceptions", nargs="+", default=[], required=False, type=if_prefixed, help=( "Constant TeX primitive conditionals (\\iffalse, \\iftrue, etc.) are " "simplified, i.e., true branches are kept, false branches deleted. " "To parse the conditional constructs correctly, all commands starting " "with `\\if` are assumed to be TeX primitive conditionals (e.g., " "declared by \\newif\\ifvar). Some known exceptions to this rule are " "already included (e.g., \\iff, \\ifthenelse, etc.), but you can add " "custom exceptions using `--if_exceptions iffalt`." ), ) PARSER.add_argument( "--use_external_tikz", type=str, help=( "Folder (relative to input folder) containing externalized tikz " "figures in PDF format." ), ) PARSER.add_argument( "--svg_inkscape", nargs="?", type=str, const="svg-inkscape", help=( "Include PDF files generated by Inkscape via the `\\includesvg` " "command from the `svg` package. This is done by replacing the " "`\\includesvg` calls with `\\includeinkscape` calls pointing to the " "generated `.pdf_tex` files. By default, these files and the " "generated PDFs are located under `./svg-inkscape` (relative to the " "input folder), but a different path (relative to the input folder) " "can be provided in case a different `inkscapepath` was set when " "loading the `svg` package." ), ) PARSER.add_argument( "--convert_png_to_jpg", action="store_true", help="Convert PNG images to JPG format to reduce file size. Note that this will override --resize_images for PNG files.", ) PARSER.add_argument( "--png_quality", type=int, default=50, help="JPG quality for PNG conversion (0-100, default: 50)", ) PARSER.add_argument( "--png_size_threshold", type=float, default=0.5, help="Minimum PNG file size in MB to apply quality reduction (default: 0.5)", ) PARSER.add_argument( "--config", type=str, help=( "Read settings from `.yaml` config file. If command line arguments " "are provided additionally, the config file parameters are updated " "with the command line parameters." ), required=False, ) PARSER.add_argument( "--verbose", action="store_true", help="Enable detailed output.", ) ARGS = vars(PARSER.parse_args()) if ARGS["config"] is not None: try: with open(ARGS["config"], "r") as config_file: config_params = yaml.safe_load(config_file) final_args = merge_args_into_config(ARGS, config_params) except FileNotFoundError: print(f"config file {ARGS.config} not found.") final_args = ARGS final_args.pop("config", None) else: final_args = ARGS if final_args.get("verbose", False): logging.basicConfig(level=logging.INFO) else: logging.basicConfig(level=logging.ERROR) run_arxiv_cleaner(final_args) exit(0) ================================================ FILE: arxiv_latex_cleaner/_version.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. __version__ = "v1.0.10" ================================================ FILE: arxiv_latex_cleaner/arxiv_latex_cleaner.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. """Cleans the LaTeX code of your paper to submit to arXiv.""" import collections import contextlib import copy import logging import os import pathlib import shutil import subprocess import tempfile from PIL import Image import regex PDF_RESIZE_COMMAND = ( 'gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH ' '-dDownsampleColorImages=true -dColorImageResolution={resolution} ' '-dColorImageDownsampleThreshold=1.0 -dAutoRotatePages=/None ' '-sOutputFile={output} {input}' ) MAX_FILENAME_LENGTH = 120 # Fix for Windows: Even if '\' (os.sep) is the standard way of making paths on # Windows, it interferes with regular expressions. We just change os.sep to '/' # and os.path.join to a version using '/' as Windows will handle it the right # way. if os.name == 'nt': global old_os_path_join def new_os_join(path, *args): res = old_os_path_join(path, *args) res = res.replace('\\', '/') return res old_os_path_join = os.path.join os.sep = '/' os.path.join = new_os_join def _create_dir_erase_if_exists(path): if os.path.exists(path): shutil.rmtree(path) os.makedirs(path) def _create_dir_if_not_exists(path): if not os.path.exists(path): os.makedirs(path) def _keep_pattern(haystack, patterns_to_keep): """Keeps the strings that match 'patterns_to_keep'.""" out = [] for item in haystack: if any((regex.findall(rem, item) for rem in patterns_to_keep)): out.append(item) return out def _remove_pattern(haystack, patterns_to_remove): """Removes the strings that match 'patterns_to_remove'.""" return [ item for item in haystack if item not in _keep_pattern([item], patterns_to_remove) ] def _list_all_files(in_folder, ignore_dirs=None): if ignore_dirs is None: ignore_dirs = [] to_consider = [ os.path.join(os.path.relpath(path, in_folder), name) if path != in_folder else name for path, _, files in os.walk(in_folder) for name in files ] return _remove_pattern(to_consider, ignore_dirs) def _copy_file(filename, params): _create_dir_if_not_exists( os.path.join(params['output_folder'], os.path.dirname(filename)) ) shutil.copy( os.path.join(params['input_folder'], filename), os.path.join(params['output_folder'], filename), ) def _remove_command(text, command, keep_text=False): """Removes '\\command{*}' from the string 'text'. Regex `base_pattern` used to match balanced parentheses taken from: https://stackoverflow.com/questions/546433/regular-expression-to-match-balanced-parentheses/35271017#35271017 """ base_pattern = ( r'\\' + command + r'(?:\[(?:.*?)\])*\{((?:[^{}]+|\{(?1)\})*)\}(?:\[(?:.*?)\])*' ) def extract_text_inside_curly_braces(text): """Extract text inside of {} from command string""" pattern = r'\{((?:[^{}]|(?R))*)\}' match = regex.search(pattern, text) if match: return match.group(1) else: return '' # Loops in case of nested commands that need to retain text, e.g., # \red{hello \red{world}}. while True: all_substitutions = [] has_match = False for match in regex.finditer(base_pattern, text): # In case there are only spaces or nothing up to the following newline, # adds a percent, not to alter the newlines. has_match = True if not keep_text: new_substring = '' else: temp_substring = text[match.span()[0] : match.span()[1]] new_substring = extract_text_inside_curly_braces(temp_substring) if match.span()[1] < len(text): next_newline = text[match.span()[1] :].find('\n') if next_newline != -1: text_until_newline = text[ match.span()[1] : match.span()[1] + next_newline ] if ( not text_until_newline or text_until_newline.isspace() ) and not keep_text: new_substring = '%' all_substitutions.append( (match.span()[0], match.span()[1], new_substring) ) for start, end, new_substring in reversed(all_substitutions): text = text[:start] + new_substring + text[end:] if not keep_text or not has_match: break return text def _remove_environment(text, environment): """Removes '\\begin{environment}*\\end{environment}' from 'text'.""" # Need to escape '{', to not trigger fuzzy matching if `environment` starts # with one of 'i', 'd', 's', or 'e' return regex.sub( r'\\begin\{' + environment + r'}[\s\S]*?\\end\{' + environment + r'}', '', text, ) def _simplify_conditional_blocks(text, if_exceptions=[]): r"""Simplify possibly nested conditional blocks from 'text'. For example, `\iffalse TEST1\else TEST2\fi` is simplified to `TEST2`, and `\iftrue TEST1\else TEST2\fi` is simplified to `TEST1`. Unknown conditionals are left untouched. If the conditional tree is malformed, the function will print a warning to stderr and return the original text. """ p = regex.compile(r'(?!(?<=\\newif\s*))\\if\s*(\w+)|\\else(?!\w)|\\fi(?!\w)') toplevel_tree = {'left': [], 'right': [], 'kind': 'toplevel', 'parent': None} tree = toplevel_tree exceptions = [ # TeX primitives 'iff', # package etoolbox 'ifpatchable', 'ifpatchable*', 'ifbool', 'iftoggle', 'ifdef', 'ifcsdef', 'ifundef', 'ifcsundef', 'ifdefmacro', 'ifcsmacro', 'ifdefparam', 'ifcsparam', 'ifcsprefix', 'ifdefprotected', 'ifcsprotected', 'ifdefltxprotect', 'ifcsltxprotect', 'ifdefempty', 'ifcsempty', 'ifdefvoid', 'ifcsvoid', 'ifdefequal', 'ifcsequal', 'ifdefstring', 'ifcsstring', 'ifdefstrequal', 'ifcsstrequal', 'ifdefcounter', 'ifcscounter', 'ifltxcounter', 'ifdeflength', 'ifcslength', 'ifdefdimen', 'ifcsdimen', 'ifstrequal', 'ifstrempty', 'ifblank', 'ifnumcomp', 'ifnumequal', 'ifnumodd', 'ifdimcomp', 'ifdimequal', 'ifdimgreater', 'ifdimless', 'ifboolexpr', 'ifboolexpe', 'ifinlist', 'ifinlistcs', 'ifrmnum', # package hyperref 'ifpdfstringunicode', # package ifthen 'ifthenelse', ] + if_exceptions def new_subtree(kind): return {'kind': kind, 'left': [], 'right': []} def add_subtree(tree, subtree): if 'else' not in tree: tree['left'].append(subtree) else: tree['right'].append(subtree) subtree['parent'] = tree def print_tree(tree, indent, write): if 'start' in tree: write(' ' * indent + tree['start'].group() + '\n') for subtree in tree['left']: print_tree(subtree, indent + 2, write) if 'else' in tree: write(' ' * indent + tree['else'].group() + '\n') for subtree in tree['right']: print_tree(subtree, indent + 2) if 'end' in tree: write(' ' * indent + tree['end'].group() + '\n') def print_abort(error_finding): os.sys.stderr.write( f'Warning: Found {error_finding}! Not removing any conditional' ' blocks...\n' ) os.sys.stderr.write( f' This is the matched tree (as built up to the error):\n' ) print_tree(toplevel_tree, indent=9, write=os.sys.stderr.write) os.sys.stderr.write( f' Potentially, you need to supply an exception using' f" --if_exceptions'.\n" ) for m in p.finditer(text): m_no_space = m.group().replace(' ', '') if m_no_space == r'\iffalse' or m_no_space == r'\if0': subtree = new_subtree('iffalse') subtree['start'] = m add_subtree(tree, subtree) tree = subtree elif m_no_space == r'\iftrue' or m_no_space == r'\if1': subtree = new_subtree('iftrue') subtree['start'] = m add_subtree(tree, subtree) tree = subtree elif m_no_space.startswith(r'\if'): if m_no_space[1:] in exceptions: continue subtree = new_subtree('unknown') subtree['start'] = m add_subtree(tree, subtree) tree = subtree elif m_no_space == r'\else': if tree['parent'] is None: print_abort(r'unmatched \else') return text elif 'else' in tree: print_abort(r'duplicate \else') return text tree['else'] = m elif m.group() == r'\fi': if tree['parent'] is None: print_abort(r'unmatched \fi') return text tree['end'] = m tree = tree['parent'] else: raise RuntimeError('Unreachable!') if tree['parent'] is not None: print_abort('unmatched ' + tree['start'].group()) return text positions_to_delete = [] def traverse_tree(tree): if tree['kind'] == 'iffalse': if 'else' in tree: positions_to_delete.append((tree['start'].start(), tree['else'].end())) for subtree in tree['right']: traverse_tree(subtree) positions_to_delete.append((tree['end'].start(), tree['end'].end())) else: positions_to_delete.append((tree['start'].start(), tree['end'].end())) elif tree['kind'] == 'iftrue': if 'else' in tree: positions_to_delete.append((tree['start'].start(), tree['start'].end())) for subtree in tree['left']: traverse_tree(subtree) positions_to_delete.append((tree['else'].start(), tree['end'].end())) else: positions_to_delete.append((tree['start'].start(), tree['start'].end())) positions_to_delete.append((tree['end'].start(), tree['end'].end())) elif tree['kind'] == 'unknown': for subtree in tree['left']: traverse_tree(subtree) for subtree in tree['right']: traverse_tree(subtree) else: raise ValueError('Unreachable!') for tree in toplevel_tree['left']: traverse_tree(tree) for start, end in reversed(positions_to_delete): if end < len(text) and text[end].isspace(): end_to_del = end + 1 else: end_to_del = end text = text[:start] + text[end_to_del:] return text def _remove_comments_inline(text): """Removes the comments from the string 'text' and ignores % inside \\url{}.""" auto_ignore_pattern = r'(%\s*auto-ignore).*' if regex.search(auto_ignore_pattern, text): return regex.sub(auto_ignore_pattern, r'\1', text) if text.lstrip(' ').lstrip('\t').startswith('%'): return '' url_pattern = r'\\url\{(?>[^{}]|(?R))*\}' def remove_comments(segment): """Check if a segment of text contains a comment and remove it.""" if segment.lstrip().startswith('%'): return '', True match = regex.search(r'(? lines[i].index(end_str): return lines[: i + 1] return lines def _read_file_content(filename): with open(filename, 'r', encoding='utf-8') as fp: lines = fp.readlines() lines = _strip_tex_contents(lines, '\\end{document}') return lines def _read_all_tex_contents(tex_files, parameters): contents = {} for fn in tex_files: contents[fn] = _read_file_content( os.path.join(parameters['input_folder'], fn) ) return contents def _write_file_content(content, filename): _create_dir_if_not_exists(os.path.dirname(filename)) with open(filename, 'w', encoding='utf-8') as fp: return fp.write(content) def _remove_comments_and_commands_to_delete(content, parameters): """Erases all LaTeX comments in the content, and writes it.""" content = [_remove_comments_inline(line) for line in content] content = _remove_environment(''.join(content), 'comment') content = _simplify_conditional_blocks( content, parameters.get('if_exceptions', []) ) for environment in parameters.get('environments_to_delete', []): content = _remove_environment(content, environment) for command in parameters.get('commands_only_to_delete', []): content = _remove_command(content, command, True) for command in parameters['commands_to_delete']: content = _remove_command(content, command, False) return content def _replace_tikzpictures(content, figures): """Replaces all tikzpicture environments (with includegraphic commands of external PDF figures) in the content, and writes it. """ def get_figure(matchobj): found_tikz_filename = regex.search( r'\\tikzsetnextfilename{(.*?)}', matchobj.group(0) ).group(1) # search in tex split if figure is available matching_tikz_filenames = _keep_pattern( figures, ['/' + found_tikz_filename + '.pdf'] ) if len(matching_tikz_filenames) == 1: return '\\includegraphics{' + matching_tikz_filenames[0] + '}' else: return matchobj.group(0) content = regex.sub( r'\\tikzsetnextfilename{[\s\S]*?\\end{tikzpicture}', get_figure, content ) return content def _replace_includesvg(content, svg_inkscape_files): def repl_svg(matchobj): svg_path = matchobj.group(2) if svg_path.endswith('.svg'): svg_path = '_'.join(svg_path.rsplit('.', 1)) svg_filename = os.path.basename(svg_path) # search in svg_inkscape split if pdf_tex file is available matching_pdf_tex_files = _keep_pattern( svg_inkscape_files, ['/' + svg_filename + '-tex.pdf_tex'] ) if len(matching_pdf_tex_files) == 1: options = '' if matchobj.group(1) is None else matchobj.group(1) res = f'\\includeinkscape{options}{{{matching_pdf_tex_files[0]}}}' return res else: return matchobj.group(0) content = regex.sub(r'\\includesvg(\[.*?\])?{(.*?)}', repl_svg, content) return content def _resize_and_copy_figure( filename, origin_folder, destination_folder, resize_image, image_size, compress_pdf, pdf_resolution, convert_png_to_jpg=False, png_quality=50, png_size_threshold=0.5, verbose=False ): """Resizes and copies the input figure (either JPG, PNG, or PDF). Parameters: filename: The input filename origin_folder: The folder containing the input filename destination_folder: The folder to copy the output filename to resize_image: Whether to resize the image image_size: The maximum size of the image in pixels compress_pdf: Whether to compress the PDF file convert_png_to_jpg: Whether to convert PNG files to JPG format. Note that this will override resize_image for PNG files. png_quality: JPG quality for converted PNG files (0-100) png_size_threshold: Minimum file size in MB to apply quality reduction verbose: Enable verbose logging Returns: str: The actual output filename (may differ from input if PNG was converted) """ _create_dir_if_not_exists( os.path.join(destination_folder, os.path.dirname(filename)) ) if convert_png_to_jpg and os.path.splitext(filename)[1].lower() in ['.png']: original_size_mb = os.path.getsize(os.path.join(origin_folder, filename)) / (1024 * 1024) im = Image.open(os.path.join(origin_folder, filename)) # Determine quality based on file size if original_size_mb < png_size_threshold: quality = 100 # Keep high quality for small files if verbose: print(f"Keeping original quality for small PNG: {filename}") else: quality = png_quality if verbose: print(f"Converting PNG to JPG with quality {quality}: {filename}") # Convert PNG to JPG output_filename = os.path.splitext(filename)[0] + '.jpg' rgb_img = im.convert('RGB') rgb_img.save(os.path.join(destination_folder, output_filename), 'JPEG', quality=quality) if verbose: print(f"Converted: {filename} -> {output_filename}") return output_filename if resize_image and os.path.splitext(filename)[1].lower() in [ '.jpg', '.jpeg', '.png', ]: try: im = Image.open(os.path.join(origin_folder, filename)) if max(im.size) > image_size: im = im.resize( tuple([int(x * float(image_size) / max(im.size)) for x in im.size]), Image.Resampling.LANCZOS, ) if os.path.splitext(filename)[1].lower() in ['.jpg', '.jpeg']: im.save(os.path.join(destination_folder, filename), 'JPEG', quality=90) return filename elif os.path.splitext(filename)[1].lower() in ['.png']: im.save(os.path.join(destination_folder, filename), 'PNG') return filename except Exception as e: if verbose: print(f"Failed to process image {filename}: {e}") # Fall back to simple copy shutil.copy( os.path.join(origin_folder, filename), os.path.join(destination_folder, filename), ) return filename elif compress_pdf and os.path.splitext(filename)[1].lower() == '.pdf': _resize_pdf_figure( filename, origin_folder, destination_folder, pdf_resolution ) return filename else: shutil.copy( os.path.join(origin_folder, filename), os.path.join(destination_folder, filename), ) return filename def _update_image_references(tex_contents_dict, old_filename, new_filename, verbose=False): """Update references from old_filename to new_filename in all tex content.""" if old_filename == new_filename: return # No change needed old_base = os.path.splitext(old_filename)[0] new_base = os.path.splitext(new_filename)[0] if verbose: print(f"Updating LaTeX references: {old_filename} -> {new_filename}") for tex_file in tex_contents_dict: # Handle both string and list content if isinstance(tex_contents_dict[tex_file], list): content = ''.join(tex_contents_dict[tex_file]) else: content = tex_contents_dict[tex_file] content_changed = False # Pattern 1: Direct filename with full extension, handling comments and newlines pattern1 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_filename) + r'((?:%\s*\n\s*)?[^}]*?\})' replacement1 = r'\1' + new_filename + r'\2' new_content = regex.sub(pattern1, replacement1, content, flags=regex.IGNORECASE | regex.DOTALL) if new_content != content: content = new_content content_changed = True if verbose: print(f"Applied pattern 1 (full filename) in {tex_file}") else: # Pattern 2: Base filename without extension, handling comments and newlines # Only apply this if Pattern 1 didn't match to avoid double replacements pattern2 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_base) + r'((?:%\s*\n\s*)?[^}]*?\})' replacement2 = r'\1' + new_base + r'.jpg\2' new_content = regex.sub(pattern2, replacement2, content, flags=regex.IGNORECASE | regex.DOTALL) if new_content != content: content = new_content content_changed = True if verbose: print(f"Applied pattern 2 (base filename) in {tex_file}") else: # Pattern 3: Handle cases where extension is split across lines with comments # This specifically targets patterns like: images/filename%\n.png pattern3 = r'(\{[^}]*?)' + regex.escape(old_base) + r'(%\s*\n\s*)(\.png)([^}]*?\})' replacement3 = r'\1' + new_base + r'\2.jpg\4' new_content = regex.sub(pattern3, replacement3, content, flags=regex.IGNORECASE | regex.DOTALL) if new_content != content: content = new_content content_changed = True if verbose: print(f"Applied pattern 3 (split extension) in {tex_file}") # Update the content back in the appropriate format if content_changed: if isinstance(tex_contents_dict[tex_file], list): # Convert back to list format, preserving line endings tex_contents_dict[tex_file] = content.split('\n') else: tex_contents_dict[tex_file] = content if verbose: print(f"Updated references in {tex_file}") # Re-write the updated tex files to the output directory if verbose and any(tex_contents_dict.values()): print("Re-writing updated tex files...") return tex_contents_dict def _resize_pdf_figure( filename, origin_folder, destination_folder, resolution, timeout=10 ): input_file = os.path.join(origin_folder, filename) output_file = os.path.join(destination_folder, filename) bash_command = PDF_RESIZE_COMMAND.format( input=input_file, output=output_file, resolution=resolution ) process = subprocess.Popen(bash_command.split(), stdout=subprocess.PIPE) try: process.communicate(timeout=timeout) except subprocess.TimeoutExpired: process.kill() outs, errs = process.communicate() print('Output: ', outs) print('Errors: ', errs) def _copy_only_referenced_non_tex_not_in_root(parameters, contents, splits): for fn in _keep_only_referenced( splits['non_tex_not_in_root'], contents, strict=True ): _copy_file(fn, parameters) def _resize_and_copy_figures_if_referenced(parameters, contents, splits): """Modified to handle PNG to JPG conversion and reference updates.""" image_size = collections.defaultdict(lambda: parameters['im_size']) image_size.update(parameters['images_allowlist']) pdf_resolution = collections.defaultdict( lambda: parameters['pdf_im_resolution'] ) pdf_resolution.update(parameters['images_allowlist']) # contents is the full content string for reference checking filename_changes = {} # Track PNG -> JPG filename changes for image_file in _keep_only_referenced( splits['figures'], contents, strict=False ): actual_output_filename = _resize_and_copy_figure( filename=image_file, origin_folder=parameters['input_folder'], destination_folder=parameters['output_folder'], resize_image=parameters['resize_images'], image_size=image_size[image_file], compress_pdf=parameters['compress_pdf'], pdf_resolution=pdf_resolution[image_file], convert_png_to_jpg=parameters.get('convert_png_to_jpg', False), png_quality=parameters.get('png_quality', 50), png_size_threshold=parameters.get('png_size_threshold', 0.5), verbose=parameters.get('verbose', False) ) # Track filename changes for reference updates if actual_output_filename != image_file: filename_changes[image_file] = actual_output_filename return filename_changes def _search_reference(filename, contents, strict=False): """Returns a match object if filename is referenced in contents, and None otherwise. If not strict mode, path prefix and extension are optional. """ if strict: # regex pattern for strict=True for path/to/img.ext: # \{[\s%]*path/to/img\.ext[\s%]*\} filename_regex = filename.replace('.', r'\.') else: filename_path = pathlib.Path(filename) # make extension optional root, extension = filename_path.stem, filename_path.suffix basename_regex = '{}({})?'.format( regex.escape(root), regex.escape(extension) ) # iterate through parent fragments to make path prefix optional path_prefix_regex = '' for fragment in reversed(filename_path.parents): if fragment.name == '.': continue fragment = regex.escape(fragment.name) path_prefix_regex = '({}{}{})?'.format( path_prefix_regex, fragment, os.sep ) # Regex pattern for strict=True for path/to/img.ext: # \{[\s%]*()?()?[\s%]*\} filename_regex = path_prefix_regex + basename_regex # Some files 'path/to/file' are referenced in tex as './path/to/file' thus # adds prefix for relative paths starting with './' or '.\' to regex search. filename_regex = r'(.' + os.sep + r')?' + filename_regex # Pads with braces and optional whitespace/comment characters. patn = r'\{{[\s%]*{}[\s%]*\}}'.format(filename_regex) # Picture references in LaTeX are allowed to be in different cases. return regex.search(patn, contents, regex.IGNORECASE) def _keep_only_referenced(filenames, contents, strict=False): """Returns the filenames referenced from contents. If not strict mode, path prefix and extension are optional. """ return [ fn for fn in filenames if _search_reference(fn, contents, strict) is not None ] def _keep_only_referenced_tex(contents, splits): """Returns the filenames referenced from the tex files themselves. It needs various iterations in case one file is referenced from an unreferenced file. """ old_referenced = set(splits['tex_in_root'] + splits['tex_not_in_root']) while True: referenced = set(splits['tex_in_root']) for fn in old_referenced: for fn2 in old_referenced: if regex.search( r'(' + os.path.splitext(fn)[0] + r'[.}])', '\n'.join(contents[fn2]) ): referenced.add(fn) if referenced == old_referenced: splits['tex_to_copy'] = list(referenced) return old_referenced = referenced.copy() def _add_root_tex_files(splits): # TODO: Check auto-ignore marker in root to detect the main file. Then check # there is only one non-referenced TeX in root. # Forces the TeX in root to be copied, even if they are not referenced. for fn in splits['tex_in_root']: if fn not in splits['tex_to_copy']: splits['tex_to_copy'].append(fn) def _split_all_files(parameters): """Splits the files into types or location to know what to do with them.""" file_splits = { 'all': _list_all_files( parameters['input_folder'], ignore_dirs=['.git' + os.sep] ), 'in_root': [ f for f in os.listdir(parameters['input_folder']) if os.path.isfile(os.path.join(parameters['input_folder'], f)) ], } file_splits['not_in_root'] = [ f for f in file_splits['all'] if f not in file_splits['in_root'] ] file_splits['to_copy_in_root'] = _remove_pattern( file_splits['in_root'], parameters['to_delete'] + parameters['figures_to_copy_if_referenced'], ) file_splits['to_copy_not_in_root'] = _remove_pattern( file_splits['not_in_root'], parameters['to_delete'] + parameters['figures_to_copy_if_referenced'], ) file_splits['figures'] = _keep_pattern( file_splits['all'], parameters['figures_to_copy_if_referenced'] ) file_splits['tex_in_root'] = _keep_pattern( file_splits['to_copy_in_root'], ['.tex$', '.tikz$'] ) file_splits['tex_not_in_root'] = _keep_pattern( file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$'] ) file_splits['non_tex_in_root'] = _remove_pattern( file_splits['to_copy_in_root'], ['.tex$', '.tikz$'] ) file_splits['non_tex_not_in_root'] = _remove_pattern( file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$'] ) if parameters.get('use_external_tikz', None) is not None: file_splits['external_tikz_figures'] = _keep_pattern( file_splits['all'], [parameters['use_external_tikz']] ) else: file_splits['external_tikz_figures'] = [] if parameters.get('svg_inkscape', None) is not None: file_splits['svg_inkscape'] = _keep_pattern( file_splits['all'], [parameters['svg_inkscape']] ) else: file_splits['svg_inkscape'] = [] return file_splits def _create_out_folder(input_folder): """Creates the output folder, erasing it if existed.""" out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv' _create_dir_erase_if_exists(out_folder) return out_folder def run_arxiv_cleaner(parameters): """Core of the code, runs the actual arXiv cleaner.""" files_to_delete = [ r'\.aux$', r'\.sh$', r'\.blg$', r'\.brf$', r'\.log$', r'\.out$', r'\.ps$', r'\.dvi$', r'\.synctex.gz$', '~$', r'\.backup$', r'\.gitignore$', r'\.DS_Store$', r'\.svg$', r'^\.idea', r'\.dpth$', r'\.md5$', r'\.dep$', r'\.auxlock$', r'\.fls$', r'\.fdb_latexmk$', ] if not parameters['keep_bib']: files_to_delete.append(r'\.bib$') parameters.update({ 'to_delete': files_to_delete, 'figures_to_copy_if_referenced': [ r'\.png$', r'\.jpg$', r'\.jpeg$', r'\.pdf$', ], }) logging.info('Collecting file structure.') parameters['output_folder'] = _create_out_folder(parameters['input_folder']) from_zip = parameters['input_folder'].endswith('.zip') tempdir_context = ( tempfile.TemporaryDirectory() if from_zip else contextlib.suppress() ) with tempdir_context as tempdir: if from_zip: logging.info('Unzipping input folder.') shutil.unpack_archive(parameters['input_folder'], tempdir) parameters['input_folder'] = tempdir splits = _split_all_files(parameters) logging.info('Reading all tex files') tex_contents = _read_all_tex_contents( splits['tex_in_root'] + splits['tex_not_in_root'], parameters ) for tex_file in tex_contents: logging.info('Removing comments in file %s.', tex_file) tex_contents[tex_file] = _remove_comments_and_commands_to_delete( tex_contents[tex_file], parameters ) for tex_file in tex_contents: logging.info('Replacing \\includesvg calls in file %s.', tex_file) tex_contents[tex_file] = _replace_includesvg( tex_contents[tex_file], splits['svg_inkscape'] ) for tex_file in tex_contents: logging.info('Replacing Tikz Pictures in file %s.', tex_file) content = _replace_tikzpictures( tex_contents[tex_file], splits['external_tikz_figures'] ) # If file ends with '\n' already, the split in last line would add an extra # '\n', so we remove it. tex_contents[tex_file] = content.split('\n') _keep_only_referenced_tex(tex_contents, splits) _add_root_tex_files(splits) for tex_file in splits['tex_to_copy']: logging.info('Replacing patterns in file %s.', tex_file) content = '\n'.join(tex_contents[tex_file]) content = _find_and_replace_patterns( content, parameters.get('patterns_and_insertions', list()) ) tex_contents[tex_file] = content new_path = os.path.join(parameters['output_folder'], tex_file) logging.info('Writing modified contents to %s.', new_path) _write_file_content( content, new_path, ) full_content = '\n'.join( ''.join(tex_contents[fn]) for fn in splits['tex_to_copy'] ) _copy_only_referenced_non_tex_not_in_root(parameters, full_content, splits) for non_tex_file in splits['non_tex_in_root']: logging.info('Copying non-tex file %s.', non_tex_file) _copy_file(non_tex_file, parameters) filename_changes = _resize_and_copy_figures_if_referenced(parameters, full_content, splits) logging.info('Outputs written to %s', parameters['output_folder']) # Update LaTeX references for changed filenames if tex_contents_dict is provided if tex_contents and filename_changes: for old_filename, new_filename in filename_changes.items(): tex_contents = _update_image_references( tex_contents, old_filename, new_filename, verbose=parameters.get('verbose', False) ) # Re-write modified tex files with new references after resizing and copying figures for tex_file in splits['tex_to_copy']: if tex_file in tex_contents: # Get the updated content if isinstance(tex_contents[tex_file], list): updated_content = ''.join(tex_contents[tex_file]) else: updated_content = tex_contents[tex_file] # Write the updated content back to the output file output_path = os.path.join(parameters['output_folder'], tex_file) logging.info('Re-writing modified tex file with updated references: %s', output_path) _write_file_content(updated_content, output_path) if parameters.get('verbose', False): print(f"Re-wrote {tex_file} with updated image references") if parameters.get('verbose', False): print(f"Updated {len(filename_changes)} image references and re-wrote tex files") def strip_whitespace(text): """Strips all whitespace characters. https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string """ pattern = regex.compile(r'\s+') text = regex.sub(pattern, '', text) return text def merge_args_into_config(args, config_params): final_args = copy.deepcopy(config_params) config_keys = config_params.keys() for key, value in args.items(): if key in config_keys: if any([isinstance(value, t) for t in [str, bool, float, int]]): # Overwrites config value with args value. final_args[key] = value elif isinstance(value, list): # Appends args values to config values. final_args[key] = value + config_params[key] elif isinstance(value, dict): # Updates config params with args params. final_args[key].update(**value) else: final_args[key] = value return final_args def _find_and_replace_patterns(content, patterns_and_insertions): r"""content: str patterns_and_insertions: List[Dict] Example for patterns_and_insertions: [ { "pattern" : r"(?:\\figcompfigures{\s*)(?P.*?)\s*}\s*{\s*(?P.*?)\s*}\s*{\s*(?P.*?)\s*}", "insertion" : r"\parbox[c]{{{second}\linewidth}}{{\includegraphics[width={third}\linewidth]{{figures/{first}}}}}}", "description": "Replace figcompfigures" }, ] """ for pattern_and_insertion in patterns_and_insertions: pattern = pattern_and_insertion['pattern'] insertion = pattern_and_insertion['insertion'] description = pattern_and_insertion['description'] logging.info('Processing pattern: %s.', description) p = regex.compile(pattern) m = p.search(content) while m is not None: local_insertion = insertion.format(**m.groupdict()) if pattern_and_insertion.get('strip_whitespace', True): local_insertion = strip_whitespace(local_insertion) logging.info(f'Found {content[m.start():m.end()]:<70}') logging.info(f'Replacing with {local_insertion:<30}') content = content[: m.start()] + local_insertion + content[m.end() :] m = p.search(content) logging.info('Finished pattern: %s.', description) return content ================================================ FILE: arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py ================================================ # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from os import path import shutil import unittest from absl.testing import parameterized from arxiv_latex_cleaner import arxiv_latex_cleaner from PIL import Image def make_args( input_folder='foo/bar', resize_images=False, im_size=500, compress_pdf=False, pdf_im_resolution=500, images_allowlist=None, commands_to_delete=None, use_external_tikz='foo/bar/tikz', ): if images_allowlist is None: images_allowlist = {} if commands_to_delete is None: commands_to_delete = [] args = { 'input_folder': input_folder, 'resize_images': resize_images, 'im_size': im_size, 'compress_pdf': compress_pdf, 'pdf_im_resolution': pdf_im_resolution, 'images_allowlist': images_allowlist, 'commands_to_delete': commands_to_delete, 'use_external_tikz': use_external_tikz, } return args def make_contents(): return ( r'& \figcompfigures{' '\n\timage1.jpg' '\n}{' '\n\t' r'\ww' '\n}{' '\n\t1.0' '\n\t}' '\n& ' r'\figcompfigures{image2.jpg}{\ww}{1.0}' ) def make_patterns(): pattern = r'(?:\\figcompfigures{\s*)(?P.*?)\s*}\s*{\s*(?P.*?)\s*}\s*{\s*(?P.*?)\s*}' insertion = r"""\parbox[c]{{ {second}\linewidth }}{{ \includegraphics[ width={third}\linewidth ]{{ figures/{first} }} }} """ description = 'Replace figcompfigures' output = { 'pattern': pattern, 'insertion': insertion, 'description': description, } return [output] def make_search_reference_tests(): return ( { 'testcase_name': 'prefix1', 'filenames': ['include_image_yes.png', 'include_image.png'], 'contents': '\\include{include_image_yes.png}', 'strict': False, 'true_outputs': ['include_image_yes.png'], }, { 'testcase_name': 'prefix2', 'filenames': ['include_image_yes.png', 'include_image.png'], 'contents': '\\include{include_image.png}', 'strict': False, 'true_outputs': ['include_image.png'], }, { 'testcase_name': 'nested_more_specific', 'filenames': [ 'images/im_included.png', 'images/include/images/im_included.png', ], 'contents': '\\include{images/include/images/im_included.png}', 'strict': False, 'true_outputs': ['images/include/images/im_included.png'], }, { 'testcase_name': 'nested_less_specific', 'filenames': [ 'images/im_included.png', 'images/include/images/im_included.png', ], 'contents': '\\include{images/im_included.png}', 'strict': False, 'true_outputs': [ 'images/im_included.png', 'images/include/images/im_included.png', ], }, { 'testcase_name': 'nested_substring', 'filenames': ['images/im_included.png', 'im_included.png'], 'contents': '\\include{images/im_included.png}', 'strict': False, 'true_outputs': ['images/im_included.png'], }, { 'testcase_name': 'nested_diffpath', 'filenames': ['images/im_included.png', 'figures/im_included.png'], 'contents': '\\include{images/im_included.png}', 'strict': False, 'true_outputs': ['images/im_included.png'], }, { 'testcase_name': 'diffext', 'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'], 'contents': '\\include{tables/demo.tex}', 'strict': False, 'true_outputs': ['tables/demo.tex'], }, { 'testcase_name': 'diffext2', 'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'], 'contents': '\\include{tables/demo}', 'strict': False, 'true_outputs': ['tables/demo.tex', 'tables/demo.tikz'], }, { 'testcase_name': 'strict_prefix1', 'filenames': ['demo_yes.tex', 'demo.tex'], 'contents': '\\include{demo_yes.tex}', 'strict': True, 'true_outputs': ['demo_yes.tex'], }, { 'testcase_name': 'strict_prefix2', 'filenames': ['demo_yes.tex', 'demo.tex'], 'contents': '\\include{demo.tex}', 'strict': True, 'true_outputs': ['demo.tex'], }, { 'testcase_name': 'strict_nested_more_specific', 'filenames': [ 'tables/table_included.csv', 'tables/include/tables/table_included.csv', ], 'contents': '\\include{tables/include/tables/table_included.csv}', 'strict': True, 'true_outputs': ['tables/include/tables/table_included.csv'], }, { 'testcase_name': 'strict_nested_less_specific', 'filenames': [ 'tables/table_included.csv', 'tables/include/tables/table_included.csv', ], 'contents': '\\include{tables/table_included.csv}', 'strict': True, 'true_outputs': ['tables/table_included.csv'], }, { 'testcase_name': 'strict_nested_substring1', 'filenames': ['tables/table_included.csv', 'table_included.csv'], 'contents': '\\include{tables/table_included.csv}', 'strict': True, 'true_outputs': ['tables/table_included.csv'], }, { 'testcase_name': 'strict_nested_substring2', 'filenames': ['tables/table_included.csv', 'table_included.csv'], 'contents': '\\include{table_included.csv}', 'strict': True, 'true_outputs': ['table_included.csv'], }, { 'testcase_name': 'strict_nested_diffpath', 'filenames': ['tables/table_included.csv', 'data/table_included.csv'], 'contents': '\\include{tables/table_included.csv}', 'strict': True, 'true_outputs': ['tables/table_included.csv'], }, { 'testcase_name': 'strict_diffext', 'filenames': ['tables/demo.csv', 'tables/demo.txt', 'demo.csv'], 'contents': '\\include{tables/demo.csv}', 'strict': True, 'true_outputs': ['tables/demo.csv'], }, { 'testcase_name': 'path_starting_with_dot', 'filenames': [ './images/im_included.png', './figures/im_included.png', ], 'contents': '\\include{./images/im_included.png}', 'strict': False, 'true_outputs': ['./images/im_included.png'], }, ) class UnitTests(parameterized.TestCase): @parameterized.named_parameters( { 'testcase_name': 'empty config', 'args': make_args(), 'config_params': {}, 'final_args': make_args(), }, { 'testcase_name': 'empty args', 'args': {}, 'config_params': make_args(), 'final_args': make_args(), }, { 'testcase_name': 'args and config provided', 'args': make_args( images_allowlist={'path1/': 1000}, commands_to_delete=[r'\todo1'] ), 'config_params': make_args( 'foo_/bar_', True, 1000, True, 1000, images_allowlist={'path2/': 1000}, commands_to_delete=[r'\todo2'], use_external_tikz='foo_/bar_/tikz_', ), 'final_args': make_args( images_allowlist={'path1/': 1000, 'path2/': 1000}, commands_to_delete=[r'\todo1', r'\todo2'], ), }, ) def test_merge_args_into_config(self, args, config_params, final_args): self.assertEqual( arxiv_latex_cleaner.merge_args_into_config(args, config_params), final_args, ) @parameterized.named_parameters( { 'testcase_name': 'no_comment', 'line_in': 'Foo\n', 'true_output': 'Foo\n', }, { 'testcase_name': 'auto_ignore', 'line_in': '%auto-ignore\n', 'true_output': '%auto-ignore\n', }, { 'testcase_name': 'auto_ignore_middle', 'line_in': 'Foo % auto-ignore Comment\n', 'true_output': 'Foo % auto-ignore\n', }, { 'testcase_name': 'auto_ignore_text_with_comment', 'line_in': 'Foo auto-ignore % Comment\n', 'true_output': 'Foo auto-ignore %\n', }, { 'testcase_name': 'percent', 'line_in': r'100\% accurate\n', 'true_output': r'100\% accurate\n', }, { 'testcase_name': 'comment', 'line_in': ' % Comment\n', 'true_output': '', }, { 'testcase_name': 'comment_inline', 'line_in': 'Foo %Comment\n', 'true_output': 'Foo %\n', }, { 'testcase_name': 'url_with_percent', 'line_in': '\\url{https://www.example.com/hello%20world}\n', 'true_output': '\\url{https://www.example.com/hello%20world}\n', }, { 'testcase_name': 'comment_with_url', 'line_in': 'Foo %\\url{https://www.example.com/hello%20world}\n', 'true_output': 'Foo %\n', }, ) def test_remove_comments_inline(self, line_in, true_output): self.assertEqual( arxiv_latex_cleaner._remove_comments_inline(line_in), true_output ) @parameterized.named_parameters( { 'testcase_name': 'no_command', 'text_in': 'Foo\nFoo2\n', 'keep_text': False, 'true_output': 'Foo\nFoo2\n', }, { 'testcase_name': 'command_not_removed', 'text_in': '\\textit{Foo\nFoo2}\n', 'keep_text': False, 'true_output': '\\textit{Foo\nFoo2}\n', }, { 'testcase_name': 'command_no_end_line_removed', 'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}', 'keep_text': False, 'true_output': 'AD\nE\n\\end{document}', }, { 'testcase_name': 'command_with_end_line_removed', 'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}', 'keep_text': False, 'true_output': 'A\n%\nD\n\\end{document}', }, { 'testcase_name': 'command_with_optional_arguments_start', 'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}', 'keep_text': False, 'true_output': 'A\n%\nE\n\\end{document}', }, { 'testcase_name': 'command_with_optional_arguments_end', 'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}', 'keep_text': False, 'true_output': 'A\n%\nE\n\\end{document}', }, { 'testcase_name': 'no_command_keep_text', 'text_in': 'Foo\nFoo2\n', 'keep_text': True, 'true_output': 'Foo\nFoo2\n', }, { 'testcase_name': 'command_not_removed_keep_text', 'text_in': '\\textit{Foo\nFoo2}\n', 'keep_text': True, 'true_output': '\\textit{Foo\nFoo2}\n', }, { 'testcase_name': 'command_no_end_line_removed_keep_text', 'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}', 'keep_text': True, 'true_output': 'AB\nCD\nE\n\\end{document}', }, { 'testcase_name': 'command_with_end_line_removed_keep_text', 'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}', 'keep_text': True, 'true_output': 'A\nB\nC\nD\n\\end{document}', }, { 'testcase_name': 'nested_command_keep_text', 'text_in': 'A\n\\todo{B\n\\todo{C}}\nD\n\\end{document}', 'keep_text': True, 'true_output': 'A\nB\nC\nD\n\\end{document}', }, { 'testcase_name': 'command_with_optional_arguments_start_keep_text', 'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}', 'keep_text': True, 'true_output': 'A\nC\nD\nE\n\\end{document}', }, { 'testcase_name': 'command_with_optional_arguments_end_keep_text', 'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}', 'keep_text': True, 'true_output': 'A\nB\nC\nE\n\\end{document}', }, { 'testcase_name': 'deeply_nested_command_keep_text', 'text_in': 'A\n\\todo{B\n\\emph{C\\footnote{\\textbf{D}}}}\nE\n\\end{document}', 'keep_text': True, 'true_output': ( 'A\nB\n\\emph{C\\footnote{\\textbf{D}}}\nE\n\\end{document}' ), }, ) def test_remove_command(self, text_in, keep_text, true_output): self.assertEqual( arxiv_latex_cleaner._remove_command(text_in, 'todo', keep_text), true_output, ) @parameterized.named_parameters( { 'testcase_name': 'no_environment', 'text_in': 'Foo\n', 'true_output': 'Foo\n', }, { 'testcase_name': 'environment_not_removed', 'text_in': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo', 'true_output': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo', }, { 'testcase_name': 'environment_removed', 'text_in': 'Foo\\begin{comment}\n3x+2\n\\end{comment}\nFoo', 'true_output': 'Foo\nFoo', }, ) def test_remove_environment(self, text_in, true_output): self.assertEqual( arxiv_latex_cleaner._remove_environment(text_in, 'comment'), true_output ) @parameterized.named_parameters( { 'testcase_name': 'no_iffalse', 'text_in': 'Foo\n', 'true_output': 'Foo\n', }, { 'testcase_name': 'if_not_removed', 'text_in': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n', 'true_output': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n', }, { 'testcase_name': 'if_removed_with_nested_ifvar', 'text_in': '\\ifvar\n\\iffalse\n\\ifvar\nFoo\n\\fi\n\\fi\n\\fi\n', 'true_output': '\\ifvar\n\\fi\n', }, { 'testcase_name': 'if_removed_with_nested_iffalse', 'text_in': '\\ifvar\n\\iffalse\n\\iffalse\nFoo\n\\fi\n\\fi\n\\fi\n', 'true_output': '\\ifvar\n\\fi\n', }, { 'testcase_name': 'if_removed_eof', 'text_in': '\\iffalse\nFoo\n\\fi', 'true_output': '', }, { 'testcase_name': 'if_removed_space', 'text_in': '\\iffalse\nFoo\n\\fi ', 'true_output': '', }, { 'testcase_name': 'if_removed_backslash', 'text_in': '\\iffalse\nFoo\n\\fi\\end{document}', 'true_output': '\\end{document}', }, { 'testcase_name': 'commands_not_removed', 'text_in': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}', 'true_output': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}', }, { 'testcase_name': 'iffalse_else_sustained', 'text_in': '\\iffalse not there\\else here\\fi', 'true_output': 'here', }, { 'testcase_name': 'iftrue_else_removed', 'text_in': '\\iftrue expected\\else not expected\\fi', 'true_output': 'expected', }, { 'testcase_name': 'if0_removed', 'text_in': '\\if0 to be removed\\fi', 'true_output': '', }, { 'testcase_name': 'if1 works', 'text_in': '\\if 1 expected\\fi', 'true_output': 'expected', }, { 'testcase_name': 'new_if_ignored', 'text_in': '\\newif \\ifvar \\ifvar\\iffalse test\\fi\\fi', 'true_output': '\\newif \\ifvar \\ifvar\\fi', }, { 'testcase_name': 'known exceptions (iff) ignored in \\iffalse', 'text_in': '\\iffalse \\iff\\fi', 'true_output': '', }, { 'testcase_name': 'known exceptions (iff) ignored in \\iftrue', 'text_in': '\\iftrue\\iff\\else\\fi', 'true_output': '\\iff', }, ) def test_simplify_conditional_blocks(self, text_in, true_output): self.assertEqual( arxiv_latex_cleaner._simplify_conditional_blocks(text_in), true_output ) @parameterized.named_parameters( { 'testcase_name': 'all_pass', 'inputs': ['abc', 'bca'], 'patterns': ['a'], 'true_outputs': ['abc', 'bca'], }, { 'testcase_name': 'not_all_pass', 'inputs': ['abc', 'bca'], 'patterns': ['a$'], 'true_outputs': ['bca'], }, ) def test_keep_pattern(self, inputs, patterns, true_outputs): self.assertEqual( list(arxiv_latex_cleaner._keep_pattern(inputs, patterns)), true_outputs ) @parameterized.named_parameters( { 'testcase_name': 'all_pass', 'inputs': ['abc', 'bca'], 'patterns': ['a'], 'true_outputs': [], }, { 'testcase_name': 'not_all_pass', 'inputs': ['abc', 'bca'], 'patterns': ['a$'], 'true_outputs': ['abc'], }, ) def test_remove_pattern(self, inputs, patterns, true_outputs): self.assertEqual( list(arxiv_latex_cleaner._remove_pattern(inputs, patterns)), true_outputs, ) @parameterized.named_parameters( { 'testcase_name': 'replace_contents', 'content': make_contents(), 'patterns_and_insertions': make_patterns(), 'true_outputs': ( r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image1.jpg}}' '\n' r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image2.jpg}}' ), }, ) def test_find_and_replace_patterns( self, content, patterns_and_insertions, true_outputs ): output = arxiv_latex_cleaner._find_and_replace_patterns( content, patterns_and_insertions ) output = arxiv_latex_cleaner.strip_whitespace(output) true_outputs = arxiv_latex_cleaner.strip_whitespace(true_outputs) self.assertEqual(output, true_outputs) @parameterized.named_parameters( { 'testcase_name': 'no_tikz', 'text_in': 'Foo\n', 'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'], 'true_output': 'Foo\n', }, { 'testcase_name': 'tikz_no_match', 'text_in': ( 'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node' ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo' ), 'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'], 'true_output': ( 'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node' ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo' ), }, { 'testcase_name': 'tikz_match', 'text_in': ( 'Foo\\tikzsetnextfilename{test2}\n\\begin{tikzpicture}\n\\node' ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo' ), 'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'], 'true_output': 'Foo\\includegraphics{ext_tikz/test2.pdf}\nFoo', }, ) def test_replace_tikzpictures(self, text_in, figures_in, true_output): self.assertEqual( arxiv_latex_cleaner._replace_tikzpictures(text_in, figures_in), true_output, ) @parameterized.named_parameters( { 'testcase_name': 'no_includesvg', 'text_in': 'Foo\n', 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2-tex.pdf_tex', ], 'true_output': 'Foo\n', }, { 'testcase_name': 'includesvg_no_match', 'text_in': 'Foo\\includesvg{test_no_match}\nFoo', 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2-tex.pdf_tex', ], 'true_output': 'Foo\\includesvg{test_no_match}\nFoo', }, { 'testcase_name': 'includesvg_match', 'text_in': 'Foo\\includesvg{test2}\nFoo', 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2-tex.pdf_tex', ], 'true_output': 'Foo\\includeinkscape{ext_svg/test2-tex.pdf_tex}\nFoo', }, { 'testcase_name': 'includesvg_match_with_options', 'text_in': 'Foo\\includesvg[width=\\linewidth,scale=0.40]{figs/persdf/test2}\nFoo', 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2-tex.pdf_tex', ], 'true_output': 'Foo\\includeinkscape[width=\\linewidth,scale=0.40]{ext_svg/test2-tex.pdf_tex}\nFoo', }, { 'testcase_name': 'includesvg_match_with_options_with_suffix', 'text_in': 'Foo\\includesvg[width=\\linewidth]{figs/test2.svg}\nFoo', 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2_svg-tex.pdf_tex', ], 'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2_svg-tex.pdf_tex}\nFoo', }, { 'testcase_name': 'includesvg_match_with_options_with_dot_with_suffix', 'text_in': ( 'Foo\\includesvg[width=\\linewidth]{figs/test2-0.9.svg}\nFoo' ), 'figures_in': [ 'ext_svg/test1-tex.pdf_tex', 'ext_svg/test2-0.9_svg-tex.pdf_tex', ], 'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2-0.9_svg-tex.pdf_tex}\nFoo', }, ) def test_replace_includesvg(self, text_in, figures_in, true_output): self.assertEqual( arxiv_latex_cleaner._replace_includesvg(text_in, figures_in), true_output, ) @parameterized.named_parameters(*make_search_reference_tests()) def test_search_reference_weak( self, filenames, contents, strict, true_outputs ): cleaner_outputs = [] for filename in filenames: reference = arxiv_latex_cleaner._search_reference( filename, contents, strict ) if reference is not None: cleaner_outputs.append(filename) # weak check (passes as long as cleaner includes a superset of the true_output) for true_output in true_outputs: self.assertIn(true_output, cleaner_outputs) @parameterized.named_parameters(*make_search_reference_tests()) def test_search_reference_strong( self, filenames, contents, strict, true_outputs ): cleaner_outputs = [] for filename in filenames: reference = arxiv_latex_cleaner._search_reference( filename, contents, strict ) if reference is not None: cleaner_outputs.append(filename) # strong check (set of files must match exactly) weak_check_result = set(true_outputs).issubset(cleaner_outputs) if weak_check_result: msg = 'not fatal, cleaner included more files than necessary' else: msg = 'fatal, see test_search_reference_weak' self.assertEqual(cleaner_outputs, true_outputs, msg) @parameterized.named_parameters( { 'testcase_name': 'three_parent', 'filename': 'long/path/to/img.ext', 'content_strs': [ # match '{img.ext}', '{to/img.ext}', '{path/to/img.ext}', '{long/path/to/img.ext}', '{%\nimg.ext }', '{to/img.ext % \n}', '{ \npath/to/img.ext\n}', '{ \n \nlong/path/to/img.ext\n}', '{img}', '{to/img}', '{path/to/img}', '{long/path/to/img}', # dont match '{from/img.ext}', '{from/img}', '{imgoext}', '{from/imgo}', '{ \n long/\npath/to/img.ext\n}', '{path/img.ext}', '{long/img.ext}', '{long/path/img.ext}', '{long/to/img.ext}', '{path/img}', '{long/img}', '{long/path/img}', '{long/to/img}', ], 'strict': False, 'true_outputs': [True] * 12 + [False] * 13, }, { 'testcase_name': 'two_parent', 'filename': 'path/to/img.ext', 'content_strs': [ # match '{img.ext}', '{to/img.ext}', '{path/to/img.ext}', '{%\nimg.ext }', '{to/img.ext % \n}', '{ \npath/to/img.ext\n}', '{img}', '{to/img}', '{path/to/img}', # dont match '{long/path/to/img.ext}', '{ \n \nlong/path/to/img.ext\n}', '{long/path/to/img}', '{from/img.ext}', '{from/img}', '{imgoext}', '{from/imgo}', '{ \n long/\npath/to/img.ext\n}', '{path/img.ext}', '{long/img.ext}', '{long/path/img.ext}', '{long/to/img.ext}', '{path/img}', '{long/img}', '{long/path/img}', '{long/to/img}', ], 'strict': False, 'true_outputs': [True] * 9 + [False] * 16, }, { 'testcase_name': 'one_parent', 'filename': 'to/img.ext', 'content_strs': [ # match '{img.ext}', '{to/img.ext}', '{%\nimg.ext }', '{to/img.ext % \n}', '{img}', '{to/img}', # dont match '{long/path/to/img}', '{path/to/img}', '{ \n \nlong/path/to/img.ext\n}', '{ \npath/to/img.ext\n}', '{long/path/to/img.ext}', '{path/to/img.ext}', '{from/img.ext}', '{from/img}', '{imgoext}', '{from/imgo}', '{ \n long/\npath/to/img.ext\n}', '{path/img.ext}', '{long/img.ext}', '{long/path/img.ext}', '{long/to/img.ext}', '{path/img}', '{long/img}', '{long/path/img}', '{long/to/img}', ], 'strict': False, 'true_outputs': [True] * 6 + [False] * 19, }, { 'testcase_name': 'two_parent_strict', 'filename': 'path/to/img.ext', 'content_strs': [ # match '{path/to/img.ext}', '{ \npath/to/img.ext\n}', # dont match '{img.ext}', '{to/img.ext}', '{%\nimg.ext }', '{to/img.ext % \n}', '{img}', '{to/img}', '{path/to/img}', '{long/path/to/img.ext}', '{ \n \nlong/path/to/img.ext\n}', '{long/path/to/img}', '{from/img.ext}', '{from/img}', '{imgoext}', '{from/imgo}', '{ \n long/\npath/to/img.ext\n}', '{path/img.ext}', '{long/img.ext}', '{long/path/img.ext}', '{long/to/img.ext}', '{path/img}', '{long/img}', '{long/path/img}', '{long/to/img}', ], 'strict': True, 'true_outputs': [True] * 2 + [False] * 23, }, ) def test_search_reference_filewise( self, filename, content_strs, strict, true_outputs ): if len(content_strs) != len(true_outputs): raise ValueError( "number of true_outputs doesn't match number of content strs" ) for content, true_output in zip(content_strs, true_outputs): reference = arxiv_latex_cleaner._search_reference( filename, content, strict ) matched = reference is not None msg_not = ' ' if true_output else ' not ' msg_fmt = 'file {} should' + msg_not + 'have matched latex reference {}' msg = msg_fmt.format(filename, content) self.assertEqual(matched, true_output, msg) class IntegrationTests(parameterized.TestCase): def setUp(self): super(IntegrationTests, self).setUp() self.out_path = 'test_data/tex_arXiv' def _compare_files(self, filename, filename_true): if path.splitext(filename)[1].lower() in ['.jpg', '.jpeg', '.png']: with Image.open(filename) as im, Image.open(filename_true) as im_true: # We check only the sizes of the images, checking pixels would be too # complicated in case the resize implementations change. self.assertEqual( im.size, im_true.size, 'Images {:s} was not resized properly.'.format(filename), ) else: # Checks if text files are equal without taking in account end of line # characters. with open(filename, 'rb') as f: processed_content = f.read().splitlines() with open(filename_true, 'rb') as f: groundtruth_content = f.read().splitlines() self.assertEqual( processed_content, groundtruth_content, '{:s} and {:s} are not equal.'.format(filename, filename_true), ) @parameterized.named_parameters( {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'}, {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'}, ) def test_complete(self, input_dir): out_path_true = 'test_data/tex_arXiv_true' # Make sure the folder does not exist, since we erase it in the test. if path.isdir(self.out_path): raise RuntimeError( 'The folder {:s} should not exist.'.format(self.out_path) ) arxiv_latex_cleaner.run_arxiv_cleaner({ 'input_folder': input_dir, 'images_allowlist': { 'images/im2_included.jpg': 200, 'images/im3_included.png': 400, }, 'resize_images': True, 'im_size': 100, 'compress_pdf': False, 'pdf_im_resolution': 500, 'commands_to_delete': ['mytodo'], 'commands_only_to_delete': ['red'], 'if_exceptions': ['iffalt'], 'environments_to_delete': ['mynote'], 'use_external_tikz': 'ext_tikz', 'keep_bib': False, }) # Checks the set of files is the same as in the true folder. out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path)) out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true)) self.assertSetEqual(out_files, out_files_true) # Compares the contents of each file against the true value. for f1 in out_files: self._compare_files( path.join(self.out_path, f1), path.join(out_path_true, f1) ) @parameterized.named_parameters( {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'}, {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'}, ) def test_png2jpg(self, input_dir): out_path_true = 'test_data/tex_arXiv_png2jpg_true' # Make sure the folder does not exist, since we erase it in the test. if path.isdir(self.out_path): raise RuntimeError( 'The folder {:s} should not exist.'.format(self.out_path) ) arxiv_latex_cleaner.run_arxiv_cleaner({ 'input_folder': input_dir, 'images_allowlist': { # 'images/im2_included.jpg': 200, # 'images/im3_included.png': 400, }, 'resize_images': False, 'im_size': 100, 'compress_pdf': False, 'pdf_im_resolution': 500, 'commands_to_delete': ['mytodo'], 'commands_only_to_delete': ['red'], 'if_exceptions': ['iffalt'], 'environments_to_delete': ['mynote'], 'use_external_tikz': 'ext_tikz', 'keep_bib': False, 'convert_png_to_jpg': True, 'png_quality': 50, 'png_size_threshold': 0.5, }) # Checks the set of files is the same as in the true folder. out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path)) out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true)) self.assertSetEqual(out_files, out_files_true) # Compares the contents of each file against the true value. for f1 in out_files: if path.splitext(path.join(self.out_path, f1))[1].lower() in ['.jpg', '.jpeg', '.png']: # check if all png files have been renamed to jpg self.assertTrue(path.splitext(f1)[1].lower() != '.png', f'{f1} is not renamed to jpg') else: self._compare_files( path.join(self.out_path, f1), path.join(out_path_true, f1) ) def tearDown(self): shutil.rmtree(self.out_path) super(IntegrationTests, self).tearDown() if __name__ == '__main__': unittest.main() ================================================ FILE: cleaner_config.yaml ================================================ patterns_and_insertions: [ # Use single ticks for regex patterns # http://blogs.perl.org/users/tinita/2018/03/strings-in-yaml---to-quote-or-not-to-quote.html # You need to escape \ with \\ in the pattern, for instance for \\todo # Use Python named groups https://docs.python.org/3/library/re.html#regular-expression-examples # Escape {{ and }} in the insertion expression # # Optional: # Set strip_whitespace to n to disable white space stripping while replacing the pattern. (Default: y) { "pattern" : '(?:\\figcomp{\s*)(?P.*?)\s*}\s*{\s*(?P.*?)\s*}\s*{\s*(?P.*?)\s*}', "insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}', "description" : "Replace figcomp", # "strip_whitespace": n }, ] verbose: False commands_to_delete: [ '\\todo', ] ================================================ FILE: requirements.txt ================================================ absl_py>=0.12 pillow pyyaml regex ================================================ FILE: setup.py ================================================ #! /usr/bin/env python # # coding=utf-8 # Copyright 2018 The Google Research Authors. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. from setuptools import setup from setuptools import find_packages from arxiv_latex_cleaner._version import __version__ with open("README.md", "r") as fh: long_description = fh.read() install_requires = [] with open("requirements.txt") as f: for l in f.readlines(): l_c = l.strip() if l_c and not l_c.startswith('#'): install_requires.append(l_c) setup( name="arxiv_latex_cleaner", version=__version__, packages=find_packages(exclude=["*.tests"]), python_requires='>=3', url="https://github.com/google-research/arxiv-latex-cleaner", license="Apache License, Version 2.0", author="Google Research Authors", author_email="jponttuset@gmail.com", description="Cleans the LaTeX code of your paper to submit to arXiv.", long_description=long_description, long_description_content_type="text/markdown", entry_points={ "console_scripts": ["arxiv_latex_cleaner=arxiv_latex_cleaner.__main__:__main__"] }, install_requires=install_requires, classifiers=[ "License :: OSI Approved :: Apache Software License", "Intended Audience :: Science/Research", ], ) ================================================ FILE: test_data/tex/figures/data_included.txt ================================================ ================================================ FILE: test_data/tex/figures/data_not_included.txt ================================================ ================================================ FILE: test_data/tex/figures/figure_included.tex ================================================ \includegraphics{images/im2_included.jpg} \addplot{figures/data_included.txt} ================================================ FILE: test_data/tex/figures/figure_included.tikz ================================================ \tikzsetnextfilename{test2} \begin{tikzpicture} \node {root} child {node {left}} child {node {right} child {node {child}} child {node {child}} }; \end{tikzpicture} ================================================ FILE: test_data/tex/figures/figure_not_included.tex ================================================ \addplot{figures/data_not_included.txt} \input{figures/figure_not_included_2.tex} ================================================ FILE: test_data/tex/figures/figure_not_included_2.tex ================================================ ================================================ FILE: test_data/tex/main.aux ================================================ ================================================ FILE: test_data/tex/main.bbl ================================================ BBL content, should be intact. ================================================ FILE: test_data/tex/main.bib ================================================ ================================================ FILE: test_data/tex/main.tex ================================================ \begin{document} Text % Whole line comment Text% Inline comment \begin{comment} This is an environment comment. \end{comment} This is a percent \%. % Whole line comment without newline \includegraphics{images/im1_included.png} %\includegraphics{images/im_not_included} \includegraphics{images/im3_included.png} \includegraphics{% images/im4_included.png% } \includegraphics[width=.5\linewidth]{% images/im5_included.jpg} %\includegraphics{% % images/im4_not_included.png % } %\includegraphics[width=.5\linewidth]{% % images/im5_not_included.jpg} % test whatever the path satrting with dot works when include graphics \includegraphics{./images/im3_included.png} This line should\mytodo{Do this later} not be separated \mytodo{This is a todo command with a nested \textit{command}. Please remember that up to \texttt{2 levels} of \textit{nesting} are supported.} from this one. \begin{mynote} This is a custom environment that could be excluded. \end{mynote} \newif\ifvar \newif \ifvarII \ifvarII asdf \fi \ifvar \if false \if false \if 0 \iffalse \ifvar Text \fi \fi \fi \fi \fi \fi \iffalse I shall be gone (iffalse block)!\else Expect me (else block of iffalse)!\fi \iftrue Expect me (iftrue block)!\else I shall be gone (else block of iftrue)!\fi \iffalse \iffalt \fi \newcommand{\red}[1]{{\color{red} #1}} hello test \red{hello test \red{hello}} test % content after this line should not be cleaned if \end{document} is in a comment \input{figures/figure_included.tex} % \input{figures/figure_not_included.tex} % Test for tikzpicture feature % should be replaced \tikzsetnextfilename{test1} \begin{tikzpicture} \node (test) at (0,0) {Test1}; \end{tikzpicture} % should be replaced in included file \input{figures/figure_included.tikz} % should not be be replaced - no preceding tikzsetnextfilename command \begin{tikzpicture} \node (test) at (0,0) {Test3}; \end{tikzpicture} \tikzsetnextfilename{test_no_match} \begin{tikzpicture} \node (test) at (0,0) {Test4}; \end{tikzpicture} \end{document} This should be ignored. ================================================ FILE: test_data/tex/not_included/figures/data_included.txt ================================================ ================================================ FILE: test_data/tex_arXiv_png2jpg_true/figures/data_included.txt ================================================ ================================================ FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tex ================================================ \includegraphics{images/im2_included.jpg} \addplot{figures/data_included.txt} ================================================ FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tikz ================================================ \includegraphics{ext_tikz/test2.pdf} ================================================ FILE: test_data/tex_arXiv_png2jpg_true/main.bbl ================================================ BBL content, should be intact. ================================================ FILE: test_data/tex_arXiv_png2jpg_true/main.tex ================================================ \begin{document} Text Text% This is a percent \%. \includegraphics{images/im1_included.jpg} \includegraphics{images/im3_included.jpg} \includegraphics{% images/im4_included.jpg% } \includegraphics[width=.5\linewidth]{% images/im5_included.jpg} \includegraphics{./images/im3_included.jpg} This line should not be separated % from this one. \newif\ifvar \newif \ifvarII \ifvarII asdf \fi \ifvar \fi Expect me (else block of iffalse)! Expect me (iftrue block)! \newcommand{\red}[1]{{\color{red} #1}} hello test hello test hello test \input{figures/figure_included.tex} \includegraphics{ext_tikz/test1.pdf} \input{figures/figure_included.tikz} \begin{tikzpicture} \node (test) at (0,0) {Test3}; \end{tikzpicture} \tikzsetnextfilename{test_no_match} \begin{tikzpicture} \node (test) at (0,0) {Test4}; \end{tikzpicture} \end{document} ================================================ FILE: test_data/tex_arXiv_true/figures/data_included.txt ================================================ ================================================ FILE: test_data/tex_arXiv_true/figures/figure_included.tex ================================================ \includegraphics{images/im2_included.jpg} \addplot{figures/data_included.txt} ================================================ FILE: test_data/tex_arXiv_true/figures/figure_included.tikz ================================================ \includegraphics{ext_tikz/test2.pdf} ================================================ FILE: test_data/tex_arXiv_true/main.bbl ================================================ BBL content, should be intact. ================================================ FILE: test_data/tex_arXiv_true/main.tex ================================================ \begin{document} Text Text% This is a percent \%. \includegraphics{images/im1_included.png} \includegraphics{images/im3_included.png} \includegraphics{% images/im4_included.png% } \includegraphics[width=.5\linewidth]{% images/im5_included.jpg} \includegraphics{./images/im3_included.png} This line should not be separated % from this one. \newif\ifvar \newif \ifvarII \ifvarII asdf \fi \ifvar \fi Expect me (else block of iffalse)! Expect me (iftrue block)! \newcommand{\red}[1]{{\color{red} #1}} hello test hello test hello test \input{figures/figure_included.tex} \includegraphics{ext_tikz/test1.pdf} \input{figures/figure_included.tikz} \begin{tikzpicture} \node (test) at (0,0) {Test3}; \end{tikzpicture} \tikzsetnextfilename{test_no_match} \begin{tikzpicture} \node (test) at (0,0) {Test4}; \end{tikzpicture} \end{document}