[
  {
    "path": ".github/workflows/release-workflow.yml",
    "content": "name: Create a GitHub and PyPI release\non:\n  push:\n    tags:\n      - 'v*'\n\njobs:\n  build:\n    name: Create a GitHub Release\n    runs-on: ubuntu-latest\n    permissions:\n      contents: write\n    steps:\n      - name: Checkout code\n        uses: actions/checkout@v2\n      - name: Create Release\n        id: create_release\n        uses: actions/create-release@v1\n        env:\n          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}\n        with:\n          tag_name: ${{ github.ref }}\n          release_name: Release ${{ github.ref }}\n          body: ${{ github.ref }} release of `arxiv_latex_cleaner`.\n          draft: false\n          prerelease: false\n  deploy:\n    name: Create a PyPI Release\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout code\n        uses: actions/checkout@v2\n      - name: Set up Python\n        uses: actions/setup-python@v2\n        with:\n          python-version: '3.x'\n      - name: Install dependencies\n        run: |\n          python -m pip install --upgrade pip\n          pip install setuptools wheel twine\n      - name: Build\n        run: |\n          python setup.py sdist bdist_wheel\n      - name: Publish\n        env:\n          TWINE_USERNAME: '__token__'\n          TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}\n        run: |\n          python -m twine upload dist/*\n"
  },
  {
    "path": ".gitignore",
    "content": "*.pyc\n.idea\narxiv-latex-cleaner.iml\narxiv-latex-cleaner.ipr\narxiv-latex-cleaner.iws\narxiv_latex_cleaner.egg-info/\nbuild/\ndist/\n\n*.DS_Store\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# How to Contribute\n\nWe'd love to accept your patches and contributions to this project. There are\njust a few small guidelines you need to follow.\n\n## Contributor License Agreement\n\nContributions to this project must be accompanied by a Contributor License\nAgreement. You (or your employer) retain the copyright to your contribution;\nthis simply gives us permission to use and redistribute your contributions as\npart of the project. Head over to <https://cla.developers.google.com/> to see\nyour current agreements on file or to sign a new one.\n\nYou generally only need to submit a CLA once, so if you've already submitted one\n(even if it was for a different project), you probably don't need to do it\nagain.\n\n## Code reviews\n\nAll submissions, including submissions by project members, require review. We\nuse GitHub pull requests for this purpose. Consult\n[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more\ninformation on using pull requests.\n\n## Community Guidelines\n\nThis project follows\n[Google's Open Source Community Guidelines](https://opensource.google.com/conduct/).\n"
  },
  {
    "path": "LICENSE",
    "content": "\n                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": "include LICENSE\ninclude README.md\ninclude requirements.txt\n"
  },
  {
    "path": "README.md",
    "content": "# `arxiv_latex_cleaner`\n\nThis tool allows you to easily clean the LaTeX code of your paper to submit to\narXiv. From a folder containing all your code, e.g. `/path/to/latex/`, it\ncreates a new folder `/path/to/latex_arXiv/`, that is ready to ZIP and upload to\narXiv.\n\n## Example call:\n\n```bash\narxiv_latex_cleaner /path/to/latex --resize_images --im_size 500 --images_allowlist='{\"images/im.png\":2000}'\n```\n\nOr simply from a config file\n\n```bash\narxiv_latex_cleaner /path/to/latex --config cleaner_config.yaml\n```\n\n## Installation:\n\n```bash\npip install arxiv-latex-cleaner\n```\n\n| :exclamation: arxiv_latex_cleaner is only compatible with Python >=3.9 :exclamation: |\n| ---------------------------------------------------------------------------------- |\n\nIf using MacOS, you can install using [Homebrew](https://brew.sh/):\n\n```bash\nbrew install arxiv_latex_cleaner\n```\n\nAlternatively, you can download the source code:\n\n```bash\ngit clone https://github.com/google-research/arxiv-latex-cleaner\ncd arxiv-latex-cleaner/\npython -m arxiv_latex_cleaner --help\n```\n\nAnd install as a command-line program directly from the source code:\n\n```bash\npython setup.py install\n```\n\n## Main features:\n\n#### Privacy-oriented\n\n*   Removes all auxiliary files (`.aux`, `.log`, `.out`, etc.).\n*   Removes all comments from your code (yes, those are visible on arXiv and you\n    do not want them to be). These also include `\\begin{comment}\\end{comment}`,\n    `\\iffalse\\fi`, and `\\if0\\fi` environments.\n*   Optionally removes user-defined commands entered with `commands_to_delete`\n    (such as `\\todo{}` that you redefine as the empty string at the end).\n*   Optionally allows you to define custom regex replacement rules through a\n    `cleaner_config.yaml` file.\n\n#### Size-oriented\n\nThere is a 50MB limit on arXiv submissions, so to make it fit:\n\n*   Removes all unused `.tex` files (those that are not in the root and not\n    included in any other `.tex` file).\n*   Removes all unused images that take up space (those that are not actually\n    included in any used `.tex` file).\n*   Optionally resizes all images to `im_size` pixels, to reduce the size of the\n    submission. You can allowlist some images to skip the global size using\n    `images_allowlist`.\n*   Optionally compresses `.pdf` files using ghostscript (Linux and Mac only).\n    You can allowlist some PDFs to skip the global size using\n    `images_allowlist`.\n*   Optionally converts PNG images to JPG format to reduce file size.\n\n#### TikZ picture source code concealment\n\nTo prevent the upload of tikzpicture source code or raw simulation data, this\nfeature:\n\n*   Replaces the tikzpicture environment `\\begin{tikzpicture} ...\n    \\end{tikzpicture}` with the respective\n    `\\includegraphics{EXTERNAL_TIKZ_FOLDER/picture_name.pdf}`.\n*   Requires externally compiled TikZ pictures as `.pdf` files in folder\n    `EXTERNAL_TIKZ_FOLDER`. See section 52 (Externalization Library) in the\n    [PGF/TikZ manual](https://ctan.org/pkg/pgf?lang=en) on TikZ picture\n    externalization.\n*   Only replaces environments with preceding\n    `\\tikzsetnextfilename{picture_name}` command (as in\n    `\\tikzsetnextfilename{picture_name}\\begin{tikzpicture} ...\n    \\end{tikzpicture}`) where the externalized `picture_name.pdf` filename\n    matches `picture_name`.\n\n#### More sophisticated pattern replacement based on regex group captures\n\nSometimes it is useful to work with a set of custom LaTeX commands when writing\na paper. To get rid of them upon arXiv submission, one can simply revert them to\nplain LaTeX with a regular expression insertion.\n\n```yaml\n{\n    \"pattern\" : '(?:\\\\figcomp{\\s*)(?P<first>.*?)\\s*}\\s*{\\s*(?P<second>.*?)\\s*}\\s*{\\s*(?P<third>.*?)\\s*}',\n    \"insertion\" : '\\parbox[c]{{ {second} \\linewidth}} {{ \\includegraphics[width= {third} \\linewidth]{{figures/{first} }} }}',\n    \"description\" : \"Replace figcomp\"\n}\n```\n\nThe pattern above will find all `\\figcomp{path}{w1}{w2}` commands and replace\nthem with\n`\\parbox[c]{w1\\linewidth}{\\includegraphics[width=w2\\linewidth]{figures/path}}`.\nNote that the insertion template is filled with the\n[named groups captures](https://docs.python.org/3/library/re.html#regular-expression-examples)\nfrom the pattern. Note that the replacement is processed **before** all\n`\\includegraphics` commands are processed and corresponding file paths are\ncopied, making sure all figure files are copied to the cleaned version. See also\n[cleaner_config.yaml](cleaner_config.yaml) for details on how to specify the\npatterns.\n\n## Usage:\n\n```\nusage: arxiv_latex_cleaner@v1.0.10 [-h] [--resize_images] [--im_size IM_SIZE]\n                                   [--compress_pdf]\n                                   [--pdf_im_resolution PDF_IM_RESOLUTION]\n                                   [--images_allowlist IMAGES_ALLOWLIST]\n                                   [--keep_bib]\n                                   [--commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]]\n                                   [--commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]]\n                                   [--environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]]\n                                   [--if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]]\n                                   [--use_external_tikz USE_EXTERNAL_TIKZ]\n                                   [--svg_inkscape [SVG_INKSCAPE]]\n                                   [--convert_png_to_jpg]\n                                   [--png_quality PNG_QUALITY]\n                                   [--png_size_threshold PNG_SIZE_THRESHOLD]\n                                   [--config CONFIG] [--verbose]\n                                   input_folder\n\nClean the LaTeX code of your paper to submit to arXiv. Check the README for\nmore information on the use.\n\npositional arguments:\n  input_folder          Input folder containing the LaTeX code.\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --resize_images       Resize images.\n  --im_size IM_SIZE     Size of the output images (in pixels, longest side).\n                        Fine tune this to get as close to 10MB as possible.\n  --compress_pdf        Compress PDF images using ghostscript (Linux and Mac\n                        only).\n  --pdf_im_resolution PDF_IM_RESOLUTION\n                        Resolution (in dpi) to which the tool resamples the\n                        PDF images.\n  --images_allowlist IMAGES_ALLOWLIST\n                        Images (and PDFs) that won't be resized to the default\n                        resolution, but the one provided here. Value is pixel\n                        for images, and dpi forPDFs, as in --im_size and\n                        --pdf_im_resolution, respectively. Format is a\n                        dictionary as: '{\"path/to/im.jpg\": 1000}'\n  --keep_bib            Avoid deleting the *.bib files.\n  --commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]\n                        LaTeX commands that will be deleted. Useful for e.g.\n                        user-defined \\todo commands. For example, to delete\n                        all occurrences of \\todo1{} and \\todo2{}, run the tool\n                        with `--commands_to_delete todo1 todo2`.Please note\n                        that the positional argument `input_folder` cannot\n                        come immediately after `commands_to_delete`, as the\n                        parser does not have any way to know if it's another\n                        command to delete.\n  --commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]\n                        LaTeX commands that will be deleted but the text \n                        wrapped in the commands will be retained. Useful for\n                        commands that change text formats and colors, which\n                        you may want to remove but keep the text within. Usages\n                        are exactly the same as commands_to_delete. Note that if\n                        the commands listed here duplicate that after\n                        commands_to_delete, the default action will be retaining\n                        the wrapped text.\n  --environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]\n                        LaTeX environments that will be deleted. Useful for e.g. \n                        user-defined comment environments. For example, to \n                        delete all occurrences of \\begin{note} ... \\end{note},\n                        run the tool with `--environments_to_delete note`. \n                        Please note that the positional argument `input_folder`\n                        cannot come immediately after\n                        `environments_to_delete`, as the parser does not have\n                        any way to know if it's another environment to delete.\n  --if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]\n                        Constant TeX primitive conditionals (\\iffalse, \\iftrue,\n                        etc.) are simplified, i.e., true branches are kept, false\n                        branches deleted. To parse the conditional constructs\n                        correctly, all commands starting with `\\if` are assumed to\n                        be TeX primitive conditionals (e.g., declared by\n                        \\newif\\ifvar). Some known exceptions to this rule are\n                        already included (e.g., \\iff, \\ifthenelse, etc.), but you\n                        can add custom exceptions using `--if_exceptions iffalt`.\n  --use_external_tikz USE_EXTERNAL_TIKZ\n                        Folder (relative to input folder) containing\n                        externalized tikz figures in PDF format.\n  --svg_inkscape [SVG_INKSCAPE]\n                        Include PDF files generated by Inkscape via the\n                        `\\includesvg` command from the `svg` package. This is\n                        done by replacing the `\\includesvg` calls with\n                        `\\includeinkscape` calls pointing to the generated\n                        `.pdf_tex` files. By default, these files and the\n                        generated PDFs are located under `./svg-inkscape`\n                        (relative to the input folder), but a different path\n                        (relative to the input folder) can be provided in case a\n                        different `inkscapepath` was set when loading the `svg`\n                        package.\n  --convert_png_to_jpg  Convert PNG images to JPG format to reduce file size\n  --png_quality PNG_QUALITY\n                        JPG quality for PNG conversion (0-100, default: 50)\n  --png_size_threshold PNG_SIZE_THRESHOLD\n                        Minimum PNG file size in MB to apply quality reduction (default: 0.5)\n  --config CONFIG       Read settings from `.yaml` config file. If command\n                        line arguments are provided additionally, the config\n                        file parameters are updated with the command line\n                        parameters.\n  --verbose             Enable detailed output.\n```\n\n## Testing:\n\n```bash\npython -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test\n```\n\n## Note\n\nThis is not an officially supported Google product.\n"
  },
  {
    "path": "__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n"
  },
  {
    "path": "arxiv_latex_cleaner/__init__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n"
  },
  {
    "path": "arxiv_latex_cleaner/__main__.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Main module for ``arxiv_latex_cleaner``.\n\n.. code-block:: bash\n\n    $ python -m arxiv_latex_cleaner --help\n\"\"\"\nimport argparse\nimport json\nimport logging\n\nimport yaml\n\nfrom ._version import __version__\nfrom .arxiv_latex_cleaner import merge_args_into_config\nfrom .arxiv_latex_cleaner import run_arxiv_cleaner\n\nPARSER = argparse.ArgumentParser(\n    prog=\"arxiv_latex_cleaner@{0}\".format(__version__),\n    description=(\n        \"Clean the LaTeX code of your paper to submit to arXiv. \"\n        \"Check the README for more information on the use.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"input_folder\",\n    type=str,\n    help=\"Input folder or zip archive containing the LaTeX code.\",\n)\n\nPARSER.add_argument(\n    \"--resize_images\",\n    action=\"store_true\",\n    help=\"Resize images.\",\n)\n\nPARSER.add_argument(\n    \"--im_size\",\n    default=500,\n    type=int,\n    help=(\n        \"Size of the output images (in pixels, longest side). Fine tune this \"\n        \"to get as close to 10MB as possible.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--compress_pdf\",\n    action=\"store_true\",\n    help=\"Compress PDF images using ghostscript (Linux and Mac only).\",\n)\n\nPARSER.add_argument(\n    \"--pdf_im_resolution\",\n    default=500,\n    type=int,\n    help=\"Resolution (in dpi) to which the tool resamples the PDF images.\",\n)\n\nPARSER.add_argument(\n    \"--images_allowlist\",\n    default={},\n    type=json.loads,\n    help=(\n        \"Images (and PDFs) that won't be resized to the default resolution,\"\n        \"but the one provided here. Value is pixel for images, and dpi for\"\n        \"PDFs, as in --im_size and --pdf_im_resolution, respectively. Format \"\n        \"is a dictionary as: '{\\\"path/to/im.jpg\\\": 1000}'\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--keep_bib\",\n    action=\"store_true\",\n    help=\"Avoid deleting the *.bib files.\",\n)\n\nPARSER.add_argument(\n    \"--commands_to_delete\",\n    nargs=\"+\",\n    default=[],\n    required=False,\n    help=(\n        \"LaTeX commands that will be deleted. Useful for e.g. user-defined \"\n        \"\\\\todo commands. For example, to delete all occurrences of \\\\todo1{} \"\n        \"and \\\\todo2{}, run the tool with `--commands_to_delete todo1 todo2`.\"\n        \"Please note that the positional argument `input_folder` cannot come \"\n        \"immediately after `commands_to_delete`, as the parser does not have \"\n        \"any way to know if it's another command to delete.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--commands_only_to_delete\",\n    nargs=\"+\",\n    default=[],\n    required=False,\n    help=(\n        \"LaTeX commands that will be deleted but the text wrapped in the\"\n        \" commands will be retained. Useful for commands that change text\"\n        \" formats and colors, which you may want to remove but keep the text\"\n        \" within. Usages are exactly the same as commands_to_delete. Note that\"\n        \" if the commands listed here duplicate that after commands_to_delete,\"\n        \" the default action will be retaining the wrapped text.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--environments_to_delete\",\n    nargs=\"+\",\n    default=[],\n    required=False,\n    help=(\n        \"LaTeX environments that will be deleted. Useful for e.g. user-\"\n        \"defined comment environments. For example, to delete all occurrences \"\n        \"of \\\\begin{note} ... \\\\end{note}, run the tool with \"\n        \"`--environments_to_delete note`. Please note that the positional \"\n        \"argument `input_folder` cannot come immediately after \"\n        \"`environments_to_delete`, as the parser does not have any way to \"\n        \"know if it's another environment to delete.\"\n    ),\n)\n\ndef if_prefixed(orig_string):\n  if orig_string.startswith(\"\\\\\"):\n    string = orig_string[1:]\n  else:\n    string = orig_string\n  if not string.startswith(\"if\"):\n    raise argparse.ArgumentTypeError(\n        f\"Expected a string starting with 'if', got '{orig_string}'!\"\n    )\n  return string\n\nPARSER.add_argument(\n    \"--if_exceptions\",\n    nargs=\"+\",\n    default=[],\n    required=False,\n    type=if_prefixed,\n    help=(\n        \"Constant TeX primitive conditionals (\\\\iffalse, \\\\iftrue, etc.) are \"\n        \"simplified, i.e., true branches are kept, false branches deleted. \"\n        \"To parse the conditional constructs correctly, all commands starting \"\n        \"with `\\\\if` are assumed to be TeX primitive conditionals (e.g., \"\n        \"declared by \\\\newif\\\\ifvar). Some known exceptions to this rule are \"\n        \"already included (e.g., \\\\iff, \\\\ifthenelse, etc.), but you can add \"\n        \"custom exceptions using `--if_exceptions iffalt`.\"\n    ),\n)\n\n\nPARSER.add_argument(\n    \"--use_external_tikz\",\n    type=str,\n    help=(\n        \"Folder (relative to input folder) containing externalized tikz \"\n        \"figures in PDF format.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--svg_inkscape\",\n    nargs=\"?\",\n    type=str,\n    const=\"svg-inkscape\",\n    help=(\n        \"Include PDF files generated by Inkscape via the `\\\\includesvg` \"\n        \"command from the `svg` package. This is done by replacing the \"\n        \"`\\\\includesvg` calls with `\\\\includeinkscape` calls pointing to the \"\n        \"generated `.pdf_tex` files. By default, these files and the \"\n        \"generated PDFs are located under `./svg-inkscape` (relative to the \"\n        \"input folder), but a different path (relative to the input folder) \"\n        \"can be provided in case a different `inkscapepath` was set when \"\n        \"loading the `svg` package.\"\n    ),\n)\n\nPARSER.add_argument(\n    \"--convert_png_to_jpg\",\n    action=\"store_true\",\n    help=\"Convert PNG images to JPG format to reduce file size. Note that this will override --resize_images for PNG files.\",\n)\n\nPARSER.add_argument(\n    \"--png_quality\",\n    type=int,\n    default=50,\n    help=\"JPG quality for PNG conversion (0-100, default: 50)\",\n)\n\nPARSER.add_argument(\n    \"--png_size_threshold\",\n    type=float,\n    default=0.5,\n    help=\"Minimum PNG file size in MB to apply quality reduction (default: 0.5)\",\n)\n\nPARSER.add_argument(\n    \"--config\",\n    type=str,\n    help=(\n        \"Read settings from `.yaml` config file. If command line arguments \"\n        \"are provided additionally, the config file parameters are updated \"\n        \"with the command line parameters.\"\n    ),\n    required=False,\n)\n\nPARSER.add_argument(\n    \"--verbose\",\n    action=\"store_true\",\n    help=\"Enable detailed output.\",\n)\n\nARGS = vars(PARSER.parse_args())\n\nif ARGS[\"config\"] is not None:\n  try:\n    with open(ARGS[\"config\"], \"r\") as config_file:\n      config_params = yaml.safe_load(config_file)\n    final_args = merge_args_into_config(ARGS, config_params)\n\n  except FileNotFoundError:\n    print(f\"config file {ARGS.config} not found.\")\n    final_args = ARGS\n    final_args.pop(\"config\", None)\nelse:\n  final_args = ARGS\n\nif final_args.get(\"verbose\", False):\n  logging.basicConfig(level=logging.INFO)\nelse:\n  logging.basicConfig(level=logging.ERROR)\n\nrun_arxiv_cleaner(final_args)\nexit(0)\n"
  },
  {
    "path": "arxiv_latex_cleaner/_version.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\n__version__ = \"v1.0.10\"\n"
  },
  {
    "path": "arxiv_latex_cleaner/arxiv_latex_cleaner.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\"\"\"Cleans the LaTeX code of your paper to submit to arXiv.\"\"\"\nimport collections\nimport contextlib\nimport copy\nimport logging\nimport os\nimport pathlib\nimport shutil\nimport subprocess\nimport tempfile\n\nfrom PIL import Image\nimport regex\n\nPDF_RESIZE_COMMAND = (\n    'gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH '\n    '-dDownsampleColorImages=true -dColorImageResolution={resolution} '\n    '-dColorImageDownsampleThreshold=1.0 -dAutoRotatePages=/None '\n    '-sOutputFile={output} {input}'\n)\nMAX_FILENAME_LENGTH = 120\n\n# Fix for Windows: Even if '\\' (os.sep) is the standard way of making paths on\n# Windows, it interferes with regular expressions. We just change os.sep to '/'\n# and os.path.join to a version using '/' as Windows will handle it the right\n# way.\nif os.name == 'nt':\n  global old_os_path_join\n\n  def new_os_join(path, *args):\n    res = old_os_path_join(path, *args)\n    res = res.replace('\\\\', '/')\n    return res\n\n  old_os_path_join = os.path.join\n\n  os.sep = '/'\n  os.path.join = new_os_join\n\n\ndef _create_dir_erase_if_exists(path):\n  if os.path.exists(path):\n    shutil.rmtree(path)\n  os.makedirs(path)\n\n\ndef _create_dir_if_not_exists(path):\n  if not os.path.exists(path):\n    os.makedirs(path)\n\n\ndef _keep_pattern(haystack, patterns_to_keep):\n  \"\"\"Keeps the strings that match 'patterns_to_keep'.\"\"\"\n  out = []\n  for item in haystack:\n    if any((regex.findall(rem, item) for rem in patterns_to_keep)):\n      out.append(item)\n  return out\n\n\ndef _remove_pattern(haystack, patterns_to_remove):\n  \"\"\"Removes the strings that match 'patterns_to_remove'.\"\"\"\n  return [\n      item\n      for item in haystack\n      if item not in _keep_pattern([item], patterns_to_remove)\n  ]\n\n\ndef _list_all_files(in_folder, ignore_dirs=None):\n  if ignore_dirs is None:\n    ignore_dirs = []\n  to_consider = [\n      os.path.join(os.path.relpath(path, in_folder), name)\n      if path != in_folder\n      else name\n      for path, _, files in os.walk(in_folder)\n      for name in files\n  ]\n  return _remove_pattern(to_consider, ignore_dirs)\n\n\ndef _copy_file(filename, params):\n  _create_dir_if_not_exists(\n      os.path.join(params['output_folder'], os.path.dirname(filename))\n  )\n  shutil.copy(\n      os.path.join(params['input_folder'], filename),\n      os.path.join(params['output_folder'], filename),\n  )\n\n\ndef _remove_command(text, command, keep_text=False):\n  \"\"\"Removes '\\\\command{*}' from the string 'text'.\n\n  Regex `base_pattern` used to match balanced parentheses taken from:\n  https://stackoverflow.com/questions/546433/regular-expression-to-match-balanced-parentheses/35271017#35271017\n  \"\"\"\n  base_pattern = (\n      r'\\\\'\n      + command\n      + r'(?:\\[(?:.*?)\\])*\\{((?:[^{}]+|\\{(?1)\\})*)\\}(?:\\[(?:.*?)\\])*'\n  )\n\n  def extract_text_inside_curly_braces(text):\n    \"\"\"Extract text inside of {} from command string\"\"\"\n    pattern = r'\\{((?:[^{}]|(?R))*)\\}'\n\n    match = regex.search(pattern, text)\n\n    if match:\n      return match.group(1)\n    else:\n      return ''\n\n  # Loops in case of nested commands that need to retain text, e.g.,\n  # \\red{hello \\red{world}}.\n  while True:\n    all_substitutions = []\n    has_match = False\n    for match in regex.finditer(base_pattern, text):\n      # In case there are only spaces or nothing up to the following newline,\n      # adds a percent, not to alter the newlines.\n      has_match = True\n\n      if not keep_text:\n        new_substring = ''\n      else:\n        temp_substring = text[match.span()[0] : match.span()[1]]\n        new_substring = extract_text_inside_curly_braces(temp_substring)\n\n      if match.span()[1] < len(text):\n        next_newline = text[match.span()[1] :].find('\\n')\n        if next_newline != -1:\n          text_until_newline = text[\n              match.span()[1] : match.span()[1] + next_newline\n          ]\n          if (\n              not text_until_newline or text_until_newline.isspace()\n          ) and not keep_text:\n            new_substring = '%'\n      all_substitutions.append(\n          (match.span()[0], match.span()[1], new_substring)\n      )\n\n    for start, end, new_substring in reversed(all_substitutions):\n      text = text[:start] + new_substring + text[end:]\n\n    if not keep_text or not has_match:\n      break\n\n  return text\n\n\ndef _remove_environment(text, environment):\n  \"\"\"Removes '\\\\begin{environment}*\\\\end{environment}' from 'text'.\"\"\"\n  # Need to escape '{', to not trigger fuzzy matching if `environment` starts\n  # with one of 'i', 'd', 's', or 'e'\n  return regex.sub(\n      r'\\\\begin\\{' + environment + r'}[\\s\\S]*?\\\\end\\{' + environment + r'}',\n      '',\n      text,\n  )\n\n\ndef _simplify_conditional_blocks(text, if_exceptions=[]):\n  r\"\"\"Simplify possibly nested conditional blocks from 'text'.\n\n  For example, `\\iffalse TEST1\\else TEST2\\fi` is simplified to `TEST2`,\n  and `\\iftrue TEST1\\else TEST2\\fi` is simplified to `TEST1`.\n  Unknown conditionals are left untouched.\n\n  If the conditional tree is malformed, the function will print a warning\n  to stderr and return the original text.\n  \"\"\"\n  p = regex.compile(r'(?!(?<=\\\\newif\\s*))\\\\if\\s*(\\w+)|\\\\else(?!\\w)|\\\\fi(?!\\w)')\n  toplevel_tree = {'left': [], 'right': [], 'kind': 'toplevel', 'parent': None}\n\n  tree = toplevel_tree\n\n  exceptions = [\n      # TeX primitives\n      'iff',\n      # package etoolbox\n      'ifpatchable',\n      'ifpatchable*',\n      'ifbool',\n      'iftoggle',\n      'ifdef',\n      'ifcsdef',\n      'ifundef',\n      'ifcsundef',\n      'ifdefmacro',\n      'ifcsmacro',\n      'ifdefparam',\n      'ifcsparam',\n      'ifcsprefix',\n      'ifdefprotected',\n      'ifcsprotected',\n      'ifdefltxprotect',\n      'ifcsltxprotect',\n      'ifdefempty',\n      'ifcsempty',\n      'ifdefvoid',\n      'ifcsvoid',\n      'ifdefequal',\n      'ifcsequal',\n      'ifdefstring',\n      'ifcsstring',\n      'ifdefstrequal',\n      'ifcsstrequal',\n      'ifdefcounter',\n      'ifcscounter',\n      'ifltxcounter',\n      'ifdeflength',\n      'ifcslength',\n      'ifdefdimen',\n      'ifcsdimen',\n      'ifstrequal',\n      'ifstrempty',\n      'ifblank',\n      'ifnumcomp',\n      'ifnumequal',\n      'ifnumodd',\n      'ifdimcomp',\n      'ifdimequal',\n      'ifdimgreater',\n      'ifdimless',\n      'ifboolexpr',\n      'ifboolexpe',\n      'ifinlist',\n      'ifinlistcs',\n      'ifrmnum',\n      # package hyperref\n      'ifpdfstringunicode',\n      # package ifthen\n      'ifthenelse',\n  ] + if_exceptions\n\n  def new_subtree(kind):\n    return {'kind': kind, 'left': [], 'right': []}\n\n  def add_subtree(tree, subtree):\n    if 'else' not in tree:\n      tree['left'].append(subtree)\n    else:\n      tree['right'].append(subtree)\n    subtree['parent'] = tree\n\n  def print_tree(tree, indent, write):\n    if 'start' in tree:\n      write(' ' * indent + tree['start'].group() + '\\n')\n    for subtree in tree['left']:\n      print_tree(subtree, indent + 2, write)\n    if 'else' in tree:\n      write(' ' * indent + tree['else'].group() + '\\n')\n    for subtree in tree['right']:\n      print_tree(subtree, indent + 2)\n    if 'end' in tree:\n      write(' ' * indent + tree['end'].group() + '\\n')\n\n  def print_abort(error_finding):\n    os.sys.stderr.write(\n        f'Warning: Found {error_finding}! Not removing any conditional'\n        ' blocks...\\n'\n    )\n    os.sys.stderr.write(\n        f'         This is the matched tree (as built up to the error):\\n'\n    )\n    print_tree(toplevel_tree, indent=9, write=os.sys.stderr.write)\n    os.sys.stderr.write(\n        f'         Potentially, you need to supply an exception using'\n        f\" --if_exceptions'.\\n\"\n    )\n\n  for m in p.finditer(text):\n    m_no_space = m.group().replace(' ', '')\n    if m_no_space == r'\\iffalse' or m_no_space == r'\\if0':\n      subtree = new_subtree('iffalse')\n      subtree['start'] = m\n      add_subtree(tree, subtree)\n      tree = subtree\n    elif m_no_space == r'\\iftrue' or m_no_space == r'\\if1':\n      subtree = new_subtree('iftrue')\n      subtree['start'] = m\n      add_subtree(tree, subtree)\n      tree = subtree\n    elif m_no_space.startswith(r'\\if'):\n      if m_no_space[1:] in exceptions:\n        continue\n      subtree = new_subtree('unknown')\n      subtree['start'] = m\n      add_subtree(tree, subtree)\n      tree = subtree\n    elif m_no_space == r'\\else':\n      if tree['parent'] is None:\n        print_abort(r'unmatched \\else')\n        return text\n      elif 'else' in tree:\n        print_abort(r'duplicate \\else')\n        return text\n\n      tree['else'] = m\n    elif m.group() == r'\\fi':\n      if tree['parent'] is None:\n        print_abort(r'unmatched \\fi')\n        return text\n\n      tree['end'] = m\n      tree = tree['parent']\n    else:\n      raise RuntimeError('Unreachable!')\n\n  if tree['parent'] is not None:\n    print_abort('unmatched ' + tree['start'].group())\n    return text\n\n  positions_to_delete = []\n\n  def traverse_tree(tree):\n    if tree['kind'] == 'iffalse':\n      if 'else' in tree:\n        positions_to_delete.append((tree['start'].start(), tree['else'].end()))\n        for subtree in tree['right']:\n          traverse_tree(subtree)\n        positions_to_delete.append((tree['end'].start(), tree['end'].end()))\n      else:\n        positions_to_delete.append((tree['start'].start(), tree['end'].end()))\n    elif tree['kind'] == 'iftrue':\n      if 'else' in tree:\n        positions_to_delete.append((tree['start'].start(), tree['start'].end()))\n        for subtree in tree['left']:\n          traverse_tree(subtree)\n        positions_to_delete.append((tree['else'].start(), tree['end'].end()))\n      else:\n        positions_to_delete.append((tree['start'].start(), tree['start'].end()))\n        positions_to_delete.append((tree['end'].start(), tree['end'].end()))\n    elif tree['kind'] == 'unknown':\n      for subtree in tree['left']:\n        traverse_tree(subtree)\n      for subtree in tree['right']:\n        traverse_tree(subtree)\n    else:\n      raise ValueError('Unreachable!')\n\n  for tree in toplevel_tree['left']:\n    traverse_tree(tree)\n\n  for start, end in reversed(positions_to_delete):\n    if end < len(text) and text[end].isspace():\n      end_to_del = end + 1\n    else:\n      end_to_del = end\n    text = text[:start] + text[end_to_del:]\n\n  return text\n\n\ndef _remove_comments_inline(text):\n  \"\"\"Removes the comments from the string 'text' and ignores % inside \\\\url{}.\"\"\"\n  auto_ignore_pattern = r'(%\\s*auto-ignore).*'\n  if regex.search(auto_ignore_pattern, text):\n    return regex.sub(auto_ignore_pattern, r'\\1', text)\n\n  if text.lstrip(' ').lstrip('\\t').startswith('%'):\n    return ''\n\n  url_pattern = r'\\\\url\\{(?>[^{}]|(?R))*\\}'\n\n  def remove_comments(segment):\n    \"\"\"Check if a segment of text contains a comment and remove it.\"\"\"\n    if segment.lstrip().startswith('%'):\n      return '', True\n    match = regex.search(r'(?<!\\\\)%', segment)\n    if match:\n      return segment[: match.end()] + '\\n', True\n    else:\n      return segment, False\n\n  # split the text into segments based on \\url{} tags\n  segments = regex.split(f'({url_pattern})', text)\n\n  for i in range(len(segments)):\n    # only process segments that are not part of a \\url{} tag\n    if not regex.match(url_pattern, segments[i]):\n      segments[i], match = remove_comments(segments[i])\n      if match:\n        # remove all segments after the first inline comment\n        segments = segments[: i + 1]\n        break\n\n  final_text = ''.join(segments)\n  return (\n      final_text\n      if final_text.endswith('\\n') or final_text.endswith('\\\\n')\n      else final_text + '\\n'\n  )\n\n\ndef _strip_tex_contents(lines, end_str):\n  \"\"\"Removes everything after end_str.\"\"\"\n  for i in range(len(lines)):\n    if end_str in lines[i]:\n      if '%' not in lines[i]:\n        return lines[: i + 1]\n      elif lines[i].index('%') > lines[i].index(end_str):\n        return lines[: i + 1]\n  return lines\n\n\ndef _read_file_content(filename):\n  with open(filename, 'r', encoding='utf-8') as fp:\n    lines = fp.readlines()\n    lines = _strip_tex_contents(lines, '\\\\end{document}')\n    return lines\n\n\ndef _read_all_tex_contents(tex_files, parameters):\n  contents = {}\n  for fn in tex_files:\n    contents[fn] = _read_file_content(\n        os.path.join(parameters['input_folder'], fn)\n    )\n  return contents\n\n\ndef _write_file_content(content, filename):\n  _create_dir_if_not_exists(os.path.dirname(filename))\n  with open(filename, 'w', encoding='utf-8') as fp:\n    return fp.write(content)\n\n\ndef _remove_comments_and_commands_to_delete(content, parameters):\n  \"\"\"Erases all LaTeX comments in the content, and writes it.\"\"\"\n  content = [_remove_comments_inline(line) for line in content]\n  content = _remove_environment(''.join(content), 'comment')\n  content = _simplify_conditional_blocks(\n      content, parameters.get('if_exceptions', [])\n  )\n  for environment in parameters.get('environments_to_delete', []):\n    content = _remove_environment(content, environment)\n  for command in parameters.get('commands_only_to_delete', []):\n    content = _remove_command(content, command, True)\n  for command in parameters['commands_to_delete']:\n    content = _remove_command(content, command, False)\n  return content\n\n\ndef _replace_tikzpictures(content, figures):\n  \"\"\"Replaces all tikzpicture environments (with includegraphic commands of\n\n  external PDF figures) in the content, and writes it.\n  \"\"\"\n\n  def get_figure(matchobj):\n    found_tikz_filename = regex.search(\n        r'\\\\tikzsetnextfilename{(.*?)}', matchobj.group(0)\n    ).group(1)\n    # search in tex split if figure is available\n    matching_tikz_filenames = _keep_pattern(\n        figures, ['/' + found_tikz_filename + '.pdf']\n    )\n    if len(matching_tikz_filenames) == 1:\n      return '\\\\includegraphics{' + matching_tikz_filenames[0] + '}'\n    else:\n      return matchobj.group(0)\n\n  content = regex.sub(\n      r'\\\\tikzsetnextfilename{[\\s\\S]*?\\\\end{tikzpicture}', get_figure, content\n  )\n\n  return content\n\n\ndef _replace_includesvg(content, svg_inkscape_files):\n  def repl_svg(matchobj):\n    svg_path = matchobj.group(2)\n    if svg_path.endswith('.svg'):\n      svg_path = '_'.join(svg_path.rsplit('.', 1))\n    svg_filename = os.path.basename(svg_path)\n\n    # search in svg_inkscape split if pdf_tex file is available\n    matching_pdf_tex_files = _keep_pattern(\n        svg_inkscape_files, ['/' + svg_filename + '-tex.pdf_tex']\n    )\n    if len(matching_pdf_tex_files) == 1:\n      options = '' if matchobj.group(1) is None else matchobj.group(1)\n      res = f'\\\\includeinkscape{options}{{{matching_pdf_tex_files[0]}}}'\n      return res\n    else:\n      return matchobj.group(0)\n\n  content = regex.sub(r'\\\\includesvg(\\[.*?\\])?{(.*?)}', repl_svg, content)\n\n  return content\n\ndef _resize_and_copy_figure(\n    filename,\n    origin_folder,\n    destination_folder,\n    resize_image,\n    image_size,\n    compress_pdf,\n    pdf_resolution,\n    convert_png_to_jpg=False,\n    png_quality=50,\n    png_size_threshold=0.5,\n    verbose=False\n):\n    \"\"\"Resizes and copies the input figure (either JPG, PNG, or PDF).\n\n    Parameters:\n        filename: The input filename\n        origin_folder: The folder containing the input filename\n        destination_folder: The folder to copy the output filename to\n        resize_image: Whether to resize the image\n        image_size: The maximum size of the image in pixels\n        compress_pdf: Whether to compress the PDF file\n        convert_png_to_jpg: Whether to convert PNG files to JPG format. Note that this will override resize_image for PNG files.\n        png_quality: JPG quality for converted PNG files (0-100)\n        png_size_threshold: Minimum file size in MB to apply quality reduction\n        verbose: Enable verbose logging\n    \n    Returns:\n        str: The actual output filename (may differ from input if PNG was converted)\n    \"\"\"\n    _create_dir_if_not_exists(\n        os.path.join(destination_folder, os.path.dirname(filename))\n    )\n    \n    if convert_png_to_jpg and os.path.splitext(filename)[1].lower() in ['.png']:\n        original_size_mb = os.path.getsize(os.path.join(origin_folder, filename)) / (1024 * 1024)\n        im = Image.open(os.path.join(origin_folder, filename))\n        # Determine quality based on file size\n        if original_size_mb < png_size_threshold:\n            quality = 100  # Keep high quality for small files\n            if verbose:\n                print(f\"Keeping original quality for small PNG: {filename}\")\n        else:\n            quality = png_quality\n            if verbose:\n                print(f\"Converting PNG to JPG with quality {quality}: {filename}\")\n        \n        # Convert PNG to JPG\n        output_filename = os.path.splitext(filename)[0] + '.jpg'\n        rgb_img = im.convert('RGB')\n        rgb_img.save(os.path.join(destination_folder, output_filename), 'JPEG', quality=quality)\n        \n        if verbose:\n            print(f\"Converted: {filename} -> {output_filename}\")\n          \n        return output_filename\n                    \n    if resize_image and os.path.splitext(filename)[1].lower() in [\n        '.jpg',\n        '.jpeg',\n        '.png',\n    ]:\n        try:\n            im = Image.open(os.path.join(origin_folder, filename))\n            if max(im.size) > image_size:\n                im = im.resize(\n                    tuple([int(x * float(image_size) / max(im.size)) for x in im.size]),\n                    Image.Resampling.LANCZOS,\n                )\n            \n            if os.path.splitext(filename)[1].lower() in ['.jpg', '.jpeg']:\n                im.save(os.path.join(destination_folder, filename), 'JPEG', quality=90)\n                return filename\n                \n            elif os.path.splitext(filename)[1].lower() in ['.png']:\n                im.save(os.path.join(destination_folder, filename), 'PNG')\n                return filename\n                    \n        except Exception as e:\n            if verbose:\n                print(f\"Failed to process image {filename}: {e}\")\n            # Fall back to simple copy\n            shutil.copy(\n                os.path.join(origin_folder, filename),\n                os.path.join(destination_folder, filename),\n            )\n            return filename\n\n    elif compress_pdf and os.path.splitext(filename)[1].lower() == '.pdf':\n        _resize_pdf_figure(\n            filename, origin_folder, destination_folder, pdf_resolution\n        )\n        return filename\n    else:\n        shutil.copy(\n            os.path.join(origin_folder, filename),\n            os.path.join(destination_folder, filename),\n        )\n        return filename\n\n\ndef _update_image_references(tex_contents_dict, old_filename, new_filename, verbose=False):\n    \"\"\"Update references from old_filename to new_filename in all tex content.\"\"\"\n    if old_filename == new_filename:\n        return  # No change needed\n    \n    old_base = os.path.splitext(old_filename)[0]\n    new_base = os.path.splitext(new_filename)[0]\n    \n    if verbose:\n        print(f\"Updating LaTeX references: {old_filename} -> {new_filename}\")\n    \n    for tex_file in tex_contents_dict:\n        # Handle both string and list content\n        if isinstance(tex_contents_dict[tex_file], list):\n            content = ''.join(tex_contents_dict[tex_file])\n        else:\n            content = tex_contents_dict[tex_file]\n        \n        content_changed = False\n        \n        # Pattern 1: Direct filename with full extension, handling comments and newlines\n        pattern1 = r'(\\{(?:%\\s*\\n\\s*)?[^}]*?)' + regex.escape(old_filename) + r'((?:%\\s*\\n\\s*)?[^}]*?\\})'\n        replacement1 = r'\\1' + new_filename + r'\\2'\n        \n        new_content = regex.sub(pattern1, replacement1, content, flags=regex.IGNORECASE | regex.DOTALL)\n        if new_content != content:\n            content = new_content\n            content_changed = True\n            if verbose:\n                print(f\"Applied pattern 1 (full filename) in {tex_file}\")\n        else:\n            # Pattern 2: Base filename without extension, handling comments and newlines\n            # Only apply this if Pattern 1 didn't match to avoid double replacements\n            pattern2 = r'(\\{(?:%\\s*\\n\\s*)?[^}]*?)' + regex.escape(old_base) + r'((?:%\\s*\\n\\s*)?[^}]*?\\})'\n            replacement2 = r'\\1' + new_base + r'.jpg\\2'\n            \n            new_content = regex.sub(pattern2, replacement2, content, flags=regex.IGNORECASE | regex.DOTALL)\n            if new_content != content:\n                content = new_content\n                content_changed = True\n                if verbose:\n                    print(f\"Applied pattern 2 (base filename) in {tex_file}\")\n            else:\n                # Pattern 3: Handle cases where extension is split across lines with comments\n                # This specifically targets patterns like: images/filename%\\n.png\n                pattern3 = r'(\\{[^}]*?)' + regex.escape(old_base) + r'(%\\s*\\n\\s*)(\\.png)([^}]*?\\})'\n                replacement3 = r'\\1' + new_base + r'\\2.jpg\\4'\n                \n                new_content = regex.sub(pattern3, replacement3, content, flags=regex.IGNORECASE | regex.DOTALL)\n                if new_content != content:\n                    content = new_content\n                    content_changed = True\n                    if verbose:\n                        print(f\"Applied pattern 3 (split extension) in {tex_file}\")\n        \n        # Update the content back in the appropriate format\n        if content_changed:\n            if isinstance(tex_contents_dict[tex_file], list):\n                # Convert back to list format, preserving line endings\n                tex_contents_dict[tex_file] = content.split('\\n')\n            else:\n                tex_contents_dict[tex_file] = content\n            \n            if verbose:\n                print(f\"Updated references in {tex_file}\")\n    \n    # Re-write the updated tex files to the output directory\n    if verbose and any(tex_contents_dict.values()):\n        print(\"Re-writing updated tex files...\")\n    \n    return tex_contents_dict\n\n\ndef _resize_pdf_figure(\n    filename, origin_folder, destination_folder, resolution, timeout=10\n):\n  input_file = os.path.join(origin_folder, filename)\n  output_file = os.path.join(destination_folder, filename)\n  bash_command = PDF_RESIZE_COMMAND.format(\n      input=input_file, output=output_file, resolution=resolution\n  )\n  process = subprocess.Popen(bash_command.split(), stdout=subprocess.PIPE)\n\n  try:\n    process.communicate(timeout=timeout)\n  except subprocess.TimeoutExpired:\n    process.kill()\n    outs, errs = process.communicate()\n    print('Output: ', outs)\n    print('Errors: ', errs)\n\n\ndef _copy_only_referenced_non_tex_not_in_root(parameters, contents, splits):\n  for fn in _keep_only_referenced(\n      splits['non_tex_not_in_root'], contents, strict=True\n  ):\n    _copy_file(fn, parameters)\n\ndef _resize_and_copy_figures_if_referenced(parameters, contents, splits):\n    \"\"\"Modified to handle PNG to JPG conversion and reference updates.\"\"\"\n    image_size = collections.defaultdict(lambda: parameters['im_size'])\n    image_size.update(parameters['images_allowlist'])\n    pdf_resolution = collections.defaultdict(\n        lambda: parameters['pdf_im_resolution']\n    )\n    pdf_resolution.update(parameters['images_allowlist'])\n    \n    # contents is the full content string for reference checking\n    \n    filename_changes = {}  # Track PNG -> JPG filename changes\n    \n    for image_file in _keep_only_referenced(\n        splits['figures'], contents, strict=False\n    ):\n        actual_output_filename = _resize_and_copy_figure(\n            filename=image_file,\n            origin_folder=parameters['input_folder'],\n            destination_folder=parameters['output_folder'],\n            resize_image=parameters['resize_images'],\n            image_size=image_size[image_file],\n            compress_pdf=parameters['compress_pdf'],\n            pdf_resolution=pdf_resolution[image_file],\n            convert_png_to_jpg=parameters.get('convert_png_to_jpg', False),\n            png_quality=parameters.get('png_quality', 50),\n            png_size_threshold=parameters.get('png_size_threshold', 0.5),\n            verbose=parameters.get('verbose', False)\n        )\n        \n        # Track filename changes for reference updates\n        if actual_output_filename != image_file:\n            filename_changes[image_file] = actual_output_filename\n    \n    return filename_changes\n\n\ndef _search_reference(filename, contents, strict=False):\n  \"\"\"Returns a match object if filename is referenced in contents, and None otherwise.\n\n  If not strict mode, path prefix and extension are optional.\n  \"\"\"\n  if strict:\n    # regex pattern for strict=True for path/to/img.ext:\n    # \\{[\\s%]*path/to/img\\.ext[\\s%]*\\}\n    filename_regex = filename.replace('.', r'\\.')\n  else:\n    filename_path = pathlib.Path(filename)\n\n    # make extension optional\n    root, extension = filename_path.stem, filename_path.suffix\n    basename_regex = '{}({})?'.format(\n        regex.escape(root), regex.escape(extension)\n    )\n\n    # iterate through parent fragments to make path prefix optional\n    path_prefix_regex = ''\n    for fragment in reversed(filename_path.parents):\n      if fragment.name == '.':\n        continue\n      fragment = regex.escape(fragment.name)\n      path_prefix_regex = '({}{}{})?'.format(\n          path_prefix_regex, fragment, os.sep\n      )\n\n    # Regex pattern for strict=True for path/to/img.ext:\n    # \\{[\\s%]*(<path_prefix>)?<basename>(<ext>)?[\\s%]*\\}\n    filename_regex = path_prefix_regex + basename_regex\n\n  # Some files 'path/to/file' are referenced in tex as './path/to/file' thus\n  # adds prefix for relative paths starting with './' or '.\\' to regex search.\n  filename_regex = r'(.' + os.sep + r')?' + filename_regex\n\n  # Pads with braces and optional whitespace/comment characters.\n  patn = r'\\{{[\\s%]*{}[\\s%]*\\}}'.format(filename_regex)\n  # Picture references in LaTeX are allowed to be in different cases.\n  return regex.search(patn, contents, regex.IGNORECASE)\n\n\ndef _keep_only_referenced(filenames, contents, strict=False):\n  \"\"\"Returns the filenames referenced from contents.\n\n  If not strict mode, path prefix and extension are optional.\n  \"\"\"\n  return [\n      fn\n      for fn in filenames\n      if _search_reference(fn, contents, strict) is not None\n  ]\n\n\ndef _keep_only_referenced_tex(contents, splits):\n  \"\"\"Returns the filenames referenced from the tex files themselves.\n\n  It needs various iterations in case one file is referenced from an\n  unreferenced file.\n  \"\"\"\n  old_referenced = set(splits['tex_in_root'] + splits['tex_not_in_root'])\n  while True:\n    referenced = set(splits['tex_in_root'])\n    for fn in old_referenced:\n      for fn2 in old_referenced:\n        if regex.search(\n            r'(' + os.path.splitext(fn)[0] + r'[.}])', '\\n'.join(contents[fn2])\n        ):\n          referenced.add(fn)\n\n    if referenced == old_referenced:\n      splits['tex_to_copy'] = list(referenced)\n      return\n\n    old_referenced = referenced.copy()\n\n\ndef _add_root_tex_files(splits):\n  # TODO: Check auto-ignore marker in root to detect the main file. Then check\n  #  there is only one non-referenced TeX in root.\n\n  # Forces the TeX in root to be copied, even if they are not referenced.\n  for fn in splits['tex_in_root']:\n    if fn not in splits['tex_to_copy']:\n      splits['tex_to_copy'].append(fn)\n\n\ndef _split_all_files(parameters):\n  \"\"\"Splits the files into types or location to know what to do with them.\"\"\"\n  file_splits = {\n      'all': _list_all_files(\n          parameters['input_folder'], ignore_dirs=['.git' + os.sep]\n      ),\n      'in_root': [\n          f\n          for f in os.listdir(parameters['input_folder'])\n          if os.path.isfile(os.path.join(parameters['input_folder'], f))\n      ],\n  }\n\n  file_splits['not_in_root'] = [\n      f for f in file_splits['all'] if f not in file_splits['in_root']\n  ]\n  file_splits['to_copy_in_root'] = _remove_pattern(\n      file_splits['in_root'],\n      parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],\n  )\n  file_splits['to_copy_not_in_root'] = _remove_pattern(\n      file_splits['not_in_root'],\n      parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],\n  )\n  file_splits['figures'] = _keep_pattern(\n      file_splits['all'], parameters['figures_to_copy_if_referenced']\n  )\n\n  file_splits['tex_in_root'] = _keep_pattern(\n      file_splits['to_copy_in_root'], ['.tex$', '.tikz$']\n  )\n  file_splits['tex_not_in_root'] = _keep_pattern(\n      file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']\n  )\n\n  file_splits['non_tex_in_root'] = _remove_pattern(\n      file_splits['to_copy_in_root'], ['.tex$', '.tikz$']\n  )\n  file_splits['non_tex_not_in_root'] = _remove_pattern(\n      file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']\n  )\n\n  if parameters.get('use_external_tikz', None) is not None:\n    file_splits['external_tikz_figures'] = _keep_pattern(\n        file_splits['all'], [parameters['use_external_tikz']]\n    )\n  else:\n    file_splits['external_tikz_figures'] = []\n\n  if parameters.get('svg_inkscape', None) is not None:\n    file_splits['svg_inkscape'] = _keep_pattern(\n        file_splits['all'], [parameters['svg_inkscape']]\n    )\n  else:\n    file_splits['svg_inkscape'] = []\n\n  return file_splits\n\n\ndef _create_out_folder(input_folder):\n  \"\"\"Creates the output folder, erasing it if existed.\"\"\"\n  out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv'\n  _create_dir_erase_if_exists(out_folder)\n\n  return out_folder\n\n\ndef run_arxiv_cleaner(parameters):\n  \"\"\"Core of the code, runs the actual arXiv cleaner.\"\"\"\n\n  files_to_delete = [\n      r'\\.aux$',\n      r'\\.sh$',\n      r'\\.blg$',\n      r'\\.brf$',\n      r'\\.log$',\n      r'\\.out$',\n      r'\\.ps$',\n      r'\\.dvi$',\n      r'\\.synctex.gz$',\n      '~$',\n      r'\\.backup$',\n      r'\\.gitignore$',\n      r'\\.DS_Store$',\n      r'\\.svg$',\n      r'^\\.idea',\n      r'\\.dpth$',\n      r'\\.md5$',\n      r'\\.dep$',\n      r'\\.auxlock$',\n      r'\\.fls$',\n      r'\\.fdb_latexmk$',\n  ]\n\n  if not parameters['keep_bib']:\n    files_to_delete.append(r'\\.bib$')\n\n  parameters.update({\n      'to_delete': files_to_delete,\n      'figures_to_copy_if_referenced': [\n          r'\\.png$',\n          r'\\.jpg$',\n          r'\\.jpeg$',\n          r'\\.pdf$',\n      ],\n  })\n\n  logging.info('Collecting file structure.')\n  parameters['output_folder'] = _create_out_folder(parameters['input_folder'])\n\n  from_zip = parameters['input_folder'].endswith('.zip')\n  tempdir_context = (\n      tempfile.TemporaryDirectory() if from_zip else contextlib.suppress()\n  )\n\n  with tempdir_context as tempdir:\n\n    if from_zip:\n      logging.info('Unzipping input folder.')\n      shutil.unpack_archive(parameters['input_folder'], tempdir)\n      parameters['input_folder'] = tempdir\n\n    splits = _split_all_files(parameters)\n\n    logging.info('Reading all tex files')\n    tex_contents = _read_all_tex_contents(\n        splits['tex_in_root'] + splits['tex_not_in_root'], parameters\n    )\n\n    for tex_file in tex_contents:\n      logging.info('Removing comments in file %s.', tex_file)\n      tex_contents[tex_file] = _remove_comments_and_commands_to_delete(\n          tex_contents[tex_file], parameters\n      )\n\n    for tex_file in tex_contents:\n      logging.info('Replacing \\\\includesvg calls in file %s.', tex_file)\n      tex_contents[tex_file] = _replace_includesvg(\n          tex_contents[tex_file], splits['svg_inkscape']\n      )\n\n    for tex_file in tex_contents:\n      logging.info('Replacing Tikz Pictures in file %s.', tex_file)\n      content = _replace_tikzpictures(\n          tex_contents[tex_file], splits['external_tikz_figures']\n      )\n      # If file ends with '\\n' already, the split in last line would add an extra\n      # '\\n', so we remove it.\n      tex_contents[tex_file] = content.split('\\n')\n\n    _keep_only_referenced_tex(tex_contents, splits)\n    _add_root_tex_files(splits)\n\n    for tex_file in splits['tex_to_copy']:\n      logging.info('Replacing patterns in file %s.', tex_file)\n      content = '\\n'.join(tex_contents[tex_file])\n      content = _find_and_replace_patterns(\n          content, parameters.get('patterns_and_insertions', list())\n      )\n      tex_contents[tex_file] = content\n      new_path = os.path.join(parameters['output_folder'], tex_file)\n      logging.info('Writing modified contents to %s.', new_path)\n      _write_file_content(\n          content,\n          new_path,\n      )\n\n    full_content = '\\n'.join(\n        ''.join(tex_contents[fn]) for fn in splits['tex_to_copy']\n    )\n    _copy_only_referenced_non_tex_not_in_root(parameters, full_content, splits)\n    for non_tex_file in splits['non_tex_in_root']:\n      logging.info('Copying non-tex file %s.', non_tex_file)\n      _copy_file(non_tex_file, parameters)\n\n    filename_changes = _resize_and_copy_figures_if_referenced(parameters, full_content, splits)\n    logging.info('Outputs written to %s', parameters['output_folder'])\n\n    # Update LaTeX references for changed filenames if tex_contents_dict is provided\n    if tex_contents and filename_changes:\n        for old_filename, new_filename in filename_changes.items():\n            tex_contents = _update_image_references(\n                tex_contents, old_filename, new_filename, \n                verbose=parameters.get('verbose', False)\n            )\n\n        # Re-write modified tex files with new references after resizing and copying figures\n        for tex_file in splits['tex_to_copy']:\n            if tex_file in tex_contents:\n                # Get the updated content\n                if isinstance(tex_contents[tex_file], list):\n                    updated_content = ''.join(tex_contents[tex_file])\n                else:\n                    updated_content = tex_contents[tex_file]\n                \n                # Write the updated content back to the output file\n                output_path = os.path.join(parameters['output_folder'], tex_file)\n                logging.info('Re-writing modified tex file with updated references: %s', output_path)\n                _write_file_content(updated_content, output_path)\n                \n                if parameters.get('verbose', False):\n                    print(f\"Re-wrote {tex_file} with updated image references\")\n        \n        if parameters.get('verbose', False):\n            print(f\"Updated {len(filename_changes)} image references and re-wrote tex files\")\n\n\ndef strip_whitespace(text):\n  \"\"\"Strips all whitespace characters.\n\n  https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string\n  \"\"\"\n  pattern = regex.compile(r'\\s+')\n  text = regex.sub(pattern, '', text)\n  return text\n\n\ndef merge_args_into_config(args, config_params):\n  final_args = copy.deepcopy(config_params)\n  config_keys = config_params.keys()\n  for key, value in args.items():\n    if key in config_keys:\n      if any([isinstance(value, t) for t in [str, bool, float, int]]):\n        # Overwrites config value with args value.\n        final_args[key] = value\n      elif isinstance(value, list):\n        # Appends args values to config values.\n        final_args[key] = value + config_params[key]\n      elif isinstance(value, dict):\n        # Updates config params with args params.\n        final_args[key].update(**value)\n    else:\n      final_args[key] = value\n  return final_args\n\n\ndef _find_and_replace_patterns(content, patterns_and_insertions):\n  r\"\"\"content: str\n\n  patterns_and_insertions: List[Dict]\n\n  Example for patterns_and_insertions:\n\n      [\n          {\n              \"pattern\" :\n              r\"(?:\\\\figcompfigures{\\s*)(?P<first>.*?)\\s*}\\s*{\\s*(?P<second>.*?)\\s*}\\s*{\\s*(?P<third>.*?)\\s*}\",\n              \"insertion\" :\n              r\"\\parbox[c]{{{second}\\linewidth}}{{\\includegraphics[width={third}\\linewidth]{{figures/{first}}}}}}\",\n              \"description\": \"Replace figcompfigures\"\n          },\n      ]\n  \"\"\"\n  for pattern_and_insertion in patterns_and_insertions:\n    pattern = pattern_and_insertion['pattern']\n    insertion = pattern_and_insertion['insertion']\n    description = pattern_and_insertion['description']\n    logging.info('Processing pattern: %s.', description)\n    p = regex.compile(pattern)\n    m = p.search(content)\n    while m is not None:\n      local_insertion = insertion.format(**m.groupdict())\n      if pattern_and_insertion.get('strip_whitespace', True):\n        local_insertion = strip_whitespace(local_insertion)\n      logging.info(f'Found {content[m.start():m.end()]:<70}')\n      logging.info(f'Replacing with {local_insertion:<30}')\n      content = content[: m.start()] + local_insertion + content[m.end() :]\n      m = p.search(content)\n    logging.info('Finished pattern: %s.', description)\n  return content\n"
  },
  {
    "path": "arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py",
    "content": "# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom os import path\nimport shutil\nimport unittest\nfrom absl.testing import parameterized\nfrom arxiv_latex_cleaner import arxiv_latex_cleaner\nfrom PIL import Image\n\n\ndef make_args(\n    input_folder='foo/bar',\n    resize_images=False,\n    im_size=500,\n    compress_pdf=False,\n    pdf_im_resolution=500,\n    images_allowlist=None,\n    commands_to_delete=None,\n    use_external_tikz='foo/bar/tikz',\n):\n  if images_allowlist is None:\n    images_allowlist = {}\n  if commands_to_delete is None:\n    commands_to_delete = []\n  args = {\n      'input_folder': input_folder,\n      'resize_images': resize_images,\n      'im_size': im_size,\n      'compress_pdf': compress_pdf,\n      'pdf_im_resolution': pdf_im_resolution,\n      'images_allowlist': images_allowlist,\n      'commands_to_delete': commands_to_delete,\n      'use_external_tikz': use_external_tikz,\n  }\n  return args\n\n\ndef make_contents():\n  return (\n      r'& \\figcompfigures{'\n      '\\n\\timage1.jpg'\n      '\\n}{'\n      '\\n\\t'\n      r'\\ww'\n      '\\n}{'\n      '\\n\\t1.0'\n      '\\n\\t}'\n      '\\n& '\n      r'\\figcompfigures{image2.jpg}{\\ww}{1.0}'\n  )\n\n\ndef make_patterns():\n  pattern = r'(?:\\\\figcompfigures{\\s*)(?P<first>.*?)\\s*}\\s*{\\s*(?P<second>.*?)\\s*}\\s*{\\s*(?P<third>.*?)\\s*}'\n  insertion = r\"\"\"\\parbox[c]{{\n            {second}\\linewidth\n        }}{{\n            \\includegraphics[\n                width={third}\\linewidth\n            ]{{\n                figures/{first}\n            }}\n        }} \"\"\"\n  description = 'Replace figcompfigures'\n  output = {\n      'pattern': pattern,\n      'insertion': insertion,\n      'description': description,\n  }\n  return [output]\n\n\ndef make_search_reference_tests():\n  return (\n      {\n          'testcase_name': 'prefix1',\n          'filenames': ['include_image_yes.png', 'include_image.png'],\n          'contents': '\\\\include{include_image_yes.png}',\n          'strict': False,\n          'true_outputs': ['include_image_yes.png'],\n      },\n      {\n          'testcase_name': 'prefix2',\n          'filenames': ['include_image_yes.png', 'include_image.png'],\n          'contents': '\\\\include{include_image.png}',\n          'strict': False,\n          'true_outputs': ['include_image.png'],\n      },\n      {\n          'testcase_name': 'nested_more_specific',\n          'filenames': [\n              'images/im_included.png',\n              'images/include/images/im_included.png',\n          ],\n          'contents': '\\\\include{images/include/images/im_included.png}',\n          'strict': False,\n          'true_outputs': ['images/include/images/im_included.png'],\n      },\n      {\n          'testcase_name': 'nested_less_specific',\n          'filenames': [\n              'images/im_included.png',\n              'images/include/images/im_included.png',\n          ],\n          'contents': '\\\\include{images/im_included.png}',\n          'strict': False,\n          'true_outputs': [\n              'images/im_included.png',\n              'images/include/images/im_included.png',\n          ],\n      },\n      {\n          'testcase_name': 'nested_substring',\n          'filenames': ['images/im_included.png', 'im_included.png'],\n          'contents': '\\\\include{images/im_included.png}',\n          'strict': False,\n          'true_outputs': ['images/im_included.png'],\n      },\n      {\n          'testcase_name': 'nested_diffpath',\n          'filenames': ['images/im_included.png', 'figures/im_included.png'],\n          'contents': '\\\\include{images/im_included.png}',\n          'strict': False,\n          'true_outputs': ['images/im_included.png'],\n      },\n      {\n          'testcase_name': 'diffext',\n          'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],\n          'contents': '\\\\include{tables/demo.tex}',\n          'strict': False,\n          'true_outputs': ['tables/demo.tex'],\n      },\n      {\n          'testcase_name': 'diffext2',\n          'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],\n          'contents': '\\\\include{tables/demo}',\n          'strict': False,\n          'true_outputs': ['tables/demo.tex', 'tables/demo.tikz'],\n      },\n      {\n          'testcase_name': 'strict_prefix1',\n          'filenames': ['demo_yes.tex', 'demo.tex'],\n          'contents': '\\\\include{demo_yes.tex}',\n          'strict': True,\n          'true_outputs': ['demo_yes.tex'],\n      },\n      {\n          'testcase_name': 'strict_prefix2',\n          'filenames': ['demo_yes.tex', 'demo.tex'],\n          'contents': '\\\\include{demo.tex}',\n          'strict': True,\n          'true_outputs': ['demo.tex'],\n      },\n      {\n          'testcase_name': 'strict_nested_more_specific',\n          'filenames': [\n              'tables/table_included.csv',\n              'tables/include/tables/table_included.csv',\n          ],\n          'contents': '\\\\include{tables/include/tables/table_included.csv}',\n          'strict': True,\n          'true_outputs': ['tables/include/tables/table_included.csv'],\n      },\n      {\n          'testcase_name': 'strict_nested_less_specific',\n          'filenames': [\n              'tables/table_included.csv',\n              'tables/include/tables/table_included.csv',\n          ],\n          'contents': '\\\\include{tables/table_included.csv}',\n          'strict': True,\n          'true_outputs': ['tables/table_included.csv'],\n      },\n      {\n          'testcase_name': 'strict_nested_substring1',\n          'filenames': ['tables/table_included.csv', 'table_included.csv'],\n          'contents': '\\\\include{tables/table_included.csv}',\n          'strict': True,\n          'true_outputs': ['tables/table_included.csv'],\n      },\n      {\n          'testcase_name': 'strict_nested_substring2',\n          'filenames': ['tables/table_included.csv', 'table_included.csv'],\n          'contents': '\\\\include{table_included.csv}',\n          'strict': True,\n          'true_outputs': ['table_included.csv'],\n      },\n      {\n          'testcase_name': 'strict_nested_diffpath',\n          'filenames': ['tables/table_included.csv', 'data/table_included.csv'],\n          'contents': '\\\\include{tables/table_included.csv}',\n          'strict': True,\n          'true_outputs': ['tables/table_included.csv'],\n      },\n      {\n          'testcase_name': 'strict_diffext',\n          'filenames': ['tables/demo.csv', 'tables/demo.txt', 'demo.csv'],\n          'contents': '\\\\include{tables/demo.csv}',\n          'strict': True,\n          'true_outputs': ['tables/demo.csv'],\n      },\n      {\n          'testcase_name': 'path_starting_with_dot',\n          'filenames': [\n              './images/im_included.png',\n              './figures/im_included.png',\n          ],\n          'contents': '\\\\include{./images/im_included.png}',\n          'strict': False,\n          'true_outputs': ['./images/im_included.png'],\n      },\n  )\n\n\nclass UnitTests(parameterized.TestCase):\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'empty config',\n          'args': make_args(),\n          'config_params': {},\n          'final_args': make_args(),\n      },\n      {\n          'testcase_name': 'empty args',\n          'args': {},\n          'config_params': make_args(),\n          'final_args': make_args(),\n      },\n      {\n          'testcase_name': 'args and config provided',\n          'args': make_args(\n              images_allowlist={'path1/': 1000}, commands_to_delete=[r'\\todo1']\n          ),\n          'config_params': make_args(\n              'foo_/bar_',\n              True,\n              1000,\n              True,\n              1000,\n              images_allowlist={'path2/': 1000},\n              commands_to_delete=[r'\\todo2'],\n              use_external_tikz='foo_/bar_/tikz_',\n          ),\n          'final_args': make_args(\n              images_allowlist={'path1/': 1000, 'path2/': 1000},\n              commands_to_delete=[r'\\todo1', r'\\todo2'],\n          ),\n      },\n  )\n  def test_merge_args_into_config(self, args, config_params, final_args):\n    self.assertEqual(\n        arxiv_latex_cleaner.merge_args_into_config(args, config_params),\n        final_args,\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_comment',\n          'line_in': 'Foo\\n',\n          'true_output': 'Foo\\n',\n      },\n      {\n          'testcase_name': 'auto_ignore',\n          'line_in': '%auto-ignore\\n',\n          'true_output': '%auto-ignore\\n',\n      },\n      {\n          'testcase_name': 'auto_ignore_middle',\n          'line_in': 'Foo % auto-ignore Comment\\n',\n          'true_output': 'Foo % auto-ignore\\n',\n      },\n      {\n          'testcase_name': 'auto_ignore_text_with_comment',\n          'line_in': 'Foo auto-ignore % Comment\\n',\n          'true_output': 'Foo auto-ignore %\\n',\n      },\n      {\n          'testcase_name': 'percent',\n          'line_in': r'100\\% accurate\\n',\n          'true_output': r'100\\% accurate\\n',\n      },\n      {\n          'testcase_name': 'comment',\n          'line_in': '  % Comment\\n',\n          'true_output': '',\n      },\n      {\n          'testcase_name': 'comment_inline',\n          'line_in': 'Foo %Comment\\n',\n          'true_output': 'Foo %\\n',\n      },\n      {\n          'testcase_name': 'url_with_percent',\n          'line_in': '\\\\url{https://www.example.com/hello%20world}\\n',\n          'true_output': '\\\\url{https://www.example.com/hello%20world}\\n',\n      },\n      {\n          'testcase_name': 'comment_with_url',\n          'line_in': 'Foo %\\\\url{https://www.example.com/hello%20world}\\n',\n          'true_output': 'Foo %\\n',\n      },\n  )\n  def test_remove_comments_inline(self, line_in, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._remove_comments_inline(line_in), true_output\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_command',\n          'text_in': 'Foo\\nFoo2\\n',\n          'keep_text': False,\n          'true_output': 'Foo\\nFoo2\\n',\n      },\n      {\n          'testcase_name': 'command_not_removed',\n          'text_in': '\\\\textit{Foo\\nFoo2}\\n',\n          'keep_text': False,\n          'true_output': '\\\\textit{Foo\\nFoo2}\\n',\n      },\n      {\n          'testcase_name': 'command_no_end_line_removed',\n          'text_in': 'A\\\\todo{B\\nC}D\\nE\\n\\\\end{document}',\n          'keep_text': False,\n          'true_output': 'AD\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_end_line_removed',\n          'text_in': 'A\\n\\\\todo{B\\nC}\\nD\\n\\\\end{document}',\n          'keep_text': False,\n          'true_output': 'A\\n%\\nD\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_optional_arguments_start',\n          'text_in': 'A\\n\\\\todo[B]{C\\nD}\\nE\\n\\\\end{document}',\n          'keep_text': False,\n          'true_output': 'A\\n%\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_optional_arguments_end',\n          'text_in': 'A\\n\\\\todo{B\\nC}[D]\\nE\\n\\\\end{document}',\n          'keep_text': False,\n          'true_output': 'A\\n%\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'no_command_keep_text',\n          'text_in': 'Foo\\nFoo2\\n',\n          'keep_text': True,\n          'true_output': 'Foo\\nFoo2\\n',\n      },\n      {\n          'testcase_name': 'command_not_removed_keep_text',\n          'text_in': '\\\\textit{Foo\\nFoo2}\\n',\n          'keep_text': True,\n          'true_output': '\\\\textit{Foo\\nFoo2}\\n',\n      },\n      {\n          'testcase_name': 'command_no_end_line_removed_keep_text',\n          'text_in': 'A\\\\todo{B\\nC}D\\nE\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': 'AB\\nCD\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_end_line_removed_keep_text',\n          'text_in': 'A\\n\\\\todo{B\\nC}\\nD\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': 'A\\nB\\nC\\nD\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'nested_command_keep_text',\n          'text_in': 'A\\n\\\\todo{B\\n\\\\todo{C}}\\nD\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': 'A\\nB\\nC\\nD\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_optional_arguments_start_keep_text',\n          'text_in': 'A\\n\\\\todo[B]{C\\nD}\\nE\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': 'A\\nC\\nD\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'command_with_optional_arguments_end_keep_text',\n          'text_in': 'A\\n\\\\todo{B\\nC}[D]\\nE\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': 'A\\nB\\nC\\nE\\n\\\\end{document}',\n      },\n      {\n          'testcase_name': 'deeply_nested_command_keep_text',\n          'text_in': 'A\\n\\\\todo{B\\n\\\\emph{C\\\\footnote{\\\\textbf{D}}}}\\nE\\n\\\\end{document}',\n          'keep_text': True,\n          'true_output': (\n              'A\\nB\\n\\\\emph{C\\\\footnote{\\\\textbf{D}}}\\nE\\n\\\\end{document}'\n          ),\n      },\n  )\n  def test_remove_command(self, text_in, keep_text, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._remove_command(text_in, 'todo', keep_text),\n        true_output,\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_environment',\n          'text_in': 'Foo\\n',\n          'true_output': 'Foo\\n',\n      },\n      {\n          'testcase_name': 'environment_not_removed',\n          'text_in': 'Foo\\n\\\\begin{equation}\\n3x+2\\n\\\\end{equation}\\nFoo',\n          'true_output': 'Foo\\n\\\\begin{equation}\\n3x+2\\n\\\\end{equation}\\nFoo',\n      },\n      {\n          'testcase_name': 'environment_removed',\n          'text_in': 'Foo\\\\begin{comment}\\n3x+2\\n\\\\end{comment}\\nFoo',\n          'true_output': 'Foo\\nFoo',\n      },\n  )\n  def test_remove_environment(self, text_in, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._remove_environment(text_in, 'comment'), true_output\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_iffalse',\n          'text_in': 'Foo\\n',\n          'true_output': 'Foo\\n',\n      },\n      {\n          'testcase_name': 'if_not_removed',\n          'text_in': '\\\\ifvar\\n\\\\ifvar\\nFoo\\n\\\\fi\\n\\\\fi\\n',\n          'true_output': '\\\\ifvar\\n\\\\ifvar\\nFoo\\n\\\\fi\\n\\\\fi\\n',\n      },\n      {\n          'testcase_name': 'if_removed_with_nested_ifvar',\n          'text_in': '\\\\ifvar\\n\\\\iffalse\\n\\\\ifvar\\nFoo\\n\\\\fi\\n\\\\fi\\n\\\\fi\\n',\n          'true_output': '\\\\ifvar\\n\\\\fi\\n',\n      },\n      {\n          'testcase_name': 'if_removed_with_nested_iffalse',\n          'text_in': '\\\\ifvar\\n\\\\iffalse\\n\\\\iffalse\\nFoo\\n\\\\fi\\n\\\\fi\\n\\\\fi\\n',\n          'true_output': '\\\\ifvar\\n\\\\fi\\n',\n      },\n      {\n          'testcase_name': 'if_removed_eof',\n          'text_in': '\\\\iffalse\\nFoo\\n\\\\fi',\n          'true_output': '',\n      },\n      {\n          'testcase_name': 'if_removed_space',\n          'text_in': '\\\\iffalse\\nFoo\\n\\\\fi ',\n          'true_output': '',\n      },\n      {\n          'testcase_name': 'if_removed_backslash',\n          'text_in': '\\\\iffalse\\nFoo\\n\\\\fi\\\\end{document}',\n          'true_output': '\\\\end{document}',\n      },\n      {\n          'testcase_name': 'commands_not_removed',\n          'text_in': '\\\\newcommand\\\\figref[1]{Figure~\\\\ref{fig:\\\\#1}}',\n          'true_output': '\\\\newcommand\\\\figref[1]{Figure~\\\\ref{fig:\\\\#1}}',\n      },\n      {\n          'testcase_name': 'iffalse_else_sustained',\n          'text_in': '\\\\iffalse not there\\\\else here\\\\fi',\n          'true_output': 'here',\n      },\n      {\n          'testcase_name': 'iftrue_else_removed',\n          'text_in': '\\\\iftrue expected\\\\else not expected\\\\fi',\n          'true_output': 'expected',\n      },\n      {\n          'testcase_name': 'if0_removed',\n          'text_in': '\\\\if0 to be removed\\\\fi',\n          'true_output': '',\n      },\n      {\n          'testcase_name': 'if1 works',\n          'text_in': '\\\\if 1 expected\\\\fi',\n          'true_output': 'expected',\n      },\n      {\n          'testcase_name': 'new_if_ignored',\n          'text_in': '\\\\newif  \\\\ifvar \\\\ifvar\\\\iffalse test\\\\fi\\\\fi',\n          'true_output': '\\\\newif  \\\\ifvar \\\\ifvar\\\\fi',\n      },\n      {\n          'testcase_name': 'known exceptions (iff) ignored in \\\\iffalse',\n          'text_in': '\\\\iffalse \\\\iff\\\\fi',\n          'true_output': '',\n      },\n      {\n          'testcase_name': 'known exceptions (iff) ignored in \\\\iftrue',\n          'text_in': '\\\\iftrue\\\\iff\\\\else\\\\fi',\n          'true_output': '\\\\iff',\n      },\n  )\n  def test_simplify_conditional_blocks(self, text_in, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._simplify_conditional_blocks(text_in), true_output\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'all_pass',\n          'inputs': ['abc', 'bca'],\n          'patterns': ['a'],\n          'true_outputs': ['abc', 'bca'],\n      },\n      {\n          'testcase_name': 'not_all_pass',\n          'inputs': ['abc', 'bca'],\n          'patterns': ['a$'],\n          'true_outputs': ['bca'],\n      },\n  )\n  def test_keep_pattern(self, inputs, patterns, true_outputs):\n    self.assertEqual(\n        list(arxiv_latex_cleaner._keep_pattern(inputs, patterns)), true_outputs\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'all_pass',\n          'inputs': ['abc', 'bca'],\n          'patterns': ['a'],\n          'true_outputs': [],\n      },\n      {\n          'testcase_name': 'not_all_pass',\n          'inputs': ['abc', 'bca'],\n          'patterns': ['a$'],\n          'true_outputs': ['abc'],\n      },\n  )\n  def test_remove_pattern(self, inputs, patterns, true_outputs):\n    self.assertEqual(\n        list(arxiv_latex_cleaner._remove_pattern(inputs, patterns)),\n        true_outputs,\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'replace_contents',\n          'content': make_contents(),\n          'patterns_and_insertions': make_patterns(),\n          'true_outputs': (\n              r'& \\parbox[c]{\\ww\\linewidth}{\\includegraphics[width=1.0\\linewidth]{figures/image1.jpg}}'\n              '\\n'\n              r'& \\parbox[c]{\\ww\\linewidth}{\\includegraphics[width=1.0\\linewidth]{figures/image2.jpg}}'\n          ),\n      },\n  )\n  def test_find_and_replace_patterns(\n      self, content, patterns_and_insertions, true_outputs\n  ):\n    output = arxiv_latex_cleaner._find_and_replace_patterns(\n        content, patterns_and_insertions\n    )\n    output = arxiv_latex_cleaner.strip_whitespace(output)\n    true_outputs = arxiv_latex_cleaner.strip_whitespace(true_outputs)\n    self.assertEqual(output, true_outputs)\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_tikz',\n          'text_in': 'Foo\\n',\n          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],\n          'true_output': 'Foo\\n',\n      },\n      {\n          'testcase_name': 'tikz_no_match',\n          'text_in': (\n              'Foo\\\\tikzsetnextfilename{test_no_match}\\n\\\\begin{tikzpicture}\\n\\\\node'\n              ' (test) at (0,0) {Test1};\\n\\\\end{tikzpicture}\\nFoo'\n          ),\n          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],\n          'true_output': (\n              'Foo\\\\tikzsetnextfilename{test_no_match}\\n\\\\begin{tikzpicture}\\n\\\\node'\n              ' (test) at (0,0) {Test1};\\n\\\\end{tikzpicture}\\nFoo'\n          ),\n      },\n      {\n          'testcase_name': 'tikz_match',\n          'text_in': (\n              'Foo\\\\tikzsetnextfilename{test2}\\n\\\\begin{tikzpicture}\\n\\\\node'\n              ' (test) at (0,0) {Test1};\\n\\\\end{tikzpicture}\\nFoo'\n          ),\n          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],\n          'true_output': 'Foo\\\\includegraphics{ext_tikz/test2.pdf}\\nFoo',\n      },\n  )\n  def test_replace_tikzpictures(self, text_in, figures_in, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._replace_tikzpictures(text_in, figures_in),\n        true_output,\n    )\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'no_includesvg',\n          'text_in': 'Foo\\n',\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\n',\n      },\n      {\n          'testcase_name': 'includesvg_no_match',\n          'text_in': 'Foo\\\\includesvg{test_no_match}\\nFoo',\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\\\includesvg{test_no_match}\\nFoo',\n      },\n      {\n          'testcase_name': 'includesvg_match',\n          'text_in': 'Foo\\\\includesvg{test2}\\nFoo',\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\\\includeinkscape{ext_svg/test2-tex.pdf_tex}\\nFoo',\n      },\n      {\n          'testcase_name': 'includesvg_match_with_options',\n          'text_in': 'Foo\\\\includesvg[width=\\\\linewidth,scale=0.40]{figs/persdf/test2}\\nFoo',\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\\\includeinkscape[width=\\\\linewidth,scale=0.40]{ext_svg/test2-tex.pdf_tex}\\nFoo',\n      },\n      {\n          'testcase_name': 'includesvg_match_with_options_with_suffix',\n          'text_in': 'Foo\\\\includesvg[width=\\\\linewidth]{figs/test2.svg}\\nFoo',\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2_svg-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\\\includeinkscape[width=\\\\linewidth]{ext_svg/test2_svg-tex.pdf_tex}\\nFoo',\n      },\n      {\n          'testcase_name': 'includesvg_match_with_options_with_dot_with_suffix',\n          'text_in': (\n              'Foo\\\\includesvg[width=\\\\linewidth]{figs/test2-0.9.svg}\\nFoo'\n          ),\n          'figures_in': [\n              'ext_svg/test1-tex.pdf_tex',\n              'ext_svg/test2-0.9_svg-tex.pdf_tex',\n          ],\n          'true_output': 'Foo\\\\includeinkscape[width=\\\\linewidth]{ext_svg/test2-0.9_svg-tex.pdf_tex}\\nFoo',\n      },\n  )\n  def test_replace_includesvg(self, text_in, figures_in, true_output):\n    self.assertEqual(\n        arxiv_latex_cleaner._replace_includesvg(text_in, figures_in),\n        true_output,\n    )\n\n  @parameterized.named_parameters(*make_search_reference_tests())\n  def test_search_reference_weak(\n      self, filenames, contents, strict, true_outputs\n  ):\n    cleaner_outputs = []\n    for filename in filenames:\n      reference = arxiv_latex_cleaner._search_reference(\n          filename, contents, strict\n      )\n      if reference is not None:\n        cleaner_outputs.append(filename)\n\n    # weak check (passes as long as cleaner includes a superset of the true_output)\n    for true_output in true_outputs:\n      self.assertIn(true_output, cleaner_outputs)\n\n  @parameterized.named_parameters(*make_search_reference_tests())\n  def test_search_reference_strong(\n      self, filenames, contents, strict, true_outputs\n  ):\n    cleaner_outputs = []\n    for filename in filenames:\n      reference = arxiv_latex_cleaner._search_reference(\n          filename, contents, strict\n      )\n      if reference is not None:\n        cleaner_outputs.append(filename)\n\n    # strong check (set of files must match exactly)\n    weak_check_result = set(true_outputs).issubset(cleaner_outputs)\n    if weak_check_result:\n      msg = 'not fatal, cleaner included more files than necessary'\n    else:\n      msg = 'fatal, see test_search_reference_weak'\n    self.assertEqual(cleaner_outputs, true_outputs, msg)\n\n  @parameterized.named_parameters(\n      {\n          'testcase_name': 'three_parent',\n          'filename': 'long/path/to/img.ext',\n          'content_strs': [\n              # match\n              '{img.ext}',\n              '{to/img.ext}',\n              '{path/to/img.ext}',\n              '{long/path/to/img.ext}',\n              '{%\\nimg.ext  }',\n              '{to/img.ext % \\n}',\n              '{  \\npath/to/img.ext\\n}',\n              '{ \\n \\nlong/path/to/img.ext\\n}',\n              '{img}',\n              '{to/img}',\n              '{path/to/img}',\n              '{long/path/to/img}',\n              # dont match\n              '{from/img.ext}',\n              '{from/img}',\n              '{imgoext}',\n              '{from/imgo}',\n              '{ \\n long/\\npath/to/img.ext\\n}',\n              '{path/img.ext}',\n              '{long/img.ext}',\n              '{long/path/img.ext}',\n              '{long/to/img.ext}',\n              '{path/img}',\n              '{long/img}',\n              '{long/path/img}',\n              '{long/to/img}',\n          ],\n          'strict': False,\n          'true_outputs': [True] * 12 + [False] * 13,\n      },\n      {\n          'testcase_name': 'two_parent',\n          'filename': 'path/to/img.ext',\n          'content_strs': [\n              # match\n              '{img.ext}',\n              '{to/img.ext}',\n              '{path/to/img.ext}',\n              '{%\\nimg.ext  }',\n              '{to/img.ext % \\n}',\n              '{  \\npath/to/img.ext\\n}',\n              '{img}',\n              '{to/img}',\n              '{path/to/img}',\n              # dont match\n              '{long/path/to/img.ext}',\n              '{ \\n \\nlong/path/to/img.ext\\n}',\n              '{long/path/to/img}',\n              '{from/img.ext}',\n              '{from/img}',\n              '{imgoext}',\n              '{from/imgo}',\n              '{ \\n long/\\npath/to/img.ext\\n}',\n              '{path/img.ext}',\n              '{long/img.ext}',\n              '{long/path/img.ext}',\n              '{long/to/img.ext}',\n              '{path/img}',\n              '{long/img}',\n              '{long/path/img}',\n              '{long/to/img}',\n          ],\n          'strict': False,\n          'true_outputs': [True] * 9 + [False] * 16,\n      },\n      {\n          'testcase_name': 'one_parent',\n          'filename': 'to/img.ext',\n          'content_strs': [\n              # match\n              '{img.ext}',\n              '{to/img.ext}',\n              '{%\\nimg.ext  }',\n              '{to/img.ext % \\n}',\n              '{img}',\n              '{to/img}',\n              # dont match\n              '{long/path/to/img}',\n              '{path/to/img}',\n              '{ \\n \\nlong/path/to/img.ext\\n}',\n              '{  \\npath/to/img.ext\\n}',\n              '{long/path/to/img.ext}',\n              '{path/to/img.ext}',\n              '{from/img.ext}',\n              '{from/img}',\n              '{imgoext}',\n              '{from/imgo}',\n              '{ \\n long/\\npath/to/img.ext\\n}',\n              '{path/img.ext}',\n              '{long/img.ext}',\n              '{long/path/img.ext}',\n              '{long/to/img.ext}',\n              '{path/img}',\n              '{long/img}',\n              '{long/path/img}',\n              '{long/to/img}',\n          ],\n          'strict': False,\n          'true_outputs': [True] * 6 + [False] * 19,\n      },\n      {\n          'testcase_name': 'two_parent_strict',\n          'filename': 'path/to/img.ext',\n          'content_strs': [\n              # match\n              '{path/to/img.ext}',\n              '{  \\npath/to/img.ext\\n}',\n              # dont match\n              '{img.ext}',\n              '{to/img.ext}',\n              '{%\\nimg.ext  }',\n              '{to/img.ext % \\n}',\n              '{img}',\n              '{to/img}',\n              '{path/to/img}',\n              '{long/path/to/img.ext}',\n              '{ \\n \\nlong/path/to/img.ext\\n}',\n              '{long/path/to/img}',\n              '{from/img.ext}',\n              '{from/img}',\n              '{imgoext}',\n              '{from/imgo}',\n              '{ \\n long/\\npath/to/img.ext\\n}',\n              '{path/img.ext}',\n              '{long/img.ext}',\n              '{long/path/img.ext}',\n              '{long/to/img.ext}',\n              '{path/img}',\n              '{long/img}',\n              '{long/path/img}',\n              '{long/to/img}',\n          ],\n          'strict': True,\n          'true_outputs': [True] * 2 + [False] * 23,\n      },\n  )\n  def test_search_reference_filewise(\n      self, filename, content_strs, strict, true_outputs\n  ):\n    if len(content_strs) != len(true_outputs):\n      raise ValueError(\n          \"number of true_outputs doesn't match number of content strs\"\n      )\n    for content, true_output in zip(content_strs, true_outputs):\n      reference = arxiv_latex_cleaner._search_reference(\n          filename, content, strict\n      )\n      matched = reference is not None\n      msg_not = ' ' if true_output else ' not '\n      msg_fmt = 'file {} should' + msg_not + 'have matched latex reference {}'\n      msg = msg_fmt.format(filename, content)\n      self.assertEqual(matched, true_output, msg)\n\n\nclass IntegrationTests(parameterized.TestCase):\n\n  def setUp(self):\n    super(IntegrationTests, self).setUp()\n    self.out_path = 'test_data/tex_arXiv'\n\n  def _compare_files(self, filename, filename_true):\n    if path.splitext(filename)[1].lower() in ['.jpg', '.jpeg', '.png']:\n      with Image.open(filename) as im, Image.open(filename_true) as im_true:\n        # We check only the sizes of the images, checking pixels would be too\n        # complicated in case the resize implementations change.\n        self.assertEqual(\n            im.size,\n            im_true.size,\n            'Images {:s} was not resized properly.'.format(filename),\n        )\n    else:\n      # Checks if text files are equal without taking in account end of line\n      # characters.\n      with open(filename, 'rb') as f:\n        processed_content = f.read().splitlines()\n      with open(filename_true, 'rb') as f:\n        groundtruth_content = f.read().splitlines()\n\n      self.assertEqual(\n          processed_content,\n          groundtruth_content,\n          '{:s} and {:s} are not equal.'.format(filename, filename_true),\n      )\n\n  @parameterized.named_parameters(\n      {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},\n      {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},\n  )\n  def test_complete(self, input_dir):\n    out_path_true = 'test_data/tex_arXiv_true'\n\n    # Make sure the folder does not exist, since we erase it in the test.\n    if path.isdir(self.out_path):\n      raise RuntimeError(\n          'The folder {:s} should not exist.'.format(self.out_path)\n      )\n\n    arxiv_latex_cleaner.run_arxiv_cleaner({\n        'input_folder': input_dir,\n        'images_allowlist': {\n            'images/im2_included.jpg': 200,\n            'images/im3_included.png': 400,\n        },\n        'resize_images': True,\n        'im_size': 100,\n        'compress_pdf': False,\n        'pdf_im_resolution': 500,\n        'commands_to_delete': ['mytodo'],\n        'commands_only_to_delete': ['red'],\n        'if_exceptions': ['iffalt'],\n        'environments_to_delete': ['mynote'],\n        'use_external_tikz': 'ext_tikz',\n        'keep_bib': False,\n    })\n\n    # Checks the set of files is the same as in the true folder.\n    out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))\n    out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))\n    self.assertSetEqual(out_files, out_files_true)\n\n    # Compares the contents of each file against the true value.\n    for f1 in out_files:\n      self._compare_files(\n          path.join(self.out_path, f1), path.join(out_path_true, f1)\n      )\n\n  @parameterized.named_parameters(\n      {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},\n      {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},\n  )\n  def test_png2jpg(self, input_dir):\n    out_path_true = 'test_data/tex_arXiv_png2jpg_true'\n\n    # Make sure the folder does not exist, since we erase it in the test.\n    if path.isdir(self.out_path):\n      raise RuntimeError(\n          'The folder {:s} should not exist.'.format(self.out_path)\n      )\n\n    arxiv_latex_cleaner.run_arxiv_cleaner({\n        'input_folder': input_dir,\n        'images_allowlist': {\n            # 'images/im2_included.jpg': 200,\n            # 'images/im3_included.png': 400,\n        },\n        'resize_images': False,\n        'im_size': 100,\n        'compress_pdf': False,\n        'pdf_im_resolution': 500,\n        'commands_to_delete': ['mytodo'],\n        'commands_only_to_delete': ['red'],\n        'if_exceptions': ['iffalt'],\n        'environments_to_delete': ['mynote'],\n        'use_external_tikz': 'ext_tikz',\n        'keep_bib': False,\n        'convert_png_to_jpg': True,\n        'png_quality': 50,\n        'png_size_threshold': 0.5,\n    })\n\n    # Checks the set of files is the same as in the true folder.\n    out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))\n    out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))\n    self.assertSetEqual(out_files, out_files_true)\n\n    # Compares the contents of each file against the true value.\n    for f1 in out_files:\n      if path.splitext(path.join(self.out_path, f1))[1].lower() in ['.jpg', '.jpeg', '.png']:\n        # check if all png files have been renamed to jpg\n        self.assertTrue(path.splitext(f1)[1].lower() != '.png', f'{f1} is not renamed to jpg')\n\n      else:\n        self._compare_files(\n            path.join(self.out_path, f1), path.join(out_path_true, f1)\n        )\n\n  def tearDown(self):\n    shutil.rmtree(self.out_path)\n    super(IntegrationTests, self).tearDown()\n\nif __name__ == '__main__':\n  unittest.main()\n"
  },
  {
    "path": "cleaner_config.yaml",
    "content": "patterns_and_insertions:\n    [\n        # Use single ticks for regex patterns\n        # http://blogs.perl.org/users/tinita/2018/03/strings-in-yaml---to-quote-or-not-to-quote.html\n        # You need to escape \\ with \\\\ in the pattern, for instance for \\\\todo\n        # Use Python named groups https://docs.python.org/3/library/re.html#regular-expression-examples\n        # Escape {{ and }} in the insertion expression\n        # \n        # Optional:\n        # Set strip_whitespace to n to disable white space stripping while replacing the pattern. (Default: y)\n\n        {\n            \"pattern\" : '(?:\\\\figcomp{\\s*)(?P<first>.*?)\\s*}\\s*{\\s*(?P<second>.*?)\\s*}\\s*{\\s*(?P<third>.*?)\\s*}',\n            \"insertion\" : '\\parbox[c]{{ {second} \\linewidth}} {{ \\includegraphics[width= {third} \\linewidth]{{figures/{first} }} }}',\n            \"description\" : \"Replace figcomp\",\n            # \"strip_whitespace\": n \n        },\n    ]\nverbose: False\ncommands_to_delete: [\n    '\\\\todo',\n]\n"
  },
  {
    "path": "requirements.txt",
    "content": "absl_py>=0.12\npillow\npyyaml\nregex\n"
  },
  {
    "path": "setup.py",
    "content": "#! /usr/bin/env python\n#\n# coding=utf-8\n# Copyright 2018 The Google Research Authors.\n#\n# Licensed under the Apache License, Version 2.0 (the \"License\");\n# you may not use this file except in compliance with the License.\n# You may obtain a copy of the License at\n#\n#     http://www.apache.org/licenses/LICENSE-2.0\n#\n# Unless required by applicable law or agreed to in writing, software\n# distributed under the License is distributed on an \"AS IS\" BASIS,\n# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n# See the License for the specific language governing permissions and\n# limitations under the License.\n\nfrom setuptools import setup\nfrom setuptools import find_packages\n\nfrom arxiv_latex_cleaner._version import __version__\n\nwith open(\"README.md\", \"r\") as fh:\n    long_description = fh.read()\n\ninstall_requires = []\nwith open(\"requirements.txt\") as f:\n    for l in f.readlines():\n        l_c = l.strip()\n        if l_c and not l_c.startswith('#'):\n            install_requires.append(l_c)\n\nsetup(\n    name=\"arxiv_latex_cleaner\",\n    version=__version__,\n    packages=find_packages(exclude=[\"*.tests\"]),\n    python_requires='>=3',\n    url=\"https://github.com/google-research/arxiv-latex-cleaner\",\n    license=\"Apache License, Version 2.0\",\n    author=\"Google Research Authors\",\n    author_email=\"jponttuset@gmail.com\",\n    description=\"Cleans the LaTeX code of your paper to submit to arXiv.\",\n    long_description=long_description,\n    long_description_content_type=\"text/markdown\",\n    entry_points={\n        \"console_scripts\": [\"arxiv_latex_cleaner=arxiv_latex_cleaner.__main__:__main__\"]\n    },\n    install_requires=install_requires,\n    classifiers=[\n        \"License :: OSI Approved :: Apache Software License\",\n        \"Intended Audience :: Science/Research\",\n    ],\n)\n"
  },
  {
    "path": "test_data/tex/figures/data_included.txt",
    "content": ""
  },
  {
    "path": "test_data/tex/figures/data_not_included.txt",
    "content": "\n"
  },
  {
    "path": "test_data/tex/figures/figure_included.tex",
    "content": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
  },
  {
    "path": "test_data/tex/figures/figure_included.tikz",
    "content": "﻿\\tikzsetnextfilename{test2}\n\\begin{tikzpicture}\n\\node {root}\nchild {node {left}}\nchild {node {right}\nchild {node {child}}\nchild {node {child}}\n};\n\\end{tikzpicture}"
  },
  {
    "path": "test_data/tex/figures/figure_not_included.tex",
    "content": "\\addplot{figures/data_not_included.txt}\n\\input{figures/figure_not_included_2.tex}\n"
  },
  {
    "path": "test_data/tex/figures/figure_not_included_2.tex",
    "content": ""
  },
  {
    "path": "test_data/tex/main.aux",
    "content": ""
  },
  {
    "path": "test_data/tex/main.bbl",
    "content": "BBL content, should be intact.\n"
  },
  {
    "path": "test_data/tex/main.bib",
    "content": ""
  },
  {
    "path": "test_data/tex/main.tex",
    "content": "\\begin{document}\nText\n% Whole line comment\n\nText% Inline comment\n\\begin{comment}\nThis is an environment comment.\n\\end{comment}\n\nThis is a percent \\%.\n% Whole line comment without newline\n\\includegraphics{images/im1_included.png}\n%\\includegraphics{images/im_not_included}\n\\includegraphics{images/im3_included.png}\n\\includegraphics{%\n  images/im4_included.png%\n  }\n\\includegraphics[width=.5\\linewidth]{%\n  images/im5_included.jpg}\n%\\includegraphics{%\n%  images/im4_not_included.png\n%  }\n%\\includegraphics[width=.5\\linewidth]{%\n%  images/im5_not_included.jpg}\n\n% test whatever the path satrting with dot works when include graphics\n\\includegraphics{./images/im3_included.png}\n\nThis line should\\mytodo{Do this later} not be separated\n\\mytodo{This is a todo command with a nested \\textit{command}.\nPlease remember that up to \\texttt{2 levels} of \\textit{nesting} are supported.}\nfrom this one.\n\n\\begin{mynote}\n  This is a custom environment that could be excluded.\n\\end{mynote}\n\n\\newif\\ifvar\n\\newif  \\ifvarII\n\n\\ifvarII asdf \\fi\n\n\\ifvar\n\\if    false\n\\if false\n\\if 0\n\\iffalse\n\\ifvar\nText\n\\fi\n\\fi\n\\fi\n\\fi\n\\fi\n\\fi\n\n\\iffalse I shall be gone (iffalse block)!\\else Expect me (else block of iffalse)!\\fi\n\n\\iftrue Expect me (iftrue block)!\\else I shall be gone (else block of iftrue)!\\fi\n\n\\iffalse\n\\iffalt\n\\fi\n\n\\newcommand{\\red}[1]{{\\color{red} #1}}\nhello test \\red{hello\ntest \\red{hello}}\ntest\n\n% content after this line should not be cleaned if \\end{document} is in a comment\n\n\\input{figures/figure_included.tex}\n% \\input{figures/figure_not_included.tex}\n\n% Test for tikzpicture feature\n% should be replaced\n\\tikzsetnextfilename{test1}\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test1};\n\\end{tikzpicture}\n\n% should be replaced in included file\n\\input{figures/figure_included.tikz}\n\n% should not be be replaced - no preceding tikzsetnextfilename command\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test3};\n\\end{tikzpicture}\n\n\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test4};\n\\end{tikzpicture}\n\n\\end{document}\n\nThis should be ignored.\n"
  },
  {
    "path": "test_data/tex/not_included/figures/data_included.txt",
    "content": ""
  },
  {
    "path": "test_data/tex_arXiv_png2jpg_true/figures/data_included.txt",
    "content": ""
  },
  {
    "path": "test_data/tex_arXiv_png2jpg_true/figures/figure_included.tex",
    "content": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
  },
  {
    "path": "test_data/tex_arXiv_png2jpg_true/figures/figure_included.tikz",
    "content": "﻿\\includegraphics{ext_tikz/test2.pdf}"
  },
  {
    "path": "test_data/tex_arXiv_png2jpg_true/main.bbl",
    "content": "BBL content, should be intact.\n"
  },
  {
    "path": "test_data/tex_arXiv_png2jpg_true/main.tex",
    "content": "\\begin{document}\nText\n\nText%\n\n\nThis is a percent \\%.\n\\includegraphics{images/im1_included.jpg}\n\\includegraphics{images/im3_included.jpg}\n\\includegraphics{%\n  images/im4_included.jpg%\n  }\n\\includegraphics[width=.5\\linewidth]{%\n  images/im5_included.jpg}\n\n\\includegraphics{./images/im3_included.jpg}\n\nThis line should not be separated\n%\nfrom this one.\n\n\n\n\\newif\\ifvar\n\\newif  \\ifvarII\n\n\\ifvarII asdf \\fi\n\n\\ifvar\n\\fi\n\nExpect me (else block of iffalse)!\nExpect me (iftrue block)!\n\n\\newcommand{\\red}[1]{{\\color{red} #1}}\nhello test hello\ntest hello\ntest\n\n\n\\input{figures/figure_included.tex}\n\n\\includegraphics{ext_tikz/test1.pdf}\n\n\\input{figures/figure_included.tikz}\n\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test3};\n\\end{tikzpicture}\n\n\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test4};\n\\end{tikzpicture}\n\n\\end{document}\n"
  },
  {
    "path": "test_data/tex_arXiv_true/figures/data_included.txt",
    "content": ""
  },
  {
    "path": "test_data/tex_arXiv_true/figures/figure_included.tex",
    "content": "\\includegraphics{images/im2_included.jpg}\n\\addplot{figures/data_included.txt}\n"
  },
  {
    "path": "test_data/tex_arXiv_true/figures/figure_included.tikz",
    "content": "﻿\\includegraphics{ext_tikz/test2.pdf}"
  },
  {
    "path": "test_data/tex_arXiv_true/main.bbl",
    "content": "BBL content, should be intact.\n"
  },
  {
    "path": "test_data/tex_arXiv_true/main.tex",
    "content": "\\begin{document}\nText\n\nText%\n\n\nThis is a percent \\%.\n\\includegraphics{images/im1_included.png}\n\\includegraphics{images/im3_included.png}\n\\includegraphics{%\n  images/im4_included.png%\n  }\n\\includegraphics[width=.5\\linewidth]{%\n  images/im5_included.jpg}\n\n\\includegraphics{./images/im3_included.png}\n\nThis line should not be separated\n%\nfrom this one.\n\n\n\n\\newif\\ifvar\n\\newif  \\ifvarII\n\n\\ifvarII asdf \\fi\n\n\\ifvar\n\\fi\n\nExpect me (else block of iffalse)!\nExpect me (iftrue block)!\n\n\\newcommand{\\red}[1]{{\\color{red} #1}}\nhello test hello\ntest hello\ntest\n\n\n\\input{figures/figure_included.tex}\n\n\\includegraphics{ext_tikz/test1.pdf}\n\n\\input{figures/figure_included.tikz}\n\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test3};\n\\end{tikzpicture}\n\n\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n    \\node (test) at (0,0) {Test4};\n\\end{tikzpicture}\n\n\\end{document}\n"
  }
]