Repository: google-research/arxiv-latex-cleaner
Branch: main
Commit: fb7ee5c72100
Files: 36
Total size: 109.9 KB

Directory structure:
gitextract_ut9ensog/

├── .github/
│   └── workflows/
│       └── release-workflow.yml
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── MANIFEST.in
├── README.md
├── __init__.py
├── arxiv_latex_cleaner/
│   ├── __init__.py
│   ├── __main__.py
│   ├── _version.py
│   ├── arxiv_latex_cleaner.py
│   └── tests/
│       └── arxiv_latex_cleaner_test.py
├── cleaner_config.yaml
├── requirements.txt
├── setup.py
└── test_data/
    ├── tex/
    │   ├── figures/
    │   │   ├── data_included.txt
    │   │   ├── data_not_included.txt
    │   │   ├── figure_included.tex
    │   │   ├── figure_included.tikz
    │   │   ├── figure_not_included.tex
    │   │   └── figure_not_included_2.tex
    │   ├── main.aux
    │   ├── main.bbl
    │   ├── main.bib
    │   ├── main.tex
    │   └── not_included/
    │       └── figures/
    │           └── data_included.txt
    ├── tex_arXiv_png2jpg_true/
    │   ├── figures/
    │   │   ├── data_included.txt
    │   │   ├── figure_included.tex
    │   │   └── figure_included.tikz
    │   ├── main.bbl
    │   └── main.tex
    └── tex_arXiv_true/
        ├── figures/
        │   ├── data_included.txt
        │   ├── figure_included.tex
        │   └── figure_included.tikz
        ├── main.bbl
        └── main.tex

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/release-workflow.yml
================================================
name: Create a GitHub and PyPI release
on:
  push:
    tags:
      - 'v*'

jobs:
  build:
    name: Create a GitHub Release
    runs-on: ubuntu-latest
    permissions:
      contents: write
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Create Release
        id: create_release
        uses: actions/create-release@v1
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        with:
          tag_name: ${{ github.ref }}
          release_name: Release ${{ github.ref }}
          body: ${{ github.ref }} release of `arxiv_latex_cleaner`.
          draft: false
          prerelease: false
  deploy:
    name: Create a PyPI Release
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
      - name: Set up Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install setuptools wheel twine
      - name: Build
        run: |
          python setup.py sdist bdist_wheel
      - name: Publish
        env:
          TWINE_USERNAME: '__token__'
          TWINE_PASSWORD: ${{ secrets.PYPI_TOKEN }}
        run: |
          python -m twine upload dist/*


================================================
FILE: .gitignore
================================================
*.pyc
.idea
arxiv-latex-cleaner.iml
arxiv-latex-cleaner.ipr
arxiv-latex-cleaner.iws
arxiv_latex_cleaner.egg-info/
build/
dist/

*.DS_Store


================================================
FILE: CONTRIBUTING.md
================================================
# How to Contribute

We'd love to accept your patches and contributions to this project. There are
just a few small guidelines you need to follow.

## Contributor License Agreement

Contributions to this project must be accompanied by a Contributor License
Agreement. You (or your employer) retain the copyright to your contribution;
this simply gives us permission to use and redistribute your contributions as
part of the project. Head over to <https://cla.developers.google.com/> to see
your current agreements on file or to sign a new one.

You generally only need to submit a CLA once, so if you've already submitted one
(even if it was for a different project), you probably don't need to do it
again.

## Code reviews

All submissions, including submissions by project members, require review. We
use GitHub pull requests for this purpose. Consult
[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
information on using pull requests.

## Community Guidelines

This project follows
[Google's Open Source Community Guidelines](https://opensource.google.com/conduct/).


================================================
FILE: LICENSE
================================================

                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: MANIFEST.in
================================================
include LICENSE
include README.md
include requirements.txt


================================================
FILE: README.md
================================================
# `arxiv_latex_cleaner`

This tool allows you to easily clean the LaTeX code of your paper to submit to
arXiv. From a folder containing all your code, e.g. `/path/to/latex/`, it
creates a new folder `/path/to/latex_arXiv/`, that is ready to ZIP and upload to
arXiv.

## Example call:

```bash
arxiv_latex_cleaner /path/to/latex --resize_images --im_size 500 --images_allowlist='{"images/im.png":2000}'
```

Or simply from a config file

```bash
arxiv_latex_cleaner /path/to/latex --config cleaner_config.yaml
```

## Installation:

```bash
pip install arxiv-latex-cleaner
```

| :exclamation: arxiv_latex_cleaner is only compatible with Python >=3.9 :exclamation: |
| ---------------------------------------------------------------------------------- |

If using MacOS, you can install using [Homebrew](https://brew.sh/):

```bash
brew install arxiv_latex_cleaner
```

Alternatively, you can download the source code:

```bash
git clone https://github.com/google-research/arxiv-latex-cleaner
cd arxiv-latex-cleaner/
python -m arxiv_latex_cleaner --help
```

And install as a command-line program directly from the source code:

```bash
python setup.py install
```

## Main features:

#### Privacy-oriented

*   Removes all auxiliary files (`.aux`, `.log`, `.out`, etc.).
*   Removes all comments from your code (yes, those are visible on arXiv and you
    do not want them to be). These also include `\begin{comment}\end{comment}`,
    `\iffalse\fi`, and `\if0\fi` environments.
*   Optionally removes user-defined commands entered with `commands_to_delete`
    (such as `\todo{}` that you redefine as the empty string at the end).
*   Optionally allows you to define custom regex replacement rules through a
    `cleaner_config.yaml` file.

#### Size-oriented

There is a 50MB limit on arXiv submissions, so to make it fit:

*   Removes all unused `.tex` files (those that are not in the root and not
    included in any other `.tex` file).
*   Removes all unused images that take up space (those that are not actually
    included in any used `.tex` file).
*   Optionally resizes all images to `im_size` pixels, to reduce the size of the
    submission. You can allowlist some images to skip the global size using
    `images_allowlist`.
*   Optionally compresses `.pdf` files using ghostscript (Linux and Mac only).
    You can allowlist some PDFs to skip the global size using
    `images_allowlist`.
*   Optionally converts PNG images to JPG format to reduce file size.

#### TikZ picture source code concealment

To prevent the upload of tikzpicture source code or raw simulation data, this
feature:

*   Replaces the tikzpicture environment `\begin{tikzpicture} ...
    \end{tikzpicture}` with the respective
    `\includegraphics{EXTERNAL_TIKZ_FOLDER/picture_name.pdf}`.
*   Requires externally compiled TikZ pictures as `.pdf` files in folder
    `EXTERNAL_TIKZ_FOLDER`. See section 52 (Externalization Library) in the
    [PGF/TikZ manual](https://ctan.org/pkg/pgf?lang=en) on TikZ picture
    externalization.
*   Only replaces environments with preceding
    `\tikzsetnextfilename{picture_name}` command (as in
    `\tikzsetnextfilename{picture_name}\begin{tikzpicture} ...
    \end{tikzpicture}`) where the externalized `picture_name.pdf` filename
    matches `picture_name`.

#### More sophisticated pattern replacement based on regex group captures

Sometimes it is useful to work with a set of custom LaTeX commands when writing
a paper. To get rid of them upon arXiv submission, one can simply revert them to
plain LaTeX with a regular expression insertion.

```yaml
{
    "pattern" : '(?:\\figcomp{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}',
    "insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}',
    "description" : "Replace figcomp"
}
```

The pattern above will find all `\figcomp{path}{w1}{w2}` commands and replace
them with
`\parbox[c]{w1\linewidth}{\includegraphics[width=w2\linewidth]{figures/path}}`.
Note that the insertion template is filled with the
[named groups captures](https://docs.python.org/3/library/re.html#regular-expression-examples)
from the pattern. Note that the replacement is processed **before** all
`\includegraphics` commands are processed and corresponding file paths are
copied, making sure all figure files are copied to the cleaned version. See also
[cleaner_config.yaml](cleaner_config.yaml) for details on how to specify the
patterns.

## Usage:

```
usage: arxiv_latex_cleaner@v1.0.10 [-h] [--resize_images] [--im_size IM_SIZE]
                                   [--compress_pdf]
                                   [--pdf_im_resolution PDF_IM_RESOLUTION]
                                   [--images_allowlist IMAGES_ALLOWLIST]
                                   [--keep_bib]
                                   [--commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]]
                                   [--commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]]
                                   [--environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]]
                                   [--if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]]
                                   [--use_external_tikz USE_EXTERNAL_TIKZ]
                                   [--svg_inkscape [SVG_INKSCAPE]]
                                   [--convert_png_to_jpg]
                                   [--png_quality PNG_QUALITY]
                                   [--png_size_threshold PNG_SIZE_THRESHOLD]
                                   [--config CONFIG] [--verbose]
                                   input_folder

Clean the LaTeX code of your paper to submit to arXiv. Check the README for
more information on the use.

positional arguments:
  input_folder          Input folder containing the LaTeX code.

optional arguments:
  -h, --help            show this help message and exit
  --resize_images       Resize images.
  --im_size IM_SIZE     Size of the output images (in pixels, longest side).
                        Fine tune this to get as close to 10MB as possible.
  --compress_pdf        Compress PDF images using ghostscript (Linux and Mac
                        only).
  --pdf_im_resolution PDF_IM_RESOLUTION
                        Resolution (in dpi) to which the tool resamples the
                        PDF images.
  --images_allowlist IMAGES_ALLOWLIST
                        Images (and PDFs) that won't be resized to the default
                        resolution, but the one provided here. Value is pixel
                        for images, and dpi forPDFs, as in --im_size and
                        --pdf_im_resolution, respectively. Format is a
                        dictionary as: '{"path/to/im.jpg": 1000}'
  --keep_bib            Avoid deleting the *.bib files.
  --commands_to_delete COMMANDS_TO_DELETE [COMMANDS_TO_DELETE ...]
                        LaTeX commands that will be deleted. Useful for e.g.
                        user-defined \todo commands. For example, to delete
                        all occurrences of \todo1{} and \todo2{}, run the tool
                        with `--commands_to_delete todo1 todo2`.Please note
                        that the positional argument `input_folder` cannot
                        come immediately after `commands_to_delete`, as the
                        parser does not have any way to know if it's another
                        command to delete.
  --commands_only_to_delete COMMANDS_ONLY_TO_DELETE [COMMANDS_ONLY_TO_DELETE ...]
                        LaTeX commands that will be deleted but the text 
                        wrapped in the commands will be retained. Useful for
                        commands that change text formats and colors, which
                        you may want to remove but keep the text within. Usages
                        are exactly the same as commands_to_delete. Note that if
                        the commands listed here duplicate that after
                        commands_to_delete, the default action will be retaining
                        the wrapped text.
  --environments_to_delete ENVIRONMENTS_TO_DELETE [ENVIRONMENTS_TO_DELETE ...]
                        LaTeX environments that will be deleted. Useful for e.g. 
                        user-defined comment environments. For example, to 
                        delete all occurrences of \begin{note} ... \end{note},
                        run the tool with `--environments_to_delete note`. 
                        Please note that the positional argument `input_folder`
                        cannot come immediately after
                        `environments_to_delete`, as the parser does not have
                        any way to know if it's another environment to delete.
  --if_exceptions IF_EXCEPTIONS [IF_EXCEPTIONS ...]
                        Constant TeX primitive conditionals (\iffalse, \iftrue,
                        etc.) are simplified, i.e., true branches are kept, false
                        branches deleted. To parse the conditional constructs
                        correctly, all commands starting with `\if` are assumed to
                        be TeX primitive conditionals (e.g., declared by
                        \newif\ifvar). Some known exceptions to this rule are
                        already included (e.g., \iff, \ifthenelse, etc.), but you
                        can add custom exceptions using `--if_exceptions iffalt`.
  --use_external_tikz USE_EXTERNAL_TIKZ
                        Folder (relative to input folder) containing
                        externalized tikz figures in PDF format.
  --svg_inkscape [SVG_INKSCAPE]
                        Include PDF files generated by Inkscape via the
                        `\includesvg` command from the `svg` package. This is
                        done by replacing the `\includesvg` calls with
                        `\includeinkscape` calls pointing to the generated
                        `.pdf_tex` files. By default, these files and the
                        generated PDFs are located under `./svg-inkscape`
                        (relative to the input folder), but a different path
                        (relative to the input folder) can be provided in case a
                        different `inkscapepath` was set when loading the `svg`
                        package.
  --convert_png_to_jpg  Convert PNG images to JPG format to reduce file size
  --png_quality PNG_QUALITY
                        JPG quality for PNG conversion (0-100, default: 50)
  --png_size_threshold PNG_SIZE_THRESHOLD
                        Minimum PNG file size in MB to apply quality reduction (default: 0.5)
  --config CONFIG       Read settings from `.yaml` config file. If command
                        line arguments are provided additionally, the config
                        file parameters are updated with the command line
                        parameters.
  --verbose             Enable detailed output.
```

## Testing:

```bash
python -m unittest arxiv_latex_cleaner.tests.arxiv_latex_cleaner_test
```

## Note

This is not an officially supported Google product.


================================================
FILE: __init__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


================================================
FILE: arxiv_latex_cleaner/__init__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


================================================
FILE: arxiv_latex_cleaner/__main__.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Main module for ``arxiv_latex_cleaner``.

.. code-block:: bash

    $ python -m arxiv_latex_cleaner --help
"""
import argparse
import json
import logging

import yaml

from ._version import __version__
from .arxiv_latex_cleaner import merge_args_into_config
from .arxiv_latex_cleaner import run_arxiv_cleaner

PARSER = argparse.ArgumentParser(
    prog="arxiv_latex_cleaner@{0}".format(__version__),
    description=(
        "Clean the LaTeX code of your paper to submit to arXiv. "
        "Check the README for more information on the use."
    ),
)

PARSER.add_argument(
    "input_folder",
    type=str,
    help="Input folder or zip archive containing the LaTeX code.",
)

PARSER.add_argument(
    "--resize_images",
    action="store_true",
    help="Resize images.",
)

PARSER.add_argument(
    "--im_size",
    default=500,
    type=int,
    help=(
        "Size of the output images (in pixels, longest side). Fine tune this "
        "to get as close to 10MB as possible."
    ),
)

PARSER.add_argument(
    "--compress_pdf",
    action="store_true",
    help="Compress PDF images using ghostscript (Linux and Mac only).",
)

PARSER.add_argument(
    "--pdf_im_resolution",
    default=500,
    type=int,
    help="Resolution (in dpi) to which the tool resamples the PDF images.",
)

PARSER.add_argument(
    "--images_allowlist",
    default={},
    type=json.loads,
    help=(
        "Images (and PDFs) that won't be resized to the default resolution,"
        "but the one provided here. Value is pixel for images, and dpi for"
        "PDFs, as in --im_size and --pdf_im_resolution, respectively. Format "
        "is a dictionary as: '{\"path/to/im.jpg\": 1000}'"
    ),
)

PARSER.add_argument(
    "--keep_bib",
    action="store_true",
    help="Avoid deleting the *.bib files.",
)

PARSER.add_argument(
    "--commands_to_delete",
    nargs="+",
    default=[],
    required=False,
    help=(
        "LaTeX commands that will be deleted. Useful for e.g. user-defined "
        "\\todo commands. For example, to delete all occurrences of \\todo1{} "
        "and \\todo2{}, run the tool with `--commands_to_delete todo1 todo2`."
        "Please note that the positional argument `input_folder` cannot come "
        "immediately after `commands_to_delete`, as the parser does not have "
        "any way to know if it's another command to delete."
    ),
)

PARSER.add_argument(
    "--commands_only_to_delete",
    nargs="+",
    default=[],
    required=False,
    help=(
        "LaTeX commands that will be deleted but the text wrapped in the"
        " commands will be retained. Useful for commands that change text"
        " formats and colors, which you may want to remove but keep the text"
        " within. Usages are exactly the same as commands_to_delete. Note that"
        " if the commands listed here duplicate that after commands_to_delete,"
        " the default action will be retaining the wrapped text."
    ),
)

PARSER.add_argument(
    "--environments_to_delete",
    nargs="+",
    default=[],
    required=False,
    help=(
        "LaTeX environments that will be deleted. Useful for e.g. user-"
        "defined comment environments. For example, to delete all occurrences "
        "of \\begin{note} ... \\end{note}, run the tool with "
        "`--environments_to_delete note`. Please note that the positional "
        "argument `input_folder` cannot come immediately after "
        "`environments_to_delete`, as the parser does not have any way to "
        "know if it's another environment to delete."
    ),
)

def if_prefixed(orig_string):
  if orig_string.startswith("\\"):
    string = orig_string[1:]
  else:
    string = orig_string
  if not string.startswith("if"):
    raise argparse.ArgumentTypeError(
        f"Expected a string starting with 'if', got '{orig_string}'!"
    )
  return string

PARSER.add_argument(
    "--if_exceptions",
    nargs="+",
    default=[],
    required=False,
    type=if_prefixed,
    help=(
        "Constant TeX primitive conditionals (\\iffalse, \\iftrue, etc.) are "
        "simplified, i.e., true branches are kept, false branches deleted. "
        "To parse the conditional constructs correctly, all commands starting "
        "with `\\if` are assumed to be TeX primitive conditionals (e.g., "
        "declared by \\newif\\ifvar). Some known exceptions to this rule are "
        "already included (e.g., \\iff, \\ifthenelse, etc.), but you can add "
        "custom exceptions using `--if_exceptions iffalt`."
    ),
)


PARSER.add_argument(
    "--use_external_tikz",
    type=str,
    help=(
        "Folder (relative to input folder) containing externalized tikz "
        "figures in PDF format."
    ),
)

PARSER.add_argument(
    "--svg_inkscape",
    nargs="?",
    type=str,
    const="svg-inkscape",
    help=(
        "Include PDF files generated by Inkscape via the `\\includesvg` "
        "command from the `svg` package. This is done by replacing the "
        "`\\includesvg` calls with `\\includeinkscape` calls pointing to the "
        "generated `.pdf_tex` files. By default, these files and the "
        "generated PDFs are located under `./svg-inkscape` (relative to the "
        "input folder), but a different path (relative to the input folder) "
        "can be provided in case a different `inkscapepath` was set when "
        "loading the `svg` package."
    ),
)

PARSER.add_argument(
    "--convert_png_to_jpg",
    action="store_true",
    help="Convert PNG images to JPG format to reduce file size. Note that this will override --resize_images for PNG files.",
)

PARSER.add_argument(
    "--png_quality",
    type=int,
    default=50,
    help="JPG quality for PNG conversion (0-100, default: 50)",
)

PARSER.add_argument(
    "--png_size_threshold",
    type=float,
    default=0.5,
    help="Minimum PNG file size in MB to apply quality reduction (default: 0.5)",
)

PARSER.add_argument(
    "--config",
    type=str,
    help=(
        "Read settings from `.yaml` config file. If command line arguments "
        "are provided additionally, the config file parameters are updated "
        "with the command line parameters."
    ),
    required=False,
)

PARSER.add_argument(
    "--verbose",
    action="store_true",
    help="Enable detailed output.",
)

ARGS = vars(PARSER.parse_args())

if ARGS["config"] is not None:
  try:
    with open(ARGS["config"], "r") as config_file:
      config_params = yaml.safe_load(config_file)
    final_args = merge_args_into_config(ARGS, config_params)

  except FileNotFoundError:
    print(f"config file {ARGS.config} not found.")
    final_args = ARGS
    final_args.pop("config", None)
else:
  final_args = ARGS

if final_args.get("verbose", False):
  logging.basicConfig(level=logging.INFO)
else:
  logging.basicConfig(level=logging.ERROR)

run_arxiv_cleaner(final_args)
exit(0)


================================================
FILE: arxiv_latex_cleaner/_version.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "v1.0.10"


================================================
FILE: arxiv_latex_cleaner/arxiv_latex_cleaner.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Cleans the LaTeX code of your paper to submit to arXiv."""
import collections
import contextlib
import copy
import logging
import os
import pathlib
import shutil
import subprocess
import tempfile

from PIL import Image
import regex

PDF_RESIZE_COMMAND = (
    'gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH '
    '-dDownsampleColorImages=true -dColorImageResolution={resolution} '
    '-dColorImageDownsampleThreshold=1.0 -dAutoRotatePages=/None '
    '-sOutputFile={output} {input}'
)
MAX_FILENAME_LENGTH = 120

# Fix for Windows: Even if '\' (os.sep) is the standard way of making paths on
# Windows, it interferes with regular expressions. We just change os.sep to '/'
# and os.path.join to a version using '/' as Windows will handle it the right
# way.
if os.name == 'nt':
  global old_os_path_join

  def new_os_join(path, *args):
    res = old_os_path_join(path, *args)
    res = res.replace('\\', '/')
    return res

  old_os_path_join = os.path.join

  os.sep = '/'
  os.path.join = new_os_join


def _create_dir_erase_if_exists(path):
  if os.path.exists(path):
    shutil.rmtree(path)
  os.makedirs(path)


def _create_dir_if_not_exists(path):
  if not os.path.exists(path):
    os.makedirs(path)


def _keep_pattern(haystack, patterns_to_keep):
  """Keeps the strings that match 'patterns_to_keep'."""
  out = []
  for item in haystack:
    if any((regex.findall(rem, item) for rem in patterns_to_keep)):
      out.append(item)
  return out


def _remove_pattern(haystack, patterns_to_remove):
  """Removes the strings that match 'patterns_to_remove'."""
  return [
      item
      for item in haystack
      if item not in _keep_pattern([item], patterns_to_remove)
  ]


def _list_all_files(in_folder, ignore_dirs=None):
  if ignore_dirs is None:
    ignore_dirs = []
  to_consider = [
      os.path.join(os.path.relpath(path, in_folder), name)
      if path != in_folder
      else name
      for path, _, files in os.walk(in_folder)
      for name in files
  ]
  return _remove_pattern(to_consider, ignore_dirs)


def _copy_file(filename, params):
  _create_dir_if_not_exists(
      os.path.join(params['output_folder'], os.path.dirname(filename))
  )
  shutil.copy(
      os.path.join(params['input_folder'], filename),
      os.path.join(params['output_folder'], filename),
  )


def _remove_command(text, command, keep_text=False):
  """Removes '\\command{*}' from the string 'text'.

  Regex `base_pattern` used to match balanced parentheses taken from:
  https://stackoverflow.com/questions/546433/regular-expression-to-match-balanced-parentheses/35271017#35271017
  """
  base_pattern = (
      r'\\'
      + command
      + r'(?:\[(?:.*?)\])*\{((?:[^{}]+|\{(?1)\})*)\}(?:\[(?:.*?)\])*'
  )

  def extract_text_inside_curly_braces(text):
    """Extract text inside of {} from command string"""
    pattern = r'\{((?:[^{}]|(?R))*)\}'

    match = regex.search(pattern, text)

    if match:
      return match.group(1)
    else:
      return ''

  # Loops in case of nested commands that need to retain text, e.g.,
  # \red{hello \red{world}}.
  while True:
    all_substitutions = []
    has_match = False
    for match in regex.finditer(base_pattern, text):
      # In case there are only spaces or nothing up to the following newline,
      # adds a percent, not to alter the newlines.
      has_match = True

      if not keep_text:
        new_substring = ''
      else:
        temp_substring = text[match.span()[0] : match.span()[1]]
        new_substring = extract_text_inside_curly_braces(temp_substring)

      if match.span()[1] < len(text):
        next_newline = text[match.span()[1] :].find('\n')
        if next_newline != -1:
          text_until_newline = text[
              match.span()[1] : match.span()[1] + next_newline
          ]
          if (
              not text_until_newline or text_until_newline.isspace()
          ) and not keep_text:
            new_substring = '%'
      all_substitutions.append(
          (match.span()[0], match.span()[1], new_substring)
      )

    for start, end, new_substring in reversed(all_substitutions):
      text = text[:start] + new_substring + text[end:]

    if not keep_text or not has_match:
      break

  return text


def _remove_environment(text, environment):
  """Removes '\\begin{environment}*\\end{environment}' from 'text'."""
  # Need to escape '{', to not trigger fuzzy matching if `environment` starts
  # with one of 'i', 'd', 's', or 'e'
  return regex.sub(
      r'\\begin\{' + environment + r'}[\s\S]*?\\end\{' + environment + r'}',
      '',
      text,
  )


def _simplify_conditional_blocks(text, if_exceptions=[]):
  r"""Simplify possibly nested conditional blocks from 'text'.

  For example, `\iffalse TEST1\else TEST2\fi` is simplified to `TEST2`,
  and `\iftrue TEST1\else TEST2\fi` is simplified to `TEST1`.
  Unknown conditionals are left untouched.

  If the conditional tree is malformed, the function will print a warning
  to stderr and return the original text.
  """
  p = regex.compile(r'(?!(?<=\\newif\s*))\\if\s*(\w+)|\\else(?!\w)|\\fi(?!\w)')
  toplevel_tree = {'left': [], 'right': [], 'kind': 'toplevel', 'parent': None}

  tree = toplevel_tree

  exceptions = [
      # TeX primitives
      'iff',
      # package etoolbox
      'ifpatchable',
      'ifpatchable*',
      'ifbool',
      'iftoggle',
      'ifdef',
      'ifcsdef',
      'ifundef',
      'ifcsundef',
      'ifdefmacro',
      'ifcsmacro',
      'ifdefparam',
      'ifcsparam',
      'ifcsprefix',
      'ifdefprotected',
      'ifcsprotected',
      'ifdefltxprotect',
      'ifcsltxprotect',
      'ifdefempty',
      'ifcsempty',
      'ifdefvoid',
      'ifcsvoid',
      'ifdefequal',
      'ifcsequal',
      'ifdefstring',
      'ifcsstring',
      'ifdefstrequal',
      'ifcsstrequal',
      'ifdefcounter',
      'ifcscounter',
      'ifltxcounter',
      'ifdeflength',
      'ifcslength',
      'ifdefdimen',
      'ifcsdimen',
      'ifstrequal',
      'ifstrempty',
      'ifblank',
      'ifnumcomp',
      'ifnumequal',
      'ifnumodd',
      'ifdimcomp',
      'ifdimequal',
      'ifdimgreater',
      'ifdimless',
      'ifboolexpr',
      'ifboolexpe',
      'ifinlist',
      'ifinlistcs',
      'ifrmnum',
      # package hyperref
      'ifpdfstringunicode',
      # package ifthen
      'ifthenelse',
  ] + if_exceptions

  def new_subtree(kind):
    return {'kind': kind, 'left': [], 'right': []}

  def add_subtree(tree, subtree):
    if 'else' not in tree:
      tree['left'].append(subtree)
    else:
      tree['right'].append(subtree)
    subtree['parent'] = tree

  def print_tree(tree, indent, write):
    if 'start' in tree:
      write(' ' * indent + tree['start'].group() + '\n')
    for subtree in tree['left']:
      print_tree(subtree, indent + 2, write)
    if 'else' in tree:
      write(' ' * indent + tree['else'].group() + '\n')
    for subtree in tree['right']:
      print_tree(subtree, indent + 2)
    if 'end' in tree:
      write(' ' * indent + tree['end'].group() + '\n')

  def print_abort(error_finding):
    os.sys.stderr.write(
        f'Warning: Found {error_finding}! Not removing any conditional'
        ' blocks...\n'
    )
    os.sys.stderr.write(
        f'         This is the matched tree (as built up to the error):\n'
    )
    print_tree(toplevel_tree, indent=9, write=os.sys.stderr.write)
    os.sys.stderr.write(
        f'         Potentially, you need to supply an exception using'
        f" --if_exceptions'.\n"
    )

  for m in p.finditer(text):
    m_no_space = m.group().replace(' ', '')
    if m_no_space == r'\iffalse' or m_no_space == r'\if0':
      subtree = new_subtree('iffalse')
      subtree['start'] = m
      add_subtree(tree, subtree)
      tree = subtree
    elif m_no_space == r'\iftrue' or m_no_space == r'\if1':
      subtree = new_subtree('iftrue')
      subtree['start'] = m
      add_subtree(tree, subtree)
      tree = subtree
    elif m_no_space.startswith(r'\if'):
      if m_no_space[1:] in exceptions:
        continue
      subtree = new_subtree('unknown')
      subtree['start'] = m
      add_subtree(tree, subtree)
      tree = subtree
    elif m_no_space == r'\else':
      if tree['parent'] is None:
        print_abort(r'unmatched \else')
        return text
      elif 'else' in tree:
        print_abort(r'duplicate \else')
        return text

      tree['else'] = m
    elif m.group() == r'\fi':
      if tree['parent'] is None:
        print_abort(r'unmatched \fi')
        return text

      tree['end'] = m
      tree = tree['parent']
    else:
      raise RuntimeError('Unreachable!')

  if tree['parent'] is not None:
    print_abort('unmatched ' + tree['start'].group())
    return text

  positions_to_delete = []

  def traverse_tree(tree):
    if tree['kind'] == 'iffalse':
      if 'else' in tree:
        positions_to_delete.append((tree['start'].start(), tree['else'].end()))
        for subtree in tree['right']:
          traverse_tree(subtree)
        positions_to_delete.append((tree['end'].start(), tree['end'].end()))
      else:
        positions_to_delete.append((tree['start'].start(), tree['end'].end()))
    elif tree['kind'] == 'iftrue':
      if 'else' in tree:
        positions_to_delete.append((tree['start'].start(), tree['start'].end()))
        for subtree in tree['left']:
          traverse_tree(subtree)
        positions_to_delete.append((tree['else'].start(), tree['end'].end()))
      else:
        positions_to_delete.append((tree['start'].start(), tree['start'].end()))
        positions_to_delete.append((tree['end'].start(), tree['end'].end()))
    elif tree['kind'] == 'unknown':
      for subtree in tree['left']:
        traverse_tree(subtree)
      for subtree in tree['right']:
        traverse_tree(subtree)
    else:
      raise ValueError('Unreachable!')

  for tree in toplevel_tree['left']:
    traverse_tree(tree)

  for start, end in reversed(positions_to_delete):
    if end < len(text) and text[end].isspace():
      end_to_del = end + 1
    else:
      end_to_del = end
    text = text[:start] + text[end_to_del:]

  return text


def _remove_comments_inline(text):
  """Removes the comments from the string 'text' and ignores % inside \\url{}."""
  auto_ignore_pattern = r'(%\s*auto-ignore).*'
  if regex.search(auto_ignore_pattern, text):
    return regex.sub(auto_ignore_pattern, r'\1', text)

  if text.lstrip(' ').lstrip('\t').startswith('%'):
    return ''

  url_pattern = r'\\url\{(?>[^{}]|(?R))*\}'

  def remove_comments(segment):
    """Check if a segment of text contains a comment and remove it."""
    if segment.lstrip().startswith('%'):
      return '', True
    match = regex.search(r'(?<!\\)%', segment)
    if match:
      return segment[: match.end()] + '\n', True
    else:
      return segment, False

  # split the text into segments based on \url{} tags
  segments = regex.split(f'({url_pattern})', text)

  for i in range(len(segments)):
    # only process segments that are not part of a \url{} tag
    if not regex.match(url_pattern, segments[i]):
      segments[i], match = remove_comments(segments[i])
      if match:
        # remove all segments after the first inline comment
        segments = segments[: i + 1]
        break

  final_text = ''.join(segments)
  return (
      final_text
      if final_text.endswith('\n') or final_text.endswith('\\n')
      else final_text + '\n'
  )


def _strip_tex_contents(lines, end_str):
  """Removes everything after end_str."""
  for i in range(len(lines)):
    if end_str in lines[i]:
      if '%' not in lines[i]:
        return lines[: i + 1]
      elif lines[i].index('%') > lines[i].index(end_str):
        return lines[: i + 1]
  return lines


def _read_file_content(filename):
  with open(filename, 'r', encoding='utf-8') as fp:
    lines = fp.readlines()
    lines = _strip_tex_contents(lines, '\\end{document}')
    return lines


def _read_all_tex_contents(tex_files, parameters):
  contents = {}
  for fn in tex_files:
    contents[fn] = _read_file_content(
        os.path.join(parameters['input_folder'], fn)
    )
  return contents


def _write_file_content(content, filename):
  _create_dir_if_not_exists(os.path.dirname(filename))
  with open(filename, 'w', encoding='utf-8') as fp:
    return fp.write(content)


def _remove_comments_and_commands_to_delete(content, parameters):
  """Erases all LaTeX comments in the content, and writes it."""
  content = [_remove_comments_inline(line) for line in content]
  content = _remove_environment(''.join(content), 'comment')
  content = _simplify_conditional_blocks(
      content, parameters.get('if_exceptions', [])
  )
  for environment in parameters.get('environments_to_delete', []):
    content = _remove_environment(content, environment)
  for command in parameters.get('commands_only_to_delete', []):
    content = _remove_command(content, command, True)
  for command in parameters['commands_to_delete']:
    content = _remove_command(content, command, False)
  return content


def _replace_tikzpictures(content, figures):
  """Replaces all tikzpicture environments (with includegraphic commands of

  external PDF figures) in the content, and writes it.
  """

  def get_figure(matchobj):
    found_tikz_filename = regex.search(
        r'\\tikzsetnextfilename{(.*?)}', matchobj.group(0)
    ).group(1)
    # search in tex split if figure is available
    matching_tikz_filenames = _keep_pattern(
        figures, ['/' + found_tikz_filename + '.pdf']
    )
    if len(matching_tikz_filenames) == 1:
      return '\\includegraphics{' + matching_tikz_filenames[0] + '}'
    else:
      return matchobj.group(0)

  content = regex.sub(
      r'\\tikzsetnextfilename{[\s\S]*?\\end{tikzpicture}', get_figure, content
  )

  return content


def _replace_includesvg(content, svg_inkscape_files):
  def repl_svg(matchobj):
    svg_path = matchobj.group(2)
    if svg_path.endswith('.svg'):
      svg_path = '_'.join(svg_path.rsplit('.', 1))
    svg_filename = os.path.basename(svg_path)

    # search in svg_inkscape split if pdf_tex file is available
    matching_pdf_tex_files = _keep_pattern(
        svg_inkscape_files, ['/' + svg_filename + '-tex.pdf_tex']
    )
    if len(matching_pdf_tex_files) == 1:
      options = '' if matchobj.group(1) is None else matchobj.group(1)
      res = f'\\includeinkscape{options}{{{matching_pdf_tex_files[0]}}}'
      return res
    else:
      return matchobj.group(0)

  content = regex.sub(r'\\includesvg(\[.*?\])?{(.*?)}', repl_svg, content)

  return content

def _resize_and_copy_figure(
    filename,
    origin_folder,
    destination_folder,
    resize_image,
    image_size,
    compress_pdf,
    pdf_resolution,
    convert_png_to_jpg=False,
    png_quality=50,
    png_size_threshold=0.5,
    verbose=False
):
    """Resizes and copies the input figure (either JPG, PNG, or PDF).

    Parameters:
        filename: The input filename
        origin_folder: The folder containing the input filename
        destination_folder: The folder to copy the output filename to
        resize_image: Whether to resize the image
        image_size: The maximum size of the image in pixels
        compress_pdf: Whether to compress the PDF file
        convert_png_to_jpg: Whether to convert PNG files to JPG format. Note that this will override resize_image for PNG files.
        png_quality: JPG quality for converted PNG files (0-100)
        png_size_threshold: Minimum file size in MB to apply quality reduction
        verbose: Enable verbose logging
    
    Returns:
        str: The actual output filename (may differ from input if PNG was converted)
    """
    _create_dir_if_not_exists(
        os.path.join(destination_folder, os.path.dirname(filename))
    )
    
    if convert_png_to_jpg and os.path.splitext(filename)[1].lower() in ['.png']:
        original_size_mb = os.path.getsize(os.path.join(origin_folder, filename)) / (1024 * 1024)
        im = Image.open(os.path.join(origin_folder, filename))
        # Determine quality based on file size
        if original_size_mb < png_size_threshold:
            quality = 100  # Keep high quality for small files
            if verbose:
                print(f"Keeping original quality for small PNG: {filename}")
        else:
            quality = png_quality
            if verbose:
                print(f"Converting PNG to JPG with quality {quality}: {filename}")
        
        # Convert PNG to JPG
        output_filename = os.path.splitext(filename)[0] + '.jpg'
        rgb_img = im.convert('RGB')
        rgb_img.save(os.path.join(destination_folder, output_filename), 'JPEG', quality=quality)
        
        if verbose:
            print(f"Converted: {filename} -> {output_filename}")
          
        return output_filename
                    
    if resize_image and os.path.splitext(filename)[1].lower() in [
        '.jpg',
        '.jpeg',
        '.png',
    ]:
        try:
            im = Image.open(os.path.join(origin_folder, filename))
            if max(im.size) > image_size:
                im = im.resize(
                    tuple([int(x * float(image_size) / max(im.size)) for x in im.size]),
                    Image.Resampling.LANCZOS,
                )
            
            if os.path.splitext(filename)[1].lower() in ['.jpg', '.jpeg']:
                im.save(os.path.join(destination_folder, filename), 'JPEG', quality=90)
                return filename
                
            elif os.path.splitext(filename)[1].lower() in ['.png']:
                im.save(os.path.join(destination_folder, filename), 'PNG')
                return filename
                    
        except Exception as e:
            if verbose:
                print(f"Failed to process image {filename}: {e}")
            # Fall back to simple copy
            shutil.copy(
                os.path.join(origin_folder, filename),
                os.path.join(destination_folder, filename),
            )
            return filename

    elif compress_pdf and os.path.splitext(filename)[1].lower() == '.pdf':
        _resize_pdf_figure(
            filename, origin_folder, destination_folder, pdf_resolution
        )
        return filename
    else:
        shutil.copy(
            os.path.join(origin_folder, filename),
            os.path.join(destination_folder, filename),
        )
        return filename


def _update_image_references(tex_contents_dict, old_filename, new_filename, verbose=False):
    """Update references from old_filename to new_filename in all tex content."""
    if old_filename == new_filename:
        return  # No change needed
    
    old_base = os.path.splitext(old_filename)[0]
    new_base = os.path.splitext(new_filename)[0]
    
    if verbose:
        print(f"Updating LaTeX references: {old_filename} -> {new_filename}")
    
    for tex_file in tex_contents_dict:
        # Handle both string and list content
        if isinstance(tex_contents_dict[tex_file], list):
            content = ''.join(tex_contents_dict[tex_file])
        else:
            content = tex_contents_dict[tex_file]
        
        content_changed = False
        
        # Pattern 1: Direct filename with full extension, handling comments and newlines
        pattern1 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_filename) + r'((?:%\s*\n\s*)?[^}]*?\})'
        replacement1 = r'\1' + new_filename + r'\2'
        
        new_content = regex.sub(pattern1, replacement1, content, flags=regex.IGNORECASE | regex.DOTALL)
        if new_content != content:
            content = new_content
            content_changed = True
            if verbose:
                print(f"Applied pattern 1 (full filename) in {tex_file}")
        else:
            # Pattern 2: Base filename without extension, handling comments and newlines
            # Only apply this if Pattern 1 didn't match to avoid double replacements
            pattern2 = r'(\{(?:%\s*\n\s*)?[^}]*?)' + regex.escape(old_base) + r'((?:%\s*\n\s*)?[^}]*?\})'
            replacement2 = r'\1' + new_base + r'.jpg\2'
            
            new_content = regex.sub(pattern2, replacement2, content, flags=regex.IGNORECASE | regex.DOTALL)
            if new_content != content:
                content = new_content
                content_changed = True
                if verbose:
                    print(f"Applied pattern 2 (base filename) in {tex_file}")
            else:
                # Pattern 3: Handle cases where extension is split across lines with comments
                # This specifically targets patterns like: images/filename%\n.png
                pattern3 = r'(\{[^}]*?)' + regex.escape(old_base) + r'(%\s*\n\s*)(\.png)([^}]*?\})'
                replacement3 = r'\1' + new_base + r'\2.jpg\4'
                
                new_content = regex.sub(pattern3, replacement3, content, flags=regex.IGNORECASE | regex.DOTALL)
                if new_content != content:
                    content = new_content
                    content_changed = True
                    if verbose:
                        print(f"Applied pattern 3 (split extension) in {tex_file}")
        
        # Update the content back in the appropriate format
        if content_changed:
            if isinstance(tex_contents_dict[tex_file], list):
                # Convert back to list format, preserving line endings
                tex_contents_dict[tex_file] = content.split('\n')
            else:
                tex_contents_dict[tex_file] = content
            
            if verbose:
                print(f"Updated references in {tex_file}")
    
    # Re-write the updated tex files to the output directory
    if verbose and any(tex_contents_dict.values()):
        print("Re-writing updated tex files...")
    
    return tex_contents_dict


def _resize_pdf_figure(
    filename, origin_folder, destination_folder, resolution, timeout=10
):
  input_file = os.path.join(origin_folder, filename)
  output_file = os.path.join(destination_folder, filename)
  bash_command = PDF_RESIZE_COMMAND.format(
      input=input_file, output=output_file, resolution=resolution
  )
  process = subprocess.Popen(bash_command.split(), stdout=subprocess.PIPE)

  try:
    process.communicate(timeout=timeout)
  except subprocess.TimeoutExpired:
    process.kill()
    outs, errs = process.communicate()
    print('Output: ', outs)
    print('Errors: ', errs)


def _copy_only_referenced_non_tex_not_in_root(parameters, contents, splits):
  for fn in _keep_only_referenced(
      splits['non_tex_not_in_root'], contents, strict=True
  ):
    _copy_file(fn, parameters)

def _resize_and_copy_figures_if_referenced(parameters, contents, splits):
    """Modified to handle PNG to JPG conversion and reference updates."""
    image_size = collections.defaultdict(lambda: parameters['im_size'])
    image_size.update(parameters['images_allowlist'])
    pdf_resolution = collections.defaultdict(
        lambda: parameters['pdf_im_resolution']
    )
    pdf_resolution.update(parameters['images_allowlist'])
    
    # contents is the full content string for reference checking
    
    filename_changes = {}  # Track PNG -> JPG filename changes
    
    for image_file in _keep_only_referenced(
        splits['figures'], contents, strict=False
    ):
        actual_output_filename = _resize_and_copy_figure(
            filename=image_file,
            origin_folder=parameters['input_folder'],
            destination_folder=parameters['output_folder'],
            resize_image=parameters['resize_images'],
            image_size=image_size[image_file],
            compress_pdf=parameters['compress_pdf'],
            pdf_resolution=pdf_resolution[image_file],
            convert_png_to_jpg=parameters.get('convert_png_to_jpg', False),
            png_quality=parameters.get('png_quality', 50),
            png_size_threshold=parameters.get('png_size_threshold', 0.5),
            verbose=parameters.get('verbose', False)
        )
        
        # Track filename changes for reference updates
        if actual_output_filename != image_file:
            filename_changes[image_file] = actual_output_filename
    
    return filename_changes


def _search_reference(filename, contents, strict=False):
  """Returns a match object if filename is referenced in contents, and None otherwise.

  If not strict mode, path prefix and extension are optional.
  """
  if strict:
    # regex pattern for strict=True for path/to/img.ext:
    # \{[\s%]*path/to/img\.ext[\s%]*\}
    filename_regex = filename.replace('.', r'\.')
  else:
    filename_path = pathlib.Path(filename)

    # make extension optional
    root, extension = filename_path.stem, filename_path.suffix
    basename_regex = '{}({})?'.format(
        regex.escape(root), regex.escape(extension)
    )

    # iterate through parent fragments to make path prefix optional
    path_prefix_regex = ''
    for fragment in reversed(filename_path.parents):
      if fragment.name == '.':
        continue
      fragment = regex.escape(fragment.name)
      path_prefix_regex = '({}{}{})?'.format(
          path_prefix_regex, fragment, os.sep
      )

    # Regex pattern for strict=True for path/to/img.ext:
    # \{[\s%]*(<path_prefix>)?<basename>(<ext>)?[\s%]*\}
    filename_regex = path_prefix_regex + basename_regex

  # Some files 'path/to/file' are referenced in tex as './path/to/file' thus
  # adds prefix for relative paths starting with './' or '.\' to regex search.
  filename_regex = r'(.' + os.sep + r')?' + filename_regex

  # Pads with braces and optional whitespace/comment characters.
  patn = r'\{{[\s%]*{}[\s%]*\}}'.format(filename_regex)
  # Picture references in LaTeX are allowed to be in different cases.
  return regex.search(patn, contents, regex.IGNORECASE)


def _keep_only_referenced(filenames, contents, strict=False):
  """Returns the filenames referenced from contents.

  If not strict mode, path prefix and extension are optional.
  """
  return [
      fn
      for fn in filenames
      if _search_reference(fn, contents, strict) is not None
  ]


def _keep_only_referenced_tex(contents, splits):
  """Returns the filenames referenced from the tex files themselves.

  It needs various iterations in case one file is referenced from an
  unreferenced file.
  """
  old_referenced = set(splits['tex_in_root'] + splits['tex_not_in_root'])
  while True:
    referenced = set(splits['tex_in_root'])
    for fn in old_referenced:
      for fn2 in old_referenced:
        if regex.search(
            r'(' + os.path.splitext(fn)[0] + r'[.}])', '\n'.join(contents[fn2])
        ):
          referenced.add(fn)

    if referenced == old_referenced:
      splits['tex_to_copy'] = list(referenced)
      return

    old_referenced = referenced.copy()


def _add_root_tex_files(splits):
  # TODO: Check auto-ignore marker in root to detect the main file. Then check
  #  there is only one non-referenced TeX in root.

  # Forces the TeX in root to be copied, even if they are not referenced.
  for fn in splits['tex_in_root']:
    if fn not in splits['tex_to_copy']:
      splits['tex_to_copy'].append(fn)


def _split_all_files(parameters):
  """Splits the files into types or location to know what to do with them."""
  file_splits = {
      'all': _list_all_files(
          parameters['input_folder'], ignore_dirs=['.git' + os.sep]
      ),
      'in_root': [
          f
          for f in os.listdir(parameters['input_folder'])
          if os.path.isfile(os.path.join(parameters['input_folder'], f))
      ],
  }

  file_splits['not_in_root'] = [
      f for f in file_splits['all'] if f not in file_splits['in_root']
  ]
  file_splits['to_copy_in_root'] = _remove_pattern(
      file_splits['in_root'],
      parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],
  )
  file_splits['to_copy_not_in_root'] = _remove_pattern(
      file_splits['not_in_root'],
      parameters['to_delete'] + parameters['figures_to_copy_if_referenced'],
  )
  file_splits['figures'] = _keep_pattern(
      file_splits['all'], parameters['figures_to_copy_if_referenced']
  )

  file_splits['tex_in_root'] = _keep_pattern(
      file_splits['to_copy_in_root'], ['.tex$', '.tikz$']
  )
  file_splits['tex_not_in_root'] = _keep_pattern(
      file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']
  )

  file_splits['non_tex_in_root'] = _remove_pattern(
      file_splits['to_copy_in_root'], ['.tex$', '.tikz$']
  )
  file_splits['non_tex_not_in_root'] = _remove_pattern(
      file_splits['to_copy_not_in_root'], ['.tex$', '.tikz$']
  )

  if parameters.get('use_external_tikz', None) is not None:
    file_splits['external_tikz_figures'] = _keep_pattern(
        file_splits['all'], [parameters['use_external_tikz']]
    )
  else:
    file_splits['external_tikz_figures'] = []

  if parameters.get('svg_inkscape', None) is not None:
    file_splits['svg_inkscape'] = _keep_pattern(
        file_splits['all'], [parameters['svg_inkscape']]
    )
  else:
    file_splits['svg_inkscape'] = []

  return file_splits


def _create_out_folder(input_folder):
  """Creates the output folder, erasing it if existed."""
  out_folder = os.path.abspath(input_folder).removesuffix('.zip') + '_arXiv'
  _create_dir_erase_if_exists(out_folder)

  return out_folder


def run_arxiv_cleaner(parameters):
  """Core of the code, runs the actual arXiv cleaner."""

  files_to_delete = [
      r'\.aux$',
      r'\.sh$',
      r'\.blg$',
      r'\.brf$',
      r'\.log$',
      r'\.out$',
      r'\.ps$',
      r'\.dvi$',
      r'\.synctex.gz$',
      '~$',
      r'\.backup$',
      r'\.gitignore$',
      r'\.DS_Store$',
      r'\.svg$',
      r'^\.idea',
      r'\.dpth$',
      r'\.md5$',
      r'\.dep$',
      r'\.auxlock$',
      r'\.fls$',
      r'\.fdb_latexmk$',
  ]

  if not parameters['keep_bib']:
    files_to_delete.append(r'\.bib$')

  parameters.update({
      'to_delete': files_to_delete,
      'figures_to_copy_if_referenced': [
          r'\.png$',
          r'\.jpg$',
          r'\.jpeg$',
          r'\.pdf$',
      ],
  })

  logging.info('Collecting file structure.')
  parameters['output_folder'] = _create_out_folder(parameters['input_folder'])

  from_zip = parameters['input_folder'].endswith('.zip')
  tempdir_context = (
      tempfile.TemporaryDirectory() if from_zip else contextlib.suppress()
  )

  with tempdir_context as tempdir:

    if from_zip:
      logging.info('Unzipping input folder.')
      shutil.unpack_archive(parameters['input_folder'], tempdir)
      parameters['input_folder'] = tempdir

    splits = _split_all_files(parameters)

    logging.info('Reading all tex files')
    tex_contents = _read_all_tex_contents(
        splits['tex_in_root'] + splits['tex_not_in_root'], parameters
    )

    for tex_file in tex_contents:
      logging.info('Removing comments in file %s.', tex_file)
      tex_contents[tex_file] = _remove_comments_and_commands_to_delete(
          tex_contents[tex_file], parameters
      )

    for tex_file in tex_contents:
      logging.info('Replacing \\includesvg calls in file %s.', tex_file)
      tex_contents[tex_file] = _replace_includesvg(
          tex_contents[tex_file], splits['svg_inkscape']
      )

    for tex_file in tex_contents:
      logging.info('Replacing Tikz Pictures in file %s.', tex_file)
      content = _replace_tikzpictures(
          tex_contents[tex_file], splits['external_tikz_figures']
      )
      # If file ends with '\n' already, the split in last line would add an extra
      # '\n', so we remove it.
      tex_contents[tex_file] = content.split('\n')

    _keep_only_referenced_tex(tex_contents, splits)
    _add_root_tex_files(splits)

    for tex_file in splits['tex_to_copy']:
      logging.info('Replacing patterns in file %s.', tex_file)
      content = '\n'.join(tex_contents[tex_file])
      content = _find_and_replace_patterns(
          content, parameters.get('patterns_and_insertions', list())
      )
      tex_contents[tex_file] = content
      new_path = os.path.join(parameters['output_folder'], tex_file)
      logging.info('Writing modified contents to %s.', new_path)
      _write_file_content(
          content,
          new_path,
      )

    full_content = '\n'.join(
        ''.join(tex_contents[fn]) for fn in splits['tex_to_copy']
    )
    _copy_only_referenced_non_tex_not_in_root(parameters, full_content, splits)
    for non_tex_file in splits['non_tex_in_root']:
      logging.info('Copying non-tex file %s.', non_tex_file)
      _copy_file(non_tex_file, parameters)

    filename_changes = _resize_and_copy_figures_if_referenced(parameters, full_content, splits)
    logging.info('Outputs written to %s', parameters['output_folder'])

    # Update LaTeX references for changed filenames if tex_contents_dict is provided
    if tex_contents and filename_changes:
        for old_filename, new_filename in filename_changes.items():
            tex_contents = _update_image_references(
                tex_contents, old_filename, new_filename, 
                verbose=parameters.get('verbose', False)
            )

        # Re-write modified tex files with new references after resizing and copying figures
        for tex_file in splits['tex_to_copy']:
            if tex_file in tex_contents:
                # Get the updated content
                if isinstance(tex_contents[tex_file], list):
                    updated_content = ''.join(tex_contents[tex_file])
                else:
                    updated_content = tex_contents[tex_file]
                
                # Write the updated content back to the output file
                output_path = os.path.join(parameters['output_folder'], tex_file)
                logging.info('Re-writing modified tex file with updated references: %s', output_path)
                _write_file_content(updated_content, output_path)
                
                if parameters.get('verbose', False):
                    print(f"Re-wrote {tex_file} with updated image references")
        
        if parameters.get('verbose', False):
            print(f"Updated {len(filename_changes)} image references and re-wrote tex files")


def strip_whitespace(text):
  """Strips all whitespace characters.

  https://stackoverflow.com/questions/8270092/remove-all-whitespace-in-a-string
  """
  pattern = regex.compile(r'\s+')
  text = regex.sub(pattern, '', text)
  return text


def merge_args_into_config(args, config_params):
  final_args = copy.deepcopy(config_params)
  config_keys = config_params.keys()
  for key, value in args.items():
    if key in config_keys:
      if any([isinstance(value, t) for t in [str, bool, float, int]]):
        # Overwrites config value with args value.
        final_args[key] = value
      elif isinstance(value, list):
        # Appends args values to config values.
        final_args[key] = value + config_params[key]
      elif isinstance(value, dict):
        # Updates config params with args params.
        final_args[key].update(**value)
    else:
      final_args[key] = value
  return final_args


def _find_and_replace_patterns(content, patterns_and_insertions):
  r"""content: str

  patterns_and_insertions: List[Dict]

  Example for patterns_and_insertions:

      [
          {
              "pattern" :
              r"(?:\\figcompfigures{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}",
              "insertion" :
              r"\parbox[c]{{{second}\linewidth}}{{\includegraphics[width={third}\linewidth]{{figures/{first}}}}}}",
              "description": "Replace figcompfigures"
          },
      ]
  """
  for pattern_and_insertion in patterns_and_insertions:
    pattern = pattern_and_insertion['pattern']
    insertion = pattern_and_insertion['insertion']
    description = pattern_and_insertion['description']
    logging.info('Processing pattern: %s.', description)
    p = regex.compile(pattern)
    m = p.search(content)
    while m is not None:
      local_insertion = insertion.format(**m.groupdict())
      if pattern_and_insertion.get('strip_whitespace', True):
        local_insertion = strip_whitespace(local_insertion)
      logging.info(f'Found {content[m.start():m.end()]:<70}')
      logging.info(f'Replacing with {local_insertion:<30}')
      content = content[: m.start()] + local_insertion + content[m.end() :]
      m = p.search(content)
    logging.info('Finished pattern: %s.', description)
  return content


================================================
FILE: arxiv_latex_cleaner/tests/arxiv_latex_cleaner_test.py
================================================
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from os import path
import shutil
import unittest
from absl.testing import parameterized
from arxiv_latex_cleaner import arxiv_latex_cleaner
from PIL import Image


def make_args(
    input_folder='foo/bar',
    resize_images=False,
    im_size=500,
    compress_pdf=False,
    pdf_im_resolution=500,
    images_allowlist=None,
    commands_to_delete=None,
    use_external_tikz='foo/bar/tikz',
):
  if images_allowlist is None:
    images_allowlist = {}
  if commands_to_delete is None:
    commands_to_delete = []
  args = {
      'input_folder': input_folder,
      'resize_images': resize_images,
      'im_size': im_size,
      'compress_pdf': compress_pdf,
      'pdf_im_resolution': pdf_im_resolution,
      'images_allowlist': images_allowlist,
      'commands_to_delete': commands_to_delete,
      'use_external_tikz': use_external_tikz,
  }
  return args


def make_contents():
  return (
      r'& \figcompfigures{'
      '\n\timage1.jpg'
      '\n}{'
      '\n\t'
      r'\ww'
      '\n}{'
      '\n\t1.0'
      '\n\t}'
      '\n& '
      r'\figcompfigures{image2.jpg}{\ww}{1.0}'
  )


def make_patterns():
  pattern = r'(?:\\figcompfigures{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}'
  insertion = r"""\parbox[c]{{
            {second}\linewidth
        }}{{
            \includegraphics[
                width={third}\linewidth
            ]{{
                figures/{first}
            }}
        }} """
  description = 'Replace figcompfigures'
  output = {
      'pattern': pattern,
      'insertion': insertion,
      'description': description,
  }
  return [output]


def make_search_reference_tests():
  return (
      {
          'testcase_name': 'prefix1',
          'filenames': ['include_image_yes.png', 'include_image.png'],
          'contents': '\\include{include_image_yes.png}',
          'strict': False,
          'true_outputs': ['include_image_yes.png'],
      },
      {
          'testcase_name': 'prefix2',
          'filenames': ['include_image_yes.png', 'include_image.png'],
          'contents': '\\include{include_image.png}',
          'strict': False,
          'true_outputs': ['include_image.png'],
      },
      {
          'testcase_name': 'nested_more_specific',
          'filenames': [
              'images/im_included.png',
              'images/include/images/im_included.png',
          ],
          'contents': '\\include{images/include/images/im_included.png}',
          'strict': False,
          'true_outputs': ['images/include/images/im_included.png'],
      },
      {
          'testcase_name': 'nested_less_specific',
          'filenames': [
              'images/im_included.png',
              'images/include/images/im_included.png',
          ],
          'contents': '\\include{images/im_included.png}',
          'strict': False,
          'true_outputs': [
              'images/im_included.png',
              'images/include/images/im_included.png',
          ],
      },
      {
          'testcase_name': 'nested_substring',
          'filenames': ['images/im_included.png', 'im_included.png'],
          'contents': '\\include{images/im_included.png}',
          'strict': False,
          'true_outputs': ['images/im_included.png'],
      },
      {
          'testcase_name': 'nested_diffpath',
          'filenames': ['images/im_included.png', 'figures/im_included.png'],
          'contents': '\\include{images/im_included.png}',
          'strict': False,
          'true_outputs': ['images/im_included.png'],
      },
      {
          'testcase_name': 'diffext',
          'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],
          'contents': '\\include{tables/demo.tex}',
          'strict': False,
          'true_outputs': ['tables/demo.tex'],
      },
      {
          'testcase_name': 'diffext2',
          'filenames': ['tables/demo.tex', 'tables/demo.tikz', 'demo.tex'],
          'contents': '\\include{tables/demo}',
          'strict': False,
          'true_outputs': ['tables/demo.tex', 'tables/demo.tikz'],
      },
      {
          'testcase_name': 'strict_prefix1',
          'filenames': ['demo_yes.tex', 'demo.tex'],
          'contents': '\\include{demo_yes.tex}',
          'strict': True,
          'true_outputs': ['demo_yes.tex'],
      },
      {
          'testcase_name': 'strict_prefix2',
          'filenames': ['demo_yes.tex', 'demo.tex'],
          'contents': '\\include{demo.tex}',
          'strict': True,
          'true_outputs': ['demo.tex'],
      },
      {
          'testcase_name': 'strict_nested_more_specific',
          'filenames': [
              'tables/table_included.csv',
              'tables/include/tables/table_included.csv',
          ],
          'contents': '\\include{tables/include/tables/table_included.csv}',
          'strict': True,
          'true_outputs': ['tables/include/tables/table_included.csv'],
      },
      {
          'testcase_name': 'strict_nested_less_specific',
          'filenames': [
              'tables/table_included.csv',
              'tables/include/tables/table_included.csv',
          ],
          'contents': '\\include{tables/table_included.csv}',
          'strict': True,
          'true_outputs': ['tables/table_included.csv'],
      },
      {
          'testcase_name': 'strict_nested_substring1',
          'filenames': ['tables/table_included.csv', 'table_included.csv'],
          'contents': '\\include{tables/table_included.csv}',
          'strict': True,
          'true_outputs': ['tables/table_included.csv'],
      },
      {
          'testcase_name': 'strict_nested_substring2',
          'filenames': ['tables/table_included.csv', 'table_included.csv'],
          'contents': '\\include{table_included.csv}',
          'strict': True,
          'true_outputs': ['table_included.csv'],
      },
      {
          'testcase_name': 'strict_nested_diffpath',
          'filenames': ['tables/table_included.csv', 'data/table_included.csv'],
          'contents': '\\include{tables/table_included.csv}',
          'strict': True,
          'true_outputs': ['tables/table_included.csv'],
      },
      {
          'testcase_name': 'strict_diffext',
          'filenames': ['tables/demo.csv', 'tables/demo.txt', 'demo.csv'],
          'contents': '\\include{tables/demo.csv}',
          'strict': True,
          'true_outputs': ['tables/demo.csv'],
      },
      {
          'testcase_name': 'path_starting_with_dot',
          'filenames': [
              './images/im_included.png',
              './figures/im_included.png',
          ],
          'contents': '\\include{./images/im_included.png}',
          'strict': False,
          'true_outputs': ['./images/im_included.png'],
      },
  )


class UnitTests(parameterized.TestCase):

  @parameterized.named_parameters(
      {
          'testcase_name': 'empty config',
          'args': make_args(),
          'config_params': {},
          'final_args': make_args(),
      },
      {
          'testcase_name': 'empty args',
          'args': {},
          'config_params': make_args(),
          'final_args': make_args(),
      },
      {
          'testcase_name': 'args and config provided',
          'args': make_args(
              images_allowlist={'path1/': 1000}, commands_to_delete=[r'\todo1']
          ),
          'config_params': make_args(
              'foo_/bar_',
              True,
              1000,
              True,
              1000,
              images_allowlist={'path2/': 1000},
              commands_to_delete=[r'\todo2'],
              use_external_tikz='foo_/bar_/tikz_',
          ),
          'final_args': make_args(
              images_allowlist={'path1/': 1000, 'path2/': 1000},
              commands_to_delete=[r'\todo1', r'\todo2'],
          ),
      },
  )
  def test_merge_args_into_config(self, args, config_params, final_args):
    self.assertEqual(
        arxiv_latex_cleaner.merge_args_into_config(args, config_params),
        final_args,
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_comment',
          'line_in': 'Foo\n',
          'true_output': 'Foo\n',
      },
      {
          'testcase_name': 'auto_ignore',
          'line_in': '%auto-ignore\n',
          'true_output': '%auto-ignore\n',
      },
      {
          'testcase_name': 'auto_ignore_middle',
          'line_in': 'Foo % auto-ignore Comment\n',
          'true_output': 'Foo % auto-ignore\n',
      },
      {
          'testcase_name': 'auto_ignore_text_with_comment',
          'line_in': 'Foo auto-ignore % Comment\n',
          'true_output': 'Foo auto-ignore %\n',
      },
      {
          'testcase_name': 'percent',
          'line_in': r'100\% accurate\n',
          'true_output': r'100\% accurate\n',
      },
      {
          'testcase_name': 'comment',
          'line_in': '  % Comment\n',
          'true_output': '',
      },
      {
          'testcase_name': 'comment_inline',
          'line_in': 'Foo %Comment\n',
          'true_output': 'Foo %\n',
      },
      {
          'testcase_name': 'url_with_percent',
          'line_in': '\\url{https://www.example.com/hello%20world}\n',
          'true_output': '\\url{https://www.example.com/hello%20world}\n',
      },
      {
          'testcase_name': 'comment_with_url',
          'line_in': 'Foo %\\url{https://www.example.com/hello%20world}\n',
          'true_output': 'Foo %\n',
      },
  )
  def test_remove_comments_inline(self, line_in, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._remove_comments_inline(line_in), true_output
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_command',
          'text_in': 'Foo\nFoo2\n',
          'keep_text': False,
          'true_output': 'Foo\nFoo2\n',
      },
      {
          'testcase_name': 'command_not_removed',
          'text_in': '\\textit{Foo\nFoo2}\n',
          'keep_text': False,
          'true_output': '\\textit{Foo\nFoo2}\n',
      },
      {
          'testcase_name': 'command_no_end_line_removed',
          'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}',
          'keep_text': False,
          'true_output': 'AD\nE\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_end_line_removed',
          'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}',
          'keep_text': False,
          'true_output': 'A\n%\nD\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_optional_arguments_start',
          'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}',
          'keep_text': False,
          'true_output': 'A\n%\nE\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_optional_arguments_end',
          'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}',
          'keep_text': False,
          'true_output': 'A\n%\nE\n\\end{document}',
      },
      {
          'testcase_name': 'no_command_keep_text',
          'text_in': 'Foo\nFoo2\n',
          'keep_text': True,
          'true_output': 'Foo\nFoo2\n',
      },
      {
          'testcase_name': 'command_not_removed_keep_text',
          'text_in': '\\textit{Foo\nFoo2}\n',
          'keep_text': True,
          'true_output': '\\textit{Foo\nFoo2}\n',
      },
      {
          'testcase_name': 'command_no_end_line_removed_keep_text',
          'text_in': 'A\\todo{B\nC}D\nE\n\\end{document}',
          'keep_text': True,
          'true_output': 'AB\nCD\nE\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_end_line_removed_keep_text',
          'text_in': 'A\n\\todo{B\nC}\nD\n\\end{document}',
          'keep_text': True,
          'true_output': 'A\nB\nC\nD\n\\end{document}',
      },
      {
          'testcase_name': 'nested_command_keep_text',
          'text_in': 'A\n\\todo{B\n\\todo{C}}\nD\n\\end{document}',
          'keep_text': True,
          'true_output': 'A\nB\nC\nD\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_optional_arguments_start_keep_text',
          'text_in': 'A\n\\todo[B]{C\nD}\nE\n\\end{document}',
          'keep_text': True,
          'true_output': 'A\nC\nD\nE\n\\end{document}',
      },
      {
          'testcase_name': 'command_with_optional_arguments_end_keep_text',
          'text_in': 'A\n\\todo{B\nC}[D]\nE\n\\end{document}',
          'keep_text': True,
          'true_output': 'A\nB\nC\nE\n\\end{document}',
      },
      {
          'testcase_name': 'deeply_nested_command_keep_text',
          'text_in': 'A\n\\todo{B\n\\emph{C\\footnote{\\textbf{D}}}}\nE\n\\end{document}',
          'keep_text': True,
          'true_output': (
              'A\nB\n\\emph{C\\footnote{\\textbf{D}}}\nE\n\\end{document}'
          ),
      },
  )
  def test_remove_command(self, text_in, keep_text, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._remove_command(text_in, 'todo', keep_text),
        true_output,
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_environment',
          'text_in': 'Foo\n',
          'true_output': 'Foo\n',
      },
      {
          'testcase_name': 'environment_not_removed',
          'text_in': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo',
          'true_output': 'Foo\n\\begin{equation}\n3x+2\n\\end{equation}\nFoo',
      },
      {
          'testcase_name': 'environment_removed',
          'text_in': 'Foo\\begin{comment}\n3x+2\n\\end{comment}\nFoo',
          'true_output': 'Foo\nFoo',
      },
  )
  def test_remove_environment(self, text_in, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._remove_environment(text_in, 'comment'), true_output
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_iffalse',
          'text_in': 'Foo\n',
          'true_output': 'Foo\n',
      },
      {
          'testcase_name': 'if_not_removed',
          'text_in': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n',
          'true_output': '\\ifvar\n\\ifvar\nFoo\n\\fi\n\\fi\n',
      },
      {
          'testcase_name': 'if_removed_with_nested_ifvar',
          'text_in': '\\ifvar\n\\iffalse\n\\ifvar\nFoo\n\\fi\n\\fi\n\\fi\n',
          'true_output': '\\ifvar\n\\fi\n',
      },
      {
          'testcase_name': 'if_removed_with_nested_iffalse',
          'text_in': '\\ifvar\n\\iffalse\n\\iffalse\nFoo\n\\fi\n\\fi\n\\fi\n',
          'true_output': '\\ifvar\n\\fi\n',
      },
      {
          'testcase_name': 'if_removed_eof',
          'text_in': '\\iffalse\nFoo\n\\fi',
          'true_output': '',
      },
      {
          'testcase_name': 'if_removed_space',
          'text_in': '\\iffalse\nFoo\n\\fi ',
          'true_output': '',
      },
      {
          'testcase_name': 'if_removed_backslash',
          'text_in': '\\iffalse\nFoo\n\\fi\\end{document}',
          'true_output': '\\end{document}',
      },
      {
          'testcase_name': 'commands_not_removed',
          'text_in': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}',
          'true_output': '\\newcommand\\figref[1]{Figure~\\ref{fig:\\#1}}',
      },
      {
          'testcase_name': 'iffalse_else_sustained',
          'text_in': '\\iffalse not there\\else here\\fi',
          'true_output': 'here',
      },
      {
          'testcase_name': 'iftrue_else_removed',
          'text_in': '\\iftrue expected\\else not expected\\fi',
          'true_output': 'expected',
      },
      {
          'testcase_name': 'if0_removed',
          'text_in': '\\if0 to be removed\\fi',
          'true_output': '',
      },
      {
          'testcase_name': 'if1 works',
          'text_in': '\\if 1 expected\\fi',
          'true_output': 'expected',
      },
      {
          'testcase_name': 'new_if_ignored',
          'text_in': '\\newif  \\ifvar \\ifvar\\iffalse test\\fi\\fi',
          'true_output': '\\newif  \\ifvar \\ifvar\\fi',
      },
      {
          'testcase_name': 'known exceptions (iff) ignored in \\iffalse',
          'text_in': '\\iffalse \\iff\\fi',
          'true_output': '',
      },
      {
          'testcase_name': 'known exceptions (iff) ignored in \\iftrue',
          'text_in': '\\iftrue\\iff\\else\\fi',
          'true_output': '\\iff',
      },
  )
  def test_simplify_conditional_blocks(self, text_in, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._simplify_conditional_blocks(text_in), true_output
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'all_pass',
          'inputs': ['abc', 'bca'],
          'patterns': ['a'],
          'true_outputs': ['abc', 'bca'],
      },
      {
          'testcase_name': 'not_all_pass',
          'inputs': ['abc', 'bca'],
          'patterns': ['a$'],
          'true_outputs': ['bca'],
      },
  )
  def test_keep_pattern(self, inputs, patterns, true_outputs):
    self.assertEqual(
        list(arxiv_latex_cleaner._keep_pattern(inputs, patterns)), true_outputs
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'all_pass',
          'inputs': ['abc', 'bca'],
          'patterns': ['a'],
          'true_outputs': [],
      },
      {
          'testcase_name': 'not_all_pass',
          'inputs': ['abc', 'bca'],
          'patterns': ['a$'],
          'true_outputs': ['abc'],
      },
  )
  def test_remove_pattern(self, inputs, patterns, true_outputs):
    self.assertEqual(
        list(arxiv_latex_cleaner._remove_pattern(inputs, patterns)),
        true_outputs,
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'replace_contents',
          'content': make_contents(),
          'patterns_and_insertions': make_patterns(),
          'true_outputs': (
              r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image1.jpg}}'
              '\n'
              r'& \parbox[c]{\ww\linewidth}{\includegraphics[width=1.0\linewidth]{figures/image2.jpg}}'
          ),
      },
  )
  def test_find_and_replace_patterns(
      self, content, patterns_and_insertions, true_outputs
  ):
    output = arxiv_latex_cleaner._find_and_replace_patterns(
        content, patterns_and_insertions
    )
    output = arxiv_latex_cleaner.strip_whitespace(output)
    true_outputs = arxiv_latex_cleaner.strip_whitespace(true_outputs)
    self.assertEqual(output, true_outputs)

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_tikz',
          'text_in': 'Foo\n',
          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
          'true_output': 'Foo\n',
      },
      {
          'testcase_name': 'tikz_no_match',
          'text_in': (
              'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node'
              ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
          ),
          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
          'true_output': (
              'Foo\\tikzsetnextfilename{test_no_match}\n\\begin{tikzpicture}\n\\node'
              ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
          ),
      },
      {
          'testcase_name': 'tikz_match',
          'text_in': (
              'Foo\\tikzsetnextfilename{test2}\n\\begin{tikzpicture}\n\\node'
              ' (test) at (0,0) {Test1};\n\\end{tikzpicture}\nFoo'
          ),
          'figures_in': ['ext_tikz/test1.pdf', 'ext_tikz/test2.pdf'],
          'true_output': 'Foo\\includegraphics{ext_tikz/test2.pdf}\nFoo',
      },
  )
  def test_replace_tikzpictures(self, text_in, figures_in, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._replace_tikzpictures(text_in, figures_in),
        true_output,
    )

  @parameterized.named_parameters(
      {
          'testcase_name': 'no_includesvg',
          'text_in': 'Foo\n',
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2-tex.pdf_tex',
          ],
          'true_output': 'Foo\n',
      },
      {
          'testcase_name': 'includesvg_no_match',
          'text_in': 'Foo\\includesvg{test_no_match}\nFoo',
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2-tex.pdf_tex',
          ],
          'true_output': 'Foo\\includesvg{test_no_match}\nFoo',
      },
      {
          'testcase_name': 'includesvg_match',
          'text_in': 'Foo\\includesvg{test2}\nFoo',
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2-tex.pdf_tex',
          ],
          'true_output': 'Foo\\includeinkscape{ext_svg/test2-tex.pdf_tex}\nFoo',
      },
      {
          'testcase_name': 'includesvg_match_with_options',
          'text_in': 'Foo\\includesvg[width=\\linewidth,scale=0.40]{figs/persdf/test2}\nFoo',
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2-tex.pdf_tex',
          ],
          'true_output': 'Foo\\includeinkscape[width=\\linewidth,scale=0.40]{ext_svg/test2-tex.pdf_tex}\nFoo',
      },
      {
          'testcase_name': 'includesvg_match_with_options_with_suffix',
          'text_in': 'Foo\\includesvg[width=\\linewidth]{figs/test2.svg}\nFoo',
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2_svg-tex.pdf_tex',
          ],
          'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2_svg-tex.pdf_tex}\nFoo',
      },
      {
          'testcase_name': 'includesvg_match_with_options_with_dot_with_suffix',
          'text_in': (
              'Foo\\includesvg[width=\\linewidth]{figs/test2-0.9.svg}\nFoo'
          ),
          'figures_in': [
              'ext_svg/test1-tex.pdf_tex',
              'ext_svg/test2-0.9_svg-tex.pdf_tex',
          ],
          'true_output': 'Foo\\includeinkscape[width=\\linewidth]{ext_svg/test2-0.9_svg-tex.pdf_tex}\nFoo',
      },
  )
  def test_replace_includesvg(self, text_in, figures_in, true_output):
    self.assertEqual(
        arxiv_latex_cleaner._replace_includesvg(text_in, figures_in),
        true_output,
    )

  @parameterized.named_parameters(*make_search_reference_tests())
  def test_search_reference_weak(
      self, filenames, contents, strict, true_outputs
  ):
    cleaner_outputs = []
    for filename in filenames:
      reference = arxiv_latex_cleaner._search_reference(
          filename, contents, strict
      )
      if reference is not None:
        cleaner_outputs.append(filename)

    # weak check (passes as long as cleaner includes a superset of the true_output)
    for true_output in true_outputs:
      self.assertIn(true_output, cleaner_outputs)

  @parameterized.named_parameters(*make_search_reference_tests())
  def test_search_reference_strong(
      self, filenames, contents, strict, true_outputs
  ):
    cleaner_outputs = []
    for filename in filenames:
      reference = arxiv_latex_cleaner._search_reference(
          filename, contents, strict
      )
      if reference is not None:
        cleaner_outputs.append(filename)

    # strong check (set of files must match exactly)
    weak_check_result = set(true_outputs).issubset(cleaner_outputs)
    if weak_check_result:
      msg = 'not fatal, cleaner included more files than necessary'
    else:
      msg = 'fatal, see test_search_reference_weak'
    self.assertEqual(cleaner_outputs, true_outputs, msg)

  @parameterized.named_parameters(
      {
          'testcase_name': 'three_parent',
          'filename': 'long/path/to/img.ext',
          'content_strs': [
              # match
              '{img.ext}',
              '{to/img.ext}',
              '{path/to/img.ext}',
              '{long/path/to/img.ext}',
              '{%\nimg.ext  }',
              '{to/img.ext % \n}',
              '{  \npath/to/img.ext\n}',
              '{ \n \nlong/path/to/img.ext\n}',
              '{img}',
              '{to/img}',
              '{path/to/img}',
              '{long/path/to/img}',
              # dont match
              '{from/img.ext}',
              '{from/img}',
              '{imgoext}',
              '{from/imgo}',
              '{ \n long/\npath/to/img.ext\n}',
              '{path/img.ext}',
              '{long/img.ext}',
              '{long/path/img.ext}',
              '{long/to/img.ext}',
              '{path/img}',
              '{long/img}',
              '{long/path/img}',
              '{long/to/img}',
          ],
          'strict': False,
          'true_outputs': [True] * 12 + [False] * 13,
      },
      {
          'testcase_name': 'two_parent',
          'filename': 'path/to/img.ext',
          'content_strs': [
              # match
              '{img.ext}',
              '{to/img.ext}',
              '{path/to/img.ext}',
              '{%\nimg.ext  }',
              '{to/img.ext % \n}',
              '{  \npath/to/img.ext\n}',
              '{img}',
              '{to/img}',
              '{path/to/img}',
              # dont match
              '{long/path/to/img.ext}',
              '{ \n \nlong/path/to/img.ext\n}',
              '{long/path/to/img}',
              '{from/img.ext}',
              '{from/img}',
              '{imgoext}',
              '{from/imgo}',
              '{ \n long/\npath/to/img.ext\n}',
              '{path/img.ext}',
              '{long/img.ext}',
              '{long/path/img.ext}',
              '{long/to/img.ext}',
              '{path/img}',
              '{long/img}',
              '{long/path/img}',
              '{long/to/img}',
          ],
          'strict': False,
          'true_outputs': [True] * 9 + [False] * 16,
      },
      {
          'testcase_name': 'one_parent',
          'filename': 'to/img.ext',
          'content_strs': [
              # match
              '{img.ext}',
              '{to/img.ext}',
              '{%\nimg.ext  }',
              '{to/img.ext % \n}',
              '{img}',
              '{to/img}',
              # dont match
              '{long/path/to/img}',
              '{path/to/img}',
              '{ \n \nlong/path/to/img.ext\n}',
              '{  \npath/to/img.ext\n}',
              '{long/path/to/img.ext}',
              '{path/to/img.ext}',
              '{from/img.ext}',
              '{from/img}',
              '{imgoext}',
              '{from/imgo}',
              '{ \n long/\npath/to/img.ext\n}',
              '{path/img.ext}',
              '{long/img.ext}',
              '{long/path/img.ext}',
              '{long/to/img.ext}',
              '{path/img}',
              '{long/img}',
              '{long/path/img}',
              '{long/to/img}',
          ],
          'strict': False,
          'true_outputs': [True] * 6 + [False] * 19,
      },
      {
          'testcase_name': 'two_parent_strict',
          'filename': 'path/to/img.ext',
          'content_strs': [
              # match
              '{path/to/img.ext}',
              '{  \npath/to/img.ext\n}',
              # dont match
              '{img.ext}',
              '{to/img.ext}',
              '{%\nimg.ext  }',
              '{to/img.ext % \n}',
              '{img}',
              '{to/img}',
              '{path/to/img}',
              '{long/path/to/img.ext}',
              '{ \n \nlong/path/to/img.ext\n}',
              '{long/path/to/img}',
              '{from/img.ext}',
              '{from/img}',
              '{imgoext}',
              '{from/imgo}',
              '{ \n long/\npath/to/img.ext\n}',
              '{path/img.ext}',
              '{long/img.ext}',
              '{long/path/img.ext}',
              '{long/to/img.ext}',
              '{path/img}',
              '{long/img}',
              '{long/path/img}',
              '{long/to/img}',
          ],
          'strict': True,
          'true_outputs': [True] * 2 + [False] * 23,
      },
  )
  def test_search_reference_filewise(
      self, filename, content_strs, strict, true_outputs
  ):
    if len(content_strs) != len(true_outputs):
      raise ValueError(
          "number of true_outputs doesn't match number of content strs"
      )
    for content, true_output in zip(content_strs, true_outputs):
      reference = arxiv_latex_cleaner._search_reference(
          filename, content, strict
      )
      matched = reference is not None
      msg_not = ' ' if true_output else ' not '
      msg_fmt = 'file {} should' + msg_not + 'have matched latex reference {}'
      msg = msg_fmt.format(filename, content)
      self.assertEqual(matched, true_output, msg)


class IntegrationTests(parameterized.TestCase):

  def setUp(self):
    super(IntegrationTests, self).setUp()
    self.out_path = 'test_data/tex_arXiv'

  def _compare_files(self, filename, filename_true):
    if path.splitext(filename)[1].lower() in ['.jpg', '.jpeg', '.png']:
      with Image.open(filename) as im, Image.open(filename_true) as im_true:
        # We check only the sizes of the images, checking pixels would be too
        # complicated in case the resize implementations change.
        self.assertEqual(
            im.size,
            im_true.size,
            'Images {:s} was not resized properly.'.format(filename),
        )
    else:
      # Checks if text files are equal without taking in account end of line
      # characters.
      with open(filename, 'rb') as f:
        processed_content = f.read().splitlines()
      with open(filename_true, 'rb') as f:
        groundtruth_content = f.read().splitlines()

      self.assertEqual(
          processed_content,
          groundtruth_content,
          '{:s} and {:s} are not equal.'.format(filename, filename_true),
      )

  @parameterized.named_parameters(
      {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},
      {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},
  )
  def test_complete(self, input_dir):
    out_path_true = 'test_data/tex_arXiv_true'

    # Make sure the folder does not exist, since we erase it in the test.
    if path.isdir(self.out_path):
      raise RuntimeError(
          'The folder {:s} should not exist.'.format(self.out_path)
      )

    arxiv_latex_cleaner.run_arxiv_cleaner({
        'input_folder': input_dir,
        'images_allowlist': {
            'images/im2_included.jpg': 200,
            'images/im3_included.png': 400,
        },
        'resize_images': True,
        'im_size': 100,
        'compress_pdf': False,
        'pdf_im_resolution': 500,
        'commands_to_delete': ['mytodo'],
        'commands_only_to_delete': ['red'],
        'if_exceptions': ['iffalt'],
        'environments_to_delete': ['mynote'],
        'use_external_tikz': 'ext_tikz',
        'keep_bib': False,
    })

    # Checks the set of files is the same as in the true folder.
    out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))
    out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))
    self.assertSetEqual(out_files, out_files_true)

    # Compares the contents of each file against the true value.
    for f1 in out_files:
      self._compare_files(
          path.join(self.out_path, f1), path.join(out_path_true, f1)
      )

  @parameterized.named_parameters(
      {'testcase_name': 'from_dir', 'input_dir': 'test_data/tex'},
      {'testcase_name': 'from_zip', 'input_dir': 'test_data/tex.zip'},
  )
  def test_png2jpg(self, input_dir):
    out_path_true = 'test_data/tex_arXiv_png2jpg_true'

    # Make sure the folder does not exist, since we erase it in the test.
    if path.isdir(self.out_path):
      raise RuntimeError(
          'The folder {:s} should not exist.'.format(self.out_path)
      )

    arxiv_latex_cleaner.run_arxiv_cleaner({
        'input_folder': input_dir,
        'images_allowlist': {
            # 'images/im2_included.jpg': 200,
            # 'images/im3_included.png': 400,
        },
        'resize_images': False,
        'im_size': 100,
        'compress_pdf': False,
        'pdf_im_resolution': 500,
        'commands_to_delete': ['mytodo'],
        'commands_only_to_delete': ['red'],
        'if_exceptions': ['iffalt'],
        'environments_to_delete': ['mynote'],
        'use_external_tikz': 'ext_tikz',
        'keep_bib': False,
        'convert_png_to_jpg': True,
        'png_quality': 50,
        'png_size_threshold': 0.5,
    })

    # Checks the set of files is the same as in the true folder.
    out_files = set(arxiv_latex_cleaner._list_all_files(self.out_path))
    out_files_true = set(arxiv_latex_cleaner._list_all_files(out_path_true))
    self.assertSetEqual(out_files, out_files_true)

    # Compares the contents of each file against the true value.
    for f1 in out_files:
      if path.splitext(path.join(self.out_path, f1))[1].lower() in ['.jpg', '.jpeg', '.png']:
        # check if all png files have been renamed to jpg
        self.assertTrue(path.splitext(f1)[1].lower() != '.png', f'{f1} is not renamed to jpg')

      else:
        self._compare_files(
            path.join(self.out_path, f1), path.join(out_path_true, f1)
        )

  def tearDown(self):
    shutil.rmtree(self.out_path)
    super(IntegrationTests, self).tearDown()

if __name__ == '__main__':
  unittest.main()


================================================
FILE: cleaner_config.yaml
================================================
patterns_and_insertions:
    [
        # Use single ticks for regex patterns
        # http://blogs.perl.org/users/tinita/2018/03/strings-in-yaml---to-quote-or-not-to-quote.html
        # You need to escape \ with \\ in the pattern, for instance for \\todo
        # Use Python named groups https://docs.python.org/3/library/re.html#regular-expression-examples
        # Escape {{ and }} in the insertion expression
        # 
        # Optional:
        # Set strip_whitespace to n to disable white space stripping while replacing the pattern. (Default: y)

        {
            "pattern" : '(?:\\figcomp{\s*)(?P<first>.*?)\s*}\s*{\s*(?P<second>.*?)\s*}\s*{\s*(?P<third>.*?)\s*}',
            "insertion" : '\parbox[c]{{ {second} \linewidth}} {{ \includegraphics[width= {third} \linewidth]{{figures/{first} }} }}',
            "description" : "Replace figcomp",
            # "strip_whitespace": n 
        },
    ]
verbose: False
commands_to_delete: [
    '\\todo',
]


================================================
FILE: requirements.txt
================================================
absl_py>=0.12
pillow
pyyaml
regex


================================================
FILE: setup.py
================================================
#! /usr/bin/env python
#
# coding=utf-8
# Copyright 2018 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from setuptools import setup
from setuptools import find_packages

from arxiv_latex_cleaner._version import __version__

with open("README.md", "r") as fh:
    long_description = fh.read()

install_requires = []
with open("requirements.txt") as f:
    for l in f.readlines():
        l_c = l.strip()
        if l_c and not l_c.startswith('#'):
            install_requires.append(l_c)

setup(
    name="arxiv_latex_cleaner",
    version=__version__,
    packages=find_packages(exclude=["*.tests"]),
    python_requires='>=3',
    url="https://github.com/google-research/arxiv-latex-cleaner",
    license="Apache License, Version 2.0",
    author="Google Research Authors",
    author_email="jponttuset@gmail.com",
    description="Cleans the LaTeX code of your paper to submit to arXiv.",
    long_description=long_description,
    long_description_content_type="text/markdown",
    entry_points={
        "console_scripts": ["arxiv_latex_cleaner=arxiv_latex_cleaner.__main__:__main__"]
    },
    install_requires=install_requires,
    classifiers=[
        "License :: OSI Approved :: Apache Software License",
        "Intended Audience :: Science/Research",
    ],
)


================================================
FILE: test_data/tex/figures/data_included.txt
================================================


================================================
FILE: test_data/tex/figures/data_not_included.txt
================================================


================================================
FILE: test_data/tex/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}


================================================
FILE: test_data/tex/figures/figure_included.tikz
================================================
﻿\tikzsetnextfilename{test2}
\begin{tikzpicture}
\node {root}
child {node {left}}
child {node {right}
child {node {child}}
child {node {child}}
};
\end{tikzpicture}

================================================
FILE: test_data/tex/figures/figure_not_included.tex
================================================
\addplot{figures/data_not_included.txt}
\input{figures/figure_not_included_2.tex}


================================================
FILE: test_data/tex/figures/figure_not_included_2.tex
================================================


================================================
FILE: test_data/tex/main.aux
================================================


================================================
FILE: test_data/tex/main.bbl
================================================
BBL content, should be intact.


================================================
FILE: test_data/tex/main.bib
================================================


================================================
FILE: test_data/tex/main.tex
================================================
\begin{document}
Text
% Whole line comment

Text% Inline comment
\begin{comment}
This is an environment comment.
\end{comment}

This is a percent \%.
% Whole line comment without newline
\includegraphics{images/im1_included.png}
%\includegraphics{images/im_not_included}
\includegraphics{images/im3_included.png}
\includegraphics{%
  images/im4_included.png%
  }
\includegraphics[width=.5\linewidth]{%
  images/im5_included.jpg}
%\includegraphics{%
%  images/im4_not_included.png
%  }
%\includegraphics[width=.5\linewidth]{%
%  images/im5_not_included.jpg}

% test whatever the path satrting with dot works when include graphics
\includegraphics{./images/im3_included.png}

This line should\mytodo{Do this later} not be separated
\mytodo{This is a todo command with a nested \textit{command}.
Please remember that up to \texttt{2 levels} of \textit{nesting} are supported.}
from this one.

\begin{mynote}
  This is a custom environment that could be excluded.
\end{mynote}

\newif\ifvar
\newif  \ifvarII

\ifvarII asdf \fi

\ifvar
\if    false
\if false
\if 0
\iffalse
\ifvar
Text
\fi
\fi
\fi
\fi
\fi
\fi

\iffalse I shall be gone (iffalse block)!\else Expect me (else block of iffalse)!\fi

\iftrue Expect me (iftrue block)!\else I shall be gone (else block of iftrue)!\fi

\iffalse
\iffalt
\fi

\newcommand{\red}[1]{{\color{red} #1}}
hello test \red{hello
test \red{hello}}
test

% content after this line should not be cleaned if \end{document} is in a comment

\input{figures/figure_included.tex}
% \input{figures/figure_not_included.tex}

% Test for tikzpicture feature
% should be replaced
\tikzsetnextfilename{test1}
\begin{tikzpicture}
    \node (test) at (0,0) {Test1};
\end{tikzpicture}

% should be replaced in included file
\input{figures/figure_included.tikz}

% should not be be replaced - no preceding tikzsetnextfilename command
\begin{tikzpicture}
    \node (test) at (0,0) {Test3};
\end{tikzpicture}

\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
    \node (test) at (0,0) {Test4};
\end{tikzpicture}

\end{document}

This should be ignored.


================================================
FILE: test_data/tex/not_included/figures/data_included.txt
================================================


================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/data_included.txt
================================================


================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}


================================================
FILE: test_data/tex_arXiv_png2jpg_true/figures/figure_included.tikz
================================================
﻿\includegraphics{ext_tikz/test2.pdf}

================================================
FILE: test_data/tex_arXiv_png2jpg_true/main.bbl
================================================
BBL content, should be intact.


================================================
FILE: test_data/tex_arXiv_png2jpg_true/main.tex
================================================
\begin{document}
Text

Text%


This is a percent \%.
\includegraphics{images/im1_included.jpg}
\includegraphics{images/im3_included.jpg}
\includegraphics{%
  images/im4_included.jpg%
  }
\includegraphics[width=.5\linewidth]{%
  images/im5_included.jpg}

\includegraphics{./images/im3_included.jpg}

This line should not be separated
%
from this one.


\newif\ifvar
\newif  \ifvarII

\ifvarII asdf \fi

\ifvar
\fi

Expect me (else block of iffalse)!
Expect me (iftrue block)!

\newcommand{\red}[1]{{\color{red} #1}}
hello test hello
test hello
test


\input{figures/figure_included.tex}

\includegraphics{ext_tikz/test1.pdf}

\input{figures/figure_included.tikz}

\begin{tikzpicture}
    \node (test) at (0,0) {Test3};
\end{tikzpicture}

\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
    \node (test) at (0,0) {Test4};
\end{tikzpicture}

\end{document}


================================================
FILE: test_data/tex_arXiv_true/figures/data_included.txt
================================================


================================================
FILE: test_data/tex_arXiv_true/figures/figure_included.tex
================================================
\includegraphics{images/im2_included.jpg}
\addplot{figures/data_included.txt}


================================================
FILE: test_data/tex_arXiv_true/figures/figure_included.tikz
================================================
﻿\includegraphics{ext_tikz/test2.pdf}

================================================
FILE: test_data/tex_arXiv_true/main.bbl
================================================
BBL content, should be intact.


================================================
FILE: test_data/tex_arXiv_true/main.tex
================================================
\begin{document}
Text

Text%


This is a percent \%.
\includegraphics{images/im1_included.png}
\includegraphics{images/im3_included.png}
\includegraphics{%
  images/im4_included.png%
  }
\includegraphics[width=.5\linewidth]{%
  images/im5_included.jpg}

\includegraphics{./images/im3_included.png}

This line should not be separated
%
from this one.


\newif\ifvar
\newif  \ifvarII

\ifvarII asdf \fi

\ifvar
\fi

Expect me (else block of iffalse)!
Expect me (iftrue block)!

\newcommand{\red}[1]{{\color{red} #1}}
hello test hello
test hello
test


\input{figures/figure_included.tex}

\includegraphics{ext_tikz/test1.pdf}

\input{figures/figure_included.tikz}

\begin{tikzpicture}
    \node (test) at (0,0) {Test3};
\end{tikzpicture}

\tikzsetnextfilename{test_no_match}
\begin{tikzpicture}
    \node (test) at (0,0) {Test4};
\end{tikzpicture}

\end{document}