[
  {
    "path": ".gitignore",
    "content": "# ---> Python\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\n# mylib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.nox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*.cover\n*.py,cover\n.hypothesis/\n.pytest_cache/\ncover/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\nlocal_settings.py\ndb.sqlite3\ndb.sqlite3-journal\n\n# Flask stuff:\ninstance/\n.webassets-cache\n\n# Scrapy stuff:\n.scrapy\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\n.pybuilder/\ntarget/\n\n# Jupyter Notebook\n.ipynb_checkpoints\n\n# IPython\nprofile_default/\nipython_config.py\n\n# pyenv\n#   For a library or package, you might want to ignore these files since the code is\n#   intended to run in multiple environments; otherwise, check them in:\n# .python-version\n\n# pipenv\n#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.\n#   However, in case of collaboration, if having platform-specific dependencies or dependencies\n#   having no cross-platform support, pipenv may install dependencies that don't work, or not\n#   install all needed dependencies.\n#Pipfile.lock\n\n# poetry\n#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.\n#   This is especially recommended for binary packages to ensure reproducibility, and is more\n#   commonly ignored for libraries.\n#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control\n#poetry.lock\n\n# pdm\n#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.\n#pdm.lock\n#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it\n#   in version control.\n#   https://pdm.fming.dev/#use-with-ide\n.pdm.toml\n\n# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm\n__pypackages__/\n\n# Celery stuff\ncelerybeat-schedule\ncelerybeat.pid\n\n# SageMath parsed files\n*.sage.py\n\n# Environments\n.env\n.venv\nenv/\nvenv/\nENV/\nenv.bak/\nvenv.bak/\n\n# Spyder project settings\n.spyderproject\n.spyproject\n\n# Rope project settings\n.ropeproject\n\n# mkdocs documentation\n/site\n\n# mypy\n.mypy_cache/\n.dmypy.json\ndmypy.json\n\n# Pyre type checker\n.pyre/\n\n# pytype static type analyzer\n.pytype/\n\n# Cython debug symbols\ncython_debug/\n\n# PyCharm\n#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can\n#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore\n#  and can be added to the global gitignore or merged into this file.  For a more nuclear\n#  option (not recommended) you can uncomment the following to ignore the entire idea folder.\n.idea/\n\ncsv/\ndata/\nlog/\ntemp_zip\nurls/\n*.txt"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2020 silenceagle\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# paper_downloader\n\nDownload papers and supplemental materials only from **OPEN ACCESS** paper\nwebsite, such as **AAAI**, **AAMAS**, **AISTATS**, **COLT**, **CORL**, **CVPR**, **ECCV**,\n**ICCV**, **ICLR**, **ICML**, **IJCAI**, **JMLR**, **NIPS**,\n**RSS**, **WACV**.\n\n---\n\nThe number of papers that could be downloaded using this repo (also provide **Aliyundrive** or **123Pan** share link and `access code`):\n\n<sub>\n<sup>\n\n|  year\\conf   | [AAAI](https://aaai.org/aaai-publications/aaai-conference-proceedings/#aaai) | [AAMAS](https://www.ifaamas.org/Proceedings/aamas2024/) |                                  [ACCV](https://openaccess.thecvf.com/menu)                                  |          [AISTATS](https://www.aistats.org/)           |           [COLT](http://learningtheory.org/)           | [CORL](https://www.corl.org/) |                                   [CVPR](http://openaccess.thecvf.com/menu)                                    |         [ECCV](https://www.ecva.net/papers.php)         |                                   [ICCV](http://openaccess.thecvf.com/menu)                                    |                    [ICLR](https://iclr.cc/)                    |                [ICML](https://icml.cc/)                 |            [IJCAI](https://www.ijcai.org/)             | [JMLR](http://www.jmlr.org/) |                [NIPS ](https://nips.cc/)                | [RSS](https://www.roboticsproceedings.org/index.html) |                                  [WACV](https://openaccess.thecvf.com/menu)                                  |\n|:------------:|:----------------------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------:|:------------------------------------------------------:|:-----------------------------:|:--------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------:|:----------------------------:|:-------------------------------------------------------:|:-----------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:|\n|   **1969**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           64                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1971**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           66                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1973**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           85                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1975**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          146                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1977**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          251                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1979**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           12                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1980**   |            [95](https://www.aliyundrive.com/s/ucngMrKSTmi)`96eg`             |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1981**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          108                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1982**   |                                     104                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1983**   |            [92](https://www.aliyundrive.com/s/L3GfxhEqyWg)`09jo`             |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          237                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1984**   |                                      69                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1985**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          259                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1986**   |                                     194                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           --                            |                          --                           |                                                      --                                                      |\n|   **1987**   |                                     149                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          246                           |              --              |                           90                            |                          --                           |                                                      --                                                      |\n|   **1988**   |                                     159                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           94                            |                          --                           |                                                      --                                                      |\n|   **1989**   |                                      --                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          269                           |              --              |                           101                           |                          --                           |                                                      --                                                      |\n|   **1990**   |                                     173                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           49                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           143                           |                          --                           |                                                      --                                                      |\n|   **1991**   |                                     144                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          192                           |              --              |                           144                           |                          --                           |                                                      --                                                      |\n|   **1992**   |                                     134                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           49                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           127                           |                          --                           |                                                      --                                                      |\n|   **1993**   |                                     135                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          138                           |              --              |                           158                           |                          --                           |                                                      --                                                      |\n|   **1994**   |                                     302                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           98                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           140                           |                          --                           |                                                      --                                                      |\n|   **1995**   |                                      --                                      |                           --                            |                                                      --                                                      |                           64                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          282                           |              --              |                           152                           |                          --                           |                                                      --                                                      |\n|   **1996**   |                                     275                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           98                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           152                           |                          --                           |                                                      --                                                      |\n|   **1997**   |                                     186                                      |                           --                            |                                                      --                                                      |                           57                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          180                           |              --              |                           150                           |                          --                           |                                                      --                                                      |\n|   **1998**   |                                     187                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           98                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              --              |                           151                           |                          --                           |                                                      --                                                      |\n|   **1999**   |                                     182                                      |                           --                            |                                                      --                                                      |                           17                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                          204                           |              --              |                           150                           |                          --                           |                                                      --                                                      |\n| **2000/v1**  |                                     221                                      |                           --                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           98                            |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              11              |                           152                           |                          --                           |                                                      --                                                      |\n| **2001/v2**  |                                      --                                      |                           --                            |                                                      --                                                      |                           46                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           --                            |                           17                           |              31              |                           197                           |                          --                           |                                                      --                                                      |\n| **2002/v3**  |                                     187                                      |                            /                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           196                           |                                                       --                                                       |                               --                               |                           --                            |                           --                           |              59              |                           207                           |                          --                           |                                                      --                                                      |\n| **2003/v4**  |                                     ---                                      |                            /                            |                                                      --                                                      |                           44                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           121                           |                          297                           |              59              |                           198                           |                          --                           |                                                      --                                                      |\n| **2004/v5**  |                                     177                                      |                            /                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           190                           |                                                       --                                                       |                               --                               |                           118                           |                           --                           |              56              |                           207                           |                          --                           |                                                      --                                                      |\n| **2005/v6**  |                                     328                                      |                            /                            |                                                      --                                                      |                           56                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           133                           |                          350                           |              73              |                           207                           |                          48                           |                                                      --                                                      |\n| **2006/v7**  |                                     393                                      |                            /                            |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                         192+11                          |                                                       --                                                       |                               --                               |                           --                            |                           --                           |             100              |                           204                           |                          39                           |                                                      --                                                      |\n| **2007/v8**  |                                     375                                      |                            /                            |                                                      --                                                      |                           86                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           150                           |                          478                           |              91              |                           217                           |                          41                           |                                                      --                                                      |\n| **2008/v9**  |                                     355                                      |                           254                           |                                                      --                                                      |                           --                           |                           --                           |              --               |                                                       --                                                       |                           196                           |                                                       --                                                       |                               --                               |                           158                           |                           --                           |              97              |                           250                           |                          40                           |                                                      --                                                      |\n| **2009/v10** |                                      --                                      |                           130                           |                                                      --                                                      |                           84                           |                           --                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           160                           |                          342                           |             100              |                           262                           |                          39                           |                                                      --                                                      |\n| **2010/v11** |                                     300                                      |                           163                           |                                                      --                                                      |                          126                           |                           --                           |              --               |                                                       --                                                       |                         286+63                          |                                                       --                                                       |                               --                               |                           159                           |                           --                           |             118              |                           292                           |                          40                           |                                                      --                                                      |\n| **2011/v12** |                                     302                                      |                           125                           |                                                      --                                                      |                          108                           |                           43                           |              --               |                                                       --                                                       |                           --                            |                                                       --                                                       |                               --                               |                           153                           |                          490                           |             105              |                           306                           |                          45                           |                                                      --                                                      |\n| **2012/v13** |                                     353                                      |                           136                           |                                                      --                                                      |                          160                           |                           46                           |              --               |                                                       --                                                       |                         329+147                         |                                                       --                                                       |                               --                               |                           243                           |                           --                           |             119              |                           368                           |                          60                           |                                                      --                                                      |\n| **2013/v14** |                                     251                                      |                           321                           |                                                      --                                                      |                           72                           |                           50                           |              --               |                           [471](https://www.aliyundrive.com/s/ZFvga9JZ5aY)`5p0q`+156                           |                           --                            |                                                    455+142                                                     |                              14+9                              |                           283                           |                          496                           |              84              |                           360                           |                          55                           |                                                      --                                                      |\n| **2014/v15** |                                     447                                      |                           378                           |                                                      --                                                      |                          124                           |                           61                           |              --               |                                                    545+125                                                     |                         334+158                         |                                                       --                                                       |                               35                               |                           310                           |                           --                           |             120              |                           411                           |                          57                           |                                                      --                                                      |\n| **2015/v16** |                                     455                                      |                           363                           |                                                      --                                                      |                          134                           |                           77                           |              --               |                                                    602+133                                                     |                           --                            |                                                    526+133                                                     |                               42                               |                           270                           |                          656                           |             118              |                           403                           |                          49                           |                                                      --                                                      |\n| **2016/v17** |                                     676                                      |                           280                           |                                                      --                                                      |                          168                           |                           70                           |              --               |                                                    643+194                                                     |                         372+132                         |                                                       --                                                       |                               80                               |                           322                           |                          658                           |             236              |                           568                           |                          47                           |                                                      --                                                      |\n| **2017/v18** |                                     765                                      |                           318                           |                                                      --                                                      |                          175                           |                           75                           |              48               |                                                    783+281                                                     |                           --                            |                                                    621+353                                                     |                              198                               |                           434                           |                          781                           |             234              |                           679                           |                          75                           |                                                      --                                                      |\n| **2018/v19** |                                     1102                                     |                           390                           |                                                      --                                                      |                          230                           |                           94                           |              75               |                                                    979+346                                                     |                         732+262                         |                                                       --                                                       |                              336                               |                           466                           |                          870                           |              84              |                          1009                           |                          71                           |                                                      --                                                      |\n| **2019/v20** |                                     1343                                     |                           433                           |                                                      --                                                      |                          403                           |                          127                           |              110              |                                                    1294+612                                                    |                           --                            |                                                    1075+498                                                    |                              502                               |                           773                           |                          964                           |             184              |                          1428                           |                          84                           |                                                      --                                                      |\n| **2020/v21** |           [1864](https://www.aliyundrive.com/s/kbWKUpHGR3k)`5ls6`            |                           369                           | [254](https://www.aliyundrive.com/s/Dt2ErKCmePQ)`dn93`+[13](https://www.aliyundrive.com/s/AhGvgotrMUv)`d9o6` | [796](https://www.aliyundrive.com/s/iQ4AWTHG4bk)`61yu` | [126](https://www.aliyundrive.com/s/apP8KUFLPe4)`3mv9` |              165              | [1467](https://www.aliyundrive.com/s/eJF4BTFzFJq)`y89b`+[517](https://www.aliyundrive.com/s/5wk7Mjo9XyU)`0fz9` | [1358](https://www.aliyundrive.com/s/EYyjxRmmg8d)`a5i0` |                                                       --                                                       |     [687](https://www.aliyundrive.com/s/cVRD5Bu2SgN)`4x1c`     | [1084](https://www.aliyundrive.com/s/BHqtEbi6Dix)`5yw0` | [776](https://www.aliyundrive.com/s/vMZpsjCbWMV)`4xq3` |             254              | [1899](https://www.aliyundrive.com/s/GEMFqxKeHWu)`3g3d` |                          103                          | [378](https://www.aliyundrive.com/s/gfFKwcKrCP1)`l1m8`+[24](https://www.aliyundrive.com/s/2uCW6cq9WHk)`me08` |\n| **2021/v22** |           [1961](https://www.aliyundrive.com/s/cdeGciNZch8)`b69m`            |                           304                           |                                                      --                                                      | [845](https://www.aliyundrive.com/s/3hbAhxYFHER)`93ig` | [140](https://www.aliyundrive.com/s/gwhdNT1vGDD)`96ln` |              166              |                          1660+[517](https://www.aliyundrive.com/s/ziBfXVKPXSY)`le14`                           |                           --                            | [1612](https://www.aliyundrive.com/s/ME21PfkyAec)`99uu`+[465](https://www.aliyundrive.com/s/ZahPmXSn9an)`16es` |     [860](https://www.aliyundrive.com/s/wGos6n5R93v)`ef43`     | [1183](https://www.aliyundrive.com/s/SYTtH38GiVS)`g8b1` | [723](https://www.aliyundrive.com/s/io3sAjsN5pw)`40is` |             290              | [2334](https://www.aliyundrive.com/s/13sHmhuEdxA)`v6g1` |                          92                           | [406](https://www.aliyundrive.com/s/kTwfaX9tren)`1id9`+[23](https://www.aliyundrive.com/s/7Joy4svvUfy)`90rl` |\n| **2022/v23** |           [1624](https://www.aliyundrive.com/s/ePXvUw4VFdQ)`fp76`            |                           306                           | [279](https://www.aliyundrive.com/s/zCCTJMPrfSr)`47jy`+[25](https://www.aliyundrive.com/s/f4kdMXixwJL)`s7a9` | [492](https://www.aliyundrive.com/s/xj2fRMwZxfC)`f16o` |                          155                           |              197              | [2077](https://www.aliyundrive.com/s/Q8DG9dKbx6S)`i16a`+[562](https://www.aliyundrive.com/s/f9Zx3hFFyq4)`11kj` | [1645](https://www.aliyundrive.com/s/dv4fhuueRHs)`6d7j` |                                                       --                                                       | [54+176+865](https://www.aliyundrive.com/s/gfANcdbM9TC)`b1l3`  | [1234](https://www.aliyundrive.com/s/eopQ5H8Hz2a)`81ov` | [862](https://www.aliyundrive.com/s/DBVKNsqN2UZ)`ea46` |             351              | [2673](https://www.aliyundrive.com/s/VFLmfnzSAsA)`eh49` |                          74                           | [406](https://www.aliyundrive.com/s/xRhdpencLQU)`ab53`+[80](https://www.aliyundrive.com/s/JCCcQXij7WX)`q6d2` |\n| **2023/v24** |                                     2021                                     |                           527                           |                                                      --                                                      | [496](https://www.aliyundrive.com/s/CD3Kz9cxu1U)`l5m9` |                          170                           |              199              |                                          [2358+698](./sharelinks.md)                                           |                           --                            |                                                    2161+491                                                    | [90+284+1205](https://www.aliyundrive.com/s/PZ1Wann4B8A)`29sf` |                          1805                           |                          846                           |             397              |                       67+378+2773                       |                          112                          | [639](https://www.aliyundrive.com/s/fP52KxJEUE5)`mo78`+[74](https://www.aliyundrive.com/s/XZG992JqQfn)`nj80` |\n| **2024/v25** |                                     2581                                     |                           460                           |                                                    268+46                                                    |                          547                           |                          170                           |              264              |                                                    2716+773                                                    |                          2387                           |                                                       --                                                       |                          86+369+1810                           |                      144+191+2275                       |                          1048                          |             419              |                       61+326+3650                       |                          131                          |                                                   846+120                                                    |\n| **2025/v26** |                                     3028                                     |                           479                           |                                                     ---                                                      |                          583                           |                          182                           |              263              |                                                    2871+659                                                    |                           --                            |                                                    2701+765                                                    |                      208+373+3060+6+6+56                       |                      108+211+2967                       |                          1276                          |             308              |                       77+683+4515                       |                          163                          |                                                     929                                                      |\n| **2026/v27** |                                     2375                                     |                          29 May                         |                                                    18 Dec.                                                   |                         2 May                          |                         3 July                         |             12 Nov.           |                                                     7 June                                                     |                         13 Sep.                         |                                                     29 Sep.                                                    |                            225+5131                            |                         11 July                         |                         21 Aug                         |             50               |                          13 Dec                         |                        17 July                        |                                                   831+191                                                    |\n\n</sup>\n</sub>\n\n<!--| **2023/v24** |                                             \n|                                                                                                                   |                                                             |                                                             |                                                                                                                     |                                                                                                                    |                                                                                                                     |                                                                    |                                                              |                                                             |                              |                                                              |      |                                                                                                             |-->\n\n[Download from 123pan.com](https://www.123pan.com/s/PwXljv-QErwd.html)\n(ACCESS CODE: `FdX2`)\n\n (May miss some papers due to the (older version of) 123pan's limitation on the length of filename)\n\nNOTE: all the shared papers' pdf files are collected from network, and the original authors/providers hold the copyrights.\n\n---\n\n## Usage\n\n**For example: download AAAI-2022 papers**\n\n1. Install [Internet Downloader Manager/IDM](https://www.internetdownloadmanager.com/) [*Windows*] [*OPTIONAL*]\n\n   **Note:** If the IDM is NOT installed at the DEFAULT location, then the\n   code in [lib/IDM.py](./lib/IDM.py) should also be modified:\n\n   ```python\n   # should replace with your IDM path\n   idm_path = '\"your path to IDMan.exe\"'  \n\n   # default:\n   # idm_path = '\"C:\\Program Files (x86)\\Internet Download Manager\\IDMan.exe\"'\n   ```\n\n   **Uesful tip**: [Disable the downloading popup pages of IDM would be better](https://github.com/SilenceEagle/paper_downloader/issues/17#issuecomment-773763300)\n\n2. Install [Chrome](https://www.google.com/chrome) [Needed for `ICLR`, `ICML`, some of `NIPS` and `CORL` papers]\n3. Change the code block at the end of\n   [code/paper_downloader_AAAI.py](./code/paper_downloader_AAAI.py)\n\n   ```python\n   if __name__ == '__main__':\n      year = 2022\n      total_paper_number = save_csv(year)  # save papers urls to csv/AAAI_2022.csv\n      download_from_csv(\n         year, \n         save_dir=f'..\\\\AAAI_{year}', # change to your save location\n         time_step_in_seconds=5,  # time step (seconds) between two downloading requests\n         total_paper_number=total_paper_number,\n         downloader=None # use python \"requests\" package to download papers, workable on Windows/MacOS/Linux\n         # downloader='IDM'  # use Internet Download Manager software to \n                              # download papers, Windows only\n      )\n   ```\n\n4. Then run the code:\n\n   ```python\n   python code/paper_downloader_AAAI.py  # download AAAI papers\n   ```\n\n---\n\n**This repo also provides the function to process supplemental material:**\n\n1. Merge the main supplemental material pdf file and the main paper into one single pdf file;\n2. Move the supplemental material pdf files (extracted from the downloaded zip files if presented) into the main papers' folder.\n\n## Star history\n\n[![Star History Chart](https://api.star-history.com/svg?repos=SilenceEagle/paper_downloader&type=Date)](https://star-history.com/#SilenceEagle/paper_downloader&Date)\n"
  },
  {
    "path": "code/paper_downloader_AAAI.py",
    "content": "\"\"\"paper_downloader_AAAI.py\"\"\"\r\nimport time\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport csv\r\nimport sys\r\nimport random\r\n\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib import csv_process\r\nfrom lib.user_agents import user_agents\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef get_track_urls(year):\r\n    \"\"\"\r\n    get all the technical tracks urls given AAAI proceeding year\r\n    Args:\r\n        year (int): AAAI proceeding year, such 2023\r\n\r\n    Returns:\r\n        dict : All the urls of technical tracks included in\r\n            the given AAAI proceeding. Keys are the tracks name-volume,\r\n            and values are the corresponding urls.\r\n    \"\"\"\r\n    # assert int(year) >= 2023, f\"only support year >= 2023, but get {year}!!!\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    dat_file_pathname = os.path.join(\r\n        project_root_folder, 'urls', f'track_archive_url_AAAI_{year}.dat'\r\n    )\r\n    proceeding_th_dict = {\r\n        1980: 1,\r\n        1902: 2,\r\n        1983: 3,\r\n        1984: 4,\r\n        1986: 5,\r\n        1987: 6,\r\n        1988: 7,\r\n        1990: 8,\r\n        1991: 9,\r\n        1992: 10,\r\n        1993: 11,\r\n        1994: 12,\r\n        1996: 13,\r\n        1997: 14,\r\n        1998: 15,\r\n        1999: 16,\r\n        2000: 17,\r\n        2002: 18,\r\n        2004: 19,\r\n        2005: 20,\r\n        2006: 21,\r\n        2007: 22,\r\n        2008: 23\r\n    }\r\n    if year >= 2023:\r\n        base_url = r'https://ojs.aaai.org/index.php/AAAI/issue/archive'\r\n        headers = {\r\n            'User-Agent': user_agents[-1],\r\n            'Host': 'ojs.aaai.org',\r\n            'Referer': \"https://ojs.aaai.org\",\r\n            'GET': base_url\r\n        }\r\n        if os.path.exists(dat_file_pathname):\r\n            with open(dat_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            content = urlopen_with_retry(url=base_url, headers=headers)\r\n            # req = urllib.request.Request(url=base_url, headers=headers)\r\n            # content = urllib.request.urlopen(req).read()\r\n            with open(dat_file_pathname, 'wb') as f:\r\n                pickle.dump(content, f)\r\n        soup = BeautifulSoup(content, 'html5lib')\r\n        tracks = soup.find('ul', {'class': 'issues_archive'}).find_all('li')\r\n        track_urls = dict()\r\n        for tr in tracks:\r\n            h2 = tr.find('h2')\r\n            this_track = slugify(h2.a.text)\r\n            if this_track.startswith(f'aaai-{year-2000}'):\r\n                this_track += slugify(h2.div.text) + '-' + this_track\r\n                this_url = h2.a.get('href')\r\n                track_urls[this_track] = this_url\r\n                print(f'find track: {this_track}({this_url})')\r\n    else:\r\n        if year >= 2010:\r\n            proceeding_th = year - 1986\r\n        elif year in proceeding_th_dict:\r\n            proceeding_th = proceeding_th_dict[year]\r\n        else:\r\n            print(f'ERROR: AAAI proceeding was not held in year {year}!!!')\r\n            return\r\n\r\n        base_url = f'https://aaai.org/proceeding/aaai-{proceeding_th:02d}-{year}/'\r\n        headers = {\r\n            'User-Agent': user_agents[-1],\r\n            'Host': 'aaai.org',\r\n            'Referer': \"https://aaai.org\",\r\n            'GET': base_url\r\n        }\r\n        if os.path.exists(dat_file_pathname):\r\n            with open(dat_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            # req = urllib.request.Request(url=base_url, headers=headers)\r\n            # content = urllib.request.urlopen(req).read()\r\n            content = urlopen_with_retry(url=base_url, headers=headers)\r\n            # content = open(f'..\\\\AAAI_{year}.html', 'rb').read()\r\n            with open(dat_file_pathname, 'wb') as f:\r\n                pickle.dump(content, f)\r\n        soup = BeautifulSoup(content, 'html5lib')\r\n        tracks = soup.find('main', {'class': 'content'}).find_all('li')\r\n        track_urls = dict()\r\n        for tr in tracks:\r\n            this_track = slugify(tr.a.text)\r\n            this_url = tr.a.get('href')\r\n            track_urls[this_track] = this_url\r\n            print(f'find track: {this_track}({this_url})')\r\n    return track_urls\r\n\r\n\r\ndef get_papers_of_track_ojs(track_url):\r\n    \"\"\"\r\n    get all the papers' title, belonging track group name and download link.\r\n    the link should be hosted on https://ojs.aaai.org/\r\n    Args:\r\n        track_url (str): track url\r\n\r\n    Returns:\r\n        list[dict]: a list contains all the collected papers' information,\r\n            each item in list is a dictionary, whose keys include\r\n            ['title', 'main link', 'group']\r\n            And the group is the specific track name.\r\n    \"\"\"\r\n    debug = False\r\n    paper_list = []\r\n    headers = {\r\n        'User-Agent': user_agents[-1],\r\n        'Host': 'ojs.aaai.org',\r\n        'Referer': \"https://ojs.aaai.org\",\r\n        'GET': track_url\r\n    }\r\n    content = urlopen_with_retry(url=track_url, headers=headers)\r\n\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    tracks = soup.find('div', {'class': 'sections'}).find_all(\r\n        'div', {'class': 'section'})\r\n    for tr in tracks:\r\n        this_group = slugify(tr.h2.text)\r\n        this_paper_dict = {\r\n            'group': this_group,\r\n            'title': '',\r\n            'main link': ''\r\n        }\r\n        papers = tr.find_all('li')\r\n        for p in papers:\r\n            this_paper_dict['title'] = ''\r\n            this_paper_dict['main link'] = ''\r\n            try:\r\n                title = slugify(p.find('h3', {'class': 'title'}).text)\r\n                link = p.find(\r\n                    'a', {'class': 'obj_galley_link pdf'}\r\n                ).get('href').replace('view', 'download')\r\n                this_paper_dict['title'] = title\r\n                this_paper_dict['main link'] = link\r\n                paper_list.append(this_paper_dict.copy())\r\n                if debug:\r\n                    print(\r\n                        f'paper: {title}\\n\\tlink:{link}\\n\\tgroup:{this_group}')\r\n            except Exception as e:\r\n                # skip unwanted target\r\n                # print(f'ERROR: {str(e)}')\r\n                pass\r\n                # continue\r\n\r\n    return paper_list\r\n\r\n\r\ndef get_papers_of_track(track_url):\r\n    \"\"\"\r\n    get all the papers' title, belonging track group name and download link.\r\n    the link should be hosted on https://aaai.org/\r\n    Args:\r\n        track_url (str): track url\r\n\r\n    Returns:\r\n        list[dict]: a list contains all the collected papers' information,\r\n            each item in list is a dictionary, whose keys include\r\n            ['title', 'main link', 'group']\r\n            And the group is the specific track name.\r\n    \"\"\"\r\n    debug = False\r\n    paper_list = []\r\n    headers = {\r\n        'User-Agent': user_agents[-1],\r\n        'Host': 'aaai.org',\r\n        'Referer': \"https://aaai.org\",\r\n        'GET': track_url\r\n    }\r\n    content = urlopen_with_retry(url=track_url, headers=headers)\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    tracks = soup.find('main', {'id': 'genesis-content'}).find_all(\r\n        'div', {'class': 'track-wrap'})\r\n    for tr in tracks:\r\n        this_group = slugify(tr.h2.text)\r\n        this_paper_dict = {\r\n            'group': this_group,\r\n            'title': '',\r\n            'main link': ''\r\n        }\r\n        papers = tr.find_all('li')\r\n        for p in papers:\r\n            this_paper_dict['title'] = ''\r\n            this_paper_dict['main link'] = ''\r\n            try:\r\n                title = slugify(p.find('h5').text)\r\n                link = p.find(\r\n                    'a', {'class': 'wp-block-button'}\r\n                ).get('href')\r\n                this_paper_dict['title'] = title\r\n                this_paper_dict['main link'] = link\r\n                paper_list.append(this_paper_dict.copy())\r\n                if debug:\r\n                    print(\r\n                        f'paper: {title}\\n\\tlink:{link}\\n\\tgroup:{this_group}')\r\n            except Exception as e:\r\n                # skip unwanted target\r\n                # print(f'ERROR: {str(e)}')\r\n                pass\r\n                # continue\r\n\r\n    return paper_list\r\n\r\n\r\ndef save_csv(year):\r\n    \"\"\"\r\n    write AAAI papers' urls in one csv file\r\n    :param year: int, AAAI year, such 2019\r\n    :return: peper_index: int, the total number of papers\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    csv_file_pathname = os.path.join(\r\n        project_root_folder, 'csv', f'AAAI_{year}.csv'\r\n    )\r\n    error_log = []\r\n    paper_index = 0\r\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\r\n        fieldnames = ['title', 'main link', 'group']\r\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n        writer.writeheader()\r\n        track_urls = get_track_urls(year)\r\n        for tr_name in track_urls:\r\n            tr_url = track_urls[tr_name]\r\n            print(f'collecting paper from {tr_name}({tr_url})')\r\n            if year >= 2023:\r\n                papers_dict_list = get_papers_of_track_ojs(tr_url)\r\n            else:\r\n                papers_dict_list = get_papers_of_track(tr_url)\r\n            print(f'\\tfind {len(papers_dict_list)} papers')\r\n            for p in papers_dict_list:\r\n                paper_index += 1\r\n                writer.writerow(p)\r\n            csvfile.flush()\r\n            s = random.randint(3, 7)\r\n            print(f'random sleeping {s} seconds...')\r\n            time.sleep(s)  # avoid requesting too frequently\r\n\r\n    #  write error log\r\n    print('write error log')\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'download_err_log.txt'\r\n    )\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is not None:\r\n                    f.write(e)\r\n                else:\r\n                    f.write('None')\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n    return paper_index\r\n\r\n\r\ndef download_from_csv(\r\n        year, save_dir, time_step_in_seconds=5, total_paper_number=None,\r\n        csv_filename=None, downloader='IDM'):\r\n    \"\"\"\r\n    download all AAAI paper given year\r\n    :param year: int, AAAI year, such 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds\r\n    :param total_paper_number: int, the total number of papers that is going to\r\n        download\r\n    :param csv_filename: None or str, the csv file's name, None means to use\r\n        default setting\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    postfix = f'AAAI_{year}'\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    csv_file_path = os.path.join(\r\n        project_root_folder, 'csv',\r\n        f'AAAI_{year}.csv' if csv_filename is None else csv_filename)\r\n    csv_process.download_from_csv(\r\n        postfix=postfix,\r\n        save_dir=save_dir,\r\n        csv_file_path=csv_file_path,\r\n        is_download_supplement=False,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        total_paper_number=total_paper_number,\r\n        downloader=downloader\r\n    )\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    # total_paper_number = 3028\r\n    total_paper_number = save_csv(year)\r\n    download_from_csv(\r\n        year,\r\n        save_dir=fr'D:\\AAAI_{year}',\r\n        time_step_in_seconds=15,\r\n        total_paper_number=total_paper_number)\r\n    # for year in range(2012, 2018, 2):\r\n    #     print(year)\r\n    #     total_paper_number = None\r\n    #     # total_paper_number = save_csv(year)\r\n    #     download_from_csv(year, save_dir=f'..\\\\AAAI_{year}',\r\n    #                       time_step_in_seconds=10,\r\n    #                       total_paper_number=total_paper_number)\r\n    #     time.sleep(2)\r\n    # for i in range(1, 12):\r\n    #     print(f'issue {i}/{11}')\r\n    #     year = 2022\r\n    #     total_paper_number = save_csv_given_urls(\r\n    #         urls=f'https://www.aaai.org/Library/AAAI/aaai{year - 2000}-issue{i:0>2}.php',\r\n    #         csv_filename=f'.\\AAAI_{year}_issue_{i}.csv'\r\n    #     )\r\n    #     # total_paper_number = 156\r\n    #     download_from_csv(\r\n    #         year=year,\r\n    #         csv_filename=f'.\\AAAI_{year}_issue_{i}.csv',\r\n    #         save_dir=rf'D:\\AAAI_{year}',\r\n    #         time_step_in_seconds=1,\r\n    #         total_paper_number=total_paper_number)\r\n\r\n    # print(get_track_urls(1980))\r\n    # get_papers_of_track(r'https://ojs.aaai.org/index.php/AAAI/issue/view/548')\r\n\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_AAMAS.py",
    "content": "\"\"\"paper_downloader_AAMAS.py\n\"\"\"\n\nimport time\nimport urllib\nfrom urllib.error import HTTPError\nfrom bs4 import BeautifulSoup\nimport pickle\nimport os\nfrom tqdm import tqdm\nfrom slugify import slugify\nimport csv\nimport sys\n\nroot_folder = os.path.abspath(\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nsys.path.append(root_folder)\nfrom lib import csv_process\nfrom lib.my_request import urlopen_with_retry\n\n\ndef save_csv(year):\n    \"\"\"\n    write AAMAS papers' urls in one csv file\n    :param year: int, AAMAS year, such 2023\n    :return: peper_index: int, the total number of papers\n    \"\"\"\n    conference = \"AAMAS\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    csv_file_pathname = os.path.join(\n        project_root_folder, 'csv', f'{conference}_{year}.csv'\n    )\n\n    init_url_dict = {\n        2010: 'https://www.ifaamas.org/Proceedings/aamas2010/resources/_fullpapers.html',\n        2009: 'https://www.ifaamas.org/Proceedings/aamas2009/TOC/01_FP/FP_Session.html',\n        2008: 'https://www.ifaamas.org/Proceedings/aamas2008/proceedings/mainTrackPapers.htm',\n    }\n\n    error_log = []\n    paper_index = 0\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\n        fieldnames = ['title', 'group', 'main link', 'supplemental link']\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n        writer.writeheader()\n        if year >=  2013:\n            init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}' \\\n                f'/forms/contents.htm'\n        elif year >= 2011:\n            init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}'\\\n                f'/resources/fullpapers.html'\n        elif year in init_url_dict:\n            init_url = init_url_dict[year]\n        else:   \n            # TODO: support downloading 2002 ~ 2007 papers\n            return\n        url_file_pathname = os.path.join(\n            project_root_folder, 'urls', \n            f'init_url_{conference}_{year}.dat'''\n        )\n        if os.path.exists(url_file_pathname):\n            with open(url_file_pathname, 'rb') as f:\n                content = pickle.load(f)\n        else:\n            headers = {\n                'User-Agent':\n                    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\n                    'AppleWebKit/537.36 (KHTML, like Gecko) '\n                    'Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'}\n            content = urlopen_with_retry(url=init_url, headers=headers)\n            with open(url_file_pathname, 'wb') as f:\n                pickle.dump(content, f)\n\n        soup = BeautifulSoup(content, 'html5lib')\n        # soup = BeautifulSoup(content, 'html.parser')\n        if year >=  2013:\n            group_list = soup.find('tbody').find_all('tr', recursive=False)[3:]\n            # skip \"conference title\", \"Table of Contents\" and \"Contents table\"  \n            \n            group_list_bar = tqdm(group_list)\n            paper_index = 0\n            is_start = False\n            for group in group_list_bar:\n                if not is_start:\n                    # if group.find('a', {'id': 'KT'}): # year 2019, 2023, 2024\n                    #     is_start = True\n                    if group.find('strong'):\n                        group_text = slugify(group.find('strong').text)\n                        if not group_text.startswith('table') and \\\n                            not group_text.startswith('aamas'):  \n                            # skip Table of Contents, AAMAS 20xx\n                            is_start = True\n                        else:\n                            continue\n                    else:\n                        continue\n                \n                try:\n                    tds = group.find_all('td', recursive=False)\n                    if len(tds) < 2:\n                        continue\n                    group = tds[1]\n                    papers = group.find_all('p')\n\n                    for p in papers:\n                        # group title is in <strong>...</strong>\n                        if p.find('strong', recursive=False):\n                            group_title = slugify(p.text)\n                            continue\n                        paper_dict = {'title': '',\n                                    'group': group_title,\n                                    'main link': '',\n                                    'supplemental link': ''}\n                        if p.find('a') is None and p.find('b') is None:\n                            # last empty <p>...</p> in some <tr>...</tr>\n                            continue\n                        a = p.find('a')\n                        if a is None:\n                            title = slugify(p.find('b').text)\n                            main_link = ''\n                            print(f'\\nWarning: No link found for {title}!')\n                        else:\n                            title = slugify(a.text)\n                            main_link = urllib.parse.urljoin(init_url, a.get('href'))\n                        \n                        paper_dict['title'] = title\n                        paper_dict['main link'] = main_link\n                        paper_index += 1\n                        group_list_bar.set_description_str(\n                            f'Collected paper {paper_index}: {title}')\n                        writer.writerow(paper_dict)\n                        csvfile.flush()  # write to file immediately\n                except Exception as e:\n                    print(f'Warning: {str(e)}\\n'\n                        f'Current group: {group_title}\\nCurrent paper: {title}')\n        elif year >= 2010:\n            class_name = {\n                2010: 'plist',\n                2011: 'plist',\n                2012: 'pindex'\n            }\n            papers = soup.find('div', {'class': class_name[year]}).find_all(['h2', 'div'])\n            papers_bar = tqdm(papers)\n            paper_index = 0\n            for p in papers_bar:\n                if p.name == 'h2': # group title\n                    group_title = slugify(p.text)\n                else:  # div, paper\n                    paper_dict = {'title': '',\n                                'group': group_title,\n                                'main link': '',\n                                'supplemental link': ''}\n                    a = p.find('span', {'class': 'title'}).find('a')\n                    # title = slugify(a.find(string=True, recursive=False)) # drop abs\n                    direct_text = ''.join(child for child in a.contents \n                                          if isinstance(child, str)).strip()\n                    title = slugify(direct_text)\n                    main_link = urllib.parse.urljoin(init_url, a.get('href'))\n                    paper_dict['title'] = title\n                    paper_dict['main link'] = main_link\n                    paper_index += 1\n                    papers_bar.set_description_str(\n                        f'Collected paper {paper_index}: {title}')\n                    writer.writerow(paper_dict)\n                    csvfile.flush()  # write to file immediately\n        elif year == 2009:\n            group_list = soup.find('div', {'id': 'mainContent'}).find_all('p')\n            group_list_bar = tqdm(group_list)\n            paper_index = 0\n            is_start = False\n            for group in group_list_bar:\n                if not is_start:\n                    if group.find('strong'):\n                        group_text = slugify(group.find('strong').text)\n                        is_start = True\n                    else:\n                        continue\n                if group.find('strong'):\n                    group_title = slugify(group.text)\n                    continue\n                try:\n                    papers = group.find_all('a')\n                    for p in papers:\n                        paper_dict = {'title': '',\n                                    'group': group_title,\n                                    'main link': '',\n                                    'supplemental link': ''}\n                        title = slugify(p.text)\n                        main_link = urllib.parse.urljoin(init_url, p.get('href'))\n                        \n                        paper_dict['title'] = title\n                        paper_dict['main link'] = main_link\n                        paper_index += 1\n                        group_list_bar.set_description_str(\n                            f'Collected paper {paper_index}: {title}')\n                        writer.writerow(paper_dict)\n                        csvfile.flush()  # write to file immediately\n                except Exception as e:\n                    print(f'Warning: {str(e)}\\n'\n                        f'Current group: {group_title}\\nCurrent paper: {title}')\n        elif year == 2008:\n            # papers = soup.find_all(lambda tag: \n            #     (tag.name == 'p' and 'title' in tag.get('class', [])) or \n            #     tag.name == 'a'\n            # )\n            group_list = soup.find('div', {'id': 'mainbody'}).find(\n                'table').find('tbody').find_all('tr', recursive=False)[2:]\n            # skip \"conference title\", \"Table of Contents\" \n            \n            group_list_bar = tqdm(group_list)\n            paper_index = 0\n            for group in group_list_bar:\n                \n                try:\n                    p_class_title = group.find('p', {'class': 'title'})\n                    h3 = group.find('h3')\n                    if p_class_title:                       \n                        group_title = slugify(p_class_title.text)\n                    elif h3:  # find <h3></h3>\n                        group_title = slugify(h3.text)\n                    else:\n                        raise ValueError('Parse group title failed!')\n\n                    papers = group.find_all('a')\n\n                    for p in papers:\n                        paper_dict = {'title': '',\n                                    'group': group_title,\n                                    'main link': '',\n                                    'supplemental link': ''}\n                        \n                        title = slugify(p.text)\n                        if not p.get('href'):\n                            continue # group title\n                        main_link = urllib.parse.urljoin(init_url, p.get('href'))\n                        \n                        paper_dict['title'] = title\n                        paper_dict['main link'] = main_link\n                        paper_index += 1\n                        group_list_bar.set_description_str(\n                            f'Collected paper {paper_index}: {title}')\n                        writer.writerow(paper_dict)\n                        csvfile.flush()  # write to file immediately\n                except Exception as e:\n                    print(f'Warning: {str(e)}\\n'\n                        f'Current group: {group_title}\\nCurrent paper: {title}')\n        else:\n            # TODO: support downloading 2002 ~ 2008 papers\n            return\n\n    #  write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                if e is not None:\n                    f.write(e)\n                else:\n                    f.write('None')\n                f.write('\\n')\n\n            f.write('\\n')\n    return paper_index\n\n\ndef download_from_csv(\n        year, save_dir, time_step_in_seconds=5, total_paper_number=None,\n        csv_filename=None, downloader='IDM', is_random_step=True,\n        proxy_ip_port=None):\n    \"\"\"\n    download all AAMAS paper given year\n    :param year: int, AAMAS year, such as 2019\n    :param save_dir: str, paper and supplement material's save path\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds\n    :param total_paper_number: int, the total number of papers that is going to\n        download\n    :param csv_filename: None or str, the csv file's name, None means to use\n        default setting\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder', default to 'IDM'\n    :param is_random_step: bool, whether random sample the time step between two\n        adjacent download requests. If True, the time step will be sampled\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\n        Default: True.\n    :param proxy_ip_port: str or None, proxy server ip address with or without\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n        Default: None\n    :return: True\n    \"\"\"\n    conference = \"AAMAS\"\n    postfix = f'{conference}_{year}'\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    csv_file_path = os.path.join(\n        project_root_folder, 'csv',\n        f'{conference}_{year}.csv' if csv_filename is None else csv_filename)\n    csv_process.download_from_csv(\n        postfix=postfix,\n        save_dir=save_dir,\n        csv_file_path=csv_file_path,\n        is_download_supplement=False,\n        time_step_in_seconds=time_step_in_seconds,\n        total_paper_number=total_paper_number,\n        downloader=downloader,\n        is_random_step=is_random_step,\n        proxy_ip_port=proxy_ip_port\n    )\n\n\nif __name__ == '__main__':\n    year = 2025\n    # total_paper_number = 2021\n    total_paper_number = save_csv(year)\n    download_from_csv(\n        year,\n        save_dir=fr'D:\\AAMAS_{year}',\n        time_step_in_seconds=5,\n        total_paper_number=total_paper_number)\n    # for year in range(2008, 2025, 1):\n    #     print(year)\n    #     # total_paper_number = 134\n    #     total_paper_number = save_csv(year)\n    #     download_from_csv(year, save_dir=fr'E:\\AAMAS\\AAMAS_{year}',\n    #                       time_step_in_seconds=10,\n    #                       total_paper_number=total_paper_number)\n    #     time.sleep(2)\n\n    pass"
  },
  {
    "path": "code/paper_downloader_AISTATS.py",
    "content": "\"\"\"paper_downloader_AISTATS.py\"\"\"\r\nimport os\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nimport lib.pmlr as pmlr\r\nfrom lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \\\r\n    move_main_and_supplement_2_one_directory_with_group\r\n\r\n\r\ndef download_paper(year, save_dir, is_download_supplement=True, time_step_in_seconds=5, downloader='IDM'):\r\n    \"\"\"\r\n    download all AISTATS paper and supplement files given year, restore in\r\n    save_dir/main_paper and save_dir/supplement\r\n    respectively\r\n    :param year: int, AISTATS year, such as 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    AISTATS_year_dict = {\r\n        2025: 258,\r\n        2024: 238,\r\n        2023: 206,\r\n        2022: 151,\r\n        2021: 130,\r\n        2020: 108,\r\n        2019: 89,\r\n        2018: 84,\r\n        2017: 54,\r\n        2016: 51,\r\n        2015: 38,\r\n        2014: 33,\r\n        2013: 31,\r\n        2012: 22,\r\n        2011: 15,\r\n        2010: 9,\r\n        2009: 5,\r\n        2007: 2\r\n    }\r\n    AISTATS_year_dict_R = {\r\n        1995: 0,\r\n        1997: 1,\r\n        1999: 2,\r\n        2001: 3,\r\n        2003: 4,\r\n        2005: 5\r\n\r\n    }\r\n    if year in AISTATS_year_dict.keys():\r\n        volume = f'v{AISTATS_year_dict[year]}'\r\n    elif year in AISTATS_year_dict_R.keys():\r\n        volume = f'r{AISTATS_year_dict_R[year]}'\r\n    else:\r\n        raise ValueError('''the given year's url is unknown !''')\r\n    postfix = f'AISTATS_{year}'\r\n\r\n    pmlr.download_paper_given_volume(\r\n        volume=volume,\r\n        save_dir=save_dir,\r\n        postfix=postfix,\r\n        is_download_supplement=is_download_supplement,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader\r\n    )\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    download_paper(\r\n        year,\r\n        rf'D:\\AISTATS_{year}',\r\n        is_download_supplement=True,\r\n        time_step_in_seconds=25,\r\n        downloader='IDM'\r\n    )\r\n    # move_main_and_supplement_2_one_directory(\r\n    #     main_path=rf'D:\\AISTATS_{year}\\main_paper',\r\n    #     supplement_path=rf'D:\\AISTATS_{year}\\supplement',\r\n    #     supp_pdf_save_path=rf'D:\\AISTATS_{year}\\supplement_pdf'\r\n    # )\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_COLT.py",
    "content": "\"\"\"paper_downloader_COLT.py\"\"\"\r\nimport os\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nimport lib.pmlr as pmlr\r\n\r\n\r\ndef download_paper(year, save_dir, is_download_supplement=False, time_step_in_seconds=5, downloader='IDM'):\r\n    \"\"\"\r\n    download all COLT paper and supplement files given year, restore in\r\n    save_dir/main_paper and save_dir/supplement\r\n    respectively\r\n    :param year: int, COLT year, such as 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    COLT_year_dict = {\r\n        2025: 291,\r\n        2024: 247,\r\n        2023: 195,\r\n        2022: 178,\r\n        2021: 134,\r\n        2020: 125,\r\n        2019: 99,\r\n        2018: 75,\r\n        2017: 65,\r\n        2016: 49,\r\n        2015: 40,\r\n        2014: 35,\r\n        2013: 30,\r\n        2012: 23,\r\n        2011: 19\r\n                      }\r\n    if year in COLT_year_dict.keys():\r\n        volume = f'v{COLT_year_dict[year]}'\r\n    else:\r\n        raise ValueError('''the given year's url is unknown !''')\r\n    postfix = f'COLT_{year}'\r\n\r\n    pmlr.download_paper_given_volume(\r\n        volume=volume,\r\n        save_dir=save_dir,\r\n        postfix=postfix,\r\n        is_download_supplement=is_download_supplement,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader\r\n    )\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    download_paper(\r\n        year,\r\n        rf'D:\\COLT_{year}',\r\n        is_download_supplement=False,\r\n        time_step_in_seconds=3,\r\n        downloader='IDM'\r\n    )\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_CORL.py",
    "content": "\"\"\"paper_downloader_CORL.py\"\"\"\nimport os\nimport sys\nroot_folder = os.path.abspath(\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nsys.path.append(root_folder)\nimport lib.pmlr as pmlr\nimport lib.openreview as openreview\n\n\ndef download_paper(year, save_dir, is_download_supplement=False, \n                   time_step_in_seconds=5, downloader='IDM',\n                   source=None, proxy_ip_port=None):\n    \"\"\"\n    download all CORL paper and supplement files given year, restore in\n    save_dir/main_paper and save_dir/supplement\n    respectively\n    :param year: int, CORL year, such as 2019\n    :param save_dir: str, paper and supplement material's save path\n    :param is_download_supplement: bool, True for downloading supplemental\n        material\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder', default to 'IDM'\n    :param source: str, download source, support \"pmlr\" and \"openreview\". \n        Defaults to None, means first try to download from pmlr. If failed, \n        then try to download from openreview.\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\n        eg: \"127.0.0.1:7890\".  Only useful for webdriver and request\n        downloader (downloader=None). Default: None.\n    :type proxy_ip_port: str | None\n    :return: True\n    \"\"\"\n    CORL_year_dict = {\n        2025: 305,\n        2024: 270,\n        2023: 229,\n        2022: 205,\n        2021: 164,\n        2020: 155,\n        2019: 100,\n        2018: 87,\n        2017: 78\n    }\n    postfix = f'CORL_{year}'\n\n    if source != 'openreview':\n        if year in CORL_year_dict.keys():  # download from pmlr\n            volume = f'v{CORL_year_dict[year]}'\n            pmlr.download_paper_given_volume(\n                volume=volume,\n                save_dir=save_dir,\n                postfix=postfix,\n                is_download_supplement=is_download_supplement,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader\n            )\n            return True\n        elif source == 'pmlr':\n            raise ValueError(f'Not found CoRL {year} in pmlr!')\n        \n    # try to download from openreview\n    base_url = f'https://openreview.net/group?id=robot-learning.org/'\\\n               f'CoRL/{year}/Conference'\n    group_id_dict = {\n        2023: ['accept--oral-', 'accept--poster-'],\n        2024: ['accept']\n    }\n    for gid in group_id_dict[year]:\n        openreview.download_papers_given_url_and_group_id(\n            save_dir=save_dir,\n            year=year,\n            base_url=f'{base_url}#{gid}',\n            group_id=gid,\n            conference='CORL',\n            time_step_in_seconds=time_step_in_seconds,\n            downloader=downloader,\n            proxy_ip_port=proxy_ip_port\n        )\n    return True \n\n\nif __name__ == '__main__':\n    year=2025\n    download_paper(\n        year,\n        rf'D:\\CORL\\CORL_{year}',\n        is_download_supplement=False,\n        time_step_in_seconds=30,\n        downloader='IDM'\n        # downloader = None\n    )\n    pass\n"
  },
  {
    "path": "code/paper_downloader_CVF.py",
    "content": "\"\"\"paper_downloader_CVF.py\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom slugify import slugify\r\nimport csv\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \\\r\n    move_main_and_supplement_2_one_directory_with_group, \\\r\n    rename_2_short_name, rename_2_short_name_within_group\r\nfrom lib.cvf import get_paper_dict_list\r\nfrom lib import csv_process\r\nimport time\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef save_csv(year, conference, proxy_ip_port=None):\r\n    \"\"\"\r\n    write CVF conference papers' and supplemental material's urls in one csv file\r\n    :param year: int\r\n    :param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV']\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n        Default: None\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']:\r\n        raise ValueError(f'{conference} is not found in '\r\n                         f'https://openaccess.thecvf.com/menu, '\r\n                         f'maybe a spelling mistake!')\r\n    csv_file_pathname = os.path.join(\r\n        project_root_folder, 'csv', f'{conference}_{year}.csv'\r\n    )\r\n    print(f'saving {conference}-{year} paper urls into {csv_file_pathname}')\r\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\r\n        fieldnames = ['title', 'main link', 'supplemental link', 'arxiv']\r\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n        writer.writeheader()\r\n        init_url = f'http://openaccess.thecvf.com/{conference}{year}'\r\n        if conference == 'ICCV' and year == 2021:\r\n            init_url = 'https://openaccess.thecvf.com/ICCV2021?day=all'\r\n        elif conference == 'CVPR' and year >= 2022:\r\n            init_url = f'https://openaccess.thecvf.com/CVPR{year}?day=all'\r\n        url_file_pathname = os.path.join(\r\n            project_root_folder, 'urls', f'init_url_{conference}_{year}.dat'\r\n        )\r\n        if os.path.exists(url_file_pathname):\r\n            with open(url_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            headers = {\r\n                'User-Agent':\r\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                    'Gecko/20100101 Firefox/23.0'}\r\n            content = urlopen_with_retry(\r\n                url=init_url, headers=headers, proxy_ip_port=proxy_ip_port)\r\n            with open(url_file_pathname, 'wb') as f:\r\n                pickle.dump(content, f)\r\n\r\n        soup = BeautifulSoup(content, 'html5lib')\r\n        tmp_list = soup.find('div', {'id': 'content'}).find_all('dt')\r\n        if len(tmp_list) <= 1:\r\n            paper_different_days_list_bar = soup.find(\r\n                'div', {'id': 'content'}).find_all('dd')\r\n            paper_index = 0\r\n            for group in paper_different_days_list_bar:\r\n                # get group name\r\n                a = group.find('a')\r\n                print(a.text)\r\n                group_link = urllib.parse.urljoin(init_url, a.get('href'))\r\n                group_paper_dict_list, _ = get_paper_dict_list(\r\n                    url=group_link\r\n                )\r\n                paper_index += len(group_paper_dict_list)\r\n                for paper_dict in group_paper_dict_list:\r\n                    writer.writerow(paper_dict)\r\n            return paper_index\r\n        else:\r\n            paper_dict_list, content = get_paper_dict_list(\r\n                url=init_url,\r\n                content=content)\r\n            for paper_dict in paper_dict_list:\r\n                writer.writerow(paper_dict)\r\n            return len(paper_dict_list)\r\n\r\n\r\ndef save_csv_workshops(year, conference, proxy_ip_port=None):\r\n    \"\"\"\r\n    write CVF workshops papers' and supplemental material's urls in one csv file\r\n    :param year: int\r\n    :param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV']\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n        Default: None\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']:\r\n        raise ValueError(f'{conference} is not found in '\r\n                         f'https://openaccess.thecvf.com/menu, '\r\n                         f'maybe a spelling mistake!')\r\n    csv_file_pathname = os.path.join(\r\n        project_root_folder, 'csv', f'{conference}_WS_{year}.csv'\r\n    )\r\n    print(f'saving {conference}-WS-{year} paper urls into {csv_file_pathname}')\r\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\r\n        fieldnames = ['group', 'title', 'main link', 'supplemental link',\r\n                      'arxiv']\r\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n        writer.writeheader()\r\n\r\n        headers = {\r\n            'User-Agent':\r\n                'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                'Gecko/20100101 Firefox/23.0'}\r\n\r\n        init_url = f'https://openaccess.thecvf.com/' \\\r\n                   f'{conference}{year}_workshops/menu'\r\n        url_file_pathname = os.path.join(\r\n            project_root_folder, 'urls', f'init_url_{conference}_WS_{year}.dat'\r\n        )\r\n        if os.path.exists(url_file_pathname):\r\n            with open(url_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            content = urlopen_with_retry(\r\n                url=init_url, headers=headers, proxy_ip_port=proxy_ip_port)\r\n            # content = open(f'..\\\\{conference}_WS_{year}.html', 'rb').read()\r\n            with open(url_file_pathname, 'wb') as f:\r\n                pickle.dump(content, f)\r\n        soup = BeautifulSoup(content, 'html5lib')\r\n        paper_group_list_bar = soup.find('div', {'id': 'content'}).find_all('dd')\r\n        paper_index = 0\r\n        for group in paper_group_list_bar:\r\n            # get group name\r\n            a = group.find('a')\r\n            group_name = slugify(a.text)\r\n            print(f'GROUP: {group_name}')\r\n\r\n            group_link = urllib.parse.urljoin(init_url, a.get('href'))\r\n\r\n            repeat_time = 3\r\n            for r in range(repeat_time):\r\n                try:\r\n                    group_paper_dict_list, _ = get_paper_dict_list(\r\n                        url=group_link,\r\n                        group_name=group_name,\r\n                        timeout=20,\r\n                    )\r\n                    time.sleep(1)\r\n                    break\r\n                except Exception as e:\r\n                    if r + 1 == repeat_time:\r\n                        print(f'ERROR: {str(e)}')\r\n                        continue\r\n\r\n            paper_index += len(group_paper_dict_list)\r\n            for paper_dict in group_paper_dict_list:\r\n                writer.writerow(paper_dict)\r\n    return paper_index\r\n\r\n\r\ndef download_from_csv(\r\n        year, conference, save_dir, is_download_main_paper=True,\r\n        is_download_supplement=True, time_step_in_seconds=5,\r\n        total_paper_number=None, is_workshops=False, downloader='IDM',\r\n        proxy_ip_port=None):\r\n    \"\"\"\r\n    download all CVF paper and supplement files given year, restore in\r\n    save_dir/main_paper and save_dir/supplement\r\n    respectively\r\n    :param year: int, CVF year, such 2019\r\n    :param conference: str, one of ['CVPR', 'ICCV', 'WACV']\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param is_download_main_paper: bool, True for downloading main paper\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two downloading\r\n        request in seconds\r\n    :param total_paper_number: int, the total number of papers that is going to\r\n        download\r\n    :param is_workshops: bool, is to download workshops from csv file.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n        Default: None\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    postfix = f'{conference}_{year}'\r\n    if is_workshops:\r\n        postfix = f'{conference}_WS_{year}'\r\n    csv_file_path = os.path.join(\r\n        project_root_folder,\r\n        'csv',\r\n        f'{conference}_{year}.csv' if not is_workshops else\r\n        f'{conference}_WS_{year}.csv'\r\n    )\r\n    csv_process.download_from_csv(\r\n        postfix=postfix,\r\n        save_dir=save_dir,\r\n        csv_file_path=csv_file_path,\r\n        is_download_main_paper=is_download_main_paper,\r\n        is_download_supplement=is_download_supplement,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        total_paper_number=total_paper_number,\r\n        downloader=downloader,\r\n\r\n    )\r\n    return True\r\n\r\n\r\ndef download_paper(\r\n        year, conference, save_dir, is_download_main_paper=True,\r\n        is_download_supplement=True, time_step_in_seconds=5,\r\n        is_download_main_conference=True, is_download_workshops=True,\r\n        downloader='IDM', proxy_ip_port=None):\r\n    \"\"\"\r\n    download all CVF papers in given year, support downloading main conference\r\n    and workshops.\r\n    :param year: int, CVF year, such 2019.\r\n    :param conference: str, one of {'CVPR', 'ICCV', 'WACV'}.\r\n    :param save_dir: str, paper and supplement material's save path.\r\n    :param is_download_main_paper: bool, True for downloading main paper.\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material.\r\n    :param time_step_in_seconds: int, the interval time between two downloading\r\n        request in seconds.\r\n    :param is_download_main_conference: bool, this parameter controls whether to\r\n        download main conference papers,\r\n        it is a upper level control flag of parameters is_download_main_paper\r\n        and is_download_supplement. eg. After setting\r\n        is_download_main_conference=True, is_download_main_paper=False,\r\n        is_download_supplement=True, the only the supplement materials of the\r\n        conference (vs. workshops) will be downloaded.\r\n    :param is_download_workshops: bool, True for downloading workshops paper\r\n        and is similar with is_download_main_conference.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n        Default: None\r\n    :return:\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    # main conference\r\n    if is_download_main_conference:\r\n        csv_file_path = os.path.join(\r\n            project_root_folder, 'csv', f'{conference}_{year}.csv')\r\n        if not os.path.exists(csv_file_path):\r\n            total_paper_number = save_csv(\r\n                year=year, conference=conference, proxy_ip_port=proxy_ip_port)\r\n        else:\r\n            with open(csv_file_path, newline='') as csvfile:\r\n                myreader = csv.DictReader(csvfile, delimiter=',')\r\n                total_paper_number = sum(1 for row in myreader)\r\n\r\n        download_from_csv(\r\n            year=year,\r\n            conference=conference,\r\n            save_dir=os.path.join(save_dir, f'{conference}_{year}'),\r\n            is_download_main_paper=is_download_main_paper,\r\n            is_download_supplement=is_download_supplement,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            total_paper_number=total_paper_number,\r\n            is_workshops=False,\r\n            downloader=downloader,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # workshops\r\n    if is_download_workshops:\r\n        csv_file_path = os.path.join(\r\n            project_root_folder, 'csv', f'{conference}_WS_{year}.csv')\r\n        if not os.path.exists(csv_file_path):\r\n            total_paper_number = save_csv_workshops(\r\n                year=year, conference=conference, proxy_ip_port=proxy_ip_port)\r\n        else:\r\n            with open(csv_file_path, newline='') as csvfile:\r\n                myreader = csv.DictReader(csvfile, delimiter=',')\r\n                total_paper_number = sum(1 for row in myreader)\r\n        download_from_csv(\r\n            year=year,\r\n            conference=conference,\r\n            save_dir=os.path.join(save_dir, f'{conference}_WS_{year}'),\r\n            is_download_main_paper=is_download_main_paper,\r\n            is_download_supplement=is_download_supplement,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            total_paper_number=total_paper_number,\r\n            is_workshops=True,\r\n            downloader=downloader,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    conference = 'CVPR'\r\n    download_paper(\r\n        year,\r\n        conference=conference,\r\n        save_dir=fr'D:\\{conference}',\r\n        is_download_main_paper=True,\r\n        is_download_supplement=True,\r\n        time_step_in_seconds=10,\r\n        is_download_main_conference=True,\r\n        is_download_workshops=True,\r\n        # proxy_ip_port='127.0.0.1:7897'\r\n    )\r\n    #\r\n    # move_main_and_supplement_2_one_directory(\r\n    #     main_path=rf'E:\\{conference}\\{conference}_{year}\\main_paper',\r\n    #     supplement_path=rf'E:\\{conference}\\{conference}_{year}\\supplement',\r\n    #     supp_pdf_save_path=rf'E:\\{conference}\\{conference}_{year}\\main_paper'\r\n    # )\r\n    # move_main_and_supplement_2_one_directory_with_group(\r\n    #     main_path=rf'E:\\{conference}\\{conference}_WS_{year}\\main_paper',\r\n    #     supplement_path=rf'E:\\{conference}\\{conference}_WS_{year}\\supplement',\r\n    #     supp_pdf_save_path=rf'E:\\{conference}\\{conference}_WS_{year}\\main_paper'\r\n    # )\r\n\r\n    # rename to short filename for uploading to 123pan\r\n    # rename_2_short_name(\r\n    #     src_path=r'E:\\CVPR\\CVPR_2024\\main_paper',\r\n    #     save_path=r'E:\\short_name_cvpr2024',\r\n    #     target_max_length=128\r\n    # )\r\n    # rename_2_short_name_within_group(\r\n    #     src_path=r'E:\\CVPR\\CVPR_WS_2024\\main_paper',\r\n    #     save_path=r'E:\\short_name_cvpr2024_ws',\r\n    #     target_max_length=128\r\n    # )\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_ECCV.py",
    "content": "\"\"\"paper_downloader_ECCV.py\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport csv\r\nimport sys\r\n\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib.supplement_porcess import move_main_and_supplement_2_one_directory\r\nimport lib.springer as springer\r\nfrom lib import csv_process\r\nfrom lib.downloader import Downloader\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef save_csv(year):\r\n    \"\"\"\r\n    write ECCV papers' and supplemental material's urls in one csv file\r\n    :param year: int\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    csv_file_pathname = os.path.join(\r\n        project_root_folder, 'csv', f'ECCV_{year}.csv')\r\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\r\n        fieldnames = ['title', 'main link', 'supplemental link']\r\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n        writer.writeheader()\r\n        headers = {\r\n            'User-Agent':\r\n                'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                'Gecko/20100101 Firefox/23.0'}\r\n        dat_file_pathname = os.path.join(\r\n            project_root_folder, 'urls', f'init_url_ECCV_{year}.dat')\r\n        if year >= 2018:\r\n            init_url = f'https://www.ecva.net/papers.php'\r\n            if os.path.exists(dat_file_pathname):\r\n                with open(dat_file_pathname, 'rb') as f:\r\n                    content = pickle.load(f)\r\n            else:\r\n                content = urlopen_with_retry(url=init_url, headers=headers)\r\n                with open(dat_file_pathname, 'wb') as f:\r\n                    pickle.dump(content, f)\r\n            soup = BeautifulSoup(content, 'html5lib')\r\n            paper_list_bar = tqdm(soup.find_all(['dt', 'dd']))\r\n            paper_index = 0\r\n            paper_dict = {'title': '',\r\n                          'main link': '',\r\n                          'supplemental link': ''}\r\n            for paper in paper_list_bar:\r\n                is_new_paper = False\r\n\r\n                # get title\r\n                try:\r\n                    if 'dt' == paper.name and \\\r\n                            'ptitle' == paper.get('class')[0] and \\\r\n                            year == int(paper.a.get('href').split('_')[1][:4]):  # title:\r\n                        # this_year = int(paper.a.get('href').split('_')[1][:4])\r\n                        title = slugify(paper.text.strip())\r\n                        paper_dict['title'] = title\r\n                        paper_index += 1\r\n                        paper_list_bar.set_description_str(\r\n                            f'Downloading paper {paper_index}: {title}')\r\n                    elif '' != paper_dict['title'] and 'dd' == paper.name:\r\n                        all_as = paper.find_all('a')\r\n                        for a in all_as:\r\n                            if 'pdf' == slugify(a.text.strip()):\r\n                                main_link = urllib.parse.urljoin(init_url,\r\n                                                                 a.get('href'))\r\n                                paper_dict['main link'] = main_link\r\n                                is_new_paper = True\r\n                            elif 'supp' in slugify(a.text.strip()):\r\n                                supp_link = urllib.parse.urljoin(init_url,\r\n                                                                 a.get('href'))\r\n                                paper_dict['supplemental link'] = supp_link\r\n                                break\r\n                except:\r\n                    pass\r\n                if is_new_paper:\r\n                    writer.writerow(paper_dict)\r\n                    paper_dict = {'title': '',\r\n                                  'main link': '',\r\n                                  'supplemental link': ''}\r\n        else:\r\n            init_url = f'http://www.eccv{year}.org/main-conference/'\r\n            if os.path.exists(dat_file_pathname):\r\n                with open(dat_file_pathname, 'rb') as f:\r\n                    content = pickle.load(f)\r\n            else:\r\n                content = urlopen_with_retry(url=init_url, headers=headers)\r\n                with open(dat_file_pathname, 'wb') as f:\r\n                    pickle.dump(content, f)\r\n            soup = BeautifulSoup(content, 'html5lib')\r\n            paper_list_bar = tqdm(\r\n                soup.find('div', {'class': 'entry-content'}).find_all(['p']))\r\n            paper_index = 0\r\n            paper_dict = {'title': '',\r\n                          'main link': '',\r\n                          'supplemental link': ''}\r\n            for paper in paper_list_bar:\r\n                try:\r\n                    if len(paper.find_all(['strong'])) and len(\r\n                            paper.find_all(['a'])) and len(\r\n                            paper.find_all(['img'])):\r\n                        paper_index += 1\r\n                        title = slugify(paper.find('strong').text)\r\n                        paper_dict['title'] = title\r\n                        paper_list_bar.set_description_str(\r\n                            f'Downloading paper {paper_index}: {title}')\r\n                        main_link = paper.find('a').get('href')\r\n                        paper_dict['main link'] = main_link\r\n                        writer.writerow(paper_dict)\r\n                        paper_dict = {'title': '',\r\n                                      'main link': '',\r\n                                      'supplemental link': ''}\r\n                except Exception as e:\r\n                    print(f'ERROR: {str(e)}')\r\n    return paper_index\r\n\r\n\r\ndef download_from_csv(\r\n        year, save_dir, is_download_supplement=True, time_step_in_seconds=5,\r\n        total_paper_number=None,\r\n        is_workshops=False, downloader='IDM'):\r\n    \"\"\"\r\n    download all ECCV paper and supplement files given year, restore in\r\n    save_dir/main_paper and save_dir/supplement respectively\r\n    :param year: int, ECCV year, such 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two downlaod\r\n        request in seconds\r\n    :param total_paper_number: int, the total number of papers that is going\r\n        to download\r\n    :param is_workshops: bool, is to download workshops from csv file.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    postfix = f'ECCV_{year}'\r\n    if is_workshops:\r\n        postfix = f'ECCV_WS_{year}'\r\n    csv_file_name = f'ECCV_{year}.csv' if not is_workshops else \\\r\n        f'ECCV_WS_{year}.csv'\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    csv_file_name = os.path.join(project_root_folder, 'csv', csv_file_name)\r\n    csv_process.download_from_csv(\r\n        postfix=postfix,\r\n        save_dir=save_dir,\r\n        csv_file_path=csv_file_name,\r\n        is_download_supplement=is_download_supplement,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        total_paper_number=total_paper_number,\r\n        downloader=downloader\r\n    )\r\n\r\n\r\ndef download_from_springer(\r\n        year, save_dir, is_workshops=False, time_sleep_in_seconds=5,\r\n        downloader='IDM'):\r\n    os.makedirs(save_dir, exist_ok=True)\r\n    if 2018 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-030-01246-5',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01216-8',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01219-9',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01225-0',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01228-1',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01231-1',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01234-2',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01237-3',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01240-3',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01249-6',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01252-6',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01258-8',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01261-8',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01264-9',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01267-0',\r\n                'https://link.springer.com/book/10.1007/978-3-030-01270-0'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-030-11009-3',\r\n                'https://link.springer.com/book/10.1007/978-3-030-11012-3',\r\n                'https://link.springer.com/book/10.1007/978-3-030-11015-4',\r\n                'https://link.springer.com/book/10.1007/978-3-030-11018-5',\r\n                'https://link.springer.com/book/10.1007/978-3-030-11021-5',\r\n                'https://link.springer.com/book/10.1007/978-3-030-11024-6'\r\n            ]\r\n    elif 2016 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46448-0',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46475-6',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46487-9',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46493-0',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46454-1',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46466-4',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46478-7',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46484-8'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-46604-0',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-48881-3',\r\n                'https://link.springer.com/book/10.1007%2F978-3-319-49409-8'\r\n            ]\r\n    elif 2014 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-319-10590-1',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10605-2',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10578-9',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10593-2',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10602-1',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10599-4',\r\n                'https://link.springer.com/book/10.1007/978-3-319-10584-0'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-319-16178-5',\r\n                'https://link.springer.com/book/10.1007/978-3-319-16181-5',\r\n                'https://link.springer.com/book/10.1007/978-3-319-16199-0',\r\n                'https://link.springer.com/book/10.1007/978-3-319-16220-1'\r\n            ]\r\n    elif 2012 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-642-33718-5',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33709-3',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33712-3',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33765-9',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33715-4',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33783-3',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33786-4'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-642-33863-2',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33868-7',\r\n                'https://link.springer.com/book/10.1007/978-3-642-33885-4'\r\n            ]\r\n    elif 2010 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-642-15549-9',\r\n                'https://link.springer.com/book/10.1007/978-3-642-15552-9',\r\n                'https://link.springer.com/book/10.1007/978-3-642-15558-1',\r\n                'https://link.springer.com/book/10.1007/978-3-642-15561-1',\r\n                'https://link.springer.com/book/10.1007/978-3-642-15555-0',\r\n                'https://link.springer.com/book/10.1007/978-3-642-15567-3'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-642-35749-7',\r\n                'https://link.springer.com/book/10.1007/978-3-642-35740-4'\r\n            ]\r\n    elif 2008 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/978-3-540-88682-2',\r\n                'https://link.springer.com/book/10.1007/978-3-540-88688-4',\r\n                'https://link.springer.com/book/10.1007/978-3-540-88690-7',\r\n                'https://link.springer.com/book/10.1007/978-3-540-88693-8'\r\n            ]\r\n        else:\r\n            urls_list = []\r\n    elif 2006 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/11744023',\r\n                'https://link.springer.com/book/10.1007/11744047',\r\n                'https://link.springer.com/book/10.1007/11744078',\r\n                'https://link.springer.com/book/10.1007/11744085'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/11754336'\r\n            ]\r\n    elif 2004 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/b97865',\r\n                'https://link.springer.com/book/10.1007/b97866',\r\n                'https://link.springer.com/book/10.1007/b97871',\r\n                'https://link.springer.com/book/10.1007/b97873'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 2002 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/3-540-47969-4',\r\n                'https://link.springer.com/book/10.1007/3-540-47967-8',\r\n                'https://link.springer.com/book/10.1007/3-540-47977-5',\r\n                'https://link.springer.com/book/10.1007/3-540-47979-1'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 2000 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/3-540-45054-8',\r\n                'https://link.springer.com/book/10.1007/3-540-45053-X'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 1998 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/BFb0055655',\r\n                'https://link.springer.com/book/10.1007/BFb0054729'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 1996 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/BFb0015518',\r\n                'https://link.springer.com/book/10.1007/3-540-61123-1'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 1994 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/3-540-57956-7',\r\n                'https://link.springer.com/book/10.1007/BFb0028329'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 1992 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/3-540-55426-2'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    elif 1990 == year:\r\n        if not is_workshops:\r\n            urls_list = [\r\n                'https://link.springer.com/book/10.1007/BFb0014843'\r\n            ]\r\n        else:\r\n            urls_list = [\r\n\r\n            ]\r\n    else:\r\n        raise ValueError(f'ECCV {year} is current not available!')\r\n    for url in urls_list:\r\n        __download_from_springer(\r\n            url, save_dir, year, is_workshops=is_workshops,\r\n            time_sleep_in_seconds=time_sleep_in_seconds,\r\n            downloader=downloader)\r\n\r\n\r\ndef __download_from_springer(\r\n        url, save_dir, year, is_workshops=False, time_sleep_in_seconds=5,\r\n        downloader='IDM'):\r\n    downloader = Downloader(downloader)\r\n    for i in range(3):\r\n        try:\r\n            papers_dict = springer.get_paper_name_link_from_url(url)\r\n            break\r\n        except Exception as e:\r\n            print(str(e))\r\n    # total_paper_number = len(papers_dict)\r\n    pbar = tqdm(papers_dict.keys())\r\n    postfix = f'ECCV_{year}'\r\n    if is_workshops:\r\n        postfix = f'ECCV_WS_{year}'\r\n\r\n    for name in pbar:\r\n        pbar.set_description(f'Downloading paper {name}')\r\n        if not os.path.exists(os.path.join(save_dir, f'{name}_{postfix}.pdf')):\r\n            downloader.download(\r\n                papers_dict[name],\r\n                os.path.join(save_dir, f'{name}_{postfix}.pdf'),\r\n                time_sleep_in_seconds)\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2024\r\n    # total_paper_number = 2387\r\n    total_paper_number = save_csv(year)\r\n    download_from_csv(year,\r\n                      save_dir=fr'Z:\\all_papers\\ECCV\\ECCV_{year}',\r\n                      is_download_supplement=True,\r\n                      time_step_in_seconds=5,\r\n                      total_paper_number=total_paper_number,\r\n                      is_workshops=False)\r\n    # move_main_and_supplement_2_one_directory(\r\n    #     main_path=f'E:\\\\ECCV_{year}\\\\main_paper',\r\n    #     supplement_path=f'E:\\\\ECCV_{year}\\\\supplement',\r\n    #     supp_pdf_save_path=f'E:\\\\ECCV_{year}\\\\main_paper'\r\n    # )\r\n    # for year in range(2018, 2017, -2):\r\n    #     # download_from_springer(\r\n    #     #     save_dir=f'F:\\\\ECCV_{year}',\r\n    #     #     year=year,\r\n    #     #     is_workshops=False, time_sleep_in_seconds=30)\r\n    #     download_from_springer(\r\n    #         save_dir=f'F:\\\\ECCV_WS_{year}',\r\n    #         year=year,\r\n    #         is_workshops=True, time_sleep_in_seconds=30)\r\n    # pass\r\n"
  },
  {
    "path": "code/paper_downloader_ICLR.py",
    "content": "\"\"\"paper_downloader_ICLR.py\"\"\"\r\n\r\nfrom tqdm import tqdm\r\nimport os\r\n# https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename\r\nfrom slugify import slugify\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nfrom urllib.request import urlopen\r\nimport urllib\r\nimport sys\r\n\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib.downloader import Downloader\r\nfrom lib.openreview import download_iclr_papers_given_url_and_group_id\r\nfrom lib.arxiv import get_pdf_link_from_arxiv\r\n\r\n\r\ndef download_iclr_oral_papers(save_dir, year, base_url=None,\r\n                              time_step_in_seconds=10, downloader='IDM',\r\n                              start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr oral papers for year 2017 ~ 2022, 2024~2025.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2026: \"tab-accept-oral\",\r\n        2025: \"tab-accept-oral\",\r\n        2024: \"tab-accept-oral\",\r\n        2022: \"oral-submissions\",\r\n        2021: \"oral-presentations\",\r\n        2020: \"oral-presentations\",\r\n        2019: \"oral-presentations\",\r\n        2018: \"accepted-oral-papers\",\r\n        2017: \"oral-presentations\",\r\n        2013: \"conferenceoral-iclr2013-conference\"\r\n    }\r\n    \r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n        \r\n    print(f'Downloading ICLR-{year} oral papers...')\r\n    group_id = group_id_dict[year].replace('tab-', '')\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year > 2021)\r\n    )\r\n\r\n\r\ndef download_iclr_conditional_oral_papers(save_dir, year, base_url=None,\r\n                              time_step_in_seconds=10, downloader='IDM',\r\n                              start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr conditional oral papers for year 2025.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2025: \"tab-accept-conditional-oral\"\r\n    }\r\n    no_pages_year = [2025]\r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} conditional oral papers...')\r\n    group_id = group_id_dict[year].replace('tab-', '')\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year not in no_pages_year)\r\n    )\r\n\r\n\r\ndef download_iclr_top5_papers(save_dir, year, base_url=None, start_page=1,\r\n                              time_step_in_seconds=10, downloader='IDM',\r\n                              proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr notable-top-5% papers for year 2023.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year\r\n    :type year: int\r\n    :param base_url: str, paper website url\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Default: 1\r\n    :param time_step_in_seconds: int, the interval time between two downlaod\r\n        request in seconds. Default: 10.\r\n    :type time_step_in_seconds: int\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None. Default: 'IDM'.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    if base_url is None:\r\n        if year == 2023:\r\n            base_url = \"https://openreview.net/group?id=ICLR.cc/\" \\\r\n                       \"2023/Conference#notable-top-5-\"\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} top5 papers...')\r\n    group_id = \"notable-top-5-\"\r\n    return download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port\r\n    )\r\n\r\n\r\ndef download_iclr_poster_papers(save_dir, year, base_url=None, start_page=1,\r\n                                time_step_in_seconds=10, downloader='IDM',\r\n                                proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr poster papers from year 2013, 2017 ~ 2024.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year\r\n    :param base_url: str, paper website url\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Default: 1\r\n    :param time_step_in_seconds: int, the interval time between two downlaod\r\n        request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None. Default: 'IDM'\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2026: \"tab-accept-poster\",\r\n        2025: \"tab-accept-poster\",\r\n        2024: \"tab-accept-poster\",\r\n        2023: \"poster\",\r\n        2022: \"poster-submissions\",\r\n        2021: \"poster-presentations\",\r\n        2020: \"poster-presentations\",\r\n        2019: \"poster-presentations\",\r\n        2018: \"accepted-poster-papers\",\r\n        2017: \"poster-presentations\",\r\n        2013: \"conferenceposter-iclr2013-conference\"\r\n    }\r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} poster papers...')\r\n    no_pages_year = [2013, 2018, 2019, 2020, 2021]\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id_dict[year].replace('tab-', ''),\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year not in no_pages_year),\r\n        is_need_click_group_button=(year == 2018)\r\n    )\r\n\r\n\r\ndef download_iclr_conditional_poster_papers(save_dir, year, base_url=None,\r\n                              time_step_in_seconds=10, downloader='IDM',\r\n                              start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr conditional poster papers for year 2025.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2025: \"tab-accept-conditional-poster\"\r\n    }\r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} conditional poster papers...')\r\n    group_id = group_id_dict[year].replace('tab-', '')\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year > 2021)\r\n    )\r\n\r\n\r\ndef download_iclr_spotlight_papers(save_dir, year, base_url=None,\r\n                                   time_step_in_seconds=10, downloader='IDM',\r\n                                   start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr spotlight papers between year 2020 and 2022, 2024~2025.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2025: \"tab-accept-spotlight\",\r\n        2024: \"tab-accept-spotlight\",\r\n        2022: \"spotlight-submissions\",\r\n        2021: \"spotlight-presentations\",\r\n        2020: \"spotlight-presentations\",\r\n    }\r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} spotlight papers...')\r\n    no_pages_year = [2020, 2021]\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id_dict[year].replace('tab-', ''),\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year not in no_pages_year)\r\n    )\r\n\r\n\r\ndef download_iclr_conditional_spotlight_papers(save_dir, year, base_url=None,\r\n                              time_step_in_seconds=10, downloader='IDM',\r\n                              start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr conditional spotlight papers for year 2025.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    group_id_dict = {\r\n        2025: \"tab-accept-conditional-spotlight\"\r\n    }\r\n    no_pages_year = [2025]\r\n    if base_url is None:\r\n        if year in group_id_dict:\r\n            base_url = 'https://openreview.net/group?id=ICLR.cc/' \\\r\n                f'{year}/Conference#{group_id_dict[year]}'\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} conditional spotlight papers...')\r\n    group_id = group_id_dict[year].replace('tab-', '')\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port,\r\n        is_have_pages=(year not in no_pages_year)\r\n    )\r\n\r\n\r\ndef download_iclr_top25_papers(save_dir, year, base_url=None, start_page=1,\r\n                               time_step_in_seconds=10, downloader='IDM',\r\n                               proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr notable-top-25% papers for year 2023.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year\r\n    :type year: int\r\n    :param base_url: str, paper website url\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Default: 1\r\n    :param time_step_in_seconds: int, the interval time between two downlaod\r\n        request in seconds. Default: 10.\r\n    :type time_step_in_seconds: int\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None. Default: 'IDM'.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    if base_url is None:\r\n        if year == 2023:\r\n            base_url = \"https://openreview.net/group?id=ICLR.cc/\" \\\r\n                       \"2023/Conference#notable-top-25-\"\r\n        else:\r\n            raise ValueError('the website url is not given for this year!')\r\n    print(f'Downloading ICLR-{year} top25 papers...')\r\n    group_id = \"notable-top-25-\"\r\n    download_iclr_papers_given_url_and_group_id(\r\n        save_dir=save_dir,\r\n        year=year,\r\n        base_url=base_url,\r\n        group_id=group_id,\r\n        start_page=start_page,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        downloader=downloader,\r\n        proxy_ip_port=proxy_ip_port\r\n    )\r\n\r\n\r\ndef download_iclr_paper(save_dir, year, base_url=None,\r\n                        time_step_in_seconds=10, downloader='IDM',\r\n                        start_page=1, proxy_ip_port=None):\r\n    \"\"\"\r\n    Download iclr papers between year 2013 and 2024.\r\n    :param save_dir: str, paper save path\r\n    :param year: int, iclr year, current only support year >= 2018\r\n    :param base_url: str, paper website url\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds.\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        None, default to 'IDM'.\r\n    :param start_page: int, the initial downloading webpage number, only the\r\n        pages whose number is equal to or greater than this number will be\r\n        processed. Currently, this parameter is only used in year 2024.\r\n        Default: 1.\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return:\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n\r\n    year_no_group = [2014]\r\n    year_no_group_iclrcc = [2015, 2016]\r\n    year_oral_poster = [2013, 2017, 2018, 2019, 2026]\r\n    year_oral_spotlight_poster = [2020, 2021, 2022, 2024, 2025]\r\n    year_top5_top25_poster = [2023]\r\n    year_oral_spotlight_poster_conditional = [2025]\r\n\r\n    # no group, openreview website\r\n    if year in year_no_group:\r\n        if base_url is None:\r\n            if year == 2014:\r\n                base_url = 'https://openreview.net/group?id=ICLR.cc/2014/conference'\r\n            else:\r\n                raise ValueError('the website url is not given for this year!')\r\n        print(f'Downloading ICLR-{year} oral papers...')\r\n        group_id_dict = {\r\n            2014: \"submitted-papers\"\r\n        }\r\n        group_id = group_id_dict[year]\r\n        no_pages_year = [2014]\r\n        return download_iclr_papers_given_url_and_group_id(\r\n            save_dir=save_dir,\r\n            year=year,\r\n            base_url=base_url,\r\n            group_id=group_id,\r\n            start_page=start_page,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            proxy_ip_port=proxy_ip_port,\r\n            is_have_pages=(year not in no_pages_year)\r\n        )\r\n    # no group, iclr.cc website\r\n    if year in year_no_group_iclrcc:\r\n        downloader = Downloader(downloader=downloader)\r\n        paper_postfix = f'ICLR_{year}'\r\n        if base_url is None:\r\n            if year == 2016:\r\n                base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2016:main.html'\r\n            elif year == 2015:\r\n                base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2015:main.html'\r\n            elif year == 2014:\r\n                base_url = 'https://iclr.cc/archive/2014/conference-proceedings/'\r\n            else:\r\n                raise ValueError('the website url is not given for this year!')\r\n        os.makedirs(save_dir, exist_ok=True)\r\n        if year == 2015:  # oral and poster seperated\r\n            oral_save_path = os.path.join(save_dir, 'oral')\r\n            poster_save_path = os.path.join(save_dir, 'poster')\r\n            workshop_save_path = os.path.join(save_dir, 'ws')\r\n            os.makedirs(oral_save_path, exist_ok=True)\r\n            os.makedirs(poster_save_path, exist_ok=True)\r\n            os.makedirs(workshop_save_path, exist_ok=True)\r\n        dat_file_pathname = os.path.join(\r\n            project_root_folder, 'urls', f'init_url_iclr_{year}.dat'\r\n        )\r\n        if os.path.exists(dat_file_pathname):\r\n            with open(dat_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            headers = {\r\n                'User-Agent':\r\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                    'Gecko/20100101 Firefox/23.0'}\r\n            req = urllib.request.Request(url=base_url, headers=headers)\r\n            content = urllib.request.urlopen(req).read()\r\n            with open(f'..\\\\urls\\\\init_url_iclr_{year}.dat', 'wb') as f:\r\n                pickle.dump(content, f)\r\n        error_log = []\r\n        soup = BeautifulSoup(content, 'html.parser')\r\n        print('open url successfully!')\r\n        if year == 2016:\r\n            papers = soup.find('h3',\r\n                               {\r\n                                   'id': 'accepted_papers_conference_track'}).findNext(\r\n                'div').find_all('a')\r\n            for paper in tqdm(papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_{paper_postfix}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(save_dir,\r\n                                             title + f'_{paper_postfix}.pdf')):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(save_dir, pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds\r\n                            )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n            # workshops\r\n            papers = soup.find('h3',\r\n                               {\r\n                                   'id': 'workshop_track_posters_may_2nd'}).findNext(\r\n                'div').find_all('a')\r\n            for paper in tqdm(papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://beta.openreview'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_ICLR_WS_{year}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(save_dir, 'ws', pdf_name)):\r\n                            pdf_link = get_pdf_link_from_openreview(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(save_dir, 'ws',\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds\r\n                            )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n            papers = soup.find('h3',\r\n                               {\r\n                                   'id': 'workshop_track_posters_may_3rd'}).findNext(\r\n                'div').find_all('a')\r\n            for paper in tqdm(papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://beta.openreview'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_ICLR_WS_{year}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(save_dir, 'ws', pdf_name)):\r\n                            pdf_link = get_pdf_link_from_openreview(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(save_dir, 'ws',\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds\r\n                            )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n        elif year == 2015:\r\n            # oral papers\r\n            oral_papers = soup.find('h3', {\r\n                'id': 'conference_oral_presentations'}).findNext(\r\n                'div').find_all(\r\n                'a')\r\n            for paper in tqdm(oral_papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_{paper_postfix}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(oral_save_path,\r\n                                             title + f'_{paper_postfix}.pdf')):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(oral_save_path,\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds\r\n                            )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n\r\n            # workshops papers\r\n            workshop_papers = soup.find('h3', {\r\n                'id': 'may_7_workshop_poster_session'}).findNext(\r\n                'div').find_all(\r\n                'a')\r\n            workshop_papers.append(\r\n                soup.find('h3',\r\n                          {'id': 'may_8_workshop_poster_session'}).findNext(\r\n                    'div').find_all('a'))\r\n            for paper in tqdm(workshop_papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_ICLR_WS_{year}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(workshop_save_path,\r\n                                             title + f'_{paper_postfix}.pdf')):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(workshop_save_path,\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds)\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n            # poster papers\r\n            poster_papers = soup.find('h3', {\r\n                'id': 'may_9_conference_poster_session'}).findNext(\r\n                'div').find_all(\r\n                'a')\r\n            for paper in tqdm(poster_papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_{paper_postfix}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(poster_save_path,\r\n                                             title + f'_{paper_postfix}.pdf')):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(poster_save_path,\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds)\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n        elif year == 2014:\r\n            papers = soup.find('div',\r\n                               {'id': 'sites-canvas-main-content'}).find_all(\r\n                'a')\r\n            for paper in tqdm(papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_{paper_postfix}.pdf'\r\n                    try:\r\n                        if not os.path.exists(os.path.join(save_dir, pdf_name)):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(save_dir, pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds)\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n\r\n            # workshops\r\n            paper_postfix = f'ICLR_WS_{year}'\r\n            base_url = 'https://sites.google.com/site/representationlearning2014/' \\\r\n                       'workshop-proceedings'\r\n            headers = {\r\n                'User-Agent':\r\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                    'Gecko/20100101 Firefox/23.0'}\r\n            req = urllib.request.Request(url=base_url, headers=headers)\r\n            content = urllib.request.urlopen(req).read()\r\n            soup = BeautifulSoup(content, 'html.parser')\r\n            workshop_save_path = os.path.join(save_dir, 'WS')\r\n            os.makedirs(workshop_save_path, exist_ok=True)\r\n            papers = soup.find(\r\n                'div', {'id': 'sites-canvas-main-content'}).find_all('a')\r\n            for paper in tqdm(papers):\r\n                link = paper.get('href')\r\n                if link.startswith('http://arxiv'):\r\n                    title = slugify(paper.text)\r\n                    pdf_name = f'{title}_{paper_postfix}.pdf'\r\n                    try:\r\n                        if not os.path.exists(\r\n                                os.path.join(workshop_save_path, pdf_name)):\r\n                            pdf_link = get_pdf_link_from_arxiv(link)\r\n                            print(f'downloading {title}')\r\n                            downloader.download(\r\n                                urls=pdf_link,\r\n                                save_path=os.path.join(workshop_save_path,\r\n                                                       pdf_name),\r\n                                time_sleep_in_seconds=time_step_in_seconds)\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append(\r\n                            (title, link, 'paper download error', str(e)))\r\n\r\n        # write error log\r\n        print('write error log')\r\n        log_file_pathname = os.path.join(\r\n            project_root_folder, 'log', 'download_err_log.txt')\r\n        with open(log_file_pathname, 'w') as f:\r\n            for log in tqdm(error_log):\r\n                for e in log:\r\n                    if e is not None:\r\n                        f.write(e)\r\n                    else:\r\n                        f.write('None')\r\n                    f.write('\\n')\r\n\r\n                f.write('\\n')\r\n        return True\r\n\r\n    # oral openreview\r\n    if year in (year_oral_poster + year_oral_spotlight_poster):\r\n        save_dir_oral = os.path.join(save_dir, 'oral')\r\n        download_iclr_oral_papers(\r\n            save_dir_oral,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n    \r\n    # conditional oral openreview\r\n    if year in (year_oral_spotlight_poster_conditional):\r\n        save_dir_cond_oral = os.path.join(save_dir, 'conditional-oral')\r\n        download_iclr_conditional_oral_papers(\r\n            save_dir_cond_oral,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # poster openreview\r\n    if year in (year_oral_poster + year_oral_spotlight_poster +\r\n                year_top5_top25_poster):\r\n        save_dir_poster = os.path.join(save_dir, 'poster')\r\n        download_iclr_poster_papers(\r\n            save_dir_poster,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n    \r\n    # conditional poster openreview\r\n    if year in (year_oral_spotlight_poster_conditional):\r\n        save_dir_cond_poster = os.path.join(save_dir, 'conditional-poster')\r\n        download_iclr_conditional_poster_papers(\r\n            save_dir_cond_poster,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # spotlight openreview\r\n    if year in year_oral_spotlight_poster:\r\n        save_dir_spotlight = os.path.join(save_dir, 'spotlight')\r\n        download_iclr_spotlight_papers(\r\n            save_dir_spotlight,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # conditional spotlight openreview\r\n    if year in (year_oral_spotlight_poster_conditional):\r\n        save_dir_cond_spotlight = os.path.join(save_dir, 'conditional-spotlight')\r\n        download_iclr_conditional_spotlight_papers(\r\n            save_dir_cond_spotlight,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # top5 openreview\r\n    if year in year_top5_top25_poster:\r\n        save_dir_top5 = os.path.join(save_dir, 'top5')\r\n        download_iclr_top5_papers(\r\n            save_dir_top5,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n    # top25 openreview\r\n    if year in year_top5_top25_poster:\r\n        save_dir_top25 = os.path.join(save_dir, 'top25')\r\n        download_iclr_top25_papers(\r\n            save_dir_top25,\r\n            year,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader,\r\n            start_page=start_page,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n\r\n\r\ndef get_pdf_link_from_openreview(abs_link):\r\n    return abs_link.replace('beta.', '').replace('forum', 'pdf')\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    save_dir_iclr = rf'E:\\ICLR_{year}'\r\n    # save_dir_iclr_oral = os.path.join(save_dir_iclr, 'oral')\r\n    # save_dir_iclr_top5 = os.path.join(save_dir_iclr, 'top5')\r\n    # save_dir_iclr_spotlight = os.path.join(save_dir_iclr, 'spotlight')\r\n    # save_dir_iclr_top25 = os.path.join(save_dir_iclr, 'top25')\r\n    # save_dir_iclr_poster = os.path.join(save_dir_iclr, 'poster')\r\n    proxy_ip_port = None\r\n    # proxy_ip_port = \"http://127.0.0.1:7890\"\r\n    # download_iclr_oral_papers(save_dir_iclr_oral, year,\r\n    #                           time_step_in_seconds=5)\r\n    # download_iclr_top5_papers(save_dir_iclr_top5, year, start_page=1,\r\n    #                           time_step_in_seconds=5,\r\n    #                           proxy_ip_port=proxy_ip_port)\r\n    # download_iclr_top25_papers(save_dir_iclr_top25, year, start_page=1,\r\n    #                           time_step_in_seconds=5,\r\n    #                           proxy_ip_port=proxy_ip_port)\r\n    # download_iclr_spotlight_papers(save_dir_iclr_spotlight, year,\r\n    #                                time_step_in_seconds=5)\r\n    # download_iclr_poster_papers(save_dir_iclr_poster, year, start_page=1,\r\n    #                             time_step_in_seconds=5,\r\n    #                           proxy_ip_port=proxy_ip_port)\r\n    download_iclr_paper(save_dir_iclr, year, time_step_in_seconds=5,\r\n                        proxy_ip_port=proxy_ip_port)\r\n"
  },
  {
    "path": "code/paper_downloader_ICML.py",
    "content": "\"\"\"paper_downloader_ICML.py\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib.downloader import Downloader\r\nimport lib.pmlr as pmlr\r\nfrom lib.supplement_porcess import merge_main_supplement\r\nfrom lib.openreview import download_icml_papers_given_url_and_group_id\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef download_paper(year, save_dir, is_download_supplement=True,\r\n                   time_step_in_seconds=5, downloader='IDM', source='pmlr',\r\n                   proxy_ip_port=None):\r\n    \"\"\"\r\n    download all ICML paper and supplement files given year, restore in\r\n        save_dir/main_paper and save_dir/supplement\r\n    respectively\r\n    :param year: int, ICML year, such 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two download\r\n        request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or\r\n        'Thunder', default to 'IDM'\r\n    :param source: str, source website, 'pmlr' or 'openreview'\r\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\r\n        eg: \"127.0.0.1:7890\". Default: None.\r\n    :type proxy_ip_port: str | None\r\n    :return: True\r\n    \"\"\"\r\n    assert source in ['pmlr', 'openreview'], \\\r\n        f'only support source pmlr or openreview, but get {source}'\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port)\r\n    ICML_year_dict = {\r\n        2024: 235,\r\n        2023: 202,\r\n        2022: 162,\r\n        2021: 139,\r\n        2020: 119,\r\n        2019: 97,\r\n        2018: 80,\r\n        2017: 70,\r\n        2016: 48,\r\n        2015: 37,\r\n        2014: 32,\r\n        2013: 28\r\n    }\r\n    if source == 'openreview':\r\n        init_url = f'https://openreview.net/group?id=ICML.cc/{year}/Conference'\r\n    else:  # pmlr\r\n        if year >= 2013:\r\n            init_url = f'http://proceedings.mlr.press/v{ICML_year_dict[year]}/'\r\n        elif year == 2012:\r\n            init_url = 'https://icml.cc/2012/papers.1.html'\r\n        elif year == 2011:\r\n            init_url = 'http://www.icml-2011.org/papers.php'\r\n        elif 2009 == year:\r\n            init_url = 'https://icml.cc/Conferences/2009/abstracts.html'\r\n        elif 2008 == year:\r\n            init_url = 'http://www.machinelearning.org/archive/icml2008/' \\\r\n                       'abstracts.shtml'\r\n        elif 2007 == year:\r\n            init_url = 'https://icml.cc/Conferences/2007/paperlist.html'\r\n        elif year in [2006, 2004, 2005]:\r\n            init_url = f'https://icml.cc/Conferences/{year}/proceedings.html'\r\n        elif 2003 == year:\r\n            init_url = 'https://aaai.org/Library/ICML/icml03contents.php'\r\n        else:\r\n            raise ValueError('''the given year's url is unknown !''')\r\n\r\n    postfix = f'ICML_{year}'\r\n    if source == 'openreview':  # download from openreview website:\r\n        # oral paper\r\n        group_id = 'oral'\r\n        save_dir_oral = os.path.join(save_dir, group_id)\r\n        os.makedirs(save_dir_oral, exist_ok=True)\r\n        download_icml_papers_given_url_and_group_id(\r\n            save_dir=save_dir_oral,\r\n            year=year,\r\n            base_url=init_url,\r\n            group_id=group_id,\r\n            start_page=1,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader.downloader,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n        # poster paper\r\n        group_id = 'poster'\r\n        save_dir_poster = os.path.join(save_dir, group_id)\r\n        os.makedirs(save_dir_poster, exist_ok=True)\r\n        download_icml_papers_given_url_and_group_id(\r\n            save_dir=os.path.join(save_dir, 'poster'),\r\n            year=year,\r\n            base_url=init_url,\r\n            group_id=group_id,\r\n            start_page=1,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader.downloader,\r\n            proxy_ip_port=proxy_ip_port\r\n        )\r\n        # spotlight paper\r\n        group_id = 'spotlight'\r\n        save_dir_poster = os.path.join(save_dir, group_id)\r\n        os.makedirs(save_dir_poster, exist_ok=True)\r\n        try:\r\n            download_icml_papers_given_url_and_group_id(\r\n                save_dir=os.path.join(save_dir, 'spotlight'),\r\n                year=year,\r\n                base_url=init_url,\r\n                group_id=group_id,\r\n                start_page=1,\r\n                time_step_in_seconds=time_step_in_seconds,\r\n                downloader=downloader.downloader,\r\n                proxy_ip_port=proxy_ip_port\r\n            )\r\n        except ValueError as e:  # no spotlight paper\r\n            print(f\"WARNING: {str(e)}\")\r\n        return\r\n\r\n    dat_file_pathname = os.path.join(\r\n        project_root_folder, 'urls', f'init_url_icml_{year}.dat')\r\n    if os.path.exists(dat_file_pathname):\r\n        with open(dat_file_pathname, 'rb') as f:\r\n            content = pickle.load(f)\r\n    else:\r\n        headers = {\r\n            'User-Agent':\r\n                'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n                'Gecko/20100101 Firefox/23.0'}\r\n        content = urlopen_with_retry(url=init_url, headers=headers)\r\n        # content = open(f'..\\\\ICML_{year}.html', 'rb').read()\r\n        with open(dat_file_pathname, 'wb') as f:\r\n            pickle.dump(content, f)\r\n    # soup = BeautifulSoup(content, 'html.parser')\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    # soup = BeautifulSoup(open(r'..\\ICML_2011.html', 'rb'), 'html.parser')\r\n    error_log = []\r\n    if year >= 2013:\r\n        if year in ICML_year_dict.keys():\r\n            volume = f'v{ICML_year_dict[year]}'\r\n        else:\r\n            raise ValueError('''the given year's url is unknown !''')\r\n\r\n        pmlr.download_paper_given_volume(\r\n            volume=volume,\r\n            save_dir=save_dir,\r\n            postfix=postfix,\r\n            is_download_supplement=is_download_supplement,\r\n            time_step_in_seconds=time_step_in_seconds,\r\n            downloader=downloader.downloader\r\n        )\r\n    elif 2012 == year:  # 2012\r\n        # base_url = f'https://icml.cc/{year}/'\r\n        paper_list_bar = tqdm(soup.find_all('div', {'class': 'paper'}))\r\n        paper_index = 0\r\n        for paper in paper_list_bar:\r\n            paper_index += 1\r\n            title = ''\r\n            title = slugify(paper.find('h2').text)\r\n            link = None\r\n            for a in paper.find_all('a'):\r\n                if 'ICML version (pdf)' == a.text:\r\n                    link = urllib.parse.urljoin(init_url, a.get('href'))\r\n                    break\r\n            if link is not None:\r\n                this_paper_main_path = os.path.join(\r\n                    save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n                paper_list_bar.set_description(\r\n                    f'find paper {paper_index}:{title}')\r\n                if not os.path.exists(this_paper_main_path) :\r\n                    paper_list_bar.set_description(\r\n                        f'downloading paper {paper_index}:{title}')\r\n                    downloader.download(\r\n                        urls=link,\r\n                        save_path=this_paper_main_path,\r\n                        time_sleep_in_seconds=time_step_in_seconds\r\n                    )\r\n            else:\r\n                error_log.append((title, 'no main link error'))\r\n    elif 2011 == year:\r\n        paper_list_bar = tqdm(soup.find_all('a'))\r\n        paper_index = 0\r\n        for paper in paper_list_bar:\r\n            h3 = paper.find('h3')\r\n            if h3 is not None:\r\n                title = slugify(h3.text)\r\n                paper_index += 1\r\n            if 'download' == slugify(paper.text.strip()):\r\n                link = paper.get('href')\r\n                link = urllib.parse.urljoin(init_url, link)\r\n                if link is not None:\r\n                    this_paper_main_path = os.path.join(\r\n                        save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n                    paper_list_bar.set_description(\r\n                        f'find paper {paper_index}:{title}')\r\n                    if not os.path.exists(this_paper_main_path) :\r\n                        paper_list_bar.set_description(\r\n                            f'downloading paper {paper_index}:{title}')\r\n                        downloader.download(\r\n                            urls=link,\r\n                            save_path=this_paper_main_path,\r\n                            time_sleep_in_seconds=time_step_in_seconds\r\n                        )\r\n                else:\r\n                    error_log.append((title, 'no main link error'))\r\n    elif year in [2009, 2008]:\r\n        if 2009 == year:\r\n            paper_list_bar = tqdm(\r\n                soup.find('div', {'id': 'right_column'}).find_all(['h3','a']))\r\n        elif 2008 == year:\r\n            paper_list_bar = tqdm(\r\n                soup.find('div', {'class': 'content'}).find_all(['h3','a']))\r\n        paper_index = 0\r\n        title = None\r\n        for paper in paper_list_bar:\r\n            if 'h3' == paper.name:\r\n                title = slugify(paper.text)\r\n                paper_index += 1\r\n            elif 'full-paper' == slugify(paper.text.strip()):  # a\r\n                link = paper.get('href')\r\n                if link is not None and title is not None:\r\n                    link = urllib.parse.urljoin(init_url, link)\r\n                    this_paper_main_path = os.path.join(\r\n                        save_dir, f'{title}_{postfix}.pdf')\r\n                    paper_list_bar.set_description(\r\n                        f'find paper {paper_index}:{title}')\r\n                    if not os.path.exists(this_paper_main_path):\r\n                        paper_list_bar.set_description(\r\n                            f'downloading paper {paper_index}:{title}')\r\n                        downloader.download(\r\n                            urls=link,\r\n                            save_path=this_paper_main_path,\r\n                            time_sleep_in_seconds=time_step_in_seconds\r\n                        )\r\n                    title = None\r\n                else:\r\n                    error_log.append((title, 'no main link error'))\r\n    elif year in [2006, 2005]:\r\n        paper_list_bar = tqdm(soup.find_all('a'))\r\n        paper_index = 0\r\n        for paper in paper_list_bar:\r\n            title = slugify(paper.text.strip())\r\n            link = paper.get('href')\r\n            paper_index += 1\r\n            if link is not None and title is not None and \\\r\n                    ('pdf' == link[-3:] or 'ps' == link[-2:]):\r\n                link = urllib.parse.urljoin(init_url, link)\r\n                this_paper_main_path = os.path.join(\r\n                    save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n                paper_list_bar.set_description(\r\n                    f'find paper {paper_index}:{title}')\r\n                if not os.path.exists(this_paper_main_path):\r\n                    paper_list_bar.set_description(\r\n                        f'downloading paper {paper_index}:{title}')\r\n                    downloader.download(\r\n                        urls=link,\r\n                        save_path=this_paper_main_path,\r\n                        time_sleep_in_seconds=time_step_in_seconds\r\n                    )\r\n    elif 2004 == year:\r\n        paper_index = 0\r\n        paper_list_bar = tqdm(\r\n            soup.find('table', {'class': 'proceedings'}).find_all('tr'))\r\n        title = None\r\n        for paper in paper_list_bar:\r\n            tr_class = None\r\n            try:\r\n                tr_class = paper.get('class')[0]\r\n            except:\r\n                pass\r\n            if 'proc_2004_title' == tr_class:  # title\r\n                title = slugify(paper.text.strip())\r\n                paper_index += 1\r\n            else:\r\n                for a in paper.find_all('a'):\r\n                    if '[Paper]' == a.text:\r\n                        link = a.get('href')\r\n                        if link is not None and title is not None:\r\n                            link = urllib.parse.urljoin(init_url, link)\r\n                            this_paper_main_path = os.path.join(\r\n                                save_dir,\r\n                                f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n                            paper_list_bar.set_description(\r\n                                f'find paper {paper_index}:{title}')\r\n                            if not os.path.exists(this_paper_main_path):\r\n                                paper_list_bar.set_description(\r\n                                    f'downloading paper {paper_index}:{title}')\r\n                                downloader.download(\r\n                                    urls=link,\r\n                                    save_path=this_paper_main_path,\r\n                                    time_sleep_in_seconds=time_step_in_seconds\r\n                                )\r\n                        break\r\n    elif 2003 == year:\r\n        paper_index = 0\r\n        paper_list_bar = tqdm(\r\n            soup.find('div', {'id': 'content'}).find_all(\r\n                'p', {'class': 'left'}))\r\n        for paper in paper_list_bar:\r\n            abs_link = None\r\n            title = None\r\n            link = None\r\n            for a in paper.find_all('a'):\r\n                abs_link = urllib.parse.urljoin(init_url, a.get('href'))\r\n                if abs_link is not None:\r\n                    title = slugify(a.text.strip())\r\n                    break\r\n            if title is not None:\r\n                paper_index += 1\r\n                this_paper_main_path = os.path.join(\r\n                    save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n                paper_list_bar.set_description(\r\n                    f'find paper {paper_index}:{title}')\r\n                if not os.path.exists(this_paper_main_path):\r\n                    if abs_link is not None:\r\n                        headers = {'User-Agent':\r\n                                       'Mozilla/5.0 (Windows NT 6.1; WOW64; '\r\n                                       'rv:23.0) Gecko/20100101 Firefox/23.0'}\r\n                        abs_content = urlopen_with_retry(\r\n                            url=abs_link, headers=headers,\r\n                            raise_error_if_failed=False)\r\n                        if abs_content is None:\r\n                            print('error'+title)\r\n                            error_log.append(\r\n                                (title, abs_link, 'download error'))\r\n                            continue\r\n                        abs_soup = BeautifulSoup(abs_content, 'html5lib')\r\n                        for a in abs_soup.find_all('a'):\r\n                            try:\r\n                                if 'pdf' == a.get('href')[-3:]:\r\n                                    link = urllib.parse.urljoin(\r\n                                        abs_link, a.get('href'))\r\n                                    if link is not None:\r\n                                        paper_list_bar.set_description(\r\n                                            f'downloading paper {paper_index}:'\r\n                                            f'{title}')\r\n                                        downloader.download(\r\n                                            urls=link,\r\n                                            save_path=this_paper_main_path,\r\n                                            time_sleep_in_seconds=time_step_in_seconds\r\n                                        )\r\n                                    break\r\n                            except:\r\n                                pass\r\n\r\n    # write error log\r\n    print('write error log')\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'download_err_log.txt')\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is not None:\r\n                    f.write(e)\r\n                else:\r\n                    f.write('None')\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n\r\n\r\ndef rename_downloaded_paper(year, source_path):\r\n    \"\"\"\r\n    rename the downloaded ICML paper to {title}_ICML_2010.pdf and save to\r\n    source_path\r\n    :param year: int, year\r\n    :param source_path: str, whose structure should be\r\n        source_path/papers/pdf files (2010)\r\n                   /index.html       (2010)\r\n        source_path/icml2007_proc.html (2007)\r\n   :return:\r\n    \"\"\"\r\n    if not os.path.exists(source_path):\r\n        raise ValueError(f'can not find {source_path}')\r\n    postfix = f'ICML_{year}'\r\n    if 2010 == year:\r\n        soup = BeautifulSoup(\r\n            open(os.path.join(source_path, 'index.html'), 'rb'), 'html5lib')\r\n        paper_list_bar = tqdm(soup.find_all('span', {'class': 'boxpopup3'}))\r\n\r\n        for paper in paper_list_bar:\r\n            a = paper.find('a')\r\n            title = slugify(a.text)\r\n            ori_name = os.path.join(\r\n                source_path, 'papers', a.get('href').split('/')[-1])\r\n            os.rename(ori_name, os.path.join(\r\n                source_path, f'{title}_{postfix}.pdf'))\r\n            paper_list_bar.set_description(f'processing {title}')\r\n    elif 2007 == year:\r\n        soup = BeautifulSoup(open(os.path.join(\r\n            source_path, 'icml2007_proc.html'), 'rb'), 'html5lib')\r\n        paper_list_bar = tqdm(soup.find_all('td', {'colspan': '2'}))\r\n        for paper in paper_list_bar:\r\n            all_as = paper.find_all('a')\r\n            if len(all_as) <= 1:\r\n                title = slugify(paper.text.strip())\r\n            else:\r\n                for a in all_as:\r\n                    if '[Paper]' == a.text:\r\n                        sub_path = a.get('href')\r\n                        os.rename(os.path.join(source_path, sub_path),\r\n                                  os.path.join(\r\n                                      source_path, f'{title}_{postfix}.pdf'))\r\n                        paper_list_bar.set_description_str(\r\n                            (f'processing {title}'))\r\n                        break\r\n\r\n\r\nif __name__ == '__main__':\r\n    year = 2025\r\n    download_paper(\r\n        year,\r\n        rf'E:\\ICML_{year}',\r\n        is_download_supplement=True,\r\n        time_step_in_seconds=10,\r\n        downloader='IDM',\r\n        source='openreview'\r\n    ) \r\n    # merge_main_supplement(main_path=f'..\\\\ICML_{year}\\\\main_paper',\r\n    #                       supplement_path=f'..\\\\ICML_{year}\\\\supplement',\r\n    #                       save_path=f'..\\\\ICML_{year}',\r\n    #                       is_delete_ori_files=False)\r\n    # rename_downloaded_paper(year, f'..\\\\ICML_{year}')\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_IJCAI.py",
    "content": "\"\"\"paper_downloader_IJCAI.py\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport csv\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib import csv_process\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef save_csv(year):\r\n    \"\"\"\r\n    write IJCAI papers' urls in one csv file\r\n    :param year: int, IJCAI year, such 2019\r\n    :return: peper_index: int, the total number of papers\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    csv_file_pathname = os.path.join(\r\n        project_root_folder, 'csv', f'IJCAI_{year}.csv'\r\n    )\r\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\r\n        fieldnames = ['title', 'main link', 'group']\r\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\r\n        writer.writeheader()\r\n        if year >= 2003:\r\n            init_urls = [f'https://www.ijcai.org/proceedings/{year}/']\r\n        elif year >= 1977:\r\n            init_urls = [f'https://www.ijcai.org/Proceedings/{year}-1/',\r\n                         f'https://www.ijcai.org/Proceedings/{year}-2/']\r\n        elif year >= 1969:\r\n            init_urls = [f'https://www.ijcai.org/Proceedings/{year}/']\r\n        else:\r\n            raise ValueError('invalid year!')\r\n        error_log = []\r\n        user_agents = [\r\n            'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) '\r\n            'Gecko/20071127 Firefox/2.0.0.11',\r\n\r\n            'Opera/9.25 (Windows NT 5.1; U; en)',\r\n\r\n            'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; '\r\n            '.NET CLR 1.1.4322; .NET CLR 2.0.50727)',\r\n\r\n            'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) '\r\n            'KHTML/3.5.5 (like Gecko) (Kubuntu)',\r\n\r\n            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) '\r\n            'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',\r\n\r\n            'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',\r\n\r\n            \"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 \"\r\n            \"(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 \"\r\n            \"Chrome/16.0.912.77 Safari/535.7\",\r\n\r\n            \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) \"\r\n            \"Gecko/20100101 Firefox/10.0 \",\r\n\r\n            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\r\n            'AppleWebKit/537.36 (KHTML, like Gecko) '\r\n            'Chrome/105.0.0.0 Safari/537.36'\r\n\r\n        ]\r\n        headers = {\r\n            'User-Agent': user_agents[-1],\r\n            'Host': 'www.ijcai.org',\r\n            'Referer': \"https://www.ijcai.org\",\r\n            'GET': init_urls[0]\r\n        }\r\n        if len(init_urls) == 1:\r\n            data_file_pathname = os.path.join(\r\n                project_root_folder, 'urls', f'init_url_IJCAI_{year}.dat'\r\n            )\r\n            if os.path.exists(data_file_pathname):\r\n                with open(data_file_pathname, 'rb') as f:\r\n                    content = pickle.load(f)\r\n            else:\r\n                content = urlopen_with_retry(url=init_urls[0], headers=headers)\r\n                with open(data_file_pathname, 'wb') as f:\r\n                    pickle.dump(content, f)\r\n            contents = [content]\r\n        else:\r\n            contents = []\r\n            data_file_pathname = os.path.join(\r\n                project_root_folder, 'urls', f'init_url_IJCAI_0_{year}.dat'\r\n            )\r\n            if os.path.exists(data_file_pathname):\r\n                with open(data_file_pathname, 'rb') as f:\r\n                    content = pickle.load(f)\r\n            else:\r\n                content = urlopen_with_retry(url=init_urls[0], headers=headers)\r\n                with open(data_file_pathname, 'wb') as f:\r\n                    pickle.dump(content, f)\r\n            contents.append(content)\r\n            data_file_pathname = os.path.join(\r\n                project_root_folder, 'urls', f'init_url_IJCAI_1_{year}.dat'\r\n            )\r\n            if os.path.exists(data_file_pathname):\r\n                with open(data_file_pathname, 'rb') as f:\r\n                    content = pickle.load(f)\r\n            else:\r\n                content = urlopen_with_retry(url=init_urls[1], headers=headers)\r\n                with open(data_file_pathname, 'wb') as f:\r\n                    pickle.dump(content, f)\r\n            contents.append(content)\r\n        paper_index = 0\r\n        for content in contents:\r\n            soup = BeautifulSoup(content, 'html5lib')\r\n            if year >= 2017:\r\n                pbar = tqdm(soup.find_all('div', {'class': 'section_title'}))\r\n                for section in pbar:\r\n                    this_group = slugify(section.text)\r\n                    papers = section.parent.find_all(\r\n                        'div', {'class': ['paper_wrapper', 'subsection_title']})\r\n                    sub_group = ''\r\n                    for paper in papers:\r\n                        if 'subsection_title' == paper.get('class')[0]:\r\n                            sub_group = slugify(paper.text)\r\n                            continue\r\n                        paper_index += 1\r\n                        is_get_link = False\r\n                        title = slugify(\r\n                            paper.find('div', {'class': 'title'}).text)\r\n                        pbar.set_description(\r\n                            f'downloading paper {paper_index}: {title}')\r\n                        for a in paper.find(\r\n                                'div', {'class': 'details'}).find_all('a'):\r\n                            if 'PDF' == a.text:\r\n                                link = urllib.parse.urljoin(\r\n                                    init_urls[0], a.get('href'))\r\n                                is_get_link = True\r\n                                break\r\n                        if is_get_link:\r\n                            paper_dict = {'title': title,\r\n                                          'main link': link,\r\n                                          'group': this_group + '--' +\r\n                                                   sub_group if\r\n                                          sub_group != '' else this_group}\r\n                        else:\r\n                            paper_dict = {'title': title,\r\n                                          'main link': 'error',\r\n                                          'group': this_group + '--' +\r\n                                                   sub_group if\r\n                                          sub_group != '' else this_group}\r\n                            print(f'get link for {title}_{year} failed!')\r\n                            error_log.apend(title, 'no link')\r\n                        writer.writerow(paper_dict)\r\n            elif year in [2016]:  # no group\r\n                papers_bar = tqdm(soup.find_all('p'))\r\n                for paper in papers_bar:\r\n                    all_as = paper.find_all('a')\r\n                    if len(all_as) >= 2:  # paper pdf and abstract\r\n                        paper_index += 1\r\n                        title = slugify(paper.text.split('\\n')[0])\r\n                        papers_bar.set_description(\r\n                            f'downloading paper {paper_index}: {title}')\r\n                        is_get_link = False\r\n                        for a in all_as:\r\n                            if 'PDF' == a.text:\r\n                                link = 'https://www.ijcai.org' + a.get('href')\r\n                                is_get_link = True\r\n                                break\r\n                        if is_get_link:\r\n                            paper_dict = {'title': title,\r\n                                          'main link': link,\r\n                                          'group': ''}\r\n                        else:\r\n                            paper_dict = {'title': title,\r\n                                          'main link': 'error',\r\n                                          'group': ''}\r\n                            print(f'get link for {title}_{year} failed!')\r\n                            error_log.apend(title, 'no link')\r\n                        writer.writerow(paper_dict)\r\n            elif year in [2015]:  # p group 'PDF'\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3']))\r\n                is_start = False\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    if not is_start:\r\n                        if 'h2' == paper.name:  # find 'content'\r\n                            if 'Contents' == paper.text:\r\n                                is_start = True\r\n                    else:\r\n                        if 'h3' == paper.name: # group\r\n                            this_group = slugify(paper.text)\r\n                        elif 'p' == paper.name:  # paper\r\n                            all_as = paper.find_all('a')\r\n                            if len(all_as) >= 2:  # paper pdf and abstract\r\n                                paper_index += 1\r\n                                title = slugify(paper.text.split('\\n')[0])\r\n                                papers_bar.set_description(\r\n                                    f'downloading paper {paper_index}: {title}')\r\n                                is_get_link = False\r\n                                for a in all_as:\r\n                                    if 'PDF' == a.text:\r\n                                        link = 'https://www.ijcai.org' + \\\r\n                                               a.get('href')\r\n                                        is_get_link = True\r\n                                        break\r\n                                if is_get_link:\r\n                                    paper_dict = {'title': title,\r\n                                                  'main link': link,\r\n                                                  'group': this_group}\r\n                                else:\r\n                                    paper_dict = {'title': title,\r\n                                                  'main link': 'error',\r\n                                                  'group': this_group}\r\n                                    print(f'get link for {title}_{year} failed!')\r\n                                    error_log.apend(title, 'no link')\r\n                                writer.writerow(paper_dict)\r\n            elif year in [2013, 2011, 2009, 2007]:  # p group\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3', 'h4']))\r\n                # papers_bar = div_content.find_all(['h2', 'p', 'h3', 'h4'])\r\n                is_start = False\r\n                this_group = ''\r\n                this_group_v3 = ''\r\n                this_group_v4 = ''\r\n                for paper in papers_bar:\r\n                    if not is_start:\r\n                        if 'h2' == paper.name:  # find 'content'\r\n                            if 'Contents' == paper.text or \\\r\n                                    'IJCAI-09 Contents' == paper.text or \\\r\n                                    'IJCAI-07 Contents' == paper.text:\r\n                                is_start = True\r\n                    else:\r\n                        if 'h3' == paper.name: # group\r\n                            this_group_v3 = slugify(paper.text)\r\n                            this_group = this_group_v3\r\n                        elif 'h4' == paper.name: # group\r\n                            this_group_v4 = slugify(paper.text)\r\n                            this_group = this_group_v3 + '--' + this_group_v4\r\n                        elif 'p' == paper.name:  # paper\r\n                            try:\r\n                                all_as = paper.find_all('a')\r\n                            except:\r\n                                continue\r\n                            if len(all_as) >= 1:  # paper\r\n                                paper_index += 1\r\n                                is_get_link = False\r\n                                for a in all_as:\r\n                                    if 'abstract' != slugify(a.text.strip()):\r\n                                        title = slugify(a.text)\r\n                                        link = a.get('href')\r\n                                        is_get_link = True\r\n                                        papers_bar.set_description(\r\n                                            f'downloading paper {paper_index}: '\r\n                                            f'{title}')\r\n                                        break\r\n                                if is_get_link:\r\n                                    paper_dict = {'title': title,\r\n                                                  'main link': link,\r\n                                                  'group': this_group}\r\n                                else:\r\n                                    paper_dict = {'title': title,\r\n                                                  'main link': 'error',\r\n                                                  'group': this_group}\r\n                                    print(f'get link for {title}_{year} failed!')\r\n                                    error_log.append((title, 'no link'))\r\n                                # papers_bar.set_description(f'downloading\r\n                                # paper {paper_index}: {title}')\r\n                                writer.writerow(paper_dict)\r\n            elif year in [2005]:\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        paper_class = paper.get('class')[0]\r\n                    except:\r\n                        continue\r\n                    if 'docsection' == paper_class:  # group\r\n                        this_group = slugify(paper.text)\r\n                    elif 'doctitle' == paper_class:  # paper\r\n                        paper_index += 1\r\n                        title = slugify(paper.a.text)\r\n                        link = paper.a.get('href')\r\n                        papers_bar.set_description(\r\n                            f'downloading paper {paper_index}: {title}')\r\n                        paper_dict = {'title': title,\r\n                                      'main link': link,\r\n                                      'group': this_group}\r\n                        writer.writerow(paper_dict)\r\n            elif year in [2003]:\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                base_url = 'https://www.ijcai.org'\r\n                for paper in papers_bar:\r\n                    try:\r\n                        this_group = slugify(paper.b.text)\r\n                    except:\r\n                        pass\r\n                    try:\r\n                        title = slugify(paper.a.text)\r\n                        link = base_url + paper.a.get('href')\r\n                        paper_index += 1\r\n                        papers_bar.set_description(\r\n                            f'downloading paper {paper_index}: {title}')\r\n                        paper_dict = {'title': title,\r\n                                      'main link': link,\r\n                                      'group': this_group}\r\n                        writer.writerow(paper_dict)\r\n                    except:\r\n                        continue\r\n            elif year in [2001]:\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        title = slugify(paper.a.text)\r\n                        link = paper.a.get('href')\r\n                        paper_index += 1\r\n                        papers_bar.set_description(\r\n                            f'downloading paper {paper_index}: {title}')\r\n                        paper_dict = {'title': title,\r\n                                      'main link': link,\r\n                                      'group': this_group}\r\n                        writer.writerow(paper_dict)\r\n                    except:\r\n                        continue\r\n            elif year in [1999, 1997, 1995, 1993, 1991, 1989, 1987, 1981, 1979,\r\n                          1977, 1969]:  # goup in capital in p.b.text\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        if paper.b.text.isupper():\r\n                            # print(paper.b.text)\r\n                            this_group = slugify(paper.b.text)\r\n                    except:\r\n                        pass\r\n                    try:\r\n                        for a in paper.find_all('a'):\r\n                            title = slugify(a.text.strip())\r\n                            link = a.get('href')\r\n                            if link[-3:] == 'pdf' and '' != title:\r\n                                paper_index += 1\r\n                                papers_bar.set_description(\r\n                                    f'downloading paper {paper_index}: {title}')\r\n                                paper_dict = {'title': title,\r\n                                              'main link': link,\r\n                                              'group': this_group}\r\n                                writer.writerow(paper_dict)\r\n                                break\r\n                            else:\r\n                                continue\r\n\r\n                    except:\r\n                        continue\r\n            elif year in [1985, 1975, 1971]:  # no group, paper in 'p'\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        for a in paper.find_all('a'):\r\n                            title = slugify(a.text.strip())\r\n                            link = a.get('href')\r\n                            if link[-3:] == 'pdf' and '' != title:\r\n                                paper_index += 1\r\n                                papers_bar.set_description(\r\n                                    f'downloading paper {paper_index}: {title}')\r\n                                paper_dict = {'title': title,\r\n                                              'main link': link,\r\n                                              'group': this_group}\r\n                                writer.writerow(paper_dict)\r\n                                break\r\n                            else:\r\n                                continue\r\n\r\n                    except:\r\n                        continue\r\n            elif year in [1983]:  # goup in capital p.text\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        if paper.text.isupper():\r\n                            this_group = slugify(paper.text)\r\n                    except:\r\n                        pass\r\n                    try:\r\n                        for a in paper.find_all('a'):\r\n                            title = slugify(a.text.strip())\r\n                            link = a.get('href')\r\n                            if link[-3:] == 'pdf' and '' != title:\r\n                                paper_index += 1\r\n                                papers_bar.set_description(\r\n                                    f'downloading paper {paper_index}: {title}')\r\n                                paper_dict = {'title': title,\r\n                                              'main link': link,\r\n                                              'group': this_group}\r\n                                writer.writerow(paper_dict)\r\n                                break\r\n                            else:\r\n                                continue\r\n\r\n                    except:\r\n                        continue\r\n            elif year in [1973]:  # goup in p.b\r\n                div_content = soup.find('div', {'id': 'content'})\r\n                papers_bar = tqdm(div_content.find_all(['p']))\r\n                this_group = ''\r\n                for paper in papers_bar:\r\n                    try:\r\n                        if '' != paper.b.text.strip():\r\n                            this_group = slugify(paper.b.text.strip())\r\n                    except:\r\n                        pass\r\n                    try:\r\n                        for a in paper.find_all('a'):\r\n                            title = slugify(a.text.strip())\r\n                            link = a.get('href')\r\n                            if link[-3:] == 'pdf' and '' != title:\r\n                                paper_index += 1\r\n                                papers_bar.set_description(\r\n                                    f'downloading paper {paper_index}: {title}')\r\n                                paper_dict = {'title': title,\r\n                                              'main link': link,\r\n                                              'group': this_group}\r\n                                writer.writerow(paper_dict)\r\n                                break\r\n                            else:\r\n                                continue\r\n\r\n                    except:\r\n                        continue\r\n        #  write error log\r\n        print('write error log')\r\n        log_file_pathname = os.path.join(\r\n            project_root_folder, 'log', 'download_err_log.txt')\r\n        with open(log_file_pathname, 'w') as f:\r\n            for log in tqdm(error_log):\r\n                for e in log:\r\n                    if e is not None:\r\n                        f.write(e)\r\n                    else:\r\n                        f.write('None')\r\n                    f.write('\\n')\r\n\r\n                f.write('\\n')\r\n\r\n    return paper_index if paper_index is not None else None\r\n\r\n\r\ndef download_from_csv(\r\n        year, save_dir, time_step_in_seconds=5, total_paper_number=None, downloader='IDM'):\r\n    \"\"\"\r\n    download all IJCAI paper given year\r\n    :param year: int, IJCAI year, such 2019\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param time_step_in_seconds: int, the interval time between two downlaod request in seconds\r\n    :param total_paper_number: int, the total number of papers that is going to download\r\n    :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    postfix = f'IJCAI_{year}'\r\n    csv_filename = f'IJCAI_{year}.csv'\r\n    csv_filename = os.path.join(project_root_folder, 'csv', csv_filename)\r\n    csv_process.download_from_csv(\r\n        postfix=postfix,\r\n        save_dir=save_dir,\r\n        csv_file_path=csv_filename,\r\n        is_download_supplement=False,\r\n        time_step_in_seconds=time_step_in_seconds,\r\n        total_paper_number=total_paper_number,\r\n        downloader=downloader\r\n    )\r\n\r\n\r\nif __name__ == '__main__':\r\n    # for year in  range(1993, 1968, -2):\r\n    #     print(year)\r\n    #     # save_csv(year)\r\n    #     # time.sleep(2)\r\n    #     download_from_csv(year, save_dir=f'..\\\\IJCAI_{year}',\r\n    #     time_step_in_seconds=1)\r\n    year = 2024\r\n    # total_paper_number = 723\r\n    total_paper_number = save_csv(year)\r\n    download_from_csv(\r\n        year,\r\n        save_dir=fr'E:\\IJCAI_{year}',\r\n        time_step_in_seconds=5,\r\n        total_paper_number=total_paper_number,\r\n        downloader=None)\r\n\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_JMLR.py",
    "content": "\"\"\"paper_downloader_JMLR.py\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nimport pickle\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport time\r\nimport sys\r\nroot_folder = os.path.abspath(\r\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\nsys.path.append(root_folder)\r\nfrom lib.downloader import Downloader\r\nfrom lib.my_request import urlopen_with_retry\r\n\r\n\r\ndef download_paper(\r\n        volumn, save_dir, time_step_in_seconds=5, downloader='IDM', url=None,\r\n        is_use_url=False, refresh_paper_list=True):\r\n    \"\"\"\r\n    download all JMLR paper files given volumn and restore in save_dir\r\n    respectively\r\n    :param volumn: int, JMLR volumn, such as 2019\r\n    :param save_dir: str, paper and supplement material's saving path\r\n    :param time_step_in_seconds: int, the interval time between two downlaod request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'\r\n    :param url: None or str, None means to download volumn papers.\r\n    :param is_use_url: bool, if to download papers from 'url'. url couldn't be None when is_use_url is True.\r\n    :param refresh_paper_list: bool, if to refresh the saved paper list, default\r\n        true, which means the \"dat\" file that contains the papers' information\r\n        will be re-downloaded.\r\n    :return: True\r\n    \"\"\"\r\n    downloader = Downloader(downloader=downloader)\r\n    # create current dict\r\n    title_list = []\r\n    # paper_dict = dict()\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n\r\n    headers = {\r\n        'User-Agent':\r\n            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}\r\n    if not is_use_url:\r\n        init_url = f'http://jmlr.org/papers/v{volumn}/'\r\n        postfix = f'JMLR_v{volumn}'\r\n        dat_file_pathname = os.path.join(\r\n            project_root_folder, 'urls', f'init_url_JMLR_v{volumn}.dat')\r\n        if not refresh_paper_list and \\\r\n                os.path.exists(dat_file_pathname):\r\n            with open(dat_file_pathname, 'rb') as f:\r\n                content = pickle.load(f)\r\n        else:\r\n            print('collecting papers from website...')\r\n            content = urlopen_with_retry(url=init_url, headers=headers)\r\n            # content = open(f'..\\\\JMLR_{volumn}.html', 'rb').read()\r\n            with open(dat_file_pathname, 'wb') as f:\r\n                pickle.dump(content, f)\r\n    elif url is not None:\r\n        content = urlopen_with_retry(url=url, headers=headers)\r\n        postfix = f'JMLR'\r\n    else:\r\n        raise ValueError(''''url' could not be None when 'is_use_url'=True!!!''')\r\n    # soup = BeautifulSoup(content, 'html.parser')\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    # soup = BeautifulSoup(open(r'..\\JMLR_2011.html', 'rb'), 'html.parser')\r\n    error_log = []\r\n    os.makedirs(save_dir, exist_ok=True)\r\n\r\n    if (not is_use_url) and volumn <= 4:\r\n        paper_list = soup.find('div', {'id': 'content'}).find_all('tr')\r\n    else:\r\n        paper_list = soup.find('div', {'id': 'content'}).find_all('dl')\r\n    # num_download = 5 # number of papers to download\r\n    num_download = len(paper_list)\r\n    print(f'total papers counting: {num_download}, start downloading...')\r\n    for paper in tqdm(zip(paper_list, range(num_download))):\r\n        # get title\r\n        this_paper = paper[0]\r\n        title = slugify(this_paper.find('dt').text)\r\n        title_list.append(title)\r\n\r\n        this_paper_main_path = os.path.join(save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))\r\n        if os.path.exists(this_paper_main_path):\r\n            continue\r\n\r\n        # get abstract page url\r\n        links = this_paper.find_all('a')\r\n        main_link = None\r\n        for link in links:\r\n            if '[pdf]' == link.text or 'pdf' == link.text:\r\n                main_link = urllib.parse.urljoin('http://jmlr.org', link.get('href'))\r\n                break\r\n\r\n        # try 1 time\r\n        # error_flag = False\r\n        for d_iter in range(1):\r\n            try:\r\n                # download paper with IDM\r\n                if not os.path.exists(this_paper_main_path) and main_link is not None:\r\n                    try:\r\n                        print('Downloading paper {}/{}: {}'.format(paper[1] + 1, num_download, title))\r\n                    except:\r\n                        print(title.encode('utf8'))\r\n                    downloader.download(\r\n                        urls=main_link,\r\n                        save_path=this_paper_main_path,\r\n                        time_sleep_in_seconds=time_step_in_seconds\r\n                    )\r\n            except Exception as e:\r\n                # error_flag = True\r\n                print('Error: ' + title + ' - ' + str(e))\r\n                error_log.append((title, main_link, 'main paper download error', str(e)))\r\n\r\n    # store the results\r\n    # 1. store in the pickle file\r\n    # with open(f'{postfix}_pre.dat', 'wb') as f:\r\n    #     pickle.dump(paper_dict, f)\r\n\r\n    # 2. write error log\r\n    print('write error log')\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'download_err_log.txt')\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is not None:\r\n                    f.write(e)\r\n                else:\r\n                    f.write('None')\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n\r\n\r\ndef download_special_topics_and_issues_paper(save_dir, time_step_in_seconds=5, downloader='IDM'):\r\n    \"\"\"\r\n    download all JMLR special topics and issues paper files given volumn and restore in save_dir\r\n    respectively\r\n    :param save_dir: str, paper and supplement material's saving path\r\n    :param time_step_in_seconds: int, the interval time between two downlaod request in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'\r\n    :return: True\r\n    \"\"\"\r\n    homepage = 'https://www.jmlr.org/papers/'\r\n    headers = {\r\n        'User-Agent':\r\n            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}\r\n    # postfix = f'JMLR_v{volumn}'\r\n\r\n    content = urlopen_with_retry(url=homepage, headers=headers)\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    # soup = BeautifulSoup(open(r'..\\JMLR_2011.html', 'rb'), 'html.parser')\r\n\r\n    all_topics = soup.find('div', {'id': 'content'}).find_all(['h2', 'p'])\r\n    is_topic = False\r\n    is_issue = False\r\n    for topic in all_topics:\r\n        if 'h2' == topic.name and slugify(topic.text.strip()) == 'special-topics':\r\n            is_topic = True\r\n        elif 'h2' == topic.name:\r\n            is_topic = False\r\n            if 'special-issues' == slugify(topic.text.strip()):\r\n                is_issue = True\r\n        if is_topic and 'p' == topic.name:\r\n            topic_name = slugify(topic.text.strip())\r\n            topic_url = urllib.parse.urljoin(homepage, topic.a.get('href'))\r\n            # print(f'T: {topic_name} url:{topic_url}')\r\n            print(f'processing special topic: {topic_name}')\r\n            download_paper(\r\n                volumn=1000,\r\n                save_dir=os.path.join(save_dir, 'special-topics', topic_name),\r\n                time_step_in_seconds=time_step_in_seconds,\r\n                downloader=downloader,\r\n                url=topic_url,\r\n                is_use_url=True\r\n            )\r\n            time.sleep(time_step_in_seconds)\r\n        if is_issue and 'p' == topic.name:\r\n            issue_name = slugify(topic.text.strip())\r\n            issue_url = urllib.parse.urljoin(homepage, topic.a.get('href'))\r\n            # print(f'T: {issue_name} url:{issue_url}')\r\n            print(f'processing special issue: {issue_name}')\r\n            download_paper(\r\n                volumn=1000,\r\n                save_dir=os.path.join(save_dir, 'special-issues', issue_name),\r\n                time_step_in_seconds=time_step_in_seconds,\r\n                downloader=downloader,\r\n                url=issue_url,\r\n                is_use_url=True\r\n            )\r\n            time.sleep(time_step_in_seconds)\r\n\r\n\r\nif __name__ == '__main__':\r\n    volumn = 25\r\n    download_paper(volumn, rf'W:\\all_papers\\JMLR\\JMLR_v{volumn}',\r\n                   time_step_in_seconds=3)\r\n    # download_special_topics_and_issues_paper(\r\n    #     rf'Z:\\all_papers\\JMLR', time_step_in_seconds=3, downloader='IDM')\r\n    pass\r\n"
  },
  {
    "path": "code/paper_downloader_NIPS.py",
    "content": "\"\"\"paper_downloader_NIPS.py\"\"\"\n\nimport urllib\nimport time\nfrom bs4 import BeautifulSoup\nimport pickle\nimport os\nfrom tqdm import tqdm\nfrom slugify import slugify\nimport csv\nimport sys\nroot_folder = os.path.abspath(\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nsys.path.append(root_folder)\nfrom lib.supplement_porcess import move_main_and_supplement_2_one_directory\nfrom lib.downloader import Downloader\nfrom lib import csv_process\nfrom lib.openreview import download_nips_papers_given_url\nfrom lib.my_request import urlopen_with_retry\n\n\ndef save_csv(year):\n    \"\"\"\n    write nips papers' and supplemental material's urls in one csv file\n    :param year: int\n    :return: num_download: int, the total number of papers.\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    csv_file_pathname = os.path.join(\n        project_root_folder, 'csv', f'NIPS_{year}.csv'\n    )\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\n        fieldnames = ['title', 'main link', 'supplemental link']\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n        writer.writeheader()\n        headers = {\n            'User-Agent':\n                'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\n                'Gecko/20100101 Firefox/23.0'}\n        init_url = f'https://proceedings.neurips.cc/paper/{year}'\n        dat_file_pathname = os.path.join(\n            project_root_folder, 'urls', f'init_url_nips_{year}.dat')\n        if os.path.exists(dat_file_pathname):\n            with open(dat_file_pathname, 'rb') as f:\n                content = pickle.load(f)\n        else:\n            content = urlopen_with_retry(url=init_url, headers=headers)\n            with open(dat_file_pathname, 'wb') as f:\n                pickle.dump(content, f)\n        soup = BeautifulSoup(content, 'html.parser')\n        paper_list = soup.find(\n            'div', {'class': 'container-fluid'}).find_all('li')\n        # num_download = 5 # number of papers to download\n        num_download = len(paper_list)\n        paper_list_bar = tqdm(zip(paper_list, range(num_download)))\n        for paper in tqdm(zip(paper_list, range(num_download))):\n            paper_dict = {'title': '',\n                          'main link': '',\n                          'supplemental link': ''}\n            # get title\n            # print('\\n')\n            this_paper = paper[0]\n            title = slugify(this_paper.a.text)\n            paper_dict['title'] = title\n            # print('Downloading paper {}/{}: {}'.format(\n            # paper[1] + 1, num_download, title))\n            paper_list_bar.set_description(\n                'Tracing paper {}/{}: {}'.format(\n                    paper[1] + 1, num_download, title))\n\n            # get abstract page url\n            url2 = this_paper.a.get('href')\n            abs_url = urllib.parse.urljoin(init_url, url2)\n            abs_content = urlopen_with_retry(url=abs_url, headers=headers,\n                                             raise_error_if_failed=False)\n            if abs_content is not None:\n                soup_temp = BeautifulSoup(abs_content, 'html.parser')\n                # abstract = soup_temp.find(\n                # 'p', {'class': 'abstract'}).text.strip()\n                # paper_dict[title] = abstract\n                all_a = soup_temp.findAll('a')\n                for a in all_a:\n                    # print(a.text[:-2])\n                    # print(a.text[:-2].strip().lower())\n                    if 'paper' == a.text[:-2].strip().lower():\n                        paper_dict['main link'] = urllib.parse.urljoin(\n                            abs_url, a.get('href'))\n                    elif 'supplemental' == a.text[:-2].strip().lower():\n                        paper_dict['supplemental link'] = \\\n                            urllib.parse.urljoin(abs_url, a.get('href'))\n                        break\n            else:\n                print('Error: ' + title)\n                if paper_dict['main link'] == '':\n                    paper_dict['main link'] = 'error'\n                if paper_dict['supplemental link'] == '':\n                    paper_dict['supplemental link'] = 'error'\n            writer.writerow(paper_dict)\n            time.sleep(1)\n    return num_download\n\n\ndef download_from_csv(\n        year, save_dir, is_download_mainpaper=True, is_download_supplement=True,\n        time_step_in_seconds=5, total_paper_number=None, downloader='IDM'):\n    \"\"\"\n    download all NIPS paper and supplement files given year, restore in\n    save_dir/main_paper and save_dir/supplement\n    respectively\n    :param year: int, NIPS year, such 2019\n    :param save_dir: str, paper and supplement material's save path\n    :param is_download_mainpaper: boot, True for downloading main papers\n    :param is_download_supplement: bool, True for downloading supplemental\n        material\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds\n    :param total_paper_number: int, the total number of papers that is going to\n        download\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder', default to 'IDM'\n    :return: True\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    postfix = f'NIPS_{year}'\n    csv_file_path = os.path.join(project_root_folder, 'csv', f'NIPS_{year}.csv')\n    return csv_process.download_from_csv(\n        postfix=postfix,\n        save_dir=save_dir,\n        csv_file_path=csv_file_path,\n        is_download_supplement=is_download_supplement,\n        time_step_in_seconds=time_step_in_seconds,\n        total_paper_number=total_paper_number,\n        downloader=downloader\n    )\n\n\n# def rename_supp( year, supp_dir):\n#     \"\"\"\n#     rename supplemental material\n#     :param year: int, NIPS year, such 2019\n#     :param supp_dir: str, supplement material's save path\n#     :return: True\n#     \"\"\"\n#     if not os.path.exists(supp_dir):\n#         raise ValueError(f'''can't find path {supp_dir}''')\n#\n#     postfix = f'NIPS_{year}'\n#     with open(f'..\\\\csv\\\\NIPS_{year}.csv', newline='') as csvfile:\n#         myreader = csv.DictReader(csvfile, delimiter=',')\n#         pbar = tqdm(myreader)\n#         for this_paper in pbar:\n#             title = slugify(this_paper['title'])\n#             this_paper_supp_path_no_ext = os.path.join(\n#             supp_dir, f'{title}_{postfix}_supp.')\n#\n#             if '' != this_paper['supplemental link']:\n#                 supp_ori_name = this_paper['supplemental link'].split('/')[-1]\n#                 supp_type = supp_ori_name.split('.')[-1]\n#                 if os.path.exists(os.path.join(supp_dir, supp_ori_name)) and \\\n#                 not os.path.exists(\n#                         this_paper_supp_path_no_ext + supp_type):\n#                     os.rename(\n#                         os.path.join(supp_dir, supp_ori_name),\n#                         this_paper_supp_path_no_ext + supp_type\n#                     )\n#                 pbar.set_description(f'Renaming paper: {title}...')\n\n\nif __name__ == '__main__':\n    year = 2024\n    # total_paper_number = 1899\n    # total_paper_number = save_csv(year)\n    # download_from_csv(\n    #     year, f'..\\\\NIPS_{year}',\n    #     is_download_mainpaper=False,\n    #     is_download_supplement=True,\n    #     time_step_in_seconds=20,\n    #     total_paper_number=total_paper_number,\n    #     downloader='IDM')\n    download_nips_papers_given_url(\n        save_dir=rf'E:\\NIPS_{year}',\n        year=year,\n        base_url=f'https://openreview.net/group?id=NeurIPS.cc/'\n                 f'{year}/Conference',\n        time_step_in_seconds=10,\n        # download_groups=['poster'],\n        downloader='IDM')\n    # move_main_and_supplement_2_one_directory(\n    #     main_path=rf'F:\\workspace\\python3_ws\\paper_downloader-master\\NIPS_{year}\\main_paper',\n    #     supplement_path=rf'F:\\workspace\\python3_ws\\paper_downloader-master\\NIPS_{year}\\supplement',\n    #     supp_pdf_save_path=rf'F:\\workspace\\python3_ws\\paper_downloader-master\\NIPS_{year}\\supplement_pdf'\n    # )\n"
  },
  {
    "path": "code/paper_downloader_RSS.py",
    "content": "\"\"\"paper_downloader_RSS.py\n20240322\"\"\"\nimport time\nimport urllib\nfrom urllib.error import HTTPError\nfrom bs4 import BeautifulSoup\nimport pickle\nimport os\nfrom tqdm import tqdm\nfrom slugify import slugify\nimport csv\nimport sys\nfrom datetime import datetime\n\nroot_folder = os.path.abspath(\n    os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\nsys.path.append(root_folder)\nfrom lib import csv_process\nfrom lib.my_request import urlopen_with_retry\n\n\ndef get_paper_pdf_link(abs_url):\n    \"\"\"get paper pdf link in the abstract url.\n       For newest papers that have not been added to \n       \"https://www.roboticsproceedings.org/rss19/index.html\"\n\n    Args:\n        abs_url (str): paper abstract page url.\n    \"\"\"\n    headers = {\n                'User-Agent':\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\n                    'Gecko/20100101 Firefox/23.0'}\n    content = urlopen_with_retry(url=abs_url, headers=headers)\n    soup = BeautifulSoup(content, 'html5lib')\n    paper_pdf_div = soup.find('div', {'class': 'paper-pdf'})\n    paper_pdf_div = paper_pdf_div.find('a').get('href')\n    return paper_pdf_div\n\n\ndef save_csv(year):\n    \"\"\"\n    write RSS papers' urls in one csv file\n    :param year: int, RSS year, such 2023\n    :return: peper_index: int, the total number of papers\n    \"\"\"\n    conference = \"RSS\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    csv_file_pathname = os.path.join(\n        project_root_folder, 'csv', f'{conference}_{year}.csv'\n    )\n    error_log = []\n    paper_index = 0\n    with open(csv_file_pathname, 'w', newline='') as csvfile:\n        fieldnames = ['title', 'main link', 'supplemental link']\n        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)\n        writer.writeheader()\n        is_from_proceed = True  \n        # True to get papaers from \"https://www.roboticsproceedings.org\"\n        # False to get papers from \"https://roboticsconference.org/\"\n        init_url = f'https://www.roboticsproceedings.org/rss' \\\n                   f'{year-2004 :0>2d}/index.html'\n        # determine whether this year's papers had been added to \n        # \"https://www.roboticsproceedings.org\"\n        # If not, get papers from \"https://roboticsconference.org/\"\n        try:\n            headers = {\n                'User-Agent':\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\n                    'Gecko/20100101 Firefox/23.0'}\n            req = urllib.request.Request(url=init_url, headers=headers)\n            urllib.request.urlopen(req, timeout=20)\n        except HTTPError as e:\n            if e.code == 404:  # not added\n                current_year = datetime.now().year\n                if year == current_year:\n                    init_url = f'https://roboticsconference.org/program/papers/'\n                else:\n                    init_url = f'https://roboticsconference.org/{year}/program/papers/'\n                is_from_proceed = False\n        url_file_pathname = os.path.join(\n            project_root_folder, 'urls', \n            f'init_url_{conference}_{year}_'\n            f'''{'proc' if is_from_proceed else 'conf'}.dat'''\n        )\n        if os.path.exists(url_file_pathname):\n            with open(url_file_pathname, 'rb') as f:\n                content = pickle.load(f)\n        else:\n            headers = {\n                'User-Agent':\n                    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\n                    'Gecko/20100101 Firefox/23.0'}\n            content = urlopen_with_retry(url=init_url, headers=headers)\n            with open(url_file_pathname, 'wb') as f:\n                pickle.dump(content, f)\n\n        soup = BeautifulSoup(content, 'html5lib')\n        if is_from_proceed:\n            paper_list = soup.find('div', {'class': 'content'}).find_all('tr')\n        else:\n            paper_list = soup.find('table', {'id': 'myTable'}).find_all('tr')\n        paper_list_bar = tqdm(paper_list)\n        paper_index = 0\n        title_index = 0\n        for i, paper in enumerate(paper_list_bar):\n            paper_dict = {'title': '',\n                          'main link': '',\n                          'supplemental link': ''}\n            # get title\n            try:\n                if not is_from_proceed and i == 0:\n                    # header\n                    fields = paper.find_all('th')\n                    fields = [f.text.lower() for f in fields]\n                    title_index = fields.index('title')\n                tds = paper.find_all('td')\n                if len(tds) < 2:  # seperator\n                    continue\n                if is_from_proceed:\n                    title = slugify(tds[0].a.text)\n                    main_link = tds[1].a.get('href')\n                    main_link = urllib.parse.urljoin(init_url, main_link)\n                else:\n                    title = slugify(tds[title_index].a.text)\n                    abs_link = tds[title_index].a.get('href')\n                    abs_link = urllib.parse.urljoin(init_url, abs_link)\n                    main_link = get_paper_pdf_link(abs_link)\n                \n                paper_dict['title'] = title\n                paper_dict['main link'] = main_link\n                paper_index += 1\n                paper_list_bar.set_description_str(\n                    f'Collected paper {paper_index}: {title}')\n                writer.writerow(paper_dict)\n                csvfile.flush()  # write to file immediately\n            except Exception as e:\n                print(f'Warning: {str(e)}')\n\n    #  write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                if e is not None:\n                    f.write(e)\n                else:\n                    f.write('None')\n                f.write('\\n')\n\n            f.write('\\n')\n    return paper_index\n\n\ndef download_from_csv(\n        year, save_dir, time_step_in_seconds=5, total_paper_number=None,\n        csv_filename=None, downloader='IDM', is_random_step=True,\n        proxy_ip_port=None):\n    \"\"\"\n    download all RSS paper given year\n    :param year: int, RSS year, such as 2019\n    :param save_dir: str, paper and supplement material's save path\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds\n    :param total_paper_number: int, the total number of papers that is going to\n        download\n    :param csv_filename: None or str, the csv file's name, None means to use\n        default setting\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder', default to 'IDM'\n    :param is_random_step: bool, whether random sample the time step between two\n        adjacent download requests. If True, the time step will be sampled\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\n        Default: True.\n    :param proxy_ip_port: str or None, proxy server ip address with or without\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n        Default: None\n    :return: True\n    \"\"\"\n    conference = \"RSS\"\n    postfix = f'{conference}_{year}'\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    csv_file_path = os.path.join(\n        project_root_folder, 'csv',\n        f'{conference}_{year}.csv' if csv_filename is None else csv_filename)\n    csv_process.download_from_csv(\n        postfix=postfix,\n        save_dir=save_dir,\n        csv_file_path=csv_file_path,\n        is_download_supplement=False,\n        time_step_in_seconds=time_step_in_seconds,\n        total_paper_number=total_paper_number,\n        downloader=downloader,\n        is_random_step=is_random_step,\n        proxy_ip_port=proxy_ip_port\n    )\n\n\nif __name__ == '__main__':\n    year = 2025\n    total_paper_number = save_csv(year)\n    # total_paper_number = 134\n    download_from_csv(year, save_dir=fr'E:\\RSS\\RSS_{year}',\n                        time_step_in_seconds=15,\n                        total_paper_number=total_paper_number)\n    time.sleep(2)\n\n    pass\n"
  },
  {
    "path": "lib/IDM.py",
    "content": "import subprocess\r\nimport os\r\nimport time\r\nimport random\r\n\r\n\r\ndef download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True,\r\n             verbose=False):\r\n    \"\"\"\r\n    download file from given urls and save it to given path\r\n    :param urls: str, urls\r\n    :param save_path: str, full path\r\n    :param time_sleep_in_seconds: int, sleep seconds after call\r\n    :param is_random_step: bool, whether random sample the time step between two\r\n        adjacent download requests. If True, the time step will be sampled\r\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\r\n        Default: True.\r\n    :param verbose: bool, whether to display time step information.\r\n        Default: False\r\n    :return: None\r\n    \"\"\"\r\n    idm_path = '\"C:\\Program Files (x86)\\Internet Download Manager\\IDMan.exe\"'  # should replace by the local IDM path\r\n    basic_command = [idm_path, '/d', 'xxxx', '/p', 'xxx', '/f', 'xxxx', '/n']\r\n    head, tail = os.path.split(save_path)\r\n    if '' != head:\r\n        os.makedirs(head, exist_ok=True)\r\n    basic_command[2] = urls\r\n    basic_command[4] = head\r\n    basic_command[6] = tail\r\n    p = subprocess.Popen(' '.join(basic_command))\r\n    # p.wait()\r\n    if is_random_step:\r\n        time_sleep_in_seconds = random.uniform(\r\n            0.5 * time_sleep_in_seconds,\r\n            1.5 * time_sleep_in_seconds,\r\n        )\r\n    if verbose:\r\n        print(f'\\t random sleep {time_sleep_in_seconds: .2f} seconds')\r\n    time.sleep(time_sleep_in_seconds)\r\n\r\n\r\n"
  },
  {
    "path": "lib/__init__.py",
    "content": ""
  },
  {
    "path": "lib/arxiv.py",
    "content": "\"\"\"\narxiv.py\n20240218\n\"\"\"\nfrom bs4 import BeautifulSoup\nfrom .my_request import urlopen_with_retry\n\ndef get_pdf_link_from_arxiv(abs_link, is_use_mirror=False):\n    headers = {\n        'User-Agent':\n            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\n            'Gecko/20100101 Firefox/23.0'}\n    mirror = 'cn.arxiv.org'\n    if is_use_mirror:\n        abs_link = abs_link.replace('arxiv.org', mirror)\n\n    abs_content = urlopen_with_retry(\n        url=abs_link, headers=headers, raise_error_if_failed=False)\n    if abs_content is None:\n        return None\n    abs_soup = BeautifulSoup(abs_content, 'html.parser')\n    pdf_link = 'http://arxiv.org' + abs_soup.find('div', {\n        'class': 'full-text'}).find('ul').find('a').get('href')\n    if pdf_link[-3:] != 'pdf':\n        pdf_link += '.pdf'\n    if is_use_mirror:\n        pdf_link = pdf_link.replace('arxiv.org', mirror)\n    return pdf_link\n"
  },
  {
    "path": "lib/csv_process.py",
    "content": "\"\"\"\r\ncsv_process.py\r\n20210617\r\n\"\"\"\r\n\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nimport csv\r\nfrom lib.downloader import Downloader\r\n\r\n\r\ndef download_from_csv(\r\n        postfix, save_dir, csv_file_path, is_download_main_paper=True,\r\n        is_download_bib=True, is_download_supplement=True,\r\n        time_step_in_seconds=5, total_paper_number=None,\r\n        downloader='IDM', is_random_step=True, proxy_ip_port=None,\r\n        max_length_filename=128\r\n):\r\n    \"\"\"\r\n    download paper, bibtex and supplement files and save them to\r\n        save_dir/main_paper and save_dir/supplement respectively\r\n    :param postfix: str, postfix that will be added at the end of papers' title\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param csv_file_path: str, the full path to csv file\r\n    :param is_download_main_paper: bool, True for downloading main paper\r\n    :param is_download_supplement: bool, True for downloading supplemental\r\n        material\r\n    :param time_step_in_seconds: int, the interval time between two downloading\r\n        request in seconds\r\n    :param total_paper_number: int, the total number of papers that is going to\r\n        download\r\n    :param downloader: str, the downloader to download, could be 'IDM' or None,\r\n        default to 'IDM'.\r\n    :param is_random_step: bool, whether random sample the time step between two\r\n        adjacent download requests. If True, the time step will be sampled\r\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\r\n        Default: True.\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n        Default: None\r\n    :param max_length_filename: int or None, max filen name length. All the\r\n            files whose name length is not less than this will be renamed\r\n            before saving, the others will stay unchanged. None means\r\n            no limitation. Default: 128.\r\n    :return: True\r\n    \"\"\"\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    downloader = Downloader(\r\n        downloader=downloader, is_random_step=is_random_step,\r\n        proxy_ip_port=proxy_ip_port)\r\n    if not os.path.exists(csv_file_path):\r\n        raise ValueError(f'ERROR: file not found in {csv_file_path}!!!')\r\n\r\n    main_save_path = os.path.join(save_dir, 'main_paper')\r\n    if is_download_main_paper:\r\n        os.makedirs(main_save_path, exist_ok=True)\r\n    if is_download_supplement:\r\n        supplement_save_path = os.path.join(save_dir, 'supplement')\r\n        os.makedirs(supplement_save_path, exist_ok=True)\r\n\r\n    error_log = []\r\n    with open(csv_file_path, newline='') as csvfile:\r\n        myreader = csv.DictReader(csvfile, delimiter=',')\r\n        pbar = tqdm(myreader, total=total_paper_number)\r\n        i = 0\r\n        for this_paper in pbar:\r\n            is_download_bib &= ('bib' in this_paper)\r\n            is_grouped = ('group' in this_paper)\r\n            i += 1\r\n            # get title\r\n            if is_grouped:\r\n                group = slugify(this_paper['group'])\r\n            title = slugify(this_paper['title'])\r\n            title_main_pdf = short_name(\r\n                name=f'{title}_{postfix}.pdf',\r\n                max_length=max_length_filename\r\n            )\r\n            if total_paper_number is not None:\r\n                pbar.set_description(\r\n                    f'Downloading {postfix} paper {i} /{total_paper_number}')\r\n            else:\r\n                pbar.set_description(f'Downloading {postfix} paper {i}')\r\n            this_paper_main_path = os.path.join(\r\n                main_save_path, title_main_pdf)\r\n            if is_grouped:\r\n                this_paper_main_path = os.path.join(\r\n                    main_save_path, group, title_main_pdf)\r\n            if is_download_supplement:\r\n                this_paper_supp_title_no_ext = short_name(\r\n                    name=f'{title}_{postfix}_supp.',\r\n                    max_length=max_length_filename-3  # zip or pdf, so 3\r\n                )\r\n                this_paper_supp_path_no_ext = os.path.join(\r\n                    supplement_save_path, this_paper_supp_title_no_ext)\r\n                if is_grouped:\r\n                    this_paper_supp_path_no_ext = os.path.join(\r\n                        supplement_save_path, group,\r\n                        this_paper_supp_title_no_ext\r\n                    )\r\n                if '' != this_paper['supplemental link'] and os.path.exists(\r\n                        this_paper_main_path) and \\\r\n                        (os.path.exists(\r\n                            this_paper_supp_path_no_ext + 'zip') or\r\n                         os.path.exists(\r\n                            this_paper_supp_path_no_ext + 'pdf')):\r\n                    continue\r\n                elif '' == this_paper['supplemental link'] and \\\r\n                        os.path.exists(this_paper_main_path):\r\n                    continue\r\n            elif os.path.exists(this_paper_main_path):\r\n                continue\r\n            if 'error' == this_paper['main link']:\r\n                error_log.append((title, 'no MAIN link'))\r\n            elif '' != this_paper['main link']:\r\n                if is_grouped:\r\n                    if is_download_main_paper:\r\n                        os.makedirs(os.path.join(main_save_path, group),\r\n                                    exist_ok=True)\r\n                    if is_download_supplement:\r\n                        os.makedirs(os.path.join(supplement_save_path, group),\r\n                                    exist_ok=True)\r\n                if is_download_main_paper:\r\n                    try:\r\n                        # download paper with IDM\r\n                        if not os.path.exists(this_paper_main_path):\r\n                            downloader.download(\r\n                                urls=this_paper['main link'].replace(\r\n                                    ' ', '%20'),\r\n                                save_path=os.path.join(\r\n                                    os.getcwd(), this_paper_main_path),\r\n                                time_sleep_in_seconds=time_step_in_seconds\r\n                            )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append((title, this_paper['main link'],\r\n                                          'main paper download error', str(e)))\r\n                # download supp\r\n                if is_download_supplement:\r\n                    # check whether the supp can be downloaded\r\n                    if not (os.path.exists(\r\n                            this_paper_supp_path_no_ext + 'zip') or\r\n                            os.path.exists(\r\n                                this_paper_supp_path_no_ext + 'pdf')):\r\n                        if 'error' == this_paper['supplemental link']:\r\n                            error_log.append((title, 'no SUPPLEMENTAL link'))\r\n                        elif '' != this_paper['supplemental link']:\r\n                            supp_type = \\\r\n                            this_paper['supplemental link'].split('.')[-1]\r\n                            try:\r\n                                downloader.download(\r\n                                    urls=this_paper['supplemental link'],\r\n                                    save_path=os.path.join(\r\n                                        os.getcwd(),\r\n                                        this_paper_supp_path_no_ext + supp_type),\r\n                                    time_sleep_in_seconds=time_step_in_seconds\r\n                                )\r\n                            except Exception as e:\r\n                                # error_flag = True\r\n                                print('Error: ' + title + ' - ' + str(e))\r\n                                error_log.append((title, this_paper[\r\n                                    'supplemental link'],\r\n                                                  'supplement download error',\r\n                                                  str(e)))\r\n                # download bibtex file\r\n                if is_download_bib:\r\n                    bib_path = this_paper_main_path[:-3] + 'bib'\r\n                    if not os.path.exists(bib_path):\r\n                        if 'error' == this_paper['bib']:\r\n                            error_log.append((title, 'no bibtex link'))\r\n                        elif '' != this_paper['bib']:\r\n                            try:\r\n                                downloader.download(\r\n                                    urls=this_paper['bib'],\r\n                                    save_path=os.path.join(os.getcwd(),\r\n                                                           bib_path),\r\n                                    time_sleep_in_seconds=time_step_in_seconds\r\n                                )\r\n                            except Exception as e:\r\n                                # error_flag = True\r\n                                print('Error: ' + title + ' - ' + str(e))\r\n                                error_log.append((title, this_paper['bib'],\r\n                                                  'bibtex download error',\r\n                                                  str(e)))\r\n\r\n        # 2. write error log\r\n        print('write error log')\r\n        log_file_pathname = os.path.join(\r\n            project_root_folder, 'log', 'download_err_log.txt'\r\n        )\r\n        with open(log_file_pathname, 'w') as f:\r\n            for log in tqdm(error_log):\r\n                for e in log:\r\n                    if e is not None:\r\n                        f.write(e)\r\n                    else:\r\n                        f.write('None')\r\n                    f.write('\\n')\r\n\r\n                f.write('\\n')\r\n\r\n    return True\r\n\r\n\r\ndef short_name(name, max_length, verbose=False):\r\n    \"\"\"\r\n    rename to shorter name\r\n    Args:\r\n        name (str): original name\r\n        max_length (int): max filen name length. All the\r\n            files whose name length is not less than this will be renamed\r\n            before saving, the others will stay unchanged. None means\r\n            no limitation.\r\n        verbose (bool): whether to print debug information. Default: False.\r\n    Returns:\r\n        new_name (str): short name.\r\n    \"\"\"\r\n    if len(name) < max_length:\r\n        new_name = name\r\n    else:\r\n        # rename\r\n        try:\r\n            [title, postfix] = name.split('_', 1)  # only split to 2 parts\r\n            new_title = title[:max_length - len(postfix) - 2]\r\n            new_name = f'{new_title}_{postfix}'\r\n            if verbose:\r\n                print(f'\\nrenaming {name} \\n\\t-> {new_name}')\r\n        except ValueError:\r\n            # ValueError: not enough values to unpack (expected 2, got 1)\r\n            if verbose:\r\n                print(f'\\nWARNING!!!:\\n\\tunable to parse postfix from {name}')\r\n                print('\\tSo, it will be just rename to short name')\r\n            ext = os.path.splitext(name)[1]\r\n            new_title = name[:max_length - len(ext) - 1]\r\n            new_name = f'{new_title}{ext}'\r\n            if verbose:\r\n                print(f'\\nrenaming {name} \\n\\t-> {new_name}')\r\n    return new_name\r\n"
  },
  {
    "path": "lib/cvf.py",
    "content": "\"\"\"\r\ncvf.py\r\n20210617\r\n\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nfrom .my_request import urlopen_with_retry\r\n\r\n\r\ndef get_paper_dict_list(url=None, content=None, group_name=None, timeout=10):\r\n    \"\"\"\r\n    parse papers' title, link, supp link from content, and save in a list contains dictionaries with key \"title\",\r\n        \"main link\", \"supplemental link\" and \"group\"(optional, if group_name is not None),\r\n    :param url: str or None, url\r\n    :param content: None of object return by urlopen\r\n    :param group_name: str or None, the group name of the papers in given content\r\n    :param timeout: int, the timeout value for open url, default to 10\r\n    :return: paper_dict_list, list of dictionaries, that contains the dictionaries of papers with key \"title\",\r\n        \"main link\",  \"supplemental link\" and \"group\"(optional, if group_name is not None)\r\n        content, object return by urlopen\r\n    \"\"\"\r\n    if url is None and content is None:\r\n        raise ValueError('''one of \"url\" and \"content\" should be provide!!!''')\r\n    paper_dict_list = []\r\n    paper_dict = {'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''} if group_name is None else \\\r\n        {'group': group_name, 'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''}\r\n\r\n    if content is None:\r\n        headers = {\r\n            'User-Agent':\r\n                'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}\r\n        content = urlopen_with_retry(url=url, headers=headers)\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    paper_list_bar = tqdm(soup.find('div', {'id': 'content'}).find_all(['dd', 'dt']))\r\n    paper_index = 0\r\n    for paper in paper_list_bar:\r\n        is_new_paper = False\r\n\r\n        # get title\r\n        try:\r\n            if 'dt' == paper.name and 'ptitle' == paper.get('class')[0]:  # title:\r\n                title = slugify(paper.text.strip())\r\n                paper_dict['title'] = title\r\n                paper_index += 1\r\n                paper_list_bar.set_description_str(f'Collecting paper {paper_index}: {title}')\r\n            elif 'dd' == paper.name:\r\n                all_as = paper.find_all('a')\r\n                for a in all_as:\r\n                    if 'pdf' == slugify(a.text.strip()):\r\n                        main_link = urllib.parse.urljoin(url, a.get('href'))\r\n                        paper_dict['main link'] = main_link\r\n                        is_new_paper = True\r\n                    elif 'supp' == slugify(a.text.strip()):\r\n                        supp_link = urllib.parse.urljoin(url, a.get('href'))\r\n                        paper_dict['supplemental link'] = supp_link\r\n                    elif 'arxiv' == slugify(a.text.strip()):\r\n                        arxiv = urllib.parse.urljoin(url, a.get('href'))\r\n                        paper_dict['arxiv'] = arxiv\r\n                        break\r\n        except Exception as e:\r\n            print(f'Warning: {str(e)}')\r\n\r\n        if is_new_paper:\r\n            paper_dict_list.append(paper_dict.copy())\r\n            paper_dict['title'] = ''\r\n            paper_dict['main link'] = ''\r\n            paper_dict['supplemental link'] = ''\r\n            paper_dict['arxiv'] = ''\r\n\r\n    return paper_dict_list, content\r\n\r\n\r\n"
  },
  {
    "path": "lib/downloader.py",
    "content": "\"\"\"\r\ndownloader.py\r\n20210624\r\n\"\"\"\r\nimport time\r\nfrom lib import IDM\r\nimport requests\r\nimport os\r\nimport random\r\nfrom tqdm import tqdm\r\nfrom threading import Thread\r\nfrom lib.proxy import get_proxy_4_requests\r\n\r\n\r\ndef _download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True,\r\n              verbose=False, proxy_ip_port=None):\r\n    \"\"\"\r\n    download file from given urls and save it to given path\r\n    :param urls: str, urls\r\n    :param save_path: str, full path\r\n    :param time_sleep_in_seconds: int, sleep seconds after call\r\n    :param is_random_step: bool, whether random sample the time step between two\r\n        adjacent download requests. If True, the time step will be sampled\r\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\r\n        Default: True.\r\n    :param verbose: bool, whether to display time step information.\r\n        Default: False\r\n    :param proxy_ip_port: str or None, proxy server ip address with or without\r\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n    :return: None\r\n    \"\"\"\r\n\r\n    def __download(urls, save_path, proxy_ip_port):\r\n        head, tail = os.path.split(save_path)\r\n        # debug\r\n        # print(f'downloading {tail}')\r\n        proxies = get_proxy_4_requests(proxy_ip_port)\r\n        r = requests.get(urls, stream=True, proxies=proxies)\r\n        # file size in MB\r\n        length = round(int(r.headers['content-length']) / 1024**2, 2)\r\n        process_bar = tqdm(\r\n            colour='blue', total=length, unit='MB',desc=tail, initial=0)\r\n\r\n        if '' != head:\r\n            os.makedirs(head, exist_ok=True)\r\n\r\n        for part in r.iter_content(1024 ** 2):\r\n            process_bar.update(1)\r\n            with open(save_path, 'ab') as file:\r\n                file.write(part)\r\n        r.close()\r\n\r\n    # set daemon as False to continue downloading even if the main threading\r\n    # has been killed due to KeyboardInterrupt\r\n    t = Thread(\r\n        target=__download, args=(urls, save_path, proxy_ip_port), daemon=False)\r\n    t.start()\r\n\r\n    if is_random_step:\r\n        time_sleep_in_seconds = random.uniform(\r\n            0.5 * time_sleep_in_seconds,\r\n            1.5 * time_sleep_in_seconds,\r\n        )\r\n    if verbose:\r\n        print(f'\\t random sleep {time_sleep_in_seconds: .2f} seconds')\r\n    time.sleep(time_sleep_in_seconds)\r\n\r\n\r\nclass Downloader(object):\r\n    def __init__(self, downloader=None, is_random_step=True,\r\n                 proxy_ip_port=None):\r\n        \"\"\"\r\n        :param downloader: None or str, the downloader's name.\r\n            if downloader is None, 'request' will be used to\r\n            download files; if downloader is 'IDM', the\r\n            \"Internet Downloader Manager\" will be used to download\r\n            files; or a ValueError will be raised.\r\n        :param is_random_step: bool, whether random sample the time step between\r\n            two adjacent download requests. If True, the time step will be\r\n            sampled from Uniform(0.5t, 1.5t), where t is the given\r\n            time_step_in_seconds. Default: True.\r\n        :param proxy_ip_port: str or None, proxy server ip address with or without\r\n            protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\r\n            (only useful for None|\"request\" downloader)\r\n            Default: None\r\n        \"\"\"\r\n        super(Downloader, self).__init__()\r\n        if downloader is not None and downloader.lower() not in ['idm']:\r\n            raise ValueError(\r\n                f'''ERROR: Unsupported downloader: {downloader}, '''\r\n                f'''we currently only support'''\r\n                f''' None (means python's requests) or \"IDM\" '''\r\n            )\r\n\r\n        self.downloader = downloader\r\n        self.is_random_step = is_random_step\r\n        self.proxy_ip_port = proxy_ip_port\r\n\r\n    def download(self, urls, save_path, time_sleep_in_seconds=5):\r\n        \"\"\"\r\n        download file from given urls and save it to given path\r\n        :param urls: str, urls\r\n        :param save_path: str, full path\r\n        :param time_sleep_in_seconds: int, sleep seconds after call\r\n        :return: None\r\n        \"\"\"\r\n        if self.downloader is None:\r\n            _download(\r\n                urls=urls,\r\n                save_path=save_path,\r\n                time_sleep_in_seconds=time_sleep_in_seconds,\r\n                is_random_step=self.is_random_step,\r\n                proxy_ip_port=self.proxy_ip_port\r\n            )\r\n        elif self.downloader.lower() == 'idm':\r\n            IDM.download(\r\n                urls=urls,\r\n                save_path=save_path,\r\n                time_sleep_in_seconds=time_sleep_in_seconds,\r\n                is_random_step=self.is_random_step\r\n            )\r\n"
  },
  {
    "path": "lib/my_request.py",
    "content": "\"\"\"\nmy_request.py\n20240412\n\"\"\"\n\nimport urllib\nimport random\nfrom urllib.error import URLError, HTTPError\nfrom lib.proxy import set_proxy_4_urllib_request\n\n\ndef urlopen_with_retry(url, headers=dict(), retry_time=3, time_out=20,\n                       raise_error_if_failed=True, proxy_ip_port=None):\n    \"\"\"\n    load content from url with given headers. Retry if error occurs.\n    Args:\n        url (str): url.\n        headers (dict): request headers. Default: {}.\n        retry_time (int): max retry time. Default: 3.\n        time_out (int): time out in seconds. Default: 10.\n        raise_error_if_failed (bool): whether to raise error if failed.\n            Default: True.\n        proxy_ip_port(str|None): proxy server ip address with or without\n            protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n            Default: None\n\n    Returns:\n        content(str|None): url content. None will be returned if failed.\n\n    \"\"\"\n    set_proxy_4_urllib_request(proxy_ip_port)\n    req = urllib.request.Request(url=url, headers=headers)\n    for r in range(retry_time):\n        try:\n            content = urllib.request.urlopen(req, timeout=time_out).read()\n            return content\n        except HTTPError as e:\n            print('The server couldn\\'t fulfill the request.')\n            print('Error code: ', e.code)\n            s = random.randint(3, 7)\n            print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}'\n                  f'-th retrying...')\n        except URLError as e:\n            print('We failed to reach a server.')\n            print('Reason: ', e.reason)\n            s = random.randint(3, 7)\n            print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}'\n                  f'-th retrying...')\n    if raise_error_if_failed:\n        raise ValueError(f'Failed to open {url} after trying {retry_time} '\n                         f'times!')\n    else:\n        return None\n\n\n"
  },
  {
    "path": "lib/openreview.py",
    "content": "\"\"\"\nopenreview.py\n20230104\n\"\"\"\n\nimport time\nfrom tqdm import tqdm\nfrom selenium import webdriver\nfrom selenium.webdriver import ActionChains\nfrom selenium.webdriver.chrome.options import Options\nfrom selenium.webdriver.chrome.service import Service\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.common.by import By\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom webdriver_manager.chrome import ChromeDriverManager\nfrom selenium.webdriver.common.keys import Keys\nfrom selenium.common.exceptions import NoSuchElementException\nfrom selenium.common.exceptions import StaleElementReferenceException\nimport os\n# https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename\nfrom slugify import slugify\nfrom lib.downloader import Downloader\nfrom lib.proxy import get_proxy\nimport urllib\nfrom lib.arxiv import get_pdf_link_from_arxiv\n\n\ndef get_driver(proxy_ip_port=None):\n    # driver = webdriver.Chrome(driver_path)\n    capabilities = webdriver.DesiredCapabilities.CHROME\n    if proxy_ip_port is not None:\n        proxy = get_proxy(proxy_ip_port)\n        proxy.add_to_capabilities(capabilities)\n    \n    # https://stackoverflow.com/a/78797164\n    chrome_install = ChromeDriverManager().install()\n    folder = os.path.dirname(chrome_install)\n    chromedriver_path = os.path.join(folder, \"chromedriver.exe\")\n    driver = webdriver.Chrome(\n        service=Service(executable_path=chromedriver_path),\n        desired_capabilities=capabilities)\n    return driver\n\n\ndef __download_papers_given_divs(driver, divs, save_dir, paper_postfix,\n                                 time_step_in_seconds=10, downloader='IDM',\n                                 proxy_ip_port=None):\n    error_log = []\n    downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port)\n    \n    # scroll to top of page\n    # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n    driver.find_element(By.TAG_NAME, 'body').send_keys(\n        Keys.CONTROL + Keys.HOME)\n    time.sleep(0.3)\n\n    # titles = [d.text for d in divs]\n    titles = []\n    for d in divs:\n        for i in range(3):  # temp workaround\n            try:\n                titles.append(d.text)    \n                break\n            except Exception as e:\n                if i == 2:\n                    print(f'\\tget Exception: {str(e.msg)}')\n                time.sleep(0.3)\n                       \n    valid_divs = []\n    for i, t in enumerate(titles):\n        if len(t):\n            valid_divs.append(divs[i])\n    num_papers = len(valid_divs)\n    print('found number of papers:', num_papers)\n    name = None\n    for index, paper in enumerate(valid_divs):\n        is_get_paper = False\n        try:\n            a_hrefs = paper.find_elements(By.TAG_NAME, \"a\")\n            name = slugify(a_hrefs[0].text.strip())\n            if a_hrefs[1].get_attribute('class') == 'pdf-link':\n                # has pdf button\n                link = a_hrefs[1].get_attribute('href')\n                link = urllib.parse.urljoin('https://openreview.net', link)\n            else:\n                # raise ValueError('pdf link not found!')\n                print('\\tWarning: pdf link not found, skip this download...')\n                if name is not None:\n                    error_log.append((name, str(index)))\n                else:\n                    error_log.append((str(index), str(index)))\n                continue\n                # TODO: find pdf link in paper abstract page\n            if name == '':\n                continue\n            is_get_paper = True\n        except Exception as e:\n            print(f'\\tget Exception: {str(e.msg)}')\n            print('\\tskip this download...')\n            if name is not None:\n                error_log.append((name, str(index)))\n            else:\n                error_log.append((str(index), str(index)))\n        if not is_get_paper:\n            continue\n\n        # name = slugify(paper.find_element_by_class_name('note_content_title').text)\n        # link = paper.find_element_by_class_name('note_content_pdf').get_attribute('href')\n        pdf_name = name + '_' + paper_postfix + '.pdf'\n        if not os.path.exists(os.path.join(save_dir, pdf_name)):\n            print('Downloading paper {}/{}: {}'.format(index + 1, num_papers,\n                                                       name))\n            # get pdf link of arxiv if the original link is on arxiv.org\n            if \"arxiv.org/abs\" in link:\n                link = get_pdf_link_from_arxiv(abs_link=link)\n            # try 1 times\n            success_flag = False\n            for d_iter in range(1):\n                try:\n                    downloader.download(\n                        urls=link,\n                        save_path=os.path.join(save_dir, pdf_name),\n                        time_sleep_in_seconds=time_step_in_seconds\n                    )\n                    success_flag = True\n                    break\n                except Exception as e:\n                    print('Error: ' + name + ' - ' + str(e))\n            if not success_flag:\n                error_log.append((name, link))\n    return error_log, num_papers\n\n\ndef __get_into_pages_given_number(driver, page_number, pages, wait_fn,\n                                  condition=None):\n    wait_fn(driver, condition)\n    for page in pages:\n        if page.text.isnumeric() and int(page.text) == page_number:\n            page_link = page.find_element(By.TAG_NAME, \"a\")\n            page_link.click()\n            wait_fn(driver, condition)\n            return page\n\n    return None\n\n\ndef download_nips_papers_given_url(\n        save_dir, year, base_url, conference='NIPS', start_page=1,\n        time_step_in_seconds=10, download_groups='all', downloader='IDM',\n        proxy_ip_port=None):\n    \"\"\"\n    download NeurIPS papers from the given web url.\n    :param save_dir: str, paper save path\n    :type save_dir: str\n    :param year: int, iclr year, current only support year >= 2018\n    :type year: int\n    :param base_url: str, paper website url\n    :type base_url: str\n    :param conference: str, conference name, such as NIPS.\n    :param start_page: int, the initial downloading webpage number, only the pages whose number is\n                            equal to or greater than this number will be processed.\n    :param time_step_in_seconds: int, the interval time between two downlaod request in seconds\n    :param groups: group name, such as 'oral', 'spotlight', 'poster'.\n        Default: 'all'.\n    :type download_groups: str | list[str]\n    :param downloader: str, the downloader to download, could be 'IDM' or None,\n        default to 'IDM'\n    :param proxy_ip_port: str or None, proxy server ip address with or without\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n        (only useful for None|\"request\" downloader and webdriver)\n        Default: None\n    :return:\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    if year < 2023:\n        sub_xpath = '''id=\"accepted-papers\"'''\n    else:\n        sub_xpath = '''class=\"submissions-list\"'''\n    def mywait(driver, condition=None):\n        # wait for the select element to become visible\n        # print('Starting web driver wait...')\n        # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)\n        # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)\n        wait = WebDriverWait(driver, 20)\n        # print('Starting web driver wait... finished')\n        # res = wait.until(EC.presence_of_element_located((By.ID, \"notes\")))\n        # print(\"Successful load the website!->\", res)\n        # res = wait.until(\n        #     EC.presence_of_element_located((By.CLASS_NAME, \"note\")))\n        res = wait.until(\n            EC.presence_of_element_located((By.ID, \"notes\")))\n        # print(\"Successful load the website notes!->\", res)\n        res = wait.until(EC.presence_of_element_located(\n            (By.XPATH, f'''//*[@{sub_xpath}]/nav''')))\n        # print(\"Successful load the website pagination!->\", res)\n        time.sleep(2)  # seconds, workaround for bugs\n\n    def find_divs_of_papers():\n        if year < 2023:\n            divs = driver.find_element(By.ID, group_id). \\\n                find_elements(By.CLASS_NAME, 'note ')\n        else:\n            # divs = driver.find_element(By.ID, group_id). \\\n            #     find_elements(By.XPATH, '//*[@class=\"note  undefined\"]')\n            divs = driver.find_element(By.ID, group_id).find_elements(\n                By.XPATH, \n                '//*[contains(@class, \"note\") and contains(@class, \"undefined\")]'\n            )\n        return divs\n\n    paper_postfix = f'{conference}_{year}'\n    error_log = []\n    driver = get_driver(proxy_ip_port=proxy_ip_port)\n    driver.get(base_url)\n\n    if not os.path.exists(save_dir):\n        os.makedirs(save_dir)\n\n    mywait(driver)\n    # pages = driver.find_elements_by_xpath('//*[@id=\"accepted-papers\"]/nav/ul/li')\n\n    # download grouped papers, such as \"Accepted Papaers\" for year before 2023\n    # \"Accept (oral)\", \"Accept (spotlight)\", \"Accept (poster)\" for year 2023\n    groups = driver.find_elements(\n        By.XPATH, f'//*[@id=\"notes\"]/div/div[1]/ul/li')\n    accept_groups = []\n    for g in groups:\n        if 'accept' in g.text.lower():\n            # whether download this group\n            is_download_group = True\n            if not 'all' == download_groups:\n                is_download_group = False\n                for dg in download_groups:\n                    if dg.lower() in g.text.lower():\n                        is_download_group = True\n                        break\n            if is_download_group:\n                accept_groups.append(g)\n    group_name = None\n    group_save_dir = save_dir\n    for ag in accept_groups:\n        group_name = slugify(ag.text)\n        group_save_dir = os.path.join(save_dir, group_name)\n        print(f'Downloading {group_name}...')\n        os.makedirs(group_save_dir, exist_ok=True)\n        number_paper_group = 0\n        accept_group_link = ag.find_element(By.TAG_NAME, \"a\")\n        # group_id = accept_group_link.get_attribute('aria-controls')\n        group_id = accept_group_link.get_attribute('href').split('#')[-1]\n        # scroll to top of page, if not at top, the click action not work\n        # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n        driver.find_element(By.TAG_NAME, 'body').send_keys(\n            Keys.CONTROL + Keys.HOME)\n        time.sleep(0.2)\n        accept_group_link.click()\n        mywait(driver)\n        pages = driver.find_elements(\n            By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')\n        page_str_list = get_pages_str(pages)\n        # print(f'Current page navigation bar:\\n{page_str_list}')\n        current_page = 1\n        ind_page = 2  # 0 << ; 1 <\n        # << | < | 1, 2, 3, ... | > | >>\n        total_pages_number = get_max_page_number(page_str_list)\n        last_total_pages = total_pages_number\n        # get into start pages\n        while current_page < start_page:\n            if total_pages_number < start_page:  # flip pages until seeing the start page\n                current_page = total_pages_number\n                __get_into_pages_given_number(\n                    driver=driver, page_number=current_page, pages=pages,\n                    wait_fn=mywait)\n                print(f'getting into web page {current_page}...')\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, '//*[@id=\"accepted-papers\"]/ul/li/h4/a')))\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, '''//*[@id=\"accepted-papers\"]/nav''')))\n                mywait(driver)\n\n                # print(\"Successful load the website pagination!->\", res)\n                # pages = driver.find_elements_by_xpath('//*[@id=\"accepted-papers\"]/nav/ul/li')\n                pages = pages = driver.find_elements(\n                    By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')\n                page_str_list = get_pages_str(pages)\n                total_pages_number = get_max_page_number(page_str_list)\n                # # print(f'Current page navigation bar:\\n{page_str_list}')\n                if total_pages_number == last_total_pages:  # total page remain unchanged after reload\n                    print(f'reached last({total_pages_number}-th) webpage')\n                    # when get the last page, but the page number is till less than start page, so\n                    # the start page doesn't exist. PRINT ERROR and return\n                    print(f'ERROR: THE {start_page}-th webpage not found!')\n                    return\n            else:\n                current_page = start_page\n\n        page = __get_into_pages_given_number(\n            driver=driver, page_number=current_page, pages=pages,\n            wait_fn=mywait)\n\n        while current_page <= total_pages_number:\n            if page is None:\n                break\n            print(f'downloading papers in page: {current_page}')\n            mywait(driver)\n\n            # divs = driver.find_elements_by_xpath('//*[@id=\"accepted-papers\"]/ul/li')\n            # divs = driver.find_elements(By.XPATH, '//*[@id=\"accepted-papers\"]/ul/li')\n            divs = find_divs_of_papers()\n\n            # temp workaround\n            repeat_times = 3\n            is_find_paper = False\n            for r in range(repeat_times):\n                try:\n                    a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    is_find_paper = True\n                    break\n                except Exception as e:\n                    if (r + 1) < repeat_times:\n                        print(f'\\terror occurre: {str(e)}')\n                        print(f'\\tsleep {(r + 1) * 5} seconds...')\n                        time.sleep((r + 1) * 5)\n                        print(f'{r + 1}-th reloading page')\n                        divs = find_divs_of_papers()\n                    else:\n                        print('\\tskip this page.')\n            if not is_find_paper:\n                continue\n\n            # time.sleep(time_step_in_seconds)\n            this_error_log, this_number_paper = __download_papers_given_divs(\n                driver=driver,\n                divs=divs,\n                save_dir=group_save_dir,\n                paper_postfix=paper_postfix,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader,\n                proxy_ip_port=proxy_ip_port\n            )\n            for e in this_error_log:\n                error_log.append(e)\n            number_paper_group += this_number_paper\n            # get into next page\n            current_page += 1\n            # pages = driver.find_elements_by_xpath('//*[@id=\"accepted-papers\"]/nav/ul/li')\n            pages = driver.find_elements(\n                By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')\n            page_str_list = get_pages_str(pages)\n            total_pages_number = get_max_page_number(page_str_list)\n            # print(f'Current page navigation bar:\\n{page_str_list}')\n            # if we do not reread the pages, all the pages will be not available with an exception:\n            # selenium.common.exceptions.StaleElementReferenceException:\n            # Message: stale element reference: element is not attached to the page document\n            page = __get_into_pages_given_number(driver=driver,\n                                                 page_number=current_page,\n                                                 pages=pages,\n                                                 wait_fn=mywait)\n        # display total number of papers\n        print(f'number of papers in {group_name}: {number_paper_group}')\n\n    driver.quit()\n    # 2. write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                f.write(e)\n                f.write('\\n')\n            f.write('\\n')\n\n\ndef download_iclr_papers_given_url_and_group_id(\n        save_dir, year, base_url, group_id, conference='ICLR', start_page=1,\n        time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None,\n        is_have_pages=True, is_need_click_group_button=False):\n    \"\"\"\n    downlaod ICLR papers for the given web url and the paper group id\n    :param save_dir: str, paper save path\n    :type save_dir: str\n    :param year: int, iclr year, current only support year >= 2018\n    :type year: int\n    :param base_url: str, paper website url\n    :type base_url: str\n    :param group_id: str, paper group id, such as \"notable-top-5-\",\n        \"notable-top-25-\", \"poster\", \"oral-submissions\",\n        \"spotlight-submissions\", \"poster-submissions\", etc.\n    :type group_id: str\n    :param conference: str, conference name, such as ICLR. Default: ICLR\n    :param start_page: int, the initial downloading webpage number, only the\n        pages whose number is equal to or greater than this number will be\n        processed. Default: 1\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds. Default: 10\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder'. Default: 'IDM'\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\n        eg: \"127.0.0.1:7890\".  Only useful for webdriver and request\n        downloader (downloader=None). Default: None.\n    :type proxy_ip_port: str | None\n    :param is_have_pages: bool, is there pages in webpage. Default:\n        True.\n    :type is_have_pages: bool\n    :param is_need_click_group_button: bool, is there need to click the\n        group button in webpage. For some years, for example 2018, the\n        navigation part \"#xxxxx\" in base url will not work. And it should\n        be clicked before reading content from webpage. Default: False.\n    :type is_need_click_group_button: bool\n    :return:\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    def _get_pages_xpath(year):\n        if year <= 2023:\n            xpath = f'''//*[@id=\"{group_id}\"]/nav/ul/li'''\n        else:\n            xpath = f'''//*[@id=\"{group_id}\"]/div/div/nav/ul/li'''\n        return xpath\n\n    def mywait(driver, condition=None):\n        # wait for the select element to become visible\n        # print('Starting web driver wait...')\n        # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)\n        # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)\n        wait = WebDriverWait(driver, 20)\n        # print('Starting web driver wait... finished')\n        # res = wait.until(EC.presence_of_element_located((By.ID, \"notes\")))\n        # print(\"Successful load the website!->\", res)\n        if year <= 2023:\n            res = wait.until(\n                EC.presence_of_element_located((By.CLASS_NAME, \"note\")))\n        # print(\"Successful load the website notes!->\", res)\n        # res = wait.until(EC.presence_of_element_located(\n        #     (By.XPATH, f'''//*[@id=\"{group_id}\"]/nav''')))\n        if is_have_pages:\n            # scroll to bottom of page\n            # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n            driver.find_element(By.TAG_NAME, 'body').send_keys(\n                Keys.CONTROL + Keys.END)\n            if year <= 2023:\n                wait.until(EC.element_to_be_clickable(\n                    (By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))\n            else:\n                wait.until(EC.element_to_be_clickable(\n                    (By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))\n            # print(\"Successful load the website pagination!->\", res)\n        time.sleep(2)  # seconds, workaround for bugs\n\n    paper_postfix = f'{conference}_{year}'\n    error_log = []\n    driver = get_driver(proxy_ip_port=proxy_ip_port)\n    driver.get(base_url)\n\n    if not os.path.exists(save_dir):\n        os.makedirs(save_dir)\n\n    if is_need_click_group_button:\n        archive_is_have_pages = is_have_pages\n        is_have_pages = False\n        mywait(driver)\n        aria_controls = base_url.split('#')[-1]\n        # scroll to home of page\n        driver.find_element(By.TAG_NAME, 'body').send_keys(\n            Keys.CONTROL + Keys.HOME)\n        group_button = driver.find_element(\n            By.XPATH, f\"\"\"//a[@aria-controls=\"{aria_controls}\"]\"\"\"\n        )\n        group_button.click()\n        is_have_pages = archive_is_have_pages\n    mywait(driver)\n    if is_have_pages:\n        pages = driver.find_elements(By.XPATH, _get_pages_xpath(year))\n        current_page = 1\n        ind_page = 2  # 0 << ; 1 <\n        total_pages_number = int(pages[-3].text)\n        # << | < | 1, 2, 3, ... | > | >>\n        last_total_pages = total_pages_number\n        # get into start pages\n        while current_page < start_page:\n            # flip pages until seeing the start page\n            if total_pages_number < start_page:\n                current_page = total_pages_number\n                __get_into_pages_given_number(\n                    driver=driver, page_number=current_page, pages=pages,\n                    wait_fn=mywait)\n                print(f'getting into web page {current_page}...')\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, f'//*[@id=\"{group_id}\"]/ul/li/h4/a')))\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, f'''//*[@id=\"{group_id}\"]/nav''')))\n                mywait(driver)\n\n                # print(\"Successful load the website pagination!->\", res)\n                pages = driver.find_elements(\n                    By.XPATH, _get_pages_xpath(year))\n                total_pages_number = int(pages[-3].text)\n                # total page remain unchanged after reload\n                if total_pages_number == last_total_pages:\n                    print(f'reached last({total_pages_number}-th) webpage')\n                    # when get the last page, but the page number is till\n                    # less than start page, so the start page doesn't exist.\n                    # PRINT ERROR and return\n                    print(f'ERROR: THE {start_page}-th webpage not found!')\n                    return\n            else:\n                current_page = start_page\n\n        page = __get_into_pages_given_number(\n            driver=driver, page_number=current_page, pages=pages, wait_fn=mywait)\n\n        while current_page <= total_pages_number:\n            if page is None:\n                break\n            print(f'downloading {group_id} papers in page: {current_page}')\n            mywait(driver)\n\n            divs = driver.find_element(By.ID, group_id). \\\n                find_elements(By.CLASS_NAME, 'note ')\n\n            # temp workaround\n            repeat_times = 3\n            is_find_paper = False\n            for r in range(repeat_times):\n                try:\n                    a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    is_find_paper = True\n                    break\n                except Exception as e:\n                    if (r + 1) < repeat_times:\n                        print(f'\\terror occurre: {str(e.msg)}')\n                        print(f'\\tsleep {(r + 1) * 5} seconds...')\n                        time.sleep((r + 1) * 5)\n                        print(f'{r + 1}-th reloading page')\n                        divs = driver.find_element(By.ID, group_id). \\\n                            find_elements(By.CLASS_NAME, 'note ')\n                    else:\n                        print('\\tskip this page.')\n            if not is_find_paper:\n                continue\n\n            # time.sleep(time_step_in_seconds)\n            this_error_log, this_number_paper = __download_papers_given_divs(\n                driver=driver,\n                divs=divs,\n                save_dir=save_dir,\n                paper_postfix=paper_postfix,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader,\n                proxy_ip_port=proxy_ip_port\n            )\n            for e in this_error_log:\n                error_log.append(e)\n            # get into next page\n            current_page += 1\n            pages = driver.find_elements(\n                By.XPATH, _get_pages_xpath(year))\n            total_pages_number = int(pages[-3].text)\n            # if we do not reread the pages, all the pages will be not available\n            # with an exception:\n            # selenium.common.exceptions.StaleElementReferenceException:\n            # Message: stale element reference: element is not attached to the\n            # page document\n            page = __get_into_pages_given_number(\n                driver=driver, page_number=current_page, pages=pages,\n                wait_fn=mywait)\n    else:  # no pages\n        divs = driver.find_element(By.ID, group_id). \\\n            find_elements(By.CLASS_NAME, 'note ')\n        # temp workaround\n        repeat_times = 3\n        is_find_paper = False\n        for r in range(repeat_times):\n            try:\n                a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                is_find_paper = True\n                break\n            except Exception as e:\n                if (r + 1) < repeat_times:\n                    print(f'\\terror occurre: {str(e.msg)}')\n                    print(f'\\tsleep {(r + 1) * 5} seconds...')\n                    time.sleep((r + 1) * 5)\n                    print(f'{r + 1}-th reloading page')\n                    divs = driver.find_element(By.ID, group_id). \\\n                        find_elements(By.CLASS_NAME, 'note ')\n                else:\n                    print('\\tskipped!!!')\n        if is_find_paper:\n            # time.sleep(time_step_in_seconds)\n            this_error_log, this_number_paper = __download_papers_given_divs(\n                driver=driver,\n                divs=divs,\n                save_dir=save_dir,\n                paper_postfix=paper_postfix,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader,\n                proxy_ip_port=proxy_ip_port\n            )\n            for e in this_error_log:\n                error_log.append(e)\n\n    driver.quit()\n    # 2. write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                f.write(e)\n                f.write('\\n')\n            f.write('\\n')\n\n\ndef download_icml_papers_given_url_and_group_id(\n        save_dir, year, base_url, group_id, conference='ICML', start_page=1,\n        time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None):\n    \"\"\"\n    downlaod ICLR papers for the given web url and the paper group id\n    :param save_dir: str, paper save path\n    :type save_dir: str\n    :param year: int, iclr year, current only support year >= 2018\n    :type year: int\n    :param base_url: str, paper website url\n    :type base_url: str\n    :param group_id: str, paper group id, such as \"poster\" and \"oral\".\n    :type group_id: str\n    :param conference: str, conference name, such as ICLR. Default: ICLR\n    :param start_page: int, the initial downloading webpage number, only the\n        pages whose number is equal to or greater than this number will be\n        processed. Default: 1\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds. Default: 10\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder'. Default: 'IDM'\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\n        eg: \"127.0.0.1:7890\". Only useful for webdriver and request\n        downloader (downloader=None). Default: None.\n    :type proxy_ip_port: str | None\n    :return:\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    def mywait(driver, aria_controls=None):\n        # wait for the select element to become visible\n        # print('Starting web driver wait...')\n        wait = WebDriverWait(driver, 20)\n        # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)\n        # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)\n        # print('Starting web driver wait... finished')\n        # res = wait.until(EC.presence_of_element_located((By.ID, \"notes\")))\n        # print(\"Successful load the website!->\", res)\n        res = wait.until(EC.presence_of_element_located((By.ID, \"notes\")))\n        res = wait.until(EC.presence_of_element_located((By.CLASS_NAME, \"submissions-list\")))\n        # print(\"Successful load the website notes!->\", res)\n        # res = wait.until(EC.presence_of_element_located(\n        #     (By.XPATH, f'''//*[@id=\"{group_id}\"]/nav''')))\n        # scroll to bottom of page\n        # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n        driver.find_element(By.TAG_NAME, 'body').send_keys(\n            Keys.CONTROL + Keys.END)\n        time.sleep(0.3)\n        if aria_controls is None:\n            wait.until(EC.element_to_be_clickable(\n                (By.XPATH, f'//*[@class=\"submissions-list\"]/nav/ul/li[3]/a''')))\n        else:\n            wait.until(EC.element_to_be_clickable(\n                (By.XPATH,\n                 f'''//*[@id='{aria_controls}']/div/div/nav/ul/li[3]/a''')))\n            wait.until(EC.presence_of_element_located(\n                (By.XPATH,\n                 f'''//*[@id='{aria_controls}']/div/div/ul/li[1]/div/h4/a[1]''')))\n        # print(\"Successful load the website pagination!->\", res)\n        time.sleep(2)  # seconds, workaround for bugs\n\n    paper_postfix = f'{conference}_{year}'\n    error_log = []\n    driver = get_driver(proxy_ip_port=proxy_ip_port)\n    driver.get(base_url)\n\n    if not os.path.exists(save_dir):\n        os.makedirs(save_dir)\n\n    # wait = WebDriverWait(driver, 20)\n    mywait(driver)\n\n    # get into poster or oral page\n    nav_tap = driver.find_elements(\n        By.XPATH, f'//ul[@class=\"nav nav-tabs\"]/li')\n    is_found_group = False\n    for li in nav_tap:\n        if group_id in li.text.lower():\n            if 'poster' in group_id and 'spotlight' in li.text.lower():\n                # spotlight-poster should be recognized as spotlight rather\n                # than poster\n                continue\n            page_link = li.find_element(By.TAG_NAME, \"a\")\n            # scroll to top of page, if not at top, the click action not work\n            # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n            driver.find_element(By.TAG_NAME, 'body').send_keys(\n                Keys.CONTROL + Keys.HOME)\n            aria_controls = page_link.get_attribute('aria-controls')\n            page_link.click()\n            mywait(driver, aria_controls)  # there is no request in here\n            is_found_group = True\n            break\n    if not is_found_group:\n        raise ValueError(f'not found {group_id} papers at {base_url}!!!')\n\n    # pages = driver.find_elements(\n    #     By.XPATH, f'//nav[@aria-label=\"page navigation\"]/ul/li')\n    pages = driver.find_elements(\n        By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')\n    current_page = 1\n    # ind_page = 2  # 0 << ; 1 <\n    total_pages_number = int(pages[-3].text)  # << | < | 1, 2, 3, ... | > | >>\n    last_total_pages = total_pages_number\n    # get into start pages\n    while current_page < start_page:\n        # flip pages until seeing the start page\n        if total_pages_number < start_page:\n            current_page = total_pages_number\n            __get_into_pages_given_number(\n                driver=driver, page_number=current_page, pages=pages,\n                wait_fn=mywait, condition=aria_controls)\n            print(f'getting into web page {current_page}...')\n\n            # print(\"Successful load the website pagination!->\", res)\n            pages = driver.find_elements(\n                By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')\n            total_pages_number = int(pages[-3].text)\n            # total page remain unchanged after reload\n            if total_pages_number == last_total_pages:\n                print(f'reached last({total_pages_number}-th) webpage')\n                # when get the last page, but the page number is till less than\n                # start page, so the start page doesn't exist. PRINT ERROR and\n                # return\n                print(f'ERROR: THE {start_page}-th webpage not found!')\n                return\n        else:\n            current_page = start_page\n\n    page = __get_into_pages_given_number(\n        driver=driver, page_number=current_page, pages=pages, wait_fn=mywait,\n        condition=aria_controls)\n\n    while current_page <= total_pages_number:\n        if page is None:\n            break\n        print(f'downloading {group_id} papers in page: {current_page}')\n\n        divs = driver.find_elements(\n            By.XPATH, f'''//*[@id='{aria_controls}']/div/div/ul/li''')\n\n        # temp workaround\n        repeat_times = 3\n        is_find_paper = False\n        for r in range(repeat_times):\n            try:\n                a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                is_find_paper = True\n                break\n            except Exception as e:\n                if (r+1) < repeat_times:\n                    print(f'\\terror occurre: {str(e.msg)}')\n                    print(f'\\tsleep {(r+1)*5} seconds...')\n                    time.sleep((r+1)*5)\n                    print(f'{r+1}-th reloading page')\n                    divs = driver.find_elements(\n                        By.XPATH,\n                        f'''//*[@id='{aria_controls}']/div/div/ul/li''')\n                else:\n                    print('\\tskip this page.')\n        if not is_find_paper:\n            continue\n        # time.sleep(time_step_in_seconds)\n        this_error_log, this_number_paper = __download_papers_given_divs(\n            driver=driver,\n            divs=divs,\n            save_dir=save_dir,\n            paper_postfix=paper_postfix,\n            time_step_in_seconds=time_step_in_seconds,\n            downloader=downloader,\n            proxy_ip_port=proxy_ip_port\n        )\n        for e in this_error_log:\n            error_log.append(e)\n        # get into next page\n        current_page += 1\n        pages = driver.find_elements(\n            By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')\n        total_pages_number = int(pages[-3].text)\n        # if we do not reread the pages, all the pages will be not available\n        # with an exception:\n        # selenium.common.exceptions.StaleElementReferenceException:\n        # Message: stale element reference: element is not attached to the\n        # page document\n        page = __get_into_pages_given_number(\n            driver=driver, page_number=current_page, pages=pages,\n            wait_fn=mywait, condition=aria_controls)\n\n    driver.quit()\n    # 2. write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                f.write(e)\n                f.write('\\n')\n            f.write('\\n')\n\n\ndef get_pages_str(pages):\n    page_str_list = [p.text for p in pages]\n    # print(f'Current page navigation bar:\\n{page_str_list}')\n    return page_str_list\n\n\ndef get_max_page_number(page_str_list):\n    is_find_number = False\n    for i, page_str in enumerate(page_str_list):\n        if not page_str.isnumeric() and is_find_number:\n            return int(page_str_list[i-1])\n        if page_str.isnumeric():\n            is_find_number = True\n    return int(page_str_list[-1])\n\n\ndef download_papers_given_url_and_group_id(\n        save_dir, year, base_url, group_id, conference, start_page=1,\n        time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None,\n        is_have_pages=True, is_need_click_group_button=False):\n    \"\"\"\n    downlaod papers for the given web url and the paper group id\n    :param save_dir: str, paper save path\n    :type save_dir: str\n    :param year: int, iclr year, current only support year >= 2018\n    :type year: int\n    :param base_url: str, paper website url\n    :type base_url: str\n    :param group_id: str, paper group id, such as \"notable-top-5-\",\n        \"notable-top-25-\", \"poster\", \"oral-submissions\",\n        \"spotlight-submissions\", \"poster-submissions\", etc.\n    :type group_id: str\n    :param conference: str, conference name, such as CORL.\n    :param start_page: int, the initial downloading webpage number, only the\n        pages whose number is equal to or greater than this number will be\n        processed. Default: 1\n    :param time_step_in_seconds: int, the interval time between two download\n        request in seconds. Default: 10\n    :param downloader: str, the downloader to download, could be 'IDM' or\n        'Thunder'. Default: 'IDM'\n    :param proxy_ip_port: str or None, proxy ip address and port, eg.\n        eg: \"127.0.0.1:7890\".  Only useful for webdriver and request\n        downloader (downloader=None). Default: None.\n    :type proxy_ip_port: str | None\n    :param is_have_pages: bool, is there pages in webpage. Default:\n        True.\n    :type is_have_pages: bool\n    :param is_need_click_group_button: bool, is there need to click the\n        group button in webpage. For some years, for example 2018, the\n        navigation part \"#xxxxx\" in base url will not work. And it should\n        be clicked before reading content from webpage. Default: False.\n    :type is_need_click_group_button: bool\n    :return:\n    \"\"\"\n    project_root_folder = os.path.abspath(\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\n    def _get_pages_xpath(year):\n        if year <= 2023:\n            xpath = f'''//*[@id=\"{group_id}\"]/nav/ul/li'''\n        else:\n            xpath = f'''//*[@id=\"{group_id}\"]/div/div/nav/ul/li'''\n        return xpath\n\n    def mywait(driver, condition=None):\n        # wait for the select element to become visible\n        # print('Starting web driver wait...')\n        # ignored_exceptions = (NoSuchElementException, \n        # StaleElementReferenceException,)\n        # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)\n        wait = WebDriverWait(driver, 20)\n        # print('Starting web driver wait... finished')\n        # res = wait.until(EC.presence_of_element_located((By.ID, \"notes\")))\n        # print(\"Successful load the website!->\", res)\n        # if year <= 2023:\n        #     res = wait.until(\n        #         EC.presence_of_element_located((By.CLASS_NAME, \"note\")))\n        # print(\"Successful load the website notes!->\", res)\n        # res = wait.until(EC.presence_of_element_located(\n        #     (By.XPATH, f'''//*[@id=\"{group_id}\"]/nav''')))\n        if is_have_pages:\n            # scroll to bottom of page\n            # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium\n            driver.find_element(By.TAG_NAME, 'body').send_keys(\n                Keys.CONTROL + Keys.END)\n            if year <= 2023:\n                wait.until(EC.element_to_be_clickable(\n                    (By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))\n            else:\n                wait.until(EC.element_to_be_clickable(\n                    (By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))\n            # print(\"Successful load the website pagination!->\", res)\n        time.sleep(2)  # seconds, workaround for bugs\n\n    paper_postfix = f'{conference}_{year}'\n    error_log = []\n    \n    driver = get_driver(proxy_ip_port=proxy_ip_port)\n    driver.get(base_url)\n\n    if not os.path.exists(save_dir):\n        os.makedirs(save_dir)\n\n    if is_need_click_group_button:\n        archive_is_have_pages = is_have_pages\n        is_have_pages = False\n        mywait(driver)\n        aria_controls = base_url.split('#')[-1]\n        # scroll to home of page\n        driver.find_element(By.TAG_NAME, 'body').send_keys(\n            Keys.CONTROL + Keys.HOME)\n        group_button = driver.find_element(\n            By.XPATH, f\"\"\"//a[@aria-controls=\"{aria_controls}\"]\"\"\"\n        )\n        group_button.click()\n        is_have_pages = archive_is_have_pages\n    mywait(driver)\n    if is_have_pages:\n        pages = driver.find_elements(By.XPATH, _get_pages_xpath(year))\n        current_page = 1\n        ind_page = 2  # 0 << ; 1 <\n        total_pages_number = int(pages[-3].text)\n        # << | < | 1, 2, 3, ... | > | >>\n        last_total_pages = total_pages_number\n        # get into start pages\n        while current_page < start_page:\n            # flip pages until seeing the start page\n            if total_pages_number < start_page:\n                current_page = total_pages_number\n                __get_into_pages_given_number(\n                    driver=driver, page_number=current_page, pages=pages,\n                    wait_fn=mywait)\n                print(f'getting into web page {current_page}...')\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, f'//*[@id=\"{group_id}\"]/ul/li/h4/a')))\n                # res = wait.until(EC.presence_of_element_located(\n                #     (By.XPATH, f'''//*[@id=\"{group_id}\"]/nav''')))\n                mywait(driver)\n\n                # print(\"Successful load the website pagination!->\", res)\n                pages = driver.find_elements(\n                    By.XPATH, _get_pages_xpath(year))\n                total_pages_number = int(pages[-3].text)\n                # total page remain unchanged after reload\n                if total_pages_number == last_total_pages:\n                    print(f'reached last({total_pages_number}-th) webpage')\n                    # when get the last page, but the page number is till\n                    # less than start page, so the start page doesn't exist.\n                    # PRINT ERROR and return\n                    print(f'ERROR: THE {start_page}-th webpage not found!')\n                    return\n            else:\n                current_page = start_page\n\n        page = __get_into_pages_given_number(\n            driver=driver, page_number=current_page, pages=pages, wait_fn=mywait)\n\n        while current_page <= total_pages_number:\n            if page is None:\n                break\n            print(f'downloading {group_id} papers in page: {current_page}')\n            mywait(driver)\n\n            divs = driver.find_element(By.ID, group_id). \\\n                find_elements(By.CLASS_NAME, 'note ')\n\n            # temp workaround\n            repeat_times = 3\n            is_find_paper = False\n            for r in range(repeat_times):\n                try:\n                    a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                    name = slugify(a_hrefs[0].text.strip())\n                    link = a_hrefs[1].get_attribute('href')\n                    is_find_paper = True\n                    break\n                except Exception as e:\n                    if (r + 1) < repeat_times:\n                        print(f'\\terror occurre: {str(e.msg)}')\n                        print(f'\\tsleep {(r + 1) * 5} seconds...')\n                        time.sleep((r + 1) * 5)\n                        print(f'{r + 1}-th reloading page')\n                        divs = driver.find_element(By.ID, group_id). \\\n                            find_elements(By.CLASS_NAME, 'note ')\n                    else:\n                        print('\\tskip this page.')\n            if not is_find_paper:\n                continue\n\n            # time.sleep(time_step_in_seconds)\n            this_error_log, this_number_paper = __download_papers_given_divs(\n                driver=driver,\n                divs=divs,\n                save_dir=save_dir,\n                paper_postfix=paper_postfix,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader,\n                proxy_ip_port=proxy_ip_port\n            )\n            for e in this_error_log:\n                error_log.append(e)\n            # get into next page\n            current_page += 1\n            pages = driver.find_elements(\n                By.XPATH, _get_pages_xpath(year))\n            total_pages_number = int(pages[-3].text)\n            # if we do not reread the pages, all the pages will be not available\n            # with an exception:\n            # selenium.common.exceptions.StaleElementReferenceException:\n            # Message: stale element reference: element is not attached to the\n            # page document\n            page = __get_into_pages_given_number(\n                driver=driver, page_number=current_page, pages=pages,\n                wait_fn=mywait)\n    else:  # no pages\n        divs = driver.find_element(By.ID, group_id). \\\n            find_elements(By.CLASS_NAME, 'note ')\n        # temp workaround\n        repeat_times = 3\n        is_find_paper = False\n        for r in range(repeat_times):\n            try:\n                a_hrefs = divs[0].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                a_hrefs = divs[-1].find_elements(By.TAG_NAME, \"a\")\n                name = slugify(a_hrefs[0].text.strip())\n                link = a_hrefs[1].get_attribute('href')\n                is_find_paper = True\n                break\n            except Exception as e:\n                if (r + 1) < repeat_times:\n                    print(f'\\terror occurre: {str(e.msg)}')\n                    print(f'\\tsleep {(r + 1) * 5} seconds...')\n                    time.sleep((r + 1) * 5)\n                    print(f'{r + 1}-th reloading page')\n                    divs = driver.find_element(By.ID, group_id). \\\n                        find_elements(By.CLASS_NAME, 'note ')\n                else:\n                    print('\\tskipped!!!')\n        if is_find_paper:\n            # time.sleep(time_step_in_seconds)\n            this_error_log, this_number_paper = __download_papers_given_divs(\n                driver=driver,\n                divs=divs,\n                save_dir=save_dir,\n                paper_postfix=paper_postfix,\n                time_step_in_seconds=time_step_in_seconds,\n                downloader=downloader,\n                proxy_ip_port=proxy_ip_port\n            )\n            for e in this_error_log:\n                error_log.append(e)\n\n    driver.quit()\n    # 2. write error log\n    print('write error log')\n    log_file_pathname = os.path.join(\n        project_root_folder, 'log', 'download_err_log.txt'\n    )\n    with open(log_file_pathname, 'w') as f:\n        for log in tqdm(error_log):\n            for e in log:\n                f.write(e)\n                f.write('\\n')\n            f.write('\\n')\n\n\n\nif __name__ == \"__main__\":\n    year = 2023\n    save_dir = rf'E:\\ICML_{year}'\n    base_url = 'https://openreview.net/group?id=ICML.cc/2023/Conference'\n    # download_nips_papers_given_url(\n    #     save_dir, year, base_url,\n    #     start_page=1,\n    #     time_step_in_seconds=10,\n    #     downloader='IDM')\n    # download_icml_papers_given_url_and_group_id(\n    #     save_dir, year, base_url, group_id='oral', start_page=1,\n    #     time_step_in_seconds=10, )\n"
  },
  {
    "path": "lib/pmlr.py",
    "content": "\"\"\"\r\npmlr.py\r\n20210618\r\n\"\"\"\r\nfrom bs4 import BeautifulSoup\r\nimport os\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nfrom lib.downloader import Downloader\r\nfrom .my_request import urlopen_with_retry\r\n\r\n\r\ndef download_paper_given_volume(\r\n        volume, save_dir, postfix, is_download_supplement=True,\r\n        time_step_in_seconds=5, downloader='IDM', is_random_step=True):\r\n    \"\"\"\r\n    download main and supplement papers from PMLR.\r\n    :param volume: str, such as 'v1', 'r1'\r\n    :param save_dir: str, paper and supplement material's save path\r\n    :param postfix: str, the postfix will be appended to the end of papers' titles\r\n    :param is_download_supplement: bool, True for downloading supplemental material\r\n    :param time_step_in_seconds: int, the interval time between two downloading\r\n        requests in seconds\r\n    :param downloader: str, the downloader to download, could be 'IDM' or None,\r\n        Default: 'IDM'\r\n    :param is_random_step: bool, whether random sample the time step between two\r\n        adjacent download requests. If True, the time step will be sampled\r\n        from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.\r\n        Default: True.\r\n    :return: True\r\n    \"\"\"\r\n    downloader = Downloader(\r\n        downloader=downloader, is_random_step=is_random_step)\r\n    init_url = f'http://proceedings.mlr.press/{volume}/'\r\n\r\n    if is_download_supplement:\r\n        main_save_path = os.path.join(save_dir, 'main_paper')\r\n        supplement_save_path = os.path.join(save_dir, 'supplement')\r\n        os.makedirs(main_save_path, exist_ok=True)\r\n        os.makedirs(supplement_save_path, exist_ok=True)\r\n    else:\r\n        main_save_path = save_dir\r\n        os.makedirs(main_save_path, exist_ok=True)\r\n    headers = {\r\n        'User-Agent':\r\n            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '\r\n            'Gecko/20100101 Firefox/23.0'}\r\n    content = urlopen_with_retry(url=init_url, headers=headers)\r\n    soup = BeautifulSoup(content, 'html.parser')\r\n    paper_list = soup.find_all('div', {'class': 'paper'})\r\n    error_log = []\r\n    title_list = []\r\n    num_download = len(paper_list)\r\n    pbar = tqdm(zip(paper_list, range(num_download)), total=num_download)\r\n    for paper in pbar:\r\n        # get title\r\n        this_paper = paper[0]\r\n        title = slugify(this_paper.find_all('p', {'class': 'title'})[0].text)\r\n        try:\r\n            pbar.set_description(\r\n                f'Downloading {postfix} paper {paper[1] + 1}/{num_download}:'\r\n                f' {title}')\r\n        except:\r\n            pbar.set_description(\r\n                f'''Downloading {postfix} paper {paper[1] + 1}/{num_download}: '''\r\n                f'''{title.encode('utf8')}''')\r\n        title_list.append(title)\r\n\r\n        this_paper_main_path = os.path.join(main_save_path,\r\n                                            f'{title}_{postfix}.pdf')\r\n        if is_download_supplement:\r\n            this_paper_supp_path = os.path.join(\r\n                supplement_save_path, f'{title}_{postfix}_supp.pdf')\r\n            this_paper_supp_path_no_ext = os.path.join(\r\n                supplement_save_path, f'{title}_{postfix}_supp.')\r\n\r\n            if os.path.exists(this_paper_main_path) and os.path.exists(\r\n                    this_paper_supp_path):\r\n                continue\r\n        else:\r\n            if os.path.exists(this_paper_main_path):\r\n                continue\r\n\r\n        # get abstract page url\r\n        links = this_paper.find_all('p', {'class': 'links'})[0].find_all('a')\r\n        supp_link = None\r\n        main_link = None\r\n        for link in links:\r\n            if 'Download PDF' == link.text or 'pdf' == link.text:\r\n                main_link = link.get('href')\r\n            elif is_download_supplement and \\\r\n                    ('Supplementary PDF' == link.text or\r\n                     'Supplementary Material' == link.text or\r\n                     'supplementary' == link.text or\r\n                     'Supplementary ZIP' == link.text or\r\n                     'Other Files' == link.text):\r\n                supp_link = link.get('href')\r\n                if supp_link[-3:] != 'pdf':\r\n                    this_paper_supp_path = this_paper_supp_path_no_ext + \\\r\n                                           supp_link[-3:]\r\n\r\n        # try 1 time\r\n        # error_flag = False\r\n        for d_iter in range(1):\r\n            try:\r\n                # download paper with IDM\r\n                if not os.path.exists(\r\n                        this_paper_main_path) and main_link is not None:\r\n                    downloader.download(\r\n                        urls=main_link,\r\n                        save_path=this_paper_main_path,\r\n                        time_sleep_in_seconds=time_step_in_seconds\r\n                    )\r\n            except Exception as e:\r\n                # error_flag = True\r\n                print('Error: ' + title + ' - ' + str(e))\r\n                error_log.append(\r\n                    (title, main_link, 'main paper download error', str(e)))\r\n            # download supp\r\n            if is_download_supplement:\r\n                # check whether the supp can be downloaded\r\n                if not os.path.exists(\r\n                        this_paper_supp_path) and supp_link is not None:\r\n                    try:\r\n                        downloader.download(\r\n                            urls=supp_link,\r\n                            save_path=this_paper_supp_path,\r\n                            time_sleep_in_seconds=time_step_in_seconds\r\n                        )\r\n                    except Exception as e:\r\n                        # error_flag = True\r\n                        print('Error: ' + title + ' - ' + str(e))\r\n                        error_log.append((title, supp_link,\r\n                                          'supplement download error', str(e)))\r\n\r\n    # write error log\r\n    print('writing error log...')\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'download_err_log.txt')\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is not None:\r\n                    f.write(e)\r\n                else:\r\n                    f.write('None')\r\n                f.write('\\n')\r\n            f.write('\\n')\r\n\r\n    return True\r\n\r\n\r\nif __name__ == '__main__':\r\n    download_paper_given_volume(\r\n        volume=150,\r\n        save_dir=r'D:\\The_KDD21_Workshop_on_Causal_Discovery',\r\n        postfix=f'',\r\n        is_download_supplement=False,\r\n        time_step_in_seconds=5,\r\n        downloader='IDM'\r\n    )\r\n"
  },
  {
    "path": "lib/proxy.py",
    "content": "\"\"\"\nproxy.py\n20230228\n\"\"\"\nfrom selenium.webdriver.common.proxy import Proxy, ProxyType\nimport urllib\n\n\ndef get_proxy(ip_port: str):\n    \"\"\"\n    setup proxy\n    :param ip_port: str, proxy server ip address without protocol prefix,\n        eg: \"127.0.0.1:7890\"\n    :return: proxy (instance of selenium.webdriver.common.proxy.Proxy)\n    Then the proxy could be to webdriver.Chrome:\n        capabilities = webdriver.DesiredCapabilities.CHROME\n        proxy.add_to_capabilities(capabilities)\n        driver = webdriver.Chrome(\n            service=Service(ChromeDriverManager().install()),\n            desired_capabilities=capabilities)\n    \"\"\"\n    proxy = Proxy()\n    proxy.proxy_type = ProxyType.MANUAL\n    proxy.http_proxy = ip_port\n    proxy.ssl_proxy = ip_port\n    return proxy\n\n\ndef set_proxy_4_urllib_request(ip_port: str):\n    \"\"\"\n    setup proxy\n    :param ip_port: str or None, proxy server ip address with or without\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n    :return: proxies, dict with keys \"http\" and \"https\" or None.\n    \"\"\"\n    if ip_port is None:\n        proxies = None\n    else:\n        if not ip_port.startswith('http'):\n            ip_port = 'http://' + ip_port\n        proxies = {\n            'http': ip_port,\n            'https': ip_port\n        }\n        proxy_support = urllib.request.ProxyHandler(proxies)\n        opener = urllib.request.build_opener(proxy_support)\n        urllib.request.install_opener(opener)\n    return proxies\n\n\ndef get_proxy_4_requests(ip_port: str):\n    \"\"\"\n    setup proxy\n    :param ip_port: str or None, proxy server ip address with or without\n        protocol prefix, eg: \"127.0.0.1:7890\", \"http://127.0.0.1:7890\".\n    :return: proxies, dict with keys \"http\" and \"https\" or None.\n    \"\"\"\n    if ip_port is None:\n        proxies = None\n    else:\n        if not ip_port.startswith('http'):\n            ip_port = 'http://' + ip_port\n        proxies = {\n            'http': ip_port,\n            'https': ip_port\n        }\n    return proxies\n\n\nif __name__ == \"__main__\":\n    # get my ip\n    import json\n    set_proxy_4_urllib_request('127.0.0.1:7897')\n    url = \"http://ip-api.com/json\"  # ipv4\n    response = urllib.request.urlopen(url)\n    data = json.load(response)\n    if data['status'] == 'success':\n        ip = data['query']\n        print(f'ip: {ip}')\n        print(f'details: {data}')\n    else:\n        print(f'failed, try agin: {data}')\n\n\n\n"
  },
  {
    "path": "lib/springer.py",
    "content": "\"\"\"\r\nspringer.py\r\nsome function for springer\r\n20201106\r\n\"\"\"\r\n\r\nimport urllib\r\nfrom bs4 import BeautifulSoup\r\nfrom tqdm import tqdm\r\nfrom slugify import slugify\r\nfrom .my_request import urlopen_with_retry\r\nimport re\r\n\r\n\r\ndef get_paper_name_link_from_url(url):\r\n    headers = {\r\n        'User-Agent':\r\n            'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}\r\n    paper_dict = dict()\r\n    content = urlopen_with_retry(url=url, headers=headers)\r\n    soup = BeautifulSoup(content, 'html5lib')\r\n    paper_list_bar = tqdm(\r\n        soup.find('section', {'data-title': 'Table of contents'}).find(\r\n            'div', {'class': 'c-book-section'}).find_all(\r\n            ['li'], {'data-test': 'chapter'}))\r\n    for paper in paper_list_bar:\r\n        try:\r\n            title = slugify(\r\n                paper.find(['h3', 'h4'], {'class': 'app-card-open__heading'}).text)\r\n            link = urllib.parse.urljoin(\r\n                url, \r\n                paper.find(\r\n                    ['h3', 'h4'], {'class': 'app-card-open__heading'}\r\n                    ).a.get('href'))\r\n            # 'https://link.springer.com/chapter/10.1007/978-3-642-33718-5_2' \r\n            # >>\r\n            # 'https://link.springer.com/content/pdf/10.1007/978-3-642-33718-5_2.pdf'\r\n            link = f'''{link.replace('/chapter/', '/content/pdf/')}.pdf'''\r\n            paper_dict[title] = link\r\n        except Exception as e:\r\n            print(f'ERROR: {str(e)}')\r\n    return paper_dict\r\n\r\n\r\n\r\n\r\nif __name__ == '__main__':\r\n    papers = get_paper_name_link_from_url('https://link.springer.com/book/10.1007%2F978-3-319-46448-0')"
  },
  {
    "path": "lib/supplement_porcess.py",
    "content": "\"\"\"\r\n    supplement_process.py\r\n\"\"\"\r\nfrom PyPDF3 import PdfFileMerger\r\nimport zipfile\r\nimport os\r\nimport shutil\r\nfrom tqdm import tqdm\r\n\r\ndef unzipfile(zip_file, save_path):\r\n    \"\"\"\r\n    unzip zip file to save_path\r\n    :param zipfile: str, zip file's full pathname.\r\n    :param save_path: str, the path store unzipped files.\r\n    :return: None\r\n    \"\"\"\r\n    zip_ref = zipfile.ZipFile(zip_file, 'r')\r\n    zip_ref.extractall(save_path)\r\n    zip_ref.close()\r\n\r\n\r\ndef get_potential_supp_pdf(path):\r\n    \"\"\"\r\n    get all the potential supplemental pdf file pathname\r\n    :param path: str, the path of unzipped files\r\n    :return: supp_pdf_list, List of str, pdf files' full pathnames\r\n    \"\"\"\r\n    supp_pdf_list = [f for f in os.scandir(path) if f.name.endswith('.pdf')]\r\n    if len(supp_pdf_list) == 0:\r\n        supp_pdf_list = []\r\n        for dir in os.scandir(path):\r\n            if dir.is_dir() and not dir.name.startswith('__'):\r\n                for pdf in os.scandir(dir.path):\r\n                    if pdf.name.endswith('.pdf'):\r\n                        supp_pdf_list.append(pdf.path)\r\n    if len(supp_pdf_list) == 0:\r\n        supp_pdf_list = []\r\n        for dir in os.scandir(path):\r\n            if dir.is_dir() and not dir.name.startswith('__'):\r\n                for sub_dir in os.scandir(dir):\r\n                    if sub_dir.is_dir() and not sub_dir.name.startswith('__'):\r\n                        for pdf in os.scandir(sub_dir.path):\r\n                            if pdf.name.endswith('.pdf'):\r\n                                supp_pdf_list.append(pdf.path)\r\n    return supp_pdf_list\r\n\r\n\r\ndef move_main_and_supplement_2_one_directory_with_group(main_path, supplement_path, supp_pdf_save_path):\r\n    \"\"\"\r\n    unzip supplemental zip files to get the pdf files, copy and\r\n        rename them into given path(supp_pdf_save_path/group_name)\r\n    :param main_path: str, the main papers' path\r\n    :param supplement_path: str, the supplemental material 's path\r\n    :param supp_pdf_save_path: str, the supplemental pdf files' save path\r\n    \"\"\"\r\n    if not os.path.exists(main_path):\r\n        raise ValueError(f'''can not open '{main_path}' !''')\r\n    if not os.path.exists(supplement_path):\r\n        raise ValueError(f'''can not open '{supplement_path}' !''')\r\n    error_log = []\r\n    # make temp dir to unzip zip file\r\n    temp_zip_dir = '.\\\\temp_zip'\r\n    if not os.path.exists(temp_zip_dir):\r\n        os.mkdir(temp_zip_dir)\r\n    else:\r\n        # remove all files\r\n        for unzip_file in os.listdir(temp_zip_dir):\r\n            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n            if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n            else:\r\n                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n    for group in os.scandir(main_path):\r\n        if group.is_dir():\r\n            paper_bar = tqdm(os.scandir(group.path))\r\n            for paper in paper_bar:\r\n                if paper.is_file():\r\n                    name, extension = os.path.splitext(paper.name)\r\n                    if '.pdf' == extension:\r\n                        paper_bar.set_description(f'''processing {name}''')\r\n                        supp_pdf_path = None\r\n                        # error_flag = False\r\n                        if os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.pdf')):\r\n                            supp_pdf_path = os.path.join(supplement_path, group.name,  f'{name}_supp.pdf')\r\n                            shutil.copyfile(\r\n                                supp_pdf_path, os.path.join(supp_pdf_save_path, group.name,  f'{name}_supp.pdf'))\r\n                        elif os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.zip')):\r\n                            try:\r\n                                unzipfile(\r\n                                    zip_file=os.path.join(supplement_path, group.name, f'{name}_supp.zip'),\r\n                                    save_path=temp_zip_dir\r\n                                )\r\n                            except Exception as e:\r\n                                print('Error: ' + name + ' - ' + str(e))\r\n                                error_log.append((paper.path, supp_pdf_path, str(e)))\r\n                            try:\r\n                                # find if there is a pdf file (by listing all files in the dir)\r\n                                supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)\r\n                                # rename the first pdf file\r\n                                if len(supp_pdf_list) >= 1:\r\n                                    # by default, we only deal with the first pdf\r\n                                    supp_pdf_path = os.path.join(supp_pdf_save_path, group.name, name+'_supp.pdf')\r\n                                    if not os.path.exists(supp_pdf_path):\r\n                                        shutil.move(supp_pdf_list[0], supp_pdf_path)\r\n                                    if len(supp_pdf_list) > 1:\r\n                                        for i in range(1, len(supp_pdf_list)):\r\n                                            supp_pdf_path = os.path.join(\r\n                                                supp_pdf_save_path, group.name, name + f'_supp_{i}.pdf')\r\n                                            if not os.path.exists(supp_pdf_path):\r\n                                                shutil.move(supp_pdf_list[i], supp_pdf_path)\r\n                                # empty the temp_folder (both the dirs and files)\r\n                                for unzip_file in os.listdir(temp_zip_dir):\r\n                                    if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                                        os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n                                    elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                                        shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n                                    else:\r\n                                        print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n                            except Exception as e:\r\n                                print('Error: ' + name + ' - ' + str(e))\r\n                                error_log.append((paper.path, supp_pdf_path, str(e)))\r\n\r\n    # 2. write error log\r\n    print('write error log')\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'merge_err_log.txt'\r\n    )\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is None:\r\n                    f.write('None')\r\n                else:\r\n                    f.write(e)\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n\r\n\r\ndef move_main_and_supplement_2_one_directory(main_path, supplement_path, supp_pdf_save_path):\r\n    \"\"\"\r\n    unzip supplemental zip files to get the pdf files, copy and\r\n    rename them into given path(supp_pdf_save_path)\r\n    :param main_path: str, the main papers' path\r\n    :param supplement_path: str, the supplemental material's path\r\n    :param supp_pdf_save_path: str, the supplemental pdf files' save path\r\n    \"\"\"\r\n    if not os.path.exists(main_path):\r\n        raise ValueError(f'''can not open '{main_path}' !''')\r\n    if not os.path.exists(supplement_path):\r\n        raise ValueError(f'''can not open '{supplement_path}' !''')\r\n    os.makedirs(supp_pdf_save_path, exist_ok=True)\r\n    error_log = []\r\n    # make temp dir to unzip zip file\r\n    temp_zip_dir = '..\\\\temp_zip'\r\n    if not os.path.exists(temp_zip_dir):\r\n        os.mkdir(temp_zip_dir)\r\n    else:\r\n        # remove all files\r\n        for unzip_file in os.listdir(temp_zip_dir):\r\n            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n            if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n            else:\r\n                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n\r\n    paper_bar = tqdm(os.scandir(main_path))\r\n    for paper in paper_bar:\r\n        if paper.is_file():\r\n            name, extension = os.path.splitext(paper.name)\r\n            if '.pdf' == extension:\r\n                paper_bar.set_description(f'''processing {name}''')\r\n                supp_pdf_path = None\r\n                # error_flag = False\r\n                if os.path.exists(os.path.join(supp_pdf_save_path, f'{name}_supp.pdf')):\r\n                    continue\r\n                elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):\r\n                    supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf')\r\n                    shutil.copyfile(supp_pdf_path, os.path.join(supp_pdf_save_path, f'{name}_supp.pdf'))\r\n                elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):\r\n                    try:\r\n                        unzipfile(\r\n                            zip_file=os.path.join(supplement_path, f'{name}_supp.zip'),\r\n                            save_path=temp_zip_dir)\r\n                    except Exception as e:\r\n                        print('Error: ' + name + ' - ' + str(e))\r\n                        error_log.append((paper.path, supp_pdf_path, str(e)))\r\n                    try:\r\n                        # find if there is a pdf file (by listing all files in the dir)\r\n                        supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)\r\n\r\n                        # rename the first pdf file\r\n                        if len(supp_pdf_list) >= 1:\r\n                            # by default, we only deal with the first pdf\r\n                            supp_pdf_path = os.path.join(supp_pdf_save_path, name+'_supp.pdf')\r\n                            if not os.path.exists(supp_pdf_path):\r\n                                shutil.move(supp_pdf_list[0], supp_pdf_path)\r\n                            if len(supp_pdf_list) > 1:\r\n                                for i in range(1, len(supp_pdf_list)):\r\n                                    supp_pdf_path = os.path.join(supp_pdf_save_path, name + f'_supp_{i}.pdf')\r\n                                    if not os.path.exists(supp_pdf_path):\r\n                                        shutil.move(supp_pdf_list[i], supp_pdf_path)\r\n                        # empty the temp_folder (both the dirs and files)\r\n                        for unzip_file in os.listdir(temp_zip_dir):\r\n                            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n                            elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n                            else:\r\n                                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n                    except Exception as e:\r\n                        print('Error: ' + name + ' - ' + str(e))\r\n                        error_log.append((paper.path, supp_pdf_path, str(e)))\r\n\r\n    # 2. write error log\r\n    print('write error log')\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'merge_err_log.txt'\r\n    )\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is None:\r\n                    f.write('None')\r\n                else:\r\n                    f.write(e)\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n\r\n\r\ndef merge_main_supplement(main_path, supplement_path, save_path, is_delete_ori_files=False):\r\n    \"\"\"\r\n    merge the main paper and supplemental material into one single pdf file\r\n    :param main_path: str, the main papers' path\r\n    :param supplement_path: str, the supplemental material 's path\r\n    :param save_path: str, merged pdf files's save path\r\n    :param is_delete_ori_files: Bool, True for deleting the original main and supplemental material after merging\r\n    \"\"\"\r\n    if not os.path.exists(main_path):\r\n        raise ValueError(f'''can not open '{main_path}' !''')\r\n    if not os.path.exists(supplement_path):\r\n        raise ValueError(f'''can not open '{supplement_path}' !''')\r\n    os.makedirs(save_path, exist_ok=True)\r\n    error_log = []\r\n    # make temp dir to unzip zip file\r\n    temp_zip_dir = '.\\\\temp_zip'\r\n    if not os.path.exists(temp_zip_dir):\r\n        os.mkdir(temp_zip_dir)\r\n    else:\r\n        # remove all files\r\n        for unzip_file in os.listdir(temp_zip_dir):\r\n            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n            if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n            else:\r\n                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n    paper_bar = tqdm(os.scandir(main_path))\r\n    for paper in paper_bar:\r\n        if paper.is_file():\r\n            name, extension = os.path.splitext(paper.name)\r\n            if '.pdf' == extension:\r\n                paper_bar.set_description(f'''processing {name}''')\r\n                if os.path.exists(os.path.join(save_path, paper.name)):\r\n                    continue\r\n                supp_pdf_path = None\r\n                error_floa = False\r\n                if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):\r\n                    supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf')\r\n                elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):\r\n                    try:\r\n                        unzipfile(\r\n                            zip_file=os.path.join(supplement_path, f'{name}_supp.zip'),\r\n                            save_path=temp_zip_dir\r\n                        )\r\n                    except Exception as e:\r\n                        print('Error: ' + name + ' - ' + str(e))\r\n                        error_log.append((paper.path, supp_pdf_path, str(e)))\r\n                    try:\r\n                        # find if there is a pdf file (by listing all files in the dir)\r\n                        supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)\r\n                        # rename the first pdf file\r\n                        if len(supp_pdf_list) >= 1:\r\n                            # by default, we only deal with the first pdf\r\n                            supp_pdf_path = os.path.join(supplement_path, name+'_supp.pdf')\r\n                            if not os.path.exists(supp_pdf_path):\r\n                                shutil.move(supp_pdf_list[0], supp_pdf_path)\r\n                        # empty the temp_folder (both the dirs and files)\r\n                        for unzip_file in os.listdir(temp_zip_dir):\r\n                            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n                            elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n                            else:\r\n                                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n                    except Exception as e:\r\n                        error_floa = True\r\n                        print('Error: ' + name + ' - ' + str(e))\r\n                        error_log.append((paper.path, supp_pdf_path, str(e)))\r\n                        # empty the temp_folder (both the dirs and files)\r\n                        for unzip_file in os.listdir(temp_zip_dir):\r\n                            if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):\r\n                                os.remove(os.path.join(temp_zip_dir, unzip_file))\r\n                            elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):\r\n                                shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))\r\n                            else:\r\n                                print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))\r\n                        continue\r\n                if supp_pdf_path is not None:\r\n                    try:\r\n                        merger = PdfFileMerger()\r\n                        f_handle1 = open(paper.path, 'rb')\r\n                        merger.append(f_handle1)\r\n                        f_handle2 = open(supp_pdf_path, 'rb')\r\n                        merger.append(f_handle2)\r\n                        with open(os.path.join(save_path, paper.name), 'wb') as fout:\r\n                            merger.write(fout)\r\n                            print('\\tmerged!')\r\n                        f_handle1.close()\r\n                        f_handle2.close()\r\n                        merger.close()\r\n                        if is_delete_ori_files:\r\n                            os.remove(paper.path)\r\n                            if os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):\r\n                                os.remove(os.path.join(supplement_path, f'{name}_supp.zip'))\r\n                            if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):\r\n                                os.remove(os.path.join(supplement_path, f'{name}_supp.pdf'))\r\n                    except Exception as e:\r\n                        print('Error: ' + name + ' - ' + str(e))\r\n                        error_log.append((paper.path, supp_pdf_path, str(e)))\r\n                        if os.path.exists(os.path.join(save_path, paper.name)):\r\n                            os.remove(os.path.join(save_path, paper.name))\r\n\r\n                else:\r\n                    if is_delete_ori_files:\r\n                        shutil.move(paper.path, os.path.join(save_path, paper.name))\r\n                    else:\r\n                        shutil.copyfile(paper.path, os.path.join(save_path, paper.name))\r\n\r\n    # 2. write error log\r\n    print('write error log')\r\n    project_root_folder = os.path.abspath(\r\n        os.path.dirname(os.path.dirname(os.path.abspath(__file__))))\r\n    log_file_pathname = os.path.join(\r\n        project_root_folder, 'log', 'merge_err_log.txt'\r\n    )\r\n    with open(log_file_pathname, 'w') as f:\r\n        for log in tqdm(error_log):\r\n            for e in log:\r\n                if e is None:\r\n                    f.write('None')\r\n                else:\r\n                    f.write(e)\r\n                f.write('\\n')\r\n\r\n            f.write('\\n')\r\n\r\n\r\ndef rename_2_short_name(src_path, save_path, target_max_length=128,\r\n                        extension='pdf'):\r\n    \"\"\"\r\n    rename file to short filename while remain the conference postfix\r\n    Args:\r\n        src_path (str): path that contains files directly.\r\n        save_path (str): path to save the renamed files.\r\n        target_max_length (int): max filen name length after renaming. All the\r\n            files whose name length is not less than this will be renamed, the\r\n            others will stay unchanged and copy into the save path. Default:\r\n            128.\r\n        extension (str | None): only the files with this extension will be\r\n            processed. None means all file will be processed. Default: 'pdf'.\r\n    Returns:\r\n        None\r\n    \"\"\"\r\n    if not os.path.exists(src_path):\r\n        raise ValueError(f'Path not found: {src_path}!')\r\n\r\n    os.makedirs(save_path, exist_ok=True)\r\n\r\n    for f in tqdm(os.scandir(src_path)):\r\n        f_name = f.name\r\n\r\n        # compare extension\r\n        ext = os.path.splitext(f_name)[1]\r\n        if extension is not None and ext[1:] != extension:\r\n            continue\r\n        # compare file name length\r\n        l = len(f_name)\r\n        if l < target_max_length:\r\n            if not os.path.exists(os.path.join(save_path, f_name)):\r\n                print(f'\\ncopying {f_name}')\r\n                shutil.copyfile(f.path, os.path.join(save_path, f_name))\r\n        else:\r\n            # rename\r\n            try:\r\n                [title, postfix] = f_name.split('_', 1)  # only split to 2 parts\r\n                new_title = title[:target_max_length-len(postfix)-2]\r\n                new_name = f'{new_title}_{postfix}'\r\n                if not os.path.exists(os.path.join(save_path, new_name)):\r\n                    print(f'\\nrenaming {f_name} \\n\\t-> {new_name}')\r\n                    shutil.copyfile(f.path, os.path.join(save_path, new_name))\r\n            except ValueError:\r\n                # ValueError: not enough values to unpack (expected 2, got 1)\r\n                print(f'\\nWARNING!!!:\\n\\tunable to parse postfix from {f.path}')\r\n                print('\\tSo, it will be just copy/rename to short name')\r\n                new_title = f_name[:target_max_length - len(ext) - 1]\r\n                new_name = f'{new_title}{ext}'\r\n                if not os.path.exists(os.path.join(save_path, new_name)):\r\n                    print(f'\\nrenaming {f_name} \\n\\t-> {new_name}')\r\n                    shutil.copyfile(f.path, os.path.join(save_path, new_name))\r\n\r\n\r\ndef rename_2_short_name_within_group(src_path, save_path, target_max_length=128,\r\n                        extension='pdf'):\r\n    \"\"\"\r\n    rename file to short filename while remain the conference postfix\r\n    Args:\r\n        src_path (str): path that contains files:\r\n            src_path/group_name/files\r\n        save_path (str): path to save the renamed files.\r\n        target_max_length (int): max filen name length after renaming. All the\r\n            files whose name length is not less than this will be renamed, the\r\n            others will stay unchanged and copy into the save path. Default:\r\n            128.\r\n        extension (str | None): only the files with this extension will be\r\n            processed. None means all file will be processed. Default: 'pdf'.\r\n    Returns:\r\n        None\r\n    \"\"\"\r\n    if not os.path.exists(src_path):\r\n        raise ValueError(f'Path not found: {src_path}!')\r\n\r\n    os.makedirs(save_path, exist_ok=True)\r\n\r\n    for d in tqdm(os.scandir(src_path)):\r\n        if not d.is_dir():\r\n            continue\r\n        print(f'\\nprocessing {d.name}')\r\n        d_name = d.name\r\n        d_name = d_name[:min(len(d_name), target_max_length-1)]\r\n        rename_2_short_name(\r\n            src_path=d.path,\r\n            save_path=os.path.join(save_path, d_name),\r\n            target_max_length=target_max_length,\r\n            extension=extension\r\n        )\r\n\r\n"
  },
  {
    "path": "lib/user_agents.py",
    "content": "\"\"\"\nuser_agents.py\n\nuser agents\n20230702\n\"\"\"\n\n\nuser_agents = [\n    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) '\n    'Gecko/20071127 Firefox/2.0.0.11',\n\n    'Opera/9.25 (Windows NT 5.1; U; en)',\n\n    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; '\n    '.NET CLR 1.1.4322; .NET CLR 2.0.50727)',\n\n    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) '\n    'KHTML/3.5.5 (like Gecko) (Kubuntu)',\n\n    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) '\n    'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',\n\n    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',\n\n    \"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 \"\n    \"(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 \"\n    \"Chrome/16.0.912.77 Safari/535.7\",\n\n    \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) \"\n    \"Gecko/20100101 Firefox/10.0 \",\n\n    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\n    'AppleWebKit/537.36 (KHTML, like Gecko) '\n    'Chrome/105.0.0.0 Safari/537.36'\n\n]"
  },
  {
    "path": "sharelinks.md",
    "content": "# SHARE LINKS\nAliyun share links\n\n注：阿里云盘更新了协议，**一个分享链接最多只能分享不超过500个文件**，所以我进行了拆分，一个链接放499个文件，直至分享完。\n\n## CVPR\n\n### main conference\n\n| year | index |                       share link                       | access code |   \n|:----:|:-----:|:------------------------------------------------------:|:-----------:|\n| 2023 |   1   |   [1-499](https://www.aliyundrive.com/s/SGMUABYNoRM)   |   `63un`    |  \n| 2023 |   2   |  [500-998](https://www.aliyundrive.com/s/XeXJz53AVKn)  |   `7ws5`    |  \n| 2023 |   3   | [999-1497](https://www.aliyundrive.com/s/9wjv8gaE95i)  |   `1er4`    |  \n| 2023 |   4   | [1498-1996](https://www.aliyundrive.com/s/kqt4GNYmSYR) |   `lf58`    | \n| 2023 |   5   | [1997-2358](https://www.aliyundrive.com/s/GyyyD4XnqhZ) |   `f47s`    | \n\n\n### workshops\n\n| year | index |                      share link                      | access code |   \n|:----:|:-----:|:----------------------------------------------------:|:-----------:|\n| 2023 |   1   |  [1-485](https://www.aliyundrive.com/s/gPtPRYcyttz)  |   `4n5t`    |  \n| 2023 |   2   | [486-698](https://www.aliyundrive.com/s/x18A9AxPJGp) |   `x40h`    |  "
  }
]