Repository: SilenceEagle/paper_downloader Branch: master Commit: 7a76ffa26612 Files: 30 Total size: 345.0 KB Directory structure: gitextract_691ya0bm/ ├── .gitignore ├── LICENSE ├── README.md ├── code/ │ ├── paper_downloader_AAAI.py │ ├── paper_downloader_AAMAS.py │ ├── paper_downloader_AISTATS.py │ ├── paper_downloader_COLT.py │ ├── paper_downloader_CORL.py │ ├── paper_downloader_CVF.py │ ├── paper_downloader_ECCV.py │ ├── paper_downloader_ICLR.py │ ├── paper_downloader_ICML.py │ ├── paper_downloader_IJCAI.py │ ├── paper_downloader_JMLR.py │ ├── paper_downloader_NIPS.py │ └── paper_downloader_RSS.py ├── lib/ │ ├── IDM.py │ ├── __init__.py │ ├── arxiv.py │ ├── csv_process.py │ ├── cvf.py │ ├── downloader.py │ ├── my_request.py │ ├── openreview.py │ ├── pmlr.py │ ├── proxy.py │ ├── springer.py │ ├── supplement_porcess.py │ └── user_agents.py └── sharelinks.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # ---> Python # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ # mylib/ lib64/ parts/ sdist/ var/ wheels/ share/python-wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .nox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover *.py,cover .hypothesis/ .pytest_cache/ cover/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 db.sqlite3-journal # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder .pybuilder/ target/ # Jupyter Notebook .ipynb_checkpoints # IPython profile_default/ ipython_config.py # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: # .python-version # pipenv # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control. # However, in case of collaboration, if having platform-specific dependencies or dependencies # having no cross-platform support, pipenv may install dependencies that don't work, or not # install all needed dependencies. #Pipfile.lock # poetry # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control #poetry.lock # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. #pdm.lock # pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it # in version control. # https://pdm.fming.dev/#use-with-ide .pdm.toml # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ # Celery stuff celerybeat-schedule celerybeat.pid # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ .dmypy.json dmypy.json # Pyre type checker .pyre/ # pytype static type analyzer .pytype/ # Cython debug symbols cython_debug/ # PyCharm # JetBrains specific template is maintained in a separate JetBrains.gitignore that can # be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. .idea/ csv/ data/ log/ temp_zip urls/ *.txt ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2020 silenceagle Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # paper_downloader Download papers and supplemental materials only from **OPEN ACCESS** paper website, such as **AAAI**, **AAMAS**, **AISTATS**, **COLT**, **CORL**, **CVPR**, **ECCV**, **ICCV**, **ICLR**, **ICML**, **IJCAI**, **JMLR**, **NIPS**, **RSS**, **WACV**. --- The number of papers that could be downloaded using this repo (also provide **Aliyundrive** or **123Pan** share link and `access code`): | year\conf | [AAAI](https://aaai.org/aaai-publications/aaai-conference-proceedings/#aaai) | [AAMAS](https://www.ifaamas.org/Proceedings/aamas2024/) | [ACCV](https://openaccess.thecvf.com/menu) | [AISTATS](https://www.aistats.org/) | [COLT](http://learningtheory.org/) | [CORL](https://www.corl.org/) | [CVPR](http://openaccess.thecvf.com/menu) | [ECCV](https://www.ecva.net/papers.php) | [ICCV](http://openaccess.thecvf.com/menu) | [ICLR](https://iclr.cc/) | [ICML](https://icml.cc/) | [IJCAI](https://www.ijcai.org/) | [JMLR](http://www.jmlr.org/) | [NIPS ](https://nips.cc/) | [RSS](https://www.roboticsproceedings.org/index.html) | [WACV](https://openaccess.thecvf.com/menu) | |:------------:|:----------------------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------:|:------------------------------------------------------:|:-----------------------------:|:--------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------:|:----------------------------:|:-------------------------------------------------------:|:-----------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:| | **1969** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 64 | -- | -- | -- | -- | | **1971** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 66 | -- | -- | -- | -- | | **1973** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 85 | -- | -- | -- | -- | | **1975** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 146 | -- | -- | -- | -- | | **1977** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 251 | -- | -- | -- | -- | | **1979** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 12 | -- | -- | -- | -- | | **1980** | [95](https://www.aliyundrive.com/s/ucngMrKSTmi)`96eg` | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | **1981** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 108 | -- | -- | -- | -- | | **1982** | 104 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | **1983** | [92](https://www.aliyundrive.com/s/L3GfxhEqyWg)`09jo` | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 237 | -- | -- | -- | -- | | **1984** | 69 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | **1985** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 259 | -- | -- | -- | -- | | **1986** | 194 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | **1987** | 149 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 246 | -- | 90 | -- | -- | | **1988** | 159 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 94 | -- | -- | | **1989** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 269 | -- | 101 | -- | -- | | **1990** | 173 | -- | -- | -- | -- | -- | -- | 49 | -- | -- | -- | -- | -- | 143 | -- | -- | | **1991** | 144 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 192 | -- | 144 | -- | -- | | **1992** | 134 | -- | -- | -- | -- | -- | -- | 49 | -- | -- | -- | -- | -- | 127 | -- | -- | | **1993** | 135 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 138 | -- | 158 | -- | -- | | **1994** | 302 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 140 | -- | -- | | **1995** | -- | -- | -- | 64 | -- | -- | -- | -- | -- | -- | -- | 282 | -- | 152 | -- | -- | | **1996** | 275 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 152 | -- | -- | | **1997** | 186 | -- | -- | 57 | -- | -- | -- | -- | -- | -- | -- | 180 | -- | 150 | -- | -- | | **1998** | 187 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 151 | -- | -- | | **1999** | 182 | -- | -- | 17 | -- | -- | -- | -- | -- | -- | -- | 204 | -- | 150 | -- | -- | | **2000/v1** | 221 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | 11 | 152 | -- | -- | | **2001/v2** | -- | -- | -- | 46 | -- | -- | -- | -- | -- | -- | -- | 17 | 31 | 197 | -- | -- | | **2002/v3** | 187 | / | -- | -- | -- | -- | -- | 196 | -- | -- | -- | -- | 59 | 207 | -- | -- | | **2003/v4** | --- | / | -- | 44 | -- | -- | -- | -- | -- | -- | 121 | 297 | 59 | 198 | -- | -- | | **2004/v5** | 177 | / | -- | -- | -- | -- | -- | 190 | -- | -- | 118 | -- | 56 | 207 | -- | -- | | **2005/v6** | 328 | / | -- | 56 | -- | -- | -- | -- | -- | -- | 133 | 350 | 73 | 207 | 48 | -- | | **2006/v7** | 393 | / | -- | -- | -- | -- | -- | 192+11 | -- | -- | -- | -- | 100 | 204 | 39 | -- | | **2007/v8** | 375 | / | -- | 86 | -- | -- | -- | -- | -- | -- | 150 | 478 | 91 | 217 | 41 | -- | | **2008/v9** | 355 | 254 | -- | -- | -- | -- | -- | 196 | -- | -- | 158 | -- | 97 | 250 | 40 | -- | | **2009/v10** | -- | 130 | -- | 84 | -- | -- | -- | -- | -- | -- | 160 | 342 | 100 | 262 | 39 | -- | | **2010/v11** | 300 | 163 | -- | 126 | -- | -- | -- | 286+63 | -- | -- | 159 | -- | 118 | 292 | 40 | -- | | **2011/v12** | 302 | 125 | -- | 108 | 43 | -- | -- | -- | -- | -- | 153 | 490 | 105 | 306 | 45 | -- | | **2012/v13** | 353 | 136 | -- | 160 | 46 | -- | -- | 329+147 | -- | -- | 243 | -- | 119 | 368 | 60 | -- | | **2013/v14** | 251 | 321 | -- | 72 | 50 | -- | [471](https://www.aliyundrive.com/s/ZFvga9JZ5aY)`5p0q`+156 | -- | 455+142 | 14+9 | 283 | 496 | 84 | 360 | 55 | -- | | **2014/v15** | 447 | 378 | -- | 124 | 61 | -- | 545+125 | 334+158 | -- | 35 | 310 | -- | 120 | 411 | 57 | -- | | **2015/v16** | 455 | 363 | -- | 134 | 77 | -- | 602+133 | -- | 526+133 | 42 | 270 | 656 | 118 | 403 | 49 | -- | | **2016/v17** | 676 | 280 | -- | 168 | 70 | -- | 643+194 | 372+132 | -- | 80 | 322 | 658 | 236 | 568 | 47 | -- | | **2017/v18** | 765 | 318 | -- | 175 | 75 | 48 | 783+281 | -- | 621+353 | 198 | 434 | 781 | 234 | 679 | 75 | -- | | **2018/v19** | 1102 | 390 | -- | 230 | 94 | 75 | 979+346 | 732+262 | -- | 336 | 466 | 870 | 84 | 1009 | 71 | -- | | **2019/v20** | 1343 | 433 | -- | 403 | 127 | 110 | 1294+612 | -- | 1075+498 | 502 | 773 | 964 | 184 | 1428 | 84 | -- | | **2020/v21** | [1864](https://www.aliyundrive.com/s/kbWKUpHGR3k)`5ls6` | 369 | [254](https://www.aliyundrive.com/s/Dt2ErKCmePQ)`dn93`+[13](https://www.aliyundrive.com/s/AhGvgotrMUv)`d9o6` | [796](https://www.aliyundrive.com/s/iQ4AWTHG4bk)`61yu` | [126](https://www.aliyundrive.com/s/apP8KUFLPe4)`3mv9` | 165 | [1467](https://www.aliyundrive.com/s/eJF4BTFzFJq)`y89b`+[517](https://www.aliyundrive.com/s/5wk7Mjo9XyU)`0fz9` | [1358](https://www.aliyundrive.com/s/EYyjxRmmg8d)`a5i0` | -- | [687](https://www.aliyundrive.com/s/cVRD5Bu2SgN)`4x1c` | [1084](https://www.aliyundrive.com/s/BHqtEbi6Dix)`5yw0` | [776](https://www.aliyundrive.com/s/vMZpsjCbWMV)`4xq3` | 254 | [1899](https://www.aliyundrive.com/s/GEMFqxKeHWu)`3g3d` | 103 | [378](https://www.aliyundrive.com/s/gfFKwcKrCP1)`l1m8`+[24](https://www.aliyundrive.com/s/2uCW6cq9WHk)`me08` | | **2021/v22** | [1961](https://www.aliyundrive.com/s/cdeGciNZch8)`b69m` | 304 | -- | [845](https://www.aliyundrive.com/s/3hbAhxYFHER)`93ig` | [140](https://www.aliyundrive.com/s/gwhdNT1vGDD)`96ln` | 166 | 1660+[517](https://www.aliyundrive.com/s/ziBfXVKPXSY)`le14` | -- | [1612](https://www.aliyundrive.com/s/ME21PfkyAec)`99uu`+[465](https://www.aliyundrive.com/s/ZahPmXSn9an)`16es` | [860](https://www.aliyundrive.com/s/wGos6n5R93v)`ef43` | [1183](https://www.aliyundrive.com/s/SYTtH38GiVS)`g8b1` | [723](https://www.aliyundrive.com/s/io3sAjsN5pw)`40is` | 290 | [2334](https://www.aliyundrive.com/s/13sHmhuEdxA)`v6g1` | 92 | [406](https://www.aliyundrive.com/s/kTwfaX9tren)`1id9`+[23](https://www.aliyundrive.com/s/7Joy4svvUfy)`90rl` | | **2022/v23** | [1624](https://www.aliyundrive.com/s/ePXvUw4VFdQ)`fp76` | 306 | [279](https://www.aliyundrive.com/s/zCCTJMPrfSr)`47jy`+[25](https://www.aliyundrive.com/s/f4kdMXixwJL)`s7a9` | [492](https://www.aliyundrive.com/s/xj2fRMwZxfC)`f16o` | 155 | 197 | [2077](https://www.aliyundrive.com/s/Q8DG9dKbx6S)`i16a`+[562](https://www.aliyundrive.com/s/f9Zx3hFFyq4)`11kj` | [1645](https://www.aliyundrive.com/s/dv4fhuueRHs)`6d7j` | -- | [54+176+865](https://www.aliyundrive.com/s/gfANcdbM9TC)`b1l3` | [1234](https://www.aliyundrive.com/s/eopQ5H8Hz2a)`81ov` | [862](https://www.aliyundrive.com/s/DBVKNsqN2UZ)`ea46` | 351 | [2673](https://www.aliyundrive.com/s/VFLmfnzSAsA)`eh49` | 74 | [406](https://www.aliyundrive.com/s/xRhdpencLQU)`ab53`+[80](https://www.aliyundrive.com/s/JCCcQXij7WX)`q6d2` | | **2023/v24** | 2021 | 527 | -- | [496](https://www.aliyundrive.com/s/CD3Kz9cxu1U)`l5m9` | 170 | 199 | [2358+698](./sharelinks.md) | -- | 2161+491 | [90+284+1205](https://www.aliyundrive.com/s/PZ1Wann4B8A)`29sf` | 1805 | 846 | 397 | 67+378+2773 | 112 | [639](https://www.aliyundrive.com/s/fP52KxJEUE5)`mo78`+[74](https://www.aliyundrive.com/s/XZG992JqQfn)`nj80` | | **2024/v25** | 2581 | 460 | 268+46 | 547 | 170 | 264 | 2716+773 | 2387 | -- | 86+369+1810 | 144+191+2275 | 1048 | 419 | 61+326+3650 | 131 | 846+120 | | **2025/v26** | 3028 | 479 | --- | 583 | 182 | 263 | 2871+659 | -- | 2701+765 | 208+373+3060+6+6+56 | 108+211+2967 | 1276 | 308 | 77+683+4515 | 163 | 929 | | **2026/v27** | 2375 | 29 May | 18 Dec. | 2 May | 3 July | 12 Nov. | 7 June | 13 Sep. | 29 Sep. | 225+5131 | 11 July | 21 Aug | 50 | 13 Dec | 17 July | 831+191 | [Download from 123pan.com](https://www.123pan.com/s/PwXljv-QErwd.html) (ACCESS CODE: `FdX2`) (May miss some papers due to the (older version of) 123pan's limitation on the length of filename) NOTE: all the shared papers' pdf files are collected from network, and the original authors/providers hold the copyrights. --- ## Usage **For example: download AAAI-2022 papers** 1. Install [Internet Downloader Manager/IDM](https://www.internetdownloadmanager.com/) [*Windows*] [*OPTIONAL*] **Note:** If the IDM is NOT installed at the DEFAULT location, then the code in [lib/IDM.py](./lib/IDM.py) should also be modified: ```python # should replace with your IDM path idm_path = '"your path to IDMan.exe"' # default: # idm_path = '"C:\Program Files (x86)\Internet Download Manager\IDMan.exe"' ``` **Uesful tip**: [Disable the downloading popup pages of IDM would be better](https://github.com/SilenceEagle/paper_downloader/issues/17#issuecomment-773763300) 2. Install [Chrome](https://www.google.com/chrome) [Needed for `ICLR`, `ICML`, some of `NIPS` and `CORL` papers] 3. Change the code block at the end of [code/paper_downloader_AAAI.py](./code/paper_downloader_AAAI.py) ```python if __name__ == '__main__': year = 2022 total_paper_number = save_csv(year) # save papers urls to csv/AAAI_2022.csv download_from_csv( year, save_dir=f'..\\AAAI_{year}', # change to your save location time_step_in_seconds=5, # time step (seconds) between two downloading requests total_paper_number=total_paper_number, downloader=None # use python "requests" package to download papers, workable on Windows/MacOS/Linux # downloader='IDM' # use Internet Download Manager software to # download papers, Windows only ) ``` 4. Then run the code: ```python python code/paper_downloader_AAAI.py # download AAAI papers ``` --- **This repo also provides the function to process supplemental material:** 1. Merge the main supplemental material pdf file and the main paper into one single pdf file; 2. Move the supplemental material pdf files (extracted from the downloaded zip files if presented) into the main papers' folder. ## Star history [![Star History Chart](https://api.star-history.com/svg?repos=SilenceEagle/paper_downloader&type=Date)](https://star-history.com/#SilenceEagle/paper_downloader&Date) ================================================ FILE: code/paper_downloader_AAAI.py ================================================ """paper_downloader_AAAI.py""" import time from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys import random root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib import csv_process from lib.user_agents import user_agents from lib.my_request import urlopen_with_retry def get_track_urls(year): """ get all the technical tracks urls given AAAI proceeding year Args: year (int): AAAI proceeding year, such 2023 Returns: dict : All the urls of technical tracks included in the given AAAI proceeding. Keys are the tracks name-volume, and values are the corresponding urls. """ # assert int(year) >= 2023, f"only support year >= 2023, but get {year}!!!" project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) dat_file_pathname = os.path.join( project_root_folder, 'urls', f'track_archive_url_AAAI_{year}.dat' ) proceeding_th_dict = { 1980: 1, 1902: 2, 1983: 3, 1984: 4, 1986: 5, 1987: 6, 1988: 7, 1990: 8, 1991: 9, 1992: 10, 1993: 11, 1994: 12, 1996: 13, 1997: 14, 1998: 15, 1999: 16, 2000: 17, 2002: 18, 2004: 19, 2005: 20, 2006: 21, 2007: 22, 2008: 23 } if year >= 2023: base_url = r'https://ojs.aaai.org/index.php/AAAI/issue/archive' headers = { 'User-Agent': user_agents[-1], 'Host': 'ojs.aaai.org', 'Referer': "https://ojs.aaai.org", 'GET': base_url } if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=base_url, headers=headers) # req = urllib.request.Request(url=base_url, headers=headers) # content = urllib.request.urlopen(req).read() with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') tracks = soup.find('ul', {'class': 'issues_archive'}).find_all('li') track_urls = dict() for tr in tracks: h2 = tr.find('h2') this_track = slugify(h2.a.text) if this_track.startswith(f'aaai-{year-2000}'): this_track += slugify(h2.div.text) + '-' + this_track this_url = h2.a.get('href') track_urls[this_track] = this_url print(f'find track: {this_track}({this_url})') else: if year >= 2010: proceeding_th = year - 1986 elif year in proceeding_th_dict: proceeding_th = proceeding_th_dict[year] else: print(f'ERROR: AAAI proceeding was not held in year {year}!!!') return base_url = f'https://aaai.org/proceeding/aaai-{proceeding_th:02d}-{year}/' headers = { 'User-Agent': user_agents[-1], 'Host': 'aaai.org', 'Referer': "https://aaai.org", 'GET': base_url } if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: # req = urllib.request.Request(url=base_url, headers=headers) # content = urllib.request.urlopen(req).read() content = urlopen_with_retry(url=base_url, headers=headers) # content = open(f'..\\AAAI_{year}.html', 'rb').read() with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') tracks = soup.find('main', {'class': 'content'}).find_all('li') track_urls = dict() for tr in tracks: this_track = slugify(tr.a.text) this_url = tr.a.get('href') track_urls[this_track] = this_url print(f'find track: {this_track}({this_url})') return track_urls def get_papers_of_track_ojs(track_url): """ get all the papers' title, belonging track group name and download link. the link should be hosted on https://ojs.aaai.org/ Args: track_url (str): track url Returns: list[dict]: a list contains all the collected papers' information, each item in list is a dictionary, whose keys include ['title', 'main link', 'group'] And the group is the specific track name. """ debug = False paper_list = [] headers = { 'User-Agent': user_agents[-1], 'Host': 'ojs.aaai.org', 'Referer': "https://ojs.aaai.org", 'GET': track_url } content = urlopen_with_retry(url=track_url, headers=headers) soup = BeautifulSoup(content, 'html5lib') tracks = soup.find('div', {'class': 'sections'}).find_all( 'div', {'class': 'section'}) for tr in tracks: this_group = slugify(tr.h2.text) this_paper_dict = { 'group': this_group, 'title': '', 'main link': '' } papers = tr.find_all('li') for p in papers: this_paper_dict['title'] = '' this_paper_dict['main link'] = '' try: title = slugify(p.find('h3', {'class': 'title'}).text) link = p.find( 'a', {'class': 'obj_galley_link pdf'} ).get('href').replace('view', 'download') this_paper_dict['title'] = title this_paper_dict['main link'] = link paper_list.append(this_paper_dict.copy()) if debug: print( f'paper: {title}\n\tlink:{link}\n\tgroup:{this_group}') except Exception as e: # skip unwanted target # print(f'ERROR: {str(e)}') pass # continue return paper_list def get_papers_of_track(track_url): """ get all the papers' title, belonging track group name and download link. the link should be hosted on https://aaai.org/ Args: track_url (str): track url Returns: list[dict]: a list contains all the collected papers' information, each item in list is a dictionary, whose keys include ['title', 'main link', 'group'] And the group is the specific track name. """ debug = False paper_list = [] headers = { 'User-Agent': user_agents[-1], 'Host': 'aaai.org', 'Referer': "https://aaai.org", 'GET': track_url } content = urlopen_with_retry(url=track_url, headers=headers) soup = BeautifulSoup(content, 'html5lib') tracks = soup.find('main', {'id': 'genesis-content'}).find_all( 'div', {'class': 'track-wrap'}) for tr in tracks: this_group = slugify(tr.h2.text) this_paper_dict = { 'group': this_group, 'title': '', 'main link': '' } papers = tr.find_all('li') for p in papers: this_paper_dict['title'] = '' this_paper_dict['main link'] = '' try: title = slugify(p.find('h5').text) link = p.find( 'a', {'class': 'wp-block-button'} ).get('href') this_paper_dict['title'] = title this_paper_dict['main link'] = link paper_list.append(this_paper_dict.copy()) if debug: print( f'paper: {title}\n\tlink:{link}\n\tgroup:{this_group}') except Exception as e: # skip unwanted target # print(f'ERROR: {str(e)}') pass # continue return paper_list def save_csv(year): """ write AAAI papers' urls in one csv file :param year: int, AAAI year, such 2019 :return: peper_index: int, the total number of papers """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'AAAI_{year}.csv' ) error_log = [] paper_index = 0 with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'group'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() track_urls = get_track_urls(year) for tr_name in track_urls: tr_url = track_urls[tr_name] print(f'collecting paper from {tr_name}({tr_url})') if year >= 2023: papers_dict_list = get_papers_of_track_ojs(tr_url) else: papers_dict_list = get_papers_of_track(tr_url) print(f'\tfind {len(papers_dict_list)} papers') for p in papers_dict_list: paper_index += 1 writer.writerow(p) csvfile.flush() s = random.randint(3, 7) print(f'random sleeping {s} seconds...') time.sleep(s) # avoid requesting too frequently # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return paper_index def download_from_csv( year, save_dir, time_step_in_seconds=5, total_paper_number=None, csv_filename=None, downloader='IDM'): """ download all AAAI paper given year :param year: int, AAAI year, such 2019 :param save_dir: str, paper and supplement material's save path :param time_step_in_seconds: int, the interval time between two download request in seconds :param total_paper_number: int, the total number of papers that is going to download :param csv_filename: None or str, the csv file's name, None means to use default setting :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ postfix = f'AAAI_{year}' project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_path = os.path.join( project_root_folder, 'csv', f'AAAI_{year}.csv' if csv_filename is None else csv_filename) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_path, is_download_supplement=False, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader ) if __name__ == '__main__': year = 2025 # total_paper_number = 3028 total_paper_number = save_csv(year) download_from_csv( year, save_dir=fr'D:\AAAI_{year}', time_step_in_seconds=15, total_paper_number=total_paper_number) # for year in range(2012, 2018, 2): # print(year) # total_paper_number = None # # total_paper_number = save_csv(year) # download_from_csv(year, save_dir=f'..\\AAAI_{year}', # time_step_in_seconds=10, # total_paper_number=total_paper_number) # time.sleep(2) # for i in range(1, 12): # print(f'issue {i}/{11}') # year = 2022 # total_paper_number = save_csv_given_urls( # urls=f'https://www.aaai.org/Library/AAAI/aaai{year - 2000}-issue{i:0>2}.php', # csv_filename=f'.\AAAI_{year}_issue_{i}.csv' # ) # # total_paper_number = 156 # download_from_csv( # year=year, # csv_filename=f'.\AAAI_{year}_issue_{i}.csv', # save_dir=rf'D:\AAAI_{year}', # time_step_in_seconds=1, # total_paper_number=total_paper_number) # print(get_track_urls(1980)) # get_papers_of_track(r'https://ojs.aaai.org/index.php/AAAI/issue/view/548') pass ================================================ FILE: code/paper_downloader_AAMAS.py ================================================ """paper_downloader_AAMAS.py """ import time import urllib from urllib.error import HTTPError from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib import csv_process from lib.my_request import urlopen_with_retry def save_csv(year): """ write AAMAS papers' urls in one csv file :param year: int, AAMAS year, such 2023 :return: peper_index: int, the total number of papers """ conference = "AAMAS" project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' ) init_url_dict = { 2010: 'https://www.ifaamas.org/Proceedings/aamas2010/resources/_fullpapers.html', 2009: 'https://www.ifaamas.org/Proceedings/aamas2009/TOC/01_FP/FP_Session.html', 2008: 'https://www.ifaamas.org/Proceedings/aamas2008/proceedings/mainTrackPapers.htm', } error_log = [] paper_index = 0 with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'group', 'main link', 'supplemental link'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() if year >= 2013: init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}' \ f'/forms/contents.htm' elif year >= 2011: init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}'\ f'/resources/fullpapers.html' elif year in init_url_dict: init_url = init_url_dict[year] else: # TODO: support downloading 2002 ~ 2007 papers return url_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_{conference}_{year}.dat''' ) if os.path.exists(url_file_pathname): with open(url_file_pathname, 'rb') as f: content = pickle.load(f) else: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'} content = urlopen_with_retry(url=init_url, headers=headers) with open(url_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') # soup = BeautifulSoup(content, 'html.parser') if year >= 2013: group_list = soup.find('tbody').find_all('tr', recursive=False)[3:] # skip "conference title", "Table of Contents" and "Contents table" group_list_bar = tqdm(group_list) paper_index = 0 is_start = False for group in group_list_bar: if not is_start: # if group.find('a', {'id': 'KT'}): # year 2019, 2023, 2024 # is_start = True if group.find('strong'): group_text = slugify(group.find('strong').text) if not group_text.startswith('table') and \ not group_text.startswith('aamas'): # skip Table of Contents, AAMAS 20xx is_start = True else: continue else: continue try: tds = group.find_all('td', recursive=False) if len(tds) < 2: continue group = tds[1] papers = group.find_all('p') for p in papers: # group title is in ... if p.find('strong', recursive=False): group_title = slugify(p.text) continue paper_dict = {'title': '', 'group': group_title, 'main link': '', 'supplemental link': ''} if p.find('a') is None and p.find('b') is None: # last empty

...

in some ... continue a = p.find('a') if a is None: title = slugify(p.find('b').text) main_link = '' print(f'\nWarning: No link found for {title}!') else: title = slugify(a.text) main_link = urllib.parse.urljoin(init_url, a.get('href')) paper_dict['title'] = title paper_dict['main link'] = main_link paper_index += 1 group_list_bar.set_description_str( f'Collected paper {paper_index}: {title}') writer.writerow(paper_dict) csvfile.flush() # write to file immediately except Exception as e: print(f'Warning: {str(e)}\n' f'Current group: {group_title}\nCurrent paper: {title}') elif year >= 2010: class_name = { 2010: 'plist', 2011: 'plist', 2012: 'pindex' } papers = soup.find('div', {'class': class_name[year]}).find_all(['h2', 'div']) papers_bar = tqdm(papers) paper_index = 0 for p in papers_bar: if p.name == 'h2': # group title group_title = slugify(p.text) else: # div, paper paper_dict = {'title': '', 'group': group_title, 'main link': '', 'supplemental link': ''} a = p.find('span', {'class': 'title'}).find('a') # title = slugify(a.find(string=True, recursive=False)) # drop abs direct_text = ''.join(child for child in a.contents if isinstance(child, str)).strip() title = slugify(direct_text) main_link = urllib.parse.urljoin(init_url, a.get('href')) paper_dict['title'] = title paper_dict['main link'] = main_link paper_index += 1 papers_bar.set_description_str( f'Collected paper {paper_index}: {title}') writer.writerow(paper_dict) csvfile.flush() # write to file immediately elif year == 2009: group_list = soup.find('div', {'id': 'mainContent'}).find_all('p') group_list_bar = tqdm(group_list) paper_index = 0 is_start = False for group in group_list_bar: if not is_start: if group.find('strong'): group_text = slugify(group.find('strong').text) is_start = True else: continue if group.find('strong'): group_title = slugify(group.text) continue try: papers = group.find_all('a') for p in papers: paper_dict = {'title': '', 'group': group_title, 'main link': '', 'supplemental link': ''} title = slugify(p.text) main_link = urllib.parse.urljoin(init_url, p.get('href')) paper_dict['title'] = title paper_dict['main link'] = main_link paper_index += 1 group_list_bar.set_description_str( f'Collected paper {paper_index}: {title}') writer.writerow(paper_dict) csvfile.flush() # write to file immediately except Exception as e: print(f'Warning: {str(e)}\n' f'Current group: {group_title}\nCurrent paper: {title}') elif year == 2008: # papers = soup.find_all(lambda tag: # (tag.name == 'p' and 'title' in tag.get('class', [])) or # tag.name == 'a' # ) group_list = soup.find('div', {'id': 'mainbody'}).find( 'table').find('tbody').find_all('tr', recursive=False)[2:] # skip "conference title", "Table of Contents" group_list_bar = tqdm(group_list) paper_index = 0 for group in group_list_bar: try: p_class_title = group.find('p', {'class': 'title'}) h3 = group.find('h3') if p_class_title: group_title = slugify(p_class_title.text) elif h3: # find

group_title = slugify(h3.text) else: raise ValueError('Parse group title failed!') papers = group.find_all('a') for p in papers: paper_dict = {'title': '', 'group': group_title, 'main link': '', 'supplemental link': ''} title = slugify(p.text) if not p.get('href'): continue # group title main_link = urllib.parse.urljoin(init_url, p.get('href')) paper_dict['title'] = title paper_dict['main link'] = main_link paper_index += 1 group_list_bar.set_description_str( f'Collected paper {paper_index}: {title}') writer.writerow(paper_dict) csvfile.flush() # write to file immediately except Exception as e: print(f'Warning: {str(e)}\n' f'Current group: {group_title}\nCurrent paper: {title}') else: # TODO: support downloading 2002 ~ 2008 papers return # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return paper_index def download_from_csv( year, save_dir, time_step_in_seconds=5, total_paper_number=None, csv_filename=None, downloader='IDM', is_random_step=True, proxy_ip_port=None): """ download all AAMAS paper given year :param year: int, AAMAS year, such as 2019 :param save_dir: str, paper and supplement material's save path :param time_step_in_seconds: int, the interval time between two download request in seconds :param total_paper_number: int, the total number of papers that is going to download :param csv_filename: None or str, the csv file's name, None means to use default setting :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: True """ conference = "AAMAS" postfix = f'{conference}_{year}' project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_path = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' if csv_filename is None else csv_filename) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_path, is_download_supplement=False, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader, is_random_step=is_random_step, proxy_ip_port=proxy_ip_port ) if __name__ == '__main__': year = 2025 # total_paper_number = 2021 total_paper_number = save_csv(year) download_from_csv( year, save_dir=fr'D:\AAMAS_{year}', time_step_in_seconds=5, total_paper_number=total_paper_number) # for year in range(2008, 2025, 1): # print(year) # # total_paper_number = 134 # total_paper_number = save_csv(year) # download_from_csv(year, save_dir=fr'E:\AAMAS\AAMAS_{year}', # time_step_in_seconds=10, # total_paper_number=total_paper_number) # time.sleep(2) pass ================================================ FILE: code/paper_downloader_AISTATS.py ================================================ """paper_downloader_AISTATS.py""" import os import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) import lib.pmlr as pmlr from lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \ move_main_and_supplement_2_one_directory_with_group def download_paper(year, save_dir, is_download_supplement=True, time_step_in_seconds=5, downloader='IDM'): """ download all AISTATS paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, AISTATS year, such as 2019 :param save_dir: str, paper and supplement material's save path :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two download request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ AISTATS_year_dict = { 2025: 258, 2024: 238, 2023: 206, 2022: 151, 2021: 130, 2020: 108, 2019: 89, 2018: 84, 2017: 54, 2016: 51, 2015: 38, 2014: 33, 2013: 31, 2012: 22, 2011: 15, 2010: 9, 2009: 5, 2007: 2 } AISTATS_year_dict_R = { 1995: 0, 1997: 1, 1999: 2, 2001: 3, 2003: 4, 2005: 5 } if year in AISTATS_year_dict.keys(): volume = f'v{AISTATS_year_dict[year]}' elif year in AISTATS_year_dict_R.keys(): volume = f'r{AISTATS_year_dict_R[year]}' else: raise ValueError('''the given year's url is unknown !''') postfix = f'AISTATS_{year}' pmlr.download_paper_given_volume( volume=volume, save_dir=save_dir, postfix=postfix, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, downloader=downloader ) if __name__ == '__main__': year = 2025 download_paper( year, rf'D:\AISTATS_{year}', is_download_supplement=True, time_step_in_seconds=25, downloader='IDM' ) # move_main_and_supplement_2_one_directory( # main_path=rf'D:\AISTATS_{year}\main_paper', # supplement_path=rf'D:\AISTATS_{year}\supplement', # supp_pdf_save_path=rf'D:\AISTATS_{year}\supplement_pdf' # ) pass ================================================ FILE: code/paper_downloader_COLT.py ================================================ """paper_downloader_COLT.py""" import os import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) import lib.pmlr as pmlr def download_paper(year, save_dir, is_download_supplement=False, time_step_in_seconds=5, downloader='IDM'): """ download all COLT paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, COLT year, such as 2019 :param save_dir: str, paper and supplement material's save path :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two download request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ COLT_year_dict = { 2025: 291, 2024: 247, 2023: 195, 2022: 178, 2021: 134, 2020: 125, 2019: 99, 2018: 75, 2017: 65, 2016: 49, 2015: 40, 2014: 35, 2013: 30, 2012: 23, 2011: 19 } if year in COLT_year_dict.keys(): volume = f'v{COLT_year_dict[year]}' else: raise ValueError('''the given year's url is unknown !''') postfix = f'COLT_{year}' pmlr.download_paper_given_volume( volume=volume, save_dir=save_dir, postfix=postfix, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, downloader=downloader ) if __name__ == '__main__': year = 2025 download_paper( year, rf'D:\COLT_{year}', is_download_supplement=False, time_step_in_seconds=3, downloader='IDM' ) pass ================================================ FILE: code/paper_downloader_CORL.py ================================================ """paper_downloader_CORL.py""" import os import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) import lib.pmlr as pmlr import lib.openreview as openreview def download_paper(year, save_dir, is_download_supplement=False, time_step_in_seconds=5, downloader='IDM', source=None, proxy_ip_port=None): """ download all CORL paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, CORL year, such as 2019 :param save_dir: str, paper and supplement material's save path :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two download request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :param source: str, download source, support "pmlr" and "openreview". Defaults to None, means first try to download from pmlr. If failed, then try to download from openreview. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Only useful for webdriver and request downloader (downloader=None). Default: None. :type proxy_ip_port: str | None :return: True """ CORL_year_dict = { 2025: 305, 2024: 270, 2023: 229, 2022: 205, 2021: 164, 2020: 155, 2019: 100, 2018: 87, 2017: 78 } postfix = f'CORL_{year}' if source != 'openreview': if year in CORL_year_dict.keys(): # download from pmlr volume = f'v{CORL_year_dict[year]}' pmlr.download_paper_given_volume( volume=volume, save_dir=save_dir, postfix=postfix, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, downloader=downloader ) return True elif source == 'pmlr': raise ValueError(f'Not found CoRL {year} in pmlr!') # try to download from openreview base_url = f'https://openreview.net/group?id=robot-learning.org/'\ f'CoRL/{year}/Conference' group_id_dict = { 2023: ['accept--oral-', 'accept--poster-'], 2024: ['accept'] } for gid in group_id_dict[year]: openreview.download_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=f'{base_url}#{gid}', group_id=gid, conference='CORL', time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) return True if __name__ == '__main__': year=2025 download_paper( year, rf'D:\CORL\CORL_{year}', is_download_supplement=False, time_step_in_seconds=30, downloader='IDM' # downloader = None ) pass ================================================ FILE: code/paper_downloader_CVF.py ================================================ """paper_downloader_CVF.py""" import urllib from bs4 import BeautifulSoup import pickle import os from slugify import slugify import csv import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \ move_main_and_supplement_2_one_directory_with_group, \ rename_2_short_name, rename_2_short_name_within_group from lib.cvf import get_paper_dict_list from lib import csv_process import time from lib.my_request import urlopen_with_retry def save_csv(year, conference, proxy_ip_port=None): """ write CVF conference papers' and supplemental material's urls in one csv file :param year: int :param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV'] :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']: raise ValueError(f'{conference} is not found in ' f'https://openaccess.thecvf.com/menu, ' f'maybe a spelling mistake!') csv_file_pathname = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' ) print(f'saving {conference}-{year} paper urls into {csv_file_pathname}') with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'supplemental link', 'arxiv'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() init_url = f'http://openaccess.thecvf.com/{conference}{year}' if conference == 'ICCV' and year == 2021: init_url = 'https://openaccess.thecvf.com/ICCV2021?day=all' elif conference == 'CVPR' and year >= 2022: init_url = f'https://openaccess.thecvf.com/CVPR{year}?day=all' url_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_{conference}_{year}.dat' ) if os.path.exists(url_file_pathname): with open(url_file_pathname, 'rb') as f: content = pickle.load(f) else: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry( url=init_url, headers=headers, proxy_ip_port=proxy_ip_port) with open(url_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') tmp_list = soup.find('div', {'id': 'content'}).find_all('dt') if len(tmp_list) <= 1: paper_different_days_list_bar = soup.find( 'div', {'id': 'content'}).find_all('dd') paper_index = 0 for group in paper_different_days_list_bar: # get group name a = group.find('a') print(a.text) group_link = urllib.parse.urljoin(init_url, a.get('href')) group_paper_dict_list, _ = get_paper_dict_list( url=group_link ) paper_index += len(group_paper_dict_list) for paper_dict in group_paper_dict_list: writer.writerow(paper_dict) return paper_index else: paper_dict_list, content = get_paper_dict_list( url=init_url, content=content) for paper_dict in paper_dict_list: writer.writerow(paper_dict) return len(paper_dict_list) def save_csv_workshops(year, conference, proxy_ip_port=None): """ write CVF workshops papers' and supplemental material's urls in one csv file :param year: int :param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV'] :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']: raise ValueError(f'{conference} is not found in ' f'https://openaccess.thecvf.com/menu, ' f'maybe a spelling mistake!') csv_file_pathname = os.path.join( project_root_folder, 'csv', f'{conference}_WS_{year}.csv' ) print(f'saving {conference}-WS-{year} paper urls into {csv_file_pathname}') with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['group', 'title', 'main link', 'supplemental link', 'arxiv'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} init_url = f'https://openaccess.thecvf.com/' \ f'{conference}{year}_workshops/menu' url_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_{conference}_WS_{year}.dat' ) if os.path.exists(url_file_pathname): with open(url_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry( url=init_url, headers=headers, proxy_ip_port=proxy_ip_port) # content = open(f'..\\{conference}_WS_{year}.html', 'rb').read() with open(url_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') paper_group_list_bar = soup.find('div', {'id': 'content'}).find_all('dd') paper_index = 0 for group in paper_group_list_bar: # get group name a = group.find('a') group_name = slugify(a.text) print(f'GROUP: {group_name}') group_link = urllib.parse.urljoin(init_url, a.get('href')) repeat_time = 3 for r in range(repeat_time): try: group_paper_dict_list, _ = get_paper_dict_list( url=group_link, group_name=group_name, timeout=20, ) time.sleep(1) break except Exception as e: if r + 1 == repeat_time: print(f'ERROR: {str(e)}') continue paper_index += len(group_paper_dict_list) for paper_dict in group_paper_dict_list: writer.writerow(paper_dict) return paper_index def download_from_csv( year, conference, save_dir, is_download_main_paper=True, is_download_supplement=True, time_step_in_seconds=5, total_paper_number=None, is_workshops=False, downloader='IDM', proxy_ip_port=None): """ download all CVF paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, CVF year, such 2019 :param conference: str, one of ['CVPR', 'ICCV', 'WACV'] :param save_dir: str, paper and supplement material's save path :param is_download_main_paper: bool, True for downloading main paper :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two downloading request in seconds :param total_paper_number: int, the total number of papers that is going to download :param is_workshops: bool, is to download workshops from csv file. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) postfix = f'{conference}_{year}' if is_workshops: postfix = f'{conference}_WS_{year}' csv_file_path = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' if not is_workshops else f'{conference}_WS_{year}.csv' ) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_path, is_download_main_paper=is_download_main_paper, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader, ) return True def download_paper( year, conference, save_dir, is_download_main_paper=True, is_download_supplement=True, time_step_in_seconds=5, is_download_main_conference=True, is_download_workshops=True, downloader='IDM', proxy_ip_port=None): """ download all CVF papers in given year, support downloading main conference and workshops. :param year: int, CVF year, such 2019. :param conference: str, one of {'CVPR', 'ICCV', 'WACV'}. :param save_dir: str, paper and supplement material's save path. :param is_download_main_paper: bool, True for downloading main paper. :param is_download_supplement: bool, True for downloading supplemental material. :param time_step_in_seconds: int, the interval time between two downloading request in seconds. :param is_download_main_conference: bool, this parameter controls whether to download main conference papers, it is a upper level control flag of parameters is_download_main_paper and is_download_supplement. eg. After setting is_download_main_conference=True, is_download_main_paper=False, is_download_supplement=True, the only the supplement materials of the conference (vs. workshops) will be downloaded. :param is_download_workshops: bool, True for downloading workshops paper and is similar with is_download_main_conference. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) # main conference if is_download_main_conference: csv_file_path = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv') if not os.path.exists(csv_file_path): total_paper_number = save_csv( year=year, conference=conference, proxy_ip_port=proxy_ip_port) else: with open(csv_file_path, newline='') as csvfile: myreader = csv.DictReader(csvfile, delimiter=',') total_paper_number = sum(1 for row in myreader) download_from_csv( year=year, conference=conference, save_dir=os.path.join(save_dir, f'{conference}_{year}'), is_download_main_paper=is_download_main_paper, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, is_workshops=False, downloader=downloader, proxy_ip_port=proxy_ip_port ) # workshops if is_download_workshops: csv_file_path = os.path.join( project_root_folder, 'csv', f'{conference}_WS_{year}.csv') if not os.path.exists(csv_file_path): total_paper_number = save_csv_workshops( year=year, conference=conference, proxy_ip_port=proxy_ip_port) else: with open(csv_file_path, newline='') as csvfile: myreader = csv.DictReader(csvfile, delimiter=',') total_paper_number = sum(1 for row in myreader) download_from_csv( year=year, conference=conference, save_dir=os.path.join(save_dir, f'{conference}_WS_{year}'), is_download_main_paper=is_download_main_paper, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, is_workshops=True, downloader=downloader, proxy_ip_port=proxy_ip_port ) if __name__ == '__main__': year = 2025 conference = 'CVPR' download_paper( year, conference=conference, save_dir=fr'D:\{conference}', is_download_main_paper=True, is_download_supplement=True, time_step_in_seconds=10, is_download_main_conference=True, is_download_workshops=True, # proxy_ip_port='127.0.0.1:7897' ) # # move_main_and_supplement_2_one_directory( # main_path=rf'E:\{conference}\{conference}_{year}\main_paper', # supplement_path=rf'E:\{conference}\{conference}_{year}\supplement', # supp_pdf_save_path=rf'E:\{conference}\{conference}_{year}\main_paper' # ) # move_main_and_supplement_2_one_directory_with_group( # main_path=rf'E:\{conference}\{conference}_WS_{year}\main_paper', # supplement_path=rf'E:\{conference}\{conference}_WS_{year}\supplement', # supp_pdf_save_path=rf'E:\{conference}\{conference}_WS_{year}\main_paper' # ) # rename to short filename for uploading to 123pan # rename_2_short_name( # src_path=r'E:\CVPR\CVPR_2024\main_paper', # save_path=r'E:\short_name_cvpr2024', # target_max_length=128 # ) # rename_2_short_name_within_group( # src_path=r'E:\CVPR\CVPR_WS_2024\main_paper', # save_path=r'E:\short_name_cvpr2024_ws', # target_max_length=128 # ) pass ================================================ FILE: code/paper_downloader_ECCV.py ================================================ """paper_downloader_ECCV.py""" import urllib from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.supplement_porcess import move_main_and_supplement_2_one_directory import lib.springer as springer from lib import csv_process from lib.downloader import Downloader from lib.my_request import urlopen_with_retry def save_csv(year): """ write ECCV papers' and supplemental material's urls in one csv file :param year: int :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'ECCV_{year}.csv') with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'supplemental link'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} dat_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_ECCV_{year}.dat') if year >= 2018: init_url = f'https://www.ecva.net/papers.php' if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_url, headers=headers) with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') paper_list_bar = tqdm(soup.find_all(['dt', 'dd'])) paper_index = 0 paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} for paper in paper_list_bar: is_new_paper = False # get title try: if 'dt' == paper.name and \ 'ptitle' == paper.get('class')[0] and \ year == int(paper.a.get('href').split('_')[1][:4]): # title: # this_year = int(paper.a.get('href').split('_')[1][:4]) title = slugify(paper.text.strip()) paper_dict['title'] = title paper_index += 1 paper_list_bar.set_description_str( f'Downloading paper {paper_index}: {title}') elif '' != paper_dict['title'] and 'dd' == paper.name: all_as = paper.find_all('a') for a in all_as: if 'pdf' == slugify(a.text.strip()): main_link = urllib.parse.urljoin(init_url, a.get('href')) paper_dict['main link'] = main_link is_new_paper = True elif 'supp' in slugify(a.text.strip()): supp_link = urllib.parse.urljoin(init_url, a.get('href')) paper_dict['supplemental link'] = supp_link break except: pass if is_new_paper: writer.writerow(paper_dict) paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} else: init_url = f'http://www.eccv{year}.org/main-conference/' if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_url, headers=headers) with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') paper_list_bar = tqdm( soup.find('div', {'class': 'entry-content'}).find_all(['p'])) paper_index = 0 paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} for paper in paper_list_bar: try: if len(paper.find_all(['strong'])) and len( paper.find_all(['a'])) and len( paper.find_all(['img'])): paper_index += 1 title = slugify(paper.find('strong').text) paper_dict['title'] = title paper_list_bar.set_description_str( f'Downloading paper {paper_index}: {title}') main_link = paper.find('a').get('href') paper_dict['main link'] = main_link writer.writerow(paper_dict) paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} except Exception as e: print(f'ERROR: {str(e)}') return paper_index def download_from_csv( year, save_dir, is_download_supplement=True, time_step_in_seconds=5, total_paper_number=None, is_workshops=False, downloader='IDM'): """ download all ECCV paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, ECCV year, such 2019 :param save_dir: str, paper and supplement material's save path :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param total_paper_number: int, the total number of papers that is going to download :param is_workshops: bool, is to download workshops from csv file. :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ postfix = f'ECCV_{year}' if is_workshops: postfix = f'ECCV_WS_{year}' csv_file_name = f'ECCV_{year}.csv' if not is_workshops else \ f'ECCV_WS_{year}.csv' project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_name = os.path.join(project_root_folder, 'csv', csv_file_name) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_name, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader ) def download_from_springer( year, save_dir, is_workshops=False, time_sleep_in_seconds=5, downloader='IDM'): os.makedirs(save_dir, exist_ok=True) if 2018 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-030-01246-5', 'https://link.springer.com/book/10.1007/978-3-030-01216-8', 'https://link.springer.com/book/10.1007/978-3-030-01219-9', 'https://link.springer.com/book/10.1007/978-3-030-01225-0', 'https://link.springer.com/book/10.1007/978-3-030-01228-1', 'https://link.springer.com/book/10.1007/978-3-030-01231-1', 'https://link.springer.com/book/10.1007/978-3-030-01234-2', 'https://link.springer.com/book/10.1007/978-3-030-01237-3', 'https://link.springer.com/book/10.1007/978-3-030-01240-3', 'https://link.springer.com/book/10.1007/978-3-030-01249-6', 'https://link.springer.com/book/10.1007/978-3-030-01252-6', 'https://link.springer.com/book/10.1007/978-3-030-01258-8', 'https://link.springer.com/book/10.1007/978-3-030-01261-8', 'https://link.springer.com/book/10.1007/978-3-030-01264-9', 'https://link.springer.com/book/10.1007/978-3-030-01267-0', 'https://link.springer.com/book/10.1007/978-3-030-01270-0' ] else: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-030-11009-3', 'https://link.springer.com/book/10.1007/978-3-030-11012-3', 'https://link.springer.com/book/10.1007/978-3-030-11015-4', 'https://link.springer.com/book/10.1007/978-3-030-11018-5', 'https://link.springer.com/book/10.1007/978-3-030-11021-5', 'https://link.springer.com/book/10.1007/978-3-030-11024-6' ] elif 2016 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007%2F978-3-319-46448-0', 'https://link.springer.com/book/10.1007%2F978-3-319-46475-6', 'https://link.springer.com/book/10.1007%2F978-3-319-46487-9', 'https://link.springer.com/book/10.1007%2F978-3-319-46493-0', 'https://link.springer.com/book/10.1007%2F978-3-319-46454-1', 'https://link.springer.com/book/10.1007%2F978-3-319-46466-4', 'https://link.springer.com/book/10.1007%2F978-3-319-46478-7', 'https://link.springer.com/book/10.1007%2F978-3-319-46484-8' ] else: urls_list = [ 'https://link.springer.com/book/10.1007%2F978-3-319-46604-0', 'https://link.springer.com/book/10.1007%2F978-3-319-48881-3', 'https://link.springer.com/book/10.1007%2F978-3-319-49409-8' ] elif 2014 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-319-10590-1', 'https://link.springer.com/book/10.1007/978-3-319-10605-2', 'https://link.springer.com/book/10.1007/978-3-319-10578-9', 'https://link.springer.com/book/10.1007/978-3-319-10593-2', 'https://link.springer.com/book/10.1007/978-3-319-10602-1', 'https://link.springer.com/book/10.1007/978-3-319-10599-4', 'https://link.springer.com/book/10.1007/978-3-319-10584-0' ] else: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-319-16178-5', 'https://link.springer.com/book/10.1007/978-3-319-16181-5', 'https://link.springer.com/book/10.1007/978-3-319-16199-0', 'https://link.springer.com/book/10.1007/978-3-319-16220-1' ] elif 2012 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-642-33718-5', 'https://link.springer.com/book/10.1007/978-3-642-33709-3', 'https://link.springer.com/book/10.1007/978-3-642-33712-3', 'https://link.springer.com/book/10.1007/978-3-642-33765-9', 'https://link.springer.com/book/10.1007/978-3-642-33715-4', 'https://link.springer.com/book/10.1007/978-3-642-33783-3', 'https://link.springer.com/book/10.1007/978-3-642-33786-4' ] else: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-642-33863-2', 'https://link.springer.com/book/10.1007/978-3-642-33868-7', 'https://link.springer.com/book/10.1007/978-3-642-33885-4' ] elif 2010 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-642-15549-9', 'https://link.springer.com/book/10.1007/978-3-642-15552-9', 'https://link.springer.com/book/10.1007/978-3-642-15558-1', 'https://link.springer.com/book/10.1007/978-3-642-15561-1', 'https://link.springer.com/book/10.1007/978-3-642-15555-0', 'https://link.springer.com/book/10.1007/978-3-642-15567-3' ] else: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-642-35749-7', 'https://link.springer.com/book/10.1007/978-3-642-35740-4' ] elif 2008 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/978-3-540-88682-2', 'https://link.springer.com/book/10.1007/978-3-540-88688-4', 'https://link.springer.com/book/10.1007/978-3-540-88690-7', 'https://link.springer.com/book/10.1007/978-3-540-88693-8' ] else: urls_list = [] elif 2006 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/11744023', 'https://link.springer.com/book/10.1007/11744047', 'https://link.springer.com/book/10.1007/11744078', 'https://link.springer.com/book/10.1007/11744085' ] else: urls_list = [ 'https://link.springer.com/book/10.1007/11754336' ] elif 2004 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/b97865', 'https://link.springer.com/book/10.1007/b97866', 'https://link.springer.com/book/10.1007/b97871', 'https://link.springer.com/book/10.1007/b97873' ] else: urls_list = [ ] elif 2002 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/3-540-47969-4', 'https://link.springer.com/book/10.1007/3-540-47967-8', 'https://link.springer.com/book/10.1007/3-540-47977-5', 'https://link.springer.com/book/10.1007/3-540-47979-1' ] else: urls_list = [ ] elif 2000 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/3-540-45054-8', 'https://link.springer.com/book/10.1007/3-540-45053-X' ] else: urls_list = [ ] elif 1998 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/BFb0055655', 'https://link.springer.com/book/10.1007/BFb0054729' ] else: urls_list = [ ] elif 1996 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/BFb0015518', 'https://link.springer.com/book/10.1007/3-540-61123-1' ] else: urls_list = [ ] elif 1994 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/3-540-57956-7', 'https://link.springer.com/book/10.1007/BFb0028329' ] else: urls_list = [ ] elif 1992 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/3-540-55426-2' ] else: urls_list = [ ] elif 1990 == year: if not is_workshops: urls_list = [ 'https://link.springer.com/book/10.1007/BFb0014843' ] else: urls_list = [ ] else: raise ValueError(f'ECCV {year} is current not available!') for url in urls_list: __download_from_springer( url, save_dir, year, is_workshops=is_workshops, time_sleep_in_seconds=time_sleep_in_seconds, downloader=downloader) def __download_from_springer( url, save_dir, year, is_workshops=False, time_sleep_in_seconds=5, downloader='IDM'): downloader = Downloader(downloader) for i in range(3): try: papers_dict = springer.get_paper_name_link_from_url(url) break except Exception as e: print(str(e)) # total_paper_number = len(papers_dict) pbar = tqdm(papers_dict.keys()) postfix = f'ECCV_{year}' if is_workshops: postfix = f'ECCV_WS_{year}' for name in pbar: pbar.set_description(f'Downloading paper {name}') if not os.path.exists(os.path.join(save_dir, f'{name}_{postfix}.pdf')): downloader.download( papers_dict[name], os.path.join(save_dir, f'{name}_{postfix}.pdf'), time_sleep_in_seconds) if __name__ == '__main__': year = 2024 # total_paper_number = 2387 total_paper_number = save_csv(year) download_from_csv(year, save_dir=fr'Z:\all_papers\ECCV\ECCV_{year}', is_download_supplement=True, time_step_in_seconds=5, total_paper_number=total_paper_number, is_workshops=False) # move_main_and_supplement_2_one_directory( # main_path=f'E:\\ECCV_{year}\\main_paper', # supplement_path=f'E:\\ECCV_{year}\\supplement', # supp_pdf_save_path=f'E:\\ECCV_{year}\\main_paper' # ) # for year in range(2018, 2017, -2): # # download_from_springer( # # save_dir=f'F:\\ECCV_{year}', # # year=year, # # is_workshops=False, time_sleep_in_seconds=30) # download_from_springer( # save_dir=f'F:\\ECCV_WS_{year}', # year=year, # is_workshops=True, time_sleep_in_seconds=30) # pass ================================================ FILE: code/paper_downloader_ICLR.py ================================================ """paper_downloader_ICLR.py""" from tqdm import tqdm import os # https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename from slugify import slugify from bs4 import BeautifulSoup import pickle from urllib.request import urlopen import urllib import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.downloader import Downloader from lib.openreview import download_iclr_papers_given_url_and_group_id from lib.arxiv import get_pdf_link_from_arxiv def download_iclr_oral_papers(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr oral papers for year 2017 ~ 2022, 2024~2025. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ group_id_dict = { 2026: "tab-accept-oral", 2025: "tab-accept-oral", 2024: "tab-accept-oral", 2022: "oral-submissions", 2021: "oral-presentations", 2020: "oral-presentations", 2019: "oral-presentations", 2018: "accepted-oral-papers", 2017: "oral-presentations", 2013: "conferenceoral-iclr2013-conference" } if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} oral papers...') group_id = group_id_dict[year].replace('tab-', '') download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year > 2021) ) def download_iclr_conditional_oral_papers(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr conditional oral papers for year 2025. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ group_id_dict = { 2025: "tab-accept-conditional-oral" } no_pages_year = [2025] if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} conditional oral papers...') group_id = group_id_dict[year].replace('tab-', '') download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year not in no_pages_year) ) def download_iclr_top5_papers(save_dir, year, base_url=None, start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None): """ Download iclr notable-top-5% papers for year 2023. :param save_dir: str, paper save path :param year: int, iclr year :type year: int :param base_url: str, paper website url :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two downlaod request in seconds. Default: 10. :type time_step_in_seconds: int :param downloader: str, the downloader to download, could be 'IDM' or None. Default: 'IDM'. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ if base_url is None: if year == 2023: base_url = "https://openreview.net/group?id=ICLR.cc/" \ "2023/Conference#notable-top-5-" else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} top5 papers...') group_id = "notable-top-5-" return download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) def download_iclr_poster_papers(save_dir, year, base_url=None, start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None): """ Download iclr poster papers from year 2013, 2017 ~ 2024. :param save_dir: str, paper save path :param year: int, iclr year, current only support year :param base_url: str, paper website url :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param downloader: str, the downloader to download, could be 'IDM' or None. Default: 'IDM' :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ group_id_dict = { 2026: "tab-accept-poster", 2025: "tab-accept-poster", 2024: "tab-accept-poster", 2023: "poster", 2022: "poster-submissions", 2021: "poster-presentations", 2020: "poster-presentations", 2019: "poster-presentations", 2018: "accepted-poster-papers", 2017: "poster-presentations", 2013: "conferenceposter-iclr2013-conference" } if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} poster papers...') no_pages_year = [2013, 2018, 2019, 2020, 2021] download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id_dict[year].replace('tab-', ''), start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year not in no_pages_year), is_need_click_group_button=(year == 2018) ) def download_iclr_conditional_poster_papers(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr conditional poster papers for year 2025. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ group_id_dict = { 2025: "tab-accept-conditional-poster" } if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} conditional poster papers...') group_id = group_id_dict[year].replace('tab-', '') download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year > 2021) ) def download_iclr_spotlight_papers(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr spotlight papers between year 2020 and 2022, 2024~2025. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM' :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :return: """ group_id_dict = { 2025: "tab-accept-spotlight", 2024: "tab-accept-spotlight", 2022: "spotlight-submissions", 2021: "spotlight-presentations", 2020: "spotlight-presentations", } if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} spotlight papers...') no_pages_year = [2020, 2021] download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id_dict[year].replace('tab-', ''), start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year not in no_pages_year) ) def download_iclr_conditional_spotlight_papers(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr conditional spotlight papers for year 2025. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ group_id_dict = { 2025: "tab-accept-conditional-spotlight" } no_pages_year = [2025] if base_url is None: if year in group_id_dict: base_url = 'https://openreview.net/group?id=ICLR.cc/' \ f'{year}/Conference#{group_id_dict[year]}' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} conditional spotlight papers...') group_id = group_id_dict[year].replace('tab-', '') download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year not in no_pages_year) ) def download_iclr_top25_papers(save_dir, year, base_url=None, start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None): """ Download iclr notable-top-25% papers for year 2023. :param save_dir: str, paper save path :param year: int, iclr year :type year: int :param base_url: str, paper website url :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two downlaod request in seconds. Default: 10. :type time_step_in_seconds: int :param downloader: str, the downloader to download, could be 'IDM' or None. Default: 'IDM'. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ if base_url is None: if year == 2023: base_url = "https://openreview.net/group?id=ICLR.cc/" \ "2023/Conference#notable-top-25-" else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} top25 papers...') group_id = "notable-top-25-" download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) def download_iclr_paper(save_dir, year, base_url=None, time_step_in_seconds=10, downloader='IDM', start_page=1, proxy_ip_port=None): """ Download iclr papers between year 2013 and 2024. :param save_dir: str, paper save path :param year: int, iclr year, current only support year >= 2018 :param base_url: str, paper website url :param time_step_in_seconds: int, the interval time between two download request in seconds. :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Currently, this parameter is only used in year 2024. Default: 1. :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) year_no_group = [2014] year_no_group_iclrcc = [2015, 2016] year_oral_poster = [2013, 2017, 2018, 2019, 2026] year_oral_spotlight_poster = [2020, 2021, 2022, 2024, 2025] year_top5_top25_poster = [2023] year_oral_spotlight_poster_conditional = [2025] # no group, openreview website if year in year_no_group: if base_url is None: if year == 2014: base_url = 'https://openreview.net/group?id=ICLR.cc/2014/conference' else: raise ValueError('the website url is not given for this year!') print(f'Downloading ICLR-{year} oral papers...') group_id_dict = { 2014: "submitted-papers" } group_id = group_id_dict[year] no_pages_year = [2014] return download_iclr_papers_given_url_and_group_id( save_dir=save_dir, year=year, base_url=base_url, group_id=group_id, start_page=start_page, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port, is_have_pages=(year not in no_pages_year) ) # no group, iclr.cc website if year in year_no_group_iclrcc: downloader = Downloader(downloader=downloader) paper_postfix = f'ICLR_{year}' if base_url is None: if year == 2016: base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2016:main.html' elif year == 2015: base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2015:main.html' elif year == 2014: base_url = 'https://iclr.cc/archive/2014/conference-proceedings/' else: raise ValueError('the website url is not given for this year!') os.makedirs(save_dir, exist_ok=True) if year == 2015: # oral and poster seperated oral_save_path = os.path.join(save_dir, 'oral') poster_save_path = os.path.join(save_dir, 'poster') workshop_save_path = os.path.join(save_dir, 'ws') os.makedirs(oral_save_path, exist_ok=True) os.makedirs(poster_save_path, exist_ok=True) os.makedirs(workshop_save_path, exist_ok=True) dat_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_iclr_{year}.dat' ) if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} req = urllib.request.Request(url=base_url, headers=headers) content = urllib.request.urlopen(req).read() with open(f'..\\urls\\init_url_iclr_{year}.dat', 'wb') as f: pickle.dump(content, f) error_log = [] soup = BeautifulSoup(content, 'html.parser') print('open url successfully!') if year == 2016: papers = soup.find('h3', { 'id': 'accepted_papers_conference_track'}).findNext( 'div').find_all('a') for paper in tqdm(papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_{paper_postfix}.pdf' try: if not os.path.exists( os.path.join(save_dir, title + f'_{paper_postfix}.pdf')): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(save_dir, pdf_name), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) # workshops papers = soup.find('h3', { 'id': 'workshop_track_posters_may_2nd'}).findNext( 'div').find_all('a') for paper in tqdm(papers): link = paper.get('href') if link.startswith('http://beta.openreview'): title = slugify(paper.text) pdf_name = f'{title}_ICLR_WS_{year}.pdf' try: if not os.path.exists( os.path.join(save_dir, 'ws', pdf_name)): pdf_link = get_pdf_link_from_openreview(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(save_dir, 'ws', pdf_name), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) papers = soup.find('h3', { 'id': 'workshop_track_posters_may_3rd'}).findNext( 'div').find_all('a') for paper in tqdm(papers): link = paper.get('href') if link.startswith('http://beta.openreview'): title = slugify(paper.text) pdf_name = f'{title}_ICLR_WS_{year}.pdf' try: if not os.path.exists( os.path.join(save_dir, 'ws', pdf_name)): pdf_link = get_pdf_link_from_openreview(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(save_dir, 'ws', pdf_name), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) elif year == 2015: # oral papers oral_papers = soup.find('h3', { 'id': 'conference_oral_presentations'}).findNext( 'div').find_all( 'a') for paper in tqdm(oral_papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_{paper_postfix}.pdf' try: if not os.path.exists( os.path.join(oral_save_path, title + f'_{paper_postfix}.pdf')): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(oral_save_path, pdf_name), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) # workshops papers workshop_papers = soup.find('h3', { 'id': 'may_7_workshop_poster_session'}).findNext( 'div').find_all( 'a') workshop_papers.append( soup.find('h3', {'id': 'may_8_workshop_poster_session'}).findNext( 'div').find_all('a')) for paper in tqdm(workshop_papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_ICLR_WS_{year}.pdf' try: if not os.path.exists( os.path.join(workshop_save_path, title + f'_{paper_postfix}.pdf')): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(workshop_save_path, pdf_name), time_sleep_in_seconds=time_step_in_seconds) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) # poster papers poster_papers = soup.find('h3', { 'id': 'may_9_conference_poster_session'}).findNext( 'div').find_all( 'a') for paper in tqdm(poster_papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_{paper_postfix}.pdf' try: if not os.path.exists( os.path.join(poster_save_path, title + f'_{paper_postfix}.pdf')): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(poster_save_path, pdf_name), time_sleep_in_seconds=time_step_in_seconds) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) elif year == 2014: papers = soup.find('div', {'id': 'sites-canvas-main-content'}).find_all( 'a') for paper in tqdm(papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_{paper_postfix}.pdf' try: if not os.path.exists(os.path.join(save_dir, pdf_name)): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(save_dir, pdf_name), time_sleep_in_seconds=time_step_in_seconds) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) # workshops paper_postfix = f'ICLR_WS_{year}' base_url = 'https://sites.google.com/site/representationlearning2014/' \ 'workshop-proceedings' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} req = urllib.request.Request(url=base_url, headers=headers) content = urllib.request.urlopen(req).read() soup = BeautifulSoup(content, 'html.parser') workshop_save_path = os.path.join(save_dir, 'WS') os.makedirs(workshop_save_path, exist_ok=True) papers = soup.find( 'div', {'id': 'sites-canvas-main-content'}).find_all('a') for paper in tqdm(papers): link = paper.get('href') if link.startswith('http://arxiv'): title = slugify(paper.text) pdf_name = f'{title}_{paper_postfix}.pdf' try: if not os.path.exists( os.path.join(workshop_save_path, pdf_name)): pdf_link = get_pdf_link_from_arxiv(link) print(f'downloading {title}') downloader.download( urls=pdf_link, save_path=os.path.join(workshop_save_path, pdf_name), time_sleep_in_seconds=time_step_in_seconds) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, link, 'paper download error', str(e))) # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt') with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return True # oral openreview if year in (year_oral_poster + year_oral_spotlight_poster): save_dir_oral = os.path.join(save_dir, 'oral') download_iclr_oral_papers( save_dir_oral, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # conditional oral openreview if year in (year_oral_spotlight_poster_conditional): save_dir_cond_oral = os.path.join(save_dir, 'conditional-oral') download_iclr_conditional_oral_papers( save_dir_cond_oral, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # poster openreview if year in (year_oral_poster + year_oral_spotlight_poster + year_top5_top25_poster): save_dir_poster = os.path.join(save_dir, 'poster') download_iclr_poster_papers( save_dir_poster, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # conditional poster openreview if year in (year_oral_spotlight_poster_conditional): save_dir_cond_poster = os.path.join(save_dir, 'conditional-poster') download_iclr_conditional_poster_papers( save_dir_cond_poster, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # spotlight openreview if year in year_oral_spotlight_poster: save_dir_spotlight = os.path.join(save_dir, 'spotlight') download_iclr_spotlight_papers( save_dir_spotlight, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # conditional spotlight openreview if year in (year_oral_spotlight_poster_conditional): save_dir_cond_spotlight = os.path.join(save_dir, 'conditional-spotlight') download_iclr_conditional_spotlight_papers( save_dir_cond_spotlight, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # top5 openreview if year in year_top5_top25_poster: save_dir_top5 = os.path.join(save_dir, 'top5') download_iclr_top5_papers( save_dir_top5, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) # top25 openreview if year in year_top5_top25_poster: save_dir_top25 = os.path.join(save_dir, 'top25') download_iclr_top25_papers( save_dir_top25, year, time_step_in_seconds=time_step_in_seconds, downloader=downloader, start_page=start_page, proxy_ip_port=proxy_ip_port ) def get_pdf_link_from_openreview(abs_link): return abs_link.replace('beta.', '').replace('forum', 'pdf') if __name__ == '__main__': year = 2025 save_dir_iclr = rf'E:\ICLR_{year}' # save_dir_iclr_oral = os.path.join(save_dir_iclr, 'oral') # save_dir_iclr_top5 = os.path.join(save_dir_iclr, 'top5') # save_dir_iclr_spotlight = os.path.join(save_dir_iclr, 'spotlight') # save_dir_iclr_top25 = os.path.join(save_dir_iclr, 'top25') # save_dir_iclr_poster = os.path.join(save_dir_iclr, 'poster') proxy_ip_port = None # proxy_ip_port = "http://127.0.0.1:7890" # download_iclr_oral_papers(save_dir_iclr_oral, year, # time_step_in_seconds=5) # download_iclr_top5_papers(save_dir_iclr_top5, year, start_page=1, # time_step_in_seconds=5, # proxy_ip_port=proxy_ip_port) # download_iclr_top25_papers(save_dir_iclr_top25, year, start_page=1, # time_step_in_seconds=5, # proxy_ip_port=proxy_ip_port) # download_iclr_spotlight_papers(save_dir_iclr_spotlight, year, # time_step_in_seconds=5) # download_iclr_poster_papers(save_dir_iclr_poster, year, start_page=1, # time_step_in_seconds=5, # proxy_ip_port=proxy_ip_port) download_iclr_paper(save_dir_iclr, year, time_step_in_seconds=5, proxy_ip_port=proxy_ip_port) ================================================ FILE: code/paper_downloader_ICML.py ================================================ """paper_downloader_ICML.py""" import urllib from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.downloader import Downloader import lib.pmlr as pmlr from lib.supplement_porcess import merge_main_supplement from lib.openreview import download_icml_papers_given_url_and_group_id from lib.my_request import urlopen_with_retry def download_paper(year, save_dir, is_download_supplement=True, time_step_in_seconds=5, downloader='IDM', source='pmlr', proxy_ip_port=None): """ download all ICML paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, ICML year, such 2019 :param save_dir: str, paper and supplement material's save path :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two download request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :param source: str, source website, 'pmlr' or 'openreview' :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Default: None. :type proxy_ip_port: str | None :return: True """ assert source in ['pmlr', 'openreview'], \ f'only support source pmlr or openreview, but get {source}' project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port) ICML_year_dict = { 2024: 235, 2023: 202, 2022: 162, 2021: 139, 2020: 119, 2019: 97, 2018: 80, 2017: 70, 2016: 48, 2015: 37, 2014: 32, 2013: 28 } if source == 'openreview': init_url = f'https://openreview.net/group?id=ICML.cc/{year}/Conference' else: # pmlr if year >= 2013: init_url = f'http://proceedings.mlr.press/v{ICML_year_dict[year]}/' elif year == 2012: init_url = 'https://icml.cc/2012/papers.1.html' elif year == 2011: init_url = 'http://www.icml-2011.org/papers.php' elif 2009 == year: init_url = 'https://icml.cc/Conferences/2009/abstracts.html' elif 2008 == year: init_url = 'http://www.machinelearning.org/archive/icml2008/' \ 'abstracts.shtml' elif 2007 == year: init_url = 'https://icml.cc/Conferences/2007/paperlist.html' elif year in [2006, 2004, 2005]: init_url = f'https://icml.cc/Conferences/{year}/proceedings.html' elif 2003 == year: init_url = 'https://aaai.org/Library/ICML/icml03contents.php' else: raise ValueError('''the given year's url is unknown !''') postfix = f'ICML_{year}' if source == 'openreview': # download from openreview website: # oral paper group_id = 'oral' save_dir_oral = os.path.join(save_dir, group_id) os.makedirs(save_dir_oral, exist_ok=True) download_icml_papers_given_url_and_group_id( save_dir=save_dir_oral, year=year, base_url=init_url, group_id=group_id, start_page=1, time_step_in_seconds=time_step_in_seconds, downloader=downloader.downloader, proxy_ip_port=proxy_ip_port ) # poster paper group_id = 'poster' save_dir_poster = os.path.join(save_dir, group_id) os.makedirs(save_dir_poster, exist_ok=True) download_icml_papers_given_url_and_group_id( save_dir=os.path.join(save_dir, 'poster'), year=year, base_url=init_url, group_id=group_id, start_page=1, time_step_in_seconds=time_step_in_seconds, downloader=downloader.downloader, proxy_ip_port=proxy_ip_port ) # spotlight paper group_id = 'spotlight' save_dir_poster = os.path.join(save_dir, group_id) os.makedirs(save_dir_poster, exist_ok=True) try: download_icml_papers_given_url_and_group_id( save_dir=os.path.join(save_dir, 'spotlight'), year=year, base_url=init_url, group_id=group_id, start_page=1, time_step_in_seconds=time_step_in_seconds, downloader=downloader.downloader, proxy_ip_port=proxy_ip_port ) except ValueError as e: # no spotlight paper print(f"WARNING: {str(e)}") return dat_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_icml_{year}.dat') if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry(url=init_url, headers=headers) # content = open(f'..\\ICML_{year}.html', 'rb').read() with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) # soup = BeautifulSoup(content, 'html.parser') soup = BeautifulSoup(content, 'html5lib') # soup = BeautifulSoup(open(r'..\ICML_2011.html', 'rb'), 'html.parser') error_log = [] if year >= 2013: if year in ICML_year_dict.keys(): volume = f'v{ICML_year_dict[year]}' else: raise ValueError('''the given year's url is unknown !''') pmlr.download_paper_given_volume( volume=volume, save_dir=save_dir, postfix=postfix, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, downloader=downloader.downloader ) elif 2012 == year: # 2012 # base_url = f'https://icml.cc/{year}/' paper_list_bar = tqdm(soup.find_all('div', {'class': 'paper'})) paper_index = 0 for paper in paper_list_bar: paper_index += 1 title = '' title = slugify(paper.find('h2').text) link = None for a in paper.find_all('a'): if 'ICML version (pdf)' == a.text: link = urllib.parse.urljoin(init_url, a.get('href')) break if link is not None: this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path) : paper_list_bar.set_description( f'downloading paper {paper_index}:{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) else: error_log.append((title, 'no main link error')) elif 2011 == year: paper_list_bar = tqdm(soup.find_all('a')) paper_index = 0 for paper in paper_list_bar: h3 = paper.find('h3') if h3 is not None: title = slugify(h3.text) paper_index += 1 if 'download' == slugify(paper.text.strip()): link = paper.get('href') link = urllib.parse.urljoin(init_url, link) if link is not None: this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path) : paper_list_bar.set_description( f'downloading paper {paper_index}:{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) else: error_log.append((title, 'no main link error')) elif year in [2009, 2008]: if 2009 == year: paper_list_bar = tqdm( soup.find('div', {'id': 'right_column'}).find_all(['h3','a'])) elif 2008 == year: paper_list_bar = tqdm( soup.find('div', {'class': 'content'}).find_all(['h3','a'])) paper_index = 0 title = None for paper in paper_list_bar: if 'h3' == paper.name: title = slugify(paper.text) paper_index += 1 elif 'full-paper' == slugify(paper.text.strip()): # a link = paper.get('href') if link is not None and title is not None: link = urllib.parse.urljoin(init_url, link) this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf') paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path): paper_list_bar.set_description( f'downloading paper {paper_index}:{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) title = None else: error_log.append((title, 'no main link error')) elif year in [2006, 2005]: paper_list_bar = tqdm(soup.find_all('a')) paper_index = 0 for paper in paper_list_bar: title = slugify(paper.text.strip()) link = paper.get('href') paper_index += 1 if link is not None and title is not None and \ ('pdf' == link[-3:] or 'ps' == link[-2:]): link = urllib.parse.urljoin(init_url, link) this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path): paper_list_bar.set_description( f'downloading paper {paper_index}:{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) elif 2004 == year: paper_index = 0 paper_list_bar = tqdm( soup.find('table', {'class': 'proceedings'}).find_all('tr')) title = None for paper in paper_list_bar: tr_class = None try: tr_class = paper.get('class')[0] except: pass if 'proc_2004_title' == tr_class: # title title = slugify(paper.text.strip()) paper_index += 1 else: for a in paper.find_all('a'): if '[Paper]' == a.text: link = a.get('href') if link is not None and title is not None: link = urllib.parse.urljoin(init_url, link) this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path): paper_list_bar.set_description( f'downloading paper {paper_index}:{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) break elif 2003 == year: paper_index = 0 paper_list_bar = tqdm( soup.find('div', {'id': 'content'}).find_all( 'p', {'class': 'left'})) for paper in paper_list_bar: abs_link = None title = None link = None for a in paper.find_all('a'): abs_link = urllib.parse.urljoin(init_url, a.get('href')) if abs_link is not None: title = slugify(a.text.strip()) break if title is not None: paper_index += 1 this_paper_main_path = os.path.join( save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) paper_list_bar.set_description( f'find paper {paper_index}:{title}') if not os.path.exists(this_paper_main_path): if abs_link is not None: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; ' 'rv:23.0) Gecko/20100101 Firefox/23.0'} abs_content = urlopen_with_retry( url=abs_link, headers=headers, raise_error_if_failed=False) if abs_content is None: print('error'+title) error_log.append( (title, abs_link, 'download error')) continue abs_soup = BeautifulSoup(abs_content, 'html5lib') for a in abs_soup.find_all('a'): try: if 'pdf' == a.get('href')[-3:]: link = urllib.parse.urljoin( abs_link, a.get('href')) if link is not None: paper_list_bar.set_description( f'downloading paper {paper_index}:' f'{title}') downloader.download( urls=link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) break except: pass # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt') with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') def rename_downloaded_paper(year, source_path): """ rename the downloaded ICML paper to {title}_ICML_2010.pdf and save to source_path :param year: int, year :param source_path: str, whose structure should be source_path/papers/pdf files (2010) /index.html (2010) source_path/icml2007_proc.html (2007) :return: """ if not os.path.exists(source_path): raise ValueError(f'can not find {source_path}') postfix = f'ICML_{year}' if 2010 == year: soup = BeautifulSoup( open(os.path.join(source_path, 'index.html'), 'rb'), 'html5lib') paper_list_bar = tqdm(soup.find_all('span', {'class': 'boxpopup3'})) for paper in paper_list_bar: a = paper.find('a') title = slugify(a.text) ori_name = os.path.join( source_path, 'papers', a.get('href').split('/')[-1]) os.rename(ori_name, os.path.join( source_path, f'{title}_{postfix}.pdf')) paper_list_bar.set_description(f'processing {title}') elif 2007 == year: soup = BeautifulSoup(open(os.path.join( source_path, 'icml2007_proc.html'), 'rb'), 'html5lib') paper_list_bar = tqdm(soup.find_all('td', {'colspan': '2'})) for paper in paper_list_bar: all_as = paper.find_all('a') if len(all_as) <= 1: title = slugify(paper.text.strip()) else: for a in all_as: if '[Paper]' == a.text: sub_path = a.get('href') os.rename(os.path.join(source_path, sub_path), os.path.join( source_path, f'{title}_{postfix}.pdf')) paper_list_bar.set_description_str( (f'processing {title}')) break if __name__ == '__main__': year = 2025 download_paper( year, rf'E:\ICML_{year}', is_download_supplement=True, time_step_in_seconds=10, downloader='IDM', source='openreview' ) # merge_main_supplement(main_path=f'..\\ICML_{year}\\main_paper', # supplement_path=f'..\\ICML_{year}\\supplement', # save_path=f'..\\ICML_{year}', # is_delete_ori_files=False) # rename_downloaded_paper(year, f'..\\ICML_{year}') pass ================================================ FILE: code/paper_downloader_IJCAI.py ================================================ """paper_downloader_IJCAI.py""" import urllib from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib import csv_process from lib.my_request import urlopen_with_retry def save_csv(year): """ write IJCAI papers' urls in one csv file :param year: int, IJCAI year, such 2019 :return: peper_index: int, the total number of papers """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'IJCAI_{year}.csv' ) with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'group'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() if year >= 2003: init_urls = [f'https://www.ijcai.org/proceedings/{year}/'] elif year >= 1977: init_urls = [f'https://www.ijcai.org/Proceedings/{year}-1/', f'https://www.ijcai.org/Proceedings/{year}-2/'] elif year >= 1969: init_urls = [f'https://www.ijcai.org/Proceedings/{year}/'] else: raise ValueError('invalid year!') error_log = [] user_agents = [ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) ' 'Gecko/20071127 Firefox/2.0.0.11', 'Opera/9.25 (Windows NT 5.1; U; en)', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; ' '.NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) ' 'KHTML/3.5.5 (like Gecko) (Kubuntu)', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) ' 'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12', 'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9', "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 " "(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 " "Chrome/16.0.912.77 Safari/535.7", "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) " "Gecko/20100101 Firefox/10.0 ", 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/105.0.0.0 Safari/537.36' ] headers = { 'User-Agent': user_agents[-1], 'Host': 'www.ijcai.org', 'Referer': "https://www.ijcai.org", 'GET': init_urls[0] } if len(init_urls) == 1: data_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_IJCAI_{year}.dat' ) if os.path.exists(data_file_pathname): with open(data_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_urls[0], headers=headers) with open(data_file_pathname, 'wb') as f: pickle.dump(content, f) contents = [content] else: contents = [] data_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_IJCAI_0_{year}.dat' ) if os.path.exists(data_file_pathname): with open(data_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_urls[0], headers=headers) with open(data_file_pathname, 'wb') as f: pickle.dump(content, f) contents.append(content) data_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_IJCAI_1_{year}.dat' ) if os.path.exists(data_file_pathname): with open(data_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_urls[1], headers=headers) with open(data_file_pathname, 'wb') as f: pickle.dump(content, f) contents.append(content) paper_index = 0 for content in contents: soup = BeautifulSoup(content, 'html5lib') if year >= 2017: pbar = tqdm(soup.find_all('div', {'class': 'section_title'})) for section in pbar: this_group = slugify(section.text) papers = section.parent.find_all( 'div', {'class': ['paper_wrapper', 'subsection_title']}) sub_group = '' for paper in papers: if 'subsection_title' == paper.get('class')[0]: sub_group = slugify(paper.text) continue paper_index += 1 is_get_link = False title = slugify( paper.find('div', {'class': 'title'}).text) pbar.set_description( f'downloading paper {paper_index}: {title}') for a in paper.find( 'div', {'class': 'details'}).find_all('a'): if 'PDF' == a.text: link = urllib.parse.urljoin( init_urls[0], a.get('href')) is_get_link = True break if is_get_link: paper_dict = {'title': title, 'main link': link, 'group': this_group + '--' + sub_group if sub_group != '' else this_group} else: paper_dict = {'title': title, 'main link': 'error', 'group': this_group + '--' + sub_group if sub_group != '' else this_group} print(f'get link for {title}_{year} failed!') error_log.apend(title, 'no link') writer.writerow(paper_dict) elif year in [2016]: # no group papers_bar = tqdm(soup.find_all('p')) for paper in papers_bar: all_as = paper.find_all('a') if len(all_as) >= 2: # paper pdf and abstract paper_index += 1 title = slugify(paper.text.split('\n')[0]) papers_bar.set_description( f'downloading paper {paper_index}: {title}') is_get_link = False for a in all_as: if 'PDF' == a.text: link = 'https://www.ijcai.org' + a.get('href') is_get_link = True break if is_get_link: paper_dict = {'title': title, 'main link': link, 'group': ''} else: paper_dict = {'title': title, 'main link': 'error', 'group': ''} print(f'get link for {title}_{year} failed!') error_log.apend(title, 'no link') writer.writerow(paper_dict) elif year in [2015]: # p group 'PDF' div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3'])) is_start = False this_group = '' for paper in papers_bar: if not is_start: if 'h2' == paper.name: # find 'content' if 'Contents' == paper.text: is_start = True else: if 'h3' == paper.name: # group this_group = slugify(paper.text) elif 'p' == paper.name: # paper all_as = paper.find_all('a') if len(all_as) >= 2: # paper pdf and abstract paper_index += 1 title = slugify(paper.text.split('\n')[0]) papers_bar.set_description( f'downloading paper {paper_index}: {title}') is_get_link = False for a in all_as: if 'PDF' == a.text: link = 'https://www.ijcai.org' + \ a.get('href') is_get_link = True break if is_get_link: paper_dict = {'title': title, 'main link': link, 'group': this_group} else: paper_dict = {'title': title, 'main link': 'error', 'group': this_group} print(f'get link for {title}_{year} failed!') error_log.apend(title, 'no link') writer.writerow(paper_dict) elif year in [2013, 2011, 2009, 2007]: # p group div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3', 'h4'])) # papers_bar = div_content.find_all(['h2', 'p', 'h3', 'h4']) is_start = False this_group = '' this_group_v3 = '' this_group_v4 = '' for paper in papers_bar: if not is_start: if 'h2' == paper.name: # find 'content' if 'Contents' == paper.text or \ 'IJCAI-09 Contents' == paper.text or \ 'IJCAI-07 Contents' == paper.text: is_start = True else: if 'h3' == paper.name: # group this_group_v3 = slugify(paper.text) this_group = this_group_v3 elif 'h4' == paper.name: # group this_group_v4 = slugify(paper.text) this_group = this_group_v3 + '--' + this_group_v4 elif 'p' == paper.name: # paper try: all_as = paper.find_all('a') except: continue if len(all_as) >= 1: # paper paper_index += 1 is_get_link = False for a in all_as: if 'abstract' != slugify(a.text.strip()): title = slugify(a.text) link = a.get('href') is_get_link = True papers_bar.set_description( f'downloading paper {paper_index}: ' f'{title}') break if is_get_link: paper_dict = {'title': title, 'main link': link, 'group': this_group} else: paper_dict = {'title': title, 'main link': 'error', 'group': this_group} print(f'get link for {title}_{year} failed!') error_log.append((title, 'no link')) # papers_bar.set_description(f'downloading # paper {paper_index}: {title}') writer.writerow(paper_dict) elif year in [2005]: div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: paper_class = paper.get('class')[0] except: continue if 'docsection' == paper_class: # group this_group = slugify(paper.text) elif 'doctitle' == paper_class: # paper paper_index += 1 title = slugify(paper.a.text) link = paper.a.get('href') papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) elif year in [2003]: div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' base_url = 'https://www.ijcai.org' for paper in papers_bar: try: this_group = slugify(paper.b.text) except: pass try: title = slugify(paper.a.text) link = base_url + paper.a.get('href') paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) except: continue elif year in [2001]: div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: title = slugify(paper.a.text) link = paper.a.get('href') paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) except: continue elif year in [1999, 1997, 1995, 1993, 1991, 1989, 1987, 1981, 1979, 1977, 1969]: # goup in capital in p.b.text div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: if paper.b.text.isupper(): # print(paper.b.text) this_group = slugify(paper.b.text) except: pass try: for a in paper.find_all('a'): title = slugify(a.text.strip()) link = a.get('href') if link[-3:] == 'pdf' and '' != title: paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) break else: continue except: continue elif year in [1985, 1975, 1971]: # no group, paper in 'p' div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: for a in paper.find_all('a'): title = slugify(a.text.strip()) link = a.get('href') if link[-3:] == 'pdf' and '' != title: paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) break else: continue except: continue elif year in [1983]: # goup in capital p.text div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: if paper.text.isupper(): this_group = slugify(paper.text) except: pass try: for a in paper.find_all('a'): title = slugify(a.text.strip()) link = a.get('href') if link[-3:] == 'pdf' and '' != title: paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) break else: continue except: continue elif year in [1973]: # goup in p.b div_content = soup.find('div', {'id': 'content'}) papers_bar = tqdm(div_content.find_all(['p'])) this_group = '' for paper in papers_bar: try: if '' != paper.b.text.strip(): this_group = slugify(paper.b.text.strip()) except: pass try: for a in paper.find_all('a'): title = slugify(a.text.strip()) link = a.get('href') if link[-3:] == 'pdf' and '' != title: paper_index += 1 papers_bar.set_description( f'downloading paper {paper_index}: {title}') paper_dict = {'title': title, 'main link': link, 'group': this_group} writer.writerow(paper_dict) break else: continue except: continue # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt') with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return paper_index if paper_index is not None else None def download_from_csv( year, save_dir, time_step_in_seconds=5, total_paper_number=None, downloader='IDM'): """ download all IJCAI paper given year :param year: int, IJCAI year, such 2019 :param save_dir: str, paper and supplement material's save path :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param total_paper_number: int, the total number of papers that is going to download :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) postfix = f'IJCAI_{year}' csv_filename = f'IJCAI_{year}.csv' csv_filename = os.path.join(project_root_folder, 'csv', csv_filename) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_filename, is_download_supplement=False, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader ) if __name__ == '__main__': # for year in range(1993, 1968, -2): # print(year) # # save_csv(year) # # time.sleep(2) # download_from_csv(year, save_dir=f'..\\IJCAI_{year}', # time_step_in_seconds=1) year = 2024 # total_paper_number = 723 total_paper_number = save_csv(year) download_from_csv( year, save_dir=fr'E:\IJCAI_{year}', time_step_in_seconds=5, total_paper_number=total_paper_number, downloader=None) pass ================================================ FILE: code/paper_downloader_JMLR.py ================================================ """paper_downloader_JMLR.py""" import urllib from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import time import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.downloader import Downloader from lib.my_request import urlopen_with_retry def download_paper( volumn, save_dir, time_step_in_seconds=5, downloader='IDM', url=None, is_use_url=False, refresh_paper_list=True): """ download all JMLR paper files given volumn and restore in save_dir respectively :param volumn: int, JMLR volumn, such as 2019 :param save_dir: str, paper and supplement material's saving path :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :param url: None or str, None means to download volumn papers. :param is_use_url: bool, if to download papers from 'url'. url couldn't be None when is_use_url is True. :param refresh_paper_list: bool, if to refresh the saved paper list, default true, which means the "dat" file that contains the papers' information will be re-downloaded. :return: True """ downloader = Downloader(downloader=downloader) # create current dict title_list = [] # paper_dict = dict() project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} if not is_use_url: init_url = f'http://jmlr.org/papers/v{volumn}/' postfix = f'JMLR_v{volumn}' dat_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_JMLR_v{volumn}.dat') if not refresh_paper_list and \ os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: print('collecting papers from website...') content = urlopen_with_retry(url=init_url, headers=headers) # content = open(f'..\\JMLR_{volumn}.html', 'rb').read() with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) elif url is not None: content = urlopen_with_retry(url=url, headers=headers) postfix = f'JMLR' else: raise ValueError(''''url' could not be None when 'is_use_url'=True!!!''') # soup = BeautifulSoup(content, 'html.parser') soup = BeautifulSoup(content, 'html5lib') # soup = BeautifulSoup(open(r'..\JMLR_2011.html', 'rb'), 'html.parser') error_log = [] os.makedirs(save_dir, exist_ok=True) if (not is_use_url) and volumn <= 4: paper_list = soup.find('div', {'id': 'content'}).find_all('tr') else: paper_list = soup.find('div', {'id': 'content'}).find_all('dl') # num_download = 5 # number of papers to download num_download = len(paper_list) print(f'total papers counting: {num_download}, start downloading...') for paper in tqdm(zip(paper_list, range(num_download))): # get title this_paper = paper[0] title = slugify(this_paper.find('dt').text) title_list.append(title) this_paper_main_path = os.path.join(save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_')) if os.path.exists(this_paper_main_path): continue # get abstract page url links = this_paper.find_all('a') main_link = None for link in links: if '[pdf]' == link.text or 'pdf' == link.text: main_link = urllib.parse.urljoin('http://jmlr.org', link.get('href')) break # try 1 time # error_flag = False for d_iter in range(1): try: # download paper with IDM if not os.path.exists(this_paper_main_path) and main_link is not None: try: print('Downloading paper {}/{}: {}'.format(paper[1] + 1, num_download, title)) except: print(title.encode('utf8')) downloader.download( urls=main_link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append((title, main_link, 'main paper download error', str(e))) # store the results # 1. store in the pickle file # with open(f'{postfix}_pre.dat', 'wb') as f: # pickle.dump(paper_dict, f) # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt') with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') def download_special_topics_and_issues_paper(save_dir, time_step_in_seconds=5, downloader='IDM'): """ download all JMLR special topics and issues paper files given volumn and restore in save_dir respectively :param save_dir: str, paper and supplement material's saving path :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ homepage = 'https://www.jmlr.org/papers/' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} # postfix = f'JMLR_v{volumn}' content = urlopen_with_retry(url=homepage, headers=headers) soup = BeautifulSoup(content, 'html5lib') # soup = BeautifulSoup(open(r'..\JMLR_2011.html', 'rb'), 'html.parser') all_topics = soup.find('div', {'id': 'content'}).find_all(['h2', 'p']) is_topic = False is_issue = False for topic in all_topics: if 'h2' == topic.name and slugify(topic.text.strip()) == 'special-topics': is_topic = True elif 'h2' == topic.name: is_topic = False if 'special-issues' == slugify(topic.text.strip()): is_issue = True if is_topic and 'p' == topic.name: topic_name = slugify(topic.text.strip()) topic_url = urllib.parse.urljoin(homepage, topic.a.get('href')) # print(f'T: {topic_name} url:{topic_url}') print(f'processing special topic: {topic_name}') download_paper( volumn=1000, save_dir=os.path.join(save_dir, 'special-topics', topic_name), time_step_in_seconds=time_step_in_seconds, downloader=downloader, url=topic_url, is_use_url=True ) time.sleep(time_step_in_seconds) if is_issue and 'p' == topic.name: issue_name = slugify(topic.text.strip()) issue_url = urllib.parse.urljoin(homepage, topic.a.get('href')) # print(f'T: {issue_name} url:{issue_url}') print(f'processing special issue: {issue_name}') download_paper( volumn=1000, save_dir=os.path.join(save_dir, 'special-issues', issue_name), time_step_in_seconds=time_step_in_seconds, downloader=downloader, url=issue_url, is_use_url=True ) time.sleep(time_step_in_seconds) if __name__ == '__main__': volumn = 25 download_paper(volumn, rf'W:\all_papers\JMLR\JMLR_v{volumn}', time_step_in_seconds=3) # download_special_topics_and_issues_paper( # rf'Z:\all_papers\JMLR', time_step_in_seconds=3, downloader='IDM') pass ================================================ FILE: code/paper_downloader_NIPS.py ================================================ """paper_downloader_NIPS.py""" import urllib import time from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib.supplement_porcess import move_main_and_supplement_2_one_directory from lib.downloader import Downloader from lib import csv_process from lib.openreview import download_nips_papers_given_url from lib.my_request import urlopen_with_retry def save_csv(year): """ write nips papers' and supplemental material's urls in one csv file :param year: int :return: num_download: int, the total number of papers. """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'NIPS_{year}.csv' ) with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'supplemental link'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} init_url = f'https://proceedings.neurips.cc/paper/{year}' dat_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_nips_{year}.dat') if os.path.exists(dat_file_pathname): with open(dat_file_pathname, 'rb') as f: content = pickle.load(f) else: content = urlopen_with_retry(url=init_url, headers=headers) with open(dat_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html.parser') paper_list = soup.find( 'div', {'class': 'container-fluid'}).find_all('li') # num_download = 5 # number of papers to download num_download = len(paper_list) paper_list_bar = tqdm(zip(paper_list, range(num_download))) for paper in tqdm(zip(paper_list, range(num_download))): paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} # get title # print('\n') this_paper = paper[0] title = slugify(this_paper.a.text) paper_dict['title'] = title # print('Downloading paper {}/{}: {}'.format( # paper[1] + 1, num_download, title)) paper_list_bar.set_description( 'Tracing paper {}/{}: {}'.format( paper[1] + 1, num_download, title)) # get abstract page url url2 = this_paper.a.get('href') abs_url = urllib.parse.urljoin(init_url, url2) abs_content = urlopen_with_retry(url=abs_url, headers=headers, raise_error_if_failed=False) if abs_content is not None: soup_temp = BeautifulSoup(abs_content, 'html.parser') # abstract = soup_temp.find( # 'p', {'class': 'abstract'}).text.strip() # paper_dict[title] = abstract all_a = soup_temp.findAll('a') for a in all_a: # print(a.text[:-2]) # print(a.text[:-2].strip().lower()) if 'paper' == a.text[:-2].strip().lower(): paper_dict['main link'] = urllib.parse.urljoin( abs_url, a.get('href')) elif 'supplemental' == a.text[:-2].strip().lower(): paper_dict['supplemental link'] = \ urllib.parse.urljoin(abs_url, a.get('href')) break else: print('Error: ' + title) if paper_dict['main link'] == '': paper_dict['main link'] = 'error' if paper_dict['supplemental link'] == '': paper_dict['supplemental link'] = 'error' writer.writerow(paper_dict) time.sleep(1) return num_download def download_from_csv( year, save_dir, is_download_mainpaper=True, is_download_supplement=True, time_step_in_seconds=5, total_paper_number=None, downloader='IDM'): """ download all NIPS paper and supplement files given year, restore in save_dir/main_paper and save_dir/supplement respectively :param year: int, NIPS year, such 2019 :param save_dir: str, paper and supplement material's save path :param is_download_mainpaper: boot, True for downloading main papers :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two download request in seconds :param total_paper_number: int, the total number of papers that is going to download :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) postfix = f'NIPS_{year}' csv_file_path = os.path.join(project_root_folder, 'csv', f'NIPS_{year}.csv') return csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_path, is_download_supplement=is_download_supplement, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader ) # def rename_supp( year, supp_dir): # """ # rename supplemental material # :param year: int, NIPS year, such 2019 # :param supp_dir: str, supplement material's save path # :return: True # """ # if not os.path.exists(supp_dir): # raise ValueError(f'''can't find path {supp_dir}''') # # postfix = f'NIPS_{year}' # with open(f'..\\csv\\NIPS_{year}.csv', newline='') as csvfile: # myreader = csv.DictReader(csvfile, delimiter=',') # pbar = tqdm(myreader) # for this_paper in pbar: # title = slugify(this_paper['title']) # this_paper_supp_path_no_ext = os.path.join( # supp_dir, f'{title}_{postfix}_supp.') # # if '' != this_paper['supplemental link']: # supp_ori_name = this_paper['supplemental link'].split('/')[-1] # supp_type = supp_ori_name.split('.')[-1] # if os.path.exists(os.path.join(supp_dir, supp_ori_name)) and \ # not os.path.exists( # this_paper_supp_path_no_ext + supp_type): # os.rename( # os.path.join(supp_dir, supp_ori_name), # this_paper_supp_path_no_ext + supp_type # ) # pbar.set_description(f'Renaming paper: {title}...') if __name__ == '__main__': year = 2024 # total_paper_number = 1899 # total_paper_number = save_csv(year) # download_from_csv( # year, f'..\\NIPS_{year}', # is_download_mainpaper=False, # is_download_supplement=True, # time_step_in_seconds=20, # total_paper_number=total_paper_number, # downloader='IDM') download_nips_papers_given_url( save_dir=rf'E:\NIPS_{year}', year=year, base_url=f'https://openreview.net/group?id=NeurIPS.cc/' f'{year}/Conference', time_step_in_seconds=10, # download_groups=['poster'], downloader='IDM') # move_main_and_supplement_2_one_directory( # main_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\main_paper', # supplement_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\supplement', # supp_pdf_save_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\supplement_pdf' # ) ================================================ FILE: code/paper_downloader_RSS.py ================================================ """paper_downloader_RSS.py 20240322""" import time import urllib from urllib.error import HTTPError from bs4 import BeautifulSoup import pickle import os from tqdm import tqdm from slugify import slugify import csv import sys from datetime import datetime root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) sys.path.append(root_folder) from lib import csv_process from lib.my_request import urlopen_with_retry def get_paper_pdf_link(abs_url): """get paper pdf link in the abstract url. For newest papers that have not been added to "https://www.roboticsproceedings.org/rss19/index.html" Args: abs_url (str): paper abstract page url. """ headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry(url=abs_url, headers=headers) soup = BeautifulSoup(content, 'html5lib') paper_pdf_div = soup.find('div', {'class': 'paper-pdf'}) paper_pdf_div = paper_pdf_div.find('a').get('href') return paper_pdf_div def save_csv(year): """ write RSS papers' urls in one csv file :param year: int, RSS year, such 2023 :return: peper_index: int, the total number of papers """ conference = "RSS" project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_pathname = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' ) error_log = [] paper_index = 0 with open(csv_file_pathname, 'w', newline='') as csvfile: fieldnames = ['title', 'main link', 'supplemental link'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() is_from_proceed = True # True to get papaers from "https://www.roboticsproceedings.org" # False to get papers from "https://roboticsconference.org/" init_url = f'https://www.roboticsproceedings.org/rss' \ f'{year-2004 :0>2d}/index.html' # determine whether this year's papers had been added to # "https://www.roboticsproceedings.org" # If not, get papers from "https://roboticsconference.org/" try: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} req = urllib.request.Request(url=init_url, headers=headers) urllib.request.urlopen(req, timeout=20) except HTTPError as e: if e.code == 404: # not added current_year = datetime.now().year if year == current_year: init_url = f'https://roboticsconference.org/program/papers/' else: init_url = f'https://roboticsconference.org/{year}/program/papers/' is_from_proceed = False url_file_pathname = os.path.join( project_root_folder, 'urls', f'init_url_{conference}_{year}_' f'''{'proc' if is_from_proceed else 'conf'}.dat''' ) if os.path.exists(url_file_pathname): with open(url_file_pathname, 'rb') as f: content = pickle.load(f) else: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry(url=init_url, headers=headers) with open(url_file_pathname, 'wb') as f: pickle.dump(content, f) soup = BeautifulSoup(content, 'html5lib') if is_from_proceed: paper_list = soup.find('div', {'class': 'content'}).find_all('tr') else: paper_list = soup.find('table', {'id': 'myTable'}).find_all('tr') paper_list_bar = tqdm(paper_list) paper_index = 0 title_index = 0 for i, paper in enumerate(paper_list_bar): paper_dict = {'title': '', 'main link': '', 'supplemental link': ''} # get title try: if not is_from_proceed and i == 0: # header fields = paper.find_all('th') fields = [f.text.lower() for f in fields] title_index = fields.index('title') tds = paper.find_all('td') if len(tds) < 2: # seperator continue if is_from_proceed: title = slugify(tds[0].a.text) main_link = tds[1].a.get('href') main_link = urllib.parse.urljoin(init_url, main_link) else: title = slugify(tds[title_index].a.text) abs_link = tds[title_index].a.get('href') abs_link = urllib.parse.urljoin(init_url, abs_link) main_link = get_paper_pdf_link(abs_link) paper_dict['title'] = title paper_dict['main link'] = main_link paper_index += 1 paper_list_bar.set_description_str( f'Collected paper {paper_index}: {title}') writer.writerow(paper_dict) csvfile.flush() # write to file immediately except Exception as e: print(f'Warning: {str(e)}') # write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return paper_index def download_from_csv( year, save_dir, time_step_in_seconds=5, total_paper_number=None, csv_filename=None, downloader='IDM', is_random_step=True, proxy_ip_port=None): """ download all RSS paper given year :param year: int, RSS year, such as 2019 :param save_dir: str, paper and supplement material's save path :param time_step_in_seconds: int, the interval time between two download request in seconds :param total_paper_number: int, the total number of papers that is going to download :param csv_filename: None or str, the csv file's name, None means to use default setting :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM' :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :return: True """ conference = "RSS" postfix = f'{conference}_{year}' project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) csv_file_path = os.path.join( project_root_folder, 'csv', f'{conference}_{year}.csv' if csv_filename is None else csv_filename) csv_process.download_from_csv( postfix=postfix, save_dir=save_dir, csv_file_path=csv_file_path, is_download_supplement=False, time_step_in_seconds=time_step_in_seconds, total_paper_number=total_paper_number, downloader=downloader, is_random_step=is_random_step, proxy_ip_port=proxy_ip_port ) if __name__ == '__main__': year = 2025 total_paper_number = save_csv(year) # total_paper_number = 134 download_from_csv(year, save_dir=fr'E:\RSS\RSS_{year}', time_step_in_seconds=15, total_paper_number=total_paper_number) time.sleep(2) pass ================================================ FILE: lib/IDM.py ================================================ import subprocess import os import time import random def download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True, verbose=False): """ download file from given urls and save it to given path :param urls: str, urls :param save_path: str, full path :param time_sleep_in_seconds: int, sleep seconds after call :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param verbose: bool, whether to display time step information. Default: False :return: None """ idm_path = '"C:\Program Files (x86)\Internet Download Manager\IDMan.exe"' # should replace by the local IDM path basic_command = [idm_path, '/d', 'xxxx', '/p', 'xxx', '/f', 'xxxx', '/n'] head, tail = os.path.split(save_path) if '' != head: os.makedirs(head, exist_ok=True) basic_command[2] = urls basic_command[4] = head basic_command[6] = tail p = subprocess.Popen(' '.join(basic_command)) # p.wait() if is_random_step: time_sleep_in_seconds = random.uniform( 0.5 * time_sleep_in_seconds, 1.5 * time_sleep_in_seconds, ) if verbose: print(f'\t random sleep {time_sleep_in_seconds: .2f} seconds') time.sleep(time_sleep_in_seconds) ================================================ FILE: lib/__init__.py ================================================ ================================================ FILE: lib/arxiv.py ================================================ """ arxiv.py 20240218 """ from bs4 import BeautifulSoup from .my_request import urlopen_with_retry def get_pdf_link_from_arxiv(abs_link, is_use_mirror=False): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} mirror = 'cn.arxiv.org' if is_use_mirror: abs_link = abs_link.replace('arxiv.org', mirror) abs_content = urlopen_with_retry( url=abs_link, headers=headers, raise_error_if_failed=False) if abs_content is None: return None abs_soup = BeautifulSoup(abs_content, 'html.parser') pdf_link = 'http://arxiv.org' + abs_soup.find('div', { 'class': 'full-text'}).find('ul').find('a').get('href') if pdf_link[-3:] != 'pdf': pdf_link += '.pdf' if is_use_mirror: pdf_link = pdf_link.replace('arxiv.org', mirror) return pdf_link ================================================ FILE: lib/csv_process.py ================================================ """ csv_process.py 20210617 """ import os from tqdm import tqdm from slugify import slugify import csv from lib.downloader import Downloader def download_from_csv( postfix, save_dir, csv_file_path, is_download_main_paper=True, is_download_bib=True, is_download_supplement=True, time_step_in_seconds=5, total_paper_number=None, downloader='IDM', is_random_step=True, proxy_ip_port=None, max_length_filename=128 ): """ download paper, bibtex and supplement files and save them to save_dir/main_paper and save_dir/supplement respectively :param postfix: str, postfix that will be added at the end of papers' title :param save_dir: str, paper and supplement material's save path :param csv_file_path: str, the full path to csv file :param is_download_main_paper: bool, True for downloading main paper :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two downloading request in seconds :param total_paper_number: int, the total number of papers that is going to download :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM'. :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None :param max_length_filename: int or None, max filen name length. All the files whose name length is not less than this will be renamed before saving, the others will stay unchanged. None means no limitation. Default: 128. :return: True """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) downloader = Downloader( downloader=downloader, is_random_step=is_random_step, proxy_ip_port=proxy_ip_port) if not os.path.exists(csv_file_path): raise ValueError(f'ERROR: file not found in {csv_file_path}!!!') main_save_path = os.path.join(save_dir, 'main_paper') if is_download_main_paper: os.makedirs(main_save_path, exist_ok=True) if is_download_supplement: supplement_save_path = os.path.join(save_dir, 'supplement') os.makedirs(supplement_save_path, exist_ok=True) error_log = [] with open(csv_file_path, newline='') as csvfile: myreader = csv.DictReader(csvfile, delimiter=',') pbar = tqdm(myreader, total=total_paper_number) i = 0 for this_paper in pbar: is_download_bib &= ('bib' in this_paper) is_grouped = ('group' in this_paper) i += 1 # get title if is_grouped: group = slugify(this_paper['group']) title = slugify(this_paper['title']) title_main_pdf = short_name( name=f'{title}_{postfix}.pdf', max_length=max_length_filename ) if total_paper_number is not None: pbar.set_description( f'Downloading {postfix} paper {i} /{total_paper_number}') else: pbar.set_description(f'Downloading {postfix} paper {i}') this_paper_main_path = os.path.join( main_save_path, title_main_pdf) if is_grouped: this_paper_main_path = os.path.join( main_save_path, group, title_main_pdf) if is_download_supplement: this_paper_supp_title_no_ext = short_name( name=f'{title}_{postfix}_supp.', max_length=max_length_filename-3 # zip or pdf, so 3 ) this_paper_supp_path_no_ext = os.path.join( supplement_save_path, this_paper_supp_title_no_ext) if is_grouped: this_paper_supp_path_no_ext = os.path.join( supplement_save_path, group, this_paper_supp_title_no_ext ) if '' != this_paper['supplemental link'] and os.path.exists( this_paper_main_path) and \ (os.path.exists( this_paper_supp_path_no_ext + 'zip') or os.path.exists( this_paper_supp_path_no_ext + 'pdf')): continue elif '' == this_paper['supplemental link'] and \ os.path.exists(this_paper_main_path): continue elif os.path.exists(this_paper_main_path): continue if 'error' == this_paper['main link']: error_log.append((title, 'no MAIN link')) elif '' != this_paper['main link']: if is_grouped: if is_download_main_paper: os.makedirs(os.path.join(main_save_path, group), exist_ok=True) if is_download_supplement: os.makedirs(os.path.join(supplement_save_path, group), exist_ok=True) if is_download_main_paper: try: # download paper with IDM if not os.path.exists(this_paper_main_path): downloader.download( urls=this_paper['main link'].replace( ' ', '%20'), save_path=os.path.join( os.getcwd(), this_paper_main_path), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append((title, this_paper['main link'], 'main paper download error', str(e))) # download supp if is_download_supplement: # check whether the supp can be downloaded if not (os.path.exists( this_paper_supp_path_no_ext + 'zip') or os.path.exists( this_paper_supp_path_no_ext + 'pdf')): if 'error' == this_paper['supplemental link']: error_log.append((title, 'no SUPPLEMENTAL link')) elif '' != this_paper['supplemental link']: supp_type = \ this_paper['supplemental link'].split('.')[-1] try: downloader.download( urls=this_paper['supplemental link'], save_path=os.path.join( os.getcwd(), this_paper_supp_path_no_ext + supp_type), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append((title, this_paper[ 'supplemental link'], 'supplement download error', str(e))) # download bibtex file if is_download_bib: bib_path = this_paper_main_path[:-3] + 'bib' if not os.path.exists(bib_path): if 'error' == this_paper['bib']: error_log.append((title, 'no bibtex link')) elif '' != this_paper['bib']: try: downloader.download( urls=this_paper['bib'], save_path=os.path.join(os.getcwd(), bib_path), time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append((title, this_paper['bib'], 'bibtex download error', str(e))) # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return True def short_name(name, max_length, verbose=False): """ rename to shorter name Args: name (str): original name max_length (int): max filen name length. All the files whose name length is not less than this will be renamed before saving, the others will stay unchanged. None means no limitation. verbose (bool): whether to print debug information. Default: False. Returns: new_name (str): short name. """ if len(name) < max_length: new_name = name else: # rename try: [title, postfix] = name.split('_', 1) # only split to 2 parts new_title = title[:max_length - len(postfix) - 2] new_name = f'{new_title}_{postfix}' if verbose: print(f'\nrenaming {name} \n\t-> {new_name}') except ValueError: # ValueError: not enough values to unpack (expected 2, got 1) if verbose: print(f'\nWARNING!!!:\n\tunable to parse postfix from {name}') print('\tSo, it will be just rename to short name') ext = os.path.splitext(name)[1] new_title = name[:max_length - len(ext) - 1] new_name = f'{new_title}{ext}' if verbose: print(f'\nrenaming {name} \n\t-> {new_name}') return new_name ================================================ FILE: lib/cvf.py ================================================ """ cvf.py 20210617 """ import urllib from bs4 import BeautifulSoup from tqdm import tqdm from slugify import slugify from .my_request import urlopen_with_retry def get_paper_dict_list(url=None, content=None, group_name=None, timeout=10): """ parse papers' title, link, supp link from content, and save in a list contains dictionaries with key "title", "main link", "supplemental link" and "group"(optional, if group_name is not None), :param url: str or None, url :param content: None of object return by urlopen :param group_name: str or None, the group name of the papers in given content :param timeout: int, the timeout value for open url, default to 10 :return: paper_dict_list, list of dictionaries, that contains the dictionaries of papers with key "title", "main link", "supplemental link" and "group"(optional, if group_name is not None) content, object return by urlopen """ if url is None and content is None: raise ValueError('''one of "url" and "content" should be provide!!!''') paper_dict_list = [] paper_dict = {'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''} if group_name is None else \ {'group': group_name, 'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''} if content is None: headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry(url=url, headers=headers) soup = BeautifulSoup(content, 'html5lib') paper_list_bar = tqdm(soup.find('div', {'id': 'content'}).find_all(['dd', 'dt'])) paper_index = 0 for paper in paper_list_bar: is_new_paper = False # get title try: if 'dt' == paper.name and 'ptitle' == paper.get('class')[0]: # title: title = slugify(paper.text.strip()) paper_dict['title'] = title paper_index += 1 paper_list_bar.set_description_str(f'Collecting paper {paper_index}: {title}') elif 'dd' == paper.name: all_as = paper.find_all('a') for a in all_as: if 'pdf' == slugify(a.text.strip()): main_link = urllib.parse.urljoin(url, a.get('href')) paper_dict['main link'] = main_link is_new_paper = True elif 'supp' == slugify(a.text.strip()): supp_link = urllib.parse.urljoin(url, a.get('href')) paper_dict['supplemental link'] = supp_link elif 'arxiv' == slugify(a.text.strip()): arxiv = urllib.parse.urljoin(url, a.get('href')) paper_dict['arxiv'] = arxiv break except Exception as e: print(f'Warning: {str(e)}') if is_new_paper: paper_dict_list.append(paper_dict.copy()) paper_dict['title'] = '' paper_dict['main link'] = '' paper_dict['supplemental link'] = '' paper_dict['arxiv'] = '' return paper_dict_list, content ================================================ FILE: lib/downloader.py ================================================ """ downloader.py 20210624 """ import time from lib import IDM import requests import os import random from tqdm import tqdm from threading import Thread from lib.proxy import get_proxy_4_requests def _download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True, verbose=False, proxy_ip_port=None): """ download file from given urls and save it to given path :param urls: str, urls :param save_path: str, full path :param time_sleep_in_seconds: int, sleep seconds after call :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param verbose: bool, whether to display time step information. Default: False :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". :return: None """ def __download(urls, save_path, proxy_ip_port): head, tail = os.path.split(save_path) # debug # print(f'downloading {tail}') proxies = get_proxy_4_requests(proxy_ip_port) r = requests.get(urls, stream=True, proxies=proxies) # file size in MB length = round(int(r.headers['content-length']) / 1024**2, 2) process_bar = tqdm( colour='blue', total=length, unit='MB',desc=tail, initial=0) if '' != head: os.makedirs(head, exist_ok=True) for part in r.iter_content(1024 ** 2): process_bar.update(1) with open(save_path, 'ab') as file: file.write(part) r.close() # set daemon as False to continue downloading even if the main threading # has been killed due to KeyboardInterrupt t = Thread( target=__download, args=(urls, save_path, proxy_ip_port), daemon=False) t.start() if is_random_step: time_sleep_in_seconds = random.uniform( 0.5 * time_sleep_in_seconds, 1.5 * time_sleep_in_seconds, ) if verbose: print(f'\t random sleep {time_sleep_in_seconds: .2f} seconds') time.sleep(time_sleep_in_seconds) class Downloader(object): def __init__(self, downloader=None, is_random_step=True, proxy_ip_port=None): """ :param downloader: None or str, the downloader's name. if downloader is None, 'request' will be used to download files; if downloader is 'IDM', the "Internet Downloader Manager" will be used to download files; or a ValueError will be raised. :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". (only useful for None|"request" downloader) Default: None """ super(Downloader, self).__init__() if downloader is not None and downloader.lower() not in ['idm']: raise ValueError( f'''ERROR: Unsupported downloader: {downloader}, ''' f'''we currently only support''' f''' None (means python's requests) or "IDM" ''' ) self.downloader = downloader self.is_random_step = is_random_step self.proxy_ip_port = proxy_ip_port def download(self, urls, save_path, time_sleep_in_seconds=5): """ download file from given urls and save it to given path :param urls: str, urls :param save_path: str, full path :param time_sleep_in_seconds: int, sleep seconds after call :return: None """ if self.downloader is None: _download( urls=urls, save_path=save_path, time_sleep_in_seconds=time_sleep_in_seconds, is_random_step=self.is_random_step, proxy_ip_port=self.proxy_ip_port ) elif self.downloader.lower() == 'idm': IDM.download( urls=urls, save_path=save_path, time_sleep_in_seconds=time_sleep_in_seconds, is_random_step=self.is_random_step ) ================================================ FILE: lib/my_request.py ================================================ """ my_request.py 20240412 """ import urllib import random from urllib.error import URLError, HTTPError from lib.proxy import set_proxy_4_urllib_request def urlopen_with_retry(url, headers=dict(), retry_time=3, time_out=20, raise_error_if_failed=True, proxy_ip_port=None): """ load content from url with given headers. Retry if error occurs. Args: url (str): url. headers (dict): request headers. Default: {}. retry_time (int): max retry time. Default: 3. time_out (int): time out in seconds. Default: 10. raise_error_if_failed (bool): whether to raise error if failed. Default: True. proxy_ip_port(str|None): proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". Default: None Returns: content(str|None): url content. None will be returned if failed. """ set_proxy_4_urllib_request(proxy_ip_port) req = urllib.request.Request(url=url, headers=headers) for r in range(retry_time): try: content = urllib.request.urlopen(req, timeout=time_out).read() return content except HTTPError as e: print('The server couldn\'t fulfill the request.') print('Error code: ', e.code) s = random.randint(3, 7) print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}' f'-th retrying...') except URLError as e: print('We failed to reach a server.') print('Reason: ', e.reason) s = random.randint(3, 7) print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}' f'-th retrying...') if raise_error_if_failed: raise ValueError(f'Failed to open {url} after trying {retry_time} ' f'times!') else: return None ================================================ FILE: lib/openreview.py ================================================ """ openreview.py 20230104 """ import time from tqdm import tqdm from selenium import webdriver from selenium.webdriver import ActionChains from selenium.webdriver.chrome.options import Options from selenium.webdriver.chrome.service import Service from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from webdriver_manager.chrome import ChromeDriverManager from selenium.webdriver.common.keys import Keys from selenium.common.exceptions import NoSuchElementException from selenium.common.exceptions import StaleElementReferenceException import os # https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename from slugify import slugify from lib.downloader import Downloader from lib.proxy import get_proxy import urllib from lib.arxiv import get_pdf_link_from_arxiv def get_driver(proxy_ip_port=None): # driver = webdriver.Chrome(driver_path) capabilities = webdriver.DesiredCapabilities.CHROME if proxy_ip_port is not None: proxy = get_proxy(proxy_ip_port) proxy.add_to_capabilities(capabilities) # https://stackoverflow.com/a/78797164 chrome_install = ChromeDriverManager().install() folder = os.path.dirname(chrome_install) chromedriver_path = os.path.join(folder, "chromedriver.exe") driver = webdriver.Chrome( service=Service(executable_path=chromedriver_path), desired_capabilities=capabilities) return driver def __download_papers_given_divs(driver, divs, save_dir, paper_postfix, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None): error_log = [] downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port) # scroll to top of page # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.HOME) time.sleep(0.3) # titles = [d.text for d in divs] titles = [] for d in divs: for i in range(3): # temp workaround try: titles.append(d.text) break except Exception as e: if i == 2: print(f'\tget Exception: {str(e.msg)}') time.sleep(0.3) valid_divs = [] for i, t in enumerate(titles): if len(t): valid_divs.append(divs[i]) num_papers = len(valid_divs) print('found number of papers:', num_papers) name = None for index, paper in enumerate(valid_divs): is_get_paper = False try: a_hrefs = paper.find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) if a_hrefs[1].get_attribute('class') == 'pdf-link': # has pdf button link = a_hrefs[1].get_attribute('href') link = urllib.parse.urljoin('https://openreview.net', link) else: # raise ValueError('pdf link not found!') print('\tWarning: pdf link not found, skip this download...') if name is not None: error_log.append((name, str(index))) else: error_log.append((str(index), str(index))) continue # TODO: find pdf link in paper abstract page if name == '': continue is_get_paper = True except Exception as e: print(f'\tget Exception: {str(e.msg)}') print('\tskip this download...') if name is not None: error_log.append((name, str(index))) else: error_log.append((str(index), str(index))) if not is_get_paper: continue # name = slugify(paper.find_element_by_class_name('note_content_title').text) # link = paper.find_element_by_class_name('note_content_pdf').get_attribute('href') pdf_name = name + '_' + paper_postfix + '.pdf' if not os.path.exists(os.path.join(save_dir, pdf_name)): print('Downloading paper {}/{}: {}'.format(index + 1, num_papers, name)) # get pdf link of arxiv if the original link is on arxiv.org if "arxiv.org/abs" in link: link = get_pdf_link_from_arxiv(abs_link=link) # try 1 times success_flag = False for d_iter in range(1): try: downloader.download( urls=link, save_path=os.path.join(save_dir, pdf_name), time_sleep_in_seconds=time_step_in_seconds ) success_flag = True break except Exception as e: print('Error: ' + name + ' - ' + str(e)) if not success_flag: error_log.append((name, link)) return error_log, num_papers def __get_into_pages_given_number(driver, page_number, pages, wait_fn, condition=None): wait_fn(driver, condition) for page in pages: if page.text.isnumeric() and int(page.text) == page_number: page_link = page.find_element(By.TAG_NAME, "a") page_link.click() wait_fn(driver, condition) return page return None def download_nips_papers_given_url( save_dir, year, base_url, conference='NIPS', start_page=1, time_step_in_seconds=10, download_groups='all', downloader='IDM', proxy_ip_port=None): """ download NeurIPS papers from the given web url. :param save_dir: str, paper save path :type save_dir: str :param year: int, iclr year, current only support year >= 2018 :type year: int :param base_url: str, paper website url :type base_url: str :param conference: str, conference name, such as NIPS. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. :param time_step_in_seconds: int, the interval time between two downlaod request in seconds :param groups: group name, such as 'oral', 'spotlight', 'poster'. Default: 'all'. :type download_groups: str | list[str] :param downloader: str, the downloader to download, could be 'IDM' or None, default to 'IDM' :param proxy_ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". (only useful for None|"request" downloader and webdriver) Default: None :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) if year < 2023: sub_xpath = '''id="accepted-papers"''' else: sub_xpath = '''class="submissions-list"''' def mywait(driver, condition=None): # wait for the select element to become visible # print('Starting web driver wait...') # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,) # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions) wait = WebDriverWait(driver, 20) # print('Starting web driver wait... finished') # res = wait.until(EC.presence_of_element_located((By.ID, "notes"))) # print("Successful load the website!->", res) # res = wait.until( # EC.presence_of_element_located((By.CLASS_NAME, "note"))) res = wait.until( EC.presence_of_element_located((By.ID, "notes"))) # print("Successful load the website notes!->", res) res = wait.until(EC.presence_of_element_located( (By.XPATH, f'''//*[@{sub_xpath}]/nav'''))) # print("Successful load the website pagination!->", res) time.sleep(2) # seconds, workaround for bugs def find_divs_of_papers(): if year < 2023: divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') else: # divs = driver.find_element(By.ID, group_id). \ # find_elements(By.XPATH, '//*[@class="note undefined"]') divs = driver.find_element(By.ID, group_id).find_elements( By.XPATH, '//*[contains(@class, "note") and contains(@class, "undefined")]' ) return divs paper_postfix = f'{conference}_{year}' error_log = [] driver = get_driver(proxy_ip_port=proxy_ip_port) driver.get(base_url) if not os.path.exists(save_dir): os.makedirs(save_dir) mywait(driver) # pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li') # download grouped papers, such as "Accepted Papaers" for year before 2023 # "Accept (oral)", "Accept (spotlight)", "Accept (poster)" for year 2023 groups = driver.find_elements( By.XPATH, f'//*[@id="notes"]/div/div[1]/ul/li') accept_groups = [] for g in groups: if 'accept' in g.text.lower(): # whether download this group is_download_group = True if not 'all' == download_groups: is_download_group = False for dg in download_groups: if dg.lower() in g.text.lower(): is_download_group = True break if is_download_group: accept_groups.append(g) group_name = None group_save_dir = save_dir for ag in accept_groups: group_name = slugify(ag.text) group_save_dir = os.path.join(save_dir, group_name) print(f'Downloading {group_name}...') os.makedirs(group_save_dir, exist_ok=True) number_paper_group = 0 accept_group_link = ag.find_element(By.TAG_NAME, "a") # group_id = accept_group_link.get_attribute('aria-controls') group_id = accept_group_link.get_attribute('href').split('#')[-1] # scroll to top of page, if not at top, the click action not work # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.HOME) time.sleep(0.2) accept_group_link.click() mywait(driver) pages = driver.find_elements( By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li') page_str_list = get_pages_str(pages) # print(f'Current page navigation bar:\n{page_str_list}') current_page = 1 ind_page = 2 # 0 << ; 1 < # << | < | 1, 2, 3, ... | > | >> total_pages_number = get_max_page_number(page_str_list) last_total_pages = total_pages_number # get into start pages while current_page < start_page: if total_pages_number < start_page: # flip pages until seeing the start page current_page = total_pages_number __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) print(f'getting into web page {current_page}...') # res = wait.until(EC.presence_of_element_located( # (By.XPATH, '//*[@id="accepted-papers"]/ul/li/h4/a'))) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, '''//*[@id="accepted-papers"]/nav'''))) mywait(driver) # print("Successful load the website pagination!->", res) # pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li') pages = pages = driver.find_elements( By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li') page_str_list = get_pages_str(pages) total_pages_number = get_max_page_number(page_str_list) # # print(f'Current page navigation bar:\n{page_str_list}') if total_pages_number == last_total_pages: # total page remain unchanged after reload print(f'reached last({total_pages_number}-th) webpage') # when get the last page, but the page number is till less than start page, so # the start page doesn't exist. PRINT ERROR and return print(f'ERROR: THE {start_page}-th webpage not found!') return else: current_page = start_page page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) while current_page <= total_pages_number: if page is None: break print(f'downloading papers in page: {current_page}') mywait(driver) # divs = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/ul/li') # divs = driver.find_elements(By.XPATH, '//*[@id="accepted-papers"]/ul/li') divs = find_divs_of_papers() # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r + 1) < repeat_times: print(f'\terror occurre: {str(e)}') print(f'\tsleep {(r + 1) * 5} seconds...') time.sleep((r + 1) * 5) print(f'{r + 1}-th reloading page') divs = find_divs_of_papers() else: print('\tskip this page.') if not is_find_paper: continue # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=group_save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) number_paper_group += this_number_paper # get into next page current_page += 1 # pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li') pages = driver.find_elements( By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li') page_str_list = get_pages_str(pages) total_pages_number = get_max_page_number(page_str_list) # print(f'Current page navigation bar:\n{page_str_list}') # if we do not reread the pages, all the pages will be not available with an exception: # selenium.common.exceptions.StaleElementReferenceException: # Message: stale element reference: element is not attached to the page document page = __get_into_pages_given_number(driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) # display total number of papers print(f'number of papers in {group_name}: {number_paper_group}') driver.quit() # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: f.write(e) f.write('\n') f.write('\n') def download_iclr_papers_given_url_and_group_id( save_dir, year, base_url, group_id, conference='ICLR', start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None, is_have_pages=True, is_need_click_group_button=False): """ downlaod ICLR papers for the given web url and the paper group id :param save_dir: str, paper save path :type save_dir: str :param year: int, iclr year, current only support year >= 2018 :type year: int :param base_url: str, paper website url :type base_url: str :param group_id: str, paper group id, such as "notable-top-5-", "notable-top-25-", "poster", "oral-submissions", "spotlight-submissions", "poster-submissions", etc. :type group_id: str :param conference: str, conference name, such as ICLR. Default: ICLR :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two download request in seconds. Default: 10 :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder'. Default: 'IDM' :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Only useful for webdriver and request downloader (downloader=None). Default: None. :type proxy_ip_port: str | None :param is_have_pages: bool, is there pages in webpage. Default: True. :type is_have_pages: bool :param is_need_click_group_button: bool, is there need to click the group button in webpage. For some years, for example 2018, the navigation part "#xxxxx" in base url will not work. And it should be clicked before reading content from webpage. Default: False. :type is_need_click_group_button: bool :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) def _get_pages_xpath(year): if year <= 2023: xpath = f'''//*[@id="{group_id}"]/nav/ul/li''' else: xpath = f'''//*[@id="{group_id}"]/div/div/nav/ul/li''' return xpath def mywait(driver, condition=None): # wait for the select element to become visible # print('Starting web driver wait...') # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,) # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions) wait = WebDriverWait(driver, 20) # print('Starting web driver wait... finished') # res = wait.until(EC.presence_of_element_located((By.ID, "notes"))) # print("Successful load the website!->", res) if year <= 2023: res = wait.until( EC.presence_of_element_located((By.CLASS_NAME, "note"))) # print("Successful load the website notes!->", res) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'''//*[@id="{group_id}"]/nav'''))) if is_have_pages: # scroll to bottom of page # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.END) if year <= 2023: wait.until(EC.element_to_be_clickable( (By.XPATH, f'{_get_pages_xpath(year)}[3]/a'))) else: wait.until(EC.element_to_be_clickable( (By.XPATH, f'{_get_pages_xpath(year)}[3]/a'))) # print("Successful load the website pagination!->", res) time.sleep(2) # seconds, workaround for bugs paper_postfix = f'{conference}_{year}' error_log = [] driver = get_driver(proxy_ip_port=proxy_ip_port) driver.get(base_url) if not os.path.exists(save_dir): os.makedirs(save_dir) if is_need_click_group_button: archive_is_have_pages = is_have_pages is_have_pages = False mywait(driver) aria_controls = base_url.split('#')[-1] # scroll to home of page driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.HOME) group_button = driver.find_element( By.XPATH, f"""//a[@aria-controls="{aria_controls}"]""" ) group_button.click() is_have_pages = archive_is_have_pages mywait(driver) if is_have_pages: pages = driver.find_elements(By.XPATH, _get_pages_xpath(year)) current_page = 1 ind_page = 2 # 0 << ; 1 < total_pages_number = int(pages[-3].text) # << | < | 1, 2, 3, ... | > | >> last_total_pages = total_pages_number # get into start pages while current_page < start_page: # flip pages until seeing the start page if total_pages_number < start_page: current_page = total_pages_number __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) print(f'getting into web page {current_page}...') # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'//*[@id="{group_id}"]/ul/li/h4/a'))) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'''//*[@id="{group_id}"]/nav'''))) mywait(driver) # print("Successful load the website pagination!->", res) pages = driver.find_elements( By.XPATH, _get_pages_xpath(year)) total_pages_number = int(pages[-3].text) # total page remain unchanged after reload if total_pages_number == last_total_pages: print(f'reached last({total_pages_number}-th) webpage') # when get the last page, but the page number is till # less than start page, so the start page doesn't exist. # PRINT ERROR and return print(f'ERROR: THE {start_page}-th webpage not found!') return else: current_page = start_page page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) while current_page <= total_pages_number: if page is None: break print(f'downloading {group_id} papers in page: {current_page}') mywait(driver) divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r + 1) < repeat_times: print(f'\terror occurre: {str(e.msg)}') print(f'\tsleep {(r + 1) * 5} seconds...') time.sleep((r + 1) * 5) print(f'{r + 1}-th reloading page') divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') else: print('\tskip this page.') if not is_find_paper: continue # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) # get into next page current_page += 1 pages = driver.find_elements( By.XPATH, _get_pages_xpath(year)) total_pages_number = int(pages[-3].text) # if we do not reread the pages, all the pages will be not available # with an exception: # selenium.common.exceptions.StaleElementReferenceException: # Message: stale element reference: element is not attached to the # page document page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) else: # no pages divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r + 1) < repeat_times: print(f'\terror occurre: {str(e.msg)}') print(f'\tsleep {(r + 1) * 5} seconds...') time.sleep((r + 1) * 5) print(f'{r + 1}-th reloading page') divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') else: print('\tskipped!!!') if is_find_paper: # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) driver.quit() # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: f.write(e) f.write('\n') f.write('\n') def download_icml_papers_given_url_and_group_id( save_dir, year, base_url, group_id, conference='ICML', start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None): """ downlaod ICLR papers for the given web url and the paper group id :param save_dir: str, paper save path :type save_dir: str :param year: int, iclr year, current only support year >= 2018 :type year: int :param base_url: str, paper website url :type base_url: str :param group_id: str, paper group id, such as "poster" and "oral". :type group_id: str :param conference: str, conference name, such as ICLR. Default: ICLR :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two download request in seconds. Default: 10 :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder'. Default: 'IDM' :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Only useful for webdriver and request downloader (downloader=None). Default: None. :type proxy_ip_port: str | None :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) def mywait(driver, aria_controls=None): # wait for the select element to become visible # print('Starting web driver wait...') wait = WebDriverWait(driver, 20) # ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,) # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions) # print('Starting web driver wait... finished') # res = wait.until(EC.presence_of_element_located((By.ID, "notes"))) # print("Successful load the website!->", res) res = wait.until(EC.presence_of_element_located((By.ID, "notes"))) res = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "submissions-list"))) # print("Successful load the website notes!->", res) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'''//*[@id="{group_id}"]/nav'''))) # scroll to bottom of page # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.END) time.sleep(0.3) if aria_controls is None: wait.until(EC.element_to_be_clickable( (By.XPATH, f'//*[@class="submissions-list"]/nav/ul/li[3]/a'''))) else: wait.until(EC.element_to_be_clickable( (By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li[3]/a'''))) wait.until(EC.presence_of_element_located( (By.XPATH, f'''//*[@id='{aria_controls}']/div/div/ul/li[1]/div/h4/a[1]'''))) # print("Successful load the website pagination!->", res) time.sleep(2) # seconds, workaround for bugs paper_postfix = f'{conference}_{year}' error_log = [] driver = get_driver(proxy_ip_port=proxy_ip_port) driver.get(base_url) if not os.path.exists(save_dir): os.makedirs(save_dir) # wait = WebDriverWait(driver, 20) mywait(driver) # get into poster or oral page nav_tap = driver.find_elements( By.XPATH, f'//ul[@class="nav nav-tabs"]/li') is_found_group = False for li in nav_tap: if group_id in li.text.lower(): if 'poster' in group_id and 'spotlight' in li.text.lower(): # spotlight-poster should be recognized as spotlight rather # than poster continue page_link = li.find_element(By.TAG_NAME, "a") # scroll to top of page, if not at top, the click action not work # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.HOME) aria_controls = page_link.get_attribute('aria-controls') page_link.click() mywait(driver, aria_controls) # there is no request in here is_found_group = True break if not is_found_group: raise ValueError(f'not found {group_id} papers at {base_url}!!!') # pages = driver.find_elements( # By.XPATH, f'//nav[@aria-label="page navigation"]/ul/li') pages = driver.find_elements( By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''') current_page = 1 # ind_page = 2 # 0 << ; 1 < total_pages_number = int(pages[-3].text) # << | < | 1, 2, 3, ... | > | >> last_total_pages = total_pages_number # get into start pages while current_page < start_page: # flip pages until seeing the start page if total_pages_number < start_page: current_page = total_pages_number __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait, condition=aria_controls) print(f'getting into web page {current_page}...') # print("Successful load the website pagination!->", res) pages = driver.find_elements( By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''') total_pages_number = int(pages[-3].text) # total page remain unchanged after reload if total_pages_number == last_total_pages: print(f'reached last({total_pages_number}-th) webpage') # when get the last page, but the page number is till less than # start page, so the start page doesn't exist. PRINT ERROR and # return print(f'ERROR: THE {start_page}-th webpage not found!') return else: current_page = start_page page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait, condition=aria_controls) while current_page <= total_pages_number: if page is None: break print(f'downloading {group_id} papers in page: {current_page}') divs = driver.find_elements( By.XPATH, f'''//*[@id='{aria_controls}']/div/div/ul/li''') # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r+1) < repeat_times: print(f'\terror occurre: {str(e.msg)}') print(f'\tsleep {(r+1)*5} seconds...') time.sleep((r+1)*5) print(f'{r+1}-th reloading page') divs = driver.find_elements( By.XPATH, f'''//*[@id='{aria_controls}']/div/div/ul/li''') else: print('\tskip this page.') if not is_find_paper: continue # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) # get into next page current_page += 1 pages = driver.find_elements( By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''') total_pages_number = int(pages[-3].text) # if we do not reread the pages, all the pages will be not available # with an exception: # selenium.common.exceptions.StaleElementReferenceException: # Message: stale element reference: element is not attached to the # page document page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait, condition=aria_controls) driver.quit() # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: f.write(e) f.write('\n') f.write('\n') def get_pages_str(pages): page_str_list = [p.text for p in pages] # print(f'Current page navigation bar:\n{page_str_list}') return page_str_list def get_max_page_number(page_str_list): is_find_number = False for i, page_str in enumerate(page_str_list): if not page_str.isnumeric() and is_find_number: return int(page_str_list[i-1]) if page_str.isnumeric(): is_find_number = True return int(page_str_list[-1]) def download_papers_given_url_and_group_id( save_dir, year, base_url, group_id, conference, start_page=1, time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None, is_have_pages=True, is_need_click_group_button=False): """ downlaod papers for the given web url and the paper group id :param save_dir: str, paper save path :type save_dir: str :param year: int, iclr year, current only support year >= 2018 :type year: int :param base_url: str, paper website url :type base_url: str :param group_id: str, paper group id, such as "notable-top-5-", "notable-top-25-", "poster", "oral-submissions", "spotlight-submissions", "poster-submissions", etc. :type group_id: str :param conference: str, conference name, such as CORL. :param start_page: int, the initial downloading webpage number, only the pages whose number is equal to or greater than this number will be processed. Default: 1 :param time_step_in_seconds: int, the interval time between two download request in seconds. Default: 10 :param downloader: str, the downloader to download, could be 'IDM' or 'Thunder'. Default: 'IDM' :param proxy_ip_port: str or None, proxy ip address and port, eg. eg: "127.0.0.1:7890". Only useful for webdriver and request downloader (downloader=None). Default: None. :type proxy_ip_port: str | None :param is_have_pages: bool, is there pages in webpage. Default: True. :type is_have_pages: bool :param is_need_click_group_button: bool, is there need to click the group button in webpage. For some years, for example 2018, the navigation part "#xxxxx" in base url will not work. And it should be clicked before reading content from webpage. Default: False. :type is_need_click_group_button: bool :return: """ project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) def _get_pages_xpath(year): if year <= 2023: xpath = f'''//*[@id="{group_id}"]/nav/ul/li''' else: xpath = f'''//*[@id="{group_id}"]/div/div/nav/ul/li''' return xpath def mywait(driver, condition=None): # wait for the select element to become visible # print('Starting web driver wait...') # ignored_exceptions = (NoSuchElementException, # StaleElementReferenceException,) # wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions) wait = WebDriverWait(driver, 20) # print('Starting web driver wait... finished') # res = wait.until(EC.presence_of_element_located((By.ID, "notes"))) # print("Successful load the website!->", res) # if year <= 2023: # res = wait.until( # EC.presence_of_element_located((By.CLASS_NAME, "note"))) # print("Successful load the website notes!->", res) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'''//*[@id="{group_id}"]/nav'''))) if is_have_pages: # scroll to bottom of page # https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.END) if year <= 2023: wait.until(EC.element_to_be_clickable( (By.XPATH, f'{_get_pages_xpath(year)}[3]/a'))) else: wait.until(EC.element_to_be_clickable( (By.XPATH, f'{_get_pages_xpath(year)}[3]/a'))) # print("Successful load the website pagination!->", res) time.sleep(2) # seconds, workaround for bugs paper_postfix = f'{conference}_{year}' error_log = [] driver = get_driver(proxy_ip_port=proxy_ip_port) driver.get(base_url) if not os.path.exists(save_dir): os.makedirs(save_dir) if is_need_click_group_button: archive_is_have_pages = is_have_pages is_have_pages = False mywait(driver) aria_controls = base_url.split('#')[-1] # scroll to home of page driver.find_element(By.TAG_NAME, 'body').send_keys( Keys.CONTROL + Keys.HOME) group_button = driver.find_element( By.XPATH, f"""//a[@aria-controls="{aria_controls}"]""" ) group_button.click() is_have_pages = archive_is_have_pages mywait(driver) if is_have_pages: pages = driver.find_elements(By.XPATH, _get_pages_xpath(year)) current_page = 1 ind_page = 2 # 0 << ; 1 < total_pages_number = int(pages[-3].text) # << | < | 1, 2, 3, ... | > | >> last_total_pages = total_pages_number # get into start pages while current_page < start_page: # flip pages until seeing the start page if total_pages_number < start_page: current_page = total_pages_number __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) print(f'getting into web page {current_page}...') # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'//*[@id="{group_id}"]/ul/li/h4/a'))) # res = wait.until(EC.presence_of_element_located( # (By.XPATH, f'''//*[@id="{group_id}"]/nav'''))) mywait(driver) # print("Successful load the website pagination!->", res) pages = driver.find_elements( By.XPATH, _get_pages_xpath(year)) total_pages_number = int(pages[-3].text) # total page remain unchanged after reload if total_pages_number == last_total_pages: print(f'reached last({total_pages_number}-th) webpage') # when get the last page, but the page number is till # less than start page, so the start page doesn't exist. # PRINT ERROR and return print(f'ERROR: THE {start_page}-th webpage not found!') return else: current_page = start_page page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) while current_page <= total_pages_number: if page is None: break print(f'downloading {group_id} papers in page: {current_page}') mywait(driver) divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r + 1) < repeat_times: print(f'\terror occurre: {str(e.msg)}') print(f'\tsleep {(r + 1) * 5} seconds...') time.sleep((r + 1) * 5) print(f'{r + 1}-th reloading page') divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') else: print('\tskip this page.') if not is_find_paper: continue # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) # get into next page current_page += 1 pages = driver.find_elements( By.XPATH, _get_pages_xpath(year)) total_pages_number = int(pages[-3].text) # if we do not reread the pages, all the pages will be not available # with an exception: # selenium.common.exceptions.StaleElementReferenceException: # Message: stale element reference: element is not attached to the # page document page = __get_into_pages_given_number( driver=driver, page_number=current_page, pages=pages, wait_fn=mywait) else: # no pages divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') # temp workaround repeat_times = 3 is_find_paper = False for r in range(repeat_times): try: a_hrefs = divs[0].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a") name = slugify(a_hrefs[0].text.strip()) link = a_hrefs[1].get_attribute('href') is_find_paper = True break except Exception as e: if (r + 1) < repeat_times: print(f'\terror occurre: {str(e.msg)}') print(f'\tsleep {(r + 1) * 5} seconds...') time.sleep((r + 1) * 5) print(f'{r + 1}-th reloading page') divs = driver.find_element(By.ID, group_id). \ find_elements(By.CLASS_NAME, 'note ') else: print('\tskipped!!!') if is_find_paper: # time.sleep(time_step_in_seconds) this_error_log, this_number_paper = __download_papers_given_divs( driver=driver, divs=divs, save_dir=save_dir, paper_postfix=paper_postfix, time_step_in_seconds=time_step_in_seconds, downloader=downloader, proxy_ip_port=proxy_ip_port ) for e in this_error_log: error_log.append(e) driver.quit() # 2. write error log print('write error log') log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: f.write(e) f.write('\n') f.write('\n') if __name__ == "__main__": year = 2023 save_dir = rf'E:\ICML_{year}' base_url = 'https://openreview.net/group?id=ICML.cc/2023/Conference' # download_nips_papers_given_url( # save_dir, year, base_url, # start_page=1, # time_step_in_seconds=10, # downloader='IDM') # download_icml_papers_given_url_and_group_id( # save_dir, year, base_url, group_id='oral', start_page=1, # time_step_in_seconds=10, ) ================================================ FILE: lib/pmlr.py ================================================ """ pmlr.py 20210618 """ from bs4 import BeautifulSoup import os from tqdm import tqdm from slugify import slugify from lib.downloader import Downloader from .my_request import urlopen_with_retry def download_paper_given_volume( volume, save_dir, postfix, is_download_supplement=True, time_step_in_seconds=5, downloader='IDM', is_random_step=True): """ download main and supplement papers from PMLR. :param volume: str, such as 'v1', 'r1' :param save_dir: str, paper and supplement material's save path :param postfix: str, the postfix will be appended to the end of papers' titles :param is_download_supplement: bool, True for downloading supplemental material :param time_step_in_seconds: int, the interval time between two downloading requests in seconds :param downloader: str, the downloader to download, could be 'IDM' or None, Default: 'IDM' :param is_random_step: bool, whether random sample the time step between two adjacent download requests. If True, the time step will be sampled from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds. Default: True. :return: True """ downloader = Downloader( downloader=downloader, is_random_step=is_random_step) init_url = f'http://proceedings.mlr.press/{volume}/' if is_download_supplement: main_save_path = os.path.join(save_dir, 'main_paper') supplement_save_path = os.path.join(save_dir, 'supplement') os.makedirs(main_save_path, exist_ok=True) os.makedirs(supplement_save_path, exist_ok=True) else: main_save_path = save_dir os.makedirs(main_save_path, exist_ok=True) headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) ' 'Gecko/20100101 Firefox/23.0'} content = urlopen_with_retry(url=init_url, headers=headers) soup = BeautifulSoup(content, 'html.parser') paper_list = soup.find_all('div', {'class': 'paper'}) error_log = [] title_list = [] num_download = len(paper_list) pbar = tqdm(zip(paper_list, range(num_download)), total=num_download) for paper in pbar: # get title this_paper = paper[0] title = slugify(this_paper.find_all('p', {'class': 'title'})[0].text) try: pbar.set_description( f'Downloading {postfix} paper {paper[1] + 1}/{num_download}:' f' {title}') except: pbar.set_description( f'''Downloading {postfix} paper {paper[1] + 1}/{num_download}: ''' f'''{title.encode('utf8')}''') title_list.append(title) this_paper_main_path = os.path.join(main_save_path, f'{title}_{postfix}.pdf') if is_download_supplement: this_paper_supp_path = os.path.join( supplement_save_path, f'{title}_{postfix}_supp.pdf') this_paper_supp_path_no_ext = os.path.join( supplement_save_path, f'{title}_{postfix}_supp.') if os.path.exists(this_paper_main_path) and os.path.exists( this_paper_supp_path): continue else: if os.path.exists(this_paper_main_path): continue # get abstract page url links = this_paper.find_all('p', {'class': 'links'})[0].find_all('a') supp_link = None main_link = None for link in links: if 'Download PDF' == link.text or 'pdf' == link.text: main_link = link.get('href') elif is_download_supplement and \ ('Supplementary PDF' == link.text or 'Supplementary Material' == link.text or 'supplementary' == link.text or 'Supplementary ZIP' == link.text or 'Other Files' == link.text): supp_link = link.get('href') if supp_link[-3:] != 'pdf': this_paper_supp_path = this_paper_supp_path_no_ext + \ supp_link[-3:] # try 1 time # error_flag = False for d_iter in range(1): try: # download paper with IDM if not os.path.exists( this_paper_main_path) and main_link is not None: downloader.download( urls=main_link, save_path=this_paper_main_path, time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append( (title, main_link, 'main paper download error', str(e))) # download supp if is_download_supplement: # check whether the supp can be downloaded if not os.path.exists( this_paper_supp_path) and supp_link is not None: try: downloader.download( urls=supp_link, save_path=this_paper_supp_path, time_sleep_in_seconds=time_step_in_seconds ) except Exception as e: # error_flag = True print('Error: ' + title + ' - ' + str(e)) error_log.append((title, supp_link, 'supplement download error', str(e))) # write error log print('writing error log...') project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) log_file_pathname = os.path.join( project_root_folder, 'log', 'download_err_log.txt') with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is not None: f.write(e) else: f.write('None') f.write('\n') f.write('\n') return True if __name__ == '__main__': download_paper_given_volume( volume=150, save_dir=r'D:\The_KDD21_Workshop_on_Causal_Discovery', postfix=f'', is_download_supplement=False, time_step_in_seconds=5, downloader='IDM' ) ================================================ FILE: lib/proxy.py ================================================ """ proxy.py 20230228 """ from selenium.webdriver.common.proxy import Proxy, ProxyType import urllib def get_proxy(ip_port: str): """ setup proxy :param ip_port: str, proxy server ip address without protocol prefix, eg: "127.0.0.1:7890" :return: proxy (instance of selenium.webdriver.common.proxy.Proxy) Then the proxy could be to webdriver.Chrome: capabilities = webdriver.DesiredCapabilities.CHROME proxy.add_to_capabilities(capabilities) driver = webdriver.Chrome( service=Service(ChromeDriverManager().install()), desired_capabilities=capabilities) """ proxy = Proxy() proxy.proxy_type = ProxyType.MANUAL proxy.http_proxy = ip_port proxy.ssl_proxy = ip_port return proxy def set_proxy_4_urllib_request(ip_port: str): """ setup proxy :param ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". :return: proxies, dict with keys "http" and "https" or None. """ if ip_port is None: proxies = None else: if not ip_port.startswith('http'): ip_port = 'http://' + ip_port proxies = { 'http': ip_port, 'https': ip_port } proxy_support = urllib.request.ProxyHandler(proxies) opener = urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) return proxies def get_proxy_4_requests(ip_port: str): """ setup proxy :param ip_port: str or None, proxy server ip address with or without protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890". :return: proxies, dict with keys "http" and "https" or None. """ if ip_port is None: proxies = None else: if not ip_port.startswith('http'): ip_port = 'http://' + ip_port proxies = { 'http': ip_port, 'https': ip_port } return proxies if __name__ == "__main__": # get my ip import json set_proxy_4_urllib_request('127.0.0.1:7897') url = "http://ip-api.com/json" # ipv4 response = urllib.request.urlopen(url) data = json.load(response) if data['status'] == 'success': ip = data['query'] print(f'ip: {ip}') print(f'details: {data}') else: print(f'failed, try agin: {data}') ================================================ FILE: lib/springer.py ================================================ """ springer.py some function for springer 20201106 """ import urllib from bs4 import BeautifulSoup from tqdm import tqdm from slugify import slugify from .my_request import urlopen_with_retry import re def get_paper_name_link_from_url(url): headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} paper_dict = dict() content = urlopen_with_retry(url=url, headers=headers) soup = BeautifulSoup(content, 'html5lib') paper_list_bar = tqdm( soup.find('section', {'data-title': 'Table of contents'}).find( 'div', {'class': 'c-book-section'}).find_all( ['li'], {'data-test': 'chapter'})) for paper in paper_list_bar: try: title = slugify( paper.find(['h3', 'h4'], {'class': 'app-card-open__heading'}).text) link = urllib.parse.urljoin( url, paper.find( ['h3', 'h4'], {'class': 'app-card-open__heading'} ).a.get('href')) # 'https://link.springer.com/chapter/10.1007/978-3-642-33718-5_2' # >> # 'https://link.springer.com/content/pdf/10.1007/978-3-642-33718-5_2.pdf' link = f'''{link.replace('/chapter/', '/content/pdf/')}.pdf''' paper_dict[title] = link except Exception as e: print(f'ERROR: {str(e)}') return paper_dict if __name__ == '__main__': papers = get_paper_name_link_from_url('https://link.springer.com/book/10.1007%2F978-3-319-46448-0') ================================================ FILE: lib/supplement_porcess.py ================================================ """ supplement_process.py """ from PyPDF3 import PdfFileMerger import zipfile import os import shutil from tqdm import tqdm def unzipfile(zip_file, save_path): """ unzip zip file to save_path :param zipfile: str, zip file's full pathname. :param save_path: str, the path store unzipped files. :return: None """ zip_ref = zipfile.ZipFile(zip_file, 'r') zip_ref.extractall(save_path) zip_ref.close() def get_potential_supp_pdf(path): """ get all the potential supplemental pdf file pathname :param path: str, the path of unzipped files :return: supp_pdf_list, List of str, pdf files' full pathnames """ supp_pdf_list = [f for f in os.scandir(path) if f.name.endswith('.pdf')] if len(supp_pdf_list) == 0: supp_pdf_list = [] for dir in os.scandir(path): if dir.is_dir() and not dir.name.startswith('__'): for pdf in os.scandir(dir.path): if pdf.name.endswith('.pdf'): supp_pdf_list.append(pdf.path) if len(supp_pdf_list) == 0: supp_pdf_list = [] for dir in os.scandir(path): if dir.is_dir() and not dir.name.startswith('__'): for sub_dir in os.scandir(dir): if sub_dir.is_dir() and not sub_dir.name.startswith('__'): for pdf in os.scandir(sub_dir.path): if pdf.name.endswith('.pdf'): supp_pdf_list.append(pdf.path) return supp_pdf_list def move_main_and_supplement_2_one_directory_with_group(main_path, supplement_path, supp_pdf_save_path): """ unzip supplemental zip files to get the pdf files, copy and rename them into given path(supp_pdf_save_path/group_name) :param main_path: str, the main papers' path :param supplement_path: str, the supplemental material 's path :param supp_pdf_save_path: str, the supplemental pdf files' save path """ if not os.path.exists(main_path): raise ValueError(f'''can not open '{main_path}' !''') if not os.path.exists(supplement_path): raise ValueError(f'''can not open '{supplement_path}' !''') error_log = [] # make temp dir to unzip zip file temp_zip_dir = '.\\temp_zip' if not os.path.exists(temp_zip_dir): os.mkdir(temp_zip_dir) else: # remove all files for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) for group in os.scandir(main_path): if group.is_dir(): paper_bar = tqdm(os.scandir(group.path)) for paper in paper_bar: if paper.is_file(): name, extension = os.path.splitext(paper.name) if '.pdf' == extension: paper_bar.set_description(f'''processing {name}''') supp_pdf_path = None # error_flag = False if os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.pdf')): supp_pdf_path = os.path.join(supplement_path, group.name, f'{name}_supp.pdf') shutil.copyfile( supp_pdf_path, os.path.join(supp_pdf_save_path, group.name, f'{name}_supp.pdf')) elif os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.zip')): try: unzipfile( zip_file=os.path.join(supplement_path, group.name, f'{name}_supp.zip'), save_path=temp_zip_dir ) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) try: # find if there is a pdf file (by listing all files in the dir) supp_pdf_list = get_potential_supp_pdf(temp_zip_dir) # rename the first pdf file if len(supp_pdf_list) >= 1: # by default, we only deal with the first pdf supp_pdf_path = os.path.join(supp_pdf_save_path, group.name, name+'_supp.pdf') if not os.path.exists(supp_pdf_path): shutil.move(supp_pdf_list[0], supp_pdf_path) if len(supp_pdf_list) > 1: for i in range(1, len(supp_pdf_list)): supp_pdf_path = os.path.join( supp_pdf_save_path, group.name, name + f'_supp_{i}.pdf') if not os.path.exists(supp_pdf_path): shutil.move(supp_pdf_list[i], supp_pdf_path) # empty the temp_folder (both the dirs and files) for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) # 2. write error log print('write error log') project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) log_file_pathname = os.path.join( project_root_folder, 'log', 'merge_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is None: f.write('None') else: f.write(e) f.write('\n') f.write('\n') def move_main_and_supplement_2_one_directory(main_path, supplement_path, supp_pdf_save_path): """ unzip supplemental zip files to get the pdf files, copy and rename them into given path(supp_pdf_save_path) :param main_path: str, the main papers' path :param supplement_path: str, the supplemental material's path :param supp_pdf_save_path: str, the supplemental pdf files' save path """ if not os.path.exists(main_path): raise ValueError(f'''can not open '{main_path}' !''') if not os.path.exists(supplement_path): raise ValueError(f'''can not open '{supplement_path}' !''') os.makedirs(supp_pdf_save_path, exist_ok=True) error_log = [] # make temp dir to unzip zip file temp_zip_dir = '..\\temp_zip' if not os.path.exists(temp_zip_dir): os.mkdir(temp_zip_dir) else: # remove all files for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) paper_bar = tqdm(os.scandir(main_path)) for paper in paper_bar: if paper.is_file(): name, extension = os.path.splitext(paper.name) if '.pdf' == extension: paper_bar.set_description(f'''processing {name}''') supp_pdf_path = None # error_flag = False if os.path.exists(os.path.join(supp_pdf_save_path, f'{name}_supp.pdf')): continue elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')): supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf') shutil.copyfile(supp_pdf_path, os.path.join(supp_pdf_save_path, f'{name}_supp.pdf')) elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')): try: unzipfile( zip_file=os.path.join(supplement_path, f'{name}_supp.zip'), save_path=temp_zip_dir) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) try: # find if there is a pdf file (by listing all files in the dir) supp_pdf_list = get_potential_supp_pdf(temp_zip_dir) # rename the first pdf file if len(supp_pdf_list) >= 1: # by default, we only deal with the first pdf supp_pdf_path = os.path.join(supp_pdf_save_path, name+'_supp.pdf') if not os.path.exists(supp_pdf_path): shutil.move(supp_pdf_list[0], supp_pdf_path) if len(supp_pdf_list) > 1: for i in range(1, len(supp_pdf_list)): supp_pdf_path = os.path.join(supp_pdf_save_path, name + f'_supp_{i}.pdf') if not os.path.exists(supp_pdf_path): shutil.move(supp_pdf_list[i], supp_pdf_path) # empty the temp_folder (both the dirs and files) for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) # 2. write error log print('write error log') project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) log_file_pathname = os.path.join( project_root_folder, 'log', 'merge_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is None: f.write('None') else: f.write(e) f.write('\n') f.write('\n') def merge_main_supplement(main_path, supplement_path, save_path, is_delete_ori_files=False): """ merge the main paper and supplemental material into one single pdf file :param main_path: str, the main papers' path :param supplement_path: str, the supplemental material 's path :param save_path: str, merged pdf files's save path :param is_delete_ori_files: Bool, True for deleting the original main and supplemental material after merging """ if not os.path.exists(main_path): raise ValueError(f'''can not open '{main_path}' !''') if not os.path.exists(supplement_path): raise ValueError(f'''can not open '{supplement_path}' !''') os.makedirs(save_path, exist_ok=True) error_log = [] # make temp dir to unzip zip file temp_zip_dir = '.\\temp_zip' if not os.path.exists(temp_zip_dir): os.mkdir(temp_zip_dir) else: # remove all files for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) paper_bar = tqdm(os.scandir(main_path)) for paper in paper_bar: if paper.is_file(): name, extension = os.path.splitext(paper.name) if '.pdf' == extension: paper_bar.set_description(f'''processing {name}''') if os.path.exists(os.path.join(save_path, paper.name)): continue supp_pdf_path = None error_floa = False if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')): supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf') elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')): try: unzipfile( zip_file=os.path.join(supplement_path, f'{name}_supp.zip'), save_path=temp_zip_dir ) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) try: # find if there is a pdf file (by listing all files in the dir) supp_pdf_list = get_potential_supp_pdf(temp_zip_dir) # rename the first pdf file if len(supp_pdf_list) >= 1: # by default, we only deal with the first pdf supp_pdf_path = os.path.join(supplement_path, name+'_supp.pdf') if not os.path.exists(supp_pdf_path): shutil.move(supp_pdf_list[0], supp_pdf_path) # empty the temp_folder (both the dirs and files) for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) except Exception as e: error_floa = True print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) # empty the temp_folder (both the dirs and files) for unzip_file in os.listdir(temp_zip_dir): if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)): os.remove(os.path.join(temp_zip_dir, unzip_file)) elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)): shutil.rmtree(os.path.join(temp_zip_dir, unzip_file)) else: print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file)) continue if supp_pdf_path is not None: try: merger = PdfFileMerger() f_handle1 = open(paper.path, 'rb') merger.append(f_handle1) f_handle2 = open(supp_pdf_path, 'rb') merger.append(f_handle2) with open(os.path.join(save_path, paper.name), 'wb') as fout: merger.write(fout) print('\tmerged!') f_handle1.close() f_handle2.close() merger.close() if is_delete_ori_files: os.remove(paper.path) if os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')): os.remove(os.path.join(supplement_path, f'{name}_supp.zip')) if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')): os.remove(os.path.join(supplement_path, f'{name}_supp.pdf')) except Exception as e: print('Error: ' + name + ' - ' + str(e)) error_log.append((paper.path, supp_pdf_path, str(e))) if os.path.exists(os.path.join(save_path, paper.name)): os.remove(os.path.join(save_path, paper.name)) else: if is_delete_ori_files: shutil.move(paper.path, os.path.join(save_path, paper.name)) else: shutil.copyfile(paper.path, os.path.join(save_path, paper.name)) # 2. write error log print('write error log') project_root_folder = os.path.abspath( os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) log_file_pathname = os.path.join( project_root_folder, 'log', 'merge_err_log.txt' ) with open(log_file_pathname, 'w') as f: for log in tqdm(error_log): for e in log: if e is None: f.write('None') else: f.write(e) f.write('\n') f.write('\n') def rename_2_short_name(src_path, save_path, target_max_length=128, extension='pdf'): """ rename file to short filename while remain the conference postfix Args: src_path (str): path that contains files directly. save_path (str): path to save the renamed files. target_max_length (int): max filen name length after renaming. All the files whose name length is not less than this will be renamed, the others will stay unchanged and copy into the save path. Default: 128. extension (str | None): only the files with this extension will be processed. None means all file will be processed. Default: 'pdf'. Returns: None """ if not os.path.exists(src_path): raise ValueError(f'Path not found: {src_path}!') os.makedirs(save_path, exist_ok=True) for f in tqdm(os.scandir(src_path)): f_name = f.name # compare extension ext = os.path.splitext(f_name)[1] if extension is not None and ext[1:] != extension: continue # compare file name length l = len(f_name) if l < target_max_length: if not os.path.exists(os.path.join(save_path, f_name)): print(f'\ncopying {f_name}') shutil.copyfile(f.path, os.path.join(save_path, f_name)) else: # rename try: [title, postfix] = f_name.split('_', 1) # only split to 2 parts new_title = title[:target_max_length-len(postfix)-2] new_name = f'{new_title}_{postfix}' if not os.path.exists(os.path.join(save_path, new_name)): print(f'\nrenaming {f_name} \n\t-> {new_name}') shutil.copyfile(f.path, os.path.join(save_path, new_name)) except ValueError: # ValueError: not enough values to unpack (expected 2, got 1) print(f'\nWARNING!!!:\n\tunable to parse postfix from {f.path}') print('\tSo, it will be just copy/rename to short name') new_title = f_name[:target_max_length - len(ext) - 1] new_name = f'{new_title}{ext}' if not os.path.exists(os.path.join(save_path, new_name)): print(f'\nrenaming {f_name} \n\t-> {new_name}') shutil.copyfile(f.path, os.path.join(save_path, new_name)) def rename_2_short_name_within_group(src_path, save_path, target_max_length=128, extension='pdf'): """ rename file to short filename while remain the conference postfix Args: src_path (str): path that contains files: src_path/group_name/files save_path (str): path to save the renamed files. target_max_length (int): max filen name length after renaming. All the files whose name length is not less than this will be renamed, the others will stay unchanged and copy into the save path. Default: 128. extension (str | None): only the files with this extension will be processed. None means all file will be processed. Default: 'pdf'. Returns: None """ if not os.path.exists(src_path): raise ValueError(f'Path not found: {src_path}!') os.makedirs(save_path, exist_ok=True) for d in tqdm(os.scandir(src_path)): if not d.is_dir(): continue print(f'\nprocessing {d.name}') d_name = d.name d_name = d_name[:min(len(d_name), target_max_length-1)] rename_2_short_name( src_path=d.path, save_path=os.path.join(save_path, d_name), target_max_length=target_max_length, extension=extension ) ================================================ FILE: lib/user_agents.py ================================================ """ user_agents.py user agents 20230702 """ user_agents = [ 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) ' 'Gecko/20071127 Firefox/2.0.0.11', 'Opera/9.25 (Windows NT 5.1; U; en)', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; ' '.NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) ' 'KHTML/3.5.5 (like Gecko) (Kubuntu)', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) ' 'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12', 'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9', "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 " "(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 " "Chrome/16.0.912.77 Safari/535.7", "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) " "Gecko/20100101 Firefox/10.0 ", 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' 'AppleWebKit/537.36 (KHTML, like Gecko) ' 'Chrome/105.0.0.0 Safari/537.36' ] ================================================ FILE: sharelinks.md ================================================ # SHARE LINKS Aliyun share links 注:阿里云盘更新了协议,**一个分享链接最多只能分享不超过500个文件**,所以我进行了拆分,一个链接放499个文件,直至分享完。 ## CVPR ### main conference | year | index | share link | access code | |:----:|:-----:|:------------------------------------------------------:|:-----------:| | 2023 | 1 | [1-499](https://www.aliyundrive.com/s/SGMUABYNoRM) | `63un` | | 2023 | 2 | [500-998](https://www.aliyundrive.com/s/XeXJz53AVKn) | `7ws5` | | 2023 | 3 | [999-1497](https://www.aliyundrive.com/s/9wjv8gaE95i) | `1er4` | | 2023 | 4 | [1498-1996](https://www.aliyundrive.com/s/kqt4GNYmSYR) | `lf58` | | 2023 | 5 | [1997-2358](https://www.aliyundrive.com/s/GyyyD4XnqhZ) | `f47s` | ### workshops | year | index | share link | access code | |:----:|:-----:|:----------------------------------------------------:|:-----------:| | 2023 | 1 | [1-485](https://www.aliyundrive.com/s/gPtPRYcyttz) | `4n5t` | | 2023 | 2 | [486-698](https://www.aliyundrive.com/s/x18A9AxPJGp) | `x40h` |