Repository: SilenceEagle/paper_downloader
Branch: master
Commit: 7a76ffa26612
Files: 30
Total size: 345.0 KB
Directory structure:
gitextract_691ya0bm/
├── .gitignore
├── LICENSE
├── README.md
├── code/
│ ├── paper_downloader_AAAI.py
│ ├── paper_downloader_AAMAS.py
│ ├── paper_downloader_AISTATS.py
│ ├── paper_downloader_COLT.py
│ ├── paper_downloader_CORL.py
│ ├── paper_downloader_CVF.py
│ ├── paper_downloader_ECCV.py
│ ├── paper_downloader_ICLR.py
│ ├── paper_downloader_ICML.py
│ ├── paper_downloader_IJCAI.py
│ ├── paper_downloader_JMLR.py
│ ├── paper_downloader_NIPS.py
│ └── paper_downloader_RSS.py
├── lib/
│ ├── IDM.py
│ ├── __init__.py
│ ├── arxiv.py
│ ├── csv_process.py
│ ├── cvf.py
│ ├── downloader.py
│ ├── my_request.py
│ ├── openreview.py
│ ├── pmlr.py
│ ├── proxy.py
│ ├── springer.py
│ ├── supplement_porcess.py
│ └── user_agents.py
└── sharelinks.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# ---> Python
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
# mylib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
csv/
data/
log/
temp_zip
urls/
*.txt
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2020 silenceagle
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# paper_downloader
Download papers and supplemental materials only from **OPEN ACCESS** paper
website, such as **AAAI**, **AAMAS**, **AISTATS**, **COLT**, **CORL**, **CVPR**, **ECCV**,
**ICCV**, **ICLR**, **ICML**, **IJCAI**, **JMLR**, **NIPS**,
**RSS**, **WACV**.
---
The number of papers that could be downloaded using this repo (also provide **Aliyundrive** or **123Pan** share link and `access code`):
| year\conf | [AAAI](https://aaai.org/aaai-publications/aaai-conference-proceedings/#aaai) | [AAMAS](https://www.ifaamas.org/Proceedings/aamas2024/) | [ACCV](https://openaccess.thecvf.com/menu) | [AISTATS](https://www.aistats.org/) | [COLT](http://learningtheory.org/) | [CORL](https://www.corl.org/) | [CVPR](http://openaccess.thecvf.com/menu) | [ECCV](https://www.ecva.net/papers.php) | [ICCV](http://openaccess.thecvf.com/menu) | [ICLR](https://iclr.cc/) | [ICML](https://icml.cc/) | [IJCAI](https://www.ijcai.org/) | [JMLR](http://www.jmlr.org/) | [NIPS ](https://nips.cc/) | [RSS](https://www.roboticsproceedings.org/index.html) | [WACV](https://openaccess.thecvf.com/menu) |
|:------------:|:----------------------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------:|:------------------------------------------------------:|:-----------------------------:|:--------------------------------------------------------------------------------------------------------------:|:-------------------------------------------------------:|:--------------------------------------------------------------------------------------------------------------:|:--------------------------------------------------------------:|:-------------------------------------------------------:|:------------------------------------------------------:|:----------------------------:|:-------------------------------------------------------:|:-----------------------------------------------------:|:------------------------------------------------------------------------------------------------------------:|
| **1969** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 64 | -- | -- | -- | -- |
| **1971** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 66 | -- | -- | -- | -- |
| **1973** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 85 | -- | -- | -- | -- |
| **1975** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 146 | -- | -- | -- | -- |
| **1977** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 251 | -- | -- | -- | -- |
| **1979** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 12 | -- | -- | -- | -- |
| **1980** | [95](https://www.aliyundrive.com/s/ucngMrKSTmi)`96eg` | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| **1981** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 108 | -- | -- | -- | -- |
| **1982** | 104 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| **1983** | [92](https://www.aliyundrive.com/s/L3GfxhEqyWg)`09jo` | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 237 | -- | -- | -- | -- |
| **1984** | 69 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| **1985** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 259 | -- | -- | -- | -- |
| **1986** | 194 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- |
| **1987** | 149 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 246 | -- | 90 | -- | -- |
| **1988** | 159 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 94 | -- | -- |
| **1989** | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 269 | -- | 101 | -- | -- |
| **1990** | 173 | -- | -- | -- | -- | -- | -- | 49 | -- | -- | -- | -- | -- | 143 | -- | -- |
| **1991** | 144 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 192 | -- | 144 | -- | -- |
| **1992** | 134 | -- | -- | -- | -- | -- | -- | 49 | -- | -- | -- | -- | -- | 127 | -- | -- |
| **1993** | 135 | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | 138 | -- | 158 | -- | -- |
| **1994** | 302 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 140 | -- | -- |
| **1995** | -- | -- | -- | 64 | -- | -- | -- | -- | -- | -- | -- | 282 | -- | 152 | -- | -- |
| **1996** | 275 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 152 | -- | -- |
| **1997** | 186 | -- | -- | 57 | -- | -- | -- | -- | -- | -- | -- | 180 | -- | 150 | -- | -- |
| **1998** | 187 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | -- | 151 | -- | -- |
| **1999** | 182 | -- | -- | 17 | -- | -- | -- | -- | -- | -- | -- | 204 | -- | 150 | -- | -- |
| **2000/v1** | 221 | -- | -- | -- | -- | -- | -- | 98 | -- | -- | -- | -- | 11 | 152 | -- | -- |
| **2001/v2** | -- | -- | -- | 46 | -- | -- | -- | -- | -- | -- | -- | 17 | 31 | 197 | -- | -- |
| **2002/v3** | 187 | / | -- | -- | -- | -- | -- | 196 | -- | -- | -- | -- | 59 | 207 | -- | -- |
| **2003/v4** | --- | / | -- | 44 | -- | -- | -- | -- | -- | -- | 121 | 297 | 59 | 198 | -- | -- |
| **2004/v5** | 177 | / | -- | -- | -- | -- | -- | 190 | -- | -- | 118 | -- | 56 | 207 | -- | -- |
| **2005/v6** | 328 | / | -- | 56 | -- | -- | -- | -- | -- | -- | 133 | 350 | 73 | 207 | 48 | -- |
| **2006/v7** | 393 | / | -- | -- | -- | -- | -- | 192+11 | -- | -- | -- | -- | 100 | 204 | 39 | -- |
| **2007/v8** | 375 | / | -- | 86 | -- | -- | -- | -- | -- | -- | 150 | 478 | 91 | 217 | 41 | -- |
| **2008/v9** | 355 | 254 | -- | -- | -- | -- | -- | 196 | -- | -- | 158 | -- | 97 | 250 | 40 | -- |
| **2009/v10** | -- | 130 | -- | 84 | -- | -- | -- | -- | -- | -- | 160 | 342 | 100 | 262 | 39 | -- |
| **2010/v11** | 300 | 163 | -- | 126 | -- | -- | -- | 286+63 | -- | -- | 159 | -- | 118 | 292 | 40 | -- |
| **2011/v12** | 302 | 125 | -- | 108 | 43 | -- | -- | -- | -- | -- | 153 | 490 | 105 | 306 | 45 | -- |
| **2012/v13** | 353 | 136 | -- | 160 | 46 | -- | -- | 329+147 | -- | -- | 243 | -- | 119 | 368 | 60 | -- |
| **2013/v14** | 251 | 321 | -- | 72 | 50 | -- | [471](https://www.aliyundrive.com/s/ZFvga9JZ5aY)`5p0q`+156 | -- | 455+142 | 14+9 | 283 | 496 | 84 | 360 | 55 | -- |
| **2014/v15** | 447 | 378 | -- | 124 | 61 | -- | 545+125 | 334+158 | -- | 35 | 310 | -- | 120 | 411 | 57 | -- |
| **2015/v16** | 455 | 363 | -- | 134 | 77 | -- | 602+133 | -- | 526+133 | 42 | 270 | 656 | 118 | 403 | 49 | -- |
| **2016/v17** | 676 | 280 | -- | 168 | 70 | -- | 643+194 | 372+132 | -- | 80 | 322 | 658 | 236 | 568 | 47 | -- |
| **2017/v18** | 765 | 318 | -- | 175 | 75 | 48 | 783+281 | -- | 621+353 | 198 | 434 | 781 | 234 | 679 | 75 | -- |
| **2018/v19** | 1102 | 390 | -- | 230 | 94 | 75 | 979+346 | 732+262 | -- | 336 | 466 | 870 | 84 | 1009 | 71 | -- |
| **2019/v20** | 1343 | 433 | -- | 403 | 127 | 110 | 1294+612 | -- | 1075+498 | 502 | 773 | 964 | 184 | 1428 | 84 | -- |
| **2020/v21** | [1864](https://www.aliyundrive.com/s/kbWKUpHGR3k)`5ls6` | 369 | [254](https://www.aliyundrive.com/s/Dt2ErKCmePQ)`dn93`+[13](https://www.aliyundrive.com/s/AhGvgotrMUv)`d9o6` | [796](https://www.aliyundrive.com/s/iQ4AWTHG4bk)`61yu` | [126](https://www.aliyundrive.com/s/apP8KUFLPe4)`3mv9` | 165 | [1467](https://www.aliyundrive.com/s/eJF4BTFzFJq)`y89b`+[517](https://www.aliyundrive.com/s/5wk7Mjo9XyU)`0fz9` | [1358](https://www.aliyundrive.com/s/EYyjxRmmg8d)`a5i0` | -- | [687](https://www.aliyundrive.com/s/cVRD5Bu2SgN)`4x1c` | [1084](https://www.aliyundrive.com/s/BHqtEbi6Dix)`5yw0` | [776](https://www.aliyundrive.com/s/vMZpsjCbWMV)`4xq3` | 254 | [1899](https://www.aliyundrive.com/s/GEMFqxKeHWu)`3g3d` | 103 | [378](https://www.aliyundrive.com/s/gfFKwcKrCP1)`l1m8`+[24](https://www.aliyundrive.com/s/2uCW6cq9WHk)`me08` |
| **2021/v22** | [1961](https://www.aliyundrive.com/s/cdeGciNZch8)`b69m` | 304 | -- | [845](https://www.aliyundrive.com/s/3hbAhxYFHER)`93ig` | [140](https://www.aliyundrive.com/s/gwhdNT1vGDD)`96ln` | 166 | 1660+[517](https://www.aliyundrive.com/s/ziBfXVKPXSY)`le14` | -- | [1612](https://www.aliyundrive.com/s/ME21PfkyAec)`99uu`+[465](https://www.aliyundrive.com/s/ZahPmXSn9an)`16es` | [860](https://www.aliyundrive.com/s/wGos6n5R93v)`ef43` | [1183](https://www.aliyundrive.com/s/SYTtH38GiVS)`g8b1` | [723](https://www.aliyundrive.com/s/io3sAjsN5pw)`40is` | 290 | [2334](https://www.aliyundrive.com/s/13sHmhuEdxA)`v6g1` | 92 | [406](https://www.aliyundrive.com/s/kTwfaX9tren)`1id9`+[23](https://www.aliyundrive.com/s/7Joy4svvUfy)`90rl` |
| **2022/v23** | [1624](https://www.aliyundrive.com/s/ePXvUw4VFdQ)`fp76` | 306 | [279](https://www.aliyundrive.com/s/zCCTJMPrfSr)`47jy`+[25](https://www.aliyundrive.com/s/f4kdMXixwJL)`s7a9` | [492](https://www.aliyundrive.com/s/xj2fRMwZxfC)`f16o` | 155 | 197 | [2077](https://www.aliyundrive.com/s/Q8DG9dKbx6S)`i16a`+[562](https://www.aliyundrive.com/s/f9Zx3hFFyq4)`11kj` | [1645](https://www.aliyundrive.com/s/dv4fhuueRHs)`6d7j` | -- | [54+176+865](https://www.aliyundrive.com/s/gfANcdbM9TC)`b1l3` | [1234](https://www.aliyundrive.com/s/eopQ5H8Hz2a)`81ov` | [862](https://www.aliyundrive.com/s/DBVKNsqN2UZ)`ea46` | 351 | [2673](https://www.aliyundrive.com/s/VFLmfnzSAsA)`eh49` | 74 | [406](https://www.aliyundrive.com/s/xRhdpencLQU)`ab53`+[80](https://www.aliyundrive.com/s/JCCcQXij7WX)`q6d2` |
| **2023/v24** | 2021 | 527 | -- | [496](https://www.aliyundrive.com/s/CD3Kz9cxu1U)`l5m9` | 170 | 199 | [2358+698](./sharelinks.md) | -- | 2161+491 | [90+284+1205](https://www.aliyundrive.com/s/PZ1Wann4B8A)`29sf` | 1805 | 846 | 397 | 67+378+2773 | 112 | [639](https://www.aliyundrive.com/s/fP52KxJEUE5)`mo78`+[74](https://www.aliyundrive.com/s/XZG992JqQfn)`nj80` |
| **2024/v25** | 2581 | 460 | 268+46 | 547 | 170 | 264 | 2716+773 | 2387 | -- | 86+369+1810 | 144+191+2275 | 1048 | 419 | 61+326+3650 | 131 | 846+120 |
| **2025/v26** | 3028 | 479 | --- | 583 | 182 | 263 | 2871+659 | -- | 2701+765 | 208+373+3060+6+6+56 | 108+211+2967 | 1276 | 308 | 77+683+4515 | 163 | 929 |
| **2026/v27** | 2375 | 29 May | 18 Dec. | 2 May | 3 July | 12 Nov. | 7 June | 13 Sep. | 29 Sep. | 225+5131 | 11 July | 21 Aug | 50 | 13 Dec | 17 July | 831+191 |
[Download from 123pan.com](https://www.123pan.com/s/PwXljv-QErwd.html)
(ACCESS CODE: `FdX2`)
(May miss some papers due to the (older version of) 123pan's limitation on the length of filename)
NOTE: all the shared papers' pdf files are collected from network, and the original authors/providers hold the copyrights.
---
## Usage
**For example: download AAAI-2022 papers**
1. Install [Internet Downloader Manager/IDM](https://www.internetdownloadmanager.com/) [*Windows*] [*OPTIONAL*]
**Note:** If the IDM is NOT installed at the DEFAULT location, then the
code in [lib/IDM.py](./lib/IDM.py) should also be modified:
```python
# should replace with your IDM path
idm_path = '"your path to IDMan.exe"'
# default:
# idm_path = '"C:\Program Files (x86)\Internet Download Manager\IDMan.exe"'
```
**Uesful tip**: [Disable the downloading popup pages of IDM would be better](https://github.com/SilenceEagle/paper_downloader/issues/17#issuecomment-773763300)
2. Install [Chrome](https://www.google.com/chrome) [Needed for `ICLR`, `ICML`, some of `NIPS` and `CORL` papers]
3. Change the code block at the end of
[code/paper_downloader_AAAI.py](./code/paper_downloader_AAAI.py)
```python
if __name__ == '__main__':
year = 2022
total_paper_number = save_csv(year) # save papers urls to csv/AAAI_2022.csv
download_from_csv(
year,
save_dir=f'..\\AAAI_{year}', # change to your save location
time_step_in_seconds=5, # time step (seconds) between two downloading requests
total_paper_number=total_paper_number,
downloader=None # use python "requests" package to download papers, workable on Windows/MacOS/Linux
# downloader='IDM' # use Internet Download Manager software to
# download papers, Windows only
)
```
4. Then run the code:
```python
python code/paper_downloader_AAAI.py # download AAAI papers
```
---
**This repo also provides the function to process supplemental material:**
1. Merge the main supplemental material pdf file and the main paper into one single pdf file;
2. Move the supplemental material pdf files (extracted from the downloaded zip files if presented) into the main papers' folder.
## Star history
[](https://star-history.com/#SilenceEagle/paper_downloader&Date)
================================================
FILE: code/paper_downloader_AAAI.py
================================================
"""paper_downloader_AAAI.py"""
import time
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
import random
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib import csv_process
from lib.user_agents import user_agents
from lib.my_request import urlopen_with_retry
def get_track_urls(year):
"""
get all the technical tracks urls given AAAI proceeding year
Args:
year (int): AAAI proceeding year, such 2023
Returns:
dict : All the urls of technical tracks included in
the given AAAI proceeding. Keys are the tracks name-volume,
and values are the corresponding urls.
"""
# assert int(year) >= 2023, f"only support year >= 2023, but get {year}!!!"
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'track_archive_url_AAAI_{year}.dat'
)
proceeding_th_dict = {
1980: 1,
1902: 2,
1983: 3,
1984: 4,
1986: 5,
1987: 6,
1988: 7,
1990: 8,
1991: 9,
1992: 10,
1993: 11,
1994: 12,
1996: 13,
1997: 14,
1998: 15,
1999: 16,
2000: 17,
2002: 18,
2004: 19,
2005: 20,
2006: 21,
2007: 22,
2008: 23
}
if year >= 2023:
base_url = r'https://ojs.aaai.org/index.php/AAAI/issue/archive'
headers = {
'User-Agent': user_agents[-1],
'Host': 'ojs.aaai.org',
'Referer': "https://ojs.aaai.org",
'GET': base_url
}
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=base_url, headers=headers)
# req = urllib.request.Request(url=base_url, headers=headers)
# content = urllib.request.urlopen(req).read()
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
tracks = soup.find('ul', {'class': 'issues_archive'}).find_all('li')
track_urls = dict()
for tr in tracks:
h2 = tr.find('h2')
this_track = slugify(h2.a.text)
if this_track.startswith(f'aaai-{year-2000}'):
this_track += slugify(h2.div.text) + '-' + this_track
this_url = h2.a.get('href')
track_urls[this_track] = this_url
print(f'find track: {this_track}({this_url})')
else:
if year >= 2010:
proceeding_th = year - 1986
elif year in proceeding_th_dict:
proceeding_th = proceeding_th_dict[year]
else:
print(f'ERROR: AAAI proceeding was not held in year {year}!!!')
return
base_url = f'https://aaai.org/proceeding/aaai-{proceeding_th:02d}-{year}/'
headers = {
'User-Agent': user_agents[-1],
'Host': 'aaai.org',
'Referer': "https://aaai.org",
'GET': base_url
}
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
# req = urllib.request.Request(url=base_url, headers=headers)
# content = urllib.request.urlopen(req).read()
content = urlopen_with_retry(url=base_url, headers=headers)
# content = open(f'..\\AAAI_{year}.html', 'rb').read()
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
tracks = soup.find('main', {'class': 'content'}).find_all('li')
track_urls = dict()
for tr in tracks:
this_track = slugify(tr.a.text)
this_url = tr.a.get('href')
track_urls[this_track] = this_url
print(f'find track: {this_track}({this_url})')
return track_urls
def get_papers_of_track_ojs(track_url):
"""
get all the papers' title, belonging track group name and download link.
the link should be hosted on https://ojs.aaai.org/
Args:
track_url (str): track url
Returns:
list[dict]: a list contains all the collected papers' information,
each item in list is a dictionary, whose keys include
['title', 'main link', 'group']
And the group is the specific track name.
"""
debug = False
paper_list = []
headers = {
'User-Agent': user_agents[-1],
'Host': 'ojs.aaai.org',
'Referer': "https://ojs.aaai.org",
'GET': track_url
}
content = urlopen_with_retry(url=track_url, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
tracks = soup.find('div', {'class': 'sections'}).find_all(
'div', {'class': 'section'})
for tr in tracks:
this_group = slugify(tr.h2.text)
this_paper_dict = {
'group': this_group,
'title': '',
'main link': ''
}
papers = tr.find_all('li')
for p in papers:
this_paper_dict['title'] = ''
this_paper_dict['main link'] = ''
try:
title = slugify(p.find('h3', {'class': 'title'}).text)
link = p.find(
'a', {'class': 'obj_galley_link pdf'}
).get('href').replace('view', 'download')
this_paper_dict['title'] = title
this_paper_dict['main link'] = link
paper_list.append(this_paper_dict.copy())
if debug:
print(
f'paper: {title}\n\tlink:{link}\n\tgroup:{this_group}')
except Exception as e:
# skip unwanted target
# print(f'ERROR: {str(e)}')
pass
# continue
return paper_list
def get_papers_of_track(track_url):
"""
get all the papers' title, belonging track group name and download link.
the link should be hosted on https://aaai.org/
Args:
track_url (str): track url
Returns:
list[dict]: a list contains all the collected papers' information,
each item in list is a dictionary, whose keys include
['title', 'main link', 'group']
And the group is the specific track name.
"""
debug = False
paper_list = []
headers = {
'User-Agent': user_agents[-1],
'Host': 'aaai.org',
'Referer': "https://aaai.org",
'GET': track_url
}
content = urlopen_with_retry(url=track_url, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
tracks = soup.find('main', {'id': 'genesis-content'}).find_all(
'div', {'class': 'track-wrap'})
for tr in tracks:
this_group = slugify(tr.h2.text)
this_paper_dict = {
'group': this_group,
'title': '',
'main link': ''
}
papers = tr.find_all('li')
for p in papers:
this_paper_dict['title'] = ''
this_paper_dict['main link'] = ''
try:
title = slugify(p.find('h5').text)
link = p.find(
'a', {'class': 'wp-block-button'}
).get('href')
this_paper_dict['title'] = title
this_paper_dict['main link'] = link
paper_list.append(this_paper_dict.copy())
if debug:
print(
f'paper: {title}\n\tlink:{link}\n\tgroup:{this_group}')
except Exception as e:
# skip unwanted target
# print(f'ERROR: {str(e)}')
pass
# continue
return paper_list
def save_csv(year):
"""
write AAAI papers' urls in one csv file
:param year: int, AAAI year, such 2019
:return: peper_index: int, the total number of papers
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'AAAI_{year}.csv'
)
error_log = []
paper_index = 0
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'group']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
track_urls = get_track_urls(year)
for tr_name in track_urls:
tr_url = track_urls[tr_name]
print(f'collecting paper from {tr_name}({tr_url})')
if year >= 2023:
papers_dict_list = get_papers_of_track_ojs(tr_url)
else:
papers_dict_list = get_papers_of_track(tr_url)
print(f'\tfind {len(papers_dict_list)} papers')
for p in papers_dict_list:
paper_index += 1
writer.writerow(p)
csvfile.flush()
s = random.randint(3, 7)
print(f'random sleeping {s} seconds...')
time.sleep(s) # avoid requesting too frequently
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return paper_index
def download_from_csv(
year, save_dir, time_step_in_seconds=5, total_paper_number=None,
csv_filename=None, downloader='IDM'):
"""
download all AAAI paper given year
:param year: int, AAAI year, such 2019
:param save_dir: str, paper and supplement material's save path
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param csv_filename: None or str, the csv file's name, None means to use
default setting
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:return: True
"""
postfix = f'AAAI_{year}'
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_path = os.path.join(
project_root_folder, 'csv',
f'AAAI_{year}.csv' if csv_filename is None else csv_filename)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_path,
is_download_supplement=False,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader
)
if __name__ == '__main__':
year = 2025
# total_paper_number = 3028
total_paper_number = save_csv(year)
download_from_csv(
year,
save_dir=fr'D:\AAAI_{year}',
time_step_in_seconds=15,
total_paper_number=total_paper_number)
# for year in range(2012, 2018, 2):
# print(year)
# total_paper_number = None
# # total_paper_number = save_csv(year)
# download_from_csv(year, save_dir=f'..\\AAAI_{year}',
# time_step_in_seconds=10,
# total_paper_number=total_paper_number)
# time.sleep(2)
# for i in range(1, 12):
# print(f'issue {i}/{11}')
# year = 2022
# total_paper_number = save_csv_given_urls(
# urls=f'https://www.aaai.org/Library/AAAI/aaai{year - 2000}-issue{i:0>2}.php',
# csv_filename=f'.\AAAI_{year}_issue_{i}.csv'
# )
# # total_paper_number = 156
# download_from_csv(
# year=year,
# csv_filename=f'.\AAAI_{year}_issue_{i}.csv',
# save_dir=rf'D:\AAAI_{year}',
# time_step_in_seconds=1,
# total_paper_number=total_paper_number)
# print(get_track_urls(1980))
# get_papers_of_track(r'https://ojs.aaai.org/index.php/AAAI/issue/view/548')
pass
================================================
FILE: code/paper_downloader_AAMAS.py
================================================
"""paper_downloader_AAMAS.py
"""
import time
import urllib
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib import csv_process
from lib.my_request import urlopen_with_retry
def save_csv(year):
"""
write AAMAS papers' urls in one csv file
:param year: int, AAMAS year, such 2023
:return: peper_index: int, the total number of papers
"""
conference = "AAMAS"
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'{conference}_{year}.csv'
)
init_url_dict = {
2010: 'https://www.ifaamas.org/Proceedings/aamas2010/resources/_fullpapers.html',
2009: 'https://www.ifaamas.org/Proceedings/aamas2009/TOC/01_FP/FP_Session.html',
2008: 'https://www.ifaamas.org/Proceedings/aamas2008/proceedings/mainTrackPapers.htm',
}
error_log = []
paper_index = 0
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'group', 'main link', 'supplemental link']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
if year >= 2013:
init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}' \
f'/forms/contents.htm'
elif year >= 2011:
init_url = f'https://www.ifaamas.org/Proceedings/aamas{year}'\
f'/resources/fullpapers.html'
elif year in init_url_dict:
init_url = init_url_dict[year]
else:
# TODO: support downloading 2002 ~ 2007 papers
return
url_file_pathname = os.path.join(
project_root_folder, 'urls',
f'init_url_{conference}_{year}.dat'''
)
if os.path.exists(url_file_pathname):
with open(url_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'}
content = urlopen_with_retry(url=init_url, headers=headers)
with open(url_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
# soup = BeautifulSoup(content, 'html.parser')
if year >= 2013:
group_list = soup.find('tbody').find_all('tr', recursive=False)[3:]
# skip "conference title", "Table of Contents" and "Contents table"
group_list_bar = tqdm(group_list)
paper_index = 0
is_start = False
for group in group_list_bar:
if not is_start:
# if group.find('a', {'id': 'KT'}): # year 2019, 2023, 2024
# is_start = True
if group.find('strong'):
group_text = slugify(group.find('strong').text)
if not group_text.startswith('table') and \
not group_text.startswith('aamas'):
# skip Table of Contents, AAMAS 20xx
is_start = True
else:
continue
else:
continue
try:
tds = group.find_all('td', recursive=False)
if len(tds) < 2:
continue
group = tds[1]
papers = group.find_all('p')
for p in papers:
# group title is in ...
if p.find('strong', recursive=False):
group_title = slugify(p.text)
continue
paper_dict = {'title': '',
'group': group_title,
'main link': '',
'supplemental link': ''}
if p.find('a') is None and p.find('b') is None:
# last empty
...
in some ...
continue
a = p.find('a')
if a is None:
title = slugify(p.find('b').text)
main_link = ''
print(f'\nWarning: No link found for {title}!')
else:
title = slugify(a.text)
main_link = urllib.parse.urljoin(init_url, a.get('href'))
paper_dict['title'] = title
paper_dict['main link'] = main_link
paper_index += 1
group_list_bar.set_description_str(
f'Collected paper {paper_index}: {title}')
writer.writerow(paper_dict)
csvfile.flush() # write to file immediately
except Exception as e:
print(f'Warning: {str(e)}\n'
f'Current group: {group_title}\nCurrent paper: {title}')
elif year >= 2010:
class_name = {
2010: 'plist',
2011: 'plist',
2012: 'pindex'
}
papers = soup.find('div', {'class': class_name[year]}).find_all(['h2', 'div'])
papers_bar = tqdm(papers)
paper_index = 0
for p in papers_bar:
if p.name == 'h2': # group title
group_title = slugify(p.text)
else: # div, paper
paper_dict = {'title': '',
'group': group_title,
'main link': '',
'supplemental link': ''}
a = p.find('span', {'class': 'title'}).find('a')
# title = slugify(a.find(string=True, recursive=False)) # drop abs
direct_text = ''.join(child for child in a.contents
if isinstance(child, str)).strip()
title = slugify(direct_text)
main_link = urllib.parse.urljoin(init_url, a.get('href'))
paper_dict['title'] = title
paper_dict['main link'] = main_link
paper_index += 1
papers_bar.set_description_str(
f'Collected paper {paper_index}: {title}')
writer.writerow(paper_dict)
csvfile.flush() # write to file immediately
elif year == 2009:
group_list = soup.find('div', {'id': 'mainContent'}).find_all('p')
group_list_bar = tqdm(group_list)
paper_index = 0
is_start = False
for group in group_list_bar:
if not is_start:
if group.find('strong'):
group_text = slugify(group.find('strong').text)
is_start = True
else:
continue
if group.find('strong'):
group_title = slugify(group.text)
continue
try:
papers = group.find_all('a')
for p in papers:
paper_dict = {'title': '',
'group': group_title,
'main link': '',
'supplemental link': ''}
title = slugify(p.text)
main_link = urllib.parse.urljoin(init_url, p.get('href'))
paper_dict['title'] = title
paper_dict['main link'] = main_link
paper_index += 1
group_list_bar.set_description_str(
f'Collected paper {paper_index}: {title}')
writer.writerow(paper_dict)
csvfile.flush() # write to file immediately
except Exception as e:
print(f'Warning: {str(e)}\n'
f'Current group: {group_title}\nCurrent paper: {title}')
elif year == 2008:
# papers = soup.find_all(lambda tag:
# (tag.name == 'p' and 'title' in tag.get('class', [])) or
# tag.name == 'a'
# )
group_list = soup.find('div', {'id': 'mainbody'}).find(
'table').find('tbody').find_all('tr', recursive=False)[2:]
# skip "conference title", "Table of Contents"
group_list_bar = tqdm(group_list)
paper_index = 0
for group in group_list_bar:
try:
p_class_title = group.find('p', {'class': 'title'})
h3 = group.find('h3')
if p_class_title:
group_title = slugify(p_class_title.text)
elif h3: # find
group_title = slugify(h3.text)
else:
raise ValueError('Parse group title failed!')
papers = group.find_all('a')
for p in papers:
paper_dict = {'title': '',
'group': group_title,
'main link': '',
'supplemental link': ''}
title = slugify(p.text)
if not p.get('href'):
continue # group title
main_link = urllib.parse.urljoin(init_url, p.get('href'))
paper_dict['title'] = title
paper_dict['main link'] = main_link
paper_index += 1
group_list_bar.set_description_str(
f'Collected paper {paper_index}: {title}')
writer.writerow(paper_dict)
csvfile.flush() # write to file immediately
except Exception as e:
print(f'Warning: {str(e)}\n'
f'Current group: {group_title}\nCurrent paper: {title}')
else:
# TODO: support downloading 2002 ~ 2008 papers
return
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return paper_index
def download_from_csv(
year, save_dir, time_step_in_seconds=5, total_paper_number=None,
csv_filename=None, downloader='IDM', is_random_step=True,
proxy_ip_port=None):
"""
download all AAMAS paper given year
:param year: int, AAMAS year, such as 2019
:param save_dir: str, paper and supplement material's save path
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param csv_filename: None or str, the csv file's name, None means to use
default setting
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return: True
"""
conference = "AAMAS"
postfix = f'{conference}_{year}'
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_path = os.path.join(
project_root_folder, 'csv',
f'{conference}_{year}.csv' if csv_filename is None else csv_filename)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_path,
is_download_supplement=False,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader,
is_random_step=is_random_step,
proxy_ip_port=proxy_ip_port
)
if __name__ == '__main__':
year = 2025
# total_paper_number = 2021
total_paper_number = save_csv(year)
download_from_csv(
year,
save_dir=fr'D:\AAMAS_{year}',
time_step_in_seconds=5,
total_paper_number=total_paper_number)
# for year in range(2008, 2025, 1):
# print(year)
# # total_paper_number = 134
# total_paper_number = save_csv(year)
# download_from_csv(year, save_dir=fr'E:\AAMAS\AAMAS_{year}',
# time_step_in_seconds=10,
# total_paper_number=total_paper_number)
# time.sleep(2)
pass
================================================
FILE: code/paper_downloader_AISTATS.py
================================================
"""paper_downloader_AISTATS.py"""
import os
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
import lib.pmlr as pmlr
from lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \
move_main_and_supplement_2_one_directory_with_group
def download_paper(year, save_dir, is_download_supplement=True, time_step_in_seconds=5, downloader='IDM'):
"""
download all AISTATS paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, AISTATS year, such as 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:return: True
"""
AISTATS_year_dict = {
2025: 258,
2024: 238,
2023: 206,
2022: 151,
2021: 130,
2020: 108,
2019: 89,
2018: 84,
2017: 54,
2016: 51,
2015: 38,
2014: 33,
2013: 31,
2012: 22,
2011: 15,
2010: 9,
2009: 5,
2007: 2
}
AISTATS_year_dict_R = {
1995: 0,
1997: 1,
1999: 2,
2001: 3,
2003: 4,
2005: 5
}
if year in AISTATS_year_dict.keys():
volume = f'v{AISTATS_year_dict[year]}'
elif year in AISTATS_year_dict_R.keys():
volume = f'r{AISTATS_year_dict_R[year]}'
else:
raise ValueError('''the given year's url is unknown !''')
postfix = f'AISTATS_{year}'
pmlr.download_paper_given_volume(
volume=volume,
save_dir=save_dir,
postfix=postfix,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader
)
if __name__ == '__main__':
year = 2025
download_paper(
year,
rf'D:\AISTATS_{year}',
is_download_supplement=True,
time_step_in_seconds=25,
downloader='IDM'
)
# move_main_and_supplement_2_one_directory(
# main_path=rf'D:\AISTATS_{year}\main_paper',
# supplement_path=rf'D:\AISTATS_{year}\supplement',
# supp_pdf_save_path=rf'D:\AISTATS_{year}\supplement_pdf'
# )
pass
================================================
FILE: code/paper_downloader_COLT.py
================================================
"""paper_downloader_COLT.py"""
import os
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
import lib.pmlr as pmlr
def download_paper(year, save_dir, is_download_supplement=False, time_step_in_seconds=5, downloader='IDM'):
"""
download all COLT paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, COLT year, such as 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:return: True
"""
COLT_year_dict = {
2025: 291,
2024: 247,
2023: 195,
2022: 178,
2021: 134,
2020: 125,
2019: 99,
2018: 75,
2017: 65,
2016: 49,
2015: 40,
2014: 35,
2013: 30,
2012: 23,
2011: 19
}
if year in COLT_year_dict.keys():
volume = f'v{COLT_year_dict[year]}'
else:
raise ValueError('''the given year's url is unknown !''')
postfix = f'COLT_{year}'
pmlr.download_paper_given_volume(
volume=volume,
save_dir=save_dir,
postfix=postfix,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader
)
if __name__ == '__main__':
year = 2025
download_paper(
year,
rf'D:\COLT_{year}',
is_download_supplement=False,
time_step_in_seconds=3,
downloader='IDM'
)
pass
================================================
FILE: code/paper_downloader_CORL.py
================================================
"""paper_downloader_CORL.py"""
import os
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
import lib.pmlr as pmlr
import lib.openreview as openreview
def download_paper(year, save_dir, is_download_supplement=False,
time_step_in_seconds=5, downloader='IDM',
source=None, proxy_ip_port=None):
"""
download all CORL paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, CORL year, such as 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:param source: str, download source, support "pmlr" and "openreview".
Defaults to None, means first try to download from pmlr. If failed,
then try to download from openreview.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Only useful for webdriver and request
downloader (downloader=None). Default: None.
:type proxy_ip_port: str | None
:return: True
"""
CORL_year_dict = {
2025: 305,
2024: 270,
2023: 229,
2022: 205,
2021: 164,
2020: 155,
2019: 100,
2018: 87,
2017: 78
}
postfix = f'CORL_{year}'
if source != 'openreview':
if year in CORL_year_dict.keys(): # download from pmlr
volume = f'v{CORL_year_dict[year]}'
pmlr.download_paper_given_volume(
volume=volume,
save_dir=save_dir,
postfix=postfix,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader
)
return True
elif source == 'pmlr':
raise ValueError(f'Not found CoRL {year} in pmlr!')
# try to download from openreview
base_url = f'https://openreview.net/group?id=robot-learning.org/'\
f'CoRL/{year}/Conference'
group_id_dict = {
2023: ['accept--oral-', 'accept--poster-'],
2024: ['accept']
}
for gid in group_id_dict[year]:
openreview.download_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=f'{base_url}#{gid}',
group_id=gid,
conference='CORL',
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
return True
if __name__ == '__main__':
year=2025
download_paper(
year,
rf'D:\CORL\CORL_{year}',
is_download_supplement=False,
time_step_in_seconds=30,
downloader='IDM'
# downloader = None
)
pass
================================================
FILE: code/paper_downloader_CVF.py
================================================
"""paper_downloader_CVF.py"""
import urllib
from bs4 import BeautifulSoup
import pickle
import os
from slugify import slugify
import csv
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.supplement_porcess import merge_main_supplement, move_main_and_supplement_2_one_directory, \
move_main_and_supplement_2_one_directory_with_group, \
rename_2_short_name, rename_2_short_name_within_group
from lib.cvf import get_paper_dict_list
from lib import csv_process
import time
from lib.my_request import urlopen_with_retry
def save_csv(year, conference, proxy_ip_port=None):
"""
write CVF conference papers' and supplemental material's urls in one csv file
:param year: int
:param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV']
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']:
raise ValueError(f'{conference} is not found in '
f'https://openaccess.thecvf.com/menu, '
f'maybe a spelling mistake!')
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'{conference}_{year}.csv'
)
print(f'saving {conference}-{year} paper urls into {csv_file_pathname}')
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'supplemental link', 'arxiv']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
init_url = f'http://openaccess.thecvf.com/{conference}{year}'
if conference == 'ICCV' and year == 2021:
init_url = 'https://openaccess.thecvf.com/ICCV2021?day=all'
elif conference == 'CVPR' and year >= 2022:
init_url = f'https://openaccess.thecvf.com/CVPR{year}?day=all'
url_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_{conference}_{year}.dat'
)
if os.path.exists(url_file_pathname):
with open(url_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(
url=init_url, headers=headers, proxy_ip_port=proxy_ip_port)
with open(url_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
tmp_list = soup.find('div', {'id': 'content'}).find_all('dt')
if len(tmp_list) <= 1:
paper_different_days_list_bar = soup.find(
'div', {'id': 'content'}).find_all('dd')
paper_index = 0
for group in paper_different_days_list_bar:
# get group name
a = group.find('a')
print(a.text)
group_link = urllib.parse.urljoin(init_url, a.get('href'))
group_paper_dict_list, _ = get_paper_dict_list(
url=group_link
)
paper_index += len(group_paper_dict_list)
for paper_dict in group_paper_dict_list:
writer.writerow(paper_dict)
return paper_index
else:
paper_dict_list, content = get_paper_dict_list(
url=init_url,
content=content)
for paper_dict in paper_dict_list:
writer.writerow(paper_dict)
return len(paper_dict_list)
def save_csv_workshops(year, conference, proxy_ip_port=None):
"""
write CVF workshops papers' and supplemental material's urls in one csv file
:param year: int
:param conference: str, one of ['CVPR', 'ICCV', 'WACV', 'ACCV']
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
if conference not in ['CVPR', 'ICCV', 'WACV', 'ACCV']:
raise ValueError(f'{conference} is not found in '
f'https://openaccess.thecvf.com/menu, '
f'maybe a spelling mistake!')
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'{conference}_WS_{year}.csv'
)
print(f'saving {conference}-WS-{year} paper urls into {csv_file_pathname}')
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['group', 'title', 'main link', 'supplemental link',
'arxiv']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
init_url = f'https://openaccess.thecvf.com/' \
f'{conference}{year}_workshops/menu'
url_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_{conference}_WS_{year}.dat'
)
if os.path.exists(url_file_pathname):
with open(url_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(
url=init_url, headers=headers, proxy_ip_port=proxy_ip_port)
# content = open(f'..\\{conference}_WS_{year}.html', 'rb').read()
with open(url_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
paper_group_list_bar = soup.find('div', {'id': 'content'}).find_all('dd')
paper_index = 0
for group in paper_group_list_bar:
# get group name
a = group.find('a')
group_name = slugify(a.text)
print(f'GROUP: {group_name}')
group_link = urllib.parse.urljoin(init_url, a.get('href'))
repeat_time = 3
for r in range(repeat_time):
try:
group_paper_dict_list, _ = get_paper_dict_list(
url=group_link,
group_name=group_name,
timeout=20,
)
time.sleep(1)
break
except Exception as e:
if r + 1 == repeat_time:
print(f'ERROR: {str(e)}')
continue
paper_index += len(group_paper_dict_list)
for paper_dict in group_paper_dict_list:
writer.writerow(paper_dict)
return paper_index
def download_from_csv(
year, conference, save_dir, is_download_main_paper=True,
is_download_supplement=True, time_step_in_seconds=5,
total_paper_number=None, is_workshops=False, downloader='IDM',
proxy_ip_port=None):
"""
download all CVF paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, CVF year, such 2019
:param conference: str, one of ['CVPR', 'ICCV', 'WACV']
:param save_dir: str, paper and supplement material's save path
:param is_download_main_paper: bool, True for downloading main paper
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two downloading
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param is_workshops: bool, is to download workshops from csv file.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
postfix = f'{conference}_{year}'
if is_workshops:
postfix = f'{conference}_WS_{year}'
csv_file_path = os.path.join(
project_root_folder,
'csv',
f'{conference}_{year}.csv' if not is_workshops else
f'{conference}_WS_{year}.csv'
)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_path,
is_download_main_paper=is_download_main_paper,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader,
)
return True
def download_paper(
year, conference, save_dir, is_download_main_paper=True,
is_download_supplement=True, time_step_in_seconds=5,
is_download_main_conference=True, is_download_workshops=True,
downloader='IDM', proxy_ip_port=None):
"""
download all CVF papers in given year, support downloading main conference
and workshops.
:param year: int, CVF year, such 2019.
:param conference: str, one of {'CVPR', 'ICCV', 'WACV'}.
:param save_dir: str, paper and supplement material's save path.
:param is_download_main_paper: bool, True for downloading main paper.
:param is_download_supplement: bool, True for downloading supplemental
material.
:param time_step_in_seconds: int, the interval time between two downloading
request in seconds.
:param is_download_main_conference: bool, this parameter controls whether to
download main conference papers,
it is a upper level control flag of parameters is_download_main_paper
and is_download_supplement. eg. After setting
is_download_main_conference=True, is_download_main_paper=False,
is_download_supplement=True, the only the supplement materials of the
conference (vs. workshops) will be downloaded.
:param is_download_workshops: bool, True for downloading workshops paper
and is similar with is_download_main_conference.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
# main conference
if is_download_main_conference:
csv_file_path = os.path.join(
project_root_folder, 'csv', f'{conference}_{year}.csv')
if not os.path.exists(csv_file_path):
total_paper_number = save_csv(
year=year, conference=conference, proxy_ip_port=proxy_ip_port)
else:
with open(csv_file_path, newline='') as csvfile:
myreader = csv.DictReader(csvfile, delimiter=',')
total_paper_number = sum(1 for row in myreader)
download_from_csv(
year=year,
conference=conference,
save_dir=os.path.join(save_dir, f'{conference}_{year}'),
is_download_main_paper=is_download_main_paper,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
is_workshops=False,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
# workshops
if is_download_workshops:
csv_file_path = os.path.join(
project_root_folder, 'csv', f'{conference}_WS_{year}.csv')
if not os.path.exists(csv_file_path):
total_paper_number = save_csv_workshops(
year=year, conference=conference, proxy_ip_port=proxy_ip_port)
else:
with open(csv_file_path, newline='') as csvfile:
myreader = csv.DictReader(csvfile, delimiter=',')
total_paper_number = sum(1 for row in myreader)
download_from_csv(
year=year,
conference=conference,
save_dir=os.path.join(save_dir, f'{conference}_WS_{year}'),
is_download_main_paper=is_download_main_paper,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
is_workshops=True,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
if __name__ == '__main__':
year = 2025
conference = 'CVPR'
download_paper(
year,
conference=conference,
save_dir=fr'D:\{conference}',
is_download_main_paper=True,
is_download_supplement=True,
time_step_in_seconds=10,
is_download_main_conference=True,
is_download_workshops=True,
# proxy_ip_port='127.0.0.1:7897'
)
#
# move_main_and_supplement_2_one_directory(
# main_path=rf'E:\{conference}\{conference}_{year}\main_paper',
# supplement_path=rf'E:\{conference}\{conference}_{year}\supplement',
# supp_pdf_save_path=rf'E:\{conference}\{conference}_{year}\main_paper'
# )
# move_main_and_supplement_2_one_directory_with_group(
# main_path=rf'E:\{conference}\{conference}_WS_{year}\main_paper',
# supplement_path=rf'E:\{conference}\{conference}_WS_{year}\supplement',
# supp_pdf_save_path=rf'E:\{conference}\{conference}_WS_{year}\main_paper'
# )
# rename to short filename for uploading to 123pan
# rename_2_short_name(
# src_path=r'E:\CVPR\CVPR_2024\main_paper',
# save_path=r'E:\short_name_cvpr2024',
# target_max_length=128
# )
# rename_2_short_name_within_group(
# src_path=r'E:\CVPR\CVPR_WS_2024\main_paper',
# save_path=r'E:\short_name_cvpr2024_ws',
# target_max_length=128
# )
pass
================================================
FILE: code/paper_downloader_ECCV.py
================================================
"""paper_downloader_ECCV.py"""
import urllib
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.supplement_porcess import move_main_and_supplement_2_one_directory
import lib.springer as springer
from lib import csv_process
from lib.downloader import Downloader
from lib.my_request import urlopen_with_retry
def save_csv(year):
"""
write ECCV papers' and supplemental material's urls in one csv file
:param year: int
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'ECCV_{year}.csv')
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'supplemental link']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_ECCV_{year}.dat')
if year >= 2018:
init_url = f'https://www.ecva.net/papers.php'
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_url, headers=headers)
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
paper_list_bar = tqdm(soup.find_all(['dt', 'dd']))
paper_index = 0
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
for paper in paper_list_bar:
is_new_paper = False
# get title
try:
if 'dt' == paper.name and \
'ptitle' == paper.get('class')[0] and \
year == int(paper.a.get('href').split('_')[1][:4]): # title:
# this_year = int(paper.a.get('href').split('_')[1][:4])
title = slugify(paper.text.strip())
paper_dict['title'] = title
paper_index += 1
paper_list_bar.set_description_str(
f'Downloading paper {paper_index}: {title}')
elif '' != paper_dict['title'] and 'dd' == paper.name:
all_as = paper.find_all('a')
for a in all_as:
if 'pdf' == slugify(a.text.strip()):
main_link = urllib.parse.urljoin(init_url,
a.get('href'))
paper_dict['main link'] = main_link
is_new_paper = True
elif 'supp' in slugify(a.text.strip()):
supp_link = urllib.parse.urljoin(init_url,
a.get('href'))
paper_dict['supplemental link'] = supp_link
break
except:
pass
if is_new_paper:
writer.writerow(paper_dict)
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
else:
init_url = f'http://www.eccv{year}.org/main-conference/'
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_url, headers=headers)
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
paper_list_bar = tqdm(
soup.find('div', {'class': 'entry-content'}).find_all(['p']))
paper_index = 0
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
for paper in paper_list_bar:
try:
if len(paper.find_all(['strong'])) and len(
paper.find_all(['a'])) and len(
paper.find_all(['img'])):
paper_index += 1
title = slugify(paper.find('strong').text)
paper_dict['title'] = title
paper_list_bar.set_description_str(
f'Downloading paper {paper_index}: {title}')
main_link = paper.find('a').get('href')
paper_dict['main link'] = main_link
writer.writerow(paper_dict)
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
except Exception as e:
print(f'ERROR: {str(e)}')
return paper_index
def download_from_csv(
year, save_dir, is_download_supplement=True, time_step_in_seconds=5,
total_paper_number=None,
is_workshops=False, downloader='IDM'):
"""
download all ECCV paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement respectively
:param year: int, ECCV year, such 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two downlaod
request in seconds
:param total_paper_number: int, the total number of papers that is going
to download
:param is_workshops: bool, is to download workshops from csv file.
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:return: True
"""
postfix = f'ECCV_{year}'
if is_workshops:
postfix = f'ECCV_WS_{year}'
csv_file_name = f'ECCV_{year}.csv' if not is_workshops else \
f'ECCV_WS_{year}.csv'
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_name = os.path.join(project_root_folder, 'csv', csv_file_name)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_name,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader
)
def download_from_springer(
year, save_dir, is_workshops=False, time_sleep_in_seconds=5,
downloader='IDM'):
os.makedirs(save_dir, exist_ok=True)
if 2018 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-030-01246-5',
'https://link.springer.com/book/10.1007/978-3-030-01216-8',
'https://link.springer.com/book/10.1007/978-3-030-01219-9',
'https://link.springer.com/book/10.1007/978-3-030-01225-0',
'https://link.springer.com/book/10.1007/978-3-030-01228-1',
'https://link.springer.com/book/10.1007/978-3-030-01231-1',
'https://link.springer.com/book/10.1007/978-3-030-01234-2',
'https://link.springer.com/book/10.1007/978-3-030-01237-3',
'https://link.springer.com/book/10.1007/978-3-030-01240-3',
'https://link.springer.com/book/10.1007/978-3-030-01249-6',
'https://link.springer.com/book/10.1007/978-3-030-01252-6',
'https://link.springer.com/book/10.1007/978-3-030-01258-8',
'https://link.springer.com/book/10.1007/978-3-030-01261-8',
'https://link.springer.com/book/10.1007/978-3-030-01264-9',
'https://link.springer.com/book/10.1007/978-3-030-01267-0',
'https://link.springer.com/book/10.1007/978-3-030-01270-0'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-030-11009-3',
'https://link.springer.com/book/10.1007/978-3-030-11012-3',
'https://link.springer.com/book/10.1007/978-3-030-11015-4',
'https://link.springer.com/book/10.1007/978-3-030-11018-5',
'https://link.springer.com/book/10.1007/978-3-030-11021-5',
'https://link.springer.com/book/10.1007/978-3-030-11024-6'
]
elif 2016 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007%2F978-3-319-46448-0',
'https://link.springer.com/book/10.1007%2F978-3-319-46475-6',
'https://link.springer.com/book/10.1007%2F978-3-319-46487-9',
'https://link.springer.com/book/10.1007%2F978-3-319-46493-0',
'https://link.springer.com/book/10.1007%2F978-3-319-46454-1',
'https://link.springer.com/book/10.1007%2F978-3-319-46466-4',
'https://link.springer.com/book/10.1007%2F978-3-319-46478-7',
'https://link.springer.com/book/10.1007%2F978-3-319-46484-8'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007%2F978-3-319-46604-0',
'https://link.springer.com/book/10.1007%2F978-3-319-48881-3',
'https://link.springer.com/book/10.1007%2F978-3-319-49409-8'
]
elif 2014 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-319-10590-1',
'https://link.springer.com/book/10.1007/978-3-319-10605-2',
'https://link.springer.com/book/10.1007/978-3-319-10578-9',
'https://link.springer.com/book/10.1007/978-3-319-10593-2',
'https://link.springer.com/book/10.1007/978-3-319-10602-1',
'https://link.springer.com/book/10.1007/978-3-319-10599-4',
'https://link.springer.com/book/10.1007/978-3-319-10584-0'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-319-16178-5',
'https://link.springer.com/book/10.1007/978-3-319-16181-5',
'https://link.springer.com/book/10.1007/978-3-319-16199-0',
'https://link.springer.com/book/10.1007/978-3-319-16220-1'
]
elif 2012 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-642-33718-5',
'https://link.springer.com/book/10.1007/978-3-642-33709-3',
'https://link.springer.com/book/10.1007/978-3-642-33712-3',
'https://link.springer.com/book/10.1007/978-3-642-33765-9',
'https://link.springer.com/book/10.1007/978-3-642-33715-4',
'https://link.springer.com/book/10.1007/978-3-642-33783-3',
'https://link.springer.com/book/10.1007/978-3-642-33786-4'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-642-33863-2',
'https://link.springer.com/book/10.1007/978-3-642-33868-7',
'https://link.springer.com/book/10.1007/978-3-642-33885-4'
]
elif 2010 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-642-15549-9',
'https://link.springer.com/book/10.1007/978-3-642-15552-9',
'https://link.springer.com/book/10.1007/978-3-642-15558-1',
'https://link.springer.com/book/10.1007/978-3-642-15561-1',
'https://link.springer.com/book/10.1007/978-3-642-15555-0',
'https://link.springer.com/book/10.1007/978-3-642-15567-3'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-642-35749-7',
'https://link.springer.com/book/10.1007/978-3-642-35740-4'
]
elif 2008 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/978-3-540-88682-2',
'https://link.springer.com/book/10.1007/978-3-540-88688-4',
'https://link.springer.com/book/10.1007/978-3-540-88690-7',
'https://link.springer.com/book/10.1007/978-3-540-88693-8'
]
else:
urls_list = []
elif 2006 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/11744023',
'https://link.springer.com/book/10.1007/11744047',
'https://link.springer.com/book/10.1007/11744078',
'https://link.springer.com/book/10.1007/11744085'
]
else:
urls_list = [
'https://link.springer.com/book/10.1007/11754336'
]
elif 2004 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/b97865',
'https://link.springer.com/book/10.1007/b97866',
'https://link.springer.com/book/10.1007/b97871',
'https://link.springer.com/book/10.1007/b97873'
]
else:
urls_list = [
]
elif 2002 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/3-540-47969-4',
'https://link.springer.com/book/10.1007/3-540-47967-8',
'https://link.springer.com/book/10.1007/3-540-47977-5',
'https://link.springer.com/book/10.1007/3-540-47979-1'
]
else:
urls_list = [
]
elif 2000 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/3-540-45054-8',
'https://link.springer.com/book/10.1007/3-540-45053-X'
]
else:
urls_list = [
]
elif 1998 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/BFb0055655',
'https://link.springer.com/book/10.1007/BFb0054729'
]
else:
urls_list = [
]
elif 1996 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/BFb0015518',
'https://link.springer.com/book/10.1007/3-540-61123-1'
]
else:
urls_list = [
]
elif 1994 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/3-540-57956-7',
'https://link.springer.com/book/10.1007/BFb0028329'
]
else:
urls_list = [
]
elif 1992 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/3-540-55426-2'
]
else:
urls_list = [
]
elif 1990 == year:
if not is_workshops:
urls_list = [
'https://link.springer.com/book/10.1007/BFb0014843'
]
else:
urls_list = [
]
else:
raise ValueError(f'ECCV {year} is current not available!')
for url in urls_list:
__download_from_springer(
url, save_dir, year, is_workshops=is_workshops,
time_sleep_in_seconds=time_sleep_in_seconds,
downloader=downloader)
def __download_from_springer(
url, save_dir, year, is_workshops=False, time_sleep_in_seconds=5,
downloader='IDM'):
downloader = Downloader(downloader)
for i in range(3):
try:
papers_dict = springer.get_paper_name_link_from_url(url)
break
except Exception as e:
print(str(e))
# total_paper_number = len(papers_dict)
pbar = tqdm(papers_dict.keys())
postfix = f'ECCV_{year}'
if is_workshops:
postfix = f'ECCV_WS_{year}'
for name in pbar:
pbar.set_description(f'Downloading paper {name}')
if not os.path.exists(os.path.join(save_dir, f'{name}_{postfix}.pdf')):
downloader.download(
papers_dict[name],
os.path.join(save_dir, f'{name}_{postfix}.pdf'),
time_sleep_in_seconds)
if __name__ == '__main__':
year = 2024
# total_paper_number = 2387
total_paper_number = save_csv(year)
download_from_csv(year,
save_dir=fr'Z:\all_papers\ECCV\ECCV_{year}',
is_download_supplement=True,
time_step_in_seconds=5,
total_paper_number=total_paper_number,
is_workshops=False)
# move_main_and_supplement_2_one_directory(
# main_path=f'E:\\ECCV_{year}\\main_paper',
# supplement_path=f'E:\\ECCV_{year}\\supplement',
# supp_pdf_save_path=f'E:\\ECCV_{year}\\main_paper'
# )
# for year in range(2018, 2017, -2):
# # download_from_springer(
# # save_dir=f'F:\\ECCV_{year}',
# # year=year,
# # is_workshops=False, time_sleep_in_seconds=30)
# download_from_springer(
# save_dir=f'F:\\ECCV_WS_{year}',
# year=year,
# is_workshops=True, time_sleep_in_seconds=30)
# pass
================================================
FILE: code/paper_downloader_ICLR.py
================================================
"""paper_downloader_ICLR.py"""
from tqdm import tqdm
import os
# https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
from slugify import slugify
from bs4 import BeautifulSoup
import pickle
from urllib.request import urlopen
import urllib
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.downloader import Downloader
from lib.openreview import download_iclr_papers_given_url_and_group_id
from lib.arxiv import get_pdf_link_from_arxiv
def download_iclr_oral_papers(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr oral papers for year 2017 ~ 2022, 2024~2025.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
group_id_dict = {
2026: "tab-accept-oral",
2025: "tab-accept-oral",
2024: "tab-accept-oral",
2022: "oral-submissions",
2021: "oral-presentations",
2020: "oral-presentations",
2019: "oral-presentations",
2018: "accepted-oral-papers",
2017: "oral-presentations",
2013: "conferenceoral-iclr2013-conference"
}
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} oral papers...')
group_id = group_id_dict[year].replace('tab-', '')
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year > 2021)
)
def download_iclr_conditional_oral_papers(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr conditional oral papers for year 2025.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
group_id_dict = {
2025: "tab-accept-conditional-oral"
}
no_pages_year = [2025]
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} conditional oral papers...')
group_id = group_id_dict[year].replace('tab-', '')
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year not in no_pages_year)
)
def download_iclr_top5_papers(save_dir, year, base_url=None, start_page=1,
time_step_in_seconds=10, downloader='IDM',
proxy_ip_port=None):
"""
Download iclr notable-top-5% papers for year 2023.
:param save_dir: str, paper save path
:param year: int, iclr year
:type year: int
:param base_url: str, paper website url
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two downlaod
request in seconds. Default: 10.
:type time_step_in_seconds: int
:param downloader: str, the downloader to download, could be 'IDM' or
None. Default: 'IDM'.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
if base_url is None:
if year == 2023:
base_url = "https://openreview.net/group?id=ICLR.cc/" \
"2023/Conference#notable-top-5-"
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} top5 papers...')
group_id = "notable-top-5-"
return download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
def download_iclr_poster_papers(save_dir, year, base_url=None, start_page=1,
time_step_in_seconds=10, downloader='IDM',
proxy_ip_port=None):
"""
Download iclr poster papers from year 2013, 2017 ~ 2024.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year
:param base_url: str, paper website url
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two downlaod
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
None. Default: 'IDM'
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
group_id_dict = {
2026: "tab-accept-poster",
2025: "tab-accept-poster",
2024: "tab-accept-poster",
2023: "poster",
2022: "poster-submissions",
2021: "poster-presentations",
2020: "poster-presentations",
2019: "poster-presentations",
2018: "accepted-poster-papers",
2017: "poster-presentations",
2013: "conferenceposter-iclr2013-conference"
}
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} poster papers...')
no_pages_year = [2013, 2018, 2019, 2020, 2021]
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id_dict[year].replace('tab-', ''),
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year not in no_pages_year),
is_need_click_group_button=(year == 2018)
)
def download_iclr_conditional_poster_papers(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr conditional poster papers for year 2025.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
group_id_dict = {
2025: "tab-accept-conditional-poster"
}
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} conditional poster papers...')
group_id = group_id_dict[year].replace('tab-', '')
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year > 2021)
)
def download_iclr_spotlight_papers(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr spotlight papers between year 2020 and 2022, 2024~2025.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:return:
"""
group_id_dict = {
2025: "tab-accept-spotlight",
2024: "tab-accept-spotlight",
2022: "spotlight-submissions",
2021: "spotlight-presentations",
2020: "spotlight-presentations",
}
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} spotlight papers...')
no_pages_year = [2020, 2021]
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id_dict[year].replace('tab-', ''),
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year not in no_pages_year)
)
def download_iclr_conditional_spotlight_papers(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr conditional spotlight papers for year 2025.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
group_id_dict = {
2025: "tab-accept-conditional-spotlight"
}
no_pages_year = [2025]
if base_url is None:
if year in group_id_dict:
base_url = 'https://openreview.net/group?id=ICLR.cc/' \
f'{year}/Conference#{group_id_dict[year]}'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} conditional spotlight papers...')
group_id = group_id_dict[year].replace('tab-', '')
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year not in no_pages_year)
)
def download_iclr_top25_papers(save_dir, year, base_url=None, start_page=1,
time_step_in_seconds=10, downloader='IDM',
proxy_ip_port=None):
"""
Download iclr notable-top-25% papers for year 2023.
:param save_dir: str, paper save path
:param year: int, iclr year
:type year: int
:param base_url: str, paper website url
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two downlaod
request in seconds. Default: 10.
:type time_step_in_seconds: int
:param downloader: str, the downloader to download, could be 'IDM' or
None. Default: 'IDM'.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
if base_url is None:
if year == 2023:
base_url = "https://openreview.net/group?id=ICLR.cc/" \
"2023/Conference#notable-top-25-"
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} top25 papers...')
group_id = "notable-top-25-"
download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
def download_iclr_paper(save_dir, year, base_url=None,
time_step_in_seconds=10, downloader='IDM',
start_page=1, proxy_ip_port=None):
"""
Download iclr papers between year 2013 and 2024.
:param save_dir: str, paper save path
:param year: int, iclr year, current only support year >= 2018
:param base_url: str, paper website url
:param time_step_in_seconds: int, the interval time between two download
request in seconds.
:param downloader: str, the downloader to download, could be 'IDM' or
None, default to 'IDM'.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Currently, this parameter is only used in year 2024.
Default: 1.
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
year_no_group = [2014]
year_no_group_iclrcc = [2015, 2016]
year_oral_poster = [2013, 2017, 2018, 2019, 2026]
year_oral_spotlight_poster = [2020, 2021, 2022, 2024, 2025]
year_top5_top25_poster = [2023]
year_oral_spotlight_poster_conditional = [2025]
# no group, openreview website
if year in year_no_group:
if base_url is None:
if year == 2014:
base_url = 'https://openreview.net/group?id=ICLR.cc/2014/conference'
else:
raise ValueError('the website url is not given for this year!')
print(f'Downloading ICLR-{year} oral papers...')
group_id_dict = {
2014: "submitted-papers"
}
group_id = group_id_dict[year]
no_pages_year = [2014]
return download_iclr_papers_given_url_and_group_id(
save_dir=save_dir,
year=year,
base_url=base_url,
group_id=group_id,
start_page=start_page,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port,
is_have_pages=(year not in no_pages_year)
)
# no group, iclr.cc website
if year in year_no_group_iclrcc:
downloader = Downloader(downloader=downloader)
paper_postfix = f'ICLR_{year}'
if base_url is None:
if year == 2016:
base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2016:main.html'
elif year == 2015:
base_url = 'https://iclr.cc/archive/www/doku.php%3Fid=iclr2015:main.html'
elif year == 2014:
base_url = 'https://iclr.cc/archive/2014/conference-proceedings/'
else:
raise ValueError('the website url is not given for this year!')
os.makedirs(save_dir, exist_ok=True)
if year == 2015: # oral and poster seperated
oral_save_path = os.path.join(save_dir, 'oral')
poster_save_path = os.path.join(save_dir, 'poster')
workshop_save_path = os.path.join(save_dir, 'ws')
os.makedirs(oral_save_path, exist_ok=True)
os.makedirs(poster_save_path, exist_ok=True)
os.makedirs(workshop_save_path, exist_ok=True)
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_iclr_{year}.dat'
)
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=base_url, headers=headers)
content = urllib.request.urlopen(req).read()
with open(f'..\\urls\\init_url_iclr_{year}.dat', 'wb') as f:
pickle.dump(content, f)
error_log = []
soup = BeautifulSoup(content, 'html.parser')
print('open url successfully!')
if year == 2016:
papers = soup.find('h3',
{
'id': 'accepted_papers_conference_track'}).findNext(
'div').find_all('a')
for paper in tqdm(papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_{paper_postfix}.pdf'
try:
if not os.path.exists(
os.path.join(save_dir,
title + f'_{paper_postfix}.pdf')):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(save_dir, pdf_name),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
# workshops
papers = soup.find('h3',
{
'id': 'workshop_track_posters_may_2nd'}).findNext(
'div').find_all('a')
for paper in tqdm(papers):
link = paper.get('href')
if link.startswith('http://beta.openreview'):
title = slugify(paper.text)
pdf_name = f'{title}_ICLR_WS_{year}.pdf'
try:
if not os.path.exists(
os.path.join(save_dir, 'ws', pdf_name)):
pdf_link = get_pdf_link_from_openreview(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(save_dir, 'ws',
pdf_name),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
papers = soup.find('h3',
{
'id': 'workshop_track_posters_may_3rd'}).findNext(
'div').find_all('a')
for paper in tqdm(papers):
link = paper.get('href')
if link.startswith('http://beta.openreview'):
title = slugify(paper.text)
pdf_name = f'{title}_ICLR_WS_{year}.pdf'
try:
if not os.path.exists(
os.path.join(save_dir, 'ws', pdf_name)):
pdf_link = get_pdf_link_from_openreview(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(save_dir, 'ws',
pdf_name),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
elif year == 2015:
# oral papers
oral_papers = soup.find('h3', {
'id': 'conference_oral_presentations'}).findNext(
'div').find_all(
'a')
for paper in tqdm(oral_papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_{paper_postfix}.pdf'
try:
if not os.path.exists(
os.path.join(oral_save_path,
title + f'_{paper_postfix}.pdf')):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(oral_save_path,
pdf_name),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
# workshops papers
workshop_papers = soup.find('h3', {
'id': 'may_7_workshop_poster_session'}).findNext(
'div').find_all(
'a')
workshop_papers.append(
soup.find('h3',
{'id': 'may_8_workshop_poster_session'}).findNext(
'div').find_all('a'))
for paper in tqdm(workshop_papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_ICLR_WS_{year}.pdf'
try:
if not os.path.exists(
os.path.join(workshop_save_path,
title + f'_{paper_postfix}.pdf')):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(workshop_save_path,
pdf_name),
time_sleep_in_seconds=time_step_in_seconds)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
# poster papers
poster_papers = soup.find('h3', {
'id': 'may_9_conference_poster_session'}).findNext(
'div').find_all(
'a')
for paper in tqdm(poster_papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_{paper_postfix}.pdf'
try:
if not os.path.exists(
os.path.join(poster_save_path,
title + f'_{paper_postfix}.pdf')):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(poster_save_path,
pdf_name),
time_sleep_in_seconds=time_step_in_seconds)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
elif year == 2014:
papers = soup.find('div',
{'id': 'sites-canvas-main-content'}).find_all(
'a')
for paper in tqdm(papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_{paper_postfix}.pdf'
try:
if not os.path.exists(os.path.join(save_dir, pdf_name)):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(save_dir, pdf_name),
time_sleep_in_seconds=time_step_in_seconds)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
# workshops
paper_postfix = f'ICLR_WS_{year}'
base_url = 'https://sites.google.com/site/representationlearning2014/' \
'workshop-proceedings'
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=base_url, headers=headers)
content = urllib.request.urlopen(req).read()
soup = BeautifulSoup(content, 'html.parser')
workshop_save_path = os.path.join(save_dir, 'WS')
os.makedirs(workshop_save_path, exist_ok=True)
papers = soup.find(
'div', {'id': 'sites-canvas-main-content'}).find_all('a')
for paper in tqdm(papers):
link = paper.get('href')
if link.startswith('http://arxiv'):
title = slugify(paper.text)
pdf_name = f'{title}_{paper_postfix}.pdf'
try:
if not os.path.exists(
os.path.join(workshop_save_path, pdf_name)):
pdf_link = get_pdf_link_from_arxiv(link)
print(f'downloading {title}')
downloader.download(
urls=pdf_link,
save_path=os.path.join(workshop_save_path,
pdf_name),
time_sleep_in_seconds=time_step_in_seconds)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, link, 'paper download error', str(e)))
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt')
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return True
# oral openreview
if year in (year_oral_poster + year_oral_spotlight_poster):
save_dir_oral = os.path.join(save_dir, 'oral')
download_iclr_oral_papers(
save_dir_oral,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# conditional oral openreview
if year in (year_oral_spotlight_poster_conditional):
save_dir_cond_oral = os.path.join(save_dir, 'conditional-oral')
download_iclr_conditional_oral_papers(
save_dir_cond_oral,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# poster openreview
if year in (year_oral_poster + year_oral_spotlight_poster +
year_top5_top25_poster):
save_dir_poster = os.path.join(save_dir, 'poster')
download_iclr_poster_papers(
save_dir_poster,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# conditional poster openreview
if year in (year_oral_spotlight_poster_conditional):
save_dir_cond_poster = os.path.join(save_dir, 'conditional-poster')
download_iclr_conditional_poster_papers(
save_dir_cond_poster,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# spotlight openreview
if year in year_oral_spotlight_poster:
save_dir_spotlight = os.path.join(save_dir, 'spotlight')
download_iclr_spotlight_papers(
save_dir_spotlight,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# conditional spotlight openreview
if year in (year_oral_spotlight_poster_conditional):
save_dir_cond_spotlight = os.path.join(save_dir, 'conditional-spotlight')
download_iclr_conditional_spotlight_papers(
save_dir_cond_spotlight,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# top5 openreview
if year in year_top5_top25_poster:
save_dir_top5 = os.path.join(save_dir, 'top5')
download_iclr_top5_papers(
save_dir_top5,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
# top25 openreview
if year in year_top5_top25_poster:
save_dir_top25 = os.path.join(save_dir, 'top25')
download_iclr_top25_papers(
save_dir_top25,
year,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
start_page=start_page,
proxy_ip_port=proxy_ip_port
)
def get_pdf_link_from_openreview(abs_link):
return abs_link.replace('beta.', '').replace('forum', 'pdf')
if __name__ == '__main__':
year = 2025
save_dir_iclr = rf'E:\ICLR_{year}'
# save_dir_iclr_oral = os.path.join(save_dir_iclr, 'oral')
# save_dir_iclr_top5 = os.path.join(save_dir_iclr, 'top5')
# save_dir_iclr_spotlight = os.path.join(save_dir_iclr, 'spotlight')
# save_dir_iclr_top25 = os.path.join(save_dir_iclr, 'top25')
# save_dir_iclr_poster = os.path.join(save_dir_iclr, 'poster')
proxy_ip_port = None
# proxy_ip_port = "http://127.0.0.1:7890"
# download_iclr_oral_papers(save_dir_iclr_oral, year,
# time_step_in_seconds=5)
# download_iclr_top5_papers(save_dir_iclr_top5, year, start_page=1,
# time_step_in_seconds=5,
# proxy_ip_port=proxy_ip_port)
# download_iclr_top25_papers(save_dir_iclr_top25, year, start_page=1,
# time_step_in_seconds=5,
# proxy_ip_port=proxy_ip_port)
# download_iclr_spotlight_papers(save_dir_iclr_spotlight, year,
# time_step_in_seconds=5)
# download_iclr_poster_papers(save_dir_iclr_poster, year, start_page=1,
# time_step_in_seconds=5,
# proxy_ip_port=proxy_ip_port)
download_iclr_paper(save_dir_iclr, year, time_step_in_seconds=5,
proxy_ip_port=proxy_ip_port)
================================================
FILE: code/paper_downloader_ICML.py
================================================
"""paper_downloader_ICML.py"""
import urllib
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.downloader import Downloader
import lib.pmlr as pmlr
from lib.supplement_porcess import merge_main_supplement
from lib.openreview import download_icml_papers_given_url_and_group_id
from lib.my_request import urlopen_with_retry
def download_paper(year, save_dir, is_download_supplement=True,
time_step_in_seconds=5, downloader='IDM', source='pmlr',
proxy_ip_port=None):
"""
download all ICML paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, ICML year, such 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:param source: str, source website, 'pmlr' or 'openreview'
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Default: None.
:type proxy_ip_port: str | None
:return: True
"""
assert source in ['pmlr', 'openreview'], \
f'only support source pmlr or openreview, but get {source}'
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port)
ICML_year_dict = {
2024: 235,
2023: 202,
2022: 162,
2021: 139,
2020: 119,
2019: 97,
2018: 80,
2017: 70,
2016: 48,
2015: 37,
2014: 32,
2013: 28
}
if source == 'openreview':
init_url = f'https://openreview.net/group?id=ICML.cc/{year}/Conference'
else: # pmlr
if year >= 2013:
init_url = f'http://proceedings.mlr.press/v{ICML_year_dict[year]}/'
elif year == 2012:
init_url = 'https://icml.cc/2012/papers.1.html'
elif year == 2011:
init_url = 'http://www.icml-2011.org/papers.php'
elif 2009 == year:
init_url = 'https://icml.cc/Conferences/2009/abstracts.html'
elif 2008 == year:
init_url = 'http://www.machinelearning.org/archive/icml2008/' \
'abstracts.shtml'
elif 2007 == year:
init_url = 'https://icml.cc/Conferences/2007/paperlist.html'
elif year in [2006, 2004, 2005]:
init_url = f'https://icml.cc/Conferences/{year}/proceedings.html'
elif 2003 == year:
init_url = 'https://aaai.org/Library/ICML/icml03contents.php'
else:
raise ValueError('''the given year's url is unknown !''')
postfix = f'ICML_{year}'
if source == 'openreview': # download from openreview website:
# oral paper
group_id = 'oral'
save_dir_oral = os.path.join(save_dir, group_id)
os.makedirs(save_dir_oral, exist_ok=True)
download_icml_papers_given_url_and_group_id(
save_dir=save_dir_oral,
year=year,
base_url=init_url,
group_id=group_id,
start_page=1,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader.downloader,
proxy_ip_port=proxy_ip_port
)
# poster paper
group_id = 'poster'
save_dir_poster = os.path.join(save_dir, group_id)
os.makedirs(save_dir_poster, exist_ok=True)
download_icml_papers_given_url_and_group_id(
save_dir=os.path.join(save_dir, 'poster'),
year=year,
base_url=init_url,
group_id=group_id,
start_page=1,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader.downloader,
proxy_ip_port=proxy_ip_port
)
# spotlight paper
group_id = 'spotlight'
save_dir_poster = os.path.join(save_dir, group_id)
os.makedirs(save_dir_poster, exist_ok=True)
try:
download_icml_papers_given_url_and_group_id(
save_dir=os.path.join(save_dir, 'spotlight'),
year=year,
base_url=init_url,
group_id=group_id,
start_page=1,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader.downloader,
proxy_ip_port=proxy_ip_port
)
except ValueError as e: # no spotlight paper
print(f"WARNING: {str(e)}")
return
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_icml_{year}.dat')
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(url=init_url, headers=headers)
# content = open(f'..\\ICML_{year}.html', 'rb').read()
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
# soup = BeautifulSoup(content, 'html.parser')
soup = BeautifulSoup(content, 'html5lib')
# soup = BeautifulSoup(open(r'..\ICML_2011.html', 'rb'), 'html.parser')
error_log = []
if year >= 2013:
if year in ICML_year_dict.keys():
volume = f'v{ICML_year_dict[year]}'
else:
raise ValueError('''the given year's url is unknown !''')
pmlr.download_paper_given_volume(
volume=volume,
save_dir=save_dir,
postfix=postfix,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader.downloader
)
elif 2012 == year: # 2012
# base_url = f'https://icml.cc/{year}/'
paper_list_bar = tqdm(soup.find_all('div', {'class': 'paper'}))
paper_index = 0
for paper in paper_list_bar:
paper_index += 1
title = ''
title = slugify(paper.find('h2').text)
link = None
for a in paper.find_all('a'):
if 'ICML version (pdf)' == a.text:
link = urllib.parse.urljoin(init_url, a.get('href'))
break
if link is not None:
this_paper_main_path = os.path.join(
save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path) :
paper_list_bar.set_description(
f'downloading paper {paper_index}:{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
else:
error_log.append((title, 'no main link error'))
elif 2011 == year:
paper_list_bar = tqdm(soup.find_all('a'))
paper_index = 0
for paper in paper_list_bar:
h3 = paper.find('h3')
if h3 is not None:
title = slugify(h3.text)
paper_index += 1
if 'download' == slugify(paper.text.strip()):
link = paper.get('href')
link = urllib.parse.urljoin(init_url, link)
if link is not None:
this_paper_main_path = os.path.join(
save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path) :
paper_list_bar.set_description(
f'downloading paper {paper_index}:{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
else:
error_log.append((title, 'no main link error'))
elif year in [2009, 2008]:
if 2009 == year:
paper_list_bar = tqdm(
soup.find('div', {'id': 'right_column'}).find_all(['h3','a']))
elif 2008 == year:
paper_list_bar = tqdm(
soup.find('div', {'class': 'content'}).find_all(['h3','a']))
paper_index = 0
title = None
for paper in paper_list_bar:
if 'h3' == paper.name:
title = slugify(paper.text)
paper_index += 1
elif 'full-paper' == slugify(paper.text.strip()): # a
link = paper.get('href')
if link is not None and title is not None:
link = urllib.parse.urljoin(init_url, link)
this_paper_main_path = os.path.join(
save_dir, f'{title}_{postfix}.pdf')
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path):
paper_list_bar.set_description(
f'downloading paper {paper_index}:{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
title = None
else:
error_log.append((title, 'no main link error'))
elif year in [2006, 2005]:
paper_list_bar = tqdm(soup.find_all('a'))
paper_index = 0
for paper in paper_list_bar:
title = slugify(paper.text.strip())
link = paper.get('href')
paper_index += 1
if link is not None and title is not None and \
('pdf' == link[-3:] or 'ps' == link[-2:]):
link = urllib.parse.urljoin(init_url, link)
this_paper_main_path = os.path.join(
save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path):
paper_list_bar.set_description(
f'downloading paper {paper_index}:{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
elif 2004 == year:
paper_index = 0
paper_list_bar = tqdm(
soup.find('table', {'class': 'proceedings'}).find_all('tr'))
title = None
for paper in paper_list_bar:
tr_class = None
try:
tr_class = paper.get('class')[0]
except:
pass
if 'proc_2004_title' == tr_class: # title
title = slugify(paper.text.strip())
paper_index += 1
else:
for a in paper.find_all('a'):
if '[Paper]' == a.text:
link = a.get('href')
if link is not None and title is not None:
link = urllib.parse.urljoin(init_url, link)
this_paper_main_path = os.path.join(
save_dir,
f'{title}_{postfix}.pdf'.replace(' ', '_'))
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path):
paper_list_bar.set_description(
f'downloading paper {paper_index}:{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
break
elif 2003 == year:
paper_index = 0
paper_list_bar = tqdm(
soup.find('div', {'id': 'content'}).find_all(
'p', {'class': 'left'}))
for paper in paper_list_bar:
abs_link = None
title = None
link = None
for a in paper.find_all('a'):
abs_link = urllib.parse.urljoin(init_url, a.get('href'))
if abs_link is not None:
title = slugify(a.text.strip())
break
if title is not None:
paper_index += 1
this_paper_main_path = os.path.join(
save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))
paper_list_bar.set_description(
f'find paper {paper_index}:{title}')
if not os.path.exists(this_paper_main_path):
if abs_link is not None:
headers = {'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; '
'rv:23.0) Gecko/20100101 Firefox/23.0'}
abs_content = urlopen_with_retry(
url=abs_link, headers=headers,
raise_error_if_failed=False)
if abs_content is None:
print('error'+title)
error_log.append(
(title, abs_link, 'download error'))
continue
abs_soup = BeautifulSoup(abs_content, 'html5lib')
for a in abs_soup.find_all('a'):
try:
if 'pdf' == a.get('href')[-3:]:
link = urllib.parse.urljoin(
abs_link, a.get('href'))
if link is not None:
paper_list_bar.set_description(
f'downloading paper {paper_index}:'
f'{title}')
downloader.download(
urls=link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
break
except:
pass
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt')
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
def rename_downloaded_paper(year, source_path):
"""
rename the downloaded ICML paper to {title}_ICML_2010.pdf and save to
source_path
:param year: int, year
:param source_path: str, whose structure should be
source_path/papers/pdf files (2010)
/index.html (2010)
source_path/icml2007_proc.html (2007)
:return:
"""
if not os.path.exists(source_path):
raise ValueError(f'can not find {source_path}')
postfix = f'ICML_{year}'
if 2010 == year:
soup = BeautifulSoup(
open(os.path.join(source_path, 'index.html'), 'rb'), 'html5lib')
paper_list_bar = tqdm(soup.find_all('span', {'class': 'boxpopup3'}))
for paper in paper_list_bar:
a = paper.find('a')
title = slugify(a.text)
ori_name = os.path.join(
source_path, 'papers', a.get('href').split('/')[-1])
os.rename(ori_name, os.path.join(
source_path, f'{title}_{postfix}.pdf'))
paper_list_bar.set_description(f'processing {title}')
elif 2007 == year:
soup = BeautifulSoup(open(os.path.join(
source_path, 'icml2007_proc.html'), 'rb'), 'html5lib')
paper_list_bar = tqdm(soup.find_all('td', {'colspan': '2'}))
for paper in paper_list_bar:
all_as = paper.find_all('a')
if len(all_as) <= 1:
title = slugify(paper.text.strip())
else:
for a in all_as:
if '[Paper]' == a.text:
sub_path = a.get('href')
os.rename(os.path.join(source_path, sub_path),
os.path.join(
source_path, f'{title}_{postfix}.pdf'))
paper_list_bar.set_description_str(
(f'processing {title}'))
break
if __name__ == '__main__':
year = 2025
download_paper(
year,
rf'E:\ICML_{year}',
is_download_supplement=True,
time_step_in_seconds=10,
downloader='IDM',
source='openreview'
)
# merge_main_supplement(main_path=f'..\\ICML_{year}\\main_paper',
# supplement_path=f'..\\ICML_{year}\\supplement',
# save_path=f'..\\ICML_{year}',
# is_delete_ori_files=False)
# rename_downloaded_paper(year, f'..\\ICML_{year}')
pass
================================================
FILE: code/paper_downloader_IJCAI.py
================================================
"""paper_downloader_IJCAI.py"""
import urllib
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib import csv_process
from lib.my_request import urlopen_with_retry
def save_csv(year):
"""
write IJCAI papers' urls in one csv file
:param year: int, IJCAI year, such 2019
:return: peper_index: int, the total number of papers
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'IJCAI_{year}.csv'
)
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'group']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
if year >= 2003:
init_urls = [f'https://www.ijcai.org/proceedings/{year}/']
elif year >= 1977:
init_urls = [f'https://www.ijcai.org/Proceedings/{year}-1/',
f'https://www.ijcai.org/Proceedings/{year}-2/']
elif year >= 1969:
init_urls = [f'https://www.ijcai.org/Proceedings/{year}/']
else:
raise ValueError('invalid year!')
error_log = []
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) '
'Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; '
'.NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) '
'KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) '
'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 "
"(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 "
"Chrome/16.0.912.77 Safari/535.7",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) "
"Gecko/20100101 Firefox/10.0 ",
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/105.0.0.0 Safari/537.36'
]
headers = {
'User-Agent': user_agents[-1],
'Host': 'www.ijcai.org',
'Referer': "https://www.ijcai.org",
'GET': init_urls[0]
}
if len(init_urls) == 1:
data_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_IJCAI_{year}.dat'
)
if os.path.exists(data_file_pathname):
with open(data_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_urls[0], headers=headers)
with open(data_file_pathname, 'wb') as f:
pickle.dump(content, f)
contents = [content]
else:
contents = []
data_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_IJCAI_0_{year}.dat'
)
if os.path.exists(data_file_pathname):
with open(data_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_urls[0], headers=headers)
with open(data_file_pathname, 'wb') as f:
pickle.dump(content, f)
contents.append(content)
data_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_IJCAI_1_{year}.dat'
)
if os.path.exists(data_file_pathname):
with open(data_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_urls[1], headers=headers)
with open(data_file_pathname, 'wb') as f:
pickle.dump(content, f)
contents.append(content)
paper_index = 0
for content in contents:
soup = BeautifulSoup(content, 'html5lib')
if year >= 2017:
pbar = tqdm(soup.find_all('div', {'class': 'section_title'}))
for section in pbar:
this_group = slugify(section.text)
papers = section.parent.find_all(
'div', {'class': ['paper_wrapper', 'subsection_title']})
sub_group = ''
for paper in papers:
if 'subsection_title' == paper.get('class')[0]:
sub_group = slugify(paper.text)
continue
paper_index += 1
is_get_link = False
title = slugify(
paper.find('div', {'class': 'title'}).text)
pbar.set_description(
f'downloading paper {paper_index}: {title}')
for a in paper.find(
'div', {'class': 'details'}).find_all('a'):
if 'PDF' == a.text:
link = urllib.parse.urljoin(
init_urls[0], a.get('href'))
is_get_link = True
break
if is_get_link:
paper_dict = {'title': title,
'main link': link,
'group': this_group + '--' +
sub_group if
sub_group != '' else this_group}
else:
paper_dict = {'title': title,
'main link': 'error',
'group': this_group + '--' +
sub_group if
sub_group != '' else this_group}
print(f'get link for {title}_{year} failed!')
error_log.apend(title, 'no link')
writer.writerow(paper_dict)
elif year in [2016]: # no group
papers_bar = tqdm(soup.find_all('p'))
for paper in papers_bar:
all_as = paper.find_all('a')
if len(all_as) >= 2: # paper pdf and abstract
paper_index += 1
title = slugify(paper.text.split('\n')[0])
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
is_get_link = False
for a in all_as:
if 'PDF' == a.text:
link = 'https://www.ijcai.org' + a.get('href')
is_get_link = True
break
if is_get_link:
paper_dict = {'title': title,
'main link': link,
'group': ''}
else:
paper_dict = {'title': title,
'main link': 'error',
'group': ''}
print(f'get link for {title}_{year} failed!')
error_log.apend(title, 'no link')
writer.writerow(paper_dict)
elif year in [2015]: # p group 'PDF'
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3']))
is_start = False
this_group = ''
for paper in papers_bar:
if not is_start:
if 'h2' == paper.name: # find 'content'
if 'Contents' == paper.text:
is_start = True
else:
if 'h3' == paper.name: # group
this_group = slugify(paper.text)
elif 'p' == paper.name: # paper
all_as = paper.find_all('a')
if len(all_as) >= 2: # paper pdf and abstract
paper_index += 1
title = slugify(paper.text.split('\n')[0])
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
is_get_link = False
for a in all_as:
if 'PDF' == a.text:
link = 'https://www.ijcai.org' + \
a.get('href')
is_get_link = True
break
if is_get_link:
paper_dict = {'title': title,
'main link': link,
'group': this_group}
else:
paper_dict = {'title': title,
'main link': 'error',
'group': this_group}
print(f'get link for {title}_{year} failed!')
error_log.apend(title, 'no link')
writer.writerow(paper_dict)
elif year in [2013, 2011, 2009, 2007]: # p group
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['h2', 'p', 'h3', 'h4']))
# papers_bar = div_content.find_all(['h2', 'p', 'h3', 'h4'])
is_start = False
this_group = ''
this_group_v3 = ''
this_group_v4 = ''
for paper in papers_bar:
if not is_start:
if 'h2' == paper.name: # find 'content'
if 'Contents' == paper.text or \
'IJCAI-09 Contents' == paper.text or \
'IJCAI-07 Contents' == paper.text:
is_start = True
else:
if 'h3' == paper.name: # group
this_group_v3 = slugify(paper.text)
this_group = this_group_v3
elif 'h4' == paper.name: # group
this_group_v4 = slugify(paper.text)
this_group = this_group_v3 + '--' + this_group_v4
elif 'p' == paper.name: # paper
try:
all_as = paper.find_all('a')
except:
continue
if len(all_as) >= 1: # paper
paper_index += 1
is_get_link = False
for a in all_as:
if 'abstract' != slugify(a.text.strip()):
title = slugify(a.text)
link = a.get('href')
is_get_link = True
papers_bar.set_description(
f'downloading paper {paper_index}: '
f'{title}')
break
if is_get_link:
paper_dict = {'title': title,
'main link': link,
'group': this_group}
else:
paper_dict = {'title': title,
'main link': 'error',
'group': this_group}
print(f'get link for {title}_{year} failed!')
error_log.append((title, 'no link'))
# papers_bar.set_description(f'downloading
# paper {paper_index}: {title}')
writer.writerow(paper_dict)
elif year in [2005]:
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
paper_class = paper.get('class')[0]
except:
continue
if 'docsection' == paper_class: # group
this_group = slugify(paper.text)
elif 'doctitle' == paper_class: # paper
paper_index += 1
title = slugify(paper.a.text)
link = paper.a.get('href')
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
elif year in [2003]:
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
base_url = 'https://www.ijcai.org'
for paper in papers_bar:
try:
this_group = slugify(paper.b.text)
except:
pass
try:
title = slugify(paper.a.text)
link = base_url + paper.a.get('href')
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
except:
continue
elif year in [2001]:
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
title = slugify(paper.a.text)
link = paper.a.get('href')
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
except:
continue
elif year in [1999, 1997, 1995, 1993, 1991, 1989, 1987, 1981, 1979,
1977, 1969]: # goup in capital in p.b.text
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
if paper.b.text.isupper():
# print(paper.b.text)
this_group = slugify(paper.b.text)
except:
pass
try:
for a in paper.find_all('a'):
title = slugify(a.text.strip())
link = a.get('href')
if link[-3:] == 'pdf' and '' != title:
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
break
else:
continue
except:
continue
elif year in [1985, 1975, 1971]: # no group, paper in 'p'
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
for a in paper.find_all('a'):
title = slugify(a.text.strip())
link = a.get('href')
if link[-3:] == 'pdf' and '' != title:
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
break
else:
continue
except:
continue
elif year in [1983]: # goup in capital p.text
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
if paper.text.isupper():
this_group = slugify(paper.text)
except:
pass
try:
for a in paper.find_all('a'):
title = slugify(a.text.strip())
link = a.get('href')
if link[-3:] == 'pdf' and '' != title:
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
break
else:
continue
except:
continue
elif year in [1973]: # goup in p.b
div_content = soup.find('div', {'id': 'content'})
papers_bar = tqdm(div_content.find_all(['p']))
this_group = ''
for paper in papers_bar:
try:
if '' != paper.b.text.strip():
this_group = slugify(paper.b.text.strip())
except:
pass
try:
for a in paper.find_all('a'):
title = slugify(a.text.strip())
link = a.get('href')
if link[-3:] == 'pdf' and '' != title:
paper_index += 1
papers_bar.set_description(
f'downloading paper {paper_index}: {title}')
paper_dict = {'title': title,
'main link': link,
'group': this_group}
writer.writerow(paper_dict)
break
else:
continue
except:
continue
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt')
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return paper_index if paper_index is not None else None
def download_from_csv(
year, save_dir, time_step_in_seconds=5, total_paper_number=None, downloader='IDM'):
"""
download all IJCAI paper given year
:param year: int, IJCAI year, such 2019
:param save_dir: str, paper and supplement material's save path
:param time_step_in_seconds: int, the interval time between two downlaod request in seconds
:param total_paper_number: int, the total number of papers that is going to download
:param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
postfix = f'IJCAI_{year}'
csv_filename = f'IJCAI_{year}.csv'
csv_filename = os.path.join(project_root_folder, 'csv', csv_filename)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_filename,
is_download_supplement=False,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader
)
if __name__ == '__main__':
# for year in range(1993, 1968, -2):
# print(year)
# # save_csv(year)
# # time.sleep(2)
# download_from_csv(year, save_dir=f'..\\IJCAI_{year}',
# time_step_in_seconds=1)
year = 2024
# total_paper_number = 723
total_paper_number = save_csv(year)
download_from_csv(
year,
save_dir=fr'E:\IJCAI_{year}',
time_step_in_seconds=5,
total_paper_number=total_paper_number,
downloader=None)
pass
================================================
FILE: code/paper_downloader_JMLR.py
================================================
"""paper_downloader_JMLR.py"""
import urllib
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import time
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.downloader import Downloader
from lib.my_request import urlopen_with_retry
def download_paper(
volumn, save_dir, time_step_in_seconds=5, downloader='IDM', url=None,
is_use_url=False, refresh_paper_list=True):
"""
download all JMLR paper files given volumn and restore in save_dir
respectively
:param volumn: int, JMLR volumn, such as 2019
:param save_dir: str, paper and supplement material's saving path
:param time_step_in_seconds: int, the interval time between two downlaod request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'
:param url: None or str, None means to download volumn papers.
:param is_use_url: bool, if to download papers from 'url'. url couldn't be None when is_use_url is True.
:param refresh_paper_list: bool, if to refresh the saved paper list, default
true, which means the "dat" file that contains the papers' information
will be re-downloaded.
:return: True
"""
downloader = Downloader(downloader=downloader)
# create current dict
title_list = []
# paper_dict = dict()
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
if not is_use_url:
init_url = f'http://jmlr.org/papers/v{volumn}/'
postfix = f'JMLR_v{volumn}'
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_JMLR_v{volumn}.dat')
if not refresh_paper_list and \
os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
print('collecting papers from website...')
content = urlopen_with_retry(url=init_url, headers=headers)
# content = open(f'..\\JMLR_{volumn}.html', 'rb').read()
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
elif url is not None:
content = urlopen_with_retry(url=url, headers=headers)
postfix = f'JMLR'
else:
raise ValueError(''''url' could not be None when 'is_use_url'=True!!!''')
# soup = BeautifulSoup(content, 'html.parser')
soup = BeautifulSoup(content, 'html5lib')
# soup = BeautifulSoup(open(r'..\JMLR_2011.html', 'rb'), 'html.parser')
error_log = []
os.makedirs(save_dir, exist_ok=True)
if (not is_use_url) and volumn <= 4:
paper_list = soup.find('div', {'id': 'content'}).find_all('tr')
else:
paper_list = soup.find('div', {'id': 'content'}).find_all('dl')
# num_download = 5 # number of papers to download
num_download = len(paper_list)
print(f'total papers counting: {num_download}, start downloading...')
for paper in tqdm(zip(paper_list, range(num_download))):
# get title
this_paper = paper[0]
title = slugify(this_paper.find('dt').text)
title_list.append(title)
this_paper_main_path = os.path.join(save_dir, f'{title}_{postfix}.pdf'.replace(' ', '_'))
if os.path.exists(this_paper_main_path):
continue
# get abstract page url
links = this_paper.find_all('a')
main_link = None
for link in links:
if '[pdf]' == link.text or 'pdf' == link.text:
main_link = urllib.parse.urljoin('http://jmlr.org', link.get('href'))
break
# try 1 time
# error_flag = False
for d_iter in range(1):
try:
# download paper with IDM
if not os.path.exists(this_paper_main_path) and main_link is not None:
try:
print('Downloading paper {}/{}: {}'.format(paper[1] + 1, num_download, title))
except:
print(title.encode('utf8'))
downloader.download(
urls=main_link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append((title, main_link, 'main paper download error', str(e)))
# store the results
# 1. store in the pickle file
# with open(f'{postfix}_pre.dat', 'wb') as f:
# pickle.dump(paper_dict, f)
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt')
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
def download_special_topics_and_issues_paper(save_dir, time_step_in_seconds=5, downloader='IDM'):
"""
download all JMLR special topics and issues paper files given volumn and restore in save_dir
respectively
:param save_dir: str, paper and supplement material's saving path
:param time_step_in_seconds: int, the interval time between two downlaod request in seconds
:param downloader: str, the downloader to download, could be 'IDM' or 'Thunder', default to 'IDM'
:return: True
"""
homepage = 'https://www.jmlr.org/papers/'
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
# postfix = f'JMLR_v{volumn}'
content = urlopen_with_retry(url=homepage, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
# soup = BeautifulSoup(open(r'..\JMLR_2011.html', 'rb'), 'html.parser')
all_topics = soup.find('div', {'id': 'content'}).find_all(['h2', 'p'])
is_topic = False
is_issue = False
for topic in all_topics:
if 'h2' == topic.name and slugify(topic.text.strip()) == 'special-topics':
is_topic = True
elif 'h2' == topic.name:
is_topic = False
if 'special-issues' == slugify(topic.text.strip()):
is_issue = True
if is_topic and 'p' == topic.name:
topic_name = slugify(topic.text.strip())
topic_url = urllib.parse.urljoin(homepage, topic.a.get('href'))
# print(f'T: {topic_name} url:{topic_url}')
print(f'processing special topic: {topic_name}')
download_paper(
volumn=1000,
save_dir=os.path.join(save_dir, 'special-topics', topic_name),
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
url=topic_url,
is_use_url=True
)
time.sleep(time_step_in_seconds)
if is_issue and 'p' == topic.name:
issue_name = slugify(topic.text.strip())
issue_url = urllib.parse.urljoin(homepage, topic.a.get('href'))
# print(f'T: {issue_name} url:{issue_url}')
print(f'processing special issue: {issue_name}')
download_paper(
volumn=1000,
save_dir=os.path.join(save_dir, 'special-issues', issue_name),
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
url=issue_url,
is_use_url=True
)
time.sleep(time_step_in_seconds)
if __name__ == '__main__':
volumn = 25
download_paper(volumn, rf'W:\all_papers\JMLR\JMLR_v{volumn}',
time_step_in_seconds=3)
# download_special_topics_and_issues_paper(
# rf'Z:\all_papers\JMLR', time_step_in_seconds=3, downloader='IDM')
pass
================================================
FILE: code/paper_downloader_NIPS.py
================================================
"""paper_downloader_NIPS.py"""
import urllib
import time
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib.supplement_porcess import move_main_and_supplement_2_one_directory
from lib.downloader import Downloader
from lib import csv_process
from lib.openreview import download_nips_papers_given_url
from lib.my_request import urlopen_with_retry
def save_csv(year):
"""
write nips papers' and supplemental material's urls in one csv file
:param year: int
:return: num_download: int, the total number of papers.
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'NIPS_{year}.csv'
)
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'supplemental link']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
init_url = f'https://proceedings.neurips.cc/paper/{year}'
dat_file_pathname = os.path.join(
project_root_folder, 'urls', f'init_url_nips_{year}.dat')
if os.path.exists(dat_file_pathname):
with open(dat_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
content = urlopen_with_retry(url=init_url, headers=headers)
with open(dat_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html.parser')
paper_list = soup.find(
'div', {'class': 'container-fluid'}).find_all('li')
# num_download = 5 # number of papers to download
num_download = len(paper_list)
paper_list_bar = tqdm(zip(paper_list, range(num_download)))
for paper in tqdm(zip(paper_list, range(num_download))):
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
# get title
# print('\n')
this_paper = paper[0]
title = slugify(this_paper.a.text)
paper_dict['title'] = title
# print('Downloading paper {}/{}: {}'.format(
# paper[1] + 1, num_download, title))
paper_list_bar.set_description(
'Tracing paper {}/{}: {}'.format(
paper[1] + 1, num_download, title))
# get abstract page url
url2 = this_paper.a.get('href')
abs_url = urllib.parse.urljoin(init_url, url2)
abs_content = urlopen_with_retry(url=abs_url, headers=headers,
raise_error_if_failed=False)
if abs_content is not None:
soup_temp = BeautifulSoup(abs_content, 'html.parser')
# abstract = soup_temp.find(
# 'p', {'class': 'abstract'}).text.strip()
# paper_dict[title] = abstract
all_a = soup_temp.findAll('a')
for a in all_a:
# print(a.text[:-2])
# print(a.text[:-2].strip().lower())
if 'paper' == a.text[:-2].strip().lower():
paper_dict['main link'] = urllib.parse.urljoin(
abs_url, a.get('href'))
elif 'supplemental' == a.text[:-2].strip().lower():
paper_dict['supplemental link'] = \
urllib.parse.urljoin(abs_url, a.get('href'))
break
else:
print('Error: ' + title)
if paper_dict['main link'] == '':
paper_dict['main link'] = 'error'
if paper_dict['supplemental link'] == '':
paper_dict['supplemental link'] = 'error'
writer.writerow(paper_dict)
time.sleep(1)
return num_download
def download_from_csv(
year, save_dir, is_download_mainpaper=True, is_download_supplement=True,
time_step_in_seconds=5, total_paper_number=None, downloader='IDM'):
"""
download all NIPS paper and supplement files given year, restore in
save_dir/main_paper and save_dir/supplement
respectively
:param year: int, NIPS year, such 2019
:param save_dir: str, paper and supplement material's save path
:param is_download_mainpaper: boot, True for downloading main papers
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
postfix = f'NIPS_{year}'
csv_file_path = os.path.join(project_root_folder, 'csv', f'NIPS_{year}.csv')
return csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_path,
is_download_supplement=is_download_supplement,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader
)
# def rename_supp( year, supp_dir):
# """
# rename supplemental material
# :param year: int, NIPS year, such 2019
# :param supp_dir: str, supplement material's save path
# :return: True
# """
# if not os.path.exists(supp_dir):
# raise ValueError(f'''can't find path {supp_dir}''')
#
# postfix = f'NIPS_{year}'
# with open(f'..\\csv\\NIPS_{year}.csv', newline='') as csvfile:
# myreader = csv.DictReader(csvfile, delimiter=',')
# pbar = tqdm(myreader)
# for this_paper in pbar:
# title = slugify(this_paper['title'])
# this_paper_supp_path_no_ext = os.path.join(
# supp_dir, f'{title}_{postfix}_supp.')
#
# if '' != this_paper['supplemental link']:
# supp_ori_name = this_paper['supplemental link'].split('/')[-1]
# supp_type = supp_ori_name.split('.')[-1]
# if os.path.exists(os.path.join(supp_dir, supp_ori_name)) and \
# not os.path.exists(
# this_paper_supp_path_no_ext + supp_type):
# os.rename(
# os.path.join(supp_dir, supp_ori_name),
# this_paper_supp_path_no_ext + supp_type
# )
# pbar.set_description(f'Renaming paper: {title}...')
if __name__ == '__main__':
year = 2024
# total_paper_number = 1899
# total_paper_number = save_csv(year)
# download_from_csv(
# year, f'..\\NIPS_{year}',
# is_download_mainpaper=False,
# is_download_supplement=True,
# time_step_in_seconds=20,
# total_paper_number=total_paper_number,
# downloader='IDM')
download_nips_papers_given_url(
save_dir=rf'E:\NIPS_{year}',
year=year,
base_url=f'https://openreview.net/group?id=NeurIPS.cc/'
f'{year}/Conference',
time_step_in_seconds=10,
# download_groups=['poster'],
downloader='IDM')
# move_main_and_supplement_2_one_directory(
# main_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\main_paper',
# supplement_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\supplement',
# supp_pdf_save_path=rf'F:\workspace\python3_ws\paper_downloader-master\NIPS_{year}\supplement_pdf'
# )
================================================
FILE: code/paper_downloader_RSS.py
================================================
"""paper_downloader_RSS.py
20240322"""
import time
import urllib
from urllib.error import HTTPError
from bs4 import BeautifulSoup
import pickle
import os
from tqdm import tqdm
from slugify import slugify
import csv
import sys
from datetime import datetime
root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
sys.path.append(root_folder)
from lib import csv_process
from lib.my_request import urlopen_with_retry
def get_paper_pdf_link(abs_url):
"""get paper pdf link in the abstract url.
For newest papers that have not been added to
"https://www.roboticsproceedings.org/rss19/index.html"
Args:
abs_url (str): paper abstract page url.
"""
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(url=abs_url, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
paper_pdf_div = soup.find('div', {'class': 'paper-pdf'})
paper_pdf_div = paper_pdf_div.find('a').get('href')
return paper_pdf_div
def save_csv(year):
"""
write RSS papers' urls in one csv file
:param year: int, RSS year, such 2023
:return: peper_index: int, the total number of papers
"""
conference = "RSS"
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_pathname = os.path.join(
project_root_folder, 'csv', f'{conference}_{year}.csv'
)
error_log = []
paper_index = 0
with open(csv_file_pathname, 'w', newline='') as csvfile:
fieldnames = ['title', 'main link', 'supplemental link']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
is_from_proceed = True
# True to get papaers from "https://www.roboticsproceedings.org"
# False to get papers from "https://roboticsconference.org/"
init_url = f'https://www.roboticsproceedings.org/rss' \
f'{year-2004 :0>2d}/index.html'
# determine whether this year's papers had been added to
# "https://www.roboticsproceedings.org"
# If not, get papers from "https://roboticsconference.org/"
try:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=init_url, headers=headers)
urllib.request.urlopen(req, timeout=20)
except HTTPError as e:
if e.code == 404: # not added
current_year = datetime.now().year
if year == current_year:
init_url = f'https://roboticsconference.org/program/papers/'
else:
init_url = f'https://roboticsconference.org/{year}/program/papers/'
is_from_proceed = False
url_file_pathname = os.path.join(
project_root_folder, 'urls',
f'init_url_{conference}_{year}_'
f'''{'proc' if is_from_proceed else 'conf'}.dat'''
)
if os.path.exists(url_file_pathname):
with open(url_file_pathname, 'rb') as f:
content = pickle.load(f)
else:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(url=init_url, headers=headers)
with open(url_file_pathname, 'wb') as f:
pickle.dump(content, f)
soup = BeautifulSoup(content, 'html5lib')
if is_from_proceed:
paper_list = soup.find('div', {'class': 'content'}).find_all('tr')
else:
paper_list = soup.find('table', {'id': 'myTable'}).find_all('tr')
paper_list_bar = tqdm(paper_list)
paper_index = 0
title_index = 0
for i, paper in enumerate(paper_list_bar):
paper_dict = {'title': '',
'main link': '',
'supplemental link': ''}
# get title
try:
if not is_from_proceed and i == 0:
# header
fields = paper.find_all('th')
fields = [f.text.lower() for f in fields]
title_index = fields.index('title')
tds = paper.find_all('td')
if len(tds) < 2: # seperator
continue
if is_from_proceed:
title = slugify(tds[0].a.text)
main_link = tds[1].a.get('href')
main_link = urllib.parse.urljoin(init_url, main_link)
else:
title = slugify(tds[title_index].a.text)
abs_link = tds[title_index].a.get('href')
abs_link = urllib.parse.urljoin(init_url, abs_link)
main_link = get_paper_pdf_link(abs_link)
paper_dict['title'] = title
paper_dict['main link'] = main_link
paper_index += 1
paper_list_bar.set_description_str(
f'Collected paper {paper_index}: {title}')
writer.writerow(paper_dict)
csvfile.flush() # write to file immediately
except Exception as e:
print(f'Warning: {str(e)}')
# write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return paper_index
def download_from_csv(
year, save_dir, time_step_in_seconds=5, total_paper_number=None,
csv_filename=None, downloader='IDM', is_random_step=True,
proxy_ip_port=None):
"""
download all RSS paper given year
:param year: int, RSS year, such as 2019
:param save_dir: str, paper and supplement material's save path
:param time_step_in_seconds: int, the interval time between two download
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param csv_filename: None or str, the csv file's name, None means to use
default setting
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder', default to 'IDM'
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:return: True
"""
conference = "RSS"
postfix = f'{conference}_{year}'
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
csv_file_path = os.path.join(
project_root_folder, 'csv',
f'{conference}_{year}.csv' if csv_filename is None else csv_filename)
csv_process.download_from_csv(
postfix=postfix,
save_dir=save_dir,
csv_file_path=csv_file_path,
is_download_supplement=False,
time_step_in_seconds=time_step_in_seconds,
total_paper_number=total_paper_number,
downloader=downloader,
is_random_step=is_random_step,
proxy_ip_port=proxy_ip_port
)
if __name__ == '__main__':
year = 2025
total_paper_number = save_csv(year)
# total_paper_number = 134
download_from_csv(year, save_dir=fr'E:\RSS\RSS_{year}',
time_step_in_seconds=15,
total_paper_number=total_paper_number)
time.sleep(2)
pass
================================================
FILE: lib/IDM.py
================================================
import subprocess
import os
import time
import random
def download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True,
verbose=False):
"""
download file from given urls and save it to given path
:param urls: str, urls
:param save_path: str, full path
:param time_sleep_in_seconds: int, sleep seconds after call
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:param verbose: bool, whether to display time step information.
Default: False
:return: None
"""
idm_path = '"C:\Program Files (x86)\Internet Download Manager\IDMan.exe"' # should replace by the local IDM path
basic_command = [idm_path, '/d', 'xxxx', '/p', 'xxx', '/f', 'xxxx', '/n']
head, tail = os.path.split(save_path)
if '' != head:
os.makedirs(head, exist_ok=True)
basic_command[2] = urls
basic_command[4] = head
basic_command[6] = tail
p = subprocess.Popen(' '.join(basic_command))
# p.wait()
if is_random_step:
time_sleep_in_seconds = random.uniform(
0.5 * time_sleep_in_seconds,
1.5 * time_sleep_in_seconds,
)
if verbose:
print(f'\t random sleep {time_sleep_in_seconds: .2f} seconds')
time.sleep(time_sleep_in_seconds)
================================================
FILE: lib/__init__.py
================================================
================================================
FILE: lib/arxiv.py
================================================
"""
arxiv.py
20240218
"""
from bs4 import BeautifulSoup
from .my_request import urlopen_with_retry
def get_pdf_link_from_arxiv(abs_link, is_use_mirror=False):
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
mirror = 'cn.arxiv.org'
if is_use_mirror:
abs_link = abs_link.replace('arxiv.org', mirror)
abs_content = urlopen_with_retry(
url=abs_link, headers=headers, raise_error_if_failed=False)
if abs_content is None:
return None
abs_soup = BeautifulSoup(abs_content, 'html.parser')
pdf_link = 'http://arxiv.org' + abs_soup.find('div', {
'class': 'full-text'}).find('ul').find('a').get('href')
if pdf_link[-3:] != 'pdf':
pdf_link += '.pdf'
if is_use_mirror:
pdf_link = pdf_link.replace('arxiv.org', mirror)
return pdf_link
================================================
FILE: lib/csv_process.py
================================================
"""
csv_process.py
20210617
"""
import os
from tqdm import tqdm
from slugify import slugify
import csv
from lib.downloader import Downloader
def download_from_csv(
postfix, save_dir, csv_file_path, is_download_main_paper=True,
is_download_bib=True, is_download_supplement=True,
time_step_in_seconds=5, total_paper_number=None,
downloader='IDM', is_random_step=True, proxy_ip_port=None,
max_length_filename=128
):
"""
download paper, bibtex and supplement files and save them to
save_dir/main_paper and save_dir/supplement respectively
:param postfix: str, postfix that will be added at the end of papers' title
:param save_dir: str, paper and supplement material's save path
:param csv_file_path: str, the full path to csv file
:param is_download_main_paper: bool, True for downloading main paper
:param is_download_supplement: bool, True for downloading supplemental
material
:param time_step_in_seconds: int, the interval time between two downloading
request in seconds
:param total_paper_number: int, the total number of papers that is going to
download
:param downloader: str, the downloader to download, could be 'IDM' or None,
default to 'IDM'.
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
:param max_length_filename: int or None, max filen name length. All the
files whose name length is not less than this will be renamed
before saving, the others will stay unchanged. None means
no limitation. Default: 128.
:return: True
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
downloader = Downloader(
downloader=downloader, is_random_step=is_random_step,
proxy_ip_port=proxy_ip_port)
if not os.path.exists(csv_file_path):
raise ValueError(f'ERROR: file not found in {csv_file_path}!!!')
main_save_path = os.path.join(save_dir, 'main_paper')
if is_download_main_paper:
os.makedirs(main_save_path, exist_ok=True)
if is_download_supplement:
supplement_save_path = os.path.join(save_dir, 'supplement')
os.makedirs(supplement_save_path, exist_ok=True)
error_log = []
with open(csv_file_path, newline='') as csvfile:
myreader = csv.DictReader(csvfile, delimiter=',')
pbar = tqdm(myreader, total=total_paper_number)
i = 0
for this_paper in pbar:
is_download_bib &= ('bib' in this_paper)
is_grouped = ('group' in this_paper)
i += 1
# get title
if is_grouped:
group = slugify(this_paper['group'])
title = slugify(this_paper['title'])
title_main_pdf = short_name(
name=f'{title}_{postfix}.pdf',
max_length=max_length_filename
)
if total_paper_number is not None:
pbar.set_description(
f'Downloading {postfix} paper {i} /{total_paper_number}')
else:
pbar.set_description(f'Downloading {postfix} paper {i}')
this_paper_main_path = os.path.join(
main_save_path, title_main_pdf)
if is_grouped:
this_paper_main_path = os.path.join(
main_save_path, group, title_main_pdf)
if is_download_supplement:
this_paper_supp_title_no_ext = short_name(
name=f'{title}_{postfix}_supp.',
max_length=max_length_filename-3 # zip or pdf, so 3
)
this_paper_supp_path_no_ext = os.path.join(
supplement_save_path, this_paper_supp_title_no_ext)
if is_grouped:
this_paper_supp_path_no_ext = os.path.join(
supplement_save_path, group,
this_paper_supp_title_no_ext
)
if '' != this_paper['supplemental link'] and os.path.exists(
this_paper_main_path) and \
(os.path.exists(
this_paper_supp_path_no_ext + 'zip') or
os.path.exists(
this_paper_supp_path_no_ext + 'pdf')):
continue
elif '' == this_paper['supplemental link'] and \
os.path.exists(this_paper_main_path):
continue
elif os.path.exists(this_paper_main_path):
continue
if 'error' == this_paper['main link']:
error_log.append((title, 'no MAIN link'))
elif '' != this_paper['main link']:
if is_grouped:
if is_download_main_paper:
os.makedirs(os.path.join(main_save_path, group),
exist_ok=True)
if is_download_supplement:
os.makedirs(os.path.join(supplement_save_path, group),
exist_ok=True)
if is_download_main_paper:
try:
# download paper with IDM
if not os.path.exists(this_paper_main_path):
downloader.download(
urls=this_paper['main link'].replace(
' ', '%20'),
save_path=os.path.join(
os.getcwd(), this_paper_main_path),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append((title, this_paper['main link'],
'main paper download error', str(e)))
# download supp
if is_download_supplement:
# check whether the supp can be downloaded
if not (os.path.exists(
this_paper_supp_path_no_ext + 'zip') or
os.path.exists(
this_paper_supp_path_no_ext + 'pdf')):
if 'error' == this_paper['supplemental link']:
error_log.append((title, 'no SUPPLEMENTAL link'))
elif '' != this_paper['supplemental link']:
supp_type = \
this_paper['supplemental link'].split('.')[-1]
try:
downloader.download(
urls=this_paper['supplemental link'],
save_path=os.path.join(
os.getcwd(),
this_paper_supp_path_no_ext + supp_type),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append((title, this_paper[
'supplemental link'],
'supplement download error',
str(e)))
# download bibtex file
if is_download_bib:
bib_path = this_paper_main_path[:-3] + 'bib'
if not os.path.exists(bib_path):
if 'error' == this_paper['bib']:
error_log.append((title, 'no bibtex link'))
elif '' != this_paper['bib']:
try:
downloader.download(
urls=this_paper['bib'],
save_path=os.path.join(os.getcwd(),
bib_path),
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append((title, this_paper['bib'],
'bibtex download error',
str(e)))
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return True
def short_name(name, max_length, verbose=False):
"""
rename to shorter name
Args:
name (str): original name
max_length (int): max filen name length. All the
files whose name length is not less than this will be renamed
before saving, the others will stay unchanged. None means
no limitation.
verbose (bool): whether to print debug information. Default: False.
Returns:
new_name (str): short name.
"""
if len(name) < max_length:
new_name = name
else:
# rename
try:
[title, postfix] = name.split('_', 1) # only split to 2 parts
new_title = title[:max_length - len(postfix) - 2]
new_name = f'{new_title}_{postfix}'
if verbose:
print(f'\nrenaming {name} \n\t-> {new_name}')
except ValueError:
# ValueError: not enough values to unpack (expected 2, got 1)
if verbose:
print(f'\nWARNING!!!:\n\tunable to parse postfix from {name}')
print('\tSo, it will be just rename to short name')
ext = os.path.splitext(name)[1]
new_title = name[:max_length - len(ext) - 1]
new_name = f'{new_title}{ext}'
if verbose:
print(f'\nrenaming {name} \n\t-> {new_name}')
return new_name
================================================
FILE: lib/cvf.py
================================================
"""
cvf.py
20210617
"""
import urllib
from bs4 import BeautifulSoup
from tqdm import tqdm
from slugify import slugify
from .my_request import urlopen_with_retry
def get_paper_dict_list(url=None, content=None, group_name=None, timeout=10):
"""
parse papers' title, link, supp link from content, and save in a list contains dictionaries with key "title",
"main link", "supplemental link" and "group"(optional, if group_name is not None),
:param url: str or None, url
:param content: None of object return by urlopen
:param group_name: str or None, the group name of the papers in given content
:param timeout: int, the timeout value for open url, default to 10
:return: paper_dict_list, list of dictionaries, that contains the dictionaries of papers with key "title",
"main link", "supplemental link" and "group"(optional, if group_name is not None)
content, object return by urlopen
"""
if url is None and content is None:
raise ValueError('''one of "url" and "content" should be provide!!!''')
paper_dict_list = []
paper_dict = {'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''} if group_name is None else \
{'group': group_name, 'title': '', 'main link': '', 'supplemental link': '', 'arxiv': ''}
if content is None:
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(url=url, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
paper_list_bar = tqdm(soup.find('div', {'id': 'content'}).find_all(['dd', 'dt']))
paper_index = 0
for paper in paper_list_bar:
is_new_paper = False
# get title
try:
if 'dt' == paper.name and 'ptitle' == paper.get('class')[0]: # title:
title = slugify(paper.text.strip())
paper_dict['title'] = title
paper_index += 1
paper_list_bar.set_description_str(f'Collecting paper {paper_index}: {title}')
elif 'dd' == paper.name:
all_as = paper.find_all('a')
for a in all_as:
if 'pdf' == slugify(a.text.strip()):
main_link = urllib.parse.urljoin(url, a.get('href'))
paper_dict['main link'] = main_link
is_new_paper = True
elif 'supp' == slugify(a.text.strip()):
supp_link = urllib.parse.urljoin(url, a.get('href'))
paper_dict['supplemental link'] = supp_link
elif 'arxiv' == slugify(a.text.strip()):
arxiv = urllib.parse.urljoin(url, a.get('href'))
paper_dict['arxiv'] = arxiv
break
except Exception as e:
print(f'Warning: {str(e)}')
if is_new_paper:
paper_dict_list.append(paper_dict.copy())
paper_dict['title'] = ''
paper_dict['main link'] = ''
paper_dict['supplemental link'] = ''
paper_dict['arxiv'] = ''
return paper_dict_list, content
================================================
FILE: lib/downloader.py
================================================
"""
downloader.py
20210624
"""
import time
from lib import IDM
import requests
import os
import random
from tqdm import tqdm
from threading import Thread
from lib.proxy import get_proxy_4_requests
def _download(urls, save_path, time_sleep_in_seconds=5, is_random_step=True,
verbose=False, proxy_ip_port=None):
"""
download file from given urls and save it to given path
:param urls: str, urls
:param save_path: str, full path
:param time_sleep_in_seconds: int, sleep seconds after call
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:param verbose: bool, whether to display time step information.
Default: False
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
:return: None
"""
def __download(urls, save_path, proxy_ip_port):
head, tail = os.path.split(save_path)
# debug
# print(f'downloading {tail}')
proxies = get_proxy_4_requests(proxy_ip_port)
r = requests.get(urls, stream=True, proxies=proxies)
# file size in MB
length = round(int(r.headers['content-length']) / 1024**2, 2)
process_bar = tqdm(
colour='blue', total=length, unit='MB',desc=tail, initial=0)
if '' != head:
os.makedirs(head, exist_ok=True)
for part in r.iter_content(1024 ** 2):
process_bar.update(1)
with open(save_path, 'ab') as file:
file.write(part)
r.close()
# set daemon as False to continue downloading even if the main threading
# has been killed due to KeyboardInterrupt
t = Thread(
target=__download, args=(urls, save_path, proxy_ip_port), daemon=False)
t.start()
if is_random_step:
time_sleep_in_seconds = random.uniform(
0.5 * time_sleep_in_seconds,
1.5 * time_sleep_in_seconds,
)
if verbose:
print(f'\t random sleep {time_sleep_in_seconds: .2f} seconds')
time.sleep(time_sleep_in_seconds)
class Downloader(object):
def __init__(self, downloader=None, is_random_step=True,
proxy_ip_port=None):
"""
:param downloader: None or str, the downloader's name.
if downloader is None, 'request' will be used to
download files; if downloader is 'IDM', the
"Internet Downloader Manager" will be used to download
files; or a ValueError will be raised.
:param is_random_step: bool, whether random sample the time step between
two adjacent download requests. If True, the time step will be
sampled from Uniform(0.5t, 1.5t), where t is the given
time_step_in_seconds. Default: True.
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
(only useful for None|"request" downloader)
Default: None
"""
super(Downloader, self).__init__()
if downloader is not None and downloader.lower() not in ['idm']:
raise ValueError(
f'''ERROR: Unsupported downloader: {downloader}, '''
f'''we currently only support'''
f''' None (means python's requests) or "IDM" '''
)
self.downloader = downloader
self.is_random_step = is_random_step
self.proxy_ip_port = proxy_ip_port
def download(self, urls, save_path, time_sleep_in_seconds=5):
"""
download file from given urls and save it to given path
:param urls: str, urls
:param save_path: str, full path
:param time_sleep_in_seconds: int, sleep seconds after call
:return: None
"""
if self.downloader is None:
_download(
urls=urls,
save_path=save_path,
time_sleep_in_seconds=time_sleep_in_seconds,
is_random_step=self.is_random_step,
proxy_ip_port=self.proxy_ip_port
)
elif self.downloader.lower() == 'idm':
IDM.download(
urls=urls,
save_path=save_path,
time_sleep_in_seconds=time_sleep_in_seconds,
is_random_step=self.is_random_step
)
================================================
FILE: lib/my_request.py
================================================
"""
my_request.py
20240412
"""
import urllib
import random
from urllib.error import URLError, HTTPError
from lib.proxy import set_proxy_4_urllib_request
def urlopen_with_retry(url, headers=dict(), retry_time=3, time_out=20,
raise_error_if_failed=True, proxy_ip_port=None):
"""
load content from url with given headers. Retry if error occurs.
Args:
url (str): url.
headers (dict): request headers. Default: {}.
retry_time (int): max retry time. Default: 3.
time_out (int): time out in seconds. Default: 10.
raise_error_if_failed (bool): whether to raise error if failed.
Default: True.
proxy_ip_port(str|None): proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
Default: None
Returns:
content(str|None): url content. None will be returned if failed.
"""
set_proxy_4_urllib_request(proxy_ip_port)
req = urllib.request.Request(url=url, headers=headers)
for r in range(retry_time):
try:
content = urllib.request.urlopen(req, timeout=time_out).read()
return content
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
s = random.randint(3, 7)
print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}'
f'-th retrying...')
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
s = random.randint(3, 7)
print(f'random sleeping {s} seconds and doing {r + 1}/{retry_time}'
f'-th retrying...')
if raise_error_if_failed:
raise ValueError(f'Failed to open {url} after trying {retry_time} '
f'times!')
else:
return None
================================================
FILE: lib/openreview.py
================================================
"""
openreview.py
20230104
"""
import time
from tqdm import tqdm
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
import os
# https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
from slugify import slugify
from lib.downloader import Downloader
from lib.proxy import get_proxy
import urllib
from lib.arxiv import get_pdf_link_from_arxiv
def get_driver(proxy_ip_port=None):
# driver = webdriver.Chrome(driver_path)
capabilities = webdriver.DesiredCapabilities.CHROME
if proxy_ip_port is not None:
proxy = get_proxy(proxy_ip_port)
proxy.add_to_capabilities(capabilities)
# https://stackoverflow.com/a/78797164
chrome_install = ChromeDriverManager().install()
folder = os.path.dirname(chrome_install)
chromedriver_path = os.path.join(folder, "chromedriver.exe")
driver = webdriver.Chrome(
service=Service(executable_path=chromedriver_path),
desired_capabilities=capabilities)
return driver
def __download_papers_given_divs(driver, divs, save_dir, paper_postfix,
time_step_in_seconds=10, downloader='IDM',
proxy_ip_port=None):
error_log = []
downloader = Downloader(downloader=downloader, proxy_ip_port=proxy_ip_port)
# scroll to top of page
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.HOME)
time.sleep(0.3)
# titles = [d.text for d in divs]
titles = []
for d in divs:
for i in range(3): # temp workaround
try:
titles.append(d.text)
break
except Exception as e:
if i == 2:
print(f'\tget Exception: {str(e.msg)}')
time.sleep(0.3)
valid_divs = []
for i, t in enumerate(titles):
if len(t):
valid_divs.append(divs[i])
num_papers = len(valid_divs)
print('found number of papers:', num_papers)
name = None
for index, paper in enumerate(valid_divs):
is_get_paper = False
try:
a_hrefs = paper.find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
if a_hrefs[1].get_attribute('class') == 'pdf-link':
# has pdf button
link = a_hrefs[1].get_attribute('href')
link = urllib.parse.urljoin('https://openreview.net', link)
else:
# raise ValueError('pdf link not found!')
print('\tWarning: pdf link not found, skip this download...')
if name is not None:
error_log.append((name, str(index)))
else:
error_log.append((str(index), str(index)))
continue
# TODO: find pdf link in paper abstract page
if name == '':
continue
is_get_paper = True
except Exception as e:
print(f'\tget Exception: {str(e.msg)}')
print('\tskip this download...')
if name is not None:
error_log.append((name, str(index)))
else:
error_log.append((str(index), str(index)))
if not is_get_paper:
continue
# name = slugify(paper.find_element_by_class_name('note_content_title').text)
# link = paper.find_element_by_class_name('note_content_pdf').get_attribute('href')
pdf_name = name + '_' + paper_postfix + '.pdf'
if not os.path.exists(os.path.join(save_dir, pdf_name)):
print('Downloading paper {}/{}: {}'.format(index + 1, num_papers,
name))
# get pdf link of arxiv if the original link is on arxiv.org
if "arxiv.org/abs" in link:
link = get_pdf_link_from_arxiv(abs_link=link)
# try 1 times
success_flag = False
for d_iter in range(1):
try:
downloader.download(
urls=link,
save_path=os.path.join(save_dir, pdf_name),
time_sleep_in_seconds=time_step_in_seconds
)
success_flag = True
break
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
if not success_flag:
error_log.append((name, link))
return error_log, num_papers
def __get_into_pages_given_number(driver, page_number, pages, wait_fn,
condition=None):
wait_fn(driver, condition)
for page in pages:
if page.text.isnumeric() and int(page.text) == page_number:
page_link = page.find_element(By.TAG_NAME, "a")
page_link.click()
wait_fn(driver, condition)
return page
return None
def download_nips_papers_given_url(
save_dir, year, base_url, conference='NIPS', start_page=1,
time_step_in_seconds=10, download_groups='all', downloader='IDM',
proxy_ip_port=None):
"""
download NeurIPS papers from the given web url.
:param save_dir: str, paper save path
:type save_dir: str
:param year: int, iclr year, current only support year >= 2018
:type year: int
:param base_url: str, paper website url
:type base_url: str
:param conference: str, conference name, such as NIPS.
:param start_page: int, the initial downloading webpage number, only the pages whose number is
equal to or greater than this number will be processed.
:param time_step_in_seconds: int, the interval time between two downlaod request in seconds
:param groups: group name, such as 'oral', 'spotlight', 'poster'.
Default: 'all'.
:type download_groups: str | list[str]
:param downloader: str, the downloader to download, could be 'IDM' or None,
default to 'IDM'
:param proxy_ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
(only useful for None|"request" downloader and webdriver)
Default: None
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
if year < 2023:
sub_xpath = '''id="accepted-papers"'''
else:
sub_xpath = '''class="submissions-list"'''
def mywait(driver, condition=None):
# wait for the select element to become visible
# print('Starting web driver wait...')
# ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
# wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)
wait = WebDriverWait(driver, 20)
# print('Starting web driver wait... finished')
# res = wait.until(EC.presence_of_element_located((By.ID, "notes")))
# print("Successful load the website!->", res)
# res = wait.until(
# EC.presence_of_element_located((By.CLASS_NAME, "note")))
res = wait.until(
EC.presence_of_element_located((By.ID, "notes")))
# print("Successful load the website notes!->", res)
res = wait.until(EC.presence_of_element_located(
(By.XPATH, f'''//*[@{sub_xpath}]/nav''')))
# print("Successful load the website pagination!->", res)
time.sleep(2) # seconds, workaround for bugs
def find_divs_of_papers():
if year < 2023:
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
else:
# divs = driver.find_element(By.ID, group_id). \
# find_elements(By.XPATH, '//*[@class="note undefined"]')
divs = driver.find_element(By.ID, group_id).find_elements(
By.XPATH,
'//*[contains(@class, "note") and contains(@class, "undefined")]'
)
return divs
paper_postfix = f'{conference}_{year}'
error_log = []
driver = get_driver(proxy_ip_port=proxy_ip_port)
driver.get(base_url)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
mywait(driver)
# pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li')
# download grouped papers, such as "Accepted Papaers" for year before 2023
# "Accept (oral)", "Accept (spotlight)", "Accept (poster)" for year 2023
groups = driver.find_elements(
By.XPATH, f'//*[@id="notes"]/div/div[1]/ul/li')
accept_groups = []
for g in groups:
if 'accept' in g.text.lower():
# whether download this group
is_download_group = True
if not 'all' == download_groups:
is_download_group = False
for dg in download_groups:
if dg.lower() in g.text.lower():
is_download_group = True
break
if is_download_group:
accept_groups.append(g)
group_name = None
group_save_dir = save_dir
for ag in accept_groups:
group_name = slugify(ag.text)
group_save_dir = os.path.join(save_dir, group_name)
print(f'Downloading {group_name}...')
os.makedirs(group_save_dir, exist_ok=True)
number_paper_group = 0
accept_group_link = ag.find_element(By.TAG_NAME, "a")
# group_id = accept_group_link.get_attribute('aria-controls')
group_id = accept_group_link.get_attribute('href').split('#')[-1]
# scroll to top of page, if not at top, the click action not work
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.HOME)
time.sleep(0.2)
accept_group_link.click()
mywait(driver)
pages = driver.find_elements(
By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')
page_str_list = get_pages_str(pages)
# print(f'Current page navigation bar:\n{page_str_list}')
current_page = 1
ind_page = 2 # 0 << ; 1 <
# << | < | 1, 2, 3, ... | > | >>
total_pages_number = get_max_page_number(page_str_list)
last_total_pages = total_pages_number
# get into start pages
while current_page < start_page:
if total_pages_number < start_page: # flip pages until seeing the start page
current_page = total_pages_number
__get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
print(f'getting into web page {current_page}...')
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, '//*[@id="accepted-papers"]/ul/li/h4/a')))
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, '''//*[@id="accepted-papers"]/nav''')))
mywait(driver)
# print("Successful load the website pagination!->", res)
# pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li')
pages = pages = driver.find_elements(
By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')
page_str_list = get_pages_str(pages)
total_pages_number = get_max_page_number(page_str_list)
# # print(f'Current page navigation bar:\n{page_str_list}')
if total_pages_number == last_total_pages: # total page remain unchanged after reload
print(f'reached last({total_pages_number}-th) webpage')
# when get the last page, but the page number is till less than start page, so
# the start page doesn't exist. PRINT ERROR and return
print(f'ERROR: THE {start_page}-th webpage not found!')
return
else:
current_page = start_page
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
while current_page <= total_pages_number:
if page is None:
break
print(f'downloading papers in page: {current_page}')
mywait(driver)
# divs = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/ul/li')
# divs = driver.find_elements(By.XPATH, '//*[@id="accepted-papers"]/ul/li')
divs = find_divs_of_papers()
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r + 1) < repeat_times:
print(f'\terror occurre: {str(e)}')
print(f'\tsleep {(r + 1) * 5} seconds...')
time.sleep((r + 1) * 5)
print(f'{r + 1}-th reloading page')
divs = find_divs_of_papers()
else:
print('\tskip this page.')
if not is_find_paper:
continue
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=group_save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
number_paper_group += this_number_paper
# get into next page
current_page += 1
# pages = driver.find_elements_by_xpath('//*[@id="accepted-papers"]/nav/ul/li')
pages = driver.find_elements(
By.XPATH, f'//*[@{sub_xpath}]/nav[1]/ul/li')
page_str_list = get_pages_str(pages)
total_pages_number = get_max_page_number(page_str_list)
# print(f'Current page navigation bar:\n{page_str_list}')
# if we do not reread the pages, all the pages will be not available with an exception:
# selenium.common.exceptions.StaleElementReferenceException:
# Message: stale element reference: element is not attached to the page document
page = __get_into_pages_given_number(driver=driver,
page_number=current_page,
pages=pages,
wait_fn=mywait)
# display total number of papers
print(f'number of papers in {group_name}: {number_paper_group}')
driver.quit()
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
f.write(e)
f.write('\n')
f.write('\n')
def download_iclr_papers_given_url_and_group_id(
save_dir, year, base_url, group_id, conference='ICLR', start_page=1,
time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None,
is_have_pages=True, is_need_click_group_button=False):
"""
downlaod ICLR papers for the given web url and the paper group id
:param save_dir: str, paper save path
:type save_dir: str
:param year: int, iclr year, current only support year >= 2018
:type year: int
:param base_url: str, paper website url
:type base_url: str
:param group_id: str, paper group id, such as "notable-top-5-",
"notable-top-25-", "poster", "oral-submissions",
"spotlight-submissions", "poster-submissions", etc.
:type group_id: str
:param conference: str, conference name, such as ICLR. Default: ICLR
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two download
request in seconds. Default: 10
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder'. Default: 'IDM'
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Only useful for webdriver and request
downloader (downloader=None). Default: None.
:type proxy_ip_port: str | None
:param is_have_pages: bool, is there pages in webpage. Default:
True.
:type is_have_pages: bool
:param is_need_click_group_button: bool, is there need to click the
group button in webpage. For some years, for example 2018, the
navigation part "#xxxxx" in base url will not work. And it should
be clicked before reading content from webpage. Default: False.
:type is_need_click_group_button: bool
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def _get_pages_xpath(year):
if year <= 2023:
xpath = f'''//*[@id="{group_id}"]/nav/ul/li'''
else:
xpath = f'''//*[@id="{group_id}"]/div/div/nav/ul/li'''
return xpath
def mywait(driver, condition=None):
# wait for the select element to become visible
# print('Starting web driver wait...')
# ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
# wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)
wait = WebDriverWait(driver, 20)
# print('Starting web driver wait... finished')
# res = wait.until(EC.presence_of_element_located((By.ID, "notes")))
# print("Successful load the website!->", res)
if year <= 2023:
res = wait.until(
EC.presence_of_element_located((By.CLASS_NAME, "note")))
# print("Successful load the website notes!->", res)
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'''//*[@id="{group_id}"]/nav''')))
if is_have_pages:
# scroll to bottom of page
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.END)
if year <= 2023:
wait.until(EC.element_to_be_clickable(
(By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))
else:
wait.until(EC.element_to_be_clickable(
(By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))
# print("Successful load the website pagination!->", res)
time.sleep(2) # seconds, workaround for bugs
paper_postfix = f'{conference}_{year}'
error_log = []
driver = get_driver(proxy_ip_port=proxy_ip_port)
driver.get(base_url)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
if is_need_click_group_button:
archive_is_have_pages = is_have_pages
is_have_pages = False
mywait(driver)
aria_controls = base_url.split('#')[-1]
# scroll to home of page
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.HOME)
group_button = driver.find_element(
By.XPATH, f"""//a[@aria-controls="{aria_controls}"]"""
)
group_button.click()
is_have_pages = archive_is_have_pages
mywait(driver)
if is_have_pages:
pages = driver.find_elements(By.XPATH, _get_pages_xpath(year))
current_page = 1
ind_page = 2 # 0 << ; 1 <
total_pages_number = int(pages[-3].text)
# << | < | 1, 2, 3, ... | > | >>
last_total_pages = total_pages_number
# get into start pages
while current_page < start_page:
# flip pages until seeing the start page
if total_pages_number < start_page:
current_page = total_pages_number
__get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
print(f'getting into web page {current_page}...')
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'//*[@id="{group_id}"]/ul/li/h4/a')))
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'''//*[@id="{group_id}"]/nav''')))
mywait(driver)
# print("Successful load the website pagination!->", res)
pages = driver.find_elements(
By.XPATH, _get_pages_xpath(year))
total_pages_number = int(pages[-3].text)
# total page remain unchanged after reload
if total_pages_number == last_total_pages:
print(f'reached last({total_pages_number}-th) webpage')
# when get the last page, but the page number is till
# less than start page, so the start page doesn't exist.
# PRINT ERROR and return
print(f'ERROR: THE {start_page}-th webpage not found!')
return
else:
current_page = start_page
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages, wait_fn=mywait)
while current_page <= total_pages_number:
if page is None:
break
print(f'downloading {group_id} papers in page: {current_page}')
mywait(driver)
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r + 1) < repeat_times:
print(f'\terror occurre: {str(e.msg)}')
print(f'\tsleep {(r + 1) * 5} seconds...')
time.sleep((r + 1) * 5)
print(f'{r + 1}-th reloading page')
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
else:
print('\tskip this page.')
if not is_find_paper:
continue
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
# get into next page
current_page += 1
pages = driver.find_elements(
By.XPATH, _get_pages_xpath(year))
total_pages_number = int(pages[-3].text)
# if we do not reread the pages, all the pages will be not available
# with an exception:
# selenium.common.exceptions.StaleElementReferenceException:
# Message: stale element reference: element is not attached to the
# page document
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
else: # no pages
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r + 1) < repeat_times:
print(f'\terror occurre: {str(e.msg)}')
print(f'\tsleep {(r + 1) * 5} seconds...')
time.sleep((r + 1) * 5)
print(f'{r + 1}-th reloading page')
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
else:
print('\tskipped!!!')
if is_find_paper:
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
driver.quit()
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
f.write(e)
f.write('\n')
f.write('\n')
def download_icml_papers_given_url_and_group_id(
save_dir, year, base_url, group_id, conference='ICML', start_page=1,
time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None):
"""
downlaod ICLR papers for the given web url and the paper group id
:param save_dir: str, paper save path
:type save_dir: str
:param year: int, iclr year, current only support year >= 2018
:type year: int
:param base_url: str, paper website url
:type base_url: str
:param group_id: str, paper group id, such as "poster" and "oral".
:type group_id: str
:param conference: str, conference name, such as ICLR. Default: ICLR
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two download
request in seconds. Default: 10
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder'. Default: 'IDM'
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Only useful for webdriver and request
downloader (downloader=None). Default: None.
:type proxy_ip_port: str | None
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def mywait(driver, aria_controls=None):
# wait for the select element to become visible
# print('Starting web driver wait...')
wait = WebDriverWait(driver, 20)
# ignored_exceptions = (NoSuchElementException, StaleElementReferenceException,)
# wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)
# print('Starting web driver wait... finished')
# res = wait.until(EC.presence_of_element_located((By.ID, "notes")))
# print("Successful load the website!->", res)
res = wait.until(EC.presence_of_element_located((By.ID, "notes")))
res = wait.until(EC.presence_of_element_located((By.CLASS_NAME, "submissions-list")))
# print("Successful load the website notes!->", res)
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'''//*[@id="{group_id}"]/nav''')))
# scroll to bottom of page
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.END)
time.sleep(0.3)
if aria_controls is None:
wait.until(EC.element_to_be_clickable(
(By.XPATH, f'//*[@class="submissions-list"]/nav/ul/li[3]/a''')))
else:
wait.until(EC.element_to_be_clickable(
(By.XPATH,
f'''//*[@id='{aria_controls}']/div/div/nav/ul/li[3]/a''')))
wait.until(EC.presence_of_element_located(
(By.XPATH,
f'''//*[@id='{aria_controls}']/div/div/ul/li[1]/div/h4/a[1]''')))
# print("Successful load the website pagination!->", res)
time.sleep(2) # seconds, workaround for bugs
paper_postfix = f'{conference}_{year}'
error_log = []
driver = get_driver(proxy_ip_port=proxy_ip_port)
driver.get(base_url)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# wait = WebDriverWait(driver, 20)
mywait(driver)
# get into poster or oral page
nav_tap = driver.find_elements(
By.XPATH, f'//ul[@class="nav nav-tabs"]/li')
is_found_group = False
for li in nav_tap:
if group_id in li.text.lower():
if 'poster' in group_id and 'spotlight' in li.text.lower():
# spotlight-poster should be recognized as spotlight rather
# than poster
continue
page_link = li.find_element(By.TAG_NAME, "a")
# scroll to top of page, if not at top, the click action not work
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.HOME)
aria_controls = page_link.get_attribute('aria-controls')
page_link.click()
mywait(driver, aria_controls) # there is no request in here
is_found_group = True
break
if not is_found_group:
raise ValueError(f'not found {group_id} papers at {base_url}!!!')
# pages = driver.find_elements(
# By.XPATH, f'//nav[@aria-label="page navigation"]/ul/li')
pages = driver.find_elements(
By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')
current_page = 1
# ind_page = 2 # 0 << ; 1 <
total_pages_number = int(pages[-3].text) # << | < | 1, 2, 3, ... | > | >>
last_total_pages = total_pages_number
# get into start pages
while current_page < start_page:
# flip pages until seeing the start page
if total_pages_number < start_page:
current_page = total_pages_number
__get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait, condition=aria_controls)
print(f'getting into web page {current_page}...')
# print("Successful load the website pagination!->", res)
pages = driver.find_elements(
By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')
total_pages_number = int(pages[-3].text)
# total page remain unchanged after reload
if total_pages_number == last_total_pages:
print(f'reached last({total_pages_number}-th) webpage')
# when get the last page, but the page number is till less than
# start page, so the start page doesn't exist. PRINT ERROR and
# return
print(f'ERROR: THE {start_page}-th webpage not found!')
return
else:
current_page = start_page
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages, wait_fn=mywait,
condition=aria_controls)
while current_page <= total_pages_number:
if page is None:
break
print(f'downloading {group_id} papers in page: {current_page}')
divs = driver.find_elements(
By.XPATH, f'''//*[@id='{aria_controls}']/div/div/ul/li''')
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r+1) < repeat_times:
print(f'\terror occurre: {str(e.msg)}')
print(f'\tsleep {(r+1)*5} seconds...')
time.sleep((r+1)*5)
print(f'{r+1}-th reloading page')
divs = driver.find_elements(
By.XPATH,
f'''//*[@id='{aria_controls}']/div/div/ul/li''')
else:
print('\tskip this page.')
if not is_find_paper:
continue
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
# get into next page
current_page += 1
pages = driver.find_elements(
By.XPATH, f'''//*[@id='{aria_controls}']/div/div/nav/ul/li''')
total_pages_number = int(pages[-3].text)
# if we do not reread the pages, all the pages will be not available
# with an exception:
# selenium.common.exceptions.StaleElementReferenceException:
# Message: stale element reference: element is not attached to the
# page document
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait, condition=aria_controls)
driver.quit()
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
f.write(e)
f.write('\n')
f.write('\n')
def get_pages_str(pages):
page_str_list = [p.text for p in pages]
# print(f'Current page navigation bar:\n{page_str_list}')
return page_str_list
def get_max_page_number(page_str_list):
is_find_number = False
for i, page_str in enumerate(page_str_list):
if not page_str.isnumeric() and is_find_number:
return int(page_str_list[i-1])
if page_str.isnumeric():
is_find_number = True
return int(page_str_list[-1])
def download_papers_given_url_and_group_id(
save_dir, year, base_url, group_id, conference, start_page=1,
time_step_in_seconds=10, downloader='IDM', proxy_ip_port=None,
is_have_pages=True, is_need_click_group_button=False):
"""
downlaod papers for the given web url and the paper group id
:param save_dir: str, paper save path
:type save_dir: str
:param year: int, iclr year, current only support year >= 2018
:type year: int
:param base_url: str, paper website url
:type base_url: str
:param group_id: str, paper group id, such as "notable-top-5-",
"notable-top-25-", "poster", "oral-submissions",
"spotlight-submissions", "poster-submissions", etc.
:type group_id: str
:param conference: str, conference name, such as CORL.
:param start_page: int, the initial downloading webpage number, only the
pages whose number is equal to or greater than this number will be
processed. Default: 1
:param time_step_in_seconds: int, the interval time between two download
request in seconds. Default: 10
:param downloader: str, the downloader to download, could be 'IDM' or
'Thunder'. Default: 'IDM'
:param proxy_ip_port: str or None, proxy ip address and port, eg.
eg: "127.0.0.1:7890". Only useful for webdriver and request
downloader (downloader=None). Default: None.
:type proxy_ip_port: str | None
:param is_have_pages: bool, is there pages in webpage. Default:
True.
:type is_have_pages: bool
:param is_need_click_group_button: bool, is there need to click the
group button in webpage. For some years, for example 2018, the
navigation part "#xxxxx" in base url will not work. And it should
be clicked before reading content from webpage. Default: False.
:type is_need_click_group_button: bool
:return:
"""
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
def _get_pages_xpath(year):
if year <= 2023:
xpath = f'''//*[@id="{group_id}"]/nav/ul/li'''
else:
xpath = f'''//*[@id="{group_id}"]/div/div/nav/ul/li'''
return xpath
def mywait(driver, condition=None):
# wait for the select element to become visible
# print('Starting web driver wait...')
# ignored_exceptions = (NoSuchElementException,
# StaleElementReferenceException,)
# wait = WebDriverWait(driver, 20, ignored_exceptions=ignored_exceptions)
wait = WebDriverWait(driver, 20)
# print('Starting web driver wait... finished')
# res = wait.until(EC.presence_of_element_located((By.ID, "notes")))
# print("Successful load the website!->", res)
# if year <= 2023:
# res = wait.until(
# EC.presence_of_element_located((By.CLASS_NAME, "note")))
# print("Successful load the website notes!->", res)
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'''//*[@id="{group_id}"]/nav''')))
if is_have_pages:
# scroll to bottom of page
# https://stackoverflow.com/questions/45576958/scrolling-to-top-of-the-page-in-python-using-selenium
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.END)
if year <= 2023:
wait.until(EC.element_to_be_clickable(
(By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))
else:
wait.until(EC.element_to_be_clickable(
(By.XPATH, f'{_get_pages_xpath(year)}[3]/a')))
# print("Successful load the website pagination!->", res)
time.sleep(2) # seconds, workaround for bugs
paper_postfix = f'{conference}_{year}'
error_log = []
driver = get_driver(proxy_ip_port=proxy_ip_port)
driver.get(base_url)
if not os.path.exists(save_dir):
os.makedirs(save_dir)
if is_need_click_group_button:
archive_is_have_pages = is_have_pages
is_have_pages = False
mywait(driver)
aria_controls = base_url.split('#')[-1]
# scroll to home of page
driver.find_element(By.TAG_NAME, 'body').send_keys(
Keys.CONTROL + Keys.HOME)
group_button = driver.find_element(
By.XPATH, f"""//a[@aria-controls="{aria_controls}"]"""
)
group_button.click()
is_have_pages = archive_is_have_pages
mywait(driver)
if is_have_pages:
pages = driver.find_elements(By.XPATH, _get_pages_xpath(year))
current_page = 1
ind_page = 2 # 0 << ; 1 <
total_pages_number = int(pages[-3].text)
# << | < | 1, 2, 3, ... | > | >>
last_total_pages = total_pages_number
# get into start pages
while current_page < start_page:
# flip pages until seeing the start page
if total_pages_number < start_page:
current_page = total_pages_number
__get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
print(f'getting into web page {current_page}...')
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'//*[@id="{group_id}"]/ul/li/h4/a')))
# res = wait.until(EC.presence_of_element_located(
# (By.XPATH, f'''//*[@id="{group_id}"]/nav''')))
mywait(driver)
# print("Successful load the website pagination!->", res)
pages = driver.find_elements(
By.XPATH, _get_pages_xpath(year))
total_pages_number = int(pages[-3].text)
# total page remain unchanged after reload
if total_pages_number == last_total_pages:
print(f'reached last({total_pages_number}-th) webpage')
# when get the last page, but the page number is till
# less than start page, so the start page doesn't exist.
# PRINT ERROR and return
print(f'ERROR: THE {start_page}-th webpage not found!')
return
else:
current_page = start_page
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages, wait_fn=mywait)
while current_page <= total_pages_number:
if page is None:
break
print(f'downloading {group_id} papers in page: {current_page}')
mywait(driver)
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r + 1) < repeat_times:
print(f'\terror occurre: {str(e.msg)}')
print(f'\tsleep {(r + 1) * 5} seconds...')
time.sleep((r + 1) * 5)
print(f'{r + 1}-th reloading page')
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
else:
print('\tskip this page.')
if not is_find_paper:
continue
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
# get into next page
current_page += 1
pages = driver.find_elements(
By.XPATH, _get_pages_xpath(year))
total_pages_number = int(pages[-3].text)
# if we do not reread the pages, all the pages will be not available
# with an exception:
# selenium.common.exceptions.StaleElementReferenceException:
# Message: stale element reference: element is not attached to the
# page document
page = __get_into_pages_given_number(
driver=driver, page_number=current_page, pages=pages,
wait_fn=mywait)
else: # no pages
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
# temp workaround
repeat_times = 3
is_find_paper = False
for r in range(repeat_times):
try:
a_hrefs = divs[0].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
a_hrefs = divs[-1].find_elements(By.TAG_NAME, "a")
name = slugify(a_hrefs[0].text.strip())
link = a_hrefs[1].get_attribute('href')
is_find_paper = True
break
except Exception as e:
if (r + 1) < repeat_times:
print(f'\terror occurre: {str(e.msg)}')
print(f'\tsleep {(r + 1) * 5} seconds...')
time.sleep((r + 1) * 5)
print(f'{r + 1}-th reloading page')
divs = driver.find_element(By.ID, group_id). \
find_elements(By.CLASS_NAME, 'note ')
else:
print('\tskipped!!!')
if is_find_paper:
# time.sleep(time_step_in_seconds)
this_error_log, this_number_paper = __download_papers_given_divs(
driver=driver,
divs=divs,
save_dir=save_dir,
paper_postfix=paper_postfix,
time_step_in_seconds=time_step_in_seconds,
downloader=downloader,
proxy_ip_port=proxy_ip_port
)
for e in this_error_log:
error_log.append(e)
driver.quit()
# 2. write error log
print('write error log')
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
f.write(e)
f.write('\n')
f.write('\n')
if __name__ == "__main__":
year = 2023
save_dir = rf'E:\ICML_{year}'
base_url = 'https://openreview.net/group?id=ICML.cc/2023/Conference'
# download_nips_papers_given_url(
# save_dir, year, base_url,
# start_page=1,
# time_step_in_seconds=10,
# downloader='IDM')
# download_icml_papers_given_url_and_group_id(
# save_dir, year, base_url, group_id='oral', start_page=1,
# time_step_in_seconds=10, )
================================================
FILE: lib/pmlr.py
================================================
"""
pmlr.py
20210618
"""
from bs4 import BeautifulSoup
import os
from tqdm import tqdm
from slugify import slugify
from lib.downloader import Downloader
from .my_request import urlopen_with_retry
def download_paper_given_volume(
volume, save_dir, postfix, is_download_supplement=True,
time_step_in_seconds=5, downloader='IDM', is_random_step=True):
"""
download main and supplement papers from PMLR.
:param volume: str, such as 'v1', 'r1'
:param save_dir: str, paper and supplement material's save path
:param postfix: str, the postfix will be appended to the end of papers' titles
:param is_download_supplement: bool, True for downloading supplemental material
:param time_step_in_seconds: int, the interval time between two downloading
requests in seconds
:param downloader: str, the downloader to download, could be 'IDM' or None,
Default: 'IDM'
:param is_random_step: bool, whether random sample the time step between two
adjacent download requests. If True, the time step will be sampled
from Uniform(0.5t, 1.5t), where t is the given time_step_in_seconds.
Default: True.
:return: True
"""
downloader = Downloader(
downloader=downloader, is_random_step=is_random_step)
init_url = f'http://proceedings.mlr.press/{volume}/'
if is_download_supplement:
main_save_path = os.path.join(save_dir, 'main_paper')
supplement_save_path = os.path.join(save_dir, 'supplement')
os.makedirs(main_save_path, exist_ok=True)
os.makedirs(supplement_save_path, exist_ok=True)
else:
main_save_path = save_dir
os.makedirs(main_save_path, exist_ok=True)
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) '
'Gecko/20100101 Firefox/23.0'}
content = urlopen_with_retry(url=init_url, headers=headers)
soup = BeautifulSoup(content, 'html.parser')
paper_list = soup.find_all('div', {'class': 'paper'})
error_log = []
title_list = []
num_download = len(paper_list)
pbar = tqdm(zip(paper_list, range(num_download)), total=num_download)
for paper in pbar:
# get title
this_paper = paper[0]
title = slugify(this_paper.find_all('p', {'class': 'title'})[0].text)
try:
pbar.set_description(
f'Downloading {postfix} paper {paper[1] + 1}/{num_download}:'
f' {title}')
except:
pbar.set_description(
f'''Downloading {postfix} paper {paper[1] + 1}/{num_download}: '''
f'''{title.encode('utf8')}''')
title_list.append(title)
this_paper_main_path = os.path.join(main_save_path,
f'{title}_{postfix}.pdf')
if is_download_supplement:
this_paper_supp_path = os.path.join(
supplement_save_path, f'{title}_{postfix}_supp.pdf')
this_paper_supp_path_no_ext = os.path.join(
supplement_save_path, f'{title}_{postfix}_supp.')
if os.path.exists(this_paper_main_path) and os.path.exists(
this_paper_supp_path):
continue
else:
if os.path.exists(this_paper_main_path):
continue
# get abstract page url
links = this_paper.find_all('p', {'class': 'links'})[0].find_all('a')
supp_link = None
main_link = None
for link in links:
if 'Download PDF' == link.text or 'pdf' == link.text:
main_link = link.get('href')
elif is_download_supplement and \
('Supplementary PDF' == link.text or
'Supplementary Material' == link.text or
'supplementary' == link.text or
'Supplementary ZIP' == link.text or
'Other Files' == link.text):
supp_link = link.get('href')
if supp_link[-3:] != 'pdf':
this_paper_supp_path = this_paper_supp_path_no_ext + \
supp_link[-3:]
# try 1 time
# error_flag = False
for d_iter in range(1):
try:
# download paper with IDM
if not os.path.exists(
this_paper_main_path) and main_link is not None:
downloader.download(
urls=main_link,
save_path=this_paper_main_path,
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append(
(title, main_link, 'main paper download error', str(e)))
# download supp
if is_download_supplement:
# check whether the supp can be downloaded
if not os.path.exists(
this_paper_supp_path) and supp_link is not None:
try:
downloader.download(
urls=supp_link,
save_path=this_paper_supp_path,
time_sleep_in_seconds=time_step_in_seconds
)
except Exception as e:
# error_flag = True
print('Error: ' + title + ' - ' + str(e))
error_log.append((title, supp_link,
'supplement download error', str(e)))
# write error log
print('writing error log...')
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
log_file_pathname = os.path.join(
project_root_folder, 'log', 'download_err_log.txt')
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is not None:
f.write(e)
else:
f.write('None')
f.write('\n')
f.write('\n')
return True
if __name__ == '__main__':
download_paper_given_volume(
volume=150,
save_dir=r'D:\The_KDD21_Workshop_on_Causal_Discovery',
postfix=f'',
is_download_supplement=False,
time_step_in_seconds=5,
downloader='IDM'
)
================================================
FILE: lib/proxy.py
================================================
"""
proxy.py
20230228
"""
from selenium.webdriver.common.proxy import Proxy, ProxyType
import urllib
def get_proxy(ip_port: str):
"""
setup proxy
:param ip_port: str, proxy server ip address without protocol prefix,
eg: "127.0.0.1:7890"
:return: proxy (instance of selenium.webdriver.common.proxy.Proxy)
Then the proxy could be to webdriver.Chrome:
capabilities = webdriver.DesiredCapabilities.CHROME
proxy.add_to_capabilities(capabilities)
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()),
desired_capabilities=capabilities)
"""
proxy = Proxy()
proxy.proxy_type = ProxyType.MANUAL
proxy.http_proxy = ip_port
proxy.ssl_proxy = ip_port
return proxy
def set_proxy_4_urllib_request(ip_port: str):
"""
setup proxy
:param ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
:return: proxies, dict with keys "http" and "https" or None.
"""
if ip_port is None:
proxies = None
else:
if not ip_port.startswith('http'):
ip_port = 'http://' + ip_port
proxies = {
'http': ip_port,
'https': ip_port
}
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
return proxies
def get_proxy_4_requests(ip_port: str):
"""
setup proxy
:param ip_port: str or None, proxy server ip address with or without
protocol prefix, eg: "127.0.0.1:7890", "http://127.0.0.1:7890".
:return: proxies, dict with keys "http" and "https" or None.
"""
if ip_port is None:
proxies = None
else:
if not ip_port.startswith('http'):
ip_port = 'http://' + ip_port
proxies = {
'http': ip_port,
'https': ip_port
}
return proxies
if __name__ == "__main__":
# get my ip
import json
set_proxy_4_urllib_request('127.0.0.1:7897')
url = "http://ip-api.com/json" # ipv4
response = urllib.request.urlopen(url)
data = json.load(response)
if data['status'] == 'success':
ip = data['query']
print(f'ip: {ip}')
print(f'details: {data}')
else:
print(f'failed, try agin: {data}')
================================================
FILE: lib/springer.py
================================================
"""
springer.py
some function for springer
20201106
"""
import urllib
from bs4 import BeautifulSoup
from tqdm import tqdm
from slugify import slugify
from .my_request import urlopen_with_retry
import re
def get_paper_name_link_from_url(url):
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
paper_dict = dict()
content = urlopen_with_retry(url=url, headers=headers)
soup = BeautifulSoup(content, 'html5lib')
paper_list_bar = tqdm(
soup.find('section', {'data-title': 'Table of contents'}).find(
'div', {'class': 'c-book-section'}).find_all(
['li'], {'data-test': 'chapter'}))
for paper in paper_list_bar:
try:
title = slugify(
paper.find(['h3', 'h4'], {'class': 'app-card-open__heading'}).text)
link = urllib.parse.urljoin(
url,
paper.find(
['h3', 'h4'], {'class': 'app-card-open__heading'}
).a.get('href'))
# 'https://link.springer.com/chapter/10.1007/978-3-642-33718-5_2'
# >>
# 'https://link.springer.com/content/pdf/10.1007/978-3-642-33718-5_2.pdf'
link = f'''{link.replace('/chapter/', '/content/pdf/')}.pdf'''
paper_dict[title] = link
except Exception as e:
print(f'ERROR: {str(e)}')
return paper_dict
if __name__ == '__main__':
papers = get_paper_name_link_from_url('https://link.springer.com/book/10.1007%2F978-3-319-46448-0')
================================================
FILE: lib/supplement_porcess.py
================================================
"""
supplement_process.py
"""
from PyPDF3 import PdfFileMerger
import zipfile
import os
import shutil
from tqdm import tqdm
def unzipfile(zip_file, save_path):
"""
unzip zip file to save_path
:param zipfile: str, zip file's full pathname.
:param save_path: str, the path store unzipped files.
:return: None
"""
zip_ref = zipfile.ZipFile(zip_file, 'r')
zip_ref.extractall(save_path)
zip_ref.close()
def get_potential_supp_pdf(path):
"""
get all the potential supplemental pdf file pathname
:param path: str, the path of unzipped files
:return: supp_pdf_list, List of str, pdf files' full pathnames
"""
supp_pdf_list = [f for f in os.scandir(path) if f.name.endswith('.pdf')]
if len(supp_pdf_list) == 0:
supp_pdf_list = []
for dir in os.scandir(path):
if dir.is_dir() and not dir.name.startswith('__'):
for pdf in os.scandir(dir.path):
if pdf.name.endswith('.pdf'):
supp_pdf_list.append(pdf.path)
if len(supp_pdf_list) == 0:
supp_pdf_list = []
for dir in os.scandir(path):
if dir.is_dir() and not dir.name.startswith('__'):
for sub_dir in os.scandir(dir):
if sub_dir.is_dir() and not sub_dir.name.startswith('__'):
for pdf in os.scandir(sub_dir.path):
if pdf.name.endswith('.pdf'):
supp_pdf_list.append(pdf.path)
return supp_pdf_list
def move_main_and_supplement_2_one_directory_with_group(main_path, supplement_path, supp_pdf_save_path):
"""
unzip supplemental zip files to get the pdf files, copy and
rename them into given path(supp_pdf_save_path/group_name)
:param main_path: str, the main papers' path
:param supplement_path: str, the supplemental material 's path
:param supp_pdf_save_path: str, the supplemental pdf files' save path
"""
if not os.path.exists(main_path):
raise ValueError(f'''can not open '{main_path}' !''')
if not os.path.exists(supplement_path):
raise ValueError(f'''can not open '{supplement_path}' !''')
error_log = []
# make temp dir to unzip zip file
temp_zip_dir = '.\\temp_zip'
if not os.path.exists(temp_zip_dir):
os.mkdir(temp_zip_dir)
else:
# remove all files
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
for group in os.scandir(main_path):
if group.is_dir():
paper_bar = tqdm(os.scandir(group.path))
for paper in paper_bar:
if paper.is_file():
name, extension = os.path.splitext(paper.name)
if '.pdf' == extension:
paper_bar.set_description(f'''processing {name}''')
supp_pdf_path = None
# error_flag = False
if os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.pdf')):
supp_pdf_path = os.path.join(supplement_path, group.name, f'{name}_supp.pdf')
shutil.copyfile(
supp_pdf_path, os.path.join(supp_pdf_save_path, group.name, f'{name}_supp.pdf'))
elif os.path.exists(os.path.join(supplement_path, group.name, f'{name}_supp.zip')):
try:
unzipfile(
zip_file=os.path.join(supplement_path, group.name, f'{name}_supp.zip'),
save_path=temp_zip_dir
)
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
try:
# find if there is a pdf file (by listing all files in the dir)
supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)
# rename the first pdf file
if len(supp_pdf_list) >= 1:
# by default, we only deal with the first pdf
supp_pdf_path = os.path.join(supp_pdf_save_path, group.name, name+'_supp.pdf')
if not os.path.exists(supp_pdf_path):
shutil.move(supp_pdf_list[0], supp_pdf_path)
if len(supp_pdf_list) > 1:
for i in range(1, len(supp_pdf_list)):
supp_pdf_path = os.path.join(
supp_pdf_save_path, group.name, name + f'_supp_{i}.pdf')
if not os.path.exists(supp_pdf_path):
shutil.move(supp_pdf_list[i], supp_pdf_path)
# empty the temp_folder (both the dirs and files)
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
# 2. write error log
print('write error log')
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
log_file_pathname = os.path.join(
project_root_folder, 'log', 'merge_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is None:
f.write('None')
else:
f.write(e)
f.write('\n')
f.write('\n')
def move_main_and_supplement_2_one_directory(main_path, supplement_path, supp_pdf_save_path):
"""
unzip supplemental zip files to get the pdf files, copy and
rename them into given path(supp_pdf_save_path)
:param main_path: str, the main papers' path
:param supplement_path: str, the supplemental material's path
:param supp_pdf_save_path: str, the supplemental pdf files' save path
"""
if not os.path.exists(main_path):
raise ValueError(f'''can not open '{main_path}' !''')
if not os.path.exists(supplement_path):
raise ValueError(f'''can not open '{supplement_path}' !''')
os.makedirs(supp_pdf_save_path, exist_ok=True)
error_log = []
# make temp dir to unzip zip file
temp_zip_dir = '..\\temp_zip'
if not os.path.exists(temp_zip_dir):
os.mkdir(temp_zip_dir)
else:
# remove all files
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
paper_bar = tqdm(os.scandir(main_path))
for paper in paper_bar:
if paper.is_file():
name, extension = os.path.splitext(paper.name)
if '.pdf' == extension:
paper_bar.set_description(f'''processing {name}''')
supp_pdf_path = None
# error_flag = False
if os.path.exists(os.path.join(supp_pdf_save_path, f'{name}_supp.pdf')):
continue
elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):
supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf')
shutil.copyfile(supp_pdf_path, os.path.join(supp_pdf_save_path, f'{name}_supp.pdf'))
elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):
try:
unzipfile(
zip_file=os.path.join(supplement_path, f'{name}_supp.zip'),
save_path=temp_zip_dir)
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
try:
# find if there is a pdf file (by listing all files in the dir)
supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)
# rename the first pdf file
if len(supp_pdf_list) >= 1:
# by default, we only deal with the first pdf
supp_pdf_path = os.path.join(supp_pdf_save_path, name+'_supp.pdf')
if not os.path.exists(supp_pdf_path):
shutil.move(supp_pdf_list[0], supp_pdf_path)
if len(supp_pdf_list) > 1:
for i in range(1, len(supp_pdf_list)):
supp_pdf_path = os.path.join(supp_pdf_save_path, name + f'_supp_{i}.pdf')
if not os.path.exists(supp_pdf_path):
shutil.move(supp_pdf_list[i], supp_pdf_path)
# empty the temp_folder (both the dirs and files)
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
# 2. write error log
print('write error log')
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
log_file_pathname = os.path.join(
project_root_folder, 'log', 'merge_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is None:
f.write('None')
else:
f.write(e)
f.write('\n')
f.write('\n')
def merge_main_supplement(main_path, supplement_path, save_path, is_delete_ori_files=False):
"""
merge the main paper and supplemental material into one single pdf file
:param main_path: str, the main papers' path
:param supplement_path: str, the supplemental material 's path
:param save_path: str, merged pdf files's save path
:param is_delete_ori_files: Bool, True for deleting the original main and supplemental material after merging
"""
if not os.path.exists(main_path):
raise ValueError(f'''can not open '{main_path}' !''')
if not os.path.exists(supplement_path):
raise ValueError(f'''can not open '{supplement_path}' !''')
os.makedirs(save_path, exist_ok=True)
error_log = []
# make temp dir to unzip zip file
temp_zip_dir = '.\\temp_zip'
if not os.path.exists(temp_zip_dir):
os.mkdir(temp_zip_dir)
else:
# remove all files
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
if os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
paper_bar = tqdm(os.scandir(main_path))
for paper in paper_bar:
if paper.is_file():
name, extension = os.path.splitext(paper.name)
if '.pdf' == extension:
paper_bar.set_description(f'''processing {name}''')
if os.path.exists(os.path.join(save_path, paper.name)):
continue
supp_pdf_path = None
error_floa = False
if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):
supp_pdf_path = os.path.join(supplement_path, f'{name}_supp.pdf')
elif os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):
try:
unzipfile(
zip_file=os.path.join(supplement_path, f'{name}_supp.zip'),
save_path=temp_zip_dir
)
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
try:
# find if there is a pdf file (by listing all files in the dir)
supp_pdf_list = get_potential_supp_pdf(temp_zip_dir)
# rename the first pdf file
if len(supp_pdf_list) >= 1:
# by default, we only deal with the first pdf
supp_pdf_path = os.path.join(supplement_path, name+'_supp.pdf')
if not os.path.exists(supp_pdf_path):
shutil.move(supp_pdf_list[0], supp_pdf_path)
# empty the temp_folder (both the dirs and files)
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
except Exception as e:
error_floa = True
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
# empty the temp_folder (both the dirs and files)
for unzip_file in os.listdir(temp_zip_dir):
if os.path.isfile(os.path.join(temp_zip_dir, unzip_file)):
os.remove(os.path.join(temp_zip_dir, unzip_file))
elif os.path.isdir(os.path.join(temp_zip_dir, unzip_file)):
shutil.rmtree(os.path.join(temp_zip_dir, unzip_file))
else:
print('Cannot Remove - ' + os.path.join(temp_zip_dir, unzip_file))
continue
if supp_pdf_path is not None:
try:
merger = PdfFileMerger()
f_handle1 = open(paper.path, 'rb')
merger.append(f_handle1)
f_handle2 = open(supp_pdf_path, 'rb')
merger.append(f_handle2)
with open(os.path.join(save_path, paper.name), 'wb') as fout:
merger.write(fout)
print('\tmerged!')
f_handle1.close()
f_handle2.close()
merger.close()
if is_delete_ori_files:
os.remove(paper.path)
if os.path.exists(os.path.join(supplement_path, f'{name}_supp.zip')):
os.remove(os.path.join(supplement_path, f'{name}_supp.zip'))
if os.path.exists(os.path.join(supplement_path, f'{name}_supp.pdf')):
os.remove(os.path.join(supplement_path, f'{name}_supp.pdf'))
except Exception as e:
print('Error: ' + name + ' - ' + str(e))
error_log.append((paper.path, supp_pdf_path, str(e)))
if os.path.exists(os.path.join(save_path, paper.name)):
os.remove(os.path.join(save_path, paper.name))
else:
if is_delete_ori_files:
shutil.move(paper.path, os.path.join(save_path, paper.name))
else:
shutil.copyfile(paper.path, os.path.join(save_path, paper.name))
# 2. write error log
print('write error log')
project_root_folder = os.path.abspath(
os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
log_file_pathname = os.path.join(
project_root_folder, 'log', 'merge_err_log.txt'
)
with open(log_file_pathname, 'w') as f:
for log in tqdm(error_log):
for e in log:
if e is None:
f.write('None')
else:
f.write(e)
f.write('\n')
f.write('\n')
def rename_2_short_name(src_path, save_path, target_max_length=128,
extension='pdf'):
"""
rename file to short filename while remain the conference postfix
Args:
src_path (str): path that contains files directly.
save_path (str): path to save the renamed files.
target_max_length (int): max filen name length after renaming. All the
files whose name length is not less than this will be renamed, the
others will stay unchanged and copy into the save path. Default:
128.
extension (str | None): only the files with this extension will be
processed. None means all file will be processed. Default: 'pdf'.
Returns:
None
"""
if not os.path.exists(src_path):
raise ValueError(f'Path not found: {src_path}!')
os.makedirs(save_path, exist_ok=True)
for f in tqdm(os.scandir(src_path)):
f_name = f.name
# compare extension
ext = os.path.splitext(f_name)[1]
if extension is not None and ext[1:] != extension:
continue
# compare file name length
l = len(f_name)
if l < target_max_length:
if not os.path.exists(os.path.join(save_path, f_name)):
print(f'\ncopying {f_name}')
shutil.copyfile(f.path, os.path.join(save_path, f_name))
else:
# rename
try:
[title, postfix] = f_name.split('_', 1) # only split to 2 parts
new_title = title[:target_max_length-len(postfix)-2]
new_name = f'{new_title}_{postfix}'
if not os.path.exists(os.path.join(save_path, new_name)):
print(f'\nrenaming {f_name} \n\t-> {new_name}')
shutil.copyfile(f.path, os.path.join(save_path, new_name))
except ValueError:
# ValueError: not enough values to unpack (expected 2, got 1)
print(f'\nWARNING!!!:\n\tunable to parse postfix from {f.path}')
print('\tSo, it will be just copy/rename to short name')
new_title = f_name[:target_max_length - len(ext) - 1]
new_name = f'{new_title}{ext}'
if not os.path.exists(os.path.join(save_path, new_name)):
print(f'\nrenaming {f_name} \n\t-> {new_name}')
shutil.copyfile(f.path, os.path.join(save_path, new_name))
def rename_2_short_name_within_group(src_path, save_path, target_max_length=128,
extension='pdf'):
"""
rename file to short filename while remain the conference postfix
Args:
src_path (str): path that contains files:
src_path/group_name/files
save_path (str): path to save the renamed files.
target_max_length (int): max filen name length after renaming. All the
files whose name length is not less than this will be renamed, the
others will stay unchanged and copy into the save path. Default:
128.
extension (str | None): only the files with this extension will be
processed. None means all file will be processed. Default: 'pdf'.
Returns:
None
"""
if not os.path.exists(src_path):
raise ValueError(f'Path not found: {src_path}!')
os.makedirs(save_path, exist_ok=True)
for d in tqdm(os.scandir(src_path)):
if not d.is_dir():
continue
print(f'\nprocessing {d.name}')
d_name = d.name
d_name = d_name[:min(len(d_name), target_max_length-1)]
rename_2_short_name(
src_path=d.path,
save_path=os.path.join(save_path, d_name),
target_max_length=target_max_length,
extension=extension
)
================================================
FILE: lib/user_agents.py
================================================
"""
user_agents.py
user agents
20230702
"""
user_agents = [
'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) '
'Gecko/20071127 Firefox/2.0.0.11',
'Opera/9.25 (Windows NT 5.1; U; en)',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; '
'.NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) '
'KHTML/3.5.5 (like Gecko) (Kubuntu)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) '
'Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
"Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 "
"(KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 "
"Chrome/16.0.912.77 Safari/535.7",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) "
"Gecko/20100101 Firefox/10.0 ",
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/105.0.0.0 Safari/537.36'
]
================================================
FILE: sharelinks.md
================================================
# SHARE LINKS
Aliyun share links
注:阿里云盘更新了协议,**一个分享链接最多只能分享不超过500个文件**,所以我进行了拆分,一个链接放499个文件,直至分享完。
## CVPR
### main conference
| year | index | share link | access code |
|:----:|:-----:|:------------------------------------------------------:|:-----------:|
| 2023 | 1 | [1-499](https://www.aliyundrive.com/s/SGMUABYNoRM) | `63un` |
| 2023 | 2 | [500-998](https://www.aliyundrive.com/s/XeXJz53AVKn) | `7ws5` |
| 2023 | 3 | [999-1497](https://www.aliyundrive.com/s/9wjv8gaE95i) | `1er4` |
| 2023 | 4 | [1498-1996](https://www.aliyundrive.com/s/kqt4GNYmSYR) | `lf58` |
| 2023 | 5 | [1997-2358](https://www.aliyundrive.com/s/GyyyD4XnqhZ) | `f47s` |
### workshops
| year | index | share link | access code |
|:----:|:-----:|:----------------------------------------------------:|:-----------:|
| 2023 | 1 | [1-485](https://www.aliyundrive.com/s/gPtPRYcyttz) | `4n5t` |
| 2023 | 2 | [486-698](https://www.aliyundrive.com/s/x18A9AxPJGp) | `x40h` |