Repository: jhao104/proxy_pool
Branch: master
Commit: 50cc52ea50da
Files: 59
Total size: 98.2 KB
Directory structure:
gitextract_arxmexr3/
├── .github/
│ └── workflows/
│ ├── docker-image-latest.yml
│ └── docker-image-tags.yml
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── README.md
├── _config.yml
├── api/
│ ├── __init__.py
│ └── proxyApi.py
├── db/
│ ├── __init__.py
│ ├── dbClient.py
│ ├── redisClient.py
│ └── ssdbClient.py
├── docker-compose.yml
├── docs/
│ ├── Makefile
│ ├── changelog.rst
│ ├── conf.py
│ ├── dev/
│ │ ├── ext_fetcher.rst
│ │ ├── ext_validator.rst
│ │ └── index.rst
│ ├── index.rst
│ ├── make.bat
│ └── user/
│ ├── how_to_config.rst
│ ├── how_to_run.rst
│ ├── how_to_use.rst
│ └── index.rst
├── fetcher/
│ ├── __init__.py
│ └── proxyFetcher.py
├── handler/
│ ├── __init__.py
│ ├── configHandler.py
│ ├── logHandler.py
│ └── proxyHandler.py
├── helper/
│ ├── __init__.py
│ ├── check.py
│ ├── fetch.py
│ ├── launcher.py
│ ├── proxy.py
│ ├── scheduler.py
│ └── validator.py
├── proxyPool.py
├── requirements.txt
├── setting.py
├── start.sh
├── test/
│ ├── __init__.py
│ ├── testConfigHandler.py
│ ├── testDbClient.py
│ ├── testLogHandler.py
│ ├── testProxyClass.py
│ ├── testProxyFetcher.py
│ ├── testProxyValidator.py
│ ├── testRedisClient.py
│ └── testSsdbClient.py
├── test.py
└── util/
├── __init__.py
├── lazyProperty.py
├── singleton.py
├── six.py
└── webRequest.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/docker-image-latest.yml
================================================
name: Publish Docker image latest
on:
push:
branches:
- 'master'
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v2
- name: Log in to Docker Hub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v3
with:
images: jhao104/proxy_pool
- name: Build and push Docker image
uses: docker/build-push-action@v2
with:
context: .
push: true
tags: jhao104/proxy_pool:latest
================================================
FILE: .github/workflows/docker-image-tags.yml
================================================
name: Publish Docker image tags
on:
push:
tags:
- '*'
jobs:
push_to_registry:
name: Push Docker image to Docker Hub
runs-on: ubuntu-latest
steps:
- name: Check out the repo
uses: actions/checkout@v2
- name: Log in to Docker Hub
uses: docker/login-action@v1
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v3
with:
images: jhao104/proxy_pool
- name: Build and push Docker image
uses: docker/build-push-action@v2
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
================================================
FILE: .gitignore
================================================
.idea/
docs/_build
*.pyc
*.log
================================================
FILE: .travis.yml
================================================
language: python
python:
- "2.7"
- "3.5"
- "3.6"
- "3.7"
- "3.8"
- "3.9"
- "3.10"
- "3.11"
os:
- linux
install:
- pip install -r requirements.txt
script: python test.py
================================================
FILE: Dockerfile
================================================
FROM python:3.6-alpine
MAINTAINER jhao104 <j_hao104@163.com>
WORKDIR /app
COPY ./requirements.txt .
# apk repository
RUN sed -i 's/dl-cdn.alpinelinux.org/mirrors.ustc.edu.cn/g' /etc/apk/repositories
# timezone
RUN apk add -U tzdata && cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && apk del tzdata
# runtime environment
RUN apk add musl-dev gcc libxml2-dev libxslt-dev && \
pip install --no-cache-dir -r requirements.txt && \
apk del gcc musl-dev
COPY . .
EXPOSE 5010
ENTRYPOINT [ "sh", "start.sh" ]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2017 J_hao104
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
ProxyPool 爬虫代理IP池
=======
[](https://travis-ci.org/jhao104/proxy_pool)
[](http://www.spiderpy.cn/blog/)
[](https://github.com/jhao104/proxy_pool/blob/master/LICENSE)
[](https://github.com/jhao104/proxy_pool/graphs/contributors)
[](https://github.com/jhao104/proxy_pool)
______ ______ _
| ___ \_ | ___ \ | |
| |_/ / \__ __ __ _ __ _ | |_/ /___ ___ | |
| __/| _// _ \ \ \/ /| | | || __// _ \ / _ \ | |
| | | | | (_) | > < \ |_| || | | (_) | (_) || |___
\_| |_| \___/ /_/\_\ \__ |\_| \___/ \___/ \_____\
__ / /
/___ /
### ProxyPool
爬虫代理IP池项目,主要功能为定时采集网上发布的免费代理验证入库,定时验证入库的代理保证代理的可用性,提供API和CLI两种使用方式。同时你也可以扩展代理源以增加代理池IP的质量和数量。
* 文档: [document](https://proxy-pool.readthedocs.io/zh/latest/) [](https://proxy-pool.readthedocs.io/zh/latest/?badge=latest)
* 支持版本: [](https://docs.python.org/2.7/)
[](https://docs.python.org/3.5/)
[](https://docs.python.org/3.6/)
[](https://docs.python.org/3.7/)
[](https://docs.python.org/3.8/)
[](https://docs.python.org/3.9/)
[](https://docs.python.org/3.10/)
[](https://docs.python.org/3.11/)
* 测试地址: http://demo.spiderpy.cn (勿压谢谢)
* 付费代理推荐: [luminati-china](https://get.brightdata.com/github_jh). 国外的亮数据BrightData(以前叫luminati)被认为是代理市场领导者,覆盖全球的7200万IP,大部分是真人住宅IP,成功率扛扛的。付费套餐多种,需要高质量代理IP的可以注册后联系中文客服。[申请免费试用](https://get.brightdata.com/github_jh) 目前有50%折扣优惠活动。(PS:用不明白的同学可以参考这个[使用教程](https://www.cnblogs.com/jhao/p/15611785.html))。
### 运行项目
##### 下载代码:
* git clone
```bash
git clone git@github.com:jhao104/proxy_pool.git
```
* releases
```bash
https://github.com/jhao104/proxy_pool/releases 下载对应zip文件
```
##### 安装依赖:
```bash
pip install -r requirements.txt
```
##### 更新配置:
```python
# setting.py 为项目配置文件
# 配置API服务
HOST = "0.0.0.0" # IP
PORT = 5000 # 监听端口
# 配置数据库
DB_CONN = 'redis://:pwd@127.0.0.1:8888/0'
# 配置 ProxyFetcher
PROXY_FETCHER = [
"freeProxy01", # 这里是启用的代理抓取方法名,所有fetch方法位于fetcher/proxyFetcher.py
"freeProxy02",
# ....
]
```
#### 启动项目:
```bash
# 如果已经具备运行条件, 可用通过proxyPool.py启动。
# 程序分为: schedule 调度程序 和 server Api服务
# 启动调度程序
python proxyPool.py schedule
# 启动webApi服务
python proxyPool.py server
```
### Docker Image
```bash
docker pull jhao104/proxy_pool
docker run --env DB_CONN=redis://:password@ip:port/0 -p 5010:5010 jhao104/proxy_pool:latest
```
### docker-compose
项目目录下运行:
``` bash
docker-compose up -d
```
### 使用
* Api
启动web服务后, 默认配置下会开启 http://127.0.0.1:5010 的api接口服务:
| api | method | Description | params|
| ----| ---- | ---- | ----|
| / | GET | api介绍 | None |
| /get | GET | 随机获取一个代理| 可选参数: `?type=https` 过滤支持https的代理|
| /pop | GET | 获取并删除一个代理| 可选参数: `?type=https` 过滤支持https的代理|
| /all | GET | 获取所有代理 |可选参数: `?type=https` 过滤支持https的代理|
| /count | GET | 查看代理数量 |None|
| /delete | GET | 删除代理 |`?proxy=host:ip`|
* 爬虫使用
如果要在爬虫代码中使用的话, 可以将此api封装成函数直接使用,例如:
```python
import requests
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
# your spider code
def getHtml():
# ....
retry_count = 5
proxy = get_proxy().get("proxy")
while retry_count > 0:
try:
html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
# 使用代理访问
return html
except Exception:
retry_count -= 1
# 删除代理池中代理
delete_proxy(proxy)
return None
```
### 扩展代理
项目默认包含几个免费的代理获取源,但是免费的毕竟质量有限,所以如果直接运行可能拿到的代理质量不理想。所以,提供了代理获取的扩展方法。
添加一个新的代理源方法如下:
* 1、首先在[ProxyFetcher](https://github.com/jhao104/proxy_pool/blob/1a3666283806a22ef287fba1a8efab7b94e94bac/fetcher/proxyFetcher.py#L21)类中添加自定义的获取代理的静态方法,
该方法需要以生成器(yield)形式返回`host:ip`格式的代理,例如:
```python
class ProxyFetcher(object):
# ....
# 自定义代理源获取方法
@staticmethod
def freeProxyCustom1(): # 命名不和已有重复即可
# 通过某网站或者某接口或某数据库获取代理
# 假设你已经拿到了一个代理列表
proxies = ["x.x.x.x:3128", "x.x.x.x:80"]
for proxy in proxies:
yield proxy
# 确保每个proxy都是 host:ip正确的格式返回
```
* 2、添加好方法后,修改[setting.py](https://github.com/jhao104/proxy_pool/blob/1a3666283806a22ef287fba1a8efab7b94e94bac/setting.py#L47)文件中的`PROXY_FETCHER`项:
在`PROXY_FETCHER`下添加自定义方法的名字:
```python
PROXY_FETCHER = [
"freeProxy01",
"freeProxy02",
# ....
"freeProxyCustom1" # # 确保名字和你添加方法名字一致
]
```
`schedule` 进程会每隔一段时间抓取一次代理,下次抓取时会自动识别调用你定义的方法。
### 免费代理源
目前实现的采集免费代理网站有(排名不分先后, 下面仅是对其发布的免费代理情况, 付费代理测评可以参考[这里](https://zhuanlan.zhihu.com/p/33576641)):
| 代理名称 | 状态 | 更新速度 | 可用率 | 地址 | 代码 |
|---------------| ---- | -------- | ------ | ----- |------------------------------------------------|
| 66代理 | ✔ | ★ | * | [地址](http://www.66ip.cn/) | [`freeProxy02`](/fetcher/proxyFetcher.py#L50) |
| 开心代理 | ✔ | ★ | * | [地址](http://www.kxdaili.com/) | [`freeProxy03`](/fetcher/proxyFetcher.py#L63) |
| FreeProxyList | ✔ | ★ | * | [地址](https://www.freeproxylists.net/zh/) | [`freeProxy04`](/fetcher/proxyFetcher.py#L74) |
| 快代理 | ✔ | ★ | * | [地址](https://www.kuaidaili.com/) | [`freeProxy05`](/fetcher/proxyFetcher.py#L92) |
| 冰凌代理 | ✔ | ★★★ | * | [地址](https://www.binglx.cn/) | [`freeProxy06`](/fetcher/proxyFetcher.py#L111) |
| 云代理 | ✔ | ★ | * | [地址](http://www.ip3366.net/) | [`freeProxy07`](/fetcher/proxyFetcher.py#L123) |
| 小幻代理 | ✔ | ★★ | * | [地址](https://ip.ihuan.me/) | [`freeProxy08`](/fetcher/proxyFetcher.py#L133) |
| 免费代理库 | ✔ | ☆ | * | [地址](http://ip.jiangxianli.com/) | [`freeProxy09`](/fetcher/proxyFetcher.py#L143) |
| 89代理 | ✔ | ☆ | * | [地址](https://www.89ip.cn/) | [`freeProxy10`](/fetcher/proxyFetcher.py#L154) |
| 稻壳代理 | ✔ | ★★ | *** | [地址](https://www.docip.ne) | [`freeProxy11`](/fetcher/proxyFetcher.py#L164) |
如果还有其他好的免费代理网站, 可以在提交在[issues](https://github.com/jhao104/proxy_pool/issues/71), 下次更新时会考虑在项目中支持。
### 问题反馈
任何问题欢迎在[Issues](https://github.com/jhao104/proxy_pool/issues) 中反馈,同时也可以到我的[博客](http://www.spiderpy.cn/blog/message)中留言。
你的反馈会让此项目变得更加完美。
### 贡献代码
本项目仅作为基本的通用的代理池架构,不接收特有功能(当然,不限于特别好的idea)。
本项目依然不够完善,如果发现bug或有新的功能添加,请在[Issues](https://github.com/jhao104/proxy_pool/issues)中提交bug(或新功能)描述,我会尽力改进,使她更加完美。
这里感谢以下contributor的无私奉献:
[@kangnwh](https://github.com/kangnwh) | [@bobobo80](https://github.com/bobobo80) | [@halleywj](https://github.com/halleywj) | [@newlyedward](https://github.com/newlyedward) | [@wang-ye](https://github.com/wang-ye) | [@gladmo](https://github.com/gladmo) | [@bernieyangmh](https://github.com/bernieyangmh) | [@PythonYXY](https://github.com/PythonYXY) | [@zuijiawoniu](https://github.com/zuijiawoniu) | [@netAir](https://github.com/netAir) | [@scil](https://github.com/scil) | [@tangrela](https://github.com/tangrela) | [@highroom](https://github.com/highroom) | [@luocaodan](https://github.com/luocaodan) | [@vc5](https://github.com/vc5) | [@1again](https://github.com/1again) | [@obaiyan](https://github.com/obaiyan) | [@zsbh](https://github.com/zsbh) | [@jiannanya](https://github.com/jiannanya) | [@Jerry12228](https://github.com/Jerry12228)
### Release Notes
[changelog](https://github.com/jhao104/proxy_pool/blob/master/docs/changelog.rst)
<a href="https://hellogithub.com/repository/92a066e658d147cc8bd8397a1cb88183" target="_blank"><img src="https://api.hellogithub.com/v1/widgets/recommend.svg?rid=92a066e658d147cc8bd8397a1cb88183&claim_uid=DR60NequsjP54Lc" alt="Featured|HelloGitHub" style="width: 250px; height: 54px;" width="250" height="54" /></a>
================================================
FILE: _config.yml
================================================
theme: jekyll-theme-cayman
================================================
FILE: api/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__.py
Description :
Author : JHao
date: 2016/12/3
-------------------------------------------------
Change Activity:
2016/12/3:
-------------------------------------------------
"""
__author__ = 'JHao'
================================================
FILE: api/proxyApi.py
================================================
# -*- coding: utf-8 -*-
# !/usr/bin/env python
"""
-------------------------------------------------
File Name: ProxyApi.py
Description : WebApi
Author : JHao
date: 2016/12/4
-------------------------------------------------
Change Activity:
2016/12/04: WebApi
2019/08/14: 集成Gunicorn启动方式
2020/06/23: 新增pop接口
2022/07/21: 更新count接口
-------------------------------------------------
"""
__author__ = 'JHao'
import platform
from werkzeug.wrappers import Response
from flask import Flask, jsonify, request
from util.six import iteritems
from helper.proxy import Proxy
from handler.proxyHandler import ProxyHandler
from handler.configHandler import ConfigHandler
app = Flask(__name__)
conf = ConfigHandler()
proxy_handler = ProxyHandler()
class JsonResponse(Response):
@classmethod
def force_type(cls, response, environ=None):
if isinstance(response, (dict, list)):
response = jsonify(response)
return super(JsonResponse, cls).force_type(response, environ)
app.response_class = JsonResponse
api_list = [
{"url": "/get", "params": "type: ''https'|''", "desc": "get a proxy"},
{"url": "/pop", "params": "", "desc": "get and delete a proxy"},
{"url": "/delete", "params": "proxy: 'e.g. 127.0.0.1:8080'", "desc": "delete an unable proxy"},
{"url": "/all", "params": "type: ''https'|''", "desc": "get all proxy from proxy pool"},
{"url": "/count", "params": "", "desc": "return proxy count"}
# 'refresh': 'refresh proxy pool',
]
@app.route('/')
def index():
return {'url': api_list}
@app.route('/get/')
def get():
https = request.args.get("type", "").lower() == 'https'
proxy = proxy_handler.get(https)
return proxy.to_dict if proxy else {"code": 0, "src": "no proxy"}
@app.route('/pop/')
def pop():
https = request.args.get("type", "").lower() == 'https'
proxy = proxy_handler.pop(https)
return proxy.to_dict if proxy else {"code": 0, "src": "no proxy"}
@app.route('/refresh/')
def refresh():
# TODO refresh会有守护程序定时执行,由api直接调用性能较差,暂不使用
return 'success'
@app.route('/all/')
def getAll():
https = request.args.get("type", "").lower() == 'https'
proxies = proxy_handler.getAll(https)
return jsonify([_.to_dict for _ in proxies])
@app.route('/delete/', methods=['GET'])
def delete():
proxy = request.args.get('proxy')
status = proxy_handler.delete(Proxy(proxy))
return {"code": 0, "src": status}
@app.route('/count/')
def getCount():
proxies = proxy_handler.getAll()
http_type_dict = {}
source_dict = {}
for proxy in proxies:
http_type = 'https' if proxy.https else 'http'
http_type_dict[http_type] = http_type_dict.get(http_type, 0) + 1
for source in proxy.source.split('/'):
source_dict[source] = source_dict.get(source, 0) + 1
return {"http_type": http_type_dict, "source": source_dict, "count": len(proxies)}
def runFlask():
if platform.system() == "Windows":
app.run(host=conf.serverHost, port=conf.serverPort)
else:
import gunicorn.app.base
class StandaloneApplication(gunicorn.app.base.BaseApplication):
def __init__(self, app, options=None):
self.options = options or {}
self.application = app
super(StandaloneApplication, self).__init__()
def load_config(self):
_config = dict([(key, value) for key, value in iteritems(self.options)
if key in self.cfg.settings and value is not None])
for key, value in iteritems(_config):
self.cfg.set(key.lower(), value)
def load(self):
return self.application
_options = {
'bind': '%s:%s' % (conf.serverHost, conf.serverPort),
'workers': 4,
'accesslog': '-', # log to stdout
'access_log_format': '%(h)s %(l)s %(t)s "%(r)s" %(s)s "%(a)s"'
}
StandaloneApplication(app, _options).run()
if __name__ == '__main__':
runFlask()
================================================
FILE: db/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__.py.py
Description :
Author : JHao
date: 2016/12/2
-------------------------------------------------
Change Activity:
2016/12/2:
-------------------------------------------------
"""
================================================
FILE: db/dbClient.py
================================================
# -*- coding: utf-8 -*-
# !/usr/bin/env python
"""
-------------------------------------------------
File Name: DbClient.py
Description : DB工厂类
Author : JHao
date: 2016/12/2
-------------------------------------------------
Change Activity:
2016/12/02: DB工厂类
2020/07/03: 取消raw_proxy储存
-------------------------------------------------
"""
__author__ = 'JHao'
import os
import sys
from util.six import urlparse, withMetaclass
from util.singleton import Singleton
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
class DbClient(withMetaclass(Singleton)):
"""
DbClient DB工厂类 提供get/put/update/pop/delete/exists/getAll/clean/getCount/changeTable方法
抽象方法定义:
get(): 随机返回一个proxy;
put(proxy): 存入一个proxy;
pop(): 顺序返回并删除一个proxy;
update(proxy): 更新指定proxy信息;
delete(proxy): 删除指定proxy;
exists(proxy): 判断指定proxy是否存在;
getAll(): 返回所有代理;
clean(): 清除所有proxy信息;
getCount(): 返回proxy统计信息;
changeTable(name): 切换操作对象
所有方法需要相应类去具体实现:
ssdb: ssdbClient.py
redis: redisClient.py
mongodb: mongodbClient.py
"""
def __init__(self, db_conn):
"""
init
:return:
"""
self.parseDbConn(db_conn)
self.__initDbClient()
@classmethod
def parseDbConn(cls, db_conn):
db_conf = urlparse(db_conn)
cls.db_type = db_conf.scheme.upper().strip()
cls.db_host = db_conf.hostname
cls.db_port = db_conf.port
cls.db_user = db_conf.username
cls.db_pwd = db_conf.password
cls.db_name = db_conf.path[1:]
return cls
def __initDbClient(self):
"""
init DB Client
:return:
"""
__type = None
if "SSDB" == self.db_type:
__type = "ssdbClient"
elif "REDIS" == self.db_type:
__type = "redisClient"
else:
pass
assert __type, 'type error, Not support DB type: {}'.format(self.db_type)
self.client = getattr(__import__(__type), "%sClient" % self.db_type.title())(host=self.db_host,
port=self.db_port,
username=self.db_user,
password=self.db_pwd,
db=self.db_name)
def get(self, https, **kwargs):
return self.client.get(https, **kwargs)
def put(self, key, **kwargs):
return self.client.put(key, **kwargs)
def update(self, key, value, **kwargs):
return self.client.update(key, value, **kwargs)
def delete(self, key, **kwargs):
return self.client.delete(key, **kwargs)
def exists(self, key, **kwargs):
return self.client.exists(key, **kwargs)
def pop(self, https, **kwargs):
return self.client.pop(https, **kwargs)
def getAll(self, https):
return self.client.getAll(https)
def clear(self):
return self.client.clear()
def changeTable(self, name):
self.client.changeTable(name)
def getCount(self):
return self.client.getCount()
def test(self):
return self.client.test()
================================================
FILE: db/redisClient.py
================================================
# -*- coding: utf-8 -*-
"""
-----------------------------------------------------
File Name: redisClient.py
Description : 封装Redis相关操作
Author : JHao
date: 2019/8/9
------------------------------------------------------
Change Activity:
2019/08/09: 封装Redis相关操作
2020/06/23: 优化pop方法, 改用hscan命令
2021/05/26: 区别http/https代理
------------------------------------------------------
"""
__author__ = 'JHao'
from redis.exceptions import TimeoutError, ConnectionError, ResponseError
from redis.connection import BlockingConnectionPool
from handler.logHandler import LogHandler
from random import choice
from redis import Redis
import json
class RedisClient(object):
"""
Redis client
Redis中代理存放的结构为hash:
key为ip:port, value为代理属性的字典;
"""
def __init__(self, **kwargs):
"""
init
:param host: host
:param port: port
:param password: password
:param db: db
:return:
"""
self.name = ""
kwargs.pop("username")
self.__conn = Redis(connection_pool=BlockingConnectionPool(decode_responses=True,
timeout=5,
socket_timeout=5,
**kwargs))
def get(self, https):
"""
返回一个代理
:return:
"""
if https:
items = self.__conn.hvals(self.name)
proxies = list(filter(lambda x: json.loads(x).get("https"), items))
return choice(proxies) if proxies else None
else:
proxies = self.__conn.hkeys(self.name)
proxy = choice(proxies) if proxies else None
return self.__conn.hget(self.name, proxy) if proxy else None
def put(self, proxy_obj):
"""
将代理放入hash, 使用changeTable指定hash name
:param proxy_obj: Proxy obj
:return:
"""
data = self.__conn.hset(self.name, proxy_obj.proxy, proxy_obj.to_json)
return data
def pop(self, https):
"""
弹出一个代理
:return: dict {proxy: value}
"""
proxy = self.get(https)
if proxy:
self.__conn.hdel(self.name, json.loads(proxy).get("proxy", ""))
return proxy if proxy else None
def delete(self, proxy_str):
"""
移除指定代理, 使用changeTable指定hash name
:param proxy_str: proxy str
:return:
"""
return self.__conn.hdel(self.name, proxy_str)
def exists(self, proxy_str):
"""
判断指定代理是否存在, 使用changeTable指定hash name
:param proxy_str: proxy str
:return:
"""
return self.__conn.hexists(self.name, proxy_str)
def update(self, proxy_obj):
"""
更新 proxy 属性
:param proxy_obj:
:return:
"""
return self.__conn.hset(self.name, proxy_obj.proxy, proxy_obj.to_json)
def getAll(self, https):
"""
字典形式返回所有代理, 使用changeTable指定hash name
:return:
"""
items = self.__conn.hvals(self.name)
if https:
return list(filter(lambda x: json.loads(x).get("https"), items))
else:
return items
def clear(self):
"""
清空所有代理, 使用changeTable指定hash name
:return:
"""
return self.__conn.delete(self.name)
def getCount(self):
"""
返回代理数量
:return:
"""
proxies = self.getAll(https=False)
return {'total': len(proxies), 'https': len(list(filter(lambda x: json.loads(x).get("https"), proxies)))}
def changeTable(self, name):
"""
切换操作对象
:param name:
:return:
"""
self.name = name
def test(self):
log = LogHandler('redis_client')
try:
self.getCount()
except TimeoutError as e:
log.error('redis connection time out: %s' % str(e), exc_info=True)
return e
except ConnectionError as e:
log.error('redis connection error: %s' % str(e), exc_info=True)
return e
except ResponseError as e:
log.error('redis connection error: %s' % str(e), exc_info=True)
return e
================================================
FILE: db/ssdbClient.py
================================================
# -*- coding: utf-8 -*-
# !/usr/bin/env python
"""
-------------------------------------------------
File Name: ssdbClient.py
Description : 封装SSDB操作
Author : JHao
date: 2016/12/2
-------------------------------------------------
Change Activity:
2016/12/2:
2017/09/22: PY3中 redis-py返回的数据是bytes型
2017/09/27: 修改pop()方法 返回{proxy:value}字典
2020/07/03: 2.1.0 优化代码结构
2021/05/26: 区分http和https代理
-------------------------------------------------
"""
__author__ = 'JHao'
from redis.exceptions import TimeoutError, ConnectionError, ResponseError
from redis.connection import BlockingConnectionPool
from handler.logHandler import LogHandler
from random import choice
from redis import Redis
import json
class SsdbClient(object):
"""
SSDB client
SSDB中代理存放的结构为hash:
key为代理的ip:por, value为代理属性的字典;
"""
def __init__(self, **kwargs):
"""
init
:param host: host
:param port: port
:param password: password
:return:
"""
self.name = ""
kwargs.pop("username")
self.__conn = Redis(connection_pool=BlockingConnectionPool(decode_responses=True,
timeout=5,
socket_timeout=5,
**kwargs))
def get(self, https):
"""
从hash中随机返回一个代理
:return:
"""
if https:
items_dict = self.__conn.hgetall(self.name)
proxies = list(filter(lambda x: json.loads(x).get("https"), items_dict.values()))
return choice(proxies) if proxies else None
else:
proxies = self.__conn.hkeys(self.name)
proxy = choice(proxies) if proxies else None
return self.__conn.hget(self.name, proxy) if proxy else None
def put(self, proxy_obj):
"""
将代理放入hash
:param proxy_obj: Proxy obj
:return:
"""
result = self.__conn.hset(self.name, proxy_obj.proxy, proxy_obj.to_json)
return result
def pop(self, https):
"""
顺序弹出一个代理
:return: proxy
"""
proxy = self.get(https)
if proxy:
self.__conn.hdel(self.name, json.loads(proxy).get("proxy", ""))
return proxy if proxy else None
def delete(self, proxy_str):
"""
移除指定代理, 使用changeTable指定hash name
:param proxy_str: proxy str
:return:
"""
self.__conn.hdel(self.name, proxy_str)
def exists(self, proxy_str):
"""
判断指定代理是否存在, 使用changeTable指定hash name
:param proxy_str: proxy str
:return:
"""
return self.__conn.hexists(self.name, proxy_str)
def update(self, proxy_obj):
"""
更新 proxy 属性
:param proxy_obj:
:return:
"""
self.__conn.hset(self.name, proxy_obj.proxy, proxy_obj.to_json)
def getAll(self, https):
"""
字典形式返回所有代理, 使用changeTable指定hash name
:return:
"""
item_dict = self.__conn.hgetall(self.name)
if https:
return list(filter(lambda x: json.loads(x).get("https"), item_dict.values()))
else:
return item_dict.values()
def clear(self):
"""
清空所有代理, 使用changeTable指定hash name
:return:
"""
return self.__conn.delete(self.name)
def getCount(self):
"""
返回代理数量
:return:
"""
proxies = self.getAll(https=False)
return {'total': len(proxies), 'https': len(list(filter(lambda x: json.loads(x).get("https"), proxies)))}
def changeTable(self, name):
"""
切换操作对象
:param name:
:return:
"""
self.name = name
def test(self):
log = LogHandler('ssdb_client')
try:
self.getCount()
except TimeoutError as e:
log.error('ssdb connection time out: %s' % str(e), exc_info=True)
return e
except ConnectionError as e:
log.error('ssdb connection error: %s' % str(e), exc_info=True)
return e
except ResponseError as e:
log.error('ssdb connection error: %s' % str(e), exc_info=True)
return e
================================================
FILE: docker-compose.yml
================================================
version: '2'
services:
proxy_pool:
build: .
container_name: proxy_pool
ports:
- "5010:5010"
links:
- proxy_redis
environment:
DB_CONN: "redis://@proxy_redis:6379/0"
proxy_redis:
image: "redis"
container_name: proxy_redis
================================================
FILE: docs/Makefile
================================================
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = _build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
================================================
FILE: docs/changelog.rst
================================================
.. _changelog:
ChangeLog
==========
2.4.2 (2024-01-18)
------------------
1. 代理格式检查支持需认证的代理格式 `username:password@ip:port` ; (2023-03-10)
2. 新增代理源 **稻壳代理**; (2023-05-15)
3. 新增代理源 **冰凌代理**; (2023-01-18)
2.4.1 (2022-07-17)
------------------
1. 新增代理源 **FreeProxyList**; (2022-07-21)
2. 新增代理源 **FateZero**; (2022-08-01)
3. 新增代理属性 ``region``; (2022-08-16)
2.4.0 (2021-11-17)
------------------
1. 移除无效代理源 **神鸡代理**; (2021-11-16)
2. 移除无效代理源 **极速代理**; (2021-11-16)
3. 移除代理源 **西拉代理**; (2021-11-16)
4. 新增代理源 **蝶鸟IP**; (2021-11-16)
5. 新增代理源 **PROXY11**; (2021-11-16)
6. 多线程采集代理; (2021-11-17)
2.3.0 (2021-05-27)
------------------
1. 修复Dockerfile时区问题; (2021-04-12)
2. 新增Proxy属性 ``source``, 标记代理来源; (2021-04-13)
3. 新增Proxy属性 ``https``, 标记支持https的代理; (2021-05-27)
2.2.0 (2021-04-08)
------------------
1. 启动时检查数据库连通性;
2. 新增免费代理源 **米扑代理**;
3. 新增免费代理源 **Pzzqz**;
4. 新增免费代理源 **神鸡代理**;
5. 新增免费代理源 **极速代理**;
6. 新增免费代理源 **小幻代理**;
2.1.1 (2021-02-23)
------------------
1. Fix Bug `#493`_, 新增时区配置; (2020-08-12)
2. 修复 **66代理** 采集; (2020-11-04)
3. 修复 **全网代理** 采集, 解决HTML端口加密问题; (2020-11-04)
4. 新增 **代理盒子** 免费源; (2020-11-04)
5. 新增 ``POOL_SIZE_MIN`` 配置项, runProxyCheck时, 剩余代理少于POOL_SIZE_MIN触发抓取; (2021-02-23)
.. _#493: https://github.com/jhao104/proxy_pool/issues/493
2.1.0 (2020.07)
------------------
1. 新增免费代理源 **西拉代理** (2020-03-30)
2. Fix Bug `#356`_ `#401`_
3. 优化Docker镜像体积; (2020-06-19)
4. 优化配置方式;
5. 优化代码结构;
6. 不再储存raw_proxy, 抓取后直接验证入库;
.. _#401: https://github.com/jhao104/proxy_pool/issues/401
.. _#356: https://github.com/jhao104/proxy_pool/issues/356
2.0.1 (2019.10)
-----------------
1. 新增免费代理源 **89免费代理**;
#. 新增免费代理源 **齐云代理**
2.0.0 (2019.08)
------------------
1. WebApi集成Gunicorn方式启动, Windows平台暂不支持;
#. 优化Proxy调度程序;
#. 扩展Proxy属性;
#. 新增cli工具, 更加方便启动proxyPool
1.14 (2019.07)
-----------------
1. 修复 Queue阻塞导致的 ``ProxyValidSchedule`` 假死bug;
#. 修改代理源 **云代理** 抓取;
#. 修改代理源 **码农代理** 抓取;
#. 修改代理源 **代理66** 抓取, 引入 ``PyExecJS`` 模块破解加速乐动态Cookies加密;
1.13 (2019.02)
-----------------
1. 使用.py文件替换.ini作为配置文件;
#. 优化代理采集部分;
1.12 (2018.04)
-----------------
1. 优化代理格式检查;
#. 增加代理源;
#. fix bug `#122`_ `#126`_
.. _#122: https://github.com/jhao104/proxy_pool/issues/122
.. _#126: https://github.com/jhao104/proxy_pool/issues/126
1.11 (2017.08)
-----------------
1. 使用多线程验证useful_pool;
1.10 (2016.11)
-----------------
1. 第一版;
#. 支持PY2/PY3;
#. 代理池基本功能;
================================================
FILE: docs/conf.py
================================================
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
# import os
# import sys
# sys.path.insert(0, os.path.abspath('.'))
import sphinx_rtd_theme
# -- Project information -----------------------------------------------------
project = 'ProxyPool'
copyright = '2020, jhao104'
author = 'jhao104'
master_doc = 'index'
# The full version, including alpha/beta/rc tags
release = '2.1.0'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
]
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = "sphinx"
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = 'zh_CN'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = 'sphinx_rtd_theme'
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
================================================
FILE: docs/dev/ext_fetcher.rst
================================================
.. ext_fetcher
扩展代理源
-----------
项目默认包含几个免费的代理获取源,但是免费的毕竟质量有限,如果直接运行可能拿到的代理质量不理想。因此提供了用户自定义扩展代理获取的方法。
如果要添加一个新的代理获取方法, 过程如下:
1. 首先在 `ProxyFetcher`_ 类中添加自定义的获取代理的静态方法,该方法需要以生成器(yield)形式返回 ``host:ip`` 格式的代理字符串, 例如:
.. code-block:: python
class ProxyFetcher(object):
# ....
# 自定义代理源获取方法
@staticmethod
def freeProxyCustom01(): # 命名不和已有重复即可
# 通过某网站或者某接口或某数据库获取代理
# 假设你已经拿到了一个代理列表
proxies = ["x.x.x.x:3128", "x.x.x.x:80"]
for proxy in proxies:
yield proxy
# 确保每个proxy都是 host:ip正确的格式返回
2. 添加好方法后,修改配置文件 `setting.py`_ 中的 ``PROXY_FETCHER`` 项, 加入刚才添加的自定义方法的名字:
.. code-block:: python
PROXY_FETCHER = [
# ....
"freeProxyCustom01" # # 确保名字和你添加方法名字一致
]
.. _ProxyFetcher: https://github.com/jhao104/proxy_pool/blob/1a3666283806a22ef287fba1a8efab7b94e94bac/fetcher/proxyFetcher.py#L20
.. _setting.py: https://github.com/jhao104/proxy_pool/blob/1a3666283806a22ef287fba1a8efab7b94e94bac/setting.py#L47
================================================
FILE: docs/dev/ext_validator.rst
================================================
.. ext_validator
代理校验
-----------
内置校验
>>>>>>>>>
项目中使用的代理校验方法全部定义在 `validator.py`_ 中, 通过 `ProxyValidator`_ 类中提供的装饰器来区分。校验方法返回 ``True`` 表示
校验通过, 返回 ``False`` 表示校验不通过。
* 代理校验方法分为三类: ``preValidator`` 、 ``httpValidator`` 、 ``httpsValidator``:
* **preValidator**: 预校验,在代理抓取后验证前调用,目前实现了 `formatValidator`_ 校验代理IP格式是否合法;
* **httpValidator**: 代理可用性校验,通过则认为代理可用, 目前实现了 `httpTimeOutValidator`_ 校验;
* **httpsValidator**: 校验代理是否支持https,目前实现了 `httpsTimeOutValidator`_ 校验。
.. _validator.py: https://github.com/jhao104/proxy_pool/blob/release-2.3.0/helper/validator.py
.. _ProxyValidator: https://github.com/jhao104/proxy_pool/blob/release-2.3.0/helper/validator.py#L29
.. _formatValidator: https://github.com/jhao104/proxy_pool/blob/release-2.3.0/helper/validator.py#L51
.. _httpTimeOutValidator: https://github.com/jhao104/proxy_pool/blob/release-2.3.0/helper/validator.py#L58
.. _httpsTimeOutValidator: https://github.com/jhao104/proxy_pool/blob/release-2.3.0/helper/validator.py#L71
每种校验可以定义多个方法,只有 **所有** 方法都返回 ``True`` 的情况下才视为该校验通过,校验方法执行顺序为: 先执行 **httpValidator** , 前者通过后再执行 **httpsValidator** 。
只有 `preValidator` 校验通过的代理才会进入可用性校验, `httpValidator` 校验通过后认为代理可用准备更新入代理池, `httpValidator` 校验通过后视为代理支持https更新代理的 `https` 属性为 `True` 。
扩展校验
>>>>>>>>>
在 `validator.py`_ 已有自定义校验的示例,自定义函数需返回True或者False,使用 `ProxyValidator`_ 中提供的装饰器来区分校验类型。 下面是两个例子:
* 1. 自定义一个代理可用性的校验(``addHttpValidator``):
.. code-block:: python
@ProxyValidator.addHttpValidator
def customValidatorExample01(proxy):
"""自定义代理可用性校验函数"""
proxies = {"http": "http://{proxy}".format(proxy=proxy)}
try:
r = requests.get("http://www.baidu.com/", headers=HEADER, proxies=proxies, timeout=5)
return True if r.status_code == 200 and len(r.content) > 200 else False
except Exception as e:
return False
* 2. 自定义一个代理是否支持https的校验(``addHttpsValidator``):
.. code-block:: python
@ProxyValidator.addHttpsValidator
def customValidatorExample02(proxy):
"""自定义代理是否支持https校验函数"""
proxies = {"https": "https://{proxy}".format(proxy=proxy)}
try:
r = requests.get("https://www.baidu.com/", headers=HEADER, proxies=proxies, timeout=5, verify=False)
return True if r.status_code == 200 and len(r.content) > 200 else False
except Exception as e:
return False
注意,比如在运行代理可用性校验时,所有被 ``ProxyValidator.addHttpValidator`` 装饰的函数会被依次按定义顺序执行,只有当所有函数都返回True时才会判断代理可用。 ``HttpsValidator`` 运行机制也是如此。
================================================
FILE: docs/dev/index.rst
================================================
=========
开发指南
=========
.. module:: dev
.. toctree::
:maxdepth: 2
ext_fetcher
ext_validator
================================================
FILE: docs/index.rst
================================================
.. ProxyPool documentation master file, created by
sphinx-quickstart on Wed Jul 8 16:13:42 2020.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
ProxyPool
=====================================
::
****************************************************************
*** ______ ********************* ______ *********** _ ********
*** | ___ \_ ******************** | ___ \ ********* | | ********
*** | |_/ / \__ __ __ _ __ _ | |_/ /___ * ___ | | ********
*** | __/| _// _ \ \ \/ /| | | || __// _ \ / _ \ | | ********
*** | | | | | (_) | > < \ |_| || | | (_) | (_) || |___ ****
*** \_| |_| \___/ /_/\_\ \__ |\_| \___/ \___/ \_____/ ****
**** __ / / *****
************************* /___ / *******************************
************************* ********************************
****************************************************************
Python爬虫代理IP池
安装
-----
* 下载代码
.. code-block:: console
$ git clone git@github.com:jhao104/proxy_pool.git
* 安装依赖
.. code-block:: console
$ pip install -r requirements.txt
* 更新配置
.. code-block:: python
HOST = "0.0.0.0"
PORT = 5000
DB_CONN = 'redis://@127.0.0.1:8888'
PROXY_FETCHER = [
"freeProxy01",
"freeProxy02",
# ....
]
* 启动项目
.. code-block:: console
$ python proxyPool.py schedule
$ python proxyPool.py server
使用
______
* API
============ ======== ================ ==============
Api Method Description Params
============ ======== ================ ==============
/ GET API介绍 无
/get GET 返回一个代理 可选参数: `?type=https` 过滤支持https的代理
/pop GET 返回并删除一个代理 可选参数: `?type=https` 过滤支持https的代理
/all GET 返回所有代理 可选参数: `?type=https` 过滤支持https的代理
/count GET 返回代理数量 无
/delete GET 删除指定代理 `?proxy=host:ip`
============ ======== ================ ==============
* 爬虫
.. code-block:: python
import requests
def get_proxy():
return requests.get("http://127.0.0.1:5010/get?type=https").json()
def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
# your spider code
def getHtml():
# ....
retry_count = 5
proxy = get_proxy().get("proxy")
while retry_count > 0:
try:
html = requests.get('https://www.example.com', proxies={"http": "http://{}".format(proxy), "https": "https://{}".format(proxy)})
# 使用代理访问
return html
except Exception:
retry_count -= 1
# 删除代理池中代理
delete_proxy(proxy)
return None
Contents
--------
.. toctree::
:maxdepth: 2
user/index
dev/index
changelog
================================================
FILE: docs/make.bat
================================================
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=.
set BUILDDIR=_build
if "%1" == "" goto help
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.http://sphinx-doc.org/
exit /b 1
)
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd
================================================
FILE: docs/user/how_to_config.rst
================================================
.. how_to_config
配置参考
---------
配置文件 ``setting.py`` 位于项目的主目录下, 配置主要分为四类: **服务配置** 、 **数据库配置** 、 **采集配置** 、 **校验配置**.
服务配置
>>>>>>>>>
* ``HOST``
API服务监听的IP, 本机访问设置为 ``127.0.0.1``, 开启远程访问设置为: ``0.0.0.0``.
* ``PORT``
API服务监听的端口.
数据库配置
>>>>>>>>>>>
* ``DB_CONN``
用户存放代理IP的数据库URI, 配置格式为: ``db_type://[[user]:[pwd]]@ip:port/[db]``.
目前支持的db_type有: ``ssdb`` 、 ``redis``.
配置示例:
.. code-block:: python
# SSDB IP: 127.0.0.1 Port: 8888
DB_CONN = 'ssdb://@127.0.0.1:8888'
# SSDB IP: 127.0.0.1 Port: 8899 Password: 123456
DB_CONN = 'ssdb://:123456@127.0.0.1:8888'
# Redis IP: 127.0.0.1 Port: 6379
DB_CONN = 'redis://@127.0.0.1:6379'
# Redis IP: 127.0.0.1 Port: 6379 Password: 123456
DB_CONN = 'redis://:123456@127.0.0.1:6379'
# Redis IP: 127.0.0.1 Port: 6379 Password: 123456 DB: 15
DB_CONN = 'redis://:123456@127.0.0.1:6379/15'
* ``TABLE_NAME``
存放代理的数据载体名称, ssdb和redis的存放结构为hash.
采集配置
>>>>>>>>>
* ``PROXY_FETCHER``
启用的代理采集方法名, 代理采集方法位于 ``fetcher/proxyFetcher.py`` 类中.
由于各个代理源的稳定性不容易掌握, 当某个代理采集方法失效时, 可以该配置中注释掉其名称.
如果有增加某些代理采集方法, 也请在该配置中添加其方法名, 具体请参考 :doc:`/dev/extend_fetcher`.
调度程序每次执行采集任务时都会再次加载该配置, 保证每次运行的采集方法都是有效的.
校验配置
>>>>>>>>>
* ``HTTP_URL``
用于检验代理是否可用的地址, 默认为 ``http://httpbin.org``, 可根据使用场景修改为其他地址.
* ``HTTPS_URL``
用于检验代理是否支持HTTPS的地址, 默认为 ``https://www.qq.com``, 可根据使用场景修改为其他地址.
* ``VERIFY_TIMEOUT``
检验代理的超时时间, 默认为 ``10`` , 单位秒. 使用代理访问 ``HTTP(S)_URL`` 耗时超过 ``VERIFY_TIMEOUT`` 时, 视为代理不可用.
* ``MAX_FAIL_COUNT``
检验代理允许最大失败次数, 默认为 ``0``, 即出错一次即删除.
* ``POOL_SIZE_MIN``
代理检测定时任务运行前若代理数量小于 `POOL_SIZE_MIN`, 则先运行抓取程序.
================================================
FILE: docs/user/how_to_run.rst
================================================
.. how_to_run
如何运行
---------
下载代码
>>>>>>>>>
本项目需要下载代码到本地运行, 通过 ``git`` 下载:
.. code-block:: console
$ git clone git@github.com:jhao104/proxy_pool.git
或者下载特定的 ``release`` 版本:
.. code-block:: console
https://github.com/jhao104/proxy_pool/releases
安装依赖
>>>>>>>>>
到项目目录下使用 ``pip`` 安装依赖库:
.. code-block:: console
$ pip install -r requirements.txt
更新配置
>>>>>>>>>
配置文件 ``setting.py`` 位于项目的主目录下:
.. code-block:: python
# 配置API服务
HOST = "0.0.0.0" # IP
PORT = 5000 # 监听端口
# 配置数据库
DB_CONN = 'redis://@127.0.0.1:8888/0'
# 配置 ProxyFetcher
PROXY_FETCHER = [
"freeProxy01", # 这里是启用的代理抓取方法,所有fetch方法位于fetcher/proxyFetcher.py
"freeProxy02",
# ....
]
更多配置请参考 :doc:`/user/how_to_config`
启动项目
>>>>>>>>>
如果已配置好运行环境, 具备运行条件, 可以通过 ``proxyPool.py`` 启动. ``proxyPool.py`` 是项目的CLI入口.
完整程序包含两部份: ``schedule`` 调度程序和 ``server`` API服务, 调度程序负责采集和验证代理, API服务提供代理服务HTTP接口.
通过命令行程序分别启动调度程序和API服务:
.. code-block:: console
# 启动调度程序
$ python proxyPool.py schedule
# 启动webApi服务
$ python proxyPool.py server
================================================
FILE: docs/user/how_to_use.rst
================================================
.. how_to_use
如何使用
----------
爬虫代码要对接代理池目前有两种方式: 一是通过调用API接口使用, 二是直接读取数据库.
调用API
>>>>>>>>>
启动ProxyPool的 ``server`` 后会提供如下几个http接口:
============ ======== ================ ==============
Api Method Description Arg
============ ======== ================ ==============
/ GET API介绍 无
/get GET 随机返回一个代理 无
/get_all GET 返回所有代理 无
/get_status GET 返回代理数量 无
/delete GET 删除指定代理 proxy=host:ip
============ ======== ================ ==============
在代码中可以通过封装上面的API接口来使用代理, 例子:
.. code-block:: python
import requests
def get_proxy():
return requests.get("http://127.0.0.1:5010/get/").json()
def delete_proxy(proxy):
requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))
# your spider code
def getHtml():
# ....
retry_count = 5
proxy = get_proxy().get("proxy")
while retry_count > 0:
try:
# 使用代理访问
html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
return html
except Exception:
retry_count -= 1
# 删除代理池中代理
delete_proxy(proxy)
return None
本例中我们在本地 ``127.0.0.1`` 启动端口为 ``5010`` 的 ``server``, 使用 ``/get`` 接口获取代理, ``/delete`` 删除代理.
读数据库
>>>>>>>>>
目前支持配置两种数据库: ``REDIS`` 、 ``SSDB``.
* **REDIS** 储存结构为 ``hash``, hash name为配置项中的 **TABLE_NAME**
* **SSDB** 储存结构为 ``hash``, hash name为配置项中的 **TABLE_NAME**
可以在代码中自行读取.
================================================
FILE: docs/user/index.rst
================================================
=========
用户指南
=========
.. module:: user
.. toctree::
:maxdepth: 2
how_to_run
how_to_use
how_to_config
================================================
FILE: fetcher/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__.py
Description :
Author : JHao
date: 2016/11/25
-------------------------------------------------
Change Activity:
2016/11/25:
-------------------------------------------------
"""
================================================
FILE: fetcher/proxyFetcher.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: proxyFetcher
Description :
Author : JHao
date: 2016/11/25
-------------------------------------------------
Change Activity:
2016/11/25: proxyFetcher
-------------------------------------------------
"""
__author__ = 'JHao'
import re
import json
from time import sleep
from util.webRequest import WebRequest
class ProxyFetcher(object):
"""
proxy getter
"""
@staticmethod
def freeProxy01():
"""
站大爷 https://www.zdaye.com/dayProxy.html
"""
start_url = "https://www.zdaye.com/dayProxy.html"
html_tree = WebRequest().get(start_url, verify=False).tree
latest_page_time = html_tree.xpath("//span[@class='thread_time_info']/text()")[0].strip()
from datetime import datetime
interval = datetime.now() - datetime.strptime(latest_page_time, "%Y/%m/%d %H:%M:%S")
if interval.seconds < 300: # 只采集5分钟内的更新
target_url = "https://www.zdaye.com/" + html_tree.xpath("//h3[@class='thread_title']/a/@href")[0].strip()
while target_url:
_tree = WebRequest().get(target_url, verify=False).tree
for tr in _tree.xpath("//table//tr"):
ip = "".join(tr.xpath("./td[1]/text()")).strip()
port = "".join(tr.xpath("./td[2]/text()")).strip()
yield "%s:%s" % (ip, port)
next_page = _tree.xpath("//div[@class='page']/a[@title='下一页']/@href")
target_url = "https://www.zdaye.com/" + next_page[0].strip() if next_page else False
sleep(5)
@staticmethod
def freeProxy02():
"""
代理66 http://www.66ip.cn/
"""
url = "http://www.66ip.cn/"
resp = WebRequest().get(url, timeout=10).tree
for i, tr in enumerate(resp.xpath("(//table)[3]//tr")):
if i > 0:
ip = "".join(tr.xpath("./td[1]/text()")).strip()
port = "".join(tr.xpath("./td[2]/text()")).strip()
yield "%s:%s" % (ip, port)
@staticmethod
def freeProxy03():
""" 开心代理 """
target_urls = ["http://www.kxdaili.com/dailiip.html", "http://www.kxdaili.com/dailiip/2/1.html"]
for url in target_urls:
tree = WebRequest().get(url).tree
for tr in tree.xpath("//table[@class='active']//tr")[1:]:
ip = "".join(tr.xpath('./td[1]/text()')).strip()
port = "".join(tr.xpath('./td[2]/text()')).strip()
yield "%s:%s" % (ip, port)
@staticmethod
def freeProxy04():
""" FreeProxyList https://www.freeproxylists.net/zh/ """
url = "https://www.freeproxylists.net/zh/?c=CN&pt=&pr=&a%5B%5D=0&a%5B%5D=1&a%5B%5D=2&u=50"
tree = WebRequest().get(url, verify=False).tree
from urllib import parse
def parse_ip(input_str):
html_str = parse.unquote(input_str)
ips = re.findall(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', html_str)
return ips[0] if ips else None
for tr in tree.xpath("//tr[@class='Odd']") + tree.xpath("//tr[@class='Even']"):
ip = parse_ip("".join(tr.xpath('./td[1]/script/text()')).strip())
port = "".join(tr.xpath('./td[2]/text()')).strip()
if ip:
yield "%s:%s" % (ip, port)
@staticmethod
def freeProxy05(page_count=1):
""" 快代理 https://www.kuaidaili.com """
url_pattern = [
'https://www.kuaidaili.com/free/inha/{}/',
'https://www.kuaidaili.com/free/intr/{}/'
]
url_list = []
for page_index in range(1, page_count + 1):
for pattern in url_pattern:
url_list.append(pattern.format(page_index))
for url in url_list:
tree = WebRequest().get(url).tree
proxy_list = tree.xpath('.//table//tr')
sleep(1) # 必须sleep 不然第二条请求不到数据
for tr in proxy_list[1:]:
yield ':'.join(tr.xpath('./td/text()')[0:2])
@staticmethod
def freeProxy06():
""" 冰凌代理 https://www.binglx.cn """
url = "https://www.binglx.cn/?page=1"
try:
tree = WebRequest().get(url).tree
proxy_list = tree.xpath('.//table//tr')
for tr in proxy_list[1:]:
yield ':'.join(tr.xpath('./td/text()')[0:2])
except Exception as e:
print(e)
@staticmethod
def freeProxy07():
""" 云代理 """
urls = ['http://www.ip3366.net/free/?stype=1', "http://www.ip3366.net/free/?stype=2"]
for url in urls:
r = WebRequest().get(url, timeout=10)
proxies = re.findall(r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>[\s\S]*?<td>(\d+)</td>', r.text)
for proxy in proxies:
yield ":".join(proxy)
@staticmethod
def freeProxy08():
""" 小幻代理 """
urls = ['https://ip.ihuan.me/address/5Lit5Zu9.html']
for url in urls:
r = WebRequest().get(url, timeout=10)
proxies = re.findall(r'>\s*?(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s*?</a></td><td>(\d+)</td>', r.text)
for proxy in proxies:
yield ":".join(proxy)
@staticmethod
def freeProxy09(page_count=1):
""" 免费代理库 """
for i in range(1, page_count + 1):
url = 'http://ip.jiangxianli.com/?country=中国&page={}'.format(i)
html_tree = WebRequest().get(url, verify=False).tree
for index, tr in enumerate(html_tree.xpath("//table//tr")):
if index == 0:
continue
yield ":".join(tr.xpath("./td/text()")[0:2]).strip()
@staticmethod
def freeProxy10():
""" 89免费代理 """
r = WebRequest().get("https://www.89ip.cn/index_1.html", timeout=10)
proxies = re.findall(
r'<td.*?>[\s\S]*?(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})[\s\S]*?</td>[\s\S]*?<td.*?>[\s\S]*?(\d+)[\s\S]*?</td>',
r.text)
for proxy in proxies:
yield ':'.join(proxy)
@staticmethod
def freeProxy11():
""" 稻壳代理 https://www.docip.net/ """
r = WebRequest().get("https://www.docip.net/data/free.json", timeout=10)
try:
for each in r.json['data']:
yield each['ip']
except Exception as e:
print(e)
# @staticmethod
# def wallProxy01():
# """
# PzzQz https://pzzqz.com/
# """
# from requests import Session
# from lxml import etree
# session = Session()
# try:
# index_resp = session.get("https://pzzqz.com/", timeout=20, verify=False).text
# x_csrf_token = re.findall('X-CSRFToken": "(.*?)"', index_resp)
# if x_csrf_token:
# data = {"http": "on", "ping": "3000", "country": "cn", "ports": ""}
# proxy_resp = session.post("https://pzzqz.com/", verify=False,
# headers={"X-CSRFToken": x_csrf_token[0]}, json=data).json()
# tree = etree.HTML(proxy_resp["proxy_html"])
# for tr in tree.xpath("//tr"):
# ip = "".join(tr.xpath("./td[1]/text()"))
# port = "".join(tr.xpath("./td[2]/text()"))
# yield "%s:%s" % (ip, port)
# except Exception as e:
# print(e)
# @staticmethod
# def freeProxy10():
# """
# 墙外网站 cn-proxy
# :return:
# """
# urls = ['http://cn-proxy.com/', 'http://cn-proxy.com/archives/218']
# request = WebRequest()
# for url in urls:
# r = request.get(url, timeout=10)
# proxies = re.findall(r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>[\w\W]<td>(\d+)</td>', r.text)
# for proxy in proxies:
# yield ':'.join(proxy)
# @staticmethod
# def freeProxy11():
# """
# https://proxy-list.org/english/index.php
# :return:
# """
# urls = ['https://proxy-list.org/english/index.php?p=%s' % n for n in range(1, 10)]
# request = WebRequest()
# import base64
# for url in urls:
# r = request.get(url, timeout=10)
# proxies = re.findall(r"Proxy\('(.*?)'\)", r.text)
# for proxy in proxies:
# yield base64.b64decode(proxy).decode()
# @staticmethod
# def freeProxy12():
# urls = ['https://list.proxylistplus.com/Fresh-HTTP-Proxy-List-1']
# request = WebRequest()
# for url in urls:
# r = request.get(url, timeout=10)
# proxies = re.findall(r'<td>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})</td>[\s\S]*?<td>(\d+)</td>', r.text)
# for proxy in proxies:
# yield ':'.join(proxy)
if __name__ == '__main__':
p = ProxyFetcher()
for _ in p.freeProxy06():
print(_)
# http://nntime.com/proxy-list-01.htm
================================================
FILE: handler/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__.py
Description :
Author : JHao
date: 2016/12/3
-------------------------------------------------
Change Activity:
2016/12/3:
-------------------------------------------------
"""
__author__ = 'JHao'
# from handler.ProxyManager import ProxyManager
================================================
FILE: handler/configHandler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: configHandler
Description :
Author : JHao
date: 2020/6/22
-------------------------------------------------
Change Activity:
2020/6/22:
-------------------------------------------------
"""
__author__ = 'JHao'
import os
import setting
from util.singleton import Singleton
from util.lazyProperty import LazyProperty
from util.six import reload_six, withMetaclass
class ConfigHandler(withMetaclass(Singleton)):
def __init__(self):
pass
@LazyProperty
def serverHost(self):
return os.environ.get("HOST", setting.HOST)
@LazyProperty
def serverPort(self):
return os.environ.get("PORT", setting.PORT)
@LazyProperty
def dbConn(self):
return os.getenv("DB_CONN", setting.DB_CONN)
@LazyProperty
def tableName(self):
return os.getenv("TABLE_NAME", setting.TABLE_NAME)
@property
def fetchers(self):
reload_six(setting)
return setting.PROXY_FETCHER
@LazyProperty
def httpUrl(self):
return os.getenv("HTTP_URL", setting.HTTP_URL)
@LazyProperty
def httpsUrl(self):
return os.getenv("HTTPS_URL", setting.HTTPS_URL)
@LazyProperty
def verifyTimeout(self):
return int(os.getenv("VERIFY_TIMEOUT", setting.VERIFY_TIMEOUT))
# @LazyProperty
# def proxyCheckCount(self):
# return int(os.getenv("PROXY_CHECK_COUNT", setting.PROXY_CHECK_COUNT))
@LazyProperty
def maxFailCount(self):
return int(os.getenv("MAX_FAIL_COUNT", setting.MAX_FAIL_COUNT))
# @LazyProperty
# def maxFailRate(self):
# return int(os.getenv("MAX_FAIL_RATE", setting.MAX_FAIL_RATE))
@LazyProperty
def poolSizeMin(self):
return int(os.getenv("POOL_SIZE_MIN", setting.POOL_SIZE_MIN))
@LazyProperty
def proxyRegion(self):
return bool(os.getenv("PROXY_REGION", setting.PROXY_REGION))
@LazyProperty
def timezone(self):
return os.getenv("TIMEZONE", setting.TIMEZONE)
================================================
FILE: handler/logHandler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: LogHandler.py
Description : 日志操作模块
Author : JHao
date: 2017/3/6
-------------------------------------------------
Change Activity:
2017/03/06: log handler
2017/09/21: 屏幕输出/文件输出 可选(默认屏幕和文件均输出)
2020/07/13: Windows下TimedRotatingFileHandler线程不安全, 不再使用
-------------------------------------------------
"""
__author__ = 'JHao'
import os
import logging
import platform
from logging.handlers import TimedRotatingFileHandler
# 日志级别
CRITICAL = 50
FATAL = CRITICAL
ERROR = 40
WARNING = 30
WARN = WARNING
INFO = 20
DEBUG = 10
NOTSET = 0
CURRENT_PATH = os.path.dirname(os.path.abspath(__file__))
ROOT_PATH = os.path.join(CURRENT_PATH, os.pardir)
LOG_PATH = os.path.join(ROOT_PATH, 'log')
if not os.path.exists(LOG_PATH):
try:
os.mkdir(LOG_PATH)
except FileExistsError:
pass
class LogHandler(logging.Logger):
"""
LogHandler
"""
def __init__(self, name, level=DEBUG, stream=True, file=True):
self.name = name
self.level = level
logging.Logger.__init__(self, self.name, level=level)
if stream:
self.__setStreamHandler__()
if file:
if platform.system() != "Windows":
self.__setFileHandler__()
def __setFileHandler__(self, level=None):
"""
set file handler
:param level:
:return:
"""
file_name = os.path.join(LOG_PATH, '{name}.log'.format(name=self.name))
# 设置日志回滚, 保存在log目录, 一天保存一个文件, 保留15天
file_handler = TimedRotatingFileHandler(filename=file_name, when='D', interval=1, backupCount=15)
file_handler.suffix = '%Y%m%d.log'
if not level:
file_handler.setLevel(self.level)
else:
file_handler.setLevel(level)
formatter = logging.Formatter('%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s')
file_handler.setFormatter(formatter)
self.file_handler = file_handler
self.addHandler(file_handler)
def __setStreamHandler__(self, level=None):
"""
set stream handler
:param level:
:return:
"""
stream_handler = logging.StreamHandler()
formatter = logging.Formatter('%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s')
stream_handler.setFormatter(formatter)
if not level:
stream_handler.setLevel(self.level)
else:
stream_handler.setLevel(level)
self.addHandler(stream_handler)
if __name__ == '__main__':
log = LogHandler('test')
log.info('this is a test msg')
================================================
FILE: handler/proxyHandler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: ProxyHandler.py
Description :
Author : JHao
date: 2016/12/3
-------------------------------------------------
Change Activity:
2016/12/03:
2020/05/26: 区分http和https
-------------------------------------------------
"""
__author__ = 'JHao'
from helper.proxy import Proxy
from db.dbClient import DbClient
from handler.configHandler import ConfigHandler
class ProxyHandler(object):
""" Proxy CRUD operator"""
def __init__(self):
self.conf = ConfigHandler()
self.db = DbClient(self.conf.dbConn)
self.db.changeTable(self.conf.tableName)
def get(self, https=False):
"""
return a proxy
Args:
https: True/False
Returns:
"""
proxy = self.db.get(https)
return Proxy.createFromJson(proxy) if proxy else None
def pop(self, https):
"""
return and delete a useful proxy
:return:
"""
proxy = self.db.pop(https)
if proxy:
return Proxy.createFromJson(proxy)
return None
def put(self, proxy):
"""
put proxy into use proxy
:return:
"""
self.db.put(proxy)
def delete(self, proxy):
"""
delete useful proxy
:param proxy:
:return:
"""
return self.db.delete(proxy.proxy)
def getAll(self, https=False):
"""
get all proxy from pool as Proxy list
:return:
"""
proxies = self.db.getAll(https)
return [Proxy.createFromJson(_) for _ in proxies]
def exists(self, proxy):
"""
check proxy exists
:param proxy:
:return:
"""
return self.db.exists(proxy.proxy)
def getCount(self):
"""
return raw_proxy and use_proxy count
:return:
"""
total_use_proxy = self.db.getCount()
return {'count': total_use_proxy}
================================================
FILE: helper/__init__.py
================================================
================================================
FILE: helper/check.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: check
Description : 执行代理校验
Author : JHao
date: 2019/8/6
-------------------------------------------------
Change Activity:
2019/08/06: 执行代理校验
2021/05/25: 分别校验http和https
2022/08/16: 获取代理Region信息
-------------------------------------------------
"""
__author__ = 'JHao'
from util.six import Empty
from threading import Thread
from datetime import datetime
from util.webRequest import WebRequest
from handler.logHandler import LogHandler
from helper.validator import ProxyValidator
from handler.proxyHandler import ProxyHandler
from handler.configHandler import ConfigHandler
class DoValidator(object):
""" 执行校验 """
conf = ConfigHandler()
@classmethod
def validator(cls, proxy, work_type):
"""
校验入口
Args:
proxy: Proxy Object
work_type: raw/use
Returns:
Proxy Object
"""
http_r = cls.httpValidator(proxy)
https_r = False if not http_r else cls.httpsValidator(proxy)
proxy.check_count += 1
proxy.last_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
proxy.last_status = True if http_r else False
if http_r:
if proxy.fail_count > 0:
proxy.fail_count -= 1
proxy.https = True if https_r else False
if work_type == "raw":
proxy.region = cls.regionGetter(proxy) if cls.conf.proxyRegion else ""
else:
proxy.fail_count += 1
return proxy
@classmethod
def httpValidator(cls, proxy):
for func in ProxyValidator.http_validator:
if not func(proxy.proxy):
return False
return True
@classmethod
def httpsValidator(cls, proxy):
for func in ProxyValidator.https_validator:
if not func(proxy.proxy):
return False
return True
@classmethod
def preValidator(cls, proxy):
for func in ProxyValidator.pre_validator:
if not func(proxy):
return False
return True
@classmethod
def regionGetter(cls, proxy):
try:
url = 'https://searchplugin.csdn.net/api/v1/ip/get?ip=%s' % proxy.proxy.split(':')[0]
r = WebRequest().get(url=url, retry_time=1, timeout=2).json
return r['data']['address']
except:
return 'error'
class _ThreadChecker(Thread):
""" 多线程检测 """
def __init__(self, work_type, target_queue, thread_name):
Thread.__init__(self, name=thread_name)
self.work_type = work_type
self.log = LogHandler("checker")
self.proxy_handler = ProxyHandler()
self.target_queue = target_queue
self.conf = ConfigHandler()
def run(self):
self.log.info("{}ProxyCheck - {}: start".format(self.work_type.title(), self.name))
while True:
try:
proxy = self.target_queue.get(block=False)
except Empty:
self.log.info("{}ProxyCheck - {}: complete".format(self.work_type.title(), self.name))
break
proxy = DoValidator.validator(proxy, self.work_type)
if self.work_type == "raw":
self.__ifRaw(proxy)
else:
self.__ifUse(proxy)
self.target_queue.task_done()
def __ifRaw(self, proxy):
if proxy.last_status:
if self.proxy_handler.exists(proxy):
self.log.info('RawProxyCheck - {}: {} exist'.format(self.name, proxy.proxy.ljust(23)))
else:
self.log.info('RawProxyCheck - {}: {} pass'.format(self.name, proxy.proxy.ljust(23)))
self.proxy_handler.put(proxy)
else:
self.log.info('RawProxyCheck - {}: {} fail'.format(self.name, proxy.proxy.ljust(23)))
def __ifUse(self, proxy):
if proxy.last_status:
self.log.info('UseProxyCheck - {}: {} pass'.format(self.name, proxy.proxy.ljust(23)))
self.proxy_handler.put(proxy)
else:
if proxy.fail_count > self.conf.maxFailCount:
self.log.info('UseProxyCheck - {}: {} fail, count {} delete'.format(self.name,
proxy.proxy.ljust(23),
proxy.fail_count))
self.proxy_handler.delete(proxy)
else:
self.log.info('UseProxyCheck - {}: {} fail, count {} keep'.format(self.name,
proxy.proxy.ljust(23),
proxy.fail_count))
self.proxy_handler.put(proxy)
def Checker(tp, queue):
"""
run Proxy ThreadChecker
:param tp: raw/use
:param queue: Proxy Queue
:return:
"""
thread_list = list()
for index in range(20):
thread_list.append(_ThreadChecker(tp, queue, "thread_%s" % str(index).zfill(2)))
for thread in thread_list:
thread.setDaemon(True)
thread.start()
for thread in thread_list:
thread.join()
================================================
FILE: helper/fetch.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: fetchScheduler
Description :
Author : JHao
date: 2019/8/6
-------------------------------------------------
Change Activity:
2021/11/18: 多线程采集
-------------------------------------------------
"""
__author__ = 'JHao'
from threading import Thread
from helper.proxy import Proxy
from helper.check import DoValidator
from handler.logHandler import LogHandler
from handler.proxyHandler import ProxyHandler
from fetcher.proxyFetcher import ProxyFetcher
from handler.configHandler import ConfigHandler
class _ThreadFetcher(Thread):
def __init__(self, fetch_source, proxy_dict):
Thread.__init__(self)
self.fetch_source = fetch_source
self.proxy_dict = proxy_dict
self.fetcher = getattr(ProxyFetcher, fetch_source, None)
self.log = LogHandler("fetcher")
self.conf = ConfigHandler()
self.proxy_handler = ProxyHandler()
def run(self):
self.log.info("ProxyFetch - {func}: start".format(func=self.fetch_source))
try:
for proxy in self.fetcher():
self.log.info('ProxyFetch - %s: %s ok' % (self.fetch_source, proxy.ljust(23)))
proxy = proxy.strip()
if proxy in self.proxy_dict:
self.proxy_dict[proxy].add_source(self.fetch_source)
else:
self.proxy_dict[proxy] = Proxy(
proxy, source=self.fetch_source)
except Exception as e:
self.log.error("ProxyFetch - {func}: error".format(func=self.fetch_source))
self.log.error(str(e))
class Fetcher(object):
name = "fetcher"
def __init__(self):
self.log = LogHandler(self.name)
self.conf = ConfigHandler()
def run(self):
"""
fetch proxy with proxyFetcher
:return:
"""
proxy_dict = dict()
thread_list = list()
self.log.info("ProxyFetch : start")
for fetch_source in self.conf.fetchers:
self.log.info("ProxyFetch - {func}: start".format(func=fetch_source))
fetcher = getattr(ProxyFetcher, fetch_source, None)
if not fetcher:
self.log.error("ProxyFetch - {func}: class method not exists!".format(func=fetch_source))
continue
if not callable(fetcher):
self.log.error("ProxyFetch - {func}: must be class method".format(func=fetch_source))
continue
thread_list.append(_ThreadFetcher(fetch_source, proxy_dict))
for thread in thread_list:
thread.setDaemon(True)
thread.start()
for thread in thread_list:
thread.join()
self.log.info("ProxyFetch - all complete!")
for _ in proxy_dict.values():
if DoValidator.preValidator(_.proxy):
yield _
================================================
FILE: helper/launcher.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: launcher
Description : 启动器
Author : JHao
date: 2021/3/26
-------------------------------------------------
Change Activity:
2021/3/26: 启动器
-------------------------------------------------
"""
__author__ = 'JHao'
import sys
from db.dbClient import DbClient
from handler.logHandler import LogHandler
from handler.configHandler import ConfigHandler
log = LogHandler('launcher')
def startServer():
__beforeStart()
from api.proxyApi import runFlask
runFlask()
def startScheduler():
__beforeStart()
from helper.scheduler import runScheduler
runScheduler()
def __beforeStart():
__showVersion()
__showConfigure()
if __checkDBConfig():
log.info('exit!')
sys.exit()
def __showVersion():
from setting import VERSION
log.info("ProxyPool Version: %s" % VERSION)
def __showConfigure():
conf = ConfigHandler()
log.info("ProxyPool configure HOST: %s" % conf.serverHost)
log.info("ProxyPool configure PORT: %s" % conf.serverPort)
log.info("ProxyPool configure PROXY_FETCHER: %s" % conf.fetchers)
def __checkDBConfig():
conf = ConfigHandler()
db = DbClient(conf.dbConn)
log.info("============ DATABASE CONFIGURE ================")
log.info("DB_TYPE: %s" % db.db_type)
log.info("DB_HOST: %s" % db.db_host)
log.info("DB_PORT: %s" % db.db_port)
log.info("DB_NAME: %s" % db.db_name)
log.info("DB_USER: %s" % db.db_user)
log.info("=================================================")
return db.test()
================================================
FILE: helper/proxy.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: Proxy
Description : 代理对象类型封装
Author : JHao
date: 2019/7/11
-------------------------------------------------
Change Activity:
2019/7/11: 代理对象类型封装
-------------------------------------------------
"""
__author__ = 'JHao'
import json
class Proxy(object):
def __init__(self, proxy, fail_count=0, region="", anonymous="",
source="", check_count=0, last_status="", last_time="", https=False):
self._proxy = proxy
self._fail_count = fail_count
self._region = region
self._anonymous = anonymous
self._source = source.split('/')
self._check_count = check_count
self._last_status = last_status
self._last_time = last_time
self._https = https
@classmethod
def createFromJson(cls, proxy_json):
_dict = json.loads(proxy_json)
return cls(proxy=_dict.get("proxy", ""),
fail_count=_dict.get("fail_count", 0),
region=_dict.get("region", ""),
anonymous=_dict.get("anonymous", ""),
source=_dict.get("source", ""),
check_count=_dict.get("check_count", 0),
last_status=_dict.get("last_status", ""),
last_time=_dict.get("last_time", ""),
https=_dict.get("https", False)
)
@property
def proxy(self):
""" 代理 ip:port """
return self._proxy
@property
def fail_count(self):
""" 检测失败次数 """
return self._fail_count
@property
def region(self):
""" 地理位置(国家/城市) """
return self._region
@property
def anonymous(self):
""" 匿名 """
return self._anonymous
@property
def source(self):
""" 代理来源 """
return '/'.join(self._source)
@property
def check_count(self):
""" 代理检测次数 """
return self._check_count
@property
def last_status(self):
""" 最后一次检测结果 True -> 可用; False -> 不可用"""
return self._last_status
@property
def last_time(self):
""" 最后一次检测时间 """
return self._last_time
@property
def https(self):
""" 是否支持https """
return self._https
@property
def to_dict(self):
""" 属性字典 """
return {"proxy": self.proxy,
"https": self.https,
"fail_count": self.fail_count,
"region": self.region,
"anonymous": self.anonymous,
"source": self.source,
"check_count": self.check_count,
"last_status": self.last_status,
"last_time": self.last_time}
@property
def to_json(self):
""" 属性json格式 """
return json.dumps(self.to_dict, ensure_ascii=False)
@fail_count.setter
def fail_count(self, value):
self._fail_count = value
@check_count.setter
def check_count(self, value):
self._check_count = value
@last_status.setter
def last_status(self, value):
self._last_status = value
@last_time.setter
def last_time(self, value):
self._last_time = value
@https.setter
def https(self, value):
self._https = value
@region.setter
def region(self, value):
self._region = value
def add_source(self, source_str):
if source_str:
self._source.append(source_str)
self._source = list(set(self._source))
================================================
FILE: helper/scheduler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: proxyScheduler
Description :
Author : JHao
date: 2019/8/5
-------------------------------------------------
Change Activity:
2019/08/05: proxyScheduler
2021/02/23: runProxyCheck时,剩余代理少于POOL_SIZE_MIN时执行抓取
-------------------------------------------------
"""
__author__ = 'JHao'
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.executors.pool import ProcessPoolExecutor
from util.six import Queue
from helper.fetch import Fetcher
from helper.check import Checker
from handler.logHandler import LogHandler
from handler.proxyHandler import ProxyHandler
from handler.configHandler import ConfigHandler
def __runProxyFetch():
proxy_queue = Queue()
proxy_fetcher = Fetcher()
for proxy in proxy_fetcher.run():
proxy_queue.put(proxy)
Checker("raw", proxy_queue)
def __runProxyCheck():
proxy_handler = ProxyHandler()
proxy_queue = Queue()
if proxy_handler.db.getCount().get("total", 0) < proxy_handler.conf.poolSizeMin:
__runProxyFetch()
for proxy in proxy_handler.getAll():
proxy_queue.put(proxy)
Checker("use", proxy_queue)
def runScheduler():
__runProxyFetch()
timezone = ConfigHandler().timezone
scheduler_log = LogHandler("scheduler")
scheduler = BlockingScheduler(logger=scheduler_log, timezone=timezone)
scheduler.add_job(__runProxyFetch, 'interval', minutes=4, id="proxy_fetch", name="proxy采集")
scheduler.add_job(__runProxyCheck, 'interval', minutes=2, id="proxy_check", name="proxy检查")
executors = {
'default': {'type': 'threadpool', 'max_workers': 20},
'processpool': ProcessPoolExecutor(max_workers=5)
}
job_defaults = {
'coalesce': False,
'max_instances': 10
}
scheduler.configure(executors=executors, job_defaults=job_defaults, timezone=timezone)
scheduler.start()
if __name__ == '__main__':
runScheduler()
================================================
FILE: helper/validator.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: _validators
Description : 定义proxy验证方法
Author : JHao
date: 2021/5/25
-------------------------------------------------
Change Activity:
2023/03/10: 支持带用户认证的代理格式 username:password@ip:port
-------------------------------------------------
"""
__author__ = 'JHao'
import re
from requests import head
from util.six import withMetaclass
from util.singleton import Singleton
from handler.configHandler import ConfigHandler
conf = ConfigHandler()
HEADER = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Accept': '*/*',
'Connection': 'keep-alive',
'Accept-Language': 'zh-CN,zh;q=0.8'}
IP_REGEX = re.compile(r"(.*:.*@)?\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}:\d{1,5}")
class ProxyValidator(withMetaclass(Singleton)):
pre_validator = []
http_validator = []
https_validator = []
@classmethod
def addPreValidator(cls, func):
cls.pre_validator.append(func)
return func
@classmethod
def addHttpValidator(cls, func):
cls.http_validator.append(func)
return func
@classmethod
def addHttpsValidator(cls, func):
cls.https_validator.append(func)
return func
@ProxyValidator.addPreValidator
def formatValidator(proxy):
"""检查代理格式"""
return True if IP_REGEX.fullmatch(proxy) else False
@ProxyValidator.addHttpValidator
def httpTimeOutValidator(proxy):
""" http检测超时 """
proxies = {"http": "http://{proxy}".format(proxy=proxy), "https": "https://{proxy}".format(proxy=proxy)}
try:
r = head(conf.httpUrl, headers=HEADER, proxies=proxies, timeout=conf.verifyTimeout)
return True if r.status_code == 200 else False
except Exception as e:
return False
@ProxyValidator.addHttpsValidator
def httpsTimeOutValidator(proxy):
"""https检测超时"""
proxies = {"http": "http://{proxy}".format(proxy=proxy), "https": "https://{proxy}".format(proxy=proxy)}
try:
r = head(conf.httpsUrl, headers=HEADER, proxies=proxies, timeout=conf.verifyTimeout, verify=False)
return True if r.status_code == 200 else False
except Exception as e:
return False
@ProxyValidator.addHttpValidator
def customValidatorExample(proxy):
"""自定义validator函数,校验代理是否可用, 返回True/False"""
return True
================================================
FILE: proxyPool.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: proxy_pool
Description : proxy pool 启动入口
Author : JHao
date: 2020/6/19
-------------------------------------------------
Change Activity:
2020/6/19:
-------------------------------------------------
"""
__author__ = 'JHao'
import click
from helper.launcher import startServer, startScheduler
from setting import BANNER, VERSION
CONTEXT_SETTINGS = dict(help_option_names=['-h', '--help'])
@click.group(context_settings=CONTEXT_SETTINGS)
@click.version_option(version=VERSION)
def cli():
"""ProxyPool cli工具"""
@cli.command(name="schedule")
def schedule():
""" 启动调度程序 """
click.echo(BANNER)
startScheduler()
@cli.command(name="server")
def server():
""" 启动api服务 """
click.echo(BANNER)
startServer()
if __name__ == '__main__':
cli()
================================================
FILE: requirements.txt
================================================
requests==2.20.0
gunicorn==19.9.0
lxml==4.9.2
redis==3.5.3
APScheduler==3.10.0;python_version>="3.10"
APScheduler==3.2.0;python_version<"3.10"
click==8.0.1;python_version>"3.6"
click==7.0;python_version<="3.6"
Flask==2.1.1;python_version>"3.6"
Flask==1.0;python_version<="3.6"
werkzeug==2.1.0;python_version>"3.6"
werkzeug==0.15.5;python_version<="3.6"
================================================
FILE: setting.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: setting.py
Description : 配置文件
Author : JHao
date: 2019/2/15
-------------------------------------------------
Change Activity:
2019/2/15:
-------------------------------------------------
"""
BANNER = r"""
****************************************************************
*** ______ ********************* ______ *********** _ ********
*** | ___ \_ ******************** | ___ \ ********* | | ********
*** | |_/ / \__ __ __ _ __ _ | |_/ /___ * ___ | | ********
*** | __/| _// _ \ \ \/ /| | | || __// _ \ / _ \ | | ********
*** | | | | | (_) | > < \ |_| || | | (_) | (_) || |___ ****
*** \_| |_| \___/ /_/\_\ \__ |\_| \___/ \___/ \_____/ ****
**** __ / / *****
************************* /___ / *******************************
************************* ********************************
****************************************************************
"""
VERSION = "2.4.0"
# ############### server config ###############
HOST = "0.0.0.0"
PORT = 5010
# ############### database config ###################
# db connection uri
# example:
# Redis: redis://:password@ip:port/db
# Ssdb: ssdb://:password@ip:port
DB_CONN = 'redis://:pwd@127.0.0.1:6379/0'
# proxy table name
TABLE_NAME = 'use_proxy'
# ###### config the proxy fetch function ######
PROXY_FETCHER = [
"freeProxy01",
"freeProxy02",
"freeProxy03",
"freeProxy04",
"freeProxy05",
"freeProxy06",
"freeProxy07",
"freeProxy08",
"freeProxy09",
"freeProxy10",
"freeProxy11"
]
# ############# proxy validator #################
# 代理验证目标网站
HTTP_URL = "http://httpbin.org"
HTTPS_URL = "https://www.qq.com"
# 代理验证时超时时间
VERIFY_TIMEOUT = 10
# 近PROXY_CHECK_COUNT次校验中允许的最大失败次数,超过则剔除代理
MAX_FAIL_COUNT = 0
# 近PROXY_CHECK_COUNT次校验中允许的最大失败率,超过则剔除代理
# MAX_FAIL_RATE = 0.1
# proxyCheck时代理数量少于POOL_SIZE_MIN触发抓取
POOL_SIZE_MIN = 20
# ############# proxy attributes #################
# 是否启用代理地域属性
PROXY_REGION = True
# ############# scheduler config #################
# Set the timezone for the scheduler forcely (optional)
# If it is running on a VM, and
# "ValueError: Timezone offset does not match system offset"
# was raised during scheduling.
# Please uncomment the following line and set a timezone for the scheduler.
# Otherwise it will detect the timezone from the system automatically.
TIMEZONE = "Asia/Shanghai"
================================================
FILE: start.sh
================================================
#!/usr/bin/env bash
python proxyPool.py server &
python proxyPool.py schedule
================================================
FILE: test/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__
Description :
Author : JHao
date: 2019/2/15
-------------------------------------------------
Change Activity:
2019/2/15:
-------------------------------------------------
"""
__author__ = 'JHao'
================================================
FILE: test/testConfigHandler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testGetConfig
Description : testGetConfig
Author : J_hao
date: 2017/7/31
-------------------------------------------------
Change Activity:
2017/7/31:
-------------------------------------------------
"""
__author__ = 'J_hao'
from handler.configHandler import ConfigHandler
from time import sleep
def testConfig():
"""
:return:
"""
conf = ConfigHandler()
print(conf.dbConn)
print(conf.serverPort)
print(conf.serverHost)
print(conf.tableName)
assert isinstance(conf.fetchers, list)
print(conf.fetchers)
for _ in range(2):
print(conf.fetchers)
sleep(5)
if __name__ == '__main__':
testConfig()
================================================
FILE: test/testDbClient.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testDbClient
Description :
Author : JHao
date: 2020/6/23
-------------------------------------------------
Change Activity:
2020/6/23:
-------------------------------------------------
"""
__author__ = 'JHao'
from db.dbClient import DbClient
def testDbClient():
# ############### ssdb ###############
ssdb_uri = "ssdb://:password@127.0.0.1:8888"
s = DbClient.parseDbConn(ssdb_uri)
assert s.db_type == "SSDB"
assert s.db_pwd == "password"
assert s.db_host == "127.0.0.1"
assert s.db_port == 8888
# ############### redis ###############
redis_uri = "redis://:password@127.0.0.1:6379/1"
r = DbClient.parseDbConn(redis_uri)
assert r.db_type == "REDIS"
assert r.db_pwd == "password"
assert r.db_host == "127.0.0.1"
assert r.db_port == 6379
assert r.db_name == "1"
print("DbClient ok!")
if __name__ == '__main__':
testDbClient()
================================================
FILE: test/testLogHandler.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testLogHandler
Description :
Author : J_hao
date: 2017/8/2
-------------------------------------------------
Change Activity:
2017/8/2:
-------------------------------------------------
"""
__author__ = 'J_hao'
from handler.logHandler import LogHandler
def testLogHandler():
log = LogHandler('test')
log.info('this is info')
log.error('this is error')
if __name__ == '__main__':
testLogHandler()
================================================
FILE: test/testProxyClass.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testProxyClass
Description :
Author : JHao
date: 2019/8/8
-------------------------------------------------
Change Activity:
2019/8/8:
-------------------------------------------------
"""
__author__ = 'JHao'
import json
from helper.proxy import Proxy
def testProxyClass():
proxy = Proxy("127.0.0.1:8080")
print(proxy.to_json)
proxy.source = "test"
proxy_str = json.dumps(proxy.to_dict, ensure_ascii=False)
print(proxy_str)
print(Proxy.createFromJson(proxy_str).to_dict)
if __name__ == '__main__':
testProxyClass()
================================================
FILE: test/testProxyFetcher.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testProxyFetcher
Description :
Author : JHao
date: 2020/6/23
-------------------------------------------------
Change Activity:
2020/6/23:
-------------------------------------------------
"""
__author__ = 'JHao'
from fetcher.proxyFetcher import ProxyFetcher
from handler.configHandler import ConfigHandler
def testProxyFetcher():
conf = ConfigHandler()
proxy_getter_functions = conf.fetchers
proxy_counter = {_: 0 for _ in proxy_getter_functions}
for proxyGetter in proxy_getter_functions:
for proxy in getattr(ProxyFetcher, proxyGetter.strip())():
if proxy:
print('{func}: fetch proxy {proxy}'.format(func=proxyGetter, proxy=proxy))
proxy_counter[proxyGetter] = proxy_counter.get(proxyGetter) + 1
for key, value in proxy_counter.items():
print(key, value)
if __name__ == '__main__':
testProxyFetcher()
================================================
FILE: test/testProxyValidator.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testProxyValidator
Description :
Author : JHao
date: 2021/5/25
-------------------------------------------------
Change Activity:
2021/5/25:
-------------------------------------------------
"""
__author__ = 'JHao'
from helper.validator import ProxyValidator
def testProxyValidator():
for _ in ProxyValidator.pre_validator:
print(_)
for _ in ProxyValidator.http_validator:
print(_)
for _ in ProxyValidator.https_validator:
print(_)
if __name__ == '__main__':
testProxyValidator()
================================================
FILE: test/testRedisClient.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testRedisClient
Description :
Author : JHao
date: 2020/6/23
-------------------------------------------------
Change Activity:
2020/6/23:
-------------------------------------------------
"""
__author__ = 'JHao'
def testRedisClient():
from db.dbClient import DbClient
from helper.proxy import Proxy
uri = "redis://:pwd@127.0.0.1:6379"
db = DbClient(uri)
db.changeTable("use_proxy")
proxy = Proxy.createFromJson('{"proxy": "118.190.79.36:8090", "https": false, "fail_count": 0, "region": "", "anonymous": "", "source": "freeProxy14", "check_count": 4, "last_status": true, "last_time": "2021-05-26 10:58:04"}')
print("put: ", db.put(proxy))
print("get: ", db.get(https=None))
print("exists: ", db.exists("27.38.96.101:9797"))
print("exists: ", db.exists("27.38.96.101:8888"))
print("pop: ", db.pop(https=None))
print("getAll: ", db.getAll(https=None))
print("getCount", db.getCount())
if __name__ == '__main__':
testRedisClient()
================================================
FILE: test/testSsdbClient.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: testSsdbClient
Description :
Author : JHao
date: 2020/7/3
-------------------------------------------------
Change Activity:
2020/7/3:
-------------------------------------------------
"""
__author__ = 'JHao'
def testSsdbClient():
from db.dbClient import DbClient
from helper.proxy import Proxy
uri = "ssdb://@127.0.0.1:8888"
db = DbClient(uri)
db.changeTable("use_proxy")
proxy = Proxy.createFromJson('{"proxy": "118.190.79.36:8090", "https": false, "fail_count": 0, "region": "", "anonymous": "", "source": "freeProxy14", "check_count": 4, "last_status": true, "last_time": "2021-05-26 10:58:04"}')
print("put: ", db.put(proxy))
print("get: ", db.get(https=None))
print("exists: ", db.exists("27.38.96.101:9797"))
print("exists: ", db.exists("27.38.96.101:8888"))
print("getAll: ", db.getAll(https=None))
# print("pop: ", db.pop(https=None))
print("clear: ", db.clear())
print("getCount", db.getCount())
if __name__ == '__main__':
testSsdbClient()
================================================
FILE: test.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: test.py
Description :
Author : JHao
date: 2017/3/7
-------------------------------------------------
Change Activity:
2017/3/7:
-------------------------------------------------
"""
__author__ = 'JHao'
from test import testProxyValidator
from test import testConfigHandler
from test import testLogHandler
from test import testDbClient
if __name__ == '__main__':
print("ConfigHandler:")
testConfigHandler.testConfig()
print("LogHandler:")
testLogHandler.testLogHandler()
print("DbClient:")
testDbClient.testDbClient()
print("ProxyValidator:")
testProxyValidator.testProxyValidator()
================================================
FILE: util/__init__.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: __init__
Description :
Author : JHao
date: 2020/7/6
-------------------------------------------------
Change Activity:
2020/7/6:
-------------------------------------------------
"""
__author__ = 'JHao'
================================================
FILE: util/lazyProperty.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: lazyProperty
Description :
Author : JHao
date: 2016/12/3
-------------------------------------------------
Change Activity:
2016/12/3:
-------------------------------------------------
"""
__author__ = 'JHao'
class LazyProperty(object):
"""
LazyProperty
explain: http://www.spiderpy.cn/blog/5/
"""
def __init__(self, func):
self.func = func
def __get__(self, instance, owner):
if instance is None:
return self
else:
value = self.func(instance)
setattr(instance, self.func.__name__, value)
return value
================================================
FILE: util/singleton.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: singleton
Description :
Author : JHao
date: 2016/12/3
-------------------------------------------------
Change Activity:
2016/12/3:
-------------------------------------------------
"""
__author__ = 'JHao'
class Singleton(type):
"""
Singleton Metaclass
"""
_inst = {}
def __call__(cls, *args, **kwargs):
if cls not in cls._inst:
cls._inst[cls] = super(Singleton, cls).__call__(*args)
return cls._inst[cls]
================================================
FILE: util/six.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: six
Description :
Author : JHao
date: 2020/6/22
-------------------------------------------------
Change Activity:
2020/6/22:
-------------------------------------------------
"""
__author__ = 'JHao'
import sys
PY2 = sys.version_info[0] == 2
PY3 = sys.version_info[0] == 3
if PY3:
def iteritems(d, **kw):
return iter(d.items(**kw))
else:
def iteritems(d, **kw):
return d.iteritems(**kw)
if PY3:
from urllib.parse import urlparse
else:
from urlparse import urlparse
if PY3:
from imp import reload as reload_six
else:
reload_six = reload
if PY3:
from queue import Empty, Queue
else:
from Queue import Empty, Queue
def withMetaclass(meta, *bases):
"""Create a base class with a metaclass."""
# This requires a bit of explanation: the basic idea is to make a dummy
# metaclass for one level of class instantiation that replaces itself with
# the actual metaclass.
class MetaClass(meta):
def __new__(cls, name, this_bases, d):
return meta(name, bases, d)
return type.__new__(MetaClass, 'temporary_class', (), {})
================================================
FILE: util/webRequest.py
================================================
# -*- coding: utf-8 -*-
"""
-------------------------------------------------
File Name: WebRequest
Description : Network Requests Class
Author : J_hao
date: 2017/7/31
-------------------------------------------------
Change Activity:
2017/7/31:
-------------------------------------------------
"""
__author__ = 'J_hao'
from requests.models import Response
from lxml import etree
import requests
import random
import time
from handler.logHandler import LogHandler
requests.packages.urllib3.disable_warnings()
class WebRequest(object):
name = "web_request"
def __init__(self, *args, **kwargs):
self.log = LogHandler(self.name, file=False)
self.response = Response()
@property
def user_agent(self):
"""
return an User-Agent at random
:return:
"""
ua_list = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
]
return random.choice(ua_list)
@property
def header(self):
"""
basic header
:return:
"""
return {'User-Agent': self.user_agent,
'Accept': '*/*',
'Connection': 'keep-alive',
'Accept-Language': 'zh-CN,zh;q=0.8'}
def get(self, url, header=None, retry_time=3, retry_interval=5, timeout=5, *args, **kwargs):
"""
get method
:param url: target url
:param header: headers
:param retry_time: retry time
:param retry_interval: retry interval
:param timeout: network timeout
:return:
"""
headers = self.header
if header and isinstance(header, dict):
headers.update(header)
while True:
try:
self.response = requests.get(url, headers=headers, timeout=timeout, *args, **kwargs)
return self
except Exception as e:
self.log.error("requests: %s error: %s" % (url, str(e)))
retry_time -= 1
if retry_time <= 0:
resp = Response()
resp.status_code = 200
return self
self.log.info("retry %s second after" % retry_interval)
time.sleep(retry_interval)
@property
def tree(self):
return etree.HTML(self.response.content)
@property
def text(self):
return self.response.text
@property
def json(self):
try:
return self.response.json()
except Exception as e:
self.log.error(str(e))
return {}
gitextract_arxmexr3/
├── .github/
│ └── workflows/
│ ├── docker-image-latest.yml
│ └── docker-image-tags.yml
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── README.md
├── _config.yml
├── api/
│ ├── __init__.py
│ └── proxyApi.py
├── db/
│ ├── __init__.py
│ ├── dbClient.py
│ ├── redisClient.py
│ └── ssdbClient.py
├── docker-compose.yml
├── docs/
│ ├── Makefile
│ ├── changelog.rst
│ ├── conf.py
│ ├── dev/
│ │ ├── ext_fetcher.rst
│ │ ├── ext_validator.rst
│ │ └── index.rst
│ ├── index.rst
│ ├── make.bat
│ └── user/
│ ├── how_to_config.rst
│ ├── how_to_run.rst
│ ├── how_to_use.rst
│ └── index.rst
├── fetcher/
│ ├── __init__.py
│ └── proxyFetcher.py
├── handler/
│ ├── __init__.py
│ ├── configHandler.py
│ ├── logHandler.py
│ └── proxyHandler.py
├── helper/
│ ├── __init__.py
│ ├── check.py
│ ├── fetch.py
│ ├── launcher.py
│ ├── proxy.py
│ ├── scheduler.py
│ └── validator.py
├── proxyPool.py
├── requirements.txt
├── setting.py
├── start.sh
├── test/
│ ├── __init__.py
│ ├── testConfigHandler.py
│ ├── testDbClient.py
│ ├── testLogHandler.py
│ ├── testProxyClass.py
│ ├── testProxyFetcher.py
│ ├── testProxyValidator.py
│ ├── testRedisClient.py
│ └── testSsdbClient.py
├── test.py
└── util/
├── __init__.py
├── lazyProperty.py
├── singleton.py
├── six.py
└── webRequest.py
SYMBOL INDEX (173 symbols across 27 files)
FILE: api/proxyApi.py
class JsonResponse (line 33) | class JsonResponse(Response):
method force_type (line 35) | def force_type(cls, response, environ=None):
function index (line 55) | def index():
function get (line 60) | def get():
function pop (line 67) | def pop():
function refresh (line 74) | def refresh():
function getAll (line 80) | def getAll():
function delete (line 87) | def delete():
function getCount (line 94) | def getCount():
function runFlask (line 106) | def runFlask():
FILE: db/dbClient.py
class DbClient (line 26) | class DbClient(withMetaclass(Singleton)):
method __init__ (line 51) | def __init__(self, db_conn):
method parseDbConn (line 60) | def parseDbConn(cls, db_conn):
method __initDbClient (line 70) | def __initDbClient(self):
method get (line 89) | def get(self, https, **kwargs):
method put (line 92) | def put(self, key, **kwargs):
method update (line 95) | def update(self, key, value, **kwargs):
method delete (line 98) | def delete(self, key, **kwargs):
method exists (line 101) | def exists(self, key, **kwargs):
method pop (line 104) | def pop(self, https, **kwargs):
method getAll (line 107) | def getAll(self, https):
method clear (line 110) | def clear(self):
method changeTable (line 113) | def changeTable(self, name):
method getCount (line 116) | def getCount(self):
method test (line 119) | def test(self):
FILE: db/redisClient.py
class RedisClient (line 25) | class RedisClient(object):
method __init__ (line 34) | def __init__(self, **kwargs):
method get (line 50) | def get(self, https):
method put (line 64) | def put(self, proxy_obj):
method pop (line 73) | def pop(self, https):
method delete (line 83) | def delete(self, proxy_str):
method exists (line 91) | def exists(self, proxy_str):
method update (line 99) | def update(self, proxy_obj):
method getAll (line 107) | def getAll(self, https):
method clear (line 118) | def clear(self):
method getCount (line 125) | def getCount(self):
method changeTable (line 133) | def changeTable(self, name):
method test (line 141) | def test(self):
FILE: db/ssdbClient.py
class SsdbClient (line 27) | class SsdbClient(object):
method __init__ (line 35) | def __init__(self, **kwargs):
method get (line 50) | def get(self, https):
method put (line 64) | def put(self, proxy_obj):
method pop (line 73) | def pop(self, https):
method delete (line 83) | def delete(self, proxy_str):
method exists (line 91) | def exists(self, proxy_str):
method update (line 99) | def update(self, proxy_obj):
method getAll (line 107) | def getAll(self, https):
method clear (line 118) | def clear(self):
method getCount (line 125) | def getCount(self):
method changeTable (line 133) | def changeTable(self, name):
method test (line 141) | def test(self):
FILE: fetcher/proxyFetcher.py
class ProxyFetcher (line 22) | class ProxyFetcher(object):
method freeProxy01 (line 28) | def freeProxy01():
method freeProxy02 (line 50) | def freeProxy02():
method freeProxy03 (line 63) | def freeProxy03():
method freeProxy04 (line 74) | def freeProxy04():
method freeProxy05 (line 92) | def freeProxy05(page_count=1):
method freeProxy06 (line 111) | def freeProxy06():
method freeProxy07 (line 123) | def freeProxy07():
method freeProxy08 (line 133) | def freeProxy08():
method freeProxy09 (line 143) | def freeProxy09(page_count=1):
method freeProxy10 (line 154) | def freeProxy10():
method freeProxy11 (line 164) | def freeProxy11():
FILE: handler/configHandler.py
class ConfigHandler (line 22) | class ConfigHandler(withMetaclass(Singleton)):
method __init__ (line 24) | def __init__(self):
method serverHost (line 28) | def serverHost(self):
method serverPort (line 32) | def serverPort(self):
method dbConn (line 36) | def dbConn(self):
method tableName (line 40) | def tableName(self):
method fetchers (line 44) | def fetchers(self):
method httpUrl (line 49) | def httpUrl(self):
method httpsUrl (line 53) | def httpsUrl(self):
method verifyTimeout (line 57) | def verifyTimeout(self):
method maxFailCount (line 65) | def maxFailCount(self):
method poolSizeMin (line 73) | def poolSizeMin(self):
method proxyRegion (line 77) | def proxyRegion(self):
method timezone (line 81) | def timezone(self):
FILE: handler/logHandler.py
class LogHandler (line 44) | class LogHandler(logging.Logger):
method __init__ (line 49) | def __init__(self, name, level=DEBUG, stream=True, file=True):
method __setFileHandler__ (line 59) | def __setFileHandler__(self, level=None):
method __setStreamHandler__ (line 79) | def __setStreamHandler__(self, level=None):
FILE: handler/proxyHandler.py
class ProxyHandler (line 21) | class ProxyHandler(object):
method __init__ (line 24) | def __init__(self):
method get (line 29) | def get(self, https=False):
method pop (line 39) | def pop(self, https):
method put (line 49) | def put(self, proxy):
method delete (line 56) | def delete(self, proxy):
method getAll (line 64) | def getAll(self, https=False):
method exists (line 72) | def exists(self, proxy):
method getCount (line 80) | def getCount(self):
FILE: helper/check.py
class DoValidator (line 27) | class DoValidator(object):
method validator (line 33) | def validator(cls, proxy, work_type):
method httpValidator (line 59) | def httpValidator(cls, proxy):
method httpsValidator (line 66) | def httpsValidator(cls, proxy):
method preValidator (line 73) | def preValidator(cls, proxy):
method regionGetter (line 80) | def regionGetter(cls, proxy):
class _ThreadChecker (line 89) | class _ThreadChecker(Thread):
method __init__ (line 92) | def __init__(self, work_type, target_queue, thread_name):
method run (line 100) | def run(self):
method __ifRaw (line 115) | def __ifRaw(self, proxy):
method __ifUse (line 125) | def __ifUse(self, proxy):
function Checker (line 142) | def Checker(tp, queue):
FILE: helper/fetch.py
class _ThreadFetcher (line 24) | class _ThreadFetcher(Thread):
method __init__ (line 26) | def __init__(self, fetch_source, proxy_dict):
method run (line 35) | def run(self):
class Fetcher (line 51) | class Fetcher(object):
method __init__ (line 54) | def __init__(self):
method run (line 58) | def run(self):
FILE: helper/launcher.py
function startServer (line 23) | def startServer():
function startScheduler (line 29) | def startScheduler():
function __beforeStart (line 35) | def __beforeStart():
function __showVersion (line 43) | def __showVersion():
function __showConfigure (line 48) | def __showConfigure():
function __checkDBConfig (line 55) | def __checkDBConfig():
FILE: helper/proxy.py
class Proxy (line 18) | class Proxy(object):
method __init__ (line 20) | def __init__(self, proxy, fail_count=0, region="", anonymous="",
method createFromJson (line 33) | def createFromJson(cls, proxy_json):
method proxy (line 47) | def proxy(self):
method fail_count (line 52) | def fail_count(self):
method region (line 57) | def region(self):
method anonymous (line 62) | def anonymous(self):
method source (line 67) | def source(self):
method check_count (line 72) | def check_count(self):
method last_status (line 77) | def last_status(self):
method last_time (line 82) | def last_time(self):
method https (line 87) | def https(self):
method to_dict (line 92) | def to_dict(self):
method to_json (line 105) | def to_json(self):
method fail_count (line 110) | def fail_count(self, value):
method check_count (line 114) | def check_count(self, value):
method last_status (line 118) | def last_status(self, value):
method last_time (line 122) | def last_time(self, value):
method https (line 126) | def https(self, value):
method region (line 130) | def region(self, value):
method add_source (line 133) | def add_source(self, source_str):
FILE: helper/scheduler.py
function __runProxyFetch (line 27) | def __runProxyFetch():
function __runProxyCheck (line 37) | def __runProxyCheck():
function runScheduler (line 47) | def runScheduler():
FILE: helper/validator.py
class ProxyValidator (line 31) | class ProxyValidator(withMetaclass(Singleton)):
method addPreValidator (line 37) | def addPreValidator(cls, func):
method addHttpValidator (line 42) | def addHttpValidator(cls, func):
method addHttpsValidator (line 47) | def addHttpsValidator(cls, func):
function formatValidator (line 53) | def formatValidator(proxy):
function httpTimeOutValidator (line 59) | def httpTimeOutValidator(proxy):
function httpsTimeOutValidator (line 72) | def httpsTimeOutValidator(proxy):
function customValidatorExample (line 84) | def customValidatorExample(proxy):
FILE: proxyPool.py
function cli (line 24) | def cli():
function schedule (line 29) | def schedule():
function server (line 36) | def server():
FILE: test/testConfigHandler.py
function testConfig (line 19) | def testConfig():
FILE: test/testDbClient.py
function testDbClient (line 18) | def testDbClient():
FILE: test/testLogHandler.py
function testLogHandler (line 18) | def testLogHandler():
FILE: test/testProxyClass.py
function testProxyClass (line 19) | def testProxyClass():
FILE: test/testProxyFetcher.py
function testProxyFetcher (line 19) | def testProxyFetcher():
FILE: test/testProxyValidator.py
function testProxyValidator (line 18) | def testProxyValidator():
FILE: test/testRedisClient.py
function testRedisClient (line 16) | def testRedisClient():
FILE: test/testSsdbClient.py
function testSsdbClient (line 16) | def testSsdbClient():
FILE: util/lazyProperty.py
class LazyProperty (line 16) | class LazyProperty(object):
method __init__ (line 22) | def __init__(self, func):
method __get__ (line 25) | def __get__(self, instance, owner):
FILE: util/singleton.py
class Singleton (line 16) | class Singleton(type):
method __call__ (line 23) | def __call__(cls, *args, **kwargs):
FILE: util/six.py
function iteritems (line 21) | def iteritems(d, **kw):
function iteritems (line 24) | def iteritems(d, **kw):
function withMetaclass (line 43) | def withMetaclass(meta, *bases):
FILE: util/webRequest.py
class WebRequest (line 26) | class WebRequest(object):
method __init__ (line 29) | def __init__(self, *args, **kwargs):
method user_agent (line 34) | def user_agent(self):
method header (line 52) | def header(self):
method get (line 62) | def get(self, url, header=None, retry_time=3, retry_interval=5, timeou...
method tree (line 90) | def tree(self):
method text (line 94) | def text(self):
method json (line 98) | def json(self):
Condensed preview — 59 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (108K chars).
[
{
"path": ".github/workflows/docker-image-latest.yml",
"chars": 795,
"preview": "name: Publish Docker image latest\n\non:\n push:\n branches:\n - 'master'\n\njobs:\n\n push_to_registry:\n name: Push"
},
{
"path": ".github/workflows/docker-image-tags.yml",
"chars": 840,
"preview": "name: Publish Docker image tags\n\non:\n push:\n tags:\n - '*'\n\njobs:\n\n push_to_registry:\n name: Push Docker ima"
},
{
"path": ".gitignore",
"chars": 31,
"preview": ".idea/\ndocs/_build\n*.pyc\n*.log\n"
},
{
"path": ".travis.yml",
"chars": 190,
"preview": "language: python\npython:\n - \"2.7\"\n - \"3.5\"\n - \"3.6\"\n - \"3.7\"\n - \"3.8\"\n - \"3.9\"\n - \"3.10\"\n - \"3.11\"\nos:\n - linux"
},
{
"path": "Dockerfile",
"chars": 524,
"preview": "FROM python:3.6-alpine\n\nMAINTAINER jhao104 <j_hao104@163.com>\n\nWORKDIR /app\n\nCOPY ./requirements.txt .\n\n# apk repository"
},
{
"path": "LICENSE",
"chars": 1065,
"preview": "MIT License\n\nCopyright (c) 2017 J_hao104\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\no"
},
{
"path": "README.md",
"chars": 8820,
"preview": "\nProxyPool 爬虫代理IP池\n=======\n[](https://travis-"
},
{
"path": "_config.yml",
"chars": 26,
"preview": "theme: jekyll-theme-cayman"
},
{
"path": "api/__init__.py",
"chars": 356,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__.py \n Descrip"
},
{
"path": "api/proxyApi.py",
"chars": 4165,
"preview": "# -*- coding: utf-8 -*-\n# !/usr/bin/env python\n\"\"\"\n-------------------------------------------------\n File Name: P"
},
{
"path": "db/__init__.py",
"chars": 337,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__.py.py \n Desc"
},
{
"path": "db/dbClient.py",
"chars": 3461,
"preview": "# -*- coding: utf-8 -*-\n# !/usr/bin/env python\n\"\"\"\n-------------------------------------------------\n File Name: Db"
},
{
"path": "db/redisClient.py",
"chars": 4366,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-----------------------------------------------------\n File Name: redisClient.py\n De"
},
{
"path": "db/ssdbClient.py",
"chars": 4451,
"preview": "# -*- coding: utf-8 -*-\n# !/usr/bin/env python\n\"\"\"\n-------------------------------------------------\n File Name: s"
},
{
"path": "docker-compose.yml",
"chars": 270,
"preview": "version: '2'\nservices:\n proxy_pool:\n build: .\n container_name: proxy_pool\n ports:\n - \"5010:5010\"\n link"
},
{
"path": "docs/Makefile",
"chars": 634,
"preview": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the "
},
{
"path": "docs/changelog.rst",
"chars": 2365,
"preview": ".. _changelog:\n\nChangeLog\n==========\n\n2.4.2 (2024-01-18)\n------------------\n\n1. 代理格式检查支持需认证的代理格式 `username:password@ip:p"
},
{
"path": "docs/conf.py",
"chars": 2526,
"preview": "# Configuration file for the Sphinx documentation builder.\n#\n# This file only contains a selection of the most common op"
},
{
"path": "docs/dev/ext_fetcher.rst",
"chars": 993,
"preview": ".. ext_fetcher\n\n扩展代理源\n-----------\n\n项目默认包含几个免费的代理获取源,但是免费的毕竟质量有限,如果直接运行可能拿到的代理质量不理想。因此提供了用户自定义扩展代理获取的方法。\n\n如果要添加一个新的代理获取方法"
},
{
"path": "docs/dev/ext_validator.rst",
"chars": 2490,
"preview": ".. ext_validator\n\n代理校验\n-----------\n\n内置校验\n>>>>>>>>>\n\n项目中使用的代理校验方法全部定义在 `validator.py`_ 中, 通过 `ProxyValidator`_ 类中提供的装饰器来区"
},
{
"path": "docs/dev/index.rst",
"chars": 105,
"preview": "=========\n开发指南\n=========\n\n.. module:: dev\n\n.. toctree::\n :maxdepth: 2\n\n ext_fetcher\n ext_validator\n"
},
{
"path": "docs/index.rst",
"chars": 3032,
"preview": ".. ProxyPool documentation master file, created by\n sphinx-quickstart on Wed Jul 8 16:13:42 2020.\n You can adapt th"
},
{
"path": "docs/make.bat",
"chars": 760,
"preview": "@ECHO OFF\n\npushd %~dp0\n\nREM Command file for Sphinx documentation\n\nif \"%SPHINXBUILD%\" == \"\" (\n\tset SPHINXBUILD=sphinx-bu"
},
{
"path": "docs/user/how_to_config.rst",
"chars": 1651,
"preview": ".. how_to_config\n\n配置参考\n---------\n\n配置文件 ``setting.py`` 位于项目的主目录下, 配置主要分为四类: **服务配置** 、 **数据库配置** 、 **采集配置** 、 **校验配置**.\n\n"
},
{
"path": "docs/user/how_to_run.rst",
"chars": 1121,
"preview": ".. how_to_run\n\n\n如何运行\n---------\n\n下载代码\n>>>>>>>>>\n\n本项目需要下载代码到本地运行, 通过 ``git`` 下载:\n\n.. code-block:: console\n\n $ git clone"
},
{
"path": "docs/user/how_to_use.rst",
"chars": 1659,
"preview": ".. how_to_use\n\n如何使用\n----------\n\n爬虫代码要对接代理池目前有两种方式: 一是通过调用API接口使用, 二是直接读取数据库.\n\n调用API\n>>>>>>>>>\n\n启动ProxyPool的 ``server`` 后"
},
{
"path": "docs/user/index.rst",
"chars": 119,
"preview": "=========\n用户指南\n=========\n\n.. module:: user\n\n.. toctree::\n :maxdepth: 2\n\n how_to_run\n how_to_use\n how_to_config\n"
},
{
"path": "fetcher/__init__.py",
"chars": 334,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__.py\n Descripti"
},
{
"path": "fetcher/proxyFetcher.py",
"chars": 9076,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: proxyFetcher\n Descript"
},
{
"path": "handler/__init__.py",
"chars": 402,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__.py\n Descripti"
},
{
"path": "handler/configHandler.py",
"chars": 2110,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: configHandler\n Descrip"
},
{
"path": "handler/logHandler.py",
"chars": 2755,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: LogHandler.py\n Descrip"
},
{
"path": "handler/proxyHandler.py",
"chars": 2069,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: ProxyHandler.py\n Descr"
},
{
"path": "helper/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "helper/check.py",
"chars": 5391,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: check\n Description : "
},
{
"path": "helper/fetch.py",
"chars": 2975,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: fetchScheduler\n Descri"
},
{
"path": "helper/launcher.py",
"chars": 1653,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: launcher\n Description "
},
{
"path": "helper/proxy.py",
"chars": 3604,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: Proxy\n Description : "
},
{
"path": "helper/scheduler.py",
"chars": 2063,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: proxyScheduler\n Descri"
},
{
"path": "helper/validator.py",
"chars": 2437,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: _validators\n Descripti"
},
{
"path": "proxyPool.py",
"chars": 916,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: proxy_pool\n Descriptio"
},
{
"path": "requirements.txt",
"chars": 353,
"preview": "requests==2.20.0\ngunicorn==19.9.0\nlxml==4.9.2\nredis==3.5.3\nAPScheduler==3.10.0;python_version>=\"3.10\"\nAPScheduler==3.2.0"
},
{
"path": "setting.py",
"chars": 2544,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: setting.py\n Descriptio"
},
{
"path": "start.sh",
"chars": 77,
"preview": "#!/usr/bin/env bash\npython proxyPool.py server &\npython proxyPool.py schedule"
},
{
"path": "test/__init__.py",
"chars": 348,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__\n Description "
},
{
"path": "test/testConfigHandler.py",
"chars": 807,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testGetConfig\n Descrip"
},
{
"path": "test/testDbClient.py",
"chars": 1042,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testDbClient\n Descript"
},
{
"path": "test/testLogHandler.py",
"chars": 560,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testLogHandler\n Descri"
},
{
"path": "test/testProxyClass.py",
"chars": 696,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testProxyClass\n Descri"
},
{
"path": "test/testProxyFetcher.py",
"chars": 1036,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testProxyFetcher\n Desc"
},
{
"path": "test/testProxyValidator.py",
"chars": 668,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testProxyValidator\n De"
},
{
"path": "test/testRedisClient.py",
"chars": 1140,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testRedisClient\n Descr"
},
{
"path": "test/testSsdbClient.py",
"chars": 1166,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: testSsdbClient\n Descri"
},
{
"path": "test.py",
"chars": 766,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: test.py \n Description"
},
{
"path": "util/__init__.py",
"chars": 346,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: __init__\n Description "
},
{
"path": "util/lazyProperty.py",
"chars": 745,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: lazyProperty\n Descript"
},
{
"path": "util/singleton.py",
"chars": 601,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: singleton\n Description"
},
{
"path": "util/six.py",
"chars": 1254,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: six\n Description :\n "
},
{
"path": "util/webRequest.py",
"chars": 3421,
"preview": "# -*- coding: utf-8 -*-\n\"\"\"\n-------------------------------------------------\n File Name: WebRequest\n Descriptio"
}
]
About this extraction
This page contains the full source code of the jhao104/proxy_pool GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 59 files (98.2 KB), approximately 28.5k tokens, and a symbol index with 173 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.