Repository: leo8916/wxhub
Branch: master
Commit: bb0935a549e8
Files: 5
Total size: 11.7 MB
Directory structure:
gitextract_ukro0y5t/
├── README.md
├── chromedriver
├── pipe_example.py
├── requirements.txt
└── wxhub.py
================================================
FILE CONTENTS
================================================
================================================
FILE: README.md
================================================
## 公众号文章抓取工具
使用公众号文章编辑链接的方案, 突破搜狗方案10条的限制~~~ ;-)
### 2018.12
- 新增公众号内, 百度网盘链接和密码的抓取. (指定method为baidu_pan_links)
- 新增全部html页面抓取方法 -method whole_page
- 添加todo.list 与 mask 变量
```
todo.list 文件记录了公众号下所有文章的链接数据, 因为高频次调用文章搜索/翻页接口会导致被ban.
所以目前的方案是使用mask记录所有索引处理记录, 保证了不会翻页相同位置, 提高了获取新增链接的几率.
```
### 2019.01
- 添加-pl参数, 用来限制每次公众号翻页数目, 每次翻页过多会被ban.建议10以内.
- N = 0: 不进行翻页, 只讲之前的url重新处理(todo.list)
- N < 0: 不限制翻页(默认), 翻到底或者出错时停止.
- N > 0: 翻页N次.
### 准备
- 首先你需要有一个 [微信公众号, 注册很简单](https://mp.weixin.qq.com)
- python 3.6
- [下载ChromeDriver](http://chromedriver.chromium.org/home) 在第一次登陆时, 需要使用其手动登录.
- 安装依赖
```
pip install -r requirements.txt
```
### 结构
```
wxhub/
├── README.md
├── arti.cache.list (使用后生成)
├── chromedriver (默认macOS版本, windows可另行下载 重命名即可)
├── cookies.json (使用后生成)
├── gongzhonghao.py (使用后生成)
├── output (使用后生成)
├── requirements.txt
├── url.cache.list (使用后生成)
└── wxhub.py
```
### 使用
```
(py3) isyuu:wxhub isyuu$ python wxhub.py -h
usage: wxhub.py [-h] -biz BIZ [-chrome CHROME] [-arti ARTI] [-method METHOD]
[-sleep SLEEP] [-pipe PIPE] [-pl PAGE_LIMIT]
公众号文章全搞定
optional arguments:
-h, --help show this help message and exit
-biz BIZ 必填:公众号名字
-chrome CHROME 可选:web chrome 路径, 默认使用脚本同级目录下的chromedriver
-arti ARTI 可选:文章名字, 默认处理全部文章
-method METHOD 可选, 处理方法: all_images, baidu_pan_links, whole_page
-sleep SLEEP 翻页休眠时间, 默认为1即 1秒每页.
-pipe PIPE 在method指定为pipe时, 该参数指定pipe处理流程. 例如:"pipe_example,
pipe_example1, pipe_example2, pipe_example3"
-pl PAGE_LIMIT 指定最大翻页次数, 每次同一个公众号, 翻页太多次会被ban, 0:不翻页 只处理todo.list, 默认<0:无限制
>0:翻页次数
```
现有缓存功能, 目前缓存在如下文件中.
- 用户cookies
- 已经爬取的文章链接. --> arti.cache.list
- 已经下载的链接. --> url.cache.list
需要全部重新下载时, 删除对应文件即可.
### 已知问题
- 在某些情况下, cookies里的session过期后, 会导致"获取页面失败!"的错误.(此时参数cookies.json文件即可)
- 提示"搜索过于频繁"问题, 这可能是又有微信对搜索接口存在反爬机制; 目前解决的方案是:删除cookies.json, 换账号登录, 或者等几个小时即可.(未来准备尝试先缓存所有链接再逐条爬取的方式...)
================================================
FILE: chromedriver
================================================
[File too large to display: 11.7 MB]
================================================
FILE: pipe_example.py
================================================
# -*- coding: utf-8 -*-
'''
扩展处理脚本案例
'''
def crawl(arti_url, arti_dir):
'''
必要实现函数, 在处理文章链接时调用.
arti_url: 公众号文章链接.
arti_dir: 用于存储当前公众号的目录.
返回值: []存放所有处理完成的url链接.
'''
pass
================================================
FILE: requirements.txt
================================================
selenium==3.14.0
requests==2.18.4
beautifulsoup4==4.6.3
pyquery==1.4.0
================================================
FILE: wxhub.py
================================================
# -*- coding: utf-8 -*-
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import requests
import re
import shutil
import os
import json
import argparse
import traceback
import random
import math
import codecs
class Input:
fake_name = ""#"影想"
out_dir = "output"
'''
all_images
baidu_pan_links
whole_page
pipe
'''
crawl_method = "all_images"
url_cache = {}
arti_cache = {}
page_sleep = 1
page_limit = -1
args = {}
custom_pipe = []
class Session:
token = ''
cookies = []
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36'}
class Urls:
index = 'https://mp.weixin.qq.com'
editor = 'https://mp.weixin.qq.com/cgi-bin/appmsg?t=media/appmsg_edit&action=edit&type=10&isMul=1&isNew=1&share=1&lang=zh_CN&token={token}'
query_biz = 'https://mp.weixin.qq.com/cgi-bin/searchbiz?action=search_biz&token={token}&lang=zh_CN&f=json&ajax=1&random={random}&query={query}&begin={begin}&count={count}'
query_arti = 'https://mp.weixin.qq.com/cgi-bin/appmsg?token={token}&lang=zh_CN&f=json&%E2%80%A65&action=list_ex&begin={begin}&count={count}&query={query}&fakeid={fakeid}&type=9'
class BaseResp:
def __init__(self, sjson):
self.data = json.loads(sjson)
self.base_resp = self.data['base_resp']
@property
def ret(self):
return self.base_resp['ret']
@property
def err_msg(self):
return self.base_resp['err_msg']
@property
def is_ok(self):
return self.base_resp['ret'] == 0
class FakesResp(BaseResp):
def __init__(self, sjson):
super(FakesResp, self).__init__(sjson)
self.list = self.data['list']
self.total = self.data['total']
@property
def count(self):
return len(self.list)
class ArtisResp(BaseResp):
def __init__(self, sjson):
super(ArtisResp, self).__init__(sjson)
self.list = self.data['app_msg_list'] if self.is_ok else []
self.total = self.data['app_msg_cnt'] if self.is_ok else 0
@property
def count(self):
return len(self.list)
def execute_times(driver, times):
for i in range(times + 1):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
def login(driver):
pass
def read_url_set():
ret = {}
fn = os.path.join('output', '__urls.json')
if os.path.isdir('output') and os.path.isfile(fn):
with open(fn, 'rt') as f:
ret = json.load(f)
return ret
def write_url_set(urls):
fn = os.path.join('output', '__urls.json')
if not os.path.isdir('output'):
shutil.os.makedirs('output', exist_ok=True)
with open(fn, 'wb') as f:
f.write(json.dumps(urls).encode('utf-8'))
def set_cookies(driver, cookies):
Session.cookies = {}
for item in cookies:
driver.add_cookie(item)
Session.cookies[item['name']]=item['value']
def download(url, sname):
for i in range(0, 3):
result = requests.get(url, headers=Session.headers, stream=True)
if result.status_code == 200:
with open(sname, 'wb') as f:
for chunk in result.iter_content(1024):
f.write(chunk)
return True
else:
continue
print(f"Error download:{url}")
return False
def pipe_fakes(fake_name):
begin = 0
count = 5
while(True):
rep = requests.get(Urls.query_biz.format(random=random.random(), token=Session.token, query=fake_name, begin=begin, count=count), cookies=Session.cookies, headers=Session.headers)
fakes = FakesResp(rep.text)
if not fakes.is_ok:
break
i = 0
for it in fakes.list:
print(f"{i}) {it['nickname']}")
i = i + 1
while(True):
ic = input("输入数字, 选择序号;或者输入n翻页:")
try:
if ic == 'n' or int(ic) >= 0 and int(ic) < len(fakes.list):
break
except ValueError:
print("输入错误, 请重新输入!")
continue
if ic == 'n' or ic == 'N':
begin = begin + fakes.count
continue
return fakes.list[int(ic)]
def pipe_articles(fakeid, query=''):
TIME_SLEEP = Input.page_sleep
todo = load_todo_list(Input.fake_name)
if not todo:
todo['data'] = {}
data = todo['data']
mask = list(todo['__mask'] if '__mask' in todo else '')
last_total = todo['__total_cnt'] if '__total_cnt' in todo else 0
begin = 0
pagesize = 5
total = 0
total_page = 0
last_total_page = math.ceil(last_total / pagesize)
page_limit = Input.page_limit
rep = requests.get(Urls.query_arti.format(token=Session.token, fakeid=fakeid, begin=begin, count=pagesize, query=query), cookies=Session.cookies, headers=Session.headers)
artis = ArtisResp(rep.text)
if not artis.ret and page_limit:
total = artis.total
total_page = math.ceil(total / pagesize)
if total_page > last_total_page:
mask = (total_page - last_total_page) * ['0'] + mask
if artis.list[0]['link'] in data:
mask[0] = '0' #has new arti. reset first page.
print(f"正在获取全部链接, 共发现 {artis.total} 条文章, 需要翻页 {total_page} 次, 请稍后 ...")
# 当前页为0时必检查下一页..
for i in range(0, len(mask)):
if not page_limit:
break
if mask[i] == '1':
continue
print(f"正在处理第{i}页...")
time.sleep(TIME_SLEEP)
rep = requests.get(Urls.query_arti.format(token=Session.token, fakeid=fakeid, begin=i * pagesize, count=pagesize, query=query), cookies=Session.cookies, headers=Session.headers)
artis = ArtisResp(rep.text)
if artis.ret :
break
flag = True
for it in artis.list:
link = it['link']
if link in data:
continue
flag = False
data[link] = it
mask[i] = '1'
# force check next page.
if not flag and i < len(mask) - 1:
mask[i + 1] = '0'
#count check limit
page_limit -= 1
else:
print(f"调用搜索, 报错:{artis.ret} {artis.err_msg}")
curr_searched = sum(map(lambda x: 1 if x == '1' else 0, mask))
# if not total:
# raise Exception('搜索不到文章, 或者接口被反爬, 请删除cookies.json文件 等几分钟再试, 或换个账号试试.')
print(f"本次搜索到:{total_page} 页文章, 已处理:{curr_searched}页, 共在 todo.list 中包含 {len(data)} 条文章链接 ...")
todo['__total_cnt'] = total
todo['__mask'] = ''.join(mask)
save_todo_list(Input.fake_name, todo)
cnt = 0
for url, arti_info in data.items():
if url in Input.arti_cache:
continue
print(f"{arti_info['title']} --> {url}")
if pipe_crawl_articles(arti_info):
cnt += 1
append_arti_cache(url)
print(f" 本次共处理了 {cnt} 条文章链接!")
def verfy_arti_content(html):
if not html:
return False, "从服务器获取失败"
pat = re.compile(r'<div class="page_msg')
if not pat.search(html):
return True, ""
pat = re.compile(r'<div class="global_error_msg.*?">(.*?)</div', re.MULTILINE| re.DOTALL)
ms = pat.findall(html)
if ms:
return False, ms[0].strip()
return False, "服务器返回未知错误"
def crawl_all_images(url, sdir, url_cache, html=None):
pat = re.compile(r'src="(https://.*?)"')
pat2 = re.compile(r'wx_fmt=(.*)')
urls = []
try:
if not html:
rep = requests.get(url, cookies=Session.cookies, headers=Session.headers)
html = rep.text
mats = pat.findall(html, pos=0)
idx = 0
for m in mats:
if m in url_cache:
continue
pps = pat2.findall(m)
if pps:
postfix = pps[0]
else:
postfix = 'jpg'
download(m, os.path.join(sdir, f"{idx}.{postfix}"))
urls.append(m)
idx += 1
append_url_cache(urls)
return True
except:
print(f"failed crawl images from url:{url}")
sg = traceback.format_exc()
print(sg)
return False
def crawl_baidu_pan_link(url, sdir, url_cache):
pat = re.compile(r'链接\s*[:|:]\s*(https://pan\.baidu\.com/.*?)提取码\s*[:|:]\s*(....)')
try:
urls = []
rep = requests.get(url, cookies=Session.cookies, headers=Session.headers)
html = rep.text
mats = pat.findall(html, pos=0)
if not mats:
return False
with open("baidu.pan.links.txt", "a") as myfile:
for uus in mats:
uu = uus[0]
if not uu or uu in Input.url_cache:
continue
pwd = uus[1]
myfile.write(f"{uu} => {pwd}\n")
Input.url_cache[uu] = True
urls.append(uu)
append_url_cache(urls)
return True
except:
print(f"failed crawl linkss from url:{url}")
sg = traceback.format_exc()
print(sg)
return False
def crawl_whole_page(url, sdir, url_cache):
try:
rep = requests.get(url, cookies=Session.cookies, headers=Session.headers)
if rep.status_code != 200:
return False
html = rep.text
valid, msg = verfy_arti_content(html)
if not valid:
raise Exception(f"保存网页失败: {msg}")
os.makedirs(sdir, exist_ok=True)
with codecs.open(os.path.join(sdir, 'index.html'), "w", 'utf-8') as f:
f.write(html)
f.flush()
return crawl_all_images(url, sdir, Input.url_cache, html=html)
except:
print(f"failed crawl page from url:{url}")
sg = traceback.format_exc()
print(sg)
return False
def crawl_by_custom_pipe(url, sdir, url_cache):
if not Input.custom_pipe:
sps = (Input.args.pipe if Input.args.pipe else '').split(',')
for sp in sps:
Input.custom_pipe.append(__import__(sp.strip()))
for p in Input.custom_pipe:
urls = p.crawl(url, sdir)
for url in urls:
url_cache[url] = True
return not not urls
return False
def pipe_crawl_articles(arti_info):
title_4_dir = arti_info['title'].replace(':', '').replace(' ', '').replace(':', '').replace('/', '').replace('|', '').replace('<', '').replace('>', '').replace('?', '').replace('"', '')
sdir = os.path.join(Input.out_dir, Input.fake_name, title_4_dir)
if not os.path.exists(sdir):
os.makedirs(sdir, exist_ok=True)
if Input.crawl_method == 'all_images':
return crawl_all_images(arti_info['link'], sdir, Input.url_cache)
elif Input.crawl_method == 'baidu_pan_links':
return crawl_baidu_pan_link(arti_info['link'], sdir, Input.url_cache)
elif Input.crawl_method == 'whole_page':
return crawl_whole_page(arti_info['link'], sdir, Input.url_cache)
elif Input.crawl_method == 'pipe':
return crawl_by_custom_pipe(arti_info['link'], sdir, Input.url_cache)
def pipe():
'''query fakes '''
fake_info = pipe_fakes(Input.fake_name)
if not fake_info:
raise Exception(f"Can not query fakes with input:{Input.fake_name}")
'''query arti'''
fakeid = fake_info['fakeid']
pipe_articles(fakeid)
input("pipe contiune:")
def process_input():
Input.artis_cache = {}
ac = os.path.join('arti.cache.list')
if os.path.isfile(ac):
with open(ac, 'rt') as fi:
line = fi.readline()
while line:
Input.arti_cache[line.strip()] = True
line = fi.readline()
uc = os.path.join('url.cache.list')
if os.path.isfile(uc):
with open(uc, 'rt') as fi:
line = fi.readline()
while line:
Input.url_cache[line.strip()] = True
line = fi.readline()
def append_arti_cache(arti_link):
arti_link = arti_link.strip()
if not arti_link:
return
ac = os.path.join('arti.cache.list')
with open(ac, "a") as myfile:
myfile.write(f"{arti_link}\n")
Input.arti_cache[arti_link] = True
def append_url_cache(urls):
ac = os.path.join('url.cache.list')
with open(ac, "a") as myfile:
for url in urls:
url = url.strip()
if not url:
continue
myfile.write(f"{url}\n")
Input.url_cache[url] = True
def load_todo_list(key):
fn = os.path.join('output', key, "todo.list")
if os.path.isfile(fn):
with open(fn, 'rb') as fi:
return json.load(fi)
return {}
def save_todo_list(key, dic):
if not dic:
return
fn = os.path.join('output', key, "todo.list")
os.makedirs(os.path.dirname(fn), exist_ok=True)
open(fn, 'wb').write(json.dumps(dic).encode('utf-8'))
def main(chrome):
#会过期, 重新登录后需要重新取得
if not chrome:
if os.path.isfile('chromedriver'):
chrome = 'chromedriver'
else:
chrome = input('输入webchrome:').strip()
driver = webdriver.Chrome(executable_path=chrome)
cookies = json.load(open('cookies.json', 'rb')) if os.path.isfile('cookies.json') else []
driver.get(Urls.index)
if not cookies:
input("请先手动登录, 完成后按回车继续:")
cookies = driver.get_cookies()
open('cookies.json', 'wb').write(json.dumps(cookies).encode('utf-8'))
set_cookies(driver, cookies)
driver.get(Urls.index)
url = driver.current_url
if 'token' not in url:
raise Exception(f"获取网页失败!")
Session.token = re.findall(r'token=(\w+)', url)[0]
process_input()
pipe()
# def test():
# Input.fake_name = '大J小D'
# Input.crawl_method = 'baidu_pan_links'
# main(None)
if __name__ == '__main__':
# test()
description = u"公众号文章全搞定"
parser = argparse.ArgumentParser(description=description)
parser.add_argument('-biz', dest='biz', type=str, help='必填:公众号名字', required=True)
parser.add_argument('-chrome', dest='chrome', type=str, help='可选:web chrome 路径, 默认使用脚本同级目录下的chromedriver')
parser.add_argument('-arti', dest='arti', type=str, help='可选:文章名字, 默认处理全部文章')
parser.add_argument('-method', dest='method', type=str, help='可选, 处理方法: all_images, baidu_pan_links, whole_page')
parser.add_argument('-sleep', dest='sleep', type=str, help='翻页休眠时间, 默认为1即 1秒每页.')
parser.add_argument('-pipe', dest='pipe', type=str, help='在method指定为pipe时, 该参数指定pipe处理流程. 例如:"pipe_example, pipe_example1, pipe_example2, pipe_example3"')
parser.add_argument('-pl', dest='page_limit', type=str, help='指定最大翻页次数, 每次同一个公众号, 翻页太多次会被ban, 0:不翻页 只处理todo.list, 默认<0:无限制 >0:翻页次数')
Input.args = parser.parse_args()
Input.fake_name = Input.args.biz
Input.crawl_method = Input.args.method if Input.args.method else 'all_images'
Input.page_sleep = int(Input.args.sleep) if Input.args.sleep else 1
Input.page_limit = int(Input.args.page_limit) if Input.args.page_limit else -1
main(Input.args.chrome)
gitextract_ukro0y5t/ ├── README.md ├── chromedriver ├── pipe_example.py ├── requirements.txt └── wxhub.py
SYMBOL INDEX (36 symbols across 2 files)
FILE: pipe_example.py
function crawl (line 6) | def crawl(arti_url, arti_dir):
FILE: wxhub.py
class Input (line 16) | class Input:
class Session (line 33) | class Session:
class Urls (line 38) | class Urls:
class BaseResp (line 44) | class BaseResp:
method __init__ (line 45) | def __init__(self, sjson):
method ret (line 50) | def ret(self):
method err_msg (line 54) | def err_msg(self):
method is_ok (line 58) | def is_ok(self):
class FakesResp (line 62) | class FakesResp(BaseResp):
method __init__ (line 64) | def __init__(self, sjson):
method count (line 70) | def count(self):
class ArtisResp (line 74) | class ArtisResp(BaseResp):
method __init__ (line 76) | def __init__(self, sjson):
method count (line 82) | def count(self):
function execute_times (line 86) | def execute_times(driver, times):
function login (line 91) | def login(driver):
function read_url_set (line 94) | def read_url_set():
function write_url_set (line 102) | def write_url_set(urls):
function set_cookies (line 109) | def set_cookies(driver, cookies):
function download (line 116) | def download(url, sname):
function pipe_fakes (line 129) | def pipe_fakes(fake_name):
function pipe_articles (line 158) | def pipe_articles(fakeid, query=''):
function verfy_arti_content (line 237) | def verfy_arti_content(html):
function crawl_all_images (line 249) | def crawl_all_images(url, sdir, url_cache, html=None):
function crawl_baidu_pan_link (line 280) | def crawl_baidu_pan_link(url, sdir, url_cache):
function crawl_whole_page (line 306) | def crawl_whole_page(url, sdir, url_cache):
function crawl_by_custom_pipe (line 327) | def crawl_by_custom_pipe(url, sdir, url_cache):
function pipe_crawl_articles (line 342) | def pipe_crawl_articles(arti_info):
function pipe (line 356) | def pipe():
function process_input (line 368) | def process_input():
function append_arti_cache (line 387) | def append_arti_cache(arti_link):
function append_url_cache (line 396) | def append_url_cache(urls):
function load_todo_list (line 406) | def load_todo_list(key):
function save_todo_list (line 413) | def save_todo_list(key, dic):
function main (line 420) | def main(chrome):
Condensed preview — 5 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (20K chars).
[
{
"path": "README.md",
"chars": 1961,
"preview": "## 公众号文章抓取工具\n使用公众号文章编辑链接的方案, 突破搜狗方案10条的限制~~~ ;-)\n\n### 2018.12\n- 新增公众号内, 百度网盘链接和密码的抓取. (指定method为baidu_pan_links)\n- 新增全部h"
},
{
"path": "pipe_example.py",
"chars": 199,
"preview": "# -*- coding: utf-8 -*-\n'''\n扩展处理脚本案例\n'''\n\ndef crawl(arti_url, arti_dir):\n '''\n 必要实现函数, 在处理文章链接时调用.\n arti_url: 公"
},
{
"path": "requirements.txt",
"chars": 71,
"preview": "selenium==3.14.0\nrequests==2.18.4\nbeautifulsoup4==4.6.3\npyquery==1.4.0\n"
},
{
"path": "wxhub.py",
"chars": 15226,
"preview": "# -*- coding: utf-8 -*-\nfrom selenium import webdriver\nimport time\nfrom bs4 import BeautifulSoup\nimport requests\nimport "
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the leo8916/wxhub GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 5 files (11.7 MB), approximately 5.2k tokens, and a symbol index with 36 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.