[
  {
    "path": ".coveragerc",
    "content": "[run]\ninclude =\n    tests/*\nomit =\n    __init__.py\n\n[report]\nexclude_lines =\n    pragma: no cover\n    def __repr__\n    def __str__\n    if self.debug:\n    if settings.DEBUG\n    except ImportError\n    raise AssertionError\n    raise NotImplementedError\n    if 0:\n    if __name__ == .__main__.:\n"
  },
  {
    "path": ".gitignore",
    "content": "# Created by .ignore support plugin (hsz.mobi)\n\n*.py[cod]\n*.env\n.idea\n.DS_Store\n\nlogs/*\n!logs/index.html\n\n.coverage\nhtmlcov/\n.coveralls.yml\n\ncsv/*\n\n\n# toutiao\nnews/spiders/toutiao.py\ntools/toutiao.py\n\n# middlewares\nnews/middlewares/httpproxy_vps.py\n\n# config\n#config/*\n#!config/__init__.py\n#!config/default.py\n#\n#env_*.sh\n#!env_default.sh\n\n\n# gitbook\ndocs/_book/*\ndocs/node_modules\ndocs/package-lock.json\n"
  },
  {
    "path": ".travis.yml",
    "content": "sudo: no\ndist: trusty\nlanguage: python\npython:\n  - \"2.7\"\n  - \"3.6\"\n# command to install dependencies\ninstall:\n  - if [[ $TRAVIS_PYTHON_VERSION == 2.7 ]]; then pip install -r requirements-py2.txt; fi\n  - if [[ $TRAVIS_PYTHON_VERSION == 3.6 ]]; then pip install -r requirements-py3.txt; fi\n  - pip install coveralls\n  - pip install pyyaml\n# command to run tests\nscript:\n  - export PYTHONPATH=${PWD}\n  - coverage run -a tests/test_date_time.py\n  - coverage run -a tests/test_finger.py\n  - coverage report\nafter_success:\n# upload test report\n  - coveralls\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018 碎ping子\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "## 新闻抓取\n\n[![Build Status](https://travis-ci.org/zhanghe06/news_spider.svg?branch=master)](https://travis-ci.org/zhanghe06/news_spider)\n[![Coverage Status](https://coveralls.io/repos/github/zhanghe06/news_spider/badge.svg?branch=master)](https://coveralls.io/github/zhanghe06/news_spider?branch=master)\n\n### 项目演示\n\n服务依赖:\n- MariaDB\n- Redis\n- NodeJS\n\n本项目依赖第三方验证码识别服务\n\n更新配置 config/default.py 用户名和密码\n```\nRK_CONFIG = {\n    'username': '******',\n    'password': '******',\n    'soft_id': '93676',\n    'soft_key': '5d0e00b196c244cb9d8413809c62f9d5',\n}\n\n# 斐斐打码\nFF_CONFIG = {\n    'pd_id': '******',\n    'pd_key': '******',\n    'app_id': '312451',\n    'app_key': '5YuN+6isLserKBZti4hoaI6UR2N5UT2j',\n}\n```\n\n```bash\n# python2\nvirtualenv news_spider.env              # 创建虚拟环境\n# python3\nvirtualenv news_spider.env -p python3   # 创建虚拟环境\n\nsource env_default.sh               # 激活虚拟环境\npip install -r requirements-py2.txt # 安装环境依赖\n# 开发环境 模拟单次抓取\npython tasks/job_put_tasks.py wx    # 初次创建任务\npython tasks/jobs_sogou.py          # 初次应对反爬\nscrapy crawl weixin                 # 开启微信爬虫\n# 生产环境 开启持续抓取\nsupervisord                         # 开启守护进程\nsupervisorctl start all             # 开启工作进程\n```\n\n- env_develop.sh   # 开发环境\n- env_product.sh   # 生产环境\n\n### 项目创建过程记录\n\n项目依赖明细\n```bash\npip install requests\npip install scrapy\npip install sqlalchemy\npip install mysqlclient\npip install sqlacodegen==1.1.6  # 注意: 最新版 sqlacodegen==2.0 有bug\npip install redis\npip install PyExecJS\npip install Pillow\npip install psutil\npip install schedule\npip install future          # 兼容py2、py3\npip install supervisor      # 当前主版本3只支持py2，将来主版本4(未发布)会支持py3\n```\n因当前`supervisor`不支持`python3`，故在`requirements.txt`中将其去掉\n\n由于任务调度`apscheduler`不支持Py3（其中的依赖`futures`不支持），这里采用`schedule`\n\n`scrapy`的依赖`cryptography`在`2.2.2`版本中有[安全性问题](https://nvd.nist.gov/vuln/detail/CVE-2018-10903), 强烈建议更新至`2.3`及以上版本, 可以通过更新`scrapy`的方式升级\n\n`scrapy`的依赖`parsel`使用了`functools`的`lru_cache`方法（ python2 是`functools32`的`lru_cache`方法；`functools32`是`functools`的反向移植）\n\n\nMac 系统环境依赖（mariadb）\n```bash\nbrew unlink mariadb\nbrew install mariadb-connector-c\nln -s /usr/local/opt/mariadb-connector-c/bin/mariadb_config /usr/local/bin/mysql_config\n# pip install MySQL-python\npip install mysqlclient  # 基于 MySQL-python 兼容py2、py3\nrm /usr/local/bin/mysql_config\nbrew unlink mariadb-connector-c\nbrew link mariadb\n```\n\nCentOS 系统环境依赖\n```bash\nyum install gcc\nyum install mysql-devel\nyum install python-devel\nyum install epel-release\nyum install redis\nyum install nodejs\n```\n\nCentOS 安装 python3 环境（CentOS 默认是不带 python3 的）\n```bash\nyum install python34\nyum install python34-devel\n```\n\nCentOS 安装 pip & virtualenv & git & vim\n```bash\nyum install python-pip\npip install --upgrade pip\npip install virtualenv\nyum install git\nyum install vim\n```\n\n创建项目\n```bash\nscrapy startproject news .\nscrapy genspider weixin mp.weixin.qq.com\n```\n\n启动蜘蛛\n```bash\nscrapy crawl weixin\n```\n\n如需测试微博, 修改以下方法, 更改正确用户名和密码\n\ntools/weibo.py\n```\ndef get_login_data():\n    return {\n        'username': '******',\n        'password': '******'\n    }\n```\n\n### 蜘蛛调试（以微博为例）\n1. 清除中间件去重缓存, 重置调试任务\n```\n127.0.0.1:6379> DEL \"dup:weibo:0\"\n(integer) 1\n127.0.0.1:6379> DEL \"scrapy:tasks_set:weibo\"\n(integer) 1\n127.0.0.1:6379> SADD \"scrapy:tasks_set:weibo\" 130\n(integer) 1\n127.0.0.1:6379>\n```\n2. 清除调试蜘蛛存储数据\n```mysql\nDELETE FROM fetch_result WHERE platform_id=2;\n```\n3. 启动调试蜘蛛\n```bash\nscrapy crawl weibo\n```\n\n\n### 验证码识别\n\n~~http://www.ruokuai.com/~~\n\n~~http://wiki.ruokuai.com/~~\n\n~~价格类型:~~\n~~http://www.ruokuai.com/home/pricetype~~\n\n热心网友反映`若快`已经关闭, 接下来会支持`斐斐打码`, 敬请期待\n\n斐斐打码开发文档 [http://docs.fateadm.com](http://docs.fateadm.com)\n\n\n### 索引说明\n\n联合索引, 注意顺序, 同时注意查询条件字段类型需要与索引字段类型一致\n\n实测, 数据量8万记录以上, 如果没有命中索引, 查询会很痛苦\n\n\n### 项目说明\n\n亮点:\n\n1. 支持分布式, 每个蜘蛛抓取进程对应一个独立的抓取任务\n2. 采用订阅发布模型的观察者模式, 处理并发场景的验证码识别任务, 避免无效的识别\n\n备注: `mysql`中`text`最大长度为65,535(2的16次方–1)\n\n类型 | 表达式 | 最大字节长度（bytes） | 大致容量\n---: | ---: | ---: | ---:\nTinyText | 2的8次方–1 | 255 | 255B\nText | 2的16次方–1 | 65,535 | 64KB\nMediumText | 2的24次方–1 | 16,777,215 | 16MB\nLongText | 2的32次方–1 | 4,294,967,295 | 4GB\n\n由于微信公众号文章标签过多, 长度超过`Text`的最大值, 故建议采用`MediumText`\n\n\n### 特别说明\n\n头条请求签名\n- M端需要2个参数: as、cp\n- PC端需要3个参数: as、cp、_signature\n\nM端2个参数获取方法已公开, 参考蜘蛛 toutiao_m\n\n~~PC端3个参数获取方法已破解, 由于公开之后会引起头条反爬机制更新, 故没有公开, 如有需要, 敬请私聊, 仅供学习, 谢绝商用~~\n\n因M端已满足数据获取要求, 不再开源PC端签名破解\n\n\n### TODO\n\n微博反爬处理\n"
  },
  {
    "path": "apps/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:33\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "apps/client_db.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: client_db.py\n@time: 2018-02-10 17:34\n\"\"\"\n\n\nfrom sqlalchemy import create_engine\nfrom sqlalchemy import distinct\nfrom sqlalchemy import func\nfrom sqlalchemy.orm import sessionmaker\nimport redis\n\nfrom config import current_config\n\nSQLALCHEMY_DATABASE_URI_MYSQL = current_config.SQLALCHEMY_DATABASE_URI_MYSQL\nSQLALCHEMY_POOL_SIZE = current_config.SQLALCHEMY_POOL_SIZE\nREDIS = current_config.REDIS\n\n\nengine_mysql = create_engine(SQLALCHEMY_DATABASE_URI_MYSQL, pool_size=SQLALCHEMY_POOL_SIZE, max_overflow=0)\ndb_session_mysql = sessionmaker(bind=engine_mysql, autocommit=True)\n\n\nredis_client = redis.Redis(**REDIS)\n\n\ndef get_item(model_class, pk_id):\n    session = db_session_mysql()\n    try:\n        result = session.query(model_class).get(pk_id)\n        return result\n    finally:\n        session.close()\n\n\ndef get_all(model_class, *args, **kwargs):\n    session = db_session_mysql()\n    try:\n        result = session.query(model_class).filter(*args).filter_by(**kwargs).all()\n        return result\n    finally:\n        session.close()\n\n\ndef get_distinct(model_class, field, *args, **kwargs):\n    session = db_session_mysql()\n    try:\n        result = session.query(distinct(getattr(model_class, field)).label(field)).filter(*args).filter_by(**kwargs).all()\n        return result\n    finally:\n        session.close()\n\n\ndef get_group(model_class, field, min_count=0, *args, **kwargs):\n    field_obj = getattr(model_class, field)\n    session = db_session_mysql()\n    try:\n        result = session.query(field_obj, func.count(field_obj).label('c')).filter(*args).filter_by(\n            **kwargs).group_by(field_obj).having(func.count(field_obj) >= min_count).all()\n        return result\n    finally:\n        session.close()\n\n\ndef add_item(model_class, data):\n    session = db_session_mysql()\n    try:\n        ret = model_class(**data)\n        session.add(ret)\n        # 如需返回id, 需要手动flush\n        session.flush()\n        return ret.id\n    finally:\n        session.close()\n"
  },
  {
    "path": "apps/client_rk.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: client_rk.py\n@time: 2018-02-10 17:34\n\"\"\"\n\n\nfrom libs.rk import RKClient\nfrom libs.counter import CounterClient\nfrom apps.client_db import redis_client\nfrom tools.cookies import len_cookies\n\nfrom config import current_config\n\nRK_CONFIG = current_config.RK_CONFIG\nBASE_DIR = current_config.BASE_DIR\nRK_LIMIT_COUNT_DAILY = current_config.RK_LIMIT_COUNT_DAILY\nCOOKIES_QUEUE_COUNT = current_config.COOKIES_QUEUE_COUNT\n\nrc_client = RKClient(**RK_CONFIG)\n\nrk_counter_client = CounterClient(redis_client, 'rk')\n\n# 正常图形验证码\n# 'im_type_id': 1000     # 任意长度数字\n# 'im_type_id': 2000     # 任意长度字母\n# 'im_type_id': 3000     # 任意长度英数混合\n# 'im_type_id': 4000     # 任意长度汉字\n# 'im_type_id': 5000     # 任意长度中英数三混\n\n\ndef get_img_code(im, im_type_id):\n    \"\"\"\n    获取验证码\n    :param im:\n    :param im_type_id:\n    :return:\n    \"\"\"\n    rc_result = rc_client.rk_create(im, im_type_id)\n    print(rc_result)\n    if 'Error_Code' in rc_result:\n        print(rc_result.get('Error'))\n        return None, None\n    # {u'Result': u'6dx2t8', u'Id': u'c8a897f0-9825-41a1-b19e-6195ba8559ed'}\n    return rc_result['Id'], rc_result['Result']\n\n\ndef img_report_error(im_id):\n    rc_client.rk_report_error(im_id)\n\n\ndef check_counter_limit():\n    \"\"\"\n    检查是否超过限制（True: 没有超过; False: 超过限制）\n    :return:\n    \"\"\"\n    rk_counter = rk_counter_client.get()\n    return rk_counter < RK_LIMIT_COUNT_DAILY\n\n\ndef check_cookies_count(spider_name):\n    \"\"\"\n    检查 cookies 长度是否达到要求（True: 没有达到; False: 达到要求）\n    :param spider_name:\n    :return:\n    \"\"\"\n    return len_cookies(spider_name) < COOKIES_QUEUE_COUNT\n\n\ndef counter_clear():\n    \"\"\"\n    计数器清零（每天0点）\n    :return:\n    \"\"\"\n    rk_counter_client.clear()\n"
  },
  {
    "path": "config/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py\n@time: 2018-02-10 15:02\n\"\"\"\n\nfrom __future__ import unicode_literals\nfrom __future__ import print_function\n\nimport os\nfrom importlib import import_module\n\n\nMODE = os.environ.get('MODE') or 'default'\n\n\ntry:\n    current_config = import_module('config.' + MODE)\n    print('[√] 当前环境变量: %s' % MODE)\nexcept ImportError:\n    print('[!] 配置错误，请初始化环境变量')\n    print('source env_develop.sh  # 开发环境')\n    print('source env_product.sh  # 生产环境')\n"
  },
  {
    "path": "config/default.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: default.py\n@time: 2018-07-02 17:57\n\"\"\"\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport os\n\nBASE_DIR = os.path.dirname(os.path.dirname(__file__))\n\n# requests 超时设置\nREQUESTS_TIME_OUT = (30, 30)\n\nHOST_IP = '0.0.0.0'\n\n# 数据库 MySQL\nDB_MYSQL = {\n    'host': HOST_IP,\n    'user': 'root',\n    'passwd': '123456',\n    'port': 3306,\n    'db': 'news_spider'\n}\n\nSQLALCHEMY_DATABASE_URI_MYSQL = \\\n    'mysql+mysqldb://%s:%s@%s:%s/%s?charset=utf8' % \\\n    (DB_MYSQL['user'], DB_MYSQL['passwd'], DB_MYSQL['host'], DB_MYSQL['port'], DB_MYSQL['db'])\n\nSQLALCHEMY_POOL_SIZE = 5  # 默认 pool_size=5\n\n# 缓存，队列\nREDIS = {\n    'host': HOST_IP,\n    'port': 6379,\n    # 'password': '123456'  # redis-cli AUTH 123456\n}\n\n# 若快验证码识别\nRK_CONFIG = {\n    'username': '******',\n    'password': '******',\n    'soft_id': '93676',\n    'soft_key': '5d0e00b196c244cb9d8413809c62f9d5',\n}\n\n# 斐斐打码\nFF_CONFIG = {\n    'pd_id': '******',\n    'pd_key': '******',\n    'app_id': '312451',\n    'app_key': '5YuN+6isLserKBZti4hoaI6UR2N5UT2j',\n}\n\n# 熔断机制 每天请求限制（200元==500000快豆）\nRK_LIMIT_COUNT_DAILY = 925\n\n# 队列保留 cookies 数量\nCOOKIES_QUEUE_COUNT = 5\n\n# 分布式文件系统\nWEED_FS_URL = 'http://%s:9333' % HOST_IP\n\n# 优先级配置（深度优先）\nDEPTH_PRIORITY = 1\nPRIORITY_CONFIG = {\n    'list': 600,\n    'next': 500,\n    'detail': 800,\n}\n\n# 启动时间（启动时间之前的内容不抓取, 适用于新闻）\nSTART_TIME = '2018-01-01 00:00:00'\n"
  },
  {
    "path": "db/data/mysql.sql",
    "content": "USE news_spider;\n\n-- 插入用频道信息\nTRUNCATE TABLE `channel`;\nINSERT INTO `channel` VALUES (1, 'recommend', '推荐', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (2, 'hot', '热点', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (3, 'technology', '科技', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (4, 'social', '社会', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (5, 'entertainment', '娱乐', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (6, 'game', '游戏', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (7, 'sports', '体育', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (8, 'car', '汽车', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (9, 'finance', '财经', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (10, 'military', '军事', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (11, 'international', '国际', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (12, 'fashion', '时尚', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (13, 'travel', '旅游', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (14, 'explore', '探索', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (15, 'childcare', '育儿', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (16, 'health', '养生', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (17, 'article', '美文', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (18, 'history', '历史', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (19, 'food', '美食', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (20, 'education', '教育', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (21, 'electrical', '电气', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (22, 'machine', '机械', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\nINSERT INTO `channel` VALUES (23, 'medical', '医疗', '', '2017-11-20 10:00:00', '2017-11-20 10:00:00');\n\n-- 插入抓取任务信息\nTRUNCATE TABLE `fetch_task`;\nINSERT INTO `fetch_task` VALUES (11, 3, 0, '6555293927', '制造业那些事儿', '', 'http://m.toutiao.com/profile/6555293927/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (12, 3, 0, '51555073058', '制造业福星高赵', '', 'http://m.toutiao.com/profile/51555073058/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (13, 3, 0, '58075853770', 'AI汽车制造业', '', 'http://m.toutiao.com/profile/58075853770/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (14, 3, 0, '51397533037', '制造业的云时代', '', 'http://m.toutiao.com/profile/51397533037/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (15, 3, 0, '6157673577', '电器制造业大事件', '', 'http://m.toutiao.com/profile/6157673577/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (16, 3, 0, '3810739482', '互联网扒皮王', '', 'http://m.toutiao.com/profile/3810739482/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (17, 3, 0, '5347877887', '互联网智慧驿站', '', 'http://m.toutiao.com/profile/5347877887/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (18, 1, 0, 'Root_Id', 'Website_Name', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (19, 1, 0, 'chuangbiandao', '创变岛', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (20, 1, 0, 'changmaiw', '畅脉全球购', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (21, 1, 0, 'BizNext', '企鹅智酷', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (22, 1, 0, 'renhecom', '人和网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (23, 1, 0, 'rsqwyjs', '人生趣味研究所', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (24, 1, 0, 'shiyehome', '食业家', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (25, 1, 0, 'tyjzksp', '食品商', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (26, 1, 0, 'wisesale_lzzd', '联纵智达', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (27, 1, 0, 'sxlh002', '蓝海果业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (28, 1, 0, 'huxiu_com', '虎嗅网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (29, 1, 0, 'HZKSXFPJLQ', '华中快速消费品经理群', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (30, 1, 0, 'kuaixiao999888', '经销商那些事儿', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (31, 1, 0, 'jingxiaoshang168', '经销商', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (32, 1, 0, 'fmcgchina', '快消品网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (33, 1, 0, 'FMCG-CLUB', '快速消费品精英俱乐部', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (34, 1, 0, 'tyjspb', '食品板', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (35, 1, 0, 'yxts518', '营销透视镜', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (36, 1, 0, 'salesman66', '营销人', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (37, 1, 0, 'cn-beverage', '饮料行业网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (38, 1, 0, 'youshudejiu', '有数酒业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (39, 1, 0, 'i-yiou', '亿欧网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (40, 1, 0, 'CLFDA-001', '中国副食流通协会总监联盟', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (41, 1, 0, 'wbfood', '58食品网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (42, 1, 0, 'lanhaiyingxiao', '营销兵法', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (43, 1, 0, 'AutoMan-No1', 'AutoMan', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (44, 1, 0, 'leiphone-sz', '雷锋网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (45, 1, 0, 'coffeeO2O', '餐饮O2O', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (46, 1, 0, 'newso2o', '零售渠道观察', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (47, 1, 0, 'wwwcbocn', '化妆品财经在线', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (48, 1, 0, 'dushekeji', '毒舌科技', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (49, 1, 0, 'zgsppj', '新食品评介', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (50, 1, 0, 'foodinc', '小食代', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (51, 1, 0, 'lookforfoods', '食品饮料新零售内参', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (52, 1, 0, 'wow36kr', '36氪', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (53, 1, 0, 'food-gnosis', '食悟', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (54, 1, 0, 'newfortune', '新财富杂志', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (55, 1, 0, 'lp800315111', '快消家', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (56, 1, 0, 'tancaijing', '叶檀财经', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (57, 1, 0, 'yigejubaopen', '市井财经', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (58, 1, 0, 'njss02584195518', '工程机械微管家', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (59, 1, 0, 'jiajucy', '家具产业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (60, 1, 0, 'chinafood365', '中国食品网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (61, 1, 0, 'dqjswol', '电气自动化控制网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (62, 1, 0, 'zgyybweixin', '中国医药报', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (63, 1, 0, 'fzfzzk', '纺织服装周刊', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (64, 1, 0, 'www-glass-com-cn', '中国玻璃网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (65, 1, 0, 'amdaily', '先进制造业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (66, 1, 0, 'cmpzhizao', '制造业那些事儿', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (67, 1, 0, 'zhishexueshuquan', '知社学术圈', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (68, 1, 0, 'keyanquan', '科研圈', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (69, 1, 0, 'iccafe-sh', 'IC咖啡', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (70, 1, 0, 'robotmagazine', '机器人技术与应用', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (71, 1, 0, 'productronicaChina', '慕尼黑上海电子生产设备展', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (72, 1, 0, 'electronicaChina', 'e星球', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (73, 1, 0, 'feelingcar666', '飞灵汽车', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (74, 1, 0, 'depo88', '分布式能源', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (75, 1, 0, 'jianyuecheping', '建约车评', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (76, 1, 0, 'AECC-2016', '中国航发', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (77, 1, 0, 'mesbook', 'MES百科', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (78, 1, 0, 'mtmt-1951', '机床杂志社', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (79, 1, 0, 'AI_era', '新智元', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (80, 1, 0, 'ikanlixiang', '看理想', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (81, 1, 0, 'AVICESI', '中行伊萨', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (82, 1, 0, 'www_51shape_com', '3D科学谷', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (83, 1, 0, 'i-zhoushuo', '周说', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (84, 1, 0, 'guoguo_innovation', '蝈蝈创新随笔', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (85, 1, 0, 'e-zhizao', 'e制造', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (86, 1, 0, 'RoboSpeak', '机器人大讲堂', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (87, 1, 0, 'The-Intellectual', '知识分子', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (88, 1, 0, 'sdr-china', '软件定义世界', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (89, 1, 0, 'wufutu5', '洞见', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (90, 1, 0, 'siid_2inno', '之新网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (91, 1, 0, 'e-works', '数字化企业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (92, 1, 0, 'smr8700', '水木然', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (93, 1, 0, 'casic3s', '航天科工系统仿真科技', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (94, 1, 0, 'xiangxt1984', '向小田', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (95, 1, 0, 'gh_7157c03a9f49', '理深科技时评', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (96, 1, 0, 'gh_8189758efb1b', '国富资本熊焰', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (97, 1, 0, 'iscientists', '赛先生', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (98, 1, 0, 'bjcppmp', '中国造纸杂志社', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (99, 1, 0, 'CPA-PAPER', '中国造纸协会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (100, 1, 0, 'CTAPI-Paper', '中国造纸学会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (101, 1, 0, 'zzcywd', '造纸产业', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (102, 1, 0, 'paperCEO', '造纸老板内刊', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (103, 1, 0, 'gh_28281e9f6cc4', '造纸助手', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (104, 1, 0, 'qgzzbwh', '全国造纸工业标准化技术委员会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (105, 1, 0, 'waysmos', '造纸化学品', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (106, 1, 0, 'wff168_com', '第一家具网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (107, 1, 0, 'jiajuwxw', '家具微新闻', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (108, 1, 0, 'Furniture_China', '上海家具展', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (109, 1, 0, 'jiajuzhuliuMF', '家具主流', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (110, 1, 0, 'jjgle2015', '家具在线', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (111, 1, 0, 'nfsyyjjb', '医药经济报', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (112, 1, 0, 'iyiyaomofang', '医药魔方数据', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (113, 1, 0, 'gh_260ce2309fff', 'MIMS医药资讯', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (114, 1, 0, 'yyguancha', '医药观察家网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (115, 1, 0, 'yyshoujibao', '医药手机报', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (116, 1, 0, 'shstpa', '上海医药商业行业协会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (117, 1, 0, 'fangda_healthcare', '医药法律评论', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (118, 1, 0, 'cmpma1989', '中国医药物资协会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (119, 1, 0, 'yehenala_678', '医药那些事儿', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (120, 1, 0, 'imrobotic', '机器人在线', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (121, 1, 0, 'CSDN_Tech', 'CSDN技术头条', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (122, 1, 0, 'CSDN_BLOG', 'CSDN博客', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (123, 1, 0, 'CSDNLIB', 'CSDN知识库', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (124, 1, 0, 'csdn_iot', 'CSDN物联网开发', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (125, 2, 0, '1005051627825392', '互联网的那点事', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (126, 2, 0, '1006061787567623', '199IT-互联网数据中心', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (127, 2, 0, '1002061577794853', '互联网的一些事', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (128, 2, 0, '1002063318777442', '互联网创业刊', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (129, 2, 0, '1006061661377270', '互联网观察网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (130, 2, 0, '1002062210869832', '互联网新闻网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (131, 2, 0, '1006063481197561', '中国互联网安全大会', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (132, 2, 0, '1002061768025224', '互联网周刊', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (133, 2, 0, '1002063819805149', '互联网焦点网', '', '', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\nINSERT INTO `fetch_task` VALUES (134, 3, 0, '55982516338', '奇文志怪', '', 'http://m.toutiao.com/profile/55982516338/', 1, '', '2018-09-06 14:01:05', '2018-09-06 14:01:05');\nINSERT INTO `fetch_task` VALUES (135, 3, 0, '6014591174', '鹏君读书', '', 'http://m.toutiao.com/profile/6014591174/', 1, '', '2017-01-11 11:01:05', '2017-01-11 11:01:05');\n"
  },
  {
    "path": "db/schema/mysql.sql",
    "content": "DROP DATABASE IF EXISTS `news_spider`;\nCREATE DATABASE `news_spider` /*!40100 DEFAULT CHARACTER SET utf8 */;\n\n\nuse news_spider;\n\n\nCREATE TABLE `channel` (\n  `id` INT(11) NOT NULL AUTO_INCREMENT,\n  `code` VARCHAR(20) COMMENT '频道编号',\n  `name` VARCHAR(20) COMMENT '频道名称',\n  `description` VARCHAR(500) DEFAULT '' COMMENT '描述',\n  `create_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',\n  `update_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',\n  PRIMARY KEY (`id`),\n  UNIQUE KEY idx_code (`code`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='频道表';\n\n\nCREATE TABLE `fetch_task` (\n  `id` INT(11) NOT NULL AUTO_INCREMENT,\n  `platform_id` TINYINT DEFAULT 0 COMMENT '平台id（1:微信;2:微博;3:头条）',\n  `channel_id` TINYINT DEFAULT 0 COMMENT '频道id',\n  `follow_id` VARCHAR(45) DEFAULT '' COMMENT '关注账号id',\n  `follow_name` VARCHAR(45) DEFAULT '' COMMENT '关注账号名称',\n  `avatar_url` VARCHAR(512) DEFAULT '' COMMENT '关注账号头像',\n  `fetch_url` VARCHAR(512) DEFAULT '' COMMENT '抓取入口',\n  `flag_enabled` TINYINT DEFAULT 0 COMMENT '启用标记（0:未启用;1:已启用）',\n  `description` VARCHAR(500) DEFAULT '' COMMENT '描述',\n  `create_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',\n  `update_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',\n  PRIMARY KEY (`id`),\n  UNIQUE KEY idx_platform_follow_id (`platform_id`, `follow_id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='抓取任务表';\n\n\nCREATE TABLE `fetch_result` (\n  `id` INT(11) NOT NULL AUTO_INCREMENT,\n  `task_id` INT NOT NULL COMMENT '任务id',\n  `platform_id` TINYINT DEFAULT 0 COMMENT '平台id（1:微信;2:微博;3:头条）',\n  `platform_name` VARCHAR(50) DEFAULT '' COMMENT '平台名称（1:微信;2:微博;3:头条）',\n  `channel_id` TINYINT DEFAULT 0 COMMENT '频道id',\n  `channel_name` VARCHAR(50) DEFAULT '' COMMENT '频道名称',\n  `article_id` VARCHAR(50) DEFAULT '' COMMENT '文章id',\n  `article_url` VARCHAR(512) DEFAULT '' COMMENT '文章链接',\n  `article_title` VARCHAR(100) DEFAULT '' COMMENT '文章标题',\n  `article_author_id` VARCHAR(100) DEFAULT '' COMMENT '文章作者id（对应follow_id）',\n  `article_author_name` VARCHAR(100) DEFAULT '' COMMENT '文章作者名称（对应follow_name）',\n  `article_tags` VARCHAR(100) DEFAULT '' COMMENT '文章标签（半角逗号分隔）',\n  `article_abstract` VARCHAR(500) DEFAULT '' COMMENT '文章摘要',\n  `article_content` MEDIUMTEXT COMMENT '文章内容',\n  `article_pub_time` DATETIME DEFAULT '1000-01-01 00:00:00' COMMENT '文章发布时间',\n  `create_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',\n  `update_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',\n  PRIMARY KEY (`id`),\n  KEY idx_task_id (`task_id`),\n  UNIQUE KEY idx_platform_article_id (`platform_id`, `article_id`),\n  KEY idx_platform_author_id (`platform_id`, `article_author_id`),\n  KEY idx_article_pub_time (`article_pub_time`),\n  KEY idx_create_time (`create_time`),\n  KEY idx_update_time (`update_time`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='抓取结果表';\n\n\nCREATE TABLE `log_task_scheduling` (\n  `id` INT(11) NOT NULL AUTO_INCREMENT,\n  `platform_id` TINYINT DEFAULT 0 COMMENT '平台id（1:微信;2:微博;3:头条）',\n  `platform_name` VARCHAR(50) DEFAULT '' COMMENT '平台名称（1:微信;2:微博;3:头条）',\n  `spider_name` VARCHAR(45) DEFAULT '' COMMENT '蜘蛛名称，一般同平台名称',\n  `task_quantity` INT(11) DEFAULT 0 COMMENT '任务数量',\n  `create_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',\n  `update_time` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '更新时间',\n  PRIMARY KEY (`id`)\n) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='任务调度日志表';\n\n-- 更新记录[2018-02-13]\n# ALTER TABLE `fetch_result` MODIFY `article_content` MEDIUMTEXT COMMENT '文章内容';\n\n-- 更新记录[2018-05-29]\n# DROP INDEX idx_platform_author_id ON `fetch_result`;\n# ALTER TABLE `fetch_result` ADD INDEX idx_platform_author_id (`platform_id`, `article_author_id`);\n# ALTER TABLE `fetch_result` MODIFY `article_pub_time` DATETIME DEFAULT '1000-01-01 00:00:00' COMMENT '文章发布时间';\n# ALTER TABLE `fetch_result` ADD INDEX idx_article_pub_time (`article_pub_time`);\n# ALTER TABLE `fetch_result` ADD INDEX idx_create_time (`create_time`);\n# ALTER TABLE `fetch_result` ADD INDEX idx_update_time (`update_time`);\n"
  },
  {
    "path": "docs/Architecture.md",
    "content": "# 整体架构(Architecture)\n\n- MariaDB\n\n每个公众号/发布号的首页（即爬虫抓取入口）存储于数据库中。\n\n表结构 db/schema/mysql.sql\n\n测试数据 db/data/mysql.sql\n\n\n- Redis\n\n为了支持分布式, 抓取任务单独存放于缓存, 这样在调试时, 需要手动执行创建任务。\n\n参考[启动说明](Spiders/README.md)\n\n为了方便调试, 本项目所有缓存key均以`scrapy:`作为前缀\n\n- NodeJS\n\n部分详情页面的信息抽取, 本项目使用js处理, 避免正则表达式规则的不完全覆盖。\n"
  },
  {
    "path": "docs/Components/MariaDB.md",
    "content": "# MariaDB\n"
  },
  {
    "path": "docs/Components/Redis.md",
    "content": "# Redis\n"
  },
  {
    "path": "docs/Components/SeaweedFS.md",
    "content": "# SeaweedFS\n\n[SeaweedFS 项目地址](https://github.com/chrislusf/seaweedfs)\n\n\n## 安装\n\n### Go (Golang)\n\n下载页面： https://golang.org/dl/\n\n```\n$ wget https://dl.google.com/go/go1.11.1.linux-amd64.tar.gz\n$ sudo tar -C /usr/local -xzf go1.11.1.linux-amd64.tar.gz\n$ sudo vim /etc/profile\n    export GOROOT=/usr/local/go\n    export GOPATH=$HOME/work\n    export PATH=$PATH:$GOROOT/bin:$GOPATH/bin\n$ source /etc/profile\n```\n\n或者仅为当前用户设置环境变量\n```\n$ vim ~/.bashrc\n$ source ~/.bashrc\n```\n\n注意：使用 zsh 的用户, 需要为 zsh 设置环境变量\n```\n$ vim ~/.zshrc\n$ source ~/.zshrc\n```\n\n### Weed\n\n依赖 git (版本控制工具)\n\n```\ngo get github.com/chrislusf/seaweedfs/weed\n```\n\n\n## 启动\n\nStart Master Server\n```\n$ weed master\n```\n\nStart Volume Servers\n```\n$ mkdir /tmp/data1 /tmp/data2\n$ chmod 777 /tmp/data1 /tmp/data2\n$ weed volume -dir=\"/tmp/data1\" -max=5  -mserver=\"localhost:9333\" -port=8080 &\n$ weed volume -dir=\"/tmp/data2\" -max=10 -mserver=\"localhost:9333\" -port=8081 &\n```\n\n```\n$ weed volume -dir=/tmp/data1/ -mserver=\"localhost:9333\" -ip=\"192.168.2.32\" -port=8080\n```\n\n\n## 启动（方式二）\n```\n$ weed server -dir=/tmp/data1/ -filer -filer.port=8000 -master.port=9333 -volume.port=8001\n```\n集群管理: http://127.0.0.1:9333/\n\n归档管理: http://localhost:8000/\n\n卷积管理: http://localhost:8001/ui/index.html\n\n图片地址: http://localhost:8001/\n\n\n上传文件请求\n```\n$ curl http://localhost:9333/dir/assign\n{\"fid\":\"2,055a54a8ec\",\"url\":\"127.0.0.1:8080\",\"publicUrl\":\"127.0.0.1:8080\",\"count\":1}\n```\n\n上传文件\n```\n$ curl -X PUT -F file=@/home/zhanghe/metro.jpg http://127.0.0.1:8080/2,055a54a8ec\n{\"name\":\"metro.jpg\",\"size\":1830848}\n```\n\n删除文件\n```\n$ curl -X DELETE http://127.0.0.1:8080/2,055a54a8ec\n{\"size\":1830869}\n```\n\n文件读取\n```\n$ curl \"http://localhost:9333/dir/lookup?volumeId=2\"\n{\"volumeId\":\"2\",\"locations\":[{\"url\":\"127.0.0.1:8080\",\"publicUrl\":\"127.0.0.1:8080\"}]}\n```\n\n访问文件\n- [http://127.0.0.1:8080/2,055a54a8ec.jpg](http://127.0.0.1:8080/2,055a54a8ec.jpg)\n- [http://127.0.0.1:8080/2/055a54a8ec.jpg](http://127.0.0.1:8080/2/055a54a8ec.jpg)\n- [http://127.0.0.1:8080/2/055a54a8ec](http://127.0.0.1:8080/2/055a54a8ec)\n- [http://127.0.0.1:8080/2/055a54a8ec?height=200&width=200](http://127.0.0.1:8080/2/055a54a8ec?height=200&width=200)\n\n\n导出文件打包\n```\n$ weed export -dir=/tmp/data1 -volumeId=1 -o=/tmp/data1.tar -fileNameFormat={{.Name}} -newer='2006-01-02T15:04:05'\n```\n\n解包具体文件\n```\n$ tar -xvf data1.tar\n```\n\n## 快速安装\n```bash\n# Mac系统\n$ wget -c https://github.com/chrislusf/seaweedfs/releases/download/0.76/darwin_amd64.tar.gz -O weed_darwin_arm64.tar.gz\n$ tar -zxvf weed_darwin_arm64.tar.gz\n\n# Linux系统\n$ wget -c https://github.com/chrislusf/seaweedfs/releases/download/0.76/linux_arm64.tar.gz -O weed_linux_arm64.tar.gz\n$ tar -zxvf weed_linux_arm64.tar.gz\n\n# 启动\n$ ./weed server -dir=weed_data/ -filer -filer.port=8000 -master.port=9333 -volume.port=8001 -volume.max=32\n```\n"
  },
  {
    "path": "docs/Components/Squid.md",
    "content": "# Squid\n\n"
  },
  {
    "path": "docs/README.md",
    "content": "# scrapy最佳实践 - 新闻抓取\n\n## GitBook 操作指南\n\n初始化\n```bash\ncd docs\nnpm install -g gitbook-cli\nnpm install --save gitbook-plugin-todo\nnpm install --save gitbook-plugin-mermaid-full\n\ngitbook init  # 或者 gitbook install\n```\n\n开启服务\n```bash\ngitbook serve\n```\n\n访问 [http://localhost:4000](http://localhost:4000)\n"
  },
  {
    "path": "docs/SUMMARY.md",
    "content": "# Summary\n\n* [项目介绍](README.md)\n* [项目架构](Architecture.md)\n* [爬虫模块](Spiders/README.md)\n    * [微信爬虫](Spiders/Weixin.md)\n    * [微博爬虫](Spiders/Weibo.md)\n    * [头条爬虫](Spiders/Toutiao.md)\n* 组件服务\n    * [MariaDB](Components/MariaDB.md)\n    * [Redis](Components/Redis.md)\n    * [SeaweedFS](Components/SeaweedFS.md)\n"
  },
  {
    "path": "docs/Spiders/README.md",
    "content": "# Spiders\n\n1、部署系统依赖\n\n- MariaDB\n- Redis\n- NodeJS\n\n2、部署项目依赖\n\n```\npip install requirements.txt\n```\n\n3、创建数据库, 建立抓取入口\n\n- 建表结构 db/schema/mysql.sql\n- 测试数据 db/data/mysql.sql\n\n4、创建抓取任务, 写入缓存\n```\n(news_spider.env) ➜  news_spider git:(master) ✗ python tasks/job_put_tasks.py\n[√] 当前环境变量: develop\n缺失参数\nExample:\n\tpython job_put_tasks.py wx  # 微信\n\tpython job_put_tasks.py wb  # 微博\n\tpython job_put_tasks.py tm  # 头条(M)\n```\n参考以上提示, 对应蜘蛛执行各自的脚本完成任务创建\n\n5、微信抓取, 需要初始化cookie, 其他两个蜘蛛不需要\n\n\n生成环境, 可以使用`supervisor`自动守护`scrapy.ini`、`tasks.ini`这两组进程, 根据需要自行修改\n"
  },
  {
    "path": "docs/Spiders/Toutiao.md",
    "content": "# 头条(M端)\n\n创建任务详情\n```mysql\nINSERT INTO `fetch_task` VALUES (134, 3, 0, '55982516338', '奇文志怪', '', 'http://m.toutiao.com/profile/55982516338/', 1, '', '2018-09-06 14:01:05', '2018-09-06 14:01:05');\n```\n\n进入redis, 检查调度任务数量\n```\n127.0.0.1:6379> SCARD \"scrapy:tasks_set:toutiao_m\"\n(integer) 439\n```\n\n如果没有调度任务, 需要创建调度任务\n```\npython tasks/job_put_tasks.py tm\n```\n\n开启爬虫\n```\nscrapy crawl toutiao_m\n```\n"
  },
  {
    "path": "docs/Spiders/Weibo.md",
    "content": "# 微博\n\n进入redis, 检查调度任务数量\n```\n127.0.0.1:6379> SCARD \"scrapy:tasks_set:weibo\"\n(integer) 0\n```\n\n如果没有调度任务, 需要创建调度任务\n```\npython tasks/job_put_tasks.py wb\n```\n\n开启爬虫\n```\nscrapy crawl weibo\n```\n"
  },
  {
    "path": "docs/Spiders/Weixin.md",
    "content": "# 微信\n\n进入redis, 检查调度任务数量\n```\n127.0.0.1:6379> SCARD \"scrapy:tasks_set:weixin\"\n(integer) 0\n```\n\n如果没有调度任务, 需要创建调度任务\n```\npython tasks/job_put_tasks.py wx\n```\n\n开启爬虫\n```\nscrapy crawl weixin\n```\n"
  },
  {
    "path": "docs/book.json",
    "content": "{\n    \"language\": \"zh-hans\",\n    \"author\": \"碎ping子\",\n    \"plugins\": [\n        \"todo\",\n        \"mermaid-full@>=0.5.1\"\n    ]\n}\n"
  },
  {
    "path": "env_default.sh",
    "content": "#!/usr/bin/env bash\n\nsource news_spider.env/bin/activate\n\nexport PATH=${PWD}:${PATH}\nexport PYTHONPATH=${PWD}\nexport PYTHONIOENCODING=utf-8\nexport MODE=default\n"
  },
  {
    "path": "etc/scrapy.ini",
    "content": "[group:scrapy]\nprograms=weixin,weibo,toutiao\n\n\n[program:weixin]\ncommand=scrapy crawl weixin\ndirectory=news\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/scrapy_weixin.log\n\n\n[program:weibo]\ncommand=scrapy crawl weibo\ndirectory=news\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/scrapy_weibo.log\n\n\n[program:toutiao]\ncommand=scrapy crawl toutiao\ndirectory=news\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/scrapy_toutiao.log\n"
  },
  {
    "path": "etc/scrapyd.ini",
    "content": "[program:scrapyd]\ncommand=scrapyd\ndirectory=news\npriority=200\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/scrapyd.log\n"
  },
  {
    "path": "etc/supervisord.conf",
    "content": "; Sample supervisor config file.\n;\n; For more information on the config file, please see:\n; http://supervisord.org/configuration.html\n;\n; Notes:\n;  - Shell expansion (\"~\" or \"$HOME\") is not supported.  Environment\n;    variables can be expanded using this syntax: \"%(ENV_HOME)s\".\n;  - Comments must have a leading space: \"a=b ;comment\" not \"a=b;comment\".\n\n;[unix_http_server]\n;file=/tmp/supervisor.sock   ; (the path to the socket file)\n;chmod=0700                 ; socket file mode (default 0700)\n;chown=nobody:nogroup       ; socket file uid:gid owner\n;username=user              ; (default is no username (open server))\n;password=123               ; (default is no password (open server))\n\n[inet_http_server]         ; inet (TCP) server disabled by default\nport=127.0.0.1:9001        ; (ip_address:port specifier, *:port for all iface)\nusername=user              ; (default is no username (open server))\npassword=123               ; (default is no password (open server))\n\n[supervisord]\nlogfile=/tmp/supervisord.log ; (main log file;default $CWD/supervisord.log)\nlogfile_maxbytes=50MB        ; (max main logfile bytes b4 rotation;default 50MB)\nlogfile_backups=10           ; (num of main logfile rotation backups;default 10)\nloglevel=info                ; (log level;default info; others: debug,warn,trace)\npidfile=/tmp/supervisord.pid ; (supervisord pidfile;default supervisord.pid)\nnodaemon=false               ; (start in foreground if true;default false)\nminfds=1024                  ; (min. avail startup file descriptors;default 1024)\nminprocs=200                 ; (min. avail process descriptors;default 200)\n;umask=022                   ; (process file creation umask;default 022)\n;user=chrism                 ; (default is current user, required if root)\n;identifier=supervisor       ; (supervisord identifier, default is 'supervisor')\n;directory=/tmp              ; (default is not to cd during start)\n;nocleanup=true              ; (don't clean up tempfiles at start;default false)\n;childlogdir=/tmp            ; ('AUTO' child log dir, default $TEMP)\n;environment=KEY=\"value\"     ; (key value pairs to add to environment)\n;strip_ansi=false            ; (strip ansi escape codes in logs; def. false)\n\n; the below section must remain in the config file for RPC\n; (supervisorctl/web interface) to work, additional interfaces may be\n; added by defining them in separate rpcinterface: sections\n[rpcinterface:supervisor]\nsupervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface\n\n[supervisorctl]\n;serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL  for a unix socket\nserverurl=http://127.0.0.1:9001 ; use an http:// url to specify an inet socket\nusername=user               ; should be same as http_username if set\npassword=123                ; should be same as http_password if set\n;prompt=mysupervisor         ; cmd line prompt (default \"supervisor\")\n;history_file=~/.sc_history  ; use readline history if available\n\n; The below sample program section shows all possible program subsection values,\n; create one or more 'real' program: sections to be able to control them under\n; supervisor.\n\n;[program:theprogramname]\n;command=/bin/cat              ; the program (relative uses PATH, can take args)\n;process_name=%(program_name)s ; process_name expr (default %(program_name)s)\n;numprocs=1                    ; number of processes copies to start (def 1)\n;directory=/tmp                ; directory to cwd to before exec (def no cwd)\n;umask=022                     ; umask for process (default None)\n;priority=999                  ; the relative start priority (default 999)\n;autostart=true                ; start at supervisord start (default: true)\n;startsecs=1                   ; # of secs prog must stay up to be running (def. 1)\n;startretries=3                ; max # of serial start failures when starting (default 3)\n;autorestart=unexpected        ; when to restart if exited after running (def: unexpected)\n;exitcodes=0,2                 ; 'expected' exit codes used with autorestart (default 0,2)\n;stopsignal=QUIT               ; signal used to kill process (default TERM)\n;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)\n;stopasgroup=false             ; send stop signal to the UNIX process group (default false)\n;killasgroup=false             ; SIGKILL the UNIX process group (def false)\n;user=chrism                   ; setuid to this UNIX account to run the program\n;redirect_stderr=true          ; redirect proc stderr to stdout (default false)\n;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO\n;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)\n;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)\n;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)\n;stdout_events_enabled=false   ; emit events on stdout writes (default false)\n;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO\n;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)\n;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)\n;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)\n;stderr_events_enabled=false   ; emit events on stderr writes (default false)\n;environment=A=\"1\",B=\"2\"       ; process environment additions (def no adds)\n;serverurl=AUTO                ; override serverurl computation (childutils)\n\n; The below sample eventlistener section shows all possible\n; eventlistener subsection values, create one or more 'real'\n; eventlistener: sections to be able to handle event notifications\n; sent by supervisor.\n\n;[eventlistener:theeventlistenername]\n;command=/bin/eventlistener    ; the program (relative uses PATH, can take args)\n;process_name=%(program_name)s ; process_name expr (default %(program_name)s)\n;numprocs=1                    ; number of processes copies to start (def 1)\n;events=EVENT                  ; event notif. types to subscribe to (req'd)\n;buffer_size=10                ; event buffer queue size (default 10)\n;directory=/tmp                ; directory to cwd to before exec (def no cwd)\n;umask=022                     ; umask for process (default None)\n;priority=-1                   ; the relative start priority (default -1)\n;autostart=true                ; start at supervisord start (default: true)\n;startsecs=1                   ; # of secs prog must stay up to be running (def. 1)\n;startretries=3                ; max # of serial start failures when starting (default 3)\n;autorestart=unexpected        ; autorestart if exited after running (def: unexpected)\n;exitcodes=0,2                 ; 'expected' exit codes used with autorestart (default 0,2)\n;stopsignal=QUIT               ; signal used to kill process (default TERM)\n;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)\n;stopasgroup=false             ; send stop signal to the UNIX process group (default false)\n;killasgroup=false             ; SIGKILL the UNIX process group (def false)\n;user=chrism                   ; setuid to this UNIX account to run the program\n;redirect_stderr=false         ; redirect_stderr=true is not allowed for eventlisteners\n;stdout_logfile=/a/path        ; stdout log path, NONE for none; default AUTO\n;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)\n;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)\n;stdout_events_enabled=false   ; emit events on stdout writes (default false)\n;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO\n;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)\n;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)\n;stderr_events_enabled=false   ; emit events on stderr writes (default false)\n;environment=A=\"1\",B=\"2\"       ; process environment additions\n;serverurl=AUTO                ; override serverurl computation (childutils)\n\n; The below sample group section shows all possible group values,\n; create one or more 'real' group: sections to create \"heterogeneous\"\n; process groups.\n\n;[group:thegroupname]\n;programs=progname1,progname2  ; each refers to 'x' in [program:x] definitions\n;priority=999                  ; the relative start priority (default 999)\n\n; The [include] section can just contain the \"files\" setting.  This\n; setting can list multiple files (separated by whitespace or\n; newlines).  It can also contain wildcards.  The filenames are\n; interpreted as relative to this file.  Included files *cannot*\n; include files themselves.\n\n;[include]\n;files = relative/directory/*.ini\n\n;[include]\n;files = scrapy.ini tasks.ini\n\n[include]\nfiles = toutiao.ini\n"
  },
  {
    "path": "etc/tasks.ini",
    "content": "[group:tasks]\nprograms=counter_clear,put_tasks_toutiao,put_tasks_weibo,put_tasks_weixin,sogou_cookies,weixin_cookies\n\n\n[program:counter_clear]\ncommand=python tasks/run_job_counter_clear.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/counter_clear.log\n\n\n[program:put_tasks_toutiao]\ncommand=python tasks/run_job_put_tasks_toutiao.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/put_tasks_toutiao.log\n\n\n[program:put_tasks_weibo]\ncommand=python tasks/run_job_put_tasks_weibo.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/put_tasks_weibo.log\n\n\n[program:put_tasks_weixin]\ncommand=python tasks/run_job_put_tasks_weixin.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/put_tasks_weixin.log\n\n\n[program:sogou_cookies]\ncommand=python tasks/run_job_sogou_cookies.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/sogou_cookies.log\n\n\n[program:weixin_cookies]\ncommand=python tasks/run_job_weixin_cookies.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/weixin_cookies.log\n"
  },
  {
    "path": "etc/toutiao.ini",
    "content": "[group:toutiao]\nprograms=put_tasks,scrapy\n\n[program:put_tasks]\ncommand=python tasks/run_job_put_tasks_toutiao.py\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/put_tasks_toutiao.log\n\n[program:scrapy]\ncommand=scrapy crawl toutiao\ndirectory=news\nstartsecs=0\nstopwaitsecs=0\nautostart=false\nautorestart=true\nredirect_stderr=true\nstdout_logfile=logs/scrapy_toutiao.log\n\n;[program:reboot_net]\n;command=python tasks/run_job_reboot_net_china_net.py\n;startsecs=0\n;stopwaitsecs=0\n;autostart=false\n;autorestart=true\n;redirect_stderr=true\n;stdout_logfile=logs/reboot_net_china_net.log\n"
  },
  {
    "path": "libs/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 15:24\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "libs/counter.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: counter.py\n@time: 2018-02-10 15:24\n\"\"\"\n\nfrom redis import Redis\n\n\nclass CounterClient(object):\n    \"\"\"\n    计数器\n    \"\"\"\n\n    def __init__(self, redis_client, entity_name, prefix='counter'):\n        \"\"\"\n        :param redis_client:\n        :param entity_name:\n        :param prefix:\n        \"\"\"\n        self.redis_client = redis_client  # type: Redis\n        self.counter_key = \"%s:%s\" % (prefix, entity_name)\n\n    def increase(self, amount=1):\n        \"\"\"\n        增加计数\n        :param amount:\n        :return:\n        \"\"\"\n        return int(self.redis_client.incr(self.counter_key, amount))\n\n    def decrease(self, amount=1):\n        \"\"\"\n        减少计数\n        :param amount:\n        :return:\n        \"\"\"\n        return int(self.redis_client.decr(self.counter_key, amount))\n\n    def get(self):\n        \"\"\"\n        获取计数\n        :return:\n        \"\"\"\n        return int(self.redis_client.get(self.counter_key) or 0)\n\n    def clear(self):\n        \"\"\"\n        清除计数\n        :return:\n        \"\"\"\n        return self.redis_client.delete(self.counter_key)\n"
  },
  {
    "path": "libs/ft.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: ff.py\n@time: 2019-05-26 14:26\n\"\"\"\n\nimport base64\nimport hashlib\nimport time\nimport requests\n\n\nURL = \"http://pred.fateadm.com\"\n\n\nclass FTClient(object):\n    def __init__(self, pd_id, pd_key, app_id='', app_key=''):\n        self.pd_id = pd_id\n        self.pd_key = pd_key\n        self.app_id = app_id\n        self.app_key = app_key\n        self.host = URL\n        self.s = requests.session()\n        self.timeout = 30\n\n    @staticmethod\n    def calc_sign(pd_id, pd_key, timestamp):\n        md5 = hashlib.md5()\n        md5.update(timestamp + pd_key)\n        sign_a = md5.hexdigest()\n\n        md5 = hashlib.md5()\n        md5.update(pd_id + timestamp + sign_a)\n        sign_b = md5.hexdigest()\n        return sign_b\n\n    @staticmethod\n    def calc_card_sign(card_id, card_key, timestamp, pd_key):\n        md5 = hashlib.md5()\n        md5.update(pd_key + timestamp + card_id + card_key)\n        return md5.hexdigest()\n\n    def query_balance(self):\n        \"\"\"查询余额\"\"\"\n        tm = str(int(time.time()))\n        sign = self.calc_sign(self.pd_id, self.pd_key, tm)\n        param = {\n            \"user_id\": self.pd_id,\n            \"timestamp\": tm,\n            \"sign\": sign\n        }\n        url = self.host + \"/api/custval\"\n        rsp = self.s.post(url, param, timeout=self.timeout).json()\n        return rsp\n\n    def query_tts(self, predict_type):\n        \"\"\"查询网络延迟\"\"\"\n        tm = str(int(time.time()))\n        sign = self.calc_sign(self.pd_id, self.pd_key, tm)\n        param = {\n            \"user_id\": self.pd_id,\n            \"timestamp\": tm,\n            \"sign\": sign,\n            \"predict_type\": predict_type,\n        }\n        if self.app_id != \"\":\n            asign = self.calc_sign(self.app_id, self.app_key, tm)\n            param[\"appid\"] = self.app_id\n            param[\"asign\"] = asign\n        url = self.host + \"/api/qcrtt\"\n        rsp = self.s.post(url, param, timeout=self.timeout).json()\n        return rsp\n\n    def predict(self, predict_type, img_data):\n        \"\"\"识别验证码\"\"\"\n        tm = str(int(time.time()))\n        sign = self.calc_sign(self.pd_id, self.pd_key, tm)\n        img_base64 = base64.b64encode(img_data)\n        param = {\n            \"user_id\": self.pd_id,\n            \"timestamp\": tm,\n            \"sign\": sign,\n            \"predict_type\": predict_type,\n            \"img_data\": img_base64,\n        }\n        if self.app_id != \"\":\n            asign = self.calc_sign(self.app_id, self.app_key, tm)\n            param[\"appid\"] = self.app_id\n            param[\"asign\"] = asign\n        url = self.host + \"/api/capreg\"\n        rsp = self.s.post(url, param, timeout=self.timeout).json()\n        return rsp\n\n    def predict_from_file(self, predict_type, file_name):\n        \"\"\"从文件进行验证码识别\"\"\"\n        with open(file_name, \"rb+\") as f:\n            data = f.read()\n        return self.predict(predict_type, data)\n\n    def justice(self, request_id):\n        \"\"\"识别失败，进行退款请求\"\"\"\n        if request_id == \"\":\n            return\n        tm = str(int(time.time()))\n        sign = self.calc_sign(self.pd_id, self.pd_key, tm)\n        param = {\n            \"user_id\": self.pd_id,\n            \"timestamp\": tm,\n            \"sign\": sign,\n            \"request_id\": request_id\n        }\n        url = self.host + \"/api/capjust\"\n        rsp = self.s.post(url, param, timeout=self.timeout).json()\n        return rsp\n\n    def charge(self, card_id, card_key):\n        \"\"\"充值接口\"\"\"\n        tm = str(int(time.time()))\n        sign = self.calc_sign(self.pd_id, self.pd_key, tm)\n        card_sign = self.calc_card_sign(card_id, card_key, tm, self.pd_key)\n        param = {\n            \"user_id\": self.pd_id,\n            \"timestamp\": tm,\n            \"sign\": sign,\n            'cardid': card_id,\n            'csign': card_sign\n        }\n        url = self.host + \"/api/charge\"\n        rsp = self.s.post(url, param, timeout=self.timeout).json()\n        return rsp\n\n\ndef test_ft():\n    \"\"\"\n    测试\n    {u'RspData': u'{\"cust_val\":1010}', u'RetCode': u'0', u'ErrMsg': u'succ', u'RequestId': u''}\n    {u'RspData': u'{\"result\": \"8x4g\"}', u'RetCode': u'0', u'ErrMsg': u'', u'RequestId': u'2019052615005042ad98b2000518d493'}\n    :return:\n    \"\"\"\n    pd_id = \"xxxxxx\"\n    pd_key = \"xxxxxx\"\n    app_id = \"312451\"\n    app_key = \"5YuN+6isLserKBZti4hoaI6UR2N5UT2j\"\n    predict_type = \"30400\"\n    api = FTClient(pd_id, pd_key, app_id, app_key)\n    # 查询余额接口\n    res = api.query_balance()\n    print(res)\n    file_name = \"img.jpg\"\n    rsp = api.predict_from_file(predict_type, file_name)\n    print(rsp)\n\nif __name__ == \"__main__\":\n    test_ft()\n"
  },
  {
    "path": "libs/optical_modem.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: optical_modem.py\n@time: 2018-05-27 00:24\n\"\"\"\n\nimport base64\nimport json\nimport time\nimport re\nimport random\nimport hashlib\nimport requests\nfrom scrapy.selector import Selector\n\n\nclass OpticalModemChinaNet(object):\n    \"\"\"\n    电信光猫\n    \"\"\"\n    s = requests.session()\n\n    def __init__(self, host='192.168.1.1', username='useradmin', password='crcun'):\n\n        self.host = host\n        self.username = username\n        self.password = password\n\n        self.url_login = 'http://%s/login.cgi' % self.host\n        self.url_get_wan_wifi_status = 'http://%s/gatewayManage.cmd' % self.host\n        self.url_reboot = 'http://%s/gatewayManage.cmd' % self.host\n\n        self.timeout = 180\n\n        self.net_ip_o = None\n        self.net_ip_n = None\n\n    @staticmethod\n    def _get_tc():\n        tc = str('%13d' % (time.time() * 1000))\n        return tc\n\n    def login(self):\n        \"\"\"\n        登录\n        :return:\n        \"\"\"\n        params = {\n            'username': self.username,\n            'psd': self.password,\n        }\n        res = self.s.get(self.url_login, params=params, timeout=self.timeout)\n        print(res.status_code, res.url)\n\n    def get_wan_wifi_status(self):\n        \"\"\"\n        获取wifi状态\n        :return:\n        \"\"\"\n        headers = {\n            'X-Requested-With': 'XMLHttpRequest',\n        }\n        params = {\n            'timeStamp': self._get_tc(),\n        }\n        json_cfg = {\n            'RPCMethod': 'Post1',\n            'ID': '123',\n            'Parameter': base64.urlsafe_b64encode(\"{'CmdType':'GET_WAN_WIFI_STATUS'}\")\n        }\n        data = \"jsonCfg=%s\" % json.dumps(json_cfg)\n        res = self.s.post(self.url_get_wan_wifi_status, headers=headers, params=params, data=data, timeout=self.timeout)\n        print(res.status_code, res.url)\n        return_parameter = json.loads(base64.decodestring(res.json().get('return_Parameter', '')))\n        print(return_parameter)\n        print(return_parameter.get('ipAddr'))\n        wan_ip = return_parameter.get('ipAddr')\n        return wan_ip\n\n    def reboot(self):\n        \"\"\"\n        重启\n        :return:\n        \"\"\"\n        headers = {'X-Requested-With': 'XMLHttpRequest'}\n        params = {\n            'timeStamp': self._get_tc(),\n        }\n        json_cfg = {\n            'RPCMethod': 'Post1',\n            'ID': '123',\n            'Parameter': base64.urlsafe_b64encode(\"{'CmdType':'HG_COMMAND_REBOOT'}\")\n        }\n        data = \"jsonCfg=%s\" % json.dumps(json_cfg)\n        res = self.s.post(self.url_reboot, headers=headers, params=params, data=data, timeout=self.timeout)\n        print(res.status_code, res.url)\n        return_parameter = json.loads(base64.decodestring(res.json().get('return_Parameter', '')))\n        print(return_parameter)\n\n    def get_net_ip(self):\n        \"\"\"\n        获取网络IP，这里使用requests不用session，因为重启之后，session会断开\n        :return:\n        \"\"\"\n        url = 'https://ip.cn/'\n        res = requests.get(url, timeout=self.timeout)\n        response = Selector(res)\n        info = response.xpath('//div[@class=\"well\"]//code/text()').extract()\n        ip_info = dict(zip(['ip', 'address'], info))\n        net_ip = ip_info['ip']\n        print(net_ip)\n        return net_ip\n\n    def check_reboot_status(self):\n        reboot_status = self.net_ip_o != self.net_ip_n\n        print(reboot_status)\n        return reboot_status\n\n\nclass OpticalModemChinaMobile(object):\n    \"\"\"\n    移动光猫\n    登录密码表单SHA256加密\n    \"\"\"\n    s = requests.session()\n    pid = 1002\n    session_token = 0\n\n    def __init__(self, host='192.168.1.1', username='user', password='gkw4p3uv'):\n\n        self.host = host\n        self.username = username\n        self.password = password\n\n        self.pwd_random = self._get_pwd_random()\n        self.encryption_pwd = self._get_encryption_pwd(self.password, self.pwd_random)\n        self.token = self._get_token()\n\n        self.url_login = 'http://%s/' % self.host\n\n        self.timeout = 180\n\n        self.net_ip_o = None\n        self.net_ip_n = None\n\n    @staticmethod\n    def _get_pwd_random():\n        pwd_random = str(int(round(random.random() * 89999999)) + 10000000)\n        return pwd_random\n\n    @staticmethod\n    def _get_encryption_pwd(pwd, r):\n        encryption_pwd = hashlib.sha256(''.join([pwd, r])).hexdigest()\n        return encryption_pwd\n\n    def _get_token(self):\n        url = 'http://%s' % self.host\n        res = self.s.get(url)\n        html_body = res.text\n        token_re = re.compile(r'getObj\\(\"Frm_Logintoken\"\\)\\.value = \"(\\d+)\";')\n        token_list = re.findall(token_re, html_body)\n        return int(token_list[0]) if token_list else 0\n\n    def _get_pid(self):\n        url = 'http://%s/template.gch' % self.host\n        res = self.s.get(url, timeout=self.timeout)\n        html_body = res.text\n        pid_re = re.compile(r'\"getpage\\.gch\\?pid=(\\d+)&nextpage=\"')\n        pid_list = re.findall(pid_re, html_body)\n        self.pid = int(pid_list[0]) if pid_list else self.pid\n        return self.pid\n\n    def _get_session_token(self):\n        url = 'http://%s/getpage.gch?pid=%s&nextpage=manager_dev_restart_t.gch' % (self.host, self.pid)\n        res = self.s.get(url, timeout=self.timeout)\n        html_body = res.text\n        session_token_re = re.compile(r'var session_token = \"(\\d+)\";')\n        session_token_list = re.findall(session_token_re, html_body)\n        self.session_token = int(session_token_list[0]) if session_token_list else self.session_token\n        return self.session_token\n\n    def login(self):\n        \"\"\"\n        登录\n        :return:\n        \"\"\"\n        payload = {\n            'frashnum': '',\n            'action': 'login',\n            'Frm_Logintoken': self.token,\n            'UserRandomNum': self.pwd_random,\n            'Username': self.username,\n            'Password': self.encryption_pwd,\n        }\n        res = self.s.post(self.url_login, data=payload, timeout=self.timeout)\n        return 'mainFrame' in res.text\n\n    def reboot(self):\n        url = 'http://%s/getpage.gch?pid=%s&nextpage=manager_dev_restart_t.gch' % (self.host, self._get_pid())\n        payload = {\n            'IF_ACTION': 'devrestart',\n            'IF_ERRORSTR': 'SUCC',\n            'IF_ERRORPARAM': 'SUCC',\n            'IF_ERRORTYPE': -1,\n            'flag': 1,\n            '_SESSION_TOKEN': self._get_session_token(),\n        }\n\n        res = self.s.post(url, data=payload, timeout=self.timeout)\n        return '设备重启需要2~3分钟，请耐心等待。' in res.text\n\n    def get_net_ip(self):\n        \"\"\"\n        获取网络IP，这里使用requests不用session，因为重启之后，session会断开\n        :return:\n        \"\"\"\n        url = 'https://ip.cn/'\n        res = requests.get(url, timeout=self.timeout)\n        response = Selector(res)\n        info = response.xpath('//div[@class=\"well\"]//code/text()').extract()\n        ip_info = dict(zip(['ip', 'address'], info))\n        net_ip = ip_info['ip']\n        print(net_ip)\n        return net_ip\n\n    def check_reboot_status(self):\n        reboot_status = self.net_ip_o != self.net_ip_n\n        print(reboot_status)\n        return reboot_status\n\n\ndef test_china_net():\n    om_cn = OpticalModemChinaNet()\n\n    om_cn.net_ip_o = om_cn.get_net_ip()\n\n    om_cn.login()  # 默认用户名、密码\n    om_cn.reboot()\n\n    time.sleep(10)\n    c = 3\n    while 1:\n        if c <= 0:\n            break\n        try:\n            om_cn.net_ip_n = om_cn.get_net_ip()\n            break\n        except Exception as e:\n            c -= 1\n            print(e)\n\n    om_cn.check_reboot_status()\n\n\ndef test_china_mobile():\n    om_cm = OpticalModemChinaMobile()\n\n    om_cm.net_ip_o = om_cm.get_net_ip()\n\n    om_cm.login()\n    om_cm.reboot()\n\n    time.sleep(10)\n    c = 3\n    while 1:\n        if c <= 0:\n            break\n        try:\n            om_cm.net_ip_n = om_cm.get_net_ip()\n            break\n        except Exception as e:\n            c -= 1\n            print(e)\n\n    om_cm.check_reboot_status()\n\n\nif __name__ == '__main__':\n    # test_china_net()\n    test_china_mobile()\n"
  },
  {
    "path": "libs/redis_pub_sub.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: redis_pub_sub.py\n@time: 2018-02-10 15:24\n\"\"\"\n\nimport redis\n\n\nclass RedisPubSub(object):\n    \"\"\"\n    Pub/Sub\n        队列中存储的数据必须是序列化之后的数据\n        生产消息: 入队前, 序列化\n        消费消息: 出队后, 反序列化\n    \"\"\"\n\n    def __init__(self, name, namespace='pub/sub', redis_client=None, **redis_kwargs):\n        \"\"\"The default connection parameters are: host='localhost', port=6379, db=0\"\"\"\n        self.__db = redis_client or redis.Redis(**redis_kwargs)\n        self.key = '%s:%s' % (namespace, name)\n\n    def pub(self, k, v):\n        \"\"\"\n        Pub\n        :param k:\n        :param v:\n        :return:\n        \"\"\"\n        ch = '%s:%s' % (self.key, k)\n        self.__db.publish(ch, v)\n\n    def sub(self, k):\n        \"\"\"\n        Sub\n        :param k:\n        :return:\n        \"\"\"\n        ps = self.__db.pubsub()\n        ch = '%s:%s' % (self.key, k)\n        ps.subscribe(ch)\n        for item in ps.listen():\n            # {'pattern': None, 'type': 'subscribe', 'channel': 'pub/sub:test:hh', 'data': 1L}\n            yield item\n            if item['type'] == 'message':\n                yield item.get('data')\n\n    def p_sub(self, k):\n        \"\"\"\n        PSub\n        订阅一个或多个符合给定模式的频道\n        每个模式以 * 作为匹配符\n        注意 psubscribe 与 subscribe 区别\n        :param k:\n        :return:\n        \"\"\"\n        ps = self.__db.pubsub()\n        ch = '%s:%s' % (self.key, k)\n        ps.psubscribe(ch)\n        for item in ps.listen():\n            # {'pattern': None, 'type': 'psubscribe', 'channel': 'pub/sub:test:*:hh', 'data': 1L}\n            # yield item\n            if item['type'] == 'pmessage':\n                # {'pattern': 'pub/sub:test:*:hh', 'type': 'pmessage', 'channel': 'pub/sub:test:aa:hh', 'data': '123'}\n                yield item.get('data')\n\n    def sub_not_loop(self, k):\n        \"\"\"\n        Sub 非无限循环，取到结果即退出\n        :param k:\n        :return:\n        \"\"\"\n        ps = self.__db.pubsub()\n        ch = '%s:%s' % (self.key, k)\n        ps.subscribe(ch)\n        for item in ps.listen():\n            if item['type'] == 'message':\n                return item.get('data')\n\n    def p_sub_not_loop(self, k):\n        \"\"\"\n        PSub 非无限循环，取到结果即退出\n        :param k:\n        :return:\n        \"\"\"\n        ps = self.__db.pubsub()\n        ch = '%s:%s' % (self.key, k)\n        ps.psubscribe(ch)\n        for item in ps.listen():\n            if item['type'] == 'pmessage':\n                return item.get('data')\n"
  },
  {
    "path": "libs/redis_queue.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: redis_queue.py\n@time: 2018-02-10 15:25\n\"\"\"\n\nimport redis\n\n\nclass RedisQueue(object):\n    \"\"\"Simple Queue with Redis Backend\"\"\"\n\n    def __init__(self, name, namespace='queue', redis_client=None, **redis_kwargs):\n        \"\"\"The default connection parameters are: host='localhost', port=6379, db=0\"\"\"\n        self.__db = redis_client or redis.Redis(**redis_kwargs)\n        self.key = '%s:%s' % (namespace, name)\n\n    def qsize(self):\n        \"\"\"Return the approximate size of the queue.\"\"\"\n        return self.__db.llen(self.key)\n\n    def empty(self):\n        \"\"\"Return True if the queue is empty, False otherwise.\"\"\"\n        return self.qsize() == 0\n\n    def put(self, item):\n        \"\"\"Put item into the queue.\"\"\"\n        self.__db.rpush(self.key, item)\n\n    def get(self, block=True, timeout=None):\n        \"\"\"Remove and return an item from the queue.\n\n        If optional args block is true and timeout is None (the default), block\n        if necessary until an item is available.\"\"\"\n        if block:\n            # ('queue:test', 'hello world')\n            item = self.__db.blpop(self.key, timeout=timeout)\n        else:\n            # hello world\n            item = self.__db.lpop(self.key)\n\n        if isinstance(item, tuple):\n            item = item[1]\n        return item\n\n    def get_nowait(self):\n        \"\"\"Equivalent to get(False).\"\"\"\n        return self.get(False)\n"
  },
  {
    "path": "libs/rk.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: rk.py\n@time: 2018-02-10 15:25\n\"\"\"\n\nfrom hashlib import md5\n\nimport requests\n\n\nclass RKClient(object):\n    def __init__(self, username, password, soft_id, soft_key):\n        self.username = username\n        self.password = md5(password).hexdigest()\n        self.soft_id = soft_id\n        self.soft_key = soft_key\n        self.base_params = {\n            'username': self.username,\n            'password': self.password,\n            'softid': self.soft_id,\n            'softkey': self.soft_key,\n        }\n        self.headers = {\n            'Connection': 'Keep-Alive',\n            'Expect': '100-continue',\n            'User-Agent': 'ben',\n        }\n\n    def rk_create(self, im, im_type, timeout=60):\n        \"\"\"\n        im: 图片字节\n        im_type: 题目类型\n        \"\"\"\n        params = {\n            'typeid': im_type,\n            'timeout': timeout,\n        }\n        params.update(self.base_params)\n        files = {'image': ('a.jpg', im)}\n        r = requests.post(\n            'http://api.ruokuai.com/create.json',\n            data=params,\n            files=files,\n            headers=self.headers,\n            timeout=timeout\n        )\n        return r.json()\n\n    def rk_report_error(self, im_id):\n        \"\"\"\n        im_id:报错题目的ID\n        \"\"\"\n        params = {\n            'id': im_id,\n        }\n        params.update(self.base_params)\n        r = requests.post(\n            'http://api.ruokuai.com/reporterror.json',\n            data=params,\n            headers=self.headers,\n            timeout=30\n        )\n        return r.json()\n\n\nif __name__ == '__main__':\n    rc = RKClient('username', 'password', 'soft_id', 'soft_key')\n    im = open('a.jpg', 'rb').read()\n    print(rc.rk_create(im, 3040))\n"
  },
  {
    "path": "libs/weed_fs.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: weed_fs.py\n@time: 2018-02-10 15:25\n\"\"\"\n\nimport csv\n\n# from urlparse import urlparse                 # PY2\n# from urllib.parse import urlparse             # PY3\nfrom future.moves.urllib.parse import urlparse\n\nimport requests\n\nfrom config import current_config\n\n\nREQUESTS_TIME_OUT = current_config.REQUESTS_TIME_OUT\n\n\nclass WeedFSClient(object):\n    request_headers = {\n        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n        'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n    }\n\n    def __init__(self, weed_fs_url):\n        self.weed_fs_url = weed_fs_url\n\n    def _get_assign(self):\n        \"\"\"\n        获取分配的资源（url fid）\n        接口消息 - 正确:\n            {\"fid\":\"1,014e123ade\",\"url\":\"127.0.0.1:8080\",\"publicUrl\":\"127.0.0.1:8080\",\"count\":1}\n        接口消息 - 错误:\n            {\"error\":\"No free volumes left!\"}\n        \"\"\"\n        url = '%s/dir/assign' % self.weed_fs_url\n        res = requests.get(url, timeout=REQUESTS_TIME_OUT).json()\n        if 'error' in res:\n            raise Exception(res['error'])\n        return res\n\n    def _get_locations(self, fid):\n        \"\"\"\n        获取文件服务器列表\n        {\"volumeId\":\"1\",\"locations\":[{\"url\":\"127.0.0.1:8080\",\"publicUrl\":\"127.0.0.1:8080\"}]}\n        \"\"\"\n        volume_id = fid.split(',')[0]\n        url = '%s/dir/lookup?volumeId=%s' % (self.weed_fs_url, volume_id)\n        return requests.get(url, timeout=REQUESTS_TIME_OUT).json()\n\n    def save_file(self, local_file_path=None, remote_file_path=None, file_obj=None):\n        \"\"\"\n        保存本地文件至weed_fs文件系统\n        {\"name\":\"test.csv\",\"size\":425429}\n        \"\"\"\n        assign = self._get_assign()\n        url = 'http://%s/%s' % (assign['url'], assign['fid'])\n\n        if local_file_path:\n            file_obj = open(local_file_path, 'rb')\n        elif remote_file_path:\n            headers = {'Host': urlparse(remote_file_path).netloc}  # 防反爬, 指定图片 Host\n            headers.update(self.request_headers)\n            res = requests.get(remote_file_path, headers=headers, timeout=REQUESTS_TIME_OUT)\n            if res.status_code == 200:\n                file_obj = res.content\n            else:\n                raise Exception('File does not exist')\n        elif not file_obj:\n            raise Exception('File does not exist')\n\n        res = requests.post(url, files={'file': file_obj}, timeout=REQUESTS_TIME_OUT)\n        return dict(res.json(), **assign)\n\n    def get_file_url(self, fid, separator=None):\n        \"\"\"\n        获取文件链接\n        \"\"\"\n        locations = self._get_locations(fid)\n        public_url = locations['locations'][0]['publicUrl']\n        return 'http://%s/%s' % (public_url, fid.replace(',', separator) if separator else fid)\n\n    def read_csv(self, fid, encoding=None):\n        \"\"\"\n        逐行读取远程csv文件\n        :param fid:\n        :param encoding: 'gbk'/'utf-8'\n        :return:\n        \"\"\"\n        file_url = self.get_file_url(fid)\n        download = requests.get(file_url, timeout=REQUESTS_TIME_OUT)\n        csv_rows = csv.reader(download.iter_lines(), delimiter=',', quotechar='\"')\n        for csv_row in csv_rows:\n            line = [item.decode(encoding, 'ignore') if encoding else item for item in csv_row]\n            yield line\n"
  },
  {
    "path": "logs/index.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <title>Title</title>\n</head>\n<body>\n\n</body>\n</html>"
  },
  {
    "path": "maps/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:58\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "maps/channel.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: channel.py\n@time: 2018-02-10 18:13\n\"\"\"\n\n\nchannel_name_map = {\n}\n"
  },
  {
    "path": "maps/platform.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: platform.py\n@time: 2018-02-10 17:58\n\"\"\"\n\n\nWEIXIN = 1\nWEIBO = 2\nTOUTIAO = 3\n\n\nplatform_name_map = {\n    1: u'微信',\n    2: u'微博',\n    3: u'头条',\n}\n"
  },
  {
    "path": "models/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:10\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "models/news.py",
    "content": "# coding: utf-8\nfrom sqlalchemy import Column, DateTime, Index, Integer, String, text\nfrom sqlalchemy.ext.declarative import declarative_base\n\n\nBase = declarative_base()\nmetadata = Base.metadata\n\n\ndef to_dict(self):\n    return {c.name: getattr(self, c.name, None) for c in self.__table__.columns}\n\nBase.to_dict = to_dict\n\n\nclass Channel(Base):\n    __tablename__ = 'channel'\n\n    id = Column(Integer, primary_key=True)\n    code = Column(String(20), unique=True)\n    name = Column(String(20))\n    description = Column(String(500), server_default=text(\"''\"))\n    create_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP\"))\n    update_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP\"))\n\n\nclass FetchResult(Base):\n    __tablename__ = 'fetch_result'\n    __table_args__ = (\n        Index('idx_platform_author_id', 'platform_id', 'article_author_id'),\n        Index('idx_platform_article_id', 'platform_id', 'article_id', unique=True)\n    )\n\n    id = Column(Integer, primary_key=True)\n    task_id = Column(Integer, nullable=False, index=True)\n    platform_id = Column(Integer, server_default=text(\"'0'\"))\n    platform_name = Column(String(50), server_default=text(\"''\"))\n    channel_id = Column(Integer, server_default=text(\"'0'\"))\n    channel_name = Column(String(50), server_default=text(\"''\"))\n    article_id = Column(String(50), server_default=text(\"''\"))\n    article_url = Column(String(512), server_default=text(\"''\"))\n    article_title = Column(String(100), server_default=text(\"''\"))\n    article_author_id = Column(String(100), server_default=text(\"''\"))\n    article_author_name = Column(String(100), server_default=text(\"''\"))\n    article_tags = Column(String(100), server_default=text(\"''\"))\n    article_abstract = Column(String(500), server_default=text(\"''\"))\n    article_content = Column(String)\n    article_pub_time = Column(DateTime, index=True, server_default=text(\"'1000-01-01 00:00:00'\"))\n    create_time = Column(DateTime, nullable=False, index=True, server_default=text(\"CURRENT_TIMESTAMP\"))\n    update_time = Column(DateTime, nullable=False, index=True, server_default=text(\"CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP\"))\n\n\nclass FetchTask(Base):\n    __tablename__ = 'fetch_task'\n    __table_args__ = (\n        Index('idx_platform_follow_id', 'platform_id', 'follow_id', unique=True),\n    )\n\n    id = Column(Integer, primary_key=True)\n    platform_id = Column(Integer, server_default=text(\"'0'\"))\n    channel_id = Column(Integer, server_default=text(\"'0'\"))\n    follow_id = Column(String(45), server_default=text(\"''\"))\n    follow_name = Column(String(45), server_default=text(\"''\"))\n    avatar_url = Column(String(512), server_default=text(\"''\"))\n    fetch_url = Column(String(512), server_default=text(\"''\"))\n    flag_enabled = Column(Integer, server_default=text(\"'0'\"))\n    description = Column(String(500), server_default=text(\"''\"))\n    create_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP\"))\n    update_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP\"))\n\n\nclass LogTaskScheduling(Base):\n    __tablename__ = 'log_task_scheduling'\n\n    id = Column(Integer, primary_key=True)\n    platform_id = Column(Integer, server_default=text(\"'0'\"))\n    platform_name = Column(String(50), server_default=text(\"''\"))\n    spider_name = Column(String(45), server_default=text(\"''\"))\n    task_quantity = Column(Integer, server_default=text(\"'0'\"))\n    create_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP\"))\n    update_time = Column(DateTime, nullable=False, server_default=text(\"CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP\"))\n"
  },
  {
    "path": "news/__init__.py",
    "content": ""
  },
  {
    "path": "news/items.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your scraped items\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/items.html\n\nimport scrapy\n\n\nclass FetchTaskItem(scrapy.Item):\n    \"\"\"\n    table_name: fetch_task\n    primary_key: id\n    \"\"\"\n    follow_id = scrapy.Field()\n    fetch_url = scrapy.Field()\n    description = scrapy.Field()\n    platform_id = scrapy.Field()\n    channel_id = scrapy.Field()\n    avatar_url = scrapy.Field()\n    flag_enabled = scrapy.Field()\n    follow_name = scrapy.Field()\n\n\nclass FetchResultItem(scrapy.Item):\n    \"\"\"\n    table_name: fetch_result\n    primary_key: id\n    \"\"\"\n    article_title = scrapy.Field()\n    platform_name = scrapy.Field()\n    task_id = scrapy.Field()\n    channel_id = scrapy.Field()\n    article_author_name = scrapy.Field()\n    article_content = scrapy.Field()\n    platform_id = scrapy.Field()\n    channel_name = scrapy.Field()\n    article_url = scrapy.Field()\n    article_abstract = scrapy.Field()\n    article_author_id = scrapy.Field()\n    article_tags = scrapy.Field()\n    article_id = scrapy.Field()\n    article_pub_time = scrapy.Field()\n\n\nclass ChannelItem(scrapy.Item):\n    \"\"\"\n    table_name: channel\n    primary_key: id\n    \"\"\"\n    code = scrapy.Field()\n    description = scrapy.Field()\n    name = scrapy.Field()\n"
  },
  {
    "path": "news/middlewares/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:10\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "news/middlewares/anti_spider.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\nfrom __future__ import unicode_literals\n\nimport time\nfrom scrapy.exceptions import IgnoreRequest\nfrom scrapy.exceptions import NotConfigured\n\nfrom tools.cookies import del_cookies\nfrom tasks.jobs_weixin import set_anti_spider_task, sub_anti_spider\n\n\nclass AntiSpiderMiddleware(object):\n    \"\"\"\n    反爬中间件\n    配置说明:\n        RETRY_ENABLED 默认: True\n        RETRY_TIMES 默认: 2\n        RETRY_HTTP_CODES 默认: [500, 502, 503, 504, 400, 408]\n    \"\"\"\n    def __init__(self, settings):\n        if not settings.getbool('RETRY_ENABLED'):\n            raise NotConfigured\n        self.max_retry_times = settings.getint('RETRY_TIMES')\n        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))\n        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST') or 1\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        return cls(crawler.settings)\n\n    def process_request(self, request, spider):\n        # 处理微信反爬(反爬机制一, sogou)\n        if spider.name in ['weixin'] and 'antispider' in request.url:\n            # 获取来源链接\n            redirect_urls = request.meta['redirect_urls']\n\n            # 清理失效 cookies\n            cookies_id = request.meta['cookiejar']\n            del_cookies(spider.name, cookies_id)\n\n            # spider.log(message='AntiSpider cookies_id: %s; url: %s' % (cookies_id, redirect_urls[0]))\n            raise IgnoreRequest(\n                'Spider: %s, AntiSpider cookies_id: %s; url: %s' % (spider.name, cookies_id, redirect_urls[0]))\n\n    def process_response(self, request, response, spider):\n        # 处理微信反爬(反爬机制二, weixin)\n        if spider.name in ['weixin']:\n            title = response.xpath('//title/text()').extract_first(default='').strip()\n            if title == '请输入验证码':\n                # 设置反爬处理任务\n                msg = {\n                    'url': response.url,\n                    'time': time.strftime('%Y-%m-%d %H:%M:%S')\n                }\n                set_anti_spider_task(spider.name, msg)\n\n                # 订阅处理结果\n                anti_spider_result = sub_anti_spider(spider.name)\n                if not anti_spider_result.get('status'):\n                    return response\n\n                # 请求重试\n                retry_req = request.copy()\n                retry_req.dont_filter = True  # 必须设置(禁止重复请求被过滤掉)\n                retry_req.priority = request.priority + self.priority_adjust\n                return retry_req\n        return response\n"
  },
  {
    "path": "news/middlewares/content_type.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\n\nclass ContentTypeGb2312Middleware(object):\n    \"\"\"\n    处理不规范的页面（优先级降低至580之后才能生效）\n    原因:\n        默认配置的 DOWNLOADER_MIDDLEWARES 包含 MetaRefreshMiddleware\n        当请求页面存在如 Content-Location 类似的 header 时, 会触发重定向请求\n    指定 Content-Type 为 gb2312\n    \"\"\"\n    def process_response(self, request, response, spider):\n        response.headers['Content-Type'] = 'text/html; charset=gb2312'\n        return response\n"
  },
  {
    "path": "news/middlewares/de_duplication_request.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\n\nfrom scrapy.exceptions import IgnoreRequest\n\nfrom tools.duplicate import is_dup_detail\n\n\nclass DeDuplicationRequestMiddleware(object):\n    \"\"\"\n    去重 - 请求\n    (数据结构：集合)\n    \"\"\"\n    def process_request(self, request, spider):\n        if not request.url:\n            return None\n        channel_id = request.meta.get('channel_id', 0)\n        # 处理详情页面（忽略列表页面）与pipeline配合\n        if is_dup_detail(request.url, spider.name, channel_id):\n            raise IgnoreRequest(\"Spider: %s, DeDuplicationRequest: %s\" % (spider.name, request.url))\n"
  },
  {
    "path": "news/middlewares/httpproxy.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\n\nfrom scrapy.exceptions import NotConfigured\nfrom tools.proxies import get_proxy, del_proxy\n\n\nclass HttpProxyMiddleware(object):\n    \"\"\"\n    代理中间件\n    \"\"\"\n    def __init__(self, settings):\n        if not settings.getbool('RETRY_ENABLED'):\n            raise NotConfigured\n        self.max_retry_times = settings.getint('RETRY_TIMES')\n        self.retry_http_codes = set(int(x) for x in settings.getlist('RETRY_HTTP_CODES'))\n        self.priority_adjust = settings.getint('RETRY_PRIORITY_ADJUST') or 1\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        return cls(crawler.settings)\n\n    def process_request(self, request, spider):\n        # request.meta['proxy'] = \"http://YOUR_PROXY_IP:PORT\"\n        # 当前请求代理（保证重试过程，代理一致）\n        request_proxy = request.meta.get('proxy') or get_proxy(spider.name)\n        request.meta['proxy'] = request_proxy\n        spider.log(request.meta)\n\n    def process_exception(self, request, exception, spider):\n        error_proxy = request.meta.get('proxy')\n        if not error_proxy:\n            return None\n        # 重试失败（默认重试2次，共请求3次），删除代理\n        if request.meta.get('retry_times', 0) >= self.max_retry_times:\n            del_proxy(spider.name, error_proxy)\n            spider.log('%s del proxy: %s, error reason: %s' % (spider.name, error_proxy, exception))\n            return None\n"
  },
  {
    "path": "news/middlewares/useragent.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# http://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\n\nimport random\n\n\nclass UserAgentMiddleware(object):\n    \"\"\"\n    Randomly rotate user agents based on a list of predefined ones\n    \"\"\"\n    def __init__(self, agents):\n        self.agents = agents\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        return cls(crawler.settings.getlist('USER_AGENTS'))\n\n    def process_request(self, request, spider):\n        request.headers.setdefault('User-Agent', random.choice(self.agents))\n        # request.headers.setdefault('User-Agent', self.agents[0])\n"
  },
  {
    "path": "news/middlewares.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define here the models for your spider middleware\n#\n# See documentation in:\n# https://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\nfrom scrapy import signals\n\n\nclass NewsSpiderMiddleware(object):\n    # Not all methods need to be defined. If a method is not defined,\n    # scrapy acts as if the spider middleware does not modify the\n    # passed objects.\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        # This method is used by Scrapy to create your spiders.\n        s = cls()\n        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)\n        return s\n\n    def process_spider_input(self, response, spider):\n        # Called for each response that goes through the spider\n        # middleware and into the spider.\n\n        # Should return None or raise an exception.\n        return None\n\n    def process_spider_output(self, response, result, spider):\n        # Called with the results returned from the Spider, after\n        # it has processed the response.\n\n        # Must return an iterable of Request, dict or Item objects.\n        for i in result:\n            yield i\n\n    def process_spider_exception(self, response, exception, spider):\n        # Called when a spider or process_spider_input() method\n        # (from other spider middleware) raises an exception.\n\n        # Should return either None or an iterable of Response, dict\n        # or Item objects.\n        pass\n\n    def process_start_requests(self, start_requests, spider):\n        # Called with the start requests of the spider, and works\n        # similarly to the process_spider_output() method, except\n        # that it doesn’t have a response associated.\n\n        # Must return only requests (not items).\n        for r in start_requests:\n            yield r\n\n    def spider_opened(self, spider):\n        spider.logger.info('Spider opened: %s' % spider.name)\n\n\nclass NewsDownloaderMiddleware(object):\n    # Not all methods need to be defined. If a method is not defined,\n    # scrapy acts as if the downloader middleware does not modify the\n    # passed objects.\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        # This method is used by Scrapy to create your spiders.\n        s = cls()\n        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)\n        return s\n\n    def process_request(self, request, spider):\n        # Called for each request that goes through the downloader\n        # middleware.\n\n        # Must either:\n        # - return None: continue processing this request\n        # - or return a Response object\n        # - or return a Request object\n        # - or raise IgnoreRequest: process_exception() methods of\n        #   installed downloader middleware will be called\n        return None\n\n    def process_response(self, request, response, spider):\n        # Called with the response returned from the downloader.\n\n        # Must either;\n        # - return a Response object\n        # - return a Request object\n        # - or raise IgnoreRequest\n        return response\n\n    def process_exception(self, request, exception, spider):\n        # Called when a download handler or a process_request()\n        # (from other downloader middleware) raises an exception.\n\n        # Must either:\n        # - return None: continue processing this exception\n        # - return a Response object: stops process_exception() chain\n        # - return a Request object: stops process_exception() chain\n        pass\n\n    def spider_opened(self, spider):\n        spider.logger.info('Spider opened: %s' % spider.name)\n"
  },
  {
    "path": "news/pipelines/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:10\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "news/pipelines/de_duplication_request.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nfrom news.items import FetchResultItem\n\nfrom tools.duplicate import is_dup_detail, add_dup_detail\n\n\nclass DeDuplicationRequestPipeline(object):\n    \"\"\"\n    去重 - 请求\n    注意:\n        1、置于数据存储 pipeline 之后\n        2、与 DeDuplicationRequestMiddleware 配合使用\n    \"\"\"\n    def process_item(self, item, spider):\n\n        spider_name = spider.name\n        if isinstance(item, FetchResultItem):\n            # 详细页url 加入去重集合\n            if not is_dup_detail(item['article_url'], spider_name, item['channel_id']):\n                add_dup_detail(item['article_url'], spider_name, item['channel_id'])\n        return item\n"
  },
  {
    "path": "news/pipelines/de_duplication_store_mysql.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nfrom models.news import FetchResult\nfrom news.items import FetchResultItem\nfrom apps.client_db import db_session_mysql\nfrom tools.weixin import get_finger\nfrom maps.platform import WEIXIN, WEIBO\n\nfrom scrapy.exceptions import DropItem\n\n\nclass DeDuplicationStoreMysqlPipeline(object):\n    \"\"\"\n    去重 - 入库\n    注意:\n        1、置于数据存储 pipeline 之前\n    \"\"\"\n    def process_item(self, item, spider):\n\n        session = db_session_mysql()\n        try:\n            if isinstance(item, FetchResultItem):\n                if spider.name == 'weixin':\n                    # 标题（微信只能通过标题去重, 因为链接带过期签名）\n                    article_id_count = session.query(FetchResult) \\\n                        .filter(FetchResult.platform_id == WEIXIN,\n                                FetchResult.article_id == get_finger(item['article_title'])) \\\n                        .count()\n                    if article_id_count:\n                        raise DropItem(\n                            '%s Has been duplication of article_title: %s' % (spider.name, item['article_title']))\n\n                if spider.name == 'weibo':\n                    # 详细链接（微博可以直接通过链接去重）\n                    article_url_count = session.query(FetchResult) \\\n                        .filter(FetchResult.platform_id == WEIBO,\n                                FetchResult.article_id == get_finger(item['article_url'])) \\\n                        .count()\n                    if article_url_count:\n                        raise DropItem(\n                            '%s Has been duplication of article_url: %s' % (spider.name, item['article_url']))\n\n            return item\n        except Exception as e:\n            raise e\n        finally:\n            session.close()\n"
  },
  {
    "path": "news/pipelines/exporter_csv.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nfrom scrapy import signals\nfrom scrapy.exporters import CsvItemExporter\n\n\nclass CsvExportPipeline(object):\n    def __init__(self):\n        self.files = {}\n        self.exporter = None\n\n    @classmethod\n    def from_crawler(cls, crawler):\n        pipeline = cls()\n        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)\n        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)\n        return pipeline\n\n    def spider_opened(self, spider):\n        file_csv = open('%s_items.csv' % spider.name, 'w+b')\n        self.files[spider] = file_csv\n        self.exporter = CsvItemExporter(file_csv)\n        self.exporter.start_exporting()\n\n    def spider_closed(self, spider):\n        self.exporter.finish_exporting()\n        file_csv = self.files.pop(spider)\n        file_csv.close()\n\n    def process_item(self, item, spider):\n        self.exporter.export_item(item)\n        return item\n"
  },
  {
    "path": "news/pipelines/img_remote_to_local_fs.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nimport re\n\n# from urlparse import urljoin                  # PY2\n# from urllib.parse import urljoin              # PY3\nfrom future.moves.urllib.parse import urljoin\n\nfrom news.items import FetchResultItem\n\nfrom libs.weed_fs import WeedFSClient\nfrom config import current_config\n\nWEED_FS_URL = current_config.WEED_FS_URL\n\nweed_fs_client = WeedFSClient(WEED_FS_URL)\n\n\ndef remote_to_local(remote_file_path):\n    \"\"\"\n    保存远程图片文件\n    :param remote_file_path:\n    :return:\n    \"\"\"\n    remote_file_save_result = weed_fs_client.save_file(remote_file_path=remote_file_path)\n    local_file_url = weed_fs_client.get_file_url(remote_file_save_result['fid'], '/')\n    return local_file_url\n\n\ndef add_src(html_body, base=''):\n    \"\"\"\n    添加图片文件链接（1、添加真实链接；2、替换本地链接）\n    :param html_body:\n    :param base:\n    :return:\n    \"\"\"\n    rule = r'data-src=\"(.*?)\"'\n    img_data_src_list = re.compile(rule, re.I).findall(html_body)\n    for img_src in img_data_src_list:\n        # 处理相对链接\n        if base:\n            new_img_src = urljoin(base, img_src)\n        if new_img_src.startswith('/'):\n            continue\n        # 远程转本地\n        local_img_src = remote_to_local(new_img_src)\n        img_dict = {\n            'img_src': img_src,\n            'local_img_src': local_img_src\n        }\n        html_body = html_body.replace(img_src, '%(img_src)s\" src=\"%(local_img_src)s' % img_dict)\n    return html_body\n\n\ndef replace_src(html_body, base=''):\n    \"\"\"\n    替换图片文件链接（替换本地链接）\n    :param html_body:\n    :param base:\n    :return:\n    \"\"\"\n    rule = r'src=\"(.*?)\"'\n    img_data_src_list = re.compile(rule, re.I).findall(html_body)\n    for img_src in img_data_src_list:\n        # 处理//,补充协议\n        if img_src.startswith('//'):\n            img_src = 'http:%s' % img_src\n        # 处理相对链接\n        if base:\n            new_img_src = urljoin(base, img_src)\n        if new_img_src.startswith('/'):\n            continue\n        # 远程转本地\n        local_img_src = remote_to_local(new_img_src)\n        img_dict = {\n            'img_src': img_src,\n            'local_img_src': local_img_src\n        }\n        html_body = html_body.replace(img_src, '%(local_img_src)s\" data-src=\"%(img_src)s' % img_dict)\n    return html_body\n\n\nclass ImgRemoteToLocalFSPipeline(object):\n    \"\"\"\n    图片 远程链接 转 本地文件系统链接\n    注意:\n        1、置于数据存储 pipeline 之前\n    \"\"\"\n\n    def process_item(self, item, spider):\n\n        spider_name = spider.name\n        # 读取抓取内容\n        if isinstance(item, FetchResultItem):\n            if spider_name in ['weixin']:\n                html_body = item['article_content']\n                base = item['article_url']\n                item['article_content'] = add_src(html_body, base)\n            if spider_name in ['weibo']:\n                html_body = item['article_content']\n                base = item['article_url']\n                item['article_content'] = replace_src(html_body, base)\n            if spider_name in ['toutiao', 'toutiao_m']:\n                html_body = item['article_content']\n                base = item['article_url']\n                item['article_content'] = replace_src(html_body, base)\n        return item\n"
  },
  {
    "path": "news/pipelines/store_mysql.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nfrom models.news import FetchResult\nfrom news.items import FetchResultItem\nfrom apps.client_db import db_session_mysql\n\n\nclass StoreMysqlPipeline(object):\n    \"\"\"\n    基于 MySQL 的存储\n    \"\"\"\n\n    def process_item(self, item, spider):\n        session = db_session_mysql()\n        try:\n            if isinstance(item, FetchResultItem):\n                fetch_result = FetchResult(**item)\n                # 数据入库\n                session.add(fetch_result)\n                session.flush()\n                # session.commit()\n            return item\n        except Exception as e:\n            raise e\n        finally:\n            session.close()\n"
  },
  {
    "path": "news/pipelines.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Define your item pipelines here\n#\n# Don't forget to add your pipeline to the ITEM_PIPELINES setting\n# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html\n\n\nclass NewsPipeline(object):\n    def process_item(self, item, spider):\n        return item\n"
  },
  {
    "path": "news/settings.py",
    "content": "# -*- coding: utf-8 -*-\n\n# Scrapy settings for news project\n#\n# For simplicity, this file contains only settings considered important or\n# commonly used. You can find more settings consulting the documentation:\n#\n#     https://doc.scrapy.org/en/latest/topics/settings.html\n#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html\n#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html\n\nBOT_NAME = 'news'\n\nSPIDER_MODULES = ['news.spiders']\nNEWSPIDER_MODULE = 'news.spiders'\n\n\n# Crawl responsibly by identifying yourself (and your website) on the user-agent\n#USER_AGENT = 'news (+http://www.yourdomain.com)'\n\n# Obey robots.txt rules\nROBOTSTXT_OBEY = False\n\n# Configure maximum concurrent requests performed by Scrapy (default: 16)\n#CONCURRENT_REQUESTS = 32\n\n# Configure a delay for requests for the same website (default: 0)\n# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay\n# See also autothrottle settings and docs\nDOWNLOAD_DELAY = 2\n# The download delay setting will honor only one of:\n#CONCURRENT_REQUESTS_PER_DOMAIN = 16\n#CONCURRENT_REQUESTS_PER_IP = 16\n\n# Disable cookies (enabled by default)\nCOOKIES_ENABLED = True\nCOOKIES_DEBUG = True\n\n# Disable Telnet Console (enabled by default)\n#TELNETCONSOLE_ENABLED = False\n\n# Override the default request headers:\n#DEFAULT_REQUEST_HEADERS = {\n#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',\n#   'Accept-Language': 'en',\n#}\nDEFAULT_REQUEST_HEADERS = {\n  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n  'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n}\n\n# Enable or disable spider middlewares\n# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html\n#SPIDER_MIDDLEWARES = {\n#    'news.middlewares.NewsSpiderMiddleware': 543,\n#}\n\n# Enable or disable downloader middlewares\n# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html\n#DOWNLOADER_MIDDLEWARES = {\n#    'news.middlewares.NewsDownloaderMiddleware': 543,\n#}\n\n# Enable or disable extensions\n# See https://doc.scrapy.org/en/latest/topics/extensions.html\n#EXTENSIONS = {\n#    'scrapy.extensions.telnet.TelnetConsole': None,\n#}\n\n# Configure item pipelines\n# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html\n#ITEM_PIPELINES = {\n#    'news.pipelines.NewsPipeline': 300,\n#}\nITEM_PIPELINES = {\n   'news.pipelines.store_mysql.StoreMysqlPipeline': 400,\n}\n\n# Enable and configure the AutoThrottle extension (disabled by default)\n# See https://doc.scrapy.org/en/latest/topics/autothrottle.html\n#AUTOTHROTTLE_ENABLED = True\n# The initial download delay\n#AUTOTHROTTLE_START_DELAY = 5\n# The maximum download delay to be set in case of high latencies\n#AUTOTHROTTLE_MAX_DELAY = 60\n# The average number of requests Scrapy should be sending in parallel to\n# each remote server\n#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0\n# Enable showing throttling stats for every response received:\n#AUTOTHROTTLE_DEBUG = False\n\n# Enable and configure HTTP caching (disabled by default)\n# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings\n#HTTPCACHE_ENABLED = True\n#HTTPCACHE_EXPIRATION_SECS = 0\n#HTTPCACHE_DIR = 'httpcache'\n#HTTPCACHE_IGNORE_HTTP_CODES = []\n#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'\n\n# USER_AGENTS\nUSER_AGENTS = [\n    \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36\",\n    \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)\",\n    \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)\",\n    \"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)\",\n    \"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)\",\n    \"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)\",\n    \"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)\",\n    \"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)\",\n    \"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)\",\n    \"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6\",\n    \"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1\",\n    \"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0\",\n    \"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5\",\n    \"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6\",\n    \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11\",\n    \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20\",\n    \"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52\"\n]\n"
  },
  {
    "path": "news/spiders/__init__.py",
    "content": "# This package will contain the spiders of your Scrapy project\n#\n# Please refer to the documentation for information on how to create and manage\n# your spiders.\n"
  },
  {
    "path": "news/spiders/ip.py",
    "content": "# -*- coding: utf-8 -*-\nimport scrapy\n\n\nclass IpSpider(scrapy.Spider):\n    \"\"\"\n    IP代理测试 蜘蛛\n    重试3次，每次超时10秒\n    使用：\n    进入项目目录\n    $ scrapy crawl ip\n    \"\"\"\n    name = \"ip\"\n    allowed_domains = [\"ip.cn\"]\n    start_urls = (\n        'https://ip.cn',\n    )\n\n    custom_settings = dict(\n        COOKIES_ENABLED=True,\n        DEFAULT_REQUEST_HEADERS={\n            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n        },\n        USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0',\n        DOWNLOADER_MIDDLEWARES={\n            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,\n            'news.middlewares.useragent.UserAgentMiddleware': 500,\n            'news.middlewares.httpproxy.HttpProxyMiddleware': 720,  # 代理（cookie需要与代理IP关联）\n        },\n        ITEM_PIPELINES={\n            'news.pipelines.store_mysql.StoreMysqlPipeline': 450,\n        },\n        DOWNLOAD_TIMEOUT=10\n    )\n\n    def parse(self, response):\n        info = response.xpath('//div[@class=\"well\"]//code/text()').extract()\n        ip_info = dict(zip(['ip', 'address'], info))\n        yield ip_info\n"
  },
  {
    "path": "news/spiders/toutiao_m.py",
    "content": "# -*- coding: utf-8 -*-\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport json\nimport time\n\nimport scrapy\n\nfrom apps.client_db import get_item\nfrom maps.channel import channel_name_map\nfrom maps.platform import platform_name_map\nfrom models.news import FetchTask\nfrom news.items import FetchResultItem\nfrom tools.date_time import time_local_to_utc\nfrom tools.scrapy_tasks import pop_task\nfrom tools.toutiao_m import get_as_cp, ParseJsTt, parse_toutiao_js_body\nfrom tools.url import get_update_url\n\n\nclass ToutiaoMSpider(scrapy.Spider):\n    \"\"\"\n    头条蜘蛛\n    \"\"\"\n    name = 'toutiao_m'\n    allowed_domains = ['toutiao.com', 'snssdk.com']\n\n    custom_settings = dict(\n        COOKIES_ENABLED=True,\n        DEFAULT_REQUEST_HEADERS={\n            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n        },\n        USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0',\n        DOWNLOADER_MIDDLEWARES={\n            'news.middlewares.de_duplication_request.DeDuplicationRequestMiddleware': 140,  # 去重请求\n            # 'news.middlewares.anti_spider.AntiSpiderMiddleware': 160,  # 反爬处理\n            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,\n            'news.middlewares.useragent.UserAgentMiddleware': 500,\n            # 'news.middlewares.httpproxy.HttpProxyMiddleware': 720,\n        },\n        ITEM_PIPELINES={\n            'news.pipelines.de_duplication_store_mysql.DeDuplicationStoreMysqlPipeline': 400,  # 去重存储\n            'news.pipelines.store_mysql.StoreMysqlPipeline': 450,\n            'news.pipelines.de_duplication_request.DeDuplicationRequestPipeline': 500,  # 去重请求\n        },\n        DOWNLOAD_DELAY=0.5\n    )\n\n    # start_urls = ['http://toutiao.com/']\n    # start_urls = ['https://www.toutiao.com/ch/news_finance/']\n\n    def start_requests(self):\n        \"\"\"\n        入口准备\n        :return:\n        \"\"\"\n        url_params = {\n            'version_code': '6.4.2',\n            'version_name': '',\n            'device_platform': 'iphone',\n            'tt_from': 'weixin',\n            'utm_source': 'weixin',\n            'utm_medium': 'toutiao_ios',\n            'utm_campaign': 'client_share',\n            'wxshare_count': '1',\n        }\n\n        task_id = pop_task(self.name)\n\n        if not task_id:\n            print('%s task is empty' % self.name)\n            return\n        print('%s task id: %s' % (self.name, task_id))\n\n        task_item = get_item(FetchTask, task_id)\n        fetch_url = 'http://m.toutiao.com/profile/%s/' % task_item.follow_id\n        url_profile = get_update_url(fetch_url, url_params)\n        meta = {\n            'task_id': task_item.id,\n            'platform_id': task_item.platform_id,\n            'channel_id': task_item.channel_id,\n            'follow_id': task_item.follow_id,\n            'follow_name': task_item.follow_name,\n        }\n        yield scrapy.Request(url=url_profile, callback=self.get_profile, meta=meta)\n\n    def get_profile(self, response):\n        userid = response.xpath('//button[@itemid=\"topsharebtn\"]/@data-userid').extract_first(default='')\n        mediaid = response.xpath('//button[@itemid=\"topsharebtn\"]/@data-mediaid').extract_first(default='')\n\n        meta = dict(response.meta, userid=userid, mediaid=mediaid)\n\n        url = 'http://open.snssdk.com/jssdk_signature/'\n        url_params = {\n            'appid': 'wxe8b89be1715734a6',\n            'noncestr': 'Wm3WZYTPz0wzccnW',\n            'timestamp': '%13d' % (time.time() * 1000),\n            'callback': 'jsonp2',\n        }\n        url_jssdk_signature = get_update_url(url, url_params)\n        yield scrapy.Request(url=url_jssdk_signature, callback=self.jssdk_signature, meta=meta)\n\n    def jssdk_signature(self, response):\n        AS, CP = get_as_cp()\n        jsonp_index = 3\n\n        url = 'https://www.toutiao.com/pgc/ma/'\n        url_params = {\n            'page_type': 1,\n            'max_behot_time': '',\n            'uid': response.meta['userid'],\n            'media_id': response.meta['mediaid'],\n            'output': 'json',\n            'is_json': 1,\n            'count': 20,\n            'from': 'user_profile_app',\n            'version': 2,\n            'as': AS,\n            'cp': CP,\n            'callback': 'jsonp%d' % jsonp_index,\n        }\n        url_article_list = get_update_url(url, url_params)\n\n        meta = dict(response.meta, jsonp_index=jsonp_index)\n\n        yield scrapy.Request(url=url_article_list, callback=self.parse_article_list, meta=meta)\n\n    def parse_article_list(self, response):\n        \"\"\"\n        文章列表\n        :param response:\n        :return:\n        \"\"\"\n        body = response.body_as_unicode()\n        jsonp_text = 'jsonp%d' % response.meta.get('jsonp_index', 0)\n        result = json.loads(body.lstrip('%s(' % jsonp_text).rstrip(')'))\n        # 翻页 TODO FIX\n        has_more = result.get('has_more')\n        if has_more:\n            max_behot_time = result['next']['max_behot_time']\n            AS, CP = get_as_cp()\n            jsonp_index = response.meta.get('jsonp_index', 0) + 1\n\n            url_params_next = {\n                'max_behot_time': max_behot_time,\n                'as': AS,\n                'cp': CP,\n                'callback': 'jsonp%d' % jsonp_index,\n            }\n\n            url_article_list_next = get_update_url(response.url, url_params_next)\n\n            meta = dict(response.meta, jsonp_index=jsonp_index)\n            yield scrapy.Request(url=url_article_list_next, callback=self.parse_article_list, meta=meta)\n        # 详情\n        data_list = result.get('data', [])\n        for data_item in data_list:\n            detail_url = data_item.get('source_url')\n            meta = dict(response.meta, detail_url=detail_url)\n            yield scrapy.Request(url=detail_url, callback=self.parse_article_detail, meta=meta)\n\n    def parse_article_detail(self, response):\n        \"\"\"\n        文章详情\n        :param response:\n        :return:\n        \"\"\"\n        toutiao_body = response.body_as_unicode()\n        js_body = parse_toutiao_js_body(toutiao_body, response.meta['detail_url'])\n        if not js_body:\n            return\n        pj = ParseJsTt(js_body=js_body)\n\n        article_id = pj.parse_js_item_id()\n        article_title = pj.parse_js_title()\n        article_abstract = pj.parse_js_abstract()\n        article_content = pj.parse_js_content()\n        article_pub_time = pj.parse_js_pub_time()\n        article_tags = pj.parse_js_tags()\n\n        fetch_result_item = FetchResultItem()\n        fetch_result_item['task_id'] = response.meta['task_id']\n        fetch_result_item['platform_id'] = response.meta['platform_id']\n        fetch_result_item['platform_name'] = platform_name_map.get(response.meta['platform_id'], '')\n        fetch_result_item['channel_id'] = response.meta['channel_id']\n        fetch_result_item['channel_name'] = channel_name_map.get(response.meta['channel_id'], '')\n        fetch_result_item['article_id'] = article_id\n        fetch_result_item['article_title'] = article_title\n        fetch_result_item['article_author_id'] = response.meta['follow_id']\n        fetch_result_item['article_author_name'] = response.meta['follow_name']\n        fetch_result_item['article_pub_time'] = time_local_to_utc(article_pub_time).strftime('%Y-%m-%d %H:%M:%S')\n        fetch_result_item['article_url'] = response.url or response.meta['detail_url']\n        fetch_result_item['article_tags'] = article_tags\n        fetch_result_item['article_abstract'] = article_abstract\n        fetch_result_item['article_content'] = article_content\n\n        yield fetch_result_item\n"
  },
  {
    "path": "news/spiders/weibo.py",
    "content": "# -*- coding: utf-8 -*-\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport json\nimport re\nimport time\nfrom datetime import datetime\n\nimport scrapy\nimport six\nfrom lxml.html import fromstring, tostring\n\nfrom apps.client_db import get_item\nfrom maps.channel import channel_name_map\nfrom maps.platform import platform_name_map\nfrom models.news import FetchTask\nfrom news.items import FetchResultItem\nfrom tools.date_time import time_local_to_utc\nfrom tools.scrapy_tasks import pop_task\nfrom tools.url import get_update_url, get_request_finger\nfrom tools.weibo import get_su, get_login_data\n\n\nclass WeiboSpider(scrapy.Spider):\n    \"\"\"\n    微博蜘蛛\n    \"\"\"\n    name = 'weibo'\n    allowed_domains = ['weibo.com', 'weibo.cn', 'sina.com.cn', 'sina.cn']\n\n    custom_settings = dict(\n        COOKIES_ENABLED=True,\n        DEFAULT_REQUEST_HEADERS={\n            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n        },\n        USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0',\n        DOWNLOADER_MIDDLEWARES={\n            'news.middlewares.de_duplication_request.DeDuplicationRequestMiddleware': 140,  # 去重请求\n            # 'news.middlewares.anti_spider.AntiSpiderMiddleware': 160,  # 反爬处理\n            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,\n            'news.middlewares.useragent.UserAgentMiddleware': 500,\n            # 'news.middlewares.httpproxy.HttpProxyMiddleware': 720,\n        },\n        ITEM_PIPELINES={\n            'news.pipelines.de_duplication_store_mysql.DeDuplicationStoreMysqlPipeline': 400,  # 去重存储\n            'news.pipelines.store_mysql.StoreMysqlPipeline': 450,\n            'news.pipelines.de_duplication_request.DeDuplicationRequestPipeline': 500,  # 去重请求\n        },\n        DOWNLOAD_DELAY=0.5\n    )\n\n    passport_weibo_login_url = 'https://passport.weibo.cn/signin/login'\n\n    start_urls = ['http://weibo.cn/']\n\n    uid = 0\n\n    login_form_data = {\n        'username': '',\n        'password': '',\n        'savestate': '1',\n        'r': '',\n        'ec': '0',\n        'pagerefer': '',\n        'entry': 'mweibo',\n        'wentry': '',\n        'loginfrom': '',\n        'client_id': '',\n        'code': '',\n        'qq': '',\n        'mainpageflag': '1',\n        'hff': '',\n        'hfp': ''\n    }\n\n    def parse(self, response):\n        return self.passport_weibo_login()\n\n    def passport_weibo_login(self):\n        yield scrapy.Request(url=self.passport_weibo_login_url, callback=self.login_sina_sso_prelogin)\n\n    def login_sina_sso_prelogin(self, response):\n        login_data = get_login_data()\n        self.login_form_data.update(login_data)\n        login_sina_sso_prelogin_url = 'https://login.sina.com.cn/sso/prelogin.php'\n        query_payload = {\n            'checkpin': '1',\n            'entry': 'mweibo',\n            'su': get_su(login_data.get('username', '')),\n            'callback': 'jsonpcallback%13d' % (time.time()*1000),\n        }\n        request_url = get_update_url(login_sina_sso_prelogin_url, query_payload)\n\n        yield scrapy.Request(url=request_url, callback=self.passport_weibo_sso_login)\n\n    def passport_weibo_sso_login(self, response):\n        passport_weibo_sso_login_url = 'https://passport.weibo.cn/sso/login'\n\n        yield scrapy.FormRequest(\n            url=passport_weibo_sso_login_url,\n            formdata=self.login_form_data,\n            callback=self.after_login\n        )\n\n    def after_login(self, response):\n        data = {\n            'savestate': '1',\n            'callback': 'jsonpcallback%13d' % (time.time()*1000),\n        }\n\n        res = response.body_as_unicode()\n        info = json.loads(res)\n\n        crossdomainlist = info['data']['crossdomainlist']\n        self.uid = info['data']['uid']\n\n        url_weibo_com = get_update_url(crossdomainlist['weibo.com'], data)\n        url_sina_com_cn = get_update_url(crossdomainlist['sina.com.cn'], data)\n        url_weibo_cn = get_update_url(crossdomainlist['weibo.cn'], data)\n\n        url_items = {\n            'url_weibo_com': url_weibo_com,\n            'url_sina_com_cn': url_sina_com_cn,\n            'url_weibo_cn': url_weibo_cn,\n        }\n\n        meta = dict(response.meta, **url_items)\n\n        # 跨域处理 weibo.com\n        yield scrapy.Request(url=url_weibo_com, callback=self.crossdomain_weibo_com, meta=meta)\n\n    def crossdomain_weibo_com(self, response):\n        \"\"\"\n        跨域处理 weibo.com\n        :param response:\n        :return:\n        \"\"\"\n        # 跨域处理 sina.com.cn\n        url_sina_com_cn = response.meta['url_sina_com_cn']\n        yield scrapy.Request(url=url_sina_com_cn, callback=self.crossdomain_sina_com_cn, meta=response.meta)\n\n    def crossdomain_sina_com_cn(self, response):\n        \"\"\"\n        跨域处理 sina.com.cn\n        :param response:\n        :return:\n        \"\"\"\n        # 跨域处理 weibo.cn\n        url_weibo_cn = response.meta['url_weibo_cn']\n        yield scrapy.Request(url=url_weibo_cn, callback=self.crossdomain_weibo_cn, meta=response.meta)\n\n    def crossdomain_weibo_cn(self, response):\n        \"\"\"\n        跨域处理 weibo.cn\n        :param response:\n        :return:\n        \"\"\"\n        # 获取登录状态 weibo.cn\n        yield scrapy.Request(url='https://weibo.cn/', callback=self.weibo_cn_index)\n\n    def weibo_cn_index(self, response):\n        \"\"\"\n        获取登录状态\n        :param response:\n        :return:\n        \"\"\"\n        print(response.url)\n        title = response.xpath('//title/text()').extract_first()\n        if title == '我的首页':\n            print('登录成功')\n            # follow_url = 'https://weibo.cn/%s/follow' % self.uid\n            # yield scrapy.Request(url=follow_url, callback=self.parse_follow_list)\n            # 获取登录状态 weibo.com\n            yield scrapy.Request(url='https://weibo.com/', callback=self.weibo_com_index)\n        else:\n            print('登录失败')\n\n    def weibo_com_index(self, response):\n        \"\"\"\n        获取登录状态\n        :param response:\n        :return:\n        \"\"\"\n        print(response.url)\n        title = response.xpath('//title/text()').extract_first()\n        if '我的首页' in title:\n            print('登录成功')\n            # follow_url = 'https://weibo.cn/%s/follow' % self.uid\n            # yield scrapy.Request(url=follow_url, callback=self.parse_follow_list)\n            return self.get_article_task()\n        else:\n            print('登录失败')\n\n    def parse_follow_list(self, response):\n        \"\"\"\n        已关注列表\n        \"\"\"\n        print(response.url)\n        # 进入关注用户页面\n        follows = response.xpath('//table//tr/td/a[1]/@href').extract()\n        for follow in follows:\n            yield scrapy.Request(url=follow, callback=self.follow_home_list)\n\n        # 关注列表翻页\n        next_url = response.xpath('//div[@id=\"pagelist\"]//a[contains(text(), \"下页\")]/@href').extract_first(default='')\n        next_url = response.urljoin(next_url)\n        if next_url == response.url:\n            print('当前条件列表页最后一页：%s' % response.url)\n        else:\n            yield scrapy.Request(url=next_url, callback=self.parse_follow_list)\n\n    def follow_home_list(self, response):\n        \"\"\"\n        已关注用户首页列表\n        \"\"\"\n        contents = response.xpath('//div[@class=\"c\"]//span[@class=\"ctt\"]/text()').extract()\n        for content in contents:\n            print(content)\n\n    def get_article_task(self):\n        \"\"\"\n        文章抓取入口\n        :return:\n        \"\"\"\n        task_id = pop_task(self.name)\n\n        if not task_id:\n            print('%s task is empty' % self.name)\n            return\n        print('%s task id: %s' % (self.name, task_id))\n\n        task_item = get_item(FetchTask, task_id)\n\n        article_id = task_item.follow_id\n\n        article_list_url = 'https://weibo.com/p/%s/wenzhang' % article_id\n\n        meta = {\n            'task_id': task_item.id,\n            'platform_id': task_item.platform_id,\n            'channel_id': task_item.channel_id,\n            'follow_id': task_item.follow_id,\n            'follow_name': task_item.follow_name,\n        }\n\n        yield scrapy.Request(url=article_list_url, callback=self.parse_article_list, meta=meta)\n\n    @staticmethod\n    def replace_all(input_html, replace_dict):\n        \"\"\"\n        用字典实现批量替换\n        \"\"\"\n        for k, v in six.iteritems(replace_dict):\n            input_html = input_html.replace(k, v)\n        return input_html\n\n    def parse_article_list(self, response):\n        \"\"\"\n        文章列表解析\n        没有翻页特征 <a class=\\\"page next S_txt1 S_line1 page_dis\\\"><span>下一页<\\/span>\n        解析链接 href=\\\"\\/p\\/1005051627825392\\/wenzhang?pids=Pl_Core_ArticleList__61&cfs=600&Pl_Core_ArticleList__61_filter=&Pl_Core_ArticleList__61_page=6#Pl_Core_ArticleList__61\\\"\n        \"\"\"\n        print('task_url: %s' % response.url)\n        # 页面解析(微博是JS动态数据, 无法直接解析页面)\n        article_list_body = response.body_as_unicode()\n\n        article_list_rule = r'<script>FM.view\\({\"ns\":\"pl.content.miniTab.index\",\"domid\":\"Pl_Core_ArticleList__\\d+\".*?\"html\":\"(.*?)\"}\\)</script>'\n        article_list_re_parse = re.compile(article_list_rule, re.S).findall(article_list_body)\n        if not article_list_re_parse:\n            return\n        article_list_html = ''.join(article_list_re_parse)\n\n        # 转义字符处理\n        article_list_html = article_list_html.replace('\\\\r', '')\n        article_list_html = article_list_html.replace('\\\\t', '')\n        article_list_html = article_list_html.replace('\\\\n', '')\n        article_list_html = article_list_html.replace('\\\\\"', '\"')\n        article_list_html = article_list_html.replace('\\\\/', '/')\n\n        article_list_doc = fromstring(article_list_html)\n        article_list_doc_parse = article_list_doc.xpath('//div[@class=\"text_box\"]')\n\n        for article_item in article_list_doc_parse:\n            article_detail_url = article_item.xpath('./div[@class=\"title W_autocut\"]/a[@class=\"W_autocut S_txt1\"]/@href')\n            article_detail_title = article_item.xpath('./div[@class=\"title W_autocut\"]/a[@class=\"W_autocut S_txt1\"]/text()')\n            article_detail_abstract = article_item.xpath('./div[@class=\"text\"]/a[@class=\"S_txt1\"]/text()')\n            if not (article_detail_url and article_detail_title):\n                continue\n            article_detail_url = article_detail_url[0].strip()\n            article_detail_url = response.urljoin(article_detail_url)\n            article_detail_title = article_detail_title[0].strip()\n\n            article_detail_abstract = article_detail_abstract[0].strip() if article_detail_abstract else ''\n\n            meta_article_item = {\n                'article_url': article_detail_url,\n                'article_title': article_detail_title,\n                'article_abstract': article_detail_abstract,\n                'article_id': get_request_finger(article_detail_url),\n            }\n\n            meta = dict(response.meta, **meta_article_item)\n\n            # 两种不同类型页面\n            if '/ttarticle/p/show?id=' in article_detail_url:\n                yield scrapy.Request(url=article_detail_url, callback=self.parse_article_detail_html, meta=meta)\n            else:\n                yield scrapy.Request(url=article_detail_url, callback=self.parse_article_detail_js, meta=meta)\n\n        # 翻页处理\n        next_url_parse = article_list_doc.xpath('//a[@class=\"page next S_txt1 S_line1\"]/@href')\n        if not next_url_parse:\n            print('当前条件列表页最后一页：%s' % response.url)\n        else:\n            next_url = next_url_parse[0]\n            next_url = response.urljoin(next_url)\n            print(next_url)\n            yield scrapy.Request(url=next_url, callback=self.parse_article_list, meta=response.meta)\n\n    def parse_article_detail_html(self, response):\n        \"\"\"\n        文章详情解析 html 版\n        :param response:\n        :return:\n        \"\"\"\n        article_title = response.xpath('//div[@class=\"title\"]/text()').extract_first(default='')\n        article_pub_time = response.xpath('//span[@class=\"time\"]/text()').extract_first(default='')\n        article_content = response.xpath('//div[@class=\"WB_editor_iframe\"]').extract_first(default='')\n        fetch_result_item = FetchResultItem()\n        fetch_result_item['task_id'] = response.meta['task_id']\n        fetch_result_item['platform_id'] = response.meta['platform_id']\n        fetch_result_item['platform_name'] = platform_name_map.get(response.meta['platform_id'], '')\n        fetch_result_item['channel_id'] = response.meta['channel_id']\n        fetch_result_item['channel_name'] = channel_name_map.get(response.meta['channel_id'], '')\n        fetch_result_item['article_id'] = response.meta['article_id']\n        fetch_result_item['article_title'] = article_title\n        fetch_result_item['article_author_id'] = response.meta['follow_id']\n        fetch_result_item['article_author_name'] = response.meta['follow_name']\n        fetch_result_item['article_pub_time'] = article_pub_time\n        fetch_result_item['article_url'] = response.url\n        fetch_result_item['article_tags'] = ''\n        fetch_result_item['article_abstract'] = response.meta['article_abstract']\n        fetch_result_item['article_content'] = article_content\n        yield fetch_result_item\n\n    @staticmethod\n    def trans_time(time_str):\n        \"\"\"\n        时间转换\n        :param time_str:\n        :return:\n        \"\"\"\n        time_rule = r'(\\d+)年(\\d+)月(\\d+)日 (\\d+):(\\d+)'\n        time_parse = re.compile(time_rule, re.S).findall(time_str)\n        if not time_parse:\n            return time.strftime('%Y-%m-%d %H:%M:%S')\n        return datetime(*[int(i) for i in time_parse[0]]).strftime('%Y-%m-%d %H:%M:%S')\n\n    def parse_article_detail_js(self, response):\n        \"\"\"\n        文章详情解析 js 版\n        :param response:\n        :return:\n        \"\"\"\n        article_detail_body = response.body_as_unicode()\n        article_detail_rule = r'<script>FM.view\\({\"ns\":.*?\"html\":\"(.*?)\"}\\)</script>'\n        article_detail_re_parse = re.compile(article_detail_rule, re.S).findall(article_detail_body)\n        if not article_detail_re_parse:\n            return\n        article_detail_html = ''.join(article_detail_re_parse)\n\n        # 转义字符处理\n        article_detail_html = article_detail_html.replace('\\\\r', '')\n        article_detail_html = article_detail_html.replace('\\\\t', '')\n        article_detail_html = article_detail_html.replace('\\\\n', '')\n        article_detail_html = article_detail_html.replace('\\\\\"', '\"')\n        article_detail_html = article_detail_html.replace('\\\\/', '/')\n\n        article_detail_doc = fromstring(article_detail_html)\n\n        article_title_parse = article_detail_doc.xpath('//h1[@class=\"title\"]/text()')\n        article_title = article_title_parse[0].strip() if article_title_parse else ''\n\n        article_pub_time_parse = article_detail_doc.xpath('//span[@class=\"time\"]/text()')\n        article_pub_time = self.trans_time(article_pub_time_parse[0].strip()) if article_pub_time_parse else time.strftime('%Y-%m-%d %H:%M:%S')\n\n        article_content_parse = article_detail_doc.xpath('//div[@class=\"WBA_content\"]')\n        article_content = tostring(article_content_parse[0], encoding='unicode').strip() if article_content_parse else ''\n\n        fetch_result_item = FetchResultItem()\n        fetch_result_item['task_id'] = response.meta['task_id']\n        fetch_result_item['platform_id'] = response.meta['platform_id']\n        fetch_result_item['platform_name'] = platform_name_map.get(response.meta['platform_id'], '')\n        fetch_result_item['channel_id'] = response.meta['channel_id']\n        fetch_result_item['channel_name'] = channel_name_map.get(response.meta['channel_id'], '')\n        fetch_result_item['article_id'] = response.meta['article_id']\n        fetch_result_item['article_title'] = article_title\n        fetch_result_item['article_author_id'] = response.meta['follow_id']\n        fetch_result_item['article_author_name'] = response.meta['follow_name']\n        fetch_result_item['article_pub_time'] = time_local_to_utc(article_pub_time).strftime('%Y-%m-%d %H:%M:%S')\n        fetch_result_item['article_url'] = response.url\n        fetch_result_item['article_tags'] = ''\n        fetch_result_item['article_abstract'] = response.meta['article_abstract']\n        fetch_result_item['article_content'] = article_content\n        yield fetch_result_item\n"
  },
  {
    "path": "news/spiders/weixin.py",
    "content": "# -*- coding: utf-8 -*-\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport scrapy\n\nfrom apps.client_db import get_item\nfrom maps.channel import channel_name_map\nfrom maps.platform import platform_name_map\nfrom models.news import FetchTask\nfrom news.items import FetchResultItem\nfrom tools.cookies import get_cookies\nfrom tools.date_time import time_local_to_utc\nfrom tools.scrapy_tasks import pop_task\nfrom tools.url import get_update_url\nfrom tools.weixin import parse_weixin_js_body, ParseJsWc, check_article_title_duplicate\n\n\nclass WeixinSpider(scrapy.Spider):\n    \"\"\"\n    微信公众号蜘蛛\n    因微信公众号详情链接是带有效期签名的动态链接, 故无法使用请求去重中间件\n    \"\"\"\n    name = 'weixin'\n    allowed_domains = ['mp.weixin.qq.com', 'weixin.qq.com', 'qq.com', 'sogou.com']\n\n    custom_settings = dict(\n        COOKIES_ENABLED=True,\n        DEFAULT_REQUEST_HEADERS={\n            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',\n            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n        },\n        USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0',\n        DOWNLOADER_MIDDLEWARES={\n            # 'news.middlewares.de_duplication_request.DeDuplicationRequestMiddleware': 140,  # 去重请求\n            'news.middlewares.anti_spider.AntiSpiderMiddleware': 160,  # 反爬处理\n            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,\n            'news.middlewares.useragent.UserAgentMiddleware': 500,\n            # 'news.middlewares.httpproxy.HttpProxyMiddleware': 720,  # 代理（cookie需要与代理IP关联）\n        },\n        ITEM_PIPELINES={\n            'news.pipelines.de_duplication_store_mysql.DeDuplicationStoreMysqlPipeline': 400,  # 去重存储\n            # 'news.pipelines.img_remote_to_local_fs.ImgRemoteToLocalFSPipeline': 440,\n            'news.pipelines.store_mysql.StoreMysqlPipeline': 450,\n            # 'news.pipelines.de_duplication_request.DeDuplicationRequestPipeline': 500,  # 去重请求\n        },\n        DOWNLOAD_DELAY=0.5\n    )\n\n    def start_requests(self):\n        \"\"\"\n        入口准备\n        :return:\n        \"\"\"\n        boot_url = 'http://weixin.sogou.com/weixin'\n\n        task_id = pop_task(self.name)\n\n        if not task_id:\n            print('%s task is empty' % self.name)\n            return\n        print('%s task id: %s' % (self.name, task_id))\n\n        task_item = get_item(FetchTask, task_id)\n\n        cookies_id, cookies = get_cookies(self.name)\n        url_params = {\n            'type': 1,\n            # 'query': task_item.follow_id,\n            'query': task_item.follow_name.encode('utf-8'),\n        }\n        url_profile = get_update_url(boot_url, url_params)\n        meta = {\n            'cookiejar': cookies_id,\n            'task_id': task_item.id,\n            'platform_id': task_item.platform_id,\n            'channel_id': task_item.channel_id,\n            'follow_id': task_item.follow_id,\n            'follow_name': task_item.follow_name,\n        }\n\n        yield scrapy.Request(url=url_profile, cookies=cookies, callback=self.parse_account_search_list, meta=meta)\n\n    def parse_article_search_list(self, response):\n        \"\"\"\n        解析微信文章 搜索列表页面 (废弃)\n        :param response:\n        :return:\n        \"\"\"\n        news_links = response.xpath('//div[@class=\"txt-box\"]/h3/a/@href').extract()\n        for new_link in news_links:\n            yield scrapy.Request(url=new_link, callback=self.parse_detail)\n\n    def parse_account_search_list(self, response):\n        \"\"\"\n        解析公众账号 搜索列表页面\n        :param response:\n        :return:\n        \"\"\"\n        account_link = response.xpath('//div[@class=\"txt-box\"]//a/@href').extract_first()\n        if account_link:\n            yield scrapy.Request(url=account_link, callback=self.parse_account_article_list, meta=response.meta)\n\n    def parse_account_article_list(self, response):\n        \"\"\"\n        解析公众账号 文章列表页面\n        :param response:\n        :return:\n        \"\"\"\n        article_list_body = response.body_as_unicode()\n        js_body = parse_weixin_js_body(article_list_body, response.url)\n        if not js_body:\n            return\n        pj = ParseJsWc(js_body=js_body)\n        article_list = pj.parse_js_msg_list()\n\n        for article_item in article_list:\n            # 标题去重\n            if check_article_title_duplicate(article_item['article_title']):\n                continue\n            meta = dict(response.meta, **article_item)\n            yield scrapy.Request(url=article_item['article_url'], callback=self.parse_detail, meta=meta)\n\n    def parse_detail(self, response):\n        \"\"\"\n        详细页面\n        :param response:\n        :return:\n        \"\"\"\n        article_content = ''.join([i.strip() for i in response.xpath('//div[@id=\"js_content\"]/*').extract()])\n\n        # 原创内容处理（处理内容为空）\n        if not article_content:\n            share_source_url = response.xpath('//a[@id=\"js_share_source\"]/@href').extract_first()\n            yield scrapy.Request(url=share_source_url, callback=self.parse_detail, meta=response.meta)\n            return\n\n        fetch_result_item = FetchResultItem()\n        fetch_result_item['task_id'] = response.meta['task_id']\n        fetch_result_item['platform_id'] = response.meta['platform_id']\n        fetch_result_item['platform_name'] = platform_name_map.get(response.meta['platform_id'], '')\n        fetch_result_item['channel_id'] = response.meta['channel_id']\n        fetch_result_item['channel_name'] = channel_name_map.get(response.meta['channel_id'], '')\n        fetch_result_item['article_id'] = response.meta['article_id']\n        fetch_result_item['article_title'] = response.meta['article_title']\n        fetch_result_item['article_author_id'] = response.meta['follow_id']\n        fetch_result_item['article_author_name'] = response.meta['follow_name']\n        fetch_result_item['article_pub_time'] = time_local_to_utc(response.meta['article_pub_time']).strftime('%Y-%m-%d %H:%M:%S')\n        fetch_result_item['article_url'] = response.meta['article_url']\n        fetch_result_item['article_tags'] = ''\n        fetch_result_item['article_abstract'] = response.meta['article_abstract']\n        fetch_result_item['article_content'] = article_content\n\n        yield fetch_result_item\n"
  },
  {
    "path": "requirements-py2.txt",
    "content": "asn1crypto==0.24.0\nattrs==19.1.0\nAutomat==0.7.0\ncertifi==2019.3.9\ncffi==1.12.3\nchardet==3.0.4\nconstantly==15.1.0\ncryptography==2.6.1\ncssselect==1.0.3\nenum34==1.1.6\nfunctools32==3.2.3.post2\nfuture==0.17.1\nhyperlink==19.0.0\nidna==2.8\nincremental==17.5.0\ninflect==2.1.0\nipaddress==1.0.22\nlxml==4.3.3\nmysqlclient==1.4.2.post1\nparsel==1.5.1\nPillow==6.0.0\npsutil==5.6.2\npyasn1==0.4.5\npyasn1-modules==0.2.5\npycparser==2.19\nPyDispatcher==2.0.5\nPyExecJS==1.5.1\nPyHamcrest==1.9.0\npyOpenSSL==19.0.0\nqueuelib==1.5.0\nredis==3.2.1\nrequests==2.22.0\nschedule==0.6.0\nScrapy==1.6.0\nservice-identity==18.1.0\nsix==1.12.0\nsqlacodegen==1.1.6\nSQLAlchemy==1.3.3\nTwisted==19.2.0\nurllib3==1.25.3\nw3lib==1.20.0\nzope.interface==4.6.0\n"
  },
  {
    "path": "requirements-py3.txt",
    "content": "asn1crypto==0.24.0\nattrs==19.1.0\nAutomat==0.7.0\ncertifi==2019.3.9\ncffi==1.12.3\nchardet==3.0.4\nconstantly==15.1.0\ncryptography==2.6.1\ncssselect==1.0.3\nfuture==0.17.1\nhyperlink==19.0.0\nidna==2.8\nincremental==17.5.0\ninflect==2.1.0\nlxml==4.3.3\nmysqlclient==1.4.2.post1\nparsel==1.5.1\nPillow==6.0.0\npsutil==5.6.2\npyasn1==0.4.5\npyasn1-modules==0.2.5\npycparser==2.19\nPyDispatcher==2.0.5\nPyExecJS==1.5.1\nPyHamcrest==1.9.0\npyOpenSSL==19.0.0\nqueuelib==1.5.0\nredis==3.2.1\nrequests==2.22.0\nschedule==0.6.0\nScrapy==1.6.0\nservice-identity==18.1.0\nsix==1.12.0\nsqlacodegen==1.1.6\nSQLAlchemy==1.3.3\nTwisted==19.2.0\nurllib3==1.25.3\nw3lib==1.20.0\nzope.interface==4.6.0\n"
  },
  {
    "path": "scrapy.cfg",
    "content": "# Automatically created by: scrapy startproject\n#\n# For more information about the [deploy] section see:\n# https://scrapyd.readthedocs.io/en/latest/deploy.html\n\n[settings]\ndefault = news.settings\n\n[deploy]\n#url = http://localhost:6800/\nproject = news\n"
  },
  {
    "path": "tasks/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:10\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "tasks/job_put_tasks.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: job_put_tasks.py\n@time: 2018-02-10 17:16\n\"\"\"\n\nimport sys\n\nfrom models.news import FetchTask, FetchResult, LogTaskScheduling\nfrom apps.client_db import get_group, get_all\nfrom maps.platform import WEIXIN, WEIBO, TOUTIAO\nfrom tools.scrapy_tasks import put_task, get_tasks_count\n\n\ndef job_put_tasks(spider_name):\n    # 如果任务队列没有消耗完毕, 不处理\n    tasks_count = get_tasks_count(spider_name)\n    if tasks_count:\n        return True\n\n    spider_map = {\n        'weixin': WEIXIN,\n        'weibo': WEIBO,\n        'toutiao': TOUTIAO,\n        'toutiao_m': TOUTIAO,\n    }\n\n    # TODO 稳定运行之后需要去掉\n    # task_exclude = [i.task_id for i in get_group(FetchResult, 'task_id', min_count=1)]\n\n    task_list = get_all(FetchTask, FetchTask.platform_id == spider_map.get(spider_name))\n\n    c = 0\n    for task in task_list:\n        # 排除任务\n        # if task.id in task_exclude:\n        #     continue\n        put_task(spider_name, task.id)\n        c += 1\n        if c % 100 == 0:\n            print(c)\n    print('put %s tasks count: %s' % (spider_name, c))\n    return True\n\n\ndef usage():\n    contents = [\n        'Example:',\n        '\\tpython job_put_tasks.py wx  # 微信',\n        '\\tpython job_put_tasks.py wb  # 微博',\n        '\\tpython job_put_tasks.py tm  # 头条(M)',\n        '\\tpython job_put_tasks.py tt  # 头条(PC)',\n    ]\n    print('\\n'.join(contents))\n\n\ndef run():\n    \"\"\"\n    入口\n    \"\"\"\n    # print(sys.argv)\n    spider_name_maps = {\n        'wx': 'weixin',\n        'wb': 'weibo',\n        'tt': 'toutiao',\n        'tm': 'toutiao_m',\n    }\n    try:\n        if len(sys.argv) > 1:\n            spider_name = spider_name_maps.get(sys.argv[1])\n            if not spider_name:\n                raise Exception('参数错误')\n            job_put_tasks(spider_name)\n        else:\n            raise Exception('缺失参数')\n    except Exception as e:\n        print(e.message)\n        usage()\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/job_reboot_net_china_net.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: job_reboot_net_china_net.py\n@time: 2018-05-28 19:40\n\"\"\"\n\n\nimport time\nfrom libs.optical_modem import OpticalModemChinaNet\nfrom tools.net_status import get_reboot_net_status, del_reboot_net_status\n\nnet_name = 'optical_modem_china_net'\n\n\ndef job_reboot_net_china_net():\n    \"\"\"\n    重启中国电信光猫\n    :return:\n    \"\"\"\n    # reboot_net_status = get_reboot_net_status(net_name)\n    # if not reboot_net_status:\n    #     return\n\n    om_cn = OpticalModemChinaNet()\n    om_cn.net_ip_o = om_cn.get_net_ip()\n    om_cn.login()  # 默认用户名、密码\n    om_cn.reboot()\n    time.sleep(10)\n    om_cn.net_ip_n = om_cn.get_net_ip()\n    om_cn.check_reboot_status()\n\n    del_reboot_net_status(net_name)\n"
  },
  {
    "path": "tasks/jobs_proxies.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: jobs_proxies.py\n@time: 2018-03-13 17:22\n\"\"\"\n\n\nfrom __future__ import print_function\n\nimport sys\n\nfrom tools.proxies import add_proxy, len_proxy, fetch_proxy\n\n\ndef job_proxies(spider_name, mix_num=0):\n    if len_proxy(spider_name) <= mix_num:\n        proxy_list = fetch_proxy()\n        if not proxy_list:\n            return\n        add_proxy(spider_name, *proxy_list)\n        print('%s add proxies: %s' % (spider_name, len(proxy_list)))\n\n\ndef usage():\n    contents = [\n        'Example:',\n        '\\tpython jobs_proxies.py ip  # 测试',\n        '\\tpython jobs_proxies.py wx  # 微信',\n        '\\tpython jobs_proxies.py wb  # 微博',\n        '\\tpython jobs_proxies.py tt  # 头条',\n    ]\n    print('\\n'.join(contents))\n\n\ndef run():\n    \"\"\"\n    入口\n    \"\"\"\n    # print(sys.argv)\n    spider_name_maps = {\n        'wx': 'weixin',\n        'wb': 'weibo',\n        'tt': 'toutiao',\n    }\n    try:\n        if len(sys.argv) > 1:\n            spider_name = spider_name_maps.get(sys.argv[1], sys.argv[1])\n            if not spider_name:\n                raise Exception('参数错误')\n            job_proxies(spider_name)\n        else:\n            raise Exception('缺失参数')\n    except Exception as e:\n        print(e.message)\n        usage()\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/jobs_sogou.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: jobs_sogou.py\n@time: 2018-02-10 18:05\n\"\"\"\n\n\nfrom tools.cookies import add_cookies\nfrom tools.anti_spider_sogou import auto_cookies as sogou_cookies\nfrom apps.client_rk import rk_counter_client, check_counter_limit, check_cookies_count\n\n\ndef job_sogou_cookies(spider_name):\n    \"\"\"\n    sogou cookies\n    :return:\n    \"\"\"\n    # 判断每天限制额度\n    if not check_counter_limit():\n        print('spider_name: %s, There is not enough available quantity' % spider_name)\n        return False\n\n    # 判断 cookie 队列长度\n    if not check_cookies_count(spider_name):\n        print('spider_name: %s, The quantity of cookies is enough' % spider_name)\n        return False\n\n    sogou_cookies_obj = sogou_cookies()\n\n    if not sogou_cookies_obj:\n        return False\n\n    add_cookies(spider_name, sogou_cookies_obj)\n    rk_counter_client.increase(1)\n    return True\n\n\nif __name__ == '__main__':\n    job_sogou_cookies('weixin')\n"
  },
  {
    "path": "tasks/jobs_weixin.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: jobs_weixin.py\n@time: 2018-02-10 18:06\n\"\"\"\n\n\nimport json\nimport time\nimport sys\n\nfrom libs.redis_pub_sub import RedisPubSub\nfrom libs.redis_queue import RedisQueue\nfrom tools.anti_spider_weixin import auto_cookies as weixin_cookies\nfrom apps.client_db import redis_client\nfrom apps.client_rk import rk_counter_client, check_counter_limit\n\n\ndef set_anti_spider_task(spider_name, msg):\n    \"\"\"\n    设置任务队列\n    msg = {\n        'url': url,\n        'time': time.strftime(\"%Y-%m-%d %H:%M:%S\")\n    }\n    :param spider_name:\n    :param msg:\n    :return:\n    \"\"\"\n    key = 'scrapy:anti_spider_task_weixin:%s' % spider_name\n    q_task = RedisQueue(key, redis_client=redis_client)\n    q_msg = json.dumps(msg) if isinstance(msg, dict) else msg\n    # 因为微信反爬策略是通过IP限制, 这里仅仅处理一个任务\n    if q_task.empty():\n        q_task.put(q_msg)\n\n\ndef _get_anti_spider_task(spider_name):\n    \"\"\"获取任务队列\"\"\"\n    key = 'scrapy:anti_spider_task_weixin:%s' % spider_name\n    q_task = RedisQueue(key, redis_client=redis_client)\n    result = q_task.get(timeout=60)\n    return json.loads(result) if result else {}\n\n\ndef _set_anti_spider_result(spider_name, msg):\n    \"\"\"设置结果队列\"\"\"\n    key = 'scrapy:anti_spider_result_weixin:%s' % spider_name\n    q_result = RedisQueue(key, redis_client=redis_client)\n    q_msg = json.dumps(msg) if isinstance(msg, dict) else msg\n    q_result.put(q_msg)\n\n\ndef _get_anti_spider_result(spider_name):\n    \"\"\"获取任务队列\"\"\"\n    key = 'scrapy:anti_spider_result_weixin:%s' % spider_name\n    q_result = RedisQueue(key, redis_client=redis_client)\n    result = q_result.get(timeout=60)\n    return json.loads(result) if result else {}\n\n\ndef sub_anti_spider(spider_name):\n    \"\"\"\n    蜘蛛订阅验证码处理结果\n    :param spider_name:\n    :return:\n    \"\"\"\n    q = RedisPubSub('scrapy:anti_spider', redis_client=redis_client)\n    r = q.sub_not_loop(spider_name)\n    return json.loads(r) if r else {}\n\n\ndef _pub_anti_spider(spider_name, msg):\n    \"\"\"\n    将对应蜘蛛的验证码处理结果发布给对应订阅者\n    :param spider_name:\n    :return:\n    \"\"\"\n    q = RedisPubSub('scrapy:anti_spider', redis_client=redis_client)\n    msg = json.dumps(msg) if isinstance(msg, dict) else msg\n    q.pub(spider_name, msg)\n\n\ndef job_weixin_cookies(spider_name):\n    \"\"\"\n    weixin cookies\n    :return:\n    \"\"\"\n    # 判断每天限制额度\n    if not check_counter_limit():\n        print('spider_name: %s, There is not enough available quantity' % spider_name)\n        return False\n\n    # 读取验证码任务队列(超时1分钟)\n    task = _get_anti_spider_task(spider_name)\n    if not task:\n        return False\n\n    # 设置验证码结果队列\n    url = task.get('url')\n    msg = {\n        'url': url,\n        'status': False,\n        'time': time.strftime(\"%Y-%m-%d %H:%M:%S\")\n    }\n    try:\n        weixin_cookies_status = weixin_cookies(url)\n        msg['status'] = weixin_cookies_status\n\n        _set_anti_spider_result(spider_name, msg)\n\n        # 读取验证码结果队列(超时1分钟)\n        msg = _get_anti_spider_result(spider_name)\n\n        _pub_anti_spider(spider_name, msg)\n        rk_counter_client.increase(1)\n        return True\n    except Exception as e:\n        print(e.message)\n        _pub_anti_spider(spider_name, msg)\n\n\ndef usage():\n    print('python tasks/jobs_weixin.py <function> <spider_name>')\n    print('\\tpython tasks/jobs_weixin.py job_weixin_cookies weixin')\n\n\ndef run():\n    \"\"\"\n    启动入口\n    \"\"\"\n    # print sys.argv\n    try:\n        if len(sys.argv) >= 3:\n            fun_name = globals()[sys.argv[1]]\n            fun_name(sys.argv[2])\n        else:\n            usage()\n    except NameError as e:\n        print(e)\n\n\nif __name__ == '__main__':\n    job_weixin_cookies('weixin')\n    # run()\n    # python tasks/jobs_weixin.py job_weixin_cookies weixin\n"
  },
  {
    "path": "tasks/run_job_counter_clear.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_counter_clear.py\n@time: 2018-05-02 10:24\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom apps.client_rk import counter_clear as job_counter_clear\nfrom tools import catch_keyboard_interrupt\n\n# 计数清零\nschedule.every().day.at('00:00').do(job_counter_clear)\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_put_tasks_toutiao.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_put_tasks_toutiao.py\n@time: 2018-05-02 10:23\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.job_put_tasks import job_put_tasks\nfrom tools import catch_keyboard_interrupt\n\n\n# 分布式任务调度 - 头条\nschedule.every(1).minutes.do(job_put_tasks, spider_name='toutiao')\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_put_tasks_weibo.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_put_tasks_weibo.py\n@time: 2018-05-02 10:23\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.job_put_tasks import job_put_tasks\nfrom tools import catch_keyboard_interrupt\n\n\n# 分布式任务调度 - 微博\nschedule.every(5).minutes.do(job_put_tasks, spider_name='weibo')\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_put_tasks_weixin.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_put_tasks_weixin.py\n@time: 2018-05-02 10:23\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.job_put_tasks import job_put_tasks\nfrom tools import catch_keyboard_interrupt\n\n\n# 分布式任务调度 - 微信\nschedule.every(5).minutes.do(job_put_tasks, spider_name='weixin')\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_reboot_net_china_net.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_optical_modem_china_net.py\n@time: 2018-05-28 19:35\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.job_reboot_net_china_net import job_reboot_net_china_net\nfrom tools import catch_keyboard_interrupt\n\n\n# 电信光猫重启\nschedule.every(15).minutes.do(job_reboot_net_china_net)\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_sogou_cookies.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_sogou_cookies.py\n@time: 2018-05-02 10:21\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.jobs_sogou import job_sogou_cookies\nfrom tools import catch_keyboard_interrupt\n\n# sogou 反爬任务\nschedule.every(5).minutes.do(job_sogou_cookies, spider_name='weixin')\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_job_weixin_cookies.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_job_weixin_cookies.py\n@time: 2018-05-02 10:22\n\"\"\"\n\nimport time\n\nimport schedule\n\nfrom tasks.jobs_weixin import job_weixin_cookies\nfrom tools import catch_keyboard_interrupt\n\n# weixin 反爬任务\nschedule.every(5).minutes.do(job_weixin_cookies, spider_name='weixin')\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_jobs.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_jobs.py\n@time: 2018-04-18 11:10\n\"\"\"\n\n\nimport schedule\nimport time\nfrom tools import catch_keyboard_interrupt\n\nfrom tasks import job_put_tasks\nfrom tasks.jobs_sogou import job_sogou_cookies\nfrom tasks.jobs_weixin import job_weixin_cookies\nfrom apps.client_rk import counter_clear as job_counter_clear\n\n\n# sogou 反爬任务\nschedule.every(5).minutes.do(job_sogou_cookies, spider_name='weixin')\n# weixin 反爬任务\nschedule.every(5).minutes.do(job_weixin_cookies, spider_name='weixin')\n# 分布式任务调度 - 微信\nschedule.every(5).minutes.do(job_put_tasks, spider_name='weixin')\n# 分布式任务调度 - 微博\nschedule.every(5).minutes.do(job_put_tasks, spider_name='weibo')\n# 分布式任务调度 - 头条\nschedule.every(5).minutes.do(job_put_tasks, spider_name='toutiao')\n# 计数清零\nschedule.every().day.at('00:00').do(job_counter_clear)\n\n\n@catch_keyboard_interrupt\ndef run():\n    while True:\n        schedule.run_pending()\n        time.sleep(1)\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tasks/run_jobs_apscheduler.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: run_jobs_apscheduler.py\n@time: 2018-02-10 18:01\n\"\"\"\n\n# Deprecated\n\n\nfrom apscheduler.schedulers.blocking import BlockingScheduler\n\nfrom config import current_config\n\nfrom tasks import job_put_tasks\nfrom tasks.jobs_sogou import job_sogou_cookies\nfrom tasks.jobs_weixin import job_weixin_cookies\nfrom apps.client_rk import counter_clear as job_counter_clear\n\nREDIS = current_config.REDIS\n\nscheduler = BlockingScheduler()\n\njob_store_redis_alias = 'news_spider'\n\n\ndef add_job_store_redis():\n    \"\"\"\n    127.0.0.1:6379> TYPE \"example.jobs\"\n    hash\n    127.0.0.1:6379> TYPE \"example.run_times\"\n    zset\n    127.0.0.1:6379> HGETALL \"example.jobs\"\n    1) \"45431465e6104f3c924ec01852ed1aeb\"\n    2) \"\\x80\\x02}q\\x01(U\\x04argsq\\x02)U\\bexecutorq\\x03U\\adefaultq\\x04U\\rmax_instancesq\\x05K\\x01U\\x04funcq\\x06U\\x10__main__:task_03q\\aU\\x02idq\\bU 45431465e6104f3c924ec01852ed1aebq\\tU\\rnext_run_timeq\\ncdatetime\\ndatetime\\nq\\x0bU\\n\\a\\xe1\\x0c\\b\\x02\\x01\\x00\\x00\\x00\\x00cpytz\\n_p\\nq\\x0c(U\\rAsia/Shanghaiq\\rM\\x80pK\\x00U\\x03CSTq\\x0etRq\\x0f\\x86Rq\\x10U\\x04nameq\\x11U\\atask_03q\\x12U\\x12misfire_grace_timeq\\x13K\\x01U\\atriggerq\\x14capscheduler.triggers.cron\\nCronTrigger\\nq\\x15)\\x81q\\x16}q\\x17(U\\btimezoneq\\x18h\\x0c(h\\rM\\xe8qK\\x00U\\x03LMTq\\x19tRq\\x1aU\\aversionq\\x1bK\\x01U\\nstart_dateq\\x1cNU\\bend_dateq\\x1dNU\\x06fieldsq\\x1e]q\\x1f(capscheduler.triggers.cron.fields\\nBaseField\\nq )\\x81q!}q\\\"(U\\nis_defaultq#\\x88U\\x0bexpressionsq$]q%capscheduler.triggers.cron.expressions\\nAllExpression\\nq&)\\x81q'}q(U\\x04stepq)Nsbah\\x11U\\x04yearq*ubh )\\x81q+}q,(h#\\x88h$]q-h&)\\x81q.}q/h)Nsbah\\x11U\\x05monthq0ubcapscheduler.triggers.cron.fields\\nDayOfMonthField\\nq1)\\x81q2}q3(h#\\x88h$]q4h&)\\x81q5}q6h)Nsbah\\x11U\\x03dayq7ubcapscheduler.triggers.cron.fields\\nWeekField\\nq8)\\x81q9}q:(h#\\x88h$]q;h&)\\x81q<}q=h)Nsbah\\x11U\\x04weekq>ubcapscheduler.triggers.cron.fields\\nDayOfWeekField\\nq?)\\x81q@}qA(h#\\x88h$]qBh&)\\x81qC}qDh)Nsbah\\x11U\\x0bday_of_weekqEubh )\\x81qF}qG(h#\\x89h$]qHcapscheduler.triggers.cron.expressions\\nRangeExpression\\nqI)\\x81qJ}qK(h)NU\\x04lastqLK\\x16U\\x05firstqMK\\x00ubah\\x11U\\x04hourqNubh )\\x81qO}qP(h#\\x89h$]qQhI)\\x81qR}qS(h)NhLK\\x01hMK\\x01ubah\\x11U\\x06minuteqTubh )\\x81qU}qV(h#\\x88h$]qWhI)\\x81qX}qY(h)NhLK\\x00hMK\\x00ubah\\x11U\\x06secondqZubeubU\\bcoalesceq[\\x88h\\x1bK\\x01U\\x06kwargsq\\\\}q]u.\"\n    3) \"f5637d98946848c291da09a4ceb08027\"\n    4) \"\\x80\\x02}q\\x01(U\\x04argsq\\x02)U\\bexecutorq\\x03U\\adefaultq\\x04U\\rmax_instancesq\\x05K\\x01U\\x04funcq\\x06U\\x10__main__:task_04q\\aU\\x02idq\\bU f5637d98946848c291da09a4ceb08027q\\tU\\rnext_run_timeq\\ncdatetime\\ndatetime\\nq\\x0bU\\n\\a\\xe1\\x0c\\b\\x012\\x00\\x00\\x00\\x00cpytz\\n_p\\nq\\x0c(U\\rAsia/Shanghaiq\\rM\\x80pK\\x00U\\x03CSTq\\x0etRq\\x0f\\x86Rq\\x10U\\x04nameq\\x11U\\atask_04q\\x12U\\x12misfire_grace_timeq\\x13K\\x01U\\atriggerq\\x14capscheduler.triggers.cron\\nCronTrigger\\nq\\x15)\\x81q\\x16}q\\x17(U\\btimezoneq\\x18h\\x0c(h\\rM\\xe8qK\\x00U\\x03LMTq\\x19tRq\\x1aU\\aversionq\\x1bK\\x01U\\nstart_dateq\\x1cNU\\bend_dateq\\x1dNU\\x06fieldsq\\x1e]q\\x1f(capscheduler.triggers.cron.fields\\nBaseField\\nq )\\x81q!}q\\\"(U\\nis_defaultq#\\x88U\\x0bexpressionsq$]q%capscheduler.triggers.cron.expressions\\nAllExpression\\nq&)\\x81q'}q(U\\x04stepq)Nsbah\\x11U\\x04yearubh )\\x81q*}q+(h#\\x88h$]q,h&)\\x81q-}q.h)Nsbah\\x11U\\x05monthubcapscheduler.triggers.cron.fields\\nDayOfMonthField\\nq/)\\x81q0}q1(h#\\x88h$]q2h&)\\x81q3}q4h)Nsbah\\x11U\\x03dayubcapscheduler.triggers.cron.fields\\nWeekField\\nq5)\\x81q6}q7(h#\\x88h$]q8h&)\\x81q9}q:h)Nsbah\\x11U\\x04weekubcapscheduler.triggers.cron.fields\\nDayOfWeekField\\nq;)\\x81q<}q=(h#\\x88h$]q>h&)\\x81q?}q@h)Nsbah\\x11U\\x0bday_of_weekubh )\\x81qA}qB(h#\\x89h$]qCcapscheduler.triggers.cron.expressions\\nRangeExpression\\nqD)\\x81qE}qF(h)NU\\x04lastqGK\\x16U\\x05firstqHK\\x00ubah\\x11U\\x04hourubh )\\x81qI}qJ(h#\\x89h$]qKh&)\\x81qL}qMh)K\\x01sbah\\x11U\\x06minuteubh )\\x81qN}qO(h#\\x88h$]qPhD)\\x81qQ}qR(h)NhGK\\x00hHK\\x00ubah\\x11U\\x06secondubeubU\\bcoalesceqS\\x88h\\x1bK\\x01U\\x06kwargsqT}qUu.\"\n    5) \"ba044f7b253a4cb1961e7abf036f8ef7\"\n    6) \"\\x80\\x02}q\\x01(U\\x04argsq\\x02)U\\bexecutorq\\x03U\\adefaultq\\x04U\\rmax_instancesq\\x05K\\x01U\\x04funcq\\x06U\\x10__main__:task_02q\\aU\\x02idq\\bU ba044f7b253a4cb1961e7abf036f8ef7q\\tU\\rnext_run_timeq\\ncdatetime\\ndatetime\\nq\\x0bU\\n\\a\\xe1\\x0c\\b\\x012\\r\\x0f5\\xf9cpytz\\n_p\\nq\\x0c(U\\rAsia/Shanghaiq\\rM\\x80pK\\x00U\\x03CSTq\\x0etRq\\x0f\\x86Rq\\x10U\\x04nameq\\x11U\\atask_02q\\x12U\\x12misfire_grace_timeq\\x13K\\x01U\\atriggerq\\x14capscheduler.triggers.interval\\nIntervalTrigger\\nq\\x15)\\x81q\\x16}q\\x17(U\\btimezoneq\\x18h\\x0c(h\\rM\\xe8qK\\x00U\\x03LMTq\\x19tRq\\x1aU\\aversionq\\x1bK\\x01U\\nstart_dateq\\x1ch\\x0bU\\n\\a\\xe1\\x0c\\b\\x01.\\r\\x0f5\\xf9h\\x0f\\x86Rq\\x1dU\\bend_dateq\\x1eNU\\bintervalq\\x1fcdatetime\\ntimedelta\\nq K\\x00K<K\\x00\\x87Rq!ubU\\bcoalesceq\\\"\\x88h\\x1bK\\x01U\\x06kwargsq#}q$u.\"\n    127.0.0.1:6379> ZCARD \"example.run_times\"\n    (integer) 3\n    127.0.0.1:6379> ZRANGE \"example.run_times\" 0 2 WITHSCORES\n    1) \"f5637d98946848c291da09a4ceb08027\"\n    2) \"1512669060\"\n    3) \"ba044f7b253a4cb1961e7abf036f8ef7\"\n    4) \"1512669073.9968569\"\n    5) \"45431465e6104f3c924ec01852ed1aeb\"\n    6) \"1512669660\"\n\n    # 清理数据\n    127.0.0.1:6379> DEL example.jobs\n    (integer) 1\n    127.0.0.1:6379> DEL example.run_times\n    (integer) 1\n    :return:\n    \"\"\"\n    scheduler.add_jobstore(\n        'redis',\n        alias=job_store_redis_alias,\n        jobs_key='news_spider.jobs',\n        run_times_key='news_spider.run_times',\n        **REDIS\n    )\n\n\ndef add_job():\n    # sogou 反爬任务\n    scheduler.add_job(\n        job_sogou_cookies,\n        'interval',\n        kwargs={'spider_name': 'weixin'},\n        minutes=5,\n        id='job_sogou_cookies',\n        replace_existing=True\n    )\n\n    # weixin 反爬任务\n    scheduler.add_job(\n        job_weixin_cookies,\n        'interval',\n        kwargs={'spider_name': 'weixin'},\n        minutes=2,\n        id='job_weixin_cookies',\n        replace_existing=True\n    )\n\n    # 分布式任务调度 - 微信\n    scheduler.add_job(\n        job_put_tasks,\n        'interval',\n        kwargs={'spider_name': 'weixin'},\n        minutes=5,\n        id='job_put_tasks_weixin',\n        replace_existing=True\n    )\n\n    # 分布式任务调度 - 微博\n    scheduler.add_job(\n        job_put_tasks,\n        'interval',\n        kwargs={'spider_name': 'weibo'},\n        minutes=5,\n        id='job_put_tasks_weibo',\n        replace_existing=True\n    )\n\n    # 分布式任务调度 - 头条\n    scheduler.add_job(\n        job_put_tasks,\n        'interval',\n        kwargs={'spider_name': 'toutiao'},\n        minutes=5,\n        id='job_put_tasks_toutiao',\n        replace_existing=True\n    )\n\n    # 计数清零\n    scheduler.add_job(\n        job_counter_clear,\n        'cron',\n        day='*',\n        hour='0',\n        id='job_counter_clear',\n        replace_existing=True\n    )\n\n\ndef run_blocking():\n    try:\n        # add_job_store_redis()   # 后端存储 基于redis(可选)\n        add_job()               # 添加任务\n        scheduler.start()       # 开启调度\n    except (KeyboardInterrupt, SystemExit):\n        scheduler.shutdown()    # 关闭调度\n\n\nif __name__ == '__main__':\n    run_blocking()\n"
  },
  {
    "path": "tests/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:39\n\"\"\"\n\n\ndef func():\n    pass\n\n\nclass Main(object):\n    def __init__(self):\n        pass\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "tests/test_date_time.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: test_date_time.py\n@time: 2018-06-25 17:55\n\"\"\"\n\n\nfrom __future__ import unicode_literals\n\nimport unittest\n\nimport time\nimport datetime\nfrom tools.date_time import time_local_to_utc, time_utc_to_local\n\n\nclass DateTimeTest(unittest.TestCase):\n    \"\"\"\n    日期时间测试\n    \"\"\"\n    def setUp(self):\n        \"\"\"\n        获取系统时区, 设定一对本地时间和国际时间\n        1、断言转换后的时差是否正确\n        2、断言转换后的时间是否正确\n        :return:\n        \"\"\"\n        self.time_offset = time.timezone\n        self.local_time = '2018-06-06 18:12:26'\n        local_time_obj = datetime.datetime.strptime(self.local_time, '%Y-%m-%d %H:%M:%S')\n        self.utc_time = (local_time_obj + datetime.timedelta(hours=self.time_offset/60/60)).strftime('%Y-%m-%d %H:%M:%S')\n\n    def test_local_to_utc(self):\n        \"\"\"\n        测试\n        :return:\n        \"\"\"\n        local_time_obj = datetime.datetime.strptime(self.local_time, '%Y-%m-%d %H:%M:%S')\n        utc_time_obj = time_local_to_utc(self.local_time)\n\n        self.assertEqual(utc_time_obj, local_time_obj + datetime.timedelta(seconds=self.time_offset))\n        self.assertEqual(self.utc_time, utc_time_obj.strftime('%Y-%m-%d %H:%M:%S'))\n\n    def test_utc_to_local(self):\n        \"\"\"\n        测试\n        :return:\n        \"\"\"\n        utc_time_obj = datetime.datetime.strptime(self.utc_time, '%Y-%m-%d %H:%M:%S')\n        local_time_obj = time_utc_to_local(self.utc_time)\n\n        self.assertEqual(utc_time_obj, local_time_obj + datetime.timedelta(seconds=self.time_offset))\n        self.assertEqual(self.local_time, local_time_obj.strftime('%Y-%m-%d %H:%M:%S'))\n\n    def tearDown(self):\n        pass\n\n\nif __name__ == '__main__':\n    unittest.main()\n"
  },
  {
    "path": "tests/test_finger.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: test_finger.py\n@time: 2018-02-11 00:06\n\"\"\"\n\nfrom __future__ import unicode_literals\n\nimport hashlib\nimport unittest\n\nfrom scrapy.http import Request\nfrom scrapy.utils import request\n\n\nclass FingerTest(unittest.TestCase):\n    \"\"\"\n    指纹测试\n    \"\"\"\n\n    def setUp(self):\n        self.url_01 = 'https://www.baidu.com/s?wd=openstack&rsv_spt=1'\n        self.url_02 = 'https://www.baidu.com/s?rsv_spt=1&wd=openstack'\n\n    def test_request(self):\n        \"\"\"\n        测试请求\n        :return:\n        \"\"\"\n        req_01 = Request(url=self.url_01)\n        result_01 = request.request_fingerprint(req_01)\n\n        req_02 = Request(url=self.url_02)\n        result_02 = request.request_fingerprint(req_02)\n\n        self.assertEqual(result_01, result_02)\n\n    def tearDown(self):\n        pass\n\n\nclass MD5Test(unittest.TestCase):\n    \"\"\"\n    md5测试\n    \"\"\"\n\n    def setUp(self):\n        self.url_01 = 'https://www.baidu.com/s?wd=openstack&rsv_spt=1'\n        self.url_02 = 'https://www.baidu.com/s?rsv_spt=1&wd=openstack'\n\n    def test_request(self):\n        \"\"\"\n        测试请求\n        :return:\n        \"\"\"\n        m1 = hashlib.md5()\n        m1.update(self.url_01.encode('utf-8'))\n        result_01 = m1.hexdigest()\n\n        m2 = hashlib.md5()\n        m2.update(self.url_02.encode('utf-8'))\n        result_02 = m2.hexdigest()\n\n        self.assertNotEqual(result_01, result_02)\n\n    def tearDown(self):\n        pass\n\n\nif __name__ == '__main__':\n    unittest.main()\n"
  },
  {
    "path": "tools/__init__.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: __init__.py.py\n@time: 2018-02-10 17:10\n\"\"\"\n\nfrom functools import wraps\n\n\ndef catch_keyboard_interrupt(func):\n    @wraps(func)\n    def wrapper(*args, **kwargs):\n        try:\n            return func(*args, **kwargs)\n        except KeyboardInterrupt:\n            print('\\n强制退出')\n\n    return wrapper\n"
  },
  {
    "path": "tools/anti_spider_sogou.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: anti_spider_sogou.py\n@time: 2018-02-10 17:24\n\"\"\"\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nfrom future.builtins import input               # PY2(raw_input)\n\nimport random\nimport time\nimport json\n\nimport requests\n\nfrom apps.client_rk import get_img_code, img_report_error\n\nfrom config import current_config\n\n\nREQUESTS_TIME_OUT = current_config.REQUESTS_TIME_OUT\n\n\ncookies = {}\n\n\ns = requests.session()\n\n\nheaders = {\n    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n    'Accept-Encoding': 'gzip, deflate',\n    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',\n    'Connection': 'keep-alive',\n    # 'Host': 'weixin.sogou.com',\n    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n}\n\n\ndef _get_tc():\n    tc = str('%13d' % (time.time() * 1000))\n    return tc\n\n\ndef _save_img(res):\n    # 保存验证码图片\n    img_name = 'sogou_%s.jpg' % _get_tc()\n    print('图片名称: %s' % img_name)\n    img_content = res.content\n    with open(img_name, b'w') as f:\n        f.write(img_content)\n    time.sleep(1)\n\n\ndef anti_spider():\n    url = 'http://weixin.sogou.com/antispider/?from=/weixin?type=2&query=chuangbiandao'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'weixin.sogou.com'\n\n    request_cookie = {\n        'refresh': '1'\n    }\n\n    res = s.get(url, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n    # print cookies\n\n\ndef code_img_save():\n    url = 'http://weixin.sogou.com/antispider/util/seccode.php'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'weixin.sogou.com'\n\n    request_cookie = cookies.copy()\n\n    params = {\n        'tc': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n\n    # 保存图片\n    _save_img(res)\n\n    cookies.update(res.cookies)\n    print('.', end='')\n    # print cookies\n\n\ndef code_img_obj():\n    url = 'http://weixin.sogou.com/antispider/util/seccode.php'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'weixin.sogou.com'\n\n    request_cookie = cookies.copy()\n\n    params = {\n        'tc': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    print('.', end='')\n    return res.content\n\n\ndef pv_refresh():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n    }\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'refresh',\n        'domain': 'weixin',\n        'suv': '',\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_index():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n    }\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'index',\n        'domain': 'weixin',\n        'suv': '',\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_img_cost():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'imgCost',\n        'domain': 'weixin',\n        'suv': '',\n        'snuid': '',\n        't': _get_tc(),\n        'cost': '27',\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_mouse():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'mouse',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_img_success():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'imgSuccess',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_real_index():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'realIndex',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_seccode_focus():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'seccodeFocus',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_seccode_input():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'seccodeInput',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef pv_seccode_blur():\n    url = 'http://pb.sogou.com/pv.gif'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'pb.sogou.com'\n\n    request_cookie = {\n        'IPLOC': cookies['IPLOC'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    params = {\n        'uigs_productid': 'webapp',\n        'type': 'antispider',\n        'subtype': 'seccodeBlur',\n        'domain': 'weixin',\n        'suv': cookies['SUV'],\n        'snuid': '',\n        't': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n\ndef thank(code_anti_spider):\n    url = 'http://weixin.sogou.com/antispider/thank.php'\n\n    request_headers = headers.copy()\n    request_headers['X-Requested-With'] = 'XMLHttpRequest'\n\n    request_cookie = {\n        'ABTEST': cookies['ABTEST'],\n        'IPLOC': cookies['IPLOC'],\n        'SUID': cookies['SUID'],\n        'PHPSESSID': cookies['PHPSESSID'],\n        'SUIR': cookies['SUIR'],\n        'SUV': cookies['SUV'],\n    }\n\n    data = {\n        'c': code_anti_spider,\n        'r': '%2Fweixin%3Ftype%3D2',\n        'v': '5',\n    }\n\n    res = s.post(url, data=data, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    # print cookies\n\n    json_msg = json.loads(res.content)\n    print(json_msg)\n    return json_msg\n    # {\"code\": 0,\"msg\": \"解封成功，正在为您跳转来源地址...\", \"id\": \"ECB542781D1B4105B09FB4461E0587D4\"}\n    # {\"code\": 2,\"msg\": \"未知访问来源\"}\n    # {\"code\": 3,\"msg\": \"验证码输入错误, 请重新输入！\"}\n\n\ndef _get_cookies():\n    print(cookies)\n    return cookies\n\n\ndef check_n():\n    url = 'http://weixin.sogou.com/weixin?query=chuangbiandao&type=1'\n    res = requests.get(url, headers=headers, timeout=REQUESTS_TIME_OUT)\n    print(res.content)\n\n\ndef check_y():\n    url = 'http://weixin.sogou.com/weixin?query=chuangbiandao&type=1'\n    res = s.get(url, headers=headers, cookies=cookies, timeout=REQUESTS_TIME_OUT)\n    print(res.content)\n\n\ndef manual_cookies():\n    \"\"\"\n    获取 cookies - 手动填验证码\n    :return:\n    \"\"\"\n    anti_spider()\n    code_img_save()\n\n    # 模拟用户行为\n    pv_refresh()\n    pv_index()\n    pv_img_cost()\n\n    # 模拟鼠标滑过\n    pv_mouse()\n    pv_img_success()\n    pv_real_index()\n\n    # 模拟表单输入\n    pv_seccode_focus()\n    pv_seccode_input()\n    pv_seccode_blur()\n\n    input_code = input('code << ')\n\n    thank(input_code)\n\n    return _get_cookies()\n\n\ndef auto_cookies():\n    \"\"\"\n    获取 cookies - 第三方识别验证码\n    :return:\n    \"\"\"\n    anti_spider()\n\n    im = code_img_obj()\n    # 6位英数混合 白天:15快豆 夜间:18.75快豆 超时:60秒\n    img_id, img_code = get_img_code(im, im_type_id=3060)\n    if not img_id:\n        return None\n    print(img_id, img_code)\n\n    # 模拟用户行为\n    pv_refresh()\n    pv_index()\n    pv_img_cost()\n\n    # 模拟鼠标滑过\n    pv_mouse()\n    pv_img_success()\n    pv_real_index()\n\n    # 模拟表单输入\n    pv_seccode_focus()\n    pv_seccode_input()\n    pv_seccode_blur()\n\n    # 重试3次\n    c = 3\n    while c > 0:\n        c -= 1\n        res = thank(img_code)\n        if res.get('code') == 0:\n            # 识别成功\n            cookies['SNUID'] = res.get('id', '')\n            break\n        elif res.get('code') == 3:\n            # 报告错误识别\n            img_report_error(img_id)\n\n            # 出错随机等待后重试\n            time.sleep(random.randint(1, 5))\n\n            # 换张图片再来一次\n            im = code_img_obj()\n            # 6位英数混合 白天:15快豆 夜间:18.75快豆 超时:60秒\n            img_id, img_code = get_img_code(im, im_type_id=3060)\n            print(img_id, img_code)\n        else:\n            print('Error')\n            print(res)\n            return None\n\n    return _get_cookies() if c > 0 else None\n\n\nif __name__ == '__main__':\n    # manual_cookies()\n    auto_cookies()\n    # check_n()\n    # check_y()\n"
  },
  {
    "path": "tools/anti_spider_weixin.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: anti_spider_weixin.py\n@time: 2018-02-10 17:24\n\"\"\"\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\nfrom future.builtins import input               # PY2(raw_input)\n\nimport random\nimport time\nimport json\n\nfrom lxml.html import fromstring\n\nimport requests\n\nfrom apps.client_rk import get_img_code, img_report_error\n\nfrom config import current_config\n\n\nREQUESTS_TIME_OUT = current_config.REQUESTS_TIME_OUT\n\n\ncookies = {}\n\n\ns = requests.session()\n\n\nheaders = {\n    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',\n    'Accept-Encoding': 'gzip, deflate',\n    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',\n    'Connection': 'keep-alive',\n    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:57.0) Gecko/20100101 Firefox/57.0'\n}\n\n\ndef _get_tc():\n    tc = str('%13d' % (time.time() * 1000))\n    return tc\n\n\ndef _save_img(res):\n    # 保存验证码图片\n    img_name = 'weixin_%s.jpg' % _get_tc()\n    print('图片名称: %s' % img_name)\n    img_content = res.content\n    with open(img_name, b'w') as f:\n        f.write(img_content)\n    time.sleep(1)\n\n\ndef anti_spider(url):\n    # url = 'https://mp.weixin.qq.com/profile?src=3&timestamp=1512923946&ver=1&signature=RZh61VIthXnp4HUsow1pgQXJbGxi*v-n4Pr1W6e5PVkmJSbRknd6LMT-EFoQqX4gaM6uGyHREmDPsN6lXkeYfg=='\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'mp.weixin.qq.com'\n\n    res = s.get(url, headers=request_headers, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    print('.', end='')\n\n    doc = fromstring(res.text)\n    title = u''.join(i.strip() for i in doc.xpath('//title/text()'))\n    print(title)\n    return title == '请输入验证码'\n\n\ndef code_img_save():\n    url = 'https://mp.weixin.qq.com/mp/verifycode'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'mp.weixin.qq.com'\n\n    request_cookie = cookies.copy()\n\n    params = {\n        'cert': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n\n    # 保存图片\n    _save_img(res)\n\n    cookies.update(res.cookies)\n    print('.', end='')\n    # print cookies\n\n\ndef code_img_obj():\n    url = 'https://mp.weixin.qq.com/mp/verifycode'\n\n    request_headers = headers.copy()\n    request_headers['Host'] = 'mp.weixin.qq.com'\n\n    request_cookie = cookies.copy()\n\n    params = {\n        'cert': _get_tc(),\n    }\n    res = s.get(url, params=params, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    print('.', end='')\n    return res.content\n\n\ndef verify_code(input_code):\n    url = 'https://mp.weixin.qq.com/mp/verifycode'\n    request_headers = headers.copy()\n    request_headers['Host'] = 'mp.weixin.qq.com'\n    request_headers['X-Requested-With'] = 'XMLHttpRequest'\n\n    request_cookie = cookies.copy()\n\n    data = {\n        'cert': _get_tc(),\n        'input': input_code,\n        'appmsg_token': '',\n    }\n\n    res = s.post(url, data=data, headers=request_headers, cookies=request_cookie, timeout=REQUESTS_TIME_OUT)\n    cookies.update(res.cookies)\n    # print cookies\n\n    json_msg = json.loads(res.content)\n    print(json_msg)\n    return json_msg\n    # {u'cookie_count': 0, u'errmsg': u'', u'ret': 0}\n    # {u'cookie_count': 0, u'errmsg': u'', u'ret': 501} 验证码有误\n\n\ndef _get_cookies():\n    print(cookies)\n    return cookies\n\n\ndef manual_cookies():\n    url = input('url << ')\n    anti_spider(url)\n    code_img_save()\n\n    input_code = input('code << ')\n\n    verify_code(input_code)\n\n    return _get_cookies()\n\n\ndef auto_cookies(url):\n    need_status = anti_spider(url)\n    if not need_status:\n        return True\n\n    im = code_img_obj()\n    # 4位纯英文字母 白天:10快豆 夜间:12.5快豆 超时:60秒\n    img_id, img_code = get_img_code(im, im_type_id=2040)\n    print(img_id, img_code)\n\n    # 重试3次\n    c = 3\n    while c > 0:\n        c -= 1\n        res = verify_code(img_code)\n        if res.get('ret') == 0:\n            # 识别成功\n            break\n        elif res.get('ret') == 501:\n            # 报告错误识别\n            img_report_error(img_id)\n\n            # 出错随机等待后重试\n            time.sleep(random.randint(1, 5))\n\n            # 换张图片再来一次\n            im = code_img_obj()\n            # 4位纯英文字母 白天:10快豆 夜间:12.5快豆 超时:60秒\n            img_id, img_code = get_img_code(im, im_type_id=2040)\n            print(img_id, img_code)\n        else:\n            print('Error')\n            print(res)\n            return False\n\n    return True if c > 0 else False\n\n\nif __name__ == '__main__':\n    # manual_cookies()\n    anti_spider_url = 'http://mp.weixin.qq.com/profile?src=3&timestamp=1513650933&ver=1&signature=zzgwSdnYIm68Nu5eFz1X8-Heqjojhy4ozHmg4cUz*hEo*QuXma9-qkMrOFxzOGDfzJHHfyechg0AVCFPpsXpuA=='\n    print(auto_cookies(anti_spider_url))\n"
  },
  {
    "path": "tools/char.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: char.py\n@time: 2018-02-10 17:48\n\"\"\"\n\n\nimport execjs\n\n# from HTMLParser import HTMLParser     # PY2\n# from html.parser import HTMLParser    # PY3\nfrom future.moves.html.parser import HTMLParser\n\nhtml_parser = HTMLParser()\n\n\ndef un_escape(char_str):\n    \"\"\"\n    反转译\n    :param char_str:\n    :return:\n    \"\"\"\n    return html_parser.unescape(char_str)\n\n\ndef get_js_36_str(i):\n    \"\"\"\n    整数、浮点数 js方式转36进制\n    :param i:\n    :return:\n    \"\"\"\n    js_body = '''\n        function get_36_str(i) {\n            return i.toString(36);\n        };\n    '''\n    ctx = execjs.compile(js_body)\n    return ctx.call(\"get_36_str\", i)\n\n\nif __name__ == '__main__':\n    a = '&#21152;&#20837;&#21040;&#34;&#25105;&#30340;&#20070;&#30446;&#36873;&#21333;&#34;&#20013;'\n    b = '\\xe5\\xbd\\x93\\xe5\\x89\\x8d\\xe5\\xb7\\xb2\\xe8\\xbe\\xbe\\xe5\\x88\\xb0\\xe6\\x8a\\x93\\xe5\\x8f\\x96\\xe9\\x85\\x8d\\xe7\\xbd\\xae\\xe7\\x9a\\x84\\xe6\\x9c\\x80\\xe5\\xa4\\xa7\\xe9\\xa1\\xb5\\xe7\\xa0\\x81'\n    c = 'https://mp.weixin.qq.com/s?timestamp=1511432702&amp;src=3&amp;ver=1&amp;signature=lAC8MtonFiHnlc5-j4z48WcPRpfP1Nn4zxCmY4ZjCjdXQscLcB5uyi5Jb395m5yaZQHTqqSlqzy*HRR0nAPZHsz0*Efu3w*Y2B8XbIL5v8pZQsGt9cwZQTuvI0GZqAsZobqzaeDptAQzHLB4QKL-qExOz0ANOTG*QAvJ7-ZurMg='\n    d = 'http://mp.weixin.qq.com/mp/homepage?__biz=MzAxNzU2Mjc4NQ==&amp;hid=2&amp;sn=8177890cc7e468d3df6f3050d49951c5#wechat_redirect'\n    print(un_escape(a))\n    print(un_escape(b))\n    print(un_escape(c))\n    print(un_escape(d))\n"
  },
  {
    "path": "tools/cookies.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: cookies.py\n@time: 2018-02-10 17:49\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport json\nimport hashlib\n\nfrom apps.client_db import redis_client\n\n\ndef _get_cookies_str(cookies_dict):\n    \"\"\"\n    In [1]: import json\n\n    In [2]: sd = {'c':1, 'b':2, 'a':3}\n\n    In [3]: sd\n    Out[3]: {'a': 3, 'b': 2, 'c': 1}\n\n    In [4]: items = sd.items()\n\n    In [5]: items\n    Out[5]: [('a', 3), ('c', 1), ('b', 2)]\n\n    In [6]: sorted(items)\n    Out[6]: [('a', 3), ('b', 2), ('c', 1)]\n\n    In [7]: sorted(items, reverse=True)\n    Out[7]: [('c', 1), ('b', 2), ('a', 3)]\n\n    In [8]: json.dumps(sorted(items))\n    Out[8]: '[[\"a\", 3], [\"b\", 2], [\"c\", 1]]'\n\n    In [9]: json.loads(json.dumps(sorted(items)))\n    Out[9]: [[u'a', 3], [u'b', 2], [u'c', 1]]\n\n    In [10]: dict(json.loads(json.dumps(sorted(items))))\n    Out[10]: {u'a': 3, u'b': 2, u'c': 1}\n    :param cookies_dict:\n    :return:\n    \"\"\"\n    cookies_str = json.dumps(sorted(cookies_dict.items()))\n    return cookies_str\n\n\ndef _get_finger(cookies_str):\n    \"\"\"\n    :param cookies_str:\n    :return:\n    \"\"\"\n    m = hashlib.md5()\n    m.update(cookies_str.encode('utf-8') if isinstance(cookies_str, unicode) else cookies_str)\n    finger = m.hexdigest()\n    return finger\n\n\ndef get_cookies(spider_name):\n    \"\"\"\n    获取 cookies\n    兼容 redis 没有 cookies 池的情况\n    :param spider_name:\n    :return:\n    \"\"\"\n    key_set = 'scrapy:cookies_set:%(spider_name)s' % {'spider_name': spider_name}\n    cookies_id = redis_client.srandmember(key_set)\n\n    key_id = 'scrapy:cookies_id:%(cookies_id)s' % {'cookies_id': cookies_id}\n    cookies_str = redis_client.get(key_id)\n    cookies_obj = dict(json.loads(cookies_str or '[]'))\n\n    return cookies_id, cookies_obj\n\n\ndef add_cookies(spider_name, cookies_obj):\n    \"\"\"\n    添加 cookies\n    :param spider_name:\n    :param cookies_obj:\n    :return:\n    \"\"\"\n    cookies_str = _get_cookies_str(cookies_obj)\n    cookies_id = _get_finger(cookies_str)\n\n    key_id = 'scrapy:cookies_id:%(cookies_id)s' % {'cookies_id': cookies_id}\n    key_set = 'scrapy:cookies_set:%(spider_name)s' % {'spider_name': spider_name}\n\n    if redis_client.sismember(key_set, cookies_id):\n        return False\n\n    redis_client.set(key_id, cookies_str)\n    redis_client.sadd(key_set, cookies_id)\n    return True\n\n\ndef del_cookies(spider_name, cookies_id):\n    \"\"\"\n    删除 cookies\n    :param spider_name:\n    :param cookies_id:\n    :return:\n    \"\"\"\n    key_id = 'scrapy:cookies_id:%(cookies_id)s' % {'cookies_id': cookies_id}\n    key_set = 'scrapy:cookies_set:%(spider_name)s' % {'spider_name': spider_name}\n\n    redis_client.delete(key_id)\n    redis_client.srem(key_set, cookies_id)\n\n\ndef len_cookies(spider_name):\n    \"\"\"\n    获取 cookies 长度\n    :param spider_name:\n    :return:\n    \"\"\"\n    key_set = 'scrapy:cookies_set:%(spider_name)s' % {'spider_name': spider_name}\n    cookies_len = redis_client.scard(key_set)\n    return cookies_len\n\n\n\"\"\"\n集合\nkey: cookies_id\n\n字符串\ncookies_id_key: cookies_obj\n\"\"\"\n"
  },
  {
    "path": "tools/date_time.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: date_time.py\n@time: 2018-06-25 16:44\n\"\"\"\n\n\nfrom __future__ import unicode_literals\nimport six\n\nimport time\nimport calendar\nfrom datetime import datetime, timedelta, date\n\n\ndef get_tc():\n    \"\"\"\n    获取13位字符串时间戳\n    :return:\n    \"\"\"\n    tc = str('%13d' % (time.time() * 1000))\n    return tc\n\n\ndef get_current_day_time_ends():\n    \"\"\"\n    获取当天开始结束时刻\n    :return:\n    \"\"\"\n    today = datetime.today()\n    start_time = datetime(today.year, today.month, today.day, 0, 0, 0)\n    end_time = datetime(today.year, today.month, today.day, 23, 59, 59)\n    return start_time, end_time\n\n\ndef get_current_month_time_ends():\n    \"\"\"\n    获取当月开始结束时刻\n    :return:\n    \"\"\"\n    today = datetime.today()\n    _, days = calendar.monthrange(today.year, today.month)\n    start_time = datetime(today.year, today.month, 1, 0, 0, 0)\n    end_time = datetime(today.year, today.month, days, 23, 59, 59)\n    return start_time, end_time\n\n\ndef get_current_year_time_ends():\n    \"\"\"\n    获取当年开始结束时刻\n    :return:\n    \"\"\"\n    today = datetime.today()\n    start_time = datetime(today.year, 1, 1, 0, 0, 0)\n    end_time = datetime(today.year, 12, 31, 23, 59, 59)\n    return start_time, end_time\n\n\ndef get_hours(zerofill=True):\n    \"\"\"\n    列出1天所有24小时\n    :return:\n    \"\"\"\n    if zerofill:\n        return ['%02d' % i for i in range(24)]\n    else:\n        return range(24)\n\n\ndef get_days(year=1970, month=1, zerofill=True):\n    \"\"\"\n    列出当月的所有日期\n    :param year:\n    :param month:\n    :param zerofill:\n    :return:\n    \"\"\"\n    year = int(year)\n    month = int(month)\n    _, days = calendar.monthrange(year, month)\n    if zerofill:\n        return ['%02d' % i for i in range(1, days+1)]\n    else:\n        return range(1, days+1)\n\n\ndef get_weeks():\n    \"\"\"\n    列出所有星期\n    :return:\n    \"\"\"\n    return ['周一', '周二', '周三', '周四', '周五', '周六', '周日']\n\n\ndef get_months(zerofill=True):\n    \"\"\"\n    列出1年所有12月份\n    :return:\n    \"\"\"\n    if zerofill:\n        return ['%02d' % i for i in range(1, 13)]\n    else:\n        return [i for i in range(1, 13)]\n\n\ndef time_local_to_utc(local_time):\n    \"\"\"\n    本地时间转UTC时间\n    :param local_time:\n    :return:\n    \"\"\"\n    # 字符串处理\n    if isinstance(local_time, six.string_types) and len(local_time) == 10:\n        local_time = datetime.strptime(local_time, '%Y-%m-%d')\n    elif isinstance(local_time, six.string_types) and len(local_time) >= 19:\n        local_time = datetime.strptime(local_time[:19], '%Y-%m-%d %H:%M:%S')\n    elif not (isinstance(local_time, datetime) or isinstance(local_time, date)):\n        local_time = datetime.now()\n    # 时间转换\n    utc_time = local_time + timedelta(seconds=time.timezone)\n    return utc_time\n\n\ndef time_utc_to_local(utc_time):\n    \"\"\"\n    UTC时间转本地时间\n    :param utc_time:\n    :return:\n    \"\"\"\n    # 字符串处理\n    if isinstance(utc_time, six.string_types) and len(utc_time) == 10:\n        utc_time = datetime.strptime(utc_time, '%Y-%m-%d')\n    elif isinstance(utc_time, six.string_types) and len(utc_time) >= 19:\n        utc_time = datetime.strptime(utc_time[:19], '%Y-%m-%d %H:%M:%S')\n    elif not (isinstance(utc_time, datetime) or isinstance(utc_time, date)):\n        utc_time = datetime.utcnow()\n    # 时间转换\n    local_time = utc_time - timedelta(seconds=time.timezone)\n    return local_time\n\n\nif __name__ == '__main__':\n    print(get_current_day_time_ends())\n    print(get_current_month_time_ends())\n    print(get_current_year_time_ends())\n    print(get_hours(zerofill=False))\n    print(get_hours(zerofill=True))\n    print(get_days(zerofill=False))\n    print(get_days(zerofill=True))\n    print(get_months(zerofill=False))\n    print(get_months(zerofill=True))\n"
  },
  {
    "path": "tools/duplicate.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: duplicate.py\n@time: 2018-02-10 17:39\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nfrom apps.client_db import redis_client\nfrom tools.url import get_request_finger\n\n\ndef is_dup_detail(detail_url, spider_name, channel_id=0):\n    \"\"\"\n    检查详细页是否重复\n    :param detail_url:\n    :param spider_name:\n    :param channel_id:\n    :return:\n    \"\"\"\n    detail_dup_key = 'scrapy:dup:%s:%s' % (spider_name, channel_id)\n    detail_url_finger = get_request_finger(detail_url)\n    return redis_client.sismember(detail_dup_key, detail_url_finger)\n\n\ndef add_dup_detail(detail_url, spider_name, channel_id=0):\n    \"\"\"\n    把当前详细页加入集合\n    :param detail_url:\n    :param spider_name:\n    :param channel_id:\n    :return:\n    \"\"\"\n    detail_dup_key = 'scrapy:dup:%s:%s' % (spider_name, channel_id)\n    detail_url_finger = get_request_finger(detail_url)\n    return redis_client.sadd(detail_dup_key, detail_url_finger)\n"
  },
  {
    "path": "tools/gen.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: gen.py\n@time: 2018-02-10 17:19\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport os\nimport sys\nfrom sqlalchemy.ext.declarative.api import DeclarativeMeta\nfrom sqlalchemy.inspection import inspect\nfrom config import current_config\n\n\nBASE_DIR = current_config.BASE_DIR\nSQLALCHEMY_DATABASE_URI = current_config.SQLALCHEMY_DATABASE_URI_MYSQL\n\n\ndef gen_models():\n    \"\"\"\n    创建 models\n    $ python gen.py gen_models\n    \"\"\"\n    file_path = os.path.join(BASE_DIR, 'models/news.py')\n    cmd = 'sqlacodegen %s --noinflect --outfile %s' % (SQLALCHEMY_DATABASE_URI, file_path)\n\n    output = os.popen(cmd)\n    result = output.read()\n    print(result)\n\n    # 更新 model 文件\n    with open(file_path, b'r') as f:\n        lines = f.readlines()\n    # 新增 model 转 dict 方法\n    with open(file_path, b'w') as f:\n        lines.insert(9, b'def to_dict(self):\\n')\n        lines.insert(10, b'    return {c.name: getattr(self, c.name, None) for c in self.__table__.columns}\\n')\n        lines.insert(11, b'\\n')\n        lines.insert(12, b'Base.to_dict = to_dict\\n')\n        lines.insert(13, b'\\n\\n')\n        f.write(b''.join(lines))\n\n\ndef gen_items():\n    \"\"\"\n    创建 items\n    $ python gen.py gen_items\n    字段规则： 去除自增主键，非自增是需要的。\n    \"\"\"\n    from models import news\n\n    file_path = os.path.join(BASE_DIR, 'news/items.py')\n\n    model_list = [(k, v) for k, v in news.__dict__.items() if isinstance(v, DeclarativeMeta) and k != 'Base']\n\n    with open(file_path, b'w') as f:\n        f.write(b'# -*- coding: utf-8 -*-\\n\\n')\n        f.write(b'# Define here the models for your scraped items\\n#\\n')\n        f.write(b'# See documentation in:\\n')\n        f.write(b'# http://doc.scrapy.org/en/latest/topics/items.html\\n\\n')\n        f.write(b'import scrapy\\n')\n\n        for model_name, model_class in model_list:\n            result = model_class().to_dict()\n            table_name = model_class().__tablename__\n            model_pk = inspect(model_class).primary_key[0].name\n            f.write(b'\\n\\nclass %sItem(scrapy.Item):\\n' % model_name)\n            f.write(b'    \"\"\"\\n')\n            f.write(b'    table_name: %s\\n' % table_name)\n            f.write(b'    primary_key: %s\\n' % model_pk)\n            f.write(b'    \"\"\"\\n')\n            for field_name in list(result.keys()):\n                if field_name in [model_pk, 'create_time', 'update_time']:\n                    continue\n                f.write(b'    %s = scrapy.Field()\\n' % field_name)\n\n\ndef run():\n    \"\"\"\n    入口\n    \"\"\"\n    # print sys.argv\n    try:\n        if len(sys.argv) > 1:\n            fun_name = globals()[sys.argv[1]]\n            fun_name()\n        else:\n            print('缺失参数\\n')\n            usage()\n    except NameError as e:\n        print(e)\n        print('未定义的方法[%s]' % sys.argv[1])\n\n\ndef usage():\n    print(\"\"\"\n创建 models\n$ python gen.py gen_models\n\n创建 items\n$ python gen.py gen_items\n\"\"\")\n\n\nif __name__ == '__main__':\n    run()\n    # print BASE_DIR\n    # print SQLALCHEMY_DATABASE_URI\n"
  },
  {
    "path": "tools/img.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: img.py\n@time: 2018-03-20 14:24\n\"\"\"\n\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport imghdr\n\nimport requests\nfrom PIL import Image\nfrom six import BytesIO\n\nfrom config import current_config\n\nREQUESTS_TIME_OUT = current_config.REQUESTS_TIME_OUT\n\n\ndef filter_img_size(min_width=0, min_height=0, *img_url):\n    \"\"\"\n    过滤尺寸不符要求的图片\n    :param min_width:\n    :param min_height:\n    :param img_url:\n    :return:\n    \"\"\"\n    result = []\n    for i in img_url:\n        try:\n            img_res = requests.get(i, stream=True, timeout=REQUESTS_TIME_OUT)\n            if img_res.status_code == 200:\n                orig_image = Image.open(BytesIO(img_res.content))\n                img_width, img_height = orig_image.size\n                if img_width >= min_width and img_height >= min_height:\n                    result.append(i)\n        except Exception as e:\n            print('check images error: %s' % img_url)\n            print(e.message)\n            continue\n    return result\n\n\ndef filter_local_img_type(ignore_type='gif', *img_path):\n    \"\"\"\n    过滤指定类型本地图片\n    :param ignore_type:\n    :param img_path:\n    :return:\n    \"\"\"\n    result = []\n    for i in img_path:\n        img_type = imghdr.what(i)\n        # print(img_type, i)\n        if img_type == ignore_type:\n            continue\n        result.append(i)\n    return result\n\n\ndef filter_remote_img_type(ignore_type='gif', *img_url):\n    \"\"\"\n    过滤指定类型远程图片\n    :param ignore_type:\n    :param img_url:\n    :return:\n    \"\"\"\n    result = []\n    for i in img_url:\n        img_type = imghdr.what(None, requests.get(i).content)\n        # print(img_type, i)\n        if img_type == ignore_type:\n            continue\n        result.append(i)\n    return result\n\n\nif __name__ == '__main__':\n    pass\n"
  },
  {
    "path": "tools/import_task.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: import_csv.py\n@time: 2018-05-17 18:46\n\"\"\"\n\nfrom __future__ import print_function\nfrom __future__ import unicode_literals\n\nimport sys\nimport csv\nimport json\nfrom apps.client_db import add_item\nfrom models.news import FetchTask\n\n\ndef read_csv(filename):\n    \"\"\"\n    读取csv\n    :param filename:\n    :return:\n    \"\"\"\n    count = 0\n    with open(filename) as f:\n        reader = csv.DictReader(f)\n        for line in reader:\n            print(json.dumps(line, indent=4, ensure_ascii=False))\n            count += 1\n            yield line\n    print('读取数量: %s' % count)\n\n\ndef import_csv(filename):\n    \"\"\"\n    导入csv\n    :param filename:\n    :return:\n    \"\"\"\n    count = 0\n    for item in read_csv(filename):\n        result = add_item(FetchTask, item)\n        print(result)\n        count += 1\n    print('导入数量: %s' % count)\n\n\ndef usage():\n    print('''\n导入 csv\n注意 csv 格式, 表头与数据库任务表的字段对应（去掉主键）\n$ python tools/import_task.py example.csv\n''')\n\n\ndef run():\n    \"\"\"\n    入口\n    \"\"\"\n    # print sys.argv\n    try:\n        if len(sys.argv) < 2:\n            raise Exception('缺失参数\\n')\n        import_csv(sys.argv[1])\n    except Exception as e:\n        print('导入异常')\n        print(e)\n        usage()\n\n\nif __name__ == '__main__':\n    run()\n"
  },
  {
    "path": "tools/net_status.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: net_status.py\n@time: 2018-05-28 20:45\n\"\"\"\n\nimport time\nfrom apps.client_db import redis_client\n\n\ndef get_reboot_net_status(net_name='optical_modem_china_net'):\n    key_reboot_net = 'scrapy:reboot_net:%s' % net_name\n    reboot_net_status = redis_client.get(key_reboot_net)\n    return reboot_net_status\n\n\ndef set_reboot_net_status(net_name='optical_modem_china_net'):\n    key_reboot_net = 'scrapy:reboot_net:%s' % net_name\n    reboot_net_status = time.strftime('%Y-%m-%d %H:%M:%S')\n    redis_client.set(key_reboot_net, reboot_net_status)\n\n\ndef del_reboot_net_status(net_name='optical_modem_china_net'):\n    key_reboot_net = 'scrapy:reboot_net:%s' % net_name\n    redis_client.delete(key_reboot_net)\n"
  },
  {
    "path": "tools/proxies.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: proxies.py\n@time: 2018-03-13 16:37\n\"\"\"\n\n\nimport json\nimport requests\nfrom apps.client_db import redis_client\nfrom tools.url import get_update_url\n\nfrom config import current_config\n\n\nREQUESTS_TIME_OUT = current_config.REQUESTS_TIME_OUT\n\n\ndef add_proxy(spider_name, *proxy):\n    key_set = 'scrapy:proxies_set:%(spider_name)s' % {'spider_name': spider_name}\n    return redis_client.sadd(key_set, *proxy)\n\n\ndef del_proxy(spider_name, proxy):\n    key_set = 'scrapy:proxies_set:%(spider_name)s' % {'spider_name': spider_name}\n    return redis_client.srem(key_set, proxy)\n\n\ndef get_proxy(spider_name):\n    key_set = 'scrapy:proxies_set:%(spider_name)s' % {'spider_name': spider_name}\n    return redis_client.srandmember(key_set)\n\n\ndef len_proxy(spider_name):\n    key_set = 'scrapy:proxies_set:%(spider_name)s' % {'spider_name': spider_name}\n    return redis_client.scard(key_set)\n\n\ndef fetch_proxy(country='China', scheme='http'):\n    \"\"\"\n    获取代理\n    :param country:\n    :param scheme:\n    :return:\n    \"\"\"\n    data = {}\n    if country:\n        data['country'] = country\n    if scheme:\n        data['type'] = scheme\n    url = 'http://proxy.nghuyong.top/'\n    url = get_update_url(url, data)\n    res = requests.get(url, timeout=REQUESTS_TIME_OUT).json()\n    return ['%s://%s' % (i['type'], i['ip_and_port']) for i in res.get('data', [])]\n\n\nif __name__ == '__main__':\n    proxy_result = fetch_proxy()\n    print(json.dumps(proxy_result, indent=4))\n"
  },
  {
    "path": "tools/scrapy_tasks.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: scrapy_tasks.py\n@time: 2018-02-10 17:42\n\"\"\"\n\n\nfrom apps.client_db import redis_client\n\n\ndef pop_task(spider_name):\n    key_set = 'scrapy:tasks_set:%(spider_name)s' % {'spider_name': spider_name}\n    return redis_client.spop(key_set)\n\n\ndef put_task(spider_name, *task_ids):\n    key_set = 'scrapy:tasks_set:%(spider_name)s' % {'spider_name': spider_name}\n    redis_client.sadd(key_set, *task_ids)\n\n\ndef get_tasks_count(spider_name):\n    key_set = 'scrapy:tasks_set:%(spider_name)s' % {'spider_name': spider_name}\n    cookies_len = redis_client.scard(key_set)\n    return cookies_len\n"
  },
  {
    "path": "tools/sys_monitor.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: sys_monitor.py\n@time: 2018-02-10 17:43\n\"\"\"\n\n\nimport psutil\nimport time\n\n\ndef bytes2human(n):\n    \"\"\"\n    >>> bytes2human(10000)\n    '9.8 K'\n    >>> bytes2human(100001221)\n    '95.4 M'\n    \"\"\"\n    symbols = ('K', 'M', 'G', 'T', 'P', 'E', 'Z', 'Y')\n    prefix = {}\n    for i, s in enumerate(symbols):\n        prefix[s] = 1 << (i + 1) * 10\n    for s in reversed(symbols):\n        if n >= prefix[s]:\n            value = float(n) / prefix[s]\n            return '%.2f %s' % (value, s)\n    return '%.2f B' % n\n\n\ndef _format_info(k, v):\n    if len(str(v)) <= 5:\n        return '%-25s %5s' % (k, v)\n    elif len(str(v)) <= 10:\n        return '%-20s %10s' % (k, v)\n    else:\n        return '%-15s %15s' % (k, v)\n\n\ndef _print_info(contents, topic=''):\n    if topic:\n        print('\\n[%s]' % topic)\n    contents.insert(0, '-' * 31)\n    contents.append('-' * 31)\n    print('\\n'.join(contents))\n\n\ndef _cpu():\n    contents = [\n        _format_info('cpu_count_logical', psutil.cpu_count()),\n        _format_info('cpu_count_physical', psutil.cpu_count(logical=False)),\n    ]\n    _print_info(contents, 'CPU')\n\n\ndef _memory():\n    mem_virtual = psutil.virtual_memory()\n    mem_swap = psutil.swap_memory()\n\n    contents = [_format_info('mem_virtual_total', bytes2human(mem_virtual.total)),\n                _format_info('mem_virtual_free', bytes2human(mem_virtual.free)),\n                _format_info('mem_virtual_percent', '%s %%' % mem_virtual.percent),\n                _format_info('mem_swap_total', bytes2human(mem_swap.total)),\n                _format_info('mem_swap_free', bytes2human(mem_swap.free)),\n                _format_info('mem_swap_percent', '%s %%' % mem_swap.percent)]\n    _print_info(contents, 'Memory')\n\n\ndef _disks():\n    sdisk_part = psutil.disk_partitions()\n\n    contents = []\n\n    for i in sdisk_part:\n        contents.append(_format_info(i.device, i.mountpoint))\n\n        sdisk_usage = psutil.disk_usage(i.mountpoint)\n        contents.append(_format_info('disk_usage_total', bytes2human(sdisk_usage.total)))\n        contents.append(_format_info('disk_usage_free', bytes2human(sdisk_usage.free)))\n        contents.append(_format_info('disk_usage_percent', '%s %%' % sdisk_usage.percent))\n\n    _print_info(contents, 'Disks')\n\n\ndef _network(speed=True):\n    snetio = psutil.net_io_counters()\n    contents = [_format_info('bytes_sent', bytes2human(snetio.bytes_sent)),\n                _format_info('bytes_recv', bytes2human(snetio.bytes_recv))]\n\n    if speed:\n        time.sleep(1)\n        snetio_after = psutil.net_io_counters()\n        contents.append(_format_info('speed_sent', '%s/S' % bytes2human(snetio_after.bytes_sent - snetio.bytes_sent)))\n        contents.append(_format_info('speed_recv', '%s/S' % bytes2human(snetio_after.bytes_recv - snetio.bytes_recv)))\n    _print_info(contents, 'Network')\n\n\ndef _sensors():\n\n    contents = []\n\n    if hasattr(psutil, \"sensors_temperatures\"):\n        sensors_temperatures = psutil.sensors_temperatures()\n        for name, entries in sensors_temperatures.items():\n            for entry in entries:\n                contents.append(\n                    _format_info(entry.label or name, '%s °C' % entry.current))\n\n    sbattery = psutil.sensors_battery()\n\n    if sbattery:\n        contents.append(_format_info('battery_percent', '%s %%' % sbattery.percent))\n        contents.append(_format_info('secsleft', sbattery.secsleft))\n        contents.append(_format_info('power_plugged', sbattery.power_plugged))\n    _print_info(contents, 'Sensors')\n\n\ndef stats():\n    _cpu()\n    _memory()\n    _disks()\n    _network()\n    _sensors()\n\n\nif __name__ == '__main__':\n    stats()\n"
  },
  {
    "path": "tools/toutiao_m.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: toutiao_m.py\n@time: 2018-02-28 14:14\n\"\"\"\n\nimport hashlib\nimport math\nimport re\nimport time\n\nimport execjs\n\nfrom tools.char import un_escape\n\n\ndef get_as_cp():\n    t = int(math.floor(time.time()))\n    e = hex(t).upper()[2:]\n    m = hashlib.md5()\n    m.update(str(t).encode(encoding='utf-8'))\n    i = m.hexdigest().upper()\n\n    if len(e) != 8:\n        AS = '479BB4B7254C150'\n        CP = '7E0AC8874BB0985'\n        return AS, CP\n\n    n = i[0:5]\n    a = i[-5:]\n    s = ''\n    r = ''\n    for o in range(5):\n        s += n[o] + e[o]\n        r += e[o + 3] + a[o]\n\n    AS = 'A1' + s + e[-3:]\n    CP = e[0:3] + r + 'E1'\n    return AS, CP\n\n\ndef parse_toutiao_js_body(html_body, url=''):\n    \"\"\"\n    解析js\n    :param html_body:\n    :param url:\n    :return:\n    \"\"\"\n    rule = r'<script>(var BASE_DATA = {.*?};)</script>'\n    js_list = re.compile(rule, re.S).findall(html_body)\n    if not js_list:\n        print('parse error url: %s' % url)\n        print(html_body)\n    return ''.join(js_list)\n\n\nclass ParseJsTt(object):\n    \"\"\"\n    解析头条动态数据\n    \"\"\"\n\n    def __init__(self, js_body):\n        self.js_body = js_body\n\n        self._add_js_item_id_fn()\n        self._add_js_title_fn()\n        self._add_js_abstract_fn()\n        self._add_js_content_fn()\n        self._add_js_pub_time()\n        self._add_js_tags_fn()\n\n        self.ctx = execjs.compile(self.js_body)\n\n    def _add_js_item_id_fn(self):\n        js_item_id_fn = \"\"\"\n        function r_item_id() {\n            return BASE_DATA.articleInfo.itemId;\n        };\n        \"\"\"\n        self.js_body += js_item_id_fn\n\n    def _add_js_title_fn(self):\n        js_title_fn = \"\"\"\n        function r_title() {\n            return BASE_DATA.articleInfo.title;\n        };\n        \"\"\"\n        self.js_body += js_title_fn\n\n    def _add_js_abstract_fn(self):\n        js_abstract_fn = \"\"\"\n        function r_abstract() {\n            return BASE_DATA.shareInfo.abstract;\n        };\n        \"\"\"\n        self.js_body += js_abstract_fn\n\n    def _add_js_content_fn(self):\n        js_content_fn = \"\"\"\n        function r_content() {\n            return BASE_DATA.articleInfo.content;\n        };\n        \"\"\"\n        self.js_body += js_content_fn\n\n    def _add_js_pub_time(self):\n        js_pub_time_fn = \"\"\"\n                function r_pub_time() {\n                    return BASE_DATA.articleInfo.subInfo.time;\n                };\n                \"\"\"\n        self.js_body += js_pub_time_fn\n\n    def _add_js_tags_fn(self):\n        js_tags_fn = \"\"\"\n        function r_tags() {\n            return BASE_DATA.articleInfo.tagInfo.tags;\n        };\n        \"\"\"\n        self.js_body += js_tags_fn\n\n    def parse_js_item_id(self):\n        return self.ctx.call('r_item_id') or ''\n\n    def parse_js_title(self):\n        return self.ctx.call('r_title') or ''\n\n    def parse_js_abstract(self):\n        return self.ctx.call('r_abstract') or ''\n\n    def parse_js_content(self):\n        return un_escape(self.ctx.call('r_content')) or ''\n\n    def parse_js_pub_time(self):\n        return self.ctx.call('r_pub_time') or time.strftime('%Y-%m-%d %H:%M:%S')\n\n    def parse_js_tags(self):\n        return ','.join([tag['name'] or '' for tag in self.ctx.call('r_tags')])\n\n\nif __name__ == '__main__':\n    print(get_as_cp())\n"
  },
  {
    "path": "tools/url.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: url.py\n@time: 2018-02-10 17:38\n\"\"\"\n\n\n# from urllib import urlencode                                                      # PY2\n# from urlparse import urlparse, urlunparse, parse_qsl                              # PY2\n# from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode               # PY3\nfrom future.moves.urllib.parse import urlparse, urlunparse, parse_qsl, urlencode\n\nfrom scrapy.utils import request\nfrom scrapy.http import Request\n\n\ndef get_update_url(url, data):\n    \"\"\"\n    获取更新后的url\n    :param url:\n    :param data:\n    :return:\n    \"\"\"\n    result = urlparse(url)\n    query_payload = dict(parse_qsl(result.query), **data)\n    query_param = urlencode(query_payload)\n    return urlunparse((result.scheme, result.netloc, result.path, result.params, query_param, result.fragment))\n\n\ndef get_url_query_param(url, param):\n    \"\"\"\n    获取url参数值\n    :param url:\n    :param param:\n    :return:\n    \"\"\"\n    result = urlparse(url)\n    return dict(parse_qsl(result.query)).get(param)\n\n\ndef get_request_finger(url):\n    \"\"\"\n    获取 url 指纹（允许参数无序）\n    :param url:\n    :return:\n    \"\"\"\n    req = Request(url=url)\n    return request.request_fingerprint(req)\n\n\ndef allow_url(url, allow_domains):\n    url_parse = urlparse(url)\n    result = False\n    for domain in allow_domains:\n        if url_parse.netloc.endswith(domain):\n            result = True\n    return result\n\n\nif __name__ == '__main__':\n    print(get_update_url('http://www.abc.com/def/', {'b': 2}))\n    print(get_update_url('http://www.abc.com/def/?a=1', {'b': 2}))\n    print(get_update_url('http://www.abc.com/def/?a=1', {'a': 2}))\n    print(get_url_query_param('http://www.abc.com/def/?a=1&b=2', 'a'))\n    print(allow_url('http://www.abc.com', ['abc.com']))\n    print(allow_url('http://www.abc.com', ['b.com']))\n"
  },
  {
    "path": "tools/weibo.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: weibo.py\n@time: 2018-02-13 16:20\n\"\"\"\n\n\nimport base64\n\n# from urllib import quote                      # PY2\n# from urllib.parse import quote                # PY3\nfrom future.moves.urllib.parse import quote\nfrom future.builtins import input               # PY2(raw_input)\n\n\ndef get_su(user_name):\n    return base64.b64encode(quote(user_name.strip()))\n\n\ndef get_login_data():\n    print('Please type username and password!')\n    username = input('username < ')\n    password = input('password < ')\n    if not(username and password):\n        raise Exception('Method or function hasn\\'t been implemented yet.')\n    return {\n        'username': username,\n        'password': password\n    }\n\n\nif __name__ == '__main__':\n    pass\n\n"
  },
  {
    "path": "tools/weixin.py",
    "content": "#!/usr/bin/env python\n# encoding: utf-8\n\n\"\"\"\n@author: zhanghe\n@software: PyCharm\n@file: weixin.py\n@time: 2018-02-10 17:55\n\"\"\"\n\n\nimport re\nimport time\nimport hashlib\n\n# from urlparse import urljoin                  # PY2\n# from urllib.parse import urljoin              # PY3\nfrom future.moves.urllib.parse import urljoin\n\nimport execjs\nfrom tools.char import un_escape\nfrom config import current_config\nfrom models.news import FetchResult\nfrom news.items import FetchResultItem\nfrom apps.client_db import db_session_mysql\nfrom maps.platform import WEIXIN, WEIBO\n\nBASE_DIR = current_config.BASE_DIR\n\n\ndef get_finger(content_str):\n    \"\"\"\n    :param content_str:\n    :return:\n    \"\"\"\n    m = hashlib.md5()\n    m.update(content_str.encode('utf-8') if isinstance(content_str, unicode) else content_str)\n    finger = m.hexdigest()\n    return finger\n\n\ndef parse_weixin_js_body(html_body, url=''):\n    \"\"\"\n    解析js\n    :param html_body:\n    :param url:\n    :return:\n    \"\"\"\n    rule = r'<script type=\"text/javascript\">.*?(var msgList.*?)seajs.use\\(\"sougou/profile.js\"\\);.*?</script>'\n    js_list = re.compile(rule, re.S).findall(html_body)\n    if not js_list:\n        print('parse error url: %s' % url)\n    return ''.join(js_list)\n\n\ndef parse_weixin_article_id(html_body):\n    rule = r'<script nonce=\"(\\d+)\" type=\"text\\/javascript\">'\n    article_id_list = re.compile(rule, re.I).findall(html_body)\n    return article_id_list[0]\n\n\ndef add_img_src(html_body):\n    rule = r'data-src=\"(.*?)\"'\n    img_data_src_list = re.compile(rule, re.I).findall(html_body)\n    print(img_data_src_list)\n    for img_src in img_data_src_list:\n        print(img_src)\n        html_body = html_body.replace(img_src, '%(img_src)s\" src=\"%(img_src)s' % {'img_src': img_src})\n    return html_body\n\n\ndef get_img_src_list(html_body, host_name='/', limit=None):\n    rule = r'src=\"(%s.*?)\"' % host_name\n    img_data_src_list = re.compile(rule, re.I).findall(html_body)\n    if limit:\n        return img_data_src_list[:limit]\n    return img_data_src_list\n\n\ndef check_article_title_duplicate(article_title):\n    \"\"\"\n    检查标题重复\n    :param article_title:\n    :return:\n    \"\"\"\n    session = db_session_mysql()\n    article_id_count = session.query(FetchResult) \\\n        .filter(FetchResult.platform_id == WEIXIN,\n                FetchResult.article_id == get_finger(article_title)) \\\n        .count()\n    return article_id_count\n\n\nclass ParseJsWc(object):\n    \"\"\"\n    解析微信动态数据\n    \"\"\"\n    def __init__(self, js_body):\n        self.js_body = js_body\n\n        self._add_js_msg_list_fn()\n\n        self.ctx = execjs.compile(self.js_body)\n        # print(self.ctx)\n\n    def _add_js_msg_list_fn(self):\n        js_msg_list_fn = \"\"\"\n        function r_msg_list() {\n            return msgList.list;\n        };\n        \"\"\"\n        self.js_body += js_msg_list_fn\n\n    def parse_js_msg_list(self):\n        msg_list = self.ctx.call('r_msg_list')\n        app_msg_ext_info_list = [i['app_msg_ext_info'] for i in msg_list]\n        comm_msg_info_date_time_list = [time.strftime(\"%Y-%m-%d %H:%M:%S\", time.localtime(i['comm_msg_info']['datetime'])) for i in msg_list]\n        # msg_id_list = [i['comm_msg_info']['id'] for i in msg_list]\n        msg_data_list = [\n            {\n                # 'article_id': '%s_000' % msg_id_list[index],\n                'article_id': get_finger(i['title']),\n                'article_url': urljoin('https://mp.weixin.qq.com', un_escape(i['content_url'])),\n                'article_title': i['title'],\n                'article_abstract': i['digest'],\n                'article_pub_time': comm_msg_info_date_time_list[index],\n            } for index, i in enumerate(app_msg_ext_info_list)\n        ]\n        msg_ext_list = [i['multi_app_msg_item_list'] for i in app_msg_ext_info_list]\n        for index_j, j in enumerate(msg_ext_list):\n            for index_i, i in enumerate(j):\n                msg_data_list.append(\n                    {\n                        # 'article_id': '%s_%03d' % (msg_id_list[index_j], index_i + 1),\n                        'article_id': get_finger(i['title']),\n                        'article_url': urljoin('https://mp.weixin.qq.com', un_escape(i['content_url'])),\n                        'article_title': i['title'],\n                        'article_abstract': i['digest'],\n                        'article_pub_time': comm_msg_info_date_time_list[index_j],\n                    }\n                )\n        return msg_data_list\n"
  }
]