Full Code of binux/pyspider for AI

master 897891cafb21 cached

165 files

775.9 KB

191.5k tokens

1327 symbols

1 requests

Download .txt

Showing preview only (822K chars total). Download the full file or copy to clipboard to get everything.

Repository: binux/pyspider
Branch: master
Commit: 897891cafb21
Files: 165
Total size: 775.9 KB

Directory structure:
gitextract_jmd7ykkk/

├── .coveragerc
├── .github/
│   └── ISSUE_TEMPLATE.md
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── README.md
├── config_example.json
├── docker-compose.yaml
├── docs/
│   ├── About-Projects.md
│   ├── About-Tasks.md
│   ├── Architecture.md
│   ├── Command-Line.md
│   ├── Deployment-demo.pyspider.org.md
│   ├── Deployment.md
│   ├── Frequently-Asked-Questions.md
│   ├── Quickstart.md
│   ├── Running-pyspider-with-Docker.md
│   ├── Script-Environment.md
│   ├── Working-with-Results.md
│   ├── apis/
│   │   ├── @catch_status_code_error.md
│   │   ├── @every.md
│   │   ├── Response.md
│   │   ├── index.md
│   │   ├── self.crawl.md
│   │   └── self.send_message.md
│   ├── conf.py
│   ├── index.md
│   └── tutorial/
│       ├── AJAX-and-more-HTTP.md
│       ├── HTML-and-CSS-Selector.md
│       ├── Render-with-PhantomJS.md
│       └── index.md
├── mkdocs.yml
├── pyspider/
│   ├── __init__.py
│   ├── database/
│   │   ├── __init__.py
│   │   ├── base/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── basedb.py
│   │   ├── couchdb/
│   │   │   ├── __init__.py
│   │   │   ├── couchdbbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── elasticsearch/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── local/
│   │   │   ├── __init__.py
│   │   │   └── projectdb.py
│   │   ├── mongodb/
│   │   │   ├── __init__.py
│   │   │   ├── mongodbbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── mysql/
│   │   │   ├── __init__.py
│   │   │   ├── mysqlbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── redis/
│   │   │   ├── __init__.py
│   │   │   └── taskdb.py
│   │   ├── sqlalchemy/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   ├── sqlalchemybase.py
│   │   │   └── taskdb.py
│   │   └── sqlite/
│   │       ├── __init__.py
│   │       ├── projectdb.py
│   │       ├── resultdb.py
│   │       ├── sqlitebase.py
│   │       └── taskdb.py
│   ├── fetcher/
│   │   ├── __init__.py
│   │   ├── cookie_utils.py
│   │   ├── phantomjs_fetcher.js
│   │   ├── puppeteer_fetcher.js
│   │   ├── splash_fetcher.lua
│   │   └── tornado_fetcher.py
│   ├── libs/
│   │   ├── ListIO.py
│   │   ├── __init__.py
│   │   ├── base_handler.py
│   │   ├── bench.py
│   │   ├── counter.py
│   │   ├── dataurl.py
│   │   ├── log.py
│   │   ├── multiprocessing_queue.py
│   │   ├── pprint.py
│   │   ├── response.py
│   │   ├── result_dump.py
│   │   ├── sample_handler.py
│   │   ├── url.py
│   │   ├── utils.py
│   │   └── wsgi_xmlrpc.py
│   ├── logging.conf
│   ├── message_queue/
│   │   ├── __init__.py
│   │   ├── kombu_queue.py
│   │   ├── rabbitmq.py
│   │   └── redis_queue.py
│   ├── processor/
│   │   ├── __init__.py
│   │   ├── processor.py
│   │   └── project_module.py
│   ├── result/
│   │   ├── __init__.py
│   │   └── result_worker.py
│   ├── run.py
│   ├── scheduler/
│   │   ├── __init__.py
│   │   ├── scheduler.py
│   │   ├── task_queue.py
│   │   └── token_bucket.py
│   └── webui/
│       ├── __init__.py
│       ├── app.py
│       ├── bench_test.py
│       ├── debug.py
│       ├── index.py
│       ├── login.py
│       ├── result.py
│       ├── static/
│       │   ├── .babelrc
│       │   ├── package.json
│       │   ├── src/
│       │   │   ├── css_selector_helper.js
│       │   │   ├── debug.js
│       │   │   ├── debug.less
│       │   │   ├── index.js
│       │   │   ├── index.less
│       │   │   ├── result.less
│       │   │   ├── splitter.js
│       │   │   ├── task.less
│       │   │   ├── tasks.less
│       │   │   └── variable.less
│       │   └── webpack.config.js
│       ├── task.py
│       ├── templates/
│       │   ├── debug.html
│       │   ├── index.html
│       │   ├── result.html
│       │   ├── task.html
│       │   └── tasks.html
│       └── webdav.py
├── requirements.txt
├── run.py
├── setup.py
├── tests/
│   ├── __init__.py
│   ├── data_fetcher_processor_handler.py
│   ├── data_handler.py
│   ├── data_sample_handler.py
│   ├── data_test_webpage.py
│   ├── test_base_handler.py
│   ├── test_bench.py
│   ├── test_counter.py
│   ├── test_database.py
│   ├── test_fetcher.py
│   ├── test_fetcher_processor.py
│   ├── test_message_queue.py
│   ├── test_processor.py
│   ├── test_response.py
│   ├── test_result_dump.py
│   ├── test_result_worker.py
│   ├── test_run.py
│   ├── test_scheduler.py
│   ├── test_task_queue.py
│   ├── test_utils.py
│   ├── test_webdav.py
│   ├── test_webui.py
│   └── test_xmlrpc.py
├── tools/
│   └── migrate.py
└── tox.ini

================================================
FILE CONTENTS
================================================

================================================
FILE: .coveragerc
================================================
[run]
source =
    pyspider
parallel = True

[report]
omit =
    pyspider/libs/sample_handler.py
    pyspider/libs/pprint.py

exclude_lines =
    pragma: no cover
    def __repr__
    if self.debug:
    if settings.DEBUG
    raise AssertionError
    raise NotImplementedError
    if 0:
    if __name__ == .__main__.:
    except ImportError:
    pass


================================================
FILE: .github/ISSUE_TEMPLATE.md
================================================
<!--
Thanks for using pyspider!

如果你需要使用中文提问，请将问题提交到 https://segmentfault.com/t/pyspider
-->

* pyspider version:
* Operating system:
* Start up command:

### Expected behavior

<!-- What do you think should happen? -->

### Actual behavior

<!-- What actually happens? -->

### How to reproduce

<!-- 

The best chance of getting help is providing enough information that can be reproduce the issue you have.

If it's related to API or extraction behavior, please paste the script of your project.
If it's related to scheduling of whole project, please paste the screenshot of queue status on the top in dashboard.

-->


================================================
FILE: .gitignore
================================================
*.py[cod]
data/*
.venv
.idea
# C extensions
*.so

# Packages
*.egg
*.egg-info
dist
build
eggs
parts
bin
var
sdist
develop-eggs
.installed.cfg
lib
lib64
__pycache__

# Installer logs
pip-log.txt

# Unit test / coverage reports
.coverage
.tox
nosetests.xml

# Translations
*.mo

# Mr Developer
.mr.developer.cfg
.project
.pydevproject
.idea


================================================
FILE: .travis.yml
================================================
language: python
cache: pip
python:
  - 3.5
  - 3.6
  - 3.7
  #- 3.8
services:
    - docker
    - mongodb
    - rabbitmq
    - redis
    - mysql
    # - elasticsearch
    - postgresql
addons:
  postgresql: "9.4"
  apt:
    packages:
    - rabbitmq-server
env:
    - IGNORE_COUCHDB=1

before_install:
    - sudo apt-get update -qq
    - curl -O https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.4.0/elasticsearch-2.4.0.deb && sudo dpkg -i --force-confnew elasticsearch-2.4.0.deb && sudo service elasticsearch restart
    - npm install express puppeteer
    - sudo docker pull scrapinghub/splash
    - sudo docker run -d --net=host scrapinghub/splash
before_script:
    - psql -c "CREATE DATABASE pyspider_test_taskdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
    - psql -c "CREATE DATABASE pyspider_test_projectdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
    - psql -c "CREATE DATABASE pyspider_test_resultdb ENCODING 'UTF8' TEMPLATE=template0;" -U postgres
    - sleep 10
install:
    - pip install https://github.com/marcus67/easywebdav/archive/master.zip
    - sudo apt-get install libgnutls28-dev
    - pip install -e .[all,test]
    - pip install coveralls
script:
    - coverage run setup.py test
after_success:
    - coverage combine
    - coveralls


================================================
FILE: Dockerfile
================================================
FROM python:3.6
MAINTAINER binux <roy@binux.me>

# install phantomjs
RUN mkdir -p /opt/phantomjs \
        && cd /opt/phantomjs \
        && wget -O phantomjs.tar.bz2 https://bitbucket.org/ariya/phantomjs/downloads/phantomjs-2.1.1-linux-x86_64.tar.bz2 \
        && tar xavf phantomjs.tar.bz2 --strip-components 1 \
        && ln -s /opt/phantomjs/bin/phantomjs /usr/local/bin/phantomjs \
        && rm phantomjs.tar.bz2
# Fix Error: libssl_conf.so: cannot open shared object file: No such file or directory
ENV OPENSSL_CONF=/etc/ssl/

# install nodejs
ENV NODEJS_VERSION=8.15.0 \
    PATH=$PATH:/opt/node/bin
WORKDIR "/opt/node"
RUN apt-get -qq update && apt-get -qq install -y curl ca-certificates libx11-xcb1 libxtst6 libnss3 libasound2 libatk-bridge2.0-0 libgtk-3-0 --no-install-recommends && \
    curl -sL https://nodejs.org/dist/v${NODEJS_VERSION}/node-v${NODEJS_VERSION}-linux-x64.tar.gz | tar xz --strip-components=1 && \
    rm -rf /var/lib/apt/lists/*
RUN npm install puppeteer express

# install requirements
COPY requirements.txt /opt/pyspider/requirements.txt
RUN pip install -r /opt/pyspider/requirements.txt

# add all repo
ADD ./ /opt/pyspider

# run test
WORKDIR /opt/pyspider
RUN pip install -e .[all]

# Create a symbolic link to node_modules
RUN ln -s /opt/node/node_modules ./node_modules

#VOLUME ["/opt/pyspider"]
ENTRYPOINT ["pyspider"]

EXPOSE 5000 23333 24444 25555 22222


================================================
FILE: LICENSE
================================================
Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright 2014 Binux

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.



================================================
FILE: MANIFEST.in
================================================
include README.md
include requirements.txt
include Dockerfile
include LICENSE
include pyspider/logging.conf
include pyspider/webui/static/*
include pyspider/webui/templates/*


================================================
FILE: README.md
================================================
pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage]
========

A Powerful Spider(Web Crawler) System in Python.

- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- [MySQL](https://www.mysql.com/), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
- [RabbitMQ](http://www.rabbitmq.com/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)  
Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)  
Release notes: [https://github.com/binux/pyspider/releases](https://github.com/binux/pyspider/releases)  

Sample Code 
-----------

```python
from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }
```


Installation
------------

* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)

**WARNING:** WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).

Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)

Contribute
----------

* Use It
* Open [Issue], send PR
* [User Group]
* [中文问答](http://segmentfault.com/t/pyspider)


TODO
----

### v0.4.0

- [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)


License
-------
Licensed under the Apache License, Version 2.0


[Build Status]:         https://img.shields.io/travis/binux/pyspider/master.svg?style=flat
[Travis CI]:            https://travis-ci.org/binux/pyspider
[Coverage Status]:      https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
[Coverage]:             https://coveralls.io/r/binux/pyspider
[Try]:                  https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
[Issue]:                https://github.com/binux/pyspider/issues
[User Group]:           https://groups.google.com/group/pyspider-users


================================================
FILE: config_example.json
================================================
{
  "taskdb": "couchdb+taskdb://user:password@couchdb:5984",
  "projectdb": "couchdb+projectdb://user:password@couchdb:5984",
  "resultdb": "couchdb+resultdb://user:password@couchdb:5984",
  "message_queue": "amqp://rabbitmq:5672/%2F",
  "webui": {
    "username": "username",
    "password": "password",
    "need-auth": true,
    "scheduler-rpc": "http://scheduler:23333",
    "fetcher-rpc": "http://fetcher:24444"
  }
}


================================================
FILE: docker-compose.yaml
================================================
version: "3.7"

# replace /path/to/dir/ to point to config.json

# The RabbitMQ and CouchDB services can take some time to startup.
# During this time most of the pyspider services will exit and restart.
# Once RabbitMQ and CouchDB are fully up and running everything should run as normal.

services:
  rabbitmq:
    image: rabbitmq:alpine
    container_name: rabbitmq
    networks:
      - pyspider
    command: rabbitmq-server
  mysql:
    image: mysql:latest
    container_name: mysql
    volumes:
      - /tmp:/var/lib/mysql
    environment:
      - MYSQL_ALLOW_EMPTY_PASSWORD=yes
    networks:
      - pyspider
  phantomjs:
    image: pyspider:latest
    container_name: phantomjs
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command: -c config.json phantomjs
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped
  result:
    image: pyspider:latest
    container_name: result
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command: -c config.json result_worker
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped # Sometimes we'll get a connection refused error because couchdb has yet to fully start
  processor:
    container_name: processor
    image: pyspider:latest
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command: -c config.json processor
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped
  fetcher:
    image: pyspider:latest
    container_name: fetcher
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command : -c config.json fetcher
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped
  scheduler:
    image: pyspider:latest
    container_name: scheduler
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command: -c config.json scheduler
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped
  webui:
    image: pyspider:latest
    container_name: webui
    ports:
      - "5050:5000"
    networks:
      - pyspider
    volumes:
      - ./config_example.json:/opt/pyspider/config.json
    command: -c config.json webui
    depends_on:
      - couchdb
      - rabbitmq
    restart: unless-stopped

networks:
  pyspider:
    external:
      name: pyspider
  default:
    driver: bridge


================================================
FILE: docs/About-Projects.md
================================================
About Projects
==============

In most cases, a project is one script you write for one website.

* Projects are independent, but you can import another project as a module with `from projects import other_project`
* A project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG` and `RUNNING`
    - `TODO` - a script is just created to be written
    - `STOP` - you can mark a project as `STOP` if you want it to STOP (= =).
    - `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will be set as `CHECKING` automatically.
    - `DEBUG`/`RUNNING` - these two status have no difference to spider. But it's good to mark it as `DEBUG` when it's running the first time then change it to `RUNNING` after being checked.
* The crawl rate is controlled by `rate` and `burst` with [token-bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm.
    - `rate` - how many requests in one second
    - `burst` - consider this situation, `rate/burst = 0.1/3`, it means that the spider scrawls 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
* To delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.


`on_finished` callback
--------------------
You can override `on_finished` method in the project, the method would be triggered when the task_queue goes to 0.

Example 1: When you start a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages are successfully crawled or failed after retries.

Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when there are auto_recrawl tasks in it.

Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the newly submitted tasks are finished.


================================================
FILE: docs/About-Tasks.md
================================================
About Tasks
===========

Tasks are the basic unit to be scheduled.

Basis
-----

* A task is differentiated by its `taskid`. (Default: `md5(url)`, can be changed by overriding the `def get_taskid(self, task)` method)
* Tasks are isolated between different projects.
* A Task has 4 status:
    - active
    - failed
    - success
    - bad - not used
* Only tasks in active status will be scheduled.
* Tasks are served in order of `priority`.

Schedule
--------

#### new task

When a new task (never seen before) comes in:

* If `exetime` is set but not arrived, it will be put into a time-based queue to wait.
* Otherwise it will be accepted.

When the task is already in the queue:

* Ignored unless `force_update`

When a completed task comes out:

* If `age` is set, `last_crawl_time + age < now` it will be accepted. Otherwise discarded.
* If `itag` is set and not equal to it's previous value, it will be accepted. Otherwise discarded.


#### task retry

When a fetch error or script error happens, the task will retry 3 times by default.

The first retry will execute every time after 30 seconds, 1 hour, 6 hours, 12 hours and any more retries will postpone 24 hours.

If `age` is specified, the retry delay will not larger then `age`.

You can config the retry delay by adding a variable named `retry_delay` to handler. `retry_delay` is a dict to specify retry intervals. The items in the dict are {retried: seconds}, and a special key: '' (empty string) is used to specify the default retry delay if not specified.

e.g. the default `retry_delay` declares like:


```
class MyHandler(BaseHandler):
    retry_delay = {
        0: 30,
        1: 1*60*60,
        2: 6*60*60,
        3: 12*60*60,
        '': 24*60*60
    }
```


================================================
FILE: docs/Architecture.md
================================================
Architecture
============

This document describes the reason why I made pyspider and the architecture.

Why
---
Two years ago, I was working on a vertical search engine. We are facing following needs on crawling:

1. collect 100-200 websites, they may on/offline or change their templates at any time
> We need a really powerful monitor to find out which website is changing. And a good tool to help us write script/template for each website.

2. data should be collected in 5min when website updated
> We solve this problem by check index page frequently, and use something like 'last update time' or 'last reply time' to determine which page is changed. In addition to this, we recheck pages after X days in case to prevent the omission.  
> **pyspider will never stop as WWW is changing all the time**

Furthermore, we have some APIs from our cooperators, the API may need POST, proxy, request signature etc. Full control from script is more convenient than some global parameters of components.

Overview
--------
The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system.

![pyspider](imgs/pyspider-arch.png)

Components are connected by message queue. Every component, including message queue, is running in their own process/thread, and replaceable. That means, when process is slow, you can have many instances of processor and make full use of multiple CPUs, or deploy to multiple machines. This architecture makes pyspider really fast. [benchmarking](https://gist.github.com/binux/67b276c51e988f8e2c31#comment-1339242).

Components
----------

### Scheduler
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control ([token bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.

All of above can be set via `self.crawl` [API](apis/). 

Note that in current implement of scheduler, only one scheduler is allowed.

### Fetcher
The Fetcher is responsible for fetching web pages then send results to processor. For flexible, fetcher support [Data URI](http://en.wikipedia.org/wiki/Data_URI_scheme) and pages that rendered by JavaScript (via [phantomjs](http://phantomjs.org/)). Fetch method, headers, cookies, proxy, etag etc can be controlled by script via [API](apis/self.crawl/#fetch).

### Phantomjs Fetcher
Phantomjs Fetcher works like a proxy. It's connected to general Fetcher, fetch and render pages with JavaScript enabled, output a general HTML back to Fetcher:

```
scheduler -> fetcher -> processor
                |
            phantomjs
                |
             internet
```

### Processor
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like [PyQuery](https://pythonhosted.org/pyquery/)) for you to extract information and links, you can use anything you want to deal with the response. You may refer to [Script Environment](Script-Environment) and [API Reference](apis/) to get more information about script.

Processor will capture the exceptions and logs, send status(task track) and new tasks to `scheduler`, send results to `Result Worker`.

### Result Worker (optional)
Result worker receives results from `Processor`. Pyspider has a built-in result worker to save result to `resultdb`. Overwrite it to deal with result by your needs.

### WebUI
WebUI is a web frontend for everything. It contains:

* script editor, debugger
* project manager
* task monitor
* result viewer, exporter

Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.

Data flow
---------
The data flow in pyspider is just as your seen in diagram above:

1. Each script has a callback named `on_start`, when you press the `Run` button on WebUI. A new task of `on_start` is submitted to Scheduler as the entries of project.
2. Scheduler dispatches this `on_start` task with a Data URI as a normal task to Fetcher.
3. Fetcher makes a request and a response to it (for Data URI, it's a fake request and response, but has no difference with other normal tasks), then feeds to Processor.
4. Processor calls the `on_start` method and generated some new URL to crawl. Processor send a message to Scheduler that this task is finished and new tasks via message queue to Scheduler (here is no results for `on_start` in most case. If has results, Processor send them to `result_queue`).
5. Scheduler receives the new tasks, looking up in the database, determine whether the task is new or requires re-crawl, if so, put them into task queue. Dispatch tasks in order.
6. The process repeats (from step 3) and wouldn't stop till WWW is dead ;-). Scheduler will check periodic tasks to crawl latest data.


================================================
FILE: docs/Command-Line.md
================================================
Command Line
============

Global Config
-------------

You can get command help via `pyspider --help` and `pyspider all --help` for subcommand help.

global options work for all subcommands.

```
Usage: pyspider [OPTIONS] COMMAND [ARGS]...

  A powerful spider system in python.

Options:
  -c, --config FILENAME    a json file with default values for subcommands.
                           {“webui”: {“port”:5001}}
  --logging-config TEXT    logging config file for built-in python logging
                           module  [default: pyspider/pyspider/logging.conf]
  --debug                  debug mode
  --queue-maxsize INTEGER  maxsize of queue
  --taskdb TEXT            database url for taskdb, default: sqlite
  --projectdb TEXT         database url for projectdb, default: sqlite
  --resultdb TEXT          database url for resultdb, default: sqlite
  --message-queue TEXT     connection url to message queue, default: builtin
                           multiprocessing.Queue
  --amqp-url TEXT          [deprecated] amqp url for rabbitmq. please use
                           --message-queue instead.
  --beanstalk TEXT         [deprecated] beanstalk config for beanstalk queue.
                           please use --message-queue instead.
  --phantomjs-proxy TEXT   phantomjs proxy ip:port
  --data-path TEXT         data dir path
  --version                Show the version and exit.
  --help                   Show this message and exit.
```

#### --config

Config file is a JSON file with config values for global options or subcommands (a sub-dict named after subcommand). [example](/Deployment/#configjson)

``` json
{
  "taskdb": "mysql+taskdb://username:password@host:port/taskdb",
  "projectdb": "mysql+projectdb://username:password@host:port/projectdb",
  "resultdb": "mysql+resultdb://username:password@host:port/resultdb",
  "message_queue": "amqp://username:password@host:port/%2F",
  "webui": {
    "username": "some_name",
    "password": "some_passwd",
    "need-auth": true
  }
}
```

#### --queue-maxsize

Queue size limit, 0 for not limit

#### --taskdb, --projectdb, --resultdb

```
mysql:
    mysql+type://user:passwd@host:port/database
sqlite:
    # relative path
    sqlite+type:///path/to/database.db
    # absolute path
    sqlite+type:////path/to/database.db
    # memory database
    sqlite+type://
mongodb:
    mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
    more: http://docs.mongodb.org/manual/reference/connection-string/
couchdb:
    couchdb+type://[username:password@]host[:port]
sqlalchemy:
    sqlalchemy+postgresql+type://user:passwd@host:port/database
    sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
    more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
local:
    local+projectdb://filepath,filepath
    
type:
    should be one of `taskdb`, `projectdb`, `resultdb`.
```


#### --message-queue

```
rabbitmq:
    amqp://username:password@host:5672/%2F
    see https://www.rabbitmq.com/uri-spec.html
redis:
    redis://host:6379/db
    redis://host1:port1,host2:port2,...,hostn:portn (for redis 3.x in cluster mode)
kombu:
    kombu+transport://userid:password@hostname:port/virtual_host
    see http://kombu.readthedocs.org/en/latest/userguide/connections.html#urls
builtin:
    None
```

#### --phantomjs-proxy

The phantomjs proxy address, you need a phantomjs installed and running phantomjs proxy with command: [`pyspider phantomjs`](#phantomjs).

#### --data-path

SQLite database and counter dump files saved path


all
---

```
Usage: pyspider all [OPTIONS]

  Run all the components in subprocess or thread

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --help                        Show this message and exit.
```


one
---

```
Usage: pyspider one [OPTIONS] [SCRIPTS]...

  One mode not only means all-in-one, it runs every thing in one process
  over tornado.ioloop, for debug purpose

Options:
  -i, --interactive  enable interactive mode, you can choose crawl url.
  --phantomjs        enable phantomjs, will spawn a subprocess for phantomjs
  --help             Show this message and exit.
```

**NOTE: WebUI is not running in one mode.**

In `one` mode, results will be written to stdout by default. You can capture them via `pyspider one > result.txt`.

#### [SCRIPTS]

The script file path of projects. Project status is RUNNING, `rate` and `burst` can be set via script comments:

```
# rate: 1.0
# burst: 3
```

When SCRIPTS is set, `taskdb` and `resultdb` will use a in-memory sqlite db by default (can be overridden by global config `--taskdb`, `--resultdb`). on_start callback will be triggered on start.

#### -i, --interactive

With interactive mode, pyspider will start an interactive console asking what to do in next loop of process. In the console, you can use:

``` python
crawl(url, project=None, **kwargs)
    Crawl given url, same parameters as BaseHandler.crawl

    url - url or taskid, parameters will be used if in taskdb
    project - can be omitted if only one project exists.
    
quit_interactive()
    Quit interactive mode
    
quit_pyspider()
    Close pyspider
```

You can use `pyspider.libs.utils.python_console()` to open an interactive console in your script.

bench
-----

```
Usage: pyspider bench [OPTIONS]

  Run Benchmark test. In bench mode, in-memory sqlite database is used
  instead of on-disk sqlite database.

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --total INTEGER               total url in test page
  --show INTEGER                show how many urls in a page
  --help                        Show this message and exit.
```


scheduler
---------

```
Usage: pyspider scheduler [OPTIONS]

  Run Scheduler, only one scheduler is allowed.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --inqueue-limit INTEGER  size limit of task queue for each project, tasks
                           will been ignored when overflow
  --delete-time INTEGER    delete time before marked as delete
  --active-tasks INTEGER   active log size
  --loop-limit INTEGER     maximum number of tasks due with in a loop
  --scheduler-cls TEXT     scheduler class to be used.
  --help                   Show this message and exit.
```

#### --scheduler-cls

set this option to use customized Scheduler class

phantomjs
---------

```
Usage: run.py phantomjs [OPTIONS] [ARGS]...

  Run phantomjs fetcher if phantomjs is installed.

Options:
  --phantomjs-path TEXT  phantomjs path
  --port INTEGER         phantomjs port
  --auto-restart TEXT    auto restart phantomjs if crashed
  --help                 Show this message and exit.
```

#### ARGS

Addition args pass to phantomjs command line.

fetcher
-------

```
Usage: pyspider fetcher [OPTIONS]

  Run Fetcher.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --poolsize INTEGER      max simultaneous fetches
  --proxy TEXT            proxy host:port
  --user-agent TEXT       user agent
  --timeout TEXT          default fetch timeout
  --fetcher-cls TEXT      Fetcher class to be used.
  --help                  Show this message and exit.
```

#### --proxy

Default proxy used by fetcher, can been override by `self.crawl` option. [DOC](apis/self.crawl/#fetch)


processor
---------

```
Usage: pyspider processor [OPTIONS]

  Run Processor.

Options:
  --processor-cls TEXT  Processor class to be used.
  --help                Show this message and exit.
```

result_worker
-------------

```
Usage: pyspider result_worker [OPTIONS]

  Run result worker.

Options:
  --result-cls TEXT  ResultWorker class to be used.
  --help             Show this message and exit.
```


webui
-----

```
Usage: pyspider webui [OPTIONS]

  Run WebUI

Options:
  --host TEXT            webui bind to host
  --port INTEGER         webui bind to host
  --cdn TEXT             js/css cdn server
  --scheduler-rpc TEXT   xmlrpc path of scheduler
  --fetcher-rpc TEXT     xmlrpc path of fetcher
  --max-rate FLOAT       max rate for each project
  --max-burst FLOAT      max burst for each project
  --username TEXT        username of lock -ed projects
  --password TEXT        password of lock -ed projects
  --need-auth            need username and password
  --webui-instance TEXT  webui Flask Application instance to be used.
  --help                 Show this message and exit.
```

#### --cdn

JS/CSS libs CDN service, URL must compatible with [cdnjs](https://cdnjs.com/)

#### --fetcher-rpc

XML-RPC path URI for fetcher XMLRPC server. If not set, use a Fetcher instance.

#### --need-auth

If true, all pages require username and password specified via `--username` and `--password`.




================================================
FILE: docs/Deployment-demo.pyspider.org.md
================================================
Deployment of demo.pyspider.org
===============================

[demo.pyspider.org](http://demo.pyspider.org/) is running on three VPSs connected together with private network using [tinc](http://www.tinc-vpn.org/).

1vCore 4GB RAM | 1vCore 2GB RAM * 2
---------------|----------------
database<br>message queue<br>scheduler | phantomjs * 2<br>phantomjs-lb * 1<br>fetcher * 1<br>fetcher-lb * 1<br>processor * 2<br>result-worker * 1<br>webui * 4<br>webui-lb * 1<br>nginx * 1<br>

All components are running inside docker containers.

database / message queue / scheduler
------------------------------------

The database is postgresql and the message queue is redis.

Scheduler may have a lot of database operations, it's better to put it close to the database.

```bash
docker run --name postgres -v /data/postgres/:/var/lib/postgresql/data -d -p $LOCAL_IP:5432:5432 -e POSTGRES_PASSWORD="" postgres
docker run --name redis -d -p  $LOCAL_IP:6379:6379 redis
docker run --name scheduler -d -p $LOCAL_IP:23333:23333 --restart=always binux/pyspider \
 --taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb" \
 --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" \
 --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" \
 --message-queue "redis://10.21.0.7:6379/1" \
 scheduler --inqueue-limit 5000 --delete-time 43200
```

other components
----------------

fetcher, processor, result_worker are running on two boxes with same configuration managed with [docker-compose](https://docs.docker.com/compose/).

```yaml
phantomjs:
  image: 'binux/pyspider:latest'
  command: phantomjs
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,23333,24444'
  expose:
    - '25555'
  mem_limit: 512m
  restart: always
phantomjs-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - phantomjs
  restart: always
  
fetcher:
  image: 'binux/pyspider:latest'
  command: '--message-queue "redis://10.21.0.7:6379/1" --phantomjs-proxy "phantomjs:80" fetcher --xmlrpc'
  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=5000,25555,23333'
  links:
    - 'phantomjs-lb:phantomjs'
  mem_limit: 128m
  restart: always
fetcher-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - fetcher
  restart: always
  
processor:
  image: 'binux/pyspider:latest'
  command: '--projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --message-queue "redis://10.21.0.7:6379/1" processor'
  cpu_shares: 512
  mem_limit: 256m
  restart: always
  
result-worker:
  image: 'binux/pyspider:latest'
  command: '--taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" --message-queue "redis://10.21.0.7:6379/1" result_worker'
  cpu_shares: 512
  mem_limit: 256m
  restart: always
  
webui:
  image: 'binux/pyspider:latest'
  command: '--taskdb "sqlalchemy+postgresql+taskdb://binux@10.21.0.7/taskdb"  --projectdb "sqlalchemy+postgresql+projectdb://binux@10.21.0.7/projectdb" --resultdb "sqlalchemy+postgresql+resultdb://binux@10.21.0.7/resultdb" --message-queue "redis://10.21.0.7:6379/1" webui --max-rate 0.2 --max-burst 3 --scheduler-rpc "http://o4.i.binux.me:23333/" --fetcher-rpc "http://fetcher/"'

  cpu_shares: 512
  environment:
    - 'EXCLUDE_PORTS=24444,25555,23333'
  links:
    - 'fetcher-lb:fetcher'
  mem_limit: 256m
  restart: always
webui-lb:
  image: 'dockercloud/haproxy:latest'
  links:
    - webui
  restart: always
  
nginx:
  image: 'nginx'
  links:
    - 'webui-lb:HAPROXY'
  ports:
    - '0.0.0.0:80:80'
  volumes:
    - /home/binux/nfs/profile/nginx/nginx.conf:/etc/nginx/nginx.conf
    - /home/binux/nfs/profile/nginx/conf.d/:/etc/nginx/conf.d/
  restart: always
```

With the config, you can change the scale by `docker-compose scale phantomjs=2 processor=2 webui=4` when you need. 

#### load balance

phantomjs-lb, fetcher-lb, webui-lb are automaticlly configed haproxy, allow any number of upstreams.

#### phantomjs

phantomjs have memory leak issue, memory limit applied, and it's recommended to restart it every hour.

#### fetcher

fetcher is implemented with aync IO, it supportes 100 concurrent connections. If the upstream queue are not choked, one fetcher should be enough.

#### processor

processor is CPU bound component, recommended number of instance is number of CPU cores + 1~2 or CPU cores * 10%~15% when you have more then 20 cores.

#### result-worker

If you didn't override result-worker, it only write results into database, and should be very fast.


================================================
FILE: docs/Deployment.md
================================================
Deployment
===========

Since pyspider has various components, you can just run `pyspider` to start a standalone and third service free instance. Or using MySQL or MongoDB and RabbitMQ to deploy a distributed crawl cluster.

To deploy pyspider in product environment, running component in each process and store data in database service is more reliable and flexible.

Installation
------------

To deploy pyspider components in each single processes, you need at least one database service. pyspider now supports [MySQL](http://www.mysql.com/), [CouchDB](https://couchdb.apache.org), [MongoDB](http://www.mongodb.org/) and [PostgreSQL](http://www.postgresql.org/). You can choose one of them.

And you need a message queue service to connect the components together. You can use [RabbitMQ](http://www.rabbitmq.com/) or [Redis](http://redis.io/) as message queue.

`pip install --allow-all-external pyspider[all]`

> Even if you had install pyspider using `pip` before. Install with `pyspider[all]` is necessary to install the requirements for MySQL/MongoDB/RabbitMQ.

if you are using Ubuntu, try:
```
apt-get install python python-dev python-distribute python-pip libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml
```
to install binary packages.

Deployment
----------

**This document is based on MySQL + RabbitMQ**

### config.json

Although you can use command-line to specify the parameters. A config file is a better choice.

```
{
  "taskdb": "mysql+taskdb://username:password@host:port/taskdb",
  "projectdb": "mysql+projectdb://username:password@host:port/projectdb",
  "resultdb": "mysql+resultdb://username:password@host:port/resultdb",
  "message_queue": "amqp://username:password@host:port/%2F",
  "webui": {
    "username": "some_name",
    "password": "some_passwd",
    "need-auth": true
  }
}
```

you can get complete options by running `pyspider --help` and `pyspider webui --help` for subcommands. `"webui"` in JSON  is configs for subcommands. You can add parameters for other components similar to this one.

#### Database Connection URI
`"taskdb"`, `"projectdb”`, `"resultdb"` is using database connection URI with format below:

```
mysql:
    mysql+type://user:passwd@host:port/database
sqlite:
    # relative path
    sqlite+type:///path/to/database.db
    # absolute path
    sqlite+type:////path/to/database.db
    # memory database
    sqlite+type://
mongodb:
    mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
    more: http://docs.mongodb.org/manual/reference/connection-string/
couchdb:
    couchdb+type://[username:password@]host[:port][?options]]
sqlalchemy:
    sqlalchemy+postgresql+type://user:passwd@host:port/database
    sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
    more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
local:
    local+projectdb://filepath,filepath
    
type:
    should be one of `taskdb`, `projectdb`, `resultdb`.
```

#### Message Queue URL
You can use connection URL to specify the message queue:

```
rabbitmq:
    amqp://username:password@host:5672/%2F
    Refer: https://www.rabbitmq.com/uri-spec.html
redis:
    redis://host:6379/db
    redis://host1:port1,host2:port2,...,hostn:portn (for redis 3.x in cluster mode)
builtin:
    None
```

> Hint for postgresql: you need to create database with encoding utf8 by your own. pyspider will not create database for you.

running
-------

You should run components alone with subcommands. You may add `&` after command to make it running in background and use [screen](http://linux.die.net/man/1/screen) or [nohup](http://linux.die.net/man/1/nohup) to prevent exit after your ssh session ends. **It's recommended to manage components with [Supervisor](http://supervisord.org/).**

```
# start **only one** scheduler instance
pyspider -c config.json scheduler

# phantomjs
pyspider -c config.json phantomjs

# start fetcher / processor / result_worker instances as many as your needs
pyspider -c config.json --phantomjs-proxy="localhost:25555" fetcher
pyspider -c config.json processor
pyspider -c config.json result_worker

# start webui, set `--scheduler-rpc` if scheduler is not running on the same host as webui
pyspider -c config.json webui
```

Running with Docker
-------------------
[Running pyspider with Docker](Running-pyspider-with-Docker)


Deployment of demo.pyspider.org
-------------------------------
[Deployment of demo.pyspider.org](Deployment-demo.pyspider.org)



================================================
FILE: docs/Frequently-Asked-Questions.md
================================================
Frequently Asked Questions
==========================

Does pyspider Work with Windows?
--------------------------------
Yes, it should, some users have made it work on Windows. But as I don't have windows development environment, I cannot test. Only some tips for users who want to use pyspider on Windows:

- Some package needs binary libs (e.g. pycurl, lxml), that maybe you cannot install it from pip, Windowns binaries packages could be found in [http://www.lfd.uci.edu/~gohlke/pythonlibs/](http://www.lfd.uci.edu/~gohlke/pythonlibs/).
- Make a clean environment with [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
- Try 32bit version of Python, especially your are facing crash issue.
- Avoid using Python 3.4.1 ([#194](https://github.com/binux/pyspider/issues/194), [#217](https://github.com/binux/pyspider/issues/217))

Unreadable Code (乱码) Returned from Phantomjs
---------------------------------------------

Phantomjs doesn't support gzip, don't set `Accept-Encoding` header with `gzip`.


How to Delete a Project?
------------------------

set `group` to `delete` and `status` to `STOP` then wait 24 hours. You can change the time before a project deleted via `scheduler.DELETE_TIME`.

How to Restart a Project?
-------------------------
#### Why
It happens after you modified a script, and wants to crawl everything again with new strategy. But as the [age](/apis/self.crawl/#age) of urls are not expired. Scheduler will discard all of the new requests.

#### Solution
1. Create a new project.
2. Using a [itag](/apis/self.crawl/#itag) within `Handler.crawl_config` to specify the version of your script.

How to Use WebDAV Mode?
-----------------------
Mount `http://hostname/dav/` to your filesystem, edit or create scripts with your favourite editor.

> OSX: `mount_webdav http://hostname/dav/ /Volumes/dav`  
> Linux: Install davfs2, `mount.davfs http://hostname/dav/ /mnt/dav`  
> VIM: `vim http://hostname/dav/script_name.py`

When you are editing script without WebUI, you need to change it to `WebDAV Mode` while debugging. After you saved script in editor, WebUI can load and use latest script to debug your code.

What does the progress bar mean on the dashboard?
-------------------------------------------------
When mouse move onto the progress bar, you can see the explaintions.

For 5m, 1h, 1d the number are the events triggered in 5m, 1h, 1d. For all progress bar, they are the number of total tasks in correspond status.

Only the tasks in DEBUG/RUNNING status will show the progress.

How many scheduler/fetcher/processor/result_worker do I need? or pyspider stop working
--------------------------------------------------------------------------------------
You can have only have one scheduler, and multiple fetcher/processor/result_worker depends on the bottleneck. You can use the queue status on dashboard to view the bottleneck of the system:

![run one step](imgs/queue_status.png)

For example, the number between scheduler and fetcher indicate the queue size of scheduler to fetchers, when it's hitting 100 (default maximum queue size), fetcher might crashed, or you should considered adding more fetchers.

The number `0+0` below fetcher indicate the queue size of new tasks and status packs between processors and schduler. You can put your mouse over the numbers to see the tips.

================================================
FILE: docs/Quickstart.md
================================================
Quickstart
==========

Installation
------------

* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)

if you are using ubuntu, try:
```
apt-get install python python-dev python-distribute python-pip \
libcurl4-openssl-dev libxml2-dev libxslt1-dev python-lxml \
libssl-dev zlib1g-dev
```
to install binary packages first.


please install PhantomJS if needed: http://phantomjs.org/build.html

note that PhantomJS will be enabled only if it is excutable in the `PATH` or in the System Environment

**Note:** `pyspider` command is running pyspider in `all` mode, which running components in threads or subprocesses. For production environment, please refer to [Deployment](Deployment).

**WARNING:** WebUI is opened to public by default, it can be used to execute any command which may harm to you system. Please use it in internal network or [enable `need-auth` for webui](http://docs.pyspider.org/en/latest/Command-Line/#-config).

Your First Script
-----------------

```python
from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    @config(priority=2)
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }
```

> * `def on_start(self)` is the entry point of the script. It will be called when you click the `run` button on dashboard.
> * [`self.crawl(url, callback=self.index_page)`*](/apis/self.crawl) is the most important API here. It will add a new task to be crawled. Most of the options will be spicified via `self.crawl` arguments.
> * `def index_page(self, response)` get a [`Response`*](/apis/Response) object. [`response.doc`*](/apis/Response/#responsedoc) is a [pyquery](https://pythonhosted.org/pyquery/) object which has jQuery-like API to select elements to be extracted.
> * `def detail_page(self, response)` return a `dict` object as result. The result will be captured into `resultdb` by default. You can override `on_result(self, result)` method to manage the result yourself.


More things you may want to know:

> * [`@every(minutes=24*60, seconds=0)`*](/apis/@every/) is a helper to tell the scheduler that `on_start` method should be called everyday.
> * [`@config(age=10 * 24 * 60 * 60)`*](/apis/self.crawl/#configkwargs) specified the default `age` parameter of `self.crawl` with page type `index_page` (when `callback=self.index_page`). The parameter [`age`*](/apis/self.crawl/#age) can be specified via `self.crawl(url, age=10*24*60*60)` (highest priority) and `crawl_config` (lowest priority).
> * [`age=10 * 24 * 60 * 60`*](/apis/self.crawl/#age) tell scheduler discard the request if it have been crawled in 10 days. pyspider will not crawl a same URL twice by default (discard forever), even you had modified the code, it's very common for beginners that runs the project the first time and modified it and run it the second time, it will not crawl again (read [`itag`](/apis/self.crawl/#itag) for solution)
> * [`@config(priority=2)`*](/apis/self.crawl/#schedule) mark that detail pages should be crawled first.

You can test your script step by step by click the green `run` button. Switch to `follows` panel, click the play button to move on.

![run one step](imgs/run_one_step.png)

Start Running
-------------

1. Save your script.
2. Back to dashboard find your project.
3. Changing the `status` to `DEBUG` or `RUNNING`.
4. Click the `run` button.

![index demo](imgs/index_page.png)

Your script is running now!


================================================
FILE: docs/Running-pyspider-with-Docker.md
================================================
```shell
# mysql
docker run --name mysql -d -v /data/mysql:/var/lib/mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=yes mysql:latest
# rabbitmq
docker run --name rabbitmq -d rabbitmq:latest

# phantomjs
docker run --name phantomjs -d binux/pyspider:latest phantomjs

# result worker
docker run --name result_worker -m 128m -d --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider:latest result_worker
# processor, run multiple instance if needed.
docker run --name processor -m 256m -d --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider:latest processor
# fetcher, run multiple instance if needed.
docker run --name fetcher -m 256m -d --link phantomjs:phantomjs --link rabbitmq:rabbitmq binux/pyspider:latest fetcher --no-xmlrpc
# scheduler
docker run --name scheduler -d --link mysql:mysql --link rabbitmq:rabbitmq binux/pyspider:latest scheduler
# webui
docker run --name webui -m 256m -d -p 5000:5000 --link mysql:mysql --link rabbitmq:rabbitmq --link scheduler:scheduler --link phantomjs:phantomjs binux/pyspider:latest webui
```

or running with [Docker Compose](https://docs.docker.com/compose/) with `docker-compose.yml`:

NOTE: It's recommended to run mysql and rabbitmq outside compose as they may not been restarted with pyspider. You can find commands to start mysql and rabbitmq service above.

```
phantomjs:
  image: binux/pyspider:latest
  command: phantomjs
result:
  image: binux/pyspider:latest
  external_links:
    - mysql
    - rabbitmq
  command: result_worker
processor:
  image: binux/pyspider:latest
  external_links:
    - mysql
    - rabbitmq
  command: processor
fetcher:
  image: binux/pyspider:latest
  external_links:
    - rabbitmq
  links:
    - phantomjs
  command : fetcher
scheduler:
  image: binux/pyspider:latest
  external_links:
    - mysql
    - rabbitmq
  command: scheduler
webui:
  image: binux/pyspider:latest
  external_links:
    - mysql
    - rabbitmq
  links:
    - scheduler
    - phantomjs
  command: webui
  ports:
    - "5000:5000"
```

`docker-compose up`




================================================
FILE: docs/Script-Environment.md
================================================
Script Environment
==================

Variables
---------
* `self.project_name`
* `self.project` information about current project
* `self.response`
* `self.task`

About Script
------------
* The name of `Handler` is not matters, but you need at least one class inherit from `BaseHandler`
* A third parameter can be set to get task object: `def callback(self, response, task)`
* Non-200 response will not submit to callback by default. Use `@catch_status_code_error` 

About Environment
-----------------
* `logging`, `print` and exceptions will be captured.
* You can import other projects as module with `from projects import some_project`

### Web view

* view the page as a browser would render (approximately)

### HTML view

* view the HTML of the current callback (index_page, detail_page, etc.)

### Follows view

* view the callbacks that can be made from the current callback
* index_page follows view will show the detail_page callbacks that can be executed.

### Messages view

* shows the messages send by [`self.send_message`](apis/self.send_message) API.

### Enable CSS Selector Helper

* Enable a CSS Selector Helper of the Web view. It gets the CSS Selector of the element you clicked then add it to your script.


================================================
FILE: docs/Working-with-Results.md
================================================
Working with Results
====================
Downloading and viewing your data from WebUI is convenient, but may not suitable for computer.

Working with ResultDB
---------------------
Although resultdb is only designed for result preview, not suitable for large scale storage. But if you want to grab data from resultdb, there are some simple snippets using database API that can help you to connect and select the data.

```
from pyspider.database import connect_database
resultdb = connect_database("<your resutldb connection url>")
for project in resultdb.projects:
    for result in resultdb.select(project):
        assert result['taskid']
        assert result['url']
        assert result['result']
```

The `result['result']` is the object submitted by `return` statement from your script.

Working with ResultWorker
-------------------------
In product environment, you may want to connect pyspider to your system / post-processing pipeline, rather than store it into resultdb. It's highly recommended to override ResultWorker.

```
from pyspider.result import ResultWorker

class MyResultWorker(ResultWorker):
    def on_result(self, task, result):
        assert task['taskid']
        assert task['project']
        assert task['url']
        assert result
        # your processing code goes here
```

`result` is the object submitted by `return` statement from your script.

You can put this script (e.g., `my_result_worker.py`) at the folder where you launch pyspider. Add argument for `result_worker` subcommand:

`pyspider result_worker --result-cls=my_result_worker.MyResultWorker`

Or

```
{
  ...
  "result_worker": {
    "result_cls": "my_result_worker.MyResultWorker"
  }
  ...
}
```

if you are using config file. [Please refer to Deployment](/Deployment)

Design Your Own Database Schema
-------------------------------
The results stored in database is encoded as JSON for compatibility. It's highly recommended to design your own database, and override the ResultWorker described above.

TIPS about Results
-------------------
#### Want to return more than one result in callback?
As resultdb de-duplicate results by taskid(url), the latest will overwrite previous results.

One workaround is using `send_message` API to make a `fake` taskid for each result.

```
def detail_page(self, response):
    for li in response.doc('li').items():
        self.send_message(self.project_name, {
            ...
        }, url=response.url+"#"+li('a.product-sku').text())

def on_message(self, project, msg):
    return msg
```

See Also: [apis/self.send_message](/apis/self.send_message)


================================================
FILE: docs/apis/@catch_status_code_error.md
================================================
@catch_status_code_error
========================

non-200 response will been regarded as fetch failed and will not pass to callback. use this decorator to override this feature.

```python
def on_start(self):
    self.crawl('http://httpbin.org/status/404', self.callback)

@catch_status_code_error  
def callback(self, response):
    ...
```

>  The `callback` would not be executed as the request is failed (with status code 404). With the `@catch_status_code_error` decorater, the `callback` would be executed even if the request failed.



================================================
FILE: docs/apis/@every.md
================================================
@every(minutes=0, seconds=0)
============================

method will been called every `minutes` or `seconds`


```python
@every(minutes=24 * 60)
def on_start(self):
    for url in urllist:
        self.crawl(url, callback=self.index_page)
```

The urls would be restarted every 24 hours. Note that, if `age` is also used and the period is longer then `@every`, the crawl request would be discarded as it's regarded as not changed:

```python
@every(minutes=24 * 60)
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self):
    ...
```

> Even though the crawl request triggered every day, but it's discard and only restarted every 10 days.



================================================
FILE: docs/apis/Response.md
================================================
Response
========

The attributes of Response object.

### Response.url

final URL.

### Response.text

Content of response, in unicode.

if `Response.encoding` is None and `chardet` module is available, encoding of content will be guessed.

### Response.content

Content of response, in bytes.

### Response.doc

A [PyQuery](https://pythonhosted.org/pyquery/) object of the response's content. Links have made as absolute by default.

Refer to the documentation of PyQuery: [https://pythonhosted.org/pyquery/](https://pythonhosted.org/pyquery/)

It's important that I will repeat, refer to the documentation of PyQuery: [https://pythonhosted.org/pyquery/](https://pythonhosted.org/pyquery/)

### Response.etree

A [lxml](http://lxml.de/) object of the response's content.

### Response.json

The JSON-encoded content of the response, if any.

### Response.status_code

### Response.orig_url

If there is any redirection during the request, here is the url you just submit via `self.crawl`.

### Response.headers

A case insensitive dict holds the headers of response.

### Response.cookies

### Response.error

Messages when fetch error

### Response.time

Time used during fetching.

### Response.ok

True if `status_code` is 200 and no error.

### Response.encoding

Encoding of Response.content.

If Response.encoding is None, encoding will be guessed by header or content or `chardet`(if available).

Set encoding of content manually will overwrite the guessed encoding.

### Response.save

The object saved by [`self.crawl`](/apis/self.crawl/#save) API

### Response.js_script_result

content returned by JS script

### Response.raise_for_status()

Raise HTTPError if status code is not 200 or `Response.error` exists.



================================================
FILE: docs/apis/index.md
================================================
API Reference
=============
    
- [self.crawl](self.crawl)
- [Response](Response)
- [self.send_message](self.send_message)
- [@every](@every)
- [@catch_status_code_error](@catch_status_code_error)


================================================
FILE: docs/apis/self.crawl.md
================================================
self.crawl
===========

self.crawl(url, **kwargs)
-------------------------

`self.crawl` is the main interface to tell pyspider which url(s) should be crawled.

### Parameters:

##### url
the url or url list to be crawled.

##### callback
the method to parse the response. _default: `__call__` _


```python
def on_start(self):
    self.crawl('http://scrapy.org/', callback=self.index_page)
```

the following parameters are optional

##### age

the period of validity of the task. The page would be regarded as not modified during the period. _default: -1(never recrawl)_ 

```python
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    ...
```
> Every pages parsed by the callback `index_page` would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.

##### priority

the priority of task to be scheduled, higher the better. _default: 0_ 

```python
def index_page(self):
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,
               priority=1)
```
> The page `233.html` would be crawled before `page2.html`. Use this parameter can do a [BFS](http://en.wikipedia.org/wiki/Breadth-first_search) and reduce the number of tasks in queue(which may cost more memory resources).

##### exetime

the executed time of task in unix timestamp. _default: 0(immediately)_ 

```python
import time
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               exetime=time.time()+30*60)
```
> The page would be crawled 30 minutes later.

##### retries

retry times while failed. _default: 3_ 

##### itag

a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it's changed. _default: None_ 

```python
def index_page(self, response):
    for item in response.doc('.item').items():
        self.crawl(item.find('a').attr.url, callback=self.detail_page,
                   itag=item.find('.update-time').text())
```
> In the sample, `.update-time` is used as itag. If it's not changed, the request would be discarded.

Or you can use `itag` with `Handler.crawl_config` to specify the script version if you want to restart all of the tasks.

```python
class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v223'
    }
```
> Change the value of itag after you modified the script and click run button again. It doesn't matter if not set before. 

##### auto_recrawl

when enabled, task would be recrawled every `age` time. _default: False_ 

```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               age=5*60*60, auto_recrawl=True)
```
> The page would be restarted every `age` 5 hours.

##### method
    
HTTP method to use. _default: GET_ 

##### params

dictionary of URL parameters to append to the URL. 

```python
def on_start(self):
    self.crawl('http://httpbin.org/get', callback=self.callback,
               params={'a': 123, 'b': 'c'})
    self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
```
> The two requests are the same.

##### data

the body to attach to the request. If a dictionary is provided, form-encoding will take place. 

```python
def on_start(self):
    self.crawl('http://httpbin.org/post', callback=self.callback,
               method='POST', data={'a': 123, 'b': 'c'})
```

##### files

dictionary of `{field: {filename: 'content'}}` files to multipart upload.` 

##### user_agent

the User-Agent of the request

##### headers

dictionary of headers to send. 

##### cookies

dictionary of cookies to attach to this request. 

##### connect_timeout

timeout for initial connection in seconds. _default: 20_

##### timeout

maximum time in seconds to fetch the page. _default: 120_ 

##### allow_redirects

follow `30x` redirect _default: True_ 

##### validate_cert

For HTTPS requests, validate the server’s certificate? _default: True_ 

##### proxy

proxy server of `username:password@hostname:port` to use, only http proxy is supported currently. 

```python
class Handler(BaseHandler):
    crawl_config = {
        'proxy': 'localhost:8080'
    }
```
> `Handler.crawl_config` can be used with `proxy` to set a proxy for whole project.

##### etag 

use HTTP Etag mechanism to pass the process if the content of the page is not changed. _default: True_ 

###### last_modified

use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. _default: True_ 

##### fetch_type

set to `js` to enable JavaScript fetcher. _default: None_ 

##### js_script

JavaScript run before or after page loaded, should been wrapped by a function like `function() { document.write("binux"); }`. 


```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               fetch_type='js', js_script='''
               function() {
                   window.scrollTo(0,document.body.scrollHeight);
                   return 123;
               }
               ''')
```
> The script would scroll the page to bottom. The value returned in function could be captured via `Response.js_script_result`.

##### js_run_at

run JavaScript specified via `js_script` at `document-start` or `document-end`. _default: `document-end`_ 

##### js_viewport_width/js_viewport_height

set the size of the viewport for the JavaScript fetcher of the layout process. 

##### load_images

load images when JavaScript fetcher enabled. _default: False_ 

##### save

a object pass to the callback method, can be visit via `response.save`. 


```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               save={'a': 123})

def callback(self, response):
    return response.save['a']
```
> `123` would be returned in `callback`

##### taskid
    
unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method `def get_taskid(self, task)` 

```python
import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
    return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
```
> Only url is md5 -ed as taskid by default, the code above add `data` of POST request as part of taskid.

##### force_update
    
force update task params even if the task is in `ACTIVE` status. 

##### cancel

cancel a task, should be used with `force_update` to cancel a active task. To cancel an `auto_recrawl` task, you should set `auto_recrawl=False` as well.

cURL command
------------

`self.crawl(curl_command)`

cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel,  right click the request and "Copy as cURL".

You can use cURL command as the first argument of `self.crawl`. It will parse the command and make the HTTP request just like curl do.

@config(**kwargs)
-----------------
default parameters of `self.crawl` when use the decorated method as callback. For example:

```python
@config(age=15*60)
def index_page(self, response):
    self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
    self.crawl('http://www.example.org/product-233', callback=self.detail_page)
    
@config(age=10*24*60*60)
def detail_page(self, response):
    return {...}
```

`age` of `list-1.html` is 15min while the `age` of `product-233.html` is 10days. Because the callback of `product-233.html` is `detail_page`, means it's a `detail_page` so it shares the config of `detail_page`.

Handler.crawl_config = {}
-------------------------
default parameters of `self.crawl` for the whole project. The parameters in `crawl_config` for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.

```python
class Handler(BaseHandler):
    crawl_config = {
        'headers': {
            'User-Agent': 'GoogleBot',
        }
    }
    
    ...
```
> crawl_config set a project level user-agent.



================================================
FILE: docs/apis/self.send_message.md
================================================
self.send_message
=================

self.send_message(project, msg, [url])
--------------------------------------
send messages to other project. can been received by `def on_message(self, project, message)` callback.

- `project` - other project name
- `msg` - any json-able object
- `url` - result will been overwrite if have same `taskid`. `send_message` share a same `taskid` by default. Change this to return multiple result by one response.

```python
def detail_page(self, response):
    for i, each in enumerate(response.json['products']):
        self.send_message(self.project_name, {
                "name": each['name'],
                'price': each['prices'],
             }, url="%s#%s" % (response.url, i))

def on_message(self, project, msg):
    return msg
``` 

pyspider send_message [OPTIONS] PROJECT MESSAGE
-----------------------------------------------

You can also send message from command line.

```
Usage: pyspider send_message [OPTIONS] PROJECT MESSAGE

  Send Message to project from command line

Options:
  --scheduler-rpc TEXT  xmlrpc path of scheduler
  --help                Show this message and exit.
```

def on_message(self, project, message)
--------------------------------------
receive message from other project


================================================
FILE: docs/conf.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2015-11-10 01:31:54

import sys
from unittest.mock import MagicMock
from recommonmark.parser import CommonMarkParser

class Mock(MagicMock):
    @classmethod
    def __getattr__(cls, name):
            return Mock()

MOCK_MODULES = ['pycurl', 'lxml', 'psycopg2']
sys.modules.update((mod_name, Mock()) for mod_name in MOCK_MODULES)

source_parsers = {
        '.md': CommonMarkParser,
}

source_suffix = ['.rst', '.md']


================================================
FILE: docs/index.md
================================================
pyspider [![Build Status][Build Status]][Travis CI] [![Coverage Status][Coverage Status]][Coverage] [![Try][Try]][Demo]
========

A Powerful Spider(Web Crawler) System in Python. **[TRY IT NOW!][Demo]**

- Write script in Python
- Powerful WebUI with script editor, task monitor, project manager and result viewer
- [MySQL](https://www.mysql.com/), [CouchDB](https://couchdb.apache.org), [MongoDB](https://www.mongodb.org/), [Redis](http://redis.io/), [SQLite](https://www.sqlite.org/), [Elasticsearch](https://www.elastic.co/products/elasticsearch); [PostgreSQL](http://www.postgresql.org/) with [SQLAlchemy](http://www.sqlalchemy.org/) as database backend
- [RabbitMQ](http://www.rabbitmq.com/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue
- Task priority, retry, periodical, recrawl by age, etc...
- Distributed architecture, Crawl Javascript pages, Python 2&3, etc...

Tutorial: [http://docs.pyspider.org/en/latest/tutorial/](http://docs.pyspider.org/en/latest/tutorial/)  
Documentation: [http://docs.pyspider.org/](http://docs.pyspider.org/)  
Release notes: [https://github.com/binux/pyspider/releases](https://github.com/binux/pyspider/releases)  

Sample Code 
-----------

```python
from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }
```

[![Demo][Demo Img]][Demo]


Installation
------------

* `pip install pyspider`
* run command `pyspider`, visit [http://localhost:5000/](http://localhost:5000/)

Quickstart: [http://docs.pyspider.org/en/latest/Quickstart/](http://docs.pyspider.org/en/latest/Quickstart/)

Contribute
----------

* Use It
* Open [Issue], send PR
* [User Group]
* [中文问答](http://segmentfault.com/t/pyspider)


TODO
----

### v0.4.0

- [x] local mode, load script from file.
- [x] works as a framework (all components running in one process, no threads)
- [x] redis
- [x] shell mode like `scrapy shell` 
- [ ] a visual scraping interface like [portia](https://github.com/scrapinghub/portia)


### more

- [x] edit script with vim via [WebDAV](http://en.wikipedia.org/wiki/WebDAV)


License
-------
Licensed under the Apache License, Version 2.0


[Build Status]:         https://img.shields.io/travis/binux/pyspider/master.svg?style=flat
[Travis CI]:            https://travis-ci.org/binux/pyspider
[Coverage Status]:      https://img.shields.io/coveralls/binux/pyspider.svg?branch=master&style=flat
[Coverage]:             https://coveralls.io/r/binux/pyspider
[Try]:                  https://img.shields.io/badge/try-pyspider-blue.svg?style=flat
[Demo]:                 http://demo.pyspider.org/
[Demo Img]:             imgs/demo.png
[Issue]:                https://github.com/binux/pyspider/issues
[User Group]:           https://groups.google.com/group/pyspider-users


================================================
FILE: docs/tutorial/AJAX-and-more-HTTP.md
================================================
Level 2: AJAX and More HTTP
===========================

In the last article, we discussed how to extract links and information from HTML documents. However, web contents are becoming more complicated using some technology like AJAX. You may find that page looks different with it in browser, the information you want to extract is not in the HTML of the page.

In this article, we will not write complete scrape scripts, but some snippets of web page cases using the technology like AJAX or needs some HTTP parameters besides URL.

AJAX
----

[AJAX] is short for asynchronous JavaScript + XML. AJAX is using existing standards to update parts of a web page without loading the whole page. A common usage of AJAX is loading [JSON] data and render to HTML on the client side.

You may find elements missing in HTML fetched by pyspider or [wget](https://www.gnu.org/software/wget/). When you open it in browser some elements appear after page loaded with(maybe not) a 'loading' animation or words. For example, we want to scrape all channels of Dota 2 from [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202)

![twitch](../imgs/twitch.png)

But you may find nothing in the page. 

### Finding the request

As [AJAX] data is transferred in [HTTP], we can find the real request with the help of [Chrome Developer Tools](https://developer.chrome.com/devtools).

0. Open a new tab.
1. Use `Ctrl`+`Shift`+`I` (or `Cmd`+`Opt`+`I` on Mac) to open the DevTools.
2. Switch to Network panel.
3. Open the URL [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202) in this tab.

While resources are been loaded, you may find a table of requested resources.

![developer tools network](../imgs/developer-tools-network.png)

AJAX is using [XMLHttpRequest](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest) object to send and retrieve data which is generally shorted as "XHR". Use Filter (funnel icon) to filter out the XHR requests. Glance over each requests using preview:

![find request](../imgs/search-for-request.png)

To determine which one is the key request, you can use a filter to reduce the number of requests, guess the usage of the request by this path and parameters, then view the response contents for confirmation. Here we found the request: [http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1](http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1)

Now, open the URL in a new tab, you would see a [JSON] data containing channel list. You can use a extension [JSONView](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc) ([for Firfox](http://jsonview.com/)) to have a pretty printed view of JSON. A sample code is trying extract the name, current title and viewers of each channel.

```
class Handler(BaseHandler):
    @every(minutes=10)
    def on_start(self):
        self.crawl('http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1', callback=self.index_page)

    @config(age=10*60)
    def index_page(self, response):
        return [{
                "name": x['channel']['display_name'],
                "viewers": x['viewers'],
                "status": x['channel'].get('status'),
             } for x in response.json['streams']]
```

> * You can use `response.json` to convert content to a python `dict` object.
> * As channel list is changing frequently, we update it every 10 minutes and use [`@config(age=10*60)`](/apis/self.crawl/#configkwargs) to set the age. Otherwise, it will be ignored as scheduler thinks it's new enough and refuse to update the content.

Here is an online demo for twitch as well as a measure using [PhantomJS] which will be discussed in the next level: [http://demo.pyspider.org/debug/tutorial_twitch](http://demo.pyspider.org/debug/tutorial_twitch)

HTTP
----

[HTTP] is the protocol to exchange or transfer hypertext. We had used it in last article, we used `self.crawl` and a URL to fetch HTML content which is transferred by [HTTP].

When you got `403 Forbidden` or needed login. You need right parameters of HTTP request.

A typical HTTP request message to [http://example.com/](http://example.com/) looks like:

```
GET / HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.45 Safari/537.36
Referer: http://en.wikipedia.org/wiki/Example.com
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8
If-None-Match: "359670651"
If-Modified-Since: Fri, 09 Aug 2013 23:54:35 GMT
```

> * the first line contains [HTTP method](http://www.w3schools.com/tags/ref_httpmethods.asp), path and HTTP version
> * several lines of request header fields in `key: value` format.
> * if has message body(say POST request), an empty line and message body would be appended to end of request message.

You can get this with [Chrome Developer Tools](https://developer.chrome.com/devtools) - Network panel we used in above section:

![request header](../imgs/request-headers.png)

In most case, the last thing you need is to copy right URL + method + headers + body from Network panel.

cURL command
------------

`self.crawl` supports `cURL` command as argument to make the HTTP request. It will parse the arguments in the command and use it as fetch parameters.

With `Copy as cURL` of a request, you can get a `cURL` command and paste to `self.crawl(command)` to make crawling easy.

HTTP Method
-----------

[HTTP] defines methods to indicate the desired action to be performed on the identified resource. Two commonly used methods are: GET and POST. GET is when you open a URL, requests the content of a specified resource. POST is used to submit data to server.

TODO: need example here.

HTTP Headers
------------

[HTTP Headers](http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) is a list of parameters of a request. Some headers you need to attention while scraping:

### User-Agent

A [user agent string](http://en.wikipedia.org/wiki/User_agent_string) tell server the application type, operating system or software revision who send the HTTP request.

pyspider's default user agent string is: `pyspider/VERSION (+http://pyspider.org/)`

### Referer

[Referer](http://en.wikipedia.org/wiki/HTTP_referer) is the address of the previous webpage from which a link to the currently requested page was followed. Some website uses this in image resources to prevent deep linking.

TODO: need example here.

HTTP Cookie
-----------

[HTTP Cookie](http://en.wikipedia.org/wiki/HTTP_cookie) is a field in HTTP headers used for tracking which user is making the request. Generally used for user login and prevent unauthorized requests.

You can use [`self.crawl(cookies={"key": value})`](/apis/self.crawl/#fetch) to set cookie via a dict like API.

TODO: need example here.

[PhantomJS]:           http://phantomjs.org/
[AJAX]:          http://en.wikipedia.org/wiki/Ajax_%28programming%29
[JSON]:          http://en.wikipedia.org/wiki/JSON
[HTTP]:          http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol


================================================
FILE: docs/tutorial/HTML-and-CSS-Selector.md
================================================
Level 1: HTML and CSS Selector
==============================

In this tutorial, we will scrape information of movies and TV from [IMDb].

An online demo with completed code is: [http://demo.pyspider.org/debug/tutorial_imdb](http://demo.pyspider.org/debug/tutorial_imdb) .


Before Start
------------

You should have pyspider installed. You can refer to the documentation [QuickStart](Quickstart). Or test your code on [demo.pyspider.org](http://demo.pyspider.org).

Some basic knowledges you should know before scraping:

* [Web][WWW] is a system of interlinked hypertext pages.
* Pages is identified on the Web via uniform resource locator ([URL]).
* Pages transferred via the Hypertext Transfer Protocol ([HTTP]).
* Web Pages structured using HyperText Markup Language ([HTML]).

To scrape information from a web is

1. Finding URLs of the pages contain the information we want.
2. Fetching the pages via HTTP.
3. Extracting the information from HTML.
4. Finding more URL contains what we want, go back to 2.


Pick a start URL
----------------

As we want to get all of the movies on [IMDb], the first thing is finding a list.  A good list page may:

* containing links to the [movies](http://www.imdb.com/title/tt0167260/) as many as possible.
* by following next page, you can traverse all of the movies. 
* list sorted by last updated time would be a great help to get latest movies.

By looking around at the index page of [IMDb], I found this:

![IMDb front page](../imgs/tutorial_imdb_front.png)

[http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1](http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1)

### Creating a project

You can find "Create" on the bottom right of baseboard. Click and name a project.

![Creating a project](../imgs/creating_a_project.png)

Changing the crawl URL in `on_start` callback:

```
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1', callback=self.index_page)
```

> * `self.crawl` would fetch the page and call the `callback` method to parse the response.  
> * The [`@every` decorator](http://docs.pyspider.org/en/latest/apis/@every/) represents `on_start` would execute every day, to make sure not missing any new movies.

Click the green `run` button, you should find a red 1 above follows, switch to follows panel, click the green play button:

![Run one step](../imgs/run_one_step.png)

Index Page
----------

From [index page](http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1), we need extract two things:

* links of the movies like `http://www.imdb.com/title/tt0167260/`
* links of [Next](http://www.imdb.com/search/title?count=100&ref_=nv_ch_mm_1&start=101&title_type=feature,tv_series,tv_movie) page

### Find Movies

As you can see, the sample handler had already extracted 1900+ links from the page. A measure of extracting movie pages is filtering links with regular expression:

```
import re
...

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
```

> * `callback` is `self.detail_page` here to use another callback method to parse.

Remember you can always use the power of python or anything you are familiar with to extract information. But using tools like CSS selector is recommended.

### Next page

#### CSS Selectors

CSS selectors are patterns used by [CSS] to select HTML elements which are wanted to style. As elements containing information may have different style in document, It's appropriate to use CSS Selector to select elements we want. More information about CSS selectors could be found in above links:

* [CSS Selectors](http://www.w3schools.com/css/css_selectors.asp)
* [CSS Selector Reference](http://www.w3schools.com/cssref/css_selectors.asp)

You can use CSS Selector with built-in `response.doc` object, which is provided by [PyQuery], you may find the full reference there.

#### CSS Selector Helper

pyspider provide a tool called `CSS selector helper` to make it easier to generate a selector pattern to element you clicked. Enable CSS selector helper by click the button and switch to `web` panel.

![CSS Selector helper](../imgs/css_selector_helper.png)

The element will be highlighted in yellow while mouse over. When you click it, a pre-selected CSS Selector pattern is shown on the bar above. You can edit the features to locate the element and add it to your source code.

click "Next »" in the page and add selector pattern to your code:

```
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
        self.crawl(response.doc('#right a').attr.href, callback=self.index_page)
```

Click `run` again and move to the next page, we found that "« Prev" has the same selector pattern as "Next »". When using above code you may find pyspider selected the link of "« Prev", not "Next »". A solution for this is select both of them:

```
        self.crawl([x.attr.href for x in response.doc('#right a').items()], callback=self.index_page)
```

Extracting Information
----------------------

Click `run` again and follow to detail page.

Add keys you need to result dict and collect value using `CSS selector helper` repeatedly:

```
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('.header > [itemprop="name"]').text(),
            "rating": response.doc('.star-box-giga-star').text(),
            "director": [x.text() for x in response.doc('[itemprop="director"] span').items()],
        }
```

Note that, `CSS Selector helper` may not always work. You could write selector pattern manually with tools like [Chrome Dev Tools](https://developer.chrome.com/devtools):

![inspect element](../imgs/inspect_element.png)

You doesn't need to write every ancestral element in selector pattern, only the elements which can differentiate with not needed elements, is enough. However, it needs experience on scraping or Web developing to know which attribute is important, can be used as locator. You can also test CSS Selector in the JavaScript Console by using `$$` like `$$('[itemprop="director"] span')`

Running
-------

1. After tested you code, don't forget to save it.
2. Back to dashboard find your project.
3. Changing the `status` to `DEBUG` or `RUNNING`.
4. Press the `run` button. 

![index demo](../imgs/index_page.png)

Notes
-----

The script is just a simple, you may found more issues when scraping IMDb:

* ref in list page url is for tracing user, it's better remove it.
* IMDb does not serve more than 100000 results for any query, you need find more lists with lesser results, like [this](http://www.imdb.com/search/title?genres=action&title_type=feature&sort=moviemeter,asc)
* You may need a list sorted by last updated time and update it with a shorter interval.
* Some attribute is hard to extract, you may need write selector pattern on hand or using [XPATH](http://www.w3schools.com/xpath/xpath_syntax.asp) and/or some python code to extract information.

[IMDb]:          http://www.imdb.com/
[WWW]:           http://en.wikipedia.org/wiki/World_Wide_Web
[HTTP]:          http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[HTML]:          http://en.wikipedia.org/wiki/HTML
[URL]:           http://en.wikipedia.org/wiki/Uniform_resource_locator
[CSS]:           https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_Started/What_is_CSS
[PyQuery]:       https://pythonhosted.org/pyquery/


================================================
FILE: docs/tutorial/Render-with-PhantomJS.md
================================================
Level 3: Render with PhantomJS
==============================

Sometimes web page is too complex to find out the API request. It's time to meet the power of [PhantomJS].

To use PhantomJS, you should have PhantomJS [installed](http://phantomjs.org/download.html). If you are running pyspider with `all` mode, PhantomJS is enabled if excutable in the `PATH`.

Make sure phantomjs is working by running
```
$ pyspider phantomjs
```

Continue with the rest of the tutorial if the output is
```
Web server running on port 25555
```

Use PhantomJS
-------------

When pyspider with PhantomJS connected, you can enable this feature by adding a parameter `fetch_type='js'` to `self.crawl`. We use PhantomJS to scrape channel list of  [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202) which is loaded with AJAX we discussed in [Level 2](tutorial/AJAX-and-more-HTTP#ajax):

```
class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.twitch.tv/directory/game/Dota%202',
                   fetch_type='js', callback=self.index_page)
             
    def index_page(self, response):
        return {
            "url": response.url,
            "channels": [{
                "title": x('.title').text(),
                "viewers": x('.info').contents()[2],
                "name": x('.info a').text(),
            } for x in response.doc('.stream.item').items()]
        }
```
> I used some API to handle the list of streams. You can find complete API reference from [PyQuery complete API](https://pythonhosted.org/pyquery/api.html)

Running JavaScript on Page
--------------------------

We will try to scrape images from [http://www.pinterest.com/categories/popular/](http://www.pinterest.com/categories/popular/) in this section. Only 25 images is shown at the beginning, more images would be loaded when you scroll to the bottom of the page.

To scrape images as many as posible we can use a [`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher) to set some function wrapped JavaScript codes to simulate the scroll action: 

```
class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.pinterest.com/categories/popular/',
                   fetch_type='js', js_script="""
                   function() {
                       window.scrollTo(0,document.body.scrollHeight);
                   }
                   """, callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "images": [{
                "title": x('.richPinGridTitle').text(),
                "img": x('.pinImg').attr('src'),
                "author": x('.creditName').text(),
            } for x in response.doc('.item').items() if x('.pinImg')]
        }
```

> * Script would been executed after page loaded(can been changed via [`js_run_at` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher))
> * We scroll once after page loaded, you can scroll multiple times using [`setTimeout`](https://developer.mozilla.org/en-US/docs/Web/API/WindowTimers.setTimeout). PhantomJS will fetch as many items as possible before timeout arrived.

Online demo: [http://demo.pyspider.org/debug/tutorial_pinterest](http://demo.pyspider.org/debug/tutorial_pinterest)



[PhantomJS]:           http://phantomjs.org/


================================================
FILE: docs/tutorial/index.md
================================================
pyspider Tutorial
=================

> The best way to learn how to scrap is learning how to make it.

* [Level 1: HTML and CSS Selector](HTML-and-CSS-Selector)
* [Level 2: AJAX and More HTTP](AJAX-and-more-HTTP)
* [Level 3: Render with PhantomJS](Render-with-PhantomJS)

If you have problem using pyspider, [user group](https://groups.google.com/group/pyspider-users) is a place for discussing.


================================================
FILE: mkdocs.yml
================================================
site_name: pyspider
site_description: A Powerful Spider(Web Crawler) System in Python.
site_author: binux
repo_url: https://github.com/binux/pyspider
pages:
- Introduction: index.md
- Quickstart: Quickstart.md
- Command Line: Command-Line.md
- Tutorial:
  - Index: tutorial/index.md
  - 'Level 1: HTML and CSS Selector': tutorial/HTML-and-CSS-Selector.md
  - 'Level 2: AJAX and More HTTP': tutorial/AJAX-and-more-HTTP.md
  - 'Level 3: Render with PhantomJS': tutorial/Render-with-PhantomJS.md
- About pyspider:
  - Architecture: Architecture.md
  - About Tasks: About-Tasks.md
  - About Projects: About-Projects.md
  - Script Environment: Script-Environment.md
  - Working with Results: Working-with-Results.md
- API Reference:
  - Index: apis/index.md
  - self.crawl: apis/self.crawl.md
  - Response: apis/Response.md
  - self.send_message: apis/self.send_message.md
  - '@catch_status_code_error': apis/@catch_status_code_error.md
  - '@every': apis/@every.md
- Deployment: Deployment.md
- Running pyspider with Docker: Running-pyspider-with-Docker.md
- Deployment of demo.pyspider.org: Deployment-demo.pyspider.org.md
- Frequently Asked Questions: Frequently-Asked-Questions.md

theme: readthedocs
markdown_extensions: ['toc(permalink=true)', ]


================================================
FILE: pyspider/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-11-17 19:17:12

__version__ = '0.4.0'


================================================
FILE: pyspider/database/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-08 15:04:08

import os, requests, json
from six.moves.urllib.parse import urlparse, parse_qs


def connect_database(url):
    """
    create database object by url

    mysql:
        mysql+type://user:passwd@host:port/database
    sqlite:
        # relative path
        sqlite+type:///path/to/database.db
        # absolute path
        sqlite+type:////path/to/database.db
        # memory database
        sqlite+type://
    mongodb:
        mongodb+type://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:portN]]][/[database][?options]]
        more: http://docs.mongodb.org/manual/reference/connection-string/
    sqlalchemy:
        sqlalchemy+postgresql+type://user:passwd@host:port/database
        sqlalchemy+mysql+mysqlconnector+type://user:passwd@host:port/database
        more: http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html
    redis:
        redis+taskdb://host:port/db
    elasticsearch:
        elasticsearch+type://host:port/?index=pyspider
    couchdb:
        couchdb+type://[username:password@]host[:port]
    local:
        local+projectdb://filepath,filepath

    type:
        taskdb
        projectdb
        resultdb

    """
    db = _connect_database(url)
    db.copy = lambda: _connect_database(url)
    return db


def _connect_database(url):  # NOQA
    parsed = urlparse(url)

    scheme = parsed.scheme.split('+')
    if len(scheme) == 1:
        raise Exception('wrong scheme format: %s' % parsed.scheme)
    else:
        engine, dbtype = scheme[0], scheme[-1]
        other_scheme = "+".join(scheme[1:-1])

    if dbtype not in ('taskdb', 'projectdb', 'resultdb'):
        raise LookupError('unknown database type: %s, '
                          'type should be one of ["taskdb", "projectdb", "resultdb"]', dbtype)

    if engine == 'mysql':
        return _connect_mysql(parsed,dbtype)

    elif engine == 'sqlite':
        return _connect_sqlite(parsed,dbtype)
    elif engine == 'mongodb':
        return _connect_mongodb(parsed,dbtype,url)

    elif engine == 'sqlalchemy':
        return _connect_sqlalchemy(parsed, dbtype, url, other_scheme)


    elif engine == 'redis':
        if dbtype == 'taskdb':
            from .redis.taskdb import TaskDB
            return TaskDB(parsed.hostname, parsed.port,
                          int(parsed.path.strip('/') or 0))
        else:
            raise LookupError('not supported dbtype: %s', dbtype)
    elif engine == 'local':
        scripts = url.split('//', 1)[1].split(',')
        if dbtype == 'projectdb':
            from .local.projectdb import ProjectDB
            return ProjectDB(scripts)
        else:
            raise LookupError('not supported dbtype: %s', dbtype)
    elif engine == 'elasticsearch' or engine == 'es':
        return _connect_elasticsearch(parsed, dbtype)

    elif engine == 'couchdb':
        return _connect_couchdb(parsed, dbtype, url)

    else:
        raise Exception('unknown engine: %s' % engine)


def _connect_mysql(parsed,dbtype):
    parames = {}
    if parsed.username:
        parames['user'] = parsed.username
    if parsed.password:
        parames['passwd'] = parsed.password
    if parsed.hostname:
        parames['host'] = parsed.hostname
    if parsed.port:
        parames['port'] = parsed.port
    if parsed.path.strip('/'):
        parames['database'] = parsed.path.strip('/')

    if dbtype == 'taskdb':
        from .mysql.taskdb import TaskDB
        return TaskDB(**parames)
    elif dbtype == 'projectdb':
        from .mysql.projectdb import ProjectDB
        return ProjectDB(**parames)
    elif dbtype == 'resultdb':
        from .mysql.resultdb import ResultDB
        return ResultDB(**parames)
    else:
        raise LookupError


def _connect_sqlite(parsed,dbtype):
    if parsed.path.startswith('//'):
        path = '/' + parsed.path.strip('/')
    elif parsed.path.startswith('/'):
        path = './' + parsed.path.strip('/')
    elif not parsed.path:
        path = ':memory:'
    else:
        raise Exception('error path: %s' % parsed.path)

    if dbtype == 'taskdb':
        from .sqlite.taskdb import TaskDB
        return TaskDB(path)
    elif dbtype == 'projectdb':
        from .sqlite.projectdb import ProjectDB
        return ProjectDB(path)
    elif dbtype == 'resultdb':
        from .sqlite.resultdb import ResultDB
        return ResultDB(path)
    else:
        raise LookupError


def _connect_mongodb(parsed,dbtype,url):
    url = url.replace(parsed.scheme, 'mongodb')
    parames = {}
    if parsed.path.strip('/'):
        parames['database'] = parsed.path.strip('/')

    if dbtype == 'taskdb':
        from .mongodb.taskdb import TaskDB
        return TaskDB(url, **parames)
    elif dbtype == 'projectdb':
        from .mongodb.projectdb import ProjectDB
        return ProjectDB(url, **parames)
    elif dbtype == 'resultdb':
        from .mongodb.resultdb import ResultDB
        return ResultDB(url, **parames)
    else:
        raise LookupError


def _connect_sqlalchemy(parsed, dbtype,url, other_scheme):
    if not other_scheme:
        raise Exception('wrong scheme format: %s' % parsed.scheme)
    url = url.replace(parsed.scheme, other_scheme)
    if dbtype == 'taskdb':
        from .sqlalchemy.taskdb import TaskDB
        return TaskDB(url)
    elif dbtype == 'projectdb':
        from .sqlalchemy.projectdb import ProjectDB
        return ProjectDB(url)
    elif dbtype == 'resultdb':
        from .sqlalchemy.resultdb import ResultDB
        return ResultDB(url)
    else:
        raise LookupError


def _connect_elasticsearch(parsed, dbtype):
    # in python 2.6 url like "http://host/?query", query will not been splitted
    if parsed.path.startswith('/?'):
        index = parse_qs(parsed.path[2:])
    else:
        index = parse_qs(parsed.query)
    if 'index' in index and index['index']:
        index = index['index'][0]
    else:
        index = 'pyspider'

    if dbtype == 'projectdb':
        from .elasticsearch.projectdb import ProjectDB
        return ProjectDB([parsed.netloc], index=index)
    elif dbtype == 'resultdb':
        from .elasticsearch.resultdb import ResultDB
        return ResultDB([parsed.netloc], index=index)
    elif dbtype == 'taskdb':
        from .elasticsearch.taskdb import TaskDB
        return TaskDB([parsed.netloc], index=index)


def _connect_couchdb(parsed, dbtype, url):
    if os.environ.get('COUCHDB_HTTPS'):
        url = "https://" + parsed.netloc + "/"
    else:
        url = "http://" + parsed.netloc + "/"
    params = {}

    # default to env, then url, then hard coded
    params['username'] = os.environ.get('COUCHDB_USER') or parsed.username
    params['password'] = os.environ.get('COUCHDB_PASSWORD') or parsed.password

    if dbtype == 'taskdb':
        from .couchdb.taskdb import TaskDB
        return TaskDB(url, **params)
    elif dbtype == 'projectdb':
        from .couchdb.projectdb import ProjectDB
        return ProjectDB(url, **params)
    elif dbtype == 'resultdb':
        from .couchdb.resultdb import ResultDB
        return ResultDB(url, **params)
    else:
        raise LookupError


================================================
FILE: pyspider/database/base/__init__.py
================================================


================================================
FILE: pyspider/database/base/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-02-09 11:28:52

import re

# NOTE: When get/get_all/check_update from database with default fields,
#       all following fields should be included in output dict.
{
    'project': {
        'name': str,
        'group': str,
        'status': str,
        'script': str,
        # 'config': str,
        'comments': str,
        # 'priority': int,
        'rate': int,
        'burst': int,
        'updatetime': int,
    }
}


class ProjectDB(object):
    status_str = [
        'TODO',
        'STOP',
        'CHECKING',
        'DEBUG',
        'RUNNING',
    ]

    def insert(self, name, obj={}):
        raise NotImplementedError

    def update(self, name, obj={}, **kwargs):
        raise NotImplementedError

    def get_all(self, fields=None):
        raise NotImplementedError

    def get(self, name, fields):
        raise NotImplementedError

    def drop(self, name):
        raise NotImplementedError

    def check_update(self, timestamp, fields=None):
        raise NotImplementedError

    def split_group(self, group, lower=True):
        if lower:
            return re.split("\W+", (group or '').lower())
        else:
            return re.split("\W+", group or '')

    def verify_project_name(self, name):
        if len(name) > 64:
            return False
        if re.search(r"[^\w]", name):
            return False
        return True

    def copy(self):
        '''
        database should be able to copy itself to create new connection

        it's implemented automatically by pyspider.database.connect_database
        if you are not create database connection via connect_database method,
        you should implement this
        '''
        raise NotImplementedError


================================================
FILE: pyspider/database/base/resultdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-11 18:40:03

# result schema
{
    'result': {
        'taskid': str,  # new, not changeable
        'project': str,  # new, not changeable
        'url': str,  # new, not changeable
        'result': str,  # json string
        'updatetime': int,
    }
}


class ResultDB(object):
    """
    database for result
    """
    projects = set()  # projects in resultdb

    def save(self, project, taskid, url, result):
        raise NotImplementedError

    def select(self, project, fields=None, offset=0, limit=None):
        raise NotImplementedError

    def count(self, project):
        raise NotImplementedError

    def get(self, project, taskid, fields=None):
        raise NotImplementedError

    def drop(self, project):
        raise NotImplementedError

    def copy(self):
        '''
        database should be able to copy itself to create new connection

        it's implemented automatically by pyspider.database.connect_database
        if you are not create database connection via connect_database method,
        you should implement this
        '''
        raise NotImplementedError


================================================
FILE: pyspider/database/base/taskdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-02-08 10:28:48

# task schema
{
    'task': {
        'taskid': str,  # new, not change
        'project': str,  # new, not change
        'url': str,  # new, not change
        'status': int,  # change
        'schedule': {
            'priority': int,
            'retries': int,
            'retried': int,
            'exetime': int,
            'age': int,
            'itag': str,
            # 'recrawl': int
        },  # new and restart
        'fetch': {
            'method': str,
            'headers': dict,
            'data': str,
            'timeout': int,
            'save': dict,
        },  # new and restart
        'process': {
            'callback': str,
        },  # new and restart
        'track': {
            'fetch': {
                'ok': bool,
                'time': int,
                'status_code': int,
                'headers': dict,
                'encoding': str,
                'content': str,
            },
            'process': {
                'ok': bool,
                'time': int,
                'follows': int,
                'outputs': int,
                'logs': str,
                'exception': str,
            },
            'save': object,  # jsonable object saved by processor
        },  # finish
        'lastcrawltime': int,  # keep between request
        'updatetime': int,  # keep between request
    }
}


class TaskDB(object):
    ACTIVE = 1
    SUCCESS = 2
    FAILED = 3
    BAD = 4

    projects = set()  # projects in taskdb

    def load_tasks(self, status, project=None, fields=None):
        raise NotImplementedError

    def get_task(self, project, taskid, fields=None):
        raise NotImplementedError

    def status_count(self, project):
        '''
        return a dict
        '''
        raise NotImplementedError

    def insert(self, project, taskid, obj={}):
        raise NotImplementedError

    def update(self, project, taskid, obj={}, **kwargs):
        raise NotImplementedError

    def drop(self, project):
        raise NotImplementedError

    @staticmethod
    def status_to_string(status):
        return {
            1: 'ACTIVE',
            2: 'SUCCESS',
            3: 'FAILED',
            4: 'BAD',
        }.get(status, 'UNKNOWN')

    @staticmethod
    def status_to_int(status):
        return {
            'ACTIVE': 1,
            'SUCCESS': 2,
            'FAILED': 3,
            'BAD': 4,
        }.get(status, 4)

    def copy(self):
        '''
        database should be able to copy itself to create new connection

        it's implemented automatically by pyspider.database.connect_database
        if you are not create database connection via connect_database method,
        you should implement this
        '''
        raise NotImplementedError


================================================
FILE: pyspider/database/basedb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.com>
#         http://binux.me
# Created on 2012-08-30 17:43:49

from __future__ import unicode_literals, division, absolute_import

import logging
logger = logging.getLogger('database.basedb')

from six import itervalues
from pyspider.libs import utils


class BaseDB:

    '''
    BaseDB

    dbcur should be overwirte
    '''
    __tablename__ = None
    placeholder = '%s'
    maxlimit = -1

    @staticmethod
    def escape(string):
        return '`%s`' % string

    @property
    def dbcur(self):
        raise NotImplementedError

    def _execute(self, sql_query, values=[]):
        dbcur = self.dbcur
        dbcur.execute(sql_query, values)
        return dbcur

    def _select(self, tablename=None, what="*", where="", where_values=[], offset=0, limit=None):
        tablename = self.escape(tablename or self.__tablename__)
        if isinstance(what, list) or isinstance(what, tuple) or what is None:
            what = ','.join(self.escape(f) for f in what) if what else '*'

        sql_query = "SELECT %s FROM %s" % (what, tablename)
        if where:
            sql_query += " WHERE %s" % where
        if limit:
            sql_query += " LIMIT %d, %d" % (offset, limit)
        elif offset:
            sql_query += " LIMIT %d, %d" % (offset, self.maxlimit)
        logger.debug("<sql: %s>", sql_query)

        for row in self._execute(sql_query, where_values):
            yield row

    def _select2dic(self, tablename=None, what="*", where="", where_values=[],
                    order=None, offset=0, limit=None):
        tablename = self.escape(tablename or self.__tablename__)
        if isinstance(what, list) or isinstance(what, tuple) or what is None:
            what = ','.join(self.escape(f) for f in what) if what else '*'

        sql_query = "SELECT %s FROM %s" % (what, tablename)
        if where:
            sql_query += " WHERE %s" % where
        if order:
            sql_query += ' ORDER BY %s' % order
        if limit:
            sql_query += " LIMIT %d, %d" % (offset, limit)
        elif offset:
            sql_query += " LIMIT %d, %d" % (offset, self.maxlimit)
        logger.debug("<sql: %s>", sql_query)

        dbcur = self._execute(sql_query, where_values)

        # f[0] may return bytes type
        # https://github.com/mysql/mysql-connector-python/pull/37
        fields = [utils.text(f[0]) for f in dbcur.description]

        for row in dbcur:
            yield dict(zip(fields, row))

    def _replace(self, tablename=None, **values):
        tablename = self.escape(tablename or self.__tablename__)
        if values:
            _keys = ", ".join(self.escape(k) for k in values)
            _values = ", ".join([self.placeholder, ] * len(values))
            sql_query = "REPLACE INTO %s (%s) VALUES (%s)" % (tablename, _keys, _values)
        else:
            sql_query = "REPLACE INTO %s DEFAULT VALUES" % tablename
        logger.debug("<sql: %s>", sql_query)

        if values:
            dbcur = self._execute(sql_query, list(itervalues(values)))
        else:
            dbcur = self._execute(sql_query)
        return dbcur.lastrowid

    def _insert(self, tablename=None, **values):
        tablename = self.escape(tablename or self.__tablename__)
        if values:
            _keys = ", ".join((self.escape(k) for k in values))
            _values = ", ".join([self.placeholder, ] * len(values))
            sql_query = "INSERT INTO %s (%s) VALUES (%s)" % (tablename, _keys, _values)
        else:
            sql_query = "INSERT INTO %s DEFAULT VALUES" % tablename
        logger.debug("<sql: %s>", sql_query)

        if values:
            dbcur = self._execute(sql_query, list(itervalues(values)))
        else:
            dbcur = self._execute(sql_query)
        return dbcur.lastrowid

    def _update(self, tablename=None, where="1=0", where_values=[], **values):
        tablename = self.escape(tablename or self.__tablename__)
        _key_values = ", ".join([
            "%s = %s" % (self.escape(k), self.placeholder) for k in values
        ])
        sql_query = "UPDATE %s SET %s WHERE %s" % (tablename, _key_values, where)
        logger.debug("<sql: %s>", sql_query)

        return self._execute(sql_query, list(itervalues(values)) + list(where_values))

    def _delete(self, tablename=None, where="1=0", where_values=[]):
        tablename = self.escape(tablename or self.__tablename__)
        sql_query = "DELETE FROM %s" % tablename
        if where:
            sql_query += " WHERE %s" % where
        logger.debug("<sql: %s>", sql_query)

        return self._execute(sql_query, where_values)

if __name__ == "__main__":
    import sqlite3

    class DB(BaseDB):
        __tablename__ = "test"
        placeholder = "?"

        def __init__(self):
            self.conn = sqlite3.connect(":memory:")
            cursor = self.conn.cursor()
            cursor.execute(
                '''CREATE TABLE `%s` (id INTEGER PRIMARY KEY AUTOINCREMENT, name, age)'''
                % self.__tablename__
            )

        @property
        def dbcur(self):
            return self.conn.cursor()

    db = DB()
    assert db._insert(db.__tablename__, name="binux", age=23) == 1
    assert db._select(db.__tablename__, "name, age").next() == ("binux", 23)
    assert db._select2dic(db.__tablename__, "name, age").next()["name"] == "binux"
    assert db._select2dic(db.__tablename__, "name, age").next()["age"] == 23
    db._replace(db.__tablename__, id=1, age=24)
    assert db._select(db.__tablename__, "name, age").next() == (None, 24)
    db._update(db.__tablename__, "id = 1", age=16)
    assert db._select(db.__tablename__, "name, age").next() == (None, 16)
    db._delete(db.__tablename__, "id = 1")
    assert [row for row in db._select(db.__tablename__)] == []


================================================
FILE: pyspider/database/couchdb/__init__.py
================================================


================================================
FILE: pyspider/database/couchdb/couchdbbase.py
================================================
import time, requests, json
from requests.auth import HTTPBasicAuth

class SplitTableMixin(object):
    UPDATE_PROJECTS_TIME = 10 * 60

    def __init__(self):
        self.session = requests.session()
        if self.username:
            self.session.auth = HTTPBasicAuth(self.username, self.password)
        self.session.headers.update({'Content-Type': 'application/json'})

    def _collection_name(self, project):
        if self.collection_prefix:
            return "%s_%s" % (self.collection_prefix, project)
        else:
            return project


    @property
    def projects(self):
        if time.time() - getattr(self, '_last_update_projects', 0) > self.UPDATE_PROJECTS_TIME:
            self._list_project()
        return self._projects


    @projects.setter
    def projects(self, value):
        self._projects = value


    def _list_project(self):
        self._last_update_projects = time.time()
        self.projects = set()
        if self.collection_prefix:
            prefix = "%s." % self.collection_prefix
        else:
            prefix = ''

        url = self.base_url + "_all_dbs"
        res = self.session.get(url, json={}).json()
        for each in res:
            if each.startswith('_'):
                continue
            if each.startswith(self.database):
                self.projects.add(each[len(self.database)+1+len(prefix):])


    def create_database(self, name):
        url = self.base_url + name
        res = self.session.put(url).json()
        if 'error' in res and res['error'] == 'unauthorized':
            raise Exception("Supplied credentials are incorrect. Reason: {} for User: {} Password: {}".format(res['reason'], self.username, self.password))
        return res


    def get_doc(self, db_name, doc_id):
        url = self.base_url + db_name + "/" + doc_id
        res = self.session.get(url).json()
        if "error" in res and res["error"] == "not_found":
            return None
        return res


    def get_docs(self, db_name, selector):
        url = self.base_url + db_name + "/_find"
        selector['use_index'] = self.index
        res = self.session.post(url, json=selector).json()
        if 'error' in res and res['error'] == 'not_found':
            return []
        return res['docs']


    def get_all_docs(self, db_name):
        return self.get_docs(db_name, {"selector": {}})


    def insert_doc(self, db_name, doc_id, doc):
        url = self.base_url + db_name + "/" + doc_id
        return self.session.put(url, json=doc).json()


    def update_doc(self, db_name, doc_id, new_doc):
        doc = self.get_doc(db_name, doc_id)
        if doc is None:
            return self.insert_doc(db_name, doc_id, new_doc)
        for key in new_doc:
            doc[key] = new_doc[key]
        url = self.base_url + db_name + "/" + doc_id
        return self.session.put(url, json=doc).json()


    def delete(self, url):
        return self.session.delete(url).json()



================================================
FILE: pyspider/database/couchdb/projectdb.py
================================================
import time, requests, json
from requests.auth import HTTPBasicAuth
from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB


class ProjectDB(BaseProjectDB):
    __collection_name__ = 'projectdb'

    def __init__(self, url, database='projectdb', username=None, password=None):
        self.username = username
        self.password = password
        self.url = url + self.__collection_name__ + "_" + database + "/"
        self.database = database

        self.session = requests.session()
        if username:
            self.session.auth = HTTPBasicAuth(self.username, self.password)
        self.session.headers.update({'Content-Type': 'application/json'})

        # Create the db
        res = self.session.put(self.url).json()
        if 'error' in res and res['error'] == 'unauthorized':
            raise Exception(
                "Supplied credentials are incorrect. Reason: {} for User: {} Password: {}".format(res['reason'],
                                                                                                  self.username,
                                                                                                  self.password))
        # create index
        payload = {
            'index': {
                'fields': ['name']
            },
            'name': self.__collection_name__ + "_" + database
        }
        res = self.session.post(self.url + "_index", json=payload).json()
        self.index = res['id']

    def _default_fields(self, each):
        if each is None:
            return each
        each.setdefault('group', None)
        each.setdefault('status', 'TODO')
        each.setdefault('script', '')
        each.setdefault('comments', None)
        each.setdefault('rate', 0)
        each.setdefault('burst', 0)
        each.setdefault('updatetime', 0)
        return each

    def insert(self, name, obj={}):
        url = self.url + name
        obj = dict(obj)
        obj['name'] = name
        obj['updatetime'] = time.time()
        res = self.session.put(url, json=obj).json()
        return res

    def update(self, name, obj={}, **kwargs):
        # object contains the fields to update and their new values
        update = self.get(name) # update will contain _rev
        if update is None:
            return None
        obj = dict(obj)
        obj['updatetime'] = time.time()
        obj.update(kwargs)
        for key in obj:
            update[key] = obj[key]
        return self.insert(name, update)

    def get_all(self, fields=None):
        if fields is None:
            fields = []
        payload = {
            "selector": {},
            "fields": fields,
            "use_index": self.index
        }
        url = self.url + "_find"
        res = self.session.post(url, json=payload).json()
        for doc in res['docs']:
            yield self._default_fields(doc)

    def get(self, name, fields=None):
        if fields is None:
            fields = []
        payload = {
            "selector": {"name": name},
            "fields": fields,
            "limit": 1,
            "use_index": self.index
        }
        url = self.url + "_find"
        res = self.session.post(url, json=payload).json()
        if len(res['docs']) == 0:
            return None
        return self._default_fields(res['docs'][0])

    def check_update(self, timestamp, fields=None):
        if fields is None:
            fields = []
        for project in self.get_all(fields=('updatetime', 'name')):
            if project['updatetime'] > timestamp:
                project = self.get(project['name'], fields)
                yield self._default_fields(project)

    def drop(self, name):
        doc = self.get(name)
        payload = {"rev": doc["_rev"]}
        url = self.url + name
        return self.session.delete(url, params=payload).json()

    def drop_database(self):
        return self.session.delete(self.url).json()


================================================
FILE: pyspider/database/couchdb/resultdb.py
================================================
import time, json
from pyspider.database.base.resultdb import ResultDB as BaseResultDB
from .couchdbbase import SplitTableMixin


class ResultDB(SplitTableMixin, BaseResultDB):
    collection_prefix = ''

    def __init__(self, url, database='resultdb', username=None, password=None):
        self.username = username
        self.password = password
        self.base_url = url
        self.url = url + database + "/"
        self.database = database

        super().__init__()
        self.create_database(database)
        self.index = None

    def _get_collection_name(self, project):
        return self.database + "_" + self._collection_name(project)

    def _create_project(self, project):
        collection_name = self._get_collection_name(project)
        self.create_database(collection_name)
        # create index
        payload = {
            'index': {
                'fields': ['taskid']
            },
            'name': collection_name
        }

        res = self.session.post(self.base_url + collection_name + "/_index", json=payload).json()
        self.index = res['id']
        self._list_project()

    def save(self, project, taskid, url, result):
        if project not in self.projects:
            self._create_project(project)
        collection_name = self._get_collection_name(project)
        obj = {
            'taskid': taskid,
            'url': url,
            'result': result,
            'updatetime': time.time(),
        }
        return self.update_doc(collection_name, taskid, obj)

    def select(self, project, fields=None, offset=0, limit=0):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        offset = offset or 0
        limit = limit or 0
        collection_name = self._get_collection_name(project)
        if fields is None:
            fields = []
        if limit == 0:
            sel = {
                'selector': {},
                'fields': fields,
                'skip': offset
            }
        else:
            sel = {
              'selector': {},
              'fields': fields,
              'skip': offset,
              'limit': limit
            }
        for result in self.get_docs(collection_name, sel):
            yield result

    def count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._get_collection_name(project)
        return len(self.get_all_docs(collection_name))

    def get(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._get_collection_name(project)
        if fields is None:
            fields = []
        sel = {
            'selector': {'taskid': taskid},
            'fields': fields
        }
        ret = self.get_docs(collection_name, sel)
        if len(ret) == 0:
            return None
        return ret[0]

    def drop_database(self):
        return self.delete(self.url)

    def drop(self, project):
        # drop the project
        collection_name = self._get_collection_name(project)
        url = self.base_url + collection_name
        return self.delete(url)

================================================
FILE: pyspider/database/couchdb/taskdb.py
================================================
import json, time
from pyspider.database.base.taskdb import TaskDB as BaseTaskDB
from .couchdbbase import SplitTableMixin


class TaskDB(SplitTableMixin, BaseTaskDB):
    collection_prefix = ''

    def __init__(self, url, database='taskdb', username=None, password=None):
        self.username = username
        self.password = password
        self.base_url = url
        self.url = url + database + "/"
        self.database = database
        self.index = None

        super().__init__()

        self.create_database(database)
        self.projects = set()
        self._list_project()

    def _get_collection_name(self, project):
        return self.database + "_" + self._collection_name(project)

    def _create_project(self, project):
        collection_name = self._get_collection_name(project)
        self.create_database(collection_name)
        # create index
        payload = {
            'index': {
                'fields': ['status', 'taskid']
            },
            'name': collection_name
        }
        res = self.session.post(self.base_url + collection_name + "/_index", json=payload).json()
        self.index = res['id']
        self._list_project()

    def load_tasks(self, status, project=None, fields=None):
        if not project:
            self._list_project()
        if fields is None:
            fields = []
        if project:
            projects = [project, ]
        else:
            projects = self.projects
        for project in projects:
            collection_name = self._get_collection_name(project)
            for task in self.get_docs(collection_name, {"selector": {"status": status}, "fields": fields}):
                yield task

    def get_task(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        if fields is None:
            fields = []
        collection_name = self._get_collection_name(project)
        ret = self.get_docs(collection_name, {"selector": {"taskid": taskid}, "fields": fields})
        if len(ret) == 0:
            return None
        return ret[0]

    def status_count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return {}
        collection_name = self._get_collection_name(project)

        def _count_for_status(collection_name, status):
            total = len(self.get_docs(collection_name, {"selector": {'status': status}}))
            return {'total': total, "_id": status} if total else None

        c = collection_name
        ret = filter(lambda x: x,map(lambda s: _count_for_status(c, s), [self.ACTIVE, self.SUCCESS, self.FAILED]))

        result = {}
        if isinstance(ret, dict):
            ret = ret.get('result', [])
        for each in ret:
            result[each['_id']] = each['total']
        return result

    def insert(self, project, taskid, obj={}):
        if project not in self.projects:
            self._create_project(project)
        obj = dict(obj)
        obj['taskid'] = taskid
        obj['project'] = project
        obj['updatetime'] = time.time()
        return self.update(project, taskid, obj=obj)

    def update(self, project, taskid, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        collection_name = self._get_collection_name(project)
        return self.update_doc(collection_name, taskid, obj)

    def drop_database(self):
        return self.delete(self.url)

    def drop(self, project):
        collection_name = self._get_collection_name(project)
        url = self.base_url + collection_name
        return self.delete(url)

================================================
FILE: pyspider/database/elasticsearch/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2016-01-17 18:31:58


================================================
FILE: pyspider/database/elasticsearch/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2016-01-17 18:32:33

import time

import elasticsearch.helpers
from elasticsearch import Elasticsearch
from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB


class ProjectDB(BaseProjectDB):
    __type__ = 'project'

    def __init__(self, hosts, index='pyspider'):
        self.index = index
        self.es = Elasticsearch(hosts=hosts)

        self.es.indices.create(index=self.index, ignore=400)
        if not self.es.indices.get_mapping(index=self.index, doc_type=self.__type__):
            self.es.indices.put_mapping(index=self.index, doc_type=self.__type__, body={
                "_all": {"enabled": False},
                "properties": {
                    "updatetime": {"type": "double"}
                }
            })

    def insert(self, name, obj={}):
        obj = dict(obj)
        obj['name'] = name
        obj['updatetime'] = time.time()

        obj.setdefault('group', '')
        obj.setdefault('status', 'TODO')
        obj.setdefault('script', '')
        obj.setdefault('comments', '')
        obj.setdefault('rate', 0)
        obj.setdefault('burst', 0)

        return self.es.index(index=self.index, doc_type=self.__type__, body=obj, id=name,
                             refresh=True)

    def update(self, name, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        return self.es.update(index=self.index, doc_type=self.__type__,
                              body={'doc': obj}, id=name, refresh=True, ignore=404)

    def get_all(self, fields=None):
        for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                 query={'query': {"match_all": {}}},
                                                 _source_include=fields or []):
            yield record['_source']

    def get(self, name, fields=None):
        ret = self.es.get(index=self.index, doc_type=self.__type__, id=name,
                          _source_include=fields or [], ignore=404)
        return ret.get('_source', None)

    def check_update(self, timestamp, fields=None):
        for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                 query={'query': {"range": {
                                                     "updatetime": {"gte": timestamp}
                                                 }}}, _source_include=fields or []):
            yield record['_source']

    def drop(self, name):
        return self.es.delete(index=self.index, doc_type=self.__type__, id=name, refresh=True)


================================================
FILE: pyspider/database/elasticsearch/resultdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2016-01-18 19:41:24


import time

import elasticsearch.helpers
from elasticsearch import Elasticsearch
from pyspider.database.base.resultdb import ResultDB as BaseResultDB


class ResultDB(BaseResultDB):
    __type__ = 'result'

    def __init__(self, hosts, index='pyspider'):
        self.index = index
        self.es = Elasticsearch(hosts=hosts)

        self.es.indices.create(index=self.index, ignore=400)
        if not self.es.indices.get_mapping(index=self.index, doc_type=self.__type__):
            self.es.indices.put_mapping(index=self.index, doc_type=self.__type__, body={
                "_all": {"enabled": True},
                "properties": {
                    "taskid": {"enabled": False},
                    "project": {"type": "string", "index": "not_analyzed"},
                    "url": {"enabled": False},
                }
            })

    @property
    def projects(self):
        ret = self.es.search(index=self.index, doc_type=self.__type__,
                             body={"aggs": {"projects": {
                                 "terms": {"field": "project"}
                             }}}, _source=False)
        return [each['key'] for each in ret['aggregations']['projects'].get('buckets', [])]

    def save(self, project, taskid, url, result):
        obj = {
            'taskid': taskid,
            'project': project,
            'url': url,
            'result': result,
            'updatetime': time.time(),
        }
        return self.es.index(index=self.index, doc_type=self.__type__,
                             body=obj, id='%s:%s' % (project, taskid))

    def select(self, project, fields=None, offset=0, limit=0):
        offset = offset or 0
        limit = limit or 0
        if not limit:
            for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                     query={'query': {'term': {'project': project}}},
                                                     _source_include=fields or [], from_=offset,
                                                     sort="updatetime:desc"):
                yield record['_source']
        else:
            for record in self.es.search(index=self.index, doc_type=self.__type__,
                                         body={'query': {'term': {'project': project}}},
                                         _source_include=fields or [], from_=offset, size=limit,
                                         sort="updatetime:desc"
                                         ).get('hits', {}).get('hits', []):
                yield record['_source']

    def count(self, project):
        return self.es.count(index=self.index, doc_type=self.__type__,
                             body={'query': {'term': {'project': project}}}
                             ).get('count', 0)

    def get(self, project, taskid, fields=None):
        ret = self.es.get(index=self.index, doc_type=self.__type__, id="%s:%s" % (project, taskid),
                          _source_include=fields or [], ignore=404)
        return ret.get('_source', None)

    def drop(self, project):
        self.refresh()
        for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                 query={'query': {'term': {'project': project}}},
                                                 _source=False):
            self.es.delete(index=self.index, doc_type=self.__type__, id=record['_id'])

    def refresh(self):
        """
        Explicitly refresh one or more index, making all operations
        performed since the last refresh available for search.
        """
        self.es.indices.refresh(index=self.index)


================================================
FILE: pyspider/database/elasticsearch/taskdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2016-01-20 20:20:55


import time
import json

import elasticsearch.helpers
from elasticsearch import Elasticsearch
from pyspider.database.base.taskdb import TaskDB as BaseTaskDB


class TaskDB(BaseTaskDB):
    __type__ = 'task'

    def __init__(self, hosts, index='pyspider'):
        self.index = index
        self._changed = False
        self.es = Elasticsearch(hosts=hosts)

        self.es.indices.create(index=self.index, ignore=400)
        if not self.es.indices.get_mapping(index=self.index, doc_type=self.__type__):
            self.es.indices.put_mapping(index=self.index, doc_type=self.__type__, body={
                "_all": {"enabled": False},
                "properties": {
                    "project": {"type": "string", "index": "not_analyzed"},
                    "status": {"type": "byte"},
                }
            })

    def _parse(self, data):
        if not data:
            return data
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                if data[each]:
                    data[each] = json.loads(data[each])
                else:
                    data[each] = {}
        return data

    def _stringify(self, data):
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                data[each] = json.dumps(data[each])
        return data

    @property
    def projects(self):
        ret = self.es.search(index=self.index, doc_type=self.__type__,
                             body={"aggs": {"projects": {
                                 "terms": {"field": "project"}
                             }}}, _source=False)
        return [each['key'] for each in ret['aggregations']['projects'].get('buckets', [])]

    def load_tasks(self, status, project=None, fields=None):
        self.refresh()
        if project is None:
            for project in self.projects:
                for each in self.load_tasks(status, project, fields):
                    yield each
        else:
            for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                     query={'query': {'bool': {
                                                         'must': {'term': {'project': project}},
                                                         'should': [{'term': {'status': status}}],
                                                         'minimum_should_match': 1,
                                                     }}}, _source_include=fields or []):
                yield self._parse(record['_source'])

    def get_task(self, project, taskid, fields=None):
        if self._changed:
            self.refresh()
        ret = self.es.get(index=self.index, doc_type=self.__type__, id="%s:%s" % (project, taskid),
                          _source_include=fields or [], ignore=404)
        return self._parse(ret.get('_source', None))

    def status_count(self, project):
        self.refresh()
        ret = self.es.search(index=self.index, doc_type=self.__type__,
                             body={"query": {'term': {'project': project}},
                                   "aggs": {"status": {
                                       "terms": {"field": "status"}
                                   }}}, _source=False)
        result = {}
        for each in ret['aggregations']['status'].get('buckets', []):
            result[each['key']] = each['doc_count']
        return result

    def insert(self, project, taskid, obj={}):
        self._changed = True
        obj = dict(obj)
        obj['taskid'] = taskid
        obj['project'] = project
        obj['updatetime'] = time.time()
        return self.es.index(index=self.index, doc_type=self.__type__,
                             body=self._stringify(obj), id='%s:%s' % (project, taskid))

    def update(self, project, taskid, obj={}, **kwargs):
        self._changed = True
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        return self.es.update(index=self.index, doc_type=self.__type__, id='%s:%s' % (project, taskid),
                              body={"doc": self._stringify(obj)}, ignore=404)

    def drop(self, project):
        self.refresh()
        for record in elasticsearch.helpers.scan(self.es, index=self.index, doc_type=self.__type__,
                                                 query={'query': {'term': {'project': project}}},
                                                 _source=False):
            self.es.delete(index=self.index, doc_type=self.__type__, id=record['_id'])
        self.refresh()

    def refresh(self):
        """
        Explicitly refresh one or more index, making all operations
        performed since the last refresh available for search.
        """
        self._changed = False
        self.es.indices.refresh(index=self.index)


================================================
FILE: pyspider/database/local/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2015-01-17 20:56:50


================================================
FILE: pyspider/database/local/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2015-01-17 12:32:17

import os
import re
import six
import glob
import logging

from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB


class ProjectDB(BaseProjectDB):
    """ProjectDB loading scripts from local file."""

    def __init__(self, files):
        self.files = files
        self.projects = {}
        self.load_scripts()

    def load_scripts(self):
        project_names = set(self.projects.keys())
        for path in self.files:
            for filename in glob.glob(path):
                name = os.path.splitext(os.path.basename(filename))[0]
                if name in project_names:
                    project_names.remove(name)
                updatetime = os.path.getmtime(filename)
                if name not in self.projects or updatetime > self.projects[name]['updatetime']:
                    project = self._build_project(filename)
                    if not project:
                        continue
                    self.projects[project['name']] = project

        for name in project_names:
            del self.projects[name]

    rate_re = re.compile(r'^\s*#\s*rate.*?(\d+(\.\d+)?)', re.I | re.M)
    burst_re = re.compile(r'^\s*#\s*burst.*?(\d+(\.\d+)?)', re.I | re.M)

    def _build_project(self, filename):
        try:
            with open(filename) as fp:
                script = fp.read()
            m = self.rate_re.search(script)
            if m:
                rate = float(m.group(1))
            else:
                rate = 1

            m = self.burst_re.search(script)
            if m:
                burst = float(m.group(1))
            else:
                burst = 3

            return {
                'name': os.path.splitext(os.path.basename(filename))[0],
                'group': None,
                'status': 'RUNNING',
                'script': script,
                'comments': None,
                'rate': rate,
                'burst': burst,
                'updatetime': os.path.getmtime(filename),
            }
        except OSError as e:
            logging.error('loading project script error: %s', e)
            return None

    def get_all(self, fields=None):
        for projectname in self.projects:
            yield self.get(projectname, fields)

    def get(self, name, fields=None):
        if name not in self.projects:
            return None
        project = self.projects[name]
        result = {}
        for f in fields or project:
            if f in project:
                result[f] = project[f]
            else:
                result[f] = None
        return result

    def check_update(self, timestamp, fields=None):
        self.load_scripts()
        for projectname, project in six.iteritems(self.projects):
            if project['updatetime'] > timestamp:
                yield self.get(projectname, fields)


================================================
FILE: pyspider/database/mongodb/__init__.py
================================================


================================================
FILE: pyspider/database/mongodb/mongodbbase.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-11-22 20:42:01

import time


class SplitTableMixin(object):
    UPDATE_PROJECTS_TIME = 10 * 60

    def _collection_name(self, project):
        if self.collection_prefix:
            return "%s.%s" % (self.collection_prefix, project)
        else:
            return project

    @property
    def projects(self):
        if time.time() - getattr(self, '_last_update_projects', 0) > self.UPDATE_PROJECTS_TIME:
            self._list_project()
        return self._projects

    @projects.setter
    def projects(self, value):
        self._projects = value

    def _list_project(self):
        self._last_update_projects = time.time()
        self.projects = set()
        if self.collection_prefix:
            prefix = "%s." % self.collection_prefix
        else:
            prefix = ''
        for each in self.database.collection_names():
            if each.startswith('system.'):
                continue
            if each.startswith(prefix):
                self.projects.add(each[len(prefix):])

    def drop(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._collection_name(project)
        self.database[collection_name].drop()
        self._list_project()


================================================
FILE: pyspider/database/mongodb/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-12 12:22:42

import time
from pymongo import MongoClient

from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB


class ProjectDB(BaseProjectDB):
    __collection_name__ = 'projectdb'

    def __init__(self, url, database='projectdb'):
        self.conn = MongoClient(url)
        self.conn.admin.command("ismaster")
        self.database = self.conn[database]
        self.collection = self.database[self.__collection_name__]

        self.collection.ensure_index('name', unique=True)

    def _default_fields(self, each):
        if each is None:
            return each
        each.setdefault('group', None)
        each.setdefault('status', 'TODO')
        each.setdefault('script', '')
        each.setdefault('comments', None)
        each.setdefault('rate', 0)
        each.setdefault('burst', 0)
        each.setdefault('updatetime', 0)
        return each

    def insert(self, name, obj={}):
        obj = dict(obj)
        obj['name'] = name
        obj['updatetime'] = time.time()
        return self.collection.update({'name': name}, {'$set': obj}, upsert=True)

    def update(self, name, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        return self.collection.update({'name': name}, {'$set': obj})

    def get_all(self, fields=None):
        for each in self.collection.find({}, fields):
            if each and '_id' in each:
                del each['_id']
            yield self._default_fields(each)

    def get(self, name, fields=None):
        each = self.collection.find_one({'name': name}, fields)
        if each and '_id' in each:
            del each['_id']
        return self._default_fields(each)

    def check_update(self, timestamp, fields=None):
        for project in self.get_all(fields=('updatetime', 'name')):
            if project['updatetime'] > timestamp:
                project = self.get(project['name'], fields)
                yield self._default_fields(project)

    def drop(self, name):
        return self.collection.remove({'name': name})


================================================
FILE: pyspider/database/mongodb/resultdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-13 22:18:36

import json
import time

from pymongo import MongoClient

from pyspider.database.base.resultdb import ResultDB as BaseResultDB
from .mongodbbase import SplitTableMixin


class ResultDB(SplitTableMixin, BaseResultDB):
    collection_prefix = ''

    def __init__(self, url, database='resultdb'):
        self.conn = MongoClient(url)
        self.conn.admin.command("ismaster")
        self.database = self.conn[database]
        self.projects = set()

        self._list_project()
        # we suggest manually build index in advance, instead of indexing
        #  in the startup process,
        # for project in self.projects:
        #     collection_name = self._collection_name(project)
        #     self.database[collection_name].ensure_index('taskid')
        pass

    def _create_project(self, project):
        collection_name = self._collection_name(project)
        self.database[collection_name].ensure_index('taskid')
        self._list_project()

    def _parse(self, data):
        data['_id'] = str(data['_id'])
        if 'result' in data:
            data['result'] = json.loads(data['result'])
        return data

    def _stringify(self, data):
        if 'result' in data:
            data['result'] = json.dumps(data['result'])
        return data

    def save(self, project, taskid, url, result):
        if project not in self.projects:
            self._create_project(project)
        collection_name = self._collection_name(project)
        obj = {
            'taskid'    : taskid,
            'url'       : url,
            'result'    : result,
            'updatetime': time.time(),
        }
        return self.database[collection_name].update(
            {'taskid': taskid}, {"$set": self._stringify(obj)}, upsert=True
        )

    def select(self, project, fields=None, offset=0, limit=0):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        offset = offset or 0
        limit = limit or 0
        collection_name = self._collection_name(project)
        for result in self.database[collection_name].find({}, fields, skip=offset, limit=limit):
            yield self._parse(result)

    def count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._collection_name(project)
        return self.database[collection_name].count()

    def get(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._collection_name(project)
        ret = self.database[collection_name].find_one({'taskid': taskid}, fields)
        if not ret:
            return ret
        return self._parse(ret)


================================================
FILE: pyspider/database/mongodb/taskdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-11 23:54:50

import json
import time

from pymongo import MongoClient

from pyspider.database.base.taskdb import TaskDB as BaseTaskDB
from .mongodbbase import SplitTableMixin


class TaskDB(SplitTableMixin, BaseTaskDB):
    collection_prefix = ''

    def __init__(self, url, database='taskdb'):
        self.conn = MongoClient(url)
        self.conn.admin.command("ismaster")
        self.database = self.conn[database]
        self.projects = set()

        self._list_project()
        # we suggest manually build index in advance, instead of indexing
        #  in the startup process,
        # for project in self.projects:
        #     collection_name = self._collection_name(project)
        #     self.database[collection_name].ensure_index('status')
        #     self.database[collection_name].ensure_index('taskid')

    def _create_project(self, project):
        collection_name = self._collection_name(project)
        self.database[collection_name].ensure_index('status')
        self.database[collection_name].ensure_index('taskid')
        self._list_project()

    def _parse(self, data):
        if '_id' in data:
            del data['_id']
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                if data[each]:
                    if isinstance(data[each], bytearray):
                        data[each] = str(data[each])
                    data[each] = json.loads(data[each], encoding='utf8')
                else:
                    data[each] = {}
        return data

    def _stringify(self, data):
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                data[each] = json.dumps(data[each])
        return data

    def load_tasks(self, status, project=None, fields=None):
        if not project:
            self._list_project()

        if project:
            projects = [project, ]
        else:
            projects = self.projects

        for project in projects:
            collection_name = self._collection_name(project)
            for task in self.database[collection_name].find({'status': status}, fields):
                yield self._parse(task)

    def get_task(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        collection_name = self._collection_name(project)
        ret = self.database[collection_name].find_one({'taskid': taskid}, fields)
        if not ret:
            return ret
        return self._parse(ret)

    def status_count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return {}
        collection_name = self._collection_name(project)

        # when there are too many data in task collection , aggregate operation will take a very long time,
        #  and this will cause scheduler module startup to be particularly slow

        # ret = self.database[collection_name].aggregate([
        #     {'$group': {
        #         '_id'  : '$status',
        #         'total': {
        #             '$sum': 1
        #         }
        #     }
        #     }])

        # Instead of aggregate, use find-count on status(with index) field.
        def _count_for_status(collection, status):
            total = collection.find({'status': status}).count()
            return {'total': total, "_id": status} if total else None

        c = self.database[collection_name]
        ret = filter(
            lambda x: x,
            map(
                lambda s: _count_for_status(c, s), [self.ACTIVE, self.SUCCESS, self.FAILED]
            )
        )

        result = {}
        if isinstance(ret, dict):
            ret = ret.get('result', [])
        for each in ret:
            result[each['_id']] = each['total']
        return result

    def insert(self, project, taskid, obj={}):
        if project not in self.projects:
            self._create_project(project)
        obj = dict(obj)
        obj['taskid'] = taskid
        obj['project'] = project
        obj['updatetime'] = time.time()
        return self.update(project, taskid, obj=obj)

    def update(self, project, taskid, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        collection_name = self._collection_name(project)
        return self.database[collection_name].update(
            {'taskid': taskid},
            {"$set": self._stringify(obj)},
            upsert=True
        )


================================================
FILE: pyspider/database/mysql/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-07-17 20:12:54


================================================
FILE: pyspider/database/mysql/mysqlbase.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-11-05 10:42:24

import time
import mysql.connector


class MySQLMixin(object):
    maxlimit = 18446744073709551615

    @property
    def dbcur(self):
        try:
            if self.conn.unread_result:
                self.conn.get_rows()
                if hasattr(self.conn, 'free_result'):
                    self.conn.free_result()
            return self.conn.cursor()
        except (mysql.connector.OperationalError, mysql.connector.InterfaceError):
            self.conn.ping(reconnect=True)
            self.conn.database = self.database_name
            return self.conn.cursor()


class SplitTableMixin(object):
    UPDATE_PROJECTS_TIME = 10 * 60

    def _tablename(self, project):
        if self.__tablename__:
            return '%s_%s' % (self.__tablename__, project)
        else:
            return project

    @property
    def projects(self):
        if time.time() - getattr(self, '_last_update_projects', 0) \
                > self.UPDATE_PROJECTS_TIME:
            self._list_project()
        return self._projects

    @projects.setter
    def projects(self, value):
        self._projects = value

    def _list_project(self):
        self._last_update_projects = time.time()
        self.projects = set()
        if self.__tablename__:
            prefix = '%s_' % self.__tablename__
        else:
            prefix = ''
        for project, in self._execute('show tables;'):
            if project.startswith(prefix):
                project = project[len(prefix):]
                self.projects.add(project)

    def drop(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        tablename = self._tablename(project)
        self._execute("DROP TABLE %s" % self.escape(tablename))
        self._list_project()


================================================
FILE: pyspider/database/mysql/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-07-17 21:06:43

import time
import mysql.connector

from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB
from pyspider.database.basedb import BaseDB
from .mysqlbase import MySQLMixin


class ProjectDB(MySQLMixin, BaseProjectDB, BaseDB):
    __tablename__ = 'projectdb'

    def __init__(self, host='localhost', port=3306, database='projectdb',
                 user='root', passwd=None):
        self.database_name = database
        self.conn = mysql.connector.connect(user=user, password=passwd,
                                            host=host, port=port, autocommit=True)
        if database not in [x[0] for x in self._execute('show databases')]:
            self._execute('CREATE DATABASE %s' % self.escape(database))
        self.conn.database = database

        self._execute('''CREATE TABLE IF NOT EXISTS %s (
            `name` varchar(64) PRIMARY KEY,
            `group` varchar(64),
            `status` varchar(16),
            `script` TEXT,
            `comments` varchar(1024),
            `rate` float(11, 4),
            `burst` float(11, 4),
            `updatetime` double(16, 4)
            ) ENGINE=InnoDB CHARSET=utf8''' % self.escape(self.__tablename__))

    def insert(self, name, obj={}):
        obj = dict(obj)
        obj['name'] = name
        obj['updatetime'] = time.time()
        return self._insert(**obj)

    def update(self, name, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        ret = self._update(where="`name` = %s" % self.placeholder, where_values=(name, ), **obj)
        return ret.rowcount

    def get_all(self, fields=None):
        return self._select2dic(what=fields)

    def get(self, name, fields=None):
        where = "`name` = %s" % self.placeholder
        for each in self._select2dic(what=fields, where=where, where_values=(name, )):
            return each
        return None

    def drop(self, name):
        where = "`name` = %s" % self.placeholder
        return self._delete(where=where, where_values=(name, ))

    def check_update(self, timestamp, fields=None):
        where = "`updatetime` >= %f" % timestamp
        return self._select2dic(what=fields, where=where)


================================================
FILE: pyspider/database/mysql/resultdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-10-13 22:02:57

import re
import six
import time
import json
import mysql.connector

from pyspider.libs import utils
from pyspider.database.base.resultdb import ResultDB as BaseResultDB
from pyspider.database.basedb import BaseDB
from .mysqlbase import MySQLMixin, SplitTableMixin


class ResultDB(MySQLMixin, SplitTableMixin, BaseResultDB, BaseDB):
    __tablename__ = ''

    def __init__(self, host='localhost', port=3306, database='resultdb',
                 user='root', passwd=None):
        self.database_name = database
        self.conn = mysql.connector.connect(user=user, password=passwd,
                                            host=host, port=port, autocommit=True)
        if database not in [x[0] for x in self._execute('show databases')]:
            self._execute('CREATE DATABASE %s' % self.escape(database))
        self.conn.database = database
        self._list_project()

    def _create_project(self, project):
        assert re.match(r'^\w+$', project) is not None
        tablename = self._tablename(project)
        if tablename in [x[0] for x in self._execute('show tables')]:
            return
        self._execute('''CREATE TABLE %s (
            `taskid` varchar(64) PRIMARY KEY,
            `url` varchar(1024),
            `result` MEDIUMBLOB,
            `updatetime` double(16, 4)
            ) ENGINE=InnoDB CHARSET=utf8''' % self.escape(tablename))

    def _parse(self, data):
        for key, value in list(six.iteritems(data)):
            if isinstance(value, (bytearray, six.binary_type)):
                data[key] = utils.text(value)
        if 'result' in data:
            data['result'] = json.loads(data['result'])
        return data

    def _stringify(self, data):
        if 'result' in data:
            data['result'] = json.dumps(data['result'])
        return data

    def save(self, project, taskid, url, result):
        tablename = self._tablename(project)
        if project not in self.projects:
            self._create_project(project)
            self._list_project()
        obj = {
            'taskid': taskid,
            'url': url,
            'result': result,
            'updatetime': time.time(),
        }
        return self._replace(tablename, **self._stringify(obj))

    def select(self, project, fields=None, offset=0, limit=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        tablename = self._tablename(project)

        for task in self._select2dic(tablename, what=fields, order='updatetime DESC',
                                     offset=offset, limit=limit):
            yield self._parse(task)

    def count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return 0
        tablename = self._tablename(project)
        for count, in self._execute("SELECT count(1) FROM %s" % self.escape(tablename)):
            return count

    def get(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        tablename = self._tablename(project)
        where = "`taskid` = %s" % self.placeholder
        for task in self._select2dic(tablename, what=fields,
                                     where=where, where_values=(taskid, )):
            return self._parse(task)


================================================
FILE: pyspider/database/mysql/taskdb.py
================================================
#!/usr/bin/envutils
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<i@binux.me>
#         http://binux.me
# Created on 2014-07-17 18:53:01


import re
import six
import time
import json
import mysql.connector

from pyspider.libs import utils
from pyspider.database.base.taskdb import TaskDB as BaseTaskDB
from pyspider.database.basedb import BaseDB
from .mysqlbase import MySQLMixin, SplitTableMixin


class TaskDB(MySQLMixin, SplitTableMixin, BaseTaskDB, BaseDB):
    __tablename__ = ''

    def __init__(self, host='localhost', port=3306, database='taskdb',
                 user='root', passwd=None):
        self.database_name = database
        self.conn = mysql.connector.connect(user=user, password=passwd,
                                            host=host, port=port, autocommit=True)
        if database not in [x[0] for x in self._execute('show databases')]:
            self._execute('CREATE DATABASE %s' % self.escape(database))
        self.conn.database = database
        self._list_project()

    def _create_project(self, project):
        assert re.match(r'^\w+$', project) is not None
        tablename = self._tablename(project)
        if tablename in [x[0] for x in self._execute('show tables')]:
            return
        self._execute('''CREATE TABLE IF NOT EXISTS %s (
            `taskid` varchar(64) PRIMARY KEY,
            `project` varchar(64),
            `url` varchar(1024),
            `status` int(1),
            `schedule` BLOB,
            `fetch` BLOB,
            `process` BLOB,
            `track` BLOB,
            `lastcrawltime` double(16, 4),
            `updatetime` double(16, 4),
            INDEX `status_index` (`status`)
            ) ENGINE=InnoDB CHARSET=utf8''' % self.escape(tablename))

    def _parse(self, data):
        for key, value in list(six.iteritems(data)):
            if isinstance(value, (bytearray, six.binary_type)):
                data[key] = utils.text(value)
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                if data[each]:
                    data[each] = json.loads(data[each])
                else:
                    data[each] = {}
        return data

    def _stringify(self, data):
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                data[each] = json.dumps(data[each])
        return data

    def load_tasks(self, status, project=None, fields=None):
        if project and project not in self.projects:
            return
        where = "`status` = %s" % self.placeholder

        if project:
            projects = [project, ]
        else:
            projects = self.projects

        for project in projects:
            tablename = self._tablename(project)
            for each in self._select2dic(
                tablename, what=fields, where=where, where_values=(status, )
            ):
                yield self._parse(each)

    def get_task(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return None
        where = "`taskid` = %s" % self.placeholder
        tablename = self._tablename(project)
        for each in self._select2dic(tablename, what=fields, where=where, where_values=(taskid, )):
            return self._parse(each)
        return None

    def status_count(self, project):
        result = dict()
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return result
        tablename = self._tablename(project)
        for status, count in self._execute("SELECT `status`, count(1) FROM %s GROUP BY `status`" %
                                           self.escape(tablename)):
            result[status] = count
        return result

    def insert(self, project, taskid, obj={}):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            self._create_project(project)
            self._list_project()
        obj = dict(obj)
        obj['taskid'] = taskid
        obj['project'] = project
        obj['updatetime'] = time.time()
        tablename = self._tablename(project)
        return self._insert(tablename, **self._stringify(obj))

    def update(self, project, taskid, obj={}, **kwargs):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            raise LookupError
        tablename = self._tablename(project)
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        return self._update(
            tablename,
            where="`taskid` = %s" % self.placeholder,
            where_values=(taskid, ),
            **self._stringify(obj)
        )


================================================
FILE: pyspider/database/redis/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2015-05-17 01:34:21



================================================
FILE: pyspider/database/redis/taskdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2015-05-16 21:01:52

import six
import time
import json
import redis
import logging
import itertools

from pyspider.libs import utils
from pyspider.database.base.taskdb import TaskDB as BaseTaskDB


class TaskDB(BaseTaskDB):
    UPDATE_PROJECTS_TIME = 10 * 60
    __prefix__ = 'taskdb_'

    def __init__(self, host='localhost', port=6379, db=0):
        self.redis = redis.StrictRedis(host=host, port=port, db=db)
        try:
            self.redis.scan(count=1)
            self.scan_available = True
        except Exception as e:
            logging.debug("redis_scan disabled: %r", e)
            self.scan_available = False

    def _gen_key(self, project, taskid):
        return "%s%s_%s" % (self.__prefix__, project, taskid)

    def _gen_status_key(self, project, status):
        return '%s%s_status_%d' % (self.__prefix__, project, status)

    def _parse(self, data):
        if six.PY3:
            result = {}
            for key, value in data.items():
                if isinstance(value, bytes):
                    value = utils.text(value)
                result[utils.text(key)] = value
            data = result

        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                if data[each]:
                    data[each] = json.loads(data[each])
                else:
                    data[each] = {}
        if 'status' in data:
            data['status'] = int(data['status'])
        if 'lastcrawltime' in data:
            data['lastcrawltime'] = float(data['lastcrawltime'] or 0)
        if 'updatetime' in data:
            data['updatetime'] = float(data['updatetime'] or 0)
        return data

    def _stringify(self, data):
        for each in ('schedule', 'fetch', 'process', 'track'):
            if each in data:
                data[each] = json.dumps(data[each])
        return data

    @property
    def projects(self):
        if time.time() - getattr(self, '_last_update_projects', 0) \
                > self.UPDATE_PROJECTS_TIME:
            self._projects = set(utils.text(x) for x in self.redis.smembers(
                self.__prefix__ + 'projects'))
        return self._projects

    def load_tasks(self, status, project=None, fields=None):
        if project is None:
            project = self.projects
        elif not isinstance(project, list):
            project = [project, ]

        if self.scan_available:
            scan_method = self.redis.sscan_iter
        else:
            scan_method = self.redis.smembers

        if fields:
            def get_method(key):
                obj = self.redis.hmget(key, fields)
                if all(x is None for x in obj):
                    return None
                return dict(zip(fields, obj))
        else:
            get_method = self.redis.hgetall

        for p in project:
            status_key = self._gen_status_key(p, status)
            for taskid in scan_method(status_key):
                obj = get_method(self._gen_key(p, utils.text(taskid)))
                if not obj:
                    #self.redis.srem(status_key, taskid)
                    continue
                else:
                    yield self._parse(obj)

    def get_task(self, project, taskid, fields=None):
        if fields:
            obj = self.redis.hmget(self._gen_key(project, taskid), fields)
            if all(x is None for x in obj):
                return None
            obj = dict(zip(fields, obj))
        else:
            obj = self.redis.hgetall(self._gen_key(project, taskid))

        if not obj:
            return None
        return self._parse(obj)

    def status_count(self, project):
        '''
        return a dict
        '''
        pipe = self.redis.pipeline(transaction=False)
        for status in range(1, 5):
            pipe.scard(self._gen_status_key(project, status))
        ret = pipe.execute()

        result = {}
        for status, count in enumerate(ret):
            if count > 0:
                result[status + 1] = count
        return result

    def insert(self, project, taskid, obj={}):
        obj = dict(obj)
        obj['taskid'] = taskid
        obj['project'] = project
        obj['updatetime'] = time.time()
        obj.setdefault('status', self.ACTIVE)

        task_key = self._gen_key(project, taskid)

        pipe = self.redis.pipeline(transaction=False)
        if project not in self.projects:
            pipe.sadd(self.__prefix__ + 'projects', project)
        pipe.hmset(task_key, self._stringify(obj))
        pipe.sadd(self._gen_status_key(project, obj['status']), taskid)
        pipe.execute()

    def update(self, project, taskid, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()

        pipe = self.redis.pipeline(transaction=False)
        pipe.hmset(self._gen_key(project, taskid), self._stringify(obj))
        if 'status' in obj:
            for status in range(1, 5):
                if status == obj['status']:
                    pipe.sadd(self._gen_status_key(project, status), taskid)
                else:
                    pipe.srem(self._gen_status_key(project, status), taskid)
        pipe.execute()

    def drop(self, project):
        self.redis.srem(self.__prefix__ + 'projects', project)

        if self.scan_available:
            scan_method = self.redis.scan_iter
        else:
            scan_method = self.redis.keys

        for each in itertools.tee(scan_method("%s%s_*" % (self.__prefix__, project)), 100):
            each = list(each)
            if each:
                self.redis.delete(*each)


================================================
FILE: pyspider/database/sqlalchemy/__init__.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-12-04 20:11:04



================================================
FILE: pyspider/database/sqlalchemy/projectdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-12-04 23:25:10

import six
import time
import sqlalchemy.exc

from sqlalchemy import create_engine, MetaData, Table, Column, String, Float, Text
from sqlalchemy.engine.url import make_url
from pyspider.libs import utils
from pyspider.database.base.projectdb import ProjectDB as BaseProjectDB
from .sqlalchemybase import result2dict


class ProjectDB(BaseProjectDB):
    __tablename__ = 'projectdb'

    def __init__(self, url):
        self.table = Table(self.__tablename__, MetaData(),
                           Column('name', String(64), primary_key=True),
                           Column('group', String(64)),
                           Column('status', String(16)),
                           Column('script', Text),
                           Column('comments', String(1024)),
                           Column('rate', Float(11)),
                           Column('burst', Float(11)),
                           Column('updatetime', Float(32)),
                           mysql_engine='InnoDB',
                           mysql_charset='utf8'
                           )

        self.url = make_url(url)
        if self.url.database:
            database = self.url.database
            self.url.database = None
            try:
                engine = create_engine(self.url, convert_unicode=True, pool_recycle=3600)
                conn = engine.connect()
                conn.execute("commit")
                conn.execute("CREATE DATABASE %s" % database)
            except sqlalchemy.exc.SQLAlchemyError:
                pass
            self.url.database = database
        self.engine = create_engine(url, convert_unicode=True, pool_recycle=3600)
        self.table.create(self.engine, checkfirst=True)

    @staticmethod
    def _parse(data):
        return data

    @staticmethod
    def _stringify(data):
        return data

    def insert(self, name, obj={}):
        obj = dict(obj)
        obj['name'] = name
        obj['updatetime'] = time.time()
        return self.engine.execute(self.table.insert()
                                   .values(**self._stringify(obj)))

    def update(self, name, obj={}, **kwargs):
        obj = dict(obj)
        obj.update(kwargs)
        obj['updatetime'] = time.time()
        return self.engine.execute(self.table.update()
                                   .where(self.table.c.name == name)
                                   .values(**self._stringify(obj)))

    def get_all(self, fields=None):
        columns = [getattr(self.table.c, f, f) for f in fields] if fields else self.table.c
        for task in self.engine.execute(self.table.select()
                                        .with_only_columns(columns)):
            yield self._parse(result2dict(columns, task))

    def get(self, name, fields=None):
        columns = [getattr(self.table.c, f, f) for f in fields] if fields else self.table.c
        for task in self.engine.execute(self.table.select()
                                        .where(self.table.c.name == name)
                                        .limit(1)
                                        .with_only_columns(columns)):
            return self._parse(result2dict(columns, task))

    def drop(self, name):
        return self.engine.execute(self.table.delete()
                                   .where(self.table.c.name == name))

    def check_update(self, timestamp, fields=None):
        columns = [getattr(self.table.c, f, f) for f in fields] if fields else self.table.c
        for task in self.engine.execute(self.table.select()
                                        .with_only_columns(columns)
                                        .where(self.table.c.updatetime >= timestamp)):
            yield self._parse(result2dict(columns, task))


================================================
FILE: pyspider/database/sqlalchemy/resultdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-12-04 18:48:15

import re
import six
import time
import json
import sqlalchemy.exc

from sqlalchemy import (create_engine, MetaData, Table, Column,
                        String, Float, Text)
from sqlalchemy.engine.url import make_url
from pyspider.database.base.resultdb import ResultDB as BaseResultDB
from pyspider.libs import utils
from .sqlalchemybase import SplitTableMixin, result2dict


class ResultDB(SplitTableMixin, BaseResultDB):
    __tablename__ = ''

    def __init__(self, url):
        self.table = Table('__tablename__', MetaData(),
                           Column('taskid', String(64), primary_key=True, nullable=False),
                           Column('url', String(1024)),
                           Column('result', Text()),
                           Column('updatetime', Float(32)),
                           mysql_engine='InnoDB',
                           mysql_charset='utf8'
                           )

        self.url = make_url(url)
        if self.url.database:
            database = self.url.database
            self.url.database = None
            try:
                engine = create_engine(self.url, convert_unicode=True, pool_recycle=3600)
                conn = engine.connect()
                conn.execute("commit")
                conn.execute("CREATE DATABASE %s" % database)
            except sqlalchemy.exc.SQLAlchemyError:
                pass
            self.url.database = database
        self.engine = create_engine(url, convert_unicode=True,
                                    pool_recycle=3600)

        self._list_project()

    def _create_project(self, project):
        assert re.match(r'^\w+$', project) is not None
        if project in self.projects:
            return
        self.table.name = self._tablename(project)
        self.table.create(self.engine)

    @staticmethod
    def _parse(data):
        for key, value in list(six.iteritems(data)):
            if isinstance(value, six.binary_type):
                data[key] = utils.text(value)
        if 'result' in data:
            if data['result']:
                data['result'] = json.loads(data['result'])
            else:
                data['result'] = {}
        return data

    @staticmethod
    def _stringify(data):
        if 'result' in data:
            if data['result']:
                data['result'] = json.dumps(data['result'])
            else:
                data['result'] = json.dumps({})
        return data

    def save(self, project, taskid, url, result):
        if project not in self.projects:
            self._create_project(project)
            self._list_project()
        self.table.name = self._tablename(project)
        obj = {
            'taskid': taskid,
            'url': url,
            'result': result,
            'updatetime': time.time(),
        }
        if self.get(project, taskid, ('taskid', )):
            del obj['taskid']
            return self.engine.execute(self.table.update()
                                       .where(self.table.c.taskid == taskid)
                                       .values(**self._stringify(obj)))
        else:
            return self.engine.execute(self.table.insert()
                                       .values(**self._stringify(obj)))

    def select(self, project, fields=None, offset=0, limit=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        self.table.name = self._tablename(project)

        columns = [getattr(self.table.c, f, f) for f in fields] if fields else self.table.c
        for task in self.engine.execute(self.table.select()
                                        .with_only_columns(columns=columns)
                                        .order_by(self.table.c.updatetime.desc())
                                        .offset(offset).limit(limit)
                                        .execution_options(autocommit=True)):
            yield self._parse(result2dict(columns, task))

    def count(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return 0
        self.table.name = self._tablename(project)

        for count, in self.engine.execute(self.table.count()):
            return count

    def get(self, project, taskid, fields=None):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        self.table.name = self._tablename(project)

        columns = [getattr(self.table.c, f, f) for f in fields] if fields else self.table.c
        for task in self.engine.execute(self.table.select()
                                        .with_only_columns(columns=columns)
                                        .where(self.table.c.taskid == taskid)
                                        .limit(1)):
            return self._parse(result2dict(columns, task))


================================================
FILE: pyspider/database/sqlalchemy/sqlalchemybase.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-12-04 18:48:47

import time


def result2dict(columns, task):
    return dict(task)


class SplitTableMixin(object):
    UPDATE_PROJECTS_TIME = 10 * 60

    def _tablename(self, project):
        if self.__tablename__:
            return '%s_%s' % (self.__tablename__, project)
        else:
            return project

    @property
    def projects(self):
        if time.time() - getattr(self, '_last_update_projects', 0) \
                > self.UPDATE_PROJECTS_TIME:
            self._list_project()
        return self._projects

    @projects.setter
    def projects(self, value):
        self._projects = value

    def _list_project(self):
        self._last_update_projects = time.time()
        self.projects = set()
        if self.__tablename__:
            prefix = '%s_' % self.__tablename__
        else:
            prefix = ''

        for project in self.engine.table_names():
            if project.startswith(prefix):
                project = project[len(prefix):]
                self.projects.add(project)

    def drop(self, project):
        if project not in self.projects:
            self._list_project()
        if project not in self.projects:
            return
        self.table.name = self._tablename(project)
        self.table.drop(self.engine)
        self._list_project()


================================================
FILE: pyspider/database/sqlalchemy/taskdb.py
================================================
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:
# Author: Binux<roy@binux.me>
#         http://binux.me
# Created on 2014-12-04 22:33:43

import re
import six
import time
import json
import sqlalchemy.exc

from sqlalchemy import (create_engine, MetaData, Table, Column, Index,
                        Integer, String, Float, Text, func)
from sqlalchemy.engine.url import make_url
from pyspider.libs import utils
from pyspider.database.base.taskdb import TaskDB as BaseTaskDB
from .sqlalchemybase import SplitTableMixin, result2dict


class TaskDB(SplitTableMixin, BaseTaskDB):
    __tablename__ = ''

    def __init__(self, url):
        self.table = Table('__tablename__', MetaData(),
                           Column('taskid', String(64), primary_key=True, nullable=False),
                           Column('project', String(64)),
                           Column('url', String(1024)),
                           Column('status', Integer),
                           Column('schedule', Text()),
                           Column('fetch', Text()),
                           Column('process', Text()),
                           Column('track', Text()),
                           Column('lastcrawltime', Float(32)),
                           Column('updatetime', Float(32)),
                           mysql_engine='InnoDB',
                           mysql_charset='utf8'
                           )

        self.url = make_url(url)
        if self.url.database:
            database = self.url.database
            self.url.database = None
            try:
                engine = create_engine(self.url, convert_unicode=True, pool_recycle=3600)
                conn = engine.connect()

Download .txt

gitextract_jmd7ykkk/

├── .coveragerc
├── .github/
│   └── ISSUE_TEMPLATE.md
├── .gitignore
├── .travis.yml
├── Dockerfile
├── LICENSE
├── MANIFEST.in
├── README.md
├── config_example.json
├── docker-compose.yaml
├── docs/
│   ├── About-Projects.md
│   ├── About-Tasks.md
│   ├── Architecture.md
│   ├── Command-Line.md
│   ├── Deployment-demo.pyspider.org.md
│   ├── Deployment.md
│   ├── Frequently-Asked-Questions.md
│   ├── Quickstart.md
│   ├── Running-pyspider-with-Docker.md
│   ├── Script-Environment.md
│   ├── Working-with-Results.md
│   ├── apis/
│   │   ├── @catch_status_code_error.md
│   │   ├── @every.md
│   │   ├── Response.md
│   │   ├── index.md
│   │   ├── self.crawl.md
│   │   └── self.send_message.md
│   ├── conf.py
│   ├── index.md
│   └── tutorial/
│       ├── AJAX-and-more-HTTP.md
│       ├── HTML-and-CSS-Selector.md
│       ├── Render-with-PhantomJS.md
│       └── index.md
├── mkdocs.yml
├── pyspider/
│   ├── __init__.py
│   ├── database/
│   │   ├── __init__.py
│   │   ├── base/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── basedb.py
│   │   ├── couchdb/
│   │   │   ├── __init__.py
│   │   │   ├── couchdbbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── elasticsearch/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── local/
│   │   │   ├── __init__.py
│   │   │   └── projectdb.py
│   │   ├── mongodb/
│   │   │   ├── __init__.py
│   │   │   ├── mongodbbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── mysql/
│   │   │   ├── __init__.py
│   │   │   ├── mysqlbase.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   └── taskdb.py
│   │   ├── redis/
│   │   │   ├── __init__.py
│   │   │   └── taskdb.py
│   │   ├── sqlalchemy/
│   │   │   ├── __init__.py
│   │   │   ├── projectdb.py
│   │   │   ├── resultdb.py
│   │   │   ├── sqlalchemybase.py
│   │   │   └── taskdb.py
│   │   └── sqlite/
│   │       ├── __init__.py
│   │       ├── projectdb.py
│   │       ├── resultdb.py
│   │       ├── sqlitebase.py
│   │       └── taskdb.py
│   ├── fetcher/
│   │   ├── __init__.py
│   │   ├── cookie_utils.py
│   │   ├── phantomjs_fetcher.js
│   │   ├── puppeteer_fetcher.js
│   │   ├── splash_fetcher.lua
│   │   └── tornado_fetcher.py
│   ├── libs/
│   │   ├── ListIO.py
│   │   ├── __init__.py
│   │   ├── base_handler.py
│   │   ├── bench.py
│   │   ├── counter.py
│   │   ├── dataurl.py
│   │   ├── log.py
│   │   ├── multiprocessing_queue.py
│   │   ├── pprint.py
│   │   ├── response.py
│   │   ├── result_dump.py
│   │   ├── sample_handler.py
│   │   ├── url.py
│   │   ├── utils.py
│   │   └── wsgi_xmlrpc.py
│   ├── logging.conf
│   ├── message_queue/
│   │   ├── __init__.py
│   │   ├── kombu_queue.py
│   │   ├── rabbitmq.py
│   │   └── redis_queue.py
│   ├── processor/
│   │   ├── __init__.py
│   │   ├── processor.py
│   │   └── project_module.py
│   ├── result/
│   │   ├── __init__.py
│   │   └── result_worker.py
│   ├── run.py
│   ├── scheduler/
│   │   ├── __init__.py
│   │   ├── scheduler.py
│   │   ├── task_queue.py
│   │   └── token_bucket.py
│   └── webui/
│       ├── __init__.py
│       ├── app.py
│       ├── bench_test.py
│       ├── debug.py
│       ├── index.py
│       ├── login.py
│       ├── result.py
│       ├── static/
│       │   ├── .babelrc
│       │   ├── package.json
│       │   ├── src/
│       │   │   ├── css_selector_helper.js
│       │   │   ├── debug.js
│       │   │   ├── debug.less
│       │   │   ├── index.js
│       │   │   ├── index.less
│       │   │   ├── result.less
│       │   │   ├── splitter.js
│       │   │   ├── task.less
│       │   │   ├── tasks.less
│       │   │   └── variable.less
│       │   └── webpack.config.js
│       ├── task.py
│       ├── templates/
│       │   ├── debug.html
│       │   ├── index.html
│       │   ├── result.html
│       │   ├── task.html
│       │   └── tasks.html
│       └── webdav.py
├── requirements.txt
├── run.py
├── setup.py
├── tests/
│   ├── __init__.py
│   ├── data_fetcher_processor_handler.py
│   ├── data_handler.py
│   ├── data_sample_handler.py
│   ├── data_test_webpage.py
│   ├── test_base_handler.py
│   ├── test_bench.py
│   ├── test_counter.py
│   ├── test_database.py
│   ├── test_fetcher.py
│   ├── test_fetcher_processor.py
│   ├── test_message_queue.py
│   ├── test_processor.py
│   ├── test_response.py
│   ├── test_result_dump.py
│   ├── test_result_worker.py
│   ├── test_run.py
│   ├── test_scheduler.py
│   ├── test_task_queue.py
│   ├── test_utils.py
│   ├── test_webdav.py
│   ├── test_webui.py
│   └── test_xmlrpc.py
├── tools/
│   └── migrate.py
└── tox.ini

Download .txt

SYMBOL INDEX (1327 symbols across 95 files)

FILE: docs/conf.py
  class Mock (line 12) | class Mock(MagicMock):
    method __getattr__ (line 14) | def __getattr__(cls, name):

FILE: pyspider/database/__init__.py
  function connect_database (line 12) | def connect_database(url):
  function _connect_database (line 52) | def _connect_database(url):  # NOQA
  function _connect_mysql (line 102) | def _connect_mysql(parsed,dbtype):
  function _connect_sqlite (line 128) | def _connect_sqlite(parsed,dbtype):
  function _connect_mongodb (line 151) | def _connect_mongodb(parsed,dbtype,url):
  function _connect_sqlalchemy (line 170) | def _connect_sqlalchemy(parsed, dbtype,url, other_scheme):
  function _connect_elasticsearch (line 187) | def _connect_elasticsearch(parsed, dbtype):
  function _connect_couchdb (line 209) | def _connect_couchdb(parsed, dbtype, url):

FILE: pyspider/database/base/projectdb.py
  class ProjectDB (line 28) | class ProjectDB(object):
    method insert (line 37) | def insert(self, name, obj={}):
    method update (line 40) | def update(self, name, obj={}, **kwargs):
    method get_all (line 43) | def get_all(self, fields=None):
    method get (line 46) | def get(self, name, fields):
    method drop (line 49) | def drop(self, name):
    method check_update (line 52) | def check_update(self, timestamp, fields=None):
    method split_group (line 55) | def split_group(self, group, lower=True):
    method verify_project_name (line 61) | def verify_project_name(self, name):
    method copy (line 68) | def copy(self):

FILE: pyspider/database/base/resultdb.py
  class ResultDB (line 20) | class ResultDB(object):
    method save (line 26) | def save(self, project, taskid, url, result):
    method select (line 29) | def select(self, project, fields=None, offset=0, limit=None):
    method count (line 32) | def count(self, project):
    method get (line 35) | def get(self, project, taskid, fields=None):
    method drop (line 38) | def drop(self, project):
    method copy (line 41) | def copy(self):

FILE: pyspider/database/base/taskdb.py
  class TaskDB (line 59) | class TaskDB(object):
    method load_tasks (line 67) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 70) | def get_task(self, project, taskid, fields=None):
    method status_count (line 73) | def status_count(self, project):
    method insert (line 79) | def insert(self, project, taskid, obj={}):
    method update (line 82) | def update(self, project, taskid, obj={}, **kwargs):
    method drop (line 85) | def drop(self, project):
    method status_to_string (line 89) | def status_to_string(status):
    method status_to_int (line 98) | def status_to_int(status):
    method copy (line 106) | def copy(self):

FILE: pyspider/database/basedb.py
  class BaseDB (line 17) | class BaseDB:
    method escape (line 29) | def escape(string):
    method dbcur (line 33) | def dbcur(self):
    method _execute (line 36) | def _execute(self, sql_query, values=[]):
    method _select (line 41) | def _select(self, tablename=None, what="*", where="", where_values=[],...
    method _select2dic (line 58) | def _select2dic(self, tablename=None, what="*", where="", where_values...
    method _replace (line 84) | def _replace(self, tablename=None, **values):
    method _insert (line 100) | def _insert(self, tablename=None, **values):
    method _update (line 116) | def _update(self, tablename=None, where="1=0", where_values=[], **valu...
    method _delete (line 126) | def _delete(self, tablename=None, where="1=0", where_values=[]):
  class DB (line 138) | class DB(BaseDB):
    method __init__ (line 142) | def __init__(self):
    method dbcur (line 151) | def dbcur(self):

FILE: pyspider/database/couchdb/couchdbbase.py
  class SplitTableMixin (line 4) | class SplitTableMixin(object):
    method __init__ (line 7) | def __init__(self):
    method _collection_name (line 13) | def _collection_name(self, project):
    method projects (line 21) | def projects(self):
    method projects (line 28) | def projects(self, value):
    method _list_project (line 32) | def _list_project(self):
    method create_database (line 49) | def create_database(self, name):
    method get_doc (line 57) | def get_doc(self, db_name, doc_id):
    method get_docs (line 65) | def get_docs(self, db_name, selector):
    method get_all_docs (line 74) | def get_all_docs(self, db_name):
    method insert_doc (line 78) | def insert_doc(self, db_name, doc_id, doc):
    method update_doc (line 83) | def update_doc(self, db_name, doc_id, new_doc):
    method delete (line 93) | def delete(self, url):

FILE: pyspider/database/couchdb/projectdb.py
  class ProjectDB (line 6) | class ProjectDB(BaseProjectDB):
    method __init__ (line 9) | def __init__(self, url, database='projectdb', username=None, password=...
    method _default_fields (line 37) | def _default_fields(self, each):
    method insert (line 49) | def insert(self, name, obj={}):
    method update (line 57) | def update(self, name, obj={}, **kwargs):
    method get_all (line 69) | def get_all(self, fields=None):
    method get (line 82) | def get(self, name, fields=None):
    method check_update (line 97) | def check_update(self, timestamp, fields=None):
    method drop (line 105) | def drop(self, name):
    method drop_database (line 111) | def drop_database(self):

FILE: pyspider/database/couchdb/resultdb.py
  class ResultDB (line 6) | class ResultDB(SplitTableMixin, BaseResultDB):
    method __init__ (line 9) | def __init__(self, url, database='resultdb', username=None, password=N...
    method _get_collection_name (line 20) | def _get_collection_name(self, project):
    method _create_project (line 23) | def _create_project(self, project):
    method save (line 38) | def save(self, project, taskid, url, result):
    method select (line 50) | def select(self, project, fields=None, offset=0, limit=0):
    method count (line 76) | def count(self, project):
    method get (line 84) | def get(self, project, taskid, fields=None):
    method drop_database (line 101) | def drop_database(self):
    method drop (line 104) | def drop(self, project):

FILE: pyspider/database/couchdb/taskdb.py
  class TaskDB (line 6) | class TaskDB(SplitTableMixin, BaseTaskDB):
    method __init__ (line 9) | def __init__(self, url, database='taskdb', username=None, password=None):
    method _get_collection_name (line 23) | def _get_collection_name(self, project):
    method _create_project (line 26) | def _create_project(self, project):
    method load_tasks (line 40) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 54) | def get_task(self, project, taskid, fields=None):
    method status_count (line 67) | def status_count(self, project):
    method insert (line 88) | def insert(self, project, taskid, obj={}):
    method update (line 97) | def update(self, project, taskid, obj={}, **kwargs):
    method drop_database (line 104) | def drop_database(self):
    method drop (line 107) | def drop(self, project):

FILE: pyspider/database/elasticsearch/projectdb.py
  class ProjectDB (line 15) | class ProjectDB(BaseProjectDB):
    method __init__ (line 18) | def __init__(self, hosts, index='pyspider'):
    method insert (line 31) | def insert(self, name, obj={}):
    method update (line 46) | def update(self, name, obj={}, **kwargs):
    method get_all (line 53) | def get_all(self, fields=None):
    method get (line 59) | def get(self, name, fields=None):
    method check_update (line 64) | def check_update(self, timestamp, fields=None):
    method drop (line 71) | def drop(self, name):

FILE: pyspider/database/elasticsearch/resultdb.py
  class ResultDB (line 16) | class ResultDB(BaseResultDB):
    method __init__ (line 19) | def __init__(self, hosts, index='pyspider'):
    method projects (line 35) | def projects(self):
    method save (line 42) | def save(self, project, taskid, url, result):
    method select (line 53) | def select(self, project, fields=None, offset=0, limit=0):
    method count (line 70) | def count(self, project):
    method get (line 75) | def get(self, project, taskid, fields=None):
    method drop (line 80) | def drop(self, project):
    method refresh (line 87) | def refresh(self):

FILE: pyspider/database/elasticsearch/taskdb.py
  class TaskDB (line 17) | class TaskDB(BaseTaskDB):
    method __init__ (line 20) | def __init__(self, hosts, index='pyspider'):
    method _parse (line 35) | def _parse(self, data):
    method _stringify (line 46) | def _stringify(self, data):
    method projects (line 53) | def projects(self):
    method load_tasks (line 60) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 75) | def get_task(self, project, taskid, fields=None):
    method status_count (line 82) | def status_count(self, project):
    method insert (line 94) | def insert(self, project, taskid, obj={}):
    method update (line 103) | def update(self, project, taskid, obj={}, **kwargs):
    method drop (line 111) | def drop(self, project):
    method refresh (line 119) | def refresh(self):

FILE: pyspider/database/local/projectdb.py
  class ProjectDB (line 17) | class ProjectDB(BaseProjectDB):
    method __init__ (line 20) | def __init__(self, files):
    method load_scripts (line 25) | def load_scripts(self):
    method _build_project (line 45) | def _build_project(self, filename):
    method get_all (line 75) | def get_all(self, fields=None):
    method get (line 79) | def get(self, name, fields=None):
    method check_update (line 91) | def check_update(self, timestamp, fields=None):

FILE: pyspider/database/mongodb/mongodbbase.py
  class SplitTableMixin (line 11) | class SplitTableMixin(object):
    method _collection_name (line 14) | def _collection_name(self, project):
    method projects (line 21) | def projects(self):
    method projects (line 27) | def projects(self, value):
    method _list_project (line 30) | def _list_project(self):
    method drop (line 43) | def drop(self, project):

FILE: pyspider/database/mongodb/projectdb.py
  class ProjectDB (line 14) | class ProjectDB(BaseProjectDB):
    method __init__ (line 17) | def __init__(self, url, database='projectdb'):
    method _default_fields (line 25) | def _default_fields(self, each):
    method insert (line 37) | def insert(self, name, obj={}):
    method update (line 43) | def update(self, name, obj={}, **kwargs):
    method get_all (line 49) | def get_all(self, fields=None):
    method get (line 55) | def get(self, name, fields=None):
    method check_update (line 61) | def check_update(self, timestamp, fields=None):
    method drop (line 67) | def drop(self, name):

FILE: pyspider/database/mongodb/resultdb.py
  class ResultDB (line 17) | class ResultDB(SplitTableMixin, BaseResultDB):
    method __init__ (line 20) | def __init__(self, url, database='resultdb'):
    method _create_project (line 34) | def _create_project(self, project):
    method _parse (line 39) | def _parse(self, data):
    method _stringify (line 45) | def _stringify(self, data):
    method save (line 50) | def save(self, project, taskid, url, result):
    method select (line 64) | def select(self, project, fields=None, offset=0, limit=0):
    method count (line 75) | def count(self, project):
    method get (line 83) | def get(self, project, taskid, fields=None):

FILE: pyspider/database/mongodb/taskdb.py
  class TaskDB (line 17) | class TaskDB(SplitTableMixin, BaseTaskDB):
    method __init__ (line 20) | def __init__(self, url, database='taskdb'):
    method _create_project (line 34) | def _create_project(self, project):
    method _parse (line 40) | def _parse(self, data):
    method _stringify (line 53) | def _stringify(self, data):
    method load_tasks (line 59) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 73) | def get_task(self, project, taskid, fields=None):
    method status_count (line 84) | def status_count(self, project):
    method insert (line 123) | def insert(self, project, taskid, obj={}):
    method update (line 132) | def update(self, project, taskid, obj={}, **kwargs):

FILE: pyspider/database/mysql/mysqlbase.py
  class MySQLMixin (line 12) | class MySQLMixin(object):
    method dbcur (line 16) | def dbcur(self):
  class SplitTableMixin (line 29) | class SplitTableMixin(object):
    method _tablename (line 32) | def _tablename(self, project):
    method projects (line 39) | def projects(self):
    method projects (line 46) | def projects(self, value):
    method _list_project (line 49) | def _list_project(self):
    method drop (line 61) | def drop(self, project):

FILE: pyspider/database/mysql/projectdb.py
  class ProjectDB (line 16) | class ProjectDB(MySQLMixin, BaseProjectDB, BaseDB):
    method __init__ (line 19) | def __init__(self, host='localhost', port=3306, database='projectdb',
    method insert (line 39) | def insert(self, name, obj={}):
    method update (line 45) | def update(self, name, obj={}, **kwargs):
    method get_all (line 52) | def get_all(self, fields=None):
    method get (line 55) | def get(self, name, fields=None):
    method drop (line 61) | def drop(self, name):
    method check_update (line 65) | def check_update(self, timestamp, fields=None):

FILE: pyspider/database/mysql/resultdb.py
  class ResultDB (line 20) | class ResultDB(MySQLMixin, SplitTableMixin, BaseResultDB, BaseDB):
    method __init__ (line 23) | def __init__(self, host='localhost', port=3306, database='resultdb',
    method _create_project (line 33) | def _create_project(self, project):
    method _parse (line 45) | def _parse(self, data):
    method _stringify (line 53) | def _stringify(self, data):
    method save (line 58) | def save(self, project, taskid, url, result):
    method select (line 71) | def select(self, project, fields=None, offset=0, limit=None):
    method count (line 82) | def count(self, project):
    method get (line 91) | def get(self, project, taskid, fields=None):

FILE: pyspider/database/mysql/taskdb.py
  class TaskDB (line 21) | class TaskDB(MySQLMixin, SplitTableMixin, BaseTaskDB, BaseDB):
    method __init__ (line 24) | def __init__(self, host='localhost', port=3306, database='taskdb',
    method _create_project (line 34) | def _create_project(self, project):
    method _parse (line 53) | def _parse(self, data):
    method _stringify (line 65) | def _stringify(self, data):
    method load_tasks (line 71) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 88) | def get_task(self, project, taskid, fields=None):
    method status_count (line 99) | def status_count(self, project):
    method insert (line 111) | def insert(self, project, taskid, obj={}):
    method update (line 124) | def update(self, project, taskid, obj={}, **kwargs):

FILE: pyspider/database/redis/taskdb.py
  class TaskDB (line 19) | class TaskDB(BaseTaskDB):
    method __init__ (line 23) | def __init__(self, host='localhost', port=6379, db=0):
    method _gen_key (line 32) | def _gen_key(self, project, taskid):
    method _gen_status_key (line 35) | def _gen_status_key(self, project, status):
    method _parse (line 38) | def _parse(self, data):
    method _stringify (line 61) | def _stringify(self, data):
    method projects (line 68) | def projects(self):
    method load_tasks (line 75) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 105) | def get_task(self, project, taskid, fields=None):
    method status_count (line 118) | def status_count(self, project):
    method insert (line 133) | def insert(self, project, taskid, obj={}):
    method update (line 149) | def update(self, project, taskid, obj={}, **kwargs):
    method drop (line 164) | def drop(self, project):

FILE: pyspider/database/sqlalchemy/projectdb.py
  class ProjectDB (line 19) | class ProjectDB(BaseProjectDB):
    method __init__ (line 22) | def __init__(self, url):
    method _parse (line 52) | def _parse(data):
    method _stringify (line 56) | def _stringify(data):
    method insert (line 59) | def insert(self, name, obj={}):
    method update (line 66) | def update(self, name, obj={}, **kwargs):
    method get_all (line 74) | def get_all(self, fields=None):
    method get (line 80) | def get(self, name, fields=None):
    method drop (line 88) | def drop(self, name):
    method check_update (line 92) | def check_update(self, timestamp, fields=None):

FILE: pyspider/database/sqlalchemy/resultdb.py
  class ResultDB (line 22) | class ResultDB(SplitTableMixin, BaseResultDB):
    method __init__ (line 25) | def __init__(self, url):
    method _create_project (line 52) | def _create_project(self, project):
    method _parse (line 60) | def _parse(data):
    method _stringify (line 72) | def _stringify(data):
    method save (line 80) | def save(self, project, taskid, url, result):
    method select (line 100) | def select(self, project, fields=None, offset=0, limit=None):
    method count (line 115) | def count(self, project):
    method get (line 125) | def get(self, project, taskid, fields=None):

FILE: pyspider/database/sqlalchemy/sqlalchemybase.py
  function result2dict (line 11) | def result2dict(columns, task):
  class SplitTableMixin (line 15) | class SplitTableMixin(object):
    method _tablename (line 18) | def _tablename(self, project):
    method projects (line 25) | def projects(self):
    method projects (line 32) | def projects(self, value):
    method _list_project (line 35) | def _list_project(self):
    method drop (line 48) | def drop(self, project):

FILE: pyspider/database/sqlalchemy/taskdb.py
  class TaskDB (line 22) | class TaskDB(SplitTableMixin, BaseTaskDB):
    method __init__ (line 25) | def __init__(self, url):
    method _create_project (line 57) | def _create_project(self, project):
    method _parse (line 67) | def _parse(data):
    method _stringify (line 80) | def _stringify(data):
    method load_tasks (line 89) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 106) | def get_task(self, project, taskid, fields=None):
    method status_count (line 120) | def status_count(self, project):
    method insert (line 135) | def insert(self, project, taskid, obj={}):
    method update (line 149) | def update(self, project, taskid, obj={}, **kwargs):

FILE: pyspider/database/sqlite/projectdb.py
  class ProjectDB (line 15) | class ProjectDB(SQLiteMixin, BaseProjectDB, BaseDB):
    method __init__ (line 19) | def __init__(self, path):
    method insert (line 30) | def insert(self, name, obj={}):
    method update (line 36) | def update(self, name, obj={}, **kwargs):
    method get_all (line 43) | def get_all(self, fields=None):
    method get (line 46) | def get(self, name, fields=None):
    method check_update (line 52) | def check_update(self, timestamp, fields=None):
    method drop (line 56) | def drop(self, name):

FILE: pyspider/database/sqlite/resultdb.py
  class ResultDB (line 17) | class ResultDB(SQLiteMixin, SplitTableMixin, BaseResultDB, BaseDB):
    method __init__ (line 21) | def __init__(self, path):
    method _create_project (line 27) | def _create_project(self, project):
    method _parse (line 37) | def _parse(self, data):
    method _stringify (line 42) | def _stringify(self, data):
    method save (line 47) | def save(self, project, taskid, url, result):
    method select (line 60) | def select(self, project, fields=None, offset=0, limit=None):
    method count (line 71) | def count(self, project):
    method get (line 80) | def get(self, project, taskid, fields=None):

FILE: pyspider/database/sqlite/sqlitebase.py
  class SQLiteMixin (line 14) | class SQLiteMixin(object):
    method dbcur (line 17) | def dbcur(self):
  class SplitTableMixin (line 25) | class SplitTableMixin(object):
    method _tablename (line 28) | def _tablename(self, project):
    method projects (line 35) | def projects(self):
    method projects (line 42) | def projects(self, value):
    method _list_project (line 45) | def _list_project(self):
    method drop (line 58) | def drop(self, project):

FILE: pyspider/database/sqlite/taskdb.py
  class TaskDB (line 17) | class TaskDB(SQLiteMixin, SplitTableMixin, BaseTaskDB, BaseDB):
    method __init__ (line 21) | def __init__(self, path):
    method _create_project (line 27) | def _create_project(self, project):
    method _parse (line 42) | def _parse(self, data):
    method _stringify (line 51) | def _stringify(self, data):
    method load_tasks (line 57) | def load_tasks(self, status, project=None, fields=None):
    method get_task (line 72) | def get_task(self, project, taskid, fields=None):
    method status_count (line 85) | def status_count(self, project):
    method insert (line 100) | def insert(self, project, taskid, obj={}):
    method update (line 111) | def update(self, project, taskid, obj={}, **kwargs):

FILE: pyspider/fetcher/cookie_utils.py
  class MockResponse (line 11) | class MockResponse(object):
    method __init__ (line 13) | def __init__(self, headers):
    method info (line 16) | def info(self):
    method getheaders (line 19) | def getheaders(self, name):
    method get_all (line 23) | def get_all(self, name, default=None):
  function extract_cookies_to_jar (line 30) | def extract_cookies_to_jar(jar, request, response):

FILE: pyspider/fetcher/phantomjs_fetcher.js
  function make_result (line 135) | function make_result(page) {
  function _make_result (line 179) | function _make_result(page) {

FILE: pyspider/fetcher/puppeteer_fetcher.js
  function fetch (line 35) | async function fetch(options) {
  function _fetch (line 51) | async function _fetch(page, options) {
  function make_result (line 146) | async function make_result(page, options, error) {

FILE: pyspider/fetcher/tornado_fetcher.py
  class MyCurlAsyncHTTPClient (line 39) | class MyCurlAsyncHTTPClient(CurlAsyncHTTPClient):
    method free_size (line 41) | def free_size(self):
    method size (line 44) | def size(self):
  class MySimpleAsyncHTTPClient (line 48) | class MySimpleAsyncHTTPClient(SimpleAsyncHTTPClient):
    method free_size (line 50) | def free_size(self):
    method size (line 53) | def size(self):
  class Fetcher (line 66) | class Fetcher(object):
    method __init__ (line 81) | def __init__(self, inqueue, outqueue, poolsize=100, proxy=None, async_...
    method send_result (line 108) | def send_result(self, type, task, result):
    method fetch (line 116) | def fetch(self, task, callback=None):
    method async_fetch (line 123) | def async_fetch(self, task, callback=None):
    method sync_fetch (line 155) | def sync_fetch(self, task):
    method data_fetch (line 178) | def data_fetch(self, url, task):
    method handle_error (line 202) | def handle_error(self, type, url, task, start_time, error):
    method pack_tornado_request_parameters (line 220) | def pack_tornado_request_parameters(self, url, task):
    method can_fetch (line 290) | def can_fetch(self, user_agent, url):
    method clear_robot_txt_cache (line 320) | def clear_robot_txt_cache(self):
    method http_fetch (line 327) | def http_fetch(self, url, task):
    method phantomjs_fetch (line 431) | def phantomjs_fetch(self, url, task):
    method splash_fetch (line 532) | def splash_fetch(self, url, task):
    method puppeteer_fetch (line 640) | def puppeteer_fetch(self, url, task):
    method run (line 743) | def run(self):
    method quit (line 780) | def quit(self):
    method size (line 789) | def size(self):
    method xmlrpc_run (line 792) | def xmlrpc_run(self, port=24444, bind='127.0.0.1', logRequests=False):
    method on_fetch (line 827) | def on_fetch(self, type, task):
    method on_result (line 831) | def on_result(self, type, task, result):

FILE: pyspider/libs/ListIO.py
  class ListO (line 9) | class ListO(object):
    method __init__ (line 13) | def __init__(self, buffer=None):
    method isatty (line 18) | def isatty(self):
    method close (line 21) | def close(self):
    method flush (line 24) | def flush(self):
    method seek (line 27) | def seek(self, n, mode=0):
    method readline (line 30) | def readline(self):
    method reset (line 33) | def reset(self):
    method write (line 36) | def write(self, x):
    method writelines (line 39) | def writelines(self, x):

FILE: pyspider/libs/base_handler.py
  function catch_status_code_error (line 26) | def catch_status_code_error(func):
  function not_send_status (line 35) | def not_send_status(func):
  function config (line 49) | def config(_config=None, **kwargs):
  class NOTSET (line 64) | class NOTSET(object):
  function every (line 68) | def every(minutes=NOTSET, seconds=NOTSET):
  class BaseHandlerMeta (line 100) | class BaseHandlerMeta(type):
    method __new__ (line 102) | def __new__(cls, name, bases, attrs):
  class BaseHandler (line 123) | class BaseHandler(object):
    method _reset (line 136) | def _reset(self):
    method _run_func (line 145) | def _run_func(self, function, *arguments):
    method _run_task (line 160) | def _run_task(self, task, response):
    method run_task (line 178) | def run_task(self, module, task, response):
    method task_join_crawl_config (line 228) | def task_join_crawl_config(task, crawl_config):
    method _crawl (line 255) | def _crawl(self, url, **kwargs):
    method get_taskid (line 342) | def get_taskid(self, task):
    method crawl (line 347) | def crawl(self, url, **kwargs):
    method is_debugger (line 400) | def is_debugger(self):
    method send_message (line 404) | def send_message(self, project, msg, url='data:,on_message'):
    method on_message (line 408) | def on_message(self, project, msg):
    method on_result (line 412) | def on_result(self, result):
    method on_finished (line 422) | def on_finished(self, response, task):
    method _on_message (line 430) | def _on_message(self, response):
    method _on_cronjob (line 435) | def _on_cronjob(self, response, task):
    method _on_get_info (line 451) | def _on_get_info(self, response, task):

FILE: pyspider/libs/bench.py
  function bench_test_taskdb (line 22) | def bench_test_taskdb(taskdb):
  function bench_test_message_queue (line 129) | def bench_test_message_queue(queue):
  class BenchMixin (line 190) | class BenchMixin(object):
    method _bench_init (line 192) | def _bench_init(self):
    method _bench_report (line 198) | def _bench_report(self, name, prefix=0, rjust=0):
  class BenchScheduler (line 213) | class BenchScheduler(Scheduler, BenchMixin):
    method __init__ (line 214) | def __init__(self, *args, **kwargs):
    method on_task_status (line 218) | def on_task_status(self, task):
  class BenchFetcher (line 223) | class BenchFetcher(Fetcher, BenchMixin):
    method __init__ (line 224) | def __init__(self, *args, **kwargs):
    method on_result (line 228) | def on_result(self, type, task, result):
  class BenchProcessor (line 233) | class BenchProcessor(Processor, BenchMixin):
    method __init__ (line 234) | def __init__(self, *args, **kwargs):
    method on_task (line 238) | def on_task(self, task, response):
  class BenchResultWorker (line 243) | class BenchResultWorker(ResultWorker, BenchMixin):
    method __init__ (line 244) | def __init__(self, *args, **kwargs):
    method on_result (line 248) | def on_result(self, task, result):
  class Handler (line 256) | class Handler(BaseHandler):
    method on_start (line 257) | def on_start(self, response):
    method index_page (line 262) | def index_page(self, response):

FILE: pyspider/libs/counter.py
  class BaseCounter (line 23) | class BaseCounter(object):
    method __init__ (line 25) | def __init__(self):
    method event (line 28) | def event(self, value=1):
    method value (line 32) | def value(self, value):
    method avg (line 37) | def avg(self):
    method sum (line 42) | def sum(self):
    method empty (line 46) | def empty(self):
  class TotalCounter (line 51) | class TotalCounter(BaseCounter):
    method __init__ (line 54) | def __init__(self):
    method event (line 58) | def event(self, value=1):
    method value (line 61) | def value(self, value):
    method avg (line 65) | def avg(self):
    method sum (line 69) | def sum(self):
    method empty (line 72) | def empty(self):
  class AverageWindowCounter (line 76) | class AverageWindowCounter(BaseCounter):
    method __init__ (line 81) | def __init__(self, window_size=300):
    method event (line 86) | def event(self, value=1):
    method avg (line 92) | def avg(self):
    method sum (line 96) | def sum(self):
    method empty (line 99) | def empty(self):
  class TimebaseAverageEventCounter (line 104) | class TimebaseAverageEventCounter(BaseCounter):
    method __init__ (line 111) | def __init__(self, window_size=30, window_interval=10):
    method event (line 125) | def event(self, value=1):
    method value (line 147) | def value(self, value):
    method _trim_window (line 150) | def _trim_window(self):
    method avg (line 170) | def avg(self):
    method sum (line 177) | def sum(self):
    method empty (line 181) | def empty(self):
    method on_append (line 186) | def on_append(self, value, time):
  class TimebaseAverageWindowCounter (line 190) | class TimebaseAverageWindowCounter(BaseCounter):
    method __init__ (line 197) | def __init__(self, window_size=30, window_interval=10):
    method event (line 209) | def event(self, value=1):
    method value (line 227) | def value(self, value):
    method _trim_window (line 230) | def _trim_window(self):
    method avg (line 248) | def avg(self):
    method sum (line 255) | def sum(self):
    method empty (line 259) | def empty(self):
    method on_append (line 264) | def on_append(self, value, time):
  class CounterValue (line 268) | class CounterValue(DictMixin):
    method __init__ (line 273) | def __init__(self, manager, keys):
    method __getitem__ (line 277) | def __getitem__(self, key):
    method __len__ (line 299) | def __len__(self):
    method __iter__ (line 302) | def __iter__(self):
    method __contains__ (line 305) | def __contains__(self, key):
    method keys (line 308) | def keys(self):
    method to_dict (line 316) | def to_dict(self, get_value=None):
  class CounterManager (line 329) | class CounterManager(DictMixin):
    method __init__ (line 340) | def __init__(self, cls=TimebaseAverageWindowCounter):
    method event (line 345) | def event(self, key, value=1):
    method value (line 355) | def value(self, key, value=1):
    method trim (line 366) | def trim(self):
    method __getitem__ (line 372) | def __getitem__(self, key):
    method __delitem__ (line 389) | def __delitem__(self, key):
    method __iter__ (line 398) | def __iter__(self):
    method __len__ (line 401) | def __len__(self):
    method keys (line 404) | def keys(self):
    method to_dict (line 410) | def to_dict(self, get_value=None):
    method dump (line 423) | def dump(self, filename):
    method load (line 433) | def load(self, filename):

FILE: pyspider/libs/dataurl.py
  function encode (line 14) | def encode(data, mime_type='', charset='utf-8', base64=True):
  function decode (line 41) | def decode(data_url):

FILE: pyspider/libs/log.py
  class LogFormatter (line 18) | class LogFormatter(_LogFormatter, object):
    method __init__ (line 20) | def __init__(self, fmt=None, datefmt=None, color=True, *args, **kwargs):
  class SaveLogHandler (line 26) | class SaveLogHandler(logging.Handler):
    method __init__ (line 29) | def __init__(self, saveto=None, *args, **kwargs):
    method emit (line 33) | def emit(self, record):
  function enable_pretty_logging (line 40) | def enable_pretty_logging(logger=logging.getLogger()):

FILE: pyspider/libs/multiprocessing_queue.py
  class SharedCounter (line 10) | class SharedCounter(object):
    method __init__ (line 22) | def __init__(self, n=0):
    method increment (line 25) | def increment(self, n=1):
    method value (line 31) | def value(self):
  class MultiProcessingQueue (line 36) | class MultiProcessingQueue(BaseQueue):
    method __init__ (line 47) | def __init__(self, *args, **kwargs):
    method put (line 51) | def put(self, *args, **kwargs):
    method get (line 55) | def get(self, *args, **kwargs):
    method qsize (line 60) | def qsize(self):
  function Queue (line 67) | def Queue(maxsize=0):
  function Queue (line 70) | def Queue(maxsize=0):

FILE: pyspider/libs/pprint.py
  function pprint (line 54) | def pprint(object, stream=None, indent=1, width=80, depth=None):
  function pformat (line 61) | def pformat(object, indent=1, width=80, depth=None):
  function saferepr (line 66) | def saferepr(object):
  function isreadable (line 71) | def isreadable(object):
  function isrecursive (line 76) | def isrecursive(object):
  function _sorted (line 81) | def _sorted(iterable):
  class PrettyPrinter (line 85) | class PrettyPrinter:
    method __init__ (line 87) | def __init__(self, indent=1, width=80, depth=None, stream=None):
    method pprint (line 118) | def pprint(self, object):
    method pformat (line 122) | def pformat(self, object):
    method isrecursive (line 127) | def isrecursive(self, object):
    method isreadable (line 130) | def isreadable(self, object):
    method _format (line 134) | def _format(self, object, stream, indent, allowance, context, level):
    method _repr (line 234) | def _repr(self, object, context, level):
    method format (line 243) | def format(self, object, context, maxlevels, level):
  function _safe_repr (line 253) | def _safe_repr(object, context, maxlevels, level):
  function _recursion (line 359) | def _recursion(object):
  function _perfcheck (line 364) | def _perfcheck(object=None):

FILE: pyspider/libs/response.py
  class Response (line 22) | class Response(object):
    method __init__ (line 24) | def __init__(self, status_code=None, url=None, orig_url=None, headers=...
    method __repr__ (line 40) | def __repr__(self):
    method __bool__ (line 43) | def __bool__(self):
    method __nonzero__ (line 47) | def __nonzero__(self):
    method ok (line 52) | def ok(self):
    method encoding (line 61) | def encoding(self):
    method encoding (line 89) | def encoding(self, value):
    method text (line 98) | def text(self):
    method json (line 129) | def json(self):
    method doc (line 140) | def doc(self):
    method etree (line 150) | def etree(self):
    method raise_for_status (line 165) | def raise_for_status(self, allow_redirects=True):
    method isok (line 186) | def isok(self):
  function rebuild_response (line 194) | def rebuild_response(r):
  function get_encoding (line 211) | def get_encoding(headers, content):

FILE: pyspider/libs/result_dump.py
  function result_formater (line 16) | def result_formater(results):
  function dump_as_json (line 46) | def dump_as_json(results, valid=False):
  function dump_as_txt (line 64) | def dump_as_txt(results):
  function dump_as_csv (line 72) | def dump_as_csv(results):

FILE: pyspider/libs/sample_handler.py
  class Handler (line 9) | class Handler(BaseHandler):
    method on_start (line 14) | def on_start(self):
    method index_page (line 18) | def index_page(self, response):
    method detail_page (line 23) | def detail_page(self, response):

FILE: pyspider/libs/url.py
  function get_content_type (line 16) | def get_content_type(filename):
  function _encode_multipart_formdata (line 24) | def _encode_multipart_formdata(fields, files):
  function _build_url (line 29) | def _build_url(url, _params):
  function quote_chinese (line 62) | def quote_chinese(url, encodeing="utf-8"):
  function curl_to_arguments (line 73) | def curl_to_arguments(curl):

FILE: pyspider/libs/utils.py
  class ReadOnlyDict (line 23) | class ReadOnlyDict(dict):
    method __setitem__ (line 26) | def __setitem__(self, key, value):
  function getitem (line 30) | def getitem(obj, key=0, default=None):
  function hide_me (line 38) | def hide_me(tb, g=globals()):
  function run_in_thread (line 54) | def run_in_thread(func, *args, **kwargs):
  function run_in_subprocess (line 63) | def run_in_subprocess(func, *args, **kwargs):
  function format_date (line 72) | def format_date(date, gmt_offset=0, relative=True, shorter=False, full_f...
  function fix_full_format (line 133) | def fix_full_format(days, seconds, relative, shorter, local_date, local_...
  class TimeoutError (line 160) | class TimeoutError(Exception):
  class timeout (line 168) | class timeout:
    method __init__ (line 176) | def __init__(self, seconds=1, error_message='Timeout'):
    method handle_timeout (line 180) | def handle_timeout(self, signum, frame):
    method __enter__ (line 183) | def __enter__(self):
    method __exit__ (line 191) | def __exit__(self, type, value, traceback):
    method __init__ (line 203) | def __init__(self, seconds=1, error_message='Timeout'):
    method __enter__ (line 206) | def __enter__(self):
    method __exit__ (line 209) | def __exit__(self, type, value, traceback):
  class timeout (line 198) | class timeout:
    method __init__ (line 176) | def __init__(self, seconds=1, error_message='Timeout'):
    method handle_timeout (line 180) | def handle_timeout(self, signum, frame):
    method __enter__ (line 183) | def __enter__(self):
    method __exit__ (line 191) | def __exit__(self, type, value, traceback):
    method __init__ (line 203) | def __init__(self, seconds=1, error_message='Timeout'):
    method __enter__ (line 206) | def __enter__(self):
    method __exit__ (line 209) | def __exit__(self, type, value, traceback):
  function utf8 (line 213) | def utf8(string):
  function text (line 227) | def text(string, encoding='utf8'):
  function pretty_unicode (line 241) | def pretty_unicode(string):
  function unicode_string (line 253) | def unicode_string(string):
  function unicode_dict (line 267) | def unicode_dict(_dict):
  function unicode_list (line 277) | def unicode_list(_list):
  function unicode_obj (line 284) | def unicode_obj(obj):
  function decode_unicode_string (line 307) | def decode_unicode_string(string):
  function decode_unicode_obj (line 316) | def decode_unicode_obj(obj):
  class Get (line 333) | class Get(object):
    method __init__ (line 338) | def __init__(self, getter):
    method __get__ (line 341) | def __get__(self, instance, owner):
  class ObjectDict (line 345) | class ObjectDict(dict):
    method __getattr__ (line 352) | def __getattr__(self, name):
  function load_object (line 359) | def load_object(name):
  function get_python_console (line 373) | def get_python_console(namespace=None):
  function python_console (line 418) | def python_console(namespace=None):
  function check_port_open (line 434) | def check_port_open(port, addr='127.0.0.1'):

FILE: pyspider/libs/wsgi_xmlrpc.py
  class WSGIXMLRPCApplication (line 24) | class WSGIXMLRPCApplication(object):
    method __init__ (line 27) | def __init__(self, instance=None, methods=None):
    method register_instance (line 42) | def register_instance(self, instance):
    method register_function (line 45) | def register_function(self, function, name=None):
    method handler (line 48) | def handler(self, environ, start_response):
    method handle_POST (line 57) | def handle_POST(self, environ, start_response):
    method __call__ (line 94) | def __call__(self, environ, start_response):

FILE: pyspider/message_queue/__init__.py
  function connect_message_queue (line 16) | def connect_message_queue(name, url=None, maxsize=0, lazy_limit=True):

FILE: pyspider/message_queue/kombu_queue.py
  class KombuQueue (line 20) | class KombuQueue(object):
    method __init__ (line 31) | def __init__(self, name, url="amqp://", maxsize=0, lazy_limit=True):
    method qsize (line 51) | def qsize(self):
    method empty (line 57) | def empty(self):
    method full (line 63) | def full(self):
    method put (line 69) | def put(self, obj, block=True, timeout=None):
    method put_nowait (line 87) | def put_nowait(self, obj):
    method get (line 96) | def get(self, block=True, timeout=None):
    method get_nowait (line 103) | def get_nowait(self):
    method delete (line 110) | def delete(self):
    method __del__ (line 113) | def __del__(self):

FILE: pyspider/message_queue/rabbitmq.py
  function catch_error (line 24) | def catch_error(func):
  class PikaQueue (line 52) | class PikaQueue(object):
    method __init__ (line 61) | def __init__(self, name, amqp_url='amqp://guest:guest@localhost:5672/%...
    method reconnect (line 91) | def reconnect(self):
    method qsize (line 106) | def qsize(self):
    method empty (line 111) | def empty(self):
    method full (line 117) | def full(self):
    method put (line 124) | def put(self, obj, block=True, timeout=None):
    method put_nowait (line 143) | def put_nowait(self, obj):
    method get (line 155) | def get(self, block=True, timeout=None, ack=False):
    method get_nowait (line 174) | def get_nowait(self, ack=False):
    method delete (line 184) | def delete(self):
  class AmqpQueue (line 189) | class AmqpQueue(PikaQueue):
    method __init__ (line 194) | def __init__(self, name, amqp_url='amqp://guest:guest@localhost:5672/%...
    method reconnect (line 224) | def reconnect(self):
    method qsize (line 241) | def qsize(self):
    method put_nowait (line 248) | def put_nowait(self, obj):
    method get_nowait (line 261) | def get_nowait(self, ack=False):

FILE: pyspider/message_queue/redis_queue.py
  class RedisQueue (line 14) | class RedisQueue(object):
    method __init__ (line 23) | def __init__(self, name, host='localhost', port=6379, db=0,
    method qsize (line 43) | def qsize(self):
    method empty (line 47) | def empty(self):
    method full (line 53) | def full(self):
    method put_nowait (line 59) | def put_nowait(self, obj):
    method put (line 67) | def put(self, obj, block=True, timeout=None):
    method get_nowait (line 85) | def get_nowait(self):
    method get (line 91) | def get(self, block=True, timeout=None):

FILE: pyspider/processor/processor.py
  class ProcessorResult (line 23) | class ProcessorResult(object):
    method __init__ (line 26) | def __init__(self, result=None, follows=(), messages=(),
    method rethrow (line 38) | def rethrow(self):
    method logstr (line 44) | def logstr(self):
  class Processor (line 62) | class Processor(object):
    method __init__ (line 69) | def __init__(self, projectdb, inqueue, status_queue, newtask_queue, re...
    method enable_projects_import (line 91) | def enable_projects_import(self):
    method __del__ (line 99) | def __del__(self):
    method on_task (line 102) | def on_task(self, task, response):
    method quit (line 205) | def quit(self):
    method run (line 209) | def run(self):

FILE: pyspider/processor/project_module.py
  class ProjectManager (line 23) | class ProjectManager(object):
    method build_module (line 32) | def build_module(project, env=None):
    method __init__ (line 89) | def __init__(self, projectdb, env):
    method _need_update (line 96) | def _need_update(self, project_name, updatetime=None, md5sum=None):
    method _check_projects (line 108) | def _check_projects(self):
    method _update_project (line 118) | def _update_project(self, project_name):
    method _load_project (line 125) | def _load_project(self, project):
    method get (line 148) | def get(self, project_name, updatetime=None, md5sum=None):
  class ProjectLoader (line 157) | class ProjectLoader(object):
    method __init__ (line 160) | def __init__(self, project, mod=None):
    method load_module (line 166) | def load_module(self, fullname):
    method is_package (line 182) | def is_package(self, fullname):
    method get_code (line 185) | def get_code(self, fullname):
    method get_source (line 188) | def get_source(self, fullname):
    method create_module (line 282) | def create_module(self, spec):
    method exec_module (line 285) | def exec_module(self, module):
    method module_repr (line 288) | def module_repr(self, module):
  class ProjectFinder (line 196) | class ProjectFinder(object):
    method __init__ (line 199) | def __init__(self, projectdb):
    method projectdb (line 203) | def projectdb(self):
    method find_module (line 206) | def find_module(self, fullname, path=None):
    method load_module (line 218) | def load_module(self, fullname):
    method is_package (line 226) | def is_package(self, fullname):
    method __init__ (line 234) | def __init__(self, projectdb):
    method projectdb (line 238) | def projectdb(self):
    method find_spec (line 241) | def find_spec(self, fullname, path, target=None):
    method find_module (line 246) | def find_module(self, fullname, path):
  class ProjectFinder (line 231) | class ProjectFinder(importlib.abc.MetaPathFinder):
    method __init__ (line 199) | def __init__(self, projectdb):
    method projectdb (line 203) | def projectdb(self):
    method find_module (line 206) | def find_module(self, fullname, path=None):
    method load_module (line 218) | def load_module(self, fullname):
    method is_package (line 226) | def is_package(self, fullname):
    method __init__ (line 234) | def __init__(self, projectdb):
    method projectdb (line 238) | def projectdb(self):
    method find_spec (line 241) | def find_spec(self, fullname, path, target=None):
    method find_module (line 246) | def find_module(self, fullname, path):
  class ProjectsLoader (line 258) | class ProjectsLoader(importlib.abc.InspectLoader):
    method load_module (line 259) | def load_module(self, fullname):
    method module_repr (line 269) | def module_repr(self, module):
    method is_package (line 272) | def is_package(self, fullname):
    method get_source (line 275) | def get_source(self, path):
    method get_code (line 278) | def get_code(self, fullname):
  class ProjectLoader (line 281) | class ProjectLoader(ProjectLoader, importlib.abc.Loader):
    method __init__ (line 160) | def __init__(self, project, mod=None):
    method load_module (line 166) | def load_module(self, fullname):
    method is_package (line 182) | def is_package(self, fullname):
    method get_code (line 185) | def get_code(self, fullname):
    method get_source (line 188) | def get_source(self, fullname):
    method create_module (line 282) | def create_module(self, spec):
    method exec_module (line 285) | def exec_module(self, module):
    method module_repr (line 288) | def module_repr(self, module):

FILE: pyspider/result/result_worker.py
  class ResultWorker (line 15) | class ResultWorker(object):
    method __init__ (line 22) | def __init__(self, resultdb, inqueue):
    method on_result (line 27) | def on_result(self, task, result):
    method quit (line 44) | def quit(self):
    method run (line 47) | def run(self):
  class OneResultWorker (line 69) | class OneResultWorker(ResultWorker):
    method on_result (line 71) | def on_result(self, task, result):

FILE: pyspider/run.py
  function read_config (line 25) | def read_config(ctx, param, value):
  function connect_db (line 40) | def connect_db(ctx, param, value):
  function load_cls (line 46) | def load_cls(ctx, param, value):
  function connect_rpc (line 52) | def connect_rpc(ctx, param, value):
  function cli (line 91) | def cli(ctx, **kwargs):
  function scheduler (line 198) | def scheduler(ctx, xmlrpc, no_xmlrpc, xmlrpc_host, xmlrpc_port,
  function fetcher (line 245) | def fetcher(ctx, xmlrpc, no_xmlrpc, xmlrpc_host, xmlrpc_port, poolsize, ...
  function processor (line 286) | def processor(ctx, processor_cls, process_time_limit, enable_stdout_capt...
  function result_worker (line 310) | def result_worker(ctx, result_cls, get_object=False):
  function webui (line 346) | def webui(ctx, host, port, cdn, scheduler_rpc, fetcher_rpc, max_rate, ma...
  function phantomjs (line 414) | def phantomjs(ctx, phantomjs_path, port, auto_restart, args):
  function puppeteer (line 462) | def puppeteer(ctx, port, auto_restart, args):
  function all (line 510) | def all(ctx, fetcher_num, processor_num, result_worker_num, run_in):
  function bench (line 601) | def bench(ctx, fetcher_num, processor_num, result_worker_num, run_in, to...
  function one (line 735) | def one(ctx, interactive, enable_phantomjs, enable_puppeteer, scripts):
  function send_message (line 813) | def send_message(ctx, scheduler_rpc, project, message):
  function main (line 838) | def main():

FILE: pyspider/scheduler/scheduler.py
  class Project (line 26) | class Project(object):
    method __init__ (line 30) | def __init__(self, scheduler, project_info):
    method paused (line 52) | def paused(self):
    method update (line 104) | def update(self, project_info):
    method on_get_info (line 129) | def on_get_info(self, info):
    method active (line 136) | def active(self):
  class Scheduler (line 140) | class Scheduler(object):
    method __init__ (line 170) | def __init__(self, taskdb, projectdb, newtask_queue, status_queue,
    method _update_projects (line 206) | def _update_projects(self):
    method _update_project (line 222) | def _update_project(self, project):
    method _load_tasks (line 263) | def _load_tasks(self, project):
    method _update_project_cnt (line 282) | def _update_project_cnt(self, project_name):
    method task_verify (line 297) | def task_verify(self, task):
    method insert_task (line 317) | def insert_task(self, task):
    method update_task (line 321) | def update_task(self, task):
    method put_task (line 325) | def put_task(self, task):
    method send_task (line 334) | def send_task(self, task, force=True):
    method _check_task_done (line 348) | def _check_task_done(self):
    method _check_request (line 374) | def _check_request(self):
    method _check_cronjob (line 419) | def _check_cronjob(self):
    method _check_select (line 463) | def _check_select(self):
    method _load_put_task (line 568) | def _load_put_task(self, project, taskid):
    method _print_counter_log (line 578) | def _print_counter_log(self):
    method _dump_cnt (line 616) | def _dump_cnt(self):
    method _try_dump_cnt (line 622) | def _try_dump_cnt(self):
    method _check_delete (line 630) | def _check_delete(self):
    method __len__ (line 650) | def __len__(self):
    method quit (line 653) | def quit(self):
    method run_once (line 661) | def run_once(self):
    method run (line 673) | def run(self):
    method trigger_on_start (line 694) | def trigger_on_start(self, project):
    method xmlrpc_run (line 705) | def xmlrpc_run(self, port=23333, bind='127.0.0.1', logRequests=False):
    method on_request (line 813) | def on_request(self, task):
    method on_new_request (line 825) | def on_new_request(self, task):
    method on_old_request (line 839) | def on_old_request(self, task, old_task):
    method on_task_status (line 889) | def on_task_status(self, task):
    method on_task_done (line 914) | def on_task_done(self, task):
    method on_task_failed (line 937) | def on_task_failed(self, task):
    method on_select_task (line 990) | def on_select_task(self, task):
  class OneScheduler (line 1014) | class OneScheduler(Scheduler):
    method _check_select (line 1022) | def _check_select(self):
    method __getattr__ (line 1106) | def __getattr__(self, name):
    method on_task_status (line 1112) | def on_task_status(self, task):
    method init_one (line 1136) | def init_one(self, ioloop, fetcher, processor,
    method do_task (line 1146) | def do_task(self, task):
    method send_task (line 1162) | def send_task(self, task, force=True):
    method run (line 1170) | def run(self):
    method quit (line 1176) | def quit(self):
  class ThreadBaseScheduler (line 1186) | class ThreadBaseScheduler(Scheduler):
    method __init__ (line 1187) | def __init__(self, threads=4, *args, **kwargs):
    method taskdb (line 1207) | def taskdb(self):
    method taskdb (line 1213) | def taskdb(self, taskdb):
    method projectdb (line 1217) | def projectdb(self):
    method projectdb (line 1223) | def projectdb(self, projectdb):
    method resultdb (line 1227) | def resultdb(self):
    method resultdb (line 1233) | def resultdb(self, resultdb):
    method _start_threads (line 1236) | def _start_threads(self):
    method _thread_worker (line 1245) | def _thread_worker(self, queue):
    method _run_in_thread (line 1253) | def _run_in_thread(self, method, *args, **kwargs):
    method _wait_thread (line 1277) | def _wait_thread(self):
    method _update_project (line 1283) | def _update_project(self, project):
    method on_task_status (line 1286) | def on_task_status(self, task):
    method on_request (line 1290) | def on_request(self, task):
    method _load_put_task (line 1294) | def _load_put_task(self, project, taskid):
    method run_once (line 1298) | def run_once(self):

FILE: pyspider/scheduler/task_queue.py
  class AtomInt (line 28) | class AtomInt(object):
    method get_value (line 33) | def get_value(cls):
  class InQueueTask (line 41) | class InQueueTask(DictMixin):
    method __init__ (line 49) | def __init__(self, taskid, priority=0, exetime=0):
    method __cmp__ (line 55) | def __cmp__(self, other):
    method __lt__ (line 65) | def __lt__(self, other):
  class PriorityTaskQueue (line 69) | class PriorityTaskQueue(Queue.Queue):
    method _init (line 76) | def _init(self, maxsize):
    method _qsize (line 80) | def _qsize(self, len=len):
    method _put (line 83) | def _put(self, item, heappush=heapq.heappush):
    method _get (line 97) | def _get(self, heappop=heapq.heappop):
    method top (line 107) | def top(self):
    method _resort (line 114) | def _resort(self):
    method __contains__ (line 117) | def __contains__(self, taskid):
    method __getitem__ (line 120) | def __getitem__(self, taskid):
    method __setitem__ (line 123) | def __setitem__(self, taskid, item):
    method __delitem__ (line 127) | def __delitem__(self, taskid):
  class TaskQueue (line 131) | class TaskQueue(object):
    method __init__ (line 137) | def __init__(self, rate=0, burst=0):
    method rate (line 145) | def rate(self):
    method rate (line 149) | def rate(self, value):
    method burst (line 153) | def burst(self):
    method burst (line 157) | def burst(self, value):
    method check_update (line 160) | def check_update(self):
    method _check_time_queue (line 169) | def _check_time_queue(self):
    method _check_processing (line 178) | def _check_processing(self):
    method put (line 190) | def put(self, taskid, priority=0, exetime=0):
    method get (line 227) | def get(self):
    method done (line 244) | def done(self, taskid):
    method delete (line 254) | def delete(self, taskid):
    method size (line 269) | def size(self):
    method is_processing (line 272) | def is_processing(self, taskid):
    method __len__ (line 278) | def __len__(self):
    method __contains__ (line 281) | def __contains__(self, taskid):

FILE: pyspider/scheduler/token_bucket.py
  class Bucket (line 15) | class Bucket(object):
    method __init__ (line 23) | def __init__(self, rate=1, burst=None):
    method get (line 33) | def get(self):
    method set (line 49) | def set(self, value):
    method desc (line 53) | def desc(self, value=1):

FILE: pyspider/webui/app.py
  class QuitableFlask (line 24) | class QuitableFlask(Flask):
    method logger (line 28) | def logger(self):
    method run (line 31) | def run(self, host=None, port=None, debug=None, **options):
    method quit (line 80) | def quit(self):
  function cdn_url_handler (line 104) | def cdn_url_handler(error, endpoint, kwargs):

FILE: pyspider/webui/bench_test.py
  function bench_test (line 19) | def bench_test():

FILE: pyspider/webui/debug.py
  function debug (line 39) | def debug(project):
  function enable_projects_import (line 65) | def enable_projects_import():
  function run (line 70) | def run(project):
  function save (line 170) | def save(project):
  function get_script (line 209) | def get_script(project):
  function blank_html (line 219) | def blank_html():

FILE: pyspider/webui/index.py
  function index (line 24) | def index():
  function get_queues (line 32) | def get_queues():
  function project_update (line 49) | def project_update():
  function counter (line 95) | def counter():
  function runtask (line 116) | def runtask():
  function robots (line 153) | def robots():

FILE: pyspider/webui/login.py
  class AnonymousUser (line 20) | class AnonymousUser(login.AnonymousUserMixin):
    method is_anonymous (line 22) | def is_anonymous(self):
    method is_active (line 25) | def is_active(self):
    method is_authenticated (line 28) | def is_authenticated(self):
    method get_id (line 31) | def get_id(self):
  class User (line 35) | class User(login.UserMixin):
    method __init__ (line 37) | def __init__(self, id, password):
    method is_authenticated (line 41) | def is_authenticated(self):
    method is_active (line 49) | def is_active(self):
  function load_user_from_request (line 57) | def load_user_from_request(request):
  function before_request (line 74) | def before_request():

FILE: pyspider/webui/result.py
  function result (line 17) | def result():
  function dump_result (line 34) | def dump_result(project, _format):

FILE: pyspider/webui/static/src/css_selector_helper.js
  function arrayEquals (line 8) | function arrayEquals(a, b) {
  function getOffset (line 21) | function getOffset(elem) {
  function merge_name (line 31) | function merge_name(features) {
  function merge_pattern (line 40) | function merge_pattern(path, end) {
  function path_info (line 75) | function path_info(doc, element) {
  class CSSSelectorHelperServer (line 179) | class CSSSelectorHelperServer extends EventEmitter {
    method constructor (line 180) | constructor(window) {
    method overlay (line 198) | overlay(elements) {
    method heightlight (line 221) | heightlight(elements) {
    method getElementByXpath (line 245) | getElementByXpath(path) {

FILE: pyspider/webui/static/src/debug.js
  function merge_name (line 14) | function merge_name(p) {
  function merge_pattern (line 27) | function merge_pattern(path, end) {
  function selector_changed (line 62) | function selector_changed(path) {
  function render_selector_helper (line 67) | function render_selector_helper(path) {
  function adjustHelper (line 137) | function adjustHelper() {
  function escape (line 221) | function escape(text) {

FILE: pyspider/webui/static/src/index.js
  function init_editable (line 17) | function init_editable(projects_app) {
  function init_sortable (line 82) | function init_sortable() {
  function update_counters (line 110) | function update_counters() {
  function update_queues (line 145) | function update_queues() {

FILE: pyspider/webui/static/src/splitter.js
  function moveSplitter (line 113) | function moveSplitter(pos) {
  function resetPrev (line 151) | function resetPrev() {

FILE: pyspider/webui/task.py
  function task (line 16) | def task(taskid):
  function task_in_json (line 36) | def task_in_json(taskid):
  function tasks (line 51) | def tasks():
  function active_tasks (line 81) | def active_tasks():

FILE: pyspider/webui/webdav.py
  function check_user (line 21) | def check_user(environ):
  class ContentIO (line 39) | class ContentIO(BytesIO):
    method close (line 40) | def close(self):
  class ScriptResource (line 45) | class ScriptResource(DAVNonCollection):
    method __init__ (line 46) | def __init__(self, path, environ, app, project=None):
    method project (line 58) | def project(self):
    method readonly (line 80) | def readonly(self):
    method getContentLength (line 90) | def getContentLength(self):
    method getContentType (line 93) | def getContentType(self):
    method getLastModified (line 96) | def getLastModified(self):
    method getContent (line 99) | def getContent(self):
    method beginWrite (line 102) | def beginWrite(self, contentType=None):
    method endWrite (line 109) | def endWrite(self, withErrors):
  class RootCollection (line 133) | class RootCollection(DAVCollection):
    method __init__ (line 134) | def __init__(self, path, environ, app):
    method getMemberList (line 139) | def getMemberList(self):
    method getMemberNames (line 155) | def getMemberNames(self):
  class ScriptProvider (line 165) | class ScriptProvider(DAVProvider):
    method __init__ (line 166) | def __init__(self, app):
    method __repr__ (line 170) | def __repr__(self):
    method getResourceInst (line 173) | def getResourceInst(self, path, environ):
  class NeedAuthController (line 182) | class NeedAuthController(object):
    method __init__ (line 183) | def __init__(self, app):
    method getDomainRealm (line 186) | def getDomainRealm(self, inputRelativeURL, environ):
    method requireAuthentication (line 189) | def requireAuthentication(self, realmname, environ):
    method isRealmUser (line 192) | def isRealmUser(self, realmname, username, environ):
    method getRealmUserPassword (line 195) | def getRealmUserPassword(self, realmname, username, environ):
    method authDomainUser (line 198) | def authDomainUser(self, realmname, username, password, environ):

FILE: tests/data_fetcher_processor_handler.py
  class Handler (line 10) | class Handler(BaseHandler):
    method not_send_status (line 13) | def not_send_status(self, response):
    method url_deduplicated (line 17) | def url_deduplicated(self, response):
    method catch_http_error (line 25) | def catch_http_error(self, response):
    method json (line 29) | def json(self, response):
    method html (line 32) | def html(self, response):
    method links (line 35) | def links(self, response):
    method cookies (line 38) | def cookies(self, response):
    method get_save (line 41) | def get_save(self, response):
    method get_process_save (line 44) | def get_process_save(self, response):
    method set_process_save (line 47) | def set_process_save(self, response):
  class IgnoreHandler (line 50) | class IgnoreHandler(BaseHandler):

FILE: tests/data_handler.py
  class IgnoreHandler (line 12) | class IgnoreHandler(object):
  class TestHandler (line 15) | class TestHandler(BaseHandler):
    method hello (line 21) | def hello(self):
    method echo (line 24) | def echo(self, response):
    method saved (line 27) | def saved(self, response):
    method echo_task (line 30) | def echo_task(self, response, task):
    method catch_status_code (line 34) | def catch_status_code(self, response):
    method raise_exception (line 37) | def raise_exception(self):
    method add_task (line 44) | def add_task(self, response):
    method on_cronjob1 (line 49) | def on_cronjob1(self, response):
    method on_cronjob2 (line 53) | def on_cronjob2(self, response):
    method generator (line 56) | def generator(self, response):
    method sleep (line 60) | def sleep(self, response):

FILE: tests/data_sample_handler.py
  class Handler (line 9) | class Handler(BaseHandler):
    method on_start (line 14) | def on_start(self):
    method index_page (line 18) | def index_page(self, response):
    method detail_page (line 23) | def detail_page(self, response):

FILE: tests/data_test_webpage.py
  function test_page (line 11) | def test_page():
  function test_ajax (line 30) | def test_ajax():
  function test_ajax_click (line 49) | def test_ajax_click():

FILE: tests/test_base_handler.py
  class TestBaseHandler (line 13) | class TestBaseHandler(unittest.TestCase):
    method test_task_join_crawl_config (line 36) | def test_task_join_crawl_config(self):

FILE: tests/test_bench.py
  class TestBench (line 19) | class TestBench(unittest.TestCase):
    method setUpClass (line 22) | def setUpClass(self):
    method tearDownClass (line 27) | def tearDownClass(self):
    method test_10_bench (line 30) | def test_10_bench(self):

FILE: tests/test_counter.py
  class TestCounter (line 14) | class TestCounter(unittest.TestCase):
    method test_010_TimebaseAverageEventCounter (line 15) | def test_010_TimebaseAverageEventCounter(self):
    method test_020_TotalCounter (line 24) | def test_020_TotalCounter(self):
    method test_030_AverageWindowCounter (line 31) | def test_030_AverageWindowCounter(self):
    method test_020_delete (line 42) | def test_020_delete(self):

FILE: tests/test_database.py
  class TaskDBCase (line 19) | class TaskDBCase(object):
    method setUpClass (line 69) | def setUpClass(self):
    method test_20_insert (line 81) | def test_20_insert(self):
    method test_25_get_task (line 85) | def test_25_get_task(self):
    method test_30_status_count (line 104) | def test_30_status_count(self):
    method test_40_update_and_status_count (line 110) | def test_40_update_and_status_count(self):
    method test_50_load_tasks (line 120) | def test_50_load_tasks(self):
    method test_60_relist_projects (line 137) | def test_60_relist_projects(self):
    method test_z10_drop (line 142) | def test_z10_drop(self):
    method test_z20_update_projects (line 149) | def test_z20_update_projects(self):
  class ProjectDBCase (line 158) | class ProjectDBCase(object):
    method setUpClass (line 168) | def setUpClass(self):
    method test_10_insert (line 171) | def test_10_insert(self):
    method test_20_get_all (line 177) | def test_20_get_all(self):
    method test_30_update (line 201) | def test_30_update(self):
    method test_40_check_update (line 206) | def test_40_check_update(self):
    method test_45_check_update_when_bootup (line 221) | def test_45_check_update_when_bootup(self):
    method test_50_get (line 227) | def test_50_get(self):
    method test_z10_drop (line 240) | def test_z10_drop(self):
  class ResultDBCase (line 248) | class ResultDBCase(object):
    method setUpClass (line 251) | def setUpClass(self):
    method test_10_save (line 254) | def test_10_save(self):
    method test_20_get (line 265) | def test_20_get(self):
    method test_30_select (line 283) | def test_30_select(self):
    method test_35_select_limit (line 297) | def test_35_select_limit(self):
    method test_40_count (line 304) | def test_40_count(self):
    method test_50_select_not_finished (line 307) | def test_50_select_not_finished(self):
    method test_60_relist_projects (line 312) | def test_60_relist_projects(self):
    method test_z10_drop (line 317) | def test_z10_drop(self):
    method test_z20_update_projects (line 324) | def test_z20_update_projects(self):
  class TestSqliteTaskDB (line 333) | class TestSqliteTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 336) | def setUpClass(self):
    method tearDownClass (line 341) | def tearDownClass(self):
  class TestSqliteProjectDB (line 345) | class TestSqliteProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 348) | def setUpClass(self):
    method tearDownClass (line 353) | def tearDownClass(self):
  class TestSqliteResultDB (line 357) | class TestSqliteResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 360) | def setUpClass(self):
    method tearDownClass (line 365) | def tearDownClass(self):
  class TestMysqlTaskDB (line 370) | class TestMysqlTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 373) | def setUpClass(self):
    method tearDownClass (line 378) | def tearDownClass(self):
  class TestMysqlProjectDB (line 383) | class TestMysqlProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 386) | def setUpClass(self):
    method tearDownClass (line 393) | def tearDownClass(self):
  class TestMysqlResultDB (line 398) | class TestMysqlResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 401) | def setUpClass(self):
    method tearDownClass (line 408) | def tearDownClass(self):
  class TestMongoDBTaskDB (line 413) | class TestMongoDBTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 416) | def setUpClass(self):
    method tearDownClass (line 423) | def tearDownClass(self):
    method test_create_project (line 426) | def test_create_project(self):
  class TestMongoDBProjectDB (line 433) | class TestMongoDBProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 436) | def setUpClass(self):
    method tearDownClass (line 443) | def tearDownClass(self):
  class TestMongoDBResultDB (line 448) | class TestMongoDBResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 451) | def setUpClass(self):
    method tearDownClass (line 458) | def tearDownClass(self):
    method test_create_project (line 461) | def test_create_project(self):
  class TestSQLAlchemyMySQLTaskDB (line 468) | class TestSQLAlchemyMySQLTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 471) | def setUpClass(self):
    method tearDownClass (line 478) | def tearDownClass(self):
  class TestSQLAlchemyMySQLProjectDB (line 483) | class TestSQLAlchemyMySQLProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 486) | def setUpClass(self):
    method tearDownClass (line 493) | def tearDownClass(self):
  class TestSQLAlchemyMySQLResultDB (line 498) | class TestSQLAlchemyMySQLResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 501) | def setUpClass(self):
    method tearDownClass (line 508) | def tearDownClass(self):
  class TestSQLAlchemyTaskDB (line 512) | class TestSQLAlchemyTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 515) | def setUpClass(self):
    method tearDownClass (line 522) | def tearDownClass(self):
  class TestSQLAlchemyProjectDB (line 526) | class TestSQLAlchemyProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 529) | def setUpClass(self):
    method tearDownClass (line 536) | def tearDownClass(self):
  class TestSQLAlchemyResultDB (line 540) | class TestSQLAlchemyResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 543) | def setUpClass(self):
    method tearDownClass (line 550) | def tearDownClass(self):
  class TestPGTaskDB (line 555) | class TestPGTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 558) | def setUpClass(self):
    method tearDownClass (line 566) | def tearDownClass(self):
  class TestPGProjectDB (line 572) | class TestPGProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 575) | def setUpClass(self):
    method tearDownClass (line 583) | def tearDownClass(self):
  class TestPGResultDB (line 589) | class TestPGResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 592) | def setUpClass(self):
    method tearDownClass (line 600) | def tearDownClass(self):
  class TestRedisTaskDB (line 606) | class TestRedisTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 609) | def setUpClass(self):
    method tearDownClass (line 615) | def tearDownClass(self):
  class TestESProjectDB (line 621) | class TestESProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 624) | def setUpClass(self):
    method tearDownClass (line 632) | def tearDownClass(self):
  class TestESResultDB (line 637) | class TestESResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 640) | def setUpClass(self):
    method tearDownClass (line 648) | def tearDownClass(self):
    method test_15_save (line 651) | def test_15_save(self):
    method test_30_select (line 654) | def test_30_select(self):
    method test_35_select_limit (line 670) | def test_35_select_limit(self):
    method test_z20_update_projects (line 673) | def test_z20_update_projects(self):
  class TestESTaskDB (line 679) | class TestESTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 682) | def setUpClass(self):
    method tearDownClass (line 690) | def tearDownClass(self):
  class TestCouchDBProjectDB (line 695) | class TestCouchDBProjectDB(ProjectDBCase, unittest.TestCase):
    method setUpClass (line 698) | def setUpClass(self):
    method tearDownClass (line 706) | def tearDownClass(self):
  class TestCouchDBResultDB (line 712) | class TestCouchDBResultDB(ResultDBCase, unittest.TestCase):
    method setUpClass (line 715) | def setUpClass(self):
    method tearDownClass (line 723) | def tearDownClass(self):
    method test_create_project (line 727) | def test_create_project(self):
  class TestCouchDBTaskDB (line 734) | class TestCouchDBTaskDB(TaskDBCase, unittest.TestCase):
    method setUpClass (line 737) | def setUpClass(self):
    method tearDownClass (line 746) | def tearDownClass(self):
    method test_create_project (line 752) | def test_create_project(self):

FILE: tests/test_fetcher.py
  class TestFetcher (line 31) | class TestFetcher(unittest.TestCase):
    method setUpClass (line 55) | def setUpClass(self):
    method tearDownClass (line 83) | def tearDownClass(self):
    method test_10_http_get (line 103) | def test_10_http_get(self):
    method test_15_http_post (line 117) | def test_15_http_post(self):
    method test_20_dataurl_get (line 136) | def test_20_dataurl_get(self):
    method test_30_with_queue (line 145) | def test_30_with_queue(self):
    method test_40_with_rpc (line 155) | def test_40_with_rpc(self):
    method test_50_base64_data (line 164) | def test_50_base64_data(self):
    method test_55_base64_data (line 178) | def test_55_base64_data(self):
    method test_60_timeout (line 191) | def test_60_timeout(self):
    method test_65_418 (line 206) | def test_65_418(self):
    method test_69_no_phantomjs (line 216) | def test_69_no_phantomjs(self):
    method test_70_phantomjs_url (line 232) | def test_70_phantomjs_url(self):
    method test_75_phantomjs_robots (line 249) | def test_75_phantomjs_robots(self):
    method test_80_phantomjs_timeout (line 261) | def test_80_phantomjs_timeout(self):
    method test_90_phantomjs_js_script (line 276) | def test_90_phantomjs_js_script(self):
    method test_a100_phantomjs_sharp_url (line 287) | def test_a100_phantomjs_sharp_url(self):
    method test_a110_dns_error (line 300) | def test_a110_dns_error(self):
    method test_a120_http_get_with_proxy_fail (line 314) | def test_a120_http_get_with_proxy_fail(self):
    method test_a130_http_get_with_proxy_ok (line 324) | def test_a130_http_get_with_proxy_ok(self):
    method test_a140_redirect (line 340) | def test_a140_redirect(self):
    method test_a150_too_much_redirect (line 350) | def test_a150_too_much_redirect(self):
    method test_a160_cookie (line 359) | def test_a160_cookie(self):
    method test_a170_validate_cert (line 368) | def test_a170_validate_cert(self):
    method test_a180_max_redirects (line 377) | def test_a180_max_redirects(self):
    method test_a200_robots_txt (line 386) | def test_a200_robots_txt(self):
    method test_zzzz_issue375 (line 401) | def test_zzzz_issue375(self):
  class TestSplashFetcher (line 418) | class TestSplashFetcher(unittest.TestCase):
    method sample_task_http (line 420) | def sample_task_http(self):
    method setUpClass (line 444) | def setUpClass(self):
    method tearDownClass (line 464) | def tearDownClass(self):
    method test_69_no_splash (line 482) | def test_69_no_splash(self):
    method test_70_splash_url (line 496) | def test_70_splash_url(self):
    method test_75_splash_robots (line 512) | def test_75_splash_robots(self):
    method test_80_splash_timeout (line 522) | def test_80_splash_timeout(self):
    method test_90_splash_js_script (line 535) | def test_90_splash_js_script(self):
    method test_95_splash_js_script_2 (line 544) | def test_95_splash_js_script_2(self):
    method test_a100_splash_sharp_url (line 557) | def test_a100_splash_sharp_url(self):
    method test_a120_http_get_with_proxy_fail_1 (line 568) | def test_a120_http_get_with_proxy_fail_1(self):
    method test_a120_http_get_with_proxy_fail (line 578) | def test_a120_http_get_with_proxy_fail(self):
    method test_a130_http_get_with_proxy_ok_1 (line 589) | def test_a130_http_get_with_proxy_ok_1(self):
    method test_a130_http_get_with_proxy_ok (line 605) | def test_a130_http_get_with_proxy_ok(self):

FILE: tests/test_fetcher_processor.py
  class TestFetcherProcessor (line 22) | class TestFetcherProcessor(Handler, unittest.TestCase):
    method setUpClass (line 25) | def setUpClass(self):
    method tearDownClass (line 46) | def tearDownClass(self):
    method crawl (line 53) | def crawl(self, url=None, track=None, **kwargs):
    method assertStatusOk (line 80) | def assertStatusOk(self, status):
    method status_ok (line 85) | def status_ok(self, status, type):
    method test_10_not_status (line 90) | def test_10_not_status(self):
    method test_20_url_deduplicated (line 97) | def test_20_url_deduplicated(self):
    method test_30_catch_status_code_error (line 108) | def test_30_catch_status_code_error(self):
    method test_40_method (line 141) | def test_40_method(self):
    method test_50_params (line 154) | def test_50_params(self):
    method test_60_data (line 164) | def test_60_data(self):
    method test_70_redirect (line 174) | def test_70_redirect(self):
    method test_80_redirect_too_many (line 181) | def test_80_redirect_too_many(self):
    method test_90_files (line 190) | def test_90_files(self):
    method test_a100_files_with_data (line 199) | def test_a100_files_with_data(self):
    method test_a110_headers (line 212) | def test_a110_headers(self):
    method test_a115_user_agent (line 223) | def test_a115_user_agent(self):
    method test_a120_cookies (line 231) | def test_a120_cookies(self):
    method test_a130_cookies_with_headers (line 242) | def test_a130_cookies_with_headers(self):
    method test_a140_response_cookie (line 258) | def test_a140_response_cookie(self):
    method test_a145_redirect_cookie (line 265) | def test_a145_redirect_cookie(self):
    method test_a150_timeout (line 272) | def test_a150_timeout(self):
    method test_a160_etag (line 280) | def test_a160_etag(self):
    method test_a170_last_modified (line 287) | def test_a170_last_modified(self):
    method test_a180_save (line 294) | def test_a180_save(self):
    method test_a190_taskid (line 302) | def test_a190_taskid(self):
    method test_a200_no_proxy (line 311) | def test_a200_no_proxy(self):
    method test_a210_proxy_failed (line 323) | def test_a210_proxy_failed(self):
    method test_a220_proxy_ok (line 337) | def test_a220_proxy_ok(self):
    method test_a230_proxy_parameter_fail (line 351) | def test_a230_proxy_parameter_fail(self):
    method test_a240_proxy_parameter_ok (line 362) | def test_a240_proxy_parameter_ok(self):
    method test_a250_proxy_userpass (line 375) | def test_a250_proxy_userpass(self):
    method test_a260_process_save (line 386) | def test_a260_process_save(self):
    method test_zzz_links (line 400) | def test_zzz_links(self):
    method test_zzz_html (line 407) | def test_zzz_html(self):
    method test_zzz_etag_enabled (line 414) | def test_zzz_etag_enabled(self):
    method test_zzz_etag_not_working (line 425) | def test_zzz_etag_not_working(self):
    method test_zzz_unexpected_crawl_argument (line 436) | def test_zzz_unexpected_crawl_argument(self):
    method test_zzz_curl_get (line 440) | def test_zzz_curl_get(self):
    method test_zzz_curl_post (line 449) | def test_zzz_curl_post(self):
    method test_zzz_curl_put (line 458) | def test_zzz_curl_put(self):
    method test_zzz_curl_no_url (line 467) | def test_zzz_curl_no_url(self):
    method test_zzz_curl_bad_option (line 473) | def test_zzz_curl_bad_option(self):
    method test_zzz_robots_txt (line 484) | def test_zzz_robots_txt(self):
    method test_zzz_connect_timeout (line 489) | def test_zzz_connect_timeout(self):

FILE: tests/test_message_queue.py
  class TestMessageQueue (line 17) | class TestMessageQueue(object):
    method setUpClass (line 20) | def setUpClass(self):
    method test_10_put (line 23) | def test_10_put(self):
    method test_20_get (line 32) | def test_20_get(self):
    method test_30_full (line 40) | def test_30_full(self):
    method test_40_multiple_threading_error (line 53) | def test_40_multiple_threading_error(self):
  class BuiltinQueue (line 67) | class BuiltinQueue(TestMessageQueue, unittest.TestCase):
    method setUpClass (line 69) | def setUpClass(self):
  class TestPikaRabbitMQ (line 78) | class TestPikaRabbitMQ(TestMessageQueue, unittest.TestCase):
    method setUpClass (line 81) | def setUpClass(self):
    method tearDownClass (line 93) | def tearDownClass(self):
    method test_30_full (line 100) | def test_30_full(self):
  class TestAmqpRabbitMQ (line 120) | class TestAmqpRabbitMQ(TestMessageQueue, unittest.TestCase):
    method setUpClass (line 123) | def setUpClass(self):
    method tearDownClass (line 138) | def tearDownClass(self):
    method test_30_full (line 145) | def test_30_full(self):
  class TestRedisQueue (line 164) | class TestRedisQueue(TestMessageQueue, unittest.TestCase):
    method setUpClass (line 167) | def setUpClass(self):
    method tearDownClass (line 183) | def tearDownClass(self):
  class TestKombuQueue (line 191) | class TestKombuQueue(TestMessageQueue, unittest.TestCase):
    method setUpClass (line 195) | def setUpClass(self):
    method tearDownClass (line 209) | def tearDownClass(self):
  class TestKombuAmpqQueue (line 222) | class TestKombuAmpqQueue(TestKombuQueue):
  class TestKombuRedisQueue (line 227) | class TestKombuRedisQueue(TestKombuQueue):
  class TestKombuMongoDBQueue (line 231) | class TestKombuMongoDBQueue(TestKombuQueue):

FILE: tests/test_processor.py
  class TestProjectModule (line 20) | class TestProjectModule(unittest.TestCase):
    method base_task (line 23) | def base_task(self):
    method fetch_result (line 51) | def fetch_result(self):
    method setUp (line 66) | def setUp(self):
    method test_2_hello (line 83) | def test_2_hello(self):
    method test_3_echo (line 90) | def test_3_echo(self):
    method test_4_saved (line 97) | def test_4_saved(self):
    method test_5_echo_task (line 104) | def test_5_echo_task(self):
    method test_6_catch_status_code (line 111) | def test_6_catch_status_code(self):
    method test_7_raise_exception (line 120) | def test_7_raise_exception(self):
    method test_8_add_task (line 130) | def test_8_add_task(self):
    method test_10_cronjob (line 138) | def test_10_cronjob(self):
    method test_20_get_info (line 175) | def test_20_get_info(self):
    method test_30_generator (line 197) | def test_30_generator(self):
    method test_40_sleep (line 204) | def test_40_sleep(self):
    method test_50_timeout (line 214) | def test_50_timeout(self):
    method test_60_timeout_in_thread (line 231) | def test_60_timeout_in_thread(self):
  class TestProcessor (line 253) | class TestProcessor(unittest.TestCase):
    method setUpClass (line 257) | def setUpClass(self):
    method tearDownClass (line 278) | def tearDownClass(self):
    method test_10_update_project (line 285) | def test_10_update_project(self):
    method test_20_broken_project (line 315) | def test_20_broken_project(self):
    method test_30_new_task (line 331) | def test_30_new_task(self):
    method test_40_index_page (line 357) | def test_40_index_page(self):
    method test_50_fetch_error (line 401) | def test_50_fetch_error(self):
    method test_60_call_broken_project (line 446) | def test_60_call_broken_project(self):
    method test_70_update_project (line 481) | def test_70_update_project(self):
    method test_80_import_project (line 549) | def test_80_import_project(self):

FILE: tests/test_response.py
  class TestResponse (line 23) | class TestResponse(unittest.TestCase):
    method setUpClass (line 31) | def setUpClass(self):
    method tearDownClass (line 38) | def tearDownClass(self):
    method get (line 41) | def get(self, url, **kwargs):
    method test_10_html (line 51) | def test_10_html(self):
    method test_20_xml (line 56) | def test_20_xml(self):
    method test_30_gzip (line 61) | def test_30_gzip(self):
    method test_40_deflate (line 66) | def test_40_deflate(self):
    method test_50_ok (line 71) | def test_50_ok(self):
    method test_60_not_ok (line 81) | def test_60_not_ok(self):
    method test_70_reraise_exception (line 92) | def test_70_reraise_exception(self):

FILE: tests/test_result_dump.py
  class TestResultDump (line 45) | class TestResultDump(unittest.TestCase):
    method test_result_formater_1 (line 46) | def test_result_formater_1(self):
    method test_result_formater_2 (line 50) | def test_result_formater_2(self):
    method test_result_formater_error (line 54) | def test_result_formater_error(self):
    method test_dump_as_json (line 58) | def test_dump_as_json(self):
    method test_dump_as_json_valid (line 63) | def test_dump_as_json_valid(self):
    method test_dump_as_txt (line 68) | def test_dump_as_txt(self):
    method test_dump_as_csv (line 74) | def test_dump_as_csv(self):
    method test_dump_as_csv_case_1 (line 79) | def test_dump_as_csv_case_1(self):

FILE: tests/test_result_worker.py
  class TestProcessor (line 21) | class TestProcessor(unittest.TestCase):
    method setUpClass (line 25) | def setUpClass(self):
    method tearDownClass (line 41) | def tearDownClass(self):
    method test_10_bad_result (line 48) | def test_10_bad_result(self):
    method test_10_bad_result_2 (line 54) | def test_10_bad_result_2(self):
    method test_20_insert_result (line 60) | def test_20_insert_result(self):
    method test_30_overwrite (line 77) | def test_30_overwrite(self):
    method test_40_insert_list (line 87) | def test_40_insert_list(self):

FILE: tests/test_run.py
  class TestRun (line 25) | class TestRun(unittest.TestCase):
    method setUpClass (line 28) | def setUpClass(self):
    method tearDownClass (line 38) | def tearDownClass(self):
    method test_10_cli (line 50) | def test_10_cli(self):
    method test_20_cli_config (line 61) | def test_20_cli_config(self):
    method test_30_cli_command_line (line 81) | def test_30_cli_command_line(self):
    method test_30a_cli_command_line (line 94) | def test_30a_cli_command_line(self):
    method test_40_cli_env (line 107) | def test_40_cli_env(self):
    method test_50_docker_rabbitmq (line 120) | def test_50_docker_rabbitmq(self):
    method test_60_docker_mongodb (line 139) | def test_60_docker_mongodb(self):
    method test_60a_docker_couchdb (line 156) | def test_60a_docker_couchdb(self):
    method test_70_docker_mysql (line 176) | def test_70_docker_mysql(self):
    method test_80_docker_phantomjs (line 192) | def test_80_docker_phantomjs(self):
    method test_90_docker_scheduler (line 206) | def test_90_docker_scheduler(self):
    method test_a100_all (line 226) | def test_a100_all(self):
    method test_a110_one (line 273) | def test_a110_one(self):
  class TestSendMessage (line 336) | class TestSendMessage(unittest.TestCase):
    method setUpClass (line 339) | def setUpClass(self):
    method tearDownClass (line 358) | def tearDownClass(self):
    method test_10_send_message (line 372) | def test_10_send_message(self):

FILE: tests/test_scheduler.py
  class TestTaskQueue (line 20) | class TestTaskQueue(unittest.TestCase):
    method setUpClass (line 23) | def setUpClass(self):
    method test_10_put (line 29) | def test_10_put(self):
    method test_20_update (line 36) | def test_20_update(self):
    method test_30_get_from_priority_queue (line 42) | def test_30_get_from_priority_queue(self):
    method test_40_time_queue_1 (line 46) | def test_40_time_queue_1(self):
    method test_50_time_queue_2 (line 51) | def test_50_time_queue_2(self):
    method test_60_processing_queue (line 58) | def test_60_processing_queue(self):
    method test_70_done (line 68) | def test_70_done(self):
  class TestBucket (line 80) | class TestBucket(unittest.TestCase):
    method test_bucket (line 82) | def test_bucket(self):
  class TestScheduler (line 105) | class TestScheduler(unittest.TestCase):
    method setUpClass (line 113) | def setUpClass(self):
    method tearDownClass (line 152) | def tearDownClass(self):
    method test_10_new_task_ignore (line 166) | def test_10_new_task_ignore(self):
    method test_20_new_project (line 178) | def test_20_new_project(self):
    method test_30_update_project (line 192) | def test_30_update_project(self):
    method test_32_get_info (line 207) | def test_32_get_info(self):
    method test_34_new_not_used_project (line 218) | def test_34_new_not_used_project(self):
    method test_35_new_task (line 234) | def test_35_new_task(self):
    method test_37_force_update_processing_task (line 267) | def test_37_force_update_processing_task(self):
    method test_40_taskdone_error_no_project (line 283) | def test_40_taskdone_error_no_project(self):
    method test_50_taskdone_error_no_track (line 295) | def test_50_taskdone_error_no_track(self):
    method test_60_taskdone_failed_retry (line 315) | def test_60_taskdone_failed_retry(self):
    method test_70_taskdone_ok (line 338) | def test_70_taskdone_ok(self):
    method test_75_on_finished_msg (line 358) | def test_75_on_finished_msg(self):
    method test_80_newtask_age_ignore (line 379) | def test_80_newtask_age_ignore(self):
    method test_82_newtask_via_rpc (line 400) | def test_82_newtask_via_rpc(self):
    method test_90_newtask_with_itag (line 421) | def test_90_newtask_with_itag(self):
    method test_a10_newtask_restart_by_age (line 450) | def test_a10_newtask_restart_by_age(self):
    method test_a20_failed_retry (line 470) | def test_a20_failed_retry(self):
    method test_a30_task_verify (line 511) | def test_a30_task_verify(self):
    method test_a40_success_recrawl (line 538) | def test_a40_success_recrawl(self):
    method test_a50_failed_recrawl (line 585) | def test_a50_failed_recrawl(self):
    method test_a60_disable_recrawl (line 620) | def test_a60_disable_recrawl(self):
    method test_38_cancel_task (line 648) | def test_38_cancel_task(self):
    method test_x10_inqueue_limit (line 691) | def test_x10_inqueue_limit(self):
    method test_x20_delete_project (line 716) | def test_x20_delete_project(self):
    method test_z10_startup (line 726) | def test_z10_startup(self):
    method test_z20_quit (line 729) | def test_z20_quit(self):
  class TestProject (line 741) | class TestProject(unittest.TestCase):
    method setUpClass (line 795) | def setUpClass(self):
    method test_pause_10_unpaused (line 809) | def test_pause_10_unpaused(self):
    method test_pause_20_no_enough_fail_tasks (line 812) | def test_pause_20_no_enough_fail_tasks(self):
    method test_pause_30_paused (line 833) | def test_pause_30_paused(self):
    method test_pause_40_unpause_checking (line 840) | def test_pause_40_unpause_checking(self):
    method test_pause_50_paused_again (line 844) | def test_pause_50_paused_again(self):
    method test_pause_60_unpause_checking (line 849) | def test_pause_60_unpause_checking(self):
    method test_pause_70_unpaused (line 853) | def test_pause_70_unpaused(self):
    method test_pause_x_disable_auto_pause (line 863) | def test_pause_x_disable_auto_pause(self):

FILE: tests/test_task_queue.py
  class TestTaskQueue (line 13) | class TestTaskQueue(unittest.TestCase):
    method test_task_queue_in_time_order (line 18) | def test_task_queue_in_time_order(self):
  class TestTimeQueue (line 54) | class TestTimeQueue(unittest.TestCase):
    method test_time_queue (line 55) | def test_time_queue(self):

FILE: tests/test_utils.py
  class TestFetcher (line 14) | class TestFetcher(unittest.TestCase):
    method test_readonlydict (line 15) | def test_readonlydict(self):
    method test_getitem (line 23) | def test_getitem(self):
    method test_format_data (line 37) | def test_format_data(self):

FILE: tests/test_webdav.py
  class TestWebDav (line 22) | class TestWebDav(unittest.TestCase):
    method setUpClass (line 24) | def setUpClass(self):
    method tearDownClass (line 50) | def tearDownClass(self):
    method test_10_ls (line 64) | def test_10_ls(self):
    method test_20_create_error (line 67) | def test_20_create_error(self):
    method test_30_create_ok (line 76) | def test_30_create_ok(self):
    method test_40_get_404 (line 81) | def test_40_get_404(self):
    method test_50_get (line 88) | def test_50_get(self):
    method test_60_edit (line 99) | def test_60_edit(self):
    method test_70_get (line 102) | def test_70_get(self):
    method test_80_password (line 108) | def test_80_password(self):
  class TestWebDavNeedAuth (line 124) | class TestWebDavNeedAuth(unittest.TestCase):
    method setUpClass (line 126) | def setUpClass(self):
    method tearDownClass (line 153) | def tearDownClass(self):
    method test_10_ls (line 167) | def test_10_ls(self):
    method test_30_create_ok (line 173) | def test_30_create_ok(self):
    method test_50_get (line 177) | def test_50_get(self):

FILE: tests/test_webui.py
  class TestWebUI (line 20) | class TestWebUI(unittest.TestCase):
    method setUpClass (line 23) | def setUpClass(self):
    method tearDownClass (line 73) | def tearDownClass(self):
    method test_10_index_page (line 92) | def test_10_index_page(self):
    method test_20_debug (line 97) | def test_20_debug(self):
    method test_25_debug_post (line 112) | def test_25_debug_post(self):
    method test_30_run (line 133) | def test_30_run(self):
    method test_32_run_bad_task (line 144) | def test_32_run_bad_task(self):
    method test_33_run_bad_script (line 154) | def test_33_run_bad_script(self):
    method test_35_run_http_task (line 164) | def test_35_run_http_task(self):
    method test_40_save (line 173) | def test_40_save(self):
    method test_42_get (line 180) | def test_42_get(self):
    method test_45_run_with_saved_script (line 187) | def test_45_run_with_saved_script(self):
    method test_50_index_page_list (line 199) | def test_50_index_page_list(self):
    method test_52_change_status (line 204) | def test_52_change_status(self):
    method test_55_reopen (line 213) | def test_55_reopen(self):
    method test_57_resave (line 218) | def test_57_resave(self):
    method test_58_index_page_list (line 225) | def test_58_index_page_list(self):
    method test_60_change_rate (line 230) | def test_60_change_rate(self):
    method test_70_change_status (line 239) | def test_70_change_status(self):
    method test_80_change_group (line 248) | def test_80_change_group(self):
    method test_90_run (line 261) | def test_90_run(self):
    method test_a10_counter (line 269) | def test_a10_counter(self):
    method test_a15_queues (line 285) | def test_a15_queues(self):
    method test_a20_tasks (line 296) | def test_a20_tasks(self):
    method test_a22_active_tasks (line 315) | def test_a22_active_tasks(self):
    method test_a24_task (line 334) | def test_a24_task(self):
    method test_a25_task_json (line 339) | def test_a25_task_json(self):
    method test_a26_debug_task (line 344) | def test_a26_debug_task(self):
    method test_a30_results (line 348) | def test_a30_results(self):
    method test_a30_export_json (line 354) | def test_a30_export_json(self):
    method test_a32_export_json_style_full (line 359) | def test_a32_export_json_style_full(self):
    method test_a34_export_json_style_full_limit_1 (line 365) | def test_a34_export_json_style_full_limit_1(self):
    method test_a40_export_url_json (line 371) | def test_a40_export_url_json(self):
    method test_a50_export_csv (line 376) | def test_a50_export_csv(self):
    method test_a60_fetch_via_cannot_connect_fetcher (line 381) | def test_a60_fetch_via_cannot_connect_fetcher(self):
    method test_a70_fetch_via_fetcher (line 396) | def test_a70_fetch_via_fetcher(self):
    method test_h000_auth (line 412) | def test_h000_auth(self):
    method test_h005_no_such_project (line 422) | def test_h005_no_such_project(self):
    method test_h005_unknown_field (line 430) | def test_h005_unknown_field(self):
    method test_h005_rate_wrong_format (line 438) | def test_h005_rate_wrong_format(self):
    method test_h010_change_group (line 446) | def test_h010_change_group(self):
    method test_h020_change_group_lock_failed (line 459) | def test_h020_change_group_lock_failed(self):
    method test_h020_change_group_lock_ok (line 467) | def test_h020_change_group_lock_ok(self):
    method test_h030_need_auth (line 477) | def test_h030_need_auth(self):
    method test_h040_auth_fail (line 488) | def test_h040_auth_fail(self):
    method test_h050_auth_fail2 (line 492) | def test_h050_auth_fail2(self):
    method test_h060_auth_fail3 (line 498) | def test_h060_auth_fail3(self):
    method test_h070_auth_ok (line 504) | def test_h070_auth_ok(self):
    method test_x0_disconnected_scheduler (line 510) | def test_x0_disconnected_scheduler(self):
    method test_x10_project_update (line 518) | def test_x10_project_update(self):
    method test_x20_counter (line 527) | def test_x20_counter(self):
    method test_x30_run_not_exists_project (line 532) | def test_x30_run_not_exists_project(self):
    method test_x30_run (line 538) | def test_x30_run(self):
    method test_x40_debug_save (line 545) | def test_x40_debug_save(self):
    method test_x50_tasks (line 552) | def test_x50_tasks(self):
    method test_x60_robots (line 556) | def test_x60_robots(self):
    method test_x70_bench (line 561) | def test_x70_bench(self):

FILE: tests/test_xmlrpc.py
  class TestXMLRPCServer (line 23) | class TestXMLRPCServer(unittest.TestCase):
    method setUpClass (line 25) | def setUpClass(self):
    method tearDownClass (line 48) | def tearDownClass(self):
    method test_xmlrpc_server (line 52) | def test_xmlrpc_server(self, uri='http://127.0.0.1:3423'):

FILE: tools/migrate.py
  function taskdb_migrating (line 20) | def taskdb_migrating(project, from_connection, to_connection):
  function resultdb_migrating (line 30) | def resultdb_migrating(project, from_connection, to_connection):
  function migrate (line 43) | def migrate(pool, from_connection, to_connection):

Download .json

Condensed preview — 165 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (832K chars).

[
  {
    "path": ".coveragerc",
    "chars": 350,
    "preview": "[run]\nsource =\n    pyspider\nparallel = True\n\n[report]\nomit =\n    pyspider/libs/sample_handler.py\n    pyspider/libs/pprin"
  },
  {
    "path": ".github/ISSUE_TEMPLATE.md",
    "chars": 621,
    "preview": "<!--\nThanks for using pyspider!\n\n如果你需要使用中文提问，请将问题提交到 https://segmentfault.com/t/pyspider\n-->\n\n* pyspider version:\n* Oper"
  },
  {
    "path": ".gitignore",
    "chars": 339,
    "preview": "*.py[cod]\ndata/*\n.venv\n.idea\n# C extensions\n*.so\n\n# Packages\n*.egg\n*.egg-info\ndist\nbuild\neggs\nparts\nbin\nvar\nsdist\ndevelo"
  },
  {
    "path": ".travis.yml",
    "chars": 1322,
    "preview": "language: python\ncache: pip\npython:\n  - 3.5\n  - 3.6\n  - 3.7\n  #- 3.8\nservices:\n    - docker\n    - mongodb\n    - rabbitmq"
  },
  {
    "path": "Dockerfile",
    "chars": 1398,
    "preview": "FROM python:3.6\nMAINTAINER binux <roy@binux.me>\n\n# install phantomjs\nRUN mkdir -p /opt/phantomjs \\\n        && cd /opt/ph"
  },
  {
    "path": "LICENSE",
    "chars": 11303,
    "preview": "Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licens"
  },
  {
    "path": "MANIFEST.in",
    "chars": 175,
    "preview": "include README.md\ninclude requirements.txt\ninclude Dockerfile\ninclude LICENSE\ninclude pyspider/logging.conf\ninclude pysp"
  },
  {
    "path": "README.md",
    "chars": 3044,
    "preview": "pyspider [![Build Status]][Travis CI] [![Coverage Status]][Coverage]\n========\n\nA Powerful Spider(Web Crawler) System in "
  },
  {
    "path": "config_example.json",
    "chars": 423,
    "preview": "{\n  \"taskdb\": \"couchdb+taskdb://user:password@couchdb:5984\",\n  \"projectdb\": \"couchdb+projectdb://user:password@couchdb:5"
  },
  {
    "path": "docker-compose.yaml",
    "chars": 2514,
    "preview": "version: \"3.7\"\n\n# replace /path/to/dir/ to point to config.json\n\n# The RabbitMQ and CouchDB services can take some time "
  },
  {
    "path": "docs/About-Projects.md",
    "chars": 2037,
    "preview": "About Projects\n==============\n\nIn most cases, a project is one script you write for one website.\n\n* Projects are indepen"
  },
  {
    "path": "docs/About-Tasks.md",
    "chars": 1734,
    "preview": "About Tasks\n===========\n\nTasks are the basic unit to be scheduled.\n\nBasis\n-----\n\n* A task is differentiated by its `task"
  },
  {
    "path": "docs/Architecture.md",
    "chars": 5163,
    "preview": "Architecture\n============\n\nThis document describes the reason why I made pyspider and the architecture.\n\nWhy\n---\nTwo yea"
  },
  {
    "path": "docs/Command-Line.md",
    "chars": 9260,
    "preview": "Command Line\n============\n\nGlobal Config\n-------------\n\nYou can get command help via `pyspider --help` and `pyspider all"
  },
  {
    "path": "docs/Deployment-demo.pyspider.org.md",
    "chars": 4593,
    "preview": "Deployment of demo.pyspider.org\n===============================\n\n[demo.pyspider.org](http://demo.pyspider.org/) is runni"
  },
  {
    "path": "docs/Deployment.md",
    "chars": 4500,
    "preview": "Deployment\n===========\n\nSince pyspider has various components, you can just run `pyspider` to start a standalone and thi"
  },
  {
    "path": "docs/Frequently-Asked-Questions.md",
    "chars": 3337,
    "preview": "Frequently Asked Questions\n==========================\n\nDoes pyspider Work with Windows?\n--------------------------------"
  },
  {
    "path": "docs/Quickstart.md",
    "chars": 3866,
    "preview": "Quickstart\n==========\n\nInstallation\n------------\n\n* `pip install pyspider`\n* run command `pyspider`, visit [http://local"
  },
  {
    "path": "docs/Running-pyspider-with-Docker.md",
    "chars": 2013,
    "preview": "```shell\n# mysql\ndocker run --name mysql -d -v /data/mysql:/var/lib/mysql -e MYSQL_ALLOW_EMPTY_PASSWORD=yes mysql:latest"
  },
  {
    "path": "docs/Script-Environment.md",
    "chars": 1232,
    "preview": "Script Environment\n==================\n\nVariables\n---------\n* `self.project_name`\n* `self.project` information about curr"
  },
  {
    "path": "docs/Working-with-Results.md",
    "chars": 2603,
    "preview": "Working with Results\n====================\nDownloading and viewing your data from WebUI is convenient, but may not suitab"
  },
  {
    "path": "docs/apis/@catch_status_code_error.md",
    "chars": 542,
    "preview": "@catch_status_code_error\n========================\n\nnon-200 response will been regarded as fetch failed and will not pass"
  },
  {
    "path": "docs/apis/@every.md",
    "chars": 729,
    "preview": "@every(minutes=0, seconds=0)\n============================\n\nmethod will been called every `minutes` or `seconds`\n\n\n```pyt"
  },
  {
    "path": "docs/apis/Response.md",
    "chars": 1726,
    "preview": "Response\n========\n\nThe attributes of Response object.\n\n### Response.url\n\nfinal URL.\n\n### Response.text\n\nContent of respo"
  },
  {
    "path": "docs/apis/index.md",
    "chars": 198,
    "preview": "API Reference\n=============\n    \n- [self.crawl](self.crawl)\n- [Response](Response)\n- [self.send_message](self.send_messa"
  },
  {
    "path": "docs/apis/self.crawl.md",
    "chars": 8232,
    "preview": "self.crawl\n===========\n\nself.crawl(url, **kwargs)\n-------------------------\n\n`self.crawl` is the main interface to tell "
  },
  {
    "path": "docs/apis/self.send_message.md",
    "chars": 1258,
    "preview": "self.send_message\n=================\n\nself.send_message(project, msg, [url])\n--------------------------------------\nsend "
  },
  {
    "path": "docs/conf.py",
    "chars": 585,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "docs/index.md",
    "chars": 3257,
    "preview": "pyspider [![Build Status][Build Status]][Travis CI] [![Coverage Status][Coverage Status]][Coverage] [![Try][Try]][Demo]\n"
  },
  {
    "path": "docs/tutorial/AJAX-and-more-HTTP.md",
    "chars": 7345,
    "preview": "Level 2: AJAX and More HTTP\n===========================\n\nIn the last article, we discussed how to extract links and info"
  },
  {
    "path": "docs/tutorial/HTML-and-CSS-Selector.md",
    "chars": 7937,
    "preview": "Level 1: HTML and CSS Selector\n==============================\n\nIn this tutorial, we will scrape information of movies an"
  },
  {
    "path": "docs/tutorial/Render-with-PhantomJS.md",
    "chars": 3392,
    "preview": "Level 3: Render with PhantomJS\n==============================\n\nSometimes web page is too complex to find out the API req"
  },
  {
    "path": "docs/tutorial/index.md",
    "chars": 396,
    "preview": "pyspider Tutorial\n=================\n\n> The best way to learn how to scrap is learning how to make it.\n\n* [Level 1: HTML "
  },
  {
    "path": "mkdocs.yml",
    "chars": 1248,
    "preview": "site_name: pyspider\nsite_description: A Powerful Spider(Web Crawler) System in Python.\nsite_author: binux\nrepo_url: http"
  },
  {
    "path": "pyspider/__init__.py",
    "chars": 207,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/__init__.py",
    "chars": 7260,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/base/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pyspider/database/base/projectdb.py",
    "chars": 1879,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/base/resultdb.py",
    "chars": 1280,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/base/taskdb.py",
    "chars": 2951,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/basedb.py",
    "chars": 5896,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/couchdb/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pyspider/database/couchdb/couchdbbase.py",
    "chars": 2962,
    "preview": "import time, requests, json\nfrom requests.auth import HTTPBasicAuth\n\nclass SplitTableMixin(object):\n    UPDATE_PROJECTS_"
  },
  {
    "path": "pyspider/database/couchdb/projectdb.py",
    "chars": 3937,
    "preview": "import time, requests, json\nfrom requests.auth import HTTPBasicAuth\nfrom pyspider.database.base.projectdb import Project"
  },
  {
    "path": "pyspider/database/couchdb/resultdb.py",
    "chars": 3368,
    "preview": "import time, json\nfrom pyspider.database.base.resultdb import ResultDB as BaseResultDB\nfrom .couchdbbase import SplitTab"
  },
  {
    "path": "pyspider/database/couchdb/taskdb.py",
    "chars": 3771,
    "preview": "import json, time\nfrom pyspider.database.base.taskdb import TaskDB as BaseTaskDB\nfrom .couchdbbase import SplitTableMixi"
  },
  {
    "path": "pyspider/database/elasticsearch/__init__.py",
    "chars": 186,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/elasticsearch/projectdb.py",
    "chars": 2821,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/elasticsearch/resultdb.py",
    "chars": 3922,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/elasticsearch/taskdb.py",
    "chars": 5082,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/local/__init__.py",
    "chars": 186,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/local/projectdb.py",
    "chars": 3013,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/mongodb/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pyspider/database/mongodb/mongodbbase.py",
    "chars": 1476,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/mongodb/projectdb.py",
    "chars": 2252,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mongodb/resultdb.py",
    "chars": 3073,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mongodb/taskdb.py",
    "chars": 4789,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mysql/__init__.py",
    "chars": 184,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mysql/mysqlbase.py",
    "chars": 2014,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mysql/projectdb.py",
    "chars": 2401,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mysql/resultdb.py",
    "chars": 3637,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/mysql/taskdb.py",
    "chars": 4906,
    "preview": "#!/usr/bin/envutils\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux.m"
  },
  {
    "path": "pyspider/database/redis/__init__.py",
    "chars": 187,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/redis/taskdb.py",
    "chars": 5783,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlalchemy/__init__.py",
    "chars": 187,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlalchemy/projectdb.py",
    "chars": 3931,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlalchemy/resultdb.py",
    "chars": 5169,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlalchemy/sqlalchemybase.py",
    "chars": 1482,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlalchemy/taskdb.py",
    "chars": 6160,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlite/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pyspider/database/sqlite/projectdb.py",
    "chars": 1836,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/sqlite/resultdb.py",
    "chars": 2899,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/database/sqlite/sqlitebase.py",
    "chars": 1886,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/database/sqlite/taskdb.py",
    "chars": 4019,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/fetcher/__init__.py",
    "chars": 37,
    "preview": "from .tornado_fetcher import Fetcher\n"
  },
  {
    "path": "pyspider/fetcher/cookie_utils.py",
    "chars": 898,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/fetcher/phantomjs_fetcher.js",
    "chars": 6721,
    "preview": "// vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8:\n// Author: Binux<i@binux.me>\n//         http://binux.me\n// Created on "
  },
  {
    "path": "pyspider/fetcher/puppeteer_fetcher.js",
    "chars": 6112,
    "preview": "const express = require(\"express\");\nconst puppeteer = require('puppeteer');\nconst bodyParser = require('body-parser');\n\n"
  },
  {
    "path": "pyspider/fetcher/splash_fetcher.lua",
    "chars": 6394,
    "preview": "--#! /usr/bin/env lua\n--\n-- splash_fetcher.lua\n-- Copyright (C) 2016 Binux <roy@binux.me>\n--\n-- Distributed under terms "
  },
  {
    "path": "pyspider/fetcher/tornado_fetcher.py",
    "chars": 32232,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/ListIO.py",
    "chars": 723,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "pyspider/libs/base_handler.py",
    "chars": 15509,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/bench.py",
    "chars": 8396,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/libs/counter.py",
    "chars": 12784,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/dataurl.py",
    "chars": 1318,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/log.py",
    "chars": 1183,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/multiprocessing_queue.py",
    "chars": 2808,
    "preview": "import six\nimport platform\nimport multiprocessing\nfrom multiprocessing.queues import Queue as BaseQueue\n\n\n# The SharedCo"
  },
  {
    "path": "pyspider/libs/pprint.py",
    "chars": 12676,
    "preview": "#  Author:      Fred L. Drake, Jr.\n#               fdrake@...\n#\n#  This is a simple little module I wrote to make life e"
  },
  {
    "path": "pyspider/libs/response.py",
    "chars": 7618,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/result_dump.py",
    "chars": 3952,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/libs/sample_handler.py",
    "chars": 684,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# Created on __DATE__\n# Project: __PROJECT_NAME__\n\nfrom pyspider.libs.ba"
  },
  {
    "path": "pyspider/libs/url.py",
    "chars": 3818,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/utils.py",
    "chars": 12492,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/libs/wsgi_xmlrpc.py",
    "chars": 3839,
    "preview": "#   Copyright (c) 2006-2007 Open Source Applications Foundation\n#\n#   Licensed under the Apache License, Version 2.0 (th"
  },
  {
    "path": "pyspider/logging.conf",
    "chars": 763,
    "preview": "[loggers]\nkeys=root,scheduler,fetcher,processor,webui,bench,werkzeug\n\n[logger_root]\nlevel=INFO\nhandlers=screen\n\n[logger_"
  },
  {
    "path": "pyspider/message_queue/__init__.py",
    "chars": 2448,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/message_queue/kombu_queue.py",
    "chars": 3274,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/message_queue/rabbitmq.py",
    "chars": 8706,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<1717529"
  },
  {
    "path": "pyspider/message_queue/redis_queue.py",
    "chars": 3256,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/processor/__init__.py",
    "chars": 50,
    "preview": "from .processor import ProcessorResult, Processor\n"
  },
  {
    "path": "pyspider/processor/processor.py",
    "chars": 8215,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/processor/project_module.py",
    "chars": 9724,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/result/__init__.py",
    "chars": 242,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/result/result_worker.py",
    "chars": 2536,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/run.py",
    "chars": 32634,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/scheduler/__init__.py",
    "chars": 76,
    "preview": "from .scheduler import Scheduler, OneScheduler, ThreadBaseScheduler  # NOQA\n"
  },
  {
    "path": "pyspider/scheduler/scheduler.py",
    "chars": 47379,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/scheduler/task_queue.py",
    "chars": 9168,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/scheduler/token_bucket.py",
    "chars": 1418,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/__init__.py",
    "chars": 238,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/app.py",
    "chars": 3625,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/bench_test.py",
    "chars": 887,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/webui/debug.py",
    "chars": 7306,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/index.py",
    "chars": 4743,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/login.py",
    "chars": 1876,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "pyspider/webui/result.py",
    "chars": 1803,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/static/.babelrc",
    "chars": 28,
    "preview": "{\n  \"presets\": [\"es2015\"]\n}\n"
  },
  {
    "path": "pyspider/webui/static/package.json",
    "chars": 627,
    "preview": "{\n  \"name\": \"pyspider-webui\",\n  \"version\": \"0.3.9\",\n  \"description\": \"webui of pyspider\",\n  \"scripts\": {\n    \"build\": \"w"
  },
  {
    "path": "pyspider/webui/static/src/css_selector_helper.js",
    "chars": 6661,
    "preview": "// vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8:\n// Author: Binux<i@binux.me>\n//         http://binux.me\n// Created on "
  },
  {
    "path": "pyspider/webui/static/src/debug.js",
    "chars": 19767,
    "preview": "// vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8:\n// Author: Binux<i@binux.me>\n//         http://binux.me\n// Created on "
  },
  {
    "path": "pyspider/webui/static/src/debug.less",
    "chars": 7063,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/src/index.js",
    "chars": 6465,
    "preview": "// vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8:\n// Author: Binux<i@binux.me>\n//         http://binux.me\n// Created on "
  },
  {
    "path": "pyspider/webui/static/src/index.less",
    "chars": 2103,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/src/result.less",
    "chars": 641,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/src/splitter.js",
    "chars": 9861,
    "preview": "// vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8:\n// Author: Binux<i@binux.me>\n//         http://binux.me\n// Created on "
  },
  {
    "path": "pyspider/webui/static/src/task.less",
    "chars": 1023,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/src/tasks.less",
    "chars": 556,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/src/variable.less",
    "chars": 545,
    "preview": "/* vim: set et sw=2 ts=2 sts=2 ff=unix fenc=utf8: */\n/* Author: Binux<i@binux.me> */\n/*         http://binux.me */\n/* Cr"
  },
  {
    "path": "pyspider/webui/static/webpack.config.js",
    "chars": 723,
    "preview": "var webpack = require(\"webpack\");\nvar ExtractTextPlugin = require(\"extract-text-webpack-plugin\");\n\nmodule.exports = {\n  "
  },
  {
    "path": "pyspider/webui/task.py",
    "chars": 3102,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "pyspider/webui/templates/debug.html",
    "chars": 5821,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    <title>{{ project_name }} - Debugger - pyspider"
  },
  {
    "path": "pyspider/webui/templates/index.html",
    "chars": 9491,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Dashboard - pyspider</title>\n    <!--[if"
  },
  {
    "path": "pyspider/webui/templates/result.html",
    "chars": 3587,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Results - {{ project }} - pyspider</titl"
  },
  {
    "path": "pyspider/webui/templates/task.html",
    "chars": 3781,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Task - {{ task.project }}:{{ task.taskid"
  },
  {
    "path": "pyspider/webui/templates/tasks.html",
    "chars": 2338,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n  <head>\n    <meta charset=\"utf-8\">\n    <title>Tasks - pyspider</title>\n    <!--[if lt "
  },
  {
    "path": "pyspider/webui/webdav.py",
    "chars": 6954,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "requirements.txt",
    "chars": 385,
    "preview": "Flask==0.10\nJinja2==2.7\nchardet==3.0.4\ncssselect==0.9\nlxml==4.3.3\npycurl==7.43.0.3\npyquery==1.4.0\nrequests==2.24.0\ntorna"
  },
  {
    "path": "run.py",
    "chars": 256,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "setup.py",
    "chars": 2777,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/__init__.py",
    "chars": 295,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/data_fetcher_processor_handler.py",
    "chars": 1412,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/data_handler.py",
    "chars": 1451,
    "preview": "\n#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binu"
  },
  {
    "path": "tests/data_sample_handler.py",
    "chars": 712,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# Created on __DATE__\n# Project: __PROJECT_NAME__\n\nfrom pyspider.libs.ba"
  },
  {
    "path": "tests/data_test_webpage.py",
    "chars": 1727,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_base_handler.py",
    "chars": 1988,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_bench.py",
    "chars": 1259,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_counter.py",
    "chars": 1413,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_database.py",
    "chars": 26753,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_fetcher.py",
    "chars": 25015,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_fetcher_processor.py",
    "chars": 23240,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_message_queue.py",
    "chars": 8290,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_processor.py",
    "chars": 20782,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_response.py",
    "chars": 2934,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_result_dump.py",
    "chars": 3006,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_result_worker.py",
    "chars": 3042,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_run.py",
    "chars": 14321,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_scheduler.py",
    "chars": 28356,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<i@binux"
  },
  {
    "path": "tests/test_task_queue.py",
    "chars": 3972,
    "preview": "#!/usr/bin/env python\n# -*- coding: utf-8 -*-\n\nimport time\nimport unittest\n\nimport six\nfrom six.moves import queue as Qu"
  },
  {
    "path": "tests/test_utils.py",
    "chars": 2400,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_webdav.py",
    "chars": 6724,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_webui.py",
    "chars": 20165,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tests/test_xmlrpc.py",
    "chars": 1991,
    "preview": "#   Copyright (c) 2006-2007 Open Source Applications Foundation\n#\n#   Licensed under the Apache License, Version 2.0 (th"
  },
  {
    "path": "tools/migrate.py",
    "chars": 2239,
    "preview": "#!/usr/bin/env python\n# -*- encoding: utf-8 -*-\n# vim: set et sw=4 ts=4 sts=4 ff=unix fenc=utf8:\n# Author: Binux<roy@bin"
  },
  {
    "path": "tox.ini",
    "chars": 300,
    "preview": "[tox]\nenvlist = py35,py36,py37,py38\n[testenv]\ninstall_command = \n    pip install --allow-all-external 'https://dev.mysql"
  }
]

About this extraction

This page contains the full source code of the binux/pyspider GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 165 files (775.9 KB), approximately 191.5k tokens, and a symbol index with 1327 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo