Full Code of BruceDone/awesome-crawler for AI

master 832efd3e8f8a cached
4 files
14.9 KB
4.1k tokens
1 requests
Download .txt
Repository: BruceDone/awesome-crawler
Branch: master
Commit: 832efd3e8f8a
Files: 4
Total size: 14.9 KB

Directory structure:
gitextract_v5ft3wbp/

├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Created by .ignore support plugin (hsz.mobi)
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject

# pycharm ide
.idea


================================================
FILE: CONTRIBUTING.md
================================================
# Contribution Guidelines

## Adding to this list

Please ensure your pull request adheres to the following guidelines:

- Make an individual pull request for each suggestion.
- Use the following format: `[Resource Name](link) - Short description.`
- Don't repeat the resource name in the description.
- Link additions should be added in alphabetical order to the relevant category.
- Titles should be [capitalized](http://grammar.yourdictionary.com/capitalization/rules-for-capitalization-in-titles.html).
- Check your spelling and grammar.
- New categories or improvements to the existing categorization are welcome.
- The pull request and commit should have a useful title.

Thank you for your suggestions!

================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2016 Bruce Tang

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/badge.svg)
A collection of awesome web crawler,spider and resources in different languages.

## Contents

- [Python](#python)
- [Java](#java)
- [C#](#c)
- [JavaScript](#javascript)
- [PHP](#php)
- [C++](#c-1)
- [C](#c-2)
- [Ruby](#ruby)
- [Rust](#rust)
- [R](#r)
- [Erlang](#erlang)
- [Perl](#perl)
- [Go](#go)
- [Scala](#scala)

## Python 
* [Scrapy](https://github.com/scrapy/scrapy) - A fast high-level screen scraping and web crawling framework.
    * [django-dynamic-scraper](https://github.com/holgerd77/django-dynamic-scraper) - Creating Scrapy scrapers via the Django admin interface.
    * [Scrapy-Redis](https://github.com/rolando/scrapy-redis) - Redis-based components for Scrapy.
    * [scrapy-cluster](https://github.com/istresearch/scrapy-cluster) - Uses Redis and Kafka to create a distributed on demand scraping cluster.
    * [distribute_crawler](https://github.com/gnemoug/distribute_crawler) - Uses scrapy,redis, mongodb,graphite to create a distributed spider.
* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
* [CoCrawler](https://github.com/cocrawler/cocrawler) - A versatile web crawler built using modern tools and concurrency.
* [cola](https://github.com/chineking/cola) - A distributed crawling framework.
* [Demiurge](https://github.com/matiasb/demiurge) - PyQuery-based scraping micro-framework.
* [Scrapely](https://github.com/scrapy/scrapely) - A pure-python HTML screen-scraping library.
* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
* [you-get](https://github.com/soimort/you-get) -  Dumb downloader that scrapes the web.
* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
* [crawley](https://github.com/jmg/crawley) - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
* [MSpider](https://github.com/manning23/MSpider) - A simple ,easy spider using gevent and js render. 
* [brownant](https://github.com/douban/brownant) - A lightweight web data extracting framework.
* [PSpider](https://github.com/xianhu/PSpider) - A simple spider frame in Python3.
* [Gain](https://github.com/gaojiuli/gain) - Web crawling framework based on asyncio for everyone.
* [sukhoi](https://github.com/iogf/sukhoi) - Minimalist and powerful Web Crawler.
* [spidy](https://github.com/rivermont/spidy) - The simple, easy to use command line web crawler. 
* [newspaper](https://github.com/codelucas/newspaper) - News, full-text, and article metadata extraction in Python 3
* [aspider](https://github.com/howie6879/aspider) - An async web scraping micro-framework based on asyncio. 

## Java
* [ACHE Crawler](https://github.com/ViDA-NYU/ache) - An easy to use web crawler for domain-specific search.
* [Apache Nutch](http://nutch.apache.org/) - Highly extensible, highly scalable web crawler for production environment.
    * [anthelion](https://github.com/yahoo/anthelion) - A plugin for Apache Nutch to crawl semantic annotations within HTML pages.
* [Crawler4j](https://github.com/yasserg/crawler4j) - Simple and lightweight web crawler.
* [JSoup](http://jsoup.org/) - Scrapes, parses, manipulates and cleans HTML.
* [websphinx](http://www.cs.cmu.edu/~rcm/websphinx/) - Website-Specific Processors for HTML information extraction.
* [Open Search Server](http://www.opensearchserver.com/) - A full set of search functions. Build your own indexing strategy. Parsers extract full-text data. The crawlers can index everything.
* [Gecco](https://github.com/xtuhcy/gecco) - A easy to use lightweight web crawler
* [WebCollector](https://github.com/CrawlScript/WebCollector) - Simple interfaces for crawling the Web,you can setup a multi-threaded web crawler in less than 5 minutes.
* [Webmagic](https://github.com/code4craft/webmagic) - A scalable crawler framework.
* [Spiderman](https://git.oschina.net/l-weiwei/spiderman) - A scalable ,extensible, multi-threaded web crawler.
    * [Spiderman2](http://git.oschina.net/l-weiwei/Spiderman2) - A distributed  web crawler framework,support js render.
* [Heritrix3](https://github.com/internetarchive/heritrix3) -  Extensible, web-scale, archival-quality web crawler project.
* [SeimiCrawler](https://github.com/zhegexiaohuozi/SeimiCrawler) - An agile, distributed crawler framework.
* [StormCrawler](http://github.com/DigitalPebble/storm-crawler/) - An open source collection of resources for building low-latency, scalable web crawlers on Apache Storm
* [Spark-Crawler](https://github.com/USCDataScience/sparkler) - Evolving Apache Nutch to run on Spark.
* [webBee](https://github.com/pkwenda/webBee) - A DFS web spider.
* [spider-flow](https://github.com/ssssssss-team/spider-flow) - A visual spider framework, it's so good that you don't need to write any code to crawl the website.
* [Norconex Web Crawler](https://github.com/Norconex/collector-http) - Norconex HTTP Collector is a full-featured web crawler (or spider) that can manipulate and store collected data into a repository of your choice (e.g. a search engine). Can be used as a stand alone application or be embedded into Java applications.


## C# 
* [ccrawler](http://www.findbestopensource.com/product/ccrawler) - Built in C# 3.5 version. it contains a simple extension of web content categorizer, which can separate between the web page depending on their content.
* [SimpleCrawler](https://github.com/lei-zhu/SimpleCrawler) - Simple spider base on mutithreading, regluar expression.
* [DotnetSpider](https://github.com/zlzforever/DotnetSpider) - This is a cross platfrom, ligth spider develop by C#.
* [Abot](https://github.com/sjdirect/abot) - C# web crawler built for speed and flexibility.
* [Hawk](https://github.com/ferventdesert/Hawk) - Advanced Crawler and ETL tool written in C#/WPF.
* [SkyScraper](https://github.com/JonCanning/SkyScraper) - An asynchronous web scraper / web crawler using async / await and Reactive Extensions.
* [Infinity Crawler](https://github.com/TurnerSoftware/InfinityCrawler) - A simple but powerful web crawler library in C#.

## JavaScript
* [scraperjs](https://github.com/ruipgil/scraperjs) - A complete and versatile web scraper.
* [scrape-it](https://github.com/IonicaBizau/scrape-it) - A Node.js scraper for humans.
* [simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Event driven web crawler.
* [node-crawler](https://github.com/bda-research/node-crawler) - Node-crawler has clean,simple api.
* [js-crawler](https://github.com/antivanov/js-crawler) - Web crawler for Node.JS, both HTTP and HTTPS are supported.
* [webster](https://github.com/zhuyingda/webster) - A reliable web crawling framework which can scrape ajax and js rendered content in a web page.
* [x-ray](https://github.com/lapwinglabs/x-ray) - Web scraper with pagination and crawler support.
* [node-osmosis](https://github.com/rchipka/node-osmosis) - HTML/XML parser and web scraper for Node.js.
* [web-scraper-chrome-extension](https://github.com/martinsbalodis/web-scraper-chrome-extension) - Web data extraction tool implemented as chrome extension.
* [supercrawler](https://github.com/brendonboshell/supercrawler) - Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits. 
* [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Headless Chrome crawls with jQuery support
* [Squidwarc](https://github.com/n0tan3rd/squidwarc) - High fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
* [crawlee](https://github.com/apify/crawlee) - A web scraping and browser automation library for Node.js that helps you build reliable crawlers. Fast. 


## PHP
* [Goutte](https://github.com/FriendsOfPHP/Goutte) - A screen scraping and web crawling library for PHP.
    * [laravel-goutte](https://github.com/dweidner/laravel-goutte) - Laravel 5 Facade for Goutte.
* [dom-crawler](https://github.com/symfony/dom-crawler) - The DomCrawler component eases DOM navigation for HTML and XML documents.
* [QueryList](https://github.com/jae-jae/QueryList) - The progressive PHP crawler framework.
* [pspider](https://github.com/hightman/pspider) - Parallel web crawler written in PHP.
* [php-spider](https://github.com/mvdbos/php-spider) - A configurable and extensible PHP web spider.
* [spatie/crawler](https://github.com/spatie/crawler) - An easy to use, powerful crawler implemented in PHP. Can execute Javascript.
* [crawlzone/crawlzone](https://github.com/crawlzone/crawlzone) - Crawlzone is a fast asynchronous internet crawling framework for PHP.
* [PHPScraper](https://github.com/spekulatius/PHPScraper) - PHPScraper is a scraper & crawler built for simplicity.

## C++
* [open-source-search-engine](https://github.com/gigablast/open-source-search-engine) - A distributed open source search engine and spider/crawler written in C/C++.

## C
* [httrack](https://github.com/xroche/httrack) - Copy websites to your computer.

## Ruby
* [Nokogiri](https://github.com/sparklemotion/nokogiri) - A Rubygem providing HTML, XML, SAX, and Reader parsers with XPath and CSS selector support.
* [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping. Just add CSS(Or do more).
* [wombat](https://github.com/felipecsl/wombat) - Lightweight Ruby web crawler/scraper with an elegant DSL which extracts structured data from pages.
* [RubyRetriever](https://github.com/joenorton/rubyretriever) - RubyRetriever is a Web Crawler, Scraper & File Harvester.
* [Spidr](https://github.com/postmodern/spidr) - Spider a site, multiple domains, certain links or infinitely.
* [Cobweb](https://github.com/stewartmckee/cobweb) - Web crawler with very flexible crawling options, standalone or using sidekiq.
* [mechanize](https://github.com/sparklemotion/mechanize) - Automated web interaction & crawling.

## Rust
* [spider](https://github.com/spider-rs/spider) - The fastest web crawler and indexer.
* [crawler](https://github.com/a11ywatch/crawler) - A gRPC web indexer turbo charged for performance.

## R
* [rvest](https://github.com/hadley/rvest) - Simple web scraping for R.

## Erlang 
* [ebot](https://github.com/matteoredaelli/ebot) - A scalable, distribuited and highly configurable web cawler.

## Perl
* [web-scraper](https://github.com/miyagawa/web-scraper) - Web Scraping Toolkit using HTML and CSS Selectors or XPath expressions.

## Go
* [pholcus](https://github.com/henrylee2cn/pholcus) -  A distributed, high concurrency and powerful web crawler.
* [gocrawl](https://github.com/PuerkitoBio/gocrawl) - Polite, slim and concurrent web crawler.
* [fetchbot](https://github.com/PuerkitoBio/fetchbot) - A simple and flexible web crawler that follows the robots.txt policies and crawl delays.
* [go_spider](https://github.com/hu17889/go_spider) - An awesome Go concurrent Crawler(spider) framework. 
* [dht](https://github.com/shiyanhui/dht) - BitTorrent DHT Protocol && DHT Spider.
* [ants-go](https://github.com/wcong/ants-go) - A open source, distributed, restful crawler engine in golang.
* [scrape](https://github.com/yhat/scrape) - A simple, higher level interface for Go web scraping.
* [creeper](https://github.com/wspl/creeper) - The Next Generation Crawler Framework (Go).
* [colly](https://github.com/asciimoo/colly) - Fast and Elegant Scraping Framework for Gophers.
* [ferret](https://github.com/MontFerret/ferret) - Declarative web scraping.
* [Dataflow kit](https://github.com/slotix/dataflowkit) - Extract structured data from web pages. Web sites scraping.
* [Hakrawler](https://github.com/hakluke/hakrawler) - Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application


## Scala
* [crawler](https://github.com/bplawler/crawler) - Scala DSL for web crawling.
* [scrala](https://github.com/gaocegege/scrala) - Scala crawler(spider) framework, inspired by scrapy.
* [ferrit](https://github.com/reggoodwin/ferrit) - Ferrit is a web crawler service written in Scala using Akka, Spray and Cassandra.
Download .txt
gitextract_v5ft3wbp/

├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
└── README.md
Condensed preview — 4 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (16K chars).
[
  {
    "path": ".gitignore",
    "chars": 1133,
    "preview": "# Created by .ignore support plugin (hsz.mobi)\n### Python template\n# Byte-compiled / optimized / DLL files\n__pycache__/\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "chars": 709,
    "preview": "# Contribution Guidelines\n\n## Adding to this list\n\nPlease ensure your pull request adheres to the following guidelines:\n"
  },
  {
    "path": "LICENSE",
    "chars": 1067,
    "preview": "MIT License\n\nCopyright (c) 2016 Bruce Tang\n\nPermission is hereby granted, free of charge, to any person obtaining a copy"
  },
  {
    "path": "README.md",
    "chars": 12335,
    "preview": "# Awesome-crawler ![Awesome](https://cdn.rawgit.com/sindresorhus/awesome/d7305f38d29fed78fa85652e3a63e154dd8e8829/media/"
  }
]

About this extraction

This page contains the full source code of the BruceDone/awesome-crawler GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 4 files (14.9 KB), approximately 4.1k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!