Repository: lorien/awesome-web-scraping
Branch: master
Commit: 00ae7cdf2272
Files: 14
Total size: 108.0 KB
Directory structure:
gitextract_m3a45r4y/
├── .gitignore
├── CONTRIBUTING.md
├── LICENSE
├── Makefile
├── README.md
├── cli.md
├── golang.md
├── java.md
├── javascript.md
├── manuals.md
├── perl.md
├── php.md
├── python.md
└── ruby.md
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
*.swp
*.swo
*.orig
.idea
html
Pipfile.lock
================================================
FILE: CONTRIBUTING.md
================================================
# How to Contribute
## IMPORTANT. READ THIS FIRST.
DO NOT ADD WEB-SERVICES. THIS LIST IS FOR STANDALONE SOFTWARE.
DO NOT ADD FRESH PROJECTS. ANY PROJECT WHICH AGE IS LESS THAN HALF A YEAR WILL BE REJECTED.
## How to create pull-request
* Clone the awesome-web-scraping repo
* Add section if needed:
* Add section description
* Add section title to Table of contents
* Search previous suggestions before making a new one, as yours may be a duplicate
* Add your link: `* [project-name](http://example.com/) - description of software`
* Description must be brief
* Description must be one-line`
* Check your spelling and grammar
* Create a pull request. Specify what you have changed/added in a pull requests's description.
================================================
FILE: LICENSE
================================================
Creative Commons Attribution 4.0 International License (CC BY 4.0)
http://creativecommons.org/licenses/by/4.0/
================================================
FILE: Makefile
================================================
.PHONY: html
html:
python -m markdown README.md > html/README.html
python -m markdown python.md > html/python.html
python -m markdown web_service.md > html/web_service.html
================================================
FILE: README.md
================================================
# Awesome Web Scraping
Lists of packages, services and manuals related to web scraping.
## Topics
* [Python](https://github.com/lorien/web-scraping/blob/master/python.md) - Python packages
* [PHP](https://github.com/lorien/web-scraping/blob/master/php.md) - PHP packages
* [Ruby](https://github.com/lorien/web-scraping/blob/master/ruby.md) - Ruby packages
* [JavaScript](https://github.com/lorien/web-scraping/blob/master/javascript.md) - JavaScript packages
* [Go](https://github.com/lorien/web-scraping/blob/master/golang.md) - Go packages
* [Command Line Tools](https://github.com/lorien/web-scraping/blob/master/cli.md) - tools with a command line interface
* [Web Scraping Manuals](https://github.com/lorien/awesome-web-scraping/blob/master/manuals.md) - list of articles and books teaching web scraping
* [dhamaniasad / HeadlessBrowsers](https://github.com/dhamaniasad/HeadlessBrowsers) - list of (almost) all headless web browsers in existence
* [DNS over HTTPS providers](https://github.com/curl/curl/wiki/DNS-over-HTTPS) - list of DNS over HTTPs providers
* [Awesome Pastebins](https://github.com/lorien/awesome-pastebins) - list of pastebin sites
## Captcha Solving Services
* [https://2captcha.com](https://2captcha.com/?from=3019071)
## Proxy Server Marketplaces
* https://www.blackhatworld.com/forums/proxies-for-sale.112/
* https://forum.antichat.com/forums/147/
## Telegram Discussion Groups
* [@grablab](https://t.me/grablab) - talks in English
* [@grablab_ru](https://t.me/grablab_ru) - talks in Russian
## How to Contribute to This List
See [Contributing](https://github.com/lorien/web-scraping/blob/master/CONTRIBUTING.md) guide.
## Credits
The list is based initially on some data from these sources [awesome-python](https://github.com/vinta/awesome-python), [awesome-php](https://github.com/ziadoz/awesome-php), [awesome-ruby](https://github.com/markets/awesome-ruby), [ruby-nlp](https://github.com/diasks2/ruby-nlp), [awesome-javascript](https://github.com/sorrycc/awesome-javascript)
================================================
FILE: cli.md
================================================
# Command Line Tools
This list contains network and data processing tools with command line interface written in any programming langauge.
## Contents
* [Network](#network)
* [Web Scraping](#web-scraping)
* [URLs](#urls)
## Network
EMPTY CONTENT
## Web Scraping
* [pipet](https://github.com/bjesus/pipet) - A swiss-army tool for scraping and extracting data using selectors, JavaScript and unix pipes
* [trafilatura](https://github.com/adbar/trafilatura) - Gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
## URLs
* [courlan](https://github.com/adbar/courlan) - Clean, filter and sample URLs to optimize data collection: Deduplication, spam, content and language filters
================================================
FILE: golang.md
================================================
# Golang Web Scraping
This list contains Golang libraries related to web scraping and data processing
* [Golang Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other Golang Lists](#other-Golang-lists)
## Network
* General
* [net](https://golang.org/pkg/net/) - built-in package manipulating networking
* [net/http](https://golang.org/pkg/net/http/) - build-in package capable of HTTP programming
* Asynchronous
* [goroutine](https://tour.golang.org/concurrency/1) - primitive green thread in Golang
## Web-Scraping Frameworks
* Full Featured Crawlers
* [Pholcus](https://github.com/henrylee2cn/pholcus) - Pholcus is a distributed, high concurrency and powerful web crawler software.
* [go_spider](https://github.com/hu17889/go_spider) - An flexible, modular and expansible Go concurrent Crawler(spider) framework.
* [ants-go](https://github.com/wcong/ants-go) - A distributed, restful crawler engine in golang.
* Full Featured Scrapers
* [geziyor](https://github.com/geziyor/geziyor) - Geziyor, a blazing fast web scraping framework, supports JS rendering.
* [colly](https://github.com/gocolly/colly) - Fast and elegant scraping framework
* [dataflow kit](https://github.com/slotix/dataflowkit) - Dataflow Kit - extract structured data from web sites.
* [flyscrape](https://github.com/philippta/flyscrape) - flyscrape is a standalone and scriptable web scraper.
* [goscrapy](https://github.com/tech-engine/goscrapy) - Scrapy inspired webscraping framework.
* Other
* [ferret](https://github.com/MontFerret/ferret) - A web scraping tool with a declarative query language.
## HTML/XML Parsing
* [encoding/xml](https://golang.org/pkg/encoding/xml/) - A built-in package implements a simple XML 1.0 parser.
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [regexp](https://golang.org/pkg/regexp/) - A built-in package implements regular expression search.
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* [encoding/json](https://golang.org/pkg/encoding/json/) - A built-in package implements encoding and decoding of JSON as defined in RFC 4627.
* [allot](https://github.com/sbstjn/allot) - Placeholder and wildcard text parsing for CLI tools and bots
* [bbConvert](https://github.com/CalebQ42/bbConvert) - Converts bbCode to HTML that allows you to add support for custom bbCode tags
* [blackfriday](https://github.com/russross/blackfriday) - Markdown processor in Go
* [bluemonday](https://github.com/microcosm-cc/bluemonday) - HTML Sanitizer
* [editorconfig-core-go](https://github.com/editorconfig/editorconfig-core-go) - Editorconfig file parser and manipulator for Go
* [enca](https://github.com/endeveit/enca) - Minimal cgo bindings for [libenca](http://cihar.com/software/enca/).
* [genex](https://github.com/alixaxel/genex) - Count and expand Regular Expressions into all matching Strings
* [github_flavored_markdown](https://godoc.org/github.com/shurcooL/github_flavored_markdown) - GitHub Flavored Markdown renderer (using blackfriday) with fenced code block highlighting, clickable header anchor links.
* [go-humanize](https://github.com/dustin/go-humanize) - Formatters for time, numbers, and memory size to human readable format.
* [go-nmea](https://github.com/adrianmo/go-nmea) - NMEA parser library for the Go language.
* [go-runewidth](https://github.com/mattn/go-runewidth) - Functions to get fixed width of the character or string.
* [go-slugify](https://github.com/mozillazg/go-slugify) - Make pretty slug with multiple languages support.
* [go-vcard](https://github.com/emersion/go-vcard) - Parse and format vCard
* [gofeed](https://github.com/mmcdole/gofeed) - Parse RSS and Atom feeds in Go
* [gographviz](https://github.com/awalterschulze/gographviz) - Parses the Graphviz DOT language.
* [gommon/bytes](https://github.com/labstack/gommon/tree/master/bytes) - Format bytes to string.
* [gonameparts](https://github.com/polera/gonameparts) - Parses human names into individual name parts
* [GoQuery](https://github.com/PuerkitoBio/goquery) - GoQuery brings a syntax and a set of features similar to jQuery to the Go language.
* [goregen](https://github.com/zach-klippenstein/goregen) - A library for generating random strings from regular expressions.
* [gotext](https://github.com/leonelquinteros/gotext) - GNU gettext utilities for Go.
* [guesslanguage](https://github.com/endeveit/guesslanguage) - Functions to determine the natural language of a unicode text.
* [inject](https://github.com/facebookgo/inject) - Package inject provides a reflect based injector.
* [mxj](https://github.com/clbanning/mxj) - Encode / decode XML as JSON or map[string]interface{}; extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.
* [sh](https://github.com/mvdan/sh) - A shell parser and formatter
* [slug](https://github.com/gosimple/slug) - URL-friendly slugify with multiple languages support.
* [Slugify](https://github.com/avelino/slugify) - A Go slugify application that handles string.
* [toml](https://github.com/BurntSushi/toml) - TOML configuration format (encoder/decoder with reflection).
* [xpath](https://github.com/antchfx/xpath) - XPath package for Go.
* [xquery](https://github.com/antchfx/xquery) - XQuery lets you extract data from HTML/XML documents using XPath expression.
## Natural Language Processing
*Libraries for working with human languages.*
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
## Browser automation and emulation
* [chromedp](https://github.com/chromedp/chromedp) - A faster, simpler way to drive browsers supporting the Chrome DevTools Protocol
## Multiprocessing
* TODO
## Asynchronous
*Libraries for asynchronous networking programming.*
* TODO
## Queue
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
## Email
*Libraries for parsing email.*
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* [net/url](https://golang.org/pkg/net/url/)
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* [x/net/html](golang.org/x/net/html)
## WebSocket
*Libraries for working with WebSocket.*
* [gorilla/websocket](https://github.com/gorilla/websocket)
## DNS Resolving
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
## Computer Vision
* TODO
## Proxy Server
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
* [Caddy](https://github.com/caddyserver/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
## Other Golang lists
* TODO
* Something
* TODO
## Natural Language Processing
*Libraries for working with human languages.*
* [dpar](https://github.com/danieldk/dpar/) - Transition-based statistical dependency parser.
* [go-eco](https://github.com/ThePaw/go-eco) - Similarity, dissimilarity and distance matrices; diversity, equitability and inequality measures; species richness estimators; coenocline models.
* [go-i18n](https://github.com/nicksnyder/go-i18n/) - A package and an accompanying tool to work with localized text.
* [go-mystem](https://github.com/dveselov/mystem) - CGo bindings to Yandex.Mystem - russian morphology analyzer.
* [go-nlp](https://github.com/nuance/go-nlp) - Utilities for working with discrete probability distributions and other tools useful for doing NLP work.
* [go-stem](https://github.com/agonopol/go-stem) - Implementation of the porter stemming algorithm.
* [go-unidecode](https://github.com/mozillazg/go-unidecode) - ASCII transliterations of Unicode text.
* [go2vec](https://github.com/danieldk/go2vec) - Reader and utility functions for word2vec embeddings.
* [gojieba](https://github.com/yanyiwu/gojieba) - This is a Go implementation of [jieba](https://github.com/fxsjy/jieba) which a Chinese word splitting algorithm.
* [golibstemmer](https://github.com/rjohnsondev/golibstemmer) - Go bindings for the snowball libstemmer library including porter 2
* [gounidecode](https://github.com/fiam/gounidecode) - Unicode transliterator (also known as unidecode) for Go
* [icu](https://github.com/goodsign/icu) - Cgo binding for icu4c C library detection and conversion functions. Guaranteed compatibility with version 50.1.
* [libtextcat](https://github.com/goodsign/libtextcat) - Cgo binding for libtextcat C library. Guaranteed compatibility with version 2.2.
* [MMSEGO](https://github.com/awsong/MMSEGO) - This is a GO implementation of [MMSEG](http://technology.chtsai.org/mmseg/) which a Chinese word splitting algorithm.
* [paicehusk](https://github.com/rookii/paicehusk) - Golang implementation of the Paice/Husk Stemming Algorithm
* [porter](https://github.com/a2800276/porter) - This is a fairly straightforward port of Martin Porter's C implementation of the Porter stemming algorithm.
* [porter2](https://github.com/zhenjl/porter2) - Really fast Porter 2 stemmer.
* [prose](https://github.com/jdkato/prose) - A library for text processing that supports tokenization, part-of-speech tagging, named-entity extraction, and more.
* [RAKE.go](https://github.com/Obaied/RAKE.go) - A Go port of the Rapid Automatic Keyword Extraction Algorithm (RAKE)
* [segment](https://github.com/blevesearch/segment) - A Go library for performing Unicode Text Segmentation as described in [Unicode Standard Annex #29](http://www.unicode.org/reports/tr29/)
* [sentences](https://github.com/neurosnap/sentences) - A sentence tokenizer: converts text into a list of sentences.
* [snowball](https://github.com/goodsign/snowball) - Snowball stemmer port (cgo wrapper) for Go. Provides word stem extraction functionality [Snowball native](http://snowball.tartarus.org/).
* [stemmer](https://github.com/dchest/stemmer) - Stemmer packages for Go programming language. Includes English and German stemmers.
* [textcat](https://github.com/pebbe/textcat) - A Go package for n-gram based text categorization, with support for utf-8 and raw text
* [whatlanggo](https://github.com/abadojack/whatlanggo) - A natural language detection package for Go. Supports 84 languages and 24 scripts (writing systems e.g. Latin, Cyrillic, etc).
* [when](https://github.com/olebedev/when) - A natural EN and RU language date/time parser with pluggable rules
## Browser automation and emulation
* TODO
## Multiprocessing
* TODO
## Asynchronous
*Libraries for asynchronous networking programming.*
* TODO
## Queue
* [NSQ](https://github.com/nsqio/nsq) - A realtime distributed messaging platform.
* [NATS](https://github.com/nats-io/go-nats) - Golang client for NATS, the cloud native messaging system.
## Email
*Libraries for parsing email.*
* [douceur](https://github.com/aymerick/douceur) - CSS inliner for your HTML emails.
* [email](https://github.com/jordan-wright/email) - A robust and flexible email library for Go.
* [go-dkim](https://github.com/toorop/go-dkim) - A DKIM library, to sign & verify email.
* [go-imap](https://github.com/emersion/go-imap) - An IMAP library for clients and servers
* [go-message](https://github.com/emersion/go-message) - A streaming library for the Internet Message Format and mail messages
* [Gomail](https://github.com/go-gomail/gomail/) - Gomail is a very simple and powerful package to send emails.
* [Hectane](https://github.com/hectane/hectane) - Lightweight SMTP client providing an HTTP API
* [hermes](https://github.com/matcornic/hermes) - Golang package that generates clean, responsive HTML e-mails
* [MailHog](https://github.com/mailhog/MailHog) - Email and SMTP testing with web and API interface
* [SendGrid](https://github.com/sendgrid/sendgrid-go) - SendGrid's Go library for sending email
* [smtp](https://github.com/mailhog/smtp) - SMTP server protocol state machine
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* [net/url](https://golang.org/pkg/net/url/)
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* [x/net/html](golang.org/x/net/html)
## WebSocket
*Libraries for working with WebSocket.*
* [gorilla/websocket](https://github.com/gorilla/websocket)
## DNS Resolving
* [net](https://golang.org/pkg/net/) - Built-in some DNS related functions.
* [miekg/dns](https://github.com/miekg/dns) - A DNS library in Go.
## Computer Vision
* TODO
## Proxy Server
* [gin](https://github.com/codegangsta/gin) - Live reload utility for Go web servers.
* [Caddy](https://github.com/mholt/caddy) - Fast, cross-platform HTTP/2 web server with automatic HTTPS, also can serve as a reverse proxy server.
## Other Golang lists
* TODO
================================================
FILE: java.md
================================================
# Java Web Scraping
This list contains Java libraries related to web scraping and data processing
* [FooLanguage Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other FooLanguage Lists](#other-foolanguage-lists)
## Network
* General
* [Apache HttpClient](https://hc.apache.org/)
* [okhttp3](http://square.github.io/okhttp/)
* Asynchronous
* [Apache Async HttpClient](https://hc.apache.org/)
* [AsyncHttpClient](https://github.com/AsyncHttpClient/async-http-client)
## Web-Scraping Frameworks
* Full Featured Crawlers
* [ACHE Crawler](https://github.com/ViDA-NYU/ache)
* [Apache Nutch](http://nutch.apache.org/)
* Other
* [Crawler4j](https://github.com/yasserg/crawler4j)
* [StormCrawler](https://github.com/DigitalPebble/storm-crawler)
## HTML/XML Parsing
* [Apache Tika](https://tika.apache.org/)
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [Apache Tika](https://tika.apache.org/)
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* [Apache Tika](https://tika.apache.org/)
* Something
* TODO
## Natural Language Processing
*Libraries for working with human languages.*
* [Apache OpenNLP](https://opennlp.apache.org/)
* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
* [Apache Tika](https://tika.apache.org/)
## Browser automation and emulation
* [htmlunit](http://htmlunit.sourceforge.net/)
## Multiprocessing
* TODO
## Asynchronous
*Libraries for asynchronous networking programming.*
* TODO
## Queue
* TODO
## Email
*Libraries for parsing email.*
* TODO
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* TODO
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* [Boilerpipe](https://github.com/kohlschutter/boilerpipe)
* [Apache Tika](https://tika.apache.org/)
## WebSocket
*Libraries for working with WebSocket.*
* TODO
## DNS Resolving
* [dnsjava](http://www.dnsjava.org/)
* [spotify-dns-java](https://github.com/spotify/dns-java)
## Computer Vision
* TODO
## Proxy Server
* TODO
## Other FooLanguage lists
* TODO
================================================
FILE: javascript.md
================================================
# JavaScript Web Scraping
This list contains JavaScript libraries related to web scraping and data processing. The content of that list is focused on libs that could be run in nodejs (without real web-browser).
* [JavaScript Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other JavaScript Lists](#other-javascript-lists)
* [Data Structure](#data-structure)
## Network
* [request](https://github.com/request/request) - Simplified HTTP request client.
* [socks5-http-client](https://github.com/mattcg/socks5-http-client) - SOCKS v5 HTTP client implementation in JavaScript for Node.js
* [rest](https://github.com/cujojs/rest) - RESTful HTTP client for JavaScript
* [wreck](https://github.com/hapijs/wreck) - HTTP Client Utilities
* [got](https://github.com/sindresorhus/got) - Simplified HTTP requests
* [node-fetch](https://github.com/bitinn/node-fetch) - A light-weight module that brings window.fetch to Node.js
* [bent](https://github.com/mikeal/bent) - Functional HTTP client for Node.js w/ async/await
* [axios](https://github.com/axios/axios) - Promise based HTTP client for the browser and node.js
* [superagent](https://github.com/visionmedia/superagent) - Ajax for Node.js and browsers (JS HTTP client)
* [urllib](https://github.com/node-modules/urllib) - Request HTTP(s) URLs in a complex world
* [needle](https://github.com/tomas/needle) - Nimble, streamable HTTP client for Node.js. With proxy, iconv, cookie, deflate & multipart support
## Web-Scraping Frameworks
* [webparsy](https://github.com/joseconstela/webparsy) - NodeJS lib and cli for scraping websites using Puppeteer and YAML
* [node-crawler](https://github.com/sylvinus/node-crawler) - Web Crawler/Spider for NodeJS + server-side jQuery
* [node-simplecrawler](https://github.com/cgiffard/node-simplecrawler) - Flexible event driven crawler for node
* [Crawlee](https://github.com/apify/crawlee) - Node.js and TypeScript library that crawls with Cheerio, JSDOM, Playwright and Puppeteer while enhancing them with anti-blocking features, queue, storages and more.
* [Ayakashi](https://github.com/ayakashi-io/ayakashi) - The next generation web scraping framework. Features all the necessary tools to create reliable and maintainable scraping and automation systems.
* [pjscrape](https://github.com/nrabinowitz/pjscrape) - A web-scraping framework written in Javascript, using PhantomJS and jQuery
## HTML/XML Parsing
* General
* [parse5](https://github.com/inikulin/parse5) - WHATWG HTML5 specification-compliant, fast and ready for production HTML parsing/serialization toolset for Node and io.js
* [htmlparser2](https://github.com/fb55/htmlparser2) - forgiving html and xml parser
* [sax-js](https://github.com/isaacs/sax-js) - A sax style parser for JS
* [cheerio](https://github.com/cheeriojs/cheerio) - Fast, flexible, and lean implementation of core jQuery designed specifically for the server
* Sanitizing
* [js-xss](https://github.com/leizongmin/js-xss) - Sanitize untrusted HTML (to prevent XSS) with a configuration specified by a Whitelist.
* [surgeon](https://github.com/gajus/surgeon) - Declarative DOM extraction expression evaluator
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [string.js](https://github.com/jprichardson/string.js) - Extra JavaScript string methods.
* [accounting.js](https://github.com/openexchangerates/accounting.js) - A lightweight JavaScript library for number, money and currency formatting - fully localisable, zero dependencies.
* [validator.js](https://github.com/chriso/validator.js) - String validation and sanitization.
* Date and time
* [moment](https://github.com/moment/moment) - Parse, validate, manipulate, and display dates in javascript.
* [moment-timezone](https://github.com/moment/moment-timezone) - Timezone support for moment.js.
* [date](https://github.com/MatthewMueller/date) - Date() for humans.
* [ms.js](https://github.com/guille/ms.js) - Tiny millisecond conversion utility.
* HTML entities
* [he](https://github.com/mathiasbynens/he) - A robust HTML entity encoder/decoder written in JavaScript.
* Money
* [money.js](https://github.com/openexchangerates/money.js) - Simple and tiny JavaScript library for realtime currency conversion and exchange rate calculation, from any currency, to any currency.
* Color
* [chroma.js](https://github.com/gka/chroma.js) - JavaScript library for all kinds of color manipulations.
* [color](https://github.com/harthur/color) - JavaScript color conversion and manipulation library.
* [TinyColor](https://github.com/bgrins/TinyColor) - Fast, small color manipulation and conversion for JavaScript.
* User Agent
* [UAParser.js](https://github.com/faisalman/ua-parser-js) - Lightweight JavaScript-based User-Agent string parser. Supports browser & node.js environment.
* Semantic Version
* [node-semver](https://github.com/npm/node-semver) - The semver parser for node
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* [jBinary](https://github.com/jDataView/jBinary) - High-level I/O (loading, parsing, manipulating, serializing, saving) for binary files with declarative syntax for describing file types and data structures.
* Office
* [js-xlsx](https://github.com/SheetJS/js-xlsx) - XLSX / XLSM / XLSB / XLS / SpreadsheetML (Excel Spreadsheet) / ODS parser and writer
* CSV
* [BabyParse](https://github.com/Rich-Harris/BabyParse) - Fast and reliable CSV parser based on Papa Parse. Papa Parse is for the browser, Baby Parse is for Node.js.
* [CSV](https://github.com/knrz/CSV.js) - A simple, blazing-fast CSV parser and encoder. Full RFC 4180 compliance.
* JSON
* [json3](https://github.com/bestiejs/json3) - A modern JSON implementation compatible with nearly all JavaScript platforms.
* EXIF
* [exif-js](https://github.com/exif-js/exif-js) - JavaScript library for reading EXIF image metadata
* CSS
* [parse-css](https://github.com/tabatkins/parse-css) - Standards-based CSS Parser
* [parser-lib CSS parser](https://github.com/CSSLint/parser-lib) - The ParserLib CSS parser is a CSS3 SAX-inspired parser written in JavaScript. By default, the parser only deals with standard CSS syntax and doesn't do validation (checking of property names and values).
* Torrent
* [parse-torrent](https://github.com/feross/parse-torrent) - Parse a torrent identifier (magnet uri, .torrent file, info hash)
* SQL
* [SQL Parser](https://github.com/forward/sql-parser) - SQL Parser is a lexer, grammar and parser for SQL written in JS. Currently it is only capable of parsing fairly basic SELECT queries.
* YAML
* [JS-YAML](https://github.com/nodeca/js-yaml) - JavaScript YAML parser and dumper. Very fast.
* Markdown
* [markdown-it](https://github.com/markdown-it/markdown-it) - Markdown parser, done right. 100% CommonMark support, extensions, syntax plugins & high speed
* Atom/RSS
* [node-feedparser](https://github.com/danmactough/node-feedparser) - Robust RSS, Atom, and RDF feed parsing in Node.js
* Netscape Bookmarks(Firefox, Google Chrome, ...)
* [node-bookmarks-parser](https://github.com/calibr/node-bookmarks-parser) - Parses Firefox/Chrome HTML bookmarks files
## Natural Language Processing
*Libraries for working with human languages.*
* General
* [natural](https://github.com/NaturalNode/natural) - general natural language facilities for node
* [nlp_compromise](https://github.com/spencermountain/nlp_compromise) - natural language processing
* [Hanzi](https://github.com/nieldlr/Hanzi) - HanziJS is a Chinese character and NLP module for Chinese language processing for Node.js
* [salient](https://github.com/nyxtom/salient) - Machine Learning, Natural Language Processing and Sentiment Analysis Toolkit for Node.js
* [node-summary](https://github.com/jbrooksuk/node-summary) - Node module that summarizes text using a naive summarization algorithm
* Stemmer
* [snowball-js](https://github.com/fortnightlabs/snowball-js) - javascript implementation of the popular snowball word stemming nlp algorithm
* [porter-stemmer](https://github.com/jedp/porter-stemmer) - Martin Porter's stemmer for node.js
* [Porter-Stemmer](https://github.com/kristopolous/Porter-Stemmer) - A Javascript Implementation of the Porter Stemmer
* [lunr-languages](https://github.com/MihaiValentin/lunr-languages) - a collection of languages stemmers and stopwords for Lunr Javascript library
* Language detection
* [franc](https://github.com/wooorm/franc) - Natural language detection
* [guessLanguage.js](https://github.com/richtr/guessLanguage.js) - A natural language detection library based on trigram statistical analysis for Node.js
## Browser automation and emulation
* [phantomjs](https://github.com/ariya/phantomjs) - Scriptable Headless WebKit.
* [slimerjs](https://github.com/laurentj/slimerjs) - A PhantomJS-like tool running Gecko.
* [casperjs](https://github.com/n1k0/casperjs) - Navigation scripting & testing utility for PhantomJS and SlimerJS.
* [zombie](https://github.com/assaf/zombie) - Insanely fast, full-stack, headless browser testing using node.js.
* [nightmare](https://github.com/segmentio/nightmare) - Nightmare is a high level wrapper for PhantomJS that lets you automate browser tasks
* [puppeteer](https://github.com/GoogleChrome/puppeteer) - Puppeteer is a Node library which provides a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome or Chromium.
* [headless-chrome-crawler](https://github.com/yujiosaka/headless-chrome-crawler) - Distributed crawler powered by Headless Chrome
* [puppeteer-recorder](https://github.com/checkly/puppeteer-recorder) - Puppeteer recorder is a Chrome extension that records your browser interactions and generates a Puppeteer script.
* [wendigo](https://github.com/angrykoala/wendigo) - Test-oriented headless browser, built on top of Puppeteer.
* [Playwright](https://github.com/microsoft/playwright) - Node.js library to automate Chromium, Firefox and WebKit with a single API
## Multiprocessing
* [nexpect](https://github.com/nodejitsu/nexpect) - spawn and control child processes in node.js with ease
* [respawn](https://github.com/mafintosh/respawn) - Spawn a process and restart it if it crashes
* [node-webworker](https://github.com/pgriess/node-webworker) - A WebWorkers implementation for NodeJS
## Asynchronous
*Libraries for asynchronous networking programming.*
* [socket.io](https://github.com/socketio/socket.io) - Realtime application framework (Node.JS server)
* [engine.io](https://github.com/socketio/engine.io) - Engine.IO is the implementation of transport-based cross-browser/cross-device bi-directional communication layer for Socket.IO
* [async](https://github.com/caolan/async) - Async utilities for node and the browser
## Queue
* [kue](https://github.com/Automattic/kue) - Kue is a priority job queue backed by redis, built for node.js
* [bull](https://github.com/OptimalBits/bull) - A lightweight, robust and fast job processing queue. Carefully written for rock solid stability and atomicity.
## Email
*Libraries for parsing email.*
* [mailparser](https://github.com/andris9/mailparser) - Decode mime formatted e-mails
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* [query-string](https://github.com/sindresorhus/query-string) - Parse and stringify URL query strings.
* [URI.js](https://github.com/medialize/URI.js/) - Javascript URL mutation library.
* [jsurl](https://github.com/Mikhus/jsurl) - Lightweight URL manipulation with JavaScript.
* [arg.js](https://github.com/stretchr/arg.js) - Lightweight URL argument and parameter parser
* Network Address
* [node-ip](https://github.com/indutny/node-ip) - IP address tools for node.js
* [ip-address](https://github.com/beaugunderson/ip-address) - A library for parsing and manipulating IPv6 (and v4) addresses in JavaScript
## Web Content Extracting
*Libraries for extracting web contents.*
* [node-read](https://github.com/bndr/node-read) - Get Readable Content from any page. Based on Arc90's readability project using cheerio engine.
* [node-ytdl-core](https://github.com/fent/node-ytdl-core) - Youtube video downloader in javascript
* [ImageResolver](https://github.com/mauricesvay/ImageResolver) - Does its best to determine the main image on a URL without loading all images.
## WebSocket
*Libraries for working with WebSocket.*
* [websocket.io](https://github.com/LearnBoost/websocket.io) - WebSocket.IO is an abstraction of the websocket server previously used by Socket.IO. It has the broadest support for websocket protocol/specifications and an API that allows for interoperability with higher-level frameworks such as Engine, Socket.IO's realtime core.
* [WebScoket-Node](https://github.com/theturtle32/WebSocket-Node) - A WebSocket Implementation for Node.JS (Draft -08 through the final RFC 6455)
## DNS Resolving
* [multicast-dns](https://github.com/mafintosh/multicast-dns) - Low level multicast-dns implementation in pure javascript
* [node-dns](https://github.com/tjfontaine/node-dns) - Replacement dns module in pure javascript for node.js
## Computer Vision
* [tracking.js](https://github.com/eduardolundgren/tracking.js) - A modern approach for Computer Vision on the web.
* [ocrad.js](https://github.com/antimatter15/ocrad.js) - OCR in Javascript via Emscripten.
## Proxy Server
* [toxy](https://github.com/h2non/toxy) - Hackable HTTP proxy to simulate server failure scenarios and unexpected network conditions
* [proxy-chain](https://github.com/apifytech/proxy-chain) - Node.js implementation of a proxy server (think Squid) with support for SSL, authentication and upstream proxy chaining
## Data Structure
* [immutable](https://github.com/facebook/immutable-js) - Immutable persistent data collections for Javascript which increase efficiency and simplicity.
* [lodash](https://github.com/lodash/lodash) - More consistent cross-environment iteration support for arrays, strings, objects, and arguments objects
## Other JavaScript lists
* [awesome-javascript](https://github.com/sorrycc/awesome-javascript)
================================================
FILE: manuals.md
================================================
# Web Scraping Manuals
## Table of Contents
- [About the List](#about-the-list)
- [Base Things](#base-things)
- [Information Availability](#information-availability)
- [Information Granularity](#information-granularity)
- [How to Contribute](#how-to-contribute)
- [Web Scraping Articles and Topics](#web-scraping-articles-and-topics)
- [HTML](#html)
- [HTTP](#http)
- [DNS](#dns)
- [TCP](#tcp)
- [TLS](#tls)
- [WebSocket](#websocket)
- [Concurrency](#concurrency)
- [Text Encoding](#text-encoding)
- [URL](#url)
- [XMLHttpRequest](#xmlhttprequest)
- [Security](#security)
- [IP Address](#ip-address)
- [Data Structures](#data-structures)
## About the List
This is a list of articles and books teaching web scraping.
### Base Things
To know base things is more important than to know particular tools or implementations.
It is important to know what is HTTP, TCP, TLS, DNS, HTML, XML, XPath, CSS, DOM, proxying network requests.
It is LESS important to know how to build crawler with SuperScrapingFramework or what function of PowerfulHTMLParsingLibrary allows
you to extract text from selected element of HTML DOM tree. These things are very specific. You do not have to know how to operate
with every scraping framework or HTML parsing package in the world. If you know base things it is just a matter of short time
to get knowledge about how to operate this base things with a particular programming package.
### Information Availability
The list must provide information which is accessable instantly. The list does not accept books whose content are not available online.
### Information Granularity
If a book contains a number of topics, it makes sense to refer to particular topic of the book in a particular section of
Learning Web Scraping list.
### How to Contribute
You may submit a new issue with an article or book you want to add. I will read the article or take a look at animals on
a cover picture of the book and will decide is it worth to be included in the list.
## Web Scraping Articles and Topics
### HTML
- [WHATWG / HTML](https://html.spec.whatwg.org/multipage/)
### HTTP
- [High Performance Browser Networking / HTTP/1.X](https://hpbn.co/http1x/)
- [High Performance Browser Networking / HTTP/2](https://hpbn.co/http2/)
- [HTTP Working Group HTTP Specs](https://httpwg.org/specs/)
### DNS
Nothing yet here.
### TCP
- [High Performance Browser Networking / Building Blocks of TCP](https://hpbn.co/building-blocks-of-tcp/)
### TLS
- [High Performance Browser Networking / Transport Layer Security (TLS)](https://hpbn.co/transport-layer-security-tls/)
### WebSocket
- [High Performance Browser Networking / WebSocket](https://hpbn.co/websocket/)
- [WHATWG / Websocket](https://websockets.spec.whatwg.org/)
### Concurrency
- [The Little Book of Semaphores](https://greenteapress.com/wp/semaphores/)
### Text Encoding
- [WHATWG / Encoding](https://encoding.spec.whatwg.org/)
### URL
- [WHATWG / URL](https://url.spec.whatwg.org/)
### XMLHttpRequest
- [WHATWG / XMLHttpRequest](https://xhr.spec.whatwg.org/)
- [High Performance Browser Networking / XMLHttpRequest](https://hpbn.co/xmlhttprequest/)
### Security
- [OWASP Web Security Testing Guide](https://owasp.org/www-project-web-security-testing-guide/latest/)
### IP Address
- [Understanding IP Addressing](http://pages.di.unipi.it/ricci/501302.pdf)
### Data Structures
- [Probabilistic Data Structures for Web Analytics and Data Mining](https://dirtysalt.github.io/html/probabilistic-data-structures-for-web-analytics-and-data-mining.html)
================================================
FILE: perl.md
================================================
# Perl Web Scraping
This list contains Perl libraries related to web scraping and data processing
* [Perl Web Scraping](#javascript-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Email](#email)
* [URL and Network Address Manipulation](#url-and-network-address-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Other Perl Lists](#other-Perl-lists)
## Network
* General
* TODO
* Asynchronous
* TODO
## Web-Scraping Frameworks
* Full Featured Crawlers
* TODO
* Other
* TODO
## HTML/XML Parsing
* TODO
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* TODO
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* TODO
* Something
* TODO
## Natural Language Processing
*Libraries for working with human languages.*
* TODO
## Browser automation and emulation
* TODO
## Multiprocessing
* TODO
## Asynchronous
*Libraries for asynchronous networking programming.*
* TODO
## Queue
* TODO
## Email
*Libraries for parsing email.*
* TODO
## URL and Network Address Manipulation
*Libraries for parsing/modifying URLs and network addresses.*
* URL
* TODO
* Network Address
* TODO
## Web Content Extracting
*Libraries for extracting web contents.*
* Text and Meta Data from HTML pages
* TODO
## WebSocket
*Libraries for working with WebSocket.*
* TODO
## DNS Resolving
* TODO
## Computer Vision
* TODO
## Proxy Server
* TODO
## Other Perl lists
* TODO
================================================
FILE: php.md
================================================
# PHP Web Scraping
This list contains PHP libraries related to web scraping and data processing
* [PHP Web Scraping](#php-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Queue](#queue)
* [Cloud Computing](#cloud-computing)
* [Email](#email)
* [URL Manipulation](#url-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Geocoding](#geocoding)
* [API Clients](#api-clients)
* [Other PHP Lists](#other-php-lists)
## Network
* [Guzzle](https://github.com/guzzle/guzzle) - A comprehensive HTTP client.
* [Buzz](https://github.com/kriswallsmith/Buzz) - Another HTTP client.
* [Requests](https://github.com/rmccue/Requests) - A simple HTTP library.
* [HTTPFul](https://github.com/nategood/httpful) - A chainable HTTP client.
* [Goutte](https://github.com/fabpot/Goutte) - A simple web scraper.
* [PHP Spider](https://github.com/mvdbos/php-spider) - A comprehensive web spider.
## Web-Scraping Frameworks
* [Crawler](https://github.com/crwlrsoft/crawler) - (crwlr) - Library for Rapid (Web) Crawler and Scraper Development
* [Roach](https://github.com/roach-php/core) - It is port of the popular Scrapy package for Python. Include adapter to Laravel and Symfony
## HTML/XML Parsing
* [HTML5 PHP](https://github.com/Masterminds/html5-php) - An HTML5 parser and serializer library.
* [QueryPath](https://github.com/technosophos/querypath) - a jQuery-like library for working with XML and HTML documents in PHP. It now contains support for HTML5 via the HTML5-PHP project.
* [DiDOM](https://github.com/Imangazaliev/DiDOM) - super fast HTML parser (because it was build on top of plain PHP).
* [PHPScraper](https://github.com/spekulatius/phpscraper) - an highly opinionated web-interface.
* [DomCrawler](https://github.com/symfony/dom-crawler) - (Symfony) - The DomCrawler component eases DOM navigation for HTML and XML documents.
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [ANSI to HTML5](https://github.com/sensiolabs/ansi-to-html) - An ANSI to HTML5 converter library.
* [Patchwork UTF-8](https://github.com/nicolas-grekas/Patchwork-UTF8) - A portable library for working with UTF-8 strings.
* [Hoa String](https://github.com/hoaproject/Ustring) - Another UTF-8 string library.
* [Stringy](https://github.com/danielstjules/Stringy) - A string manipulation library with multibyte support.
* [Color Jizz](https://github.com/mikeemoo/ColorJizz-PHP) - A library for manipulating and converting colours.
* [Text](https://github.com/kzykhys/Text) - A text manipulation library.
* [Flux](https://github.com/selvinortiz/flux) - A regular expression building library.
* Transliteration
* [Urlify](https://github.com/jbroadway/urlify) - A PHP port of Django's URLify.js.
* [Slugify](https://github.com/cocur/slugify) - A library to convert strings to slugs.
* User-agent
* [CrawlerDetect](https://github.com/JayBizzle/Crawler-Detect) - CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header.
* [PHPUserAgent](https://github.com/donatj/PhpUserAgent) - A simple, streamlined PHP user-agent parser!
* [AgentZero](https://github.com/hexydec/agentzero) - A library for extracting information from User-Agent strings very fast.
* [Device Detector](https://github.com/piwik/device-detector) - Another library for parsing user agent strings.
* [Mobile-Detect](https://github.com/serbanghita/Mobile-Detect) - A lightweight PHP class for detecting mobile devices (including tablets).
* [UA Parser](https://github.com/ua-parser/uap-php) - A library for parsing user agent strings.
* Unites of measure
* [ByteUnits](https://github.com/gabrielelana/byte-units) - A library to parse, format and convert byte units in binary and metric systems.
* [PHP Units of Measure](https://github.com/triplepoint/php-units-of-measure) - A library for converting between units of measure.
* [PHP Conversion](https://github.com/Crisu83/php-conversion) - Another library for converting between units of measure.
* Phone number
* [LibPhoneNumber for PHP](https://github.com/giggsey/libphonenumber-for-php) - A PHP implementation of Google's phone number handling library.
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* CSV
* [CSV](https://github.com/thephpleague/csv) - A CSV data manipulation library.
* Office
* [PHPWord](https://github.com/PHPOffice/PHPWord) - A library for working with Microsoft Word documents.
* [PHPExcel](https://github.com/PHPOffice/PHPExcel) - A library for working with Microsoft Excel documents.
* [PHPPowerPoint](https://github.com/PHPOffice/PHPPowerPoint) - A library for working with Microsoft PowerPoint documents.
* [ExcelAnt](https://github.com/Wisembly/ExcelAnt) - A library for manipulating Microsoft Excel documents.
* Markdown
* [PHP Markdown](https://github.com/michelf/php-markdown) - A Markdown parser.
* [CommonMark PHP](https://github.com/thephpleague/commonmark) - A Markdown parser which supports the full [CommonMark spec](http://spec.commonmark.org/).
* [Parsedown](https://github.com/erusev/parsedown) - Another Markdown parser.
* [Ciconia](https://github.com/kzykhys/Ciconia) - Another Markdown parser that supports Github flavoured Markdown.
* [Cebe Markdown](https://github.com/cebe/markdown) - An fast and extensible Markdown parser.
* BBCode
* [Decoda](https://github.com/milesj/decoda) - A lightweight lexical string parser for BBCode styled markup.
* JSON
* [JsonMapper](https://github.com/netresearch/jsonmapper) - A library that maps nested JSON structures onto PHP classes.
* vCard
* [vobject](https://github.com/fruux/sabre-vobject) - The VObject library allows you to easily parse and manipulate iCalendar and vCard objects.
* File Type Detection
* [Hoa Mime](https://github.com/hoaproject/Mime) - Another MIME detection library.
* [Canal](https://github.com/dflydev/dflydev-canal) - A library to determine internet media types.
* [Apache MIME Types](https://github.com/dflydev/dflydev-apache-mime-types) - A library that parses Apache MIME types.
* GeoJSON
* [GeoJSON](https://github.com/jmikola/geojson) - A GeoJSON implementation.
## Natural Language Processing
*Libraries for working with human languages.*
* [PHP NlpTools](https://github.com/angeloskath/php-nlp-tools) - Natural Language Processing Tools in PHP
* [nlpTools](https://github.com/atrilla/nlptools) - Natural Language Processing Toolkit for PHP
## Browser automation and emulation
* [php-webdriver](https://github.com/facebook/php-webdriver) - A php client for webdriver.
* [PHP PhantomJS](https://github.com/jonnnnyw/php-phantomjs) - Execute PhantomJS commands through PHP
* [Mink](https://github.com/minkphp/Mink) - universal API for multiple browser emulators (selenium, zombie.js, goutte)
## Multiprocessing
* [Spork](https://github.com/kriswallsmith/spork) - A process forking library.
## Asynchronous
*Libraries for asynchronous networking programming.*
* [React](https://github.com/reactphp/react) - An event driven non-blocking I/O library.
* [Rx.PHP](https://github.com/asm89/Rx.PHP) - A reactive extension library.
* [Hoa EventSource](https://github.com/hoaproject/Eventsource) - An event source library.
* [Evenement](https://github.com/igorw/evenement) - An event dispatcher library.
* [Event](https://github.com/thephpleague/event) - An event library with a focus on domain events.
* [Broadway](https://github.com/qandidate-labs/broadway) - An event source and CQRS library.
## Queue
* [Pheanstalk](https://github.com/pda/pheanstalk) - A Beanstalkd client library.
* [PHP AMQP](https://github.com/videlalvaro/php-amqplib) - A pure PHP AMQP library.
* [Thumper](https://github.com/videlalvaro/Thumper) - A RabbitMQ pattern library.
* [Bernard](https://github.com/bernardphp/bernard) - A multibackend abstraction library.
## Cloud Computing
* TODO
## Email
*Libraries for parsing email.*
* [Email Reply Parser](https://github.com/willdurand/EmailReplyParser) - An email reply parser library.
* [Email Validator](https://github.com/nojacko/email-validator) - A small email address validation library.
## URL Manipulation
*Libraries for parsing URLs.*
* [Purl](https://github.com/jwage/purl) - A URL manipulation library.
* [PHP Domain Parser](https://github.com/jeremykendall/php-domain-parser) - A domain suffix parser library.
* [Uri](https://github.com/thephpleague/uri) (The PHP League) - A simple URL manipulation library (PSR-7 compatible).
* [Url](https://github.com/crwlrsoft/url) (crwlr) - Swiss Army knife for urls.
## Web Content Extracting
* Text and Meta Data from Web Documents
* [Essence](https://github.com/felixgirault/essence) - A library for extracting web media.
* [Embera](https://github.com/mpratt/Embera) - An Oembed consumer library.
* [Embed](https://github.com/oscarotero/Embed) - An awesome library for getting useful information from a webpage.
* Video
* [Youtube-Downloader](https://github.com/jeckman/YouTube-Downloader) - PHP script for downloading videos from youtube; also parsing youtube feed into RSS enclosures for podcatchers
## WebSocket
*Libraries for working with WebSocket.*
* [Ratchet](https://github.com/cboden/Ratchet) - A web socket library.
* [Hoa WebSocket](https://github.com/hoaproject/Websocket) - Another web socket library.
* [Elephant.io](https://github.com/Wisembly/Elephant.io) - Yet another web socket library.
## DNS Resolving
* [Net_DNS2](https://github.com/mikepultz/netdns2) - Native PHP DNS Resolver and Updater
## Computer Vision
* [OpenCV-for-PHP](https://github.com/mgdm/OpenCV-for-PHP) - An OpenCV binding for PHP
## Geocoding
* [GeoCoder](http://geocoder-php.org/) - A geocoding library.
* [GeoTools](https://github.com/php-loep/Geotools) - A library of geo-related tools.
## Other PHP lists
* [awesome-php](https://github.com/ziadoz/awesome-php)
================================================
FILE: python.md
================================================
# Python Web Scraping
This list contains python libraries related to web scraping and data processing
## Contents
* [Network](#network)
* [Web Scraping](#web-scraping)
* [HTML/XML](#htmlxml)
* [Text processing](#text-processing)
* [Structured Formats](#structured-formats)
* [Serialization](#serialization)
* [Natural Language Processing](#natural-language-processing)
* [Browser Automation](#browser-automation)
* [Multiprocessing](#multiprocessing)
* [Job Queue](#job-queue)
* [Message Queue](#message-queue)
* [Cloud Computing](#cloud-computing)
* [URL and Network Address](#url-and-network-address)
* [Web Automation](#web-automation)
* [Asynchronous](#asynchronous)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Proxy Server](#proxy-server)
* [Whois](#whois)
* [JavaScript Engine Bindings](#javascript-engine-bindings)
* [Captcha Solving](#captcha-solving)
* [Other Python Lists](#other-python-lists)
## Network
### Network : General
* [urllib](https://docs.python.org/3.4/library/urllib.html?highlight=urllib#module-urllib) - network library (stdlib)
* [requests](https://github.com/kennethreitz/requests) - network library
* [pycurl](https://github.com/pycurl/pycurl) - network library (binding to [libcurl](http://curl.haxx.se/libcurl/))
* [urllib3](https://github.com/shazow/urllib3) - Python HTTP library with thread-safe connection pooling, file post support, sanity friendly, and more.
* [httplib2](https://github.com/httplib2/httplib2) - Small, fast HTTP client library. Features persistent connections, cache, and Google App Engine support.
* [RoboBrowser](https://github.com/jmcarp/robobrowser) - A simple, Pythonic library for browsing the web without a standalone web browser.
* [MechanicalSoup](https://github.com/hickford/MechanicalSoup) - A Python library for automating interaction with websites.
* [mechanize](https://github.com/python-mechanize/mechanize) - Stateful programmatic web browsing.
* [socket](https://docs.python.org/3/library/socket.html) low-level networking interface (stdlib)
* [Unirest for Python](https://github.com/Mashape/unirest-python) - Unirest is a set of lightweight HTTP libraries available in multiple languages
* [hyper](https://github.com/Lukasa/hyper) - HTTP/2 Client for Python
* [PySocks](https://github.com/Anorov/PySocks) - Updated and actively maintained version of SocksiPy, with bug fixes and extra features. Acts as a drop-in replacement to the socket module.
* [curl cffi](https://github.com/lexiforest/curl_cffi) - curl-impersonate fork via cffi
### Network : Asynchronous
* [treq](https://github.com/dreid/treq) - requests like API (twisted based)
* [aiohttp](https://github.com/KeepSafe/aiohttp) - http client/server for asyncio (PEP-3156)
* [httpx](https://github.com/projectdiscovery/httpx) - fast and multi-purpose HTTP toolkit that allows running multiple probes using the retryablehttp library
### Network : Low Level
* [dpkt](https://github.com/kbandla/dpkt) - fast, simple packet creation / parsing, with definitions for the basic TCP/IP protocols
* [pyOpenSSL](https://github.com/pyca/pyopenssl) - A Python wrapper around the OpenSSL library
* [tlslite-ng](https://github.com/tomato42/tlslite-ng) - TLS implementation in pure python
* [scapy](https://github.com/secdev/scapy) - powerful Python-based interactive packet manipulation program and library
* [impacket](https://github.com/SecureAuthCorp/impacket/) - low-level programmatic access to the packets of network protocols
## Web Scraping
### Web Scraping : Frameworks
* [scrapy](https://github.com/scrapy/scrapy) - web-scraping framework (twisted based).
* [pyspider](https://github.com/binux/pyspider) - A powerful spider system.
* [autoscraper](https://github.com/alirezamika/autoscraper) - A smart, automatic and lightweight web scraper
* [ruia](https://github.com/howie6879/ruia) - Async Python 3.6+ web scraping micro-framework based on asyncio
* [cola](https://github.com/chineking/cola) - A distributed crawling framework.
* [frontera](https://github.com/scrapinghub/frontera) - A scalable frontier for web crawlers
* [dude](https://github.com/roniemartinez/dude) - A simple framework for writing web scrapers using decorators.
* [ScrapegrphAI](https://github.com/ScrapeGraphAI/Scrapegraph-ai) - Web scraping framework that uses AI for extracting data
* [Crawl4AI](https://github.com/unclecode/crawl4ai) - web crawler and scraper
### Web Scraping : Tools
* [portia](https://github.com/scrapinghub/portia) - Visual scraping for Scrapy.
* [restkit](https://github.com/benoitc/restkit) - HTTP resource kit for Python. It allows you to easily access to HTTP resource and build objects around it.
* [requests-html](https://github.com/kennethreitz/requests-html) - Pythonic HTML Parsing for Humans.
* [ScrapydWeb](https://github.com/my8100/scrapydweb) - A full-featured web UI for Scrapyd cluster management, which supports Scrapy Log Analysis & Visualization, Auto Packaging, Timer Tasks, Email Notice and so on.
* [Starbelly](https://github.com/HyperionGray/starbelly) - Starbelly is a user-friendly and highly configurable web crawler front end.
* [Gerapy](https://github.com/Gerapy/Gerapy) - Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js
* [crawler-buddy](https://github.com/rumca-js/crawler-buddy) - Crawling server, provides crawl information via JSON interface
* [python-proxy-headers](https://github.com/proxymesh/python-proxy-headers) - extensions for popular libraries to better handle proxy headers
### Web Scraping : Bypass Protection
* [cloudscraper](https://github.com/venomous/cloudscraper) - A Python module to bypass Cloudflare's anti-bot page.
## HTML/XML
### HTML/XML : General
* [lxml](https://github.com/lxml/lxml/) - effective HTML/XML processing library. Supports XPATH. Written in C.
* [cssselect](https://github.com/scrapy/cssselect) - working with DOM tree with CSS selectors
* [pyquery](https://github.com/gawel/pyquery) - working with DOM tree with jQuery-like selectors
* [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) - slow HTML/XMl processing library, written in pure python
* [html5lib](https://github.com/html5lib/html5lib-python) - builds DOM of HTML/XML document according to [WHATWG spec](url=http://www.whatwg.org/). That spec is used in all modern browsers.
* [feedparser](https://github.com/kurtmckee/feedparser) - parsing of RSS/ATOM feeds.
* [brutefeedparser](https://github.com/rumca-js/brutefeedparser) - parsing of RSS/ATOM feeds.
* [MarkupSafe](https://github.com/mitsuhiko/markupsafe) - Implements a XML/HTML/XHTML Markup safe string for Python.
* [xmltodict](https://github.com/martinblech/xmltodict) - Working with XML feel like you are working with JSON.
* [xhtml2pdf](https://github.com/chrisglass/xhtml2pdf) - HTML/CSS to PDF converter.
* [untangle](https://github.com/stchris/untangle) - Converts XML documents to Python objects for easy access.
* [hodor](https://github.com/CompileInc/hodor) - Configuration driven wrapper around lxml and cssselect.
* [chopper](https://github.com/jurismarches/chopper) - Tool to extract a part from HTML page with corresponding CSS rules and preserving correct HTML.
* [selectolax](https://github.com/rushter/selectolax) - Python bindings to Modest engine (fast HTML5 parser with CSS selectors).
* [parsel](https://github.com/scrapy/parsel) - Lets you extract data from XML/HTML documents using XPath or CSS selectors.
* [html5-parser](https://github.com/kovidgoyal/html5-parser) - Fast C based HTML 5 parsing for python.
* [gazpacho](https://github.com/maxhumber/gazpacho/) - A simple, fast, and modern web scraping library.
### HTML/XML : Sanitizing
* [Bleach](https://github.com/mozilla/bleach) - cleaning of HTML (requires html5lib)
* [sanitize](https://github.com/Alir3z4/sanitize) - Bringing sanity to world of messed-up data.
### HTML/XML : Metadata
* [extruct](https://github.com/scrapinghub/extruct) - A library for extracting embedded metadata from HTML markup.
## Text Processing
Libraries for parsing and manipulating plain texts.
### Text Processing : General
* [difflib](https://docs.python.org/3/library/difflib.html) - (Python standard library) Helpers for computing deltas.
* [Levenshtein](https://github.com/ztane/python-Levenshtein/) - Fast computation of Levenshtein distance and string similarity.
* [fuzzywuzzy](https://github.com/seatgeek/fuzzywuzzy) - Fuzzy String Matching.
* [esmre](https://code.google.com/p/esmre/) - Regular expression accelerator.
* [ftfy](https://github.com/LuminosoInsight/python-ftfy) - Makes Unicode text less broken and more consistent automagically.
### Text Processing : Transliteration
* [unidecode](https://pypi.python.org/pypi/Unidecode) - ASCII transliterations of Unicode text.
### Text Processing : Character Encoding
* [uniout](https://github.com/moskytw/uniout) - Print readable chars instead of the escaped string.
* [chardet](https://github.com/chardet/chardet) - Python 2/3 compatible character encoding detector.
* [xpinyin](https://github.com/lxneng/xpinyin) - A library to translate Chinese hanzi (漢字) to pinyin (拼音).
* [pangu.py](https://github.com/vinta/pangu.py) - Spacing texts for CJK and alphanumerics.
* [cchardet](https://github.com/PyYoshi/cChardet) - cChardet is high speed universal character encoding detector. - binding to uchardet.
### Text Processing : Slugify
* [awesome-slugify](https://github.com/dimka665/awesome-slugify) - A Python slugify library that can preserve unicode.
* [python-slugify](https://github.com/un33k/python-slugify) - A Python slugify library that translates unicode to ASCII.
* [unicode-slugify](https://github.com/mozilla/unicode-slugify) - A slugifier that generates unicode slugs.
* [pytils](https://github.com/j2a/pytils) - Simple tools for processing strings in russian (including pytils.translit.slugify)
### Text Processing : General Parser
* [PLY](http://www.dabeaz.com/ply/) - Implementation of lex and yacc parsing tools for Python
* [pyparsing](https://github.com/pyparsing/pyparsing) - A general purpose framework for generating parsers.
### Text Processing : Human Names
* [python-nameparser](https://github.com/derek73/python-nameparser) - Parsing human names into their individual components.
### Text Processing : Phone Number
* [phonenumbers](https://github.com/daviddrysdale/python-phonenumbers) - Parsing, formatting, storing and validating international phone numbers.
### Text Processing :: User-Agent strings
* [HTTP Agent Parser](https://github.com/shon/httpagentparser) - Python HTTP Agent Parser
* [uap-python](https://github.com/ua-parser/uap-python) - Python implementation of ua-parser
* [python-user-agents](https://github.com/selwin/python-user-agents) - Browser user agent parser.
* [fake-useragent](https://github.com/hellysmile/fake-useragent) - Python user agent string faker, based on world statistic of browsers
* [user_agent](https://github.com/lorien/user_agent) - Generator of User-Agent data
### Text Processing : robots.txt
* [reppy](https://github.com/seomoz/reppy) - Modern robots.txt Parser for Python
### Text Processing :: Date and Time
* [dateutil](https://github.com/dateutil/dateutil) - Useful extensions to the standard Python datetime features
* [dateparser](https://github.com/scrapinghub/dateparser) - python parser for human readable dates
* [ciso8601](https://github.com/closeio/ciso8601) - converts ISO 8601 or RFC 3339 date time strings into Python datetime objects
### Text Processing :: Price and Currency
* [price-parser](https://github.com/scrapinghub/price-parser) - a small library for extracting price and currency from raw text strings.
## Structured Formats
Libraries for parsing and manipulating specific text formats.
### Structured Formats : General
* [tablib](https://github.com/kennethreitz/tablib) - A module for Tabular Datasets in XLS, CSV, JSON, YAML.
* [textract](https://github.com/deanmalmgren/textract) - Extract text from any document, Word, PowerPoint, PDFs, etc.
* [messytables](https://github.com/okfn/messytables) - Tools for parsing messy tabular data
* [rows](https://github.com/turicas/rows) - A common, beautiful interface to tabular data, no matter the format (currently CSV, HTML, XLS, TXT -- more coming!)
### Structured Formats : Office
* [python-docx](https://github.com/python-openxml/python-docx) - Reads, queries and modifies Microsoft Word 2007/2008 docx files.
* [xlwt](https://github.com/python-excel/xlwt) / [xlrd](https://github.com/python-excel/xlrd) - Writing and reading data and formatting information from Excel files.
* [XlsxWriter](https://xlsxwriter.readthedocs.org/) - A Python module for creating Excel .xlsx files.
* [xlwings](http://xlwings.org/) - A BSD-licensed library that makes it easy to call Python from Excel and vice versa.
* [openpyxl](https://openpyxl.readthedocs.org/en/latest/) - A library for reading and writing Excel 2010 xlsx/xlsm/xltx/xltm files.
* [Marmir](https://github.com/brianray/mm) - Takes Python data structures and turns them into spreadsheets.
### Structured Formats : PDF
* [PDFMiner](https://github.com/euske/pdfminer) - A tool for extracting information from PDF documents.
* [PyPDF2](https://github.com/mstamy2/PyPDF2) - A library capable of splitting, merging and transforming PDF pages.
* [ReportLab](http://www.reportlab.com/opensource/) - Allowing Rapid creation of rich PDF documents.
* [pdftables](https://pypi.python.org/pypi/pdftables) - Extract tables from PDF files directly
### Structured Formats : Markdown
* [Python-Markdown](https://github.com/waylan/Python-Markdown) - A Python implementation of John Gruber’s Markdown.
* [Mistune](https://github.com/lepture/mistune) - Fastest and full featured pure Python parsers of Markdown.
* [markdown2](https://pypi.python.org/pypi/markdown2) - A fast and complete Python implementation of Markdown
* [mistletoe](https://github.com/miyuchina/mistletoe) - A fast, extensible and spec-compliant Markdown parser in pure Python
### Structured Formats : YAML
* [PyYAML](https://github.com/yaml/pyyaml) - YAML implementations for Python.
### Structured Formats : CSS
* [cssutils](https://pypi.python.org/pypi/cssutils/) - A CSS library for Python.
### Structured Formats : ATOM/RSS
* [feedparser](http://pythonhosted.org/feedparser/) - Universal feed parser.
### Structured Formats : SQL
* [sqlparse](https://sqlparse.readthedocs.org/) - A non-validating SQL parser.
### Structured Formats : HTTP
* [http-parser](https://github.com/benoitc/http-parser) - HTTP request/response parser for python in C
* [httptools](https://github.com/MagicStack/httptools) - a Python binding for nodejs HTTP parser
### Structured Formats : Microformats
* [opengraph](https://github.com/erikriver/opengraph) - A Python module to parse the Open Graph Protocol tags
### Structured Formats : Portable Executable
* [pefile](https://github.com/erocarrera/pefile) - A multi-platform module to parse and work with Portable Executable (aka PE) files.
### Structured Formats : PSD
* [psd-tools](https://github.com/kmike/psd-tools) - reading Adobe Photoshop PSD files (as described in [specification](https://www.adobe.com/devnet-apps/photoshop/fileformatashtml/PhotoshopFileFormats.htm)) to Python data structures.
### Structured Formats : Bookmarks File
* [bookmarks-parser](https://github.com/bookmarks-tools/bookmarks-parser) - Parses Firefox/Chrome HTML bookmarks files
### Structured Formats : JavaScript Object
* [chompjs](https://github.com/Nykakin/chompjs) - Parsing JavaScript objects into Python dictionaries
### Structured Formats : Email
* [flanker](https://github.com/mailgun/flanker) - A email address and Mime parsing library.
* [Talon](https://github.com/mailgun/talon) - Mailgun library to extract message quotations and signatures.
## Serialization
* [orjson](https://github.com/ijl/orjson) - Fast, correct Python JSON library supporting dataclasses and datetimes
* [ujson](https://github.com/esnme/ultrajson) - Ultra fast JSON decoder and encoder written in C with Python bindings
* [msgspec](https://github.com/jcrist/msgspec) - A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
* [msgpack](https://github.com/msgpack/msgpack-python) - MessagePack serializer implementation for Python
* [padantic](https://github.com/pydantic/pydantic) - Data validation using Python type hints
* [cloudpickle](https://github.com/cloudpipe/cloudpickle) - Extended pickling support for Python objects
## Natural Language Processing
Libraries for working with human languages.
* [NLTK](http://www.nltk.org/) - A leading platform for building Python programs to work with human language data.
* [spacy](https://github.com/explosion/spaCy) - Enables using State-of-the-Art Deep Learning models for common NLP tasks.
* [fastai](https://github.com/fastai/fastai) - Deep Learning library with free video tutorials + active forum community, downside of lib: GPU needed
* [gensim](https://github.com/RaRe-Technologies/gensim) - library for topic modeling, document indexing and similarity retrieval with large corpora
* [Pattern](https://github.com/clips/pattern) - A web mining module for the Python. It has tools for natural language processing, machine learning, among others.
* [TextBlob](http://textblob.readthedocs.org/) - Providing a consistent API for diving into common NLP tasks. Stands on the giant shoulders of NLTK and Pattern.
* [jieba](https://github.com/fxsjy/jieba) - Chinese Words Segmentation Utilities.
* [SnowNLP](https://github.com/isnowfy/snownlp) - A library for processing Chinese text.
* [loso](https://github.com/victorlin/loso) - Another Chinese segmentation library.
* [genius](https://github.com/duanhongyi/genius) - A Chinese segment base on Conditional Random Field.
* [langid.py](https://github.com/saffsd/langid.py) - Stand-alone language identification system.
* [Korean](https://korean.readthedocs.org/) - A library for [Korean](http://en.wikipedia.org/wiki/Korean_language) morphology.
* [pymorphy2](https://github.com/kmike/pymorphy2) - Morphological analyzer (POS tagger + inflection engine) for Russian language.
* [PyPLN](https://github.com/NAMD/pypln.backend) - A distributed pipeline for natural language processing, made in Python. he goal of the project is to create an easy way to use NLTK for processing big corpora, with a Web interface.
* [langdetect](https://github.com/Mimino666/langdetect) - Port of Google's language-detection library to Python
## Browser Automation
### Browser Automation : Drivers
* [selenium](http://selenium-python.readthedocs.io/) - automating real browsers (Chrome, Firefox, Opera, IE)
* [Ghost.py](http://carrerasrodrigo.github.io/Ghost.py/) - wrapper of QtWebKit (requires PyQT)
* [Spynner](https://github.com/makinacorpus/spynner) - wrapper of QtWebKit QtWebKit (requires PyQT)
* [Splinter](https://github.com/cobrateam/splinter) - universal API to browser emulators (selenium webdrivers, django client, zope)
* [Requestium](https://github.com/tryolabs/requestium) - Integration layer between Requests and Selenium for automation of web actions.
* [Splash](https://github.com/scrapinghub/splash) - Lightweight, scriptable browser as a service with an HTTP API.
* [pyppeteer](https://github.com/miyakogi/pyppeteer) - Headless chrome/chromium automation library (unofficial port of puppeteer)
* [Playwright](https://github.com/microsoft/playwright-python) - Playwright is a Python library to automate Chromium, Firefox and WebKit browsers with a single API
* [seleniumbase](https://github.com/seleniumbase/SeleniumBase) - Python framework for Web/UI testing + RPA. 🤖 🏰 Fast, easy, and reliable.
### Browser Automation : Frameworks
* [botasaurus](https://github.com/omkarcloud/botasaurus) - all-in-one web scraping framework
* [crawlee](https://github.com/apify/crawlee-python) - A web scraping and browser automation library for Python to build reliable crawlers
### Browser Automation : Tools
* [xvfbwrapper](https://github.com/cgoldberg/xvfbwrapper) - Python wrapper for running a display inside X virtual framebuffer (Xvfb)
## Multiprocessing
* [threading](http://docs.python.org/3/library/threading.html) - standard python library to run threads. Effective for I/O-bound tasks. Useless for CPU-bound tasks because of python GIL.
* [multiprocessing](http://docs.python.org/3/library/multiprocessing.html) - standard python library to run processes.
* [concurrent-futures](https://docs.python.org/3/library/concurrent.futures.html) - The concurrent.futures module provides a high-level interface for asynchronously executing callables.
## Asynchronous
Libraries for asynchronous networking programming.
* [asyncio](https://docs.python.org/3/library/asyncio.html) - (Python standard library in Python 3.4+) Asynchronous I/O, event loop, coroutines and tasks.
* [Twisted](https://twistedmatrix.com/trac/) - An event-driven networking engine.
* [Tornado](http://www.tornadoweb.org/) - A Web framework and asynchronous networking library.
* [pulsar](https://github.com/quantmind/pulsar) - Event-driven concurrent framework for Python.
* [diesel](https://github.com/jamwt/diesel) - Greenlet-based event I/O Framework for Python.
* [gevent](http://www.gevent.org/) - A coroutine-based Python networking library that uses [greenlet](https://github.com/python-greenlet/greenlet).
* [eventlet](http://eventlet.net/) - Asynchronous framework with WSGI support.
* [Tomorrow](https://github.com/madisonmay/Tomorrow) - Magic decorator syntax for asynchronous code.
* [grequests](https://github.com/kennethreitz/grequests) - Make asynchronous HTTP Requests easily.
## Job Queue
* [celery](http://www.celeryproject.org/) - An asynchronous task queue/job queue based on distributed message passing.
* [huey](https://github.com/coleifer/huey) - Little multi-threaded task queue.
* [mrq](https://github.com/pricingassistant/mrq) - Mr. Queue - A distributed worker task queue in Python using Redis & gevent.
* [RQ](https://github.com/rq/rq) - lightweight task queue manager based on redis
* [simpleq](https://github.com/rdegges/simpleq) - A simple, infinitely scalable, Amazon SQS based queue.
* [python-gearman](https://github.com/Yelp/python-gearman) - python API for Gearman
## Message Queue
* [kombu](https://github.com/celery/kombu) - Messaging library for Python
## Cloud Computing
* [picloud](http://docs.picloud.com/) - executing python-code in cloud
* [dominoup.com](http://www.dominoup.com/) - executing R, Python и matlab code in cloud
* [minigun-requests](https://github.com/umihico/minigun-requests) - Web scraping API to outsource tons of GET & xpath to cloud computing
* [pythonista-chromeless](https://github.com/umihico/pythonista-chromeless) - AWS lambda which execute given python code on selenium
## URL and Network Address
Libraries for parsing/modifying URLs, network addresses, domain names.
### URL and Network Address : URL
* [furl](https://github.com/gruns/furl) - A small Python library that makes manipulating URLs simple.
* [purl](https://github.com/codeinthehole/purl) - A simple, immutable URL class with a clean API for interrogation and manipulation.
* [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) - interface to break URL strings up in components (addressing scheme, network location, path etc.), to combine the components back into a URL string, and to convert a “relative URL” to an absolute URL.
* [courlan](https://github.com/adbar/courlan) - Clean, filter and sample URLs to optimize data collection: Deduplication, spam, content and language filters
### URL and Network Address : Network Address
* [netaddr](https://github.com/drkjam/netaddr) - A Python library for representing and manipulating network addresses.
* [micawber](https://github.com/coleifer/micawber) - A small library for extracting rich content from URLs.
### Domain Names
* [tldextract](https://github.com/john-kurkowski/tldextract) - Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.
* [find_domains](https://github.com/rushter/find_domains) - a library to search for domain names in text data
## Web Automation
Tools to automate multiple actions on a website.
### Web Automation :: Content Extraction
* [newspaper](https://github.com/codelucas/newspaper) - News extraction, article extraction and content curation in Python.
* [python-goose](https://github.com/grangier/python-goose) - HTML Content/Article Extractor.
* [scrapely](https://github.com/scrapy/scrapely) - Library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.
* [htmldate](https://github.com/adbar/htmldate) - Find creation date using common structural patterns or text-based heuristics.
* [lassie](https://github.com/michaelhelmick/lassie) - Web Content Retrieval for Humans.
* [html2text](https://github.com/Alir3z4/html2text) - Convert HTML to Markdown-formatted text.
* [libextract](https://github.com/datalib/libextract) - Extract data from websites.
* [python-readability](https://github.com/buriy/python-readability) - Fast Python port of arc90's readability tool.
* [sumy](https://github.com/miso-belica/sumy) - A module for automatic summarization of text documents and HTML pages.
* [Haul](https://github.com/vinta/Haul) - An Extensible Image Crawler.
* [you-get](http://www.soimort.org/you-get/) - A YouTube/Youku/Niconico video downloader written in Python 3.
* [youtube-dl](http://rg3.github.io/youtube-dl/) - A small command-line program to download videos from YouTube.
* [WikiTeam](https://github.com/WikiTeam/wikiteam) - Tools for downloading and preserving wikis.
* [linkchecker](https://github.com/wummel/linkchecker) - check links in web documents or full websites
* [python-sitemap](https://github.com/c4software/python-sitemap) - Mini website crawler to make sitemap from a website.
* [trafilatura](https://github.com/adbar/trafilatura) - Gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
* [advertools](https://github.com/eliasdabbas/advertools) - A customizable crawler to analyze SEO and content of pages and websites.
* [photon](https://github.com/s0md3v/Photon) - Incredibly fast crawler designed for OSINT
* [extractnet](https://github.com/currentsapi/extractnet) - Machine Learning based content and metadata extraction in Python 3
* [visura-api](https://github.com/zornade/visura-api) - REST API for automated Italian cadastral property record extraction from the SISTER portal, using Playwright for browser automation with SPID authentication.
### Web Automation : Account Creation
* [ninjemail](https://github.com/david96182/ninjemail) - Python library for automated email account creation for different providers.
## WebSocket
Libraries for working with WebSocket.
* [Crossbar](https://github.com/crossbario/crossbar/) - Open-source Unified Application Router (Websocket & WAMP for Python on Autobahn).
* [AutobahnPython](https://github.com/tavendo/AutobahnPython) - WebSocket & WAMP for Python on Twisted and [asyncio](https://docs.python.org/3/library/asyncio.html).
* [WebSocket-for-Python](https://github.com/Lawouach/WebSocket-for-Python) - WebSocket client and server library for Python 2 and 3 as well as PyPy.
## DNS Resolving
* [dnspython](https://github.com/rthalley/dnspython) - a powerful DNS toolkit for python
* [dnsyo](https://github.com/samarudge/dnsyo) - Check your DNS against over 1500 global DNS servers.
* [pycares](https://github.com/saghul/pycares) - interface to c-ares. c-ares is a C library that performs DNS requests and name resolutions asynchronously
## Computer Vision
* [OpenCV](https://github.com/Itseez/opencv) - Open Source Computer Vision Library.
* [SimpleCV](https://github.com/sightmachine/SimpleCV) - Concise, readable interface for cameras, image manipulation, feature extraction, and format conversion (based on OpenCV).
* [mahotas](https://github.com/luispedro/mahotas) - fast computer vision algorithms (all implemented in C++) operating over numpy arrays.
## Proxy Server
* [scylla](https://github.com/imWildCat/scylla) - Intelligent proxy pool for Humans
* [ProxyBroker](https://github.com/constverum/Proxybroker) - Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS
* [shadowsocks](https://github.com/shadowsocks/shadowsocks) - A fast tunnel proxy that helps you bypass firewalls (TCP & UDP support, User management API, TCP Fast Open, Workers and graceful restart, Destination IP blacklist)
* [tproxy](https://github.com/benoitc/tproxy) - tproxy is a simple TCP routing proxy (layer 7) built on Gevent that lets you configure the routine logic in Python
## Whois
* [python-whois](https://github.com/joepie91/python-whois) - A python module for retrieving and parsing WHOIS data
## JavaScript Engine Bindings
* [Js2Py](https://github.com/PiotrDabkowski/Js2Py) - JavaScript to Python Translator & JavaScript interpreter written in 100% pure Python
* [v8eval](https://github.com/sony/v8eval/) - Multi-language bindings to JavaScript engine V8
## Captcha Solving
* [captcha_solver](https://github.com/lorien/captcha_solver) - Universal python API to captcha solving services
* [python-anticaptcha](https://github.com/ad-m/python-anticaptcha) - Client library for solve captchas with anti-captcha.com support
* [python3-anticaptcha](https://github.com/AndreiDrang/python3-anticaptcha) - Python library for anti-captcha services
* [unicaps](https://github.com/sergey-scat/unicaps) - a unified Python API for CAPTCHA solving services
## Other python lists
* [awesome-python](https://github.com/vinta/awesome-python)
* [pycrumbs](https://github.com/kirang89/pycrumbs)
* [pythonidae](https://github.com/svaksha/pythonidae)
================================================
FILE: ruby.md
================================================
# Ruby Web Scraping
This list contains ruby libraries related to web scraping and data processing
* [Ruby Web Scraping](#ruby-web-scraping)
* [Network](#network)
* [Web-scraping Frameworks](#web-scraping-frameworks)
* [HTML/XML Parsing](#htmlxml-parsing)
* [Text processing](#text-processing)
* [Specific Formats Processing](#specific-formats-processing)
* [Natural Language Processing](#natural-language-processing)
* [Browser automation and emulation](#browser-automation-and-emulation)
* [Multiprocessing](#multiprocessing)
* [Asynchronous](#asynchronous)
* [Queue](#queue)
* [Email](#email)
* [URL Manipulation](#url-manipulation)
* [Web Content Extracting](#web-content-extracting)
* [WebSocket](#websocket)
* [DNS Resolving](#dns-resolving)
* [Computer Vision](#computer-vision)
* [Geolocation](#geolocation)
* [Other Ruby Lists](#other-Ruby-lists)
## Network
* [httparty](https://github.com/jnunemaker/httparty) Makes http fun again!
* [http](https://github.com/tarcieri/http) A simple Ruby DSL for making HTTP requests
* [excon](https://github.com/excon/excon) Usable, fast, simple HTTP(S) 1.1 for Ruby
* [nestful](https://github.com/maccman/nestful) Simple Ruby HTTP/REST client with a sane API
* [EM-HTTP-Request](https://github.com/igrigorik/em-http-request) - EventMachine based asynchronous HTTP client
* [excon](https://github.com/excon/excon) - Usable, fast, simple Ruby HTTP 1.1. It works great as a general HTTP(s) client and is particularly well suited to usage in API clients.
* [Faraday](https://github.com/lostisland/faraday) - an HTTP client lib that provides a common interface over many adapters (such as Net::HTTP) and embraces the concept of Rack middleware when processing the request/response cycle.
* [Http Client](https://github.com/nahi/httpclient) - Gives something like the functionality of libwww-perl (LWP) in Ruby.
* [HTTP](https://github.com/httprb/http.rb) - The HTTP Gem: a simple Ruby DSL for making HTTP requests.
* [Http-2](https://github.com/igrigorik/http-2) - Pure Ruby implementation of HTTP/2 protocol
* [Patron](https://github.com/toland/patron) - Patron is a Ruby HTTP client library based on libcurl.
* [RESTClient](https://github.com/rest-client/rest-client) - Simple HTTP and REST client for Ruby, inspired by microframework syntax for specifying actions.
* [Savon](https://github.com/savonrb/savon) - Savon is a SOAP client for the Ruby programming language.
* [Sawyer](https://github.com/lostisland/sawyer) - Secret user agent of HTTP, built on top of Faraday.
* [Spyke](https://github.com/balvig/spyke) - Interact with REST services in an ActiveRecord-like manner.
* [Typhoeus](https://github.com/typhoeus/typhoeus) - Typhoeus wraps libcurl in order to make fast and reliable requests.
* [Mechanize](https://github.com/sparklemotion/mechanize) - Mechanize is a ruby library that makes automated web interaction easy.
* [wreq](https://github.com/SearchApi/wreq-ruby) - An HTTP client with real browser TLS/HTTP2 fingerprinting, emulating Chrome, Firefox, Safari, Edge, and Opera signatures via BoringSSL.
## Web-Scraping Frameworks
* [upton](https://github.com/propublica/upton) - A batteries-included framework for easy web-scraping
* [Wombat](https://github.com/felipecsl/wombat) - Web scraper with an elegant DSL that parses structured data from web pages.
* [Anemone](https://github.com/chriskite/anemone) - web spider framework that can spider a domain and collect useful information about the pages it visits
* [Spidr](https://github.com/postmodern/spidr) - versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
* [kimuraframework](https://github.com/vifreefly/kimuraframework) - Modern web scraping framework written in Ruby which works out of box with Headless Chromium/Firefox, PhantomJS, or simple HTTP requests and allows to scrape and interact with JavaScript rendered websites
* [arachnid2](https://github.com/samnissen/arachnid2) A simple, fast, framework-less crawler with sensible defaults and lots of options. Crawls the page and runs your code directly against either Typhoeus responses or a Watir browser.
## HTML/XML Parsing
* [nokogiri](https://github.com/sparklemotion/nokogiri) - HTML, XML, SAX, and Reader parser with XPath and CSS selector support
* [loofah](https://github.com/flavorjones/loofah) - HTML/XML manipulation and sanitization based on Nokogiri
* [HappyMapper](https://github.com/dam5s/happymapper) - allows you to parse XML data and convert it quickly and easily into ruby data structures.
* [HTML::Pipeline](https://github.com/jch/html-pipeline) - HTML processing filters and utilities.
* [Oga](https://github.com/YorickPeterse/oga) - An XML/HTML parser written in Ruby. Oga does not require system libraries such as libxml, making it easier and faster to install on various platforms.
* [Ox](https://github.com/ohler55/ox) - A fast XML parser and Object marshaller.
* [ROXML](https://github.com/Empact/roxml) - Custom mapping and bidirectional marshalling between Ruby and XML using annotation-style class methods, via Nokogiri or LibXML.
* [equivalent-xml](https://github.com/mbklein/equivalent-xml) - Easy tests of equivalency of XML documents for Nokogiri::XML
* [nokolexbor](https://github.com/serpapi/nokolexbor) - A performance-focused HTML5 parser for Ruby based on Lexbor. It supports both CSS selectors and XPath.
## Text Processing
*Libraries for parsing and manipulating plain texts.*
* General
* [Kiba](https://github.com/thbar/kiba) - library for writing reliable, concise, well-tested & maintainable data-processing code
* [diffy](https://github.com/samg/diffy) - a convenient way to generate a diff from two strings or files
* [CommonRegexRuby](https://github.com/talyssonoc/CommonRegexRuby) - find a lot of kinds of common information in a string
* Phone number
* [GlobalPhone](https://github.com/sstephenson/global_phone) - Parse, validate, and format phone numbers in Ruby using Google's libphonenumber database.
* Country names
* [i18n_data](https://github.com/grosser/i18n_data) - country/language names and 2-letter-code pairs, in 85 languages, for country/language i18n.
* [normalize_country](https://github.com/sshaw/normalize_country) - Convert country names and codes to a standard, includes a conversion program for XMLs, CSVs and DBs.
* User agent
* [Device Detector](https://github.com/podigee/device_detector) - A precise and fast user agent parser and device detector, backed by the largest and most up-to-date user agent database.
* General parser
* [Parslet](http://kschiess.github.io/parslet/) - A small Ruby library for constructing parsers in the PEG (Parsing Expression Grammar) fashion.
* [Treetop](https://github.com/cjheath/treetop) - PEG (Parsing Expression Grammar) parser.
* [rley](https://github.com/famished-tiger/Rley) - Ruby gem implementing a general context-free grammar parser based on Earley's algorithm
* Date & time
* [Chronic](https://github.com/mojombo/chronic) - A natural language date/time parser written in pure Ruby.
* [yymmdd](https://github.com/sshaw/yymmdd) - Tiny DSL for idiomatic date parsing and formatting.
* [Chronic Between](https://github.com/jrobertson/chronic_between) - a simple Ruby natural language parser for date and time ranges
* [Chronic Duration](https://github.com/hpoydar/chronic_duration) - a simple Ruby natural language parser for elapsed time
* [Kronic](https://github.com/xaviershay/kronic) - a dirt simple library for parsing and formatting human readable dates
* [Nickel](https://github.com/iainbeeston/nickel) - extracts date, time, and message information from naturally worded text
* [Tickle](https://github.com/yb66/tickle) - a natural language parser for recurring events
* Human Names
* [nameable](https://github.com/chorn/nameable) - A Ruby gem that provides parsing and output of person names, as well as Gender & Ethnicity matching
* N-grams
* [N-Gram](https://github.com/reddavis/N-Gram) - N-Gram generator in Ruby
* [ngram](https://github.com/tkellen/ruby-ngram) - break words and phrases into ngrams
* [raingrams](https://github.com/postmodern/raingrams) - a flexible and general-purpose ngrams library written in Ruby
* Text Similarity
* [FuzzyMatch](https://github.com/seamusabshere/fuzzy_match) - find a needle in a haystack based on string similarity and regular expression rules
* [fuzzy-string-match](https://github.com/kiyoka/fuzzy-string-match) - fuzzy string matching library for ruby
* [FuzzyTools](https://github.com/brianhempel/fuzzy_tools) - In-memory TF-IDF fuzzy document finding with a fancy default tokenizer tuned on diverse record linkage datasets for easy out-of-the-box use
* [Going the Distance](https://github.com/schneems/going_the_distance) - contains scripts that do various distance calculations
* [hotwater](https://github.com/colinsurprenant/hotwater) - Fast Ruby FFI string edit distance algorithms
* [levenshtein-ffi](https://github.com/dbalatero/levenshtein-ffi) - fast string edit distance computation, using the Damerau-Levenshtein algorithm
* [TF-IDF](https://github.com/reddavis/TF-IDF) - Term Frequency - Inverse Document Frequency in Ruby
* [tf-idf-similarity](https://github.com/jpmckinney/tf-idf-similarity) - calculate the similarity between texts using tf*idf
## Specific Formats Processing
*Libraries for parsing and manipulating specific text formats.*
* General
* [markup](https://github.com/github/markup) — GitHub library to convert mardown, rst, creole, etc into HTML
* Office
* [Yomu](https://github.com/Erol) - Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
* [spreadsheet](https://github.com/zdavatz/spreadsheet) - The Spreadsheet Library is designed to read and write Spreadsheet Documents.
* [roo](https://github.com/Empact/roo) - Roo implements read access for all spreadsheet types and read/write access for Google spreadsheets.
* [google-spreadsheet-ruby](https://github.com/gimite/google-spreadsheet-ruby) - This is a library to read/write Google Spreadsheet.
* [rubyXL](https://github.com/weshatheleopard/rubyXL) - rubyXL is a gem which allows the parsing, creation, and manipulation of Microsoft Excel (.xlsx/.xlsm) Documents
* [remote_table](https://github.com/seamusabshere/remote_table) - Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs.
* [sheets](https://github.com/bspaulding/Sheets) - Work with spreadsheets easily in a native ruby format.
* [workbook](https://github.com/murb/workbook) - Workbook contains workbooks, as in a table, contains rows, contains cells, reads/writes excel, ods and csv and tab separated files...
* [oxcelix](https://github.com/gbiczo/oxcelix) - A fast Excel 2007/2010 (.xlsx) file parser that returns a collection of Matrix objects
* [wrap_excel](https://github.com/tomiacannondale/wrap_excel) - WrapExcel is to wrap the win32ole, and easy to use Excel operations with ruby. Detailed description please see the README.
* libpcap
* [PacketFul](https://github.com/packetfu/packetfu) - A library for reading and writing packets to an interface or to a libpcap-formatted file.
* JSON
* [JsonCompare](https://github.com/a2design-company/json-compare) - Returns the difference between two JSON files
* [JSON](https://github.com/flori/json) — includes pure Ruby and C implementation for JSON.
* [JSON::Stream](https://github.com/dgraham/json-stream) — a streaming JSON parser that generates SAX-like events.
* [YAJL](https://github.com/brianmario/yajl-ruby) — a streaming JSON parsing and encoding library for Ruby (C bindings to YAJL).
* [OJ](https://github.com/ohler55/oj) — Optimized JSON, as the name implies, was written to provide speed optimized JSON handling. So far it has achieved that, and is about 2 times faster than any other Ruby JSON parser, and 3 or more times faster at serializing JSON.
* Markdown
* [kramdown](https://github.com/gettalong/kramdown) - Kramdown is yet-another-markdown-parser but fast, pure Ruby, using a strict syntax definition and supporting several common extensions.
* [Maruku](https://github.com/bhollis/maruku) - A pure-Ruby Markdown-superset interpreter.
* [Redcarpet](https://github.com/vmg/redcarpet) - A fast, safe and extensible Markdown to (X)HTML parser.
* ATOM/RSS
* [Feed normalizer](https://github.com/aasmith/feed-normalizer) - Extensible Ruby wrapper for Atom and RSS parsers.
* [Feedjira](https://github.com/feedjira/feedjira) - A feed fetching and parsing library.
* [Ratom](https://github.com/seangeo/ratom) - A fast, libxml based, Ruby Atom library.
* [Simple rss](https://github.com/cardmagic/simple-rss) - A simple, flexible, extensible, and liberal RSS and Atom reader.
* BSON
* [BSON](https://github.com/mongodb/bson-ruby) — Ruby implementation of the BSON Specification (2.0.0+), http://bsonspec.org
* MessagePack
* [MessagePack](https://github.com/msgpack/msgpack-ruby) — an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves. See http://msgpack.org
* Protobuf
* [Protobuf](https://github.com/localshred/protobuf) — Ruby implementation for Protocol Buffers.
* RDF
* [rdf](https://github.com/ruby-rdf/rdf) - pure-Ruby library for working with Resource Description Framework (RDF) data
## Natural Language Processing
*Libraries for working with human languages.*
* General
* [Treat](https://github.com/louismullie/treat) - Treat is a toolkit for natural language processing and computational linguistics in Ruby
* [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter) - Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
* [Text](https://github.com/threedaymonk/text) - A collection of text algorithms including Levenshtein distance, Metaphone, Soundex 2, Porter stemming & White similarity.
* [whatlanguage](https://github.com/peterc/whatlanguage) - a language detection library for Ruby that uses bloom filters for speed
* [nlp](https://github.com/knife/nlp) - NLP tools for the Polish language
* [NlpToolz](https://github.com/LeFnord/nlp_toolz) - Basic NLP tools, mostly based on OpenNLP, at this time sentence finder, tokenizer and POS tagger implemented, plus Berkeley Parser
* [Open NLP (Ruby bindings)](https://github.com/louismullie/open-nlp)
* [Stanford Core NLP (Ruby bindings)](https://github.com/louismullie/stanford-core-nlp)
* [ve](https://github.com/Kimtaro/ve) - a linguistic framework that's easy to use
* [zipf](https://github.com/pks/zipf) - a collection of various NLP tools and libraries
* [ruby-ner](https://github.com/mblongii/ruby-ner) - named entity recognition with Stanford NER and Ruby
* [ruby-nlp](https://github.com/tiendung/ruby-nlp) - Ruby Binding for Stanford Pos-Tagger and Name Entity Recognizer
* [linkparser](https://github.com/ged/linkparser) - a Ruby binding for the Abiword version of CMU's Link Grammar, a syntactic parser of English
* Part-of-Speech Tagger
* [engtagger](https://github.com/yohasebe/engtagger) - English Part-of-Speech Tagger Library; a Ruby port of Lingua::EN::Tagger
* [rbtagger](http://rbtagger.rubyforge.org/) - a simple ruby rule-based part of speech tagger
* [TreeTagger for Ruby](https://github.com/LeFnord/rstt) - Ruby based wrapper for the TreeTagger by Helmut Schmid
* Sentence segmentation
* [Pragmatic Segmenter](https://github.com/diasks2/pragmatic_segmenter)
* [Punkt Segmenter](https://github.com/lfcipriani/punkt-segmenter)
* [TactfulTokenizer](https://github.com/zencephalon/Tactful_Tokenizer)
* [Scapel](https://github.com/louismullie/scalpel)
* [SRX English](https://github.com/apohllo/srx-english)
* Stemmers
* [Greek stemmer](https://github.com/skroutz/greek_stemmer) - a Greek stemmer
* [Ruby-Stemmer](https://github.com/aurelian/ruby-stemmer) - Ruby-Stemmer exposes the SnowBall API to Ruby
* [Turkish stemmer](https://github.com/skroutz/turkish_stemmer) - a Turkish stemmer
* [uea-stemmer](https://github.com/ealdent/uea-stemmer) - a conservative stemmer for search and indexing
* Summarization
* [Epitome](https://github.com/McFreely/epitome) - A small gem to make your text shorter; an implementation of the Lexrank algorithm
* [ots](https://github.com/deepfryed/ots) - Ruby bindings to open text summarizer
* [summarize](https://github.com/ssoper/summarize) - Ruby C wrapper for Open Text Summarizer
* Tokenizers
* [Jieba](https://github.com/mimosa/jieba-jruby) - Chinese tokenizer and segmenter (jRuby)
* [MeCab](https://github.com/markburns/mecab) - Japanese morphological analyzer [[MeCab Heroku buildpack](https://github.com/diasks2/heroku-buildpack-mecab)]
* [NLP Pure](https://github.com/parhamr/nlp-pure) - natural language processing algorithms implemented in pure Ruby with minimal dependencies
* [rseg](https://github.com/yzhang/rseg) - a Chinese Word Segmentation (中文分词) routine in pure Ruby
* [thailang4r](https://github.com/veer66/thailang4r) - Thai tokenizer
* [tiny_segmenter](https://github.com/6/tiny_segmenter) - Ruby port of TinySegmenter.js for tokenizing Japanese text
* [tokenizer](https://github.com/arbox/tokenizer) - a simple multilingual tokenizer
* Word Count
* [wc](https://github.com/thesp0nge/wc) - a rubygem to count word occurrences in a given text
* [word_count](https://github.com/AtelierConvivialite/word_count) - a word counter for String and Hash in Ruby
* [Word Count Analyzer](https://github.com/diasks2/word_count_analyzer) - analyzes a string for potential areas of the text that might cause word count discrepancies depending on the tool used
* [WordsCounted](https://github.com/abitdodgy/words_counted) - a highly customisable Ruby text analyser
## Browser automation and emulation
* [selenium](https://github.com/seleniumhq/selenium) - A browser automation framework and ecosystem
* [Watir](https://github.com/watir/watir) - Watir implementation built on WebDriver's Ruby bindings
* [capybara-webkit](https://github.com/thoughtbot/capybara-webkit) - A Capybara driver for headless WebKit to test JavaScript web apps
* [poltergeist](https://github.com/teampoltergeist/poltergeist) - A PhantomJS driver for Capybara
## Multiprocessing
* [Celluloid](https://github.com/celluloid/celluloid) - Actor-based concurrent object framework for Ruby
* [Parallel](https://github.com/grosser/parallel) - Run any code in parallel Processes (> use all CPUs) or Threads (> speedup blocking operations).
* [Concurrent Ruby](https://github.com/ruby-concurrency/concurrent-ruby) - Modern concurrency tools including agents, futures, promises, thread pools, supervisors, and more.
* [childprocess](https://github.com/jarib/childprocess) - Cross-platform ruby library for managing child processes.
* [forkoff](https://github.com/ahoward/forkoff) - brain-dead simple parallel processing for ruby.
* [posix-spawn](https://github.com/rtomayko/posix-spawn) - Fast Process::spawn for Rubys >= 1.8.7 based on the posix_spawn() system interfaces.
* [thread](https://github.com/meh/ruby-thread) — extensions to the thread library (includes thread pool).
* [Sprawling](https://github.com/dreikanter/ruby-bookmarks) — spawn gem for Rails to easily fork or thread long-running code blocks.
## Asynchronous
*Libraries for asynchronous networking programming.*
* [EventMachine](https://github.com/eventmachine/eventmachine) - event-driven I/O and lightweight concurrency library
## Queue
* [Resque](https://github.com/resque/resque) A Redis-backed Ruby library for creating background jobs, placing them on multiple queues.
* [Delayed::Job](https://github.com/tobi/delayed_job) — Database backed asynchronous priority queue.
* [Qu](https://github.com/bkeepers/qu) A Ruby library for queuing and processing background jobs.
* [Sidekiq](http://sidekiq.org) - A full-featured background processing framework for Ruby. It aims to be simple to integrate with any modern Rails application and much higher performance than other existing solutions.
* [Sneakers](https://github.com/jondot/sneakers) - A fast background processing framework for Ruby and RabbitMQ
* [Backburner](https://github.com/nesquena/backburner) - Backburner is a beanstalkd-powered job queue that can handle a very high volume of jobs.
* [Delayed::Job](https://github.com/collectiveidea/delayed_job) - Database backed asynchronous priority queue.
* [Que](https://github.com/chanks/que) - A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.
* [Shoryuken](https://github.com/phstc/shoryuken) - A super efficient AWS SQS thread based message processor for Ruby.
* [Sucker Punch](https://github.com/brandonhilkert/sucker_punch) - A single process background processing library using Celluloid. Aimed to be Sidekiq's little brother.
## Email
*Libraries for parsing email.*
* [mail](https://github.com/mikel/mail) A Really Ruby Mail Library
## URL Manipulation
*Libraries for parsing URLs.*
* [addressable](https://github.com/sporkmonger/addressable) - Addressable is a replacement for the URI implementation that is part of Ruby's standard library. It more closely conforms to RFC 3986, RFC 3987, and RFC 6570 (level 4), providing support for IRIs and URI templates.
## Web Content Extracting
*Libraries for extracting web contents.*
* [Metainspector](https://github.com/jaimeiniesta/metainspector) - scrapes a given URL, and returns its title, meta description, meta keywords, an array with all the links, all the images in it, etc
* [LinkThumbnailer](https://github.com/gottfrois/link_thumbnailer) - Ruby gem that generates thumbnail images and videos from a given URL. Much like popular social website with link preview.
* [docsplit](http://documentcloud.github.io/docsplit/) - Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts
* [Ruby Readability](https://github.com/cantino/ruby-readability) - a tool for extracting the primary readable content of a webpage
## WebSocket
*Libraries for working with WebSocket.*
* [em-websocket](https://github.com/igrigorik/em-websocket) - EventMachine based WebSocket server
* [Faye](http://faye.jcoglan.com/ruby.html) - A set of tools for simple publish-subscribe messaging between web clients.
* [Firehose](https://github.com/polleverywhere/firehose) - Build realtime Ruby web applications.
* [Slanger](https://github.com/stevegraham/slanger) - Open Pusher implementation compatible with Pusher libraries.
## DNS Resolving
* [em-resolve-replace](https://github.com/mperham/em-resolv-replace) - EventMachine-aware pure Ruby DNS resolution
* [Celluloid::DNS](https://github.com/celluloid/celluloid-dns) - a high-performance DNS client resolver and server which can be easily integrated into other projects or used as a stand-alone daemon. It was forked from RubyDNS which is now implemented in terms of this library.
## Computer Vision
* [ruby-opencv](https://github.com/ruby-opencv/ruby-opencv) - An OpenCV wrapper for Ruby.
## Geolocation
* [geocoder](https://github.com/alexreisner/geocoder) - A complete geocoding solution for Ruby. With Rails it adds geocoding (by street or IP address), reverse geocoding (find street address based on given coordinates), and distance queries.
* [Geokit](https://github.com/geokit/geokit) - Geokit gem provides geocoding and distance/heading calculations.
* [geoip](https://github.com/cjheath/geoip) - Searches a GeoIP database for a given host or IP address, and returns information about the country where the IP address is allocated, and the city, ISP and other information.
## Other Ruby Lists
* [awesome-ruby](https://github.com/markets/awesome-ruby/blob/master/README.md) by markets
* [awesome-ruby](https://github.com/Sdogruyol/awesome-ruby) by Sdogruyol
* [ruby-nlp](https://github.com/diasks2/ruby-nlp) - a collection of Natural Language Processing (NLP) Ruby libraries, tools and software
gitextract_m3a45r4y/ ├── .gitignore ├── CONTRIBUTING.md ├── LICENSE ├── Makefile ├── README.md ├── cli.md ├── golang.md ├── java.md ├── javascript.md ├── manuals.md ├── perl.md ├── php.md ├── python.md └── ruby.md
Condensed preview — 14 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (113K chars).
[
{
"path": ".gitignore",
"chars": 44,
"preview": "*.swp\n*.swo\n*.orig\n.idea\n\nhtml\nPipfile.lock\n"
},
{
"path": "CONTRIBUTING.md",
"chars": 740,
"preview": "# How to Contribute\n\n## IMPORTANT. READ THIS FIRST.\n\nDO NOT ADD WEB-SERVICES. THIS LIST IS FOR STANDALONE SOFTWARE.\n\nDO "
},
{
"path": "LICENSE",
"chars": 112,
"preview": "Creative Commons Attribution 4.0 International License (CC BY 4.0)\n\nhttp://creativecommons.org/licenses/by/4.0/\n"
},
{
"path": "Makefile",
"chars": 177,
"preview": ".PHONY: html\n\nhtml:\n\tpython -m markdown README.md > html/README.html\n\tpython -m markdown python.md > html/python.html\n\tp"
},
{
"path": "README.md",
"chars": 2019,
"preview": "# Awesome Web Scraping\n\nLists of packages, services and manuals related to web scraping.\n\n## Topics\n\n* [Python](https://"
},
{
"path": "cli.md",
"chars": 740,
"preview": "# Command Line Tools\n\nThis list contains network and data processing tools with command line interface written in any pr"
},
{
"path": "golang.md",
"chars": 18356,
"preview": "# Golang Web Scraping\n\nThis list contains Golang libraries related to web scraping and data processing\n\n* [Golang Web Sc"
},
{
"path": "java.md",
"chars": 3010,
"preview": "# Java Web Scraping\n\nThis list contains Java libraries related to web scraping and data processing\n\n* [FooLanguage Web S"
},
{
"path": "javascript.md",
"chars": 15015,
"preview": "# JavaScript Web Scraping\n\nThis list contains JavaScript libraries related to web scraping and data processing. The cont"
},
{
"path": "manuals.md",
"chars": 3587,
"preview": "# Web Scraping Manuals\n\n## Table of Contents\n\n- [About the List](#about-the-list)\n - [Base Things](#base-things)\n - [I"
},
{
"path": "perl.md",
"chars": 2096,
"preview": "# Perl Web Scraping\n\nThis list contains Perl libraries related to web scraping and data processing\n\n* [Perl Web Scraping"
},
{
"path": "php.md",
"chars": 10457,
"preview": "# PHP Web Scraping\n\nThis list contains PHP libraries related to web scraping and data processing\n\n* [PHP Web Scraping](#"
},
{
"path": "python.md",
"chars": 29880,
"preview": "# Python Web Scraping\n\nThis list contains python libraries related to web scraping and data processing\n\n## Contents\n\n* ["
},
{
"path": "ruby.md",
"chars": 24357,
"preview": "# Ruby Web Scraping\n\nThis list contains ruby libraries related to web scraping and data processing\n\n* [Ruby Web Scraping"
}
]
About this extraction
This page contains the full source code of the lorien/awesome-web-scraping GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 14 files (108.0 KB), approximately 28.8k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.