Showing preview only (270K chars total). Download the full file or copy to clipboard to get everything.
Repository: gocolly/colly
Branch: master
Commit: abd17898f26e
Files: 67
Total size: 253.0 KB
Directory structure:
gitextract_san38b80/
├── .codecov.yml
├── .github/
│ ├── ISSUE_TEMPLATE/
│ │ ├── bug_report.md
│ │ ├── config.yml
│ │ └── feature_request.md
│ └── workflows/
│ └── ci.yml
├── CHANGELOG.md
├── CONTRIBUTING.md
├── LICENSE.txt
├── README.md
├── VERSION
├── _examples/
│ ├── README.md
│ ├── basic/
│ │ └── basic.go
│ ├── coursera_courses/
│ │ └── coursera_courses.go
│ ├── cryptocoinmarketcap/
│ │ └── cryptocoinmarketcap.go
│ ├── error_handling/
│ │ └── error_handling.go
│ ├── factba.se/
│ │ └── factbase.go
│ ├── google_groups/
│ │ └── google_groups.go
│ ├── hackernews_comments/
│ │ └── hackernews_comments.go
│ ├── instagram/
│ │ └── instagram.go
│ ├── local_files/
│ │ ├── html/
│ │ │ ├── child_page/
│ │ │ │ ├── one.html
│ │ │ │ ├── three.html
│ │ │ │ └── two.html
│ │ │ └── index.html
│ │ └── local_files.go
│ ├── login/
│ │ └── login.go
│ ├── max_depth/
│ │ └── max_depth.go
│ ├── multipart/
│ │ └── multipart.go
│ ├── openedx_courses/
│ │ └── openedx_courses.go
│ ├── parallel/
│ │ └── parallel.go
│ ├── proxy_switcher/
│ │ └── proxy_switcher.go
│ ├── queue/
│ │ └── queue.go
│ ├── random_delay/
│ │ └── random_delay.go
│ ├── rate_limit/
│ │ └── rate_limit.go
│ ├── reddit/
│ │ └── reddit.go
│ ├── request_context/
│ │ └── request_context.go
│ ├── scraper_server/
│ │ └── scraper_server.go
│ ├── shopify_sitemap/
│ │ └── shopify_sitemap.go
│ ├── url_filter/
│ │ └── url_filter.go
│ └── xkcd_store/
│ └── xkcd_store.go
├── cmd/
│ └── colly/
│ └── colly.go
├── colly.go
├── colly_test.go
├── context.go
├── context_test.go
├── debug/
│ ├── debug.go
│ ├── logdebugger.go
│ └── webdebugger.go
├── extensions/
│ ├── extensions.go
│ ├── random_user_agent.go
│ ├── referer.go
│ └── url_length_filter.go
├── go.mod
├── go.sum
├── htmlelement.go
├── http_backend.go
├── http_trace.go
├── http_trace_test.go
├── proxy/
│ └── proxy.go
├── queue/
│ ├── queue.go
│ └── queue_test.go
├── request.go
├── response.go
├── storage/
│ └── storage.go
├── unmarshal.go
├── unmarshal_test.go
├── xmlelement.go
└── xmlelement_test.go
================================================
FILE CONTENTS
================================================
================================================
FILE: .codecov.yml
================================================
comment: false
================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''
---
<!--
Remember to include a code sample that reproduces the bug, if possible.
Love colly? Please consider supporting our collective:
👉 https://opencollective.com/colly/donate
-->
================================================
FILE: .github/ISSUE_TEMPLATE/config.yml
================================================
blank_issues_enabled: true
contact_links:
- name: Question
url: https://stackoverflow.com/
about: Questions should go to Stack Overflow. You can use go-colly tag.
================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''
---
<!--
Love colly? Please consider supporting our collective:
👉 https://opencollective.com/colly/donate
-->
================================================
FILE: .github/workflows/ci.yml
================================================
---
name: CI
on:
push:
branches:
- '**'
pull_request:
jobs:
test:
name: Test ${{matrix.go}}
runs-on: [ubuntu-latest]
strategy:
fail-fast: false
max-parallel: 4
matrix:
go: [
"1.24",
"1.23",
"1.22",
"1.21",
]
steps:
- name: Checkout branch
uses: actions/checkout@v2
- name: Setup go
uses: actions/setup-go@v2
with:
go-version: ${{matrix.go}}
- name: Test
run: |
go install golang.org/x/lint/golint@latest
OUT="$(go get -a)"; test -z "$OUT" || (echo "$OUT" && return 1)
OUT="$(gofmt -l -d ./)"; test -z "$OUT" || (echo "$OUT" && return 1)
golint -set_exit_status
go vet -v ./...
go test -race -v -coverprofile=coverage.txt -covermode=atomic ./...
build:
name: Build ${{matrix.go}}
runs-on: [ubuntu-latest]
strategy:
fail-fast: false
max-parallel: 4
matrix:
go: [
"1.24",
"1.23",
"1.22",
"1.21",
]
steps:
- name: Checkout branch
uses: actions/checkout@v2
- name: Setup go
uses: actions/setup-go@v2
with:
go-version: ${{matrix.go}}
- name: Build
run: |
go install golang.org/x/lint/golint@latest
OUT="$(go get -a)"; test -z "$OUT" || (echo "$OUT" && return 1)
OUT="$(gofmt -l -d ./)"; test -z "$OUT" || (echo "$OUT" && return 1)
golint -set_exit_status
go build
codecov:
name: Codecov
runs-on: [ubuntu-latest]
needs:
- test
- build
steps:
- name: Run Codecov
run: bash <(curl -s https://codecov.io/bash)
================================================
FILE: CHANGELOG.md
================================================
# 2.1.0 - 2020.06.09
- HTTP tracing support
- New callback: OnResponseHeader
- Queue fixes
- New collector option: Collector.CheckHead
- Proxy fixes
- Fixed POST revisit checking
- Updated dependencies
# 2.0.0 - 2019.11.28
- Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function
- Go module support
- Collector.HasVisited method added to be able to check if an url has been visited
- Collector.SetClient method introduced
- HTMLElement.ChildTexts method added
- New user agents
- Multiple bugfixes
# 1.2.0 - 2019.02.13
- Compatibility with the latest htmlquery package
- New request shortcut for HEAD requests
- Check URL availability before visiting
- Fix proxy URL value
- Request counter fix
- Minor fixes in examples
# 1.1.0 - 2018.08.13
- Appengine integration takes context.Context instead of http.Request (API change)
- Added "Accept" http header by default to every request
- Support slices of pointers in unmarshal
- Fixed a race condition in queues
- ForEachWithBreak method added to HTMLElement
- Added a local file example
- Support gzip decompression of response bodies
- Don't share waitgroup when cloning a collector
- Fixed instagram example
# 1.0.0 - 2018.05.13
================================================
FILE: CONTRIBUTING.md
================================================
# Contribute
## Introduction
First, thank you for considering contributing to colly! It's people like you that make the open source community such a great community! 😊
We welcome any type of contribution, not only code. You can help with
- **QA**: file bug reports, the more details you can give the better (e.g. screenshots with the console open)
- **Marketing**: writing blog posts, howto's, printing stickers, ...
- **Community**: presenting the project at meetups, organizing a dedicated meetup for the local community, ...
- **Code**: take a look at the [open issues](https://github.com/gocolly/colly/issues). Even if you can't write code, commenting on them, showing that you care about a given issue matters. It helps us triage them.
- **Money**: we welcome financial contributions in full transparency on our [open collective](https://opencollective.com/colly).
## Your First Contribution
Working on your first Pull Request? You can learn how from this *free* series, [How to Contribute to an Open Source Project on GitHub](https://app.egghead.io/playlists/how-to-contribute-to-an-open-source-project-on-github).
## Submitting code
Any code change should be submitted as a pull request. The description should explain what the code does and give steps to execute it. The pull request should also contain tests.
## Code review process
The bigger the pull request, the longer it will take to review and merge. Try to break down large pull requests in smaller chunks that are easier to review and merge.
It is also always helpful to have some context for your pull request. What was the purpose? Why does it matter to you?
## Financial contributions
We also welcome financial contributions in full transparency on our [open collective](https://opencollective.com/colly).
Anyone can file an expense. If the expense makes sense for the development of the community, it will be "merged" in the ledger of our open collective by the core contributors and the person who filed the expense will be reimbursed.
## Questions
If you have any questions, create an [issue](https://github.com/gocolly/colly/issues/new) (protip: do a quick search first to see if someone else didn't ask the same question before!).
You can also reach us at hello@colly.opencollective.com.
## Credits
### Contributors
Thank you to all the people who have already contributed to colly!
<a href="graphs/contributors"><img src="https://opencollective.com/colly/contributors.svg?width=890" /></a>
### Backers
Thank you to all our backers! [[Become a backer](https://opencollective.com/colly#backer)]
<a href="https://opencollective.com/colly#backers" target="_blank"><img src="https://opencollective.com/colly/backers.svg?width=890"></a>
### Sponsors
Thank you to all our sponsors! (please ask your company to also support this open source project by [becoming a sponsor](https://opencollective.com/colly#sponsor))
<a href="https://opencollective.com/colly/sponsor/0/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/0/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/1/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/1/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/2/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/2/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/3/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/3/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/4/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/4/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/5/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/5/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/6/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/6/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/7/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/7/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/8/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/8/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/9/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/9/avatar.svg"></a>
<!-- This `CONTRIBUTING.md` is based on @nayafia's template https://github.com/nayafia/contributing-template -->
================================================
FILE: LICENSE.txt
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
# Colly
Lightning Fast and Elegant Scraping Framework for Gophers
Colly provides a clean interface to write any kind of crawler/scraper/spider.
With Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.
[](https://pkg.go.dev/github.com/gocolly/colly/v2)
[](#backers) [](#sponsors) [](https://github.com/gocolly/colly/actions/workflows/ci.yml)
[](http://goreportcard.com/report/gocolly/colly)
[](https://github.com/gocolly/colly/tree/master/_examples)
[](https://codecov.io/github/gocolly/colly?branch=master)
[](https://app.fossa.io/projects/git%2Bgithub.com%2Fgocolly%2Fcolly?ref=badge_shield)
[](https://twitter.com/gocolly)
## Features
- Clean API
- Fast (>1k request/sec on a single core)
- Manages request delays and maximum concurrency per domain
- Automatic cookie and session handling
- Sync/async/parallel scraping
- Caching
- Automatic encoding of non-unicode responses
- Robots.txt support
- Distributed scraping
- Configuration via environment variables
- Extensions
## Example
```go
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
// Find and visit all links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL)
})
c.Visit("http://go-colly.org/")
}
```
See [examples folder](https://github.com/gocolly/colly/tree/master/_examples) for more detailed examples.
## Installation
`go get github.com/gocolly/colly/v2`
## Bugs
Bugs or suggestions? Visit the [issue tracker](https://github.com/gocolly/colly/issues) or join `#colly` on freenode
## Other Projects Using Colly
Below is a list of public, open source projects that use Colly:
- [greenpeace/check-my-pages](https://github.com/greenpeace/check-my-pages) Scraping script to test the Spanish Greenpeace web archive.
- [altsab/gowap](https://github.com/altsab/gowap) Wappalyzer implementation in Go.
- [jesuiscamille/goquotes](https://github.com/jesuiscamille/goquotes) A quotes scraper, making your day a little better!
- [jivesearch/jivesearch](https://github.com/jivesearch/jivesearch) A search engine that doesn't track you.
- [Leagify/colly-draft-prospects](https://github.com/Leagify/colly-draft-prospects) A scraper for future NFL Draft prospects.
- [lucasepe/go-ps4](https://github.com/lucasepe/go-ps4) Search playstation store for your favorite PS4 games using the command line.
- [yringler/inside-chassidus-scraper](https://github.com/yringler/inside-chassidus-scraper) Scrapes Rabbi Paltiel's web site for lesson metadata.
- [gamedb/gamedb](https://github.com/gamedb/gamedb) A database of Steam games.
- [lawzava/scrape](https://github.com/lawzava/scrape) CLI for email scraping from any website.
- [eureka101v/WeiboSpiderGo](https://github.com/eureka101v/WeiboSpiderGo) A sina weibo(chinese twitter) scraper
- [Go-phie/gophie](https://github.com/Go-phie/gophie) Search, Download and Stream movies from your terminal
- [imthaghost/goclone](https://github.com/imthaghost/goclone) Clone websites to your computer within seconds.
- [superiss/spidy](https://github.com/superiss/spidy) Crawl the web and collect expired domains.
- [docker-slim/docker-slim](https://github.com/docker-slim/docker-slim) Optimize your Docker containers to make them smaller and better.
- [seversky/gachifinder](https://github.com/seversky/gachifinder) an agent for asynchronous scraping, parsing and writing to some storages(elasticsearch for now)
- [eval-exec/goodreads](https://github.com/eval-exec/goodreads) crawl all tags and all pages of quotes from goodreads.
If you are using Colly in a project please send a pull request to add it to the list.
## Contributors
This project exists thanks to all the people who contribute. [[Contribute]](CONTRIBUTING.md).
<a href="https://github.com/gocolly/colly/graphs/contributors"><img src="https://opencollective.com/colly/contributors.svg?width=890" /></a>
## Backers
Thank you to all our backers! 🙏 [[Become a backer](https://opencollective.com/colly#backer)]
<a href="https://opencollective.com/colly#backers" target="_blank"><img src="https://opencollective.com/colly/backers.svg?width=890"></a>
## Sponsors
Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [[Become a sponsor](https://opencollective.com/colly#sponsor)]
<a href="https://opencollective.com/colly/sponsor/0/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/0/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/1/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/1/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/2/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/2/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/3/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/3/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/4/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/4/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/5/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/5/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/6/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/6/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/7/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/7/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/8/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/8/avatar.svg"></a>
<a href="https://opencollective.com/colly/sponsor/9/website" target="_blank"><img src="https://opencollective.com/colly/sponsor/9/avatar.svg"></a>
## License
[](https://app.fossa.io/projects/git%2Bgithub.com%2Fgocolly%2Fcolly?ref=badge_large)
================================================
FILE: VERSION
================================================
2.1.0
================================================
FILE: _examples/README.md
================================================
# Colly examples
This folder provides easy to understand code snippets on how to get started with colly.
To execute an example run `go run [example/example.go]`
## Demo
```
$ go run rate_limit/rate_limit.go
[000001] 1 [ 1 - request] map["url":"https://httpbin.org/delay/2?n=4"] (60.872µs)
[000002] 1 [ 2 - request] map["url":"https://httpbin.org/delay/2?n=2"] (154.425µs)
[000003] 1 [ 3 - request] map["url":"https://httpbin.org/delay/2?n=0"] (158.374µs)
[000004] 1 [ 5 - request] map["url":"https://httpbin.org/delay/2?n=3"] (426.999µs)
[000005] 1 [ 4 - request] map["url":"https://httpbin.org/delay/2?n=1"] (448.75µs)
[000007] 1 [ 2 - response] map["url":"https://httpbin.org/delay/2?n=2" "status":"OK"] (2.855764394s)
[000008] 1 [ 2 - scraped] map["url":"https://httpbin.org/delay/2?n=2"] (2.855797868s)
[000006] 1 [ 1 - response] map["url":"https://httpbin.org/delay/2?n=4" "status":"OK"] (2.855756753s)
[000009] 1 [ 1 - scraped] map["url":"https://httpbin.org/delay/2?n=4"] (2.855819581s)
[000010] 1 [ 3 - response] map["status":"OK" "url":"https://httpbin.org/delay/2?n=0"] (5.002065299s)
[000011] 1 [ 3 - scraped] map["url":"https://httpbin.org/delay/2?n=0"] (5.002103755s)
[000012] 1 [ 5 - response] map["status":"OK" "url":"https://httpbin.org/delay/2?n=3"] (5.012080614s)
[000013] 1 [ 5 - scraped] map["url":"https://httpbin.org/delay/2?n=3"] (5.012101056s)
[000014] 1 [ 4 - response] map["url":"https://httpbin.org/delay/2?n=1" "status":"OK"] (7.155725591s)
[000015] 1 [ 4 - scraped] map["url":"https://httpbin.org/delay/2?n=1"] (7.155759136s)
```
================================================
FILE: _examples/basic/basic.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: hackerspaces.org, wiki.hackerspaces.org
colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are in AllowedDomains
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on https://hackerspaces.org
c.Visit("https://hackerspaces.org/")
}
================================================
FILE: _examples/coursera_courses/coursera_courses.go
================================================
package main
import (
"encoding/json"
"log"
"os"
"strings"
"time"
"github.com/gocolly/colly/v2"
)
// Course stores information about a coursera course
type Course struct {
Title string
Description string
Creator string
Level string
URL string
Language string
Commitment string
Rating string
}
func main() {
fName := "courses.json"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("Cannot create file %q: %s\n", fName, err)
return
}
defer file.Close()
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: coursera.org, www.coursera.org
colly.AllowedDomains("coursera.org", "www.coursera.org"),
// Cache responses to prevent multiple download of pages
// even if the collector is restarted
colly.CacheDir("./coursera_cache"),
// Cached responses older than the specified duration will be refreshed
colly.CacheExpiration(24*time.Hour),
)
// Create another collector to scrape course details
detailCollector := c.Clone()
courses := make([]Course, 0, 200)
// On every <a> element which has "href" attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
// If attribute class is this long string return from callback
// As this a is irrelevant
if e.Attr("class") == "Button_1qxkboh-o_O-primary_cv02ee-o_O-md_28awn8-o_O-primaryLink_109aggg" {
return
}
link := e.Attr("href")
// If link start with browse or includes either signup or login return from callback
if !strings.HasPrefix(link, "/browse") || strings.Index(link, "=signup") > -1 || strings.Index(link, "=login") > -1 {
return
}
// start scaping the page under the link found
e.Request.Visit(link)
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
log.Println("visiting", r.URL.String())
})
// On every <a> element with collection-product-card class call callback
c.OnHTML(`a.collection-product-card`, func(e *colly.HTMLElement) {
// Activate detailCollector if the link contains "coursera.org/learn"
courseURL := e.Request.AbsoluteURL(e.Attr("href"))
if strings.Index(courseURL, "coursera.org/learn") != -1 {
detailCollector.Visit(courseURL)
}
})
// Extract details of the course
detailCollector.OnHTML(`div[id=rendered-content]`, func(e *colly.HTMLElement) {
log.Println("Course found", e.Request.URL)
title := e.ChildText(".banner-title")
if title == "" {
log.Println("No title found", e.Request.URL)
}
course := Course{
Title: title,
URL: e.Request.URL.String(),
Description: e.ChildText("div.content"),
Creator: e.ChildText("li.banner-instructor-info > a > div > div > span"),
Rating: e.ChildText("span.number-rating"),
}
// Iterate over div components and add details to course
e.ForEach(".AboutCourse .ProductGlance > div", func(_ int, el *colly.HTMLElement) {
svgTitle := strings.Split(el.ChildText("div:nth-child(1) svg title"), " ")
lastWord := svgTitle[len(svgTitle)-1]
switch lastWord {
// svg Title: Available Languages
case "languages":
course.Language = el.ChildText("div:nth-child(2) > div:nth-child(1)")
// svg Title: Mixed/Beginner/Intermediate/Advanced Level
case "Level":
course.Level = el.ChildText("div:nth-child(2) > div:nth-child(1)")
// svg Title: Hours to complete
case "complete":
course.Commitment = el.ChildText("div:nth-child(2) > div:nth-child(1)")
}
})
courses = append(courses, course)
})
// Start scraping on http://coursera.com/browse
c.Visit("https://coursera.org/browse")
enc := json.NewEncoder(file)
enc.SetIndent("", " ")
// Dump json to the standard output
enc.Encode(courses)
}
================================================
FILE: _examples/cryptocoinmarketcap/cryptocoinmarketcap.go
================================================
package main
import (
"encoding/csv"
"log"
"os"
"github.com/gocolly/colly/v2"
)
func main() {
fName := "cryptocoinmarketcap.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("Cannot create file %q: %s\n", fName, err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Write CSV header
writer.Write([]string{"Name", "Symbol", "Market Cap (USD)", "Price (USD)", "Circulating Supply (USD)", "Volume (24h)", "Change (1h)", "Change (24h)", "Change (7d)"})
// Instantiate default collector
c := colly.NewCollector()
c.OnHTML("tbody tr", func(e *colly.HTMLElement) {
writer.Write([]string{
e.ChildText(".cmc-table__column-name"),
e.ChildText(".cmc-table__cell--sort-by__symbol"),
e.ChildText(".cmc-table__cell--sort-by__market-cap"),
e.ChildText(".cmc-table__cell--sort-by__price"),
e.ChildText(".cmc-table__cell--sort-by__circulating-supply"),
e.ChildText(".cmc-table__cell--sort-by__volume-24-h"),
e.ChildText(".cmc-table__cell--sort-by__percent-change-1-h"),
e.ChildText(".cmc-table__cell--sort-by__percent-change-24-h"),
e.ChildText(".cmc-table__cell--sort-by__percent-change-7-d"),
})
})
c.Visit("https://coinmarketcap.com/all/views/all/")
log.Printf("Scraping finished, check file %q for results\n", fName)
}
================================================
FILE: _examples/error_handling/error_handling.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Create a collector
c := colly.NewCollector()
// Set HTML callback
// Won't be called if error occurs
c.OnHTML("*", func(e *colly.HTMLElement) {
fmt.Println(e)
})
// Set error handler
c.OnError(func(r *colly.Response, err error) {
fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
})
// Start scraping
c.Visit("https://definitely-not-a.website/")
}
================================================
FILE: _examples/factba.se/factbase.go
================================================
package main
import (
"encoding/json"
"fmt"
"os"
"strconv"
"github.com/gocolly/colly/v2"
)
var baseSearchURL = "https://factba.se/json/json-transcript.php?q=&f=&dt=&p="
var baseTranscriptURL = "https://factba.se/transcript/"
type result struct {
Slug string `json:"slug"`
Date string `json:"date"`
}
type results struct {
Data []*result `json:"data"`
}
type transcript struct {
Speaker string
Text string
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("factba.se"),
)
d := c.Clone()
d.OnHTML("body", func(e *colly.HTMLElement) {
t := make([]transcript, 0)
e.ForEach(".topic-media-row", func(_ int, el *colly.HTMLElement) {
t = append(t, transcript{
Speaker: el.ChildText(".speaker-label"),
Text: el.ChildText(".transcript-text-block"),
})
})
jsonData, err := json.MarshalIndent(t, "", " ")
if err != nil {
return
}
os.WriteFile(colly.SanitizeFileName(e.Request.Ctx.Get("date")+"_"+e.Request.Ctx.Get("slug"))+".json", jsonData, 0644)
})
stop := false
c.OnResponse(func(r *colly.Response) {
rs := &results{}
err := json.Unmarshal(r.Body, rs)
if err != nil || len(rs.Data) == 0 {
stop = true
return
}
for _, res := range rs.Data {
u := baseTranscriptURL + res.Slug
ctx := colly.NewContext()
ctx.Put("date", res.Date)
ctx.Put("slug", res.Slug)
d.Request("GET", u, nil, ctx, nil)
}
})
for i := 1; i < 1000; i++ {
if stop {
break
}
if err := c.Visit(baseSearchURL + strconv.Itoa(i)); err != nil {
fmt.Println("Error:", err)
break
}
}
}
================================================
FILE: _examples/google_groups/google_groups.go
================================================
package main
import (
"encoding/json"
"flag"
"log"
"os"
"strings"
"github.com/gocolly/colly/v2"
)
// Mail is the container of a single e-mail
type Mail struct {
Title string
Link string
Author string
Date string
Message string
}
func main() {
var groupName string
flag.StringVar(&groupName, "group", "hspbp", "Google Groups group name")
flag.Parse()
threads := make(map[string][]Mail)
threadCollector := colly.NewCollector()
mailCollector := colly.NewCollector()
// Collect threads
threadCollector.OnHTML("tr", func(e *colly.HTMLElement) {
ch := e.DOM.Children()
author := ch.Eq(1).Text()
// deleted topic
if author == "" {
return
}
title := ch.Eq(0).Text()
link, _ := ch.Eq(0).Children().Eq(0).Attr("href")
// fix link to point to the pure HTML version of the thread
link = strings.Replace(link, ".com/d/topic", ".com/forum/?_escaped_fragment_=topic", 1)
date := ch.Eq(2).Text()
log.Printf("Thread found: %s %q %s %s\n", link, title, author, date)
mailCollector.Visit(link)
})
// Visit next page
threadCollector.OnHTML("body > a[href]", func(e *colly.HTMLElement) {
log.Println("Next page link found:", e.Attr("href"))
e.Request.Visit(e.Attr("href"))
})
// Extract mails
mailCollector.OnHTML("body", func(e *colly.HTMLElement) {
// Find subject
threadSubject := e.ChildText("h2")
if _, ok := threads[threadSubject]; !ok {
threads[threadSubject] = make([]Mail, 0, 8)
}
// Extract mails
e.ForEach("table tr", func(_ int, el *colly.HTMLElement) {
mail := Mail{
Title: el.ChildText("td:nth-of-type(1)"),
Link: el.ChildAttr("td:nth-of-type(1)", "href"),
Author: el.ChildText("td:nth-of-type(2)"),
Date: el.ChildText("td:nth-of-type(3)"),
Message: el.ChildText("td:nth-of-type(4)"),
}
threads[threadSubject] = append(threads[threadSubject], mail)
})
// Follow next page link
if link, found := e.DOM.Find("> a[href]").Attr("href"); found {
e.Request.Visit(link)
} else {
log.Printf("Thread %q done\n", threadSubject)
}
})
threadCollector.Visit("https://groups.google.com/forum/?_escaped_fragment_=forum/" + groupName)
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
// Dump json to the standard output
enc.Encode(threads)
}
================================================
FILE: _examples/hackernews_comments/hackernews_comments.go
================================================
package main
import (
"encoding/json"
"flag"
"log"
"os"
"strconv"
"strings"
"github.com/gocolly/colly/v2"
)
type comment struct {
Author string `selector:"a.hnuser"`
URL string `selector:".age a[href]" attr:"href"`
Comment string `selector:".comment"`
Replies []*comment
depth int
}
func main() {
var itemID string
flag.StringVar(&itemID, "id", "", "hackernews post id")
flag.Parse()
if itemID == "" {
log.Println("Hackernews post id required")
os.Exit(1)
}
comments := make([]*comment, 0)
// Instantiate default collector
c := colly.NewCollector()
// Extract comment
c.OnHTML(".comment-tree tr.athing", func(e *colly.HTMLElement) {
width, err := strconv.Atoi(e.ChildAttr("td.ind img", "width"))
if err != nil {
return
}
// hackernews uses 40px spacers to indent comment replies,
// so we have to divide the width with it to get the depth
// of the comment
depth := width / 40
c := &comment{
Replies: make([]*comment, 0),
depth: depth,
}
e.Unmarshal(c)
c.Comment = strings.TrimSpace(c.Comment[:len(c.Comment)-5])
if depth == 0 {
comments = append(comments, c)
return
}
parent := comments[len(comments)-1]
// append comment to its parent
for i := 0; i < depth-1; i++ {
parent = parent.Replies[len(parent.Replies)-1]
}
parent.Replies = append(parent.Replies, c)
})
c.Visit("https://news.ycombinator.com/item?id=" + itemID)
enc := json.NewEncoder(os.Stdout)
enc.SetIndent("", " ")
// Dump json to the standard output
enc.Encode(comments)
}
================================================
FILE: _examples/instagram/instagram.go
================================================
package main
import (
"crypto/md5"
"encoding/json"
"fmt"
"log"
"net/url"
"os"
"regexp"
"strings"
"github.com/gocolly/colly/v2"
)
// "id": user id, "after": end cursor
const nextPageURL string = `https://www.instagram.com/graphql/query/?query_hash=%s&variables=%s`
const nextPagePayload string = `{"id":"%s","first":50,"after":"%s"}`
var requestID string
var requestIds [][]byte
var queryIdPattern = regexp.MustCompile(`queryId:".{32}"`)
type pageInfo struct {
EndCursor string `json:"end_cursor"`
NextPage bool `json:"has_next_page"`
}
type mainPageData struct {
Rhxgis string `json:"rhx_gis"`
EntryData struct {
ProfilePage []struct {
Graphql struct {
User struct {
Id string `json:"id"`
Media struct {
Edges []struct {
Node struct {
ImageURL string `json:"display_url"`
ThumbnailURL string `json:"thumbnail_src"`
IsVideo bool `json:"is_video"`
Date int `json:"date"`
Dimensions struct {
Width int `json:"width"`
Height int `json:"height"`
} `json:"dimensions"`
} `json:node"`
} `json:"edges"`
PageInfo pageInfo `json:"page_info"`
} `json:"edge_owner_to_timeline_media"`
} `json:"user"`
} `json:"graphql"`
} `json:"ProfilePage"`
} `json:"entry_data"`
}
type nextPageData struct {
Data struct {
User struct {
Container struct {
PageInfo pageInfo `json:"page_info"`
Edges []struct {
Node struct {
ImageURL string `json:"display_url"`
ThumbnailURL string `json:"thumbnail_src"`
IsVideo bool `json:"is_video"`
Date int `json:"taken_at_timestamp"`
Dimensions struct {
Width int `json:"width"`
Height int `json:"height"`
}
}
} `json:"edges"`
} `json:"edge_owner_to_timeline_media"`
}
} `json:"data"`
}
func main() {
if len(os.Args) != 2 {
log.Println("Missing account name argument")
os.Exit(1)
}
var actualUserId string
instagramAccount := os.Args[1]
outputDir := fmt.Sprintf("./instagram_%s/", instagramAccount)
c := colly.NewCollector(
//colly.CacheDir("./_instagram_cache/"),
colly.UserAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"),
)
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("X-Requested-With", "XMLHttpRequest")
r.Headers.Set("Referer", "https://www.instagram.com/"+instagramAccount)
if r.Ctx.Get("gis") != "" {
gis := fmt.Sprintf("%s:%s", r.Ctx.Get("gis"), r.Ctx.Get("variables"))
h := md5.New()
h.Write([]byte(gis))
gisHash := fmt.Sprintf("%x", h.Sum(nil))
r.Headers.Set("X-Instagram-GIS", gisHash)
}
})
c.OnHTML("html", func(e *colly.HTMLElement) {
d := c.Clone()
d.OnResponse(func(r *colly.Response) {
requestIds = queryIdPattern.FindAll(r.Body, -1)
requestID = string(requestIds[1][9:41])
})
requestIDURL := e.Request.AbsoluteURL(e.ChildAttr(`link[as="script"]`, "href"))
d.Visit(requestIDURL)
dat := e.ChildText("body > script:first-of-type")
jsonData := dat[strings.Index(dat, "{") : len(dat)-1]
data := &mainPageData{}
err := json.Unmarshal([]byte(jsonData), data)
if err != nil {
log.Fatal(err)
}
log.Println("saving output to ", outputDir)
os.MkdirAll(outputDir, os.ModePerm)
page := data.EntryData.ProfilePage[0]
actualUserId = page.Graphql.User.Id
for _, obj := range page.Graphql.User.Media.Edges {
// skip videos
if obj.Node.IsVideo {
continue
}
c.Visit(obj.Node.ImageURL)
}
nextPageVars := fmt.Sprintf(nextPagePayload, actualUserId, page.Graphql.User.Media.PageInfo.EndCursor)
e.Request.Ctx.Put("variables", nextPageVars)
if page.Graphql.User.Media.PageInfo.NextPage {
u := fmt.Sprintf(
nextPageURL,
requestID,
url.QueryEscape(nextPageVars),
)
log.Println("Next page found", u)
e.Request.Ctx.Put("gis", data.Rhxgis)
e.Request.Visit(u)
}
})
c.OnError(func(r *colly.Response, e error) {
log.Println("error:", e, r.Request.URL, string(r.Body))
})
c.OnResponse(func(r *colly.Response) {
if strings.Index(r.Headers.Get("Content-Type"), "image") > -1 {
r.Save(outputDir + r.FileName())
return
}
if strings.Index(r.Headers.Get("Content-Type"), "json") == -1 {
return
}
data := &nextPageData{}
err := json.Unmarshal(r.Body, data)
if err != nil {
log.Fatal(err)
}
for _, obj := range data.Data.User.Container.Edges {
// skip videos
if obj.Node.IsVideo {
continue
}
c.Visit(obj.Node.ImageURL)
}
if data.Data.User.Container.PageInfo.NextPage {
nextPageVars := fmt.Sprintf(nextPagePayload, actualUserId, data.Data.User.Container.PageInfo.EndCursor)
r.Request.Ctx.Put("variables", nextPageVars)
u := fmt.Sprintf(
nextPageURL,
requestID,
url.QueryEscape(nextPageVars),
)
log.Println("Next page found", u)
r.Request.Visit(u)
}
})
c.Visit("https://instagram.com/" + instagramAccount)
}
================================================
FILE: _examples/local_files/html/child_page/one.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page One</h1>
</body>
</html>
================================================
FILE: _examples/local_files/html/child_page/three.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page Three</h1>
</body>
</html>
================================================
FILE: _examples/local_files/html/child_page/two.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Child Page Two</h1>
</body>
</html>
================================================
FILE: _examples/local_files/html/index.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta http-equiv="X-UA-Compatible" content="ie=edge">
<title>Document</title>
</head>
<body>
<h1>Index.html</h1>
<ul>
<li><a href="/child_page/one.html"></a></li>
<li><a href="/child_page/two.html"></a></li>
<li><a href="/child_page/three.html"></a></li>
</ul>
</body>
</html>
================================================
FILE: _examples/local_files/local_files.go
================================================
package main
import (
"fmt"
"net/http"
"os"
"path/filepath"
"github.com/gocolly/colly/v2"
)
func main() {
dir, err := filepath.Abs(filepath.Dir(os.Args[0]))
if err != nil {
panic(err)
}
t := &http.Transport{}
t.RegisterProtocol("file", http.NewFileTransport(http.Dir("/")))
c := colly.NewCollector()
c.WithTransport(t)
pages := []string{}
c.OnHTML("h1", func(e *colly.HTMLElement) {
pages = append(pages, e.Text)
})
c.OnHTML("a", func(e *colly.HTMLElement) {
c.Visit("file://" + dir + "/html" + e.Attr("href"))
})
fmt.Println("file://" + dir + "/html/index.html")
c.Visit("file://" + dir + "/html/index.html")
c.Wait()
for i, p := range pages {
fmt.Printf("%d : %s\n", i, p)
}
}
================================================
FILE: _examples/login/login.go
================================================
package main
import (
"log"
"github.com/gocolly/colly/v2"
)
func main() {
// create a new collector
c := colly.NewCollector()
// authenticate
err := c.Post("http://example.com/login", map[string]string{"username": "admin", "password": "admin"})
if err != nil {
log.Fatal(err)
}
// attach callbacks after login
c.OnResponse(func(r *colly.Response) {
log.Println("response received", r.StatusCode)
})
// start scraping
c.Visit("https://example.com/")
}
================================================
FILE: _examples/max_depth/max_depth.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// MaxDepth is 1, so only the links on the scraped page
// is visited, and no further links are followed
colly.MaxDepth(1),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Println(link)
// Visit link found on page
e.Request.Visit(link)
})
// Start scraping on https://en.wikipedia.org
c.Visit("https://en.wikipedia.org/")
}
================================================
FILE: _examples/multipart/multipart.go
================================================
package main
import (
"fmt"
"io"
"net/http"
"os"
"time"
"github.com/gocolly/colly/v2"
)
func generateFormData() map[string][]byte {
f, _ := os.Open("gocolly.jpg")
defer f.Close()
imgData, _ := io.ReadAll(f)
return map[string][]byte{
"firstname": []byte("one"),
"lastname": []byte("two"),
"email": []byte("onetwo@example.com"),
"file": imgData,
}
}
func setupServer() {
var handler http.HandlerFunc = func(w http.ResponseWriter, r *http.Request) {
fmt.Println("received request")
err := r.ParseMultipartForm(10000000)
if err != nil {
fmt.Println("server: Error")
w.WriteHeader(500)
w.Write([]byte("<html><body>Internal Server Error</body></html>"))
return
}
w.WriteHeader(200)
fmt.Println("server: OK")
w.Write([]byte("<html><body>Success</body></html>"))
}
go http.ListenAndServe(":8080", handler)
}
func main() {
// Start a single route http server to post an image to.
setupServer()
c := colly.NewCollector(colly.AllowURLRevisit(), colly.MaxDepth(5))
// On every a element which has href attribute call callback
c.OnHTML("html", func(e *colly.HTMLElement) {
fmt.Println(e.Text)
time.Sleep(1 * time.Second)
e.Request.PostMultipart("http://localhost:8080/", generateFormData())
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Posting gocolly.jpg to", r.URL.String())
})
// Start scraping
c.PostMultipart("http://localhost:8080/", generateFormData())
c.Wait()
}
================================================
FILE: _examples/openedx_courses/openedx_courses.go
================================================
package main
import (
"encoding/json"
"fmt"
"strings"
"time"
"github.com/gocolly/colly/v2"
)
// DATE_FORMAT default format date used in openedx
const DATE_FORMAT = "02 Jan, 2006"
// Course store openedx course data
type Course struct {
CourseID string
Run string
Name string
Number string
StartDate *time.Time
EndDate *time.Time
URL string
}
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Using IndonesiaX as sample
colly.AllowedDomains("indonesiax.co.id", "www.indonesiax.co.id"),
// Cache responses to prevent multiple download of pages
// even if the collector is restarted
colly.CacheDir("./cache"),
)
courses := make([]Course, 0, 200)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
if !strings.HasPrefix(link, "/courses/") {
return
}
// start scraping the page under the link found
e.Request.Visit(link)
})
c.OnHTML("div[class=main-container]", func(e *colly.HTMLElement) {
if e.DOM.Find("section#course-info").Length() == 0 {
return
}
title := strings.Split(e.ChildText(".course-info__title"), "\n")[0]
course_id := e.ChildAttr("input[name=course_id]", "value")
texts := e.ChildTexts("span[data-datetime]")
start_date, _ := time.Parse(DATE_FORMAT, texts[0])
end_date, _ := time.Parse(DATE_FORMAT, texts[1])
var run string
if len(strings.Split(course_id, "_")) > 1 {
run = strings.Split(course_id, "_")[1]
}
course := Course{
CourseID: course_id,
Run: run,
Name: title,
Number: e.ChildText("span.course-number"),
StartDate: &start_date,
EndDate: &end_date,
URL: fmt.Sprintf("/courses/%s/about", course_id),
}
courses = append(courses, course)
})
// Start scraping on https://openedxdomain/courses
c.Visit("https://www.indonesiax.co.id/courses")
// Convert results to JSON data if the scraping job has finished
jsonData, err := json.MarshalIndent(courses, "", " ")
if err != nil {
panic(err)
}
// Dump json to the standard output (can be redirected to a file)
fmt.Println(string(jsonData))
}
================================================
FILE: _examples/parallel/parallel.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// MaxDepth is 2, so only the links on the scraped page
// and links on those pages are visited
colly.MaxDepth(2),
colly.Async(),
)
// Limit the maximum parallelism to 2
// This is necessary if the goroutines are dynamically
// created to control the limit of simultaneous requests.
//
// Parallelism can be controlled also by spawning fixed
// number of go routines.
c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 2})
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Println(link)
// Visit link found on page on a new thread
e.Request.Visit(link)
})
// Start scraping on https://en.wikipedia.org
c.Visit("https://en.wikipedia.org/")
// Wait until threads are finished
c.Wait()
}
================================================
FILE: _examples/proxy_switcher/proxy_switcher.go
================================================
package main
import (
"bytes"
"log"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/proxy"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// Rotate two socks5 proxies
rp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338")
if err != nil {
log.Fatal(err)
}
c.SetProxyFunc(rp)
// Print the response
c.OnResponse(func(r *colly.Response) {
log.Printf("Proxy Address: %s\n", r.Request.ProxyURL)
log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
})
// Fetch httpbin.org/ip five times
for i := 0; i < 5; i++ {
c.Visit("https://httpbin.org/ip")
}
}
================================================
FILE: _examples/queue/queue.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/queue"
)
func main() {
url := "https://httpbin.org/delay/1"
// Instantiate default collector
c := colly.NewCollector(colly.AllowURLRevisit())
// create a request queue with 2 consumer threads
q, _ := queue.New(
2, // Number of consumer threads
&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
)
c.OnRequest(func(r *colly.Request) {
fmt.Println("visiting", r.URL)
if r.ID < 15 {
r2, err := r.New("GET", fmt.Sprintf("%s?x=%v", url, r.ID), nil)
if err == nil {
q.AddRequest(r2)
}
}
})
for i := 0; i < 5; i++ {
// Add URLs to the queue
q.AddURL(fmt.Sprintf("%s?n=%d", url, i))
}
// Consume URLs
q.Run(c)
}
================================================
FILE: _examples/random_delay/random_delay.go
================================================
package main
import (
"fmt"
"time"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
url := "https://httpbin.org/delay/2"
// Instantiate default collector
c := colly.NewCollector(
// Attach a debugger to the collector
colly.Debugger(&debug.LogDebugger{}),
colly.Async(),
)
// Limit the number of threads started by colly to two
// when visiting links which domains' matches "*httpbin.*" glob
c.Limit(&colly.LimitRule{
DomainGlob: "*httpbin.*",
Parallelism: 2,
RandomDelay: 5 * time.Second,
})
// Start scraping in four threads on https://httpbin.org/delay/2
for i := 0; i < 4; i++ {
c.Visit(fmt.Sprintf("%s?n=%d", url, i))
}
// Start scraping on https://httpbin.org/delay/2
c.Visit(url)
// Wait until threads are finished
c.Wait()
}
================================================
FILE: _examples/rate_limit/rate_limit.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
"github.com/gocolly/colly/v2/debug"
)
func main() {
url := "https://httpbin.org/delay/2"
// Instantiate default collector
c := colly.NewCollector(
// Turn on asynchronous requests
colly.Async(),
// Attach a debugger to the collector
colly.Debugger(&debug.LogDebugger{}),
)
// Limit the number of threads started by colly to two
// when visiting links which domains' matches "*httpbin.*" glob
c.Limit(&colly.LimitRule{
DomainGlob: "*httpbin.*",
Parallelism: 2,
//Delay: 5 * time.Second,
})
// Start scraping in five threads on https://httpbin.org/delay/2
for i := 0; i < 5; i++ {
c.Visit(fmt.Sprintf("%s?n=%d", url, i))
}
// Wait until threads are finished
c.Wait()
}
================================================
FILE: _examples/reddit/reddit.go
================================================
package main
import (
"fmt"
"os"
"time"
"github.com/gocolly/colly/v2"
)
type item struct {
StoryURL string
Source string
comments string
CrawledAt time.Time
Comments string
Title string
}
func main() {
stories := []item{}
// Instantiate default collector
c := colly.NewCollector(
// Visit only domains: old.reddit.com
colly.AllowedDomains("old.reddit.com"),
// Parallelism
colly.Async(true),
)
// On every a element which has .top-matter attribute call callback
// This class is unique to the div that holds all information about a story
c.OnHTML(".top-matter", func(e *colly.HTMLElement) {
temp := item{}
temp.StoryURL = e.ChildAttr("a[data-event-action=title]", "href")
temp.Source = "https://old.reddit.com/r/programming/"
temp.Title = e.ChildText("a[data-event-action=title]")
temp.Comments = e.ChildAttr("a[data-event-action=comments]", "href")
temp.CrawledAt = time.Now()
stories = append(stories, temp)
})
// On every span tag with the class next-button
c.OnHTML("span.next-button", func(h *colly.HTMLElement) {
t := h.ChildAttr("a", "href")
c.Visit(t)
})
// Set max Parallelism and introduce a Random Delay
c.Limit(&colly.LimitRule{
Parallelism: 2,
RandomDelay: 5 * time.Second,
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Crawl all reddits the user passes in
reddits := os.Args[1:]
for _, reddit := range reddits {
c.Visit(reddit)
}
c.Wait()
fmt.Println(stories)
}
================================================
FILE: _examples/request_context/request_context.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector()
// Before making a request put the URL with
// the key of "url" into the context of the request
c.OnRequest(func(r *colly.Request) {
r.Ctx.Put("url", r.URL.String())
})
// After making a request get "url" from
// the context of the request
c.OnResponse(func(r *colly.Response) {
fmt.Println(r.Ctx.Get("url"))
})
// Start scraping on https://en.wikipedia.org
c.Visit("https://en.wikipedia.org/")
}
================================================
FILE: _examples/scraper_server/scraper_server.go
================================================
package main
import (
"encoding/json"
"log"
"net/http"
"github.com/gocolly/colly/v2"
)
type pageInfo struct {
StatusCode int
Links map[string]int
}
func handler(w http.ResponseWriter, r *http.Request) {
URL := r.URL.Query().Get("url")
if URL == "" {
log.Println("missing URL argument")
return
}
log.Println("visiting", URL)
c := colly.NewCollector()
p := &pageInfo{Links: make(map[string]int)}
// count links
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Request.AbsoluteURL(e.Attr("href"))
if link != "" {
p.Links[link]++
}
})
// extract status code
c.OnResponse(func(r *colly.Response) {
log.Println("response received", r.StatusCode)
p.StatusCode = r.StatusCode
})
c.OnError(func(r *colly.Response, err error) {
log.Println("error:", r.StatusCode, err)
p.StatusCode = r.StatusCode
})
c.Visit(URL)
// dump results
b, err := json.Marshal(p)
if err != nil {
log.Println("failed to serialize response:", err)
return
}
w.Header().Add("Content-Type", "application/json")
w.Write(b)
}
func main() {
// example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/'
addr := ":7171"
http.HandleFunc("/", handler)
log.Println("listening on", addr)
log.Fatal(http.ListenAndServe(addr, nil))
}
================================================
FILE: _examples/shopify_sitemap/shopify_sitemap.go
================================================
package main
import (
"fmt"
"github.com/gocolly/colly/v2"
)
func main() {
// Array containing all the known URLs in a sitemap
knownUrls := []string{}
// Create a Collector specifically for Shopify
c := colly.NewCollector(colly.AllowedDomains("www.shopify.com"))
// Create a callback on the XPath query searching for the URLs
c.OnXML("//urlset/url/loc", func(e *colly.XMLElement) {
knownUrls = append(knownUrls, e.Text)
})
// Start the collector
c.Visit("https://www.shopify.com/sitemap.xml")
fmt.Println("All known URLs:")
for _, url := range knownUrls {
fmt.Println("\t", url)
}
fmt.Println("Collected", len(knownUrls), "URLs")
}
================================================
FILE: _examples/url_filter/url_filter.go
================================================
package main
import (
"fmt"
"regexp"
"github.com/gocolly/colly/v2"
)
func main() {
// Instantiate default collector
c := colly.NewCollector(
// Visit only root url and urls which start with "e" or "h" on httpbin.org
colly.URLFilters(
regexp.MustCompile("http://httpbin\\.org/(|e.+)$"),
regexp.MustCompile("http://httpbin\\.org/h.+"),
),
)
// On every a element which has href attribute call callback
c.OnHTML("a[href]", func(e *colly.HTMLElement) {
link := e.Attr("href")
// Print link
fmt.Printf("Link found: %q -> %s\n", e.Text, link)
// Visit link found on page
// Only those links are visited which are matched by any of the URLFilter regexps
c.Visit(e.Request.AbsoluteURL(link))
})
// Before making a request print "Visiting ..."
c.OnRequest(func(r *colly.Request) {
fmt.Println("Visiting", r.URL.String())
})
// Start scraping on http://httpbin.org
c.Visit("http://httpbin.org/")
}
================================================
FILE: _examples/xkcd_store/xkcd_store.go
================================================
package main
import (
"encoding/csv"
"log"
"os"
"github.com/gocolly/colly/v2"
)
func main() {
fName := "xkcd_store_items.csv"
file, err := os.Create(fName)
if err != nil {
log.Fatalf("Cannot create file %q: %s\n", fName, err)
return
}
defer file.Close()
writer := csv.NewWriter(file)
defer writer.Flush()
// Write CSV header
writer.Write([]string{"Name", "Price", "URL", "Image URL"})
// Instantiate default collector
c := colly.NewCollector(
// Allow requests only to store.xkcd.com
colly.AllowedDomains("store.xkcd.com"),
)
// Extract product details
c.OnHTML(".product-grid-item", func(e *colly.HTMLElement) {
writer.Write([]string{
e.ChildAttr("a", "title"),
e.ChildText("span"),
e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
"https:" + e.ChildAttr("img", "src"),
})
})
// Find and visit next page links
c.OnHTML(`.next a[href]`, func(e *colly.HTMLElement) {
e.Request.Visit(e.Attr("href"))
})
c.Visit("https://store.xkcd.com/collections/everything")
log.Printf("Scraping finished, check file %q for results\n", fName)
// Display collector's statistics
log.Println(c)
}
================================================
FILE: cmd/colly/colly.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package main
import (
"bytes"
"fmt"
"log"
"os"
"strings"
"github.com/jawher/mow.cli"
)
var scraperHeadTemplate = `package main
import (
"log"
"github.com/gocolly/colly/v2"
)
func main() {
c := colly.NewCollector()
`
var scraperEndTemplate = `
c.Visit("https://yourdomain.com/")
}
`
var htmlCallbackTemplate = `
c.OnHTML("element-selector", func(e *colly.HTMLElement) {
log.Println(e.Text)
})
`
var requestCallbackTemplate = `
c.OnRequest(func(r *colly.Request) {
log.Println("Visiting", r.URL)
})
`
var responseCallbackTemplate = `
c.OnResponse(func(r *colly.Response) {
log.Println("Visited", r.Request.URL, r.StatusCode)
})
`
var errorCallbackTemplate = `
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error on %s: %s", r.Request.URL, err)
})
`
func main() {
app := cli.App("colly", "Scraping Framework for Gophers")
app.Command("new", "Create new scraper", func(cmd *cli.Cmd) {
var (
callbacks = cmd.StringOpt("callbacks", "", "Add callbacks to the template. (E.g. '--callbacks=html,response,error')")
hosts = cmd.StringOpt("hosts", "", "Specify scraper's allowed hosts. (e.g. '--hosts=xy.com,abcd.com')")
path = cmd.StringArg("PATH", "", "Path of the new scraper")
)
cmd.Spec = "[--callbacks] [--hosts] [PATH]"
cmd.Action = func() {
scraper := bytes.NewBufferString(scraperHeadTemplate)
outfile := os.Stdout
if *path != "" {
var err error
outfile, err = os.Create(*path)
if err != nil {
log.Fatal(err)
}
defer outfile.Close()
}
if *hosts != "" {
scraper.WriteString("\n c.AllowedDomains = []string{")
for i, h := range strings.Split(*hosts, ",") {
if i > 0 {
scraper.WriteString(", ")
}
scraper.WriteString(fmt.Sprintf("%q", h))
}
scraper.WriteString("}\n")
}
if len(*callbacks) > 0 {
for _, c := range strings.Split(*callbacks, ",") {
switch c {
case "html":
scraper.WriteString(htmlCallbackTemplate)
case "request":
scraper.WriteString(requestCallbackTemplate)
case "response":
scraper.WriteString(responseCallbackTemplate)
case "error":
scraper.WriteString(errorCallbackTemplate)
}
}
}
scraper.WriteString(scraperEndTemplate)
outfile.Write(scraper.Bytes())
}
})
app.Run(os.Args)
}
================================================
FILE: colly.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
// Package colly implements a HTTP scraping framework
package colly
import (
"bytes"
"context"
"crypto/rand"
"encoding/json"
"errors"
"fmt"
"hash/fnv"
"io"
"log"
"net/http"
"net/http/cookiejar"
"net/url"
"os"
"path/filepath"
"regexp"
"slices"
"strconv"
"strings"
"sync"
"sync/atomic"
"time"
"github.com/PuerkitoBio/goquery"
"github.com/antchfx/htmlquery"
"github.com/antchfx/xmlquery"
"github.com/gocolly/colly/v2/debug"
"github.com/gocolly/colly/v2/storage"
"github.com/kennygrant/sanitize"
whatwgUrl "github.com/nlnwa/whatwg-url/url"
"github.com/temoto/robotstxt"
"google.golang.org/appengine/urlfetch"
)
// A CollectorOption sets an option on a Collector.
type CollectorOption func(*Collector)
// Collector provides the scraper instance for a scraping job
type Collector struct {
// UserAgent is the User-Agent string used by HTTP requests
UserAgent string
// Custom headers for the request
Headers *http.Header
// MaxDepth limits the recursion depth of visited URLs.
// Set it to 0 for infinite recursion (default).
MaxDepth int
// AllowedDomains is a domain whitelist.
// Leave it blank to allow any domains to be visited
AllowedDomains []string
// DisallowedDomains is a domain blacklist.
DisallowedDomains []string
// DisallowedURLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request will be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
DisallowedURLFilters []*regexp.Regexp
// URLFilters is a list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the
// request won't be stopped. DisallowedURLFilters will
// be evaluated before URLFilters
// Leave it blank to allow any URLs to be visited
URLFilters []*regexp.Regexp
// AllowURLRevisit allows multiple downloads of the same URL
AllowURLRevisit bool
// MaxBodySize is the limit of the retrieved response body in bytes.
// 0 means unlimited.
// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).
MaxBodySize int
// CacheDir specifies a location where GET requests are cached as files.
// When it's not defined, caching is disabled.
CacheDir string
// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by
// the target host's robots.txt file. See http://www.robotstxt.org/ for more
// information.
IgnoreRobotsTxt bool
// Async turns on asynchronous network communication. Use Collector.Wait() to
// be sure all requests have been finished.
Async bool
// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.
// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse
// to true to enable it.
ParseHTTPErrorResponse bool
// ID is the unique identifier of a collector
ID uint32
// DetectCharset can enable character encoding detection for non-utf8 response bodies
// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
DetectCharset bool
// RedirectHandler allows control on how a redirect will be managed
// use c.SetRedirectHandler to set this value
redirectHandler func(req *http.Request, via []*http.Request) error
// CheckHead performs a HEAD request before every GET to pre-validate the response
CheckHead bool
// TraceHTTP enables capturing and reporting request performance for crawler tuning.
// When set to true, the Response.Trace will be filled in with an HTTPTrace object.
TraceHTTP bool
// Context is the context that will be used for HTTP requests. You can set this
// to support clean cancellation of scraping.
Context context.Context
// MaxRequests limit the number of requests done by the instance.
// Set it to 0 for infinite requests (default).
MaxRequests uint32
store storage.Storage
debugger debug.Debugger
robotsMap map[string]*robotstxt.RobotsData
htmlCallbacks []*htmlCallbackContainer
xmlCallbacks []*xmlCallbackContainer
requestCallbacks []RequestCallback
responseCallbacks []ResponseCallback
responseHeadersCallbacks []ResponseHeadersCallback
requestHeadersCallbacks []RequestCallback
errorCallbacks []ErrorCallback
scrapedCallbacks []ScrapedCallback
requestCount atomic.Uint32
responseCount atomic.Uint32
backend *httpBackend
wg *sync.WaitGroup
lock *sync.RWMutex
// CacheExpiration sets the maximum age for cache files.
// If a cached file is older than this duration, it will be ignored and refreshed.
CacheExpiration time.Duration
}
// RequestCallback is a type alias for OnRequest callback functions
type RequestCallback func(*Request)
// ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions
type ResponseHeadersCallback func(*Response)
// ResponseCallback is a type alias for OnResponse callback functions
type ResponseCallback func(*Response)
// HTMLCallback is a type alias for OnHTML callback functions
type HTMLCallback func(*HTMLElement)
// XMLCallback is a type alias for OnXML callback functions
type XMLCallback func(*XMLElement)
// ErrorCallback is a type alias for OnError callback functions
type ErrorCallback func(*Response, error)
// ScrapedCallback is a type alias for OnScraped callback functions
type ScrapedCallback func(*Response)
// ProxyFunc is a type alias for proxy setter functions.
type ProxyFunc func(*http.Request) (*url.URL, error)
// AlreadyVisitedError is the error type for already visited URLs.
//
// It's returned synchronously by Visit when the URL passed to Visit
// is already visited.
//
// When already visited URL is encountered after following
// redirects, this error appears in OnError callback, and if Async
// mode is not enabled, is also returned by Visit.
type AlreadyVisitedError struct {
// Destination is the URL that was attempted to be visited.
// It might not match the URL passed to Visit if redirect
// was followed.
Destination *url.URL
}
// Error implements error interface.
func (e *AlreadyVisitedError) Error() string {
return fmt.Sprintf("%q already visited", e.Destination)
}
type htmlCallbackContainer struct {
Selector string
Function HTMLCallback
active atomic.Bool
}
type xmlCallbackContainer struct {
Query string
Function XMLCallback
active atomic.Bool
}
type cookieJarSerializer struct {
store storage.Storage
lock *sync.RWMutex
}
var collectorCounter uint32
// The key type is unexported to prevent collisions with context keys defined in
// other packages.
type key int
// ProxyURLKey is the context key for the request proxy address.
const (
ProxyURLKey key = iota
CheckRevisitKey
)
// The prefix for environment variables of Colly settings
const envVariablePrefix = "COLLY_"
var (
// ErrForbiddenDomain is the error thrown if visiting
// a domain which is not allowed in AllowedDomains
ErrForbiddenDomain = errors.New("Forbidden domain")
// ErrMissingURL is the error type for missing URL errors
ErrMissingURL = errors.New("Missing URL")
// ErrMaxDepth is the error type for exceeding max depth
ErrMaxDepth = errors.New("Max depth limit reached")
// ErrForbiddenURL is the error thrown if visiting
// a URL which is not allowed by URLFilters
ErrForbiddenURL = errors.New("ForbiddenURL")
// ErrNoURLFiltersMatch is the error thrown if visiting
// a URL which is not allowed by URLFilters
ErrNoURLFiltersMatch = errors.New("No URLFilters match")
// ErrRobotsTxtBlocked is the error type for robots.txt errors
ErrRobotsTxtBlocked = errors.New("URL blocked by robots.txt")
// ErrNoCookieJar is the error type for missing cookie jar
ErrNoCookieJar = errors.New("Cookie jar is not available")
// ErrNoPattern is the error type for LimitRules without patterns
ErrNoPattern = errors.New("No pattern defined in LimitRule")
// ErrEmptyProxyURL is the error type for empty Proxy URL list
ErrEmptyProxyURL = errors.New("Proxy URL list is empty")
// ErrAbortedAfterHeaders is the error returned when OnResponseHeaders aborts the transfer.
ErrAbortedAfterHeaders = errors.New("Aborted after receiving response headers")
// ErrAbortedBeforeRequest is the error returned when OnResponseHeaders aborts the transfer.
ErrAbortedBeforeRequest = errors.New("Aborted before Do Request")
// ErrQueueFull is the error returned when the queue is full
ErrQueueFull = errors.New("Queue MaxSize reached")
// ErrMaxRequests is the error returned when exceeding max requests
ErrMaxRequests = errors.New("Max Requests limit reached")
// ErrRetryBodyUnseekable is the error when retry with not seekable body
ErrRetryBodyUnseekable = errors.New("Retry Body Unseekable")
)
var envMap = map[string]func(*Collector, string){
"ALLOWED_DOMAINS": func(c *Collector, val string) {
c.AllowedDomains = strings.Split(val, ",")
},
"CACHE_DIR": func(c *Collector, val string) {
c.CacheDir = val
},
"DETECT_CHARSET": func(c *Collector, val string) {
c.DetectCharset = isYesString(val)
},
"DISABLE_COOKIES": func(c *Collector, _ string) {
c.backend.Client.Jar = nil
},
"DISALLOWED_DOMAINS": func(c *Collector, val string) {
c.DisallowedDomains = strings.Split(val, ",")
},
"IGNORE_ROBOTSTXT": func(c *Collector, val string) {
c.IgnoreRobotsTxt = isYesString(val)
},
"FOLLOW_REDIRECTS": func(c *Collector, val string) {
if !isYesString(val) {
c.redirectHandler = func(req *http.Request, via []*http.Request) error {
return http.ErrUseLastResponse
}
}
},
"MAX_BODY_SIZE": func(c *Collector, val string) {
size, err := strconv.Atoi(val)
if err == nil {
c.MaxBodySize = size
}
},
"MAX_DEPTH": func(c *Collector, val string) {
maxDepth, err := strconv.Atoi(val)
if err == nil {
c.MaxDepth = maxDepth
}
},
"MAX_REQUESTS": func(c *Collector, val string) {
maxRequests, err := strconv.ParseUint(val, 0, 32)
if err == nil {
c.MaxRequests = uint32(maxRequests)
}
},
"PARSE_HTTP_ERROR_RESPONSE": func(c *Collector, val string) {
c.ParseHTTPErrorResponse = isYesString(val)
},
"TRACE_HTTP": func(c *Collector, val string) {
c.TraceHTTP = isYesString(val)
},
"USER_AGENT": func(c *Collector, val string) {
c.UserAgent = val
},
}
var urlParser = whatwgUrl.NewParser(whatwgUrl.WithPercentEncodeSinglePercentSign())
// NewCollector creates a new Collector instance with default configuration
func NewCollector(options ...CollectorOption) *Collector {
c := &Collector{}
c.Init()
for _, f := range options {
f(c)
}
c.parseSettingsFromEnv()
return c
}
// UserAgent sets the user agent used by the Collector.
func UserAgent(ua string) CollectorOption {
return func(c *Collector) {
c.UserAgent = ua
}
}
// Headers sets the custom headers used by the Collector.
func Headers(headers map[string]string) CollectorOption {
return func(c *Collector) {
customHeaders := make(http.Header)
for header, value := range headers {
customHeaders.Add(header, value)
}
c.Headers = &customHeaders
}
}
// MaxDepth limits the recursion depth of visited URLs.
func MaxDepth(depth int) CollectorOption {
return func(c *Collector) {
c.MaxDepth = depth
}
}
// MaxRequests limit the number of requests done by the instance.
// Set it to 0 for infinite requests (default).
func MaxRequests(max uint32) CollectorOption {
return func(c *Collector) {
c.MaxRequests = max
}
}
// AllowedDomains sets the domain whitelist used by the Collector.
func AllowedDomains(domains ...string) CollectorOption {
return func(c *Collector) {
c.AllowedDomains = domains
}
}
// ParseHTTPErrorResponse allows parsing responses with HTTP errors
func ParseHTTPErrorResponse() CollectorOption {
return func(c *Collector) {
c.ParseHTTPErrorResponse = true
}
}
// DisallowedDomains sets the domain blacklist used by the Collector.
func DisallowedDomains(domains ...string) CollectorOption {
return func(c *Collector) {
c.DisallowedDomains = domains
}
}
// DisallowedURLFilters sets the list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the request will be stopped.
func DisallowedURLFilters(filters ...*regexp.Regexp) CollectorOption {
return func(c *Collector) {
c.DisallowedURLFilters = filters
}
}
// URLFilters sets the list of regular expressions which restricts
// visiting URLs. If any of the rules matches to a URL the request won't be stopped.
func URLFilters(filters ...*regexp.Regexp) CollectorOption {
return func(c *Collector) {
c.URLFilters = filters
}
}
// AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL
func AllowURLRevisit() CollectorOption {
return func(c *Collector) {
c.AllowURLRevisit = true
}
}
// MaxBodySize sets the limit of the retrieved response body in bytes.
func MaxBodySize(sizeInBytes int) CollectorOption {
return func(c *Collector) {
c.MaxBodySize = sizeInBytes
}
}
// CacheDir specifies the location where GET requests are cached as files.
func CacheDir(path string) CollectorOption {
return func(c *Collector) {
c.CacheDir = path
}
}
// IgnoreRobotsTxt instructs the Collector to ignore any restrictions
// set by the target host's robots.txt file.
func IgnoreRobotsTxt() CollectorOption {
return func(c *Collector) {
c.IgnoreRobotsTxt = true
}
}
// TraceHTTP instructs the Collector to collect and report request trace data
// on the Response.Trace.
func TraceHTTP() CollectorOption {
return func(c *Collector) {
c.TraceHTTP = true
}
}
// StdlibContext sets the context that will be used for HTTP requests.
// You can set this to support clean cancellation of scraping.
func StdlibContext(ctx context.Context) CollectorOption {
return func(c *Collector) {
c.Context = ctx
}
}
// ID sets the unique identifier of the Collector.
func ID(id uint32) CollectorOption {
return func(c *Collector) {
c.ID = id
}
}
// Async turns on asynchronous network requests.
func Async(a ...bool) CollectorOption {
return func(c *Collector) {
if len(a) > 0 {
c.Async = a[0]
} else {
c.Async = true
}
}
}
// DetectCharset enables character encoding detection for non-utf8 response bodies
// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet
func DetectCharset() CollectorOption {
return func(c *Collector) {
c.DetectCharset = true
}
}
// Debugger sets the debugger used by the Collector.
func Debugger(d debug.Debugger) CollectorOption {
return func(c *Collector) {
d.Init()
c.debugger = d
}
}
// CheckHead performs a HEAD request before every GET to pre-validate the response
func CheckHead() CollectorOption {
return func(c *Collector) {
c.CheckHead = true
}
}
// CacheExpiration sets the maximum age for cache files.
// If a cached file is older than this duration, it will be ignored and refreshed.
func CacheExpiration(d time.Duration) CollectorOption {
return func(c *Collector) {
c.CacheExpiration = d
}
}
// Init initializes the Collector's private variables and sets default
// configuration for the Collector
func (c *Collector) Init() {
c.UserAgent = "colly - https://github.com/gocolly/colly"
c.Headers = nil
c.MaxDepth = 0
c.MaxRequests = 0
c.store = &storage.InMemoryStorage{}
c.store.Init()
c.MaxBodySize = 10 * 1024 * 1024
c.backend = &httpBackend{}
jar, _ := cookiejar.New(nil)
c.backend.Init(jar)
c.backend.Client.CheckRedirect = c.checkRedirectFunc()
c.wg = &sync.WaitGroup{}
c.lock = &sync.RWMutex{}
c.robotsMap = make(map[string]*robotstxt.RobotsData)
c.IgnoreRobotsTxt = true
c.ID = atomic.AddUint32(&collectorCounter, 1)
c.TraceHTTP = false
c.Context = context.Background()
}
// Appengine will replace the Collector's backend http.Client
// With an Http.Client that is provided by appengine/urlfetch
// This function should be used when the scraper is run on
// Google App Engine. Example:
//
// func startScraper(w http.ResponseWriter, r *http.Request) {
// ctx := appengine.NewContext(r)
// c := colly.NewCollector()
// c.Appengine(ctx)
// ...
// c.Visit("https://google.ca")
// }
func (c *Collector) Appengine(ctx context.Context) {
client := urlfetch.Client(ctx)
client.Jar = c.backend.Client.Jar
client.CheckRedirect = c.backend.Client.CheckRedirect
client.Timeout = c.backend.Client.Timeout
c.backend.Client = client
}
// Visit starts Collector's collecting job by creating a
// request to the URL specified in parameter.
// Visit also calls the previously provided callbacks
func (c *Collector) Visit(URL string) error {
if c.CheckHead {
if check := c.scrape(URL, "HEAD", 1, nil, nil, nil, true); check != nil {
return check
}
}
return c.scrape(URL, "GET", 1, nil, nil, nil, true)
}
// HasVisited checks if the provided URL has been visited
func (c *Collector) HasVisited(URL string) (bool, error) {
return c.checkHasVisited(URL, nil)
}
// HasPosted checks if the provided URL and requestData has been visited
// This method is useful more likely to prevent re-visit same URL and POST body
func (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {
return c.checkHasVisited(URL, requestData)
}
// Head starts a collector job by creating a HEAD request.
func (c *Collector) Head(URL string) error {
return c.scrape(URL, "HEAD", 1, nil, nil, nil, false)
}
// Post starts a collector job by creating a POST request.
// Post also calls the previously provided callbacks
func (c *Collector) Post(URL string, requestData map[string]string) error {
return c.scrape(URL, "POST", 1, createFormReader(requestData), nil, nil, true)
}
// PostRaw starts a collector job by creating a POST request with raw binary data.
// Post also calls the previously provided callbacks
func (c *Collector) PostRaw(URL string, requestData []byte) error {
return c.scrape(URL, "POST", 1, bytes.NewReader(requestData), nil, nil, true)
}
// PostMultipart starts a collector job by creating a Multipart POST request
// with raw binary data. PostMultipart also calls the previously provided callbacks
func (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error {
boundary := randomBoundary()
hdr := http.Header{}
hdr.Set("Content-Type", "multipart/form-data; boundary="+boundary)
hdr.Set("User-Agent", c.UserAgent)
return c.scrape(URL, "POST", 1, createMultipartReader(boundary, requestData), nil, hdr, true)
}
// Request starts a collector job by creating a custom HTTP request
// where method, context, headers and request data can be specified.
// Set requestData, ctx, hdr parameters to nil if you don't want to use them.
// Valid methods:
// - "GET"
// - "HEAD"
// - "POST"
// - "PUT"
// - "DELETE"
// - "PATCH"
// - "OPTIONS"
func (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error {
return c.scrape(URL, method, 1, requestData, ctx, hdr, true)
}
// SetDebugger attaches a debugger to the collector
func (c *Collector) SetDebugger(d debug.Debugger) {
d.Init()
c.debugger = d
}
// UnmarshalRequest creates a Request from serialized data
func (c *Collector) UnmarshalRequest(r []byte) (*Request, error) {
req := &serializableRequest{}
err := json.Unmarshal(r, req)
if err != nil {
return nil, err
}
u, err := url.Parse(req.URL)
if err != nil {
return nil, err
}
ctx := NewContext()
for k, v := range req.Ctx {
ctx.Put(k, v)
}
return &Request{
Method: req.Method,
URL: u,
Depth: req.Depth,
Body: bytes.NewReader(req.Body),
Ctx: ctx,
ID: c.requestCount.Add(1),
Headers: &req.Headers,
collector: c,
}, nil
}
func (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error {
parsedWhatwgURL, err := urlParser.Parse(u)
if err != nil {
return err
}
parsedURL, err := url.Parse(parsedWhatwgURL.Href(false))
if err != nil {
return err
}
if hdr == nil {
hdr = http.Header{}
if c.Headers != nil {
for k, v := range *c.Headers {
for _, value := range v {
hdr.Add(k, value)
}
}
}
}
if _, ok := hdr["User-Agent"]; !ok {
hdr.Set("User-Agent", c.UserAgent)
}
if seeker, ok := requestData.(io.ReadSeeker); ok {
_, err := seeker.Seek(0, io.SeekStart)
if err != nil {
return err
}
}
req, err := http.NewRequest(method, parsedURL.String(), requestData)
if err != nil {
return err
}
req.Header = hdr
// The Go HTTP API ignores "Host" in the headers, preferring the client
// to use the Host field on Request.
if hostHeader := hdr.Get("Host"); hostHeader != "" {
req.Host = hostHeader
}
// note: once 1.13 is minimum supported Go version,
// replace this with http.NewRequestWithContext
req = req.WithContext(context.WithValue(c.Context, CheckRevisitKey, checkRevisit))
if err := c.requestCheck(parsedURL, method, req.GetBody, depth, checkRevisit); err != nil {
return err
}
u = parsedURL.String()
c.wg.Add(1)
if c.Async {
go c.fetch(u, method, depth, requestData, ctx, hdr, req)
return nil
}
return c.fetch(u, method, depth, requestData, ctx, hdr, req)
}
func (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error {
defer c.wg.Done()
if ctx == nil {
ctx = NewContext()
}
request := &Request{
URL: req.URL,
Headers: &req.Header,
Host: req.Host,
Ctx: ctx,
Depth: depth,
Method: method,
Body: requestData,
collector: c,
ID: c.requestCount.Add(1),
}
if req.Header.Get("Accept") == "" {
req.Header.Set("Accept", "*/*")
}
c.handleOnRequest(request)
if request.abort {
return nil
}
if method == "POST" && req.Header.Get("Content-Type") == "" {
req.Header.Add("Content-Type", "application/x-www-form-urlencoded")
}
var hTrace *HTTPTrace
if c.TraceHTTP {
hTrace = &HTTPTrace{}
req = hTrace.WithTrace(req)
}
origURL := req.URL
checkResponseHeadersFunc := func(req *http.Request, statusCode int, headers http.Header) bool {
if req.URL != origURL {
request.URL = req.URL
request.Headers = &req.Header
}
c.handleOnResponseHeaders(&Response{Ctx: ctx, Request: request, StatusCode: statusCode, Headers: &headers})
return !request.abort
}
checkRequestHeadersFunc := func(req *http.Request) bool {
c.handleOnRequestHeaders(request)
return !request.abort
}
response, err := c.backend.Cache(req, c.MaxBodySize, checkRequestHeadersFunc, checkResponseHeadersFunc, c.CacheDir, c.CacheExpiration)
if proxyURL, ok := req.Context().Value(ProxyURLKey).(string); ok {
request.ProxyURL = proxyURL
}
if err := c.handleOnError(response, err, request, ctx); err != nil {
return err
}
c.responseCount.Add(1)
response.Ctx = ctx
response.Request = request
response.Trace = hTrace
err = response.fixCharset(c.DetectCharset, request.ResponseCharacterEncoding)
if err != nil {
return err
}
c.handleOnResponse(response)
err = c.handleOnHTML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
err = c.handleOnXML(response)
if err != nil {
c.handleOnError(response, err, request, ctx)
}
c.handleOnScraped(response)
return err
}
func (c *Collector) requestCheck(parsedURL *url.URL, method string, getBody func() (io.ReadCloser, error), depth int, checkRevisit bool) error {
u := parsedURL.String()
if c.MaxDepth > 0 && c.MaxDepth < depth {
return ErrMaxDepth
}
if c.MaxRequests > 0 && c.requestCount.Load() >= c.MaxRequests {
return ErrMaxRequests
}
if err := c.checkFilters(u, parsedURL.Hostname()); err != nil {
return err
}
if method != "HEAD" && !c.IgnoreRobotsTxt {
if err := c.checkRobots(parsedURL); err != nil {
return err
}
}
if checkRevisit && !c.AllowURLRevisit {
// TODO weird behaviour, it allows CheckHead to work correctly,
// but it should probably better be solved with
// "check-but-not-save" flag or something
if method != "GET" && getBody == nil {
return nil
}
var body io.ReadCloser
if getBody != nil {
var err error
body, err = getBody()
if err != nil {
return err
}
defer body.Close()
}
uHash := requestHash(u, body)
visited, err := c.store.IsVisited(uHash)
if err != nil {
return err
}
if visited {
return &AlreadyVisitedError{parsedURL}
}
return c.store.Visited(uHash)
}
return nil
}
func (c *Collector) checkFilters(URL, domain string) error {
if len(c.DisallowedURLFilters) > 0 {
if isMatchingFilter(c.DisallowedURLFilters, []byte(URL)) {
return ErrForbiddenURL
}
}
if len(c.URLFilters) > 0 {
if !isMatchingFilter(c.URLFilters, []byte(URL)) {
return ErrNoURLFiltersMatch
}
}
if !c.isDomainAllowed(domain) {
return ErrForbiddenDomain
}
return nil
}
func (c *Collector) isDomainAllowed(domain string) bool {
if slices.Contains(c.DisallowedDomains, domain) {
return false
}
if c.AllowedDomains == nil || len(c.AllowedDomains) == 0 {
return true
}
return slices.Contains(c.AllowedDomains, domain)
}
func (c *Collector) checkRobots(u *url.URL) error {
c.lock.RLock()
robot, ok := c.robotsMap[u.Host]
c.lock.RUnlock()
if !ok {
// no robots file cached
// Prepare request,
req, err := http.NewRequest("GET", u.Scheme+"://"+u.Host+"/robots.txt", nil)
if err != nil {
return err
}
hdr := http.Header{}
if c.Headers != nil {
for k, v := range *c.Headers {
for _, value := range v {
hdr.Add(k, value)
}
}
}
if _, ok := hdr["User-Agent"]; !ok {
hdr.Set("User-Agent", c.UserAgent)
}
req.Header = hdr
// The Go HTTP API ignores "Host" in the headers, preferring the client
// to use the Host field on Request.
if hostHeader := hdr.Get("Host"); hostHeader != "" {
req.Host = hostHeader
}
resp, err := c.backend.Client.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
robot, err = robotstxt.FromResponse(resp)
if err != nil {
return err
}
c.lock.Lock()
c.robotsMap[u.Host] = robot
c.lock.Unlock()
}
uaGroup := robot.FindGroup(c.UserAgent)
if uaGroup == nil {
return nil
}
eu := u.EscapedPath()
if u.RawQuery != "" {
eu += "?" + u.Query().Encode()
}
if !uaGroup.Test(eu) {
return ErrRobotsTxtBlocked
}
return nil
}
// String is the text representation of the collector.
// It contains useful debug information about the collector's internals
func (c *Collector) String() string {
return fmt.Sprintf(
"Requests made: %d (%d responses) | Callbacks: OnRequest: %d, OnHTML: %d, OnResponse: %d, OnError: %d",
c.requestCount.Load(),
c.responseCount.Load(),
len(c.requestCallbacks),
len(c.htmlCallbacks),
len(c.responseCallbacks),
len(c.errorCallbacks),
)
}
// Wait returns when the collector jobs are finished
func (c *Collector) Wait() {
c.wg.Wait()
}
// OnRequest registers a function. Function will be executed on every
// request made by the Collector
func (c *Collector) OnRequest(f RequestCallback) {
c.lock.Lock()
if c.requestCallbacks == nil {
c.requestCallbacks = make([]RequestCallback, 0, 4)
}
c.requestCallbacks = append(c.requestCallbacks, f)
c.lock.Unlock()
}
// OnResponseHeaders registers a function. Function will be executed on every response
// when headers and status are already received, but body is not yet read.
//
// Like in OnRequest, you can call Request.Abort to abort the transfer. This might be
// useful if, for example, you're following all hyperlinks, but want to avoid
// downloading files.
//
// Be aware that using this will prevent HTTP/1.1 connection reuse, as
// the only way to abort a download is to immediately close the connection.
// HTTP/2 doesn't suffer from this problem, as it's possible to close
// specific stream inside the connection.
func (c *Collector) OnResponseHeaders(f ResponseHeadersCallback) {
c.lock.Lock()
c.responseHeadersCallbacks = append(c.responseHeadersCallbacks, f)
c.lock.Unlock()
}
// OnRequestHeaders registers a function. Function will be executed on every
// request made by the Collector before Request Do
func (c *Collector) OnRequestHeaders(f RequestCallback) {
c.lock.Lock()
c.requestHeadersCallbacks = append(c.requestHeadersCallbacks, f)
c.lock.Unlock()
}
// OnResponse registers a function. Function will be executed on every response
func (c *Collector) OnResponse(f ResponseCallback) {
c.lock.Lock()
if c.responseCallbacks == nil {
c.responseCallbacks = make([]ResponseCallback, 0, 4)
}
c.responseCallbacks = append(c.responseCallbacks, f)
c.lock.Unlock()
}
// OnHTML registers a function. Function will be executed on every HTML
// element matched by the GoQuery Selector parameter.
// GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery
func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback) {
c.lock.Lock()
if c.htmlCallbacks == nil {
c.htmlCallbacks = make([]*htmlCallbackContainer, 0, 4)
}
cc := &htmlCallbackContainer{
Selector: goquerySelector,
Function: f,
}
cc.active.Store(true)
c.htmlCallbacks = append(c.htmlCallbacks, cc)
c.lock.Unlock()
}
// OnXML registers a function. Function will be executed on every XML
// element matched by the xpath Query parameter.
// xpath Query is used by https://github.com/antchfx/xmlquery
func (c *Collector) OnXML(xpathQuery string, f XMLCallback) {
c.lock.Lock()
if c.xmlCallbacks == nil {
c.xmlCallbacks = make([]*xmlCallbackContainer, 0, 4)
}
cc := &xmlCallbackContainer{
Query: xpathQuery,
Function: f,
}
cc.active.Store(true)
c.xmlCallbacks = append(c.xmlCallbacks, cc)
c.lock.Unlock()
}
// OnHTMLDetach deregister a function. Function will not be execute after detached
func (c *Collector) OnHTMLDetach(goquerySelector string) {
c.lock.Lock()
defer c.lock.Unlock()
for _, cc := range c.htmlCallbacks {
if cc.Selector == goquerySelector {
cc.active.Store(false)
}
}
}
// OnXMLDetach deregister a function. Function will not be execute after detached
func (c *Collector) OnXMLDetach(xpathQuery string) {
c.lock.Lock()
defer c.lock.Unlock()
for _, cc := range c.xmlCallbacks {
if cc.Query == xpathQuery {
cc.active.Store(false)
}
}
}
// OnError registers a function. Function will be executed if an error
// occurs during the HTTP request.
func (c *Collector) OnError(f ErrorCallback) {
c.lock.Lock()
if c.errorCallbacks == nil {
c.errorCallbacks = make([]ErrorCallback, 0, 4)
}
c.errorCallbacks = append(c.errorCallbacks, f)
c.lock.Unlock()
}
// OnScraped registers a function that will be executed as the final part of
// the scraping, after OnHTML and OnXML have finished.
func (c *Collector) OnScraped(f ScrapedCallback) {
c.lock.Lock()
if c.scrapedCallbacks == nil {
c.scrapedCallbacks = make([]ScrapedCallback, 0, 4)
}
c.scrapedCallbacks = append(c.scrapedCallbacks, f)
c.lock.Unlock()
}
// SetClient will override the previously set http.Client
func (c *Collector) SetClient(client *http.Client) {
c.backend.Client = client
}
// WithTransport allows you to set a custom http.RoundTripper (transport)
func (c *Collector) WithTransport(transport http.RoundTripper) {
c.backend.Client.Transport = transport
}
// DisableCookies turns off cookie handling
func (c *Collector) DisableCookies() {
c.backend.Client.Jar = nil
}
// SetCookieJar overrides the previously set cookie jar
func (c *Collector) SetCookieJar(j http.CookieJar) {
c.backend.Client.Jar = j
}
// SetRequestTimeout overrides the default timeout (10 seconds) for this collector
func (c *Collector) SetRequestTimeout(timeout time.Duration) {
c.backend.Client.Timeout = timeout
}
// SetStorage overrides the default in-memory storage.
// Storage stores scraping related data like cookies and visited urls
func (c *Collector) SetStorage(s storage.Storage) error {
if err := s.Init(); err != nil {
return err
}
c.store = s
c.backend.Client.Jar = createJar(s)
return nil
}
// SetProxy sets a proxy for the collector. This method overrides the previously
// used http.Transport if the type of the transport is not http.RoundTripper.
// The proxy type is determined by the URL scheme. "http"
// and "socks5" are supported. If the scheme is empty,
// "http" is assumed.
func (c *Collector) SetProxy(proxyURL string) error {
proxyParsed, err := url.Parse(proxyURL)
if err != nil {
return err
}
c.SetProxyFunc(http.ProxyURL(proxyParsed))
return nil
}
// SetProxyFunc sets a custom proxy setter/switcher function.
// See built-in ProxyFuncs for more details.
// This method overrides the previously used http.Transport
// if the type of the transport is not *http.Transport.
// The proxy type is determined by the URL scheme. "http"
// and "socks5" are supported. If the scheme is empty,
// "http" is assumed.
func (c *Collector) SetProxyFunc(p ProxyFunc) {
t, ok := c.backend.Client.Transport.(*http.Transport)
if c.backend.Client.Transport != nil && ok {
t.Proxy = p
t.DisableKeepAlives = true
} else {
c.backend.Client.Transport = &http.Transport{
Proxy: p,
DisableKeepAlives: true,
}
}
}
func createEvent(eventType string, requestID, collectorID uint32, kvargs map[string]string) *debug.Event {
return &debug.Event{
CollectorID: collectorID,
RequestID: requestID,
Type: eventType,
Values: kvargs,
}
}
func (c *Collector) handleOnRequest(r *Request) {
if c.debugger != nil {
c.debugger.Event(createEvent("request", r.ID, c.ID, map[string]string{
"url": r.URL.String(),
}))
}
for _, f := range c.requestCallbacks {
f(r)
}
}
func (c *Collector) handleOnResponse(r *Response) {
if c.debugger != nil {
c.debugger.Event(createEvent("response", r.Request.ID, c.ID, map[string]string{
"url": r.Request.URL.String(),
"status": http.StatusText(r.StatusCode),
}))
}
for _, f := range c.responseCallbacks {
f(r)
}
}
func (c *Collector) handleOnResponseHeaders(r *Response) {
if c.debugger != nil {
c.debugger.Event(createEvent("responseHeaders", r.Request.ID, c.ID, map[string]string{
"url": r.Request.URL.String(),
"status": http.StatusText(r.StatusCode),
}))
}
for _, f := range c.responseHeadersCallbacks {
f(r)
}
}
func (c *Collector) handleOnRequestHeaders(r *Request) {
if c.debugger != nil {
c.debugger.Event(createEvent("requestHeaders", r.ID, c.ID, map[string]string{
"url": r.URL.String(),
}))
}
for _, f := range c.requestHeadersCallbacks {
f(r)
}
}
func (c *Collector) handleOnHTML(resp *Response) error {
c.lock.RLock()
htmlCallbacks := slices.Clone(c.htmlCallbacks)
c.lock.RUnlock()
if len(htmlCallbacks) == 0 {
return nil
}
contentType := resp.Headers.Get("Content-Type")
if contentType == "" {
contentType = http.DetectContentType(resp.Body)
}
// implementation of mime.ParseMediaType without parsing the params
// part
mediatype, _, _ := strings.Cut(contentType, ";")
mediatype = strings.TrimSpace(strings.ToLower(mediatype))
// TODO we also want to parse application/xml as XHTML if it has
// appropriate doctype
switch mediatype {
case "text/html", "application/xhtml+xml":
default:
return nil
}
doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
if href, found := doc.Find("base[href]").Attr("href"); found {
u, err := urlParser.ParseRef(resp.Request.URL.String(), href)
if err == nil {
baseURL, err := url.Parse(u.Href(false))
if err == nil {
resp.Request.baseURL = baseURL
}
}
}
for _, cc := range htmlCallbacks {
if !cc.active.Load() {
continue
}
i := 0
doc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection) {
for _, n := range s.Nodes {
e := NewHTMLElementFromSelectionNode(resp, s, n, i)
i++
if c.debugger != nil {
c.debugger.Event(createEvent("html", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Selector,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
}
})
}
return nil
}
func (c *Collector) handleOnXML(resp *Response) error {
c.lock.RLock()
xmlCallbacks := slices.Clone(c.xmlCallbacks)
c.lock.RUnlock()
if len(xmlCallbacks) == 0 {
return nil
}
contentType := strings.ToLower(resp.Headers.Get("Content-Type"))
isXMLFile := strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml") || strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), ".xml.gz")
if !strings.Contains(contentType, "html") && (!strings.Contains(contentType, "xml") && !isXMLFile) {
return nil
}
if strings.Contains(contentType, "html") {
doc, err := htmlquery.Parse(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
if e := htmlquery.FindOne(doc, "//base"); e != nil {
for _, a := range e.Attr {
if a.Key == "href" {
baseURL, err := resp.Request.URL.Parse(a.Val)
if err == nil {
resp.Request.baseURL = baseURL
}
break
}
}
}
for _, cc := range xmlCallbacks {
if !cc.active.Load() {
continue
}
for i, n := range htmlquery.Find(doc, cc.Query) {
e := NewXMLElementFromHTMLNode(resp, n)
e.Index = i
if c.debugger != nil {
c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Query,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
}
}
} else if strings.Contains(contentType, "xml") || isXMLFile {
doc, err := xmlquery.Parse(bytes.NewBuffer(resp.Body))
if err != nil {
return err
}
for _, cc := range xmlCallbacks {
if !cc.active.Load() {
continue
}
xmlquery.FindEach(doc, cc.Query, func(i int, n *xmlquery.Node) {
e := NewXMLElementFromXMLNode(resp, n)
if c.debugger != nil {
c.debugger.Event(createEvent("xml", resp.Request.ID, c.ID, map[string]string{
"selector": cc.Query,
"url": resp.Request.URL.String(),
}))
}
cc.Function(e)
})
}
}
return nil
}
func (c *Collector) handleOnError(response *Response, err error, request *Request, ctx *Context) error {
if err == nil && (c.ParseHTTPErrorResponse || response.StatusCode < 203) {
return nil
}
if err == nil && response.StatusCode >= 203 {
err = errors.New(http.StatusText(response.StatusCode))
}
if response == nil {
response = &Response{
Request: request,
Ctx: ctx,
}
}
if c.debugger != nil {
c.debugger.Event(createEvent("error", request.ID, c.ID, map[string]string{
"url": request.URL.String(),
"status": http.StatusText(response.StatusCode),
}))
}
if response.Request == nil {
response.Request = request
}
if response.Ctx == nil {
response.Ctx = request.Ctx
}
for _, f := range c.errorCallbacks {
f(response, err)
}
return err
}
func (c *Collector) cleanupCallbacks() {
c.lock.Lock()
defer c.lock.Unlock()
// Clean HTML callbacks
c.htmlCallbacks = slices.DeleteFunc(c.htmlCallbacks, func(cc *htmlCallbackContainer) bool {
return !cc.active.Load()
})
// Clean XML callbacks
c.xmlCallbacks = slices.DeleteFunc(c.xmlCallbacks, func(cc *xmlCallbackContainer) bool {
return !cc.active.Load()
})
}
func (c *Collector) handleOnScraped(r *Response) {
if c.debugger != nil {
c.debugger.Event(createEvent("scraped", r.Request.ID, c.ID, map[string]string{
"url": r.Request.URL.String(),
}))
}
for _, f := range c.scrapedCallbacks {
f(r)
}
// Cleanup inactive callbacks after processing each response
c.cleanupCallbacks()
}
// Limit adds a new LimitRule to the collector
func (c *Collector) Limit(rule *LimitRule) error {
return c.backend.Limit(rule)
}
// Limits adds new LimitRules to the collector
func (c *Collector) Limits(rules []*LimitRule) error {
return c.backend.Limits(rules)
}
// SetRedirectHandler instructs the Collector to allow multiple downloads of the same URL
func (c *Collector) SetRedirectHandler(f func(req *http.Request, via []*http.Request) error) {
c.redirectHandler = f
c.backend.Client.CheckRedirect = c.checkRedirectFunc()
}
// SetCookies handles the receipt of the cookies in a reply for the given URL
func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error {
if c.backend.Client.Jar == nil {
return ErrNoCookieJar
}
u, err := url.Parse(URL)
if err != nil {
return err
}
c.backend.Client.Jar.SetCookies(u, cookies)
return nil
}
// Cookies returns the cookies to send in a request for the given URL.
func (c *Collector) Cookies(URL string) []*http.Cookie {
if c.backend.Client.Jar == nil {
return nil
}
u, err := url.Parse(URL)
if err != nil {
return nil
}
return c.backend.Client.Jar.Cookies(u)
}
// Clone creates an exact copy of a Collector without callbacks.
// HTTP backend, robots.txt cache and cookie jar are shared
// between collectors.
func (c *Collector) Clone() *Collector {
return &Collector{
AllowedDomains: c.AllowedDomains,
AllowURLRevisit: c.AllowURLRevisit,
CacheDir: c.CacheDir,
CacheExpiration: c.CacheExpiration,
DetectCharset: c.DetectCharset,
DisallowedDomains: c.DisallowedDomains,
ID: atomic.AddUint32(&collectorCounter, 1),
IgnoreRobotsTxt: c.IgnoreRobotsTxt,
MaxBodySize: c.MaxBodySize,
MaxDepth: c.MaxDepth,
MaxRequests: c.MaxRequests,
DisallowedURLFilters: c.DisallowedURLFilters,
URLFilters: c.URLFilters,
CheckHead: c.CheckHead,
ParseHTTPErrorResponse: c.ParseHTTPErrorResponse,
UserAgent: c.UserAgent,
Headers: c.Headers,
TraceHTTP: c.TraceHTTP,
Context: c.Context,
store: c.store,
backend: c.backend,
debugger: c.debugger,
Async: c.Async,
redirectHandler: c.redirectHandler,
errorCallbacks: make([]ErrorCallback, 0, 8),
htmlCallbacks: make([]*htmlCallbackContainer, 0, 8),
xmlCallbacks: make([]*xmlCallbackContainer, 0, 8),
scrapedCallbacks: make([]ScrapedCallback, 0, 8),
lock: c.lock,
requestCallbacks: make([]RequestCallback, 0, 8),
responseCallbacks: make([]ResponseCallback, 0, 8),
robotsMap: c.robotsMap,
wg: &sync.WaitGroup{},
}
}
func (c *Collector) checkRedirectFunc() func(req *http.Request, via []*http.Request) error {
return func(req *http.Request, via []*http.Request) error {
if err := c.checkFilters(req.URL.String(), req.URL.Hostname()); err != nil {
return fmt.Errorf("Not following redirect to %q: %w", req.URL, err)
}
// allow redirects to the original destination
// to support websites redirecting to the same page while setting
// session cookies
samePageRedirect := normalizeURL(req.URL.String()) == normalizeURL(via[0].URL.String())
if !c.AllowURLRevisit && !samePageRedirect {
var body io.ReadCloser
if req.GetBody != nil {
var err error
body, err = req.GetBody()
if err != nil {
return err
}
defer body.Close()
}
uHash := requestHash(req.URL.String(), body)
visited, err := c.store.IsVisited(uHash)
if err != nil {
return err
}
if visited {
if checkRevisit, ok := req.Context().Value(CheckRevisitKey).(bool); !ok || checkRevisit {
return &AlreadyVisitedError{req.URL}
}
}
err = c.store.Visited(uHash)
if err != nil {
return err
}
}
if c.redirectHandler != nil {
return c.redirectHandler(req, via)
}
// Honor golangs default of maximum of 10 redirects
if len(via) >= 10 {
return http.ErrUseLastResponse
}
lastRequest := via[len(via)-1]
// If domain has changed, remove the Authorization-header if it exists
if req.URL.Host != lastRequest.URL.Host {
req.Header.Del("Authorization")
}
return nil
}
}
func (c *Collector) parseSettingsFromEnv() {
for _, e := range os.Environ() {
if !strings.HasPrefix(e, envVariablePrefix) {
continue
}
pair := strings.SplitN(e[len(envVariablePrefix):], "=", 2)
if f, ok := envMap[pair[0]]; ok {
f(c, pair[1])
} else {
log.Println("Unknown environment variable:", pair[0])
}
}
}
func (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {
hash := requestHash(URL, createFormReader(requestData))
return c.store.IsVisited(hash)
}
// SanitizeFileName replaces dangerous characters in a string
// so the return value can be used as a safe file name.
func SanitizeFileName(fileName string) string {
ext := filepath.Ext(fileName)
cleanExt := sanitize.BaseName(ext)
if cleanExt == "" {
cleanExt = ".unknown"
}
return strings.Replace(fmt.Sprintf(
"%s.%s",
sanitize.BaseName(fileName[:len(fileName)-len(ext)]),
cleanExt[1:],
), "-", "_", -1)
}
func createFormReader(data map[string]string) io.Reader {
form := url.Values{}
for k, v := range data {
form.Add(k, v)
}
return strings.NewReader(form.Encode())
}
func createMultipartReader(boundary string, data map[string][]byte) io.Reader {
dashBoundary := "--" + boundary
body := []byte{}
buffer := bytes.NewBuffer(body)
buffer.WriteString("Content-type: multipart/form-data; boundary=" + boundary + "\n\n")
for contentType, content := range data {
buffer.WriteString(dashBoundary + "\n")
buffer.WriteString("Content-Disposition: form-data; name=" + contentType + "\n")
buffer.WriteString(fmt.Sprintf("Content-Length: %d \n\n", len(content)))
buffer.Write(content)
buffer.WriteString("\n")
}
buffer.WriteString(dashBoundary + "--\n\n")
return bytes.NewReader(buffer.Bytes())
}
// randomBoundary was borrowed from
// github.com/golang/go/mime/multipart/writer.go#randomBoundary
func randomBoundary() string {
var buf [30]byte
_, err := io.ReadFull(rand.Reader, buf[:])
if err != nil {
panic(err)
}
return fmt.Sprintf("%x", buf[:])
}
func isYesString(s string) bool {
switch strings.ToLower(s) {
case "1", "yes", "true", "y":
return true
}
return false
}
func createJar(s storage.Storage) http.CookieJar {
return &cookieJarSerializer{store: s, lock: &sync.RWMutex{}}
}
func (j *cookieJarSerializer) SetCookies(u *url.URL, cookies []*http.Cookie) {
j.lock.Lock()
defer j.lock.Unlock()
cookieStr := j.store.Cookies(u)
// Merge existing cookies, new cookies have precedence.
cnew := make([]*http.Cookie, len(cookies))
copy(cnew, cookies)
existing := storage.UnstringifyCookies(cookieStr)
for _, c := range existing {
if !storage.ContainsCookie(cnew, c.Name) {
cnew = append(cnew, c)
}
}
j.store.SetCookies(u, storage.StringifyCookies(cnew))
}
func (j *cookieJarSerializer) Cookies(u *url.URL) []*http.Cookie {
cookies := storage.UnstringifyCookies(j.store.Cookies(u))
// Filter.
now := time.Now()
cnew := make([]*http.Cookie, 0, len(cookies))
for _, c := range cookies {
// Drop expired cookies.
if c.RawExpires != "" && c.Expires.Before(now) {
continue
}
// Drop secure cookies if not over https.
if c.Secure && u.Scheme != "https" {
continue
}
cnew = append(cnew, c)
}
return cnew
}
func isMatchingFilter(fs []*regexp.Regexp, d []byte) bool {
for _, r := range fs {
if r.Match(d) {
return true
}
}
return false
}
func normalizeURL(u string) string {
parsed, err := urlParser.Parse(u)
if err != nil {
return u
}
return parsed.String()
}
func requestHash(url string, body io.Reader) uint64 {
h := fnv.New64a()
// reparse the url to fix ambiguities such as
// "http://example.com" vs "http://example.com/"
io.WriteString(h, normalizeURL(url))
if body != nil {
io.Copy(h, body)
}
return h.Sum64()
}
================================================
FILE: colly_test.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package colly
import (
"bufio"
"bytes"
"context"
"errors"
"fmt"
"net/http"
"net/http/httptest"
"net/url"
"os"
"reflect"
"regexp"
"strings"
"testing"
"time"
"github.com/PuerkitoBio/goquery"
"github.com/gocolly/colly/v2/debug"
)
var serverIndexResponse = []byte("hello world\n")
var callbackTestHTML = []byte(`
<!DOCTYPE html>
<html>
<head>
<title>Callback Test Page</title>
</head>
<body>
<div id="firstElem">First</div>
<div id="secondElem">Second</div>
<div id="thirdElem">Third</div>
</body>
</html>
`)
var robotsFile = `
User-agent: *
Allow: /allowed
Disallow: /disallowed
Disallow: /allowed*q=
`
func newUnstartedTestServer() *httptest.Server {
mux := http.NewServeMux()
mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write(serverIndexResponse)
})
mux.HandleFunc("/callback_test", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.WriteHeader(200)
w.Write(callbackTestHTML)
})
mux.HandleFunc("/html", func(w http.ResponseWriter, r *http.Request) {
if r.URL.Query().Get("no-content-type") != "" {
w.Header()["Content-Type"] = nil
} else {
w.Header().Set("Content-Type", "text/html")
}
w.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1>Hello World</h1>
<p class="description">This is a test page</p>
<p class="description">This is a test paragraph</p>
</body>
</html>
`))
})
mux.HandleFunc("/xml", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/xml")
w.Write([]byte(`<?xml version="1.0" encoding="UTF-8"?>
<page>
<title>Test Page</title>
<paragraph type="description">This is a test page</paragraph>
<paragraph type="description">This is a test paragraph</paragraph>
</page>
`))
})
mux.HandleFunc("/login", func(w http.ResponseWriter, r *http.Request) {
if r.Method == "POST" {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(r.FormValue("name")))
}
})
mux.HandleFunc("/robots.txt", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte(robotsFile))
})
mux.HandleFunc("/allowed", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte("allowed"))
})
mux.HandleFunc("/disallowed", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte("disallowed"))
})
mux.Handle("/redirect", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
destination := "/redirected/"
if d := r.URL.Query().Get("d"); d != "" {
destination = d
}
http.Redirect(w, r, destination, http.StatusSeeOther)
}))
mux.Handle("/redirected/", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, `<a href="test">test</a>`)
}))
mux.HandleFunc("/set_cookie", func(w http.ResponseWriter, r *http.Request) {
c := &http.Cookie{Name: "test", Value: "testv", HttpOnly: false}
http.SetCookie(w, c)
w.WriteHeader(200)
w.Write([]byte("ok"))
})
mux.HandleFunc("/check_cookie", func(w http.ResponseWriter, r *http.Request) {
cs := r.Cookies()
if len(cs) != 1 || r.Cookies()[0].Value != "testv" {
w.WriteHeader(500)
w.Write([]byte("nok"))
return
}
w.WriteHeader(200)
w.Write([]byte("ok"))
})
mux.HandleFunc("/500", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.WriteHeader(500)
w.Write([]byte("<p>error</p>"))
})
mux.HandleFunc("/user_agent", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte(r.Header.Get("User-Agent")))
})
mux.HandleFunc("/host_header", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte(r.Host))
})
mux.HandleFunc("/accept_header", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte(r.Header.Get("Accept")))
})
mux.HandleFunc("/custom_header", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
w.Write([]byte(r.Header.Get("Test")))
})
mux.HandleFunc("/base", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<base href="http://xy.com/" />
</head>
<body>
<a href="z">link</a>
</body>
</html>
`))
})
mux.HandleFunc("/base_relative", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<base href="/foobar/" />
</head>
<body>
<a href="z">link</a>
</body>
</html>
`))
})
mux.HandleFunc("/tabs_and_newlines", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
<base href="/foo bar/" />
</head>
<body>
<a href="x
y">link</a>
</body>
</html>
`))
})
mux.HandleFunc("/foobar/xy", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html")
w.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Test Page</title>
</head>
<body>
<p>hello</p>
</body>
</html>
`))
})
mux.HandleFunc("/100%25", func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("100 percent"))
})
mux.HandleFunc("/large_binary", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/octet-stream")
ww := bufio.NewWriter(w)
defer ww.Flush()
for {
// have to check error to detect client aborting download
if _, err := ww.Write([]byte{0x41}); err != nil {
return
}
}
})
mux.HandleFunc("/slow", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(200)
ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()
i := 0
for {
select {
case <-r.Context().Done():
return
case t := <-ticker.C:
fmt.Fprintf(w, "%s\n", t)
if flusher, ok := w.(http.Flusher); ok {
flusher.Flush()
}
i++
if i == 10 {
return
}
}
}
})
mux.HandleFunc("/sitemap.xml.gz", func(w http.ResponseWriter, r *http.Request) {
// Return a 404 HTML page for a non-existent .xml.gz URL.
// This simulates the scenario in issue #745 where a server
// returns an HTML error page for a missing gzipped sitemap.
w.Header().Set("Content-Type", "text/html")
w.WriteHeader(404)
w.Write([]byte(`<!DOCTYPE html><html><body><h1>404 Not Found</h1></body></html>`))
})
return httptest.NewUnstartedServer(mux)
}
func newTestServer() *httptest.Server {
srv := newUnstartedTestServer()
srv.Start()
return srv
}
var newCollectorTests = map[string]func(*testing.T){
"UserAgent": func(t *testing.T) {
for _, ua := range []string{
"foo",
"bar",
} {
c := NewCollector(UserAgent(ua))
if got, want := c.UserAgent, ua; got != want {
t.Fatalf("c.UserAgent = %q, want %q", got, want)
}
}
},
"MaxDepth": func(t *testing.T) {
for _, depth := range []int{
12,
34,
0,
} {
c := NewCollector(MaxDepth(depth))
if got, want := c.MaxDepth, depth; got != want {
t.Fatalf("c.MaxDepth = %d, want %d", got, want)
}
}
},
"AllowedDomains": func(t *testing.T) {
for _, domains := range [][]string{
{"example.com", "example.net"},
{"example.net"},
{},
nil,
} {
c := NewCollector(AllowedDomains(domains...))
if got, want := c.AllowedDomains, domains; !reflect.DeepEqual(got, want) {
t.Fatalf("c.AllowedDomains = %q, want %q", got, want)
}
}
},
"DisallowedDomains": func(t *testing.T) {
for _, domains := range [][]string{
{"example.com", "example.net"},
{"example.net"},
{},
nil,
} {
c := NewCollector(DisallowedDomains(domains...))
if got, want := c.DisallowedDomains, domains; !reflect.DeepEqual(got, want) {
t.Fatalf("c.DisallowedDomains = %q, want %q", got, want)
}
}
},
"DisallowedURLFilters": func(t *testing.T) {
for _, filters := range [][]*regexp.Regexp{
{regexp.MustCompile(`.*not_allowed.*`)},
} {
c := NewCollector(DisallowedURLFilters(filters...))
if got, want := c.DisallowedURLFilters, filters; !reflect.DeepEqual(got, want) {
t.Fatalf("c.DisallowedURLFilters = %v, want %v", got, want)
}
}
},
"URLFilters": func(t *testing.T) {
for _, filters := range [][]*regexp.Regexp{
{regexp.MustCompile(`\w+`)},
{regexp.MustCompile(`\d+`)},
{},
nil,
} {
c := NewCollector(URLFilters(filters...))
if got, want := c.URLFilters, filters; !reflect.DeepEqual(got, want) {
t.Fatalf("c.URLFilters = %v, want %v", got, want)
}
}
},
"AllowURLRevisit": func(t *testing.T) {
c := NewCollector(AllowURLRevisit())
if !c.AllowURLRevisit {
t.Fatal("c.AllowURLRevisit = false, want true")
}
},
"MaxBodySize": func(t *testing.T) {
for _, sizeInBytes := range []int{
1024 * 1024,
1024,
0,
} {
c := NewCollector(MaxBodySize(sizeInBytes))
if got, want := c.MaxBodySize, sizeInBytes; got != want {
t.Fatalf("c.MaxBodySize = %d, want %d", got, want)
}
}
},
"CacheDir": func(t *testing.T) {
for _, path := range []string{
"/tmp/",
"/var/cache/",
} {
c := NewCollector(CacheDir(path))
if got, want := c.CacheDir, path; got != want {
t.Fatalf("c.CacheDir = %q, want %q", got, want)
}
}
},
"CacheExpiration": func(t *testing.T) {
for _, d := range []time.Duration{
5 * time.Second,
10 * time.Minute,
0,
} {
c := NewCollector(CacheExpiration(d))
if got, want := c.CacheExpiration, d; got != want {
t.Fatalf("c.CacheExpiration = %v, want %v", got, want)
}
}
},
"IgnoreRobotsTxt": func(t *testing.T) {
c := NewCollector(IgnoreRobotsTxt())
if !c.IgnoreRobotsTxt {
t.Fatal("c.IgnoreRobotsTxt = false, want true")
}
},
"ID": func(t *testing.T) {
for _, id := range []uint32{
0,
1,
2,
} {
c := NewCollector(ID(id))
if got, want := c.ID, id; got != want {
t.Fatalf("c.ID = %d, want %d", got, want)
}
}
},
"DetectCharset": func(t *testing.T) {
c := NewCollector(DetectCharset())
if !c.DetectCharset {
t.Fatal("c.DetectCharset = false, want true")
}
},
"Debugger": func(t *testing.T) {
d := &debug.LogDebugger{}
c := NewCollector(Debugger(d))
if got, want := c.debugger, d; got != want {
t.Fatalf("c.debugger = %v, want %v", got, want)
}
},
"CheckHead": func(t *testing.T) {
c := NewCollector(CheckHead())
if !c.CheckHead {
t.Fatal("c.CheckHead = false, want true")
}
},
"Async": func(t *testing.T) {
c := NewCollector(Async())
if !c.Async {
t.Fatal("c.Async = false, want true")
}
},
}
func TestNoAcceptHeader(t *testing.T) {
ts := newTestServer()
defer ts.Close()
var receivedHeader string
// checks if Accept is enabled by default
func() {
c := NewCollector()
c.OnResponse(func(resp *Response) {
receivedHeader = string(resp.Body)
})
c.Visit(ts.URL + "/accept_header")
if receivedHeader != "*/*" {
t.Errorf("default Accept header isn't */*. got: %v", receivedHeader)
}
}()
// checks if Accept can be disabled
func() {
c := NewCollector()
c.OnRequest(func(r *Request) {
r.Headers.Del("Accept")
})
c.OnResponse(func(resp *Response) {
receivedHeader = string(resp.Body)
})
c.Visit(ts.URL + "/accept_header")
if receivedHeader != "" {
t.Errorf("failed to pass request with no Accept header. got: %v", receivedHeader)
}
}()
}
func TestNewCollector(t *testing.T) {
t.Run("Functional Options", func(t *testing.T) {
for name, test := range newCollectorTests {
t.Run(name, test)
}
})
}
func TestCollectorVisit(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
onRequestCalled := false
onResponseCalled := false
onScrapedCalled := false
c.OnRequest(func(r *Request) {
onRequestCalled = true
r.Ctx.Put("x", "y")
})
c.OnResponse(func(r *Response) {
onResponseCalled = true
if r.Ctx.Get("x") != "y" {
t.Error("Failed to retrieve context value for key 'x'")
}
if !bytes.Equal(r.Body, serverIndexResponse) {
t.Error("Response body does not match with the original content")
}
})
c.OnScraped(func(r *Response) {
if !onResponseCalled {
t.Error("OnScraped called before OnResponse")
}
if !onRequestCalled {
t.Error("OnScraped called before OnRequest")
}
onScrapedCalled = true
})
c.Visit(ts.URL)
if !onRequestCalled {
t.Error("Failed to call OnRequest callback")
}
if !onResponseCalled {
t.Error("Failed to call OnResponse callback")
}
if !onScrapedCalled {
t.Error("Failed to call OnScraped callback")
}
}
func TestCollectorVisitWithAllowedDomains(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector(AllowedDomains("localhost", "127.0.0.1", "::1"))
err := c.Visit(ts.URL)
if err != nil {
t.Errorf("Failed to visit url %s", ts.URL)
}
err = c.Visit("http://example.com")
if err != ErrForbiddenDomain {
t.Errorf("c.Visit should return ErrForbiddenDomain, but got %v", err)
}
}
func TestCollectorVisitWithDisallowedDomains(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector(DisallowedDomains("localhost", "127.0.0.1", "::1"))
err := c.Visit(ts.URL)
if err != ErrForbiddenDomain {
t.Errorf("c.Visit should return ErrForbiddenDomain, but got %v", err)
}
c2 := NewCollector(DisallowedDomains("example.com"))
err = c2.Visit("http://example.com:8080")
if err != ErrForbiddenDomain {
t.Errorf("c.Visit should return ErrForbiddenDomain, but got %v", err)
}
err = c2.Visit(ts.URL)
if err != nil {
t.Errorf("Failed to visit url %s", ts.URL)
}
}
func TestCollectorVisitResponseHeaders(t *testing.T) {
ts := newTestServer()
defer ts.Close()
var onResponseHeadersCalled bool
c := NewCollector()
c.OnResponseHeaders(func(r *Response) {
onResponseHeadersCalled = true
if r.Headers.Get("Content-Type") == "application/octet-stream" {
r.Request.Abort()
}
})
c.OnResponse(func(r *Response) {
t.Error("OnResponse was called")
})
c.Visit(ts.URL + "/large_binary")
if !onResponseHeadersCalled {
t.Error("OnResponseHeaders was not called")
}
}
func TestCollectorOnHTML(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
titleCallbackCalled := false
paragraphCallbackCount := 0
c.OnHTML("title", func(e *HTMLElement) {
titleCallbackCalled = true
if e.Text != "Test Page" {
t.Error("Title element text does not match, got", e.Text)
}
})
c.OnHTML("p", func(e *HTMLElement) {
paragraphCallbackCount++
if e.Attr("class") != "description" {
t.Error("Failed to get paragraph's class attribute")
}
})
c.OnHTML("body", func(e *HTMLElement) {
if e.ChildAttr("p", "class") != "description" {
t.Error("Invalid class value")
}
classes := e.ChildAttrs("p", "class")
if len(classes) != 2 {
t.Error("Invalid class values")
}
})
c.Visit(ts.URL + "/html")
if !titleCallbackCalled {
t.Error("Failed to call OnHTML callback for <title> tag")
}
if paragraphCallbackCount != 2 {
t.Error("Failed to find all <p> tags")
}
}
func TestCollectorContentSniffing(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
htmlCallbackCalled := false
c.OnResponse(func(r *Response) {
if (*r.Headers)["Content-Type"] != nil {
t.Error("Content-Type unexpectedly not nil")
}
})
c.OnHTML("html", func(e *HTMLElement) {
htmlCallbackCalled = true
})
err := c.Visit(ts.URL + "/html?no-content-type=yes")
if err != nil {
t.Fatal(err)
}
if !htmlCallbackCalled {
t.Error("OnHTML was not called")
}
}
func TestCollectorURLRevisit(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
visitCount := 0
c.OnRequest(func(r *Request) {
visitCount++
})
c.Visit(ts.URL)
c.Visit(ts.URL)
if visitCount != 1 {
t.Error("URL revisited")
}
c.AllowURLRevisit = true
c.Visit(ts.URL)
c.Visit(ts.URL)
if visitCount != 3 {
t.Error("URL not revisited")
}
}
func TestCollectorPostRevisit(t *testing.T) {
ts := newTestServer()
defer ts.Close()
postValue := "hello"
postData := map[string]string{
"name": postValue,
}
visitCount := 0
c := NewCollector()
c.OnResponse(func(r *Response) {
if postValue != string(r.Body) {
t.Error("Failed to send data with POST")
}
visitCount++
})
c.Post(ts.URL+"/login", postData)
c.Post(ts.URL+"/login", postData)
c.Post(ts.URL+"/login", map[string]string{
"name": postValue,
"lastname": "world",
})
if visitCount != 2 {
t.Error("URL POST revisited")
}
c.AllowURLRevisit = true
c.Post(ts.URL+"/login", postData)
c.Post(ts.URL+"/login", postData)
if visitCount != 4 {
t.Error("URL POST not revisited")
}
}
func TestCollectorURLRevisitCheck(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
visited, err := c.HasVisited(ts.URL)
if err != nil {
t.Error(err.Error())
}
if visited != false {
t.Error("Expected URL to NOT have been visited")
}
c.Visit(ts.URL)
visited, err = c.HasVisited(ts.URL)
if err != nil {
t.Error(err.Error())
}
if visited != true {
t.Error("Expected URL to have been visited")
}
errorTestCases := []struct {
Path string
DestinationError string
}{
{"/", "/"},
{"/redirect?d=/", "/"},
// now that /redirect?d=/ itself is recorded as visited,
// it's now returned in error
{"/redirect?d=/", "/redirect?d=/"},
{"/redirect?d=/redirect%3Fd%3D/", "/redirect?d=/"},
{"/redirect?d=/redirect%3Fd%3D/", "/redirect?d=/redirect%3Fd%3D/"},
{"/redirect?d=/redirect%3Fd%3D/&foo=bar", "/redirect?d=/"},
}
for i, testCase := range errorTestCases {
err := c.Visit(ts.URL + testCase.Path)
if testCase.DestinationError == "" {
if err != nil {
t.Errorf("got unexpected error in test %d: %q", i, err)
}
} else {
var ave *AlreadyVisitedError
if !errors.As(err, &ave) {
t.Errorf("err=%q returned when trying to revisit, expected AlreadyVisitedError", err)
} else {
if got, want := ave.Destination.String(), ts.URL+testCase.DestinationError; got != want {
t.Errorf("wrong destination in AlreadyVisitedError in test %d, got=%q want=%q", i, got, want)
}
}
}
}
}
func TestSetCookieRedirect(t *testing.T) {
type middleware = func(http.Handler) http.Handler
for _, m := range []middleware{
requireSessionCookieSimple,
requireSessionCookieAuthPage,
} {
t.Run("", func(t *testing.T) {
ts := newUnstartedTestServer()
ts.Config.Handler = m(ts.Config.Handler)
ts.Start()
defer ts.Close()
c := NewCollector()
c.OnResponse(func(r *Response) {
if got, want := r.Body, serverIndexResponse; !bytes.Equal(got, want) {
t.Errorf("bad response body got=%q want=%q", got, want)
}
if got, want := r.StatusCode, http.StatusOK; got != want {
t.Errorf("bad response code got=%d want=%d", got, want)
}
})
if err := c.Visit(ts.URL); err != nil {
t.Fatal(err)
}
})
}
}
func TestCollectorPostURLRevisitCheck(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
postValue := "hello"
postData := map[string]string{
"name": postValue,
}
posted, err := c.HasPosted(ts.URL+"/login", postData)
if err != nil {
t.Error(err.Error())
}
if posted != false {
t.Error("Expected URL to NOT have been visited")
}
c.Post(ts.URL+"/login", postData)
posted, err = c.HasPosted(ts.URL+"/login", postData)
if err != nil {
t.Error(err.Error())
}
if posted != true {
t.Error("Expected URL to have been visited")
}
postData["lastname"] = "world"
posted, err = c.HasPosted(ts.URL+"/login", postData)
if err != nil {
t.Error(err.Error())
}
if posted != false {
t.Error("Expected URL to NOT have been visited")
}
c.Post(ts.URL+"/login", postData)
posted, err = c.HasPosted(ts.URL+"/login", postData)
if err != nil {
t.Error(err.Error())
}
if posted != true {
t.Error("Expected URL to have been visited")
}
}
// TestCollectorURLRevisitDomainDisallowed ensures that disallowed URL is not considered visited.
func TestCollectorURLRevisitDomainDisallowed(t *testing.T) {
ts := newTestServer()
defer ts.Close()
parsedURL, err := url.Parse(ts.URL)
if err != nil {
t.Fatal(err)
}
c := NewCollector(DisallowedDomains(parsedURL.Hostname()))
err = c.Visit(ts.URL)
if got, want := err, ErrForbiddenDomain; got != want {
t.Fatalf("wrong error on first visit: got=%v want=%v", got, want)
}
err = c.Visit(ts.URL)
if got, want := err, ErrForbiddenDomain; got != want {
t.Fatalf("wrong error on second visit: got=%v want=%v", got, want)
}
}
func TestCollectorPost(t *testing.T) {
ts := newTestServer()
defer ts.Close()
postValue := "hello"
c := NewCollector()
c.OnResponse(func(r *Response) {
if postValue != string(r.Body) {
t.Error("Failed to send data with POST")
}
})
c.Post(ts.URL+"/login", map[string]string{
"name": postValue,
})
}
func TestCollectorPostRaw(t *testing.T) {
ts := newTestServer()
defer ts.Close()
postValue := "hello"
c := NewCollector()
c.OnResponse(func(r *Response) {
if postValue != string(r.Body) {
t.Error("Failed to send data with POST")
}
})
c.PostRaw(ts.URL+"/login", []byte("name="+postValue))
}
func TestCollectorPostRawRevisit(t *testing.T) {
ts := newTestServer()
defer ts.Close()
postValue := "hello"
postData := "name=" + postValue
visitCount := 0
c := NewCollector()
c.OnResponse(func(r *Response) {
if postValue != string(r.Body) {
t.Error("Failed to send data with POST RAW")
}
visitCount++
})
c.PostRaw(ts.URL+"/login", []byte(postData))
c.PostRaw(ts.URL+"/login", []byte(postData))
c.PostRaw(ts.URL+"/login", []byte(postData+"&lastname=world"))
if visitCount != 2 {
t.Error("URL POST RAW revisited")
}
c.AllowURLRevisit = true
c.PostRaw(ts.URL+"/login", []byte(postData))
c.PostRaw(ts.URL+"/login", []byte(postData))
if visitCount != 4 {
t.Error("URL POST RAW not revisited")
}
}
func TestRedirect(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnHTML("a[href]", func(e *HTMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
if !strings.HasSuffix(u, "/redirected/test") {
t.Error("Invalid URL after redirect: " + u)
}
})
c.OnResponseHeaders(func(r *Response) {
if !strings.HasSuffix(r.Request.URL.String(), "/redirected/") {
t.Error("Invalid URL in Request after redirect (OnResponseHeaders): " + r.Request.URL.String())
}
})
c.OnResponse(func(r *Response) {
if !strings.HasSuffix(r.Request.URL.String(), "/redirected/") {
t.Error("Invalid URL in Request after redirect (OnResponse): " + r.Request.URL.String())
}
})
c.Visit(ts.URL + "/redirect")
}
func TestIssue594(t *testing.T) {
// This is a regression test for a data race bug. There's no
// assertions because it's meant to be used with race detector
ts := newTestServer()
defer ts.Close()
c := NewCollector()
// if timeout is set, this bug is not triggered
c.SetClient(&http.Client{Timeout: 0 * time.Second})
c.Visit(ts.URL)
}
func TestRedirectWithDisallowedURLs(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.DisallowedURLFilters = []*regexp.Regexp{regexp.MustCompile(ts.URL + "/redirected/test")}
c.OnHTML("a[href]", func(e *HTMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
err := c.Visit(u)
if !errors.Is(err, ErrForbiddenURL) {
t.Error("URL should have been forbidden: " + u)
}
})
c.Visit(ts.URL + "/redirect")
}
func TestBaseTag(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnHTML("a[href]", func(e *HTMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
if u != "http://xy.com/z" {
t.Error("Invalid <base /> tag handling in OnHTML: expected https://xy.com/z, got " + u)
}
})
c.Visit(ts.URL + "/base")
c2 := NewCollector()
c2.OnXML("//a", func(e *XMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
if u != "http://xy.com/z" {
t.Error("Invalid <base /> tag handling in OnXML: expected https://xy.com/z, got " + u)
}
})
c2.Visit(ts.URL + "/base")
}
func TestBaseTagRelative(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnHTML("a[href]", func(e *HTMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
expected := ts.URL + "/foobar/z"
if u != expected {
t.Errorf("Invalid <base /> tag handling in OnHTML: expected %q, got %q", expected, u)
}
})
c.Visit(ts.URL + "/base_relative")
c2 := NewCollector()
c2.OnXML("//a", func(e *XMLElement) {
u := e.Request.AbsoluteURL(e.Attr("href"))
expected := ts.URL + "/foobar/z"
if u != expected {
t.Errorf("Invalid <base /> tag handling in OnXML: expected %q, got %q", expected, u)
}
})
c2.Visit(ts.URL + "/base_relative")
}
func TestTabsAndNewlines(t *testing.T) {
// this test might look odd, but see step 3 of
// https://url.spec.whatwg.org/#concept-basic-url-parser
ts := newTestServer()
defer ts.Close()
visited := map[string]struct{}{}
expected := map[string]struct{}{
"/tabs_and_newlines": {},
"/foobar/xy": {},
}
c := NewCollector()
c.OnResponse(func(res *Response) {
visited[res.Request.URL.EscapedPath()] = struct{}{}
})
c.OnHTML("a[href]", func(e *HTMLElement) {
if err := e.Request.Visit(e.Attr("href")); err != nil {
t.Errorf("visit failed: %v", err)
}
})
if err := c.Visit(ts.URL + "/tabs_and_newlines"); err != nil {
t.Errorf("visit failed: %v", err)
}
if !reflect.DeepEqual(visited, expected) {
t.Errorf("visited=%v expected=%v", visited, expected)
}
}
func TestLonePercent(t *testing.T) {
ts := newTestServer()
defer ts.Close()
var visitedPath string
c := NewCollector()
c.OnResponse(func(res *Response) {
visitedPath = res.Request.URL.RequestURI()
})
if err := c.Visit(ts.URL + "/100%"); err != nil {
t.Errorf("visit failed: %v", err)
}
// Automatic encoding is not really correct: browsers
// would send bare percent here. However, Go net/http
// cannot send such requests due to
// https://github.com/golang/go/issues/29808. So we have two
// alternatives really: return an error when attempting
// to fetch such URLs, or at least try the encoded variant.
// This test checks that the latter is attempted.
if got, want := visitedPath, "/100%25"; got != want {
t.Errorf("got=%q want=%q", got, want)
}
// invalid URL escape in query component is not a problem,
// but check it anyway
if err := c.Visit(ts.URL + "/?a=100%zz"); err != nil {
t.Errorf("visit failed: %v", err)
}
if got, want := visitedPath, "/?a=100%zz"; got != want {
t.Errorf("got=%q want=%q", got, want)
}
}
func TestCollectorCookies(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
if err := c.Visit(ts.URL + "/set_cookie"); err != nil {
t.Fatal(err)
}
if err := c.Visit(ts.URL + "/check_cookie"); err != nil {
t.Fatalf("Failed to use previously set cookies: %s", err)
}
}
func TestRobotsWhenAllowed(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.IgnoreRobotsTxt = false
c.OnResponse(func(resp *Response) {
if resp.StatusCode != 200 {
t.Fatalf("Wrong response code: %d", resp.StatusCode)
}
})
err := c.Visit(ts.URL + "/allowed")
if err != nil {
t.Fatal(err)
}
}
func TestRobotsWhenDisallowed(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.IgnoreRobotsTxt = false
c.OnResponse(func(resp *Response) {
t.Fatalf("Received response: %d", resp.StatusCode)
})
err := c.Visit(ts.URL + "/disallowed")
if err.Error() != "URL blocked by robots.txt" {
t.Fatalf("wrong error message: %v", err)
}
}
func TestRobotsWhenDisallowedWithQueryParameter(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.IgnoreRobotsTxt = false
c.OnResponse(func(resp *Response) {
t.Fatalf("Received response: %d", resp.StatusCode)
})
err := c.Visit(ts.URL + "/allowed?q=1")
if err.Error() != "URL blocked by robots.txt" {
t.Fatalf("wrong error message: %v", err)
}
}
func TestIgnoreRobotsWhenDisallowed(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.IgnoreRobotsTxt = true
c.OnResponse(func(resp *Response) {
if resp.StatusCode != 200 {
t.Fatalf("Wrong response code: %d", resp.StatusCode)
}
})
err := c.Visit(ts.URL + "/disallowed")
if err != nil {
t.Fatal(err)
}
}
func TestConnectionErrorOnRobotsTxtResultsInError(t *testing.T) {
ts := newTestServer()
ts.Close() // immediately close the server to force a connection error
c := NewCollector()
c.IgnoreRobotsTxt = false
err := c.Visit(ts.URL)
if err == nil {
t.Fatal("Error expected")
}
}
func TestEnvSettings(t *testing.T) {
ts := newTestServer()
defer ts.Close()
os.Setenv("COLLY_USER_AGENT", "test")
defer os.Unsetenv("COLLY_USER_AGENT")
c := NewCollector()
valid := false
c.OnResponse(func(resp *Response) {
if string(resp.Body) == "test" {
valid = true
}
})
c.Visit(ts.URL + "/user_agent")
if !valid {
t.Fatalf("Wrong user-agent from environment")
}
}
func TestUserAgent(t *testing.T) {
const exampleUserAgent1 = "Example/1.0"
const exampleUserAgent2 = "Example/2.0"
const defaultUserAgent = "colly - https://github.com/gocolly/colly"
ts := newTestServer()
defer ts.Close()
var receivedUserAgent string
func() {
c := NewCollector()
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
c.Visit(ts.URL + "/user_agent")
if got, want := receivedUserAgent, defaultUserAgent; got != want {
t.Errorf("mismatched User-Agent: got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(UserAgent(exampleUserAgent1))
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
c.Visit(ts.URL + "/user_agent")
if got, want := receivedUserAgent, exampleUserAgent1; got != want {
t.Errorf("mismatched User-Agent: got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(UserAgent(exampleUserAgent1))
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
c.Request("GET", ts.URL+"/user_agent", nil, nil, nil)
if got, want := receivedUserAgent, exampleUserAgent1; got != want {
t.Errorf("mismatched User-Agent (nil hdr): got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(UserAgent(exampleUserAgent1))
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
c.Request("GET", ts.URL+"/user_agent", nil, nil, http.Header{})
if got, want := receivedUserAgent, exampleUserAgent1; got != want {
t.Errorf("mismatched User-Agent (non-nil hdr): got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(UserAgent(exampleUserAgent1))
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
hdr := http.Header{}
hdr.Set("User-Agent", "")
c.Request("GET", ts.URL+"/user_agent", nil, nil, hdr)
if got, want := receivedUserAgent, ""; got != want {
t.Errorf("mismatched User-Agent (hdr with empty UA): got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(UserAgent(exampleUserAgent1))
c.OnResponse(func(resp *Response) {
receivedUserAgent = string(resp.Body)
})
hdr := http.Header{}
hdr.Set("User-Agent", exampleUserAgent2)
c.Request("GET", ts.URL+"/user_agent", nil, nil, hdr)
if got, want := receivedUserAgent, exampleUserAgent2; got != want {
t.Errorf("mismatched User-Agent (hdr with UA): got=%q want=%q", got, want)
}
}()
}
func TestHeaders(t *testing.T) {
const exampleHostHeader = "example.com"
const exampleTestHeader = "Testing"
ts := newTestServer()
defer ts.Close()
var receivedHeader string
func() {
c := NewCollector(
Headers(map[string]string{"Host": exampleHostHeader}),
)
c.OnResponse(func(resp *Response) {
receivedHeader = string(resp.Body)
})
c.Visit(ts.URL + "/host_header")
if got, want := receivedHeader, exampleHostHeader; got != want {
t.Errorf("mismatched Host header: got=%q want=%q", got, want)
}
}()
func() {
c := NewCollector(
Headers(map[string]string{"Test": exampleTestHeader}),
)
c.OnResponse(func(resp *Response) {
receivedHeader = string(resp.Body)
})
c.Visit(ts.URL + "/custom_header")
if got, want := receivedHeader, exampleTestHeader; got != want {
t.Errorf("mismatched custom header: got=%q want=%q", got, want)
}
}()
}
func TestParseHTTPErrorResponse(t *testing.T) {
contentCount := 0
ts := newTestServer()
defer ts.Close()
c := NewCollector(
AllowURLRevisit(),
)
c.OnHTML("p", func(e *HTMLElement) {
if e.Text == "error" {
contentCount++
}
})
c.Visit(ts.URL + "/500")
if contentCount != 0 {
t.Fatal("Content is parsed without ParseHTTPErrorResponse enabled")
}
c.ParseHTTPErrorResponse = true
c.Visit(ts.URL + "/500")
if contentCount != 1 {
t.Fatal("Content isn't parsed with ParseHTTPErrorResponse enabled")
}
}
func TestHTMLElement(t *testing.T) {
ctx := &Context{}
resp := &Response{
Request: &Request{
Ctx: ctx,
},
Ctx: ctx,
}
in := `<a href="http://go-colly.org">Colly</a>`
sel := "a[href]"
doc, err := goquery.NewDocumentFromReader(bytes.NewBuffer([]byte(in)))
if err != nil {
t.Fatal(err)
}
elements := []*HTMLElement{}
i := 0
doc.Find(sel).Each(func(_ int, s *goquery.Selection) {
for _, n := range s.Nodes {
elements = append(elements, NewHTMLElementFromSelectionNode(resp, s, n, i))
i++
}
})
elementsLen := len(elements)
if elementsLen != 1 {
t.Errorf("element length mismatch. got %d, expected %d.\n", elementsLen, 1)
}
v := elements[0]
if v.Name != "a" {
t.Errorf("element tag mismatch. got %s, expected %s.\n", v.Name, "a")
}
if v.Text != "Colly" {
t.Errorf("element content mismatch. got %s, expected %s.\n", v.Text, "Colly")
}
if v.Attr("href") != "http://go-colly.org" {
t.Errorf("element href mismatch. got %s, expected %s.\n", v.Attr("href"), "http://go-colly.org")
}
}
func TestCollectorOnXMLWithHtml(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
titleCallbackCalled := false
paragraphCallbackCount := 0
c.OnXML("/html/head/title", func(e *XMLElement) {
titleCallbackCalled = true
if e.Text != "Test Page" {
t.Error("Title element text does not match, got", e.Text)
}
})
c.OnXML("/html/body/p", func(e *XMLElement) {
paragraphCallbackCount++
if e.Attr("class") != "description" {
t.Error("Failed to get paragraph's class attribute")
}
})
c.OnXML("/html/body", func(e *XMLElement) {
if e.ChildAttr("p", "class") != "description" {
t.Error("Invalid class value")
}
classes := e.ChildAttrs("p", "class")
if len(classes) != 2 {
t.Error("Invalid class values")
}
})
c.Visit(ts.URL + "/html")
if !titleCallbackCalled {
t.Error("Failed to call OnXML callback for <title> tag")
}
if paragraphCallbackCount != 2 {
t.Error("Failed to find all <p> tags")
}
}
func TestCollectorOnXMLWithXML(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
titleCallbackCalled := false
paragraphCallbackCount := 0
c.OnXML("//page/title", func(e *XMLElement) {
titleCallbackCalled = true
if e.Text != "Test Page" {
t.Error("Title element text does not match, got", e.Text)
}
})
c.OnXML("//page/paragraph", func(e *XMLElement) {
paragraphCallbackCount++
if e.Attr("type") != "description" {
t.Error("Failed to get paragraph's type attribute")
}
})
c.OnXML("/page", func(e *XMLElement) {
if e.ChildAttr("paragraph", "type") != "description" {
t.Error("Invalid type value")
}
classes := e.ChildAttrs("paragraph", "type")
if len(classes) != 2 {
t.Error("Invalid type values")
}
})
c.Visit(ts.URL + "/xml")
if !titleCallbackCalled {
t.Error("Failed to call OnXML callback for <title> tag")
}
if paragraphCallbackCount != 2 {
t.Error("Failed to find all <paragraph> tags")
}
}
func TestCollectorVisitWithTrace(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector(AllowedDomains("localhost", "127.0.0.1", "::1"), TraceHTTP())
c.OnResponse(func(resp *Response) {
if resp.Trace == nil {
t.Error("Failed to initialize trace")
}
})
err := c.Visit(ts.URL)
if err != nil {
t.Errorf("Failed to visit url %s", ts.URL)
}
}
func TestCollectorVisitWithCheckHead(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector(CheckHead())
var requestMethodChain []string
c.OnResponse(func(resp *Response) {
requestMethodChain = append(requestMethodChain, resp.Request.Method)
})
err := c.Visit(ts.URL)
if err != nil {
t.Errorf("Failed to visit url %s", ts.URL)
}
if requestMethodChain[0] != "HEAD" && requestMethodChain[1] != "GET" {
t.Errorf("Failed to perform a HEAD request before GET")
}
}
func TestCollectorDepth(t *testing.T) {
ts := newTestServer()
defer ts.Close()
maxDepth := 2
c1 := NewCollector(
MaxDepth(maxDepth),
AllowURLRevisit(),
)
requestCount := 0
c1.OnResponse(func(resp *Response) {
requestCount++
if requestCount >= 10 {
return
}
c1.Visit(ts.URL)
})
c1.Visit(ts.URL)
if requestCount < 10 {
t.Errorf("Invalid number of requests: %d (expected 10) without using MaxDepth", requestCount)
}
c2 := c1.Clone()
requestCount = 0
c2.OnResponse(func(resp *Response) {
requestCount++
resp.Request.Visit(ts.URL)
})
c2.Visit(ts.URL)
if requestCount != 2 {
t.Errorf("Invalid number of requests: %d (expected 2) with using MaxDepth 2", requestCount)
}
c1.Visit(ts.URL)
if requestCount < 10 {
t.Errorf("Invalid number of requests: %d (expected 10) without using MaxDepth again", requestCount)
}
requestCount = 0
c2.Visit(ts.URL)
if requestCount != 2 {
t.Errorf("Invalid number of requests: %d (expected 2) with using MaxDepth 2 again", requestCount)
}
}
func TestCollectorRequests(t *testing.T) {
ts := newTestServer()
defer ts.Close()
maxRequests := uint32(5)
c1 := NewCollector(
MaxRequests(maxRequests),
AllowURLRevisit(),
)
requestCount := 0
c1.OnResponse(func(resp *Response) {
requestCount++
c1.Visit(ts.URL)
})
c1.Visit(ts.URL)
if requestCount != 5 {
t.Errorf("Invalid number of requests: %d (expected 5) with MaxRequests", requestCount)
}
}
func TestCollectorContext(t *testing.T) {
// "/slow" takes 1 second to return the response.
// If context does abort the transfer after 0.5 seconds as it should,
// OnError will be called, and the test is passed. Otherwise, test is failed.
ts := newTestServer()
defer ts.Close()
ctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)
defer cancel()
c := NewCollector(StdlibContext(ctx))
onErrorCalled := false
c.OnResponse(func(resp *Response) {
t.Error("OnResponse was called, expected OnError")
})
c.OnError(func(resp *Response, err error) {
onErrorCalled = true
if err != context.DeadlineExceeded {
t.Errorf("OnError got err=%#v, expected context.DeadlineExceeded", err)
}
})
err := c.Visit(ts.URL + "/slow")
if err != context.DeadlineExceeded {
t.Errorf("Visit return err=%#v, expected context.DeadlineExceeded", err)
}
if !onErrorCalled {
t.Error("OnError was not called")
}
}
func BenchmarkOnHTML(b *testing.B) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnHTML("p", func(_ *HTMLElement) {})
for n := 0; n < b.N; n++ {
c.Visit(fmt.Sprintf("%s/html?q=%d", ts.URL, n))
}
}
func BenchmarkOnXML(b *testing.B) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnXML("//p", func(_ *XMLElement) {})
for n := 0; n < b.N; n++ {
c.Visit(fmt.Sprintf("%s/html?q=%d", ts.URL, n))
}
}
func BenchmarkOnResponse(b *testing.B) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.AllowURLRevisit = true
c.OnResponse(func(_ *Response) {})
for n := 0; n < b.N; n++ {
c.Visit(ts.URL)
}
}
func requireSessionCookieSimple(handler http.Handler) http.Handler {
const cookieName = "session_id"
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if _, err := r.Cookie(cookieName); err == http.ErrNoCookie {
http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "1"})
http.Redirect(w, r, r.RequestURI, http.StatusFound)
return
}
handler.ServeHTTP(w, r)
})
}
func requireSessionCookieAuthPage(handler http.Handler) http.Handler {
const setCookiePath = "/auth"
const cookieName = "session_id"
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.URL.Path == setCookiePath {
destination := r.URL.Query().Get("return")
http.Redirect(w, r, destination, http.StatusFound)
return
}
if _, err := r.Cookie(cookieName); err == http.ErrNoCookie {
http.SetCookie(w, &http.Cookie{Name: cookieName, Value: "1"})
http.Redirect(w, r, setCookiePath+"?return="+url.QueryEscape(r.RequestURI), http.StatusFound)
return
}
handler.ServeHTTP(w, r)
})
}
func TestCallbackDetachment(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.AllowURLRevisit = true
var executions [3]int // tracks number of executions of each callback
c.OnHTML("#firstElem", func(e *HTMLElement) {
executions[0]++
// Detach this callback after first execution
c.OnHTMLDetach("#firstElem")
})
c.OnHTML("#secondElem", func(e *HTMLElement) {
executions[1]++
})
c.OnHTML("#thirdElem", func(e *HTMLElement) {
executions[2]++
})
// First visit - all callbacks should execute
c.Visit(ts.URL + "/callback_test")
// Second visit - first callback should NOT execute
c.Visit(ts.URL + "/callback_test")
// Verify callback counts
if executions[0] != 1 {
t.Errorf("firstElem callback executed %d times, expected 1", executions[0])
}
if executions[1] != 2 {
t.Errorf("secondElem callback executed %d times, expected 2", executions[1])
}
if executions[2] != 2 {
t.Errorf("thirdElem callback executed %d times, expected 2", executions[2])
}
}
func TestCollectorPostRetry(t *testing.T) {
ts := newTestServer()
defer ts.Close()
postValue := "hello"
c := NewCollector()
try := false
c.OnResponse(func(r *Response) {
if r.Ctx.Get("notFirst") == "" {
r.Ctx.Put("notFirst", "first")
_ = r.Request.Retry()
return
}
if postValue != string(r.Body) {
t.Error("Failed to send data with POST")
}
try = true
})
c.Post(ts.URL+"/login", map[string]string{
"name": postValue,
})
if !try {
t.Error("OnResponse Retry was not called")
}
}
func TestCollectorGetRetry(t *testing.T) {
ts := newTestServer()
defer ts.Close()
try := false
c := NewCollector()
c.OnResponse(func(r *Response) {
if r.Ctx.Get("notFirst") == "" {
r.Ctx.Put("notFirst", "first")
_ = r.Request.Retry()
return
}
if !bytes.Equal(r.Body, serverIndexResponse) {
t.Error("Response body does not match with the original content")
}
try = true
})
c.Visit(ts.URL)
if !try {
t.Error("OnResponse Retry was not called")
}
}
func TestCollectorPostRetryUnseekable(t *testing.T) {
ts := newTestServer()
defer ts.Close()
try := false
postValue := "hello"
c := NewCollector()
c.OnResponse(func(r *Response) {
if postValue != string(r.Body) {
t.Error("Failed to send data with POST")
}
if r.Ctx.Get("notFirst") == "" {
r.Ctx.Put("notFirst", "first")
err := r.Request.Retry()
if !errors.Is(err, ErrRetryBodyUnseekable) {
t.Errorf("Unexpected error Type ErrRetryBodyUnseekable : %v", err)
}
return
}
try = true
})
c.Request("POST", ts.URL+"/login", bytes.NewBuffer([]byte("name="+postValue)), nil, nil)
if try {
t.Error("OnResponse Retry was called but BodyUnseekable")
}
}
func TestRedirectErrorRetry(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
c.OnError(func(r *Response, err error) {
if r.Ctx.Get("notFirst") == "" {
r.Ctx.Put("notFirst", "first")
_ = r.Request.Retry()
return
}
if e := (&AlreadyVisitedError{}); errors.As(err, &e) {
t.Error("loop AlreadyVisitedError")
}
})
c.OnResponse(func(response *Response) {
//println(1)
})
c.Visit(ts.URL + "/redirected/")
c.Visit(ts.URL + "/redirect")
}
func TestCheckRequestHeadersFunc(t *testing.T) {
ts := newTestServer()
defer ts.Close()
try := false
c := NewCollector()
c.OnRequestHeaders(func(r *Request) {
try = true
r.Abort()
})
c.OnScraped(func(r *Response) {
try = false
})
c.Visit(ts.URL)
if try == false {
t.Error("TestCheckRequestHeadersFunc failed")
}
}
func TestIssue745GzipURLWith404Response(t *testing.T) {
ts := newTestServer()
defer ts.Close()
c := NewCollector()
var responseStatusCode int
c.OnError(func(resp *Response, err error) {
responseStatusCode = resp.StatusCode
// The error should NOT be "gzip: invalid header".
// A 404 response for a .xml.gz URL should be treated as a
// normal HTTP error, not a decompression failure.
if strings.Contains(err.Error(), "gzip") {
t.Errorf("Expected HTTP error, got gzip decompression error: %v", err)
}
})
c.OnResponse(func(resp *Response) {
// A 404 should not reach OnResponse as a successful response
if resp.StatusCode == 404 {
responseStatusCode = resp.StatusCode
}
})
c.Visit(ts.URL + "/sitemap.xml.gz")
// The response should have been received (either via OnError or OnResponse)
// with status 404, not a gzip decompression error
if responseStatusCode != 404 {
t.Errorf("Expected status code 404, got %d", responseStatusCode)
}
}
================================================
FILE: context.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package colly
import (
"sync"
)
// Context provides a tiny layer for passing data between callbacks
type Context struct {
contextMap map[string]interface{}
lock *sync.RWMutex
}
// NewContext initializes a new Context instance
func NewContext() *Context {
return &Context{
contextMap: make(map[string]interface{}),
lock: &sync.RWMutex{},
}
}
// UnmarshalBinary decodes Context value to nil
// This function is used by request caching
func (c *Context) UnmarshalBinary(_ []byte) error {
return nil
}
// MarshalBinary encodes Context value
// This function is used by request caching
func (c *Context) MarshalBinary() (_ []byte, _ error) {
return nil, nil
}
// Put stores a value of any type in Context
func (c *Context) Put(key string, value interface{}) {
c.lock.Lock()
c.contextMap[key] = value
c.lock.Unlock()
}
// Get retrieves a string value from Context.
// Get returns an empty string if key not found
func (c *Context) Get(key string) string {
c.lock.RLock()
defer c.lock.RUnlock()
if v, ok := c.contextMap[key]; ok {
return v.(string)
}
return ""
}
// GetAny retrieves a value from Context.
// GetAny returns nil if key not found
func (c *Context) GetAny(key string) interface{} {
c.lock.RLock()
defer c.lock.RUnlock()
if v, ok := c.contextMap[key]; ok {
return v
}
return nil
}
// ForEach iterate context
func (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{} {
c.lock.RLock()
defer c.lock.RUnlock()
ret := make([]interface{}, 0, len(c.contextMap))
for k, v := range c.contextMap {
ret = append(ret, fn(k, v))
}
return ret
}
// Clone clones context
func (c *Context) Clone() *Context {
c.lock.RLock()
defer c.lock.RUnlock()
newCtx := NewContext()
c.ForEach(func(key string, value interface{}) interface{} {
newCtx.Put(key, value)
return nil
})
return newCtx
}
================================================
FILE: context_test.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package colly
import (
"strconv"
"testing"
)
func TestContextIteration(t *testing.T) {
ctx := NewContext()
for i := 0; i < 10; i++ {
ctx.Put(strconv.Itoa(i), i)
}
values := ctx.ForEach(func(k string, v interface{}) interface{} {
return v.(int)
})
if len(values) != 10 {
t.Fatal("fail to iterate context")
}
for _, i := range values {
v := i.(int)
if v != ctx.GetAny(strconv.Itoa(v)).(int) {
t.Fatal("value not equal")
}
}
}
func TestContextClone(t *testing.T) {
ctxOrg := NewContext()
for i := 0; i < 10; i++ {
ctxOrg.Put(strconv.Itoa(i), i)
}
ctx := ctxOrg.Clone()
values := ctx.ForEach(func(k string, v interface{}) interface{} {
return v.(int)
})
if len(values) != 10 {
t.Fatal("fail to iterate context")
}
for _, i := range values {
v := i.(int)
if v != ctx.GetAny(strconv.Itoa(v)).(int) {
t.Fatal("value not equal")
}
}
}
================================================
FILE: debug/debug.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package debug
// Event represents an action inside a collector
type Event struct {
// Type is the type of the event
Type string
// RequestID identifies the HTTP request of the Event
RequestID uint32
// CollectorID identifies the collector of the Event
CollectorID uint32
// Values contains the event's key-value pairs. Different type of events
// can return different key-value pairs
Values map[string]string
}
// Debugger is an interface for different type of debugging backends
type Debugger interface {
// Init initializes the backend
Init() error
// Event receives a new collector event.
Event(e *Event)
}
================================================
FILE: debug/logdebugger.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package debug
import (
"io"
"log"
"os"
"sync/atomic"
"time"
)
// LogDebugger is the simplest debugger which prints log messages to the STDERR
type LogDebugger struct {
// Output is the log destination, anything can be used which implements them
// io.Writer interface. Leave it blank to use STDERR
Output io.Writer
// Prefix appears at the beginning of each generated log line
Prefix string
// Flag defines the logging properties.
Flag int
logger *log.Logger
counter int32
start time.Time
}
// Init initializes the LogDebugger
func (l *LogDebugger) Init() error {
l.counter = 0
l.start = time.Now()
if l.Output == nil {
l.Output = os.Stderr
}
l.logger = log.New(l.Output, l.Prefix, l.Flag)
return nil
}
// Event receives Collector events and prints them to STDERR
func (l *LogDebugger) Event(e *Event) {
i := atomic.AddInt32(&l.counter, 1)
l.logger.Printf("[%06d] %d [%6d - %s] %q (%s)\n", i, e.CollectorID, e.RequestID, e.Type, e.Values, time.Since(l.start))
}
================================================
FILE: debug/webdebugger.go
================================================
// Copyright 2018 Adam Tauber
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package debug
import (
"encoding/json"
"log"
"net/http"
"sync"
"time"
)
// WebDebugger is a web based debugging frontend for colly
type WebDebugger struct {
// Address is the address of the web server. It is 127.0.0.1:7676 by default.
Address string
initialized bool
CurrentRequests map[uint32]requestInfo
RequestLog []requestInfo
sync.Mutex
}
type requestInfo struct {
URL string
Started time.Time
Duration time.Duration
ResponseStatus string
ID uint32
CollectorID uint32
}
// Init initializes the WebDebugger
func (w *WebDebugger) Init() error {
if w.initialized {
return nil
}
defer func() {
w.initialized = true
}()
if w.Address == "" {
w.Address = "127.0.0.1:7676"
}
w.RequestLog = make([]requestInfo, 0)
w.CurrentRequests = make(map[uint32]requestInfo)
http.HandleFunc("/", w.indexHandler)
http.HandleFunc("/status", w.statusHandler)
log.Println("Starting debug webserver on", w.Address)
go http.ListenAndServe(w.Address, nil)
return nil
}
// Event updates the debugger's status
func (w *WebDebugger) Event(e *Event) {
w.Lock()
defer w.Unlock()
switch e.Type {
case "request":
w.CurrentRequests[e.RequestID] = requestInfo{
URL: e.Values["url"],
Started: time.Now(),
ID: e.RequestID,
CollectorID: e.CollectorID,
}
case "response", "error":
r := w.CurrentRequests[e.RequestID]
r.Duration = time.Since(r.Started)
r.ResponseStatus = e.Values["status"]
w.RequestLog = append(w.RequestLog, r)
delete(w.CurrentRequests, e.RequestID)
}
}
func (w *WebDebugger) indexHandler(wr http.ResponseWriter, r *http.Request) {
wr.Write([]byte(`<!DOCTYPE html>
<html>
<head>
<title>Colly Debugger WebUI</title>
<script src="https://code.jquery.com/jquery-latest.min.js" type="text/javascript"></script>
<link rel="stylesheet" type="text/css" href="https://semantic-ui.com/dist/semantic.min.css">
</head>
<body>
<div class="ui inverted vertical masthead center aligned segment" id="menu">
<div class="ui tiny secondary inverted menu">
<a class="item" href="/"><b>Colly WebDebugger</b></a>
</div>
</div>
<div class="ui grid container">
<div class="row">
<div class="eight wide column">
<h1>Current Requests <span id="current_request_count"></span></h1>
<div id="current_requests" class="ui small feed"></div>
</div>
<div class="eight wide column">
<h1>Finished Requests <span id="request_log_count"></span></h1>
<div id="request_log" class="ui small feed"></div>
</div>
</div>
</div>
<script>
function curRequestTpl(url, started, collectorId) {
return '<div class="event"><div class="content"><div class="summary">' + url + '</div><div class="meta">Collector #' + collectorId + ' - ' + started + "</div></div></div>";
}
function requestLogTpl(url, duration, collectorId) {
return '<div class="event"><div class="content"><div class="summary">' + url + '</div><div class="meta">Collector #' + collectorId + ' - ' + (duration/1000000000) + "s</div></div></div>";
}
function fetchStatus() {
$.getJSON("/status", function(data) {
$("#current_requests").html("");
$("#request_log").html("");
$("#current_request_count").text('(' + Object.keys(data.CurrentRequests).length + ')');
$("#request_log_count").text('(' + data.RequestLog.length + ')');
for(var i in data.CurrentRequests) {
var r = data.CurrentRequests[i];
$("#current_requests").append(curRequestTpl(r.URL, r.Started, r.CollectorID));
}
for(var i in data.RequestLog.reverse()) {
var r = data.RequestLog[i];
$("#request_log").append(requestLogTpl(r.URL, r.Duration, r.CollectorID));
}
setTimeout(fetchStatus, 1000);
});
}
$(document).ready(function() {
fetchStatus();
});
</script>
</body>
</html>
`))
}
func (w *WebDebugger) statusHandler(wr http.ResponseWriter, r *http.Request) {
w.Lock()
jsonData, err := json.MarshalIndent(w, "", " ")
w.Unlock()
if err != nil {
panic(err)
}
wr.Write(jsonData)
}
================================================
FILE: extensions/extensions.go
================================================
// Package extensions implements various helper addons for Colly
package extensions
================================================
FILE: extensions/random_user_agent.go
================================================
package extensions
import (
"fmt"
"math/rand"
"strings"
"github.com/gocolly/colly/v2"
)
var uaGens = []func() string{
genFirefoxUA,
genChromeUA,
genEdgeUA,
genOperaUA,
}
var uaGensMobile = []func() string{
genMobilePixel7UA,
genMobilePixel6UA,
genMobilePixel5UA,
genMobilePixel4UA,
genMobileNexus10UA,
}
// RandomUserAgent generates a random DESKTOP browser user-agent on every requests
func RandomUserAgent(c *colly.Collector) {
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
})
}
// RandomMobileUserAgent generates a random MOBILE browser user-agent on every requests
func RandomMobileUserAgent(c *colly.Collector) {
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", uaGensMobile[rand.Intn(len(uaGensMobile))]())
})
}
var ffVersions = []float32{
// NOTE: Only version released after Jun 1, 2022 will be listed.
// Data source: https://en.wikipedia.org/wiki/Firefox_version_history
// 2022
102.0,
103.0,
104.0,
105.0,
106.0,
107.0,
108.0,
// 2023
109.0,
110.0,
111.0,
112.0,
113.0,
}
var chromeVersions = []string{
// NOTE: Only version released after Jun 1, 2022 will be listed.
// Data source: https://chromereleases.googleblog.com/search/label/Stable%20updates
// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop.html
"102.0.5005.115",
// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop_21.html
"103.0.5060.53",
// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop_27.html
"103.0.5060.66",
// https://chromereleases.googleblog.com/2022/07/stable-channel-update-for-desktop.html
"103.0.5060.114",
// https://chromereleases.googleblog.com/2022/07/stable-channel-update-for-desktop_19.html
"103.0.5060.134",
// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop.html
"104.0.5112.79",
"104.0.5112.80",
"104.0.5112.81",
// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop_16.html
"104.0.5112.101",
"104.0.5112.102",
// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop_30.html
"105.0.5195.52",
"105.0.5195.53",
"105.0.5195.54",
// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop.html
"105.0.5195.102",
// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_14.html
"105.0.5195.125",
"105.0.5195.126",
"105.0.5195.127",
// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_27.html
"106.0.5249.61",
"106.0.5249.62",
// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_30.html
"106.0.5249.91",
// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop.html
"106.0.5249.103",
// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_11.html
"106.0.5249.119",
// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_25.html
"107.0.5304.62",
"107.0.5304.63",
"107.0.5304.68",
// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_27.html
"107.0.5304.87",
"107.0.5304.88",
// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop.html
"107.0.5304.106",
"107.0.5304.107",
"107.0.5304.110",
// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop_24.html
"107.0.5304.121",
"107.0.5304.122",
// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop_29.html
"108.0.5359.71",
"108.0.5359.72",
// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop.html
"108.0.5359.94",
"108.0.5359.95",
// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop_7.html
"108.0.5359.98",
"108.0.5359.99",
// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop_13.html
"108.0.5359.124",
"108.0.5359.125",
// https://chromereleases.googleblog.com/2023/01/stable-channel-update-for-desktop.html
"109.0.5414.74",
"109.0.5414.75",
"109.0.5414.87",
// https://chromereleases.googleblog.com/2023/01/stable-channel-update-for-desktop_24.html
"109.0.5414.119",
"109.0.5414.120",
// https://chromereleases.googleblog.com/2023/02/stable-channel-update-for-desktop.html
"110.0.5481.77",
"110.0.5481.78",
// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update.html
"110.0.5481.96",
"110.0.5481.97",
// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_14.html
"110.0.5481.100",
// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_16.html
"110.0.5481.104",
// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_22.html
"110.0.5481.177",
"110.0.5481.178",
// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_97.html
"109.0.5414.129",
// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop.html
"111.0.5563.64",
"111.0.5563.65",
// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_21.html
"111.0.5563.110",
"111.0.5563.111",
// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_27.html
"111.0.5563.146",
"111.0.5563.147",
// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop.html
"112.0.5615.49",
"112.0.5615.50",
// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_12.html
"112.0.5615.86",
"112.0.5615.87",
// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_14.html
"112.0.5615.121",
// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_18.html
"112.0.5615.137",
"112.0.5615.138",
"112.0.5615.165",
// https://chromereleases.googleblog.com/2023/05/stable-channel-update-for-desktop.html
"113.0.5672.63",
"113.0.5672.64",
// https://chromereleases.googleblog.com/2023/05/stable-channel-update-for-desktop_8.html
"113.0.5672.92",
"113.0.5672.93",
}
var edgeVersions = []string{
// NOTE: Only version released after Jun 1, 2022 will be listed.
// Data source: https://learn.microsoft.com/en-us/deployedge/microsoft-edge-release-schedule
// 2022
"103.0.0.0,103.0.1264.37",
"104.0.0.0,104.0.1293.47",
"105.0.0.0,105.0.1343.25",
"106.0.0.0,106.0.1370.34",
"107.0.0.0,107.0.1418.24",
"108.0.0.0,108.0.1462.42",
// 2023
"109.0.0.0,109.0.1518.49",
"110.0.0.0,110.0.1587.41",
"111.0.0.0,111.0.1661.41",
"112.0.0.0,112.0.1722.34",
"113.0.0.0,113.0.1774.3",
}
var operaVersions = []string{
// NOTE: Only version released after Jan 1, 2023 will be listed.
// Data source: https://blogs.opera.com/desktop/
// https://blogs.opera.com/desktop/changelog-for-96/
"110.0.5449.0,96.0.4640.0",
"110.0.5464.2,96.0.4653.0",
"110.0.5464.2,96.0.4660.0",
"110.0.5481.30,96.0.4674.0",
"110.0.5481.30,96.0.4691.0",
"110.0.5481.30,96.0.4693.12",
"110.0.5481.77,96.0.4693.16",
"110.0.5481.100,96.0.4693.20",
"110.0.5481.178,96.0.4693.31",
"110.0.5481.178,96.0.4693.50",
"110.0.5481.192,96.0.4693.80",
// https://blogs.opera.com/desktop/changelog-for-97/
"111.0.5532.2,97.0.4711.0",
"111.0.5532.2,97.0.4704.0",
"111.0.5532.2,97.0.4697.0",
"111.0.5562.0,97.0.4718.0",
"111.0.5563.19,97.0.4719.4",
"111.0.5563.19,97.0.4719.11",
"111.0.5563.41,97.0.4719.17",
"111.0.5563.65,97.0.4719.26",
"111.0.5563.65,97.0.4719.28",
"111.0.5563.111,97.0.4719.43",
"111.0.5563.147,97.0.4719.63",
"111.0.5563.147,97.0.4719.83",
// https://blogs.opera.com/desktop/changelog-for-98/
"112.0.5596.2,98.0.4756.0",
"112.0.5596.2,98.0.4746.0",
"112.0.5615.20,98.0.4759.1",
"112.0.5615.50,98.0.4759.3",
"112.0.5615.87,98.0.4759.6",
"112.0.5615.165,98.0.4759.15",
"112.0.5615.165,98.0.4759.21",
"112.0.5615.165,98.0.4759.39",
}
var pixel7AndroidVersions = []string{
// Data source:
// - https://developer.android.com/about/versions
// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"13",
}
var pixel6AndroidVersions = []string{
// Data source:
// - https://developer.android.com/about/versions
// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"12",
"13",
}
var pixel5AndroidVersions = []string{
// Data source:
// - https://developer.android.com/about/versions
// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"11",
"12",
"13",
}
var pixel4AndroidVersions = []string{
// Data source:
// - https://developer.android.com/about/versions
// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"10",
"11",
"12",
"13",
}
var nexus10AndroidVersions = []string{
// Data source:
// - https://developer.android.com/about/versions
// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"4.4.2",
"4.4.4",
"5.0",
"5.0.1",
"5.0.2",
"5.1",
"5.1.1",
}
var nexus10Builds = []string{
// Data source: https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds
"LMY49M", // android-5.1.1_r38 (Lollipop)
"LMY49J", // android-5.1.1_r37 (Lollipop)
"LMY49I", // android-5.1.1_r36 (Lollipop)
"LMY49H", // android-5.1.1_r35 (Lollipop)
"LMY49G", // android-5.1.1_r34 (Lollipop)
"LMY49F", // android-5.1.1_r33 (Lollipop)
"LMY48Z", // android-5.1.1_r30 (Lollipop)
"LMY48X", // android-5.1.1_r25 (Lollipop)
"LMY48T", // android-5.1.1_r19 (Lollipop)
"LMY48M", // android-5.1.1_r14 (Lollipop)
"LMY48I", // android-5.1.1_r9 (Lollipop)
"LMY47V", // android-5.1.1_r1 (Lollipop)
"LMY47D", // android-5.1.0_r1 (Lollipop)
"LRX22G", // android-5.0.2_r1 (Lollipop)
"LRX22C", // android-5.0.1_r1 (Lollipop)
"LRX21P", // android-5.0.0_r4.0.1 (Lollipop)
"KTU84P", // android-4.4.4_r1 (KitKat)
"KTU84L", // android-4.4.3_r1 (KitKat)
"KOT49H", // android-4.4.2_r1 (KitKat)
"KOT49E", // android-4.4.1_r1 (KitKat)
"KRT16S", // android-4.4_r1.2 (KitKat)
"JWR66Y", // android-4.3_r1.1 (Jelly Bean)
"JWR66V", // android-4.3_r1 (Jelly Bean)
"JWR66N", // android-4.3_r0.9.1 (Jelly Bean)
"JDQ39 ", // android-4.2.2_r1 (Jelly Bean)
"JOP40F", // android-4.2.1_r1.1 (Jelly Bean)
"JOP40D", // android-4.2.1_r1 (Jelly Bean)
"JOP40C", // android-4.2_r1 (Jelly Bean)
}
var osStrings = []string{
// MacOS - High Sierra
"Macintosh; Intel Mac OS X 10_13",
"Macintosh; Intel Mac OS X 10_13_1",
"Macintosh; Intel Mac OS X 10_13_2",
"Macintosh; Intel Mac OS X 10_13_3",
"Macintosh; Intel Mac OS X 10_13_4",
"Macintosh; Intel Mac OS X 10_13_5",
"Macintosh; Intel Mac OS X 10_13_6",
// MacOS - Mojave
"Macintosh; Intel Mac OS X 10_14",
"Macintosh; Intel Mac OS X 10_14_1",
"Macintosh; Intel Mac OS X 10_14_2",
"Macintosh; Intel Mac OS X 10_14_3",
"Macintosh; Intel Mac OS X 10_14_4",
"Macintosh; Intel Mac OS X 10_14_5",
"Macintosh; Intel Mac OS X 10_14_6",
// MacOS - Catalina
"Macintosh; Intel Mac OS X 10_15",
"Macintosh; Intel Mac OS X 10_15_1",
"Macintosh; Intel Mac OS X 10_15_2",
"Macintosh; Intel Mac OS X 10_15_3",
"Macintosh; Intel Mac OS X 10_15_4",
"Macintosh; Intel Mac OS X 10_15_5",
"Macintosh; Intel Mac OS X 10_15_6",
"Macintosh; Intel Mac OS X 10_15_7",
// MacOS - Big Sur
"Macintosh; Intel Mac OS X 11_0",
"Macintosh; Intel Mac OS X 11_0_1",
"Macintosh; Intel Mac OS X 11_1",
"Macintosh; Intel Mac OS X 11_2",
"Macintosh; Intel Mac OS X 11_2_1",
"Macintosh; Intel Mac OS X 11_2_2",
"Macintosh; Intel Mac OS X 11_2_3",
"Macintosh; Intel Mac OS X 11_3",
"Macintosh; Intel Mac OS X 11_3_1",
"Macintosh; Intel Mac OS X 11_4",
"Macintosh; Intel Mac OS X 11_5",
"Macintosh; Intel Mac OS X 11_5_1",
"Macintosh; Intel Mac OS X 11_5_2",
"Macintosh; Intel Mac OS X 11_6",
"Macintosh; Intel Mac OS X 11_6_1",
"Macintosh; Intel Mac OS X 11_6_2",
"Macintosh; Intel Mac OS X 11_6_3",
"Macintosh; Intel Mac OS X 11_6_4",
"Macintosh; Intel Mac OS X 11_6_5",
"Macintosh; Intel Mac OS X 11_6_6",
"Macintosh; Intel Mac OS X 11_6_7",
"Macintosh; Intel Mac OS X 11_6_8",
"Macintosh; Intel Mac OS X 11_7",
"Macintosh; Intel Mac OS X 11_7_1",
"Macintosh; Intel Mac OS X 11_7_2",
"Macintosh; Intel Mac OS X 11_7_3",
"Macintosh; Intel Mac OS X 11_7_4",
"Macintosh; Intel Mac OS X 11_7_5",
"Macintosh; Intel Mac OS X 11_7_6",
// MacOS - Monterey
"Macintosh; Intel Mac OS X 12_0",
"Macintosh; Intel Mac OS X 12_0_1",
"Macintosh; Intel Mac OS X 12_1",
"Macintosh; Intel Mac OS X 12_2",
"Macintosh; Intel Mac OS X 12_2_1",
"Macintosh; Intel Mac OS X 12_3",
"Macintosh; Intel Mac OS X 12_3_1",
"Macintosh; Intel Mac OS X 12_4",
"Macintosh; Intel Mac OS X 12_5",
"Macintosh; Intel Mac OS X 12_5_1",
"Macintosh; Intel Mac OS X 12_6",
"Macintosh; Intel Mac OS X 12_6_1",
"Macintosh; Intel Mac OS X 12_6_2",
"Macintosh; Intel Mac OS X 12_6_3",
"Macintosh; Intel Mac OS X 12_6_4",
"Macintosh; Intel Mac OS X 12_6_5",
// MacOS - Ventura
"Macintosh; Intel Mac OS X 13_0",
"Macintosh; Intel Mac OS X 13_0_1",
"Macintosh; Intel Mac OS X 13_1",
"Macintosh; Intel Mac OS X 13_2",
"Macintosh; Intel Mac OS X 13_2_1",
"Macintosh; Intel Mac OS X 13_3",
"Macintosh; Intel Mac OS X 13_3_1",
// Windows
"Windows NT 10.0; Win64; x64",
"Windows NT 5.1",
"Windows NT 6.1; WOW64",
"Windows NT 6.1; Win64; x64",
// Linux
"X11; Linux x86_64",
}
// Generates Firefox Browser User-Agent (Desktop)
//
// -> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:87.0) Gecko/20100101 Firefox/87.0"
func genFirefoxUA() string {
version := ffVersions[rand.Intn(len(ffVersions))]
os := osStrings[rand.Intn(len(osStrings))]
return fmt.Sprintf("Mozilla/5.0 (%s; rv:%.1f) Gecko/20100101 Firefox/%.1f", os, version, version)
}
// Generates Chrome Browser User-Agent (Desktop)
//
// -> "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36"
func genChromeUA() string {
version := chromeVersions[rand.Intn(len(chromeVersions))]
os := osStrings[rand.Intn(len(osStrings))]
return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", os, version)
}
// Generates Microsoft Edge User-Agent (Desktop)
//
// -> "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36 Edg/90.0.818.39"
func genEdgeUA() string {
version := edgeVersions[rand.Intn(len(edgeVersions))]
chromeVersion := strings.Split(version, ",")[0]
edgeVersion := strings.Split(version, ",")[1]
os := osStrings[rand.Intn(len(osStrings))]
return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 Edg/%s", os, chromeVersion, edgeVersion)
}
// Generates Opera Browser User-Agent (Desktop)
//
// -> "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 OPR/98.0.4759.3"
func genOperaUA() string {
version := operaVersions[rand.Intn(len(operaVersions))]
chromeVersion := strings.Split(version, ",")[0]
operaVersion := strings.Split(version, ",")[1]
os := osStrings[rand.Intn(len(osStrings))]
return fmt.Sprintf("Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 OPR/%s", os, chromeVersion, operaVersion)
}
// Generates Pixel 7 Browser User-Agent (Mobile)
//
// -> Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36
func genMobilePixel7UA() string {
android := pixel7AndroidVersions[rand.Intn(len(pixel7AndroidVersions))]
chrome := chromeVersions[rand.Intn(len(chromeVersions))]
return fmt.Sprintf("Mozilla/5.0 (Linux; Android %s; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", android, chrome)
}
// Generates Pixel 6 Browser User-Agent (Mobile)
//
// -> "Mozilla/5.0 (Linux; Android 13; Pixel 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36"
func genMobilePixel6UA() string {
android := pixel6AndroidVersions[rand.Intn(len(pixel6AndroidVersions))]
chrome := chromeVersions[rand.Intn(len(chromeVersions))]
return fmt.Sprintf("Mozilla/5.0 (Linux; Android %s; Pixel 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", android, chrome)
}
// Generates Pixel 5 Browser User-Agent (Mobile)
//
// -> "Mozilla/5.0 (Linux; Android 13; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36"
func genMobilePixel5UA() string {
android := pixel5AndroidVersions[rand.Intn(len(pixel5AndroidVersions))]
chrome := chromeVersions[rand.Intn(len(chromeVersions))]
return fmt.Sprintf("Mozilla/5.0 (Linux; Android %s; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", android, chrome)
}
// Generates Pixel 4 Browser User-Agent (Mobile)
//
// -> "Mozilla/5.0 (Linux; Android 13; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36"
func genMobilePixel4UA() string {
android := pixel4AndroidVersions[rand.Intn(len(pixel4AndroidVersions))]
chrome := chromeVersions[rand.Intn(len(chromeVersions))]
return fmt.Sprintf("Mozilla/5.0 (Linux; Android %s; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", android, chrome)
}
// Generates Nexus 10 Browser User-Agent (Mobile)
//
// -> "Mozilla/5.0 (Linux; Android 5.1.1; Nexus 10 Build/LMY48T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.91 Safari/537.36"
func genMobileNexus10UA() string {
build := nexus10Builds[rand.Intn(len(nexus10Builds))]
android := nexus10AndroidVersions[rand.Intn(len(nexus10AndroidVersions))]
chrome := chromeVersions[rand.Intn(len(chromeVersions))]
return fmt.Sprintf("Mozilla/5.0 (Linux; Android %s; Nexus 10 Build/%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36", android, build, chrome)
}
================================================
FILE: extensions/referer.go
================================================
package extensions
import (
"github.com/gocolly/colly/v2"
)
// Referer sets valid Referer HTTP header to requests.
// Warning: this extension works only if you use Request.Visit
// from callbacks instead of Collector.Visit.
func Referer(c *colly.Collector) {
c.OnResponse(func(r *colly.Response) {
r.Ctx.Put("_referer", r.Request.URL.String())
})
c.OnRequest(func(r *colly.Request) {
if ref := r.Ctx.Get("_referer"); ref != "" {
r.Headers.Set("Referer", ref)
}
})
}
================================================
FILE: extensions/url_length_filter.go
================================================
package extensions
import (
"github.com/gocolly/colly/v2"
)
// URLLengthFilter filters out requests with URLs longer than URLLengthLimit
func URLLengthFilter(c *colly.Collector, URLLengthLimit int) {
c.OnRequest(func(r *colly.Request) {
if len(r.URL.String()) > URLLengthLimit {
r.Abort()
}
})
}
================================================
FILE: go.mod
================================================
module github.com/gocolly/colly/v2
go 1.24.0
toolchain go1.24.9
require (
github.com/PuerkitoBio/goquery v1.11.0
github.com/antchfx/htmlquery v1.3.5
github.com/antchfx/xmlquery v1.5.0
github.com/gobwas/glob v0.2.3
github.com/gocolly/colly v1.2.0
github.com/jawher/mow.cli v1.1.0
github.com/kennygrant/sanitize v1.2.4
github.com/nlnwa/whatwg-url v0.6.2
github.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d
github.com/temoto/robotstxt v1.1.2
golang.org/x/net v0.47.0
google.golang.org/appengine v1.6.8
)
require (
github.com/andybalholm/cascadia v1.3.3 // indirect
github.com/antchfx/xpath v1.3.5 // indirect
github.com/bits-and-blooms/bitset v1.24.4 // indirect
github.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 // indirect
github.com/golang/protobuf v1.5.4 // indirect
golang.org/x/text v0.31.0 // indirect
google.golang.org/protobuf v1.36.10 // indirect
)
================================================
FILE: go.sum
================================================
github.com/PuerkitoBio/goquery v1.10.2 h1:7fh2BdHcG6VFZsK7toXBT/Bh1z5Wmy8Q9MV9HqT2AM8=
github.com/PuerkitoBio/goquery v1.10.2/go.mod h1:0guWGjcLu9AYC7C1GHnpysHy056u9aEkUHwhdnePMCU=
github.com/PuerkitoBio/goquery v1.11.0 h1:jZ7pwMQXIITcUXNH83LLk+txlaEy6NVOfTuP43xxfqw=
github.com/PuerkitoBio/goquery v1.11.0/go.mod h1:wQHgxUOU3JGuj3oD/QFfxUdlzW6xPHfqyHre6VMY4DQ=
github.com/andybalholm/cascadia v1.3.3 h1:AG2YHrzJIm4BZ19iwJ/DAua6Btl3IwJX+VI4kktS1LM=
github.com/andybalholm/cascadia v1.3.3/go.mod h1:xNd9bqTn98Ln4DwST8/nG+H0yuB8Hmgu1YHNnWw0GeA=
github.com/antchfx/htmlquery v1.3.4 h1:Isd0srPkni2iNTWCwVj/72t7uCphFeor5Q8nCzj1jdQ=
github.com/antchfx/htmlquery v1.3.4/go.mod h1:K9os0BwIEmLAvTqaNSua8tXLWRWZpocZIH73OzWQbwM=
github.com/antchfx/htmlquery v1.3.5 h1:aYthDDClnG2a2xePf6tys/UyyM/kRcsFRm+ifhFKoU0=
github.com/antchfx/htmlquery v1.3.5/go.mod h1:5oyIPIa3ovYGtLqMPNjBF2Uf25NPCKsMjCnQ8lvjaoA=
github.com/antchfx/xmlquery v1.4.4 h1:mxMEkdYP3pjKSftxss4nUHfjBhnMk4imGoR96FRY2dg=
github.com/antchfx/xmlquery v1.4.4/go.mod h1:AEPEEPYE9GnA2mj5Ur2L5Q5/2PycJ0N9Fusrx9b12fc=
github.com/antchfx/xmlquery v1.5.0 h1:uAi+mO40ZWfyU6mlUBxRVvL6uBNZ6LMU4M3+mQIBV4c=
github.com/antchfx/xmlquery v1.5.0/go.mod h1:lJfWRXzYMK1ss32zm1GQV3gMIW/HFey3xDZmkP1SuNc=
github.com/antchfx/xpath v1.3.3 h1:tmuPQa1Uye0Ym1Zn65vxPgfltWb/Lxu2jeqIGteJSRs=
github.com/antchfx/xpath v1.3.3/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
github.com/antchfx/xpath v1.3.5 h1:PqbXLC3TkfeZyakF5eeh3NTWEbYl4VHNVeufANzDbKQ=
github.com/antchfx/xpath v1.3.5/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=
github.com/bits-and-blooms/bitset v1.20.0/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=
github.com/bits-and-blooms/bitset v1.22.0 h1:Tquv9S8+SGaS3EhyA+up3FXzmkhxPGjQQCkcs2uw7w4=
github.com/bits-and-blooms/bitset v1.22.0/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=
github.com/bits-and-blooms/bitset v1.24.4 h1:95H15Og1clikBrKr/DuzMXkQzECs1M6hhoGXLwLQOZE=
github.com/bits-and-blooms/bitset v1.24.4/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/gobwas/glob v0.2.3 h1:A4xDbljILXROh+kObIiy5kIaPYD8e96x1tgBhUI5J+Y=
github.com/gobwas/glob v0.2.3/go.mod h1:d3Ez4x06l9bZtSvzIay5+Yzi0fmZzPgnTbPcKjJAkT8=
github.com/gocolly/colly v1.2.0 h1:qRz9YAn8FIH0qzgNUw+HT9UN7wm1oF9OBAilwEWpyrI=
github.com/gocolly/colly v1.2.0/go.mod h1:Hof5T3ZswNVsOHYmba1u03W65HDWgpV5HifSuueE0EA=
github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=
github.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 h1:f+oWsMOmNPc8JmEHVZIycC7hBoQxHH9pNKQORJNozsQ=
github.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8/go.mod h1:wcDNUvekVysuuOpQKo3191zZyTpiI6se1N1ULghS0sw=
github.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LI
gitextract_san38b80/ ├── .codecov.yml ├── .github/ │ ├── ISSUE_TEMPLATE/ │ │ ├── bug_report.md │ │ ├── config.yml │ │ └── feature_request.md │ └── workflows/ │ └── ci.yml ├── CHANGELOG.md ├── CONTRIBUTING.md ├── LICENSE.txt ├── README.md ├── VERSION ├── _examples/ │ ├── README.md │ ├── basic/ │ │ └── basic.go │ ├── coursera_courses/ │ │ └── coursera_courses.go │ ├── cryptocoinmarketcap/ │ │ └── cryptocoinmarketcap.go │ ├── error_handling/ │ │ └── error_handling.go │ ├── factba.se/ │ │ └── factbase.go │ ├── google_groups/ │ │ └── google_groups.go │ ├── hackernews_comments/ │ │ └── hackernews_comments.go │ ├── instagram/ │ │ └── instagram.go │ ├── local_files/ │ │ ├── html/ │ │ │ ├── child_page/ │ │ │ │ ├── one.html │ │ │ │ ├── three.html │ │ │ │ └── two.html │ │ │ └── index.html │ │ └── local_files.go │ ├── login/ │ │ └── login.go │ ├── max_depth/ │ │ └── max_depth.go │ ├── multipart/ │ │ └── multipart.go │ ├── openedx_courses/ │ │ └── openedx_courses.go │ ├── parallel/ │ │ └── parallel.go │ ├── proxy_switcher/ │ │ └── proxy_switcher.go │ ├── queue/ │ │ └── queue.go │ ├── random_delay/ │ │ └── random_delay.go │ ├── rate_limit/ │ │ └── rate_limit.go │ ├── reddit/ │ │ └── reddit.go │ ├── request_context/ │ │ └── request_context.go │ ├── scraper_server/ │ │ └── scraper_server.go │ ├── shopify_sitemap/ │ │ └── shopify_sitemap.go │ ├── url_filter/ │ │ └── url_filter.go │ └── xkcd_store/ │ └── xkcd_store.go ├── cmd/ │ └── colly/ │ └── colly.go ├── colly.go ├── colly_test.go ├── context.go ├── context_test.go ├── debug/ │ ├── debug.go │ ├── logdebugger.go │ └── webdebugger.go ├── extensions/ │ ├── extensions.go │ ├── random_user_agent.go │ ├── referer.go │ └── url_length_filter.go ├── go.mod ├── go.sum ├── htmlelement.go ├── http_backend.go ├── http_trace.go ├── http_trace_test.go ├── proxy/ │ └── proxy.go ├── queue/ │ ├── queue.go │ └── queue_test.go ├── request.go ├── response.go ├── storage/ │ └── storage.go ├── unmarshal.go ├── unmarshal_test.go ├── xmlelement.go └── xmlelement_test.go
SYMBOL INDEX (357 symbols across 49 files)
FILE: _examples/basic/basic.go
function main (line 9) | func main() {
FILE: _examples/coursera_courses/coursera_courses.go
type Course (line 14) | type Course struct
function main (line 25) | func main() {
FILE: _examples/cryptocoinmarketcap/cryptocoinmarketcap.go
function main (line 11) | func main() {
FILE: _examples/error_handling/error_handling.go
function main (line 9) | func main() {
FILE: _examples/factba.se/factbase.go
type result (line 15) | type result struct
type results (line 20) | type results struct
type transcript (line 24) | type transcript struct
function main (line 29) | func main() {
FILE: _examples/google_groups/google_groups.go
type Mail (line 14) | type Mail struct
function main (line 22) | func main() {
FILE: _examples/hackernews_comments/hackernews_comments.go
type comment (line 14) | type comment struct
function main (line 22) | func main() {
FILE: _examples/instagram/instagram.go
constant nextPageURL (line 17) | nextPageURL string = `https://www.instagram.com/graphql/query/?query_has...
constant nextPagePayload (line 18) | nextPagePayload string = `{"id":"%s","first":50,"after":"%s"}`
type pageInfo (line 24) | type pageInfo struct
type mainPageData (line 29) | type mainPageData struct
type nextPageData (line 57) | type nextPageData struct
function main (line 79) | func main() {
FILE: _examples/local_files/local_files.go
function main (line 12) | func main() {
FILE: _examples/login/login.go
function main (line 9) | func main() {
FILE: _examples/max_depth/max_depth.go
function main (line 9) | func main() {
FILE: _examples/multipart/multipart.go
function generateFormData (line 13) | func generateFormData() map[string][]byte {
function setupServer (line 27) | func setupServer() {
function main (line 45) | func main() {
FILE: _examples/openedx_courses/openedx_courses.go
constant DATE_FORMAT (line 13) | DATE_FORMAT = "02 Jan, 2006"
type Course (line 16) | type Course struct
function main (line 26) | func main() {
FILE: _examples/parallel/parallel.go
function main (line 9) | func main() {
FILE: _examples/proxy_switcher/proxy_switcher.go
function main (line 11) | func main() {
FILE: _examples/queue/queue.go
function main (line 10) | func main() {
FILE: _examples/random_delay/random_delay.go
function main (line 11) | func main() {
FILE: _examples/rate_limit/rate_limit.go
function main (line 10) | func main() {
FILE: _examples/reddit/reddit.go
type item (line 11) | type item struct
function main (line 20) | func main() {
FILE: _examples/request_context/request_context.go
function main (line 9) | func main() {
FILE: _examples/scraper_server/scraper_server.go
type pageInfo (line 11) | type pageInfo struct
function handler (line 16) | func handler(w http.ResponseWriter, r *http.Request) {
function main (line 58) | func main() {
FILE: _examples/shopify_sitemap/shopify_sitemap.go
function main (line 9) | func main() {
FILE: _examples/url_filter/url_filter.go
function main (line 10) | func main() {
FILE: _examples/xkcd_store/xkcd_store.go
function main (line 11) | func main() {
FILE: cmd/colly/colly.go
function main (line 68) | func main() {
FILE: colly.go
type CollectorOption (line 53) | type CollectorOption
type Collector (line 56) | type Collector struct
method Init (line 492) | func (c *Collector) Init() {
method Appengine (line 525) | func (c *Collector) Appengine(ctx context.Context) {
method Visit (line 537) | func (c *Collector) Visit(URL string) error {
method HasVisited (line 547) | func (c *Collector) HasVisited(URL string) (bool, error) {
method HasPosted (line 553) | func (c *Collector) HasPosted(URL string, requestData map[string]strin...
method Head (line 558) | func (c *Collector) Head(URL string) error {
method Post (line 564) | func (c *Collector) Post(URL string, requestData map[string]string) er...
method PostRaw (line 570) | func (c *Collector) PostRaw(URL string, requestData []byte) error {
method PostMultipart (line 576) | func (c *Collector) PostMultipart(URL string, requestData map[string][...
method Request (line 595) | func (c *Collector) Request(method, URL string, requestData io.Reader,...
method SetDebugger (line 600) | func (c *Collector) SetDebugger(d debug.Debugger) {
method UnmarshalRequest (line 606) | func (c *Collector) UnmarshalRequest(r []byte) (*Request, error) {
method scrape (line 635) | func (c *Collector) scrape(u, method string, depth int, requestData io...
method fetch (line 690) | func (c *Collector) fetch(u, method string, depth int, requestData io....
method requestCheck (line 773) | func (c *Collector) requestCheck(parsedURL *url.URL, method string, ge...
method checkFilters (line 819) | func (c *Collector) checkFilters(URL, domain string) error {
method isDomainAllowed (line 836) | func (c *Collector) isDomainAllowed(domain string) bool {
method checkRobots (line 846) | func (c *Collector) checkRobots(u *url.URL) error {
method String (line 909) | func (c *Collector) String() string {
method Wait (line 922) | func (c *Collector) Wait() {
method OnRequest (line 928) | func (c *Collector) OnRequest(f RequestCallback) {
method OnResponseHeaders (line 948) | func (c *Collector) OnResponseHeaders(f ResponseHeadersCallback) {
method OnRequestHeaders (line 956) | func (c *Collector) OnRequestHeaders(f RequestCallback) {
method OnResponse (line 963) | func (c *Collector) OnResponse(f ResponseCallback) {
method OnHTML (line 975) | func (c *Collector) OnHTML(goquerySelector string, f HTMLCallback) {
method OnXML (line 992) | func (c *Collector) OnXML(xpathQuery string, f XMLCallback) {
method OnHTMLDetach (line 1007) | func (c *Collector) OnHTMLDetach(goquerySelector string) {
method OnXMLDetach (line 1019) | func (c *Collector) OnXMLDetach(xpathQuery string) {
method OnError (line 1032) | func (c *Collector) OnError(f ErrorCallback) {
method OnScraped (line 1043) | func (c *Collector) OnScraped(f ScrapedCallback) {
method SetClient (line 1053) | func (c *Collector) SetClient(client *http.Client) {
method WithTransport (line 1058) | func (c *Collector) WithTransport(transport http.RoundTripper) {
method DisableCookies (line 1063) | func (c *Collector) DisableCookies() {
method SetCookieJar (line 1068) | func (c *Collector) SetCookieJar(j http.CookieJar) {
method SetRequestTimeout (line 1073) | func (c *Collector) SetRequestTimeout(timeout time.Duration) {
method SetStorage (line 1079) | func (c *Collector) SetStorage(s storage.Storage) error {
method SetProxy (line 1093) | func (c *Collector) SetProxy(proxyURL string) error {
method SetProxyFunc (line 1111) | func (c *Collector) SetProxyFunc(p ProxyFunc) {
method handleOnRequest (line 1133) | func (c *Collector) handleOnRequest(r *Request) {
method handleOnResponse (line 1144) | func (c *Collector) handleOnResponse(r *Response) {
method handleOnResponseHeaders (line 1156) | func (c *Collector) handleOnResponseHeaders(r *Response) {
method handleOnRequestHeaders (line 1167) | func (c *Collector) handleOnRequestHeaders(r *Request) {
method handleOnHTML (line 1178) | func (c *Collector) handleOnHTML(resp *Response) error {
method handleOnXML (line 1240) | func (c *Collector) handleOnXML(resp *Response) error {
method handleOnError (line 1312) | func (c *Collector) handleOnError(response *Response, err error, reque...
method cleanupCallbacks (line 1343) | func (c *Collector) cleanupCallbacks() {
method handleOnScraped (line 1358) | func (c *Collector) handleOnScraped(r *Response) {
method Limit (line 1373) | func (c *Collector) Limit(rule *LimitRule) error {
method Limits (line 1378) | func (c *Collector) Limits(rules []*LimitRule) error {
method SetRedirectHandler (line 1383) | func (c *Collector) SetRedirectHandler(f func(req *http.Request, via [...
method SetCookies (line 1389) | func (c *Collector) SetCookies(URL string, cookies []*http.Cookie) err...
method Cookies (line 1402) | func (c *Collector) Cookies(URL string) []*http.Cookie {
method Clone (line 1416) | func (c *Collector) Clone() *Collector {
method checkRedirectFunc (line 1454) | func (c *Collector) checkRedirectFunc() func(req *http.Request, via []...
method parseSettingsFromEnv (line 1511) | func (c *Collector) parseSettingsFromEnv() {
method checkHasVisited (line 1525) | func (c *Collector) checkHasVisited(URL string, requestData map[string...
type RequestCallback (line 145) | type RequestCallback
type ResponseHeadersCallback (line 148) | type ResponseHeadersCallback
type ResponseCallback (line 151) | type ResponseCallback
type HTMLCallback (line 154) | type HTMLCallback
type XMLCallback (line 157) | type XMLCallback
type ErrorCallback (line 160) | type ErrorCallback
type ScrapedCallback (line 163) | type ScrapedCallback
type ProxyFunc (line 166) | type ProxyFunc
type AlreadyVisitedError (line 176) | type AlreadyVisitedError struct
method Error (line 184) | func (e *AlreadyVisitedError) Error() string {
type htmlCallbackContainer (line 188) | type htmlCallbackContainer struct
type xmlCallbackContainer (line 194) | type xmlCallbackContainer struct
type cookieJarSerializer (line 200) | type cookieJarSerializer struct
method SetCookies (line 1595) | func (j *cookieJarSerializer) SetCookies(u *url.URL, cookies []*http.C...
method Cookies (line 1612) | func (j *cookieJarSerializer) Cookies(u *url.URL) []*http.Cookie {
type key (line 209) | type key
constant ProxyURLKey (line 213) | ProxyURLKey key = iota
constant CheckRevisitKey (line 214) | CheckRevisitKey
constant envVariablePrefix (line 218) | envVariablePrefix = "COLLY_"
function NewCollector (line 313) | func NewCollector(options ...CollectorOption) *Collector {
function UserAgent (line 327) | func UserAgent(ua string) CollectorOption {
function Headers (line 334) | func Headers(headers map[string]string) CollectorOption {
function MaxDepth (line 345) | func MaxDepth(depth int) CollectorOption {
function MaxRequests (line 353) | func MaxRequests(max uint32) CollectorOption {
function AllowedDomains (line 360) | func AllowedDomains(domains ...string) CollectorOption {
function ParseHTTPErrorResponse (line 367) | func ParseHTTPErrorResponse() CollectorOption {
function DisallowedDomains (line 374) | func DisallowedDomains(domains ...string) CollectorOption {
function DisallowedURLFilters (line 382) | func DisallowedURLFilters(filters ...*regexp.Regexp) CollectorOption {
function URLFilters (line 390) | func URLFilters(filters ...*regexp.Regexp) CollectorOption {
function AllowURLRevisit (line 397) | func AllowURLRevisit() CollectorOption {
function MaxBodySize (line 404) | func MaxBodySize(sizeInBytes int) CollectorOption {
function CacheDir (line 411) | func CacheDir(path string) CollectorOption {
function IgnoreRobotsTxt (line 419) | func IgnoreRobotsTxt() CollectorOption {
function TraceHTTP (line 427) | func TraceHTTP() CollectorOption {
function StdlibContext (line 435) | func StdlibContext(ctx context.Context) CollectorOption {
function ID (line 442) | func ID(id uint32) CollectorOption {
function Async (line 449) | func Async(a ...bool) CollectorOption {
function DetectCharset (line 461) | func DetectCharset() CollectorOption {
function Debugger (line 468) | func Debugger(d debug.Debugger) CollectorOption {
function CheckHead (line 476) | func CheckHead() CollectorOption {
function CacheExpiration (line 484) | func CacheExpiration(d time.Duration) CollectorOption {
function createEvent (line 1124) | func createEvent(eventType string, requestID, collectorID uint32, kvargs...
function SanitizeFileName (line 1532) | func SanitizeFileName(fileName string) string {
function createFormReader (line 1545) | func createFormReader(data map[string]string) io.Reader {
function createMultipartReader (line 1553) | func createMultipartReader(boundary string, data map[string][]byte) io.R...
function randomBoundary (line 1574) | func randomBoundary() string {
function isYesString (line 1583) | func isYesString(s string) bool {
function createJar (line 1591) | func createJar(s storage.Storage) http.CookieJar {
function isMatchingFilter (line 1631) | func isMatchingFilter(fs []*regexp.Regexp, d []byte) bool {
function normalizeURL (line 1640) | func normalizeURL(u string) string {
function requestHash (line 1648) | func requestHash(url string, body io.Reader) uint64 {
FILE: colly_test.go
function newUnstartedTestServer (line 59) | func newUnstartedTestServer() *httptest.Server {
function newTestServer (line 296) | func newTestServer() *httptest.Server {
function TestNoAcceptHeader (line 477) | func TestNoAcceptHeader(t *testing.T) {
function TestNewCollector (line 510) | func TestNewCollector(t *testing.T) {
function TestCollectorVisit (line 518) | func TestCollectorVisit(t *testing.T) {
function TestCollectorVisitWithAllowedDomains (line 572) | func TestCollectorVisitWithAllowedDomains(t *testing.T) {
function TestCollectorVisitWithDisallowedDomains (line 588) | func TestCollectorVisitWithDisallowedDomains(t *testing.T) {
function TestCollectorVisitResponseHeaders (line 609) | func TestCollectorVisitResponseHeaders(t *testing.T) {
function TestCollectorOnHTML (line 631) | func TestCollectorOnHTML(t *testing.T) {
function TestCollectorContentSniffing (line 675) | func TestCollectorContentSniffing(t *testing.T) {
function TestCollectorURLRevisit (line 703) | func TestCollectorURLRevisit(t *testing.T) {
function TestCollectorPostRevisit (line 732) | func TestCollectorPostRevisit(t *testing.T) {
function TestCollectorURLRevisitCheck (line 771) | func TestCollectorURLRevisitCheck(t *testing.T) {
function TestSetCookieRedirect (line 832) | func TestSetCookieRedirect(t *testing.T) {
function TestCollectorPostURLRevisitCheck (line 859) | func TestCollectorPostURLRevisitCheck(t *testing.T) {
function TestCollectorURLRevisitDomainDisallowed (line 917) | func TestCollectorURLRevisitDomainDisallowed(t *testing.T) {
function TestCollectorPost (line 938) | func TestCollectorPost(t *testing.T) {
function TestCollectorPostRaw (line 956) | func TestCollectorPostRaw(t *testing.T) {
function TestCollectorPostRawRevisit (line 972) | func TestCollectorPostRawRevisit(t *testing.T) {
function TestRedirect (line 1006) | func TestRedirect(t *testing.T) {
function TestIssue594 (line 1032) | func TestIssue594(t *testing.T) {
function TestRedirectWithDisallowedURLs (line 1045) | func TestRedirectWithDisallowedURLs(t *testing.T) {
function TestBaseTag (line 1062) | func TestBaseTag(t *testing.T) {
function TestBaseTagRelative (line 1085) | func TestBaseTagRelative(t *testing.T) {
function TestTabsAndNewlines (line 1110) | func TestTabsAndNewlines(t *testing.T) {
function TestLonePercent (line 1142) | func TestLonePercent(t *testing.T) {
function TestCollectorCookies (line 1175) | func TestCollectorCookies(t *testing.T) {
function TestRobotsWhenAllowed (line 1190) | func TestRobotsWhenAllowed(t *testing.T) {
function TestRobotsWhenDisallowed (line 1210) | func TestRobotsWhenDisallowed(t *testing.T) {
function TestRobotsWhenDisallowedWithQueryParameter (line 1227) | func TestRobotsWhenDisallowedWithQueryParameter(t *testing.T) {
function TestIgnoreRobotsWhenDisallowed (line 1244) | func TestIgnoreRobotsWhenDisallowed(t *testing.T) {
function TestConnectionErrorOnRobotsTxtResultsInError (line 1265) | func TestConnectionErrorOnRobotsTxtResultsInError(t *testing.T) {
function TestEnvSettings (line 1278) | func TestEnvSettings(t *testing.T) {
function TestUserAgent (line 1302) | func TestUserAgent(t *testing.T) {
function TestHeaders (line 1382) | func TestHeaders(t *testing.T) {
function TestParseHTTPErrorResponse (line 1417) | func TestParseHTTPErrorResponse(t *testing.T) {
function TestHTMLElement (line 1448) | func TestHTMLElement(t *testing.T) {
function TestCollectorOnXMLWithHtml (line 1487) | func TestCollectorOnXMLWithHtml(t *testing.T) {
function TestCollectorOnXMLWithXML (line 1531) | func TestCollectorOnXMLWithXML(t *testing.T) {
function TestCollectorVisitWithTrace (line 1575) | func TestCollectorVisitWithTrace(t *testing.T) {
function TestCollectorVisitWithCheckHead (line 1592) | func TestCollectorVisitWithCheckHead(t *testing.T) {
function TestCollectorDepth (line 1611) | func TestCollectorDepth(t *testing.T) {
function TestCollectorRequests (line 1655) | func TestCollectorRequests(t *testing.T) {
function TestCollectorContext (line 1674) | func TestCollectorContext(t *testing.T) {
function BenchmarkOnHTML (line 1711) | func BenchmarkOnHTML(b *testing.B) {
function BenchmarkOnXML (line 1723) | func BenchmarkOnXML(b *testing.B) {
function BenchmarkOnResponse (line 1735) | func BenchmarkOnResponse(b *testing.B) {
function requireSessionCookieSimple (line 1748) | func requireSessionCookieSimple(handler http.Handler) http.Handler {
function requireSessionCookieAuthPage (line 1761) | func requireSessionCookieAuthPage(handler http.Handler) http.Handler {
function TestCallbackDetachment (line 1780) | func TestCallbackDetachment(t *testing.T) {
function TestCollectorPostRetry (line 1818) | func TestCollectorPostRetry(t *testing.T) {
function TestCollectorGetRetry (line 1844) | func TestCollectorGetRetry(t *testing.T) {
function TestCollectorPostRetryUnseekable (line 1869) | func TestCollectorPostRetryUnseekable(t *testing.T) {
function TestRedirectErrorRetry (line 1897) | func TestRedirectErrorRetry(t *testing.T) {
function TestCheckRequestHeadersFunc (line 1919) | func TestCheckRequestHeadersFunc(t *testing.T) {
function TestIssue745GzipURLWith404Response (line 1939) | func TestIssue745GzipURLWith404Response(t *testing.T) {
FILE: context.go
type Context (line 22) | type Context struct
method UnmarshalBinary (line 37) | func (c *Context) UnmarshalBinary(_ []byte) error {
method MarshalBinary (line 43) | func (c *Context) MarshalBinary() (_ []byte, _ error) {
method Put (line 48) | func (c *Context) Put(key string, value interface{}) {
method Get (line 56) | func (c *Context) Get(key string) string {
method GetAny (line 67) | func (c *Context) GetAny(key string) interface{} {
method ForEach (line 77) | func (c *Context) ForEach(fn func(k string, v interface{}) interface{}...
method Clone (line 90) | func (c *Context) Clone() *Context {
function NewContext (line 28) | func NewContext() *Context {
FILE: context_test.go
function TestContextIteration (line 22) | func TestContextIteration(t *testing.T) {
function TestContextClone (line 41) | func TestContextClone(t *testing.T) {
FILE: debug/debug.go
type Event (line 18) | type Event struct
type Debugger (line 31) | type Debugger interface
FILE: debug/logdebugger.go
type LogDebugger (line 26) | type LogDebugger struct
method Init (line 40) | func (l *LogDebugger) Init() error {
method Event (line 51) | func (l *LogDebugger) Event(e *Event) {
FILE: debug/webdebugger.go
type WebDebugger (line 26) | type WebDebugger struct
method Init (line 45) | func (w *WebDebugger) Init() error {
method Event (line 65) | func (w *WebDebugger) Event(e *Event) {
method indexHandler (line 86) | func (w *WebDebugger) indexHandler(wr http.ResponseWriter, r *http.Req...
method statusHandler (line 145) | func (w *WebDebugger) statusHandler(wr http.ResponseWriter, r *http.Re...
type requestInfo (line 35) | type requestInfo struct
FILE: extensions/random_user_agent.go
function RandomUserAgent (line 27) | func RandomUserAgent(c *colly.Collector) {
function RandomMobileUserAgent (line 34) | func RandomMobileUserAgent(c *colly.Collector) {
function genFirefoxUA (line 458) | func genFirefoxUA() string {
function genChromeUA (line 467) | func genChromeUA() string {
function genEdgeUA (line 476) | func genEdgeUA() string {
function genOperaUA (line 487) | func genOperaUA() string {
function genMobilePixel7UA (line 498) | func genMobilePixel7UA() string {
function genMobilePixel6UA (line 507) | func genMobilePixel6UA() string {
function genMobilePixel5UA (line 516) | func genMobilePixel5UA() string {
function genMobilePixel4UA (line 525) | func genMobilePixel4UA() string {
function genMobileNexus10UA (line 534) | func genMobileNexus10UA() string {
FILE: extensions/referer.go
function Referer (line 10) | func Referer(c *colly.Collector) {
FILE: extensions/url_length_filter.go
function URLLengthFilter (line 8) | func URLLengthFilter(c *colly.Collector, URLLengthLimit int) {
FILE: htmlelement.go
type HTMLElement (line 25) | type HTMLElement struct
method Attr (line 56) | func (h *HTMLElement) Attr(k string) string {
method ChildText (line 67) | func (h *HTMLElement) ChildText(goquerySelector string) string {
method ChildTexts (line 73) | func (h *HTMLElement) ChildTexts(goquerySelector string) []string {
method ChildAttr (line 84) | func (h *HTMLElement) ChildAttr(goquerySelector, attrName string) stri...
method ChildAttrs (line 93) | func (h *HTMLElement) ChildAttrs(goquerySelector, attrName string) []s...
method ForEach (line 105) | func (h *HTMLElement) ForEach(goquerySelector string, callback func(in...
method ForEachWithBreak (line 120) | func (h *HTMLElement) ForEachWithBreak(goquerySelector string, callbac...
function NewHTMLElementFromSelectionNode (line 42) | func NewHTMLElementFromSelectionNode(resp *Response, s *goquery.Selectio...
FILE: http_backend.go
type httpBackend (line 36) | type httpBackend struct
method Init (line 93) | func (h *httpBackend) Init(jar http.CookieJar) {
method GetMatchingRule (line 114) | func (h *httpBackend) GetMatchingRule(domain string) *LimitRule {
method Cache (line 128) | func (h *httpBackend) Cache(request *http.Request, bodySize int, check...
method Do (line 173) | func (h *httpBackend) Do(request *http.Request, bodySize int, checkReq...
method Limit (line 228) | func (h *httpBackend) Limit(rule *LimitRule) error {
method Limits (line 238) | func (h *httpBackend) Limits(rules []*LimitRule) error {
type checkResponseHeadersFunc (line 42) | type checkResponseHeadersFunc
type checkRequestHeadersFunc (line 43) | type checkRequestHeadersFunc
type LimitRule (line 51) | type LimitRule struct
method Init (line 68) | func (r *LimitRule) Init() error {
method Match (line 103) | func (r *LimitRule) Match(domain string) bool {
FILE: http_trace.go
type HTTPTrace (line 10) | type HTTPTrace struct
method trace (line 18) | func (ht *HTTPTrace) trace() *httptrace.ClientTrace {
method WithTrace (line 35) | func (ht *HTTPTrace) WithTrace(req *http.Request) *http.Request {
FILE: http_trace_test.go
constant testDelay (line 10) | testDelay = 200 * time.Millisecond
function newTraceTestServer (line 12) | func newTraceTestServer(delay time.Duration) *httptest.Server {
function TestTraceWithNoDelay (line 27) | func TestTraceWithNoDelay(t *testing.T) {
function TestTraceWithDelay (line 51) | func TestTraceWithDelay(t *testing.T) {
FILE: proxy/proxy.go
type roundRobinSwitcher (line 26) | type roundRobinSwitcher struct
method GetProxy (line 31) | func (r *roundRobinSwitcher) GetProxy(pr *http.Request) (*url.URL, err...
function RoundRobinProxySwitcher (line 45) | func RoundRobinProxySwitcher(ProxyURLs ...string) (colly.ProxyFunc, erro...
FILE: queue/queue.go
constant stop (line 12) | stop = true
type Storage (line 18) | type Storage interface
type Queue (line 32) | type Queue struct
method IsEmpty (line 75) | func (q *Queue) IsEmpty() bool {
method AddURL (line 81) | func (q *Queue) AddURL(URL string) error {
method AddRequest (line 102) | func (q *Queue) AddRequest(r *colly.Request) error {
method storeRequest (line 117) | func (q *Queue) storeRequest(r *colly.Request) error {
method Size (line 126) | func (q *Queue) Size() (int, error) {
method Run (line 133) | func (q *Queue) Run(c *colly.Collector) error {
method Stop (line 154) | func (q *Queue) Stop() {
method loop (line 160) | func (q *Queue) loop(c *colly.Collector, requestc chan<- *colly.Reques...
method loadRequest (line 214) | func (q *Queue) loadRequest(c *colly.Collector) (*colly.Request, error) {
type InMemoryQueueStorage (line 43) | type InMemoryQueueStorage struct
method Init (line 225) | func (q *InMemoryQueueStorage) Init() error {
method AddRequest (line 231) | func (q *InMemoryQueueStorage) AddRequest(r []byte) error {
method GetRequest (line 250) | func (q *InMemoryQueueStorage) GetRequest() ([]byte, error) {
method QueueSize (line 263) | func (q *InMemoryQueueStorage) QueueSize() (int, error) {
type inMemoryQueueItem (line 53) | type inMemoryQueueItem struct
function New (line 60) | func New(threads int, s Storage) (*Queue, error) {
function independentRunner (line 207) | func independentRunner(requestc <-chan *colly.Request, complete chan<- s...
FILE: queue/queue_test.go
function TestQueue (line 15) | func TestQueue(t *testing.T) {
function serverHandler (line 78) | func serverHandler(w http.ResponseWriter, req *http.Request) {
function serverRoute (line 84) | func serverRoute(w http.ResponseWriter, req *http.Request) bool {
function serveDelay (line 91) | func serveDelay(w http.ResponseWriter, req *http.Request) error {
function shutdown (line 102) | func shutdown(w http.ResponseWriter) {
FILE: request.go
type Request (line 27) | type Request struct
method New (line 67) | func (r *Request) New(method, URL string, body io.Reader) (*Request, e...
method Abort (line 89) | func (r *Request) Abort() {
method IsAbort (line 94) | func (r *Request) IsAbort() bool {
method AbsoluteURL (line 101) | func (r *Request) AbsoluteURL(u string) string {
method Visit (line 122) | func (r *Request) Visit(URL string) error {
method HasVisited (line 127) | func (r *Request) HasVisited(URL string) (bool, error) {
method Post (line 134) | func (r *Request) Post(URL string, requestData map[string]string) error {
method PostRaw (line 141) | func (r *Request) PostRaw(URL string, requestData []byte) error {
method PostMultipart (line 148) | func (r *Request) PostMultipart(URL string, requestData map[string][]b...
method Retry (line 157) | func (r *Request) Retry() error {
method Do (line 166) | func (r *Request) Do() error {
method Marshal (line 171) | func (r *Request) Marshal() ([]byte, error) {
type serializableRequest (line 55) | type serializableRequest struct
FILE: response.go
type Response (line 31) | type Response struct
method Save (line 48) | func (r *Response) Save(fileName string) error {
method FileName (line 54) | func (r *Response) FileName() string {
method fixCharset (line 65) | func (r *Response) fixCharset(detectCharset bool, defaultEncoding stri...
function encodeBytes (line 110) | func encodeBytes(b []byte, contentType string) ([]byte, error) {
FILE: storage/storage.go
type Storage (line 30) | type Storage interface
type InMemoryStorage (line 47) | type InMemoryStorage struct
method Init (line 54) | func (s *InMemoryStorage) Init() error {
method Visited (line 70) | func (s *InMemoryStorage) Visited(requestID uint64) error {
method IsVisited (line 78) | func (s *InMemoryStorage) IsVisited(requestID uint64) (bool, error) {
method Cookies (line 86) | func (s *InMemoryStorage) Cookies(u *url.URL) string {
method SetCookies (line 91) | func (s *InMemoryStorage) SetCookies(u *url.URL, cookies string) {
method Close (line 96) | func (s *InMemoryStorage) Close() error {
function StringifyCookies (line 101) | func StringifyCookies(cookies []*http.Cookie) string {
function UnstringifyCookies (line 111) | func UnstringifyCookies(s string) []*http.Cookie {
function ContainsCookie (line 121) | func ContainsCookie(cookies []*http.Cookie, name string) bool {
FILE: unmarshal.go
method Unmarshal (line 26) | func (h *HTMLElement) Unmarshal(v interface{}) error {
method UnmarshalWithMap (line 31) | func (h *HTMLElement) UnmarshalWithMap(v interface{}, structMap map[stri...
function UnmarshalHTML (line 51) | func UnmarshalHTML(v interface{}, s *goquery.Selection, structMap map[st...
function unmarshalSelector (line 86) | func unmarshalSelector(s *goquery.Selection, attrV reflect.Value, select...
function unmarshalAttr (line 120) | func unmarshalAttr(s *goquery.Selection, attrV reflect.Value, attrT refl...
function unmarshalStruct (line 150) | func unmarshalStruct(s *goquery.Selection, selector string, attrV reflec...
function unmarshalPtr (line 167) | func unmarshalPtr(s *goquery.Selection, selector string, attrV reflect.V...
function unmarshalSlice (line 188) | func unmarshalSlice(s *goquery.Selection, selector, htmlAttr string, att...
function getDOMValue (line 217) | func getDOMValue(s *goquery.Selection, attr string) string {
FILE: unmarshal_test.go
function TestBasicUnmarshal (line 28) | func TestBasicUnmarshal(t *testing.T) {
function TestNestedUnmarshalMap (line 51) | func TestNestedUnmarshalMap(t *testing.T) {
function TestNestedUnmarshal (line 85) | func TestNestedUnmarshal(t *testing.T) {
function TestPointerSliceUnmarshall (line 109) | func TestPointerSliceUnmarshall(t *testing.T) {
function TestStructSliceUnmarshall (line 137) | func TestStructSliceUnmarshall(t *testing.T) {
FILE: xmlelement.go
type XMLElement (line 26) | type XMLElement struct
method Attr (line 72) | func (h *XMLElement) Attr(k string) string {
method ChildText (line 91) | func (h *XMLElement) ChildText(xpathQuery string) string {
method ChildAttr (line 109) | func (h *XMLElement) ChildAttr(xpathQuery, attrName string) string {
method ChildAttrs (line 135) | func (h *XMLElement) ChildAttrs(xpathQuery, attrName string) []string {
method ChildTexts (line 159) | func (h *XMLElement) ChildTexts(xpathQuery string) []string {
function NewXMLElementFromHTMLNode (line 45) | func NewXMLElementFromHTMLNode(resp *Response, s *html.Node) *XMLElement {
function NewXMLElementFromXMLNode (line 58) | func NewXMLElementFromXMLNode(resp *Response, s *xmlquery.Node) *XMLElem...
FILE: xmlelement_test.go
constant htmlPage (line 27) | htmlPage = `
function TestAttr (line 51) | func TestAttr(t *testing.T) {
function TestChildText (line 66) | func TestChildText(t *testing.T) {
function TestChildTexts (line 80) | func TestChildTexts(t *testing.T) {
function TestChildAttr (line 93) | func TestChildAttr(t *testing.T) {
function TestChildAttrs (line 107) | func TestChildAttrs(t *testing.T) {
Condensed preview — 67 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (285K chars).
[
{
"path": ".codecov.yml",
"chars": 15,
"preview": "comment: false\n"
},
{
"path": ".github/ISSUE_TEMPLATE/bug_report.md",
"chars": 284,
"preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n<!--\nRemember to"
},
{
"path": ".github/ISSUE_TEMPLATE/config.yml",
"chars": 173,
"preview": "blank_issues_enabled: true\ncontact_links:\n - name: Question\n url: https://stackoverflow.com/\n about: Questions sh"
},
{
"path": ".github/ISSUE_TEMPLATE/feature_request.md",
"chars": 214,
"preview": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n<!--\nLove col"
},
{
"path": ".github/workflows/ci.yml",
"chars": 1784,
"preview": "---\nname: CI\non:\n push:\n branches:\n - '**'\n pull_request:\n\njobs:\n test:\n name: Test ${{matrix.go}}\n run"
},
{
"path": "CHANGELOG.md",
"chars": 1261,
"preview": "# 2.1.0 - 2020.06.09\n\n - HTTP tracing support\n - New callback: OnResponseHeader\n - Queue fixes\n - New collector option: "
},
{
"path": "CONTRIBUTING.md",
"chars": 4493,
"preview": "# Contribute\n\n## Introduction\n\nFirst, thank you for considering contributing to colly! It's people like you that make th"
},
{
"path": "LICENSE.txt",
"chars": 11358,
"preview": "\n Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 6933,
"preview": "# Colly\n\nLightning Fast and Elegant Scraping Framework for Gophers\n\nColly provides a clean interface to write any kind o"
},
{
"path": "VERSION",
"chars": 6,
"preview": "2.1.0\n"
},
{
"path": "_examples/README.md",
"chars": 1625,
"preview": "# Colly examples\n\nThis folder provides easy to understand code snippets on how to get started with colly.\n\nTo execute an"
},
{
"path": "_examples/basic/basic.go",
"chars": 840,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := c"
},
{
"path": "_examples/coursera_courses/coursera_courses.go",
"chars": 3713,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Course sto"
},
{
"path": "_examples/cryptocoinmarketcap/cryptocoinmarketcap.go",
"chars": 1313,
"preview": "package main\n\nimport (\n\t\"encoding/csv\"\n\t\"log\"\n\t\"os\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tfName := \"cryptoco"
},
{
"path": "_examples/error_handling/error_handling.go",
"chars": 486,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Create a collector\n\tc := colly.NewCol"
},
{
"path": "_examples/factba.se/factbase.go",
"chars": 1566,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"strconv\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nvar baseSearchURL = \""
},
{
"path": "_examples/google_groups/google_groups.go",
"chars": 2284,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"flag\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Mail is th"
},
{
"path": "_examples/hackernews_comments/hackernews_comments.go",
"chars": 1544,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"flag\"\n\t\"log\"\n\t\"os\"\n\t\"strconv\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nty"
},
{
"path": "_examples/instagram/instagram.go",
"chars": 4996,
"preview": "package main\n\nimport (\n\t\"crypto/md5\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"log\"\n\t\"net/url\"\n\t\"os\"\n\t\"regexp\"\n\t\"strings\"\n\n\t\"github.com/"
},
{
"path": "_examples/local_files/html/child_page/one.html",
"chars": 286,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width"
},
{
"path": "_examples/local_files/html/child_page/three.html",
"chars": 288,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width"
},
{
"path": "_examples/local_files/html/child_page/two.html",
"chars": 286,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width"
},
{
"path": "_examples/local_files/html/index.html",
"chars": 462,
"preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width"
},
{
"path": "_examples/local_files/local_files.go",
"chars": 718,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"net/http\"\n\t\"os\"\n\t\"path/filepath\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tdir, "
},
{
"path": "_examples/login/login.go",
"chars": 474,
"preview": "package main\n\nimport (\n\t\"log\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// create a new collector\n\tc := colly.Ne"
},
{
"path": "_examples/max_depth/max_depth.go",
"chars": 592,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := c"
},
{
"path": "_examples/multipart/multipart.go",
"chars": 1504,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"io\"\n\t\"net/http\"\n\t\"os\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc generateFormData()"
},
{
"path": "_examples/openedx_courses/openedx_courses.go",
"chars": 2176,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// DATE_FORMAT defa"
},
{
"path": "_examples/parallel/parallel.go",
"chars": 965,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := c"
},
{
"path": "_examples/proxy_switcher/proxy_switcher.go",
"chars": 688,
"preview": "package main\n\nimport (\n\t\"bytes\"\n\t\"log\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/proxy\"\n)\n\nfunc main"
},
{
"path": "_examples/queue/queue.go",
"chars": 764,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/queue\"\n)\n\nfunc main() {\n\turl"
},
{
"path": "_examples/random_delay/random_delay.go",
"chars": 805,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/debug\"\n)\n\nfunc main("
},
{
"path": "_examples/rate_limit/rate_limit.go",
"chars": 769,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/debug\"\n)\n\nfunc main() {\n\turl"
},
{
"path": "_examples/reddit/reddit.go",
"chars": 1554,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype item struct {\n\tStoryURL string\n\tSou"
},
{
"path": "_examples/request_context/request_context.go",
"chars": 554,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := c"
},
{
"path": "_examples/scraper_server/scraper_server.go",
"chars": 1283,
"preview": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"log\"\n\t\"net/http\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype pageInfo struct {\n\tSt"
},
{
"path": "_examples/shopify_sitemap/shopify_sitemap.go",
"chars": 657,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Array containing all the known URLs i"
},
{
"path": "_examples/url_filter/url_filter.go",
"chars": 935,
"preview": "package main\n\nimport (\n\t\"fmt\"\n\t\"regexp\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collect"
},
{
"path": "_examples/xkcd_store/xkcd_store.go",
"chars": 1138,
"preview": "package main\n\nimport (\n\t\"encoding/csv\"\n\t\"log\"\n\t\"os\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tfName := \"xkcd_sto"
},
{
"path": "cmd/colly/colly.go",
"chars": 2919,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "colly.go",
"chars": 48126,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "colly_test.go",
"chars": 45902,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "context.go",
"chars": 2459,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "context_test.go",
"chars": 1472,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "debug/debug.go",
"chars": 1215,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "debug/logdebugger.go",
"chars": 1587,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "debug/webdebugger.go",
"chars": 4594,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "extensions/extensions.go",
"chars": 84,
"preview": "// Package extensions implements various helper addons for Colly\npackage extensions\n"
},
{
"path": "extensions/random_user_agent.go",
"chars": 17936,
"preview": "package extensions\n\nimport (\n\t\"fmt\"\n\t\"math/rand\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nvar uaGens = []func() str"
},
{
"path": "extensions/referer.go",
"chars": 482,
"preview": "package extensions\n\nimport (\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Referer sets valid Referer HTTP header to requests.\n//"
},
{
"path": "extensions/url_length_filter.go",
"chars": 308,
"preview": "package extensions\n\nimport (\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// URLLengthFilter filters out requests with URLs longer "
},
{
"path": "go.mod",
"chars": 905,
"preview": "module github.com/gocolly/colly/v2\n\ngo 1.24.0\n\ntoolchain go1.24.9\n\nrequire (\n\tgithub.com/PuerkitoBio/goquery v1.11.0\n\tgi"
},
{
"path": "go.sum",
"chars": 12851,
"preview": "github.com/PuerkitoBio/goquery v1.10.2 h1:7fh2BdHcG6VFZsK7toXBT/Bh1z5Wmy8Q9MV9HqT2AM8=\ngithub.com/PuerkitoBio/goquery v1"
},
{
"path": "htmlelement.go",
"chars": 4187,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "http_backend.go",
"chars": 6969,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "http_trace.go",
"chars": 1068,
"preview": "package colly\n\nimport (\n\t\"net/http\"\n\t\"net/http/httptrace\"\n\t\"time\"\n)\n\n// HTTPTrace provides a datastructure for storing a"
},
{
"path": "http_trace_test.go",
"chars": 1781,
"preview": "package colly\n\nimport (\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"testing\"\n\t\"time\"\n)\n\nconst testDelay = 200 * time.Millisecond\n"
},
{
"path": "proxy/proxy.go",
"chars": 1677,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "queue/queue.go",
"chars": 5881,
"preview": "package queue\n\nimport (\n\t\"net/url\"\n\t\"sync\"\n\n\twhatwgUrl \"github.com/nlnwa/whatwg-url/url\"\n\n\t\"github.com/gocolly/colly/v2\""
},
{
"path": "queue/queue_test.go",
"chars": 2247,
"preview": "package queue\n\nimport (\n\t\"math/rand\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"githu"
},
{
"path": "request.go",
"chars": 5908,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "response.go",
"chars": 3220,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "storage/storage.go",
"chars": 3528,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "unmarshal.go",
"chars": 6078,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "unmarshal_test.go",
"chars": 4785,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "xmlelement.go",
"chars": 4850,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
},
{
"path": "xmlelement_test.go",
"chars": 4299,
"preview": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use t"
}
]
About this extraction
This page contains the full source code of the gocolly/colly GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 67 files (253.0 KB), approximately 77.1k tokens, and a symbol index with 357 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.