[
  {
    "path": ".codecov.yml",
    "content": "comment: false\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "content": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n<!--\nRemember to include a code sample that reproduces the bug, if possible.\n\nLove colly? Please consider supporting our collective:\n👉  https://opencollective.com/colly/donate\n-->\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/config.yml",
    "content": "blank_issues_enabled: true\ncontact_links:\n  - name: Question\n    url: https://stackoverflow.com/\n    about: Questions should go to Stack Overflow. You can use go-colly tag.\n"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "content": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n<!--\nLove colly? Please consider supporting our collective:\n👉  https://opencollective.com/colly/donate\n-->\n"
  },
  {
    "path": ".github/workflows/ci.yml",
    "content": "---\nname: CI\non:\n  push:\n    branches:\n      - '**'\n  pull_request:\n\njobs:\n  test:\n    name: Test ${{matrix.go}}\n    runs-on: [ubuntu-latest]\n    strategy:\n      fail-fast: false\n      max-parallel: 4\n      matrix:\n        go: [\n          \"1.24\",\n          \"1.23\",\n          \"1.22\",\n          \"1.21\",\n        ]\n\n    steps:\n      - name: Checkout branch\n        uses: actions/checkout@v2\n\n      - name: Setup go\n        uses: actions/setup-go@v2\n        with:\n          go-version: ${{matrix.go}}\n\n      - name: Test\n        run: |\n          go install golang.org/x/lint/golint@latest\n          OUT=\"$(go get -a)\"; test -z \"$OUT\" || (echo \"$OUT\" && return 1)\n          OUT=\"$(gofmt -l -d ./)\"; test -z \"$OUT\" || (echo \"$OUT\" && return 1)\n          golint -set_exit_status\n          go vet -v ./...\n          go test -race -v -coverprofile=coverage.txt -covermode=atomic ./...\n\n  build:\n    name: Build ${{matrix.go}}\n    runs-on: [ubuntu-latest]\n    strategy:\n      fail-fast: false\n      max-parallel: 4\n      matrix:\n        go: [\n          \"1.24\",\n          \"1.23\",\n          \"1.22\",\n          \"1.21\",\n        ]\n\n    steps:\n      - name: Checkout branch\n        uses: actions/checkout@v2\n\n      - name: Setup go\n        uses: actions/setup-go@v2\n        with:\n          go-version: ${{matrix.go}}\n\n      - name: Build\n        run: |\n          go install golang.org/x/lint/golint@latest\n          OUT=\"$(go get -a)\"; test -z \"$OUT\" || (echo \"$OUT\" && return 1)\n          OUT=\"$(gofmt -l -d ./)\"; test -z \"$OUT\" || (echo \"$OUT\" && return 1)\n          golint -set_exit_status\n          go build\n\n  codecov:\n    name: Codecov \n    runs-on: [ubuntu-latest]\n    needs: \n      - test\n      - build\n    steps:\n      - name: Run Codecov \n        run: bash <(curl -s https://codecov.io/bash)\n"
  },
  {
    "path": "CHANGELOG.md",
    "content": "# 2.1.0 - 2020.06.09\n\n - HTTP tracing support\n - New callback: OnResponseHeader\n - Queue fixes\n - New collector option: Collector.CheckHead\n - Proxy fixes\n - Fixed POST revisit checking\n - Updated dependencies\n\n# 2.0.0 - 2019.11.28\n\n - Breaking change: Change Collector.RedirectHandler member to Collector.SetRedirectHandler function\n - Go module support\n - Collector.HasVisited method added to be able to check if an url has been visited\n - Collector.SetClient method introduced\n - HTMLElement.ChildTexts method added\n - New user agents\n - Multiple bugfixes\n\n# 1.2.0 - 2019.02.13\n\n - Compatibility with the latest htmlquery package\n - New request shortcut for HEAD requests\n - Check URL availability before visiting\n - Fix proxy URL value\n - Request counter fix\n - Minor fixes in examples\n\n# 1.1.0 - 2018.08.13\n\n - Appengine integration takes context.Context instead of http.Request (API change)\n - Added \"Accept\" http header by default to every request\n - Support slices of pointers in unmarshal\n - Fixed a race condition in queues\n - ForEachWithBreak method added to HTMLElement\n - Added a local file example\n - Support gzip decompression of response bodies\n - Don't share waitgroup when cloning a collector\n - Fixed instagram example\n\n\n# 1.0.0 - 2018.05.13\n"
  },
  {
    "path": "CONTRIBUTING.md",
    "content": "# Contribute\n\n## Introduction\n\nFirst, thank you for considering contributing to colly! It's people like you that make the open source community such a great community! 😊\n\nWe welcome any type of contribution, not only code. You can help with \n- **QA**: file bug reports, the more details you can give the better (e.g. screenshots with the console open)\n- **Marketing**: writing blog posts, howto's, printing stickers, ...\n- **Community**: presenting the project at meetups, organizing a dedicated meetup for the local community, ...\n- **Code**: take a look at the [open issues](https://github.com/gocolly/colly/issues). Even if you can't write code, commenting on them, showing that you care about a given issue matters. It helps us triage them.\n- **Money**: we welcome financial contributions in full transparency on our [open collective](https://opencollective.com/colly).\n\n## Your First Contribution\n\nWorking on your first Pull Request? You can learn how from this *free* series, [How to Contribute to an Open Source Project on GitHub](https://app.egghead.io/playlists/how-to-contribute-to-an-open-source-project-on-github).\n\n## Submitting code\n\nAny code change should be submitted as a pull request. The description should explain what the code does and give steps to execute it. The pull request should also contain tests.\n\n## Code review process\n\nThe bigger the pull request, the longer it will take to review and merge. Try to break down large pull requests in smaller chunks that are easier to review and merge.\nIt is also always helpful to have some context for your pull request. What was the purpose? Why does it matter to you?\n\n## Financial contributions\n\nWe also welcome financial contributions in full transparency on our [open collective](https://opencollective.com/colly).\nAnyone can file an expense. If the expense makes sense for the development of the community, it will be \"merged\" in the ledger of our open collective by the core contributors and the person who filed the expense will be reimbursed.\n\n## Questions\n\nIf you have any questions, create an [issue](https://github.com/gocolly/colly/issues/new) (protip: do a quick search first to see if someone else didn't ask the same question before!).\nYou can also reach us at hello@colly.opencollective.com.\n\n## Credits\n\n### Contributors\n\nThank you to all the people who have already contributed to colly!\n<a href=\"graphs/contributors\"><img src=\"https://opencollective.com/colly/contributors.svg?width=890\" /></a>\n\n\n### Backers\n\nThank you to all our backers! [[Become a backer](https://opencollective.com/colly#backer)]\n\n<a href=\"https://opencollective.com/colly#backers\" target=\"_blank\"><img src=\"https://opencollective.com/colly/backers.svg?width=890\"></a>\n\n\n### Sponsors\n\nThank you to all our sponsors! (please ask your company to also support this open source project by [becoming a sponsor](https://opencollective.com/colly#sponsor))\n\n<a href=\"https://opencollective.com/colly/sponsor/0/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/0/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/1/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/1/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/2/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/2/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/3/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/3/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/4/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/4/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/5/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/5/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/6/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/6/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/7/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/7/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/8/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/8/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/9/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/9/avatar.svg\"></a>\n\n<!-- This `CONTRIBUTING.md` is based on @nayafia's template https://github.com/nayafia/contributing-template -->\n"
  },
  {
    "path": "LICENSE.txt",
    "content": "\n                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# Colly\n\nLightning Fast and Elegant Scraping Framework for Gophers\n\nColly provides a clean interface to write any kind of crawler/scraper/spider.\n\nWith Colly you can easily extract structured data from websites, which can be used for a wide range of applications, like data mining, data processing or archiving.\n\n[![GoDoc](https://godoc.org/github.com/gocolly/colly?status.svg)](https://pkg.go.dev/github.com/gocolly/colly/v2)\n[![Backers on Open Collective](https://opencollective.com/colly/backers/badge.svg)](#backers) [![Sponsors on Open Collective](https://opencollective.com/colly/sponsors/badge.svg)](#sponsors) [![build status](https://github.com/gocolly/colly/actions/workflows/ci.yml/badge.svg)](https://github.com/gocolly/colly/actions/workflows/ci.yml)\n[![report card](https://img.shields.io/badge/report%20card-a%2B-ff3333.svg?style=flat-square)](http://goreportcard.com/report/gocolly/colly)\n[![view examples](https://img.shields.io/badge/learn%20by-examples-0077b3.svg?style=flat-square)](https://github.com/gocolly/colly/tree/master/_examples)\n[![Code Coverage](https://img.shields.io/codecov/c/github/gocolly/colly/master.svg)](https://codecov.io/github/gocolly/colly?branch=master)\n[![FOSSA Status](https://app.fossa.io/api/projects/git%2Bgithub.com%2Fgocolly%2Fcolly.svg?type=shield)](https://app.fossa.io/projects/git%2Bgithub.com%2Fgocolly%2Fcolly?ref=badge_shield)\n[![Twitter URL](https://img.shields.io/badge/twitter-follow-green.svg)](https://twitter.com/gocolly)\n\n\n## Features\n\n-   Clean API\n-   Fast (>1k request/sec on a single core)\n-   Manages request delays and maximum concurrency per domain\n-   Automatic cookie and session handling\n-   Sync/async/parallel scraping\n-   Caching\n-   Automatic encoding of non-unicode responses\n-   Robots.txt support\n-   Distributed scraping\n-   Configuration via environment variables\n-   Extensions\n\n## Example\n\n```go\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tc := colly.NewCollector()\n\n\t// Find and visit all links\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\te.Request.Visit(e.Attr(\"href\"))\n\t})\n\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"Visiting\", r.URL)\n\t})\n\n\tc.Visit(\"http://go-colly.org/\")\n}\n```\n\nSee [examples folder](https://github.com/gocolly/colly/tree/master/_examples) for more detailed examples.\n\n## Installation\n\n`go get github.com/gocolly/colly/v2`\n\n\n## Bugs\n\nBugs or suggestions? Visit the [issue tracker](https://github.com/gocolly/colly/issues) or join `#colly` on freenode\n\n## Other Projects Using Colly\n\nBelow is a list of public, open source projects that use Colly:\n\n-   [greenpeace/check-my-pages](https://github.com/greenpeace/check-my-pages) Scraping script to test the Spanish Greenpeace web archive.\n-   [altsab/gowap](https://github.com/altsab/gowap) Wappalyzer implementation in Go.\n-   [jesuiscamille/goquotes](https://github.com/jesuiscamille/goquotes) A quotes scraper, making your day a little better!\n-   [jivesearch/jivesearch](https://github.com/jivesearch/jivesearch) A search engine that doesn't track you.\n-   [Leagify/colly-draft-prospects](https://github.com/Leagify/colly-draft-prospects) A scraper for future NFL Draft prospects.\n-   [lucasepe/go-ps4](https://github.com/lucasepe/go-ps4) Search playstation store for your favorite PS4 games using the command line.\n-   [yringler/inside-chassidus-scraper](https://github.com/yringler/inside-chassidus-scraper) Scrapes Rabbi Paltiel's web site for lesson metadata.\n-   [gamedb/gamedb](https://github.com/gamedb/gamedb) A database of Steam games.\n-   [lawzava/scrape](https://github.com/lawzava/scrape) CLI for email scraping from any website.\n-   [eureka101v/WeiboSpiderGo](https://github.com/eureka101v/WeiboSpiderGo) A sina weibo(chinese twitter) scraper\n-   [Go-phie/gophie](https://github.com/Go-phie/gophie) Search, Download and Stream movies from your terminal\n-   [imthaghost/goclone](https://github.com/imthaghost/goclone) Clone websites to your computer within seconds.\n-   [superiss/spidy](https://github.com/superiss/spidy) Crawl the web and collect expired domains.\n-   [docker-slim/docker-slim](https://github.com/docker-slim/docker-slim) Optimize your Docker containers to make them smaller and better.\n-   [seversky/gachifinder](https://github.com/seversky/gachifinder) an agent for asynchronous scraping, parsing and writing to some storages(elasticsearch for now)\n-   [eval-exec/goodreads](https://github.com/eval-exec/goodreads) crawl all tags and all pages of quotes from goodreads.\n\nIf you are using Colly in a project please send a pull request to add it to the list.\n\n## Contributors\n\nThis project exists thanks to all the people who contribute. [[Contribute]](CONTRIBUTING.md).\n<a href=\"https://github.com/gocolly/colly/graphs/contributors\"><img src=\"https://opencollective.com/colly/contributors.svg?width=890\" /></a>\n\n## Backers\n\nThank you to all our backers! 🙏 [[Become a backer](https://opencollective.com/colly#backer)]\n\n<a href=\"https://opencollective.com/colly#backers\" target=\"_blank\"><img src=\"https://opencollective.com/colly/backers.svg?width=890\"></a>\n\n## Sponsors\n\nSupport this project by becoming a sponsor. Your logo will show up here with a link to your website. [[Become a sponsor](https://opencollective.com/colly#sponsor)]\n\n<a href=\"https://opencollective.com/colly/sponsor/0/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/0/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/1/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/1/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/2/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/2/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/3/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/3/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/4/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/4/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/5/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/5/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/6/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/6/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/7/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/7/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/8/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/8/avatar.svg\"></a>\n<a href=\"https://opencollective.com/colly/sponsor/9/website\" target=\"_blank\"><img src=\"https://opencollective.com/colly/sponsor/9/avatar.svg\"></a>\n\n## License\n\n[![FOSSA Status](https://app.fossa.io/api/projects/git%2Bgithub.com%2Fgocolly%2Fcolly.svg?type=large)](https://app.fossa.io/projects/git%2Bgithub.com%2Fgocolly%2Fcolly?ref=badge_large)\n"
  },
  {
    "path": "VERSION",
    "content": "2.1.0\n"
  },
  {
    "path": "_examples/README.md",
    "content": "# Colly examples\n\nThis folder provides easy to understand code snippets on how to get started with colly.\n\nTo execute an example run `go run [example/example.go]`\n\n\n## Demo\n\n```\n$ go run rate_limit/rate_limit.go\n[000001] 1 [     1 - request] map[\"url\":\"https://httpbin.org/delay/2?n=4\"] (60.872µs)\n[000002] 1 [     2 - request] map[\"url\":\"https://httpbin.org/delay/2?n=2\"] (154.425µs)\n[000003] 1 [     3 - request] map[\"url\":\"https://httpbin.org/delay/2?n=0\"] (158.374µs)\n[000004] 1 [     5 - request] map[\"url\":\"https://httpbin.org/delay/2?n=3\"] (426.999µs)\n[000005] 1 [     4 - request] map[\"url\":\"https://httpbin.org/delay/2?n=1\"] (448.75µs)\n[000007] 1 [     2 - response] map[\"url\":\"https://httpbin.org/delay/2?n=2\" \"status\":\"OK\"] (2.855764394s)\n[000008] 1 [     2 - scraped] map[\"url\":\"https://httpbin.org/delay/2?n=2\"] (2.855797868s)\n[000006] 1 [     1 - response] map[\"url\":\"https://httpbin.org/delay/2?n=4\" \"status\":\"OK\"] (2.855756753s)\n[000009] 1 [     1 - scraped] map[\"url\":\"https://httpbin.org/delay/2?n=4\"] (2.855819581s)\n[000010] 1 [     3 - response] map[\"status\":\"OK\" \"url\":\"https://httpbin.org/delay/2?n=0\"] (5.002065299s)\n[000011] 1 [     3 - scraped] map[\"url\":\"https://httpbin.org/delay/2?n=0\"] (5.002103755s)\n[000012] 1 [     5 - response] map[\"status\":\"OK\" \"url\":\"https://httpbin.org/delay/2?n=3\"] (5.012080614s)\n[000013] 1 [     5 - scraped] map[\"url\":\"https://httpbin.org/delay/2?n=3\"] (5.012101056s)\n[000014] 1 [     4 - response] map[\"url\":\"https://httpbin.org/delay/2?n=1\" \"status\":\"OK\"] (7.155725591s)\n[000015] 1 [     4 - scraped] map[\"url\":\"https://httpbin.org/delay/2?n=1\"] (7.155759136s)\n\n```\n"
  },
  {
    "path": "_examples/basic/basic.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Visit only domains: hackerspaces.org, wiki.hackerspaces.org\n\t\tcolly.AllowedDomains(\"hackerspaces.org\", \"wiki.hackerspaces.org\"),\n\t)\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Attr(\"href\")\n\t\t// Print link\n\t\tfmt.Printf(\"Link found: %q -> %s\\n\", e.Text, link)\n\t\t// Visit link found on page\n\t\t// Only those links are visited which are in AllowedDomains\n\t\tc.Visit(e.Request.AbsoluteURL(link))\n\t})\n\n\t// Before making a request print \"Visiting ...\"\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"Visiting\", r.URL.String())\n\t})\n\n\t// Start scraping on https://hackerspaces.org\n\tc.Visit(\"https://hackerspaces.org/\")\n}\n"
  },
  {
    "path": "_examples/coursera_courses/coursera_courses.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Course stores information about a coursera course\ntype Course struct {\n\tTitle       string\n\tDescription string\n\tCreator     string\n\tLevel       string\n\tURL         string\n\tLanguage    string\n\tCommitment  string\n\tRating      string\n}\n\nfunc main() {\n\tfName := \"courses.json\"\n\tfile, err := os.Create(fName)\n\tif err != nil {\n\t\tlog.Fatalf(\"Cannot create file %q: %s\\n\", fName, err)\n\t\treturn\n\t}\n\tdefer file.Close()\n\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Visit only domains: coursera.org, www.coursera.org\n\t\tcolly.AllowedDomains(\"coursera.org\", \"www.coursera.org\"),\n\n\t\t// Cache responses to prevent multiple download of pages\n\t\t// even if the collector is restarted\n\t\tcolly.CacheDir(\"./coursera_cache\"),\n\t\t// Cached responses older than the specified duration will be refreshed\n\t\tcolly.CacheExpiration(24*time.Hour),\n\t)\n\n\t// Create another collector to scrape course details\n\tdetailCollector := c.Clone()\n\n\tcourses := make([]Course, 0, 200)\n\n\t// On every <a> element which has \"href\" attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\t// If attribute class is this long string return from callback\n\t\t// As this a is irrelevant\n\t\tif e.Attr(\"class\") == \"Button_1qxkboh-o_O-primary_cv02ee-o_O-md_28awn8-o_O-primaryLink_109aggg\" {\n\t\t\treturn\n\t\t}\n\t\tlink := e.Attr(\"href\")\n\t\t// If link start with browse or includes either signup or login return from callback\n\t\tif !strings.HasPrefix(link, \"/browse\") || strings.Index(link, \"=signup\") > -1 || strings.Index(link, \"=login\") > -1 {\n\t\t\treturn\n\t\t}\n\t\t// start scaping the page under the link found\n\t\te.Request.Visit(link)\n\t})\n\n\t// Before making a request print \"Visiting ...\"\n\tc.OnRequest(func(r *colly.Request) {\n\t\tlog.Println(\"visiting\", r.URL.String())\n\t})\n\n\t// On every <a> element with collection-product-card class call callback\n\tc.OnHTML(`a.collection-product-card`, func(e *colly.HTMLElement) {\n\t\t// Activate detailCollector if the link contains \"coursera.org/learn\"\n\t\tcourseURL := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\tif strings.Index(courseURL, \"coursera.org/learn\") != -1 {\n\t\t\tdetailCollector.Visit(courseURL)\n\t\t}\n\t})\n\n\t// Extract details of the course\n\tdetailCollector.OnHTML(`div[id=rendered-content]`, func(e *colly.HTMLElement) {\n\t\tlog.Println(\"Course found\", e.Request.URL)\n\t\ttitle := e.ChildText(\".banner-title\")\n\t\tif title == \"\" {\n\t\t\tlog.Println(\"No title found\", e.Request.URL)\n\t\t}\n\t\tcourse := Course{\n\t\t\tTitle:       title,\n\t\t\tURL:         e.Request.URL.String(),\n\t\t\tDescription: e.ChildText(\"div.content\"),\n\t\t\tCreator:     e.ChildText(\"li.banner-instructor-info > a > div > div > span\"),\n\t\t\tRating:      e.ChildText(\"span.number-rating\"),\n\t\t}\n\t\t// Iterate over div components and add details to course\n\t\te.ForEach(\".AboutCourse .ProductGlance > div\", func(_ int, el *colly.HTMLElement) {\n\t\t\tsvgTitle := strings.Split(el.ChildText(\"div:nth-child(1) svg title\"), \" \")\n\t\t\tlastWord := svgTitle[len(svgTitle)-1]\n\t\t\tswitch lastWord {\n\t\t\t// svg Title: Available Languages\n\t\t\tcase \"languages\":\n\t\t\t\tcourse.Language = el.ChildText(\"div:nth-child(2) > div:nth-child(1)\")\n\t\t\t// svg Title: Mixed/Beginner/Intermediate/Advanced Level\n\t\t\tcase \"Level\":\n\t\t\t\tcourse.Level = el.ChildText(\"div:nth-child(2) > div:nth-child(1)\")\n\t\t\t// svg Title: Hours to complete\n\t\t\tcase \"complete\":\n\t\t\t\tcourse.Commitment = el.ChildText(\"div:nth-child(2) > div:nth-child(1)\")\n\t\t\t}\n\t\t})\n\t\tcourses = append(courses, course)\n\t})\n\n\t// Start scraping on http://coursera.com/browse\n\tc.Visit(\"https://coursera.org/browse\")\n\n\tenc := json.NewEncoder(file)\n\tenc.SetIndent(\"\", \"  \")\n\n\t// Dump json to the standard output\n\tenc.Encode(courses)\n}\n"
  },
  {
    "path": "_examples/cryptocoinmarketcap/cryptocoinmarketcap.go",
    "content": "package main\n\nimport (\n\t\"encoding/csv\"\n\t\"log\"\n\t\"os\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tfName := \"cryptocoinmarketcap.csv\"\n\tfile, err := os.Create(fName)\n\tif err != nil {\n\t\tlog.Fatalf(\"Cannot create file %q: %s\\n\", fName, err)\n\t\treturn\n\t}\n\tdefer file.Close()\n\twriter := csv.NewWriter(file)\n\tdefer writer.Flush()\n\n\t// Write CSV header\n\twriter.Write([]string{\"Name\", \"Symbol\", \"Market Cap (USD)\", \"Price (USD)\", \"Circulating Supply (USD)\", \"Volume (24h)\", \"Change (1h)\", \"Change (24h)\", \"Change (7d)\"})\n\n\t// Instantiate default collector\n\tc := colly.NewCollector()\n\n\tc.OnHTML(\"tbody tr\", func(e *colly.HTMLElement) {\n\t\twriter.Write([]string{\n\t\t\te.ChildText(\".cmc-table__column-name\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__symbol\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__market-cap\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__price\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__circulating-supply\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__volume-24-h\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__percent-change-1-h\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__percent-change-24-h\"),\n\t\t\te.ChildText(\".cmc-table__cell--sort-by__percent-change-7-d\"),\n\t\t})\n\t})\n\n\tc.Visit(\"https://coinmarketcap.com/all/views/all/\")\n\n\tlog.Printf(\"Scraping finished, check file %q for results\\n\", fName)\n}\n"
  },
  {
    "path": "_examples/error_handling/error_handling.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Create a collector\n\tc := colly.NewCollector()\n\n\t// Set HTML callback\n\t// Won't be called if error occurs\n\tc.OnHTML(\"*\", func(e *colly.HTMLElement) {\n\t\tfmt.Println(e)\n\t})\n\n\t// Set error handler\n\tc.OnError(func(r *colly.Response, err error) {\n\t\tfmt.Println(\"Request URL:\", r.Request.URL, \"failed with response:\", r, \"\\nError:\", err)\n\t})\n\n\t// Start scraping\n\tc.Visit(\"https://definitely-not-a.website/\")\n}\n"
  },
  {
    "path": "_examples/factba.se/factbase.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"os\"\n\t\"strconv\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nvar baseSearchURL = \"https://factba.se/json/json-transcript.php?q=&f=&dt=&p=\"\nvar baseTranscriptURL = \"https://factba.se/transcript/\"\n\ntype result struct {\n\tSlug string `json:\"slug\"`\n\tDate string `json:\"date\"`\n}\n\ntype results struct {\n\tData []*result `json:\"data\"`\n}\n\ntype transcript struct {\n\tSpeaker string\n\tText    string\n}\n\nfunc main() {\n\tc := colly.NewCollector(\n\t\tcolly.AllowedDomains(\"factba.se\"),\n\t)\n\n\td := c.Clone()\n\n\td.OnHTML(\"body\", func(e *colly.HTMLElement) {\n\t\tt := make([]transcript, 0)\n\t\te.ForEach(\".topic-media-row\", func(_ int, el *colly.HTMLElement) {\n\t\t\tt = append(t, transcript{\n\t\t\t\tSpeaker: el.ChildText(\".speaker-label\"),\n\t\t\t\tText:    el.ChildText(\".transcript-text-block\"),\n\t\t\t})\n\t\t})\n\t\tjsonData, err := json.MarshalIndent(t, \"\", \"  \")\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\tos.WriteFile(colly.SanitizeFileName(e.Request.Ctx.Get(\"date\")+\"_\"+e.Request.Ctx.Get(\"slug\"))+\".json\", jsonData, 0644)\n\t})\n\n\tstop := false\n\tc.OnResponse(func(r *colly.Response) {\n\t\trs := &results{}\n\t\terr := json.Unmarshal(r.Body, rs)\n\t\tif err != nil || len(rs.Data) == 0 {\n\t\t\tstop = true\n\t\t\treturn\n\t\t}\n\t\tfor _, res := range rs.Data {\n\t\t\tu := baseTranscriptURL + res.Slug\n\t\t\tctx := colly.NewContext()\n\t\t\tctx.Put(\"date\", res.Date)\n\t\t\tctx.Put(\"slug\", res.Slug)\n\t\t\td.Request(\"GET\", u, nil, ctx, nil)\n\t\t}\n\t})\n\n\tfor i := 1; i < 1000; i++ {\n\t\tif stop {\n\t\t\tbreak\n\t\t}\n\t\tif err := c.Visit(baseSearchURL + strconv.Itoa(i)); err != nil {\n\t\t\tfmt.Println(\"Error:\", err)\n\t\t\tbreak\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "_examples/google_groups/google_groups.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"flag\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Mail is the container of a single e-mail\ntype Mail struct {\n\tTitle   string\n\tLink    string\n\tAuthor  string\n\tDate    string\n\tMessage string\n}\n\nfunc main() {\n\tvar groupName string\n\tflag.StringVar(&groupName, \"group\", \"hspbp\", \"Google Groups group name\")\n\tflag.Parse()\n\n\tthreads := make(map[string][]Mail)\n\n\tthreadCollector := colly.NewCollector()\n\tmailCollector := colly.NewCollector()\n\n\t// Collect threads\n\tthreadCollector.OnHTML(\"tr\", func(e *colly.HTMLElement) {\n\t\tch := e.DOM.Children()\n\t\tauthor := ch.Eq(1).Text()\n\t\t// deleted topic\n\t\tif author == \"\" {\n\t\t\treturn\n\t\t}\n\n\t\ttitle := ch.Eq(0).Text()\n\t\tlink, _ := ch.Eq(0).Children().Eq(0).Attr(\"href\")\n\t\t// fix link to point to the pure HTML version of the thread\n\t\tlink = strings.Replace(link, \".com/d/topic\", \".com/forum/?_escaped_fragment_=topic\", 1)\n\t\tdate := ch.Eq(2).Text()\n\n\t\tlog.Printf(\"Thread found: %s %q %s %s\\n\", link, title, author, date)\n\t\tmailCollector.Visit(link)\n\t})\n\n\t// Visit next page\n\tthreadCollector.OnHTML(\"body > a[href]\", func(e *colly.HTMLElement) {\n\t\tlog.Println(\"Next page link found:\", e.Attr(\"href\"))\n\t\te.Request.Visit(e.Attr(\"href\"))\n\t})\n\n\t// Extract mails\n\tmailCollector.OnHTML(\"body\", func(e *colly.HTMLElement) {\n\t\t// Find subject\n\t\tthreadSubject := e.ChildText(\"h2\")\n\t\tif _, ok := threads[threadSubject]; !ok {\n\t\t\tthreads[threadSubject] = make([]Mail, 0, 8)\n\t\t}\n\n\t\t// Extract mails\n\t\te.ForEach(\"table tr\", func(_ int, el *colly.HTMLElement) {\n\t\t\tmail := Mail{\n\t\t\t\tTitle:   el.ChildText(\"td:nth-of-type(1)\"),\n\t\t\t\tLink:    el.ChildAttr(\"td:nth-of-type(1)\", \"href\"),\n\t\t\t\tAuthor:  el.ChildText(\"td:nth-of-type(2)\"),\n\t\t\t\tDate:    el.ChildText(\"td:nth-of-type(3)\"),\n\t\t\t\tMessage: el.ChildText(\"td:nth-of-type(4)\"),\n\t\t\t}\n\t\t\tthreads[threadSubject] = append(threads[threadSubject], mail)\n\t\t})\n\n\t\t// Follow next page link\n\t\tif link, found := e.DOM.Find(\"> a[href]\").Attr(\"href\"); found {\n\t\t\te.Request.Visit(link)\n\t\t} else {\n\t\t\tlog.Printf(\"Thread %q done\\n\", threadSubject)\n\t\t}\n\t})\n\n\tthreadCollector.Visit(\"https://groups.google.com/forum/?_escaped_fragment_=forum/\" + groupName)\n\n\tenc := json.NewEncoder(os.Stdout)\n\tenc.SetIndent(\"\", \"  \")\n\n\t// Dump json to the standard output\n\tenc.Encode(threads)\n}\n"
  },
  {
    "path": "_examples/hackernews_comments/hackernews_comments.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"flag\"\n\t\"log\"\n\t\"os\"\n\t\"strconv\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype comment struct {\n\tAuthor  string `selector:\"a.hnuser\"`\n\tURL     string `selector:\".age a[href]\" attr:\"href\"`\n\tComment string `selector:\".comment\"`\n\tReplies []*comment\n\tdepth   int\n}\n\nfunc main() {\n\tvar itemID string\n\tflag.StringVar(&itemID, \"id\", \"\", \"hackernews post id\")\n\tflag.Parse()\n\n\tif itemID == \"\" {\n\t\tlog.Println(\"Hackernews post id required\")\n\t\tos.Exit(1)\n\t}\n\n\tcomments := make([]*comment, 0)\n\n\t// Instantiate default collector\n\tc := colly.NewCollector()\n\n\t// Extract comment\n\tc.OnHTML(\".comment-tree tr.athing\", func(e *colly.HTMLElement) {\n\t\twidth, err := strconv.Atoi(e.ChildAttr(\"td.ind img\", \"width\"))\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\t\t// hackernews uses 40px spacers to indent comment replies,\n\t\t// so we have to divide the width with it to get the depth\n\t\t// of the comment\n\t\tdepth := width / 40\n\t\tc := &comment{\n\t\t\tReplies: make([]*comment, 0),\n\t\t\tdepth:   depth,\n\t\t}\n\t\te.Unmarshal(c)\n\t\tc.Comment = strings.TrimSpace(c.Comment[:len(c.Comment)-5])\n\t\tif depth == 0 {\n\t\t\tcomments = append(comments, c)\n\t\t\treturn\n\t\t}\n\t\tparent := comments[len(comments)-1]\n\t\t// append comment to its parent\n\t\tfor i := 0; i < depth-1; i++ {\n\t\t\tparent = parent.Replies[len(parent.Replies)-1]\n\t\t}\n\t\tparent.Replies = append(parent.Replies, c)\n\t})\n\n\tc.Visit(\"https://news.ycombinator.com/item?id=\" + itemID)\n\n\tenc := json.NewEncoder(os.Stdout)\n\tenc.SetIndent(\"\", \"  \")\n\n\t// Dump json to the standard output\n\tenc.Encode(comments)\n}\n"
  },
  {
    "path": "_examples/instagram/instagram.go",
    "content": "package main\n\nimport (\n\t\"crypto/md5\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"log\"\n\t\"net/url\"\n\t\"os\"\n\t\"regexp\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// \"id\": user id, \"after\": end cursor\nconst nextPageURL string = `https://www.instagram.com/graphql/query/?query_hash=%s&variables=%s`\nconst nextPagePayload string = `{\"id\":\"%s\",\"first\":50,\"after\":\"%s\"}`\n\nvar requestID string\nvar requestIds [][]byte\nvar queryIdPattern = regexp.MustCompile(`queryId:\".{32}\"`)\n\ntype pageInfo struct {\n\tEndCursor string `json:\"end_cursor\"`\n\tNextPage  bool   `json:\"has_next_page\"`\n}\n\ntype mainPageData struct {\n\tRhxgis    string `json:\"rhx_gis\"`\n\tEntryData struct {\n\t\tProfilePage []struct {\n\t\t\tGraphql struct {\n\t\t\t\tUser struct {\n\t\t\t\t\tId    string `json:\"id\"`\n\t\t\t\t\tMedia struct {\n\t\t\t\t\t\tEdges []struct {\n\t\t\t\t\t\t\tNode struct {\n\t\t\t\t\t\t\t\tImageURL     string `json:\"display_url\"`\n\t\t\t\t\t\t\t\tThumbnailURL string `json:\"thumbnail_src\"`\n\t\t\t\t\t\t\t\tIsVideo      bool   `json:\"is_video\"`\n\t\t\t\t\t\t\t\tDate         int    `json:\"date\"`\n\t\t\t\t\t\t\t\tDimensions   struct {\n\t\t\t\t\t\t\t\t\tWidth  int `json:\"width\"`\n\t\t\t\t\t\t\t\t\tHeight int `json:\"height\"`\n\t\t\t\t\t\t\t\t} `json:\"dimensions\"`\n\t\t\t\t\t\t\t} `json:node\"`\n\t\t\t\t\t\t} `json:\"edges\"`\n\t\t\t\t\t\tPageInfo pageInfo `json:\"page_info\"`\n\t\t\t\t\t} `json:\"edge_owner_to_timeline_media\"`\n\t\t\t\t} `json:\"user\"`\n\t\t\t} `json:\"graphql\"`\n\t\t} `json:\"ProfilePage\"`\n\t} `json:\"entry_data\"`\n}\n\ntype nextPageData struct {\n\tData struct {\n\t\tUser struct {\n\t\t\tContainer struct {\n\t\t\t\tPageInfo pageInfo `json:\"page_info\"`\n\t\t\t\tEdges    []struct {\n\t\t\t\t\tNode struct {\n\t\t\t\t\t\tImageURL     string `json:\"display_url\"`\n\t\t\t\t\t\tThumbnailURL string `json:\"thumbnail_src\"`\n\t\t\t\t\t\tIsVideo      bool   `json:\"is_video\"`\n\t\t\t\t\t\tDate         int    `json:\"taken_at_timestamp\"`\n\t\t\t\t\t\tDimensions   struct {\n\t\t\t\t\t\t\tWidth  int `json:\"width\"`\n\t\t\t\t\t\t\tHeight int `json:\"height\"`\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t} `json:\"edges\"`\n\t\t\t} `json:\"edge_owner_to_timeline_media\"`\n\t\t}\n\t} `json:\"data\"`\n}\n\nfunc main() {\n\tif len(os.Args) != 2 {\n\t\tlog.Println(\"Missing account name argument\")\n\t\tos.Exit(1)\n\t}\n\n\tvar actualUserId string\n\tinstagramAccount := os.Args[1]\n\toutputDir := fmt.Sprintf(\"./instagram_%s/\", instagramAccount)\n\n\tc := colly.NewCollector(\n\t\t//colly.CacheDir(\"./_instagram_cache/\"),\n\t\tcolly.UserAgent(\"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36\"),\n\t)\n\n\tc.OnRequest(func(r *colly.Request) {\n\t\tr.Headers.Set(\"X-Requested-With\", \"XMLHttpRequest\")\n\t\tr.Headers.Set(\"Referer\", \"https://www.instagram.com/\"+instagramAccount)\n\t\tif r.Ctx.Get(\"gis\") != \"\" {\n\t\t\tgis := fmt.Sprintf(\"%s:%s\", r.Ctx.Get(\"gis\"), r.Ctx.Get(\"variables\"))\n\t\t\th := md5.New()\n\t\t\th.Write([]byte(gis))\n\t\t\tgisHash := fmt.Sprintf(\"%x\", h.Sum(nil))\n\t\t\tr.Headers.Set(\"X-Instagram-GIS\", gisHash)\n\t\t}\n\t})\n\n\tc.OnHTML(\"html\", func(e *colly.HTMLElement) {\n\t\td := c.Clone()\n\t\td.OnResponse(func(r *colly.Response) {\n\t\t\trequestIds = queryIdPattern.FindAll(r.Body, -1)\n\t\t\trequestID = string(requestIds[1][9:41])\n\t\t})\n\t\trequestIDURL := e.Request.AbsoluteURL(e.ChildAttr(`link[as=\"script\"]`, \"href\"))\n\t\td.Visit(requestIDURL)\n\n\t\tdat := e.ChildText(\"body > script:first-of-type\")\n\t\tjsonData := dat[strings.Index(dat, \"{\") : len(dat)-1]\n\t\tdata := &mainPageData{}\n\t\terr := json.Unmarshal([]byte(jsonData), data)\n\t\tif err != nil {\n\t\t\tlog.Fatal(err)\n\t\t}\n\n\t\tlog.Println(\"saving output to \", outputDir)\n\t\tos.MkdirAll(outputDir, os.ModePerm)\n\t\tpage := data.EntryData.ProfilePage[0]\n\t\tactualUserId = page.Graphql.User.Id\n\t\tfor _, obj := range page.Graphql.User.Media.Edges {\n\t\t\t// skip videos\n\t\t\tif obj.Node.IsVideo {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tc.Visit(obj.Node.ImageURL)\n\t\t}\n\t\tnextPageVars := fmt.Sprintf(nextPagePayload, actualUserId, page.Graphql.User.Media.PageInfo.EndCursor)\n\t\te.Request.Ctx.Put(\"variables\", nextPageVars)\n\t\tif page.Graphql.User.Media.PageInfo.NextPage {\n\t\t\tu := fmt.Sprintf(\n\t\t\t\tnextPageURL,\n\t\t\t\trequestID,\n\t\t\t\turl.QueryEscape(nextPageVars),\n\t\t\t)\n\t\t\tlog.Println(\"Next page found\", u)\n\t\t\te.Request.Ctx.Put(\"gis\", data.Rhxgis)\n\t\t\te.Request.Visit(u)\n\t\t}\n\t})\n\n\tc.OnError(func(r *colly.Response, e error) {\n\t\tlog.Println(\"error:\", e, r.Request.URL, string(r.Body))\n\t})\n\n\tc.OnResponse(func(r *colly.Response) {\n\t\tif strings.Index(r.Headers.Get(\"Content-Type\"), \"image\") > -1 {\n\t\t\tr.Save(outputDir + r.FileName())\n\t\t\treturn\n\t\t}\n\n\t\tif strings.Index(r.Headers.Get(\"Content-Type\"), \"json\") == -1 {\n\t\t\treturn\n\t\t}\n\n\t\tdata := &nextPageData{}\n\t\terr := json.Unmarshal(r.Body, data)\n\t\tif err != nil {\n\t\t\tlog.Fatal(err)\n\t\t}\n\n\t\tfor _, obj := range data.Data.User.Container.Edges {\n\t\t\t// skip videos\n\t\t\tif obj.Node.IsVideo {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tc.Visit(obj.Node.ImageURL)\n\t\t}\n\t\tif data.Data.User.Container.PageInfo.NextPage {\n\t\t\tnextPageVars := fmt.Sprintf(nextPagePayload, actualUserId, data.Data.User.Container.PageInfo.EndCursor)\n\t\t\tr.Request.Ctx.Put(\"variables\", nextPageVars)\n\t\t\tu := fmt.Sprintf(\n\t\t\t\tnextPageURL,\n\t\t\t\trequestID,\n\t\t\t\turl.QueryEscape(nextPageVars),\n\t\t\t)\n\t\t\tlog.Println(\"Next page found\", u)\n\t\t\tr.Request.Visit(u)\n\t\t}\n\t})\n\n\tc.Visit(\"https://instagram.com/\" + instagramAccount)\n}\n"
  },
  {
    "path": "_examples/local_files/html/child_page/one.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n    <title>Document</title>\n</head>\n<body>\n    <h1>Child Page One</h1>\n</body>\n</html>"
  },
  {
    "path": "_examples/local_files/html/child_page/three.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n    <title>Document</title>\n</head>\n<body>\n    <h1>Child Page Three</h1>\n</body>\n</html>"
  },
  {
    "path": "_examples/local_files/html/child_page/two.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n    <title>Document</title>\n</head>\n<body>\n    <h1>Child Page Two</h1>\n</body>\n</html>"
  },
  {
    "path": "_examples/local_files/html/index.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <meta http-equiv=\"X-UA-Compatible\" content=\"ie=edge\">\n    <title>Document</title>\n</head>\n<body>\n    <h1>Index.html</h1>\n    <ul>\n        <li><a href=\"/child_page/one.html\"></a></li>\n        <li><a href=\"/child_page/two.html\"></a></li>\n        <li><a href=\"/child_page/three.html\"></a></li>\n    </ul>\n</body>\n</html>"
  },
  {
    "path": "_examples/local_files/local_files.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\t\"net/http\"\n\t\"os\"\n\t\"path/filepath\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tdir, err := filepath.Abs(filepath.Dir(os.Args[0]))\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\tt := &http.Transport{}\n\tt.RegisterProtocol(\"file\", http.NewFileTransport(http.Dir(\"/\")))\n\n\tc := colly.NewCollector()\n\tc.WithTransport(t)\n\n\tpages := []string{}\n\n\tc.OnHTML(\"h1\", func(e *colly.HTMLElement) {\n\t\tpages = append(pages, e.Text)\n\t})\n\n\tc.OnHTML(\"a\", func(e *colly.HTMLElement) {\n\t\tc.Visit(\"file://\" + dir + \"/html\" + e.Attr(\"href\"))\n\t})\n\n\tfmt.Println(\"file://\" + dir + \"/html/index.html\")\n\tc.Visit(\"file://\" + dir + \"/html/index.html\")\n\tc.Wait()\n\tfor i, p := range pages {\n\t\tfmt.Printf(\"%d : %s\\n\", i, p)\n\t}\n}\n"
  },
  {
    "path": "_examples/login/login.go",
    "content": "package main\n\nimport (\n\t\"log\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// create a new collector\n\tc := colly.NewCollector()\n\n\t// authenticate\n\terr := c.Post(\"http://example.com/login\", map[string]string{\"username\": \"admin\", \"password\": \"admin\"})\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\n\t// attach callbacks after login\n\tc.OnResponse(func(r *colly.Response) {\n\t\tlog.Println(\"response received\", r.StatusCode)\n\t})\n\n\t// start scraping\n\tc.Visit(\"https://example.com/\")\n}\n"
  },
  {
    "path": "_examples/max_depth/max_depth.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// MaxDepth is 1, so only the links on the scraped page\n\t\t// is visited, and no further links are followed\n\t\tcolly.MaxDepth(1),\n\t)\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Attr(\"href\")\n\t\t// Print link\n\t\tfmt.Println(link)\n\t\t// Visit link found on page\n\t\te.Request.Visit(link)\n\t})\n\n\t// Start scraping on https://en.wikipedia.org\n\tc.Visit(\"https://en.wikipedia.org/\")\n}\n"
  },
  {
    "path": "_examples/multipart/multipart.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\t\"io\"\n\t\"net/http\"\n\t\"os\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc generateFormData() map[string][]byte {\n\tf, _ := os.Open(\"gocolly.jpg\")\n\tdefer f.Close()\n\n\timgData, _ := io.ReadAll(f)\n\n\treturn map[string][]byte{\n\t\t\"firstname\": []byte(\"one\"),\n\t\t\"lastname\":  []byte(\"two\"),\n\t\t\"email\":     []byte(\"onetwo@example.com\"),\n\t\t\"file\":      imgData,\n\t}\n}\n\nfunc setupServer() {\n\tvar handler http.HandlerFunc = func(w http.ResponseWriter, r *http.Request) {\n\t\tfmt.Println(\"received request\")\n\t\terr := r.ParseMultipartForm(10000000)\n\t\tif err != nil {\n\t\t\tfmt.Println(\"server: Error\")\n\t\t\tw.WriteHeader(500)\n\t\t\tw.Write([]byte(\"<html><body>Internal Server Error</body></html>\"))\n\t\t\treturn\n\t\t}\n\t\tw.WriteHeader(200)\n\t\tfmt.Println(\"server: OK\")\n\t\tw.Write([]byte(\"<html><body>Success</body></html>\"))\n\t}\n\n\tgo http.ListenAndServe(\":8080\", handler)\n}\n\nfunc main() {\n\t// Start a single route http server to post an image to.\n\tsetupServer()\n\n\tc := colly.NewCollector(colly.AllowURLRevisit(), colly.MaxDepth(5))\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"html\", func(e *colly.HTMLElement) {\n\t\tfmt.Println(e.Text)\n\t\ttime.Sleep(1 * time.Second)\n\t\te.Request.PostMultipart(\"http://localhost:8080/\", generateFormData())\n\t})\n\n\t// Before making a request print \"Visiting ...\"\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"Posting gocolly.jpg to\", r.URL.String())\n\t})\n\n\t// Start scraping\n\tc.PostMultipart(\"http://localhost:8080/\", generateFormData())\n\tc.Wait()\n}\n"
  },
  {
    "path": "_examples/openedx_courses/openedx_courses.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// DATE_FORMAT default format date used in openedx\nconst DATE_FORMAT = \"02 Jan, 2006\"\n\n// Course store openedx course data\ntype Course struct {\n\tCourseID  string\n\tRun       string\n\tName      string\n\tNumber    string\n\tStartDate *time.Time\n\tEndDate   *time.Time\n\tURL       string\n}\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Using IndonesiaX as sample\n\t\tcolly.AllowedDomains(\"indonesiax.co.id\", \"www.indonesiax.co.id\"),\n\n\t\t// Cache responses to prevent multiple download of pages\n\t\t// even if the collector is restarted\n\t\tcolly.CacheDir(\"./cache\"),\n\t)\n\n\tcourses := make([]Course, 0, 200)\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Attr(\"href\")\n\t\tif !strings.HasPrefix(link, \"/courses/\") {\n\t\t\treturn\n\t\t}\n\t\t// start scraping the page under the link found\n\t\te.Request.Visit(link)\n\t})\n\n\tc.OnHTML(\"div[class=main-container]\", func(e *colly.HTMLElement) {\n\t\tif e.DOM.Find(\"section#course-info\").Length() == 0 {\n\t\t\treturn\n\t\t}\n\t\ttitle := strings.Split(e.ChildText(\".course-info__title\"), \"\\n\")[0]\n\t\tcourse_id := e.ChildAttr(\"input[name=course_id]\", \"value\")\n\t\ttexts := e.ChildTexts(\"span[data-datetime]\")\n\t\tstart_date, _ := time.Parse(DATE_FORMAT, texts[0])\n\t\tend_date, _ := time.Parse(DATE_FORMAT, texts[1])\n\t\tvar run string\n\t\tif len(strings.Split(course_id, \"_\")) > 1 {\n\t\t\trun = strings.Split(course_id, \"_\")[1]\n\t\t}\n\t\tcourse := Course{\n\t\t\tCourseID:  course_id,\n\t\t\tRun:       run,\n\t\t\tName:      title,\n\t\t\tNumber:    e.ChildText(\"span.course-number\"),\n\t\t\tStartDate: &start_date,\n\t\t\tEndDate:   &end_date,\n\t\t\tURL:       fmt.Sprintf(\"/courses/%s/about\", course_id),\n\t\t}\n\t\tcourses = append(courses, course)\n\t})\n\n\t// Start scraping on https://openedxdomain/courses\n\tc.Visit(\"https://www.indonesiax.co.id/courses\")\n\n\t// Convert results to JSON data if the scraping job has finished\n\tjsonData, err := json.MarshalIndent(courses, \"\", \"  \")\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\n\t// Dump json to the standard output (can be redirected to a file)\n\tfmt.Println(string(jsonData))\n}\n"
  },
  {
    "path": "_examples/parallel/parallel.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// MaxDepth is 2, so only the links on the scraped page\n\t\t// and links on those pages are visited\n\t\tcolly.MaxDepth(2),\n\t\tcolly.Async(),\n\t)\n\n\t// Limit the maximum parallelism to 2\n\t// This is necessary if the goroutines are dynamically\n\t// created to control the limit of simultaneous requests.\n\t//\n\t// Parallelism can be controlled also by spawning fixed\n\t// number of go routines.\n\tc.Limit(&colly.LimitRule{DomainGlob: \"*\", Parallelism: 2})\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Attr(\"href\")\n\t\t// Print link\n\t\tfmt.Println(link)\n\t\t// Visit link found on page on a new thread\n\t\te.Request.Visit(link)\n\t})\n\n\t// Start scraping on https://en.wikipedia.org\n\tc.Visit(\"https://en.wikipedia.org/\")\n\t// Wait until threads are finished\n\tc.Wait()\n}\n"
  },
  {
    "path": "_examples/proxy_switcher/proxy_switcher.go",
    "content": "package main\n\nimport (\n\t\"bytes\"\n\t\"log\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/proxy\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(colly.AllowURLRevisit())\n\n\t// Rotate two socks5 proxies\n\trp, err := proxy.RoundRobinProxySwitcher(\"socks5://127.0.0.1:1337\", \"socks5://127.0.0.1:1338\")\n\tif err != nil {\n\t\tlog.Fatal(err)\n\t}\n\tc.SetProxyFunc(rp)\n\n\t// Print the response\n\tc.OnResponse(func(r *colly.Response) {\n\t\tlog.Printf(\"Proxy Address: %s\\n\", r.Request.ProxyURL)\n\t\tlog.Printf(\"%s\\n\", bytes.Replace(r.Body, []byte(\"\\n\"), nil, -1))\n\t})\n\n\t// Fetch httpbin.org/ip five times\n\tfor i := 0; i < 5; i++ {\n\t\tc.Visit(\"https://httpbin.org/ip\")\n\t}\n}\n"
  },
  {
    "path": "_examples/queue/queue.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/queue\"\n)\n\nfunc main() {\n\turl := \"https://httpbin.org/delay/1\"\n\n\t// Instantiate default collector\n\tc := colly.NewCollector(colly.AllowURLRevisit())\n\n\t// create a request queue with 2 consumer threads\n\tq, _ := queue.New(\n\t\t2, // Number of consumer threads\n\t\t&queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage\n\t)\n\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"visiting\", r.URL)\n\t\tif r.ID < 15 {\n\t\t\tr2, err := r.New(\"GET\", fmt.Sprintf(\"%s?x=%v\", url, r.ID), nil)\n\t\t\tif err == nil {\n\t\t\t\tq.AddRequest(r2)\n\t\t\t}\n\t\t}\n\t})\n\n\tfor i := 0; i < 5; i++ {\n\t\t// Add URLs to the queue\n\t\tq.AddURL(fmt.Sprintf(\"%s?n=%d\", url, i))\n\t}\n\t// Consume URLs\n\tq.Run(c)\n\n}\n"
  },
  {
    "path": "_examples/random_delay/random_delay.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/debug\"\n)\n\nfunc main() {\n\turl := \"https://httpbin.org/delay/2\"\n\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Attach a debugger to the collector\n\t\tcolly.Debugger(&debug.LogDebugger{}),\n\t\tcolly.Async(),\n\t)\n\n\t// Limit the number of threads started by colly to two\n\t// when visiting links which domains' matches \"*httpbin.*\" glob\n\tc.Limit(&colly.LimitRule{\n\t\tDomainGlob:  \"*httpbin.*\",\n\t\tParallelism: 2,\n\t\tRandomDelay: 5 * time.Second,\n\t})\n\n\t// Start scraping in four threads on https://httpbin.org/delay/2\n\tfor i := 0; i < 4; i++ {\n\t\tc.Visit(fmt.Sprintf(\"%s?n=%d\", url, i))\n\t}\n\t// Start scraping on https://httpbin.org/delay/2\n\tc.Visit(url)\n\t// Wait until threads are finished\n\tc.Wait()\n}\n"
  },
  {
    "path": "_examples/rate_limit/rate_limit.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n\t\"github.com/gocolly/colly/v2/debug\"\n)\n\nfunc main() {\n\turl := \"https://httpbin.org/delay/2\"\n\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Turn on asynchronous requests\n\t\tcolly.Async(),\n\t\t// Attach a debugger to the collector\n\t\tcolly.Debugger(&debug.LogDebugger{}),\n\t)\n\n\t// Limit the number of threads started by colly to two\n\t// when visiting links which domains' matches \"*httpbin.*\" glob\n\tc.Limit(&colly.LimitRule{\n\t\tDomainGlob:  \"*httpbin.*\",\n\t\tParallelism: 2,\n\t\t//Delay:      5 * time.Second,\n\t})\n\n\t// Start scraping in five threads on https://httpbin.org/delay/2\n\tfor i := 0; i < 5; i++ {\n\t\tc.Visit(fmt.Sprintf(\"%s?n=%d\", url, i))\n\t}\n\t// Wait until threads are finished\n\tc.Wait()\n}\n"
  },
  {
    "path": "_examples/reddit/reddit.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\t\"os\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype item struct {\n\tStoryURL  string\n\tSource    string\n\tcomments  string\n\tCrawledAt time.Time\n\tComments  string\n\tTitle     string\n}\n\nfunc main() {\n\tstories := []item{}\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Visit only domains: old.reddit.com\n\t\tcolly.AllowedDomains(\"old.reddit.com\"),\n\t\t// Parallelism\n\t\tcolly.Async(true),\n\t)\n\n\t// On every a element which has .top-matter attribute call callback\n\t// This class is unique to the div that holds all information about a story\n\tc.OnHTML(\".top-matter\", func(e *colly.HTMLElement) {\n\t\ttemp := item{}\n\t\ttemp.StoryURL = e.ChildAttr(\"a[data-event-action=title]\", \"href\")\n\t\ttemp.Source = \"https://old.reddit.com/r/programming/\"\n\t\ttemp.Title = e.ChildText(\"a[data-event-action=title]\")\n\t\ttemp.Comments = e.ChildAttr(\"a[data-event-action=comments]\", \"href\")\n\t\ttemp.CrawledAt = time.Now()\n\t\tstories = append(stories, temp)\n\t})\n\n\t// On every span tag with the class next-button\n\tc.OnHTML(\"span.next-button\", func(h *colly.HTMLElement) {\n\t\tt := h.ChildAttr(\"a\", \"href\")\n\t\tc.Visit(t)\n\t})\n\n\t// Set max Parallelism and introduce a Random Delay\n\tc.Limit(&colly.LimitRule{\n\t\tParallelism: 2,\n\t\tRandomDelay: 5 * time.Second,\n\t})\n\n\t// Before making a request print \"Visiting ...\"\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"Visiting\", r.URL.String())\n\n\t})\n\n\t// Crawl all reddits the user passes in\n\treddits := os.Args[1:]\n\tfor _, reddit := range reddits {\n\t\tc.Visit(reddit)\n\n\t}\n\n\tc.Wait()\n\tfmt.Println(stories)\n\n}\n"
  },
  {
    "path": "_examples/request_context/request_context.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector()\n\n\t// Before making a request put the URL with\n\t// the key of \"url\" into the context of the request\n\tc.OnRequest(func(r *colly.Request) {\n\t\tr.Ctx.Put(\"url\", r.URL.String())\n\t})\n\n\t// After making a request get \"url\" from\n\t// the context of the request\n\tc.OnResponse(func(r *colly.Response) {\n\t\tfmt.Println(r.Ctx.Get(\"url\"))\n\t})\n\n\t// Start scraping on https://en.wikipedia.org\n\tc.Visit(\"https://en.wikipedia.org/\")\n}\n"
  },
  {
    "path": "_examples/scraper_server/scraper_server.go",
    "content": "package main\n\nimport (\n\t\"encoding/json\"\n\t\"log\"\n\t\"net/http\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype pageInfo struct {\n\tStatusCode int\n\tLinks      map[string]int\n}\n\nfunc handler(w http.ResponseWriter, r *http.Request) {\n\tURL := r.URL.Query().Get(\"url\")\n\tif URL == \"\" {\n\t\tlog.Println(\"missing URL argument\")\n\t\treturn\n\t}\n\tlog.Println(\"visiting\", URL)\n\n\tc := colly.NewCollector()\n\n\tp := &pageInfo{Links: make(map[string]int)}\n\n\t// count links\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\tif link != \"\" {\n\t\t\tp.Links[link]++\n\t\t}\n\t})\n\n\t// extract status code\n\tc.OnResponse(func(r *colly.Response) {\n\t\tlog.Println(\"response received\", r.StatusCode)\n\t\tp.StatusCode = r.StatusCode\n\t})\n\tc.OnError(func(r *colly.Response, err error) {\n\t\tlog.Println(\"error:\", r.StatusCode, err)\n\t\tp.StatusCode = r.StatusCode\n\t})\n\n\tc.Visit(URL)\n\n\t// dump results\n\tb, err := json.Marshal(p)\n\tif err != nil {\n\t\tlog.Println(\"failed to serialize response:\", err)\n\t\treturn\n\t}\n\tw.Header().Add(\"Content-Type\", \"application/json\")\n\tw.Write(b)\n}\n\nfunc main() {\n\t// example usage: curl -s 'http://127.0.0.1:7171/?url=http://go-colly.org/'\n\taddr := \":7171\"\n\n\thttp.HandleFunc(\"/\", handler)\n\n\tlog.Println(\"listening on\", addr)\n\tlog.Fatal(http.ListenAndServe(addr, nil))\n}\n"
  },
  {
    "path": "_examples/shopify_sitemap/shopify_sitemap.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Array containing all the known URLs in a sitemap\n\tknownUrls := []string{}\n\n\t// Create a Collector specifically for Shopify\n\tc := colly.NewCollector(colly.AllowedDomains(\"www.shopify.com\"))\n\n\t// Create a callback on the XPath query searching for the URLs\n\tc.OnXML(\"//urlset/url/loc\", func(e *colly.XMLElement) {\n\t\tknownUrls = append(knownUrls, e.Text)\n\t})\n\n\t// Start the collector\n\tc.Visit(\"https://www.shopify.com/sitemap.xml\")\n\n\tfmt.Println(\"All known URLs:\")\n\tfor _, url := range knownUrls {\n\t\tfmt.Println(\"\\t\", url)\n\t}\n\tfmt.Println(\"Collected\", len(knownUrls), \"URLs\")\n}\n"
  },
  {
    "path": "_examples/url_filter/url_filter.go",
    "content": "package main\n\nimport (\n\t\"fmt\"\n\t\"regexp\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Visit only root url and urls which start with \"e\" or \"h\" on httpbin.org\n\t\tcolly.URLFilters(\n\t\t\tregexp.MustCompile(\"http://httpbin\\\\.org/(|e.+)$\"),\n\t\t\tregexp.MustCompile(\"http://httpbin\\\\.org/h.+\"),\n\t\t),\n\t)\n\n\t// On every a element which has href attribute call callback\n\tc.OnHTML(\"a[href]\", func(e *colly.HTMLElement) {\n\t\tlink := e.Attr(\"href\")\n\t\t// Print link\n\t\tfmt.Printf(\"Link found: %q -> %s\\n\", e.Text, link)\n\t\t// Visit link found on page\n\t\t// Only those links are visited which are matched by  any of the URLFilter regexps\n\t\tc.Visit(e.Request.AbsoluteURL(link))\n\t})\n\n\t// Before making a request print \"Visiting ...\"\n\tc.OnRequest(func(r *colly.Request) {\n\t\tfmt.Println(\"Visiting\", r.URL.String())\n\t})\n\n\t// Start scraping on http://httpbin.org\n\tc.Visit(\"http://httpbin.org/\")\n}\n"
  },
  {
    "path": "_examples/xkcd_store/xkcd_store.go",
    "content": "package main\n\nimport (\n\t\"encoding/csv\"\n\t\"log\"\n\t\"os\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tfName := \"xkcd_store_items.csv\"\n\tfile, err := os.Create(fName)\n\tif err != nil {\n\t\tlog.Fatalf(\"Cannot create file %q: %s\\n\", fName, err)\n\t\treturn\n\t}\n\tdefer file.Close()\n\twriter := csv.NewWriter(file)\n\tdefer writer.Flush()\n\t// Write CSV header\n\twriter.Write([]string{\"Name\", \"Price\", \"URL\", \"Image URL\"})\n\n\t// Instantiate default collector\n\tc := colly.NewCollector(\n\t\t// Allow requests only to store.xkcd.com\n\t\tcolly.AllowedDomains(\"store.xkcd.com\"),\n\t)\n\n\t// Extract product details\n\tc.OnHTML(\".product-grid-item\", func(e *colly.HTMLElement) {\n\t\twriter.Write([]string{\n\t\t\te.ChildAttr(\"a\", \"title\"),\n\t\t\te.ChildText(\"span\"),\n\t\t\te.Request.AbsoluteURL(e.ChildAttr(\"a\", \"href\")),\n\t\t\t\"https:\" + e.ChildAttr(\"img\", \"src\"),\n\t\t})\n\t})\n\n\t// Find and visit next page links\n\tc.OnHTML(`.next a[href]`, func(e *colly.HTMLElement) {\n\t\te.Request.Visit(e.Attr(\"href\"))\n\t})\n\n\tc.Visit(\"https://store.xkcd.com/collections/everything\")\n\n\tlog.Printf(\"Scraping finished, check file %q for results\\n\", fName)\n\n\t// Display collector's statistics\n\tlog.Println(c)\n}\n"
  },
  {
    "path": "cmd/colly/colly.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage main\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"log\"\n\t\"os\"\n\t\"strings\"\n\n\t\"github.com/jawher/mow.cli\"\n)\n\nvar scraperHeadTemplate = `package main\n\nimport (\n\t\"log\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc main() {\n\tc := colly.NewCollector()\n`\n\nvar scraperEndTemplate = `\n\tc.Visit(\"https://yourdomain.com/\")\n}\n`\n\nvar htmlCallbackTemplate = `\n\tc.OnHTML(\"element-selector\", func(e *colly.HTMLElement) {\n\t\tlog.Println(e.Text)\n\t})\n`\n\nvar requestCallbackTemplate = `\n\tc.OnRequest(func(r *colly.Request) {\n\t\tlog.Println(\"Visiting\", r.URL)\n\t})\n`\n\nvar responseCallbackTemplate = `\n\tc.OnResponse(func(r *colly.Response) {\n\t\tlog.Println(\"Visited\", r.Request.URL, r.StatusCode)\n\t})\n`\n\nvar errorCallbackTemplate = `\n\tc.OnError(func(r *colly.Response, err error) {\n\t\tlog.Printf(\"Error on %s: %s\", r.Request.URL, err)\n\t})\n`\n\nfunc main() {\n\tapp := cli.App(\"colly\", \"Scraping Framework for Gophers\")\n\n\tapp.Command(\"new\", \"Create new scraper\", func(cmd *cli.Cmd) {\n\t\tvar (\n\t\t\tcallbacks = cmd.StringOpt(\"callbacks\", \"\", \"Add callbacks to the template. (E.g. '--callbacks=html,response,error')\")\n\t\t\thosts     = cmd.StringOpt(\"hosts\", \"\", \"Specify scraper's allowed hosts. (e.g. '--hosts=xy.com,abcd.com')\")\n\t\t\tpath      = cmd.StringArg(\"PATH\", \"\", \"Path of the new scraper\")\n\t\t)\n\n\t\tcmd.Spec = \"[--callbacks] [--hosts] [PATH]\"\n\n\t\tcmd.Action = func() {\n\t\t\tscraper := bytes.NewBufferString(scraperHeadTemplate)\n\t\t\toutfile := os.Stdout\n\t\t\tif *path != \"\" {\n\t\t\t\tvar err error\n\t\t\t\toutfile, err = os.Create(*path)\n\t\t\t\tif err != nil {\n\t\t\t\t\tlog.Fatal(err)\n\t\t\t\t}\n\t\t\t\tdefer outfile.Close()\n\t\t\t}\n\t\t\tif *hosts != \"\" {\n\t\t\t\tscraper.WriteString(\"\\n\tc.AllowedDomains = []string{\")\n\t\t\t\tfor i, h := range strings.Split(*hosts, \",\") {\n\t\t\t\t\tif i > 0 {\n\t\t\t\t\t\tscraper.WriteString(\", \")\n\t\t\t\t\t}\n\t\t\t\t\tscraper.WriteString(fmt.Sprintf(\"%q\", h))\n\t\t\t\t}\n\t\t\t\tscraper.WriteString(\"}\\n\")\n\t\t\t}\n\t\t\tif len(*callbacks) > 0 {\n\t\t\t\tfor _, c := range strings.Split(*callbacks, \",\") {\n\t\t\t\t\tswitch c {\n\t\t\t\t\tcase \"html\":\n\t\t\t\t\t\tscraper.WriteString(htmlCallbackTemplate)\n\t\t\t\t\tcase \"request\":\n\t\t\t\t\t\tscraper.WriteString(requestCallbackTemplate)\n\t\t\t\t\tcase \"response\":\n\t\t\t\t\t\tscraper.WriteString(responseCallbackTemplate)\n\t\t\t\t\tcase \"error\":\n\t\t\t\t\t\tscraper.WriteString(errorCallbackTemplate)\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t\tscraper.WriteString(scraperEndTemplate)\n\t\t\toutfile.Write(scraper.Bytes())\n\t\t}\n\t})\n\n\tapp.Run(os.Args)\n}\n"
  },
  {
    "path": "colly.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\n// Package colly implements a HTTP scraping framework\npackage colly\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"crypto/rand\"\n\t\"encoding/json\"\n\t\"errors\"\n\t\"fmt\"\n\t\"hash/fnv\"\n\t\"io\"\n\t\"log\"\n\t\"net/http\"\n\t\"net/http/cookiejar\"\n\t\"net/url\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"regexp\"\n\t\"slices\"\n\t\"strconv\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"github.com/antchfx/htmlquery\"\n\t\"github.com/antchfx/xmlquery\"\n\t\"github.com/gocolly/colly/v2/debug\"\n\t\"github.com/gocolly/colly/v2/storage\"\n\t\"github.com/kennygrant/sanitize\"\n\twhatwgUrl \"github.com/nlnwa/whatwg-url/url\"\n\t\"github.com/temoto/robotstxt\"\n\t\"google.golang.org/appengine/urlfetch\"\n)\n\n// A CollectorOption sets an option on a Collector.\ntype CollectorOption func(*Collector)\n\n// Collector provides the scraper instance for a scraping job\ntype Collector struct {\n\t// UserAgent is the User-Agent string used by HTTP requests\n\tUserAgent string\n\t// Custom headers for the request\n\tHeaders *http.Header\n\t// MaxDepth limits the recursion depth of visited URLs.\n\t// Set it to 0 for infinite recursion (default).\n\tMaxDepth int\n\t// AllowedDomains is a domain whitelist.\n\t// Leave it blank to allow any domains to be visited\n\tAllowedDomains []string\n\t// DisallowedDomains is a domain blacklist.\n\tDisallowedDomains []string\n\t// DisallowedURLFilters is a list of regular expressions which restricts\n\t// visiting URLs. If any of the rules matches to a URL the\n\t// request will be stopped. DisallowedURLFilters will\n\t// be evaluated before URLFilters\n\t// Leave it blank to allow any URLs to be visited\n\tDisallowedURLFilters []*regexp.Regexp\n\t// URLFilters is a list of regular expressions which restricts\n\t// visiting URLs. If any of the rules matches to a URL the\n\t// request won't be stopped. DisallowedURLFilters will\n\t// be evaluated before URLFilters\n\n\t// Leave it blank to allow any URLs to be visited\n\tURLFilters []*regexp.Regexp\n\n\t// AllowURLRevisit allows multiple downloads of the same URL\n\tAllowURLRevisit bool\n\t// MaxBodySize is the limit of the retrieved response body in bytes.\n\t// 0 means unlimited.\n\t// The default value for MaxBodySize is 10MB (10 * 1024 * 1024 bytes).\n\tMaxBodySize int\n\t// CacheDir specifies a location where GET requests are cached as files.\n\t// When it's not defined, caching is disabled.\n\tCacheDir string\n\t// IgnoreRobotsTxt allows the Collector to ignore any restrictions set by\n\t// the target host's robots.txt file.  See http://www.robotstxt.org/ for more\n\t// information.\n\tIgnoreRobotsTxt bool\n\t// Async turns on asynchronous network communication. Use Collector.Wait() to\n\t// be sure all requests have been finished.\n\tAsync bool\n\t// ParseHTTPErrorResponse allows parsing HTTP responses with non 2xx status codes.\n\t// By default, Colly parses only successful HTTP responses. Set ParseHTTPErrorResponse\n\t// to true to enable it.\n\tParseHTTPErrorResponse bool\n\t// ID is the unique identifier of a collector\n\tID uint32\n\t// DetectCharset can enable character encoding detection for non-utf8 response bodies\n\t// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet\n\tDetectCharset bool\n\t// RedirectHandler allows control on how a redirect will be managed\n\t// use c.SetRedirectHandler to set this value\n\tredirectHandler func(req *http.Request, via []*http.Request) error\n\t// CheckHead performs a HEAD request before every GET to pre-validate the response\n\tCheckHead bool\n\t// TraceHTTP enables capturing and reporting request performance for crawler tuning.\n\t// When set to true, the Response.Trace will be filled in with an HTTPTrace object.\n\tTraceHTTP bool\n\t// Context is the context that will be used for HTTP requests. You can set this\n\t// to support clean cancellation of scraping.\n\tContext context.Context\n\t// MaxRequests limit the number of requests done by the instance.\n\t// Set it to 0 for infinite requests (default).\n\tMaxRequests uint32\n\n\tstore                    storage.Storage\n\tdebugger                 debug.Debugger\n\trobotsMap                map[string]*robotstxt.RobotsData\n\thtmlCallbacks            []*htmlCallbackContainer\n\txmlCallbacks             []*xmlCallbackContainer\n\trequestCallbacks         []RequestCallback\n\tresponseCallbacks        []ResponseCallback\n\tresponseHeadersCallbacks []ResponseHeadersCallback\n\trequestHeadersCallbacks  []RequestCallback\n\terrorCallbacks           []ErrorCallback\n\tscrapedCallbacks         []ScrapedCallback\n\trequestCount             atomic.Uint32\n\tresponseCount            atomic.Uint32\n\tbackend                  *httpBackend\n\twg                       *sync.WaitGroup\n\tlock                     *sync.RWMutex\n\t// CacheExpiration sets the maximum age for cache files.\n\t// If a cached file is older than this duration, it will be ignored and refreshed.\n\tCacheExpiration time.Duration\n}\n\n// RequestCallback is a type alias for OnRequest callback functions\ntype RequestCallback func(*Request)\n\n// ResponseHeadersCallback is a type alias for OnResponseHeaders callback functions\ntype ResponseHeadersCallback func(*Response)\n\n// ResponseCallback is a type alias for OnResponse callback functions\ntype ResponseCallback func(*Response)\n\n// HTMLCallback is a type alias for OnHTML callback functions\ntype HTMLCallback func(*HTMLElement)\n\n// XMLCallback is a type alias for OnXML callback functions\ntype XMLCallback func(*XMLElement)\n\n// ErrorCallback is a type alias for OnError callback functions\ntype ErrorCallback func(*Response, error)\n\n// ScrapedCallback is a type alias for OnScraped callback functions\ntype ScrapedCallback func(*Response)\n\n// ProxyFunc is a type alias for proxy setter functions.\ntype ProxyFunc func(*http.Request) (*url.URL, error)\n\n// AlreadyVisitedError is the error type for already visited URLs.\n//\n// It's returned synchronously by Visit when the URL passed to Visit\n// is already visited.\n//\n// When already visited URL is encountered after following\n// redirects, this error appears in OnError callback, and if Async\n// mode is not enabled, is also returned by Visit.\ntype AlreadyVisitedError struct {\n\t// Destination is the URL that was attempted to be visited.\n\t// It might not match the URL passed to Visit if redirect\n\t// was followed.\n\tDestination *url.URL\n}\n\n// Error implements error interface.\nfunc (e *AlreadyVisitedError) Error() string {\n\treturn fmt.Sprintf(\"%q already visited\", e.Destination)\n}\n\ntype htmlCallbackContainer struct {\n\tSelector string\n\tFunction HTMLCallback\n\tactive   atomic.Bool\n}\n\ntype xmlCallbackContainer struct {\n\tQuery    string\n\tFunction XMLCallback\n\tactive   atomic.Bool\n}\n\ntype cookieJarSerializer struct {\n\tstore storage.Storage\n\tlock  *sync.RWMutex\n}\n\nvar collectorCounter uint32\n\n// The key type is unexported to prevent collisions with context keys defined in\n// other packages.\ntype key int\n\n// ProxyURLKey is the context key for the request proxy address.\nconst (\n\tProxyURLKey key = iota\n\tCheckRevisitKey\n)\n\n// The prefix for environment variables of Colly settings\nconst envVariablePrefix = \"COLLY_\"\n\nvar (\n\t// ErrForbiddenDomain is the error thrown if visiting\n\t// a domain which is not allowed in AllowedDomains\n\tErrForbiddenDomain = errors.New(\"Forbidden domain\")\n\t// ErrMissingURL is the error type for missing URL errors\n\tErrMissingURL = errors.New(\"Missing URL\")\n\t// ErrMaxDepth is the error type for exceeding max depth\n\tErrMaxDepth = errors.New(\"Max depth limit reached\")\n\t// ErrForbiddenURL is the error thrown if visiting\n\t// a URL which is not allowed by URLFilters\n\tErrForbiddenURL = errors.New(\"ForbiddenURL\")\n\n\t// ErrNoURLFiltersMatch is the error thrown if visiting\n\t// a URL which is not allowed by URLFilters\n\tErrNoURLFiltersMatch = errors.New(\"No URLFilters match\")\n\t// ErrRobotsTxtBlocked is the error type for robots.txt errors\n\tErrRobotsTxtBlocked = errors.New(\"URL blocked by robots.txt\")\n\t// ErrNoCookieJar is the error type for missing cookie jar\n\tErrNoCookieJar = errors.New(\"Cookie jar is not available\")\n\t// ErrNoPattern is the error type for LimitRules without patterns\n\tErrNoPattern = errors.New(\"No pattern defined in LimitRule\")\n\t// ErrEmptyProxyURL is the error type for empty Proxy URL list\n\tErrEmptyProxyURL = errors.New(\"Proxy URL list is empty\")\n\t// ErrAbortedAfterHeaders is the error returned when OnResponseHeaders aborts the transfer.\n\tErrAbortedAfterHeaders = errors.New(\"Aborted after receiving response headers\")\n\t// ErrAbortedBeforeRequest is the error returned when OnResponseHeaders aborts the transfer.\n\tErrAbortedBeforeRequest = errors.New(\"Aborted before Do Request\")\n\t// ErrQueueFull is the error returned when the queue is full\n\tErrQueueFull = errors.New(\"Queue MaxSize reached\")\n\t// ErrMaxRequests is the error returned when exceeding max requests\n\tErrMaxRequests = errors.New(\"Max Requests limit reached\")\n\t// ErrRetryBodyUnseekable is the error when retry with not seekable body\n\tErrRetryBodyUnseekable = errors.New(\"Retry Body Unseekable\")\n)\n\nvar envMap = map[string]func(*Collector, string){\n\t\"ALLOWED_DOMAINS\": func(c *Collector, val string) {\n\t\tc.AllowedDomains = strings.Split(val, \",\")\n\t},\n\t\"CACHE_DIR\": func(c *Collector, val string) {\n\t\tc.CacheDir = val\n\t},\n\t\"DETECT_CHARSET\": func(c *Collector, val string) {\n\t\tc.DetectCharset = isYesString(val)\n\t},\n\t\"DISABLE_COOKIES\": func(c *Collector, _ string) {\n\t\tc.backend.Client.Jar = nil\n\t},\n\t\"DISALLOWED_DOMAINS\": func(c *Collector, val string) {\n\t\tc.DisallowedDomains = strings.Split(val, \",\")\n\t},\n\t\"IGNORE_ROBOTSTXT\": func(c *Collector, val string) {\n\t\tc.IgnoreRobotsTxt = isYesString(val)\n\t},\n\t\"FOLLOW_REDIRECTS\": func(c *Collector, val string) {\n\t\tif !isYesString(val) {\n\t\t\tc.redirectHandler = func(req *http.Request, via []*http.Request) error {\n\t\t\t\treturn http.ErrUseLastResponse\n\t\t\t}\n\t\t}\n\t},\n\t\"MAX_BODY_SIZE\": func(c *Collector, val string) {\n\t\tsize, err := strconv.Atoi(val)\n\t\tif err == nil {\n\t\t\tc.MaxBodySize = size\n\t\t}\n\t},\n\t\"MAX_DEPTH\": func(c *Collector, val string) {\n\t\tmaxDepth, err := strconv.Atoi(val)\n\t\tif err == nil {\n\t\t\tc.MaxDepth = maxDepth\n\t\t}\n\t},\n\t\"MAX_REQUESTS\": func(c *Collector, val string) {\n\t\tmaxRequests, err := strconv.ParseUint(val, 0, 32)\n\t\tif err == nil {\n\t\t\tc.MaxRequests = uint32(maxRequests)\n\t\t}\n\t},\n\t\"PARSE_HTTP_ERROR_RESPONSE\": func(c *Collector, val string) {\n\t\tc.ParseHTTPErrorResponse = isYesString(val)\n\t},\n\t\"TRACE_HTTP\": func(c *Collector, val string) {\n\t\tc.TraceHTTP = isYesString(val)\n\t},\n\t\"USER_AGENT\": func(c *Collector, val string) {\n\t\tc.UserAgent = val\n\t},\n}\n\nvar urlParser = whatwgUrl.NewParser(whatwgUrl.WithPercentEncodeSinglePercentSign())\n\n// NewCollector creates a new Collector instance with default configuration\nfunc NewCollector(options ...CollectorOption) *Collector {\n\tc := &Collector{}\n\tc.Init()\n\n\tfor _, f := range options {\n\t\tf(c)\n\t}\n\n\tc.parseSettingsFromEnv()\n\n\treturn c\n}\n\n// UserAgent sets the user agent used by the Collector.\nfunc UserAgent(ua string) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.UserAgent = ua\n\t}\n}\n\n// Headers sets the custom headers used by the Collector.\nfunc Headers(headers map[string]string) CollectorOption {\n\treturn func(c *Collector) {\n\t\tcustomHeaders := make(http.Header)\n\t\tfor header, value := range headers {\n\t\t\tcustomHeaders.Add(header, value)\n\t\t}\n\t\tc.Headers = &customHeaders\n\t}\n}\n\n// MaxDepth limits the recursion depth of visited URLs.\nfunc MaxDepth(depth int) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.MaxDepth = depth\n\t}\n}\n\n// MaxRequests limit the number of requests done by the instance.\n// Set it to 0 for infinite requests (default).\nfunc MaxRequests(max uint32) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.MaxRequests = max\n\t}\n}\n\n// AllowedDomains sets the domain whitelist used by the Collector.\nfunc AllowedDomains(domains ...string) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.AllowedDomains = domains\n\t}\n}\n\n// ParseHTTPErrorResponse allows parsing responses with HTTP errors\nfunc ParseHTTPErrorResponse() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.ParseHTTPErrorResponse = true\n\t}\n}\n\n// DisallowedDomains sets the domain blacklist used by the Collector.\nfunc DisallowedDomains(domains ...string) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.DisallowedDomains = domains\n\t}\n}\n\n// DisallowedURLFilters sets the list of regular expressions which restricts\n// visiting URLs. If any of the rules matches to a URL the request will be stopped.\nfunc DisallowedURLFilters(filters ...*regexp.Regexp) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.DisallowedURLFilters = filters\n\t}\n}\n\n// URLFilters sets the list of regular expressions which restricts\n// visiting URLs. If any of the rules matches to a URL the request won't be stopped.\nfunc URLFilters(filters ...*regexp.Regexp) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.URLFilters = filters\n\t}\n}\n\n// AllowURLRevisit instructs the Collector to allow multiple downloads of the same URL\nfunc AllowURLRevisit() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.AllowURLRevisit = true\n\t}\n}\n\n// MaxBodySize sets the limit of the retrieved response body in bytes.\nfunc MaxBodySize(sizeInBytes int) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.MaxBodySize = sizeInBytes\n\t}\n}\n\n// CacheDir specifies the location where GET requests are cached as files.\nfunc CacheDir(path string) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.CacheDir = path\n\t}\n}\n\n// IgnoreRobotsTxt instructs the Collector to ignore any restrictions\n// set by the target host's robots.txt file.\nfunc IgnoreRobotsTxt() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.IgnoreRobotsTxt = true\n\t}\n}\n\n// TraceHTTP instructs the Collector to collect and report request trace data\n// on the Response.Trace.\nfunc TraceHTTP() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.TraceHTTP = true\n\t}\n}\n\n// StdlibContext sets the context that will be used for HTTP requests.\n// You can set this to support clean cancellation of scraping.\nfunc StdlibContext(ctx context.Context) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.Context = ctx\n\t}\n}\n\n// ID sets the unique identifier of the Collector.\nfunc ID(id uint32) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.ID = id\n\t}\n}\n\n// Async turns on asynchronous network requests.\nfunc Async(a ...bool) CollectorOption {\n\treturn func(c *Collector) {\n\t\tif len(a) > 0 {\n\t\t\tc.Async = a[0]\n\t\t} else {\n\t\t\tc.Async = true\n\t\t}\n\t}\n}\n\n// DetectCharset enables character encoding detection for non-utf8 response bodies\n// without explicit charset declaration. This feature uses https://github.com/saintfish/chardet\nfunc DetectCharset() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.DetectCharset = true\n\t}\n}\n\n// Debugger sets the debugger used by the Collector.\nfunc Debugger(d debug.Debugger) CollectorOption {\n\treturn func(c *Collector) {\n\t\td.Init()\n\t\tc.debugger = d\n\t}\n}\n\n// CheckHead performs a HEAD request before every GET to pre-validate the response\nfunc CheckHead() CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.CheckHead = true\n\t}\n}\n\n// CacheExpiration sets the maximum age for cache files.\n// If a cached file is older than this duration, it will be ignored and refreshed.\nfunc CacheExpiration(d time.Duration) CollectorOption {\n\treturn func(c *Collector) {\n\t\tc.CacheExpiration = d\n\t}\n}\n\n// Init initializes the Collector's private variables and sets default\n// configuration for the Collector\nfunc (c *Collector) Init() {\n\tc.UserAgent = \"colly - https://github.com/gocolly/colly\"\n\tc.Headers = nil\n\tc.MaxDepth = 0\n\tc.MaxRequests = 0\n\tc.store = &storage.InMemoryStorage{}\n\tc.store.Init()\n\tc.MaxBodySize = 10 * 1024 * 1024\n\tc.backend = &httpBackend{}\n\tjar, _ := cookiejar.New(nil)\n\tc.backend.Init(jar)\n\tc.backend.Client.CheckRedirect = c.checkRedirectFunc()\n\tc.wg = &sync.WaitGroup{}\n\tc.lock = &sync.RWMutex{}\n\tc.robotsMap = make(map[string]*robotstxt.RobotsData)\n\tc.IgnoreRobotsTxt = true\n\tc.ID = atomic.AddUint32(&collectorCounter, 1)\n\tc.TraceHTTP = false\n\tc.Context = context.Background()\n}\n\n// Appengine will replace the Collector's backend http.Client\n// With an Http.Client that is provided by appengine/urlfetch\n// This function should be used when the scraper is run on\n// Google App Engine. Example:\n//\n//\tfunc startScraper(w http.ResponseWriter, r *http.Request) {\n//\t  ctx := appengine.NewContext(r)\n//\t  c := colly.NewCollector()\n//\t  c.Appengine(ctx)\n//\t   ...\n//\t  c.Visit(\"https://google.ca\")\n//\t}\nfunc (c *Collector) Appengine(ctx context.Context) {\n\tclient := urlfetch.Client(ctx)\n\tclient.Jar = c.backend.Client.Jar\n\tclient.CheckRedirect = c.backend.Client.CheckRedirect\n\tclient.Timeout = c.backend.Client.Timeout\n\n\tc.backend.Client = client\n}\n\n// Visit starts Collector's collecting job by creating a\n// request to the URL specified in parameter.\n// Visit also calls the previously provided callbacks\nfunc (c *Collector) Visit(URL string) error {\n\tif c.CheckHead {\n\t\tif check := c.scrape(URL, \"HEAD\", 1, nil, nil, nil, true); check != nil {\n\t\t\treturn check\n\t\t}\n\t}\n\treturn c.scrape(URL, \"GET\", 1, nil, nil, nil, true)\n}\n\n// HasVisited checks if the provided URL has been visited\nfunc (c *Collector) HasVisited(URL string) (bool, error) {\n\treturn c.checkHasVisited(URL, nil)\n}\n\n// HasPosted checks if the provided URL and requestData has been visited\n// This method is useful more likely to prevent re-visit same URL and POST body\nfunc (c *Collector) HasPosted(URL string, requestData map[string]string) (bool, error) {\n\treturn c.checkHasVisited(URL, requestData)\n}\n\n// Head starts a collector job by creating a HEAD request.\nfunc (c *Collector) Head(URL string) error {\n\treturn c.scrape(URL, \"HEAD\", 1, nil, nil, nil, false)\n}\n\n// Post starts a collector job by creating a POST request.\n// Post also calls the previously provided callbacks\nfunc (c *Collector) Post(URL string, requestData map[string]string) error {\n\treturn c.scrape(URL, \"POST\", 1, createFormReader(requestData), nil, nil, true)\n}\n\n// PostRaw starts a collector job by creating a POST request with raw binary data.\n// Post also calls the previously provided callbacks\nfunc (c *Collector) PostRaw(URL string, requestData []byte) error {\n\treturn c.scrape(URL, \"POST\", 1, bytes.NewReader(requestData), nil, nil, true)\n}\n\n// PostMultipart starts a collector job by creating a Multipart POST request\n// with raw binary data.  PostMultipart also calls the previously provided callbacks\nfunc (c *Collector) PostMultipart(URL string, requestData map[string][]byte) error {\n\tboundary := randomBoundary()\n\thdr := http.Header{}\n\thdr.Set(\"Content-Type\", \"multipart/form-data; boundary=\"+boundary)\n\thdr.Set(\"User-Agent\", c.UserAgent)\n\treturn c.scrape(URL, \"POST\", 1, createMultipartReader(boundary, requestData), nil, hdr, true)\n}\n\n// Request starts a collector job by creating a custom HTTP request\n// where method, context, headers and request data can be specified.\n// Set requestData, ctx, hdr parameters to nil if you don't want to use them.\n// Valid methods:\n//   - \"GET\"\n//   - \"HEAD\"\n//   - \"POST\"\n//   - \"PUT\"\n//   - \"DELETE\"\n//   - \"PATCH\"\n//   - \"OPTIONS\"\nfunc (c *Collector) Request(method, URL string, requestData io.Reader, ctx *Context, hdr http.Header) error {\n\treturn c.scrape(URL, method, 1, requestData, ctx, hdr, true)\n}\n\n// SetDebugger attaches a debugger to the collector\nfunc (c *Collector) SetDebugger(d debug.Debugger) {\n\td.Init()\n\tc.debugger = d\n}\n\n// UnmarshalRequest creates a Request from serialized data\nfunc (c *Collector) UnmarshalRequest(r []byte) (*Request, error) {\n\treq := &serializableRequest{}\n\terr := json.Unmarshal(r, req)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tu, err := url.Parse(req.URL)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tctx := NewContext()\n\tfor k, v := range req.Ctx {\n\t\tctx.Put(k, v)\n\t}\n\n\treturn &Request{\n\t\tMethod:    req.Method,\n\t\tURL:       u,\n\t\tDepth:     req.Depth,\n\t\tBody:      bytes.NewReader(req.Body),\n\t\tCtx:       ctx,\n\t\tID:        c.requestCount.Add(1),\n\t\tHeaders:   &req.Headers,\n\t\tcollector: c,\n\t}, nil\n}\n\nfunc (c *Collector) scrape(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, checkRevisit bool) error {\n\tparsedWhatwgURL, err := urlParser.Parse(u)\n\tif err != nil {\n\t\treturn err\n\t}\n\tparsedURL, err := url.Parse(parsedWhatwgURL.Href(false))\n\tif err != nil {\n\t\treturn err\n\t}\n\tif hdr == nil {\n\t\thdr = http.Header{}\n\t\tif c.Headers != nil {\n\t\t\tfor k, v := range *c.Headers {\n\t\t\t\tfor _, value := range v {\n\t\t\t\t\thdr.Add(k, value)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\tif _, ok := hdr[\"User-Agent\"]; !ok {\n\t\thdr.Set(\"User-Agent\", c.UserAgent)\n\t}\n\tif seeker, ok := requestData.(io.ReadSeeker); ok {\n\t\t_, err := seeker.Seek(0, io.SeekStart)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\n\treq, err := http.NewRequest(method, parsedURL.String(), requestData)\n\tif err != nil {\n\t\treturn err\n\t}\n\treq.Header = hdr\n\t// The Go HTTP API ignores \"Host\" in the headers, preferring the client\n\t// to use the Host field on Request.\n\tif hostHeader := hdr.Get(\"Host\"); hostHeader != \"\" {\n\t\treq.Host = hostHeader\n\t}\n\t// note: once 1.13 is minimum supported Go version,\n\t// replace this with http.NewRequestWithContext\n\treq = req.WithContext(context.WithValue(c.Context, CheckRevisitKey, checkRevisit))\n\n\tif err := c.requestCheck(parsedURL, method, req.GetBody, depth, checkRevisit); err != nil {\n\t\treturn err\n\t}\n\tu = parsedURL.String()\n\tc.wg.Add(1)\n\tif c.Async {\n\t\tgo c.fetch(u, method, depth, requestData, ctx, hdr, req)\n\t\treturn nil\n\t}\n\treturn c.fetch(u, method, depth, requestData, ctx, hdr, req)\n}\n\nfunc (c *Collector) fetch(u, method string, depth int, requestData io.Reader, ctx *Context, hdr http.Header, req *http.Request) error {\n\tdefer c.wg.Done()\n\tif ctx == nil {\n\t\tctx = NewContext()\n\t}\n\trequest := &Request{\n\t\tURL:       req.URL,\n\t\tHeaders:   &req.Header,\n\t\tHost:      req.Host,\n\t\tCtx:       ctx,\n\t\tDepth:     depth,\n\t\tMethod:    method,\n\t\tBody:      requestData,\n\t\tcollector: c,\n\t\tID:        c.requestCount.Add(1),\n\t}\n\n\tif req.Header.Get(\"Accept\") == \"\" {\n\t\treq.Header.Set(\"Accept\", \"*/*\")\n\t}\n\n\tc.handleOnRequest(request)\n\n\tif request.abort {\n\t\treturn nil\n\t}\n\n\tif method == \"POST\" && req.Header.Get(\"Content-Type\") == \"\" {\n\t\treq.Header.Add(\"Content-Type\", \"application/x-www-form-urlencoded\")\n\t}\n\n\tvar hTrace *HTTPTrace\n\tif c.TraceHTTP {\n\t\thTrace = &HTTPTrace{}\n\t\treq = hTrace.WithTrace(req)\n\t}\n\torigURL := req.URL\n\tcheckResponseHeadersFunc := func(req *http.Request, statusCode int, headers http.Header) bool {\n\t\tif req.URL != origURL {\n\t\t\trequest.URL = req.URL\n\t\t\trequest.Headers = &req.Header\n\t\t}\n\t\tc.handleOnResponseHeaders(&Response{Ctx: ctx, Request: request, StatusCode: statusCode, Headers: &headers})\n\t\treturn !request.abort\n\t}\n\tcheckRequestHeadersFunc := func(req *http.Request) bool {\n\t\tc.handleOnRequestHeaders(request)\n\t\treturn !request.abort\n\t}\n\tresponse, err := c.backend.Cache(req, c.MaxBodySize, checkRequestHeadersFunc, checkResponseHeadersFunc, c.CacheDir, c.CacheExpiration)\n\tif proxyURL, ok := req.Context().Value(ProxyURLKey).(string); ok {\n\t\trequest.ProxyURL = proxyURL\n\t}\n\tif err := c.handleOnError(response, err, request, ctx); err != nil {\n\t\treturn err\n\t}\n\tc.responseCount.Add(1)\n\tresponse.Ctx = ctx\n\tresponse.Request = request\n\tresponse.Trace = hTrace\n\n\terr = response.fixCharset(c.DetectCharset, request.ResponseCharacterEncoding)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tc.handleOnResponse(response)\n\n\terr = c.handleOnHTML(response)\n\tif err != nil {\n\t\tc.handleOnError(response, err, request, ctx)\n\t}\n\n\terr = c.handleOnXML(response)\n\tif err != nil {\n\t\tc.handleOnError(response, err, request, ctx)\n\t}\n\n\tc.handleOnScraped(response)\n\n\treturn err\n}\n\nfunc (c *Collector) requestCheck(parsedURL *url.URL, method string, getBody func() (io.ReadCloser, error), depth int, checkRevisit bool) error {\n\tu := parsedURL.String()\n\tif c.MaxDepth > 0 && c.MaxDepth < depth {\n\t\treturn ErrMaxDepth\n\t}\n\tif c.MaxRequests > 0 && c.requestCount.Load() >= c.MaxRequests {\n\t\treturn ErrMaxRequests\n\t}\n\tif err := c.checkFilters(u, parsedURL.Hostname()); err != nil {\n\t\treturn err\n\t}\n\tif method != \"HEAD\" && !c.IgnoreRobotsTxt {\n\t\tif err := c.checkRobots(parsedURL); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\tif checkRevisit && !c.AllowURLRevisit {\n\t\t// TODO weird behaviour, it allows CheckHead to work correctly,\n\t\t// but it should probably better be solved with\n\t\t// \"check-but-not-save\" flag or something\n\t\tif method != \"GET\" && getBody == nil {\n\t\t\treturn nil\n\t\t}\n\n\t\tvar body io.ReadCloser\n\t\tif getBody != nil {\n\t\t\tvar err error\n\t\t\tbody, err = getBody()\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tdefer body.Close()\n\t\t}\n\t\tuHash := requestHash(u, body)\n\t\tvisited, err := c.store.IsVisited(uHash)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif visited {\n\t\t\treturn &AlreadyVisitedError{parsedURL}\n\t\t}\n\t\treturn c.store.Visited(uHash)\n\t}\n\treturn nil\n}\n\nfunc (c *Collector) checkFilters(URL, domain string) error {\n\tif len(c.DisallowedURLFilters) > 0 {\n\t\tif isMatchingFilter(c.DisallowedURLFilters, []byte(URL)) {\n\t\t\treturn ErrForbiddenURL\n\t\t}\n\t}\n\tif len(c.URLFilters) > 0 {\n\t\tif !isMatchingFilter(c.URLFilters, []byte(URL)) {\n\t\t\treturn ErrNoURLFiltersMatch\n\t\t}\n\t}\n\tif !c.isDomainAllowed(domain) {\n\t\treturn ErrForbiddenDomain\n\t}\n\treturn nil\n}\n\nfunc (c *Collector) isDomainAllowed(domain string) bool {\n\tif slices.Contains(c.DisallowedDomains, domain) {\n\t\treturn false\n\t}\n\tif c.AllowedDomains == nil || len(c.AllowedDomains) == 0 {\n\t\treturn true\n\t}\n\treturn slices.Contains(c.AllowedDomains, domain)\n}\n\nfunc (c *Collector) checkRobots(u *url.URL) error {\n\tc.lock.RLock()\n\trobot, ok := c.robotsMap[u.Host]\n\tc.lock.RUnlock()\n\n\tif !ok {\n\t\t// no robots file cached\n\n\t\t// Prepare request,\n\t\treq, err := http.NewRequest(\"GET\", u.Scheme+\"://\"+u.Host+\"/robots.txt\", nil)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\thdr := http.Header{}\n\t\tif c.Headers != nil {\n\t\t\tfor k, v := range *c.Headers {\n\t\t\t\tfor _, value := range v {\n\t\t\t\t\thdr.Add(k, value)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t\tif _, ok := hdr[\"User-Agent\"]; !ok {\n\t\t\thdr.Set(\"User-Agent\", c.UserAgent)\n\t\t}\n\t\treq.Header = hdr\n\t\t// The Go HTTP API ignores \"Host\" in the headers, preferring the client\n\t\t// to use the Host field on Request.\n\t\tif hostHeader := hdr.Get(\"Host\"); hostHeader != \"\" {\n\t\t\treq.Host = hostHeader\n\t\t}\n\n\t\tresp, err := c.backend.Client.Do(req)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tdefer resp.Body.Close()\n\n\t\trobot, err = robotstxt.FromResponse(resp)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tc.lock.Lock()\n\t\tc.robotsMap[u.Host] = robot\n\t\tc.lock.Unlock()\n\t}\n\n\tuaGroup := robot.FindGroup(c.UserAgent)\n\tif uaGroup == nil {\n\t\treturn nil\n\t}\n\n\teu := u.EscapedPath()\n\tif u.RawQuery != \"\" {\n\t\teu += \"?\" + u.Query().Encode()\n\t}\n\tif !uaGroup.Test(eu) {\n\t\treturn ErrRobotsTxtBlocked\n\t}\n\treturn nil\n}\n\n// String is the text representation of the collector.\n// It contains useful debug information about the collector's internals\nfunc (c *Collector) String() string {\n\treturn fmt.Sprintf(\n\t\t\"Requests made: %d (%d responses) | Callbacks: OnRequest: %d, OnHTML: %d, OnResponse: %d, OnError: %d\",\n\t\tc.requestCount.Load(),\n\t\tc.responseCount.Load(),\n\t\tlen(c.requestCallbacks),\n\t\tlen(c.htmlCallbacks),\n\t\tlen(c.responseCallbacks),\n\t\tlen(c.errorCallbacks),\n\t)\n}\n\n// Wait returns when the collector jobs are finished\nfunc (c *Collector) Wait() {\n\tc.wg.Wait()\n}\n\n// OnRequest registers a function. Function will be executed on every\n// request made by the Collector\nfunc (c *Collector) OnRequest(f RequestCallback) {\n\tc.lock.Lock()\n\tif c.requestCallbacks == nil {\n\t\tc.requestCallbacks = make([]RequestCallback, 0, 4)\n\t}\n\tc.requestCallbacks = append(c.requestCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// OnResponseHeaders registers a function. Function will be executed on every response\n// when headers and status are already received, but body is not yet read.\n//\n// Like in OnRequest, you can call Request.Abort to abort the transfer. This might be\n// useful if, for example, you're following all hyperlinks, but want to avoid\n// downloading files.\n//\n// Be aware that using this will prevent HTTP/1.1 connection reuse, as\n// the only way to abort a download is to immediately close the connection.\n// HTTP/2 doesn't suffer from this problem, as it's possible to close\n// specific stream inside the connection.\nfunc (c *Collector) OnResponseHeaders(f ResponseHeadersCallback) {\n\tc.lock.Lock()\n\tc.responseHeadersCallbacks = append(c.responseHeadersCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// OnRequestHeaders registers a function. Function will be executed on every\n// request made by the Collector before Request Do\nfunc (c *Collector) OnRequestHeaders(f RequestCallback) {\n\tc.lock.Lock()\n\tc.requestHeadersCallbacks = append(c.requestHeadersCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// OnResponse registers a function. Function will be executed on every response\nfunc (c *Collector) OnResponse(f ResponseCallback) {\n\tc.lock.Lock()\n\tif c.responseCallbacks == nil {\n\t\tc.responseCallbacks = make([]ResponseCallback, 0, 4)\n\t}\n\tc.responseCallbacks = append(c.responseCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// OnHTML registers a function. Function will be executed on every HTML\n// element matched by the GoQuery Selector parameter.\n// GoQuery Selector is a selector used by https://github.com/PuerkitoBio/goquery\nfunc (c *Collector) OnHTML(goquerySelector string, f HTMLCallback) {\n\tc.lock.Lock()\n\tif c.htmlCallbacks == nil {\n\t\tc.htmlCallbacks = make([]*htmlCallbackContainer, 0, 4)\n\t}\n\tcc := &htmlCallbackContainer{\n\t\tSelector: goquerySelector,\n\t\tFunction: f,\n\t}\n\tcc.active.Store(true)\n\tc.htmlCallbacks = append(c.htmlCallbacks, cc)\n\tc.lock.Unlock()\n}\n\n// OnXML registers a function. Function will be executed on every XML\n// element matched by the xpath Query parameter.\n// xpath Query is used by https://github.com/antchfx/xmlquery\nfunc (c *Collector) OnXML(xpathQuery string, f XMLCallback) {\n\tc.lock.Lock()\n\tif c.xmlCallbacks == nil {\n\t\tc.xmlCallbacks = make([]*xmlCallbackContainer, 0, 4)\n\t}\n\tcc := &xmlCallbackContainer{\n\t\tQuery:    xpathQuery,\n\t\tFunction: f,\n\t}\n\tcc.active.Store(true)\n\tc.xmlCallbacks = append(c.xmlCallbacks, cc)\n\tc.lock.Unlock()\n}\n\n// OnHTMLDetach deregister a function. Function will not be execute after detached\nfunc (c *Collector) OnHTMLDetach(goquerySelector string) {\n\tc.lock.Lock()\n\tdefer c.lock.Unlock()\n\n\tfor _, cc := range c.htmlCallbacks {\n\t\tif cc.Selector == goquerySelector {\n\t\t\tcc.active.Store(false)\n\t\t}\n\t}\n}\n\n// OnXMLDetach deregister a function. Function will not be execute after detached\nfunc (c *Collector) OnXMLDetach(xpathQuery string) {\n\tc.lock.Lock()\n\tdefer c.lock.Unlock()\n\n\tfor _, cc := range c.xmlCallbacks {\n\t\tif cc.Query == xpathQuery {\n\t\t\tcc.active.Store(false)\n\t\t}\n\t}\n}\n\n// OnError registers a function. Function will be executed if an error\n// occurs during the HTTP request.\nfunc (c *Collector) OnError(f ErrorCallback) {\n\tc.lock.Lock()\n\tif c.errorCallbacks == nil {\n\t\tc.errorCallbacks = make([]ErrorCallback, 0, 4)\n\t}\n\tc.errorCallbacks = append(c.errorCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// OnScraped registers a function that will be executed as the final part of\n// the scraping, after OnHTML and OnXML have finished.\nfunc (c *Collector) OnScraped(f ScrapedCallback) {\n\tc.lock.Lock()\n\tif c.scrapedCallbacks == nil {\n\t\tc.scrapedCallbacks = make([]ScrapedCallback, 0, 4)\n\t}\n\tc.scrapedCallbacks = append(c.scrapedCallbacks, f)\n\tc.lock.Unlock()\n}\n\n// SetClient will override the previously set http.Client\nfunc (c *Collector) SetClient(client *http.Client) {\n\tc.backend.Client = client\n}\n\n// WithTransport allows you to set a custom http.RoundTripper (transport)\nfunc (c *Collector) WithTransport(transport http.RoundTripper) {\n\tc.backend.Client.Transport = transport\n}\n\n// DisableCookies turns off cookie handling\nfunc (c *Collector) DisableCookies() {\n\tc.backend.Client.Jar = nil\n}\n\n// SetCookieJar overrides the previously set cookie jar\nfunc (c *Collector) SetCookieJar(j http.CookieJar) {\n\tc.backend.Client.Jar = j\n}\n\n// SetRequestTimeout overrides the default timeout (10 seconds) for this collector\nfunc (c *Collector) SetRequestTimeout(timeout time.Duration) {\n\tc.backend.Client.Timeout = timeout\n}\n\n// SetStorage overrides the default in-memory storage.\n// Storage stores scraping related data like cookies and visited urls\nfunc (c *Collector) SetStorage(s storage.Storage) error {\n\tif err := s.Init(); err != nil {\n\t\treturn err\n\t}\n\tc.store = s\n\tc.backend.Client.Jar = createJar(s)\n\treturn nil\n}\n\n// SetProxy sets a proxy for the collector. This method overrides the previously\n// used http.Transport if the type of the transport is not http.RoundTripper.\n// The proxy type is determined by the URL scheme. \"http\"\n// and \"socks5\" are supported. If the scheme is empty,\n// \"http\" is assumed.\nfunc (c *Collector) SetProxy(proxyURL string) error {\n\tproxyParsed, err := url.Parse(proxyURL)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tc.SetProxyFunc(http.ProxyURL(proxyParsed))\n\n\treturn nil\n}\n\n// SetProxyFunc sets a custom proxy setter/switcher function.\n// See built-in ProxyFuncs for more details.\n// This method overrides the previously used http.Transport\n// if the type of the transport is not *http.Transport.\n// The proxy type is determined by the URL scheme. \"http\"\n// and \"socks5\" are supported. If the scheme is empty,\n// \"http\" is assumed.\nfunc (c *Collector) SetProxyFunc(p ProxyFunc) {\n\tt, ok := c.backend.Client.Transport.(*http.Transport)\n\tif c.backend.Client.Transport != nil && ok {\n\t\tt.Proxy = p\n\t\tt.DisableKeepAlives = true\n\t} else {\n\t\tc.backend.Client.Transport = &http.Transport{\n\t\t\tProxy:             p,\n\t\t\tDisableKeepAlives: true,\n\t\t}\n\t}\n}\n\nfunc createEvent(eventType string, requestID, collectorID uint32, kvargs map[string]string) *debug.Event {\n\treturn &debug.Event{\n\t\tCollectorID: collectorID,\n\t\tRequestID:   requestID,\n\t\tType:        eventType,\n\t\tValues:      kvargs,\n\t}\n}\n\nfunc (c *Collector) handleOnRequest(r *Request) {\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"request\", r.ID, c.ID, map[string]string{\n\t\t\t\"url\": r.URL.String(),\n\t\t}))\n\t}\n\tfor _, f := range c.requestCallbacks {\n\t\tf(r)\n\t}\n}\n\nfunc (c *Collector) handleOnResponse(r *Response) {\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"response\", r.Request.ID, c.ID, map[string]string{\n\t\t\t\"url\":    r.Request.URL.String(),\n\t\t\t\"status\": http.StatusText(r.StatusCode),\n\t\t}))\n\t}\n\tfor _, f := range c.responseCallbacks {\n\t\tf(r)\n\t}\n}\n\nfunc (c *Collector) handleOnResponseHeaders(r *Response) {\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"responseHeaders\", r.Request.ID, c.ID, map[string]string{\n\t\t\t\"url\":    r.Request.URL.String(),\n\t\t\t\"status\": http.StatusText(r.StatusCode),\n\t\t}))\n\t}\n\tfor _, f := range c.responseHeadersCallbacks {\n\t\tf(r)\n\t}\n}\nfunc (c *Collector) handleOnRequestHeaders(r *Request) {\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"requestHeaders\", r.ID, c.ID, map[string]string{\n\t\t\t\"url\": r.URL.String(),\n\t\t}))\n\t}\n\tfor _, f := range c.requestHeadersCallbacks {\n\t\tf(r)\n\t}\n}\n\nfunc (c *Collector) handleOnHTML(resp *Response) error {\n\tc.lock.RLock()\n\thtmlCallbacks := slices.Clone(c.htmlCallbacks)\n\tc.lock.RUnlock()\n\n\tif len(htmlCallbacks) == 0 {\n\t\treturn nil\n\t}\n\n\tcontentType := resp.Headers.Get(\"Content-Type\")\n\tif contentType == \"\" {\n\t\tcontentType = http.DetectContentType(resp.Body)\n\t}\n\t// implementation of mime.ParseMediaType without parsing the params\n\t// part\n\tmediatype, _, _ := strings.Cut(contentType, \";\")\n\tmediatype = strings.TrimSpace(strings.ToLower(mediatype))\n\n\t// TODO we also want to parse application/xml as XHTML if it has\n\t// appropriate doctype\n\tswitch mediatype {\n\tcase \"text/html\", \"application/xhtml+xml\":\n\tdefault:\n\t\treturn nil\n\t}\n\n\tdoc, err := goquery.NewDocumentFromReader(bytes.NewBuffer(resp.Body))\n\tif err != nil {\n\t\treturn err\n\t}\n\tif href, found := doc.Find(\"base[href]\").Attr(\"href\"); found {\n\t\tu, err := urlParser.ParseRef(resp.Request.URL.String(), href)\n\t\tif err == nil {\n\t\t\tbaseURL, err := url.Parse(u.Href(false))\n\t\t\tif err == nil {\n\t\t\t\tresp.Request.baseURL = baseURL\n\t\t\t}\n\t\t}\n\n\t}\n\tfor _, cc := range htmlCallbacks {\n\t\tif !cc.active.Load() {\n\t\t\tcontinue\n\t\t}\n\t\ti := 0\n\t\tdoc.Find(cc.Selector).Each(func(_ int, s *goquery.Selection) {\n\t\t\tfor _, n := range s.Nodes {\n\t\t\t\te := NewHTMLElementFromSelectionNode(resp, s, n, i)\n\t\t\t\ti++\n\t\t\t\tif c.debugger != nil {\n\t\t\t\t\tc.debugger.Event(createEvent(\"html\", resp.Request.ID, c.ID, map[string]string{\n\t\t\t\t\t\t\"selector\": cc.Selector,\n\t\t\t\t\t\t\"url\":      resp.Request.URL.String(),\n\t\t\t\t\t}))\n\t\t\t\t}\n\t\t\t\tcc.Function(e)\n\t\t\t}\n\t\t})\n\t}\n\treturn nil\n}\n\nfunc (c *Collector) handleOnXML(resp *Response) error {\n\tc.lock.RLock()\n\txmlCallbacks := slices.Clone(c.xmlCallbacks)\n\tc.lock.RUnlock()\n\n\tif len(xmlCallbacks) == 0 {\n\t\treturn nil\n\t}\n\tcontentType := strings.ToLower(resp.Headers.Get(\"Content-Type\"))\n\tisXMLFile := strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), \".xml\") || strings.HasSuffix(strings.ToLower(resp.Request.URL.Path), \".xml.gz\")\n\tif !strings.Contains(contentType, \"html\") && (!strings.Contains(contentType, \"xml\") && !isXMLFile) {\n\t\treturn nil\n\t}\n\n\tif strings.Contains(contentType, \"html\") {\n\t\tdoc, err := htmlquery.Parse(bytes.NewBuffer(resp.Body))\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tif e := htmlquery.FindOne(doc, \"//base\"); e != nil {\n\t\t\tfor _, a := range e.Attr {\n\t\t\t\tif a.Key == \"href\" {\n\t\t\t\t\tbaseURL, err := resp.Request.URL.Parse(a.Val)\n\t\t\t\t\tif err == nil {\n\t\t\t\t\t\tresp.Request.baseURL = baseURL\n\t\t\t\t\t}\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\n\t\tfor _, cc := range xmlCallbacks {\n\t\t\tif !cc.active.Load() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tfor i, n := range htmlquery.Find(doc, cc.Query) {\n\t\t\t\te := NewXMLElementFromHTMLNode(resp, n)\n\t\t\t\te.Index = i\n\t\t\t\tif c.debugger != nil {\n\t\t\t\t\tc.debugger.Event(createEvent(\"xml\", resp.Request.ID, c.ID, map[string]string{\n\t\t\t\t\t\t\"selector\": cc.Query,\n\t\t\t\t\t\t\"url\":      resp.Request.URL.String(),\n\t\t\t\t\t}))\n\t\t\t\t}\n\t\t\t\tcc.Function(e)\n\t\t\t}\n\t\t}\n\t} else if strings.Contains(contentType, \"xml\") || isXMLFile {\n\t\tdoc, err := xmlquery.Parse(bytes.NewBuffer(resp.Body))\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\n\t\tfor _, cc := range xmlCallbacks {\n\t\t\tif !cc.active.Load() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\txmlquery.FindEach(doc, cc.Query, func(i int, n *xmlquery.Node) {\n\t\t\t\te := NewXMLElementFromXMLNode(resp, n)\n\t\t\t\tif c.debugger != nil {\n\t\t\t\t\tc.debugger.Event(createEvent(\"xml\", resp.Request.ID, c.ID, map[string]string{\n\t\t\t\t\t\t\"selector\": cc.Query,\n\t\t\t\t\t\t\"url\":      resp.Request.URL.String(),\n\t\t\t\t\t}))\n\t\t\t\t}\n\t\t\t\tcc.Function(e)\n\t\t\t})\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc (c *Collector) handleOnError(response *Response, err error, request *Request, ctx *Context) error {\n\tif err == nil && (c.ParseHTTPErrorResponse || response.StatusCode < 203) {\n\t\treturn nil\n\t}\n\tif err == nil && response.StatusCode >= 203 {\n\t\terr = errors.New(http.StatusText(response.StatusCode))\n\t}\n\tif response == nil {\n\t\tresponse = &Response{\n\t\t\tRequest: request,\n\t\t\tCtx:     ctx,\n\t\t}\n\t}\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"error\", request.ID, c.ID, map[string]string{\n\t\t\t\"url\":    request.URL.String(),\n\t\t\t\"status\": http.StatusText(response.StatusCode),\n\t\t}))\n\t}\n\tif response.Request == nil {\n\t\tresponse.Request = request\n\t}\n\tif response.Ctx == nil {\n\t\tresponse.Ctx = request.Ctx\n\t}\n\tfor _, f := range c.errorCallbacks {\n\t\tf(response, err)\n\t}\n\treturn err\n}\n\nfunc (c *Collector) cleanupCallbacks() {\n\tc.lock.Lock()\n\tdefer c.lock.Unlock()\n\n\t// Clean HTML callbacks\n\tc.htmlCallbacks = slices.DeleteFunc(c.htmlCallbacks, func(cc *htmlCallbackContainer) bool {\n\t\treturn !cc.active.Load()\n\t})\n\n\t// Clean XML callbacks\n\tc.xmlCallbacks = slices.DeleteFunc(c.xmlCallbacks, func(cc *xmlCallbackContainer) bool {\n\t\treturn !cc.active.Load()\n\t})\n}\n\nfunc (c *Collector) handleOnScraped(r *Response) {\n\tif c.debugger != nil {\n\t\tc.debugger.Event(createEvent(\"scraped\", r.Request.ID, c.ID, map[string]string{\n\t\t\t\"url\": r.Request.URL.String(),\n\t\t}))\n\t}\n\tfor _, f := range c.scrapedCallbacks {\n\t\tf(r)\n\t}\n\n\t// Cleanup inactive callbacks after processing each response\n\tc.cleanupCallbacks()\n}\n\n// Limit adds a new LimitRule to the collector\nfunc (c *Collector) Limit(rule *LimitRule) error {\n\treturn c.backend.Limit(rule)\n}\n\n// Limits adds new LimitRules to the collector\nfunc (c *Collector) Limits(rules []*LimitRule) error {\n\treturn c.backend.Limits(rules)\n}\n\n// SetRedirectHandler instructs the Collector to allow multiple downloads of the same URL\nfunc (c *Collector) SetRedirectHandler(f func(req *http.Request, via []*http.Request) error) {\n\tc.redirectHandler = f\n\tc.backend.Client.CheckRedirect = c.checkRedirectFunc()\n}\n\n// SetCookies handles the receipt of the cookies in a reply for the given URL\nfunc (c *Collector) SetCookies(URL string, cookies []*http.Cookie) error {\n\tif c.backend.Client.Jar == nil {\n\t\treturn ErrNoCookieJar\n\t}\n\tu, err := url.Parse(URL)\n\tif err != nil {\n\t\treturn err\n\t}\n\tc.backend.Client.Jar.SetCookies(u, cookies)\n\treturn nil\n}\n\n// Cookies returns the cookies to send in a request for the given URL.\nfunc (c *Collector) Cookies(URL string) []*http.Cookie {\n\tif c.backend.Client.Jar == nil {\n\t\treturn nil\n\t}\n\tu, err := url.Parse(URL)\n\tif err != nil {\n\t\treturn nil\n\t}\n\treturn c.backend.Client.Jar.Cookies(u)\n}\n\n// Clone creates an exact copy of a Collector without callbacks.\n// HTTP backend, robots.txt cache and cookie jar are shared\n// between collectors.\nfunc (c *Collector) Clone() *Collector {\n\treturn &Collector{\n\t\tAllowedDomains:         c.AllowedDomains,\n\t\tAllowURLRevisit:        c.AllowURLRevisit,\n\t\tCacheDir:               c.CacheDir,\n\t\tCacheExpiration:        c.CacheExpiration,\n\t\tDetectCharset:          c.DetectCharset,\n\t\tDisallowedDomains:      c.DisallowedDomains,\n\t\tID:                     atomic.AddUint32(&collectorCounter, 1),\n\t\tIgnoreRobotsTxt:        c.IgnoreRobotsTxt,\n\t\tMaxBodySize:            c.MaxBodySize,\n\t\tMaxDepth:               c.MaxDepth,\n\t\tMaxRequests:            c.MaxRequests,\n\t\tDisallowedURLFilters:   c.DisallowedURLFilters,\n\t\tURLFilters:             c.URLFilters,\n\t\tCheckHead:              c.CheckHead,\n\t\tParseHTTPErrorResponse: c.ParseHTTPErrorResponse,\n\t\tUserAgent:              c.UserAgent,\n\t\tHeaders:                c.Headers,\n\t\tTraceHTTP:              c.TraceHTTP,\n\t\tContext:                c.Context,\n\t\tstore:                  c.store,\n\t\tbackend:                c.backend,\n\t\tdebugger:               c.debugger,\n\t\tAsync:                  c.Async,\n\t\tredirectHandler:        c.redirectHandler,\n\t\terrorCallbacks:         make([]ErrorCallback, 0, 8),\n\t\thtmlCallbacks:          make([]*htmlCallbackContainer, 0, 8),\n\t\txmlCallbacks:           make([]*xmlCallbackContainer, 0, 8),\n\t\tscrapedCallbacks:       make([]ScrapedCallback, 0, 8),\n\t\tlock:                   c.lock,\n\t\trequestCallbacks:       make([]RequestCallback, 0, 8),\n\t\tresponseCallbacks:      make([]ResponseCallback, 0, 8),\n\t\trobotsMap:              c.robotsMap,\n\t\twg:                     &sync.WaitGroup{},\n\t}\n}\n\nfunc (c *Collector) checkRedirectFunc() func(req *http.Request, via []*http.Request) error {\n\treturn func(req *http.Request, via []*http.Request) error {\n\t\tif err := c.checkFilters(req.URL.String(), req.URL.Hostname()); err != nil {\n\t\t\treturn fmt.Errorf(\"Not following redirect to %q: %w\", req.URL, err)\n\t\t}\n\n\t\t// allow redirects to the original destination\n\t\t// to support websites redirecting to the same page while setting\n\t\t// session cookies\n\t\tsamePageRedirect := normalizeURL(req.URL.String()) == normalizeURL(via[0].URL.String())\n\n\t\tif !c.AllowURLRevisit && !samePageRedirect {\n\t\t\tvar body io.ReadCloser\n\t\t\tif req.GetBody != nil {\n\t\t\t\tvar err error\n\t\t\t\tbody, err = req.GetBody()\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn err\n\t\t\t\t}\n\t\t\t\tdefer body.Close()\n\t\t\t}\n\t\t\tuHash := requestHash(req.URL.String(), body)\n\t\t\tvisited, err := c.store.IsVisited(uHash)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t\tif visited {\n\t\t\t\tif checkRevisit, ok := req.Context().Value(CheckRevisitKey).(bool); !ok || checkRevisit {\n\t\t\t\t\treturn &AlreadyVisitedError{req.URL}\n\t\t\t\t}\n\t\t\t}\n\t\t\terr = c.store.Visited(uHash)\n\t\t\tif err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\n\t\tif c.redirectHandler != nil {\n\t\t\treturn c.redirectHandler(req, via)\n\t\t}\n\n\t\t// Honor golangs default of maximum of 10 redirects\n\t\tif len(via) >= 10 {\n\t\t\treturn http.ErrUseLastResponse\n\t\t}\n\n\t\tlastRequest := via[len(via)-1]\n\n\t\t// If domain has changed, remove the Authorization-header if it exists\n\t\tif req.URL.Host != lastRequest.URL.Host {\n\t\t\treq.Header.Del(\"Authorization\")\n\t\t}\n\n\t\treturn nil\n\t}\n}\n\nfunc (c *Collector) parseSettingsFromEnv() {\n\tfor _, e := range os.Environ() {\n\t\tif !strings.HasPrefix(e, envVariablePrefix) {\n\t\t\tcontinue\n\t\t}\n\t\tpair := strings.SplitN(e[len(envVariablePrefix):], \"=\", 2)\n\t\tif f, ok := envMap[pair[0]]; ok {\n\t\t\tf(c, pair[1])\n\t\t} else {\n\t\t\tlog.Println(\"Unknown environment variable:\", pair[0])\n\t\t}\n\t}\n}\n\nfunc (c *Collector) checkHasVisited(URL string, requestData map[string]string) (bool, error) {\n\thash := requestHash(URL, createFormReader(requestData))\n\treturn c.store.IsVisited(hash)\n}\n\n// SanitizeFileName replaces dangerous characters in a string\n// so the return value can be used as a safe file name.\nfunc SanitizeFileName(fileName string) string {\n\text := filepath.Ext(fileName)\n\tcleanExt := sanitize.BaseName(ext)\n\tif cleanExt == \"\" {\n\t\tcleanExt = \".unknown\"\n\t}\n\treturn strings.Replace(fmt.Sprintf(\n\t\t\"%s.%s\",\n\t\tsanitize.BaseName(fileName[:len(fileName)-len(ext)]),\n\t\tcleanExt[1:],\n\t), \"-\", \"_\", -1)\n}\n\nfunc createFormReader(data map[string]string) io.Reader {\n\tform := url.Values{}\n\tfor k, v := range data {\n\t\tform.Add(k, v)\n\t}\n\treturn strings.NewReader(form.Encode())\n}\n\nfunc createMultipartReader(boundary string, data map[string][]byte) io.Reader {\n\tdashBoundary := \"--\" + boundary\n\n\tbody := []byte{}\n\tbuffer := bytes.NewBuffer(body)\n\n\tbuffer.WriteString(\"Content-type: multipart/form-data; boundary=\" + boundary + \"\\n\\n\")\n\tfor contentType, content := range data {\n\t\tbuffer.WriteString(dashBoundary + \"\\n\")\n\t\tbuffer.WriteString(\"Content-Disposition: form-data; name=\" + contentType + \"\\n\")\n\t\tbuffer.WriteString(fmt.Sprintf(\"Content-Length: %d \\n\\n\", len(content)))\n\t\tbuffer.Write(content)\n\t\tbuffer.WriteString(\"\\n\")\n\t}\n\tbuffer.WriteString(dashBoundary + \"--\\n\\n\")\n\treturn bytes.NewReader(buffer.Bytes())\n\n}\n\n// randomBoundary was borrowed from\n// github.com/golang/go/mime/multipart/writer.go#randomBoundary\nfunc randomBoundary() string {\n\tvar buf [30]byte\n\t_, err := io.ReadFull(rand.Reader, buf[:])\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\treturn fmt.Sprintf(\"%x\", buf[:])\n}\n\nfunc isYesString(s string) bool {\n\tswitch strings.ToLower(s) {\n\tcase \"1\", \"yes\", \"true\", \"y\":\n\t\treturn true\n\t}\n\treturn false\n}\n\nfunc createJar(s storage.Storage) http.CookieJar {\n\treturn &cookieJarSerializer{store: s, lock: &sync.RWMutex{}}\n}\n\nfunc (j *cookieJarSerializer) SetCookies(u *url.URL, cookies []*http.Cookie) {\n\tj.lock.Lock()\n\tdefer j.lock.Unlock()\n\tcookieStr := j.store.Cookies(u)\n\n\t// Merge existing cookies, new cookies have precedence.\n\tcnew := make([]*http.Cookie, len(cookies))\n\tcopy(cnew, cookies)\n\texisting := storage.UnstringifyCookies(cookieStr)\n\tfor _, c := range existing {\n\t\tif !storage.ContainsCookie(cnew, c.Name) {\n\t\t\tcnew = append(cnew, c)\n\t\t}\n\t}\n\tj.store.SetCookies(u, storage.StringifyCookies(cnew))\n}\n\nfunc (j *cookieJarSerializer) Cookies(u *url.URL) []*http.Cookie {\n\tcookies := storage.UnstringifyCookies(j.store.Cookies(u))\n\t// Filter.\n\tnow := time.Now()\n\tcnew := make([]*http.Cookie, 0, len(cookies))\n\tfor _, c := range cookies {\n\t\t// Drop expired cookies.\n\t\tif c.RawExpires != \"\" && c.Expires.Before(now) {\n\t\t\tcontinue\n\t\t}\n\t\t// Drop secure cookies if not over https.\n\t\tif c.Secure && u.Scheme != \"https\" {\n\t\t\tcontinue\n\t\t}\n\t\tcnew = append(cnew, c)\n\t}\n\treturn cnew\n}\n\nfunc isMatchingFilter(fs []*regexp.Regexp, d []byte) bool {\n\tfor _, r := range fs {\n\t\tif r.Match(d) {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n\nfunc normalizeURL(u string) string {\n\tparsed, err := urlParser.Parse(u)\n\tif err != nil {\n\t\treturn u\n\t}\n\treturn parsed.String()\n}\n\nfunc requestHash(url string, body io.Reader) uint64 {\n\th := fnv.New64a()\n\t// reparse the url to fix ambiguities such as\n\t// \"http://example.com\" vs \"http://example.com/\"\n\tio.WriteString(h, normalizeURL(url))\n\tif body != nil {\n\t\tio.Copy(h, body)\n\t}\n\treturn h.Sum64()\n}\n"
  },
  {
    "path": "colly_test.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"bufio\"\n\t\"bytes\"\n\t\"context\"\n\t\"errors\"\n\t\"fmt\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"net/url\"\n\t\"os\"\n\t\"reflect\"\n\t\"regexp\"\n\t\"strings\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\n\t\"github.com/gocolly/colly/v2/debug\"\n)\n\nvar serverIndexResponse = []byte(\"hello world\\n\")\nvar callbackTestHTML = []byte(`\n<!DOCTYPE html>\n<html>\n<head>\n<title>Callback Test Page</title>\n</head>\n<body>\n<div id=\"firstElem\">First</div>\n<div id=\"secondElem\">Second</div>\n<div id=\"thirdElem\">Third</div>\n</body>\n</html>\n`)\nvar robotsFile = `\nUser-agent: *\nAllow: /allowed\nDisallow: /disallowed\nDisallow: /allowed*q=\n`\n\nfunc newUnstartedTestServer() *httptest.Server {\n\tmux := http.NewServeMux()\n\n\tmux.HandleFunc(\"/\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write(serverIndexResponse)\n\t})\n\n\tmux.HandleFunc(\"/callback_test\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.WriteHeader(200)\n\t\tw.Write(callbackTestHTML)\n\t})\n\n\tmux.HandleFunc(\"/html\", func(w http.ResponseWriter, r *http.Request) {\n\t\tif r.URL.Query().Get(\"no-content-type\") != \"\" {\n\t\t\tw.Header()[\"Content-Type\"] = nil\n\t\t} else {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t}\n\t\tw.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n<title>Test Page</title>\n</head>\n<body>\n<h1>Hello World</h1>\n<p class=\"description\">This is a test page</p>\n<p class=\"description\">This is a test paragraph</p>\n</body>\n</html>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/xml\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"application/xml\")\n\t\tw.Write([]byte(`<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<page>\n\t<title>Test Page</title>\n\t<paragraph type=\"description\">This is a test page</paragraph>\n\t<paragraph type=\"description\">This is a test paragraph</paragraph>\n</page>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/login\", func(w http.ResponseWriter, r *http.Request) {\n\t\tif r.Method == \"POST\" {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(r.FormValue(\"name\")))\n\t\t}\n\t})\n\n\tmux.HandleFunc(\"/robots.txt\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(robotsFile))\n\t})\n\n\tmux.HandleFunc(\"/allowed\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(\"allowed\"))\n\t})\n\n\tmux.HandleFunc(\"/disallowed\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(\"disallowed\"))\n\t})\n\n\tmux.Handle(\"/redirect\", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tdestination := \"/redirected/\"\n\t\tif d := r.URL.Query().Get(\"d\"); d != \"\" {\n\t\t\tdestination = d\n\t\t}\n\t\thttp.Redirect(w, r, destination, http.StatusSeeOther)\n\n\t}))\n\n\tmux.Handle(\"/redirected/\", http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tfmt.Fprintf(w, `<a href=\"test\">test</a>`)\n\t}))\n\n\tmux.HandleFunc(\"/set_cookie\", func(w http.ResponseWriter, r *http.Request) {\n\t\tc := &http.Cookie{Name: \"test\", Value: \"testv\", HttpOnly: false}\n\t\thttp.SetCookie(w, c)\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(\"ok\"))\n\t})\n\n\tmux.HandleFunc(\"/check_cookie\", func(w http.ResponseWriter, r *http.Request) {\n\t\tcs := r.Cookies()\n\t\tif len(cs) != 1 || r.Cookies()[0].Value != \"testv\" {\n\t\t\tw.WriteHeader(500)\n\t\t\tw.Write([]byte(\"nok\"))\n\t\t\treturn\n\t\t}\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(\"ok\"))\n\t})\n\n\tmux.HandleFunc(\"/500\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.WriteHeader(500)\n\t\tw.Write([]byte(\"<p>error</p>\"))\n\t})\n\n\tmux.HandleFunc(\"/user_agent\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(r.Header.Get(\"User-Agent\")))\n\t})\n\n\tmux.HandleFunc(\"/host_header\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(r.Host))\n\t})\n\n\tmux.HandleFunc(\"/accept_header\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(r.Header.Get(\"Accept\")))\n\t})\n\n\tmux.HandleFunc(\"/custom_header\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\t\tw.Write([]byte(r.Header.Get(\"Test\")))\n\t})\n\n\tmux.HandleFunc(\"/base\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n<title>Test Page</title>\n<base href=\"http://xy.com/\" />\n</head>\n<body>\n<a href=\"z\">link</a>\n</body>\n</html>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/base_relative\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n<title>Test Page</title>\n<base href=\"/foobar/\" />\n</head>\n<body>\n<a href=\"z\">link</a>\n</body>\n</html>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/tabs_and_newlines\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n<title>Test Page</title>\n<base href=\"/foo\tbar/\" />\n</head>\n<body>\n<a href=\"x\ny\">link</a>\n</body>\n</html>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/foobar/xy\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n<title>Test Page</title>\n</head>\n<body>\n<p>hello</p>\n</body>\n</html>\n\t\t`))\n\t})\n\n\tmux.HandleFunc(\"/100%25\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Write([]byte(\"100 percent\"))\n\t})\n\n\tmux.HandleFunc(\"/large_binary\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.Header().Set(\"Content-Type\", \"application/octet-stream\")\n\t\tww := bufio.NewWriter(w)\n\t\tdefer ww.Flush()\n\t\tfor {\n\t\t\t// have to check error to detect client aborting download\n\t\t\tif _, err := ww.Write([]byte{0x41}); err != nil {\n\t\t\t\treturn\n\t\t\t}\n\t\t}\n\t})\n\n\tmux.HandleFunc(\"/slow\", func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(200)\n\n\t\tticker := time.NewTicker(100 * time.Millisecond)\n\t\tdefer ticker.Stop()\n\n\t\ti := 0\n\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase <-r.Context().Done():\n\t\t\t\treturn\n\t\t\tcase t := <-ticker.C:\n\t\t\t\tfmt.Fprintf(w, \"%s\\n\", t)\n\t\t\t\tif flusher, ok := w.(http.Flusher); ok {\n\t\t\t\t\tflusher.Flush()\n\t\t\t\t}\n\t\t\t\ti++\n\t\t\t\tif i == 10 {\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t})\n\n\tmux.HandleFunc(\"/sitemap.xml.gz\", func(w http.ResponseWriter, r *http.Request) {\n\t\t// Return a 404 HTML page for a non-existent .xml.gz URL.\n\t\t// This simulates the scenario in issue #745 where a server\n\t\t// returns an HTML error page for a missing gzipped sitemap.\n\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\tw.WriteHeader(404)\n\t\tw.Write([]byte(`<!DOCTYPE html><html><body><h1>404 Not Found</h1></body></html>`))\n\t})\n\n\treturn httptest.NewUnstartedServer(mux)\n}\n\nfunc newTestServer() *httptest.Server {\n\tsrv := newUnstartedTestServer()\n\tsrv.Start()\n\treturn srv\n}\n\nvar newCollectorTests = map[string]func(*testing.T){\n\t\"UserAgent\": func(t *testing.T) {\n\t\tfor _, ua := range []string{\n\t\t\t\"foo\",\n\t\t\t\"bar\",\n\t\t} {\n\t\t\tc := NewCollector(UserAgent(ua))\n\n\t\t\tif got, want := c.UserAgent, ua; got != want {\n\t\t\t\tt.Fatalf(\"c.UserAgent = %q, want %q\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"MaxDepth\": func(t *testing.T) {\n\t\tfor _, depth := range []int{\n\t\t\t12,\n\t\t\t34,\n\t\t\t0,\n\t\t} {\n\t\t\tc := NewCollector(MaxDepth(depth))\n\n\t\t\tif got, want := c.MaxDepth, depth; got != want {\n\t\t\t\tt.Fatalf(\"c.MaxDepth = %d, want %d\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"AllowedDomains\": func(t *testing.T) {\n\t\tfor _, domains := range [][]string{\n\t\t\t{\"example.com\", \"example.net\"},\n\t\t\t{\"example.net\"},\n\t\t\t{},\n\t\t\tnil,\n\t\t} {\n\t\t\tc := NewCollector(AllowedDomains(domains...))\n\n\t\t\tif got, want := c.AllowedDomains, domains; !reflect.DeepEqual(got, want) {\n\t\t\t\tt.Fatalf(\"c.AllowedDomains = %q, want %q\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"DisallowedDomains\": func(t *testing.T) {\n\t\tfor _, domains := range [][]string{\n\t\t\t{\"example.com\", \"example.net\"},\n\t\t\t{\"example.net\"},\n\t\t\t{},\n\t\t\tnil,\n\t\t} {\n\t\t\tc := NewCollector(DisallowedDomains(domains...))\n\n\t\t\tif got, want := c.DisallowedDomains, domains; !reflect.DeepEqual(got, want) {\n\t\t\t\tt.Fatalf(\"c.DisallowedDomains = %q, want %q\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"DisallowedURLFilters\": func(t *testing.T) {\n\t\tfor _, filters := range [][]*regexp.Regexp{\n\t\t\t{regexp.MustCompile(`.*not_allowed.*`)},\n\t\t} {\n\t\t\tc := NewCollector(DisallowedURLFilters(filters...))\n\n\t\t\tif got, want := c.DisallowedURLFilters, filters; !reflect.DeepEqual(got, want) {\n\t\t\t\tt.Fatalf(\"c.DisallowedURLFilters = %v, want %v\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"URLFilters\": func(t *testing.T) {\n\t\tfor _, filters := range [][]*regexp.Regexp{\n\t\t\t{regexp.MustCompile(`\\w+`)},\n\t\t\t{regexp.MustCompile(`\\d+`)},\n\t\t\t{},\n\t\t\tnil,\n\t\t} {\n\t\t\tc := NewCollector(URLFilters(filters...))\n\n\t\t\tif got, want := c.URLFilters, filters; !reflect.DeepEqual(got, want) {\n\t\t\t\tt.Fatalf(\"c.URLFilters = %v, want %v\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"AllowURLRevisit\": func(t *testing.T) {\n\t\tc := NewCollector(AllowURLRevisit())\n\n\t\tif !c.AllowURLRevisit {\n\t\t\tt.Fatal(\"c.AllowURLRevisit = false, want true\")\n\t\t}\n\t},\n\t\"MaxBodySize\": func(t *testing.T) {\n\t\tfor _, sizeInBytes := range []int{\n\t\t\t1024 * 1024,\n\t\t\t1024,\n\t\t\t0,\n\t\t} {\n\t\t\tc := NewCollector(MaxBodySize(sizeInBytes))\n\n\t\t\tif got, want := c.MaxBodySize, sizeInBytes; got != want {\n\t\t\t\tt.Fatalf(\"c.MaxBodySize = %d, want %d\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"CacheDir\": func(t *testing.T) {\n\t\tfor _, path := range []string{\n\t\t\t\"/tmp/\",\n\t\t\t\"/var/cache/\",\n\t\t} {\n\t\t\tc := NewCollector(CacheDir(path))\n\n\t\t\tif got, want := c.CacheDir, path; got != want {\n\t\t\t\tt.Fatalf(\"c.CacheDir = %q, want %q\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"CacheExpiration\": func(t *testing.T) {\n\t\tfor _, d := range []time.Duration{\n\t\t\t5 * time.Second,\n\t\t\t10 * time.Minute,\n\t\t\t0,\n\t\t} {\n\t\t\tc := NewCollector(CacheExpiration(d))\n\n\t\t\tif got, want := c.CacheExpiration, d; got != want {\n\t\t\t\tt.Fatalf(\"c.CacheExpiration = %v, want %v\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"IgnoreRobotsTxt\": func(t *testing.T) {\n\t\tc := NewCollector(IgnoreRobotsTxt())\n\n\t\tif !c.IgnoreRobotsTxt {\n\t\t\tt.Fatal(\"c.IgnoreRobotsTxt = false, want true\")\n\t\t}\n\t},\n\t\"ID\": func(t *testing.T) {\n\t\tfor _, id := range []uint32{\n\t\t\t0,\n\t\t\t1,\n\t\t\t2,\n\t\t} {\n\t\t\tc := NewCollector(ID(id))\n\n\t\t\tif got, want := c.ID, id; got != want {\n\t\t\t\tt.Fatalf(\"c.ID = %d, want %d\", got, want)\n\t\t\t}\n\t\t}\n\t},\n\t\"DetectCharset\": func(t *testing.T) {\n\t\tc := NewCollector(DetectCharset())\n\n\t\tif !c.DetectCharset {\n\t\t\tt.Fatal(\"c.DetectCharset = false, want true\")\n\t\t}\n\t},\n\t\"Debugger\": func(t *testing.T) {\n\t\td := &debug.LogDebugger{}\n\t\tc := NewCollector(Debugger(d))\n\n\t\tif got, want := c.debugger, d; got != want {\n\t\t\tt.Fatalf(\"c.debugger = %v, want %v\", got, want)\n\t\t}\n\t},\n\t\"CheckHead\": func(t *testing.T) {\n\t\tc := NewCollector(CheckHead())\n\n\t\tif !c.CheckHead {\n\t\t\tt.Fatal(\"c.CheckHead = false, want true\")\n\t\t}\n\t},\n\t\"Async\": func(t *testing.T) {\n\t\tc := NewCollector(Async())\n\n\t\tif !c.Async {\n\t\t\tt.Fatal(\"c.Async = false, want true\")\n\t\t}\n\t},\n}\n\nfunc TestNoAcceptHeader(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvar receivedHeader string\n\t// checks if Accept is enabled by default\n\tfunc() {\n\t\tc := NewCollector()\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedHeader = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/accept_header\")\n\t\tif receivedHeader != \"*/*\" {\n\t\t\tt.Errorf(\"default Accept header isn't */*. got: %v\", receivedHeader)\n\t\t}\n\t}()\n\n\t// checks if Accept can be disabled\n\tfunc() {\n\t\tc := NewCollector()\n\t\tc.OnRequest(func(r *Request) {\n\t\t\tr.Headers.Del(\"Accept\")\n\t\t})\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedHeader = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/accept_header\")\n\t\tif receivedHeader != \"\" {\n\t\t\tt.Errorf(\"failed to pass request with no Accept header. got: %v\", receivedHeader)\n\t\t}\n\t}()\n}\n\nfunc TestNewCollector(t *testing.T) {\n\tt.Run(\"Functional Options\", func(t *testing.T) {\n\t\tfor name, test := range newCollectorTests {\n\t\t\tt.Run(name, test)\n\t\t}\n\t})\n}\n\nfunc TestCollectorVisit(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tonRequestCalled := false\n\tonResponseCalled := false\n\tonScrapedCalled := false\n\n\tc.OnRequest(func(r *Request) {\n\t\tonRequestCalled = true\n\t\tr.Ctx.Put(\"x\", \"y\")\n\t})\n\n\tc.OnResponse(func(r *Response) {\n\t\tonResponseCalled = true\n\n\t\tif r.Ctx.Get(\"x\") != \"y\" {\n\t\t\tt.Error(\"Failed to retrieve context value for key 'x'\")\n\t\t}\n\n\t\tif !bytes.Equal(r.Body, serverIndexResponse) {\n\t\t\tt.Error(\"Response body does not match with the original content\")\n\t\t}\n\t})\n\n\tc.OnScraped(func(r *Response) {\n\t\tif !onResponseCalled {\n\t\t\tt.Error(\"OnScraped called before OnResponse\")\n\t\t}\n\n\t\tif !onRequestCalled {\n\t\t\tt.Error(\"OnScraped called before OnRequest\")\n\t\t}\n\n\t\tonScrapedCalled = true\n\t})\n\n\tc.Visit(ts.URL)\n\n\tif !onRequestCalled {\n\t\tt.Error(\"Failed to call OnRequest callback\")\n\t}\n\n\tif !onResponseCalled {\n\t\tt.Error(\"Failed to call OnResponse callback\")\n\t}\n\n\tif !onScrapedCalled {\n\t\tt.Error(\"Failed to call OnScraped callback\")\n\t}\n}\n\nfunc TestCollectorVisitWithAllowedDomains(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector(AllowedDomains(\"localhost\", \"127.0.0.1\", \"::1\"))\n\terr := c.Visit(ts.URL)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to visit url %s\", ts.URL)\n\t}\n\n\terr = c.Visit(\"http://example.com\")\n\tif err != ErrForbiddenDomain {\n\t\tt.Errorf(\"c.Visit should return ErrForbiddenDomain, but got %v\", err)\n\t}\n}\n\nfunc TestCollectorVisitWithDisallowedDomains(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector(DisallowedDomains(\"localhost\", \"127.0.0.1\", \"::1\"))\n\terr := c.Visit(ts.URL)\n\tif err != ErrForbiddenDomain {\n\t\tt.Errorf(\"c.Visit should return ErrForbiddenDomain, but got %v\", err)\n\t}\n\n\tc2 := NewCollector(DisallowedDomains(\"example.com\"))\n\terr = c2.Visit(\"http://example.com:8080\")\n\tif err != ErrForbiddenDomain {\n\t\tt.Errorf(\"c.Visit should return ErrForbiddenDomain, but got %v\", err)\n\t}\n\terr = c2.Visit(ts.URL)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to visit url %s\", ts.URL)\n\t}\n}\n\nfunc TestCollectorVisitResponseHeaders(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvar onResponseHeadersCalled bool\n\n\tc := NewCollector()\n\tc.OnResponseHeaders(func(r *Response) {\n\t\tonResponseHeadersCalled = true\n\t\tif r.Headers.Get(\"Content-Type\") == \"application/octet-stream\" {\n\t\t\tr.Request.Abort()\n\t\t}\n\t})\n\tc.OnResponse(func(r *Response) {\n\t\tt.Error(\"OnResponse was called\")\n\t})\n\tc.Visit(ts.URL + \"/large_binary\")\n\tif !onResponseHeadersCalled {\n\t\tt.Error(\"OnResponseHeaders was not called\")\n\t}\n}\n\nfunc TestCollectorOnHTML(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\ttitleCallbackCalled := false\n\tparagraphCallbackCount := 0\n\n\tc.OnHTML(\"title\", func(e *HTMLElement) {\n\t\ttitleCallbackCalled = true\n\t\tif e.Text != \"Test Page\" {\n\t\t\tt.Error(\"Title element text does not match, got\", e.Text)\n\t\t}\n\t})\n\n\tc.OnHTML(\"p\", func(e *HTMLElement) {\n\t\tparagraphCallbackCount++\n\t\tif e.Attr(\"class\") != \"description\" {\n\t\t\tt.Error(\"Failed to get paragraph's class attribute\")\n\t\t}\n\t})\n\n\tc.OnHTML(\"body\", func(e *HTMLElement) {\n\t\tif e.ChildAttr(\"p\", \"class\") != \"description\" {\n\t\t\tt.Error(\"Invalid class value\")\n\t\t}\n\t\tclasses := e.ChildAttrs(\"p\", \"class\")\n\t\tif len(classes) != 2 {\n\t\t\tt.Error(\"Invalid class values\")\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/html\")\n\n\tif !titleCallbackCalled {\n\t\tt.Error(\"Failed to call OnHTML callback for <title> tag\")\n\t}\n\n\tif paragraphCallbackCount != 2 {\n\t\tt.Error(\"Failed to find all <p> tags\")\n\t}\n}\n\nfunc TestCollectorContentSniffing(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\thtmlCallbackCalled := false\n\n\tc.OnResponse(func(r *Response) {\n\t\tif (*r.Headers)[\"Content-Type\"] != nil {\n\t\t\tt.Error(\"Content-Type unexpectedly not nil\")\n\t\t}\n\t})\n\n\tc.OnHTML(\"html\", func(e *HTMLElement) {\n\t\thtmlCallbackCalled = true\n\t})\n\n\terr := c.Visit(ts.URL + \"/html?no-content-type=yes\")\n\tif err != nil {\n\t\tt.Fatal(err)\n\t}\n\n\tif !htmlCallbackCalled {\n\t\tt.Error(\"OnHTML was not called\")\n\t}\n}\n\nfunc TestCollectorURLRevisit(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tvisitCount := 0\n\n\tc.OnRequest(func(r *Request) {\n\t\tvisitCount++\n\t})\n\n\tc.Visit(ts.URL)\n\tc.Visit(ts.URL)\n\n\tif visitCount != 1 {\n\t\tt.Error(\"URL revisited\")\n\t}\n\n\tc.AllowURLRevisit = true\n\n\tc.Visit(ts.URL)\n\tc.Visit(ts.URL)\n\n\tif visitCount != 3 {\n\t\tt.Error(\"URL not revisited\")\n\t}\n}\n\nfunc TestCollectorPostRevisit(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tpostValue := \"hello\"\n\tpostData := map[string]string{\n\t\t\"name\": postValue,\n\t}\n\tvisitCount := 0\n\n\tc := NewCollector()\n\tc.OnResponse(func(r *Response) {\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST\")\n\t\t}\n\t\tvisitCount++\n\t})\n\n\tc.Post(ts.URL+\"/login\", postData)\n\tc.Post(ts.URL+\"/login\", postData)\n\tc.Post(ts.URL+\"/login\", map[string]string{\n\t\t\"name\":     postValue,\n\t\t\"lastname\": \"world\",\n\t})\n\n\tif visitCount != 2 {\n\t\tt.Error(\"URL POST revisited\")\n\t}\n\n\tc.AllowURLRevisit = true\n\n\tc.Post(ts.URL+\"/login\", postData)\n\tc.Post(ts.URL+\"/login\", postData)\n\n\tif visitCount != 4 {\n\t\tt.Error(\"URL POST not revisited\")\n\t}\n}\n\nfunc TestCollectorURLRevisitCheck(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tvisited, err := c.HasVisited(ts.URL)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif visited != false {\n\t\tt.Error(\"Expected URL to NOT have been visited\")\n\t}\n\n\tc.Visit(ts.URL)\n\n\tvisited, err = c.HasVisited(ts.URL)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif visited != true {\n\t\tt.Error(\"Expected URL to have been visited\")\n\t}\n\n\terrorTestCases := []struct {\n\t\tPath             string\n\t\tDestinationError string\n\t}{\n\t\t{\"/\", \"/\"},\n\t\t{\"/redirect?d=/\", \"/\"},\n\t\t// now that /redirect?d=/ itself is recorded as visited,\n\t\t// it's now returned in error\n\t\t{\"/redirect?d=/\", \"/redirect?d=/\"},\n\t\t{\"/redirect?d=/redirect%3Fd%3D/\", \"/redirect?d=/\"},\n\t\t{\"/redirect?d=/redirect%3Fd%3D/\", \"/redirect?d=/redirect%3Fd%3D/\"},\n\t\t{\"/redirect?d=/redirect%3Fd%3D/&foo=bar\", \"/redirect?d=/\"},\n\t}\n\n\tfor i, testCase := range errorTestCases {\n\t\terr := c.Visit(ts.URL + testCase.Path)\n\t\tif testCase.DestinationError == \"\" {\n\t\t\tif err != nil {\n\t\t\t\tt.Errorf(\"got unexpected error in test %d: %q\", i, err)\n\t\t\t}\n\t\t} else {\n\t\t\tvar ave *AlreadyVisitedError\n\t\t\tif !errors.As(err, &ave) {\n\t\t\t\tt.Errorf(\"err=%q returned when trying to revisit, expected AlreadyVisitedError\", err)\n\t\t\t} else {\n\t\t\t\tif got, want := ave.Destination.String(), ts.URL+testCase.DestinationError; got != want {\n\t\t\t\t\tt.Errorf(\"wrong destination in AlreadyVisitedError in test %d, got=%q want=%q\", i, got, want)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n\nfunc TestSetCookieRedirect(t *testing.T) {\n\ttype middleware = func(http.Handler) http.Handler\n\tfor _, m := range []middleware{\n\t\trequireSessionCookieSimple,\n\t\trequireSessionCookieAuthPage,\n\t} {\n\t\tt.Run(\"\", func(t *testing.T) {\n\t\t\tts := newUnstartedTestServer()\n\t\t\tts.Config.Handler = m(ts.Config.Handler)\n\t\t\tts.Start()\n\t\t\tdefer ts.Close()\n\t\t\tc := NewCollector()\n\t\t\tc.OnResponse(func(r *Response) {\n\t\t\t\tif got, want := r.Body, serverIndexResponse; !bytes.Equal(got, want) {\n\t\t\t\t\tt.Errorf(\"bad response body got=%q want=%q\", got, want)\n\t\t\t\t}\n\t\t\t\tif got, want := r.StatusCode, http.StatusOK; got != want {\n\t\t\t\t\tt.Errorf(\"bad response code got=%d want=%d\", got, want)\n\t\t\t\t}\n\t\t\t})\n\t\t\tif err := c.Visit(ts.URL); err != nil {\n\t\t\t\tt.Fatal(err)\n\t\t\t}\n\t\t})\n\t}\n}\n\nfunc TestCollectorPostURLRevisitCheck(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tpostValue := \"hello\"\n\tpostData := map[string]string{\n\t\t\"name\": postValue,\n\t}\n\n\tposted, err := c.HasPosted(ts.URL+\"/login\", postData)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif posted != false {\n\t\tt.Error(\"Expected URL to NOT have been visited\")\n\t}\n\n\tc.Post(ts.URL+\"/login\", postData)\n\n\tposted, err = c.HasPosted(ts.URL+\"/login\", postData)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif posted != true {\n\t\tt.Error(\"Expected URL to have been visited\")\n\t}\n\n\tpostData[\"lastname\"] = \"world\"\n\tposted, err = c.HasPosted(ts.URL+\"/login\", postData)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif posted != false {\n\t\tt.Error(\"Expected URL to NOT have been visited\")\n\t}\n\n\tc.Post(ts.URL+\"/login\", postData)\n\n\tposted, err = c.HasPosted(ts.URL+\"/login\", postData)\n\n\tif err != nil {\n\t\tt.Error(err.Error())\n\t}\n\n\tif posted != true {\n\t\tt.Error(\"Expected URL to have been visited\")\n\t}\n}\n\n// TestCollectorURLRevisitDomainDisallowed ensures that disallowed URL is not considered visited.\nfunc TestCollectorURLRevisitDomainDisallowed(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tparsedURL, err := url.Parse(ts.URL)\n\tif err != nil {\n\t\tt.Fatal(err)\n\t}\n\n\tc := NewCollector(DisallowedDomains(parsedURL.Hostname()))\n\terr = c.Visit(ts.URL)\n\tif got, want := err, ErrForbiddenDomain; got != want {\n\t\tt.Fatalf(\"wrong error on first visit: got=%v want=%v\", got, want)\n\t}\n\terr = c.Visit(ts.URL)\n\tif got, want := err, ErrForbiddenDomain; got != want {\n\t\tt.Fatalf(\"wrong error on second visit: got=%v want=%v\", got, want)\n\t}\n\n}\n\nfunc TestCollectorPost(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tpostValue := \"hello\"\n\tc := NewCollector()\n\n\tc.OnResponse(func(r *Response) {\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST\")\n\t\t}\n\t})\n\n\tc.Post(ts.URL+\"/login\", map[string]string{\n\t\t\"name\": postValue,\n\t})\n}\n\nfunc TestCollectorPostRaw(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tpostValue := \"hello\"\n\tc := NewCollector()\n\n\tc.OnResponse(func(r *Response) {\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST\")\n\t\t}\n\t})\n\n\tc.PostRaw(ts.URL+\"/login\", []byte(\"name=\"+postValue))\n}\n\nfunc TestCollectorPostRawRevisit(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tpostValue := \"hello\"\n\tpostData := \"name=\" + postValue\n\tvisitCount := 0\n\n\tc := NewCollector()\n\tc.OnResponse(func(r *Response) {\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST RAW\")\n\t\t}\n\t\tvisitCount++\n\t})\n\n\tc.PostRaw(ts.URL+\"/login\", []byte(postData))\n\tc.PostRaw(ts.URL+\"/login\", []byte(postData))\n\tc.PostRaw(ts.URL+\"/login\", []byte(postData+\"&lastname=world\"))\n\n\tif visitCount != 2 {\n\t\tt.Error(\"URL POST RAW revisited\")\n\t}\n\n\tc.AllowURLRevisit = true\n\n\tc.PostRaw(ts.URL+\"/login\", []byte(postData))\n\tc.PostRaw(ts.URL+\"/login\", []byte(postData))\n\n\tif visitCount != 4 {\n\t\tt.Error(\"URL POST RAW not revisited\")\n\t}\n}\n\nfunc TestRedirect(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.OnHTML(\"a[href]\", func(e *HTMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\tif !strings.HasSuffix(u, \"/redirected/test\") {\n\t\t\tt.Error(\"Invalid URL after redirect: \" + u)\n\t\t}\n\t})\n\n\tc.OnResponseHeaders(func(r *Response) {\n\t\tif !strings.HasSuffix(r.Request.URL.String(), \"/redirected/\") {\n\t\t\tt.Error(\"Invalid URL in Request after redirect (OnResponseHeaders): \" + r.Request.URL.String())\n\t\t}\n\t})\n\n\tc.OnResponse(func(r *Response) {\n\t\tif !strings.HasSuffix(r.Request.URL.String(), \"/redirected/\") {\n\t\t\tt.Error(\"Invalid URL in Request after redirect (OnResponse): \" + r.Request.URL.String())\n\t\t}\n\t})\n\tc.Visit(ts.URL + \"/redirect\")\n}\n\nfunc TestIssue594(t *testing.T) {\n\t// This is a regression test for a data race bug. There's no\n\t// assertions because it's meant to be used with race detector\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\t// if timeout is set, this bug is not triggered\n\tc.SetClient(&http.Client{Timeout: 0 * time.Second})\n\n\tc.Visit(ts.URL)\n}\n\nfunc TestRedirectWithDisallowedURLs(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.DisallowedURLFilters = []*regexp.Regexp{regexp.MustCompile(ts.URL + \"/redirected/test\")}\n\tc.OnHTML(\"a[href]\", func(e *HTMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\terr := c.Visit(u)\n\t\tif !errors.Is(err, ErrForbiddenURL) {\n\t\t\tt.Error(\"URL should have been forbidden: \" + u)\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/redirect\")\n}\n\nfunc TestBaseTag(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.OnHTML(\"a[href]\", func(e *HTMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\tif u != \"http://xy.com/z\" {\n\t\t\tt.Error(\"Invalid <base /> tag handling in OnHTML: expected https://xy.com/z, got \" + u)\n\t\t}\n\t})\n\tc.Visit(ts.URL + \"/base\")\n\n\tc2 := NewCollector()\n\tc2.OnXML(\"//a\", func(e *XMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\tif u != \"http://xy.com/z\" {\n\t\t\tt.Error(\"Invalid <base /> tag handling in OnXML: expected https://xy.com/z, got \" + u)\n\t\t}\n\t})\n\tc2.Visit(ts.URL + \"/base\")\n}\n\nfunc TestBaseTagRelative(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.OnHTML(\"a[href]\", func(e *HTMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\texpected := ts.URL + \"/foobar/z\"\n\t\tif u != expected {\n\t\t\tt.Errorf(\"Invalid <base /> tag handling in OnHTML: expected %q, got %q\", expected, u)\n\t\t}\n\t})\n\tc.Visit(ts.URL + \"/base_relative\")\n\n\tc2 := NewCollector()\n\tc2.OnXML(\"//a\", func(e *XMLElement) {\n\t\tu := e.Request.AbsoluteURL(e.Attr(\"href\"))\n\t\texpected := ts.URL + \"/foobar/z\"\n\t\tif u != expected {\n\t\t\tt.Errorf(\"Invalid <base /> tag handling in OnXML: expected %q, got %q\", expected, u)\n\t\t}\n\t})\n\tc2.Visit(ts.URL + \"/base_relative\")\n}\n\nfunc TestTabsAndNewlines(t *testing.T) {\n\t// this test might look odd, but see step 3 of\n\t// https://url.spec.whatwg.org/#concept-basic-url-parser\n\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvisited := map[string]struct{}{}\n\texpected := map[string]struct{}{\n\t\t\"/tabs_and_newlines\": {},\n\t\t\"/foobar/xy\":         {},\n\t}\n\n\tc := NewCollector()\n\tc.OnResponse(func(res *Response) {\n\t\tvisited[res.Request.URL.EscapedPath()] = struct{}{}\n\t})\n\tc.OnHTML(\"a[href]\", func(e *HTMLElement) {\n\t\tif err := e.Request.Visit(e.Attr(\"href\")); err != nil {\n\t\t\tt.Errorf(\"visit failed: %v\", err)\n\t\t}\n\t})\n\n\tif err := c.Visit(ts.URL + \"/tabs_and_newlines\"); err != nil {\n\t\tt.Errorf(\"visit failed: %v\", err)\n\t}\n\n\tif !reflect.DeepEqual(visited, expected) {\n\t\tt.Errorf(\"visited=%v expected=%v\", visited, expected)\n\t}\n}\n\nfunc TestLonePercent(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvar visitedPath string\n\n\tc := NewCollector()\n\tc.OnResponse(func(res *Response) {\n\t\tvisitedPath = res.Request.URL.RequestURI()\n\t})\n\tif err := c.Visit(ts.URL + \"/100%\"); err != nil {\n\t\tt.Errorf(\"visit failed: %v\", err)\n\t}\n\t// Automatic encoding is not really correct: browsers\n\t// would send bare percent here. However, Go net/http\n\t// cannot send such requests due to\n\t// https://github.com/golang/go/issues/29808. So we have two\n\t// alternatives really: return an error when attempting\n\t// to fetch such URLs, or at least try the encoded variant.\n\t// This test checks that the latter is attempted.\n\tif got, want := visitedPath, \"/100%25\"; got != want {\n\t\tt.Errorf(\"got=%q want=%q\", got, want)\n\t}\n\t// invalid URL escape in query component is not a problem,\n\t// but check it anyway\n\tif err := c.Visit(ts.URL + \"/?a=100%zz\"); err != nil {\n\t\tt.Errorf(\"visit failed: %v\", err)\n\t}\n\tif got, want := visitedPath, \"/?a=100%zz\"; got != want {\n\t\tt.Errorf(\"got=%q want=%q\", got, want)\n\t}\n}\n\nfunc TestCollectorCookies(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tif err := c.Visit(ts.URL + \"/set_cookie\"); err != nil {\n\t\tt.Fatal(err)\n\t}\n\n\tif err := c.Visit(ts.URL + \"/check_cookie\"); err != nil {\n\t\tt.Fatalf(\"Failed to use previously set cookies: %s\", err)\n\t}\n}\n\nfunc TestRobotsWhenAllowed(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.IgnoreRobotsTxt = false\n\n\tc.OnResponse(func(resp *Response) {\n\t\tif resp.StatusCode != 200 {\n\t\t\tt.Fatalf(\"Wrong response code: %d\", resp.StatusCode)\n\t\t}\n\t})\n\n\terr := c.Visit(ts.URL + \"/allowed\")\n\n\tif err != nil {\n\t\tt.Fatal(err)\n\t}\n}\n\nfunc TestRobotsWhenDisallowed(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.IgnoreRobotsTxt = false\n\n\tc.OnResponse(func(resp *Response) {\n\t\tt.Fatalf(\"Received response: %d\", resp.StatusCode)\n\t})\n\n\terr := c.Visit(ts.URL + \"/disallowed\")\n\tif err.Error() != \"URL blocked by robots.txt\" {\n\t\tt.Fatalf(\"wrong error message: %v\", err)\n\t}\n}\n\nfunc TestRobotsWhenDisallowedWithQueryParameter(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.IgnoreRobotsTxt = false\n\n\tc.OnResponse(func(resp *Response) {\n\t\tt.Fatalf(\"Received response: %d\", resp.StatusCode)\n\t})\n\n\terr := c.Visit(ts.URL + \"/allowed?q=1\")\n\tif err.Error() != \"URL blocked by robots.txt\" {\n\t\tt.Fatalf(\"wrong error message: %v\", err)\n\t}\n}\n\nfunc TestIgnoreRobotsWhenDisallowed(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.IgnoreRobotsTxt = true\n\n\tc.OnResponse(func(resp *Response) {\n\t\tif resp.StatusCode != 200 {\n\t\t\tt.Fatalf(\"Wrong response code: %d\", resp.StatusCode)\n\t\t}\n\t})\n\n\terr := c.Visit(ts.URL + \"/disallowed\")\n\n\tif err != nil {\n\t\tt.Fatal(err)\n\t}\n\n}\n\nfunc TestConnectionErrorOnRobotsTxtResultsInError(t *testing.T) {\n\tts := newTestServer()\n\tts.Close() // immediately close the server to force a connection error\n\n\tc := NewCollector()\n\tc.IgnoreRobotsTxt = false\n\terr := c.Visit(ts.URL)\n\n\tif err == nil {\n\t\tt.Fatal(\"Error expected\")\n\t}\n}\n\nfunc TestEnvSettings(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tos.Setenv(\"COLLY_USER_AGENT\", \"test\")\n\tdefer os.Unsetenv(\"COLLY_USER_AGENT\")\n\n\tc := NewCollector()\n\n\tvalid := false\n\n\tc.OnResponse(func(resp *Response) {\n\t\tif string(resp.Body) == \"test\" {\n\t\t\tvalid = true\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/user_agent\")\n\n\tif !valid {\n\t\tt.Fatalf(\"Wrong user-agent from environment\")\n\t}\n}\n\nfunc TestUserAgent(t *testing.T) {\n\tconst exampleUserAgent1 = \"Example/1.0\"\n\tconst exampleUserAgent2 = \"Example/2.0\"\n\tconst defaultUserAgent = \"colly - https://github.com/gocolly/colly\"\n\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvar receivedUserAgent string\n\n\tfunc() {\n\t\tc := NewCollector()\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/user_agent\")\n\t\tif got, want := receivedUserAgent, defaultUserAgent; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent: got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(UserAgent(exampleUserAgent1))\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/user_agent\")\n\t\tif got, want := receivedUserAgent, exampleUserAgent1; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent: got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(UserAgent(exampleUserAgent1))\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\n\t\tc.Request(\"GET\", ts.URL+\"/user_agent\", nil, nil, nil)\n\t\tif got, want := receivedUserAgent, exampleUserAgent1; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent (nil hdr): got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(UserAgent(exampleUserAgent1))\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\n\t\tc.Request(\"GET\", ts.URL+\"/user_agent\", nil, nil, http.Header{})\n\t\tif got, want := receivedUserAgent, exampleUserAgent1; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent (non-nil hdr): got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(UserAgent(exampleUserAgent1))\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\t\thdr := http.Header{}\n\t\thdr.Set(\"User-Agent\", \"\")\n\n\t\tc.Request(\"GET\", ts.URL+\"/user_agent\", nil, nil, hdr)\n\t\tif got, want := receivedUserAgent, \"\"; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent (hdr with empty UA): got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(UserAgent(exampleUserAgent1))\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedUserAgent = string(resp.Body)\n\t\t})\n\t\thdr := http.Header{}\n\t\thdr.Set(\"User-Agent\", exampleUserAgent2)\n\n\t\tc.Request(\"GET\", ts.URL+\"/user_agent\", nil, nil, hdr)\n\t\tif got, want := receivedUserAgent, exampleUserAgent2; got != want {\n\t\t\tt.Errorf(\"mismatched User-Agent (hdr with UA): got=%q want=%q\", got, want)\n\t\t}\n\t}()\n}\n\nfunc TestHeaders(t *testing.T) {\n\tconst exampleHostHeader = \"example.com\"\n\tconst exampleTestHeader = \"Testing\"\n\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tvar receivedHeader string\n\n\tfunc() {\n\t\tc := NewCollector(\n\t\t\tHeaders(map[string]string{\"Host\": exampleHostHeader}),\n\t\t)\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedHeader = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/host_header\")\n\t\tif got, want := receivedHeader, exampleHostHeader; got != want {\n\t\t\tt.Errorf(\"mismatched Host header: got=%q want=%q\", got, want)\n\t\t}\n\t}()\n\tfunc() {\n\t\tc := NewCollector(\n\t\t\tHeaders(map[string]string{\"Test\": exampleTestHeader}),\n\t\t)\n\t\tc.OnResponse(func(resp *Response) {\n\t\t\treceivedHeader = string(resp.Body)\n\t\t})\n\t\tc.Visit(ts.URL + \"/custom_header\")\n\t\tif got, want := receivedHeader, exampleTestHeader; got != want {\n\t\t\tt.Errorf(\"mismatched custom header: got=%q want=%q\", got, want)\n\t\t}\n\t}()\n}\n\nfunc TestParseHTTPErrorResponse(t *testing.T) {\n\tcontentCount := 0\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector(\n\t\tAllowURLRevisit(),\n\t)\n\n\tc.OnHTML(\"p\", func(e *HTMLElement) {\n\t\tif e.Text == \"error\" {\n\t\t\tcontentCount++\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/500\")\n\n\tif contentCount != 0 {\n\t\tt.Fatal(\"Content is parsed without ParseHTTPErrorResponse enabled\")\n\t}\n\n\tc.ParseHTTPErrorResponse = true\n\n\tc.Visit(ts.URL + \"/500\")\n\n\tif contentCount != 1 {\n\t\tt.Fatal(\"Content isn't parsed with ParseHTTPErrorResponse enabled\")\n\t}\n\n}\n\nfunc TestHTMLElement(t *testing.T) {\n\tctx := &Context{}\n\tresp := &Response{\n\t\tRequest: &Request{\n\t\t\tCtx: ctx,\n\t\t},\n\t\tCtx: ctx,\n\t}\n\n\tin := `<a href=\"http://go-colly.org\">Colly</a>`\n\tsel := \"a[href]\"\n\tdoc, err := goquery.NewDocumentFromReader(bytes.NewBuffer([]byte(in)))\n\tif err != nil {\n\t\tt.Fatal(err)\n\t}\n\telements := []*HTMLElement{}\n\ti := 0\n\tdoc.Find(sel).Each(func(_ int, s *goquery.Selection) {\n\t\tfor _, n := range s.Nodes {\n\t\t\telements = append(elements, NewHTMLElementFromSelectionNode(resp, s, n, i))\n\t\t\ti++\n\t\t}\n\t})\n\telementsLen := len(elements)\n\tif elementsLen != 1 {\n\t\tt.Errorf(\"element length mismatch. got %d, expected %d.\\n\", elementsLen, 1)\n\t}\n\tv := elements[0]\n\tif v.Name != \"a\" {\n\t\tt.Errorf(\"element tag mismatch. got %s, expected %s.\\n\", v.Name, \"a\")\n\t}\n\tif v.Text != \"Colly\" {\n\t\tt.Errorf(\"element content mismatch. got %s, expected %s.\\n\", v.Text, \"Colly\")\n\t}\n\tif v.Attr(\"href\") != \"http://go-colly.org\" {\n\t\tt.Errorf(\"element href mismatch. got %s, expected %s.\\n\", v.Attr(\"href\"), \"http://go-colly.org\")\n\t}\n}\n\nfunc TestCollectorOnXMLWithHtml(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\ttitleCallbackCalled := false\n\tparagraphCallbackCount := 0\n\n\tc.OnXML(\"/html/head/title\", func(e *XMLElement) {\n\t\ttitleCallbackCalled = true\n\t\tif e.Text != \"Test Page\" {\n\t\t\tt.Error(\"Title element text does not match, got\", e.Text)\n\t\t}\n\t})\n\n\tc.OnXML(\"/html/body/p\", func(e *XMLElement) {\n\t\tparagraphCallbackCount++\n\t\tif e.Attr(\"class\") != \"description\" {\n\t\t\tt.Error(\"Failed to get paragraph's class attribute\")\n\t\t}\n\t})\n\n\tc.OnXML(\"/html/body\", func(e *XMLElement) {\n\t\tif e.ChildAttr(\"p\", \"class\") != \"description\" {\n\t\t\tt.Error(\"Invalid class value\")\n\t\t}\n\t\tclasses := e.ChildAttrs(\"p\", \"class\")\n\t\tif len(classes) != 2 {\n\t\t\tt.Error(\"Invalid class values\")\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/html\")\n\n\tif !titleCallbackCalled {\n\t\tt.Error(\"Failed to call OnXML callback for <title> tag\")\n\t}\n\n\tif paragraphCallbackCount != 2 {\n\t\tt.Error(\"Failed to find all <p> tags\")\n\t}\n}\n\nfunc TestCollectorOnXMLWithXML(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\ttitleCallbackCalled := false\n\tparagraphCallbackCount := 0\n\n\tc.OnXML(\"//page/title\", func(e *XMLElement) {\n\t\ttitleCallbackCalled = true\n\t\tif e.Text != \"Test Page\" {\n\t\t\tt.Error(\"Title element text does not match, got\", e.Text)\n\t\t}\n\t})\n\n\tc.OnXML(\"//page/paragraph\", func(e *XMLElement) {\n\t\tparagraphCallbackCount++\n\t\tif e.Attr(\"type\") != \"description\" {\n\t\t\tt.Error(\"Failed to get paragraph's type attribute\")\n\t\t}\n\t})\n\n\tc.OnXML(\"/page\", func(e *XMLElement) {\n\t\tif e.ChildAttr(\"paragraph\", \"type\") != \"description\" {\n\t\t\tt.Error(\"Invalid type value\")\n\t\t}\n\t\tclasses := e.ChildAttrs(\"paragraph\", \"type\")\n\t\tif len(classes) != 2 {\n\t\t\tt.Error(\"Invalid type values\")\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/xml\")\n\n\tif !titleCallbackCalled {\n\t\tt.Error(\"Failed to call OnXML callback for <title> tag\")\n\t}\n\n\tif paragraphCallbackCount != 2 {\n\t\tt.Error(\"Failed to find all <paragraph> tags\")\n\t}\n}\n\nfunc TestCollectorVisitWithTrace(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector(AllowedDomains(\"localhost\", \"127.0.0.1\", \"::1\"), TraceHTTP())\n\tc.OnResponse(func(resp *Response) {\n\t\tif resp.Trace == nil {\n\t\t\tt.Error(\"Failed to initialize trace\")\n\t\t}\n\t})\n\n\terr := c.Visit(ts.URL)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to visit url %s\", ts.URL)\n\t}\n}\n\nfunc TestCollectorVisitWithCheckHead(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector(CheckHead())\n\tvar requestMethodChain []string\n\tc.OnResponse(func(resp *Response) {\n\t\trequestMethodChain = append(requestMethodChain, resp.Request.Method)\n\t})\n\n\terr := c.Visit(ts.URL)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to visit url %s\", ts.URL)\n\t}\n\tif requestMethodChain[0] != \"HEAD\" && requestMethodChain[1] != \"GET\" {\n\t\tt.Errorf(\"Failed to perform a HEAD request before GET\")\n\t}\n}\n\nfunc TestCollectorDepth(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\tmaxDepth := 2\n\tc1 := NewCollector(\n\t\tMaxDepth(maxDepth),\n\t\tAllowURLRevisit(),\n\t)\n\trequestCount := 0\n\tc1.OnResponse(func(resp *Response) {\n\t\trequestCount++\n\t\tif requestCount >= 10 {\n\t\t\treturn\n\t\t}\n\t\tc1.Visit(ts.URL)\n\t})\n\tc1.Visit(ts.URL)\n\tif requestCount < 10 {\n\t\tt.Errorf(\"Invalid number of requests: %d (expected 10) without using MaxDepth\", requestCount)\n\t}\n\n\tc2 := c1.Clone()\n\trequestCount = 0\n\tc2.OnResponse(func(resp *Response) {\n\t\trequestCount++\n\t\tresp.Request.Visit(ts.URL)\n\t})\n\tc2.Visit(ts.URL)\n\tif requestCount != 2 {\n\t\tt.Errorf(\"Invalid number of requests: %d (expected 2) with using MaxDepth 2\", requestCount)\n\t}\n\n\tc1.Visit(ts.URL)\n\tif requestCount < 10 {\n\t\tt.Errorf(\"Invalid number of requests: %d (expected 10) without using MaxDepth again\", requestCount)\n\t}\n\n\trequestCount = 0\n\tc2.Visit(ts.URL)\n\tif requestCount != 2 {\n\t\tt.Errorf(\"Invalid number of requests: %d (expected 2) with using MaxDepth 2 again\", requestCount)\n\t}\n}\n\nfunc TestCollectorRequests(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\tmaxRequests := uint32(5)\n\tc1 := NewCollector(\n\t\tMaxRequests(maxRequests),\n\t\tAllowURLRevisit(),\n\t)\n\trequestCount := 0\n\tc1.OnResponse(func(resp *Response) {\n\t\trequestCount++\n\t\tc1.Visit(ts.URL)\n\t})\n\tc1.Visit(ts.URL)\n\tif requestCount != 5 {\n\t\tt.Errorf(\"Invalid number of requests: %d (expected 5) with MaxRequests\", requestCount)\n\t}\n}\n\nfunc TestCollectorContext(t *testing.T) {\n\t// \"/slow\" takes 1 second to return the response.\n\t// If context does abort the transfer after 0.5 seconds as it should,\n\t// OnError will be called, and the test is passed. Otherwise, test is failed.\n\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tctx, cancel := context.WithTimeout(context.Background(), 500*time.Millisecond)\n\tdefer cancel()\n\n\tc := NewCollector(StdlibContext(ctx))\n\n\tonErrorCalled := false\n\n\tc.OnResponse(func(resp *Response) {\n\t\tt.Error(\"OnResponse was called, expected OnError\")\n\t})\n\n\tc.OnError(func(resp *Response, err error) {\n\t\tonErrorCalled = true\n\t\tif err != context.DeadlineExceeded {\n\t\t\tt.Errorf(\"OnError got err=%#v, expected context.DeadlineExceeded\", err)\n\t\t}\n\t})\n\n\terr := c.Visit(ts.URL + \"/slow\")\n\tif err != context.DeadlineExceeded {\n\t\tt.Errorf(\"Visit return err=%#v, expected context.DeadlineExceeded\", err)\n\t}\n\n\tif !onErrorCalled {\n\t\tt.Error(\"OnError was not called\")\n\t}\n\n}\n\nfunc BenchmarkOnHTML(b *testing.B) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.OnHTML(\"p\", func(_ *HTMLElement) {})\n\n\tfor n := 0; n < b.N; n++ {\n\t\tc.Visit(fmt.Sprintf(\"%s/html?q=%d\", ts.URL, n))\n\t}\n}\n\nfunc BenchmarkOnXML(b *testing.B) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.OnXML(\"//p\", func(_ *XMLElement) {})\n\n\tfor n := 0; n < b.N; n++ {\n\t\tc.Visit(fmt.Sprintf(\"%s/html?q=%d\", ts.URL, n))\n\t}\n}\n\nfunc BenchmarkOnResponse(b *testing.B) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.AllowURLRevisit = true\n\tc.OnResponse(func(_ *Response) {})\n\n\tfor n := 0; n < b.N; n++ {\n\t\tc.Visit(ts.URL)\n\t}\n}\n\nfunc requireSessionCookieSimple(handler http.Handler) http.Handler {\n\tconst cookieName = \"session_id\"\n\n\treturn http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tif _, err := r.Cookie(cookieName); err == http.ErrNoCookie {\n\t\t\thttp.SetCookie(w, &http.Cookie{Name: cookieName, Value: \"1\"})\n\t\t\thttp.Redirect(w, r, r.RequestURI, http.StatusFound)\n\t\t\treturn\n\t\t}\n\t\thandler.ServeHTTP(w, r)\n\t})\n}\n\nfunc requireSessionCookieAuthPage(handler http.Handler) http.Handler {\n\tconst setCookiePath = \"/auth\"\n\tconst cookieName = \"session_id\"\n\n\treturn http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tif r.URL.Path == setCookiePath {\n\t\t\tdestination := r.URL.Query().Get(\"return\")\n\t\t\thttp.Redirect(w, r, destination, http.StatusFound)\n\t\t\treturn\n\t\t}\n\t\tif _, err := r.Cookie(cookieName); err == http.ErrNoCookie {\n\t\t\thttp.SetCookie(w, &http.Cookie{Name: cookieName, Value: \"1\"})\n\t\t\thttp.Redirect(w, r, setCookiePath+\"?return=\"+url.QueryEscape(r.RequestURI), http.StatusFound)\n\t\t\treturn\n\t\t}\n\t\thandler.ServeHTTP(w, r)\n\t})\n}\n\nfunc TestCallbackDetachment(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\tc.AllowURLRevisit = true\n\n\tvar executions [3]int // tracks number of executions of each callback\n\n\tc.OnHTML(\"#firstElem\", func(e *HTMLElement) {\n\t\texecutions[0]++\n\t\t// Detach this callback after first execution\n\t\tc.OnHTMLDetach(\"#firstElem\")\n\t})\n\tc.OnHTML(\"#secondElem\", func(e *HTMLElement) {\n\t\texecutions[1]++\n\t})\n\tc.OnHTML(\"#thirdElem\", func(e *HTMLElement) {\n\t\texecutions[2]++\n\t})\n\n\t// First visit - all callbacks should execute\n\tc.Visit(ts.URL + \"/callback_test\")\n\t// Second visit - first callback should NOT execute\n\tc.Visit(ts.URL + \"/callback_test\")\n\n\t// Verify callback counts\n\tif executions[0] != 1 {\n\t\tt.Errorf(\"firstElem callback executed %d times, expected 1\", executions[0])\n\t}\n\tif executions[1] != 2 {\n\t\tt.Errorf(\"secondElem callback executed %d times, expected 2\", executions[1])\n\t}\n\tif executions[2] != 2 {\n\t\tt.Errorf(\"thirdElem callback executed %d times, expected 2\", executions[2])\n\t}\n}\n\nfunc TestCollectorPostRetry(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tpostValue := \"hello\"\n\tc := NewCollector()\n\ttry := false\n\tc.OnResponse(func(r *Response) {\n\t\tif r.Ctx.Get(\"notFirst\") == \"\" {\n\t\t\tr.Ctx.Put(\"notFirst\", \"first\")\n\t\t\t_ = r.Request.Retry()\n\t\t\treturn\n\t\t}\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST\")\n\t\t}\n\t\ttry = true\n\t})\n\n\tc.Post(ts.URL+\"/login\", map[string]string{\n\t\t\"name\": postValue,\n\t})\n\tif !try {\n\t\tt.Error(\"OnResponse Retry was not called\")\n\t}\n}\nfunc TestCollectorGetRetry(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\ttry := false\n\n\tc := NewCollector()\n\n\tc.OnResponse(func(r *Response) {\n\t\tif r.Ctx.Get(\"notFirst\") == \"\" {\n\t\t\tr.Ctx.Put(\"notFirst\", \"first\")\n\t\t\t_ = r.Request.Retry()\n\t\t\treturn\n\t\t}\n\t\tif !bytes.Equal(r.Body, serverIndexResponse) {\n\t\t\tt.Error(\"Response body does not match with the original content\")\n\t\t}\n\t\ttry = true\n\t})\n\n\tc.Visit(ts.URL)\n\tif !try {\n\t\tt.Error(\"OnResponse Retry was not called\")\n\t}\n}\n\nfunc TestCollectorPostRetryUnseekable(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\ttry := false\n\tpostValue := \"hello\"\n\tc := NewCollector()\n\n\tc.OnResponse(func(r *Response) {\n\t\tif postValue != string(r.Body) {\n\t\t\tt.Error(\"Failed to send data with POST\")\n\t\t}\n\n\t\tif r.Ctx.Get(\"notFirst\") == \"\" {\n\t\t\tr.Ctx.Put(\"notFirst\", \"first\")\n\t\t\terr := r.Request.Retry()\n\t\t\tif !errors.Is(err, ErrRetryBodyUnseekable) {\n\t\t\t\tt.Errorf(\"Unexpected error Type ErrRetryBodyUnseekable : %v\", err)\n\t\t\t}\n\t\t\treturn\n\t\t}\n\t\ttry = true\n\t})\n\tc.Request(\"POST\", ts.URL+\"/login\", bytes.NewBuffer([]byte(\"name=\"+postValue)), nil, nil)\n\tif try {\n\t\tt.Error(\"OnResponse Retry was called but BodyUnseekable\")\n\t}\n}\n\nfunc TestRedirectErrorRetry(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\tc := NewCollector()\n\tc.OnError(func(r *Response, err error) {\n\t\tif r.Ctx.Get(\"notFirst\") == \"\" {\n\t\t\tr.Ctx.Put(\"notFirst\", \"first\")\n\t\t\t_ = r.Request.Retry()\n\t\t\treturn\n\t\t}\n\t\tif e := (&AlreadyVisitedError{}); errors.As(err, &e) {\n\t\t\tt.Error(\"loop AlreadyVisitedError\")\n\t\t}\n\n\t})\n\tc.OnResponse(func(response *Response) {\n\t\t//println(1)\n\t})\n\tc.Visit(ts.URL + \"/redirected/\")\n\tc.Visit(ts.URL + \"/redirect\")\n}\n\nfunc TestCheckRequestHeadersFunc(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\ttry := false\n\n\tc := NewCollector()\n\n\tc.OnRequestHeaders(func(r *Request) {\n\t\ttry = true\n\t\tr.Abort()\n\t})\n\tc.OnScraped(func(r *Response) {\n\t\ttry = false\n\t})\n\tc.Visit(ts.URL)\n\tif try == false {\n\t\tt.Error(\"TestCheckRequestHeadersFunc failed\")\n\t}\n}\n\nfunc TestIssue745GzipURLWith404Response(t *testing.T) {\n\tts := newTestServer()\n\tdefer ts.Close()\n\n\tc := NewCollector()\n\n\tvar responseStatusCode int\n\tc.OnError(func(resp *Response, err error) {\n\t\tresponseStatusCode = resp.StatusCode\n\t\t// The error should NOT be \"gzip: invalid header\".\n\t\t// A 404 response for a .xml.gz URL should be treated as a\n\t\t// normal HTTP error, not a decompression failure.\n\t\tif strings.Contains(err.Error(), \"gzip\") {\n\t\t\tt.Errorf(\"Expected HTTP error, got gzip decompression error: %v\", err)\n\t\t}\n\t})\n\n\tc.OnResponse(func(resp *Response) {\n\t\t// A 404 should not reach OnResponse as a successful response\n\t\tif resp.StatusCode == 404 {\n\t\t\tresponseStatusCode = resp.StatusCode\n\t\t}\n\t})\n\n\tc.Visit(ts.URL + \"/sitemap.xml.gz\")\n\n\t// The response should have been received (either via OnError or OnResponse)\n\t// with status 404, not a gzip decompression error\n\tif responseStatusCode != 404 {\n\t\tt.Errorf(\"Expected status code 404, got %d\", responseStatusCode)\n\t}\n}\n"
  },
  {
    "path": "context.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"sync\"\n)\n\n// Context provides a tiny layer for passing data between callbacks\ntype Context struct {\n\tcontextMap map[string]interface{}\n\tlock       *sync.RWMutex\n}\n\n// NewContext initializes a new Context instance\nfunc NewContext() *Context {\n\treturn &Context{\n\t\tcontextMap: make(map[string]interface{}),\n\t\tlock:       &sync.RWMutex{},\n\t}\n}\n\n// UnmarshalBinary decodes Context value to nil\n// This function is used by request caching\nfunc (c *Context) UnmarshalBinary(_ []byte) error {\n\treturn nil\n}\n\n// MarshalBinary encodes Context value\n// This function is used by request caching\nfunc (c *Context) MarshalBinary() (_ []byte, _ error) {\n\treturn nil, nil\n}\n\n// Put stores a value of any type in Context\nfunc (c *Context) Put(key string, value interface{}) {\n\tc.lock.Lock()\n\tc.contextMap[key] = value\n\tc.lock.Unlock()\n}\n\n// Get retrieves a string value from Context.\n// Get returns an empty string if key not found\nfunc (c *Context) Get(key string) string {\n\tc.lock.RLock()\n\tdefer c.lock.RUnlock()\n\tif v, ok := c.contextMap[key]; ok {\n\t\treturn v.(string)\n\t}\n\treturn \"\"\n}\n\n// GetAny retrieves a value from Context.\n// GetAny returns nil if key not found\nfunc (c *Context) GetAny(key string) interface{} {\n\tc.lock.RLock()\n\tdefer c.lock.RUnlock()\n\tif v, ok := c.contextMap[key]; ok {\n\t\treturn v\n\t}\n\treturn nil\n}\n\n// ForEach iterate context\nfunc (c *Context) ForEach(fn func(k string, v interface{}) interface{}) []interface{} {\n\tc.lock.RLock()\n\tdefer c.lock.RUnlock()\n\n\tret := make([]interface{}, 0, len(c.contextMap))\n\tfor k, v := range c.contextMap {\n\t\tret = append(ret, fn(k, v))\n\t}\n\n\treturn ret\n}\n\n// Clone clones context\nfunc (c *Context) Clone() *Context {\n\tc.lock.RLock()\n\tdefer c.lock.RUnlock()\n\tnewCtx := NewContext()\n\tc.ForEach(func(key string, value interface{}) interface{} {\n\t\tnewCtx.Put(key, value)\n\t\treturn nil\n\t})\n\treturn newCtx\n}\n"
  },
  {
    "path": "context_test.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"strconv\"\n\t\"testing\"\n)\n\nfunc TestContextIteration(t *testing.T) {\n\tctx := NewContext()\n\tfor i := 0; i < 10; i++ {\n\t\tctx.Put(strconv.Itoa(i), i)\n\t}\n\tvalues := ctx.ForEach(func(k string, v interface{}) interface{} {\n\t\treturn v.(int)\n\t})\n\tif len(values) != 10 {\n\t\tt.Fatal(\"fail to iterate context\")\n\t}\n\tfor _, i := range values {\n\t\tv := i.(int)\n\t\tif v != ctx.GetAny(strconv.Itoa(v)).(int) {\n\t\t\tt.Fatal(\"value not equal\")\n\t\t}\n\t}\n}\n\nfunc TestContextClone(t *testing.T) {\n\tctxOrg := NewContext()\n\tfor i := 0; i < 10; i++ {\n\t\tctxOrg.Put(strconv.Itoa(i), i)\n\t}\n\n\tctx := ctxOrg.Clone()\n\tvalues := ctx.ForEach(func(k string, v interface{}) interface{} {\n\t\treturn v.(int)\n\t})\n\tif len(values) != 10 {\n\t\tt.Fatal(\"fail to iterate context\")\n\t}\n\tfor _, i := range values {\n\t\tv := i.(int)\n\t\tif v != ctx.GetAny(strconv.Itoa(v)).(int) {\n\t\t\tt.Fatal(\"value not equal\")\n\t\t}\n\t}\n}\n"
  },
  {
    "path": "debug/debug.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage debug\n\n// Event represents an action inside a collector\ntype Event struct {\n\t// Type is the type of the event\n\tType string\n\t// RequestID identifies the HTTP request of the Event\n\tRequestID uint32\n\t// CollectorID identifies the collector of the Event\n\tCollectorID uint32\n\t// Values contains the event's key-value pairs. Different type of events\n\t// can return different key-value pairs\n\tValues map[string]string\n}\n\n// Debugger is an interface for different type of debugging backends\ntype Debugger interface {\n\t// Init initializes the backend\n\tInit() error\n\t// Event receives a new collector event.\n\tEvent(e *Event)\n}\n"
  },
  {
    "path": "debug/logdebugger.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage debug\n\nimport (\n\t\"io\"\n\t\"log\"\n\t\"os\"\n\t\"sync/atomic\"\n\t\"time\"\n)\n\n// LogDebugger is the simplest debugger which prints log messages to the STDERR\ntype LogDebugger struct {\n\t// Output is the log destination, anything can be used which implements them\n\t// io.Writer interface. Leave it blank to use STDERR\n\tOutput io.Writer\n\t// Prefix appears at the beginning of each generated log line\n\tPrefix string\n\t// Flag defines the logging properties.\n\tFlag    int\n\tlogger  *log.Logger\n\tcounter int32\n\tstart   time.Time\n}\n\n// Init initializes the LogDebugger\nfunc (l *LogDebugger) Init() error {\n\tl.counter = 0\n\tl.start = time.Now()\n\tif l.Output == nil {\n\t\tl.Output = os.Stderr\n\t}\n\tl.logger = log.New(l.Output, l.Prefix, l.Flag)\n\treturn nil\n}\n\n// Event receives Collector events and prints them to STDERR\nfunc (l *LogDebugger) Event(e *Event) {\n\ti := atomic.AddInt32(&l.counter, 1)\n\tl.logger.Printf(\"[%06d] %d [%6d - %s] %q (%s)\\n\", i, e.CollectorID, e.RequestID, e.Type, e.Values, time.Since(l.start))\n}\n"
  },
  {
    "path": "debug/webdebugger.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage debug\n\nimport (\n\t\"encoding/json\"\n\t\"log\"\n\t\"net/http\"\n\t\"sync\"\n\t\"time\"\n)\n\n// WebDebugger is a web based debugging frontend for colly\ntype WebDebugger struct {\n\t// Address is the address of the web server. It is 127.0.0.1:7676 by default.\n\tAddress         string\n\tinitialized     bool\n\tCurrentRequests map[uint32]requestInfo\n\tRequestLog      []requestInfo\n\tsync.Mutex\n}\n\ntype requestInfo struct {\n\tURL            string\n\tStarted        time.Time\n\tDuration       time.Duration\n\tResponseStatus string\n\tID             uint32\n\tCollectorID    uint32\n}\n\n// Init initializes the WebDebugger\nfunc (w *WebDebugger) Init() error {\n\tif w.initialized {\n\t\treturn nil\n\t}\n\tdefer func() {\n\t\tw.initialized = true\n\t}()\n\tif w.Address == \"\" {\n\t\tw.Address = \"127.0.0.1:7676\"\n\t}\n\tw.RequestLog = make([]requestInfo, 0)\n\tw.CurrentRequests = make(map[uint32]requestInfo)\n\thttp.HandleFunc(\"/\", w.indexHandler)\n\thttp.HandleFunc(\"/status\", w.statusHandler)\n\tlog.Println(\"Starting debug webserver on\", w.Address)\n\tgo http.ListenAndServe(w.Address, nil)\n\treturn nil\n}\n\n// Event updates the debugger's status\nfunc (w *WebDebugger) Event(e *Event) {\n\tw.Lock()\n\tdefer w.Unlock()\n\n\tswitch e.Type {\n\tcase \"request\":\n\t\tw.CurrentRequests[e.RequestID] = requestInfo{\n\t\t\tURL:         e.Values[\"url\"],\n\t\t\tStarted:     time.Now(),\n\t\t\tID:          e.RequestID,\n\t\t\tCollectorID: e.CollectorID,\n\t\t}\n\tcase \"response\", \"error\":\n\t\tr := w.CurrentRequests[e.RequestID]\n\t\tr.Duration = time.Since(r.Started)\n\t\tr.ResponseStatus = e.Values[\"status\"]\n\t\tw.RequestLog = append(w.RequestLog, r)\n\t\tdelete(w.CurrentRequests, e.RequestID)\n\t}\n}\n\nfunc (w *WebDebugger) indexHandler(wr http.ResponseWriter, r *http.Request) {\n\twr.Write([]byte(`<!DOCTYPE html>\n<html>\n<head>\n <title>Colly Debugger WebUI</title>\n <script src=\"https://code.jquery.com/jquery-latest.min.js\" type=\"text/javascript\"></script>\n <link rel=\"stylesheet\" type=\"text/css\" href=\"https://semantic-ui.com/dist/semantic.min.css\">\n</head>\n<body>\n<div class=\"ui inverted vertical masthead center aligned segment\" id=\"menu\">\n <div class=\"ui tiny secondary inverted menu\">\n   <a class=\"item\" href=\"/\"><b>Colly WebDebugger</b></a>\n </div>\n</div>\n<div class=\"ui grid container\">\n <div class=\"row\">\n  <div class=\"eight wide column\">\n   <h1>Current Requests <span id=\"current_request_count\"></span></h1>\n   <div id=\"current_requests\" class=\"ui small feed\"></div>\n  </div>\n  <div class=\"eight wide column\">\n   <h1>Finished Requests <span id=\"request_log_count\"></span></h1>\n   <div id=\"request_log\" class=\"ui small feed\"></div>\n  </div>\n </div>\n</div>\n<script>\nfunction curRequestTpl(url, started, collectorId) {\n  return '<div class=\"event\"><div class=\"content\"><div class=\"summary\">' + url + '</div><div class=\"meta\">Collector #' + collectorId + ' - ' + started + \"</div></div></div>\";\n}\nfunction requestLogTpl(url, duration, collectorId) {\n  return '<div class=\"event\"><div class=\"content\"><div class=\"summary\">' + url + '</div><div class=\"meta\">Collector #' + collectorId + ' - ' + (duration/1000000000) + \"s</div></div></div>\";\n}\nfunction fetchStatus() {\n  $.getJSON(\"/status\", function(data) {\n    $(\"#current_requests\").html(\"\");\n    $(\"#request_log\").html(\"\");\n    $(\"#current_request_count\").text('(' + Object.keys(data.CurrentRequests).length + ')');\n    $(\"#request_log_count\").text('(' + data.RequestLog.length + ')');\n    for(var i in data.CurrentRequests) {\n      var r = data.CurrentRequests[i];\n      $(\"#current_requests\").append(curRequestTpl(r.URL, r.Started, r.CollectorID));\n    }\n    for(var i in data.RequestLog.reverse()) {\n      var r = data.RequestLog[i];\n      $(\"#request_log\").append(requestLogTpl(r.URL, r.Duration, r.CollectorID));\n    }\n    setTimeout(fetchStatus, 1000);\n  });\n}\n$(document).ready(function() {\n    fetchStatus();\n});\n</script>\n</body>\n</html>\n`))\n}\n\nfunc (w *WebDebugger) statusHandler(wr http.ResponseWriter, r *http.Request) {\n\tw.Lock()\n\tjsonData, err := json.MarshalIndent(w, \"\", \"  \")\n\tw.Unlock()\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\twr.Write(jsonData)\n}\n"
  },
  {
    "path": "extensions/extensions.go",
    "content": "// Package extensions implements various helper addons for Colly\npackage extensions\n"
  },
  {
    "path": "extensions/random_user_agent.go",
    "content": "package extensions\n\nimport (\n\t\"fmt\"\n\t\"math/rand\"\n\t\"strings\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nvar uaGens = []func() string{\n\tgenFirefoxUA,\n\tgenChromeUA,\n\tgenEdgeUA,\n\tgenOperaUA,\n}\n\nvar uaGensMobile = []func() string{\n\tgenMobilePixel7UA,\n\tgenMobilePixel6UA,\n\tgenMobilePixel5UA,\n\tgenMobilePixel4UA,\n\tgenMobileNexus10UA,\n}\n\n// RandomUserAgent generates a random DESKTOP browser user-agent on every requests\nfunc RandomUserAgent(c *colly.Collector) {\n\tc.OnRequest(func(r *colly.Request) {\n\t\tr.Headers.Set(\"User-Agent\", uaGens[rand.Intn(len(uaGens))]())\n\t})\n}\n\n// RandomMobileUserAgent generates a random MOBILE browser user-agent on every requests\nfunc RandomMobileUserAgent(c *colly.Collector) {\n\tc.OnRequest(func(r *colly.Request) {\n\t\tr.Headers.Set(\"User-Agent\", uaGensMobile[rand.Intn(len(uaGensMobile))]())\n\t})\n}\n\nvar ffVersions = []float32{\n\t// NOTE: Only version released after Jun 1, 2022 will be listed.\n\t// Data source: https://en.wikipedia.org/wiki/Firefox_version_history\n\n\t// 2022\n\t102.0,\n\t103.0,\n\t104.0,\n\t105.0,\n\t106.0,\n\t107.0,\n\t108.0,\n\n\t// 2023\n\t109.0,\n\t110.0,\n\t111.0,\n\t112.0,\n\t113.0,\n}\n\nvar chromeVersions = []string{\n\t// NOTE: Only version released after Jun 1, 2022 will be listed.\n\t// Data source: https://chromereleases.googleblog.com/search/label/Stable%20updates\n\n\t// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop.html\n\t\"102.0.5005.115\",\n\n\t// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop_21.html\n\t\"103.0.5060.53\",\n\n\t// https://chromereleases.googleblog.com/2022/06/stable-channel-update-for-desktop_27.html\n\t\"103.0.5060.66\",\n\n\t// https://chromereleases.googleblog.com/2022/07/stable-channel-update-for-desktop.html\n\t\"103.0.5060.114\",\n\n\t// https://chromereleases.googleblog.com/2022/07/stable-channel-update-for-desktop_19.html\n\t\"103.0.5060.134\",\n\n\t// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop.html\n\t\"104.0.5112.79\",\n\t\"104.0.5112.80\",\n\t\"104.0.5112.81\",\n\n\t// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop_16.html\n\t\"104.0.5112.101\",\n\t\"104.0.5112.102\",\n\n\t// https://chromereleases.googleblog.com/2022/08/stable-channel-update-for-desktop_30.html\n\t\"105.0.5195.52\",\n\t\"105.0.5195.53\",\n\t\"105.0.5195.54\",\n\n\t// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop.html\n\t\"105.0.5195.102\",\n\n\t// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_14.html\n\t\"105.0.5195.125\",\n\t\"105.0.5195.126\",\n\t\"105.0.5195.127\",\n\n\t// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_27.html\n\t\"106.0.5249.61\",\n\t\"106.0.5249.62\",\n\n\t// https://chromereleases.googleblog.com/2022/09/stable-channel-update-for-desktop_30.html\n\t\"106.0.5249.91\",\n\n\t// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop.html\n\t\"106.0.5249.103\",\n\n\t// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_11.html\n\t\"106.0.5249.119\",\n\n\t// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_25.html\n\t\"107.0.5304.62\",\n\t\"107.0.5304.63\",\n\t\"107.0.5304.68\",\n\n\t// https://chromereleases.googleblog.com/2022/10/stable-channel-update-for-desktop_27.html\n\t\"107.0.5304.87\",\n\t\"107.0.5304.88\",\n\n\t// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop.html\n\t\"107.0.5304.106\",\n\t\"107.0.5304.107\",\n\t\"107.0.5304.110\",\n\n\t// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop_24.html\n\t\"107.0.5304.121\",\n\t\"107.0.5304.122\",\n\n\t// https://chromereleases.googleblog.com/2022/11/stable-channel-update-for-desktop_29.html\n\t\"108.0.5359.71\",\n\t\"108.0.5359.72\",\n\n\t// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop.html\n\t\"108.0.5359.94\",\n\t\"108.0.5359.95\",\n\n\t// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop_7.html\n\t\"108.0.5359.98\",\n\t\"108.0.5359.99\",\n\n\t// https://chromereleases.googleblog.com/2022/12/stable-channel-update-for-desktop_13.html\n\t\"108.0.5359.124\",\n\t\"108.0.5359.125\",\n\n\t// https://chromereleases.googleblog.com/2023/01/stable-channel-update-for-desktop.html\n\t\"109.0.5414.74\",\n\t\"109.0.5414.75\",\n\t\"109.0.5414.87\",\n\n\t// https://chromereleases.googleblog.com/2023/01/stable-channel-update-for-desktop_24.html\n\t\"109.0.5414.119\",\n\t\"109.0.5414.120\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-update-for-desktop.html\n\t\"110.0.5481.77\",\n\t\"110.0.5481.78\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update.html\n\t\"110.0.5481.96\",\n\t\"110.0.5481.97\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_14.html\n\t\"110.0.5481.100\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_16.html\n\t\"110.0.5481.104\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_22.html\n\t\"110.0.5481.177\",\n\t\"110.0.5481.178\",\n\n\t// https://chromereleases.googleblog.com/2023/02/stable-channel-desktop-update_97.html\n\t\"109.0.5414.129\",\n\n\t// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop.html\n\t\"111.0.5563.64\",\n\t\"111.0.5563.65\",\n\n\t// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_21.html\n\t\"111.0.5563.110\",\n\t\"111.0.5563.111\",\n\n\t// https://chromereleases.googleblog.com/2023/03/stable-channel-update-for-desktop_27.html\n\t\"111.0.5563.146\",\n\t\"111.0.5563.147\",\n\n\t// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop.html\n\t\"112.0.5615.49\",\n\t\"112.0.5615.50\",\n\n\t// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_12.html\n\t\"112.0.5615.86\",\n\t\"112.0.5615.87\",\n\n\t// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_14.html\n\t\"112.0.5615.121\",\n\n\t// https://chromereleases.googleblog.com/2023/04/stable-channel-update-for-desktop_18.html\n\t\"112.0.5615.137\",\n\t\"112.0.5615.138\",\n\t\"112.0.5615.165\",\n\n\t// https://chromereleases.googleblog.com/2023/05/stable-channel-update-for-desktop.html\n\t\"113.0.5672.63\",\n\t\"113.0.5672.64\",\n\n\t// https://chromereleases.googleblog.com/2023/05/stable-channel-update-for-desktop_8.html\n\t\"113.0.5672.92\",\n\t\"113.0.5672.93\",\n}\n\nvar edgeVersions = []string{\n\t// NOTE: Only version released after Jun 1, 2022 will be listed.\n\t// Data source: https://learn.microsoft.com/en-us/deployedge/microsoft-edge-release-schedule\n\n\t// 2022\n\t\"103.0.0.0,103.0.1264.37\",\n\t\"104.0.0.0,104.0.1293.47\",\n\t\"105.0.0.0,105.0.1343.25\",\n\t\"106.0.0.0,106.0.1370.34\",\n\t\"107.0.0.0,107.0.1418.24\",\n\t\"108.0.0.0,108.0.1462.42\",\n\n\t// 2023\n\t\"109.0.0.0,109.0.1518.49\",\n\t\"110.0.0.0,110.0.1587.41\",\n\t\"111.0.0.0,111.0.1661.41\",\n\t\"112.0.0.0,112.0.1722.34\",\n\t\"113.0.0.0,113.0.1774.3\",\n}\n\nvar operaVersions = []string{\n\t// NOTE: Only version released after Jan 1, 2023 will be listed.\n\t// Data source: https://blogs.opera.com/desktop/\n\n\t// https://blogs.opera.com/desktop/changelog-for-96/\n\t\"110.0.5449.0,96.0.4640.0\",\n\t\"110.0.5464.2,96.0.4653.0\",\n\t\"110.0.5464.2,96.0.4660.0\",\n\t\"110.0.5481.30,96.0.4674.0\",\n\t\"110.0.5481.30,96.0.4691.0\",\n\t\"110.0.5481.30,96.0.4693.12\",\n\t\"110.0.5481.77,96.0.4693.16\",\n\t\"110.0.5481.100,96.0.4693.20\",\n\t\"110.0.5481.178,96.0.4693.31\",\n\t\"110.0.5481.178,96.0.4693.50\",\n\t\"110.0.5481.192,96.0.4693.80\",\n\n\t// https://blogs.opera.com/desktop/changelog-for-97/\n\t\"111.0.5532.2,97.0.4711.0\",\n\t\"111.0.5532.2,97.0.4704.0\",\n\t\"111.0.5532.2,97.0.4697.0\",\n\t\"111.0.5562.0,97.0.4718.0\",\n\t\"111.0.5563.19,97.0.4719.4\",\n\t\"111.0.5563.19,97.0.4719.11\",\n\t\"111.0.5563.41,97.0.4719.17\",\n\t\"111.0.5563.65,97.0.4719.26\",\n\t\"111.0.5563.65,97.0.4719.28\",\n\t\"111.0.5563.111,97.0.4719.43\",\n\t\"111.0.5563.147,97.0.4719.63\",\n\t\"111.0.5563.147,97.0.4719.83\",\n\n\t// https://blogs.opera.com/desktop/changelog-for-98/\n\t\"112.0.5596.2,98.0.4756.0\",\n\t\"112.0.5596.2,98.0.4746.0\",\n\t\"112.0.5615.20,98.0.4759.1\",\n\t\"112.0.5615.50,98.0.4759.3\",\n\t\"112.0.5615.87,98.0.4759.6\",\n\t\"112.0.5615.165,98.0.4759.15\",\n\t\"112.0.5615.165,98.0.4759.21\",\n\t\"112.0.5615.165,98.0.4759.39\",\n}\n\nvar pixel7AndroidVersions = []string{\n\t// Data source:\n\t// - https://developer.android.com/about/versions\n\t// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\t\"13\",\n}\n\nvar pixel6AndroidVersions = []string{\n\t// Data source:\n\t// - https://developer.android.com/about/versions\n\t// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\t\"12\",\n\t\"13\",\n}\n\nvar pixel5AndroidVersions = []string{\n\t// Data source:\n\t// - https://developer.android.com/about/versions\n\t// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\t\"11\",\n\t\"12\",\n\t\"13\",\n}\n\nvar pixel4AndroidVersions = []string{\n\t// Data source:\n\t// - https://developer.android.com/about/versions\n\t// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\t\"10\",\n\t\"11\",\n\t\"12\",\n\t\"13\",\n}\n\nvar nexus10AndroidVersions = []string{\n\t// Data source:\n\t// - https://developer.android.com/about/versions\n\t// - https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\t\"4.4.2\",\n\t\"4.4.4\",\n\t\"5.0\",\n\t\"5.0.1\",\n\t\"5.0.2\",\n\t\"5.1\",\n\t\"5.1.1\",\n}\n\nvar nexus10Builds = []string{\n\t// Data source: https://source.android.com/docs/setup/about/build-numbers#source-code-tags-and-builds\n\n\t\"LMY49M\", // android-5.1.1_r38 (Lollipop)\n\t\"LMY49J\", // android-5.1.1_r37 (Lollipop)\n\t\"LMY49I\", // android-5.1.1_r36 (Lollipop)\n\t\"LMY49H\", // android-5.1.1_r35 (Lollipop)\n\t\"LMY49G\", // android-5.1.1_r34 (Lollipop)\n\t\"LMY49F\", // android-5.1.1_r33 (Lollipop)\n\t\"LMY48Z\", // android-5.1.1_r30 (Lollipop)\n\t\"LMY48X\", // android-5.1.1_r25 (Lollipop)\n\t\"LMY48T\", // android-5.1.1_r19 (Lollipop)\n\t\"LMY48M\", // android-5.1.1_r14 (Lollipop)\n\t\"LMY48I\", // android-5.1.1_r9 (Lollipop)\n\t\"LMY47V\", // android-5.1.1_r1 (Lollipop)\n\t\"LMY47D\", // android-5.1.0_r1 (Lollipop)\n\t\"LRX22G\", // android-5.0.2_r1 (Lollipop)\n\t\"LRX22C\", // android-5.0.1_r1 (Lollipop)\n\t\"LRX21P\", // android-5.0.0_r4.0.1 (Lollipop)\n\t\"KTU84P\", // android-4.4.4_r1 (KitKat)\n\t\"KTU84L\", // android-4.4.3_r1 (KitKat)\n\t\"KOT49H\", // android-4.4.2_r1 (KitKat)\n\t\"KOT49E\", // android-4.4.1_r1 (KitKat)\n\t\"KRT16S\", // android-4.4_r1.2 (KitKat)\n\t\"JWR66Y\", // android-4.3_r1.1 (Jelly Bean)\n\t\"JWR66V\", // android-4.3_r1 (Jelly Bean)\n\t\"JWR66N\", // android-4.3_r0.9.1 (Jelly Bean)\n\t\"JDQ39 \", // android-4.2.2_r1 (Jelly Bean)\n\t\"JOP40F\", // android-4.2.1_r1.1 (Jelly Bean)\n\t\"JOP40D\", // android-4.2.1_r1 (Jelly Bean)\n\t\"JOP40C\", // android-4.2_r1 (Jelly Bean)\n}\n\nvar osStrings = []string{\n\t// MacOS - High Sierra\n\t\"Macintosh; Intel Mac OS X 10_13\",\n\t\"Macintosh; Intel Mac OS X 10_13_1\",\n\t\"Macintosh; Intel Mac OS X 10_13_2\",\n\t\"Macintosh; Intel Mac OS X 10_13_3\",\n\t\"Macintosh; Intel Mac OS X 10_13_4\",\n\t\"Macintosh; Intel Mac OS X 10_13_5\",\n\t\"Macintosh; Intel Mac OS X 10_13_6\",\n\n\t// MacOS - Mojave\n\t\"Macintosh; Intel Mac OS X 10_14\",\n\t\"Macintosh; Intel Mac OS X 10_14_1\",\n\t\"Macintosh; Intel Mac OS X 10_14_2\",\n\t\"Macintosh; Intel Mac OS X 10_14_3\",\n\t\"Macintosh; Intel Mac OS X 10_14_4\",\n\t\"Macintosh; Intel Mac OS X 10_14_5\",\n\t\"Macintosh; Intel Mac OS X 10_14_6\",\n\n\t// MacOS - Catalina\n\t\"Macintosh; Intel Mac OS X 10_15\",\n\t\"Macintosh; Intel Mac OS X 10_15_1\",\n\t\"Macintosh; Intel Mac OS X 10_15_2\",\n\t\"Macintosh; Intel Mac OS X 10_15_3\",\n\t\"Macintosh; Intel Mac OS X 10_15_4\",\n\t\"Macintosh; Intel Mac OS X 10_15_5\",\n\t\"Macintosh; Intel Mac OS X 10_15_6\",\n\t\"Macintosh; Intel Mac OS X 10_15_7\",\n\n\t// MacOS - Big Sur\n\t\"Macintosh; Intel Mac OS X 11_0\",\n\t\"Macintosh; Intel Mac OS X 11_0_1\",\n\t\"Macintosh; Intel Mac OS X 11_1\",\n\t\"Macintosh; Intel Mac OS X 11_2\",\n\t\"Macintosh; Intel Mac OS X 11_2_1\",\n\t\"Macintosh; Intel Mac OS X 11_2_2\",\n\t\"Macintosh; Intel Mac OS X 11_2_3\",\n\t\"Macintosh; Intel Mac OS X 11_3\",\n\t\"Macintosh; Intel Mac OS X 11_3_1\",\n\t\"Macintosh; Intel Mac OS X 11_4\",\n\t\"Macintosh; Intel Mac OS X 11_5\",\n\t\"Macintosh; Intel Mac OS X 11_5_1\",\n\t\"Macintosh; Intel Mac OS X 11_5_2\",\n\t\"Macintosh; Intel Mac OS X 11_6\",\n\t\"Macintosh; Intel Mac OS X 11_6_1\",\n\t\"Macintosh; Intel Mac OS X 11_6_2\",\n\t\"Macintosh; Intel Mac OS X 11_6_3\",\n\t\"Macintosh; Intel Mac OS X 11_6_4\",\n\t\"Macintosh; Intel Mac OS X 11_6_5\",\n\t\"Macintosh; Intel Mac OS X 11_6_6\",\n\t\"Macintosh; Intel Mac OS X 11_6_7\",\n\t\"Macintosh; Intel Mac OS X 11_6_8\",\n\t\"Macintosh; Intel Mac OS X 11_7\",\n\t\"Macintosh; Intel Mac OS X 11_7_1\",\n\t\"Macintosh; Intel Mac OS X 11_7_2\",\n\t\"Macintosh; Intel Mac OS X 11_7_3\",\n\t\"Macintosh; Intel Mac OS X 11_7_4\",\n\t\"Macintosh; Intel Mac OS X 11_7_5\",\n\t\"Macintosh; Intel Mac OS X 11_7_6\",\n\n\t// MacOS - Monterey\n\t\"Macintosh; Intel Mac OS X 12_0\",\n\t\"Macintosh; Intel Mac OS X 12_0_1\",\n\t\"Macintosh; Intel Mac OS X 12_1\",\n\t\"Macintosh; Intel Mac OS X 12_2\",\n\t\"Macintosh; Intel Mac OS X 12_2_1\",\n\t\"Macintosh; Intel Mac OS X 12_3\",\n\t\"Macintosh; Intel Mac OS X 12_3_1\",\n\t\"Macintosh; Intel Mac OS X 12_4\",\n\t\"Macintosh; Intel Mac OS X 12_5\",\n\t\"Macintosh; Intel Mac OS X 12_5_1\",\n\t\"Macintosh; Intel Mac OS X 12_6\",\n\t\"Macintosh; Intel Mac OS X 12_6_1\",\n\t\"Macintosh; Intel Mac OS X 12_6_2\",\n\t\"Macintosh; Intel Mac OS X 12_6_3\",\n\t\"Macintosh; Intel Mac OS X 12_6_4\",\n\t\"Macintosh; Intel Mac OS X 12_6_5\",\n\n\t// MacOS - Ventura\n\t\"Macintosh; Intel Mac OS X 13_0\",\n\t\"Macintosh; Intel Mac OS X 13_0_1\",\n\t\"Macintosh; Intel Mac OS X 13_1\",\n\t\"Macintosh; Intel Mac OS X 13_2\",\n\t\"Macintosh; Intel Mac OS X 13_2_1\",\n\t\"Macintosh; Intel Mac OS X 13_3\",\n\t\"Macintosh; Intel Mac OS X 13_3_1\",\n\n\t// Windows\n\t\"Windows NT 10.0; Win64; x64\",\n\t\"Windows NT 5.1\",\n\t\"Windows NT 6.1; WOW64\",\n\t\"Windows NT 6.1; Win64; x64\",\n\n\t// Linux\n\t\"X11; Linux x86_64\",\n}\n\n// Generates Firefox Browser User-Agent (Desktop)\n//\n// -> \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:87.0) Gecko/20100101 Firefox/87.0\"\nfunc genFirefoxUA() string {\n\tversion := ffVersions[rand.Intn(len(ffVersions))]\n\tos := osStrings[rand.Intn(len(osStrings))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (%s; rv:%.1f) Gecko/20100101 Firefox/%.1f\", os, version, version)\n}\n\n// Generates Chrome Browser User-Agent (Desktop)\n//\n// -> \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36\"\nfunc genChromeUA() string {\n\tversion := chromeVersions[rand.Intn(len(chromeVersions))]\n\tos := osStrings[rand.Intn(len(osStrings))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", os, version)\n}\n\n// Generates Microsoft Edge User-Agent (Desktop)\n//\n// -> \"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36 Edg/90.0.818.39\"\nfunc genEdgeUA() string {\n\tversion := edgeVersions[rand.Intn(len(edgeVersions))]\n\tchromeVersion := strings.Split(version, \",\")[0]\n\tedgeVersion := strings.Split(version, \",\")[1]\n\tos := osStrings[rand.Intn(len(osStrings))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 Edg/%s\", os, chromeVersion, edgeVersion)\n}\n\n// Generates Opera Browser User-Agent (Desktop)\n//\n// -> \"Mozilla/5.0 (Macintosh; Intel Mac OS X 13_3_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 OPR/98.0.4759.3\"\nfunc genOperaUA() string {\n\tversion := operaVersions[rand.Intn(len(operaVersions))]\n\tchromeVersion := strings.Split(version, \",\")[0]\n\toperaVersion := strings.Split(version, \",\")[1]\n\tos := osStrings[rand.Intn(len(osStrings))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36 OPR/%s\", os, chromeVersion, operaVersion)\n}\n\n// Generates Pixel 7 Browser User-Agent (Mobile)\n//\n// -> Mozilla/5.0 (Linux; Android 13; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36\nfunc genMobilePixel7UA() string {\n\tandroid := pixel7AndroidVersions[rand.Intn(len(pixel7AndroidVersions))]\n\tchrome := chromeVersions[rand.Intn(len(chromeVersions))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (Linux; Android %s; Pixel 7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", android, chrome)\n}\n\n// Generates Pixel 6 Browser User-Agent (Mobile)\n//\n// -> \"Mozilla/5.0 (Linux; Android 13; Pixel 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36\"\nfunc genMobilePixel6UA() string {\n\tandroid := pixel6AndroidVersions[rand.Intn(len(pixel6AndroidVersions))]\n\tchrome := chromeVersions[rand.Intn(len(chromeVersions))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (Linux; Android %s; Pixel 6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", android, chrome)\n}\n\n// Generates Pixel 5 Browser User-Agent (Mobile)\n//\n// -> \"Mozilla/5.0 (Linux; Android 13; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36\"\nfunc genMobilePixel5UA() string {\n\tandroid := pixel5AndroidVersions[rand.Intn(len(pixel5AndroidVersions))]\n\tchrome := chromeVersions[rand.Intn(len(chromeVersions))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (Linux; Android %s; Pixel 5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", android, chrome)\n}\n\n// Generates Pixel 4 Browser User-Agent (Mobile)\n//\n// -> \"Mozilla/5.0 (Linux; Android 13; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Mobile Safari/537.36\"\nfunc genMobilePixel4UA() string {\n\tandroid := pixel4AndroidVersions[rand.Intn(len(pixel4AndroidVersions))]\n\tchrome := chromeVersions[rand.Intn(len(chromeVersions))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (Linux; Android %s; Pixel 4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", android, chrome)\n}\n\n// Generates Nexus 10 Browser User-Agent (Mobile)\n//\n// -> \"Mozilla/5.0 (Linux; Android 5.1.1; Nexus 10 Build/LMY48T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.91 Safari/537.36\"\nfunc genMobileNexus10UA() string {\n\tbuild := nexus10Builds[rand.Intn(len(nexus10Builds))]\n\tandroid := nexus10AndroidVersions[rand.Intn(len(nexus10AndroidVersions))]\n\tchrome := chromeVersions[rand.Intn(len(chromeVersions))]\n\treturn fmt.Sprintf(\"Mozilla/5.0 (Linux; Android %s; Nexus 10 Build/%s) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/%s Safari/537.36\", android, build, chrome)\n}\n"
  },
  {
    "path": "extensions/referer.go",
    "content": "package extensions\n\nimport (\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// Referer sets valid Referer HTTP header to requests.\n// Warning: this extension works only if you use Request.Visit\n// from callbacks instead of Collector.Visit.\nfunc Referer(c *colly.Collector) {\n\tc.OnResponse(func(r *colly.Response) {\n\t\tr.Ctx.Put(\"_referer\", r.Request.URL.String())\n\t})\n\tc.OnRequest(func(r *colly.Request) {\n\t\tif ref := r.Ctx.Get(\"_referer\"); ref != \"\" {\n\t\t\tr.Headers.Set(\"Referer\", ref)\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "extensions/url_length_filter.go",
    "content": "package extensions\n\nimport (\n\t\"github.com/gocolly/colly/v2\"\n)\n\n// URLLengthFilter filters out requests with URLs longer than URLLengthLimit\nfunc URLLengthFilter(c *colly.Collector, URLLengthLimit int) {\n\tc.OnRequest(func(r *colly.Request) {\n\t\tif len(r.URL.String()) > URLLengthLimit {\n\t\t\tr.Abort()\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "go.mod",
    "content": "module github.com/gocolly/colly/v2\n\ngo 1.24.0\n\ntoolchain go1.24.9\n\nrequire (\n\tgithub.com/PuerkitoBio/goquery v1.11.0\n\tgithub.com/antchfx/htmlquery v1.3.5\n\tgithub.com/antchfx/xmlquery v1.5.0\n\tgithub.com/gobwas/glob v0.2.3\n\tgithub.com/gocolly/colly v1.2.0\n\tgithub.com/jawher/mow.cli v1.1.0\n\tgithub.com/kennygrant/sanitize v1.2.4\n\tgithub.com/nlnwa/whatwg-url v0.6.2\n\tgithub.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d\n\tgithub.com/temoto/robotstxt v1.1.2\n\tgolang.org/x/net v0.47.0\n\tgoogle.golang.org/appengine v1.6.8\n)\n\nrequire (\n\tgithub.com/andybalholm/cascadia v1.3.3 // indirect\n\tgithub.com/antchfx/xpath v1.3.5 // indirect\n\tgithub.com/bits-and-blooms/bitset v1.24.4 // indirect\n\tgithub.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 // indirect\n\tgithub.com/golang/protobuf v1.5.4 // indirect\n\tgolang.org/x/text v0.31.0 // indirect\n\tgoogle.golang.org/protobuf v1.36.10 // indirect\n)\n"
  },
  {
    "path": "go.sum",
    "content": "github.com/PuerkitoBio/goquery v1.10.2 h1:7fh2BdHcG6VFZsK7toXBT/Bh1z5Wmy8Q9MV9HqT2AM8=\ngithub.com/PuerkitoBio/goquery v1.10.2/go.mod h1:0guWGjcLu9AYC7C1GHnpysHy056u9aEkUHwhdnePMCU=\ngithub.com/PuerkitoBio/goquery v1.11.0 h1:jZ7pwMQXIITcUXNH83LLk+txlaEy6NVOfTuP43xxfqw=\ngithub.com/PuerkitoBio/goquery v1.11.0/go.mod h1:wQHgxUOU3JGuj3oD/QFfxUdlzW6xPHfqyHre6VMY4DQ=\ngithub.com/andybalholm/cascadia v1.3.3 h1:AG2YHrzJIm4BZ19iwJ/DAua6Btl3IwJX+VI4kktS1LM=\ngithub.com/andybalholm/cascadia v1.3.3/go.mod h1:xNd9bqTn98Ln4DwST8/nG+H0yuB8Hmgu1YHNnWw0GeA=\ngithub.com/antchfx/htmlquery v1.3.4 h1:Isd0srPkni2iNTWCwVj/72t7uCphFeor5Q8nCzj1jdQ=\ngithub.com/antchfx/htmlquery v1.3.4/go.mod h1:K9os0BwIEmLAvTqaNSua8tXLWRWZpocZIH73OzWQbwM=\ngithub.com/antchfx/htmlquery v1.3.5 h1:aYthDDClnG2a2xePf6tys/UyyM/kRcsFRm+ifhFKoU0=\ngithub.com/antchfx/htmlquery v1.3.5/go.mod h1:5oyIPIa3ovYGtLqMPNjBF2Uf25NPCKsMjCnQ8lvjaoA=\ngithub.com/antchfx/xmlquery v1.4.4 h1:mxMEkdYP3pjKSftxss4nUHfjBhnMk4imGoR96FRY2dg=\ngithub.com/antchfx/xmlquery v1.4.4/go.mod h1:AEPEEPYE9GnA2mj5Ur2L5Q5/2PycJ0N9Fusrx9b12fc=\ngithub.com/antchfx/xmlquery v1.5.0 h1:uAi+mO40ZWfyU6mlUBxRVvL6uBNZ6LMU4M3+mQIBV4c=\ngithub.com/antchfx/xmlquery v1.5.0/go.mod h1:lJfWRXzYMK1ss32zm1GQV3gMIW/HFey3xDZmkP1SuNc=\ngithub.com/antchfx/xpath v1.3.3 h1:tmuPQa1Uye0Ym1Zn65vxPgfltWb/Lxu2jeqIGteJSRs=\ngithub.com/antchfx/xpath v1.3.3/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=\ngithub.com/antchfx/xpath v1.3.5 h1:PqbXLC3TkfeZyakF5eeh3NTWEbYl4VHNVeufANzDbKQ=\ngithub.com/antchfx/xpath v1.3.5/go.mod h1:i54GszH55fYfBmoZXapTHN8T8tkcHfRgLyVwwqzXNcs=\ngithub.com/bits-and-blooms/bitset v1.20.0/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=\ngithub.com/bits-and-blooms/bitset v1.22.0 h1:Tquv9S8+SGaS3EhyA+up3FXzmkhxPGjQQCkcs2uw7w4=\ngithub.com/bits-and-blooms/bitset v1.22.0/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=\ngithub.com/bits-and-blooms/bitset v1.24.4 h1:95H15Og1clikBrKr/DuzMXkQzECs1M6hhoGXLwLQOZE=\ngithub.com/bits-and-blooms/bitset v1.24.4/go.mod h1:7hO7Gc7Pp1vODcmWvKMRA9BNmbv6a/7QIWpPxHddWR8=\ngithub.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=\ngithub.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=\ngithub.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=\ngithub.com/gobwas/glob v0.2.3 h1:A4xDbljILXROh+kObIiy5kIaPYD8e96x1tgBhUI5J+Y=\ngithub.com/gobwas/glob v0.2.3/go.mod h1:d3Ez4x06l9bZtSvzIay5+Yzi0fmZzPgnTbPcKjJAkT8=\ngithub.com/gocolly/colly v1.2.0 h1:qRz9YAn8FIH0qzgNUw+HT9UN7wm1oF9OBAilwEWpyrI=\ngithub.com/gocolly/colly v1.2.0/go.mod h1:Hof5T3ZswNVsOHYmba1u03W65HDWgpV5HifSuueE0EA=\ngithub.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da/go.mod h1:cIg4eruTrX1D+g88fzRXU5OdNfaM+9IcxsU14FzY7Hc=\ngithub.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8 h1:f+oWsMOmNPc8JmEHVZIycC7hBoQxHH9pNKQORJNozsQ=\ngithub.com/golang/groupcache v0.0.0-20241129210726-2c02b8208cf8/go.mod h1:wcDNUvekVysuuOpQKo3191zZyTpiI6se1N1ULghS0sw=\ngithub.com/golang/protobuf v1.5.0/go.mod h1:FsONVRAS9T7sI+LIUmWTfcYkHO4aIWwzhcaSAoJOfIk=\ngithub.com/golang/protobuf v1.5.2/go.mod h1:XVQd3VNwM+JqD3oG2Ue2ip4fOMUkwXdXDdiuN0vRsmY=\ngithub.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek=\ngithub.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps=\ngithub.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=\ngithub.com/google/go-cmp v0.6.0 h1:ofyhxvXcZhMsU5ulbFiLKl/XBFqE1GSq7atu8tAmTRI=\ngithub.com/google/go-cmp v0.6.0/go.mod h1:17dUlkBOakJ0+DkrSSNjCkIjxS6bF9zb3elmeNGIjoY=\ngithub.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=\ngithub.com/jawher/mow.cli v1.1.0 h1:NdtHXRc0CwZQ507wMvQ/IS+Q3W3x2fycn973/b8Zuk8=\ngithub.com/jawher/mow.cli v1.1.0/go.mod h1:aNaQlc7ozF3vw6IJ2dHjp2ZFiA4ozMIYY6PyuRJwlUg=\ngithub.com/kennygrant/sanitize v1.2.4 h1:gN25/otpP5vAsO2djbMhF/LQX6R7+O1TB4yv8NzpJ3o=\ngithub.com/kennygrant/sanitize v1.2.4/go.mod h1:LGsjYYtgxbetdg5owWB2mpgUL6e2nfw2eObZ0u0qvak=\ngithub.com/nlnwa/whatwg-url v0.6.1 h1:Zlefa3aglQFHF/jku45VxbEJwPicDnOz64Ra3F7npqQ=\ngithub.com/nlnwa/whatwg-url v0.6.1/go.mod h1:x0FPXJzzOEieQtsBT/AKvbiBbQ46YlL6Xa7m02M1ECk=\ngithub.com/nlnwa/whatwg-url v0.6.2 h1:jU61lU2ig4LANydbEJmA2nPrtCGiKdtgT0rmMd2VZ/Q=\ngithub.com/nlnwa/whatwg-url v0.6.2/go.mod h1:x0FPXJzzOEieQtsBT/AKvbiBbQ46YlL6Xa7m02M1ECk=\ngithub.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=\ngithub.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=\ngithub.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d h1:hrujxIzL1woJ7AwssoOcM/tq5JjjG2yYOc8odClEiXA=\ngithub.com/saintfish/chardet v0.0.0-20230101081208-5e3ef4b5456d/go.mod h1:uugorj2VCxiV1x+LzaIdVa9b4S4qGAcH6cbhh4qVxOU=\ngithub.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=\ngithub.com/stretchr/objx v0.2.0/go.mod h1:qt09Ya8vawLte6SNmTgCsAVtYtaKzEcn8ATUoHMkEqE=\ngithub.com/stretchr/testify v1.3.0 h1:TivCn/peBQ7UY8ooIcPgZFpTNSz0Q2U6UrFlUfqbe0Q=\ngithub.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=\ngithub.com/temoto/robotstxt v1.1.2 h1:W2pOjSJ6SWvldyEuiFXNxz3xZ8aiWX5LbfDiOFd7Fxg=\ngithub.com/temoto/robotstxt v1.1.2/go.mod h1:+1AmkuG3IYkh1kv0d2qEB9Le88ehNO0zwOr3ujewlOo=\ngithub.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=\ngolang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=\ngolang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=\ngolang.org/x/crypto v0.13.0/go.mod h1:y6Z2r+Rw4iayiXXAIxJIDAJ1zMW4yaTpebo8fPOliYc=\ngolang.org/x/crypto v0.19.0/go.mod h1:Iy9bg/ha4yyC70EfRS8jz+B6ybOBKMaSxLj6P6oBDfU=\ngolang.org/x/crypto v0.23.0/go.mod h1:CKFgDieR+mRhux2Lsu27y0fO304Db0wZe70UKqHu0v8=\ngolang.org/x/crypto v0.31.0/go.mod h1:kDsLvtWBEx7MV9tJOj9bnXsPbxwJQ6csT/x4KIN4Ssk=\ngolang.org/x/crypto v0.32.0/go.mod h1:ZnnJkOaASj8g0AjIduWNlq2NRxL0PlBrbKVyZ6V/Ugc=\ngolang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4=\ngolang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs=\ngolang.org/x/mod v0.12.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs=\ngolang.org/x/mod v0.15.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=\ngolang.org/x/mod v0.17.0/go.mod h1:hTbmBsO62+eylJbnUtE2MGJUyE7QWk4xUqPFrRgJ+7c=\ngolang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=\ngolang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=\ngolang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c=\ngolang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=\ngolang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=\ngolang.org/x/net v0.15.0/go.mod h1:idbUs1IY1+zTqbi8yxTbhexhEEk5ur9LInksu6HrEpk=\ngolang.org/x/net v0.21.0/go.mod h1:bIjVDfnllIU7BJ2DNgfnXvpSvtn8VRwhlsaeUTyUS44=\ngolang.org/x/net v0.25.0/go.mod h1:JkAGAh7GEvH74S6FOH42FLoXpXbE/aqXSrIQjXgsiwM=\ngolang.org/x/net v0.33.0/go.mod h1:HXLR5J+9DxmrqMwG9qjGCxZ+zKXxBru04zlTvWlWuN4=\ngolang.org/x/net v0.34.0/go.mod h1:di0qlW3YNM5oh6GqDGQr92MyTozJPmybPK4Ev/Gm31k=\ngolang.org/x/net v0.37.0 h1:1zLorHbz+LYj7MQlSf1+2tPIIgibq2eL5xkrGk6f+2c=\ngolang.org/x/net v0.37.0/go.mod h1:ivrbrMbzFq5J41QOQh0siUuly180yBYtLp+CKbEaFx8=\ngolang.org/x/net v0.47.0 h1:Mx+4dIFzqraBXUugkia1OOvlD6LemFo1ALMHjrXDOhY=\ngolang.org/x/net v0.47.0/go.mod h1:/jNxtkgq5yWUGYkaZGqo27cfGZ1c5Nen03aYrrKpVRU=\ngolang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.3.0/go.mod h1:FU7BRWz2tNW+3quACPkgCx/L+uEAv1htQ0V83Z9Rj+Y=\ngolang.org/x/sync v0.6.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=\ngolang.org/x/sync v0.7.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=\ngolang.org/x/sync v0.10.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=\ngolang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=\ngolang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=\ngolang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.12.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.17.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/sys v0.20.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/sys v0.28.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/sys v0.29.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/telemetry v0.0.0-20240228155512-f48c80bd79b2/go.mod h1:TeRTkGYfJXctD9OcfyVLyj2J3IxLnKwHJR8f4D8a3YE=\ngolang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=\ngolang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=\ngolang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=\ngolang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo=\ngolang.org/x/term v0.12.0/go.mod h1:owVbMEjm3cBLCHdkQu9b1opXd4ETQWc3BhuQGKgXgvU=\ngolang.org/x/term v0.17.0/go.mod h1:lLRBjIVuehSbZlaOtGMbcMncT+aqLLLmKrsjNrUguwk=\ngolang.org/x/term v0.20.0/go.mod h1:8UkIAJTvZgivsXaD6/pH6U9ecQzZ45awqEOzuCvwpFY=\ngolang.org/x/term v0.27.0/go.mod h1:iMsnZpn0cago0GOrHO2+Y7u7JPn5AylBrcoWkElMTSM=\ngolang.org/x/term v0.28.0/go.mod h1:Sw/lC2IAUZ92udQNf3WodGtn4k/XoLyZoh8v/8uiwek=\ngolang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=\ngolang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=\ngolang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=\ngolang.org/x/text v0.3.8/go.mod h1:E6s5w1FMmriuDzIBO73fBruAKo1PCIq6d2Q6DHfQ8WQ=\ngolang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=\ngolang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8=\ngolang.org/x/text v0.13.0/go.mod h1:TvPlkZtksWOMsz7fbANvkp4WM8x/WCo/om8BMLbz+aE=\ngolang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=\ngolang.org/x/text v0.15.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=\ngolang.org/x/text v0.21.0/go.mod h1:4IBbMaMmOPCJ8SecivzSH54+73PCFmPWxNTLm+vZkEQ=\ngolang.org/x/text v0.23.0 h1:D71I7dUrlY+VX0gQShAThNGHFxZ13dGLBHQLVl1mJlY=\ngolang.org/x/text v0.23.0/go.mod h1:/BLNzu4aZCJ1+kcD0DNRotWKage4q2rGVAg4o22unh4=\ngolang.org/x/text v0.31.0 h1:aC8ghyu4JhP8VojJ2lEHBnochRno1sgL6nEi9WGFGMM=\ngolang.org/x/text v0.31.0/go.mod h1:tKRAlv61yKIjGGHX/4tP1LTbc13YSec1pxVEWXzfoeM=\ngolang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=\ngolang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=\ngolang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc=\ngolang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU=\ngolang.org/x/tools v0.13.0/go.mod h1:HvlwmtVNQAhOuCjW7xxvovg8wbNq7LwfXh/k7wXUl58=\ngolang.org/x/tools v0.21.1-0.20240508182429-e35e4ccd0d2d/go.mod h1:aiJjzUbINMkxbQROHiO6hDPo2LHcIPhhQsa9DLh0yGk=\ngolang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=\ngolang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=\ngoogle.golang.org/appengine v1.6.8 h1:IhEN5q69dyKagZPYMSdIjS2HqprW324FRQZJcGqPAsM=\ngoogle.golang.org/appengine v1.6.8/go.mod h1:1jJ3jBArFh5pcgW8gCtRJnepW8FzD1V44FJffLiz/Ds=\ngoogle.golang.org/protobuf v1.26.0-rc.1/go.mod h1:jlhhOSvTdKEhbULTjvd4ARK9grFBp09yW+WbY/TyQbw=\ngoogle.golang.org/protobuf v1.26.0/go.mod h1:9q0QmTI4eRPtz6boOQmLYwt+qCgq0jsYwAQnmE0givc=\ngoogle.golang.org/protobuf v1.36.6 h1:z1NpPI8ku2WgiWnf+t9wTPsn6eP1L7ksHUlkfLvd9xY=\ngoogle.golang.org/protobuf v1.36.6/go.mod h1:jduwjTPXsFjZGTmRluh+L6NjiWu7pchiJ2/5YcXBHnY=\ngoogle.golang.org/protobuf v1.36.10 h1:AYd7cD/uASjIL6Q9LiTjz8JLcrh/88q5UObnmY3aOOE=\ngoogle.golang.org/protobuf v1.36.10/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=\n"
  },
  {
    "path": "htmlelement.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"strings\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"golang.org/x/net/html\"\n)\n\n// HTMLElement is the representation of a HTML tag.\ntype HTMLElement struct {\n\t// Name is the name of the tag\n\tName       string\n\tText       string\n\tattributes []html.Attribute\n\t// Request is the request object of the element's HTML document\n\tRequest *Request\n\t// Response is the Response object of the element's HTML document\n\tResponse *Response\n\t// DOM is the goquery parsed DOM object of the page. DOM is relative\n\t// to the current HTMLElement\n\tDOM *goquery.Selection\n\t// Index stores the position of the current element within all the elements matched by an OnHTML callback\n\tIndex int\n}\n\n// NewHTMLElementFromSelectionNode creates a HTMLElement from a goquery.Selection Node.\nfunc NewHTMLElementFromSelectionNode(resp *Response, s *goquery.Selection, n *html.Node, idx int) *HTMLElement {\n\treturn &HTMLElement{\n\t\tName:       n.Data,\n\t\tRequest:    resp.Request,\n\t\tResponse:   resp,\n\t\tText:       goquery.NewDocumentFromNode(n).Text(),\n\t\tDOM:        s,\n\t\tIndex:      idx,\n\t\tattributes: n.Attr,\n\t}\n}\n\n// Attr returns the selected attribute of a HTMLElement or empty string\n// if no attribute found\nfunc (h *HTMLElement) Attr(k string) string {\n\tfor _, a := range h.attributes {\n\t\tif a.Key == k {\n\t\t\treturn a.Val\n\t\t}\n\t}\n\treturn \"\"\n}\n\n// ChildText returns the concatenated and stripped text content of the matching\n// elements.\nfunc (h *HTMLElement) ChildText(goquerySelector string) string {\n\treturn strings.TrimSpace(h.DOM.Find(goquerySelector).Text())\n}\n\n// ChildTexts returns the stripped text content of all the matching\n// elements.\nfunc (h *HTMLElement) ChildTexts(goquerySelector string) []string {\n\tvar res []string\n\th.DOM.Find(goquerySelector).Each(func(_ int, s *goquery.Selection) {\n\n\t\tres = append(res, strings.TrimSpace(s.Text()))\n\t})\n\treturn res\n}\n\n// ChildAttr returns the stripped text content of the first matching\n// element's attribute.\nfunc (h *HTMLElement) ChildAttr(goquerySelector, attrName string) string {\n\tif attr, ok := h.DOM.Find(goquerySelector).Attr(attrName); ok {\n\t\treturn strings.TrimSpace(attr)\n\t}\n\treturn \"\"\n}\n\n// ChildAttrs returns the stripped text content of all the matching\n// element's attributes.\nfunc (h *HTMLElement) ChildAttrs(goquerySelector, attrName string) []string {\n\tvar res []string\n\th.DOM.Find(goquerySelector).Each(func(_ int, s *goquery.Selection) {\n\t\tif attr, ok := s.Attr(attrName); ok {\n\t\t\tres = append(res, strings.TrimSpace(attr))\n\t\t}\n\t})\n\treturn res\n}\n\n// ForEach iterates over the elements matched by the first argument\n// and calls the callback function on every HTMLElement match.\nfunc (h *HTMLElement) ForEach(goquerySelector string, callback func(int, *HTMLElement)) {\n\ti := 0\n\th.DOM.Find(goquerySelector).Each(func(_ int, s *goquery.Selection) {\n\t\tfor _, n := range s.Nodes {\n\t\t\tcallback(i, NewHTMLElementFromSelectionNode(h.Response, s, n, i))\n\t\t\ti++\n\t\t}\n\t})\n}\n\n// ForEachWithBreak iterates over the elements matched by the first argument\n// and calls the callback function on every HTMLElement match.\n// It is identical to ForEach except that it is possible to break\n// out of the loop by returning false in the callback function. It returns the\n// current Selection object.\nfunc (h *HTMLElement) ForEachWithBreak(goquerySelector string, callback func(int, *HTMLElement) bool) {\n\ti := 0\n\th.DOM.Find(goquerySelector).EachWithBreak(func(_ int, s *goquery.Selection) bool {\n\t\tfor _, n := range s.Nodes {\n\t\t\tif callback(i, NewHTMLElementFromSelectionNode(h.Response, s, n, i)) {\n\t\t\t\ti++\n\t\t\t\treturn true\n\t\t\t}\n\t\t}\n\t\treturn false\n\t})\n}\n"
  },
  {
    "path": "http_backend.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"crypto/sha1\"\n\t\"encoding/gob\"\n\t\"encoding/hex\"\n\t\"io\"\n\t\"math/rand\"\n\t\"net/http\"\n\t\"os\"\n\t\"path\"\n\t\"regexp\"\n\t\"strings\"\n\t\"sync\"\n\t\"time\"\n\n\t\"compress/gzip\"\n\n\t\"github.com/gobwas/glob\"\n)\n\ntype httpBackend struct {\n\tLimitRules []*LimitRule\n\tClient     *http.Client\n\tlock       *sync.RWMutex\n}\n\ntype checkResponseHeadersFunc func(req *http.Request, statusCode int, header http.Header) bool\ntype checkRequestHeadersFunc func(req *http.Request) bool\n\n// LimitRule provides connection restrictions for domains.\n// Both DomainRegexp and DomainGlob can be used to specify\n// the included domains patterns, but at least one is required.\n// There can be two kind of limitations:\n//   - Parallelism: Set limit for the number of concurrent requests to matching domains\n//   - Delay: Wait specified amount of time between requests (parallelism is 1 in this case)\ntype LimitRule struct {\n\t// DomainRegexp is a regular expression to match against domains\n\tDomainRegexp string\n\t// DomainGlob is a glob pattern to match against domains\n\tDomainGlob string\n\t// Delay is the duration to wait before creating a new request to the matching domains\n\tDelay time.Duration\n\t// RandomDelay is the extra randomized duration to wait added to Delay before creating a new request\n\tRandomDelay time.Duration\n\t// Parallelism is the number of the maximum allowed concurrent requests of the matching domains\n\tParallelism    int\n\twaitChan       chan bool\n\tcompiledRegexp *regexp.Regexp\n\tcompiledGlob   glob.Glob\n}\n\n// Init initializes the private members of LimitRule\nfunc (r *LimitRule) Init() error {\n\tr.waitChan = make(chan bool, max(r.Parallelism, 1))\n\thasPattern := false\n\tif r.DomainRegexp != \"\" {\n\t\tc, err := regexp.Compile(r.DomainRegexp)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tr.compiledRegexp = c\n\t\thasPattern = true\n\t}\n\tif r.DomainGlob != \"\" {\n\t\tc, err := glob.Compile(r.DomainGlob)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tr.compiledGlob = c\n\t\thasPattern = true\n\t}\n\tif !hasPattern {\n\t\treturn ErrNoPattern\n\t}\n\treturn nil\n}\n\nfunc (h *httpBackend) Init(jar http.CookieJar) {\n\trand.Seed(time.Now().UnixNano())\n\th.Client = &http.Client{\n\t\tJar:     jar,\n\t\tTimeout: 10 * time.Second,\n\t}\n\th.lock = &sync.RWMutex{}\n}\n\n// Match checks that the domain parameter triggers the rule\nfunc (r *LimitRule) Match(domain string) bool {\n\tmatch := false\n\tif r.compiledRegexp != nil && r.compiledRegexp.MatchString(domain) {\n\t\tmatch = true\n\t}\n\tif r.compiledGlob != nil && r.compiledGlob.Match(domain) {\n\t\tmatch = true\n\t}\n\treturn match\n}\n\nfunc (h *httpBackend) GetMatchingRule(domain string) *LimitRule {\n\tif h.LimitRules == nil {\n\t\treturn nil\n\t}\n\th.lock.RLock()\n\tdefer h.lock.RUnlock()\n\tfor _, r := range h.LimitRules {\n\t\tif r.Match(domain) {\n\t\t\treturn r\n\t\t}\n\t}\n\treturn nil\n}\n\nfunc (h *httpBackend) Cache(request *http.Request, bodySize int, checkRequestHeadersFunc checkRequestHeadersFunc, checkResponseHeadersFunc checkResponseHeadersFunc, cacheDir string, cacheExpiration time.Duration) (*Response, error) {\n\tif cacheDir == \"\" || request.Method != \"GET\" || request.Header.Get(\"Cache-Control\") == \"no-cache\" {\n\t\treturn h.Do(request, bodySize, checkRequestHeadersFunc, checkResponseHeadersFunc)\n\t}\n\tsum := sha1.Sum([]byte(request.URL.String()))\n\thash := hex.EncodeToString(sum[:])\n\tdir := path.Join(cacheDir, hash[:2])\n\tfilename := path.Join(dir, hash)\n\n\tif fileInfo, err := os.Stat(filename); err == nil && cacheExpiration > 0 {\n\t\tif time.Since(fileInfo.ModTime()) > cacheExpiration {\n\t\t\t_ = os.Remove(filename)\n\t\t}\n\t}\n\n\tif file, err := os.Open(filename); err == nil {\n\t\tresp := new(Response)\n\t\terr := gob.NewDecoder(file).Decode(resp)\n\t\tfile.Close()\n\t\tcheckResponseHeadersFunc(request, resp.StatusCode, *resp.Headers)\n\t\tif resp.StatusCode < 500 {\n\t\t\treturn resp, err\n\t\t}\n\t}\n\tresp, err := h.Do(request, bodySize, checkRequestHeadersFunc, checkResponseHeadersFunc)\n\tif err != nil || resp.StatusCode >= 500 {\n\t\treturn resp, err\n\t}\n\tif _, err := os.Stat(dir); err != nil {\n\t\tif err := os.MkdirAll(dir, 0750); err != nil {\n\t\t\treturn resp, err\n\t\t}\n\t}\n\tfile, err := os.Create(filename + \"~\")\n\tif err != nil {\n\t\treturn resp, err\n\t}\n\tif err := gob.NewEncoder(file).Encode(resp); err != nil {\n\t\tfile.Close()\n\t\treturn resp, err\n\t}\n\tfile.Close()\n\treturn resp, os.Rename(filename+\"~\", filename)\n}\n\nfunc (h *httpBackend) Do(request *http.Request, bodySize int, checkRequestHeadersFunc checkRequestHeadersFunc, checkResponseHeadersFunc checkResponseHeadersFunc) (*Response, error) {\n\tr := h.GetMatchingRule(request.URL.Host)\n\tif r != nil {\n\t\tr.waitChan <- true\n\t\tdefer func(r *LimitRule) {\n\t\t\trandomDelay := time.Duration(0)\n\t\t\tif r.RandomDelay != 0 {\n\t\t\t\trandomDelay = time.Duration(rand.Int63n(int64(r.RandomDelay)))\n\t\t\t}\n\t\t\ttime.Sleep(r.Delay + randomDelay)\n\t\t\t<-r.waitChan\n\t\t}(r)\n\t}\n\tif !checkRequestHeadersFunc(request) {\n\t\treturn nil, ErrAbortedBeforeRequest\n\t}\n\tres, err := h.Client.Do(request)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tdefer res.Body.Close()\n\n\tfinalRequest := request\n\tif res.Request != nil {\n\t\tfinalRequest = res.Request\n\t}\n\tif !checkResponseHeadersFunc(finalRequest, res.StatusCode, res.Header) {\n\t\t// closing res.Body (see defer above) without reading it aborts\n\t\t// the download\n\t\treturn nil, ErrAbortedAfterHeaders\n\t}\n\n\tvar bodyReader io.Reader = res.Body\n\tif bodySize > 0 {\n\t\tbodyReader = io.LimitReader(bodyReader, int64(bodySize))\n\t}\n\tcontentEncoding := strings.ToLower(res.Header.Get(\"Content-Encoding\"))\n\tif !res.Uncompressed && (strings.Contains(contentEncoding, \"gzip\") || (contentEncoding == \"\" && strings.Contains(strings.ToLower(res.Header.Get(\"Content-Type\")), \"gzip\")) || (strings.HasSuffix(strings.ToLower(finalRequest.URL.Path), \".xml.gz\") && res.StatusCode >= 200 && res.StatusCode < 300)) {\n\t\tbodyReader, err = gzip.NewReader(bodyReader)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\tdefer bodyReader.(*gzip.Reader).Close()\n\t}\n\tbody, err := io.ReadAll(bodyReader)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn &Response{\n\t\tStatusCode: res.StatusCode,\n\t\tBody:       body,\n\t\tHeaders:    &res.Header,\n\t}, nil\n}\n\nfunc (h *httpBackend) Limit(rule *LimitRule) error {\n\th.lock.Lock()\n\tif h.LimitRules == nil {\n\t\th.LimitRules = make([]*LimitRule, 0, 8)\n\t}\n\th.LimitRules = append(h.LimitRules, rule)\n\th.lock.Unlock()\n\treturn rule.Init()\n}\n\nfunc (h *httpBackend) Limits(rules []*LimitRule) error {\n\tfor _, r := range rules {\n\t\tif err := h.Limit(r); err != nil {\n\t\t\treturn err\n\t\t}\n\t}\n\treturn nil\n}\n"
  },
  {
    "path": "http_trace.go",
    "content": "package colly\n\nimport (\n\t\"net/http\"\n\t\"net/http/httptrace\"\n\t\"time\"\n)\n\n// HTTPTrace provides a datastructure for storing an http trace.\ntype HTTPTrace struct {\n\tstart, connect    time.Time\n\tConnectDuration   time.Duration\n\tFirstByteDuration time.Duration\n}\n\n// trace returns a httptrace.ClientTrace object to be used with an http\n// request via httptrace.WithClientTrace() that fills in the HttpTrace.\nfunc (ht *HTTPTrace) trace() *httptrace.ClientTrace {\n\ttrace := &httptrace.ClientTrace{\n\t\tConnectStart: func(network, addr string) { ht.connect = time.Now() },\n\t\tConnectDone: func(network, addr string, err error) {\n\t\t\tht.ConnectDuration = time.Since(ht.connect)\n\t\t},\n\n\t\tGetConn: func(hostPort string) { ht.start = time.Now() },\n\t\tGotFirstResponseByte: func() {\n\t\t\tht.FirstByteDuration = time.Since(ht.start)\n\t\t},\n\t}\n\treturn trace\n}\n\n// WithTrace returns the given HTTP Request with this HTTPTrace added to its\n// context.\nfunc (ht *HTTPTrace) WithTrace(req *http.Request) *http.Request {\n\treturn req.WithContext(httptrace.WithClientTrace(req.Context(), ht.trace()))\n}\n"
  },
  {
    "path": "http_trace_test.go",
    "content": "package colly\n\nimport (\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"testing\"\n\t\"time\"\n)\n\nconst testDelay = 200 * time.Millisecond\n\nfunc newTraceTestServer(delay time.Duration) *httptest.Server {\n\tmux := http.NewServeMux()\n\n\tmux.HandleFunc(\"/\", func(w http.ResponseWriter, r *http.Request) {\n\t\ttime.Sleep(delay)\n\t\tw.WriteHeader(200)\n\t})\n\tmux.HandleFunc(\"/error\", func(w http.ResponseWriter, r *http.Request) {\n\t\ttime.Sleep(delay)\n\t\tw.WriteHeader(500)\n\t})\n\n\treturn httptest.NewServer(mux)\n}\n\nfunc TestTraceWithNoDelay(t *testing.T) {\n\tts := newTraceTestServer(0)\n\tdefer ts.Close()\n\n\tclient := ts.Client()\n\treq, err := http.NewRequest(\"GET\", ts.URL, nil)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to construct request %v\", err)\n\t}\n\ttrace := &HTTPTrace{}\n\treq = trace.WithTrace(req)\n\n\tif _, err = client.Do(req); err != nil {\n\t\tt.Errorf(\"Failed to make request %v\", err)\n\t}\n\n\tif trace.ConnectDuration > testDelay {\n\t\tt.Errorf(\"trace ConnectDuration should be (almost) 0, got %v\", trace.ConnectDuration)\n\t}\n\tif trace.FirstByteDuration > testDelay {\n\t\tt.Errorf(\"trace FirstByteDuration should be (almost) 0, got %v\", trace.FirstByteDuration)\n\t}\n}\n\nfunc TestTraceWithDelay(t *testing.T) {\n\tts := newTraceTestServer(testDelay)\n\tdefer ts.Close()\n\n\tclient := ts.Client()\n\treq, err := http.NewRequest(\"GET\", ts.URL, nil)\n\tif err != nil {\n\t\tt.Errorf(\"Failed to construct request %v\", err)\n\t}\n\ttrace := &HTTPTrace{}\n\treq = trace.WithTrace(req)\n\n\tif _, err = client.Do(req); err != nil {\n\t\tt.Errorf(\"Failed to make request %v\", err)\n\t}\n\n\tif trace.ConnectDuration > testDelay {\n\t\tt.Errorf(\"trace ConnectDuration should be (almost) 0, got %v\", trace.ConnectDuration)\n\t}\n\tif trace.FirstByteDuration < testDelay {\n\t\tt.Errorf(\"trace FirstByteDuration should be at least 200ms, got %v\", trace.FirstByteDuration)\n\t}\n}\n"
  },
  {
    "path": "proxy/proxy.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage proxy\n\nimport (\n\t\"context\"\n\t\"net/http\"\n\t\"net/url\"\n\t\"sync/atomic\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\ntype roundRobinSwitcher struct {\n\tproxyURLs []*url.URL\n\tindex     uint32\n}\n\nfunc (r *roundRobinSwitcher) GetProxy(pr *http.Request) (*url.URL, error) {\n\tindex := atomic.AddUint32(&r.index, 1) - 1\n\tu := r.proxyURLs[index%uint32(len(r.proxyURLs))]\n\n\tctx := context.WithValue(pr.Context(), colly.ProxyURLKey, u.String())\n\t*pr = *pr.WithContext(ctx)\n\treturn u, nil\n}\n\n// RoundRobinProxySwitcher creates a proxy switcher function which rotates\n// ProxyURLs on every request.\n// The proxy type is determined by the URL scheme. \"http\", \"https\"\n// and \"socks5\" are supported. If the scheme is empty,\n// \"http\" is assumed.\nfunc RoundRobinProxySwitcher(ProxyURLs ...string) (colly.ProxyFunc, error) {\n\tif len(ProxyURLs) < 1 {\n\t\treturn nil, colly.ErrEmptyProxyURL\n\t}\n\turls := make([]*url.URL, len(ProxyURLs))\n\tfor i, u := range ProxyURLs {\n\t\tparsedU, err := url.Parse(u)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t\turls[i] = parsedU\n\t}\n\treturn (&roundRobinSwitcher{urls, 0}).GetProxy, nil\n}\n"
  },
  {
    "path": "queue/queue.go",
    "content": "package queue\n\nimport (\n\t\"net/url\"\n\t\"sync\"\n\n\twhatwgUrl \"github.com/nlnwa/whatwg-url/url\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nconst stop = true\n\nvar urlParser = whatwgUrl.NewParser(whatwgUrl.WithPercentEncodeSinglePercentSign())\n\n// Storage is the interface of the queue's storage backend\n// Storage must be concurrently safe for multiple goroutines.\ntype Storage interface {\n\t// Init initializes the storage\n\tInit() error\n\t// AddRequest adds a serialized request to the queue\n\tAddRequest([]byte) error\n\t// GetRequest pops the next request from the queue\n\t// or returns error if the queue is empty\n\tGetRequest() ([]byte, error)\n\t// QueueSize returns with the size of the queue\n\tQueueSize() (int, error)\n}\n\n// Queue is a request queue which uses a Collector to consume\n// requests in multiple threads\ntype Queue struct {\n\t// Threads defines the number of consumer threads\n\tThreads int\n\tstorage Storage\n\twake    chan struct{}\n\tmut     sync.Mutex // guards wake and running\n\trunning bool\n}\n\n// InMemoryQueueStorage is the default implementation of the Storage interface.\n// InMemoryQueueStorage holds the request queue in memory.\ntype InMemoryQueueStorage struct {\n\t// MaxSize defines the capacity of the queue.\n\t// New requests are discarded if the queue size reaches MaxSize\n\tMaxSize int\n\tlock    *sync.RWMutex\n\tsize    int\n\tfirst   *inMemoryQueueItem\n\tlast    *inMemoryQueueItem\n}\n\ntype inMemoryQueueItem struct {\n\tRequest []byte\n\tNext    *inMemoryQueueItem\n}\n\n// New creates a new queue with a Storage specified in argument\n// A standard InMemoryQueueStorage is used if Storage argument is nil.\nfunc New(threads int, s Storage) (*Queue, error) {\n\tif s == nil {\n\t\ts = &InMemoryQueueStorage{MaxSize: 100000}\n\t}\n\tif err := s.Init(); err != nil {\n\t\treturn nil, err\n\t}\n\treturn &Queue{\n\t\tThreads: threads,\n\t\tstorage: s,\n\t\trunning: true,\n\t}, nil\n}\n\n// IsEmpty returns true if the queue is empty\nfunc (q *Queue) IsEmpty() bool {\n\ts, _ := q.Size()\n\treturn s == 0\n}\n\n// AddURL adds a new URL to the queue\nfunc (q *Queue) AddURL(URL string) error {\n\tu, err := urlParser.Parse(URL)\n\tif err != nil {\n\t\treturn err\n\t}\n\tu2, err := url.Parse(u.Href(false))\n\tif err != nil {\n\t\treturn err\n\t}\n\tr := &colly.Request{\n\t\tURL:    u2,\n\t\tMethod: \"GET\",\n\t}\n\td, err := r.Marshal()\n\tif err != nil {\n\t\treturn err\n\t}\n\treturn q.storage.AddRequest(d)\n}\n\n// AddRequest adds a new Request to the queue\nfunc (q *Queue) AddRequest(r *colly.Request) error {\n\tq.mut.Lock()\n\twaken := q.wake != nil\n\tq.mut.Unlock()\n\tif !waken {\n\t\treturn q.storeRequest(r)\n\t}\n\terr := q.storeRequest(r)\n\tif err != nil {\n\t\treturn err\n\t}\n\tq.wake <- struct{}{}\n\treturn nil\n}\n\nfunc (q *Queue) storeRequest(r *colly.Request) error {\n\td, err := r.Marshal()\n\tif err != nil {\n\t\treturn err\n\t}\n\treturn q.storage.AddRequest(d)\n}\n\n// Size returns the size of the queue\nfunc (q *Queue) Size() (int, error) {\n\treturn q.storage.QueueSize()\n}\n\n// Run starts consumer threads and calls the Collector\n// to perform requests. Run blocks while the queue has active requests\n// The given Storage must not be used directly while Run blocks.\nfunc (q *Queue) Run(c *colly.Collector) error {\n\tq.mut.Lock()\n\tif q.wake != nil && q.running == true {\n\t\tq.mut.Unlock()\n\t\tpanic(\"cannot call duplicate Queue.Run\")\n\t}\n\tq.wake = make(chan struct{})\n\tq.running = true\n\tq.mut.Unlock()\n\n\trequestc := make(chan *colly.Request)\n\tcomplete, errc := make(chan struct{}), make(chan error, 1)\n\tfor i := 0; i < q.Threads; i++ {\n\t\tgo independentRunner(requestc, complete)\n\t}\n\tgo q.loop(c, requestc, complete, errc)\n\tdefer close(requestc)\n\treturn <-errc\n}\n\n// Stop will stop the running queue\nfunc (q *Queue) Stop() {\n\tq.mut.Lock()\n\tq.running = false\n\tq.mut.Unlock()\n}\n\nfunc (q *Queue) loop(c *colly.Collector, requestc chan<- *colly.Request, complete <-chan struct{}, errc chan<- error) {\n\tvar active int\n\tfor {\n\t\tsize, err := q.storage.QueueSize()\n\t\tif err != nil {\n\t\t\terrc <- err\n\t\t\tbreak\n\t\t}\n\t\tif size == 0 && active == 0 || !q.running {\n\t\t\t// Terminate when\n\t\t\t//   1. No active requests\n\t\t\t//   2. Empty queue\n\t\t\terrc <- nil\n\t\t\tbreak\n\t\t}\n\t\tsent := requestc\n\t\tvar req *colly.Request\n\t\tif size > 0 {\n\t\t\treq, err = q.loadRequest(c)\n\t\t\tif err != nil {\n\t\t\t\t// ignore an error returned by GetRequest() or\n\t\t\t\t// UnmarshalRequest()\n\t\t\t\tcontinue\n\t\t\t}\n\t\t} else {\n\t\t\tsent = nil\n\t\t}\n\tSent:\n\t\tfor {\n\t\t\tselect {\n\t\t\tcase sent <- req:\n\t\t\t\tactive++\n\t\t\t\tbreak Sent\n\t\t\tcase <-q.wake:\n\t\t\t\tif sent == nil {\n\t\t\t\t\tbreak Sent\n\t\t\t\t}\n\t\t\tcase <-complete:\n\t\t\t\tactive--\n\t\t\t\tif sent == nil && active == 0 {\n\t\t\t\t\tbreak Sent\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n}\n\nfunc independentRunner(requestc <-chan *colly.Request, complete chan<- struct{}) {\n\tfor req := range requestc {\n\t\treq.Do()\n\t\tcomplete <- struct{}{}\n\t}\n}\n\nfunc (q *Queue) loadRequest(c *colly.Collector) (*colly.Request, error) {\n\tbuf, err := q.storage.GetRequest()\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tcopied := make([]byte, len(buf))\n\tcopy(copied, buf)\n\treturn c.UnmarshalRequest(copied)\n}\n\n// Init implements Storage.Init() function\nfunc (q *InMemoryQueueStorage) Init() error {\n\tq.lock = &sync.RWMutex{}\n\treturn nil\n}\n\n// AddRequest implements Storage.AddRequest() function\nfunc (q *InMemoryQueueStorage) AddRequest(r []byte) error {\n\tq.lock.Lock()\n\tdefer q.lock.Unlock()\n\t// Discard URLs if size limit exceeded\n\tif q.MaxSize > 0 && q.size >= q.MaxSize {\n\t\treturn colly.ErrQueueFull\n\t}\n\ti := &inMemoryQueueItem{Request: r}\n\tif q.first == nil {\n\t\tq.first = i\n\t} else {\n\t\tq.last.Next = i\n\t}\n\tq.last = i\n\tq.size++\n\treturn nil\n}\n\n// GetRequest implements Storage.GetRequest() function\nfunc (q *InMemoryQueueStorage) GetRequest() ([]byte, error) {\n\tq.lock.Lock()\n\tdefer q.lock.Unlock()\n\tif q.size == 0 {\n\t\treturn nil, nil\n\t}\n\tr := q.first.Request\n\tq.first = q.first.Next\n\tq.size--\n\treturn r, nil\n}\n\n// QueueSize implements Storage.QueueSize() function\nfunc (q *InMemoryQueueStorage) QueueSize() (int, error) {\n\tq.lock.Lock()\n\tdefer q.lock.Unlock()\n\treturn q.size, nil\n}\n"
  },
  {
    "path": "queue/queue_test.go",
    "content": "package queue\n\nimport (\n\t\"math/rand\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/gocolly/colly/v2\"\n)\n\nfunc TestQueue(t *testing.T) {\n\tserver := httptest.NewServer(http.HandlerFunc(serverHandler))\n\tdefer server.Close()\n\n\trng := rand.New(rand.NewSource(12387123712321232))\n\tvar rngMu sync.Mutex\n\n\tvar (\n\t\titems    uint32\n\t\trequests uint32\n\t\tsuccess  uint32\n\t\tfailure  uint32\n\t)\n\tstorage := &InMemoryQueueStorage{MaxSize: 100000}\n\tq, err := New(10, storage)\n\tif err != nil {\n\t\tpanic(err)\n\t}\n\tput := func() {\n\t\trngMu.Lock()\n\t\tt := time.Duration(rng.Intn(50)) * time.Microsecond\n\t\trngMu.Unlock()\n\t\turl := server.URL + \"/delay?t=\" + t.String()\n\t\tatomic.AddUint32(&items, 1)\n\t\tq.AddURL(url)\n\t}\n\tfor i := 0; i < 3000; i++ {\n\t\tput()\n\t\tstorage.AddRequest([]byte(\"error request\"))\n\t}\n\tc := colly.NewCollector(\n\t\tcolly.AllowURLRevisit(),\n\t)\n\tc.OnRequest(func(req *colly.Request) {\n\t\tatomic.AddUint32(&requests, 1)\n\t})\n\tc.OnResponse(func(resp *colly.Response) {\n\t\tif resp.StatusCode == http.StatusOK {\n\t\t\tatomic.AddUint32(&success, 1)\n\t\t} else {\n\t\t\tatomic.AddUint32(&failure, 1)\n\t\t}\n\t\trngMu.Lock()\n\t\ttoss := rng.Intn(2) == 0\n\t\trngMu.Unlock()\n\t\tif toss {\n\t\t\tput()\n\t\t}\n\t})\n\tc.OnError(func(resp *colly.Response, err error) {\n\t\tatomic.AddUint32(&failure, 1)\n\t})\n\terr = q.Run(c)\n\tif err != nil {\n\t\tt.Fatalf(\"Queue.Run() return an error: %v\", err)\n\t}\n\tif items != requests || success+failure != requests || failure > 0 {\n\t\tt.Fatalf(\"wrong Queue implementation: \"+\n\t\t\t\"items = %d, requests = %d, success = %d, failure = %d\",\n\t\t\titems, requests, success, failure)\n\t}\n}\n\nfunc serverHandler(w http.ResponseWriter, req *http.Request) {\n\tif !serverRoute(w, req) {\n\t\tshutdown(w)\n\t}\n}\n\nfunc serverRoute(w http.ResponseWriter, req *http.Request) bool {\n\tif req.URL.Path == \"/delay\" {\n\t\treturn serveDelay(w, req) == nil\n\t}\n\treturn false\n}\n\nfunc serveDelay(w http.ResponseWriter, req *http.Request) error {\n\tq := req.URL.Query()\n\tt, err := time.ParseDuration(q.Get(\"t\"))\n\tif err != nil {\n\t\treturn err\n\t}\n\ttime.Sleep(t)\n\tw.WriteHeader(http.StatusOK)\n\treturn nil\n}\n\nfunc shutdown(w http.ResponseWriter) {\n\ttaker, ok := w.(http.Hijacker)\n\tif !ok {\n\t\treturn\n\t}\n\traw, _, err := taker.Hijack()\n\tif err != nil {\n\t\treturn\n\t}\n\traw.Close()\n}\n"
  },
  {
    "path": "request.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"bytes\"\n\t\"encoding/json\"\n\t\"io\"\n\t\"net/http\"\n\t\"net/url\"\n\t\"strings\"\n)\n\n// Request is the representation of a HTTP request made by a Collector\ntype Request struct {\n\t// URL is the parsed URL of the HTTP request\n\tURL *url.URL\n\t// Headers contains the Request's HTTP headers\n\tHeaders *http.Header\n\t// the Host header\n\tHost string\n\t// Ctx is a context between a Request and a Response\n\tCtx *Context\n\t// Depth is the number of the parents of the request\n\tDepth int\n\t// Method is the HTTP method of the request\n\tMethod string\n\t// Body is the request body which is used on POST/PUT requests\n\tBody io.Reader\n\t// ResponseCharacterencoding is the character encoding of the response body.\n\t// Leave it blank to allow automatic character encoding of the response body.\n\t// It is empty by default and it can be set in OnRequest callback.\n\tResponseCharacterEncoding string\n\t// ID is the Unique identifier of the request\n\tID        uint32\n\tcollector *Collector\n\tabort     bool\n\tbaseURL   *url.URL\n\t// ProxyURL is the proxy address that handles the request\n\tProxyURL string\n}\n\ntype serializableRequest struct {\n\tURL     string\n\tMethod  string\n\tDepth   int\n\tBody    []byte\n\tID      uint32\n\tCtx     map[string]interface{}\n\tHeaders http.Header\n\tHost    string\n}\n\n// New creates a new request with the context of the original request\nfunc (r *Request) New(method, URL string, body io.Reader) (*Request, error) {\n\tu, err := urlParser.Parse(URL)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tu2, err := url.Parse(u.Href(false))\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn &Request{\n\t\tMethod:    method,\n\t\tURL:       u2,\n\t\tBody:      body,\n\t\tCtx:       r.Ctx,\n\t\tHeaders:   &http.Header{},\n\t\tHost:      r.Host,\n\t\tID:        r.collector.requestCount.Add(1),\n\t\tcollector: r.collector,\n\t}, nil\n}\n\n// Abort cancels the HTTP request when called in an OnRequest callback\nfunc (r *Request) Abort() {\n\tr.abort = true\n}\n\n// IsAbort returns true if the request has been aborted\nfunc (r *Request) IsAbort() bool {\n\treturn r.abort\n}\n\n// AbsoluteURL returns with the resolved absolute URL of an URL chunk.\n// AbsoluteURL returns empty string if the URL chunk is a fragment or\n// could not be parsed\nfunc (r *Request) AbsoluteURL(u string) string {\n\tif strings.HasPrefix(u, \"#\") {\n\t\treturn \"\"\n\t}\n\tvar base *url.URL\n\tif r.baseURL != nil {\n\t\tbase = r.baseURL\n\t} else {\n\t\tbase = r.URL\n\t}\n\n\tabsURL, err := urlParser.ParseRef(base.String(), u)\n\tif err != nil {\n\t\treturn \"\"\n\t}\n\treturn absURL.Href(false)\n}\n\n// Visit continues Collector's collecting job by creating a\n// request and preserves the Context of the previous request.\n// Visit also calls the previously provided callbacks\nfunc (r *Request) Visit(URL string) error {\n\treturn r.collector.scrape(r.AbsoluteURL(URL), \"GET\", r.Depth+1, nil, r.Ctx, nil, true)\n}\n\n// HasVisited checks if the provided URL has been visited\nfunc (r *Request) HasVisited(URL string) (bool, error) {\n\treturn r.collector.HasVisited(URL)\n}\n\n// Post continues a collector job by creating a POST request and preserves the Context\n// of the previous request.\n// Post also calls the previously provided callbacks\nfunc (r *Request) Post(URL string, requestData map[string]string) error {\n\treturn r.collector.scrape(r.AbsoluteURL(URL), \"POST\", r.Depth+1, createFormReader(requestData), r.Ctx, nil, true)\n}\n\n// PostRaw starts a collector job by creating a POST request with raw binary data.\n// PostRaw preserves the Context of the previous request\n// and calls the previously provided callbacks\nfunc (r *Request) PostRaw(URL string, requestData []byte) error {\n\treturn r.collector.scrape(r.AbsoluteURL(URL), \"POST\", r.Depth+1, bytes.NewReader(requestData), r.Ctx, nil, true)\n}\n\n// PostMultipart starts a collector job by creating a Multipart POST request\n// with raw binary data.  PostMultipart also calls the previously provided.\n// callbacks\nfunc (r *Request) PostMultipart(URL string, requestData map[string][]byte) error {\n\tboundary := randomBoundary()\n\thdr := http.Header{}\n\thdr.Set(\"Content-Type\", \"multipart/form-data; boundary=\"+boundary)\n\thdr.Set(\"User-Agent\", r.collector.UserAgent)\n\treturn r.collector.scrape(r.AbsoluteURL(URL), \"POST\", r.Depth+1, createMultipartReader(boundary, requestData), r.Ctx, hdr, true)\n}\n\n// Retry submits HTTP request again with the same parameters\nfunc (r *Request) Retry() error {\n\tr.Headers.Del(\"Cookie\")\n\tif _, ok := r.Body.(io.ReadSeeker); r.Body != nil && !ok {\n\t\treturn ErrRetryBodyUnseekable\n\t}\n\treturn r.collector.scrape(r.URL.String(), r.Method, r.Depth, r.Body, r.Ctx, *r.Headers, false)\n}\n\n// Do submits the request\nfunc (r *Request) Do() error {\n\treturn r.collector.scrape(r.URL.String(), r.Method, r.Depth, r.Body, r.Ctx, *r.Headers, !r.collector.AllowURLRevisit)\n}\n\n// Marshal serializes the Request\nfunc (r *Request) Marshal() ([]byte, error) {\n\tctx := make(map[string]interface{})\n\tif r.Ctx != nil {\n\t\tr.Ctx.ForEach(func(k string, v interface{}) interface{} {\n\t\t\tctx[k] = v\n\t\t\treturn nil\n\t\t})\n\t}\n\tvar err error\n\tvar body []byte\n\tif r.Body != nil {\n\t\tbody, err = io.ReadAll(r.Body)\n\t\tif err != nil {\n\t\t\treturn nil, err\n\t\t}\n\t}\n\tsr := &serializableRequest{\n\t\tURL:    r.URL.String(),\n\t\tHost:   r.Host,\n\t\tMethod: r.Method,\n\t\tDepth:  r.Depth,\n\t\tBody:   body,\n\t\tID:     r.ID,\n\t\tCtx:    ctx,\n\t}\n\tif r.Headers != nil {\n\t\tsr.Headers = *r.Headers\n\t}\n\treturn json.Marshal(sr)\n}\n"
  },
  {
    "path": "response.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"bytes\"\n\t\"fmt\"\n\t\"io\"\n\t\"mime\"\n\t\"net/http\"\n\t\"os\"\n\t\"strings\"\n\n\t\"github.com/saintfish/chardet\"\n\t\"golang.org/x/net/html/charset\"\n)\n\n// Response is the representation of a HTTP response made by a Collector\ntype Response struct {\n\t// StatusCode is the status code of the Response\n\tStatusCode int\n\t// Body is the content of the Response\n\tBody []byte\n\t// Ctx is a context between a Request and a Response\n\tCtx *Context\n\t// Request is the Request object of the response\n\tRequest *Request\n\t// Headers contains the Response's HTTP headers\n\tHeaders *http.Header\n\t// Trace contains the HTTPTrace for the request. Will only be set by the\n\t// collector if Collector.TraceHTTP is set to true.\n\tTrace *HTTPTrace\n}\n\n// Save writes response body to disk\nfunc (r *Response) Save(fileName string) error {\n\treturn os.WriteFile(fileName, r.Body, 0644)\n}\n\n// FileName returns the sanitized file name parsed from \"Content-Disposition\"\n// header or from URL\nfunc (r *Response) FileName() string {\n\t_, params, err := mime.ParseMediaType(r.Headers.Get(\"Content-Disposition\"))\n\tif fName, ok := params[\"filename\"]; ok && err == nil {\n\t\treturn SanitizeFileName(fName)\n\t}\n\tif r.Request.URL.RawQuery != \"\" {\n\t\treturn SanitizeFileName(fmt.Sprintf(\"%s_%s\", r.Request.URL.Path, r.Request.URL.RawQuery))\n\t}\n\treturn SanitizeFileName(strings.TrimPrefix(r.Request.URL.Path, \"/\"))\n}\n\nfunc (r *Response) fixCharset(detectCharset bool, defaultEncoding string) error {\n\tif len(r.Body) == 0 {\n\t\treturn nil\n\t}\n\tif defaultEncoding != \"\" {\n\t\ttmpBody, err := encodeBytes(r.Body, \"text/plain; charset=\"+defaultEncoding)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tr.Body = tmpBody\n\t\treturn nil\n\t}\n\tcontentType := strings.ToLower(r.Headers.Get(\"Content-Type\"))\n\n\tif strings.Contains(contentType, \"image/\") ||\n\t\tstrings.Contains(contentType, \"video/\") ||\n\t\tstrings.Contains(contentType, \"audio/\") ||\n\t\tstrings.Contains(contentType, \"font/\") {\n\t\t// These MIME types should not have textual data.\n\n\t\treturn nil\n\t}\n\n\tif !strings.Contains(contentType, \"charset\") {\n\t\tif !detectCharset {\n\t\t\treturn nil\n\t\t}\n\t\td := chardet.NewTextDetector()\n\t\tr, err := d.DetectBest(r.Body)\n\t\tif err != nil {\n\t\t\treturn err\n\t\t}\n\t\tcontentType = \"text/plain; charset=\" + r.Charset\n\t}\n\tif strings.Contains(contentType, \"utf-8\") || strings.Contains(contentType, \"utf8\") {\n\t\treturn nil\n\t}\n\ttmpBody, err := encodeBytes(r.Body, contentType)\n\tif err != nil {\n\t\treturn err\n\t}\n\tr.Body = tmpBody\n\treturn nil\n}\n\nfunc encodeBytes(b []byte, contentType string) ([]byte, error) {\n\tr, err := charset.NewReader(bytes.NewReader(b), contentType)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\treturn io.ReadAll(r)\n}\n"
  },
  {
    "path": "storage/storage.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage storage\n\nimport (\n\t\"net/http\"\n\t\"net/http/cookiejar\"\n\t\"net/url\"\n\t\"strings\"\n\t\"sync\"\n)\n\n// Storage is an interface which handles Collector's internal data,\n// like visited urls and cookies.\n// The default Storage of the Collector is the InMemoryStorage.\n// Collector's storage can be changed by calling Collector.SetStorage()\n// function.\ntype Storage interface {\n\t// Init initializes the storage\n\tInit() error\n\t// Visited receives and stores a request ID that is visited by the Collector\n\tVisited(requestID uint64) error\n\t// IsVisited returns true if the request was visited before IsVisited\n\t// is called\n\tIsVisited(requestID uint64) (bool, error)\n\t// Cookies retrieves stored cookies for a given host\n\tCookies(u *url.URL) string\n\t// SetCookies stores cookies for a given host\n\tSetCookies(u *url.URL, cookies string)\n}\n\n// InMemoryStorage is the default storage backend of colly.\n// InMemoryStorage keeps cookies and visited urls in memory\n// without persisting data on the disk.\ntype InMemoryStorage struct {\n\tvisitedURLs map[uint64]bool\n\tlock        *sync.RWMutex\n\tjar         *cookiejar.Jar\n}\n\n// Init initializes InMemoryStorage\nfunc (s *InMemoryStorage) Init() error {\n\tif s.visitedURLs == nil {\n\t\ts.visitedURLs = make(map[uint64]bool)\n\t}\n\tif s.lock == nil {\n\t\ts.lock = &sync.RWMutex{}\n\t}\n\tif s.jar == nil {\n\t\tvar err error\n\t\ts.jar, err = cookiejar.New(nil)\n\t\treturn err\n\t}\n\treturn nil\n}\n\n// Visited implements Storage.Visited()\nfunc (s *InMemoryStorage) Visited(requestID uint64) error {\n\ts.lock.Lock()\n\ts.visitedURLs[requestID] = true\n\ts.lock.Unlock()\n\treturn nil\n}\n\n// IsVisited implements Storage.IsVisited()\nfunc (s *InMemoryStorage) IsVisited(requestID uint64) (bool, error) {\n\ts.lock.RLock()\n\tvisited := s.visitedURLs[requestID]\n\ts.lock.RUnlock()\n\treturn visited, nil\n}\n\n// Cookies implements Storage.Cookies()\nfunc (s *InMemoryStorage) Cookies(u *url.URL) string {\n\treturn StringifyCookies(s.jar.Cookies(u))\n}\n\n// SetCookies implements Storage.SetCookies()\nfunc (s *InMemoryStorage) SetCookies(u *url.URL, cookies string) {\n\ts.jar.SetCookies(u, UnstringifyCookies(cookies))\n}\n\n// Close implements Storage.Close()\nfunc (s *InMemoryStorage) Close() error {\n\treturn nil\n}\n\n// StringifyCookies serializes list of http.Cookies to string\nfunc StringifyCookies(cookies []*http.Cookie) string {\n\t// Stringify cookies.\n\tcs := make([]string, len(cookies))\n\tfor i, c := range cookies {\n\t\tcs[i] = c.String()\n\t}\n\treturn strings.Join(cs, \"\\n\")\n}\n\n// UnstringifyCookies deserializes a cookie string to http.Cookies\nfunc UnstringifyCookies(s string) []*http.Cookie {\n\th := http.Header{}\n\tfor _, c := range strings.Split(s, \"\\n\") {\n\t\th.Add(\"Set-Cookie\", c)\n\t}\n\tr := http.Response{Header: h}\n\treturn r.Cookies()\n}\n\n// ContainsCookie checks if a cookie name is represented in cookies\nfunc ContainsCookie(cookies []*http.Cookie, name string) bool {\n\tfor _, c := range cookies {\n\t\tif c.Name == name {\n\t\t\treturn true\n\t\t}\n\t}\n\treturn false\n}\n"
  },
  {
    "path": "unmarshal.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"errors\"\n\t\"reflect\"\n\t\"strings\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n)\n\n// Unmarshal is a shorthand for colly.UnmarshalHTML\nfunc (h *HTMLElement) Unmarshal(v interface{}) error {\n\treturn UnmarshalHTML(v, h.DOM, nil)\n}\n\n// UnmarshalWithMap is a shorthand for colly.UnmarshalHTML, extended to allow maps to be passed in.\nfunc (h *HTMLElement) UnmarshalWithMap(v interface{}, structMap map[string]string) error {\n\treturn UnmarshalHTML(v, h.DOM, structMap)\n}\n\n// UnmarshalHTML declaratively extracts text or attributes to a struct from\n// HTML response using struct tags composed of css selectors.\n// Allowed struct tags:\n//   - \"selector\" (required): CSS (goquery) selector of the desired data\n//   - \"attr\" (optional): Selects the matching element's attribute's value.\n//     Leave it blank or omit to get the text of the element.\n//\n// Example struct declaration:\n//\n//\ttype Nested struct {\n//\t\tString  string   `selector:\"div > p\"`\n//\t   Classes []string `selector:\"li\" attr:\"class\"`\n//\t\tStruct  *Nested  `selector:\"div > div\"`\n//\t}\n//\n// Supported types: struct, *struct, string, []string\nfunc UnmarshalHTML(v interface{}, s *goquery.Selection, structMap map[string]string) error {\n\trv := reflect.ValueOf(v)\n\n\tif rv.Kind() != reflect.Ptr || rv.IsNil() {\n\t\treturn errors.New(\"Invalid type or nil-pointer\")\n\t}\n\n\tsv := rv.Elem()\n\tst := reflect.TypeOf(v).Elem()\n\tif structMap != nil {\n\t\tfor k, v := range structMap {\n\t\t\tattrV := sv.FieldByName(k)\n\t\t\tif !attrV.CanAddr() || !attrV.CanSet() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tif err := unmarshalSelector(s, attrV, v); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\t\t}\n\t} else {\n\t\tfor i := 0; i < sv.NumField(); i++ {\n\t\t\tattrV := sv.Field(i)\n\t\t\tif !attrV.CanAddr() || !attrV.CanSet() {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\tif err := unmarshalAttr(s, attrV, st.Field(i)); err != nil {\n\t\t\t\treturn err\n\t\t\t}\n\n\t\t}\n\t}\n\n\treturn nil\n}\n\nfunc unmarshalSelector(s *goquery.Selection, attrV reflect.Value, selector string) error {\n\t//selector is \"-\" specify that field should ignore.\n\tif selector == \"-\" {\n\t\treturn nil\n\t}\n\thtmlAttr := \"\"\n\t// TODO support more types\n\tswitch attrV.Kind() {\n\tcase reflect.Slice:\n\t\tif err := unmarshalSlice(s, selector, htmlAttr, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tcase reflect.String:\n\t\tvar val string\n\t\tif selector == \"\" && htmlAttr != \"\" {\n\t\t\tval = getDOMValue(s, htmlAttr)\n\t\t} else {\n\t\t\tval = getDOMValue(s.Find(selector), htmlAttr)\n\t\t}\n\t\tattrV.Set(reflect.Indirect(reflect.ValueOf(val)))\n\tcase reflect.Struct:\n\t\tif err := unmarshalStruct(s, selector, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tcase reflect.Ptr:\n\t\tif err := unmarshalPtr(s, selector, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tdefault:\n\t\treturn errors.New(\"Invalid type: \" + attrV.String())\n\t}\n\treturn nil\n}\n\nfunc unmarshalAttr(s *goquery.Selection, attrV reflect.Value, attrT reflect.StructField) error {\n\tselector := attrT.Tag.Get(\"selector\")\n\t//selector is \"-\" specify that field should ignore.\n\tif selector == \"-\" {\n\t\treturn nil\n\t}\n\thtmlAttr := attrT.Tag.Get(\"attr\")\n\t// TODO support more types\n\tswitch attrV.Kind() {\n\tcase reflect.Slice:\n\t\tif err := unmarshalSlice(s, selector, htmlAttr, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tcase reflect.String:\n\t\tval := getDOMValue(s.Find(selector), htmlAttr)\n\t\tattrV.Set(reflect.Indirect(reflect.ValueOf(val)))\n\tcase reflect.Struct:\n\t\tif err := unmarshalStruct(s, selector, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tcase reflect.Ptr:\n\t\tif err := unmarshalPtr(s, selector, attrV); err != nil {\n\t\t\treturn err\n\t\t}\n\tdefault:\n\t\treturn errors.New(\"Invalid type: \" + attrV.String())\n\t}\n\treturn nil\n}\n\nfunc unmarshalStruct(s *goquery.Selection, selector string, attrV reflect.Value) error {\n\tnewS := s\n\tif selector != \"\" {\n\t\tnewS = newS.Find(selector)\n\t}\n\tif newS.Nodes == nil {\n\t\treturn nil\n\t}\n\tv := reflect.New(attrV.Type())\n\terr := UnmarshalHTML(v.Interface(), newS, nil)\n\tif err != nil {\n\t\treturn err\n\t}\n\tattrV.Set(reflect.Indirect(v))\n\treturn nil\n}\n\nfunc unmarshalPtr(s *goquery.Selection, selector string, attrV reflect.Value) error {\n\tnewS := s\n\tif selector != \"\" {\n\t\tnewS = newS.Find(selector)\n\t}\n\tif newS.Nodes == nil {\n\t\treturn nil\n\t}\n\te := attrV.Type().Elem()\n\tif e.Kind() != reflect.Struct {\n\t\treturn errors.New(\"Invalid slice type\")\n\t}\n\tv := reflect.New(e)\n\terr := UnmarshalHTML(v.Interface(), newS, nil)\n\tif err != nil {\n\t\treturn err\n\t}\n\tattrV.Set(v)\n\treturn nil\n}\n\nfunc unmarshalSlice(s *goquery.Selection, selector, htmlAttr string, attrV reflect.Value) error {\n\tif attrV.Pointer() == 0 {\n\t\tv := reflect.MakeSlice(attrV.Type(), 0, 0)\n\t\tattrV.Set(v)\n\t}\n\tswitch attrV.Type().Elem().Kind() {\n\tcase reflect.String:\n\t\ts.Find(selector).Each(func(_ int, s *goquery.Selection) {\n\t\t\tval := getDOMValue(s, htmlAttr)\n\t\t\tattrV.Set(reflect.Append(attrV, reflect.Indirect(reflect.ValueOf(val))))\n\t\t})\n\tcase reflect.Ptr:\n\t\ts.Find(selector).Each(func(_ int, innerSel *goquery.Selection) {\n\t\t\tsomeVal := reflect.New(attrV.Type().Elem().Elem())\n\t\t\tUnmarshalHTML(someVal.Interface(), innerSel, nil)\n\t\t\tattrV.Set(reflect.Append(attrV, someVal))\n\t\t})\n\tcase reflect.Struct:\n\t\ts.Find(selector).Each(func(_ int, innerSel *goquery.Selection) {\n\t\t\tsomeVal := reflect.New(attrV.Type().Elem())\n\t\t\tUnmarshalHTML(someVal.Interface(), innerSel, nil)\n\t\t\tattrV.Set(reflect.Append(attrV, reflect.Indirect(someVal)))\n\t\t})\n\tdefault:\n\t\treturn errors.New(\"Invalid slice type\")\n\t}\n\treturn nil\n}\n\nfunc getDOMValue(s *goquery.Selection, attr string) string {\n\tif attr == \"\" {\n\t\treturn strings.TrimSpace(s.First().Text())\n\t}\n\tattrV, _ := s.Attr(attr)\n\treturn attrV\n}\n"
  },
  {
    "path": "unmarshal_test.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"bytes\"\n\t\"testing\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n)\n\nvar basicTestData = []byte(`<ul><li class=\"x\">list <span>item</span> 1</li><li>list item 2</li><li>3</li></ul>`)\nvar nestedTestData = []byte(`<div><p>a</p><div><p>b</p><div><p>c</p></div></div></div>`)\nvar pointerSliceTestData = []byte(`<ul class=\"object\"><li class=\"info\">Information: <span>Info 1</span></li><li class=\"info\">Information: <span>Info 2</span></li></ul>`)\n\nfunc TestBasicUnmarshal(t *testing.T) {\n\tdoc, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(basicTestData))\n\te := &HTMLElement{\n\t\tDOM: doc.First(),\n\t}\n\ts := struct {\n\t\tString string   `selector:\"li:first-child\" attr:\"class\"`\n\t\tItems  []string `selector:\"li\"`\n\t\tStruct struct {\n\t\t\tString string `selector:\"li:last-child\"`\n\t\t}\n\t}{}\n\tif err := e.Unmarshal(&s); err != nil {\n\t\tt.Error(\"Cannot unmarshal struct: \" + err.Error())\n\t}\n\tif s.String != \"x\" {\n\t\tt.Errorf(`Invalid data for String: %q, expected \"x\"`, s.String)\n\t}\n\tif s.Struct.String != \"3\" {\n\t\tt.Errorf(`Invalid data for Struct.String: %q, expected \"3\"`, s.Struct.String)\n\t}\n}\n\nfunc TestNestedUnmarshalMap(t *testing.T) {\n\tdoc, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(nestedTestData))\n\te := &HTMLElement{\n\t\tDOM: doc.First(),\n\t}\n\tdoc2, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(basicTestData))\n\te2 := &HTMLElement{\n\t\tDOM: doc2.First(),\n\t}\n\ttype nested struct {\n\t\tString string\n\t}\n\tmapSelector := make(map[string]string)\n\tmapSelector[\"String\"] = \"div > p\"\n\n\tmapSelector2 := make(map[string]string)\n\tmapSelector2[\"String\"] = \"span\"\n\n\ts := nested{}\n\ts2 := nested{}\n\tif err := e.UnmarshalWithMap(&s, mapSelector); err != nil {\n\t\tt.Error(\"Cannot unmarshal struct: \" + err.Error())\n\t}\n\tif err := e2.UnmarshalWithMap(&s2, mapSelector2); err != nil {\n\t\tt.Error(\"Cannot unmarshal struct: \" + err.Error())\n\t}\n\tif s.String != \"a\" {\n\t\tt.Errorf(`Invalid data for String: %q, expected \"a\"`, s.String)\n\t}\n\tif s2.String != \"item\" {\n\t\tt.Errorf(`Invalid data for String: %q, expected \"a\"`, s.String)\n\t}\n}\n\nfunc TestNestedUnmarshal(t *testing.T) {\n\tdoc, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(nestedTestData))\n\te := &HTMLElement{\n\t\tDOM: doc.First(),\n\t}\n\ttype nested struct {\n\t\tString string  `selector:\"div > p\"`\n\t\tStruct *nested `selector:\"div > div\"`\n\t}\n\ts := nested{}\n\tif err := e.Unmarshal(&s); err != nil {\n\t\tt.Error(\"Cannot unmarshal struct: \" + err.Error())\n\t}\n\tif s.String != \"a\" {\n\t\tt.Errorf(`Invalid data for String: %q, expected \"a\"`, s.String)\n\t}\n\tif s.Struct.String != \"b\" {\n\t\tt.Errorf(`Invalid data for Struct.String: %q, expected \"b\"`, s.Struct.String)\n\t}\n\tif s.Struct.Struct.String != \"c\" {\n\t\tt.Errorf(`Invalid data for Struct.Struct.String: %q, expected \"c\"`, s.Struct.Struct.String)\n\t}\n}\n\nfunc TestPointerSliceUnmarshall(t *testing.T) {\n\ttype info struct {\n\t\tText string `selector:\"span\"`\n\t}\n\ttype object struct {\n\t\tInfo []*info `selector:\"li.info\"`\n\t}\n\n\tdoc, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(pointerSliceTestData))\n\te := HTMLElement{DOM: doc.First()}\n\to := object{}\n\terr := e.Unmarshal(&o)\n\tif err != nil {\n\t\tt.Fatalf(\"Failed to unmarshal page: %s\\n\", err.Error())\n\t}\n\n\tif len(o.Info) != 2 {\n\t\tt.Errorf(\"Invalid length for Info: %d, expected 2\", len(o.Info))\n\t}\n\tif o.Info[0].Text != \"Info 1\" {\n\t\tt.Errorf(\"Invalid data for Info.[0].Text: %s, expected Info 1\", o.Info[0].Text)\n\t}\n\tif o.Info[1].Text != \"Info 2\" {\n\t\tt.Errorf(\"Invalid data for Info.[1].Text: %s, expected Info 2\", o.Info[1].Text)\n\t}\n\n}\n\nfunc TestStructSliceUnmarshall(t *testing.T) {\n\ttype info struct {\n\t\tText string `selector:\"span\"`\n\t}\n\ttype object struct {\n\t\tInfo []info `selector:\"li.info\"`\n\t}\n\n\tdoc, _ := goquery.NewDocumentFromReader(bytes.NewBuffer(pointerSliceTestData))\n\te := HTMLElement{DOM: doc.First()}\n\to := object{}\n\terr := e.Unmarshal(&o)\n\tif err != nil {\n\t\tt.Fatalf(\"Failed to unmarshal page: %s\\n\", err.Error())\n\t}\n\n\tif len(o.Info) != 2 {\n\t\tt.Errorf(\"Invalid length for Info: %d, expected 2\", len(o.Info))\n\t}\n\tif o.Info[0].Text != \"Info 1\" {\n\t\tt.Errorf(\"Invalid data for Info.[0].Text: %s, expected Info 1\", o.Info[0].Text)\n\t}\n\tif o.Info[1].Text != \"Info 2\" {\n\t\tt.Errorf(\"Invalid data for Info.[1].Text: %s, expected Info 2\", o.Info[1].Text)\n\t}\n\n}\n"
  },
  {
    "path": "xmlelement.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly\n\nimport (\n\t\"strings\"\n\n\t\"github.com/antchfx/htmlquery\"\n\t\"github.com/antchfx/xmlquery\"\n\t\"golang.org/x/net/html\"\n)\n\n// XMLElement is the representation of a XML tag.\ntype XMLElement struct {\n\t// Name is the name of the tag\n\tName       string\n\tText       string\n\tattributes interface{}\n\t// Request is the request object of the element's HTML document\n\tRequest *Request\n\t// Response is the Response object of the element's HTML document\n\tResponse *Response\n\t// DOM is the DOM object of the page. DOM is relative\n\t// to the current XMLElement and is either a html.Node or xmlquery.Node\n\t// based on how the XMLElement was created.\n\tDOM    interface{}\n\tisHTML bool\n\t// Index stores the position of the current element within all the elements matched by an OnXML callback\n\tIndex int\n}\n\n// NewXMLElementFromHTMLNode creates a XMLElement from a html.Node.\nfunc NewXMLElementFromHTMLNode(resp *Response, s *html.Node) *XMLElement {\n\treturn &XMLElement{\n\t\tName:       s.Data,\n\t\tRequest:    resp.Request,\n\t\tResponse:   resp,\n\t\tText:       htmlquery.InnerText(s),\n\t\tDOM:        s,\n\t\tattributes: s.Attr,\n\t\tisHTML:     true,\n\t}\n}\n\n// NewXMLElementFromXMLNode creates a XMLElement from a xmlquery.Node.\nfunc NewXMLElementFromXMLNode(resp *Response, s *xmlquery.Node) *XMLElement {\n\treturn &XMLElement{\n\t\tName:       s.Data,\n\t\tRequest:    resp.Request,\n\t\tResponse:   resp,\n\t\tText:       s.InnerText(),\n\t\tDOM:        s,\n\t\tattributes: s.Attr,\n\t\tisHTML:     false,\n\t}\n}\n\n// Attr returns the selected attribute of a HTMLElement or empty string\n// if no attribute found\nfunc (h *XMLElement) Attr(k string) string {\n\tif h.isHTML {\n\t\tfor _, a := range h.attributes.([]html.Attribute) {\n\t\t\tif a.Key == k {\n\t\t\t\treturn a.Val\n\t\t\t}\n\t\t}\n\t} else {\n\t\tfor _, a := range h.attributes.([]xmlquery.Attr) {\n\t\t\tif a.Name.Local == k {\n\t\t\t\treturn a.Value\n\t\t\t}\n\t\t}\n\t}\n\treturn \"\"\n}\n\n// ChildText returns the concatenated and stripped text content of the matching\n// elements.\nfunc (h *XMLElement) ChildText(xpathQuery string) string {\n\tif h.isHTML {\n\t\tchild := htmlquery.FindOne(h.DOM.(*html.Node), xpathQuery)\n\t\tif child == nil {\n\t\t\treturn \"\"\n\t\t}\n\t\treturn strings.TrimSpace(htmlquery.InnerText(child))\n\t}\n\tchild := xmlquery.FindOne(h.DOM.(*xmlquery.Node), xpathQuery)\n\tif child == nil {\n\t\treturn \"\"\n\t}\n\treturn strings.TrimSpace(child.InnerText())\n\n}\n\n// ChildAttr returns the stripped text content of the first matching\n// element's attribute.\nfunc (h *XMLElement) ChildAttr(xpathQuery, attrName string) string {\n\tif h.isHTML {\n\t\tchild := htmlquery.FindOne(h.DOM.(*html.Node), xpathQuery)\n\t\tif child != nil {\n\t\t\tfor _, attr := range child.Attr {\n\t\t\t\tif attr.Key == attrName {\n\t\t\t\t\treturn strings.TrimSpace(attr.Val)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t} else {\n\t\tchild := xmlquery.FindOne(h.DOM.(*xmlquery.Node), xpathQuery)\n\t\tif child != nil {\n\t\t\tfor _, attr := range child.Attr {\n\t\t\t\tif attr.Name.Local == attrName {\n\t\t\t\t\treturn strings.TrimSpace(attr.Value)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\treturn \"\"\n}\n\n// ChildAttrs returns the stripped text content of all the matching\n// element's attributes.\nfunc (h *XMLElement) ChildAttrs(xpathQuery, attrName string) []string {\n\tvar res []string\n\tif h.isHTML {\n\t\tfor _, child := range htmlquery.Find(h.DOM.(*html.Node), xpathQuery) {\n\t\t\tfor _, attr := range child.Attr {\n\t\t\t\tif attr.Key == attrName {\n\t\t\t\t\tres = append(res, strings.TrimSpace(attr.Val))\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t} else {\n\t\txmlquery.FindEach(h.DOM.(*xmlquery.Node), xpathQuery, func(i int, child *xmlquery.Node) {\n\t\t\tfor _, attr := range child.Attr {\n\t\t\t\tif attr.Name.Local == attrName {\n\t\t\t\t\tres = append(res, strings.TrimSpace(attr.Value))\n\t\t\t\t}\n\t\t\t}\n\t\t})\n\t}\n\treturn res\n}\n\n// ChildTexts returns an array of strings corresponding to child elements that match the xpath query.\n// Each item in the array is the stripped text content of the corresponding matching child element.\nfunc (h *XMLElement) ChildTexts(xpathQuery string) []string {\n\ttexts := make([]string, 0)\n\tif h.isHTML {\n\t\tfor _, child := range htmlquery.Find(h.DOM.(*html.Node), xpathQuery) {\n\t\t\ttexts = append(texts, strings.TrimSpace(htmlquery.InnerText(child)))\n\t\t}\n\t} else {\n\t\txmlquery.FindEach(h.DOM.(*xmlquery.Node), xpathQuery, func(i int, child *xmlquery.Node) {\n\t\t\ttexts = append(texts, strings.TrimSpace(child.InnerText()))\n\t\t})\n\t}\n\treturn texts\n}\n"
  },
  {
    "path": "xmlelement_test.go",
    "content": "// Copyright 2018 Adam Tauber\n//\n// Licensed under the Apache License, Version 2.0 (the \"License\");\n// you may not use this file except in compliance with the License.\n// You may obtain a copy of the License at\n//\n//      http://www.apache.org/licenses/LICENSE-2.0\n//\n// Unless required by applicable law or agreed to in writing, software\n// distributed under the License is distributed on an \"AS IS\" BASIS,\n// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n// See the License for the specific language governing permissions and\n// limitations under the License.\n\npackage colly_test\n\nimport (\n\t\"github.com/antchfx/htmlquery\"\n\t\"github.com/gocolly/colly/v2\"\n\t\"reflect\"\n\t\"strings\"\n\t\"testing\"\n)\n\n// Borrowed from http://infohost.nmt.edu/tcc/help/pubs/xhtml/example.html\n// Added attributes to the `<li>` tags for testing purposes\nconst htmlPage = `\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.1//EN\"\n \"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">\n  <head>\n    <title>Your page title here</title>\n  </head>\n  <body>\n    <h1>Your major heading here</h1>\n    <p>\n      This is a regular text paragraph.\n    </p>\n    <ul>\n      <li class=\"list-item-1\">\n        First bullet of a bullet list.\n      </li>\n      <li class=\"list-item-2\">\n        This is the <em>second</em> bullet.\n      </li>\n    </ul>\n  </body>\n</html>\n`\n\nfunc TestAttr(t *testing.T) {\n\tresp := &colly.Response{StatusCode: 200, Body: []byte(htmlPage)}\n\tdoc, _ := htmlquery.Parse(strings.NewReader(htmlPage))\n\txmlNode := htmlquery.FindOne(doc, \"/html\")\n\txmlElem := colly.NewXMLElementFromHTMLNode(resp, xmlNode)\n\n\tif xmlElem.Attr(\"xmlns\") != \"http://www.w3.org/1999/xhtml\" {\n\t\tt.Fatalf(\"failed xmlns attribute test: %v != http://www.w3.org/1999/xhtml\", xmlElem.Attr(\"xmlns\"))\n\t}\n\n\tif xmlElem.Attr(\"xml:lang\") != \"en\" {\n\t\tt.Fatalf(\"failed lang attribute test: %v != en\", xmlElem.Attr(\"lang\"))\n\t}\n}\n\nfunc TestChildText(t *testing.T) {\n\tresp := &colly.Response{StatusCode: 200, Body: []byte(htmlPage)}\n\tdoc, _ := htmlquery.Parse(strings.NewReader(htmlPage))\n\txmlNode := htmlquery.FindOne(doc, \"/html\")\n\txmlElem := colly.NewXMLElementFromHTMLNode(resp, xmlNode)\n\n\tif text := xmlElem.ChildText(\"//p\"); text != \"This is a regular text paragraph.\" {\n\t\tt.Fatalf(\"failed child tag test: %v != This is a regular text paragraph.\", text)\n\t}\n\tif text := xmlElem.ChildText(\"//dl\"); text != \"\" {\n\t\tt.Fatalf(\"failed child tag test: %v != \\\"\\\"\", text)\n\t}\n}\n\nfunc TestChildTexts(t *testing.T) {\n\tresp := &colly.Response{StatusCode: 200, Body: []byte(htmlPage)}\n\tdoc, _ := htmlquery.Parse(strings.NewReader(htmlPage))\n\txmlNode := htmlquery.FindOne(doc, \"/html\")\n\txmlElem := colly.NewXMLElementFromHTMLNode(resp, xmlNode)\n\texpected := []string{\"First bullet of a bullet list.\", \"This is the second bullet.\"}\n\tif texts := xmlElem.ChildTexts(\"//li\"); reflect.DeepEqual(texts, expected) == false {\n\t\tt.Fatalf(\"failed child tags test: %v != %v\", texts, expected)\n\t}\n\tif texts := xmlElem.ChildTexts(\"//dl\"); reflect.DeepEqual(texts, make([]string, 0)) == false {\n\t\tt.Fatalf(\"failed child tag test: %v != \\\"\\\"\", texts)\n\t}\n}\nfunc TestChildAttr(t *testing.T) {\n\tresp := &colly.Response{StatusCode: 200, Body: []byte(htmlPage)}\n\tdoc, _ := htmlquery.Parse(strings.NewReader(htmlPage))\n\txmlNode := htmlquery.FindOne(doc, \"/html\")\n\txmlElem := colly.NewXMLElementFromHTMLNode(resp, xmlNode)\n\n\tif attr := xmlElem.ChildAttr(\"/body/ul/li[1]\", \"class\"); attr != \"list-item-1\" {\n\t\tt.Fatalf(\"failed child attribute test: %v != list-item-1\", attr)\n\t}\n\tif attr := xmlElem.ChildAttr(\"/body/ul/li[2]\", \"class\"); attr != \"list-item-2\" {\n\t\tt.Fatalf(\"failed child attribute test: %v != list-item-2\", attr)\n\t}\n}\n\nfunc TestChildAttrs(t *testing.T) {\n\tresp := &colly.Response{StatusCode: 200, Body: []byte(htmlPage)}\n\tdoc, _ := htmlquery.Parse(strings.NewReader(htmlPage))\n\txmlNode := htmlquery.FindOne(doc, \"/html\")\n\txmlElem := colly.NewXMLElementFromHTMLNode(resp, xmlNode)\n\n\tattrs := xmlElem.ChildAttrs(\"/body/ul/li\", \"class\")\n\tif len(attrs) != 2 {\n\t\tt.Fatalf(\"failed child attributes length test: %d != 2\", len(attrs))\n\t}\n\n\tfor _, attr := range attrs {\n\t\tif !(attr == \"list-item-1\" || attr == \"list-item-2\") {\n\t\t\tt.Fatalf(\"failed child attributes values test: %s != list-item-(1 or 2)\", attr)\n\t\t}\n\t}\n}\n"
  }
]