Full Code of commoncrawl/cc-crawl-statistics for AI

master 446d601f16d0 cached

177 files

2.6 MB

677.4k tokens

182 symbols

1 requests

Download .txt

Showing preview only (2,709K chars total). Download the full file or copy to clipboard to get everything.

Repository: commoncrawl/cc-crawl-statistics
Branch: master
Commit: 446d601f16d0
Files: 177
Total size: 2.6 MB

Directory structure:
gitextract_jga75z6d/

├── .github/
│   └── workflows/
│       └── ci.yml
├── .gitignore
├── LICENSE
├── README.md
├── _config.yml
├── _layouts/
│   ├── default.html
│   └── table.html
├── crawlplot.py
├── crawlstats.py
├── get_stats.sh
├── get_stats_and_plot.sh
├── index.md
├── plot/
│   ├── charset.py
│   ├── crawl_size.py
│   ├── crawler_metrics.py
│   ├── domain.py
│   ├── histogram.py
│   ├── language.py
│   ├── mimetype.py
│   ├── mimetype_detected.py
│   ├── overlap.py
│   ├── table.py
│   ├── tld.py
│   └── tld_by_continent.py
├── plot.sh
├── plots/
│   ├── README.md
│   ├── charsets-top-100.html
│   ├── charsets.csv
│   ├── charsets.md
│   ├── crawlermetrics.md
│   ├── crawloverlap.md
│   ├── crawlsize/
│   │   ├── cumulative.csv
│   │   ├── domain.csv
│   │   ├── monthly.csv
│   │   ├── monthly_new.csv
│   │   ├── url_last_n_crawls.csv
│   │   └── url_page_ratio_last_n_crawls.csv
│   ├── crawlsize.md
│   ├── domains-top-500.csv
│   ├── domains-top-500.html
│   ├── domains.md
│   ├── languages-top-200.html
│   ├── languages.csv
│   ├── languages.md
│   ├── mimetypes-top-100.html
│   ├── mimetypes.csv
│   ├── mimetypes.md
│   ├── mimetypes_detected-top-100.html
│   ├── mimetypes_detected.csv
│   ├── tld/
│   │   ├── by-year-and-continent.md
│   │   ├── comparison.md
│   │   ├── groups-percentage.html
│   │   ├── groups.md
│   │   ├── latest-crawl-groups.html
│   │   ├── latest-crawl-tlds.html
│   │   ├── latestcrawl.md
│   │   ├── percentage.md
│   │   ├── selected-crawl-comparison-spearman-all-tlds.html
│   │   ├── selected-crawl-comparison-spearman-frequent-tlds.html
│   │   ├── selected-crawl-comparison.html
│   │   ├── selected-crawls-percentage.html
│   │   ├── selected-tlds-by-year.csv
│   │   ├── selected-tlds-by-year.html
│   │   ├── tlds-by-year-and-continent.csv
│   │   └── tlds-by-year-and-continent.html
│   └── tlds.md
├── requirements.txt
├── requirements_plot.txt
├── run_stats_hadoop.sh
├── setup.py
├── site.Dockerfile
├── stats/
│   ├── crawler/
│   │   ├── CC-MAIN-2016-18.json
│   │   ├── CC-MAIN-2016-22.json
│   │   ├── CC-MAIN-2016-26.json
│   │   ├── CC-MAIN-2016-30.json
│   │   ├── CC-MAIN-2016-36.json
│   │   ├── CC-MAIN-2016-40.json
│   │   ├── CC-MAIN-2016-44.json
│   │   ├── CC-MAIN-2016-50.json
│   │   ├── CC-MAIN-2017-04.json
│   │   ├── CC-MAIN-2017-09.json
│   │   ├── CC-MAIN-2017-13.json
│   │   ├── CC-MAIN-2017-17.json
│   │   ├── CC-MAIN-2017-22.json
│   │   ├── CC-MAIN-2017-26.json
│   │   ├── CC-MAIN-2017-30.json
│   │   ├── CC-MAIN-2017-34.json
│   │   ├── CC-MAIN-2017-39.json
│   │   ├── CC-MAIN-2017-43.json
│   │   ├── CC-MAIN-2017-47.json
│   │   ├── CC-MAIN-2017-51.json
│   │   ├── CC-MAIN-2018-05.json
│   │   ├── CC-MAIN-2018-09.json
│   │   ├── CC-MAIN-2018-13.json
│   │   ├── CC-MAIN-2018-17.json
│   │   ├── CC-MAIN-2018-22.json
│   │   ├── CC-MAIN-2018-26.json
│   │   ├── CC-MAIN-2018-30.json
│   │   ├── CC-MAIN-2018-34.json
│   │   ├── CC-MAIN-2018-39.json
│   │   ├── CC-MAIN-2018-43.json
│   │   ├── CC-MAIN-2018-47.json
│   │   ├── CC-MAIN-2018-51.json
│   │   ├── CC-MAIN-2019-04.json
│   │   ├── CC-MAIN-2019-09.json
│   │   ├── CC-MAIN-2019-13.json
│   │   ├── CC-MAIN-2019-18.json
│   │   ├── CC-MAIN-2019-22.json
│   │   ├── CC-MAIN-2019-26.json
│   │   ├── CC-MAIN-2019-30.json
│   │   ├── CC-MAIN-2019-35.json
│   │   ├── CC-MAIN-2019-39.json
│   │   ├── CC-MAIN-2019-43.json
│   │   ├── CC-MAIN-2019-47.json
│   │   ├── CC-MAIN-2019-51.json
│   │   ├── CC-MAIN-2020-05.json
│   │   ├── CC-MAIN-2020-10.json
│   │   ├── CC-MAIN-2020-16.json
│   │   ├── CC-MAIN-2020-24.json
│   │   ├── CC-MAIN-2020-29.json
│   │   ├── CC-MAIN-2020-34.json
│   │   ├── CC-MAIN-2020-40.json
│   │   ├── CC-MAIN-2020-45.json
│   │   ├── CC-MAIN-2020-50.json
│   │   ├── CC-MAIN-2021-04.json
│   │   ├── CC-MAIN-2021-10.json
│   │   ├── CC-MAIN-2021-17.json
│   │   ├── CC-MAIN-2021-21.json
│   │   ├── CC-MAIN-2021-25.json
│   │   ├── CC-MAIN-2021-31.json
│   │   ├── CC-MAIN-2021-39.json
│   │   ├── CC-MAIN-2021-43.json
│   │   ├── CC-MAIN-2021-49.json
│   │   ├── CC-MAIN-2022-05.json
│   │   ├── CC-MAIN-2022-21.json
│   │   ├── CC-MAIN-2022-27.json
│   │   ├── CC-MAIN-2022-33.json
│   │   ├── CC-MAIN-2022-40.json
│   │   ├── CC-MAIN-2022-49.json
│   │   ├── CC-MAIN-2023-06.json
│   │   ├── CC-MAIN-2023-14.json
│   │   ├── CC-MAIN-2023-23.json
│   │   ├── CC-MAIN-2023-40.json
│   │   ├── CC-MAIN-2023-50.json
│   │   ├── CC-MAIN-2024-10.json
│   │   ├── CC-MAIN-2024-18.json
│   │   ├── CC-MAIN-2024-22.json
│   │   ├── CC-MAIN-2024-26.json
│   │   ├── CC-MAIN-2024-30.json
│   │   ├── CC-MAIN-2024-33.json
│   │   ├── CC-MAIN-2024-38.json
│   │   ├── CC-MAIN-2024-42.json
│   │   ├── CC-MAIN-2024-46.json
│   │   ├── CC-MAIN-2024-51.json
│   │   ├── CC-MAIN-2025-05.json
│   │   ├── CC-MAIN-2025-08.json
│   │   ├── CC-MAIN-2025-13.json
│   │   ├── CC-MAIN-2025-18.json
│   │   ├── CC-MAIN-2025-21.json
│   │   ├── CC-MAIN-2025-26.json
│   │   ├── CC-MAIN-2025-30.json
│   │   ├── CC-MAIN-2025-33.json
│   │   ├── CC-MAIN-2025-38.json
│   │   ├── CC-MAIN-2025-43.json
│   │   ├── CC-MAIN-2025-47.json
│   │   ├── CC-MAIN-2025-51.json
│   │   ├── CC-MAIN-2026-04.json
│   │   ├── CC-MAIN-2026-08.json
│   │   ├── CC-MAIN-2026-12.json
│   │   ├── CC-MAIN-2026-17.json
│   │   └── README.md
│   ├── tld_alexa_top_1m.py
│   ├── tld_cisco_umbrella_top_1m.py
│   └── tld_majestic_top_1m.py
├── stats.Dockerfile
├── tests/
│   └── test_crawlstat.py
└── top_level_domain.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/ci.yml
================================================
name: CI Pipeline

on:
  push:
    branches: [master]
  pull_request:

env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test-and-build-stats:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
    - name: Checkout repository
      uses: actions/checkout@v4

    - name: Set up QEMU
      uses: docker/setup-qemu-action@v3

    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3

    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    - name: Extract metadata for stats image
      id: meta-stats
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha
          type=raw,value=latest,enable={{is_default_branch}}

    - name: Build stats Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        file: ./stats.Dockerfile
        load: true
        platforms: linux/amd64
        tags: |
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }}

    - name: Run unit tests
      run: |
        docker run --rm \
          -v ${{ github.workspace }}/tests:/app/tests \
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }} \
          python -m pytest -s tests/

    - name: Push stats Docker image
      if: |
        success() && (
          github.event_name == 'push' && github.ref == 'refs/heads/master' ||
          github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name == github.repository
        )
      uses: docker/build-push-action@v5
      with:
        context: .
        file: ./stats.Dockerfile
        push: true
        platforms: linux/amd64,linux/arm64
        tags: ${{ steps.meta-stats.outputs.tags }}
        labels: ${{ steps.meta-stats.outputs.labels }}

  build-site:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write

    steps:
    - name: Checkout repository
      uses: actions/checkout@v4

    - name: Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}

    - name: Extract metadata for site image
      id: meta-site
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/site
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha
          type=raw,value=latest,enable={{is_default_branch}}

    - name: Build and push site Docker image
      uses: docker/build-push-action@v5
      with:
        context: .
        file: ./site.Dockerfile
        push: ${{ github.event.pull_request.head.repo.full_name == github.repository }}
        tags: ${{ steps.meta-site.outputs.tags }}
        labels: ${{ steps.meta-site.outputs.labels }}


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# IPython Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# dotenv
.env

# virtualenv
venv/
ENV/

# Spyder project settings
.spyderproject

# Rope project settings
.ropeproject

# Eclipse PyDev
.project
.pydevproject
.settings/

# Jekyll files to run github-pages locally
_site
.sass-cache
.jekyll-metadata
Gemfile
Gemfile.lock
assets
_includes
_sass
js
vendor/
.bundle/
themes/

# crawl statistics files
stats/*.gz
stats/crawls.txt
stats/excerpt/

# generated CSV data
data/

# macOS Desktop Services Store
.DS_Store


================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "{}"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright {yyyy} {name of copyright owner}

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.


================================================
FILE: README.md
================================================
Basic Statistics of Common Crawl Monthly Archives
=================================================

Analyze the [Common Crawl](https://commoncrawl.org/) data to get metrics about the monthly crawl archives:
* size of the monthly crawls, number of
  * fetched pages
  * unique URLs
  * unique documents (by content digest)
  * number of different hosts, domains, top-level domains
* distribution of pages/URLs on hosts, domains, top-level domains
* and ...
  * mime types
  * protocols / schemes (http vs. https)
  * content languages (since summer 2018)

This is a description how to generate the statistics from the Common Crawl URL index files.

The results are presented on https://commoncrawl.github.io/cc-crawl-statistics/.


Step 1: Count Items
-------------------

The items (URLs, hosts, domains, etc.) are counted using the Common Crawl index files
on AWS S3 `s3://commoncrawl/cc-index/collections/*/indexes/cdx-*.gz`.

1. define a pattern of cdx files to process - usually from one monthly crawl (here: `CC-MAIN-2016-26`)
   - either smaller set of local files for testing
   ```
   INPUT="test/cdx/cdx-0000[0-3].gz"
   ```
   - or one monthly crawl to be accessed via Hadoop on AWS S3:
   ```
   INPUT="s3a://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-*.gz"
   ```

2. run `crawlstats.py --job=count` to process the cdx files and count the items:
   ```
   python3 crawlstats.py --job=count --no-exact-counts \
        --no-output --output-dir .../count/ $INPUT
   ```

Help on command-line parameters (including [mrjob](https://pypi.org/project/mrjob/) options) are shown by
`python3 crawlstats.py --help`.
The option `--no-exact-counts` is recommended (and is the default) to save storage space and computation time
when counting URLs and content digests.


Step 2: Aggregate Counts
------------------------

Run `crawlstats.py --job=stats` on the output of step 1:
```
python3 crawlstats.py --job=stats --max-top-hosts-domains=500 \
     --no-output --output-dir .../stats/ .../count/
```
The max. number of most frequent thosts and domains contained in the output is set by the option
`--max-top-hosts-domains=N`.


Step 3: Download the Data
-------------------------

In order to prepare the plots, the the output of step 2 must be downloaded to local disk.
Simplest, the data is fetched from the Common Crawl Public Data Set bucket on AWS S3:
```sh
while read crawl; do
    aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz
done <<EOF
CC-MAIN-2008-2009
...
EOF
```

One aggregated, gzip-compressed statistics file, is about 1 MiB in size. So you could just run
[get_stats.sh](get_stats.sh) to download the data files for all released monthly crawls.

Also the output of step 1 is provided on `s3://commoncrawl/`. The counts for every crawl is hold
in 10 bzip2-compressed files, together 1 GiB per crawl in average. To download the counts for one crawl:
- if you're on AWS and [AWS CLI]() is installed and configured
  ```sh
  CRAWL=CC-MAIN-2022-05
  aws s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count stats/count/$CRAWL
  ```
- otherwise
  ```sh
  CRAWL=CC-MAIN-2022-05
  mkdir -p stats/count/$CRAWL
  for i in $(seq 0 9); do
    curl https://data.commoncrawl.org/crawl-analysis/$CRAWL/count/part-0000$i.bz2 \
      >stats/count/$CRAWL/part-0000$i.bz2
  done
  ```


Step 4: Plot the Data
---------------------

To prepare the plots using the downloaded aggregated data:
```
gzip -dc stats/CC-MAIN-*.gz | python3 plot/crawl_size.py
```
The full list of commands to prepare all plots is found in [plot.sh](plot.sh). Don't forget to install the Python
modules [required for plotting](requirements_plot.txt).


Step 5: Local Site Preview
--------------------------

The [crawl statistics site](https://commoncrawl.github.io/cc-crawl-statistics/) is hosted by [Github pages](https://pages.github.com/). The site is updated as soon as plots or description texts are updated, committed and pushed to the Github repository.

To preview local changes, it's possible to serve the site locally:
1. build the Docker image with Ruby, Jekyll and the content to be served
   ```
   docker build -f site.Dockerfile -t cc-crawl-statistics-site:latest .
   ```
2. run a Docker container to serve the site preview
   ```
   docker run --network=host --rm -ti cc-crawl-statistics-site:latest
   ```
   The site should be served on localhost, port 4000 (http://127.0.0.1:4000).
   If not, the correct location is shown in the output of the `docker run` command.

   If running this on a Mac, you may find that the loopback interface (127.0.0.1) within the container is not accessible, so you can change the line in the [Dockerfile](site.Dockerfile) to:

   ```
   CMD bundle exec jekyll serve --host 0.0.0.0
   ```

   ... and then the site will be served on http://0.0.0.0:4000 instead.  (You will of course need to rebuild the Docker image after updating the Dockerfile.)


Run via Container
-----------------

The whole workflow can be run as a container (docker or podman) including downloading stats files from Common Crawl's S3 bucket and generating new plots.

```bash
# clone the repository (to have the latest crawl IDs)
git clone https://github.com/commoncrawl/cc-crawl-statistics.git
cd cc-crawl-statistics

# download stats and generate plots
# SSH, AWS keys, and stats and plots directories must be mounted into the container
podman run --rm -v ~/.aws:/root/.aws:ro -v $(pwd -P)/stats:/app/stats -v $(pwd -P)/plots:/app/plots ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest 

# if needed you can manually build the container image
podman build -f stats.Dockerfile -t ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest

# for development it is recommend to mount the whole repository into the container
podman run -it -v ~/.aws:/root/.aws:ro -v $(pwd -P):/app ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest /bin/bash

```


Related Projects
----------------

The [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/)
simplifies counting and analytics a lot - easier to maintain, more transparent, reproducible and
extensible than running two MapReduce jobs, see the the list of example
- [SQL queries](https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena) and
- [Jupyter notebooks](https://github.com/commoncrawl/cc-notebooks)



================================================
FILE: _config.yml
================================================
title: Statistics of Common Crawl Monthly Archives
description: Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
repository: commoncrawl/cc-crawl-statistics
latest_crawl: CC-MAIN-2026-17

show_navigation: True
navlist:
 - title: Home
   url: /
 - title: Size of crawls
   url: /plots/crawlsize
 - title: Top-level domains
   url: /plots/tlds
 - title: Registered domains
   url: /plots/domains
 - title: Crawler metrics
   url: /plots/crawlermetrics
 - title: Crawl overlaps
   url: /plots/crawloverlap
 - title: Media types
   url: /plots/mimetypes
 - title: Character sets
   url: /plots/charsets
 - title: Languages
   url: /plots/languages

theme: jekyll-theme-minimal


================================================
FILE: _layouts/default.html
================================================
<!doctype html>
<html lang="{{ site.lang | default: "en-US" }}">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>{{ site.title | default: site.github.repository_name }} by {{ site.github.owner_name }}</title>

    <link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
    <meta name="viewport" content="width=device-width">
    <!--[if lt IE 9]>
    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
  </head>
  <body>
    <div class="wrapper">
      <header>
        <h1>{{ site.title | default: site.github.repository_name }}</h1>
        <p>{{ site.description | default: site.github.project_tagline }}
        <br>Latest crawl: {{site.latest_crawl}}
        </p>

        {% if site.show_navigation %}
          <nav>
            <p>
            {% for node in site.navlist %}
              <a href="{{ site.baseurl }}{{ node.url }}">{{ node.title }}</a><br/>
            {% endfor %}
            </p>
          </nav>
        {% endif %}

      </header>
      <section>

      {{ content }}

      </section>
      <footer>
        {% if site.github.is_project_page %}
          <p class="view"><a href="{{ site.github.repository_url }}">View the Project on GitHub <small>{{ github_name }}</small></a></p>
        {% endif %}
        {% if site.github.is_project_page %}
        <p>This project is maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
        {% endif %}
        <p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
      </footer>
    </div>
    <script src="{{ '/assets/js/scale.fix.js' | relative_url }}"></script>

  </body>
</html>
<!--
 Based on:
  https://github.com/pages-themes/minimal
  https://github.com/orderedlist/minimal
-->

================================================
FILE: _layouts/table.html
================================================
<!doctype html>
<html lang="{{ site.lang | default: "en-US" }}">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>{{ site.title | default: site.github.repository_name }} by {{ site.github.owner_name }}</title>

    <link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
    <meta name="viewport" content="width=device-width">
    <!--[if lt IE 9]>
    <script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
    <![endif]-->
    <script src="{{ '/assets/js/jquery-3.7.1.min.js?v=' | append: site.github.build_revision | relative_url }}"></script>
    <script src="{{ '/assets/js/jquery.tablesorter.min.js?v=' | append: site.github.build_revision | relative_url }}"></script>
    <script type="text/javascript">
      $(document).ready(function() {
        $(".tablesorter").tablesorter({{ page.table_sortlist }});
        $("table.iso639-3-language tbody tr th").each(function() {
          $(this).html("<a href='https://en.wikipedia.org/wiki/ISO_639:" + $(this).html() + "'>" + $(this).html() + "</a>");
        });
	    $("#search").on("keyup", function() {
    	  var value = $(this).val().toLowerCase();
          $(".tablesearcher tbody tr").filter(function() {
            $(this).toggle($(this).children('th').text().indexOf(value) > -1);
          });
        });
      });
    </script>
    <style>
      table.tablepercentage { table-layout: auto; }
      table.tablepercentage thead tr:first-child th {
          hyphens: auto; font-size: small;
      }
      table.tablepercentage td { min-width: 60px; }
      table.tablepercentage tbody th { max-width: 200px; hyphens: auto; }
      table.tablepercentage thead tr:last-child .header:not(:first-child):before {
        content: "%";
      }
      table.matrix tbody tr th {
        font-weight: bold;
      }
      table.tablesorter thead tr:last-child .header {
        background-image: url({{ '/assets/img/bg.gif?v=' | append: site.github.build_revision | relative_url }});
        background-repeat: no-repeat;
        background-position: center right;
        cursor: pointer;
      }
      table.tablesorter thead tr:last-child .headerSortUp {
        background-image: url({{ '/assets/img/asc.gif?v=' | append: site.github.build_revision | relative_url }});
      }
      table.tablesorter thead tr:last-child .headerSortDown {
        background-image: url({{ '/assets/img/desc.gif?v=' | append: site.github.build_revision | relative_url }});
      }
      table tbody th { font-weight: normal; }
      table tbody td { text-align: right; }
    </style>

  </head>
  <body>
    <div class="wrapper">
      <header>
        <h1>{{ site.title | default: site.github.repository_name }}</h1>
        <p>{{ site.description | default: site.github.project_tagline }}
        <br>Latest crawl: {{site.latest_crawl}}
        </p>

        {% if site.show_navigation %}
          <nav>
            <p>
            {% for node in site.navlist %}
              <a href="{{ site.baseurl }}{{ node.url }}">{{ node.title }}</a><br/>
            {% endfor %}
            </p>
          </nav>
        {% endif %}

        {% if site.github.is_project_page %}
          <p class="view"><a href="{{ site.github.repository_url }}">View the Project on GitHub <small>{{ github_name }}</small></a></p>
        {% endif %}

      </header>
      <section>

      {{ content }}

      {% if page.table_searcher %}
      <p><input type="text" id="search" placeholder="{{ page.table_searcher }}"></p>
      {% endif %}

      {% for table in page.table_include %}
      {% include_relative {{ table }} %}
      <br>
      {% endfor %}

      </section>
      <footer>
        {% if site.github.is_project_page %}
        <p>This project is maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
        {% endif %}
        <p><small>Hosted on GitHub Pages &mdash; Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
      </footer>
    </div>
    <script src="{{ '/assets/js/scale.fix.js' | relative_url }}"></script>

  </body>
</html>
<!--
 Based on:
  https://github.com/pages-themes/minimal
  https://github.com/orderedlist/minimal
  http://tablesorter.com/
-->

================================================
FILE: crawlplot.py
================================================
"""
Base plotting module for Common Crawl statistics visualization.

This module provides the CrawlPlot base class which handles:
- Plot library selection (matplotlib, rpy2/ggplot2, or legacy ggplot)
- Common plot styling to match ggplot2 aesthetics
- Data input from stdin or files
- Output directory management

The plot library is controlled by the PLOTLIB environment variable:
- 'matplotlib' (recommended)
- 'rpy2.ggplot2' (requires R and rpy2)
- 'ggplot' (deprecated)

The output directory is controlled by PLOTDIR (defaults to 'plots/').
"""

import json
import logging
import os
import os.path
import sys
from typing import Literal

import fsspec
import numpy as np


# Supported plot library backends
PlotLibType = Literal["rpy2.ggplot2", "ggplot", "matplotlib"]


class CrawlPlot:
    """
    Base class for Common Crawl statistics plots.

    Provides common functionality for all plot types including:
    - Plot library initialization and configuration
    - Data reading from stdin or gzipped files
    - Line plot generation with consistent styling
    - Output directory management

    Subclasses should implement:
    - add(key, val): Process a single data record
    - plot(): Generate the specific visualization

    Attributes:
        PLOTLIB: The plotting library to use ('matplotlib', 'rpy2.ggplot2', or 'ggplot')
        PLOTDIR: Directory for saving plot output files
        DEFAULT_FIGSIZE: Default figure size in inches (7 = 2100px at 300 DPI)
        DEFAULT_DPI: Default resolution for saved figures
    """

    GGPLOT2_THEME = None
    GGPLOT2_THEME_KWARGS = None

    # figure with square aspect ratio : 7 inches * 300 DPI = 2100 pixels
    DEFAULT_FIGSIZE = 7
    DEFAULT_DPI = 300

    title_fontsize = 15
    title_pad = 20
    title_fontweight = "normal"
    title_loc = "left"
    xlabel_fontsize = 12
    ylabel_fontsize = 12
    ticks_fontsize = 10
    ticks_color = "#E6E6E6"
    ticks_length = 8
    ticks_width = 1.0
    bar_width = 0.8
    legend_fontsize = 10
    legend_title_fontsize = 11
    line_width = 0.75
    marker_size = 4
    grid_major_linewidth = 1.0
    grid_minor_linewidth = 0.5
    grid_major_color = "#E6E6E6"
    grid_minor_color = "#E6E6E6"
    tight_layout_pad = 0.5
    savefig_facecolor = "white"
    savefig_bbox_inches = None

    # -------------------------------------------------------------------------
    # Matplotlib helper methods for reducing code duplication
    # -------------------------------------------------------------------------

    def create_figure(self, ratio=1.0):
        """Create a matplotlib figure with consistent sizing.

        Args:
            ratio: Height ratio relative to width (default: 1.0 for square)

        Returns:
            Tuple of (fig, ax)
        """
        import matplotlib.pyplot as plt
        return plt.subplots(figsize=(self.DEFAULT_FIGSIZE, self.DEFAULT_FIGSIZE * ratio))

    def set_title(self, ax, title):
        """Apply consistent title styling to an axes.

        Args:
            ax: matplotlib Axes object
            title: Title text
        """
        ax.set_title(
            title,
            fontsize=self.title_fontsize,
            fontweight=self.title_fontweight,
            pad=self.title_pad,
            loc=self.title_loc,
        )

    def apply_ggplot2_style(self, ax, show_grid=True, grid_axis='both'):
        """Apply ggplot2-like minimal styling to an axes.

        Removes spines, adds grid lines, and sets axes below plot elements.

        Args:
            ax: matplotlib Axes object
            show_grid: Whether to show grid lines (default: True)
            grid_axis: Which axis to show grid on ('both', 'x', or 'y')
        """
        # Remove all spines
        for spine in ['top', 'right', 'left', 'bottom']:
            ax.spines[spine].set_visible(False)

        # Add grid
        if show_grid:
            ax.grid(True, which='major', linewidth=self.grid_major_linewidth,
                    color=self.grid_major_color, zorder=0, axis=grid_axis)

        ax.set_axisbelow(True)

    def set_tick_labels_black(self, ax):
        """Set all tick labels to black color.

        Args:
            ax: matplotlib Axes object
        """
        for label in ax.get_xticklabels() + ax.get_yticklabels():
            label.set_color('black')

    def apply_nice_ticks(self, ax, axis='y', use_scientific=True):
        """Apply nice tick spacing using the nice_tick_step calculation.

        Sets minor and major ticks at 'nice' intervals (multiples of 1, 2, or 5).
        Optionally applies scientific notation for large values.

        Args:
            ax: matplotlib Axes object
            axis: Which axis to apply to ('x' or 'y')
            use_scientific: Whether to use scientific notation for large values
        """
        from matplotlib.ticker import MultipleLocator, FormatStrFormatter

        if axis == 'y':
            vmin, vmax = ax.get_ylim()
            axis_obj = ax.yaxis
        else:
            vmin, vmax = ax.get_xlim()
            axis_obj = ax.xaxis

        minor = self.nice_tick_step(vmin, vmax, n=8)
        major = 2 * minor

        axis_obj.set_minor_locator(MultipleLocator(minor))
        axis_obj.set_major_locator(MultipleLocator(major))

        if use_scientific and vmax > 1e4:
            axis_obj.set_major_formatter(FormatStrFormatter('%.0e'))

    def save_figure(self, fig, img_path):
        """Save figure with consistent settings and close it.

        Args:
            fig: matplotlib Figure object
            img_path: Output file path

        Returns:
            The figure object (for chaining)
        """
        import matplotlib.pyplot as plt
        plt.tight_layout(pad=self.tight_layout_pad)
        plt.savefig(img_path, dpi=self.DEFAULT_DPI,
                    bbox_inches=self.savefig_bbox_inches,
                    facecolor=self.savefig_facecolor)
        plt.close()
        return fig

    def hide_tick_marks(self, ax, tick_color='#FFFFFF'):
        """Hide tick marks by setting them to a background color.

        The tick labels remain visible but the tick marks themselves are hidden.

        Args:
            ax: matplotlib Axes object
            tick_color: Color to set ticks to (default: white)
        """
        ax.tick_params(axis='both', which='both', colors=tick_color,
                       length=self.ticks_length, width=self.ticks_width)

    def __init__(self):
        """Initialize the plot with library selection and output directory setup."""
        # Settings defined via environment variables
        self.PLOTLIB: PlotLibType = os.environ.get('PLOTLIB', 'matplotlib')
        self.PLOTDIR = os.environ.get('PLOTDIR', 'plots')

        if self.PLOTLIB == 'ggplot':
            # nothing to do here
            pass
        elif self.PLOTLIB == 'rpy2.ggplot2':
            from rpy2.robjects.lib import ggplot2
            from rpy2.robjects import pandas2ri
            pandas2ri.activate()
            # use minimal theme with white background set in plot constructor
            # https://ggplot2.tidyverse.org/reference/ggtheme.html
            self.GGPLOT2_THEME = ggplot2.theme_minimal(base_size=12, base_family="Helvetica")

            self.GGPLOT2_THEME_KWARGS = {
                'panel.background': ggplot2.element_rect(fill='white', color='white'),
                'plot.background': ggplot2.element_rect(fill='white', color='white')
            }

        elif self.PLOTLIB == "matplotlib":
            import matplotlib.pyplot as plt

            # ggplot2-inspired color palette
            ggplot_colors = [
                "#F8766D", "#00BE67", "#00A9FF", "#CD9600", "#7CAE00",
                "#00BFC4", "#C77CFF", "#FF61CC",
            ]

            # Set up ggplot2-like minimal theme with larger fonts
            plt.style.use('default')
            plt.rcParams.update({
                'font.family': 'sans-serif',
                'font.sans-serif': ['Liberation Sans', 'Arial', 'DejaVu Sans'],
                'font.size': 20,  # Much larger base font size
                'axes.linewidth': 1.5,
                'axes.spines.left': True,
                'axes.spines.bottom': True,
                'axes.spines.top': False,
                'axes.spines.right': False,
                'axes.axisbelow': True,
                'axes.grid': True,
                'axes.grid.axis': 'both',
                'grid.linewidth': 1.0,
                'grid.color': '#E6E6E6',  # Gray grid lines
                'axes.facecolor': 'white',  # White background
                'figure.facecolor': 'white',
                'xtick.bottom': True,
                'xtick.top': False,
                'ytick.left': True,
                'ytick.right': False,
                'xtick.direction': 'out',
                'ytick.direction': 'out',
                'axes.prop_cycle':  plt.cycler(color=ggplot_colors),
            })

        else:
            raise ValueError("Invalid PLOTLIB defined")

        # Make sure output directories exists
        os.makedirs(os.path.join(self.PLOTDIR, "crawler"), exist_ok=True)
        os.makedirs(os.path.join(self.PLOTDIR, "crawloverlap"), exist_ok=True)
        os.makedirs(os.path.join(self.PLOTDIR, "crawlsize"), exist_ok=True)
        os.makedirs(os.path.join(self.PLOTDIR, "tld"), exist_ok=True)


    def read_from_stdin_or_file(self):
        """Read statistics data from a file argument or stdin.

        If a file path is provided as the first command line argument,
        reads from that file (supports gzip compression). Otherwise,
        reads from stdin.
        """
        if len(sys.argv) > 1:
            # File provided as argument
            fp = sys.argv[1]
            compression = ("gzip" if fp.endswith(".gz") else None)

            with fsspec.open(fp, 'r', compression=compression) as f:
                self.read_data(f)
        else:
            # No argument, use stdin
            self.read_data(sys.stdin)

    def read_data(self, stream):
        """Parse tab-separated JSON key-value pairs from a stream.

        Args:
            stream: Input stream containing lines of tab-separated JSON data.
                   Each line should have format: JSON_KEY<tab>JSON_VALUE
        """
        for line in stream:
            keyval = line.split('\t')
            if len(keyval) == 2:
                key = json.loads(keyval[0])
                val = json.loads(keyval[1])
                self.add(key, val)
            else:
                logging.error("Not a key-value pair: {}".find(line))

    def line_plot_with_ggplot(
        self,
        data,
        title,
        ylabel,
        img_path,
        x="date",
        y="size",
        c="type",
        clabel="",
        ratio=1.0,
    ):
        """Generate a line plot using the legacy ggplot library (deprecated)."""
        from ggplot import ggplot, aes, ggtitle, ylab, xlab, scale_x_date, date_breaks, geom_line, geom_point

        date_label = "%Y\n%W"  # year + week number
        p = (
            ggplot(data, aes(x=x, y=y, color=c))
            + ggtitle(title)
            + ylab(ylabel)
            + xlab(" ")
            + scale_x_date(breaks=date_breaks("3 months"), labels=date_label)
            + geom_line()
            + geom_point()
        )
        p.save(img_path)
        return p

    def line_plot_with_rpy2_ggplot2(
        self,
        data,
        title,
        ylabel,
        img_path,
        x="date",
        y="size",
        c="type",
        clabel="",
        ratio=1.0,
    ):
        """Generate a line plot using R's ggplot2 via rpy2."""
        from rpy2.robjects.lib import ggplot2

        # Convert y axis to float because R uses 32-bit signed integers
        # and values >= 2 billion (2^31) will overflow
        data[y] = data[y].astype(float)
        if y != "size" and "size" in data.columns:
            data["size"] = data["size"].astype(float)
        p = (
            ggplot2.ggplot(data)
            + ggplot2.aes_string(x=x, y=y, color=c)
            + ggplot2.geom_line(linewidth=0.5)
            + ggplot2.geom_point()
            + self.GGPLOT2_THEME
            + ggplot2.theme(
                **{
                    "legend.position": "bottom",
                    "aspect.ratio": ratio,
                    **self.GGPLOT2_THEME_KWARGS,
                }
            )
            + ggplot2.labs(title=title, x="", y=ylabel, color=clabel)
        )

        p.save(img_path)

        return p

    @staticmethod
    def nice_tick_step(vmin, vmax, n=5):
        """Calculate a 'nice' tick step for axis labels.

        Returns a tick step value that is a multiple of 1, 2, or 5 times
        a power of 10, which produces clean, readable axis labels.

        Args:
            vmin: Minimum value of the axis range
            vmax: Maximum value of the axis range
            n: Approximate number of tick intervals desired (default: 5)

        Returns:
            A 'nice' tick step value (1/2/5 * 10^k)
        """
        span = abs(vmax - vmin)
        if span == 0:
            return 1.0
        raw = span / n
        exp = np.floor(np.log10(raw))
        frac = raw / (10**exp)
        nice_frac = 1 if frac <= 1 else 2 if frac <= 2 else 5 if frac <= 5 else 10
        return nice_frac * 10**exp
    
    @staticmethod
    def center_legend_title(fig, ax, leg_items, leg_title, x_axes=0.1):
        """Center the legend title vertically with respect to legend items."""
        fig.canvas.draw()
        r = fig.canvas.get_renderer()
        bb = leg_items.get_window_extent(r)
        y = fig.transFigure.inverted().transform((0, (bb.y0+bb.y1)/2))[1]
        x = fig.transFigure.inverted().transform(ax.transAxes.transform((x_axes, 0)))[0]
        leg_title.set_bbox_to_anchor((x, y), transform=fig.transFigure)

    def line_plot_with_matplotlib(
        self,
        data,
        title,
        ylabel,
        img_path,
        x="date",
        y="size",
        c="type",
        clabel="",
        ratio=1.0,
    ):
        """Generate a line plot using matplotlib with ggplot2-like styling.

        Creates a multi-series line plot with markers, styled to match
        ggplot2's minimal theme aesthetic.

        Args:
            data: pandas DataFrame containing the plot data
            title: Plot title
            ylabel: Y-axis label
            img_path: Output file path for the saved image
            x: Column name for x-axis values (default: 'date')
            y: Column name for y-axis values (default: 'size')
            c: Column name for grouping/color (default: 'type')
            clabel: Legend title (default: '')
            ratio: Aspect ratio for the plot (default: 1.0)

        Returns:
            matplotlib Figure object
        """
        from matplotlib.ticker import AutoMinorLocator
        from matplotlib.dates import YearLocator, DateFormatter

        # Convert y axis to float for consistency with large values
        data[y] = data[y].astype(float)
        if y != "size" and "size" in data.columns:
            data["size"] = data["size"].astype(float)

        fig, ax = self.create_figure()
        groups = data.groupby(c)

        # Use ggplot2 default colors for small group counts
        colors = ["#F8766D", "#00BA38", "#619CFF"] if len(groups) <= 3 else None

        for i, (group_key, group_df) in enumerate(groups):
            group_color = colors[i] if colors is not None else None
            ax.plot(
                group_df[x], group_df[y], "o-",
                color=group_color, label=group_key,
                linewidth=self.line_width, markersize=self.marker_size,
            )

        self.set_title(ax, title)
        ax.set_xlabel("")
        ax.set_ylabel(ylabel, fontsize=self.ylabel_fontsize)

        # Apply nice y-axis ticks
        self.apply_nice_ticks(ax, axis='y')

        # Axes ratio
        axes_aspect_ratio = 1 / ax.get_data_ratio() * ratio
        if axes_aspect_ratio < 1:
            ax.set_aspect(axes_aspect_ratio)

        # Date formatting for x-axis
        ax.xaxis.set_major_formatter(DateFormatter("%Y"))
        ax.xaxis.set_major_locator(YearLocator(base=5))
        ax.xaxis.set_minor_locator(AutoMinorLocator(2))

        ax.tick_params(axis="both", labelsize=self.ticks_fontsize)

        # Grid with both major and minor lines
        ax.grid(True, which="major", linewidth=self.grid_major_linewidth,
                color=self.grid_major_color, zorder=0)
        ax.grid(True, which="minor", linewidth=self.grid_minor_linewidth,
                color=self.grid_minor_color, zorder=0)
        ax.set_axisbelow(True)

        # Apply ggplot2 style (remove spines)
        for spine in ['top', 'right', 'left', 'bottom']:
            ax.spines[spine].set_visible(False)

        # Hide tick marks but keep labels black
        self.hide_tick_marks(ax)
        self.set_tick_labels_black(ax)

        # Legend setup
        num_legend_items = len(groups)
        ncol = 5 if num_legend_items == 5 else 4

        if clabel:
            leg_items = ax.legend(
                loc="upper center", ncol=ncol, bbox_to_anchor=(0.6, -0.1),
                frameon=False, fontsize=self.legend_fontsize,
            )
            ax.legend(
                [], [], title=clabel, loc="upper center",
                bbox_to_anchor=(0.2, -0.075), frameon=False,
                title_fontsize=self.legend_title_fontsize,
            )
            ax.add_artist(leg_items)
        else:
            ax.legend(
                loc="upper center", bbox_to_anchor=(0.5, -0.1),
                ncol=ncol, frameon=False, fontsize=self.legend_fontsize,
            )

        return self.save_figure(fig, img_path)

    def line_plot(
        self,
        data,
        title,
        ylabel,
        img_file,
        x="date",
        y="size",
        c="type",
        clabel="",
        ratio=1.0,
    ):
        """Generate a line plot using the configured plotting library.

        This is the main entry point for creating line plots. It delegates
        to the appropriate backend based on the PLOTLIB setting.

        Args:
            data: pandas DataFrame containing the plot data
            title: Plot title
            ylabel: Y-axis label
            img_file: Output filename relative to PLOTDIR
            x: Column name for x-axis values (default: 'date')
            y: Column name for y-axis values (default: 'size')
            c: Column name for grouping/color (default: 'type')
            clabel: Legend title (default: '')
            ratio: Aspect ratio for the plot (default: 1.0)

        Returns:
            Plot object (type depends on backend)
        """
        img_path = os.path.join(self.PLOTDIR, img_file)

        if self.PLOTLIB == "ggplot":
            return self.line_plot_with_ggplot(
                data=data,
                title=title,
                ylabel=ylabel,
                img_path=img_path,
                x=x,
                y=y,
                c=c,
                clabel=clabel,
                ratio=ratio,
            )

        elif self.PLOTLIB == "rpy2.ggplot2":
            return self.line_plot_with_rpy2_ggplot2(
                data=data,
                title=title,
                ylabel=ylabel,
                img_path=img_path,
                x=x,
                y=y,
                c=c,
                clabel=clabel,
                ratio=ratio,
            )

        elif self.PLOTLIB == "matplotlib":
            return self.line_plot_with_matplotlib(
                data=data,
                title=title,
                ylabel=ylabel,
                img_path=img_path,
                x=x,
                y=y,
                c=c,
                clabel=clabel,
                ratio=ratio,
            )


================================================
FILE: crawlstats.py
================================================
import heapq
import json
import logging
import os
import re

from collections import defaultdict, Counter
from datetime import date
from enum import Enum
from urllib.parse import urlparse

import mrjob.util
import tldextract
import ujson

from hyperloglog import HyperLogLog
from isoweek import Week
from mrjob.job import MRJob, MRStep
from mrjob.protocol import JSONProtocol, RawValueProtocol


HYPERLOGLOG_ERROR = .01

# threshold when to add a HyperLogLog for SURT domains
MIN_SURT_HLL_SIZE = 50000

LOGGING_FORMAT = '%(asctime)s: [%(levelname)s]: %(message)s'
LOGGING_LEVEL = logging.INFO
LOG = logging.getLogger('CCStatsJob')
mrjob.util.log_to_stream(format=LOGGING_FORMAT,
                         level=LOGGING_LEVEL,
                         name='CCStatsJob')


class MonthlyCrawl:
    """Enumeration of monthly crawl archives"""

    by_name = {
               'CC-MAIN-2008-2009': 88,
               'CC-MAIN-2009-2010': 89,
               'CC-MAIN-2012': 90,
               'CC-MAIN-2013-20': 91,
               'CC-MAIN-2013-48': 92,
               'CC-MAIN-2014-10': 93,
               'CC-MAIN-2014-15': 94,
               'CC-MAIN-2014-23': 95,
               'CC-MAIN-2014-35': 96,
               'CC-MAIN-2014-41': 97,
               'CC-MAIN-2014-42': 98,
               'CC-MAIN-2014-49': 99,
               'CC-MAIN-2014-52': 0,
               'CC-MAIN-2015-06': 1,
               'CC-MAIN-2015-11': 2,
               'CC-MAIN-2015-14': 3,
               'CC-MAIN-2015-18': 4,
               'CC-MAIN-2015-22': 5,
               'CC-MAIN-2015-27': 6,
               'CC-MAIN-2015-32': 7,
               'CC-MAIN-2015-35': 8,
               'CC-MAIN-2015-40': 9,
               'CC-MAIN-2015-48': 10,
               'CC-MAIN-2016-07': 11,
               'CC-MAIN-2016-18': 12,
               'CC-MAIN-2016-22': 13,
               'CC-MAIN-2016-26': 14,
               'CC-MAIN-2016-30': 15,
               'CC-MAIN-2016-36': 16,
               'CC-MAIN-2016-40': 17,
               'CC-MAIN-2016-44': 18,
               'CC-MAIN-2016-50': 19,
               'CC-MAIN-2017-04': 20,
               'CC-MAIN-2017-09': 21,
               'CC-MAIN-2017-13': 22,
               'CC-MAIN-2017-17': 23,
               'CC-MAIN-2017-22': 24,
               'CC-MAIN-2017-26': 25,
               'CC-MAIN-2017-30': 26,
               'CC-MAIN-2017-34': 27,
               'CC-MAIN-2017-39': 28,
               'CC-MAIN-2017-43': 29,
               'CC-MAIN-2017-47': 30,
               'CC-MAIN-2017-51': 31,
               'CC-MAIN-2018-05': 32,
               'CC-MAIN-2018-09': 33,
               'CC-MAIN-2018-13': 34,
               'CC-MAIN-2018-17': 35,
               'CC-MAIN-2018-22': 36,
               'CC-MAIN-2018-26': 37,
               'CC-MAIN-2018-30': 38,
               'CC-MAIN-2018-34': 39,
               'CC-MAIN-2018-39': 40,
               'CC-MAIN-2018-43': 41,
               'CC-MAIN-2018-47': 42,
               'CC-MAIN-2018-51': 43,
               'CC-MAIN-2019-04': 44,
               'CC-MAIN-2019-09': 45,
               'CC-MAIN-2019-13': 46,
               'CC-MAIN-2019-18': 47,
               'CC-MAIN-2019-22': 48,
               'CC-MAIN-2019-26': 49,
               'CC-MAIN-2019-30': 50,
               'CC-MAIN-2019-35': 51,
               'CC-MAIN-2019-39': 52,
               'CC-MAIN-2019-43': 53,
               'CC-MAIN-2019-47': 54,
               'CC-MAIN-2019-51': 55,
               'CC-MAIN-2020-05': 56,
               'CC-MAIN-2020-10': 57,
               'CC-MAIN-2020-16': 58,
               'CC-MAIN-2020-24': 59,
               'CC-MAIN-2020-29': 60,
               'CC-MAIN-2020-34': 61,
               'CC-MAIN-2020-40': 62,
               'CC-MAIN-2020-45': 63,
               'CC-MAIN-2020-50': 64,
               'CC-MAIN-2021-04': 65,
               'CC-MAIN-2021-10': 66,
               'CC-MAIN-2021-17': 67,
               'CC-MAIN-2021-21': 68,
               'CC-MAIN-2021-25': 69,
               'CC-MAIN-2021-31': 70,
               'CC-MAIN-2021-39': 71,
               'CC-MAIN-2021-43': 72,
               'CC-MAIN-2021-49': 73,
               'CC-MAIN-2022-05': 74,
               'CC-MAIN-2022-21': 75,
               'CC-MAIN-2022-27': 76,
               'CC-MAIN-2022-33': 77,
               'CC-MAIN-2022-40': 78,
               'CC-MAIN-2022-49': 79,
               'CC-MAIN-2023-06': 80,
               'CC-MAIN-2023-14': 81,
               'CC-MAIN-2023-23': 82,
               'CC-MAIN-2023-40': 83,
               'CC-MAIN-2023-50': 84,
               'CC-MAIN-2024-10': 85,
               'CC-MAIN-2024-18': 86,
               'CC-MAIN-2024-22': 87,
               'CC-MAIN-2024-26': 100,
               'CC-MAIN-2024-30': 101,
               'CC-MAIN-2024-33': 102,
               'CC-MAIN-2024-38': 103,
               'CC-MAIN-2024-42': 104,
               'CC-MAIN-2024-46': 105,
               'CC-MAIN-2024-51': 106,
               'CC-MAIN-2025-05': 107,
               'CC-MAIN-2025-08': 108,
               'CC-MAIN-2025-13': 109,
               'CC-MAIN-2025-18': 110,
               'CC-MAIN-2025-21': 111,
               'CC-MAIN-2025-26': 112,
               'CC-MAIN-2025-30': 113,
               'CC-MAIN-2025-33': 114,
               'CC-MAIN-2025-38': 115,
               'CC-MAIN-2025-43': 116,
               'CC-MAIN-2025-47': 117,
               'CC-MAIN-2025-51': 118,
               'CC-MAIN-2026-04': 119,
               'CC-MAIN-2026-08': 120,
               'CC-MAIN-2026-12': 121,
               'CC-MAIN-2026-17': 122,
               'CC-MAIN-2026-21': 123,
    }

    by_id = dict(map(reversed, by_name.items()))

    @staticmethod
    def get_by_name(name):
        return MonthlyCrawl.by_name[name]

    @staticmethod
    def to_name(crawl):
        return MonthlyCrawl.by_id[crawl]

    @staticmethod
    def to_bit_mask(crawl):
        return (1 << crawl)

    @staticmethod
    def date_of(crawl):
        if crawl == 'CC-MAIN-2008-2009':
            return date(2009, 1, 12)
        if crawl == 'CC-MAIN-2009-2010':
            return date(2010, 9, 25)
        if crawl == 'CC-MAIN-2012':
            return date(2012, 11, 2)
        [_, _, year, week] = crawl.split('-')
        return Week(int(year), int(week)).monday()

    @staticmethod
    def year_of(crawl):
        return MonthlyCrawl.date_of(crawl).year

    @staticmethod
    def short_name(name):
        return name.replace('CC-MAIN-', '')

    @staticmethod
    def get_latest(n):
        return sorted(MonthlyCrawl.by_name.keys())[-n:]


class MonthlyCrawlSet:
    """Dense representation of a list of monthly crawls.
    Represent in which crawls a given item (URL, but also
    domain, host, digest) occurs.
    """

    def __init__(self, crawls=0):
        self.bits = crawls

    def add(self, crawl):
        self.bits |= MonthlyCrawl.to_bit_mask(crawl)

    def update(self, *others):
        for other in others:
            self.bits |= other.get_bits()

    def clear(self):
        self.bits = 0

    def discard(self, crawl):
        self.bits &= ~MonthlyCrawl.to_bit_mask(crawl)

    def __contains__(self, crawl):
        return (self.bits & MonthlyCrawl.to_bit_mask(crawl)) != 0

    def __len__(self):
        """popcount of a 32 bit integer."""
        i = self.bits
        i = i - ((i >> 1) & 0x55555555)
        i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
        return (((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) & 0xffffffff) >> 24

    def get_bits(self):
        return self.bits

    def get_crawls(self):
        i = self.bits
        r = 0
        while (i):
            if (i & 1):
                yield r
            r += 1
            i >>= 1

    def is_new(self, crawl):
        """True if there are no older crawls in set (no lower id)"""
        if (self.bits == 0):
            return True
        i = self.bits
        i = (i ^ (i - 1)) >> 1  # set trailing 0s to 1s and zero rest
        r = 0
        while (i):
            if r == crawl:
                return True
            r += 1
            i >>= 1
        if (r < crawl):
            return False
        return True

    def is_newest(self, crawl):
        """True if crawl is the newest crawl in set (highest id)"""
        # i = self.bits
        # j = MonthlyCrawl.to_bit_mask(crawl)
        # return (i & ~j) < j
        return self.bits.bit_length() == (crawl + 1)


class CST(Enum):
    """Enum for crawl statistics types.
    Every line (key-value pair) has a marker which indicates the type
    of the count / frequency:
    - pages, URLs, hosts, etc.
    - size (number of unique items), histograms, etc.
    The type marker (the first element in the key tuple) determines
    the format of the line (key-value pair):
      <<type, key_params...>, <values...>>
    The format may vary for different steps (job, mapper, reducer).
    The count job (CCCountJob) uses the numeric types to reduce
    the data size, while CCCountJob outputs the type names for better
    readability.
    Types of countable items
    #   <<type, item, crawl>, <count(s)>>
    # For hosts, domains, etc. MultiCount is used to hold two counts -
    # the number of pages and URLs per item."""
    url = 0
    """(unique) URL"""
    digest = 1
    """(unique) content digest (MD5)"""
    host = 2
    """hostname ("www.commoncrawl.org")"""
    domain = 3
    """pay-level domain or private domain ("commoncrawl.org")"""
    tld = 4
    """public suffix ("org" or "co.uk")
    - not necessarily a TLD / "top-level domain" according to
      https://github.com/google/guava/wiki/InternetDomainNameExplained
    - here following https://github.com/john-kurkowski/tldextract"""
    surt_domain = 5
    """surt_domain :- SURT domain ("org,commoncrawl")
    - Sort-friendly URI Reordering Transform, cf.
      http://crawler.archive.org/articles/user_manual/glossary.html#surt"""
    scheme = 6
    """URI scheme ("http", "https")
    see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax"""
    mimetype = 7
    """MIME type / media type / content type
    - as sent by the server as "Content-Type" in the HTTP header,
      weakly normalized, not verified"""
    mimetype_detected = 77
    """MIME type detected based on content, URL and HTTP Content-Type"""
    page = 8
    """number of successfully fetched pages (HTTP status 200),
    including URL-level and content-level duplicates"""
    fetch = 9
    """number of fetches, including 404s, redirects, robots.txt, etc.
    - since CC-MAIN-2016-50"""
    http_status = 10
    """detected charset
    - since CC-MAIN-2018-34"""
    charset = 11
    """detected languages or combination of languages
    - since CC-MAIN-2018-34
    NOTE: since gld2 identifies 160 languages and up to 3 languages,
    the number of possible combinations is too high (4 millions) and
    only the more common ones are preserved"""
    languages = 12
    """primary language of the document (first of the detected languages)
    - since CC-MAIN-2018-34"""
    primary_language = 13
    """number of HTTP status codes (200, 404, etc.)
    - since CC-MAIN-2016-50"""
    crawl_status = 55
    """crawl status (successful fetches, 404s, exceptions, etc.)
    - following Nutch CrawlDatum status codes
    - similar to HTTP status but less fine-grained
    - includes crawler-specific statuses (e.g., "denied by robots.txt")"""
    robotstxt_status = 56
    """HTTP status of robots.txt responses"""
    size = 90
    """size of a crawl (number of unique items):
    - pages,
    - URLs (one URL may be fetched multiple times),
    - content digests,
    - domains, hosts, top-level domains
    - mime types
    - etc.
    format:
      <<size, item_type, crawl>, number_of_unique_items>"""
    size_estimate = 91
    """estimates for unique URLs and content digests
    - estimates by HyperLogLog probabilistic counters"""
    size_estimate_for = 92
    """estimates per large-sized item
    (domains, hosts, TLDs, SURT domains)
    - aimed to estimate domain coverage over time / multiple crawls
    - CC-MAIN-2016-44 adds HyperLogLogs for SURT domain (>=50,000 URLs)
    format:
     <<size_estimate_for, per_item_type, per_item, item_type, crawl>, hll>"""
    size_robotstxt = 93
    """number of robots.txt fetches"""
    new_items = 95
    """new items (URLs, content digests) for a given crawl
    - first seen in this crawl, not observed in previous crawls
    - only with exact counts for all crawls
    - could be estimated by HyperLogLog set operations otherwise"""
    histogram = 96
    """frequency of item counts per page or URL
    format:
      <<type, item_type, crawl, counted_per, count>, frequency>"""


class MultiCount(defaultdict):
    """Dictionary with multiple counters for the same key"""

    def __init__(self, size):
        self.default_factory = lambda: [0]*size
        self.size = size

    def incr(self, key, *counts):
        for i in range(0, self.size):
            self[key][i] += counts[i]

    @staticmethod
    def compress(size, counts):
        compress_from = size-1
        last_val = counts[compress_from]
        while compress_from > 0 and last_val == counts[compress_from-1]:
            compress_from -= 1
        if compress_from == 0:
            return counts[0]
        else:
            return counts[0:compress_from+1]

    def get_compressed(self, key):
        return MultiCount.compress(self.size, self.get(key))

    @staticmethod
    def get_count(index, value):
        if isinstance(value, int):
            return value
        if len(value) <= index:
            return value[-1]
        return value[index]

    @staticmethod
    def sum_values(values, compress=True):
        counts = [0]
        size = 1
        for val in values:
            if isinstance(val, int):
                # compressed count, one unique count
                for i in range(0, size):
                    counts[i] += val
            else:
                if len(val) >= size:
                    # enlarge counts array
                    base_count = counts[-1]
                    for j in range(size, len(val)):
                        counts.append(base_count)
                    size = len(val)
                for i in range(0, len(val)):
                    counts[i] += val[i]
                if len(val) < size:
                    for j in range(i+1, size):
                        # add compressed counts
                        counts[j] += val[i]
        if compress:
            return MultiCount.compress(size, counts)
        else:
            return counts


class CrawlStatsJSONEncoder(json.JSONEncoder):

    def default(self, o):
        if isinstance(o, MonthlyCrawlSet):
            return o.get_bits()
        if isinstance(o, HyperLogLog):
            return CrawlStatsJSONEncoder.json_encode_hyperloglog(o)
        return json.JSONEncoder.default(self, o)

    @staticmethod
    def json_encode_hyperloglog(o):
        return {'__type__': 'HyperLogLog',
                'card': o.card(),
                'p': o.p, 'M': o.M, 'm': o.m, 'alpha': o.alpha}


class CrawlStatsJSONDecoder(json.JSONDecoder):

    def __init__(self, *args, **kargs):
        json.JSONDecoder.__init__(self, object_hook=self.dict_to_object,
                                  *args, **kargs)

    def dict_to_object(self, dic):
        if '__type__' not in dic:
            return dic
        if dic['__type__'] == 'HyperLogLog':
            try:
                return CrawlStatsJSONDecoder.json_decode_hyperloglog(dic)
            except Exception as e:
                LOG.error('Cannot decode object of type {0}'.format(
                    dic['__type__']))
                raise e
        return dic

    @staticmethod
    def json_decode_hyperloglog(dic):
        hll = HyperLogLog(HYPERLOGLOG_ERROR)
        hll.p = dic['p']
        hll.m = dic['m']
        hll.alpha = dic['alpha']
        hll.M = dic['M']
        return hll


class HostDomainCount:
    """Counts requiring URL parsing (host, domain, TLD, scheme).
    For each item both total pages and unique URLs are counted.
    """

    IPpattern = re.compile(r'^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')

    def __init__(self):
        self.hosts = MultiCount(2)
        self.schemes = MultiCount(2)

    def add(self, url, count):
        uri = urlparse(url)
        host = uri.hostname
        if host is not None:
            host = host.lower().strip('.')
            self.hosts.incr(host, count, 1)
        self.schemes.incr(uri.scheme, count, 1)

    def output(self, crawl):
        domains = MultiCount(3)  # pages, URLs, hosts
        tlds = MultiCount(4)     # pages, URLs, hosts, domains
        for scheme, counts in self.schemes.items():
            yield (CST.scheme.value, scheme, crawl), counts
        for host, counts in self.hosts.items():
            yield (CST.host.value, host, crawl), counts
            try:
                parsedhost = tldextract.extract(host)
                hosttld = parsedhost.suffix
            except TypeError as e:
                LOG.error('Failed to parse host {}: {}'.format(host, e))
                hosttld = None
            if hosttld is None:
                hostdomain = '(invalid)'
            elif hosttld == '':
                hostdomain = parsedhost.domain
                if self.IPpattern.match(host):
                    hosttld = '(ip address)'
            else:
                hostdomain = '.'.join([parsedhost.domain, parsedhost.suffix])
            domains.incr((hostdomain, hosttld),
                         counts[0], counts[1], 1)
        for dom, counts in domains.items():
            tlds.incr(dom[1], counts[0], counts[1], counts[2], 1)
            yield (CST.domain.value, dom[0], crawl), counts
        for tld, counts in tlds.items():
            yield (CST.tld.value, tld, crawl), counts


class SurtDomainCount:
    """Counters for one single SURT prefix/domain."""

    robots_txt_warc_pattern = re.compile(r'/robotstxt/')

    def __init__(self, surt_domain):
        self.surt_domain = surt_domain
        self.pages = 0
        self.url = defaultdict(int)
        self.digest = defaultdict(lambda: [0, 0])
        self.mime = defaultdict(lambda: [0, 0])
        self.mime_detected = defaultdict(lambda: [0, 0])
        self.charset = defaultdict(lambda: [0, 0])
        self.languages = defaultdict(lambda: [0, 0])
        self.http_status = defaultdict(int)
        self.robotstxt_status = defaultdict(lambda: [0, 0])
        self.robotstxt_url = defaultdict(int)

    def add(self, _path, metadata):
        status = -1
        if 'status' in metadata:
            status = int(metadata['status'])
        if self.robots_txt_warc_pattern.search(metadata['filename']):
            self.robotstxt_status[status][0] += 1
            if metadata['url'] not in self.robotstxt_url:
                self.robotstxt_status[status][1] += 1
            self.robotstxt_url[metadata['url']] += 1
            # do not count robots.txt responses as "ordinary" pages
            return
        self.http_status[status] += 1
        if status != 200:
            # skip content-related metrics for non-200 responses
            return
        self.pages += 1
        mime = 'unk'
        if 'mime' in metadata:
            mime = metadata['mime']
        self.mime[mime][0] += 1
        mime_detected = None
        if 'mime-detected' in metadata:
            mime_detected = metadata['mime-detected']
            self.mime_detected[mime_detected][0] += 1
        charset = None
        if 'charset' in metadata:
            charset = metadata['charset']
            self.charset[charset][0] += 1
        languages = None
        if 'languages' in metadata:
            languages = metadata['languages']
            self.languages[languages][0] += 1
        digest = None
        if 'digest' in metadata:
            digest = metadata['digest']
            self.digest[digest][0] += 1
        if metadata['url'] not in self.url:
            if digest:
                self.digest[digest][1] += 1
            self.mime[mime][1] += 1
            if mime_detected:
                self.mime_detected[mime_detected][1] += 1
            if languages:
                self.languages[languages][1] += 1
            if charset:
                self.charset[charset][1] += 1
        self.url[metadata['url']] += 1

    def unique_urls(self):
        return len(self.url)

    def output(self, crawl, exact_count=True, min_surt_hll_size=50000):
        counts = (self.pages, self.unique_urls())
        host_domain_count = HostDomainCount()
        surt_hll = None
        if self.unique_urls() >= min_surt_hll_size:
            surt_hll = HyperLogLog(HYPERLOGLOG_ERROR)
        for url, count in self.url.items():
            host_domain_count.add(url, count)
            if exact_count:
                yield (CST.url.value, self.surt_domain, url), (crawl, count)
            if surt_hll is not None:
                surt_hll.add(url)
        if exact_count:
            for digest, counts in self.digest.items():
                yield (CST.digest.value, digest), (crawl, counts)
        for mime, counts in self.mime.items():
            yield (CST.mimetype.value, mime, crawl), counts
        for mime, counts in self.mime_detected.items():
            yield (CST.mimetype_detected.value, mime, crawl), counts
        for charset, counts in self.charset.items():
            yield (CST.charset.value, charset, crawl), counts
        for languages, counts in self.languages.items():
            yield (CST.languages.value, languages, crawl), counts
            # yield primary language
            prim_l = languages.split(',')[0]
            yield (CST.primary_language.value, prim_l, crawl), counts
        for key, val in host_domain_count.output(crawl):
            yield key, val
        yield((CST.surt_domain.value, self.surt_domain, crawl),
              (self.pages, self.unique_urls(), len(host_domain_count.hosts)))
        if surt_hll is not None:
            yield((CST.size_estimate_for.value, CST.surt_domain.value,
                   self.surt_domain, CST.url.value, crawl),
                  (self.unique_urls(),
                   CrawlStatsJSONEncoder.json_encode_hyperloglog(surt_hll)))
        for status, counts in self.http_status.items():
            yield (CST.http_status.value, status, crawl), counts
        for url, count in self.robotstxt_url.items():
            yield (CST.size_robotstxt.value, CST.url.value, crawl), 1
            yield (CST.size_robotstxt.value, CST.page.value, crawl), count
        for status, counts in self.robotstxt_status.items():
            yield (CST.robotstxt_status.value, status, crawl), counts


class UnhandledTypeError(Exception):
    def __init__(self, outputType):
        self.message = 'Unhandled type {}\n'.format(outputType)


class InputError(Exception):
    def __init__(self, message):
        self.message = message


class CCStatsJob(MRJob):
    '''Job to get crawl statistics from Common Crawl index
       --job=count
            run count job (first step) to get counts
            from Common Crawl index files (cdx-*.gz)
       --job=stats
            run statistics job (second step) on output
            from count job'''

    OUTPUT_PROTOCOL = JSONProtocol

    JOBCONF = {
        'mapreduce.task.timeout': '9600000',
        'mapreduce.map.speculative': 'false',
        'mapreduce.reduce.speculative': 'false',
        'mapreduce.job.jvm.numtasks': '-1',
    }

    s3pattern = re.compile(r'^s3://([^/]+)/(.+)')
    gzpattern = re.compile(r'\.gz$')
    crawlpattern = re.compile(r'(CC-MAIN-2\d{3}-\d{2})')

    def configure_args(self):
        """Custom command line options for common crawl index statistics"""
        super(CCStatsJob, self).configure_args()
        self.add_passthru_arg(
            '--job', dest='job_to_run',
            default='', choices=['count', 'stats', ''],
            help='''Job(s) to run ("count", "stats", or empty to run both)''')
        self.add_passthru_arg(
            '--exact-counts', dest='exact_counts',
            action='store_true', default=None,
            help='''Exact counts for URLs and content digests,
                    this increases the output size significantly''')
        self.add_passthru_arg(
            '--no-exact-counts', dest='exact_counts',
            action='store_false', default=None,
            help='''No exact counts for URLs and content digests
                    to save storage space and computation time''')
        self.add_passthru_arg(
            '--max-top-hosts-domains', dest='max_hosts',
            type=int, default=200,
            help='''Max. number of most frequent hosts or domains shown
                    in final statistics (cf. --min-urls-top-host-domain)''')
        self.add_passthru_arg(
            '--min-urls-top-host-domain', dest='min_domain_frequency',
            type=int, default=1,
            help='''Min. number of URLs required per host or domain shown
                    in final statistics (cf. --max-top-hosts-domains).''')
        self.add_passthru_arg(
            '--min-lang-comb-freq', dest='min_lang_comb_freq',
            type=int, default=1,
            help='''Min. number of pages required for a combination of detected
                    languages to be shown in final statistics.''')
        self.add_passthru_arg(
            '--crawl', dest='crawl', default=None,
            help='''ID/name of the crawl analyzed (if not given detected
                    from input path)''')

    def input_protocol(self):
        if self.options.job_to_run != 'stats':
            LOG.debug('Reading text input from cdx files')
            return RawValueProtocol()
        LOG.debug('Reading JSON input from count job')
        return JSONProtocol()

    def hadoop_input_format(self):
        input_format = self.HADOOP_INPUT_FORMAT
        if self.options.job_to_run != 'stats':
            input_format = 'org.apache.hadoop.mapred.TextInputFormat'
        LOG.info("Setting input format for {} job: {}".format(
            self.options.job_to_run, input_format))
        return input_format

    def count_mapper_init(self):
        """Because cdx.gz files cannot be split and
        mapreduce.input.fileinputformat.split.minsize is set to a value larger
        than any cdx.gz file, the mapper is guaranteed to process the content
        of a single cdx file. Input lines of a cdx file are sorted by SURT URL
        which allows to aggregate URL counts for one SURT domain in memory.
        It may happen that one SURT domain spans over multiple cdx files.
        In this case (and without --exact-counts) the count of unique URLs
        and the URL histograms may be slightly off in case the same URL occurs
        also in a second cdx file. However, this problem is negligible because
        there are only 300 cdx files."""
        self.counters = Counter()
        self.cdx_path = os.environ['mapreduce_map_input_file']
        LOG.info('Reading {0}'.format(self.cdx_path))
        self.crawl_name = None
        self.crawl = None
        if self.options.crawl is not None:
            self.crawl_name = self.options.crawl
        else:
            crawl_name_match = self.crawlpattern.search(self.cdx_path)
            if crawl_name_match is not None:
                self.crawl_name = crawl_name_match.group(1)
            else:
                raise InputError(
                    "Cannot determine ID of monthly crawl from input path {}"
                    .format(self.cdx_path))
        if self.crawl_name is None:
            raise InputError("Name of crawl not given")
        self.crawl = MonthlyCrawl.get_by_name(self.crawl_name)
        self.fetches_total = 0
        self.pages_total = 0
        self.urls_total = 0
        self.urls_hll = HyperLogLog(HYPERLOGLOG_ERROR)
        self.digest_hll = HyperLogLog(HYPERLOGLOG_ERROR)
        self.url_histogram = Counter()
        self.count = None
        # first and last SURT may continue in previous/next cdx
        self.min_surt_hll_size = 1
        self.increment_counter('cdx-stats', 'cdx files processed', 1)

    def count_mapper(self, _, line):
        self.fetches_total += 1
        if (self.fetches_total % 1000) == 0:
            self.increment_counter('cdx-stats', 'cdx lines read', 1000)
            if (self.fetches_total % 100000) == 0:
                LOG.info('Read {0} cdx lines'.format(self.fetches_total))
            else:
                LOG.debug('Read {0} cdx lines'.format(self.fetches_total))
        parts = line.split(' ')
        [surt_domain, path] = parts[0].split(')', 1)
        if self.count is None:
            self.count = SurtDomainCount(surt_domain)
        if surt_domain != self.count.surt_domain:
            # output accumulated statistics for one SURT domain
            for pair in self.count.output(self.crawl,
                                          self.options.exact_counts,
                                          self.min_surt_hll_size):
                yield pair
            self.urls_total += self.count.unique_urls()
            for url, cnt in self.count.url.items():
                self.urls_hll.add(url)
                self.url_histogram[cnt] += 1
            for digest in self.count.digest:
                self.digest_hll.add(digest)
            self.pages_total += self.count.pages
            self.count = SurtDomainCount(surt_domain)
            self.min_surt_hll_size = MIN_SURT_HLL_SIZE
        json_string = ' '.join(parts[2:])
        try:
            metadata = ujson.loads(json_string)
            self.count.add(path, metadata)
        except ValueError as e:
            LOG.error('Failed to parse json: {0} - {1}'.format(
                e, json_string))

    def count_mapper_final(self):
        self.increment_counter('cdx-stats',
                               'cdx lines read', self.fetches_total % 1000)
        if self.count is None:
            return
        for pair in self.count.output(self.crawl, self.options.exact_counts, 1):
            yield pair
        self.urls_total += self.count.unique_urls()
        for url, cnt in self.count.url.items():
            self.urls_hll.add(url)
            self.url_histogram[cnt] += 1
        for digest in self.count.digest:
            self.digest_hll.add(digest)
        self.pages_total += self.count.pages
        if not self.options.exact_counts:
            for count, frequency in self.url_histogram.items():
                yield((CST.histogram.value, CST.url.value, self.crawl,
                       CST.page.value, count), frequency)
        yield (CST.size.value, CST.page.value, self.crawl), self.pages_total
        yield (CST.size.value, CST.fetch.value, self.crawl), self.fetches_total
        if not self.options.exact_counts:
            yield (CST.size.value, CST.url.value, self.crawl), self.urls_total
        yield((CST.size_estimate.value, CST.url.value, self.crawl),
              CrawlStatsJSONEncoder.json_encode_hyperloglog(self.urls_hll))
        yield((CST.size_estimate.value, CST.digest.value, self.crawl),
              CrawlStatsJSONEncoder.json_encode_hyperloglog(self.digest_hll))
        self.increment_counter('cdx-stats', 'cdx files finished', 1)

    def reducer_init(self):
        self.counters = Counter()
        self.mostfrequent = defaultdict(list)

    def count_reducer(self, key, values):
        outputType = key[0]
        if outputType in (CST.size.value, CST.size_robotstxt.value):
            yield key, sum(values)
        elif outputType == CST.histogram.value:
            yield key, sum(values)
        elif outputType in (CST.url.value, CST.digest.value):
            # only with --exact-counts
            crawls = MonthlyCrawlSet()
            new_crawls = set()
            page_count = MultiCount(2)
            for val in values:
                if type(val) is list:
                    if (outputType == CST.url.value):
                        (crawl, pages) = val
                        page_count.incr(crawl, pages, 1)
                    else:  # digest
                        (crawl, (pages, urls)) = val
                        page_count.incr(crawl, pages, urls)
                    crawls.add(crawl)
                    new_crawls.add(crawl)
                else:
                    # crawl set bit mask
                    crawls.update(val)
            yield key, crawls.get_bits()
            for new_crawl in new_crawls:
                if crawls.is_new(new_crawl):
                    self.counters[(CST.new_items.value,
                                   outputType, new_crawl)] += 1
            # url/digest duplicate histograms
            for crawl, counts in page_count.items():
                items = (1+counts[0]-counts[1])
                self.counters[(CST.histogram.value, outputType,
                               crawl, CST.page.value, items)] += 1
            # size in terms of unique URLs and unique content digests
            for crawl, counts in page_count.items():
                self.counters[(CST.size.value, outputType, crawl)] += 1
        elif outputType in (CST.mimetype.value,
                            CST.mimetype_detected.value,
                            CST.charset.value,
                            CST.languages.value,
                            CST.primary_language.value,
                            CST.scheme.value,
                            CST.tld.value,
                            CST.domain.value,
                            CST.surt_domain.value,
                            CST.host.value,
                            CST.http_status.value,
                            CST.robotstxt_status.value):
            yield key, MultiCount.sum_values(values)
        elif outputType == CST.size_estimate.value:
            hll = HyperLogLog(HYPERLOGLOG_ERROR)
            for val in values:
                hll.update(
                    CrawlStatsJSONDecoder.json_decode_hyperloglog(val))
            yield(key,
                  CrawlStatsJSONEncoder.json_encode_hyperloglog(hll))
        elif outputType == CST.size_estimate_for.value:
            res = None
            hll = None
            cnt = 0
            for val in values:
                if res:
                    if hll is None:
                        cnt = res[0]
                        hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(res[1])
                    cnt += val[0]
                    hll.update(CrawlStatsJSONDecoder.json_decode_hyperloglog(val[1]))
                else:
                    res = val
            if hll is not None and cnt >= MIN_SURT_HLL_SIZE:
                yield(key, (cnt, CrawlStatsJSONEncoder.json_encode_hyperloglog(hll)))
            elif res[0] >= MIN_SURT_HLL_SIZE:
                yield(key, res)
        else:
            raise UnhandledTypeError(outputType)

    def stats_mapper_init(self):
        self.counters = Counter()

    def stats_mapper(self, key, value):
        if key[0] in (CST.url.value, CST.digest.value,
                      CST.size_estimate_for.value):
            return
        if ((self.options.min_domain_frequency > 1) and
            (key[0] in (CST.host.value, CST.domain.value,
                        CST.surt_domain.value))):
            # quick skip of infrequent host and domains,
            # significantly limits amount of tuples processed in reducer
            page_count = MultiCount.get_count(0, value)
            url_count = MultiCount.get_count(1, value)
            self.counters[(CST.size.value, key[0], key[2])] += 1
            self.counters[(CST.histogram.value, key[0],
                           key[2], CST.page.value, page_count)] += 1
            self.counters[(CST.histogram.value, key[0],
                           key[2], CST.url.value, url_count)] += 1
            if key[0] in (CST.domain.value, CST.surt_domain.value):
                host_count = MultiCount.get_count(2, value)
                self.counters[(CST.histogram.value, key[0],
                               key[2], CST.host.value, host_count)] += 1
            if url_count < self.options.min_domain_frequency:
                return
        if key[0] == CST.languages.value:
            # yield only frequent language combinations (if configured)
            page_count = MultiCount.get_count(0, value)
            if ((self.options.min_lang_comb_freq > 1) and
                    (page_count < self.options.min_lang_comb_freq) and
                    (',' in key[1])):
                return
        yield key, value

    def stats_mapper_final(self):
        for (counter, count) in self.counters.items():
            yield counter, count

    def stats_reducer(self, key, values):
        outputType = CST(key[0])
        item = key[1]
        crawl = MonthlyCrawl.to_name(key[2])
        if outputType in (CST.size, CST.new_items,
                          CST.size_estimate, CST.size_robotstxt):
            verbose_key = (outputType.name, CST(item).name, crawl)
            if outputType in (CST.size, CST.size_robotstxt):
                val = sum(values)
            elif outputType == CST.new_items:
                val = MultiCount.sum_values(values)
            elif outputType == CST.size_estimate:
                # already "reduced" in count job
                for val in values:
                    break
            yield verbose_key, val
        elif outputType == CST.histogram:
            yield((outputType.name, CST(item).name, crawl,
                   CST(key[3]).name, key[4]), sum(values))
        elif outputType in (CST.mimetype, CST.mimetype_detected, CST.charset,
                            CST.languages, CST.primary_language, CST.scheme,
                            CST.surt_domain, CST.tld, CST.domain, CST.host,
                            CST.http_status, CST.robotstxt_status):
            item = key[1]
            for counts in values:
                page_count = MultiCount.get_count(0, counts)
                url_count = MultiCount.get_count(1, counts)
                if outputType in (CST.domain, CST.surt_domain, CST.tld):
                    host_count = MultiCount.get_count(2, counts)
                if (self.options.min_domain_frequency <= 1 or
                    outputType not in (CST.host, CST.domain,
                                       CST.surt_domain)):
                    self.counters[(CST.size.name, outputType.name, crawl)] += 1
                    self.counters[(CST.histogram.name, outputType.name,
                                   crawl, CST.page.name, page_count)] += 1
                    self.counters[(CST.histogram.name, outputType.name,
                                   crawl, CST.url.name, url_count)] += 1
                    if outputType in (CST.domain, CST.surt_domain, CST.tld):
                        self.counters[(CST.histogram.name, outputType.name,
                                       crawl, CST.host.name, host_count)] += 1
                if outputType == CST.tld:
                    domain_count = MultiCount.get_count(3, counts)
                    self.counters[(CST.histogram.name, outputType.name,
                                   crawl, CST.domain.name, domain_count)] += 1
                if outputType in (CST.domain, CST.host, CST.surt_domain):
                    outKey = (outputType.name, crawl)
                    outVal = (page_count, url_count, item)
                    if outputType in (CST.domain, CST.surt_domain):
                        outVal = (page_count, url_count, host_count, item)
                    # take most common
                    if len(self.mostfrequent[outKey]) < self.options.max_hosts:
                        heapq.heappush(self.mostfrequent[outKey], outVal)
                    else:
                        heapq.heappushpop(self.mostfrequent[outKey], outVal)
                else:
                    yield((outputType.name, item, crawl), counts)
        else:
            raise UnhandledTypeError(outputType)

    def reducer_final(self):
        for (counter, count) in self.counters.items():
            yield counter, count
        for key, mostfrequent in self.mostfrequent.items():
            (outputType, crawl) = key
            if outputType in (CST.domain.name, CST.surt_domain.name):
                for (pages, urls, hosts, item) in mostfrequent:
                    yield((outputType, item, crawl),
                          MultiCount.compress(3, [pages, urls, hosts]))
            else:
                for (pages, urls, item) in mostfrequent:
                    yield((outputType, item, crawl),
                          MultiCount.compress(2, [pages, urls]))

    def steps(self):
        reduces = 10
        cdxminsplitsize = 2**32  # do not split cdx map input files
        if self.options.exact_counts:
            # with exact counts need many reducers to aggregate the counts
            # in reasonable time and to get not too large partitions
            reduces = 200
        count_job = \
            MRStep(mapper_init=self.count_mapper_init,
                   mapper=self.count_mapper,
                   mapper_final=self.count_mapper_final,
                   reducer_init=self.reducer_init,
                   reducer=self.count_reducer,
                   reducer_final=self.reducer_final,
                   jobconf={'mapreduce.job.reduces': reduces,
                            'mapreduce.input.fileinputformat.split.minsize':
                                cdxminsplitsize,
                            'mapreduce.output.fileoutputformat.compress':
                                "true",
                            'mapreduce.output.fileoutputformat.compress.codec':
                                'org.apache.hadoop.io.compress.BZip2Codec'})
        stats_job = \
            MRStep(mapper_init=self.stats_mapper_init,
                   mapper=self.stats_mapper,
                   mapper_final=self.stats_mapper_final,
                   reducer_init=self.reducer_init,
                   reducer=self.stats_reducer,
                   reducer_final=self.reducer_final,
                   jobconf={'mapreduce.job.reduces': 1,
                            'mapreduce.output.fileoutputformat.compress':
                                "true",
                            'mapreduce.output.fileoutputformat.compress.codec':
                                'org.apache.hadoop.io.compress.GzipCodec'})
        if self.options.job_to_run == 'count':
            return [count_job]
        if self.options.job_to_run == 'stats':
            return [stats_job]
        return [count_job, stats_job]


if __name__ == '__main__':
    CCStatsJob.run()


================================================
FILE: get_stats.sh
================================================
#!/bin/bash

set -o pipefail

if aws s3 ls s3://commoncrawl/crawl-analysis/ | sed -E 's@.* @@; s@/$@@' >./stats/crawls.txt; then
    ON_AWS=true;
    echo "Running on AWS (AWS CLI configured for authenticated access)"
else
    echo "Downloading from https://data.commoncrawl.org/ using curl"
    # list of crawls enumerated in crawlstats.py
    python3 -c 'from crawlstats import MonthlyCrawl; [print(c) for c in sorted(MonthlyCrawl.by_name.keys())]' >./stats/crawls.txt
    ON_AWS=false
fi

while read crawl; do
    echo $crawl
    if [ -e stats/$crawl.gz ]; then
        echo "  ... exists"
        continue
    fi
    if $ON_AWS; then
        aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz
    else
        curl --silent https://data.commoncrawl.org/crawl-analysis/$crawl/stats/part-00000.gz >./stats/$crawl.gz
    fi
done <./stats/crawls.txt


================================================
FILE: get_stats_and_plot.sh
================================================
#!/bin/bash
set -e

echo "Starting ..."

./get_stats.sh

# make sure plot directories exist
mkdir -p plots/crawler
mkdir -p plots/crawloverlap
mkdir -p plots/crawlsize
mkdir -p plots/throughput
mkdir -p plots/tld

./plot.sh

echo "Done."

================================================
FILE: index.md
================================================
Statistics of Common Crawl Monthly Archives
===========================================

Statistics of [Common Crawl](https://commoncrawl.org/)'s [web archives](https://commoncrawl.org/the-data/get-started/) released on a monthly base:

* [size of the crawls](plots/crawlsize) - number of pages, unique URLs, hosts, domains, top-level domains (public suffixes), cumulative growth of crawled data over time
* [top-level domains](plots/tlds) - distribution and comparison
* [top-500 registered domains](plots/domains.md)
* [crawler-related metrics](plots/crawlermetrics) - fetch status, etc.
* [overlaps between monthly crawls](plots/crawloverlap)
* distribution of
    - [media types (MIME)](plots/mimetypes)
	- [character encodings](plots/charsets.md)
	- [languages](plots/languages.md)

All metrics presented here are generated from [Common Crawl's URL index](https://index.commoncrawl.org/) data using the code of the [cc-crawl-statistics project](https://github.com/commoncrawl/cc-crawl-statistics). Inspired by Sebastian Spiegler's [Statistics of the Common Crawl Corpus 2012](https://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/).

See also our [Web Graph statistics](https://commoncrawl.github.io/cc-webgraph-statistics/).



================================================
FILE: plot/charset.py
================================================
import sys

from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl


class CharsetStats(TabularStats):

    MIN_AVERAGE_COUNT = 500
    MAX_CHARSETS = 100

    def __init__(self):
        super().__init__()
        self.MAX_TYPE_VALUES = CharsetStats.MAX_CHARSETS

    def add(self, key, val):
        self.add_check_type(key, val, CST.charset)


if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    plot_name = 'charsets'
    column_header = 'charset'
    if len(plot_crawls) == 0:
        plot_crawls = MonthlyCrawl.get_latest(3)
        print(plot_crawls)
    else:
        plot_name += '-' + '-'.join(plot_crawls)
    plot = CharsetStats()
    plot.read_from_stdin_or_file()
    plot.transform_data(CharsetStats.MAX_CHARSETS,
                        CharsetStats.MIN_AVERAGE_COUNT,
                        None)
    plot.save_data_percentage(plot_name, dir_name='plots', type_name='charset')
    plot.plot(plot_crawls, plot_name, column_header)


================================================
FILE: plot/crawl_size.py
================================================
"""
Plot crawl size metrics over time.

This module generates visualizations of crawl size statistics including:
- Monthly crawl sizes (pages, URLs, content digests)
- Cumulative sizes over time
- New URLs per crawl
- URL status by year (new, revisit, duplicate)
- Domain/host/TLD counts

The plots show the growth and evolution of the Common Crawl archive.
"""

import os
import re
import types
from collections import defaultdict

import pandas
from hyperloglog import HyperLogLog

from crawlplot import CrawlPlot
from crawlstats import CST, CrawlStatsJSONDecoder, HYPERLOGLOG_ERROR, MonthlyCrawl


class CrawlSizePlot(CrawlPlot):
    """Generate plots showing crawl size metrics over time.

    Tracks various size metrics including page counts, unique URLs,
    unique content digests, and cumulative statistics across crawls.
    Uses HyperLogLog for efficient cardinality estimation.
    """

    def __init__(self):
        super().__init__()

        self.size = defaultdict(dict)
        self.size_by_type = defaultdict(dict)
        self.type_index = defaultdict(dict)
        self.crawls = {}
        self.ncrawls = 0
        self.hll = defaultdict(dict)
        self.N = 0
        self.sum_counts = False

    def add(self, key, val):
        """Process a size or size_estimate record from statistics data."""
        cst = CST[key[0]]
        if cst not in (CST.size, CST.size_estimate):
            return
        item_type = key[1]
        crawl = key[2]
        count = 0
        if cst == CST.size_estimate:
            item_type = ' '.join([item_type, 'estim.'])
            hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(val)
            count = len(hll)
            self.hll[item_type][crawl] = hll
        elif cst == CST.size:
            count = val
        self.add_by_type(crawl, item_type, count)

    def add_by_type(self, crawl, item_type, count):
        """Add a count for a specific crawl and item type combination."""
        if crawl not in self.crawls:
            self.crawls[crawl] = self.ncrawls
            self.size['crawl'][self.ncrawls] = crawl
            date = pandas.Timestamp(MonthlyCrawl.date_of(crawl))
            self.size['date'][self.ncrawls] = date
            self.ncrawls += 1
        else:
            date = self.size['date'][self.crawls[crawl]]
        if item_type in self.size and \
                self.crawls[crawl] in self.size[item_type]:
            # add count to existing record?
            if self.sum_counts:
                count += self.size[item_type][self.crawls[crawl]]
                self.size[item_type][self.crawls[crawl]] = count
                _N = self.type_index[item_type][self.crawls[crawl]]
                self.size_by_type['size'][_N] = count
            return
        self.size[item_type][self.crawls[crawl]] = count
        self.size_by_type['crawl'][self.N] = crawl
        self.size_by_type['date'][self.N] = date
        self.size_by_type['type'][self.N] = item_type
        self.size_by_type['size'][self.N] = count
        self.type_index[item_type][self.crawls[crawl]] = self.N
        self.N += 1

    def cumulative_size(self):
        """Calculate cumulative sizes across crawls using HyperLogLog unions."""
        latest_n_crawls_cumul = [2, 3, 4, 6, 9, 12]
        total_pages = 0
        sorted_crawls = sorted(self.crawls)
        for crawl in sorted_crawls:
            total_pages += self.size['page'][self.crawls[crawl]]
            self.add_by_type(crawl, 'page cumul.', total_pages)
        urls_cumul = defaultdict(dict)
        for item_type in self.hll.keys():
            item_type_cumul = ' '.join([item_type, 'cumul.'])
            item_type_new = ' '.join([item_type, 'new'])
            cumul_hll = HyperLogLog(HYPERLOGLOG_ERROR)
            n = 0
            hlls = []
            for crawl in sorted(self.hll[item_type]):
                n += 1
                hll = self.hll[item_type][crawl]
                last_cumul_hll_len = len(cumul_hll)
                cumul_hll.update(hll)
                # cumulative size
                self.add_by_type(crawl, item_type_cumul, len(cumul_hll))
                # new unseen items this crawl (since the first analyzed crawl)
                unseen = (len(cumul_hll) - last_cumul_hll_len)
                if unseen > len(hll):
                    # 1% error rate for cumulative HLLs is large in comparison
                    # to crawl size, adjust to size of items in this crawl
                    # (there can be no more new items than the size of the crawl)
                    unseen = len(hll)
                self.add_by_type(crawl, item_type_new, unseen)
                hlls.append(hll)
                # cumulative size for last N crawls
                for n_crawls in latest_n_crawls_cumul:
                    item_type_n_crawls = '{} cumul. last {} crawls'.format(
                        item_type, n_crawls)
                    if n_crawls <= len(hlls):
                        cum_hll = HyperLogLog(HYPERLOGLOG_ERROR)
                        for i in range(1, (n_crawls+1)):
                            if i > len(hlls):
                                break
                            cum_hll.update(hlls[-i])
                        size_last_n = len(cum_hll)
                        if item_type == 'url estim.':
                            urls_cumul[crawl][str(n_crawls)] = size_last_n
                    else:
                        size_last_n = 'nan'
                    self.add_by_type(crawl, item_type_n_crawls, size_last_n)
        for n, crawl in enumerate(sorted_crawls):
            for n_crawls in latest_n_crawls_cumul:
                if n_crawls > (n+1):
                    self.add_by_type(crawl,
                                     'page cumul. last {} crawls'.format(n_crawls),
                                     'nan')
                    continue
                cumul_pages = 0
                for c in sorted_crawls[(1+n-n_crawls):(n+1)]:
                    cumul_pages += self.size['page'][self.crawls[c]]
                self.add_by_type(crawl,
                                 'page cumul. last {} crawls'.format(n_crawls),
                                 cumul_pages)
                urls_cumul[crawl][str(n_crawls)] = urls_cumul[crawl][str(n_crawls)]/cumul_pages
        for crawl in urls_cumul:
            for n_crawls in urls_cumul[crawl]:
                self.add_by_type(crawl,
                                 'URLs/pages last {} crawls'.format(n_crawls),
                                 urls_cumul[crawl][n_crawls])

    def transform_data(self):
        """Convert internal dictionaries to pandas DataFrames."""
        self.size = pandas.DataFrame(self.size)
        self.size_by_type = pandas.DataFrame(self.size_by_type)

    def save_data(self):
        """Save size data to CSV files."""
        self.size.to_csv('data/crawlsize.csv')
        self.size_by_type.to_csv('data/crawlsizebytype.csv')

    def duplicate_ratio(self):
        """Calculate and save URL and content duplicate ratios per crawl."""
        data = self.size[['crawl', 'page', 'url', 'digest estim.']]
        data['1-(urls/pages)'] = 100 * (1.0 - (data['url'] / data['page']))
        data['1-(digests/pages)'] = \
            100 * (1.0 - (data['digest estim.'] / data['page']))
        floatf = '{0:.1f}%'.format
        print(data.to_string(formatters={'1-(urls/pages)': floatf,
                                         '1-(digests/pages)': floatf}),
              file=open('data/crawlduplicates.txt', 'w'))

    def plot(self):
        """Generate all crawl size plots."""
        # Size per crawl (pages, URL and content digest)
        row_types = ['page', 'url', 'digest estim.']
        self.size_plot(self.size_by_type, row_types, '',
                       'Crawl Size', 'Pages / Unique Items',
                       'crawlsize/monthly.png',
                       data_export_csv='crawlsize/monthly.csv')
        # -- cumulative size
        row_types = ['page cumul.', 'url estim. cumul.',
                     'digest estim. cumul.']
        self.size_plot(self.size_by_type, row_types, r' cumul\.$',
                       'Crawl Size Cumulative',
                       'Pages / Unique Items Cumulative',
                       'crawlsize/cumulative.png',
                       data_export_csv='crawlsize/cumulative.csv')
        # -- new URLs per crawl
        row_types = ['url estim. new']
        self.size_plot(self.size_by_type, row_types, '',
                       'New URLs per Crawl (not observed in prior crawls)',
                       'New URLs', 'crawlsize/monthly_new.png',
                       data_export_csv='crawlsize/monthly_new.csv')
        # -- cumulative URLs over last N crawls (this and preceding N-1 crawls)
        row_types = ['url', '1 crawl',  # 'url' replaced by '1 crawl'
                     'url estim. cumul. last 2 crawls',
                     'url estim. cumul. last 3 crawls',
                     'url estim. cumul. last 4 crawls',
                     'url estim. cumul. last 6 crawls',
                     'url estim. cumul. last 9 crawls',
                     'url estim. cumul. last 12 crawls']
        data = self.size_by_type
        data = data[data['type'].isin(row_types)]
        data.replace(to_replace='url', value='1 crawl', inplace=True)
        self.size_plot(data, row_types, r'^url estim\. cumul\. last | crawls?$',
                       'URLs Cumulative Over Last N Crawls',
                       'Unique URLs cumulative',
                       'crawlsize/url_last_n_crawls.png',
                       clabel='n crawls',
                       data_export_csv='crawlsize/url_last_n_crawls.csv')
        # -- ratio unique URLs by total page captures over last N crawls (this and preceding N-1 crawls)
        row_types = ['URLs/pages last 2 crawls',
                     'URLs/pages last 3 crawls',
                     'URLs/pages last 4 crawls',
                     'URLs/pages last 6 crawls',
                     'URLs/pages last 9 crawls',
                     'URLs/pages last 12 crawls']
        data = self.size_by_type
        data = data[data['type'].isin(row_types)]
        data.replace(to_replace='url', value='1 crawl', inplace=True)
        self.size_plot(data, row_types, r'^URLs/pages last | crawls?$',
                       'Ratio Unique URLs / Total Pages Captured Over Last N Crawls',
                       'URLs/Pages',
                       'crawlsize/url_page_ratio_last_n_crawls.png',
                       clabel='n crawls',
                       data_export_csv='crawlsize/url_page_ratio_last_n_crawls.csv')
        # -- cumul. digests over last N crawls (this and preceding N-1 crawls)
        row_types = ['digest estim.', '1 crawl',  # 'url' replaced by '1 crawl'
                     'digest estim. cumul. last 2 crawls',
                     'digest estim. cumul. last 3 crawls',
                     'digest estim. cumul. last 6 crawls',
                     'digest estim. cumul. last 12 crawls']
        data = self.size_by_type
        data = data[data['type'].isin(row_types)]
        data.replace(to_replace='digest estim.', value='1 crawl', inplace=True)
        self.size_plot(data, row_types,
                       r'^digest estim\. cumul\. last | crawls?$',
                       'Content Digest Cumulative Over Last N Crawls',
                       'Unique content digests cumulative',
                       'crawlsize/digest_last_n_crawls.png',
                       clabel='n crawls')
        # -- URLs, hosts, domains, tlds (normalized)
        data = self.size_by_type
        row_types = ['url', 'tld', 'domain', 'host']
        data = data[data['type'].isin(row_types)]
        self.export_csv(data, 'crawlsize/domain.csv')
        # --- domains only (not yet normalized)
        self.size_plot(data[data['type'].isin(['domain'])], '', '',
                       'Unique Domains per Crawl',
                       '', 'crawlsize/registered-domains.png')
        # normalize scale (exponent) of counts so that they fit on one plot
        size_norm = data['size'] / 1000.0
        data['size'] = size_norm.where(data['type'] == 'tld',
                                       other=data['size'])
        data.replace(to_replace='tld', value='tld e+04', inplace=True)
        size_norm = size_norm / 10000.0
        data['size'] = size_norm.where(data['type'] == 'host',
                                       other=data['size'])
        data.replace(to_replace='host', value='host e+07', inplace=True)
        data['size'] = size_norm.where(data['type'] == 'domain',
                                       other=data['size'])
        data.replace(to_replace='domain', value='domain e+07', inplace=True)
        size_norm = size_norm / 100.0
        data['size'] = size_norm.where(data.type == 'url',
                                       other=data['size'])
        data.replace(to_replace='url', value='url e+09', inplace=True)
        self.size_plot(data, '', '',
                       'URLs / Hosts / Domains / TLDs per Crawl',
                       'Unique Items', 'crawlsize/domain.png')
        # -- URL status by year:
        # --   duplicates (pages - URLs), known URLs (URLs - new), new URLs
        data = self.size[['crawl', 'page', 'url', 'url estim. new']]
        data['year'] = data['crawl'].apply(lambda c: int(MonthlyCrawl.year_of(c)))
        by_year = data[['year', 'page', 'url', 'url estim. new']] \
            .groupby('year').agg(sum).reset_index()
        by_year['revisit'] = by_year['url'] - by_year['url estim. new']
        by_year['duplicate'] = by_year['page'] - by_year['url']
        by_year['new'] = by_year['url estim. new']
        print('URL status by year:')
        print(by_year)
        by_year_by_type = by_year[['year', 'new', 'revisit', 'duplicate', 'page']].melt(
            id_vars=['year', 'page'],
            value_vars=['new', 'revisit', 'duplicate'],
            var_name='url_status', value_name='page_captures')
        by_year_by_type['ratio'] = by_year_by_type['page_captures'] / by_year_by_type['page']
        by_year_by_type['perc'] = by_year_by_type['ratio'].apply(lambda x: round((100.0*x), 1)).astype(str) + '%'
        by_year_by_type['year'] = pandas.Categorical(by_year_by_type['year'], ordered=True)
        by_year_by_type['url_status'] = pandas.Categorical(by_year_by_type['url_status'],
                                                           ordered=True,
                                                           categories=['duplicate',
                                                                       'revisit', 'new'])
        by_year_by_type['page_captures'] = by_year_by_type['page_captures'].astype(float)

        # url_status_by_year
        img_path = os.path.join(self.PLOTDIR, 'crawlsize', 'url_status_by_year.png')

        if self.PLOTLIB == "rpy2.ggplot2":
            return self.plot_with_rpy2_ggplot2(by_year_by_type, img_path)
        elif self.PLOTLIB == "matplotlib":
            return self.plot_with_matplotlib(by_year_by_type, img_path)
        else:
            raise ValueError("Invalid PLOTLIB")
        
    def plot_with_rpy2_ggplot2(self, by_year_by_type, img_path):
        """Generate URL status by year stacked bar chart using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2
        from rpy2 import robjects
        from rpy2.robjects import pandas2ri
        pandas2ri.activate()

        p = ggplot2.ggplot(by_year_by_type) \
            + ggplot2.aes_string(x='year', y='page_captures', fill='url_status', label='perc') \
            + ggplot2.geom_bar(stat='identity', position='stack') \
            + ggplot2.geom_text(
                data=by_year_by_type[
                    by_year_by_type['url_status'].isin(['new'])
                    & ~by_year_by_type['year'].isin(by_year_by_type['year'].tolist()[0:3])],
                color='black', size=2,
                position=ggplot2.position_dodge(width=.5)) \
            + self.GGPLOT2_THEME \
            + ggplot2.scale_fill_manual(values=robjects.r('c("duplicate"="#00BA38", "revisit"="#619CFF", "new"="#F8766D")')) \
            + ggplot2.theme(**{'legend.position': 'right',
                            'aspect.ratio': .7,
                            **self.GGPLOT2_THEME_KWARGS},
                            **{'axis.text.x':
                            ggplot2.element_text(angle=45, size=10,
                                                    vjust=1, hjust=1)}) \
            + ggplot2.labs(title='Number of Page Captures', x='', y='', fill='URL status')
        p.save(img_path)

        return p


    def plot_with_matplotlib(self, by_year_by_type, img_path):
        """Generate URL status by year stacked bar chart using matplotlib."""
        import numpy as np

        aspect_ratio = 0.7
        bar_label_fontsize = 5
        title = 'Number of Page Captures'

        fig, ax = self.create_figure()

        # Prepare data for stacked bar chart
        years = by_year_by_type['year'].unique()
        url_statuses = ['new', 'revisit', 'duplicate']
        colors = {'duplicate': '#00BA38', 'revisit': '#619CFF', 'new': '#F8766D'}

        # Create stacked bars
        bottoms = np.zeros(len(years))
        bars = {}

        for status in url_statuses:
            status_data = by_year_by_type[by_year_by_type['url_status'] == status]
            values = []
            labels = []

            for year in years:
                year_data = status_data[status_data['year'] == year]
                if len(year_data) > 0:
                    values.append(year_data['page_captures'].iloc[0])
                    labels.append(year_data['perc'].iloc[0])
                else:
                    values.append(0)
                    labels.append('')

            bars[status] = ax.bar(range(len(years)), values, bottom=bottoms,
                                  color=colors[status], label=status, width=self.bar_width)

            # Add text labels only for 'new' status, excluding first 3 years
            if status == 'new':
                for i, (bar, label) in enumerate(zip(bars[status], labels)):
                    if i >= 3 and label:
                        height = bar.get_height()
                        ax.text(bar.get_x() + bar.get_width() / 2.,
                                bottoms[i] + height, label,
                                ha='center', va='top', color='black',
                                fontsize=bar_label_fontsize)

            bottoms += values

        self.set_title(ax, title)
        ax.set_xlabel('')
        ax.set_ylabel('')

        # Format x-axis
        ax.set_xticks(range(len(years)))
        ax.set_xticklabels(years, rotation=45, ha='right', va='top',
                          fontsize=self.ticks_fontsize)
        ax.set_xlim(-0.5, len(years) - 0.5)

        # Axes ratio
        ax.set_aspect(1 / ax.get_data_ratio() * aspect_ratio)

        # Apply nice y-axis ticks
        self.apply_nice_ticks(ax, axis='y')

        # Grid styling
        ax.grid(True, which='minor', linewidth=self.grid_minor_linewidth,
                color=self.grid_minor_color, zorder=0, axis='both')
        ax.grid(True, which='major', linewidth=self.grid_major_linewidth,
                color=self.grid_major_color, zorder=0, axis='both')
        ax.set_axisbelow(True)

        # Apply ggplot2 style
        self.apply_ggplot2_style(ax, show_grid=False)

        # Set tick colors
        ax.tick_params(axis='y', which='both', colors='#FFFFFF',
                       length=self.ticks_length, width=self.grid_major_linewidth,
                       labelsize=self.ticks_fontsize)
        ax.tick_params(axis='x', which='both', colors='#E6E6E6',
                       length=self.ticks_length, width=self.grid_major_linewidth,
                       labelsize=self.ticks_fontsize)
        self.set_tick_labels_black(ax)

        # Position legend on right side with reversed order
        handles, labels = ax.get_legend_handles_labels()
        legend = ax.legend(handles[::-1], labels[::-1], loc='center left',
                          bbox_to_anchor=(1.0, 0.5), frameon=False,
                          fontsize=self.legend_fontsize, title='URL status',
                          title_fontsize=self.legend_title_fontsize)
        legend._legend_box.align = 'left'

        return self.save_figure(fig, img_path)


    def export_csv(self, data, csv):
        """Export pivot table data to CSV file."""
        if csv is not None:
            data.reset_index().pivot(index='crawl',
                                     columns='type', values='size').to_csv(
                                         os.path.join(self.PLOTDIR, csv))

    def norm_data(self, data, row_filter, type_name_norm):
        """Filter and normalize type names in the data for plotting."""
        if len(row_filter) > 0:
            data = data[data['type'].isin(row_filter)]
        if type_name_norm != '':
            for value in row_filter:
                replacement = value
                if isinstance(type_name_norm, str):
                    if re.search(type_name_norm, value):
                        while re.search(type_name_norm, replacement):
                            replacement = re.sub(type_name_norm,
                                                 '', replacement)
                elif isinstance(type_name_norm, types.FunctionType):
                    replacement = type_name_norm(value)
                if replacement != value:
                    data.replace(to_replace=value, value=replacement,
                                 inplace=True)
        return data

    def size_plot(self, data, row_filter, type_name_norm,
                  title, ylabel, img_file, clabel='', data_export_csv=None,
                  x='date', y='size', c='type'):
        """Generate a size plot with filtering and normalization.

        Args:
            data: DataFrame containing the size data
            row_filter: List of type values to include
            type_name_norm: Regex pattern or function to normalize type names
            title: Plot title
            ylabel: Y-axis label
            img_file: Output filename
            clabel: Legend title
            data_export_csv: Optional CSV export path
            x, y, c: Column names for x-axis, y-axis, and color grouping
        """
        data = self.norm_data(data, row_filter, type_name_norm)
        self.export_csv(data, data_export_csv)
        return self.line_plot(data, title, ylabel, img_file,
                              x=x, y=y, c=c, clabel=clabel, ratio=.9)


if __name__ == '__main__':
    plot = CrawlSizePlot()
    plot.read_from_stdin_or_file()
    plot.cumulative_size()
    plot.transform_data()
    plot.save_data()
    plot.duplicate_ratio()
    plot.plot()


================================================
FILE: plot/crawler_metrics.py
================================================
"""
Plot crawler performance metrics.

This module generates visualizations of crawler metrics including:
- Fetch status breakdown (success, redirect, denied, failed, skipped)
- CrawlDb status counts
- HTTP vs HTTPS URL distribution

These metrics help monitor crawler health and performance over time.
"""

import logging
import os
import re

import pandas

from crawlstats import CST, MultiCount
from crawl_size import CrawlSizePlot


LOGGING_LEVEL = logging.INFO
logging.basicConfig(level=LOGGING_LEVEL)


class CrawlerMetrics(CrawlSizePlot):
    """Generate plots showing crawler performance metrics.

    Tracks fetch statuses, CrawlDb sizes, and URL protocol distribution
    across crawls.
    """

    metrics_map = {
        'fetcher:aggr:redirect': ('fetcher:temp_moved', 'fetcher:moved',
                                  'fetcher:redirect_count_exceeded',
                                  'fetcher:redirect_deduplicated',
                                  # new counter names (NUTCH-3132)
                                  # unchanged: 'fetcher:temp_moved', 'fetcher:moved',
                                  'fetcher:redirect_count_exceeded_total',
                                  'fetcher:redirect_deduplicated_total',
                                  'fetcher:redirect_not_created_total'),
        'fetcher:aggr:denied':   ('fetcher:access_denied',
                                  'fetcher:robots_denied',
                                  'fetcher:robots_denied_maxcrawldelay',
                                  'fetcher:robots_defer_visits_dropped',
                                  'fetcher:filter_denied',
                                  # new counter names (NUTCH-3132)
                                  # unchanged: 'fetcher:access_denied',
                                  'fetcher:robots_denied_total',
                                  'fetcher:robots_denied_maxcrawldelay_total',
                                  'fetcher:robots_defer_visits_dropped_total'),
        'fetcher:aggr:failed':   ('fetcher:gone', 'fetcher:notfound',
                                  'fetcher:exception',
                                  # (no) new counter names (NUTCH-3132)
                                  ),
        'fetcher:aggr:skipped':  ('fetcher:hitByThrougputThreshold',
                                  'fetcher:hitByTimeLimit',
                                  'fetcher:AboveExceptionThresholdInQueue',
                                  'fetcher:filtered',
                                  # new counter names (NUTCH-3132)
                                  'fetcher:hit_by_throughput_threshold_total',
                                  'fetcher:hit_by_timelimit_total',
                                  'fetcher:above_exception_threshold_total',
                                  'fetcher:hit_by_timeout_total',
                                  'fetcher:filtered_total')
    }

    def __init__(self):
        super().__init__()
        self.sum_counts = True

    def add(self, key, val):
        """Process crawl status, size, and scheme records."""
        cst = CST[key[0]]
        item_type = key[1]
        crawl = key[2]
        if not (cst == CST.crawl_status or
                (cst == CST.size and item_type in ('page', 'url'))
                or cst == CST.scheme):
            return
        if cst == CST.scheme:
            item_type = 'scheme:' + item_type
            val = MultiCount.get_count(1, val)
        self.add_by_type(crawl, item_type, val)
        for metric in self.metrics_map:
            if item_type in self.metrics_map[metric]:
                logging.debug('Adding metric %s for <%s, %s> = %s', metric, crawl, item_type, val)
                self.add_by_type(crawl, metric, val)

    def save_data(self):
        """Save crawler metrics data to CSV files."""
        self.size.sort_values(['crawl'], inplace=True)
        self.size.to_csv('data/crawlmetrics.csv')
        self.size_by_type.to_csv('data/crawlmetricsbytype.csv')

    def add_percent(self):
        """Calculate percentage values for fetch statuses and schemes."""
        for crawl in self.crawls:
            if self.crawls[crawl] not in self.size['fetcher:total']:
                logging.debug('Crawl %s not found in fetch status data', crawl)
                continue
            total = self.size['fetcher:total'][self.crawls[crawl]]
            for item_type in self.type_index:
                if self.crawls[crawl] not in self.size[item_type]:
                    continue
                count = self.size[item_type][self.crawls[crawl]]
                _N = self.type_index[item_type][self.crawls[crawl]]
                if (item_type.startswith('fetcher:') and
                    item_type != 'fetcher:total'):
                    self.size_by_type['percentage'][_N] = 100.0*count/total
                elif item_type.startswith('scheme:'):
                    total = self.size['url'][self.crawls[crawl]]
                    self.size_by_type['percentage'][_N] = 100.0*count/total

    @staticmethod
    def row2title(row):
        """Convert metric row name to human-readable title."""
        row = re.sub('(?<=^fetch)er(?::aggr)?|^generator:', '', row)
        row = re.sub('[:_]', ' ', row)
        if row == 'page':
            row = 'pages released'
        return row

    def plot(self):
        """Generate all crawler metrics plots."""
        row_types = ['generator:fetch_list',
                     'fetcher:success', 'fetcher:total',
                     'fetcher:aggr:redirect', 'fetcher:notmodified',
                     'fetcher:aggr:failed', 'fetcher:aggr:denied',
                     'fetcher:aggr:skipped', 'page']
        self.size_plot(self.size_by_type, row_types, CrawlerMetrics.row2title,
                       'Crawler Metrics', 'Pages',
                       'crawler/metrics.png')
        # -- stacked bar plot
        row_types = ['fetcher:success', 'fetcher:notmodified',
                     'fetcher:aggr:redirect', 'fetcher:aggr:failed',
                     'fetcher:aggr:denied', 'fetcher:aggr:skipped']
        ratio = 0.1 + self.ncrawls * .05
        self.plot_fetch_status(self.size_by_type, row_types,
                               'crawler/fetch_status_percentage.png',
                               ratio=ratio)
        # -- status of pages in CrawlDb
        row_types = ['crawldb:status:db_fetched',
                     'crawldb:status:db_notmodified',
                     'crawldb:status:db_redir_perm',
                     'crawldb:status:db_redir_temp',
                     'crawldb:status:db_duplicate',
                     'crawldb:status:db_gone',
                     'crawldb:status:db_unfetched',
                     'crawldb:status:db_orphan']
        self.plot_crawldb_status(self.size_by_type, row_types,
                                 'crawler/crawldb_status.png',
                                 ratio=ratio)
        # successfully fetched http:// vs https:// URLs
        self.size_plot(self.size_by_type, ['scheme:http', 'scheme:https'], lambda x: x.split(':')[1],
                       'HTTP vs HTTPS URLs', 'Successfully fetched URLs',
                       'crawler/url_protocols.png')
        self.size_plot(self.size_by_type, ['scheme:http', 'scheme:https'], lambda x: x.split(':')[1],
                       'Percentage of HTTP vs HTTPS URLs', 'Percentage of successfully fetched URLs',
                       'crawler/url_protocols_percentage.png', y='percentage')

    def plot_fetch_status_with_rpy2_ggplot2(self, data, img_path, ratio):
        """Generate fetch status stacked bar chart using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2

        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='crawl', y='percentage', fill='type') \
            + ggplot2.geom_bar(stat='identity', position='stack', width=.9) \
            + ggplot2.coord_flip() \
            + ggplot2.scale_fill_brewer(palette='RdYlGn', type='sequential',
                                        guide=ggplot2.guide_legend(reverse=True)) \
            + self.GGPLOT2_THEME \
            + ggplot2.theme(**{'legend.position': 'bottom',
                            'aspect.ratio': ratio,
                            **self.GGPLOT2_THEME_KWARGS}) \
            + ggplot2.labs(title='Percentage of Fetch Status',
                        x='', y='', fill='')

        p.save(img_path, height = int(7 * ratio), width = 7)

        return p

    def plot_fetch_status_with_matplotlib(self, data, categories, img_path, ratio):
        """Generate fetch status stacked bar chart using matplotlib."""
        import numpy as np
        from matplotlib.ticker import MaxNLocator

        crawls = data['crawl'].unique()
        n_crawls = len(crawls)

        # Define colors from dark green (success) to dark red (denied)
        status_order = ['success', 'skipped', 'redirect', 'notmodified', 'failed', 'denied']
        status_colors = {
            'success': '#1A9850', 'skipped': '#91CF60', 'redirect': '#D9EF8B',
            'notmodified': '#FEE08B', 'failed': '#FC8D59', 'denied': '#D73027'
        }
        categories_ordered = [cat for cat in status_order if cat in categories]

        fig, ax = self.create_figure(ratio=ratio)

        # Prepare data for horizontal stacked bar chart
        bar_positions = np.arange(n_crawls)
        lefts = np.zeros(n_crawls)

        for category in categories_ordered:
            category_data = data[data['type'] == category]
            values = [
                category_data[category_data['crawl'] == crawl]['percentage'].iloc[0]
                if len(category_data[category_data['crawl'] == crawl]) > 0 else 0
                for crawl in crawls
            ]
            ax.barh(bar_positions, values, left=lefts, height=self.bar_width,
                    color=status_colors[category], label=category)
            lefts += values

        self.set_title(ax, 'Percentage of Fetch Status')
        ax.set_xlabel('')
        ax.set_ylabel('')

        # Format y-axis (crawl names)
        ax.set_yticks(bar_positions)
        ax.set_yticklabels(crawls, fontsize=self.ticks_fontsize)
        ax.set_ylim(-0.5, n_crawls - 0.5)

        # Format x-axis (percentage)
        max_value = lefts.max()
        ax.set_xlim(0, max_value * 1.02)
        ax.xaxis.set_major_locator(MaxNLocator(nbins=5))

        # Apply ggplot2-like styling
        self.apply_ggplot2_style(ax, grid_axis='x')

        # Set tick colors
        ax.tick_params(axis='y', which='both', colors='#E6E6E6', length=20,
                       width=1.5, labelsize=self.ticks_fontsize)
        ax.tick_params(axis='x', which='both', colors='#E6E6E6', length=4,
                       width=1.5, labelsize=self.ticks_fontsize)
        self.set_tick_labels_black(ax)

        # Position legend at bottom
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, -0.05),
                  ncol=min(3, len(categories)), frameon=False,
                  fontsize=self.legend_fontsize, title='')

        return self.save_figure(fig, img_path)

    def plot_fetch_status(self, data, row_filter, img_file, ratio=1.0):
        """Generate fetch status percentage stacked bar chart."""
        if row_filter:
            data = data[data['type'].isin(row_filter)]
        data = data[['crawl', 'percentage', 'type']]
        categories = []
        for value in row_filter:
            if re.search('^fetcher:(?:aggr:)?', value):
                replacement = re.sub('^fetcher:(?:aggr:)?', '', value)
                categories.append(replacement)
                data.replace(to_replace=value, value=replacement, inplace=True)
        data['type'] = pandas.Categorical(data['type'], ordered=True,
                                          categories=categories.reverse())
        ratio = 0.1 + len(data['crawl'].unique()) * .03
        img_path = os.path.join(self.PLOTDIR, img_file)

        if self.PLOTLIB == "rpy2.ggplot2":
            return self.plot_fetch_status_with_rpy2_ggplot2(data=data, img_path=img_path, ratio=ratio)
        elif self.PLOTLIB == "matplotlib":
            return self.plot_fetch_status_with_matplotlib(data=data, categories=categories, img_path=img_path, ratio=ratio)
        else:
            raise ValueError("Invalid PLOTLIB")

    def plot_crawldb_status_with_rpy2_ggplot2(self, data, img_path, ratio):
        """Generate CrawlDb status stacked bar chart using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2

        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='crawl', y='size', fill='type') \
            + ggplot2.geom_bar(stat='identity', position='stack', width=.9) \
            + ggplot2.coord_flip() \
            + ggplot2.scale_fill_brewer(palette='Pastel1', type='sequential',
                                        guide=ggplot2.guide_legend(reverse=False)) \
            + self.GGPLOT2_THEME \
            + ggplot2.theme(**{'legend.position': 'bottom',
                            'aspect.ratio': ratio,
                            **self.GGPLOT2_THEME_KWARGS}) \
            + ggplot2.labs(title='CrawlDb Size and Status Counts',
                        x='', y='', fill='')

        p.save(img_path, height = int(7 * ratio), width = 7)
        return p

    def plot_crawldb_status_with_matplotlib(self, data, img_path, ratio):
        """Generate CrawlDb status stacked bar chart using matplotlib."""
        import numpy as np

        crawls = data['crawl'].unique()
        n_crawls = len(crawls)

        # Pastel1 palette colors
        pastel1_colors = ['#FDDAEC', '#E5D8BD', '#FFFFCC', '#FED9A6',
                          '#DECBE4', '#CCEBC5', '#B3CDE3', '#FBB4AE', '#F2F2F2']
        categories_ordered = ['unfetched', 'redir_temp', 'redir_perm', 'orphan',
                              'notmodified', 'gone', 'fetched', 'duplicate']

        fig, ax = self.create_figure(ratio=ratio)

        bar_positions = np.arange(n_crawls)
        lefts = np.zeros(n_crawls)

        for i, category in enumerate(categories_ordered):
            category_data = data[data['type'] == category]
            values = [
                category_data[category_data['crawl'] == crawl]['size'].iloc[0]
                if len(category_data[category_data['crawl'] == crawl]) > 0 else 0
                for crawl in crawls
            ]
            color = pastel1_colors[i % len(pastel1_colors)]
            ax.barh(bar_positions, values, left=lefts, height=self.bar_width,
                    color=color, label=category)
            lefts += values

        self.set_title(ax, 'CrawlDb Size and Status Counts')
        ax.set_xlabel('')
        ax.set_ylabel('')

        # Format y-axis (crawl names)
        ax.set_yticks(bar_positions)
        ax.set_yticklabels(crawls, fontsize=self.ticks_fontsize)
        ax.set_ylim(-0.5, n_crawls - 0.5)

        # Format x-axis (size counts)
        max_value = lefts.max()
        ax.set_xlim(0, max_value * 1.02)

        # Axes ratio
        ax.set_aspect(1 / ax.get_data_ratio() * ratio)

        # Apply nice x-axis ticks
        self.apply_nice_ticks(ax, axis='x')

        # Apply ggplot2-like styling with x-axis grid
        ax.grid(True, which='both', linewidth=self.grid_major_linewidth,
                color=self.grid_major_color, zorder=0, axis='x')
        ax.set_axisbelow(True)
        self.apply_ggplot2_style(ax, show_grid=False)

        # Set tick colors
        ax.tick_params(axis='both', which='both', colors=self.ticks_color,
                       length=self.ticks_length, width=0.8,
                       labelsize=self.ticks_fontsize)
        self.set_tick_labels_black(ax)

        # Position legend at bottom with reversed order
        handles, labels = ax.get_legend_handles_labels()
        ax.legend(handles[::-1], labels[::-1], loc='upper center',
                  bbox_to_anchor=(0.5, -0.05), ncol=min(4, len(categories_ordered)),
                  frameon=False, fontsize=self.legend_fontsize, title='')

        return self.save_figure(fig, img_path)

    def plot_crawldb_status(self, data, row_filter, img_file, ratio=1.0):
        """Generate CrawlDb status stacked bar chart."""
        if row_filter:
            data = data[data['type'].isin(row_filter)]
        categories = []
        for value in row_filter:
            if re.search('^crawldb:status:db_', value):
                replacement = re.sub('^crawldb:status:db_', '', value)
                categories.append(replacement)
                data.replace(to_replace=value, value=replacement, inplace=True)
        data['type'] = pandas.Categorical(data['type'], ordered=True,
                                          categories=categories.reverse())
        data['size'] = data['size'].astype(float)
        ratio = 0.1 + len(data['crawl'].unique()) * .03
        img_path = os.path.join(self.PLOTDIR, img_file)

        if self.PLOTLIB == "rpy2.ggplot2":
            return self.plot_crawldb_status_with_rpy2_ggplot2(
                data=data, img_path=img_path, ratio=ratio
            )

        elif self.PLOTLIB == "matplotlib":
            return self.plot_crawldb_status_with_matplotlib(
                data=data, img_path=img_path, ratio=ratio
            )

        else:
            raise ValueError("Invalid PLOTLIB")


if __name__ == '__main__':
    plot = CrawlerMetrics()
    plot.read_from_stdin_or_file()
    plot.add_percent()
    plot.transform_data()
    plot.save_data()
    plot.plot()


================================================
FILE: plot/domain.py
================================================
import sys

import pandas

from crawlstats import CST, MonthlyCrawl, MultiCount
from plot.table import TabularStats


class DomainStats(TabularStats):

    # defined via crawlstats command-line option --max-top-hosts-domains
    MAX_TOP_DOMAINS = 500

    def __init__(self, crawl):
        super().__init__()
        self.crawl = crawl
        self.N = 0

    def add(self, key, val):
        cst = CST[key[0]]
        if cst not in (CST.size, CST.domain):
            return
        typeval = key[1]
        crawl = key[2]
        if crawl != self.crawl:
            return
        if cst == CST.size:
            self.size[typeval] = val
            return
        self.type_stats['domain'][self.N] = typeval 
        self.type_stats['pages'][self.N] = MultiCount.get_count(0, val)
        self.type_stats['urls'][self.N] = MultiCount.get_count(1, val)
        self.type_stats['hosts'][self.N] = MultiCount.get_count(2, val)
        # self.type_stats['crawl'][self.N] = crawl
        self.N += 1

    def transform_data(self):
        data = pandas.DataFrame(self.type_stats)
        for cnt in ['pages', 'urls']:
            total = self.size[cnt[:-1]]
            data['%' + cnt] = 100.0 * data[cnt] / total
        data.sort_values(ascending=False, inplace=True, by='pages')
        print(data)
        self.type_stats = data

    def save_data(self, name, dir_name='data/'):
        self.type_stats.to_csv('{}/{}-top-{}.csv'.format(self.PLOTDIR, name, self.MAX_TOP_DOMAINS),
                               float_format='%.6f', index=None)

    def plot(self, name):
        data = self.type_stats
        css_classes = ['tablesorter', 'tablesearcher']
        data = data.set_index('domain')
        data.columns.name = 'domain'
        data.index.name = None
        print(data.to_html('{}/{}-top-{}.html'.format(
                            self.PLOTDIR, name, self.MAX_TOP_DOMAINS),
                           float_format='%.6f',
                           classes=css_classes, index='domain'))

if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    if len(plot_crawls) == 0:
        plot_crawls = MonthlyCrawl.get_latest(1)
        print(plot_crawls)
    latest_crawl = plot_crawls[-1]
    plot_name = 'domains'
    plot = DomainStats(latest_crawl)
    plot.read_from_stdin_or_file()
    plot.transform_data()
    plot.save_data(plot_name, dir_name=plot.PLOTDIR)
    plot.plot(plot_name)


================================================
FILE: plot/histogram.py
================================================
"""
Plot histogram distributions for crawl statistics.

This module generates histogram visualizations showing distributions of:
- Pages per URL (URL-level duplicates)
- URLs per host/domain/TLD
- Cumulative URL coverage by domain

These histograms help understand the distribution patterns in crawl data.
"""

import os.path
import sys
from collections import defaultdict

import pandas

from crawlplot import CrawlPlot
from crawlstats import CST


class CrawlHistogram(CrawlPlot):
    """Generate histogram plots for crawl statistics.

    Produces histograms showing frequency distributions of various metrics
    like duplicate rates, coverage per domain, etc.
    """

    PSEUDO_LOG_BINS = [0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000,
                       10000, 20000, 50000, 100000, 200000, 500000, 1000000,
                       2*10**6, 5*10**6, 10**7, 2*10**7, 5*10**7, 10**8,
                       2*10**8, 5*10**8, 10**9]

    def __init__(self):
        super().__init__()
        self.histogr = defaultdict(dict)
        self.N = 0

    def add(self, key, frequency):
        """Process a histogram record from statistics data."""
        cst = CST[key[0]]
        if cst != CST.histogram:
            return
        item_type = key[1]
        if item_type == 'surt_domain':
            return
        crawl = key[2]
        type_counted = key[3]
        count = key[4]
        self.histogr['crawl'][self.N] = crawl
        self.histogr['type'][self.N] = item_type
        self.histogr['type_counted'][self.N] = type_counted
        self.histogr['count'][self.N] = count
        self.histogr['frequency'][self.N] = frequency
        self.N += 1

    def transform_data(self):
        """Convert internal dictionary to pandas DataFrame."""
        self.histogr = pandas.DataFrame(self.histogr)

    def save_data(self):
        """Save histogram data to CSV file."""
        self.histogr.to_csv('data/crawlhistogr.csv')

    def plot_dupl_url(self):
        """Plot histogram of pages per URL (URL-level duplicates)."""
        from rpy2.robjects.lib import ggplot2

        row_filter = ['url']
        data = self.histogr
        data = data[data['type'].isin(row_filter)]
        title = 'Pages per URL (URL-level duplicates)'
        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='count', y='frequency') \
            + ggplot2.geom_jitter() \
            + ggplot2.facet_wrap('crawl', ncol=5) \
            + ggplot2.labs(title=title, x='(duplicate) pages per URL',
                           y='log(frequency)') \
            + ggplot2.scale_y_log10()
        # + ggplot2.scale_x_log10()  # could use log-log scale
        img_path = os.path.join(self.PLOTDIR, 'crawler/histogr_url_dupl.png')
        p.save(img_path)
        # data.to_csv(img_path + '.csv')
        return p

    def plot_host_domain_tld(self):
        """Plot histogram of URLs per host/domain/TLD."""
        from rpy2.robjects.lib import ggplot2

        data = self.histogr
        data = data[data['type'].isin(['host', 'domain', 'tld'])]
        data = data[data['type_counted'].isin(['url'])]
        img_path = os.path.join(self.PLOTDIR,
                                'crawler/histogr_host_domain_tld.png')
        # data.to_csv(img_path + '.csv')
        title = 'URLs per Host / Domain / TLD'
        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='count', weight='frequency', color='type') \
            + ggplot2.geom_freqpoly(bins=20) \
            + ggplot2.facet_wrap('crawl', ncol=4) \
            + ggplot2.labs(title='', x=title,
                           y='Frequency') \
            + ggplot2.scale_y_log10() \
            + ggplot2.scale_x_log10()
        p.save(img_path)
        return p

    def plot_domain_cumul_with_rpy2_ggplot2(self, data, title, img_path):
        """Generate cumulative domain coverage plot using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2

        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='cum_domains', y='cum_urls') \
            + ggplot2.geom_line() + ggplot2.geom_point() \
            + self.GGPLOT2_THEME \
            + ggplot2.theme(**self.GGPLOT2_THEME_KWARGS) \
            + ggplot2.labs(title=title, x='domains cumulative',
                            y='URLs cumulative') \
            + ggplot2.scale_y_log10() \
            + ggplot2.scale_x_log10()
        p.save(img_path)
    
        return p
    
    def plot_domain_cumul(self, crawl):
        """Plot cumulative URL coverage by domain for a specific crawl."""
        data = self.histogr
        data = data[data['type'].isin(['domain'])]
        data = data[data['crawl'] == crawl]
        data = data[data['type_counted'].isin(['url'])]
        data['urls'] = data['count']*data['frequency']
        print(data)
        data = data[['urls', 'count', 'frequency']]
        data = data.sort_values(['count'], ascending=0)
        data['cum_domains'] = data['frequency'].cumsum()
        data['cum_urls'] = data['urls'].cumsum()
        data_perc = data.apply(lambda x: round(100.0*x/float(x.sum()), 1))
        data['%domains'] = data_perc['frequency']
        data['%urls'] = data_perc['urls']
        data['%cum_domains'] = data['cum_domains'].apply(
            lambda x: round(100.0*x/float(data['frequency'].sum()), 1))
        data['%cum_urls'] = data['cum_urls'].apply(
            lambda x: round(100.0*x/float(data['urls'].sum()), 1))

        img_path = os.path.join(self.PLOTDIR,
                                'crawler/histogr_domain_cumul.png')
        # data.to_csv(img_path + '.csv')
        title = 'Cumulative URLs for Top Domains'

        if self.PLOTLIB == "rpy2.ggplot2":
            return self.plot_domain_cumul_with_rpy2_ggplot2(data=data, title=title, img_path=img_path)
        
        elif self.PLOTLIB == "matplotlib":
            # this plot is currently not used
            raise NotImplementedError
        
        else:
            raise ValueError("Invalid PLOTLIB")



if __name__ == '__main__':
    latest_crawl = sys.argv[-1]
    plot = CrawlHistogram()
    plot.read_from_stdin_or_file()
    plot.transform_data()
    plot.save_data()
    plot.plot_dupl_url()
    plot.plot_host_domain_tld()
    plot.plot_domain_cumul(latest_crawl)


================================================
FILE: plot/language.py
================================================
import string
import sys

from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl


class LanguageStats(TabularStats):

    MIN_AVERAGE_COUNT = 1
    MAX_LANGUAGES = 200

    def __init__(self):
        super().__init__()
        self.MAX_TYPE_VALUES = LanguageStats.MAX_LANGUAGES

    def add(self, key, val):
        self.add_check_type(key, val, CST.primary_language)


if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    plot_name = 'languages'
    column_header = 'language'
    if len(plot_crawls) == 0:
        plot_crawls = MonthlyCrawl.get_latest(3)
        print(plot_crawls)
    else:
        plot_name += '-' + '-'.join(plot_crawls)
    plot = LanguageStats()
    plot.read_from_stdin_or_file()
    plot.transform_data(LanguageStats.MAX_LANGUAGES,
                        LanguageStats.MIN_AVERAGE_COUNT,
                        None)
    plot.save_data_percentage(plot_name, dir_name='plots', type_name='primary_language')
    plot.plot(plot_crawls, plot_name, column_header,
              ['iso639-3-language'])


================================================
FILE: plot/mimetype.py
================================================
import re
import sys

from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl


class MimeTypeStats(TabularStats):

    MIN_AVERAGE_COUNT = 500
    MAX_MIME_TYPES = 100

    # see https://en.wikipedia.org/wiki/Media_type#Naming
    mime_pattern_str = \
        r'(?:x-)?[a-z]+/[a-z0-9]+' \
        r'(?:[.-](?:c\+\+[a-z]*|[a-z0-9]+))*(?:\+[a-z0-9]+)?'
    mime_pattern = re.compile(r'^'+mime_pattern_str+r'$')
    mime_extract_pattern = re.compile(r'^\s*(?:content\s*=\s*)?["\']?\s*(' +
                                      mime_pattern_str +
                                      r')(?:\s*[;,].*)?\s*["\']?\s*$')

    def __init__(self):
        super().__init__()
        self.MAX_TYPE_VALUES = MimeTypeStats.MAX_MIME_TYPES

    def norm_value(self, mimetype):
        if type(mimetype) is str:
            mimetype = mimetype.lower()
            m = MimeTypeStats.mime_extract_pattern.match(mimetype)
            if m:
                return m.group(1)
            return mimetype.strip('"\', \t')
        return ""

    def add(self, key, val):
        self.add_check_type(key, val, CST.mimetype)


if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    plot_name = 'mimetypes'
    column_header = 'mimetype'
    if len(plot_crawls) == 0:
        plot_crawls = MonthlyCrawl.get_latest(3)
        print(plot_crawls)
    else:
        plot_name += '-' + '-'.join(plot_crawls)
    plot = MimeTypeStats()
    plot.read_from_stdin_or_file()
    plot.transform_data(MimeTypeStats.MAX_MIME_TYPES,
                        MimeTypeStats.MIN_AVERAGE_COUNT,
                        MimeTypeStats.mime_pattern)
    plot.save_data_percentage(plot_name, dir_name='plots', type_name='mimetype')
    plot.plot(plot_crawls, plot_name, column_header, ['tablesearcher'])


================================================
FILE: plot/mimetype_detected.py
================================================
import sys

from plot.mimetype import MimeTypeStats
from crawlstats import CST, MonthlyCrawl


class MimeTypeDetectedStats(MimeTypeStats):

    def __init__(self):
        super().__init__()
        self.MAX_TYPE_VALUES = MimeTypeStats.MAX_MIME_TYPES

    def norm_value(self, mimetype):
        return mimetype

    def add(self, key, val):
        self.add_check_type(key, val, CST.mimetype_detected)


if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    plot_name = 'mimetypes_detected'
    column_header = 'mimetype_detected'
    if len(plot_crawls) == 0:
        plot_crawls = MonthlyCrawl.get_latest(3)
        print(plot_crawls)
    else:
        plot_name += '-' + '-'.join(plot_crawls)
    plot = MimeTypeDetectedStats()
    plot.read_from_stdin_or_file()
    plot.transform_data(MimeTypeStats.MAX_MIME_TYPES,
                        MimeTypeStats.MIN_AVERAGE_COUNT,
                        None)
    plot.save_data_percentage(plot_name, dir_name='plots', type_name='mimetype_detected')
    plot.plot(plot_crawls, plot_name, column_header, ['tablesearcher'])


================================================
FILE: plot/overlap.py
================================================
"""
Plot crawl overlap and similarity metrics.

This module generates visualizations showing the overlap between different
crawls based on URL or content digest similarities. Uses Jaccard similarity
to measure the intersection over union of items between crawls.
"""

import copy
import os.path
from collections import defaultdict

import pandas
import pygraphviz

from crawlplot import CrawlPlot
from crawlstats import CST, CrawlStatsJSONDecoder, MonthlyCrawl


class CrawlOverlap(CrawlPlot):
    """Generate overlap and similarity visualizations between crawls.

    Calculates and visualizes the Jaccard similarity between crawls
    based on unique URLs or content digests using HyperLogLog cardinality
    estimation.
    """

    MAX_MATRIX_SIZE = 30

    def __init__(self):
        super().__init__()

        self.crawl_size = defaultdict(dict)
        self.overlap = defaultdict(dict)
        self.similarity = defaultdict(dict)  # Jaccard index

    def add(self, key, val):
        """Process a size_estimate record and store HyperLogLog for overlap calculation."""
        cst = CST[key[0]]
        if cst != CST.size_estimate:
            return
        item_type = key[1]
        crawl = key[2]
        hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(val)
        self.crawl_size[item_type][crawl] = hll

    def fill_overlap_matrix(self):
        """Calculate pairwise overlap and Jaccard similarity between all crawls."""
        for item_type in self.crawl_size:
            for crawl1 in self.crawl_size[item_type]:
                hll1 = self.crawl_size[item_type][crawl1]
                size1 = len(hll1)
                self.overlap[item_type][crawl1] = defaultdict(list)
                self.similarity[item_type][crawl1] = defaultdict(float)
                for crawl2 in self.crawl_size[item_type]:
                    if crawl1 >= crawl2:
                        continue
                    hll2 = self.crawl_size[item_type][crawl2]
                    size2 = len(hll2)
                    union_hll = copy.deepcopy(hll1)
                    union_hll.update(hll2)
                    union = len(union_hll)
                    intersection = size1 + size2 - union
                    jaccard_sim = intersection / union
                    self.overlap[item_type][crawl1][crawl2] \
                        = [intersection, union, size1, size2,
                           (intersection/size2), jaccard_sim]
                    self.similarity[item_type][crawl1][crawl2] = jaccard_sim

    def save_overlap_matrix(self):
        """Save overlap and similarity matrices to CSV files."""
        for item_type in self.overlap:
            data = pandas.DataFrame(self.similarity[item_type])
            data.to_csv('data/crawlsimilarity_' + item_type + '.csv')
            data = pandas.DataFrame(self.overlap[item_type])
            data.to_csv('data/crawloverlap_' + item_type + '.csv')

    def plot_similarity_graph(self, show_edges=False):
        """Visualize similarity as a graph using GraphViz (experimental)."""
        g = pygraphviz.AGraph(directed=False, overlap='scale', splines=True)
        g.node_attr['shape'] = 'plaintext'
        g.node_attr['fontsize'] = '12'
        if show_edges:
            g.edge_attr['color'] = 'lightgrey'
            g.edge_attr['fontcolor'] = 'grey'
            g.edge_attr['fontsize'] = '8'
        else:
            g.edge_attr['style'] = 'invis'
        for crawl1 in sorted(self.similarity['url']):
            for crawl2 in sorted(self.similarity['url'][crawl1]):
                similarity = self.similarity['url'][crawl1][crawl2]
                distance = 1.0 - similarity
                g.add_edge(MonthlyCrawl.short_name(crawl1),
                           MonthlyCrawl.short_name(crawl2),
                           len=(distance),
                           label='{0:.2f}'.format(distance))
        g.write(os.path.join(self.PLOTDIR, 'crawlsimilarity_url.dot'))
        g.draw(os.path.join(self.PLOTDIR, 'crawlsimilarity_url.svg'), prog='fdp')

    def plot_similarity_matrix_with_rpy2_ggplot2(self, data, midpoint, title, textsize, img_path):
        """Generate similarity heatmap using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2

        p = ggplot2.ggplot(data) \
            + ggplot2.aes_string(x='crawl2', y='crawl1',
                                fill='similarity', label='sim_rounded') \
            + ggplot2.geom_tile(color="white") \
            + ggplot2.scale_fill_gradient2(low="red", high="blue", mid="white",
                                        midpoint=midpoint, space="Lab") \
            + self.GGPLOT2_THEME \
            + ggplot2.coord_fixed() \
            + ggplot2.theme(**{'axis.text.x':
                            ggplot2.element_text(angle=45,
                                                    vjust=1, hjust=1),
                            **self.GGPLOT2_THEME_KWARGS}) \
            + ggplot2.labs(title=title, x='', y='') \
            + ggplot2.geom_text(color='black', size=textsize)

        p.save(img_path)
        return p
    
    def plot_similarity_matrix_with_matplotlib(self, data, decimals, title, cell_textsize, img_path):
        """Generate similarity heatmap using matplotlib.

        Creates a color-coded matrix showing Jaccard similarity between crawls,
        with color ranging from red (low) through white to blue (high).
        """
        import matplotlib.pyplot as plt
        import numpy as np
        from matplotlib.colors import LinearSegmentedColormap, Normalize

        # Pivot data to create matrix
        pivot_data = data.pivot(index='crawl1', columns='crawl2', values='similarity')
        pivot_data_rounded = pivot_data.round(decimals)

        fig, ax = self.create_figure()

        # Create color map: red (low) -> white (mid) -> blue (high)
        vmin = pivot_data_rounded.min().min()
        vmax = pivot_data_rounded.max().max()

        if vmin < 0:
            colors = ['#ff0801', '#ff6b48', '#ffa388', '#ffd2c4', '#fff4ef',
                      '#FFFFFF', '#eadaff', '#c6a5ff', '#a073ff', '#6e43ff',
                      '#4020ff', '#1306ff']
        else:
            colors = ['#fff4ef', '#FFFFFF', '#eadaff', '#c6a5ff', '#a073ff',
                      '#6e43ff', '#4020ff', '#1306ff']

        cmap = LinearSegmentedColormap.from_list('red_white_blue', colors, N=256)
        norm = Normalize(vmin=vmin, vmax=vmax)

        # Add grey grid lines behind everything
        ax.set_axisbelow(True)
        ax.grid(True, which='major', linewidth=0.8, color='#E6E6E6', zorder=-1)

        # Create heatmap with origin='lower' to match ggplot2 (bottom-up)
        im = ax.imshow(pivot_data_rounded.values, cmap=cmap, norm=norm,
                       aspect='equal', origin='lower', zorder=1)

        # Add text annotations
        for i in range(len(pivot_data.index)):
            for j in range(len(pivot_data.columns)):
                similarity = pivot_data.iloc[i, j]
                if pandas.isna(similarity):
                    continue

                # Draw white rectangle border around each cell
                rect = plt.Rectangle((j - 0.5, i - 0.5), 1, 1,
                                      fill=False, edgecolor='white',
                                      linewidth=0.5, zorder=1)
                ax.add_patch(rect)

                # Get the rounded text for this cell
                matching_rows = data[(data['crawl1'] == pivot_data.index[i]) &
                                     (data['crawl2'] == pivot_data.columns[j])]
                if len(matching_rows) > 0:
                    text_val = matching_rows['sim_rounded'].iloc[0]
                    ax.text(j, i, text_val, ha='center', va='center',
                            color='black', fontsize=cell_textsize, zorder=2)

        # Set ticks and labels
        ax.set_xticks(np.arange(len(pivot_data.columns)))
        ax.set_yticks(np.arange(len(pivot_data.index)))
        ax.set_xticklabels(pivot_data.columns, fontsize=10)
        ax.set_yticklabels(pivot_data.index, fontsize=10)

        # Hide tick marks but keep labels black
        ax.tick_params(axis='both', which='both', colors='#FFFFFF', zorder=0)
        self.set_tick_labels_black(ax)

        # Rotate x-axis labels
        plt.setp(ax.get_xticklabels(), rotation=45, ha='right', va='top')

        self.set_title(ax, title)
        ax.set_xlabel('')
        ax.set_ylabel('')

        # Add colorbar
        cbar = plt.colorbar(im, ax=ax, aspect=5, pad=0.04, shrink=0.2)
        cbar.ax.set_title('similarity', fontsize=10, pad=10, loc="left")
        cbar.ax.tick_params(labelsize=8)
        cbar.outline.set_visible(False)

        # Apply ggplot2-like styling
        self.apply_ggplot2_style(ax, show_grid=False)

        return self.save_figure(fig, img_path)


    def plot_similarity_matrix(self, item_type, image_file, title):
        """Plot similarities of crawls as a heatmap matrix.

        Args:
            item_type: Type of items to compare ('url' or 'digest')
            image_file: Output filename relative to PLOTDIR
            title: Plot title
        """
        data = defaultdict(dict)
        n = 1
        for crawl1 in self.similarity[item_type]:
            for crawl2 in self.similarity[item_type][crawl1]:
                similarity = self.similarity[item_type][crawl1][crawl2]
                data['crawl1'][n] = MonthlyCrawl.short_name(crawl1)
                data['crawl2'][n] = MonthlyCrawl.short_name(crawl2)
                data['similarity'][n] = similarity
                data['sim_rounded'][n] = similarity  # to be rounded
                n += 1
        data = pandas.DataFrame(data)
        print(data)
        # select median of similarity values as midpoint of similarity scale
        midpoint = data['similarity'].median()
        decimals = 3
        textsize = 2
        minshown = .0005
        cell_textsize = 6

        if (data['similarity'].max()-data['similarity'].min()) > .2:
            decimals = 2
            textsize = 2.8
            minshown = .005
            cell_textsize = 8

        data['sim_rounded'] = data['sim_rounded'].apply(
            lambda x: ('{0:.'+str(decimals)+'f}').format(x).lstrip('0')
            if x >= minshown else '0')
        print('Median of similarities for', item_type, '=', midpoint)
        matrix_size = len(self.similarity[item_type])
        if matrix_size > self.MAX_MATRIX_SIZE:
            n = 0
            for crawl1 in sorted(self.similarity[item_type], reverse=True):
                short_name = MonthlyCrawl.short_name(crawl1)
                if n > self.MAX_MATRIX_SIZE:
                    data = data[data['crawl1'] != short_name]
                    data = data[data['crawl2'] != short_name]
                n += 1

        img_path = os.path.join(self.PLOTDIR, image_file)

        if self.PLOTLIB == "rpy2.ggplot2":
            return self.plot_similarity_matrix_with_rpy2_ggplot2(data=data, midpoint=midpoint, title=title, textsize=textsize, img_path=img_path)
        
        elif self.PLOTLIB == "matplotlib":
            return self.plot_similarity_matrix_with_matplotlib(data=data, decimals=decimals, title=title, cell_textsize=cell_textsize, img_path=img_path)
        
        else:
            raise ValueError("Invalid PLOTLIB")


if __name__ == '__main__':
    plot = CrawlOverlap()
    plot.read_from_stdin_or_file()
    plot.fill_overlap_matrix()
    plot.save_overlap_matrix()
    # plot.plot_similarity_graph()
    plot.plot_similarity_matrix(
        'url', 'crawloverlap/crawlsimilarity_matrix_url.png',
        'URL overlap between crawls (Jaccard similarity)')
    plot.plot_similarity_matrix(
        'digest', 'crawloverlap/crawlsimilarity_matrix_digest.png',
        'Content overlap between crawls (Jaccard similarity on digest)')


================================================
FILE: plot/table.py
================================================
import heapq

import numpy
import pandas

from collections import defaultdict, Counter

from crawlplot import CrawlPlot
from crawlstats import CST, MultiCount


class TabularStats(CrawlPlot):

    def __init__(self):
        super().__init__()

        self.crawls = set()
        self.types = defaultdict(dict)
        self.type_stats = defaultdict(dict)
        self.types_total = Counter()
        self.size = defaultdict(dict)
        self.N = 0

    def norm_value(self, typeval):
        return typeval

    def add_check_type(self, key, val, requ_type_cst):
        cst = CST[key[0]]
        if cst != requ_type_cst and cst != CST.size:
            return
        typeval = key[1]
        crawl = key[2]
        self.crawls.add(crawl)
        typeval = self.norm_value(typeval)
        if cst == CST.size:
            self.size[crawl][typeval] = int(val)
            return
        if crawl in self.types[typeval]:
            self.types[typeval][crawl] = \
                MultiCount.sum_values([val, self.types[typeval][crawl]])
        else:
            self.types[typeval][crawl] = val
        npages = MultiCount.get_count(0, val)
        self.types_total[typeval] += npages
        if 'known_values' not in self.size[crawl]:
            self.size[crawl]['known_values'] = 0
        self.size[crawl]['known_values'] += npages

    def transform_data(self, top_n, min_avg_count, check_pattern=None):
        print("Number of different values after first normalization: {}"
              .format(len(self.types)))
        typevals_for_deletion = set()
        typevals_mostfrequent = []
        for typeval in self.types:
            total_count = self.types_total[typeval]
            average_count = int(total_count / len(self.crawls))
            if average_count >= min_avg_count:
                if not check_pattern or check_pattern.match(typeval):
                    print('{}\t{}\t{}'.format(typeval,
                                              average_count, total_count))
                    fval = (total_count, typeval)
                    if len(typevals_mostfrequent) < top_n:
                        heapq.heappush(typevals_mostfrequent, fval)
                    else:
                        heapq.heappushpop(typevals_mostfrequent, fval)
                    continue  # ok, keep this type value
                else:
                    print('Type value frequent but invalid: <{}> (avg. count = {})'
                          .format(typeval, average_count))
            elif average_count >= (min_avg_count/10):
                if not check_pattern or check_pattern.match(typeval):
                    print('Skipped type value because of low frequency: <{}> (avg. count = {}, min. = {})'
                          .format(typeval, average_count, (min_avg_count/10)))
            typevals_for_deletion.add(typeval)
        # map low frequency or invalid type values to empty type
        keep_typevals = set()
        for (_, typeval) in typevals_mostfrequent:
            keep_typevals.add(typeval)
        for typeval in self.types:
            if (typeval not in keep_typevals and
                    typeval not in typevals_for_deletion):
                print('Skipped type value because not in top {}: <{}> (avg. count = {})'
                      .format(top_n, typeval,
                              int(self.types_total[typeval]/len(self.crawls))))
                typevals_for_deletion.add(typeval)
        typevals_other = dict()
        for typeval in typevals_for_deletion:
            for crawl in self.types[typeval]:
                if crawl in typevals_other:
                    val = typevals_other[crawl]
                else:
                    val = 0
                typevals_other[crawl] = \
                    MultiCount.sum_values([val, self.types[typeval][crawl]])
            self.types.pop(typeval, None)
        self.types['<other>'] = typevals_other
        print('Number of different type values after cleaning and'
              ' removal of low frequency types: {}'
              .format(len(self.types)))
        # unknown values
        for crawl in self.crawls:
            known_values = 0
            if 'known_values' in self.size[crawl]:
                known_values = self.size[crawl]['known_values']
            unknown = (self.size[crawl]['page'] - known_values)
            if unknown > 0:
                print("{} unknown values in {}".format(unknown, crawl))
                self.types['<unknown>'][crawl] = unknown
        for typeval in self.types:
            for crawl in self.types[typeval]:
                self.type_stats['type'][self.N] = typeval
                self.type_stats['crawl'][self.N] = crawl
                value = self.types[typeval][crawl]
                n_pages = MultiCount.get_count(0, value)
                self.type_stats['pages'][self.N] = n_pages
                n_urls = MultiCount.get_count(1, value)
                self.type_stats['urls'][self.N] = n_urls
                self.N += 1
        self.type_stats = pandas.DataFrame(self.type_stats)

    def save_data(self, base_name, dir_name='data/'):
        self.type_stats.to_csv(dir_name + base_name + '.csv')

    def save_data_percentage(self, base_name, dir_name='data/', type_name='type'):
        if dir_name[-1] != '/':
            dir_name += '/'
        data = self.type_stats
        data = data[['crawl', 'type', 'pages', 'urls']]
        sum_data = data.groupby(['crawl']).aggregate({'pages':'sum'}).add_suffix('_sum').reset_index()
        data = data.groupby(['crawl', 'type']).aggregate(numpy.sum).reset_index()
        data = pandas.merge(data, sum_data)
        data['%pages/crawl'] = 100.0 * data['pages'] / data['pages_sum']
        data.drop(['pages_sum'], inplace=True, axis=1)
        data = data.rename(columns={'type': type_name})
        data.to_csv(dir_name + base_name + '.csv', float_format='%.4f', index=None)

    def plot(self, crawls, name, column_header, xtra_css_classes=[]):
        # stats comparison for selected crawls
        field_percentage_formatter = '{0:,.4f}'.format
        data = self.type_stats
        data = data[data['crawl'].isin(crawls)]
        if data.size == 0:
            print("No data points in table for selected crawls ({})"
                  .format(crawls))
            return
        data[column_header] = data['type']
        data = data[['crawl', column_header, 'pages']]
        data = data.groupby(['crawl', column_header]).agg({'pages': 'sum'})
        data = data.groupby(level=0, as_index=False).apply(lambda x: 100.0*x/float(x.sum()))
        data = data.reset_index().pivot(index=column_header,
                                        columns='crawl', values='pages')
        print("\n-----\n")
        formatters = {c: field_percentage_formatter for c in crawls}
        print(data.to_string(formatters=formatters))
        css_classes = ['tablesorter', 'tablepercentage']
        css_classes.extend(xtra_css_classes)
        data.to_html('{}/{}-top-{}.html'.format(
                     self.PLOTDIR, name, self.MAX_TYPE_VALUES),
                     formatters=formatters,
                     classes=css_classes)



================================================
FILE: plot/tld.py
================================================
import sys

from collections import defaultdict

import pandas

from crawlplot import CrawlPlot
from crawlstats import CST, MonthlyCrawl, MultiCount
from top_level_domain import TopLevelDomain
from stats.tld_alexa_top_1m import alexa_top_1m_tlds
from stats.tld_cisco_umbrella_top_1m import cisco_umbrella_top_1m_tlds
from stats.tld_majestic_top_1m import majestic_top_1m_tlds

# min. share of URLs for a TLD to be shown in metrics
min_urls_percentage = .05


class TldStats(CrawlPlot):

    def __init__(self):
        super().__init__()

        self.tlds = defaultdict(dict)
        self.tld_stats = defaultdict(dict)
        self.N = 0

    def add(self, key, val):
        cst = CST[key[0]]
        if cst != CST.tld:
            return
        tld = key[1]
        crawl = key[2]
        self.tlds[tld][crawl] = val

    def transform_data(self):
        crawl_has_host_domain_counts = {}
        for tld in self.tlds:
            tld_repr = tld
            tld_obj = None
            if tld in ('', '(ip address)'):
                continue
            else:
                try:
                    tld_obj = TopLevelDomain(tld)
                    tld_repr = tld_obj.tld
                except:
                    print('error', tld)
                    continue
            for crawl in self.tlds[tld]:
                self.tld_stats['suffix'][self.N] = tld_repr
                self.tld_stats['crawl'][self.N] = crawl
                date = pandas.Timestamp(MonthlyCrawl.date_of(crawl))
                self.tld_stats['date'][self.N] = date
                if tld_obj:
                    self.tld_stats['type'][self.N] \
                        = TopLevelDomain.short_type(tld_obj.tld_type)
                    self.tld_stats['subtype'][self.N] = tld_obj.sub_type
                    self.tld_stats['tld'][self.N] = tld_obj.first_level
                else:
                    self.tld_stats['type'][self.N] = ''
                    self.tld_stats['subtype'][self.N] = ''
                    self.tld_stats['tld'][self.N] = ''
                value = self.tlds[tld][crawl]
                n_pages = MultiCount.get_count(0, value)
                self.tld_stats['pages'][self.N] = n_pages
                n_urls = MultiCount.get_count(1, value)
                self.tld_stats['urls'][self.N] = n_urls
                n_hosts = MultiCount.get_count(2, value)
                self.tld_stats['hosts'][self.N] = n_hosts
                n_domains = MultiCount.get_count(3, value)
                self.tld_stats['domains'][self.N] = n_domains
                if n_urls != n_hosts:
                    # multi counts including host counts are not (yet)
                    # available for all crawls
                    crawl_has_host_domain_counts[crawl] = True
                elif crawl not in crawl_has_host_domain_counts:
                    crawl_has_host_domain_counts[crawl] = False
                self.N += 1
        for crawl in crawl_has_host_domain_counts:
            if not crawl_has_host_domain_counts[crawl]:
                print('No host and domain counts for', crawl)
                for n in self.tld_stats['crawl']:
                    if self.tld_stats['crawl'][n] == crawl:
                        del(self.tld_stats['hosts'][n])
                        del(self.tld_stats['domains'][n])
        self.tld_stats = pandas.DataFrame(self.tld_stats)

    @staticmethod
    def field_percentage_formatter(precision=2, nan='-'):
        f = '{0:,.' + str(precision) + 'f}'
        return lambda x: nan if pandas.isna(x) else f.format(x)


    def save_data(self):
        self.tld_stats.to_csv('data/tlds.csv')

    def percent_agg(self, data, column, index, values, aggregate):
        data = data[[column, index, values]]
        data = data.groupby([column, index]).agg(aggregate)
        data = data.groupby(level=0, as_index=False).apply(lambda x: 100.0*x/float(x.sum()))
        # print("\n-----\n")
        # print(data.to_string(formatters={'urls': TldStats.field_percentage_formatter()}))
        return data

    def pivot_percentage(self, data, column, index, values, aggregate):
        data = self.percent_agg(data, column, index, values, aggregate)
        return data.reset_index().pivot(index=index,
                                        columns=[column], values=values)

    def plot_groups(self):
        title = 'Groups of Top-Level Domains'
        ylabel = 'URLs %'
        clabel = ''
        img_file = 'tld/groups.png'
        data = self.pivot_percentage(self.tld_stats, 'crawl', 'type',
                                     'urls', {'urls': 'sum'})
        data = data.transpose()
        print("\n-----\n")
        types = set(self.tld_stats['type'].tolist())
        formatters = {c: TldStats.field_percentage_formatter() for c in types}
        print(data.to_string(formatters=formatters))
        data.to_html('{}/tld/groups-percentage.html'.format(self.PLOTDIR),
                     formatters=formatters,
                     classes=['tablesorter', 'tablepercentage'])
        data = self.percent_agg(self.tld_stats, 'date', 'type',
                                'urls', {'urls': 'sum'}).reset_index()
        return self.line_plot(data, title, ylabel, img_file,
                              x='date', y='urls', c='type', clabel=clabel)

    def plot(self, crawls, latest_crawl):
        field_formatters = {c: '{:,.0f}'.format
                            for c in ['pages', 'urls', 'hosts', 'domains']}
        for c in ['%urls', '%hosts', '%domains']:
            field_formatters[c] = TldStats.field_percentage_formatter()
        data = self.tld_stats
        data = data[data['crawl'].isin(crawls)]
        crawl_data = data
        top_tlds = []
        # stats per crawl
        for crawl in crawls:
            print("\n-----\n{}\n".format(crawl))
            for aggr_type in ('type', 'tld'):
                data = crawl_data
                data = data[data['crawl'].isin([crawl])]
                data = data[[aggr_type, 'pages', 'urls', 'hosts', 'domains']]
                data = data.set_index([aggr_type])
                data = data.groupby(level=0).sum().sort_values(
                    by=['urls'], ascending=False)
                for count in ('urls', 'hosts', 'domains'):
                    data['%'+count] = 100.0 * data[count] / data[count].sum()
                if aggr_type == 'tld':
                    # skip less frequent TLDs
                    data = data[data['%urls'] >= min_urls_percentage]
                    for tld in data.index.values:
                        top_tlds.append(tld)
                print(data.to_string(formatters=field_formatters))
                print()
                if crawl == latest_crawl:
                    # latest crawl by convention
                    type_name = aggr_type
                    if aggr_type == 'type':
                        type_name = 'group'
                    path = '{}/tld/latest-crawl-{}s.html'.format(
                        self.PLOTDIR, type_name)
                    data.to_html(path,
                                 formatters=field_formatters,
                                 classes=['tablesorter', 'tablesearcher'])
        # stats comparison for selected crawls
        for aggr_type in ('type', 'tld'):
            data = crawl_data
            if aggr_type == 'tld':
                data = data[data['tld'].isin(top_tlds)]
            data = self.pivot_percentage(data, 'crawl', aggr_type,
                                         'urls', {'urls': 'sum'})
            print("\n----- {}\n".format(aggr_type))
            print(data.to_string(formatters={c: TldStats.field_percentage_formatter()
                                             for c in crawls}))
            if aggr_type == 'tld':
                # save as HTML table
                path = '{}/tld/selected-crawls-percentage.html'.format(
                                    self.PLOTDIR, len(crawls))
                data.to_html(path,
                             float_format=TldStats.field_percentage_formatter(4),
                             classes=['tablesorter', 'tablepercentage',
                                      'tablesearcher'])

    def plot_comparison(self, crawl, name, topNlimit=None, method='spearman'):
        print()
        print('Comparison for', crawl, '-', name, '-', method)
        data = self.tld_stats
        data = data[data['crawl'].isin([crawl])]
        data = data[data['urls'] >= topNlimit]
        data = data.set_index(['tld'], drop=False)
        data = data.sum(level='tld')
        print(data)
        data['alexa'] = pandas.Series(alexa_top_1m_tlds)
        data['cisco'] = pandas.Series(cisco_umbrella_top_1m_tlds)
        data['majestic'] = pandas.Series(majestic_top_1m_tlds)
        fields = ('pages', 'urls', 'hosts', 'domains',
                  'alexa', 'cisco', 'majestic')
        formatters = {c: '{0:,.3f}'.format for c in fields}
        # relative frequency (percent)
        for count in fields:
            data[count] = 100.0 * data[count] / data[count].sum()
        # Spearman's rank correlation for all TLDs
        corr = data.corr(method=method, min_periods=1)
        print(corr.to_string(formatters=formatters))
        corr.to_html('{}/tld/{}-comparison-{}-all-tlds.html'
                     .format(self.PLOTDIR, name, method),
                     formatters=formatters,
                     classes=['matrix'])
        if topNlimit is None:
            return
        # Spearman's rank correlation for TLDs covering
        # at least topNlimit % of urls
        data = data[data['urls'] >= topNlimit]
        print()
        print('Top', len(data), 'TLDs (>= ', topNlimit, '%)')
        print(data)
        data.to_html('{}/tld/{}-comparison.html'.format(self.PLOTDIR, name),
                     formatters=formatters,
                     classes=['tablesorter', 'tablepercentage'])
        print()
        corr = data.corr(method=method, min_periods=1)
        print(corr.to_string(formatters=formatters))
        corr.to_html('{}/tld/{}-comparison-{}-frequent-tlds.html'
                     .format(self.PLOTDIR, name, method),
                     formatters=formatters,
                     classes=['matrix'])
        print()

    def plot_comparison_groups(self):
        # Alexa and Cisco types/groups:
        for (name, data) in [('Alexa', alexa_top_1m_tlds),
                             ('Cisco', cisco_umbrella_top_1m_tlds),
                             ('Majestic', majestic_top_1m_tlds)]:
            compare_types = defaultdict(int)
            for tld in data:
                compare_types[TopLevelDomain(tld).tld_type] += data[tld]
            print(name, 'TLD groups:')
            for tld in compare_types:
                c = compare_types[tld]
                print(' {:6d}\t{:4.1f}\t{}'.format(c, (100.0*c/1000000), tld))
            print()


if __name__ == '__main__':
    plot_crawls = sys.argv[1:]
    latest_crawl = plot_crawls[-1]
    if len(plot_crawls) == 0:
        print(sys.argv[0], 'crawl-id...')
        print()
        print('Distribution of top-level domains for (selected) monthly crawls')
        print()
        print('Example:')
        print('', sys.argv[0], '[options]', 'CC-MAIN-2014-52', 'CC-MAIN-2016-50')
        print()
        print('Last argument is considered to be the latest crawl')
        print()
        print('Options:')
        print()
        sys.exit(1)
    plot = TldStats()
    plot.read_data(sys.stdin)
    plot.transform_data()
    plot.save_data()
    plot.plot_groups()
    plot.plot(plot_crawls, latest_crawl)
    if latest_crawl == 'CC-MAIN-2019-09':
        # plot comparison only for crawl of similar date as benchmark data
        plot.plot_comparison(latest_crawl, 'selected-crawl',
                             min_urls_percentage)
#         plot.plot_comparison(latest_crawl, 'selected-crawl',
#                              min_urls_percentage, 'pearson')
    plot.plot_comparison_groups()


================================================
FILE: plot/tld_by_continent.py
================================================
"""
Plot TLD distributions by continent.

This module generates visualizations showing how TLDs are distributed
across geographic continents and major TLD groups (com/net, org, edu, gov/mil).
Maps country-code TLDs to their respective continents using ISO country codes.
"""

import json
import os.path
import sys
from collections import Counter, defaultdict

import fsspec
import matplotlib.pyplot as plt
import pandas
from matplotlib.ticker import MaxNLocator

from crawlplot import CrawlPlot
from crawlstats import MonthlyCrawl, MultiCount
from top_level_domain import TopLevelDomain


tld_counts = defaultdict(lambda: Counter())

# mapping of country-code TLDs to continents
continent_cc_tlds = {
    'Africa': {'ao', 'bf', 'bi', 'bj', 'bw', 'cd', 'cf', 'cg', 'ci', 'cm', 'cv',
               'dj', 'dz', 'eg', 'eh', 'er', 'et', 'ga', 'gh', 'gm', 'gn', 'gq',
               'gw', 'ke', 'km', 'lr', 'ls', 'ly', 'ma', 'mg', 'ml', 'mr', 'mu',
               'mw', 'mz', 'na', 'ne', 'ng', 're', 'rw', 'sc', 'sd', 'sh', 'sl',
               'sn', 'so', 'ss', 'st', 'sz', 'td', 'tg', 'tn', 'tz', 'ug', 'yt',
               'za', 'zm', 'zw'},
    'Antarctica': {'aq'},
    'Asia': {'ae', 'af', 'am', 'az', 'bd', 'bh', 'bn', 'bt', 'cc', 'cn', 'cx',
             'ge', 'hk', 'id', 'il', 'in', 'io', 'iq', 'ir', 'jo', 'jp', 'kg',
             'kh', 'kp', 'kr', 'kw', 'kz', 'la', 'lb', 'lk', 'mm', 'mn', 'mo',
             'mv', 'my', 'np', 'om', 'ph', 'pk', 'ps', 'qa', 'sa', 'sg', 'sy',
             'th', 'tj', 'tm', 'tr', 'tw', 'uz', 'vn', 'ye',
             'tp' # Timor-Leste: deleted in favor of .tl in 2015
             },
    'Europe': {'ad', 'al', 'at', 'ba', 'be', 'bg', 'by', 'ch', 'cy', 'cz',
               'de', 'dk', 'ee', 'es', 'fi', 'fo', 'fr', 'gg', 'gi', 'gr',
               'hr', 'hu', 'ie', 'im', 'is', 'it', 'je', 'li', 'lt', 'lu', 'lv',
               'mc', 'md', 'me', 'mk', 'mt', 'nl', 'no',
               'pl', 'pt', 'ro', 'rs', 'ru', 'se', 'si', 'sj', 'sk', 'sm',
               'ua', 'uk', 'va',
               'xk',  # https://en.wikipedia.org/wiki/.xk
               'bv', # Bouvet Island (inactive, uninhabited Norwegian territory, South Atlantic Ocean)
               'gb' # Great Britain (reserved)
               },
    'North America': {'ag', 'ai', 'an', 'aw', 'bb', 'bm', 'bs', 'bz',
                      'ca', 'cr', 'cu', 'cw', 'dm', 'do', 'gd', 'gl', 'gp', 'gt',
                      'hn', 'ht', 'jm', 'kn', 'ky', 'lc', 'mq',     'ms', 'mx', 'ni',
                      'pa', 'pm', 'pr', 'sv', 'sx', 'tc', 'tt',
                      'us', 'vc', 'vg', 'vi',
                      'bl', # Saint Barthélemy (unused)
                      'bq', # Bonaire, Sint Eustatius and Saba (reserved)
                      'mf', # Saint Martin (unassigned)
                      },
    'Oceania': {'as', 'au', 'ck', 'fj', 'fm', 'gu', 'ki', 'mh', 'mp',
                'nc', 'nf', 'nr', 'nu', 'nz', 'pf', 'pg', 'pn', 'pw',
                'sb', 'tk', 'tl', 'to', 'tv', 'vu', 'wf', 'ws'
                },
    'South America': {'ar', 'bo', 'br', 'cl', 'co', 'ec', 'fk', 'gf', 'gy',
                      'pe', 'py', 'sr', 'uy', 've'},
}

# Geographic TLDs mapped to continents
# https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Geographic_top-level_domains
continent_geographic_tlds = {
    'Africa': {'africa', 'capetown', 'durban', 'joburg'},
    'Asia': {'abudhabi', 'arab', 'asia', 'doha', 'dubai', 'krd', 'kyoto',
             'nagoya', 'okinawa', 'osaka', 'ryukyu', 'taipei', 'tokyo', 'yokohama',
             # https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Internationalized_geographic_top-level_domains
             'xn--1qqw23a', '佛山', # Foshan, China
             'xn--xhq521b', '广东', # Guangdong, China
             'xn--80adxhks', 'москва', # Moscow, Russia
             'xn--p1acf', 'рус', # Russian language and culture - https://en.wikipedia.org/wiki/.%D1%80%D1%83%D1%81
             'xn--mgbca7dzdo', 'ابوظبي', # Abu Dhabi
             'xn--ngbrx', 'عرب', # Arab
             },
    'Europe': {
        # France
        'alsace', 'bzh', 'corsica', 'eus', 'paris',
        # Spain
        'bcn', 'barcelona', 'cat', 'eus', 'gal', 'madrid',
        # Germany
        'bayern', 'berlin', 'cologne', 'koeln', 'hamburg', 'nrw', 'ruhr', 'saarland',
        # other
        'eu', 'amsterdam', 'bar', 'brussels', 'cymru', 'wales', 'frl', 'gent', 'helsinki', 'irish', 'ist', 'istanbul', 'london', 'moscow', 'scot', 'stockholm', 'swiss', 'tatar', 'tirol', 'vlaanderen', 'wien', 'zuerich', 'su',
        # https://en.wikipedia.org/wiki/.ax
        'ax'
    },
    'North America': {'boston', 'miami', 'nyc', 'quebec', 'vegas'},
    'Oceania': {'kiwi', 'melbourne', 'sydney'},
    'South America': {'lat', 'rio'}
}

# list of "continents" to be shown in the output
continents = ['(other)', 'com,net', 'org', 'edu', 'gov,mil', 'North America', 'South America', 'Oceania', 'Africa', 'Asia', 'Europe']

# lookup tables TLD -> continent
tld_continent = {
    'gov': 'gov,mil', 'mil': 'gov,mil',
    'com': 'com,net', 'net': 'com,net',
    'org': 'org', 'edu': 'edu'
}

# frequency counts of TLDs that cannot be mapped to a continent
tld_unmapped = Counter()

# fill the lookup table with TLD -> continent mappings
for continent in continent_cc_tlds:
    for tld in continent_cc_tlds[continent]:
        tld_continent[tld] = continent

for continent in continent_geographic_tlds:
    for tld in continent_geographic_tlds[continent]:
        tld_continent[tld] = continent

for icctld in TopLevelDomain.tld_ccs:
    if TopLevelDomain.tld_ccs[icctld] in tld_continent:
        tld_continent[icctld] = tld_continent[TopLevelDomain.tld_ccs[icctld]]

def tld2continent(tld):
    """Map a TLD to its corresponding continent."""
    continent = '(other)'
    tld = tld.lower()
    if tld in tld_continent and tld_continent[tld] != 'Antarctica':
        continent = tld_continent[tld]
    return continent


def get_data(f):
    """Parse TLD statistics and aggregate by year and crawl.

    Returns two dictionaries: one aggregated by year, one by crawl name.
    """
    d = defaultdict(lambda: defaultdict(list))
    dd = defaultdict(lambda: defaultdict(list))

    for line in f:
        keyval = line.split('\t')
        if len(keyval) == 2:
            [_, suffix, crawl] = json.loads(keyval[0])
            year = MonthlyCrawl.year_of(crawl)
            val = json.loads(keyval[1])
            tld = suffix.split('.')[-1].lower()
            tld_cnt = tld2continent(tld)
            if tld_cnt == '(other)':
                tld_unmapped[tld] += MultiCount.get_count(0, val)
            if tld:
                # print(tld)
                tld_counts['(any)'][tld] += MultiCount.get_count(0, val)
                tld_counts[str(year)][tld] += MultiCount.get_count(0, val)
            d[str(year)][tld_cnt].append(val)
            dd[MonthlyCrawl.short_name(crawl)][tld_cnt].append(val)

    return d, dd


class TLDByContinentPlot(CrawlPlot):
    """Generate TLD distribution by continent visualizations."""

    def __init__(self):
        super().__init__()

    def plot(self):
        """Generate TLD by continent/year plots and save data tables."""
        # Read from file path or stdin
        if len(sys.argv) > 1 and os.path.exists(sys.argv[-1]):
            with fsspec.open(sys.argv[-1], compression="gzip", mode="rt") as f:
                d, dd = get_data(f)
        else:
            d, dd = get_data(sys.stdin)

        print("\nyear\t{}".format("\t".join(continents)))
        continent_percentages = dict()
        for year in d:
            pages = dict()
            total = 0
            values = []
            for tld in continents:
                d[year][tld].append([0,0,0,0])
                val = MultiCount.sum_values(d[year][tld], False)
                total += val[0]
                values.append(val[0])
                # print("{}\t{}\t{}\t{}\t{}\t{}".format(year, tld, *val))
            percentages = [100*val/total for val in values]
            print("{}\t{}".format(year, "\t".join(
                map(lambda x: '{:.2f}'.format(x), percentages))))
            continent_percentages[year] = percentages
        continent_percentages = pandas.DataFrame.from_dict(continent_percentages,
                                                        orient='index',
                                                        columns=continents)
        continent_percentages.index.name = 'year'
        print(continent_percentages)

        top_tlds = tld_counts['(any)'].most_common(16)
        #print("\n", top_tlds)

        top_tlds_by_year = defaultdict(list)
        print("\nyear\t{}".format("\t".join([x[0] for x in top_tlds])))
        for year in tld_counts:
            total = sum(tld_counts[year].values())
            sys.stdout.write(year)
            for tld in top_tlds:
                perc = 100*tld_counts[year][tld[0]]/total
                sys.stdout.write('\t{:.2f}'.format(perc))
                top_tlds_by_year[year].append(perc)
            sys.stdout.write('\n')

        # table TLDs by year
        selected_tlds = pandas.DataFrame.from_dict(
            top_tlds_by_year,
            orient='index',
            columns=map(lambda tld: tld[0], top_tlds)
        )
        selected_tlds.index.name = 'year'
        selected_tlds.to_csv(
            os.path.join(self.PLOTDIR, 'tld', 'selected-tlds-by-year.csv'),
            index=True)
        css_classes = ['tablepercentage', 'tablesorter']
        selected_tlds.to_html(
            os.path.join(self.PLOTDIR, 'tld', 'selected-tlds-by-year.html'),
            float_format='%.2f',
            classes=css_classes,
            index_names=True)

        print("\ncrawl\t{}".format("\t".join(continents)))
        for crawl in dd:
            pages = dict()
            total = 0
            values = []
            for tld in continents:
                dd[crawl][tld].append([0,0,0,0])
                val = MultiCount.sum_values(dd[crawl][tld], False)
                total += val[0]
                values.append(val[0])
                # print("{}\t{}\t{}\t{}\t{}\t{}".format(year, tld, *val))
            print("{}\t{}".format(crawl, "\t".join(['{:.2f}'.format(100*val/total) for val in values])))

        # print unmapped TLDs to verify whether there are any TLDs
        # that need to be added to the mapping
        print("\n", len(tld_unmapped), " unmapped TLDs: ", str(tld_unmapped), "\n\n")


        data = continent_percentages.melt(id_vars=[], var_name='continent',
                                        value_name='perc', ignore_index=False)
        data['continent'] = pandas.Categorical(data['continent'],
                                            ordered=True,
                                            categories=continents.reverse())
        
        if self.PLOTLIB == "rpy2.ggplot2":
            self.plot_with_rpy2_ggplot2(data=data)

        elif self.PLOTLIB == "matplotlib":
            self.plot_with_matplotlib(data=data)

        else:
            raise ValueError("Invalid PLOTLIB")
        

        ### plot and table for print publication
        #plot = plot + ggplot2.labs(title='',
        #                           x='', y='', fill='TLD / Continent') \
        #            + ggplot2.theme()
        #plot.save(os.path.join(PLOTDIR, 'tld', 'tlds-by-year-and-continent.pdf'))
        #print(continent_percentages.to_latex(index=True, float_format='%.2f'))
        continent_percentages.to_csv(
            os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.csv'),
            index=True)
        css_classes = ['tablepercentage', 'tablesorter']
        continent_percentages.to_html(
            os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.html'),
            float_format='%.2f',
            classes=css_classes)

    def plot_with_rpy2_ggplot2(self, data):
        """Generate TLD by continent stacked bar chart using rpy2/ggplot2."""
        from rpy2.robjects.lib import ggplot2

        plot = ggplot2.ggplot(data.reset_index()) \
                + ggplot2.aes_string(x='year', y='perc', fill='continent', label='perc') \
                + ggplot2.geom_bar(stat='identity', position='stack') \
                + self.GGPLOT2_THEME + ggplot2.scale_fill_hue() \
                + ggplot2.labs(title='Percentage of Page Captures per TLD / Continent',
                            x='', y='Percentage', fill='TLD / Continent') \
                + ggplot2.theme(**{'legend.position': 'right',
                                'aspect.ratio': .7,
                                **self.GGPLOT2_THEME_KWARGS,
                                'axis.text.x':
                                    ggplot2.element_text(angle=45,
                                                        vjust=1, hjust=1)})
        plot.save(os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.png'))

        return plot


    def plot_with_matplotlib(self, data):
        """Generate TLD by continent stacked bar chart using matplotlib."""
        aspect_ratio = 0.7
        title = 'Percentage of Page Captures per TLD / Continent'

        fig, ax = self.create_figure()

        # Colorblind-safe palette (Paul Tol's scheme)
        colors = ['#4477AA', '#EE6677', '#228833', '#CCBB44', '#AA3377',
                  '#66CCEE', '#EE8866', '#44AA99', '#BBBBBB', '#99CC66', '#CC99BB']

        years = sorted(data.reset_index()['year'].unique())
        bottoms = [0] * len(years)
        sorted_continents = sorted(continents)[::-1]

        for i, continent in enumerate(sorted_continents):
            values = []
            for year in years:
                year_data = data.loc[year]
                continent_data = year_data[year_data['continent'] == continent]
                values.append(continent_data['perc'].values[0] if len(continent_data) > 0 else 0)

            ax.bar(range(len(years)), values, bottom=bottoms, label=continent,
                   color=colors[i % len(colors)], width=self.bar_width)
            bottoms = [b + v for b, v in zip(bottoms, values)]

        # Axes ratio
        ax.set_aspect(1 / ax.get_data_ratio() * aspect_ratio)

        self.set_title(ax, title)
        ax.set_xlabel('')
        ax.set_ylabel('Percentage', fontsize=self.ylabel_fontsize)

        # Set x-axis ticks and labels
        ax.set_xticks(range(len(years)))
        ax.set_xticklabels(years, rotation=45, ha='right', fontsize=self.ticks_fontsize)
        ax.set_xlim(-0.5, len(years) - 0.5)

        # Set y-axis formatting
        ax.yaxis.set_major_locator(MaxNLocator(nbins=6))
        ax.set_ylim(0, 100)
        ax.tick_params(axis='y', labelsize=self.ticks_fontsize)

        # Apply ggplot2-like styling with y-axis grid
        ax.grid(True, which='major', linewidth=1.0, color='#E6E6E6', zorder=0, axis='y')
        ax.set_axisbelow(True)

        # Custom spine styling (thin borders at top/bottom)
        ax.spines['top'].set_visible(True)
        ax.spines['top'].set_linewidth(1.0)
        ax.spines['top'].set_color('#E6E6E6')
        ax.spines['right'].set_visible(False)
        ax.spines['left'].set_visible(False)
        ax.spines['bottom'].set_visible(True)
        ax.spines['bottom'].set_linewidth(1.0)
        ax.spines['bottom'].set_color('#E6E6E6')

        # Set tick colors
        ax.tick_params(axis='both', which='both', colors=self.ticks_color,
                       length=self.ticks_length, width=1.0)
        self.set_tick_labels_black(ax)

        # Position legend on right side with reversed order
        handles, labels = ax.get_legend_handles_labels()
        legend = ax.legend(handles[::-1], labels[::-1], loc='center left',
                          bbox_to_anchor=(1.0, 0.5), frameon=False,
                          fontsize=self.legend_fontsize, title='TLD / Continent',
                          title_fontsize=self.legend_title_fontsize)
        legend._legend_box.align = 'left'

        img_path = os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.png')
        return self.save_figure(fig, img_path)


if __name__ == '__main__':
    plot = TLDByContinentPlot()
    plot.plot()


================================================
FILE: plot.sh
================================================
#!/bin/bash

N_CRAWLS=$(python3 -c 'from crawlstats import MonthlyCrawl; print(len(MonthlyCrawl.by_name))')
LATEST_CRAWL=$(python3 -c 'from crawlstats import MonthlyCrawl; print(sorted(MonthlyCrawl.by_name.keys())[-1])')

# verify that all stats files are downloaded, cf. get_stats.sh
N_CRAWLS_STATS_FILES=$(ls stats/CC-MAIN-*.gz | wc -l)
if [[ $N_CRAWLS -ne $N_CRAWLS_STATS_FILES ]]; then
    echo "Number of crawls registered in crawlstats.py ($N_CRAWLS) and"
    echo "the number of statistics files in stats/ ($N_CRAWLS_STATS_FILES) are not equal."
    echo "Exiting!"
    exit 1
fi

echo "Plotting crawl statistics for $N_CRAWLS crawls"
echo "Latest crawl is: $LATEST_CRAWL"
echo


# fail on any kind of error
set -exo pipefail


# register the latest crawl in the website configuration
sed -i 's@^latest_crawl:.*@latest_crawl: '$LATEST_CRAWL'@' _config.yml


function update_excerpt() {
    regex="$1"
    excerpt="$2"
    if [ -e "$excerpt" ]; then
        # short-cut for monthy update plots: only add data from latest crawl
        if ! zgrep -qF "$LATEST_CRAWL" $excerpt; then
            echo "Updating excerpt $excerpt with latest crawl $LATEST_CRAWL"
            zgrep -Eh "$regex" stats/$LATEST_CRAWL.gz | gzip >>$excerpt
        fi
        # sanity check: are all crawls excerpted?
        N_CRAWLS_EXCERPTED=$(zcat $excerpt | cut -f1 | jq -r '.[2]' | uniq | sort -u | wc -l)
        if [[ $N_CRAWLS_EXCERPTED -eq $N_CRAWLS ]]; then
            echo "Excerpt $excerpt includes $N_CRAWLS crawls as expected."
        else
            echo "Number of crawls excerpted in $excerpt ($N_CRAWLS_EXCERPTED) does not equal $N_CRAWLS"

Download .txt

gitextract_jga75z6d/

├── .github/
│   └── workflows/
│       └── ci.yml
├── .gitignore
├── LICENSE
├── README.md
├── _config.yml
├── _layouts/
│   ├── default.html
│   └── table.html
├── crawlplot.py
├── crawlstats.py
├── get_stats.sh
├── get_stats_and_plot.sh
├── index.md
├── plot/
│   ├── charset.py
│   ├── crawl_size.py
│   ├── crawler_metrics.py
│   ├── domain.py
│   ├── histogram.py
│   ├── language.py
│   ├── mimetype.py
│   ├── mimetype_detected.py
│   ├── overlap.py
│   ├── table.py
│   ├── tld.py
│   └── tld_by_continent.py
├── plot.sh
├── plots/
│   ├── README.md
│   ├── charsets-top-100.html
│   ├── charsets.csv
│   ├── charsets.md
│   ├── crawlermetrics.md
│   ├── crawloverlap.md
│   ├── crawlsize/
│   │   ├── cumulative.csv
│   │   ├── domain.csv
│   │   ├── monthly.csv
│   │   ├── monthly_new.csv
│   │   ├── url_last_n_crawls.csv
│   │   └── url_page_ratio_last_n_crawls.csv
│   ├── crawlsize.md
│   ├── domains-top-500.csv
│   ├── domains-top-500.html
│   ├── domains.md
│   ├── languages-top-200.html
│   ├── languages.csv
│   ├── languages.md
│   ├── mimetypes-top-100.html
│   ├── mimetypes.csv
│   ├── mimetypes.md
│   ├── mimetypes_detected-top-100.html
│   ├── mimetypes_detected.csv
│   ├── tld/
│   │   ├── by-year-and-continent.md
│   │   ├── comparison.md
│   │   ├── groups-percentage.html
│   │   ├── groups.md
│   │   ├── latest-crawl-groups.html
│   │   ├── latest-crawl-tlds.html
│   │   ├── latestcrawl.md
│   │   ├── percentage.md
│   │   ├── selected-crawl-comparison-spearman-all-tlds.html
│   │   ├── selected-crawl-comparison-spearman-frequent-tlds.html
│   │   ├── selected-crawl-comparison.html
│   │   ├── selected-crawls-percentage.html
│   │   ├── selected-tlds-by-year.csv
│   │   ├── selected-tlds-by-year.html
│   │   ├── tlds-by-year-and-continent.csv
│   │   └── tlds-by-year-and-continent.html
│   └── tlds.md
├── requirements.txt
├── requirements_plot.txt
├── run_stats_hadoop.sh
├── setup.py
├── site.Dockerfile
├── stats/
│   ├── crawler/
│   │   ├── CC-MAIN-2016-18.json
│   │   ├── CC-MAIN-2016-22.json
│   │   ├── CC-MAIN-2016-26.json
│   │   ├── CC-MAIN-2016-30.json
│   │   ├── CC-MAIN-2016-36.json
│   │   ├── CC-MAIN-2016-40.json
│   │   ├── CC-MAIN-2016-44.json
│   │   ├── CC-MAIN-2016-50.json
│   │   ├── CC-MAIN-2017-04.json
│   │   ├── CC-MAIN-2017-09.json
│   │   ├── CC-MAIN-2017-13.json
│   │   ├── CC-MAIN-2017-17.json
│   │   ├── CC-MAIN-2017-22.json
│   │   ├── CC-MAIN-2017-26.json
│   │   ├── CC-MAIN-2017-30.json
│   │   ├── CC-MAIN-2017-34.json
│   │   ├── CC-MAIN-2017-39.json
│   │   ├── CC-MAIN-2017-43.json
│   │   ├── CC-MAIN-2017-47.json
│   │   ├── CC-MAIN-2017-51.json
│   │   ├── CC-MAIN-2018-05.json
│   │   ├── CC-MAIN-2018-09.json
│   │   ├── CC-MAIN-2018-13.json
│   │   ├── CC-MAIN-2018-17.json
│   │   ├── CC-MAIN-2018-22.json
│   │   ├── CC-MAIN-2018-26.json
│   │   ├── CC-MAIN-2018-30.json
│   │   ├── CC-MAIN-2018-34.json
│   │   ├── CC-MAIN-2018-39.json
│   │   ├── CC-MAIN-2018-43.json
│   │   ├── CC-MAIN-2018-47.json
│   │   ├── CC-MAIN-2018-51.json
│   │   ├── CC-MAIN-2019-04.json
│   │   ├── CC-MAIN-2019-09.json
│   │   ├── CC-MAIN-2019-13.json
│   │   ├── CC-MAIN-2019-18.json
│   │   ├── CC-MAIN-2019-22.json
│   │   ├── CC-MAIN-2019-26.json
│   │   ├── CC-MAIN-2019-30.json
│   │   ├── CC-MAIN-2019-35.json
│   │   ├── CC-MAIN-2019-39.json
│   │   ├── CC-MAIN-2019-43.json
│   │   ├── CC-MAIN-2019-47.json
│   │   ├── CC-MAIN-2019-51.json
│   │   ├── CC-MAIN-2020-05.json
│   │   ├── CC-MAIN-2020-10.json
│   │   ├── CC-MAIN-2020-16.json
│   │   ├── CC-MAIN-2020-24.json
│   │   ├── CC-MAIN-2020-29.json
│   │   ├── CC-MAIN-2020-34.json
│   │   ├── CC-MAIN-2020-40.json
│   │   ├── CC-MAIN-2020-45.json
│   │   ├── CC-MAIN-2020-50.json
│   │   ├── CC-MAIN-2021-04.json
│   │   ├── CC-MAIN-2021-10.json
│   │   ├── CC-MAIN-2021-17.json
│   │   ├── CC-MAIN-2021-21.json
│   │   ├── CC-MAIN-2021-25.json
│   │   ├── CC-MAIN-2021-31.json
│   │   ├── CC-MAIN-2021-39.json
│   │   ├── CC-MAIN-2021-43.json
│   │   ├── CC-MAIN-2021-49.json
│   │   ├── CC-MAIN-2022-05.json
│   │   ├── CC-MAIN-2022-21.json
│   │   ├── CC-MAIN-2022-27.json
│   │   ├── CC-MAIN-2022-33.json
│   │   ├── CC-MAIN-2022-40.json
│   │   ├── CC-MAIN-2022-49.json
│   │   ├── CC-MAIN-2023-06.json
│   │   ├── CC-MAIN-2023-14.json
│   │   ├── CC-MAIN-2023-23.json
│   │   ├── CC-MAIN-2023-40.json
│   │   ├── CC-MAIN-2023-50.json
│   │   ├── CC-MAIN-2024-10.json
│   │   ├── CC-MAIN-2024-18.json
│   │   ├── CC-MAIN-2024-22.json
│   │   ├── CC-MAIN-2024-26.json
│   │   ├── CC-MAIN-2024-30.json
│   │   ├── CC-MAIN-2024-33.json
│   │   ├── CC-MAIN-2024-38.json
│   │   ├── CC-MAIN-2024-42.json
│   │   ├── CC-MAIN-2024-46.json
│   │   ├── CC-MAIN-2024-51.json
│   │   ├── CC-MAIN-2025-05.json
│   │   ├── CC-MAIN-2025-08.json
│   │   ├── CC-MAIN-2025-13.json
│   │   ├── CC-MAIN-2025-18.json
│   │   ├── CC-MAIN-2025-21.json
│   │   ├── CC-MAIN-2025-26.json
│   │   ├── CC-MAIN-2025-30.json
│   │   ├── CC-MAIN-2025-33.json
│   │   ├── CC-MAIN-2025-38.json
│   │   ├── CC-MAIN-2025-43.json
│   │   ├── CC-MAIN-2025-47.json
│   │   ├── CC-MAIN-2025-51.json
│   │   ├── CC-MAIN-2026-04.json
│   │   ├── CC-MAIN-2026-08.json
│   │   ├── CC-MAIN-2026-12.json
│   │   ├── CC-MAIN-2026-17.json
│   │   └── README.md
│   ├── tld_alexa_top_1m.py
│   ├── tld_cisco_umbrella_top_1m.py
│   └── tld_majestic_top_1m.py
├── stats.Dockerfile
├── tests/
│   └── test_crawlstat.py
└── top_level_domain.py

Download .txt

SYMBOL INDEX (182 symbols across 16 files)

FILE: crawlplot.py
  class CrawlPlot (line 33) | class CrawlPlot:
    method create_figure (line 88) | def create_figure(self, ratio=1.0):
    method set_title (line 100) | def set_title(self, ax, title):
    method apply_ggplot2_style (line 115) | def apply_ggplot2_style(self, ax, show_grid=True, grid_axis='both'):
    method set_tick_labels_black (line 136) | def set_tick_labels_black(self, ax):
    method apply_nice_ticks (line 145) | def apply_nice_ticks(self, ax, axis='y', use_scientific=True):
    method save_figure (line 174) | def save_figure(self, fig, img_path):
    method hide_tick_marks (line 192) | def hide_tick_marks(self, ax, tick_color='#FFFFFF'):
    method __init__ (line 204) | def __init__(self):
    method read_from_stdin_or_file (line 272) | def read_from_stdin_or_file(self):
    method read_data (line 290) | def read_data(self, stream):
    method line_plot_with_ggplot (line 306) | def line_plot_with_ggplot(
    method line_plot_with_rpy2_ggplot2 (line 334) | def line_plot_with_rpy2_ggplot2(
    method nice_tick_step (line 375) | def nice_tick_step(vmin, vmax, n=5):
    method center_legend_title (line 399) | def center_legend_title(fig, ax, leg_items, leg_title, x_axes=0.1):
    method line_plot_with_matplotlib (line 408) | def line_plot_with_matplotlib(
    method line_plot (line 518) | def line_plot(

FILE: crawlstats.py
  class MonthlyCrawl (line 35) | class MonthlyCrawl:
    method get_by_name (line 168) | def get_by_name(name):
    method to_name (line 172) | def to_name(crawl):
    method to_bit_mask (line 176) | def to_bit_mask(crawl):
    method date_of (line 180) | def date_of(crawl):
    method year_of (line 191) | def year_of(crawl):
    method short_name (line 195) | def short_name(name):
    method get_latest (line 199) | def get_latest(n):
  class MonthlyCrawlSet (line 203) | class MonthlyCrawlSet:
    method __init__ (line 209) | def __init__(self, crawls=0):
    method add (line 212) | def add(self, crawl):
    method update (line 215) | def update(self, *others):
    method clear (line 219) | def clear(self):
    method discard (line 222) | def discard(self, crawl):
    method __contains__ (line 225) | def __contains__(self, crawl):
    method __len__ (line 228) | def __len__(self):
    method get_bits (line 235) | def get_bits(self):
    method get_crawls (line 238) | def get_crawls(self):
    method is_new (line 247) | def is_new(self, crawl):
    method is_newest (line 263) | def is_newest(self, crawl):
  class CST (line 271) | class CST(Enum):
  class MultiCount (line 375) | class MultiCount(defaultdict):
    method __init__ (line 378) | def __init__(self, size):
    method incr (line 382) | def incr(self, key, *counts):
    method compress (line 387) | def compress(size, counts):
    method get_compressed (line 397) | def get_compressed(self, key):
    method get_count (line 401) | def get_count(index, value):
    method sum_values (line 409) | def sum_values(values, compress=True):
  class CrawlStatsJSONEncoder (line 436) | class CrawlStatsJSONEncoder(json.JSONEncoder):
    method default (line 438) | def default(self, o):
    method json_encode_hyperloglog (line 446) | def json_encode_hyperloglog(o):
  class CrawlStatsJSONDecoder (line 452) | class CrawlStatsJSONDecoder(json.JSONDecoder):
    method __init__ (line 454) | def __init__(self, *args, **kargs):
    method dict_to_object (line 458) | def dict_to_object(self, dic):
    method json_decode_hyperloglog (line 471) | def json_decode_hyperloglog(dic):
  class HostDomainCount (line 480) | class HostDomainCount:
    method __init__ (line 487) | def __init__(self):
    method add (line 491) | def add(self, url, count):
    method output (line 499) | def output(self, crawl):
  class SurtDomainCount (line 529) | class SurtDomainCount:
    method __init__ (line 534) | def __init__(self, surt_domain):
    method add (line 547) | def add(self, _path, metadata):
    method unique_urls (line 595) | def unique_urls(self):
    method output (line 598) | def output(self, crawl, exact_count=True, min_surt_hll_size=50000):
  class UnhandledTypeError (line 642) | class UnhandledTypeError(Exception):
    method __init__ (line 643) | def __init__(self, outputType):
  class InputError (line 647) | class InputError(Exception):
    method __init__ (line 648) | def __init__(self, message):
  class CCStatsJob (line 652) | class CCStatsJob(MRJob):
    method configure_args (line 674) | def configure_args(self):
    method input_protocol (line 711) | def input_protocol(self):
    method hadoop_input_format (line 718) | def hadoop_input_format(self):
    method count_mapper_init (line 726) | def count_mapper_init(self):
    method count_mapper (line 766) | def count_mapper(self, _, line):
    method count_mapper_final (line 801) | def count_mapper_final(self):
    method reducer_init (line 829) | def reducer_init(self):
    method count_reducer (line 833) | def count_reducer(self, key, values):
    method stats_mapper_init (line 910) | def stats_mapper_init(self):
    method stats_mapper (line 913) | def stats_mapper(self, key, value):
    method stats_mapper_final (line 944) | def stats_mapper_final(self):
    method stats_reducer (line 948) | def stats_reducer(self, key, values):
    method reducer_final (line 1007) | def reducer_final(self):
    method steps (line 1021) | def steps(self):

FILE: plot/charset.py
  class CharsetStats (line 7) | class CharsetStats(TabularStats):
    method __init__ (line 12) | def __init__(self):
    method add (line 16) | def add(self, key, val):

FILE: plot/crawl_size.py
  class CrawlSizePlot (line 26) | class CrawlSizePlot(CrawlPlot):
    method __init__ (line 34) | def __init__(self):
    method add (line 46) | def add(self, key, val):
    method add_by_type (line 63) | def add_by_type(self, crawl, item_type, count):
    method cumulative_size (line 90) | def cumulative_size(self):
    method transform_data (line 157) | def transform_data(self):
    method save_data (line 162) | def save_data(self):
    method duplicate_ratio (line 167) | def duplicate_ratio(self):
    method plot (line 178) | def plot(self):
    method plot_with_rpy2_ggplot2 (line 310) | def plot_with_rpy2_ggplot2(self, by_year_by_type, img_path):
    method plot_with_matplotlib (line 340) | def plot_with_matplotlib(self, by_year_by_type, img_path):
    method export_csv (line 434) | def export_csv(self, data, csv):
    method norm_data (line 441) | def norm_data(self, data, row_filter, type_name_norm):
    method size_plot (line 460) | def size_plot(self, data, row_filter, type_name_norm,

FILE: plot/crawler_metrics.py
  class CrawlerMetrics (line 26) | class CrawlerMetrics(CrawlSizePlot):
    method __init__ (line 68) | def __init__(self):
    method add (line 72) | def add(self, key, val):
    method save_data (line 90) | def save_data(self):
    method add_percent (line 96) | def add_percent(self):
    method row2title (line 116) | def row2title(row):
    method plot (line 124) | def plot(self):
    method plot_fetch_status_with_rpy2_ggplot2 (line 162) | def plot_fetch_status_with_rpy2_ggplot2(self, data, img_path, ratio):
    method plot_fetch_status_with_matplotlib (line 183) | def plot_fetch_status_with_matplotlib(self, data, categories, img_path...
    method plot_fetch_status (line 248) | def plot_fetch_status(self, data, row_filter, img_file, ratio=1.0):
    method plot_crawldb_status_with_rpy2_ggplot2 (line 271) | def plot_crawldb_status_with_rpy2_ggplot2(self, data, img_path, ratio):
    method plot_crawldb_status_with_matplotlib (line 291) | def plot_crawldb_status_with_matplotlib(self, data, img_path, ratio):
    method plot_crawldb_status (line 360) | def plot_crawldb_status(self, data, row_filter, img_file, ratio=1.0):

FILE: plot/domain.py
  class DomainStats (line 9) | class DomainStats(TabularStats):
    method __init__ (line 14) | def __init__(self, crawl):
    method add (line 19) | def add(self, key, val):
    method transform_data (line 37) | def transform_data(self):
    method save_data (line 46) | def save_data(self, name, dir_name='data/'):
    method plot (line 50) | def plot(self, name):

FILE: plot/histogram.py
  class CrawlHistogram (line 22) | class CrawlHistogram(CrawlPlot):
    method __init__ (line 34) | def __init__(self):
    method add (line 39) | def add(self, key, frequency):
    method transform_data (line 57) | def transform_data(self):
    method save_data (line 61) | def save_data(self):
    method plot_dupl_url (line 65) | def plot_dupl_url(self):
    method plot_host_domain_tld (line 86) | def plot_host_domain_tld(self):
    method plot_domain_cumul_with_rpy2_ggplot2 (line 108) | def plot_domain_cumul_with_rpy2_ggplot2(self, data, title, img_path):
    method plot_domain_cumul (line 125) | def plot_domain_cumul(self, crawl):

FILE: plot/language.py
  class LanguageStats (line 8) | class LanguageStats(TabularStats):
    method __init__ (line 13) | def __init__(self):
    method add (line 17) | def add(self, key, val):

FILE: plot/mimetype.py
  class MimeTypeStats (line 8) | class MimeTypeStats(TabularStats):
    method __init__ (line 22) | def __init__(self):
    method norm_value (line 26) | def norm_value(self, mimetype):
    method add (line 35) | def add(self, key, val):

FILE: plot/mimetype_detected.py
  class MimeTypeDetectedStats (line 7) | class MimeTypeDetectedStats(MimeTypeStats):
    method __init__ (line 9) | def __init__(self):
    method norm_value (line 13) | def norm_value(self, mimetype):
    method add (line 16) | def add(self, key, val):

FILE: plot/overlap.py
  class CrawlOverlap (line 20) | class CrawlOverlap(CrawlPlot):
    method __init__ (line 30) | def __init__(self):
    method add (line 37) | def add(self, key, val):
    method fill_overlap_matrix (line 47) | def fill_overlap_matrix(self):
    method save_overlap_matrix (line 70) | def save_overlap_matrix(self):
    method plot_similarity_graph (line 78) | def plot_similarity_graph(self, show_edges=False):
    method plot_similarity_matrix_with_rpy2_ggplot2 (line 100) | def plot_similarity_matrix_with_rpy2_ggplot2(self, data, midpoint, tit...
    method plot_similarity_matrix_with_matplotlib (line 122) | def plot_similarity_matrix_with_matplotlib(self, data, decimals, title...
    method plot_similarity_matrix (line 211) | def plot_similarity_matrix(self, item_type, image_file, title):

FILE: plot/table.py
  class TabularStats (line 12) | class TabularStats(CrawlPlot):
    method __init__ (line 14) | def __init__(self):
    method norm_value (line 24) | def norm_value(self, typeval):
    method add_check_type (line 27) | def add_check_type(self, key, val, requ_type_cst):
    method transform_data (line 49) | def transform_data(self, top_n, min_avg_count, check_pattern=None):
    method save_data (line 121) | def save_data(self, base_name, dir_name='data/'):
    method save_data_percentage (line 124) | def save_data_percentage(self, base_name, dir_name='data/', type_name=...
    method plot (line 137) | def plot(self, crawls, name, column_header, xtra_css_classes=[]):

FILE: plot/tld.py
  class TldStats (line 18) | class TldStats(CrawlPlot):
    method __init__ (line 20) | def __init__(self):
    method add (line 27) | def add(self, key, val):
    method transform_data (line 35) | def transform_data(self):
    method field_percentage_formatter (line 89) | def field_percentage_formatter(precision=2, nan='-'):
    method save_data (line 94) | def save_data(self):
    method percent_agg (line 97) | def percent_agg(self, data, column, index, values, aggregate):
    method pivot_percentage (line 105) | def pivot_percentage(self, data, column, index, values, aggregate):
    method plot_groups (line 110) | def plot_groups(self):
    method plot (line 130) | def plot(self, crawls, latest_crawl):
    method plot_comparison (line 187) | def plot_comparison(self, crawl, name, topNlimit=None, method='spearma...
    method plot_comparison_groups (line 232) | def plot_comparison_groups(self):

FILE: plot/tld_by_continent.py
  function tld2continent (line 126) | def tld2continent(tld):
  function get_data (line 135) | def get_data(f):
  class TLDByContinentPlot (line 163) | class TLDByContinentPlot(CrawlPlot):
    method __init__ (line 166) | def __init__(self):
    method plot (line 169) | def plot(self):
    method plot_with_rpy2_ggplot2 (line 280) | def plot_with_rpy2_ggplot2(self, data):
    method plot_with_matplotlib (line 301) | def plot_with_matplotlib(self, data):

FILE: tests/test_crawlstat.py
  function test_monthly_crawl (line 18) | def test_monthly_crawl():
  function test_monthly_crawl_set (line 25) | def test_monthly_crawl_set():
  function test_crawlstatstype (line 78) | def test_crawlstatstype():
  function test_json_hyperloglog (line 83) | def test_json_hyperloglog():
  function test_multicount (line 96) | def test_multicount():

FILE: top_level_domain.py
  class TopLevelDomain (line 6) | class TopLevelDomain:
    method __init__ (line 24) | def __init__(self, tld):
    method __str__ (line 43) | def __str__(self):
    method _read_data (line 58) | def _read_data():
    method short_type (line 117) | def short_type(name):

Download .json

Condensed preview — 177 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,778K chars).

[
  {
    "path": ".github/workflows/ci.yml",
    "chars": 3152,
    "preview": "name: CI Pipeline\n\non:\n  push:\n    branches: [master]\n  pull_request:\n\nenv:\n  REGISTRY: ghcr.io\n  IMAGE_NAME: ${{ github"
  },
  {
    "path": ".gitignore",
    "chars": 1386,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 11357,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 6393,
    "preview": "Basic Statistics of Common Crawl Monthly Archives\n=================================================\n\nAnalyze the [Common"
  },
  {
    "path": "_config.yml",
    "chars": 755,
    "preview": "title: Statistics of Common Crawl Monthly Archives\ndescription: Number of pages, distribution of top-level domains, craw"
  },
  {
    "path": "_layouts/default.html",
    "chars": 1920,
    "preview": "<!doctype html>\n<html lang=\"{{ site.lang | default: \"en-US\" }}\">\n  <head>\n    <meta charset=\"utf-8\">\n    <meta http-equi"
  },
  {
    "path": "_layouts/table.html",
    "chars": 4305,
    "preview": "<!doctype html>\n<html lang=\"{{ site.lang | default: \"en-US\" }}\">\n  <head>\n    <meta charset=\"utf-8\">\n    <meta http-equi"
  },
  {
    "path": "crawlplot.py",
    "chars": 19840,
    "preview": "\"\"\"\nBase plotting module for Common Crawl statistics visualization.\n\nThis module provides the CrawlPlot base class which"
  },
  {
    "path": "crawlstats.py",
    "chars": 42634,
    "preview": "import heapq\nimport json\nimport logging\nimport os\nimport re\n\nfrom collections import defaultdict, Counter\nfrom datetime "
  },
  {
    "path": "get_stats.sh",
    "chars": 886,
    "preview": "#!/bin/bash\n\nset -o pipefail\n\nif aws s3 ls s3://commoncrawl/crawl-analysis/ | sed -E 's@.* @@; s@/$@@' >./stats/crawls.t"
  },
  {
    "path": "get_stats_and_plot.sh",
    "chars": 237,
    "preview": "#!/bin/bash\nset -e\n\necho \"Starting ...\"\n\n./get_stats.sh\n\n# make sure plot directories exist\nmkdir -p plots/crawler\nmkdir"
  },
  {
    "path": "index.md",
    "chars": 1260,
    "preview": "Statistics of Common Crawl Monthly Archives\n===========================================\n\nStatistics of [Common Crawl](ht"
  },
  {
    "path": "plot/charset.py",
    "chars": 980,
    "preview": "import sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass CharsetStats(TabularSta"
  },
  {
    "path": "plot/crawl_size.py",
    "chars": 22869,
    "preview": "\"\"\"\nPlot crawl size metrics over time.\n\nThis module generates visualizations of crawl size statistics including:\n- Month"
  },
  {
    "path": "plot/crawler_metrics.py",
    "chars": 17558,
    "preview": "\"\"\"\nPlot crawler performance metrics.\n\nThis module generates visualizations of crawler metrics including:\n- Fetch status"
  },
  {
    "path": "plot/domain.py",
    "chars": 2408,
    "preview": "import sys\n\nimport pandas\n\nfrom crawlstats import CST, MonthlyCrawl, MultiCount\nfrom plot.table import TabularStats\n\n\ncl"
  },
  {
    "path": "plot/histogram.py",
    "chars": 6275,
    "preview": "\"\"\"\nPlot histogram distributions for crawl statistics.\n\nThis module generates histogram visualizations showing distribut"
  },
  {
    "path": "plot/language.py",
    "chars": 1057,
    "preview": "import string\nimport sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass LanguageS"
  },
  {
    "path": "plot/mimetype.py",
    "chars": 1785,
    "preview": "import re\nimport sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass MimeTypeStats"
  },
  {
    "path": "plot/mimetype_detected.py",
    "chars": 1078,
    "preview": "import sys\n\nfrom plot.mimetype import MimeTypeStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass MimeTypeDetectedSta"
  },
  {
    "path": "plot/overlap.py",
    "chars": 11841,
    "preview": "\"\"\"\nPlot crawl overlap and similarity metrics.\n\nThis module generates visualizations showing the overlap between differe"
  },
  {
    "path": "plot/table.py",
    "chars": 7162,
    "preview": "import heapq\n\nimport numpy\nimport pandas\n\nfrom collections import defaultdict, Counter\n\nfrom crawlplot import CrawlPlot\n"
  },
  {
    "path": "plot/tld.py",
    "chars": 11983,
    "preview": "import sys\n\nfrom collections import defaultdict\n\nimport pandas\n\nfrom crawlplot import CrawlPlot\nfrom crawlstats import C"
  },
  {
    "path": "plot/tld_by_continent.py",
    "chars": 16180,
    "preview": "\"\"\"\nPlot TLD distributions by continent.\n\nThis module generates visualizations showing how TLDs are distributed\nacross g"
  },
  {
    "path": "plot.sh",
    "chars": 3824,
    "preview": "#!/bin/bash\n\nN_CRAWLS=$(python3 -c 'from crawlstats import MonthlyCrawl; print(len(MonthlyCrawl.by_name))')\nLATEST_CRAWL"
  },
  {
    "path": "plots/README.md",
    "chars": 512,
    "preview": "Plots about Common Crawl Monthly Archives\n=========================================\n\n* [size of the crawls](crawlsize.md"
  },
  {
    "path": "plots/charsets-top-100.html",
    "chars": 5869,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th"
  },
  {
    "path": "plots/charsets.csv",
    "chars": 166158,
    "preview": "crawl,charset,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-2010,<unkn"
  },
  {
    "path": "plots/charsets.md",
    "chars": 630,
    "preview": "---\nlayout: table\ntable_include: charsets-top-100.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\n---\n\nCharacter Encoding of "
  },
  {
    "path": "plots/crawlermetrics.md",
    "chars": 1867,
    "preview": "Crawler-Related Metrics\n=======================\n\nCrawler-related metrics are extracted from the crawler log files, cf. ["
  },
  {
    "path": "plots/crawloverlap.md",
    "chars": 933,
    "preview": "Overlaps between Common Crawl Monthly Archives\n==============================================\n\nOverlaps between monthly "
  },
  {
    "path": "plots/crawlsize/cumulative.csv",
    "chars": 6579,
    "preview": "crawl,digest estim.,page,url estim.\nCC-MAIN-2008-2009,1804803498,1798158091,1799114116\nCC-MAIN-2009-2010,4339999986,4661"
  },
  {
    "path": "plots/crawlsize/domain.csv",
    "chars": 6178,
    "preview": "crawl,domain,host,tld,url\nCC-MAIN-2008-2009,15045431,32086112,1496,1790932667\nCC-MAIN-2009-2010,30794437,68991076,4711,2"
  },
  {
    "path": "plots/crawlsize/monthly.csv",
    "chars": 6057,
    "preview": "crawl,digest estim.,page,url\nCC-MAIN-2008-2009,1804803498,1798158091,1790932667\nCC-MAIN-2009-2010,2631454016,2863495211,"
  },
  {
    "path": "plots/crawlsize/monthly_new.csv",
    "chars": 3246,
    "preview": "crawl,url estim. new\nCC-MAIN-2008-2009,1799114116\nCC-MAIN-2009-2010,2025520640\nCC-MAIN-2012,2875802047\nCC-MAIN-2013-20,1"
  },
  {
    "path": "plots/crawlsize/url_last_n_crawls.csv",
    "chars": 11517,
    "preview": "crawl,1,12,2,3,4,6,9\nCC-MAIN-2008-2009,1790932667,nan,nan,nan,nan,nan,nan\nCC-MAIN-2009-2010,2301135881,nan,3824634756,na"
  },
  {
    "path": "plots/crawlsize/url_page_ratio_last_n_crawls.csv",
    "chars": 15476,
    "preview": "crawl,12,2,3,4,6,9\nCC-MAIN-2009-2010,,0.8204459894859851,,,,\nCC-MAIN-2012,,0.7976991186989425,0.7891972139777857,,,\nCC-M"
  },
  {
    "path": "plots/crawlsize.md",
    "chars": 3011,
    "preview": "Size of Common Crawl Monthly Archives\n=====================================\n\nThe number of released pages per month fluc"
  },
  {
    "path": "plots/domains-top-500.csv",
    "chars": 23086,
    "preview": "domain,pages,urls,hosts,%pages,%urls\nblogspot.com,19780455,19755134,246742,0.902419,0.906434\nwikipedia.org,3899574,38564"
  },
  {
    "path": "plots/domains-top-500.html",
    "chars": 77835,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th>d"
  },
  {
    "path": "plots/domains.md",
    "chars": 941,
    "preview": "---\nlayout: table\ntable_include: domains-top-500.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\ntable_searcher: \"Filter for "
  },
  {
    "path": "plots/languages-top-200.html",
    "chars": 17247,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage iso639-3-language\">\n  <thead>\n    <tr style=\"text-align: "
  },
  {
    "path": "plots/languages.csv",
    "chars": 476573,
    "preview": "crawl,primary_language,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-2"
  },
  {
    "path": "plots/languages.md",
    "chars": 609,
    "preview": "---\nlayout: table\ntable_include: languages-top-200.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\n---\n\nDistribution of Langu"
  },
  {
    "path": "plots/mimetypes-top-100.html",
    "chars": 12542,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n  <thead>\n    <tr style=\"text-align: righ"
  },
  {
    "path": "plots/mimetypes.csv",
    "chars": 666670,
    "preview": "crawl,mimetype,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<other>,818049,815434,0.0455\nCC-MAIN-2008-2009,application/atom"
  },
  {
    "path": "plots/mimetypes.md",
    "chars": 730,
    "preview": "---\nlayout: table\ntable_include:\n - mimetypes-top-100.html\n - mimetypes_detected-top-100.html\ntable_sortlist: \"{sortList"
  },
  {
    "path": "plots/mimetypes_detected-top-100.html",
    "chars": 11560,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n  <thead>\n    <tr style=\"text-align: righ"
  },
  {
    "path": "plots/mimetypes_detected.csv",
    "chars": 458874,
    "preview": "crawl,mimetype_detected,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-"
  },
  {
    "path": "plots/tld/by-year-and-continent.md",
    "chars": 1031,
    "preview": "---\nlayout: table\ntable_include:\n- tlds-by-year-and-continent.html\n- selected-tlds-by-year.html\ntable_sortlist: \"{sortLi"
  },
  {
    "path": "plots/tld/comparison.md",
    "chars": 2422,
    "preview": "---\nlayout: table\ntable_include:\n - selected-crawl-comparison-spearman-frequent-tlds.html\n - selected-crawl-comparison.h"
  },
  {
    "path": "plots/tld/groups-percentage.html",
    "chars": 24025,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th"
  },
  {
    "path": "plots/tld/groups.md",
    "chars": 2508,
    "preview": "---\nlayout: table\ntable_include: groups-percentage.html\ntable_sortlist: \"{sortList: [[0,1]]}\"\n---\n\nGroups of Top-Level D"
  },
  {
    "path": "plots/tld/latest-crawl-groups.html",
    "chars": 1848,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th><"
  },
  {
    "path": "plots/tld/latest-crawl-tlds.html",
    "chars": 18262,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th><"
  },
  {
    "path": "plots/tld/latestcrawl.md",
    "chars": 735,
    "preview": "---\nlayout: table\ntable_include:\n - latest-crawl-groups.html\n - latest-crawl-tlds.html\ntable_sortlist: \"{sortList: [[5,1"
  },
  {
    "path": "plots/tld/percentage.md",
    "chars": 455,
    "preview": "---\nlayout: table\ntable_include: selected-crawls-percentage.html\ntable_sortlist: \"{sortList: [[7,1]]}\"\ntable_searcher: \""
  },
  {
    "path": "plots/tld/selected-crawl-comparison-spearman-all-tlds.html",
    "chars": 1620,
    "preview": "<table border=\"1\" class=\"dataframe matrix\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>page"
  },
  {
    "path": "plots/tld/selected-crawl-comparison-spearman-frequent-tlds.html",
    "chars": 1620,
    "preview": "<table border=\"1\" class=\"dataframe matrix\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>page"
  },
  {
    "path": "plots/tld/selected-crawl-comparison.html",
    "chars": 15782,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th"
  },
  {
    "path": "plots/tld/selected-crawls-percentage.html",
    "chars": 19219,
    "preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n  <thead>\n    <tr style=\"text-align: righ"
  },
  {
    "path": "plots/tld/selected-tlds-by-year.csv",
    "chars": 5540,
    "preview": "year,com,org,ru,net,de,uk,jp,edu,fr,it,pl,nl,br,cz,au,es\n(any),51.89498041088741,6.374690312897634,4.043836128047363,3.7"
  },
  {
    "path": "plots/tld/selected-tlds-by-year.html",
    "chars": 7246,
    "preview": "<table border=\"1\" class=\"dataframe tablepercentage tablesorter\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th"
  },
  {
    "path": "plots/tld/tlds-by-year-and-continent.csv",
    "chars": 3672,
    "preview": "year,(other),\"com,net\",org,edu,\"gov,mil\",North America,South America,Oceania,Africa,Asia,Europe\n2009,2.372710231294118,8"
  },
  {
    "path": "plots/tld/tlds-by-year-and-continent.html",
    "chars": 5077,
    "preview": "<table border=\"1\" class=\"dataframe tablepercentage tablesorter\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th"
  },
  {
    "path": "plots/tlds.md",
    "chars": 1245,
    "preview": "Top-Level Domains\n=================\n\n[Top-level domains](https://en.wikipedia.org/wiki/Top-level_domain) (abbrev. \"TLD\"/"
  },
  {
    "path": "requirements.txt",
    "chars": 118,
    "preview": "hyperloglog==0.0.14\nisoweek==1.3.3\nmrjob==0.7.4\ntldextract==5.1.2\nujson==5.12.0\n\n# tests\npytest\njsonpickle\nsetuptools\n"
  },
  {
    "path": "requirements_plot.txt",
    "chars": 119,
    "preview": "ggplot==0.11.5\nidna==3.7\n#pandas==2.1.4+dfsg\npandas==2.1.4\npygraphviz==1.13\nrpy2==3.5.15\n\nmatplotlib==3.10.7\nfsspec[s3]"
  },
  {
    "path": "run_stats_hadoop.sh",
    "chars": 2516,
    "preview": "#!/bin/bash\n\nCRAWL=\"$1\"\n\nif [ -z \"$CRAWL\" ]; then\n    echo \"Usage: $0 <CRAWL-YEAR-WEEK>\"\n    echo \"  Argument indicating"
  },
  {
    "path": "setup.py",
    "chars": 107,
    "preview": "from setuptools import setup\n\n\nsetup(\n    setup_requires=['pytest-runner'],\n    tests_require=['pytest'],\n)"
  },
  {
    "path": "site.Dockerfile",
    "chars": 965,
    "preview": "# See\n#    https://docs.github.com/en/pages/setting-up-a-github-pages-site-with-jekyll\n#    https://github.com/BillRaymo"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-18.json",
    "chars": 1351,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-18\"]\t7706034375\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-22.json",
    "chars": 1354,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-22\"]\t8087074988\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-26.json",
    "chars": 2285,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-26\"]\t7180806080\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-30.json",
    "chars": 2362,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-30\"]\t7180518032\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-36.json",
    "chars": 2446,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-36\"]\t7218846495\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-40.json",
    "chars": 1885,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-40\"]\t6936860504\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-44.json",
    "chars": 2448,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-44\"]\t9290101260\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2016-50.json",
    "chars": 2445,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-50\"]\t9779597110\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-04.json",
    "chars": 2446,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-04\"]\t10058030146\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-09.json",
    "chars": 2450,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-09\"]\t10309950142\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-13.json",
    "chars": 2452,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-13\"]\t11054073116\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-17.json",
    "chars": 2447,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-17\"]\t11614646341\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-22.json",
    "chars": 2388,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-22\"]\t12106403981\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-26.json",
    "chars": 2450,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-26\"]\t12986411417\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-30.json",
    "chars": 2455,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-30\"]\t13566579384\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-34.json",
    "chars": 2580,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-34\"]\t14698581608\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-39.json",
    "chars": 2580,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-39\"]\t14981165656\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-43.json",
    "chars": 2583,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-43\"]\t15959590811\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-47.json",
    "chars": 2581,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-47\"]\t16756526195\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2017-51.json",
    "chars": 2577,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-51\"]\t17390543871\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-05.json",
    "chars": 2584,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-05\"]\t14985445343\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-09.json",
    "chars": 2584,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-09\"]\t15702763091\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-13.json",
    "chars": 2583,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-13\"]\t13830972570\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-17.json",
    "chars": 2585,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-17\"]\t14071100780\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-22.json",
    "chars": 2582,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-22\"]\t14293230909\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-26.json",
    "chars": 2516,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-26\"]\t14114634413\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-30.json",
    "chars": 2581,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-30\"]\t12336785966\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-34.json",
    "chars": 2578,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-34\"]\t11304753488\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-39.json",
    "chars": 1865,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-39\"]\t10322390017\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-43.json",
    "chars": 2577,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-43\"]\t10068215197\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-47.json",
    "chars": 2578,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-47\"]\t10372257458\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2018-51.json",
    "chars": 2576,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-51\"]\t10696257053\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-04.json",
    "chars": 2574,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-04\"]\t11425571649\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-09.json",
    "chars": 2574,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-09\"]\t11395325655\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-13.json",
    "chars": 2575,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-13\"]\t11250692948\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-18.json",
    "chars": 2571,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-18\"]\t11385935631\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-22.json",
    "chars": 2577,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-22\"]\t12201452792\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-26.json",
    "chars": 2572,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-26\"]\t13037540318\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-30.json",
    "chars": 2572,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-30\"]\t13857387307\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-35.json",
    "chars": 2577,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-35\"]\t15024860128\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-39.json",
    "chars": 2574,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-39\"]\t15879390106\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-43.json",
    "chars": 2575,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-43\"]\t17009132379\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-47.json",
    "chars": 2576,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-47\"]\t18304035037\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2019-51.json",
    "chars": 2660,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-51\"]\t19131514303\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-05.json",
    "chars": 2659,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-05\"]\t20085156050\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-10.json",
    "chars": 2662,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-10\"]\t19918325266\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-16.json",
    "chars": 2815,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-16\"]\t20380741760\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-24.json",
    "chars": 2819,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-24\"]\t21097779271\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-29.json",
    "chars": 2818,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-29\"]\t20528820728\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-34.json",
    "chars": 2819,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-34\"]\t20403113647\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-40.json",
    "chars": 2819,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-40\"]\t21173047327\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-45.json",
    "chars": 2814,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-45\"]\t21531016513\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2020-50.json",
    "chars": 2815,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-50\"]\t22014556091\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-04.json",
    "chars": 2817,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-04\"]\t22704928999\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-10.json",
    "chars": 2816,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-10\"]\t22937361361\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-17.json",
    "chars": 2815,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-17\"]\t21968054310\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-21.json",
    "chars": 2819,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-21\"]\t23713955373\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-25.json",
    "chars": 2104,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-25\"]\t3080558881\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-31.json",
    "chars": 2816,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-31\"]\t26047830277\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-39.json",
    "chars": 2819,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-39\"]\t26294713963\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-43.json",
    "chars": 2817,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-43\"]\t25464021170\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2021-49.json",
    "chars": 2815,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-49\"]\t26305572235\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-05.json",
    "chars": 2813,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-05\"]\t26716610822\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-21.json",
    "chars": 2817,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-21\"]\t27142123347\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-27.json",
    "chars": 2895,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-27\"]\t23976620914\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-33.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-33\"]\t24023318047\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-40.json",
    "chars": 2899,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-40\"]\t25121487097\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2022-49.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-49\"]\t25536843285\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2023-06.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-06\"]\t24063833984\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2023-14.json",
    "chars": 2899,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-14\"]\t23193639650\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2023-23.json",
    "chars": 2900,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-23\"]\t23471797837\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2023-40.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-40\"]\t24130374695\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2023-50.json",
    "chars": 2898,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-50\"]\t20650993966\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-10.json",
    "chars": 2901,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-10\"]\t20599819450\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-18.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-18\"]\t20216718958\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-22.json",
    "chars": 2897,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-22\"]\t19759133480\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-26.json",
    "chars": 3025,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-26\"]\t20635672053\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-30.json",
    "chars": 2960,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-30\"]\t22484038777\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-33.json",
    "chars": 2956,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-33\"]\t21902542885\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-38.json",
    "chars": 2959,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-38\"]\t23533999592\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-42.json",
    "chars": 2955,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-42\"]\t24237973275\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-46.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-46\"]\t25404964414\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2024-51.json",
    "chars": 2959,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-51\"]\t25915332801\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-05.json",
    "chars": 2961,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-05\"]\t26857631748\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-08.json",
    "chars": 3022,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-08\"]\t26387970898\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-13.json",
    "chars": 3024,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-13\"]\t27837493836\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-18.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-18\"]\t27638605242\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-21.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-21\"]\t27797066363\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-26.json",
    "chars": 2960,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-26\"]\t28167640016\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-30.json",
    "chars": 2960,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-30\"]\t27783766888\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-33.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-33\"]\t28259545744\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-38.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-38\"]\t27837059074\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-43.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-43\"]\t26899650260\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-47.json",
    "chars": 2962,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-47\"]\t27080086217\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2025-51.json",
    "chars": 2958,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-51\"]\t26268339705\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2026-04.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-04\"]\t25612258765\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2026-08.json",
    "chars": 3023,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-08\"]\t24938307108\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2026-12.json",
    "chars": 3314,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-12\"]\t25055505905\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/CC-MAIN-2026-17.json",
    "chars": 3322,
    "preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-17\"]\t24642277769\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
  },
  {
    "path": "stats/crawler/README.md",
    "chars": 424,
    "preview": "Crawler-Related Metrics\n=======================\n\nJSON files in this folder contain metrics\n- written by the crawler ([Ap"
  },
  {
    "path": "stats/tld_alexa_top_1m.py",
    "chars": 10884,
    "preview": "# derived from\n#   http://s3.amazonaws.com/alexa-static/top-1m.csv.zip\n# fetched 2019-02-06, see also\n#   https://suppor"
  },
  {
    "path": "stats/tld_cisco_umbrella_top_1m.py",
    "chars": 9779,
    "preview": "# derived from\n#   http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip\n# fetched 2019-02-06, see also\n#   h"
  },
  {
    "path": "stats/tld_majestic_top_1m.py",
    "chars": 11217,
    "preview": "# derived from\n#   http://downloads.majestic.com/majestic_million.csv\n# fetched 2019-02-06\n#\n# see also\n#\n#   https://ma"
  },
  {
    "path": "stats.Dockerfile",
    "chars": 1158,
    "preview": "# Replicating pjox/cc-crawl-statistics\nFROM python:3.12\n\n# Install system dependencies\nRUN apt-get update && apt-get ins"
  },
  {
    "path": "tests/test_crawlstat.py",
    "chars": 3121,
    "preview": "import json\nimport sys\n\nimport ujson\nimport jsonpickle\n\nfrom crawlstats import MonthlyCrawl, MonthlyCrawlSet\nfrom crawls"
  },
  {
    "path": "top_level_domain.py",
    "chars": 81321,
    "preview": "import fileinput\nimport idna\nimport re\n\n\nclass TopLevelDomain:\n    \"\"\"Classify top-level domains (TLDs) to provide the f"
  }
]

About this extraction

This page contains the full source code of the commoncrawl/cc-crawl-statistics GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 177 files (2.6 MB), approximately 677.4k tokens, and a symbol index with 182 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo