Showing preview only (2,709K chars total). Download the full file or copy to clipboard to get everything.
Repository: commoncrawl/cc-crawl-statistics
Branch: master
Commit: 446d601f16d0
Files: 177
Total size: 2.6 MB
Directory structure:
gitextract_jga75z6d/
├── .github/
│ └── workflows/
│ └── ci.yml
├── .gitignore
├── LICENSE
├── README.md
├── _config.yml
├── _layouts/
│ ├── default.html
│ └── table.html
├── crawlplot.py
├── crawlstats.py
├── get_stats.sh
├── get_stats_and_plot.sh
├── index.md
├── plot/
│ ├── charset.py
│ ├── crawl_size.py
│ ├── crawler_metrics.py
│ ├── domain.py
│ ├── histogram.py
│ ├── language.py
│ ├── mimetype.py
│ ├── mimetype_detected.py
│ ├── overlap.py
│ ├── table.py
│ ├── tld.py
│ └── tld_by_continent.py
├── plot.sh
├── plots/
│ ├── README.md
│ ├── charsets-top-100.html
│ ├── charsets.csv
│ ├── charsets.md
│ ├── crawlermetrics.md
│ ├── crawloverlap.md
│ ├── crawlsize/
│ │ ├── cumulative.csv
│ │ ├── domain.csv
│ │ ├── monthly.csv
│ │ ├── monthly_new.csv
│ │ ├── url_last_n_crawls.csv
│ │ └── url_page_ratio_last_n_crawls.csv
│ ├── crawlsize.md
│ ├── domains-top-500.csv
│ ├── domains-top-500.html
│ ├── domains.md
│ ├── languages-top-200.html
│ ├── languages.csv
│ ├── languages.md
│ ├── mimetypes-top-100.html
│ ├── mimetypes.csv
│ ├── mimetypes.md
│ ├── mimetypes_detected-top-100.html
│ ├── mimetypes_detected.csv
│ ├── tld/
│ │ ├── by-year-and-continent.md
│ │ ├── comparison.md
│ │ ├── groups-percentage.html
│ │ ├── groups.md
│ │ ├── latest-crawl-groups.html
│ │ ├── latest-crawl-tlds.html
│ │ ├── latestcrawl.md
│ │ ├── percentage.md
│ │ ├── selected-crawl-comparison-spearman-all-tlds.html
│ │ ├── selected-crawl-comparison-spearman-frequent-tlds.html
│ │ ├── selected-crawl-comparison.html
│ │ ├── selected-crawls-percentage.html
│ │ ├── selected-tlds-by-year.csv
│ │ ├── selected-tlds-by-year.html
│ │ ├── tlds-by-year-and-continent.csv
│ │ └── tlds-by-year-and-continent.html
│ └── tlds.md
├── requirements.txt
├── requirements_plot.txt
├── run_stats_hadoop.sh
├── setup.py
├── site.Dockerfile
├── stats/
│ ├── crawler/
│ │ ├── CC-MAIN-2016-18.json
│ │ ├── CC-MAIN-2016-22.json
│ │ ├── CC-MAIN-2016-26.json
│ │ ├── CC-MAIN-2016-30.json
│ │ ├── CC-MAIN-2016-36.json
│ │ ├── CC-MAIN-2016-40.json
│ │ ├── CC-MAIN-2016-44.json
│ │ ├── CC-MAIN-2016-50.json
│ │ ├── CC-MAIN-2017-04.json
│ │ ├── CC-MAIN-2017-09.json
│ │ ├── CC-MAIN-2017-13.json
│ │ ├── CC-MAIN-2017-17.json
│ │ ├── CC-MAIN-2017-22.json
│ │ ├── CC-MAIN-2017-26.json
│ │ ├── CC-MAIN-2017-30.json
│ │ ├── CC-MAIN-2017-34.json
│ │ ├── CC-MAIN-2017-39.json
│ │ ├── CC-MAIN-2017-43.json
│ │ ├── CC-MAIN-2017-47.json
│ │ ├── CC-MAIN-2017-51.json
│ │ ├── CC-MAIN-2018-05.json
│ │ ├── CC-MAIN-2018-09.json
│ │ ├── CC-MAIN-2018-13.json
│ │ ├── CC-MAIN-2018-17.json
│ │ ├── CC-MAIN-2018-22.json
│ │ ├── CC-MAIN-2018-26.json
│ │ ├── CC-MAIN-2018-30.json
│ │ ├── CC-MAIN-2018-34.json
│ │ ├── CC-MAIN-2018-39.json
│ │ ├── CC-MAIN-2018-43.json
│ │ ├── CC-MAIN-2018-47.json
│ │ ├── CC-MAIN-2018-51.json
│ │ ├── CC-MAIN-2019-04.json
│ │ ├── CC-MAIN-2019-09.json
│ │ ├── CC-MAIN-2019-13.json
│ │ ├── CC-MAIN-2019-18.json
│ │ ├── CC-MAIN-2019-22.json
│ │ ├── CC-MAIN-2019-26.json
│ │ ├── CC-MAIN-2019-30.json
│ │ ├── CC-MAIN-2019-35.json
│ │ ├── CC-MAIN-2019-39.json
│ │ ├── CC-MAIN-2019-43.json
│ │ ├── CC-MAIN-2019-47.json
│ │ ├── CC-MAIN-2019-51.json
│ │ ├── CC-MAIN-2020-05.json
│ │ ├── CC-MAIN-2020-10.json
│ │ ├── CC-MAIN-2020-16.json
│ │ ├── CC-MAIN-2020-24.json
│ │ ├── CC-MAIN-2020-29.json
│ │ ├── CC-MAIN-2020-34.json
│ │ ├── CC-MAIN-2020-40.json
│ │ ├── CC-MAIN-2020-45.json
│ │ ├── CC-MAIN-2020-50.json
│ │ ├── CC-MAIN-2021-04.json
│ │ ├── CC-MAIN-2021-10.json
│ │ ├── CC-MAIN-2021-17.json
│ │ ├── CC-MAIN-2021-21.json
│ │ ├── CC-MAIN-2021-25.json
│ │ ├── CC-MAIN-2021-31.json
│ │ ├── CC-MAIN-2021-39.json
│ │ ├── CC-MAIN-2021-43.json
│ │ ├── CC-MAIN-2021-49.json
│ │ ├── CC-MAIN-2022-05.json
│ │ ├── CC-MAIN-2022-21.json
│ │ ├── CC-MAIN-2022-27.json
│ │ ├── CC-MAIN-2022-33.json
│ │ ├── CC-MAIN-2022-40.json
│ │ ├── CC-MAIN-2022-49.json
│ │ ├── CC-MAIN-2023-06.json
│ │ ├── CC-MAIN-2023-14.json
│ │ ├── CC-MAIN-2023-23.json
│ │ ├── CC-MAIN-2023-40.json
│ │ ├── CC-MAIN-2023-50.json
│ │ ├── CC-MAIN-2024-10.json
│ │ ├── CC-MAIN-2024-18.json
│ │ ├── CC-MAIN-2024-22.json
│ │ ├── CC-MAIN-2024-26.json
│ │ ├── CC-MAIN-2024-30.json
│ │ ├── CC-MAIN-2024-33.json
│ │ ├── CC-MAIN-2024-38.json
│ │ ├── CC-MAIN-2024-42.json
│ │ ├── CC-MAIN-2024-46.json
│ │ ├── CC-MAIN-2024-51.json
│ │ ├── CC-MAIN-2025-05.json
│ │ ├── CC-MAIN-2025-08.json
│ │ ├── CC-MAIN-2025-13.json
│ │ ├── CC-MAIN-2025-18.json
│ │ ├── CC-MAIN-2025-21.json
│ │ ├── CC-MAIN-2025-26.json
│ │ ├── CC-MAIN-2025-30.json
│ │ ├── CC-MAIN-2025-33.json
│ │ ├── CC-MAIN-2025-38.json
│ │ ├── CC-MAIN-2025-43.json
│ │ ├── CC-MAIN-2025-47.json
│ │ ├── CC-MAIN-2025-51.json
│ │ ├── CC-MAIN-2026-04.json
│ │ ├── CC-MAIN-2026-08.json
│ │ ├── CC-MAIN-2026-12.json
│ │ ├── CC-MAIN-2026-17.json
│ │ └── README.md
│ ├── tld_alexa_top_1m.py
│ ├── tld_cisco_umbrella_top_1m.py
│ └── tld_majestic_top_1m.py
├── stats.Dockerfile
├── tests/
│ └── test_crawlstat.py
└── top_level_domain.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .github/workflows/ci.yml
================================================
name: CI Pipeline
on:
push:
branches: [master]
pull_request:
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
test-and-build-stats:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up QEMU
uses: docker/setup-qemu-action@v3
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata for stats image
id: meta-stats
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha
type=raw,value=latest,enable={{is_default_branch}}
- name: Build stats Docker image
uses: docker/build-push-action@v5
with:
context: .
file: ./stats.Dockerfile
load: true
platforms: linux/amd64
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }}
- name: Run unit tests
run: |
docker run --rm \
-v ${{ github.workspace }}/tests:/app/tests \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/stats:${{ github.sha }} \
python -m pytest -s tests/
- name: Push stats Docker image
if: |
success() && (
github.event_name == 'push' && github.ref == 'refs/heads/master' ||
github.event_name == 'pull_request' && github.event.pull_request.head.repo.full_name == github.repository
)
uses: docker/build-push-action@v5
with:
context: .
file: ./stats.Dockerfile
push: true
platforms: linux/amd64,linux/arm64
tags: ${{ steps.meta-stats.outputs.tags }}
labels: ${{ steps.meta-stats.outputs.labels }}
build-site:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata for site image
id: meta-site
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}/site
tags: |
type=ref,event=branch
type=ref,event=pr
type=sha
type=raw,value=latest,enable={{is_default_branch}}
- name: Build and push site Docker image
uses: docker/build-push-action@v5
with:
context: .
file: ./site.Dockerfile
push: ${{ github.event.pull_request.head.repo.full_name == github.repository }}
tags: ${{ steps.meta-site.outputs.tags }}
labels: ${{ steps.meta-site.outputs.labels }}
================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# IPython Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# dotenv
.env
# virtualenv
venv/
ENV/
# Spyder project settings
.spyderproject
# Rope project settings
.ropeproject
# Eclipse PyDev
.project
.pydevproject
.settings/
# Jekyll files to run github-pages locally
_site
.sass-cache
.jekyll-metadata
Gemfile
Gemfile.lock
assets
_includes
_sass
js
vendor/
.bundle/
themes/
# crawl statistics files
stats/*.gz
stats/crawls.txt
stats/excerpt/
# generated CSV data
data/
# macOS Desktop Services Store
.DS_Store
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
Basic Statistics of Common Crawl Monthly Archives
=================================================
Analyze the [Common Crawl](https://commoncrawl.org/) data to get metrics about the monthly crawl archives:
* size of the monthly crawls, number of
* fetched pages
* unique URLs
* unique documents (by content digest)
* number of different hosts, domains, top-level domains
* distribution of pages/URLs on hosts, domains, top-level domains
* and ...
* mime types
* protocols / schemes (http vs. https)
* content languages (since summer 2018)
This is a description how to generate the statistics from the Common Crawl URL index files.
The results are presented on https://commoncrawl.github.io/cc-crawl-statistics/.
Step 1: Count Items
-------------------
The items (URLs, hosts, domains, etc.) are counted using the Common Crawl index files
on AWS S3 `s3://commoncrawl/cc-index/collections/*/indexes/cdx-*.gz`.
1. define a pattern of cdx files to process - usually from one monthly crawl (here: `CC-MAIN-2016-26`)
- either smaller set of local files for testing
```
INPUT="test/cdx/cdx-0000[0-3].gz"
```
- or one monthly crawl to be accessed via Hadoop on AWS S3:
```
INPUT="s3a://commoncrawl/cc-index/collections/CC-MAIN-2016-26/indexes/cdx-*.gz"
```
2. run `crawlstats.py --job=count` to process the cdx files and count the items:
```
python3 crawlstats.py --job=count --no-exact-counts \
--no-output --output-dir .../count/ $INPUT
```
Help on command-line parameters (including [mrjob](https://pypi.org/project/mrjob/) options) are shown by
`python3 crawlstats.py --help`.
The option `--no-exact-counts` is recommended (and is the default) to save storage space and computation time
when counting URLs and content digests.
Step 2: Aggregate Counts
------------------------
Run `crawlstats.py --job=stats` on the output of step 1:
```
python3 crawlstats.py --job=stats --max-top-hosts-domains=500 \
--no-output --output-dir .../stats/ .../count/
```
The max. number of most frequent thosts and domains contained in the output is set by the option
`--max-top-hosts-domains=N`.
Step 3: Download the Data
-------------------------
In order to prepare the plots, the the output of step 2 must be downloaded to local disk.
Simplest, the data is fetched from the Common Crawl Public Data Set bucket on AWS S3:
```sh
while read crawl; do
aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz
done <<EOF
CC-MAIN-2008-2009
...
EOF
```
One aggregated, gzip-compressed statistics file, is about 1 MiB in size. So you could just run
[get_stats.sh](get_stats.sh) to download the data files for all released monthly crawls.
Also the output of step 1 is provided on `s3://commoncrawl/`. The counts for every crawl is hold
in 10 bzip2-compressed files, together 1 GiB per crawl in average. To download the counts for one crawl:
- if you're on AWS and [AWS CLI]() is installed and configured
```sh
CRAWL=CC-MAIN-2022-05
aws s3 cp --recursive s3://commoncrawl/crawl-analysis/$CRAWL/count stats/count/$CRAWL
```
- otherwise
```sh
CRAWL=CC-MAIN-2022-05
mkdir -p stats/count/$CRAWL
for i in $(seq 0 9); do
curl https://data.commoncrawl.org/crawl-analysis/$CRAWL/count/part-0000$i.bz2 \
>stats/count/$CRAWL/part-0000$i.bz2
done
```
Step 4: Plot the Data
---------------------
To prepare the plots using the downloaded aggregated data:
```
gzip -dc stats/CC-MAIN-*.gz | python3 plot/crawl_size.py
```
The full list of commands to prepare all plots is found in [plot.sh](plot.sh). Don't forget to install the Python
modules [required for plotting](requirements_plot.txt).
Step 5: Local Site Preview
--------------------------
The [crawl statistics site](https://commoncrawl.github.io/cc-crawl-statistics/) is hosted by [Github pages](https://pages.github.com/). The site is updated as soon as plots or description texts are updated, committed and pushed to the Github repository.
To preview local changes, it's possible to serve the site locally:
1. build the Docker image with Ruby, Jekyll and the content to be served
```
docker build -f site.Dockerfile -t cc-crawl-statistics-site:latest .
```
2. run a Docker container to serve the site preview
```
docker run --network=host --rm -ti cc-crawl-statistics-site:latest
```
The site should be served on localhost, port 4000 (http://127.0.0.1:4000).
If not, the correct location is shown in the output of the `docker run` command.
If running this on a Mac, you may find that the loopback interface (127.0.0.1) within the container is not accessible, so you can change the line in the [Dockerfile](site.Dockerfile) to:
```
CMD bundle exec jekyll serve --host 0.0.0.0
```
... and then the site will be served on http://0.0.0.0:4000 instead. (You will of course need to rebuild the Docker image after updating the Dockerfile.)
Run via Container
-----------------
The whole workflow can be run as a container (docker or podman) including downloading stats files from Common Crawl's S3 bucket and generating new plots.
```bash
# clone the repository (to have the latest crawl IDs)
git clone https://github.com/commoncrawl/cc-crawl-statistics.git
cd cc-crawl-statistics
# download stats and generate plots
# SSH, AWS keys, and stats and plots directories must be mounted into the container
podman run --rm -v ~/.aws:/root/.aws:ro -v $(pwd -P)/stats:/app/stats -v $(pwd -P)/plots:/app/plots ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest
# if needed you can manually build the container image
podman build -f stats.Dockerfile -t ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest
# for development it is recommend to mount the whole repository into the container
podman run -it -v ~/.aws:/root/.aws:ro -v $(pwd -P):/app ghcr.io/commoncrawl/cc-crawl-statistics/stats:latest /bin/bash
```
Related Projects
----------------
The [columnar index](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/)
simplifies counting and analytics a lot - easier to maintain, more transparent, reproducible and
extensible than running two MapReduce jobs, see the the list of example
- [SQL queries](https://github.com/commoncrawl/cc-index-table#query-the-table-in-amazon-athena) and
- [Jupyter notebooks](https://github.com/commoncrawl/cc-notebooks)
================================================
FILE: _config.yml
================================================
title: Statistics of Common Crawl Monthly Archives
description: Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
repository: commoncrawl/cc-crawl-statistics
latest_crawl: CC-MAIN-2026-17
show_navigation: True
navlist:
- title: Home
url: /
- title: Size of crawls
url: /plots/crawlsize
- title: Top-level domains
url: /plots/tlds
- title: Registered domains
url: /plots/domains
- title: Crawler metrics
url: /plots/crawlermetrics
- title: Crawl overlaps
url: /plots/crawloverlap
- title: Media types
url: /plots/mimetypes
- title: Character sets
url: /plots/charsets
- title: Languages
url: /plots/languages
theme: jekyll-theme-minimal
================================================
FILE: _layouts/default.html
================================================
<!doctype html>
<html lang="{{ site.lang | default: "en-US" }}">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>{{ site.title | default: site.github.repository_name }} by {{ site.github.owner_name }}</title>
<link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
<meta name="viewport" content="width=device-width">
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
</head>
<body>
<div class="wrapper">
<header>
<h1>{{ site.title | default: site.github.repository_name }}</h1>
<p>{{ site.description | default: site.github.project_tagline }}
<br>Latest crawl: {{site.latest_crawl}}
</p>
{% if site.show_navigation %}
<nav>
<p>
{% for node in site.navlist %}
<a href="{{ site.baseurl }}{{ node.url }}">{{ node.title }}</a><br/>
{% endfor %}
</p>
</nav>
{% endif %}
</header>
<section>
{{ content }}
</section>
<footer>
{% if site.github.is_project_page %}
<p class="view"><a href="{{ site.github.repository_url }}">View the Project on GitHub <small>{{ github_name }}</small></a></p>
{% endif %}
{% if site.github.is_project_page %}
<p>This project is maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
{% endif %}
<p><small>Hosted on GitHub Pages — Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
</footer>
</div>
<script src="{{ '/assets/js/scale.fix.js' | relative_url }}"></script>
</body>
</html>
<!--
Based on:
https://github.com/pages-themes/minimal
https://github.com/orderedlist/minimal
-->
================================================
FILE: _layouts/table.html
================================================
<!doctype html>
<html lang="{{ site.lang | default: "en-US" }}">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>{{ site.title | default: site.github.repository_name }} by {{ site.github.owner_name }}</title>
<link rel="stylesheet" href="{{ '/assets/css/style.css?v=' | append: site.github.build_revision | relative_url }}">
<meta name="viewport" content="width=device-width">
<!--[if lt IE 9]>
<script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<script src="{{ '/assets/js/jquery-3.7.1.min.js?v=' | append: site.github.build_revision | relative_url }}"></script>
<script src="{{ '/assets/js/jquery.tablesorter.min.js?v=' | append: site.github.build_revision | relative_url }}"></script>
<script type="text/javascript">
$(document).ready(function() {
$(".tablesorter").tablesorter({{ page.table_sortlist }});
$("table.iso639-3-language tbody tr th").each(function() {
$(this).html("<a href='https://en.wikipedia.org/wiki/ISO_639:" + $(this).html() + "'>" + $(this).html() + "</a>");
});
$("#search").on("keyup", function() {
var value = $(this).val().toLowerCase();
$(".tablesearcher tbody tr").filter(function() {
$(this).toggle($(this).children('th').text().indexOf(value) > -1);
});
});
});
</script>
<style>
table.tablepercentage { table-layout: auto; }
table.tablepercentage thead tr:first-child th {
hyphens: auto; font-size: small;
}
table.tablepercentage td { min-width: 60px; }
table.tablepercentage tbody th { max-width: 200px; hyphens: auto; }
table.tablepercentage thead tr:last-child .header:not(:first-child):before {
content: "%";
}
table.matrix tbody tr th {
font-weight: bold;
}
table.tablesorter thead tr:last-child .header {
background-image: url({{ '/assets/img/bg.gif?v=' | append: site.github.build_revision | relative_url }});
background-repeat: no-repeat;
background-position: center right;
cursor: pointer;
}
table.tablesorter thead tr:last-child .headerSortUp {
background-image: url({{ '/assets/img/asc.gif?v=' | append: site.github.build_revision | relative_url }});
}
table.tablesorter thead tr:last-child .headerSortDown {
background-image: url({{ '/assets/img/desc.gif?v=' | append: site.github.build_revision | relative_url }});
}
table tbody th { font-weight: normal; }
table tbody td { text-align: right; }
</style>
</head>
<body>
<div class="wrapper">
<header>
<h1>{{ site.title | default: site.github.repository_name }}</h1>
<p>{{ site.description | default: site.github.project_tagline }}
<br>Latest crawl: {{site.latest_crawl}}
</p>
{% if site.show_navigation %}
<nav>
<p>
{% for node in site.navlist %}
<a href="{{ site.baseurl }}{{ node.url }}">{{ node.title }}</a><br/>
{% endfor %}
</p>
</nav>
{% endif %}
{% if site.github.is_project_page %}
<p class="view"><a href="{{ site.github.repository_url }}">View the Project on GitHub <small>{{ github_name }}</small></a></p>
{% endif %}
</header>
<section>
{{ content }}
{% if page.table_searcher %}
<p><input type="text" id="search" placeholder="{{ page.table_searcher }}"></p>
{% endif %}
{% for table in page.table_include %}
{% include_relative {{ table }} %}
<br>
{% endfor %}
</section>
<footer>
{% if site.github.is_project_page %}
<p>This project is maintained by <a href="{{ site.github.owner_url }}">{{ site.github.owner_name }}</a></p>
{% endif %}
<p><small>Hosted on GitHub Pages — Theme by <a href="https://github.com/orderedlist">orderedlist</a></small></p>
</footer>
</div>
<script src="{{ '/assets/js/scale.fix.js' | relative_url }}"></script>
</body>
</html>
<!--
Based on:
https://github.com/pages-themes/minimal
https://github.com/orderedlist/minimal
http://tablesorter.com/
-->
================================================
FILE: crawlplot.py
================================================
"""
Base plotting module for Common Crawl statistics visualization.
This module provides the CrawlPlot base class which handles:
- Plot library selection (matplotlib, rpy2/ggplot2, or legacy ggplot)
- Common plot styling to match ggplot2 aesthetics
- Data input from stdin or files
- Output directory management
The plot library is controlled by the PLOTLIB environment variable:
- 'matplotlib' (recommended)
- 'rpy2.ggplot2' (requires R and rpy2)
- 'ggplot' (deprecated)
The output directory is controlled by PLOTDIR (defaults to 'plots/').
"""
import json
import logging
import os
import os.path
import sys
from typing import Literal
import fsspec
import numpy as np
# Supported plot library backends
PlotLibType = Literal["rpy2.ggplot2", "ggplot", "matplotlib"]
class CrawlPlot:
"""
Base class for Common Crawl statistics plots.
Provides common functionality for all plot types including:
- Plot library initialization and configuration
- Data reading from stdin or gzipped files
- Line plot generation with consistent styling
- Output directory management
Subclasses should implement:
- add(key, val): Process a single data record
- plot(): Generate the specific visualization
Attributes:
PLOTLIB: The plotting library to use ('matplotlib', 'rpy2.ggplot2', or 'ggplot')
PLOTDIR: Directory for saving plot output files
DEFAULT_FIGSIZE: Default figure size in inches (7 = 2100px at 300 DPI)
DEFAULT_DPI: Default resolution for saved figures
"""
GGPLOT2_THEME = None
GGPLOT2_THEME_KWARGS = None
# figure with square aspect ratio : 7 inches * 300 DPI = 2100 pixels
DEFAULT_FIGSIZE = 7
DEFAULT_DPI = 300
title_fontsize = 15
title_pad = 20
title_fontweight = "normal"
title_loc = "left"
xlabel_fontsize = 12
ylabel_fontsize = 12
ticks_fontsize = 10
ticks_color = "#E6E6E6"
ticks_length = 8
ticks_width = 1.0
bar_width = 0.8
legend_fontsize = 10
legend_title_fontsize = 11
line_width = 0.75
marker_size = 4
grid_major_linewidth = 1.0
grid_minor_linewidth = 0.5
grid_major_color = "#E6E6E6"
grid_minor_color = "#E6E6E6"
tight_layout_pad = 0.5
savefig_facecolor = "white"
savefig_bbox_inches = None
# -------------------------------------------------------------------------
# Matplotlib helper methods for reducing code duplication
# -------------------------------------------------------------------------
def create_figure(self, ratio=1.0):
"""Create a matplotlib figure with consistent sizing.
Args:
ratio: Height ratio relative to width (default: 1.0 for square)
Returns:
Tuple of (fig, ax)
"""
import matplotlib.pyplot as plt
return plt.subplots(figsize=(self.DEFAULT_FIGSIZE, self.DEFAULT_FIGSIZE * ratio))
def set_title(self, ax, title):
"""Apply consistent title styling to an axes.
Args:
ax: matplotlib Axes object
title: Title text
"""
ax.set_title(
title,
fontsize=self.title_fontsize,
fontweight=self.title_fontweight,
pad=self.title_pad,
loc=self.title_loc,
)
def apply_ggplot2_style(self, ax, show_grid=True, grid_axis='both'):
"""Apply ggplot2-like minimal styling to an axes.
Removes spines, adds grid lines, and sets axes below plot elements.
Args:
ax: matplotlib Axes object
show_grid: Whether to show grid lines (default: True)
grid_axis: Which axis to show grid on ('both', 'x', or 'y')
"""
# Remove all spines
for spine in ['top', 'right', 'left', 'bottom']:
ax.spines[spine].set_visible(False)
# Add grid
if show_grid:
ax.grid(True, which='major', linewidth=self.grid_major_linewidth,
color=self.grid_major_color, zorder=0, axis=grid_axis)
ax.set_axisbelow(True)
def set_tick_labels_black(self, ax):
"""Set all tick labels to black color.
Args:
ax: matplotlib Axes object
"""
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_color('black')
def apply_nice_ticks(self, ax, axis='y', use_scientific=True):
"""Apply nice tick spacing using the nice_tick_step calculation.
Sets minor and major ticks at 'nice' intervals (multiples of 1, 2, or 5).
Optionally applies scientific notation for large values.
Args:
ax: matplotlib Axes object
axis: Which axis to apply to ('x' or 'y')
use_scientific: Whether to use scientific notation for large values
"""
from matplotlib.ticker import MultipleLocator, FormatStrFormatter
if axis == 'y':
vmin, vmax = ax.get_ylim()
axis_obj = ax.yaxis
else:
vmin, vmax = ax.get_xlim()
axis_obj = ax.xaxis
minor = self.nice_tick_step(vmin, vmax, n=8)
major = 2 * minor
axis_obj.set_minor_locator(MultipleLocator(minor))
axis_obj.set_major_locator(MultipleLocator(major))
if use_scientific and vmax > 1e4:
axis_obj.set_major_formatter(FormatStrFormatter('%.0e'))
def save_figure(self, fig, img_path):
"""Save figure with consistent settings and close it.
Args:
fig: matplotlib Figure object
img_path: Output file path
Returns:
The figure object (for chaining)
"""
import matplotlib.pyplot as plt
plt.tight_layout(pad=self.tight_layout_pad)
plt.savefig(img_path, dpi=self.DEFAULT_DPI,
bbox_inches=self.savefig_bbox_inches,
facecolor=self.savefig_facecolor)
plt.close()
return fig
def hide_tick_marks(self, ax, tick_color='#FFFFFF'):
"""Hide tick marks by setting them to a background color.
The tick labels remain visible but the tick marks themselves are hidden.
Args:
ax: matplotlib Axes object
tick_color: Color to set ticks to (default: white)
"""
ax.tick_params(axis='both', which='both', colors=tick_color,
length=self.ticks_length, width=self.ticks_width)
def __init__(self):
"""Initialize the plot with library selection and output directory setup."""
# Settings defined via environment variables
self.PLOTLIB: PlotLibType = os.environ.get('PLOTLIB', 'matplotlib')
self.PLOTDIR = os.environ.get('PLOTDIR', 'plots')
if self.PLOTLIB == 'ggplot':
# nothing to do here
pass
elif self.PLOTLIB == 'rpy2.ggplot2':
from rpy2.robjects.lib import ggplot2
from rpy2.robjects import pandas2ri
pandas2ri.activate()
# use minimal theme with white background set in plot constructor
# https://ggplot2.tidyverse.org/reference/ggtheme.html
self.GGPLOT2_THEME = ggplot2.theme_minimal(base_size=12, base_family="Helvetica")
self.GGPLOT2_THEME_KWARGS = {
'panel.background': ggplot2.element_rect(fill='white', color='white'),
'plot.background': ggplot2.element_rect(fill='white', color='white')
}
elif self.PLOTLIB == "matplotlib":
import matplotlib.pyplot as plt
# ggplot2-inspired color palette
ggplot_colors = [
"#F8766D", "#00BE67", "#00A9FF", "#CD9600", "#7CAE00",
"#00BFC4", "#C77CFF", "#FF61CC",
]
# Set up ggplot2-like minimal theme with larger fonts
plt.style.use('default')
plt.rcParams.update({
'font.family': 'sans-serif',
'font.sans-serif': ['Liberation Sans', 'Arial', 'DejaVu Sans'],
'font.size': 20, # Much larger base font size
'axes.linewidth': 1.5,
'axes.spines.left': True,
'axes.spines.bottom': True,
'axes.spines.top': False,
'axes.spines.right': False,
'axes.axisbelow': True,
'axes.grid': True,
'axes.grid.axis': 'both',
'grid.linewidth': 1.0,
'grid.color': '#E6E6E6', # Gray grid lines
'axes.facecolor': 'white', # White background
'figure.facecolor': 'white',
'xtick.bottom': True,
'xtick.top': False,
'ytick.left': True,
'ytick.right': False,
'xtick.direction': 'out',
'ytick.direction': 'out',
'axes.prop_cycle': plt.cycler(color=ggplot_colors),
})
else:
raise ValueError("Invalid PLOTLIB defined")
# Make sure output directories exists
os.makedirs(os.path.join(self.PLOTDIR, "crawler"), exist_ok=True)
os.makedirs(os.path.join(self.PLOTDIR, "crawloverlap"), exist_ok=True)
os.makedirs(os.path.join(self.PLOTDIR, "crawlsize"), exist_ok=True)
os.makedirs(os.path.join(self.PLOTDIR, "tld"), exist_ok=True)
def read_from_stdin_or_file(self):
"""Read statistics data from a file argument or stdin.
If a file path is provided as the first command line argument,
reads from that file (supports gzip compression). Otherwise,
reads from stdin.
"""
if len(sys.argv) > 1:
# File provided as argument
fp = sys.argv[1]
compression = ("gzip" if fp.endswith(".gz") else None)
with fsspec.open(fp, 'r', compression=compression) as f:
self.read_data(f)
else:
# No argument, use stdin
self.read_data(sys.stdin)
def read_data(self, stream):
"""Parse tab-separated JSON key-value pairs from a stream.
Args:
stream: Input stream containing lines of tab-separated JSON data.
Each line should have format: JSON_KEY<tab>JSON_VALUE
"""
for line in stream:
keyval = line.split('\t')
if len(keyval) == 2:
key = json.loads(keyval[0])
val = json.loads(keyval[1])
self.add(key, val)
else:
logging.error("Not a key-value pair: {}".find(line))
def line_plot_with_ggplot(
self,
data,
title,
ylabel,
img_path,
x="date",
y="size",
c="type",
clabel="",
ratio=1.0,
):
"""Generate a line plot using the legacy ggplot library (deprecated)."""
from ggplot import ggplot, aes, ggtitle, ylab, xlab, scale_x_date, date_breaks, geom_line, geom_point
date_label = "%Y\n%W" # year + week number
p = (
ggplot(data, aes(x=x, y=y, color=c))
+ ggtitle(title)
+ ylab(ylabel)
+ xlab(" ")
+ scale_x_date(breaks=date_breaks("3 months"), labels=date_label)
+ geom_line()
+ geom_point()
)
p.save(img_path)
return p
def line_plot_with_rpy2_ggplot2(
self,
data,
title,
ylabel,
img_path,
x="date",
y="size",
c="type",
clabel="",
ratio=1.0,
):
"""Generate a line plot using R's ggplot2 via rpy2."""
from rpy2.robjects.lib import ggplot2
# Convert y axis to float because R uses 32-bit signed integers
# and values >= 2 billion (2^31) will overflow
data[y] = data[y].astype(float)
if y != "size" and "size" in data.columns:
data["size"] = data["size"].astype(float)
p = (
ggplot2.ggplot(data)
+ ggplot2.aes_string(x=x, y=y, color=c)
+ ggplot2.geom_line(linewidth=0.5)
+ ggplot2.geom_point()
+ self.GGPLOT2_THEME
+ ggplot2.theme(
**{
"legend.position": "bottom",
"aspect.ratio": ratio,
**self.GGPLOT2_THEME_KWARGS,
}
)
+ ggplot2.labs(title=title, x="", y=ylabel, color=clabel)
)
p.save(img_path)
return p
@staticmethod
def nice_tick_step(vmin, vmax, n=5):
"""Calculate a 'nice' tick step for axis labels.
Returns a tick step value that is a multiple of 1, 2, or 5 times
a power of 10, which produces clean, readable axis labels.
Args:
vmin: Minimum value of the axis range
vmax: Maximum value of the axis range
n: Approximate number of tick intervals desired (default: 5)
Returns:
A 'nice' tick step value (1/2/5 * 10^k)
"""
span = abs(vmax - vmin)
if span == 0:
return 1.0
raw = span / n
exp = np.floor(np.log10(raw))
frac = raw / (10**exp)
nice_frac = 1 if frac <= 1 else 2 if frac <= 2 else 5 if frac <= 5 else 10
return nice_frac * 10**exp
@staticmethod
def center_legend_title(fig, ax, leg_items, leg_title, x_axes=0.1):
"""Center the legend title vertically with respect to legend items."""
fig.canvas.draw()
r = fig.canvas.get_renderer()
bb = leg_items.get_window_extent(r)
y = fig.transFigure.inverted().transform((0, (bb.y0+bb.y1)/2))[1]
x = fig.transFigure.inverted().transform(ax.transAxes.transform((x_axes, 0)))[0]
leg_title.set_bbox_to_anchor((x, y), transform=fig.transFigure)
def line_plot_with_matplotlib(
self,
data,
title,
ylabel,
img_path,
x="date",
y="size",
c="type",
clabel="",
ratio=1.0,
):
"""Generate a line plot using matplotlib with ggplot2-like styling.
Creates a multi-series line plot with markers, styled to match
ggplot2's minimal theme aesthetic.
Args:
data: pandas DataFrame containing the plot data
title: Plot title
ylabel: Y-axis label
img_path: Output file path for the saved image
x: Column name for x-axis values (default: 'date')
y: Column name for y-axis values (default: 'size')
c: Column name for grouping/color (default: 'type')
clabel: Legend title (default: '')
ratio: Aspect ratio for the plot (default: 1.0)
Returns:
matplotlib Figure object
"""
from matplotlib.ticker import AutoMinorLocator
from matplotlib.dates import YearLocator, DateFormatter
# Convert y axis to float for consistency with large values
data[y] = data[y].astype(float)
if y != "size" and "size" in data.columns:
data["size"] = data["size"].astype(float)
fig, ax = self.create_figure()
groups = data.groupby(c)
# Use ggplot2 default colors for small group counts
colors = ["#F8766D", "#00BA38", "#619CFF"] if len(groups) <= 3 else None
for i, (group_key, group_df) in enumerate(groups):
group_color = colors[i] if colors is not None else None
ax.plot(
group_df[x], group_df[y], "o-",
color=group_color, label=group_key,
linewidth=self.line_width, markersize=self.marker_size,
)
self.set_title(ax, title)
ax.set_xlabel("")
ax.set_ylabel(ylabel, fontsize=self.ylabel_fontsize)
# Apply nice y-axis ticks
self.apply_nice_ticks(ax, axis='y')
# Axes ratio
axes_aspect_ratio = 1 / ax.get_data_ratio() * ratio
if axes_aspect_ratio < 1:
ax.set_aspect(axes_aspect_ratio)
# Date formatting for x-axis
ax.xaxis.set_major_formatter(DateFormatter("%Y"))
ax.xaxis.set_major_locator(YearLocator(base=5))
ax.xaxis.set_minor_locator(AutoMinorLocator(2))
ax.tick_params(axis="both", labelsize=self.ticks_fontsize)
# Grid with both major and minor lines
ax.grid(True, which="major", linewidth=self.grid_major_linewidth,
color=self.grid_major_color, zorder=0)
ax.grid(True, which="minor", linewidth=self.grid_minor_linewidth,
color=self.grid_minor_color, zorder=0)
ax.set_axisbelow(True)
# Apply ggplot2 style (remove spines)
for spine in ['top', 'right', 'left', 'bottom']:
ax.spines[spine].set_visible(False)
# Hide tick marks but keep labels black
self.hide_tick_marks(ax)
self.set_tick_labels_black(ax)
# Legend setup
num_legend_items = len(groups)
ncol = 5 if num_legend_items == 5 else 4
if clabel:
leg_items = ax.legend(
loc="upper center", ncol=ncol, bbox_to_anchor=(0.6, -0.1),
frameon=False, fontsize=self.legend_fontsize,
)
ax.legend(
[], [], title=clabel, loc="upper center",
bbox_to_anchor=(0.2, -0.075), frameon=False,
title_fontsize=self.legend_title_fontsize,
)
ax.add_artist(leg_items)
else:
ax.legend(
loc="upper center", bbox_to_anchor=(0.5, -0.1),
ncol=ncol, frameon=False, fontsize=self.legend_fontsize,
)
return self.save_figure(fig, img_path)
def line_plot(
self,
data,
title,
ylabel,
img_file,
x="date",
y="size",
c="type",
clabel="",
ratio=1.0,
):
"""Generate a line plot using the configured plotting library.
This is the main entry point for creating line plots. It delegates
to the appropriate backend based on the PLOTLIB setting.
Args:
data: pandas DataFrame containing the plot data
title: Plot title
ylabel: Y-axis label
img_file: Output filename relative to PLOTDIR
x: Column name for x-axis values (default: 'date')
y: Column name for y-axis values (default: 'size')
c: Column name for grouping/color (default: 'type')
clabel: Legend title (default: '')
ratio: Aspect ratio for the plot (default: 1.0)
Returns:
Plot object (type depends on backend)
"""
img_path = os.path.join(self.PLOTDIR, img_file)
if self.PLOTLIB == "ggplot":
return self.line_plot_with_ggplot(
data=data,
title=title,
ylabel=ylabel,
img_path=img_path,
x=x,
y=y,
c=c,
clabel=clabel,
ratio=ratio,
)
elif self.PLOTLIB == "rpy2.ggplot2":
return self.line_plot_with_rpy2_ggplot2(
data=data,
title=title,
ylabel=ylabel,
img_path=img_path,
x=x,
y=y,
c=c,
clabel=clabel,
ratio=ratio,
)
elif self.PLOTLIB == "matplotlib":
return self.line_plot_with_matplotlib(
data=data,
title=title,
ylabel=ylabel,
img_path=img_path,
x=x,
y=y,
c=c,
clabel=clabel,
ratio=ratio,
)
================================================
FILE: crawlstats.py
================================================
import heapq
import json
import logging
import os
import re
from collections import defaultdict, Counter
from datetime import date
from enum import Enum
from urllib.parse import urlparse
import mrjob.util
import tldextract
import ujson
from hyperloglog import HyperLogLog
from isoweek import Week
from mrjob.job import MRJob, MRStep
from mrjob.protocol import JSONProtocol, RawValueProtocol
HYPERLOGLOG_ERROR = .01
# threshold when to add a HyperLogLog for SURT domains
MIN_SURT_HLL_SIZE = 50000
LOGGING_FORMAT = '%(asctime)s: [%(levelname)s]: %(message)s'
LOGGING_LEVEL = logging.INFO
LOG = logging.getLogger('CCStatsJob')
mrjob.util.log_to_stream(format=LOGGING_FORMAT,
level=LOGGING_LEVEL,
name='CCStatsJob')
class MonthlyCrawl:
"""Enumeration of monthly crawl archives"""
by_name = {
'CC-MAIN-2008-2009': 88,
'CC-MAIN-2009-2010': 89,
'CC-MAIN-2012': 90,
'CC-MAIN-2013-20': 91,
'CC-MAIN-2013-48': 92,
'CC-MAIN-2014-10': 93,
'CC-MAIN-2014-15': 94,
'CC-MAIN-2014-23': 95,
'CC-MAIN-2014-35': 96,
'CC-MAIN-2014-41': 97,
'CC-MAIN-2014-42': 98,
'CC-MAIN-2014-49': 99,
'CC-MAIN-2014-52': 0,
'CC-MAIN-2015-06': 1,
'CC-MAIN-2015-11': 2,
'CC-MAIN-2015-14': 3,
'CC-MAIN-2015-18': 4,
'CC-MAIN-2015-22': 5,
'CC-MAIN-2015-27': 6,
'CC-MAIN-2015-32': 7,
'CC-MAIN-2015-35': 8,
'CC-MAIN-2015-40': 9,
'CC-MAIN-2015-48': 10,
'CC-MAIN-2016-07': 11,
'CC-MAIN-2016-18': 12,
'CC-MAIN-2016-22': 13,
'CC-MAIN-2016-26': 14,
'CC-MAIN-2016-30': 15,
'CC-MAIN-2016-36': 16,
'CC-MAIN-2016-40': 17,
'CC-MAIN-2016-44': 18,
'CC-MAIN-2016-50': 19,
'CC-MAIN-2017-04': 20,
'CC-MAIN-2017-09': 21,
'CC-MAIN-2017-13': 22,
'CC-MAIN-2017-17': 23,
'CC-MAIN-2017-22': 24,
'CC-MAIN-2017-26': 25,
'CC-MAIN-2017-30': 26,
'CC-MAIN-2017-34': 27,
'CC-MAIN-2017-39': 28,
'CC-MAIN-2017-43': 29,
'CC-MAIN-2017-47': 30,
'CC-MAIN-2017-51': 31,
'CC-MAIN-2018-05': 32,
'CC-MAIN-2018-09': 33,
'CC-MAIN-2018-13': 34,
'CC-MAIN-2018-17': 35,
'CC-MAIN-2018-22': 36,
'CC-MAIN-2018-26': 37,
'CC-MAIN-2018-30': 38,
'CC-MAIN-2018-34': 39,
'CC-MAIN-2018-39': 40,
'CC-MAIN-2018-43': 41,
'CC-MAIN-2018-47': 42,
'CC-MAIN-2018-51': 43,
'CC-MAIN-2019-04': 44,
'CC-MAIN-2019-09': 45,
'CC-MAIN-2019-13': 46,
'CC-MAIN-2019-18': 47,
'CC-MAIN-2019-22': 48,
'CC-MAIN-2019-26': 49,
'CC-MAIN-2019-30': 50,
'CC-MAIN-2019-35': 51,
'CC-MAIN-2019-39': 52,
'CC-MAIN-2019-43': 53,
'CC-MAIN-2019-47': 54,
'CC-MAIN-2019-51': 55,
'CC-MAIN-2020-05': 56,
'CC-MAIN-2020-10': 57,
'CC-MAIN-2020-16': 58,
'CC-MAIN-2020-24': 59,
'CC-MAIN-2020-29': 60,
'CC-MAIN-2020-34': 61,
'CC-MAIN-2020-40': 62,
'CC-MAIN-2020-45': 63,
'CC-MAIN-2020-50': 64,
'CC-MAIN-2021-04': 65,
'CC-MAIN-2021-10': 66,
'CC-MAIN-2021-17': 67,
'CC-MAIN-2021-21': 68,
'CC-MAIN-2021-25': 69,
'CC-MAIN-2021-31': 70,
'CC-MAIN-2021-39': 71,
'CC-MAIN-2021-43': 72,
'CC-MAIN-2021-49': 73,
'CC-MAIN-2022-05': 74,
'CC-MAIN-2022-21': 75,
'CC-MAIN-2022-27': 76,
'CC-MAIN-2022-33': 77,
'CC-MAIN-2022-40': 78,
'CC-MAIN-2022-49': 79,
'CC-MAIN-2023-06': 80,
'CC-MAIN-2023-14': 81,
'CC-MAIN-2023-23': 82,
'CC-MAIN-2023-40': 83,
'CC-MAIN-2023-50': 84,
'CC-MAIN-2024-10': 85,
'CC-MAIN-2024-18': 86,
'CC-MAIN-2024-22': 87,
'CC-MAIN-2024-26': 100,
'CC-MAIN-2024-30': 101,
'CC-MAIN-2024-33': 102,
'CC-MAIN-2024-38': 103,
'CC-MAIN-2024-42': 104,
'CC-MAIN-2024-46': 105,
'CC-MAIN-2024-51': 106,
'CC-MAIN-2025-05': 107,
'CC-MAIN-2025-08': 108,
'CC-MAIN-2025-13': 109,
'CC-MAIN-2025-18': 110,
'CC-MAIN-2025-21': 111,
'CC-MAIN-2025-26': 112,
'CC-MAIN-2025-30': 113,
'CC-MAIN-2025-33': 114,
'CC-MAIN-2025-38': 115,
'CC-MAIN-2025-43': 116,
'CC-MAIN-2025-47': 117,
'CC-MAIN-2025-51': 118,
'CC-MAIN-2026-04': 119,
'CC-MAIN-2026-08': 120,
'CC-MAIN-2026-12': 121,
'CC-MAIN-2026-17': 122,
'CC-MAIN-2026-21': 123,
}
by_id = dict(map(reversed, by_name.items()))
@staticmethod
def get_by_name(name):
return MonthlyCrawl.by_name[name]
@staticmethod
def to_name(crawl):
return MonthlyCrawl.by_id[crawl]
@staticmethod
def to_bit_mask(crawl):
return (1 << crawl)
@staticmethod
def date_of(crawl):
if crawl == 'CC-MAIN-2008-2009':
return date(2009, 1, 12)
if crawl == 'CC-MAIN-2009-2010':
return date(2010, 9, 25)
if crawl == 'CC-MAIN-2012':
return date(2012, 11, 2)
[_, _, year, week] = crawl.split('-')
return Week(int(year), int(week)).monday()
@staticmethod
def year_of(crawl):
return MonthlyCrawl.date_of(crawl).year
@staticmethod
def short_name(name):
return name.replace('CC-MAIN-', '')
@staticmethod
def get_latest(n):
return sorted(MonthlyCrawl.by_name.keys())[-n:]
class MonthlyCrawlSet:
"""Dense representation of a list of monthly crawls.
Represent in which crawls a given item (URL, but also
domain, host, digest) occurs.
"""
def __init__(self, crawls=0):
self.bits = crawls
def add(self, crawl):
self.bits |= MonthlyCrawl.to_bit_mask(crawl)
def update(self, *others):
for other in others:
self.bits |= other.get_bits()
def clear(self):
self.bits = 0
def discard(self, crawl):
self.bits &= ~MonthlyCrawl.to_bit_mask(crawl)
def __contains__(self, crawl):
return (self.bits & MonthlyCrawl.to_bit_mask(crawl)) != 0
def __len__(self):
"""popcount of a 32 bit integer."""
i = self.bits
i = i - ((i >> 1) & 0x55555555)
i = (i & 0x33333333) + ((i >> 2) & 0x33333333)
return (((i + (i >> 4) & 0xF0F0F0F) * 0x1010101) & 0xffffffff) >> 24
def get_bits(self):
return self.bits
def get_crawls(self):
i = self.bits
r = 0
while (i):
if (i & 1):
yield r
r += 1
i >>= 1
def is_new(self, crawl):
"""True if there are no older crawls in set (no lower id)"""
if (self.bits == 0):
return True
i = self.bits
i = (i ^ (i - 1)) >> 1 # set trailing 0s to 1s and zero rest
r = 0
while (i):
if r == crawl:
return True
r += 1
i >>= 1
if (r < crawl):
return False
return True
def is_newest(self, crawl):
"""True if crawl is the newest crawl in set (highest id)"""
# i = self.bits
# j = MonthlyCrawl.to_bit_mask(crawl)
# return (i & ~j) < j
return self.bits.bit_length() == (crawl + 1)
class CST(Enum):
"""Enum for crawl statistics types.
Every line (key-value pair) has a marker which indicates the type
of the count / frequency:
- pages, URLs, hosts, etc.
- size (number of unique items), histograms, etc.
The type marker (the first element in the key tuple) determines
the format of the line (key-value pair):
<<type, key_params...>, <values...>>
The format may vary for different steps (job, mapper, reducer).
The count job (CCCountJob) uses the numeric types to reduce
the data size, while CCCountJob outputs the type names for better
readability.
Types of countable items
# <<type, item, crawl>, <count(s)>>
# For hosts, domains, etc. MultiCount is used to hold two counts -
# the number of pages and URLs per item."""
url = 0
"""(unique) URL"""
digest = 1
"""(unique) content digest (MD5)"""
host = 2
"""hostname ("www.commoncrawl.org")"""
domain = 3
"""pay-level domain or private domain ("commoncrawl.org")"""
tld = 4
"""public suffix ("org" or "co.uk")
- not necessarily a TLD / "top-level domain" according to
https://github.com/google/guava/wiki/InternetDomainNameExplained
- here following https://github.com/john-kurkowski/tldextract"""
surt_domain = 5
"""surt_domain :- SURT domain ("org,commoncrawl")
- Sort-friendly URI Reordering Transform, cf.
http://crawler.archive.org/articles/user_manual/glossary.html#surt"""
scheme = 6
"""URI scheme ("http", "https")
see https://en.wikipedia.org/wiki/Uniform_Resource_Identifier#Syntax"""
mimetype = 7
"""MIME type / media type / content type
- as sent by the server as "Content-Type" in the HTTP header,
weakly normalized, not verified"""
mimetype_detected = 77
"""MIME type detected based on content, URL and HTTP Content-Type"""
page = 8
"""number of successfully fetched pages (HTTP status 200),
including URL-level and content-level duplicates"""
fetch = 9
"""number of fetches, including 404s, redirects, robots.txt, etc.
- since CC-MAIN-2016-50"""
http_status = 10
"""detected charset
- since CC-MAIN-2018-34"""
charset = 11
"""detected languages or combination of languages
- since CC-MAIN-2018-34
NOTE: since gld2 identifies 160 languages and up to 3 languages,
the number of possible combinations is too high (4 millions) and
only the more common ones are preserved"""
languages = 12
"""primary language of the document (first of the detected languages)
- since CC-MAIN-2018-34"""
primary_language = 13
"""number of HTTP status codes (200, 404, etc.)
- since CC-MAIN-2016-50"""
crawl_status = 55
"""crawl status (successful fetches, 404s, exceptions, etc.)
- following Nutch CrawlDatum status codes
- similar to HTTP status but less fine-grained
- includes crawler-specific statuses (e.g., "denied by robots.txt")"""
robotstxt_status = 56
"""HTTP status of robots.txt responses"""
size = 90
"""size of a crawl (number of unique items):
- pages,
- URLs (one URL may be fetched multiple times),
- content digests,
- domains, hosts, top-level domains
- mime types
- etc.
format:
<<size, item_type, crawl>, number_of_unique_items>"""
size_estimate = 91
"""estimates for unique URLs and content digests
- estimates by HyperLogLog probabilistic counters"""
size_estimate_for = 92
"""estimates per large-sized item
(domains, hosts, TLDs, SURT domains)
- aimed to estimate domain coverage over time / multiple crawls
- CC-MAIN-2016-44 adds HyperLogLogs for SURT domain (>=50,000 URLs)
format:
<<size_estimate_for, per_item_type, per_item, item_type, crawl>, hll>"""
size_robotstxt = 93
"""number of robots.txt fetches"""
new_items = 95
"""new items (URLs, content digests) for a given crawl
- first seen in this crawl, not observed in previous crawls
- only with exact counts for all crawls
- could be estimated by HyperLogLog set operations otherwise"""
histogram = 96
"""frequency of item counts per page or URL
format:
<<type, item_type, crawl, counted_per, count>, frequency>"""
class MultiCount(defaultdict):
"""Dictionary with multiple counters for the same key"""
def __init__(self, size):
self.default_factory = lambda: [0]*size
self.size = size
def incr(self, key, *counts):
for i in range(0, self.size):
self[key][i] += counts[i]
@staticmethod
def compress(size, counts):
compress_from = size-1
last_val = counts[compress_from]
while compress_from > 0 and last_val == counts[compress_from-1]:
compress_from -= 1
if compress_from == 0:
return counts[0]
else:
return counts[0:compress_from+1]
def get_compressed(self, key):
return MultiCount.compress(self.size, self.get(key))
@staticmethod
def get_count(index, value):
if isinstance(value, int):
return value
if len(value) <= index:
return value[-1]
return value[index]
@staticmethod
def sum_values(values, compress=True):
counts = [0]
size = 1
for val in values:
if isinstance(val, int):
# compressed count, one unique count
for i in range(0, size):
counts[i] += val
else:
if len(val) >= size:
# enlarge counts array
base_count = counts[-1]
for j in range(size, len(val)):
counts.append(base_count)
size = len(val)
for i in range(0, len(val)):
counts[i] += val[i]
if len(val) < size:
for j in range(i+1, size):
# add compressed counts
counts[j] += val[i]
if compress:
return MultiCount.compress(size, counts)
else:
return counts
class CrawlStatsJSONEncoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, MonthlyCrawlSet):
return o.get_bits()
if isinstance(o, HyperLogLog):
return CrawlStatsJSONEncoder.json_encode_hyperloglog(o)
return json.JSONEncoder.default(self, o)
@staticmethod
def json_encode_hyperloglog(o):
return {'__type__': 'HyperLogLog',
'card': o.card(),
'p': o.p, 'M': o.M, 'm': o.m, 'alpha': o.alpha}
class CrawlStatsJSONDecoder(json.JSONDecoder):
def __init__(self, *args, **kargs):
json.JSONDecoder.__init__(self, object_hook=self.dict_to_object,
*args, **kargs)
def dict_to_object(self, dic):
if '__type__' not in dic:
return dic
if dic['__type__'] == 'HyperLogLog':
try:
return CrawlStatsJSONDecoder.json_decode_hyperloglog(dic)
except Exception as e:
LOG.error('Cannot decode object of type {0}'.format(
dic['__type__']))
raise e
return dic
@staticmethod
def json_decode_hyperloglog(dic):
hll = HyperLogLog(HYPERLOGLOG_ERROR)
hll.p = dic['p']
hll.m = dic['m']
hll.alpha = dic['alpha']
hll.M = dic['M']
return hll
class HostDomainCount:
"""Counts requiring URL parsing (host, domain, TLD, scheme).
For each item both total pages and unique URLs are counted.
"""
IPpattern = re.compile(r'^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}$')
def __init__(self):
self.hosts = MultiCount(2)
self.schemes = MultiCount(2)
def add(self, url, count):
uri = urlparse(url)
host = uri.hostname
if host is not None:
host = host.lower().strip('.')
self.hosts.incr(host, count, 1)
self.schemes.incr(uri.scheme, count, 1)
def output(self, crawl):
domains = MultiCount(3) # pages, URLs, hosts
tlds = MultiCount(4) # pages, URLs, hosts, domains
for scheme, counts in self.schemes.items():
yield (CST.scheme.value, scheme, crawl), counts
for host, counts in self.hosts.items():
yield (CST.host.value, host, crawl), counts
try:
parsedhost = tldextract.extract(host)
hosttld = parsedhost.suffix
except TypeError as e:
LOG.error('Failed to parse host {}: {}'.format(host, e))
hosttld = None
if hosttld is None:
hostdomain = '(invalid)'
elif hosttld == '':
hostdomain = parsedhost.domain
if self.IPpattern.match(host):
hosttld = '(ip address)'
else:
hostdomain = '.'.join([parsedhost.domain, parsedhost.suffix])
domains.incr((hostdomain, hosttld),
counts[0], counts[1], 1)
for dom, counts in domains.items():
tlds.incr(dom[1], counts[0], counts[1], counts[2], 1)
yield (CST.domain.value, dom[0], crawl), counts
for tld, counts in tlds.items():
yield (CST.tld.value, tld, crawl), counts
class SurtDomainCount:
"""Counters for one single SURT prefix/domain."""
robots_txt_warc_pattern = re.compile(r'/robotstxt/')
def __init__(self, surt_domain):
self.surt_domain = surt_domain
self.pages = 0
self.url = defaultdict(int)
self.digest = defaultdict(lambda: [0, 0])
self.mime = defaultdict(lambda: [0, 0])
self.mime_detected = defaultdict(lambda: [0, 0])
self.charset = defaultdict(lambda: [0, 0])
self.languages = defaultdict(lambda: [0, 0])
self.http_status = defaultdict(int)
self.robotstxt_status = defaultdict(lambda: [0, 0])
self.robotstxt_url = defaultdict(int)
def add(self, _path, metadata):
status = -1
if 'status' in metadata:
status = int(metadata['status'])
if self.robots_txt_warc_pattern.search(metadata['filename']):
self.robotstxt_status[status][0] += 1
if metadata['url'] not in self.robotstxt_url:
self.robotstxt_status[status][1] += 1
self.robotstxt_url[metadata['url']] += 1
# do not count robots.txt responses as "ordinary" pages
return
self.http_status[status] += 1
if status != 200:
# skip content-related metrics for non-200 responses
return
self.pages += 1
mime = 'unk'
if 'mime' in metadata:
mime = metadata['mime']
self.mime[mime][0] += 1
mime_detected = None
if 'mime-detected' in metadata:
mime_detected = metadata['mime-detected']
self.mime_detected[mime_detected][0] += 1
charset = None
if 'charset' in metadata:
charset = metadata['charset']
self.charset[charset][0] += 1
languages = None
if 'languages' in metadata:
languages = metadata['languages']
self.languages[languages][0] += 1
digest = None
if 'digest' in metadata:
digest = metadata['digest']
self.digest[digest][0] += 1
if metadata['url'] not in self.url:
if digest:
self.digest[digest][1] += 1
self.mime[mime][1] += 1
if mime_detected:
self.mime_detected[mime_detected][1] += 1
if languages:
self.languages[languages][1] += 1
if charset:
self.charset[charset][1] += 1
self.url[metadata['url']] += 1
def unique_urls(self):
return len(self.url)
def output(self, crawl, exact_count=True, min_surt_hll_size=50000):
counts = (self.pages, self.unique_urls())
host_domain_count = HostDomainCount()
surt_hll = None
if self.unique_urls() >= min_surt_hll_size:
surt_hll = HyperLogLog(HYPERLOGLOG_ERROR)
for url, count in self.url.items():
host_domain_count.add(url, count)
if exact_count:
yield (CST.url.value, self.surt_domain, url), (crawl, count)
if surt_hll is not None:
surt_hll.add(url)
if exact_count:
for digest, counts in self.digest.items():
yield (CST.digest.value, digest), (crawl, counts)
for mime, counts in self.mime.items():
yield (CST.mimetype.value, mime, crawl), counts
for mime, counts in self.mime_detected.items():
yield (CST.mimetype_detected.value, mime, crawl), counts
for charset, counts in self.charset.items():
yield (CST.charset.value, charset, crawl), counts
for languages, counts in self.languages.items():
yield (CST.languages.value, languages, crawl), counts
# yield primary language
prim_l = languages.split(',')[0]
yield (CST.primary_language.value, prim_l, crawl), counts
for key, val in host_domain_count.output(crawl):
yield key, val
yield((CST.surt_domain.value, self.surt_domain, crawl),
(self.pages, self.unique_urls(), len(host_domain_count.hosts)))
if surt_hll is not None:
yield((CST.size_estimate_for.value, CST.surt_domain.value,
self.surt_domain, CST.url.value, crawl),
(self.unique_urls(),
CrawlStatsJSONEncoder.json_encode_hyperloglog(surt_hll)))
for status, counts in self.http_status.items():
yield (CST.http_status.value, status, crawl), counts
for url, count in self.robotstxt_url.items():
yield (CST.size_robotstxt.value, CST.url.value, crawl), 1
yield (CST.size_robotstxt.value, CST.page.value, crawl), count
for status, counts in self.robotstxt_status.items():
yield (CST.robotstxt_status.value, status, crawl), counts
class UnhandledTypeError(Exception):
def __init__(self, outputType):
self.message = 'Unhandled type {}\n'.format(outputType)
class InputError(Exception):
def __init__(self, message):
self.message = message
class CCStatsJob(MRJob):
'''Job to get crawl statistics from Common Crawl index
--job=count
run count job (first step) to get counts
from Common Crawl index files (cdx-*.gz)
--job=stats
run statistics job (second step) on output
from count job'''
OUTPUT_PROTOCOL = JSONProtocol
JOBCONF = {
'mapreduce.task.timeout': '9600000',
'mapreduce.map.speculative': 'false',
'mapreduce.reduce.speculative': 'false',
'mapreduce.job.jvm.numtasks': '-1',
}
s3pattern = re.compile(r'^s3://([^/]+)/(.+)')
gzpattern = re.compile(r'\.gz$')
crawlpattern = re.compile(r'(CC-MAIN-2\d{3}-\d{2})')
def configure_args(self):
"""Custom command line options for common crawl index statistics"""
super(CCStatsJob, self).configure_args()
self.add_passthru_arg(
'--job', dest='job_to_run',
default='', choices=['count', 'stats', ''],
help='''Job(s) to run ("count", "stats", or empty to run both)''')
self.add_passthru_arg(
'--exact-counts', dest='exact_counts',
action='store_true', default=None,
help='''Exact counts for URLs and content digests,
this increases the output size significantly''')
self.add_passthru_arg(
'--no-exact-counts', dest='exact_counts',
action='store_false', default=None,
help='''No exact counts for URLs and content digests
to save storage space and computation time''')
self.add_passthru_arg(
'--max-top-hosts-domains', dest='max_hosts',
type=int, default=200,
help='''Max. number of most frequent hosts or domains shown
in final statistics (cf. --min-urls-top-host-domain)''')
self.add_passthru_arg(
'--min-urls-top-host-domain', dest='min_domain_frequency',
type=int, default=1,
help='''Min. number of URLs required per host or domain shown
in final statistics (cf. --max-top-hosts-domains).''')
self.add_passthru_arg(
'--min-lang-comb-freq', dest='min_lang_comb_freq',
type=int, default=1,
help='''Min. number of pages required for a combination of detected
languages to be shown in final statistics.''')
self.add_passthru_arg(
'--crawl', dest='crawl', default=None,
help='''ID/name of the crawl analyzed (if not given detected
from input path)''')
def input_protocol(self):
if self.options.job_to_run != 'stats':
LOG.debug('Reading text input from cdx files')
return RawValueProtocol()
LOG.debug('Reading JSON input from count job')
return JSONProtocol()
def hadoop_input_format(self):
input_format = self.HADOOP_INPUT_FORMAT
if self.options.job_to_run != 'stats':
input_format = 'org.apache.hadoop.mapred.TextInputFormat'
LOG.info("Setting input format for {} job: {}".format(
self.options.job_to_run, input_format))
return input_format
def count_mapper_init(self):
"""Because cdx.gz files cannot be split and
mapreduce.input.fileinputformat.split.minsize is set to a value larger
than any cdx.gz file, the mapper is guaranteed to process the content
of a single cdx file. Input lines of a cdx file are sorted by SURT URL
which allows to aggregate URL counts for one SURT domain in memory.
It may happen that one SURT domain spans over multiple cdx files.
In this case (and without --exact-counts) the count of unique URLs
and the URL histograms may be slightly off in case the same URL occurs
also in a second cdx file. However, this problem is negligible because
there are only 300 cdx files."""
self.counters = Counter()
self.cdx_path = os.environ['mapreduce_map_input_file']
LOG.info('Reading {0}'.format(self.cdx_path))
self.crawl_name = None
self.crawl = None
if self.options.crawl is not None:
self.crawl_name = self.options.crawl
else:
crawl_name_match = self.crawlpattern.search(self.cdx_path)
if crawl_name_match is not None:
self.crawl_name = crawl_name_match.group(1)
else:
raise InputError(
"Cannot determine ID of monthly crawl from input path {}"
.format(self.cdx_path))
if self.crawl_name is None:
raise InputError("Name of crawl not given")
self.crawl = MonthlyCrawl.get_by_name(self.crawl_name)
self.fetches_total = 0
self.pages_total = 0
self.urls_total = 0
self.urls_hll = HyperLogLog(HYPERLOGLOG_ERROR)
self.digest_hll = HyperLogLog(HYPERLOGLOG_ERROR)
self.url_histogram = Counter()
self.count = None
# first and last SURT may continue in previous/next cdx
self.min_surt_hll_size = 1
self.increment_counter('cdx-stats', 'cdx files processed', 1)
def count_mapper(self, _, line):
self.fetches_total += 1
if (self.fetches_total % 1000) == 0:
self.increment_counter('cdx-stats', 'cdx lines read', 1000)
if (self.fetches_total % 100000) == 0:
LOG.info('Read {0} cdx lines'.format(self.fetches_total))
else:
LOG.debug('Read {0} cdx lines'.format(self.fetches_total))
parts = line.split(' ')
[surt_domain, path] = parts[0].split(')', 1)
if self.count is None:
self.count = SurtDomainCount(surt_domain)
if surt_domain != self.count.surt_domain:
# output accumulated statistics for one SURT domain
for pair in self.count.output(self.crawl,
self.options.exact_counts,
self.min_surt_hll_size):
yield pair
self.urls_total += self.count.unique_urls()
for url, cnt in self.count.url.items():
self.urls_hll.add(url)
self.url_histogram[cnt] += 1
for digest in self.count.digest:
self.digest_hll.add(digest)
self.pages_total += self.count.pages
self.count = SurtDomainCount(surt_domain)
self.min_surt_hll_size = MIN_SURT_HLL_SIZE
json_string = ' '.join(parts[2:])
try:
metadata = ujson.loads(json_string)
self.count.add(path, metadata)
except ValueError as e:
LOG.error('Failed to parse json: {0} - {1}'.format(
e, json_string))
def count_mapper_final(self):
self.increment_counter('cdx-stats',
'cdx lines read', self.fetches_total % 1000)
if self.count is None:
return
for pair in self.count.output(self.crawl, self.options.exact_counts, 1):
yield pair
self.urls_total += self.count.unique_urls()
for url, cnt in self.count.url.items():
self.urls_hll.add(url)
self.url_histogram[cnt] += 1
for digest in self.count.digest:
self.digest_hll.add(digest)
self.pages_total += self.count.pages
if not self.options.exact_counts:
for count, frequency in self.url_histogram.items():
yield((CST.histogram.value, CST.url.value, self.crawl,
CST.page.value, count), frequency)
yield (CST.size.value, CST.page.value, self.crawl), self.pages_total
yield (CST.size.value, CST.fetch.value, self.crawl), self.fetches_total
if not self.options.exact_counts:
yield (CST.size.value, CST.url.value, self.crawl), self.urls_total
yield((CST.size_estimate.value, CST.url.value, self.crawl),
CrawlStatsJSONEncoder.json_encode_hyperloglog(self.urls_hll))
yield((CST.size_estimate.value, CST.digest.value, self.crawl),
CrawlStatsJSONEncoder.json_encode_hyperloglog(self.digest_hll))
self.increment_counter('cdx-stats', 'cdx files finished', 1)
def reducer_init(self):
self.counters = Counter()
self.mostfrequent = defaultdict(list)
def count_reducer(self, key, values):
outputType = key[0]
if outputType in (CST.size.value, CST.size_robotstxt.value):
yield key, sum(values)
elif outputType == CST.histogram.value:
yield key, sum(values)
elif outputType in (CST.url.value, CST.digest.value):
# only with --exact-counts
crawls = MonthlyCrawlSet()
new_crawls = set()
page_count = MultiCount(2)
for val in values:
if type(val) is list:
if (outputType == CST.url.value):
(crawl, pages) = val
page_count.incr(crawl, pages, 1)
else: # digest
(crawl, (pages, urls)) = val
page_count.incr(crawl, pages, urls)
crawls.add(crawl)
new_crawls.add(crawl)
else:
# crawl set bit mask
crawls.update(val)
yield key, crawls.get_bits()
for new_crawl in new_crawls:
if crawls.is_new(new_crawl):
self.counters[(CST.new_items.value,
outputType, new_crawl)] += 1
# url/digest duplicate histograms
for crawl, counts in page_count.items():
items = (1+counts[0]-counts[1])
self.counters[(CST.histogram.value, outputType,
crawl, CST.page.value, items)] += 1
# size in terms of unique URLs and unique content digests
for crawl, counts in page_count.items():
self.counters[(CST.size.value, outputType, crawl)] += 1
elif outputType in (CST.mimetype.value,
CST.mimetype_detected.value,
CST.charset.value,
CST.languages.value,
CST.primary_language.value,
CST.scheme.value,
CST.tld.value,
CST.domain.value,
CST.surt_domain.value,
CST.host.value,
CST.http_status.value,
CST.robotstxt_status.value):
yield key, MultiCount.sum_values(values)
elif outputType == CST.size_estimate.value:
hll = HyperLogLog(HYPERLOGLOG_ERROR)
for val in values:
hll.update(
CrawlStatsJSONDecoder.json_decode_hyperloglog(val))
yield(key,
CrawlStatsJSONEncoder.json_encode_hyperloglog(hll))
elif outputType == CST.size_estimate_for.value:
res = None
hll = None
cnt = 0
for val in values:
if res:
if hll is None:
cnt = res[0]
hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(res[1])
cnt += val[0]
hll.update(CrawlStatsJSONDecoder.json_decode_hyperloglog(val[1]))
else:
res = val
if hll is not None and cnt >= MIN_SURT_HLL_SIZE:
yield(key, (cnt, CrawlStatsJSONEncoder.json_encode_hyperloglog(hll)))
elif res[0] >= MIN_SURT_HLL_SIZE:
yield(key, res)
else:
raise UnhandledTypeError(outputType)
def stats_mapper_init(self):
self.counters = Counter()
def stats_mapper(self, key, value):
if key[0] in (CST.url.value, CST.digest.value,
CST.size_estimate_for.value):
return
if ((self.options.min_domain_frequency > 1) and
(key[0] in (CST.host.value, CST.domain.value,
CST.surt_domain.value))):
# quick skip of infrequent host and domains,
# significantly limits amount of tuples processed in reducer
page_count = MultiCount.get_count(0, value)
url_count = MultiCount.get_count(1, value)
self.counters[(CST.size.value, key[0], key[2])] += 1
self.counters[(CST.histogram.value, key[0],
key[2], CST.page.value, page_count)] += 1
self.counters[(CST.histogram.value, key[0],
key[2], CST.url.value, url_count)] += 1
if key[0] in (CST.domain.value, CST.surt_domain.value):
host_count = MultiCount.get_count(2, value)
self.counters[(CST.histogram.value, key[0],
key[2], CST.host.value, host_count)] += 1
if url_count < self.options.min_domain_frequency:
return
if key[0] == CST.languages.value:
# yield only frequent language combinations (if configured)
page_count = MultiCount.get_count(0, value)
if ((self.options.min_lang_comb_freq > 1) and
(page_count < self.options.min_lang_comb_freq) and
(',' in key[1])):
return
yield key, value
def stats_mapper_final(self):
for (counter, count) in self.counters.items():
yield counter, count
def stats_reducer(self, key, values):
outputType = CST(key[0])
item = key[1]
crawl = MonthlyCrawl.to_name(key[2])
if outputType in (CST.size, CST.new_items,
CST.size_estimate, CST.size_robotstxt):
verbose_key = (outputType.name, CST(item).name, crawl)
if outputType in (CST.size, CST.size_robotstxt):
val = sum(values)
elif outputType == CST.new_items:
val = MultiCount.sum_values(values)
elif outputType == CST.size_estimate:
# already "reduced" in count job
for val in values:
break
yield verbose_key, val
elif outputType == CST.histogram:
yield((outputType.name, CST(item).name, crawl,
CST(key[3]).name, key[4]), sum(values))
elif outputType in (CST.mimetype, CST.mimetype_detected, CST.charset,
CST.languages, CST.primary_language, CST.scheme,
CST.surt_domain, CST.tld, CST.domain, CST.host,
CST.http_status, CST.robotstxt_status):
item = key[1]
for counts in values:
page_count = MultiCount.get_count(0, counts)
url_count = MultiCount.get_count(1, counts)
if outputType in (CST.domain, CST.surt_domain, CST.tld):
host_count = MultiCount.get_count(2, counts)
if (self.options.min_domain_frequency <= 1 or
outputType not in (CST.host, CST.domain,
CST.surt_domain)):
self.counters[(CST.size.name, outputType.name, crawl)] += 1
self.counters[(CST.histogram.name, outputType.name,
crawl, CST.page.name, page_count)] += 1
self.counters[(CST.histogram.name, outputType.name,
crawl, CST.url.name, url_count)] += 1
if outputType in (CST.domain, CST.surt_domain, CST.tld):
self.counters[(CST.histogram.name, outputType.name,
crawl, CST.host.name, host_count)] += 1
if outputType == CST.tld:
domain_count = MultiCount.get_count(3, counts)
self.counters[(CST.histogram.name, outputType.name,
crawl, CST.domain.name, domain_count)] += 1
if outputType in (CST.domain, CST.host, CST.surt_domain):
outKey = (outputType.name, crawl)
outVal = (page_count, url_count, item)
if outputType in (CST.domain, CST.surt_domain):
outVal = (page_count, url_count, host_count, item)
# take most common
if len(self.mostfrequent[outKey]) < self.options.max_hosts:
heapq.heappush(self.mostfrequent[outKey], outVal)
else:
heapq.heappushpop(self.mostfrequent[outKey], outVal)
else:
yield((outputType.name, item, crawl), counts)
else:
raise UnhandledTypeError(outputType)
def reducer_final(self):
for (counter, count) in self.counters.items():
yield counter, count
for key, mostfrequent in self.mostfrequent.items():
(outputType, crawl) = key
if outputType in (CST.domain.name, CST.surt_domain.name):
for (pages, urls, hosts, item) in mostfrequent:
yield((outputType, item, crawl),
MultiCount.compress(3, [pages, urls, hosts]))
else:
for (pages, urls, item) in mostfrequent:
yield((outputType, item, crawl),
MultiCount.compress(2, [pages, urls]))
def steps(self):
reduces = 10
cdxminsplitsize = 2**32 # do not split cdx map input files
if self.options.exact_counts:
# with exact counts need many reducers to aggregate the counts
# in reasonable time and to get not too large partitions
reduces = 200
count_job = \
MRStep(mapper_init=self.count_mapper_init,
mapper=self.count_mapper,
mapper_final=self.count_mapper_final,
reducer_init=self.reducer_init,
reducer=self.count_reducer,
reducer_final=self.reducer_final,
jobconf={'mapreduce.job.reduces': reduces,
'mapreduce.input.fileinputformat.split.minsize':
cdxminsplitsize,
'mapreduce.output.fileoutputformat.compress':
"true",
'mapreduce.output.fileoutputformat.compress.codec':
'org.apache.hadoop.io.compress.BZip2Codec'})
stats_job = \
MRStep(mapper_init=self.stats_mapper_init,
mapper=self.stats_mapper,
mapper_final=self.stats_mapper_final,
reducer_init=self.reducer_init,
reducer=self.stats_reducer,
reducer_final=self.reducer_final,
jobconf={'mapreduce.job.reduces': 1,
'mapreduce.output.fileoutputformat.compress':
"true",
'mapreduce.output.fileoutputformat.compress.codec':
'org.apache.hadoop.io.compress.GzipCodec'})
if self.options.job_to_run == 'count':
return [count_job]
if self.options.job_to_run == 'stats':
return [stats_job]
return [count_job, stats_job]
if __name__ == '__main__':
CCStatsJob.run()
================================================
FILE: get_stats.sh
================================================
#!/bin/bash
set -o pipefail
if aws s3 ls s3://commoncrawl/crawl-analysis/ | sed -E 's@.* @@; s@/$@@' >./stats/crawls.txt; then
ON_AWS=true;
echo "Running on AWS (AWS CLI configured for authenticated access)"
else
echo "Downloading from https://data.commoncrawl.org/ using curl"
# list of crawls enumerated in crawlstats.py
python3 -c 'from crawlstats import MonthlyCrawl; [print(c) for c in sorted(MonthlyCrawl.by_name.keys())]' >./stats/crawls.txt
ON_AWS=false
fi
while read crawl; do
echo $crawl
if [ -e stats/$crawl.gz ]; then
echo " ... exists"
continue
fi
if $ON_AWS; then
aws s3 cp s3://commoncrawl/crawl-analysis/$crawl/stats/part-00000.gz ./stats/$crawl.gz
else
curl --silent https://data.commoncrawl.org/crawl-analysis/$crawl/stats/part-00000.gz >./stats/$crawl.gz
fi
done <./stats/crawls.txt
================================================
FILE: get_stats_and_plot.sh
================================================
#!/bin/bash
set -e
echo "Starting ..."
./get_stats.sh
# make sure plot directories exist
mkdir -p plots/crawler
mkdir -p plots/crawloverlap
mkdir -p plots/crawlsize
mkdir -p plots/throughput
mkdir -p plots/tld
./plot.sh
echo "Done."
================================================
FILE: index.md
================================================
Statistics of Common Crawl Monthly Archives
===========================================
Statistics of [Common Crawl](https://commoncrawl.org/)'s [web archives](https://commoncrawl.org/the-data/get-started/) released on a monthly base:
* [size of the crawls](plots/crawlsize) - number of pages, unique URLs, hosts, domains, top-level domains (public suffixes), cumulative growth of crawled data over time
* [top-level domains](plots/tlds) - distribution and comparison
* [top-500 registered domains](plots/domains.md)
* [crawler-related metrics](plots/crawlermetrics) - fetch status, etc.
* [overlaps between monthly crawls](plots/crawloverlap)
* distribution of
- [media types (MIME)](plots/mimetypes)
- [character encodings](plots/charsets.md)
- [languages](plots/languages.md)
All metrics presented here are generated from [Common Crawl's URL index](https://index.commoncrawl.org/) data using the code of the [cc-crawl-statistics project](https://github.com/commoncrawl/cc-crawl-statistics). Inspired by Sebastian Spiegler's [Statistics of the Common Crawl Corpus 2012](https://commoncrawl.org/2013/08/a-look-inside-common-crawls-210tb-2012-web-corpus/).
See also our [Web Graph statistics](https://commoncrawl.github.io/cc-webgraph-statistics/).
================================================
FILE: plot/charset.py
================================================
import sys
from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl
class CharsetStats(TabularStats):
MIN_AVERAGE_COUNT = 500
MAX_CHARSETS = 100
def __init__(self):
super().__init__()
self.MAX_TYPE_VALUES = CharsetStats.MAX_CHARSETS
def add(self, key, val):
self.add_check_type(key, val, CST.charset)
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
plot_name = 'charsets'
column_header = 'charset'
if len(plot_crawls) == 0:
plot_crawls = MonthlyCrawl.get_latest(3)
print(plot_crawls)
else:
plot_name += '-' + '-'.join(plot_crawls)
plot = CharsetStats()
plot.read_from_stdin_or_file()
plot.transform_data(CharsetStats.MAX_CHARSETS,
CharsetStats.MIN_AVERAGE_COUNT,
None)
plot.save_data_percentage(plot_name, dir_name='plots', type_name='charset')
plot.plot(plot_crawls, plot_name, column_header)
================================================
FILE: plot/crawl_size.py
================================================
"""
Plot crawl size metrics over time.
This module generates visualizations of crawl size statistics including:
- Monthly crawl sizes (pages, URLs, content digests)
- Cumulative sizes over time
- New URLs per crawl
- URL status by year (new, revisit, duplicate)
- Domain/host/TLD counts
The plots show the growth and evolution of the Common Crawl archive.
"""
import os
import re
import types
from collections import defaultdict
import pandas
from hyperloglog import HyperLogLog
from crawlplot import CrawlPlot
from crawlstats import CST, CrawlStatsJSONDecoder, HYPERLOGLOG_ERROR, MonthlyCrawl
class CrawlSizePlot(CrawlPlot):
"""Generate plots showing crawl size metrics over time.
Tracks various size metrics including page counts, unique URLs,
unique content digests, and cumulative statistics across crawls.
Uses HyperLogLog for efficient cardinality estimation.
"""
def __init__(self):
super().__init__()
self.size = defaultdict(dict)
self.size_by_type = defaultdict(dict)
self.type_index = defaultdict(dict)
self.crawls = {}
self.ncrawls = 0
self.hll = defaultdict(dict)
self.N = 0
self.sum_counts = False
def add(self, key, val):
"""Process a size or size_estimate record from statistics data."""
cst = CST[key[0]]
if cst not in (CST.size, CST.size_estimate):
return
item_type = key[1]
crawl = key[2]
count = 0
if cst == CST.size_estimate:
item_type = ' '.join([item_type, 'estim.'])
hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(val)
count = len(hll)
self.hll[item_type][crawl] = hll
elif cst == CST.size:
count = val
self.add_by_type(crawl, item_type, count)
def add_by_type(self, crawl, item_type, count):
"""Add a count for a specific crawl and item type combination."""
if crawl not in self.crawls:
self.crawls[crawl] = self.ncrawls
self.size['crawl'][self.ncrawls] = crawl
date = pandas.Timestamp(MonthlyCrawl.date_of(crawl))
self.size['date'][self.ncrawls] = date
self.ncrawls += 1
else:
date = self.size['date'][self.crawls[crawl]]
if item_type in self.size and \
self.crawls[crawl] in self.size[item_type]:
# add count to existing record?
if self.sum_counts:
count += self.size[item_type][self.crawls[crawl]]
self.size[item_type][self.crawls[crawl]] = count
_N = self.type_index[item_type][self.crawls[crawl]]
self.size_by_type['size'][_N] = count
return
self.size[item_type][self.crawls[crawl]] = count
self.size_by_type['crawl'][self.N] = crawl
self.size_by_type['date'][self.N] = date
self.size_by_type['type'][self.N] = item_type
self.size_by_type['size'][self.N] = count
self.type_index[item_type][self.crawls[crawl]] = self.N
self.N += 1
def cumulative_size(self):
"""Calculate cumulative sizes across crawls using HyperLogLog unions."""
latest_n_crawls_cumul = [2, 3, 4, 6, 9, 12]
total_pages = 0
sorted_crawls = sorted(self.crawls)
for crawl in sorted_crawls:
total_pages += self.size['page'][self.crawls[crawl]]
self.add_by_type(crawl, 'page cumul.', total_pages)
urls_cumul = defaultdict(dict)
for item_type in self.hll.keys():
item_type_cumul = ' '.join([item_type, 'cumul.'])
item_type_new = ' '.join([item_type, 'new'])
cumul_hll = HyperLogLog(HYPERLOGLOG_ERROR)
n = 0
hlls = []
for crawl in sorted(self.hll[item_type]):
n += 1
hll = self.hll[item_type][crawl]
last_cumul_hll_len = len(cumul_hll)
cumul_hll.update(hll)
# cumulative size
self.add_by_type(crawl, item_type_cumul, len(cumul_hll))
# new unseen items this crawl (since the first analyzed crawl)
unseen = (len(cumul_hll) - last_cumul_hll_len)
if unseen > len(hll):
# 1% error rate for cumulative HLLs is large in comparison
# to crawl size, adjust to size of items in this crawl
# (there can be no more new items than the size of the crawl)
unseen = len(hll)
self.add_by_type(crawl, item_type_new, unseen)
hlls.append(hll)
# cumulative size for last N crawls
for n_crawls in latest_n_crawls_cumul:
item_type_n_crawls = '{} cumul. last {} crawls'.format(
item_type, n_crawls)
if n_crawls <= len(hlls):
cum_hll = HyperLogLog(HYPERLOGLOG_ERROR)
for i in range(1, (n_crawls+1)):
if i > len(hlls):
break
cum_hll.update(hlls[-i])
size_last_n = len(cum_hll)
if item_type == 'url estim.':
urls_cumul[crawl][str(n_crawls)] = size_last_n
else:
size_last_n = 'nan'
self.add_by_type(crawl, item_type_n_crawls, size_last_n)
for n, crawl in enumerate(sorted_crawls):
for n_crawls in latest_n_crawls_cumul:
if n_crawls > (n+1):
self.add_by_type(crawl,
'page cumul. last {} crawls'.format(n_crawls),
'nan')
continue
cumul_pages = 0
for c in sorted_crawls[(1+n-n_crawls):(n+1)]:
cumul_pages += self.size['page'][self.crawls[c]]
self.add_by_type(crawl,
'page cumul. last {} crawls'.format(n_crawls),
cumul_pages)
urls_cumul[crawl][str(n_crawls)] = urls_cumul[crawl][str(n_crawls)]/cumul_pages
for crawl in urls_cumul:
for n_crawls in urls_cumul[crawl]:
self.add_by_type(crawl,
'URLs/pages last {} crawls'.format(n_crawls),
urls_cumul[crawl][n_crawls])
def transform_data(self):
"""Convert internal dictionaries to pandas DataFrames."""
self.size = pandas.DataFrame(self.size)
self.size_by_type = pandas.DataFrame(self.size_by_type)
def save_data(self):
"""Save size data to CSV files."""
self.size.to_csv('data/crawlsize.csv')
self.size_by_type.to_csv('data/crawlsizebytype.csv')
def duplicate_ratio(self):
"""Calculate and save URL and content duplicate ratios per crawl."""
data = self.size[['crawl', 'page', 'url', 'digest estim.']]
data['1-(urls/pages)'] = 100 * (1.0 - (data['url'] / data['page']))
data['1-(digests/pages)'] = \
100 * (1.0 - (data['digest estim.'] / data['page']))
floatf = '{0:.1f}%'.format
print(data.to_string(formatters={'1-(urls/pages)': floatf,
'1-(digests/pages)': floatf}),
file=open('data/crawlduplicates.txt', 'w'))
def plot(self):
"""Generate all crawl size plots."""
# Size per crawl (pages, URL and content digest)
row_types = ['page', 'url', 'digest estim.']
self.size_plot(self.size_by_type, row_types, '',
'Crawl Size', 'Pages / Unique Items',
'crawlsize/monthly.png',
data_export_csv='crawlsize/monthly.csv')
# -- cumulative size
row_types = ['page cumul.', 'url estim. cumul.',
'digest estim. cumul.']
self.size_plot(self.size_by_type, row_types, r' cumul\.$',
'Crawl Size Cumulative',
'Pages / Unique Items Cumulative',
'crawlsize/cumulative.png',
data_export_csv='crawlsize/cumulative.csv')
# -- new URLs per crawl
row_types = ['url estim. new']
self.size_plot(self.size_by_type, row_types, '',
'New URLs per Crawl (not observed in prior crawls)',
'New URLs', 'crawlsize/monthly_new.png',
data_export_csv='crawlsize/monthly_new.csv')
# -- cumulative URLs over last N crawls (this and preceding N-1 crawls)
row_types = ['url', '1 crawl', # 'url' replaced by '1 crawl'
'url estim. cumul. last 2 crawls',
'url estim. cumul. last 3 crawls',
'url estim. cumul. last 4 crawls',
'url estim. cumul. last 6 crawls',
'url estim. cumul. last 9 crawls',
'url estim. cumul. last 12 crawls']
data = self.size_by_type
data = data[data['type'].isin(row_types)]
data.replace(to_replace='url', value='1 crawl', inplace=True)
self.size_plot(data, row_types, r'^url estim\. cumul\. last | crawls?$',
'URLs Cumulative Over Last N Crawls',
'Unique URLs cumulative',
'crawlsize/url_last_n_crawls.png',
clabel='n crawls',
data_export_csv='crawlsize/url_last_n_crawls.csv')
# -- ratio unique URLs by total page captures over last N crawls (this and preceding N-1 crawls)
row_types = ['URLs/pages last 2 crawls',
'URLs/pages last 3 crawls',
'URLs/pages last 4 crawls',
'URLs/pages last 6 crawls',
'URLs/pages last 9 crawls',
'URLs/pages last 12 crawls']
data = self.size_by_type
data = data[data['type'].isin(row_types)]
data.replace(to_replace='url', value='1 crawl', inplace=True)
self.size_plot(data, row_types, r'^URLs/pages last | crawls?$',
'Ratio Unique URLs / Total Pages Captured Over Last N Crawls',
'URLs/Pages',
'crawlsize/url_page_ratio_last_n_crawls.png',
clabel='n crawls',
data_export_csv='crawlsize/url_page_ratio_last_n_crawls.csv')
# -- cumul. digests over last N crawls (this and preceding N-1 crawls)
row_types = ['digest estim.', '1 crawl', # 'url' replaced by '1 crawl'
'digest estim. cumul. last 2 crawls',
'digest estim. cumul. last 3 crawls',
'digest estim. cumul. last 6 crawls',
'digest estim. cumul. last 12 crawls']
data = self.size_by_type
data = data[data['type'].isin(row_types)]
data.replace(to_replace='digest estim.', value='1 crawl', inplace=True)
self.size_plot(data, row_types,
r'^digest estim\. cumul\. last | crawls?$',
'Content Digest Cumulative Over Last N Crawls',
'Unique content digests cumulative',
'crawlsize/digest_last_n_crawls.png',
clabel='n crawls')
# -- URLs, hosts, domains, tlds (normalized)
data = self.size_by_type
row_types = ['url', 'tld', 'domain', 'host']
data = data[data['type'].isin(row_types)]
self.export_csv(data, 'crawlsize/domain.csv')
# --- domains only (not yet normalized)
self.size_plot(data[data['type'].isin(['domain'])], '', '',
'Unique Domains per Crawl',
'', 'crawlsize/registered-domains.png')
# normalize scale (exponent) of counts so that they fit on one plot
size_norm = data['size'] / 1000.0
data['size'] = size_norm.where(data['type'] == 'tld',
other=data['size'])
data.replace(to_replace='tld', value='tld e+04', inplace=True)
size_norm = size_norm / 10000.0
data['size'] = size_norm.where(data['type'] == 'host',
other=data['size'])
data.replace(to_replace='host', value='host e+07', inplace=True)
data['size'] = size_norm.where(data['type'] == 'domain',
other=data['size'])
data.replace(to_replace='domain', value='domain e+07', inplace=True)
size_norm = size_norm / 100.0
data['size'] = size_norm.where(data.type == 'url',
other=data['size'])
data.replace(to_replace='url', value='url e+09', inplace=True)
self.size_plot(data, '', '',
'URLs / Hosts / Domains / TLDs per Crawl',
'Unique Items', 'crawlsize/domain.png')
# -- URL status by year:
# -- duplicates (pages - URLs), known URLs (URLs - new), new URLs
data = self.size[['crawl', 'page', 'url', 'url estim. new']]
data['year'] = data['crawl'].apply(lambda c: int(MonthlyCrawl.year_of(c)))
by_year = data[['year', 'page', 'url', 'url estim. new']] \
.groupby('year').agg(sum).reset_index()
by_year['revisit'] = by_year['url'] - by_year['url estim. new']
by_year['duplicate'] = by_year['page'] - by_year['url']
by_year['new'] = by_year['url estim. new']
print('URL status by year:')
print(by_year)
by_year_by_type = by_year[['year', 'new', 'revisit', 'duplicate', 'page']].melt(
id_vars=['year', 'page'],
value_vars=['new', 'revisit', 'duplicate'],
var_name='url_status', value_name='page_captures')
by_year_by_type['ratio'] = by_year_by_type['page_captures'] / by_year_by_type['page']
by_year_by_type['perc'] = by_year_by_type['ratio'].apply(lambda x: round((100.0*x), 1)).astype(str) + '%'
by_year_by_type['year'] = pandas.Categorical(by_year_by_type['year'], ordered=True)
by_year_by_type['url_status'] = pandas.Categorical(by_year_by_type['url_status'],
ordered=True,
categories=['duplicate',
'revisit', 'new'])
by_year_by_type['page_captures'] = by_year_by_type['page_captures'].astype(float)
# url_status_by_year
img_path = os.path.join(self.PLOTDIR, 'crawlsize', 'url_status_by_year.png')
if self.PLOTLIB == "rpy2.ggplot2":
return self.plot_with_rpy2_ggplot2(by_year_by_type, img_path)
elif self.PLOTLIB == "matplotlib":
return self.plot_with_matplotlib(by_year_by_type, img_path)
else:
raise ValueError("Invalid PLOTLIB")
def plot_with_rpy2_ggplot2(self, by_year_by_type, img_path):
"""Generate URL status by year stacked bar chart using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
from rpy2 import robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
p = ggplot2.ggplot(by_year_by_type) \
+ ggplot2.aes_string(x='year', y='page_captures', fill='url_status', label='perc') \
+ ggplot2.geom_bar(stat='identity', position='stack') \
+ ggplot2.geom_text(
data=by_year_by_type[
by_year_by_type['url_status'].isin(['new'])
& ~by_year_by_type['year'].isin(by_year_by_type['year'].tolist()[0:3])],
color='black', size=2,
position=ggplot2.position_dodge(width=.5)) \
+ self.GGPLOT2_THEME \
+ ggplot2.scale_fill_manual(values=robjects.r('c("duplicate"="#00BA38", "revisit"="#619CFF", "new"="#F8766D")')) \
+ ggplot2.theme(**{'legend.position': 'right',
'aspect.ratio': .7,
**self.GGPLOT2_THEME_KWARGS},
**{'axis.text.x':
ggplot2.element_text(angle=45, size=10,
vjust=1, hjust=1)}) \
+ ggplot2.labs(title='Number of Page Captures', x='', y='', fill='URL status')
p.save(img_path)
return p
def plot_with_matplotlib(self, by_year_by_type, img_path):
"""Generate URL status by year stacked bar chart using matplotlib."""
import numpy as np
aspect_ratio = 0.7
bar_label_fontsize = 5
title = 'Number of Page Captures'
fig, ax = self.create_figure()
# Prepare data for stacked bar chart
years = by_year_by_type['year'].unique()
url_statuses = ['new', 'revisit', 'duplicate']
colors = {'duplicate': '#00BA38', 'revisit': '#619CFF', 'new': '#F8766D'}
# Create stacked bars
bottoms = np.zeros(len(years))
bars = {}
for status in url_statuses:
status_data = by_year_by_type[by_year_by_type['url_status'] == status]
values = []
labels = []
for year in years:
year_data = status_data[status_data['year'] == year]
if len(year_data) > 0:
values.append(year_data['page_captures'].iloc[0])
labels.append(year_data['perc'].iloc[0])
else:
values.append(0)
labels.append('')
bars[status] = ax.bar(range(len(years)), values, bottom=bottoms,
color=colors[status], label=status, width=self.bar_width)
# Add text labels only for 'new' status, excluding first 3 years
if status == 'new':
for i, (bar, label) in enumerate(zip(bars[status], labels)):
if i >= 3 and label:
height = bar.get_height()
ax.text(bar.get_x() + bar.get_width() / 2.,
bottoms[i] + height, label,
ha='center', va='top', color='black',
fontsize=bar_label_fontsize)
bottoms += values
self.set_title(ax, title)
ax.set_xlabel('')
ax.set_ylabel('')
# Format x-axis
ax.set_xticks(range(len(years)))
ax.set_xticklabels(years, rotation=45, ha='right', va='top',
fontsize=self.ticks_fontsize)
ax.set_xlim(-0.5, len(years) - 0.5)
# Axes ratio
ax.set_aspect(1 / ax.get_data_ratio() * aspect_ratio)
# Apply nice y-axis ticks
self.apply_nice_ticks(ax, axis='y')
# Grid styling
ax.grid(True, which='minor', linewidth=self.grid_minor_linewidth,
color=self.grid_minor_color, zorder=0, axis='both')
ax.grid(True, which='major', linewidth=self.grid_major_linewidth,
color=self.grid_major_color, zorder=0, axis='both')
ax.set_axisbelow(True)
# Apply ggplot2 style
self.apply_ggplot2_style(ax, show_grid=False)
# Set tick colors
ax.tick_params(axis='y', which='both', colors='#FFFFFF',
length=self.ticks_length, width=self.grid_major_linewidth,
labelsize=self.ticks_fontsize)
ax.tick_params(axis='x', which='both', colors='#E6E6E6',
length=self.ticks_length, width=self.grid_major_linewidth,
labelsize=self.ticks_fontsize)
self.set_tick_labels_black(ax)
# Position legend on right side with reversed order
handles, labels = ax.get_legend_handles_labels()
legend = ax.legend(handles[::-1], labels[::-1], loc='center left',
bbox_to_anchor=(1.0, 0.5), frameon=False,
fontsize=self.legend_fontsize, title='URL status',
title_fontsize=self.legend_title_fontsize)
legend._legend_box.align = 'left'
return self.save_figure(fig, img_path)
def export_csv(self, data, csv):
"""Export pivot table data to CSV file."""
if csv is not None:
data.reset_index().pivot(index='crawl',
columns='type', values='size').to_csv(
os.path.join(self.PLOTDIR, csv))
def norm_data(self, data, row_filter, type_name_norm):
"""Filter and normalize type names in the data for plotting."""
if len(row_filter) > 0:
data = data[data['type'].isin(row_filter)]
if type_name_norm != '':
for value in row_filter:
replacement = value
if isinstance(type_name_norm, str):
if re.search(type_name_norm, value):
while re.search(type_name_norm, replacement):
replacement = re.sub(type_name_norm,
'', replacement)
elif isinstance(type_name_norm, types.FunctionType):
replacement = type_name_norm(value)
if replacement != value:
data.replace(to_replace=value, value=replacement,
inplace=True)
return data
def size_plot(self, data, row_filter, type_name_norm,
title, ylabel, img_file, clabel='', data_export_csv=None,
x='date', y='size', c='type'):
"""Generate a size plot with filtering and normalization.
Args:
data: DataFrame containing the size data
row_filter: List of type values to include
type_name_norm: Regex pattern or function to normalize type names
title: Plot title
ylabel: Y-axis label
img_file: Output filename
clabel: Legend title
data_export_csv: Optional CSV export path
x, y, c: Column names for x-axis, y-axis, and color grouping
"""
data = self.norm_data(data, row_filter, type_name_norm)
self.export_csv(data, data_export_csv)
return self.line_plot(data, title, ylabel, img_file,
x=x, y=y, c=c, clabel=clabel, ratio=.9)
if __name__ == '__main__':
plot = CrawlSizePlot()
plot.read_from_stdin_or_file()
plot.cumulative_size()
plot.transform_data()
plot.save_data()
plot.duplicate_ratio()
plot.plot()
================================================
FILE: plot/crawler_metrics.py
================================================
"""
Plot crawler performance metrics.
This module generates visualizations of crawler metrics including:
- Fetch status breakdown (success, redirect, denied, failed, skipped)
- CrawlDb status counts
- HTTP vs HTTPS URL distribution
These metrics help monitor crawler health and performance over time.
"""
import logging
import os
import re
import pandas
from crawlstats import CST, MultiCount
from crawl_size import CrawlSizePlot
LOGGING_LEVEL = logging.INFO
logging.basicConfig(level=LOGGING_LEVEL)
class CrawlerMetrics(CrawlSizePlot):
"""Generate plots showing crawler performance metrics.
Tracks fetch statuses, CrawlDb sizes, and URL protocol distribution
across crawls.
"""
metrics_map = {
'fetcher:aggr:redirect': ('fetcher:temp_moved', 'fetcher:moved',
'fetcher:redirect_count_exceeded',
'fetcher:redirect_deduplicated',
# new counter names (NUTCH-3132)
# unchanged: 'fetcher:temp_moved', 'fetcher:moved',
'fetcher:redirect_count_exceeded_total',
'fetcher:redirect_deduplicated_total',
'fetcher:redirect_not_created_total'),
'fetcher:aggr:denied': ('fetcher:access_denied',
'fetcher:robots_denied',
'fetcher:robots_denied_maxcrawldelay',
'fetcher:robots_defer_visits_dropped',
'fetcher:filter_denied',
# new counter names (NUTCH-3132)
# unchanged: 'fetcher:access_denied',
'fetcher:robots_denied_total',
'fetcher:robots_denied_maxcrawldelay_total',
'fetcher:robots_defer_visits_dropped_total'),
'fetcher:aggr:failed': ('fetcher:gone', 'fetcher:notfound',
'fetcher:exception',
# (no) new counter names (NUTCH-3132)
),
'fetcher:aggr:skipped': ('fetcher:hitByThrougputThreshold',
'fetcher:hitByTimeLimit',
'fetcher:AboveExceptionThresholdInQueue',
'fetcher:filtered',
# new counter names (NUTCH-3132)
'fetcher:hit_by_throughput_threshold_total',
'fetcher:hit_by_timelimit_total',
'fetcher:above_exception_threshold_total',
'fetcher:hit_by_timeout_total',
'fetcher:filtered_total')
}
def __init__(self):
super().__init__()
self.sum_counts = True
def add(self, key, val):
"""Process crawl status, size, and scheme records."""
cst = CST[key[0]]
item_type = key[1]
crawl = key[2]
if not (cst == CST.crawl_status or
(cst == CST.size and item_type in ('page', 'url'))
or cst == CST.scheme):
return
if cst == CST.scheme:
item_type = 'scheme:' + item_type
val = MultiCount.get_count(1, val)
self.add_by_type(crawl, item_type, val)
for metric in self.metrics_map:
if item_type in self.metrics_map[metric]:
logging.debug('Adding metric %s for <%s, %s> = %s', metric, crawl, item_type, val)
self.add_by_type(crawl, metric, val)
def save_data(self):
"""Save crawler metrics data to CSV files."""
self.size.sort_values(['crawl'], inplace=True)
self.size.to_csv('data/crawlmetrics.csv')
self.size_by_type.to_csv('data/crawlmetricsbytype.csv')
def add_percent(self):
"""Calculate percentage values for fetch statuses and schemes."""
for crawl in self.crawls:
if self.crawls[crawl] not in self.size['fetcher:total']:
logging.debug('Crawl %s not found in fetch status data', crawl)
continue
total = self.size['fetcher:total'][self.crawls[crawl]]
for item_type in self.type_index:
if self.crawls[crawl] not in self.size[item_type]:
continue
count = self.size[item_type][self.crawls[crawl]]
_N = self.type_index[item_type][self.crawls[crawl]]
if (item_type.startswith('fetcher:') and
item_type != 'fetcher:total'):
self.size_by_type['percentage'][_N] = 100.0*count/total
elif item_type.startswith('scheme:'):
total = self.size['url'][self.crawls[crawl]]
self.size_by_type['percentage'][_N] = 100.0*count/total
@staticmethod
def row2title(row):
"""Convert metric row name to human-readable title."""
row = re.sub('(?<=^fetch)er(?::aggr)?|^generator:', '', row)
row = re.sub('[:_]', ' ', row)
if row == 'page':
row = 'pages released'
return row
def plot(self):
"""Generate all crawler metrics plots."""
row_types = ['generator:fetch_list',
'fetcher:success', 'fetcher:total',
'fetcher:aggr:redirect', 'fetcher:notmodified',
'fetcher:aggr:failed', 'fetcher:aggr:denied',
'fetcher:aggr:skipped', 'page']
self.size_plot(self.size_by_type, row_types, CrawlerMetrics.row2title,
'Crawler Metrics', 'Pages',
'crawler/metrics.png')
# -- stacked bar plot
row_types = ['fetcher:success', 'fetcher:notmodified',
'fetcher:aggr:redirect', 'fetcher:aggr:failed',
'fetcher:aggr:denied', 'fetcher:aggr:skipped']
ratio = 0.1 + self.ncrawls * .05
self.plot_fetch_status(self.size_by_type, row_types,
'crawler/fetch_status_percentage.png',
ratio=ratio)
# -- status of pages in CrawlDb
row_types = ['crawldb:status:db_fetched',
'crawldb:status:db_notmodified',
'crawldb:status:db_redir_perm',
'crawldb:status:db_redir_temp',
'crawldb:status:db_duplicate',
'crawldb:status:db_gone',
'crawldb:status:db_unfetched',
'crawldb:status:db_orphan']
self.plot_crawldb_status(self.size_by_type, row_types,
'crawler/crawldb_status.png',
ratio=ratio)
# successfully fetched http:// vs https:// URLs
self.size_plot(self.size_by_type, ['scheme:http', 'scheme:https'], lambda x: x.split(':')[1],
'HTTP vs HTTPS URLs', 'Successfully fetched URLs',
'crawler/url_protocols.png')
self.size_plot(self.size_by_type, ['scheme:http', 'scheme:https'], lambda x: x.split(':')[1],
'Percentage of HTTP vs HTTPS URLs', 'Percentage of successfully fetched URLs',
'crawler/url_protocols_percentage.png', y='percentage')
def plot_fetch_status_with_rpy2_ggplot2(self, data, img_path, ratio):
"""Generate fetch status stacked bar chart using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='crawl', y='percentage', fill='type') \
+ ggplot2.geom_bar(stat='identity', position='stack', width=.9) \
+ ggplot2.coord_flip() \
+ ggplot2.scale_fill_brewer(palette='RdYlGn', type='sequential',
guide=ggplot2.guide_legend(reverse=True)) \
+ self.GGPLOT2_THEME \
+ ggplot2.theme(**{'legend.position': 'bottom',
'aspect.ratio': ratio,
**self.GGPLOT2_THEME_KWARGS}) \
+ ggplot2.labs(title='Percentage of Fetch Status',
x='', y='', fill='')
p.save(img_path, height = int(7 * ratio), width = 7)
return p
def plot_fetch_status_with_matplotlib(self, data, categories, img_path, ratio):
"""Generate fetch status stacked bar chart using matplotlib."""
import numpy as np
from matplotlib.ticker import MaxNLocator
crawls = data['crawl'].unique()
n_crawls = len(crawls)
# Define colors from dark green (success) to dark red (denied)
status_order = ['success', 'skipped', 'redirect', 'notmodified', 'failed', 'denied']
status_colors = {
'success': '#1A9850', 'skipped': '#91CF60', 'redirect': '#D9EF8B',
'notmodified': '#FEE08B', 'failed': '#FC8D59', 'denied': '#D73027'
}
categories_ordered = [cat for cat in status_order if cat in categories]
fig, ax = self.create_figure(ratio=ratio)
# Prepare data for horizontal stacked bar chart
bar_positions = np.arange(n_crawls)
lefts = np.zeros(n_crawls)
for category in categories_ordered:
category_data = data[data['type'] == category]
values = [
category_data[category_data['crawl'] == crawl]['percentage'].iloc[0]
if len(category_data[category_data['crawl'] == crawl]) > 0 else 0
for crawl in crawls
]
ax.barh(bar_positions, values, left=lefts, height=self.bar_width,
color=status_colors[category], label=category)
lefts += values
self.set_title(ax, 'Percentage of Fetch Status')
ax.set_xlabel('')
ax.set_ylabel('')
# Format y-axis (crawl names)
ax.set_yticks(bar_positions)
ax.set_yticklabels(crawls, fontsize=self.ticks_fontsize)
ax.set_ylim(-0.5, n_crawls - 0.5)
# Format x-axis (percentage)
max_value = lefts.max()
ax.set_xlim(0, max_value * 1.02)
ax.xaxis.set_major_locator(MaxNLocator(nbins=5))
# Apply ggplot2-like styling
self.apply_ggplot2_style(ax, grid_axis='x')
# Set tick colors
ax.tick_params(axis='y', which='both', colors='#E6E6E6', length=20,
width=1.5, labelsize=self.ticks_fontsize)
ax.tick_params(axis='x', which='both', colors='#E6E6E6', length=4,
width=1.5, labelsize=self.ticks_fontsize)
self.set_tick_labels_black(ax)
# Position legend at bottom
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles, labels, loc='upper center', bbox_to_anchor=(0.5, -0.05),
ncol=min(3, len(categories)), frameon=False,
fontsize=self.legend_fontsize, title='')
return self.save_figure(fig, img_path)
def plot_fetch_status(self, data, row_filter, img_file, ratio=1.0):
"""Generate fetch status percentage stacked bar chart."""
if row_filter:
data = data[data['type'].isin(row_filter)]
data = data[['crawl', 'percentage', 'type']]
categories = []
for value in row_filter:
if re.search('^fetcher:(?:aggr:)?', value):
replacement = re.sub('^fetcher:(?:aggr:)?', '', value)
categories.append(replacement)
data.replace(to_replace=value, value=replacement, inplace=True)
data['type'] = pandas.Categorical(data['type'], ordered=True,
categories=categories.reverse())
ratio = 0.1 + len(data['crawl'].unique()) * .03
img_path = os.path.join(self.PLOTDIR, img_file)
if self.PLOTLIB == "rpy2.ggplot2":
return self.plot_fetch_status_with_rpy2_ggplot2(data=data, img_path=img_path, ratio=ratio)
elif self.PLOTLIB == "matplotlib":
return self.plot_fetch_status_with_matplotlib(data=data, categories=categories, img_path=img_path, ratio=ratio)
else:
raise ValueError("Invalid PLOTLIB")
def plot_crawldb_status_with_rpy2_ggplot2(self, data, img_path, ratio):
"""Generate CrawlDb status stacked bar chart using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='crawl', y='size', fill='type') \
+ ggplot2.geom_bar(stat='identity', position='stack', width=.9) \
+ ggplot2.coord_flip() \
+ ggplot2.scale_fill_brewer(palette='Pastel1', type='sequential',
guide=ggplot2.guide_legend(reverse=False)) \
+ self.GGPLOT2_THEME \
+ ggplot2.theme(**{'legend.position': 'bottom',
'aspect.ratio': ratio,
**self.GGPLOT2_THEME_KWARGS}) \
+ ggplot2.labs(title='CrawlDb Size and Status Counts',
x='', y='', fill='')
p.save(img_path, height = int(7 * ratio), width = 7)
return p
def plot_crawldb_status_with_matplotlib(self, data, img_path, ratio):
"""Generate CrawlDb status stacked bar chart using matplotlib."""
import numpy as np
crawls = data['crawl'].unique()
n_crawls = len(crawls)
# Pastel1 palette colors
pastel1_colors = ['#FDDAEC', '#E5D8BD', '#FFFFCC', '#FED9A6',
'#DECBE4', '#CCEBC5', '#B3CDE3', '#FBB4AE', '#F2F2F2']
categories_ordered = ['unfetched', 'redir_temp', 'redir_perm', 'orphan',
'notmodified', 'gone', 'fetched', 'duplicate']
fig, ax = self.create_figure(ratio=ratio)
bar_positions = np.arange(n_crawls)
lefts = np.zeros(n_crawls)
for i, category in enumerate(categories_ordered):
category_data = data[data['type'] == category]
values = [
category_data[category_data['crawl'] == crawl]['size'].iloc[0]
if len(category_data[category_data['crawl'] == crawl]) > 0 else 0
for crawl in crawls
]
color = pastel1_colors[i % len(pastel1_colors)]
ax.barh(bar_positions, values, left=lefts, height=self.bar_width,
color=color, label=category)
lefts += values
self.set_title(ax, 'CrawlDb Size and Status Counts')
ax.set_xlabel('')
ax.set_ylabel('')
# Format y-axis (crawl names)
ax.set_yticks(bar_positions)
ax.set_yticklabels(crawls, fontsize=self.ticks_fontsize)
ax.set_ylim(-0.5, n_crawls - 0.5)
# Format x-axis (size counts)
max_value = lefts.max()
ax.set_xlim(0, max_value * 1.02)
# Axes ratio
ax.set_aspect(1 / ax.get_data_ratio() * ratio)
# Apply nice x-axis ticks
self.apply_nice_ticks(ax, axis='x')
# Apply ggplot2-like styling with x-axis grid
ax.grid(True, which='both', linewidth=self.grid_major_linewidth,
color=self.grid_major_color, zorder=0, axis='x')
ax.set_axisbelow(True)
self.apply_ggplot2_style(ax, show_grid=False)
# Set tick colors
ax.tick_params(axis='both', which='both', colors=self.ticks_color,
length=self.ticks_length, width=0.8,
labelsize=self.ticks_fontsize)
self.set_tick_labels_black(ax)
# Position legend at bottom with reversed order
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='upper center',
bbox_to_anchor=(0.5, -0.05), ncol=min(4, len(categories_ordered)),
frameon=False, fontsize=self.legend_fontsize, title='')
return self.save_figure(fig, img_path)
def plot_crawldb_status(self, data, row_filter, img_file, ratio=1.0):
"""Generate CrawlDb status stacked bar chart."""
if row_filter:
data = data[data['type'].isin(row_filter)]
categories = []
for value in row_filter:
if re.search('^crawldb:status:db_', value):
replacement = re.sub('^crawldb:status:db_', '', value)
categories.append(replacement)
data.replace(to_replace=value, value=replacement, inplace=True)
data['type'] = pandas.Categorical(data['type'], ordered=True,
categories=categories.reverse())
data['size'] = data['size'].astype(float)
ratio = 0.1 + len(data['crawl'].unique()) * .03
img_path = os.path.join(self.PLOTDIR, img_file)
if self.PLOTLIB == "rpy2.ggplot2":
return self.plot_crawldb_status_with_rpy2_ggplot2(
data=data, img_path=img_path, ratio=ratio
)
elif self.PLOTLIB == "matplotlib":
return self.plot_crawldb_status_with_matplotlib(
data=data, img_path=img_path, ratio=ratio
)
else:
raise ValueError("Invalid PLOTLIB")
if __name__ == '__main__':
plot = CrawlerMetrics()
plot.read_from_stdin_or_file()
plot.add_percent()
plot.transform_data()
plot.save_data()
plot.plot()
================================================
FILE: plot/domain.py
================================================
import sys
import pandas
from crawlstats import CST, MonthlyCrawl, MultiCount
from plot.table import TabularStats
class DomainStats(TabularStats):
# defined via crawlstats command-line option --max-top-hosts-domains
MAX_TOP_DOMAINS = 500
def __init__(self, crawl):
super().__init__()
self.crawl = crawl
self.N = 0
def add(self, key, val):
cst = CST[key[0]]
if cst not in (CST.size, CST.domain):
return
typeval = key[1]
crawl = key[2]
if crawl != self.crawl:
return
if cst == CST.size:
self.size[typeval] = val
return
self.type_stats['domain'][self.N] = typeval
self.type_stats['pages'][self.N] = MultiCount.get_count(0, val)
self.type_stats['urls'][self.N] = MultiCount.get_count(1, val)
self.type_stats['hosts'][self.N] = MultiCount.get_count(2, val)
# self.type_stats['crawl'][self.N] = crawl
self.N += 1
def transform_data(self):
data = pandas.DataFrame(self.type_stats)
for cnt in ['pages', 'urls']:
total = self.size[cnt[:-1]]
data['%' + cnt] = 100.0 * data[cnt] / total
data.sort_values(ascending=False, inplace=True, by='pages')
print(data)
self.type_stats = data
def save_data(self, name, dir_name='data/'):
self.type_stats.to_csv('{}/{}-top-{}.csv'.format(self.PLOTDIR, name, self.MAX_TOP_DOMAINS),
float_format='%.6f', index=None)
def plot(self, name):
data = self.type_stats
css_classes = ['tablesorter', 'tablesearcher']
data = data.set_index('domain')
data.columns.name = 'domain'
data.index.name = None
print(data.to_html('{}/{}-top-{}.html'.format(
self.PLOTDIR, name, self.MAX_TOP_DOMAINS),
float_format='%.6f',
classes=css_classes, index='domain'))
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
if len(plot_crawls) == 0:
plot_crawls = MonthlyCrawl.get_latest(1)
print(plot_crawls)
latest_crawl = plot_crawls[-1]
plot_name = 'domains'
plot = DomainStats(latest_crawl)
plot.read_from_stdin_or_file()
plot.transform_data()
plot.save_data(plot_name, dir_name=plot.PLOTDIR)
plot.plot(plot_name)
================================================
FILE: plot/histogram.py
================================================
"""
Plot histogram distributions for crawl statistics.
This module generates histogram visualizations showing distributions of:
- Pages per URL (URL-level duplicates)
- URLs per host/domain/TLD
- Cumulative URL coverage by domain
These histograms help understand the distribution patterns in crawl data.
"""
import os.path
import sys
from collections import defaultdict
import pandas
from crawlplot import CrawlPlot
from crawlstats import CST
class CrawlHistogram(CrawlPlot):
"""Generate histogram plots for crawl statistics.
Produces histograms showing frequency distributions of various metrics
like duplicate rates, coverage per domain, etc.
"""
PSEUDO_LOG_BINS = [0, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000,
10000, 20000, 50000, 100000, 200000, 500000, 1000000,
2*10**6, 5*10**6, 10**7, 2*10**7, 5*10**7, 10**8,
2*10**8, 5*10**8, 10**9]
def __init__(self):
super().__init__()
self.histogr = defaultdict(dict)
self.N = 0
def add(self, key, frequency):
"""Process a histogram record from statistics data."""
cst = CST[key[0]]
if cst != CST.histogram:
return
item_type = key[1]
if item_type == 'surt_domain':
return
crawl = key[2]
type_counted = key[3]
count = key[4]
self.histogr['crawl'][self.N] = crawl
self.histogr['type'][self.N] = item_type
self.histogr['type_counted'][self.N] = type_counted
self.histogr['count'][self.N] = count
self.histogr['frequency'][self.N] = frequency
self.N += 1
def transform_data(self):
"""Convert internal dictionary to pandas DataFrame."""
self.histogr = pandas.DataFrame(self.histogr)
def save_data(self):
"""Save histogram data to CSV file."""
self.histogr.to_csv('data/crawlhistogr.csv')
def plot_dupl_url(self):
"""Plot histogram of pages per URL (URL-level duplicates)."""
from rpy2.robjects.lib import ggplot2
row_filter = ['url']
data = self.histogr
data = data[data['type'].isin(row_filter)]
title = 'Pages per URL (URL-level duplicates)'
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='count', y='frequency') \
+ ggplot2.geom_jitter() \
+ ggplot2.facet_wrap('crawl', ncol=5) \
+ ggplot2.labs(title=title, x='(duplicate) pages per URL',
y='log(frequency)') \
+ ggplot2.scale_y_log10()
# + ggplot2.scale_x_log10() # could use log-log scale
img_path = os.path.join(self.PLOTDIR, 'crawler/histogr_url_dupl.png')
p.save(img_path)
# data.to_csv(img_path + '.csv')
return p
def plot_host_domain_tld(self):
"""Plot histogram of URLs per host/domain/TLD."""
from rpy2.robjects.lib import ggplot2
data = self.histogr
data = data[data['type'].isin(['host', 'domain', 'tld'])]
data = data[data['type_counted'].isin(['url'])]
img_path = os.path.join(self.PLOTDIR,
'crawler/histogr_host_domain_tld.png')
# data.to_csv(img_path + '.csv')
title = 'URLs per Host / Domain / TLD'
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='count', weight='frequency', color='type') \
+ ggplot2.geom_freqpoly(bins=20) \
+ ggplot2.facet_wrap('crawl', ncol=4) \
+ ggplot2.labs(title='', x=title,
y='Frequency') \
+ ggplot2.scale_y_log10() \
+ ggplot2.scale_x_log10()
p.save(img_path)
return p
def plot_domain_cumul_with_rpy2_ggplot2(self, data, title, img_path):
"""Generate cumulative domain coverage plot using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='cum_domains', y='cum_urls') \
+ ggplot2.geom_line() + ggplot2.geom_point() \
+ self.GGPLOT2_THEME \
+ ggplot2.theme(**self.GGPLOT2_THEME_KWARGS) \
+ ggplot2.labs(title=title, x='domains cumulative',
y='URLs cumulative') \
+ ggplot2.scale_y_log10() \
+ ggplot2.scale_x_log10()
p.save(img_path)
return p
def plot_domain_cumul(self, crawl):
"""Plot cumulative URL coverage by domain for a specific crawl."""
data = self.histogr
data = data[data['type'].isin(['domain'])]
data = data[data['crawl'] == crawl]
data = data[data['type_counted'].isin(['url'])]
data['urls'] = data['count']*data['frequency']
print(data)
data = data[['urls', 'count', 'frequency']]
data = data.sort_values(['count'], ascending=0)
data['cum_domains'] = data['frequency'].cumsum()
data['cum_urls'] = data['urls'].cumsum()
data_perc = data.apply(lambda x: round(100.0*x/float(x.sum()), 1))
data['%domains'] = data_perc['frequency']
data['%urls'] = data_perc['urls']
data['%cum_domains'] = data['cum_domains'].apply(
lambda x: round(100.0*x/float(data['frequency'].sum()), 1))
data['%cum_urls'] = data['cum_urls'].apply(
lambda x: round(100.0*x/float(data['urls'].sum()), 1))
img_path = os.path.join(self.PLOTDIR,
'crawler/histogr_domain_cumul.png')
# data.to_csv(img_path + '.csv')
title = 'Cumulative URLs for Top Domains'
if self.PLOTLIB == "rpy2.ggplot2":
return self.plot_domain_cumul_with_rpy2_ggplot2(data=data, title=title, img_path=img_path)
elif self.PLOTLIB == "matplotlib":
# this plot is currently not used
raise NotImplementedError
else:
raise ValueError("Invalid PLOTLIB")
if __name__ == '__main__':
latest_crawl = sys.argv[-1]
plot = CrawlHistogram()
plot.read_from_stdin_or_file()
plot.transform_data()
plot.save_data()
plot.plot_dupl_url()
plot.plot_host_domain_tld()
plot.plot_domain_cumul(latest_crawl)
================================================
FILE: plot/language.py
================================================
import string
import sys
from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl
class LanguageStats(TabularStats):
MIN_AVERAGE_COUNT = 1
MAX_LANGUAGES = 200
def __init__(self):
super().__init__()
self.MAX_TYPE_VALUES = LanguageStats.MAX_LANGUAGES
def add(self, key, val):
self.add_check_type(key, val, CST.primary_language)
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
plot_name = 'languages'
column_header = 'language'
if len(plot_crawls) == 0:
plot_crawls = MonthlyCrawl.get_latest(3)
print(plot_crawls)
else:
plot_name += '-' + '-'.join(plot_crawls)
plot = LanguageStats()
plot.read_from_stdin_or_file()
plot.transform_data(LanguageStats.MAX_LANGUAGES,
LanguageStats.MIN_AVERAGE_COUNT,
None)
plot.save_data_percentage(plot_name, dir_name='plots', type_name='primary_language')
plot.plot(plot_crawls, plot_name, column_header,
['iso639-3-language'])
================================================
FILE: plot/mimetype.py
================================================
import re
import sys
from plot.table import TabularStats
from crawlstats import CST, MonthlyCrawl
class MimeTypeStats(TabularStats):
MIN_AVERAGE_COUNT = 500
MAX_MIME_TYPES = 100
# see https://en.wikipedia.org/wiki/Media_type#Naming
mime_pattern_str = \
r'(?:x-)?[a-z]+/[a-z0-9]+' \
r'(?:[.-](?:c\+\+[a-z]*|[a-z0-9]+))*(?:\+[a-z0-9]+)?'
mime_pattern = re.compile(r'^'+mime_pattern_str+r'$')
mime_extract_pattern = re.compile(r'^\s*(?:content\s*=\s*)?["\']?\s*(' +
mime_pattern_str +
r')(?:\s*[;,].*)?\s*["\']?\s*$')
def __init__(self):
super().__init__()
self.MAX_TYPE_VALUES = MimeTypeStats.MAX_MIME_TYPES
def norm_value(self, mimetype):
if type(mimetype) is str:
mimetype = mimetype.lower()
m = MimeTypeStats.mime_extract_pattern.match(mimetype)
if m:
return m.group(1)
return mimetype.strip('"\', \t')
return ""
def add(self, key, val):
self.add_check_type(key, val, CST.mimetype)
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
plot_name = 'mimetypes'
column_header = 'mimetype'
if len(plot_crawls) == 0:
plot_crawls = MonthlyCrawl.get_latest(3)
print(plot_crawls)
else:
plot_name += '-' + '-'.join(plot_crawls)
plot = MimeTypeStats()
plot.read_from_stdin_or_file()
plot.transform_data(MimeTypeStats.MAX_MIME_TYPES,
MimeTypeStats.MIN_AVERAGE_COUNT,
MimeTypeStats.mime_pattern)
plot.save_data_percentage(plot_name, dir_name='plots', type_name='mimetype')
plot.plot(plot_crawls, plot_name, column_header, ['tablesearcher'])
================================================
FILE: plot/mimetype_detected.py
================================================
import sys
from plot.mimetype import MimeTypeStats
from crawlstats import CST, MonthlyCrawl
class MimeTypeDetectedStats(MimeTypeStats):
def __init__(self):
super().__init__()
self.MAX_TYPE_VALUES = MimeTypeStats.MAX_MIME_TYPES
def norm_value(self, mimetype):
return mimetype
def add(self, key, val):
self.add_check_type(key, val, CST.mimetype_detected)
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
plot_name = 'mimetypes_detected'
column_header = 'mimetype_detected'
if len(plot_crawls) == 0:
plot_crawls = MonthlyCrawl.get_latest(3)
print(plot_crawls)
else:
plot_name += '-' + '-'.join(plot_crawls)
plot = MimeTypeDetectedStats()
plot.read_from_stdin_or_file()
plot.transform_data(MimeTypeStats.MAX_MIME_TYPES,
MimeTypeStats.MIN_AVERAGE_COUNT,
None)
plot.save_data_percentage(plot_name, dir_name='plots', type_name='mimetype_detected')
plot.plot(plot_crawls, plot_name, column_header, ['tablesearcher'])
================================================
FILE: plot/overlap.py
================================================
"""
Plot crawl overlap and similarity metrics.
This module generates visualizations showing the overlap between different
crawls based on URL or content digest similarities. Uses Jaccard similarity
to measure the intersection over union of items between crawls.
"""
import copy
import os.path
from collections import defaultdict
import pandas
import pygraphviz
from crawlplot import CrawlPlot
from crawlstats import CST, CrawlStatsJSONDecoder, MonthlyCrawl
class CrawlOverlap(CrawlPlot):
"""Generate overlap and similarity visualizations between crawls.
Calculates and visualizes the Jaccard similarity between crawls
based on unique URLs or content digests using HyperLogLog cardinality
estimation.
"""
MAX_MATRIX_SIZE = 30
def __init__(self):
super().__init__()
self.crawl_size = defaultdict(dict)
self.overlap = defaultdict(dict)
self.similarity = defaultdict(dict) # Jaccard index
def add(self, key, val):
"""Process a size_estimate record and store HyperLogLog for overlap calculation."""
cst = CST[key[0]]
if cst != CST.size_estimate:
return
item_type = key[1]
crawl = key[2]
hll = CrawlStatsJSONDecoder.json_decode_hyperloglog(val)
self.crawl_size[item_type][crawl] = hll
def fill_overlap_matrix(self):
"""Calculate pairwise overlap and Jaccard similarity between all crawls."""
for item_type in self.crawl_size:
for crawl1 in self.crawl_size[item_type]:
hll1 = self.crawl_size[item_type][crawl1]
size1 = len(hll1)
self.overlap[item_type][crawl1] = defaultdict(list)
self.similarity[item_type][crawl1] = defaultdict(float)
for crawl2 in self.crawl_size[item_type]:
if crawl1 >= crawl2:
continue
hll2 = self.crawl_size[item_type][crawl2]
size2 = len(hll2)
union_hll = copy.deepcopy(hll1)
union_hll.update(hll2)
union = len(union_hll)
intersection = size1 + size2 - union
jaccard_sim = intersection / union
self.overlap[item_type][crawl1][crawl2] \
= [intersection, union, size1, size2,
(intersection/size2), jaccard_sim]
self.similarity[item_type][crawl1][crawl2] = jaccard_sim
def save_overlap_matrix(self):
"""Save overlap and similarity matrices to CSV files."""
for item_type in self.overlap:
data = pandas.DataFrame(self.similarity[item_type])
data.to_csv('data/crawlsimilarity_' + item_type + '.csv')
data = pandas.DataFrame(self.overlap[item_type])
data.to_csv('data/crawloverlap_' + item_type + '.csv')
def plot_similarity_graph(self, show_edges=False):
"""Visualize similarity as a graph using GraphViz (experimental)."""
g = pygraphviz.AGraph(directed=False, overlap='scale', splines=True)
g.node_attr['shape'] = 'plaintext'
g.node_attr['fontsize'] = '12'
if show_edges:
g.edge_attr['color'] = 'lightgrey'
g.edge_attr['fontcolor'] = 'grey'
g.edge_attr['fontsize'] = '8'
else:
g.edge_attr['style'] = 'invis'
for crawl1 in sorted(self.similarity['url']):
for crawl2 in sorted(self.similarity['url'][crawl1]):
similarity = self.similarity['url'][crawl1][crawl2]
distance = 1.0 - similarity
g.add_edge(MonthlyCrawl.short_name(crawl1),
MonthlyCrawl.short_name(crawl2),
len=(distance),
label='{0:.2f}'.format(distance))
g.write(os.path.join(self.PLOTDIR, 'crawlsimilarity_url.dot'))
g.draw(os.path.join(self.PLOTDIR, 'crawlsimilarity_url.svg'), prog='fdp')
def plot_similarity_matrix_with_rpy2_ggplot2(self, data, midpoint, title, textsize, img_path):
"""Generate similarity heatmap using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
p = ggplot2.ggplot(data) \
+ ggplot2.aes_string(x='crawl2', y='crawl1',
fill='similarity', label='sim_rounded') \
+ ggplot2.geom_tile(color="white") \
+ ggplot2.scale_fill_gradient2(low="red", high="blue", mid="white",
midpoint=midpoint, space="Lab") \
+ self.GGPLOT2_THEME \
+ ggplot2.coord_fixed() \
+ ggplot2.theme(**{'axis.text.x':
ggplot2.element_text(angle=45,
vjust=1, hjust=1),
**self.GGPLOT2_THEME_KWARGS}) \
+ ggplot2.labs(title=title, x='', y='') \
+ ggplot2.geom_text(color='black', size=textsize)
p.save(img_path)
return p
def plot_similarity_matrix_with_matplotlib(self, data, decimals, title, cell_textsize, img_path):
"""Generate similarity heatmap using matplotlib.
Creates a color-coded matrix showing Jaccard similarity between crawls,
with color ranging from red (low) through white to blue (high).
"""
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap, Normalize
# Pivot data to create matrix
pivot_data = data.pivot(index='crawl1', columns='crawl2', values='similarity')
pivot_data_rounded = pivot_data.round(decimals)
fig, ax = self.create_figure()
# Create color map: red (low) -> white (mid) -> blue (high)
vmin = pivot_data_rounded.min().min()
vmax = pivot_data_rounded.max().max()
if vmin < 0:
colors = ['#ff0801', '#ff6b48', '#ffa388', '#ffd2c4', '#fff4ef',
'#FFFFFF', '#eadaff', '#c6a5ff', '#a073ff', '#6e43ff',
'#4020ff', '#1306ff']
else:
colors = ['#fff4ef', '#FFFFFF', '#eadaff', '#c6a5ff', '#a073ff',
'#6e43ff', '#4020ff', '#1306ff']
cmap = LinearSegmentedColormap.from_list('red_white_blue', colors, N=256)
norm = Normalize(vmin=vmin, vmax=vmax)
# Add grey grid lines behind everything
ax.set_axisbelow(True)
ax.grid(True, which='major', linewidth=0.8, color='#E6E6E6', zorder=-1)
# Create heatmap with origin='lower' to match ggplot2 (bottom-up)
im = ax.imshow(pivot_data_rounded.values, cmap=cmap, norm=norm,
aspect='equal', origin='lower', zorder=1)
# Add text annotations
for i in range(len(pivot_data.index)):
for j in range(len(pivot_data.columns)):
similarity = pivot_data.iloc[i, j]
if pandas.isna(similarity):
continue
# Draw white rectangle border around each cell
rect = plt.Rectangle((j - 0.5, i - 0.5), 1, 1,
fill=False, edgecolor='white',
linewidth=0.5, zorder=1)
ax.add_patch(rect)
# Get the rounded text for this cell
matching_rows = data[(data['crawl1'] == pivot_data.index[i]) &
(data['crawl2'] == pivot_data.columns[j])]
if len(matching_rows) > 0:
text_val = matching_rows['sim_rounded'].iloc[0]
ax.text(j, i, text_val, ha='center', va='center',
color='black', fontsize=cell_textsize, zorder=2)
# Set ticks and labels
ax.set_xticks(np.arange(len(pivot_data.columns)))
ax.set_yticks(np.arange(len(pivot_data.index)))
ax.set_xticklabels(pivot_data.columns, fontsize=10)
ax.set_yticklabels(pivot_data.index, fontsize=10)
# Hide tick marks but keep labels black
ax.tick_params(axis='both', which='both', colors='#FFFFFF', zorder=0)
self.set_tick_labels_black(ax)
# Rotate x-axis labels
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', va='top')
self.set_title(ax, title)
ax.set_xlabel('')
ax.set_ylabel('')
# Add colorbar
cbar = plt.colorbar(im, ax=ax, aspect=5, pad=0.04, shrink=0.2)
cbar.ax.set_title('similarity', fontsize=10, pad=10, loc="left")
cbar.ax.tick_params(labelsize=8)
cbar.outline.set_visible(False)
# Apply ggplot2-like styling
self.apply_ggplot2_style(ax, show_grid=False)
return self.save_figure(fig, img_path)
def plot_similarity_matrix(self, item_type, image_file, title):
"""Plot similarities of crawls as a heatmap matrix.
Args:
item_type: Type of items to compare ('url' or 'digest')
image_file: Output filename relative to PLOTDIR
title: Plot title
"""
data = defaultdict(dict)
n = 1
for crawl1 in self.similarity[item_type]:
for crawl2 in self.similarity[item_type][crawl1]:
similarity = self.similarity[item_type][crawl1][crawl2]
data['crawl1'][n] = MonthlyCrawl.short_name(crawl1)
data['crawl2'][n] = MonthlyCrawl.short_name(crawl2)
data['similarity'][n] = similarity
data['sim_rounded'][n] = similarity # to be rounded
n += 1
data = pandas.DataFrame(data)
print(data)
# select median of similarity values as midpoint of similarity scale
midpoint = data['similarity'].median()
decimals = 3
textsize = 2
minshown = .0005
cell_textsize = 6
if (data['similarity'].max()-data['similarity'].min()) > .2:
decimals = 2
textsize = 2.8
minshown = .005
cell_textsize = 8
data['sim_rounded'] = data['sim_rounded'].apply(
lambda x: ('{0:.'+str(decimals)+'f}').format(x).lstrip('0')
if x >= minshown else '0')
print('Median of similarities for', item_type, '=', midpoint)
matrix_size = len(self.similarity[item_type])
if matrix_size > self.MAX_MATRIX_SIZE:
n = 0
for crawl1 in sorted(self.similarity[item_type], reverse=True):
short_name = MonthlyCrawl.short_name(crawl1)
if n > self.MAX_MATRIX_SIZE:
data = data[data['crawl1'] != short_name]
data = data[data['crawl2'] != short_name]
n += 1
img_path = os.path.join(self.PLOTDIR, image_file)
if self.PLOTLIB == "rpy2.ggplot2":
return self.plot_similarity_matrix_with_rpy2_ggplot2(data=data, midpoint=midpoint, title=title, textsize=textsize, img_path=img_path)
elif self.PLOTLIB == "matplotlib":
return self.plot_similarity_matrix_with_matplotlib(data=data, decimals=decimals, title=title, cell_textsize=cell_textsize, img_path=img_path)
else:
raise ValueError("Invalid PLOTLIB")
if __name__ == '__main__':
plot = CrawlOverlap()
plot.read_from_stdin_or_file()
plot.fill_overlap_matrix()
plot.save_overlap_matrix()
# plot.plot_similarity_graph()
plot.plot_similarity_matrix(
'url', 'crawloverlap/crawlsimilarity_matrix_url.png',
'URL overlap between crawls (Jaccard similarity)')
plot.plot_similarity_matrix(
'digest', 'crawloverlap/crawlsimilarity_matrix_digest.png',
'Content overlap between crawls (Jaccard similarity on digest)')
================================================
FILE: plot/table.py
================================================
import heapq
import numpy
import pandas
from collections import defaultdict, Counter
from crawlplot import CrawlPlot
from crawlstats import CST, MultiCount
class TabularStats(CrawlPlot):
def __init__(self):
super().__init__()
self.crawls = set()
self.types = defaultdict(dict)
self.type_stats = defaultdict(dict)
self.types_total = Counter()
self.size = defaultdict(dict)
self.N = 0
def norm_value(self, typeval):
return typeval
def add_check_type(self, key, val, requ_type_cst):
cst = CST[key[0]]
if cst != requ_type_cst and cst != CST.size:
return
typeval = key[1]
crawl = key[2]
self.crawls.add(crawl)
typeval = self.norm_value(typeval)
if cst == CST.size:
self.size[crawl][typeval] = int(val)
return
if crawl in self.types[typeval]:
self.types[typeval][crawl] = \
MultiCount.sum_values([val, self.types[typeval][crawl]])
else:
self.types[typeval][crawl] = val
npages = MultiCount.get_count(0, val)
self.types_total[typeval] += npages
if 'known_values' not in self.size[crawl]:
self.size[crawl]['known_values'] = 0
self.size[crawl]['known_values'] += npages
def transform_data(self, top_n, min_avg_count, check_pattern=None):
print("Number of different values after first normalization: {}"
.format(len(self.types)))
typevals_for_deletion = set()
typevals_mostfrequent = []
for typeval in self.types:
total_count = self.types_total[typeval]
average_count = int(total_count / len(self.crawls))
if average_count >= min_avg_count:
if not check_pattern or check_pattern.match(typeval):
print('{}\t{}\t{}'.format(typeval,
average_count, total_count))
fval = (total_count, typeval)
if len(typevals_mostfrequent) < top_n:
heapq.heappush(typevals_mostfrequent, fval)
else:
heapq.heappushpop(typevals_mostfrequent, fval)
continue # ok, keep this type value
else:
print('Type value frequent but invalid: <{}> (avg. count = {})'
.format(typeval, average_count))
elif average_count >= (min_avg_count/10):
if not check_pattern or check_pattern.match(typeval):
print('Skipped type value because of low frequency: <{}> (avg. count = {}, min. = {})'
.format(typeval, average_count, (min_avg_count/10)))
typevals_for_deletion.add(typeval)
# map low frequency or invalid type values to empty type
keep_typevals = set()
for (_, typeval) in typevals_mostfrequent:
keep_typevals.add(typeval)
for typeval in self.types:
if (typeval not in keep_typevals and
typeval not in typevals_for_deletion):
print('Skipped type value because not in top {}: <{}> (avg. count = {})'
.format(top_n, typeval,
int(self.types_total[typeval]/len(self.crawls))))
typevals_for_deletion.add(typeval)
typevals_other = dict()
for typeval in typevals_for_deletion:
for crawl in self.types[typeval]:
if crawl in typevals_other:
val = typevals_other[crawl]
else:
val = 0
typevals_other[crawl] = \
MultiCount.sum_values([val, self.types[typeval][crawl]])
self.types.pop(typeval, None)
self.types['<other>'] = typevals_other
print('Number of different type values after cleaning and'
' removal of low frequency types: {}'
.format(len(self.types)))
# unknown values
for crawl in self.crawls:
known_values = 0
if 'known_values' in self.size[crawl]:
known_values = self.size[crawl]['known_values']
unknown = (self.size[crawl]['page'] - known_values)
if unknown > 0:
print("{} unknown values in {}".format(unknown, crawl))
self.types['<unknown>'][crawl] = unknown
for typeval in self.types:
for crawl in self.types[typeval]:
self.type_stats['type'][self.N] = typeval
self.type_stats['crawl'][self.N] = crawl
value = self.types[typeval][crawl]
n_pages = MultiCount.get_count(0, value)
self.type_stats['pages'][self.N] = n_pages
n_urls = MultiCount.get_count(1, value)
self.type_stats['urls'][self.N] = n_urls
self.N += 1
self.type_stats = pandas.DataFrame(self.type_stats)
def save_data(self, base_name, dir_name='data/'):
self.type_stats.to_csv(dir_name + base_name + '.csv')
def save_data_percentage(self, base_name, dir_name='data/', type_name='type'):
if dir_name[-1] != '/':
dir_name += '/'
data = self.type_stats
data = data[['crawl', 'type', 'pages', 'urls']]
sum_data = data.groupby(['crawl']).aggregate({'pages':'sum'}).add_suffix('_sum').reset_index()
data = data.groupby(['crawl', 'type']).aggregate(numpy.sum).reset_index()
data = pandas.merge(data, sum_data)
data['%pages/crawl'] = 100.0 * data['pages'] / data['pages_sum']
data.drop(['pages_sum'], inplace=True, axis=1)
data = data.rename(columns={'type': type_name})
data.to_csv(dir_name + base_name + '.csv', float_format='%.4f', index=None)
def plot(self, crawls, name, column_header, xtra_css_classes=[]):
# stats comparison for selected crawls
field_percentage_formatter = '{0:,.4f}'.format
data = self.type_stats
data = data[data['crawl'].isin(crawls)]
if data.size == 0:
print("No data points in table for selected crawls ({})"
.format(crawls))
return
data[column_header] = data['type']
data = data[['crawl', column_header, 'pages']]
data = data.groupby(['crawl', column_header]).agg({'pages': 'sum'})
data = data.groupby(level=0, as_index=False).apply(lambda x: 100.0*x/float(x.sum()))
data = data.reset_index().pivot(index=column_header,
columns='crawl', values='pages')
print("\n-----\n")
formatters = {c: field_percentage_formatter for c in crawls}
print(data.to_string(formatters=formatters))
css_classes = ['tablesorter', 'tablepercentage']
css_classes.extend(xtra_css_classes)
data.to_html('{}/{}-top-{}.html'.format(
self.PLOTDIR, name, self.MAX_TYPE_VALUES),
formatters=formatters,
classes=css_classes)
================================================
FILE: plot/tld.py
================================================
import sys
from collections import defaultdict
import pandas
from crawlplot import CrawlPlot
from crawlstats import CST, MonthlyCrawl, MultiCount
from top_level_domain import TopLevelDomain
from stats.tld_alexa_top_1m import alexa_top_1m_tlds
from stats.tld_cisco_umbrella_top_1m import cisco_umbrella_top_1m_tlds
from stats.tld_majestic_top_1m import majestic_top_1m_tlds
# min. share of URLs for a TLD to be shown in metrics
min_urls_percentage = .05
class TldStats(CrawlPlot):
def __init__(self):
super().__init__()
self.tlds = defaultdict(dict)
self.tld_stats = defaultdict(dict)
self.N = 0
def add(self, key, val):
cst = CST[key[0]]
if cst != CST.tld:
return
tld = key[1]
crawl = key[2]
self.tlds[tld][crawl] = val
def transform_data(self):
crawl_has_host_domain_counts = {}
for tld in self.tlds:
tld_repr = tld
tld_obj = None
if tld in ('', '(ip address)'):
continue
else:
try:
tld_obj = TopLevelDomain(tld)
tld_repr = tld_obj.tld
except:
print('error', tld)
continue
for crawl in self.tlds[tld]:
self.tld_stats['suffix'][self.N] = tld_repr
self.tld_stats['crawl'][self.N] = crawl
date = pandas.Timestamp(MonthlyCrawl.date_of(crawl))
self.tld_stats['date'][self.N] = date
if tld_obj:
self.tld_stats['type'][self.N] \
= TopLevelDomain.short_type(tld_obj.tld_type)
self.tld_stats['subtype'][self.N] = tld_obj.sub_type
self.tld_stats['tld'][self.N] = tld_obj.first_level
else:
self.tld_stats['type'][self.N] = ''
self.tld_stats['subtype'][self.N] = ''
self.tld_stats['tld'][self.N] = ''
value = self.tlds[tld][crawl]
n_pages = MultiCount.get_count(0, value)
self.tld_stats['pages'][self.N] = n_pages
n_urls = MultiCount.get_count(1, value)
self.tld_stats['urls'][self.N] = n_urls
n_hosts = MultiCount.get_count(2, value)
self.tld_stats['hosts'][self.N] = n_hosts
n_domains = MultiCount.get_count(3, value)
self.tld_stats['domains'][self.N] = n_domains
if n_urls != n_hosts:
# multi counts including host counts are not (yet)
# available for all crawls
crawl_has_host_domain_counts[crawl] = True
elif crawl not in crawl_has_host_domain_counts:
crawl_has_host_domain_counts[crawl] = False
self.N += 1
for crawl in crawl_has_host_domain_counts:
if not crawl_has_host_domain_counts[crawl]:
print('No host and domain counts for', crawl)
for n in self.tld_stats['crawl']:
if self.tld_stats['crawl'][n] == crawl:
del(self.tld_stats['hosts'][n])
del(self.tld_stats['domains'][n])
self.tld_stats = pandas.DataFrame(self.tld_stats)
@staticmethod
def field_percentage_formatter(precision=2, nan='-'):
f = '{0:,.' + str(precision) + 'f}'
return lambda x: nan if pandas.isna(x) else f.format(x)
def save_data(self):
self.tld_stats.to_csv('data/tlds.csv')
def percent_agg(self, data, column, index, values, aggregate):
data = data[[column, index, values]]
data = data.groupby([column, index]).agg(aggregate)
data = data.groupby(level=0, as_index=False).apply(lambda x: 100.0*x/float(x.sum()))
# print("\n-----\n")
# print(data.to_string(formatters={'urls': TldStats.field_percentage_formatter()}))
return data
def pivot_percentage(self, data, column, index, values, aggregate):
data = self.percent_agg(data, column, index, values, aggregate)
return data.reset_index().pivot(index=index,
columns=[column], values=values)
def plot_groups(self):
title = 'Groups of Top-Level Domains'
ylabel = 'URLs %'
clabel = ''
img_file = 'tld/groups.png'
data = self.pivot_percentage(self.tld_stats, 'crawl', 'type',
'urls', {'urls': 'sum'})
data = data.transpose()
print("\n-----\n")
types = set(self.tld_stats['type'].tolist())
formatters = {c: TldStats.field_percentage_formatter() for c in types}
print(data.to_string(formatters=formatters))
data.to_html('{}/tld/groups-percentage.html'.format(self.PLOTDIR),
formatters=formatters,
classes=['tablesorter', 'tablepercentage'])
data = self.percent_agg(self.tld_stats, 'date', 'type',
'urls', {'urls': 'sum'}).reset_index()
return self.line_plot(data, title, ylabel, img_file,
x='date', y='urls', c='type', clabel=clabel)
def plot(self, crawls, latest_crawl):
field_formatters = {c: '{:,.0f}'.format
for c in ['pages', 'urls', 'hosts', 'domains']}
for c in ['%urls', '%hosts', '%domains']:
field_formatters[c] = TldStats.field_percentage_formatter()
data = self.tld_stats
data = data[data['crawl'].isin(crawls)]
crawl_data = data
top_tlds = []
# stats per crawl
for crawl in crawls:
print("\n-----\n{}\n".format(crawl))
for aggr_type in ('type', 'tld'):
data = crawl_data
data = data[data['crawl'].isin([crawl])]
data = data[[aggr_type, 'pages', 'urls', 'hosts', 'domains']]
data = data.set_index([aggr_type])
data = data.groupby(level=0).sum().sort_values(
by=['urls'], ascending=False)
for count in ('urls', 'hosts', 'domains'):
data['%'+count] = 100.0 * data[count] / data[count].sum()
if aggr_type == 'tld':
# skip less frequent TLDs
data = data[data['%urls'] >= min_urls_percentage]
for tld in data.index.values:
top_tlds.append(tld)
print(data.to_string(formatters=field_formatters))
print()
if crawl == latest_crawl:
# latest crawl by convention
type_name = aggr_type
if aggr_type == 'type':
type_name = 'group'
path = '{}/tld/latest-crawl-{}s.html'.format(
self.PLOTDIR, type_name)
data.to_html(path,
formatters=field_formatters,
classes=['tablesorter', 'tablesearcher'])
# stats comparison for selected crawls
for aggr_type in ('type', 'tld'):
data = crawl_data
if aggr_type == 'tld':
data = data[data['tld'].isin(top_tlds)]
data = self.pivot_percentage(data, 'crawl', aggr_type,
'urls', {'urls': 'sum'})
print("\n----- {}\n".format(aggr_type))
print(data.to_string(formatters={c: TldStats.field_percentage_formatter()
for c in crawls}))
if aggr_type == 'tld':
# save as HTML table
path = '{}/tld/selected-crawls-percentage.html'.format(
self.PLOTDIR, len(crawls))
data.to_html(path,
float_format=TldStats.field_percentage_formatter(4),
classes=['tablesorter', 'tablepercentage',
'tablesearcher'])
def plot_comparison(self, crawl, name, topNlimit=None, method='spearman'):
print()
print('Comparison for', crawl, '-', name, '-', method)
data = self.tld_stats
data = data[data['crawl'].isin([crawl])]
data = data[data['urls'] >= topNlimit]
data = data.set_index(['tld'], drop=False)
data = data.sum(level='tld')
print(data)
data['alexa'] = pandas.Series(alexa_top_1m_tlds)
data['cisco'] = pandas.Series(cisco_umbrella_top_1m_tlds)
data['majestic'] = pandas.Series(majestic_top_1m_tlds)
fields = ('pages', 'urls', 'hosts', 'domains',
'alexa', 'cisco', 'majestic')
formatters = {c: '{0:,.3f}'.format for c in fields}
# relative frequency (percent)
for count in fields:
data[count] = 100.0 * data[count] / data[count].sum()
# Spearman's rank correlation for all TLDs
corr = data.corr(method=method, min_periods=1)
print(corr.to_string(formatters=formatters))
corr.to_html('{}/tld/{}-comparison-{}-all-tlds.html'
.format(self.PLOTDIR, name, method),
formatters=formatters,
classes=['matrix'])
if topNlimit is None:
return
# Spearman's rank correlation for TLDs covering
# at least topNlimit % of urls
data = data[data['urls'] >= topNlimit]
print()
print('Top', len(data), 'TLDs (>= ', topNlimit, '%)')
print(data)
data.to_html('{}/tld/{}-comparison.html'.format(self.PLOTDIR, name),
formatters=formatters,
classes=['tablesorter', 'tablepercentage'])
print()
corr = data.corr(method=method, min_periods=1)
print(corr.to_string(formatters=formatters))
corr.to_html('{}/tld/{}-comparison-{}-frequent-tlds.html'
.format(self.PLOTDIR, name, method),
formatters=formatters,
classes=['matrix'])
print()
def plot_comparison_groups(self):
# Alexa and Cisco types/groups:
for (name, data) in [('Alexa', alexa_top_1m_tlds),
('Cisco', cisco_umbrella_top_1m_tlds),
('Majestic', majestic_top_1m_tlds)]:
compare_types = defaultdict(int)
for tld in data:
compare_types[TopLevelDomain(tld).tld_type] += data[tld]
print(name, 'TLD groups:')
for tld in compare_types:
c = compare_types[tld]
print(' {:6d}\t{:4.1f}\t{}'.format(c, (100.0*c/1000000), tld))
print()
if __name__ == '__main__':
plot_crawls = sys.argv[1:]
latest_crawl = plot_crawls[-1]
if len(plot_crawls) == 0:
print(sys.argv[0], 'crawl-id...')
print()
print('Distribution of top-level domains for (selected) monthly crawls')
print()
print('Example:')
print('', sys.argv[0], '[options]', 'CC-MAIN-2014-52', 'CC-MAIN-2016-50')
print()
print('Last argument is considered to be the latest crawl')
print()
print('Options:')
print()
sys.exit(1)
plot = TldStats()
plot.read_data(sys.stdin)
plot.transform_data()
plot.save_data()
plot.plot_groups()
plot.plot(plot_crawls, latest_crawl)
if latest_crawl == 'CC-MAIN-2019-09':
# plot comparison only for crawl of similar date as benchmark data
plot.plot_comparison(latest_crawl, 'selected-crawl',
min_urls_percentage)
# plot.plot_comparison(latest_crawl, 'selected-crawl',
# min_urls_percentage, 'pearson')
plot.plot_comparison_groups()
================================================
FILE: plot/tld_by_continent.py
================================================
"""
Plot TLD distributions by continent.
This module generates visualizations showing how TLDs are distributed
across geographic continents and major TLD groups (com/net, org, edu, gov/mil).
Maps country-code TLDs to their respective continents using ISO country codes.
"""
import json
import os.path
import sys
from collections import Counter, defaultdict
import fsspec
import matplotlib.pyplot as plt
import pandas
from matplotlib.ticker import MaxNLocator
from crawlplot import CrawlPlot
from crawlstats import MonthlyCrawl, MultiCount
from top_level_domain import TopLevelDomain
tld_counts = defaultdict(lambda: Counter())
# mapping of country-code TLDs to continents
continent_cc_tlds = {
'Africa': {'ao', 'bf', 'bi', 'bj', 'bw', 'cd', 'cf', 'cg', 'ci', 'cm', 'cv',
'dj', 'dz', 'eg', 'eh', 'er', 'et', 'ga', 'gh', 'gm', 'gn', 'gq',
'gw', 'ke', 'km', 'lr', 'ls', 'ly', 'ma', 'mg', 'ml', 'mr', 'mu',
'mw', 'mz', 'na', 'ne', 'ng', 're', 'rw', 'sc', 'sd', 'sh', 'sl',
'sn', 'so', 'ss', 'st', 'sz', 'td', 'tg', 'tn', 'tz', 'ug', 'yt',
'za', 'zm', 'zw'},
'Antarctica': {'aq'},
'Asia': {'ae', 'af', 'am', 'az', 'bd', 'bh', 'bn', 'bt', 'cc', 'cn', 'cx',
'ge', 'hk', 'id', 'il', 'in', 'io', 'iq', 'ir', 'jo', 'jp', 'kg',
'kh', 'kp', 'kr', 'kw', 'kz', 'la', 'lb', 'lk', 'mm', 'mn', 'mo',
'mv', 'my', 'np', 'om', 'ph', 'pk', 'ps', 'qa', 'sa', 'sg', 'sy',
'th', 'tj', 'tm', 'tr', 'tw', 'uz', 'vn', 'ye',
'tp' # Timor-Leste: deleted in favor of .tl in 2015
},
'Europe': {'ad', 'al', 'at', 'ba', 'be', 'bg', 'by', 'ch', 'cy', 'cz',
'de', 'dk', 'ee', 'es', 'fi', 'fo', 'fr', 'gg', 'gi', 'gr',
'hr', 'hu', 'ie', 'im', 'is', 'it', 'je', 'li', 'lt', 'lu', 'lv',
'mc', 'md', 'me', 'mk', 'mt', 'nl', 'no',
'pl', 'pt', 'ro', 'rs', 'ru', 'se', 'si', 'sj', 'sk', 'sm',
'ua', 'uk', 'va',
'xk', # https://en.wikipedia.org/wiki/.xk
'bv', # Bouvet Island (inactive, uninhabited Norwegian territory, South Atlantic Ocean)
'gb' # Great Britain (reserved)
},
'North America': {'ag', 'ai', 'an', 'aw', 'bb', 'bm', 'bs', 'bz',
'ca', 'cr', 'cu', 'cw', 'dm', 'do', 'gd', 'gl', 'gp', 'gt',
'hn', 'ht', 'jm', 'kn', 'ky', 'lc', 'mq', 'ms', 'mx', 'ni',
'pa', 'pm', 'pr', 'sv', 'sx', 'tc', 'tt',
'us', 'vc', 'vg', 'vi',
'bl', # Saint Barthélemy (unused)
'bq', # Bonaire, Sint Eustatius and Saba (reserved)
'mf', # Saint Martin (unassigned)
},
'Oceania': {'as', 'au', 'ck', 'fj', 'fm', 'gu', 'ki', 'mh', 'mp',
'nc', 'nf', 'nr', 'nu', 'nz', 'pf', 'pg', 'pn', 'pw',
'sb', 'tk', 'tl', 'to', 'tv', 'vu', 'wf', 'ws'
},
'South America': {'ar', 'bo', 'br', 'cl', 'co', 'ec', 'fk', 'gf', 'gy',
'pe', 'py', 'sr', 'uy', 've'},
}
# Geographic TLDs mapped to continents
# https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Geographic_top-level_domains
continent_geographic_tlds = {
'Africa': {'africa', 'capetown', 'durban', 'joburg'},
'Asia': {'abudhabi', 'arab', 'asia', 'doha', 'dubai', 'krd', 'kyoto',
'nagoya', 'okinawa', 'osaka', 'ryukyu', 'taipei', 'tokyo', 'yokohama',
# https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains#Internationalized_geographic_top-level_domains
'xn--1qqw23a', '佛山', # Foshan, China
'xn--xhq521b', '广东', # Guangdong, China
'xn--80adxhks', 'москва', # Moscow, Russia
'xn--p1acf', 'рус', # Russian language and culture - https://en.wikipedia.org/wiki/.%D1%80%D1%83%D1%81
'xn--mgbca7dzdo', 'ابوظبي', # Abu Dhabi
'xn--ngbrx', 'عرب', # Arab
},
'Europe': {
# France
'alsace', 'bzh', 'corsica', 'eus', 'paris',
# Spain
'bcn', 'barcelona', 'cat', 'eus', 'gal', 'madrid',
# Germany
'bayern', 'berlin', 'cologne', 'koeln', 'hamburg', 'nrw', 'ruhr', 'saarland',
# other
'eu', 'amsterdam', 'bar', 'brussels', 'cymru', 'wales', 'frl', 'gent', 'helsinki', 'irish', 'ist', 'istanbul', 'london', 'moscow', 'scot', 'stockholm', 'swiss', 'tatar', 'tirol', 'vlaanderen', 'wien', 'zuerich', 'su',
# https://en.wikipedia.org/wiki/.ax
'ax'
},
'North America': {'boston', 'miami', 'nyc', 'quebec', 'vegas'},
'Oceania': {'kiwi', 'melbourne', 'sydney'},
'South America': {'lat', 'rio'}
}
# list of "continents" to be shown in the output
continents = ['(other)', 'com,net', 'org', 'edu', 'gov,mil', 'North America', 'South America', 'Oceania', 'Africa', 'Asia', 'Europe']
# lookup tables TLD -> continent
tld_continent = {
'gov': 'gov,mil', 'mil': 'gov,mil',
'com': 'com,net', 'net': 'com,net',
'org': 'org', 'edu': 'edu'
}
# frequency counts of TLDs that cannot be mapped to a continent
tld_unmapped = Counter()
# fill the lookup table with TLD -> continent mappings
for continent in continent_cc_tlds:
for tld in continent_cc_tlds[continent]:
tld_continent[tld] = continent
for continent in continent_geographic_tlds:
for tld in continent_geographic_tlds[continent]:
tld_continent[tld] = continent
for icctld in TopLevelDomain.tld_ccs:
if TopLevelDomain.tld_ccs[icctld] in tld_continent:
tld_continent[icctld] = tld_continent[TopLevelDomain.tld_ccs[icctld]]
def tld2continent(tld):
"""Map a TLD to its corresponding continent."""
continent = '(other)'
tld = tld.lower()
if tld in tld_continent and tld_continent[tld] != 'Antarctica':
continent = tld_continent[tld]
return continent
def get_data(f):
"""Parse TLD statistics and aggregate by year and crawl.
Returns two dictionaries: one aggregated by year, one by crawl name.
"""
d = defaultdict(lambda: defaultdict(list))
dd = defaultdict(lambda: defaultdict(list))
for line in f:
keyval = line.split('\t')
if len(keyval) == 2:
[_, suffix, crawl] = json.loads(keyval[0])
year = MonthlyCrawl.year_of(crawl)
val = json.loads(keyval[1])
tld = suffix.split('.')[-1].lower()
tld_cnt = tld2continent(tld)
if tld_cnt == '(other)':
tld_unmapped[tld] += MultiCount.get_count(0, val)
if tld:
# print(tld)
tld_counts['(any)'][tld] += MultiCount.get_count(0, val)
tld_counts[str(year)][tld] += MultiCount.get_count(0, val)
d[str(year)][tld_cnt].append(val)
dd[MonthlyCrawl.short_name(crawl)][tld_cnt].append(val)
return d, dd
class TLDByContinentPlot(CrawlPlot):
"""Generate TLD distribution by continent visualizations."""
def __init__(self):
super().__init__()
def plot(self):
"""Generate TLD by continent/year plots and save data tables."""
# Read from file path or stdin
if len(sys.argv) > 1 and os.path.exists(sys.argv[-1]):
with fsspec.open(sys.argv[-1], compression="gzip", mode="rt") as f:
d, dd = get_data(f)
else:
d, dd = get_data(sys.stdin)
print("\nyear\t{}".format("\t".join(continents)))
continent_percentages = dict()
for year in d:
pages = dict()
total = 0
values = []
for tld in continents:
d[year][tld].append([0,0,0,0])
val = MultiCount.sum_values(d[year][tld], False)
total += val[0]
values.append(val[0])
# print("{}\t{}\t{}\t{}\t{}\t{}".format(year, tld, *val))
percentages = [100*val/total for val in values]
print("{}\t{}".format(year, "\t".join(
map(lambda x: '{:.2f}'.format(x), percentages))))
continent_percentages[year] = percentages
continent_percentages = pandas.DataFrame.from_dict(continent_percentages,
orient='index',
columns=continents)
continent_percentages.index.name = 'year'
print(continent_percentages)
top_tlds = tld_counts['(any)'].most_common(16)
#print("\n", top_tlds)
top_tlds_by_year = defaultdict(list)
print("\nyear\t{}".format("\t".join([x[0] for x in top_tlds])))
for year in tld_counts:
total = sum(tld_counts[year].values())
sys.stdout.write(year)
for tld in top_tlds:
perc = 100*tld_counts[year][tld[0]]/total
sys.stdout.write('\t{:.2f}'.format(perc))
top_tlds_by_year[year].append(perc)
sys.stdout.write('\n')
# table TLDs by year
selected_tlds = pandas.DataFrame.from_dict(
top_tlds_by_year,
orient='index',
columns=map(lambda tld: tld[0], top_tlds)
)
selected_tlds.index.name = 'year'
selected_tlds.to_csv(
os.path.join(self.PLOTDIR, 'tld', 'selected-tlds-by-year.csv'),
index=True)
css_classes = ['tablepercentage', 'tablesorter']
selected_tlds.to_html(
os.path.join(self.PLOTDIR, 'tld', 'selected-tlds-by-year.html'),
float_format='%.2f',
classes=css_classes,
index_names=True)
print("\ncrawl\t{}".format("\t".join(continents)))
for crawl in dd:
pages = dict()
total = 0
values = []
for tld in continents:
dd[crawl][tld].append([0,0,0,0])
val = MultiCount.sum_values(dd[crawl][tld], False)
total += val[0]
values.append(val[0])
# print("{}\t{}\t{}\t{}\t{}\t{}".format(year, tld, *val))
print("{}\t{}".format(crawl, "\t".join(['{:.2f}'.format(100*val/total) for val in values])))
# print unmapped TLDs to verify whether there are any TLDs
# that need to be added to the mapping
print("\n", len(tld_unmapped), " unmapped TLDs: ", str(tld_unmapped), "\n\n")
data = continent_percentages.melt(id_vars=[], var_name='continent',
value_name='perc', ignore_index=False)
data['continent'] = pandas.Categorical(data['continent'],
ordered=True,
categories=continents.reverse())
if self.PLOTLIB == "rpy2.ggplot2":
self.plot_with_rpy2_ggplot2(data=data)
elif self.PLOTLIB == "matplotlib":
self.plot_with_matplotlib(data=data)
else:
raise ValueError("Invalid PLOTLIB")
### plot and table for print publication
#plot = plot + ggplot2.labs(title='',
# x='', y='', fill='TLD / Continent') \
# + ggplot2.theme()
#plot.save(os.path.join(PLOTDIR, 'tld', 'tlds-by-year-and-continent.pdf'))
#print(continent_percentages.to_latex(index=True, float_format='%.2f'))
continent_percentages.to_csv(
os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.csv'),
index=True)
css_classes = ['tablepercentage', 'tablesorter']
continent_percentages.to_html(
os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.html'),
float_format='%.2f',
classes=css_classes)
def plot_with_rpy2_ggplot2(self, data):
"""Generate TLD by continent stacked bar chart using rpy2/ggplot2."""
from rpy2.robjects.lib import ggplot2
plot = ggplot2.ggplot(data.reset_index()) \
+ ggplot2.aes_string(x='year', y='perc', fill='continent', label='perc') \
+ ggplot2.geom_bar(stat='identity', position='stack') \
+ self.GGPLOT2_THEME + ggplot2.scale_fill_hue() \
+ ggplot2.labs(title='Percentage of Page Captures per TLD / Continent',
x='', y='Percentage', fill='TLD / Continent') \
+ ggplot2.theme(**{'legend.position': 'right',
'aspect.ratio': .7,
**self.GGPLOT2_THEME_KWARGS,
'axis.text.x':
ggplot2.element_text(angle=45,
vjust=1, hjust=1)})
plot.save(os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.png'))
return plot
def plot_with_matplotlib(self, data):
"""Generate TLD by continent stacked bar chart using matplotlib."""
aspect_ratio = 0.7
title = 'Percentage of Page Captures per TLD / Continent'
fig, ax = self.create_figure()
# Colorblind-safe palette (Paul Tol's scheme)
colors = ['#4477AA', '#EE6677', '#228833', '#CCBB44', '#AA3377',
'#66CCEE', '#EE8866', '#44AA99', '#BBBBBB', '#99CC66', '#CC99BB']
years = sorted(data.reset_index()['year'].unique())
bottoms = [0] * len(years)
sorted_continents = sorted(continents)[::-1]
for i, continent in enumerate(sorted_continents):
values = []
for year in years:
year_data = data.loc[year]
continent_data = year_data[year_data['continent'] == continent]
values.append(continent_data['perc'].values[0] if len(continent_data) > 0 else 0)
ax.bar(range(len(years)), values, bottom=bottoms, label=continent,
color=colors[i % len(colors)], width=self.bar_width)
bottoms = [b + v for b, v in zip(bottoms, values)]
# Axes ratio
ax.set_aspect(1 / ax.get_data_ratio() * aspect_ratio)
self.set_title(ax, title)
ax.set_xlabel('')
ax.set_ylabel('Percentage', fontsize=self.ylabel_fontsize)
# Set x-axis ticks and labels
ax.set_xticks(range(len(years)))
ax.set_xticklabels(years, rotation=45, ha='right', fontsize=self.ticks_fontsize)
ax.set_xlim(-0.5, len(years) - 0.5)
# Set y-axis formatting
ax.yaxis.set_major_locator(MaxNLocator(nbins=6))
ax.set_ylim(0, 100)
ax.tick_params(axis='y', labelsize=self.ticks_fontsize)
# Apply ggplot2-like styling with y-axis grid
ax.grid(True, which='major', linewidth=1.0, color='#E6E6E6', zorder=0, axis='y')
ax.set_axisbelow(True)
# Custom spine styling (thin borders at top/bottom)
ax.spines['top'].set_visible(True)
ax.spines['top'].set_linewidth(1.0)
ax.spines['top'].set_color('#E6E6E6')
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(True)
ax.spines['bottom'].set_linewidth(1.0)
ax.spines['bottom'].set_color('#E6E6E6')
# Set tick colors
ax.tick_params(axis='both', which='both', colors=self.ticks_color,
length=self.ticks_length, width=1.0)
self.set_tick_labels_black(ax)
# Position legend on right side with reversed order
handles, labels = ax.get_legend_handles_labels()
legend = ax.legend(handles[::-1], labels[::-1], loc='center left',
bbox_to_anchor=(1.0, 0.5), frameon=False,
fontsize=self.legend_fontsize, title='TLD / Continent',
title_fontsize=self.legend_title_fontsize)
legend._legend_box.align = 'left'
img_path = os.path.join(self.PLOTDIR, 'tld', 'tlds-by-year-and-continent.png')
return self.save_figure(fig, img_path)
if __name__ == '__main__':
plot = TLDByContinentPlot()
plot.plot()
================================================
FILE: plot.sh
================================================
#!/bin/bash
N_CRAWLS=$(python3 -c 'from crawlstats import MonthlyCrawl; print(len(MonthlyCrawl.by_name))')
LATEST_CRAWL=$(python3 -c 'from crawlstats import MonthlyCrawl; print(sorted(MonthlyCrawl.by_name.keys())[-1])')
# verify that all stats files are downloaded, cf. get_stats.sh
N_CRAWLS_STATS_FILES=$(ls stats/CC-MAIN-*.gz | wc -l)
if [[ $N_CRAWLS -ne $N_CRAWLS_STATS_FILES ]]; then
echo "Number of crawls registered in crawlstats.py ($N_CRAWLS) and"
echo "the number of statistics files in stats/ ($N_CRAWLS_STATS_FILES) are not equal."
echo "Exiting!"
exit 1
fi
echo "Plotting crawl statistics for $N_CRAWLS crawls"
echo "Latest crawl is: $LATEST_CRAWL"
echo
# fail on any kind of error
set -exo pipefail
# register the latest crawl in the website configuration
sed -i 's@^latest_crawl:.*@latest_crawl: '$LATEST_CRAWL'@' _config.yml
function update_excerpt() {
regex="$1"
excerpt="$2"
if [ -e "$excerpt" ]; then
# short-cut for monthy update plots: only add data from latest crawl
if ! zgrep -qF "$LATEST_CRAWL" $excerpt; then
echo "Updating excerpt $excerpt with latest crawl $LATEST_CRAWL"
zgrep -Eh "$regex" stats/$LATEST_CRAWL.gz | gzip >>$excerpt
fi
# sanity check: are all crawls excerpted?
N_CRAWLS_EXCERPTED=$(zcat $excerpt | cut -f1 | jq -r '.[2]' | uniq | sort -u | wc -l)
if [[ $N_CRAWLS_EXCERPTED -eq $N_CRAWLS ]]; then
echo "Excerpt $excerpt includes $N_CRAWLS crawls as expected."
else
echo "Number of crawls excerpted in $excerpt ($N_CRAWLS_EXCERPTED) does not equal $N_CRAWLS"
gitextract_jga75z6d/ ├── .github/ │ └── workflows/ │ └── ci.yml ├── .gitignore ├── LICENSE ├── README.md ├── _config.yml ├── _layouts/ │ ├── default.html │ └── table.html ├── crawlplot.py ├── crawlstats.py ├── get_stats.sh ├── get_stats_and_plot.sh ├── index.md ├── plot/ │ ├── charset.py │ ├── crawl_size.py │ ├── crawler_metrics.py │ ├── domain.py │ ├── histogram.py │ ├── language.py │ ├── mimetype.py │ ├── mimetype_detected.py │ ├── overlap.py │ ├── table.py │ ├── tld.py │ └── tld_by_continent.py ├── plot.sh ├── plots/ │ ├── README.md │ ├── charsets-top-100.html │ ├── charsets.csv │ ├── charsets.md │ ├── crawlermetrics.md │ ├── crawloverlap.md │ ├── crawlsize/ │ │ ├── cumulative.csv │ │ ├── domain.csv │ │ ├── monthly.csv │ │ ├── monthly_new.csv │ │ ├── url_last_n_crawls.csv │ │ └── url_page_ratio_last_n_crawls.csv │ ├── crawlsize.md │ ├── domains-top-500.csv │ ├── domains-top-500.html │ ├── domains.md │ ├── languages-top-200.html │ ├── languages.csv │ ├── languages.md │ ├── mimetypes-top-100.html │ ├── mimetypes.csv │ ├── mimetypes.md │ ├── mimetypes_detected-top-100.html │ ├── mimetypes_detected.csv │ ├── tld/ │ │ ├── by-year-and-continent.md │ │ ├── comparison.md │ │ ├── groups-percentage.html │ │ ├── groups.md │ │ ├── latest-crawl-groups.html │ │ ├── latest-crawl-tlds.html │ │ ├── latestcrawl.md │ │ ├── percentage.md │ │ ├── selected-crawl-comparison-spearman-all-tlds.html │ │ ├── selected-crawl-comparison-spearman-frequent-tlds.html │ │ ├── selected-crawl-comparison.html │ │ ├── selected-crawls-percentage.html │ │ ├── selected-tlds-by-year.csv │ │ ├── selected-tlds-by-year.html │ │ ├── tlds-by-year-and-continent.csv │ │ └── tlds-by-year-and-continent.html │ └── tlds.md ├── requirements.txt ├── requirements_plot.txt ├── run_stats_hadoop.sh ├── setup.py ├── site.Dockerfile ├── stats/ │ ├── crawler/ │ │ ├── CC-MAIN-2016-18.json │ │ ├── CC-MAIN-2016-22.json │ │ ├── CC-MAIN-2016-26.json │ │ ├── CC-MAIN-2016-30.json │ │ ├── CC-MAIN-2016-36.json │ │ ├── CC-MAIN-2016-40.json │ │ ├── CC-MAIN-2016-44.json │ │ ├── CC-MAIN-2016-50.json │ │ ├── CC-MAIN-2017-04.json │ │ ├── CC-MAIN-2017-09.json │ │ ├── CC-MAIN-2017-13.json │ │ ├── CC-MAIN-2017-17.json │ │ ├── CC-MAIN-2017-22.json │ │ ├── CC-MAIN-2017-26.json │ │ ├── CC-MAIN-2017-30.json │ │ ├── CC-MAIN-2017-34.json │ │ ├── CC-MAIN-2017-39.json │ │ ├── CC-MAIN-2017-43.json │ │ ├── CC-MAIN-2017-47.json │ │ ├── CC-MAIN-2017-51.json │ │ ├── CC-MAIN-2018-05.json │ │ ├── CC-MAIN-2018-09.json │ │ ├── CC-MAIN-2018-13.json │ │ ├── CC-MAIN-2018-17.json │ │ ├── CC-MAIN-2018-22.json │ │ ├── CC-MAIN-2018-26.json │ │ ├── CC-MAIN-2018-30.json │ │ ├── CC-MAIN-2018-34.json │ │ ├── CC-MAIN-2018-39.json │ │ ├── CC-MAIN-2018-43.json │ │ ├── CC-MAIN-2018-47.json │ │ ├── CC-MAIN-2018-51.json │ │ ├── CC-MAIN-2019-04.json │ │ ├── CC-MAIN-2019-09.json │ │ ├── CC-MAIN-2019-13.json │ │ ├── CC-MAIN-2019-18.json │ │ ├── CC-MAIN-2019-22.json │ │ ├── CC-MAIN-2019-26.json │ │ ├── CC-MAIN-2019-30.json │ │ ├── CC-MAIN-2019-35.json │ │ ├── CC-MAIN-2019-39.json │ │ ├── CC-MAIN-2019-43.json │ │ ├── CC-MAIN-2019-47.json │ │ ├── CC-MAIN-2019-51.json │ │ ├── CC-MAIN-2020-05.json │ │ ├── CC-MAIN-2020-10.json │ │ ├── CC-MAIN-2020-16.json │ │ ├── CC-MAIN-2020-24.json │ │ ├── CC-MAIN-2020-29.json │ │ ├── CC-MAIN-2020-34.json │ │ ├── CC-MAIN-2020-40.json │ │ ├── CC-MAIN-2020-45.json │ │ ├── CC-MAIN-2020-50.json │ │ ├── CC-MAIN-2021-04.json │ │ ├── CC-MAIN-2021-10.json │ │ ├── CC-MAIN-2021-17.json │ │ ├── CC-MAIN-2021-21.json │ │ ├── CC-MAIN-2021-25.json │ │ ├── CC-MAIN-2021-31.json │ │ ├── CC-MAIN-2021-39.json │ │ ├── CC-MAIN-2021-43.json │ │ ├── CC-MAIN-2021-49.json │ │ ├── CC-MAIN-2022-05.json │ │ ├── CC-MAIN-2022-21.json │ │ ├── CC-MAIN-2022-27.json │ │ ├── CC-MAIN-2022-33.json │ │ ├── CC-MAIN-2022-40.json │ │ ├── CC-MAIN-2022-49.json │ │ ├── CC-MAIN-2023-06.json │ │ ├── CC-MAIN-2023-14.json │ │ ├── CC-MAIN-2023-23.json │ │ ├── CC-MAIN-2023-40.json │ │ ├── CC-MAIN-2023-50.json │ │ ├── CC-MAIN-2024-10.json │ │ ├── CC-MAIN-2024-18.json │ │ ├── CC-MAIN-2024-22.json │ │ ├── CC-MAIN-2024-26.json │ │ ├── CC-MAIN-2024-30.json │ │ ├── CC-MAIN-2024-33.json │ │ ├── CC-MAIN-2024-38.json │ │ ├── CC-MAIN-2024-42.json │ │ ├── CC-MAIN-2024-46.json │ │ ├── CC-MAIN-2024-51.json │ │ ├── CC-MAIN-2025-05.json │ │ ├── CC-MAIN-2025-08.json │ │ ├── CC-MAIN-2025-13.json │ │ ├── CC-MAIN-2025-18.json │ │ ├── CC-MAIN-2025-21.json │ │ ├── CC-MAIN-2025-26.json │ │ ├── CC-MAIN-2025-30.json │ │ ├── CC-MAIN-2025-33.json │ │ ├── CC-MAIN-2025-38.json │ │ ├── CC-MAIN-2025-43.json │ │ ├── CC-MAIN-2025-47.json │ │ ├── CC-MAIN-2025-51.json │ │ ├── CC-MAIN-2026-04.json │ │ ├── CC-MAIN-2026-08.json │ │ ├── CC-MAIN-2026-12.json │ │ ├── CC-MAIN-2026-17.json │ │ └── README.md │ ├── tld_alexa_top_1m.py │ ├── tld_cisco_umbrella_top_1m.py │ └── tld_majestic_top_1m.py ├── stats.Dockerfile ├── tests/ │ └── test_crawlstat.py └── top_level_domain.py
SYMBOL INDEX (182 symbols across 16 files)
FILE: crawlplot.py
class CrawlPlot (line 33) | class CrawlPlot:
method create_figure (line 88) | def create_figure(self, ratio=1.0):
method set_title (line 100) | def set_title(self, ax, title):
method apply_ggplot2_style (line 115) | def apply_ggplot2_style(self, ax, show_grid=True, grid_axis='both'):
method set_tick_labels_black (line 136) | def set_tick_labels_black(self, ax):
method apply_nice_ticks (line 145) | def apply_nice_ticks(self, ax, axis='y', use_scientific=True):
method save_figure (line 174) | def save_figure(self, fig, img_path):
method hide_tick_marks (line 192) | def hide_tick_marks(self, ax, tick_color='#FFFFFF'):
method __init__ (line 204) | def __init__(self):
method read_from_stdin_or_file (line 272) | def read_from_stdin_or_file(self):
method read_data (line 290) | def read_data(self, stream):
method line_plot_with_ggplot (line 306) | def line_plot_with_ggplot(
method line_plot_with_rpy2_ggplot2 (line 334) | def line_plot_with_rpy2_ggplot2(
method nice_tick_step (line 375) | def nice_tick_step(vmin, vmax, n=5):
method center_legend_title (line 399) | def center_legend_title(fig, ax, leg_items, leg_title, x_axes=0.1):
method line_plot_with_matplotlib (line 408) | def line_plot_with_matplotlib(
method line_plot (line 518) | def line_plot(
FILE: crawlstats.py
class MonthlyCrawl (line 35) | class MonthlyCrawl:
method get_by_name (line 168) | def get_by_name(name):
method to_name (line 172) | def to_name(crawl):
method to_bit_mask (line 176) | def to_bit_mask(crawl):
method date_of (line 180) | def date_of(crawl):
method year_of (line 191) | def year_of(crawl):
method short_name (line 195) | def short_name(name):
method get_latest (line 199) | def get_latest(n):
class MonthlyCrawlSet (line 203) | class MonthlyCrawlSet:
method __init__ (line 209) | def __init__(self, crawls=0):
method add (line 212) | def add(self, crawl):
method update (line 215) | def update(self, *others):
method clear (line 219) | def clear(self):
method discard (line 222) | def discard(self, crawl):
method __contains__ (line 225) | def __contains__(self, crawl):
method __len__ (line 228) | def __len__(self):
method get_bits (line 235) | def get_bits(self):
method get_crawls (line 238) | def get_crawls(self):
method is_new (line 247) | def is_new(self, crawl):
method is_newest (line 263) | def is_newest(self, crawl):
class CST (line 271) | class CST(Enum):
class MultiCount (line 375) | class MultiCount(defaultdict):
method __init__ (line 378) | def __init__(self, size):
method incr (line 382) | def incr(self, key, *counts):
method compress (line 387) | def compress(size, counts):
method get_compressed (line 397) | def get_compressed(self, key):
method get_count (line 401) | def get_count(index, value):
method sum_values (line 409) | def sum_values(values, compress=True):
class CrawlStatsJSONEncoder (line 436) | class CrawlStatsJSONEncoder(json.JSONEncoder):
method default (line 438) | def default(self, o):
method json_encode_hyperloglog (line 446) | def json_encode_hyperloglog(o):
class CrawlStatsJSONDecoder (line 452) | class CrawlStatsJSONDecoder(json.JSONDecoder):
method __init__ (line 454) | def __init__(self, *args, **kargs):
method dict_to_object (line 458) | def dict_to_object(self, dic):
method json_decode_hyperloglog (line 471) | def json_decode_hyperloglog(dic):
class HostDomainCount (line 480) | class HostDomainCount:
method __init__ (line 487) | def __init__(self):
method add (line 491) | def add(self, url, count):
method output (line 499) | def output(self, crawl):
class SurtDomainCount (line 529) | class SurtDomainCount:
method __init__ (line 534) | def __init__(self, surt_domain):
method add (line 547) | def add(self, _path, metadata):
method unique_urls (line 595) | def unique_urls(self):
method output (line 598) | def output(self, crawl, exact_count=True, min_surt_hll_size=50000):
class UnhandledTypeError (line 642) | class UnhandledTypeError(Exception):
method __init__ (line 643) | def __init__(self, outputType):
class InputError (line 647) | class InputError(Exception):
method __init__ (line 648) | def __init__(self, message):
class CCStatsJob (line 652) | class CCStatsJob(MRJob):
method configure_args (line 674) | def configure_args(self):
method input_protocol (line 711) | def input_protocol(self):
method hadoop_input_format (line 718) | def hadoop_input_format(self):
method count_mapper_init (line 726) | def count_mapper_init(self):
method count_mapper (line 766) | def count_mapper(self, _, line):
method count_mapper_final (line 801) | def count_mapper_final(self):
method reducer_init (line 829) | def reducer_init(self):
method count_reducer (line 833) | def count_reducer(self, key, values):
method stats_mapper_init (line 910) | def stats_mapper_init(self):
method stats_mapper (line 913) | def stats_mapper(self, key, value):
method stats_mapper_final (line 944) | def stats_mapper_final(self):
method stats_reducer (line 948) | def stats_reducer(self, key, values):
method reducer_final (line 1007) | def reducer_final(self):
method steps (line 1021) | def steps(self):
FILE: plot/charset.py
class CharsetStats (line 7) | class CharsetStats(TabularStats):
method __init__ (line 12) | def __init__(self):
method add (line 16) | def add(self, key, val):
FILE: plot/crawl_size.py
class CrawlSizePlot (line 26) | class CrawlSizePlot(CrawlPlot):
method __init__ (line 34) | def __init__(self):
method add (line 46) | def add(self, key, val):
method add_by_type (line 63) | def add_by_type(self, crawl, item_type, count):
method cumulative_size (line 90) | def cumulative_size(self):
method transform_data (line 157) | def transform_data(self):
method save_data (line 162) | def save_data(self):
method duplicate_ratio (line 167) | def duplicate_ratio(self):
method plot (line 178) | def plot(self):
method plot_with_rpy2_ggplot2 (line 310) | def plot_with_rpy2_ggplot2(self, by_year_by_type, img_path):
method plot_with_matplotlib (line 340) | def plot_with_matplotlib(self, by_year_by_type, img_path):
method export_csv (line 434) | def export_csv(self, data, csv):
method norm_data (line 441) | def norm_data(self, data, row_filter, type_name_norm):
method size_plot (line 460) | def size_plot(self, data, row_filter, type_name_norm,
FILE: plot/crawler_metrics.py
class CrawlerMetrics (line 26) | class CrawlerMetrics(CrawlSizePlot):
method __init__ (line 68) | def __init__(self):
method add (line 72) | def add(self, key, val):
method save_data (line 90) | def save_data(self):
method add_percent (line 96) | def add_percent(self):
method row2title (line 116) | def row2title(row):
method plot (line 124) | def plot(self):
method plot_fetch_status_with_rpy2_ggplot2 (line 162) | def plot_fetch_status_with_rpy2_ggplot2(self, data, img_path, ratio):
method plot_fetch_status_with_matplotlib (line 183) | def plot_fetch_status_with_matplotlib(self, data, categories, img_path...
method plot_fetch_status (line 248) | def plot_fetch_status(self, data, row_filter, img_file, ratio=1.0):
method plot_crawldb_status_with_rpy2_ggplot2 (line 271) | def plot_crawldb_status_with_rpy2_ggplot2(self, data, img_path, ratio):
method plot_crawldb_status_with_matplotlib (line 291) | def plot_crawldb_status_with_matplotlib(self, data, img_path, ratio):
method plot_crawldb_status (line 360) | def plot_crawldb_status(self, data, row_filter, img_file, ratio=1.0):
FILE: plot/domain.py
class DomainStats (line 9) | class DomainStats(TabularStats):
method __init__ (line 14) | def __init__(self, crawl):
method add (line 19) | def add(self, key, val):
method transform_data (line 37) | def transform_data(self):
method save_data (line 46) | def save_data(self, name, dir_name='data/'):
method plot (line 50) | def plot(self, name):
FILE: plot/histogram.py
class CrawlHistogram (line 22) | class CrawlHistogram(CrawlPlot):
method __init__ (line 34) | def __init__(self):
method add (line 39) | def add(self, key, frequency):
method transform_data (line 57) | def transform_data(self):
method save_data (line 61) | def save_data(self):
method plot_dupl_url (line 65) | def plot_dupl_url(self):
method plot_host_domain_tld (line 86) | def plot_host_domain_tld(self):
method plot_domain_cumul_with_rpy2_ggplot2 (line 108) | def plot_domain_cumul_with_rpy2_ggplot2(self, data, title, img_path):
method plot_domain_cumul (line 125) | def plot_domain_cumul(self, crawl):
FILE: plot/language.py
class LanguageStats (line 8) | class LanguageStats(TabularStats):
method __init__ (line 13) | def __init__(self):
method add (line 17) | def add(self, key, val):
FILE: plot/mimetype.py
class MimeTypeStats (line 8) | class MimeTypeStats(TabularStats):
method __init__ (line 22) | def __init__(self):
method norm_value (line 26) | def norm_value(self, mimetype):
method add (line 35) | def add(self, key, val):
FILE: plot/mimetype_detected.py
class MimeTypeDetectedStats (line 7) | class MimeTypeDetectedStats(MimeTypeStats):
method __init__ (line 9) | def __init__(self):
method norm_value (line 13) | def norm_value(self, mimetype):
method add (line 16) | def add(self, key, val):
FILE: plot/overlap.py
class CrawlOverlap (line 20) | class CrawlOverlap(CrawlPlot):
method __init__ (line 30) | def __init__(self):
method add (line 37) | def add(self, key, val):
method fill_overlap_matrix (line 47) | def fill_overlap_matrix(self):
method save_overlap_matrix (line 70) | def save_overlap_matrix(self):
method plot_similarity_graph (line 78) | def plot_similarity_graph(self, show_edges=False):
method plot_similarity_matrix_with_rpy2_ggplot2 (line 100) | def plot_similarity_matrix_with_rpy2_ggplot2(self, data, midpoint, tit...
method plot_similarity_matrix_with_matplotlib (line 122) | def plot_similarity_matrix_with_matplotlib(self, data, decimals, title...
method plot_similarity_matrix (line 211) | def plot_similarity_matrix(self, item_type, image_file, title):
FILE: plot/table.py
class TabularStats (line 12) | class TabularStats(CrawlPlot):
method __init__ (line 14) | def __init__(self):
method norm_value (line 24) | def norm_value(self, typeval):
method add_check_type (line 27) | def add_check_type(self, key, val, requ_type_cst):
method transform_data (line 49) | def transform_data(self, top_n, min_avg_count, check_pattern=None):
method save_data (line 121) | def save_data(self, base_name, dir_name='data/'):
method save_data_percentage (line 124) | def save_data_percentage(self, base_name, dir_name='data/', type_name=...
method plot (line 137) | def plot(self, crawls, name, column_header, xtra_css_classes=[]):
FILE: plot/tld.py
class TldStats (line 18) | class TldStats(CrawlPlot):
method __init__ (line 20) | def __init__(self):
method add (line 27) | def add(self, key, val):
method transform_data (line 35) | def transform_data(self):
method field_percentage_formatter (line 89) | def field_percentage_formatter(precision=2, nan='-'):
method save_data (line 94) | def save_data(self):
method percent_agg (line 97) | def percent_agg(self, data, column, index, values, aggregate):
method pivot_percentage (line 105) | def pivot_percentage(self, data, column, index, values, aggregate):
method plot_groups (line 110) | def plot_groups(self):
method plot (line 130) | def plot(self, crawls, latest_crawl):
method plot_comparison (line 187) | def plot_comparison(self, crawl, name, topNlimit=None, method='spearma...
method plot_comparison_groups (line 232) | def plot_comparison_groups(self):
FILE: plot/tld_by_continent.py
function tld2continent (line 126) | def tld2continent(tld):
function get_data (line 135) | def get_data(f):
class TLDByContinentPlot (line 163) | class TLDByContinentPlot(CrawlPlot):
method __init__ (line 166) | def __init__(self):
method plot (line 169) | def plot(self):
method plot_with_rpy2_ggplot2 (line 280) | def plot_with_rpy2_ggplot2(self, data):
method plot_with_matplotlib (line 301) | def plot_with_matplotlib(self, data):
FILE: tests/test_crawlstat.py
function test_monthly_crawl (line 18) | def test_monthly_crawl():
function test_monthly_crawl_set (line 25) | def test_monthly_crawl_set():
function test_crawlstatstype (line 78) | def test_crawlstatstype():
function test_json_hyperloglog (line 83) | def test_json_hyperloglog():
function test_multicount (line 96) | def test_multicount():
FILE: top_level_domain.py
class TopLevelDomain (line 6) | class TopLevelDomain:
method __init__ (line 24) | def __init__(self, tld):
method __str__ (line 43) | def __str__(self):
method _read_data (line 58) | def _read_data():
method short_type (line 117) | def short_type(name):
Condensed preview — 177 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,778K chars).
[
{
"path": ".github/workflows/ci.yml",
"chars": 3152,
"preview": "name: CI Pipeline\n\non:\n push:\n branches: [master]\n pull_request:\n\nenv:\n REGISTRY: ghcr.io\n IMAGE_NAME: ${{ github"
},
{
"path": ".gitignore",
"chars": 1386,
"preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
},
{
"path": "LICENSE",
"chars": 11357,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 6393,
"preview": "Basic Statistics of Common Crawl Monthly Archives\n=================================================\n\nAnalyze the [Common"
},
{
"path": "_config.yml",
"chars": 755,
"preview": "title: Statistics of Common Crawl Monthly Archives\ndescription: Number of pages, distribution of top-level domains, craw"
},
{
"path": "_layouts/default.html",
"chars": 1920,
"preview": "<!doctype html>\n<html lang=\"{{ site.lang | default: \"en-US\" }}\">\n <head>\n <meta charset=\"utf-8\">\n <meta http-equi"
},
{
"path": "_layouts/table.html",
"chars": 4305,
"preview": "<!doctype html>\n<html lang=\"{{ site.lang | default: \"en-US\" }}\">\n <head>\n <meta charset=\"utf-8\">\n <meta http-equi"
},
{
"path": "crawlplot.py",
"chars": 19840,
"preview": "\"\"\"\nBase plotting module for Common Crawl statistics visualization.\n\nThis module provides the CrawlPlot base class which"
},
{
"path": "crawlstats.py",
"chars": 42634,
"preview": "import heapq\nimport json\nimport logging\nimport os\nimport re\n\nfrom collections import defaultdict, Counter\nfrom datetime "
},
{
"path": "get_stats.sh",
"chars": 886,
"preview": "#!/bin/bash\n\nset -o pipefail\n\nif aws s3 ls s3://commoncrawl/crawl-analysis/ | sed -E 's@.* @@; s@/$@@' >./stats/crawls.t"
},
{
"path": "get_stats_and_plot.sh",
"chars": 237,
"preview": "#!/bin/bash\nset -e\n\necho \"Starting ...\"\n\n./get_stats.sh\n\n# make sure plot directories exist\nmkdir -p plots/crawler\nmkdir"
},
{
"path": "index.md",
"chars": 1260,
"preview": "Statistics of Common Crawl Monthly Archives\n===========================================\n\nStatistics of [Common Crawl](ht"
},
{
"path": "plot/charset.py",
"chars": 980,
"preview": "import sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass CharsetStats(TabularSta"
},
{
"path": "plot/crawl_size.py",
"chars": 22869,
"preview": "\"\"\"\nPlot crawl size metrics over time.\n\nThis module generates visualizations of crawl size statistics including:\n- Month"
},
{
"path": "plot/crawler_metrics.py",
"chars": 17558,
"preview": "\"\"\"\nPlot crawler performance metrics.\n\nThis module generates visualizations of crawler metrics including:\n- Fetch status"
},
{
"path": "plot/domain.py",
"chars": 2408,
"preview": "import sys\n\nimport pandas\n\nfrom crawlstats import CST, MonthlyCrawl, MultiCount\nfrom plot.table import TabularStats\n\n\ncl"
},
{
"path": "plot/histogram.py",
"chars": 6275,
"preview": "\"\"\"\nPlot histogram distributions for crawl statistics.\n\nThis module generates histogram visualizations showing distribut"
},
{
"path": "plot/language.py",
"chars": 1057,
"preview": "import string\nimport sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass LanguageS"
},
{
"path": "plot/mimetype.py",
"chars": 1785,
"preview": "import re\nimport sys\n\nfrom plot.table import TabularStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass MimeTypeStats"
},
{
"path": "plot/mimetype_detected.py",
"chars": 1078,
"preview": "import sys\n\nfrom plot.mimetype import MimeTypeStats\nfrom crawlstats import CST, MonthlyCrawl\n\n\nclass MimeTypeDetectedSta"
},
{
"path": "plot/overlap.py",
"chars": 11841,
"preview": "\"\"\"\nPlot crawl overlap and similarity metrics.\n\nThis module generates visualizations showing the overlap between differe"
},
{
"path": "plot/table.py",
"chars": 7162,
"preview": "import heapq\n\nimport numpy\nimport pandas\n\nfrom collections import defaultdict, Counter\n\nfrom crawlplot import CrawlPlot\n"
},
{
"path": "plot/tld.py",
"chars": 11983,
"preview": "import sys\n\nfrom collections import defaultdict\n\nimport pandas\n\nfrom crawlplot import CrawlPlot\nfrom crawlstats import C"
},
{
"path": "plot/tld_by_continent.py",
"chars": 16180,
"preview": "\"\"\"\nPlot TLD distributions by continent.\n\nThis module generates visualizations showing how TLDs are distributed\nacross g"
},
{
"path": "plot.sh",
"chars": 3824,
"preview": "#!/bin/bash\n\nN_CRAWLS=$(python3 -c 'from crawlstats import MonthlyCrawl; print(len(MonthlyCrawl.by_name))')\nLATEST_CRAWL"
},
{
"path": "plots/README.md",
"chars": 512,
"preview": "Plots about Common Crawl Monthly Archives\n=========================================\n\n* [size of the crawls](crawlsize.md"
},
{
"path": "plots/charsets-top-100.html",
"chars": 5869,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n <thead>\n <tr style=\"text-align: right;\">\n <th"
},
{
"path": "plots/charsets.csv",
"chars": 166158,
"preview": "crawl,charset,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-2010,<unkn"
},
{
"path": "plots/charsets.md",
"chars": 630,
"preview": "---\nlayout: table\ntable_include: charsets-top-100.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\n---\n\nCharacter Encoding of "
},
{
"path": "plots/crawlermetrics.md",
"chars": 1867,
"preview": "Crawler-Related Metrics\n=======================\n\nCrawler-related metrics are extracted from the crawler log files, cf. ["
},
{
"path": "plots/crawloverlap.md",
"chars": 933,
"preview": "Overlaps between Common Crawl Monthly Archives\n==============================================\n\nOverlaps between monthly "
},
{
"path": "plots/crawlsize/cumulative.csv",
"chars": 6579,
"preview": "crawl,digest estim.,page,url estim.\nCC-MAIN-2008-2009,1804803498,1798158091,1799114116\nCC-MAIN-2009-2010,4339999986,4661"
},
{
"path": "plots/crawlsize/domain.csv",
"chars": 6178,
"preview": "crawl,domain,host,tld,url\nCC-MAIN-2008-2009,15045431,32086112,1496,1790932667\nCC-MAIN-2009-2010,30794437,68991076,4711,2"
},
{
"path": "plots/crawlsize/monthly.csv",
"chars": 6057,
"preview": "crawl,digest estim.,page,url\nCC-MAIN-2008-2009,1804803498,1798158091,1790932667\nCC-MAIN-2009-2010,2631454016,2863495211,"
},
{
"path": "plots/crawlsize/monthly_new.csv",
"chars": 3246,
"preview": "crawl,url estim. new\nCC-MAIN-2008-2009,1799114116\nCC-MAIN-2009-2010,2025520640\nCC-MAIN-2012,2875802047\nCC-MAIN-2013-20,1"
},
{
"path": "plots/crawlsize/url_last_n_crawls.csv",
"chars": 11517,
"preview": "crawl,1,12,2,3,4,6,9\nCC-MAIN-2008-2009,1790932667,nan,nan,nan,nan,nan,nan\nCC-MAIN-2009-2010,2301135881,nan,3824634756,na"
},
{
"path": "plots/crawlsize/url_page_ratio_last_n_crawls.csv",
"chars": 15476,
"preview": "crawl,12,2,3,4,6,9\nCC-MAIN-2009-2010,,0.8204459894859851,,,,\nCC-MAIN-2012,,0.7976991186989425,0.7891972139777857,,,\nCC-M"
},
{
"path": "plots/crawlsize.md",
"chars": 3011,
"preview": "Size of Common Crawl Monthly Archives\n=====================================\n\nThe number of released pages per month fluc"
},
{
"path": "plots/domains-top-500.csv",
"chars": 23086,
"preview": "domain,pages,urls,hosts,%pages,%urls\nblogspot.com,19780455,19755134,246742,0.902419,0.906434\nwikipedia.org,3899574,38564"
},
{
"path": "plots/domains-top-500.html",
"chars": 77835,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n <thead>\n <tr style=\"text-align: right;\">\n <th>d"
},
{
"path": "plots/domains.md",
"chars": 941,
"preview": "---\nlayout: table\ntable_include: domains-top-500.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\ntable_searcher: \"Filter for "
},
{
"path": "plots/languages-top-200.html",
"chars": 17247,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage iso639-3-language\">\n <thead>\n <tr style=\"text-align: "
},
{
"path": "plots/languages.csv",
"chars": 476573,
"preview": "crawl,primary_language,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-2"
},
{
"path": "plots/languages.md",
"chars": 609,
"preview": "---\nlayout: table\ntable_include: languages-top-200.html\ntable_sortlist: \"{sortList: [[1,1]]}\"\n---\n\nDistribution of Langu"
},
{
"path": "plots/mimetypes-top-100.html",
"chars": 12542,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n <thead>\n <tr style=\"text-align: righ"
},
{
"path": "plots/mimetypes.csv",
"chars": 666670,
"preview": "crawl,mimetype,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<other>,818049,815434,0.0455\nCC-MAIN-2008-2009,application/atom"
},
{
"path": "plots/mimetypes.md",
"chars": 730,
"preview": "---\nlayout: table\ntable_include:\n - mimetypes-top-100.html\n - mimetypes_detected-top-100.html\ntable_sortlist: \"{sortList"
},
{
"path": "plots/mimetypes_detected-top-100.html",
"chars": 11560,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n <thead>\n <tr style=\"text-align: righ"
},
{
"path": "plots/mimetypes_detected.csv",
"chars": 458874,
"preview": "crawl,mimetype_detected,pages,urls,%pages/crawl\nCC-MAIN-2008-2009,<unknown>,1798158091,1798158091,100.0000\nCC-MAIN-2009-"
},
{
"path": "plots/tld/by-year-and-continent.md",
"chars": 1031,
"preview": "---\nlayout: table\ntable_include:\n- tlds-by-year-and-continent.html\n- selected-tlds-by-year.html\ntable_sortlist: \"{sortLi"
},
{
"path": "plots/tld/comparison.md",
"chars": 2422,
"preview": "---\nlayout: table\ntable_include:\n - selected-crawl-comparison-spearman-frequent-tlds.html\n - selected-crawl-comparison.h"
},
{
"path": "plots/tld/groups-percentage.html",
"chars": 24025,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n <thead>\n <tr style=\"text-align: right;\">\n <th"
},
{
"path": "plots/tld/groups.md",
"chars": 2508,
"preview": "---\nlayout: table\ntable_include: groups-percentage.html\ntable_sortlist: \"{sortList: [[0,1]]}\"\n---\n\nGroups of Top-Level D"
},
{
"path": "plots/tld/latest-crawl-groups.html",
"chars": 1848,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n <thead>\n <tr style=\"text-align: right;\">\n <th><"
},
{
"path": "plots/tld/latest-crawl-tlds.html",
"chars": 18262,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablesearcher\">\n <thead>\n <tr style=\"text-align: right;\">\n <th><"
},
{
"path": "plots/tld/latestcrawl.md",
"chars": 735,
"preview": "---\nlayout: table\ntable_include:\n - latest-crawl-groups.html\n - latest-crawl-tlds.html\ntable_sortlist: \"{sortList: [[5,1"
},
{
"path": "plots/tld/percentage.md",
"chars": 455,
"preview": "---\nlayout: table\ntable_include: selected-crawls-percentage.html\ntable_sortlist: \"{sortList: [[7,1]]}\"\ntable_searcher: \""
},
{
"path": "plots/tld/selected-crawl-comparison-spearman-all-tlds.html",
"chars": 1620,
"preview": "<table border=\"1\" class=\"dataframe matrix\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>page"
},
{
"path": "plots/tld/selected-crawl-comparison-spearman-frequent-tlds.html",
"chars": 1620,
"preview": "<table border=\"1\" class=\"dataframe matrix\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>page"
},
{
"path": "plots/tld/selected-crawl-comparison.html",
"chars": 15782,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage\">\n <thead>\n <tr style=\"text-align: right;\">\n <th"
},
{
"path": "plots/tld/selected-crawls-percentage.html",
"chars": 19219,
"preview": "<table border=\"1\" class=\"dataframe tablesorter tablepercentage tablesearcher\">\n <thead>\n <tr style=\"text-align: righ"
},
{
"path": "plots/tld/selected-tlds-by-year.csv",
"chars": 5540,
"preview": "year,com,org,ru,net,de,uk,jp,edu,fr,it,pl,nl,br,cz,au,es\n(any),51.89498041088741,6.374690312897634,4.043836128047363,3.7"
},
{
"path": "plots/tld/selected-tlds-by-year.html",
"chars": 7246,
"preview": "<table border=\"1\" class=\"dataframe tablepercentage tablesorter\">\n <thead>\n <tr style=\"text-align: right;\">\n <th"
},
{
"path": "plots/tld/tlds-by-year-and-continent.csv",
"chars": 3672,
"preview": "year,(other),\"com,net\",org,edu,\"gov,mil\",North America,South America,Oceania,Africa,Asia,Europe\n2009,2.372710231294118,8"
},
{
"path": "plots/tld/tlds-by-year-and-continent.html",
"chars": 5077,
"preview": "<table border=\"1\" class=\"dataframe tablepercentage tablesorter\">\n <thead>\n <tr style=\"text-align: right;\">\n <th"
},
{
"path": "plots/tlds.md",
"chars": 1245,
"preview": "Top-Level Domains\n=================\n\n[Top-level domains](https://en.wikipedia.org/wiki/Top-level_domain) (abbrev. \"TLD\"/"
},
{
"path": "requirements.txt",
"chars": 118,
"preview": "hyperloglog==0.0.14\nisoweek==1.3.3\nmrjob==0.7.4\ntldextract==5.1.2\nujson==5.12.0\n\n# tests\npytest\njsonpickle\nsetuptools\n"
},
{
"path": "requirements_plot.txt",
"chars": 119,
"preview": "ggplot==0.11.5\nidna==3.7\n#pandas==2.1.4+dfsg\npandas==2.1.4\npygraphviz==1.13\nrpy2==3.5.15\n\nmatplotlib==3.10.7\nfsspec[s3]"
},
{
"path": "run_stats_hadoop.sh",
"chars": 2516,
"preview": "#!/bin/bash\n\nCRAWL=\"$1\"\n\nif [ -z \"$CRAWL\" ]; then\n echo \"Usage: $0 <CRAWL-YEAR-WEEK>\"\n echo \" Argument indicating"
},
{
"path": "setup.py",
"chars": 107,
"preview": "from setuptools import setup\n\n\nsetup(\n setup_requires=['pytest-runner'],\n tests_require=['pytest'],\n)"
},
{
"path": "site.Dockerfile",
"chars": 965,
"preview": "# See\n# https://docs.github.com/en/pages/setting-up-a-github-pages-site-with-jekyll\n# https://github.com/BillRaymo"
},
{
"path": "stats/crawler/CC-MAIN-2016-18.json",
"chars": 1351,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-18\"]\t7706034375\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-22.json",
"chars": 1354,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-22\"]\t8087074988\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-26.json",
"chars": 2285,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-26\"]\t7180806080\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-30.json",
"chars": 2362,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-30\"]\t7180518032\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-36.json",
"chars": 2446,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-36\"]\t7218846495\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-40.json",
"chars": 1885,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-40\"]\t6936860504\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-44.json",
"chars": 2448,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-44\"]\t9290101260\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2016-50.json",
"chars": 2445,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2016-50\"]\t9779597110\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2017-04.json",
"chars": 2446,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-04\"]\t10058030146\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-09.json",
"chars": 2450,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-09\"]\t10309950142\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-13.json",
"chars": 2452,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-13\"]\t11054073116\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-17.json",
"chars": 2447,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-17\"]\t11614646341\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-22.json",
"chars": 2388,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-22\"]\t12106403981\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-26.json",
"chars": 2450,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-26\"]\t12986411417\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-30.json",
"chars": 2455,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-30\"]\t13566579384\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-34.json",
"chars": 2580,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-34\"]\t14698581608\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-39.json",
"chars": 2580,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-39\"]\t14981165656\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-43.json",
"chars": 2583,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-43\"]\t15959590811\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-47.json",
"chars": 2581,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-47\"]\t16756526195\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2017-51.json",
"chars": 2577,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2017-51\"]\t17390543871\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-05.json",
"chars": 2584,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-05\"]\t14985445343\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-09.json",
"chars": 2584,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-09\"]\t15702763091\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-13.json",
"chars": 2583,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-13\"]\t13830972570\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-17.json",
"chars": 2585,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-17\"]\t14071100780\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-22.json",
"chars": 2582,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-22\"]\t14293230909\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-26.json",
"chars": 2516,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-26\"]\t14114634413\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-30.json",
"chars": 2581,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-30\"]\t12336785966\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-34.json",
"chars": 2578,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-34\"]\t11304753488\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-39.json",
"chars": 1865,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-39\"]\t10322390017\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-43.json",
"chars": 2577,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-43\"]\t10068215197\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-47.json",
"chars": 2578,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-47\"]\t10372257458\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2018-51.json",
"chars": 2576,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2018-51\"]\t10696257053\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-04.json",
"chars": 2574,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-04\"]\t11425571649\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-09.json",
"chars": 2574,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-09\"]\t11395325655\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-13.json",
"chars": 2575,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-13\"]\t11250692948\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-18.json",
"chars": 2571,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-18\"]\t11385935631\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-22.json",
"chars": 2577,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-22\"]\t12201452792\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-26.json",
"chars": 2572,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-26\"]\t13037540318\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-30.json",
"chars": 2572,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-30\"]\t13857387307\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-35.json",
"chars": 2577,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-35\"]\t15024860128\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-39.json",
"chars": 2574,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-39\"]\t15879390106\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-43.json",
"chars": 2575,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-43\"]\t17009132379\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-47.json",
"chars": 2576,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-47\"]\t18304035037\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2019-51.json",
"chars": 2660,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2019-51\"]\t19131514303\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-05.json",
"chars": 2659,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-05\"]\t20085156050\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-10.json",
"chars": 2662,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-10\"]\t19918325266\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-16.json",
"chars": 2815,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-16\"]\t20380741760\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-24.json",
"chars": 2819,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-24\"]\t21097779271\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-29.json",
"chars": 2818,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-29\"]\t20528820728\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-34.json",
"chars": 2819,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-34\"]\t20403113647\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-40.json",
"chars": 2819,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-40\"]\t21173047327\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-45.json",
"chars": 2814,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-45\"]\t21531016513\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2020-50.json",
"chars": 2815,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2020-50\"]\t22014556091\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-04.json",
"chars": 2817,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-04\"]\t22704928999\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-10.json",
"chars": 2816,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-10\"]\t22937361361\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-17.json",
"chars": 2815,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-17\"]\t21968054310\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-21.json",
"chars": 2819,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-21\"]\t23713955373\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-25.json",
"chars": 2104,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-25\"]\t3080558881\n[\"crawl_status\", \"generator:fetch_list\", \"CC-MA"
},
{
"path": "stats/crawler/CC-MAIN-2021-31.json",
"chars": 2816,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-31\"]\t26047830277\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-39.json",
"chars": 2819,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-39\"]\t26294713963\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-43.json",
"chars": 2817,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-43\"]\t25464021170\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2021-49.json",
"chars": 2815,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2021-49\"]\t26305572235\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-05.json",
"chars": 2813,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-05\"]\t26716610822\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-21.json",
"chars": 2817,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-21\"]\t27142123347\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-27.json",
"chars": 2895,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-27\"]\t23976620914\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-33.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-33\"]\t24023318047\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-40.json",
"chars": 2899,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-40\"]\t25121487097\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2022-49.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2022-49\"]\t25536843285\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2023-06.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-06\"]\t24063833984\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2023-14.json",
"chars": 2899,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-14\"]\t23193639650\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2023-23.json",
"chars": 2900,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-23\"]\t23471797837\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2023-40.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-40\"]\t24130374695\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2023-50.json",
"chars": 2898,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2023-50\"]\t20650993966\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-10.json",
"chars": 2901,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-10\"]\t20599819450\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-18.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-18\"]\t20216718958\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-22.json",
"chars": 2897,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-22\"]\t19759133480\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-26.json",
"chars": 3025,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-26\"]\t20635672053\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-30.json",
"chars": 2960,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-30\"]\t22484038777\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-33.json",
"chars": 2956,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-33\"]\t21902542885\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-38.json",
"chars": 2959,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-38\"]\t23533999592\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-42.json",
"chars": 2955,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-42\"]\t24237973275\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-46.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-46\"]\t25404964414\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2024-51.json",
"chars": 2959,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2024-51\"]\t25915332801\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-05.json",
"chars": 2961,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-05\"]\t26857631748\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-08.json",
"chars": 3022,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-08\"]\t26387970898\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-13.json",
"chars": 3024,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-13\"]\t27837493836\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-18.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-18\"]\t27638605242\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-21.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-21\"]\t27797066363\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-26.json",
"chars": 2960,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-26\"]\t28167640016\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-30.json",
"chars": 2960,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-30\"]\t27783766888\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-33.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-33\"]\t28259545744\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-38.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-38\"]\t27837059074\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-43.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-43\"]\t26899650260\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-47.json",
"chars": 2962,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-47\"]\t27080086217\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2025-51.json",
"chars": 2958,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2025-51\"]\t26268339705\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2026-04.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-04\"]\t25612258765\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2026-08.json",
"chars": 3023,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-08\"]\t24938307108\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2026-12.json",
"chars": 3314,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-12\"]\t25055505905\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/CC-MAIN-2026-17.json",
"chars": 3322,
"preview": "[\"crawl_status\", \"generator:crawldb_size\", \"CC-MAIN-2026-17\"]\t24642277769\n[\"crawl_status\", \"generator:fetch_list\", \"CC-M"
},
{
"path": "stats/crawler/README.md",
"chars": 424,
"preview": "Crawler-Related Metrics\n=======================\n\nJSON files in this folder contain metrics\n- written by the crawler ([Ap"
},
{
"path": "stats/tld_alexa_top_1m.py",
"chars": 10884,
"preview": "# derived from\n# http://s3.amazonaws.com/alexa-static/top-1m.csv.zip\n# fetched 2019-02-06, see also\n# https://suppor"
},
{
"path": "stats/tld_cisco_umbrella_top_1m.py",
"chars": 9779,
"preview": "# derived from\n# http://s3-us-west-1.amazonaws.com/umbrella-static/top-1m.csv.zip\n# fetched 2019-02-06, see also\n# h"
},
{
"path": "stats/tld_majestic_top_1m.py",
"chars": 11217,
"preview": "# derived from\n# http://downloads.majestic.com/majestic_million.csv\n# fetched 2019-02-06\n#\n# see also\n#\n# https://ma"
},
{
"path": "stats.Dockerfile",
"chars": 1158,
"preview": "# Replicating pjox/cc-crawl-statistics\nFROM python:3.12\n\n# Install system dependencies\nRUN apt-get update && apt-get ins"
},
{
"path": "tests/test_crawlstat.py",
"chars": 3121,
"preview": "import json\nimport sys\n\nimport ujson\nimport jsonpickle\n\nfrom crawlstats import MonthlyCrawl, MonthlyCrawlSet\nfrom crawls"
},
{
"path": "top_level_domain.py",
"chars": 81321,
"preview": "import fileinput\nimport idna\nimport re\n\n\nclass TopLevelDomain:\n \"\"\"Classify top-level domains (TLDs) to provide the f"
}
]
About this extraction
This page contains the full source code of the commoncrawl/cc-crawl-statistics GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 177 files (2.6 MB), approximately 677.4k tokens, and a symbol index with 182 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.