Showing preview only (2,164K chars total). Download the full file or copy to clipboard to get everything.
Repository: janreges/siteone-crawler
Branch: main
Commit: 63203298f00e
Files: 132
Total size: 2.0 MB
Directory structure:
gitextract_j_gpbi69/
├── .githooks/
│ └── pre-commit
├── .github/
│ └── workflows/
│ ├── ci.yml
│ ├── publish.yml
│ └── release.yml
├── .gitignore
├── CHANGELOG.md
├── CLAUDE.md
├── Cargo.toml
├── LICENSE
├── README.md
├── docs/
│ ├── JSON-OUTPUT.md
│ ├── OUTPUT-crawler.siteone.io.json
│ ├── OUTPUT-crawler.siteone.io.txt
│ └── TEXT-OUTPUT.md
├── rustfmt.toml
├── src/
│ ├── analysis/
│ │ ├── accessibility_analyzer.rs
│ │ ├── analyzer.rs
│ │ ├── base_analyzer.rs
│ │ ├── best_practice_analyzer.rs
│ │ ├── caching_analyzer.rs
│ │ ├── content_type_analyzer.rs
│ │ ├── dns_analyzer.rs
│ │ ├── external_links_analyzer.rs
│ │ ├── fastest_analyzer.rs
│ │ ├── headers_analyzer.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── page404_analyzer.rs
│ │ ├── redirects_analyzer.rs
│ │ ├── result/
│ │ │ ├── analyzer_stats.rs
│ │ │ ├── dns_analysis_result.rs
│ │ │ ├── header_stats.rs
│ │ │ ├── heading_tree_item.rs
│ │ │ ├── mod.rs
│ │ │ ├── security_checked_header.rs
│ │ │ ├── security_result.rs
│ │ │ ├── seo_opengraph_result.rs
│ │ │ └── url_analysis_result.rs
│ │ ├── security_analyzer.rs
│ │ ├── seo_opengraph_analyzer.rs
│ │ ├── skipped_urls_analyzer.rs
│ │ ├── slowest_analyzer.rs
│ │ ├── source_domains_analyzer.rs
│ │ └── ssl_tls_analyzer.rs
│ ├── components/
│ │ ├── mod.rs
│ │ ├── summary/
│ │ │ ├── item.rs
│ │ │ ├── item_status.rs
│ │ │ ├── mod.rs
│ │ │ └── summary.rs
│ │ ├── super_table.rs
│ │ └── super_table_column.rs
│ ├── content_processor/
│ │ ├── astro_processor.rs
│ │ ├── base_processor.rs
│ │ ├── content_processor.rs
│ │ ├── css_processor.rs
│ │ ├── html_processor.rs
│ │ ├── javascript_processor.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── nextjs_processor.rs
│ │ ├── svelte_processor.rs
│ │ └── xml_processor.rs
│ ├── debugger.rs
│ ├── engine/
│ │ ├── crawler.rs
│ │ ├── found_url.rs
│ │ ├── found_urls.rs
│ │ ├── http_client.rs
│ │ ├── http_response.rs
│ │ ├── initiator.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── parsed_url.rs
│ │ └── robots_txt.rs
│ ├── error.rs
│ ├── export/
│ │ ├── base_exporter.rs
│ │ ├── exporter.rs
│ │ ├── file_exporter.rs
│ │ ├── html_report/
│ │ │ ├── badge.rs
│ │ │ ├── mod.rs
│ │ │ ├── report.rs
│ │ │ ├── tab.rs
│ │ │ └── template.html
│ │ ├── mailer_exporter.rs
│ │ ├── markdown_exporter.rs
│ │ ├── mod.rs
│ │ ├── offline_website_exporter.rs
│ │ ├── sitemap_exporter.rs
│ │ ├── upload_exporter.rs
│ │ └── utils/
│ │ ├── html_to_markdown.rs
│ │ ├── markdown_site_aggregator.rs
│ │ ├── mod.rs
│ │ ├── offline_url_converter.rs
│ │ └── target_domain_relation.rs
│ ├── extra_column.rs
│ ├── info.rs
│ ├── lib.rs
│ ├── main.rs
│ ├── options/
│ │ ├── core_options.rs
│ │ ├── group.rs
│ │ ├── mod.rs
│ │ ├── option.rs
│ │ ├── option_type.rs
│ │ └── options.rs
│ ├── output/
│ │ ├── json_output.rs
│ │ ├── mod.rs
│ │ ├── multi_output.rs
│ │ ├── output.rs
│ │ ├── output_type.rs
│ │ └── text_output.rs
│ ├── result/
│ │ ├── basic_stats.rs
│ │ ├── manager_stats.rs
│ │ ├── mod.rs
│ │ ├── status.rs
│ │ ├── storage/
│ │ │ ├── file_storage.rs
│ │ │ ├── memory_storage.rs
│ │ │ ├── mod.rs
│ │ │ ├── storage.rs
│ │ │ └── storage_type.rs
│ │ └── visited_url.rs
│ ├── scoring/
│ │ ├── ci_gate.rs
│ │ ├── mod.rs
│ │ ├── quality_score.rs
│ │ └── scorer.rs
│ ├── server.rs
│ ├── types.rs
│ ├── utils.rs
│ ├── version.rs
│ └── wizard/
│ ├── form.rs
│ ├── mod.rs
│ └── presets.rs
└── tests/
├── common/
│ └── mod.rs
└── integration_crawl.rs
================================================
FILE CONTENTS
================================================
================================================
FILE: .githooks/pre-commit
================================================
#!/bin/bash
# Pre-commit hook: run cargo fmt, clippy, and tests before committing.
set -e
echo "=== Pre-commit: cargo fmt --check ==="
cargo fmt -- --check
echo "=== Pre-commit: cargo clippy ==="
cargo clippy -- -D warnings
echo "=== Pre-commit: cargo test ==="
cargo test
echo "=== Pre-commit checks passed ==="
================================================
FILE: .github/workflows/ci.yml
================================================
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
CARGO_TERM_COLOR: always
jobs:
check:
name: Check & Lint
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: dtolnay/rust-toolchain@stable
with:
components: rustfmt, clippy
- uses: actions/cache@v5
with:
path: |
~/.cargo/registry
~/.cargo/git
target
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
restore-keys: |
${{ runner.os }}-cargo-
- name: Check formatting
run: cargo fmt -- --check
- name: Clippy
run: cargo clippy -- -D warnings
- name: Build
run: cargo build
- name: Run tests
run: cargo test
================================================
FILE: .github/workflows/publish.yml
================================================
name: Publish to package managers
# Triggers when a draft release is published (manually via GitHub UI)
on:
release:
types: [published]
workflow_dispatch:
inputs:
tag:
description: 'Release tag (e.g. v2.0.1)'
required: true
permissions:
contents: write
jobs:
# ─────────────────────────────────────────────────────────────────
# Publish to crates.io
# ─────────────────────────────────────────────────────────────────
publish-crates:
name: Publish to crates.io
runs-on: ubuntu-latest
if: vars.PUBLISH_CRATES == 'true'
steps:
- name: Checkout
uses: actions/checkout@v6
with:
ref: ${{ github.event.release.tag_name || inputs.tag }}
- name: Determine version
id: version
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Ensure Cargo.toml has correct version
env:
VERSION: ${{ steps.version.outputs.version }}
run: sed -i "s/^version = .*/version = \"${VERSION}\"/" Cargo.toml
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
- name: Publish
env:
CARGO_REGISTRY_TOKEN: ${{ secrets.CARGO_REGISTRY_TOKEN }}
run: cargo publish --no-verify --allow-dirty || echo "Already published (skipping)"
# ─────────────────────────────────────────────────────────────────
# Update Homebrew tap
# ─────────────────────────────────────────────────────────────────
publish-homebrew:
name: Update Homebrew formula
runs-on: ubuntu-latest
if: vars.PUBLISH_HOMEBREW == 'true'
steps:
- name: Determine version
id: version
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Download release archives and compute SHA256
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
BASE_URL="https://github.com/${{ github.repository }}/releases/download/v${VERSION}"
for SUFFIX in linux-x64 linux-arm64 macos-x64 macos-arm64; do
FILE="siteone-crawler-v${VERSION}-${SUFFIX}.tar.gz"
curl -sfL "${BASE_URL}/${FILE}" -o "${FILE}"
SHA=$(sha256sum "${FILE}" | cut -d' ' -f1)
VAR_NAME="SHA_$(echo "${SUFFIX}" | tr '[:lower:]-' '[:upper:]_')"
echo "${VAR_NAME}=${SHA}" >> "$GITHUB_ENV"
echo "${VAR_NAME}=${SHA}"
done
- name: Clone Homebrew tap
env:
TAP_TOKEN: ${{ secrets.HOMEBREW_TAP_TOKEN }}
run: |
git clone "https://x-access-token:${TAP_TOKEN}@github.com/janreges/homebrew-tap.git" tap
- name: Update formula
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
cat > tap/Formula/siteone-crawler.rb <<'FORMULA'
class SiteoneCrawler < Formula
desc "Website crawler and QA toolkit in Rust for security, performance, SEO, and accessibility audits, offline cloning, markdown export, sitemap generation, cache warming, and CI/CD gating — one dependency-free binary for all major platforms, 10 tools in one."
homepage "https://crawler.siteone.io/"
version "VERSION_PLACEHOLDER"
license "MIT"
on_macos do
if Hardware::CPU.arm?
url "https://github.com/janreges/siteone-crawler/releases/download/v#{version}/siteone-crawler-v#{version}-macos-arm64.tar.gz"
sha256 "SHA_MACOS_ARM64_PLACEHOLDER"
else
url "https://github.com/janreges/siteone-crawler/releases/download/v#{version}/siteone-crawler-v#{version}-macos-x64.tar.gz"
sha256 "SHA_MACOS_X64_PLACEHOLDER"
end
end
on_linux do
if Hardware::CPU.arm?
url "https://github.com/janreges/siteone-crawler/releases/download/v#{version}/siteone-crawler-v#{version}-linux-arm64.tar.gz"
sha256 "SHA_LINUX_ARM64_PLACEHOLDER"
else
url "https://github.com/janreges/siteone-crawler/releases/download/v#{version}/siteone-crawler-v#{version}-linux-x64.tar.gz"
sha256 "SHA_LINUX_X64_PLACEHOLDER"
end
end
def install
bin.install "siteone-crawler"
end
test do
assert_match "SiteOne Crawler", shell_output("#{bin}/siteone-crawler --version")
end
end
FORMULA
sed -i "s/VERSION_PLACEHOLDER/${VERSION}/g" tap/Formula/siteone-crawler.rb
sed -i "s/SHA_MACOS_ARM64_PLACEHOLDER/${SHA_MACOS_ARM64}/g" tap/Formula/siteone-crawler.rb
sed -i "s/SHA_MACOS_X64_PLACEHOLDER/${SHA_MACOS_X64}/g" tap/Formula/siteone-crawler.rb
sed -i "s/SHA_LINUX_ARM64_PLACEHOLDER/${SHA_LINUX_ARM64}/g" tap/Formula/siteone-crawler.rb
sed -i "s/SHA_LINUX_X64_PLACEHOLDER/${SHA_LINUX_X64}/g" tap/Formula/siteone-crawler.rb
- name: Push updated formula
run: |
cd tap
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add Formula/siteone-crawler.rb
git diff --cached --quiet && echo "Formula already up to date" && exit 0
git commit -m "chore: update siteone-crawler to v${{ steps.version.outputs.version }}"
git push
# ─────────────────────────────────────────────────────────────────
# Update Scoop bucket
# ─────────────────────────────────────────────────────────────────
publish-scoop:
name: Update Scoop manifest
runs-on: ubuntu-latest
if: vars.PUBLISH_SCOOP == 'true'
steps:
- name: Determine version
id: version
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Download Windows archives and compute SHA256
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
BASE_URL="https://github.com/${{ github.repository }}/releases/download/v${VERSION}"
for SUFFIX in win-x64 win-arm64; do
FILE="siteone-crawler-v${VERSION}-${SUFFIX}.zip"
curl -sfL "${BASE_URL}/${FILE}" -o "${FILE}"
SHA=$(sha256sum "${FILE}" | cut -d' ' -f1)
VAR_NAME="SHA_$(echo "${SUFFIX}" | tr '[:lower:]-' '[:upper:]_')"
echo "${VAR_NAME}=${SHA}" >> "$GITHUB_ENV"
done
- name: Clone Scoop bucket
env:
BUCKET_TOKEN: ${{ secrets.SCOOP_BUCKET_TOKEN }}
run: |
git clone "https://x-access-token:${BUCKET_TOKEN}@github.com/janreges/scoop-siteone.git" bucket
- name: Update manifest
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
mkdir -p bucket/bucket
cat > bucket/bucket/siteone-crawler.json << 'TEMPLATE'
{
"version": "VERSION_PLACEHOLDER",
"description": "Website crawler and QA toolkit in Rust for security, performance, SEO, and accessibility audits, offline cloning, markdown export, sitemap generation, cache warming, and CI/CD gating — one dependency-free binary for all major platforms, 10 tools in one.",
"homepage": "https://crawler.siteone.io/",
"license": "MIT",
"architecture": {
"64bit": {
"url": "https://github.com/janreges/siteone-crawler/releases/download/vVERSION_PLACEHOLDER/siteone-crawler-vVERSION_PLACEHOLDER-win-x64.zip",
"hash": "HASH_X64_PLACEHOLDER"
},
"arm64": {
"url": "https://github.com/janreges/siteone-crawler/releases/download/vVERSION_PLACEHOLDER/siteone-crawler-vVERSION_PLACEHOLDER-win-arm64.zip",
"hash": "HASH_ARM64_PLACEHOLDER"
}
},
"extract_dir": "siteone-crawler",
"bin": "siteone-crawler.exe",
"checkver": "github",
"autoupdate": {
"architecture": {
"64bit": {
"url": "https://github.com/janreges/siteone-crawler/releases/download/v$version/siteone-crawler-v$version-win-x64.zip"
},
"arm64": {
"url": "https://github.com/janreges/siteone-crawler/releases/download/v$version/siteone-crawler-v$version-win-arm64.zip"
}
}
}
}
TEMPLATE
sed -i "s/VERSION_PLACEHOLDER/${VERSION}/g" bucket/bucket/siteone-crawler.json
sed -i "s/HASH_X64_PLACEHOLDER/${SHA_WIN_X64}/g" bucket/bucket/siteone-crawler.json
sed -i "s/HASH_ARM64_PLACEHOLDER/${SHA_WIN_ARM64}/g" bucket/bucket/siteone-crawler.json
- name: Push updated manifest
run: |
cd bucket
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add bucket/siteone-crawler.json
git commit -m "chore: update siteone-crawler to v${{ steps.version.outputs.version }}"
git push
# ─────────────────────────────────────────────────────────────────
# Submit to WinGet
# ─────────────────────────────────────────────────────────────────
publish-winget:
name: Submit to WinGet
runs-on: windows-latest
# Requires initial manual submission to microsoft/winget-pkgs first.
# Once JanReges.SiteOneCrawler exists in winget-pkgs, set PUBLISH_WINGET=true.
if: vars.PUBLISH_WINGET == 'true'
steps:
- name: Determine version
id: version
shell: bash
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Install wingetcreate
run: winget install Microsoft.WingetCreate --accept-source-agreements --accept-package-agreements
- name: Update WinGet manifest
env:
VERSION: ${{ steps.version.outputs.version }}
WINGET_TOKEN: ${{ secrets.WINGET_TOKEN }}
run: |
$url_x64 = "https://github.com/janreges/siteone-crawler/releases/download/v$env:VERSION/siteone-crawler-v$env:VERSION-win-x64.zip"
$url_arm64 = "https://github.com/janreges/siteone-crawler/releases/download/v$env:VERSION/siteone-crawler-v$env:VERSION-win-arm64.zip"
wingetcreate update JanReges.SiteOneCrawler `
--version $env:VERSION `
--urls $url_x64 $url_arm64 `
--token $env:WINGET_TOKEN `
--submit
# ─────────────────────────────────────────────────────────────────
# Update AUR package
# ─────────────────────────────────────────────────────────────────
publish-aur:
name: Update AUR package
runs-on: ubuntu-latest
if: vars.PUBLISH_AUR == 'true'
steps:
- name: Determine version
id: version
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Compute SHA256 for Linux archives
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
BASE_URL="https://github.com/${{ github.repository }}/releases/download/v${VERSION}"
for SUFFIX in linux-x64 linux-arm64; do
FILE="siteone-crawler-v${VERSION}-${SUFFIX}.tar.gz"
curl -sfL "${BASE_URL}/${FILE}" -o "${FILE}"
SHA=$(sha256sum "${FILE}" | cut -d' ' -f1)
VAR_NAME="SHA_$(echo "${SUFFIX}" | tr '[:lower:]-' '[:upper:]_')"
echo "${VAR_NAME}=${SHA}" >> "$GITHUB_ENV"
done
- name: Setup SSH for AUR
env:
AUR_SSH_KEY: ${{ secrets.AUR_SSH_KEY }}
run: |
mkdir -p ~/.ssh
echo "$AUR_SSH_KEY" > ~/.ssh/aur
chmod 600 ~/.ssh/aur
echo "Host aur.archlinux.org" >> ~/.ssh/config
echo " IdentityFile ~/.ssh/aur" >> ~/.ssh/config
echo " User aur" >> ~/.ssh/config
ssh-keyscan aur.archlinux.org >> ~/.ssh/known_hosts
- name: Clone AUR repo and update PKGBUILD
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
git clone ssh://aur@aur.archlinux.org/siteone-crawler-bin.git aur
cd aur
cat > PKGBUILD << PKGBUILD
# Maintainer: Jan Reges <jan.reges@siteone.cz>
pkgname=siteone-crawler-bin
pkgver=${VERSION}
pkgrel=1
pkgdesc="Website crawler and QA toolkit in Rust for security, performance, SEO, and accessibility audits, offline cloning, markdown export, sitemap generation, cache warming, and CI/CD gating — one dependency-free binary for all major platforms, 10 tools in one."
arch=('x86_64' 'aarch64')
url="https://crawler.siteone.io/"
license=('MIT')
provides=('siteone-crawler')
conflicts=('siteone-crawler')
source_x86_64=("https://github.com/janreges/siteone-crawler/releases/download/v\${pkgver}/siteone-crawler-v\${pkgver}-linux-x64.tar.gz")
source_aarch64=("https://github.com/janreges/siteone-crawler/releases/download/v\${pkgver}/siteone-crawler-v\${pkgver}-linux-arm64.tar.gz")
sha256sums_x86_64=('${SHA_LINUX_X64}')
sha256sums_aarch64=('${SHA_LINUX_ARM64}')
package() {
install -Dm755 "\${srcdir}/siteone-crawler/siteone-crawler" "\${pkgdir}/usr/bin/siteone-crawler"
install -Dm644 "\${srcdir}/siteone-crawler/LICENSE" "\${pkgdir}/usr/share/licenses/\${pkgname}/LICENSE"
}
PKGBUILD
cat > .SRCINFO << SRCINFO
pkgbase = siteone-crawler-bin
pkgdesc = Website crawler and QA toolkit in Rust for security, performance, SEO, and accessibility audits, offline cloning, markdown export, sitemap generation, cache warming, and CI/CD gating — one dependency-free binary for all major platforms, 10 tools in one.
pkgver = ${VERSION}
pkgrel = 1
url = https://crawler.siteone.io/
arch = x86_64
arch = aarch64
license = MIT
provides = siteone-crawler
conflicts = siteone-crawler
source_x86_64 = https://github.com/janreges/siteone-crawler/releases/download/v${VERSION}/siteone-crawler-v${VERSION}-linux-x64.tar.gz
sha256sums_x86_64 = ${SHA_LINUX_X64}
source_aarch64 = https://github.com/janreges/siteone-crawler/releases/download/v${VERSION}/siteone-crawler-v${VERSION}-linux-arm64.tar.gz
sha256sums_aarch64 = ${SHA_LINUX_ARM64}
pkgname = siteone-crawler-bin
SRCINFO
git config user.name "Jan Reges"
git config user.email "jan.reges@siteone.cz"
git add PKGBUILD .SRCINFO
git commit -m "chore: update siteone-crawler to v${VERSION}"
git push
# ─────────────────────────────────────────────────────────────────
# Publish .deb and .rpm to Cloudsmith (APT + DNF repository)
# ─────────────────────────────────────────────────────────────────
publish-cloudsmith:
name: Publish to Cloudsmith
runs-on: ubuntu-latest
if: vars.PUBLISH_CLOUDSMITH == 'true'
steps:
- name: Determine version
id: version
run: |
TAG="${{ github.event.release.tag_name || inputs.tag }}"
echo "version=${TAG#v}" >> "$GITHUB_OUTPUT"
- name: Download .deb, .rpm and .apk from release
env:
GH_TOKEN: ${{ github.token }}
VERSION: ${{ steps.version.outputs.version }}
run: |
mkdir -p packages
BASE_URL="https://github.com/${{ github.repository }}/releases/download/v${VERSION}"
# Download all .deb, .rpm and .apk assets from the release
for file in $(gh release view "v${VERSION}" --repo "${{ github.repository }}" --json assets -q '.assets[].name' | grep -E '\.(deb|rpm|apk)$'); do
echo "Downloading ${file} ..."
curl -sfL "${BASE_URL}/${file}" -o "packages/${file}"
done
- name: List packages
run: ls -lhR packages/
- name: Install Cloudsmith CLI
run: pip install cloudsmith-cli
- name: Upload .deb packages
env:
CLOUDSMITH_API_KEY: ${{ secrets.CLOUDSMITH_API_KEY }}
run: |
for deb in packages/*.deb; do
[ -f "$deb" ] || continue
echo "Uploading $deb ..."
cloudsmith push deb janreges/siteone-crawler/any-distro/any-version "$deb" --republish
done
- name: Upload .rpm packages
env:
CLOUDSMITH_API_KEY: ${{ secrets.CLOUDSMITH_API_KEY }}
run: |
for rpm in packages/*.rpm; do
[ -f "$rpm" ] || continue
echo "Uploading $rpm ..."
cloudsmith push rpm janreges/siteone-crawler/any-distro/any-version "$rpm" --republish
done
- name: Upload .apk packages
env:
CLOUDSMITH_API_KEY: ${{ secrets.CLOUDSMITH_API_KEY }}
run: |
for apk in packages/*.apk; do
[ -f "$apk" ] || continue
echo "Uploading $apk ..."
cloudsmith push alpine janreges/siteone-crawler/alpine/any-version "$apk" --republish
done
================================================
FILE: .github/workflows/release.yml
================================================
name: Release
# Trigger: push a tag like v1.0.10
on:
push:
tags:
- 'v*'
# Manual trigger for building artifacts only (no release created)
workflow_dispatch:
inputs:
version:
description: 'Version number (e.g. 1.0.10)'
required: true
permissions:
contents: write
env:
CARGO_TERM_COLOR: always
jobs:
build:
name: Build ${{ matrix.artifact_suffix }}
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
include:
- target: x86_64-unknown-linux-gnu
os: ubuntu-latest
artifact_suffix: linux-x64
archive: tar.gz
- target: aarch64-unknown-linux-gnu
os: ubuntu-latest
artifact_suffix: linux-arm64
archive: tar.gz
cross: true
- target: x86_64-apple-darwin
os: macos-latest
artifact_suffix: macos-x64
archive: tar.gz
- target: aarch64-apple-darwin
os: macos-latest
artifact_suffix: macos-arm64
archive: tar.gz
- target: x86_64-pc-windows-msvc
os: windows-latest
artifact_suffix: win-x64
archive: zip
- target: aarch64-pc-windows-msvc
os: windows-latest
artifact_suffix: win-arm64
archive: zip
- target: x86_64-unknown-linux-musl
os: ubuntu-latest
artifact_suffix: linux-musl-x64
archive: tar.gz
musl: true
- target: aarch64-unknown-linux-musl
os: ubuntu-latest
artifact_suffix: linux-musl-arm64
archive: tar.gz
cross: true
musl: true
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Determine version
id: version
shell: bash
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
VERSION="${{ github.event.inputs.version }}"
else
# Extract from tag: v1.0.10 -> 1.0.10
VERSION="${GITHUB_REF_NAME#v}"
fi
echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
echo "Version: ${VERSION}"
- name: Install Rust toolchain
uses: dtolnay/rust-toolchain@stable
with:
targets: ${{ matrix.target }}
- name: Install cross (for cross-compilation)
if: matrix.cross
run: cargo install cross --git https://github.com/cross-rs/cross
- name: Install musl tools
if: matrix.musl && !matrix.cross
run: sudo apt-get install -y musl-tools
- name: Update version in source
shell: bash
run: |
VERSION="${{ steps.version.outputs.version }}"
DATE_SUFFIX="$(date +%Y%m%d)"
VERSION_CODE="${VERSION}.${DATE_SUFFIX}"
# Update Cargo.toml
sed -i.bak "s/^version = .*/version = \"${VERSION}\"/" Cargo.toml
# Update version.rs
sed -i.bak "s/^pub const CODE: .*/pub const CODE: \&str = \"${VERSION_CODE}\";/" src/version.rs
echo "Cargo.toml version: ${VERSION}"
echo "version.rs CODE: ${VERSION_CODE}"
- name: Build
shell: bash
run: |
if [[ "${{ matrix.cross }}" == "true" ]]; then
cross build --release --target ${{ matrix.target }}
else
cargo build --release --target ${{ matrix.target }}
fi
# ── macOS Code Signing & Notarization ──────────────────────────
- name: Import Apple certificate
if: runner.os == 'macOS'
env:
CERTIFICATE_BASE64: ${{ secrets.APPLE_CERTIFICATE_BASE64 }}
CERTIFICATE_PASSWORD: ${{ secrets.APPLE_CERTIFICATE_PASSWORD }}
run: |
CERTIFICATE_PATH="$RUNNER_TEMP/certificate.p12"
KEYCHAIN_PATH="$RUNNER_TEMP/signing.keychain-db"
KEYCHAIN_PASSWORD="$(openssl rand -hex 16)"
echo -n "$CERTIFICATE_BASE64" | base64 --decode -o "$CERTIFICATE_PATH"
security create-keychain -p "$KEYCHAIN_PASSWORD" "$KEYCHAIN_PATH"
security set-keychain-settings -lut 21600 "$KEYCHAIN_PATH"
security unlock-keychain -p "$KEYCHAIN_PASSWORD" "$KEYCHAIN_PATH"
security import "$CERTIFICATE_PATH" \
-P "$CERTIFICATE_PASSWORD" \
-A -t cert -f pkcs12 \
-k "$KEYCHAIN_PATH"
security set-key-partition-list \
-S apple-tool:,apple: \
-k "$KEYCHAIN_PASSWORD" \
"$KEYCHAIN_PATH"
security list-keychain -d user -s "$KEYCHAIN_PATH"
- name: Sign macOS binary
if: runner.os == 'macOS'
env:
SIGNING_IDENTITY: ${{ secrets.APPLE_SIGNING_IDENTITY }}
run: |
BINARY="target/${{ matrix.target }}/release/siteone-crawler"
codesign --force --options runtime \
--sign "$SIGNING_IDENTITY" \
"$BINARY"
echo "Verifying signature..."
codesign --verify --verbose "$BINARY"
echo "Signature OK"
- name: Notarize macOS binary
if: runner.os == 'macOS'
env:
APPLE_ID: ${{ secrets.APPLE_ID }}
APPLE_ID_PASSWORD: ${{ secrets.APPLE_ID_PASSWORD }}
APPLE_TEAM_ID: ${{ secrets.APPLE_TEAM_ID }}
run: |
BINARY="target/${{ matrix.target }}/release/siteone-crawler"
NOTARIZE_ZIP="$RUNNER_TEMP/notarize.zip"
# ditto is required — Apple's notary service rejects zip-created archives
ditto -c -k --keepParent "$BINARY" "$NOTARIZE_ZIP"
echo "Submitting for notarization..."
xcrun notarytool submit "$NOTARIZE_ZIP" \
--apple-id "$APPLE_ID" \
--password "$APPLE_ID_PASSWORD" \
--team-id "$APPLE_TEAM_ID" \
--wait
echo "Notarization complete"
- name: Clean up keychain
if: runner.os == 'macOS' && always()
run: |
KEYCHAIN_PATH="$RUNNER_TEMP/signing.keychain-db"
if [ -f "$KEYCHAIN_PATH" ]; then
security delete-keychain "$KEYCHAIN_PATH"
fi
# ────────────────────────────────────────────────────────────────
- name: Package (Unix)
if: matrix.archive == 'tar.gz'
shell: bash
run: |
VERSION="${{ steps.version.outputs.version }}"
ARTIFACT="siteone-crawler-v${VERSION}-${{ matrix.artifact_suffix }}"
mkdir -p "staging/siteone-crawler"
cp "target/${{ matrix.target }}/release/siteone-crawler" "staging/siteone-crawler/"
cp README.md "staging/siteone-crawler/" 2>/dev/null || true
cp LICENSE "staging/siteone-crawler/" 2>/dev/null || true
chmod +x "staging/siteone-crawler/siteone-crawler"
(cd staging && tar czf "../${ARTIFACT}.tar.gz" siteone-crawler/)
echo "ARTIFACT_PATH=${ARTIFACT}.tar.gz" >> "$GITHUB_ENV"
- name: Package (Windows)
if: matrix.archive == 'zip'
shell: bash
run: |
VERSION="${{ steps.version.outputs.version }}"
ARTIFACT="siteone-crawler-v${VERSION}-${{ matrix.artifact_suffix }}"
mkdir -p "staging/siteone-crawler"
cp "target/${{ matrix.target }}/release/siteone-crawler.exe" "staging/siteone-crawler/"
cp README.md "staging/siteone-crawler/" 2>/dev/null || true
cp LICENSE "staging/siteone-crawler/" 2>/dev/null || true
(cd staging && 7z a -r "../${ARTIFACT}.zip" siteone-crawler/)
echo "ARTIFACT_PATH=${ARTIFACT}.zip" >> "$GITHUB_ENV"
# ── Build .deb and .rpm packages (Linux only) ──────────────
- name: Install cross-compilation tools (arm64)
if: runner.os == 'Linux' && matrix.cross
run: sudo apt-get install -y binutils-aarch64-linux-gnu
- name: Strip binary (Linux)
if: runner.os == 'Linux'
shell: bash
run: |
BINARY="target/${{ matrix.target }}/release/siteone-crawler"
if [[ "${{ matrix.target }}" == "aarch64"* ]]; then
aarch64-linux-gnu-strip -s "$BINARY" || true
else
strip -s "$BINARY" || true
fi
- name: Build .deb package
if: runner.os == 'Linux'
shell: bash
run: |
cargo install cargo-deb
if [[ "${{ matrix.musl }}" == "true" ]]; then
cargo deb --no-build --no-strip --target ${{ matrix.target }} --variant static
else
cargo deb --no-build --no-strip --target ${{ matrix.target }}
fi
echo "DEB_PATH=$(ls target/${{ matrix.target }}/debian/*.deb)" >> "$GITHUB_ENV"
- name: Build .rpm package
if: runner.os == 'Linux'
shell: bash
run: |
cargo install cargo-generate-rpm
mkdir -p target/release
cp "target/${{ matrix.target }}/release/siteone-crawler" target/release/
if [[ "${{ matrix.musl }}" == "true" ]]; then
# Override package name for static/musl variant
sed -i 's/^name = "siteone-crawler"$/name = "siteone-crawler-static"/' Cargo.toml
fi
cargo generate-rpm --target ${{ matrix.target }}
echo "RPM_PATH=$(find target -name '*.rpm' -path '*/generate-rpm/*' | head -1)" >> "$GITHUB_ENV"
- name: Upload .deb artifact
if: runner.os == 'Linux'
uses: actions/upload-artifact@v7
with:
name: siteone-crawler-${{ matrix.artifact_suffix }}-deb
path: ${{ env.DEB_PATH }}
- name: Upload .rpm artifact
if: runner.os == 'Linux'
uses: actions/upload-artifact@v7
with:
name: siteone-crawler-${{ matrix.artifact_suffix }}-rpm
path: ${{ env.RPM_PATH }}
# ────────────────────────────────────────────────────────────────
- name: Upload artifact
uses: actions/upload-artifact@v7
with:
name: siteone-crawler-${{ matrix.artifact_suffix }}
path: ${{ env.ARTIFACT_PATH }}
# ─────────────────────────────────────────────────────────────────
# Build Alpine .apk packages from musl binaries
# ─────────────────────────────────────────────────────────────────
package-alpine:
name: Build Alpine .apk (${{ matrix.arch }})
needs: build
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
- arch: x86_64
artifact_suffix: linux-musl-x64
- arch: aarch64
artifact_suffix: linux-musl-arm64
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Determine version
id: version
shell: bash
run: |
if [[ "${{ github.event_name }}" == "workflow_dispatch" ]]; then
VERSION="${{ github.event.inputs.version }}"
else
VERSION="${GITHUB_REF_NAME#v}"
fi
echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
- name: Download musl binary
uses: actions/download-artifact@v8
with:
name: siteone-crawler-${{ matrix.artifact_suffix }}
path: dist
- name: Extract binary
run: |
VERSION="${{ steps.version.outputs.version }}"
tar xzf "dist/siteone-crawler-v${VERSION}-${{ matrix.artifact_suffix }}.tar.gz" -C dist
- name: Setup Alpine
uses: jirutka/setup-alpine@v1
with:
arch: ${{ matrix.arch }}
packages: abuild
- name: Prepare signing key
shell: alpine.sh --root {0}
env:
ALPINE_RSA_KEY: ${{ secrets.ALPINE_RSA_PRIVATE_KEY }}
ALPINE_RSA_PUB: ${{ secrets.ALPINE_RSA_PUBLIC_KEY }}
run: |
BUILDER=runner
# Install signing key
mkdir -p /etc/apk/keys
printf '%s\n' "$ALPINE_RSA_PUB" > /etc/apk/keys/siteone.rsa.pub
# Setup abuild config for builder
mkdir -p "/home/$BUILDER/.abuild"
printf '%s\n' "$ALPINE_RSA_KEY" > "/home/$BUILDER/.abuild/siteone.rsa"
printf '%s\n' "$ALPINE_RSA_PUB" > "/home/$BUILDER/.abuild/siteone.rsa.pub"
chmod 600 "/home/$BUILDER/.abuild/siteone.rsa"
cat > "/home/$BUILDER/.abuild/abuild.conf" << 'EOF'
PACKAGER_PRIVKEY="$HOME/.abuild/siteone.rsa"
EOF
chown -R "$BUILDER" "/home/$BUILDER/.abuild"
# Add user to abuild group
addgroup "$BUILDER" abuild
- name: Build .apk
shell: alpine.sh {0}
env:
VERSION: ${{ steps.version.outputs.version }}
run: |
ARCH=$(uname -m)
# Prepare build directory
mkdir -p ~/build
cp "$GITHUB_WORKSPACE/dist/siteone-crawler/siteone-crawler" ~/build/
cp "$GITHUB_WORKSPACE/LICENSE" ~/build/ 2>/dev/null || true
# Create APKBUILD
cat > ~/build/APKBUILD << EOF
# Maintainer: Jan Reges <jan.reges@siteone.cz>
pkgname=siteone-crawler
pkgver=${VERSION}
pkgrel=1
pkgdesc="Website crawler and QA toolkit in Rust"
url="https://crawler.siteone.io/"
arch="${ARCH}"
license="MIT"
source=""
options="!check !strip"
package() {
install -Dm755 "\$startdir/siteone-crawler" "\$pkgdir/usr/bin/siteone-crawler"
install -Dm644 "\$startdir/LICENSE" "\$pkgdir/usr/share/licenses/\$pkgname/LICENSE" 2>/dev/null || true
}
EOF
# Build the package
cd ~/build
abuild -d -P ~/packages
# Copy and rename to include arch (both arches produce the same filename)
mkdir -p "$GITHUB_WORKSPACE/apk-out"
for f in $(find ~/packages -name '*.apk'); do
BASENAME=$(basename "$f" .apk)
cp "$f" "$GITHUB_WORKSPACE/apk-out/${BASENAME}-${ARCH}.apk"
done
- name: Upload .apk artifact
uses: actions/upload-artifact@v7
with:
name: siteone-crawler-alpine-${{ matrix.arch }}
path: apk-out/*.apk
release:
name: Create GitHub Release
needs: [build, package-alpine]
runs-on: ubuntu-latest
if: always() && startsWith(github.ref, 'refs/tags/v') && needs.build.result == 'success'
steps:
- name: Checkout
uses: actions/checkout@v6
- name: Download all artifacts
uses: actions/download-artifact@v8
with:
path: artifacts
merge-multiple: true
- name: Determine version
id: version
run: echo "version=${GITHUB_REF_NAME#v}" >> "$GITHUB_OUTPUT"
- name: List artifacts
run: ls -lhR artifacts/
- name: Create Release
uses: softprops/action-gh-release@v2
with:
name: "v${{ steps.version.outputs.version }}"
body: |
### Downloads
| Platform | Architecture | File |
|----------|-------------|------|
| Linux | x64 | `siteone-crawler-v${{ steps.version.outputs.version }}-linux-x64.tar.gz` |
| Linux | arm64 | `siteone-crawler-v${{ steps.version.outputs.version }}-linux-arm64.tar.gz` |
| Linux | x64 (musl/static) | `siteone-crawler-v${{ steps.version.outputs.version }}-linux-musl-x64.tar.gz` |
| Linux | arm64 (musl/static) | `siteone-crawler-v${{ steps.version.outputs.version }}-linux-musl-arm64.tar.gz` |
| macOS | arm64 (Apple Silicon) | `siteone-crawler-v${{ steps.version.outputs.version }}-macos-arm64.tar.gz` |
| macOS | x64 (Intel) | `siteone-crawler-v${{ steps.version.outputs.version }}-macos-x64.tar.gz` |
| Windows | x64 | `siteone-crawler-v${{ steps.version.outputs.version }}-win-x64.zip` |
| Windows | arm64 | `siteone-crawler-v${{ steps.version.outputs.version }}-win-arm64.zip` |
### Linux packages (glibc — best performance, requires glibc 2.39+)
| Format | Architecture | File |
|--------|-------------|------|
| Debian/Ubuntu (.deb) | x64 | `siteone-crawler_${{ steps.version.outputs.version }}-1_amd64.deb` |
| Debian/Ubuntu (.deb) | arm64 | `siteone-crawler_${{ steps.version.outputs.version }}-1_arm64.deb` |
| Fedora/RHEL (.rpm) | x64 | `siteone-crawler-${{ steps.version.outputs.version }}-1.x86_64.rpm` |
| Fedora/RHEL (.rpm) | arm64 | `siteone-crawler-${{ steps.version.outputs.version }}-1.aarch64.rpm` |
### Linux packages (musl/static — any Linux, ~50–80% slower)
| Format | Architecture | File |
|--------|-------------|------|
| Debian/Ubuntu (.deb) | x64 | `siteone-crawler-static_${{ steps.version.outputs.version }}-1_amd64.deb` |
| Debian/Ubuntu (.deb) | arm64 | `siteone-crawler-static_${{ steps.version.outputs.version }}-1_arm64.deb` |
| Fedora/RHEL (.rpm) | x64 | `siteone-crawler-static-${{ steps.version.outputs.version }}-1.x86_64.rpm` |
| Fedora/RHEL (.rpm) | arm64 | `siteone-crawler-static-${{ steps.version.outputs.version }}-1.aarch64.rpm` |
| Alpine (.apk) | x64 | `siteone-crawler-${{ steps.version.outputs.version }}-r1-x86_64.apk` |
| Alpine (.apk) | arm64 | `siteone-crawler-${{ steps.version.outputs.version }}-r1-aarch64.apk` |
### Quick start
```bash
# Extract and run
tar xzf siteone-crawler-v${{ steps.version.outputs.version }}-linux-x64.tar.gz
cd siteone-crawler
./siteone-crawler --url=https://example.com
```
### Install via package manager
```bash
# Debian/Ubuntu (glibc — Ubuntu 24.04+, Debian 13+)
sudo dpkg -i siteone-crawler_${{ steps.version.outputs.version }}-1_amd64.deb
# Debian/Ubuntu (static/musl — older distributions)
sudo dpkg -i siteone-crawler-static_${{ steps.version.outputs.version }}-1_amd64.deb
# Fedora/RHEL
sudo dnf install ./siteone-crawler-${{ steps.version.outputs.version }}-1.x86_64.rpm
```
files: artifacts/*
generate_release_notes: true
draft: true
prerelease: false
================================================
FILE: .gitignore
================================================
/target
/tmp/
/dist/
*.swp
*.swo
*~
.idea/
.vscode/
*.cache
================================================
FILE: CHANGELOG.md
================================================
### Changelog
All notable changes to this project will be documented in this file. Dates are displayed in UTC.
#### [v1.0.9](https://github.com/janreges/siteone-crawler/compare/v1.0.8...v1.0.9)
- typos: non exhaustive typo and spelling corrections [`#8`](https://github.com/janreges/siteone-crawler/pull/8)
- offline exporter: new option --ignore-store-file-error for the OfflineWebsiteExporter [`#16`](https://github.com/janreges/siteone-crawler/pull/16)
- url handling: added option --transform-url to force requests for some URL to be internally transformed and a different URL/domain (e.g. local) to be queried, fixes #58 [`#58`](https://github.com/janreges/siteone-crawler/issues/58)
- html report: added option to list which sections to include in the HTML report via --html-report-options (see README.md), fixes #63 [`#63`](https://github.com/janreges/siteone-crawler/issues/63)
- offline export: fix behavior regarding URLs containing various valid UTF-8 characters (German, Chinese, etc.), fixes #65 [`#65`](https://github.com/janreges/siteone-crawler/issues/65)
- seo analysis: fix for an issue that occurs when encoding UTF-8 due to some special characters in the content, fixes #51 [`#51`](https://github.com/janreges/siteone-crawler/issues/51)
- offline website exporter: added option --offline-export-no-auto-redirect-html, which disables the generation of automatic sub-folder.html with meta redirects to sub-folder/index.html, fixes #54 [`#54`](https://github.com/janreges/siteone-crawler/issues/54)
- offline website exporter: fix replacing reference where it is followed by and not an immediate number, fixes #52 [`#52`](https://github.com/janreges/siteone-crawler/issues/52)
- slowest analyzer: fixed typo slowest->slower, fixes #42 [`#42`](https://github.com/janreges/siteone-crawler/issues/42)
- url & sitemaps: as --url it is now possible to specify a URL to sitemap xml, or sitemap index xml, from which to find a list of all URLs, fixes #25 [`#25`](https://github.com/janreges/siteone-crawler/issues/25)
- github: remove all unnecessary files from the release package [`e54029c`](https://github.com/janreges/siteone-crawler/commit/e54029cbef015a259d92e93933e81af2e851a145)
- github: fix release workflow [`c9d5361`](https://github.com/janreges/siteone-crawler/commit/c9d5361acd646b24e47cb6e60e7d07be12cd96c9)
- github: workflow for automatic creation of release archives for all 5 supported platforms/architectures [`0a461ac`](https://github.com/janreges/siteone-crawler/commit/0a461aca0b145982005b6a460d3e852a0767426a)
- webp analysis: if there are avif images on the website (they are more optimized than webp), we will not report the absence of webp [`e067653`](https://github.com/janreges/siteone-crawler/commit/e06765332fa743f9bb22f5eb589cb71a01dc90db)
- term: if TERM is not set or we're not in a TTY, use default width 138 [`eb839e4`](https://github.com/janreges/siteone-crawler/commit/eb839e423abf4df7020822986dc9e2ae43d44971)
- options: handling of the situation of calling only 'crawler' without a parameters - complete documentation and a red message about the need to pass at least the --url parameter will be displayed [`fc390ae`](https://github.com/janreges/siteone-crawler/commit/fc390ae693ba201b060effc67a90ad893772558f)
- phpstan: fix errors found by phpstan and increasing the memory limit for phpstan [`650d46a`](https://github.com/janreges/siteone-crawler/commit/650d46abb867ff04df24be01c3c6daebd42b0911)
- tests: fix the tests after removing the underscore for the external domain [`b31a872`](https://github.com/janreges/siteone-crawler/commit/b31a872fdc439a321c364665d18634906ce8ad30)
- Revert "url parser: fix url parsing in some cases when href starts with './'" [`240430b`](https://github.com/janreges/siteone-crawler/commit/240430bc90039063b9e810980360f798afa46f74)
- url parser: fix url parsing in some cases when href starts with './' [`2443532`](https://github.com/janreges/siteone-crawler/commit/244353202c80c152b6a3b63ef83f6046338404e9)
- url parser: fix url parsing in some cases when href starts with './' [`fe33e7b`](https://github.com/janreges/siteone-crawler/commit/fe33e7b6c404a62cf636db20b6193196c4bf6e25)
- website to markdown: added --markdown-remove-links-and-images-from-single-file - useful when used within an AI tool to obtain context from a website (typically with documentation of a solution/framework) [`631e544`](https://github.com/janreges/siteone-crawler/commit/631e544b9eb836a01f68f055e80b8b35b16687dc)
- website to markdown: fixed the problem with incorrect sorting of the root index.md (homepage should be at the beginning) [`c2ffff3`](https://github.com/janreges/siteone-crawler/commit/c2ffff32a48e84c872e132417ef1623015755e7e)
- website to markdown: fine tuning of the resulting markdown files, correct detection of table headers, removal of excess whitespaces [`ee40b29`](https://github.com/janreges/siteone-crawler/commit/ee40b2915611824c676c5d4761a266edba6be0d2)
- website to markdown: added --markdown-export-single-file for the ability to save all website content into one combined markdown file (smart detection and removal of shared headers and footers is also implemented) [`af01376`](https://github.com/janreges/siteone-crawler/commit/af013766830991f473d26dc25dc5804cc88b7c76)
- readme: changed partnership to powered by JetBrains [`e77f755`](https://github.com/janreges/siteone-crawler/commit/e77f755527319e99d66cec6b2b2864dee4d560e4)
- readme: added partnership with JetBrains [`0104646`](https://github.com/janreges/siteone-crawler/commit/0104646b6f209eae1c530bb68160a5fa238f7dda)
- website to markdown: added implicit excluded selectors for typical 'hidden' classes [`b3c57d6`](https://github.com/janreges/siteone-crawler/commit/b3c57d69f9ef52592cab308b493b328b81c29705)
- website to markdown: consecutive links fixes (ignore links without visible text or defined href) [`6d9a310`](https://github.com/janreges/siteone-crawler/commit/6d9a31053a56bb532961395a2e1821f2028e36ac)
- website to markdown: list fixes and prepared auto-removal of duplicates (e.g. desktop & mobile version of menus) [`338b0c6`](https://github.com/janreges/siteone-crawler/commit/338b0c692a434a4f3a2a20160c9f45004526c04a)
- website to markdown: removed unwanted escaping from links/images [`35c6f57`](https://github.com/janreges/siteone-crawler/commit/35c6f579a62080316210ec00734af8069ea32f27)
- website to markdown: refactoring the way ul/ol lists are composed (there were problems with nested lists and whitespaces) [`15ea68c`](https://github.com/janreges/siteone-crawler/commit/15ea68ce7e3e3d962c5813ad971571eec42fe933)
- README: improved introduction and added icons [`737b8c6`](https://github.com/janreges/siteone-crawler/commit/737b8c63bce618bdd613090205866d03bde1d67b)
- docs: added Table of Contents to JSON-OUTPUT.md and TEXT-OUTPUT.md [`2aa2856`](https://github.com/janreges/siteone-crawler/commit/2aa28569de9fa0315219e68d084cc675ead57303)
- docs: added detailed documentation and real sample JSON and TXT output from the crawler for a better idea of its functionality [`09495d1`](https://github.com/janreges/siteone-crawler/commit/09495d187e3f80d4f4c29176c12540e300c5cb6f)
- docs: added detailed documentation and real sample JSON and TXT output from the crawler for a better idea of its functionality [`cb7606b`](https://github.com/janreges/siteone-crawler/commit/cb7606b2c8fe1547cfdf787d0b7050693228ff2e)
- json output docs: first version [`73e8d45`](https://github.com/janreges/siteone-crawler/commit/73e8d45ad93e1cc88a778533d0955e22cec9d6c7)
- output options: added option --timezone (e.g. Europe/Prague, default is UTC) to set the time zone in which dates and times in HTML reports and exported folder/file names should be, refs #57 [`e3d3213`](https://github.com/janreges/siteone-crawler/commit/e3d321315b6c9f0290b5795345a90e78af32a358)
- website to markdown: use link URL as text when link text is empty [`873ffae`](https://github.com/janreges/siteone-crawler/commit/873ffae76a8d96c8a2a4e4670ad09f4ed8527d4a)
- website to markdown: if the link contains nested div/span tags, display the link in markdown as a list-item so that it is on its own line [`c48f346`](https://github.com/janreges/siteone-crawler/commit/c48f34614a057b79e5f9e5d5fbb9877cd7c2d25f)
- website to markdown: removed the use of html2markdown (problematic integration on windows due to cygwin) and replaced with a custom HtmlToMarkdownConverter [`4e1db09`](https://github.com/janreges/siteone-crawler/commit/4e1db090f7b9276663c8fda587e8673d67783340)
- content processor: added justification for skipping URLs due to exceeding --max-depth [`a6bc08a`](https://github.com/janreges/siteone-crawler/commit/a6bc08ac2367b3fb008b51e1278b3b78ae5bfe28)
- README: converting arguments to a table view and adding missing links to the outline [`c23a686`](https://github.com/janreges/siteone-crawler/commit/c23a6860f06918062a159747039bd38e868cd7f8)
- README: added all missing options (--max-reqs-per-sec, --max-heading-level, --websocket-server, --console-width and a few others less important) [`82c48bc`](https://github.com/janreges/siteone-crawler/commit/82c48bccf597a7c3811c16ef6c8b29fc37d7c46c)
- extra columns: added option to extract data using XPath and RegEx to --extra-columns [`cd6d55a`](https://github.com/janreges/siteone-crawler/commit/cd6d55af254f4f38b25399293aa6d122c578f4c7)
- http response: ensuring that the repeated response header is merged into a concatenated string, instead of an array, refs #48 [`c0f3b21`](https://github.com/janreges/siteone-crawler/commit/c0f3b210e3ddca203eb9363f038bcf4e30a3f30c)
- css processor: fix for a situation where some processors could cause CSS content to be NULL [`c8f2ffc`](https://github.com/janreges/siteone-crawler/commit/c8f2ffc45628a2f0f1e477dc7e2ea436c9ebafbe)
- website to markdown: better removal of nested images in situations like [](index.html) [`9ecba5e`](https://github.com/janreges/siteone-crawler/commit/9ecba5e91608fbbcd625e3ff42621869a7e31f00)
- website to markdown: first version of the converter of entire web pages to markdown [`b944edb`](https://github.com/janreges/siteone-crawler/commit/b944edbcc33381c97ba220e1920994574c676225)
- security check: handle case of multiple headers with the same name [`706977e`](https://github.com/janreges/siteone-crawler/commit/706977e545c428ab82714e95a75294841dac5e46)
- html processor: do not remove the schema and host for URLs defined in --ignore-regex [`8be42af`](https://github.com/janreges/siteone-crawler/commit/8be42afea5af076aee097842fa3c4996e66c47ef)
- offline export: added --offline-export-remove-unwanted-code=<1/0> (default is 1) to remove unwanted code for offline mode - typically, JS of the analytics, social networks, cookie consent, cross origins, etc .. refs #37 [`17a11fa`](https://github.com/janreges/siteone-crawler/commit/17a11fa3fe7a2d9c012e0f70c2392e833e02193c)
- loop protection: added --max-non200-responses-per-basename as configurable protection against looping with dynamic non-200 URLs. If a basename (the last part of the URL after the last slash) has more non-200 responses than this limit, other URLs with same basename will be ignored/skipped [`063bddf`](https://github.com/janreges/siteone-crawler/commit/063bddf47a9fe82dc2b08297acd16fd154001feb)
- bin/swoole-cli: upgrade to latest Swoole 6.0.0 (this version already supports Swoole\Threads - in the future there will be a refactoring that will relieve us of the necessity to use Swoole\Table, which requires memory preallocation for a predefined number of rows + my ticket https://github.com/swoole/swoole-src/issues/5460 has been processed regarding the support of getting the values of repeated header) [`b6e7c23`](https://github.com/janreges/siteone-crawler/commit/b6e7c23c055032a1003605ef2679f1ca59b64a08)
- css processor: fix query string and anchor processing for paths in url() + don't replace url(data:*) with complex information e.g. about svg including brackets, refs #31 [`36eece8`](https://github.com/janreges/siteone-crawler/commit/36eece89c0719602145dbf51673d80355a80bfd2)
- skipped urls: width defined fixed at 60 - better for most situations than the previous dynamic calculation [`8ef462f`](https://github.com/janreges/siteone-crawler/commit/8ef462f2fb52b65213136705536ba575dd2a9511)
- manager: refactored mb_convert_encoding() -> htmlentities() as part of the migration to PHP 8.4.1 [`5c7c903`](https://github.com/janreges/siteone-crawler/commit/5c7c903d7d4d178b691c870e4a71fc862685c21d)
- http cache analysis: added analysis of http cache of all pages and assets - divided by content type, domains, and their combination [`b09cfbd`](https://github.com/janreges/siteone-crawler/commit/b09cfbdf3fe033feef3b64b0fcbbda15dc0308ab)
- css processing: added search for urls in @import url(*.css) [`c964fea`](https://github.com/janreges/siteone-crawler/commit/c964fea1382fec71b990ac2cd89683590694d5b3)
- analysis/report: if there is no URL with code >= 200, there is no point to perform analysis, print empty output of all analyzers and generate full report [`c1bb448`](https://github.com/janreges/siteone-crawler/commit/c1bb448922cb47d0fe7fa28d2c5f540d6961ea94)
- options: fix passing booleans to correctUrl() in case of empty '-u' or '--url' parameters (recognized as boolean flags) [`a297fec`](https://github.com/janreges/siteone-crawler/commit/a297feccb34002604705c180855d8f12cd0e41a2)
- skipped-urls: added overview of skipped URLs including summary across domains - not only from security point of view it is good to know where external links are pointing and from where js/css/fonts/images are loaded [`84ae146`](https://github.com/janreges/siteone-crawler/commit/84ae1467a6c02b194e9e5631351f00a52b5924e0)
- user-agent: if a manually defined user-agent ends with the exclamation !, do not add the signature siteone-crawler/version and remove the exclamation [`cfda3b0`](https://github.com/janreges/siteone-crawler/commit/cfda3b072e208966f9e7078211257d1a027d2bfa)
- options: better response and warning for unfilled required --url [`52e50db`](https://github.com/janreges/siteone-crawler/commit/52e50db58f15bd10ab64b70f4c3f3fbf299c0135)
- dns resolving: added --resolve attribute, which behaves exactly the same as curl, and using the 'domain:port:ip' entry it is possible to provide a custom IP address for the domain:port pair [`4031181`](https://github.com/janreges/siteone-crawler/commit/403118132807f30c65ed89b1b2d8f924a22e3a90)
- windows/cygwin: workarounds for cygwin environment to return as much DNS/SSL/TLS info as possible even if nslookup or dig cannot be called [`bfc4f55`](https://github.com/janreges/siteone-crawler/commit/bfc4f5508e85b4af2e7c17309181791a3a9d5fc1)
- upload timeout: fix that --upload-timeout does not overwrite the primary timeout [`c429639`](https://github.com/janreges/siteone-crawler/commit/c429639e30d44420bc9af017536714df52868813)
- readme: adding a sample report and clone of nextjs.org and a few other updates [`07ad5e1`](https://github.com/janreges/siteone-crawler/commit/07ad5e119b47455ff4a2e3ba6230a21203d40396)
- readme: added description for --allowed-domain-for-external-files and --allowed-domain-for-crawling [`0c8b1b3`](https://github.com/janreges/siteone-crawler/commit/0c8b1b3fb791a5c0e8f540a34d62122682680c19)
- filtering: added --single-foreign-page to ensure that only the linked page and its assets are loaded from the external domain (which second-level domain is not the same as the initialization URL), but not all other pages on the external domain are automatically crawled [`c4af4ec`](https://github.com/janreges/siteone-crawler/commit/c4af4ec5fb76456f4d47eaf6041ba4be4fbb48b8)
- filtering: added --disable-all-assets as a shortcut for calling all --disable-* flags [`7e32c44`](https://github.com/janreges/siteone-crawler/commit/7e32c440fb0ee0260f7b1e2c6b2a01b753ffb149)
- filtering: added --max-depth=<int> for maximum crawling depth (for pages, not assets) and --single-page moved to basic options [`2dbff75`](https://github.com/janreges/siteone-crawler/commit/2dbff756dec3735f8f8c9f293dcd846eb3b3fde6)
- resource filtering: added --single-page for loading only one given URL and their assets [`7325a4b`](https://github.com/janreges/siteone-crawler/commit/7325a4bbf633f60015309e257af509a5f21384d5)
- offline exporter: added the possibility to use --replace-query-string to replace the default behavior where the query string is replaced by a short hash constructed from the query string in filenames, see issue #30 [`1a3482c`](https://github.com/janreges/siteone-crawler/commit/1a3482c6dada06b8482f205ceb181d8b42a62607)
- offline export: added --replace-content=<val> option to replace content in HTML/JS/CSS before saving to disk (with strict text & regexp support) [`81cddaa`](https://github.com/janreges/siteone-crawler/commit/81cddaaf57550ac253b3e1ab322c3f5498374e96)
- revert caps [`76a7418`](https://github.com/janreges/siteone-crawler/commit/76a74184c871714871f537344a84e757069fff0c)
- Revert "Auxiliary commit to revert individual files from b3bb0eea10075aee124cce485379c24ece78df79" [`5878be9`](https://github.com/janreges/siteone-crawler/commit/5878be97f663d8ac70eac9e56578e628faeabb9f)
- robots.txt handling: process Disallow records only for user-agent 'SiteOne-Crawler' or '*' [`9c2c989`](https://github.com/janreges/siteone-crawler/commit/9c2c989c569fed518bb5139c1d496159cc486683)
- new option for the OfflineWebsiteExporter [`2c4bbbc`](https://github.com/janreges/siteone-crawler/commit/2c4bbbc6f0e55a4f3af6a89be50450e15b65cdd2)
- tables: added --rows-limit option (default 200) to hard limit the length of all tables with data from analyses (except Visited URLs) to prevent very long and slow reports .. tables are sorted by severity, so it should be ok [`9798252`](https://github.com/janreges/siteone-crawler/commit/9798252901dd25797d1d38fa26a19c6dbc409fa1)
- video gallery: added display of all found videos with video player (including use of observer for lazy loading and smart option to preload first seconds of video + button to play 2 seconds of each video sequentially) [`411736a`](https://github.com/janreges/siteone-crawler/commit/411736ac3852d07464fe4a4a52c4c0bf171d716f)
- license: change of licensing to MIT [`14b73e2`](https://github.com/janreges/siteone-crawler/commit/14b73e2e10cc924112966d2c5b16812dadf1fc48)
- non exhaustive typo and spelling corrections [`b3bb0ee`](https://github.com/janreges/siteone-crawler/commit/b3bb0eea10075aee124cce485379c24ece78df79)
#### [v1.0.8](https://github.com/janreges/siteone-crawler/compare/v1.0.7...v1.0.8)
> 24 August 2024
- reports: changed file name composition from report.mydomain.com.* to mydomain.com.report.* [`#9`](https://github.com/janreges/siteone-crawler/pull/9)
- version: update to 1.0.8.20240824 [`6c634e0`](https://github.com/janreges/siteone-crawler/commit/6c634e0f88cce49aa3f5fb9cd69ca55fa5191bd8)
- version 1.0.8.20240824 + changelog [`a02cc7b`](https://github.com/janreges/siteone-crawler/commit/a02cc7bf4c0fc4703189341d9ea0be2345b95796)
- crawler: solved edge-case, which very rarely occurred when the queue processing was already finished, but the last outstanding coroutine still found some new URL [`a85990d`](https://github.com/janreges/siteone-crawler/commit/a85990d662d74af281805cfdf10c0320fee0007a)
- javascript processor: improvement of webpack JS processing in order to correctly replace paths from VueJS during offline export (as e.g. in case of docs.netlify.com) .. without this, HTML had the correct paths in the left menu, but JS immediately broke them because they started with an absolute path with a slash at the beginning [`9bea99b`](https://github.com/janreges/siteone-crawler/commit/9bea99b9684e6059b8abfad4b382fafdad31c9a9)
- offline export: detect and process fonts.googleapis.com/css* as CSS even if there is no .css extension [`da33100`](https://github.com/janreges/siteone-crawler/commit/da33100975635be8305e07c2023a22c300b66216)
- js processor: removed the forgotten var_dump [`5f2c36d`](https://github.com/janreges/siteone-crawler/commit/5f2c36de1666e6987d2c9d88a39e3b6d0a2e1f32)
- offline export: improved search for external JS in the case of webpack (dynamic composition of URLs from an object with the definition of chunks) - it was debugged on docs.netlify.com [`a61e72e`](https://github.com/janreges/siteone-crawler/commit/a61e72e7f5b773a437b4151432db04a5afd7124a)
- offline export: in case the URL ends with a dot and a number (so it looks like an extension), we must not recognize it as an extension in some cases [`c382d95`](https://github.com/janreges/siteone-crawler/commit/c382d959f7440ebfcd95566ec0050e771a2f3495)
- offline url converter: better support for SVG in case the URL does not contain an extension at all, but has e.g. 'icon' in the URL (it's not perfect) [`c9c01a6`](https://github.com/janreges/siteone-crawler/commit/c9c01a69905fefce82f4e8f85e707a0d1abb5e1e)
- offline exporter: warning instead of exception for some edge-cases, e.g. not saving SVG without an extension does not cause the export to stop [`9d285f4`](https://github.com/janreges/siteone-crawler/commit/9d285f4d599ba8892dd8752e8d831cd3c86af178)
- cors: do not set Origin request header for images (otherwise error 403 on cdn.sanity.io for svg, etc.) [`2f3b7eb`](https://github.com/janreges/siteone-crawler/commit/2f3b7eb51a03d42d3d2961c84aadcd118b546e05)
- best practice analyzer: in checking for missing quotes ignore values longer than 1000 characters (fixes, e.g., at skoda-auto.cz the error Compilation failed: regular expression is too large at offset 90936) [`8a009df`](https://github.com/janreges/siteone-crawler/commit/8a009df9734773275fd9805862dc9bfeeccb6079)
- html report: added loading of extra headers to the visited URL list in the HTML report [`781cf17`](https://github.com/janreges/siteone-crawler/commit/781cf17c18088126db74ebc1ef00fee3d6784979)
- Frontload the report names [`62d2aae`](https://github.com/janreges/siteone-crawler/commit/62d2aae57e31c7bfa53720446cc8dfbc59e482af)
- robots.txt: added option --ignore-robots-txt (we often need to view internal or preview domains that are otherwise prohibited from indexing by search engines) [`9017c45`](https://github.com/janreges/siteone-crawler/commit/9017c45a675dd327895b57f14095ad6bd52a02fc)
- http client: adden an explicit 'Connection: close' header and explicitly calling $client->close(), even though Swoole was doing it automatically after exiting the coroutine [`86a7346`](https://github.com/janreges/siteone-crawler/commit/86a7346d059452d210b945ca4329e1cc17781dca)
- javascript processor: parse url addresses to import the JS module only in JS files (otherwise imports from HTML documentation, e.g. on the websites svelte.dev or nextjs.org, were parsed by mistake) [`592b618`](https://github.com/janreges/siteone-crawler/commit/592b618c01e75509e16a812fafab7f21f3c7c64d)
- html processor: added obtaining urls from HTML attributes that are not wrapped in quotes (but I am aware that current regexps can cause problems in the cases when are used spaces, which are not properly escaped) [`f00abab`](https://github.com/janreges/siteone-crawler/commit/f00ababfa459eca27dce7657fe91c70831f86089)
- offline url converter: swapping woff2/woff order for regex because in this case their priority is important and because of that woff2 didn't work properly [`3f318d1`](https://github.com/janreges/siteone-crawler/commit/3f318d19fa0a3757546493ac7f47cca21922b1f5)
- non-200 url basename detection: we no longer consider e.g. image generators that have the same basename and the url to the image in the query parameters as the same basename [`bc15ef1`](https://github.com/janreges/siteone-crawler/commit/bc15ef198bb13fe845fef8cd4946b2cab5c2ea6d)
- supertable: activation of automatic creation of active links also for homepage '/' [`c2e228e`](https://github.com/janreges/siteone-crawler/commit/c2e228e0d475351431cf9b060487e86ce6d33e52)
- analysis and robots.txt: improving the display of url addresses for SEO analysis in the case of a multi-domain website, so that it cannot happen that the same url, e.g. '/', is in the overview multiple times without recognizing the domain or scheme + improving the work with robots.txt in SEO detection and displaying urls banned for indexing [`47c7602`](https://github.com/janreges/siteone-crawler/commit/47c7602217e40a4f6d4f3af5c71d6dff72952aab)
- offline website exporter: we add the suffix '_' to the folder name only in the case of a typical extension of a static file - we don't want this to happen with domain names as well [`d16722a`](https://github.com/janreges/siteone-crawler/commit/d16722a5ad6271270fb0fff11e66a7f02f3b6e9a)
- javascript processor: extract JS urls also from imports like import {xy} from "./path/foo.js" [`aec6cab`](https://github.com/janreges/siteone-crawler/commit/aec6cab051a46df9d89866f5cfd7e66312dafb92)
- visited url: added 'txt' extension to looksLikeStaticFileByUrl() [`460c645`](https://github.com/janreges/siteone-crawler/commit/460c6453d91e85c2889ebaa2b2542fd88c5ffa6a)
- html processor: extract JS urls also from <link href="*.js">, typically with rel="modulepreload" [`c4a92be`](https://github.com/janreges/siteone-crawler/commit/c4a92bee00d96c530431134370a3ba0d2216a1c1)
- html processor: extracting repeated calls to getFullUrl() into a variable [`a5e1306`](https://github.com/janreges/siteone-crawler/commit/a5e1306530717d9edd4f95a7989539a172a38f4a)
- analysis: do not include urls that failed to load (timeout, skipping, etc.) in the analysis of content-types and source-domains - prevention of displaying content type 'unknown' [`b21ecfb`](https://github.com/janreges/siteone-crawler/commit/b21ecfb85f58d07c0a82b93826ad2977ab2cd523)
- cli options: improved method of removing quotes even for options that can be arrays - also fixes --extra-columns='Title' [`97f2761`](https://github.com/janreges/siteone-crawler/commit/97f27611acf2fc4ed24b1e5574be84711ea3fa12)
- url skipping: if there are a lot of URLs with the same basename (ending after the last slash), we will allow a maximum of 5 requests for URLs with the same basename - the purpose is to prevent a lot of 404 from being triggered when there is an incorrect relative link to relative/my-img.jpg on all pages (e.g. on 404 page on v2.svelte.dev) [`4fbb917`](https://github.com/janreges/siteone-crawler/commit/4fbb91791f9111cc6f9d98b60732fcca7fad2f1f)
- analysis: perform most of the analysis only on URLs from domains for which we have crawling enabled [`313adde`](https://github.com/janreges/siteone-crawler/commit/313addede29ac847273b6ab6ed3a8ab878a6fb4a)
- audio & video: added audio/video file search in <audio> and <video> tags, if file crawling is not disabled [`d72a5a5`](https://github.com/janreges/siteone-crawler/commit/d72a5a51bd6863425a3d8bcffc7a9b5eb831f979)
- base practices: retexting stupid warning like '<h2> after <h0>' to '<h2> without previous heading [`041b383`](https://github.com/janreges/siteone-crawler/commit/041b3836a8a585158ae1a1a6fb0057b367f3a4f6)
- initial url redirect: in the case thats is entered url that redirects to another url/domain within the same 2nd-level domain (typically http->https or mydomain.tld -> www.mydomain.tld redirects), we continue crawling with new url/domain and declare a new url as initial url [`166e617`](https://github.com/janreges/siteone-crawler/commit/166e617fbc893798dc7b340f43de75df2d4cf335)
#### [v1.0.7](https://github.com/janreges/siteone-crawler/compare/v1.0.6...v1.0.7)
> 22 December 2023
- version 1.0.7.20231222 + changelog [`9d2be52`](https://github.com/janreges/siteone-crawler/commit/9d2be52776c081989322953c7a31debfd4947420)
- html report template: updated logo link to crawler.siteone.io [`9892cfe`](https://github.com/janreges/siteone-crawler/commit/9892cfe5708a3da2f5fc355246dd50b2a0c5cb4f)
- http headers analysis: renamed 'Headers' to 'HTTP headers' [`436e6ea`](https://github.com/janreges/siteone-crawler/commit/436e6ea5a9914c8615bb03b444ac0aad15e31c49)
- sitemap generator: added info about crawler to generated sitemap.xml [`7cb7005`](https://github.com/janreges/siteone-crawler/commit/7cb7005bf50b8f93b421c94c57ff51eb99b45912)
- html report: refactor of all inline on* event listeners to data attributes and event listeners added from static JS inside <script>, so that we can disable all inline JS in the online HTML report and allow only our JS signed with hashes by Content-Security-Policy [`b576eef`](https://github.com/janreges/siteone-crawler/commit/b576eef55a5678a67928970fc51aaaefd7abd1a8)
- readme: removed HTTP auth from roadmap (it's already done), improved guide how to implement own upload endpoint and message about SMTP moved under mailer options [`e1567ae`](https://github.com/janreges/siteone-crawler/commit/e1567aee52f9d09c1cef1ad35babaf9eea388175)
- utils: hide passwords/authentication specified in cli parameters as *auth=xyz (e.g. --http-auth=abc:xyz)" in html report [`c8bb88f`](https://github.com/janreges/siteone-crawler/commit/c8bb88fc1a65ecdfd53db23fc5d972b841830837)
- readme: fixed formatting of the upload and expert options [`2d14bd5`](https://github.com/janreges/siteone-crawler/commit/2d14bd5972496989624f91617de2689601e1c027)
- readme: added Upload Options [`d8352c5`](https://github.com/janreges/siteone-crawler/commit/d8352c5acfddbeef1c1ae6498556dc296d944e0b)
- upload exporter: added possibility via --upload to upload HTML report to offline URL, by default crawler.siteone.io/html/* [`2a027c3`](https://github.com/janreges/siteone-crawler/commit/2a027c38bfdb8e6e416b9a79ebe81e809c9326d9)
- parsed-url: fixed warning in the case of url without host [`284e844`](https://github.com/janreges/siteone-crawler/commit/284e844f3f94cdb02032ddb76e51caa9a584c120)
- seo and opengraph: fixed false positives 'DENY (robots.txt)' in some cases [`658b649`](https://github.com/janreges/siteone-crawler/commit/658b6494130fa282505ec38f12aa058acf7709b9)
- best practices and inline-svgs: detection and display of the entire icon set in the HTML report in the case of <svg> with more <symbol> or <g> [`3b2772c`](https://github.com/janreges/siteone-crawler/commit/3b2772c59f822b7b4a6f91e15b616815b5ff92c4)
- sitemap generator: sort urls primary by number of dashes and secondary alphabetically (thanks to this, urls of the main levels will be at the beginning) [`bbc47e6`](https://github.com/janreges/siteone-crawler/commit/bbc47e6239f9693c621016a50e624698dc3d242d)
- sitemap generator: only include URLs from the same domain as the initial URL [`9969254`](https://github.com/janreges/siteone-crawler/commit/9969254e35cd8c134f85a7817de8722091f0377c)
- changelog: updated by 'composer changelog' [`0c67fd4`](https://github.com/janreges/siteone-crawler/commit/0c67fd4f8d308d8d51d5b912d9b82cc96fb6e4fb)
- package.json: used by auto-changelog generator [`6ad8789`](https://github.com/janreges/siteone-crawler/commit/6ad87895e5a8ab8bbce3d9cbf92ee5e8b8218cc0)
#### [v1.0.6](https://github.com/janreges/siteone-crawler/compare/v1.0.5...v1.0.6)
> 8 December 2023
- readme: removed bold links from the intro (it didn't look as good on github as it did in the IDE) [`b675873`](https://github.com/janreges/siteone-crawler/commit/b6758733cde67f11322a2f82573b19ec1a0edc9d)
- readme: improved intro and gif animation with the real output [`fd9e2d6`](https://github.com/janreges/siteone-crawler/commit/fd9e2d69c8f940cfaa81ad7bab86f1a74f01b0da)
- http auth: for security reasons, we only send auth data to the same 2nd level domain (and possibly subdomains). With HTTP basic auth, the name and password are only base64 encoded and we would send them to foreign domains (which are referred to from the crawled website) [`4bc8a7f`](https://github.com/janreges/siteone-crawler/commit/4bc8a7f9871064aa1c88c374aa299904409d2817)
- html report: increased specificity of the .header class for the header, because this class were also used by the generic class at <td class='header'> in security tab [`9d270e8`](https://github.com/janreges/siteone-crawler/commit/9d270e884545d6459f20348db71404e513ae8928)
- html report: improved readability of badge colors in light mode [`76c5680`](https://github.com/janreges/siteone-crawler/commit/76c5680397446b84f3b13800590d914b7a9b0533)
- crawler: moving the decrement of active workers after parsing URLs from the content, where further filling of the queue could occur (for this reason, queue processing could sometimes get stuck in the final stages) [`f8f82ab`](https://github.com/janreges/siteone-crawler/commit/f8f82ab61c1969952bb70f1b598ed3d97938a84e)
- analysis: do not parse/check empty HTML (it produced unnecessary warning) - it is valid to have content-type: text/html but with connect-lengt: 0 (for example case for 'gtm.js?id=') [`436d81b`](https://github.com/janreges/siteone-crawler/commit/436d81b81f905178fb972f8b5cd0236bac244bc4)
#### [v1.0.5](https://github.com/janreges/siteone-crawler/compare/v1.0.4...v1.0.5)
> 3 December 2023
- changelog: updated changelog after 3 added commits to still untagged draft release 1.0.5 [`f42fe18`](https://github.com/janreges/siteone-crawler/commit/f42fe18de89676dc0dea4dc033207c934282d04b)
- utils tests: fixed tests of methods getAbsolutePath() and getOutputFormattedPath() [`d4f4576`](https://github.com/janreges/siteone-crawler/commit/d4f4576ff566eb48495c9fb55a898b0989ef42c3)
- crawler.php: replaced preg_match to str_contains [`5b28952`](https://github.com/janreges/siteone-crawler/commit/5b289521cdbb90b6571a29cb9c880e065b852129)
- version: 1.0.5.20231204 + changelog [`7f2e974`](https://github.com/janreges/siteone-crawler/commit/7f2e9741fab25e9369151bc2d79a38b8827e2463)
- option: replace placeholders like a '%domain' also in validateValue() method because there is also check if path is writable with attempt to mkdir [`329143f`](https://github.com/janreges/siteone-crawler/commit/329143fa23925ea523504735b3f724c026fe5ac6)
- swoole in cygwin: improved getBaseDir() to work better even with the version of Swoole that does not have SCRIPT_DIR [`94cc5af`](https://github.com/janreges/siteone-crawler/commit/94cc5af4411a8c7427ee136a937ac629b8637668)
- html processor: it must also process the page with the redirect, because is needed to replace the URL in the meta redirect tag [`9ce0eee`](https://github.com/janreges/siteone-crawler/commit/9ce0eeeebe1e524b9d46d91dd4cecb2e796db8c3)
- sitemap: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) [`6297a7f`](https://github.com/janreges/siteone-crawler/commit/6297a7f4069f9e09c013268e0df896db2fa91dec)
- file exporter: use formatted output path (primary for better output in Cygwin environment with needed C:/foo <-> /cygwin/c/foo conversion) [`426cfb2`](https://github.com/janreges/siteone-crawler/commit/426cfb2b32f854d65abfce841e4e4f4badf04fef)
- options: in the case of dir/file validation, we want to work with absolute paths for more precise error messages [`6df228b`](https://github.com/janreges/siteone-crawler/commit/6df228bdfc87a2c9fb6eee611fdc87d976b7f721)
- crawler.php: improved baseDir detection - we want to work with absolute path in all scenarios [`9d1b2ce`](https://github.com/janreges/siteone-crawler/commit/9d1b2ce9bedb15ede90bcee9641e1cfc62b9c3cc)
- utils: improved getAbsolutePath() for cygwin and added getOutputFormattedPath() with reverse logic for cygwin (C:/foo/bar <-> /cygdrive/c/foo/bar) [`161cfc5`](https://github.com/janreges/siteone-crawler/commit/161cfc5c4fd3fa3675cade409d7d5e11db2da0c6)
- offline export: renamed --offline-export-directory to --offline-export-dir for consistency with --http-cache-dir or --result-storage-dir [`26ef45d`](https://github.com/janreges/siteone-crawler/commit/26ef45d145a1a02a5313067e6298571e26d9618b)
#### [v1.0.4](https://github.com/janreges/siteone-crawler/compare/v1.0.3...v1.0.4)
> 30 November 2023
- dom parsing: handling warnings in case of impossibility to parse some DOM elements correctly, fixes #3 [`#3`](https://github.com/janreges/siteone-crawler/issues/3)
- version: 1.0.4.20231201 + changelog [`8e15781`](https://github.com/janreges/siteone-crawler/commit/8e15781265cdd9cce10d9dcde57d46b57b50e1cf)
- options: ignore empty values in the case of directives with the possibility of repeated definition [`5e30c2f`](https://github.com/janreges/siteone-crawler/commit/5e30c2f8ad6cf00ad819ba1d7d6ec4e6c95a7113)
- http-cache: now the http cache is turned off using the 'off' value (it's more understandable) [`9508409`](https://github.com/janreges/siteone-crawler/commit/9508409fbba2d96dc92cd73bed5abe462d5cea15)
- core options: added --console-width to enforce the definition of the console width and disable automatic detection via 'tput cols' on macOS/Linux or 'mode con' on Windows (used by Electron GUI) [`8cf44b0`](https://github.com/janreges/siteone-crawler/commit/8cf44b06616e15301c486146a7c6b1003ce5137f)
- gui support: added base-dir detection for Windows where the GUI crawler runs in Cygwin [`5ce893a`](https://github.com/janreges/siteone-crawler/commit/5ce893a66c7f1e21af025603b66223e04246e029)
- renaming: renamed 'siteone-website-crawler' to 'siteone-crawler' and 'SiteOne Website Crawler' to 'SiteOne Crawler' [`64ddde4`](https://github.com/janreges/siteone-crawler/commit/64ddde4b53f16679a8c4671c98b3f9c619d94b42)
- utils: fixed color-support detection [`62dbac0`](https://github.com/janreges/siteone-crawler/commit/62dbac07d15ecfa0ff677c277e2a3381a47025bf)
- core options: added --force-color options to bypass tty detection (used by Electron GUI) [`607b4ad`](https://github.com/janreges/siteone-crawler/commit/607b4ad8583845adea209f75edfa27870ac23f9d)
- best practice analysis: in the case of checking an image (e.g. for the existence of WebP/AVIF), we also want to check external images, because very often websites have images linked from external domains or services for image modification or optimization [`6100187`](https://github.com/janreges/siteone-crawler/commit/6100187347e0bbba6270335e2d9b2faf37475333)
- html report: set scaleDown as default object-fit for image gallery [`91cd300`](https://github.com/janreges/siteone-crawler/commit/91cd300dcd7455c2b9be548fb2746cea7fd7c904)
- offline exporter: added short -oed as alias to --offline-export-directory [`22368d9`](https://github.com/janreges/siteone-crawler/commit/22368d9a892aab8011aa4a0884bf01a8560f6167)
- image gallery: list of all images on the website (except those from the srcset, where there would be duplicates only in other sizes or formats), including SVG with rich filtering options (through image format, size and source tag/attribute) and the option of choosing small/medium/view and scale-down/contains/cover for object-fit css property [`43de0af`](https://github.com/janreges/siteone-crawler/commit/43de0af1c60d398f91b373c192d1a35ac2df2fd1)
- core options: added a shortened version of the command name consisting of only one hyphen and the first letters of the words of the full command (e.g. --memory-limit has short version -ml), added getInitialScheme() [`eb9a3cc`](https://github.com/janreges/siteone-crawler/commit/eb9a3cc62dffc58be2701c52bb21509d39a5dfad)
- visited url: added 'sourceAttr' with information about where the given URL was found and useful helper methods [`6de4e39`](https://github.com/janreges/siteone-crawler/commit/6de4e39c5f8b9ba685e3865193274ccf0ee91a3d)
- found urls: in the case of the occurrence of one URL in several places/attributes, we consider the first one to be the main one (typically the same URL in src and then also in srcset) [`660bb2b`](https://github.com/janreges/siteone-crawler/commit/660bb2b2bd2cb6949fe9c573e72b31e9fb97a9fe)
- url parsing: added more recognition of which attributes the given URL address was parsed from (we need to recognize src and srcset for ImageGallery in particular) [`802c3c6`](https://github.com/janreges/siteone-crawler/commit/802c3c66a40087745e68f47392f0e6e8e9725171)
- supertable and urls: in removing the redundant hostname for a more compact URL output, we also take into account the scheme http:// or https:// of initial URL (otherwise somewhere it lookedlike duplicate) + prevention of ansi-color definitions for bash in the HTML output [`915469e`](https://github.com/janreges/siteone-crawler/commit/915469e2a4a6d0fed337ca70efe9170758751ade)
- title/description/keywords parsing: added html entities decoding because some website uses decoded entities with í – etc [`920523d`](https://github.com/janreges/siteone-crawler/commit/920523d3c55baf6cd7b2602334d9776b3e40f4d7)
- crawler: added 'sourceAttr' to the swoole table queue and already visited URLs (we will use it in the Image Gallery for filtering, so as not to display unnecessarily and a lot of duplicate images only in other resolutions from the srcsets) [`0345abc`](https://github.com/janreges/siteone-crawler/commit/0345abc6dab770e3196dd88ff0123a2050828644)
- url parameter: it is already possible not to enter the scheme and https:// or http:// will be added automatically (http:// for e.g. for localhost) [`85e14e9`](https://github.com/janreges/siteone-crawler/commit/85e14e961b53b83c208ac936972a335cace61bf8)
- disabled images: in the case of a request to remove the images, replace their body with a 1x1px transparent gif and place a semi-transparent hatch with the crawler logo and opacity as a background [`c1418c3`](https://github.com/janreges/siteone-crawler/commit/c1418c3154301fd3995dde421b066f16850203e7)
- url regex filtering: added option , which will allow you to limit the list of crawled pages according to the declared regexps, but at the same time it will allow you to crawl and download assets (js, css, images, fonts, documents, etc.) from any URL (but with respect to allowed domains) [`21e67e5`](https://github.com/janreges/siteone-crawler/commit/21e67e5be74050cd5b7c9998654ed66f18db4d85)
- img srcset parsing: because a valid URL can also contain a comma (and various dynamic parametric img generators use them) and in the srcset a comma+whitespace should be used to separate multiple values, this is also reflected in the srcset parsing [`0db578b`](https://github.com/janreges/siteone-crawler/commit/0db578bda37c024b2b111c814e35c2107e4751ad)
- websocket server: added option to set --websocket-server, which starts a parallel process with the websocket server, through which the crawler sends various information about the progress of crawling (this will also be used by Electron UI applications) [`649132f`](https://github.com/janreges/siteone-crawler/commit/649132f8965421cd1bb3570fbb9f534e6caef313)
- http client: handle scenario when content loaded from cache is not valid (is_bool) [`1ddd099`](https://github.com/janreges/siteone-crawler/commit/1ddd099ecdadc5752016237ec1f0acf80e907dc8)
- HTML report: updated logo with final look [`2a3bb42`](https://github.com/janreges/siteone-crawler/commit/2a3bb428180067a649f2467419920b3d4f70a9fd)
- mailer: shortening and simplifying email content [`e797107`](https://github.com/janreges/siteone-crawler/commit/e7971071f8c5e4cff1472464ce9ec4407c198a59)
- robots.txt: added info about loaded robots.txt to summary (limited to 10 domains for case of huge multi domain crawling) [`00f9365`](https://github.com/janreges/siteone-crawler/commit/00f93659637705bc6389c5f073a29f09b743370f)
- redirects analyzer: handled edge case with empty url [`e9be1e3`](https://github.com/janreges/siteone-crawler/commit/e9be1e350b1d114c54b7099b54277da23467b538)
- text output: added fancy banner with crawler logo (thanks to great SiteOne designers!) and smooth effect [`e011c35`](https://github.com/janreges/siteone-crawler/commit/e011c35f3cbc87fceb9d7a9c56c726817c79b543)
- content processors: added applyContentChangesBeforeUrlParsing() and better NextJS chunks handling [`e5c404f`](https://github.com/janreges/siteone-crawler/commit/e5c404f2d52a7c2ebdb80ae3c93760c7e881dc9a)
- url searches: added ignoring data:, mailto:, tel:, file:// and other non-requestable resources also to FoundUrls [`5349be2`](https://github.com/janreges/siteone-crawler/commit/5349be242f99567b8f5f093537a696ef5fd319ac)
- crawler: added declare(strict_types=1) and banner [`27134d2`](https://github.com/janreges/siteone-crawler/commit/27134d29d16e3e24c633f010f731f11deeeadcb7)
- heading structure analysis: highlighting and calculating errors for duplicate <h1> + added help cursor with a hint [`f5c7db6`](https://github.com/janreges/siteone-crawler/commit/f5c7db6206ed06e0cbaf38a7ae2505be573da2e6)
- core options: added --help and --version, colorized help [`6f1ada1`](https://github.com/janreges/siteone-crawler/commit/6f1ada112898580d2de028c02e32fdeb8ad2a845)
- ./crawler binary - send output of cd - to /dev/null and hide unwanted printed script path [`16fe79d`](https://github.com/janreges/siteone-crawler/commit/16fe79d08e24c4a6fbd87d16417413725aaa24e8)
- README: updated paths in the documentation - it is now possible to use the ERROR: Option --url () must be valid URL [`86abd99`](https://github.com/janreges/siteone-crawler/commit/86abd998da94971c2512b6018085f39e8dd5db7f)
- options: --workers default for Cygwin runtime is now 1 (instead of 3), because Cygwin runtime is highly unstable when workers > 1 [`f484960`](https://github.com/janreges/siteone-crawler/commit/f4849606fb382e1b759f547c4f1bfe2e5d8b4d02)
#### [v1.0.3](https://github.com/janreges/siteone-crawler/compare/v1.0.2...v1.0.3)
> 10 November 2023
- version: 1.0.3.20231110 + changelog [`5b80965`](https://github.com/janreges/siteone-crawler/commit/5b8096550dcd489a998d34fae44e3d99375e33e3)
- cache/storage: better race-condition handling in a situation where several coroutines could write the same folder at one time, then mkdir reported 'File exists' [`be543dc`](https://github.com/janreges/siteone-crawler/commit/be543dc195e675e49064b20ee091903f1977942a)
#### [v1.0.2](https://github.com/janreges/siteone-crawler/compare/v1.0.1...v1.0.2)
> 10 November 2023
- version: 1.0.2.20231110 + changelog [`230b947`](https://github.com/janreges/siteone-crawler/commit/230b9478a36ee664dfe080447c09da9c4a9bc25c)
- html report: added aria labels to active/important elements [`a329b9d`](https://github.com/janreges/siteone-crawler/commit/a329b9d4e0f040996c17cb3382cf3c07c61a4b35)
- version: 1.0.1.20231109 - changelog [`50dc69c`](https://github.com/janreges/siteone-crawler/commit/50dc69c9ab956691bbf97860355d410a0bdba0c9)
#### [v1.0.1](https://github.com/janreges/siteone-crawler/compare/v1.0.0...v1.0.1)
> 9 November 2023
- version: 1.0.1.20231109 [`e213cb3`](https://github.com/janreges/siteone-crawler/commit/e213cb326db78e2f69fd3e4f04b9728223550a3d)
- offline exporter: fixed case when on https:// website is link to same path but with http:// protocol (it overrided proper *.html file just with meta redirect .. real case from nextjs.org) [`4a1be0b`](https://github.com/janreges/siteone-crawler/commit/4a1be0bdfb62167c498f6c3b4c91fe74532ff833)
- html processor: force to remove all anchor listeners when NextJS is detected (it is very hard to achive a working NextJS with offline file:// protocol) [`2b1d935`](https://github.com/janreges/siteone-crawler/commit/2b1d935419bade80d8e6ab07b2ae04ded0df131e)
- file exporters: now by default crawler generates a html/json/txt report to 'tmp/[report|output].%domain%.%datetime%.[html|json|txt]' .. i assume that most people will want to save/see them [`7831c6b`](https://github.com/janreges/siteone-crawler/commit/7831c6b87dd41444a0fca529bc450bf7934ef541)
- security analysis: removed multi-line console output for recommendations .. it was ugly [`310af30`](https://github.com/janreges/siteone-crawler/commit/310af308859dbb2fd5895af468195e2339f2788d)
- json output: added JSON_UNESCAPED_UNICODE for unescaped unicode chars (e.g. czech chars will be readable) [`cf1de9f`](https://github.com/janreges/siteone-crawler/commit/cf1de9f60820963ccb78a00b43ca3aec8b311a77)
- mailer: do not send e-mails in case of interruption of the crawler using ctrl+c [`19c94aa`](https://github.com/janreges/siteone-crawler/commit/19c94aac8211b4550ba11497e1332d604f8cdbc7)
- refactoring: manager stats logic extracted into ManagerStats and implemented also into manager of content processors + stats added into 'Crawler stats' tab in HTML report [`3754200`](https://github.com/janreges/siteone-crawler/commit/3754200652dc91ac05efe22812e64c0e4be84019)
- refactoring: content related logic extracted to content processors based on ContentProcessor interface with methods findUrls():?FoundUrls, applyContentChangesForOfflineVersion():void and isContentTypeRelevant():bool + better division of web framework related logic (NextJS, Astro, Svelte, ...) + better URL handling and maximized usage of ParsedUrl [`6d9f25c`](https://github.com/janreges/siteone-crawler/commit/6d9f25ce82f8a1cfbfbc6bc0b5a6a07262c427b1)
- phpstan: ignore BASE_DIR warning [`6e0370a`](https://github.com/janreges/siteone-crawler/commit/6e0370aafe02d3bb2ca528ea8a9a37995f5ddce6)
- offline website exporter: improved export of a website based on NextJS, but it's not perfect, because latest NextJS version do not have some JS/CSS path in code, but they are generated dynamicly from arrays/objects [`c4993ef`](https://github.com/janreges/siteone-crawler/commit/c4993efcb97f7058834713ed273f9c4274be5cad)
- seo analyzer: fixed trim() warning when no <h1> found [`f0c526f`](https://github.com/janreges/siteone-crawler/commit/f0c526f5d2ff7d0155c1bfc7da7a6c0f2f7a1419)
- offline export: a lot of improvements when generating the offline version of the website on NextJS - chunk detection from the manifest, replacing paths, etc. [`98c2e15`](https://github.com/janreges/siteone-crawler/commit/98c2e15acf4e22d25301d160968555c19ddd44cc)
- seo and og: fixed division by zero when no og/twitter tags found [`19e4259`](https://github.com/janreges/siteone-crawler/commit/19e4259c519a3e41eb7aa8eabce80e6364e74639)
- console output: lots of improvements for nice, consistent and minimal word-wrap output [`596a5dc`](https://github.com/janreges/siteone-crawler/commit/596a5dc17945359ffc0fef2ed8ed8ee8bfc1db00)
- basic file/dir structure: created ./crawler (for Linux/macOS) and ./crawler.bat for Windows, init script moved to ./src, small related changes about file/dir path building [`5ce41ee`](https://github.com/janreges/siteone-crawler/commit/5ce41ee8e78425747bf40327152bd99499c64013)
- header status: ignore too dynamic Content-Disposition header [`4e0c6fd`](https://github.com/janreges/siteone-crawler/commit/4e0c6fdf5c356f8c0eea78ccebe29641b90f96b4)
- offline website exporter: added .html extensions to typical dynamic language extensions, because without it the browser will show them as source code [`7130b9e`](https://github.com/janreges/siteone-crawler/commit/7130b9eb666eca5b08c9dbeda91198bc85b31379)
- html report: show tables with details, even if they are without data (it is good to know that the checks were carried out, but nothing was found) [`da019e4`](https://github.com/janreges/siteone-crawler/commit/da019e4591682c21e9f78de1ec26939088d92ccc)
- tests: repaired tests after last changes of file/url building for offline website .. merlot is great! [`7c77c41`](https://github.com/janreges/siteone-crawler/commit/7c77c411ff67c01e07d16cb2acce0e926b264fcd)
- utils: be more precise and do not replace attributes in SVG .. creative designers will not love you when looking at the broken SVG in HTML report [`3fc81bb`](https://github.com/janreges/siteone-crawler/commit/3fc81bb0c47eef2935da2e74721a809a9aff0959)
- utils: be more precise in parsing phone numbers, otherwise people will 'love' you because of false positives .. wine is still great [`51fd574`](https://github.com/janreges/siteone-crawler/commit/51fd574c764d832d74cb5e67eed890bd9d349a5c)
- html parser: better support for formatted html with tags/attributes on multiple lines [`89a36d2`](https://github.com/janreges/siteone-crawler/commit/89a36d2fcf3d96b61c4b3d2e20d5a46f4cb96cb8)
- utils: don't be hungry in stripJavaScript() because you ate half of my html :) wine is already in my head... [`0e00957`](https://github.com/janreges/siteone-crawler/commit/0e0095727638b7940d2e555a6be231ad3dde19e4)
- file result storage: changed cache directory structure for consistency with http client's cache, so it looks like my.domain.tld-443/04/046ec07c.cache [`26bf428`](https://github.com/janreges/siteone-crawler/commit/26bf428f95bc428485d7cf505e74c8a69c94d869)
- http client cache: for better consistency with result storage cache, directory structure now contains also port, so it looks like my.domain.tld-443/b9/b989bdcf2b9389cf0c8e5edb435adc05.cache [`a0b2e09`](https://github.com/janreges/siteone-crawler/commit/a0b2e09d01e36aed56c0208a8001d616755de096)
- http client cache: improved directory structure for large scale and better orientation for partial cache deleting.. current structure in tmp dir: my.domain.tld/b9/b989bdcf2b9389cf0c8e5edb435adc05.cache [`10e02c1`](https://github.com/janreges/siteone-crawler/commit/10e02c189297f28ea563ba6f3792462c2d6790ea)
- offline website exporter: better srcset handling - urls can be defined with or without sizes [`473c1ad`](https://github.com/janreges/siteone-crawler/commit/473c1ad0d753df209aa160b0d90687c4bff21912)
- html report: blue color for search term, looks better [`cb47df9`](https://github.com/janreges/siteone-crawler/commit/cb47df98e230c0375dbcb14c278250709bf3644a)
- offline website exporter: handled situation of the same-name folder/file when both the folder /foo/next.js/ and the file /foo/next.js existed on the website (real case from vercel.com) [`7c27d2c`](https://github.com/janreges/siteone-crawler/commit/7c27d2c2277dd134615563ee4eaa706ec0ee7485)
- exporters: added exec times to summary messages [`41c8873`](https://github.com/janreges/siteone-crawler/commit/41c8873dc33d7f08d91f77d71fcf1bf2fafa30ae)
- crawler: use port from URL if defined or by scheme .. previous solution didn't work properly for localhost:port and parsed URLs to external websites [`324ba04`](https://github.com/janreges/siteone-crawler/commit/324ba04267b962a56817dd10e3ecba7777702aa2)
- heading analysis: changed sorting to DESC by errors, renamed Headings structure -> Heading structure [`dbc1a38`](https://github.com/janreges/siteone-crawler/commit/dbc1a38f33d4094aebe64020531518538e2b3baf)
- security analysis: detection and ignoring of URLs that point to a non-existent static file but return 404 HTML, better description [`193fb7d`](https://github.com/janreges/siteone-crawler/commit/193fb7dcf1f994aba69b646576bf7c6f8701a975)
- super table: added escapeOutputHtml property to column for better escape managing + updated related supertables [`bfb901c`](https://github.com/janreges/siteone-crawler/commit/bfb901cb82b9cda81198df0dc87885b5eceb5c93)
- headings analysis: replace usage of DOMNode->textContent because when the headings contain other tags, including <script>, textContent also contains JS code, but without the <script> tag [`5c426c2`](https://github.com/janreges/siteone-crawler/commit/5c426c24969a063aa3366da02520025733cf16e7)
- best practices: better missing quotes detection and minimizing false positives in special cases (HTML/JS in attributes, etc.) [`b03a534`](https://github.com/janreges/siteone-crawler/commit/b03a5345e7f71f880ee4d36fb9f51c230d8c772f)
- best practices: better SVG detection and minimizing false positives (e.g. code snippets with SVG), improved look in HTML report and better descriptions [`c35f7e2`](https://github.com/janreges/siteone-crawler/commit/c35f7e226f6cd384e5c8cf4b9af3a1a0d3be4cfc)
- headers analysis: added [ignored generic values] or [see values below] for specific headers [`a7b444d`](https://github.com/janreges/siteone-crawler/commit/a7b444dab0e1c3949abfa0e0746db18343b9b55d)
- core options: changed --hide-scheme-and-host to --show-scheme-and-host (by default is hidden schema+host better) [`3c202e9`](https://github.com/janreges/siteone-crawler/commit/3c202e998a824f97b6f481575a24e2924c9dc663)
- truncating: replaced '...' with '…' [`870cf8c`](https://github.com/janreges/siteone-crawler/commit/870cf8cd447fd14e389d76bcc8853b1e691f5349)
- accessibility analyzer: better descriptions [`514b471`](https://github.com/janreges/siteone-crawler/commit/514b47124d101cd4f0bd67148f41ea5644febd62)
- crawler & http client: if the response is loaded from the cache, we do not wait due to rate limiting - very useful for repeated executions [`61fbfab`](https://github.com/janreges/siteone-crawler/commit/61fbfab34ba07c1856099051b8f68dc76b1adf09)
- header stats: added missing strval in values preview [`9e11030`](https://github.com/janreges/siteone-crawler/commit/9e1103064af0962ed4963cace61bf7ad201d19a2)
- content type analyzer: increased column width for MIME type from 20 to 26 (enough for application/octet-stream) [`c806674`](https://github.com/janreges/siteone-crawler/commit/c806674ee82d0aba90a9d61e10ff2b5e2cf6c813)
- SSL/TLS analyzer: fixed issues on Windows with Cygwin where nslookup does not work reliably [`714b9e1`](https://github.com/janreges/siteone-crawler/commit/714b9e12a2426574731b62d460c98f1fed95aa18)
- text output: removed redundant whitespaces from banner after .YYYYMMDD was added to the version number [`8b76205`](https://github.com/janreges/siteone-crawler/commit/8b76205b41ca9cbf4dd32e7d908f4fe932c4a2a3)
- readme: added link to #ready-to-use-releases to summary [`574b39e`](https://github.com/janreges/siteone-crawler/commit/574b39e836794c98e7be8ceaa81d1ab0c50ab149)
- readme: added section Ready-to-use releases [`44d686b`](https://github.com/janreges/siteone-crawler/commit/44d686b910a36747d002ec2886b85c22be5c4864)
- changelog: added changelog by https://github.com/cookpete/auto-changelog/tree/master + added 'composer changelog' [`d11af7e`](https://github.com/janreges/siteone-crawler/commit/d11af7e4d847362276e1dd4cec3c25cad38263fb)
#### v1.0.0
> 7 November 2023
- proxy: added support for --proxy=<host:port>, closes #1 [`#1`](https://github.com/janreges/siteone-crawler/issues/1)
- license: renamed to LICENSE.md [`c0f8ec2`](https://github.com/janreges/siteone-crawler/commit/c0f8ec22a68741b1740981dc98bdec13d8e5182a)
- license: added license CC 4.0 BY [`bd5371b`](https://github.com/janreges/siteone-crawler/commit/bd5371b99363fbb5de29c33f0fcc572d154e467d)
- version: set v1.0.0.20231107 [`bdbf2be`](https://github.com/janreges/siteone-crawler/commit/bdbf2be97e68cfa01fb992fb960c1c5313d5780f)
- version: set v1.0.0 [`a98e61e`](https://github.com/janreges/siteone-crawler/commit/a98e61e161652861541743df6fe1d8c55be446f9)
- SSL/TLS analyzer: uncolorize valid-to in summary item, phpstan fixes (non-funcional changes) [`88d1d9f`](https://github.com/janreges/siteone-crawler/commit/88d1d9fec8bc29cd26ab88c18d6c122939b59bba)
- content type analyzer: added table with MIME types [`b744f13`](https://github.com/janreges/siteone-crawler/commit/b744f139e417b625bd22ea282f744b55406853b1)
- seo analysis: added TOP10 non-unique titles and descriptions to tab SEO and OpenGraph + badges [`4ae14c1`](https://github.com/janreges/siteone-crawler/commit/4ae14c13be5163704c2c6a2d55d75bc83f41f801)
- html report: increased sidebar width to prevent wrapping in the case of higher numbers in badges [`c5c8f4c`](https://github.com/janreges/siteone-crawler/commit/c5c8f4cae991bbdd6b6a8a7fab6cbaae1c199344)
- dns analyzer: increased column size to prevent auto-truncation of dns/ip addresses [`b4d4127`](https://github.com/janreges/siteone-crawler/commit/b4d4127b2b67efd63fff53ae0ad27b6c9a987501)
- html report: fixed badge with errors on DNS and SSL tab [`e290403`](https://github.com/janreges/siteone-crawler/commit/e29040349ac4966b22842e52ee4c102a67f9860c)
- html report: ensure that no empty tabs will be in report (e.g. in case where all analyzers will be deactivated by --analyzer-filter-regex='/anything/') [`6dd5bcc`](https://github.com/janreges/siteone-crawler/commit/6dd5bcc67d215bca085ef75cb98398aa162ce5fa)
- html report: improved replacement of non-badged cells to transparent badge for better alignment [`172a074`](https://github.com/janreges/siteone-crawler/commit/172a074c519a55c492d2b72250232e23749cd75b)
- html report: increased visible part of long tables from 500px to 658px (based on typical sidebar height), updated title [`0be355f`](https://github.com/janreges/siteone-crawler/commit/0be355f5474ad6aff461ac3362127569d29eac22)
- utils: selected better colors for ansi->html conversion [`6c2a8e3`](https://github.com/janreges/siteone-crawler/commit/6c2a8e364790e2cdb338f164c572aafd9e3db6c1)
- SSL/TLS analyzer: evaluation and hints about unsafe or recommeneded protocols, from-to validation, colorized output [`5cea1fe`](https://github.com/janreges/siteone-crawler/commit/5cea1fe51d500db433c4d86fe5fa8660d2ef2a14)
- SEO & OpenGraph analyzers: refactored class names, headings structure moved to own tab, other small improvements [`75a9724`](https://github.com/janreges/siteone-crawler/commit/75a97245af1e896ab3304891dd4459873ad3a26f)
- security analyzer: bette vulnerabilities explanation and better output formatting [`ee172cb`](https://github.com/janreges/siteone-crawler/commit/ee172cb25073e2e5452b38d5a6c52802e9585bcc)
- summary: selected more suitable icons from the utf-8 set that work well in the console and HTML [`ef67483`](https://github.com/janreges/siteone-crawler/commit/ef67483827755895f0edf3149f4f106d28ba1942)
- header stats: addValue() can accept both string and array [`a0d746b`](https://github.com/janreges/siteone-crawler/commit/a0d746ba9f956c03cb4ad1bddee14a26951ff86d)
- headers & redirects - text improvements [`3ac9010`](https://github.com/janreges/siteone-crawler/commit/3ac9010c33e9048f1b3d24182232ae182ae681ca)
- dns analyzer: colorized output and added info about CNAME chain into summary [`7dd1f8a`](https://github.com/janreges/siteone-crawler/commit/7dd1f8ac1eafcdcd92f651d397b561f6383fdcfc)
- best practices analyzer: added SVG sanitization to prevent XSS, fine-tuning of missing quotes detection, typos [`4dc1eb5`](https://github.com/janreges/siteone-crawler/commit/4dc1eb592de3631f61ed67dfb87466a95462d5f3)
- options: added extras option, e.g. for number range validation [`760a865`](https://github.com/janreges/siteone-crawler/commit/760a865082a7cd5f8e439f3fc9094fb7503a78be)
- seo and socials: small type-hint and phpstan fixes [`bf695be`](https://github.com/janreges/siteone-crawler/commit/bf695be5fa859ca49bef67fb6511039e4301bb34)
- best practice analyzer: added found depth to messages about too deep DOM depth [`220b43c`](https://github.com/janreges/siteone-crawler/commit/220b43c77a6d4747a29cf483e11a985dc07ac460)
- analysis: added SSL/TLS analyzer with info about SSL certificate, its validity, supported protocols, issuer .. in the report SSL/TLS info are under tab 'DNS and TLS/SSL' [`3daf175`](https://github.com/janreges/siteone-crawler/commit/3daf1757e1eee765ea3d6b2dca1ed55ffb694d4a)
- super table: show fulltext only for >= 10 rows + visible height of the table in HTML shorten to 500px/20 rows and show 'Show entire table' link .. implemented only with HTML+CSS, so that it also works on devices without JS (e.g. e-mail browser on iOS) [`7fb9e52`](https://github.com/janreges/siteone-crawler/commit/7fb9e52de2514b0fc1a11032238de815f76acb37)
- analysis: added seo & sharing analysis - meta info (title, h1, description, keywords), OG/Twitter data, heading structure details [`53e12e6`](https://github.com/janreges/siteone-crawler/commit/53e12e63102d70b0329194493599523808758716)
- best practices: added checks for WebP and AVIF images [`0ccabc6`](https://github.com/janreges/siteone-crawler/commit/0ccabc633cdae4b7ef7b03aad22ab8cfab1a590f)
- best practices: added brotli support reporting to tables [`7ff2c53`](https://github.com/janreges/siteone-crawler/commit/7ff2c53e56705c19de77d54db578338252007b99)
- super table: added option to specify whether the table should be displayed on the output to the console, html or json [`6bb6217`](https://github.com/janreges/siteone-crawler/commit/6bb62177522a61bab1673b9d5f19e18f50bd54a3)
- headers analysis: analysis of HTTP headers of all requests to the main domain, their detailed breakdown, values and statistics [`1fcc1db`](https://github.com/janreges/siteone-crawler/commit/1fcc1dba38a3ac41f0547a4f11a2aef9af1d876f)
- analysis: fixed search of attributes with missing quotes [`3db31b9`](https://github.com/janreges/siteone-crawler/commit/3db31b9c01317d8c8ac6eba6b98679be79982c3e)
- super table: added the number of found/displayed lines next to the full text [`6e7f3d4`](https://github.com/janreges/siteone-crawler/commit/6e7f3d4b4de0cfa378920c9389291a9902c0c486)
- super table: removed setting column widths for HTML table - works best without forcing widths [`2a785e7`](https://github.com/janreges/siteone-crawler/commit/2a785e70b675ef681b005042a50b289b3b29d600)
- html report: even wider content of the report is allowed, for better functioning for high-resolution displays [`363990c`](https://github.com/janreges/siteone-crawler/commit/363990c3566cb39d653ab2760df6bb4d2acd8149)
- pages 404: truncate too long urls [`082bae6`](https://github.com/janreges/siteone-crawler/commit/082bae6f28d2ba8296591a0885548faa0b38a59a)
- fixes: fixed various minor warnings related to specific content or parameters [`da1802d`](https://github.com/janreges/siteone-crawler/commit/da1802d82f8ccf2de3f4329bf3b952ebefeb3449)
- options: ignore extra comma or empty value in list [`3f5cab6`](https://github.com/janreges/siteone-crawler/commit/3f5cab68bc4981faea7b7bed30b9f687ea773830)
- super table: added useful fulltext search for all super tables [`50a4edf`](https://github.com/janreges/siteone-crawler/commit/50a4edf9caa69f67fdc21c3c32a92d201c211ccc)
- colors: more light color for badge.neutral in light mode because previous was too contrasting [`0dbad09`](https://github.com/janreges/siteone-crawler/commit/0dbad0920f8f8a9f14186f9513e3ea6793fcf297)
- colors: notice is now blue instead of yellow and severity order fix in some places (critical -> warning -> notice -> ok -> info) [`1b50b99`](https://github.com/janreges/siteone-crawler/commit/1b50b99ae079a4d1cdc350038e105d469dec524a)
- colors: changed gray color to more platform-consistent color, otherwise gray was too dark on macOS [`173c9bd`](https://github.com/janreges/siteone-crawler/commit/173c9bd211bf066b69bb3adbde487ec3e99f6da1)
- scripts: removed helper run.tests* scripts [`e9f0c8f`](https://github.com/janreges/siteone-crawler/commit/e9f0c8ff768042737bfab57b5d2270df995c611e)
- analysis: added table with detailed list of security findings and URLs [`5b9e0fe`](https://github.com/janreges/siteone-crawler/commit/5b9e0fe1c3a514941abf2e277bf3f2bd4e017004)
- analysis: added SecurityAnalyzer, which checks the existence and values of security headers and performs HTML analysis for common issues [`0cb7cb9`](https://github.com/janreges/siteone-crawler/commit/0cb7cb9daac5303227e31b72b0f6931218968bf7)
- http auth: added support for basic HTTP authentication by --http-auth=username:password [`147e004`](https://github.com/janreges/siteone-crawler/commit/147e0040e97f6ad37da7897813063cbb73302e22)
- error handling: improved behaviour in case of entering a non-existent domain or problems with DNS resolving [`5c08fb4`](https://github.com/janreges/siteone-crawler/commit/5c08fb4c82409863f73fcdcd66f9a0ba76206c5c)
- html report: implemented completely redesigned html report with useful information, with light/dark mode and possibility to sort tables by clicking on the header .. design inspired by Zanrly from Shuffle.dev [`05da14f`](https://github.com/janreges/siteone-crawler/commit/05da14f50b108deec4827c5c0324bbd1b9775b37)
- http client: fix of extension detection in the case of very non-standard or invalid URLs [`113faa5`](https://github.com/janreges/siteone-crawler/commit/113faa501016f14c017f5f1eaa586a6fae35efbf)
- options: increased default memory limit from 512M to 2048M + fixed refactored 'file-system' -> 'file' in docs for result storage [`1471b28`](https://github.com/janreges/siteone-crawler/commit/1471b2884bcbf1806a388e4ae85cc4f7e1bc11fe)
- utils: fix that date formats are not detected as a phone number in parsePhoneNumbersFromHtml() [`e4e1009`](https://github.com/janreges/siteone-crawler/commit/e4e10097f7e74816dd716d2713516d5ff8eef39a)
- strict types: added declare(strict_types=1) to all classes with related fixes and copyright [`92dd47c`](https://github.com/janreges/siteone-crawler/commit/92dd47c72e4f1aaa5a05187f60f2a9f0a5c285ee)
- dns analyzer: added information about the DNS of the given domain - shows the entire cname/alias chain as well as the final resolved IPv4/IPv6 addresses + tests [`199421d`](https://github.com/janreges/siteone-crawler/commit/199421df3c96e2f2bec20f45230cbd812e9fc21c)
- utils: helper function parsePhoneNumbersFromHtml() used in BestPracticeAnalyzer + tests [`09cc5fb`](https://github.com/janreges/siteone-crawler/commit/09cc5fbbbdf7f4a706ef912221e32d476fa397b4)
- summary consistency: forced dots at the end of each item in the summary list [`4758e38`](https://github.com/janreges/siteone-crawler/commit/4758e38c3b2ab73476516662129e3b6abd78ff44)
- crawler: support for more benevolent tags for title and meta attributes .. e.g. even the title can contain other HTML attributes [`770b339`](https://github.com/janreges/siteone-crawler/commit/770b339fb7b6ac86af56a864feb184977974d37d)
- options: default timeout increased from 3 to 5 seconds .. after testing on a lot websites, it makes better sense [`eb74207`](https://github.com/janreges/siteone-crawler/commit/eb7420736f5c4d353651ec39d8d030a8485e1486)
- super table: added option to force non-breakable spaces in column cells [`3500818`](https://github.com/janreges/siteone-crawler/commit/35008185064331d33c380e0643606f2dbaeb2b64)
- best practice analyzer: added measurement of individual steps + added checking of active links with phone numbers <a href="tel: 123..."> [`1bb39e8`](https://github.com/janreges/siteone-crawler/commit/1bb39e87a440975e8956fbf1d66b81ef1b424574)
- accessibility analyzer: added measurement of individual steps + removed DOMDocument parsing after refactoring [`2a7c49b`](https://github.com/janreges/siteone-crawler/commit/2a7c49b415dd2864cc37497d409cb083abb99df5)
- analysis: added option to measure the duration and number of analysis steps + the analyzeVisitedUrl() method already accepts DOMDocument (if HTML) so the analyzers themselves do not have to do it twice [`d8b9a3d`](https://github.com/janreges/siteone-crawler/commit/d8b9a3d8e0016ec4cc6da908a1bd9db39370e9da)
- super table: calculated auto-width can't be shorter than column name (label) [`b97484f`](https://github.com/janreges/siteone-crawler/commit/b97484f22d59bee04b935fa204d18c609ba8658c)
- utils: removed ungreedy flag from all regular expressions, it caused problems under some circumstances [`03fc202`](https://github.com/janreges/siteone-crawler/commit/03fc202ed2f30fe4bd2001e8fcaecbea5ca45f7e)
- phpstan: fixed all level 5 issues [`04c21aa`](https://github.com/janreges/siteone-crawler/commit/04c21aaeeed24117740fac22b5756363e3a4769d)
- phpstan: fixed all level 4 issues [`91fee49`](https://github.com/janreges/siteone-crawler/commit/91fee49a0aefa603c4dba9bc1f19d658a7ab413e)
- phpstan: fixed all level 3 issues [`2f7866a`](https://github.com/janreges/siteone-crawler/commit/2f7866a389b05e3c796e7f1f0bd7f6410a23cb05)
- phpstan: fixed all level 2 issues [`e438996`](https://github.com/janreges/siteone-crawler/commit/e4389962be4a476bdcacc6acc18f36c7037b90ee)
- phpstan: installed phpstan with level 2 for now [`b896e6c`](https://github.com/janreges/siteone-crawler/commit/b896e6c0552e4fd938088594a7d44d6af14fc809)
- tests: allowed nextjs.org for crawling (incorrectly because of this, a couple of tests did not pass) [`cdc7f56`](https://github.com/janreges/siteone-crawler/commit/cdc7f5688f6aca0e822c3fa6daee6a3acd99eeeb)
- refactor: moved /Crawler/ into /src/Crawler/ + added file attachment support to mailer [`2f0d26c`](https://github.com/janreges/siteone-crawler/commit/2f0d26c7d2f7cb65495b375dd4b11bf7849888e2)
- sitemap exporter: renamed addErrorToSummary -> addCriticalToSummary [`e46e192`](https://github.com/janreges/siteone-crawler/commit/e46e1926df52a3edfc4137ebd8ede9dee8a45bf1)
- text output: added options --show-inline-criticals and --show-inline-warning which displays the found problems directly under the URL - the displayed table will be less clear, but the problems are clearly visible [`725b212`](https://github.com/janreges/siteone-crawler/commit/725b2124172710895d86503fd4a933e2ea91efaa)
- composer.json: added require declarations for ext-dom, ext-libxml (used in analyzers) and ext-zlib (used in cache/storages) [`3542cf0`](https://github.com/janreges/siteone-crawler/commit/3542cf03829e9a3c745e58e0df1bc2f6284d25ba)
- analysis: added accessibility and best practices analyzers with useful checks [`860316f`](https://github.com/janreges/siteone-crawler/commit/860316fa685509104462412aeb125417dceaee28)
- analysis: added AnalysisManager for better analysis control with the possibility to filter required analyzers using --analyzer-filter-regex [`150569f`](https://github.com/janreges/siteone-crawler/commit/150569fd20c380781ed5971cefd47308762a730a)
- result storage: options --result-storage, --result-storage-dir and --result-storage-compression for storage of response bodies and headers (by default is used memory storage but you can use file storage for extremely large websites) [`d2a8fab`](https://github.com/janreges/siteone-crawler/commit/d2a8fabcef72067500dfcb0065e87ebc4395dac3)
- http cache: added --http-cache-dir and --http-cache-compression parameters (by default http cache is on and set to 'tmp/http-client-cache' and compression is disabled) [`2eb9ed8`](https://github.com/janreges/siteone-crawler/commit/2eb9ed86d9d53b4735a3de3cf6d06b652818dbc0)
- super table: the currentOrderColumn is already optional - sometimes we want to leave the table sorted according to the input array [`4fba880`](https://github.com/janreges/siteone-crawler/commit/4fba880fcf137a6207df4c5177cf3ec80afaa3ae)
- analysis: replaced severity ok/warning/error with ok/notice/warning/critical - it made more sense for analyzers [`18dbaa7`](https://github.com/janreges/siteone-crawler/commit/18dbaa7a4a760874ba39c75af28f7e808fb8eb2e)
- analysis: added support for immediate analysis of visited URLs with the possibility to insert the analyzer's own columns into the main table [`004865f`](https://github.com/janreges/siteone-crawler/commit/004865f223c9ec688c4f522cd8f93d8022458130)
- content types: fixed json/xml detection [`00fc180`](https://github.com/janreges/siteone-crawler/commit/00fc1808838c7a191cc9986e884ffda26f841281)
- content type analyzer: decreased URLs column size from 6 to 5 - that's enough [`2eefbaf`](https://github.com/janreges/siteone-crawler/commit/2eefbafad24f68118a2efe8d6ddedc4d3d45b5cf)
- formatting: unification of duration formatting across the entire application [`412ee7a`](https://github.com/janreges/siteone-crawler/commit/412ee7ab5c5eda19dfc5492a6cc9edbb7c5969c6)
- super table: fixed sorting for array of arrays [`4829be8`](https://github.com/janreges/siteone-crawler/commit/4829be8f8e1d3f0d8201dedfa99d245453601422)
- source domains analyzer: minor formatting improvements [`2d32ced`](https://github.com/janreges/siteone-crawler/commit/2d32cedb59aa13e4e27a1dbe58eff586e4407cd9)
- offline website exporter: added info about successful export to summary [`92e7e46`](https://github.com/janreges/siteone-crawler/commit/92e7e46bdbc1f1cff329cf4aff5ee99dd70332e2)
- help: added red message about invalid CLI parameters also to the end of help output, because help is already too long [`6942e8f`](https://github.com/janreges/siteone-crawler/commit/6942e8f4535d748763a124207634ea7548bbfa83)
- super table: added column property 'formatterWillChangeValueLength' to handle situation with the colored text and broken padding [`7371a68`](https://github.com/janreges/siteone-crawler/commit/7371a68f11191b0b21307e6ca703e362f476b815)
- analyzers: setting a more meaningful analyzers order [`5e8f747`](https://github.com/janreges/siteone-crawler/commit/5e8f747392f291abdfb0140038c42fe84801955c)
- analyzers: added source domains analyzer with summary of domains and downloaded content types (number/size/duration) [`f478f17`](https://github.com/janreges/siteone-crawler/commit/f478f178fb2f79a81e5db89909951816ac6e1c9f)
- super table: added auto-width column feature [`d2c04de`](https://github.com/janreges/siteone-crawler/commit/d2c04dec3312d72ed373236d73f7a4d3bbf8c20d)
- renaming: '--max-workers' to '--workers' with possibility to use shortcut '-w=<num>' + adding possibility to use shortcut '-rps=<num>' for '--max-reqs-per-sec=<num>' [`218f8ff`](https://github.com/janreges/siteone-crawler/commit/218f8ffcca15550853bcb4ace44dedf260d1e735)
- extra columns: added ability to force columns to the required length via "!" + refactoring using ExtraColumn [`def82ff`](https://github.com/janreges/siteone-crawler/commit/def82ff3f5f11efa2e4ef812e086a5c8379ac962)
- readme: divisionlit of features into several groups and divided accordingly [`c03d231`](https://github.com/janreges/siteone-crawler/commit/c03d2311b618f8aad165ffad39ae51989f60f846)
- offline exporter: export of the website to the offline form has already been fine-tuned (but not perfect yet), --disable-* options to disable JS/CSS/images/fonts/etc. and a lot of other related functionalities [`0d04a98`](https://github.com/janreges/siteone-crawler/commit/0d04a9805bdebea708eba44cc6680bd58995d559)
- crawler: added possibility to set speed via --max-reqs-per-sec (default 10) [`d57cc4a`](https://github.com/janreges/siteone-crawler/commit/d57cc4a39e6ce1882ee3233b015200382d90f06f)
- tests: dividing asserts for URL conversion testing into different detailed groups [`f6221cb`](https://github.com/janreges/siteone-crawler/commit/f6221cb5d3e5e844f146a95940479b20604c37cf)
- html url parser: added support for loading fonts from <link href='...'> [`4c482d1`](https://github.com/janreges/siteone-crawler/commit/4c482d1078fb535e4a3be96f6c3e7ded2ea02d65)
- manager: remove avif/webp support if OfflineWebsiteExporter is active - we want to use only long-supported jpg/png/gif on the local offline version [`3ec81d3`](https://github.com/janreges/siteone-crawler/commit/3ec81d338590ae16ee337cbbfa8a741e01b0522d)
- http response: transformation of the redirect to html with redirection through the <meta> tag [`8f6ff16`](https://github.com/janreges/siteone-crawler/commit/8f6ff161066a82af9ae91a738aae66327fe407b6)
- initiator: skip comments or empty arguments [`12f4c52`](https://github.com/janreges/siteone-crawler/commit/12f4c52b7fe0429926c2a6540e8842eae4882888)
- http client: added crawler signature to User-Agent and X-Crawler-Info header + added possibility to set Origin request header (otherwise some servers block downloading the fonts) [`ae4eaf3`](https://github.com/janreges/siteone-crawler/commit/ae4eaf3298e0bc94c1d913d08393426e380ba4ad)
- visited url: added isStaticFile() [`f1cd5e8`](https://github.com/janreges/siteone-crawler/commit/f1cd5e8e397b734dc3353db943c2928ff46cf520)
- crawler: increased pcre.backtrack_limit and pcre.recursion_limit (100x) to support longer HTML/CSS/JS [`35a6e9a`](https://github.com/janreges/siteone-crawler/commit/35a6e9a4729fffa7ee0a77b0be50621c4077a7b9)
- core options: renamed --headers-to-table to --extra-columns [`7c30988`](https://github.com/janreges/siteone-crawler/commit/7c30988fdecdaeb6aa89aed15a864a033c121d2f)
- crawler: added type for audio and xml + static cache for getContentTypeIdByContentTypeHeader [`386599e`](https://github.com/janreges/siteone-crawler/commit/386599e881051ae8c14b7ec9688690e50c0dd7dc)
- found urls: normalization of URL takes care of spaces + change of source type to int [`c3063a2`](https://github.com/janreges/siteone-crawler/commit/c3063a247f10bf00b8516eb2303bb85cab426c15)
- debugging: possibility to enable debugging through ParsedUrl [`979dc0e`](https://github.com/janreges/siteone-crawler/commit/979dc0e89af063b5ffe04b49275ceb0fa9191db2)
- offline url converter: class for solving the translation of URL addresses to offline/local + tests [`44118e6`](https://github.com/janreges/siteone-crawler/commit/44118e6bf96f6b25c7d8410084f76dfb3eb10188)
- url converter: TargetDomainRelation enum with tests [`fd6cf21`](https://github.com/janreges/siteone-crawler/commit/fd6cf216d903785adf46923ed2a805937f724d15)
- initiator: check only script basename in unknown args check [`888448f`](https://github.com/janreges/siteone-crawler/commit/888448fc9c598a7e8f750e746214b2834722b412)
- offline website export: to run the exporter is necessary to set --offline-export-directory [`33e9f95`](https://github.com/janreges/siteone-crawler/commit/33e9f952814b52bdfc7634cf4b9521d393b87417)
- offline website export: to run the exporter is necessary to set --offline-export-directory [`bcc007b`](https://github.com/janreges/siteone-crawler/commit/bcc007b6a3a9c0e9de23e76bd6f9150c7d2295c9)
- log & tmp: added .gitkeep for versioning of these folders - they are used by some optional features [`065f8ef`](https://github.com/janreges/siteone-crawler/commit/065f8ef27fabe889e8a35b98fd75ce260263d268)
- offline website export & tests: added the already well-functioning option to export the entire website to offline mode working from local static HTML files, including images, fonts, styles, scripts and other files (no documentation yet) + lot of related changes in Crawler + added first test testing some important functionalities about relative URL building [`4633211`](https://github.com/janreges/siteone-crawler/commit/463321199e6f9bac10b097e3f286da6a13f36906)
- composer & phpunit: added composer, phpunit and license CC BY 4.0 [`4979143`](https://github.com/janreges/siteone-crawler/commit/4979143ac2aea9d7b3fe9fcfb9d57f1890c1f114)
- visited-url: added info if is external and if is allowed to crawl it [`268a696`](https://github.com/janreges/siteone-crawler/commit/268a6960f8ff69046c8e6c73beae98d24b73ba1f)
- text-output: added peak memory usage and average traffic bandwidth to total stats [`cb68340`](https://github.com/janreges/siteone-crawler/commit/cb683407e2cdcd62f5484da96baf9ef43e49a4b3)
- crawler: added video support and fixed javascript detection by content-type [`3c3eb96`](https://github.com/janreges/siteone-crawler/commit/3c3eb9625f20657e971249c14cdff97a0a0b8687)
- url parsers: extraction of url parsing from html/css into dedicated classes and FoundUrl with info about source tag/attribute [`d87597d`](https://github.com/janreges/siteone-crawler/commit/d87597d36507c7bd6029f87bf1801586eea9b420)
- manager: ensure that done callback is executed only once [`d99cccd`](https://github.com/janreges/siteone-crawler/commit/d99cccd91b43680e0726f9c037fb568a9e8be1b4)
- http-client: extraction of http client functionality into dedicated classes and implemented cache for HTTP responses (critical for efficient development) [`8439e37`](https://github.com/janreges/siteone-crawler/commit/8439e376c50a346e133a2d99e7406020bb89030a)
- debugging: added debugging related expert options + Debugger class [`2c89682`](https://github.com/janreges/siteone-crawler/commit/2c89682feaf65a4f224da8ebaf05c48aa899eccc)
- parsed-url: added query, it is already needed [`860df08`](https://github.com/janreges/siteone-crawler/commit/860df086ae8c8556420d92e249b3b459b8bf288f)
- status: trim only HTML bodies because trim break some types of binary files, e.g. avif [`fca2156`](https://github.com/janreges/siteone-crawler/commit/fca2156a2f9607f705a32833a650ae70d5690772)
- url parsers: unification of extension length in relevant regexes to {1,10} [`96a3548`](https://github.com/janreges/siteone-crawler/commit/96a35484ba5ab0eee7e43837c1eade1aba6f8a57)
- basic-stats: fixed division by zero and nullable times [`8c38b96`](https://github.com/janreges/siteone-crawler/commit/8c38b9660752f132c09e3ceaab596e54176b46e9)
- fastest-analyzer: show only URLs with status 200 on the TOP list [`0085dd1`](https://github.com/janreges/siteone-crawler/commit/0085dd1fcbd3b5657eca73345921fe3fc6f407bc)
- content-type-analyzer: added stats for 42x statuses (429 Too many requests) [`4f49d12`](https://github.com/janreges/siteone-crawler/commit/4f49d124d1d9993abe3babd9a181c9768b5c2903)
- file export: fixed HTML report error after last refactoring [`e77fa6c`](https://github.com/janreges/siteone-crawler/commit/e77fa6cf791da08b522e2124545c303ab5de67ed)
- sitemap: publish only URLs with status 200 OK [`b2d4448`](https://github.com/janreges/siteone-crawler/commit/b2d44488a28aeca3421c36ca1e5ada0030de26d8)
- summary: added missing </ul> and renamed heading Stats to Summary in HTML report [`c645e16`](https://github.com/janreges/siteone-crawler/commit/c645e16016611a49f70c3d5de9e6ab4d58a45048)
- status summary: added summary showing important analyzed metrics with OK/WARNING/CRITICAL icons, ordering by severity and INFO about the export execution + interrupting the script by CTRL+C will also run all analyzers, exporters and display all statistics for already processed URLs [`fd643d0`](https://github.com/janreges/siteone-crawler/commit/fd643d016036f4eed5418375f8b25cfe08549ed0)
- output consistency: ensuring color and formatting consistency of different types of values (status codes, request durations) [`3ffe1d2`](https://github.com/janreges/siteone-crawler/commit/3ffe1d2a939d718a6fae9c1f927646cfbec808f4)
- analyzers: added content-type analyzer with stats for total/avg times, total sizes and statuses 200x, 300x, 400x, 500x [`0475347`](https://github.com/janreges/siteone-crawler/commit/04753478bce1f81dfdab73cd19b0541e725317fe)
- crawler: better content-type handling for statistics and added 'Type' column to URL lists + refactored info from array to class [`346caf4`](https://github.com/janreges/siteone-crawler/commit/346caf45f3a18e75a0cf4d0e65961fbee63c9632)
- supertable: is now able to display from the array-of-arrays as well as from the array-of-objects + it can translate color declarations from bash to HTML colors when rendering to HTML [`80f0b1c`](https://github.com/janreges/siteone-crawler/commit/80f0b1ca3d50ee7dfae9a01eccbe15fcc06a72d5)
- analyzers: TOP slowest/fastest pages analyzer now evaluates only HTML pages, otherwise static content skews the results + decreased minTime for slowest analysis from 0.1 to 0.01 sec (on a very fast and cached website, the results were empty, which is not ideal) [`1390bbc`](https://github.com/janreges/siteone-crawler/commit/1390bbc6daa5484fed8612731dc99f734c406042)
- major refactoring: implementation of the Status class summarizing useful information for analyzers/exporters (replaces the JsonOutput over-use) + implementation of basic analyzers (404, redirects, slow/fast URLs) + SuperTable component that exports data to text and HTML + choice of memory-limit setting + change of some default values [`efb9a60`](https://github.com/janreges/siteone-crawler/commit/efb9a60aa0be5cb8af55b09723a236370fccb904)
- url parsing: fixes for cases when query params are used with htm/html/php/asp etc. + mini readme fix [`af1acfa`](https://github.com/janreges/siteone-crawler/commit/af1acfa9efa536d2ef2e51b2f0a2404ef9d2417a)
- minor refactoring: renaming about core options, small non-functional changes [`1dd258e`](https://github.com/janreges/siteone-crawler/commit/1dd258e81eb4d06658e5e41e62141d5be48ce622)
- major refactoring: better modularity and auto loading in the area of the exporters, analyzers, their configurability and help auto-building + new mailer options --mail-from-name and --mail-subject-template [`0c57dbd`](https://github.com/janreges/siteone-crawler/commit/0c57dbdb30702cc6669a703788b530fbc4d04af6)
- json output: automatic shortening of the URL according to the text width of the console, because if the long URL exceeds the width of the window, the rewriting of the line with the progressbar stops working properly [`106332b`](https://github.com/janreges/siteone-crawler/commit/106332b1d8421dbea5f8725536fa3efed6834564)
- manual exit: captures CTRL+C and ends with the statistics for at least the current URLs [`7f4fc80`](https://github.com/janreges/siteone-crawler/commit/7f4fc80c5f9f0fe47da2d9bee2e139489c36a966)
- error handling: show red error with help when queue or visited tables are full and info how to fix it [`4efbd73`](https://github.com/janreges/siteone-crawler/commit/4efbd734d775aaa2e6dd66d2d8ed7a007871a1dd)
- DOM elements: implemented DOM elements counter and when you add 'DOM' to --headers-to-column you will see DOM elements count [`1837a9c`](https://github.com/janreges/siteone-crawler/commit/1837a9cb12f97a33aec6bcf03a54250bd48545a2)
- sitemap and no-color: implemented xml/txt sitemap generator and --no-color option [`f9ade44`](https://github.com/janreges/siteone-crawler/commit/f9ade44d470d97bcc399039bc91a5ce74a6537c1)
- readme: added table of contents and rewrited intro, features and installation chapters [`469fd1c`](https://github.com/janreges/siteone-crawler/commit/469fd1cf15af4d191c239b2523e0fd8614f7653f)
- readme: removed deprecated and duplicate mailer docs [`c5effe8`](https://github.com/janreges/siteone-crawler/commit/c5effe84aece85f7a6aaa97228cd84a5eade4f8b)
- readme and CLI help: dividing the parameters into clear groups and improving parameters description - in README.md is detailed form, in CLI instructions is a shorter version. [`19ff724`](https://github.com/janreges/siteone-crawler/commit/19ff724ec0d21f08c4d6cf09def06ba27b023598)
- include/ignore regex: added option to limit crawled URLs with the common combination of --include-regex and --ignore-regex [`88e393d`](https://github.com/janreges/siteone-crawler/commit/88e393d33c07fab77173432fd0faf7fe631c2c2c)
- html report: masking passwords, styling, added logo, better info ordering and other small changes [`4cdcdab`](https://github.com/janreges/siteone-crawler/commit/4cdcdabf145ffe6f02d84b3250b2a1fc46a5677a)
- mailer & exports: implemented ability to send HTML report to e-mail via SMTP + exports to HTML/JSON/TXT file + better reporting of HTTP error conditions (timeout, etc.) + requests for assets are sent only as HEAD without the need to download all binary data + updated documentation [`a97c29d`](https://github.com/janreges/siteone-crawler/commit/a97c29d78f07b4d854853c474fb9d0542b6f2796)
- table output: option to set expected column length for better look by 'X-Cache(10)' [`e44f89d`](https://github.com/janreges/siteone-crawler/commit/e44f89d6c3114ccf02c70f38d5ffa5a0f081c1b2)
- output: renamed print*() methods to more meaningul add*() relevant also for JSON output [`1069c4a`](https://github.com/janreges/siteone-crawler/commit/1069c4a346d13878c52a316b5953ffa997ec3700)
- options: default timeout decreased from 10 to 3, --table-url-column-size renamed to --url-column-size and decreased its default value from 100 to 80, new option --hide-progress-bar, changed --truncate-url-to-column-size to --do-not-truncate-url [`e75038c`](https://github.com/janreges/siteone-crawler/commit/e75038c56afcf85ae591b1dbedf33a54fcd84754)
- readme: improved documentation describing use on Windows, macOS or arm64 Linux [`baf2d05`](https://github.com/janreges/siteone-crawler/commit/baf2d0596a3e8367d51fe6ab75793d803e984330)
- readme: added info about really tested crawler on Windows with Cygwin (Cygwin has some output limitations and it is not possible to achieve such nice behavior as on Linux) [`1f195c0`](https://github.com/janreges/siteone-crawler/commit/1f195c0c9c8565a37fcb5786070e69c6aa0b8e0e)
- windows compatibility: ensuring compatibility with running through cygwin Swoole, which I recommend in the documentation for Windows users [`c22cc45`](https://github.com/janreges/siteone-crawler/commit/c22cc4559ed3de2ac5e4e6e2957b4d3233b4fda5)
- json output: implemented nice continuos progress reporting, intentionally on STDERR so the output on STDOUT can be used to save JSON to file + improved README.md [`c095249`](https://github.com/janreges/siteone-crawler/commit/c095249d03c96a00da75553b10dadf7e025a5b0b)
- limits: increased limit of max queue length from 1000 to 2000 (this default will more suitable even for medium-sized websites) [`c8c3312`](https://github.com/janreges/siteone-crawler/commit/c8c33121c371cc4d0f0791a250178254d9e3a88a)
- major refactoring: splitting the code into classes, improving error handling and implementing other functions (JSON output, assets crawling) [`f6902fc`](https://github.com/janreges/siteone-crawler/commit/f6902fc025943ef96150739ae6834358097b235d)
- readme: added information how to use crawler with Windows, macOS or arm64 architecture + a few other details [`721f4bb`](https://github.com/janreges/siteone-crawler/commit/721f4bb73e92f65ca3aab789219f046dea665931)
- url parsing: handled situations when relative or dotted URLs are also used in HTML, e.g. href='sub/page', href='./sub/page' or href='../sub/page', href='../../sub/page' etc. + few minor optimizations [`c2bbf72`](https://github.com/janreges/siteone-crawler/commit/c2bbf72cf636340a43ebf8472c38008d0fc50f27)
- memory allocation: added optional params --max-queue-length=<n> (default 1000), --max-visited-urls=<n> (default 5000) and --max-url-length=<u> (default 2000) [`947a43f`](https://github.com/janreges/siteone-crawler/commit/947a43f3bb826ad852ca51390ae2778fbff320e0)
- Initial commit with first version 2023.10.1 [`7109788`](https://github.com/janreges/siteone-crawler/commit/71097884df3c1ade6fd7c02b4ac9ac8f5f161a12)
================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Setup After Clone
```bash
git config core.hooksPath .githooks # enable pre-commit hook (fmt + clippy + tests)
```
## Build & Test Commands
```bash
cargo fmt # auto-format code (always run before build)
cargo build # debug build
cargo build --release # release build (~11s)
cargo test # unit tests + offline integration tests (~300 tests)
cargo test --test integration_crawl -- --ignored --test-threads=1 # network integration tests (crawls crawler.siteone.io)
cargo test scoring::ci_gate::tests::all_checks_pass # run a single test by name
cargo clippy -- -D warnings # lint (CI enforces zero warnings)
cargo fmt -- --check # format check
```
## Quick Run
```bash
./target/release/siteone-crawler --url=https://example.com --single-page
./target/release/siteone-crawler --url=https://example.com --output=json --http-cache-dir= # no cache
./target/release/siteone-crawler --html-to-markdown=page.html # convert local HTML to markdown (stdout)
./target/release/siteone-crawler --html-to-markdown=page.html --html-to-markdown-output=page.md # convert to file
```
## Architecture
### Crawl Lifecycle (in order)
1. **CLI Parsing** (`Initiator` → `CoreOptions::parse_argv()`): Parses 120+ CLI options, merges config file if present, validates. Exits with code 101 on error, code 2 on `--help`/`--version`. Non-crawl utility modes (`--serve-markdown`, `--serve-offline`, `--html-to-markdown`) exit early in `main.rs` before creating the Manager.
2. **Analyzer Registration** (`Initiator::register_analyzers()`): Creates all 15 analyzer instances (Accessibility, BestPractice, Caching, ContentType, DNS, ExternalLinks, Fastest, Headers, Page404, Redirects, Security, SeoAndOpenGraph, SkippedUrls, Slowest, SourceDomains, SslTls) and registers them with `AnalysisManager`. Some analyzers receive config from CLI options (e.g. `fastest_top_limit`, `max_heading_level`).
3. **Manager Setup** (`Manager::run()`): Creates `Status` (result storage), `Output` (text/json/multi), `HttpClient` (with optional proxy, auth, cache), `ContentProcessorManager` (HTML, CSS, JS, XML, Astro, Next.js, Svelte processors), and the `Crawler` instance.
4. **Robots.txt Fetch** (`Crawler::fetch_robots_txt()`): Before crawling starts, fetches and parses `/robots.txt` from the initial domain. Respects `--ignore-robots-txt` option.
5. **Crawl Loop** (`Crawler::run()`): Breadth-first concurrent URL processing:
- URL queue (`DashMap`) seeded with initial URL
- Tokio tasks limited by `Semaphore` (= `--workers` count) + rate limiting (`--max-reqs-per-sec`)
- Per-URL flow: check robots.txt → HTTP request → on error, store with negative status code → on success, run content processors → extract links from HTML → enqueue discovered URLs
- Content processors (`HtmlProcessor`, `CssProcessor`, etc.) transform response bodies during crawl — used by offline/markdown exporters for URL rewriting
- Each visited URL's response is stored in `Status` for post-crawl analysis
- Per-URL data collected: status code, headers, body, response time, content type, size, redirects
6. **Post-Crawl Analysis** (`Manager::run_post_crawl()`): Sequential pipeline after crawling ends:
- Transfer skipped URLs from crawler to `Status`
- Run all registered analyzers (`AnalysisManager::run_analyzers()`): each analyzer gets read access to `Status` (all crawled data) and write access to `Output` (adds tables/findings)
- Add content processor stats table
7. **Exporters** (`Manager::run_exporters()`): Generate output files based on CLI options:
- `SitemapExporter`: XML/TXT sitemap files
- `OfflineWebsiteExporter`: Static website copy with rewritten relative URLs
- `MarkdownExporter`: HTML→Markdown conversion with relative .md links
- `FileExporter`: Save text/JSON output to file
- `HtmlReport`: Self-contained HTML report (also used by Mailer and Upload)
- `MailerExporter`: Email HTML report via SMTP
- `UploadExporter`: Upload report to remote server
8. **Scoring** (`scorer::calculate_scores()`): Computes quality scores (0–10) across 5 weighted categories (Performance 20%, SEO 20%, Security 25%, Accessibility 20%, Best Practices 15%). Deductions come from summary findings (criticals, warnings) and stats (404s, 5xx, slow responses).
9. **CI/CD Gate** (`ci_gate::evaluate()`): When `--ci` is active, checks scores and stats against configurable thresholds (`--ci-min-score`, `--ci-max-404`, etc.). Returns exit code 10 on failure.
10. **Summary & Output** (`Output::add_summary()`, `Output::end()`): Prints summary table with OK/Warning/Critical counts, finalizes output. Exit code: 0 = success, 3 = no pages crawled, 10 = CI gate failed.
### How Analyzers Work
Each analyzer implements the `Analyzer` trait (`analysis/analyzer.rs`). Analyzers are **post-crawl only** — they don't run during crawling. The `AnalysisManager` calls each analyzer's `analyze(&Status, &mut Output)` method after all URLs have been visited. Analyzers read crawled data from `Status` (visited URLs, response headers, bodies, skipped URLs) and produce `SuperTable` instances that get added to `Output`. Analyzers also add `Item` entries to the `Summary` (OK, Warning, Critical, Info findings) which feed into scoring.
### How Content Processors Work
Content processors implement `ContentProcessor` (`content_processor/content_processor.rs`) and run **during crawl** on each URL's response body. They serve two purposes: (1) transform content for offline/markdown export (rewrite URLs to relative paths), and (2) extract metadata (links, assets). Processors are type-specific: `HtmlProcessor` handles HTML, `CssProcessor` handles CSS `url()` references, etc. The `ContentProcessorManager` dispatches to the right processor based on content type.
### Concurrency Model
The crawler uses tokio for async I/O with a semaphore-based worker pool (`options.workers`). Shared state uses:
- `Arc<DashMap<...>>` for lock-free concurrent maps (URL queue, visited URLs, skipped URLs)
- `Arc<Mutex<...>>` for sequential-access state (Status, Output, AnalysisManager)
- `Arc<AtomicBool/AtomicUsize>` for simple flags and counters
### Key Traits
- **`Analyzer`** (`analysis/analyzer.rs`): Post-crawl analysis (SEO, security, headers, etc.). Each analyzer gets `&Status` and `&mut Output`.
- **`Exporter`** (`export/exporter.rs`): Output generators (HTML report, offline website, markdown, sitemap, mailer, upload).
- **`Output`** (`output/output.rs`): Formatting backend. Implementations: `TextOutput`, `JsonOutput`, `MultiOutput`.
- **`ContentProcessor`** (`content_processor/content_processor.rs`): Per-URL content transformation during crawl (HTML, JS, CSS, XML processors).
### Options System
CLI options are defined in `options/core_options.rs` via `get_options()` which returns an `Options` struct with typed option groups. Parsing flow: `parse_argv()` → merge config file → parse flags → `CoreOptions::from_options()` → `apply_option_value()` for each option. New CLI options require: adding the field to `CoreOptions`, a case in `apply_option_value()`, and an entry in the appropriate option group.
### Exit Codes
| Code | Meaning |
|------|---------|
| 0 | Success (with `--ci`: all thresholds passed) |
| 1 | Runtime error |
| 2 | Help/version displayed |
| 3 | No pages successfully crawled (DNS failure, timeout, etc.) |
| 10 | CI/CD quality gate failed |
| 101 | Configuration error |
### HTTP Response Body
`HttpResponse.body` is `Option<Vec<u8>>` (not String) to preserve binary data for images, fonts, etc. Use `body_text()` for string content. Failed HTTP requests return `Ok(HttpResponse)` with negative status codes (-1 connection error, -2 timeout, -4 send error), not `Err`.
### Testing Structure
- **Unit tests**: In-file `#[cfg(test)] mod tests` blocks (standard Rust convention)
- **Integration tests**: `tests/integration_crawl.rs` with shared helpers in `tests/common/mod.rs`
- Network-dependent integration tests are `#[ignore]` — run explicitly with `--ignored`
### Testing Complex Scenarios with Sample Websites
The crawler has a built-in HTTP server (`--serve-offline=<dir>`) that can serve any local directory as a static website. This enables efficient local testing of edge cases without deploying a real site:
1. Create a sample website directory, e.g. `./tmp/sample-website-xyz/`
2. Add HTML files and assets simulating the desired scenario (spaces in filenames, special characters, redirect chains, broken links, specific heading structures, etc.)
3. Start the built-in server: `./target/release/siteone-crawler --serve-offline=./tmp/sample-website-xyz/ --serve-port=8888`
4. In another terminal, crawl the local site: `./target/release/siteone-crawler --url=http://127.0.0.1:8888/`
5. Verify the crawler handles the scenario correctly (output, offline export, analysis results)
This approach is useful for reproducing bug reports, testing regex edge cases (e.g. URLs with spaces, HTML entities, unusual attribute quoting), validating offline/markdown export for specific HTML structures, and any scenario that would be hard to find on a live website.
### Key Files
- `src/engine/crawler.rs` (~1700 lines): Core crawl loop, URL queue management, HTML/content parsing
- `src/options/core_options.rs` (~2500 lines): All 120+ CLI options, parsing, validation
- `src/export/utils/offline_url_converter.rs` (~1400 lines): URL-to-file-path conversion for offline export
- `src/export/html_report/report.rs`: HTML report generation with embedded template
- `src/scoring/scorer.rs`: Quality score calculation from summary findings
- `src/scoring/ci_gate.rs`: CI/CD threshold evaluation
### Edition & Rust Version
Project uses `edition = "2024"` (Rust 1.85+) with `rust-version = "1.94"`. Edition 2024 features used throughout: `unsafe extern` blocks, `if let` chaining (`if let ... && ...`), `unsafe { std::env::set_var() }`.
### Commit Policy
**Never commit automatically.** Commits are only allowed on explicit user request. Before every commit, always run `git status`, review the changes, and stage only the relevant files — never use `git add -A` or `git add .` blindly.
### Commit Messages
Use [Conventional Commits](https://www.conventionalcommits.org/): `feat:`, `fix:`, `refactor:`, `perf:`, `docs:`, `style:`, `ci:`, `chore:`, `test:`. Examples:
- `feat: add built-in HTTP server for markdown/offline exports`
- `fix: correct non-ASCII text corruption in heading ID generation`
- `perf: eliminate heap allocation in content_type_for_extension`
- `chore: bump version to 2.0.3`
### Releasing a New Version
1. Update version in `Cargo.toml` (`version = "X.Y.Z"`)
2. Update version in `src/version.rs` (`pub const CODE: &str = "X.Y.Z.YYYYMMDD";`)
3. Run `cargo check` so that `Cargo.lock` is updated with the new version
4. Commit all three files (`Cargo.toml`, `src/version.rs`, `Cargo.lock`): `git commit -m "chore: bump version to X.Y.Z"`
5. Tag and push: `git tag vX.Y.Z && git push && git push --tags`
### Important Conventions
- Tables, column order, and formatting must stay consistent across versions. The HTML parser uses the `scraper` crate.
- HTTP cache lives in `tmp/http-client-cache/` by default. Delete it for fresh crawls or use `--http-cache-dir=` to disable.
- `rustls` requires explicit `ring` CryptoProvider installation in `main.rs`.
================================================
FILE: Cargo.toml
================================================
[package]
name = "siteone-crawler"
version = "2.3.0"
edition = "2024"
rust-version = "1.94"
authors = ["Ján Regeš <jan.reges@siteone.cz>"]
description = "Website crawler and QA toolkit in Rust for security, performance, SEO, and accessibility audits, offline cloning, markdown export, sitemap generation, cache warming, and CI/CD gating — one dependency-free binary for all major platforms, 10 tools in one."
license = "MIT"
repository = "https://github.com/janreges/siteone-crawler"
homepage = "https://crawler.siteone.io/"
keywords = ["crawler", "seo", "website-analysis", "accessibility", "security"]
categories = ["command-line-utilities", "web-programming"]
readme = "README.md"
[[bin]]
name = "siteone-crawler"
path = "src/main.rs"
[dependencies]
tokio = { version = "1", features = ["full"] }
reqwest = { version = "0.13", features = ["gzip", "brotli", "deflate", "rustls", "socks", "cookies", "stream", "blocking", "multipart"] }
scraper = "0.25"
regex = "1"
clap = { version = "4", features = ["derive"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
colored = "3"
dashmap = "6"
hickory-resolver = "0.25"
rustls = { version = "0.23", features = ["ring"] }
x509-parser = "0.18"
lettre = { version = "0.11", default-features = false, features = ["tokio1-rustls-tls", "smtp-transport", "builder"] }
flate2 = "1"
brotli = "8"
chrono = { version = "0.4", features = ["serde"] }
chrono-tz = "0.10"
terminal_size = "0.4"
quick-xml = "0.39"
thiserror = "2"
anyhow = "1"
md-5 = "0.10"
url = "2"
percent-encoding = "2"
mime = "0.3"
once_cell = "1"
indexmap = "2"
gethostname = "1.1"
rustls-native-certs = "0.8"
ego-tree = "0.10"
base64 = "0.22"
dirs = "6"
pulldown-cmark = "0.13.1"
inquire = { version = "0.9", default-features = false, features = ["crossterm"] }
crossterm = "0.29"
fancy-regex = "0.17"
[package.metadata.deb]
maintainer = "Ján Regeš <jan.reges@siteone.cz>"
copyright = "2023-2026, Ján Regeš"
depends = "libc6"
section = "web"
priority = "optional"
extended-description = """\
SiteOne Crawler is an ultra-fast, open-source website crawler and QA toolkit \
written in Rust. It helps developers, DevOps teams, QA engineers, and technical \
SEO specialists crawl websites, audit quality, stress-test pages under load, \
clone sites for offline browsing and archiving, export content to markdown, \
generate sitemaps, warm caches, and enforce CI/CD quality gates — all from a \
single, dependency-free binary for Linux, macOS, and Windows.\n\
\n\
It combines multiple website tooling workflows in one application: security, \
performance, SEO, accessibility, and best-practices audits; whole-site quality \
scoring; UX checks that other tools miss (e.g. non-clickable phone numbers, \
missing alt text, broken heading hierarchy); reporting of all external links \
with their source pages, redirects, and 404s; stress/load testing with tunable \
concurrency and rate limits; offline multi-domain cloning with URL rewriting; \
markdown export for documentation, archiving, or AI workflows; sitemap \
generation; post-deploy cache warming; and automated quality checks for CI/CD \
pipelines.\n\
\n\
SiteOne Crawler can output results as interactive HTML reports (including an \
image gallery of all pictures found on the site), structured JSON, or readable \
terminal text, making it suitable both for local development and for automation \
in CI/CD environments. It can also email HTML reports directly via \
the user's own SMTP server and includes a built-in web server for browsing \
generated markdown exports, plus extensive CLI configurability for advanced \
use cases.\n\
\n\
Whether you need a technical website audit, an offline mirror, a load-testing \
helper, a markdown export for LLM/AI processing, or a reliable quality gate \
before deployment, SiteOne Crawler delivers 10 tools in one — as an ultra-fast, \
portable, open-source Rust binary with zero runtime dependencies."""
assets = [
["target/release/siteone-crawler", "usr/bin/", "755"],
["README.md", "usr/share/doc/siteone-crawler/", "644"],
["LICENSE", "usr/share/doc/siteone-crawler/", "644"],
]
[package.metadata.deb.variants.static]
name = "siteone-crawler-static"
depends = ""
conflicts = "siteone-crawler"
provides = "siteone-crawler"
extended-description = """\
Statically linked (musl) variant of SiteOne Crawler for maximum Linux compatibility. \
This version runs on any Linux distribution regardless of the installed glibc version. \
Install this if the standard siteone-crawler package reports a 'GLIBC not found' error. \
Note: ~50–80% slower than the glibc variant for CPU-intensive operations (offline and \
markdown export) due to the musl memory allocator."""
[package.metadata.generate-rpm]
assets = [
{ source = "target/release/siteone-crawler", dest = "/usr/bin/siteone-crawler", mode = "0755" },
{ source = "README.md", dest = "/usr/share/doc/siteone-crawler/README.md", mode = "0644" },
{ source = "LICENSE", dest = "/usr/share/doc/siteone-crawler/LICENSE", mode = "0644" },
]
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2023-2026 Ján Regeš
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# SiteOne Crawler
SiteOne Crawler is a powerful and easy-to-use **website analyzer, cloner, and converter** designed for developers seeking security and performance insights, SEO specialists identifying optimization opportunities, and website owners needing reliable backups and offline versions.
**Now rewritten in Rust** for maximum performance, minimal resource usage, and zero runtime dependencies. The transition from PHP+Swoole to Rust resulted in **25% faster execution** and **30% lower memory consumption** while producing identical output.
**Discover the SiteOne Crawler advantage:**
* **Run Anywhere:** Single native binary for **🪟 Windows**, **🍎 macOS**, and **🐧 Linux** (x64 & arm64). No runtime dependencies.
* **Work Your Way:** Launch the binary without arguments for an **interactive wizard** 🧙 with 10 preset modes, use the extensive **command-line interface** 📟 ([releases](https://github.com/janreges/siteone-crawler/releases), [▶️ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)) for automation and power, or enjoy the intuitive **desktop GUI application** 💻 ([GUI app](https://github.com/janreges/siteone-crawler-gui), [▶️ video](https://www.youtube.com/watch?v=rFW8LNEVNdw)) for visual control.
* **Rich Output Formats:** Interactive **HTML audit report** 📊 with sortable tables and quality scoring (0.0-10.0) (see [nextjs.org sample](https://crawler.siteone.io/html/2024-08-23/forever/cl8xw4r-fdag8wg-44dd.html)), detailed **JSON** for programmatic consumption, and human-readable **text** for terminal. Send HTML reports directly to your inbox via **built-in SMTP mailer** 📧.
* **CI/CD Integration:** Built-in **quality gate** (`--ci`) with configurable thresholds — exit code 10 on failure enables automated deployment blocking. Also useful for **cache warming** — crawling the entire site after deployment populates your reverse proxy/CDN cache.
* **Offline & Markdown Power:** Create complete **offline clones** 💾 for browsing without a server ([nextjs.org clone](https://crawler.siteone.io/examples-exports/nextjs.org/)) or convert entire websites into clean **Markdown** 📝 — perfect for backups, documentation, or feeding content to AI models ([examples](https://github.com/janreges/siteone-crawler-markdown-examples/)).
* **Deep Crawling & Analysis:** Thoroughly crawl every page and asset, identify errors (404s, redirects), generate **sitemaps** 🗺️, and even get **email summaries** 📧 (watch [▶️ video example](https://www.youtube.com/watch?v=PHIFSOmk0gk)).
* **Learn More:** Dive into the 🌐 [Project Website](https://crawler.siteone.io/), explore the detailed [Documentation](https://crawler.siteone.io/configuration/command-line-options/), or check the [JSON](docs/JSON-OUTPUT.md)/[Text](docs/TEXT-OUTPUT.md) output specs.
GIF animation of the crawler in action (also available as a [▶️ video](https://www.youtube.com/watch?v=25T_yx13naA&list=PL9mElgTe-s1Csfg0jXWmDS0MHFN7Cpjwp)):

## Table of contents
- [✨ Features](#-features)
* [🕷️ Crawler](#️-crawler)
* [🛠️ Dev/DevOps assistant](#️-devdevops-assistant)
* [📊 Analyzer](#-analyzer)
* [📧 Reporter](#-reporter)
* [💾 Offline website generator](#-offline-website-generator)
* [📝 Website to markdown converter](#-website-to-markdown-converter)
* [🗺️ Sitemap generator](#️-sitemap-generator)
- [🚀 Installation](#-installation)
* [📦 Pre-built binaries](#-pre-built-binaries)
* [🍺 Homebrew (macOS / Linux)](#-homebrew-macos--linux)
* [🐧 Debian / Ubuntu (apt)](#-debian--ubuntu-apt)
* [🎩 Fedora / RHEL (dnf)](#-fedora--rhel-dnf)
* [🦎 openSUSE / SLES (zypper)](#-opensuse--sles-zypper)
* [🏔️ Alpine Linux (apk)](#️-alpine-linux-apk)
* [🔨 Build from source](#-build-from-source)
- [▶️ Usage](#️-usage)
* [Interactive wizard](#interactive-wizard)
* [Basic example](#basic-example)
* [CI/CD example](#cicd-example)
* [Fully-featured example](#fully-featured-example)
* [⚙️ Arguments](#️-arguments)
+ [Basic settings](#basic-settings)
+ [Output settings](#output-settings)
+ [Resource filtering](#resource-filtering)
+ [Advanced crawler settings](#advanced-crawler-settings)
+ [File export settings](#file-export-settings)
+ [Mailer options](#mailer-options)
+ [Upload options](#upload-options)
+ [Offline exporter options](#offline-exporter-options)
+ [Markdown exporter options](#markdown-exporter-options)
+ [Sitemap options](#sitemap-options)
+ [Expert options](#expert-options)
+ [Fastest URL analyzer](#fastest-url-analyzer)
+ [SEO and OpenGraph analyzer](#seo-and-opengraph-analyzer)
+ [Slowest URL analyzer](#slowest-url-analyzer)
+ [Built-in HTTP server](#built-in-http-server)
+ [HTML-to-Markdown conversion](#html-to-markdown-conversion)
+ [CI/CD settings](#cicd-settings)
- [🏆 Quality Scoring](#-quality-scoring)
- [🔄 CI/CD Integration](#-cicd-integration)
- [📄 Output Examples](#-output-examples)
- [🧪 Testing](#-testing)
- [⚠️ Disclaimer](#️-disclaimer)
- [📜 License](#-license)
## ✨ Features
In short, the main benefits can be summarized in these points:
- **🕷️ Crawler** - very powerful crawler of the entire website reporting useful information about each URL (status code,
response time, size, custom headers, titles, etc.)
- **🛠️ Dev/DevOps assistant** - offers stress/load testing with configurable concurrent workers (`--workers`) and request
rate (`--max-reqs-per-sec`), cache warming, localhost testing, and rich URL/content-type filtering
- **📊 Analyzer** - analyzes all webpages and reports strange or error behaviour and useful statistics (404, redirects, bad
practices, SEO and security issues, heading structures, etc.)
- **📧 Reporter** - interactive **HTML audit report**, structured **JSON**, and colored **text** output; built-in
**SMTP mailer** sends HTML reports directly to your inbox
- **💾 Offline website generator** - clone entire websites to browsable local HTML files (no server needed) including all
assets. Supports **multi-domain clones** — include subdomains or external domains with intelligent cross-linking.
- **📝 Website to markdown converter** - export the entire website to browsable text markdown (viewable on GitHub or any
text editor), or generate a **single-file markdown** with smart header/footer deduplication — ideal for **feeding to AI
tools**. Includes a **built-in web server** that renders markdown exports as styled HTML pages.
Also supports **standalone HTML-to-Markdown conversion** of local files (`--html-to-markdown`).
See [markdown examples](https://github.com/janreges/siteone-crawler-markdown-examples/).
- **🗺️ Sitemap generator** - allows you to generate `sitemap.xml` and `sitemap.txt` files with a list of all pages on your
website
- **🏆 Quality scoring** - automatic quality scoring (0.0-10.0) across 5 categories: Performance, SEO, Security, Accessibility, Best Practices
- **🔄 CI/CD quality gate** - configurable thresholds with exit code 10 on failure for automated pipelines; also
useful as a **post-deployment cache warmer** for reverse proxies and CDNs
The following features are summarized in greater detail:
### 🕷️ Crawler
- **all major platforms** supported without dependencies (🐧 Linux, 🪟 Windows, 🍎 macOS, arm64) — single native binary
- has incredible **🚀 native Rust performance** with async I/O and multi-threaded crawling
- provides simulation of **different device types** (desktop/mobile/tablet) thanks to predefined User-Agents
- will crawl **all files**, styles, scripts, fonts, images, documents, etc. on your website
- will respect the `robots.txt` file and will not crawl the pages that are not allowed
- has a **beautiful interactive** and **🎨 colourful output**
- it will **clearly warn you** ⚠️ of any wrong use of the tool (e.g. input parameters validation or wrong permissions)
- as `--url` parameter, you can specify also a `sitemap.xml` file (or [sitemap index](https://www.sitemaps.org/protocol.html#index)),
which will be processed as a list of URLs. In sitemap-only mode, the crawler follows only URLs from
the sitemap — it does not discover additional links from HTML pages. Gzip-compressed sitemaps (`*.xml.gz`)
are fully supported, both as direct URLs and when referenced from sitemap index files.
- respects the HTML `<base href>` tag when resolving relative URLs on pages that use it.
### 🛠️ Dev/DevOps assistant
- allows testing **public** and **local projects on specific ports** (e.g. `http://localhost:3000/`)
- works as a **stress/load tester** — configure the number of **concurrent workers** (`--workers`) and the **maximum
requests per second** (`--max-reqs-per-sec`) to simulate various traffic levels and test your infrastructure's
resilience against high load or DoS scenarios
- combine with **rich filtering options** — include/ignore URLs by regex (`--include-regex`, `--ignore-regex`), disable
specific asset types (`--disable-javascript`, `--disable-images`, etc.), or limit crawl depth (`--max-depth`) to focus
the load on specific parts of your website
- will help you **warm up the application cache** or the **cache on the reverse proxy** of the entire website
### 📊 Analyzer
- will **find the weak points** or **strange behavior** of your website
- built-in analyzers cover SEO, security headers, accessibility, best practices, performance, SSL/TLS, caching, and more
### 📧 Reporter
Three output formats:
- **Interactive HTML report** — a self-contained `.html` file with sortable tables, quality scores, color-coded
findings, and sections for SEO, security, accessibility, performance, headers, redirects, 404s, and more. Open it
in any browser — no server needed.
- **JSON output** — structured data with all crawled URLs, response details, analysis findings, scores, and CI/CD gate
results. Ideal for programmatic consumption, dashboards, and integrations.
- **Text output** — human-readable colored terminal output with tables, progress bars, and summaries.
Additional reporting features:
- **Built-in SMTP mailer** — send the HTML audit report directly to one or more email addresses via your own SMTP
server. Configure sender, recipients, subject template, and SMTP credentials via CLI options.
- will provide you with data for **SEO analysis**, just add the `Title`, `Keywords` and `Description` extra columns
- will provide useful **summaries and statistics** at the end of the processing
### 💾 Offline website generator
- will help you **export the entire website** to offline form, where it is possible to browse the site through local
HTML files (without HTTP server) including all documents, images, styles, scripts, fonts, etc.
- supports **multi-domain clones** — include subdomains (`*.mysite.tld`) or entirely different domains in a single
offline export. All URLs across included domains are **intelligently rewritten to relative paths**, so the resulting
offline version cross-links pages between domains seamlessly — you get one unified browsable clone.
- you can **limit what assets** you want to download and export (see `--disable-*` directives) .. for some types of
websites the best result is with the `--disable-javascript` option.
- you can specify by `--allowed-domain-for-external-files` (short `-adf`) from which **external domains** it is possible
to **download** assets (JS, CSS, fonts, images, documents) including `*` option for all domains.
- you can specify by `--allowed-domain-for-crawling` (short `-adc`) which **other domains** should be included in the
**crawling** if there are any links pointing to them. You can enable e.g. `mysite.*` to export all language mutations
that have a different TLD or `*.mysite.tld` to export all subdomains.
- you can use `--single-page` to **export only one page** to which the URL is given (and its assets), but do not follow
other pages.
- you can use `--single-foreign-page` to **export only one page** from another domain (if allowed by `--allowed-domain-for-crawling`),
but do not follow other pages.
- you can use `--replace-content` to **replace content** in HTML/JS/CSS with `foo -> bar` or regexp in PCRE format, e.g.
`/card[0-9]/i -> card`. Can be specified multiple times.
- you can use `--replace-query-string` to **replace chars in query string** in the filename.
- you can use `--max-depth` to set the **maximum crawling depth** (for pages, not assets). `1` means `/about` or `/about/`,
`2` means `/about/contacts` etc.
- you can use it to **export your website to a static form** and host it on GitHub Pages, Netlify, Vercel, etc. as a
static backup and part of your **disaster recovery plan** or **archival/legal needs**
- works great with **older conventional websites** but also **modern ones**, built on frameworks like Next.js, Nuxt.js,
SvelteKit, Astro, Gatsby, etc. When a JS framework is detected, the export also performs some framework-specific code
modifications for optimal results.
- **try it** for your website, and you will be very pleasantly surprised :-)
### 📝 Website to markdown converter
Two export modes:
- **Multi-file markdown** — exports the entire website with all subpages to a directory of **browsable `.md` files**.
The markdown renders nicely when uploaded to GitHub, viewed in VS Code, or any text editor. Links between pages are
converted to relative `.md` links so you can navigate between files. Optionally includes images and other files
(PDF, etc.).
- **Single-file markdown** — combines all pages into **one large markdown file** with smart removal of duplicate website
headers and footers across pages. Ideal for **feeding entire website content to AI tools** (ChatGPT, Claude, etc.)
that process markdown more effectively than raw HTML.
Smart conversion features:
- **collapsible accordions** — large link lists (menus, navigation, footer links with 8+ items) are automatically
collapsed into `<details>` accordions with contextual labels ("Menu", "Links") for better readability
- content before the main heading (typically h1) — such as the site header and navigation — is moved to the end of the
page below a `---` separator, so the actual page content comes first
- you can set multiple selectors (CSS-like) to **remove unwanted elements** from the exported markdown
- **code block detection** and **syntax highlighting** for popular programming languages
- HTML tables are converted to proper **markdown tables**
Built-in web server:
- use `--serve-markdown=<dir>` to start a **built-in HTTP server** that renders your markdown export as styled HTML
pages with tables, dark/light mode, breadcrumb navigation, and accordion support — perfect for browsing and sharing
the export locally or on a network
Standalone HTML-to-Markdown conversion:
- use `--html-to-markdown=<file>` to convert a **local HTML file** directly to Markdown without crawling any website
- outputs clean Markdown to **stdout** (pipe-friendly) or to a file with `--html-to-markdown-output=<file>`
- uses the same conversion pipeline as `--markdown-export-dir` — including all cleanup, accordion collapsing, code language detection, and implicit exclusions (cookie banners, `aria-hidden` elements, `role="menu"` dropdowns)
- respects `--markdown-disable-images`, `--markdown-disable-files`, `--markdown-exclude-selector`, and `--markdown-move-content-before-h1-to-end`
- does **not** rewrite links (`.html` → `.md`) since the file is standalone with no site context
💡 Tip: you can push the exported markdown folder to your GitHub repository, where it will be automatically rendered as a browsable
documentation. You can look at the [examples](https://github.com/janreges/siteone-crawler-markdown-examples/) of converted websites to markdown.
See all available [markdown exporter options](#markdown-exporter-options) and [HTML-to-Markdown conversion options](#html-to-markdown-conversion).
### 🗺️ Sitemap generator
- will help you create a `sitemap.xml` and `sitemap.txt` for your website
- you can set the priority of individual pages based on the number of slashes in the URL
Don't hesitate and try it. You will love it as we do! ❤️
## 🚀 Installation
### 📦 Pre-built binaries
Download pre-built binaries from [🐙 GitHub releases](https://github.com/janreges/siteone-crawler/releases) for all major platforms (🐧 Linux, 🪟 Windows, 🍎 macOS, x64 & arm64).
The binary is self-contained — no runtime dependencies required.
```bash
# Linux / macOS — download, extract, run
./siteone-crawler --url=https://my.domain.tld
```
**🐧 Linux binary variants:**
For Linux, two binary variants are provided:
| Variant | Compatibility | Performance |
|---------|--------------|-------------|
| **glibc** (primary) | Requires glibc 2.39+ (Ubuntu 24.04+, Debian 13+, Fedora 40+) | Full native performance |
| **musl** (compatible) | Any Linux distribution (statically linked, no dependencies) | ~50–80% slower due to musl memory allocator |
The **glibc** variant is recommended for current distributions — it offers the best performance. If you are running an older distribution (e.g. Ubuntu 22.04, Debian 12) and encounter a `GLIBC_2.xx not found` error, use the **musl** variant instead. The musl binary is fully statically linked and runs on any Linux system regardless of the installed glibc version. The performance difference is mainly noticeable during CPU-intensive operations like offline and markdown exports.
**Note for macOS users**: In case that Mac refuses to start the crawler from your Download folder, move the entire folder with the Crawler **via the terminal** to another location, for example to the homefolder `~`.
### 🍺 Homebrew (macOS / Linux)
```bash
brew install janreges/tap/siteone-crawler
siteone-crawler --url=https://my.domain.tld
```
### 🐧 Debian / Ubuntu (apt)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.deb.sh' | sudo -E bash
sudo apt-get install siteone-crawler
```
> **Older distributions (Ubuntu 22.04, Debian 11/12, etc.):** If you get a `GLIBC_X.XX not found` error, install the statically linked variant instead:
> ```bash
> sudo apt-get install siteone-crawler-static
> ```
> See [Linux binary variants](#-pre-built-binaries) for details on the performance difference.
### 🎩 Fedora / RHEL (dnf)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo dnf install siteone-crawler
```
> **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo dnf install siteone-crawler-static` instead.
> See [Linux binary variants](#-pre-built-binaries) for details.
### 🦎 openSUSE / SLES (zypper)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.rpm.sh' | sudo -E bash
sudo zypper install siteone-crawler
```
> **Older distributions:** If you get a `GLIBC_X.XX not found` error, use `sudo zypper install siteone-crawler-static` instead.
> See [Linux binary variants](#-pre-built-binaries) for details.
### 🏔️ Alpine Linux (apk)
```bash
curl -1sLf 'https://dl.cloudsmith.io/public/janreges/siteone-crawler/setup.alpine.sh' | sudo -E bash
sudo apk add siteone-crawler
```
### 🔨 Build from source
Requires [Rust](https://www.rust-lang.org/tools/install) 1.85 or later.
```bash
git clone https://github.com/janreges/siteone-crawler.git
cd siteone-crawler
# Build optimized release binary
cargo build --release
# Run
./target/release/siteone-crawler --url=https://my.domain.tld
```
**Build statically linked (musl) binary:**
```bash
# Install musl toolchain (Ubuntu/Debian)
sudo apt-get install musl-tools
rustup target add x86_64-unknown-linux-musl
# Build static binary (no system dependencies)
cargo build --release --target x86_64-unknown-linux-musl
# Run — works on any Linux distribution
./target/x86_64-unknown-linux-musl/release/siteone-crawler --url=https://my.domain.tld
```
## ▶️ Usage
### Interactive wizard
Run the binary **without any arguments** and an interactive wizard will guide you through the
configuration. Choose from 10 preset modes, enter the target URL, fine-tune settings with
arrow keys, and the crawler starts immediately — no need to remember CLI flags.
```
? Choose a crawl mode:
❯ Quick Audit Fast site health overview — crawls all pages and assets
SEO Analysis Extract titles, descriptions, keywords, and OpenGraph tags
Performance Test Measure response times with cache disabled — find bottlenecks
Security Check Check SSL/TLS, security headers, and redirects site-wide
Offline Clone Download entire website with all assets for offline browsing
Markdown Export Convert pages to Markdown for AI models or documentation
Stress Test High-concurrency load test with cache-busting random params
Single Page Deep analysis of a single URL — SEO, security, performance
Large Site Crawl High-throughput HTML-only crawl for large sites (100k+ pages)
Custom Start from defaults and configure every option manually
──────────────────────────────────────
Browse offline export Serve a previously exported offline site via HTTP
Browse markdown export Serve a previously exported markdown site via HTTP
[↑↓ to move, enter to select, type to filter]
```
After selecting a preset and entering the URL, the wizard shows a settings form where you can
adjust workers, timeout, content types, export options, and more. A configuration summary with the
equivalent CLI command is displayed before the crawl starts — copy it for future use without the
wizard.
If existing offline or markdown exports are detected in `./tmp/`, the wizard also offers to
**serve them via the built-in HTTP server** directly from the menu.
### Basic example
To run the crawler from the command line, provide the required arguments:
```bash
./siteone-crawler --url=https://mydomain.tld/ --device=mobile
```
### CI/CD example
```bash
# Fail deployment if quality score < 7.0 or any 5xx errors
./siteone-crawler --url=https://mydomain.tld/ --ci --ci-min-score=7.0 --ci-max-5xx=0
echo $? # 0 = pass, 10 = fail
```
### Fully-featured example
```bash
./siteone-crawler --url=https://mydomain.tld/ \
--output=text \
--workers=2 \
--max-reqs-per-sec=10 \
--memory-limit=2048M \
--resolve='mydomain.tld:443:127.0.0.1' \
--timeout=5 \
--proxy=proxy.mydomain.tld:8080 \
--http-auth=myuser:secretPassword123 \
--user-agent="My User-Agent String" \
--extra-columns="DOM,X-Cache(10),Title(40),Keywords(50),Description(50>),Heading1=xpath://h1/text()(20>),ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)" \
--accept-encoding="gzip, deflate" \
--url-column-size=100 \
--max-queue-length=3000 \
--max-visited-urls=10000 \
--max-url-length=5000 \
--max-non200-responses-per-basename=10 \
--include-regex="/^.*\/technologies.*/" \
--include-regex="/^.*\/fashion.*/" \
--ignore-regex="/^.*\/downloads\/.*\.pdf$/i" \
--analyzer-filter-regex="/^.*$/i" \
--remove-query-params \
--keep-query-param=page \
--add-random-query-params \
--transform-url="live-site.com -> local-site.local" \
--transform-url="/cdn\.live-site\.com/ -> local-site.local/cdn" \
--show-scheme-and-host \
--do-not-truncate-url \
--output-html-report=tmp/myreport.html \
--html-report-options="summary,seo-opengraph,visited-urls,security,redirects" \
--output-json-file=/dir/report.json \
--output-text-file=/dir/report.txt \
--add-timestamp-to-output-file \
--add-host-to-output-file \
--offline-export-dir=tmp/mydomain.tld \
--replace-content='/<foo[^>]+>/ -> <bar>' \
--ignore-store-file-error \
--sitemap-xml-file=/dir/sitemap.xml \
--sitemap-txt-file=/dir/sitemap.txt \
--sitemap-base-priority=0.5 \
--sitemap-priority-increase=0.1 \
--markdown-export-dir=tmp/mydomain.tld.md \
--markdown-export-single-file=tmp/mydomain.tld.combined.md \
--markdown-move-content-before-h1-to-end \
--markdown-disable-images \
--markdown-disable-files \
--markdown-remove-links-and-images-from-single-file \
--markdown-exclude-selector='.exclude-me' \
--markdown-replace-content='/<foo[^>]+>/ -> <bar>' \
--markdown-replace-query-string='/[a-z]+=[^&]*(&|$)/i -> $1__$2' \
--mail-to=your.name@my-mail.tld \
--mail-to=your.friend.name@my-mail.tld \
--mail-from=crawler@my-mail.tld \
--mail-from-name="SiteOne Crawler" \
--mail-subject-template="Crawler Report for %domain% (%date%)" \
--mail-smtp-host=smtp.my-mail.tld \
--mail-smtp-port=25 \
--mail-smtp-user=smtp.user \
--mail-smtp-pass=secretPassword123 \
--ci --ci-min-score=7.0 --ci-min-security=8.0
```
## ⚙️ Arguments
For a clearer list, I recommend going to the documentation: 🌐 https://crawler.siteone.io/configuration/command-line-options/
### Basic settings
| Parameter | Description |
|-----------|-------------|
| `--url=<url>` | Required. HTTP or HTTPS URL address of the website or sitemap xml to be crawled.<br>Use quotation marks `''` if the URL contains query parameters. |
| `--single-page` | Load only one page to which the URL is given (and its assets), but do not follow other pages. |
| `--max-depth=<int>` | Maximum crawling depth (for pages, not assets). Default is `0` (no limit). `1` means `/about`<br>or `/about/`, `2` means `/about/contacts` etc. |
| `--device=<val>` | Device type for choosing a predefined User-Agent. Ignored when `--user-agent` is defined.<br>Supported values: `desktop`, `mobile`, `tablet`. Default is `desktop`. |
| `--user-agent=<val>` | Custom User-Agent header. Use quotation marks. If specified, it takes precedence over<br>the device parameter. If you add `!` at the end, the siteone-crawler/version will not be<br>added as a signature at the end of the final user-agent. |
| `--timeout=<int>` | Request timeout in seconds. Default is `5`. |
| `--proxy=<host:port>` | HTTP proxy to use in `host:port` format. Host can be hostname, IPv4 or IPv6. |
| `--http-auth=<user:pass>` | Basic HTTP authentication in `username:password` format. |
| `--config-file=<file>` | Load CLI options from a config file. One option per line, `#` comments allowed.<br>Without this flag, auto-discovers `~/.siteone-crawler.conf` or `/etc/siteone-crawler.conf`.<br>CLI arguments override config file values. |
### Output settings
| Parameter | Description |
|-----------|-------------|
| `--output=<val>` | Output type. Supported values: `text`, `json`. Default is `text`. |
| `--extra-columns=<values>` | Comma delimited list of extra columns added to output table. You can specify HTTP headers<br>(e.g. `X-Cache`), predefined values (`Title`, `Keywords`, `Description`, `DOM`), or custom<br>extraction from text files (HTML, JS, CSS, TXT, JSON, XML, etc.) using XPath or regexp.<br>For custom extraction, use the format `Custom_column_name=method:pattern#group(length)`, where<br>`method` is `xpath` or `regexp`, `pattern` is the extraction pattern, an optional `#group` specifies the<br>capturing group (or node index for XPath) to return (defaulting to the entire match or first node), and an<br>optional `(length)` sets the maximum output length (append `>` to disable truncation).<br>For example, use `Heading1=xpath://h1/text()(20>)` to extract the text of the first H1 element<br>from the HTML document, and `ProductPrice=regexp:/Price:\s*\$?(\d+(?:\.\d{2})?)/i#1(10)`<br>to extract a numeric price (e.g., "29.99") from a string like "Price: $29.99". |
| `--url-column-size=<num>` | Basic URL column width. By default, it is calculated from the size of your terminal window. |
| `--rows-limit=<num>` | Max. number of rows to display in tables with analysis results.<br>Default is `200`. |
| `--timezone=<val>` | Timezone for datetimes in HTML reports and timestamps in output folders/files, e.g. `Europe/Prague`.<br>Default is `UTC`. |
| `--do-not-truncate-url` | In the text output, long URLs are truncated by default to `--url-column-size` so the table does not<br>wrap due to long URLs. With this option, you can turn off the truncation. |
| `--show-scheme-and-host` | On text output, show scheme and host also for origin domain URLs. |
| `--hide-progress-bar` | Hide progress bar visible in text and JSON output for more compact view. |
| `--hide-columns=<list>` | Hide specified columns from the progress table. Comma-separated list of column names:<br>`type`, `time`, `size`, `cache`. Example: `--hide-columns=cache` or `--hide-columns=cache,type`. |
| `--no-color` | Disable colored output. |
| `--force-color` | Force colored output regardless of support detection. |
| `--show-inline-criticals` | Show criticals from the analyzer directly in the URL table. |
| `--show-inline-warnings` | Show warnings from the analyzer directly in the URL table. |
### Resource filtering
| Parameter | Description |
|-----------|-------------|
| `--disable-all-assets` | Disables crawling of all assets and files and only crawls pages in href attributes.<br>Shortcut for calling all other `--disable-*` flags. |
| `--disable-javascript` | Disables JavaScript downloading and removes all JavaScript code from HTML,<br>including `onclick` and other `on*` handlers. |
| `--disable-styles` | Disables CSS file downloading and at the same time removes all style definitions<br>by `<style>` tag or inline by style attributes. |
| `--disable-fonts` | Disables font downloading and also removes all font/font-face definitions from CSS. |
| `--disable-images` | Disables downloading of all images and replaces found images in HTML with placeholder image only. |
| `--disable-files` | Disables downloading of any files (typically downloadable documents) to which various links point. |
| `--remove-all-anchor-listeners` | On all links on the page remove any event listeners. Useful on some types of sites with modern<br>JS frameworks that would like to compose content dynamically (React, Svelte, Vue, Angular, etc.). |
### Advanced crawler settings
| Parameter | Description |
|-----------|-------------|
| `--workers=<int>` | Maximum number of concurrent workers (threads).<br>Crawler will not make more simultaneous requests to the server than this number.<br>Use carefully! A high number of workers can cause a DoS attack. Default is `3`. |
| `--max-reqs-per-sec=<val>` | Max requests/s for whole crawler. Be careful not to cause a DoS attack. Default value is `10`. |
| `--memory-limit=<size>` | Memory limit in units `M` (Megabytes) or `G` (Gigabytes). Default is `2048M`. |
| `--resolve=<host:port:ip>` | Custom DNS resolution in `domain:port:ip` format. Same as [curl --resolve](https://everything.curl.dev/usingcurl/connections/name.html?highlight=resolve#provide-a-custom-ip-address-for-a-name).<br>Can be specified multiple times. |
| `--allowed-domain-for-external-files=<domain>` | Enable loading of file content from another domain (e.g. CDN).<br>Can be specified multiple times. Use `*` for all domains. |
| `--allowed-domain-for-crawling=<domain>` | Allow crawling of other listed domains — typically language mutations on other domains.<br>Can be specified multiple times. Use wildcards like `*.mysite.tld`. |
| `--single-foreign-page` | When crawling of other domains is allowed, ensures that only the linked page<br>and its assets are crawled from foreign domains. |
| `--include-regex=<regex>` | PCRE-compatible regular expression for URLs that should be included.<br>Can be specified multiple times. Example: `--include-regex='/^\/public\//'` |
| `--ignore-regex=<regex>` | PCRE-compatible regular expression for URLs that should be ignored.<br>Can be specified multiple times. |
| `--regex-filtering-only-for-pages` | Apply `*-regex` rules only to page URLs, not static assets. |
| `--analyzer-filter-regex` | PCRE-compatible regular expression for filtering analyzers by name. |
| `--accept-encoding=<val>` | Custom `Accept-Encoding` request header. Default is `gzip, deflate, br`. |
| `--remove-query-params` | Remove query parameters from found URLs. |
| `--keep-query-param=<name>` | Keep only the specified query parameter(s) in discovered URLs; all others are removed.<br>Can be specified multiple times. If `--remove-query-params` is also set, all parameters<br>are removed regardless. |
| `--add-random-query-params` | Add random query parameters to each URL to bypass caches. |
| `--transform-url=<from->to>` | Transform URLs before crawling. Use `from -> to` for simple replacement or `/regex/ -> replacement`.<br>Can be specified multiple times. |
| `--force-relative-urls` | Normalize all discovered URLs matching the initial domain (incl. www variant and protocol<br>differences) to canonical form. Prevents duplicate files in offline export when the site<br>uses inconsistent URL formats (http/https, www/non-www). |
| `--ignore-robots-txt` | Ignore robots.txt content. |
| `--http-cache-dir=<dir>` | Cache dir for HTTP responses. Disable with `--http-cache-dir='off'` or `--no-cache`.<br>Default is `~/.cache/siteone-crawler/http-cache` (XDG-compliant, respects `$XDG_CACHE_HOME`). |
| `--http-cache-compression` | Enable compression for HTTP cache storage. |
| `--http-cache-ttl=<val>` | TTL for HTTP cache entries (e.g. `1h`, `7d`, `30m`). Use `0` for infinite. Default is `24h`. |
| `--no-cache` | Disable HTTP cache completely. Shortcut for `--http-cache-dir='off'`. |
| `--max-queue-length=<num>` | Maximum length of the waiting URL queue. Default is `9000`. |
| `--max-visited-urls=<num>` | Maximum number of visited URLs. Default is `10000`. |
| `--max-skipped-urls=<num>` | Maximum number of skipped URLs. Default is `10000`. |
| `--max-url-length=<num>` | Maximum supported URL length in chars. Default is `2083`. |
| `--max-non200-responses-per-basename=<num>` | Protection against looping with dynamic non-200 URLs. Default is `5`. |
### File export settings
| Parameter | Description |
|-----------|-------------|
| `--output-html-report=<file>` | Save HTML report into that file. Set to empty `''` to disable HTML report.<br>By default saved into `tmp/%domain%.report.%datetime%.html`. |
| `--html-report-options=<sections>` | Comma-separated list of sections to include in HTML report.<br>Available sections: `summary`, `seo-opengraph`, `image-gallery`, `video-gallery`, `visited-urls`, `dns-ssl`, `crawler-stats`, `crawler-info`, `headers`, `content-types`, `skipped-urls`, `external-links`, `caching`, `best-practices`, `accessibility`, `security`, `redirects`, `404-pages`, `slowest-urls`, `fastest-urls`, `source-domains`.<br>Default: all sections. |
| `--output-json-file=<file>` | File path for JSON output. Set to empty `''` to disable JSON file.<br>By default saved into `tmp/%domain%.output.%datetime%.json`.<br>See [JSON Output Documentation](docs/JSON-OUTPUT.md) for format details. |
| `--output-text-file=<file>` | File path for TXT output. Set to empty `''` to disable TXT file.<br>By default saved into `tmp/%domain%.output.%datetime%.txt`.<br>See [Text Output Documentation](docs/TEXT-OUTPUT.md) for format details. |
| `--add-timestamp-to-output-file` | Append timestamp to output filenames (HTML report, JSON, TXT) except sitemaps. |
| `--add-host-to-output-file` | Append initial URL host to output filenames (HTML report, JSON, TXT) except sitemaps. |
**Default output directory:** Report files are saved into `./tmp/` in the current working directory. If `./tmp/` cannot be created (e.g. read-only filesystem), the crawler falls back to the platform's XDG data directory (`~/.local/share/siteone-crawler/` on Linux, `~/Library/Application Support/siteone-crawler/` on macOS, `%APPDATA%\siteone-crawler\` on Windows) and prints a notice to stderr.
### Mailer options
| Parameter | Description |
|-----------|-------------|
| `--mail-to=<email>` | Recipients of HTML e-mail reports. Required for mailer activation.<br>You can specify multiple emails separated by comma. |
| `--mail-from=<email>` | E-mail sender address. Default is `siteone-crawler@your-hostname.com`. |
| `--mail-from-name=<val>` | E-mail sender name. Default is `SiteOne Crawler`. |
| `--mail-subject-template=<val>` | E-mail subject template. You can use `%domain%`, `%date%` and `%datetime%`.<br>Default is `Crawler Report for %domain% (%date%)`. |
| `--mail-smtp-host=<host>` | SMTP host for sending emails. Default is `localhost`. |
| `--mail-smtp-port=<port>` | SMTP port for sending emails. Default is `25`. |
| `--mail-smtp-user=<user>` | SMTP user, if your SMTP server requires authentication. |
| `--mail-smtp-pass=<pass>` | SMTP password, if your SMTP server requires authentication. |
### Upload options
| Parameter | Description |
|-----------|-------------|
| `--upload` | Enable HTML report upload to `--upload-to`. |
| `--upload-to=<url>` | URL of the endpoint where to send the HTML report. Default is `https://crawler.siteone.io/up`. |
| `--upload-retention=<val>` | How long should the HTML report be kept in the online version?<br>Values: 1h / 4h / 12h / 24h / 3d / 7d / 30d / 365d / forever.<br>Default is `30d`. |
| `--upload-password=<val>` | Optional password (user will be 'crawler') to display the online HTML report. |
| `--upload-timeout=<int>` | Upload timeout in seconds. Default is `3600`. |
### Offline exporter options
| Parameter | Description |
|-----------|-------------|
| `--offline-export-dir=<dir>` | Path to directory where to save the offline version of the website. |
| `--offline-export-store-only-url-regex=<regex>` | Debug: store only URLs matching these PCRE regexes. Can be specified multiple times. |
| `--offline-export-remove-unwanted-code=<1/0>` | Remove unwanted code for offline mode (analytics, social networks, etc.). Default is `1`. |
| `--offline-export-no-auto-redirect-html` | Disable automatic creation of redirect HTML files for subfolders containing `index.html`. |
| `--offline-export-preserve-url-structure` | Preserve the original URL path structure. E.g. `/about` is stored as `about/index.html`<br>instead of `about.html`. Useful for web server deployment where the clone should maintain<br>the same URL hierarchy as the original site. |
| `--offline-export-preserve-urls` | Preserve original URL format in exported HTML/CSS/JS — same-domain links become root-relative (`/path`), cross-domain links stay absolute. Ideal for processing with [siteone-chunker](https://github.com/janreges/siteone-chunker) and RAG pipelines where links must resolve to the production website. |
| `--replace-content=<val>` | Replace content in HTML/JS/CSS with `foo -> bar` or PCRE regexp.<br>Can be specified multiple times. |
| `--replace-query-string=<val>` | Replace characters in query string filenames.<br>Can be specified multiple times. |
| `--offline-export-lowercase` | Convert all filenames to lowercase for offline export. Useful for case-insensitive filesystems. |
| `--ignore-store-file-error` | Ignore any file storing errors and continue. |
| `--disable-astro-inline-modules` | Disable inlining of Astro module scripts for offline export.<br>Scripts will remain as external files with corrected relative paths. |
### Markdown exporter options
| Parameter | Description |
|-----------|-------------|
| `--markdown-export-dir=<dir>` | Path to directory where to save the markdown version of the website. |
| `--markdown-export-single-file=<file>` | Path to a file for combined markdown. Requires `--markdown-export-dir`. |
| `--markdown-move-content-before-h1-to-end` | Move content before main H1 heading to the end of the markdown. |
| `--markdown-disable-images` | Do not export and show images in markdown files. |
| `--markdown-disable-files` | Do not export files other than HTML/CSS/JS/fonts/images (e.g. PDF, ZIP). |
| `--markdown-remove-links-and-images-from-single-file` | Remove links and images from combined single file. |
| `--markdown-exclude-selector=<val>` | Exclude DOM elements by CSS selector from markdown export.<br>Can be specified multiple times. |
| `--markdown-replace-content=<val>` | Replace text content with `foo -> bar` or PCRE regexp.<br>Can be specified multiple times. |
| `--markdown-replace-query-string=<val>` | Replace characters in query string filenames.<br>Can be specified multiple times. |
| `--markdown-export-store-only-url-regex=<regex>` | Debug: store only URLs matching these PCRE regexes. Can be specified multiple times. |
| `--markdown-ignore-store-file-error` | Ignore any file storing errors and continue. |
### Sitemap options
| Parameter | Description |
|-----------|-------------|
| `--sitemap-xml-file=<file>` | File path for generated XML Sitemap. Extension `.xml` added if not specified. |
| `--sitemap-txt-file=<file>` | File path for generated TXT Sitemap. Extension `.txt` added if not specified. |
| `--sitemap-base-priority=<num>` | Base priority for XML sitemap. Default is `0.5`. |
| `--sitemap-priority-increase=<num>` | Priority increase based on slashes in URL. Default is `0.1`. |
### Expert options
| Parameter | Description |
|-----------|-------------|
| `--debug` | Activate debug mode. |
| `--debug-log-file=<file>` | Log file for debug messages. When set without `--debug`, logging is active without visible output. |
| `--debug-url-regex=<regex>` | Regex for URL(s) to debug. Can be specified multiple times. |
| `--result-storage=<val>` | Result storage type. Values: `memory` or `file`. Use `file` for large websites. Default is `memory`. |
| `--result-storage-dir=<dir>` | Directory for `--result-storage=file`. Default is `tmp/result-storage`. |
| `--result-storage-compression` | Enable compression for results storage. |
| `--http-cache-dir=<dir>` | Cache dir for HTTP responses. Disable with `--http-cache-dir='off'` or `--no-cache`.<br>Default is `~/.cache/siteone-crawler/http-cache` (XDG-compliant, respects `$XDG_CACHE_HOME`). |
| `--http-cache-compression` | Enable compression for HTTP cache storage. |
| `--http-cache-ttl=<val>` | TTL for HTTP cache entries (e.g. `1h`, `7d`, `30m`). Use `0` for infinite. Default is `24h`. |
| `--websocket-server=<host:port>` | Start crawler with websocket server on given host:port. |
| `--console-width=<int>` | Enforce a fixed console width, disabling automatic detection. |
### Fastest URL analyzer
| Parameter | Description |
|-----------|-------------|
| `--fastest-urls-top-limit=<int>` | Number of URLs in TOP fastest list. Default is `20`. |
| `--fastest-urls-max-time=<val>` | Maximum response time for an URL to be considered fast. Default is `1`. |
### SEO and OpenGraph analyzer
| Parameter | Description |
|-----------|-------------|
| `--max-heading-level=<int>` | Max heading level from 1 to 6 for analysis. Default is `3`. |
### Slowest URL analyzer
| Parameter | Description |
|-----------|-------------|
| `--slowest-urls-top-limit=<int>` | Number of URLs in TOP slowest list. Default is `20`. |
| `--slowest-urls-min-time=<val>` | Minimum response time threshold for slow URLs. Default is `0.01`. |
| `--slowest-urls-max-time=<val>` | Maximum response time for very slow evaluation. Default is `3`. |
### Built-in HTTP server
Browse exported markdown or offline HTML files through a local web server with a built-in viewer.
| Parameter | Description |
|-----------|-------------|
| `--serve-markdown=<dir>` | Start built-in HTTP server for browsing a markdown export directory.<br>Renders `.md` files as styled HTML with tables, accordions, dark/light mode, and breadcrumb navigation. |
| `--serve-offline=<dir>` | Start built-in HTTP server for browsing an offline HTML export directory.<br>Serves static files with Content-Security-Policy restricting assets to the same origin. |
| `--serve-port=<int>` | Port for the built-in HTTP server. Default is `8321`. |
| `--serve-bind-address=<addr>` | Bind address for the built-in HTTP server. Default is `127.0.0.1` (localhost only).<br>Use `0.0.0.0` to listen on all network interfaces and their IP addresses. |
**Example:**
```bash
# Browse markdown export
./siteone-crawler --serve-markdown=./exports/markdown
# Browse offline export on custom port, accessible from network
./siteone-crawler --serve-offline=./exports/offline --serve-port=9000 --serve-bind-address=0.0.0.0
```
### HTML-to-Markdown conversion
Convert a local HTML file to clean Markdown without crawling. Uses the same conversion pipeline as the markdown exporter.
| Parameter | Description |
|-----------|-------------|
| `--html-to-markdown=<file>` | Convert a local HTML file to Markdown and print to stdout. No crawling is performed.<br>Respects `--markdown-disable-images`, `--markdown-disable-files`, `--markdown-move-content-before-h1-to-end`, and `--markdown-exclude-selector`. |
| `--html-to-markdown-output=<file>` | Write the converted Markdown to a file instead of stdout. Requires `--html-to-markdown`. |
**Examples:**
```bash
# Convert HTML file to Markdown (printed to stdout)
./siteone-crawler --html-to-markdown=page.html
# Convert and save to a file
./siteone-crawler --html-to-markdown=page.html --html-to-markdown-output=page.md
# Convert with options: remove images, exclude navigation, move header below h1
./siteone-crawler --html-to-markdown=page.html \
--markdown-disable-images \
--markdown-exclude-selector=nav \
--markdown-move-content-before-h1-to-end
# Pipe to other tools (e.g. clipboard, AI, wc)
./siteone-crawler --html-to-markdown=page.html | pbcopy
./siteone-crawler --html-to-markdown=page.html | wc -l
```
### CI/CD settings
| Parameter | Description |
|-----------|-------------|
| `--ci` | Enable CI/CD quality gate. Crawler exits with code 10 if thresholds are not met. Default file outputs (HTML, JSON, TXT reports) are suppressed unless explicitly requested via `--output-*` options. |
| `--ci-min-score=<val>` | Minimum overall quality score (0.0-10.0). Default is `5.0`. |
| `--ci-min-performance=<val>` | Minimum Performance category score (0.0-10.0). Default is `5.0`. |
| `--ci-min-seo=<val>` | Minimum SEO category score (0.0-10.0). Default is `5.0`. |
| `--ci-min-security=<val>` | Minimum Security category score (0.0-10.0). Default is `5.0`. |
| `--ci-min-accessibility=<val>` | Minimum Accessibility category score (0.0-10.0). Default is `3.0`. |
| `--ci-min-best-practices=<val>` | Minimum Best Practices category score (0.0-10.0). Default is `5.0`. |
| `--ci-max-404=<int>` | Maximum number of 404 responses allowed. Default is `0`. |
| `--ci-max-5xx=<int>` | Maximum number of 5xx server error responses allowed. Default is `0`. |
| `--ci-max-criticals=<int>` | Maximum number of critical analysis findings allowed. Default is `0`. |
| `--ci-max-warnings=<int>` | Maximum number of warning analysis findings allowed. Not checked by default. |
| `--ci-max-avg-response=<val>` | Maximum average response time in seconds. Not checked by default. |
| `--ci-min-pages=<int>` | Minimum number of HTML pages that must be found. Default is `10`. |
| `--ci-min-assets=<int>` | Minimum number of assets (JS, CSS, images, fonts) that must be found. Default is `10`. |
| `--ci-min-documents=<int>` | Minimum number of documents (PDF, etc.) that must be found. Default is `0` (not checked). |
**Default behavior with `--ci` alone:** overall score >= 5.0, each category score >= 5.0 (Performance, SEO, Security, Best Practices) and Accessibility >= 3.0, 404 errors <= 0, 5xx errors <= 0, critical findings <= 0, HTML pages >= 10, assets >= 10. File outputs (HTML, JSON, TXT reports) are not generated. To save reports in CI mode, specify the desired output explicitly, e.g. `--ci --output-html-report=report.html`.
## 🏆 Quality Scoring
The crawler automatically calculates a quality score (0.0-10.0) across 5 weighted categories:
| Category | Weight | What it measures |
|----------|--------|------------------|
| **Performance** | 20% | Response times, slow URLs |
| **SEO** | 20% | Missing H1, title uniqueness, meta descriptions, 404s, redirects |
| **Security** | 25% | SSL/TLS certificates, security headers, unsafe protocols |
| **Accessibility** | 20% | Lang attribute, image alt text, form labels, ARIA, heading levels |
| **Best Practices** | 15% | Duplicate/large SVGs, deep DOM, Brotli/WebP support |
The overall score is a weighted average of all categories. Scores are displayed in a colored box in the console output and included in JSON and HTML report outputs.
Score labels:
- **9.0-10.0** — Excellent (green)
- **7.0-8.9** — Good (blue)
- **5.0-6.9** — Fair (yellow)
- **3.0-4.9** — Poor (purple)
- **0.0-2.9** — Critical (red)
## 🔄 CI/CD Integration
The `--ci` flag enables a quality gate that evaluates configurable thresholds after crawling completes. When any threshold is not met, the crawler exits with **code 10** (distinct from exit code 1 for runtime errors). In CI mode, default file outputs (HTML, JSON, TXT reports) are automatically suppressed — only the console output and exit code matter. If you need report files in CI, specify them explicitly (e.g. `--output-html-report=report.html`).
**Bonus: Cache warming** — running the crawler as a post-deployment step in your CI/CD pipeline crawls every page and asset on your site, which populates the HTML/asset cache on your **reverse proxy** (Varnish, Nginx) or **CDN** (Cloudflare, CloudFront). This way, the first real visitors always hit a warm cache instead of cold origin requests.
### Exit codes
| Code | Meaning |
|------|---------|
| `0` | Success (with `--ci` this also means all quality thresholds passed) |
| `1` | Runtime error |
| `2` | Help/version displayed |
| `3` | No pages crawled (
gitextract_j_gpbi69/
├── .githooks/
│ └── pre-commit
├── .github/
│ └── workflows/
│ ├── ci.yml
│ ├── publish.yml
│ └── release.yml
├── .gitignore
├── CHANGELOG.md
├── CLAUDE.md
├── Cargo.toml
├── LICENSE
├── README.md
├── docs/
│ ├── JSON-OUTPUT.md
│ ├── OUTPUT-crawler.siteone.io.json
│ ├── OUTPUT-crawler.siteone.io.txt
│ └── TEXT-OUTPUT.md
├── rustfmt.toml
├── src/
│ ├── analysis/
│ │ ├── accessibility_analyzer.rs
│ │ ├── analyzer.rs
│ │ ├── base_analyzer.rs
│ │ ├── best_practice_analyzer.rs
│ │ ├── caching_analyzer.rs
│ │ ├── content_type_analyzer.rs
│ │ ├── dns_analyzer.rs
│ │ ├── external_links_analyzer.rs
│ │ ├── fastest_analyzer.rs
│ │ ├── headers_analyzer.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── page404_analyzer.rs
│ │ ├── redirects_analyzer.rs
│ │ ├── result/
│ │ │ ├── analyzer_stats.rs
│ │ │ ├── dns_analysis_result.rs
│ │ │ ├── header_stats.rs
│ │ │ ├── heading_tree_item.rs
│ │ │ ├── mod.rs
│ │ │ ├── security_checked_header.rs
│ │ │ ├── security_result.rs
│ │ │ ├── seo_opengraph_result.rs
│ │ │ └── url_analysis_result.rs
│ │ ├── security_analyzer.rs
│ │ ├── seo_opengraph_analyzer.rs
│ │ ├── skipped_urls_analyzer.rs
│ │ ├── slowest_analyzer.rs
│ │ ├── source_domains_analyzer.rs
│ │ └── ssl_tls_analyzer.rs
│ ├── components/
│ │ ├── mod.rs
│ │ ├── summary/
│ │ │ ├── item.rs
│ │ │ ├── item_status.rs
│ │ │ ├── mod.rs
│ │ │ └── summary.rs
│ │ ├── super_table.rs
│ │ └── super_table_column.rs
│ ├── content_processor/
│ │ ├── astro_processor.rs
│ │ ├── base_processor.rs
│ │ ├── content_processor.rs
│ │ ├── css_processor.rs
│ │ ├── html_processor.rs
│ │ ├── javascript_processor.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── nextjs_processor.rs
│ │ ├── svelte_processor.rs
│ │ └── xml_processor.rs
│ ├── debugger.rs
│ ├── engine/
│ │ ├── crawler.rs
│ │ ├── found_url.rs
│ │ ├── found_urls.rs
│ │ ├── http_client.rs
│ │ ├── http_response.rs
│ │ ├── initiator.rs
│ │ ├── manager.rs
│ │ ├── mod.rs
│ │ ├── parsed_url.rs
│ │ └── robots_txt.rs
│ ├── error.rs
│ ├── export/
│ │ ├── base_exporter.rs
│ │ ├── exporter.rs
│ │ ├── file_exporter.rs
│ │ ├── html_report/
│ │ │ ├── badge.rs
│ │ │ ├── mod.rs
│ │ │ ├── report.rs
│ │ │ ├── tab.rs
│ │ │ └── template.html
│ │ ├── mailer_exporter.rs
│ │ ├── markdown_exporter.rs
│ │ ├── mod.rs
│ │ ├── offline_website_exporter.rs
│ │ ├── sitemap_exporter.rs
│ │ ├── upload_exporter.rs
│ │ └── utils/
│ │ ├── html_to_markdown.rs
│ │ ├── markdown_site_aggregator.rs
│ │ ├── mod.rs
│ │ ├── offline_url_converter.rs
│ │ └── target_domain_relation.rs
│ ├── extra_column.rs
│ ├── info.rs
│ ├── lib.rs
│ ├── main.rs
│ ├── options/
│ │ ├── core_options.rs
│ │ ├── group.rs
│ │ ├── mod.rs
│ │ ├── option.rs
│ │ ├── option_type.rs
│ │ └── options.rs
│ ├── output/
│ │ ├── json_output.rs
│ │ ├── mod.rs
│ │ ├── multi_output.rs
│ │ ├── output.rs
│ │ ├── output_type.rs
│ │ └── text_output.rs
│ ├── result/
│ │ ├── basic_stats.rs
│ │ ├── manager_stats.rs
│ │ ├── mod.rs
│ │ ├── status.rs
│ │ ├── storage/
│ │ │ ├── file_storage.rs
│ │ │ ├── memory_storage.rs
│ │ │ ├── mod.rs
│ │ │ ├── storage.rs
│ │ │ └── storage_type.rs
│ │ └── visited_url.rs
│ ├── scoring/
│ │ ├── ci_gate.rs
│ │ ├── mod.rs
│ │ ├── quality_score.rs
│ │ └── scorer.rs
│ ├── server.rs
│ ├── types.rs
│ ├── utils.rs
│ ├── version.rs
│ └── wizard/
│ ├── form.rs
│ ├── mod.rs
│ └── presets.rs
└── tests/
├── common/
│ └── mod.rs
└── integration_crawl.rs
SYMBOL INDEX (1962 symbols across 101 files)
FILE: src/analysis/accessibility_analyzer.rs
constant ANALYSIS_MISSING_IMAGE_ALT_ATTRIBUTES (line 23) | const ANALYSIS_MISSING_IMAGE_ALT_ATTRIBUTES: &str = "Missing image alt a...
constant ANALYSIS_MISSING_FORM_LABELS (line 24) | const ANALYSIS_MISSING_FORM_LABELS: &str = "Missing form labels";
constant ANALYSIS_MISSING_ARIA_LABELS (line 25) | const ANALYSIS_MISSING_ARIA_LABELS: &str = "Missing aria labels";
constant ANALYSIS_MISSING_ROLES (line 26) | const ANALYSIS_MISSING_ROLES: &str = "Missing roles";
constant ANALYSIS_MISSING_LANG_ATTRIBUTE (line 27) | const ANALYSIS_MISSING_LANG_ATTRIBUTE: &str = "Missing html lang attribu...
constant SUPER_TABLE_ACCESSIBILITY (line 29) | const SUPER_TABLE_ACCESSIBILITY: &str = "accessibility";
type AccessibilityAnalyzer (line 31) | pub struct AccessibilityAnalyzer {
method new (line 50) | pub fn new() -> Self {
method check_image_alt_attributes (line 63) | fn check_image_alt_attributes(&mut self, html: &str, result: &mut UrlA...
method check_missing_labels (line 100) | fn check_missing_labels(&mut self, html: &str, result: &mut UrlAnalysi...
method check_missing_aria_labels (line 147) | fn check_missing_aria_labels(&mut self, html: &str, result: &mut UrlAn...
method check_missing_roles (line 232) | fn check_missing_roles(&mut self, html: &str, result: &mut UrlAnalysis...
method check_missing_lang (line 271) | fn check_missing_lang(&mut self, html: &str, result: &mut UrlAnalysisR...
method set_findings_to_summary (line 319) | fn set_findings_to_summary(&self, status: &Status) {
method default (line 44) | fn default() -> Self {
method analyze (line 383) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method analyze_visited_url (line 498) | fn analyze_visited_url(
method show_analyzed_visited_url_result_as_column (line 543) | fn show_analyzed_visited_url_result_as_column(&self) -> Option<ExtraColu...
method should_be_activated (line 547) | fn should_be_activated(&self) -> bool {
method get_order (line 551) | fn get_order(&self) -> i32 {
method get_name (line 555) | fn get_name(&self) -> &str {
method get_exec_times (line 559) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 563) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
function get_opening_tag_html (line 571) | fn get_opening_tag_html(element: &scraper::ElementRef) -> String {
function normalize_tag_for_dedup (line 589) | fn normalize_tag_for_dedup(element: &scraper::ElementRef) -> String {
FILE: src/analysis/analyzer.rs
type Analyzer (line 11) | pub trait Analyzer: Send + Sync {
method analyze (line 14) | fn analyze(&mut self, status: &crate::result::status::Status, output: ...
method analyze_visited_url (line 20) | fn analyze_visited_url(
method show_analyzed_visited_url_result_as_column (line 31) | fn show_analyzed_visited_url_result_as_column(&self) -> Option<ExtraCo...
method should_be_activated (line 36) | fn should_be_activated(&self) -> bool;
method get_order (line 39) | fn get_order(&self) -> i32;
method get_name (line 42) | fn get_name(&self) -> &str;
method get_exec_times (line 45) | fn get_exec_times(&self) -> &HashMap<String, f64>;
method get_exec_counts (line 48) | fn get_exec_counts(&self) -> &HashMap<String, usize>;
FILE: src/analysis/base_analyzer.rs
type BaseAnalyzer (line 10) | pub struct BaseAnalyzer {
method new (line 18) | pub fn new() -> Self {
method measure_exec_time (line 23) | pub fn measure_exec_time(&mut self, class: &str, method: &str, start_t...
method get_exec_times (line 31) | pub fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 35) | pub fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/best_practice_analyzer.rs
constant ANALYSIS_LARGE_SVGS (line 23) | const ANALYSIS_LARGE_SVGS: &str = "Large inline SVGs";
constant ANALYSIS_DUPLICATED_SVGS (line 24) | const ANALYSIS_DUPLICATED_SVGS: &str = "Duplicate inline SVGs";
constant ANALYSIS_INVALID_SVGS (line 25) | const ANALYSIS_INVALID_SVGS: &str = "Invalid inline SVGs";
constant ANALYSIS_MISSING_QUOTES (line 26) | const ANALYSIS_MISSING_QUOTES: &str = "Missing quotes on attributes";
constant ANALYSIS_HEADING_STRUCTURE (line 27) | const ANALYSIS_HEADING_STRUCTURE: &str = "Heading structure";
constant ANALYSIS_NON_CLICKABLE_PHONE_NUMBERS (line 28) | const ANALYSIS_NON_CLICKABLE_PHONE_NUMBERS: &str = "Non-clickable phone ...
constant ANALYSIS_DOM_DEPTH (line 29) | const ANALYSIS_DOM_DEPTH: &str = "DOM depth";
constant ANALYSIS_TITLE_UNIQUENESS (line 30) | const ANALYSIS_TITLE_UNIQUENESS: &str = "Title uniqueness";
constant ANALYSIS_DESCRIPTION_UNIQUENESS (line 31) | const ANALYSIS_DESCRIPTION_UNIQUENESS: &str = "Description uniqueness";
constant ANALYSIS_BROTLI_SUPPORT (line 32) | const ANALYSIS_BROTLI_SUPPORT: &str = "Brotli support";
constant ANALYSIS_WEBP_SUPPORT (line 33) | const ANALYSIS_WEBP_SUPPORT: &str = "WebP support";
constant ANALYSIS_AVIF_SUPPORT (line 34) | const ANALYSIS_AVIF_SUPPORT: &str = "AVIF support";
constant SUPER_TABLE_BEST_PRACTICES (line 36) | const SUPER_TABLE_BEST_PRACTICES: &str = "best-practices";
constant SUPER_TABLE_NON_UNIQUE_TITLES (line 37) | const SUPER_TABLE_NON_UNIQUE_TITLES: &str = "non-unique-titles";
constant SUPER_TABLE_NON_UNIQUE_DESCRIPTIONS (line 38) | const SUPER_TABLE_NON_UNIQUE_DESCRIPTIONS: &str = "non-unique-descriptio...
type BestPracticeAnalyzer (line 40) | pub struct BestPracticeAnalyzer {
method new (line 72) | pub fn new() -> Self {
method get_analysis_result (line 97) | fn get_analysis_result(
method analyze_urls (line 113) | fn analyze_urls(&mut self, status: &Status, output: &mut dyn Output) -...
method check_inline_svg (line 173) | fn check_inline_svg(&mut self, html: &str, result: &mut UrlAnalysisRes...
method check_missing_quotes_on_attributes (line 329) | fn check_missing_quotes_on_attributes(&mut self, html: &str, result: &...
method check_max_dom_depth (line 387) | fn check_max_dom_depth(&mut self, html: &str, url: &str, result: &mut ...
method check_heading_structure (line 434) | fn check_heading_structure(&mut self, html: &str, result: &mut UrlAnal...
method check_non_clickable_phone_numbers (line 586) | fn check_non_clickable_phone_numbers(&mut self, html: &str, result: &m...
method check_title_uniqueness (line 615) | fn check_title_uniqueness(
method check_meta_description_uniqueness (line 740) | fn check_meta_description_uniqueness(
method check_brotli_support (line 864) | fn check_brotli_support(&self, urls: &[&VisitedUrl], status: &Status) ...
method check_webp_support (line 884) | fn check_webp_support(&self, urls: &[&VisitedUrl], status: &Status) ->...
method check_avif_support (line 922) | fn check_avif_support(&self, urls: &[&VisitedUrl], status: &Status) ->...
method set_findings_to_summary (line 947) | fn set_findings_to_summary(&self, status: &Status) {
method default (line 66) | fn default() -> Self {
method analyze (line 1076) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method analyze_visited_url (line 1205) | fn analyze_visited_url(
method show_analyzed_visited_url_result_as_column (line 1247) | fn show_analyzed_visited_url_result_as_column(&self) -> Option<ExtraColu...
method should_be_activated (line 1251) | fn should_be_activated(&self) -> bool {
method get_order (line 1255) | fn get_order(&self) -> i32 {
method get_name (line 1259) | fn get_name(&self) -> &str {
method get_exec_times (line 1263) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 1267) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
function validate_svg (line 1273) | fn validate_svg(svg: &str) -> Option<Vec<String>> {
function sanitize_svg (line 1294) | fn sanitize_svg(svg: &str) -> String {
function find_max_depth (line 1303) | fn find_max_depth(node_ref: ego_tree::NodeRef<scraper::Node>, depth: usi...
function parse_phone_numbers_from_html (line 1313) | fn parse_phone_numbers_from_html(html: &str, only_non_clickable: bool) -...
function strip_js_and_css (line 1371) | fn strip_js_and_css(html: &str) -> String {
FILE: src/analysis/caching_analyzer.rs
constant SUPER_TABLE_CACHING_PER_CONTENT_TYPE (line 15) | const SUPER_TABLE_CACHING_PER_CONTENT_TYPE: &str = "caching-per-content-...
constant SUPER_TABLE_CACHING_PER_DOMAIN (line 16) | const SUPER_TABLE_CACHING_PER_DOMAIN: &str = "caching-per-domain";
constant SUPER_TABLE_CACHING_PER_DOMAIN_AND_CONTENT_TYPE (line 17) | const SUPER_TABLE_CACHING_PER_DOMAIN_AND_CONTENT_TYPE: &str = "caching-p...
type CachingAnalyzer (line 19) | pub struct CachingAnalyzer {
method new (line 30) | pub fn new() -> Self {
method update_cache_stat (line 36) | fn update_cache_stat(stat: &mut CacheStat, visited_url: &VisitedUrl) {
method build_lifetime_columns (line 53) | fn build_lifetime_columns(first_col_name: &str, first_col_key: &str) -...
method default (line 24) | fn default() -> Self {
method analyze (line 157) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 297) | fn should_be_activated(&self) -> bool {
method get_order (line 301) | fn get_order(&self) -> i32 {
method get_name (line 305) | fn get_name(&self) -> &str {
method get_exec_times (line 309) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 313) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
type CacheStat (line 319) | struct CacheStat {
type CacheStatWithType (line 328) | struct CacheStatWithType {
method to_row (line 335) | fn to_row(&self) -> HashMap<String, String> {
type CacheStatWithDomain (line 359) | struct CacheStatWithDomain {
method to_row (line 366) | fn to_row(&self) -> HashMap<String, String> {
type CacheStatWithDomainAndType (line 390) | struct CacheStatWithDomainAndType {
method to_row (line 398) | fn to_row(&self) -> HashMap<String, String> {
FILE: src/analysis/content_type_analyzer.rs
constant SUPER_TABLE_CONTENT_TYPES (line 15) | const SUPER_TABLE_CONTENT_TYPES: &str = "content-types";
constant SUPER_TABLE_CONTENT_MIME_TYPES (line 16) | const SUPER_TABLE_CONTENT_MIME_TYPES: &str = "content-types-raw";
type ContentTypeAnalyzer (line 18) | pub struct ContentTypeAnalyzer {
method new (line 29) | pub fn new() -> Self {
method add_content_type_super_table (line 35) | fn add_content_type_super_table(&self, status: &Status, output: &mut d...
method add_content_type_raw_super_table (line 131) | fn add_content_type_raw_super_table(&self, status: &Status, output: &m...
method default (line 23) | fn default() -> Self {
method analyze (line 230) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 235) | fn should_be_activated(&self) -> bool {
method get_order (line 239) | fn get_order(&self) -> i32 {
method get_name (line 243) | fn get_name(&self) -> &str {
method get_exec_times (line 247) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 251) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
type ContentTypeStat (line 256) | struct ContentTypeStat {
type MimeTypeStat (line 271) | struct MimeTypeStat {
function build_content_type_columns (line 284) | fn build_content_type_columns() -> Vec<SuperTableColumn> {
function get_all_content_type_ids (line 503) | fn get_all_content_type_ids() -> Vec<ContentTypeId> {
FILE: src/analysis/dns_analyzer.rs
constant SUPER_TABLE_DNS (line 15) | const SUPER_TABLE_DNS: &str = "dns";
type DnsAnalyzer (line 17) | pub struct DnsAnalyzer {
method new (line 28) | pub fn new() -> Self {
method get_dns_info (line 35) | fn get_dns_info(&self, domain: &str) -> Result<DnsAnalysisResult, Stri...
method get_system_dns_server (line 102) | fn get_system_dns_server() -> Option<String> {
method default (line 22) | fn default() -> Self {
method analyze (line 117) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 271) | fn should_be_activated(&self) -> bool {
method get_order (line 275) | fn get_order(&self) -> i32 {
method get_name (line 279) | fn get_name(&self) -> &str {
method get_exec_times (line 283) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 287) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/external_links_analyzer.rs
constant SUPER_TABLE_EXTERNAL_URLS (line 17) | const SUPER_TABLE_EXTERNAL_URLS: &str = "external-urls";
constant MAX_SOURCE_PAGES (line 18) | const MAX_SOURCE_PAGES: usize = 5;
type ExternalLinksAnalyzer (line 20) | pub struct ExternalLinksAnalyzer {
method new (line 31) | pub fn new() -> Self {
method default (line 25) | fn default() -> Self {
method analyze (line 39) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 151) | fn should_be_activated(&self) -> bool {
method get_order (line 155) | fn get_order(&self) -> i32 {
method get_name (line 159) | fn get_name(&self) -> &str {
method get_exec_times (line 163) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 167) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/fastest_analyzer.rs
constant SUPER_TABLE_FASTEST_URLS (line 15) | const SUPER_TABLE_FASTEST_URLS: &str = "fastest-urls";
type FastestAnalyzer (line 17) | pub struct FastestAnalyzer {
method new (line 30) | pub fn new() -> Self {
method set_config (line 39) | pub fn set_config(&mut self, fastest_top_limit: usize, fastest_max_tim...
method default (line 24) | fn default() -> Self {
method analyze (line 46) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 150) | fn should_be_activated(&self) -> bool {
method get_order (line 154) | fn get_order(&self) -> i32 {
method get_name (line 158) | fn get_name(&self) -> &str {
method get_exec_times (line 162) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 166) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/headers_analyzer.rs
constant SUPER_TABLE_HEADERS (line 17) | const SUPER_TABLE_HEADERS: &str = "headers";
constant SUPER_TABLE_HEADERS_VALUES (line 18) | const SUPER_TABLE_HEADERS_VALUES: &str = "headers-values";
type HeadersAnalyzer (line 20) | pub struct HeadersAnalyzer {
method new (line 32) | pub fn new() -> Self {
method default (line 26) | fn default() -> Self {
method analyze (line 41) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method analyze_visited_url (line 289) | fn analyze_visited_url(
method should_be_activated (line 313) | fn should_be_activated(&self) -> bool {
method get_order (line 317) | fn get_order(&self) -> i32 {
method get_name (line 321) | fn get_name(&self) -> &str {
method get_exec_times (line 325) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 329) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/manager.rs
constant SUPER_TABLE_ANALYSIS_STATS (line 14) | pub const SUPER_TABLE_ANALYSIS_STATS: &str = "analysis-stats";
type AnalysisManager (line 16) | pub struct AnalysisManager {
method new (line 22) | pub fn new() -> Self {
method register_analyzer (line 31) | pub fn register_analyzer(&mut self, analyzer: Box<dyn Analyzer>) {
method auto_activate_analyzers (line 36) | pub fn auto_activate_analyzers(&mut self) {
method filter_analyzers_by_regex (line 43) | pub fn filter_analyzers_by_regex(&mut self, filter_regex: &str) {
method analyze_visited_url (line 52) | pub fn analyze_visited_url(
method run_analyzers (line 79) | pub fn run_analyzers(&mut self, status: &Status, output: &mut dyn Outp...
method get_analyzers (line 126) | pub fn get_analyzers(&self) -> &[Box<dyn Analyzer>] {
method has_analyzer (line 131) | pub fn has_analyzer(&self, name: &str) -> bool {
method get_extra_columns (line 137) | pub fn get_extra_columns(&self) -> Vec<crate::extra_column::ExtraColum...
method get_analysis_column_values (line 146) | pub fn get_analysis_column_values(
method default (line 170) | fn default() -> Self {
FILE: src/analysis/page404_analyzer.rs
constant SUPER_TABLE_404 (line 14) | const SUPER_TABLE_404: &str = "404";
type Page404Analyzer (line 16) | pub struct Page404Analyzer {
method new (line 27) | pub fn new() -> Self {
method default (line 21) | fn default() -> Self {
method analyze (line 35) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 138) | fn should_be_activated(&self) -> bool {
method get_order (line 142) | fn get_order(&self) -> i32 {
method get_name (line 146) | fn get_name(&self) -> &str {
method get_exec_times (line 150) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 154) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/redirects_analyzer.rs
constant SUPER_TABLE_REDIRECTS (line 14) | const SUPER_TABLE_REDIRECTS: &str = "redirects";
type RedirectsAnalyzer (line 16) | pub struct RedirectsAnalyzer {
method new (line 27) | pub fn new() -> Self {
method default (line 21) | fn default() -> Self {
method analyze (line 35) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 161) | fn should_be_activated(&self) -> bool {
method get_order (line 165) | fn get_order(&self) -> i32 {
method get_name (line 169) | fn get_name(&self) -> &str {
method get_exec_times (line 173) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 177) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/result/analyzer_stats.rs
type AnalyzerStats (line 7) | pub struct AnalyzerStats {
method new (line 21) | pub fn new() -> Self {
method add_ok (line 25) | pub fn add_ok(&mut self, analysis_name: &str, subject: Option<&str>) {
method add_warning (line 29) | pub fn add_warning(&mut self, analysis_name: &str, subject: Option<&st...
method add_critical (line 33) | pub fn add_critical(&mut self, analysis_name: &str, subject: Option<&s...
method add_notice (line 37) | pub fn add_notice(&mut self, analysis_name: &str, subject: Option<&str...
method to_table_data (line 41) | pub fn to_table_data(&self) -> Vec<HashMap<String, String>> {
method add_result (line 55) | fn add_result(&mut self, analysis_name: &str, severity: &str, subject:...
type SeverityCounts (line 13) | struct SeverityCounts {
FILE: src/analysis/result/dns_analysis_result.rs
type DnsAnalysisResult (line 5) | pub struct DnsAnalysisResult {
method new (line 18) | pub fn new(
method get_txt_description (line 36) | pub fn get_txt_description(&self) -> String {
FILE: src/analysis/result/header_stats.rs
constant MAX_UNIQUE_VALUES (line 8) | const MAX_UNIQUE_VALUES: usize = 20;
type HeaderStats (line 11) | pub struct HeaderStats {
method new (line 23) | pub fn new(header: String) -> Self {
method add_value (line 36) | pub fn add_value(&mut self, value: &str) {
method get_sorted_unique_values (line 53) | pub fn get_sorted_unique_values(&self) -> Vec<(&String, &usize)> {
method get_formatted_header_name (line 59) | pub fn get_formatted_header_name(&self) -> String {
method is_value_for_min_max_int (line 74) | pub fn is_value_for_min_max_int(&self, header: &str) -> bool {
method is_value_for_min_max_date (line 78) | pub fn is_value_for_min_max_date(&self, header: &str) -> bool {
method ignore_header_values (line 82) | pub fn ignore_header_values(&self, header: &str) -> bool {
method get_min_value (line 86) | pub fn get_min_value(&self) -> Option<String> {
method get_max_value (line 92) | pub fn get_max_value(&self) -> Option<String> {
method get_values_preview (line 98) | pub fn get_values_preview(&self, max_length: usize) -> String {
method add_value_for_min_max_int (line 130) | fn add_value_for_min_max_int(&mut self, value: &str) {
method add_value_for_min_max_date (line 145) | fn add_value_for_min_max_date(&mut self, value: &str) {
FILE: src/analysis/result/heading_tree_item.rs
function html_escape (line 4) | fn html_escape(s: &str) -> String {
type HeadingTreeItem (line 13) | pub struct HeadingTreeItem {
method new (line 29) | pub fn new(level: i32, text: String, id: Option<String>) -> Self {
method has_error (line 40) | pub fn has_error(&self) -> bool {
method get_heading_tree_txt_list (line 45) | pub fn get_heading_tree_txt_list(items: &[HeadingTreeItem]) -> String {
method get_heading_tree_txt (line 55) | fn get_heading_tree_txt(item: &HeadingTreeItem, add_item: bool) -> Str...
method get_heading_tree_ul_li_list (line 77) | pub fn get_heading_tree_ul_li_list(items: &[HeadingTreeItem]) -> String {
method get_heading_tree_ul_li (line 88) | fn get_heading_tree_ul_li(item: &HeadingTreeItem, add_item: bool) -> S...
method get_headings_count (line 147) | pub fn get_headings_count(items: &[HeadingTreeItem]) -> usize {
method get_headings_with_error_count (line 157) | pub fn get_headings_with_error_count(items: &[HeadingTreeItem]) -> usi...
FILE: src/analysis/result/security_checked_header.rs
constant SEVERITY_OK (line 6) | pub const SEVERITY_OK: i32 = 1;
constant SEVERITY_NOTICE (line 7) | pub const SEVERITY_NOTICE: i32 = 2;
constant SEVERITY_WARNING (line 8) | pub const SEVERITY_WARNING: i32 = 3;
constant SEVERITY_CRITICAL (line 9) | pub const SEVERITY_CRITICAL: i32 = 4;
type SecurityCheckedHeader (line 12) | pub struct SecurityCheckedHeader {
method new (line 23) | pub fn new(header: String) -> Self {
method set_finding (line 33) | pub fn set_finding(&mut self, value: Option<&str>, severity: i32, reco...
method get_formatted_header (line 50) | pub fn get_formatted_header(&self) -> String {
method get_severity_name (line 65) | pub fn get_severity_name(&self) -> &'static str {
FILE: src/analysis/result/security_result.rs
type SecurityResult (line 9) | pub struct SecurityResult {
method new (line 14) | pub fn new() -> Self {
method get_checked_header (line 18) | pub fn get_checked_header(&mut self, header: &str) -> &mut SecurityChe...
method get_highest_severity (line 24) | pub fn get_highest_severity(&self) -> i32 {
FILE: src/analysis/result/seo_opengraph_result.rs
constant ROBOTS_INDEX (line 6) | pub const ROBOTS_INDEX: i32 = 1;
constant ROBOTS_NOINDEX (line 7) | pub const ROBOTS_NOINDEX: i32 = 0;
constant ROBOTS_FOLLOW (line 8) | pub const ROBOTS_FOLLOW: i32 = 1;
constant ROBOTS_NOFOLLOW (line 9) | pub const ROBOTS_NOFOLLOW: i32 = 2;
type SeoAndOpenGraphResult (line 12) | pub struct SeoAndOpenGraphResult {
method new (line 45) | pub fn new(url_uq_id: String, url_path_and_query: String) -> Self {
method is_denied_by_robots_txt (line 75) | pub fn is_denied_by_robots_txt(url_path_and_query: &str, robots_txt_co...
FILE: src/analysis/result/url_analysis_result.rs
type UrlAnalysisResult (line 9) | pub struct UrlAnalysisResult {
method new (line 25) | pub fn new() -> Self {
method add_ok (line 29) | pub fn add_ok(&mut self, message: String, analysis_name: &str, detail:...
method add_notice (line 42) | pub fn add_notice(&mut self, message: String, analysis_name: &str, det...
method add_warning (line 58) | pub fn add_warning(&mut self, message: String, analysis_name: &str, de...
method add_critical (line 74) | pub fn add_critical(&mut self, message: String, analysis_name: &str, d...
method get_stats_per_analysis (line 90) | pub fn get_stats_per_analysis(&self) -> &HashMap<String, HashMap<Strin...
method get_ok (line 94) | pub fn get_ok(&self) -> &[String] {
method get_notice (line 98) | pub fn get_notice(&self) -> &[String] {
method get_warning (line 102) | pub fn get_warning(&self) -> &[String] {
method get_critical (line 106) | pub fn get_critical(&self) -> &[String] {
method get_ok_details (line 110) | pub fn get_ok_details(&self) -> &HashMap<String, Vec<String>> {
method get_notice_details (line 114) | pub fn get_notice_details(&self) -> &HashMap<String, Vec<String>> {
method get_warning_details (line 118) | pub fn get_warning_details(&self) -> &HashMap<String, Vec<String>> {
method get_critical_details (line 122) | pub fn get_critical_details(&self) -> &HashMap<String, Vec<String>> {
method get_all_count (line 126) | pub fn get_all_count(&self) -> usize {
method get_details_of_severity_and_analysis_name (line 130) | pub fn get_details_of_severity_and_analysis_name(&self, severity: &str...
method to_icon_string (line 140) | pub fn to_icon_string(&self) -> String {
method to_colorized_string (line 164) | pub fn to_colorized_string(&self, strip_whitespaces: bool) -> String {
method to_not_colorized_string (line 197) | pub fn to_not_colorized_string(&self, strip_whitespaces: bool) -> Stri...
method get_all_details_for_analysis (line 226) | pub fn get_all_details_for_analysis(&self, analysis_name: &str) -> Has...
method fmt (line 249) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
FILE: src/analysis/security_analyzer.rs
constant SUPER_TABLE_SECURITY (line 24) | const SUPER_TABLE_SECURITY: &str = "security";
constant ANALYSIS_HEADERS (line 25) | const ANALYSIS_HEADERS: &str = "Security headers";
constant HEADER_ACCESS_CONTROL_ALLOW_ORIGIN (line 27) | const HEADER_ACCESS_CONTROL_ALLOW_ORIGIN: &str = "access-control-allow-o...
constant HEADER_STRICT_TRANSPORT_SECURITY (line 28) | const HEADER_STRICT_TRANSPORT_SECURITY: &str = "strict-transport-security";
constant HEADER_X_FRAME_OPTIONS (line 29) | const HEADER_X_FRAME_OPTIONS: &str = "x-frame-options";
constant HEADER_X_XSS_PROTECTION (line 30) | const HEADER_X_XSS_PROTECTION: &str = "x-xss-protection";
constant HEADER_X_CONTENT_TYPE_OPTIONS (line 31) | const HEADER_X_CONTENT_TYPE_OPTIONS: &str = "x-content-type-options";
constant HEADER_REFERRER_POLICY (line 32) | const HEADER_REFERRER_POLICY: &str = "referrer-policy";
constant HEADER_CONTENT_SECURITY_POLICY (line 33) | const HEADER_CONTENT_SECURITY_POLICY: &str = "content-security-policy";
constant HEADER_FEATURE_POLICY (line 34) | const HEADER_FEATURE_POLICY: &str = "feature-policy";
constant HEADER_PERMISSIONS_POLICY (line 35) | const HEADER_PERMISSIONS_POLICY: &str = "permissions-policy";
constant HEADER_SERVER (line 36) | const HEADER_SERVER: &str = "server";
constant HEADER_X_POWERED_BY (line 37) | const HEADER_X_POWERED_BY: &str = "x-powered-by";
constant HEADER_SET_COOKIE (line 38) | const HEADER_SET_COOKIE: &str = "set-cookie";
constant CHECKED_HEADERS (line 40) | const CHECKED_HEADERS: &[&str] = &[
type SecurityAnalyzer (line 55) | pub struct SecurityAnalyzer {
method new (line 70) | pub fn new() -> Self {
method check_headers (line 80) | fn check_headers(&mut self, headers: &HashMap<String, String>, is_http...
method check_html_security (line 126) | fn check_html_security(&mut self, html: &str, is_https: bool, url_resu...
method get_header_value (line 153) | fn get_header_value(headers: &HashMap<String, String>, header: &str) -...
method check_access_control_allow_origin (line 157) | fn check_access_control_allow_origin(
method check_strict_transport_security (line 192) | fn check_strict_transport_security(
method check_x_frame_options (line 240) | fn check_x_frame_options(&mut self, headers: &HashMap<String, String>,...
method check_x_xss_protection (line 290) | fn check_x_xss_protection(&mut self, headers: &HashMap<String, String>...
method check_x_content_type_options (line 328) | fn check_x_content_type_options(&mut self, headers: &HashMap<String, S...
method check_referrer_policy (line 358) | fn check_referrer_policy(&mut self, headers: &HashMap<String, String>,...
method check_content_security_policy (line 398) | fn check_content_security_policy(&mut self, headers: &HashMap<String, ...
method check_feature_policy (line 418) | fn check_feature_policy(&mut self, headers: &HashMap<String, String>, ...
method check_permissions_policy (line 447) | fn check_permissions_policy(&mut self, headers: &HashMap<String, Strin...
method check_server (line 480) | fn check_server(&mut self, headers: &HashMap<String, String>, url_resu...
method check_x_powered_by (line 536) | fn check_x_powered_by(&mut self, headers: &HashMap<String, String>, ur...
method check_set_cookie (line 569) | fn check_set_cookie(
method check_set_cookie_value (line 589) | fn check_set_cookie_value(&mut self, set_cookie: &str, is_https: bool,...
method set_findings_to_summary (line 625) | fn set_findings_to_summary(&mut self, status: &Status) {
method default (line 64) | fn default() -> Self {
method analyze (line 661) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method analyze_visited_url (line 837) | fn analyze_visited_url(
method should_be_activated (line 870) | fn should_be_activated(&self) -> bool {
method get_order (line 874) | fn get_order(&self) -> i32 {
method get_name (line 878) | fn get_name(&self) -> &str {
method get_exec_times (line 882) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 886) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/seo_opengraph_analyzer.rs
constant SUPER_TABLE_SEO (line 21) | const SUPER_TABLE_SEO: &str = "seo";
constant SUPER_TABLE_OPEN_GRAPH (line 22) | const SUPER_TABLE_OPEN_GRAPH: &str = "open-graph";
constant SUPER_TABLE_SEO_HEADINGS (line 23) | const SUPER_TABLE_SEO_HEADINGS: &str = "seo-headings";
type SeoAndOpenGraphAnalyzer (line 25) | pub struct SeoAndOpenGraphAnalyzer {
method new (line 39) | pub fn new() -> Self {
method set_config (line 49) | pub fn set_config(&mut self, max_heading_level: i32) {
method get_seo_and_opengraph_results (line 53) | fn get_seo_and_opengraph_results(&self, status: &Status) -> Vec<SeoAnd...
method analyze_seo (line 83) | fn analyze_seo(&self, url_results: &[SeoAndOpenGraphResult], status: &...
method analyze_open_graph (line 206) | fn analyze_open_graph(&self, url_results: &[SeoAndOpenGraphResult], st...
method analyze_headings (line 335) | fn analyze_headings(&self, url_results: &[SeoAndOpenGraphResult], stat...
method default (line 33) | fn default() -> Self {
method analyze (line 429) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 464) | fn should_be_activated(&self) -> bool {
method get_order (line 468) | fn get_order(&self) -> i32 {
method get_name (line 472) | fn get_name(&self) -> &str {
method get_exec_times (line 476) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 480) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
function get_url_path_and_query (line 485) | fn get_url_path_and_query(url: &str) -> String {
function extract_seo_metadata (line 498) | fn extract_seo_metadata(document: &Html, result: &mut SeoAndOpenGraphRes...
function extract_opengraph_metadata (line 550) | fn extract_opengraph_metadata(document: &Html, result: &mut SeoAndOpenGr...
function extract_twitter_metadata (line 584) | fn extract_twitter_metadata(document: &Html, result: &mut SeoAndOpenGrap...
function build_heading_tree (line 617) | fn build_heading_tree(document: &Html, result: &mut SeoAndOpenGraphResul...
function seo_results_to_table_data (line 727) | fn seo_results_to_table_data(results: &[SeoAndOpenGraphResult]) -> Vec<H...
function og_results_to_table_data (line 748) | fn og_results_to_table_data(results: &[SeoAndOpenGraphResult]) -> Vec<Ha...
function headings_to_table_data (line 771) | fn headings_to_table_data(results: &[SeoAndOpenGraphResult]) -> Vec<Hash...
FILE: src/analysis/skipped_urls_analyzer.rs
constant SUPER_TABLE_SKIPPED_SUMMARY (line 15) | const SUPER_TABLE_SKIPPED_SUMMARY: &str = "skipped-summary";
constant SUPER_TABLE_SKIPPED (line 16) | const SUPER_TABLE_SKIPPED: &str = "skipped";
type SkippedUrlsAnalyzer (line 18) | pub struct SkippedUrlsAnalyzer {
method new (line 29) | pub fn new() -> Self {
method get_reason_label (line 35) | fn get_reason_label(reason: &SkippedReason) -> &'static str {
method get_source_short_name (line 43) | fn get_source_short_name(source_attr: i32) -> &'static str {
method default (line 23) | fn default() -> Self {
method analyze (line 66) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 304) | fn should_be_activated(&self) -> bool {
method get_order (line 308) | fn get_order(&self) -> i32 {
method get_name (line 312) | fn get_name(&self) -> &str {
method get_exec_times (line 316) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 320) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/slowest_analyzer.rs
constant SUPER_TABLE_SLOWEST_URLS (line 15) | const SUPER_TABLE_SLOWEST_URLS: &str = "slowest-urls";
type SlowestAnalyzer (line 17) | pub struct SlowestAnalyzer {
method new (line 31) | pub fn new() -> Self {
method set_config (line 41) | pub fn set_config(&mut self, slowest_top_limit: usize, slowest_min_tim...
method default (line 25) | fn default() -> Self {
method analyze (line 49) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 183) | fn should_be_activated(&self) -> bool {
method get_order (line 187) | fn get_order(&self) -> i32 {
method get_name (line 191) | fn get_name(&self) -> &str {
method get_exec_times (line 195) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 199) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/analysis/source_domains_analyzer.rs
constant SUPER_TABLE_SOURCE_DOMAINS (line 15) | const SUPER_TABLE_SOURCE_DOMAINS: &str = "source-domains";
type SourceDomainsAnalyzer (line 17) | pub struct SourceDomainsAnalyzer {
method new (line 28) | pub fn new() -> Self {
method default (line 22) | fn default() -> Self {
method analyze (line 36) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 193) | fn should_be_activated(&self) -> bool {
method get_order (line 197) | fn get_order(&self) -> i32 {
method get_name (line 201) | fn get_name(&self) -> &str {
method get_exec_times (line 205) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 209) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
type DomainContentTypeStat (line 214) | struct DomainContentTypeStat {
function get_all_content_type_ids (line 220) | fn get_all_content_type_ids() -> Vec<ContentTypeId> {
FILE: src/analysis/ssl_tls_analyzer.rs
constant SUPER_TABLE_CERTIFICATE_INFO (line 21) | const SUPER_TABLE_CERTIFICATE_INFO: &str = "certificate-info";
type SslTlsAnalyzer (line 23) | pub struct SslTlsAnalyzer {
method new (line 34) | pub fn new() -> Self {
method get_tls_certificate_info (line 40) | fn get_tls_certificate_info(&self, hostname: &str, port: u16, status: ...
method default (line 28) | fn default() -> Self {
method analyze (line 324) | fn analyze(&mut self, status: &Status, output: &mut dyn Output) {
method should_be_activated (line 438) | fn should_be_activated(&self) -> bool {
method get_order (line 442) | fn get_order(&self) -> i32 {
method get_name (line 446) | fn get_name(&self) -> &str {
method get_exec_times (line 450) | fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 454) | fn get_exec_counts(&self) -> &HashMap<String, usize> {
function format_asn1_time (line 459) | fn format_asn1_time(time: &ASN1Time) -> String {
function add_spaces_around_equals (line 464) | fn add_spaces_around_equals(s: &str) -> String {
function asn1_time_to_datetime (line 470) | fn asn1_time_to_datetime(time: &ASN1Time) -> Option<chrono::DateTime<chr...
function is_hostname_shell_safe (line 478) | fn is_hostname_shell_safe(hostname: &str) -> bool {
FILE: src/components/summary/item.rs
type Item (line 11) | pub struct Item {
method new (line 18) | pub fn new(apl_code: String, text: String, status: ItemStatus) -> Self {
method get_as_html (line 22) | pub fn get_as_html(&self) -> String {
method get_as_console_text (line 37) | pub fn get_as_console_text(&self) -> String {
function html_escape (line 51) | fn html_escape(s: &str) -> String {
FILE: src/components/summary/item_status.rs
type ItemStatus (line 10) | pub enum ItemStatus {
method from_range_id (line 19) | pub fn from_range_id(range_id: i32) -> Result<Self, CrawlerError> {
method from_text (line 33) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method sort_order (line 47) | pub fn sort_order(&self) -> i32 {
FILE: src/components/summary/summary.rs
type Summary (line 11) | pub struct Summary {
method new (line 16) | pub fn new() -> Self {
method add_item (line 20) | pub fn add_item(&mut self, item: Item) {
method get_items (line 24) | pub fn get_items(&self) -> &[Item] {
method sort_items (line 28) | fn sort_items(&mut self) {
method get_as_html (line 32) | pub fn get_as_html(&mut self) -> String {
method get_as_console_text (line 42) | pub fn get_as_console_text(&mut self) -> String {
method get_count_by_item_status (line 55) | pub fn get_count_by_item_status(&self, status: ItemStatus) -> usize {
FILE: src/components/super_table.rs
constant POSITION_BEFORE_URL_TABLE (line 16) | pub const POSITION_BEFORE_URL_TABLE: &str = "before-url-table";
constant POSITION_AFTER_URL_TABLE (line 17) | pub const POSITION_AFTER_URL_TABLE: &str = "after-url-table";
constant RENDER_INTO_HTML (line 19) | pub const RENDER_INTO_HTML: &str = "html";
constant RENDER_INTO_CONSOLE (line 20) | pub const RENDER_INTO_CONSOLE: &str = "console";
type SuperTable (line 25) | pub struct SuperTable {
method new (line 75) | pub fn new(
method set_data (line 117) | pub fn set_data(&mut self, data: Vec<HashMap<String, String>>) {
method get_html_output (line 127) | pub fn get_html_output(&self) -> String {
method get_console_output (line 387) | pub fn get_console_output(&self) -> String {
method get_json_output (line 494) | pub fn get_json_output(&self) -> Option<serde_json::Value> {
method is_position_before_url_table (line 527) | pub fn is_position_before_url_table(&self) -> bool {
method get_data (line 531) | pub fn get_data(&self) -> &[HashMap<String, String>] {
method get_total_rows (line 535) | pub fn get_total_rows(&self) -> usize {
method set_host_to_strip_from_urls (line 539) | pub fn set_host_to_strip_from_urls(&mut self, host: Option<String>, sc...
method set_initial_url (line 544) | pub fn set_initial_url(&mut self, url: Option<String>) {
method set_visibility_in_html (line 548) | pub fn set_visibility_in_html(&mut self, visible: bool) {
method set_visibility_in_console (line 552) | pub fn set_visibility_in_console(&mut self, visible: bool, rows_limit:...
method set_visibility_in_json (line 557) | pub fn set_visibility_in_json(&mut self, visible: bool) {
method is_visible_in_html (line 561) | pub fn is_visible_in_html(&self) -> bool {
method is_visible_in_console (line 565) | pub fn is_visible_in_console(&self) -> bool {
method is_visible_in_json (line 569) | pub fn is_visible_in_json(&self) -> bool {
method disable_fulltext (line 573) | pub fn disable_fulltext(&mut self) {
method set_show_only_columns_with_values (line 577) | pub fn set_show_only_columns_with_values(&mut self, show_only: bool) {
method get_columns (line 581) | pub fn get_columns(&self) -> &[SuperTableColumn] {
method set_hard_rows_limit (line 585) | pub fn set_hard_rows_limit(limit: usize) {
method set_ignore_hard_rows_limit (line 591) | pub fn set_ignore_hard_rows_limit(&mut self, ignore: bool) {
method sort_data (line 595) | fn sort_data(&mut self, column_key: &str, direction: &str) {
method is_fulltext_enabled (line 612) | fn is_fulltext_enabled(&self) -> bool {
method remove_columns_with_empty_data (line 616) | fn remove_columns_with_empty_data(&mut self) {
method apply_hard_rows_limit (line 643) | fn apply_hard_rows_limit(&mut self) {
function html_escape (line 652) | fn html_escape(s: &str) -> String {
function generate_unique_id (line 660) | fn generate_unique_id() -> String {
FILE: src/components/super_table_column.rs
constant AUTO_WIDTH (line 7) | pub const AUTO_WIDTH: i32 = -1;
type FormatterFn (line 9) | pub type FormatterFn = Box<dyn Fn(&str, &str) -> String + Send + Sync>;
type RendererFn (line 10) | pub type RendererFn = Box<dyn Fn(&HashMap<String, String>, &str) -> Stri...
type DataValueCallbackFn (line 11) | pub type DataValueCallbackFn = Box<dyn Fn(&HashMap<String, String>) -> S...
type SuperTableColumn (line 14) | pub struct SuperTableColumn {
method fmt (line 32) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
method new (line 51) | pub fn new(
method get_width_px (line 78) | pub fn get_width_px(&self) -> i32 {
method get_auto_width_by_data (line 82) | pub fn get_auto_width_by_data(&self, data: &[HashMap<String, String>])...
method get_data_value (line 106) | pub fn get_data_value(&self, row: &HashMap<String, String>) -> String {
FILE: src/content_processor/astro_processor.rs
type AstroProcessor (line 32) | pub struct AstroProcessor {
method new (line 40) | pub fn new(config: ProcessorConfig) -> Self {
method detect_and_include_other_modules (line 52) | fn detect_and_include_other_modules(
method inline_module_script (line 96) | fn inline_module_script(
method find_urls (line 138) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 165) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 174) | fn apply_content_changes_for_offline_version(
method apply_content_changes_for_offline_version_with_loader (line 202) | fn apply_content_changes_for_offline_version_with_loader(
method is_content_type_relevant (line 233) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 237) | fn get_name(&self) -> &str {
method set_debug_mode (line 241) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 250) | fn make_config() -> ProcessorConfig {
function test_find_astro_urls (line 255) | fn test_find_astro_urls() {
function test_no_astro_content (line 265) | fn test_no_astro_content() {
function test_module_inlining_with_loader (line 274) | fn test_module_inlining_with_loader() {
function test_module_inlining_without_loader_falls_back (line 304) | fn test_module_inlining_without_loader_falls_back() {
FILE: src/content_processor/base_processor.rs
type ProcessorConfig (line 13) | pub struct ProcessorConfig {
method new (line 33) | pub fn new(initial_url: ParsedUrl) -> Self {
method compile_ignore_regex (line 55) | pub fn compile_ignore_regex(&mut self) {
function is_relevant (line 65) | pub fn is_relevant(content_type: ContentTypeId, relevant_types: &[Conten...
function normalize_path (line 70) | fn normalize_path(path: &str) -> String {
function convert_url_to_relative (line 91) | pub fn convert_url_to_relative(
function initial_url (line 152) | fn initial_url() -> ParsedUrl {
function decode_amp_entity_before_offline_conversion (line 157) | fn decode_amp_entity_before_offline_conversion() {
function decode_numeric_entity_before_offline_conversion (line 168) | fn decode_numeric_entity_before_offline_conversion() {
function preserve_trailing_ampersand (line 178) | fn preserve_trailing_ampersand() {
function skip_data_uri (line 188) | fn skip_data_uri() {
function skip_javascript_uri (line 195) | fn skip_javascript_uri() {
function preserve_urls_same_domain_absolute (line 204) | fn preserve_urls_same_domain_absolute() {
function preserve_urls_same_domain_root_relative (line 217) | fn preserve_urls_same_domain_root_relative() {
function preserve_urls_same_domain_relative (line 224) | fn preserve_urls_same_domain_relative() {
function preserve_urls_cross_domain (line 231) | fn preserve_urls_cross_domain() {
function preserve_urls_with_query_and_fragment (line 244) | fn preserve_urls_with_query_and_fragment() {
function preserve_urls_data_uri_unchanged (line 251) | fn preserve_urls_data_uri_unchanged() {
function preserve_urls_mailto_unchanged (line 258) | fn preserve_urls_mailto_unchanged() {
FILE: src/content_processor/content_processor.rs
type ContentProcessor (line 10) | pub trait ContentProcessor: Send + Sync {
method find_urls (line 12) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<F...
method apply_content_changes_before_url_parsing (line 17) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 27) | fn apply_content_changes_for_offline_version(
method apply_content_changes_for_offline_version_with_loader (line 39) | fn apply_content_changes_for_offline_version_with_loader(
method is_content_type_relevant (line 51) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool;
method get_name (line 54) | fn get_name(&self) -> &str;
method set_debug_mode (line 57) | fn set_debug_mode(&mut self, debug_mode: bool);
FILE: src/content_processor/css_processor.rs
type CssProcessor (line 29) | pub struct CssProcessor {
method new (line 36) | pub fn new(config: ProcessorConfig) -> Self {
method remove_unwanted_code_from_css (line 45) | fn remove_unwanted_code_from_css(&self, css: &str) -> String {
method find_urls (line 60) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 91) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 100) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 133) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 137) | fn get_name(&self) -> &str {
method set_debug_mode (line 141) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 150) | fn make_config() -> ProcessorConfig {
function test_find_css_urls (line 155) | fn test_find_css_urls() {
function test_find_css_urls_disabled_images (line 168) | fn test_find_css_urls_disabled_images() {
FILE: src/content_processor/html_processor.rs
constant JS_VARIABLE_NAME_URL_DEPTH (line 17) | pub const JS_VARIABLE_NAME_URL_DEPTH: &str = "_SiteOneUrlDepth";
constant HTML_PAGES_EXTENSIONS (line 19) | pub const HTML_PAGES_EXTENSIONS: &[&str] = &[
type HtmlProcessor (line 140) | pub struct HtmlProcessor {
method new (line 147) | pub fn new(config: ProcessorConfig) -> Self {
method find_href_urls (line 156) | fn find_href_urls(&self, html: &str, source_url: &ParsedUrl, found_url...
method find_fonts (line 197) | fn find_fonts(&self, html: &str, source_url: &ParsedUrl, found_urls: &...
method find_images (line 216) | fn find_images(&self, html: &str, source_url: &ParsedUrl, found_urls: ...
method find_audio (line 302) | fn find_audio(&self, html: &str, source_url: &ParsedUrl, found_urls: &...
method find_video (line 312) | fn find_video(&self, html: &str, source_url: &ParsedUrl, found_urls: &...
method find_scripts (line 322) | fn find_scripts(&self, html: &str, source_url: &ParsedUrl, found_urls:...
method find_stylesheets (line 372) | fn find_stylesheets(&self, html: &str, source_url: &ParsedUrl, found_u...
method remove_unwanted_code_from_html (line 392) | fn remove_unwanted_code_from_html(&self, html: &str) -> String {
method set_custom_css_for_tile_images (line 414) | fn set_custom_css_for_tile_images(&self, html: &str) -> String {
method set_js_variable_with_url_depth (line 435) | fn set_js_variable_with_url_depth(&self, html: &str, base_url: &str) -...
method set_js_function_to_remove_all_anchor_listeners (line 462) | fn set_js_function_to_remove_all_anchor_listeners(&self, html: &str) -...
method remove_schema_and_host_from_full_origin_urls (line 486) | fn remove_schema_and_host_from_full_origin_urls(&self, url: &ParsedUrl...
method update_html_paths_to_relative (line 535) | fn update_html_paths_to_relative(&self, html: &str, parsed_base_url: &...
method apply_specific_html_changes (line 636) | fn apply_specific_html_changes(
method is_forced_to_remove_anchor_listeners (line 747) | fn is_forced_to_remove_anchor_listeners(&self, html: &str) -> bool {
method find_urls (line 753) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 788) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 797) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 854) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 858) | fn get_name(&self) -> &str {
method set_debug_mode (line 862) | fn set_debug_mode(&mut self, debug_mode: bool) {
function html_entity_decode (line 869) | fn html_entity_decode(input: &str) -> String {
function try_decode_entity (line 892) | fn try_decode_entity(s: &str) -> Option<(&'static str, usize)> {
function make_config (line 934) | fn make_config() -> ProcessorConfig {
function test_find_href_urls (line 939) | fn test_find_href_urls() {
function test_find_images (line 950) | fn test_find_images() {
function test_find_scripts (line 959) | fn test_find_scripts() {
function test_single_page_no_hrefs (line 968) | fn test_single_page_no_hrefs() {
function test_find_srcset (line 984) | fn test_find_srcset() {
function test_spaces_in_quoted_img_src (line 993) | fn test_spaces_in_quoted_img_src() {
function test_spaces_in_quoted_a_href (line 1008) | fn test_spaces_in_quoted_a_href() {
function test_spaces_in_quoted_script_src (line 1022) | fn test_spaces_in_quoted_script_src() {
function test_unquoted_src_still_works (line 1036) | fn test_unquoted_src_still_works() {
function test_single_quoted_src_with_spaces (line 1050) | fn test_single_quoted_src_with_spaces() {
function test_unquoted_href_no_spaces (line 1064) | fn test_unquoted_href_no_spaces() {
function test_unquoted_script_src (line 1078) | fn test_unquoted_script_src() {
function test_spaces_in_audio_video_src (line 1092) | fn test_spaces_in_audio_video_src() {
function test_mixed_quoted_and_unquoted (line 1114) | fn test_mixed_quoted_and_unquoted() {
function test_fragment_links_still_skipped (line 1142) | fn test_fragment_links_still_skipped() {
FILE: src/content_processor/javascript_processor.rs
type JavaScriptProcessor (line 42) | pub struct JavaScriptProcessor {
method new (line 50) | pub fn new(config: ProcessorConfig) -> Self {
method find_urls_import_from (line 59) | fn find_urls_import_from(&self, content: &str, source_url: &ParsedUrl)...
method find_urls (line 141) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 145) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 154) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 201) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 205) | fn get_name(&self) -> &str {
method set_debug_mode (line 209) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 218) | fn make_config() -> ProcessorConfig {
function test_find_import_from (line 223) | fn test_find_import_from() {
function test_skip_html_content (line 233) | fn test_skip_html_content() {
function test_find_quoted_js_paths (line 242) | fn test_find_quoted_js_paths() {
FILE: src/content_processor/manager.rs
constant SUPER_TABLE_CONTENT_PROCESSORS_STATS (line 15) | pub const SUPER_TABLE_CONTENT_PROCESSORS_STATS: &str = "content-processo...
type ContentProcessorManager (line 17) | pub struct ContentProcessorManager {
method new (line 23) | pub fn new() -> Self {
method register_processor (line 32) | pub fn register_processor(&mut self, processor: Box<dyn ContentProcess...
method get_processors (line 42) | pub fn get_processors(&self) -> &[Box<dyn ContentProcessor>] {
method find_urls (line 48) | pub fn find_urls(&mut self, content: &str, content_type: ContentTypeId...
method apply_content_changes_for_offline_version (line 69) | pub fn apply_content_changes_for_offline_version(
method apply_content_changes_for_offline_version_with_loader (line 88) | pub fn apply_content_changes_for_offline_version_with_loader(
method apply_content_changes_before_url_parsing (line 113) | pub fn apply_content_changes_before_url_parsing(
method get_stats (line 130) | pub fn get_stats(&self) -> &ManagerStats {
method default (line 136) | fn default() -> Self {
FILE: src/content_processor/nextjs_processor.rs
type NextJsProcessor (line 45) | pub struct NextJsProcessor {
method new (line 53) | pub fn new(config: ProcessorConfig) -> Self {
method find_urls (line 63) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 99) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 115) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 220) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 224) | fn get_name(&self) -> &str {
method set_debug_mode (line 228) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 237) | fn make_config() -> ProcessorConfig {
function test_non_manifest_returns_none (line 242) | fn test_non_manifest_returns_none() {
function test_before_url_parsing_removes_dpl (line 251) | fn test_before_url_parsing_removes_dpl() {
FILE: src/content_processor/svelte_processor.rs
type SvelteProcessor (line 17) | pub struct SvelteProcessor {
method new (line 24) | pub fn new(config: ProcessorConfig) -> Self {
method find_urls (line 33) | fn find_urls(&self, _content: &str, _source_url: &ParsedUrl) -> Option<F...
method apply_content_changes_before_url_parsing (line 38) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 47) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 60) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 65) | fn get_name(&self) -> &str {
method set_debug_mode (line 69) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 78) | fn make_config() -> ProcessorConfig {
function test_remove_svelte_tags (line 83) | fn test_remove_svelte_tags() {
function test_is_relevant_only_for_html (line 92) | fn test_is_relevant_only_for_html() {
FILE: src/content_processor/xml_processor.rs
type XmlProcessor (line 16) | pub struct XmlProcessor {
method new (line 24) | pub fn new(config: ProcessorConfig) -> Self {
method is_sitemap_xml_index (line 32) | fn is_sitemap_xml_index(content: &str) -> bool {
method is_sitemap_xml (line 36) | fn is_sitemap_xml(content: &str) -> bool {
method get_urls_from_sitemap_xml (line 41) | fn get_urls_from_sitemap_xml(content: &str) -> Vec<String> {
method get_urls_from_sitemap_xml_index (line 82) | fn get_urls_from_sitemap_xml_index(content: &str) -> Vec<String> {
method find_urls (line 131) | fn find_urls(&self, content: &str, source_url: &ParsedUrl) -> Option<Fou...
method apply_content_changes_before_url_parsing (line 163) | fn apply_content_changes_before_url_parsing(
method apply_content_changes_for_offline_version (line 172) | fn apply_content_changes_for_offline_version(
method is_content_type_relevant (line 182) | fn is_content_type_relevant(&self, content_type: ContentTypeId) -> bool {
method get_name (line 186) | fn get_name(&self) -> &str {
method set_debug_mode (line 190) | fn set_debug_mode(&mut self, debug_mode: bool) {
function make_config (line 199) | fn make_config() -> ProcessorConfig {
function test_sitemap_xml (line 204) | fn test_sitemap_xml() {
function test_sitemap_index (line 218) | fn test_sitemap_index() {
function test_non_sitemap_xml (line 235) | fn test_non_sitemap_xml() {
function test_gzip_compressed_sitemap (line 247) | fn test_gzip_compressed_sitemap() {
function test_gzip_compressed_sitemap_index (line 284) | fn test_gzip_compressed_sitemap_index() {
FILE: src/debugger.rs
constant DEBUG (line 10) | pub const DEBUG: &str = "debug";
constant INFO (line 11) | pub const INFO: &str = "info";
constant NOTICE (line 12) | pub const NOTICE: &str = "notice";
constant WARNING (line 13) | pub const WARNING: &str = "warning";
constant CRITICAL (line 14) | pub const CRITICAL: &str = "critical";
function debug (line 20) | pub fn debug(category: &str, message: &str, severity: &str, time: Option...
function console_array_debug (line 42) | pub fn console_array_debug(row_data: &[String], col_widths: &[usize]) {
function force_enabled_debug (line 72) | pub fn force_enabled_debug(log_file: Option<&str>) {
function set_config (line 86) | pub fn set_config(debug_enabled: bool, debug_log_file: Option<&str>) {
function print_debug (line 115) | fn print_debug(message: &str) {
function log_debug (line 122) | fn log_debug(message: &str) {
FILE: src/engine/crawler.rs
type QueueEntry (line 50) | pub struct QueueEntry {
type VisitedEntry (line 60) | pub struct VisitedEntry {
type SkippedEntry (line 70) | pub struct SkippedEntry {
constant ACCEPT_HEADER (line 78) | const ACCEPT_HEADER: &str = "text/html,application/xhtml+xml,application...
type Crawler (line 81) | pub struct Crawler {
method new (line 135) | pub fn new(
method run (line 215) | pub async fn run(&mut self) -> CrawlerResult<()> {
method take_next_from_queue (line 359) | fn take_next_from_queue(&self) -> Option<QueueEntry> {
method process_url (line 382) | async fn process_url(
method parse_html_body_and_fill_queue (line 833) | fn parse_html_body_and_fill_queue(
method parse_content_and_fill_url_queue (line 925) | fn parse_content_and_fill_url_queue(
method add_suitable_urls_to_queue (line 1004) | fn add_suitable_urls_to_queue(
method add_url_to_queue (line 1174) | fn add_url_to_queue(&self, url: &ParsedUrl, source_uq_id: Option<&str>...
method add_url_to_queue_static (line 1189) | fn add_url_to_queue_static(
method normalize_url_to_initial (line 1231) | fn normalize_url_to_initial(url: &mut ParsedUrl, initial_url: &ParsedU...
method is_url_suitable_for_queue_static (line 1252) | fn is_url_suitable_for_queue_static(
method is_url_allowed_by_regexes (line 1293) | fn is_url_allowed_by_regexes(
method is_domain_allowed_for_static_files (line 1325) | fn is_domain_allowed_for_static_files(domain: &str, allowed_domains: &...
method hosts_are_www_equivalent (line 1333) | fn hosts_are_www_equivalent(host_a: &str, host_b: &str) -> bool {
method is_external_domain_allowed_for_crawling (line 1343) | fn is_external_domain_allowed_for_crawling(
method add_redirect_location_to_queue_if_suitable (line 1367) | fn add_redirect_location_to_queue_if_suitable(
method process_non200_url (line 1424) | fn process_non200_url(url: &ParsedUrl, non200_basenames: &DashMap<Stri...
method is_url_allowed_by_robots_txt_cached (line 1438) | fn is_url_allowed_by_robots_txt_cached(
method fetch_robots_txt (line 1458) | pub async fn fetch_robots_txt(&self, domain: &str, port: u16, scheme: ...
method get_content_type_id_by_header (line 1542) | fn get_content_type_id_by_header(content_type_header: &str) -> Content...
method build_final_user_agent (line 1583) | fn build_final_user_agent(options: &CoreOptions) -> String {
method get_crawler_user_agent_signature (line 1606) | pub fn get_crawler_user_agent_signature() -> String {
method compute_url_key (line 1611) | fn compute_url_key(url: &ParsedUrl) -> String {
method is_sitemap_url (line 1619) | fn is_sitemap_url(url: &ParsedUrl) -> bool {
method compute_url_uq_id (line 1625) | fn compute_url_uq_id(url: &ParsedUrl) -> String {
method decode_html_entities (line 1634) | fn decode_html_entities(text: &str) -> String {
method current_timestamp (line 1645) | fn current_timestamp() -> f64 {
method add_random_query_params (line 1653) | fn add_random_query_params(path: &str) -> String {
method apply_http_request_transformations (line 1663) | fn apply_http_request_transformations(host: &str, path: &str, transfor...
method remove_avif_and_webp_support_from_accept_header (line 1709) | pub fn remove_avif_and_webp_support_from_accept_header(&mut self) {
method terminate (line 1714) | pub fn terminate(&self) {
method get_forced_ip_for_domain_and_port (line 1719) | pub fn get_forced_ip_for_domain_and_port(&self, domain: &str, port: u1...
method get_cache_type_flags (line 1726) | fn get_cache_type_flags(headers: &HashMap<String, String>) -> u32 {
method get_cache_lifetime (line 1787) | fn get_cache_lifetime(headers: &HashMap<String, String>) -> Option<i32> {
method get_content_processor_manager (line 1804) | pub fn get_content_processor_manager(&self) -> &Arc<Mutex<ContentProce...
method get_initial_parsed_url (line 1808) | pub fn get_initial_parsed_url(&self) -> &ParsedUrl {
method get_options (line 1812) | pub fn get_options(&self) -> &Arc<CoreOptions> {
method get_output (line 1816) | pub fn get_output(&self) -> &Arc<Mutex<Box<dyn Output>>> {
method get_status (line 1820) | pub fn get_status(&self) -> &Arc<Mutex<Status>> {
method get_visited (line 1824) | pub fn get_visited(&self) -> &Arc<DashMap<String, VisitedEntry>> {
method get_queue (line 1828) | pub fn get_queue(&self) -> &Arc<DashMap<String, QueueEntry>> {
method get_skipped (line 1832) | pub fn get_skipped(&self) -> &Arc<DashMap<String, SkippedEntry>> {
method get_analysis_manager (line 1836) | pub fn get_analysis_manager(&self) -> &Arc<Mutex<AnalysisManager>> {
method get_done_urls_count (line 1840) | pub fn get_done_urls_count(&self) -> usize {
function rand_simple (line 1846) | fn rand_simple() -> u64 {
function compile_domain_patterns (line 1854) | fn compile_domain_patterns(domains: &[String]) -> Vec<Regex> {
function filter_query_params (line 1865) | fn filter_query_params(url: &str, keep_params: &[String]) -> String {
function base_href_double_quotes (line 1895) | fn base_href_double_quotes() {
function base_href_single_quotes (line 1902) | fn base_href_single_quotes() {
function base_href_no_quotes (line 1909) | fn base_href_no_quotes() {
function base_href_relative_path (line 1916) | fn base_href_relative_path() {
function base_href_case_insensitive (line 1923) | fn base_href_case_insensitive() {
function base_href_absent (line 1930) | fn base_href_absent() {
function base_href_with_other_attrs (line 1936) | fn base_href_with_other_attrs() {
function sitemap_url_standard (line 1947) | fn sitemap_url_standard() {
function sitemap_url_with_index (line 1953) | fn sitemap_url_with_index() {
function sitemap_url_nested (line 1959) | fn sitemap_url_nested() {
function sitemap_url_case_insensitive (line 1965) | fn sitemap_url_case_insensitive() {
function not_sitemap_regular_page (line 1971) | fn not_sitemap_regular_page() {
function not_sitemap_xml_without_sitemap (line 1977) | fn not_sitemap_xml_without_sitemap() {
function not_sitemap_html_page (line 1983) | fn not_sitemap_html_page() {
function sitemap_url_gzip (line 1989) | fn sitemap_url_gzip() {
function sitemap_url_gzip_nested (line 1995) | fn sitemap_url_gzip_nested() {
function not_sitemap_tar_gz (line 2001) | fn not_sitemap_tar_gz() {
function normalize_www_to_no_www (line 2011) | fn normalize_www_to_no_www() {
function normalize_no_www_to_www (line 2020) | fn normalize_no_www_to_www() {
function normalize_http_to_https (line 2028) | fn normalize_http_to_https() {
function normalize_both_www_and_scheme (line 2036) | fn normalize_both_www_and_scheme() {
function normalize_leaves_different_domain_unchanged (line 2045) | fn normalize_leaves_different_domain_unchanged() {
function normalize_same_url_no_change (line 2053) | fn normalize_same_url_no_change() {
function normalize_preserves_path (line 2062) | fn normalize_preserves_path() {
function filter_query_params_keeps_specified (line 2071) | fn filter_query_params_keeps_specified() {
function filter_query_params_removes_all_when_none_match (line 2078) | fn filter_query_params_removes_all_when_none_match() {
function filter_query_params_no_query_string (line 2085) | fn filter_query_params_no_query_string() {
function filter_query_params_keeps_param_without_value (line 2092) | fn filter_query_params_keeps_param_without_value() {
function filter_query_params_preserves_order (line 2099) | fn filter_query_params_preserves_order() {
function filter_query_params_single_kept_param (line 2106) | fn filter_query_params_single_kept_param() {
FILE: src/engine/found_url.rs
type UrlSource (line 12) | pub enum UrlSource {
method short_name (line 32) | pub fn short_name(&self) -> &'static str {
method from_code (line 53) | pub fn from_code(code: u8) -> Option<Self> {
method fmt (line 76) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
type FoundUrl (line 86) | pub struct FoundUrl {
method new (line 96) | pub fn new(url: &str, source_url: &str, source: UrlSource) -> Self {
method is_included_asset (line 106) | pub fn is_included_asset(&self) -> bool {
method fmt (line 112) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
function normalize_url (line 119) | fn normalize_url(url: &str, source_url: &str) -> String {
function test_normalize_url_entities (line 167) | fn test_normalize_url_entities() {
function test_normalize_url_spaces (line 173) | fn test_normalize_url_spaces() {
function test_is_included_asset (line 179) | fn test_is_included_asset() {
function test_source_short_name (line 188) | fn test_source_short_name() {
FILE: src/engine/found_urls.rs
type FoundUrls (line 17) | pub struct FoundUrls {
method new (line 22) | pub fn new() -> Self {
method add_url (line 29) | pub fn add_url(&mut self, found_url: FoundUrl) {
method add_urls_from_text_array (line 35) | pub fn add_urls_from_text_array(&mut self, urls: &[&str], source_url: ...
method get_urls (line 44) | pub fn get_urls(&self) -> &HashMap<String, FoundUrl> {
method get_count (line 49) | pub fn get_count(&self) -> usize {
method default (line 55) | fn default() -> Self {
function md5_hex (line 61) | fn md5_hex(input: &str) -> String {
function is_url_valid_for_crawling (line 71) | fn is_url_valid_for_crawling(url: &str) -> bool {
function test_dedup_by_md5 (line 90) | fn test_dedup_by_md5() {
function test_add_urls_from_text_array (line 98) | fn test_add_urls_from_text_array() {
function test_is_url_valid_for_crawling (line 109) | fn test_is_url_valid_for_crawling() {
FILE: src/engine/http_client.rs
type HttpClient (line 21) | pub struct HttpClient {
method new (line 35) | pub fn new(
method build_shared_client (line 55) | fn build_shared_client(proxy: &Option<String>, accept_invalid_certs: b...
method request (line 75) | pub async fn request(
method get_from_cache (line 238) | fn get_from_cache(&self, cache_key: &str) -> Option<HttpResponse> {
method save_to_cache (line 296) | fn save_to_cache(&self, cache_key: &str, result: &HttpResponse) -> Cra...
method is_url_cached (line 351) | pub fn is_url_cached(
method get_cache_file_path (line 392) | fn get_cache_file_path(&self, cache_key: &str) -> Option<String> {
method get_cache_key (line 399) | fn get_cache_key(&self, host: &str, port: u16, args: &[String], extens...
type CachedResponse (line 412) | struct CachedResponse {
function convert_response_headers (line 422) | fn convert_response_headers(headers: &reqwest::header::HeaderMap) -> Has...
function test_cache_key_generation (line 437) | fn test_cache_key_generation() {
function test_cache_file_path (line 451) | fn test_cache_file_path() {
function test_no_cache_when_disabled (line 465) | fn test_no_cache_when_disabled() {
FILE: src/engine/http_response.rs
type HttpResponse (line 12) | pub struct HttpResponse {
method new (line 23) | pub fn new(
method body_text (line 46) | pub fn body_text(&self) -> Option<String> {
method get_formatted_exec_time (line 50) | pub fn get_formatted_exec_time(&self) -> String {
method get_formatted_body_length (line 54) | pub fn get_formatted_body_length(&self) -> String {
method detect_redirect_and_set_meta_redirect (line 60) | fn detect_redirect_and_set_meta_redirect(
method set_loaded_from_cache (line 81) | pub fn set_loaded_from_cache(&mut self, loaded: bool) {
method is_loaded_from_cache (line 85) | pub fn is_loaded_from_cache(&self) -> bool {
method is_skipped (line 89) | pub fn is_skipped(&self) -> bool {
method create_skipped (line 94) | pub fn create_skipped(url: String, reason: String) -> Self {
method get_header (line 109) | pub fn get_header(&self, name: &str) -> Option<&String> {
method get_content_type (line 115) | pub fn get_content_type(&self) -> Option<&String> {
function test_redirect_meta (line 125) | fn test_redirect_meta() {
function test_skipped_response (line 137) | fn test_skipped_response() {
function test_no_redirect_for_200 (line 144) | fn test_no_redirect_for_200() {
FILE: src/engine/initiator.rs
type Initiator (line 33) | pub struct Initiator {
method new (line 40) | pub fn new(argv: &[String]) -> CrawlerResult<Self> {
method create_manager (line 88) | pub fn create_manager(self) -> CrawlerResult<Manager> {
method get_options (line 93) | pub fn get_options(&self) -> &core_options::CoreOptions {
method register_analyzers (line 98) | fn register_analyzers(analysis_manager: &mut AnalysisManager, options:...
method print_help (line 138) | pub fn print_help() {
FILE: src/engine/manager.rs
type Manager (line 48) | pub struct Manager {
method new (line 55) | pub fn new(options: CoreOptions, analysis_manager: AnalysisManager) ->...
method run (line 84) | pub async fn run(&mut self) -> CrawlerResult<i32> {
method run_post_crawl (line 219) | fn run_post_crawl(&mut self, crawler: &Crawler) -> i32 {
method run_exporters (line 384) | fn run_exporters(&self, crawler: &Crawler) {
method create_output (line 560) | fn create_output(&self, options: &CoreOptions, crawler_info: &Info) ->...
method create_content_processor_manager (line 635) | fn create_content_processor_manager(options: &CoreOptions) -> ContentP...
FILE: src/engine/parsed_url.rs
type ParsedUrl (line 41) | pub struct ParsedUrl {
method hash (line 89) | fn hash<H: std::hash::Hasher>(&self, state: &mut H) {
method new (line 102) | pub fn new(
method get_full_url (line 133) | pub fn get_full_url(&self, include_scheme_and_host: bool, include_frag...
method is_static_file (line 190) | pub fn is_static_file(&self) -> bool {
method is_image (line 212) | pub fn is_image(&self) -> bool {
method is_font (line 219) | pub fn is_font(&self) -> bool {
method is_css (line 224) | pub fn is_css(&self) -> bool {
method is_origin_required (line 229) | pub fn is_origin_required(&self) -> bool {
method estimate_extension (line 234) | pub fn estimate_extension(&self) -> Option<String> {
method set_attributes (line 254) | pub fn set_attributes(&mut self, url: &ParsedUrl, scheme: bool, host: ...
method set_path (line 267) | pub fn set_path(&mut self, path: String) {
method change_depth (line 274) | pub fn change_depth(&mut self, change: i32) {
method set_query (line 296) | pub fn set_query(&mut self, query: Option<String>) {
method set_fragment (line 301) | pub fn set_fragment(&mut self, fragment: Option<String>) {
method set_extension (line 306) | pub fn set_extension(&mut self, extension: Option<String>) {
method set_debug (line 311) | pub fn set_debug(&mut self, debug: bool) {
method is_only_fragment (line 316) | pub fn is_only_fragment(&self) -> bool {
method get_full_homepage_url (line 321) | pub fn get_full_homepage_url(&self) -> String {
method parse (line 336) | pub fn parse(url: &str, base_url: Option<&ParsedUrl>) -> Self {
method is_https (line 447) | pub fn is_https(&self) -> bool {
method extract_2nd_level_domain (line 452) | pub fn extract_2nd_level_domain(host: &str) -> Option<String> {
method get_base_name (line 460) | pub fn get_base_name(&self) -> Option<String> {
method get_depth (line 481) | pub fn get_depth(&self) -> usize {
method clear_cache (line 488) | fn clear_cache(&self) {
method fmt (line 496) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
method clone (line 57) | fn clone(&self) -> Self {
method eq (line 75) | fn eq(&self, other: &Self) -> bool {
function extract_extension (line 502) | fn extract_extension(path: &str) -> Option<String> {
function parent_path (line 511) | fn parent_path(path: &str) -> String {
function parse_url_manually (line 521) | fn parse_url_manually(
function test_parse_full_url (line 561) | fn test_parse_full_url() {
function test_depth (line 571) | fn test_depth() {
function test_is_static_file (line 580) | fn test_is_static_file() {
function test_relative_url_resolution (line 592) | fn test_relative_url_resolution() {
function test_get_full_url (line 599) | fn test_get_full_url() {
function test_get_base_name (line 607) | fn test_get_base_name() {
function test_domain_2nd_level (line 616) | fn test_domain_2nd_level() {
FILE: src/engine/robots_txt.rs
type RobotsTxt (line 29) | pub struct RobotsTxt {
method parse (line 42) | pub fn parse(content: &str) -> Self {
method is_allowed (line 105) | pub fn is_allowed(&self, url: &str) -> bool {
method get_sitemaps (line 150) | pub fn get_sitemaps(&self) -> &[String] {
method get_disallowed_paths (line 155) | pub fn get_disallowed_paths(&self) -> &[String] {
method get_allowed_paths (line 160) | pub fn get_allowed_paths(&self) -> &[String] {
method get_raw_content (line 165) | pub fn get_raw_content(&self) -> &str {
function path_matches (line 175) | fn path_matches(url_path: &str, pattern: &str) -> bool {
function wildcard_match (line 194) | fn wildcard_match(url_path: &str, pattern: &str, exact_end: bool) -> bool {
function test_parse_basic (line 230) | fn test_parse_basic() {
function test_is_allowed (line 247) | fn test_is_allowed() {
function test_assets_always_allowed (line 261) | fn test_assets_always_allowed() {
function test_wildcard_matching (line 274) | fn test_wildcard_matching() {
function test_wildcard_star (line 281) | fn test_wildcard_star() {
function test_anchor_matching (line 287) | fn test_anchor_matching() {
function test_siteone_crawler_user_agent (line 293) | fn test_siteone_crawler_user_agent() {
function test_comments_stripped (line 308) | fn test_comments_stripped() {
function test_empty_disallow (line 320) | fn test_empty_disallow() {
function test_multiple_sitemaps (line 331) | fn test_multiple_sitemaps() {
FILE: src/error.rs
type CrawlerError (line 7) | pub enum CrawlerError {
type CrawlerResult (line 51) | pub type CrawlerResult<T> = std::result::Result<T, CrawlerError>;
FILE: src/export/base_exporter.rs
function get_export_file_path (line 18) | pub fn get_export_file_path(
FILE: src/export/exporter.rs
type Exporter (line 11) | pub trait Exporter: Send + Sync {
method get_name (line 13) | fn get_name(&self) -> &str;
method should_be_activated (line 16) | fn should_be_activated(&self) -> bool;
method export (line 20) | fn export(&mut self, status: &Status, output: &dyn Output) -> CrawlerR...
FILE: src/export/file_exporter.rs
type FileExporter (line 16) | pub struct FileExporter {
method new (line 40) | pub fn new(
method set_text_output_content (line 64) | pub fn set_text_output_content(&mut self, content: String) {
method set_json_output_content (line 69) | pub fn set_json_output_content(&mut self, content: String) {
method set_html_report_content (line 74) | pub fn set_html_report_content(&mut self, content: String) {
method get_export_file_path (line 79) | fn get_export_file_path(&self, file: &str, extension: &str) -> Crawler...
method get_name (line 91) | fn get_name(&self) -> &str {
method should_be_activated (line 95) | fn should_be_activated(&self) -> bool {
method export (line 99) | fn export(&mut self, status: &Status, _output: &dyn Output) -> CrawlerRe...
FILE: src/export/html_report/badge.rs
type BadgeColor (line 6) | pub enum BadgeColor {
method as_css_class (line 15) | pub fn as_css_class(&self) -> &'static str {
type Badge (line 28) | pub struct Badge {
method new (line 35) | pub fn new(value: String, color: BadgeColor) -> Self {
method with_title (line 43) | pub fn with_title(value: String, color: BadgeColor, title: &str) -> Se...
FILE: src/export/html_report/report.rs
constant SUPER_TABLE_VISITED_URLS (line 22) | const SUPER_TABLE_VISITED_URLS: &str = "visited-urls";
constant ST_ANALYSIS_STATS (line 25) | const ST_ANALYSIS_STATS: &str = "analysis-stats";
constant ST_CONTENT_PROCESSORS_STATS (line 28) | const ST_CONTENT_PROCESSORS_STATS: &str = "content-processors-stats";
constant ST_HEADERS (line 31) | const ST_HEADERS: &str = "headers";
constant ST_HEADERS_VALUES (line 32) | const ST_HEADERS_VALUES: &str = "headers-values";
constant ST_SEO (line 33) | const ST_SEO: &str = "seo";
constant ST_OPEN_GRAPH (line 34) | const ST_OPEN_GRAPH: &str = "open-graph";
constant ST_SEO_HEADINGS (line 35) | const ST_SEO_HEADINGS: &str = "seo-headings";
constant ST_DNS (line 36) | const ST_DNS: &str = "dns";
constant ST_CERTIFICATE_INFO (line 37) | const ST_CERTIFICATE_INFO: &str = "certificate-info";
constant ST_NON_UNIQUE_TITLES (line 38) | const ST_NON_UNIQUE_TITLES: &str = "non-unique-titles";
constant ST_NON_UNIQUE_DESCRIPTIONS (line 39) | const ST_NON_UNIQUE_DESCRIPTIONS: &str = "non-unique-descriptions";
constant ST_CONTENT_TYPES (line 40) | const ST_CONTENT_TYPES: &str = "content-types";
constant ST_CONTENT_MIME_TYPES (line 41) | const ST_CONTENT_MIME_TYPES: &str = "content-types-raw";
constant ST_SKIPPED_SUMMARY (line 42) | const ST_SKIPPED_SUMMARY: &str = "skipped-summary";
constant ST_SKIPPED (line 43) | const ST_SKIPPED: &str = "skipped";
constant ST_CACHING_PER_CONTENT_TYPE (line 44) | const ST_CACHING_PER_CONTENT_TYPE: &str = "caching-per-content-type";
constant ST_CACHING_PER_DOMAIN (line 45) | const ST_CACHING_PER_DOMAIN: &str = "caching-per-domain";
constant ST_CACHING_PER_DOMAIN_AND_CONTENT_TYPE (line 46) | const ST_CACHING_PER_DOMAIN_AND_CONTENT_TYPE: &str = "caching-per-domain...
constant ST_REDIRECTS (line 47) | const ST_REDIRECTS: &str = "redirects";
constant ST_404 (line 48) | const ST_404: &str = "404";
constant ST_FASTEST_URLS (line 49) | const ST_FASTEST_URLS: &str = "fastest-urls";
constant ST_SLOWEST_URLS (line 50) | const ST_SLOWEST_URLS: &str = "slowest-urls";
constant ST_BEST_PRACTICES (line 51) | const ST_BEST_PRACTICES: &str = "best-practices";
constant ST_ACCESSIBILITY (line 52) | const ST_ACCESSIBILITY: &str = "accessibility";
constant ST_EXTERNAL_URLS (line 53) | const ST_EXTERNAL_URLS: &str = "external-urls";
constant ST_SECURITY (line 54) | const ST_SECURITY: &str = "security";
constant ST_SOURCE_DOMAINS (line 55) | const ST_SOURCE_DOMAINS: &str = "source-domains";
constant BEST_PRACTICE_ANALYSIS_NAMES (line 58) | const BEST_PRACTICE_ANALYSIS_NAMES: &[&str] = &[
constant ACCESSIBILITY_ANALYSIS_NAMES (line 71) | const ACCESSIBILITY_ANALYSIS_NAMES: &[&str] = &[
constant SECURITY_ANALYSIS_NAMES (line 81) | const SECURITY_ANALYSIS_NAMES: &[&str] = &["Security headers"];
constant SEVERITY_ORDER_CRITICAL (line 84) | const SEVERITY_ORDER_CRITICAL: i32 = 1;
constant SEVERITY_ORDER_WARNING (line 85) | const SEVERITY_ORDER_WARNING: i32 = 2;
constant SEVERITY_ORDER_NOTICE (line 86) | const SEVERITY_ORDER_NOTICE: i32 = 3;
constant MAX_EXAMPLE_URLS (line 89) | const MAX_EXAMPLE_URLS: usize = 5;
constant TEMPLATE_HTML (line 92) | const TEMPLATE_HTML: &str = include_str!("template.html");
constant SKIPPED_SUPER_TABLES (line 95) | const SKIPPED_SUPER_TABLES: &[&str] = &[
type SuperTableInfo (line 112) | struct SuperTableInfo {
function extract_info (line 122) | fn extract_info(st: &SuperTable) -> SuperTableInfo {
function get_super_table_order (line 134) | fn get_super_table_order(apl_code: &str) -> i32 {
function get_section_name_by_apl_code (line 163) | fn get_section_name_by_apl_code(apl_code: &str) -> Option<&'static str> {
type HtmlReport (line 187) | pub struct HtmlReport<'a> {
function new (line 195) | pub fn new(status: &'a Status, max_example_urls: usize, html_report_opti...
function get_html (line 211) | pub fn get_html(&self) -> String {
function is_section_allowed (line 224) | fn is_section_allowed(&self, section_name: &str) -> bool {
function get_template_variables (line 232) | fn get_template_variables(&self) -> HashMap<String, String> {
function finalize_html (line 262) | fn finalize_html(&self, mut html: String) -> String {
function extract_all_super_table_infos (line 348) | fn extract_all_super_table_infos(&self) -> Vec<SuperTableInfo> {
function get_tabs (line 371) | fn get_tabs(&self) -> Vec<Tab> {
function get_super_table_tabs (line 423) | fn get_super_table_tabs(&self) -> Vec<Tab> {
function get_tabs_radios (line 454) | fn get_tabs_radios(&self, tabs: &[Tab]) -> String {
function get_tabs_html (line 474) | fn get_tabs_html(&self, tabs: &[Tab]) -> String {
function get_tabs_content_html (line 512) | fn get_tabs_content_html(&self, tabs: &[Tab]) -> String {
function get_tabs_css (line 535) | fn get_tabs_css(&self, tabs: &[Tab]) -> String {
function get_summary_tab (line 581) | fn get_summary_tab(&self) -> Option<Tab> {
function get_seo_and_opengraph_tab (line 627) | fn get_seo_and_opengraph_tab(&self) -> Option<Tab> {
function get_image_gallery_tab (line 685) | fn get_image_gallery_tab(&self) -> Option<Tab> {
function get_video_gallery_tab (line 753) | fn get_video_gallery_tab(&self) -> Option<Tab> {
function get_dns_and_ssl_tls_tab (line 811) | fn get_dns_and_ssl_tls_tab(&self) -> Option<Tab> {
function get_crawler_stats_tab (line 887) | fn get_crawler_stats_tab(&self) -> Tab {
function get_crawler_info_tab (line 921) | fn get_crawler_info_tab(&self) -> Tab {
function get_visited_urls_tab (line 969) | fn get_visited_urls_tab(&self) -> Tab {
function get_visited_urls_table (line 986) | fn get_visited_urls_table(&self) -> SuperTable {
function get_image_gallery_form_html (line 1213) | fn get_image_gallery_form_html(&self) -> String {
function get_initial_host (line 1283) | fn get_initial_host(&self) -> String {
function get_initial_url (line 1292) | fn get_initial_url(&self) -> String {
function get_initial_scheme (line 1297) | fn get_initial_scheme(&self) -> String {
function build_analysis_detail_tables (line 1307) | fn build_analysis_detail_tables(&self) -> HashMap<String, String> {
function get_data_for_super_tables_with_details (line 1458) | fn get_data_for_super_tables_with_details(&self) -> HashMap<String, Vec<...
function get_tab_content_by_super_table (line 1553) | fn get_tab_content_by_super_table(
function get_visited_urls_badges (line 1595) | fn get_visited_urls_badges(super_table: &SuperTable) -> Vec<Badge> {
function get_super_table_badges_by_apl_code (line 1635) | fn get_super_table_badges_by_apl_code(info: &SuperTableInfo, all_infos: ...
function get_super_table_generic_badges (line 1809) | fn get_super_table_generic_badges(info: &SuperTableInfo) -> Vec<Badge> {
function html_escape (line 1858) | fn html_escape(s: &str) -> String {
function build_quality_scores_html (line 1868) | fn build_quality_scores_html(scores: &crate::scoring::quality_score::Qua...
function remove_whitespaces_from_html (line 1950) | fn remove_whitespaces_from_html(html: &str) -> String {
function aggregate_detail (line 1984) | fn aggregate_detail(detail: &str) -> String {
function aggregate_detail_key (line 2099) | fn aggregate_detail_key(severity: &str, detail: &str) -> String {
constant IMAGE_GALLERY_FILTER_SCRIPT (line 2120) | const IMAGE_GALLERY_FILTER_SCRIPT: &str = r#"<script> function initializ...
constant VIDEO_GALLERY_SCRIPT (line 2295) | const VIDEO_GALLERY_SCRIPT: &str = r#"<script> function playVideos() {
FILE: src/export/html_report/tab.rs
type Tab (line 10) | pub struct Tab {
method new (line 23) | pub fn new(
method set_order (line 48) | pub fn set_order(&mut self, order: Option<i32>) {
method get_final_sort_order (line 53) | pub fn get_final_sort_order(&self) -> i32 {
function sanitize_id (line 63) | fn sanitize_id(name: &str) -> String {
FILE: src/export/mailer_exporter.rs
function set_crawler_interrupted (line 21) | pub fn set_crawler_interrupted(interrupted: bool) {
function is_crawler_interrupted (line 25) | pub fn is_crawler_interrupted() -> bool {
type MailerExporter (line 29) | pub struct MailerExporter {
method new (line 54) | pub fn new(
method set_html_report_content (line 80) | pub fn set_html_report_content(&mut self, content: String) {
method get_email_body (line 85) | fn get_email_body(&self, host: &str) -> String {
method style_html_body_for_email (line 106) | fn style_html_body_for_email(&self, html: &str) -> String {
method build_subject (line 125) | fn build_subject(&self) -> String {
method resolve_mail_from (line 139) | fn resolve_mail_from(&self) -> String {
method send_email (line 145) | fn send_email(
method get_name (line 236) | fn get_name(&self) -> &str {
method should_be_activated (line 240) | fn should_be_activated(&self) -> bool {
method export (line 244) | fn export(&mut self, status: &Status, _output: &dyn Output) -> CrawlerRe...
FILE: src/export/markdown_exporter.rs
constant CONTENT_TYPES_REQUIRING_CHANGES (line 29) | const CONTENT_TYPES_REQUIRING_CHANGES: &[ContentTypeId] = &[ContentTypeI...
type MarkdownExporter (line 33) | pub struct MarkdownExporter {
method new (line 60) | pub fn new() -> Self {
method set_markdown_export_directory (line 81) | pub fn set_markdown_export_directory(&mut self, dir: Option<String>) {
method set_markdown_export_single_file (line 85) | pub fn set_markdown_export_single_file(&mut self, file: Option<String>) {
method set_markdown_disable_images (line 89) | pub fn set_markdown_disable_images(&mut self, disable: bool) {
method set_markdown_disable_files (line 93) | pub fn set_markdown_disable_files(&mut self, disable: bool) {
method set_markdown_remove_links_and_images_from_single_file (line 97) | pub fn set_markdown_remove_links_and_images_from_single_file(&mut self...
method set_markdown_exclude_selector (line 101) | pub fn set_markdown_exclude_selector(&mut self, selectors: Vec<String>) {
method set_markdown_export_store_only_url_regex (line 105) | pub fn set_markdown_export_store_only_url_regex(&mut self, regexes: Ve...
method set_markdown_ignore_store_file_error (line 109) | pub fn set_markdown_ignore_store_file_error(&mut self, ignore: bool) {
method set_markdown_replace_content (line 113) | pub fn set_markdown_replace_content(&mut self, replacements: Vec<Strin...
method set_markdown_replace_query_string (line 117) | pub fn set_markdown_replace_query_string(&mut self, replacements: Vec<...
method set_markdown_move_content_before_h1_to_end (line 121) | pub fn set_markdown_move_content_before_h1_to_end(&mut self, move_cont...
method set_initial_parsed_url (line 125) | pub fn set_initial_parsed_url(&mut self, url: ParsedUrl) {
method set_ignore_regexes (line 129) | pub fn set_ignore_regexes(&mut self, regexes: Vec<String>) {
method set_initial_url (line 133) | pub fn set_initial_url(&mut self, url: String) {
method set_content_processor_manager (line 137) | pub fn set_content_processor_manager(&mut self, cpm: Arc<Mutex<Content...
method get_exported_file_paths (line 142) | pub fn get_exported_file_paths(&self) -> &HashMap<String, String> {
method store_file (line 147) | fn store_file(&mut self, visited_url: &VisitedUrl, status: &Status) ->...
method normalize_markdown_file (line 299) | fn normalize_markdown_file(&self, md_file_path: &str) {
method normalize_markdown_content (line 312) | pub fn normalize_markdown_content(&self, content: &str, replace_html_l...
method remove_excessive_whitespace (line 459) | fn remove_excessive_whitespace(&self, md: &str) -> String {
method remove_empty_lines_in_lists (line 522) | fn remove_empty_lines_in_lists(&self, md: &str) -> String {
method move_content_before_main_heading_to_end (line 573) | fn move_content_before_main_heading_to_end(&self, md: &str) -> String {
method fix_multiline_images (line 630) | fn fix_multiline_images(&self, md: &str) -> String {
method detect_and_set_code_language (line 635) | fn detect_and_set_code_language(&self, md: &str) -> String {
method detect_language (line 651) | fn detect_language(&self, code: &str) -> String {
method should_be_url_stored (line 879) | fn should_be_url_stored(&self, visited_url: &VisitedUrl) -> bool {
method get_relative_file_path_for_file_by_url (line 905) | fn get_relative_file_path_for_file_by_url(&self, visited_url: &Visited...
method is_valid_url (line 955) | fn is_valid_url(url: &str) -> bool {
method default (line 54) | fn default() -> Self {
method get_name (line 961) | fn get_name(&self) -> &str {
method should_be_activated (line 965) | fn should_be_activated(&self) -> bool {
method export (line 969) | fn export(&mut self, status: &Status, _output: &dyn Output) -> CrawlerRe...
function extract_regex_pattern (line 1071) | fn extract_regex_pattern(input: &str) -> Option<String> {
function convert_html_file_to_markdown (line 1093) | pub fn convert_html_file_to_markdown(
function test_should_be_activated (line 1120) | fn test_should_be_activated() {
function test_detect_language_rust (line 1129) | fn test_detect_language_rust() {
function test_detect_language_python (line 1135) | fn test_detect_language_python() {
function test_fix_multiline_images (line 1144) | fn test_fix_multiline_images() {
function normalize (line 1153) | fn normalize(exporter: &MarkdownExporter, content: &str) -> String {
function test_trim_preserves_heading_at_start (line 1160) | fn test_trim_preserves_heading_at_start() {
function test_trim_removes_special_chars_at_end (line 1171) | fn test_trim_removes_special_chars_at_end() {
function test_disable_files_preserves_html_links (line 1181) | fn test_disable_files_preserves_html_links() {
function test_disable_files_preserves_htm_links (line 1195) | fn test_disable_files_preserves_htm_links() {
function test_disable_files_preserves_md_links (line 1207) | fn test_disable_files_preserves_md_links() {
function test_disable_files_preserves_tel_links (line 1221) | fn test_disable_files_preserves_tel_links() {
function test_disable_files_preserves_mailto_links (line 1233) | fn test_disable_files_preserves_mailto_links() {
function test_disable_files_preserves_https_links (line 1245) | fn test_disable_files_preserves_https_links() {
function test_disable_files_removes_pdf (line 1257) | fn test_disable_files_removes_pdf() {
function test_empty_list_items_removed (line 1267) | fn test_empty_list_items_removed() {
function test_disable_images_normalizes_link_whitespace (line 1275) | fn test_disable_images_normalizes_link_whitespace() {
function test_disable_images_removes_standard_images (line 1287) | fn test_disable_images_removes_standard_images() {
function test_orphaned_filename_link_removed (line 1303) | fn test_orphaned_filename_link_removed() {
function test_orphaned_filename_link_with_leading_whitespace (line 1318) | fn test_orphaned_filename_link_with_leading_whitespace() {
function test_real_link_text_not_removed (line 1329) | fn test_real_link_text_not_removed() {
function test_empty_table_rows_removed (line 1343) | fn test_empty_table_rows_removed() {
function test_table_row_with_content_preserved (line 1355) | fn test_table_row_with_content_preserved() {
function test_move_content_before_h1 (line 1364) | fn test_move_content_before_h1() {
function test_move_content_disabled (line 1378) | fn test_move_content_disabled() {
function test_move_content_nothing_before_h1 (line 1386) | fn test_move_content_nothing_before_h1() {
function test_normalize_content_without_html_link_replacement (line 1397) | fn test_normalize_content_without_html_link_replacement() {
function test_normalize_content_with_html_link_replacement (line 1408) | fn test_normalize_content_with_html_link_replacement() {
function test_convert_html_file_basic (line 1419) | fn test_convert_html_file_basic() {
function test_convert_html_file_nonexistent (line 1430) | fn test_convert_html_file_nonexistent() {
function test_convert_html_file_with_disable_images (line 1436) | fn test_convert_html_file_with_disable_images() {
function test_convert_html_file_preserves_html_links (line 1447) | fn test_convert_html_file_preserves_html_links() {
FILE: src/export/offline_website_exporter.rs
constant CONTENT_TYPES_REQUIRING_CHANGES (line 28) | const CONTENT_TYPES_REQUIRING_CHANGES: &[ContentTypeId] = &[
type OfflineWebsiteExporter (line 37) | pub struct OfflineWebsiteExporter {
method new (line 64) | pub fn new() -> Self {
method set_offline_export_directory (line 83) | pub fn set_offline_export_directory(&mut self, dir: Option<String>) {
method set_offline_export_store_only_url_regex (line 87) | pub fn set_offline_export_store_only_url_regex(&mut self, regexes: Vec...
method set_offline_export_remove_unwanted_code (line 91) | pub fn set_offline_export_remove_unwanted_code(&mut self, remove: bool) {
method set_offline_export_no_auto_redirect_html (line 95) | pub fn set_offline_export_no_auto_redirect_html(&mut self, disable: bo...
method set_offline_export_preserve_url_structure (line 99) | pub fn set_offline_export_preserve_url_structure(&mut self, preserve: ...
method set_offline_export_lowercase (line 103) | pub fn set_offline_export_lowercase(&mut self, lowercase: bool) {
method set_ignore_store_file_error (line 107) | pub fn set_ignore_store_file_error(&mut self, ignore: bool) {
method set_replace_content (line 111) | pub fn set_replace_content(&mut self, replacements: Vec<String>) {
method set_replace_query_string (line 115) | pub fn set_replace_query_string(&mut self, replacements: Vec<String>) {
method set_initial_parsed_url (line 119) | pub fn set_initial_parsed_url(&mut self, url: ParsedUrl) {
method set_content_processor_manager (line 123) | pub fn set_content_processor_manager(&mut self, cpm: Arc<Mutex<Content...
method set_domain_callbacks (line 127) | pub fn set_domain_callbacks(
method get_exported_file_paths (line 137) | pub fn get_exported_file_paths(&self) -> &HashMap<String, String> {
method store_file (line 142) | fn store_file(&mut self, visited_url: &VisitedUrl, status: &Status, _o...
method should_be_url_stored (line 278) | fn should_be_url_stored(&self, visited_url: &VisitedUrl) -> bool {
method get_relative_file_path_for_file_by_url (line 320) | fn get_relative_file_path_for_file_by_url(&self, visited_url: &Visited...
method is_valid_url (line 372) | fn is_valid_url(url: &str) -> bool {
method add_redirect_html_to_subfolders (line 395) | fn add_redirect_html_to_subfolders(dir: &str) -> CrawlerResult<()> {
method default (line 58) | fn default() -> Self {
method get_name (line 432) | fn get_name(&self) -> &str {
method should_be_activated (line 436) | fn should_be_activated(&self) -> bool {
method export (line 440) | fn export(&mut self, status: &Status, output: &dyn Output) -> CrawlerRes...
function extract_regex_pattern (line 490) | fn extract_regex_pattern(input: &str) -> Option<String> {
function test_is_valid_url (line 515) | fn test_is_valid_url() {
function test_should_be_activated (line 522) | fn test_should_be_activated() {
FILE: src/export/sitemap_exporter.rs
type SitemapExporter (line 17) | pub struct SitemapExporter {
method new (line 29) | pub fn new(
method collect_sitemap_urls (line 45) | fn collect_sitemap_urls(&self, status: &Status) -> Vec<String> {
method generate_xml_sitemap (line 66) | fn generate_xml_sitemap(&self, output_file: &str, urls: &[String]) -> ...
method generate_txt_sitemap (line 124) | fn generate_txt_sitemap(&self, output_file: &str, urls: &[String]) -> ...
method get_name (line 155) | fn get_name(&self) -> &str {
method should_be_activated (line 159) | fn should_be_activated(&self) -> bool {
method export (line 163) | fn export(&mut self, status: &Status, _output: &dyn Output) -> CrawlerRe...
function escape_xml (line 197) | fn escape_xml(s: &str) -> String {
FILE: src/export/upload_exporter.rs
type UploadExporter (line 19) | pub struct UploadExporter {
method new (line 35) | pub fn new(
method set_html_report_content (line 53) | pub fn set_html_report_content(&mut self, content: String) {
method upload (line 59) | fn upload(&self, html: &str) -> CrawlerResult<String> {
method get_name (line 138) | fn get_name(&self) -> &str {
method should_be_activated (line 142) | fn should_be_activated(&self) -> bool {
method export (line 146) | fn export(&mut self, status: &Status, _output: &dyn Output) -> CrawlerRe...
function get_arch (line 188) | fn get_arch() -> String {
FILE: src/export/utils/html_to_markdown.rs
type HtmlToMarkdownConverter (line 18) | pub struct HtmlToMarkdownConverter {
method new (line 42) | pub fn new(html: &str, excluded_selectors: Vec<String>) -> Self {
method set_strong_delimiter (line 86) | pub fn set_strong_delimiter(&mut self, delimiter: &str) -> &mut Self {
method set_em_delimiter (line 91) | pub fn set_em_delimiter(&mut self, delimiter: &str) -> &mut Self {
method set_bullet_list_marker (line 96) | pub fn set_bullet_list_marker(&mut self, marker: &str) -> &mut Self {
method set_code_block_fence (line 103) | pub fn set_code_block_fence(&mut self, fence: &str) -> &mut Self {
method set_horizontal_rule (line 110) | pub fn set_horizontal_rule(&mut self, rule: &str) -> &mut Self {
method set_heading_style (line 115) | pub fn set_heading_style(&mut self, style: HeadingStyle) -> &mut Self {
method set_escape_mode (line 120) | pub fn set_escape_mode(&mut self, enable: bool) -> &mut Self {
method set_include_images (line 125) | pub fn set_include_images(&mut self, include: bool) -> &mut Self {
method set_convert_tables (line 130) | pub fn set_convert_tables(&mut self, convert: bool) -> &mut Self {
method set_convert_strikethrough (line 135) | pub fn set_convert_strikethrough(&mut self, convert: bool) -> &mut Self {
method set_strikethrough_delimiter (line 140) | pub fn set_strikethrough_delimiter(&mut self, delimiter: &str) -> &mut...
method get_markdown (line 146) | pub fn get_markdown(&self) -> String {
method post_process (line 221) | fn post_process(&self, markdown: &str) -> String {
constant MIN_LINKS_FOR_COLLAPSE (line 231) | pub const MIN_LINKS_FOR_COLLAPSE: usize = 8;
method collapse_large_link_lists (line 235) | pub fn collapse_large_link_lists(markdown: &str) -> String {
method is_list_item (line 301) | fn is_list_item(line: &str) -> bool {
method is_list_continuation (line 310) | fn is_list_continuation(line: &str) -> bool {
method collect_excluded_node_ids (line 317) | fn collect_excluded_node_ids(&self, document: &Html) -> Vec<ego_tree::...
method convert_node (line 354) | fn convert_node(&self, node: &NodeRef<Node>, document: &Html, excluded...
method get_inner_markdown (line 446) | fn get_inner_markdown(&self, node: &NodeRef<Node>, document: &Html, ex...
method is_valid_link_node (line 492) | fn is_valid_link_node(&self, node: &NodeRef<Node>) -> bool {
method extract_text_content (line 513) | fn extract_text_content(&self, node: &NodeRef<Node>) -> String {
method collapse_inline_whitespace (line 524) | fn collapse_inline_whitespace(&self, text: &str) -> String {
method convert_heading (line 532) | fn convert_heading(&self, node: &NodeRef<Node>, document: &Html, exclu...
method convert_link (line 559) | fn convert_link(&self, node: &NodeRef<Node>, document: &Html, excluded...
method convert_image (line 593) | fn convert_image(&self, node: &NodeRef<Node>) -> String {
method convert_inline_code (line 627) | fn convert_inline_code(&self, node: &NodeRef<Node>) -> String {
method convert_code_block (line 651) | fn convert_code_block(&self, node: &NodeRef<Node>, _document: &Html) -...
method convert_blockquote (line 710) | fn convert_blockquote(&self, node: &NodeRef<Node>, document: &Html, ex...
method convert_table (line 726) | fn convert_table(&self, node: &NodeRef<Node>, document: &Html, exclude...
method extract_header_content (line 902) | fn extract_header_content(&self, cell: &NodeRef<Node>, document: &Html...
method convert_consecutive_links_to_table (line 914) | fn convert_consecutive_links_to_table(
method format_table_row (line 945) | fn format_table_row(&self, cells: &[String], max_lengths: &[usize]) ->...
method format_table_separator (line 959) | fn format_table_separator(&self, max_lengths: &[usize]) -> String {
method wrap_with_delimiter (line 970) | fn wrap_with_delimiter(&self, text: &str, delimiter: &str) -> String {
method escape_markdown_chars (line 978) | fn escape_markdown_chars(&self, text: &str) -> String {
method escape_markdown_table_cell_content (line 992) | fn escape_markdown_table_cell_content(&self, text: &str) -> String {
method convert_definition_list (line 997) | fn convert_definition_list(&self, node: &NodeRef<Node>, document: &Htm...
method convert_list_to_markdown (line 1033) | fn convert_list_to_markdown(&self, node: &NodeRef<Node>, document: &Ht...
method process_list (line 1044) | fn process_list(
method extract_li_data (line 1107) | fn extract_li_data(
method normalize_whitespace (line 1150) | fn normalize_whitespace(&self, text: &str) -> String {
type HeadingStyle (line 36) | pub enum HeadingStyle {
function test_simple_paragraph (line 1176) | fn test_simple_paragraph() {
function test_heading_atx (line 1183) | fn test_heading_atx() {
function test_heading_setext (line 1191) | fn test_heading_setext() {
function test_bold (line 1199) | fn test_bold() {
function test_italic (line 1206) | fn test_italic() {
function test_link (line 1213) | fn test_link() {
function test_image (line 1220) | fn test_image() {
function test_unordered_list (line 1227) | fn test_unordered_list() {
function test_ordered_list (line 1235) | fn test_ordered_list() {
function test_code_block (line 1243) | fn test_code_block() {
function test_inline_code (line 1253) | fn test_inline_code() {
function test_blockquote (line 1260) | fn test_blockquote() {
function test_horizontal_rule (line 1267) | fn test_horizontal_rule() {
function test_table (line 1274) | fn test_table() {
function test_strikethrough (line 1287) | fn test_strikethrough() {
function test_excluded_selector (line 1294) | fn test_excluded_selector() {
function test_script_removed (line 1305) | fn test_script_removed() {
function test_aria_hidden_excluded (line 1315) | fn test_aria_hidden_excluded() {
function test_aria_hidden_children_excluded (line 1326) | fn test_aria_hidden_children_excluded() {
function test_role_menu_excluded (line 1338) | fn test_role_menu_excluded() {
function test_adjacent_divs_have_spacing (line 1351) | fn test_adjacent_divs_have_spacing() {
function test_adjacent_sections_have_spacing (line 1364) | fn test_adjacent_sections_have_spacing() {
function test_span_remains_inline (line 1378) | fn test_span_remains_inline() {
function test_nested_divs_no_excessive_whitespace (line 1385) | fn test_nested_divs_no_excessive_whitespace() {
function test_empty_div_produces_no_output (line 1397) | fn test_empty_div_produces_no_output() {
function test_link_aria_label_fallback (line 1407) | fn test_link_aria_label_fallback() {
function test_link_visible_text_preferred_over_aria_label (line 1421) | fn test_link_visible_text_preferred_over_aria_label() {
function test_link_url_fallback_without_aria_label (line 1432) | fn test_link_url_fallback_without_aria_label() {
function test_link_empty_aria_label_falls_back_to_url (line 1443) | fn test_link_empty_aria_label_falls_back_to_url() {
function test_cookie_banner_excluded (line 1459) | fn test_cookie_banner_excluded() {
function test_onetrust_banner_excluded (line 1470) | fn test_onetrust_banner_excluded() {
FILE: src/export/utils/markdown_site_aggregator.rs
constant SIMILARITY_THRESHOLD (line 14) | const SIMILARITY_THRESHOLD: f64 = 80.0;
type MarkdownSiteAggregator (line 18) | pub struct MarkdownSiteAggregator {
method new (line 23) | pub fn new(base_url: &str) -> Self {
method combine_directory (line 30) | pub fn combine_directory(&self, directory_path: &str, remove_links_and...
method get_markdown_files (line 118) | fn get_markdown_files(&self, dir: &str) -> CrawlerResult<Vec<String>> {
method collect_markdown_files (line 125) | fn collect_markdown_files(&self, dir: &str, paths: &mut Vec<String>) -...
method make_url_from_path (line 150) | fn make_url_from_path(&self, file_path: &str, root_dir: &str) -> String {
method detect_common_header (line 183) | fn detect_common_header(&self, pages: &[&Vec<String>]) -> Vec<String> {
method detect_common_footer (line 209) | fn detect_common_footer(&self, pages: &[&Vec<String>]) -> Vec<String> {
method align_common_prefix (line 238) | fn align_common_prefix(&self, lines_a: &[String], lines_b: &[String]) ...
method lines_similar (line 267) | fn lines_similar(&self, a: &str, b: &str) -> bool {
method similar_text_percent (line 288) | fn similar_text_percent(&self, a: &str, b: &str) -> f64 {
method longest_common_substring_len (line 302) | fn longest_common_substring_len(&self, a: &str, b: &str) -> usize {
method remove_prefix (line 333) | fn remove_prefix(&self, lines: &[String], prefix_lines: &[String]) -> ...
method remove_suffix (line 346) | fn remove_suffix(&self, lines: &[String], suffix_lines: &[String]) -> ...
method remove_links_and_images (line 359) | fn remove_links_and_images(&self, markdown: &str) -> String {
function test_make_url_from_path (line 406) | fn test_make_url_from_path() {
function test_lines_similar (line 423) | fn test_lines_similar() {
function test_remove_links_and_images (line 431) | fn test_remove_links_and_images() {
FILE: src/export/utils/offline_url_converter.rs
type OfflineUrlConverter (line 59) | pub struct OfflineUrlConverter {
method new (line 75) | pub fn new(
method set_preserve_url_structure (line 99) | pub fn set_preserve_url_structure(&mut self, preserve: bool) {
method convert_url_to_relative (line 104) | pub fn convert_url_to_relative(&mut self, keep_fragment: bool) -> Stri...
method get_relative_target_url (line 116) | pub fn get_relative_target_url(&self) -> &ParsedUrl {
method get_target_domain_relation (line 120) | pub fn get_target_domain_relation(&self) -> TargetDomainRelation {
method set_replace_query_string (line 125) | pub fn set_replace_query_string(replace: Vec<String>) {
method set_lowercase (line 132) | pub fn set_lowercase(lowercase: bool) {
method get_offline_base_url_depth (line 139) | pub fn get_offline_base_url_depth(url: &ParsedUrl) -> usize {
method get_forced_url_if_needed (line 148) | fn get_forced_url_if_needed(&self) -> Option<String> {
method detect_and_set_file_name_with_extension (line 183) | fn detect_and_set_file_name_with_extension(&mut self) {
method calculate_and_apply_depth (line 277) | fn calculate_and_apply_depth(&mut self) {
method is_domain_allowed_for_static_files (line 320) | fn is_domain_allowed_for_static_files(&self, domain: &str) -> bool {
method is_external_domain_allowed_for_crawling (line 327) | fn is_external_domain_allowed_for_crawling(&self, domain: &str) -> bool {
method sanitize_file_path (line 335) | pub fn sanitize_file_path(file_path: &str, keep_fragment: bool) -> Str...
method get_query_hash_from_query_string (line 475) | fn get_query_hash_from_query_string(query_string: &str) -> String {
function extract_regex_pattern (line 520) | fn extract_regex_pattern(input: &str) -> Option<String> {
function parse_file_path_components (line 544) | fn parse_file_path_components(file_path: &str) -> (String, Option<String...
function html_entities_decode (line 569) | fn html_entities_decode(input: &str) -> String {
function make_converter (line 583) | fn make_converter(initial: &str, base: &str, target: &str, attribute: Op...
function convert (line 612) | fn convert(initial: &str, base: &str, target: &str, attribute: Option<&s...
function depth_root (line 622) | fn depth_root() {
function depth_file (line 630) | fn depth_file() {
function depth_dir (line 638) | fn depth_dir() {
function depth_file_in_dir (line 646) | fn depth_file_in_dir() {
function depth_nested_dir (line 654) | fn depth_nested_dir() {
function depth_root_with_query (line 662) | fn depth_root_with_query() {
function depth_file_with_query (line 671) | fn depth_file_with_query() {
function depth_dir_with_query (line 680) | fn depth_dir_with_query() {
function depth_file_in_dir_with_query (line 689) | fn depth_file_in_dir_with_query() {
function depth_nested_dir_with_query (line 698) | fn depth_nested_dir_with_query() {
function convert_root_to_root (line 711) | fn convert_root_to_root() {
function convert_root_page (line 724) | fn convert_root_page() {
function convert_root_page_trailing_slash (line 737) | fn convert_root_page_trailing_slash() {
function convert_from_subdir_with_fragment (line 750) | fn convert_from_subdir_with_fragment() {
function convert_relative_page (line 761) | fn convert_relative_page() {
function convert_relative_page_dir (line 769) | fn convert_relative_page_dir() {
function convert_relative_plain (line 777) | fn convert_relative_plain() {
function convert_relative_parent (line 785) | fn convert_relative_parent() {
function convert_relative_parent_dir (line 793) | fn convert_relative_parent_dir() {
function convert_from_subpath_same_dir (line 801) | fn convert_from_subpath_same_dir() {
function convert_external_allowed_domain_root (line 816) | fn convert_external_allowed_domain_root() {
function convert_external_allowed_domain_from_subdir (line 829) | fn convert_external_allowed_domain_from_subdir() {
function convert_external_css_file (line 842) | fn convert_external_css_file() {
function convert_backlink_to_initial_domain (line 857) | fn convert_backlink_to_initial_domain() {
function convert_backlink_subpage_to_initial (line 870) | fn convert_backlink_subpage_to_initial() {
function convert_backlink_subdir_to_initial (line 883) | fn convert_backlink_subdir_to_initial() {
function convert_backlink_to_third_domain (line 896) | fn convert_backlink_to_third_domain() {
function convert_protocol_relative_external (line 911) | fn convert_protocol_relative_external() {
function convert_protocol_relative_backlink (line 919) | fn convert_protocol_relative_backlink() {
function convert_fragment_only (line 929) | fn convert_fragment_only() {
function convert_fragment_only_external (line 937) | fn convert_fragment_only_external() {
function convert_page_with_query (line 947) | fn convert_page_with_query() {
function convert_query_only (line 965) | fn convert_query_only() {
function convert_css_with_query (line 977) | fn convert_css_with_query() {
function convert_double_parent_relative (line 992) | fn convert_double_parent_relative() {
function convert_double_parent_relative_dir (line 1005) | fn convert_double_parent_relative_dir() {
function convert_from_external_css_to_external_image (line 1020) | fn convert_from_external_css_to_external_image() {
function convert_from_deep_external_css_to_image (line 1031) | fn convert_from_deep_external_css_to_image() {
function convert_from_external_css_to_initial_domain (line 1042) | fn convert_from_external_css_to_initial_domain() {
function convert_from_external_css_relative_root (line 1053) | fn convert_from_external_css_relative_root() {
function convert_from_external_css_relative_parent (line 1064) | fn convert_from_external_css_relative_parent() {
function convert_unknown_domain_stays_absolute (line 1077) | fn convert_unknown_domain_stays_absolute() {
function convert_unknown_domain_http_stays_absolute (line 1088) | fn convert_unknown_domain_http_stays_absolute() {
function sanitize_utf8_czech (line 1103) | fn sanitize_utf8_czech() {
function sanitize_utf8_german (line 1111) | fn sanitize_utf8_german() {
function sanitize_utf8_chinese (line 1116) | fn sanitize_utf8_chinese() {
function sanitize_url_encoded_czech (line 1121) | fn sanitize_url_encoded_czech() {
function sanitize_url_encoded_german (line 1129) | fn sanitize_url_encoded_german() {
function sanitize_url_encoded_chinese (line 1137) | fn sanitize_url_encoded_chinese() {
function sanitize_dangerous_chars_colon (line 1145) | fn sanitize_dangerous_chars_colon() {
function sanitize_dangerous_chars_asterisk (line 1153) | fn sanitize_dangerous_chars_asterisk() {
function sanitize_dangerous_chars_question (line 1161) | fn sanitize_dangerous_chars_question() {
function sanitize_dangerous_chars_quotes (line 1169) | fn sanitize_dangerous_chars_quotes() {
function sanitize_dangerous_chars_brackets (line 1177) | fn sanitize_dangerous_chars_brackets() {
function sanitize_dangerous_chars_pipes (line 1185) | fn sanitize_dangerous_chars_pipes() {
function sanitize_dangerous_chars_backslash (line 1193) | fn sanitize_dangerous_chars_backslash() {
function sanitize_mixed_utf8_and_dangerous (line 1201) | fn sanitize_mixed_utf8_and_dangerous() {
function sanitize_empty (line 1209) | fn sanitize_empty() {
function sanitize_dots (line 1214) | fn sanitize_dots() {
function convert_simple (line 1223) | fn convert_simple(base: &str, target: &str) -> String {
function simple_from_subdir_to_root_asset (line 1244) | fn simple_from_subdir_to_root_asset() {
function simple_from_subdir_to_root_image (line 1252) | fn simple_from_subdir_to_root_image() {
function simple_from_deep_subdir_to_root_asset (line 1260) | fn simple_from_deep_subdir_to_root_asset() {
function simple_from_root_to_root_asset (line 1268) | fn simple_from_root_to_root_asset() {
function simple_from_root_to_subdir_image (line 1273) | fn simple_from_root_to_subdir_image() {
function convert_utf8 (line 1282) | fn convert_utf8(base: &str, target: &str) -> String {
function utf8_czech_from_root (line 1302) | fn utf8_czech_from_root() {
function utf8_czech_in_subdir (line 1310) | fn utf8_czech_in_subdir() {
function utf8_german_from_root (line 1318) | fn utf8_german_from_root() {
function utf8_chinese_from_root (line 1326) | fn utf8_chinese_from_root() {
function utf8_czech_trailing_slash (line 1334) | fn utf8_czech_trailing_slash() {
function utf8_chinese_trailing_slash (line 1342) | fn utf8_chinese_trailing_slash() {
function utf8_czech_from_subdir (line 1350) | fn utf8_czech_from_subdir() {
function utf8_chinese_from_subdir (line 1358) | fn utf8_chinese_from_subdir() {
function utf8_czech_with_fragment (line 1366) | fn utf8_czech_with_fragment() {
function test_sanitize_file_path_basic (line 1378) | fn test_sanitize_file_path_basic() {
function test_sanitize_file_path_with_query (line 1384) | fn test_sanitize_file_path_with_query() {
function test_extract_regex_pattern (line 1391) | fn test_extract_regex_pattern() {
function convert_preserve (line 1401) | fn convert_preserve(initial: &str, base: &str, target: &str) -> String {
function preserve_extensionless_page_becomes_dir_index (line 1422) | fn preserve_extensionless_page_becomes_dir_index() {
function preserve_trailing_slash_unchanged (line 1435) | fn preserve_trailing_slash_unchanged() {
function preserve_with_real_extension_unchanged (line 1448) | fn preserve_with_real_extension_unchanged() {
function preserve_nested_path (line 1461) | fn preserve_nested_path() {
function preserve_with_query_string (line 1474) | fn preserve_with_query_string() {
function preserve_root_page_unchanged (line 1491) | fn preserve_root_page_unchanged() {
FILE: src/export/utils/target_domain_relation.rs
type TargetDomainRelation (line 9) | pub enum TargetDomainRelation {
method get_by_urls (line 24) | pub fn get_by_urls(initial_url: &ParsedUrl, base_url: &ParsedUrl, targ...
method get_by_hosts (line 34) | pub fn get_by_hosts(initial_host: Option<&str>, base_host: Option<&str...
function initial_same_base_same_relative (line 68) | fn initial_same_base_same_relative() {
function initial_same_base_same_absolute (line 79) | fn initial_same_base_same_absolute() {
function initial_same_base_same_protocol_relative (line 90) | fn initial_same_base_same_protocol_relative() {
function initial_same_base_different_absolute (line 102) | fn initial_same_base_different_absolute() {
function initial_same_base_different_protocol_relative (line 113) | fn initial_same_base_different_protocol_relative() {
function initial_different_base_same_relative (line 125) | fn initial_different_base_same_relative() {
function initial_different_base_same_absolute (line 136) | fn initial_different_base_same_absolute() {
function initial_different_base_same_protocol_relative (line 147) | fn initial_different_base_same_protocol_relative() {
function initial_different_base_different_absolute (line 159) | fn initial_different_base_different_absolute() {
function initial_different_base_different_protocol_relative (line 170) | fn initial_different_base_different_protocol_relative() {
function initial_different_base_different_same_initial_base (line 181) | fn initial_different_base_different_same_initial_base() {
function test_target_empty (line 193) | fn test_target_empty() {
FILE: src/extra_column.rs
constant CUSTOM_METHOD_XPATH (line 9) | pub const CUSTOM_METHOD_XPATH: &str = "xpath";
constant CUSTOM_METHOD_REGEXP (line 10) | pub const CUSTOM_METHOD_REGEXP: &str = "regexp";
type ExtraColumn (line 14) | pub struct ExtraColumn {
method new (line 35) | pub fn new(
method get_length (line 86) | pub fn get_length(&self) -> usize {
method get_truncated_value (line 90) | pub fn get_truncated_value(&self, value: Option<&str>) -> Option<Strin...
method from_text (line 102) | pub fn from_text(text: &str) -> Result<ExtraColumn, CrawlerError> {
method extract_value (line 155) | pub fn extract_value(&self, text: &str) -> Option<String> {
method extract_xpath (line 180) | fn extract_xpath(html: &str, xpath: &str, index: usize) -> Option<Stri...
function default_column_size (line 25) | fn default_column_size(name: &str) -> Option<usize> {
function xpath_to_css (line 217) | fn xpath_to_css(xpath: &str) -> String {
function parse_simple_name_uses_default_length (line 243) | fn parse_simple_name_uses_default_length() {
function parse_name_with_explicit_length (line 251) | fn parse_name_with_explicit_length() {
function parse_name_with_no_truncate (line 259) | fn parse_name_with_no_truncate() {
function parse_regexp_method (line 267) | fn parse_regexp_method() {
function parse_xpath_method (line 274) | fn parse_xpath_method() {
function parse_invalid_method_returns_error (line 280) | fn parse_invalid_method_returns_error() {
function extract_regexp_matching (line 293) | fn extract_regexp_matching() {
function extract_regexp_not_matching (line 307) | fn extract_regexp_not_matching() {
function extract_xpath_h1 (line 323) | fn extract_xpath_h1() {
function extract_xpath_h1_with_text_suffix (line 338) | fn extract_xpath_h1_with_text_suffix() {
function extract_xpath_attribute (line 355) | fn extract_xpath_attribute() {
function extract_xpath_not_found (line 370) | fn extract_xpath_not_found() {
function truncated_value_truncates_when_longer (line 387) | fn truncated_value_truncates_when_longer() {
function truncated_value_none_returns_none (line 394) | fn truncated_value_none_returns_none() {
FILE: src/info.rs
type Info (line 7) | pub struct Info {
method new (line 19) | pub fn new(
method set_final_user_agent (line 39) | pub fn set_final_user_agent(&mut self, final_user_agent: String) {
FILE: src/main.rs
function main (line 8) | async fn main() {
FILE: src/options/core_options.rs
constant GROUP_BASIC_SETTINGS (line 17) | pub const GROUP_BASIC_SETTINGS: &str = "basic-settings";
constant GROUP_OUTPUT_SETTINGS (line 18) | pub const GROUP_OUTPUT_SETTINGS: &str = "output-settings";
constant GROUP_RESOURCE_FILTERING (line 19) | pub const GROUP_RESOURCE_FILTERING: &str = "resource-filtering";
constant GROUP_ADVANCED_CRAWLER_SETTINGS (line 20) | pub const GROUP_ADVANCED_CRAWLER_SETTINGS: &str = "advanced-crawler-sett...
constant GROUP_EXPERT_SETTINGS (line 21) | pub const GROUP_EXPERT_SETTINGS: &str = "expert-settings";
constant GROUP_FILE_EXPORT_SETTINGS (line 22) | pub const GROUP_FILE_EXPORT_SETTINGS: &str = "file-export-settings";
constant GROUP_MAILER_SETTINGS (line 23) | pub const GROUP_MAILER_SETTINGS: &str = "mailer-settings";
constant GROUP_MARKDOWN_EXPORT_SETTINGS (line 24) | pub const GROUP_MARKDOWN_EXPORT_SETTINGS: &str = "markdown-export-settin...
constant GROUP_OFFLINE_EXPORT_SETTINGS (line 25) | pub const GROUP_OFFLINE_EXPORT_SETTINGS: &str = "offline-export-settings";
constant GROUP_SITEMAP_SETTINGS (line 26) | pub const GROUP_SITEMAP_SETTINGS: &str = "sitemap-settings";
constant GROUP_UPLOAD_SETTINGS (line 27) | pub const GROUP_UPLOAD_SETTINGS: &str = "upload-settings";
constant GROUP_FASTEST_ANALYZER (line 28) | pub const GROUP_FASTEST_ANALYZER: &str = "fastest-analyzer";
constant GROUP_SEO_AND_OPENGRAPH_ANALYZER (line 29) | pub const GROUP_SEO_AND_OPENGRAPH_ANALYZER: &str = "seo-and-opengraph-an...
constant GROUP_SLOWEST_ANALYZER (line 30) | pub const GROUP_SLOWEST_ANALYZER: &str = "slowest-analyzer";
constant GROUP_CI_CD_SETTINGS (line 31) | pub const GROUP_CI_CD_SETTINGS: &str = "ci-cd-settings";
constant GROUP_SERVER_SETTINGS (line 32) | pub const GROUP_SERVER_SETTINGS: &str = "server-settings";
type StorageType (line 37) | pub enum StorageType {
method from_text (line 43) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method as_str (line 54) | pub fn as_str(&self) -> &'static str {
type CoreOptions (line 64) | pub struct CoreOptions {
method from_options (line 240) | pub fn from_options(options: &Options) -> Result<Self, CrawlerError> {
method apply_option_value (line 521) | fn apply_option_value(&mut self, property: &str, value: &OptionValue) ...
method has_header_to_table (line 1232) | pub fn has_header_to_table(&self, header_name: &str) -> bool {
method is_url_selected_for_debug (line 1236) | pub fn is_url_selected_for_debug(&self, url: &str) -> bool {
method crawl_only_html_files (line 1252) | pub fn crawl_only_html_files(&self) -> bool {
method get_initial_host (line 1262) | pub fn get_initial_host(&self, include_port_if_defined: bool) -> String {
method get_initial_scheme (line 1275) | pub fn get_initial_scheme(&self) -> String {
function get_options (line 1285) | pub fn get_options() -> Options {
function read_config_file (line 2396) | fn read_config_file(path: &str) -> Result<Vec<String>, CrawlerError> {
function merge_config_file_args (line 2410) | fn merge_config_file_args(argv: &[String]) -> Result<Vec<String>, Crawle...
function parse_argv (line 2458) | pub fn parse_argv(argv: &[String]) -> Result<CoreOptions, CrawlerError> {
function get_help_text (line 2552) | pub fn get_help_text() -> String {
function parse_duration_to_secs (line 2625) | fn parse_duration_to_secs(s: &str) -> u64 {
function default_http_cache_dir (line 2647) | fn default_http_cache_dir() -> String {
function default_output_prefix (line 2662) | fn default_output_prefix() -> String {
function make_default_core_options (line 2692) | fn make_default_core_options() -> CoreOptions {
function ci_defaults (line 2836) | fn ci_defaults() {
function apply_ci_bool (line 2846) | fn apply_ci_bool() {
function apply_ci_min_score (line 2853) | fn apply_ci_min_score() {
function apply_ci_max_404 (line 2860) | fn apply_ci_max_404() {
function apply_ci_max_warnings (line 2867) | fn apply_ci_max_warnings() {
function apply_ci_max_avg_response (line 2874) | fn apply_ci_max_avg_response() {
function apply_unknown_key_no_error (line 2882) | fn apply_unknown_key_no_error() {
function ci_option_group_exists (line 2889) | fn ci_option_group_exists() {
function parse_duration_days (line 2900) | fn parse_duration_days() {
function parse_duration_hours (line 2905) | fn parse_duration_hours() {
function parse_duration_minutes (line 2910) | fn parse_duration_minutes() {
function parse_duration_seconds (line 2915) | fn parse_duration_seconds() {
function parse_duration_invalid_number (line 2921) | fn parse_duration_invalid_number() {
function read_config_file_parses_args (line 2929) | fn read_config_file_parses_args() {
function read_config_file_ignores_comments_and_blank_lines (line 2939) | fn read_config_file_ignores_comments_and_blank_lines() {
function read_config_file_nonexistent_returns_error (line 2949) | fn read_config_file_nonexistent_returns_error() {
function merge_config_file_args_with_explicit_config (line 2955) | fn merge_config_file_args_with_explicit_config() {
function merge_config_file_args_without_config (line 2976) | fn merge_config_file_args_without_config() {
function apply_force_relative_urls (line 2986) | fn apply_force_relative_urls() {
function apply_offline_export_preserve_url_structure (line 2995) | fn apply_offline_export_preserve_url_structure() {
function apply_offline_export_preserve_urls (line 3004) | fn apply_offline_export_preserve_urls() {
FILE: src/options/group.rs
type OptionGroup (line 10) | pub struct OptionGroup {
method new (line 22) | pub fn new(apl_code: &str, name: &str, options: Vec<CrawlerOption>) ->...
FILE: src/options/option.rs
type OptionValue (line 17) | pub enum OptionValue {
method as_bool (line 27) | pub fn as_bool(&self) -> Option<bool> {
method as_int (line 34) | pub fn as_int(&self) -> Option<i64> {
method as_float (line 41) | pub fn as_float(&self) -> Option<f64> {
method as_str (line 48) | pub fn as_str(&self) -> Option<&str> {
method as_array (line 55) | pub fn as_array(&self) -> Option<&Vec<String>> {
method is_none (line 62) | pub fn is_none(&self) -> bool {
type CrawlerOption (line 68) | pub struct CrawlerOption {
method new (line 112) | pub fn new(
method set_value_from_argv (line 141) | pub fn set_value_from_argv(&mut self, argv: &[String]) -> Result<(), C...
method is_explicitly_set (line 259) | pub fn is_explicitly_set(&self) -> bool {
method get_value (line 263) | pub fn get_value(&self) -> Result<&OptionValue, CrawlerError> {
method validate_value (line 279) | fn validate_value(&self, value: Option<&str>, _defined_by_alt_name: bo...
method correct_value_type (line 492) | fn correct_value_type(&self, value: Option<&str>) -> Result<OptionValu...
method set_extras_domain (line 546) | pub fn set_extras_domain(domain: Option<&str>) {
function correct_url (line 555) | fn correct_url(url: &str) -> String {
function unquote_value (line 567) | fn unquote_value(value: &mut String) {
function replace_placeholders (line 579) | fn replace_placeholders(value: &mut String) {
FILE: src/options/option_type.rs
type OptionType (line 8) | pub enum OptionType {
method fmt (line 25) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
FILE: src/options/options.rs
type Options (line 10) | pub struct Options {
method new (line 15) | pub fn new() -> Self {
method add_group (line 21) | pub fn add_group(&mut self, group: OptionGroup) {
method get_groups (line 25) | pub fn get_groups(&self) -> &IndexMap<String, OptionGroup> {
method get_groups_mut (line 29) | pub fn get_groups_mut(&mut self) -> &mut IndexMap<String, OptionGroup> {
method get_group (line 33) | pub fn get_group(&self, apl_code: &str) -> Option<&OptionGroup> {
method get_group_mut (line 37) | pub fn get_group_mut(&mut self, apl_code: &str) -> Option<&mut OptionG...
method is_explicitly_set (line 43) | pub fn is_explicitly_set(&self, property: &str) -> bool {
method default (line 51) | fn default() -> Self {
FILE: src/output/json_output.rs
type JsonOutput (line 19) | pub struct JsonOutput {
method new (line 37) | pub fn new(
method get_json (line 55) | pub fn get_json(&self) -> String {
method add_banner (line 63) | fn add_banner(&mut self) {
method add_used_options (line 70) | fn add_used_options(&mut self) {
method set_extra_columns_from_analysis (line 76) | fn set_extra_columns_from_analysis(&mut self, extra_columns: Vec<ExtraCo...
method add_table_header (line 85) | fn add_table_header(&mut self) {
method add_table_row (line 89) | fn add_table_row(
method add_super_table (line 170) | fn add_super_table(&mut self, table: &SuperTable) {
method add_total_stats (line 183) | fn add_total_stats(&mut self, stats: &BasicStats) {
method add_notice (line 209) | fn add_notice(&mut self, text: &str) {
method add_error (line 222) | fn add_error(&mut self, text: &str) {
method add_quality_scores (line 235) | fn add_quality_scores(&mut self, scores: &QualityScores) {
method add_ci_gate_result (line 241) | fn add_ci_gate_result(&mut self, result: &CiGateResult) {
method add_summary (line 247) | fn add_summary(&mut self, summary: &mut Summary) {
method set_export_file_paths (line 253) | fn set_export_file_paths(
method get_type (line 279) | fn get_type(&self) -> OutputType {
method end (line 283) | fn end(&mut self) {
method get_json_content (line 292) | fn get_json_content(&self) -> Option<String> {
function make_json_output (line 303) | fn make_json_output() -> JsonOutput {
function make_pass_result (line 307) | fn make_pass_result() -> CiGateResult {
function make_fail_result (line 315) | fn make_fail_result() -> CiGateResult {
function parse_json (line 345) | fn parse_json(output: &JsonOutput) -> serde_json::Value {
function ci_gate_present_when_added (line 350) | fn ci_gate_present_when_added() {
function ci_gate_absent_when_not_added (line 358) | fn ci_gate_absent_when_not_added() {
function ci_gate_passed_true (line 365) | fn ci_gate_passed_true() {
function ci_gate_passed_false (line 375) | fn ci_gate_passed_false() {
function ci_gate_checks_array (line 385) | fn ci_gate_checks_array() {
function quality_scores_in_json (line 394) | fn quality_scores_in_json() {
function add_sample_rows (line 412) | fn add_sample_rows(output: &mut JsonOutput) {
function export_file_paths_offline_only (line 455) | fn export_file_paths_offline_only() {
function export_file_paths_both (line 475) | fn export_file_paths_both() {
function export_file_paths_none_changes_nothing (line 500) | fn export_file_paths_none_changes_nothing() {
FILE: src/output/multi_output.rs
type MultiOutput (line 16) | pub struct MultiOutput {
method new (line 21) | pub fn new() -> Self {
method add_output (line 25) | pub fn add_output(&mut self, output: Box<dyn Output>) {
method get_outputs (line 29) | pub fn get_outputs(&self) -> &[Box<dyn Output>] {
method get_outputs_mut (line 33) | pub fn get_outputs_mut(&mut self) -> &mut [Box<dyn Output>] {
method get_output_by_type (line 37) | pub fn get_output_by_type(&self, output_type: OutputType) -> Option<&d...
method get_output_by_type_mut (line 44) | pub fn get_output_by_type_mut(&mut self, output_type: OutputType) -> O...
method add_banner (line 50) | fn add_banner(&mut self) {
method add_used_options (line 56) | fn add_used_options(&mut self) {
method set_extra_columns_from_analysis (line 62) | fn set_extra_columns_from_analysis(&mut self, extra_columns: Vec<ExtraCo...
method add_table_header (line 68) | fn add_table_header(&mut self) {
method add_table_row (line 74) | fn add_table_row(
method add_super_table (line 103) | fn add_super_table(&mut self, table: &SuperTable) {
method add_total_stats (line 109) | fn add_total_stats(&mut self, stats: &BasicStats) {
method add_notice (line 115) | fn add_notice(&mut self, text: &str) {
method add_error (line 121) | fn add_error(&mut self, text: &str) {
method add_quality_scores (line 127) | fn add_quality_scores(&mut self, scores: &QualityScores) {
method add_ci_gate_result (line 133) | fn add_ci_gate_result(&mut self, result: &CiGateResult) {
method add_summary (line 139) | fn add_summary(&mut self, summary: &mut Summary) {
method set_export_file_paths (line 145) | fn set_export_file_paths(
method get_type (line 155) | fn get_type(&self) -> OutputType {
method end (line 159) | fn end(&mut self) {
method get_output_text (line 165) | fn get_output_text(&self) -> Option<String> {
method get_json_content (line 174) | fn get_json_content(&self) -> Option<String> {
FILE: src/output/output.rs
type Output (line 16) | pub trait Output: Send + Sync {
method add_banner (line 18) | fn add_banner(&mut self);
method add_used_options (line 21) | fn add_used_options(&mut self);
method set_extra_columns_from_analysis (line 24) | fn set_extra_columns_from_analysis(&mut self, extra_columns: Vec<Extra...
method add_table_header (line 27) | fn add_table_header(&mut self);
method add_table_row (line 43) | fn add_table_row(
method add_super_table (line 58) | fn add_super_table(&mut self, table: &SuperTable);
method add_total_stats (line 64) | fn add_total_stats(&mut self, stats: &BasicStats);
method add_notice (line 67) | fn add_notice(&mut self, text: &str);
method add_error (line 70) | fn add_error(&mut self, text: &str);
method add_quality_scores (line 73) | fn add_quality_scores(&mut self, _scores: &QualityScores) {}
method add_ci_gate_result (line 76) | fn add_ci_gate_result(&mut self, _result: &CiGateResult) {}
method add_summary (line 79) | fn add_summary(&mut self, summary: &mut Summary);
method get_type (line 82) | fn get_type(&self) -> OutputType;
method end (line 85) | fn end(&mut self);
method get_output_text (line 89) | fn get_output_text(&self) -> Option<String> {
method get_json_content (line 95) | fn get_json_content(&self) -> Option<String> {
method set_export_file_paths (line 102) | fn set_export_file_paths(
type BasicStats (line 113) | pub struct BasicStats {
type CrawlerInfo (line 129) | pub struct CrawlerInfo {
FILE: src/output/output_type.rs
type OutputType (line 11) | pub enum OutputType {
method from_text (line 18) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method available_text_types (line 30) | pub fn available_text_types() -> Vec<&'static str> {
method as_str (line 34) | pub fn as_str(&self) -> &'static str {
method fmt (line 44) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
FILE: src/output/text_output.rs
type TextOutput (line 18) | pub struct TextOutput {
method new (line 57) | pub fn new(
method is_column_hidden (line 118) | fn is_column_hidden(&self, name: &str) -> bool {
method hidden_columns_width (line 123) | fn hidden_columns_width(&self) -> usize {
method add_to_output (line 140) | fn add_to_output(&mut self, output: &str) {
method get_output_text (line 149) | pub fn get_output_text(&self) -> &str {
method get_url_column_size (line 153) | fn get_url_column_size(&mut self) -> usize {
method get_polynomial_delays (line 180) | fn get_polynomial_delays(total_time: f64, iterations: usize, power: u3...
method add_banner (line 197) | fn add_banner(&mut self) {
method add_used_options (line 264) | fn add_used_options(&mut self) {
method set_extra_columns_from_analysis (line 268) | fn set_extra_columns_from_analysis(&mut self, extra_columns: Vec<ExtraCo...
method add_table_header (line 279) | fn add_table_header(&mut self) {
method add_table_row (line 334) | fn add_table_row(
method add_super_table (line 478) | fn add_super_table(&mut self, table: &SuperTable) {
method add_total_stats (line 483) | fn add_total_stats(&mut self, stats: &BasicStats) {
method add_notice (line 561) | fn add_notice(&mut self, text: &str) {
method add_error (line 565) | fn add_error(&mut self, text: &str) {
method add_quality_scores (line 569) | fn add_quality_scores(&mut self, scores: &QualityScores) {
method add_ci_gate_result (line 610) | fn add_ci_gate_result(&mut self, result: &CiGateResult) {
method add_summary (line 736) | fn add_summary(&mut self, summary: &mut Summary) {
method get_type (line 741) | fn get_type(&self) -> OutputType {
method end (line 745) | fn end(&mut self) {
method get_output_text (line 749) | fn get_output_text(&self) -> Option<String> {
function extract_host (line 757) | fn extract_host(url: &str) -> String {
function strip_scheme_and_host (line 766) | fn strip_scheme_and_host(url: &str) -> String {
constant CACHE_TYPE_HAS_NO_STORE (line 780) | const CACHE_TYPE_HAS_NO_STORE: i32 = 2048;
constant CACHE_TYPE_HAS_ETAG (line 781) | const CACHE_TYPE_HAS_ETAG: i32 = 4;
constant CACHE_TYPE_HAS_LAST_MODIFIED (line 782) | const CACHE_TYPE_HAS_LAST_MODIFIED: i32 = 8;
function get_colored_cache_info (line 785) | fn get_colored_cache_info(cache_type_flags: i32, cache_lifetime: Option<...
function format_num (line 833) | fn format_num(v: f64) -> String {
function format_score_line (line 842) | fn format_score_line(
FILE: src/result/basic_stats.rs
type BasicStats (line 13) | pub struct BasicStats {
method new (line 28) | pub fn new(
method from_visited_urls (line 54) | pub fn from_visited_urls(visited_urls: &[&VisitedUrl], start_time: Ins...
method get_as_html (line 102) | pub fn get_as_html(&self) -> String {
FILE: src/result/manager_stats.rs
type ManagerStats (line 12) | pub struct ManagerStats {
method new (line 21) | pub fn new() -> Self {
method measure_exec_time (line 29) | pub fn measure_exec_time(&mut self, class: &str, method: &str, start_t...
method get_super_table (line 37) | pub fn get_super_table(
method get_exec_times (line 171) | pub fn get_exec_times(&self) -> &HashMap<String, f64> {
method get_exec_counts (line 175) | pub fn get_exec_counts(&self) -> &HashMap<String, usize> {
FILE: src/result/status.rs
type Status (line 23) | pub struct Status {
method new (line 83) | pub fn new(storage: Box<dyn Storage>, store_content: bool, crawler_inf...
method add_visited_url (line 100) | pub fn add_visited_url(
method add_summary_item_by_ranges (line 140) | pub fn add_summary_item_by_ranges(
method add_ok_to_summary (line 167) | pub fn add_ok_to_summary(&self, apl_code: &str, text: &str) {
method add_notice_to_summary (line 173) | pub fn add_notice_to_summary(&self, apl_code: &str, text: &str) {
method add_info_to_summary (line 179) | pub fn add_info_to_summary(&self, apl_code: &str, text: &str) {
method add_warning_to_summary (line 185) | pub fn add_warning_to_summary(&self, apl_code: &str, text: &str) {
method add_critical_to_summary (line 191) | pub fn add_critical_to_summary(&self, apl_code: &str, text: &str) {
method get_summary (line 197) | pub fn get_summary(&self) -> Summary {
method with_summary (line 201) | pub fn with_summary<F, R>(&self, f: F) -> Option<R>
method get_url_body (line 209) | pub fn get_url_body(&self, uq_id: &str) -> Option<Vec<u8>> {
method get_url_body_text (line 217) | pub fn get_url_body_text(&self, uq_id: &str) -> Option<String> {
method get_url_headers (line 222) | pub fn get_url_headers(&self, uq_id: &str) -> Option<HashMap<String, S...
method get_visited_urls (line 231) | pub fn get_visited_urls(&self) -> Vec<VisitedUrl> {
method with_visited_urls (line 238) | pub fn with_visited_urls<F, R>(&self, f: F) -> Option<R>
method get_crawler_info (line 245) | pub fn get_crawler_info(&self) -> Info {
method get_storage (line 259) | pub fn get_storage(&self) -> &dyn Storage {
method set_final_user_agent (line 263) | pub fn set_final_user_agent(&self, value: &str) {
method get_basic_stats (line 269) | pub fn get_basic_stats(&self) -> BasicStats {
method add_super_table_at_beginning (line 294) | pub fn add_super_table_at_beginning(&self, super_table: SuperTable) {
method add_super_table_at_end (line 300) | pub fn add_super_table_at_end(&self, super_table: SuperTable) {
method with_super_tables_at_beginning (line 306) | pub fn with_super_tables_at_beginning<F, R>(&self, f: F) -> Option<R>
method with_super_tables_at_beginning_mut (line 313) | pub fn with_super_tables_at_beginning_mut<F, R>(&self, f: F) -> Option<R>
method with_super_tables_at_end (line 323) | pub fn with_super_tables_at_end<F, R>(&self, f: F) -> Option<R>
method with_super_tables_at_end_mut (line 330) | pub fn with_super_tables_at_end_mut<F, R>(&self, f: F) -> Option<R>
method configure_super_table_url_stripping (line 339) | pub fn configure_super_table_url_stripping(&self, table: &mut SuperTab...
method get_super_table_by_apl_code (line 352) | pub fn get_super_table_by_apl_code(&self, apl_code: &str) -> bool {
method get_url_by_uq_id (line 371) | pub fn get_url_by_uq_id(&self, uq_id: &str) -> Option<String> {
method get_origin_header_value_by_source_uq_id (line 378) | pub fn get_origin_header_value_by_source_uq_id(&self, source_uq_id: &s...
method add_url_analysis_result (line 395) | pub fn add_url_analysis_result(&self, visited_url_uq_id: &str, result:...
method get_url_analysis_results (line 401) | pub fn get_url_analysis_results(&self, visited_url_uq_id: &str) -> Vec...
method add_skipped_url (line 409) | pub fn add_skipped_url(&mut self, url: String, reason: SkippedReason, ...
method get_skipped_urls (line 420) | pub fn get_skipped_urls(&self) -> Vec<SkippedUrlEntry> {
method get_details_by_analysis_name_and_severity (line 424) | pub fn get_details_by_analysis_name_and_severity(&self, analysis_name:...
method get_visited_url_to_analysis_result (line 439) | pub fn get_visited_url_to_analysis_result(&self) -> HashMap<String, Ve...
method get_number_of_working_visited_urls (line 447) | pub fn get_number_of_working_visited_urls(&self) -> usize {
method set_robots_txt_content (line 454) | pub fn set_robots_txt_content(&self, scheme: &str, host: &str, port: u...
method get_robots_txt_content (line 461) | pub fn get_robots_txt_content(&self, scheme: &str, host: &str, port: u...
method fmt (line 471) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
type SkippedUrlEntry (line 63) | pub struct SkippedUrlEntry {
type UrlAnalysisResultEntry (line 72) | pub struct UrlAnalysisResultEntry {
FILE: src/result/storage/file_storage.rs
type FileStorage (line 13) | pub struct FileStorage {
method new (line 19) | pub fn new(tmp_dir: &str, compress: bool, origin_url_domain: &str) -> ...
method get_file_extension (line 41) | fn get_file_extension(&self) -> &str {
method get_file_path (line 45) | fn get_file_path(&self, uq_id: &str) -> PathBuf {
method create_directory_if_needed (line 57) | fn create_directory_if_needed(&self, path: &Path) -> CrawlerResult<()> {
method fmt (line 134) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
method save (line 72) | fn save(&mut self, uq_id: &str, content: &[u8]) -> CrawlerResult<()> {
method load (line 89) | fn load(&self, uq_id: &str) -> CrawlerResult<Vec<u8>> {
method delete (line 108) | fn delete(&mut self, uq_id: &str) -> CrawlerResult<()> {
method delete_all (line 116) | fn delete_all(&mut self) -> CrawlerResult<()> {
method drop (line 127) | fn drop(&mut self) {
FILE: src/result/storage/memory_storage.rs
type MemoryStorage (line 10) | pub struct MemoryStorage {
method new (line 16) | pub fn new(compress: bool) -> Self {
method fmt (line 66) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
method save (line 25) | fn save(&mut self, uq_id: &str, content: &[u8]) -> CrawlerResult<()> {
method load (line 38) | fn load(&self, uq_id: &str) -> CrawlerResult<Vec<u8>> {
method delete (line 54) | fn delete(&mut self, uq_id: &str) -> CrawlerResult<()> {
method delete_all (line 59) | fn delete_all(&mut self) -> CrawlerResult<()> {
FILE: src/result/storage/storage.rs
type Storage (line 6) | pub trait Storage: Send + Sync {
method save (line 7) | fn save(&mut self, uq_id: &str, content: &[u8]) -> CrawlerResult<()>;
method load (line 9) | fn load(&self, uq_id: &str) -> CrawlerResult<Vec<u8>>;
method delete (line 11) | fn delete(&mut self, uq_id: &str) -> CrawlerResult<()>;
method delete_all (line 13) | fn delete_all(&mut self) -> CrawlerResult<()>;
FILE: src/result/storage/storage_type.rs
type StorageType (line 10) | pub enum StorageType {
method from_text (line 16) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method available_text_types (line 28) | pub fn available_text_types() -> Vec<&'static str> {
method as_str (line 32) | pub fn as_str(&self) -> &'static str {
method fmt (line 41) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
FILE: src/result/visited_url.rs
constant ERROR_CONNECTION_FAIL (line 13) | pub const ERROR_CONNECTION_FAIL: i32 = -1;
constant ERROR_TIMEOUT (line 14) | pub const ERROR_TIMEOUT: i32 = -2;
constant ERROR_SERVER_RESET (line 15) | pub const ERROR_SERVER_RESET: i32 = -3;
constant ERROR_SEND_ERROR (line 16) | pub const ERROR_SEND_ERROR: i32 = -4;
constant ERROR_SKIPPED (line 17) | pub const ERROR_SKIPPED: i32 = -6;
constant CACHE_TYPE_HAS_CACHE_CONTROL (line 20) | pub const CACHE_TYPE_HAS_CACHE_CONTROL: u32 = 1;
constant CACHE_TYPE_HAS_EXPIRES (line 21) | pub const CACHE_TYPE_HAS_EXPIRES: u32 = 2;
constant CACHE_TYPE_HAS_ETAG (line 22) | pub const CACHE_TYPE_HAS_ETAG: u32 = 4;
constant CACHE_TYPE_HAS_LAST_MODIFIED (line 23) | pub const CACHE_TYPE_HAS_LAST_MODIFIED: u32 = 8;
constant CACHE_TYPE_HAS_MAX_AGE (line 24) | pub const CACHE_TYPE_HAS_MAX_AGE: u32 = 16;
constant CACHE_TYPE_HAS_S_MAX_AGE (line 25) | pub const CACHE_TYPE_HAS_S_MAX_AGE: u32 = 32;
constant CACHE_TYPE_HAS_STALE_WHILE_REVALIDATE (line 26) | pub const CACHE_TYPE_HAS_STALE_WHILE_REVALIDATE: u32 = 64;
constant CACHE_TYPE_HAS_STALE_IF_ERROR (line 27) | pub const CACHE_TYPE_HAS_STALE_IF_ERROR: u32 = 128;
constant CACHE_TYPE_HAS_PUBLIC (line 28) | pub const CACHE_TYPE_HAS_PUBLIC: u32 = 256;
constant CACHE_TYPE_HAS_PRIVATE (line 29) | pub const CACHE_TYPE_HAS_PRIVATE: u32 = 512;
constant CACHE_TYPE_HAS_NO_CACHE (line 30) | pub const CACHE_TYPE_HAS_NO_CACHE: u32 = 1024;
constant CACHE_TYPE_HAS_NO_STORE (line 31) | pub const CACHE_TYPE_HAS_NO_STORE: u32 = 2048;
constant CACHE_TYPE_HAS_MUST_REVALIDATE (line 32) | pub const CACHE_TYPE_HAS_MUST_REVALIDATE: u32 = 4096;
constant CACHE_TYPE_HAS_PROXY_REVALIDATE (line 33) | pub const CACHE_TYPE_HAS_PROXY_REVALIDATE: u32 = 8192;
constant CACHE_TYPE_HAS_IMMUTABLE (line 34) | pub const CACHE_TYPE_HAS_IMMUTABLE: u32 = 16384;
constant CACHE_TYPE_NO_CACHE_HEADERS (line 35) | pub const CACHE_TYPE_NO_CACHE_HEADERS: u32 = 32768;
constant CACHE_TYPE_NOT_AVAILABLE (line 36) | pub const CACHE_TYPE_NOT_AVAILABLE: u32 = 65536;
constant SOURCE_INIT_URL (line 39) | pub const SOURCE_INIT_URL: i32 = 5;
constant SOURCE_A_HREF (line 40) | pub const SOURCE_A_HREF: i32 = 10;
constant SOURCE_IMG_SRC (line 41) | pub const SOURCE_IMG_SRC: i32 = 20;
constant SOURCE_IMG_SRCSET (line 42) | pub const SOURCE_IMG_SRCSET: i32 = 21;
constant SOURCE_INPUT_SRC (line 43) | pub const SOURCE_INPUT_SRC: i32 = 22;
constant SOURCE_SOURCE_SRC (line 44) | pub const SOURCE_SOURCE_SRC: i32 = 23;
constant SOURCE_VIDEO_SRC (line 45) | pub const SOURCE_VIDEO_SRC: i32 = 24;
constant SOURCE_AUDIO_SRC (line 46) | pub const SOURCE_AUDIO_SRC: i32 = 25;
constant SOURCE_SCRIPT_SRC (line 47) | pub const SOURCE_SCRIPT_SRC: i32 = 30;
constant SOURCE_INLINE_SCRIPT_SRC (line 48) | pub const SOURCE_INLINE_SCRIPT_SRC: i32 = 40;
constant SOURCE_LINK_HREF (line 49) | pub const SOURCE_LINK_HREF: i32 = 50;
constant SOURCE_CSS_URL (line 50) | pub const SOURCE_CSS_URL: i32 = 60;
constant SOURCE_JS_URL (line 51) | pub const SOURCE_JS_URL: i32 = 70;
constant SOURCE_REDIRECT (line 52) | pub const SOURCE_REDIRECT: i32 = 80;
constant SOURCE_SITEMAP (line 53) | pub const SOURCE_SITEMAP: i32 = 90;
type VisitedUrl (line 56) | pub struct VisitedUrl {
method new (line 111) | pub fn new(
method is_https (line 152) | pub fn is_https(&self) -> bool {
method is_static_file (line 156) | pub fn is_static_file(&self) -> bool {
method is_image (line 171) | pub fn is_image(&self) -> bool {
method is_video (line 175) | pub fn is_video(&self) -> bool {
method get_source_description (line 179) | pub fn get_source_description(&self, source_url: Option<&str>) -> Stri...
method get_source_short_name (line 201) | pub fn get_source_short_name(&self) -> &'static str {
method looks_like_static_file_by_url (line 222) | pub fn looks_like_static_file_by_url(&self) -> bool {
method has_error_status_code (line 232) | pub fn has_error_status_code(&self) -> bool {
method get_scheme (line 236) | pub fn get_scheme(&self) -> Option<String> {
method get_host (line 240) | pub fn get_host(&self) -> Option<String> {
method get_port (line 246) | pub fn get_port(&self) -> u16 {
method get_cache_type_label (line 256) | pub fn get_cache_type_label(&self) -> String {
FILE: src/scoring/ci_gate.rs
type CiCheck (line 18) | pub struct CiCheck {
type CiGateResult (line 28) | pub struct CiGateResult {
function evaluate (line 34) | pub fn evaluate(options: &CoreOptions, scores: &QualityScores, stats: &B...
function check_min (line 152) | fn check_min(metric: &str, actual: f64, threshold: f64) -> CiCheck {
function check_max (line 162) | fn check_max(metric: &str, actual: f64, threshold: f64) -> CiCheck {
function count_content_types (line 172) | fn count_content_types(stats: &BasicStats, types: &[ContentTypeId]) -> u...
function find_category_score (line 179) | fn find_category_score(scores: &QualityScores, code: &str) -> f64 {
function make_options (line 195) | fn make_options() -> CoreOptions {
function make_scores (line 338) | fn make_scores(overall: f64) -> QualityScores {
function make_stats (line 369) | fn make_stats(total_urls: usize) -> BasicStats {
function make_stats_with_status (line 381) | fn make_stats_with_status(total_urls: usize, status_counts: &[(i32, usiz...
function all_checks_pass (line 394) | fn all_checks_pass() {
function fail_low_overall_score (line 405) | fn fail_low_overall_score() {
function fail_404_count (line 416) | fn fail_404_count() {
function fail_5xx_count (line 426) | fn fail_5xx_count() {
function fail_criticals (line 436) | fn fail_criticals() {
function optional_warnings (line 451) | fn optional_warnings() {
function optional_avg_response (line 467) | fn optional_avg_response() {
function zero_urls_immediate_fail (line 479) | fn zero_urls_immediate_fail() {
function only_negative_status_codes_immediate_fail (line 490) | fn only_negative_status_codes_immediate_fail() {
function category_threshold (line 502) | fn category_threshold() {
function fail_min_pages (line 515) | fn fail_min_pages() {
function pass_min_pages (line 529) | fn pass_min_pages() {
function fail_min_assets (line 541) | fn fail_min_assets() {
function documents_check_skipped_when_zero (line 556) | fn documents_check_skipped_when_zero() {
function fail_min_documents (line 567) | fn fail_min_documents() {
FILE: src/scoring/quality_score.rs
type QualityScores (line 8) | pub struct QualityScores {
type CategoryScore (line 15) | pub struct CategoryScore {
method color_hex (line 32) | pub fn color_hex(&self) -> &'static str {
method console_color (line 42) | pub fn console_color(&self) -> &'static str {
type Deduction (line 26) | pub struct Deduction {
function score_label (line 53) | pub fn score_label(score: f64) -> &'static str {
function make_score (line 67) | fn make_score(score: f64) -> CategoryScore {
function score_label_values (line 79) | fn score_label_values() {
function score_label_boundaries (line 88) | fn score_label_boundaries() {
function color_hex_green_for_excellent (line 96) | fn color_hex_green_for_excellent() {
function color_hex_purple_for_poor (line 101) | fn color_hex_purple_for_poor() {
function color_hex_red_for_critical (line 106) | fn color_hex_red_for_critical() {
function color_hex_boundaries (line 111) | fn color_hex_boundaries() {
function console_color_values (line 117) | fn console_color_values() {
FILE: src/scoring/scorer.rs
constant MAX_PER_URL_DEDUCTION (line 15) | const MAX_PER_URL_DEDUCTION: f64 = 5.0;
constant MAX_PER_TYPE_DEDUCTION (line 18) | const MAX_PER_TYPE_DEDUCTION: f64 = 2.5;
function calculate_scores (line 21) | pub fn calculate_scores(summary: &Summary, basic_stats: &BasicStats) -> ...
function score_performance (line 47) | fn score_performance(summary: &Summary, stats: &BasicStats) -> CategoryS...
function score_seo (line 99) | fn score_seo(summary: &Summary, stats: &BasicStats) -> CategoryScore {
function score_security (line 182) | fn score_security(summary: &Summary) -> CategoryScore {
function score_accessibility (line 252) | fn score_accessibility(summary: &Summary) -> CategoryScore {
function score_best_practices (line 329) | fn score_best_practices(summary: &Summary) -> CategoryScore {
function build_category (line 414) | fn build_category(
function per_url_deduct (line 435) | fn per_url_deduct(
function is_not_ok (line 464) | fn is_not_ok(summary: &Summary, apl_code: &str) -> bool {
function is_critical (line 472) | fn is_critical(summary: &Summary, apl_code: &str) -> bool {
function is_warning_or_above (line 480) | fn is_warning_or_above(summary: &Summary, apl_code: &str) -> bool {
function is_warning (line 488) | fn is_warning(summary: &Summary, apl_code: &str) -> bool {
function get_item_count (line 496) | fn get_item_count(summary: &Summary, apl_code: &str) -> Option<usize> {
function get_item_count_for_code (line 505) | fn get_item_count_for_code(summary: &Summary, apl_code: &str) -> Option<...
function extract_first_number (line 515) | fn extract_first_number(text: &str) -> Option<usize> {
function number_regex (line 519) | fn number_regex() -> &'static Regex {
function round1 (line 525) | fn round1(v: f64) -> f64 {
function make_empty_summary (line 535) | fn make_empty_summary() -> Summary {
function make_summary_with_items (line 539) | fn make_summary_with_items(items: Vec<(&str, ItemStatus)>) -> Summary {
function make_basic_stats (line 547) | fn make_basic_stats() -> BasicStats {
function perfect_score_for_clean_site (line 556) | fn perfect_score_for_clean_site() {
function score_label_thresholds (line 564) | fn score_label_thresholds() {
function slow_response_deduction (line 573) | fn slow_response_deduction() {
function categories_have_correct_weights (line 583) | fn categories_have_correct_weights() {
function overall_is_weighted_average (line 592) | fn overall_is_weighted_average() {
function errors_404_deduct_from_seo (line 602) | fn errors_404_deduct_from_seo() {
function warnings_reduce_score (line 612) | fn warnings_reduce_score() {
FILE: src/server.rs
type ServeMode (line 17) | pub enum ServeMode {
function run (line 23) | pub async fn run(root_dir: PathBuf, mode: ServeMode, port: u16, bind_add...
function handle_connection (line 106) | async fn handle_connection(mut stream: TcpStream, root_dir: &Path, is_ma...
function find_header_end (line 199) | fn find_header_end(response: &[u8]) -> Option<usize> {
function extract_status (line 203) | fn extract_status(response: &[u8]) -> u16 {
function serve_markdown_request (line 214) | fn serve_markdown_request(root_dir: &Path, relative_path: &str) -> Vec<u...
function resolve_markdown_path (line 249) | fn resolve_markdown_path(root_dir: &Path, relative_path: &str) -> Option...
function serve_offline_request (line 283) | fn serve_offline_request(root_dir: &Path, relative_path: &str) -> Vec<u8> {
function resolve_offline_path (line 301) | fn resolve_offline_path(root_dir: &Path, relative_path: &str) -> Option<...
function serve_static_file (line 336) | fn serve_static_file(path: &Path, extra_headers: &[(&str, &str)]) -> Vec...
function is_within_root (line 348) | fn is_within_root(root_dir: &Path, resolved_path: &Path) -> bool {
function build_response (line 358) | fn build_response(status: u16, content_type: &str, body: &[u8], extra_he...
function build_404_response (line 390) | fn build_404_response(is_markdown: bool) -> Vec<u8> {
function content_type_for_extension (line 403) | fn content_type_for_extension(ext: &str) -> &'static str {
function render_markdown_to_html (line 446) | fn render_markdown_to_html(markdown: &str, request_path: &str) -> String {
function clean_markdown_artifacts (line 558) | fn clean_markdown_artifacts(markdown: &str) -> String {
function looks_like_code (line 842) | fn looks_like_code(line: &str) -> bool {
function style_callout_blocks (line 892) | fn style_callout_blocks(html: &str) -> String {
function extract_title (line 965) | fn extract_title(markdown: &str) -> String {
function add_heading_ids (line 993) | fn add_heading_ids(html: &str) -> String {
function detect_heading_level (line 1070) | fn detect_heading_level(line: &str) -> Option<u8> {
function collapse_link_blocks (line 1084) | fn collapse_link_blocks(html: &str) -> String {
function is_link_only_paragraph (line 1230) | fn is_link_only_paragraph(line: &str) -> bool {
function add_accordion_link_counts (line 1251) | fn add_accordion_link_counts(html: &str) -> String {
function strip_html_tags (line 1330) | fn strip_html_tags(html: &str) -> String {
function slugify (line 1345) | fn slugify(text: &str) -> String {
function build_breadcrumb (line 1365) | fn build_breadcrumb(request_path: &str) -> String {
function title_case_segment (line 1396) | fn title_case_segment(segment: &str) -> String {
function html_escape (line 1414) | fn html_escape(s: &str) -> String {
function directory_listing (line 1423) | fn directory_listing(dir_path: &Path, url_path: &str, is_markdown: bool)...
constant MARKDOWN_CSS (line 1535) | const MARKDOWN_CSS: &str = r##"
FILE: src/types.rs
type DeviceType (line 15) | pub enum DeviceType {
method from_text (line 22) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method available_text_types (line 35) | pub fn available_text_types() -> Vec<&'static str> {
method as_str (line 39) | pub fn as_str(&self) -> &'static str {
method fmt (line 49) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
type AssetType (line 60) | pub enum AssetType {
method from_text (line 69) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method available_text_types (line 84) | pub fn available_text_types() -> Vec<&'static str> {
method as_str (line 88) | pub fn as_str(&self) -> &'static str {
method fmt (line 100) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
type ContentTypeId (line 111) | pub enum ContentTypeId {
method from_i32 (line 127) | pub fn from_i32(value: i32) -> Option<Self> {
method name (line 145) | pub fn name(&self) -> &'static str {
method fmt (line 164) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
type SkippedReason (line 175) | pub enum SkippedReason {
method from_i32 (line 182) | pub fn from_i32(value: i32) -> Option<Self> {
method description (line 191) | pub fn description(&self) -> &'static str {
method fmt (line 201) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
type OutputType (line 212) | pub enum OutputType {
method from_text (line 219) | pub fn from_text(text: &str) -> Result<Self, CrawlerError> {
method available_text_types (line 231) | pub fn available_text_types() -> Vec<&'static str> {
method as_str (line 235) | pub fn as_str(&self) -> &'static str {
method fmt (line 245) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
FILE: src/utils.rs
function is_regex_pattern (line 14) | pub fn is_regex_pattern(s: &str) -> bool {
function extract_pcre_regex_pattern (line 35) | pub fn extract_pcre_regex_pattern(s: &str) -> String {
constant IMG_SRC_TRANSPARENT_1X1_GIF (line 55) | pub const IMG_SRC_TRANSPARENT_1X1_GIF: &str =
function disable_colors (line 58) | pub fn disable_colors() {
function force_enabled_colors (line 64) | pub fn force_enabled_colors() {
function set_forced_console_width (line 70) | pub fn set_forced_console_width(width: usize) {
function get_formatted_size (line 76) | pub fn get_formatted_size(bytes: i64, precision: usize) -> String {
function get_formatted_duration (line 93) | pub fn get_formatted_duration(duration: f64) -> String {
function get_formatted_age (line 106) | pub fn get_formatted_age(age: i64) -> String {
function strip_trailing_dot_zero (line 128) | fn strip_trailing_dot_zero(s: &str) -> String {
function get_formatted_cache_lifetime (line 132) | pub fn get_formatted_cache_lifetime(seconds: i64) -> String {
function get_color_text (line 148) | pub fn get_color_text(text: &str, color: &str, set_background: bool) -> ...
function atty_is_tty (line 203) | fn atty_is_tty() -> bool {
function isatty (line 209) | fn isatty(fd: i32) -> i32;
function libc_isatty (line 212) | unsafe fn libc_isatty(fd: i32) -> i32 {
function convert_bash_colors_in_text_to_html (line 216) | pub fn convert_bash_colors_in_text_to_html(text: &str) -> String {
function get_html_color_by_bash_color (line 254) | fn get_html_color_by_bash_color(color: &str) -> &'static str {
function truncate_in_two_thirds (line 268) | pub fn truncate_in_two_thirds(
function truncate_url (line 305) | pub fn truncate_url(
function get_progress_bar (line 333) | pub fn get_progress_bar(done: usize, total: usize, segments: usize) -> S...
function remove_ansi_colors (line 346) | pub fn remove_ansi_colors(text: &str) -> String {
function get_http_client_code_with_error_description (line 352) | pub fn get_http_client_code_with_error_description(http_code: i32, short...
function get_console_width (line 393) | pub fn get_console_width() -> usize {
function get_url_without_scheme_and_host (line 406) | pub fn get_url_without_scheme_and_host(
function get_safe_command (line 437) | pub fn get_safe_command(command: &str) -> String {
function get_colored_request_time (line 454) | pub fn get_colored_request_time(request_time: f64, str_pad_to: usize) ->...
function get_colored_status_code (line 469) | pub fn get_colored_status_code(status_code: i32, str_pad_to: usize) -> S...
function get_colored_severity (line 491) | pub fn get_colored_severity(severity: &str) -> String {
function get_colored_criticals (line 500) | pub fn get_colored_criticals(criticals: i32, str_pad_to: usize) -> String {
function get_colored_warnings (line 508) | pub fn get_colored_warnings(warnings: i32, str_pad_to: usize) -> String {
function get_colored_notices (line 516) | pub fn get_colored_notices(notices: i32, str_pad_to: usize) -> String {
function get_content_type_name_by_id (line 524) | pub fn get_content_type_name_by_id(content_type_id: ContentTypeId) -> &'...
function is_href_for_requestable_resource (line 528) | pub fn is_href_for_requestable_resource(href: &str) -> bool {
function get_absolute_url_by_base_url (line 556) | pub fn get_absolute_url_by_base_url(base_url: &str, target_url: &str) ->...
function get_absolute_path (line 568) | pub fn get_absolute_path(path: &str) -> String {
function get_output_formatted_path (line 580) | pub fn get_output_formatted_path(path: &str) -> String {
function mb_str_pad (line 591) | pub fn mb_str_pad(input: &str, pad_length: usize, pad_char: char) -> Str...
function strip_javascript (line 605) | pub fn strip_javascript(html: &str) -> String {
function strip_styles (line 631) | pub fn strip_styles(html: &str) -> String {
function strip_fonts (line 649) | pub fn strip_fonts(html_or_css: &str) -> String {
function strip_images (line 671) | pub fn strip_images(html_or_css: &str, placeholder_image: Option<&str>) ...
function get_colored_cache_lifetime (line 708) | pub fn get_colored_cache_lifetime(cache_lifetime: i64, str_pad_to: usize...
function is_asset_by_content_type (line 730) | pub fn is_asset_by_content_type(content_type: &str) -> bool {
function add_class_to_html_images (line 749) | pub fn add_class_to_html_images(html: &str, class_name: &str) -> String {
function get_flat_response_headers (line 767) | pub fn get_flat_response_headers(
function get_peak_memory_usage (line 775) | pub fn get_peak_memory_usage() -> i64 {
function formatted_size_zero (line 799) | fn formatted_size_zero() {
function formatted_size_bytes (line 804) | fn formatted_size_bytes() {
function formatted_size_kilobytes (line 809) | fn formatted_size_kilobytes() {
function formatted_size_megabytes (line 814) | fn formatted_size_megabytes() {
function formatted_size_gigabytes (line 819) | fn formatted_size_gigabytes() {
function formatted_duration_milliseconds (line 826) | fn formatted_duration_milliseconds() {
function formatted_duration_half_second (line 831) | fn formatted_duration_half_second() {
function formatted_duration_seconds (line 836) | fn formatted_duration_seconds() {
function formatted_age_seconds (line 843) | fn formatted_age_seconds() {
function formatted_age_minutes (line 849) | fn formatted_age_minutes() {
function formatted_age_hours (line 854) | fn formatted_age_hours() {
function formatted_age_days (line 859) | fn formatted_age_days() {
function cache_lifetime_seconds (line 866) | fn cache_lifetime_seconds() {
function cache_lifetime_minutes (line 871) | fn cache_lifetime_minutes() {
function cache_lifetime_hours (line 876) | fn cache_lifetime_hours() {
function cache_lifetime_days (line 882) | fn cache_lifetime_days() {
function cache_lifetime_months (line 888) | fn cache_lifetime_months() {
function regex_pattern_slash_delimited (line 896) | fn regex_pattern_slash_delimited() {
function regex_pattern_hash_delimited (line 901) | fn regex_pattern_hash_delimited() {
function regex_pattern_plain_text (line 906) | fn regex_pattern_plain_text() {
function regex_pattern_empty (line 911) | fn regex_pattern_empty() {
function regex_pattern_single_slash (line 916) | fn regex_pattern_single_slash() {
function extract_pcre_with_case_insensitive (line 923) | fn extract_pcre_with_case_insensitive() {
function extract_pcre_hash_delimiter (line 928) | fn extract_pcre_hash_delimiter() {
function extract_pcre_tilde_with_flags (line 933) | fn extract_pcre_tilde_with_flags() {
function strip_javascript_removes_script_tags (line 942) | fn strip_javascript_removes_script_tags() {
function strip_styles_removes_style_tags (line 950) | fn strip_styles_removes_style_tags() {
function str_pad_shorter_input (line 958) | fn str_pad_shorter_input() {
function str_pad_longer_input (line 963) | fn str_pad_longer_input() {
function requestable_http_url (line 970) | fn requestable_http_url() {
function requestable_javascript_void (line 975) | fn requestable_javascript_void() {
function requestable_mailto (line 980) | fn requestable_mailto() {
function requestable_data_uri (line 985) | fn requestable_data_uri() {
function absolute_url_from_root_relative (line 992) | fn absolute_url_from_root_relative() {
function absolute_url_from_relative (line 1000) | fn absolute_url_from_relative() {
FILE: src/version.rs
constant CODE (line 4) | pub const CODE: &str = "2.3.0.20260330";
FILE: src/wizard/form.rs
type FormSetting (line 17) | pub struct FormSetting {
method new (line 24) | fn new(label: &'static str, options: Vec<&'static str>, default: &str)...
method value (line 33) | pub fn value(&self) -> &str {
method cycle_right (line 37) | fn cycle_right(&mut self) {
method cycle_left (line 41) | fn cycle_left(&mut self) {
constant S_TIMEOUT (line 52) | const S_TIMEOUT: usize = 0;
constant S_WORKERS (line 53) | const S_WORKERS: usize = 1;
constant S_MAX_RPS (line 54) | const S_MAX_RPS: usize = 2;
constant S_MAX_URLS (line 55) | const S_MAX_URLS: usize = 3;
constant S_DEVICE (line 56) | const S_DEVICE: usize = 4;
constant S_JAVASCRIPT (line 57) | const S_JAVASCRIPT: usize = 5;
constant S_CSS (line 58) | const S_CSS: usize = 6;
constant S_FONTS (line 59) | const S_FONTS: usize = 7;
constant S_IMAGES (line 60) | const S_IMAGES: usize = 8;
constant S_FILES (line 61) | const S_FILES: usize = 9;
constant S_SINGLE_PAGE (line 62) | const S_SINGLE_PAGE: usize = 10;
constant S_OFFLINE (line 63) | const S_OFFLINE: usize = 11;
constant S_MARKDOWN (line 64) | const S_MARKDOWN: usize = 12;
constant S_SITEMAP (line 65) | const S_SITEMAP: usize = 13;
constant S_CACHE (line 66) | const S_CACHE: usize = 14;
constant S_STORAGE (line 67) | const S_STORAGE: usize = 15;
constant S_ROBOTS (line 68) | const S_ROBOTS: usize = 16;
function build_form_settings (line 72) | pub fn build_form_settings(state: &WizardState) -> Vec<FormSetting> {
function format_static_timeout (line 182) | fn format_static_timeout(val: u32) -> &'static str {
function format_static_workers (line 194) | fn format_static_workers(val: u32) -> &'static str {
function format_static_rps (line 207) | fn format_static_rps(val: u32) -> &'static str {
function format_static_max_urls (line 219) | fn format_static_max_urls(val: u32) -> &'static str {
function apply_form_to_state (line 234) | pub fn apply_form_to_state(settings: &[FormSetting], state: &mut WizardS...
function parse_timeout (line 276) | fn parse_timeout(val: &str) -> u32 {
function parse_rps (line 280) | fn parse_rps(val: &str) -> u32 {
function parse_max_urls (line 288) | fn parse_max_urls(val: &str) -> u32 {
function run_form (line 299) | pub fn run_form(settings: &mut [FormSetting], preset_name: &str) -> Resu...
function form_event_loop (line 331) | fn form_event_loop(
function render_form (line 389) | fn render_form(
function write_line (line 453) | fn write_line(stdout: &mut io::Stdout, text: &str) {
function parse_timeout_values (line 462) | fn parse_timeout_values() {
function parse_rps_values (line 469) | fn parse_rps_values() {
function parse_max_urls_values (line 476) | fn parse_max_urls_values() {
function format_static_timeout_snaps (line 482) | fn format_static_timeout_snaps() {
function format_static_workers_snaps (line 491) | fn format_static_workers_snaps() {
function cycle_wraps_around (line 499) | fn cycle_wraps_around() {
function apply_form_roundtrip (line 509) | fn apply_form_roundtrip() {
FILE: src/wizard/mod.rs
function is_interactive_tty (line 19) | pub fn is_interactive_tty() -> bool {
function offer_serve_after_export (line 26) | pub fn offer_serve_after_export(crawl_argv: &[String]) -> Option<(String...
function press_enter_to_exit (line 48) | pub fn press_enter_to_exit() {
function run_wizard (line 57) | pub fn run_wizard() -> Result<Vec<String>, WizardError> {
type PresetChoice (line 132) | enum PresetChoice {
constant SERVE_SEPARATOR (line 138) | const SERVE_SEPARATOR: &str = "──────────────────────────────────────";
function prompt_preset_or_serve (line 140) | fn prompt_preset_or_serve() -> Result<PresetChoice, WizardError> {
function prompt_serve_export (line 183) | fn prompt_serve_export(dirs: &[ExportDir], kind: &str) -> Result<PresetC...
type ExportDir (line 201) | struct ExportDir {
function find_export_dirs (line 208) | fn find_export_dirs(kind: &str) -> Vec<ExportDir> {
function resolve_export_paths (line 254) | fn resolve_export_paths(state: &mut WizardState) {
type WizardError (line 271) | pub enum WizardError {
method fmt (line 277) | fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
method from (line 286) | fn from(err: InquireError) -> Self {
function print_banner (line 296) | fn print_banner() {
function prompt_url (line 315) | fn prompt_url() -> Result<String, WizardError> {
function normalize_url_input (line 337) | fn normalize_url_input(input: &str) -> String {
function print_summary (line 348) | fn print_summary(state: &WizardState, argv: &[String]) {
function print_row (line 419) | fn print_row(label: &str, value: &str, label_width: usize) {
FILE: src/wizard/presets.rs
type Preset (line 7) | pub struct Preset {
method fmt (line 33) | fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
constant PRESETS (line 38) | pub const PRESETS: &[Preset] = &[
type WizardState (line 292) | pub struct WizardState {
method from_preset (line 321) | pub fn from_preset(preset: &Preset) -> Self {
method build_argv (line 353) | pub fn build_argv(&self) -> Vec<String> {
method content_summary (line 445) | pub fn content_summary(&self) -> String {
function resolve_export_path (line 468) | pub fn resolve_export_path(template: &str, url: &str) -> String {
function preset_count_is_10 (line 482) | fn preset_count_is_10() {
function last_preset_is_custom (line 487) | fn last_preset_is_custom() {
function build_argv_contains_url (line 492) | fn build_argv_contains_url() {
function build_argv_custom_is_minimal (line 501) | fn build_argv_custom_is_minimal() {
function build_argv_quick_audit (line 510) | fn build_argv_quick_audit() {
function build_argv_seo_disables_assets (line 519) | fn build_argv_seo_disables_assets() {
function build_argv_seo_has_extra_columns (line 533) | fn build_argv_seo_has_extra_columns() {
function build_argv_performance_test (line 541) | fn build_argv_performance_test() {
function build_argv_security_check (line 552) | fn build_argv_security_check() {
function build_argv_offline_clone (line 563) | fn build_argv_offline_clone() {
function build_argv_markdown_export (line 574) | fn build_argv_markdown_export() {
function build_argv_stress_test (line 585) | fn build_argv_stress_test() {
function build_argv_single_page (line 602) | fn build_argv_single_page() {
function build_argv_large_site (line 612) | fn build_argv_large_site() {
function content_summary_all_enabled (line 625) | fn content_summary_all_enabled() {
function content_summary_html_only (line 631) | fn content_summary_html_only() {
function description_lengths_within_range (line 637) | fn description_lengths_within_range() {
FILE: tests/common/mod.rs
function binary_path (line 8) | pub fn binary_path() -> PathBuf {
function run_crawler (line 24) | pub fn run_crawler(args: &[&str]) -> Output {
function run_crawler_json (line 32) | pub fn run_crawler_json(args: &[&str]) -> serde_json::Value {
type TempDir (line 46) | pub struct TempDir {
method new (line 51) | pub fn new(prefix: &str) -> Self {
method drop (line 62) | fn drop(&mut self) {
FILE: tests/integration_crawl.rs
constant GENTLE_FLAGS (line 22) | const GENTLE_FLAGS: [&str; 3] = ["--workers=2", "--max-reqs-per-sec=5", ...
function crawl_siteone_content_type_counts (line 30) | fn crawl_siteone_content_type_counts() {
function crawl_nonexistent_domain_exits_with_code_3 (line 105) | fn crawl_nonexistent_domain_exits_with_code_3() {
function crawl_nonexistent_domain_ci_exits_with_code_10 (line 137) | fn crawl_nonexistent_domain_ci_exits_with_code_10() {
function crawl_siteone_offline_export (line 170) | fn crawl_siteone_offline_export() {
function crawl_siteone_markdown_export (line 252) | fn crawl_siteone_markdown_export() {
function walkdir (line 333) | fn walkdir(dir: &Path, extension: &str) -> usize {
function crawl_siteone_single_page (line 354) | fn crawl_siteone_single_page() {
function version_flag_exits_with_code_2 (line 393) | fn version_flag_exits_with_code_2() {
function help_flag_exits_with_code_2 (line 401) | fn help_flag_exits_with_code_2() {
function invalid_option_exits_with_code_101 (line 416) | fn invalid_option_exits_with_code_101() {
function unknown_option_after_bool_flag_detected (line 432) | fn unknown_option_after_bool_flag_detected() {
function unknown_option_typo_without_value (line 450) | fn unknown_option_typo_without_value() {
function html_to_markdown_basic_conversion (line 470) | fn html_to_markdown_basic_conversion() {
function html_to_markdown_output_to_file (line 491) | fn html_to_markdown_output_to_file() {
function html_to_markdown_nonexistent_file (line 524) | fn html_to_markdown_nonexistent_file() {
function html_to_markdown_with_disable_images (line 536) | fn html_to_markdown_with_disable_images() {
function html_to_markdown_preserves_original_links (line 558) | fn html_to_markdown_preserves_original_links() {
function html_to_markdown_with_exclude_selector (line 588) | fn html_to_markdown_with_exclude_selector() {
function html_to_markdown_aria_hidden_excluded (line 609) | fn html_to_markdown_aria_hidden_excluded() {
function html_to_markdown_output_without_input_fails (line 633) | fn html_to_markdown_output_without_input_fails() {
function html_to_markdown_with_move_before_h1 (line 649) | fn html_to_markdown_with_move_before_h1() {
Condensed preview — 132 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (2,243K chars).
[
{
"path": ".githooks/pre-commit",
"chars": 317,
"preview": "#!/bin/bash\n# Pre-commit hook: run cargo fmt, clippy, and tests before committing.\nset -e\n\necho \"=== Pre-commit: cargo f"
},
{
"path": ".github/workflows/ci.yml",
"chars": 860,
"preview": "name: CI\r\n\r\non:\r\n push:\r\n branches: [main]\r\n pull_request:\r\n branches: [main]\r\n\r\nenv:\r\n CARGO_TERM_COLOR: alway"
},
{
"path": ".github/workflows/publish.yml",
"chars": 17575,
"preview": "name: Publish to package managers\n\n# Triggers when a draft release is published (manually via GitHub UI)\non:\n release:\n"
},
{
"path": ".github/workflows/release.yml",
"chars": 18286,
"preview": "name: Release\n\n# Trigger: push a tag like v1.0.10\non:\n push:\n tags:\n - 'v*'\n # Manual trigger for building art"
},
{
"path": ".gitignore",
"chars": 69,
"preview": "/target\r\n/tmp/\r\n/dist/\r\n*.swp\r\n*.swo\r\n*~\r\n.idea/\r\n.vscode/\r\n*.cache\r\n"
},
{
"path": "CHANGELOG.md",
"chars": 93382,
"preview": "### Changelog\n\nAll notable changes to this project will be documented in this file. Dates are displayed in UTC.\n\n#### [v"
},
{
"path": "CLAUDE.md",
"chars": 11843,
"preview": "# CLAUDE.md\r\n\r\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\r\n\r"
},
{
"path": "Cargo.toml",
"chars": 5030,
"preview": "[package]\nname = \"siteone-crawler\"\nversion = \"2.3.0\"\nedition = \"2024\"\nrust-version = \"1.94\"\nauthors = [\"Ján Regeš <jan.r"
},
{
"path": "LICENSE",
"chars": 1070,
"preview": "MIT License\n\nCopyright (c) 2023-2026 Ján Regeš\n\nPermission is hereby granted, free of charge, to any person obtaining a "
},
{
"path": "README.md",
"chars": 54641,
"preview": "# SiteOne Crawler\r\n\r\nSiteOne Crawler is a powerful and easy-to-use **website analyzer, cloner, and converter** designed "
},
{
"path": "docs/JSON-OUTPUT.md",
"chars": 39397,
"preview": "# SiteOne Crawler: JSON Output Documentation\n\n## Table of Contents\n\n* [1. Introduction](#1-introduction)\n* [2. Poten"
},
{
"path": "docs/OUTPUT-crawler.siteone.io.json",
"chars": 290140,
"preview": "{\n \"crawler\": {\n \"command\": \"./siteone-crawler --url=https://crawler.siteone.io/ --output=json --http-cache-dir=\",\n "
},
{
"path": "docs/OUTPUT-crawler.siteone.io.txt",
"chars": 81033,
"preview": "\n #### #### ##### \n #### #### ####### \n #### ### "
},
{
"path": "docs/TEXT-OUTPUT.md",
"chars": 43336,
"preview": "# SiteOne Crawler: Text Output Documentation\n\n## Table of Contents\n\n* [1. Introduction](#1-introduction)\n* [2. Gener"
},
{
"path": "rustfmt.toml",
"chars": 50,
"preview": "max_width = 120\r\nuse_field_init_shorthand = true\r\n"
},
{
"path": "src/analysis/accessibility_analyzer.rs",
"chars": 22086,
"preview": "// SiteOne Crawler - AccessibilityAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse s"
},
{
"path": "src/analysis/analyzer.rs",
"chars": 1738,
"preview": "// SiteOne Crawler - Analyzer trait\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate::"
},
{
"path": "src/analysis/base_analyzer.rs",
"chars": 1253,
"preview": "// SiteOne Crawler - BaseAnalyzer\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse std::collections::HashMap;\r\nuse std::t"
},
{
"path": "src/analysis/best_practice_analyzer.rs",
"chars": 50062,
"preview": "// SiteOne Crawler - BestPracticeAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse st"
},
{
"path": "src/analysis/caching_analyzer.rs",
"chars": 13909,
"preview": "// SiteOne Crawler - CachingAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate:"
},
{
"path": "src/analysis/content_type_analyzer.rs",
"chars": 16951,
"preview": "// SiteOne Crawler - ContentTypeAnalyzer\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse std::collections::HashMap;\r\n\r\nu"
},
{
"path": "src/analysis/dns_analyzer.rs",
"chars": 10444,
"preview": "// SiteOne Crawler - DnsAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate::ana"
},
{
"path": "src/analysis/external_links_analyzer.rs",
"chars": 5346,
"preview": "// SiteOne Crawler - ExternalLinksAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Presents external URLs discover"
},
{
"path": "src/analysis/fastest_analyzer.rs",
"chars": 4959,
"preview": "// SiteOne Crawler - FastestAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate:"
},
{
"path": "src/analysis/headers_analyzer.rs",
"chars": 10557,
"preview": "// SiteOne Crawler - HeadersAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate:"
},
{
"path": "src/analysis/manager.rs",
"chars": 6253,
"preview": "// SiteOne Crawler - Analysis Manager\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate"
},
{
"path": "src/analysis/mod.rs",
"chars": 629,
"preview": "pub mod analyzer;\r\npub mod base_analyzer;\r\npub mod manager;\r\npub mod result;\r\n\r\n// Simple analyzers\r\npub mod caching_ana"
},
{
"path": "src/analysis/page404_analyzer.rs",
"chars": 4479,
"preview": "// SiteOne Crawler - Page404Analyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate:"
},
{
"path": "src/analysis/redirects_analyzer.rs",
"chars": 5215,
"preview": "// SiteOne Crawler - RedirectsAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crat"
},
{
"path": "src/analysis/result/analyzer_stats.rs",
"chars": 2954,
"preview": "// SiteOne Crawler - AnalyzerStats\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse std::collections::HashMap;\r\n\r\n#[deriv"
},
{
"path": "src/analysis/result/dns_analysis_result.rs",
"chars": 2391,
"preview": "// SiteOne Crawler - DnsAnalysisResult\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\n#[derive(Debug, Clone)]\r\npub struct D"
},
{
"path": "src/analysis/result/header_stats.rs",
"chars": 5480,
"preview": "// SiteOne Crawler - HeaderStats\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate::uti"
},
{
"path": "src/analysis/result/heading_tree_item.rs",
"chars": 5801,
"preview": "// SiteOne Crawler - HeadingTreeItem\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nfn html_escape(s: &str) -> String {\n s."
},
{
"path": "src/analysis/result/mod.rs",
"chars": 227,
"preview": "pub mod analyzer_stats;\r\npub mod dns_analysis_result;\r\npub mod header_stats;\r\npub mod heading_tree_item;\r\npub mod securi"
},
{
"path": "src/analysis/result/security_checked_header.rs",
"chars": 2312,
"preview": "// SiteOne Crawler - SecurityCheckedHeader\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\npub "
},
{
"path": "src/analysis/result/security_result.rs",
"chars": 946,
"preview": "// SiteOne Crawler - SecurityResult\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse indexmap::IndexMap;\n\nuse super::securit"
},
{
"path": "src/analysis/result/seo_opengraph_result.rs",
"chars": 3398,
"preview": "// SiteOne Crawler - SeoAndOpenGraphResult\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse super::heading_tree_item::Hea"
},
{
"path": "src/analysis/result/url_analysis_result.rs",
"chars": 8040,
"preview": "// SiteOne Crawler - UrlAnalysisResult\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crat"
},
{
"path": "src/analysis/security_analyzer.rs",
"chars": 35991,
"preview": "// SiteOne Crawler - SecurityAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::t"
},
{
"path": "src/analysis/seo_opengraph_analyzer.rs",
"chars": 27322,
"preview": "// SiteOne Crawler - SeoAndOpenGraphAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse"
},
{
"path": "src/analysis/skipped_urls_analyzer.rs",
"chars": 10352,
"preview": "// SiteOne Crawler - SkippedUrlsAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse cr"
},
{
"path": "src/analysis/slowest_analyzer.rs",
"chars": 6251,
"preview": "// SiteOne Crawler - SlowestAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate:"
},
{
"path": "src/analysis/source_domains_analyzer.rs",
"chars": 7249,
"preview": "// SiteOne Crawler - SourceDomainsAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse "
},
{
"path": "src/analysis/ssl_tls_analyzer.rs",
"chars": 17584,
"preview": "// SiteOne Crawler - SslTlsAnalyzer\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::net"
},
{
"path": "src/components/mod.rs",
"chars": 69,
"preview": "pub mod summary;\r\npub mod super_table;\r\npub mod super_table_column;\r\n"
},
{
"path": "src/components/summary/item.rs",
"chars": 1795,
"preview": "// SiteOne Crawler - Summary Item\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse serde::{Deserialize, Serialize};\n\nuse cra"
},
{
"path": "src/components/summary/item_status.rs",
"chars": 1664,
"preview": "// SiteOne Crawler - Summary ItemStatus\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::{Deserialize, Serialize};"
},
{
"path": "src/components/summary/mod.rs",
"chars": 91,
"preview": "pub mod item;\r\npub mod item_status;\r\n#[allow(clippy::module_inception)]\r\npub mod summary;\r\n"
},
{
"path": "src/components/summary/summary.rs",
"chars": 1552,
"preview": "// SiteOne Crawler - Summary\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse serde::{Deserialize, Serialize};\n\nuse crate::c"
},
{
"path": "src/components/super_table.rs",
"chars": 24123,
"preview": "// SiteOne Crawler - SuperTable\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::sync::R"
},
{
"path": "src/components/super_table_column.rs",
"chars": 3692,
"preview": "// SiteOne Crawler - SuperTableColumn\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse serde::Serialize;\nuse std::collection"
},
{
"path": "src/content_processor/astro_processor.rs",
"chars": 11545,
"preview": "// SiteOne Crawler - AstroProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Handles Astro specific patterns - extr"
},
{
"path": "src/content_processor/base_processor.rs",
"chars": 9436,
"preview": "// SiteOne Crawler - BaseProcessor shared utilities\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Provides shared utilit"
},
{
"path": "src/content_processor/content_processor.rs",
"chars": 2284,
"preview": "// SiteOne Crawler - ContentProcessor trait\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse crate::engine::found_urls::Foun"
},
{
"path": "src/content_processor/css_processor.rs",
"chars": 5954,
"preview": "// SiteOne Crawler - CssProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Extracts URLs from CSS url() and @import"
},
{
"path": "src/content_processor/html_processor.rs",
"chars": 47500,
"preview": "// SiteOne Crawler - HtmlProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Extracts URLs from HTML content and app"
},
{
"path": "src/content_processor/javascript_processor.rs",
"chars": 9156,
"preview": "// SiteOne Crawler - JavaScriptProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Extracts URLs from JS import/from"
},
{
"path": "src/content_processor/manager.rs",
"chars": 4865,
"preview": "// SiteOne Crawler - ContentProcessorManager\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Holds all registered processo"
},
{
"path": "src/content_processor/mod.rs",
"chars": 291,
"preview": "pub mod astro_processor;\r\npub mod base_processor;\r\n#[allow(clippy::module_inception)]\r\npub mod content_processor;\r\npub m"
},
{
"path": "src/content_processor/nextjs_processor.rs",
"chars": 9801,
"preview": "// SiteOne Crawler - NextJsProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Handles Next.js specific URL extracti"
},
{
"path": "src/content_processor/svelte_processor.rs",
"chars": 2979,
"preview": "// SiteOne Crawler - SvelteProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Handles SvelteKit specific patterns.\n"
},
{
"path": "src/content_processor/xml_processor.rs",
"chars": 11323,
"preview": "// SiteOne Crawler - XmlProcessor\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Extracts URLs from sitemap.xml and sitem"
},
{
"path": "src/debugger.rs",
"chars": 3948,
"preview": "// SiteOne Crawler - Debugger\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::fs::OpenOptions;\nuse std::io::Write;\nuse"
},
{
"path": "src/engine/crawler.rs",
"chars": 79596,
"preview": "// SiteOne Crawler - Core Crawler Engine\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Main crawling engine with concurr"
},
{
"path": "src/engine/found_url.rs",
"chars": 6283,
"preview": "// SiteOne Crawler - FoundUrl\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse once_cell::sync::Lazy;\nuse regex::Regex;\n\nuse"
},
{
"path": "src/engine/found_urls.rs",
"chars": 3403,
"preview": "// SiteOne Crawler - FoundUrls collection\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse m"
},
{
"path": "src/engine/http_client.rs",
"chars": 17057,
"preview": "// SiteOne Crawler - HttpClient\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::path::P"
},
{
"path": "src/engine/http_response.rs",
"chars": 4976,
"preview": "// SiteOne Crawler - HttpResponse\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse crate::ut"
},
{
"path": "src/engine/initiator.rs",
"chars": 6688,
"preview": "// SiteOne Crawler - Initiator\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Parses CLI arguments, validates options, cr"
},
{
"path": "src/engine/manager.rs",
"chars": 28861,
"preview": "// SiteOne Crawler - Manager\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Orchestrates the crawler: initializes all com"
},
{
"path": "src/engine/mod.rs",
"chars": 269,
"preview": "// Engine module - core crawling engine\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\npub mod crawler;\r\npub mod found_url;"
},
{
"path": "src/engine/parsed_url.rs",
"chars": 20275,
"preview": "// SiteOne Crawler - ParsedUrl\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::path::Pa"
},
{
"path": "src/engine/robots_txt.rs",
"chars": 10404,
"preview": "// SiteOne Crawler - robots.txt parser\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse once_cell::sync::Lazy;\nuse regex::Re"
},
{
"path": "src/error.rs",
"chars": 1101,
"preview": "// SiteOne Crawler - Error types\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse thiserror::Error;\r\n\r\n#[derive(Error, De"
},
{
"path": "src/export/base_exporter.rs",
"chars": 2136,
"preview": "// SiteOne Crawler - BaseExporter (shared helpers for all exporters)\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse std"
},
{
"path": "src/export/exporter.rs",
"chars": 820,
"preview": "// SiteOne Crawler - Exporter trait\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse crate::error::CrawlerResult;\nuse cra"
},
{
"path": "src/export/file_exporter.rs",
"chars": 7127,
"preview": "// SiteOne Crawler - FileExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Saves crawl results to HTML, JSON, and/o"
},
{
"path": "src/export/html_report/badge.rs",
"chars": 1179,
"preview": "// SiteOne Crawler - Badge for HTML Report\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\n/// Badge colors used in HTML rep"
},
{
"path": "src/export/html_report/mod.rs",
"chars": 131,
"preview": "// SiteOne Crawler - HTML Report module\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\npub mod badge;\r\npub mod report;\r\npub"
},
{
"path": "src/export/html_report/report.rs",
"chars": 89392,
"preview": "// SiteOne Crawler - HTML Report Generator\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse "
},
{
"path": "src/export/html_report/tab.rs",
"chars": 1801,
"preview": "// SiteOne Crawler - Tab for HTML Report\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse regex::Regex;\r\n\r\nuse super::bad"
},
{
"path": "src/export/html_report/template.html",
"chars": 28966,
"preview": "<!DOCTYPE html>\r\n<html lang=\"en\">\r\n<head>\r\n <meta charset=\"UTF-8\">\r\n <meta name=\"viewport\" content=\"width=device-w"
},
{
"path": "src/export/mailer_exporter.rs",
"chars": 10955,
"preview": "// SiteOne Crawler - MailerExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Sends crawl report via SMTP email usin"
},
{
"path": "src/export/markdown_exporter.rs",
"chars": 55306,
"preview": "// SiteOne Crawler - MarkdownExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Converts crawled HTML pages to Markd"
},
{
"path": "src/export/mod.rs",
"chars": 238,
"preview": "pub mod exporter;\npub mod html_report;\npub mod utils;\n\npub mod base_exporter;\npub mod file_exporter;\npub mod mailer_expo"
},
{
"path": "src/export/offline_website_exporter.rs",
"chars": 20762,
"preview": "// SiteOne Crawler - OfflineWebsiteExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Saves all crawled pages to loc"
},
{
"path": "src/export/sitemap_exporter.rs",
"chars": 7612,
"preview": "// SiteOne Crawler - SitemapExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Generates sitemap.xml and/or sitemap."
},
{
"path": "src/export/upload_exporter.rs",
"chars": 6627,
"preview": "// SiteOne Crawler - UploadExporter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Uploads HTML report to crawler.siteone"
},
{
"path": "src/export/utils/html_to_markdown.rs",
"chars": 56640,
"preview": "// SiteOne Crawler - HtmlToMarkdownConverter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Converts HTML to Markdown for"
},
{
"path": "src/export/utils/markdown_site_aggregator.rs",
"chars": 14773,
"preview": "// SiteOne Crawler - MarkdownSiteAggregator\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Combines multiple markdown fil"
},
{
"path": "src/export/utils/mod.rs",
"chars": 209,
"preview": "// SiteOne Crawler - Export utilities module\n// (c) Jan Reges <jan.reges@siteone.cz>\n\npub mod html_to_markdown;\npub mod "
},
{
"path": "src/export/utils/offline_url_converter.rs",
"chars": 49200,
"preview": "// SiteOne Crawler - OfflineUrlConverter\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Converts absolute URLs to relativ"
},
{
"path": "src/export/utils/target_domain_relation.rs",
"chars": 7879,
"preview": "// SiteOne Crawler - TargetDomainRelation\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse crate::engine::parsed_url::Parsed"
},
{
"path": "src/extra_column.rs",
"chars": 13456,
"preview": "// SiteOne Crawler - ExtraColumn\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse regex::Regex;\nuse scraper::{Html, Selector"
},
{
"path": "src/info.rs",
"chars": 1011,
"preview": "// SiteOne Crawler - Info\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::{Deserialize, Serialize};\r\n\r\n#[derive(D"
},
{
"path": "src/lib.rs",
"chars": 407,
"preview": "// SiteOne Crawler - Library root\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\npub mod analysis;\r\npub mod components;\r\npu"
},
{
"path": "src/main.rs",
"chars": 6613,
"preview": "// SiteOne Crawler - Main entry point\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse siteone_crawler::engine::initiator::I"
},
{
"path": "src/options/core_options.rs",
"chars": 120951,
"preview": "// SiteOne Crawler - Core options (all CLI options)\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse regex::Regex;\n\nuse c"
},
{
"path": "src/options/group.rs",
"chars": 889,
"preview": "// SiteOne Crawler - Option group for organizing options\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n//\r\n\r\nuse indexmap::I"
},
{
"path": "src/options/mod.rs",
"chars": 255,
"preview": "// SiteOne Crawler - Options module\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n//\r\n// CLI option definitions and parsing\r"
},
{
"path": "src/options/option.rs",
"chars": 21364,
"preview": "// SiteOne Crawler - Option definition and value parsing\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse std::sync::Mute"
},
{
"path": "src/options/option_type.rs",
"chars": 1229,
"preview": "// SiteOne Crawler - Option type definitions\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n//\r\n\r\nuse std::fmt;\r\n\r\n#[derive(D"
},
{
"path": "src/options/options.rs",
"chars": 1406,
"preview": "// SiteOne Crawler - Options registry\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n//\r\n\r\nuse indexmap::IndexMap;\r\n\r\nuse sup"
},
{
"path": "src/output/json_output.rs",
"chars": 16812,
"preview": "// SiteOne Crawler - JsonOutput (JSON output)\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse std::collections::HashMap;"
},
{
"path": "src/output/mod.rs",
"chars": 136,
"preview": "pub mod json_output;\npub mod multi_output;\n#[allow(clippy::module_inception)]\npub mod output;\npub mod output_type;\npub m"
},
{
"path": "src/output/multi_output.rs",
"chars": 4750,
"preview": "// SiteOne Crawler - MultiOutput (delegates to multiple outputs)\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse std::co"
},
{
"path": "src/output/output.rs",
"chars": 5009,
"preview": "// SiteOne Crawler - Output trait\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse std::collections::{BTreeMap, HashMap};"
},
{
"path": "src/output/output_type.rs",
"chars": 1275,
"preview": "// SiteOne Crawler - OutputType enum\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::{Deserialize, Serialize};\r\nu"
},
{
"path": "src/output/text_output.rs",
"chars": 30805,
"preview": "// SiteOne Crawler - TextOutput (console output)\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n\nuse std::collections::HashM"
},
{
"path": "src/result/basic_stats.rs",
"chars": 5335,
"preview": "// SiteOne Crawler - BasicStats\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::BTreeMap;\nuse std::time::"
},
{
"path": "src/result/manager_stats.rs",
"chars": 5955,
"preview": "// SiteOne Crawler - ManagerStats\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::time:"
},
{
"path": "src/result/mod.rs",
"chars": 103,
"preview": "pub mod basic_stats;\r\npub mod manager_stats;\r\npub mod status;\r\npub mod storage;\r\npub mod visited_url;\r\n"
},
{
"path": "src/result/status.rs",
"chars": 16357,
"preview": "// SiteOne Crawler - Status (central crawl state)\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap"
},
{
"path": "src/result/storage/file_storage.rs",
"chars": 4419,
"preview": "// SiteOne Crawler - FileStorage\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::fs;\nuse std::io::Write;\nuse std::path"
},
{
"path": "src/result/storage/memory_storage.rs",
"chars": 2089,
"preview": "// SiteOne Crawler - MemoryStorage\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\nuse std::io::"
},
{
"path": "src/result/storage/mod.rs",
"chars": 125,
"preview": "pub mod file_storage;\r\npub mod memory_storage;\r\n#[allow(clippy::module_inception)]\r\npub mod storage;\r\npub mod storage_ty"
},
{
"path": "src/result/storage/storage.rs",
"chars": 406,
"preview": "// SiteOne Crawler - Storage trait\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse crate::error::CrawlerResult;\r\n\r\npub t"
},
{
"path": "src/result/storage/storage_type.rs",
"chars": 1236,
"preview": "// SiteOne Crawler - StorageType\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::{Deserialize, Serialize};\r\n\r\nuse"
},
{
"path": "src/result/visited_url.rs",
"chars": 9493,
"preview": "// SiteOne Crawler - VisitedUrl\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::collections::HashMap;\n\nuse regex::Rege"
},
{
"path": "src/scoring/ci_gate.rs",
"chars": 20124,
"preview": "// SiteOne Crawler - CI/CD Quality Gate\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Evaluates crawler results against "
},
{
"path": "src/scoring/mod.rs",
"chars": 147,
"preview": "// SiteOne Crawler - Quality Scoring module\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\npub mod ci_gate;\r\npub mod qualit"
},
{
"path": "src/scoring/quality_score.rs",
"chars": 3364,
"preview": "// SiteOne Crawler - Quality Score data model\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::Serialize;\r\n\r\n#[der"
},
{
"path": "src/scoring/scorer.rs",
"chars": 18852,
"preview": "// SiteOne Crawler - Quality Scorer\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Computes quality scores (0.0-10.0) acr"
},
{
"path": "src/server.rs",
"chars": 63772,
"preview": "// SiteOne Crawler - Built-in HTTP server for serving exports\n// (c) Jan Reges <jan.reges@siteone.cz>\n//\n// Two modes:\n/"
},
{
"path": "src/types.rs",
"chars": 7521,
"preview": "// SiteOne Crawler - Type definitions\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\nuse serde::{Deserialize, Serialize};\r\n"
},
{
"path": "src/utils.rs",
"chars": 28816,
"preview": "// SiteOne Crawler - Utilities\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::sync::RwLock;\n\nuse regex::Regex;\n\nuse c"
},
{
"path": "src/version.rs",
"chars": 115,
"preview": "// SiteOne Crawler - Version\r\n// (c) Jan Reges <jan.reges@siteone.cz>\r\n\r\npub const CODE: &str = \"2.3.0.20260330\";\r\n"
},
{
"path": "src/wizard/form.rs",
"chars": 16299,
"preview": "// SiteOne Crawler - Interactive settings form with arrow-key cycling\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse cross"
},
{
"path": "src/wizard/mod.rs",
"chars": 14880,
"preview": "// SiteOne Crawler - Interactive wizard for no-args invocation\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nmod form;\nmod pr"
},
{
"path": "src/wizard/presets.rs",
"chars": 23412,
"preview": "// SiteOne Crawler - Wizard preset definitions and state\n// (c) Jan Reges <jan.reges@siteone.cz>\n\nuse std::fmt;\n\n/// A w"
},
{
"path": "tests/common/mod.rs",
"chars": 1898,
"preview": "// Shared helpers for integration tests\n\nuse std::path::PathBuf;\nuse std::process::{Command, Output};\n\n/// Get path to t"
},
{
"path": "tests/integration_crawl.rs",
"chars": 23191,
"preview": "// Integration tests: crawl crawler.siteone.io and verify output correctness.\n//\n// These tests require network access a"
}
]
About this extraction
This page contains the full source code of the janreges/siteone-crawler GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 132 files (2.0 MB), approximately 539.6k tokens, and a symbol index with 1962 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.