Showing preview only (363K chars total). Download the full file or copy to clipboard to get everything.
Repository: Y2Z/monolith
Branch: master
Commit: 8702e66fed5b
Files: 103
Total size: 338.1 KB
Directory structure:
gitextract_1w8c2ho6/
├── .actor/
│ ├── Dockerfile
│ ├── README.md
│ ├── actor.json
│ ├── bin/
│ │ └── actor.sh
│ ├── dataset_schema.json
│ └── input_schema.json
├── .dockerignore
├── .github/
│ ├── FUNDING.yml
│ └── workflows/
│ ├── build_gnu_linux.yml
│ ├── build_macos.yml
│ ├── build_windows.yml
│ ├── cd.yml
│ ├── ci-netbsd.yml
│ └── ci.yml
├── .gitignore
├── Cargo.toml
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── assets/
│ └── icon/
│ └── icon.blend
├── dist/
│ └── run-in-container.sh
├── monolith.nuspec
├── snap/
│ └── snapcraft.yaml
├── src/
│ ├── cache.rs
│ ├── cookies.rs
│ ├── core.rs
│ ├── css.rs
│ ├── gui.rs
│ ├── html.rs
│ ├── js.rs
│ ├── lib.rs
│ ├── main.rs
│ ├── session.rs
│ └── url.rs
└── tests/
├── _data_/
│ ├── basic/
│ │ ├── local-file.html
│ │ ├── local-script.js
│ │ └── local-style.css
│ ├── css/
│ │ ├── index.html
│ │ └── style.css
│ ├── import-css-via-data-url/
│ │ ├── index.html
│ │ └── style.css
│ ├── integrity/
│ │ ├── index.html
│ │ ├── script.js
│ │ └── style.css
│ ├── noscript/
│ │ ├── index.html
│ │ ├── nested.html
│ │ └── script.html
│ ├── svg/
│ │ ├── image.html
│ │ ├── index.html
│ │ └── svg.html
│ └── unusual_encodings/
│ ├── gb2312.html
│ └── iso-8859-1.html
├── cli/
│ ├── base_url.rs
│ ├── basic.rs
│ ├── data_url.rs
│ ├── local_files.rs
│ ├── mod.rs
│ ├── noscript.rs
│ └── unusual_encodings.rs
├── cookies/
│ ├── cookie/
│ │ ├── is_expired.rs
│ │ ├── matches_url.rs
│ │ └── mod.rs
│ ├── mod.rs
│ └── parse_cookie_file_contents.rs
├── core/
│ ├── detect_media_type.rs
│ ├── format_output_path.rs
│ ├── mod.rs
│ ├── options.rs
│ └── parse_content_type.rs
├── css/
│ ├── embed_css.rs
│ ├── is_image_url_prop.rs
│ └── mod.rs
├── html/
│ ├── add_favicon.rs
│ ├── check_integrity.rs
│ ├── compose_csp.rs
│ ├── create_metadata_tag.rs
│ ├── embed_srcset.rs
│ ├── get_base_url.rs
│ ├── get_charset.rs
│ ├── get_node_attr.rs
│ ├── get_node_name.rs
│ ├── has_favicon.rs
│ ├── is_favicon.rs
│ ├── mod.rs
│ ├── parse_link_type.rs
│ ├── parse_srcset.rs
│ ├── serialize_document.rs
│ ├── set_node_attr.rs
│ └── walk.rs
├── js/
│ ├── attr_is_event_handler.rs
│ └── mod.rs
├── mod.rs
├── session/
│ ├── mod.rs
│ └── retrieve_asset.rs
└── url/
├── clean_url.rs
├── create_data_url.rs
├── domain_is_within_domain.rs
├── get_referer_url.rs
├── is_url_and_has_protocol.rs
├── mod.rs
├── parse_data_url.rs
└── resolve_url.rs
================================================
FILE CONTENTS
================================================
================================================
FILE: .actor/Dockerfile
================================================
FROM node:alpine
RUN apk --no-cache add curl bash git monolith jq
RUN npm -g install apify-cli
COPY .actor .actor
CMD ./.actor/bin/actor.sh
================================================
FILE: .actor/README.md
================================================
# Monolith Actor on Apify
[](https://apify.com/snshn/monolith?fpr=snshn)
This Actor wraps [Monolith](https://crates.io/crates/monolith) to crawl a web page URL and bundle the entire content in a single HTML file, without installing and running the tool locally.
## What are Actors?
[Actors](https://docs.apify.com/platform/actors?fpr=snshn) are serverless microservices running on the [Apify Platform](https://apify.com/?fpr=snshn). They are based on the [Actor SDK](https://docs.apify.com/sdk/js?fpr=snshn) and can be found in the [Apify Store](https://apify.com/store?fpr=snshn). Learn more about Actors in the [Apify Whitepaper](https://whitepaper.actor?fpr=snshn).
## Usage
### Apify Console
1. Go to the Apify Actor page
2. Click "Run"
3. In the input form, fill in **URL(s)** to crawl and bundle
4. The Actor will run and :
- save the bundled HTML files in the run's default key-value store
- save the links to the KVS with original URL and monolith process exit status to the dataset
### Apify CLI
```bash
apify call snshn/monolith --input='{
"urls": ["https://news.ycombinator.com/"]
}'
```
### Using Apify API
```bash
curl --request POST \
--url "https://api.apify.com/v2/acts/snshn~monolith/run" \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_TOKEN' \
--data '{
"urls": ["https://news.ycombinator.com/"],
}
}'
```
## Input Parameters
The Actor accepts a JSON schema with the following structure:
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `urls` | array | Yes | - | List of URLs to monolith |
| `urls[]` | string | Yes | - | URL to monolith |
### Example Input
```json
{
"urls": ["https://news.ycombinator.com/"],
}
```
## Output
The Actor provides three types of outputs:
### Dataset Record
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `url` | string | Yes | A link to the Apify key-value store object where the monolithic html is available for download |
| `kvsUrl` | array | Yes | Exit status of the monolith process |
| `status`| number | No | The original start URL for the monolith process |
### Example Dataset Item (JSON)
```json
{
"url": "https://news.ycombinator.com/",
"kvsUrl": "https://api.apify.com/v2/key-value-stores/JRFLHRy9DOtdKGpdm/records/https___news.ycombinator.com_",
"status": "0"
}
```
## Performance & Resources
- **Memory Requirements**:
- Minimum: 4168 MB RAM
- **Processing Time**:
- 30s per complex page like [bbc.co.uk](https://bbc.co.uk)
For more help, check the [Monolith Project documentation](https://github.com/Y2Z/monolith) or raise an issue in the [Actor page detail](https://apify.com/snshn/monolith?fpr=snshn) on Apify.
================================================
FILE: .actor/actor.json
================================================
{
"actorSpecification": 1,
"name": "monolith",
"version": "0.0",
"buildTag": "latest",
"environmentVariables": {},
"dockerFile": "./Dockerfile",
"dockerContext": "../",
"input": "./input_schema.json",
"storages": {
"dataset": "./dataset_schema.json"
}
}
================================================
FILE: .actor/bin/actor.sh
================================================
#!/bin/bash
#pwd
#find ./storage
apify actor:get-input > /dev/null
INPUT=`apify actor:get-input | jq -r .urls[] | xargs echo`
echo "INPUT: $INPUT"
for url in $INPUT; do
# support for local usage
# sanitize url to a safe *nix filename - replace nonalfanumerical characters
# https://stackoverflow.com/questions/9847288/is-it-possible-to-use-in-a-filename
# https://serverfault.com/questions/348482/how-to-remove-invalid-characters-from-filenames
safe_filename=`echo $url | sed -e 's/[^A-Za-z0-9._-]/_/g'`
echo "Monolith-ing $url to key $safe_filename"
monolith $url | apify actor:set-value "$safe_filename" --contentType=text/html
kvs_url="https://api.apify.com/v2/key-value-stores/${APIFY_DEFAULT_KEY_VALUE_STORE_ID}/records/${safe_filename}"
result=$?
echo "Pushing result item to the datastore"
echo "{\"url\":\"${url}\",\"status\":\"${result}\", \"kvsUrl\":\"${kvs_url}\"}" | apify actor:push-data
done
exit 0
================================================
FILE: .actor/dataset_schema.json
================================================
{
"actorSpecification": 1,
"fields":{
"title": "Sherlock actor input",
"description": "This is actor input schema",
"type": "object",
"schemaVersion": 1,
"properties": {
"kvsUrl": {
"title": "Object URL",
"type": "string",
"description": "A link to the Apify key-value store object where the monolithic html is available"
},
"status": {
"title": "Exist status",
"type": "string",
"description": "Exit status of the monolith process"
},
"url": {
"title": "URL",
"type": "string",
"description": "The original start URL for the monolith process "
}
},
"required": [
"kvsUrl",
"status",
"url"
]
},
"views": {
"overview": {
"title": "Overview",
"transformation": {
"fields": [
"url",
"kvsUrl",
"status"
],
},
"display": {
"component": "table",
"url": {
"label": "Page URL"
},
"kvsUrl": {
"label": "KVS URL"
},
"status": {
"label": "Status"
}
}
}
}
}
================================================
FILE: .actor/input_schema.json
================================================
{
"title": "Sherlock actor input",
"description": "This is actor input schema",
"type": "object",
"schemaVersion": 1,
"properties": {
"urls": {
"title": "Urls",
"type": "array",
"description": "A list of urls of pages to bundle into single HTML document",
"editor": "stringList",
"prefill": ["http://www.google.com"]
}
},
"required": [
"urls"
]
}
================================================
FILE: .dockerignore
================================================
/target/
================================================
FILE: .github/FUNDING.yml
================================================
# These are supported funding model platforms
github: snshn
================================================
FILE: .github/workflows/build_gnu_linux.yml
================================================
name: GNU/Linux
on:
push:
branches: [ master ]
paths-ignore:
- 'assets/'
- 'dist/'
- 'snap/'
- 'Dockerfile'
- 'LICENSE'
- 'Makefile'
- 'monolith.nuspec'
- 'README.md'
jobs:
build:
strategy:
matrix:
os:
- ubuntu-latest
rust:
- stable
runs-on: ${{ matrix.os }}
steps:
- run: git config --global core.autocrlf false
- uses: actions/checkout@v2
- name: Build
run: cargo build --all --locked --verbose
================================================
FILE: .github/workflows/build_macos.yml
================================================
name: macOS
on:
push:
branches: [ master ]
paths-ignore:
- 'assets/'
- 'dist/'
- 'snap/'
- 'Dockerfile'
- 'LICENSE'
- 'Makefile'
- 'monolith.nuspec'
- 'README.md'
jobs:
build:
strategy:
matrix:
os:
- macos-latest
rust:
- stable
runs-on: ${{ matrix.os }}
steps:
- run: git config --global core.autocrlf false
- uses: actions/checkout@v2
- name: Build
run: cargo build --all --locked --verbose
================================================
FILE: .github/workflows/build_windows.yml
================================================
name: Windows
on:
push:
branches: [ master ]
paths-ignore:
- 'assets/'
- 'dist/'
- 'snap/'
- 'Dockerfile'
- 'LICENSE'
- 'Makefile'
- 'monolith.nuspec'
- 'README.md'
jobs:
build:
strategy:
matrix:
os:
- windows-latest
rust:
- stable
runs-on: ${{ matrix.os }}
steps:
- run: git config --global core.autocrlf false
- uses: actions/checkout@v2
- name: Build
run: cargo build --all --locked --verbose
================================================
FILE: .github/workflows/cd.yml
================================================
# CD GitHub Actions workflow for monolith
name: CD
on:
release:
types:
- created
jobs:
gnu_linux_aarch64:
runs-on: ubuntu-20.04
steps:
- name: Checkout the repository
uses: actions/checkout@v4
- name: Prepare cross-platform environment
run: |
sudo mkdir /cross-build
sudo touch /etc/apt/sources.list.d/arm64.list
echo "deb [arch=arm64] http://ports.ubuntu.com/ubuntu-ports/ focal main" | sudo tee -a /etc/apt/sources.list.d/arm64.list
sudo apt-get update
sudo apt-get install -y gcc-aarch64-linux-gnu libc6-arm64-cross libc6-dev-arm64-cross
sudo apt-get download libssl1.1:arm64 libssl-dev:arm64
sudo dpkg -x libssl1.1*.deb /cross-build
sudo dpkg -x libssl-dev*.deb /cross-build
rustup target add aarch64-unknown-linux-gnu
echo "C_INCLUDE_PATH=/cross-build/usr/include" >> $GITHUB_ENV
echo "OPENSSL_INCLUDE_DIR=/cross-build/usr/include/aarch64-linux-gnu" >> $GITHUB_ENV
echo "OPENSSL_LIB_DIR=/cross-build/usr/lib/aarch64-linux-gnu" >> $GITHUB_ENV
echo "PKG_CONFIG_ALLOW_CROSS=1" >> $GITHUB_ENV
echo "RUSTFLAGS=-C linker=aarch64-linux-gnu-gcc -L/usr/aarch64-linux-gnu/lib -L/cross-build/usr/lib/aarch64-linux-gnu" >> $GITHUB_ENV
- name: Build the executable
run: cargo build --release --target=aarch64-unknown-linux-gnu --no-default-features --features cli
- name: Attach artifact to the release
uses: Shopify/upload-to-release@v2.0.0
with:
name: monolith-gnu-linux-aarch64
path: target/aarch64-unknown-linux-gnu/release/monolith
repo-token: ${{ secrets.GITHUB_TOKEN }}
gnu_linux_armhf:
runs-on: ubuntu-20.04
steps:
- name: Checkout the repository
uses: actions/checkout@v4
- name: Prepare cross-platform environment
run: |
sudo mkdir /cross-build
sudo touch /etc/apt/sources.list.d/armhf.list
echo "deb [arch=armhf] http://ports.ubuntu.com/ubuntu-ports/ focal main" | sudo tee -a /etc/apt/sources.list.d/armhf.list
sudo apt-get update
sudo apt-get install -y gcc-arm-linux-gnueabihf libc6-armhf-cross libc6-dev-armhf-cross
sudo apt-get download libssl1.1:armhf libssl-dev:armhf
sudo dpkg -x libssl1.1*.deb /cross-build
sudo dpkg -x libssl-dev*.deb /cross-build
rustup target add arm-unknown-linux-gnueabihf
echo "C_INCLUDE_PATH=/cross-build/usr/include" >> $GITHUB_ENV
echo "OPENSSL_INCLUDE_DIR=/cross-build/usr/include/arm-linux-gnueabihf" >> $GITHUB_ENV
echo "OPENSSL_LIB_DIR=/cross-build/usr/lib/arm-linux-gnueabihf" >> $GITHUB_ENV
echo "PKG_CONFIG_ALLOW_CROSS=1" >> $GITHUB_ENV
echo "RUSTFLAGS=-C linker=arm-linux-gnueabihf-gcc -L/usr/arm-linux-gnueabihf/lib -L/cross-build/usr/lib/arm-linux-gnueabihf -L/cross-build/lib/arm-linux-gnueabihf" >> $GITHUB_ENV
- name: Build the executable
run: cargo build --release --target=arm-unknown-linux-gnueabihf --no-default-features --features cli
- name: Attach artifact to the release
uses: Shopify/upload-to-release@v2.0.0
with:
name: monolith-gnu-linux-armhf
path: target/arm-unknown-linux-gnueabihf/release/monolith
repo-token: ${{ secrets.GITHUB_TOKEN }}
gnu_linux_x86_64:
runs-on: ubuntu-20.04
steps:
- name: Checkout the repository
uses: actions/checkout@v4
- name: Build the executable
run: cargo build --release
- uses: Shopify/upload-to-release@v2.0.0
with:
name: monolith-gnu-linux-x86_64
path: target/release/monolith
repo-token: ${{ secrets.GITHUB_TOKEN }}
windows:
runs-on: windows-2019
steps:
- run: git config --global core.autocrlf false
- name: Checkout the repository
uses: actions/checkout@v4
- name: Build the executable
run: cargo build --release
- uses: Shopify/upload-to-release@v2.0.0
with:
name: monolith.exe
path: target\release\monolith.exe
repo-token: ${{ secrets.GITHUB_TOKEN }}
================================================
FILE: .github/workflows/ci-netbsd.yml
================================================
# CI NetBSD GitHub Actions workflow for monolith
name: CI (NetBSD)
on:
pull_request:
branches: [ master ]
paths-ignore:
- 'assets/'
- 'dist/'
- 'snap/'
- 'Dockerfile'
- 'LICENSE'
- 'Makefile'
- 'monolith.nuspec'
- 'README.md'
jobs:
build_and_test:
runs-on: ubuntu-latest
name: Build and test (netbsd)
steps:
- name: "Checkout repository"
uses: actions/checkout@v4
- name: Test in NetBSD
uses: vmactions/netbsd-vm@v1
with:
usesh: true
prepare: |
/usr/sbin/pkg_add cwrappers gmake mktools pkgconf rust
run: |
cargo build --all --locked --verbose --no-default-features --features cli
cargo test --all --locked --verbose --no-default-features --features cli
================================================
FILE: .github/workflows/ci.yml
================================================
# CI GitHub Actions workflow for monolith
name: CI
on:
pull_request:
branches: [ master ]
paths-ignore:
- 'assets/'
- 'dist/'
- 'snap/'
- 'Dockerfile'
- 'LICENSE'
- 'Makefile'
- 'monolith.nuspec'
- 'README.md'
jobs:
build_and_test:
name: Build and test
strategy:
matrix:
os:
- ubuntu-latest
- macos-latest
- windows-latest
runs-on: ${{ matrix.os }}
steps:
- run: git config --global core.autocrlf false
- name: "Checkout repository"
uses: actions/checkout@v4
- name: Build
run: cargo build --all --locked --verbose
- name: Run tests
run: cargo test --all --locked --verbose
- name: Check code formatting
run: |
rustup component add rustfmt
cargo fmt --all -- --check
================================================
FILE: .gitignore
================================================
# Generated by Cargo
# will have compiled files and executables
/target/
# These are backup files generated by rustfmt
**/*.rs.bk
# Added by Apify CLI
storage
node_modules
.venv
================================================
FILE: Cargo.toml
================================================
[package]
name = "monolith"
version = "2.11.0"
authors = [
"Sunshine <snshn@tutanota.com>",
"Mahdi Robatipoor <mahdi.robatipoor@gmail.com>",
"Emmanuel Delaborde <th3rac25@gmail.com>",
"Emi Simpson <emi@alchemi.dev>",
"rhysd <lin90162@yahoo.co.jp>",
"Andriy Rakhnin <a@rakhnin.com>",
]
edition = "2021"
description = "CLI tool and library for saving web pages as a single HTML file"
homepage = "https://github.com/Y2Z/monolith"
repository = "https://github.com/Y2Z/monolith"
readme = "README.md"
keywords = ["web", "http", "html", "download", "command-line"]
categories = ["command-line-utilities", "web-programming"]
include = ["src/*.rs", "Cargo.toml"]
license = "CC0-1.0"
[dependencies]
atty = "=0.2.14" # Used for highlighting network errors
base64 = "=0.22.1" # Used for integrity attributes
chrono = "=0.4.41" # Used for formatting timestamps
clap = { version = "=4.5.37", features = [
"derive",
], optional = true } # Used for processing CLI arguments
cssparser = "=0.35.0" # Used for dealing with CSS
directories = { version = "=6.0.0", optional = true } # Used for GUI
druid = { version = "=0.8.3", optional = true } # Used for GUI
encoding_rs = "=0.8.35" # Used for parsing and converting document charsets
html5ever = "=0.29.1" # Used for all things DOM
markup5ever_rcdom = "=0.5.0-unofficial" # Used for manipulating DOM
percent-encoding = "=2.3.1" # Used for encoding URLs
sha2 = "=0.10.9" # Used for calculating checksums during integrity checks
redb = "=2.4.0" # Used for on-disk caching of remote assets
tempfile = { version = "=3.19.1", optional = true } # Used for on-disk caching of remote assets
url = "=2.5.4" # Used for parsing URLs
openssl = "=0.10.72" # Used for static linking of the OpenSSL library
# Used for unwrapping NOSCRIPT
[dependencies.regex]
version = "=1.11.1"
default-features = false
features = ["std", "perf-dfa", "unicode-perl"]
# Used for making network requests
[dependencies.reqwest]
version = "=0.12.15"
default-features = false
features = ["default-tls", "blocking", "gzip", "brotli", "deflate"]
[dev-dependencies]
assert_cmd = "=2.0.17"
[features]
default = ["cli", "vendored-openssl"]
cli = ["clap", "tempfile"] # Build a CLI tool that includes main() function
gui = [
"directories",
"druid",
"tempfile",
] # Build a GUI executable that includes main() function
vendored-openssl = [
"openssl/vendored",
] # Compile and statically link a copy of OpenSSL
[lib]
name = "monolith"
path = "src/lib.rs"
[[bin]]
name = "monolith"
path = "src/main.rs"
required-features = ["cli"]
[[bin]]
name = "monolith-gui"
path = "src/gui.rs"
required-features = ["gui"]
================================================
FILE: Dockerfile
================================================
FROM clux/muslrust:stable as builder
RUN curl -L -o monolith.tar.gz $(curl -s https://api.github.com/repos/y2z/monolith/releases/latest \
| grep "tarball_url.*\"," \
| cut -d '"' -f 4)
RUN tar xfz monolith.tar.gz \
&& mv Y2Z-monolith-* monolith \
&& rm monolith.tar.gz
WORKDIR monolith/
RUN make install
FROM alpine
RUN apk update && \
apk add --no-cache openssl && \
rm -rf "/var/cache/apk/*"
COPY --from=builder /root/.cargo/bin/monolith /usr/bin/monolith
WORKDIR /tmp
ENTRYPOINT ["/usr/bin/monolith"]
================================================
FILE: LICENSE
================================================
Creative Commons Legal Code
CC0 1.0 Universal
CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.
Statement of Purpose
The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").
Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.
For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.
1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:
i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.
2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.
3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.
4. Limitations and Disclaimers.
a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.
================================================
FILE: Makefile
================================================
# Makefile for monolith
all: build build-gui
.PHONY: all
build:
@cargo build --locked
.PHONY: build
build-gui:
@cargo build --locked --bin monolith-gui --features="gui"
.PHONY: build_gui
clean:
@cargo clean
.PHONY: clean
format:
@cargo fmt --all --
.PHONY: format
format-check:
@cargo fmt --all -- --check
.PHONY: format
install:
@cargo install --force --locked --path .
.PHONY: install
lint:
@cargo clippy --fix --allow-dirty --allow-staged
# @cargo fix --allow-dirty --allow-staged
.PHONY: lint
lint-check:
@cargo clippy --
.PHONY: lint_check
test: build
@cargo test --locked
.PHONY: test
uninstall:
@cargo uninstall
.PHONY: uninstall
update-lock-file:
@cargo update
.PHONY: clean
================================================
FILE: README.md
================================================
[](https://github.com/Y2Z/monolith/actions?query=workflow%3AGNU%2FLinux)
[](https://github.com/Y2Z/monolith/actions?query=workflow%3AmacOS)
[](https://github.com/Y2Z/monolith/actions?query=workflow%3AWindows)
[](https://apify.com/snshn/monolith?fpr=snshn)
```
_____ _____________ __________ ___________________ ___
| \ / \ | | | | | |
| \/ __ \| __ | | ___ ___ |__| |
| | | | | | | | | | | |
| |\ /| |__| |__| |___| | | | | __ |
| | \__/ | |\ | | | | | | |
|___| |__________| \___________________| |___| |___| |___|
```
A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.
Unlike the conventional “Save page as”, `monolith` not only saves the target document, it embeds CSS, image, and JavaScript assets **all at once**, producing a single HTML5 document that is a joy to store and share.
If compared to saving websites with `wget -mpk`, this tool embeds all assets as data URLs and therefore lets browsers render the saved page exactly the way it was on the Internet, even when no network connection is available.
---------------------------------------------------
## Installation
#### Using [Cargo](https://crates.io/crates/monolith) (cross-platform)
```console
cargo install monolith
```
#### Via [Homebrew](https://formulae.brew.sh/formula/monolith) (macOS and GNU/Linux)
```console
brew install monolith
```
#### Via [Chocolatey](https://community.chocolatey.org/packages/monolith) (Windows)
```console
choco install monolith
```
#### Via [Scoop](https://scoop.sh/#/apps?q=monolith) (Windows)
```console
scoop install main/monolith
```
#### Via [Winget](https://winstall.app/apps/Y2Z.Monolith) (Windows)
```console
winget install --id=Y2Z.Monolith -e
```
#### Via [MacPorts](https://ports.macports.org/port/monolith/summary) (macOS)
```console
sudo port install monolith
```
#### Using [Snapcraft](https://snapcraft.io/monolith) (GNU/Linux)
```console
snap install monolith
```
#### Using [Guix](https://packages.guix.gnu.org/packages/monolith) (GNU/Linux)
```console
guix install monolith
```
#### Using [NixPkgs](https://search.nixos.org/packages?channel=unstable&show=monolith&query=monolith)
```console
nix-env -iA nixpkgs.monolith
```
#### Using [Flox](https://flox.dev)
```console
flox install monolith
```
#### Using [Pacman](https://archlinux.org/packages/extra/x86_64/monolith) (Arch Linux)
```console
pacman -S monolith
```
#### Using [aports](https://pkgs.alpinelinux.org/packages?name=monolith) (Alpine Linux)
```console
apk add monolith
```
#### Using [XBPS Package Manager](https://voidlinux.org/packages/?q=monolith) (Void Linux)
```console
xbps-install -S monolith
```
#### Using [FreeBSD packages](https://svnweb.freebsd.org/ports/head/www/monolith/) (FreeBSD)
```console
pkg install monolith
```
#### Using [FreeBSD ports](https://www.freshports.org/www/monolith/) (FreeBSD)
```console
cd /usr/ports/www/monolith/
make install clean
```
#### Using [pkgsrc](https://pkgsrc.se/www/monolith) (NetBSD, OpenBSD, Haiku, etc)
```console
cd /usr/pkgsrc/www/monolith
make install clean
```
#### Using [containers](https://www.docker.com/)
```console
docker build -t y2z/monolith .
sudo install -b dist/run-in-container.sh /usr/local/bin/monolith
```
#### From [source](https://github.com/Y2Z/monolith)
Dependencies: `libssl`, `cargo`
<details>
<summary>Install cargo (GNU/Linux)</summary>
Check if cargo is installed
```console
cargo -v
```
If cargo is not already installed, install and add it to your existing ```$PATH``` (paraphrasing the [official installation instructions](https://doc.rust-lang.org/cargo/getting-started/installation.html)):
```console
curl https://sh.rustup.rs -sSf | sh
. "$HOME/.cargo/env"
```
Proceed with installing from source:
</details>
```console
git clone https://github.com/Y2Z/monolith.git
cd monolith
make install
```
#### Using [pre-built binaries](https://github.com/Y2Z/monolith/releases) (Windows, ARM-based devices, etc)
Every release contains pre-built binaries for Windows, GNU/Linux, as well as platforms with non-standard CPU architecture.
---------------------------------------------------
## Usage
```console
monolith https://lyrics.github.io/db/P/Portishead/Dummy/Roads/ -o %title%.%timestamp%.html
```
```console
cat some-site-page.html | monolith -aIiFfcMv -b https://some.site/ - > some-site-page-with-assets.html
```
---------------------------------------------------
## Options
- `-a`: Exclude audio sources
- `-b`: Use `custom base URL`
- `-B`: Forbid retrieving assets from specified domain(s)
- `-c`: Exclude CSS
- `-C`: Read cookies from `file`
- `-d`: Allow retrieving assets only from specified `domain(s)`
- `-e`: Ignore network errors
- `-E`: Save document using `custom encoding`
- `-f`: Omit frames
- `-F`: Exclude web fonts
- `-h`: Print help information
- `-i`: Remove images
- `-I`: Isolate the document
- `-j`: Exclude JavaScript
- `-k`: Accept invalid X.509 (TLS) certificates
- `-m`: Output in MHTML format instead of HTML
- `-M`: Don't add timestamp and URL information
- `-n`: Extract contents of NOSCRIPT elements
- `-o`: Write output to `file` (use “-” for STDOUT)
- `-q`: Be quiet
- `-t`: Adjust `network request timeout`
- `-u`: Provide `custom User-Agent`
- `-v`: Exclude videos
- `-V`: Print version number
---------------------------------------------------
## Whitelisting and blacklisting domains
Options `-d` and `-B` provide control over what domains can be used to retrieve assets from, e.g.:
```console
monolith -I -d example.com -d www.example.com https://example.com -o example-only.html
```
```console
monolith -I -B -d .googleusercontent.com -d googleanalytics.com -d .google.com https://example.com -o example-no-ads.html
```
---------------------------------------------------
## Dynamic content
Monolith doesn't feature a JavaScript engine, hence websites that retrieve and display data after initial load may require usage of additional tools.
For example, Chromium (Chrome) can be used to act as a pre-processor for such pages:
```console
chromium --headless --window-size=1920,1080 --run-all-compositor-stages-before-draw --virtual-time-budget=9000 --incognito --dump-dom https://github.com | monolith - -I -b https://github.com -o github.html
```
---------------------------------------------------
## Authentication
```console
monolith https://username:password@example.com -o example-basic-auth.html
```
---------------------------------------------------
## Proxies
Please set `https_proxy`, `http_proxy`, and `no_proxy` environment variables.
---------------------------------------------------
### Apify Actor Usage
<a href="https://apify.com/snshn/monolith?fpr=snshn"><img src="https://apify.com/ext/run-on-apify.png" alt="Run Monolith Actor on Apify" width="176" height="39" /></a>
You can run Monolith in the cloud without installation using the [Monolith Actor](https://apify.com/snshn/monolith?fpr=snshn) on [Apify](https://apify.com?fpr=snshn) free of charge.
``` bash
echo '{"urls": ["https://news.ycombinator.com/"]}' | apify call -so snshn/monolith
[{
"url": "https://news.ycombinator.com/",
"status": "0",
"kvsUrl": "https://api.apify.com/v2/key-value-stores/of9xNgvpon4elPLbc/records/https___news.ycombinator.com_"
}]
```
Read more about the [Monolith Actor](.actor/README.md), including how to use it via the Apify UI, API and CLI without installation.
---------------------------------------------------
## Contributing
Please open an issue if something is wrong, that helps make this project better.
---------------------------------------------------
## License
To the extent possible under law, the author(s) have dedicated all copyright related and neighboring rights to this software to the public domain worldwide.
This software is distributed without any warranty.
================================================
FILE: dist/run-in-container.sh
================================================
#!/bin/sh
DOCKER=docker
if which podman 2>&1 > /dev/null; then
DOCKER=podman
fi
ORG_NAME=y2z
PROG_NAME=monolith
$DOCKER run --rm $ORG_NAME/$PROG_NAME "$@"
================================================
FILE: monolith.nuspec
================================================
<?xml version="1.0" encoding="utf-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2015/06/nuspec.xsd">
<metadata>
<id>monolith</id>
<version>2.8.1</version>
<title>Monolith</title>
<authors>Sunshine, Mahdi Robatipoor, Emmanuel Delaborde, Emi Simpson, rhysd</authors>
<projectUrl>https://github.com/Y2Z/monolith</projectUrl>
<iconUrl>https://raw.githubusercontent.com/Y2Z/monolith/master/assets/icon/icon.png</iconUrl>
<licenseUrl>https://raw.githubusercontent.com/Y2Z/monolith/master/LICENSE</licenseUrl>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<description>CLI tool for saving complete web pages as a single HTML file
A data hoarder’s dream come true: bundle any web page into a single HTML file. You can finally replace that gazillion of open tabs with a gazillion of .html files stored somewhere on your precious little drive.
Unlike the conventional “Save page as”, monolith not only saves the target document, it embeds CSS, image, and JavaScript assets all at once, producing a single HTML5 document that is a joy to store and share.
If compared to saving websites using wget, this tool embeds all assets as data URLs and therefore lets browsers render the saved page exactly the way it was on the Internet, even when no network connection is available.
</description>
<copyright>Public Domain</copyright>
<language>en-US</language>
<tags>scraping archiving</tags>
<docsUrl>https://github.com/Y2Z/monolith/blob/master/README.md</docsUrl>
</metadata>
</package>
================================================
FILE: snap/snapcraft.yaml
================================================
name: monolith
base: core18
# Version data defined inside the monolith part below
adopt-info: monolith
summary: Monolith - Save HTML pages with ease
description: |
A data hoarder's dream come true: bundle any web page into a single
HTML file. You can finally replace that gazillion of open tabs with
a gazillion of .html files stored somewhere on your precious little
drive.
Unlike conventional "Save page as…", monolith not only saves the
target document, it embeds CSS, image, and JavaScript assets all
at once, producing a single HTML5 document that is a joy to store
and share.
If compared to saving websites with wget -mpk, monolith embeds
all assets as data URLs and therefore displays the saved page
exactly the same, being completely separated from the Internet.
confinement: strict
architectures:
- build-on: amd64
- build-on: arm64
- build-on: armhf
- build-on: i386
- build-on: ppc64el
- build-on: s390x
parts:
monolith:
plugin: rust
source: .
build-packages:
- libssl-dev
- pkg-config
override-pull: |
snapcraftctl pull
# Determine the current tag
last_committed_tag="$(git describe --tags --abbrev=0)"
last_committed_tag_ver="$(echo ${last_committed_tag} | sed 's/v//')"
# Determine the most recent version in the beta channel in the Snap Store
last_released_tag="$(snap info $SNAPCRAFT_PROJECT_NAME | awk '$1 == "beta:" { print $2 }')"
# If the latest tag from the upstream project has not been released to
# beta, build that tag instead of master.
if [ "${last_committed_tag_ver}" != "${last_released_tag}" ]; then
git fetch
git checkout "${last_committed_tag}"
fi
# set version number of the snap based on what we did above
snapcraftctl set-version $(git describe --tags --abbrev=0)
apps:
monolith:
command: monolith
plugs:
- home
- network
- removable-media
================================================
FILE: src/cache.rs
================================================
use std::collections::HashMap;
use std::fs::File;
use std::io::{BufWriter, Write};
use std::path::Path;
use redb::{Database, Error, TableDefinition};
pub struct CacheMetadataItem {
data: Option<Vec<u8>>, // Asset's blob; used for caching small files or if on-disk database isn't utilized
media_type: Option<String>, // MIME-type, things like "text/plain", "image/png"...
charset: Option<String>, // "UTF-8", "UTF-16"...
}
// #[derive(Debug)]
pub struct Cache {
min_file_size: usize, // Only use database for assets larger than this size (in bytes), otherwise keep them in RAM
metadata: HashMap<String, CacheMetadataItem>, // Dictionary of metadata (and occasionally data [mostly for very small files])
db: Option<Database>, // Pointer to database instance; None if not yet initialized or if failed to initialize
db_ok: Option<bool>, // None by default, Some(true) if was able to initialize database, Some (false) if an error occurred
db_file_path: Option<String>, // Filesystem path to file used for storing database
}
const FILE_WRITE_BUF_LEN: usize = 1024 * 100; // On-disk cache file write buffer size (in bytes)
const TABLE: TableDefinition<&str, &[u8]> = TableDefinition::new("_");
impl Cache {
pub fn new(min_file_size: usize, db_file_path: Option<String>) -> Cache {
let mut cache = Cache {
min_file_size,
metadata: HashMap::new(),
db: None,
db_ok: None,
db_file_path: db_file_path.clone(),
};
if db_file_path.is_some() {
// Attempt to initialize on-disk database
match Database::create(Path::new(&db_file_path.unwrap())) {
Ok(db) => {
cache.db = Some(db);
cache.db_ok = Some(true);
cache
}
Err(..) => {
cache.db_ok = Some(false);
cache
}
}
} else {
cache.db_ok = Some(false);
cache
}
}
pub fn set(&mut self, key: &str, data: &Vec<u8>, media_type: String, charset: String) {
let mut cache_metadata_item: CacheMetadataItem = CacheMetadataItem {
data: if self.db_ok.is_some() && self.db_ok.unwrap() {
None
} else {
Some(data.to_owned().to_vec())
},
media_type: Some(media_type.to_owned()),
charset: Some(charset),
};
if (self.db_ok.is_none() || !self.db_ok.unwrap()) || data.len() <= self.min_file_size {
cache_metadata_item.data = Some(data.to_owned().to_vec());
} else {
match self.db.as_ref().unwrap().begin_write() {
Ok(write_txn) => {
{
let mut table = write_txn.open_table(TABLE).unwrap();
table.insert(key, &*data.to_owned()).unwrap();
}
write_txn.commit().unwrap();
}
Err(..) => {
// Fall back to caching everything in memory
cache_metadata_item.data = Some(data.to_owned().to_vec());
}
}
}
self.metadata
.insert((*key).to_string(), cache_metadata_item);
}
pub fn get(&self, key: &str) -> Result<(Vec<u8>, String, String), Error> {
if self.metadata.contains_key(key) {
let metadata_item = self.metadata.get(key).unwrap();
if metadata_item.data.is_some() {
return Ok((
metadata_item.data.as_ref().unwrap().to_vec(),
metadata_item.media_type.as_ref().expect("").to_string(),
metadata_item.charset.as_ref().expect("").to_string(),
));
} else if self.db_ok.is_some() && self.db_ok.unwrap() {
let read_txn = self.db.as_ref().unwrap().begin_read()?;
let table = read_txn.open_table(TABLE)?;
let data = table.get(key)?;
let bytes = data.unwrap();
return Ok((
bytes.value().to_vec(),
metadata_item.media_type.as_ref().expect("").to_string(),
metadata_item.charset.as_ref().expect("").to_string(),
));
}
}
Err(Error::TransactionInProgress) // XXX
}
pub fn contains_key(&self, key: &str) -> bool {
self.metadata.contains_key(key)
}
pub fn destroy_database_file(&mut self) {
if self.db_ok.is_none() || !self.db_ok.unwrap() {
return;
}
// Destroy database instance (prevents writes into file)
self.db = None;
self.db_ok = Some(false);
// Wipe database file
if let Some(db_file_path) = self.db_file_path.to_owned() {
// Overwrite file with zeroes
if let Ok(temp_file) = File::options()
.read(true)
.write(true)
.open(db_file_path.clone())
{
let mut buffer = [0; FILE_WRITE_BUF_LEN];
let mut remaining_size: usize = temp_file.metadata().unwrap().len() as usize;
let mut writer = BufWriter::new(temp_file);
while remaining_size > 0 {
let bytes_to_write: usize = if remaining_size < FILE_WRITE_BUF_LEN {
remaining_size
} else {
FILE_WRITE_BUF_LEN
};
let buffer = &mut buffer[..bytes_to_write];
writer.write(buffer).unwrap();
remaining_size -= bytes_to_write;
}
}
}
}
}
================================================
FILE: src/cookies.rs
================================================
use std::time::{SystemTime, UNIX_EPOCH};
use crate::url::Url;
pub struct Cookie {
pub domain: String,
pub include_subdomains: bool,
pub path: String,
pub https_only: bool,
pub expires: u64,
pub name: String,
pub value: String,
}
#[derive(Debug)]
pub enum CookieFileContentsParseError {
InvalidHeader,
}
impl Cookie {
pub fn is_expired(&self) -> bool {
if self.expires == 0 {
return false; // Session, never expires
}
let start = SystemTime::now();
let since_the_epoch = start
.duration_since(UNIX_EPOCH)
.expect("Time went backwards");
self.expires < since_the_epoch.as_secs()
}
pub fn matches_url(&self, url: &str) -> bool {
match Url::parse(url) {
Ok(url) => {
// Check protocol scheme
match url.scheme() {
"http" => {
if self.https_only {
return false;
}
}
"https" => {}
_ => {
// Should never match URLs of protocols other than HTTP(S)
return false;
}
}
// Check host
if let Some(url_host) = url.host_str() {
if self.domain.starts_with(".") && self.include_subdomains {
if !url_host.to_lowercase().ends_with(&self.domain)
&& !url_host
.eq_ignore_ascii_case(&self.domain[1..self.domain.len() - 1])
{
return false;
}
} else if !url_host.eq_ignore_ascii_case(&self.domain) {
return false;
}
} else {
return false;
}
// Check path
if !url.path().eq_ignore_ascii_case(&self.path)
&& !url.path().starts_with(&self.path)
{
return false;
}
}
Err(_) => {
return false;
}
}
true
}
}
pub fn parse_cookie_file_contents(
cookie_file_contents: &str,
) -> Result<Vec<Cookie>, CookieFileContentsParseError> {
let mut cookies: Vec<Cookie> = Vec::new();
for (i, line) in cookie_file_contents.lines().enumerate() {
if i == 0 {
// Parsing first line
if !line.eq("# HTTP Cookie File") && !line.eq("# Netscape HTTP Cookie File") {
return Err(CookieFileContentsParseError::InvalidHeader);
}
} else {
// Ignore comment lines
if line.starts_with("#") {
continue;
}
// Attempt to parse values
let mut fields = line.split("\t");
if fields.clone().count() != 7 {
continue;
}
cookies.push(Cookie {
domain: fields.next().unwrap().to_string().to_lowercase(),
include_subdomains: fields.next().unwrap() == "TRUE",
path: fields.next().unwrap().to_string(),
https_only: fields.next().unwrap() == "TRUE",
expires: fields.next().unwrap().parse::<u64>().unwrap(),
name: fields.next().unwrap().to_string(),
value: fields.next().unwrap().to_string(),
});
}
}
Ok(cookies)
}
================================================
FILE: src/core.rs
================================================
use std::env;
use std::error::Error;
use std::fmt;
use std::fs;
use std::io::{self, Write};
use std::path::Path;
use chrono::{SecondsFormat, Utc};
use encoding_rs::Encoding;
use markup5ever_rcdom::RcDom;
use url::Url;
use crate::html::{
add_favicon, create_metadata_tag, get_base_url, get_charset, get_robots, get_title,
has_favicon, html_to_dom, serialize_document, set_base_url, set_charset, set_robots, walk,
};
use crate::session::Session;
use crate::url::{create_data_url, resolve_url};
#[derive(Debug)]
pub struct MonolithError {
details: String,
}
impl MonolithError {
fn new(msg: &str) -> MonolithError {
MonolithError {
details: msg.to_string(),
}
}
}
impl fmt::Display for MonolithError {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.details)
}
}
impl Error for MonolithError {
fn description(&self) -> &str {
&self.details
}
}
#[derive(Clone, Debug, PartialEq, Eq, Default)]
pub enum MonolithOutputFormat {
#[default]
HTML,
MHTML,
// WARC,
// ZIM,
// HAR,
}
#[derive(Default)]
pub struct MonolithOptions {
pub base_url: Option<String>,
pub blacklist_domains: bool,
pub domains: Option<Vec<String>>,
pub encoding: Option<String>,
pub ignore_errors: bool,
pub insecure: bool,
pub isolate: bool,
pub no_audio: bool,
pub no_css: bool,
pub no_fonts: bool,
pub no_frames: bool,
pub no_images: bool,
pub no_js: bool,
pub no_metadata: bool,
pub no_video: bool,
pub output_format: MonolithOutputFormat,
pub silent: bool,
pub timeout: u64,
pub unwrap_noscript: bool,
pub user_agent: Option<String>,
}
const ANSI_COLOR_RED: &str = "\x1b[31m";
const ANSI_COLOR_RESET: &str = "\x1b[0m";
const FILE_SIGNATURES: [[&[u8]; 2]; 18] = [
// Image
[b"GIF87a", b"image/gif"],
[b"GIF89a", b"image/gif"],
[b"\xFF\xD8\xFF", b"image/jpeg"],
[b"\x89PNG\x0D\x0A\x1A\x0A", b"image/png"],
[b"<svg ", b"image/svg+xml"],
[b"RIFF....WEBPVP8 ", b"image/webp"],
[b"\x00\x00\x01\x00", b"image/x-icon"],
// Audio
[b"ID3", b"audio/mpeg"],
[b"\xFF\x0E", b"audio/mpeg"],
[b"\xFF\x0F", b"audio/mpeg"],
[b"OggS", b"audio/ogg"],
[b"RIFF....WAVEfmt ", b"audio/wav"],
[b"fLaC", b"audio/x-flac"],
// Video
[b"RIFF....AVI LIST", b"video/avi"],
[b"....ftyp", b"video/mp4"],
[b"\x00\x00\x01\x0B", b"video/mpeg"],
[b"....moov", b"video/quicktime"],
[b"\x1A\x45\xDF\xA3", b"video/webm"],
];
// All known non-"text/..." plaintext media types
const PLAINTEXT_MEDIA_TYPES: &[&str] = &[
"application/javascript", // .js
"application/json", // .json
"application/ld+json", // .jsonld
"application/x-sh", // .sh
"application/xhtml+xml", // .xhtml
"application/xml", // .xml
"application/vnd.mozilla.xul+xml", // .xul
"image/svg+xml", // .svg
];
pub fn create_monolithic_document_from_data(
mut session: Session,
input_data: Vec<u8>,
input_encoding: Option<String>,
input_target: Option<String>,
) -> Result<(Vec<u8>, Option<String>), MonolithError> {
// Validate options
{
// Check if custom encoding value is acceptable
if let Some(custom_output_encoding) = session.options.encoding.clone() {
if Encoding::for_label_no_replacement(custom_output_encoding.as_bytes()).is_none() {
return Err(MonolithError::new(&format!(
"unknown encoding \"{}\"",
&custom_output_encoding
)));
}
}
}
let mut base_url: Url = if input_target.is_some() {
Url::parse(&input_target.clone().unwrap()).unwrap()
} else {
Url::parse("data:text/html,").unwrap()
};
let mut document_encoding: String = input_encoding.clone().unwrap_or("utf-8".to_string());
let mut dom: RcDom;
// Initial parse
dom = html_to_dom(&input_data, document_encoding.clone());
// Attempt to determine document's encoding
if let Some(html_charset) = get_charset(&dom.document) {
if !html_charset.is_empty() {
// Check if the charset specified inside HTML is valid
if let Some(document_charset) =
Encoding::for_label_no_replacement(html_charset.as_bytes())
{
document_encoding = html_charset;
dom = html_to_dom(&input_data, document_charset.name().to_string());
}
}
}
// Use custom base URL if specified; read and use what's in the DOM otherwise
let custom_base_url: String = session.options.base_url.clone().unwrap_or_default();
if custom_base_url.is_empty() {
// No custom base URL is specified; try to see if document has BASE element
if let Some(existing_base_url) = get_base_url(&dom.document) {
base_url = resolve_url(&base_url, &existing_base_url);
}
} else {
// Custom base URL provided
match Url::parse(&custom_base_url) {
Ok(parsed_url) => {
if parsed_url.scheme() == "file" {
// File base URLs can only work with documents saved from filesystem
if base_url.scheme() == "file" {
base_url = parsed_url;
}
} else {
base_url = parsed_url;
}
}
Err(_) => {
// Failed to parse given base URL, perhaps it's a filesystem path?
if base_url.scheme() == "file" {
// Relative paths could work for documents saved from filesystem
let path: &Path = Path::new(&custom_base_url);
if path.exists() {
match Url::from_file_path(fs::canonicalize(path).unwrap()) {
Ok(file_url) => {
base_url = file_url;
}
Err(_) => {
return Err(MonolithError::new(&format!(
"could not map given path to base URL \"{}\"",
custom_base_url
)));
}
}
}
}
}
}
}
// Traverse through the document and embed remote assets
walk(&mut session, &base_url, &dom.document);
// Update or add new BASE element to reroute network requests and hash-links
if let Some(new_base_url) = session.options.base_url.clone() {
dom = set_base_url(&dom.document, new_base_url);
}
// Request and embed /favicon.ico (unless it's already linked in the document)
if !session.options.no_images
&& (base_url.scheme() == "http" || base_url.scheme() == "https")
&& (input_target.is_some()
&& (input_target.as_ref().unwrap().starts_with("http:")
|| input_target.as_ref().unwrap().starts_with("https:")))
&& !has_favicon(&dom.document)
{
let favicon_ico_url: Url = resolve_url(&base_url, "/favicon.ico");
match session.retrieve_asset(/*&target_url, */ &base_url, &favicon_ico_url) {
Ok((data, final_url, media_type, charset)) => {
let favicon_data_url: Url =
create_data_url(&media_type, &charset, &data, &final_url);
dom = add_favicon(&dom.document, favicon_data_url.to_string());
}
Err(_) => {
// Failed to retrieve /favicon.ico
}
}
}
// Append noindex META-tag
let meta_robots_content_value = get_robots(&dom.document).unwrap_or_default();
if meta_robots_content_value.trim().is_empty() || meta_robots_content_value != "none" {
dom = set_robots(dom, "none");
}
// Save using specified charset, if given
if let Some(custom_encoding) = session.options.encoding.clone() {
document_encoding = custom_encoding;
dom = set_charset(dom, document_encoding.clone());
}
let document_title: Option<String> = get_title(&dom.document);
if session.options.output_format == MonolithOutputFormat::HTML {
// Serialize DOM tree
let mut result: Vec<u8> = serialize_document(dom, document_encoding, &session.options);
// Prepend metadata comment tag
if !session.options.no_metadata && !input_target.clone().unwrap_or_default().is_empty() {
let mut metadata_comment: String =
create_metadata_tag(&Url::parse(&input_target.unwrap_or_default()).unwrap());
// let mut metadata_comment: String = create_metadata_tag(target);
metadata_comment += "\n";
result.splice(0..0, metadata_comment.as_bytes().to_vec());
}
// Ensure newline at end of result
if result.last() != Some(&b"\n"[0]) {
result.extend_from_slice(b"\n");
}
Ok((result, document_title))
} else if session.options.output_format == MonolithOutputFormat::MHTML {
// Serialize DOM tree
let mut result: Vec<u8> = serialize_document(dom, document_encoding, &session.options);
// Prepend metadata comment tag
if !session.options.no_metadata && !input_target.clone().unwrap_or_default().is_empty() {
let mut metadata_comment: String =
create_metadata_tag(&Url::parse(&input_target.unwrap_or_default()).unwrap());
// let mut metadata_comment: String = create_metadata_tag(target);
metadata_comment += "\n";
result.splice(0..0, metadata_comment.as_bytes().to_vec());
}
// Extremely hacky way to convert output to MIME
let mime = "MIME-Version: 1.0\r\n\
Content-Type: multipart/related; boundary=\"----=_NextPart_000_0000\"\r\n\
\r\n\
------=_NextPart_000_0000\r\n\
Content-Type: text/html; charset=\"utf-8\"\r\n\
Content-Location: http://example.com/\r\n\
\r\n";
result.splice(0..0, mime.as_bytes().to_vec());
let mime = "\r\n------=_NextPart_000_0000--\r\n";
result.extend_from_slice(mime.as_bytes());
Ok((result, document_title))
} else {
Ok((vec![], document_title))
}
}
pub fn create_monolithic_document(
mut session: Session,
target: String,
) -> Result<(Vec<u8>, Option<String>), MonolithError> {
// Check if target was provided
if target.is_empty() {
return Err(MonolithError::new("no target specified"));
}
// Validate options
{
// Check if custom encoding value is acceptable
if let Some(custom_encoding) = session.options.encoding.clone() {
if Encoding::for_label_no_replacement(custom_encoding.as_bytes()).is_none() {
return Err(MonolithError::new(&format!(
"unknown encoding \"{}\"",
&custom_encoding
)));
}
}
}
let mut target_url = match target.as_str() {
target_str => match Url::parse(target_str) {
Ok(target_url) => match target_url.scheme() {
"data" | "file" | "http" | "https" => target_url,
unsupported_scheme => {
return Err(MonolithError::new(&format!(
"unsupported target URL scheme \"{}\"",
unsupported_scheme
)));
}
},
Err(_) => {
// Failed to parse given base URL (perhaps it's a filesystem path?)
let path: &Path = Path::new(&target_str);
match path.exists() {
true => match path.is_file() {
true => {
let canonical_path = fs::canonicalize(path).unwrap();
match Url::from_file_path(canonical_path) {
Ok(url) => url,
Err(_) => {
return Err(MonolithError::new(&format!(
"could not generate file URL out of given path \"{}\"",
&target_str
)));
}
}
}
false => {
return Err(MonolithError::new(&format!(
"local target \"{}\" is not a file",
&target_str
)));
}
},
false => {
// It is not a FS path, now we do what browsers do:
// prepend "http://" and hope it points to a website
Url::parse(&format!("http://{}", &target_str)).unwrap()
}
}
}
},
};
let data: Vec<u8>;
let document_encoding: Option<String>;
// Retrieve target document
if target_url.scheme() == "file"
|| target_url.scheme() == "http"
|| target_url.scheme() == "https"
|| target_url.scheme() == "data"
{
match session.retrieve_asset(&target_url, &target_url) {
Ok((retrieved_data, final_url, media_type, charset)) => {
if !media_type.eq_ignore_ascii_case("text/html")
&& !media_type.eq_ignore_ascii_case("application/xhtml+xml")
{
// Provide output as text (without processing it, the way browsers do)
return Ok((retrieved_data, None));
}
// If got redirected, set target_url to that
if final_url != target_url {
target_url = final_url.clone();
}
data = retrieved_data;
document_encoding = Some(charset);
}
Err(_) => {
return Err(MonolithError::new("could not retrieve target document"));
}
}
} else {
return Err(MonolithError::new("unsupported target"));
}
create_monolithic_document_from_data(
session,
data,
document_encoding,
Some(target_url.to_string()),
)
}
pub fn detect_media_type(data: &[u8], url: &Url) -> String {
// At first attempt to read file's header
for file_signature in FILE_SIGNATURES.iter() {
if data.starts_with(file_signature[0]) {
return String::from_utf8(file_signature[1].to_vec()).unwrap();
}
}
// If header didn't match any known magic signatures,
// try to guess media type from file name
let parts: Vec<&str> = url.path().split('/').collect();
detect_media_type_by_file_name(parts.last().unwrap())
}
pub fn detect_media_type_by_file_name(filename: &str) -> String {
let filename_lowercased: &str = &filename.to_lowercase();
let parts: Vec<&str> = filename_lowercased.split('.').collect();
let mime: &str = match parts.last() {
Some(v) => match *v {
"avi" => "video/avi",
"bmp" => "image/bmp",
"css" => "text/css",
"flac" => "audio/flac",
"gif" => "image/gif",
"htm" | "html" => "text/html",
"ico" => "image/x-icon",
"jpeg" | "jpg" => "image/jpeg",
"js" => "text/javascript",
"json" => "application/json",
"jsonld" => "application/ld+json",
"mp3" => "audio/mpeg",
"mp4" | "m4v" => "video/mp4",
"ogg" => "audio/ogg",
"ogv" => "video/ogg",
"pdf" => "application/pdf",
"png" => "image/png",
"svg" => "image/svg+xml",
"swf" => "application/x-shockwave-flash",
"tif" | "tiff" => "image/tiff",
"txt" => "text/plain",
"wav" => "audio/wav",
"webp" => "image/webp",
"woff" => "font/woff",
"woff2" => "font/woff2",
"xhtml" => "application/xhtml+xml",
"xml" => "text/xml",
&_ => "",
},
None => "",
};
mime.to_string()
}
pub fn format_output_path(
path: &str,
document_title: &str,
output_format: MonolithOutputFormat,
) -> String {
let datetime: &str = &Utc::now().to_rfc3339_opts(SecondsFormat::Secs, true);
path.replace("%timestamp%", &datetime.replace(':', "_"))
.replace(
"%title%",
document_title
.to_string()
.replace(['/', '\\'], "_")
.replace('<', "[")
.replace('>', "]")
.replace(':', " - ")
.replace('\"', "")
.replace('|', "-")
.replace('?', "")
.trim_start_matches('.'),
)
.replace(
"%ext%",
if output_format == MonolithOutputFormat::HTML {
"htm"
} else if output_format == MonolithOutputFormat::MHTML {
"mht"
} else {
""
},
)
.replace(
"%extension%",
if output_format == MonolithOutputFormat::HTML {
"html"
} else if output_format == MonolithOutputFormat::MHTML {
"mhtml"
} else {
""
},
)
.to_string()
}
pub fn is_plaintext_media_type(media_type: &str) -> bool {
media_type.to_lowercase().as_str().starts_with("text/")
|| PLAINTEXT_MEDIA_TYPES.contains(&media_type.to_lowercase().as_str())
}
pub fn parse_content_type(content_type: &str) -> (String, String, bool) {
let mut media_type: String = "text/plain".to_string();
let mut charset: String = "US-ASCII".to_string();
let mut is_base64: bool = false;
// Parse meta data
let content_type_items: Vec<&str> = content_type.split(';').collect();
let mut i: i8 = 0;
for item in &content_type_items {
if i == 0 {
if !item.trim().is_empty() {
media_type = item.trim().to_string();
}
} else if item.trim().eq_ignore_ascii_case("base64") {
is_base64 = true;
} else if item.trim().starts_with("charset=") {
charset = item.trim().chars().skip(8).collect();
}
i += 1;
}
(media_type, charset, is_base64)
}
pub fn print_error_message(text: &str) {
let stderr = io::stderr();
let mut handle = stderr.lock();
const ENV_VAR_NO_COLOR: &str = "NO_COLOR";
const ENV_VAR_TERM: &str = "TERM";
let mut no_color = env::var_os(ENV_VAR_NO_COLOR).is_some() || atty::isnt(atty::Stream::Stderr);
if let Some(term) = env::var_os(ENV_VAR_TERM) {
if term == "dumb" {
no_color = true;
}
}
if handle
.write_all(
format!(
"{}{}{}\n",
if no_color { "" } else { ANSI_COLOR_RED },
&text,
if no_color { "" } else { ANSI_COLOR_RESET },
)
.as_bytes(),
)
.is_ok()
{}
}
pub fn print_info_message(text: &str) {
let stderr = io::stderr();
let mut handle = stderr.lock();
if handle.write_all(format!("{}\n", &text).as_bytes()).is_ok() {}
}
================================================
FILE: src/css.rs
================================================
use cssparser::{
serialize_identifier, serialize_string, ParseError, Parser, ParserInput, SourcePosition, Token,
};
use crate::session::Session;
use crate::url::{create_data_url, resolve_url, Url, EMPTY_IMAGE_DATA_URL};
const CSS_PROPS_WITH_IMAGE_URLS: &[&str] = &[
// Universal
"background",
"background-image",
"border-image",
"border-image-source",
"content",
"cursor",
"list-style",
"list-style-image",
"mask",
"mask-image",
// Specific to @counter-style
"additive-symbols",
"negative",
"pad",
"prefix",
"suffix",
"symbols",
];
pub fn embed_css(session: &mut Session, document_url: &Url, css: &str) -> String {
let mut input = ParserInput::new(css);
let mut parser = Parser::new(&mut input);
process_css(session, document_url, &mut parser, "", "", "").unwrap()
}
pub fn format_ident(ident: &str) -> String {
let mut res: String = "".to_string();
let _ = serialize_identifier(ident, &mut res);
res = res.trim_end().to_string();
res
}
pub fn format_quoted_string(string: &str) -> String {
let mut res: String = "".to_string();
let _ = serialize_string(string, &mut res);
res
}
pub fn is_image_url_prop(prop_name: &str) -> bool {
CSS_PROPS_WITH_IMAGE_URLS
.iter()
.any(|p| prop_name.eq_ignore_ascii_case(p))
}
pub fn process_css<'a>(
session: &mut Session,
document_url: &Url,
parser: &mut Parser,
rule_name: &str,
prop_name: &str,
func_name: &str,
) -> Result<String, ParseError<'a, String>> {
let mut result: String = "".to_string();
let mut curr_rule: String = rule_name.to_string();
let mut curr_prop: String = prop_name.to_string();
let mut token: &Token;
let mut token_offset: SourcePosition;
loop {
token_offset = parser.position();
token = match parser.next_including_whitespace_and_comments() {
Ok(token) => token,
Err(_) => {
break;
}
};
match *token {
Token::Comment(_) => {
let token_slice = parser.slice_from(token_offset);
result.push_str(token_slice);
}
Token::Semicolon => result.push(';'),
Token::Colon => result.push(':'),
Token::Comma => result.push(','),
Token::ParenthesisBlock | Token::SquareBracketBlock | Token::CurlyBracketBlock => {
if session.options.no_fonts && curr_rule == "font-face" {
continue;
}
let closure: &str;
if token == &Token::ParenthesisBlock {
result.push('(');
closure = ")";
} else if token == &Token::SquareBracketBlock {
result.push('[');
closure = "]";
} else {
result.push('{');
closure = "}";
}
let block_css: String = parser
.parse_nested_block(|parser| {
process_css(
session,
document_url,
parser,
rule_name,
curr_prop.as_str(),
func_name,
)
})
.unwrap();
result.push_str(block_css.as_str());
result.push_str(closure);
}
Token::CloseParenthesis => result.push(')'),
Token::CloseSquareBracket => result.push(']'),
Token::CloseCurlyBracket => result.push('}'),
Token::IncludeMatch => result.push_str("~="),
Token::DashMatch => result.push_str("|="),
Token::PrefixMatch => result.push_str("^="),
Token::SuffixMatch => result.push_str("$="),
Token::SubstringMatch => result.push_str("*="),
Token::CDO => result.push_str("<!--"),
Token::CDC => result.push_str("-->"),
Token::WhiteSpace(value) => {
result.push_str(value);
}
// div...
Token::Ident(ref value) => {
curr_rule = "".to_string();
curr_prop = value.to_string();
result.push_str(&format_ident(value));
}
// @import, @font-face, @charset, @media...
Token::AtKeyword(ref value) => {
curr_rule = value.to_string();
if session.options.no_fonts && curr_rule == "font-face" {
continue;
}
result.push('@');
result.push_str(value);
}
Token::Hash(ref value) => {
result.push('#');
result.push_str(value);
}
Token::QuotedString(ref value) => {
if curr_rule == "import" {
// Reset current at-rule value
curr_rule = "".to_string();
// Skip empty import values
if value.len() == 0 {
result.push_str("''");
continue;
}
let import_full_url: Url = resolve_url(document_url, value);
match session.retrieve_asset(document_url, &import_full_url) {
Ok((
import_contents,
import_final_url,
import_media_type,
import_charset,
)) => {
let mut import_data_url = create_data_url(
&import_media_type,
&import_charset,
embed_css(
session,
&import_final_url,
&String::from_utf8_lossy(&import_contents),
)
.as_bytes(),
&import_final_url,
);
import_data_url.set_fragment(import_full_url.fragment());
result
.push_str(format_quoted_string(import_data_url.as_ref()).as_str());
}
Err(_) => {
// Keep remote reference if unable to retrieve the asset
if import_full_url.scheme() == "http"
|| import_full_url.scheme() == "https"
{
result.push_str(
format_quoted_string(import_full_url.as_ref()).as_str(),
);
}
}
}
} else if func_name == "url" {
// Skip empty url()'s
if value.len() == 0 {
continue;
}
if session.options.no_images && is_image_url_prop(curr_prop.as_str()) {
result.push_str(format_quoted_string(EMPTY_IMAGE_DATA_URL).as_str());
} else {
let resolved_url: Url = resolve_url(document_url, value);
match session.retrieve_asset(document_url, &resolved_url) {
Ok((data, final_url, media_type, charset)) => {
// TODO: if it's @font-face, exclude definitions of non-woff/woff-2 fonts (if woff/woff-2 are present)
let mut data_url =
create_data_url(&media_type, &charset, &data, &final_url);
data_url.set_fragment(resolved_url.fragment());
result.push_str(format_quoted_string(data_url.as_ref()).as_str());
}
Err(_) => {
// Keep remote reference if unable to retrieve the asset
if resolved_url.scheme() == "http"
|| resolved_url.scheme() == "https"
{
result.push_str(
format_quoted_string(resolved_url.as_ref()).as_str(),
);
}
}
}
}
} else {
result.push_str(format_quoted_string(value).as_str());
}
}
Token::Number {
ref has_sign,
ref value,
..
} => {
if *has_sign && *value >= 0. {
result.push('+');
}
result.push_str(&value.to_string())
}
Token::Percentage {
ref has_sign,
ref unit_value,
..
} => {
if *has_sign && *unit_value >= 0. {
result.push('+');
}
result.push_str(&(unit_value * 100.0).to_string());
result.push('%');
}
Token::Dimension {
ref has_sign,
ref value,
ref unit,
..
} => {
if *has_sign && *value >= 0. {
result.push('+');
}
result.push_str(&value.to_string());
result.push_str(unit.as_ref());
}
// #selector, #id...
Token::IDHash(ref value) => {
curr_rule = "".to_string();
result.push('#');
result.push_str(&format_ident(value));
}
// url()
Token::UnquotedUrl(ref value) => {
let is_import: bool = curr_rule == "import";
if is_import {
// Reset current at-rule value
curr_rule = "".to_string();
}
// Skip empty url()'s
if value.len() < 1 {
result.push_str("url()");
continue;
} else if value.starts_with("#") {
result.push_str("url(");
result.push_str(value);
result.push(')');
continue;
}
result.push_str("url(");
if is_import {
let full_url: Url = resolve_url(document_url, value);
match session.retrieve_asset(document_url, &full_url) {
Ok((css, final_url, media_type, charset)) => {
let mut data_url = create_data_url(
&media_type,
&charset,
embed_css(session, &final_url, &String::from_utf8_lossy(&css))
.as_bytes(),
&final_url,
);
data_url.set_fragment(full_url.fragment());
result.push_str(format_quoted_string(data_url.as_ref()).as_str());
}
Err(_) => {
// Keep remote reference if unable to retrieve the asset
if full_url.scheme() == "http" || full_url.scheme() == "https" {
result.push_str(format_quoted_string(full_url.as_ref()).as_str());
}
}
}
} else if is_image_url_prop(curr_prop.as_str()) && session.options.no_images {
result.push_str(format_quoted_string(EMPTY_IMAGE_DATA_URL).as_str());
} else {
let full_url: Url = resolve_url(document_url, value);
match session.retrieve_asset(document_url, &full_url) {
Ok((data, final_url, media_type, charset)) => {
let mut data_url =
create_data_url(&media_type, &charset, &data, &final_url);
data_url.set_fragment(full_url.fragment());
result.push_str(format_quoted_string(data_url.as_ref()).as_str());
}
Err(_) => {
// Keep remote reference if unable to retrieve the asset
if full_url.scheme() == "http" || full_url.scheme() == "https" {
result.push_str(format_quoted_string(full_url.as_ref()).as_str());
}
}
}
}
result.push(')');
}
// =
Token::Delim(ref value) => result.push(*value),
Token::Function(ref name) => {
let function_name: &str = &name.clone();
result.push_str(function_name);
result.push('(');
let block_css: String = parser
.parse_nested_block(|parser| {
process_css(
session,
document_url,
parser,
curr_rule.as_str(),
curr_prop.as_str(),
function_name,
)
})
.unwrap();
result.push_str(block_css.as_str());
result.push(')');
}
Token::BadUrl(_) | Token::BadString(_) => {}
}
}
// Ensure empty CSS is really empty
if !result.is_empty() && result.trim().is_empty() {
result = result.trim().to_string()
}
Ok(result)
}
================================================
FILE: src/gui.rs
================================================
use std::fs;
use std::io::Write;
use std::path;
use std::thread;
use directories::UserDirs;
use druid::widget::{Button, Checkbox, Either, Flex, Label, Spinner, TextBox};
use druid::{
commands, AppDelegate, AppLauncher, Command, Data, DelegateCtx, Env, FileDialogOptions,
FileSpec, Handled, Lens, LocalizedString, PlatformError, Target, Widget, WidgetExt, WindowDesc,
};
use tempfile::{Builder, NamedTempFile};
use monolith::cache::Cache;
use monolith::core::{
create_monolithic_document, format_output_path, MonolithError, MonolithOptions,
MonolithOutputFormat,
};
use monolith::session::Session;
const CACHE_ASSET_FILE_SIZE_THRESHOLD: usize = 1024 * 20; // Minimum file size for on-disk caching (in bytes)
const FILESPEC_HTML: FileSpec = FileSpec::new("HTML files", &["html"]);
const MONOLITH_GUI_WRITE_OUTPUT: druid::Selector<(Vec<u8>, Option<String>)> =
druid::Selector::new("monolith-gui.write-output");
const MONOLITH_GUI_ERROR: druid::Selector<MonolithError> =
druid::Selector::new("monolith-gui.error");
const TEXT_BOX_WIDTH: f64 = 512_f64;
struct Delegate;
#[derive(Clone, Data, Lens)]
struct AppState {
target: String,
keep_fonts: bool,
keep_frames: bool,
keep_images: bool,
keep_scripts: bool,
keep_styles: bool,
output_path: String,
isolate: bool,
unwrap_noscript: bool,
busy: bool,
}
fn main() -> Result<(), PlatformError> {
let mut program_name: String = env!("CARGO_PKG_NAME").to_string();
if let Some(l) = program_name.get_mut(0..1) {
l.make_ascii_uppercase();
}
let main_window = WindowDesc::new(ui_builder())
.title(program_name)
.with_min_size((720_f64, 360_f64));
let state = AppState {
target: "".to_string(),
keep_fonts: false,
keep_frames: true,
keep_images: true,
keep_scripts: true,
keep_styles: true,
output_path: if let Some(base_dirs) = UserDirs::new() {
base_dirs.download_dir().unwrap().display().to_string()
+ &path::MAIN_SEPARATOR.to_string()
+ "%title%.%ext%"
} else {
"%title%.%ext%".to_string()
},
isolate: true,
unwrap_noscript: false,
busy: false,
};
AppLauncher::with_window(main_window)
.delegate(Delegate)
.launch(state)
}
fn ui_builder() -> impl Widget<AppState> {
let target_label: Label<AppState> = Label::new("Target:");
let target_input = TextBox::new()
.with_placeholder("URL or filesystem path")
.fix_width(TEXT_BOX_WIDTH)
.lens(AppState::target)
.disabled_if(|state: &AppState, _env| state.busy);
let target_button = Button::new(LocalizedString::new("Open file"))
.on_click(|ctx, _, _| {
ctx.submit_command(
commands::SHOW_OPEN_PANEL.with(
FileDialogOptions::new()
.allowed_types(vec![FILESPEC_HTML])
.default_type(FILESPEC_HTML),
),
)
})
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let output_path_label: Label<AppState> = Label::new("Output path:");
let output_path_input = TextBox::new()
.with_placeholder("Filesystem path")
.fix_width(TEXT_BOX_WIDTH)
.lens(AppState::output_path)
.disabled_if(|state: &AppState, _env| state.busy);
let output_path_button = Button::new(LocalizedString::new("Browse"))
.on_click(|ctx, state: &mut AppState, _env| {
ctx.submit_command(
commands::SHOW_SAVE_PANEL.with(
FileDialogOptions::new()
// .force_starting_directory(
// state
// .output_path.clone()
// .split(path::MAIN_SEPARATOR).collect::<Vec<&str>>()[..2]
// .join(&path::MAIN_SEPARATOR.to_string())
// )
.default_name(
state
.output_path
.clone()
.split(path::MAIN_SEPARATOR)
.last()
.unwrap_or_default(),
),
),
)
})
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let fonts_checkbox = Checkbox::new("Include fonts")
.lens(AppState::keep_fonts)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let frames_checkbox = Checkbox::new("Include frames")
.lens(AppState::keep_frames)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let images_checkbox = Checkbox::new("Include images")
.lens(AppState::keep_images)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let styles_checkbox = Checkbox::new("Include styles")
.lens(AppState::keep_styles)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let scripts_checkbox = Checkbox::new("Include scripts")
.lens(AppState::keep_scripts)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let isolate_checkbox = Checkbox::new("Isolate document")
.lens(AppState::isolate)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let unwrap_noscript_checkbox = Checkbox::new("Unwrap NOSCRIPT")
.lens(AppState::unwrap_noscript)
.disabled_if(|state: &AppState, _env| state.busy)
.padding(5.0);
let start_stop_button = Button::new(LocalizedString::new("Start"))
.on_click(|ctx, state: &mut AppState, _env| {
if state.busy {
return;
}
let mut options: MonolithOptions = MonolithOptions::default();
options.ignore_errors = true;
options.insecure = true;
options.silent = true;
options.no_frames = !state.keep_frames;
options.no_fonts = !state.keep_fonts;
options.no_images = !state.keep_images;
options.no_css = !state.keep_styles;
options.no_js = !state.keep_scripts;
options.isolate = state.isolate;
options.unwrap_noscript = state.unwrap_noscript;
let handle = ctx.get_external_handle();
let thread_state = state.clone();
state.busy = true;
// Set up cache (attempt to create temporary file)
let temp_cache_file: Option<NamedTempFile> =
match Builder::new().prefix("monolith-").tempfile() {
Ok(tempfile) => Some(tempfile),
Err(_) => None,
};
let cache = Some(Cache::new(
CACHE_ASSET_FILE_SIZE_THRESHOLD,
if temp_cache_file.is_some() {
Some(
temp_cache_file
.as_ref()
.unwrap()
.path()
.display()
.to_string(),
)
} else {
None
},
));
let session: Session = Session::new(cache, None, options);
thread::spawn(
move || match create_monolithic_document(session, thread_state.target) {
Ok(result) => {
handle
.submit_command(MONOLITH_GUI_WRITE_OUTPUT, result, Target::Auto)
.unwrap();
// TODO: make it work again
//cache.unwrap().destroy_database_file();
}
Err(error) => {
handle
.submit_command(MONOLITH_GUI_ERROR, error, Target::Auto)
.unwrap();
// TODO: make it work again
//cache.unwrap().destroy_database_file();
}
},
);
})
.disabled_if(|state: &AppState, _env| {
state.busy || state.target.is_empty() || state.output_path.is_empty()
})
.padding(5.0);
let spinner = Either::new(
|sate: &AppState, _env| sate.busy,
Spinner::new(),
Label::new(""),
)
.padding(5.0);
Flex::column()
.with_spacer(5_f64)
.with_child(
Flex::row()
.with_child(target_label)
.with_spacer(5_f64)
.with_child(target_input)
.with_child(target_button),
)
.with_child(fonts_checkbox)
.with_child(frames_checkbox)
.with_child(images_checkbox)
.with_child(scripts_checkbox)
.with_child(styles_checkbox)
.with_child(
Flex::row()
.with_child(output_path_label)
.with_spacer(5_f64)
.with_child(output_path_input)
.with_child(output_path_button),
)
.with_child(
Flex::row()
.with_child(isolate_checkbox)
.with_child(unwrap_noscript_checkbox),
)
.with_child(start_stop_button)
.with_child(spinner)
.with_spacer(5_f64)
}
impl AppDelegate<AppState> for Delegate {
fn command(
&mut self,
_ctx: &mut DelegateCtx,
_target: Target,
cmd: &Command,
state: &mut AppState,
_env: &Env,
) -> Handled {
// Handle "Open file" button next to target input
if let Some(file_info) = cmd.get(commands::OPEN_FILE) {
state.target = file_info.path().display().to_string();
return Handled::Yes;
}
// Handle "Browse" button next to output path input
else if let Some(file_info) = cmd.get(commands::SAVE_FILE_AS) {
state.output_path = file_info.path().display().to_string();
return Handled::Yes;
}
// Write output
else if let Some(result) = cmd.get(MONOLITH_GUI_WRITE_OUTPUT) {
let (html, title) = result;
if !state.output_path.is_empty() {
match fs::File::create(format_output_path(
&state.output_path,
&title.clone().unwrap_or_default(),
MonolithOutputFormat::HTML,
)) {
Ok(mut file) => {
let _ = file.write(&html);
}
Err(_) => {
eprintln!("Error: could not write output");
}
}
} else {
eprintln!("Error: no output specified");
}
state.busy = false;
return Handled::Yes;
}
// Handle errors
else if let Some(_error) = cmd.get(MONOLITH_GUI_ERROR) {
state.busy = false;
return Handled::Yes;
}
Handled::No
}
}
================================================
FILE: src/html.rs
================================================
use base64::{prelude::BASE64_STANDARD, Engine};
use chrono::{SecondsFormat, Utc};
use encoding_rs::Encoding;
use html5ever::interface::{Attribute, QualName};
use html5ever::parse_document;
use html5ever::serialize::{serialize, SerializeOpts};
use html5ever::tendril::{format_tendril, TendrilSink};
use html5ever::tree_builder::{create_element, TreeSink};
use html5ever::{namespace_url, ns, LocalName};
use markup5ever_rcdom::{Handle, NodeData, RcDom, SerializableHandle};
use regex::Regex;
use sha2::{Digest, Sha256, Sha384, Sha512};
use std::default::Default;
use crate::core::{parse_content_type, MonolithOptions};
use crate::css::embed_css;
use crate::js::attr_is_event_handler;
use crate::session::Session;
use crate::url::{
clean_url, create_data_url, is_url_and_has_protocol, resolve_url, Url, EMPTY_IMAGE_DATA_URL,
};
const FAVICON_VALUES: &[&str] = &["icon", "shortcut icon"];
const WHITESPACES: &[char] = &[' ', '\t', '\n', '\x0c', '\r']; // ASCII whitespaces
#[derive(PartialEq, Eq)]
pub enum LinkType {
Alternate,
AppleTouchIcon,
DnsPrefetch,
Favicon,
Preload,
Stylesheet,
}
pub struct SrcSetItem<'a> {
pub path: &'a str,
pub descriptor: &'a str, // Width or pixel density descriptor
}
pub fn add_favicon(document: &Handle, favicon_data_url: String) -> RcDom {
let mut buf: Vec<u8> = Vec::new();
serialize(
&mut buf,
&SerializableHandle::from(document.clone()),
SerializeOpts::default(),
)
.expect("unable to serialize DOM into buffer");
let dom = html_to_dom(&buf, "utf-8".to_string());
for head in find_nodes(&dom.document, vec!["html", "head"]).iter() {
let favicon_node = create_element(
&dom,
QualName::new(None, ns!(), LocalName::from("link")),
vec![
Attribute {
name: QualName::new(None, ns!(), LocalName::from("rel")),
value: format_tendril!("icon"),
},
Attribute {
name: QualName::new(None, ns!(), LocalName::from("href")),
value: format_tendril!("{}", favicon_data_url),
},
],
);
// Insert favicon LINK tag into HEAD
head.children.borrow_mut().push(favicon_node.clone());
}
dom
}
pub fn check_integrity(data: &[u8], integrity: &str) -> bool {
if integrity.starts_with("sha256-") {
let mut hasher = Sha256::new();
hasher.update(data);
BASE64_STANDARD.encode(hasher.finalize()) == integrity[7..]
} else if integrity.starts_with("sha384-") {
let mut hasher = Sha384::new();
hasher.update(data);
BASE64_STANDARD.encode(hasher.finalize()) == integrity[7..]
} else if integrity.starts_with("sha512-") {
let mut hasher = Sha512::new();
hasher.update(data);
BASE64_STANDARD.encode(hasher.finalize()) == integrity[7..]
} else {
false
}
}
pub fn compose_csp(options: &MonolithOptions) -> String {
let mut string_list = vec![];
if options.isolate {
string_list.push("default-src 'unsafe-eval' 'unsafe-inline' data:;");
}
if options.no_css {
string_list.push("style-src 'none';");
}
if options.no_fonts {
string_list.push("font-src 'none';");
}
if options.no_frames {
string_list.push("frame-src 'none';");
string_list.push("child-src 'none';");
}
if options.no_js {
string_list.push("script-src 'none';");
}
if options.no_images {
// Note: "data:" is required for transparent pixel images to work
string_list.push("img-src data:;");
}
string_list.join(" ")
}
pub fn create_metadata_tag(url: &Url) -> String {
let datetime: &str = &Utc::now().to_rfc3339_opts(SecondsFormat::Secs, true);
let mut clean_url: Url = clean_url(url.clone());
// Prevent credentials from getting into metadata
if clean_url.scheme() == "http" || clean_url.scheme() == "https" {
// Only HTTP(S) URLs can contain credentials
clean_url.set_username("").unwrap();
clean_url.set_password(None).unwrap();
}
format!(
"<!-- Saved from {} at {} using {} v{} -->",
if clean_url.scheme() == "http" || clean_url.scheme() == "https" {
clean_url.as_str()
} else {
"local source"
},
datetime,
env!("CARGO_PKG_NAME"),
env!("CARGO_PKG_VERSION"),
)
}
pub fn embed_srcset(session: &mut Session, document_url: &Url, srcset: &str) -> String {
let srcset_items: Vec<SrcSetItem> = parse_srcset(srcset);
// Embed assets
let mut result: String = "".to_string();
let mut i: usize = srcset_items.len();
for srcset_item in srcset_items {
if session.options.no_images {
result.push_str(EMPTY_IMAGE_DATA_URL);
} else {
let image_full_url: Url = resolve_url(document_url, srcset_item.path);
match session.retrieve_asset(document_url, &image_full_url) {
Ok((image_data, image_final_url, image_media_type, image_charset)) => {
let mut image_data_url = create_data_url(
&image_media_type,
&image_charset,
&image_data,
&image_final_url,
);
// Append retrieved asset as a data URL
image_data_url.set_fragment(image_full_url.fragment());
result.push_str(image_data_url.as_ref());
}
Err(_) => {
// Keep remote reference if unable to retrieve the asset
if image_full_url.scheme() == "http" || image_full_url.scheme() == "https" {
result.push_str(image_full_url.as_ref());
} else {
// Avoid breaking the structure in case if not an HTTP(S) URL
result.push_str(EMPTY_IMAGE_DATA_URL);
}
}
}
}
if !srcset_item.descriptor.is_empty() {
result.push(' ');
result.push_str(srcset_item.descriptor);
}
if i > 1 {
result.push_str(", ");
}
i -= 1;
}
result
}
pub fn find_nodes(node: &Handle, mut path: Vec<&str>) -> Vec<Handle> {
let mut result = vec![];
while !path.is_empty() {
match node.data {
NodeData::Document | NodeData::Element { .. } => {
// Dig deeper
for child in node.children.borrow().iter() {
if get_node_name(child)
.unwrap_or_default()
.eq_ignore_ascii_case(path[0])
{
if path.len() == 1 {
result.push(child.clone());
} else {
result.append(&mut find_nodes(child, path[1..].to_vec()));
}
}
}
}
_ => {}
}
path.remove(0);
}
result
}
pub fn get_base_url(handle: &Handle) -> Option<String> {
for base_node in find_nodes(handle, vec!["html", "head", "base"]).iter() {
// Only the first base tag matters (we ignore the rest, if there's any)
return get_node_attr(base_node, "href");
}
None
}
pub fn get_charset(node: &Handle) -> Option<String> {
for meta_node in find_nodes(node, vec!["html", "head", "meta"]).iter() {
if let Some(meta_charset_node_attr_value) = get_node_attr(meta_node, "charset") {
// Processing <meta charset="..." />
return Some(meta_charset_node_attr_value);
}
if get_node_attr(meta_node, "http-equiv")
.unwrap_or_default()
.eq_ignore_ascii_case("content-type")
{
if let Some(meta_content_type_node_attr_value) = get_node_attr(meta_node, "content") {
// Processing <meta http-equiv="content-type" content="text/html; charset=..." />
let (_media_type, charset, _is_base64) =
parse_content_type(&meta_content_type_node_attr_value);
return Some(charset);
}
}
}
None
}
// TODO: get rid of this function (replace with find_nodes)
pub fn get_child_node_by_name(parent: &Handle, node_name: &str) -> Option<Handle> {
let children = parent.children.borrow();
let matching_children = children.iter().find(|child| match child.data {
NodeData::Element { ref name, .. } => &*name.local == node_name,
_ => false,
});
matching_children.cloned()
}
pub fn get_node_attr(node: &Handle, attr_name: &str) -> Option<String> {
match &node.data {
NodeData::Element { attrs, .. } => {
for attr in attrs.borrow().iter() {
if &*attr.name.local == attr_name {
return Some(attr.value.to_string());
}
}
None
}
_ => None,
}
}
pub fn get_node_name(node: &Handle) -> Option<&'_ str> {
match &node.data {
NodeData::Element { name, .. } => Some(name.local.as_ref()),
_ => None,
}
}
pub fn get_parent_node(child: &Handle) -> Handle {
let parent = child.parent.take().clone();
parent.and_then(|node| node.upgrade()).unwrap()
}
pub fn get_robots(handle: &Handle) -> Option<String> {
for meta_node in find_nodes(handle, vec!["html", "head", "meta"]).iter() {
// Only the first base tag matters (we ignore the rest, if there's any)
if get_node_attr(meta_node, "name")
.unwrap_or_default()
.eq_ignore_ascii_case("robots")
{
return get_node_attr(meta_node, "content");
}
}
None
}
pub fn get_title(node: &Handle) -> Option<String> {
for title_node in find_nodes(node, vec!["html", "head", "title"]).iter() {
for child_node in title_node.children.borrow().iter() {
if let NodeData::Text { ref contents } = child_node.data {
return Some(contents.borrow().to_string());
}
}
}
None
}
pub fn has_favicon(handle: &Handle) -> bool {
let mut found_favicon: bool = false;
for link_node in find_nodes(handle, vec!["html", "head", "link"]).iter() {
if let Some(attr_value) = get_node_attr(link_node, "rel") {
if is_favicon(attr_value.trim()) {
found_favicon = true;
break;
}
}
}
found_favicon
}
pub fn html_to_dom(data: &Vec<u8>, document_encoding: String) -> RcDom {
let s: String;
if let Some(encoding) = Encoding::for_label(document_encoding.as_bytes()) {
let (string, _, _) = encoding.decode(data);
s = string.to_string();
} else {
s = String::from_utf8_lossy(data).to_string();
}
parse_document(RcDom::default(), Default::default())
.from_utf8()
.read_from(&mut s.as_bytes())
.unwrap()
}
pub fn is_favicon(attr_value: &str) -> bool {
FAVICON_VALUES.contains(&attr_value.to_lowercase().as_str())
}
pub fn parse_link_type(link_attr_rel_value: &str) -> Vec<LinkType> {
let mut types: Vec<LinkType> = vec![];
for link_attr_rel_type in link_attr_rel_value.split_whitespace() {
if link_attr_rel_type.eq_ignore_ascii_case("alternate") {
types.push(LinkType::Alternate);
} else if link_attr_rel_type.eq_ignore_ascii_case("dns-prefetch") {
types.push(LinkType::DnsPrefetch);
} else if link_attr_rel_type.eq_ignore_ascii_case("preload") {
types.push(LinkType::Preload);
} else if link_attr_rel_type.eq_ignore_ascii_case("stylesheet") {
types.push(LinkType::Stylesheet);
} else if is_favicon(link_attr_rel_type) {
types.push(LinkType::Favicon);
} else if link_attr_rel_type.eq_ignore_ascii_case("apple-touch-icon") {
types.push(LinkType::AppleTouchIcon);
}
}
types
}
pub fn parse_srcset(srcset: &str) -> Vec<SrcSetItem> {
let mut srcset_items: Vec<SrcSetItem> = vec![];
// Parse srcset
let mut partials: Vec<&str> = srcset.split(WHITESPACES).collect();
let mut path: Option<&str> = None;
let mut descriptor: Option<&str> = None;
let mut i = 0;
while i < partials.len() {
let partial = partials[i];
i += 1;
// Skip empty strings
if partial.is_empty() {
continue;
}
if partial.ends_with(',') {
if path.is_none() {
path = Some(partial.strip_suffix(',').unwrap());
descriptor = Some("")
} else {
descriptor = Some(partial.strip_suffix(',').unwrap());
}
} else if path.is_none() {
path = Some(partial);
} else {
let mut chunks: Vec<&str> = partial.split(',').collect();
if !chunks.is_empty() && chunks.first().unwrap().ends_with(['x', 'w']) {
descriptor = Some(chunks.first().unwrap());
chunks.remove(0);
}
if !chunks.is_empty() {
if descriptor.is_some() {
partials.insert(0, &partial[descriptor.unwrap().len()..]);
} else {
partials.insert(0, partial);
}
}
}
if path.is_some() && descriptor.is_some() {
srcset_items.push(SrcSetItem {
path: path.unwrap(),
descriptor: descriptor.unwrap(),
});
path = None;
descriptor = None;
}
}
// Final attempt to process what was found
if path.is_some() {
srcset_items.push(SrcSetItem {
path: path.unwrap(),
descriptor: descriptor.unwrap_or_default(),
});
}
srcset_items
}
pub fn set_base_url(document: &Handle, base_href_value: String) -> RcDom {
let mut buf: Vec<u8> = Vec::new();
serialize(
&mut buf,
&SerializableHandle::from(document.clone()),
SerializeOpts::default(),
)
.expect("unable to serialize DOM into buffer");
let dom = html_to_dom(&buf, "utf-8".to_string());
if let Some(html_node) = get_child_node_by_name(&dom.document, "html") {
if let Some(head_node) = get_child_node_by_name(&html_node, "head") {
// Check if BASE node already exists in the DOM tree
if let Some(base_node) = get_child_node_by_name(&head_node, "base") {
set_node_attr(&base_node, "href", Some(base_href_value));
} else {
let base_node = create_element(
&dom,
QualName::new(None, ns!(), LocalName::from("base")),
vec![Attribute {
name: QualName::new(None, ns!(), LocalName::from("href")),
value: format_tendril!("{}", base_href_value),
}],
);
// Insert newly created BASE node into HEAD
head_node.children.borrow_mut().push(base_node.clone());
}
}
}
dom
}
pub fn set_charset(dom: RcDom, charset: String) -> RcDom {
for meta_node in find_nodes(&dom.document, vec!["html", "head", "meta"]).iter() {
if get_node_attr(meta_node, "charset").is_some() {
set_node_attr(meta_node, "charset", Some(charset));
return dom;
}
if get_node_attr(meta_node, "http-equiv")
.unwrap_or_default()
.eq_ignore_ascii_case("content-type")
&& get_node_attr(meta_node, "content").is_some()
{
set_node_attr(
meta_node,
"content",
Some(format!("text/html;charset={}", charset)),
);
return dom;
}
}
// Manually append charset META node to HEAD
{
let meta_charset_node: Handle = create_element(
&dom,
QualName::new(None, ns!(), LocalName::from("meta")),
vec![Attribute {
name: QualName::new(None, ns!(), LocalName::from("charset")),
value: format_tendril!("{}", charset),
}],
);
// Insert newly created META charset node into HEAD
for head_node in find_nodes(&dom.document, vec!["html", "head"]).iter() {
head_node
.children
.borrow_mut()
.push(meta_charset_node.clone());
break;
}
}
dom
}
pub fn set_node_attr(node: &Handle, attr_name: &str, attr_value: Option<String>) {
if let NodeData::Element { attrs, .. } = &node.data {
let attrs_mut = &mut attrs.borrow_mut();
let mut i = 0;
let mut found_existing_attr: bool = false;
while i < attrs_mut.len() {
if &attrs_mut[i].name.local == attr_name {
found_existing_attr = true;
if let Some(attr_value) = attr_value.clone() {
let _ = &attrs_mut[i].value.clear();
let _ = &attrs_mut[i].value.push_slice(attr_value.as_str());
} else {
// Remove attr completely if attr_value is not defined
attrs_mut.remove(i);
continue;
}
}
i += 1;
}
if !found_existing_attr {
// Add new attribute (since originally the target node didn't have it)
if let Some(attr_value) = attr_value.clone() {
let name = LocalName::from(attr_name);
attrs_mut.push(Attribute {
name: QualName::new(None, ns!(), name),
value: format_tendril!("{}", attr_value),
});
}
}
};
}
pub fn set_robots(dom: RcDom, content_value: &str) -> RcDom {
for meta_node in find_nodes(&dom.document, vec!["html", "head", "meta"]).iter() {
if get_node_attr(meta_node, "name")
.unwrap_or_default()
.eq_ignore_ascii_case("robots")
{
set_node_attr(meta_node, "content", Some(content_value.to_string()));
return dom;
}
}
// Manually append robots META node to HEAD
{
let meta_charset_node: Handle = create_element(
&dom,
QualName::new(None, ns!(), LocalName::from("meta")),
vec![
Attribute {
name: QualName::new(None, ns!(), LocalName::from("name")),
value: format_tendril!("robots"),
},
Attribute {
name: QualName::new(None, ns!(), LocalName::from("content")),
value: format_tendril!("{}", content_value),
},
],
);
// Insert newly created META charset node into HEAD
for head_node in find_nodes(&dom.document, vec!["html", "head"]).iter() {
head_node
.children
.borrow_mut()
.push(meta_charset_node.clone());
break;
}
}
dom
}
pub fn serialize_document(
dom: RcDom,
document_encoding: String,
options: &MonolithOptions,
) -> Vec<u8> {
let mut buf: Vec<u8> = Vec::new();
if options.isolate
|| options.no_css
|| options.no_fonts
|| options.no_frames
|| options.no_js
|| options.no_images
{
// Take care of CSP
if let Some(html) = get_child_node_by_name(&dom.document, "html") {
if let Some(head) = get_child_node_by_name(&html, "head") {
let meta = create_element(
&dom,
QualName::new(None, ns!(), LocalName::from("meta")),
vec![
Attribute {
name: QualName::new(None, ns!(), LocalName::from("http-equiv")),
value: format_tendril!("Content-Security-Policy"),
},
Attribute {
name: QualName::new(None, ns!(), LocalName::from("content")),
value: format_tendril!("{}", compose_csp(options)),
},
],
);
// The CSP meta-tag has to be prepended, never appended,
// since there already may be one defined in the original document,
// and browsers don't allow re-defining them (for obvious reasons)
head.children.borrow_mut().reverse();
head.children.borrow_mut().push(meta.clone());
head.children.borrow_mut().reverse();
}
}
}
let serializable: SerializableHandle = dom.document.into();
serialize(&mut buf, &serializable, SerializeOpts::default())
.expect("Unable to serialize DOM into buffer");
// Unwrap NOSCRIPT elements
if options.unwrap_noscript {
let s: &str = &String::from_utf8_lossy(&buf);
let noscript_re = Regex::new(r"<(?P<c>/?noscript[^>]*)>").unwrap();
buf = noscript_re.replace_all(s, "<!--$c-->").as_bytes().to_vec();
}
if !document_encoding.is_empty() {
if let Some(encoding) = Encoding::for_label(document_encoding.as_bytes()) {
let s: &str = &String::from_utf8_lossy(&buf);
let (data, _, _) = encoding.encode(s);
buf = data.to_vec();
}
}
buf
}
pub fn retrieve_and_embed_asset(
session: &mut Session,
document_url: &Url,
node: &Handle,
attr_name: &str,
attr_value: &str,
) {
let resolved_url: Url = resolve_url(document_url, attr_value);
match session.retrieve_asset(&document_url.clone(), &resolved_url) {
Ok((data, final_url, media_type, charset)) => {
let node_name: &str = get_node_name(node).unwrap();
// Check integrity if it's a LINK or SCRIPT element
let mut ok_to_include: bool = true;
if node_name == "link" || node_name == "script" {
// Check integrity
if let Some(node_integrity_attr_value) = get_node_attr(node, "integrity") {
if !node_integrity_attr_value.is_empty() {
ok_to_include = check_integrity(&data, &node_integrity_attr_value);
}
// Wipe the integrity attribute
set_node_attr(node, "integrity", None);
}
}
if ok_to_include {
if node_name == "link"
&& parse_link_type(&get_node_attr(node, "rel").unwrap_or(String::from("")))
.contains(&LinkType::Stylesheet)
{
let stylesheet: String;
if let Some(encoding) = Encoding::for_label(charset.as_bytes()) {
let (string, _, _) = encoding.decode(&data);
stylesheet = string.to_string();
} else {
stylesheet = String::from_utf8_lossy(&data).to_string();
}
// Stylesheet LINK elements require special treatment
let css: String = embed_css(session, &final_url, &stylesheet);
// Create and embed data URL
let css_data_url =
create_data_url(&media_type, &charset, css.as_bytes(), &final_url);
set_node_attr(node, attr_name, Some(css_data_url.to_string()));
} else if node_name == "frame" || node_name == "iframe" {
// (I)FRAMEs are also quite different from conventional resources
let frame_dom = html_to_dom(&data, charset.clone());
walk(session, &final_url, &frame_dom.document);
let mut frame_data: Vec<u8> = Vec::new();
let serializable: SerializableHandle = frame_dom.document.into();
serialize(&mut frame_data, &serializable, SerializeOpts::default()).unwrap();
// Create and embed data URL
let mut frame_data_url =
create_data_url(&media_type, &charset, &frame_data, &final_url);
frame_data_url.set_fragment(resolved_url.fragment());
set_node_attr(node, attr_name, Some(frame_data_url.to_string()));
} else {
// Every other type of element gets processed here
// Parse media type for SCRIPT elements
if node_name == "script" {
let script_media_type =
get_node_attr(node, "type").unwrap_or(String::from("text/javascript"));
if script_media_type == "text/javascript"
|| script_media_type == "application/javascript"
{
// Embed javascript code instead of using data URLs
let script_dom: RcDom =
parse_document(RcDom::default(), Default::default())
.one("<script>;</script>");
for script_node in
find_nodes(&script_dom.document, vec!["html", "head", "script"])
.iter()
{
let text_node = &script_node.children.borrow()[0];
if let NodeData::Text { ref contents } = text_node.data {
let mut tendril = contents.borrow_mut();
tendril.clear();
tendril.push_slice(
&String::from_utf8_lossy(&data)
.replace("</script>", "<\\/script>"),
);
}
node.children.borrow_mut().push(text_node.clone());
set_node_attr(node, attr_name, None);
}
} else {
// Create and embed data URL
let mut data_url =
create_data_url(&script_media_type, &charset, &data, &final_url);
data_url.set_fragment(resolved_url.fragment());
set_node_attr(node, attr_name, Some(data_url.to_string()));
}
} else {
// Create and embed data URL
let mut data_url =
create_data_url(&media_type, &charset, &data, &final_url);
data_url.set_fragment(resolved_url.fragment());
set_node_attr(node, attr_name, Some(data_url.to_string()));
}
}
}
}
Err(_) => {
if resolved_url.scheme() == "http" || resolved_url.scheme() == "https" {
// Keep remote references if unable to retrieve the asset
set_node_attr(node, attr_name, Some(resolved_url.to_string()));
} else {
// Remove local references if they can't be successfully embedded as data URLs
set_node_attr(node, attr_name, None);
}
}
}
}
pub fn walk(session: &mut Session, document_url: &Url, node: &Handle) {
match node.data {
NodeData::Document => {
// Dig deeper
for child_node in node.children.borrow().iter() {
walk(session, document_url, child_node);
}
}
NodeData::Element {
ref name,
ref attrs,
..
} => {
match name.local.as_ref() {
"meta" => {
if let Some(meta_attr_http_equiv_value) = get_node_attr(node, "http-equiv") {
let meta_attr_http_equiv_value: &str = &meta_attr_http_equiv_value;
if meta_attr_http_equiv_value.eq_ignore_ascii_case("refresh")
|| meta_attr_http_equiv_value.eq_ignore_ascii_case("location")
{
// Remove http-equiv attributes from META nodes if they're able to control the page
set_node_attr(node, "http-equiv", None);
}
}
}
"link" => {
let link_node_types: Vec<LinkType> =
parse_link_type(&get_node_attr(node, "rel").unwrap_or(String::from("")));
if link_node_types.contains(&LinkType::Favicon)
|| link_node_types.contains(&LinkType::AppleTouchIcon)
{
// Find and resolve LINK's href attribute
if let Some(link_attr_href_value) = get_node_attr(node, "href") {
if !session.options.no_images && !link_attr_href_value.is_empty() {
retrieve_and_embed_asset(
session,
document_url,
node,
"href",
&link_attr_href_value,
);
} else {
set_node_attr(node, "href", None);
}
}
} else if link_node_types.contains(&LinkType::Stylesheet) {
// Resolve LINK's href attribute
if let Some(link_attr_href_value) = get_node_attr(node, "href") {
if session.options.no_css {
set_node_attr(node, "href", None);
// Wipe integrity attribute
set_node_attr(node, "integrity", None);
} else if !link_attr_href_value.is_empty() {
retrieve_and_embed_asset(
session,
document_url,
node,
"href",
&link_attr_href_value,
);
}
}
} else if link_node_types.contains(&LinkType::Preload)
|| link_node_types.contains(&LinkType::DnsPrefetch)
{
// Since all resources are embedded as data URLs, preloading and prefetching are not necessary
set_node_attr(node, "rel", None);
} else {
// Make sure that all other LINKs' href attributes are full URLs
if let Some(link_attr_href_value) = get_node_attr(node, "href") {
let href_full_url: Url =
resolve_url(document_url, &link_attr_href_value);
set_node_attr(node, "href", Some(href_full_url.to_string()));
}
}
}
"base" => {
if document_url.scheme() == "http" || document_url.scheme() == "https" {
// Ensure the BASE node doesn't have a relative URL
if let Some(base_attr_href_value) = get_node_attr(node, "href") {
let href_full_url: Url =
resolve_url(document_url, &base_attr_href_value);
set_node_attr(node, "href", Some(href_full_url.to_string()));
}
}
}
"body" => {
// Read and remember background attribute value of this BODY node
if let Some(body_attr_background_value) = get_node_attr(node, "background") {
// Remove background BODY node attribute by default
set_node_attr(node, "background", None);
if !session.options.no_images && !body_attr_background_value.is_empty() {
retrieve_and_embed_asset(
session,
document_url,
node,
"background",
&body_attr_background_value,
);
}
}
}
"img" => {
// Find src and data-src attribute(s)
let img_attr_src_value: Option<String> = get_node_attr(node, "src");
let img_attr_data_src_value: Option<String> = get_node_attr(node, "data-src");
if session.options.no_images {
// Put empty images into src and data-src attributes
if img_attr_src_value.is_some() {
set_node_attr(node, "src", Some(EMPTY_IMAGE_DATA_URL.to_string()));
}
if img_attr_data_src_value.is_some() {
set_node_attr(node, "data-src", Some(EMPTY_IMAGE_DATA_URL.to_string()));
}
} else if img_attr_src_value.clone().unwrap_or_default().is_empty()
&& img_attr_data_src_value
.clone()
.unwrap_or_default()
.is_empty()
{
// Add empty src attribute
set_node_attr(node, "src", Some("".to_string()));
} else {
// Add data URL src attribute
let img_full_url: String = if !img_attr_data_src_value
.clone()
.unwrap_or_default()
.is_empty()
{
img_attr_data_src_value.unwrap_or_default()
} else {
img_attr_src_value.unwrap_or_default()
};
retrieve_and_embed_asset(session, document_url, node, "src", &img_full_url);
}
// Resolve srcset attribute
if let Some(img_srcset) = get_node_attr(node, "srcset") {
if !img_srcset.is_empty() {
let resolved_srcset: String =
embed_srcset(session, document_url, &img_srcset);
set_node_attr(node, "srcset", Some(resolved_srcset));
}
}
}
"input" => {
if let Some(input_attr_type_value) = get_node_attr(node, "type") {
if input_attr_type_value.eq_ignore_ascii_case("image") {
if let Some(input_attr_src_value) = get_node_attr(node, "src") {
if session.options.no_images || input_attr_src_value.is_empty() {
let value = if input_attr_src_value.is_empty() {
""
} else {
EMPTY_IMAGE_DATA_URL
};
set_node_attr(node, "src", Some(value.to_string()));
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&input_attr_src_value,
);
}
}
}
}
}
"svg" => {
if session.options.no_images {
// Remove all children
node.children.borrow_mut().clear();
}
}
"image" => {
let attr_names: [&str; 2] = ["href", "xlink:href"];
for attr_name in attr_names.into_iter() {
if let Some(image_attr_href_value) = get_node_attr(node, attr_name) {
if session.options.no_images {
set_node_attr(node, attr_name, None);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
attr_name,
&image_attr_href_value,
);
}
}
}
}
"use" => {
let attr_names: [&str; 2] = ["href", "xlink:href"];
for attr_name in attr_names.into_iter() {
if let Some(use_attr_href_value) = get_node_attr(node, attr_name) {
if session.options.no_images {
set_node_attr(node, attr_name, None);
} else {
let image_asset_url: Url =
resolve_url(document_url, &use_attr_href_value);
match session.retrieve_asset(document_url, &image_asset_url) {
Ok((data, final_url, media_type, charset)) => {
if media_type == "image/svg+xml" {
// Parse SVG
let svg_dom: RcDom = parse_document(
RcDom::default(),
Default::default(),
)
.from_utf8()
.read_from(&mut data.as_slice())
.unwrap();
if image_asset_url.fragment().is_some() {
// Take only that one #fragment symbol from SVG and replace this image|use with that node
let single_symbol_node = create_element(
&svg_dom,
QualName::new(
None,
ns!(),
LocalName::from("symbol"),
),
vec![],
);
for symbol_node in find_nodes(
&svg_dom.document,
vec!["html", "body", "svg", "defs", "symbol"],
)
.iter()
{
if get_node_attr(symbol_node, "id")
.unwrap_or_default()
== image_asset_url.fragment().unwrap()
{
svg_dom.reparent_children(
symbol_node,
&single_symbol_node,
);
set_node_attr(
&single_symbol_node,
"id",
Some(
image_asset_url
.fragment()
.unwrap()
.to_string(),
),
);
set_node_attr(
node,
attr_name,
Some(format!(
"#{}",
image_asset_url.fragment().unwrap()
)),
);
break;
}
}
node.children
.borrow_mut()
.push(single_symbol_node.clone());
} else {
// Replace this image|use with whole DOM of that SVG file
for svg_node in find_nodes(
&svg_dom.document,
vec!["html", "body", "svg"],
)
.iter()
{
svg_dom.reparent_children(svg_node, node);
break;
}
// TODO: decide if we resort to using data URL here or stick with embedding the DOM
}
} else {
// It's likely a raster image; embed it as data URL
let image_asset_data: Url = create_data_url(
&media_type,
&charset,
&data,
&final_url,
);
set_node_attr(
node,
attr_name,
Some(image_asset_data.to_string()),
);
}
}
Err(_) => {
set_node_attr(
node,
attr_name,
Some(image_asset_url.to_string()),
);
}
}
}
}
}
}
"source" => {
let parent_node = get_parent_node(node);
let parent_node_name: &str = get_node_name(&parent_node).unwrap_or_default();
if let Some(source_attr_src_value) = get_node_attr(node, "src") {
if parent_node_name == "audio" {
if session.options.no_audio {
set_node_attr(node, "src", None);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&source_attr_src_value,
);
}
} else if parent_node_name == "video" {
if session.options.no_video {
set_node_attr(node, "src", None);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&source_attr_src_value,
);
}
}
}
if let Some(source_attr_srcset_value) = get_node_attr(node, "srcset") {
if parent_node_name == "picture" && !source_attr_srcset_value.is_empty() {
if session.options.no_images {
set_node_attr(
node,
"srcset",
Some(EMPTY_IMAGE_DATA_URL.to_string()),
);
} else {
let resolved_srcset: String =
embed_srcset(session, document_url, &source_attr_srcset_value);
set_node_attr(node, "srcset", Some(resolved_srcset));
}
}
}
}
"a" | "area" => {
if let Some(anchor_attr_href_value) = get_node_attr(node, "href") {
if anchor_attr_href_value
.clone()
.trim()
.starts_with("javascript:")
{
if session.options.no_js {
// Replace with empty JS call to preserve original behavior
set_node_attr(node, "href", Some("javascript:;".to_string()));
}
} else {
// Don't touch mailto: links or hrefs which begin with a hash sign
if !anchor_attr_href_value.clone().starts_with('#')
&& !is_url_and_has_protocol(&anchor_attr_href_value.clone())
{
let href_full_url: Url =
resolve_url(document_url, &anchor_attr_href_value);
set_node_attr(node, "href", Some(href_full_url.to_string()));
}
}
}
}
"script" => {
// Read values of integrity and src attributes
let script_attr_src: &str = &get_node_attr(node, "src").unwrap_or_default();
if session.options.no_js {
// Empty inner content
node.children.borrow_mut().clear();
// Remove src attribute
if !script_attr_src.is_empty() {
set_node_attr(node, "src", None);
// Wipe integrity attribute
set_node_attr(node, "integrity", None);
}
} else if !script_attr_src.is_empty() {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
script_attr_src,
);
}
}
"style" => {
if session.options.no_css {
// Empty inner content of STYLE tags
node.children.borrow_mut().clear();
} else {
for child_node in node.children.borrow_mut().iter_mut() {
if let NodeData::Text { ref contents } = child_node.data {
let mut tendril = contents.borrow_mut();
let replacement =
embed_css(session, document_url, tendril.as_ref());
tendril.clear();
tendril.push_slice(&replacement);
}
}
}
}
"form" => {
if let Some(form_attr_action_value) = get_node_attr(node, "action") {
// Modify action property to ensure it's a full URL
let form_action_full_url: Url =
resolve_url(document_url, &form_attr_action_value);
set_node_attr(node, "action", Some(form_action_full_url.to_string()));
}
}
"frame" | "iframe" => {
if let Some(frame_attr_src_value) = get_node_attr(node, "src") {
if session.options.no_frames {
// Empty the src attribute
set_node_attr(node, "src", Some("".to_string()));
} else {
// Ignore (i)frames with empty source (they cause infinite loops)
if !frame_attr_src_value.trim().is_empty() {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&frame_attr_src_value,
);
}
}
}
}
"audio" => {
// Embed audio source
if let Some(audio_attr_src_value) = get_node_attr(node, "src") {
if session.options.no_audio {
set_node_attr(node, "src", None);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&audio_attr_src_value,
);
}
}
}
"video" => {
// Embed video source
if let Some(video_attr_src_value) = get_node_attr(node, "src") {
if session.options.no_video {
set_node_attr(node, "src", None);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"src",
&video_attr_src_value,
);
}
}
// Embed poster images
if let Some(video_attr_poster_value) = get_node_attr(node, "poster") {
// Skip posters with empty source
if !video_attr_poster_value.is_empty() {
if session.options.no_images {
set_node_attr(
node,
"poster",
Some(EMPTY_IMAGE_DATA_URL.to_string()),
);
} else {
retrieve_and_embed_asset(
session,
document_url,
node,
"poster",
&video_attr_poster_value,
);
}
}
}
}
"noscript" => {
for child_node in node.children.borrow_mut().iter_mut() {
if let NodeData::Text { ref contents } = child_node.data {
// Get contents of NOSCRIPT node
let mut noscript_contents = contents.borrow_mut();
// Parse contents of NOSCRIPT node as DOM
let noscript_contents_dom: RcDom =
html_to_dom(&noscript_contents.as_bytes().to_vec(), "".to_string());
// Embed assets of NOSCRIPT node contents
walk(session, document_url, &noscript_contents_dom.document);
// Get rid of original contents
noscript_contents.clear();
// Insert HTML containing embedded assets into NOSCRIPT node
if let Some(html) =
get_child_node_by_name(&noscript_contents_dom.document, "html")
{
if let Some(body) = get_child_node_by_name(&html, "body") {
let mut buf: Vec<u8> = Vec::new();
let serializable: SerializableHandle = body.into();
serialize(&mut buf, &serializable, SerializeOpts::default())
.expect("Unable to serialize DOM into buffer");
let result = String::from_utf8_lossy(&buf);
noscript_contents.push_slice(&result);
}
}
}
}
}
_ => {}
}
// Process style attributes
if session.options.no_css {
// Get rid of style attributes
set_node_attr(node, "style", None);
} else {
// Embed URLs found within the style attribute of this node
if let Some(node_attr_style_value) = get_node_attr(node, "style") {
let embedded_style = embed_css(session, document_url, &node_attr_style_value);
set_node_attr(node, "style", Some(embedded_style));
}
}
// Strip all JS from document
if session.options.no_js {
let attrs_mut = &mut attrs.borrow_mut();
// Get rid of JS event attributes
let mut js_attr_indexes = Vec::new();
for (i, attr) in attrs_mut.iter().enumerate() {
if attr_is_event_handler(&attr.name.local) {
js_attr_indexes.push(i);
}
}
js_attr_indexes.reverse();
for attr_index in js_attr_indexes {
attrs_mut.remove(attr_index);
}
}
// Dig deeper
for child_node in node.children.borrow().iter() {
walk(session, document_url, child_node);
}
}
_ => {
// Note: in case of options.no_js being set to true, there's no need to worry about
// getting rid of comments that may contain scripts, e.g. <!--[if IE]><script>...
// since that's not part of W3C standard and therefore gets ignored
// by browsers other than IE [5, 9]
}
}
}
================================================
FILE: src/js.rs
================================================
const JS_DOM_EVENT_ATTRS: &[&str] = &[
// From WHATWG HTML spec 8.1.5.2 "Event handlers on elements, Document objects, and Window objects":
// https://html.spec.whatwg.org/#event-handlers-on-elements,-document-objects,-and-window-objects
// https://html.spec.whatwg.org/#attributes-3 (table "List of event handler content attributes")
// Global event handlers
"onabort",
"onauxclick",
"onblur",
"oncancel",
"oncanplay",
"oncanplaythrough",
"onchange",
"onclick",
"onclose",
"oncontextmenu",
"oncuechange",
"ondblclick",
"ondrag",
"ondragend",
"ondragenter",
"ondragexit",
"ondragleave",
"ondragover",
"ondragstart",
"ondrop",
"ondurationchange",
"onemptied",
"onended",
"onerror",
"onfocus",
"onformdata",
"oninput",
"oninvalid",
"onkeydown",
"onkeypress",
"onkeyup",
"onload",
"onloadeddata",
"onloadedmetadata",
"onloadstart",
"onmousedown",
"onmouseenter",
"onmouseleave",
"onmousemove",
"onmouseout",
"onmouseover",
"onmouseup",
"onwheel",
"onpause",
"onplay",
"onplaying",
"onprogress",
"onratechange",
"onreset",
"onresize",
"onscroll",
"onsecuritypolicyviolation",
"onseeked",
"onseeking",
"onselect",
"onslotchange",
"onstalled",
"onsubmit",
"onsuspend",
"ontimeupdate",
"ontoggle",
"onvolumechange",
"onwaiting",
"onwebkitanimationend",
"onwebkitanimationiteration",
"onwebkitanimationstart",
"onwebkittransitionend",
// Event handlers for <body/> and <frameset/> elements
"onafterprint",
"onbeforeprint",
"onbeforeunload",
"onhashchange",
"onlanguagechange",
"onmessage",
"onmessageerror",
"onoffline",
"ononline",
"onpagehide",
"onpageshow",
"onpopstate",
"onrejectionhandled",
"onstorage",
"onunhandledrejection",
"onunload",
// Event handlers for <html/> element
"oncut",
"oncopy",
"onpaste",
];
// Returns true if DOM attribute name matches a native JavaScript event handler
pub fn attr_is_event_handler(attr_name: &str) -> bool {
JS_DOM_EVENT_ATTRS
.iter()
.any(|a| attr_name.eq_ignore_ascii_case(a))
}
================================================
FILE: src/lib.rs
================================================
pub mod cache;
pub mod cookies;
pub mod core;
pub mod css;
pub mod html;
pub mod js;
pub mod session;
pub mod url;
================================================
FILE: src/main.rs
================================================
use std::fs;
use std::io::{self, Error as IoError, Read, Write};
use std::process;
use clap::Parser;
use tempfile::{Builder, NamedTempFile};
use monolith::cache::Cache;
use monolith::cookies::{parse_cookie_file_contents, Cookie};
use monolith::core::{
create_monolithic_document, create_monolithic_document_from_data, format_output_path,
print_error_message, MonolithOptions, MonolithOutputFormat,
};
use monolith::session::Session;
const ASCII: &str = " \
_____ _____________ __________ ___________________ ___
| \\ / \\ | | | | | |
| \\/ __ \\| __ | | ___ ___ |__| |
| | | | | | | | | | | |
| |\\ /| |__| |__| |___| | | | | __ |
| | \\__/ | |\\ | | | | | | |
|___| |__________| \\___________________| |___| |___| |___|
";
const CACHE_ASSET_FILE_SIZE_THRESHOLD: usize = 1024 * 10; // Minimum file size for on-disk caching (in bytes)
const DEFAULT_NETWORK_TIMEOUT: u64 = 120; // Maximum time to retrieve each remote asset (in seconds)
const DEFAULT_USER_AGENT: &str =
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:135.0) Gecko/20100101 Firefox/135.0";
#[derive(Parser)]
#[command(name = env!("CARGO_PKG_NAME"))]
#[command(version)] // Read version from Cargo.toml
#[command(about = ASCII.to_owned() + "\n" + env!("CARGO_PKG_NAME") + " " + env!("CARGO_PKG_VERSION") + "\n\n" + env!("CARGO_PKG_DESCRIPTION"), long_about = None)]
struct Cli {
/// Remove audio sources
#[arg(short = 'a', long)]
no_audio: bool,
/// Set custom base URL
#[arg(short, long, value_name = "http://localhost/")]
base_url: Option<String>,
/// Treat specified domains as blacklist
#[arg(short = 'B', long)]
blacklist_domains: bool,
/// Remove CSS
#[arg(short = 'c', long)]
no_css: bool,
/// Specify cookie file
#[arg(short = 'C', long, value_name = "cookies.txt")]
cookie_file: Option<String>,
/// Specify domains to use for white/black-listing
#[arg(short = 'd', long = "domain", value_name = "example.com")]
domains: Vec<String>,
/// Ignore network errors
#[arg(short = 'e', long)]
ignore_errors: bool,
/// Enforce custom charset
#[arg(short = 'E', long, value_name = "UTF-8")]
encoding: Option<String>,
/// Remove frames and iframes
#[arg(short = 'f', long)]
no_frames: bool,
/// Remove fonts
#[arg(short = 'F', long)]
no_fonts: bool,
/// Remove images
#[arg(short = 'i', long)]
no_images: bool,
/// Cut off document from the Internet
#[arg(short = 'I', long)]
isolate: bool,
/// Remove JavaScript
#[arg(short = 'j', long)]
no_js: bool,
/// Allow invalid X.509 (TLS) certificates
#[arg(short = 'k', long)]
insecure: bool,
/// Use MHTML as output format
#[arg(short = 'm', long)]
mhtml: bool,
/// Exclude timestamp and source information
#[arg(short = 'M', long)]
no_metadata: bool,
/// Replace NOSCRIPT elements with their contents
#[arg(short = 'n', long)]
unwrap_noscript: bool,
/// File to write to, use - for STDOUT
#[arg(short, long, value_name = "result.html")]
output: Option<String>,
/// Suppress verbosity
#[arg(short, long)]
quiet: bool,
/// Adjust network request timeout
#[arg(short, long, value_name = "60")]
timeout: Option<u64>,
/// Set custom User-Agent string
#[arg(short, long, value_name = "Firefox")]
user_agent: Option<String>,
/// Remove video sources
#[arg(short = 'v', long)]
no_video: bool,
/// URL or file path, use - for STDIN
target: String,
}
pub enum Output {
Stdout(io::Stdout),
File(fs::File),
}
impl Output {
fn new(
destination: &str,
document_title: &str,
format: MonolithOutputFormat,
) -> Result<Output, IoError> {
if destination.is_empty() || destination.eq("-") {
Ok(Output::Stdout(io::stdout()))
} else {
let final_destination = format_output_path(destination, document_title, format);
Ok(Output::File(fs::File::create(final_destination)?))
}
}
fn write(&mut self, bytes: &Vec<u8>) -> Result<(), IoError> {
match self {
Output::Stdout(stdout) => {
stdout.write_all(bytes)?;
stdout.flush()
}
Output::File(file) => {
file.write_all(bytes)?;
file.flush()
}
}
}
}
pub fn read_stdin() -> Vec<u8> {
let mut buffer: Vec<u8> = vec![];
match io::stdin().lock().read_to_end(&mut buffer) {
Ok(_) => buffer,
Err(_) => buffer,
}
}
fn main() {
let cli = Cli::parse();
let cookie_file_path;
let mut exit_code = 0;
let mut options: MonolithOptions = MonolithOptions::default();
let destination;
// Process the command
{
options.base_url = cli.base_url;
options.blacklist_domains = cli.blacklist_domains;
options.encoding = cli.encoding;
if !cli.domains.is_empty() {
options.domains = Some(cli.domains);
}
options.ignore_errors = cli.ignore_errors;
options.insecure = cli.insecure;
options.isolate = cli.isolate;
options.no_audio = cli.no_audio;
options.no_css = cli.no_css;
options.no_fonts = cli.no_fonts;
options.no_frames = cli.no_frames;
options.no_images = cli.no_images;
options.no_js = cli.no_js;
if cli.mhtml {
options.output_format = MonolithOutputFormat::MHTML;
// The MHTML format doesn't allow JavaScript
options.no_js = true;
}
options.no_metadata = cli.no_metadata;
options.no_video = cli.no_video;
options.silent = cli.quiet;
options.timeout = cli.timeout.unwrap_or(DEFAULT_NETWORK_TIMEOUT);
options.unwrap_noscript = cli.unwrap_noscript;
if cli.user_agent.is_none() {
options.user_agent = Some(DEFAULT_USER_AGENT.to_string());
} else {
options.user_agent = cli.user_agent;
}
cookie_file_path = cli.cookie_file;
destination = cli.output.clone();
}
// Set up cache (attempt to create temporary file)
let temp_cache_file: Option<NamedTempFile> = match Builder::new().prefix("monolith-").tempfile()
{
Ok(tempfile) => Some(tempfile),
Err(_) => None,
};
let cache = Some(Cache::new(
CACHE_ASSET_FILE_SIZE_THRESHOLD,
if temp_cache_file.is_some() {
Some(
temp_cache_file
.as_ref()
.unwrap()
.path()
.display()
.to_string(),
)
} else {
None
},
));
// Read and parse cookie file
let mut cookies: Option<Vec<Cookie>> = None;
if let Some(opt_cookie_file) = cookie_file_path.clone() {
match fs::read_to_string(&opt_cookie_file) {
Ok(str) => match parse_cookie_file_contents(&str) {
Ok(parsed_cookies_from_file) => {
cookies = Some(parsed_cookies_from_file);
}
Err(_) => {
if !options.silent {
print_error_message(&format!(
"could not parse specified cookie file \"{}\"",
opt_cookie_file
));
}
process::exit(1);
}
},
Err(_) => {
if !options.silent {
print_error_message(&format!(
"could not read specified cookie file \"{}\"",
opt_cookie_file
));
}
process::exit(1);
}
}
}
// Initiate session
let output_format = options.output_format.clone();
let silent = options.silent;
let session: Session = Session::new(cache, cookies, options);
// Retrieve target from source and output result
if cli.target == "-" {
// Read input from pipe (STDIN)
let data: Vec<u8> = read_stdin();
match create_monolithic_document_from_data(session, data, None, None) {
Ok((result, title)) => {
// Define output
let mut output = Output::new(
&destination.unwrap_or(String::new()),
&title.unwrap_or_default(),
output_format,
)
.expect("could not prepare output");
// Write result into STDOUT or file
output.write(&result).expect("could not write output");
}
Err(error) => {
if !silent {
print_error_message(&format!("Error: {}", error));
}
exit_code = 1;
}
}
} else {
match create_monolithic_document(session, cli.target) {
Ok((result, title)) => {
// Define output
let mut output = Output::new(
&destination.unwrap_or(String::new()),
&title.unwrap_or_default(),
output_format,
)
.expect("could not prepare output");
// Write result into STDOUT or file
output.write(&result).expect("could not write output");
}
Err(error) => {
if !silent {
print_error_message(&format!("Error: {}", error));
}
exit_code = 1;
}
}
}
// TODO: bring this back
// Clean up (shred database file)
//cache.unwrap().destroy_database_file();
if exit_code > 0 {
process::exit(exit_code);
}
}
================================================
FILE: src/session.rs
================================================
use std::fs;
use std::path::{Path, PathBuf};
use std::time::Duration;
use reqwest::blocking::Client;
use reqwest::header::{HeaderMap, HeaderValue, CONTENT_TYPE, COOKIE, REFERER, USER_AGENT};
use crate::cache::Cache;
use crate::cookies::Cookie;
use crate::core::{
detect_media_type, parse_content_type, print_error_message, print_info_message, MonolithOptions,
};
use crate::url::{clean_url, domain_is_within_domain, get_referer_url, parse_data_url, Url};
pub struct Session {
cache: Option<Cache>,
client: Client,
cookies: Option<Vec<Cookie>>,
pub options: MonolithOptions,
urls: Vec<String>,
}
impl Session {
pub fn new(
cache: Option<Cache>,
cookies: Option<Vec<Cookie>>,
options: MonolithOptions,
) -> Self {
let mut header_map = HeaderMap::new();
if let Some(user_agent) = &options.user_agent {
header_map.insert(
USER_AGENT,
HeaderValue::from_str(user_agent).expect("Invalid User-Agent header specified"),
);
}
let client = Client::builder()
.timeout(Duration::from_secs(if options.timeout > 0 {
options.timeout
} else {
// We have to specify something that eventually makes the program fail
// (prevent it from hanging forever)
600 // 10 minutes in seconds
}))
.danger_accept_invalid_certs(options.insecure)
.default_headers(header_map)
.build()
.expect("Failed to initialize HTTP client");
Session {
cache,
cookies,
client,
options,
urls: Vec::new(),
}
}
pub fn retrieve_asset(
&mut self,
parent_url: &Url,
url: &Url,
) -> Result<(Vec<u8>, Url, String, String), reqwest::Error> {
let cache_key: String = clean_url(url.clone()).as_str().to_string();
if !self.urls.contains(&url.as_str().to_string()) {
self.urls.push(url.as_str().to_string());
}
if url.scheme() == "data" {
let (media_type, charset, data) = parse_data_url(url);
Ok((data, url.clone(), media_type, charset))
} else if url.scheme() == "file" {
// Check if parent_url is also a file:// URL (if not, then we don't embed the asset)
if parent_url.scheme() != "file" {
if !self.options.silent {
print_error_message(&format!("{} (security error)", &cache_key));
}
// Provoke error
self.client.get("").send()?;
}
let path_buf: PathBuf = url.to_file_path().unwrap().clone();
let path: &Path = path_buf.as_path();
if path.exists() {
if path.is_dir() {
if !self.options.silent {
print_error_message(&format!("{} (is a directory)", &cache_key));
}
// Provoke error
Err(self.client.get("").send().unwrap_err())
} else {
if !self.options.silent {
print_info_message(&cache_key.to_string());
}
let file_blob: Vec<u8> = fs::read(path).expect("unable to read file");
Ok((
file_blob.clone(),
url.clone(),
detect_media_type(&file_blob, url),
"".to_string(),
))
}
} else {
if !self.options.silent {
print_error_message(&format!("{} (file not found)", &url));
}
// Provoke error
Err(self.client.get("").send().unwrap_err())
}
} else if self.cache.is_some() && self.cache.as_ref().unwrap().contains_key(&cache_key) {
// URL is in cache, we get and return it
if !self.options.silent {
print_info_message(&format!("{} (from cache)", &cache_key));
}
Ok((
self.cache
.as_ref()
.unwrap()
.get(&cache_key)
.unwrap()
.0
.to_vec(),
url.clone(),
self.cache.as_ref().unwrap().get(&cache_key).unwrap().1,
self.cache.as_ref().unwrap().get(&cache_key).unwrap().2,
))
} else {
if let Some(domains) = &self.options.domains {
let domain_matches = domains
.iter()
.any(|d| domain_is_within_domain(url.host_str().unwrap(), d.trim()));
if (self.options.blacklist_domains && domain_matches)
|| (!self.options.blacklist_domains && !domain_matches)
{
return Err(self.client.get("").send().unwrap_err());
}
}
// URL not in cache, we retrieve the file
let mut headers = HeaderMap::new();
if self.cookies.is_some() && !self.cookies.as_ref().unwrap().is_empty() {
for cookie in self.cookies.as_ref().unwrap() {
if !cookie.is_expired() && cookie.matches_url(url.as_str()) {
let cookie_header_value: String = cookie.name.clone() + "=" + &cookie.value;
headers
.insert(COOKIE, HeaderValue::from_str(&cookie_header_value).unwrap());
}
}
}
// Add referer header for page resource requests
if ["https", "http"].contains(&parent_url.scheme()) && parent_url != url {
headers.insert(
REFERER,
HeaderValue::from_str(get_referer_url(parent_url.clone()).as_str()).unwrap(),
);
}
match self.client.get(url.as_str()).headers(headers).send() {
Ok(response) => {
if !self.options.ignore_errors && response.status() != reqwest::StatusCode::OK {
if !self.options.silent {
print_error_message(&format!("{} ({})", &cache_key, response.status()));
}
// Provoke error
return Err(self.client.get("").send().unwrap_err());
}
let response_url: Url = response.url().clone();
if !self.options.silent {
if url.as_str() == response_url.as_str() {
print_info_message(&cache_key.to_string());
} else {
print_info_message(&format!("{} -> {}", &cache_key, &response_url));
}
}
// Attempt to obtain media type and charset by reading Content-Type header
let content_type: &str = response
.headers()
.get(CONTENT_TYPE)
.and_then(|header| header.to_str().ok())
.unwrap_or("");
let (media_type, charset, _is_base64) = parse_content_type(content_type);
// Convert response into a byte array
let mut data: Vec<u8> = vec![];
match response.bytes() {
Ok(b) => {
data = b.to_vec();
}
Err(error) => {
if !self.options.silent {
print_error_message(&format!("{}", error));
}
}
}
// Add retrieved resource to cache
if self.cache.is_some() {
let new_cache_key: String = clean_url(response_url.clone()).to_string();
self.cache.as_mut().unwrap().set(
&new_cache_key,
&data,
media_type.clone(),
charset.clone(),
);
}
// Return
Ok((data, response_url, media_type, charset))
}
Err(error) => {
if !self.options.silent {
print_error_message(&format!("{} ({})", &cache_key, error));
}
Err(self.client.get("").send().unwrap_err())
}
}
}
}
}
================================================
FILE: src/url.rs
================================================
use base64::{prelude::BASE64_STANDARD, Engine};
use percent_encoding::percent_decode_str;
pub use url::Url;
use crate::core::{detect_media_type, parse_content_type};
pub const EMPTY_IMAGE_DATA_URL: &str = "data:image/png,\
%89PNG%0D%0A%1A%0A%00%00%00%0DIHDR%00%00%00%0D%00%00%00%0D%08%04%00%00%00%D8%E2%2C%F7%00%00%00%11IDATx%DAcd%C0%09%18G%A5%28%96%02%00%0A%F8%00%0E%CB%8A%EB%16%00%00%00%00IEND%AEB%60%82";
pub fn clean_url(url: Url) -> Url {
let mut url = url.clone();
// Clear fragment (if any)
url.set_fragment(None);
url
}
pub fn create_data_url(media_type: &str, charset: &str, data: &[u8], final_asset_url: &Url) -> Url {
// TODO: move this block out of this function
let media_type: String = if media_type.is_empty() {
detect_media_type(data, final_asset_url)
} else {
media_type.to_string()
};
let mut data_url: Url = Url::parse("data:,").unwrap();
let c: String =
if !charset.trim().is_empty() && !charset.trim().eq_ignore_ascii_case("US-ASCII") {
format!(";charset={}", charset.trim())
} else {
"".to_string()
};
data_url.set_path(
format!(
"{}{};base64,{}",
media_type,
c,
BASE64_STANDARD.encode(data)
)
.as_str(),
);
data_url
}
pub fn domain_is_within_domain(domain: &str, domain_to_match_against: &str) -> bool {
if domain_to_match_against.is_empty() {
return false;
}
if domain_to_match_against == "." {
return true;
}
let domain_partials: Vec<&str> = domain.trim_end_matches(".").rsplit(".").collect();
let domain_to_match_against_partials: Vec<&str> = domain_to_match_against
.trim_end_matches(".")
.rsplit(".")
.collect();
let domain_to_match_against_starts_with_a_dot = domain_to_match_against.starts_with(".");
let mut i: usize = 0;
let l: usize = std::cmp::max(
domain_partials.len(),
domain_to_match_against_partials.len(),
);
let mut ok: bool = true;
while i < l {
// Exit and return false if went out of bounds of domain to match against, and it didn't start with a dot
if !domain_to_match_against_starts_with_a_dot
&& domain_to_match_against_partials.len() < i + 1
{
ok = false;
break;
}
let domain_partial = if domain_partials.len() < i + 1 {
""
} else {
domain_partials.get(i).unwrap()
};
let domain_to_match_against_partial = if domain_to_match_against_partials.len() < i + 1 {
""
} else {
domain_to_match_against_partials.get(i).unwrap()
};
let parts_match = domain_to_match_against_partial.eq_ignore_ascii_case(domain_partial);
if !parts_match && !domain_to_match_against_partial.is_empty() {
ok = false;
break;
}
i += 1;
}
ok
}
pub fn is_url_and_has_protocol(input: &str) -> bool {
match Url::parse(input) {
Ok(parsed_url) => !parsed_url.scheme().is_empty(),
Err(_) => false,
}
}
pub fn parse_data_url(url: &Url) -> (String, String, Vec<u8>) {
let path: String = url.path().to_string();
let comma_loc: usize = path.find(',').unwrap_or(path.len());
// Split data URL into meta data and raw data
let content_type: String = path.chars().take(comma_loc).collect();
let data: String = path.chars().skip(comma_loc + 1).collect();
// Parse meta data
let (media_type, charset, is_base64) = parse_content_type(&content_type);
// Parse raw data into vector of bytes
let text: String = percent_decode_str(&data).decode_utf8_lossy().to_string();
let blob: Vec<u8> = if is_base64 {
BASE64_STANDARD.decode(&text).unwrap_or_default()
} else {
text.as_bytes().to_vec()
};
(media_type, charset, blob)
}
pub fn get_referer_url(url: Url) -> Url {
let mut url = url.clone();
// Spec: https://httpwg.org/specs/rfc9110.html#field.referer
// Must not include the fragment and userinfo components of the URI
url.set_fragment(None);
url.set_username("").unwrap();
url.set_password(None).unwrap();
url
}
pub fn resolve_url(from: &Url, to: &str) -> Url {
match Url::parse(to) {
Ok(parsed_url) => parsed_url,
Err(_) => match from.join(to) {
Ok(joined) => joined,
Err(_) => Url::parse("data:,").unwrap(),
},
}
}
================================================
FILE: tests/_data_/basic/local-file.html
================================================
<!doctype html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Local HTML file</title>
<link href="local-style.css" rel="stylesheet" type="text/css" />
<link href="local-style-does-not-exist.css" rel="stylesheet" type="text/css" />
</head>
<body>
<img src="monolith.png" alt="" />
<a href="//local-file.html">Tricky href</a>
<a href="https://github.com/Y2Z/monolith">Remote URL</a>
<script src="local-script.js"></script>
</body>
</html>
================================================
FILE: tests/_data_/basic/local-script.js
================================================
document.body.style.backgroundColor = "green";
document.body.style.color = "red";
================================================
FILE: tests/_data_/basic/local-style.css
================================================
body {
background-color: #000;
color: #fff;
}
================================================
FILE: tests/_data_/css/index.html
================================================
<style>
@charset 'UTF-8';
@import 'style.css';
@import url(style.css);
@import url('style.css');
</style>
================================================
FILE: tests/_data_/css/style.css
================================================
body{background-color:#000;color:#fff}
================================================
FILE: tests/_data_/import-css-via-data-url/index.html
================================================
<!doctype html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Attempt to import CSS via data URL asset</title>
<style>
body {
background-color: white;
color: black;
}
</style>
<link href="data:text/css;base64,QGltcG9ydCAic3R5bGUuY3NzIjsK" rel="stylesheet" type="text/css" />
</head>
<body>
<p>If you see pink background with white foreground then we’re in trouble</p>
</body>
</html>
================================================
FILE: tests/_data_/import-css-via-data-url/style.css
================================================
body {
background-color: pink;
color: white;
}
================================================
FILE: tests/_data_/integrity/index.html
================================================
<!doctype html>
<html lang="en">
<head>
<title>Local HTML file</title>
<link
href="style.css"
rel="stylesheet"
type="text/css"
integrity="sha512-IWaCTORHkRhOWzcZeILSVmV6V6gPTHgNem6o6rsFAyaKTieDFkeeMrWjtO0DuWrX3bqZY46CVTZXUu0mia0qXQ=="
crossorigin="anonymous"
/>
<link
href="style.css"
rel="stylesheet"
type="text/css"
integrity="sha512-vWBzl4NE9oIg8NFOPAyOZbaam0UXWr6aDHPaY2kodSzAFl+mKoj/RMNc6C31NDqK4mE2i68IWxYWqWJPLCgPOw=="
crossorigin="anonymous"
/>
</head>
<body>
<p>
This page should have black background and white foreground, but
only when served via http: (not via file:)
</p>
<script
src="script.js"
integrity="sha256-B8CIe6TRGtUNifdy1eY4C9iK46VgAsS5URTNMjjL6+c="
></script>
<script
src="script.js"
integrity="sha256-6idk9dK0bOkVdG7Oz4/0YLXSJya8xZHqbRZKMhYrt6o="
></script>
</body>
</html>
================================================
FILE: tests/_data_/integrity/script.js
================================================
function noop() {
console.log("</script>");
}
================================================
FILE: tests/_data_/integrity/style.css
================================================
body {
background-color: #000;
color: #FFF;
}
================================================
FILE: tests/_data_/noscript/index.html
================================================
<body><noscript><img src="image.svg" /></noscript></body>
================================================
FILE: tests/_data_/noscript/nested.html
================================================
<body><noscript><h1>JS is not active</h1><noscript><img src="image.svg" /></noscript></noscript></body>
================================================
FILE: tests/_data_/noscript/script.html
================================================
<body><noscript><script>alert(1);</script><img src="image.svg" /></noscript></body>
================================================
FILE: tests/_data_/svg/image.html
================================================
<html>
<body>
<svg height="24" width="24">
<image href="image.svg" width="24" height="24"></use>
</svg>
</body>
</html>
================================================
FILE: tests/_data_/svg/index.html
================================================
<div style="background-image: url('image.svg')"></div>
================================================
FILE: tests/_data_/svg/svg.html
================================================
<html>
<body>
<button class="tm-votes-lever__button" data-test-id="votes-lever-upvote-button" title="Like" type="button">
<svg class="tm-svg-img tm-votes-lever__icon" height="24" width="24">
<title>Like</title>
<use xlink:href="icons.svg#icon-1"></use>
</svg>
</button>
</body>
</html>
================================================
FILE: tests/_data_/unusual_encodings/gb2312.html
================================================
<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=GB2312"/>
<title>߳˼ֻת--áƼ-- </title>
</head>
<body>
<h1>߳˼ֻת</h1>
</body>
</html>
================================================
FILE: tests/_data_/unusual_encodings/iso-8859-1.html
================================================
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
© Some Company
</body>
</html>
================================================
FILE: tests/cli/base_url.rs
================================================
// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗
// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝
// ██████╔╝███████║███████╗███████╗██║██╔██╗ ██║██║ ███╗
// ██╔═══╝ ██╔══██║╚════██║╚════██║██║██║╚██╗██║██║ ██║
// ██║ ██║ ██║███████║███████║██║██║ ╚████║╚██████╔╝
// ╚═╝ ╚═╝ ╚═╝╚══════╝╚══════╝╚═╝╚═╝ ╚═══╝ ╚═════╝
#[cfg(test)]
mod passing {
use assert_cmd::prelude::*;
use std::env;
use std::process::Command;
#[test]
fn add_new_when_provided() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-b")
.arg("http://localhost:30701/")
.arg("data:text/html,Hello%2C%20World!")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain newly added base URL
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><base href="http://localhost:30701/"></base><meta name="robots" content="none"></meta></head><body>Hello, World!</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn keep_existing_when_none_provided() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("data:text/html,<base href=\"http://localhost:30701/\" />Hello%2C%20World!")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain newly added base URL
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><base href="http://localhost:30701/"><meta name="robots" content="none"></meta></head><body>Hello, World!</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn override_existing_when_provided() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-b")
.arg("http://localhost/")
.arg("data:text/html,<base href=\"http://localhost:30701/\" />Hello%2C%20World!")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain newly added base URL
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><base href="http://localhost/"><meta name="robots" content="none"></meta></head><body>Hello, World!</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn set_existing_to_empty_when_empty_provided() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-b")
.arg("")
.arg("data:text/html,<base href=\"http://localhost:30701/\" />Hello%2C%20World!")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain newly added base URL
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><base href=""><meta name="robots" content="none"></meta></head><body>Hello, World!</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
}
================================================
FILE: tests/cli/basic.rs
================================================
// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗
// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝
// ██████╔╝███████║███████╗███████╗██║██╔██╗ ██║██║ ███╗
// ██╔═══╝ ██╔══██║╚════██║╚════██║██║██║╚██╗██║██║ ██║
// ██║ ██║ ██║███████║███████║██║██║ ╚████║╚██████╔╝
// ╚═╝ ╚═╝ ╚═╝╚══════╝╚══════╝╚═╝╚═╝ ╚═══╝ ╚═════╝
#[cfg(test)]
mod passing {
use assert_cmd::prelude::*;
use std::env;
use std::fs;
use std::path::Path;
use std::process::{Command, Stdio};
use url::Url;
#[test]
fn print_help_information() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd.arg("-h").output().unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain program name, version, and usage information
// TODO
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn print_version() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd.arg("-V").output().unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain program name and version
assert_eq!(
String::from_utf8_lossy(&out.stdout),
format!("{} {}\n", env!("CARGO_PKG_NAME"), env!("CARGO_PKG_VERSION"))
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn stdin_target_input() {
let mut echo = Command::new("echo")
.arg("Hello from STDIN")
.stdout(Stdio::piped())
.spawn()
.unwrap();
let echo_out = echo.stdout.take().unwrap();
echo.wait().unwrap();
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
cmd.stdin(echo_out);
let out = cmd.arg("-M").arg("-").output().unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain HTML created out of STDIN
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><meta name="robots" content="none"></meta></head><body>Hello from STDIN
</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn css_import_string() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let path_html: &Path = Path::new("tests/_data_/css/index.html");
let path_css: &Path = Path::new("tests/_data_/css/style.css");
assert!(path_html.is_file());
assert!(path_css.is_file());
let out = cmd.arg("-M").arg(path_html.as_os_str()).output().unwrap();
// STDERR should list files that got retrieved
assert_eq!(
String::from_utf8_lossy(&out.stderr),
format!(
"\
{file_url_html}\n\
{file_url_css}\n\
{file_url_css}\n\
{file_url_css}\n\
",
file_url_html = Url::from_file_path(fs::canonicalize(path_html).unwrap()).unwrap(),
file_url_css = Url::from_file_path(fs::canonicalize(path_css).unwrap()).unwrap(),
)
);
// STDOUT should contain embedded CSS url()'s
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r##"<html><head><style>
@charset "UTF-8";
@import "data:text/css;base64,Ym9keXtiYWNrZ3JvdW5kLWNvbG9yOiMwMDA7Y29sb3I6I2ZmZn0K";
@import url("data:text/css;base64,Ym9keXtiYWNrZ3JvdW5kLWNvbG9yOiMwMDA7Y29sb3I6I2ZmZn0K");
@import url("data:text/css;base64,Ym9keXtiYWNrZ3JvdW5kLWNvbG9yOiMwMDA7Y29sb3I6I2ZmZn0K");
</style>
<meta name="robots" content="none"></meta></head><body></body></html>
"##
);
// Exit code should be 0
out.assert().code(0);
}
}
// ███████╗ █████╗ ██╗██╗ ██╗███╗ ██╗ ██████╗
// ██╔════╝██╔══██╗██║██║ ██║████╗ ██║██╔════╝
// █████╗ ███████║██║██║ ██║██╔██╗ ██║██║ ███╗
// ██╔══╝ ██╔══██║██║██║ ██║██║╚██╗██║██║ ██║
// ██║ ██║ ██║██║███████╗██║██║ ╚████║╚██████╔╝
// ╚═╝ ╚═╝ ╚═╝╚═╝╚══════╝╚═╝╚═╝ ╚═══╝ ╚═════╝
#[cfg(test)]
mod failing {
use assert_cmd::prelude::*;
use std::env;
use std::process::Command;
#[test]
fn bad_input_empty_target() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd.arg("").output().unwrap();
// STDERR should contain error description
assert_eq!(
String::from_utf8_lossy(&out.stderr),
"Error: no target specified\n"
);
// STDOUT should be empty
assert_eq!(String::from_utf8_lossy(&out.stdout), "");
// Exit code should be 1
out.assert().code(1);
}
#[test]
fn unsupported_scheme() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd.arg("mailto:snshn@tutanota.com").output().unwrap();
// STDERR should contain error description
assert_eq!(
String::from_utf8_lossy(&out.stderr),
"Error: unsupported target URL scheme \"mailto\"\n"
);
// STDOUT should be empty
assert_eq!(String::from_utf8_lossy(&out.stdout), "");
// Exit code should be 1
out.assert().code(1);
}
}
================================================
FILE: tests/cli/data_url.rs
================================================
// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗
// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝
// ██████╔╝███████║███████╗███████╗██║██╔██╗ ██║██║ ███╗
// ██╔═══╝ ██╔══██║╚════██║╚════██║██║██║╚██╗██║██║ ██║
// ██║ ██║ ██║███████║███████║██║██║ ╚████║╚██████╔╝
// ╚═╝ ╚═╝ ╚═╝╚══════╝╚══════╝╚═╝╚═╝ ╚═══╝ ╚═════╝
#[cfg(test)]
mod passing {
use assert_cmd::prelude::*;
use std::env;
use std::process::Command;
use monolith::url::EMPTY_IMAGE_DATA_URL;
#[test]
fn isolate_data_url() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-I")
.arg("data:text/html,Hello%2C%20World!")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain isolated HTML
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><meta http-equiv="Content-Security-Policy" content="default-src 'unsafe-eval' 'unsafe-inline' data:;"></meta><meta name="robots" content="none"></meta></head><body>Hello, World!</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn remove_css_from_data_url() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-c")
.arg("data:text/html,<style>body{background-color:pink}</style>Hello")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain HTML with no CSS
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><meta http-equiv="Content-Security-Policy" content="style-src 'none';"></meta><style></style><meta name="robots" content="none"></meta></head><body>Hello</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn remove_fonts_from_data_url() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-F")
.arg("data:text/html,<style>@font-face { font-family: myFont; src: url(font.woff); }</style>Hi")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain HTML with no web fonts
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><meta http-equiv="Content-Security-Policy" content="font-src 'none';"></meta><style></style><meta name="robots" content="none"></meta></head><body>Hi</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn remove_frames_from_data_url() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-f")
.arg(r#"data:text/html,<iframe src="https://duckduckgo.com"></iframe>Hi"#)
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain HTML with no iframes
assert_eq!(
String::from_utf8_lossy(&out.stdout),
r#"<html><head><meta http-equiv="Content-Security-Policy" content="frame-src 'none'; child-src 'none';"></meta><meta name="robots" content="none"></meta></head><body><iframe src=""></iframe>Hi</body></html>
"#
);
// Exit code should be 0
out.assert().code(0);
}
#[test]
fn remove_images_from_data_url() {
let mut cmd = Command::cargo_bin(env!("CARGO_PKG_NAME")).unwrap();
let out = cmd
.arg("-M")
.arg("-i")
.arg("data:text/html,<img src=\"https://google.com\"/>Hi")
.output()
.unwrap();
// STDERR should be empty
assert_eq!(String::from_utf8_lossy(&out.stderr), "");
// STDOUT should contain HTML with no images
assert_eq!(
String::from_utf8_lossy(&out.stdout),
format!(
r#"<html><head><meta http-equiv="Content-Security-Policy" content="img-src data:;"></meta><meta name="robots" content="none"></meta></head><body><img src="{empty_image}">Hi</body></html>
"#,
empty_image = EMPTY_IMAGE_DATA_URL,
)
gitextract_1w8c2ho6/
├── .actor/
│ ├── Dockerfile
│ ├── README.md
│ ├── actor.json
│ ├── bin/
│ │ └── actor.sh
│ ├── dataset_schema.json
│ └── input_schema.json
├── .dockerignore
├── .github/
│ ├── FUNDING.yml
│ └── workflows/
│ ├── build_gnu_linux.yml
│ ├── build_macos.yml
│ ├── build_windows.yml
│ ├── cd.yml
│ ├── ci-netbsd.yml
│ └── ci.yml
├── .gitignore
├── Cargo.toml
├── Dockerfile
├── LICENSE
├── Makefile
├── README.md
├── assets/
│ └── icon/
│ └── icon.blend
├── dist/
│ └── run-in-container.sh
├── monolith.nuspec
├── snap/
│ └── snapcraft.yaml
├── src/
│ ├── cache.rs
│ ├── cookies.rs
│ ├── core.rs
│ ├── css.rs
│ ├── gui.rs
│ ├── html.rs
│ ├── js.rs
│ ├── lib.rs
│ ├── main.rs
│ ├── session.rs
│ └── url.rs
└── tests/
├── _data_/
│ ├── basic/
│ │ ├── local-file.html
│ │ ├── local-script.js
│ │ └── local-style.css
│ ├── css/
│ │ ├── index.html
│ │ └── style.css
│ ├── import-css-via-data-url/
│ │ ├── index.html
│ │ └── style.css
│ ├── integrity/
│ │ ├── index.html
│ │ ├── script.js
│ │ └── style.css
│ ├── noscript/
│ │ ├── index.html
│ │ ├── nested.html
│ │ └── script.html
│ ├── svg/
│ │ ├── image.html
│ │ ├── index.html
│ │ └── svg.html
│ └── unusual_encodings/
│ ├── gb2312.html
│ └── iso-8859-1.html
├── cli/
│ ├── base_url.rs
│ ├── basic.rs
│ ├── data_url.rs
│ ├── local_files.rs
│ ├── mod.rs
│ ├── noscript.rs
│ └── unusual_encodings.rs
├── cookies/
│ ├── cookie/
│ │ ├── is_expired.rs
│ │ ├── matches_url.rs
│ │ └── mod.rs
│ ├── mod.rs
│ └── parse_cookie_file_contents.rs
├── core/
│ ├── detect_media_type.rs
│ ├── format_output_path.rs
│ ├── mod.rs
│ ├── options.rs
│ └── parse_content_type.rs
├── css/
│ ├── embed_css.rs
│ ├── is_image_url_prop.rs
│ └── mod.rs
├── html/
│ ├── add_favicon.rs
│ ├── check_integrity.rs
│ ├── compose_csp.rs
│ ├── create_metadata_tag.rs
│ ├── embed_srcset.rs
│ ├── get_base_url.rs
│ ├── get_charset.rs
│ ├── get_node_attr.rs
│ ├── get_node_name.rs
│ ├── has_favicon.rs
│ ├── is_favicon.rs
│ ├── mod.rs
│ ├── parse_link_type.rs
│ ├── parse_srcset.rs
│ ├── serialize_document.rs
│ ├── set_node_attr.rs
│ └── walk.rs
├── js/
│ ├── attr_is_event_handler.rs
│ └── mod.rs
├── mod.rs
├── session/
│ ├── mod.rs
│ └── retrieve_asset.rs
└── url/
├── clean_url.rs
├── create_data_url.rs
├── domain_is_within_domain.rs
├── get_referer_url.rs
├── is_url_and_has_protocol.rs
├── mod.rs
├── parse_data_url.rs
└── resolve_url.rs
SYMBOL INDEX (386 symbols across 51 files)
FILE: src/cache.rs
type CacheMetadataItem (line 8) | pub struct CacheMetadataItem {
type Cache (line 15) | pub struct Cache {
method new (line 27) | pub fn new(min_file_size: usize, db_file_path: Option<String>) -> Cache {
method set (line 55) | pub fn set(&mut self, key: &str, data: &Vec<u8>, media_type: String, c...
method get (line 88) | pub fn get(&self, key: &str) -> Result<(Vec<u8>, String, String), Erro...
method contains_key (line 115) | pub fn contains_key(&self, key: &str) -> bool {
method destroy_database_file (line 119) | pub fn destroy_database_file(&mut self) {
constant FILE_WRITE_BUF_LEN (line 23) | const FILE_WRITE_BUF_LEN: usize = 1024 * 100;
constant TABLE (line 24) | const TABLE: TableDefinition<&str, &[u8]> = TableDefinition::new("_");
FILE: src/cookies.rs
type Cookie (line 5) | pub struct Cookie {
method is_expired (line 21) | pub fn is_expired(&self) -> bool {
method matches_url (line 34) | pub fn matches_url(&self, url: &str) -> bool {
type CookieFileContentsParseError (line 16) | pub enum CookieFileContentsParseError {
function parse_cookie_file_contents (line 83) | pub fn parse_cookie_file_contents(
FILE: src/core.rs
type MonolithError (line 21) | pub struct MonolithError {
method new (line 26) | fn new(msg: &str) -> MonolithError {
method fmt (line 34) | fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
method description (line 40) | fn description(&self) -> &str {
type MonolithOutputFormat (line 46) | pub enum MonolithOutputFormat {
type MonolithOptions (line 56) | pub struct MonolithOptions {
constant ANSI_COLOR_RED (line 79) | const ANSI_COLOR_RED: &str = "\x1b[31m";
constant ANSI_COLOR_RESET (line 80) | const ANSI_COLOR_RESET: &str = "\x1b[0m";
constant FILE_SIGNATURES (line 81) | const FILE_SIGNATURES: [[&[u8]; 2]; 18] = [
constant PLAINTEXT_MEDIA_TYPES (line 105) | const PLAINTEXT_MEDIA_TYPES: &[&str] = &[
function create_monolithic_document_from_data (line 116) | pub fn create_monolithic_document_from_data(
function create_monolithic_document (line 299) | pub fn create_monolithic_document(
function detect_media_type (line 410) | pub fn detect_media_type(data: &[u8], url: &Url) -> String {
function detect_media_type_by_file_name (line 424) | pub fn detect_media_type_by_file_name(filename: &str) -> String {
function format_output_path (line 464) | pub fn format_output_path(
function is_plaintext_media_type (line 508) | pub fn is_plaintext_media_type(media_type: &str) -> bool {
function parse_content_type (line 513) | pub fn parse_content_type(content_type: &str) -> (String, String, bool) {
function print_error_message (line 538) | pub fn print_error_message(text: &str) {
function print_info_message (line 566) | pub fn print_info_message(text: &str) {
FILE: src/css.rs
constant CSS_PROPS_WITH_IMAGE_URLS (line 8) | const CSS_PROPS_WITH_IMAGE_URLS: &[&str] = &[
function embed_css (line 29) | pub fn embed_css(session: &mut Session, document_url: &Url, css: &str) -...
function format_ident (line 36) | pub fn format_ident(ident: &str) -> String {
function format_quoted_string (line 43) | pub fn format_quoted_string(string: &str) -> String {
function is_image_url_prop (line 49) | pub fn is_image_url_prop(prop_name: &str) -> bool {
function process_css (line 55) | pub fn process_css<'a>(
FILE: src/gui.rs
constant CACHE_ASSET_FILE_SIZE_THRESHOLD (line 21) | const CACHE_ASSET_FILE_SIZE_THRESHOLD: usize = 1024 * 20;
constant FILESPEC_HTML (line 22) | const FILESPEC_HTML: FileSpec = FileSpec::new("HTML files", &["html"]);
constant MONOLITH_GUI_WRITE_OUTPUT (line 23) | const MONOLITH_GUI_WRITE_OUTPUT: druid::Selector<(Vec<u8>, Option<String...
constant MONOLITH_GUI_ERROR (line 25) | const MONOLITH_GUI_ERROR: druid::Selector<MonolithError> =
constant TEXT_BOX_WIDTH (line 27) | const TEXT_BOX_WIDTH: f64 = 512_f64;
type Delegate (line 29) | struct Delegate;
method command (line 265) | fn command(
type AppState (line 32) | struct AppState {
function main (line 45) | fn main() -> Result<(), PlatformError> {
function ui_builder (line 77) | fn ui_builder() -> impl Widget<AppState> {
FILE: src/html.rs
constant FAVICON_VALUES (line 23) | const FAVICON_VALUES: &[&str] = &["icon", "shortcut icon"];
constant WHITESPACES (line 24) | const WHITESPACES: &[char] = &[' ', '\t', '\n', '\x0c', '\r'];
type LinkType (line 27) | pub enum LinkType {
type SrcSetItem (line 36) | pub struct SrcSetItem<'a> {
function add_favicon (line 41) | pub fn add_favicon(document: &Handle, favicon_data_url: String) -> RcDom {
function check_integrity (line 74) | pub fn check_integrity(data: &[u8], integrity: &str) -> bool {
function compose_csp (line 92) | pub fn compose_csp(options: &MonolithOptions) -> String {
function create_metadata_tag (line 124) | pub fn create_metadata_tag(url: &Url) -> String {
function embed_srcset (line 148) | pub fn embed_srcset(session: &mut Session, document_url: &Url, srcset: &...
function find_nodes (line 198) | pub fn find_nodes(node: &Handle, mut path: Vec<&str>) -> Vec<Handle> {
function get_base_url (line 227) | pub fn get_base_url(handle: &Handle) -> Option<String> {
function get_charset (line 236) | pub fn get_charset(node: &Handle) -> Option<String> {
function get_child_node_by_name (line 260) | pub fn get_child_node_by_name(parent: &Handle, node_name: &str) -> Optio...
function get_node_attr (line 269) | pub fn get_node_attr(node: &Handle, attr_name: &str) -> Option<String> {
function get_node_name (line 283) | pub fn get_node_name(node: &Handle) -> Option<&'_ str> {
function get_parent_node (line 290) | pub fn get_parent_node(child: &Handle) -> Handle {
function get_robots (line 295) | pub fn get_robots(handle: &Handle) -> Option<String> {
function get_title (line 309) | pub fn get_title(node: &Handle) -> Option<String> {
function has_favicon (line 321) | pub fn has_favicon(handle: &Handle) -> bool {
function html_to_dom (line 336) | pub fn html_to_dom(data: &Vec<u8>, document_encoding: String) -> RcDom {
function is_favicon (line 352) | pub fn is_favicon(attr_value: &str) -> bool {
function parse_link_type (line 356) | pub fn parse_link_type(link_attr_rel_value: &str) -> Vec<LinkType> {
function parse_srcset (line 378) | pub fn parse_srcset(srcset: &str) -> Vec<SrcSetItem> {
function set_base_url (line 445) | pub fn set_base_url(document: &Handle, base_href_value: String) -> RcDom {
function set_charset (line 479) | pub fn set_charset(dom: RcDom, charset: String) -> RcDom {
function set_node_attr (line 524) | pub fn set_node_attr(node: &Handle, attr_name: &str, attr_value: Option<...
function set_robots (line 561) | pub fn set_robots(dom: RcDom, content_value: &str) -> RcDom {
function serialize_document (line 602) | pub fn serialize_document(
function retrieve_and_embed_asset (line 665) | pub fn retrieve_and_embed_asset(
function walk (line 788) | pub fn walk(session: &mut Session, document_url: &Url, node: &Handle) {
FILE: src/js.rs
constant JS_DOM_EVENT_ATTRS (line 1) | const JS_DOM_EVENT_ATTRS: &[&str] = &[
function attr_is_event_handler (line 98) | pub fn attr_is_event_handler(attr_name: &str) -> bool {
FILE: src/main.rs
constant ASCII (line 16) | const ASCII: &str = " \
constant CACHE_ASSET_FILE_SIZE_THRESHOLD (line 25) | const CACHE_ASSET_FILE_SIZE_THRESHOLD: usize = 1024 * 10;
constant DEFAULT_NETWORK_TIMEOUT (line 26) | const DEFAULT_NETWORK_TIMEOUT: u64 = 120;
constant DEFAULT_USER_AGENT (line 27) | const DEFAULT_USER_AGENT: &str =
type Cli (line 34) | struct Cli {
type Output (line 127) | pub enum Output {
method new (line 133) | fn new(
method write (line 146) | fn write(&mut self, bytes: &Vec<u8>) -> Result<(), IoError> {
function read_stdin (line 160) | pub fn read_stdin() -> Vec<u8> {
function main (line 169) | fn main() {
FILE: src/session.rs
type Session (line 15) | pub struct Session {
method new (line 24) | pub fn new(
method retrieve_asset (line 58) | pub fn retrieve_asset(
FILE: src/url.rs
constant EMPTY_IMAGE_DATA_URL (line 7) | pub const EMPTY_IMAGE_DATA_URL: &str = "data:image/png,\
function clean_url (line 10) | pub fn clean_url(url: Url) -> Url {
function create_data_url (line 19) | pub fn create_data_url(media_type: &str, charset: &str, data: &[u8], fin...
function domain_is_within_domain (line 49) | pub fn domain_is_within_domain(domain: &str, domain_to_match_against: &s...
function is_url_and_has_protocol (line 105) | pub fn is_url_and_has_protocol(input: &str) -> bool {
function parse_data_url (line 112) | pub fn parse_data_url(url: &Url) -> (String, String, Vec<u8>) {
function get_referer_url (line 134) | pub fn get_referer_url(url: Url) -> Url {
function resolve_url (line 145) | pub fn resolve_url(from: &Url, to: &str) -> Url {
FILE: tests/_data_/integrity/script.js
function noop (line 1) | function noop() {
FILE: tests/cli/base_url.rs
function add_new_when_provided (line 15) | fn add_new_when_provided() {
function keep_existing_when_none_provided (line 40) | fn keep_existing_when_none_provided() {
function override_existing_when_provided (line 63) | fn override_existing_when_provided() {
function set_existing_to_empty_when_empty_provided (line 88) | fn set_existing_to_empty_when_empty_provided() {
FILE: tests/cli/basic.rs
function print_help_information (line 18) | fn print_help_information() {
function print_version (line 33) | fn print_version() {
function stdin_target_input (line 51) | fn stdin_target_input() {
function css_import_string (line 80) | fn css_import_string() {
function bad_input_empty_target (line 142) | fn bad_input_empty_target() {
function unsupported_scheme (line 160) | fn unsupported_scheme() {
FILE: tests/cli/data_url.rs
function isolate_data_url (line 17) | fn isolate_data_url() {
function remove_css_from_data_url (line 41) | fn remove_css_from_data_url() {
function remove_fonts_from_data_url (line 65) | fn remove_fonts_from_data_url() {
function remove_frames_from_data_url (line 89) | fn remove_frames_from_data_url() {
function remove_images_from_data_url (line 113) | fn remove_images_from_data_url() {
function remove_js_from_data_url (line 140) | fn remove_js_from_data_url() {
function bad_input_data_url (line 178) | fn bad_input_data_url() {
function security_disallow_local_assets_within_data_url_targets (line 193) | fn security_disallow_local_assets_within_data_url_targets() {
FILE: tests/cli/local_files.rs
function local_file_target_input_relative_target_path (line 20) | fn local_file_target_input_relative_target_path() {
function local_file_target_input_absolute_target_path (line 81) | fn local_file_target_input_absolute_target_path() {
function local_file_url_target_input (line 131) | fn local_file_url_target_input() {
function embed_file_url_local_asset_within_style_attribute (line 190) | fn embed_file_url_local_asset_within_style_attribute() {
function embed_svg_local_asset_via_use (line 222) | fn embed_svg_local_asset_via_use() {
function embed_svg_local_asset_via_image (line 264) | fn embed_svg_local_asset_via_image() {
function discard_integrity_for_local_files (line 301) | fn discard_integrity_for_local_files() {
FILE: tests/cli/noscript.rs
function parse_noscript_contents (line 18) | fn parse_noscript_contents() {
function unwrap_noscript_contents (line 49) | fn unwrap_noscript_contents() {
function unwrap_noscript_contents_nested (line 80) | fn unwrap_noscript_contents_nested() {
function unwrap_noscript_contents_with_script (line 111) | fn unwrap_noscript_contents_with_script() {
function unwrap_noscript_contents_attr_data_url (line 144) | fn unwrap_noscript_contents_attr_data_url() {
FILE: tests/cli/unusual_encodings.rs
function properly_save_document_with_gb2312 (line 17) | fn properly_save_document_with_gb2312() {
function properly_save_document_with_gb2312_from_stdin (line 68) | fn properly_save_document_with_gb2312_from_stdin() {
function properly_save_document_with_gb2312_custom_charset (line 114) | fn properly_save_document_with_gb2312_custom_charset() {
function properly_save_document_with_gb2312_custom_charset_bad (line 160) | fn properly_save_document_with_gb2312_custom_charset_bad() {
function change_iso88591_to_utf8_to_properly_display_html_entities (line 202) | fn change_iso88591_to_utf8_to_properly_display_html_entities() {
FILE: tests/cookies/cookie/is_expired.rs
function never_expires (line 13) | fn never_expires() {
function expires_long_from_now (line 28) | fn expires_long_from_now() {
function expired (line 55) | fn expired() {
FILE: tests/cookies/cookie/matches_url.rs
function secure_url (line 13) | fn secure_url() {
function non_secure_url (line 27) | fn non_secure_url() {
function subdomain (line 41) | fn subdomain() {
function empty_url (line 67) | fn empty_url() {
function wrong_hostname (line 81) | fn wrong_hostname() {
function wrong_path (line 95) | fn wrong_path() {
FILE: tests/cookies/parse_cookie_file_contents.rs
function parse_file (line 13) | fn parse_file() {
function parse_multiline_file (line 28) | fn parse_multiline_file() {
function empty (line 65) | fn empty() {
function no_header (line 72) | fn no_header() {
function spaces_instead_of_tabs (line 85) | fn spaces_instead_of_tabs() {
FILE: tests/core/detect_media_type.rs
function image_gif87 (line 15) | fn image_gif87() {
function image_gif89 (line 21) | fn image_gif89() {
function image_jpeg (line 27) | fn image_jpeg() {
function image_png (line 33) | fn image_png() {
function image_svg (line 42) | fn image_svg() {
function image_webp (line 48) | fn image_webp() {
function image_icon (line 57) | fn image_icon() {
function image_svg_filename (line 66) | fn image_svg_filename() {
function image_svg_url_uppercase (line 72) | fn image_svg_url_uppercase() {
function audio_mpeg (line 78) | fn audio_mpeg() {
function audio_mpeg_2 (line 84) | fn audio_mpeg_2() {
function audio_mpeg_3 (line 90) | fn audio_mpeg_3() {
function audio_ogg (line 96) | fn audio_ogg() {
function audio_wav (line 102) | fn audio_wav() {
function audio_flac (line 111) | fn audio_flac() {
function video_avi (line 117) | fn video_avi() {
function video_mp4 (line 126) | fn video_mp4() {
function video_mpeg (line 132) | fn video_mpeg() {
function video_quicktime (line 141) | fn video_quicktime() {
function video_webm (line 150) | fn video_webm() {
function unknown_media_type (line 173) | fn unknown_media_type() {
FILE: tests/core/format_output_path.rs
function as_is (line 13) | fn as_is() {
function substitute_title (line 24) | fn substitute_title() {
function substitute_title_multi (line 38) | fn substitute_title_multi() {
function sanitize (line 52) | fn sanitize() {
function level_up (line 66) | fn level_up() {
function file_name_extension (line 74) | fn file_name_extension() {
function file_name_extension_mhtml (line 82) | fn file_name_extension_mhtml() {
function file_name_extension_short (line 90) | fn file_name_extension_short() {
function file_name_extension_short_mhtml (line 98) | fn file_name_extension_short_mhtml() {
FILE: tests/core/options.rs
function defaults (line 13) | fn defaults() {
FILE: tests/core/parse_content_type.rs
function text_plain_utf8 (line 13) | fn text_plain_utf8() {
function text_plain_utf8_spaces (line 21) | fn text_plain_utf8_spaces() {
function empty (line 29) | fn empty() {
function base64 (line 37) | fn base64() {
function text_html_base64 (line 45) | fn text_html_base64() {
function only_media_type (line 53) | fn only_media_type() {
function only_media_type_colon (line 61) | fn only_media_type_colon() {
function media_type_gb2312_filename (line 69) | fn media_type_gb2312_filename() {
function media_type_filename_gb2312 (line 78) | fn media_type_filename_gb2312() {
FILE: tests/css/embed_css.rs
function empty_input (line 18) | fn empty_input() {
function trim_if_empty (line 27) | fn trim_if_empty() {
function style_exclude_unquoted_images (line 39) | fn style_exclude_unquoted_images() {
function style_exclude_single_quoted_images (line 70) | fn style_exclude_single_quoted_images() {
function style_block (line 101) | fn style_block() {
function attribute_selectors (line 119) | fn attribute_selectors() {
function import_string (line 159) | fn import_string() {
function hash_urls (line 186) | fn hash_urls() {
function transform_percentages_and_degrees (line 206) | fn transform_percentages_and_degrees() {
function unusual_indents (line 224) | fn unusual_indents() {
function exclude_fonts (line 244) | fn exclude_fonts() {
function content (line 288) | fn content() {
function ie_css_hack (line 309) | fn ie_css_hack() {
FILE: tests/css/is_image_url_prop.rs
function background (line 13) | fn background() {
function background_image (line 18) | fn background_image() {
function background_image_uppercase (line 23) | fn background_image_uppercase() {
function border_image (line 28) | fn border_image() {
function content (line 33) | fn content() {
function cursor (line 38) | fn cursor() {
function list_style (line 43) | fn list_style() {
function list_style_image (line 48) | fn list_style_image() {
function mask_image (line 53) | fn mask_image() {
function empty (line 70) | fn empty() {
function width (line 75) | fn width() {
function color (line 80) | fn color() {
function z_index (line 85) | fn z_index() {
FILE: tests/html/add_favicon.rs
function basic (line 16) | fn basic() {
FILE: tests/html/check_integrity.rs
function empty_input_sha256 (line 13) | fn empty_input_sha256() {
function sha256 (line 21) | fn sha256() {
function sha384 (line 29) | fn sha384() {
function sha512 (line 37) | fn sha512() {
function empty_hash (line 57) | fn empty_hash() {
function empty_input_empty_hash (line 62) | fn empty_input_empty_hash() {
function sha256 (line 67) | fn sha256() {
function sha384 (line 75) | fn sha384() {
function sha512 (line 83) | fn sha512() {
FILE: tests/html/compose_csp.rs
function isolated (line 14) | fn isolated() {
function no_css (line 26) | fn no_css() {
function no_fonts (line 35) | fn no_fonts() {
function no_frames (line 44) | fn no_frames() {
function no_js (line 53) | fn no_js() {
function no_images (line 62) | fn no_images() {
function all (line 71) | fn all() {
FILE: tests/html/create_metadata_tag.rs
function http_url (line 16) | fn http_url() {
function file_url (line 34) | fn file_url() {
function data_url (line 51) | fn data_url() {
FILE: tests/html/embed_srcset.rs
function small_medium_large (line 18) | fn small_medium_large() {
function small_medium_only_medium_has_scale (line 37) | fn small_medium_only_medium_has_scale() {
function commas_within_file_names (line 53) | fn commas_within_file_names() {
function narrow_whitespaces_within_file_names (line 69) | fn narrow_whitespaces_within_file_names() {
function tabs_and_newlines_after_commas (line 85) | fn tabs_and_newlines_after_commas() {
function no_whitespace_after_commas (line 104) | fn no_whitespace_after_commas() {
function last_without_descriptor (line 123) | fn last_without_descriptor() {
function trailing_comma (line 159) | fn trailing_comma() {
FILE: tests/html/get_base_url.rs
function present (line 13) | fn present() {
function multiple_tags (line 31) | fn multiple_tags() {
function absent (line 62) | fn absent() {
function no_href (line 76) | fn no_href() {
function empty_href (line 91) | fn empty_href() {
FILE: tests/html/get_charset.rs
function meta_content_type (line 13) | fn meta_content_type() {
function meta_charset (line 28) | fn meta_charset() {
function multiple_conflicting_meta_charset_first (line 43) | fn multiple_conflicting_meta_charset_first() {
function multiple_conflicting_meta_content_type_first (line 58) | fn multiple_conflicting_meta_content_type_first() {
FILE: tests/html/get_node_attr.rs
function div_two_style_attributes (line 15) | fn div_two_style_attributes() {
FILE: tests/html/get_node_name.rs
function parent_node_names (line 15) | fn parent_node_names() {
FILE: tests/html/has_favicon.rs
function icon (line 13) | fn icon() {
function shortcut_icon (line 22) | fn shortcut_icon() {
function absent (line 43) | fn absent() {
FILE: tests/html/is_favicon.rs
function icon (line 13) | fn icon() {
function shortcut_icon_capitalized (line 18) | fn shortcut_icon_capitalized() {
function icon_uppercase (line 23) | fn icon_uppercase() {
function apple_touch_icon (line 40) | fn apple_touch_icon() {
function mask_icon (line 45) | fn mask_icon() {
function fluid_icon (line 50) | fn fluid_icon() {
function stylesheet (line 55) | fn stylesheet() {
function empty_string (line 60) | fn empty_string() {
FILE: tests/html/parse_link_type.rs
function icon (line 13) | fn icon() {
function shortcut_icon_capitalized (line 18) | fn shortcut_icon_capitalized() {
function stylesheet (line 23) | fn stylesheet() {
function preload_stylesheet (line 28) | fn preload_stylesheet() {
function apple_touch_icon (line 33) | fn apple_touch_icon() {
function mask_icon (line 50) | fn mask_icon() {
function fluid_icon (line 55) | fn fluid_icon() {
function empty_string (line 60) | fn empty_string() {
FILE: tests/html/parse_srcset.rs
function three_items_with_width_descriptors_and_newlines (line 13) | fn three_items_with_width_descriptors_and_newlines() {
FILE: tests/html/serialize_document.rs
function div_as_root_element (line 14) | fn div_as_root_element() {
function full_page_with_no_html_head_or_body (line 26) | fn full_page_with_no_html_head_or_body() {
function doctype_and_the_rest_no_html_head_or_body (line 54) | fn doctype_and_the_rest_no_html_head_or_body() {
function doctype_and_the_rest_no_html_head_or_body_forbid_frames (line 78) | fn doctype_and_the_rest_no_html_head_or_body_forbid_frames() {
function doctype_and_the_rest_all_forbidden (line 102) | fn doctype_and_the_rest_all_forbidden() {
FILE: tests/html/set_node_attr.rs
function html_lang_and_body_style (line 15) | fn html_lang_and_body_style() {
function body_background (line 68) | fn body_background() {
FILE: tests/html/walk.rs
function basic (line 20) | fn basic() {
function ensure_no_recursive_iframe (line 47) | fn ensure_no_recursive_iframe() {
function ensure_no_recursive_frame (line 74) | fn ensure_no_recursive_frame() {
function no_css (line 101) | fn no_css() {
function no_images (line 145) | fn no_images() {
function no_body_background_images (line 186) | fn no_body_background_images() {
function no_frames (line 215) | fn no_frames() {
function no_iframes (line 251) | fn no_iframes() {
function no_js (line 286) | fn no_js() {
function keeps_integrity_for_unfamiliar_links (line 329) | fn keeps_integrity_for_unfamiliar_links() {
function discards_integrity_for_known_links_nojs_nocss (line 365) | fn discards_integrity_for_known_links_nojs_nocss() {
function discards_integrity_for_embedded_assets (line 407) | fn discards_integrity_for_embedded_assets() {
function removes_unwanted_meta_tags (line 450) | fn removes_unwanted_meta_tags() {
function processes_noscript_tags (line 498) | fn processes_noscript_tags() {
function preserves_script_type_json (line 545) | fn preserves_script_type_json() {
FILE: tests/js/attr_is_event_handler.rs
function onblur_camelcase (line 13) | fn onblur_camelcase() {
function onclick_lowercase (line 18) | fn onclick_lowercase() {
function onclick_camelcase (line 23) | fn onclick_camelcase() {
function href (line 40) | fn href() {
function empty_string (line 45) | fn empty_string() {
function class (line 50) | fn class() {
FILE: tests/session/retrieve_asset.rs
function read_data_url (line 18) | fn read_data_url() {
function read_local_file_with_file_url_parent (line 45) | fn read_local_file_with_file_url_parent() {
function read_local_file_with_data_url_parent (line 105) | fn read_local_file_with_data_url_parent() {
function read_local_file_with_https_parent (line 126) | fn read_local_file_with_https_parent() {
FILE: tests/url/clean_url.rs
function preserve_original (line 15) | fn preserve_original() {
function removes_fragment (line 25) | fn removes_fragment() {
function removes_empty_fragment (line 33) | fn removes_empty_fragment() {
function removes_empty_fragment_and_keeps_empty_query (line 41) | fn removes_empty_fragment_and_keeps_empty_query() {
function removes_empty_fragment_and_keeps_query (line 49) | fn removes_empty_fragment_and_keeps_query() {
function keeps_credentials (line 57) | fn keeps_credentials() {
FILE: tests/url/create_data_url.rs
function encode_string_with_specific_media_type (line 15) | fn encode_string_with_specific_media_type() {
function encode_append_fragment (line 32) | fn encode_append_fragment() {
function encode_string_with_specific_media_type_and_charset (line 48) | fn encode_string_with_specific_media_type_and_charset() {
function create_data_url_with_us_ascii_charset (line 66) | fn create_data_url_with_us_ascii_charset() {
function create_data_url_with_utf8_charset (line 81) | fn create_data_url_with_utf8_charset() {
function create_data_url_with_media_type_text_plain_and_utf8_charset (line 96) | fn create_data_url_with_media_type_text_plain_and_utf8_charset() {
FILE: tests/url/domain_is_within_domain.rs
function sub_domain_is_within_dotted_sub_domain (line 13) | fn sub_domain_is_within_dotted_sub_domain() {
function domain_is_within_dotted_domain (line 21) | fn domain_is_within_dotted_domain() {
function sub_domain_is_within_dotted_domain (line 29) | fn sub_domain_is_within_dotted_domain() {
function sub_domain_is_within_dotted_top_level_domain (line 37) | fn sub_domain_is_within_dotted_top_level_domain() {
function domain_is_within_itself (line 42) | fn domain_is_within_itself() {
function domain_with_trailing_dot_is_within_itself (line 50) | fn domain_with_trailing_dot_is_within_itself() {
function domain_with_trailing_dot_is_within_single_dot (line 58) | fn domain_with_trailing_dot_is_within_single_dot() {
function domain_matches_single_dot (line 63) | fn domain_matches_single_dot() {
function dotted_domain_must_be_within_dotted_domain (line 68) | fn dotted_domain_must_be_within_dotted_domain() {
function empty_is_within_dot (line 76) | fn empty_is_within_dot() {
function both_dots (line 81) | fn both_dots() {
function sub_domain_must_not_be_within_domain (line 98) | fn sub_domain_must_not_be_within_domain() {
function domain_must_not_be_within_top_level_domain (line 106) | fn domain_must_not_be_within_top_level_domain() {
function different_domains_must_not_be_within_one_another (line 111) | fn different_domains_must_not_be_within_one_another() {
function sub_domain_is_not_within_wrong_top_level_domain (line 119) | fn sub_domain_is_not_within_wrong_top_level_domain() {
function dotted_domain_is_not_within_domain (line 124) | fn dotted_domain_is_not_within_domain() {
function different_domain_is_not_within_dotted_domain (line 132) | fn different_domain_is_not_within_dotted_domain() {
function no_domain_can_be_within_empty_domain (line 140) | fn no_domain_can_be_within_empty_domain() {
function both_can_not_be_empty (line 145) | fn both_can_not_be_empty() {
FILE: tests/url/get_referer_url.rs
function preserve_original (line 15) | fn preserve_original() {
function removes_fragment (line 26) | fn removes_fragment() {
function removes_empty_fragment (line 35) | fn removes_empty_fragment() {
function removes_empty_fragment_and_keeps_empty_query (line 43) | fn removes_empty_fragment_and_keeps_empty_query() {
function removes_empty_fragment_and_keeps_query (line 51) | fn removes_empty_fragment_and_keeps_query() {
function removes_credentials (line 60) | fn removes_credentials() {
function removes_empty_credentials (line 69) | fn removes_empty_credentials() {
function removes_empty_username_credentials (line 77) | fn removes_empty_username_credentials() {
function removes_empty_password_credentials (line 85) | fn removes_empty_password_credentials() {
FILE: tests/url/is_url_and_has_protocol.rs
function mailto (line 13) | fn mailto() {
function tel (line 20) | fn tel() {
function ftp_no_slashes (line 25) | fn ftp_no_slashes() {
function ftp_with_credentials (line 30) | fn ftp_with_credentials() {
function javascript (line 37) | fn javascript() {
function http (line 42) | fn http() {
function https (line 47) | fn https() {
function file (line 52) | fn file() {
function mailto_uppercase (line 57) | fn mailto_uppercase() {
function empty_data_url (line 64) | fn empty_data_url() {
function empty_data_url_surrounded_by_spaces (line 69) | fn empty_data_url_surrounded_by_spaces() {
function url_with_no_protocol (line 86) | fn url_with_no_protocol() {
function relative_path (line 93) | fn relative_path() {
function relative_to_root_path (line 100) | fn relative_to_root_path() {
function empty_string (line 105) | fn empty_string() {
FILE: tests/url/parse_data_url.rs
function parse_text_html_base64 (line 15) | fn parse_text_html_base64() {
function parse_text_html_utf8 (line 27) | fn parse_text_html_utf8() {
function parse_text_html_plaintext (line 41) | fn parse_text_html_plaintext() {
function parse_text_css_url_encoded (line 58) | fn parse_text_css_url_encoded() {
function parse_no_media_type_base64 (line 68) | fn parse_no_media_type_base64() {
function parse_no_media_type_no_encoding (line 78) | fn parse_no_media_type_no_encoding() {
function empty_data_url (line 102) | fn empty_data_url() {
FILE: tests/url/resolve_url.rs
function basic_httsp_relative (line 15) | fn basic_httsp_relative() {
function basic_httsp_absolute (line 29) | fn basic_httsp_absolute() {
function from_https_to_level_up_relative (line 43) | fn from_https_to_level_up_relative() {
function from_https_url_to_url_with_no_protocol (line 57) | fn from_https_url_to_url_with_no_protocol() {
function from_https_url_to_url_with_no_protocol_and_on_different_hostname (line 69) | fn from_https_url_to_url_with_no_protocol_and_on_different_hostname() {
function from_https_url_to_absolute_path (line 81) | fn from_https_url_to_absolute_path() {
function from_https_to_just_filename (line 93) | fn from_https_to_just_filename() {
function from_data_url_to_https (line 105) | fn from_data_url_to_https() {
function from_data_url_to_data_url (line 118) | fn from_data_url_to_data_url() {
function from_file_url_to_relative_path (line 131) | fn from_file_url_to_relative_path() {
function from_file_url_to_relative_path_with_backslashes (line 143) | fn from_file_url_to_relative_path_with_backslashes() {
function from_data_url_to_file_url (line 155) | fn from_data_url_to_file_url() {
function preserve_fragment (line 168) | fn preserve_fragment() {
function resolve_from_file_url_to_file_url (line 180) | fn resolve_from_file_url_to_file_url() {
function from_data_url_to_url_with_no_protocol (line 217) | fn from_data_url_to_url_with_no_protocol() {
Condensed preview — 103 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (367K chars).
[
{
"path": ".actor/Dockerfile",
"chars": 141,
"preview": "FROM node:alpine\n\nRUN apk --no-cache add curl bash git monolith jq\nRUN npm -g install apify-cli\nCOPY .actor .actor\nCMD ."
},
{
"path": ".actor/README.md",
"chars": 2852,
"preview": "# Monolith Actor on Apify\n\n[](https://apify.com/sns"
},
{
"path": ".actor/actor.json",
"chars": 274,
"preview": "{\n\t\"actorSpecification\": 1,\n\t\"name\": \"monolith\",\n\t\"version\": \"0.0\",\n\t\"buildTag\": \"latest\",\n\t\"environmentVariables\": {},\n"
},
{
"path": ".actor/bin/actor.sh",
"chars": 937,
"preview": "#!/bin/bash\n#pwd\n#find ./storage\napify actor:get-input > /dev/null\nINPUT=`apify actor:get-input | jq -r .urls[] | xargs "
},
{
"path": ".actor/dataset_schema.json",
"chars": 1394,
"preview": "{\n \"actorSpecification\": 1,\n \"fields\":{\n \"title\": \"Sherlock actor input\",\n \"description\": \"This is actor"
},
{
"path": ".actor/input_schema.json",
"chars": 405,
"preview": "{\n \"title\": \"Sherlock actor input\",\n \"description\": \"This is actor input schema\",\n \"type\": \"object\",\n \"schemaVersion"
},
{
"path": ".dockerignore",
"chars": 9,
"preview": "/target/\n"
},
{
"path": ".github/FUNDING.yml",
"chars": 61,
"preview": "# These are supported funding model platforms\n\ngithub: snshn\n"
},
{
"path": ".github/workflows/build_gnu_linux.yml",
"chars": 519,
"preview": "name: GNU/Linux\n\non:\n push:\n branches: [ master ]\n paths-ignore:\n - 'assets/'\n - 'dist/'\n - 'snap/'\n "
},
{
"path": ".github/workflows/build_macos.yml",
"chars": 514,
"preview": "name: macOS\n\non:\n push:\n branches: [ master ]\n paths-ignore:\n - 'assets/'\n - 'dist/'\n - 'snap/'\n - 'D"
},
{
"path": ".github/workflows/build_windows.yml",
"chars": 518,
"preview": "name: Windows\n\non:\n push:\n branches: [ master ]\n paths-ignore:\n - 'assets/'\n - 'dist/'\n - 'snap/'\n - "
},
{
"path": ".github/workflows/cd.yml",
"chars": 4097,
"preview": "# CD GitHub Actions workflow for monolith\n\nname: CD\n\non:\n release:\n types:\n - created\n\njobs:\n\n gnu_linux_aarch64"
},
{
"path": ".github/workflows/ci-netbsd.yml",
"chars": 794,
"preview": "# CI NetBSD GitHub Actions workflow for monolith\n\nname: CI (NetBSD)\n\non:\n pull_request:\n branches: [ master ]\n pa"
},
{
"path": ".github/workflows/ci.yml",
"chars": 838,
"preview": "# CI GitHub Actions workflow for monolith\n\nname: CI\n\non:\n pull_request:\n branches: [ master ]\n paths-ignore:\n "
},
{
"path": ".gitignore",
"chars": 180,
"preview": "# Generated by Cargo\n# will have compiled files and executables\n/target/\n\n# These are backup files generated by rustfmt\n"
},
{
"path": "Cargo.toml",
"chars": 2648,
"preview": "[package]\nname = \"monolith\"\nversion = \"2.11.0\"\nauthors = [\n \"Sunshine <snshn@tutanota.com>\",\n \"Mahdi Robatipoor <m"
},
{
"path": "Dockerfile",
"chars": 588,
"preview": "FROM clux/muslrust:stable as builder\n\nRUN curl -L -o monolith.tar.gz $(curl -s https://api.github.com/repos/y2z/monolith"
},
{
"path": "LICENSE",
"chars": 7048,
"preview": "Creative Commons Legal Code\n\nCC0 1.0 Universal\n\n CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE\n"
},
{
"path": "Makefile",
"chars": 707,
"preview": "# Makefile for monolith\n\nall: build build-gui\n.PHONY: all\n\nbuild:\n\t@cargo build --locked\n.PHONY: build\n\nbuild-gui:\n\t@car"
},
{
"path": "README.md",
"chars": 8599,
"preview": "[](https://github."
},
{
"path": "dist/run-in-container.sh",
"chars": 161,
"preview": "#!/bin/sh\n\nDOCKER=docker\nif which podman 2>&1 > /dev/null; then\n DOCKER=podman\nfi\nORG_NAME=y2z\nPROG_NAME=monolith\n\n$D"
},
{
"path": "monolith.nuspec",
"chars": 1561,
"preview": "<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<package xmlns=\"http://schemas.microsoft.com/packaging/2015/06/nuspec.xsd\">\n <me"
},
{
"path": "snap/snapcraft.yaml",
"chars": 1961,
"preview": "name: monolith\nbase: core18 \n# Version data defined inside the monolith part below\nadopt-info: monolith\nsummary: Monolit"
},
{
"path": "src/cache.rs",
"chars": 5829,
"preview": "use std::collections::HashMap;\nuse std::fs::File;\nuse std::io::{BufWriter, Write};\nuse std::path::Path;\n\nuse redb::{Data"
},
{
"path": "src/cookies.rs",
"chars": 3616,
"preview": "use std::time::{SystemTime, UNIX_EPOCH};\n\nuse crate::url::Url;\n\npub struct Cookie {\n pub domain: String,\n pub incl"
},
{
"path": "src/core.rs",
"chars": 19641,
"preview": "use std::env;\nuse std::error::Error;\nuse std::fmt;\nuse std::fs;\nuse std::io::{self, Write};\nuse std::path::Path;\n\nuse ch"
},
{
"path": "src/css.rs",
"chars": 14354,
"preview": "use cssparser::{\n serialize_identifier, serialize_string, ParseError, Parser, ParserInput, SourcePosition, Token,\n};\n"
},
{
"path": "src/gui.rs",
"chars": 11299,
"preview": "use std::fs;\nuse std::io::Write;\nuse std::path;\nuse std::thread;\n\nuse directories::UserDirs;\nuse druid::widget::{Button,"
},
{
"path": "src/html.rs",
"chars": 57986,
"preview": "use base64::{prelude::BASE64_STANDARD, Engine};\nuse chrono::{SecondsFormat, Utc};\nuse encoding_rs::Encoding;\nuse html5ev"
},
{
"path": "src/js.rs",
"chars": 2312,
"preview": "const JS_DOM_EVENT_ATTRS: &[&str] = &[\n // From WHATWG HTML spec 8.1.5.2 \"Event handlers on elements, Document object"
},
{
"path": "src/lib.rs",
"chars": 115,
"preview": "pub mod cache;\npub mod cookies;\npub mod core;\npub mod css;\npub mod html;\npub mod js;\npub mod session;\npub mod url;\n"
},
{
"path": "src/main.rs",
"chars": 10111,
"preview": "use std::fs;\nuse std::io::{self, Error as IoError, Read, Write};\nuse std::process;\n\nuse clap::Parser;\nuse tempfile::{Bui"
},
{
"path": "src/session.rs",
"chars": 8866,
"preview": "use std::fs;\nuse std::path::{Path, PathBuf};\nuse std::time::Duration;\n\nuse reqwest::blocking::Client;\nuse reqwest::heade"
},
{
"path": "src/url.rs",
"chars": 4539,
"preview": "use base64::{prelude::BASE64_STANDARD, Engine};\nuse percent_encoding::percent_decode_str;\npub use url::Url;\n\nuse crate::"
},
{
"path": "tests/_data_/basic/local-file.html",
"chars": 512,
"preview": "<!doctype html>\n\n<html lang=\"en\">\n\n<head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n <tit"
},
{
"path": "tests/_data_/basic/local-script.js",
"chars": 82,
"preview": "document.body.style.backgroundColor = \"green\";\ndocument.body.style.color = \"red\";\n"
},
{
"path": "tests/_data_/basic/local-style.css",
"chars": 54,
"preview": "body {\n background-color: #000;\n color: #fff;\n}\n"
},
{
"path": "tests/_data_/css/index.html",
"chars": 127,
"preview": "<style>\n\n @charset 'UTF-8';\n\n @import 'style.css';\n\n @import url(style.css);\n\n @import url('style.css');\n\n</"
},
{
"path": "tests/_data_/css/style.css",
"chars": 39,
"preview": "body{background-color:#000;color:#fff}\n"
},
{
"path": "tests/_data_/import-css-via-data-url/index.html",
"chars": 461,
"preview": "<!doctype html>\n\n<html lang=\"en\">\n\n<head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n <tit"
},
{
"path": "tests/_data_/import-css-via-data-url/style.css",
"chars": 51,
"preview": "body {\n background-color: pink;\n color: white;\n}\n"
},
{
"path": "tests/_data_/integrity/index.html",
"chars": 1097,
"preview": "<!doctype html>\n\n<html lang=\"en\">\n <head>\n <title>Local HTML file</title>\n <link\n href=\"styl"
},
{
"path": "tests/_data_/integrity/script.js",
"chars": 48,
"preview": "function noop() {\n console.log(\"</script>\");\n}\n"
},
{
"path": "tests/_data_/integrity/style.css",
"chars": 54,
"preview": "body {\n background-color: #000;\n color: #FFF;\n}\n"
},
{
"path": "tests/_data_/noscript/index.html",
"chars": 58,
"preview": "<body><noscript><img src=\"image.svg\" /></noscript></body>\n"
},
{
"path": "tests/_data_/noscript/nested.html",
"chars": 104,
"preview": "<body><noscript><h1>JS is not active</h1><noscript><img src=\"image.svg\" /></noscript></noscript></body>\n"
},
{
"path": "tests/_data_/noscript/script.html",
"chars": 84,
"preview": "<body><noscript><script>alert(1);</script><img src=\"image.svg\" /></noscript></body>\n"
},
{
"path": "tests/_data_/svg/image.html",
"chars": 156,
"preview": "<html>\n <body>\n <svg height=\"24\" width=\"24\">\n <image href=\"image.svg\" width=\"24\" height=\"24\"></use>"
},
{
"path": "tests/_data_/svg/index.html",
"chars": 55,
"preview": "<div style=\"background-image: url('image.svg')\"></div>\n"
},
{
"path": "tests/_data_/svg/svg.html",
"chars": 298,
"preview": "<html>\n<body>\n<button class=\"tm-votes-lever__button\" data-test-id=\"votes-lever-upvote-button\" title=\"Like\" type=\"button\""
},
{
"path": "tests/_data_/unusual_encodings/gb2312.html",
"chars": 167,
"preview": "<html>\n<head>\n <meta http-equiv=\"content-type\" content=\"text/html;charset=GB2312\"/>\n <title>߳˼ֻת--áƼ-- </title>\n</"
},
{
"path": "tests/_data_/unusual_encodings/iso-8859-1.html",
"chars": 170,
"preview": "<html>\n <head>\n <meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">\n </head>\n <body"
},
{
"path": "tests/cli/base_url.rs",
"chars": 3644,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cli/basic.rs",
"chars": 5488,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cli/data_url.rs",
"chars": 7221,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cli/local_files.rs",
"chars": 13471,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cli/mod.rs",
"chars": 93,
"preview": "mod base_url;\nmod basic;\nmod data_url;\nmod local_files;\nmod noscript;\nmod unusual_encodings;\n"
},
{
"path": "tests/cli/noscript.rs",
"chars": 7854,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cli/unusual_encodings.rs",
"chars": 7719,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cookies/cookie/is_expired.rs",
"chars": 1949,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cookies/cookie/matches_url.rs",
"chars": 3232,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/cookies/cookie/mod.rs",
"chars": 33,
"preview": "mod is_expired;\nmod matches_url;\n"
},
{
"path": "tests/cookies/mod.rs",
"chars": 44,
"preview": "mod cookie;\nmod parse_cookie_file_contents;\n"
},
{
"path": "tests/cookies/parse_cookie_file_contents.rs",
"chars": 3103,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/core/detect_media_type.rs",
"chars": 5031,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/core/format_output_path.rs",
"chars": 2992,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/core/mod.rs",
"chars": 84,
"preview": "mod detect_media_type;\nmod format_output_path;\nmod options;\nmod parse_content_type;\n"
},
{
"path": "tests/core/options.rs",
"chars": 1191,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/core/parse_content_type.rs",
"chars": 2724,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/css/embed_css.rs",
"chars": 10899,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/css/is_image_url_prop.rs",
"chars": 2026,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/css/mod.rs",
"chars": 38,
"preview": "mod embed_css;\nmod is_image_url_prop;\n"
},
{
"path": "tests/html/add_favicon.rs",
"chars": 1194,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/check_integrity.rs",
"chars": 2342,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/compose_csp.rs",
"chars": 2497,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/create_metadata_tag.rs",
"chars": 2099,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/embed_srcset.rs",
"chars": 5919,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/get_base_url.rs",
"chars": 2542,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/get_charset.rs",
"chars": 2039,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/get_node_attr.rs",
"chars": 1876,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/get_node_name.rs",
"chars": 1937,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/has_favicon.rs",
"chars": 1564,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/is_favicon.rs",
"chars": 1488,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/mod.rs",
"chars": 296,
"preview": "mod add_favicon;\nmod check_integrity;\nmod compose_csp;\nmod create_metadata_tag;\nmod embed_srcset;\nmod get_base_url;\nmod "
},
{
"path": "tests/html/parse_link_type.rs",
"chars": 1786,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/parse_srcset.rs",
"chars": 1566,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/serialize_document.rs",
"chars": 6139,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/set_node_attr.rs",
"chars": 3942,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/html/walk.rs",
"chars": 17779,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/js/attr_is_event_handler.rs",
"chars": 1354,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/js/mod.rs",
"chars": 27,
"preview": "mod attr_is_event_handler;\n"
},
{
"path": "tests/mod.rs",
"chars": 81,
"preview": "mod cli;\nmod cookies;\nmod core;\nmod css;\nmod html;\nmod js;\nmod session;\nmod url;\n"
},
{
"path": "tests/session/mod.rs",
"chars": 20,
"preview": "mod retrieve_asset;\n"
},
{
"path": "tests/session/retrieve_asset.rs",
"chars": 4915,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/clean_url.rs",
"chars": 1928,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/create_data_url.rs",
"chars": 3089,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/domain_is_within_domain.rs",
"chars": 3825,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/get_referer_url.rs",
"chars": 2756,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/is_url_and_has_protocol.rs",
"chars": 2679,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/mod.rs",
"chars": 152,
"preview": "mod clean_url;\nmod create_data_url;\nmod domain_is_within_domain;\nmod get_referer_url;\nmod is_url_and_has_protocol;\nmod p"
},
{
"path": "tests/url/parse_data_url.rs",
"chars": 3655,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
},
{
"path": "tests/url/resolve_url.rs",
"chars": 6777,
"preview": "// ██████╗ █████╗ ███████╗███████╗██╗███╗ ██╗ ██████╗\n// ██╔══██╗██╔══██╗██╔════╝██╔════╝██║████╗ ██║██╔════╝\n// "
}
]
// ... and 1 more files (download for full content)
About this extraction
This page contains the full source code of the Y2Z/monolith GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 103 files (338.1 KB), approximately 90.2k tokens, and a symbol index with 386 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.