Repository: alexferrari88/sbstck-dl Branch: main Commit: 775085259f25 Files: 35 Total size: 309.7 KB Directory structure: gitextract_tn_9uzpl/ ├── .github/ │ └── workflows/ │ ├── build-release.yml │ └── test.yml ├── .gitignore ├── .serena/ │ ├── .gitignore │ ├── memories/ │ │ ├── code_style_conventions.md │ │ ├── files_feature_overview.md │ │ ├── project_overview.md │ │ ├── project_structure.md │ │ ├── suggested_commands.md │ │ ├── task_completion_checklist.md │ │ └── testing_patterns.md │ └── project.yml ├── CLAUDE.md ├── LICENSE ├── README.md ├── cmd/ │ ├── cmd_test.go │ ├── download.go │ ├── integration_test.go │ ├── list.go │ ├── main.go │ ├── root.go │ └── version.go ├── go.mod ├── go.sum ├── lib/ │ ├── extractor.go │ ├── extractor_test.go │ ├── fetcher.go │ ├── fetcher_test.go │ ├── files.go │ ├── files_test.go │ ├── images.go │ └── images_test.go ├── main.go └── specs/ ├── archive-index-page.md └── file-attachment-download.md ================================================ FILE CONTENTS ================================================ ================================================ FILE: .github/workflows/build-release.yml ================================================ name: Manual Build and Release on: workflow_dispatch: inputs: branch: description: 'Branch to build' required: true default: 'main' release: types: [created] jobs: test: name: Run Tests runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-latest, macos-latest, windows-latest] go-version: [1.24.1] steps: - name: Check out code uses: actions/checkout@v4 with: ref: ${{ github.event.inputs.branch || github.ref }} - name: Set up Go uses: actions/setup-go@v4 with: go-version: ${{ matrix.go-version }} - name: Run tests run: go test -v -timeout=10m ./... build: name: Build needs: test if: success() runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-latest, macos-latest, windows-latest] go-version: [1.24.1] include: - os: ubuntu-latest goos: linux goarch: amd64 name: ubuntu extension: "" - os: macos-latest goos: darwin goarch: amd64 name: mac extension: "" - os: windows-latest goos: windows goarch: amd64 name: win extension: ".exe" steps: - name: Check out code uses: actions/checkout@v4 with: ref: ${{ github.event.inputs.branch || github.ref }} - name: Set up Go uses: actions/setup-go@v4 with: go-version: ${{ matrix.go-version }} - name: Build run: | env GOOS=${{ matrix.goos }} GOARCH=${{ matrix.goarch }} go build -v -o sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }}${{ matrix.extension }} - name: Upload artifact uses: actions/upload-artifact@v4 with: name: sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }} path: sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }}${{ matrix.extension }} release-upload: name: Attach Artifacts to Release if: github.event_name == 'release' needs: build runs-on: ubuntu-latest permissions: contents: write # This is needed for release uploads steps: - name: Debug event info run: | echo "Event name: ${{ github.event_name }}" echo "Event type: ${{ github.event.action }}" echo "Release tag: ${{ github.event.release.tag_name }}" - name: Download all artifacts uses: actions/download-artifact@v4 with: path: artifacts - name: List artifacts run: find artifacts -type f | sort - name: Upload artifacts to release uses: softprops/action-gh-release@v1 with: files: artifacts/**/* # GitHub automatically provides this token token: ${{ github.token }} ================================================ FILE: .github/workflows/test.yml ================================================ name: Run Tests on: pull_request: branches: [main] jobs: test: name: Run Tests runs-on: ${{ matrix.os }} strategy: matrix: os: [ubuntu-latest, macos-latest, windows-latest] go-version: [1.24.1] steps: - name: Check out code uses: actions/checkout@v4 - name: Set up Go uses: actions/setup-go@v4 with: go-version: ${{ matrix.go-version }} - name: Run tests run: go test -v ./... ================================================ FILE: .gitignore ================================================ # If you prefer the allow list template instead of the deny list, see community template: # https://github.com/github/gitignore/blob/main/community/Golang/Go.AllowList.gitignore # # Binaries for programs and plugins *.exe *.exe~ *.dll *.so *.dylib bin/ # Test binary, built with `go test -c` *.test # Output of the go coverage tool, specifically when used with LiteIDE *.out # Dependency directories (remove the comment below to include it) # vendor/ # Go workspace file go.work # Directory contained scraped content scraped/ test-download/ # vscode .vscode/ # serena .serena/cache/ ================================================ FILE: .serena/.gitignore ================================================ /cache ================================================ FILE: .serena/memories/code_style_conventions.md ================================================ # Code Style and Conventions ## Go Style Guidelines - Follows standard Go conventions and formatting - Uses `gofmt` for code formatting - Package naming: lowercase, single words when possible - Function naming: CamelCase for exported, camelCase for unexported - Variable naming: camelCase, descriptive names ## Code Organization - **Separation of Concerns**: CLI logic in `cmd/`, core business logic in `lib/` - **Error Handling**: Explicit error returns, wrapping with context using `fmt.Errorf` - **Testing**: Table-driven tests, benchmarks for performance-critical code - **Concurrency**: Uses errgroup for managed goroutines, context for cancellation ## Naming Conventions - **Structs**: PascalCase (e.g., `FileDownloader`, `ImageInfo`) - **Interfaces**: Usually end with -er (e.g., implied by method names) - **Constants**: PascalCase for exported, camelCase for unexported - **Files**: snake_case for test files (`*_test.go`) ## Function Design Patterns - **Constructor Pattern**: `NewXxx()` functions for creating instances - **Options Pattern**: Used in fetcher with `FetcherOption` functional options - **Context Propagation**: All network operations accept `context.Context` - **Resource Management**: Proper `defer` usage for cleanup (file handles, HTTP responses) ## Documentation - **Godoc Comments**: All exported functions, types, and constants have comments - **README**: Comprehensive usage examples and feature documentation - **Code Comments**: Explain complex logic, especially in parsing and URL manipulation ================================================ FILE: .serena/memories/files_feature_overview.md ================================================ # File Attachment Download Feature ## Implementation Overview New feature added in `lib/files.go` that allows downloading file attachments from Substack posts. ## Key Components ### FileDownloader struct - Manages file downloads with rate limiting via Fetcher - Configurable output directory and file extensions filter - Integrates with existing image download workflow ### CSS Selector Detection - Uses `.file-embed-button.wide` to find file attachment links - Extracts download URLs from `href` attributes ### Core Functions - `DownloadFiles()` - Main entry point, returns FileDownloadResult - `extractFileElements()` - Finds file links in HTML using CSS selector - `downloadSingleFile()` - Downloads individual files with error handling - `updateHTMLWithLocalPaths()` - Replaces URLs with local paths ### Features - Extension filtering via `--file-extensions` flag - Custom output directory via `--files-dir` flag - Filename extraction from URLs and query parameters - Safe filename sanitization (removes unsafe characters) - File existence checking (skip if already downloaded) - Relative path conversion for HTML references ## CLI Integration - New flags in `cmd/download.go`: - `--download-files` - Enable file downloading - `--file-extensions` - Filter by extensions (comma-separated) - `--files-dir` - Custom files directory name ## Integration with Extractor - Extended `WriteToFileWithImages()` to also handle file downloads - Unified workflow for both images and files ================================================ FILE: .serena/memories/project_overview.md ================================================ # Project Overview ## Purpose sbstck-dl is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, and format conversion (HTML/Markdown/Text). The tool also supports downloading images and file attachments locally. ## Tech Stack - **Language**: Go 1.20+ - **CLI Framework**: Cobra (github.com/spf13/cobra) - **HTML Parsing**: goquery (github.com/PuerkitoBio/goquery) - **HTML to Markdown**: html-to-markdown (github.com/JohannesKaufmann/html-to-markdown) - **HTML to Text**: html2text (github.com/k3a/html2text) - **Retry Logic**: backoff (github.com/cenkalti/backoff/v4) - **Rate Limiting**: golang.org/x/time/rate - **Concurrency**: golang.org/x/sync/errgroup - **Progress Bar**: progressbar (github.com/schollz/progressbar/v3) - **Testing**: testify (github.com/stretchr/testify) ## Repository Structure - `main.go`: Entry point - `cmd/`: Cobra CLI commands (root.go, download.go, list.go, version.go) - `lib/`: Core library components - `fetcher.go`: HTTP client with rate limiting, retries, and cookie support - `extractor.go`: Post extraction and format conversion (HTML→Markdown/Text) - `images.go`: Image downloading and local path management - `files.go`: File attachment downloading and local path management - `.github/workflows/`: CI/CD workflows for testing and releases - Tests are co-located with source files (e.g., `lib/fetcher_test.go`) ================================================ FILE: .serena/memories/project_structure.md ================================================ # Project Structure - sbstck-dl ## Overview Go CLI tool for downloading posts from Substack blogs with support for private newsletters, rate limiting, and format conversion. ## Directory Structure ``` ├── main.go # Entry point ├── cmd/ # Cobra CLI commands │ ├── root.go │ ├── download.go # Main download functionality │ ├── list.go │ ├── version.go │ ├── cmd_test.go # Command tests │ └── integration_test.go ├── lib/ # Core library │ ├── fetcher.go # HTTP client with rate limiting/retries │ ├── fetcher_test.go # Comprehensive HTTP client tests │ ├── extractor.go # Post extraction and format conversion │ ├── extractor_test.go # Extractor tests │ ├── images.go # Image downloader │ ├── images_test.go # Comprehensive image tests │ └── files.go # NEW: File attachment downloader └── go.mod # Dependencies ``` ## Key Dependencies - `github.com/spf13/cobra` - CLI framework - `github.com/PuerkitoBio/goquery` - HTML parsing - `github.com/stretchr/testify` - Testing framework - `github.com/cenkalti/backoff/v4` - Exponential backoff - `golang.org/x/time/rate` - Rate limiting ================================================ FILE: .serena/memories/suggested_commands.md ================================================ # Suggested Commands ## Development Commands ### Building ```bash go build -o sbstck-dl . ``` ### Running ```bash go run . [command] [flags] ``` ### Testing ```bash # Run all tests go test ./... # Run tests with verbose output go test -v ./... # Run tests for specific package go test ./lib go test ./cmd ``` ### Module Management ```bash # Clean up dependencies go mod tidy # Download dependencies go mod download # Verify dependencies go mod verify ``` ### Running the CLI Locally ```bash # Download single post go run . download --url https://example.substack.com/p/post-title --output ./downloads # Download entire archive go run . download --url https://example.substack.com --output ./downloads # Download with images go run . download --url https://example.substack.com --download-images --output ./downloads # Download with file attachments go run . download --url https://example.substack.com --download-files --output ./downloads # Download with both images and files go run . download --url https://example.substack.com --download-images --download-files --output ./downloads # Test with dry run and verbose output go run . download --url https://example.substack.com --verbose --dry-run ``` ### System Commands (Linux) - `rg` (ripgrep) for searching instead of grep - Standard Linux commands: `ls`, `cd`, `find`, `git` ================================================ FILE: .serena/memories/task_completion_checklist.md ================================================ # Task Completion Checklist ## After Completing Development Tasks ### Testing 1. **Run Unit Tests**: `go test ./...` 2. **Run Integration Tests**: `go test -v ./...` 3. **Test CLI Commands**: Manual testing with real Substack URLs 4. **Test Edge Cases**: Error conditions, malformed URLs, network failures ### Code Quality 1. **Format Code**: `gofmt -w .` (usually handled by editor) 2. **Lint Code**: Use `golint` or `go vet` if available 3. **Verify Dependencies**: `go mod tidy && go mod verify` ### Documentation Updates 1. **Update CLAUDE.md**: Add new features, commands, architectural changes 2. **Update README.md**: Add usage examples for new features 3. **Update Help Text**: Ensure CLI help reflects new flags and options 4. **Update Comments**: Ensure godoc comments are current ### Version Control 1. **Stage Changes**: `git add` only relevant files 2. **Commit**: Use conventional commits format - `feat: add new feature` - `fix: resolve bug` - `docs: update documentation` - `test: add tests` - `refactor: improve code structure` 3. **Clean Up**: Remove any temporary files or test artifacts ### Build Verification 1. **Build Binary**: `go build -o sbstck-dl .` 2. **Test Binary**: Run basic commands to ensure it works 3. **Cross-Platform Check**: Ensure no platform-specific code issues ================================================ FILE: .serena/memories/testing_patterns.md ================================================ # Testing Patterns in sbstck-dl ## Test Structure - All tests use `github.com/stretchr/testify` with `assert` and `require` - Tests organized in table-driven style where appropriate - Each major component has comprehensive test coverage ## Common Patterns ### HTTP Server Tests - Use `httptest.NewServer()` for mock servers - Test various response scenarios (success, errors, timeouts) - Handle concurrent requests and rate limiting ### File I/O Tests - Use `os.MkdirTemp()` for temporary directories - Always clean up with `defer os.RemoveAll(tempDir)` - Test file creation, existence, and content validation ### HTML Parsing Tests - Use `goquery.NewDocumentFromReader(strings.NewReader(html))` - Test various HTML structures and edge cases - Validate URL extraction and replacement ### Error Handling Tests - Test both success and failure scenarios - Use specific error assertions and error message checking - Test context cancellation and timeouts ### Benchmark Tests - Include performance benchmarks for critical paths - Use `b.ResetTimer()` appropriately - Test both single operations and concurrent scenarios ## Test Organization - Unit tests for individual functions - Integration tests for complete workflows - Regression tests for specific bug fixes - Real-world data tests (when sample data available) ================================================ FILE: .serena/project.yml ================================================ # language of the project (csharp, python, rust, java, typescript, go, cpp, or ruby) # * For C, use cpp # * For JavaScript, use typescript # Special requirements: # * csharp: Requires the presence of a .sln file in the project folder. language: go # whether to use the project's gitignore file to ignore files # Added on 2025-04-07 ignore_all_files_in_gitignore: true # list of additional paths to ignore # same syntax as gitignore, so you can use * and ** # Was previously called `ignored_dirs`, please update your config if you are using that. # Added (renamed)on 2025-04-07 ignored_paths: [] # whether the project is in read-only mode # If set to true, all editing tools will be disabled and attempts to use them will result in an error # Added on 2025-04-18 read_only: false # list of tool names to exclude. We recommend not excluding any tools, see the readme for more details. # Below is the complete list of tools for convenience. # To make sure you have the latest list of tools, and to view their descriptions, # execute `uv run scripts/print_tool_overview.py`. # # * `activate_project`: Activates a project by name. # * `check_onboarding_performed`: Checks whether project onboarding was already performed. # * `create_text_file`: Creates/overwrites a file in the project directory. # * `delete_lines`: Deletes a range of lines within a file. # * `delete_memory`: Deletes a memory from Serena's project-specific memory store. # * `execute_shell_command`: Executes a shell command. # * `find_referencing_code_snippets`: Finds code snippets in which the symbol at the given location is referenced. # * `find_referencing_symbols`: Finds symbols that reference the symbol at the given location (optionally filtered by type). # * `find_symbol`: Performs a global (or local) search for symbols with/containing a given name/substring (optionally filtered by type). # * `get_current_config`: Prints the current configuration of the agent, including the active and available projects, tools, contexts, and modes. # * `get_symbols_overview`: Gets an overview of the top-level symbols defined in a given file or directory. # * `initial_instructions`: Gets the initial instructions for the current project. # Should only be used in settings where the system prompt cannot be set, # e.g. in clients you have no control over, like Claude Desktop. # * `insert_after_symbol`: Inserts content after the end of the definition of a given symbol. # * `insert_at_line`: Inserts content at a given line in a file. # * `insert_before_symbol`: Inserts content before the beginning of the definition of a given symbol. # * `list_dir`: Lists files and directories in the given directory (optionally with recursion). # * `list_memories`: Lists memories in Serena's project-specific memory store. # * `onboarding`: Performs onboarding (identifying the project structure and essential tasks, e.g. for testing or building). # * `prepare_for_new_conversation`: Provides instructions for preparing for a new conversation (in order to continue with the necessary context). # * `read_file`: Reads a file within the project directory. # * `read_memory`: Reads the memory with the given name from Serena's project-specific memory store. # * `remove_project`: Removes a project from the Serena configuration. # * `replace_lines`: Replaces a range of lines within a file with new content. # * `replace_symbol_body`: Replaces the full definition of a symbol. # * `restart_language_server`: Restarts the language server, may be necessary when edits not through Serena happen. # * `search_for_pattern`: Performs a search for a pattern in the project. # * `summarize_changes`: Provides instructions for summarizing the changes made to the codebase. # * `switch_modes`: Activates modes by providing a list of their names # * `think_about_collected_information`: Thinking tool for pondering the completeness of collected information. # * `think_about_task_adherence`: Thinking tool for determining whether the agent is still on track with the current task. # * `think_about_whether_you_are_done`: Thinking tool for determining whether the task is truly completed. # * `write_memory`: Writes a named memory (for future reference) to Serena's project-specific memory store. excluded_tools: [] # initial prompt for the project. It will always be given to the LLM upon activating the project # (contrary to the memories, which are loaded on demand). initial_prompt: "" project_name: "sbstck-dl" ================================================ FILE: CLAUDE.md ================================================ # CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## Project Overview This is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, format conversion (HTML/Markdown/Text), downloading of images and file attachments locally, and creating archive index pages that link all downloaded posts with their metadata. ## Architecture The project follows a standard Go CLI structure: - `main.go`: Entry point - `cmd/`: Contains Cobra CLI commands (`root.go`, `download.go`, `list.go`, `version.go`) - `lib/`: Core library with four main components: - `fetcher.go`: HTTP client with rate limiting, retries, and cookie support - `extractor.go`: Post extraction and format conversion (HTML→Markdown/Text) - `images.go`: Image downloading and local path management - `files.go`: File attachment downloading and local path management ## Build and Development Commands ### Building ```bash go build -o sbstck-dl . ``` ### Running ```bash go run . [command] [flags] ``` ### Testing ```bash go test ./... go test ./lib ``` ### Module management ```bash go mod tidy go mod download ``` ## Key Components ### Fetcher (`lib/fetcher.go`) - Handles HTTP requests with exponential backoff retry - Rate limiting (default: 2 requests/second) - Cookie support for private newsletters - Proxy support ### Extractor (`lib/extractor.go`) - Parses Substack post JSON from HTML - Extracts post metadata including subtitle (.subtitle CSS selector) and cover image (og:image meta tag) - Converts HTML to Markdown/Text using external libraries - Handles file writing with different formats - Provides archive page generation functionality (HTML/Markdown/Text formats) - Manages archive entries with automatic sorting by publication date (newest first) ### Image Downloader (`lib/images.go`) - Downloads images locally from Substack posts - Supports multiple image quality levels (high/medium/low) - Handles various Substack CDN URL patterns - Updates HTML/Markdown content to reference local image paths - Creates organized directory structure for downloaded images ### File Downloader (`lib/files.go`) - Downloads file attachments from Substack posts using CSS selector `.file-embed-button.wide` - Supports file extension filtering (optional) - Creates organized directory structure for downloaded files - Updates HTML content to reference local file paths - Handles filename sanitization and collision avoidance - Integrates with existing image download workflow ### Archive Page Generator (`lib/extractor.go`) - Creates index pages linking all downloaded posts with metadata - Supports HTML, Markdown, and Text formats matching the selected output format - Includes post titles (linked to downloaded files with relative paths) - Shows publication dates and download timestamps - Displays post descriptions/subtitles and cover images when available - Automatically sorts posts by publication date (newest first) - Generates `index.{format}` in the output directory root ### Commands Structure Uses Cobra framework: - `download`: Main functionality for downloading posts - `list`: Lists available posts from a Substack - `version`: Shows version information ## Dependencies - `github.com/spf13/cobra`: CLI framework - `github.com/PuerkitoBio/goquery`: HTML parsing - `github.com/JohannesKaufmann/html-to-markdown`: HTML to Markdown conversion - `github.com/cenkalti/backoff/v4`: Exponential backoff for retries - `golang.org/x/time/rate`: Rate limiting - `golang.org/x/sync/errgroup`: Concurrent processing ## Common Development Tasks ### Running the CLI locally ```bash go run . download --url https://example.substack.com --output ./downloads ``` ### Testing with verbose output ```bash go run . download --url https://example.substack.com --verbose --dry-run ``` ### Downloading posts with images ```bash # Download posts with high-quality images go run . download --url https://example.substack.com --download-images --image-quality high --output ./downloads # Download with medium quality images and custom images directory go run . download --url https://example.substack.com --download-images --image-quality medium --images-dir assets --output ./downloads # Download single post with images in markdown format go run . download --url https://example.substack.com/p/post-title --download-images --format md --output ./downloads ``` ### Downloading posts with file attachments ```bash # Download posts with file attachments go run . download --url https://example.substack.com --download-files --output ./downloads # Download with specific file extensions only go run . download --url https://example.substack.com --download-files --file-extensions "pdf,docx,txt" --output ./downloads # Download with custom files directory name go run . download --url https://example.substack.com --download-files --files-dir attachments --output ./downloads # Download single post with both images and file attachments go run . download --url https://example.substack.com/p/post-title --download-images --download-files --output ./downloads ``` ### Creating archive index pages ```bash # Download posts and create an archive index page go run . download --url https://example.substack.com --create-archive --output ./downloads # Download entire archive with archive index in markdown format go run . download --url https://example.substack.com --create-archive --format md --output ./downloads # Download single post with archive page (useful for building up an archive over time) go run . download --url https://example.substack.com/p/post-title --create-archive --output ./downloads # Download with all features: images, files, and archive page go run . download --url https://example.substack.com --download-images --download-files --create-archive --output ./downloads # Download archive with specific format and custom directories go run . download --url https://example.substack.com --create-archive --format html --images-dir assets --files-dir attachments --output ./downloads ``` ### Building for release ```bash go build -ldflags="-s -w" -o sbstck-dl . ``` ================================================ FILE: LICENSE ================================================ The MIT License (MIT) Copyright © 2023 Alex Ferrari alex@thealexferrari.com Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # Substack Downloader Simple CLI tool to download one or all the posts from a Substack blog. ## Installation ### Downloading the binary Check in the [releases](https://github.com/alexferrari88/sbstck-dl/releases) page for the latest version of the binary for your platform. We provide binaries for Linux, MacOS and Windows. ### Using Go ```bash go install github.com/alexferrari88/sbstck-dl ``` Your Go bin directory must be in your PATH. You can add it by adding the following line to your `.bashrc` or `.zshrc`: ```bash export PATH=$PATH:$(go env GOPATH)/bin ``` ## Usage ```bash Usage: sbstck-dl [command] Available Commands: download Download individual posts or the entire public archive help Help about any command list List the posts of a Substack version Print the version number of sbstck-dl Flags: --after string Download posts published after this date (format: YYYY-MM-DD) --before string Download posts published before this date (format: YYYY-MM-DD) --cookie_name cookieName Either substack.sid or connect.sid, based on your cookie (required for private newsletters) --cookie_val string The substack.sid/connect.sid cookie value (required for private newsletters) -h, --help help for sbstck-dl -x, --proxy string Specify the proxy url -r, --rate int Specify the rate of requests per second (default 2) -v, --verbose Enable verbose output Use "sbstck-dl [command] --help" for more information about a command. ``` ### Downloading posts You can provide the url of a single post or the main url of the Substack you want to download. By providing the main URL of a Substack, the downloader will download all the posts of the archive. When downloading the full archive, if the downloader is interrupted, at the next execution it will resume the download of the remaining posts. ```bash Usage: sbstck-dl download [flags] Flags: --add-source-url Add the original post URL at the end of the downloaded file --create-archive Create an archive index page linking all downloaded posts --download-files Download file attachments locally and update content to reference local files --download-images Download images locally and update content to reference local files -d, --dry-run Enable dry run --file-extensions string Comma-separated list of file extensions to download (e.g., 'pdf,docx,txt'). If empty, downloads all file types --files-dir string Directory name for downloaded file attachments (default "files") -f, --format string Specify the output format (options: "html", "md", "txt" (default "html") -h, --help help for download --image-quality string Image quality to download (options: "high", "medium", "low") (default "high") --images-dir string Directory name for downloaded images (default "images") -o, --output string Specify the download directory (default ".") -u, --url string Specify the Substack url Global Flags: --after string Download posts published after this date (format: YYYY-MM-DD) --before string Download posts published before this date (format: YYYY-MM-DD) --cookie_name cookieName Either substack.sid or connect.sid, based on your cookie (required for private newsletters) --cookie_val string The substack.sid/connect.sid cookie value (required for private newsletters) -x, --proxy string Specify the proxy url -r, --rate int Specify the rate of requests per second (default 2) -v, --verbose Enable verbose output ``` #### Adding Source URL If you use the `--add-source-url` flag, each downloaded file will have the following line appended to its content: `original content: POST_URL` Where `POST_URL` is the canonical URL of the downloaded post. For HTML format, this will be wrapped in a small paragraph with a link. #### Downloading Images Use the `--download-images` flag to download all images from Substack posts locally. This ensures posts remain accessible even if images are deleted from Substack's CDN. **Features:** - Downloads images at optimal quality (high/medium/low) - Creates organized directory structure: `{output}/images/{post-slug}/` - Updates HTML/Markdown content to reference local image paths - Handles all Substack image formats and CDN patterns - Graceful error handling for individual image failures **Examples:** ```bash # Download posts with high-quality images (default) sbstck-dl download --url https://example.substack.com --download-images # Download with medium quality images sbstck-dl download --url https://example.substack.com --download-images --image-quality medium # Download with custom images directory name sbstck-dl download --url https://example.substack.com --download-images --images-dir assets # Download single post with images in markdown format sbstck-dl download --url https://example.substack.com/p/post-title --download-images --format md ``` **Image Quality Options:** - `high`: 1456px width (best quality, larger files) - `medium`: 848px width (balanced quality/size) - `low`: 424px width (smaller files, mobile-optimized) **Directory Structure:** ``` output/ ├── 20231201_120000_post-title.html └── images/ └── post-title/ ├── image1_1456x819.jpeg ├── image2_848x636.png └── image3_1272x720.webp ``` #### Downloading File Attachments Use the `--download-files` flag to download all file attachments from Substack posts locally. This ensures posts remain accessible even if files are removed from Substack's servers. **Features:** - Downloads file attachments using CSS selector `.file-embed-button.wide` - Optional file extension filtering (e.g., only PDFs and Word documents) - Creates organized directory structure: `{output}/files/{post-slug}/` - Updates HTML content to reference local file paths - Handles filename sanitization and collision avoidance - Graceful error handling for individual file download failures **Examples:** ```bash # Download posts with all file attachments sbstck-dl download --url https://example.substack.com --download-files # Download only specific file types sbstck-dl download --url https://example.substack.com --download-files --file-extensions "pdf,docx,txt" # Download with custom files directory name sbstck-dl download --url https://example.substack.com --download-files --files-dir attachments # Download single post with both images and file attachments sbstck-dl download --url https://example.substack.com/p/post-title --download-images --download-files --format md ``` **File Extension Filtering:** - Specify extensions without dots: `pdf,docx,txt` - Case insensitive matching - If no extensions specified, downloads all file types **Directory Structure with Files:** ``` output/ ├── 20231201_120000_post-title.html ├── images/ │ └── post-title/ │ ├── image1_1456x819.jpeg │ └── image2_848x636.png └── files/ └── post-title/ ├── document.pdf ├── spreadsheet.xlsx └── presentation.pptx ``` #### Creating Archive Index Pages Use the `--create-archive` flag to generate an organized index page that links all downloaded posts with their metadata. This creates a beautiful overview of your downloaded content, making it easy to browse and access your Substack archive. **Features:** - Creates `index.{format}` file matching your selected output format (HTML/Markdown/Text) - Links to all downloaded posts using relative file paths - Displays post titles, publication dates, and download timestamps - Shows post descriptions/subtitles and cover images when available - Automatically sorts posts by publication date (newest first) - Works with both single post and bulk downloads **Examples:** ```bash # Download entire archive and create index page sbstck-dl download --url https://example.substack.com --create-archive # Create archive index in Markdown format sbstck-dl download --url https://example.substack.com --create-archive --format md # Build archive over time with single posts sbstck-dl download --url https://example.substack.com/p/post-title --create-archive # Complete download with all features sbstck-dl download --url https://example.substack.com --download-images --download-files --create-archive # Custom directory structure with archive sbstck-dl download --url https://example.substack.com --create-archive --images-dir assets --files-dir attachments ``` **Archive Content Per Post:** - **Title**: Clickable link to the downloaded post file - **Publication Date**: When the post was originally published on Substack - **Download Date**: When you downloaded the post locally - **Description**: Post subtitle or description (when available) - **Cover Image**: Featured image from the post (when available) **Archive Format Examples:** *HTML Format:* Styled webpage with images, organized post cards, and hover effects *Markdown Format:* Clean markdown with headers, links, and image references *Text Format:* Plain text listing with all metadata for maximum compatibility **Directory Structure with Archive:** ``` output/ ├── index.html # Archive index page ├── 20231201_120000_post-title.html ├── 20231115_090000_another-post.html ├── images/ │ ├── post-title/ │ │ └── image1_1456x819.jpeg │ └── another-post/ │ └── image2_848x636.png └── files/ ├── post-title/ │ └── document.pdf └── another-post/ └── spreadsheet.xlsx ``` ### Listing posts ```bash Usage: sbstck-dl list [flags] Flags: -h, --help help for list -u, --url string Specify the Substack url Global Flags: --after string Download posts published after this date (format: YYYY-MM-DD) --before string Download posts published before this date (format: YYYY-MM-DD) --cookie_name cookieName Either substack.sid or connect.sid, based on your cookie (required for private newsletters) --cookie_val string The substack.sid/connect.sid cookie value (required for private newsletters) -x, --proxy string Specify the proxy url -r, --rate int Specify the rate of requests per second (default 2) -v, --verbose Enable verbose output ``` ### Private Newsletters In order to download the full text of private newsletters you need to provide the cookie name and value of your session. The cookie name is either `substack.sid` or `connect.sid`, based on your cookie. To get the cookie value you can use the developer tools of your browser. Once you have the cookie name and value, you can pass them to the downloader using the `--cookie_name` and `--cookie_val` flags. #### Example ```bash sbstck-dl download --url https://example.substack.com --cookie_name substack.sid --cookie_val COOKIE_VALUE ``` ## Thanks - [wemoveon2](https://github.com/wemoveon2) and [lenzj](https://github.com/lenzj) for the discussion and help implementing the support for private newsletters ## TODO - [x] Improve retry logic - [ ] Implement loading from config file - [x] Add support for downloading images - [x] Add support for downloading file attachments - [x] Add archive index page functionality - [x] Add tests - [x] Add CI - [x] Add documentation - [x] Add support for private newsletters - [x] Implement filtering by date - [x] Implement resuming downloads ================================================ FILE: cmd/cmd_test.go ================================================ package cmd import ( "net/url" "os" "testing" "github.com/alexferrari88/sbstck-dl/lib" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // Test parseURL function func TestParseURL(t *testing.T) { tests := []struct { name string input string expectError bool expectedURL *url.URL }{ { name: "valid https URL", input: "https://example.substack.com", expectError: false, expectedURL: &url.URL{ Scheme: "https", Host: "example.substack.com", }, }, { name: "valid http URL", input: "http://example.substack.com", expectError: false, expectedURL: &url.URL{ Scheme: "http", Host: "example.substack.com", }, }, { name: "URL with path", input: "https://example.substack.com/p/test-post", expectError: false, expectedURL: &url.URL{ Scheme: "https", Host: "example.substack.com", Path: "/p/test-post", }, }, { name: "invalid URL - no scheme", input: "example.substack.com", expectError: true, }, { name: "invalid URL - no host", input: "https://", expectError: true, // parseURL returns nil, nil for this case }, { name: "invalid URL - malformed", input: "not-a-url", expectError: true, }, { name: "empty string", input: "", expectError: true, }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { result, err := parseURL(tt.input) if tt.expectError { // For this specific case, parseURL returns nil, nil which means no error but also no result if result == nil { assert.True(t, true) // This is the expected behavior for invalid URLs } else { assert.Error(t, err) } } else { require.NoError(t, err) require.NotNil(t, result) assert.Equal(t, tt.expectedURL.Scheme, result.Scheme) assert.Equal(t, tt.expectedURL.Host, result.Host) if tt.expectedURL.Path != "" { assert.Equal(t, tt.expectedURL.Path, result.Path) } } }) } } // Test makeDateFilterFunc function func TestMakeDateFilterFunc(t *testing.T) { tests := []struct { name string beforeDate string afterDate string testDates map[string]bool // date -> expected result }{ { name: "no filters", beforeDate: "", afterDate: "", testDates: map[string]bool{ "2023-01-01": true, "2023-06-15": true, "2023-12-31": true, }, }, { name: "before filter only", beforeDate: "2023-06-15", afterDate: "", testDates: map[string]bool{ "2023-01-01": true, "2023-06-14": true, "2023-06-15": false, "2023-06-16": false, "2023-12-31": false, }, }, { name: "after filter only", beforeDate: "", afterDate: "2023-06-15", testDates: map[string]bool{ "2023-01-01": false, "2023-06-14": false, "2023-06-15": false, "2023-06-16": true, "2023-12-31": true, }, }, { name: "both filters", beforeDate: "2023-12-31", afterDate: "2023-01-01", testDates: map[string]bool{ "2022-12-31": false, "2023-01-01": false, "2023-06-15": true, "2023-12-30": true, "2023-12-31": false, "2024-01-01": false, }, }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { filterFunc := makeDateFilterFunc(tt.beforeDate, tt.afterDate) if tt.beforeDate == "" && tt.afterDate == "" { // No filter should return nil assert.Nil(t, filterFunc) } else { require.NotNil(t, filterFunc) for date, expected := range tt.testDates { result := filterFunc(date) assert.Equal(t, expected, result, "Date %s should return %v", date, expected) } } }) } } // Test makePath function func TestMakePath(t *testing.T) { post := lib.Post{ PostDate: "2023-01-01T10:30:00.000Z", // Use RFC3339 format Slug: "test-post", } tests := []struct { name string post lib.Post outputFolder string format string expected string }{ { name: "basic path", post: post, outputFolder: "/tmp/downloads", format: "html", expected: "/tmp/downloads/20230101_103000_test-post.html", }, { name: "markdown format", post: post, outputFolder: "/tmp/downloads", format: "md", expected: "/tmp/downloads/20230101_103000_test-post.md", }, { name: "text format", post: post, outputFolder: "/tmp/downloads", format: "txt", expected: "/tmp/downloads/20230101_103000_test-post.txt", }, { name: "no output folder", post: post, outputFolder: "", format: "html", expected: "/20230101_103000_test-post.html", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { result := makePath(tt.post, tt.outputFolder, tt.format) assert.Equal(t, tt.expected, result) }) } } // Test convertDateTime function func TestConvertDateTime(t *testing.T) { tests := []struct { name string input string expected string }{ { name: "basic date", input: "2023-01-01", expected: "", // Invalid format, should return empty string }, { name: "date with time", input: "2023-01-01T10:30:00.000Z", expected: "20230101_103000", }, { name: "different date format", input: "2023-12-31T23:59:59.999Z", expected: "20231231_235959", }, { name: "empty string", input: "", expected: "", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { result := convertDateTime(tt.input) assert.Equal(t, tt.expected, result) }) } } // Test extractSlug function func TestExtractSlug(t *testing.T) { tests := []struct { name string input string expected string }{ { name: "basic substack URL", input: "https://example.substack.com/p/test-post", expected: "test-post", }, { name: "URL with query parameters", input: "https://example.substack.com/p/test-post?utm_source=newsletter", expected: "test-post?utm_source=newsletter", // extractSlug doesn't handle query params }, { name: "URL with anchor", input: "https://example.substack.com/p/test-post#comments", expected: "test-post#comments", // extractSlug doesn't handle anchors }, { name: "URL with trailing slash", input: "https://example.substack.com/p/test-post/", expected: "", // extractSlug returns empty string for trailing slash }, { name: "complex slug with dashes", input: "https://example.substack.com/p/this-is-a-very-long-post-title", expected: "this-is-a-very-long-post-title", }, { name: "no /p/ in URL", input: "https://example.substack.com/test-post", expected: "test-post", // extractSlug just returns the last segment }, { name: "empty string", input: "", expected: "", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { result := extractSlug(tt.input) assert.Equal(t, tt.expected, result) }) } } // Test cookieName type func TestCookieName(t *testing.T) { t.Run("String method", func(t *testing.T) { cn := cookieName("test-cookie") assert.Equal(t, "test-cookie", cn.String()) }) t.Run("Type method", func(t *testing.T) { cn := cookieName("") assert.Equal(t, "cookieName", cn.Type()) }) t.Run("Set method - valid values", func(t *testing.T) { validNames := []string{"substack.sid", "connect.sid"} for _, name := range validNames { cn := cookieName("") err := cn.Set(name) assert.NoError(t, err) assert.Equal(t, name, cn.String()) } }) t.Run("Set method - invalid values", func(t *testing.T) { invalidNames := []string{"invalid", "session", "auth", ""} for _, name := range invalidNames { cn := cookieName("") err := cn.Set(name) assert.Error(t, err) assert.Contains(t, err.Error(), "invalid cookie name") } }) } // Test that we can create paths and handle files correctly func TestFileHandling(t *testing.T) { // Create a temporary directory for testing tempDir := t.TempDir() // Create a test file existingFile := tempDir + "/existing.html" post := lib.Post{Title: "Test", BodyHTML: "

Test content

"} err := post.WriteToFile(existingFile, "html", false) require.NoError(t, err) // Test that file was created successfully _, err = os.Stat(existingFile) assert.NoError(t, err) // Test path creation testPost := lib.Post{PostDate: "2023-01-01T10:30:00.000Z", Slug: "test-post"} path := makePath(testPost, tempDir, "html") expectedPath := tempDir + "/20230101_103000_test-post.html" assert.Equal(t, expectedPath, path) } // Test time parsing and formatting func TestTimeFormatting(t *testing.T) { t.Run("convertDateTime with various formats", func(t *testing.T) { // Test the actual time parsing logic testCases := []struct { input string expected string }{ {"2023-01-01T10:30:00.000Z", "20230101_103000"}, {"2023-01-01T10:30:00Z", "20230101_103000"}, {"2023-01-01", ""}, // Invalid format, should return empty string {"2023-12-31T23:59:59.999Z", "20231231_235959"}, } for _, tc := range testCases { result := convertDateTime(tc.input) assert.Equal(t, tc.expected, result) } }) } // Integration test for date filtering func TestDateFilteringIntegration(t *testing.T) { t.Run("date filter with actual dates", func(t *testing.T) { // Test the interaction between date filtering and URL processing beforeDate := "2023-06-15" afterDate := "2023-01-01" filterFunc := makeDateFilterFunc(beforeDate, afterDate) require.NotNil(t, filterFunc) // Test dates within range assert.True(t, filterFunc("2023-03-15")) assert.True(t, filterFunc("2023-06-14")) // Test dates outside range assert.False(t, filterFunc("2022-12-31")) assert.False(t, filterFunc("2023-01-01")) assert.False(t, filterFunc("2023-06-15")) assert.False(t, filterFunc("2023-12-31")) }) } // Test constants func TestConstants(t *testing.T) { t.Run("cookie name constants", func(t *testing.T) { assert.Equal(t, "substack.sid", string(substackSid)) assert.Equal(t, "connect.sid", string(connectSid)) }) } ================================================ FILE: cmd/download.go ================================================ package cmd import ( "fmt" "log" "net/url" "path/filepath" "strings" "time" "github.com/alexferrari88/sbstck-dl/lib" "github.com/schollz/progressbar/v3" "github.com/spf13/cobra" ) // downloadCmd represents the download command var ( downloadUrl string format string outputFolder string dryRun bool addSourceURL bool downloadImages bool imageQuality string imagesDir string downloadFiles bool fileExtensions string filesDir string createArchive bool downloadCmd = &cobra.Command{ Use: "download", Short: "Download individual posts or the entire public archive", Long: `You can provide the url of a single post or the main url of the Substack you want to download.`, Run: func(cmd *cobra.Command, args []string) { startTime := time.Now() // Create archive instance if flag is set var archive *lib.Archive if createArchive { archive = lib.NewArchive() } // if url contains "/p/", we are downloading a single post if strings.Contains(downloadUrl, "/p/") { if verbose { fmt.Printf("Downloading post %s\n", downloadUrl) } if dryRun { fmt.Println("Dry run, exiting...") return } if (beforeDate != "" || afterDate != "") && verbose { fmt.Println("Warning: --before and --after flags are ignored when downloading a single post") } post, err := extractor.ExtractPost(ctx, downloadUrl) if err != nil { log.Fatalln(err) } downloadTime := time.Since(startTime) if verbose { fmt.Printf("Downloaded post %s in %s\n", downloadUrl, downloadTime) } path := makePath(post, outputFolder, format) if verbose { fmt.Printf("Writing post to file %s\n", path) } if downloadImages || downloadFiles { imageQualityEnum := lib.ImageQuality(imageQuality) // Parse file extensions if specified var fileExtensionsSlice []string if fileExtensions != "" { fileExtensionsSlice = strings.Split(strings.ReplaceAll(fileExtensions, " ", ""), ",") } imageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, downloadFiles, fileExtensionsSlice, filesDir, fetcher) if err != nil { log.Printf("Error writing file %s: %v\n", path, err) } else if verbose && imageResult.Success > 0 { fmt.Printf("Downloaded %d images (%d failed) for post %s\n", imageResult.Success, imageResult.Failed, post.Slug) } } else { err = post.WriteToFile(path, format, addSourceURL) if err != nil { log.Printf("Error writing file %s: %v\n", path, err) } } // Add to archive if enabled if archive != nil { archive.AddEntry(post, path, startTime) } if verbose { fmt.Println("Done in ", time.Since(startTime)) } } else { // we are downloading the entire archive var downloadedPostsCount int dateFilterfunc := makeDateFilterFunc(beforeDate, afterDate) urls, err := extractor.GetAllPostsURLs(ctx, downloadUrl, dateFilterfunc) urlsCount := len(urls) if err != nil { log.Fatalln(err) } if urlsCount == 0 { if verbose { fmt.Println("No posts found, exiting...") } return } if verbose { fmt.Printf("Found %d posts\n", urlsCount) } if dryRun { fmt.Printf("Found %d posts\n", urlsCount) fmt.Println("Dry run, exiting...") return } urls, err = filterExistingPosts(urls, outputFolder, format) if err != nil { if verbose { fmt.Println("Error filtering existing posts:", err) } } if len(urls) == 0 { if verbose { fmt.Println("No new posts found, exiting...") } return } bar := progressbar.NewOptions(len(urls), progressbar.OptionSetWidth(25), progressbar.OptionSetDescription("downloading"), progressbar.OptionShowBytes(true)) for result := range extractor.ExtractAllPosts(ctx, urls) { select { case <-ctx.Done(): log.Fatalln("context cancelled") default: } if result.Err != nil { if verbose { fmt.Printf("Error downloading post %s: %s\n", result.Post.CanonicalUrl, result.Err) fmt.Println("Skipping...") } continue } bar.Add(1) downloadedPostsCount++ if verbose { fmt.Printf("Downloading post %s\n", result.Post.CanonicalUrl) } post := result.Post path := makePath(post, outputFolder, format) if verbose { fmt.Printf("Writing post to file %s\n", path) } if downloadImages || downloadFiles { imageQualityEnum := lib.ImageQuality(imageQuality) // Parse file extensions if specified var fileExtensionsSlice []string if fileExtensions != "" { fileExtensionsSlice = strings.Split(strings.ReplaceAll(fileExtensions, " ", ""), ",") } imageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, downloadFiles, fileExtensionsSlice, filesDir, fetcher) if err != nil { log.Printf("Error writing file %s: %v\n", path, err) } else if verbose && imageResult.Success > 0 { fmt.Printf("Downloaded %d images (%d failed) for post %s\n", imageResult.Success, imageResult.Failed, post.Slug) } } else { err = post.WriteToFile(path, format, addSourceURL) if err != nil { log.Printf("Error writing file %s: %v\n", path, err) } } // Add to archive if enabled and post was successfully written if archive != nil { archive.AddEntry(post, path, time.Now()) } } if verbose { fmt.Println("Downloaded", downloadedPostsCount, "posts, out of", len(urls)) fmt.Println("Done in ", time.Since(startTime)) } } // Generate archive page if enabled if archive != nil && len(archive.Entries) > 0 { if verbose { fmt.Printf("Generating archive page in %s format...\n", format) } var archiveErr error switch format { case "html": archiveErr = archive.GenerateHTML(outputFolder) case "md": archiveErr = archive.GenerateMarkdown(outputFolder) case "txt": archiveErr = archive.GenerateText(outputFolder) default: archiveErr = fmt.Errorf("unknown format for archive: %s", format) } if archiveErr != nil { log.Printf("Error generating archive page: %v\n", archiveErr) } else if verbose { fmt.Printf("Archive page generated: %s/index.%s\n", outputFolder, format) } } }, } ) func init() { downloadCmd.Flags().StringVarP(&downloadUrl, "url", "u", "", "Specify the Substack url") downloadCmd.Flags().StringVarP(&format, "format", "f", "html", "Specify the output format (options: \"html\", \"md\", \"txt\"") downloadCmd.Flags().StringVarP(&outputFolder, "output", "o", ".", "Specify the download directory") downloadCmd.Flags().BoolVarP(&dryRun, "dry-run", "d", false, "Enable dry run") downloadCmd.Flags().BoolVar(&addSourceURL, "add-source-url", false, "Add the original post URL at the end of the downloaded file") downloadCmd.Flags().BoolVar(&downloadImages, "download-images", false, "Download images locally and update content to reference local files") downloadCmd.Flags().StringVar(&imageQuality, "image-quality", "high", "Image quality to download (options: \"high\", \"medium\", \"low\")") downloadCmd.Flags().StringVar(&imagesDir, "images-dir", "images", "Directory name for downloaded images") downloadCmd.Flags().BoolVar(&downloadFiles, "download-files", false, "Download file attachments locally and update content to reference local files") downloadCmd.Flags().StringVar(&fileExtensions, "file-extensions", "", "Comma-separated list of file extensions to download (e.g., 'pdf,docx,txt'). If empty, downloads all file types") downloadCmd.Flags().StringVar(&filesDir, "files-dir", "files", "Directory name for downloaded file attachments") downloadCmd.Flags().BoolVar(&createArchive, "create-archive", false, "Create an archive index page linking all downloaded posts") downloadCmd.MarkFlagRequired("url") } func convertDateTime(datetime string) string { // Parse the datetime string parsedTime, err := time.Parse(time.RFC3339, datetime) if err != nil { // Return an empty string or an error message if parsing fails return "" } // Format the datetime to the desired format formattedDateTime := fmt.Sprintf("%d%02d%02d_%02d%02d%02d", parsedTime.Year(), parsedTime.Month(), parsedTime.Day(), parsedTime.Hour(), parsedTime.Minute(), parsedTime.Second()) return formattedDateTime } func parseURL(toTest string) (*url.URL, error) { _, err := url.ParseRequestURI(toTest) if err != nil { return nil, err } u, err := url.Parse(toTest) if err != nil || u.Scheme == "" || u.Host == "" { return nil, err } return u, err } func makePath(post lib.Post, outputFolder string, format string) string { return fmt.Sprintf("%s/%s_%s.%s", outputFolder, convertDateTime(post.PostDate), post.Slug, format) } // extractSlug extracts the slug from a Substack post URL // e.g. https://example.substack.com/p/this-is-the-post-title -> this-is-the-post-title func extractSlug(url string) string { split := strings.Split(url, "/") return split[len(split)-1] } // filterExistingPosts filters out posts that already exist in the output folder. // It looks for files whose name ends with the post slug. func filterExistingPosts(urls []string, outputFolder string, format string) ([]string, error) { var filtered []string for _, url := range urls { slug := extractSlug(url) path := fmt.Sprintf("%s/%s_%s.%s", outputFolder, "*", slug, format) matches, err := filepath.Glob(path) if err != nil { return urls, err } if len(matches) == 0 { filtered = append(filtered, url) } } return filtered, nil } ================================================ FILE: cmd/integration_test.go ================================================ package cmd import ( "bytes" "context" "encoding/json" "fmt" "net/http" "net/http/httptest" "os" "path/filepath" "strings" "testing" "time" "github.com/alexferrari88/sbstck-dl/lib" "github.com/spf13/cobra" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // Test command execution in isolation func TestCommandExecution(t *testing.T) { // Skip in short test mode if testing.Short() { t.Skip("Skipping integration test in short mode") } // Create a mock server that serves a simple post mockPost := lib.Post{ Id: 123, Title: "Test Post", Slug: "test-post", PostDate: "2023-01-01", BodyHTML: "

This is a test post

", CanonicalUrl: "https://example.substack.com/p/test-post", } // Create sitemap XML sitemapXML := ` https://example.substack.com/p/test-post 2023-01-01 ` // Create mock HTML with embedded JSON postWrapper := lib.PostWrapper{Post: mockPost} jsonBytes, _ := json.Marshal(postWrapper) escapedJSON := strings.ReplaceAll(string(jsonBytes), `"`, `\"`) mockHTML := fmt.Sprintf(` %s `, mockPost.Title, escapedJSON) server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { path := r.URL.Path if path == "/sitemap.xml" { w.Header().Set("Content-Type", "application/xml") w.Write([]byte(sitemapXML)) } else if path == "/p/test-post" { w.Header().Set("Content-Type", "text/html") w.Write([]byte(mockHTML)) } else { w.WriteHeader(http.StatusNotFound) } })) defer server.Close() // Test version command t.Run("version command", func(t *testing.T) { // Capture stdout var output bytes.Buffer // Create a command that executes the version logic cmd := &cobra.Command{ Use: "test-version", Run: func(cmd *cobra.Command, args []string) { output.WriteString("sbstck-dl v0.4.0\n") }, } err := cmd.Execute() assert.NoError(t, err) assert.Contains(t, output.String(), "sbstck-dl v0.4.0") }) // Test list command t.Run("list command", func(t *testing.T) { // Reset global variables pubUrl = server.URL verbose = false beforeDate = "" afterDate = "" // Initialize fetcher and extractor fetcher = lib.NewFetcher() extractor = lib.NewExtractor(fetcher) ctx = context.Background() // Create a new command to capture output var output bytes.Buffer cmd := &cobra.Command{ Use: "test-list", Run: func(cmd *cobra.Command, args []string) { // Simulate list command logic urls, err := extractor.GetAllPostsURLs(ctx, pubUrl, nil) if err != nil { t.Fatalf("Failed to get URLs: %v", err) } for _, url := range urls { output.WriteString(url + "\n") } }, } err := cmd.Execute() assert.NoError(t, err) // Check that it outputs the post URL assert.Contains(t, output.String(), "https://example.substack.com/p/test-post") }) // Test single post download t.Run("single post download", func(t *testing.T) { tempDir := t.TempDir() // Reset global variables downloadUrl = server.URL + "/p/test-post" outputFolder = tempDir format = "html" dryRun = false verbose = false addSourceURL = false // Initialize fetcher and extractor fetcher = lib.NewFetcher() extractor = lib.NewExtractor(fetcher) ctx = context.Background() // Create a new command cmd := &cobra.Command{ Use: "test-download", Run: func(cmd *cobra.Command, args []string) { // Execute the single post download logic post, err := extractor.ExtractPost(ctx, downloadUrl) if err != nil { t.Fatalf("Failed to extract post: %v", err) } // Write to file filePath := makePath(post, outputFolder, format) err = post.WriteToFile(filePath, format, addSourceURL) if err != nil { t.Fatalf("Failed to write file: %v", err) } }, } err := cmd.Execute() assert.NoError(t, err) // Check that file was created - use the correct expected format // Since mockPost.PostDate is "2023-01-01" (not RFC3339), convertDateTime will return "" expectedFile := filepath.Join(tempDir, "_test-post.html") _, err = os.Stat(expectedFile) assert.NoError(t, err) // Check file content content, err := os.ReadFile(expectedFile) assert.NoError(t, err) assert.Contains(t, string(content), "Test Post") assert.Contains(t, string(content), "This is a test post") }) } // Test command flag parsing func TestCommandFlags(t *testing.T) { t.Run("root command flags", func(t *testing.T) { // Test that flags are properly defined cmd := rootCmd // Check persistent flags assert.NotNil(t, cmd.PersistentFlags().Lookup("proxy")) assert.NotNil(t, cmd.PersistentFlags().Lookup("verbose")) assert.NotNil(t, cmd.PersistentFlags().Lookup("rate")) assert.NotNil(t, cmd.PersistentFlags().Lookup("cookie_name")) assert.NotNil(t, cmd.PersistentFlags().Lookup("cookie_val")) assert.NotNil(t, cmd.PersistentFlags().Lookup("before")) assert.NotNil(t, cmd.PersistentFlags().Lookup("after")) }) t.Run("download command flags", func(t *testing.T) { cmd := downloadCmd // Check local flags assert.NotNil(t, cmd.Flags().Lookup("url")) assert.NotNil(t, cmd.Flags().Lookup("format")) assert.NotNil(t, cmd.Flags().Lookup("output")) assert.NotNil(t, cmd.Flags().Lookup("dry-run")) assert.NotNil(t, cmd.Flags().Lookup("add-source-url")) assert.NotNil(t, cmd.Flags().Lookup("download-images")) assert.NotNil(t, cmd.Flags().Lookup("image-quality")) assert.NotNil(t, cmd.Flags().Lookup("images-dir")) assert.NotNil(t, cmd.Flags().Lookup("download-files")) assert.NotNil(t, cmd.Flags().Lookup("file-extensions")) assert.NotNil(t, cmd.Flags().Lookup("files-dir")) assert.NotNil(t, cmd.Flags().Lookup("create-archive")) // Test create-archive flag specifically createArchiveFlag := cmd.Flags().Lookup("create-archive") assert.Equal(t, "bool", createArchiveFlag.Value.Type()) assert.Equal(t, "false", createArchiveFlag.DefValue) }) t.Run("list command flags", func(t *testing.T) { cmd := listCmd // Check local flags assert.NotNil(t, cmd.Flags().Lookup("url")) }) } // Test command validation func TestCommandValidation(t *testing.T) { t.Run("invalid proxy URL", func(t *testing.T) { // Test parseURL with invalid proxy _, err := parseURL("invalid-proxy") assert.Error(t, err) }) t.Run("invalid cookie name", func(t *testing.T) { cn := cookieName("") err := cn.Set("invalid-cookie") assert.Error(t, err) }) t.Run("rate validation", func(t *testing.T) { // Test that rate 0 should fail // This would normally be tested in the PersistentPreRun, but we can test the logic ratePerSecond = 0 assert.Equal(t, 0, ratePerSecond) // Should be 0 which is invalid }) } // Test error handling func TestErrorHandling(t *testing.T) { t.Run("network error handling", func(t *testing.T) { // Test with non-existent server fetcher := lib.NewFetcher() extractor := lib.NewExtractor(fetcher) ctx := context.Background() _, err := extractor.ExtractPost(ctx, "http://non-existent-server.com/p/test") assert.Error(t, err) }) t.Run("invalid URL format", func(t *testing.T) { // Test with malformed URL _, err := parseURL("://invalid-url") assert.Error(t, err) }) t.Run("file system errors", func(t *testing.T) { // Test writing to invalid directory post := lib.Post{ Title: "Test", BodyHTML: "

Test

", } // Try to write to a file with invalid character (null byte forbidden on both Windows and Unix) err := post.WriteToFile("invalid\x00filename.html", "html", false) assert.Error(t, err) }) } // Test with different configurations func TestConfigurations(t *testing.T) { t.Run("with proxy configuration", func(t *testing.T) { // Test that proxy URL parsing works proxyURL := "http://proxy.example.com:8080" parsed, err := parseURL(proxyURL) assert.NoError(t, err) assert.Equal(t, "proxy.example.com:8080", parsed.Host) assert.Equal(t, "http", parsed.Scheme) }) t.Run("with cookie configuration", func(t *testing.T) { // Test cookie creation tests := []struct { name string cookieName cookieName cookieVal string expected string }{ { name: "substack.sid cookie", cookieName: substackSid, cookieVal: "test-value", expected: "substack.sid", }, { name: "connect.sid cookie", cookieName: connectSid, cookieVal: "test-value", expected: "connect.sid", }, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { assert.Equal(t, tt.expected, tt.cookieName.String()) }) } }) t.Run("with rate limiting", func(t *testing.T) { // Test that different rate limits are handled rates := []int{1, 2, 5, 10} for _, rate := range rates { fetcher := lib.NewFetcher(lib.WithRatePerSecond(rate)) assert.NotNil(t, fetcher) assert.Equal(t, rate, int(fetcher.RateLimiter.Limit())) } }) } // Test real-world scenarios func TestRealWorldScenarios(t *testing.T) { // Skip in short test mode if testing.Short() { t.Skip("Skipping real-world scenario tests in short mode") } t.Run("large number of URLs", func(t *testing.T) { // Test performance with many URLs urls := make([]string, 100) for i := range urls { urls[i] = fmt.Sprintf("https://example.substack.com/p/post-%d", i) } // Test URL parsing performance start := time.Now() // Test parsing all URLs validUrls := 0 for _, url := range urls { if _, err := parseURL(url); err == nil { validUrls++ } } duration := time.Since(start) assert.Equal(t, len(urls), validUrls) // All should be valid assert.Less(t, duration, 1*time.Second) // Should be fast }) t.Run("concurrent processing", func(t *testing.T) { // Test that concurrent processing works correctly tempDir := t.TempDir() // Create multiple posts concurrently posts := make([]lib.Post, 5) for i := range posts { posts[i] = lib.Post{ Title: fmt.Sprintf("Post %d", i), Slug: fmt.Sprintf("post-%d", i), PostDate: "2023-01-01", BodyHTML: fmt.Sprintf("

Content for post %d

", i), } } // Write all posts concurrently start := time.Now() for i, post := range posts { filePath := filepath.Join(tempDir, fmt.Sprintf("post-%d.html", i)) err := post.WriteToFile(filePath, "html", false) assert.NoError(t, err) } duration := time.Since(start) // Verify all files were created for i := range posts { filePath := filepath.Join(tempDir, fmt.Sprintf("post-%d.html", i)) _, err := os.Stat(filePath) assert.NoError(t, err) } assert.Less(t, duration, 1*time.Second) // Should be fast }) } // Test archive functionality end-to-end func TestArchiveWorkflow(t *testing.T) { t.Run("single post with archive", func(t *testing.T) { tempDir := t.TempDir() // Create a mock post with enhanced fields post := lib.Post{ Id: 123, Title: "Test Archive Post", Slug: "test-archive-post", PostDate: "2023-01-01T10:30:00Z", Subtitle: "This is a test subtitle", Description: "Test description", CoverImage: "https://example.com/cover.jpg", CanonicalUrl: "https://example.substack.com/p/test-archive-post", BodyHTML: "

This is a test post for archive functionality.

", } // Simulate the archive workflow archive := lib.NewArchive() // Write the post to file (similar to what download command does) filePath := filepath.Join(tempDir, "20230101_103000_test-archive-post.html") err := post.WriteToFile(filePath, "html", false) require.NoError(t, err) // Add entry to archive (similar to what download command does) downloadTime, _ := time.Parse(time.RFC3339, "2023-01-10T12:00:00Z") archive.AddEntry(post, filePath, downloadTime) // Generate archive in all formats err = archive.GenerateHTML(tempDir) require.NoError(t, err) err = archive.GenerateMarkdown(tempDir) require.NoError(t, err) err = archive.GenerateText(tempDir) require.NoError(t, err) // Verify all archive files were created assert.FileExists(t, filepath.Join(tempDir, "index.html")) assert.FileExists(t, filepath.Join(tempDir, "index.md")) assert.FileExists(t, filepath.Join(tempDir, "index.txt")) // Verify HTML archive content htmlContent, err := os.ReadFile(filepath.Join(tempDir, "index.html")) require.NoError(t, err) htmlStr := string(htmlContent) assert.Contains(t, htmlStr, "Test Archive Post") assert.Contains(t, htmlStr, "This is a test subtitle") assert.Contains(t, htmlStr, "https://example.com/cover.jpg") assert.Contains(t, htmlStr, "20230101_103000_test-archive-post.html") // Relative path assert.Contains(t, htmlStr, "January 1, 2023") // Formatted date // Verify Markdown archive content mdContent, err := os.ReadFile(filepath.Join(tempDir, "index.md")) require.NoError(t, err) mdStr := string(mdContent) assert.Contains(t, mdStr, "# Substack Archive") assert.Contains(t, mdStr, "## [Test Archive Post](20230101_103000_test-archive-post.html)") assert.Contains(t, mdStr, "*This is a test subtitle*") assert.Contains(t, mdStr, "![Cover Image](https://example.com/cover.jpg)") // Verify Text archive content txtContent, err := os.ReadFile(filepath.Join(tempDir, "index.txt")) require.NoError(t, err) txtStr := string(txtContent) assert.Contains(t, txtStr, "SUBSTACK ARCHIVE") assert.Contains(t, txtStr, "Title: Test Archive Post") assert.Contains(t, txtStr, "File: 20230101_103000_test-archive-post.html") assert.Contains(t, txtStr, "Description: This is a test subtitle") }) t.Run("multiple posts with archive", func(t *testing.T) { tempDir := t.TempDir() archive := lib.NewArchive() downloadTime := time.Now() // Create multiple posts with different dates posts := []lib.Post{ { Id: 1, Title: "First Post", Slug: "first-post", PostDate: "2023-01-01T10:00:00Z", Subtitle: "First subtitle", CanonicalUrl: "https://example.substack.com/p/first-post", BodyHTML: "

First post content

", }, { Id: 2, Title: "Second Post", Slug: "second-post", PostDate: "2023-01-02T10:00:00Z", Description: "Second description", CoverImage: "https://example.com/cover2.jpg", CanonicalUrl: "https://example.substack.com/p/second-post", BodyHTML: "

Second post content

", }, { Id: 3, Title: "Third Post", Slug: "third-post", PostDate: "2023-01-03T10:00:00Z", Subtitle: "Third subtitle", CanonicalUrl: "https://example.substack.com/p/third-post", BodyHTML: "

Third post content

", }, } // Write posts and add to archive for i, post := range posts { filePath := filepath.Join(tempDir, fmt.Sprintf("post-%d.html", i+1)) err := post.WriteToFile(filePath, "html", false) require.NoError(t, err) archive.AddEntry(post, filePath, downloadTime.Add(time.Duration(i)*time.Hour)) } // Generate archive err := archive.GenerateHTML(tempDir) require.NoError(t, err) // Verify content ordering (newest first) htmlContent, err := os.ReadFile(filepath.Join(tempDir, "index.html")) require.NoError(t, err) htmlStr := string(htmlContent) // Find positions of post titles to verify ordering thirdPos := strings.Index(htmlStr, "Third Post") secondPos := strings.Index(htmlStr, "Second Post") firstPos := strings.Index(htmlStr, "First Post") assert.True(t, thirdPos < secondPos, "Third Post should appear before Second Post") assert.True(t, secondPos < firstPos, "Second Post should appear before First Post") // Verify all posts are included assert.Contains(t, htmlStr, "First subtitle") assert.Contains(t, htmlStr, "Second description") // Fallback to description assert.Contains(t, htmlStr, "Third subtitle") assert.Contains(t, htmlStr, "https://example.com/cover2.jpg") }) t.Run("archive with different formats", func(t *testing.T) { tempDir := t.TempDir() post := lib.Post{ Id: 100, Title: "Format Test Post", Slug: "format-test-post", PostDate: "2023-01-01T10:00:00Z", Subtitle: "Testing different formats", CanonicalUrl: "https://example.substack.com/p/format-test-post", BodyHTML: "

Testing different formats.

", } // Test with different output formats formats := []string{"html", "md", "txt"} for _, format := range formats { t.Run(fmt.Sprintf("format_%s", format), func(t *testing.T) { formatDir := filepath.Join(tempDir, format) err := os.MkdirAll(formatDir, 0755) require.NoError(t, err) archive := lib.NewArchive() // Write post in the specified format filePath := filepath.Join(formatDir, fmt.Sprintf("post.%s", format)) err = post.WriteToFile(filePath, format, false) require.NoError(t, err) // Add to archive and generate archive.AddEntry(post, filePath, time.Now()) switch format { case "html": err = archive.GenerateHTML(formatDir) case "md": err = archive.GenerateMarkdown(formatDir) case "txt": err = archive.GenerateText(formatDir) } require.NoError(t, err) // Verify archive file exists indexPath := filepath.Join(formatDir, fmt.Sprintf("index.%s", format)) assert.FileExists(t, indexPath) // Verify content contains the post content, err := os.ReadFile(indexPath) require.NoError(t, err) assert.Contains(t, string(content), "Format Test Post") assert.Contains(t, string(content), "Testing different formats") }) } }) } ================================================ FILE: cmd/list.go ================================================ package cmd import ( "fmt" "log" "github.com/spf13/cobra" ) // listCmd represents the list command var ( pubUrl string listCmd = &cobra.Command{ Use: "list", Short: "List the posts of a Substack", Long: `List the posts of a Substack`, Run: func(cmd *cobra.Command, args []string) { parsedURL, err := parseURL(pubUrl) if err != nil { log.Fatal(err) } mainWebsite := fmt.Sprintf("%s://%s", parsedURL.Scheme, parsedURL.Host) if verbose { fmt.Printf("Main website: %s\n", mainWebsite) fmt.Println("Getting all posts URLs...") } dateFilterfunc := makeDateFilterFunc(beforeDate, afterDate) urls, err := extractor.GetAllPostsURLs(ctx, mainWebsite, dateFilterfunc) if err != nil { log.Fatal(err) } if verbose { fmt.Printf("Found %d posts.\n", len(urls)) } for _, url := range urls { fmt.Println(url) } }, } ) func init() { listCmd.Flags().StringVarP(&pubUrl, "url", "u", "", "Specify the Substack url") listCmd.MarkFlagRequired("url") } ================================================ FILE: cmd/main.go ================================================ package cmd ================================================ FILE: cmd/root.go ================================================ package cmd import ( "context" "errors" "log" "net/http" "net/url" "os" "github.com/alexferrari88/sbstck-dl/lib" "github.com/spf13/cobra" ) // rootCmd represents the base command when called without any subcommands type cookieName string const ( substackSid cookieName = "substack.sid" connectSid cookieName = "connect.sid" ) func (c *cookieName) String() string { return string(*c) } func (c *cookieName) Set(val string) error { switch val { case "substack.sid", "connect.sid": *c = cookieName(val) default: return errors.New("invalid cookie name: must be either substack.sid or connect.sid") } return nil } func (c *cookieName) Type() string { return "cookieName" } var ( proxyURL string verbose bool ratePerSecond int beforeDate string afterDate string idCookieName cookieName idCookieVal string ctx = context.Background() parsedProxyURL *url.URL fetcher *lib.Fetcher extractor *lib.Extractor rootCmd = &cobra.Command{ Use: "sbstck-dl", Short: "Substack Downloader", Long: `sbstck-dl is a command line tool for downloading Substack newsletters for archival purposes, offline reading, or data analysis.`, PersistentPreRun: func(cmd *cobra.Command, args []string) { var cookie *http.Cookie if proxyURL != "" { var err error parsedProxyURL, err = parseURL(proxyURL) if err != nil { log.Fatal(err) } } if ratePerSecond == 0 { log.Fatal("rate must be greater than 0") } if idCookieVal != "" && idCookieName != "" { if idCookieName == substackSid { cookie = &http.Cookie{ Name: "substack.sid", Value: idCookieVal, } } else if idCookieName == connectSid { cookie = &http.Cookie{ Name: "connect.sid", Value: idCookieVal, } } } fetcher = lib.NewFetcher(lib.WithRatePerSecond(ratePerSecond), lib.WithProxyURL(parsedProxyURL), lib.WithCookie(cookie)) extractor = lib.NewExtractor(fetcher) }, } ) // Execute adds all child commands to the root command and sets flags appropriately. // This is called by main.main(). It only needs to happen once to the rootCmd. func Execute() { err := rootCmd.Execute() if err != nil { os.Exit(1) } } func init() { rootCmd.PersistentFlags().StringVarP(&proxyURL, "proxy", "x", "", "Specify the proxy url") rootCmd.PersistentFlags().Var(&idCookieName, "cookie_name", "Either \"substack.sid\" or \"connect.sid\", based on the cookie you have (required for private newsletters)") rootCmd.PersistentFlags().StringVar(&idCookieVal, "cookie_val", "", "The substack.sid/connect.sid cookie value (required for private newsletters)") rootCmd.PersistentFlags().BoolVarP(&verbose, "verbose", "v", false, "Enable verbose output") rootCmd.PersistentFlags().IntVarP(&ratePerSecond, "rate", "r", lib.DefaultRatePerSecond, "Specify the rate of requests per second") rootCmd.PersistentFlags().StringVar(&beforeDate, "before", "", "Download posts published before this date (format: YYYY-MM-DD)") rootCmd.PersistentFlags().StringVar(&afterDate, "after", "", "Download posts published after this date (format: YYYY-MM-DD)") rootCmd.MarkFlagsRequiredTogether("cookie_name", "cookie_val") rootCmd.AddCommand(downloadCmd) rootCmd.AddCommand(listCmd) rootCmd.AddCommand(versionCmd) } func makeDateFilterFunc(beforeDate string, afterDate string) lib.DateFilterFunc { var dateFilterFunc lib.DateFilterFunc if beforeDate != "" && afterDate != "" { dateFilterFunc = func(date string) bool { return date > afterDate && date < beforeDate } } else if beforeDate != "" { dateFilterFunc = func(date string) bool { return date < beforeDate } } else if afterDate != "" { dateFilterFunc = func(date string) bool { return date > afterDate } } return dateFilterFunc } ================================================ FILE: cmd/version.go ================================================ package cmd import ( "fmt" "github.com/spf13/cobra" ) // versionCmd represents the version command var versionCmd = &cobra.Command{ Use: "version", Short: "Print the version number of sbstck-dl", Long: `Display the current version of the app.`, Run: func(cmd *cobra.Command, args []string) { fmt.Println("sbstck-dl v0.7") }, } func init() { } ================================================ FILE: go.mod ================================================ module github.com/alexferrari88/sbstck-dl go 1.20 require ( github.com/JohannesKaufmann/html-to-markdown v1.5.0 github.com/PuerkitoBio/goquery v1.8.1 github.com/cenkalti/backoff/v4 v4.2.1 github.com/k3a/html2text v1.2.1 github.com/schollz/progressbar/v3 v3.14.1 github.com/spf13/cobra v1.8.0 github.com/stretchr/testify v1.8.4 golang.org/x/sync v0.6.0 golang.org/x/time v0.5.0 ) require ( github.com/andybalholm/cascadia v1.3.2 // indirect github.com/davecgh/go-spew v1.1.1 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db // indirect github.com/pmezard/go-difflib v1.0.0 // indirect github.com/rivo/uniseg v0.4.4 // indirect github.com/spf13/pflag v1.0.5 // indirect golang.org/x/net v0.20.0 // indirect golang.org/x/sys v0.16.0 // indirect golang.org/x/term v0.16.0 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect ) ================================================ FILE: go.sum ================================================ github.com/JohannesKaufmann/html-to-markdown v1.5.0 h1:cEAcqpxk0hUJOXEVGrgILGW76d1GpyGY7PCnAaWQyAI= github.com/JohannesKaufmann/html-to-markdown v1.5.0/go.mod h1:QTO/aTyEDukulzu269jY0xiHeAGsNxmuUBo2Q0hPsK8= github.com/PuerkitoBio/goquery v1.8.1 h1:uQxhNlArOIdbrH1tr0UXwdVFgDcZDrZVdcpygAcwmWM= github.com/PuerkitoBio/goquery v1.8.1/go.mod h1:Q8ICL1kNUJ2sXGoAhPGUdYDJvgQgHzJsnnd3H7Ho5jQ= github.com/andybalholm/cascadia v1.3.1/go.mod h1:R4bJ1UQfqADjvDa4P6HZHLh/3OxWWEqc0Sk8XGwHqvA= github.com/andybalholm/cascadia v1.3.2 h1:3Xi6Dw5lHF15JtdcmAHD3i1+T8plmv7BQ/nsViSLyss= github.com/andybalholm/cascadia v1.3.2/go.mod h1:7gtRlve5FxPPgIgX36uWBX58OdBsSS6lUvCFb+h7KvU= github.com/cenkalti/backoff/v4 v4.2.1 h1:y4OZtCnogmCPw98Zjyt5a6+QwPLGkiQsYW5oUqylYbM= github.com/cenkalti/backoff/v4 v4.2.1/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE= github.com/cpuguy83/go-md2man/v2 v2.0.3/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o= github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1 h1:EGx4pi6eqNxGaHF6qqu48+N2wcFQ5qg5FXgOdqsJ5d8= github.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1/go.mod h1:wJfORRmW1u3UXTncJ5qlYoELFm8eSnnEO6hX4iZ3EWY= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= github.com/jtolds/gls v4.20.0+incompatible h1:xdiiI2gbIgH/gLH7ADydsJ1uDOEzR8yvV7C0MuV77Wo= github.com/jtolds/gls v4.20.0+incompatible/go.mod h1:QJZ7F/aHp+rZTRtaJ1ow/lLfFfVYBRgL+9YlvaHOwJU= github.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213/go.mod h1:vNUNkEQ1e29fT/6vq2aBdFsgNPmy8qMdSay1npru+Sw= github.com/k3a/html2text v1.2.1 h1:nvnKgBvBR/myqrwfLuiqecUtaK1lB9hGziIJKatNFVY= github.com/k3a/html2text v1.2.1/go.mod h1:ieEXykM67iT8lTvEWBh6fhpH4B23kB9OMKPdIBmgUqA= github.com/kr/pretty v0.1.0 h1:L/CwN0zerZDmRFUapSPitk6f+Q3+0za1rQkzVuMiMFI= github.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo= github.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ= github.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE= github.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI= github.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y= github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ= github.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw= github.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0= github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/rivo/uniseg v0.4.4 h1:8TfxU8dW6PdqD27gjM8MVNuicgxIjxpm4K7x4jp8sis= github.com/rivo/uniseg v0.4.4/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88= github.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM= github.com/schollz/progressbar/v3 v3.14.1 h1:VD+MJPCr4s3wdhTc7OEJ/Z3dAeBzJ7yKH/P4lC5yRTI= github.com/schollz/progressbar/v3 v3.14.1/go.mod h1:Zc9xXneTzWXF81TGoqL71u0sBPjULtEHYtj/WVgVy8E= github.com/sebdah/goldie/v2 v2.5.3 h1:9ES/mNN+HNUbNWpVAlrzuZ7jE+Nrczbj8uFRjM7624Y= github.com/sebdah/goldie/v2 v2.5.3/go.mod h1:oZ9fp0+se1eapSRjfYbsV/0Hqhbuu3bJVvKI/NNtssI= github.com/sergi/go-diff v1.0.0/go.mod h1:0CfEIISq7TuYL3j771MWULgwwjU+GofnZX9QAmXWZgo= github.com/sergi/go-diff v1.2.0 h1:XU+rvMAioB0UC3q1MFrIQy4Vo5/4VsRDQQXHsEya6xQ= github.com/sergi/go-diff v1.2.0/go.mod h1:STckp+ISIX8hZLjrqAeVduY0gWCT9IjLuqbuNXdaHfM= github.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d h1:zE9ykElWQ6/NYmHa3jpm/yHnI4xSofP+UP6SpjHcSeM= github.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d/go.mod h1:OnSkiWE9lh6wB0YB77sQom3nweQdgAjqCqsofrRNTgc= github.com/smartystreets/goconvey v1.6.4 h1:fv0U8FUIMPNf1L9lnHLvLhgicrIVChEkdzIKYqbNC9s= github.com/smartystreets/goconvey v1.6.4/go.mod h1:syvi0/a8iFYH4r/RixwvyeAJjdLS9QV7WQ/tjFTllLA= github.com/spf13/cobra v1.8.0 h1:7aJaZx1B85qltLMc546zn58BxxfZdR/W22ej9CFoEf0= github.com/spf13/cobra v1.8.0/go.mod h1:WXLWApfZ71AjXPya3WOlMsY9yMs7YeiHhFVlvLyhcho= github.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA= github.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg= github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= github.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4= github.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk= github.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo= github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY= github.com/yuin/goldmark v1.6.0 h1:boZcn2GTjpsynOsC0iJHnBWa4Bi0qzfJjthwauItG68= github.com/yuin/goldmark v1.6.0/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY= golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w= golang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc= golang.org/x/crypto v0.16.0/go.mod h1:gCAAfMLgwOJRpTjQ2zCCt2OcSfYMTeZVSRtQlPC7Nq4= golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4= golang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs= golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= golang.org/x/net v0.0.0-20210916014120-12bc252f5db8/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y= golang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c= golang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs= golang.org/x/net v0.7.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs= golang.org/x/net v0.9.0/go.mod h1:d48xBJpPfHeWQsugry2m+kC02ZBRGRgulfHnEXEuWns= golang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg= golang.org/x/net v0.19.0/go.mod h1:CfAk/cbD4CthTvqiEl8NpboMuiuOYsAr/7NOjZJtv1U= golang.org/x/net v0.20.0 h1:aCL9BSgETF1k+blQaYUBx9hJ9LOGP3gAVemcZlf1Kpo= golang.org/x/net v0.20.0/go.mod h1:z8BVo6PvndSri0LbOE3hAn0apkU+1YvI6E70E9jsnvY= golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM= golang.org/x/sync v0.6.0 h1:5BMeUDZ7vkXGfEr1x9B4bRcTH4lpkTkpdh0T/J+qjbQ= golang.org/x/sync v0.6.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk= golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.7.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= golang.org/x/sys v0.14.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/sys v0.15.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/sys v0.16.0 h1:xWw16ngr6ZMtmxDyKyIgsE93KNKz5HKmMa3b8ALHidU= golang.org/x/sys v0.16.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA= golang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo= golang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8= golang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k= golang.org/x/term v0.7.0/go.mod h1:P32HKFT3hSsZrRxla30E9HqToFYAQPCMs/zFMBUFqPY= golang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo= golang.org/x/term v0.14.0/go.mod h1:TySc+nGkYR6qt8km8wUhuFRTVSMIX3XPR58y2lC8vww= golang.org/x/term v0.15.0/go.mod h1:BDl952bC7+uMoWR75FIrCDx79TPU9oHkTZ9yRbYOrX0= golang.org/x/term v0.16.0 h1:m+B6fahuftsE9qjo0VWp2FW0mB3MTJvR0BaMQrq0pmE= golang.org/x/term v0.16.0/go.mod h1:yn7UURbUtPyrVJPGPq404EukNFxcm/foM+bV/bfcDsY= golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ= golang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= golang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ= golang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ= golang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8= golang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8= golang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU= golang.org/x/time v0.5.0 h1:o7cqy6amK/52YcAKIPlM3a+Fpj35zvRj2TP+e1xFSfk= golang.org/x/time v0.5.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM= golang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ= golang.org/x/tools v0.0.0-20190328211700-ab21143f2384/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs= golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= golang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU= golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo= gopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY= gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= ================================================ FILE: lib/extractor.go ================================================ package lib import ( "context" "encoding/json" "errors" "fmt" "net/url" "os" "path/filepath" "sort" "strings" "sync" "time" md "github.com/JohannesKaufmann/html-to-markdown" "github.com/PuerkitoBio/goquery" "github.com/k3a/html2text" ) // RawPost represents a raw Substack post in string format. type RawPost struct { str string } // ToPost converts the RawPost to a structured Post object. func (r *RawPost) ToPost() (Post, error) { var wrapper PostWrapper err := json.Unmarshal([]byte(r.str), &wrapper) if err != nil { return Post{}, err } return wrapper.Post, nil } // Post represents a structured Substack post with various fields. type Post struct { Id int `json:"id"` PublicationId int `json:"publication_id"` Type string `json:"type"` Slug string `json:"slug"` PostDate string `json:"post_date"` CanonicalUrl string `json:"canonical_url"` PreviousPostSlug string `json:"previous_post_slug"` NextPostSlug string `json:"next_post_slug"` CoverImage string `json:"cover_image"` Description string `json:"description"` Subtitle string `json:"subtitle,omitempty"` WordCount int `json:"wordcount"` Title string `json:"title"` BodyHTML string `json:"body_html"` } // Static converter instance to avoid recreating it for each conversion var mdConverter = md.NewConverter("", true, nil) // ToMD converts the Post's HTML body to Markdown format. func (p *Post) ToMD(withTitle bool) (string, error) { if withTitle { body, err := mdConverter.ConvertString(p.BodyHTML) if err != nil { return "", err } return fmt.Sprintf("# %s\n\n%s", p.Title, body), nil } return mdConverter.ConvertString(p.BodyHTML) } // ToText converts the Post's HTML body to plain text format. func (p *Post) ToText(withTitle bool) string { if withTitle { return p.Title + "\n\n" + html2text.HTML2Text(p.BodyHTML) } return html2text.HTML2Text(p.BodyHTML) } // ToHTML returns the Post's HTML body as-is or with an optional title header. func (p *Post) ToHTML(withTitle bool) string { if withTitle { return fmt.Sprintf("

%s

\n\n%s", p.Title, p.BodyHTML) } return p.BodyHTML } // ToJSON converts the Post to a JSON string. func (p *Post) ToJSON() (string, error) { b, err := json.Marshal(p) if err != nil { return "", err } return string(b), nil } // contentForFormat returns the content of a post in the specified format. func (p *Post) contentForFormat(format string, withTitle bool) (string, error) { switch format { case "html": return p.ToHTML(withTitle), nil case "md": return p.ToMD(withTitle) case "txt": return p.ToText(withTitle), nil default: return "", fmt.Errorf("unknown format: %s", format) } } // WriteToFile writes the Post's content to a file in the specified format (html, md, or txt). func (p *Post) WriteToFile(path string, format string, addSourceURL bool) error { if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil { return err } content, err := p.contentForFormat(format, true) if err != nil { return err } if addSourceURL && p.CanonicalUrl != "" { sourceLine := fmt.Sprintf("\n\noriginal content: %s", p.CanonicalUrl) // Add separation // Adjust formatting slightly for HTML if format == "html" { sourceLine = fmt.Sprintf("

original content: %s

", p.CanonicalUrl, p.CanonicalUrl) } content += sourceLine } return os.WriteFile(path, []byte(content), 0644) } // WriteToFileWithImages writes the Post's content to a file with optional image downloading func (p *Post) WriteToFileWithImages(ctx context.Context, path string, format string, addSourceURL bool, downloadImages bool, imageQuality ImageQuality, imagesDir string, downloadFiles bool, fileExtensions []string, filesDir string, fetcher *Fetcher) (*ImageDownloadResult, error) { if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil { return nil, err } content, err := p.contentForFormat(format, true) if err != nil { return nil, err } var imageResult *ImageDownloadResult // Download images if requested and format supports it if downloadImages && (format == "html" || format == "md") { outputDir := filepath.Dir(path) imageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality) // Only process HTML content for image downloading htmlContent := content if format == "md" { // For markdown, we need to work with the original HTML htmlContent = p.BodyHTML } imageResult, err = imageDownloader.DownloadImages(ctx, htmlContent, p.Slug) if err != nil { return nil, fmt.Errorf("failed to download images: %w", err) } // Update content based on format if format == "html" { content = imageResult.UpdatedHTML // Re-add title if needed if strings.HasPrefix(content, "

") { // Title already included } else { content = fmt.Sprintf("

%s

\n\n%s", p.Title, imageResult.UpdatedHTML) } } else if format == "md" { // Convert updated HTML to markdown updatedContent, err := mdConverter.ConvertString(imageResult.UpdatedHTML) if err != nil { return nil, fmt.Errorf("failed to convert updated HTML to markdown: %w", err) } content = fmt.Sprintf("# %s\n\n%s", p.Title, updatedContent) } } else if downloadImages && format == "txt" { // For text format, we can't embed images, but we can still download them outputDir := filepath.Dir(path) imageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality) imageResult, err = imageDownloader.DownloadImages(ctx, p.BodyHTML, p.Slug) if err != nil { return nil, fmt.Errorf("failed to download images: %w", err) } // Keep original text content since we can't embed images in text format } // Download files if requested and format supports it if downloadFiles && (format == "html" || format == "md") { outputDir := filepath.Dir(path) fileDownloader := NewFileDownloader(fetcher, outputDir, filesDir, fileExtensions) // Process HTML content for file downloading - use the updated HTML from images if available htmlContent := content if imageResult != nil && imageResult.UpdatedHTML != "" { htmlContent = imageResult.UpdatedHTML } else if format == "md" { // For markdown, we need to work with the original HTML htmlContent = p.BodyHTML } fileResult, err := fileDownloader.DownloadFiles(ctx, htmlContent, p.Slug) if err != nil { return nil, fmt.Errorf("failed to download files: %w", err) } // Update content based on format if files were processed if fileResult.Success > 0 || fileResult.Failed > 0 { if format == "html" { content = fileResult.UpdatedHTML // Re-add title if needed if !strings.HasPrefix(content, "

") { content = fmt.Sprintf("

%s

\n\n%s", p.Title, fileResult.UpdatedHTML) } } else if format == "md" { // Convert updated HTML to markdown updatedContent, err := mdConverter.ConvertString(fileResult.UpdatedHTML) if err != nil { return nil, fmt.Errorf("failed to convert updated HTML to markdown: %w", err) } content = fmt.Sprintf("# %s\n\n%s", p.Title, updatedContent) } } } // Add source URL if requested if addSourceURL && p.CanonicalUrl != "" { sourceLine := fmt.Sprintf("\n\noriginal content: %s", p.CanonicalUrl) // Adjust formatting slightly for HTML if format == "html" { sourceLine = fmt.Sprintf("

original content: %s

", p.CanonicalUrl, p.CanonicalUrl) } content += sourceLine } // Write the file if err := os.WriteFile(path, []byte(content), 0644); err != nil { return imageResult, err } // Return empty result if no image downloading was performed if imageResult == nil { imageResult = &ImageDownloadResult{ Images: []ImageInfo{}, UpdatedHTML: content, Success: 0, Failed: 0, } } return imageResult, nil } // PostWrapper wraps a Post object for JSON unmarshaling. type PostWrapper struct { Post Post `json:"post"` } // Extractor is a utility for extracting Substack posts from URLs. type Extractor struct { fetcher *Fetcher } // ArchiveEntry represents a single entry in the archive page type ArchiveEntry struct { Post Post FilePath string DownloadTime time.Time } // Archive represents a collection of posts for the archive page type Archive struct { Entries []ArchiveEntry } // NewExtractor creates a new Extractor with the provided Fetcher. // If the Fetcher is nil, a default Fetcher will be used. func NewExtractor(f *Fetcher) *Extractor { if f == nil { f = NewFetcher() } return &Extractor{fetcher: f} } // extractJSONString finds and extracts the JSON data from script content. // This optimized version reduces string operations. func extractJSONString(doc *goquery.Document) (string, error) { var jsonString string var found bool doc.Find("script").EachWithBreak(func(i int, s *goquery.Selection) bool { content := s.Text() if strings.Contains(content, "window._preloads") && strings.Contains(content, "JSON.parse(") { start := strings.Index(content, "JSON.parse(\"") if start == -1 { return true } start += len("JSON.parse(\"") end := strings.LastIndex(content, "\")") if end == -1 || start >= end { return true } jsonString = content[start:end] found = true return false } return true }) if !found { return "", errors.New("failed to extract JSON string") } return jsonString, nil } func (e *Extractor) ExtractPost(ctx context.Context, pageUrl string) (Post, error) { // fetch page HTML content body, err := e.fetcher.FetchURL(ctx, pageUrl) if err != nil { return Post{}, fmt.Errorf("failed to fetch page: %w", err) } defer body.Close() doc, err := goquery.NewDocumentFromReader(body) if err != nil { return Post{}, fmt.Errorf("failed to parse HTML: %w", err) } jsonString, err := extractJSONString(doc) if err != nil { return Post{}, fmt.Errorf("failed to extract post data: %w", err) } // Unescape the JSON string directly var rawJSON RawPost err = json.Unmarshal([]byte("\""+jsonString+"\""), &rawJSON.str) if err != nil { return Post{}, fmt.Errorf("failed to unescape JSON: %w", err) } // Convert to a Go object p, err := rawJSON.ToPost() if err != nil { return Post{}, fmt.Errorf("failed to parse post data: %w", err) } // Extract additional metadata from HTML // Extract subtitle from .subtitle element if subtitle := doc.Find(".subtitle").First().Text(); subtitle != "" { p.Subtitle = strings.TrimSpace(subtitle) } // Extract cover image from og:image meta tag if not already set if p.CoverImage == "" { if ogImage, exists := doc.Find("meta[property='og:image']").Attr("content"); exists && ogImage != "" { p.CoverImage = ogImage } } return p, nil } type DateFilterFunc func(string) bool func (e *Extractor) GetAllPostsURLs(ctx context.Context, pubUrl string, f DateFilterFunc) ([]string, error) { u, err := url.Parse(pubUrl) if err != nil { return nil, err } u.Path, err = url.JoinPath(u.Path, "sitemap.xml") if err != nil { return nil, err } // fetch the sitemap of the publication body, err := e.fetcher.FetchURL(ctx, u.String()) if err != nil { return nil, err } defer body.Close() // Parse the XML doc, err := goquery.NewDocumentFromReader(body) if err != nil { return nil, err } // Pre-allocate a reasonable size for URLs // This avoids multiple slice reallocations as we append urls := make([]string, 0, 100) doc.Find("url").EachWithBreak(func(i int, s *goquery.Selection) bool { // Check if the context has been cancelled select { case <-ctx.Done(): return false default: } urlSel := s.Find("loc") url := urlSel.Text() if !strings.Contains(url, "/p/") { return true } // Only find lastmod if we have a filter if f != nil { lastmod := s.Find("lastmod").Text() if !f(lastmod) { return true } } urls = append(urls, url) return true }) return urls, nil } type ExtractResult struct { Post Post Err error } // ExtractAllPosts extracts all posts from the given URLs using a worker pool pattern // to limit concurrency and avoid overwhelming system resources. func (e *Extractor) ExtractAllPosts(ctx context.Context, urls []string) <-chan ExtractResult { resultCh := make(chan ExtractResult, len(urls)) go func() { defer close(resultCh) // Create a channel for the URLs urlCh := make(chan string, len(urls)) // Fill the URL channel for _, u := range urls { urlCh <- u } close(urlCh) // Limit concurrency - the number of workers is capped at 10 or the number of URLs, whichever is smaller workerCount := 10 if len(urls) < workerCount { workerCount = len(urls) } // Create a WaitGroup to wait for all workers to finish var wg sync.WaitGroup wg.Add(workerCount) // Start the workers for i := 0; i < workerCount; i++ { go func() { defer wg.Done() for url := range urlCh { select { case <-ctx.Done(): // Context cancelled, stop processing return default: post, err := e.ExtractPost(ctx, url) resultCh <- ExtractResult{Post: post, Err: err} } } }() } // Wait for all workers to finish wg.Wait() }() return resultCh } // NewArchive creates a new Archive instance func NewArchive() *Archive { return &Archive{ Entries: make([]ArchiveEntry, 0), } } // AddEntry adds a new entry to the archive, sorted by publication date (newest first) func (a *Archive) AddEntry(post Post, filePath string, downloadTime time.Time) { entry := ArchiveEntry{ Post: post, FilePath: filePath, DownloadTime: downloadTime, } a.Entries = append(a.Entries, entry) a.sortEntries() } // sortEntries sorts archive entries by publication date (newest first) func (a *Archive) sortEntries() { sort.Slice(a.Entries, func(i, j int) bool { // Parse post dates and compare (newest first) dateI, errI := time.Parse(time.RFC3339, a.Entries[i].Post.PostDate) dateJ, errJ := time.Parse(time.RFC3339, a.Entries[j].Post.PostDate) if errI != nil || errJ != nil { // If parsing fails, sort by title return a.Entries[i].Post.Title < a.Entries[j].Post.Title } return dateI.After(dateJ) // newest first }) } // GenerateHTML creates an HTML archive page func (a *Archive) GenerateHTML(outputDir string) error { archivePath := filepath.Join(outputDir, "index.html") html := ` Substack Archive

Substack Archive

` for _, entry := range a.Entries { // Make file path relative from archive directory relPath, _ := filepath.Rel(outputDir, entry.FilePath) // Format publication date pubDate := entry.Post.PostDate if parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil { pubDate = parsedDate.Format("January 2, 2006") } // Format download date downloadDate := entry.DownloadTime.Format("January 2, 2006 15:04") html += `
` // Add cover image if available if entry.Post.CoverImage != "" { html += fmt.Sprintf(` Cover `, entry.Post.CoverImage) } html += fmt.Sprintf(`

%s

Published: %s | Downloaded: %s
`, relPath, entry.Post.Title, pubDate, downloadDate) // Add subtitle/description description := entry.Post.Subtitle if description == "" { description = entry.Post.Description } if description != "" { html += fmt.Sprintf(`
%s
`, description) } html += `
` } html += ` ` return os.WriteFile(archivePath, []byte(html), 0644) } // GenerateMarkdown creates a Markdown archive page func (a *Archive) GenerateMarkdown(outputDir string) error { archivePath := filepath.Join(outputDir, "index.md") content := "# Substack Archive\n\n" for _, entry := range a.Entries { // Make file path relative from archive directory relPath, _ := filepath.Rel(outputDir, entry.FilePath) // Format publication date pubDate := entry.Post.PostDate if parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil { pubDate = parsedDate.Format("January 2, 2006") } // Format download date downloadDate := entry.DownloadTime.Format("January 2, 2006 15:04") content += fmt.Sprintf("## [%s](%s)\n\n", entry.Post.Title, relPath) content += fmt.Sprintf("**Published:** %s | **Downloaded:** %s\n\n", pubDate, downloadDate) // Add cover image if available if entry.Post.CoverImage != "" { content += fmt.Sprintf("![Cover Image](%s)\n\n", entry.Post.CoverImage) } // Add subtitle/description description := entry.Post.Subtitle if description == "" { description = entry.Post.Description } if description != "" { content += fmt.Sprintf("*%s*\n\n", description) } content += "---\n\n" } return os.WriteFile(archivePath, []byte(content), 0644) } // GenerateText creates a plain text archive page func (a *Archive) GenerateText(outputDir string) error { archivePath := filepath.Join(outputDir, "index.txt") content := "SUBSTACK ARCHIVE\n================\n\n" for _, entry := range a.Entries { // Make file path relative from archive directory relPath, _ := filepath.Rel(outputDir, entry.FilePath) // Format publication date pubDate := entry.Post.PostDate if parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil { pubDate = parsedDate.Format("January 2, 2006") } // Format download date downloadDate := entry.DownloadTime.Format("January 2, 2006 15:04") content += fmt.Sprintf("Title: %s\n", entry.Post.Title) content += fmt.Sprintf("File: %s\n", relPath) content += fmt.Sprintf("Published: %s\n", pubDate) content += fmt.Sprintf("Downloaded: %s\n", downloadDate) // Add subtitle/description description := entry.Post.Subtitle if description == "" { description = entry.Post.Description } if description != "" { content += fmt.Sprintf("Description: %s\n", description) } content += "\n" + strings.Repeat("-", 50) + "\n\n" } return os.WriteFile(archivePath, []byte(content), 0644) } ================================================ FILE: lib/extractor_test.go ================================================ package lib import ( "context" "encoding/json" "fmt" "net/http" "net/http/httptest" "os" "path/filepath" "strings" "sync" "sync/atomic" "testing" "time" "github.com/PuerkitoBio/goquery" "github.com/cenkalti/backoff/v4" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // Helper function to create a sample Post for testing func createSamplePost() Post { return Post{ Id: 123, PublicationId: 456, Type: "post", Slug: "test-post", PostDate: "2023-01-01", CanonicalUrl: "https://example.substack.com/p/test-post", PreviousPostSlug: "previous-post", NextPostSlug: "next-post", CoverImage: "https://example.com/image.jpg", Description: "Test description", Subtitle: "Test subtitle", WordCount: 100, Title: "Test Post", BodyHTML: "

This is a test post.

", } } // Helper function to create a mock HTML page with embedded JSON func createMockSubstackHTML(post Post) string { // Create a wrapper and marshal it to JSON wrapper := PostWrapper{Post: post} jsonBytes, _ := json.Marshal(wrapper) // Escape quotes for embedding in JavaScript escapedJSON := strings.ReplaceAll(string(jsonBytes), `"`, `\"`) return fmt.Sprintf(` %s
Some content
`, post.Title, escapedJSON) } // Test RawPost.ToPost func TestRawPostToPost(t *testing.T) { // Create a sample post expectedPost := createSamplePost() // Create a wrapper and marshal it to JSON wrapper := PostWrapper{Post: expectedPost} jsonBytes, err := json.Marshal(wrapper) require.NoError(t, err) // Create a RawPost with the JSON string rawPost := RawPost{str: string(jsonBytes)} // Test conversion actualPost, err := rawPost.ToPost() require.NoError(t, err) // Verify the result assert.Equal(t, expectedPost, actualPost) // Test with invalid JSON invalidRawPost := RawPost{str: "invalid json"} _, err = invalidRawPost.ToPost() assert.Error(t, err) } // Test Post format conversions func TestPostFormatConversions(t *testing.T) { post := createSamplePost() t.Run("ToHTML", func(t *testing.T) { html := post.ToHTML(true) assert.Contains(t, html, "

Test Post

") assert.Contains(t, html, "

This is a test post.

") htmlNoTitle := post.ToHTML(false) assert.NotContains(t, htmlNoTitle, "

Test Post

") assert.Contains(t, htmlNoTitle, "

This is a test post.

") }) t.Run("ToMD", func(t *testing.T) { md, err := post.ToMD(true) require.NoError(t, err) assert.Contains(t, md, "# Test Post") assert.Contains(t, md, "This is a **test** post.") mdNoTitle, err := post.ToMD(false) require.NoError(t, err) assert.NotContains(t, mdNoTitle, "# Test Post") assert.Contains(t, mdNoTitle, "This is a **test** post.") }) t.Run("ToText", func(t *testing.T) { text := post.ToText(true) assert.Contains(t, text, "Test Post") assert.Contains(t, text, "This is a test post.") textNoTitle := post.ToText(false) assert.NotContains(t, textNoTitle, "Test Post\n\n") assert.Contains(t, textNoTitle, "This is a test post.") }) t.Run("ToJSON", func(t *testing.T) { jsonStr, err := post.ToJSON() require.NoError(t, err) assert.Contains(t, jsonStr, `"id":123`) assert.Contains(t, jsonStr, `"title":"Test Post"`) }) t.Run("contentForFormat", func(t *testing.T) { // Test valid formats for _, format := range []string{"html", "md", "txt"} { content, err := post.contentForFormat(format, true) assert.NoError(t, err) assert.NotEmpty(t, content) } // Test invalid format _, err := post.contentForFormat("invalid", true) assert.Error(t, err) assert.Contains(t, err.Error(), "unknown format") }) // Test error handling for format conversions t.Run("ToMD error handling", func(t *testing.T) { // Create a post with problematic HTML for markdown conversion // Note: html-to-markdown library is quite robust, so we test with extremely malformed HTML problemPost := createSamplePost() problemPost.BodyHTML = "

Nested without closing

" // This should still work as the library handles most malformed HTML _, err := problemPost.ToMD(true) assert.NoError(t, err) // The library is quite tolerant }) t.Run("ToJSON error handling", func(t *testing.T) { // Create a post that would have issues during JSON marshaling // This is hard to trigger with normal Post struct, but we can test the error path problemPost := createSamplePost() // Test with valid data (JSON marshaling rarely fails with valid structs) jsonStr, err := problemPost.ToJSON() assert.NoError(t, err) assert.NotEmpty(t, jsonStr) // Verify the JSON is valid var parsedPost Post err = json.Unmarshal([]byte(jsonStr), &parsedPost) assert.NoError(t, err) assert.Equal(t, problemPost.Id, parsedPost.Id) assert.Equal(t, problemPost.Title, parsedPost.Title) }) } // Test Post.WriteToFile func TestPostWriteToFile(t *testing.T) { post := createSamplePost() tempDir, err := os.MkdirTemp("", "post-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) formats := []string{"html", "md", "txt"} for _, format := range formats { t.Run(format, func(t *testing.T) { filePath := filepath.Join(tempDir, fmt.Sprintf("test.%s", format)) err := post.WriteToFile(filePath, format, false) require.NoError(t, err) // Verify file exists fileInfo, err := os.Stat(filePath) assert.NoError(t, err) assert.True(t, fileInfo.Size() > 0, "File should not be empty") // Read file content content, err := os.ReadFile(filePath) require.NoError(t, err) // Check content based on format switch format { case "html": assert.Contains(t, string(content), "

Test Post

") assert.Contains(t, string(content), "

This is a test post.

") case "md": assert.Contains(t, string(content), "# Test Post") assert.Contains(t, string(content), "This is a **test** post.") case "txt": assert.Contains(t, string(content), "Test Post") assert.Contains(t, string(content), "This is a test post.") } }) } // Test writing to a non-existent directory t.Run("creating directory", func(t *testing.T) { newDir := filepath.Join(tempDir, "subdir", "nested") filePath := filepath.Join(newDir, "test.html") err := post.WriteToFile(filePath, "html", false) assert.NoError(t, err) // Verify directory was created _, err = os.Stat(newDir) assert.NoError(t, err) }) // Test invalid format t.Run("invalid format", func(t *testing.T) { filePath := filepath.Join(tempDir, "test.invalid") err := post.WriteToFile(filePath, "invalid", false) assert.Error(t, err) assert.Contains(t, err.Error(), "unknown format") }) // Test with addSourceURL enabled t.Run("with source URL", func(t *testing.T) { formats := []string{"html", "md", "txt"} for _, format := range formats { t.Run(format, func(t *testing.T) { filePath := filepath.Join(tempDir, fmt.Sprintf("test-with-source.%s", format)) err := post.WriteToFile(filePath, format, true) require.NoError(t, err) // Read file content content, err := os.ReadFile(filePath) require.NoError(t, err) contentStr := string(content) // Check that source URL is included assert.Contains(t, contentStr, post.CanonicalUrl) assert.Contains(t, contentStr, "original content") // Check format-specific source URL formatting if format == "html" { assert.Contains(t, contentStr, "

No script here

` doc, err := goquery.NewDocumentFromReader(strings.NewReader(invalidHTML)) require.NoError(t, err) _, err = extractJSONString(doc) assert.Error(t, err) assert.Contains(t, err.Error(), "failed to extract JSON string") }) t.Run("malformedScript", func(t *testing.T) { // Test HTML with malformed script malformedHTML := ` ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(malformedHTML)) require.NoError(t, err) _, err = extractJSONString(doc) assert.Error(t, err) }) } // Create a real test server that serves mock Substack pages func createSubstackTestServer() (*httptest.Server, map[string]Post) { posts := make(map[string]Post) // Create several sample posts for i := 1; i <= 5; i++ { post := createSamplePost() post.Id = i post.Title = fmt.Sprintf("Test Post %d", i) post.Slug = fmt.Sprintf("test-post-%d", i) post.CanonicalUrl = fmt.Sprintf("https://example.substack.com/p/test-post-%d", i) posts[fmt.Sprintf("/p/test-post-%d", i)] = post } // Create sitemap XML with different dates sitemapXML := ` ` // Create ordered list of posts to ensure deterministic date assignment dates := []string{"2023-01-01", "2023-01-02", "2023-01-03", "2023-01-04", "2023-01-05"} for i := 1; i <= 5; i++ { post := posts[fmt.Sprintf("/p/test-post-%d", i)] sitemapXML += fmt.Sprintf(` https://example.substack.com/p/%s %s `, post.Slug, dates[i-1]) } sitemapXML += `` // Create server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { path := r.URL.Path // Handle sitemap request if path == "/sitemap.xml" { w.Header().Set("Content-Type", "application/xml") w.Write([]byte(sitemapXML)) return } // Handle post requests post, exists := posts[path] if exists { w.Header().Set("Content-Type", "text/html") w.Write([]byte(createMockSubstackHTML(post))) return } // Handle not found w.WriteHeader(http.StatusNotFound) })) return server, posts } // Test Extractor.ExtractPost func TestExtractorExtractPost(t *testing.T) { // Create test server server, posts := createSubstackTestServer() defer server.Close() // Create extractor with default fetcher extractor := NewExtractor(nil) // Test successful extraction t.Run("successfulExtraction", func(t *testing.T) { ctx := context.Background() for path, expectedPost := range posts { postURL := server.URL + path extractedPost, err := extractor.ExtractPost(ctx, postURL) require.NoError(t, err) assert.Equal(t, expectedPost.Id, extractedPost.Id) assert.Equal(t, expectedPost.Title, extractedPost.Title) assert.Equal(t, expectedPost.BodyHTML, extractedPost.BodyHTML) } }) // Test invalid URL t.Run("invalidURL", func(t *testing.T) { ctx := context.Background() _, err := extractor.ExtractPost(ctx, "invalid-url") assert.Error(t, err) }) // Test not found t.Run("notFound", func(t *testing.T) { ctx := context.Background() _, err := extractor.ExtractPost(ctx, server.URL+"/p/non-existent") assert.Error(t, err) }) // Test context cancellation t.Run("contextCancellation", func(t *testing.T) { ctx, cancel := context.WithCancel(context.Background()) cancel() // Cancel immediately _, err := extractor.ExtractPost(ctx, server.URL+"/p/test-post-1") assert.Error(t, err) assert.Contains(t, err.Error(), "context") }) } // Test Extractor.GetAllPostsURLs func TestExtractorGetAllPostsURLs(t *testing.T) { // Create test server server, posts := createSubstackTestServer() defer server.Close() // Create extractor extractor := NewExtractor(nil) ctx := context.Background() // Test without filter t.Run("withoutFilter", func(t *testing.T) { urls, err := extractor.GetAllPostsURLs(ctx, server.URL, nil) require.NoError(t, err) // Should find all post URLs assert.Equal(t, len(posts), len(urls)) // Check each URL is present for _, post := range posts { found := false for _, url := range urls { if strings.Contains(url, post.Slug) { found = true break } } assert.True(t, found, "URL for post %s should be present", post.Slug) } }) // Test with date filter t.Run("withDateFilter", func(t *testing.T) { // Filter for posts after 2023-01-02 (should get 3 posts: 2023-01-03, 2023-01-04, 2023-01-05) dateFilter := func(date string) bool { return date > "2023-01-02" } urls, err := extractor.GetAllPostsURLs(ctx, server.URL, dateFilter) require.NoError(t, err) // Should get 3 posts (dates 2023-01-03, 2023-01-04, 2023-01-05) assert.Len(t, urls, 3) // Verify the filtered URLs are correct for _, url := range urls { // Should contain test-post-3, test-post-4, or test-post-5 assert.True(t, strings.Contains(url, "test-post-3") || strings.Contains(url, "test-post-4") || strings.Contains(url, "test-post-5")) } }) // Test with context cancellation t.Run("contextCancellation", func(t *testing.T) { ctx, cancel := context.WithCancel(context.Background()) cancel() // Cancel immediately _, err := extractor.GetAllPostsURLs(ctx, server.URL, nil) assert.Error(t, err) }) // Test with invalid URL t.Run("invalidURL", func(t *testing.T) { _, err := extractor.GetAllPostsURLs(ctx, "invalid-url", nil) assert.Error(t, err) }) } // Test Extractor.ExtractAllPosts func TestExtractorExtractAllPosts(t *testing.T) { // Create test server server, posts := createSubstackTestServer() defer server.Close() // Create URLs list urls := make([]string, 0, len(posts)) for path := range posts { urls = append(urls, server.URL+path) } // Create extractor extractor := NewExtractor(nil) ctx := context.Background() // Test successful extraction of all posts t.Run("successfulExtraction", func(t *testing.T) { resultCh := extractor.ExtractAllPosts(ctx, urls) // Collect results results := make(map[int]Post) errorCount := 0 for result := range resultCh { if result.Err != nil { errorCount++ } else { results[result.Post.Id] = result.Post } } // Verify results assert.Equal(t, 0, errorCount, "There should be no errors") assert.Equal(t, len(posts), len(results), "All posts should be extracted") // Check each post for _, post := range posts { extractedPost, exists := results[post.Id] assert.True(t, exists, "Post with ID %d should be extracted", post.Id) if exists { assert.Equal(t, post.Title, extractedPost.Title) assert.Equal(t, post.BodyHTML, extractedPost.BodyHTML) } } }) // Test with context cancellation t.Run("contextCancellation", func(t *testing.T) { ctx, cancel := context.WithCancel(context.Background()) resultCh := extractor.ExtractAllPosts(ctx, urls) // Cancel after receiving first result var count int var wg sync.WaitGroup wg.Add(1) go func() { defer wg.Done() for result := range resultCh { if result.Err != nil { continue } count++ if count == 1 { cancel() // Add a small delay to ensure cancellation propagates time.Sleep(100 * time.Millisecond) break // Exit loop early after cancelling } } }() wg.Wait() // We should have received at least one result before cancellation assert.GreaterOrEqual(t, count, 1) // Don't assert that count < len(posts) since on fast machines all might complete }) // Test with mixed responses (some successful, some errors) t.Run("mixedResponses", func(t *testing.T) { // Add some invalid URLs to the list mixedUrls := append([]string{"invalid-url", server.URL + "/p/non-existent"}, urls...) resultCh := extractor.ExtractAllPosts(ctx, mixedUrls) // Collect results successCount := 0 errorCount := 0 for result := range resultCh { if result.Err != nil { errorCount++ } else { successCount++ } } // Verify results assert.Equal(t, len(posts), successCount, "All valid posts should be extracted") assert.Equal(t, 2, errorCount, "There should be errors for invalid URLs") }) // Test worker concurrency limiting t.Run("concurrencyLimit", func(t *testing.T) { // Create a large number of duplicate URLs to test concurrency manyUrls := make([]string, 50) for i := range manyUrls { manyUrls[i] = urls[i%len(urls)] } // Create a channel to track concurrent requests type accessRecord struct { url string timestamp time.Time } accessCh := make(chan accessRecord, len(manyUrls)) // Create a test server that records access times concurrentServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { accessCh <- accessRecord{ url: r.URL.Path, timestamp: time.Now(), } // Simulate some processing time time.Sleep(100 * time.Millisecond) // Serve the same content as the regular server path := r.URL.Path post, exists := posts[path] if exists { w.Header().Set("Content-Type", "text/html") w.Write([]byte(createMockSubstackHTML(post))) return } w.WriteHeader(http.StatusNotFound) })) defer concurrentServer.Close() // Replace URLs with concurrent server URLs concurrentUrls := make([]string, len(manyUrls)) for i, u := range manyUrls { path := strings.TrimPrefix(u, server.URL) concurrentUrls[i] = concurrentServer.URL + path } // Create extractor with limited workers customFetcher := NewFetcher(WithMaxWorkers(10), WithRatePerSecond(100)) concurrentExtractor := NewExtractor(customFetcher) // Start extraction resultCh := concurrentExtractor.ExtractAllPosts(ctx, concurrentUrls) // Collect all results to make sure extraction completes var results []ExtractResult for result := range resultCh { results = append(results, result) } // Close the access channel since we're done receiving close(accessCh) // Process access records to determine concurrency var accessRecords []accessRecord for record := range accessCh { accessRecords = append(accessRecords, record) } // Sort access records by timestamp maxConcurrent := 0 activeTimes := make([]time.Time, 0) for _, record := range accessRecords { // Add this request's start time activeTimes = append(activeTimes, record.timestamp) // Expire any requests that would have completed by now newActiveTimes := make([]time.Time, 0) for _, t := range activeTimes { if t.Add(100 * time.Millisecond).After(record.timestamp) { newActiveTimes = append(newActiveTimes, t) } } activeTimes = newActiveTimes // Update max concurrent if len(activeTimes) > maxConcurrent { maxConcurrent = len(activeTimes) } } // Verify concurrency was limited appropriately // Note: This test is timing-dependent and may need adjustment assert.LessOrEqual(t, maxConcurrent, 15, "Concurrency should be limited") // Ensure all requests were processed assert.Equal(t, len(concurrentUrls), len(results)) }) } // Test error handling func TestExtractorErrorHandling(t *testing.T) { // Create a server that simulates various errors var requestCount atomic.Int32 errorServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Get request counter requestCount.Add(1) // Increment counter path := r.URL.Path // Simulate different errors based on path - order matters here! switch { case path == "/p/normal-post": // Return a valid post post := createSamplePost() w.Header().Set("Content-Type", "text/html") w.Write([]byte(createMockSubstackHTML(post))) return case strings.Contains(path, "not-found"): w.WriteHeader(http.StatusNotFound) return case strings.Contains(path, "server-error"): w.WriteHeader(http.StatusInternalServerError) return case strings.Contains(path, "rate-limit"): w.Header().Set("Retry-After", "1") w.WriteHeader(http.StatusTooManyRequests) return case strings.Contains(path, "bad-json"): // Return valid HTML but with malformed JSON html := ` Bad JSON ` w.Header().Set("Content-Type", "text/html") w.Write([]byte(html)) return case strings.Contains(path, "timeout-post"): // Use a long sleep to ensure timeout - longer than the client timeout time.Sleep(2 * time.Second) w.WriteHeader(http.StatusOK) return default: // Return a valid post for other paths post := createSamplePost() w.Header().Set("Content-Type", "text/html") w.Write([]byte(createMockSubstackHTML(post))) return } })) defer errorServer.Close() // Create paths for different error scenarios paths := []string{ "/p/normal-post", "/p/not-found", "/p/server-error", "/p/rate-limit", "/p/bad-json", "/p/timeout-post", } // Create URLs urls := make([]string, len(paths)) for i, path := range paths { urls[i] = errorServer.URL + path } // Create extractor with short timeout and limited retries backoffCfg := backoff.NewExponentialBackOff() backoffCfg.MaxElapsedTime = 1 * time.Second // Short timeout for tests backoffCfg.InitialInterval = 100 * time.Millisecond fetcher := NewFetcher( WithTimeout(500*time.Millisecond), // Make timeout shorter than the sleep for timeout test WithBackOffConfig(backoffCfg), ) extractor := NewExtractor(fetcher) ctx := context.Background() // Test individual error cases t.Run("NotFound", func(t *testing.T) { _, err := extractor.ExtractPost(ctx, errorServer.URL+"/p/not-found") assert.Error(t, err) }) t.Run("ServerError", func(t *testing.T) { _, err := extractor.ExtractPost(ctx, errorServer.URL+"/p/server-error") assert.Error(t, err) }) t.Run("RateLimit", func(t *testing.T) { _, err := extractor.ExtractPost(ctx, errorServer.URL+"/p/rate-limit") assert.Error(t, err) }) t.Run("BadJSON", func(t *testing.T) { _, err := extractor.ExtractPost(ctx, errorServer.URL+"/p/bad-json") assert.Error(t, err) }) t.Run("Timeout", func(t *testing.T) { // Test with a URL that will cause a timeout _, err := extractor.ExtractPost(ctx, errorServer.URL+"/p/timeout-post") assert.Error(t, err) // The error may be a context deadline exceeded or a timeout error }) // Test handling multiple URLs with mixed errors t.Run("MixedErrors", func(t *testing.T) { resultCh := extractor.ExtractAllPosts(ctx, urls) // Collect results successCount := 0 errorCount := 0 for result := range resultCh { if result.Err != nil { errorCount++ } else { successCount++ } } // We expect at least one success (the normal post) and several errors assert.GreaterOrEqual(t, successCount, 1) assert.GreaterOrEqual(t, errorCount, 1) // At least one error (likely timeout) }) } // Test enhanced post extraction features (subtitle and cover image) func TestEnhancedPostExtraction(t *testing.T) { t.Run("SubtitleExtraction", func(t *testing.T) { post := createSamplePost() post.Subtitle = "" // Clear subtitle from JSON to test HTML extraction // Create mock HTML with subtitle element html := fmt.Sprintf(` %s
This is the subtitle from HTML
Some content
`, post.Title, escapeJSONForJS(post)) // Create test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/html") w.Write([]byte(html)) })) defer server.Close() extractor := NewExtractor(nil) ctx := context.Background() extractedPost, err := extractor.ExtractPost(ctx, server.URL) require.NoError(t, err) // Verify subtitle was extracted and trimmed assert.Equal(t, "This is the subtitle from HTML", extractedPost.Subtitle) }) t.Run("CoverImageFromOGTag", func(t *testing.T) { post := createSamplePost() post.CoverImage = "" // Clear cover image from JSON to test og:image extraction // Create mock HTML with og:image meta tag html := fmt.Sprintf(` %s
Some content
`, post.Title, escapeJSONForJS(post)) // Create test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/html") w.Write([]byte(html)) })) defer server.Close() extractor := NewExtractor(nil) ctx := context.Background() extractedPost, err := extractor.ExtractPost(ctx, server.URL) require.NoError(t, err) // Verify cover image was extracted from og:image assert.Equal(t, "https://example.com/og-cover.jpg", extractedPost.CoverImage) }) t.Run("ExistingCoverImagePreserved", func(t *testing.T) { post := createSamplePost() post.CoverImage = "https://existing.com/image.jpg" // Create mock HTML with og:image meta tag (should be ignored) html := fmt.Sprintf(` %s
Some content
`, post.Title, escapeJSONForJS(post)) // Create test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/html") w.Write([]byte(html)) })) defer server.Close() extractor := NewExtractor(nil) ctx := context.Background() extractedPost, err := extractor.ExtractPost(ctx, server.URL) require.NoError(t, err) // Verify existing cover image was preserved (not overwritten by og:image) assert.Equal(t, "https://existing.com/image.jpg", extractedPost.CoverImage) }) t.Run("NoSubtitleOrCoverImage", func(t *testing.T) { post := createSamplePost() post.Subtitle = "" post.CoverImage = "" // Create mock HTML without subtitle or og:image html := fmt.Sprintf(` %s
Some content
`, post.Title, escapeJSONForJS(post)) // Create test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/html") w.Write([]byte(html)) })) defer server.Close() extractor := NewExtractor(nil) ctx := context.Background() extractedPost, err := extractor.ExtractPost(ctx, server.URL) require.NoError(t, err) // Verify empty subtitle and cover image remain empty assert.Empty(t, extractedPost.Subtitle) assert.Empty(t, extractedPost.CoverImage) }) } // Helper function to escape JSON for embedding in JavaScript func escapeJSONForJS(post Post) string { wrapper := PostWrapper{Post: post} jsonBytes, _ := json.Marshal(wrapper) return strings.ReplaceAll(string(jsonBytes), `"`, `\"`) } // Test Archive functionality func TestArchive(t *testing.T) { t.Run("NewArchive", func(t *testing.T) { archive := NewArchive() assert.NotNil(t, archive) assert.NotNil(t, archive.Entries) assert.Len(t, archive.Entries, 0) }) t.Run("AddEntry", func(t *testing.T) { archive := NewArchive() post1 := createSamplePost() post1.PostDate = "2023-01-01T00:00:00Z" post1.Title = "First Post" post2 := createSamplePost() post2.PostDate = "2023-01-02T00:00:00Z" post2.Title = "Second Post" post3 := createSamplePost() post3.PostDate = "2023-01-03T00:00:00Z" post3.Title = "Third Post" downloadTime := time.Now() // Add entries in random order archive.AddEntry(post2, "post2.html", downloadTime) archive.AddEntry(post1, "post1.html", downloadTime) archive.AddEntry(post3, "post3.html", downloadTime) // Verify entries were added and sorted by date (newest first) assert.Len(t, archive.Entries, 3) assert.Equal(t, "Third Post", archive.Entries[0].Post.Title) // 2023-01-03 (newest) assert.Equal(t, "Second Post", archive.Entries[1].Post.Title) // 2023-01-02 assert.Equal(t, "First Post", archive.Entries[2].Post.Title) // 2023-01-01 (oldest) }) t.Run("SortingWithInvalidDates", func(t *testing.T) { archive := NewArchive() post1 := createSamplePost() post1.PostDate = "invalid-date" post1.Title = "A Post" post2 := createSamplePost() post2.PostDate = "also-invalid" post2.Title = "B Post" downloadTime := time.Now() archive.AddEntry(post2, "post2.html", downloadTime) archive.AddEntry(post1, "post1.html", downloadTime) // Should sort by title when dates are invalid assert.Len(t, archive.Entries, 2) assert.Equal(t, "A Post", archive.Entries[0].Post.Title) // Alphabetical order assert.Equal(t, "B Post", archive.Entries[1].Post.Title) }) t.Run("ArchiveEntryFields", func(t *testing.T) { archive := NewArchive() post := createSamplePost() filePath := "/path/to/post.html" downloadTime := time.Now() archive.AddEntry(post, filePath, downloadTime) entry := archive.Entries[0] assert.Equal(t, post, entry.Post) assert.Equal(t, filePath, entry.FilePath) assert.Equal(t, downloadTime, entry.DownloadTime) }) } // Test Archive page generation func TestArchivePageGeneration(t *testing.T) { // Helper function to create a test archive setupTestArchive := func() (*Archive, string) { tempDir, err := os.MkdirTemp("", "archive_test") require.NoError(t, err) archive := NewArchive() // Create sample posts with different dates and metadata post1 := createSamplePost() post1.PostDate = "2023-01-01T10:30:00Z" post1.Title = "First Post" post1.Subtitle = "A great first post" post1.CoverImage = "https://example.com/cover1.jpg" post2 := createSamplePost() post2.PostDate = "2023-01-02T15:45:00Z" post2.Title = "Second Post" post2.Subtitle = "" // Empty subtitle, should fall back to description post2.Description = "This is the description" post2.CoverImage = "" post3 := createSamplePost() post3.PostDate = "2023-01-03T08:15:00Z" post3.Title = "Third Post" post3.Subtitle = "" post3.Description = "" post3.CoverImage = "https://example.com/cover3.jpg" downloadTime, _ := time.Parse(time.RFC3339, "2023-01-10T12:00:00Z") archive.AddEntry(post1, filepath.Join(tempDir, "post1.html"), downloadTime) archive.AddEntry(post2, filepath.Join(tempDir, "post2.html"), downloadTime.Add(time.Hour)) archive.AddEntry(post3, filepath.Join(tempDir, "post3.html"), downloadTime.Add(2*time.Hour)) return archive, tempDir } t.Run("GenerateHTML", func(t *testing.T) { archive, tempDir := setupTestArchive() defer os.RemoveAll(tempDir) err := archive.GenerateHTML(tempDir) require.NoError(t, err) // Check file was created indexPath := filepath.Join(tempDir, "index.html") assert.FileExists(t, indexPath) // Read and verify content content, err := os.ReadFile(indexPath) require.NoError(t, err) htmlContent := string(content) // Verify HTML structure assert.Contains(t, htmlContent, "") assert.Contains(t, htmlContent, "Substack Archive") assert.Contains(t, htmlContent, "

Substack Archive

") // Verify posts are included in correct order (newest first) assert.Contains(t, htmlContent, "Third Post") // Should appear first (newest) assert.Contains(t, htmlContent, "Second Post") assert.Contains(t, htmlContent, "First Post") // Verify relative paths assert.Contains(t, htmlContent, "post1.html") assert.Contains(t, htmlContent, "post2.html") assert.Contains(t, htmlContent, "post3.html") // Verify cover images and descriptions assert.Contains(t, htmlContent, "https://example.com/cover1.jpg") assert.Contains(t, htmlContent, "https://example.com/cover3.jpg") assert.Contains(t, htmlContent, "A great first post") // Subtitle assert.Contains(t, htmlContent, "This is the description") // Fallback description // Verify dates are formatted assert.Contains(t, htmlContent, "January 1, 2023") // Formatted publication date assert.Contains(t, htmlContent, "January 10, 2023 12:00") // Formatted download date }) t.Run("GenerateMarkdown", func(t *testing.T) { archive, tempDir := setupTestArchive() defer os.RemoveAll(tempDir) err := archive.GenerateMarkdown(tempDir) require.NoError(t, err) // Check file was created indexPath := filepath.Join(tempDir, "index.md") assert.FileExists(t, indexPath) // Read and verify content content, err := os.ReadFile(indexPath) require.NoError(t, err) mdContent := string(content) // Verify markdown structure assert.Contains(t, mdContent, "# Substack Archive\n\n") assert.Contains(t, mdContent, "## [Third Post](post3.html)") // Newest first assert.Contains(t, mdContent, "## [Second Post](post2.html)") assert.Contains(t, mdContent, "## [First Post](post1.html)") // Verify metadata format assert.Contains(t, mdContent, "**Published:** January 1, 2023") assert.Contains(t, mdContent, "**Downloaded:** January 10, 2023 12:00") // Verify cover image markdown syntax assert.Contains(t, mdContent, "![Cover Image](https://example.com/cover1.jpg)") assert.Contains(t, mdContent, "![Cover Image](https://example.com/cover3.jpg)") // Verify descriptions in italic assert.Contains(t, mdContent, "*A great first post*") assert.Contains(t, mdContent, "*This is the description*") // Verify separators assert.Contains(t, mdContent, "---") }) t.Run("GenerateText", func(t *testing.T) { archive, tempDir := setupTestArchive() defer os.RemoveAll(tempDir) err := archive.GenerateText(tempDir) require.NoError(t, err) // Check file was created indexPath := filepath.Join(tempDir, "index.txt") assert.FileExists(t, indexPath) // Read and verify content content, err := os.ReadFile(indexPath) require.NoError(t, err) txtContent := string(content) // Verify text structure assert.Contains(t, txtContent, "SUBSTACK ARCHIVE\n================") // Verify post entries (newest first) assert.Contains(t, txtContent, "Title: Third Post") assert.Contains(t, txtContent, "Title: Second Post") assert.Contains(t, txtContent, "Title: First Post") // Verify file paths assert.Contains(t, txtContent, "File: post1.html") assert.Contains(t, txtContent, "File: post2.html") assert.Contains(t, txtContent, "File: post3.html") // Verify formatted dates assert.Contains(t, txtContent, "Published: January 1, 2023") assert.Contains(t, txtContent, "Downloaded: January 10, 2023 12:00") // Verify descriptions assert.Contains(t, txtContent, "Description: A great first post") assert.Contains(t, txtContent, "Description: This is the description") // Verify separators assert.Contains(t, txtContent, strings.Repeat("-", 50)) }) t.Run("EmptyArchive", func(t *testing.T) { tempDir, err := os.MkdirTemp("", "empty_archive_test") require.NoError(t, err) defer os.RemoveAll(tempDir) archive := NewArchive() // Test each format with empty archive err = archive.GenerateHTML(tempDir) require.NoError(t, err) err = archive.GenerateMarkdown(tempDir) require.NoError(t, err) err = archive.GenerateText(tempDir) require.NoError(t, err) // Verify files exist and contain basic headers htmlContent, _ := os.ReadFile(filepath.Join(tempDir, "index.html")) assert.Contains(t, string(htmlContent), "Substack Archive") mdContent, _ := os.ReadFile(filepath.Join(tempDir, "index.md")) assert.Contains(t, string(mdContent), "# Substack Archive") txtContent, _ := os.ReadFile(filepath.Join(tempDir, "index.txt")) assert.Contains(t, string(txtContent), "SUBSTACK ARCHIVE") }) t.Run("FileSystemError", func(t *testing.T) { archive := NewArchive() post := createSamplePost() archive.AddEntry(post, "test.html", time.Now()) // Try to write to non-existent directory with restricted permissions invalidDir := "/non/existent/directory" err := archive.GenerateHTML(invalidDir) assert.Error(t, err) err = archive.GenerateMarkdown(invalidDir) assert.Error(t, err) err = archive.GenerateText(invalidDir) assert.Error(t, err) }) } // Benchmarks func BenchmarkExtractor(b *testing.B) { // Create test server server, posts := createSubstackTestServer() defer server.Close() // Create URLs urls := make([]string, 0, len(posts)) for path := range posts { urls = append(urls, server.URL+path) } // Create extractor extractor := NewExtractor(nil) ctx := context.Background() // Benchmark single post extraction b.Run("ExtractPost", func(b *testing.B) { url := urls[0] b.ResetTimer() for i := 0; i < b.N; i++ { post, err := extractor.ExtractPost(ctx, url) if err != nil { b.Fatal(err) } // Simple check to ensure the compiler doesn't optimize away the result if post.Id <= 0 { b.Fatal("Invalid post ID") } } }) // Benchmark format conversions post := createSamplePost() b.Run("ToHTML", func(b *testing.B) { for i := 0; i < b.N; i++ { html := post.ToHTML(true) if len(html) == 0 { b.Fatal("Empty HTML") } } }) b.Run("ToMD", func(b *testing.B) { for i := 0; i < b.N; i++ { md, err := post.ToMD(true) if err != nil { b.Fatal(err) } if len(md) == 0 { b.Fatal("Empty markdown") } } }) b.Run("ToText", func(b *testing.B) { for i := 0; i < b.N; i++ { text := post.ToText(true) if len(text) == 0 { b.Fatal("Empty text") } } }) // Benchmark extracting all posts b.Run("ExtractAllPosts", func(b *testing.B) { for i := 0; i < b.N; i++ { resultCh := extractor.ExtractAllPosts(ctx, urls) // Consume all results successCount := 0 for result := range resultCh { if result.Err == nil { successCount++ } } if successCount != len(posts) { b.Fatalf("Expected %d successful extractions, got %d", len(posts), successCount) } } }) // Benchmark with larger number of URLs b.Run("ExtractAllPostsMany", func(b *testing.B) { // Create many duplicate URLs to test concurrency manyUrls := make([]string, 50) for i := range manyUrls { manyUrls[i] = urls[i%len(urls)] } // Create extractor with optimized settings for benchmark optimizedFetcher := NewFetcher( WithMaxWorkers(20), WithRatePerSecond(100), WithBurst(50), ) optimizedExtractor := NewExtractor(optimizedFetcher) b.ResetTimer() for i := 0; i < b.N; i++ { resultCh := optimizedExtractor.ExtractAllPosts(ctx, manyUrls) // Consume all results successCount := 0 for result := range resultCh { if result.Err == nil { successCount++ } } if successCount < len(manyUrls)-5 { // Allow a few errors b.Fatalf("Too few successful extractions: %d out of %d", successCount, len(manyUrls)) } } }) } ================================================ FILE: lib/fetcher.go ================================================ package lib import ( "context" "fmt" "io" "net/http" "net/url" "strconv" "time" "github.com/cenkalti/backoff/v4" "golang.org/x/sync/errgroup" "golang.org/x/time/rate" ) // DefaultRatePerSecond defines the default request rate per second when creating a new Fetcher. const DefaultRatePerSecond = 2 // DefaultBurst defines the default burst size for the rate limiter. const DefaultBurst = 5 // defaultRetryAfter specifies the default value for Retry-After header in case of too many requests. const defaultRetryAfter = 60 // defaultMaxRetryCount defines the default maximum number of retries for a failed URL fetch. const defaultMaxRetryCount = 10 // defaultMaxElapsedTime specifies the default maximum elapsed time for the exponential backoff. const defaultMaxElapsedTime = 10 * time.Minute // defaultMaxInterval defines the default maximum interval for the exponential backoff. const defaultMaxInterval = 2 * time.Minute // defaultClientTimeout defines the default timeout for HTTP requests. const defaultClientTimeout = 30 * time.Second // userAgent specifies the User-Agent header value used in HTTP requests. const userAgent = "sbstck-dl/0.1" // Fetcher represents a URL fetcher with rate limiting and retry mechanisms. type Fetcher struct { Client *http.Client RateLimiter *rate.Limiter BackoffCfg backoff.BackOff Cookie *http.Cookie MaxWorkers int } // FetcherOptions holds configurable options for Fetcher. type FetcherOptions struct { RatePerSecond int Burst int ProxyURL *url.URL BackOffConfig backoff.BackOff Cookie *http.Cookie Timeout time.Duration MaxWorkers int } // FetcherOption defines a function that applies a specific option to FetcherOptions. type FetcherOption func(*FetcherOptions) // WithRatePerSecond sets the rate per second for the Fetcher. func WithRatePerSecond(rate int) FetcherOption { return func(o *FetcherOptions) { o.RatePerSecond = rate } } // WithBurst sets the burst size for the rate limiter. func WithBurst(burst int) FetcherOption { return func(o *FetcherOptions) { o.Burst = burst } } // WithProxyURL sets the proxy URL for the Fetcher. func WithProxyURL(proxyURL *url.URL) FetcherOption { return func(o *FetcherOptions) { o.ProxyURL = proxyURL } } // WithBackOffConfig sets the backoff configuration for the Fetcher. func WithBackOffConfig(b backoff.BackOff) FetcherOption { return func(o *FetcherOptions) { o.BackOffConfig = b } } // WithCookie sets the cookie for the Fetcher. func WithCookie(cookie *http.Cookie) FetcherOption { return func(o *FetcherOptions) { if cookie != nil { o.Cookie = cookie } } } // WithTimeout sets the HTTP client timeout. func WithTimeout(timeout time.Duration) FetcherOption { return func(o *FetcherOptions) { o.Timeout = timeout } } // WithMaxWorkers sets the maximum number of concurrent workers. func WithMaxWorkers(workers int) FetcherOption { return func(o *FetcherOptions) { o.MaxWorkers = workers } } // FetchResult represents the result of a URL fetch operation. type FetchResult struct { Url string Body io.ReadCloser Error error } // FetchError represents an error returned when encountering too many requests with a Retry-After value. type FetchError struct { TooManyRequests bool RetryAfter int StatusCode int } // Error returns the error message for the FetchError. func (e *FetchError) Error() string { if e.TooManyRequests { return fmt.Sprintf("too many requests, retry after %d seconds", e.RetryAfter) } return fmt.Sprintf("HTTP error: status code %d", e.StatusCode) } // NewFetcher creates a new Fetcher with the provided options. func NewFetcher(opts ...FetcherOption) *Fetcher { options := FetcherOptions{ RatePerSecond: DefaultRatePerSecond, Burst: DefaultBurst, BackOffConfig: makeDefaultBackoff(), Timeout: defaultClientTimeout, MaxWorkers: 10, // Default to 10 workers } for _, opt := range opts { opt(&options) } transport := http.DefaultTransport.(*http.Transport).Clone() if options.ProxyURL != nil { transport.Proxy = http.ProxyURL(options.ProxyURL) } // Set sensible defaults for transport transport.MaxIdleConns = 100 transport.MaxIdleConnsPerHost = options.MaxWorkers transport.MaxConnsPerHost = options.MaxWorkers transport.IdleConnTimeout = 90 * time.Second transport.TLSHandshakeTimeout = 10 * time.Second client := &http.Client{ Transport: transport, Timeout: options.Timeout, } return &Fetcher{ Client: client, RateLimiter: rate.NewLimiter(rate.Limit(options.RatePerSecond), options.Burst), BackoffCfg: options.BackOffConfig, Cookie: options.Cookie, MaxWorkers: options.MaxWorkers, } } // FetchURLs concurrently fetches the specified URLs and returns a channel to receive the FetchResults. func (f *Fetcher) FetchURLs(ctx context.Context, urls []string) <-chan FetchResult { // Use a smaller buffer to reduce memory footprint results := make(chan FetchResult, min(len(urls), f.MaxWorkers*2)) g, ctx := errgroup.WithContext(ctx) // Use a semaphore to limit concurrency sem := make(chan struct{}, f.MaxWorkers) for _, u := range urls { u := u // Capture the variable g.Go(func() error { select { case sem <- struct{}{}: // Acquire semaphore defer func() { <-sem }() // Release semaphore case <-ctx.Done(): return ctx.Err() } body, err := f.FetchURL(ctx, u) select { case results <- FetchResult{Url: u, Body: body, Error: err}: return nil case <-ctx.Done(): // Close body if context was canceled to prevent leaks if body != nil { body.Close() } return ctx.Err() } }) } // Close the results channel when all goroutines complete go func() { g.Wait() close(results) }() return results } // FetchURL fetches the specified URL with retries and rate limiting. func (f *Fetcher) FetchURL(ctx context.Context, url string) (io.ReadCloser, error) { var body io.ReadCloser var err error var retryCounter int operation := func() error { if retryCounter >= defaultMaxRetryCount { return backoff.Permanent(fmt.Errorf("max retry count reached for URL: %s", url)) } err = f.RateLimiter.Wait(ctx) // Use rate limiter if err != nil { return backoff.Permanent(err) // Context cancellation or rate limiter error } body, err = f.fetch(ctx, url) if err != nil { // If it's a fetch error that should be retried if fetchErr, ok := err.(*FetchError); ok && fetchErr.TooManyRequests { retryCounter++ return err } // For other errors, don't retry return backoff.Permanent(err) } return nil } // Use backoff with notification for logging err = backoff.RetryNotify( operation, f.BackoffCfg, func(err error, d time.Duration) { // This could be connected to a logger _ = err // Avoid unused variable error }, ) return body, err } // fetch performs the actual HTTP GET request. func (f *Fetcher) fetch(ctx context.Context, url string) (io.ReadCloser, error) { req, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil) if err != nil { return nil, err } req.Header.Set("User-Agent", userAgent) // Add cookie if available if f.Cookie != nil { req.AddCookie(f.Cookie) } res, err := f.Client.Do(req) if err != nil { return nil, err } // Handle non-success status codes if res.StatusCode != http.StatusOK { // Always close the body for non-200 responses defer res.Body.Close() if res.StatusCode == http.StatusTooManyRequests { retryAfter := defaultRetryAfter if retryAfterStr := res.Header.Get("Retry-After"); retryAfterStr != "" { if seconds, err := strconv.Atoi(retryAfterStr); err == nil { retryAfter = seconds } } return nil, &FetchError{ TooManyRequests: true, RetryAfter: retryAfter, StatusCode: res.StatusCode, } } return nil, &FetchError{ StatusCode: res.StatusCode, } } return res.Body, nil } // makeDefaultBackoff creates the default exponential backoff configuration. func makeDefaultBackoff() backoff.BackOff { backOffCfg := backoff.NewExponentialBackOff() backOffCfg.MaxElapsedTime = defaultMaxElapsedTime backOffCfg.MaxInterval = defaultMaxInterval backOffCfg.Multiplier = 1.5 // Reduced from 2.0 for more gradual backoff return backOffCfg } // min returns the smaller of two integers. func min(a, b int) int { if a < b { return a } return b } ================================================ FILE: lib/fetcher_test.go ================================================ package lib import ( "context" "fmt" "io" "math/rand" "net/http" "net/http/httptest" "net/url" "sync" "sync/atomic" "testing" "time" "github.com/cenkalti/backoff/v4" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" "golang.org/x/time/rate" ) // TestNewFetcher tests the creation of a new fetcher with various options func TestNewFetcher(t *testing.T) { t.Run("DefaultOptions", func(t *testing.T) { f := NewFetcher() assert.NotNil(t, f.Client) assert.NotNil(t, f.RateLimiter) assert.NotNil(t, f.BackoffCfg) assert.Nil(t, f.Cookie) assert.Equal(t, 10, f.MaxWorkers) }) t.Run("CustomOptions", func(t *testing.T) { proxyURL, _ := url.Parse("http://proxy.example.com") cookie := &http.Cookie{Name: "test", Value: "value"} customBackoff := backoff.NewConstantBackOff(time.Second) f := NewFetcher( WithRatePerSecond(5), WithBurst(10), WithProxyURL(proxyURL), WithCookie(cookie), WithBackOffConfig(customBackoff), WithTimeout(time.Minute), WithMaxWorkers(20), ) assert.NotNil(t, f.Client) assert.Equal(t, rate.Limit(5), f.RateLimiter.Limit()) assert.Equal(t, 10, f.RateLimiter.Burst()) assert.Equal(t, customBackoff, f.BackoffCfg) assert.Equal(t, cookie, f.Cookie) assert.Equal(t, 20, f.MaxWorkers) assert.Equal(t, time.Minute, f.Client.Timeout) }) } // TestFetchURL tests the FetchURL method func TestFetchURL(t *testing.T) { t.Run("SuccessfulFetch", func(t *testing.T) { // Create a test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { assert.Equal(t, "sbstck-dl/0.1", r.Header.Get("User-Agent")) w.WriteHeader(http.StatusOK) w.Write([]byte("response body")) })) defer server.Close() // Create fetcher and fetch the URL f := NewFetcher() ctx := context.Background() body, err := f.FetchURL(ctx, server.URL) // Assert require.NoError(t, err) require.NotNil(t, body) defer body.Close() data, err := io.ReadAll(body) require.NoError(t, err) assert.Equal(t, "response body", string(data)) }) t.Run("FetchWithCookie", func(t *testing.T) { cookieReceived := false // Create a test server that checks for cookie server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { cookies := r.Cookies() for _, cookie := range cookies { if cookie.Name == "test" && cookie.Value == "value" { cookieReceived = true break } } w.WriteHeader(http.StatusOK) })) defer server.Close() // Create fetcher with cookie cookie := &http.Cookie{Name: "test", Value: "value"} f := NewFetcher(WithCookie(cookie)) ctx := context.Background() body, err := f.FetchURL(ctx, server.URL) // Assert require.NoError(t, err) require.NotNil(t, body) body.Close() assert.True(t, cookieReceived) }) t.Run("HTTPError", func(t *testing.T) { // Create a test server that returns an error server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusInternalServerError) })) defer server.Close() // Create fetcher and fetch the URL f := NewFetcher() ctx := context.Background() body, err := f.FetchURL(ctx, server.URL) // Assert assert.Error(t, err) assert.Nil(t, body) // Check that the error is of type FetchError fetchErr, ok := err.(*FetchError) assert.True(t, ok) assert.Equal(t, http.StatusInternalServerError, fetchErr.StatusCode) assert.False(t, fetchErr.TooManyRequests) }) t.Run("TooManyRequests", func(t *testing.T) { // Create a test server that returns too many requests server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Retry-After", "2") w.WriteHeader(http.StatusTooManyRequests) })) defer server.Close() // Create fetcher with a quick backoff for testing backoffCfg := backoff.NewExponentialBackOff() backoffCfg.MaxElapsedTime = 500 * time.Millisecond // Short timeout for test f := NewFetcher(WithBackOffConfig(backoffCfg)) ctx := context.Background() body, err := f.FetchURL(ctx, server.URL) // Assert assert.Error(t, err) assert.Nil(t, body) // Check that the error is of type FetchError fetchErr, ok := err.(*FetchError) if !ok { // Could be a permanent error from max retries assert.Contains(t, err.Error(), "max retry count") } else { assert.True(t, fetchErr.TooManyRequests) assert.Equal(t, 2, fetchErr.RetryAfter) } }) t.Run("ContextCancellation", func(t *testing.T) { // Create a test server with a delay server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { time.Sleep(500 * time.Millisecond) w.WriteHeader(http.StatusOK) })) defer server.Close() // Create fetcher f := NewFetcher() // Create context with timeout ctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond) defer cancel() // Fetch should be canceled by context body, err := f.FetchURL(ctx, server.URL) // Assert assert.Error(t, err) assert.Nil(t, body) assert.Contains(t, err.Error(), "context") }) } // TestFetchURLs tests the FetchURLs method func TestFetchURLs(t *testing.T) { t.Run("MultipleFetches", func(t *testing.T) { // Track request count var requestCount int32 // Create a test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { atomic.AddInt32(&requestCount, 1) w.WriteHeader(http.StatusOK) fmt.Fprintf(w, "response for %s", r.URL.Path) })) defer server.Close() // Create URLs numURLs := 10 urls := make([]string, numURLs) for i := 0; i < numURLs; i++ { urls[i] = fmt.Sprintf("%s/%d", server.URL, i) } // Create fetcher and fetch URLs f := NewFetcher() ctx := context.Background() resultChan := f.FetchURLs(ctx, urls) // Collect results results := make(map[string]string) for result := range resultChan { assert.NoError(t, result.Error) assert.NotNil(t, result.Body) if result.Body != nil { data, err := io.ReadAll(result.Body) result.Body.Close() assert.NoError(t, err) results[result.Url] = string(data) } } // Assert all URLs were fetched assert.Equal(t, numURLs, len(results)) assert.Equal(t, int32(numURLs), atomic.LoadInt32(&requestCount)) // Check results for i := 0; i < numURLs; i++ { url := fmt.Sprintf("%s/%d", server.URL, i) expectedResponse := fmt.Sprintf("response for /%d", i) assert.Equal(t, expectedResponse, results[url]) } }) t.Run("RateLimiting", func(t *testing.T) { // Create a test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) })) defer server.Close() // Create a lot of URLs numURLs := 20 urls := make([]string, numURLs) for i := 0; i < numURLs; i++ { urls[i] = server.URL } // Create fetcher with low rate f := NewFetcher( WithRatePerSecond(2), WithBurst(1), WithMaxWorkers(5), ) // Time the fetches start := time.Now() ctx := context.Background() resultChan := f.FetchURLs(ctx, urls) // Collect results var count int for result := range resultChan { assert.NoError(t, result.Error) if result.Body != nil { result.Body.Close() } count++ } // Verify count assert.Equal(t, numURLs, count) // Check duration - should be at least 9 seconds for 20 URLs at 2 per second duration := time.Since(start) assert.GreaterOrEqual(t, duration, 9*time.Second) }) t.Run("ConcurrencyLimit", func(t *testing.T) { // Create a mutex to protect access to the concurrent counter var mu sync.Mutex var currentConcurrent, maxConcurrent int // Create a test server with a delay to test concurrency server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Increment current concurrent counter mu.Lock() currentConcurrent++ if currentConcurrent > maxConcurrent { maxConcurrent = currentConcurrent } mu.Unlock() // Sleep to maintain concurrency time.Sleep(100 * time.Millisecond) // Decrement counter mu.Lock() currentConcurrent-- mu.Unlock() w.WriteHeader(http.StatusOK) })) defer server.Close() // Create a lot of URLs numURLs := 50 urls := make([]string, numURLs) for i := 0; i < numURLs; i++ { urls[i] = server.URL } // Create fetcher with specific worker limit but high rate maxWorkers := 5 f := NewFetcher( WithRatePerSecond(100), // High rate to not be rate-limited WithMaxWorkers(maxWorkers), ) // Fetch URLs ctx := context.Background() resultChan := f.FetchURLs(ctx, urls) // Collect results for result := range resultChan { if result.Body != nil { result.Body.Close() } } // Verify the max concurrency was respected assert.LessOrEqual(t, maxConcurrent, maxWorkers) // We should have reached max workers at some point assert.GreaterOrEqual(t, maxConcurrent, maxWorkers-1) }) t.Run("MixedResponses", func(t *testing.T) { // Create a test server with mixed responses server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Extract path to determine response path := r.URL.Path if path == "/success" { w.WriteHeader(http.StatusOK) w.Write([]byte("success")) } else if path == "/error" { w.WriteHeader(http.StatusInternalServerError) } else if path == "/toomany" { w.Header().Set("Retry-After", "1") w.WriteHeader(http.StatusTooManyRequests) } else if path == "/slow" { time.Sleep(300 * time.Millisecond) w.WriteHeader(http.StatusOK) w.Write([]byte("slow")) } else { w.WriteHeader(http.StatusNotFound) } })) defer server.Close() // Create URLs urls := []string{ server.URL + "/success", server.URL + "/error", server.URL + "/toomany", server.URL + "/slow", server.URL + "/notfound", } // Create fetcher with quick backoff for testing backoffCfg := backoff.NewExponentialBackOff() backoffCfg.MaxElapsedTime = 500 * time.Millisecond // Short timeout for test f := NewFetcher( WithBackOffConfig(backoffCfg), WithTimeout(1*time.Second), ) // Fetch URLs ctx := context.Background() resultChan := f.FetchURLs(ctx, urls) // Collect results results := make(map[string]struct { body string error bool }) for result := range resultChan { resultData := struct { body string error bool }{body: "", error: result.Error != nil} if result.Body != nil { data, _ := io.ReadAll(result.Body) result.Body.Close() resultData.body = string(data) } results[result.Url] = resultData } // Check results successURL := server.URL + "/success" assert.False(t, results[successURL].error) assert.Equal(t, "success", results[successURL].body) errorURL := server.URL + "/error" assert.True(t, results[errorURL].error) tooManyURL := server.URL + "/toomany" assert.True(t, results[tooManyURL].error) slowURL := server.URL + "/slow" assert.False(t, results[slowURL].error) assert.Equal(t, "slow", results[slowURL].body) notFoundURL := server.URL + "/notfound" assert.True(t, results[notFoundURL].error) }) t.Run("EmptyURLList", func(t *testing.T) { f := NewFetcher() ctx := context.Background() resultChan := f.FetchURLs(ctx, []string{}) // Should receive no results count := 0 for range resultChan { count++ } assert.Equal(t, 0, count) }) t.Run("SingleURL", func(t *testing.T) { // Create a test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) w.Write([]byte("single")) })) defer server.Close() f := NewFetcher() ctx := context.Background() resultChan := f.FetchURLs(ctx, []string{server.URL}) // Should receive exactly one result count := 0 for result := range resultChan { count++ assert.NoError(t, result.Error) assert.NotNil(t, result.Body) if result.Body != nil { data, err := io.ReadAll(result.Body) result.Body.Close() assert.NoError(t, err) assert.Equal(t, "single", string(data)) } } assert.Equal(t, 1, count) }) t.Run("ContextCancellationDuringFetch", func(t *testing.T) { // Create a test server with delay server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { time.Sleep(200 * time.Millisecond) w.WriteHeader(http.StatusOK) })) defer server.Close() f := NewFetcher() ctx, cancel := context.WithCancel(context.Background()) // Create multiple URLs urls := []string{server.URL, server.URL, server.URL} resultChan := f.FetchURLs(ctx, urls) // Cancel context after a short delay go func() { time.Sleep(50 * time.Millisecond) cancel() }() // Collect results results := 0 for result := range resultChan { results++ if result.Body != nil { result.Body.Close() } } // Should receive fewer results than total URLs due to cancellation assert.LessOrEqual(t, results, len(urls)) }) } // TestFetchErrors tests the FetchError type func TestFetchErrors(t *testing.T) { t.Run("TooManyRequestsError", func(t *testing.T) { err := &FetchError{ TooManyRequests: true, RetryAfter: 30, StatusCode: 429, } assert.Contains(t, err.Error(), "30 seconds") }) t.Run("StatusCodeError", func(t *testing.T) { err := &FetchError{ StatusCode: 404, } assert.Contains(t, err.Error(), "404") }) } // Integration test with a realistic server that randomly returns errors func TestIntegrationWithRandomErrors(t *testing.T) { // Skip in short test mode if testing.Short() { t.Skip("Skipping integration test in short mode") } // Create a test server with random behavior server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Seed with request path to get consistent behavior per URL pathSeed := int64(0) for _, c := range r.URL.Path { pathSeed += int64(c) } rand.Seed(pathSeed) // Random behavior randomVal := rand.Intn(100) switch { case randomVal < 20: // 20% chance of error w.WriteHeader(http.StatusInternalServerError) case randomVal < 30: // 10% chance of too many requests w.Header().Set("Retry-After", "1") w.WriteHeader(http.StatusTooManyRequests) case randomVal < 40: // 10% chance of slow response time.Sleep(200 * time.Millisecond) w.WriteHeader(http.StatusOK) w.Write([]byte(fmt.Sprintf("slow response for %s", r.URL.Path))) default: // 60% chance of success w.WriteHeader(http.StatusOK) w.Write([]byte(fmt.Sprintf("response for %s", r.URL.Path))) } })) defer server.Close() // Create a large number of URLs numURLs := 30 urls := make([]string, numURLs) for i := 0; i < numURLs; i++ { urls[i] = fmt.Sprintf("%s/path%d", server.URL, i) } // Create fetcher with retry configuration backoffCfg := backoff.NewExponentialBackOff() backoffCfg.MaxElapsedTime = 5 * time.Second backoffCfg.InitialInterval = 100 * time.Millisecond backoffCfg.MaxInterval = 1 * time.Second f := NewFetcher( WithRatePerSecond(10), WithBurst(5), WithMaxWorkers(8), WithBackOffConfig(backoffCfg), WithTimeout(2*time.Second), ) // Fetch URLs ctx := context.Background() resultChan := f.FetchURLs(ctx, urls) // Collect results successCount := 0 errorCount := 0 for result := range resultChan { if result.Error == nil { successCount++ if result.Body != nil { io.Copy(io.Discard, result.Body) // Read the body result.Body.Close() } } else { errorCount++ } } // Verify we got some successes and some errors t.Logf("Success count: %d, Error count: %d", successCount, errorCount) assert.True(t, successCount > 0) assert.True(t, errorCount > 0) assert.Equal(t, numURLs, successCount+errorCount) } // Benchmarks func BenchmarkFetcher(b *testing.B) { // Create a test server server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) w.Write([]byte("benchmark response")) })) defer server.Close() b.Run("SingleFetch", func(b *testing.B) { f := NewFetcher() ctx := context.Background() b.ResetTimer() for i := 0; i < b.N; i++ { body, err := f.FetchURL(ctx, server.URL) if err == nil && body != nil { io.Copy(io.Discard, body) body.Close() } } }) b.Run("ConcurrentFetches", func(b *testing.B) { f := NewFetcher( WithRatePerSecond(100), WithMaxWorkers(20), ) ctx := context.Background() b.ResetTimer() for i := 0; i < b.N; i++ { // Create 10 URLs to fetch concurrently numURLs := 10 urls := make([]string, numURLs) for j := 0; j < numURLs; j++ { urls[j] = server.URL } resultChan := f.FetchURLs(ctx, urls) for result := range resultChan { if result.Body != nil { io.Copy(io.Discard, result.Body) result.Body.Close() } } } }) } ================================================ FILE: lib/files.go ================================================ package lib import ( "context" "fmt" "io" "net/url" "os" "path/filepath" "regexp" "strings" "time" "github.com/PuerkitoBio/goquery" ) // FileInfo represents information about a downloaded file attachment type FileInfo struct { OriginalURL string LocalPath string Filename string Size int64 Success bool Error error } // FileDownloader handles downloading file attachments from Substack posts type FileDownloader struct { fetcher *Fetcher outputDir string filesDir string fileExtensions []string // allowed file extensions, empty means all } // NewFileDownloader creates a new FileDownloader instance func NewFileDownloader(fetcher *Fetcher, outputDir, filesDir string, extensions []string) *FileDownloader { if fetcher == nil { fetcher = NewFetcher() } return &FileDownloader{ fetcher: fetcher, outputDir: outputDir, filesDir: filesDir, fileExtensions: extensions, } } // FileDownloadResult contains the results of downloading file attachments for a post type FileDownloadResult struct { Files []FileInfo UpdatedHTML string Success int Failed int } // FileElement represents a file attachment element with its download URL and local path info type FileElement struct { DownloadURL string LocalPath string Filename string Success bool } // DownloadFiles downloads all file attachments from a post's HTML content and returns updated HTML func (fd *FileDownloader) DownloadFiles(ctx context.Context, htmlContent string, postSlug string) (*FileDownloadResult, error) { // Parse HTML content doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) if err != nil { return nil, fmt.Errorf("failed to parse HTML content: %w", err) } // Extract file attachment elements fileElements, err := fd.extractFileElements(doc) if err != nil { return nil, fmt.Errorf("failed to extract file elements: %w", err) } if len(fileElements) == 0 { return &FileDownloadResult{ Files: []FileInfo{}, UpdatedHTML: htmlContent, Success: 0, Failed: 0, }, nil } // Create files directory filesPath := filepath.Join(fd.outputDir, fd.filesDir, postSlug) if err := os.MkdirAll(filesPath, 0755); err != nil { return nil, fmt.Errorf("failed to create files directory: %w", err) } // Download files and build URL mapping var files []FileInfo urlToLocalPath := make(map[string]string) for _, element := range fileElements { // Download the file fileInfo := fd.downloadSingleFile(ctx, element.DownloadURL, filesPath) files = append(files, fileInfo) if fileInfo.Success { urlToLocalPath[element.DownloadURL] = fileInfo.LocalPath } } // Update HTML content with local paths updatedHTML := fd.updateHTMLWithLocalPaths(htmlContent, urlToLocalPath) // Count success/failure successCount := 0 failedCount := 0 for _, file := range files { if file.Success { successCount++ } else { failedCount++ } } return &FileDownloadResult{ Files: files, UpdatedHTML: updatedHTML, Success: successCount, Failed: failedCount, }, nil } // extractFileElements finds all file attachment elements in the HTML using the CSS selector func (fd *FileDownloader) extractFileElements(doc *goquery.Document) ([]FileElement, error) { var elements []FileElement doc.Find(".file-embed-button.wide").Each(func(i int, s *goquery.Selection) { href, exists := s.Attr("href") if !exists || href == "" { return } // Parse and validate URL fileURL, err := url.Parse(href) if err != nil { return } // Make sure it's an absolute URL if !fileURL.IsAbs() { return } // Extract filename from URL filename := fd.extractFilenameFromURL(href) if filename == "" { // Generate filename if we can't extract one filename = fmt.Sprintf("attachment_%d", i+1) } // Check file extension filter if specified if len(fd.fileExtensions) > 0 && !fd.isAllowedExtension(filename) { return } elements = append(elements, FileElement{ DownloadURL: href, Filename: filename, }) }) return elements, nil } // extractFilenameFromURL attempts to extract a filename from a URL func (fd *FileDownloader) extractFilenameFromURL(downloadURL string) string { parsed, err := url.Parse(downloadURL) if err != nil { return "" } // Try to get filename from path using URL-safe path handling path := parsed.Path if path != "" && path != "/" { // Use strings.LastIndex to find the last segment in a cross-platform way // This avoids issues with filepath.Base on different operating systems lastSlash := strings.LastIndex(path, "/") if lastSlash >= 0 && lastSlash < len(path)-1 { filename := path[lastSlash+1:] if filename != "" && filename != "." { return filename } } } // Try to get filename from query parameters (common in some download links) if filename := parsed.Query().Get("filename"); filename != "" { return filename } return "" } // isAllowedExtension checks if a filename has an allowed extension func (fd *FileDownloader) isAllowedExtension(filename string) bool { if len(fd.fileExtensions) == 0 { return true // Allow all if no filter specified } ext := strings.ToLower(filepath.Ext(filename)) if ext != "" && ext[0] == '.' { ext = ext[1:] // Remove the dot } for _, allowedExt := range fd.fileExtensions { if strings.ToLower(allowedExt) == ext { return true } } return false } // downloadSingleFile downloads a single file and returns FileInfo func (fd *FileDownloader) downloadSingleFile(ctx context.Context, downloadURL, filesPath string) FileInfo { // Extract filename filename := fd.extractFilenameFromURL(downloadURL) if filename == "" { // Generate a safe filename based on URL filename = fd.generateSafeFilename(downloadURL) } // Ensure filename is safe for filesystem filename = fd.sanitizeFilename(filename) localPath := filepath.Join(filesPath, filename) // Check if file already exists if _, err := os.Stat(localPath); err == nil { return FileInfo{ OriginalURL: downloadURL, LocalPath: localPath, Filename: filename, Size: 0, Success: true, Error: nil, } } // Download the file resp, err := fd.fetcher.FetchURL(ctx, downloadURL) if err != nil { return FileInfo{ OriginalURL: downloadURL, LocalPath: localPath, Filename: filename, Size: 0, Success: false, Error: err, } } defer resp.Close() // Create the file file, err := os.Create(localPath) if err != nil { return FileInfo{ OriginalURL: downloadURL, LocalPath: localPath, Filename: filename, Size: 0, Success: false, Error: err, } } defer file.Close() // Copy file contents size, err := io.Copy(file, resp) if err != nil { // Remove partially downloaded file os.Remove(localPath) return FileInfo{ OriginalURL: downloadURL, LocalPath: localPath, Filename: filename, Size: 0, Success: false, Error: err, } } return FileInfo{ OriginalURL: downloadURL, LocalPath: localPath, Filename: filename, Size: size, Success: true, Error: nil, } } // generateSafeFilename generates a safe filename from a URL func (fd *FileDownloader) generateSafeFilename(downloadURL string) string { // Use timestamp and hash of URL to create unique filename timestamp := time.Now().Unix() urlHash := fmt.Sprintf("%x", []byte(downloadURL))[:8] return fmt.Sprintf("file_%d_%s", timestamp, urlHash) } // sanitizeFilename removes or replaces unsafe characters in filenames func (fd *FileDownloader) sanitizeFilename(filename string) string { // Replace unsafe characters with underscores unsafe := regexp.MustCompile(`[<>:"/\\|?*]`) safe := unsafe.ReplaceAllString(filename, "_") // Remove leading/trailing spaces and dots safe = strings.Trim(safe, " .") // Ensure it's not empty if safe == "" { safe = "attachment" } // Limit length if len(safe) > 200 { safe = safe[:200] } return safe } // updateHTMLWithLocalPaths updates the HTML content to reference local file paths func (fd *FileDownloader) updateHTMLWithLocalPaths(htmlContent string, urlToLocalPath map[string]string) string { updatedHTML := htmlContent for originalURL, localPath := range urlToLocalPath { // Convert absolute local path to relative path from the post file location relativePath := fd.makeRelativePath(localPath) // Replace the href attribute in file-embed-button links oldPattern := fmt.Sprintf(`href="%s"`, regexp.QuoteMeta(originalURL)) newPattern := fmt.Sprintf(`href="%s"`, relativePath) updatedHTML = regexp.MustCompile(oldPattern).ReplaceAllString(updatedHTML, newPattern) // Also handle single quotes oldPatternSingle := fmt.Sprintf(`href='%s'`, regexp.QuoteMeta(originalURL)) newPatternSingle := fmt.Sprintf(`href='%s'`, relativePath) updatedHTML = regexp.MustCompile(oldPatternSingle).ReplaceAllString(updatedHTML, newPatternSingle) } return updatedHTML } // makeRelativePath converts an absolute local path to a relative path from the post location func (fd *FileDownloader) makeRelativePath(localPath string) string { // Get the relative path from the output directory relPath, err := filepath.Rel(fd.outputDir, localPath) if err != nil { // If we can't make it relative, just use the filename return filepath.Base(localPath) } // Convert to forward slashes for web compatibility return filepath.ToSlash(relPath) } ================================================ FILE: lib/files_test.go ================================================ package lib import ( "context" "fmt" "net/http" "net/http/httptest" "os" "path/filepath" "strings" "testing" "time" "github.com/PuerkitoBio/goquery" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // Test file data - a simple text file content var testFileData = []byte("This is a test file content for file attachment download testing.") // createTestFileServer creates a test server that serves test files func createTestFileServer() *httptest.Server { return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { path := r.URL.Path switch { case strings.Contains(path, "success"): w.Header().Set("Content-Type", "application/octet-stream") w.Header().Set("Content-Disposition", "attachment; filename=\"test-file.pdf\"") w.WriteHeader(http.StatusOK) w.Write(testFileData) case strings.Contains(path, "document.pdf"): w.Header().Set("Content-Type", "application/pdf") w.WriteHeader(http.StatusOK) w.Write(testFileData) case strings.Contains(path, "spreadsheet.xlsx"): w.Header().Set("Content-Type", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet") w.WriteHeader(http.StatusOK) w.Write(testFileData) case strings.Contains(path, "not-found"): w.WriteHeader(http.StatusNotFound) case strings.Contains(path, "server-error"): w.WriteHeader(http.StatusInternalServerError) case strings.Contains(path, "timeout"): // Don't respond to simulate timeout - but add a timeout to prevent hanging select { case <-time.After(5 * time.Second): w.WriteHeader(http.StatusRequestTimeout) } case strings.Contains(path, "with-query"): // Handle URLs with filename in query parameter filename := r.URL.Query().Get("filename") if filename != "" { w.Header().Set("Content-Disposition", fmt.Sprintf("attachment; filename=\"%s\"", filename)) } w.Header().Set("Content-Type", "application/octet-stream") w.WriteHeader(http.StatusOK) w.Write(testFileData) default: w.Header().Set("Content-Type", "application/octet-stream") w.WriteHeader(http.StatusOK) w.Write(testFileData) } })) } // createTestHTMLWithFiles creates HTML content with file attachment links func createTestHTMLWithFiles(baseURL string) string { return fmt.Sprintf(` Test Post with Files

Test Post with File Attachments

📄
Download PDF Document
📊
Download Excel Spreadsheet
Download Report
Missing File
Should not be detected
Should not be detected either
`, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL) } // TestNewFileDownloader tests the creation of FileDownloader func TestNewFileDownloader(t *testing.T) { t.Run("WithFetcher", func(t *testing.T) { fetcher := NewFetcher() extensions := []string{"pdf", "docx"} downloader := NewFileDownloader(fetcher, "/tmp", "files", extensions) assert.Equal(t, fetcher, downloader.fetcher) assert.Equal(t, "/tmp", downloader.outputDir) assert.Equal(t, "files", downloader.filesDir) assert.Equal(t, extensions, downloader.fileExtensions) }) t.Run("WithoutFetcher", func(t *testing.T) { extensions := []string{"xlsx"} downloader := NewFileDownloader(nil, "/tmp", "attachments", extensions) assert.NotNil(t, downloader.fetcher) assert.Equal(t, "/tmp", downloader.outputDir) assert.Equal(t, "attachments", downloader.filesDir) assert.Equal(t, extensions, downloader.fileExtensions) }) t.Run("NoExtensions", func(t *testing.T) { downloader := NewFileDownloader(nil, "/output", "files", nil) assert.NotNil(t, downloader.fetcher) assert.Equal(t, "/output", downloader.outputDir) assert.Equal(t, "files", downloader.filesDir) assert.Nil(t, downloader.fileExtensions) }) } // TestExtractFileElements tests file element extraction from HTML func TestExtractFileElements(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() t.Run("SuccessfulExtraction", func(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) htmlContent := createTestHTMLWithFiles(server.URL) doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) elements, err := downloader.extractFileElements(doc) require.NoError(t, err) // Should find 4 valid file elements (only .file-embed-button.wide) assert.Len(t, elements, 4) // Verify URLs expectedURLs := []string{ server.URL + "/document.pdf", server.URL + "/spreadsheet.xlsx", server.URL + "/with-query?filename=report.docx&id=123", server.URL + "/not-found.pdf", } actualURLs := make([]string, len(elements)) for i, elem := range elements { actualURLs[i] = elem.DownloadURL } assert.ElementsMatch(t, expectedURLs, actualURLs) }) t.Run("WithExtensionFilter", func(t *testing.T) { // Only allow PDF files downloader := NewFileDownloader(nil, "/tmp", "files", []string{"pdf"}) htmlContent := createTestHTMLWithFiles(server.URL) doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) elements, err := downloader.extractFileElements(doc) require.NoError(t, err) // Should find only 2 PDF files assert.Len(t, elements, 2) for _, elem := range elements { assert.True(t, strings.Contains(elem.DownloadURL, ".pdf")) } }) t.Run("NoFileElements", func(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) htmlContent := "

No file attachments here

" doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) elements, err := downloader.extractFileElements(doc) require.NoError(t, err) assert.Len(t, elements, 0) }) t.Run("InvalidURLs", func(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) // HTML with invalid URLs htmlContent := ` Empty href Relative URL Invalid URL ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) elements, err := downloader.extractFileElements(doc) require.NoError(t, err) // Should find no valid elements assert.Len(t, elements, 0) }) } // TestExtractFilenameFromURL tests filename extraction from URLs func TestExtractFilenameFromURL(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) tests := []struct { name string url string expected string }{ { name: "SimpleFilename", url: "https://example.com/document.pdf", expected: "document.pdf", }, { name: "FilenameWithPath", url: "https://example.com/files/reports/annual-report.xlsx", expected: "annual-report.xlsx", }, { name: "FilenameInQueryParam", url: "https://example.com/?filename=my-file.docx&id=123", expected: "my-file.docx", }, { name: "NoFilename", url: "https://example.com/", expected: "", }, { name: "InvalidURL", url: "://invalid-url", expected: "", }, { name: "OnlyPath", url: "https://example.com/download", expected: "download", }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.extractFilenameFromURL(test.url) assert.Equal(t, test.expected, result) }) } } // TestIsAllowedExtension tests file extension filtering func TestIsAllowedExtension(t *testing.T) { tests := []struct { name string extensions []string filename string expected bool }{ { name: "NoFilter", extensions: nil, filename: "document.pdf", expected: true, }, { name: "EmptyFilter", extensions: []string{}, filename: "document.pdf", expected: true, }, { name: "AllowedExtension", extensions: []string{"pdf", "docx"}, filename: "document.pdf", expected: true, }, { name: "DisallowedExtension", extensions: []string{"pdf", "docx"}, filename: "image.jpg", expected: false, }, { name: "CaseInsensitive", extensions: []string{"PDF", "DOCX"}, filename: "document.pdf", expected: true, }, { name: "NoExtension", extensions: []string{"pdf"}, filename: "README", expected: false, }, { name: "ExtensionWithDot", extensions: []string{".pdf", "docx"}, filename: "document.pdf", expected: false, // ".pdf" != "pdf" after dot removal }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", test.extensions) result := downloader.isAllowedExtension(test.filename) assert.Equal(t, test.expected, result) }) } } // TestSanitizeFilename tests filename sanitization func TestSanitizeFilename(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) tests := []struct { name string filename string expected string }{ { name: "SafeFilename", filename: "document.pdf", expected: "document.pdf", }, { name: "UnsafeCharacters", filename: "myname.pdf", expected: "my_file_name.pdf", }, { name: "AllUnsafeCharacters", filename: `file<>:"/\|?*.txt`, expected: "file_________.txt", // 9 unsafe chars replaced with _ }, { name: "LeadingTrailingSpaces", filename: " document.pdf ", expected: "document.pdf", }, { name: "LeadingTrailingDots", filename: "..document.pdf..", expected: "document.pdf", }, { name: "EmptyAfterSanitization", filename: " ... ", // Should become empty after trimming spaces and dots expected: "attachment", }, { name: "VeryLongFilename", filename: strings.Repeat("a", 250) + ".pdf", expected: strings.Repeat("a", 250)[:200], // Should be truncated to 200 chars total }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.sanitizeFilename(test.filename) assert.Equal(t, test.expected, result) assert.LessOrEqual(t, len(result), 200, "Filename should not exceed 200 characters") }) } } // TestGenerateSafeFilenameForFiles tests safe filename generation for files func TestGenerateSafeFilenameForFiles(t *testing.T) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) // Test that it generates unique filenames (use very different prefixes) url1 := "abcdef123456" // Will produce different hash url2 := "zyxwvu987654" // Will produce different hash filename1 := downloader.generateSafeFilename(url1) time.Sleep(1 * time.Millisecond) // Small delay to ensure different timestamp filename2 := downloader.generateSafeFilename(url2) assert.NotEqual(t, filename1, filename2, "Should generate different filenames for different URLs") assert.Contains(t, filename1, "file_", "Should contain file_ prefix") assert.Contains(t, filename2, "file_", "Should contain file_ prefix") // Test with same URL multiple times (should be different due to timestamp) time.Sleep(1001 * time.Millisecond) // Ensure different timestamp (at least 1 second difference) filename3 := downloader.generateSafeFilename(url1) assert.NotEqual(t, filename1, filename3, "Should generate different filenames due to timestamp") } // TestDownloadSingleFile tests individual file downloading func TestDownloadSingleFile(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "single-file-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) downloader := NewFileDownloader(nil, tempDir, "files", nil) ctx := context.Background() t.Run("SuccessfulDownload", func(t *testing.T) { fileURL := server.URL + "/document.pdf" filesPath := filepath.Join(tempDir, "test-post") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.True(t, fileInfo.Success) assert.NoError(t, fileInfo.Error) assert.Equal(t, fileURL, fileInfo.OriginalURL) assert.NotEmpty(t, fileInfo.LocalPath) assert.Equal(t, "document.pdf", fileInfo.Filename) assert.Equal(t, int64(len(testFileData)), fileInfo.Size) // Check file exists _, statErr := os.Stat(fileInfo.LocalPath) assert.NoError(t, statErr) // Check file content data, err := os.ReadFile(fileInfo.LocalPath) assert.NoError(t, err) assert.Equal(t, testFileData, data) }) t.Run("FileAlreadyExists", func(t *testing.T) { fileURL := server.URL + "/existing.pdf" filesPath := filepath.Join(tempDir, "existing-test") // Create the directory and file first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) existingFile := filepath.Join(filesPath, "existing.pdf") err = os.WriteFile(existingFile, []byte("existing content"), 0644) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.True(t, fileInfo.Success) assert.NoError(t, fileInfo.Error) assert.Equal(t, fileURL, fileInfo.OriginalURL) assert.Equal(t, existingFile, fileInfo.LocalPath) // File should still contain original content (not downloaded again) data, err := os.ReadFile(existingFile) assert.NoError(t, err) assert.Equal(t, []byte("existing content"), data) }) t.Run("NotFound", func(t *testing.T) { fileURL := server.URL + "/not-found.pdf" filesPath := filepath.Join(tempDir, "not-found-test") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.False(t, fileInfo.Success) assert.Error(t, fileInfo.Error) assert.Equal(t, fileURL, fileInfo.OriginalURL) assert.Equal(t, "not-found.pdf", fileInfo.Filename) }) t.Run("ServerError", func(t *testing.T) { fileURL := server.URL + "/server-error.pdf" filesPath := filepath.Join(tempDir, "server-error-test") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.False(t, fileInfo.Success) assert.Error(t, fileInfo.Error) }) t.Run("FilenameFromQuery", func(t *testing.T) { fileURL := server.URL + "/with-query?filename=report.docx&id=123" filesPath := filepath.Join(tempDir, "query-test") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.True(t, fileInfo.Success) assert.NoError(t, fileInfo.Error) // The filename should come from the path (with-query), not query param since path takes precedence assert.Equal(t, "with-query", fileInfo.Filename) // Check file exists with correct name expectedPath := filepath.Join(filesPath, "with-query") assert.Equal(t, expectedPath, fileInfo.LocalPath) _, statErr := os.Stat(expectedPath) assert.NoError(t, statErr) }) t.Run("FilenameFromPath", func(t *testing.T) { fileURL := server.URL + "/no-filename-in-path" filesPath := filepath.Join(tempDir, "path-test") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.True(t, fileInfo.Success) assert.NoError(t, fileInfo.Error) // The filename should come from the path (no-filename-in-path) assert.Equal(t, "no-filename-in-path", fileInfo.Filename) }) t.Run("GeneratedFilename", func(t *testing.T) { // Use a URL with just / to trigger generated filename fileURL := server.URL + "/" filesPath := filepath.Join(tempDir, "generated-test") // Create the directory first err := os.MkdirAll(filesPath, 0755) require.NoError(t, err) fileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath) assert.True(t, fileInfo.Success) assert.NoError(t, fileInfo.Error) // Should use generated filename pattern assert.Contains(t, fileInfo.Filename, "file_") }) } // TestMakeRelativePath tests relative path conversion func TestMakeRelativePath(t *testing.T) { downloader := NewFileDownloader(nil, "/output", "files", nil) tests := []struct { name string localPath string expected string }{ { name: "NormalPath", localPath: "/output/files/post/document.pdf", expected: "files/post/document.pdf", }, { name: "WindowsPath", localPath: "/output/files/post/report.xlsx", expected: "files/post/report.xlsx", }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.makeRelativePath(test.localPath) assert.Equal(t, test.expected, result) }) } } // TestUpdateHTMLWithLocalPathsForFiles tests HTML content updating for files func TestUpdateHTMLWithLocalPathsForFiles(t *testing.T) { downloader := NewFileDownloader(nil, "/output", "files", nil) originalHTML := ` PDF Document Excel File Same PDF Again ` urlToLocalPath := map[string]string{ "https://example.com/document.pdf": filepath.Join("/output", "files", "post", "document.pdf"), "https://example.com/spreadsheet.xlsx": filepath.Join("/output", "files", "post", "spreadsheet.xlsx"), } updatedHTML := downloader.updateHTMLWithLocalPaths(originalHTML, urlToLocalPath) // Check that URLs were replaced assert.Contains(t, updatedHTML, `href="files/post/document.pdf"`) assert.Contains(t, updatedHTML, `href='files/post/spreadsheet.xlsx'`) assert.NotContains(t, updatedHTML, "https://example.com/") // Check that duplicate URLs were replaced assert.Equal(t, 2, strings.Count(updatedHTML, "files/post/document.pdf")) } // TestDownloadFiles tests the complete file downloading workflow func TestDownloadFiles(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "file-download-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create downloader downloader := NewFileDownloader(nil, tempDir, "files", nil) t.Run("SuccessfulDownload", func(t *testing.T) { htmlContent := createTestHTMLWithFiles(server.URL) ctx := context.Background() result, err := downloader.DownloadFiles(ctx, htmlContent, "test-post") require.NoError(t, err) // Check results assert.Greater(t, result.Success, 0, "Should have successful downloads") assert.Greater(t, result.Failed, 0, "Should have failed downloads (not-found file)") assert.Greater(t, len(result.Files), 0, "Should have file info") // Check that files directory was created filesDir := filepath.Join(tempDir, "files", "test-post") _, err = os.Stat(filesDir) assert.NoError(t, err, "Files directory should exist") // Check that some files were downloaded files, err := os.ReadDir(filesDir) assert.NoError(t, err) assert.Greater(t, len(files), 0, "Should have downloaded files") // Check that HTML was updated assert.NotEqual(t, htmlContent, result.UpdatedHTML, "HTML should be updated") assert.Contains(t, result.UpdatedHTML, "files/test-post/", "HTML should contain local file paths") // Verify specific file was downloaded var pdfFound bool for _, file := range result.Files { if strings.Contains(file.OriginalURL, "document.pdf") && file.Success { pdfFound = true assert.Equal(t, "document.pdf", file.Filename) assert.Greater(t, file.Size, int64(0)) // Verify file content data, err := os.ReadFile(file.LocalPath) assert.NoError(t, err) assert.Equal(t, testFileData, data) } } assert.True(t, pdfFound, "Should have successfully downloaded PDF file") }) t.Run("WithExtensionFilter", func(t *testing.T) { // Only allow PDF files pdfDownloader := NewFileDownloader(nil, tempDir, "pdf-files", []string{"pdf"}) htmlContent := createTestHTMLWithFiles(server.URL) ctx := context.Background() result, err := pdfDownloader.DownloadFiles(ctx, htmlContent, "pdf-test") require.NoError(t, err) // Should only process PDF files pdfCount := 0 for _, file := range result.Files { if strings.HasSuffix(file.Filename, ".pdf") { pdfCount++ } } assert.Equal(t, 2, pdfCount, "Should find exactly 2 PDF files") assert.Equal(t, 2, len(result.Files), "Should only process PDF files due to filter") }) t.Run("NoFiles", func(t *testing.T) { htmlContent := "

No file attachments here

" ctx := context.Background() result, err := downloader.DownloadFiles(ctx, htmlContent, "no-files-post") require.NoError(t, err) assert.Equal(t, 0, result.Success) assert.Equal(t, 0, result.Failed) assert.Equal(t, 0, len(result.Files)) assert.Equal(t, htmlContent, result.UpdatedHTML) }) t.Run("EmptyHTML", func(t *testing.T) { emptyHTML := "" ctx := context.Background() result, err := downloader.DownloadFiles(ctx, emptyHTML, "empty-post") require.NoError(t, err) assert.Equal(t, 0, result.Success) assert.Equal(t, 0, result.Failed) assert.Equal(t, 0, len(result.Files)) assert.Equal(t, emptyHTML, result.UpdatedHTML) }) t.Run("InvalidHTML", func(t *testing.T) { invalidHTML := "not valid html <<<" ctx := context.Background() // Should still work with invalid HTML due to goquery's tolerance result, err := downloader.DownloadFiles(ctx, invalidHTML, "invalid-post") require.NoError(t, err) assert.Equal(t, 0, result.Success) assert.Equal(t, 0, result.Failed) assert.Equal(t, 0, len(result.Files)) }) } // TestFileDownloadErrorScenarios tests various error conditions func TestFileDownloadErrorScenarios(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "error-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) downloader := NewFileDownloader(nil, tempDir, "files", nil) ctx := context.Background() t.Run("ContextCancellation", func(t *testing.T) { // Create context with immediate cancellation cancelCtx, cancel := context.WithCancel(context.Background()) cancel() // Cancel immediately fileURL := server.URL + "/document.pdf" filesPath := filepath.Join(tempDir, "cancel-test") fileInfo := downloader.downloadSingleFile(cancelCtx, fileURL, filesPath) assert.False(t, fileInfo.Success) assert.Error(t, fileInfo.Error) assert.Contains(t, fileInfo.Error.Error(), "context") }) t.Run("FileSystemError", func(t *testing.T) { // Create a read-only directory to cause file creation to fail readOnlyDir := filepath.Join(tempDir, "readonly") err := os.MkdirAll(readOnlyDir, 0755) require.NoError(t, err) // Make directory read-only (may not work on all filesystems) err = os.Chmod(readOnlyDir, 0444) require.NoError(t, err) // Restore permissions for cleanup defer os.Chmod(readOnlyDir, 0755) fileURL := server.URL + "/document.pdf" fileInfo := downloader.downloadSingleFile(ctx, fileURL, readOnlyDir) // This test may pass on some filesystems that ignore permission restrictions // for the same user, so we just verify the attempt was made if fileInfo.Error != nil { assert.False(t, fileInfo.Success) assert.Error(t, fileInfo.Error) } else { // If no error occurred (e.g., on some filesystems), just log it t.Logf("Note: Filesystem doesn't enforce directory permissions as expected") assert.True(t, fileInfo.Success) } }) t.Run("DirectoryCreationError", func(t *testing.T) { // Try to create files directory where a file already exists invalidDir := filepath.Join(tempDir, "invalid-dir") // Create a file with the same name as the directory we'll try to create err := os.WriteFile(invalidDir, []byte("blocking file"), 0644) require.NoError(t, err) invalidDownloader := NewFileDownloader(nil, invalidDir, "files", nil) htmlContent := createTestHTMLWithFiles(server.URL) _, err = invalidDownloader.DownloadFiles(ctx, htmlContent, "blocked-post") assert.Error(t, err) assert.Contains(t, err.Error(), "failed to create files directory") }) } // TestFileDownloadWithRealSubstackHTML tests with realistic Substack HTML structure func TestFileDownloadWithRealSubstackHTML(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "real-substack-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) downloader := NewFileDownloader(nil, tempDir, "attachments", nil) // Realistic Substack HTML structure with file embeds realisticHTML := fmt.Sprintf(`

Here's the quarterly report:

And here's the supporting data:

`, server.URL, server.URL) ctx := context.Background() result, err := downloader.DownloadFiles(ctx, realisticHTML, "financial-report") require.NoError(t, err) // Should successfully download both files assert.Equal(t, 2, result.Success) assert.Equal(t, 0, result.Failed) assert.Len(t, result.Files, 2) // Verify HTML was updated assert.Contains(t, result.UpdatedHTML, "attachments/financial-report/quarterly-report.pdf") assert.Contains(t, result.UpdatedHTML, "attachments/financial-report/supporting-data.xlsx") assert.NotContains(t, result.UpdatedHTML, server.URL) // Verify files exist on disk attachmentsDir := filepath.Join(tempDir, "attachments", "financial-report") files, err := os.ReadDir(attachmentsDir) require.NoError(t, err) assert.Len(t, files, 2) // Verify specific files fileNames := []string{files[0].Name(), files[1].Name()} assert.Contains(t, fileNames, "quarterly-report.pdf") assert.Contains(t, fileNames, "supporting-data.xlsx") } // TestExtractorIntegration tests file download integration with the extractor func TestExtractorIntegration(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "extractor-integration-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create a mock post with file attachments post := Post{ Id: 123, Slug: "test-post-with-files", Title: "Test Post with File Attachments", BodyHTML: createTestHTMLWithFiles(server.URL), } // Create fetcher for the extractor fetcher := NewFetcher() // Test file download through WriteToFileWithImages outputPath := filepath.Join(tempDir, "test-post.html") filesPath := "attachments" imageDownloadResult, err := post.WriteToFileWithImages( context.Background(), outputPath, "html", false, // addSourceURL false, // downloadImages ImageQualityHigh, // imageQuality "", // imagesDir (not used when downloadImages is false) true, // downloadFiles nil, // fileExtensions (no filter) filesPath, // filesDir fetcher, // fetcher ) require.NoError(t, err) require.NotNil(t, imageDownloadResult) // Check that the image result is available (files are not reported in image result) // We'll verify file downloads through the file system // Check that the HTML file was created _, err = os.Stat(outputPath) assert.NoError(t, err, "HTML file should be created") // Check that files directory was created filesDir := filepath.Join(tempDir, filesPath, post.Slug) _, err = os.Stat(filesDir) assert.NoError(t, err, "Files directory should be created") // Check that some files were actually downloaded files, err := os.ReadDir(filesDir) require.NoError(t, err) assert.Greater(t, len(files), 0, "Should have actual downloaded files") // Read the HTML file and verify URLs were replaced htmlContent, err := os.ReadFile(outputPath) require.NoError(t, err) htmlStr := string(htmlContent) assert.Contains(t, htmlStr, fmt.Sprintf("%s/%s/", filesPath, post.Slug), "HTML should contain local file paths") // Check that successfully downloaded files had their URLs replaced assert.Contains(t, htmlStr, "attachments/test-post-with-files/document.pdf", "PDF file URL should be replaced") assert.Contains(t, htmlStr, "attachments/test-post-with-files/spreadsheet.xlsx", "XLSX file URL should be replaced") assert.Contains(t, htmlStr, "attachments/test-post-with-files/with-query", "Query file URL should be replaced") // URLs that weren't downloadable or detectable should remain as original // (not-found.pdf and files that don't match CSS selector) // Verify specific file types were downloaded var pdfFound, xlsxFound bool for _, file := range files { if strings.HasSuffix(file.Name(), ".pdf") { pdfFound = true } if strings.HasSuffix(file.Name(), ".xlsx") { xlsxFound = true } } assert.True(t, pdfFound, "Should have downloaded PDF file") assert.True(t, xlsxFound, "Should have downloaded XLSX file") } // TestExtractorIntegrationWithFiltering tests file download with extension filtering through extractor func TestExtractorIntegrationWithFiltering(t *testing.T) { // Create test server server := createTestFileServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "extractor-filtering-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create a mock post with file attachments post := Post{ Id: 456, Slug: "filtered-post", Title: "Post with Filtered Files", BodyHTML: createTestHTMLWithFiles(server.URL), } // Create fetcher for the extractor fetcher := NewFetcher() // Test file download with extension filtering (only PDF files) outputPath := filepath.Join(tempDir, "filtered-post.html") filesPath := "documents" imageDownloadResult, err := post.WriteToFileWithImages( context.Background(), outputPath, "html", false, // addSourceURL false, // downloadImages ImageQualityHigh, // imageQuality "", // imagesDir (not used when downloadImages is false) true, // downloadFiles []string{"pdf"}, // fileExtensions - only PDF files filesPath, // filesDir fetcher, // fetcher ) require.NoError(t, err) require.NotNil(t, imageDownloadResult) // Check that the integration worked (files are not reported in image result) // We'll verify file downloads through the file system // Check that files directory was created filesDir := filepath.Join(tempDir, filesPath, post.Slug) _, err = os.Stat(filesDir) assert.NoError(t, err, "Files directory should be created") // Check that only PDF files were downloaded files, err := os.ReadDir(filesDir) require.NoError(t, err) assert.Greater(t, len(files), 0, "Should have downloaded files") // Verify only PDF files were downloaded for _, file := range files { assert.True(t, strings.HasSuffix(file.Name(), ".pdf"), "Only PDF files should be downloaded, found: %s", file.Name()) } // Should be fewer files than the unfiltered test assert.LessOrEqual(t, len(files), 2, "Should have fewer files due to filtering") } // Benchmark tests func BenchmarkExtractFileElements(b *testing.B) { server := createTestFileServer() defer server.Close() downloader := NewFileDownloader(nil, "/tmp", "files", nil) htmlContent := createTestHTMLWithFiles(server.URL) doc, _ := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) b.ResetTimer() for i := 0; i < b.N; i++ { downloader.extractFileElements(doc) } } func BenchmarkSanitizeFilename(b *testing.B) { downloader := NewFileDownloader(nil, "/tmp", "files", nil) filename := "myname/with\\many|bad?chars*.pdf" b.ResetTimer() for i := 0; i < b.N; i++ { downloader.sanitizeFilename(filename) } } ================================================ FILE: lib/images.go ================================================ package lib import ( "context" "encoding/json" "fmt" "io" "net/url" "os" "path/filepath" "regexp" "strconv" "strings" "github.com/PuerkitoBio/goquery" ) // ImageQuality represents the quality level for image downloads type ImageQuality string const ( ImageQualityHigh ImageQuality = "high" // 1456w ImageQualityMedium ImageQuality = "medium" // 848w ImageQualityLow ImageQuality = "low" // 424w ) // ImageInfo contains information about a downloaded image type ImageInfo struct { OriginalURL string LocalPath string Width int Height int Format string Success bool Error error } // ImageDownloader handles downloading and processing images from Substack posts type ImageDownloader struct { fetcher *Fetcher outputDir string imagesDir string imageQuality ImageQuality } // NewImageDownloader creates a new ImageDownloader instance func NewImageDownloader(fetcher *Fetcher, outputDir, imagesDir string, quality ImageQuality) *ImageDownloader { if fetcher == nil { fetcher = NewFetcher() } return &ImageDownloader{ fetcher: fetcher, outputDir: outputDir, imagesDir: imagesDir, imageQuality: quality, } } // ImageDownloadResult contains the results of downloading images for a post type ImageDownloadResult struct { Images []ImageInfo UpdatedHTML string Success int Failed int } // ImageElement represents an image element with all its URLs type ImageElement struct { BestURL string // The URL to download (highest quality) AllURLs []string // All URLs that should be replaced with the local path LocalPath string // Local path after download Success bool // Whether download was successful } // DownloadImages downloads all images from a post's HTML content and returns updated HTML func (id *ImageDownloader) DownloadImages(ctx context.Context, htmlContent string, postSlug string) (*ImageDownloadResult, error) { // Parse HTML content doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) if err != nil { return nil, fmt.Errorf("failed to parse HTML content: %w", err) } // Extract image elements with all their URLs imageElements, err := id.extractImageElements(doc) if err != nil { return nil, fmt.Errorf("failed to extract image elements: %w", err) } if len(imageElements) == 0 { return &ImageDownloadResult{ Images: []ImageInfo{}, UpdatedHTML: htmlContent, Success: 0, Failed: 0, }, nil } // Create images directory imagesPath := filepath.Join(id.outputDir, id.imagesDir, postSlug) if err := os.MkdirAll(imagesPath, 0755); err != nil { return nil, fmt.Errorf("failed to create images directory: %w", err) } // Download images and build URL mapping var images []ImageInfo urlToLocalPath := make(map[string]string) for _, element := range imageElements { // Download the best quality URL imageInfo := id.downloadSingleImage(ctx, element.BestURL, imagesPath) images = append(images, imageInfo) if imageInfo.Success { // Map ALL URLs for this image element to the same local path for _, url := range element.AllURLs { urlToLocalPath[url] = imageInfo.LocalPath } } } // Update HTML content with local paths updatedHTML := id.updateHTMLWithLocalPaths(htmlContent, urlToLocalPath) // Count success/failure success := 0 failed := 0 for _, img := range images { if img.Success { success++ } else { failed++ } } return &ImageDownloadResult{ Images: images, UpdatedHTML: updatedHTML, Success: success, Failed: failed, }, nil } // extractImageElements extracts image elements with all their URLs from HTML content func (id *ImageDownloader) extractImageElements(doc *goquery.Document) ([]ImageElement, error) { var imageElements []ImageElement seenBestURLs := make(map[string]bool) // To avoid duplicates based on best URL allURLsToCollect := make(map[string][]string) // Map from best URL to all URLs that should map to it // Find all img tags and collect their URLs doc.Find("img").Each(func(i int, s *goquery.Selection) { element := id.getImageElementInfo(s) if element.BestURL != "" && !seenBestURLs[element.BestURL] { allURLsToCollect[element.BestURL] = element.AllURLs imageElements = append(imageElements, element) seenBestURLs[element.BestURL] = true } }) // Also collect URLs from tags that link to images doc.Find("a").Each(func(i int, s *goquery.Selection) { if href, exists := s.Attr("href"); exists && id.isImageURL(href) { // Find the corresponding image element to add this URL to for bestURL, urls := range allURLsToCollect { if id.isSameImage(href, bestURL) { // Add this href URL to the list of URLs to replace urlExists := false for _, existingURL := range urls { if existingURL == href { urlExists = true break } } if !urlExists { allURLsToCollect[bestURL] = append(urls, href) // Update the corresponding element in imageElements for j, elem := range imageElements { if elem.BestURL == bestURL { imageElements[j].AllURLs = allURLsToCollect[bestURL] break } } } break } } } }) // Also collect URLs from tags (in elements) doc.Find("source").Each(func(i int, s *goquery.Selection) { if srcset, exists := s.Attr("srcset"); exists { srcsetURLs := id.extractAllURLsFromSrcset(srcset) for _, srcsetURL := range srcsetURLs { if id.isImageURL(srcsetURL) { // Find the corresponding image element to add this URL to for bestURL, urls := range allURLsToCollect { if id.isSameImage(srcsetURL, bestURL) { // Add this srcset URL to the list of URLs to replace urlExists := false for _, existingURL := range urls { if existingURL == srcsetURL { urlExists = true break } } if !urlExists { allURLsToCollect[bestURL] = append(urls, srcsetURL) // Update the corresponding element in imageElements for j, elem := range imageElements { if elem.BestURL == bestURL { imageElements[j].AllURLs = allURLsToCollect[bestURL] break } } } break } } } } } }) return imageElements, nil } // extractImageURLs extracts image URLs from HTML content (kept for backward compatibility with tests) func (id *ImageDownloader) extractImageURLs(doc *goquery.Document) ([]string, error) { var imageURLs []string urlSet := make(map[string]bool) // To avoid duplicates // Find all img tags doc.Find("img").Each(func(i int, s *goquery.Selection) { // Get the best quality URL based on user preference imageURL := id.getBestImageURL(s) if imageURL != "" && !urlSet[imageURL] { imageURLs = append(imageURLs, imageURL) urlSet[imageURL] = true } }) return imageURLs, nil } // getImageElementInfo extracts all URLs and determines the best one for an img element func (id *ImageDownloader) getImageElementInfo(imgElement *goquery.Selection) ImageElement { var allURLs []string urlSet := make(map[string]bool) // To avoid duplicates // Helper function to add unique URLs addURL := func(url string) { if url != "" && !urlSet[url] { allURLs = append(allURLs, url) urlSet[url] = true } } // 1. Get URL from data-attrs JSON (highest priority) if dataAttrs, exists := imgElement.Attr("data-attrs"); exists { var attrs map[string]interface{} if err := json.Unmarshal([]byte(dataAttrs), &attrs); err == nil { if src, ok := attrs["src"].(string); ok && src != "" { addURL(src) } } } // 2. Get URLs from srcset attribute if srcset, exists := imgElement.Attr("srcset"); exists { srcsetURLs := id.extractAllURLsFromSrcset(srcset) for _, url := range srcsetURLs { addURL(url) } } // 3. Get URL from src attribute if src, exists := imgElement.Attr("src"); exists { addURL(src) } // Determine the best URL to download bestURL := id.getBestImageURL(imgElement) return ImageElement{ BestURL: bestURL, AllURLs: allURLs, } } // getBestImageURL extracts the best quality image URL from an img element func (id *ImageDownloader) getBestImageURL(imgElement *goquery.Selection) string { // First try to get URL from data-attrs JSON dataAttrs, exists := imgElement.Attr("data-attrs") if exists { var attrs map[string]interface{} if err := json.Unmarshal([]byte(dataAttrs), &attrs); err == nil { if src, ok := attrs["src"].(string); ok && src != "" { return src } } } // Get target width based on quality preference targetWidth := id.getTargetWidth() // Try to get URL from srcset based on quality preference srcset, exists := imgElement.Attr("srcset") if exists { if url := id.extractURLFromSrcset(srcset, targetWidth); url != "" { return url } } // Fallback to src attribute src, exists := imgElement.Attr("src") if exists { return src } return "" } // getTargetWidth returns the target width based on image quality preference func (id *ImageDownloader) getTargetWidth() int { switch id.imageQuality { case ImageQualityHigh: return 1456 case ImageQualityMedium: return 848 case ImageQualityLow: return 424 default: return 1456 } } // extractAllURLsFromSrcset extracts all URLs from a srcset attribute func (id *ImageDownloader) extractAllURLsFromSrcset(srcset string) []string { if srcset == "" { return []string{} // Return empty slice instead of nil } var urls []string // Use the same robust parsing as updateSrcsetAttribute entries := id.parseSrcsetEntries(srcset) for _, entry := range entries { entry = strings.TrimSpace(entry) if entry == "" { continue } // Parse "URL WIDTHw" format parts := strings.Fields(entry) if len(parts) >= 1 { url := parts[0] // Only include if it looks like a valid URL (not a fragment like "f_webp") if url != "" && (strings.HasPrefix(url, "http://") || strings.HasPrefix(url, "https://")) { urls = append(urls, url) } } } if urls == nil { return []string{} // Ensure we never return nil } return urls } // extractURLFromSrcset extracts the URL with the target width from a srcset attribute func (id *ImageDownloader) extractURLFromSrcset(srcset string, targetWidth int) string { // Use the robust parsing to handle URLs with commas entries := id.parseSrcsetEntries(srcset) var bestURL string var bestWidth int for _, entry := range entries { entry = strings.TrimSpace(entry) if entry == "" { continue } // Parse "URL WIDTHw" format parts := strings.Fields(entry) if len(parts) >= 2 { url := parts[0] widthStr := strings.TrimSuffix(parts[1], "w") // Only process if it looks like a valid URL if url != "" && (strings.HasPrefix(url, "http://") || strings.HasPrefix(url, "https://")) { if width, err := strconv.Atoi(widthStr); err == nil { // Find the closest width to our target, preferring exact matches or higher if width == targetWidth || (bestURL == "" || (width >= targetWidth && (bestWidth < targetWidth || width < bestWidth)) || (width < targetWidth && bestWidth < targetWidth && width > bestWidth)) { bestURL = url bestWidth = width } } } } } return bestURL } // downloadSingleImage downloads a single image and returns its info func (id *ImageDownloader) downloadSingleImage(ctx context.Context, imageURL, imagesPath string) ImageInfo { imageInfo := ImageInfo{ OriginalURL: imageURL, Success: false, } // Generate safe filename filename, err := id.generateSafeFilename(imageURL) if err != nil { imageInfo.Error = fmt.Errorf("failed to generate filename: %w", err) return imageInfo } localPath := filepath.Join(imagesPath, filename) imageInfo.LocalPath = localPath // Download the image body, err := id.fetcher.FetchURL(ctx, imageURL) if err != nil { imageInfo.Error = fmt.Errorf("failed to fetch image: %w", err) return imageInfo } defer body.Close() // Create the local file file, err := os.Create(localPath) if err != nil { imageInfo.Error = fmt.Errorf("failed to create local file: %w", err) return imageInfo } defer file.Close() // Copy image data _, err = io.Copy(file, body) if err != nil { imageInfo.Error = fmt.Errorf("failed to write image data: %w", err) os.Remove(localPath) // Clean up failed file return imageInfo } // Extract image metadata imageInfo.Format = id.getImageFormat(filename) imageInfo.Width, imageInfo.Height = id.extractDimensionsFromURL(imageURL) imageInfo.Success = true return imageInfo } // generateSafeFilename generates a safe filename from an image URL func (id *ImageDownloader) generateSafeFilename(imageURL string) (string, error) { parsedURL, err := url.Parse(imageURL) if err != nil { return "", err } // Extract filename from URL path filename := filepath.Base(parsedURL.Path) // If no valid filename, generate one from URL patterns if filename == "" || filename == "/" || filename == "." { filename = "" // Reset to force fallback logic // Try to extract from the URL patterns if strings.Contains(imageURL, "substack") { // Try to extract the image ID from Substack URLs if match := regexp.MustCompile(`([a-f0-9-]{36})_(\d+x\d+)\.(jpeg|jpg|png|webp)`).FindStringSubmatch(imageURL); len(match) > 0 { filename = fmt.Sprintf("%s_%s.%s", match[1][:8], match[2], match[3]) } } // If still no filename, use default if filename == "" { filename = "image.jpg" } } // Clean filename - remove invalid characters (but preserve structure) // Only replace invalid filesystem characters cleanedFilename := regexp.MustCompile(`[<>:"/\\|?*]`).ReplaceAllString(filename, "_") // Ensure we have a valid filename after cleaning if cleanedFilename == "" || cleanedFilename == "_" || cleanedFilename == "__" { cleanedFilename = "image.jpg" } // Ensure filename is not too long if len(cleanedFilename) > 200 { ext := filepath.Ext(cleanedFilename) name := strings.TrimSuffix(cleanedFilename, ext) if len(ext) < 200 { cleanedFilename = name[:200-len(ext)] + ext } else { cleanedFilename = "image.jpg" } } return cleanedFilename, nil } // getImageFormat determines image format from filename func (id *ImageDownloader) getImageFormat(filename string) string { ext := strings.ToLower(filepath.Ext(filename)) switch ext { case ".jpg", ".jpeg": return "jpeg" case ".png": return "png" case ".webp": return "webp" case ".gif": return "gif" default: return "unknown" } } // extractDimensionsFromURL attempts to extract width and height from URL func (id *ImageDownloader) extractDimensionsFromURL(imageURL string) (int, int) { // Look for patterns like "1456x819" or "w_1456,h_819" if match := regexp.MustCompile(`(\d+)x(\d+)`).FindStringSubmatch(imageURL); len(match) >= 3 { width, _ := strconv.Atoi(match[1]) height, _ := strconv.Atoi(match[2]) return width, height } if match := regexp.MustCompile(`w_(\d+)`).FindStringSubmatch(imageURL); len(match) >= 2 { width, _ := strconv.Atoi(match[1]) return width, 0 // Height unknown } return 0, 0 } // updateHTMLWithLocalPaths replaces image URLs in HTML with local paths func (id *ImageDownloader) updateHTMLWithLocalPaths(htmlContent string, urlToLocalPath map[string]string) string { // Parse HTML content doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) if err != nil { // Fallback to simple string replacement if parsing fails return id.updateHTMLWithStringReplacement(htmlContent, urlToLocalPath) } // Create URL to relative path mapping urlToRelPath := make(map[string]string) for originalURL, localPath := range urlToLocalPath { // Convert absolute local path to relative path from output directory relPath, err := filepath.Rel(id.outputDir, localPath) if err != nil { relPath = localPath // fallback to absolute path } // Always ensure forward slashes for HTML (web standard) relPath = strings.ReplaceAll(relPath, "\\", "/") urlToRelPath[originalURL] = relPath } // Update img elements doc.Find("img").Each(func(i int, s *goquery.Selection) { // Update src attribute if src, exists := s.Attr("src"); exists { if relPath, found := urlToRelPath[src]; found { s.SetAttr("src", relPath) } } // Update srcset attribute if srcset, exists := s.Attr("srcset"); exists { updatedSrcset := id.updateSrcsetAttribute(srcset, urlToRelPath) s.SetAttr("srcset", updatedSrcset) } // Update data-attrs JSON if dataAttrs, exists := s.Attr("data-attrs"); exists { updatedDataAttrs := id.updateDataAttrsJSON(dataAttrs, urlToRelPath) s.SetAttr("data-attrs", updatedDataAttrs) } }) // Update anchor elements with image links doc.Find("a").Each(func(i int, s *goquery.Selection) { if href, exists := s.Attr("href"); exists { if relPath, found := urlToRelPath[href]; found { s.SetAttr("href", relPath) } } }) // Update source elements (in picture tags) doc.Find("source").Each(func(i int, s *goquery.Selection) { if srcset, exists := s.Attr("srcset"); exists { updatedSrcset := id.updateSrcsetAttribute(srcset, urlToRelPath) s.SetAttr("srcset", updatedSrcset) } }) // Get the updated HTML html, err := doc.Html() if err != nil { // Fallback to simple string replacement if HTML generation fails return id.updateHTMLWithStringReplacement(htmlContent, urlToLocalPath) } return html } // updateHTMLWithStringReplacement is the fallback method using simple string replacement func (id *ImageDownloader) updateHTMLWithStringReplacement(htmlContent string, urlToLocalPath map[string]string) string { updatedHTML := htmlContent for originalURL, localPath := range urlToLocalPath { // Convert absolute local path to relative path from output directory relPath, err := filepath.Rel(id.outputDir, localPath) if err != nil { relPath = localPath // fallback to absolute path } // Always ensure forward slashes for HTML (web standard) // Convert any backslashes to forward slashes regardless of platform relPath = strings.ReplaceAll(relPath, "\\", "/") // Replace URL in various contexts updatedHTML = strings.ReplaceAll(updatedHTML, originalURL, relPath) // Also replace URL-encoded versions encodedURL := url.QueryEscape(originalURL) if encodedURL != originalURL { updatedHTML = strings.ReplaceAll(updatedHTML, encodedURL, relPath) } } return updatedHTML } // updateSrcsetAttribute updates URLs in a srcset attribute func (id *ImageDownloader) updateSrcsetAttribute(srcset string, urlToRelPath map[string]string) string { if srcset == "" { return srcset } // Parse srcset more carefully to handle URLs with commas entries := id.parseSrcsetEntries(srcset) // Map to track unique local paths and their best width descriptor pathToEntry := make(map[string]string) for _, entry := range entries { entry = strings.TrimSpace(entry) if entry == "" { continue } // Parse "URL WIDTH" format parts := strings.Fields(entry) if len(parts) >= 1 { url := parts[0] // Replace URL if we have a mapping for it if relPath, found := urlToRelPath[url]; found { // Build the new entry with local path var newEntry string if len(parts) >= 2 { // Has width descriptor newEntry = relPath + " " + parts[1] } else { // No width descriptor newEntry = relPath } // Only keep one entry per unique local path // If we already have an entry for this path, keep the one with width descriptor if existingEntry, exists := pathToEntry[relPath]; exists { // Prefer entries with width descriptors if len(parts) >= 2 && !strings.Contains(existingEntry, " ") { pathToEntry[relPath] = newEntry } // If both have width descriptors or both don't, keep the first one } else { pathToEntry[relPath] = newEntry } } else { // URL wasn't mapped, keep original entry pathToEntry[url] = entry } } } // Convert map back to slice, maintaining order as much as possible var updatedEntries []string for _, entry := range entries { entry = strings.TrimSpace(entry) if entry == "" { continue } parts := strings.Fields(entry) if len(parts) >= 1 { url := parts[0] if relPath, found := urlToRelPath[url]; found { // Use the entry from our deduplication map if finalEntry, exists := pathToEntry[relPath]; exists { updatedEntries = append(updatedEntries, finalEntry) delete(pathToEntry, relPath) // Remove to avoid duplicates } } else { // Original URL, use as-is if finalEntry, exists := pathToEntry[url]; exists { updatedEntries = append(updatedEntries, finalEntry) delete(pathToEntry, url) } } } } return strings.Join(updatedEntries, ", ") } // isImageURL checks if a URL appears to be an image URL (Substack CDN or S3) func (id *ImageDownloader) isImageURL(url string) bool { return strings.Contains(url, "substackcdn.com") || strings.Contains(url, "substack-post-media.s3.amazonaws.com") || strings.Contains(url, "bucketeer-") // Some Substack images use bucketeer URLs } // isSameImage checks if two URLs refer to the same image by comparing the core image identifier func (id *ImageDownloader) isSameImage(url1, url2 string) bool { // Extract the UUID pattern from both URLs uuidPattern := regexp.MustCompile(`([a-f0-9-]{36})`) matches1 := uuidPattern.FindStringSubmatch(url1) matches2 := uuidPattern.FindStringSubmatch(url2) if len(matches1) > 0 && len(matches2) > 0 { return matches1[1] == matches2[1] } // Fallback: if we can't find UUIDs, check if the URLs contain similar image identifiers // This handles cases where the URL structure might vary return strings.Contains(url1, extractImageID(url2)) || strings.Contains(url2, extractImageID(url1)) } // extractImageID extracts a unique identifier from an image URL func extractImageID(url string) string { // Try to extract UUID first if match := regexp.MustCompile(`([a-f0-9-]{36})`).FindStringSubmatch(url); len(match) > 0 { return match[1] } // Fallback to extracting a filename-like pattern if match := regexp.MustCompile(`/([^/]+)\.(jpeg|jpg|png|webp|heic|gif)(?:\?|$)`).FindStringSubmatch(url); len(match) > 0 { return match[1] } return "" } // parseSrcsetEntries carefully parses srcset entries, handling URLs that contain commas func (id *ImageDownloader) parseSrcsetEntries(srcset string) []string { var entries []string // Use regex to find URLs followed by width descriptors // This pattern matches: (URL) (WIDTH)w where URL can contain commas pattern := regexp.MustCompile(`(https?://[^\s]+)\s+(\d+w)`) matches := pattern.FindAllStringSubmatch(srcset, -1) for _, match := range matches { if len(match) >= 3 { url := match[1] width := match[2] entries = append(entries, url+" "+width) } } // If regex parsing didn't find anything, fall back to simple comma splitting // but only for URLs that don't contain commas if len(entries) == 0 { parts := strings.Split(srcset, ",") for _, part := range parts { part = strings.TrimSpace(part) if part != "" { // Only include if it looks like a proper entry (URL + width or just URL) fields := strings.Fields(part) if len(fields) >= 1 && (strings.HasPrefix(fields[0], "http://") || strings.HasPrefix(fields[0], "https://")) { entries = append(entries, part) } } } } return entries } // updateDataAttrsJSON updates URLs in a data-attrs JSON string func (id *ImageDownloader) updateDataAttrsJSON(dataAttrs string, urlToRelPath map[string]string) string { if dataAttrs == "" { return dataAttrs } var attrs map[string]interface{} if err := json.Unmarshal([]byte(dataAttrs), &attrs); err != nil { return dataAttrs // Return original if parsing fails } // Update src field if it exists if src, ok := attrs["src"].(string); ok { if relPath, found := urlToRelPath[src]; found { attrs["src"] = relPath } } // Marshal back to JSON updatedJSON, err := json.Marshal(attrs) if err != nil { return dataAttrs // Return original if marshaling fails } return string(updatedJSON) } ================================================ FILE: lib/images_test.go ================================================ package lib import ( "context" "fmt" "net/http" "net/http/httptest" "net/url" "os" "path/filepath" "strings" "testing" "time" "github.com/PuerkitoBio/goquery" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) // Test image data - a simple 1x1 PNG var testImageData = []byte{ 0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01, 0x08, 0x06, 0x00, 0x00, 0x00, 0x1F, 0x15, 0xC4, 0x89, 0x00, 0x00, 0x00, 0x0A, 0x49, 0x44, 0x41, 0x54, 0x78, 0x9C, 0x63, 0x00, 0x01, 0x00, 0x00, 0x05, 0x00, 0x01, 0x0D, 0x0A, 0x2D, 0xB4, 0x00, 0x00, 0x00, 0x00, 0x49, 0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82, } // createTestImageServer creates a test server that serves test images func createTestImageServer() *httptest.Server { return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { path := r.URL.Path switch { case strings.Contains(path, "success"): w.Header().Set("Content-Type", "image/png") w.WriteHeader(http.StatusOK) w.Write(testImageData) case strings.Contains(path, "not-found"): w.WriteHeader(http.StatusNotFound) case strings.Contains(path, "server-error"): w.WriteHeader(http.StatusInternalServerError) case strings.Contains(path, "timeout"): // Don't respond to simulate timeout - but add a timeout to prevent hanging select { case <-time.After(5 * time.Second): w.WriteHeader(http.StatusRequestTimeout) } default: w.Header().Set("Content-Type", "image/png") w.WriteHeader(http.StatusOK) w.Write(testImageData) } })) } // createTestHTMLWithImages creates HTML content with various image structures func createTestHTMLWithImages(baseURL string) string { return fmt.Sprintf(` Test Post

Test Post with Images

Here's a simple image:

Simple image
Data attrs image Missing image `, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL) } // TestNewImageDownloader tests the creation of ImageDownloader func TestNewImageDownloader(t *testing.T) { t.Run("WithFetcher", func(t *testing.T) { fetcher := NewFetcher() downloader := NewImageDownloader(fetcher, "/tmp", "images", ImageQualityHigh) assert.Equal(t, fetcher, downloader.fetcher) assert.Equal(t, "/tmp", downloader.outputDir) assert.Equal(t, "images", downloader.imagesDir) assert.Equal(t, ImageQualityHigh, downloader.imageQuality) }) t.Run("WithoutFetcher", func(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityMedium) assert.NotNil(t, downloader.fetcher) assert.Equal(t, "/tmp", downloader.outputDir) assert.Equal(t, "images", downloader.imagesDir) assert.Equal(t, ImageQualityMedium, downloader.imageQuality) }) } // TestGetTargetWidth tests width calculation for different quality levels func TestGetTargetWidth(t *testing.T) { tests := []struct { quality ImageQuality width int }{ {ImageQualityHigh, 1456}, {ImageQualityMedium, 848}, {ImageQualityLow, 424}, {ImageQuality("invalid"), 1456}, // should default to high } for _, test := range tests { t.Run(string(test.quality), func(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", test.quality) width := downloader.getTargetWidth() assert.Equal(t, test.width, width) }) } } // TestExtractURLFromSrcset tests srcset URL extraction func TestExtractURLFromSrcset(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) tests := []struct { name string srcset string targetWidth int expected string }{ { name: "ExactMatch", srcset: "https://example.com/image-424.jpg 424w, https://example.com/image-848.jpg 848w, https://example.com/image-1456.jpg 1456w", targetWidth: 848, expected: "https://example.com/image-848.jpg", }, { name: "ClosestHigher", srcset: "https://example.com/image-424.jpg 424w, https://example.com/image-1200.jpg 1200w, https://example.com/image-1456.jpg 1456w", targetWidth: 800, expected: "https://example.com/image-1200.jpg", }, { name: "ClosestLower", srcset: "https://example.com/image-200.jpg 200w, https://example.com/image-400.jpg 400w", targetWidth: 800, expected: "https://example.com/image-400.jpg", }, { name: "SingleEntry", srcset: "https://example.com/single-image.jpg 1024w", targetWidth: 800, expected: "https://example.com/single-image.jpg", }, { name: "EmptySrcset", srcset: "", targetWidth: 800, expected: "", }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.extractURLFromSrcset(test.srcset, test.targetWidth) assert.Equal(t, test.expected, result) }) } } // TestGenerateSafeFilename tests filename generation func TestGenerateSafeFilename(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) tests := []struct { name string url string expected string }{ { name: "SimpleURL", url: "https://example.com/image.jpg", expected: "image.jpg", }, { name: "SubstackPattern", url: "https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg", expected: "d83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg", }, { name: "InvalidCharacters", url: "https://example.com/image:withchars.png", expected: "image_with_bad_chars.png", }, { name: "NoExtension", url: "https://example.com/imagewithoutextension", expected: "imagewithoutextension", }, { name: "EmptyFilename", url: "https://example.com/", expected: "image.jpg", }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result, err := downloader.generateSafeFilename(test.url) assert.NoError(t, err) assert.Equal(t, test.expected, result) }) } } // TestGetImageFormat tests image format detection func TestGetImageFormat(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) tests := []struct { filename string format string }{ {"image.jpg", "jpeg"}, {"image.jpeg", "jpeg"}, {"image.png", "png"}, {"image.webp", "webp"}, {"image.gif", "gif"}, {"image.JPG", "jpeg"}, {"image.PNG", "png"}, {"image.unknown", "unknown"}, {"image", "unknown"}, } for _, test := range tests { t.Run(test.filename, func(t *testing.T) { result := downloader.getImageFormat(test.filename) assert.Equal(t, test.format, result) }) } } // TestExtractDimensionsFromURL tests dimension extraction from URLs func TestExtractDimensionsFromURL(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) tests := []struct { name string url string width int height int }{ { name: "DimensionPattern", url: "https://example.com/image_1920x1080.jpg", width: 1920, height: 1080, }, { name: "WidthOnlyPattern", url: "https://example.com/w_1456,c_limit/image.jpg", width: 1456, height: 0, }, { name: "NoDimensions", url: "https://example.com/image.jpg", width: 0, height: 0, }, { name: "SubstackPattern", url: "https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg", width: 5634, height: 2864, }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { width, height := downloader.extractDimensionsFromURL(test.url) assert.Equal(t, test.width, width) assert.Equal(t, test.height, height) }) } } // TestDownloadImages tests the complete image downloading workflow func TestDownloadImages(t *testing.T) { // Create test server server := createTestImageServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "image-download-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create downloader downloader := NewImageDownloader(nil, tempDir, "images", ImageQualityHigh) t.Run("SuccessfulDownload", func(t *testing.T) { htmlContent := createTestHTMLWithImages(server.URL) ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "test-post") require.NoError(t, err) // Check results assert.Greater(t, result.Success, 0, "Should have successful downloads") assert.Greater(t, result.Failed, 0, "Should have failed downloads (not-found image)") assert.Greater(t, len(result.Images), 0, "Should have image info") // Check that images directory was created imagesDir := filepath.Join(tempDir, "images", "test-post") _, err = os.Stat(imagesDir) assert.NoError(t, err, "Images directory should exist") // Check that some images were downloaded files, err := os.ReadDir(imagesDir) assert.NoError(t, err) assert.Greater(t, len(files), 0, "Should have downloaded image files") // Check that HTML was updated assert.NotEqual(t, htmlContent, result.UpdatedHTML, "HTML should be updated") assert.Contains(t, result.UpdatedHTML, "images/test-post/", "HTML should contain local image paths") }) t.Run("NoImages", func(t *testing.T) { htmlContent := "

No images here

" ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "no-images-post") require.NoError(t, err) assert.Equal(t, 0, result.Success) assert.Equal(t, 0, result.Failed) assert.Equal(t, 0, len(result.Images)) assert.Equal(t, htmlContent, result.UpdatedHTML) }) t.Run("EmptyHTML", func(t *testing.T) { emptyHTML := "" ctx := context.Background() result, err := downloader.DownloadImages(ctx, emptyHTML, "empty-post") require.NoError(t, err) assert.Equal(t, 0, result.Success) assert.Equal(t, 0, result.Failed) assert.Equal(t, 0, len(result.Images)) }) } // TestDownloadSingleImage tests individual image downloading func TestDownloadSingleImage(t *testing.T) { // Create test server server := createTestImageServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "single-image-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) downloader := NewImageDownloader(nil, tempDir, "images", ImageQualityHigh) ctx := context.Background() t.Run("SuccessfulDownload", func(t *testing.T) { imageURL := server.URL + "/success.png" imageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir) assert.True(t, imageInfo.Success) assert.NoError(t, imageInfo.Error) assert.Equal(t, imageURL, imageInfo.OriginalURL) assert.NotEmpty(t, imageInfo.LocalPath) // Check file exists _, err := os.Stat(imageInfo.LocalPath) assert.NoError(t, err) // Check file content data, err := os.ReadFile(imageInfo.LocalPath) assert.NoError(t, err) assert.Equal(t, testImageData, data) }) t.Run("NotFound", func(t *testing.T) { imageURL := server.URL + "/not-found.png" imageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir) assert.False(t, imageInfo.Success) assert.Error(t, imageInfo.Error) assert.Equal(t, imageURL, imageInfo.OriginalURL) }) t.Run("ServerError", func(t *testing.T) { imageURL := server.URL + "/server-error.png" imageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir) assert.False(t, imageInfo.Success) assert.Error(t, imageInfo.Error) }) } // TestUpdateHTMLWithLocalPaths tests HTML content updating func TestUpdateHTMLWithLocalPaths(t *testing.T) { downloader := NewImageDownloader(nil, "/output", "images", ImageQualityHigh) originalHTML := `Image 1 Image 2 Same image again` urlToLocalPath := map[string]string{ "https://example.com/image1.jpg": filepath.Join("/output", "images", "post", "image1.jpg"), "https://example.com/image2.png": filepath.Join("/output", "images", "post", "image2.png"), } updatedHTML := downloader.updateHTMLWithLocalPaths(originalHTML, urlToLocalPath) // Check that URLs were replaced assert.Contains(t, updatedHTML, `src="images/post/image1.jpg"`) assert.Contains(t, updatedHTML, `src="images/post/image2.png"`) assert.NotContains(t, updatedHTML, "https://example.com/") // Check that duplicate URLs were replaced assert.Equal(t, 2, strings.Count(updatedHTML, "images/post/image1.jpg")) } // Benchmark tests func BenchmarkExtractURLFromSrcset(b *testing.B) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) srcset := "img-424.jpg 424w, img-848.jpg 848w, img-1272.jpg 1272w, img-1456.jpg 1456w" b.ResetTimer() for i := 0; i < b.N; i++ { downloader.extractURLFromSrcset(srcset, 1456) } } func BenchmarkGenerateSafeFilename(b *testing.B) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) url := "https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg" b.ResetTimer() for i := 0; i < b.N; i++ { downloader.generateSafeFilename(url) } } // TestWithRealSubstackHTML tests image extraction from actual Substack HTML files func TestWithRealSubstackHTML(t *testing.T) { // Skip test if scraped directory doesn't exist scrapedDir := "../scraped/computerenhance" if _, err := os.Stat(scrapedDir); os.IsNotExist(err) { t.Skip("Scraped directory not found, skipping real HTML test") } // Find some sample HTML files files, err := os.ReadDir(scrapedDir) require.NoError(t, err) var htmlFiles []string for _, file := range files { if strings.HasSuffix(file.Name(), ".html") && len(htmlFiles) < 3 { htmlFiles = append(htmlFiles, filepath.Join(scrapedDir, file.Name())) } } if len(htmlFiles) == 0 { t.Skip("No HTML files found in scraped directory") } // Create temporary directory for testing tempDir, err := os.MkdirTemp("", "real-substack-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) downloader := NewImageDownloader(nil, tempDir, "images", ImageQualityHigh) for _, htmlFile := range htmlFiles { t.Run(filepath.Base(htmlFile), func(t *testing.T) { // Read the HTML file htmlContent, err := os.ReadFile(htmlFile) require.NoError(t, err) // Extract image URLs from the real HTML doc, err := goquery.NewDocumentFromReader(strings.NewReader(string(htmlContent))) require.NoError(t, err) imageURLs, err := downloader.extractImageURLs(doc) require.NoError(t, err) t.Logf("Found %d image URLs in %s", len(imageURLs), filepath.Base(htmlFile)) // Verify we can parse the image URLs and generate filenames for i, imageURL := range imageURLs { if i >= 5 { // Limit to first 5 images for performance break } t.Logf("Image URL %d: %s", i+1, imageURL) // Test filename generation filename, err := downloader.generateSafeFilename(imageURL) assert.NoError(t, err) assert.NotEmpty(t, filename) assert.False(t, strings.Contains(filename, "<"), "Filename should not contain invalid characters") assert.False(t, strings.Contains(filename, ">"), "Filename should not contain invalid characters") // Test dimension extraction width, height := downloader.extractDimensionsFromURL(imageURL) t.Logf(" Dimensions: %dx%d", width, height) // Test URL parsing _, err = url.Parse(imageURL) assert.NoError(t, err, "Image URL should be valid") } // Test HTML update functionality (without actually downloading) if len(imageURLs) > 0 { // Create a mock mapping for URL replacement urlToLocalPath := make(map[string]string) for i, imageURL := range imageURLs { if i >= 3 { // Limit for performance break } filename, _ := downloader.generateSafeFilename(imageURL) localPath := filepath.Join(tempDir, "images", "test-post", filename) urlToLocalPath[imageURL] = localPath } updatedHTML := downloader.updateHTMLWithLocalPaths(string(htmlContent), urlToLocalPath) assert.NotEqual(t, string(htmlContent), updatedHTML, "HTML should be updated") // Verify some URLs were replaced for originalURL := range urlToLocalPath { assert.NotContains(t, updatedHTML, originalURL, "Original URL should be replaced") } } }) } } // TestURLReplacementIssue tests that all image URLs are properly replaced in HTML func TestURLReplacementIssue(t *testing.T) { // Create test server server := createTestImageServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "url-replacement-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create downloader downloader := NewImageDownloader(nil, tempDir, "images", ImageQualityHigh) // Create HTML with mismatched URLs between src and data-attrs // Use server URLs so downloads will succeed htmlContent := fmt.Sprintf(` Simple image`, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL) t.Logf("Original HTML:\n%s", htmlContent) // Use the full DownloadImages method which should use the new logic ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "test-post") require.NoError(t, err) t.Logf("Download results: Success=%d, Failed=%d", result.Success, result.Failed) t.Logf("Updated HTML:\n%s", result.UpdatedHTML) // Verify that ALL URLs were replaced, not just the ones from data-attrs problemURLs := []string{ fmt.Sprintf("%s/w_1456.jpeg", server.URL), // src attribute fmt.Sprintf("%s/simple-src.jpg", server.URL), // simple src fmt.Sprintf("%s/w_424.jpeg", server.URL), // srcset URLs fmt.Sprintf("%s/w_848.jpeg", server.URL), } for _, url := range problemURLs { if strings.Contains(result.UpdatedHTML, url) { t.Errorf("URL should be replaced but still present: %s", url) } } // Verify some images were actually downloaded assert.Greater(t, result.Success, 0, "Should have successful downloads") // Verify local paths are present assert.Contains(t, result.UpdatedHTML, "images/test-post/", "Should contain local image paths") } // TestCommaSeparatedURLRegressionBug tests the specific bug reported in v0.6.0 // where multiple URLs for the same image (in srcset, data-attrs, etc.) would // create comma-separated URL strings in the output instead of clean local paths. // This is a regression test to ensure this specific pattern doesn't break again. func TestCommaSeparatedURLRegressionBug(t *testing.T) { // Create a test server that serves image content server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { // Return a small PNG image for any request w.Header().Set("Content-Type", "image/png") w.WriteHeader(http.StatusOK) // Write minimal PNG data pngData := []byte{0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52} w.Write(pngData) })) defer server.Close() // Create temporary directory tempDir := t.TempDir() fetcher := NewFetcher() downloader := NewImageDownloader(fetcher, tempDir, "images", ImageQualityHigh) // Create HTML that reproduces the exact bug pattern from the bug report // This simulates real Substack HTML where the same image appears with multiple URL variations // but they all represent the same actual image file and should map to the same local path baseImageID := "4697c31d-2502-48d2-b6c1-11e5ea97536f_2560x2174" // These represent different CDN transformations of the same base image // All should download the same file and map to the same local path originalURL := fmt.Sprintf("%s/substack-post-media.s3.amazonaws.com/public/images/%s.jpeg", server.URL, baseImageID) w1456URL := fmt.Sprintf("%s/substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg", server.URL, baseImageID) w848URL := fmt.Sprintf("%s/substackcdn.com/image/fetch/w_848,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg", server.URL, baseImageID) w424URL := fmt.Sprintf("%s/substackcdn.com/image/fetch/w_424,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg", server.URL, baseImageID) webpURL := fmt.Sprintf("%s/substackcdn.com/image/fetch/f_webp,w_1456,c_limit,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg", server.URL, baseImageID) // Create HTML that matches the structure from the bug report // All these URLs should map to the same local file path htmlContent := fmt.Sprintf(``, originalURL, // href w424URL, w848URL, w1456URL, webpURL, // webp srcset w1456URL, // img src w424URL, w848URL, w1456URL, webpURL, // img srcset originalURL) // data-attrs src t.Logf("Original HTML with potentially problematic URLs:\n%s", htmlContent) // Download images using the full pipeline ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "good-ideas") require.NoError(t, err) t.Logf("Download results: Success=%d, Failed=%d", result.Success, result.Failed) t.Logf("Updated HTML:\n%s", result.UpdatedHTML) // THE KEY REGRESSION TEST: Verify no comma-separated URL strings appear // This is the exact bug pattern that was reported commaSeparatedPatterns := []string{ "images/good-ideas/" + baseImageID + ".jpeg,images/good-ideas/", // Should not have comma-separated paths ",f_webp,images/good-ideas/", // Should not have CDN parameters mixed with local paths "images/good-ideas/" + baseImageID + ".jpeg,images/good-ideas/" + baseImageID + ".jpeg", // Repeated paths } for _, pattern := range commaSeparatedPatterns { if strings.Contains(result.UpdatedHTML, pattern) { t.Errorf("REGRESSION BUG DETECTED: Found comma-separated URL pattern in output: %s", pattern) t.Errorf("This indicates the string replacement bug has returned") } } // Verify that all original URLs have been replaced with local paths originalURLs := []string{originalURL, w1456URL, w848URL, w424URL, webpURL} for _, url := range originalURLs { if strings.Contains(result.UpdatedHTML, url) { t.Errorf("Original URL should be replaced but still present: %s", url) } } // Verify clean local paths are present expectedLocalPath := "images/good-ideas/" + baseImageID + ".jpeg" if !strings.Contains(result.UpdatedHTML, expectedLocalPath) { t.Errorf("Expected local path not found: %s", expectedLocalPath) } // Verify srcset entries are clean (no commas except between entries) if strings.Contains(result.UpdatedHTML, `srcset="`) { // Extract srcset content srcsetStart := strings.Index(result.UpdatedHTML, `srcset="`) + 8 srcsetEnd := strings.Index(result.UpdatedHTML[srcsetStart:], `"`) srcsetContent := result.UpdatedHTML[srcsetStart : srcsetStart+srcsetEnd] t.Logf("Extracted srcset: %s", srcsetContent) // Verify srcset has proper format: "path width, path width, ..." or just "path" // Should NOT have comma-separated paths without proper structure entries := strings.Split(srcsetContent, ",") for i, entry := range entries { entry = strings.TrimSpace(entry) if entry == "" { continue } parts := strings.Fields(entry) if len(parts) == 0 { t.Errorf("Srcset entry %d is empty after trimming: %s", i, entry) continue } // First part should be a clean local path if !strings.HasPrefix(parts[0], "images/good-ideas/") { t.Errorf("Srcset entry %d doesn't have proper local path: %s", i, parts[0]) } // If there's a second part, it should be a width descriptor if len(parts) >= 2 { if !strings.HasSuffix(parts[1], "w") { t.Errorf("Srcset entry %d has invalid width descriptor: %s", i, parts[1]) } } // Should not have more than 2 parts if len(parts) > 2 { t.Errorf("Srcset entry %d has too many parts (should be 'path' or 'path width'): %s", i, entry) } } } // Verify at least one image was successfully downloaded assert.Greater(t, result.Success, 0, "Should have successful downloads") assert.Equal(t, 0, result.Failed, "Should have no failed downloads") } // TestExtractImageElements tests the new image element extraction with all URLs func TestExtractImageElements(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) htmlContent := ` Complete image Simple image Data only ` doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) imageElements, err := downloader.extractImageElements(doc) require.NoError(t, err) // Should find 3 image elements assert.Len(t, imageElements, 3) // First image should have all URLs elem1 := imageElements[0] assert.Equal(t, "https://example.com/data.jpg", elem1.BestURL) // data-attrs has priority expectedURLs1 := []string{ "https://example.com/data.jpg", // from data-attrs "https://example.com/small.jpg", // from srcset "https://example.com/large.jpg", // from srcset "https://example.com/src.jpg", // from src } assert.ElementsMatch(t, expectedURLs1, elem1.AllURLs) // Second image should have only src URL elem2 := imageElements[1] assert.Equal(t, "https://example.com/simple.jpg", elem2.BestURL) assert.Equal(t, []string{"https://example.com/simple.jpg"}, elem2.AllURLs) // Third image should have only data-attrs URL elem3 := imageElements[2] assert.Equal(t, "https://example.com/data-only.jpg", elem3.BestURL) assert.Equal(t, []string{"https://example.com/data-only.jpg"}, elem3.AllURLs) } // TestExtractAllURLsFromSrcset tests srcset URL extraction func TestExtractAllURLsFromSrcset(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) tests := []struct { name string srcset string expected []string }{ { name: "MultipleSizes", srcset: "https://example.com/img-400.jpg 400w, https://example.com/img-800.jpg 800w, https://example.com/img-1200.jpg 1200w", expected: []string{"https://example.com/img-400.jpg", "https://example.com/img-800.jpg", "https://example.com/img-1200.jpg"}, }, { name: "SingleEntry", srcset: "https://example.com/single.jpg 1024w", expected: []string{"https://example.com/single.jpg"}, }, { name: "ExtraSpaces", srcset: " https://example.com/spaced1.jpg 400w , https://example.com/spaced2.jpg 800w ", expected: []string{"https://example.com/spaced1.jpg", "https://example.com/spaced2.jpg"}, }, { name: "Empty", srcset: "", expected: []string{}, }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { urls := downloader.extractAllURLsFromSrcset(test.srcset) assert.Equal(t, test.expected, urls) }) } } // TestImageURLParsing tests URL parsing with various Substack image patterns func TestImageURLParsing(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) // Real Substack URL patterns from the analysis testURLs := []string{ "https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F43e258db-6164-4e47-835f-d11f10847d9d_5616x3744.jpeg", "https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg", "https://substack-post-media.s3.amazonaws.com/public/images/d6ad0fd8-3659-4626-b5db-f81cbcd4c779_779x305.png", } for i, testURL := range testURLs { t.Run(fmt.Sprintf("URL_%d", i+1), func(t *testing.T) { // Test filename generation filename, err := downloader.generateSafeFilename(testURL) assert.NoError(t, err) assert.NotEmpty(t, filename) // Test dimension extraction width, height := downloader.extractDimensionsFromURL(testURL) t.Logf("URL: %s", testURL) t.Logf("Filename: %s", filename) t.Logf("Dimensions: %dx%d", width, height) // URLs should be valid _, err = url.Parse(testURL) assert.NoError(t, err) }) } } // TestImageURLHelperFunctions tests the helper functions added for the bug fix func TestImageURLHelperFunctions(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) t.Run("IsImageURL", func(t *testing.T) { tests := []struct { name string url string expected bool }{ {"SubstackCDN", "https://substackcdn.com/image/fetch/w_1456/image.jpg", true}, {"SubstackS3", "https://substack-post-media.s3.amazonaws.com/public/images/test.png", true}, {"Bucketeer", "https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/test.jpeg", true}, {"NotImage", "https://example.com/page.html", false}, {"RegularImage", "https://example.com/image.jpg", false}, // Not Substack } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.isImageURL(test.url) assert.Equal(t, test.expected, result) }) } }) t.Run("IsSameImage", func(t *testing.T) { baseUUID := "b0ebde87-580d-4dce-bb73-573edf9229ff" tests := []struct { name string url1 string url2 string expected bool }{ { "SameUUID", fmt.Sprintf("https://substackcdn.com/image/fetch/w_1456/%s_1024x1536.heic", baseUUID), fmt.Sprintf("https://substack-post-media.s3.amazonaws.com/public/images/%s_1024x1536.heic", baseUUID), true, }, { "DifferentUUIDs", "https://substackcdn.com/image/fetch/w_1456/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee_800x600.jpg", "https://substackcdn.com/image/fetch/w_848/ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj_800x600.jpg", false, }, { "NoUUIDs", "https://example.com/image1.jpg", "https://example.com/image2.jpg", false, }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := downloader.isSameImage(test.url1, test.url2) assert.Equal(t, test.expected, result) }) } }) t.Run("ExtractImageID", func(t *testing.T) { tests := []struct { name string url string expected string }{ { "UUID", "https://substack-post-media.s3.amazonaws.com/public/images/b0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic", "b0ebde87-580d-4dce-bb73-573edf9229ff", }, { "FilenamePattern", "https://example.com/path/to/myimage.jpg", "myimage", }, { "NoPattern", "https://example.com/path/", "", }, } for _, test := range tests { t.Run(test.name, func(t *testing.T) { result := extractImageID(test.url) assert.Equal(t, test.expected, result) }) } }) } // TestExtractImageElementsWithAnchorAndSourceTags tests the bug fix for collecting URLs from and tags func TestExtractImageElementsWithAnchorAndSourceTags(t *testing.T) { downloader := NewImageDownloader(nil, "/tmp", "images", ImageQualityHigh) // This HTML pattern reproduces the exact structure from real Substack posts // where the same image appears in multiple places with different URLs baseUUID := "f35ed9ff-eb9e-4106-a443-45c963ae74cd" originalURL := fmt.Sprintf("https://substack-post-media.s3.amazonaws.com/public/images/%s_1208x793.png", baseUUID) hrefURL := fmt.Sprintf("https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png", baseUUID) w424URL := fmt.Sprintf("https://substackcdn.com/image/fetch/w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png", baseUUID) w848URL := fmt.Sprintf("https://substackcdn.com/image/fetch/w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png", baseUUID) w1456URL := fmt.Sprintf("https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png", baseUUID) htmlContent := fmt.Sprintf(` `, hrefURL, // w424URL, w848URL, w1456URL, // originalURL, // w424URL, w848URL, w1456URL, // originalURL) // data-attrs src t.Logf("Test HTML:\n%s", htmlContent) doc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent)) require.NoError(t, err) imageElements, err := downloader.extractImageElements(doc) require.NoError(t, err) // Should find exactly 1 image element (all URLs refer to the same image) assert.Len(t, imageElements, 1, "Should find exactly one image element") elem := imageElements[0] t.Logf("BestURL: %s", elem.BestURL) t.Logf("AllURLs: %v", elem.AllURLs) // Best URL should be from data-attrs (highest priority) assert.Equal(t, originalURL, elem.BestURL) // All URLs should be collected (from img src, img srcset, source srcset, a href, and data-attrs) expectedURLs := []string{ originalURL, // from data-attrs and img src w424URL, // from srcsets w848URL, // from srcsets w1456URL, // from srcsets hrefURL, // from } // Check that all expected URLs are present for _, expectedURL := range expectedURLs { assert.Contains(t, elem.AllURLs, expectedURL, "Should contain URL: %s", expectedURL) } // Should not have duplicates urlCounts := make(map[string]int) for _, url := range elem.AllURLs { urlCounts[url]++ } for url, count := range urlCounts { assert.Equal(t, 1, count, "URL should appear exactly once: %s", url) } } // TestHrefAndSourceURLReplacementRegression tests the specific bug where images were downloaded // but and URLs weren't replaced with local paths func TestHrefAndSourceURLReplacementRegression(t *testing.T) { // Create test server server := createTestImageServer() defer server.Close() // Create temporary directory tempDir, err := os.MkdirTemp("", "href-source-regression-test-*") require.NoError(t, err) defer os.RemoveAll(tempDir) // Create downloader downloader := NewImageDownloader(nil, tempDir, "images", ImageQualityHigh) // Create HTML that reproduces the exact bug: // - Images are downloaded successfully // - img src and srcset are replaced correctly // - BUT and still contain original URLs // Using Substack-style URLs so they're detected as image URLs baseUUID := "123e4567-e89b-12d3-a456-426614174000" imageURL := server.URL + "/substack-post-media.s3.amazonaws.com/public/images/" + baseUUID + "_800x600.png" hrefURL := server.URL + "/substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F" + baseUUID + "_1200x900.png" srcsetURL1 := server.URL + "/substackcdn.com/image/fetch/w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F" + baseUUID + "_800x600.png" srcsetURL2 := server.URL + "/substackcdn.com/image/fetch/w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F" + baseUUID + "_800x600.png" htmlContent := fmt.Sprintf(` `, hrefURL, // - THIS was not being replaced in the bug srcsetURL1, srcsetURL2, // - THIS was not being replaced in the bug imageURL, // - this was working srcsetURL1, srcsetURL2) // - this was working t.Logf("Original HTML with problematic URLs:\n%s", htmlContent) // Download images using the full pipeline ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "regression-test") require.NoError(t, err) t.Logf("Download results: Success=%d, Failed=%d", result.Success, result.Failed) t.Logf("Updated HTML:\n%s", result.UpdatedHTML) // CRITICAL REGRESSION TEST: Verify ALL original URLs are replaced originalURLs := []string{imageURL, hrefURL, srcsetURL1, srcsetURL2} for _, originalURL := range originalURLs { assert.NotContains(t, result.UpdatedHTML, originalURL, "REGRESSION BUG: Original URL should be replaced but still present: %s", originalURL) } // Verify local paths are present assert.Contains(t, result.UpdatedHTML, "images/regression-test/", "Should contain local image directory path") // Verify was replaced with local path assert.Regexp(t, `href="images/regression-test/[^"]*"`, result.UpdatedHTML, "href should point to local path") // Verify was replaced with local paths assert.Contains(t, result.UpdatedHTML, `
`, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL) t.Logf("Complex Substack HTML structure:\n%s", htmlContent) // Process the HTML ctx := context.Background() result, err := downloader.DownloadImages(ctx, htmlContent, "complex-test") require.NoError(t, err) t.Logf("Download results: Success=%d, Failed=%d", result.Success, result.Failed) t.Logf("Updated HTML:\n%s", result.UpdatedHTML) // Verify NO original server URLs remain in the output assert.NotContains(t, result.UpdatedHTML, server.URL, "REGRESSION BUG: Original server URLs should be completely replaced") // Verify local paths are present assert.Contains(t, result.UpdatedHTML, "images/complex-test/", "Should contain local image paths") // Verify the href was replaced assert.Contains(t, result.UpdatedHTML, `href="images/complex-test/`, "href should point to local path") // Verify source srcset was replaced assert.Contains(t, result.UpdatedHTML, ` 0 { var archiveErr error switch format { case "html": archiveErr = archive.GenerateHTML(outputFolder) case "md": archiveErr = archive.GenerateMarkdown(outputFolder) case "txt": archiveErr = archive.GenerateText(outputFolder) } } ``` ### 5.2 Format Consistency - **Output Format Matching**: Archive format automatically matches selected post format - **Content Alignment**: Archive styling and structure complement post formatting - **Directory Structure**: Archive placed in root output directory alongside posts ## 6. Archive Content Structure ### 6.1 Post Metadata Display Each archive entry includes: - **Title**: Clickable link to downloaded post file - **Publication Date**: Original Substack publication date (formatted: "January 2, 2006") - **Download Date**: Local download timestamp (formatted: "January 2, 2006 15:04") - **Description**: Post subtitle (priority) or description (fallback) - **Cover Image**: Featured post image when available ### 6.2 Content Prioritization ```go // Description selection logic description := entry.Post.Subtitle if description == "" { description = entry.Post.Description } ``` ### 6.3 Date Formatting - **Publication Date**: Human-readable format ("January 2, 2006") - **Download Date**: Includes time for precise tracking ("January 2, 2006 15:04") - **Sorting**: Uses RFC3339 format for accurate chronological ordering ## 7. Error Handling Strategy ### 7.1 Archive Generation Errors - **Directory Creation**: Automatic creation of output directory if missing - **File Writing**: Graceful handling of permission and disk space issues - **Format Validation**: Error reporting for unknown or unsupported formats ### 7.2 Metadata Processing - **Date Parsing**: Fallback to title-based sorting for unparseable dates - **Missing Fields**: Graceful handling of empty subtitles, descriptions, or cover images - **Path Generation**: Error handling for invalid file paths or relative path calculation failures ### 7.3 Content Validation - **Empty Archives**: Skip generation when no entries are present - **Invalid Entries**: Continue processing valid entries when individual entries have issues - **HTML Escaping**: Proper escaping of user content in HTML format ## 8. Performance Considerations ### 8.1 Memory Management - **Incremental Building**: Archive entries added incrementally during download process - **Efficient Sorting**: In-place sorting using standard library algorithms - **Content Generation**: String building optimized for each format type ### 8.2 File I/O Optimization - **Single Write Operations**: Generate complete content before writing to disk - **Relative Path Caching**: Efficient path calculation using filepath.Rel() - **Format-Specific Generation**: Only generate requested format to minimize overhead ## 9. Testing Strategy ### 9.1 Unit Tests ```go // Comprehensive test coverage areas func TestNewArchive(t *testing.T) func TestArchive_AddEntry(t *testing.T) func TestArchive_sortEntries(t *testing.T) func TestArchive_GenerateHTML(t *testing.T) func TestArchive_GenerateMarkdown(t *testing.T) func TestArchive_GenerateText(t *testing.T) func TestEnhancedPostExtraction(t *testing.T) ``` ### 9.2 Integration Tests ```go func TestArchiveWorkflow(t *testing.T) func TestCommandFlags(t *testing.T) func TestArchivePageGeneration(t *testing.T) ``` ### 9.3 Test Coverage Areas - **Data Structure Operations**: Archive creation, entry management, sorting - **Format Generation**: Content generation for all three formats - **Error Scenarios**: Invalid dates, missing fields, empty archives - **Integration**: End-to-end workflows with CLI flag integration - **Post Enhancement**: Subtitle and cover image extraction functionality ## 10. Security Considerations ### 10.1 Content Security - **HTML Escaping**: Proper escaping of post titles and descriptions in HTML format - **Path Validation**: Safe relative path generation preventing directory traversal - **Input Sanitization**: Clean handling of user-provided post content ### 10.2 File System Security - **Directory Containment**: Archive files created only in designated output directory - **Permission Handling**: Graceful handling of file system permission restrictions - **Path Safety**: Cross-platform safe path generation and validation ## 11. Directory Structure Impact ### 11.1 Output Structure with Archive ``` output/ ├── index.html # Archive index page ├── 20231201_120000_post-title.html ├── 20231115_090000_another-post.html ├── images/ │ ├── post-title/ │ │ └── image1_1456x819.jpeg │ └── another-post/ │ └── image2_848x636.png └── files/ ├── post-title/ │ └── document.pdf └── another-post/ └── spreadsheet.xlsx ``` ### 11.2 Archive Index Formats - **HTML**: `index.html` - Styled webpage with embedded CSS - **Markdown**: `index.md` - Clean markdown for documentation systems - **Text**: `index.txt` - Plain text for maximum compatibility ## 12. Migration and Rollout ### 12.1 Backward Compatibility - **Opt-in Feature**: Archive generation only when `--create-archive` flag is used - **No Breaking Changes**: Existing CLI behavior unchanged when flag not present - **Format Consistency**: Archive format automatically matches post format selection ### 12.2 Progressive Enhancement - **Single Post Support**: Build archives incrementally with individual post downloads - **Bulk Download Integration**: Seamless operation with existing bulk download workflows - **Feature Combination**: Full compatibility with image and file download features ## 13. Future Enhancements ### 13.1 Potential Extensions - **Custom Templates**: User-provided HTML/Markdown templates for archive pages - **Theme Support**: Multiple built-in themes for HTML archive format - **Pagination**: Support for paginated archives with very large post collections - **Search Integration**: Client-side search functionality for archive pages ### 13.2 Advanced Features - **Archive Regeneration**: Rebuild archive from existing downloaded files - **Multiple Formats**: Generate archive in multiple formats simultaneously - **RSS Generation**: Create RSS/Atom feeds from archive content - **Static Site Integration**: Export formats compatible with static site generators --- **Specification Status**: Implemented v1.0 **Last Updated**: 2025-01-03 **Dependencies**: Existing sbstck-dl codebase (fetcher.go, extractor.go), enhanced Post struct **Implementation**: Complete with comprehensive test coverage ================================================ FILE: specs/file-attachment-download.md ================================================ # File Attachment Download Feature Specification ## 1. Overview ### 1.1 Purpose Add support for downloading file attachments from Substack posts alongside the existing text and image download functionality. This feature will enable users to download PDFs, documents, and other files that authors embed in their posts, with local file references updated in the downloaded content. ### 1.2 Success Criteria - Users can download file attachments from Substack posts using command-line flags - Downloaded files are organized in a configurable directory structure - HTML/Markdown content is updated with local file paths - Optional file extension filtering allows selective downloading - Integration with existing rate limiting and retry mechanisms - Comprehensive error handling for network failures and unsupported file types ### 1.3 Scope Boundaries **In Scope:** - Detection and extraction of file attachment URLs from Substack HTML - Download of attachments with appropriate file naming - Content rewriting to reference local file paths - File extension filtering capabilities - Integration with existing fetcher infrastructure - Support for all common file types (PDF, DOC, TXT, etc.) **Out of Scope:** - File preview or content analysis capabilities - Automatic file conversion between formats - Virus scanning or security validation of downloaded files - Selective downloading based on file size limits - Cloud storage integration for downloaded files ## 2. Technical Architecture ### 2.1 Architecture Alignment This feature follows the established sbstck-dl patterns: - **Modular Design**: New `FileDownloader` struct similar to existing `ImageDownloader` - **Consistent Interface**: Integration with existing CLI flags and output patterns - **Error Handling**: Leverages existing retry and backoff mechanisms from `Fetcher` - **Content Rewriting**: Similar approach to image URL replacement in HTML/Markdown ### 2.2 Core Components #### 2.2.1 FileDownloader Struct ```go type FileDownloader struct { fetcher *Fetcher outputDir string filesDir string allowedExts []string // empty means all extensions allowed } ``` #### 2.2.2 File Information Structure ```go type FileInfo struct { URL string Filename string Extension string Size string Type string LocalPath string } type FileDownloadResult struct { Files []FileInfo UpdatedHTML string Errors []error } ``` ### 2.3 HTML Parsing Strategy #### 2.3.1 CSS Selector Target - **Primary Selector**: `.file-embed-button.wide` - **Container Selector**: `.file-embed-container-top` (for metadata extraction) #### 2.3.2 HTML Structure Analysis Based on the example URL, file attachments follow this structure: ```html
The Stone Boy Cropped 1
207KB ∙ PDF file
Download
``` ## 3. Command Line Interface ### 3.1 New CLI Flags ```go // New flags to add to cmd/download.go var ( downloadFiles bool // --download-files filesDir string // --files-dir allowedFileExts []string // --file-extensions ) ``` ### 3.2 Flag Definitions | Flag | Short | Default | Description | |------|-------|---------|-------------| | `--download-files` | | `false` | Download file attachments locally and update content references | | `--files-dir` | | `"files"` | Directory name for downloaded files (relative to output directory) | | `--file-extensions` | | `[]` (all) | Comma-separated list of allowed file extensions (e.g., "pdf,doc,txt") | ### 3.3 Usage Examples ```bash # Download posts with all file attachments sbstck-dl download --url https://example.substack.com --download-files # Download only PDF and DOC files to custom directory sbstck-dl download --url https://example.substack.com --download-files \ --file-extensions "pdf,doc" --files-dir "documents" # Combined with existing features sbstck-dl download --url https://example.substack.com --download-files \ --download-images --format md --output ./downloads ``` ## 4. Implementation Details ### 4.1 File Detection Algorithm 1. **HTML Parsing**: Use goquery to find all `.file-embed-button.wide` elements 2. **URL Extraction**: Extract `href` attribute from anchor tags 3. **Metadata Extraction**: Parse container for filename, size, and type information 4. **Extension Filtering**: Apply user-specified extension filters if provided ### 4.2 File Naming Strategy ```go func (fd *FileDownloader) generateSafeFilename(fileInfo FileInfo, index int) string { // Priority order for filename: // 1. Extract from file-embed-details-h1 if available // 2. Parse from URL path // 3. Generate from URL hash + extension // 4. Fallback: "attachment_." } ``` ### 4.3 Content Rewriting #### 4.3.1 HTML Content Updates - Replace `href` attributes in `.file-embed-button.wide` elements - Maintain original HTML structure while updating file paths - Handle both absolute and relative path scenarios #### 4.3.2 Markdown Content Updates - Convert file embed HTML to Markdown link format: `[filename](local/path)` - Preserve file metadata information in link text when possible ### 4.4 Directory Structure ``` output_directory/ ├── post-title.html ├── images/ # existing images directory │ └── image1.jpg └── files/ # new files directory ├── document1.pdf ├── spreadsheet1.xlsx └── archive1.zip ``` ## 5. Integration Points ### 5.1 Extractor Integration ```go // Add to Post struct type Post struct { // ... existing fields FileDownloadResult *FileDownloadResult `json:"file_download_result,omitempty"` } // New method on Post func (p *Post) WriteToFileWithAttachments(ctx context.Context, path, format string, addSourceURL, downloadImages, downloadFiles bool, imageQuality ImageQuality, imagesDir, filesDir string, allowedExts []string, fetcher *Fetcher) (*FileDownloadResult, error) ``` ### 5.2 Command Integration ```go // Update in cmd/download.go init() downloadCmd.Flags().BoolVar(&downloadFiles, "download-files", false, "Download file attachments locally and update content to reference local files") downloadCmd.Flags().StringVar(&filesDir, "files-dir", "files", "Directory name for downloaded files") downloadCmd.Flags().StringSliceVar(&allowedFileExts, "file-extensions", []string{}, "Comma-separated list of allowed file extensions (empty = all extensions)") ``` ## 6. Error Handling Strategy ### 6.1 Network Error Handling - **Retry Logic**: Leverage existing `Fetcher` retry mechanisms with exponential backoff - **Rate Limiting**: Respect existing rate limiting for file downloads - **Timeout Handling**: Use configurable timeouts for large file downloads ### 6.2 File System Error Handling - **Directory Creation**: Ensure files directory exists before downloading - **Permission Errors**: Graceful handling of write permission issues - **Disk Space**: Basic validation for available disk space ### 6.3 Content Error Handling - **Invalid URLs**: Skip malformed or inaccessible file URLs - **Extension Filtering**: Log filtered files for user awareness - **Partial Failures**: Continue processing other files if individual downloads fail ## 7. Performance Considerations ### 7.1 Concurrent Downloads - Use Go's `errgroup` pattern consistent with existing image download implementation - Configurable worker pools to prevent resource exhaustion - Progress reporting for large file downloads ### 7.2 Memory Management - Stream large files to disk rather than loading entirely in memory - Implement file size limits to prevent excessive memory usage - Clean up temporary files on process interruption ## 8. Testing Strategy ### 8.1 Unit Tests ```go // Test coverage areas func TestFileDownloader_ExtractFileElements(t *testing.T) func TestFileDownloader_GenerateSafeFilename(t *testing.T) func TestFileDownloader_DownloadSingleFile(t *testing.T) func TestFileDownloader_UpdateHTMLWithLocalPaths(t *testing.T) func TestFileDownloader_ExtensionFiltering(t *testing.T) ``` ### 8.2 Integration Tests - **Real Substack Posts**: Test with actual posts containing file attachments - **Network Conditions**: Test behavior under various network conditions - **File Type Coverage**: Test common file types (PDF, DOC, XLS, ZIP, etc.) - **Edge Cases**: Empty responses, malformed HTML, missing files ### 8.3 Performance Tests - **Large File Handling**: Test download of files >100MB - **Multiple Files**: Test posts with many attachments - **Concurrent Processing**: Validate worker pool behavior ## 9. Security Considerations ### 9.1 File Path Security - **Path Traversal Prevention**: Sanitize filenames to prevent directory traversal attacks - **Safe Filename Generation**: Remove or escape dangerous characters in filenames - **Directory Containment**: Ensure all downloads remain within designated directories ### 9.2 Content Validation - **URL Validation**: Validate file URLs are from expected Substack domains - **File Type Validation**: Basic MIME type checking for downloaded files - **Size Limits**: Implement reasonable file size limits to prevent abuse ## 10. Migration and Rollout ### 10.1 Backward Compatibility - New feature is entirely opt-in via `--download-files` flag - No changes to existing CLI behavior when flag is not used - Existing configurations and scripts remain unaffected ### 10.2 Documentation Updates - Update CLI help text and documentation - Add usage examples to README - Document new directory structure and file naming conventions ## 11. Future Enhancements ### 11.1 Potential Extensions - **File Size Filtering**: Add flags for minimum/maximum file size limits - **Content Type Detection**: Enhanced MIME type detection and handling - **Progress Indicators**: Visual progress bars for large downloads - **Deduplication**: Skip downloading identical files across multiple posts ### 11.2 Advanced Features - **Selective Downloads**: Interactive mode for choosing which files to download - **Metadata Preservation**: Store original file metadata in sidecar files - **Cloud Integration**: Optional upload to cloud storage services --- **Specification Status**: Draft v1.0 **Last Updated**: 2025-07-31 **Dependencies**: Existing sbstck-dl codebase (fetcher.go, extractor.go, images.go)