[
  {
    "path": ".github/workflows/build-release.yml",
    "content": "name: Manual Build and Release\non:\n  workflow_dispatch:\n    inputs:\n      branch:\n        description: 'Branch to build'\n        required: true\n        default: 'main'\n  release:\n    types: [created]\n\njobs:\n  test:\n    name: Run Tests\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [ubuntu-latest, macos-latest, windows-latest]\n        go-version: [1.24.1]\n    steps:\n      - name: Check out code\n        uses: actions/checkout@v4\n        with:\n          ref: ${{ github.event.inputs.branch || github.ref }}\n        \n      - name: Set up Go\n        uses: actions/setup-go@v4\n        with:\n          go-version: ${{ matrix.go-version }}\n          \n      - name: Run tests\n        run: go test -v -timeout=10m ./...\n\n  build:\n    name: Build\n    needs: test\n    if: success()\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [ubuntu-latest, macos-latest, windows-latest]\n        go-version: [1.24.1]\n        include:\n          - os: ubuntu-latest\n            goos: linux\n            goarch: amd64\n            name: ubuntu\n            extension: \"\"\n          - os: macos-latest\n            goos: darwin\n            goarch: amd64\n            name: mac\n            extension: \"\"\n          - os: windows-latest\n            goos: windows\n            goarch: amd64\n            name: win\n            extension: \".exe\"\n    steps:\n      - name: Check out code\n        uses: actions/checkout@v4\n        with:\n          ref: ${{ github.event.inputs.branch || github.ref }}\n        \n      - name: Set up Go\n        uses: actions/setup-go@v4\n        with:\n          go-version: ${{ matrix.go-version }}\n          \n      - name: Build\n        run: |\n          env GOOS=${{ matrix.goos }} GOARCH=${{ matrix.goarch }} go build -v -o sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }}${{ matrix.extension }}\n          \n      - name: Upload artifact\n        uses: actions/upload-artifact@v4\n        with:\n          name: sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }}\n          path: sbstck-dl-${{ matrix.name }}-${{ matrix.goarch }}${{ matrix.extension }}\n          \n  release-upload:\n    name: Attach Artifacts to Release\n    if: github.event_name == 'release'\n    needs: build\n    runs-on: ubuntu-latest\n    permissions:\n      contents: write  # This is needed for release uploads\n    steps:\n      - name: Debug event info\n        run: |\n          echo \"Event name: ${{ github.event_name }}\"\n          echo \"Event type: ${{ github.event.action }}\"\n          echo \"Release tag: ${{ github.event.release.tag_name }}\"\n        \n      - name: Download all artifacts\n        uses: actions/download-artifact@v4\n        with:\n          path: artifacts\n      \n      - name: List artifacts\n        run: find artifacts -type f | sort\n          \n      - name: Upload artifacts to release\n        uses: softprops/action-gh-release@v1\n        with:\n          files: artifacts/**/*\n          # GitHub automatically provides this token\n          token: ${{ github.token }}"
  },
  {
    "path": ".github/workflows/test.yml",
    "content": "name: Run Tests\non:\n  pull_request:\n    branches: [main]\n\njobs:\n  test:\n    name: Run Tests\n    runs-on: ${{ matrix.os }}\n    strategy:\n      matrix:\n        os: [ubuntu-latest, macos-latest, windows-latest]\n        go-version: [1.24.1]\n    steps:\n      - name: Check out code\n        uses: actions/checkout@v4\n        \n      - name: Set up Go\n        uses: actions/setup-go@v4\n        with:\n          go-version: ${{ matrix.go-version }}\n          \n      - name: Run tests\n        run: go test -v ./..."
  },
  {
    "path": ".gitignore",
    "content": "# If you prefer the allow list template instead of the deny list, see community template:\n# https://github.com/github/gitignore/blob/main/community/Golang/Go.AllowList.gitignore\n#\n# Binaries for programs and plugins\n*.exe\n*.exe~\n*.dll\n*.so\n*.dylib\nbin/\n\n# Test binary, built with `go test -c`\n*.test\n\n# Output of the go coverage tool, specifically when used with LiteIDE\n*.out\n\n# Dependency directories (remove the comment below to include it)\n# vendor/\n\n# Go workspace file\ngo.work\n\n# Directory contained scraped content\nscraped/\ntest-download/\n\n# vscode\n.vscode/\n\n# serena\n.serena/cache/"
  },
  {
    "path": ".serena/.gitignore",
    "content": "/cache\n"
  },
  {
    "path": ".serena/memories/code_style_conventions.md",
    "content": "# Code Style and Conventions\n\n## Go Style Guidelines\n- Follows standard Go conventions and formatting\n- Uses `gofmt` for code formatting\n- Package naming: lowercase, single words when possible\n- Function naming: CamelCase for exported, camelCase for unexported\n- Variable naming: camelCase, descriptive names\n\n## Code Organization\n- **Separation of Concerns**: CLI logic in `cmd/`, core business logic in `lib/`\n- **Error Handling**: Explicit error returns, wrapping with context using `fmt.Errorf`\n- **Testing**: Table-driven tests, benchmarks for performance-critical code\n- **Concurrency**: Uses errgroup for managed goroutines, context for cancellation\n\n## Naming Conventions\n- **Structs**: PascalCase (e.g., `FileDownloader`, `ImageInfo`)\n- **Interfaces**: Usually end with -er (e.g., implied by method names)\n- **Constants**: PascalCase for exported, camelCase for unexported\n- **Files**: snake_case for test files (`*_test.go`)\n\n## Function Design Patterns\n- **Constructor Pattern**: `NewXxx()` functions for creating instances\n- **Options Pattern**: Used in fetcher with `FetcherOption` functional options\n- **Context Propagation**: All network operations accept `context.Context`\n- **Resource Management**: Proper `defer` usage for cleanup (file handles, HTTP responses)\n\n## Documentation\n- **Godoc Comments**: All exported functions, types, and constants have comments\n- **README**: Comprehensive usage examples and feature documentation\n- **Code Comments**: Explain complex logic, especially in parsing and URL manipulation"
  },
  {
    "path": ".serena/memories/files_feature_overview.md",
    "content": "# File Attachment Download Feature\n\n## Implementation Overview\nNew feature added in `lib/files.go` that allows downloading file attachments from Substack posts.\n\n## Key Components\n\n### FileDownloader struct\n- Manages file downloads with rate limiting via Fetcher\n- Configurable output directory and file extensions filter\n- Integrates with existing image download workflow\n\n### CSS Selector Detection\n- Uses `.file-embed-button.wide` to find file attachment links\n- Extracts download URLs from `href` attributes\n\n### Core Functions\n- `DownloadFiles()` - Main entry point, returns FileDownloadResult\n- `extractFileElements()` - Finds file links in HTML using CSS selector\n- `downloadSingleFile()` - Downloads individual files with error handling\n- `updateHTMLWithLocalPaths()` - Replaces URLs with local paths\n\n### Features\n- Extension filtering via `--file-extensions` flag\n- Custom output directory via `--files-dir` flag\n- Filename extraction from URLs and query parameters\n- Safe filename sanitization (removes unsafe characters)\n- File existence checking (skip if already downloaded)\n- Relative path conversion for HTML references\n\n## CLI Integration\n- New flags in `cmd/download.go`:\n  - `--download-files` - Enable file downloading\n  - `--file-extensions` - Filter by extensions (comma-separated)\n  - `--files-dir` - Custom files directory name\n\n## Integration with Extractor\n- Extended `WriteToFileWithImages()` to also handle file downloads\n- Unified workflow for both images and files"
  },
  {
    "path": ".serena/memories/project_overview.md",
    "content": "# Project Overview\n\n## Purpose\nsbstck-dl is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, and format conversion (HTML/Markdown/Text). The tool also supports downloading images and file attachments locally.\n\n## Tech Stack\n- **Language**: Go 1.20+\n- **CLI Framework**: Cobra (github.com/spf13/cobra)\n- **HTML Parsing**: goquery (github.com/PuerkitoBio/goquery)\n- **HTML to Markdown**: html-to-markdown (github.com/JohannesKaufmann/html-to-markdown)\n- **HTML to Text**: html2text (github.com/k3a/html2text)\n- **Retry Logic**: backoff (github.com/cenkalti/backoff/v4)\n- **Rate Limiting**: golang.org/x/time/rate\n- **Concurrency**: golang.org/x/sync/errgroup\n- **Progress Bar**: progressbar (github.com/schollz/progressbar/v3)\n- **Testing**: testify (github.com/stretchr/testify)\n\n## Repository Structure\n- `main.go`: Entry point\n- `cmd/`: Cobra CLI commands (root.go, download.go, list.go, version.go)\n- `lib/`: Core library components\n  - `fetcher.go`: HTTP client with rate limiting, retries, and cookie support\n  - `extractor.go`: Post extraction and format conversion (HTML→Markdown/Text)\n  - `images.go`: Image downloading and local path management\n  - `files.go`: File attachment downloading and local path management\n- `.github/workflows/`: CI/CD workflows for testing and releases\n- Tests are co-located with source files (e.g., `lib/fetcher_test.go`)"
  },
  {
    "path": ".serena/memories/project_structure.md",
    "content": "# Project Structure - sbstck-dl\n\n## Overview\nGo CLI tool for downloading posts from Substack blogs with support for private newsletters, rate limiting, and format conversion.\n\n## Directory Structure\n```\n├── main.go              # Entry point\n├── cmd/                 # Cobra CLI commands\n│   ├── root.go\n│   ├── download.go      # Main download functionality\n│   ├── list.go\n│   ├── version.go\n│   ├── cmd_test.go      # Command tests\n│   └── integration_test.go\n├── lib/                 # Core library\n│   ├── fetcher.go       # HTTP client with rate limiting/retries\n│   ├── fetcher_test.go  # Comprehensive HTTP client tests\n│   ├── extractor.go     # Post extraction and format conversion\n│   ├── extractor_test.go # Extractor tests\n│   ├── images.go        # Image downloader\n│   ├── images_test.go   # Comprehensive image tests\n│   └── files.go         # NEW: File attachment downloader\n└── go.mod               # Dependencies\n```\n\n## Key Dependencies\n- `github.com/spf13/cobra` - CLI framework\n- `github.com/PuerkitoBio/goquery` - HTML parsing\n- `github.com/stretchr/testify` - Testing framework\n- `github.com/cenkalti/backoff/v4` - Exponential backoff\n- `golang.org/x/time/rate` - Rate limiting"
  },
  {
    "path": ".serena/memories/suggested_commands.md",
    "content": "# Suggested Commands\n\n## Development Commands\n\n### Building\n```bash\ngo build -o sbstck-dl .\n```\n\n### Running\n```bash\ngo run . [command] [flags]\n```\n\n### Testing\n```bash\n# Run all tests\ngo test ./...\n\n# Run tests with verbose output\ngo test -v ./...\n\n# Run tests for specific package\ngo test ./lib\ngo test ./cmd\n```\n\n### Module Management\n```bash\n# Clean up dependencies\ngo mod tidy\n\n# Download dependencies\ngo mod download\n\n# Verify dependencies\ngo mod verify\n```\n\n### Running the CLI Locally\n```bash\n# Download single post\ngo run . download --url https://example.substack.com/p/post-title --output ./downloads\n\n# Download entire archive\ngo run . download --url https://example.substack.com --output ./downloads\n\n# Download with images\ngo run . download --url https://example.substack.com --download-images --output ./downloads\n\n# Download with file attachments\ngo run . download --url https://example.substack.com --download-files --output ./downloads\n\n# Download with both images and files\ngo run . download --url https://example.substack.com --download-images --download-files --output ./downloads\n\n# Test with dry run and verbose output\ngo run . download --url https://example.substack.com --verbose --dry-run\n```\n\n### System Commands (Linux)\n- `rg` (ripgrep) for searching instead of grep\n- Standard Linux commands: `ls`, `cd`, `find`, `git`"
  },
  {
    "path": ".serena/memories/task_completion_checklist.md",
    "content": "# Task Completion Checklist\n\n## After Completing Development Tasks\n\n### Testing\n1. **Run Unit Tests**: `go test ./...`\n2. **Run Integration Tests**: `go test -v ./...` \n3. **Test CLI Commands**: Manual testing with real Substack URLs\n4. **Test Edge Cases**: Error conditions, malformed URLs, network failures\n\n### Code Quality\n1. **Format Code**: `gofmt -w .` (usually handled by editor)\n2. **Lint Code**: Use `golint` or `go vet` if available\n3. **Verify Dependencies**: `go mod tidy && go mod verify`\n\n### Documentation Updates\n1. **Update CLAUDE.md**: Add new features, commands, architectural changes\n2. **Update README.md**: Add usage examples for new features\n3. **Update Help Text**: Ensure CLI help reflects new flags and options\n4. **Update Comments**: Ensure godoc comments are current\n\n### Version Control\n1. **Stage Changes**: `git add` only relevant files\n2. **Commit**: Use conventional commits format\n   - `feat: add new feature`\n   - `fix: resolve bug`\n   - `docs: update documentation`\n   - `test: add tests`\n   - `refactor: improve code structure`\n3. **Clean Up**: Remove any temporary files or test artifacts\n\n### Build Verification\n1. **Build Binary**: `go build -o sbstck-dl .`\n2. **Test Binary**: Run basic commands to ensure it works\n3. **Cross-Platform Check**: Ensure no platform-specific code issues"
  },
  {
    "path": ".serena/memories/testing_patterns.md",
    "content": "# Testing Patterns in sbstck-dl\n\n## Test Structure\n- All tests use `github.com/stretchr/testify` with `assert` and `require`\n- Tests organized in table-driven style where appropriate\n- Each major component has comprehensive test coverage\n\n## Common Patterns\n\n### HTTP Server Tests\n- Use `httptest.NewServer()` for mock servers\n- Test various response scenarios (success, errors, timeouts)\n- Handle concurrent requests and rate limiting\n\n### File I/O Tests\n- Use `os.MkdirTemp()` for temporary directories\n- Always clean up with `defer os.RemoveAll(tempDir)`\n- Test file creation, existence, and content validation\n\n### HTML Parsing Tests\n- Use `goquery.NewDocumentFromReader(strings.NewReader(html))`\n- Test various HTML structures and edge cases\n- Validate URL extraction and replacement\n\n### Error Handling Tests\n- Test both success and failure scenarios\n- Use specific error assertions and error message checking\n- Test context cancellation and timeouts\n\n### Benchmark Tests\n- Include performance benchmarks for critical paths\n- Use `b.ResetTimer()` appropriately\n- Test both single operations and concurrent scenarios\n\n## Test Organization\n- Unit tests for individual functions\n- Integration tests for complete workflows\n- Regression tests for specific bug fixes\n- Real-world data tests (when sample data available)"
  },
  {
    "path": ".serena/project.yml",
    "content": "# language of the project (csharp, python, rust, java, typescript, go, cpp, or ruby)\n#  * For C, use cpp\n#  * For JavaScript, use typescript\n# Special requirements:\n#  * csharp: Requires the presence of a .sln file in the project folder.\nlanguage: go\n\n# whether to use the project's gitignore file to ignore files\n# Added on 2025-04-07\nignore_all_files_in_gitignore: true\n# list of additional paths to ignore\n# same syntax as gitignore, so you can use * and **\n# Was previously called `ignored_dirs`, please update your config if you are using that.\n# Added (renamed)on 2025-04-07\nignored_paths: []\n\n# whether the project is in read-only mode\n# If set to true, all editing tools will be disabled and attempts to use them will result in an error\n# Added on 2025-04-18\nread_only: false\n\n\n# list of tool names to exclude. We recommend not excluding any tools, see the readme for more details.\n# Below is the complete list of tools for convenience.\n# To make sure you have the latest list of tools, and to view their descriptions, \n# execute `uv run scripts/print_tool_overview.py`.\n#\n#  * `activate_project`: Activates a project by name.\n#  * `check_onboarding_performed`: Checks whether project onboarding was already performed.\n#  * `create_text_file`: Creates/overwrites a file in the project directory.\n#  * `delete_lines`: Deletes a range of lines within a file.\n#  * `delete_memory`: Deletes a memory from Serena's project-specific memory store.\n#  * `execute_shell_command`: Executes a shell command.\n#  * `find_referencing_code_snippets`: Finds code snippets in which the symbol at the given location is referenced.\n#  * `find_referencing_symbols`: Finds symbols that reference the symbol at the given location (optionally filtered by type).\n#  * `find_symbol`: Performs a global (or local) search for symbols with/containing a given name/substring (optionally filtered by type).\n#  * `get_current_config`: Prints the current configuration of the agent, including the active and available projects, tools, contexts, and modes.\n#  * `get_symbols_overview`: Gets an overview of the top-level symbols defined in a given file or directory.\n#  * `initial_instructions`: Gets the initial instructions for the current project.\n#     Should only be used in settings where the system prompt cannot be set,\n#     e.g. in clients you have no control over, like Claude Desktop.\n#  * `insert_after_symbol`: Inserts content after the end of the definition of a given symbol.\n#  * `insert_at_line`: Inserts content at a given line in a file.\n#  * `insert_before_symbol`: Inserts content before the beginning of the definition of a given symbol.\n#  * `list_dir`: Lists files and directories in the given directory (optionally with recursion).\n#  * `list_memories`: Lists memories in Serena's project-specific memory store.\n#  * `onboarding`: Performs onboarding (identifying the project structure and essential tasks, e.g. for testing or building).\n#  * `prepare_for_new_conversation`: Provides instructions for preparing for a new conversation (in order to continue with the necessary context).\n#  * `read_file`: Reads a file within the project directory.\n#  * `read_memory`: Reads the memory with the given name from Serena's project-specific memory store.\n#  * `remove_project`: Removes a project from the Serena configuration.\n#  * `replace_lines`: Replaces a range of lines within a file with new content.\n#  * `replace_symbol_body`: Replaces the full definition of a symbol.\n#  * `restart_language_server`: Restarts the language server, may be necessary when edits not through Serena happen.\n#  * `search_for_pattern`: Performs a search for a pattern in the project.\n#  * `summarize_changes`: Provides instructions for summarizing the changes made to the codebase.\n#  * `switch_modes`: Activates modes by providing a list of their names\n#  * `think_about_collected_information`: Thinking tool for pondering the completeness of collected information.\n#  * `think_about_task_adherence`: Thinking tool for determining whether the agent is still on track with the current task.\n#  * `think_about_whether_you_are_done`: Thinking tool for determining whether the task is truly completed.\n#  * `write_memory`: Writes a named memory (for future reference) to Serena's project-specific memory store.\nexcluded_tools: []\n\n# initial prompt for the project. It will always be given to the LLM upon activating the project\n# (contrary to the memories, which are loaded on demand).\ninitial_prompt: \"\"\n\nproject_name: \"sbstck-dl\"\n"
  },
  {
    "path": "CLAUDE.md",
    "content": "# CLAUDE.md\n\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\n\n## Project Overview\nThis is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, format conversion (HTML/Markdown/Text), downloading of images and file attachments locally, and creating archive index pages that link all downloaded posts with their metadata.\n\n## Architecture\nThe project follows a standard Go CLI structure:\n- `main.go`: Entry point\n- `cmd/`: Contains Cobra CLI commands (`root.go`, `download.go`, `list.go`, `version.go`)\n- `lib/`: Core library with four main components:\n  - `fetcher.go`: HTTP client with rate limiting, retries, and cookie support\n  - `extractor.go`: Post extraction and format conversion (HTML→Markdown/Text)\n  - `images.go`: Image downloading and local path management\n  - `files.go`: File attachment downloading and local path management\n\n## Build and Development Commands\n\n### Building\n```bash\ngo build -o sbstck-dl .\n```\n\n### Running\n```bash\ngo run . [command] [flags]\n```\n\n### Testing\n```bash\ngo test ./...\ngo test ./lib\n```\n\n### Module management\n```bash\ngo mod tidy\ngo mod download\n```\n\n## Key Components\n\n### Fetcher (`lib/fetcher.go`)\n- Handles HTTP requests with exponential backoff retry\n- Rate limiting (default: 2 requests/second)\n- Cookie support for private newsletters\n- Proxy support\n\n### Extractor (`lib/extractor.go`)\n- Parses Substack post JSON from HTML\n- Extracts post metadata including subtitle (.subtitle CSS selector) and cover image (og:image meta tag)\n- Converts HTML to Markdown/Text using external libraries\n- Handles file writing with different formats\n- Provides archive page generation functionality (HTML/Markdown/Text formats)\n- Manages archive entries with automatic sorting by publication date (newest first)\n\n### Image Downloader (`lib/images.go`)\n- Downloads images locally from Substack posts\n- Supports multiple image quality levels (high/medium/low)\n- Handles various Substack CDN URL patterns\n- Updates HTML/Markdown content to reference local image paths\n- Creates organized directory structure for downloaded images\n\n### File Downloader (`lib/files.go`)\n- Downloads file attachments from Substack posts using CSS selector `.file-embed-button.wide`\n- Supports file extension filtering (optional)\n- Creates organized directory structure for downloaded files\n- Updates HTML content to reference local file paths\n- Handles filename sanitization and collision avoidance\n- Integrates with existing image download workflow\n\n### Archive Page Generator (`lib/extractor.go`)\n- Creates index pages linking all downloaded posts with metadata\n- Supports HTML, Markdown, and Text formats matching the selected output format\n- Includes post titles (linked to downloaded files with relative paths)\n- Shows publication dates and download timestamps\n- Displays post descriptions/subtitles and cover images when available\n- Automatically sorts posts by publication date (newest first)\n- Generates `index.{format}` in the output directory root\n\n### Commands Structure\nUses Cobra framework:\n- `download`: Main functionality for downloading posts\n- `list`: Lists available posts from a Substack\n- `version`: Shows version information\n\n## Dependencies\n- `github.com/spf13/cobra`: CLI framework\n- `github.com/PuerkitoBio/goquery`: HTML parsing\n- `github.com/JohannesKaufmann/html-to-markdown`: HTML to Markdown conversion\n- `github.com/cenkalti/backoff/v4`: Exponential backoff for retries\n- `golang.org/x/time/rate`: Rate limiting\n- `golang.org/x/sync/errgroup`: Concurrent processing\n\n## Common Development Tasks\n\n### Running the CLI locally\n```bash\ngo run . download --url https://example.substack.com --output ./downloads\n```\n\n### Testing with verbose output\n```bash\ngo run . download --url https://example.substack.com --verbose --dry-run\n```\n\n### Downloading posts with images\n```bash\n# Download posts with high-quality images\ngo run . download --url https://example.substack.com --download-images --image-quality high --output ./downloads\n\n# Download with medium quality images and custom images directory\ngo run . download --url https://example.substack.com --download-images --image-quality medium --images-dir assets --output ./downloads\n\n# Download single post with images in markdown format\ngo run . download --url https://example.substack.com/p/post-title --download-images --format md --output ./downloads\n```\n\n### Downloading posts with file attachments\n```bash\n# Download posts with file attachments\ngo run . download --url https://example.substack.com --download-files --output ./downloads\n\n# Download with specific file extensions only\ngo run . download --url https://example.substack.com --download-files --file-extensions \"pdf,docx,txt\" --output ./downloads\n\n# Download with custom files directory name\ngo run . download --url https://example.substack.com --download-files --files-dir attachments --output ./downloads\n\n# Download single post with both images and file attachments\ngo run . download --url https://example.substack.com/p/post-title --download-images --download-files --output ./downloads\n```\n\n### Creating archive index pages\n```bash\n# Download posts and create an archive index page\ngo run . download --url https://example.substack.com --create-archive --output ./downloads\n\n# Download entire archive with archive index in markdown format\ngo run . download --url https://example.substack.com --create-archive --format md --output ./downloads\n\n# Download single post with archive page (useful for building up an archive over time)\ngo run . download --url https://example.substack.com/p/post-title --create-archive --output ./downloads\n\n# Download with all features: images, files, and archive page\ngo run . download --url https://example.substack.com --download-images --download-files --create-archive --output ./downloads\n\n# Download archive with specific format and custom directories\ngo run . download --url https://example.substack.com --create-archive --format html --images-dir assets --files-dir attachments --output ./downloads\n```\n\n### Building for release\n```bash\ngo build -ldflags=\"-s -w\" -o sbstck-dl .\n```"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright © 2023 Alex Ferrari alex@thealexferrari.com\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# Substack Downloader\n\nSimple CLI tool to download one or all the posts from a Substack blog.\n\n## Installation\n\n### Downloading the binary\n\nCheck in the [releases](https://github.com/alexferrari88/sbstck-dl/releases) page for the latest version of the binary for your platform.\nWe provide binaries for Linux, MacOS and Windows.\n\n### Using Go\n\n```bash\ngo install github.com/alexferrari88/sbstck-dl\n```\n\nYour Go bin directory must be in your PATH. You can add it by adding the following line to your `.bashrc` or `.zshrc`:\n\n```bash\nexport PATH=$PATH:$(go env GOPATH)/bin\n```\n\n## Usage\n\n```bash\nUsage:\n  sbstck-dl [command]\n\nAvailable Commands:\n  download    Download individual posts or the entire public archive\n  help        Help about any command\n  list        List the posts of a Substack\n  version     Print the version number of sbstck-dl\n\nFlags:\n      --after string             Download posts published after this date (format: YYYY-MM-DD)\n      --before string            Download posts published before this date (format: YYYY-MM-DD)\n      --cookie_name cookieName   Either substack.sid or connect.sid, based on your cookie (required for private newsletters)\n      --cookie_val string        The substack.sid/connect.sid cookie value (required for private newsletters)\n  -h, --help                     help for sbstck-dl\n  -x, --proxy string             Specify the proxy url\n  -r, --rate int                 Specify the rate of requests per second (default 2)\n  -v, --verbose                  Enable verbose output\n\nUse \"sbstck-dl [command] --help\" for more information about a command.\n```\n\n### Downloading posts\n\nYou can provide the url of a single post or the main url of the Substack you want to download.\n\nBy providing the main URL of a Substack, the downloader will download all the posts of the archive.\n\nWhen downloading the full archive, if the downloader is interrupted, at the next execution it will resume the download of the remaining posts.\n\n```bash\nUsage:\n  sbstck-dl download [flags]\n\nFlags:\n      --add-source-url         Add the original post URL at the end of the downloaded file\n      --create-archive         Create an archive index page linking all downloaded posts\n      --download-files         Download file attachments locally and update content to reference local files\n      --download-images        Download images locally and update content to reference local files\n  -d, --dry-run                Enable dry run\n      --file-extensions string Comma-separated list of file extensions to download (e.g., 'pdf,docx,txt'). If empty, downloads all file types\n      --files-dir string       Directory name for downloaded file attachments (default \"files\")\n  -f, --format string          Specify the output format (options: \"html\", \"md\", \"txt\" (default \"html\")\n  -h, --help                   help for download\n      --image-quality string   Image quality to download (options: \"high\", \"medium\", \"low\") (default \"high\")\n      --images-dir string      Directory name for downloaded images (default \"images\")\n  -o, --output string          Specify the download directory (default \".\")\n  -u, --url string             Specify the Substack url\n\nGlobal Flags:\n      --after string    Download posts published after this date (format: YYYY-MM-DD)\n      --before string   Download posts published before this date (format: YYYY-MM-DD)\n      --cookie_name cookieName   Either substack.sid or connect.sid, based on your cookie (required for private newsletters)\n      --cookie_val string        The substack.sid/connect.sid cookie value (required for private newsletters)\n  -x, --proxy string    Specify the proxy url\n  -r, --rate int        Specify the rate of requests per second (default 2)\n  -v, --verbose         Enable verbose output\n```\n\n#### Adding Source URL\n\nIf you use the `--add-source-url` flag, each downloaded file will have the following line appended to its content:\n\n`original content: POST_URL`\n\nWhere `POST_URL` is the canonical URL of the downloaded post. For HTML format, this will be wrapped in a small paragraph with a link.\n\n#### Downloading Images\n\nUse the `--download-images` flag to download all images from Substack posts locally. This ensures posts remain accessible even if images are deleted from Substack's CDN.\n\n**Features:**\n- Downloads images at optimal quality (high/medium/low)\n- Creates organized directory structure: `{output}/images/{post-slug}/`\n- Updates HTML/Markdown content to reference local image paths\n- Handles all Substack image formats and CDN patterns\n- Graceful error handling for individual image failures\n\n**Examples:**\n\n```bash\n# Download posts with high-quality images (default)\nsbstck-dl download --url https://example.substack.com --download-images\n\n# Download with medium quality images\nsbstck-dl download --url https://example.substack.com --download-images --image-quality medium\n\n# Download with custom images directory name\nsbstck-dl download --url https://example.substack.com --download-images --images-dir assets\n\n# Download single post with images in markdown format\nsbstck-dl download --url https://example.substack.com/p/post-title --download-images --format md\n```\n\n**Image Quality Options:**\n- `high`: 1456px width (best quality, larger files)\n- `medium`: 848px width (balanced quality/size)\n- `low`: 424px width (smaller files, mobile-optimized)\n\n**Directory Structure:**\n```\noutput/\n├── 20231201_120000_post-title.html\n└── images/\n    └── post-title/\n        ├── image1_1456x819.jpeg\n        ├── image2_848x636.png\n        └── image3_1272x720.webp\n```\n\n#### Downloading File Attachments\n\nUse the `--download-files` flag to download all file attachments from Substack posts locally. This ensures posts remain accessible even if files are removed from Substack's servers.\n\n**Features:**\n- Downloads file attachments using CSS selector `.file-embed-button.wide`\n- Optional file extension filtering (e.g., only PDFs and Word documents)\n- Creates organized directory structure: `{output}/files/{post-slug}/`\n- Updates HTML content to reference local file paths\n- Handles filename sanitization and collision avoidance\n- Graceful error handling for individual file download failures\n\n**Examples:**\n\n```bash\n# Download posts with all file attachments\nsbstck-dl download --url https://example.substack.com --download-files\n\n# Download only specific file types\nsbstck-dl download --url https://example.substack.com --download-files --file-extensions \"pdf,docx,txt\"\n\n# Download with custom files directory name\nsbstck-dl download --url https://example.substack.com --download-files --files-dir attachments\n\n# Download single post with both images and file attachments\nsbstck-dl download --url https://example.substack.com/p/post-title --download-images --download-files --format md\n```\n\n**File Extension Filtering:**\n- Specify extensions without dots: `pdf,docx,txt`\n- Case insensitive matching\n- If no extensions specified, downloads all file types\n\n**Directory Structure with Files:**\n```\noutput/\n├── 20231201_120000_post-title.html\n├── images/\n│   └── post-title/\n│       ├── image1_1456x819.jpeg\n│       └── image2_848x636.png\n└── files/\n    └── post-title/\n        ├── document.pdf\n        ├── spreadsheet.xlsx\n        └── presentation.pptx\n```\n\n#### Creating Archive Index Pages\n\nUse the `--create-archive` flag to generate an organized index page that links all downloaded posts with their metadata. This creates a beautiful overview of your downloaded content, making it easy to browse and access your Substack archive.\n\n**Features:**\n- Creates `index.{format}` file matching your selected output format (HTML/Markdown/Text)\n- Links to all downloaded posts using relative file paths\n- Displays post titles, publication dates, and download timestamps\n- Shows post descriptions/subtitles and cover images when available\n- Automatically sorts posts by publication date (newest first)\n- Works with both single post and bulk downloads\n\n**Examples:**\n\n```bash\n# Download entire archive and create index page\nsbstck-dl download --url https://example.substack.com --create-archive\n\n# Create archive index in Markdown format\nsbstck-dl download --url https://example.substack.com --create-archive --format md\n\n# Build archive over time with single posts\nsbstck-dl download --url https://example.substack.com/p/post-title --create-archive\n\n# Complete download with all features\nsbstck-dl download --url https://example.substack.com --download-images --download-files --create-archive\n\n# Custom directory structure with archive\nsbstck-dl download --url https://example.substack.com --create-archive --images-dir assets --files-dir attachments\n```\n\n**Archive Content Per Post:**\n- **Title**: Clickable link to the downloaded post file\n- **Publication Date**: When the post was originally published on Substack\n- **Download Date**: When you downloaded the post locally  \n- **Description**: Post subtitle or description (when available)\n- **Cover Image**: Featured image from the post (when available)\n\n**Archive Format Examples:**\n\n*HTML Format:* Styled webpage with images, organized post cards, and hover effects\n*Markdown Format:* Clean markdown with headers, links, and image references\n*Text Format:* Plain text listing with all metadata for maximum compatibility\n\n**Directory Structure with Archive:**\n```\noutput/\n├── index.html                     # Archive index page\n├── 20231201_120000_post-title.html\n├── 20231115_090000_another-post.html\n├── images/\n│   ├── post-title/\n│   │   └── image1_1456x819.jpeg\n│   └── another-post/\n│       └── image2_848x636.png\n└── files/\n    ├── post-title/\n    │   └── document.pdf\n    └── another-post/\n        └── spreadsheet.xlsx\n```\n\n### Listing posts\n\n```bash\nUsage:\n  sbstck-dl list [flags]\n\nFlags:\n  -h, --help         help for list\n  -u, --url string   Specify the Substack url\n\nGlobal Flags:\n      --after string    Download posts published after this date (format: YYYY-MM-DD)\n      --before string   Download posts published before this date (format: YYYY-MM-DD)\n      --cookie_name cookieName   Either substack.sid or connect.sid, based on your cookie (required for private newsletters)\n      --cookie_val string        The substack.sid/connect.sid cookie value (required for private newsletters)\n  -x, --proxy string    Specify the proxy url\n  -r, --rate int        Specify the rate of requests per second (default 2)\n  -v, --verbose         Enable verbose output\n```\n\n### Private Newsletters\n\nIn order to download the full text of private newsletters you need to provide the cookie name and value of your session.\nThe cookie name is either `substack.sid` or `connect.sid`, based on your cookie.\nTo get the cookie value you can use the developer tools of your browser.\nOnce you have the cookie name and value, you can pass them to the downloader using the `--cookie_name` and `--cookie_val` flags.\n\n#### Example\n\n```bash\nsbstck-dl download --url https://example.substack.com --cookie_name substack.sid --cookie_val COOKIE_VALUE\n```\n\n## Thanks\n\n- [wemoveon2](https://github.com/wemoveon2) and [lenzj](https://github.com/lenzj) for the discussion and help implementing the support for private newsletters\n\n## TODO\n\n- [x] Improve retry logic\n- [ ] Implement loading from config file\n- [x] Add support for downloading images\n- [x] Add support for downloading file attachments\n- [x] Add archive index page functionality\n- [x] Add tests\n- [x] Add CI\n- [x] Add documentation\n- [x] Add support for private newsletters\n- [x] Implement filtering by date\n- [x] Implement resuming downloads\n"
  },
  {
    "path": "cmd/cmd_test.go",
    "content": "package cmd\n\nimport (\n\t\"net/url\"\n\t\"os\"\n\t\"testing\"\n\n\t\"github.com/alexferrari88/sbstck-dl/lib\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Test parseURL function\nfunc TestParseURL(t *testing.T) {\n\ttests := []struct {\n\t\tname        string\n\t\tinput       string\n\t\texpectError bool\n\t\texpectedURL *url.URL\n\t}{\n\t\t{\n\t\t\tname:        \"valid https URL\",\n\t\t\tinput:       \"https://example.substack.com\",\n\t\t\texpectError: false,\n\t\t\texpectedURL: &url.URL{\n\t\t\t\tScheme: \"https\",\n\t\t\t\tHost:   \"example.substack.com\",\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:        \"valid http URL\",\n\t\t\tinput:       \"http://example.substack.com\",\n\t\t\texpectError: false,\n\t\t\texpectedURL: &url.URL{\n\t\t\t\tScheme: \"http\",\n\t\t\t\tHost:   \"example.substack.com\",\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:        \"URL with path\",\n\t\t\tinput:       \"https://example.substack.com/p/test-post\",\n\t\t\texpectError: false,\n\t\t\texpectedURL: &url.URL{\n\t\t\t\tScheme: \"https\",\n\t\t\t\tHost:   \"example.substack.com\",\n\t\t\t\tPath:   \"/p/test-post\",\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:        \"invalid URL - no scheme\",\n\t\t\tinput:       \"example.substack.com\",\n\t\t\texpectError: true,\n\t\t},\n\t\t{\n\t\t\tname:        \"invalid URL - no host\",\n\t\t\tinput:       \"https://\",\n\t\t\texpectError: true, // parseURL returns nil, nil for this case\n\t\t},\n\t\t{\n\t\t\tname:        \"invalid URL - malformed\",\n\t\t\tinput:       \"not-a-url\",\n\t\t\texpectError: true,\n\t\t},\n\t\t{\n\t\t\tname:        \"empty string\",\n\t\t\tinput:       \"\",\n\t\t\texpectError: true,\n\t\t},\n\t}\n\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tresult, err := parseURL(tt.input)\n\t\t\t\n\t\t\tif tt.expectError {\n\t\t\t\t// For this specific case, parseURL returns nil, nil which means no error but also no result\n\t\t\t\tif result == nil {\n\t\t\t\t\tassert.True(t, true) // This is the expected behavior for invalid URLs\n\t\t\t\t} else {\n\t\t\t\t\tassert.Error(t, err)\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\trequire.NotNil(t, result)\n\t\t\t\tassert.Equal(t, tt.expectedURL.Scheme, result.Scheme)\n\t\t\t\tassert.Equal(t, tt.expectedURL.Host, result.Host)\n\t\t\t\tif tt.expectedURL.Path != \"\" {\n\t\t\t\t\tassert.Equal(t, tt.expectedURL.Path, result.Path)\n\t\t\t\t}\n\t\t\t}\n\t\t})\n\t}\n}\n\n// Test makeDateFilterFunc function\nfunc TestMakeDateFilterFunc(t *testing.T) {\n\ttests := []struct {\n\t\tname       string\n\t\tbeforeDate string\n\t\tafterDate  string\n\t\ttestDates  map[string]bool // date -> expected result\n\t}{\n\t\t{\n\t\t\tname:       \"no filters\",\n\t\t\tbeforeDate: \"\",\n\t\t\tafterDate:  \"\",\n\t\t\ttestDates: map[string]bool{\n\t\t\t\t\"2023-01-01\": true,\n\t\t\t\t\"2023-06-15\": true,\n\t\t\t\t\"2023-12-31\": true,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:       \"before filter only\",\n\t\t\tbeforeDate: \"2023-06-15\",\n\t\t\tafterDate:  \"\",\n\t\t\ttestDates: map[string]bool{\n\t\t\t\t\"2023-01-01\": true,\n\t\t\t\t\"2023-06-14\": true,\n\t\t\t\t\"2023-06-15\": false,\n\t\t\t\t\"2023-06-16\": false,\n\t\t\t\t\"2023-12-31\": false,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:       \"after filter only\",\n\t\t\tbeforeDate: \"\",\n\t\t\tafterDate:  \"2023-06-15\",\n\t\t\ttestDates: map[string]bool{\n\t\t\t\t\"2023-01-01\": false,\n\t\t\t\t\"2023-06-14\": false,\n\t\t\t\t\"2023-06-15\": false,\n\t\t\t\t\"2023-06-16\": true,\n\t\t\t\t\"2023-12-31\": true,\n\t\t\t},\n\t\t},\n\t\t{\n\t\t\tname:       \"both filters\",\n\t\t\tbeforeDate: \"2023-12-31\",\n\t\t\tafterDate:  \"2023-01-01\",\n\t\t\ttestDates: map[string]bool{\n\t\t\t\t\"2022-12-31\": false,\n\t\t\t\t\"2023-01-01\": false,\n\t\t\t\t\"2023-06-15\": true,\n\t\t\t\t\"2023-12-30\": true,\n\t\t\t\t\"2023-12-31\": false,\n\t\t\t\t\"2024-01-01\": false,\n\t\t\t},\n\t\t},\n\t}\n\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tfilterFunc := makeDateFilterFunc(tt.beforeDate, tt.afterDate)\n\t\t\t\n\t\t\tif tt.beforeDate == \"\" && tt.afterDate == \"\" {\n\t\t\t\t// No filter should return nil\n\t\t\t\tassert.Nil(t, filterFunc)\n\t\t\t} else {\n\t\t\t\trequire.NotNil(t, filterFunc)\n\t\t\t\t\n\t\t\t\tfor date, expected := range tt.testDates {\n\t\t\t\t\tresult := filterFunc(date)\n\t\t\t\t\tassert.Equal(t, expected, result, \"Date %s should return %v\", date, expected)\n\t\t\t\t}\n\t\t\t}\n\t\t})\n\t}\n}\n\n// Test makePath function\nfunc TestMakePath(t *testing.T) {\n\tpost := lib.Post{\n\t\tPostDate: \"2023-01-01T10:30:00.000Z\", // Use RFC3339 format\n\t\tSlug:     \"test-post\",\n\t}\n\n\ttests := []struct {\n\t\tname         string\n\t\tpost         lib.Post\n\t\toutputFolder string\n\t\tformat       string\n\t\texpected     string\n\t}{\n\t\t{\n\t\t\tname:         \"basic path\",\n\t\t\tpost:         post,\n\t\t\toutputFolder: \"/tmp/downloads\",\n\t\t\tformat:       \"html\",\n\t\t\texpected:     \"/tmp/downloads/20230101_103000_test-post.html\",\n\t\t},\n\t\t{\n\t\t\tname:         \"markdown format\",\n\t\t\tpost:         post,\n\t\t\toutputFolder: \"/tmp/downloads\",\n\t\t\tformat:       \"md\",\n\t\t\texpected:     \"/tmp/downloads/20230101_103000_test-post.md\",\n\t\t},\n\t\t{\n\t\t\tname:         \"text format\",\n\t\t\tpost:         post,\n\t\t\toutputFolder: \"/tmp/downloads\",\n\t\t\tformat:       \"txt\",\n\t\t\texpected:     \"/tmp/downloads/20230101_103000_test-post.txt\",\n\t\t},\n\t\t{\n\t\t\tname:         \"no output folder\",\n\t\t\tpost:         post,\n\t\t\toutputFolder: \"\",\n\t\t\tformat:       \"html\",\n\t\t\texpected:     \"/20230101_103000_test-post.html\",\n\t\t},\n\t}\n\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tresult := makePath(tt.post, tt.outputFolder, tt.format)\n\t\t\tassert.Equal(t, tt.expected, result)\n\t\t})\n\t}\n}\n\n// Test convertDateTime function\nfunc TestConvertDateTime(t *testing.T) {\n\ttests := []struct {\n\t\tname     string\n\t\tinput    string\n\t\texpected string\n\t}{\n\t\t{\n\t\t\tname:     \"basic date\", \n\t\t\tinput:    \"2023-01-01\",\n\t\t\texpected: \"\", // Invalid format, should return empty string\n\t\t},\n\t\t{\n\t\t\tname:     \"date with time\",\n\t\t\tinput:    \"2023-01-01T10:30:00.000Z\",\n\t\t\texpected: \"20230101_103000\",\n\t\t},\n\t\t{\n\t\t\tname:     \"different date format\",\n\t\t\tinput:    \"2023-12-31T23:59:59.999Z\",\n\t\t\texpected: \"20231231_235959\",\n\t\t},\n\t\t{\n\t\t\tname:     \"empty string\",\n\t\t\tinput:    \"\",\n\t\t\texpected: \"\",\n\t\t},\n\t}\n\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tresult := convertDateTime(tt.input)\n\t\t\tassert.Equal(t, tt.expected, result)\n\t\t})\n\t}\n}\n\n// Test extractSlug function\nfunc TestExtractSlug(t *testing.T) {\n\ttests := []struct {\n\t\tname     string\n\t\tinput    string\n\t\texpected string\n\t}{\n\t\t{\n\t\t\tname:     \"basic substack URL\",\n\t\t\tinput:    \"https://example.substack.com/p/test-post\",\n\t\t\texpected: \"test-post\",\n\t\t},\n\t\t{\n\t\t\tname:     \"URL with query parameters\",\n\t\t\tinput:    \"https://example.substack.com/p/test-post?utm_source=newsletter\",\n\t\t\texpected: \"test-post?utm_source=newsletter\", // extractSlug doesn't handle query params\n\t\t},\n\t\t{\n\t\t\tname:     \"URL with anchor\",\n\t\t\tinput:    \"https://example.substack.com/p/test-post#comments\",\n\t\t\texpected: \"test-post#comments\", // extractSlug doesn't handle anchors\n\t\t},\n\t\t{\n\t\t\tname:     \"URL with trailing slash\",\n\t\t\tinput:    \"https://example.substack.com/p/test-post/\",\n\t\t\texpected: \"\", // extractSlug returns empty string for trailing slash\n\t\t},\n\t\t{\n\t\t\tname:     \"complex slug with dashes\",\n\t\t\tinput:    \"https://example.substack.com/p/this-is-a-very-long-post-title\",\n\t\t\texpected: \"this-is-a-very-long-post-title\",\n\t\t},\n\t\t{\n\t\t\tname:     \"no /p/ in URL\",\n\t\t\tinput:    \"https://example.substack.com/test-post\",\n\t\t\texpected: \"test-post\", // extractSlug just returns the last segment\n\t\t},\n\t\t{\n\t\t\tname:     \"empty string\",\n\t\t\tinput:    \"\",\n\t\t\texpected: \"\",\n\t\t},\n\t}\n\n\tfor _, tt := range tests {\n\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\tresult := extractSlug(tt.input)\n\t\t\tassert.Equal(t, tt.expected, result)\n\t\t})\n\t}\n}\n\n// Test cookieName type\nfunc TestCookieName(t *testing.T) {\n\tt.Run(\"String method\", func(t *testing.T) {\n\t\tcn := cookieName(\"test-cookie\")\n\t\tassert.Equal(t, \"test-cookie\", cn.String())\n\t})\n\n\tt.Run(\"Type method\", func(t *testing.T) {\n\t\tcn := cookieName(\"\")\n\t\tassert.Equal(t, \"cookieName\", cn.Type())\n\t})\n\n\tt.Run(\"Set method - valid values\", func(t *testing.T) {\n\t\tvalidNames := []string{\"substack.sid\", \"connect.sid\"}\n\t\t\n\t\tfor _, name := range validNames {\n\t\t\tcn := cookieName(\"\")\n\t\t\terr := cn.Set(name)\n\t\t\tassert.NoError(t, err)\n\t\t\tassert.Equal(t, name, cn.String())\n\t\t}\n\t})\n\n\tt.Run(\"Set method - invalid values\", func(t *testing.T) {\n\t\tinvalidNames := []string{\"invalid\", \"session\", \"auth\", \"\"}\n\t\t\n\t\tfor _, name := range invalidNames {\n\t\t\tcn := cookieName(\"\")\n\t\t\terr := cn.Set(name)\n\t\t\tassert.Error(t, err)\n\t\t\tassert.Contains(t, err.Error(), \"invalid cookie name\")\n\t\t}\n\t})\n}\n\n// Test that we can create paths and handle files correctly\nfunc TestFileHandling(t *testing.T) {\n\t// Create a temporary directory for testing\n\ttempDir := t.TempDir()\n\t\n\t// Create a test file\n\texistingFile := tempDir + \"/existing.html\"\n\tpost := lib.Post{Title: \"Test\", BodyHTML: \"<p>Test content</p>\"}\n\terr := post.WriteToFile(existingFile, \"html\", false)\n\trequire.NoError(t, err)\n\n\t// Test that file was created successfully\n\t_, err = os.Stat(existingFile)\n\tassert.NoError(t, err)\n\t\n\t// Test path creation\n\ttestPost := lib.Post{PostDate: \"2023-01-01T10:30:00.000Z\", Slug: \"test-post\"}\n\tpath := makePath(testPost, tempDir, \"html\")\n\texpectedPath := tempDir + \"/20230101_103000_test-post.html\"\n\tassert.Equal(t, expectedPath, path)\n}\n\n// Test time parsing and formatting\nfunc TestTimeFormatting(t *testing.T) {\n\tt.Run(\"convertDateTime with various formats\", func(t *testing.T) {\n\t\t// Test the actual time parsing logic\n\t\ttestCases := []struct {\n\t\t\tinput    string\n\t\t\texpected string\n\t\t}{\n\t\t\t{\"2023-01-01T10:30:00.000Z\", \"20230101_103000\"},\n\t\t\t{\"2023-01-01T10:30:00Z\", \"20230101_103000\"},\n\t\t\t{\"2023-01-01\", \"\"}, // Invalid format, should return empty string\n\t\t\t{\"2023-12-31T23:59:59.999Z\", \"20231231_235959\"},\n\t\t}\n\n\t\tfor _, tc := range testCases {\n\t\t\tresult := convertDateTime(tc.input)\n\t\t\tassert.Equal(t, tc.expected, result)\n\t\t}\n\t})\n}\n\n// Integration test for date filtering\nfunc TestDateFilteringIntegration(t *testing.T) {\n\tt.Run(\"date filter with actual dates\", func(t *testing.T) {\n\t\t// Test the interaction between date filtering and URL processing\n\t\tbeforeDate := \"2023-06-15\"\n\t\tafterDate := \"2023-01-01\"\n\t\t\n\t\tfilterFunc := makeDateFilterFunc(beforeDate, afterDate)\n\t\trequire.NotNil(t, filterFunc)\n\t\t\n\t\t// Test dates within range\n\t\tassert.True(t, filterFunc(\"2023-03-15\"))\n\t\tassert.True(t, filterFunc(\"2023-06-14\"))\n\t\t\n\t\t// Test dates outside range\n\t\tassert.False(t, filterFunc(\"2022-12-31\"))\n\t\tassert.False(t, filterFunc(\"2023-01-01\"))\n\t\tassert.False(t, filterFunc(\"2023-06-15\"))\n\t\tassert.False(t, filterFunc(\"2023-12-31\"))\n\t})\n}\n\n// Test constants\nfunc TestConstants(t *testing.T) {\n\tt.Run(\"cookie name constants\", func(t *testing.T) {\n\t\tassert.Equal(t, \"substack.sid\", string(substackSid))\n\t\tassert.Equal(t, \"connect.sid\", string(connectSid))\n\t})\n}"
  },
  {
    "path": "cmd/download.go",
    "content": "package cmd\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\t\"net/url\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/alexferrari88/sbstck-dl/lib\"\n\t\"github.com/schollz/progressbar/v3\"\n\t\"github.com/spf13/cobra\"\n)\n\n// downloadCmd represents the download command\nvar (\n\tdownloadUrl    string\n\tformat         string\n\toutputFolder   string\n\tdryRun         bool\n\taddSourceURL   bool\n\tdownloadImages bool\n\timageQuality   string\n\timagesDir      string\n\tdownloadFiles  bool\n\tfileExtensions string\n\tfilesDir       string\n\tcreateArchive  bool\n\tdownloadCmd    = &cobra.Command{\n\t\tUse:   \"download\",\n\t\tShort: \"Download individual posts or the entire public archive\",\n\t\tLong:  `You can provide the url of a single post or the main url of the Substack you want to download.`,\n\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\tstartTime := time.Now()\n\t\t\t\n\t\t\t// Create archive instance if flag is set\n\t\t\tvar archive *lib.Archive\n\t\t\tif createArchive {\n\t\t\t\tarchive = lib.NewArchive()\n\t\t\t}\n\n\t\t\t// if url contains \"/p/\", we are downloading a single post\n\t\t\tif strings.Contains(downloadUrl, \"/p/\") {\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Printf(\"Downloading post %s\\n\", downloadUrl)\n\t\t\t\t}\n\t\t\t\tif dryRun {\n\t\t\t\t\tfmt.Println(\"Dry run, exiting...\")\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\tif (beforeDate != \"\" || afterDate != \"\") && verbose {\n\t\t\t\t\tfmt.Println(\"Warning: --before and --after flags are ignored when downloading a single post\")\n\t\t\t\t}\n\n\t\t\t\tpost, err := extractor.ExtractPost(ctx, downloadUrl)\n\t\t\t\tif err != nil {\n\t\t\t\t\tlog.Fatalln(err)\n\t\t\t\t}\n\t\t\t\tdownloadTime := time.Since(startTime)\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Printf(\"Downloaded post %s in %s\\n\", downloadUrl, downloadTime)\n\t\t\t\t}\n\n\t\t\t\tpath := makePath(post, outputFolder, format)\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Printf(\"Writing post to file %s\\n\", path)\n\t\t\t\t}\n\n\t\t\t\tif downloadImages || downloadFiles {\n\t\t\t\t\timageQualityEnum := lib.ImageQuality(imageQuality)\n\t\t\t\t\t// Parse file extensions if specified\n\t\t\t\t\tvar fileExtensionsSlice []string\n\t\t\t\t\tif fileExtensions != \"\" {\n\t\t\t\t\t\tfileExtensionsSlice = strings.Split(strings.ReplaceAll(fileExtensions, \" \", \"\"), \",\")\n\t\t\t\t\t}\n\t\t\t\t\timageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, downloadFiles, fileExtensionsSlice, filesDir, fetcher)\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\tlog.Printf(\"Error writing file %s: %v\\n\", path, err)\n\t\t\t\t\t} else if verbose && imageResult.Success > 0 {\n\t\t\t\t\t\tfmt.Printf(\"Downloaded %d images (%d failed) for post %s\\n\", imageResult.Success, imageResult.Failed, post.Slug)\n\t\t\t\t\t}\n\t\t\t\t} else {\n\t\t\t\t\terr = post.WriteToFile(path, format, addSourceURL)\n\t\t\t\t\tif err != nil {\n\t\t\t\t\t\tlog.Printf(\"Error writing file %s: %v\\n\", path, err)\n\t\t\t\t\t}\n\t\t\t\t}\n\n\t\t\t\t// Add to archive if enabled\n\t\t\t\tif archive != nil {\n\t\t\t\t\tarchive.AddEntry(post, path, startTime)\n\t\t\t\t}\n\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Println(\"Done in \", time.Since(startTime))\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// we are downloading the entire archive\n\t\t\t\tvar downloadedPostsCount int\n\t\t\t\tdateFilterfunc := makeDateFilterFunc(beforeDate, afterDate)\n\t\t\t\turls, err := extractor.GetAllPostsURLs(ctx, downloadUrl, dateFilterfunc)\n\t\t\t\turlsCount := len(urls)\n\t\t\t\tif err != nil {\n\t\t\t\t\tlog.Fatalln(err)\n\t\t\t\t}\n\t\t\t\tif urlsCount == 0 {\n\t\t\t\t\tif verbose {\n\t\t\t\t\t\tfmt.Println(\"No posts found, exiting...\")\n\t\t\t\t\t}\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Printf(\"Found %d posts\\n\", urlsCount)\n\t\t\t\t}\n\t\t\t\tif dryRun {\n\t\t\t\t\tfmt.Printf(\"Found %d posts\\n\", urlsCount)\n\t\t\t\t\tfmt.Println(\"Dry run, exiting...\")\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\turls, err = filterExistingPosts(urls, outputFolder, format)\n\t\t\t\tif err != nil {\n\t\t\t\t\tif verbose {\n\t\t\t\t\t\tfmt.Println(\"Error filtering existing posts:\", err)\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif len(urls) == 0 {\n\t\t\t\t\tif verbose {\n\t\t\t\t\t\tfmt.Println(\"No new posts found, exiting...\")\n\t\t\t\t\t}\n\t\t\t\t\treturn\n\t\t\t\t}\n\t\t\t\tbar := progressbar.NewOptions(len(urls),\n\t\t\t\t\tprogressbar.OptionSetWidth(25),\n\t\t\t\t\tprogressbar.OptionSetDescription(\"downloading\"),\n\t\t\t\t\tprogressbar.OptionShowBytes(true))\n\t\t\t\tfor result := range extractor.ExtractAllPosts(ctx, urls) {\n\t\t\t\t\tselect {\n\t\t\t\t\tcase <-ctx.Done():\n\t\t\t\t\t\tlog.Fatalln(\"context cancelled\")\n\t\t\t\t\tdefault:\n\t\t\t\t\t}\n\t\t\t\t\tif result.Err != nil {\n\t\t\t\t\t\tif verbose {\n\t\t\t\t\t\t\tfmt.Printf(\"Error downloading post %s: %s\\n\", result.Post.CanonicalUrl, result.Err)\n\t\t\t\t\t\t\tfmt.Println(\"Skipping...\")\n\t\t\t\t\t\t}\n\t\t\t\t\t\tcontinue\n\t\t\t\t\t}\n\t\t\t\t\tbar.Add(1)\n\t\t\t\t\tdownloadedPostsCount++\n\t\t\t\t\tif verbose {\n\t\t\t\t\t\tfmt.Printf(\"Downloading post %s\\n\", result.Post.CanonicalUrl)\n\t\t\t\t\t}\n\t\t\t\t\tpost := result.Post\n\n\t\t\t\t\tpath := makePath(post, outputFolder, format)\n\t\t\t\t\tif verbose {\n\t\t\t\t\t\tfmt.Printf(\"Writing post to file %s\\n\", path)\n\t\t\t\t\t}\n\n\t\t\t\t\tif downloadImages || downloadFiles {\n\t\t\t\t\t\timageQualityEnum := lib.ImageQuality(imageQuality)\n\t\t\t\t\t\t// Parse file extensions if specified\n\t\t\t\t\t\tvar fileExtensionsSlice []string\n\t\t\t\t\t\tif fileExtensions != \"\" {\n\t\t\t\t\t\t\tfileExtensionsSlice = strings.Split(strings.ReplaceAll(fileExtensions, \" \", \"\"), \",\")\n\t\t\t\t\t\t}\n\t\t\t\t\t\timageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, downloadFiles, fileExtensionsSlice, filesDir, fetcher)\n\t\t\t\t\t\tif err != nil {\n\t\t\t\t\t\t\tlog.Printf(\"Error writing file %s: %v\\n\", path, err)\n\t\t\t\t\t\t} else if verbose && imageResult.Success > 0 {\n\t\t\t\t\t\t\tfmt.Printf(\"Downloaded %d images (%d failed) for post %s\\n\", imageResult.Success, imageResult.Failed, post.Slug)\n\t\t\t\t\t\t}\n\t\t\t\t\t} else {\n\t\t\t\t\t\terr = post.WriteToFile(path, format, addSourceURL)\n\t\t\t\t\t\tif err != nil {\n\t\t\t\t\t\t\tlog.Printf(\"Error writing file %s: %v\\n\", path, err)\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\n\t\t\t\t\t// Add to archive if enabled and post was successfully written\n\t\t\t\t\tif archive != nil {\n\t\t\t\t\t\tarchive.AddEntry(post, path, time.Now())\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Println(\"Downloaded\", downloadedPostsCount, \"posts, out of\", len(urls))\n\t\t\t\t\tfmt.Println(\"Done in \", time.Since(startTime))\n\t\t\t\t}\n\t\t\t}\n\n\t\t\t// Generate archive page if enabled\n\t\t\tif archive != nil && len(archive.Entries) > 0 {\n\t\t\t\tif verbose {\n\t\t\t\t\tfmt.Printf(\"Generating archive page in %s format...\\n\", format)\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\tvar archiveErr error\n\t\t\t\tswitch format {\n\t\t\t\tcase \"html\":\n\t\t\t\t\tarchiveErr = archive.GenerateHTML(outputFolder)\n\t\t\t\tcase \"md\":\n\t\t\t\t\tarchiveErr = archive.GenerateMarkdown(outputFolder)\n\t\t\t\tcase \"txt\":\n\t\t\t\t\tarchiveErr = archive.GenerateText(outputFolder)\n\t\t\t\tdefault:\n\t\t\t\t\tarchiveErr = fmt.Errorf(\"unknown format for archive: %s\", format)\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\tif archiveErr != nil {\n\t\t\t\t\tlog.Printf(\"Error generating archive page: %v\\n\", archiveErr)\n\t\t\t\t} else if verbose {\n\t\t\t\t\tfmt.Printf(\"Archive page generated: %s/index.%s\\n\", outputFolder, format)\n\t\t\t\t}\n\t\t\t}\n\t\t},\n\t}\n)\n\nfunc init() {\n\tdownloadCmd.Flags().StringVarP(&downloadUrl, \"url\", \"u\", \"\", \"Specify the Substack url\")\n\tdownloadCmd.Flags().StringVarP(&format, \"format\", \"f\", \"html\", \"Specify the output format (options: \\\"html\\\", \\\"md\\\", \\\"txt\\\"\")\n\tdownloadCmd.Flags().StringVarP(&outputFolder, \"output\", \"o\", \".\", \"Specify the download directory\")\n\tdownloadCmd.Flags().BoolVarP(&dryRun, \"dry-run\", \"d\", false, \"Enable dry run\")\n\tdownloadCmd.Flags().BoolVar(&addSourceURL, \"add-source-url\", false, \"Add the original post URL at the end of the downloaded file\")\n\tdownloadCmd.Flags().BoolVar(&downloadImages, \"download-images\", false, \"Download images locally and update content to reference local files\")\n\tdownloadCmd.Flags().StringVar(&imageQuality, \"image-quality\", \"high\", \"Image quality to download (options: \\\"high\\\", \\\"medium\\\", \\\"low\\\")\")\n\tdownloadCmd.Flags().StringVar(&imagesDir, \"images-dir\", \"images\", \"Directory name for downloaded images\")\n\tdownloadCmd.Flags().BoolVar(&downloadFiles, \"download-files\", false, \"Download file attachments locally and update content to reference local files\")\n\tdownloadCmd.Flags().StringVar(&fileExtensions, \"file-extensions\", \"\", \"Comma-separated list of file extensions to download (e.g., 'pdf,docx,txt'). If empty, downloads all file types\")\n\tdownloadCmd.Flags().StringVar(&filesDir, \"files-dir\", \"files\", \"Directory name for downloaded file attachments\")\n\tdownloadCmd.Flags().BoolVar(&createArchive, \"create-archive\", false, \"Create an archive index page linking all downloaded posts\")\n\tdownloadCmd.MarkFlagRequired(\"url\")\n}\n\nfunc convertDateTime(datetime string) string {\n\t// Parse the datetime string\n\tparsedTime, err := time.Parse(time.RFC3339, datetime)\n\tif err != nil {\n\t\t// Return an empty string or an error message if parsing fails\n\t\treturn \"\"\n\t}\n\n\t// Format the datetime to the desired format\n\tformattedDateTime := fmt.Sprintf(\"%d%02d%02d_%02d%02d%02d\",\n\t\tparsedTime.Year(), parsedTime.Month(), parsedTime.Day(),\n\t\tparsedTime.Hour(), parsedTime.Minute(), parsedTime.Second())\n\n\treturn formattedDateTime\n}\n\nfunc parseURL(toTest string) (*url.URL, error) {\n\t_, err := url.ParseRequestURI(toTest)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tu, err := url.Parse(toTest)\n\tif err != nil || u.Scheme == \"\" || u.Host == \"\" {\n\t\treturn nil, err\n\t}\n\n\treturn u, err\n}\n\nfunc makePath(post lib.Post, outputFolder string, format string) string {\n\treturn fmt.Sprintf(\"%s/%s_%s.%s\", outputFolder, convertDateTime(post.PostDate), post.Slug, format)\n}\n\n// extractSlug extracts the slug from a Substack post URL\n// e.g. https://example.substack.com/p/this-is-the-post-title -> this-is-the-post-title\nfunc extractSlug(url string) string {\n\tsplit := strings.Split(url, \"/\")\n\treturn split[len(split)-1]\n}\n\n// filterExistingPosts filters out posts that already exist in the output folder.\n// It looks for files whose name ends with the post slug.\nfunc filterExistingPosts(urls []string, outputFolder string, format string) ([]string, error) {\n\tvar filtered []string\n\tfor _, url := range urls {\n\t\tslug := extractSlug(url)\n\t\tpath := fmt.Sprintf(\"%s/%s_%s.%s\", outputFolder, \"*\", slug, format)\n\t\tmatches, err := filepath.Glob(path)\n\t\tif err != nil {\n\t\t\treturn urls, err\n\t\t}\n\t\tif len(matches) == 0 {\n\t\t\tfiltered = append(filtered, url)\n\t\t}\n\t}\n\treturn filtered, nil\n}\n"
  },
  {
    "path": "cmd/integration_test.go",
    "content": "package cmd\n\nimport (\n\t\"bytes\"\n\t\"context\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/alexferrari88/sbstck-dl/lib\"\n\t\"github.com/spf13/cobra\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Test command execution in isolation\nfunc TestCommandExecution(t *testing.T) {\n\t// Skip in short test mode\n\tif testing.Short() {\n\t\tt.Skip(\"Skipping integration test in short mode\")\n\t}\n\n\t// Create a mock server that serves a simple post\n\tmockPost := lib.Post{\n\t\tId:           123,\n\t\tTitle:        \"Test Post\",\n\t\tSlug:         \"test-post\",\n\t\tPostDate:     \"2023-01-01\",\n\t\tBodyHTML:     \"<p>This is a test post</p>\",\n\t\tCanonicalUrl: \"https://example.substack.com/p/test-post\",\n\t}\n\n\t// Create sitemap XML\n\tsitemapXML := `<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n  <url>\n    <loc>https://example.substack.com/p/test-post</loc>\n    <lastmod>2023-01-01</lastmod>\n  </url>\n</urlset>`\n\n\t// Create mock HTML with embedded JSON\n\tpostWrapper := lib.PostWrapper{Post: mockPost}\n\tjsonBytes, _ := json.Marshal(postWrapper)\n\tescapedJSON := strings.ReplaceAll(string(jsonBytes), `\"`, `\\\"`)\n\tmockHTML := fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head><title>%s</title></head>\n<body>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, mockPost.Title, escapedJSON)\n\n\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tpath := r.URL.Path\n\t\tif path == \"/sitemap.xml\" {\n\t\t\tw.Header().Set(\"Content-Type\", \"application/xml\")\n\t\t\tw.Write([]byte(sitemapXML))\n\t\t} else if path == \"/p/test-post\" {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(mockHTML))\n\t\t} else {\n\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\t}\n\t}))\n\tdefer server.Close()\n\n\t// Test version command\n\tt.Run(\"version command\", func(t *testing.T) {\n\t\t// Capture stdout\n\t\tvar output bytes.Buffer\n\t\t\n\t\t// Create a command that executes the version logic\n\t\tcmd := &cobra.Command{\n\t\t\tUse: \"test-version\",\n\t\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\t\toutput.WriteString(\"sbstck-dl v0.4.0\\n\")\n\t\t\t},\n\t\t}\n\t\t\n\t\terr := cmd.Execute()\n\t\tassert.NoError(t, err)\n\t\tassert.Contains(t, output.String(), \"sbstck-dl v0.4.0\")\n\t})\n\n\t// Test list command\n\tt.Run(\"list command\", func(t *testing.T) {\n\t\t// Reset global variables\n\t\tpubUrl = server.URL\n\t\tverbose = false\n\t\tbeforeDate = \"\"\n\t\tafterDate = \"\"\n\t\t\n\t\t// Initialize fetcher and extractor\n\t\tfetcher = lib.NewFetcher()\n\t\textractor = lib.NewExtractor(fetcher)\n\t\tctx = context.Background()\n\t\t\n\t\t// Create a new command to capture output\n\t\tvar output bytes.Buffer\n\t\tcmd := &cobra.Command{\n\t\t\tUse: \"test-list\",\n\t\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\t\t// Simulate list command logic\n\t\t\t\turls, err := extractor.GetAllPostsURLs(ctx, pubUrl, nil)\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Fatalf(\"Failed to get URLs: %v\", err)\n\t\t\t\t}\n\t\t\t\tfor _, url := range urls {\n\t\t\t\t\toutput.WriteString(url + \"\\n\")\n\t\t\t\t}\n\t\t\t},\n\t\t}\n\t\t\n\t\terr := cmd.Execute()\n\t\tassert.NoError(t, err)\n\t\t\n\t\t// Check that it outputs the post URL\n\t\tassert.Contains(t, output.String(), \"https://example.substack.com/p/test-post\")\n\t})\n\n\t// Test single post download\n\tt.Run(\"single post download\", func(t *testing.T) {\n\t\ttempDir := t.TempDir()\n\t\t\n\t\t// Reset global variables\n\t\tdownloadUrl = server.URL + \"/p/test-post\"\n\t\toutputFolder = tempDir\n\t\tformat = \"html\"\n\t\tdryRun = false\n\t\tverbose = false\n\t\taddSourceURL = false\n\t\t\n\t\t// Initialize fetcher and extractor\n\t\tfetcher = lib.NewFetcher()\n\t\textractor = lib.NewExtractor(fetcher)\n\t\tctx = context.Background()\n\t\t\n\t\t// Create a new command\n\t\tcmd := &cobra.Command{\n\t\t\tUse: \"test-download\",\n\t\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\t\t// Execute the single post download logic\n\t\t\t\tpost, err := extractor.ExtractPost(ctx, downloadUrl)\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Fatalf(\"Failed to extract post: %v\", err)\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\t// Write to file\n\t\t\t\tfilePath := makePath(post, outputFolder, format)\n\t\t\t\terr = post.WriteToFile(filePath, format, addSourceURL)\n\t\t\t\tif err != nil {\n\t\t\t\t\tt.Fatalf(\"Failed to write file: %v\", err)\n\t\t\t\t}\n\t\t\t},\n\t\t}\n\t\t\n\t\terr := cmd.Execute()\n\t\tassert.NoError(t, err)\n\t\t\n\t\t// Check that file was created - use the correct expected format\n\t\t// Since mockPost.PostDate is \"2023-01-01\" (not RFC3339), convertDateTime will return \"\"\n\t\texpectedFile := filepath.Join(tempDir, \"_test-post.html\")\n\t\t_, err = os.Stat(expectedFile)\n\t\tassert.NoError(t, err)\n\t\t\n\t\t// Check file content\n\t\tcontent, err := os.ReadFile(expectedFile)\n\t\tassert.NoError(t, err)\n\t\tassert.Contains(t, string(content), \"Test Post\")\n\t\tassert.Contains(t, string(content), \"This is a test post\")\n\t})\n}\n\n// Test command flag parsing\nfunc TestCommandFlags(t *testing.T) {\n\tt.Run(\"root command flags\", func(t *testing.T) {\n\t\t// Test that flags are properly defined\n\t\tcmd := rootCmd\n\t\t\n\t\t// Check persistent flags\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"proxy\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"verbose\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"rate\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"cookie_name\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"cookie_val\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"before\"))\n\t\tassert.NotNil(t, cmd.PersistentFlags().Lookup(\"after\"))\n\t})\n\n\tt.Run(\"download command flags\", func(t *testing.T) {\n\t\tcmd := downloadCmd\n\t\t\n\t\t// Check local flags\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"url\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"format\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"output\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"dry-run\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"add-source-url\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"download-images\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"image-quality\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"images-dir\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"download-files\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"file-extensions\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"files-dir\"))\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"create-archive\"))\n\t\t\n\t\t// Test create-archive flag specifically\n\t\tcreateArchiveFlag := cmd.Flags().Lookup(\"create-archive\")\n\t\tassert.Equal(t, \"bool\", createArchiveFlag.Value.Type())\n\t\tassert.Equal(t, \"false\", createArchiveFlag.DefValue)\n\t})\n\n\tt.Run(\"list command flags\", func(t *testing.T) {\n\t\tcmd := listCmd\n\t\t\n\t\t// Check local flags\n\t\tassert.NotNil(t, cmd.Flags().Lookup(\"url\"))\n\t})\n}\n\n// Test command validation\nfunc TestCommandValidation(t *testing.T) {\n\tt.Run(\"invalid proxy URL\", func(t *testing.T) {\n\t\t// Test parseURL with invalid proxy\n\t\t_, err := parseURL(\"invalid-proxy\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"invalid cookie name\", func(t *testing.T) {\n\t\tcn := cookieName(\"\")\n\t\terr := cn.Set(\"invalid-cookie\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"rate validation\", func(t *testing.T) {\n\t\t// Test that rate 0 should fail\n\t\t// This would normally be tested in the PersistentPreRun, but we can test the logic\n\t\tratePerSecond = 0\n\t\tassert.Equal(t, 0, ratePerSecond) // Should be 0 which is invalid\n\t})\n}\n\n// Test error handling\nfunc TestErrorHandling(t *testing.T) {\n\tt.Run(\"network error handling\", func(t *testing.T) {\n\t\t// Test with non-existent server\n\t\tfetcher := lib.NewFetcher()\n\t\textractor := lib.NewExtractor(fetcher)\n\t\tctx := context.Background()\n\t\t\n\t\t_, err := extractor.ExtractPost(ctx, \"http://non-existent-server.com/p/test\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"invalid URL format\", func(t *testing.T) {\n\t\t// Test with malformed URL\n\t\t_, err := parseURL(\"://invalid-url\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"file system errors\", func(t *testing.T) {\n\t\t// Test writing to invalid directory\n\t\tpost := lib.Post{\n\t\t\tTitle:    \"Test\",\n\t\t\tBodyHTML: \"<p>Test</p>\",\n\t\t}\n\t\t\n\t\t// Try to write to a file with invalid character (null byte forbidden on both Windows and Unix)\n\t\terr := post.WriteToFile(\"invalid\\x00filename.html\", \"html\", false)\n\t\tassert.Error(t, err)\n\t})\n}\n\n// Test with different configurations\nfunc TestConfigurations(t *testing.T) {\n\tt.Run(\"with proxy configuration\", func(t *testing.T) {\n\t\t// Test that proxy URL parsing works\n\t\tproxyURL := \"http://proxy.example.com:8080\"\n\t\tparsed, err := parseURL(proxyURL)\n\t\tassert.NoError(t, err)\n\t\tassert.Equal(t, \"proxy.example.com:8080\", parsed.Host)\n\t\tassert.Equal(t, \"http\", parsed.Scheme)\n\t})\n\n\tt.Run(\"with cookie configuration\", func(t *testing.T) {\n\t\t// Test cookie creation\n\t\ttests := []struct {\n\t\t\tname      string\n\t\t\tcookieName cookieName\n\t\t\tcookieVal  string\n\t\t\texpected   string\n\t\t}{\n\t\t\t{\n\t\t\t\tname:      \"substack.sid cookie\",\n\t\t\t\tcookieName: substackSid,\n\t\t\t\tcookieVal:  \"test-value\",\n\t\t\t\texpected:   \"substack.sid\",\n\t\t\t},\n\t\t\t{\n\t\t\t\tname:      \"connect.sid cookie\",\n\t\t\t\tcookieName: connectSid,\n\t\t\t\tcookieVal:  \"test-value\",\n\t\t\t\texpected:   \"connect.sid\",\n\t\t\t},\n\t\t}\n\n\t\tfor _, tt := range tests {\n\t\t\tt.Run(tt.name, func(t *testing.T) {\n\t\t\t\tassert.Equal(t, tt.expected, tt.cookieName.String())\n\t\t\t})\n\t\t}\n\t})\n\n\tt.Run(\"with rate limiting\", func(t *testing.T) {\n\t\t// Test that different rate limits are handled\n\t\trates := []int{1, 2, 5, 10}\n\t\t\n\t\tfor _, rate := range rates {\n\t\t\tfetcher := lib.NewFetcher(lib.WithRatePerSecond(rate))\n\t\t\tassert.NotNil(t, fetcher)\n\t\t\tassert.Equal(t, rate, int(fetcher.RateLimiter.Limit()))\n\t\t}\n\t})\n}\n\n// Test real-world scenarios\nfunc TestRealWorldScenarios(t *testing.T) {\n\t// Skip in short test mode\n\tif testing.Short() {\n\t\tt.Skip(\"Skipping real-world scenario tests in short mode\")\n\t}\n\n\tt.Run(\"large number of URLs\", func(t *testing.T) {\n\t\t// Test performance with many URLs\n\t\turls := make([]string, 100)\n\t\tfor i := range urls {\n\t\t\turls[i] = fmt.Sprintf(\"https://example.substack.com/p/post-%d\", i)\n\t\t}\n\t\t\n\t\t// Test URL parsing performance\n\t\tstart := time.Now()\n\t\t\n\t\t// Test parsing all URLs\n\t\tvalidUrls := 0\n\t\tfor _, url := range urls {\n\t\t\tif _, err := parseURL(url); err == nil {\n\t\t\t\tvalidUrls++\n\t\t\t}\n\t\t}\n\t\t\n\t\tduration := time.Since(start)\n\t\t\n\t\tassert.Equal(t, len(urls), validUrls) // All should be valid\n\t\tassert.Less(t, duration, 1*time.Second) // Should be fast\n\t})\n\n\tt.Run(\"concurrent processing\", func(t *testing.T) {\n\t\t// Test that concurrent processing works correctly\n\t\ttempDir := t.TempDir()\n\t\t\n\t\t// Create multiple posts concurrently\n\t\tposts := make([]lib.Post, 5)\n\t\tfor i := range posts {\n\t\t\tposts[i] = lib.Post{\n\t\t\t\tTitle:    fmt.Sprintf(\"Post %d\", i),\n\t\t\t\tSlug:     fmt.Sprintf(\"post-%d\", i),\n\t\t\t\tPostDate: \"2023-01-01\",\n\t\t\t\tBodyHTML: fmt.Sprintf(\"<p>Content for post %d</p>\", i),\n\t\t\t}\n\t\t}\n\t\t\n\t\t// Write all posts concurrently\n\t\tstart := time.Now()\n\t\tfor i, post := range posts {\n\t\t\tfilePath := filepath.Join(tempDir, fmt.Sprintf(\"post-%d.html\", i))\n\t\t\terr := post.WriteToFile(filePath, \"html\", false)\n\t\t\tassert.NoError(t, err)\n\t\t}\n\t\tduration := time.Since(start)\n\t\t\n\t\t// Verify all files were created\n\t\tfor i := range posts {\n\t\t\tfilePath := filepath.Join(tempDir, fmt.Sprintf(\"post-%d.html\", i))\n\t\t\t_, err := os.Stat(filePath)\n\t\t\tassert.NoError(t, err)\n\t\t}\n\t\t\n\t\tassert.Less(t, duration, 1*time.Second) // Should be fast\n\t})\n}\n\n// Test archive functionality end-to-end\nfunc TestArchiveWorkflow(t *testing.T) {\n\tt.Run(\"single post with archive\", func(t *testing.T) {\n\t\ttempDir := t.TempDir()\n\t\t\n\t\t// Create a mock post with enhanced fields\n\t\tpost := lib.Post{\n\t\t\tId:           123,\n\t\t\tTitle:        \"Test Archive Post\",\n\t\t\tSlug:         \"test-archive-post\",\n\t\t\tPostDate:     \"2023-01-01T10:30:00Z\",\n\t\t\tSubtitle:     \"This is a test subtitle\",\n\t\t\tDescription:  \"Test description\",\n\t\t\tCoverImage:   \"https://example.com/cover.jpg\",\n\t\t\tCanonicalUrl: \"https://example.substack.com/p/test-archive-post\",\n\t\t\tBodyHTML:     \"<p>This is a <strong>test</strong> post for archive functionality.</p>\",\n\t\t}\n\t\t\n\t\t// Simulate the archive workflow\n\t\tarchive := lib.NewArchive()\n\t\t\n\t\t// Write the post to file (similar to what download command does)\n\t\tfilePath := filepath.Join(tempDir, \"20230101_103000_test-archive-post.html\")\n\t\terr := post.WriteToFile(filePath, \"html\", false)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Add entry to archive (similar to what download command does)\n\t\tdownloadTime, _ := time.Parse(time.RFC3339, \"2023-01-10T12:00:00Z\")\n\t\tarchive.AddEntry(post, filePath, downloadTime)\n\t\t\n\t\t// Generate archive in all formats\n\t\terr = archive.GenerateHTML(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\terr = archive.GenerateMarkdown(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\terr = archive.GenerateText(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify all archive files were created\n\t\tassert.FileExists(t, filepath.Join(tempDir, \"index.html\"))\n\t\tassert.FileExists(t, filepath.Join(tempDir, \"index.md\"))\n\t\tassert.FileExists(t, filepath.Join(tempDir, \"index.txt\"))\n\t\t\n\t\t// Verify HTML archive content\n\t\thtmlContent, err := os.ReadFile(filepath.Join(tempDir, \"index.html\"))\n\t\trequire.NoError(t, err)\n\t\thtmlStr := string(htmlContent)\n\t\t\n\t\tassert.Contains(t, htmlStr, \"Test Archive Post\")\n\t\tassert.Contains(t, htmlStr, \"This is a test subtitle\")\n\t\tassert.Contains(t, htmlStr, \"https://example.com/cover.jpg\")\n\t\tassert.Contains(t, htmlStr, \"20230101_103000_test-archive-post.html\") // Relative path\n\t\tassert.Contains(t, htmlStr, \"January 1, 2023\") // Formatted date\n\t\t\n\t\t// Verify Markdown archive content\n\t\tmdContent, err := os.ReadFile(filepath.Join(tempDir, \"index.md\"))\n\t\trequire.NoError(t, err)\n\t\tmdStr := string(mdContent)\n\t\t\n\t\tassert.Contains(t, mdStr, \"# Substack Archive\")\n\t\tassert.Contains(t, mdStr, \"## [Test Archive Post](20230101_103000_test-archive-post.html)\")\n\t\tassert.Contains(t, mdStr, \"*This is a test subtitle*\")\n\t\tassert.Contains(t, mdStr, \"![Cover Image](https://example.com/cover.jpg)\")\n\t\t\n\t\t// Verify Text archive content\n\t\ttxtContent, err := os.ReadFile(filepath.Join(tempDir, \"index.txt\"))\n\t\trequire.NoError(t, err)\n\t\ttxtStr := string(txtContent)\n\t\t\n\t\tassert.Contains(t, txtStr, \"SUBSTACK ARCHIVE\")\n\t\tassert.Contains(t, txtStr, \"Title: Test Archive Post\")\n\t\tassert.Contains(t, txtStr, \"File: 20230101_103000_test-archive-post.html\")\n\t\tassert.Contains(t, txtStr, \"Description: This is a test subtitle\")\n\t})\n\n\tt.Run(\"multiple posts with archive\", func(t *testing.T) {\n\t\ttempDir := t.TempDir()\n\t\t\n\t\tarchive := lib.NewArchive()\n\t\tdownloadTime := time.Now()\n\t\t\n\t\t// Create multiple posts with different dates\n\t\tposts := []lib.Post{\n\t\t\t{\n\t\t\t\tId:           1,\n\t\t\t\tTitle:        \"First Post\",\n\t\t\t\tSlug:         \"first-post\",\n\t\t\t\tPostDate:     \"2023-01-01T10:00:00Z\",\n\t\t\t\tSubtitle:     \"First subtitle\",\n\t\t\t\tCanonicalUrl: \"https://example.substack.com/p/first-post\",\n\t\t\t\tBodyHTML:     \"<p>First post content</p>\",\n\t\t\t},\n\t\t\t{\n\t\t\t\tId:           2,\n\t\t\t\tTitle:        \"Second Post\",\n\t\t\t\tSlug:         \"second-post\", \n\t\t\t\tPostDate:     \"2023-01-02T10:00:00Z\",\n\t\t\t\tDescription:  \"Second description\",\n\t\t\t\tCoverImage:   \"https://example.com/cover2.jpg\",\n\t\t\t\tCanonicalUrl: \"https://example.substack.com/p/second-post\",\n\t\t\t\tBodyHTML:     \"<p>Second post content</p>\",\n\t\t\t},\n\t\t\t{\n\t\t\t\tId:           3,\n\t\t\t\tTitle:        \"Third Post\",\n\t\t\t\tSlug:         \"third-post\",\n\t\t\t\tPostDate:     \"2023-01-03T10:00:00Z\",\n\t\t\t\tSubtitle:     \"Third subtitle\",\n\t\t\t\tCanonicalUrl: \"https://example.substack.com/p/third-post\",\n\t\t\t\tBodyHTML:     \"<p>Third post content</p>\",\n\t\t\t},\n\t\t}\n\t\t\n\t\t// Write posts and add to archive\n\t\tfor i, post := range posts {\n\t\t\tfilePath := filepath.Join(tempDir, fmt.Sprintf(\"post-%d.html\", i+1))\n\t\t\terr := post.WriteToFile(filePath, \"html\", false)\n\t\t\trequire.NoError(t, err)\n\t\t\t\n\t\t\tarchive.AddEntry(post, filePath, downloadTime.Add(time.Duration(i)*time.Hour))\n\t\t}\n\t\t\n\t\t// Generate archive\n\t\terr := archive.GenerateHTML(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify content ordering (newest first)\n\t\thtmlContent, err := os.ReadFile(filepath.Join(tempDir, \"index.html\"))\n\t\trequire.NoError(t, err)\n\t\thtmlStr := string(htmlContent)\n\t\t\n\t\t// Find positions of post titles to verify ordering\n\t\tthirdPos := strings.Index(htmlStr, \"Third Post\")\n\t\tsecondPos := strings.Index(htmlStr, \"Second Post\")\n\t\tfirstPos := strings.Index(htmlStr, \"First Post\")\n\t\t\n\t\tassert.True(t, thirdPos < secondPos, \"Third Post should appear before Second Post\")\n\t\tassert.True(t, secondPos < firstPos, \"Second Post should appear before First Post\")\n\t\t\n\t\t// Verify all posts are included\n\t\tassert.Contains(t, htmlStr, \"First subtitle\")\n\t\tassert.Contains(t, htmlStr, \"Second description\") // Fallback to description\n\t\tassert.Contains(t, htmlStr, \"Third subtitle\")\n\t\tassert.Contains(t, htmlStr, \"https://example.com/cover2.jpg\")\n\t})\n\n\tt.Run(\"archive with different formats\", func(t *testing.T) {\n\t\ttempDir := t.TempDir()\n\t\t\n\t\tpost := lib.Post{\n\t\t\tId:           100,\n\t\t\tTitle:        \"Format Test Post\",\n\t\t\tSlug:         \"format-test-post\",\n\t\t\tPostDate:     \"2023-01-01T10:00:00Z\",\n\t\t\tSubtitle:     \"Testing different formats\",\n\t\t\tCanonicalUrl: \"https://example.substack.com/p/format-test-post\",\n\t\t\tBodyHTML:     \"<p>Testing <strong>different</strong> formats.</p>\",\n\t\t}\n\t\t\n\t\t// Test with different output formats\n\t\tformats := []string{\"html\", \"md\", \"txt\"}\n\t\t\n\t\tfor _, format := range formats {\n\t\t\tt.Run(fmt.Sprintf(\"format_%s\", format), func(t *testing.T) {\n\t\t\t\tformatDir := filepath.Join(tempDir, format)\n\t\t\t\terr := os.MkdirAll(formatDir, 0755)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\n\t\t\t\tarchive := lib.NewArchive()\n\t\t\t\t\n\t\t\t\t// Write post in the specified format\n\t\t\t\tfilePath := filepath.Join(formatDir, fmt.Sprintf(\"post.%s\", format))\n\t\t\t\terr = post.WriteToFile(filePath, format, false)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\n\t\t\t\t// Add to archive and generate\n\t\t\t\tarchive.AddEntry(post, filePath, time.Now())\n\t\t\t\t\n\t\t\t\tswitch format {\n\t\t\t\tcase \"html\":\n\t\t\t\t\terr = archive.GenerateHTML(formatDir)\n\t\t\t\tcase \"md\":\n\t\t\t\t\terr = archive.GenerateMarkdown(formatDir)\n\t\t\t\tcase \"txt\":\n\t\t\t\t\terr = archive.GenerateText(formatDir)\n\t\t\t\t}\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\t\n\t\t\t\t// Verify archive file exists\n\t\t\t\tindexPath := filepath.Join(formatDir, fmt.Sprintf(\"index.%s\", format))\n\t\t\t\tassert.FileExists(t, indexPath)\n\t\t\t\t\n\t\t\t\t// Verify content contains the post\n\t\t\t\tcontent, err := os.ReadFile(indexPath)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\tassert.Contains(t, string(content), \"Format Test Post\")\n\t\t\t\tassert.Contains(t, string(content), \"Testing different formats\")\n\t\t\t})\n\t\t}\n\t})\n}"
  },
  {
    "path": "cmd/list.go",
    "content": "package cmd\n\nimport (\n\t\"fmt\"\n\t\"log\"\n\n\t\"github.com/spf13/cobra\"\n)\n\n// listCmd represents the list command\nvar (\n\tpubUrl  string\n\tlistCmd = &cobra.Command{\n\t\tUse:   \"list\",\n\t\tShort: \"List the posts of a Substack\",\n\t\tLong:  `List the posts of a Substack`,\n\t\tRun: func(cmd *cobra.Command, args []string) {\n\t\t\tparsedURL, err := parseURL(pubUrl)\n\t\t\tif err != nil {\n\t\t\t\tlog.Fatal(err)\n\t\t\t}\n\t\t\tmainWebsite := fmt.Sprintf(\"%s://%s\", parsedURL.Scheme, parsedURL.Host)\n\t\t\tif verbose {\n\t\t\t\tfmt.Printf(\"Main website: %s\\n\", mainWebsite)\n\t\t\t\tfmt.Println(\"Getting all posts URLs...\")\n\t\t\t}\n\t\t\tdateFilterfunc := makeDateFilterFunc(beforeDate, afterDate)\n\t\t\turls, err := extractor.GetAllPostsURLs(ctx, mainWebsite, dateFilterfunc)\n\t\t\tif err != nil {\n\t\t\t\tlog.Fatal(err)\n\t\t\t}\n\t\t\tif verbose {\n\t\t\t\tfmt.Printf(\"Found %d posts.\\n\", len(urls))\n\t\t\t}\n\t\t\tfor _, url := range urls {\n\t\t\t\tfmt.Println(url)\n\t\t\t}\n\t\t},\n\t}\n)\n\nfunc init() {\n\tlistCmd.Flags().StringVarP(&pubUrl, \"url\", \"u\", \"\", \"Specify the Substack url\")\n\tlistCmd.MarkFlagRequired(\"url\")\n}\n"
  },
  {
    "path": "cmd/main.go",
    "content": "package cmd\n"
  },
  {
    "path": "cmd/root.go",
    "content": "package cmd\n\nimport (\n\t\"context\"\n\t\"errors\"\n\t\"log\"\n\t\"net/http\"\n\t\"net/url\"\n\t\"os\"\n\n\t\"github.com/alexferrari88/sbstck-dl/lib\"\n\t\"github.com/spf13/cobra\"\n)\n\n// rootCmd represents the base command when called without any subcommands\n\ntype cookieName string\n\nconst (\n\tsubstackSid cookieName = \"substack.sid\"\n\tconnectSid  cookieName = \"connect.sid\"\n)\n\nfunc (c *cookieName) String() string {\n\treturn string(*c)\n}\n\nfunc (c *cookieName) Set(val string) error {\n\tswitch val {\n\tcase \"substack.sid\", \"connect.sid\":\n\t\t*c = cookieName(val)\n\tdefault:\n\t\treturn errors.New(\"invalid cookie name: must be either substack.sid or connect.sid\")\n\t}\n\treturn nil\n}\n\nfunc (c *cookieName) Type() string {\n\treturn \"cookieName\"\n}\n\nvar (\n\tproxyURL       string\n\tverbose        bool\n\tratePerSecond  int\n\tbeforeDate     string\n\tafterDate      string\n\tidCookieName   cookieName\n\tidCookieVal    string\n\tctx            = context.Background()\n\tparsedProxyURL *url.URL\n\tfetcher        *lib.Fetcher\n\textractor      *lib.Extractor\n\n\trootCmd = &cobra.Command{\n\t\tUse:   \"sbstck-dl\",\n\t\tShort: \"Substack Downloader\",\n\t\tLong:  `sbstck-dl is a command line tool for downloading Substack newsletters for archival purposes, offline reading, or data analysis.`,\n\t\tPersistentPreRun: func(cmd *cobra.Command, args []string) {\n\n\t\t\tvar cookie *http.Cookie\n\n\t\t\tif proxyURL != \"\" {\n\t\t\t\tvar err error\n\t\t\t\tparsedProxyURL, err = parseURL(proxyURL)\n\t\t\t\tif err != nil {\n\t\t\t\t\tlog.Fatal(err)\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tif ratePerSecond == 0 {\n\t\t\t\tlog.Fatal(\"rate must be greater than 0\")\n\t\t\t}\n\n\t\t\tif idCookieVal != \"\" && idCookieName != \"\" {\n\t\t\t\tif idCookieName == substackSid {\n\t\t\t\t\tcookie = &http.Cookie{\n\t\t\t\t\t\tName:  \"substack.sid\",\n\t\t\t\t\t\tValue: idCookieVal,\n\t\t\t\t\t}\n\t\t\t\t} else if idCookieName == connectSid {\n\t\t\t\t\tcookie = &http.Cookie{\n\t\t\t\t\t\tName:  \"connect.sid\",\n\t\t\t\t\t\tValue: idCookieVal,\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tfetcher = lib.NewFetcher(lib.WithRatePerSecond(ratePerSecond), lib.WithProxyURL(parsedProxyURL), lib.WithCookie(cookie))\n\t\t\textractor = lib.NewExtractor(fetcher)\n\t\t},\n\t}\n)\n\n// Execute adds all child commands to the root command and sets flags appropriately.\n// This is called by main.main(). It only needs to happen once to the rootCmd.\nfunc Execute() {\n\terr := rootCmd.Execute()\n\tif err != nil {\n\t\tos.Exit(1)\n\t}\n}\n\nfunc init() {\n\trootCmd.PersistentFlags().StringVarP(&proxyURL, \"proxy\", \"x\", \"\", \"Specify the proxy url\")\n\trootCmd.PersistentFlags().Var(&idCookieName, \"cookie_name\", \"Either \\\"substack.sid\\\" or \\\"connect.sid\\\", based on the cookie you have (required for private newsletters)\")\n\trootCmd.PersistentFlags().StringVar(&idCookieVal, \"cookie_val\", \"\", \"The substack.sid/connect.sid cookie value (required for private newsletters)\")\n\trootCmd.PersistentFlags().BoolVarP(&verbose, \"verbose\", \"v\", false, \"Enable verbose output\")\n\trootCmd.PersistentFlags().IntVarP(&ratePerSecond, \"rate\", \"r\", lib.DefaultRatePerSecond, \"Specify the rate of requests per second\")\n\trootCmd.PersistentFlags().StringVar(&beforeDate, \"before\", \"\", \"Download posts published before this date (format: YYYY-MM-DD)\")\n\trootCmd.PersistentFlags().StringVar(&afterDate, \"after\", \"\", \"Download posts published after this date (format: YYYY-MM-DD)\")\n\trootCmd.MarkFlagsRequiredTogether(\"cookie_name\", \"cookie_val\")\n\n\trootCmd.AddCommand(downloadCmd)\n\trootCmd.AddCommand(listCmd)\n\trootCmd.AddCommand(versionCmd)\n}\n\nfunc makeDateFilterFunc(beforeDate string, afterDate string) lib.DateFilterFunc {\n\tvar dateFilterFunc lib.DateFilterFunc\n\tif beforeDate != \"\" && afterDate != \"\" {\n\t\tdateFilterFunc = func(date string) bool {\n\t\t\treturn date > afterDate && date < beforeDate\n\t\t}\n\t} else if beforeDate != \"\" {\n\t\tdateFilterFunc = func(date string) bool {\n\t\t\treturn date < beforeDate\n\t\t}\n\t} else if afterDate != \"\" {\n\t\tdateFilterFunc = func(date string) bool {\n\t\t\treturn date > afterDate\n\t\t}\n\t}\n\treturn dateFilterFunc\n}\n"
  },
  {
    "path": "cmd/version.go",
    "content": "package cmd\n\nimport (\n\t\"fmt\"\n\n\t\"github.com/spf13/cobra\"\n)\n\n// versionCmd represents the version command\nvar versionCmd = &cobra.Command{\n\tUse:   \"version\",\n\tShort: \"Print the version number of sbstck-dl\",\n\tLong:  `Display the current version of the app.`,\n\tRun: func(cmd *cobra.Command, args []string) {\n\t\tfmt.Println(\"sbstck-dl v0.7\")\n\t},\n}\n\nfunc init() {\n}\n"
  },
  {
    "path": "go.mod",
    "content": "module github.com/alexferrari88/sbstck-dl\n\ngo 1.20\n\nrequire (\n\tgithub.com/JohannesKaufmann/html-to-markdown v1.5.0\n\tgithub.com/PuerkitoBio/goquery v1.8.1\n\tgithub.com/cenkalti/backoff/v4 v4.2.1\n\tgithub.com/k3a/html2text v1.2.1\n\tgithub.com/schollz/progressbar/v3 v3.14.1\n\tgithub.com/spf13/cobra v1.8.0\n\tgithub.com/stretchr/testify v1.8.4\n\tgolang.org/x/sync v0.6.0\n\tgolang.org/x/time v0.5.0\n)\n\nrequire (\n\tgithub.com/andybalholm/cascadia v1.3.2 // indirect\n\tgithub.com/davecgh/go-spew v1.1.1 // indirect\n\tgithub.com/inconshreveable/mousetrap v1.1.0 // indirect\n\tgithub.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db // indirect\n\tgithub.com/pmezard/go-difflib v1.0.0 // indirect\n\tgithub.com/rivo/uniseg v0.4.4 // indirect\n\tgithub.com/spf13/pflag v1.0.5 // indirect\n\tgolang.org/x/net v0.20.0 // indirect\n\tgolang.org/x/sys v0.16.0 // indirect\n\tgolang.org/x/term v0.16.0 // indirect\n\tgopkg.in/yaml.v3 v3.0.1 // indirect\n)\n"
  },
  {
    "path": "go.sum",
    "content": "github.com/JohannesKaufmann/html-to-markdown v1.5.0 h1:cEAcqpxk0hUJOXEVGrgILGW76d1GpyGY7PCnAaWQyAI=\ngithub.com/JohannesKaufmann/html-to-markdown v1.5.0/go.mod h1:QTO/aTyEDukulzu269jY0xiHeAGsNxmuUBo2Q0hPsK8=\ngithub.com/PuerkitoBio/goquery v1.8.1 h1:uQxhNlArOIdbrH1tr0UXwdVFgDcZDrZVdcpygAcwmWM=\ngithub.com/PuerkitoBio/goquery v1.8.1/go.mod h1:Q8ICL1kNUJ2sXGoAhPGUdYDJvgQgHzJsnnd3H7Ho5jQ=\ngithub.com/andybalholm/cascadia v1.3.1/go.mod h1:R4bJ1UQfqADjvDa4P6HZHLh/3OxWWEqc0Sk8XGwHqvA=\ngithub.com/andybalholm/cascadia v1.3.2 h1:3Xi6Dw5lHF15JtdcmAHD3i1+T8plmv7BQ/nsViSLyss=\ngithub.com/andybalholm/cascadia v1.3.2/go.mod h1:7gtRlve5FxPPgIgX36uWBX58OdBsSS6lUvCFb+h7KvU=\ngithub.com/cenkalti/backoff/v4 v4.2.1 h1:y4OZtCnogmCPw98Zjyt5a6+QwPLGkiQsYW5oUqylYbM=\ngithub.com/cenkalti/backoff/v4 v4.2.1/go.mod h1:Y3VNntkOUPxTVeUxJ/G5vcM//AlwfmyYozVcomhLiZE=\ngithub.com/cpuguy83/go-md2man/v2 v2.0.3/go.mod h1:tgQtvFlXSQOSOSIRvRPT7W67SCa46tRHOmNcaadrF8o=\ngithub.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=\ngithub.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=\ngithub.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=\ngithub.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1 h1:EGx4pi6eqNxGaHF6qqu48+N2wcFQ5qg5FXgOdqsJ5d8=\ngithub.com/gopherjs/gopherjs v0.0.0-20181017120253-0766667cb4d1/go.mod h1:wJfORRmW1u3UXTncJ5qlYoELFm8eSnnEO6hX4iZ3EWY=\ngithub.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8=\ngithub.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw=\ngithub.com/jtolds/gls v4.20.0+incompatible h1:xdiiI2gbIgH/gLH7ADydsJ1uDOEzR8yvV7C0MuV77Wo=\ngithub.com/jtolds/gls v4.20.0+incompatible/go.mod h1:QJZ7F/aHp+rZTRtaJ1ow/lLfFfVYBRgL+9YlvaHOwJU=\ngithub.com/k0kubun/go-ansi v0.0.0-20180517002512-3bf9e2903213/go.mod h1:vNUNkEQ1e29fT/6vq2aBdFsgNPmy8qMdSay1npru+Sw=\ngithub.com/k3a/html2text v1.2.1 h1:nvnKgBvBR/myqrwfLuiqecUtaK1lB9hGziIJKatNFVY=\ngithub.com/k3a/html2text v1.2.1/go.mod h1:ieEXykM67iT8lTvEWBh6fhpH4B23kB9OMKPdIBmgUqA=\ngithub.com/kr/pretty v0.1.0 h1:L/CwN0zerZDmRFUapSPitk6f+Q3+0za1rQkzVuMiMFI=\ngithub.com/kr/pretty v0.1.0/go.mod h1:dAy3ld7l9f0ibDNOQOHHMYYIIbhfbHSm3C4ZsoJORNo=\ngithub.com/kr/pty v1.1.1/go.mod h1:pFQYn66WHrOpPYNljwOMqo10TkYh1fy3cYio2l3bCsQ=\ngithub.com/kr/text v0.1.0 h1:45sCR5RtlFHMR4UwH9sdQ5TC8v0qDQCHnXt+kaKSTVE=\ngithub.com/kr/text v0.1.0/go.mod h1:4Jbv+DJW3UT/LiOwJeYQe1efqtUx/iVham/4vfdArNI=\ngithub.com/mattn/go-isatty v0.0.20/go.mod h1:W+V8PltTTMOvKvAeJH7IuucS94S2C6jfK/D7dTCTo3Y=\ngithub.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db h1:62I3jR2EmQ4l5rM/4FEfDWcRD+abF5XlKShorW5LRoQ=\ngithub.com/mitchellh/colorstring v0.0.0-20190213212951-d06e56a500db/go.mod h1:l0dey0ia/Uv7NcFFVbCLtqEBQbrT4OCwCSKTEv6enCw=\ngithub.com/pkg/errors v0.8.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=\ngithub.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=\ngithub.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=\ngithub.com/rivo/uniseg v0.4.4 h1:8TfxU8dW6PdqD27gjM8MVNuicgxIjxpm4K7x4jp8sis=\ngithub.com/rivo/uniseg v0.4.4/go.mod h1:FN3SvrM+Zdj16jyLfmOkMNblXMcoc8DfTHruCPUcx88=\ngithub.com/russross/blackfriday/v2 v2.1.0/go.mod h1:+Rmxgy9KzJVeS9/2gXHxylqXiyQDYRxCVz55jmeOWTM=\ngithub.com/schollz/progressbar/v3 v3.14.1 h1:VD+MJPCr4s3wdhTc7OEJ/Z3dAeBzJ7yKH/P4lC5yRTI=\ngithub.com/schollz/progressbar/v3 v3.14.1/go.mod h1:Zc9xXneTzWXF81TGoqL71u0sBPjULtEHYtj/WVgVy8E=\ngithub.com/sebdah/goldie/v2 v2.5.3 h1:9ES/mNN+HNUbNWpVAlrzuZ7jE+Nrczbj8uFRjM7624Y=\ngithub.com/sebdah/goldie/v2 v2.5.3/go.mod h1:oZ9fp0+se1eapSRjfYbsV/0Hqhbuu3bJVvKI/NNtssI=\ngithub.com/sergi/go-diff v1.0.0/go.mod h1:0CfEIISq7TuYL3j771MWULgwwjU+GofnZX9QAmXWZgo=\ngithub.com/sergi/go-diff v1.2.0 h1:XU+rvMAioB0UC3q1MFrIQy4Vo5/4VsRDQQXHsEya6xQ=\ngithub.com/sergi/go-diff v1.2.0/go.mod h1:STckp+ISIX8hZLjrqAeVduY0gWCT9IjLuqbuNXdaHfM=\ngithub.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d h1:zE9ykElWQ6/NYmHa3jpm/yHnI4xSofP+UP6SpjHcSeM=\ngithub.com/smartystreets/assertions v0.0.0-20180927180507-b2de0cb4f26d/go.mod h1:OnSkiWE9lh6wB0YB77sQom3nweQdgAjqCqsofrRNTgc=\ngithub.com/smartystreets/goconvey v1.6.4 h1:fv0U8FUIMPNf1L9lnHLvLhgicrIVChEkdzIKYqbNC9s=\ngithub.com/smartystreets/goconvey v1.6.4/go.mod h1:syvi0/a8iFYH4r/RixwvyeAJjdLS9QV7WQ/tjFTllLA=\ngithub.com/spf13/cobra v1.8.0 h1:7aJaZx1B85qltLMc546zn58BxxfZdR/W22ej9CFoEf0=\ngithub.com/spf13/cobra v1.8.0/go.mod h1:WXLWApfZ71AjXPya3WOlMsY9yMs7YeiHhFVlvLyhcho=\ngithub.com/spf13/pflag v1.0.5 h1:iy+VFUOCP1a+8yFto/drg2CJ5u0yRoB7fZw3DKv/JXA=\ngithub.com/spf13/pflag v1.0.5/go.mod h1:McXfInJRrz4CZXVZOBLb0bTZqETkiAhM9Iw0y3An2Bg=\ngithub.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=\ngithub.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI=\ngithub.com/stretchr/testify v1.4.0/go.mod h1:j7eGeouHqKxXV5pUuKE4zz7dFj8WfuZ+81PSLYec5m4=\ngithub.com/stretchr/testify v1.8.4 h1:CcVxjf3Q8PM0mHUKJCdn+eZZtm5yQwehR5yeSVQQcUk=\ngithub.com/stretchr/testify v1.8.4/go.mod h1:sz/lmYIOXD/1dqDmKjjqLyZ2RngseejIcXlSw2iwfAo=\ngithub.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=\ngithub.com/yuin/goldmark v1.6.0 h1:boZcn2GTjpsynOsC0iJHnBWa4Bi0qzfJjthwauItG68=\ngithub.com/yuin/goldmark v1.6.0/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=\ngolang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=\ngolang.org/x/crypto v0.0.0-20210921155107-089bfa567519/go.mod h1:GvvjBRRGRdwPK5ydBHafDWAxML/pGHZbMvKqRZ5+Abc=\ngolang.org/x/crypto v0.16.0/go.mod h1:gCAAfMLgwOJRpTjQ2zCCt2OcSfYMTeZVSRtQlPC7Nq4=\ngolang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4/go.mod h1:jJ57K6gSWd91VN4djpZkiMVwK6gcyfeH4XE8wZrZaV4=\ngolang.org/x/mod v0.8.0/go.mod h1:iBbtSCu2XBx23ZKBPSOrRkjjQPZFPuis4dIYUhu/chs=\ngolang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg=\ngolang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=\ngolang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg=\ngolang.org/x/net v0.0.0-20210916014120-12bc252f5db8/go.mod h1:9nx3DQGgdP8bBQD5qxJ1jj9UTztislL4KSBs9R2vV5Y=\ngolang.org/x/net v0.0.0-20220722155237-a158d28d115b/go.mod h1:XRhObCWvk6IyKnWLug+ECip1KBveYUHfp+8e9klMJ9c=\ngolang.org/x/net v0.6.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=\ngolang.org/x/net v0.7.0/go.mod h1:2Tu9+aMcznHK/AK1HMvgo6xiTLG5rD5rZLDS+rp2Bjs=\ngolang.org/x/net v0.9.0/go.mod h1:d48xBJpPfHeWQsugry2m+kC02ZBRGRgulfHnEXEuWns=\ngolang.org/x/net v0.10.0/go.mod h1:0qNGK6F8kojg2nk9dLZ2mShWaEBan6FAoqfSigmmuDg=\ngolang.org/x/net v0.19.0/go.mod h1:CfAk/cbD4CthTvqiEl8NpboMuiuOYsAr/7NOjZJtv1U=\ngolang.org/x/net v0.20.0 h1:aCL9BSgETF1k+blQaYUBx9hJ9LOGP3gAVemcZlf1Kpo=\ngolang.org/x/net v0.20.0/go.mod h1:z8BVo6PvndSri0LbOE3hAn0apkU+1YvI6E70E9jsnvY=\ngolang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.0.0-20220722155255-886fb9371eb4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.1.0/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=\ngolang.org/x/sync v0.6.0 h1:5BMeUDZ7vkXGfEr1x9B4bRcTH4lpkTkpdh0T/J+qjbQ=\ngolang.org/x/sync v0.6.0/go.mod h1:Czt+wKu1gCyEFDUtn0jG5QVvpJ6rzVqr5aXyt9drQfk=\ngolang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=\ngolang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=\ngolang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=\ngolang.org/x/sys v0.0.0-20210615035016-665e8c7367d1/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.0.0-20220520151302-bc2c85ada10a/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.0.0-20220722155257-8c9f86f7a55f/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.5.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.6.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.7.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.8.0/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg=\ngolang.org/x/sys v0.14.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/sys v0.15.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/sys v0.16.0 h1:xWw16ngr6ZMtmxDyKyIgsE93KNKz5HKmMa3b8ALHidU=\ngolang.org/x/sys v0.16.0/go.mod h1:/VUhepiaJMQUp4+oa/7Zr1D23ma6VTLIYjOOTFZPUcA=\ngolang.org/x/term v0.0.0-20201126162022-7de9c90e9dd1/go.mod h1:bj7SfCRtBDWHUb9snDiAeCFNEtKQo2Wmx5Cou7ajbmo=\ngolang.org/x/term v0.0.0-20210927222741-03fcf44c2211/go.mod h1:jbD1KX2456YbFQfuXm/mYQcufACuNUgVhRMnK/tPxf8=\ngolang.org/x/term v0.5.0/go.mod h1:jMB1sMXY+tzblOD4FWmEbocvup2/aLOaQEp7JmGp78k=\ngolang.org/x/term v0.7.0/go.mod h1:P32HKFT3hSsZrRxla30E9HqToFYAQPCMs/zFMBUFqPY=\ngolang.org/x/term v0.8.0/go.mod h1:xPskH00ivmX89bAKVGSKKtLOWNx2+17Eiy94tnKShWo=\ngolang.org/x/term v0.14.0/go.mod h1:TySc+nGkYR6qt8km8wUhuFRTVSMIX3XPR58y2lC8vww=\ngolang.org/x/term v0.15.0/go.mod h1:BDl952bC7+uMoWR75FIrCDx79TPU9oHkTZ9yRbYOrX0=\ngolang.org/x/term v0.16.0 h1:m+B6fahuftsE9qjo0VWp2FW0mB3MTJvR0BaMQrq0pmE=\ngolang.org/x/term v0.16.0/go.mod h1:yn7UURbUtPyrVJPGPq404EukNFxcm/foM+bV/bfcDsY=\ngolang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=\ngolang.org/x/text v0.3.3/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=\ngolang.org/x/text v0.3.6/go.mod h1:5Zoc/QRtKVWzQhOtBMvqHzDpF6irO9z98xDceosuGiQ=\ngolang.org/x/text v0.3.7/go.mod h1:u+2+/6zg+i71rQMx5EYifcz6MCKuco9NR6JIITiCfzQ=\ngolang.org/x/text v0.7.0/go.mod h1:mrYo+phRRbMaCq/xk9113O4dZlRixOauAjOtrjsXDZ8=\ngolang.org/x/text v0.9.0/go.mod h1:e1OnstbJyHTd6l/uOt8jFFHp6TRDWZR/bV3emEE/zU8=\ngolang.org/x/text v0.14.0/go.mod h1:18ZOQIKpY8NJVqYksKHtTdi31H5itFRjB5/qKTNYzSU=\ngolang.org/x/time v0.5.0 h1:o7cqy6amK/52YcAKIPlM3a+Fpj35zvRj2TP+e1xFSfk=\ngolang.org/x/time v0.5.0/go.mod h1:3BpzKBy/shNhVucY/MWOyx10tF3SFh9QdLuxbVysPQM=\ngolang.org/x/tools v0.0.0-20180917221912-90fa682c2a6e/go.mod h1:n7NCudcB/nEzxVGmLbDWY5pfWTLqBcC2KZ6jyYvM4mQ=\ngolang.org/x/tools v0.0.0-20190328211700-ab21143f2384/go.mod h1:LCzVGOaR6xXOjkQ3onu1FJEFr0SW1gC7cKk1uF8kGRs=\ngolang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=\ngolang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc=\ngolang.org/x/tools v0.6.0/go.mod h1:Xwgl3UAJ/d3gWutnCtw505GrjyAbvKui8lOU390QaIU=\ngolang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=\ngopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=\ngopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15 h1:YR8cESwS4TdDjEe65xsg0ogRM/Nc3DYOhEAlW+xobZo=\ngopkg.in/check.v1 v1.0.0-20190902080502-41f04d3bba15/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=\ngopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=\ngopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=\ngopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY=\ngopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ=\ngopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=\ngopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=\n"
  },
  {
    "path": "lib/extractor.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"encoding/json\"\n\t\"errors\"\n\t\"fmt\"\n\t\"net/url\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"sort\"\n\t\"strings\"\n\t\"sync\"\n\t\"time\"\n\n\tmd \"github.com/JohannesKaufmann/html-to-markdown\"\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"github.com/k3a/html2text\"\n)\n\n// RawPost represents a raw Substack post in string format.\ntype RawPost struct {\n\tstr string\n}\n\n// ToPost converts the RawPost to a structured Post object.\nfunc (r *RawPost) ToPost() (Post, error) {\n\tvar wrapper PostWrapper\n\terr := json.Unmarshal([]byte(r.str), &wrapper)\n\tif err != nil {\n\t\treturn Post{}, err\n\t}\n\treturn wrapper.Post, nil\n}\n\n// Post represents a structured Substack post with various fields.\ntype Post struct {\n\tId               int    `json:\"id\"`\n\tPublicationId    int    `json:\"publication_id\"`\n\tType             string `json:\"type\"`\n\tSlug             string `json:\"slug\"`\n\tPostDate         string `json:\"post_date\"`\n\tCanonicalUrl     string `json:\"canonical_url\"`\n\tPreviousPostSlug string `json:\"previous_post_slug\"`\n\tNextPostSlug     string `json:\"next_post_slug\"`\n\tCoverImage       string `json:\"cover_image\"`\n\tDescription      string `json:\"description\"`\n\tSubtitle         string `json:\"subtitle,omitempty\"`\n\tWordCount        int    `json:\"wordcount\"`\n\tTitle            string `json:\"title\"`\n\tBodyHTML         string `json:\"body_html\"`\n}\n\n// Static converter instance to avoid recreating it for each conversion\nvar mdConverter = md.NewConverter(\"\", true, nil)\n\n// ToMD converts the Post's HTML body to Markdown format.\nfunc (p *Post) ToMD(withTitle bool) (string, error) {\n\tif withTitle {\n\t\tbody, err := mdConverter.ConvertString(p.BodyHTML)\n\t\tif err != nil {\n\t\t\treturn \"\", err\n\t\t}\n\t\treturn fmt.Sprintf(\"# %s\\n\\n%s\", p.Title, body), nil\n\t}\n\n\treturn mdConverter.ConvertString(p.BodyHTML)\n}\n\n// ToText converts the Post's HTML body to plain text format.\nfunc (p *Post) ToText(withTitle bool) string {\n\tif withTitle {\n\t\treturn p.Title + \"\\n\\n\" + html2text.HTML2Text(p.BodyHTML)\n\t}\n\treturn html2text.HTML2Text(p.BodyHTML)\n}\n\n// ToHTML returns the Post's HTML body as-is or with an optional title header.\nfunc (p *Post) ToHTML(withTitle bool) string {\n\tif withTitle {\n\t\treturn fmt.Sprintf(\"<h1>%s</h1>\\n\\n%s\", p.Title, p.BodyHTML)\n\t}\n\treturn p.BodyHTML\n}\n\n// ToJSON converts the Post to a JSON string.\nfunc (p *Post) ToJSON() (string, error) {\n\tb, err := json.Marshal(p)\n\tif err != nil {\n\t\treturn \"\", err\n\t}\n\treturn string(b), nil\n}\n\n// contentForFormat returns the content of a post in the specified format.\nfunc (p *Post) contentForFormat(format string, withTitle bool) (string, error) {\n\tswitch format {\n\tcase \"html\":\n\t\treturn p.ToHTML(withTitle), nil\n\tcase \"md\":\n\t\treturn p.ToMD(withTitle)\n\tcase \"txt\":\n\t\treturn p.ToText(withTitle), nil\n\tdefault:\n\t\treturn \"\", fmt.Errorf(\"unknown format: %s\", format)\n\t}\n}\n\n// WriteToFile writes the Post's content to a file in the specified format (html, md, or txt).\nfunc (p *Post) WriteToFile(path string, format string, addSourceURL bool) error {\n\tif err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {\n\t\treturn err\n\t}\n\n\tcontent, err := p.contentForFormat(format, true)\n\tif err != nil {\n\t\treturn err\n\t}\n\n\tif addSourceURL && p.CanonicalUrl != \"\" {\n\t\tsourceLine := fmt.Sprintf(\"\\n\\noriginal content: %s\", p.CanonicalUrl) // Add separation\n\n\t\t// Adjust formatting slightly for HTML\n\t\tif format == \"html\" {\n\t\t\tsourceLine = fmt.Sprintf(\"<p style=\\\"margin-top: 2em; font-size: small; color: grey;\\\">original content: <a href=\\\"%s\\\">%s</a></p>\", p.CanonicalUrl, p.CanonicalUrl)\n\t\t}\n\t\tcontent += sourceLine\n\t}\n\n\treturn os.WriteFile(path, []byte(content), 0644)\n}\n\n// WriteToFileWithImages writes the Post's content to a file with optional image downloading\nfunc (p *Post) WriteToFileWithImages(ctx context.Context, path string, format string, addSourceURL bool, \n\tdownloadImages bool, imageQuality ImageQuality, imagesDir string, \n\tdownloadFiles bool, fileExtensions []string, filesDir string, fetcher *Fetcher) (*ImageDownloadResult, error) {\n\t\n\tif err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {\n\t\treturn nil, err\n\t}\n\n\tcontent, err := p.contentForFormat(format, true)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tvar imageResult *ImageDownloadResult\n\n\t// Download images if requested and format supports it\n\tif downloadImages && (format == \"html\" || format == \"md\") {\n\t\toutputDir := filepath.Dir(path)\n\t\timageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality)\n\t\t\n\t\t// Only process HTML content for image downloading\n\t\thtmlContent := content\n\t\tif format == \"md\" {\n\t\t\t// For markdown, we need to work with the original HTML\n\t\t\thtmlContent = p.BodyHTML\n\t\t}\n\t\t\n\t\timageResult, err = imageDownloader.DownloadImages(ctx, htmlContent, p.Slug)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"failed to download images: %w\", err)\n\t\t}\n\n\t\t// Update content based on format\n\t\tif format == \"html\" {\n\t\t\tcontent = imageResult.UpdatedHTML\n\t\t\t// Re-add title if needed\n\t\t\tif strings.HasPrefix(content, \"<h1>\") {\n\t\t\t\t// Title already included\n\t\t\t} else {\n\t\t\t\tcontent = fmt.Sprintf(\"<h1>%s</h1>\\n\\n%s\", p.Title, imageResult.UpdatedHTML)\n\t\t\t}\n\t\t} else if format == \"md\" {\n\t\t\t// Convert updated HTML to markdown\n\t\t\tupdatedContent, err := mdConverter.ConvertString(imageResult.UpdatedHTML)\n\t\t\tif err != nil {\n\t\t\t\treturn nil, fmt.Errorf(\"failed to convert updated HTML to markdown: %w\", err)\n\t\t\t}\n\t\t\tcontent = fmt.Sprintf(\"# %s\\n\\n%s\", p.Title, updatedContent)\n\t\t}\n\t} else if downloadImages && format == \"txt\" {\n\t\t// For text format, we can't embed images, but we can still download them\n\t\toutputDir := filepath.Dir(path)\n\t\timageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality)\n\t\t\n\t\timageResult, err = imageDownloader.DownloadImages(ctx, p.BodyHTML, p.Slug)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"failed to download images: %w\", err)\n\t\t}\n\t\t// Keep original text content since we can't embed images in text format\n\t}\n\n\t// Download files if requested and format supports it\n\tif downloadFiles && (format == \"html\" || format == \"md\") {\n\t\toutputDir := filepath.Dir(path)\n\t\tfileDownloader := NewFileDownloader(fetcher, outputDir, filesDir, fileExtensions)\n\t\t\n\t\t// Process HTML content for file downloading - use the updated HTML from images if available\n\t\thtmlContent := content\n\t\tif imageResult != nil && imageResult.UpdatedHTML != \"\" {\n\t\t\thtmlContent = imageResult.UpdatedHTML\n\t\t} else if format == \"md\" {\n\t\t\t// For markdown, we need to work with the original HTML\n\t\t\thtmlContent = p.BodyHTML\n\t\t}\n\t\t\n\t\tfileResult, err := fileDownloader.DownloadFiles(ctx, htmlContent, p.Slug)\n\t\tif err != nil {\n\t\t\treturn nil, fmt.Errorf(\"failed to download files: %w\", err)\n\t\t}\n\n\t\t// Update content based on format if files were processed\n\t\tif fileResult.Success > 0 || fileResult.Failed > 0 {\n\t\t\tif format == \"html\" {\n\t\t\t\tcontent = fileResult.UpdatedHTML\n\t\t\t\t// Re-add title if needed\n\t\t\t\tif !strings.HasPrefix(content, \"<h1>\") {\n\t\t\t\t\tcontent = fmt.Sprintf(\"<h1>%s</h1>\\n\\n%s\", p.Title, fileResult.UpdatedHTML)\n\t\t\t\t}\n\t\t\t} else if format == \"md\" {\n\t\t\t\t// Convert updated HTML to markdown\n\t\t\t\tupdatedContent, err := mdConverter.ConvertString(fileResult.UpdatedHTML)\n\t\t\t\tif err != nil {\n\t\t\t\t\treturn nil, fmt.Errorf(\"failed to convert updated HTML to markdown: %w\", err)\n\t\t\t\t}\n\t\t\t\tcontent = fmt.Sprintf(\"# %s\\n\\n%s\", p.Title, updatedContent)\n\t\t\t}\n\t\t}\n\t}\n\n\t// Add source URL if requested\n\tif addSourceURL && p.CanonicalUrl != \"\" {\n\t\tsourceLine := fmt.Sprintf(\"\\n\\noriginal content: %s\", p.CanonicalUrl)\n\n\t\t// Adjust formatting slightly for HTML\n\t\tif format == \"html\" {\n\t\t\tsourceLine = fmt.Sprintf(\"<p style=\\\"margin-top: 2em; font-size: small; color: grey;\\\">original content: <a href=\\\"%s\\\">%s</a></p>\", p.CanonicalUrl, p.CanonicalUrl)\n\t\t}\n\t\tcontent += sourceLine\n\t}\n\n\t// Write the file\n\tif err := os.WriteFile(path, []byte(content), 0644); err != nil {\n\t\treturn imageResult, err\n\t}\n\n\t// Return empty result if no image downloading was performed\n\tif imageResult == nil {\n\t\timageResult = &ImageDownloadResult{\n\t\t\tImages:      []ImageInfo{},\n\t\t\tUpdatedHTML: content,\n\t\t\tSuccess:     0,\n\t\t\tFailed:      0,\n\t\t}\n\t}\n\n\treturn imageResult, nil\n}\n\n// PostWrapper wraps a Post object for JSON unmarshaling.\ntype PostWrapper struct {\n\tPost Post `json:\"post\"`\n}\n\n// Extractor is a utility for extracting Substack posts from URLs.\ntype Extractor struct {\n\tfetcher *Fetcher\n}\n\n// ArchiveEntry represents a single entry in the archive page\ntype ArchiveEntry struct {\n\tPost         Post\n\tFilePath     string\n\tDownloadTime time.Time\n}\n\n// Archive represents a collection of posts for the archive page\ntype Archive struct {\n\tEntries []ArchiveEntry\n}\n\n// NewExtractor creates a new Extractor with the provided Fetcher.\n// If the Fetcher is nil, a default Fetcher will be used.\nfunc NewExtractor(f *Fetcher) *Extractor {\n\tif f == nil {\n\t\tf = NewFetcher()\n\t}\n\treturn &Extractor{fetcher: f}\n}\n\n// extractJSONString finds and extracts the JSON data from script content.\n// This optimized version reduces string operations.\nfunc extractJSONString(doc *goquery.Document) (string, error) {\n\tvar jsonString string\n\tvar found bool\n\n\tdoc.Find(\"script\").EachWithBreak(func(i int, s *goquery.Selection) bool {\n\t\tcontent := s.Text()\n\t\tif strings.Contains(content, \"window._preloads\") && strings.Contains(content, \"JSON.parse(\") {\n\t\t\tstart := strings.Index(content, \"JSON.parse(\\\"\")\n\t\t\tif start == -1 {\n\t\t\t\treturn true\n\t\t\t}\n\t\t\tstart += len(\"JSON.parse(\\\"\")\n\n\t\t\tend := strings.LastIndex(content, \"\\\")\")\n\t\t\tif end == -1 || start >= end {\n\t\t\t\treturn true\n\t\t\t}\n\n\t\t\tjsonString = content[start:end]\n\t\t\tfound = true\n\t\t\treturn false\n\t\t}\n\t\treturn true\n\t})\n\n\tif !found {\n\t\treturn \"\", errors.New(\"failed to extract JSON string\")\n\t}\n\n\treturn jsonString, nil\n}\n\nfunc (e *Extractor) ExtractPost(ctx context.Context, pageUrl string) (Post, error) {\n\t// fetch page HTML content\n\tbody, err := e.fetcher.FetchURL(ctx, pageUrl)\n\tif err != nil {\n\t\treturn Post{}, fmt.Errorf(\"failed to fetch page: %w\", err)\n\t}\n\tdefer body.Close()\n\n\tdoc, err := goquery.NewDocumentFromReader(body)\n\tif err != nil {\n\t\treturn Post{}, fmt.Errorf(\"failed to parse HTML: %w\", err)\n\t}\n\n\tjsonString, err := extractJSONString(doc)\n\tif err != nil {\n\t\treturn Post{}, fmt.Errorf(\"failed to extract post data: %w\", err)\n\t}\n\n\t// Unescape the JSON string directly\n\tvar rawJSON RawPost\n\terr = json.Unmarshal([]byte(\"\\\"\"+jsonString+\"\\\"\"), &rawJSON.str)\n\tif err != nil {\n\t\treturn Post{}, fmt.Errorf(\"failed to unescape JSON: %w\", err)\n\t}\n\n\t// Convert to a Go object\n\tp, err := rawJSON.ToPost()\n\tif err != nil {\n\t\treturn Post{}, fmt.Errorf(\"failed to parse post data: %w\", err)\n\t}\n\n\t// Extract additional metadata from HTML\n\t// Extract subtitle from .subtitle element\n\tif subtitle := doc.Find(\".subtitle\").First().Text(); subtitle != \"\" {\n\t\tp.Subtitle = strings.TrimSpace(subtitle)\n\t}\n\n\t// Extract cover image from og:image meta tag if not already set\n\tif p.CoverImage == \"\" {\n\t\tif ogImage, exists := doc.Find(\"meta[property='og:image']\").Attr(\"content\"); exists && ogImage != \"\" {\n\t\t\tp.CoverImage = ogImage\n\t\t}\n\t}\n\n\treturn p, nil\n}\n\ntype DateFilterFunc func(string) bool\n\nfunc (e *Extractor) GetAllPostsURLs(ctx context.Context, pubUrl string, f DateFilterFunc) ([]string, error) {\n\tu, err := url.Parse(pubUrl)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\tu.Path, err = url.JoinPath(u.Path, \"sitemap.xml\")\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// fetch the sitemap of the publication\n\tbody, err := e.fetcher.FetchURL(ctx, u.String())\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\tdefer body.Close()\n\n\t// Parse the XML\n\tdoc, err := goquery.NewDocumentFromReader(body)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// Pre-allocate a reasonable size for URLs\n\t// This avoids multiple slice reallocations as we append\n\turls := make([]string, 0, 100)\n\n\tdoc.Find(\"url\").EachWithBreak(func(i int, s *goquery.Selection) bool {\n\t\t// Check if the context has been cancelled\n\t\tselect {\n\t\tcase <-ctx.Done():\n\t\t\treturn false\n\t\tdefault:\n\t\t}\n\n\t\turlSel := s.Find(\"loc\")\n\t\turl := urlSel.Text()\n\t\tif !strings.Contains(url, \"/p/\") {\n\t\t\treturn true\n\t\t}\n\n\t\t// Only find lastmod if we have a filter\n\t\tif f != nil {\n\t\t\tlastmod := s.Find(\"lastmod\").Text()\n\t\t\tif !f(lastmod) {\n\t\t\t\treturn true\n\t\t\t}\n\t\t}\n\n\t\turls = append(urls, url)\n\t\treturn true\n\t})\n\n\treturn urls, nil\n}\n\ntype ExtractResult struct {\n\tPost Post\n\tErr  error\n}\n\n// ExtractAllPosts extracts all posts from the given URLs using a worker pool pattern\n// to limit concurrency and avoid overwhelming system resources.\nfunc (e *Extractor) ExtractAllPosts(ctx context.Context, urls []string) <-chan ExtractResult {\n\tresultCh := make(chan ExtractResult, len(urls))\n\n\tgo func() {\n\t\tdefer close(resultCh)\n\n\t\t// Create a channel for the URLs\n\t\turlCh := make(chan string, len(urls))\n\n\t\t// Fill the URL channel\n\t\tfor _, u := range urls {\n\t\t\turlCh <- u\n\t\t}\n\t\tclose(urlCh)\n\n\t\t// Limit concurrency - the number of workers is capped at 10 or the number of URLs, whichever is smaller\n\t\tworkerCount := 10\n\t\tif len(urls) < workerCount {\n\t\t\tworkerCount = len(urls)\n\t\t}\n\n\t\t// Create a WaitGroup to wait for all workers to finish\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(workerCount)\n\n\t\t// Start the workers\n\t\tfor i := 0; i < workerCount; i++ {\n\t\t\tgo func() {\n\t\t\t\tdefer wg.Done()\n\n\t\t\t\tfor url := range urlCh {\n\t\t\t\t\tselect {\n\t\t\t\t\tcase <-ctx.Done():\n\t\t\t\t\t\t// Context cancelled, stop processing\n\t\t\t\t\t\treturn\n\t\t\t\t\tdefault:\n\t\t\t\t\t\tpost, err := e.ExtractPost(ctx, url)\n\t\t\t\t\t\tresultCh <- ExtractResult{Post: post, Err: err}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}()\n\t\t}\n\n\t\t// Wait for all workers to finish\n\t\twg.Wait()\n\t}()\n\n\treturn resultCh\n}\n\n// NewArchive creates a new Archive instance\nfunc NewArchive() *Archive {\n\treturn &Archive{\n\t\tEntries: make([]ArchiveEntry, 0),\n\t}\n}\n\n// AddEntry adds a new entry to the archive, sorted by publication date (newest first)\nfunc (a *Archive) AddEntry(post Post, filePath string, downloadTime time.Time) {\n\tentry := ArchiveEntry{\n\t\tPost:         post,\n\t\tFilePath:     filePath,\n\t\tDownloadTime: downloadTime,\n\t}\n\t\n\ta.Entries = append(a.Entries, entry)\n\ta.sortEntries()\n}\n\n// sortEntries sorts archive entries by publication date (newest first)\nfunc (a *Archive) sortEntries() {\n\tsort.Slice(a.Entries, func(i, j int) bool {\n\t\t// Parse post dates and compare (newest first)\n\t\tdateI, errI := time.Parse(time.RFC3339, a.Entries[i].Post.PostDate)\n\t\tdateJ, errJ := time.Parse(time.RFC3339, a.Entries[j].Post.PostDate)\n\t\t\n\t\tif errI != nil || errJ != nil {\n\t\t\t// If parsing fails, sort by title\n\t\t\treturn a.Entries[i].Post.Title < a.Entries[j].Post.Title\n\t\t}\n\t\t\n\t\treturn dateI.After(dateJ) // newest first\n\t})\n}\n\n// GenerateHTML creates an HTML archive page\nfunc (a *Archive) GenerateHTML(outputDir string) error {\n\tarchivePath := filepath.Join(outputDir, \"index.html\")\n\t\n\thtml := `<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n\t<meta charset=\"UTF-8\">\n\t<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n\t<title>Substack Archive</title>\n\t<style>\n\t\tbody { font-family: Arial, sans-serif; max-width: 800px; margin: 0 auto; padding: 20px; }\n\t\th1 { color: #333; }\n\t\t.post { margin-bottom: 30px; padding: 20px; border: 1px solid #eee; border-radius: 8px; }\n\t\t.post h2 { margin-top: 0; }\n\t\t.post h2 a { text-decoration: none; color: #ff6719; }\n\t\t.post h2 a:hover { text-decoration: underline; }\n\t\t.meta { color: #666; font-size: 14px; margin-bottom: 10px; }\n\t\t.subtitle { color: #777; font-style: italic; margin-bottom: 10px; }\n\t\t.cover-image { max-width: 200px; float: right; margin-left: 15px; }\n\t</style>\n</head>\n<body>\n\t<h1>Substack Archive</h1>\n`\n\n\tfor _, entry := range a.Entries {\n\t\t// Make file path relative from archive directory\n\t\trelPath, _ := filepath.Rel(outputDir, entry.FilePath)\n\t\t\n\t\t// Format publication date\n\t\tpubDate := entry.Post.PostDate\n\t\tif parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil {\n\t\t\tpubDate = parsedDate.Format(\"January 2, 2006\")\n\t\t}\n\t\t\n\t\t// Format download date\n\t\tdownloadDate := entry.DownloadTime.Format(\"January 2, 2006 15:04\")\n\t\t\n\t\thtml += `\t<div class=\"post\">\n`\n\t\t\n\t\t// Add cover image if available\n\t\tif entry.Post.CoverImage != \"\" {\n\t\t\thtml += fmt.Sprintf(`\t\t<img src=\"%s\" alt=\"Cover\" class=\"cover-image\">\n`, entry.Post.CoverImage)\n\t\t}\n\t\t\n\t\thtml += fmt.Sprintf(`\t\t<h2><a href=\"%s\">%s</a></h2>\n\t\t<div class=\"meta\">Published: %s | Downloaded: %s</div>\n`, relPath, entry.Post.Title, pubDate, downloadDate)\n\t\t\n\t\t// Add subtitle/description\n\t\tdescription := entry.Post.Subtitle\n\t\tif description == \"\" {\n\t\t\tdescription = entry.Post.Description\n\t\t}\n\t\tif description != \"\" {\n\t\t\thtml += fmt.Sprintf(`\t\t<div class=\"subtitle\">%s</div>\n`, description)\n\t\t}\n\t\t\n\t\thtml += `\t</div>\n`\n\t}\n\t\n\thtml += `</body>\n</html>`\n\t\n\treturn os.WriteFile(archivePath, []byte(html), 0644)\n}\n\n// GenerateMarkdown creates a Markdown archive page\nfunc (a *Archive) GenerateMarkdown(outputDir string) error {\n\tarchivePath := filepath.Join(outputDir, \"index.md\")\n\t\n\tcontent := \"# Substack Archive\\n\\n\"\n\t\n\tfor _, entry := range a.Entries {\n\t\t// Make file path relative from archive directory\n\t\trelPath, _ := filepath.Rel(outputDir, entry.FilePath)\n\t\t\n\t\t// Format publication date\n\t\tpubDate := entry.Post.PostDate\n\t\tif parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil {\n\t\t\tpubDate = parsedDate.Format(\"January 2, 2006\")\n\t\t}\n\t\t\n\t\t// Format download date\n\t\tdownloadDate := entry.DownloadTime.Format(\"January 2, 2006 15:04\")\n\t\t\n\t\tcontent += fmt.Sprintf(\"## [%s](%s)\\n\\n\", entry.Post.Title, relPath)\n\t\tcontent += fmt.Sprintf(\"**Published:** %s | **Downloaded:** %s\\n\\n\", pubDate, downloadDate)\n\t\t\n\t\t// Add cover image if available\n\t\tif entry.Post.CoverImage != \"\" {\n\t\t\tcontent += fmt.Sprintf(\"![Cover Image](%s)\\n\\n\", entry.Post.CoverImage)\n\t\t}\n\t\t\n\t\t// Add subtitle/description\n\t\tdescription := entry.Post.Subtitle\n\t\tif description == \"\" {\n\t\t\tdescription = entry.Post.Description\n\t\t}\n\t\tif description != \"\" {\n\t\t\tcontent += fmt.Sprintf(\"*%s*\\n\\n\", description)\n\t\t}\n\t\t\n\t\tcontent += \"---\\n\\n\"\n\t}\n\t\n\treturn os.WriteFile(archivePath, []byte(content), 0644)\n}\n\n// GenerateText creates a plain text archive page\nfunc (a *Archive) GenerateText(outputDir string) error {\n\tarchivePath := filepath.Join(outputDir, \"index.txt\")\n\t\n\tcontent := \"SUBSTACK ARCHIVE\\n================\\n\\n\"\n\t\n\tfor _, entry := range a.Entries {\n\t\t// Make file path relative from archive directory\n\t\trelPath, _ := filepath.Rel(outputDir, entry.FilePath)\n\t\t\n\t\t// Format publication date\n\t\tpubDate := entry.Post.PostDate\n\t\tif parsedDate, err := time.Parse(time.RFC3339, entry.Post.PostDate); err == nil {\n\t\t\tpubDate = parsedDate.Format(\"January 2, 2006\")\n\t\t}\n\t\t\n\t\t// Format download date\n\t\tdownloadDate := entry.DownloadTime.Format(\"January 2, 2006 15:04\")\n\t\t\n\t\tcontent += fmt.Sprintf(\"Title: %s\\n\", entry.Post.Title)\n\t\tcontent += fmt.Sprintf(\"File: %s\\n\", relPath)\n\t\tcontent += fmt.Sprintf(\"Published: %s\\n\", pubDate)\n\t\tcontent += fmt.Sprintf(\"Downloaded: %s\\n\", downloadDate)\n\t\t\n\t\t// Add subtitle/description\n\t\tdescription := entry.Post.Subtitle\n\t\tif description == \"\" {\n\t\t\tdescription = entry.Post.Description\n\t\t}\n\t\tif description != \"\" {\n\t\t\tcontent += fmt.Sprintf(\"Description: %s\\n\", description)\n\t\t}\n\t\t\n\t\tcontent += \"\\n\" + strings.Repeat(\"-\", 50) + \"\\n\\n\"\n\t}\n\t\n\treturn os.WriteFile(archivePath, []byte(content), 0644)\n}\n"
  },
  {
    "path": "lib/extractor_test.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"github.com/cenkalti/backoff/v4\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Helper function to create a sample Post for testing\nfunc createSamplePost() Post {\n\treturn Post{\n\t\tId:               123,\n\t\tPublicationId:    456,\n\t\tType:             \"post\",\n\t\tSlug:             \"test-post\",\n\t\tPostDate:         \"2023-01-01\",\n\t\tCanonicalUrl:     \"https://example.substack.com/p/test-post\",\n\t\tPreviousPostSlug: \"previous-post\",\n\t\tNextPostSlug:     \"next-post\",\n\t\tCoverImage:       \"https://example.com/image.jpg\",\n\t\tDescription:      \"Test description\",\n\t\tSubtitle:         \"Test subtitle\",\n\t\tWordCount:        100,\n\t\tTitle:            \"Test Post\",\n\t\tBodyHTML:         \"<p>This is a <strong>test</strong> post.</p>\",\n\t}\n}\n\n// Helper function to create a mock HTML page with embedded JSON\nfunc createMockSubstackHTML(post Post) string {\n\t// Create a wrapper and marshal it to JSON\n\twrapper := PostWrapper{Post: post}\n\tjsonBytes, _ := json.Marshal(wrapper)\n\n\t// Escape quotes for embedding in JavaScript\n\tescapedJSON := strings.ReplaceAll(string(jsonBytes), `\"`, `\\\"`)\n\n\treturn fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head>\n  <title>%s</title>\n</head>\n<body>\n  <div class=\"post\">Some content</div>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, post.Title, escapedJSON)\n}\n\n// Test RawPost.ToPost\nfunc TestRawPostToPost(t *testing.T) {\n\t// Create a sample post\n\texpectedPost := createSamplePost()\n\n\t// Create a wrapper and marshal it to JSON\n\twrapper := PostWrapper{Post: expectedPost}\n\tjsonBytes, err := json.Marshal(wrapper)\n\trequire.NoError(t, err)\n\n\t// Create a RawPost with the JSON string\n\trawPost := RawPost{str: string(jsonBytes)}\n\n\t// Test conversion\n\tactualPost, err := rawPost.ToPost()\n\trequire.NoError(t, err)\n\n\t// Verify the result\n\tassert.Equal(t, expectedPost, actualPost)\n\n\t// Test with invalid JSON\n\tinvalidRawPost := RawPost{str: \"invalid json\"}\n\t_, err = invalidRawPost.ToPost()\n\tassert.Error(t, err)\n}\n\n// Test Post format conversions\nfunc TestPostFormatConversions(t *testing.T) {\n\tpost := createSamplePost()\n\n\tt.Run(\"ToHTML\", func(t *testing.T) {\n\t\thtml := post.ToHTML(true)\n\t\tassert.Contains(t, html, \"<h1>Test Post</h1>\")\n\t\tassert.Contains(t, html, \"<p>This is a <strong>test</strong> post.</p>\")\n\n\t\thtmlNoTitle := post.ToHTML(false)\n\t\tassert.NotContains(t, htmlNoTitle, \"<h1>Test Post</h1>\")\n\t\tassert.Contains(t, htmlNoTitle, \"<p>This is a <strong>test</strong> post.</p>\")\n\t})\n\n\tt.Run(\"ToMD\", func(t *testing.T) {\n\t\tmd, err := post.ToMD(true)\n\t\trequire.NoError(t, err)\n\t\tassert.Contains(t, md, \"# Test Post\")\n\t\tassert.Contains(t, md, \"This is a **test** post.\")\n\n\t\tmdNoTitle, err := post.ToMD(false)\n\t\trequire.NoError(t, err)\n\t\tassert.NotContains(t, mdNoTitle, \"# Test Post\")\n\t\tassert.Contains(t, mdNoTitle, \"This is a **test** post.\")\n\t})\n\n\tt.Run(\"ToText\", func(t *testing.T) {\n\t\ttext := post.ToText(true)\n\t\tassert.Contains(t, text, \"Test Post\")\n\t\tassert.Contains(t, text, \"This is a test post.\")\n\n\t\ttextNoTitle := post.ToText(false)\n\t\tassert.NotContains(t, textNoTitle, \"Test Post\\n\\n\")\n\t\tassert.Contains(t, textNoTitle, \"This is a test post.\")\n\t})\n\n\tt.Run(\"ToJSON\", func(t *testing.T) {\n\t\tjsonStr, err := post.ToJSON()\n\t\trequire.NoError(t, err)\n\t\tassert.Contains(t, jsonStr, `\"id\":123`)\n\t\tassert.Contains(t, jsonStr, `\"title\":\"Test Post\"`)\n\t})\n\n\tt.Run(\"contentForFormat\", func(t *testing.T) {\n\t\t// Test valid formats\n\t\tfor _, format := range []string{\"html\", \"md\", \"txt\"} {\n\t\t\tcontent, err := post.contentForFormat(format, true)\n\t\t\tassert.NoError(t, err)\n\t\t\tassert.NotEmpty(t, content)\n\t\t}\n\n\t\t// Test invalid format\n\t\t_, err := post.contentForFormat(\"invalid\", true)\n\t\tassert.Error(t, err)\n\t\tassert.Contains(t, err.Error(), \"unknown format\")\n\t})\n\n\t// Test error handling for format conversions\n\tt.Run(\"ToMD error handling\", func(t *testing.T) {\n\t\t// Create a post with problematic HTML for markdown conversion\n\t\t// Note: html-to-markdown library is quite robust, so we test with extremely malformed HTML\n\t\tproblemPost := createSamplePost()\n\t\tproblemPost.BodyHTML = \"<div><p>Nested without closing</div>\"\n\t\t\n\t\t// This should still work as the library handles most malformed HTML\n\t\t_, err := problemPost.ToMD(true)\n\t\tassert.NoError(t, err) // The library is quite tolerant\n\t})\n\n\tt.Run(\"ToJSON error handling\", func(t *testing.T) {\n\t\t// Create a post that would have issues during JSON marshaling\n\t\t// This is hard to trigger with normal Post struct, but we can test the error path\n\t\tproblemPost := createSamplePost()\n\t\t\n\t\t// Test with valid data (JSON marshaling rarely fails with valid structs)\n\t\tjsonStr, err := problemPost.ToJSON()\n\t\tassert.NoError(t, err)\n\t\tassert.NotEmpty(t, jsonStr)\n\t\t\n\t\t// Verify the JSON is valid\n\t\tvar parsedPost Post\n\t\terr = json.Unmarshal([]byte(jsonStr), &parsedPost)\n\t\tassert.NoError(t, err)\n\t\tassert.Equal(t, problemPost.Id, parsedPost.Id)\n\t\tassert.Equal(t, problemPost.Title, parsedPost.Title)\n\t})\n}\n\n// Test Post.WriteToFile\nfunc TestPostWriteToFile(t *testing.T) {\n\tpost := createSamplePost()\n\ttempDir, err := os.MkdirTemp(\"\", \"post-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\n\tformats := []string{\"html\", \"md\", \"txt\"}\n\n\tfor _, format := range formats {\n\t\tt.Run(format, func(t *testing.T) {\n\t\t\tfilePath := filepath.Join(tempDir, fmt.Sprintf(\"test.%s\", format))\n\t\t\terr := post.WriteToFile(filePath, format, false)\n\t\t\trequire.NoError(t, err)\n\n\t\t\t// Verify file exists\n\t\t\tfileInfo, err := os.Stat(filePath)\n\t\t\tassert.NoError(t, err)\n\t\t\tassert.True(t, fileInfo.Size() > 0, \"File should not be empty\")\n\n\t\t\t// Read file content\n\t\t\tcontent, err := os.ReadFile(filePath)\n\t\t\trequire.NoError(t, err)\n\n\t\t\t// Check content based on format\n\t\t\tswitch format {\n\t\t\tcase \"html\":\n\t\t\t\tassert.Contains(t, string(content), \"<h1>Test Post</h1>\")\n\t\t\t\tassert.Contains(t, string(content), \"<p>This is a <strong>test</strong> post.</p>\")\n\t\t\tcase \"md\":\n\t\t\t\tassert.Contains(t, string(content), \"# Test Post\")\n\t\t\t\tassert.Contains(t, string(content), \"This is a **test** post.\")\n\t\t\tcase \"txt\":\n\t\t\t\tassert.Contains(t, string(content), \"Test Post\")\n\t\t\t\tassert.Contains(t, string(content), \"This is a test post.\")\n\t\t\t}\n\t\t})\n\t}\n\n\t// Test writing to a non-existent directory\n\tt.Run(\"creating directory\", func(t *testing.T) {\n\t\tnewDir := filepath.Join(tempDir, \"subdir\", \"nested\")\n\t\tfilePath := filepath.Join(newDir, \"test.html\")\n\t\terr := post.WriteToFile(filePath, \"html\", false)\n\t\tassert.NoError(t, err)\n\n\t\t// Verify directory was created\n\t\t_, err = os.Stat(newDir)\n\t\tassert.NoError(t, err)\n\t})\n\n\t// Test invalid format\n\tt.Run(\"invalid format\", func(t *testing.T) {\n\t\tfilePath := filepath.Join(tempDir, \"test.invalid\")\n\t\terr := post.WriteToFile(filePath, \"invalid\", false)\n\t\tassert.Error(t, err)\n\t\tassert.Contains(t, err.Error(), \"unknown format\")\n\t})\n\n\t// Test with addSourceURL enabled\n\tt.Run(\"with source URL\", func(t *testing.T) {\n\t\tformats := []string{\"html\", \"md\", \"txt\"}\n\t\t\n\t\tfor _, format := range formats {\n\t\t\tt.Run(format, func(t *testing.T) {\n\t\t\t\tfilePath := filepath.Join(tempDir, fmt.Sprintf(\"test-with-source.%s\", format))\n\t\t\t\terr := post.WriteToFile(filePath, format, true)\n\t\t\t\trequire.NoError(t, err)\n\n\t\t\t\t// Read file content\n\t\t\t\tcontent, err := os.ReadFile(filePath)\n\t\t\t\trequire.NoError(t, err)\n\t\t\t\tcontentStr := string(content)\n\n\t\t\t\t// Check that source URL is included\n\t\t\t\tassert.Contains(t, contentStr, post.CanonicalUrl)\n\t\t\t\tassert.Contains(t, contentStr, \"original content\")\n\n\t\t\t\t// Check format-specific source URL formatting\n\t\t\t\tif format == \"html\" {\n\t\t\t\t\tassert.Contains(t, contentStr, \"<a href=\")\n\t\t\t\t\tassert.Contains(t, contentStr, \"style=\\\"margin-top: 2em\")\n\t\t\t\t} else {\n\t\t\t\t\tassert.Contains(t, contentStr, fmt.Sprintf(\"original content: %s\", post.CanonicalUrl))\n\t\t\t\t}\n\t\t\t})\n\t\t}\n\t})\n\n\t// Test with addSourceURL but no canonical URL\n\tt.Run(\"with source URL but no canonical URL\", func(t *testing.T) {\n\t\tpostWithoutURL := createSamplePost()\n\t\tpostWithoutURL.CanonicalUrl = \"\"\n\t\t\n\t\tfilePath := filepath.Join(tempDir, \"test-no-url.html\")\n\t\terr := postWithoutURL.WriteToFile(filePath, \"html\", true)\n\t\trequire.NoError(t, err)\n\n\t\t// Read file content\n\t\tcontent, err := os.ReadFile(filePath)\n\t\trequire.NoError(t, err)\n\t\tcontentStr := string(content)\n\n\t\t// Should not contain source URL line\n\t\tassert.NotContains(t, contentStr, \"original content\")\n\t})\n}\n\n// Test extractJSONString function\nfunc TestExtractJSONString(t *testing.T) {\n\tt.Run(\"validHTML\", func(t *testing.T) {\n\t\tpost := createSamplePost()\n\t\thtml := createMockSubstackHTML(post)\n\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(html))\n\t\trequire.NoError(t, err)\n\n\t\tjsonString, err := extractJSONString(doc)\n\t\trequire.NoError(t, err)\n\n\t\t// Create a wrapper and marshal to get expected JSON\n\t\twrapper := PostWrapper{Post: post}\n\t\texpectedJSONBytes, _ := json.Marshal(wrapper)\n\n\t\t// The expected JSON needs to have escaped quotes to match the actual output\n\t\texpectedJSON := strings.ReplaceAll(string(expectedJSONBytes), `\"`, `\\\"`)\n\t\tassert.Equal(t, expectedJSON, jsonString)\n\t})\n\n\tt.Run(\"invalidHTML\", func(t *testing.T) {\n\t\t// Test HTML without the required script\n\t\tinvalidHTML := `<html><body><p>No script here</p></body></html>`\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(invalidHTML))\n\t\trequire.NoError(t, err)\n\n\t\t_, err = extractJSONString(doc)\n\t\tassert.Error(t, err)\n\t\tassert.Contains(t, err.Error(), \"failed to extract JSON string\")\n\t})\n\n\tt.Run(\"malformedScript\", func(t *testing.T) {\n\t\t// Test HTML with malformed script\n\t\tmalformedHTML := `\n\t\t<html><body>\n\t\t<script>\n\t\t  window._preloads = JSON.parse(\"incomplete\n\t\t</script>\n\t\t</body></html>`\n\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(malformedHTML))\n\t\trequire.NoError(t, err)\n\n\t\t_, err = extractJSONString(doc)\n\t\tassert.Error(t, err)\n\t})\n}\n\n// Create a real test server that serves mock Substack pages\nfunc createSubstackTestServer() (*httptest.Server, map[string]Post) {\n\tposts := make(map[string]Post)\n\n\t// Create several sample posts\n\tfor i := 1; i <= 5; i++ {\n\t\tpost := createSamplePost()\n\t\tpost.Id = i\n\t\tpost.Title = fmt.Sprintf(\"Test Post %d\", i)\n\t\tpost.Slug = fmt.Sprintf(\"test-post-%d\", i)\n\t\tpost.CanonicalUrl = fmt.Sprintf(\"https://example.substack.com/p/test-post-%d\", i)\n\n\t\tposts[fmt.Sprintf(\"/p/test-post-%d\", i)] = post\n\t}\n\n\t// Create sitemap XML with different dates\n\tsitemapXML := `<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\n`\n\t// Create ordered list of posts to ensure deterministic date assignment\n\tdates := []string{\"2023-01-01\", \"2023-01-02\", \"2023-01-03\", \"2023-01-04\", \"2023-01-05\"}\n\tfor i := 1; i <= 5; i++ {\n\t\tpost := posts[fmt.Sprintf(\"/p/test-post-%d\", i)]\n\t\tsitemapXML += fmt.Sprintf(`  <url>\n    <loc>https://example.substack.com/p/%s</loc>\n    <lastmod>%s</lastmod>\n  </url>\n`, post.Slug, dates[i-1])\n\t}\n\tsitemapXML += `</urlset>`\n\n\t// Create server\n\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tpath := r.URL.Path\n\n\t\t// Handle sitemap request\n\t\tif path == \"/sitemap.xml\" {\n\t\t\tw.Header().Set(\"Content-Type\", \"application/xml\")\n\t\t\tw.Write([]byte(sitemapXML))\n\t\t\treturn\n\t\t}\n\n\t\t// Handle post requests\n\t\tpost, exists := posts[path]\n\t\tif exists {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(createMockSubstackHTML(post)))\n\t\t\treturn\n\t\t}\n\n\t\t// Handle not found\n\t\tw.WriteHeader(http.StatusNotFound)\n\t}))\n\n\treturn server, posts\n}\n\n// Test Extractor.ExtractPost\nfunc TestExtractorExtractPost(t *testing.T) {\n\t// Create test server\n\tserver, posts := createSubstackTestServer()\n\tdefer server.Close()\n\n\t// Create extractor with default fetcher\n\textractor := NewExtractor(nil)\n\n\t// Test successful extraction\n\tt.Run(\"successfulExtraction\", func(t *testing.T) {\n\t\tctx := context.Background()\n\n\t\tfor path, expectedPost := range posts {\n\t\t\tpostURL := server.URL + path\n\t\t\textractedPost, err := extractor.ExtractPost(ctx, postURL)\n\n\t\t\trequire.NoError(t, err)\n\t\t\tassert.Equal(t, expectedPost.Id, extractedPost.Id)\n\t\t\tassert.Equal(t, expectedPost.Title, extractedPost.Title)\n\t\t\tassert.Equal(t, expectedPost.BodyHTML, extractedPost.BodyHTML)\n\t\t}\n\t})\n\n\t// Test invalid URL\n\tt.Run(\"invalidURL\", func(t *testing.T) {\n\t\tctx := context.Background()\n\t\t_, err := extractor.ExtractPost(ctx, \"invalid-url\")\n\t\tassert.Error(t, err)\n\t})\n\n\t// Test not found\n\tt.Run(\"notFound\", func(t *testing.T) {\n\t\tctx := context.Background()\n\t\t_, err := extractor.ExtractPost(ctx, server.URL+\"/p/non-existent\")\n\t\tassert.Error(t, err)\n\t})\n\n\t// Test context cancellation\n\tt.Run(\"contextCancellation\", func(t *testing.T) {\n\t\tctx, cancel := context.WithCancel(context.Background())\n\t\tcancel() // Cancel immediately\n\n\t\t_, err := extractor.ExtractPost(ctx, server.URL+\"/p/test-post-1\")\n\t\tassert.Error(t, err)\n\t\tassert.Contains(t, err.Error(), \"context\")\n\t})\n}\n\n// Test Extractor.GetAllPostsURLs\nfunc TestExtractorGetAllPostsURLs(t *testing.T) {\n\t// Create test server\n\tserver, posts := createSubstackTestServer()\n\tdefer server.Close()\n\n\t// Create extractor\n\textractor := NewExtractor(nil)\n\tctx := context.Background()\n\n\t// Test without filter\n\tt.Run(\"withoutFilter\", func(t *testing.T) {\n\t\turls, err := extractor.GetAllPostsURLs(ctx, server.URL, nil)\n\t\trequire.NoError(t, err)\n\n\t\t// Should find all post URLs\n\t\tassert.Equal(t, len(posts), len(urls))\n\n\t\t// Check each URL is present\n\t\tfor _, post := range posts {\n\t\t\tfound := false\n\t\t\tfor _, url := range urls {\n\t\t\t\tif strings.Contains(url, post.Slug) {\n\t\t\t\t\tfound = true\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t\tassert.True(t, found, \"URL for post %s should be present\", post.Slug)\n\t\t}\n\t})\n\n\t// Test with date filter\n\tt.Run(\"withDateFilter\", func(t *testing.T) {\n\t\t// Filter for posts after 2023-01-02 (should get 3 posts: 2023-01-03, 2023-01-04, 2023-01-05)\n\t\tdateFilter := func(date string) bool {\n\t\t\treturn date > \"2023-01-02\"\n\t\t}\n\n\t\turls, err := extractor.GetAllPostsURLs(ctx, server.URL, dateFilter)\n\t\trequire.NoError(t, err)\n\n\t\t// Should get 3 posts (dates 2023-01-03, 2023-01-04, 2023-01-05)\n\t\tassert.Len(t, urls, 3)\n\t\t\n\t\t// Verify the filtered URLs are correct\n\t\tfor _, url := range urls {\n\t\t\t// Should contain test-post-3, test-post-4, or test-post-5\n\t\t\tassert.True(t, strings.Contains(url, \"test-post-3\") || \n\t\t\t\tstrings.Contains(url, \"test-post-4\") || \n\t\t\t\tstrings.Contains(url, \"test-post-5\"))\n\t\t}\n\t})\n\n\t// Test with context cancellation\n\tt.Run(\"contextCancellation\", func(t *testing.T) {\n\t\tctx, cancel := context.WithCancel(context.Background())\n\t\tcancel() // Cancel immediately\n\n\t\t_, err := extractor.GetAllPostsURLs(ctx, server.URL, nil)\n\t\tassert.Error(t, err)\n\t})\n\n\t// Test with invalid URL\n\tt.Run(\"invalidURL\", func(t *testing.T) {\n\t\t_, err := extractor.GetAllPostsURLs(ctx, \"invalid-url\", nil)\n\t\tassert.Error(t, err)\n\t})\n}\n\n// Test Extractor.ExtractAllPosts\nfunc TestExtractorExtractAllPosts(t *testing.T) {\n\t// Create test server\n\tserver, posts := createSubstackTestServer()\n\tdefer server.Close()\n\n\t// Create URLs list\n\turls := make([]string, 0, len(posts))\n\tfor path := range posts {\n\t\turls = append(urls, server.URL+path)\n\t}\n\n\t// Create extractor\n\textractor := NewExtractor(nil)\n\tctx := context.Background()\n\n\t// Test successful extraction of all posts\n\tt.Run(\"successfulExtraction\", func(t *testing.T) {\n\t\tresultCh := extractor.ExtractAllPosts(ctx, urls)\n\n\t\t// Collect results\n\t\tresults := make(map[int]Post)\n\t\terrorCount := 0\n\n\t\tfor result := range resultCh {\n\t\t\tif result.Err != nil {\n\t\t\t\terrorCount++\n\t\t\t} else {\n\t\t\t\tresults[result.Post.Id] = result.Post\n\t\t\t}\n\t\t}\n\n\t\t// Verify results\n\t\tassert.Equal(t, 0, errorCount, \"There should be no errors\")\n\t\tassert.Equal(t, len(posts), len(results), \"All posts should be extracted\")\n\n\t\t// Check each post\n\t\tfor _, post := range posts {\n\t\t\textractedPost, exists := results[post.Id]\n\t\t\tassert.True(t, exists, \"Post with ID %d should be extracted\", post.Id)\n\t\t\tif exists {\n\t\t\t\tassert.Equal(t, post.Title, extractedPost.Title)\n\t\t\t\tassert.Equal(t, post.BodyHTML, extractedPost.BodyHTML)\n\t\t\t}\n\t\t}\n\t})\n\n\t// Test with context cancellation\n\tt.Run(\"contextCancellation\", func(t *testing.T) {\n\t\tctx, cancel := context.WithCancel(context.Background())\n\n\t\tresultCh := extractor.ExtractAllPosts(ctx, urls)\n\n\t\t// Cancel after receiving first result\n\t\tvar count int\n\t\tvar wg sync.WaitGroup\n\t\twg.Add(1)\n\n\t\tgo func() {\n\t\t\tdefer wg.Done()\n\t\t\tfor result := range resultCh {\n\t\t\t\tif result.Err != nil {\n\t\t\t\t\tcontinue\n\t\t\t\t}\n\t\t\t\tcount++\n\t\t\t\tif count == 1 {\n\t\t\t\t\tcancel()\n\t\t\t\t\t// Add a small delay to ensure cancellation propagates\n\t\t\t\t\ttime.Sleep(100 * time.Millisecond)\n\t\t\t\t\tbreak // Exit loop early after cancelling\n\t\t\t\t}\n\t\t\t}\n\t\t}()\n\n\t\twg.Wait()\n\n\t\t// We should have received at least one result before cancellation\n\t\tassert.GreaterOrEqual(t, count, 1)\n\t\t// Don't assert that count < len(posts) since on fast machines all might complete\n\t})\n\n\t// Test with mixed responses (some successful, some errors)\n\tt.Run(\"mixedResponses\", func(t *testing.T) {\n\t\t// Add some invalid URLs to the list\n\t\tmixedUrls := append([]string{\"invalid-url\", server.URL + \"/p/non-existent\"}, urls...)\n\n\t\tresultCh := extractor.ExtractAllPosts(ctx, mixedUrls)\n\n\t\t// Collect results\n\t\tsuccessCount := 0\n\t\terrorCount := 0\n\n\t\tfor result := range resultCh {\n\t\t\tif result.Err != nil {\n\t\t\t\terrorCount++\n\t\t\t} else {\n\t\t\t\tsuccessCount++\n\t\t\t}\n\t\t}\n\n\t\t// Verify results\n\t\tassert.Equal(t, len(posts), successCount, \"All valid posts should be extracted\")\n\t\tassert.Equal(t, 2, errorCount, \"There should be errors for invalid URLs\")\n\t})\n\n\t// Test worker concurrency limiting\n\tt.Run(\"concurrencyLimit\", func(t *testing.T) {\n\t\t// Create a large number of duplicate URLs to test concurrency\n\t\tmanyUrls := make([]string, 50)\n\t\tfor i := range manyUrls {\n\t\t\tmanyUrls[i] = urls[i%len(urls)]\n\t\t}\n\n\t\t// Create a channel to track concurrent requests\n\t\ttype accessRecord struct {\n\t\t\turl       string\n\t\t\ttimestamp time.Time\n\t\t}\n\n\t\taccessCh := make(chan accessRecord, len(manyUrls))\n\n\t\t// Create a test server that records access times\n\t\tconcurrentServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\taccessCh <- accessRecord{\n\t\t\t\turl:       r.URL.Path,\n\t\t\t\ttimestamp: time.Now(),\n\t\t\t}\n\n\t\t\t// Simulate some processing time\n\t\t\ttime.Sleep(100 * time.Millisecond)\n\n\t\t\t// Serve the same content as the regular server\n\t\t\tpath := r.URL.Path\n\t\t\tpost, exists := posts[path]\n\t\t\tif exists {\n\t\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\t\tw.Write([]byte(createMockSubstackHTML(post)))\n\t\t\t\treturn\n\t\t\t}\n\n\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\t}))\n\t\tdefer concurrentServer.Close()\n\n\t\t// Replace URLs with concurrent server URLs\n\t\tconcurrentUrls := make([]string, len(manyUrls))\n\t\tfor i, u := range manyUrls {\n\t\t\tpath := strings.TrimPrefix(u, server.URL)\n\t\t\tconcurrentUrls[i] = concurrentServer.URL + path\n\t\t}\n\n\t\t// Create extractor with limited workers\n\t\tcustomFetcher := NewFetcher(WithMaxWorkers(10), WithRatePerSecond(100))\n\t\tconcurrentExtractor := NewExtractor(customFetcher)\n\n\t\t// Start extraction\n\t\tresultCh := concurrentExtractor.ExtractAllPosts(ctx, concurrentUrls)\n\n\t\t// Collect all results to make sure extraction completes\n\t\tvar results []ExtractResult\n\t\tfor result := range resultCh {\n\t\t\tresults = append(results, result)\n\t\t}\n\n\t\t// Close the access channel since we're done receiving\n\t\tclose(accessCh)\n\n\t\t// Process access records to determine concurrency\n\t\tvar accessRecords []accessRecord\n\t\tfor record := range accessCh {\n\t\t\taccessRecords = append(accessRecords, record)\n\t\t}\n\n\t\t// Sort access records by timestamp\n\t\tmaxConcurrent := 0\n\t\tactiveTimes := make([]time.Time, 0)\n\n\t\tfor _, record := range accessRecords {\n\t\t\t// Add this request's start time\n\t\t\tactiveTimes = append(activeTimes, record.timestamp)\n\n\t\t\t// Expire any requests that would have completed by now\n\t\t\tnewActiveTimes := make([]time.Time, 0)\n\t\t\tfor _, t := range activeTimes {\n\t\t\t\tif t.Add(100 * time.Millisecond).After(record.timestamp) {\n\t\t\t\t\tnewActiveTimes = append(newActiveTimes, t)\n\t\t\t\t}\n\t\t\t}\n\t\t\tactiveTimes = newActiveTimes\n\n\t\t\t// Update max concurrent\n\t\t\tif len(activeTimes) > maxConcurrent {\n\t\t\t\tmaxConcurrent = len(activeTimes)\n\t\t\t}\n\t\t}\n\n\t\t// Verify concurrency was limited appropriately\n\t\t// Note: This test is timing-dependent and may need adjustment\n\t\tassert.LessOrEqual(t, maxConcurrent, 15, \"Concurrency should be limited\")\n\n\t\t// Ensure all requests were processed\n\t\tassert.Equal(t, len(concurrentUrls), len(results))\n\t})\n}\n\n// Test error handling\n\nfunc TestExtractorErrorHandling(t *testing.T) {\n\t// Create a server that simulates various errors\n\tvar requestCount atomic.Int32\n\n\terrorServer := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t// Get request counter\n\t\trequestCount.Add(1) // Increment counter\n\t\tpath := r.URL.Path\n\n\t\t// Simulate different errors based on path - order matters here!\n\t\tswitch {\n\t\tcase path == \"/p/normal-post\":\n\t\t\t// Return a valid post\n\t\t\tpost := createSamplePost()\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(createMockSubstackHTML(post)))\n\t\t\treturn\n\n\t\tcase strings.Contains(path, \"not-found\"):\n\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\t\treturn\n\n\t\tcase strings.Contains(path, \"server-error\"):\n\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\t\treturn\n\n\t\tcase strings.Contains(path, \"rate-limit\"):\n\t\t\tw.Header().Set(\"Retry-After\", \"1\")\n\t\t\tw.WriteHeader(http.StatusTooManyRequests)\n\t\t\treturn\n\n\t\tcase strings.Contains(path, \"bad-json\"):\n\t\t\t// Return valid HTML but with malformed JSON\n\t\t\thtml := `\n\t\t\t<!DOCTYPE html>\n\t\t\t<html>\n\t\t\t<head><title>Bad JSON</title></head>\n\t\t\t<body>\n\t\t\t  <script>\n\t\t\t\twindow._preloads = JSON.parse(\"{malformed json}\")\n\t\t\t  </script>\n\t\t\t</body>\n\t\t\t</html>`\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(html))\n\t\t\treturn\n\n\t\tcase strings.Contains(path, \"timeout-post\"):\n\t\t\t// Use a long sleep to ensure timeout - longer than the client timeout\n\t\t\ttime.Sleep(2 * time.Second)\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\treturn\n\n\t\tdefault:\n\t\t\t// Return a valid post for other paths\n\t\t\tpost := createSamplePost()\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(createMockSubstackHTML(post)))\n\t\t\treturn\n\t\t}\n\t}))\n\tdefer errorServer.Close()\n\n\t// Create paths for different error scenarios\n\tpaths := []string{\n\t\t\"/p/normal-post\",\n\t\t\"/p/not-found\",\n\t\t\"/p/server-error\",\n\t\t\"/p/rate-limit\",\n\t\t\"/p/bad-json\",\n\t\t\"/p/timeout-post\",\n\t}\n\n\t// Create URLs\n\turls := make([]string, len(paths))\n\tfor i, path := range paths {\n\t\turls[i] = errorServer.URL + path\n\t}\n\n\t// Create extractor with short timeout and limited retries\n\tbackoffCfg := backoff.NewExponentialBackOff()\n\tbackoffCfg.MaxElapsedTime = 1 * time.Second // Short timeout for tests\n\tbackoffCfg.InitialInterval = 100 * time.Millisecond\n\n\tfetcher := NewFetcher(\n\t\tWithTimeout(500*time.Millisecond), // Make timeout shorter than the sleep for timeout test\n\t\tWithBackOffConfig(backoffCfg),\n\t)\n\n\textractor := NewExtractor(fetcher)\n\tctx := context.Background()\n\n\t// Test individual error cases\n\tt.Run(\"NotFound\", func(t *testing.T) {\n\t\t_, err := extractor.ExtractPost(ctx, errorServer.URL+\"/p/not-found\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"ServerError\", func(t *testing.T) {\n\t\t_, err := extractor.ExtractPost(ctx, errorServer.URL+\"/p/server-error\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"RateLimit\", func(t *testing.T) {\n\t\t_, err := extractor.ExtractPost(ctx, errorServer.URL+\"/p/rate-limit\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"BadJSON\", func(t *testing.T) {\n\t\t_, err := extractor.ExtractPost(ctx, errorServer.URL+\"/p/bad-json\")\n\t\tassert.Error(t, err)\n\t})\n\n\tt.Run(\"Timeout\", func(t *testing.T) {\n\t\t// Test with a URL that will cause a timeout\n\t\t_, err := extractor.ExtractPost(ctx, errorServer.URL+\"/p/timeout-post\")\n\t\tassert.Error(t, err)\n\t\t// The error may be a context deadline exceeded or a timeout error\n\t})\n\n\t// Test handling multiple URLs with mixed errors\n\tt.Run(\"MixedErrors\", func(t *testing.T) {\n\t\tresultCh := extractor.ExtractAllPosts(ctx, urls)\n\n\t\t// Collect results\n\t\tsuccessCount := 0\n\t\terrorCount := 0\n\n\t\tfor result := range resultCh {\n\t\t\tif result.Err != nil {\n\t\t\t\terrorCount++\n\t\t\t} else {\n\t\t\t\tsuccessCount++\n\t\t\t}\n\t\t}\n\n\t\t// We expect at least one success (the normal post) and several errors\n\t\tassert.GreaterOrEqual(t, successCount, 1)\n\t\tassert.GreaterOrEqual(t, errorCount, 1) // At least one error (likely timeout)\n\t})\n}\n\n// Test enhanced post extraction features (subtitle and cover image)\nfunc TestEnhancedPostExtraction(t *testing.T) {\n\tt.Run(\"SubtitleExtraction\", func(t *testing.T) {\n\t\tpost := createSamplePost()\n\t\tpost.Subtitle = \"\" // Clear subtitle from JSON to test HTML extraction\n\t\t\n\t\t// Create mock HTML with subtitle element\n\t\thtml := fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head>\n  <title>%s</title>\n  <meta property=\"og:image\" content=\"https://example.com/og-image.jpg\">\n</head>\n<body>\n  <div class=\"subtitle\">   This is the subtitle from HTML   </div>\n  <div class=\"post\">Some content</div>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, post.Title, escapeJSONForJS(post))\n\n\t\t// Create test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(html))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\textractor := NewExtractor(nil)\n\t\tctx := context.Background()\n\n\t\textractedPost, err := extractor.ExtractPost(ctx, server.URL)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify subtitle was extracted and trimmed\n\t\tassert.Equal(t, \"This is the subtitle from HTML\", extractedPost.Subtitle)\n\t})\n\n\tt.Run(\"CoverImageFromOGTag\", func(t *testing.T) {\n\t\tpost := createSamplePost()\n\t\tpost.CoverImage = \"\" // Clear cover image from JSON to test og:image extraction\n\t\t\n\t\t// Create mock HTML with og:image meta tag\n\t\thtml := fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head>\n  <title>%s</title>\n  <meta property=\"og:image\" content=\"https://example.com/og-cover.jpg\">\n</head>\n<body>\n  <div class=\"post\">Some content</div>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, post.Title, escapeJSONForJS(post))\n\n\t\t// Create test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(html))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\textractor := NewExtractor(nil)\n\t\tctx := context.Background()\n\n\t\textractedPost, err := extractor.ExtractPost(ctx, server.URL)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify cover image was extracted from og:image\n\t\tassert.Equal(t, \"https://example.com/og-cover.jpg\", extractedPost.CoverImage)\n\t})\n\n\tt.Run(\"ExistingCoverImagePreserved\", func(t *testing.T) {\n\t\tpost := createSamplePost()\n\t\tpost.CoverImage = \"https://existing.com/image.jpg\"\n\t\t\n\t\t// Create mock HTML with og:image meta tag (should be ignored)\n\t\thtml := fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head>\n  <title>%s</title>\n  <meta property=\"og:image\" content=\"https://example.com/og-cover.jpg\">\n</head>\n<body>\n  <div class=\"post\">Some content</div>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, post.Title, escapeJSONForJS(post))\n\n\t\t// Create test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(html))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\textractor := NewExtractor(nil)\n\t\tctx := context.Background()\n\n\t\textractedPost, err := extractor.ExtractPost(ctx, server.URL)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify existing cover image was preserved (not overwritten by og:image)\n\t\tassert.Equal(t, \"https://existing.com/image.jpg\", extractedPost.CoverImage)\n\t})\n\n\tt.Run(\"NoSubtitleOrCoverImage\", func(t *testing.T) {\n\t\tpost := createSamplePost()\n\t\tpost.Subtitle = \"\"\n\t\tpost.CoverImage = \"\"\n\t\t\n\t\t// Create mock HTML without subtitle or og:image\n\t\thtml := fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head>\n  <title>%s</title>\n</head>\n<body>\n  <div class=\"post\">Some content</div>\n  <script>\n    window._preloads = JSON.parse(\"%s\")\n  </script>\n</body>\n</html>\n`, post.Title, escapeJSONForJS(post))\n\n\t\t// Create test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.Header().Set(\"Content-Type\", \"text/html\")\n\t\t\tw.Write([]byte(html))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\textractor := NewExtractor(nil)\n\t\tctx := context.Background()\n\n\t\textractedPost, err := extractor.ExtractPost(ctx, server.URL)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify empty subtitle and cover image remain empty\n\t\tassert.Empty(t, extractedPost.Subtitle)\n\t\tassert.Empty(t, extractedPost.CoverImage)\n\t})\n}\n\n// Helper function to escape JSON for embedding in JavaScript\nfunc escapeJSONForJS(post Post) string {\n\twrapper := PostWrapper{Post: post}\n\tjsonBytes, _ := json.Marshal(wrapper)\n\treturn strings.ReplaceAll(string(jsonBytes), `\"`, `\\\"`)\n}\n\n// Test Archive functionality\nfunc TestArchive(t *testing.T) {\n\tt.Run(\"NewArchive\", func(t *testing.T) {\n\t\tarchive := NewArchive()\n\t\tassert.NotNil(t, archive)\n\t\tassert.NotNil(t, archive.Entries)\n\t\tassert.Len(t, archive.Entries, 0)\n\t})\n\n\tt.Run(\"AddEntry\", func(t *testing.T) {\n\t\tarchive := NewArchive()\n\t\tpost1 := createSamplePost()\n\t\tpost1.PostDate = \"2023-01-01T00:00:00Z\"\n\t\tpost1.Title = \"First Post\"\n\t\t\n\t\tpost2 := createSamplePost()\n\t\tpost2.PostDate = \"2023-01-02T00:00:00Z\"\n\t\tpost2.Title = \"Second Post\"\n\t\t\n\t\tpost3 := createSamplePost()\n\t\tpost3.PostDate = \"2023-01-03T00:00:00Z\"\n\t\tpost3.Title = \"Third Post\"\n\n\t\tdownloadTime := time.Now()\n\t\t\n\t\t// Add entries in random order\n\t\tarchive.AddEntry(post2, \"post2.html\", downloadTime)\n\t\tarchive.AddEntry(post1, \"post1.html\", downloadTime)\n\t\tarchive.AddEntry(post3, \"post3.html\", downloadTime)\n\n\t\t// Verify entries were added and sorted by date (newest first)\n\t\tassert.Len(t, archive.Entries, 3)\n\t\tassert.Equal(t, \"Third Post\", archive.Entries[0].Post.Title) // 2023-01-03 (newest)\n\t\tassert.Equal(t, \"Second Post\", archive.Entries[1].Post.Title) // 2023-01-02\n\t\tassert.Equal(t, \"First Post\", archive.Entries[2].Post.Title) // 2023-01-01 (oldest)\n\t})\n\n\tt.Run(\"SortingWithInvalidDates\", func(t *testing.T) {\n\t\tarchive := NewArchive()\n\t\t\n\t\tpost1 := createSamplePost()\n\t\tpost1.PostDate = \"invalid-date\"\n\t\tpost1.Title = \"A Post\"\n\t\t\n\t\tpost2 := createSamplePost()\n\t\tpost2.PostDate = \"also-invalid\"\n\t\tpost2.Title = \"B Post\"\n\t\t\n\t\tdownloadTime := time.Now()\n\t\t\n\t\tarchive.AddEntry(post2, \"post2.html\", downloadTime)\n\t\tarchive.AddEntry(post1, \"post1.html\", downloadTime)\n\n\t\t// Should sort by title when dates are invalid\n\t\tassert.Len(t, archive.Entries, 2)\n\t\tassert.Equal(t, \"A Post\", archive.Entries[0].Post.Title) // Alphabetical order\n\t\tassert.Equal(t, \"B Post\", archive.Entries[1].Post.Title)\n\t})\n\n\tt.Run(\"ArchiveEntryFields\", func(t *testing.T) {\n\t\tarchive := NewArchive()\n\t\tpost := createSamplePost()\n\t\tfilePath := \"/path/to/post.html\"\n\t\tdownloadTime := time.Now()\n\t\t\n\t\tarchive.AddEntry(post, filePath, downloadTime)\n\t\t\n\t\tentry := archive.Entries[0]\n\t\tassert.Equal(t, post, entry.Post)\n\t\tassert.Equal(t, filePath, entry.FilePath)\n\t\tassert.Equal(t, downloadTime, entry.DownloadTime)\n\t})\n}\n\n// Test Archive page generation\nfunc TestArchivePageGeneration(t *testing.T) {\n\t// Helper function to create a test archive\n\tsetupTestArchive := func() (*Archive, string) {\n\t\ttempDir, err := os.MkdirTemp(\"\", \"archive_test\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tarchive := NewArchive()\n\t\t\n\t\t// Create sample posts with different dates and metadata\n\t\tpost1 := createSamplePost()\n\t\tpost1.PostDate = \"2023-01-01T10:30:00Z\"\n\t\tpost1.Title = \"First Post\"\n\t\tpost1.Subtitle = \"A great first post\"\n\t\tpost1.CoverImage = \"https://example.com/cover1.jpg\"\n\t\t\n\t\tpost2 := createSamplePost()\n\t\tpost2.PostDate = \"2023-01-02T15:45:00Z\" \n\t\tpost2.Title = \"Second Post\"\n\t\tpost2.Subtitle = \"\" // Empty subtitle, should fall back to description\n\t\tpost2.Description = \"This is the description\"\n\t\tpost2.CoverImage = \"\"\n\t\t\n\t\tpost3 := createSamplePost()\n\t\tpost3.PostDate = \"2023-01-03T08:15:00Z\"\n\t\tpost3.Title = \"Third Post\"\n\t\tpost3.Subtitle = \"\"\n\t\tpost3.Description = \"\"\n\t\tpost3.CoverImage = \"https://example.com/cover3.jpg\"\n\t\t\n\t\tdownloadTime, _ := time.Parse(time.RFC3339, \"2023-01-10T12:00:00Z\")\n\t\t\n\t\tarchive.AddEntry(post1, filepath.Join(tempDir, \"post1.html\"), downloadTime)\n\t\tarchive.AddEntry(post2, filepath.Join(tempDir, \"post2.html\"), downloadTime.Add(time.Hour))\n\t\tarchive.AddEntry(post3, filepath.Join(tempDir, \"post3.html\"), downloadTime.Add(2*time.Hour))\n\t\t\n\t\treturn archive, tempDir\n\t}\n\n\tt.Run(\"GenerateHTML\", func(t *testing.T) {\n\t\tarchive, tempDir := setupTestArchive()\n\t\tdefer os.RemoveAll(tempDir)\n\t\t\n\t\terr := archive.GenerateHTML(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Check file was created\n\t\tindexPath := filepath.Join(tempDir, \"index.html\")\n\t\tassert.FileExists(t, indexPath)\n\t\t\n\t\t// Read and verify content\n\t\tcontent, err := os.ReadFile(indexPath)\n\t\trequire.NoError(t, err)\n\t\thtmlContent := string(content)\n\t\t\n\t\t// Verify HTML structure\n\t\tassert.Contains(t, htmlContent, \"<!DOCTYPE html>\")\n\t\tassert.Contains(t, htmlContent, \"<title>Substack Archive</title>\")\n\t\tassert.Contains(t, htmlContent, \"<h1>Substack Archive</h1>\")\n\t\t\n\t\t// Verify posts are included in correct order (newest first)\n\t\tassert.Contains(t, htmlContent, \"Third Post\") // Should appear first (newest)\n\t\tassert.Contains(t, htmlContent, \"Second Post\")\n\t\tassert.Contains(t, htmlContent, \"First Post\")\n\t\t\n\t\t// Verify relative paths\n\t\tassert.Contains(t, htmlContent, \"post1.html\")\n\t\tassert.Contains(t, htmlContent, \"post2.html\") \n\t\tassert.Contains(t, htmlContent, \"post3.html\")\n\t\t\n\t\t// Verify cover images and descriptions\n\t\tassert.Contains(t, htmlContent, \"https://example.com/cover1.jpg\")\n\t\tassert.Contains(t, htmlContent, \"https://example.com/cover3.jpg\")\n\t\tassert.Contains(t, htmlContent, \"A great first post\") // Subtitle\n\t\tassert.Contains(t, htmlContent, \"This is the description\") // Fallback description\n\t\t\n\t\t// Verify dates are formatted\n\t\tassert.Contains(t, htmlContent, \"January 1, 2023\") // Formatted publication date\n\t\tassert.Contains(t, htmlContent, \"January 10, 2023 12:00\") // Formatted download date\n\t})\n\n\tt.Run(\"GenerateMarkdown\", func(t *testing.T) {\n\t\tarchive, tempDir := setupTestArchive()\n\t\tdefer os.RemoveAll(tempDir)\n\t\t\n\t\terr := archive.GenerateMarkdown(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Check file was created\n\t\tindexPath := filepath.Join(tempDir, \"index.md\")\n\t\tassert.FileExists(t, indexPath)\n\t\t\n\t\t// Read and verify content\n\t\tcontent, err := os.ReadFile(indexPath)\n\t\trequire.NoError(t, err)\n\t\tmdContent := string(content)\n\t\t\n\t\t// Verify markdown structure\n\t\tassert.Contains(t, mdContent, \"# Substack Archive\\n\\n\")\n\t\tassert.Contains(t, mdContent, \"## [Third Post](post3.html)\") // Newest first\n\t\tassert.Contains(t, mdContent, \"## [Second Post](post2.html)\")\n\t\tassert.Contains(t, mdContent, \"## [First Post](post1.html)\")\n\t\t\n\t\t// Verify metadata format\n\t\tassert.Contains(t, mdContent, \"**Published:** January 1, 2023\")\n\t\tassert.Contains(t, mdContent, \"**Downloaded:** January 10, 2023 12:00\")\n\t\t\n\t\t// Verify cover image markdown syntax\n\t\tassert.Contains(t, mdContent, \"![Cover Image](https://example.com/cover1.jpg)\")\n\t\tassert.Contains(t, mdContent, \"![Cover Image](https://example.com/cover3.jpg)\")\n\t\t\n\t\t// Verify descriptions in italic\n\t\tassert.Contains(t, mdContent, \"*A great first post*\")\n\t\tassert.Contains(t, mdContent, \"*This is the description*\")\n\t\t\n\t\t// Verify separators\n\t\tassert.Contains(t, mdContent, \"---\")\n\t})\n\n\tt.Run(\"GenerateText\", func(t *testing.T) {\n\t\tarchive, tempDir := setupTestArchive()\n\t\tdefer os.RemoveAll(tempDir)\n\t\t\n\t\terr := archive.GenerateText(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Check file was created\n\t\tindexPath := filepath.Join(tempDir, \"index.txt\")\n\t\tassert.FileExists(t, indexPath)\n\t\t\n\t\t// Read and verify content\n\t\tcontent, err := os.ReadFile(indexPath)\n\t\trequire.NoError(t, err)\n\t\ttxtContent := string(content)\n\t\t\n\t\t// Verify text structure\n\t\tassert.Contains(t, txtContent, \"SUBSTACK ARCHIVE\\n================\")\n\t\t\n\t\t// Verify post entries (newest first)\n\t\tassert.Contains(t, txtContent, \"Title: Third Post\")\n\t\tassert.Contains(t, txtContent, \"Title: Second Post\") \n\t\tassert.Contains(t, txtContent, \"Title: First Post\")\n\t\t\n\t\t// Verify file paths\n\t\tassert.Contains(t, txtContent, \"File: post1.html\")\n\t\tassert.Contains(t, txtContent, \"File: post2.html\")\n\t\tassert.Contains(t, txtContent, \"File: post3.html\")\n\t\t\n\t\t// Verify formatted dates\n\t\tassert.Contains(t, txtContent, \"Published: January 1, 2023\")\n\t\tassert.Contains(t, txtContent, \"Downloaded: January 10, 2023 12:00\")\n\t\t\n\t\t// Verify descriptions\n\t\tassert.Contains(t, txtContent, \"Description: A great first post\")\n\t\tassert.Contains(t, txtContent, \"Description: This is the description\")\n\t\t\n\t\t// Verify separators\n\t\tassert.Contains(t, txtContent, strings.Repeat(\"-\", 50))\n\t})\n\n\tt.Run(\"EmptyArchive\", func(t *testing.T) {\n\t\ttempDir, err := os.MkdirTemp(\"\", \"empty_archive_test\")\n\t\trequire.NoError(t, err)\n\t\tdefer os.RemoveAll(tempDir)\n\t\t\n\t\tarchive := NewArchive()\n\t\t\n\t\t// Test each format with empty archive\n\t\terr = archive.GenerateHTML(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\terr = archive.GenerateMarkdown(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\terr = archive.GenerateText(tempDir)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Verify files exist and contain basic headers\n\t\thtmlContent, _ := os.ReadFile(filepath.Join(tempDir, \"index.html\"))\n\t\tassert.Contains(t, string(htmlContent), \"Substack Archive\")\n\t\t\n\t\tmdContent, _ := os.ReadFile(filepath.Join(tempDir, \"index.md\"))\n\t\tassert.Contains(t, string(mdContent), \"# Substack Archive\")\n\t\t\n\t\ttxtContent, _ := os.ReadFile(filepath.Join(tempDir, \"index.txt\"))\n\t\tassert.Contains(t, string(txtContent), \"SUBSTACK ARCHIVE\")\n\t})\n\n\tt.Run(\"FileSystemError\", func(t *testing.T) {\n\t\tarchive := NewArchive()\n\t\tpost := createSamplePost()\n\t\tarchive.AddEntry(post, \"test.html\", time.Now())\n\t\t\n\t\t// Try to write to non-existent directory with restricted permissions\n\t\tinvalidDir := \"/non/existent/directory\"\n\t\t\n\t\terr := archive.GenerateHTML(invalidDir)\n\t\tassert.Error(t, err)\n\t\t\n\t\terr = archive.GenerateMarkdown(invalidDir)\n\t\tassert.Error(t, err)\n\t\t\n\t\terr = archive.GenerateText(invalidDir)\n\t\tassert.Error(t, err)\n\t})\n}\n\n// Benchmarks\nfunc BenchmarkExtractor(b *testing.B) {\n\t// Create test server\n\tserver, posts := createSubstackTestServer()\n\tdefer server.Close()\n\n\t// Create URLs\n\turls := make([]string, 0, len(posts))\n\tfor path := range posts {\n\t\turls = append(urls, server.URL+path)\n\t}\n\n\t// Create extractor\n\textractor := NewExtractor(nil)\n\tctx := context.Background()\n\n\t// Benchmark single post extraction\n\tb.Run(\"ExtractPost\", func(b *testing.B) {\n\t\turl := urls[0]\n\t\tb.ResetTimer()\n\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tpost, err := extractor.ExtractPost(ctx, url)\n\t\t\tif err != nil {\n\t\t\t\tb.Fatal(err)\n\t\t\t}\n\n\t\t\t// Simple check to ensure the compiler doesn't optimize away the result\n\t\t\tif post.Id <= 0 {\n\t\t\t\tb.Fatal(\"Invalid post ID\")\n\t\t\t}\n\t\t}\n\t})\n\n\t// Benchmark format conversions\n\tpost := createSamplePost()\n\n\tb.Run(\"ToHTML\", func(b *testing.B) {\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\thtml := post.ToHTML(true)\n\t\t\tif len(html) == 0 {\n\t\t\t\tb.Fatal(\"Empty HTML\")\n\t\t\t}\n\t\t}\n\t})\n\n\tb.Run(\"ToMD\", func(b *testing.B) {\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tmd, err := post.ToMD(true)\n\t\t\tif err != nil {\n\t\t\t\tb.Fatal(err)\n\t\t\t}\n\t\t\tif len(md) == 0 {\n\t\t\t\tb.Fatal(\"Empty markdown\")\n\t\t\t}\n\t\t}\n\t})\n\n\tb.Run(\"ToText\", func(b *testing.B) {\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\ttext := post.ToText(true)\n\t\t\tif len(text) == 0 {\n\t\t\t\tb.Fatal(\"Empty text\")\n\t\t\t}\n\t\t}\n\t})\n\n\t// Benchmark extracting all posts\n\tb.Run(\"ExtractAllPosts\", func(b *testing.B) {\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tresultCh := extractor.ExtractAllPosts(ctx, urls)\n\n\t\t\t// Consume all results\n\t\t\tsuccessCount := 0\n\t\t\tfor result := range resultCh {\n\t\t\t\tif result.Err == nil {\n\t\t\t\t\tsuccessCount++\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tif successCount != len(posts) {\n\t\t\t\tb.Fatalf(\"Expected %d successful extractions, got %d\", len(posts), successCount)\n\t\t\t}\n\t\t}\n\t})\n\n\t// Benchmark with larger number of URLs\n\tb.Run(\"ExtractAllPostsMany\", func(b *testing.B) {\n\t\t// Create many duplicate URLs to test concurrency\n\t\tmanyUrls := make([]string, 50)\n\t\tfor i := range manyUrls {\n\t\t\tmanyUrls[i] = urls[i%len(urls)]\n\t\t}\n\n\t\t// Create extractor with optimized settings for benchmark\n\t\toptimizedFetcher := NewFetcher(\n\t\t\tWithMaxWorkers(20),\n\t\t\tWithRatePerSecond(100),\n\t\t\tWithBurst(50),\n\t\t)\n\n\t\toptimizedExtractor := NewExtractor(optimizedFetcher)\n\n\t\tb.ResetTimer()\n\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tresultCh := optimizedExtractor.ExtractAllPosts(ctx, manyUrls)\n\n\t\t\t// Consume all results\n\t\t\tsuccessCount := 0\n\t\t\tfor result := range resultCh {\n\t\t\t\tif result.Err == nil {\n\t\t\t\t\tsuccessCount++\n\t\t\t\t}\n\t\t\t}\n\n\t\t\tif successCount < len(manyUrls)-5 { // Allow a few errors\n\t\t\t\tb.Fatalf(\"Too few successful extractions: %d out of %d\", successCount, len(manyUrls))\n\t\t\t}\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "lib/fetcher.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"io\"\n\t\"net/http\"\n\t\"net/url\"\n\t\"strconv\"\n\t\"time\"\n\n\t\"github.com/cenkalti/backoff/v4\"\n\t\"golang.org/x/sync/errgroup\"\n\t\"golang.org/x/time/rate\"\n)\n\n// DefaultRatePerSecond defines the default request rate per second when creating a new Fetcher.\nconst DefaultRatePerSecond = 2\n\n// DefaultBurst defines the default burst size for the rate limiter.\nconst DefaultBurst = 5\n\n// defaultRetryAfter specifies the default value for Retry-After header in case of too many requests.\nconst defaultRetryAfter = 60\n\n// defaultMaxRetryCount defines the default maximum number of retries for a failed URL fetch.\nconst defaultMaxRetryCount = 10\n\n// defaultMaxElapsedTime specifies the default maximum elapsed time for the exponential backoff.\nconst defaultMaxElapsedTime = 10 * time.Minute\n\n// defaultMaxInterval defines the default maximum interval for the exponential backoff.\nconst defaultMaxInterval = 2 * time.Minute\n\n// defaultClientTimeout defines the default timeout for HTTP requests.\nconst defaultClientTimeout = 30 * time.Second\n\n// userAgent specifies the User-Agent header value used in HTTP requests.\nconst userAgent = \"sbstck-dl/0.1\"\n\n// Fetcher represents a URL fetcher with rate limiting and retry mechanisms.\ntype Fetcher struct {\n\tClient      *http.Client\n\tRateLimiter *rate.Limiter\n\tBackoffCfg  backoff.BackOff\n\tCookie      *http.Cookie\n\tMaxWorkers  int\n}\n\n// FetcherOptions holds configurable options for Fetcher.\ntype FetcherOptions struct {\n\tRatePerSecond int\n\tBurst         int\n\tProxyURL      *url.URL\n\tBackOffConfig backoff.BackOff\n\tCookie        *http.Cookie\n\tTimeout       time.Duration\n\tMaxWorkers    int\n}\n\n// FetcherOption defines a function that applies a specific option to FetcherOptions.\ntype FetcherOption func(*FetcherOptions)\n\n// WithRatePerSecond sets the rate per second for the Fetcher.\nfunc WithRatePerSecond(rate int) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.RatePerSecond = rate\n\t}\n}\n\n// WithBurst sets the burst size for the rate limiter.\nfunc WithBurst(burst int) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.Burst = burst\n\t}\n}\n\n// WithProxyURL sets the proxy URL for the Fetcher.\nfunc WithProxyURL(proxyURL *url.URL) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.ProxyURL = proxyURL\n\t}\n}\n\n// WithBackOffConfig sets the backoff configuration for the Fetcher.\nfunc WithBackOffConfig(b backoff.BackOff) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.BackOffConfig = b\n\t}\n}\n\n// WithCookie sets the cookie for the Fetcher.\nfunc WithCookie(cookie *http.Cookie) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\tif cookie != nil {\n\t\t\to.Cookie = cookie\n\t\t}\n\t}\n}\n\n// WithTimeout sets the HTTP client timeout.\nfunc WithTimeout(timeout time.Duration) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.Timeout = timeout\n\t}\n}\n\n// WithMaxWorkers sets the maximum number of concurrent workers.\nfunc WithMaxWorkers(workers int) FetcherOption {\n\treturn func(o *FetcherOptions) {\n\t\to.MaxWorkers = workers\n\t}\n}\n\n// FetchResult represents the result of a URL fetch operation.\ntype FetchResult struct {\n\tUrl   string\n\tBody  io.ReadCloser\n\tError error\n}\n\n// FetchError represents an error returned when encountering too many requests with a Retry-After value.\ntype FetchError struct {\n\tTooManyRequests bool\n\tRetryAfter      int\n\tStatusCode      int\n}\n\n// Error returns the error message for the FetchError.\nfunc (e *FetchError) Error() string {\n\tif e.TooManyRequests {\n\t\treturn fmt.Sprintf(\"too many requests, retry after %d seconds\", e.RetryAfter)\n\t}\n\treturn fmt.Sprintf(\"HTTP error: status code %d\", e.StatusCode)\n}\n\n// NewFetcher creates a new Fetcher with the provided options.\nfunc NewFetcher(opts ...FetcherOption) *Fetcher {\n\toptions := FetcherOptions{\n\t\tRatePerSecond: DefaultRatePerSecond,\n\t\tBurst:         DefaultBurst,\n\t\tBackOffConfig: makeDefaultBackoff(),\n\t\tTimeout:       defaultClientTimeout,\n\t\tMaxWorkers:    10, // Default to 10 workers\n\t}\n\n\tfor _, opt := range opts {\n\t\topt(&options)\n\t}\n\n\ttransport := http.DefaultTransport.(*http.Transport).Clone()\n\tif options.ProxyURL != nil {\n\t\ttransport.Proxy = http.ProxyURL(options.ProxyURL)\n\t}\n\n\t// Set sensible defaults for transport\n\ttransport.MaxIdleConns = 100\n\ttransport.MaxIdleConnsPerHost = options.MaxWorkers\n\ttransport.MaxConnsPerHost = options.MaxWorkers\n\ttransport.IdleConnTimeout = 90 * time.Second\n\ttransport.TLSHandshakeTimeout = 10 * time.Second\n\n\tclient := &http.Client{\n\t\tTransport: transport,\n\t\tTimeout:   options.Timeout,\n\t}\n\n\treturn &Fetcher{\n\t\tClient:      client,\n\t\tRateLimiter: rate.NewLimiter(rate.Limit(options.RatePerSecond), options.Burst),\n\t\tBackoffCfg:  options.BackOffConfig,\n\t\tCookie:      options.Cookie,\n\t\tMaxWorkers:  options.MaxWorkers,\n\t}\n}\n\n// FetchURLs concurrently fetches the specified URLs and returns a channel to receive the FetchResults.\nfunc (f *Fetcher) FetchURLs(ctx context.Context, urls []string) <-chan FetchResult {\n\t// Use a smaller buffer to reduce memory footprint\n\tresults := make(chan FetchResult, min(len(urls), f.MaxWorkers*2))\n\n\tg, ctx := errgroup.WithContext(ctx)\n\n\t// Use a semaphore to limit concurrency\n\tsem := make(chan struct{}, f.MaxWorkers)\n\n\tfor _, u := range urls {\n\t\tu := u // Capture the variable\n\t\tg.Go(func() error {\n\t\t\tselect {\n\t\t\tcase sem <- struct{}{}: // Acquire semaphore\n\t\t\t\tdefer func() { <-sem }() // Release semaphore\n\t\t\tcase <-ctx.Done():\n\t\t\t\treturn ctx.Err()\n\t\t\t}\n\n\t\t\tbody, err := f.FetchURL(ctx, u)\n\n\t\t\tselect {\n\t\t\tcase results <- FetchResult{Url: u, Body: body, Error: err}:\n\t\t\t\treturn nil\n\t\t\tcase <-ctx.Done():\n\t\t\t\t// Close body if context was canceled to prevent leaks\n\t\t\t\tif body != nil {\n\t\t\t\t\tbody.Close()\n\t\t\t\t}\n\t\t\t\treturn ctx.Err()\n\t\t\t}\n\t\t})\n\t}\n\n\t// Close the results channel when all goroutines complete\n\tgo func() {\n\t\tg.Wait()\n\t\tclose(results)\n\t}()\n\n\treturn results\n}\n\n// FetchURL fetches the specified URL with retries and rate limiting.\nfunc (f *Fetcher) FetchURL(ctx context.Context, url string) (io.ReadCloser, error) {\n\tvar body io.ReadCloser\n\tvar err error\n\tvar retryCounter int\n\n\toperation := func() error {\n\t\tif retryCounter >= defaultMaxRetryCount {\n\t\t\treturn backoff.Permanent(fmt.Errorf(\"max retry count reached for URL: %s\", url))\n\t\t}\n\n\t\terr = f.RateLimiter.Wait(ctx) // Use rate limiter\n\t\tif err != nil {\n\t\t\treturn backoff.Permanent(err) // Context cancellation or rate limiter error\n\t\t}\n\n\t\tbody, err = f.fetch(ctx, url)\n\t\tif err != nil {\n\t\t\t// If it's a fetch error that should be retried\n\t\t\tif fetchErr, ok := err.(*FetchError); ok && fetchErr.TooManyRequests {\n\t\t\t\tretryCounter++\n\t\t\t\treturn err\n\t\t\t}\n\t\t\t// For other errors, don't retry\n\t\t\treturn backoff.Permanent(err)\n\t\t}\n\t\treturn nil\n\t}\n\n\t// Use backoff with notification for logging\n\terr = backoff.RetryNotify(\n\t\toperation,\n\t\tf.BackoffCfg,\n\t\tfunc(err error, d time.Duration) {\n\t\t\t// This could be connected to a logger\n\t\t\t_ = err // Avoid unused variable error\n\t\t},\n\t)\n\n\treturn body, err\n}\n\n// fetch performs the actual HTTP GET request.\nfunc (f *Fetcher) fetch(ctx context.Context, url string) (io.ReadCloser, error) {\n\treq, err := http.NewRequestWithContext(ctx, http.MethodGet, url, nil)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\treq.Header.Set(\"User-Agent\", userAgent)\n\n\t// Add cookie if available\n\tif f.Cookie != nil {\n\t\treq.AddCookie(f.Cookie)\n\t}\n\n\tres, err := f.Client.Do(req)\n\tif err != nil {\n\t\treturn nil, err\n\t}\n\n\t// Handle non-success status codes\n\tif res.StatusCode != http.StatusOK {\n\t\t// Always close the body for non-200 responses\n\t\tdefer res.Body.Close()\n\n\t\tif res.StatusCode == http.StatusTooManyRequests {\n\t\t\tretryAfter := defaultRetryAfter\n\t\t\tif retryAfterStr := res.Header.Get(\"Retry-After\"); retryAfterStr != \"\" {\n\t\t\t\tif seconds, err := strconv.Atoi(retryAfterStr); err == nil {\n\t\t\t\t\tretryAfter = seconds\n\t\t\t\t}\n\t\t\t}\n\t\t\treturn nil, &FetchError{\n\t\t\t\tTooManyRequests: true,\n\t\t\t\tRetryAfter:      retryAfter,\n\t\t\t\tStatusCode:      res.StatusCode,\n\t\t\t}\n\t\t}\n\n\t\treturn nil, &FetchError{\n\t\t\tStatusCode: res.StatusCode,\n\t\t}\n\t}\n\n\treturn res.Body, nil\n}\n\n// makeDefaultBackoff creates the default exponential backoff configuration.\nfunc makeDefaultBackoff() backoff.BackOff {\n\tbackOffCfg := backoff.NewExponentialBackOff()\n\tbackOffCfg.MaxElapsedTime = defaultMaxElapsedTime\n\tbackOffCfg.MaxInterval = defaultMaxInterval\n\tbackOffCfg.Multiplier = 1.5 // Reduced from 2.0 for more gradual backoff\n\n\treturn backOffCfg\n}\n\n// min returns the smaller of two integers.\nfunc min(a, b int) int {\n\tif a < b {\n\t\treturn a\n\t}\n\treturn b\n}\n"
  },
  {
    "path": "lib/fetcher_test.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"io\"\n\t\"math/rand\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"net/url\"\n\t\"sync\"\n\t\"sync/atomic\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/cenkalti/backoff/v4\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n\t\"golang.org/x/time/rate\"\n)\n\n// TestNewFetcher tests the creation of a new fetcher with various options\nfunc TestNewFetcher(t *testing.T) {\n\tt.Run(\"DefaultOptions\", func(t *testing.T) {\n\t\tf := NewFetcher()\n\t\tassert.NotNil(t, f.Client)\n\t\tassert.NotNil(t, f.RateLimiter)\n\t\tassert.NotNil(t, f.BackoffCfg)\n\t\tassert.Nil(t, f.Cookie)\n\t\tassert.Equal(t, 10, f.MaxWorkers)\n\t})\n\n\tt.Run(\"CustomOptions\", func(t *testing.T) {\n\t\tproxyURL, _ := url.Parse(\"http://proxy.example.com\")\n\t\tcookie := &http.Cookie{Name: \"test\", Value: \"value\"}\n\t\tcustomBackoff := backoff.NewConstantBackOff(time.Second)\n\n\t\tf := NewFetcher(\n\t\t\tWithRatePerSecond(5),\n\t\t\tWithBurst(10),\n\t\t\tWithProxyURL(proxyURL),\n\t\t\tWithCookie(cookie),\n\t\t\tWithBackOffConfig(customBackoff),\n\t\t\tWithTimeout(time.Minute),\n\t\t\tWithMaxWorkers(20),\n\t\t)\n\n\t\tassert.NotNil(t, f.Client)\n\t\tassert.Equal(t, rate.Limit(5), f.RateLimiter.Limit())\n\t\tassert.Equal(t, 10, f.RateLimiter.Burst())\n\t\tassert.Equal(t, customBackoff, f.BackoffCfg)\n\t\tassert.Equal(t, cookie, f.Cookie)\n\t\tassert.Equal(t, 20, f.MaxWorkers)\n\t\tassert.Equal(t, time.Minute, f.Client.Timeout)\n\t})\n}\n\n// TestFetchURL tests the FetchURL method\nfunc TestFetchURL(t *testing.T) {\n\tt.Run(\"SuccessfulFetch\", func(t *testing.T) {\n\t\t// Create a test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tassert.Equal(t, \"sbstck-dl/0.1\", r.Header.Get(\"User-Agent\"))\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write([]byte(\"response body\"))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create fetcher and fetch the URL\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\t\tbody, err := f.FetchURL(ctx, server.URL)\n\n\t\t// Assert\n\t\trequire.NoError(t, err)\n\t\trequire.NotNil(t, body)\n\t\tdefer body.Close()\n\n\t\tdata, err := io.ReadAll(body)\n\t\trequire.NoError(t, err)\n\t\tassert.Equal(t, \"response body\", string(data))\n\t})\n\n\tt.Run(\"FetchWithCookie\", func(t *testing.T) {\n\t\tcookieReceived := false\n\t\t// Create a test server that checks for cookie\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tcookies := r.Cookies()\n\t\t\tfor _, cookie := range cookies {\n\t\t\t\tif cookie.Name == \"test\" && cookie.Value == \"value\" {\n\t\t\t\t\tcookieReceived = true\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create fetcher with cookie\n\t\tcookie := &http.Cookie{Name: \"test\", Value: \"value\"}\n\t\tf := NewFetcher(WithCookie(cookie))\n\t\tctx := context.Background()\n\t\tbody, err := f.FetchURL(ctx, server.URL)\n\n\t\t// Assert\n\t\trequire.NoError(t, err)\n\t\trequire.NotNil(t, body)\n\t\tbody.Close()\n\t\tassert.True(t, cookieReceived)\n\t})\n\n\tt.Run(\"HTTPError\", func(t *testing.T) {\n\t\t// Create a test server that returns an error\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create fetcher and fetch the URL\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\t\tbody, err := f.FetchURL(ctx, server.URL)\n\n\t\t// Assert\n\t\tassert.Error(t, err)\n\t\tassert.Nil(t, body)\n\n\t\t// Check that the error is of type FetchError\n\t\tfetchErr, ok := err.(*FetchError)\n\t\tassert.True(t, ok)\n\t\tassert.Equal(t, http.StatusInternalServerError, fetchErr.StatusCode)\n\t\tassert.False(t, fetchErr.TooManyRequests)\n\t})\n\n\tt.Run(\"TooManyRequests\", func(t *testing.T) {\n\t\t// Create a test server that returns too many requests\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.Header().Set(\"Retry-After\", \"2\")\n\t\t\tw.WriteHeader(http.StatusTooManyRequests)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create fetcher with a quick backoff for testing\n\t\tbackoffCfg := backoff.NewExponentialBackOff()\n\t\tbackoffCfg.MaxElapsedTime = 500 * time.Millisecond // Short timeout for test\n\t\tf := NewFetcher(WithBackOffConfig(backoffCfg))\n\n\t\tctx := context.Background()\n\t\tbody, err := f.FetchURL(ctx, server.URL)\n\n\t\t// Assert\n\t\tassert.Error(t, err)\n\t\tassert.Nil(t, body)\n\n\t\t// Check that the error is of type FetchError\n\t\tfetchErr, ok := err.(*FetchError)\n\t\tif !ok {\n\t\t\t// Could be a permanent error from max retries\n\t\t\tassert.Contains(t, err.Error(), \"max retry count\")\n\t\t} else {\n\t\t\tassert.True(t, fetchErr.TooManyRequests)\n\t\t\tassert.Equal(t, 2, fetchErr.RetryAfter)\n\t\t}\n\t})\n\n\tt.Run(\"ContextCancellation\", func(t *testing.T) {\n\t\t// Create a test server with a delay\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\ttime.Sleep(500 * time.Millisecond)\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create fetcher\n\t\tf := NewFetcher()\n\n\t\t// Create context with timeout\n\t\tctx, cancel := context.WithTimeout(context.Background(), 100*time.Millisecond)\n\t\tdefer cancel()\n\n\t\t// Fetch should be canceled by context\n\t\tbody, err := f.FetchURL(ctx, server.URL)\n\n\t\t// Assert\n\t\tassert.Error(t, err)\n\t\tassert.Nil(t, body)\n\t\tassert.Contains(t, err.Error(), \"context\")\n\t})\n}\n\n// TestFetchURLs tests the FetchURLs method\nfunc TestFetchURLs(t *testing.T) {\n\tt.Run(\"MultipleFetches\", func(t *testing.T) {\n\t\t// Track request count\n\t\tvar requestCount int32\n\n\t\t// Create a test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tatomic.AddInt32(&requestCount, 1)\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tfmt.Fprintf(w, \"response for %s\", r.URL.Path)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create URLs\n\t\tnumURLs := 10\n\t\turls := make([]string, numURLs)\n\t\tfor i := 0; i < numURLs; i++ {\n\t\t\turls[i] = fmt.Sprintf(\"%s/%d\", server.URL, i)\n\t\t}\n\n\t\t// Create fetcher and fetch URLs\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, urls)\n\n\t\t// Collect results\n\t\tresults := make(map[string]string)\n\t\tfor result := range resultChan {\n\t\t\tassert.NoError(t, result.Error)\n\t\t\tassert.NotNil(t, result.Body)\n\n\t\t\tif result.Body != nil {\n\t\t\t\tdata, err := io.ReadAll(result.Body)\n\t\t\t\tresult.Body.Close()\n\t\t\t\tassert.NoError(t, err)\n\t\t\t\tresults[result.Url] = string(data)\n\t\t\t}\n\t\t}\n\n\t\t// Assert all URLs were fetched\n\t\tassert.Equal(t, numURLs, len(results))\n\t\tassert.Equal(t, int32(numURLs), atomic.LoadInt32(&requestCount))\n\n\t\t// Check results\n\t\tfor i := 0; i < numURLs; i++ {\n\t\t\turl := fmt.Sprintf(\"%s/%d\", server.URL, i)\n\t\t\texpectedResponse := fmt.Sprintf(\"response for /%d\", i)\n\t\t\tassert.Equal(t, expectedResponse, results[url])\n\t\t}\n\t})\n\n\tt.Run(\"RateLimiting\", func(t *testing.T) {\n\t\t// Create a test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create a lot of URLs\n\t\tnumURLs := 20\n\t\turls := make([]string, numURLs)\n\t\tfor i := 0; i < numURLs; i++ {\n\t\t\turls[i] = server.URL\n\t\t}\n\n\t\t// Create fetcher with low rate\n\t\tf := NewFetcher(\n\t\t\tWithRatePerSecond(2),\n\t\t\tWithBurst(1),\n\t\t\tWithMaxWorkers(5),\n\t\t)\n\n\t\t// Time the fetches\n\t\tstart := time.Now()\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, urls)\n\n\t\t// Collect results\n\t\tvar count int\n\t\tfor result := range resultChan {\n\t\t\tassert.NoError(t, result.Error)\n\t\t\tif result.Body != nil {\n\t\t\t\tresult.Body.Close()\n\t\t\t}\n\t\t\tcount++\n\t\t}\n\n\t\t// Verify count\n\t\tassert.Equal(t, numURLs, count)\n\n\t\t// Check duration - should be at least 9 seconds for 20 URLs at 2 per second\n\t\tduration := time.Since(start)\n\t\tassert.GreaterOrEqual(t, duration, 9*time.Second)\n\t})\n\n\tt.Run(\"ConcurrencyLimit\", func(t *testing.T) {\n\t\t// Create a mutex to protect access to the concurrent counter\n\t\tvar mu sync.Mutex\n\t\tvar currentConcurrent, maxConcurrent int\n\n\t\t// Create a test server with a delay to test concurrency\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\t// Increment current concurrent counter\n\t\t\tmu.Lock()\n\t\t\tcurrentConcurrent++\n\t\t\tif currentConcurrent > maxConcurrent {\n\t\t\t\tmaxConcurrent = currentConcurrent\n\t\t\t}\n\t\t\tmu.Unlock()\n\n\t\t\t// Sleep to maintain concurrency\n\t\t\ttime.Sleep(100 * time.Millisecond)\n\n\t\t\t// Decrement counter\n\t\t\tmu.Lock()\n\t\t\tcurrentConcurrent--\n\t\t\tmu.Unlock()\n\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create a lot of URLs\n\t\tnumURLs := 50\n\t\turls := make([]string, numURLs)\n\t\tfor i := 0; i < numURLs; i++ {\n\t\t\turls[i] = server.URL\n\t\t}\n\n\t\t// Create fetcher with specific worker limit but high rate\n\t\tmaxWorkers := 5\n\t\tf := NewFetcher(\n\t\t\tWithRatePerSecond(100), // High rate to not be rate-limited\n\t\t\tWithMaxWorkers(maxWorkers),\n\t\t)\n\n\t\t// Fetch URLs\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, urls)\n\n\t\t// Collect results\n\t\tfor result := range resultChan {\n\t\t\tif result.Body != nil {\n\t\t\t\tresult.Body.Close()\n\t\t\t}\n\t\t}\n\n\t\t// Verify the max concurrency was respected\n\t\tassert.LessOrEqual(t, maxConcurrent, maxWorkers)\n\t\t// We should have reached max workers at some point\n\t\tassert.GreaterOrEqual(t, maxConcurrent, maxWorkers-1)\n\t})\n\n\tt.Run(\"MixedResponses\", func(t *testing.T) {\n\t\t// Create a test server with mixed responses\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\t// Extract path to determine response\n\t\t\tpath := r.URL.Path\n\t\t\tif path == \"/success\" {\n\t\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\t\tw.Write([]byte(\"success\"))\n\t\t\t} else if path == \"/error\" {\n\t\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\t\t} else if path == \"/toomany\" {\n\t\t\t\tw.Header().Set(\"Retry-After\", \"1\")\n\t\t\t\tw.WriteHeader(http.StatusTooManyRequests)\n\t\t\t} else if path == \"/slow\" {\n\t\t\t\ttime.Sleep(300 * time.Millisecond)\n\t\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\t\tw.Write([]byte(\"slow\"))\n\t\t\t} else {\n\t\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\t\t}\n\t\t}))\n\t\tdefer server.Close()\n\n\t\t// Create URLs\n\t\turls := []string{\n\t\t\tserver.URL + \"/success\",\n\t\t\tserver.URL + \"/error\",\n\t\t\tserver.URL + \"/toomany\",\n\t\t\tserver.URL + \"/slow\",\n\t\t\tserver.URL + \"/notfound\",\n\t\t}\n\n\t\t// Create fetcher with quick backoff for testing\n\t\tbackoffCfg := backoff.NewExponentialBackOff()\n\t\tbackoffCfg.MaxElapsedTime = 500 * time.Millisecond // Short timeout for test\n\n\t\tf := NewFetcher(\n\t\t\tWithBackOffConfig(backoffCfg),\n\t\t\tWithTimeout(1*time.Second),\n\t\t)\n\n\t\t// Fetch URLs\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, urls)\n\n\t\t// Collect results\n\t\tresults := make(map[string]struct {\n\t\t\tbody  string\n\t\t\terror bool\n\t\t})\n\n\t\tfor result := range resultChan {\n\t\t\tresultData := struct {\n\t\t\t\tbody  string\n\t\t\t\terror bool\n\t\t\t}{body: \"\", error: result.Error != nil}\n\n\t\t\tif result.Body != nil {\n\t\t\t\tdata, _ := io.ReadAll(result.Body)\n\t\t\t\tresult.Body.Close()\n\t\t\t\tresultData.body = string(data)\n\t\t\t}\n\n\t\t\tresults[result.Url] = resultData\n\t\t}\n\n\t\t// Check results\n\t\tsuccessURL := server.URL + \"/success\"\n\t\tassert.False(t, results[successURL].error)\n\t\tassert.Equal(t, \"success\", results[successURL].body)\n\n\t\terrorURL := server.URL + \"/error\"\n\t\tassert.True(t, results[errorURL].error)\n\n\t\ttooManyURL := server.URL + \"/toomany\"\n\t\tassert.True(t, results[tooManyURL].error)\n\n\t\tslowURL := server.URL + \"/slow\"\n\t\tassert.False(t, results[slowURL].error)\n\t\tassert.Equal(t, \"slow\", results[slowURL].body)\n\n\t\tnotFoundURL := server.URL + \"/notfound\"\n\t\tassert.True(t, results[notFoundURL].error)\n\t})\n\n\tt.Run(\"EmptyURLList\", func(t *testing.T) {\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, []string{})\n\n\t\t// Should receive no results\n\t\tcount := 0\n\t\tfor range resultChan {\n\t\t\tcount++\n\t\t}\n\t\tassert.Equal(t, 0, count)\n\t})\n\n\tt.Run(\"SingleURL\", func(t *testing.T) {\n\t\t// Create a test server\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write([]byte(\"single\"))\n\t\t}))\n\t\tdefer server.Close()\n\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\t\tresultChan := f.FetchURLs(ctx, []string{server.URL})\n\n\t\t// Should receive exactly one result\n\t\tcount := 0\n\t\tfor result := range resultChan {\n\t\t\tcount++\n\t\t\tassert.NoError(t, result.Error)\n\t\t\tassert.NotNil(t, result.Body)\n\t\t\tif result.Body != nil {\n\t\t\t\tdata, err := io.ReadAll(result.Body)\n\t\t\t\tresult.Body.Close()\n\t\t\t\tassert.NoError(t, err)\n\t\t\t\tassert.Equal(t, \"single\", string(data))\n\t\t\t}\n\t\t}\n\t\tassert.Equal(t, 1, count)\n\t})\n\n\tt.Run(\"ContextCancellationDuringFetch\", func(t *testing.T) {\n\t\t// Create a test server with delay\n\t\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t\ttime.Sleep(200 * time.Millisecond)\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t}))\n\t\tdefer server.Close()\n\n\t\tf := NewFetcher()\n\t\tctx, cancel := context.WithCancel(context.Background())\n\t\t\n\t\t// Create multiple URLs\n\t\turls := []string{server.URL, server.URL, server.URL}\n\t\tresultChan := f.FetchURLs(ctx, urls)\n\n\t\t// Cancel context after a short delay\n\t\tgo func() {\n\t\t\ttime.Sleep(50 * time.Millisecond)\n\t\t\tcancel()\n\t\t}()\n\n\t\t// Collect results\n\t\tresults := 0\n\t\tfor result := range resultChan {\n\t\t\tresults++\n\t\t\tif result.Body != nil {\n\t\t\t\tresult.Body.Close()\n\t\t\t}\n\t\t}\n\n\t\t// Should receive fewer results than total URLs due to cancellation\n\t\tassert.LessOrEqual(t, results, len(urls))\n\t})\n}\n\n// TestFetchErrors tests the FetchError type\nfunc TestFetchErrors(t *testing.T) {\n\tt.Run(\"TooManyRequestsError\", func(t *testing.T) {\n\t\terr := &FetchError{\n\t\t\tTooManyRequests: true,\n\t\t\tRetryAfter:      30,\n\t\t\tStatusCode:      429,\n\t\t}\n\t\tassert.Contains(t, err.Error(), \"30 seconds\")\n\t})\n\n\tt.Run(\"StatusCodeError\", func(t *testing.T) {\n\t\terr := &FetchError{\n\t\t\tStatusCode: 404,\n\t\t}\n\t\tassert.Contains(t, err.Error(), \"404\")\n\t})\n}\n\n// Integration test with a realistic server that randomly returns errors\nfunc TestIntegrationWithRandomErrors(t *testing.T) {\n\t// Skip in short test mode\n\tif testing.Short() {\n\t\tt.Skip(\"Skipping integration test in short mode\")\n\t}\n\n\t// Create a test server with random behavior\n\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t// Seed with request path to get consistent behavior per URL\n\t\tpathSeed := int64(0)\n\t\tfor _, c := range r.URL.Path {\n\t\t\tpathSeed += int64(c)\n\t\t}\n\t\trand.Seed(pathSeed)\n\n\t\t// Random behavior\n\t\trandomVal := rand.Intn(100)\n\t\tswitch {\n\t\tcase randomVal < 20:\n\t\t\t// 20% chance of error\n\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\tcase randomVal < 30:\n\t\t\t// 10% chance of too many requests\n\t\t\tw.Header().Set(\"Retry-After\", \"1\")\n\t\t\tw.WriteHeader(http.StatusTooManyRequests)\n\t\tcase randomVal < 40:\n\t\t\t// 10% chance of slow response\n\t\t\ttime.Sleep(200 * time.Millisecond)\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write([]byte(fmt.Sprintf(\"slow response for %s\", r.URL.Path)))\n\t\tdefault:\n\t\t\t// 60% chance of success\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write([]byte(fmt.Sprintf(\"response for %s\", r.URL.Path)))\n\t\t}\n\t}))\n\tdefer server.Close()\n\n\t// Create a large number of URLs\n\tnumURLs := 30\n\turls := make([]string, numURLs)\n\tfor i := 0; i < numURLs; i++ {\n\t\turls[i] = fmt.Sprintf(\"%s/path%d\", server.URL, i)\n\t}\n\n\t// Create fetcher with retry configuration\n\tbackoffCfg := backoff.NewExponentialBackOff()\n\tbackoffCfg.MaxElapsedTime = 5 * time.Second\n\tbackoffCfg.InitialInterval = 100 * time.Millisecond\n\tbackoffCfg.MaxInterval = 1 * time.Second\n\n\tf := NewFetcher(\n\t\tWithRatePerSecond(10),\n\t\tWithBurst(5),\n\t\tWithMaxWorkers(8),\n\t\tWithBackOffConfig(backoffCfg),\n\t\tWithTimeout(2*time.Second),\n\t)\n\n\t// Fetch URLs\n\tctx := context.Background()\n\tresultChan := f.FetchURLs(ctx, urls)\n\n\t// Collect results\n\tsuccessCount := 0\n\terrorCount := 0\n\n\tfor result := range resultChan {\n\t\tif result.Error == nil {\n\t\t\tsuccessCount++\n\t\t\tif result.Body != nil {\n\t\t\t\tio.Copy(io.Discard, result.Body) // Read the body\n\t\t\t\tresult.Body.Close()\n\t\t\t}\n\t\t} else {\n\t\t\terrorCount++\n\t\t}\n\t}\n\n\t// Verify we got some successes and some errors\n\tt.Logf(\"Success count: %d, Error count: %d\", successCount, errorCount)\n\tassert.True(t, successCount > 0)\n\tassert.True(t, errorCount > 0)\n\tassert.Equal(t, numURLs, successCount+errorCount)\n}\n\n// Benchmarks\nfunc BenchmarkFetcher(b *testing.B) {\n\t// Create a test server\n\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tw.WriteHeader(http.StatusOK)\n\t\tw.Write([]byte(\"benchmark response\"))\n\t}))\n\tdefer server.Close()\n\n\tb.Run(\"SingleFetch\", func(b *testing.B) {\n\t\tf := NewFetcher()\n\t\tctx := context.Background()\n\n\t\tb.ResetTimer()\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\tbody, err := f.FetchURL(ctx, server.URL)\n\t\t\tif err == nil && body != nil {\n\t\t\t\tio.Copy(io.Discard, body)\n\t\t\t\tbody.Close()\n\t\t\t}\n\t\t}\n\t})\n\n\tb.Run(\"ConcurrentFetches\", func(b *testing.B) {\n\t\tf := NewFetcher(\n\t\t\tWithRatePerSecond(100),\n\t\t\tWithMaxWorkers(20),\n\t\t)\n\t\tctx := context.Background()\n\n\t\tb.ResetTimer()\n\t\tfor i := 0; i < b.N; i++ {\n\t\t\t// Create 10 URLs to fetch concurrently\n\t\t\tnumURLs := 10\n\t\t\turls := make([]string, numURLs)\n\t\t\tfor j := 0; j < numURLs; j++ {\n\t\t\t\turls[j] = server.URL\n\t\t\t}\n\n\t\t\tresultChan := f.FetchURLs(ctx, urls)\n\t\t\tfor result := range resultChan {\n\t\t\t\tif result.Body != nil {\n\t\t\t\t\tio.Copy(io.Discard, result.Body)\n\t\t\t\t\tresult.Body.Close()\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t})\n}\n"
  },
  {
    "path": "lib/files.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"io\"\n\t\"net/url\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"regexp\"\n\t\"strings\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n)\n\n// FileInfo represents information about a downloaded file attachment\ntype FileInfo struct {\n\tOriginalURL string\n\tLocalPath   string\n\tFilename    string\n\tSize        int64\n\tSuccess     bool\n\tError       error\n}\n\n// FileDownloader handles downloading file attachments from Substack posts\ntype FileDownloader struct {\n\tfetcher        *Fetcher\n\toutputDir      string\n\tfilesDir       string\n\tfileExtensions []string // allowed file extensions, empty means all\n}\n\n// NewFileDownloader creates a new FileDownloader instance\nfunc NewFileDownloader(fetcher *Fetcher, outputDir, filesDir string, extensions []string) *FileDownloader {\n\tif fetcher == nil {\n\t\tfetcher = NewFetcher()\n\t}\n\treturn &FileDownloader{\n\t\tfetcher:        fetcher,\n\t\toutputDir:      outputDir,\n\t\tfilesDir:       filesDir,\n\t\tfileExtensions: extensions,\n\t}\n}\n\n// FileDownloadResult contains the results of downloading file attachments for a post\ntype FileDownloadResult struct {\n\tFiles       []FileInfo\n\tUpdatedHTML string\n\tSuccess     int\n\tFailed      int\n}\n\n// FileElement represents a file attachment element with its download URL and local path info\ntype FileElement struct {\n\tDownloadURL string\n\tLocalPath   string\n\tFilename    string\n\tSuccess     bool\n}\n\n// DownloadFiles downloads all file attachments from a post's HTML content and returns updated HTML\nfunc (fd *FileDownloader) DownloadFiles(ctx context.Context, htmlContent string, postSlug string) (*FileDownloadResult, error) {\n\t// Parse HTML content\n\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to parse HTML content: %w\", err)\n\t}\n\n\t// Extract file attachment elements\n\tfileElements, err := fd.extractFileElements(doc)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to extract file elements: %w\", err)\n\t}\n\n\tif len(fileElements) == 0 {\n\t\treturn &FileDownloadResult{\n\t\t\tFiles:       []FileInfo{},\n\t\t\tUpdatedHTML: htmlContent,\n\t\t\tSuccess:     0,\n\t\t\tFailed:      0,\n\t\t}, nil\n\t}\n\n\t// Create files directory\n\tfilesPath := filepath.Join(fd.outputDir, fd.filesDir, postSlug)\n\tif err := os.MkdirAll(filesPath, 0755); err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to create files directory: %w\", err)\n\t}\n\n\t// Download files and build URL mapping\n\tvar files []FileInfo\n\turlToLocalPath := make(map[string]string)\n\n\tfor _, element := range fileElements {\n\t\t// Download the file\n\t\tfileInfo := fd.downloadSingleFile(ctx, element.DownloadURL, filesPath)\n\t\tfiles = append(files, fileInfo)\n\n\t\tif fileInfo.Success {\n\t\t\turlToLocalPath[element.DownloadURL] = fileInfo.LocalPath\n\t\t}\n\t}\n\n\t// Update HTML content with local paths\n\tupdatedHTML := fd.updateHTMLWithLocalPaths(htmlContent, urlToLocalPath)\n\n\t// Count success/failure\n\tsuccessCount := 0\n\tfailedCount := 0\n\tfor _, file := range files {\n\t\tif file.Success {\n\t\t\tsuccessCount++\n\t\t} else {\n\t\t\tfailedCount++\n\t\t}\n\t}\n\n\treturn &FileDownloadResult{\n\t\tFiles:       files,\n\t\tUpdatedHTML: updatedHTML,\n\t\tSuccess:     successCount,\n\t\tFailed:      failedCount,\n\t}, nil\n}\n\n// extractFileElements finds all file attachment elements in the HTML using the CSS selector\nfunc (fd *FileDownloader) extractFileElements(doc *goquery.Document) ([]FileElement, error) {\n\tvar elements []FileElement\n\n\tdoc.Find(\".file-embed-button.wide\").Each(func(i int, s *goquery.Selection) {\n\t\thref, exists := s.Attr(\"href\")\n\t\tif !exists || href == \"\" {\n\t\t\treturn\n\t\t}\n\n\t\t// Parse and validate URL\n\t\tfileURL, err := url.Parse(href)\n\t\tif err != nil {\n\t\t\treturn\n\t\t}\n\n\t\t// Make sure it's an absolute URL\n\t\tif !fileURL.IsAbs() {\n\t\t\treturn\n\t\t}\n\n\t\t// Extract filename from URL\n\t\tfilename := fd.extractFilenameFromURL(href)\n\t\tif filename == \"\" {\n\t\t\t// Generate filename if we can't extract one\n\t\t\tfilename = fmt.Sprintf(\"attachment_%d\", i+1)\n\t\t}\n\n\t\t// Check file extension filter if specified\n\t\tif len(fd.fileExtensions) > 0 && !fd.isAllowedExtension(filename) {\n\t\t\treturn\n\t\t}\n\n\t\telements = append(elements, FileElement{\n\t\t\tDownloadURL: href,\n\t\t\tFilename:    filename,\n\t\t})\n\t})\n\n\treturn elements, nil\n}\n\n// extractFilenameFromURL attempts to extract a filename from a URL\nfunc (fd *FileDownloader) extractFilenameFromURL(downloadURL string) string {\n\tparsed, err := url.Parse(downloadURL)\n\tif err != nil {\n\t\treturn \"\"\n\t}\n\n\t// Try to get filename from path using URL-safe path handling\n\tpath := parsed.Path\n\tif path != \"\" && path != \"/\" {\n\t\t// Use strings.LastIndex to find the last segment in a cross-platform way\n\t\t// This avoids issues with filepath.Base on different operating systems\n\t\tlastSlash := strings.LastIndex(path, \"/\")\n\t\tif lastSlash >= 0 && lastSlash < len(path)-1 {\n\t\t\tfilename := path[lastSlash+1:]\n\t\t\tif filename != \"\" && filename != \".\" {\n\t\t\t\treturn filename\n\t\t\t}\n\t\t}\n\t}\n\n\t// Try to get filename from query parameters (common in some download links)\n\tif filename := parsed.Query().Get(\"filename\"); filename != \"\" {\n\t\treturn filename\n\t}\n\n\treturn \"\"\n}\n\n// isAllowedExtension checks if a filename has an allowed extension\nfunc (fd *FileDownloader) isAllowedExtension(filename string) bool {\n\tif len(fd.fileExtensions) == 0 {\n\t\treturn true // Allow all if no filter specified\n\t}\n\n\text := strings.ToLower(filepath.Ext(filename))\n\tif ext != \"\" && ext[0] == '.' {\n\t\text = ext[1:] // Remove the dot\n\t}\n\n\tfor _, allowedExt := range fd.fileExtensions {\n\t\tif strings.ToLower(allowedExt) == ext {\n\t\t\treturn true\n\t\t}\n\t}\n\n\treturn false\n}\n\n// downloadSingleFile downloads a single file and returns FileInfo\nfunc (fd *FileDownloader) downloadSingleFile(ctx context.Context, downloadURL, filesPath string) FileInfo {\n\t// Extract filename\n\tfilename := fd.extractFilenameFromURL(downloadURL)\n\tif filename == \"\" {\n\t\t// Generate a safe filename based on URL\n\t\tfilename = fd.generateSafeFilename(downloadURL)\n\t}\n\n\t// Ensure filename is safe for filesystem\n\tfilename = fd.sanitizeFilename(filename)\n\n\tlocalPath := filepath.Join(filesPath, filename)\n\n\t// Check if file already exists\n\tif _, err := os.Stat(localPath); err == nil {\n\t\treturn FileInfo{\n\t\t\tOriginalURL: downloadURL,\n\t\t\tLocalPath:   localPath,\n\t\t\tFilename:    filename,\n\t\t\tSize:        0,\n\t\t\tSuccess:     true,\n\t\t\tError:       nil,\n\t\t}\n\t}\n\n\t// Download the file\n\tresp, err := fd.fetcher.FetchURL(ctx, downloadURL)\n\tif err != nil {\n\t\treturn FileInfo{\n\t\t\tOriginalURL: downloadURL,\n\t\t\tLocalPath:   localPath,\n\t\t\tFilename:    filename,\n\t\t\tSize:        0,\n\t\t\tSuccess:     false,\n\t\t\tError:       err,\n\t\t}\n\t}\n\tdefer resp.Close()\n\n\t// Create the file\n\tfile, err := os.Create(localPath)\n\tif err != nil {\n\t\treturn FileInfo{\n\t\t\tOriginalURL: downloadURL,\n\t\t\tLocalPath:   localPath,\n\t\t\tFilename:    filename,\n\t\t\tSize:        0,\n\t\t\tSuccess:     false,\n\t\t\tError:       err,\n\t\t}\n\t}\n\tdefer file.Close()\n\n\t// Copy file contents\n\tsize, err := io.Copy(file, resp)\n\tif err != nil {\n\t\t// Remove partially downloaded file\n\t\tos.Remove(localPath)\n\t\treturn FileInfo{\n\t\t\tOriginalURL: downloadURL,\n\t\t\tLocalPath:   localPath,\n\t\t\tFilename:    filename,\n\t\t\tSize:        0,\n\t\t\tSuccess:     false,\n\t\t\tError:       err,\n\t\t}\n\t}\n\n\treturn FileInfo{\n\t\tOriginalURL: downloadURL,\n\t\tLocalPath:   localPath,\n\t\tFilename:    filename,\n\t\tSize:        size,\n\t\tSuccess:     true,\n\t\tError:       nil,\n\t}\n}\n\n// generateSafeFilename generates a safe filename from a URL\nfunc (fd *FileDownloader) generateSafeFilename(downloadURL string) string {\n\t// Use timestamp and hash of URL to create unique filename\n\ttimestamp := time.Now().Unix()\n\turlHash := fmt.Sprintf(\"%x\", []byte(downloadURL))[:8]\n\treturn fmt.Sprintf(\"file_%d_%s\", timestamp, urlHash)\n}\n\n// sanitizeFilename removes or replaces unsafe characters in filenames\nfunc (fd *FileDownloader) sanitizeFilename(filename string) string {\n\t// Replace unsafe characters with underscores\n\tunsafe := regexp.MustCompile(`[<>:\"/\\\\|?*]`)\n\tsafe := unsafe.ReplaceAllString(filename, \"_\")\n\t\n\t// Remove leading/trailing spaces and dots\n\tsafe = strings.Trim(safe, \" .\")\n\t\n\t// Ensure it's not empty\n\tif safe == \"\" {\n\t\tsafe = \"attachment\"\n\t}\n\t\n\t// Limit length\n\tif len(safe) > 200 {\n\t\tsafe = safe[:200]\n\t}\n\t\n\treturn safe\n}\n\n// updateHTMLWithLocalPaths updates the HTML content to reference local file paths\nfunc (fd *FileDownloader) updateHTMLWithLocalPaths(htmlContent string, urlToLocalPath map[string]string) string {\n\tupdatedHTML := htmlContent\n\n\tfor originalURL, localPath := range urlToLocalPath {\n\t\t// Convert absolute local path to relative path from the post file location\n\t\trelativePath := fd.makeRelativePath(localPath)\n\t\t\n\t\t// Replace the href attribute in file-embed-button links\n\t\toldPattern := fmt.Sprintf(`href=\"%s\"`, regexp.QuoteMeta(originalURL))\n\t\tnewPattern := fmt.Sprintf(`href=\"%s\"`, relativePath)\n\t\tupdatedHTML = regexp.MustCompile(oldPattern).ReplaceAllString(updatedHTML, newPattern)\n\t\t\n\t\t// Also handle single quotes\n\t\toldPatternSingle := fmt.Sprintf(`href='%s'`, regexp.QuoteMeta(originalURL))\n\t\tnewPatternSingle := fmt.Sprintf(`href='%s'`, relativePath)\n\t\tupdatedHTML = regexp.MustCompile(oldPatternSingle).ReplaceAllString(updatedHTML, newPatternSingle)\n\t}\n\n\treturn updatedHTML\n}\n\n// makeRelativePath converts an absolute local path to a relative path from the post location\nfunc (fd *FileDownloader) makeRelativePath(localPath string) string {\n\t// Get the relative path from the output directory\n\trelPath, err := filepath.Rel(fd.outputDir, localPath)\n\tif err != nil {\n\t\t// If we can't make it relative, just use the filename\n\t\treturn filepath.Base(localPath)\n\t}\n\t\n\t// Convert to forward slashes for web compatibility\n\treturn filepath.ToSlash(relPath)\n}"
  },
  {
    "path": "lib/files_test.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Test file data - a simple text file content\nvar testFileData = []byte(\"This is a test file content for file attachment download testing.\")\n\n// createTestFileServer creates a test server that serves test files\nfunc createTestFileServer() *httptest.Server {\n\treturn httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tpath := r.URL.Path\n\t\t\n\t\tswitch {\n\t\tcase strings.Contains(path, \"success\"):\n\t\t\tw.Header().Set(\"Content-Type\", \"application/octet-stream\")\n\t\t\tw.Header().Set(\"Content-Disposition\", \"attachment; filename=\\\"test-file.pdf\\\"\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testFileData)\n\t\tcase strings.Contains(path, \"document.pdf\"):\n\t\t\tw.Header().Set(\"Content-Type\", \"application/pdf\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testFileData)\n\t\tcase strings.Contains(path, \"spreadsheet.xlsx\"):\n\t\t\tw.Header().Set(\"Content-Type\", \"application/vnd.openxmlformats-officedocument.spreadsheetml.sheet\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testFileData)\n\t\tcase strings.Contains(path, \"not-found\"):\n\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\tcase strings.Contains(path, \"server-error\"):\n\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\tcase strings.Contains(path, \"timeout\"):\n\t\t\t// Don't respond to simulate timeout - but add a timeout to prevent hanging\n\t\t\tselect {\n\t\t\tcase <-time.After(5 * time.Second):\n\t\t\t\tw.WriteHeader(http.StatusRequestTimeout)\n\t\t\t}\n\t\tcase strings.Contains(path, \"with-query\"):\n\t\t\t// Handle URLs with filename in query parameter\n\t\t\tfilename := r.URL.Query().Get(\"filename\")\n\t\t\tif filename != \"\" {\n\t\t\t\tw.Header().Set(\"Content-Disposition\", fmt.Sprintf(\"attachment; filename=\\\"%s\\\"\", filename))\n\t\t\t}\n\t\t\tw.Header().Set(\"Content-Type\", \"application/octet-stream\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testFileData)\n\t\tdefault:\n\t\t\tw.Header().Set(\"Content-Type\", \"application/octet-stream\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testFileData)\n\t\t}\n\t}))\n}\n\n// createTestHTMLWithFiles creates HTML content with file attachment links\nfunc createTestHTMLWithFiles(baseURL string) string {\n\treturn fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head><title>Test Post with Files</title></head>\n<body>\n<h1>Test Post with File Attachments</h1>\n\n<!-- Standard file embed button -->\n<div class=\"file-embed-container\">\n  <a class=\"file-embed-button wide\" href=\"%s/document.pdf\" target=\"_blank\">\n    <div class=\"file-embed-icon\">📄</div>\n    <div class=\"file-embed-text\">Download PDF Document</div>\n  </a>\n</div>\n\n<!-- Another file type -->\n<div class=\"file-embed-container\">\n  <a class=\"file-embed-button wide\" href=\"%s/spreadsheet.xlsx\" target=\"_blank\">\n    <div class=\"file-embed-icon\">📊</div>\n    <div class=\"file-embed-text\">Download Excel Spreadsheet</div>\n  </a>\n</div>\n\n<!-- File with query parameters -->\n<div class=\"file-embed-container\">\n  <a class=\"file-embed-button wide\" href=\"%s/with-query?filename=report.docx&id=123\" target=\"_blank\">\n    <div class=\"file-embed-text\">Download Report</div>\n  </a>\n</div>\n\n<!-- Non-existent file for error testing -->\n<div class=\"file-embed-container\">\n  <a class=\"file-embed-button wide\" href=\"%s/not-found.pdf\" target=\"_blank\">\n    <div class=\"file-embed-text\">Missing File</div>\n  </a>\n</div>\n\n<!-- Invalid file link (not a file-embed-button) -->\n<div class=\"other-container\">\n  <a class=\"other-button\" href=\"%s/should-not-be-detected.pdf\" target=\"_blank\">\n    Should not be detected\n  </a>\n</div>\n\n<!-- File embed button without wide class -->\n<div class=\"file-embed-container\">\n  <a class=\"file-embed-button\" href=\"%s/should-not-be-detected-2.pdf\" target=\"_blank\">\n    Should not be detected either\n  </a>\n</div>\n\n</body>\n</html>`, \n\t\tbaseURL, baseURL, baseURL, baseURL, baseURL, baseURL)\n}\n\n// TestNewFileDownloader tests the creation of FileDownloader\nfunc TestNewFileDownloader(t *testing.T) {\n\tt.Run(\"WithFetcher\", func(t *testing.T) {\n\t\tfetcher := NewFetcher()\n\t\textensions := []string{\"pdf\", \"docx\"}\n\t\tdownloader := NewFileDownloader(fetcher, \"/tmp\", \"files\", extensions)\n\t\t\n\t\tassert.Equal(t, fetcher, downloader.fetcher)\n\t\tassert.Equal(t, \"/tmp\", downloader.outputDir)\n\t\tassert.Equal(t, \"files\", downloader.filesDir)\n\t\tassert.Equal(t, extensions, downloader.fileExtensions)\n\t})\n\t\n\tt.Run(\"WithoutFetcher\", func(t *testing.T) {\n\t\textensions := []string{\"xlsx\"}\n\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"attachments\", extensions)\n\t\t\n\t\tassert.NotNil(t, downloader.fetcher)\n\t\tassert.Equal(t, \"/tmp\", downloader.outputDir)\n\t\tassert.Equal(t, \"attachments\", downloader.filesDir)\n\t\tassert.Equal(t, extensions, downloader.fileExtensions)\n\t})\n\t\n\tt.Run(\"NoExtensions\", func(t *testing.T) {\n\t\tdownloader := NewFileDownloader(nil, \"/output\", \"files\", nil)\n\t\t\n\t\tassert.NotNil(t, downloader.fetcher)\n\t\tassert.Equal(t, \"/output\", downloader.outputDir)\n\t\tassert.Equal(t, \"files\", downloader.filesDir)\n\t\tassert.Nil(t, downloader.fileExtensions)\n\t})\n}\n\n// TestExtractFileElements tests file element extraction from HTML\nfunc TestExtractFileElements(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\tt.Run(\"SuccessfulExtraction\", func(t *testing.T) {\n\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\t\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\t\trequire.NoError(t, err)\n\t\t\n\t\telements, err := downloader.extractFileElements(doc)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Should find 4 valid file elements (only .file-embed-button.wide)\n\t\tassert.Len(t, elements, 4)\n\t\t\n\t\t// Verify URLs\n\t\texpectedURLs := []string{\n\t\t\tserver.URL + \"/document.pdf\",\n\t\t\tserver.URL + \"/spreadsheet.xlsx\",\n\t\t\tserver.URL + \"/with-query?filename=report.docx&id=123\",\n\t\t\tserver.URL + \"/not-found.pdf\",\n\t\t}\n\t\t\n\t\tactualURLs := make([]string, len(elements))\n\t\tfor i, elem := range elements {\n\t\t\tactualURLs[i] = elem.DownloadURL\n\t\t}\n\t\t\n\t\tassert.ElementsMatch(t, expectedURLs, actualURLs)\n\t})\n\t\n\tt.Run(\"WithExtensionFilter\", func(t *testing.T) {\n\t\t// Only allow PDF files\n\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", []string{\"pdf\"})\n\t\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\t\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\t\trequire.NoError(t, err)\n\t\t\n\t\telements, err := downloader.extractFileElements(doc)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Should find only 2 PDF files\n\t\tassert.Len(t, elements, 2)\n\t\t\n\t\tfor _, elem := range elements {\n\t\t\tassert.True(t, strings.Contains(elem.DownloadURL, \".pdf\"))\n\t\t}\n\t})\n\t\n\tt.Run(\"NoFileElements\", func(t *testing.T) {\n\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\thtmlContent := \"<html><body><p>No file attachments here</p></body></html>\"\n\t\t\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\t\trequire.NoError(t, err)\n\t\t\n\t\telements, err := downloader.extractFileElements(doc)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Len(t, elements, 0)\n\t})\n\t\n\tt.Run(\"InvalidURLs\", func(t *testing.T) {\n\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\t\n\t\t// HTML with invalid URLs\n\t\thtmlContent := `\n\t\t<a class=\"file-embed-button wide\" href=\"\">Empty href</a>\n\t\t<a class=\"file-embed-button wide\" href=\"not-absolute-url\">Relative URL</a>\n\t\t<a class=\"file-embed-button wide\" href=\"://invalid\">Invalid URL</a>\n\t\t`\n\t\t\n\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\t\trequire.NoError(t, err)\n\t\t\n\t\telements, err := downloader.extractFileElements(doc)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Should find no valid elements\n\t\tassert.Len(t, elements, 0)\n\t})\n}\n\n// TestExtractFilenameFromURL tests filename extraction from URLs\nfunc TestExtractFilenameFromURL(t *testing.T) {\n\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\n\ttests := []struct {\n\t\tname     string\n\t\turl      string\n\t\texpected string\n\t}{\n\t\t{\n\t\t\tname:     \"SimpleFilename\",\n\t\t\turl:      \"https://example.com/document.pdf\",\n\t\t\texpected: \"document.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:     \"FilenameWithPath\",\n\t\t\turl:      \"https://example.com/files/reports/annual-report.xlsx\",\n\t\t\texpected: \"annual-report.xlsx\",\n\t\t},\n\t\t{\n\t\t\tname:     \"FilenameInQueryParam\",\n\t\t\turl:      \"https://example.com/?filename=my-file.docx&id=123\",\n\t\t\texpected: \"my-file.docx\",\n\t\t},\n\t\t{\n\t\t\tname:     \"NoFilename\",\n\t\t\turl:      \"https://example.com/\",\n\t\t\texpected: \"\",\n\t\t},\n\t\t{\n\t\t\tname:     \"InvalidURL\",\n\t\t\turl:      \"://invalid-url\",\n\t\t\texpected: \"\",\n\t\t},\n\t\t{\n\t\t\tname:     \"OnlyPath\",\n\t\t\turl:      \"https://example.com/download\",\n\t\t\texpected: \"download\",\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tresult := downloader.extractFilenameFromURL(test.url)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t})\n\t}\n}\n\n// TestIsAllowedExtension tests file extension filtering\nfunc TestIsAllowedExtension(t *testing.T) {\n\ttests := []struct {\n\t\tname          string\n\t\textensions    []string\n\t\tfilename      string\n\t\texpected      bool\n\t}{\n\t\t{\n\t\t\tname:       \"NoFilter\",\n\t\t\textensions: nil,\n\t\t\tfilename:   \"document.pdf\",\n\t\t\texpected:   true,\n\t\t},\n\t\t{\n\t\t\tname:       \"EmptyFilter\",\n\t\t\textensions: []string{},\n\t\t\tfilename:   \"document.pdf\",\n\t\t\texpected:   true,\n\t\t},\n\t\t{\n\t\t\tname:       \"AllowedExtension\",\n\t\t\textensions: []string{\"pdf\", \"docx\"},\n\t\t\tfilename:   \"document.pdf\",\n\t\t\texpected:   true,\n\t\t},\n\t\t{\n\t\t\tname:       \"DisallowedExtension\",\n\t\t\textensions: []string{\"pdf\", \"docx\"},\n\t\t\tfilename:   \"image.jpg\",\n\t\t\texpected:   false,\n\t\t},\n\t\t{\n\t\t\tname:       \"CaseInsensitive\",\n\t\t\textensions: []string{\"PDF\", \"DOCX\"},\n\t\t\tfilename:   \"document.pdf\",\n\t\t\texpected:   true,\n\t\t},\n\t\t{\n\t\t\tname:       \"NoExtension\",\n\t\t\textensions: []string{\"pdf\"},\n\t\t\tfilename:   \"README\",\n\t\t\texpected:   false,\n\t\t},\n\t\t{\n\t\t\tname:       \"ExtensionWithDot\",\n\t\t\textensions: []string{\".pdf\", \"docx\"},\n\t\t\tfilename:   \"document.pdf\",\n\t\t\texpected:   false, // \".pdf\" != \"pdf\" after dot removal\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", test.extensions)\n\t\t\tresult := downloader.isAllowedExtension(test.filename)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t})\n\t}\n}\n\n// TestSanitizeFilename tests filename sanitization\nfunc TestSanitizeFilename(t *testing.T) {\n\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\n\ttests := []struct {\n\t\tname     string\n\t\tfilename string\n\t\texpected string\n\t}{\n\t\t{\n\t\t\tname:     \"SafeFilename\",\n\t\t\tfilename: \"document.pdf\",\n\t\t\texpected: \"document.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:     \"UnsafeCharacters\",\n\t\t\tfilename: \"my<file>name.pdf\",\n\t\t\texpected: \"my_file_name.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:     \"AllUnsafeCharacters\",\n\t\t\tfilename: `file<>:\"/\\|?*.txt`,\n\t\t\texpected: \"file_________.txt\", // 9 unsafe chars replaced with _\n\t\t},\n\t\t{\n\t\t\tname:     \"LeadingTrailingSpaces\",\n\t\t\tfilename: \"  document.pdf  \",\n\t\t\texpected: \"document.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:     \"LeadingTrailingDots\",\n\t\t\tfilename: \"..document.pdf..\",\n\t\t\texpected: \"document.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:     \"EmptyAfterSanitization\",\n\t\t\tfilename: \"   ...   \", // Should become empty after trimming spaces and dots\n\t\t\texpected: \"attachment\",\n\t\t},\n\t\t{\n\t\t\tname:     \"VeryLongFilename\", \n\t\t\tfilename: strings.Repeat(\"a\", 250) + \".pdf\",\n\t\t\texpected: strings.Repeat(\"a\", 250)[:200], // Should be truncated to 200 chars total\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tresult := downloader.sanitizeFilename(test.filename)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t\tassert.LessOrEqual(t, len(result), 200, \"Filename should not exceed 200 characters\")\n\t\t})\n\t}\n}\n\n// TestGenerateSafeFilenameForFiles tests safe filename generation for files\nfunc TestGenerateSafeFilenameForFiles(t *testing.T) {\n\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\t\n\t// Test that it generates unique filenames (use very different prefixes)\n\turl1 := \"abcdef123456\"  // Will produce different hash\n\turl2 := \"zyxwvu987654\" // Will produce different hash\n\t\n\tfilename1 := downloader.generateSafeFilename(url1)\n\ttime.Sleep(1 * time.Millisecond) // Small delay to ensure different timestamp\n\tfilename2 := downloader.generateSafeFilename(url2)\n\t\n\tassert.NotEqual(t, filename1, filename2, \"Should generate different filenames for different URLs\")\n\tassert.Contains(t, filename1, \"file_\", \"Should contain file_ prefix\")\n\tassert.Contains(t, filename2, \"file_\", \"Should contain file_ prefix\")\n\t\n\t// Test with same URL multiple times (should be different due to timestamp)\n\ttime.Sleep(1001 * time.Millisecond) // Ensure different timestamp (at least 1 second difference)\n\tfilename3 := downloader.generateSafeFilename(url1)\n\tassert.NotEqual(t, filename1, filename3, \"Should generate different filenames due to timestamp\")\n}\n\n// TestDownloadSingleFile tests individual file downloading\nfunc TestDownloadSingleFile(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"single-file-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\tdownloader := NewFileDownloader(nil, tempDir, \"files\", nil)\n\tctx := context.Background()\n\t\n\tt.Run(\"SuccessfulDownload\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/document.pdf\"\n\t\tfilesPath := filepath.Join(tempDir, \"test-post\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.True(t, fileInfo.Success)\n\t\tassert.NoError(t, fileInfo.Error)\n\t\tassert.Equal(t, fileURL, fileInfo.OriginalURL)\n\t\tassert.NotEmpty(t, fileInfo.LocalPath)\n\t\tassert.Equal(t, \"document.pdf\", fileInfo.Filename)\n\t\tassert.Equal(t, int64(len(testFileData)), fileInfo.Size)\n\t\t\n\t\t// Check file exists\n\t\t_, statErr := os.Stat(fileInfo.LocalPath)\n\t\tassert.NoError(t, statErr)\n\t\t\n\t\t// Check file content\n\t\tdata, err := os.ReadFile(fileInfo.LocalPath)\n\t\tassert.NoError(t, err)\n\t\tassert.Equal(t, testFileData, data)\n\t})\n\t\n\tt.Run(\"FileAlreadyExists\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/existing.pdf\"\n\t\tfilesPath := filepath.Join(tempDir, \"existing-test\")\n\t\t\n\t\t// Create the directory and file first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\texistingFile := filepath.Join(filesPath, \"existing.pdf\")\n\t\terr = os.WriteFile(existingFile, []byte(\"existing content\"), 0644)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.True(t, fileInfo.Success)\n\t\tassert.NoError(t, fileInfo.Error)\n\t\tassert.Equal(t, fileURL, fileInfo.OriginalURL)\n\t\tassert.Equal(t, existingFile, fileInfo.LocalPath)\n\t\t\n\t\t// File should still contain original content (not downloaded again)\n\t\tdata, err := os.ReadFile(existingFile)\n\t\tassert.NoError(t, err)\n\t\tassert.Equal(t, []byte(\"existing content\"), data)\n\t})\n\t\n\tt.Run(\"NotFound\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/not-found.pdf\"\n\t\tfilesPath := filepath.Join(tempDir, \"not-found-test\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.False(t, fileInfo.Success)\n\t\tassert.Error(t, fileInfo.Error)\n\t\tassert.Equal(t, fileURL, fileInfo.OriginalURL)\n\t\tassert.Equal(t, \"not-found.pdf\", fileInfo.Filename)\n\t})\n\t\n\tt.Run(\"ServerError\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/server-error.pdf\"\n\t\tfilesPath := filepath.Join(tempDir, \"server-error-test\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.False(t, fileInfo.Success)\n\t\tassert.Error(t, fileInfo.Error)\n\t})\n\t\n\tt.Run(\"FilenameFromQuery\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/with-query?filename=report.docx&id=123\"\n\t\tfilesPath := filepath.Join(tempDir, \"query-test\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.True(t, fileInfo.Success)\n\t\tassert.NoError(t, fileInfo.Error)\n\t\t// The filename should come from the path (with-query), not query param since path takes precedence\n\t\tassert.Equal(t, \"with-query\", fileInfo.Filename)\n\t\t\n\t\t// Check file exists with correct name\n\t\texpectedPath := filepath.Join(filesPath, \"with-query\")\n\t\tassert.Equal(t, expectedPath, fileInfo.LocalPath)\n\t\t_, statErr := os.Stat(expectedPath)\n\t\tassert.NoError(t, statErr)\n\t})\n\t\n\tt.Run(\"FilenameFromPath\", func(t *testing.T) {\n\t\tfileURL := server.URL + \"/no-filename-in-path\"\n\t\tfilesPath := filepath.Join(tempDir, \"path-test\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.True(t, fileInfo.Success)\n\t\tassert.NoError(t, fileInfo.Error)\n\t\t// The filename should come from the path (no-filename-in-path)\n\t\tassert.Equal(t, \"no-filename-in-path\", fileInfo.Filename)\n\t})\n\t\n\tt.Run(\"GeneratedFilename\", func(t *testing.T) {\n\t\t// Use a URL with just / to trigger generated filename\n\t\tfileURL := server.URL + \"/\"\n\t\tfilesPath := filepath.Join(tempDir, \"generated-test\")\n\t\t\n\t\t// Create the directory first\n\t\terr := os.MkdirAll(filesPath, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, filesPath)\n\t\t\n\t\tassert.True(t, fileInfo.Success)\n\t\tassert.NoError(t, fileInfo.Error)\n\t\t// Should use generated filename pattern\n\t\tassert.Contains(t, fileInfo.Filename, \"file_\")\n\t})\n}\n\n// TestMakeRelativePath tests relative path conversion\nfunc TestMakeRelativePath(t *testing.T) {\n\tdownloader := NewFileDownloader(nil, \"/output\", \"files\", nil)\n\t\n\ttests := []struct {\n\t\tname         string\n\t\tlocalPath    string\n\t\texpected     string\n\t}{\n\t\t{\n\t\t\tname:      \"NormalPath\",\n\t\t\tlocalPath: \"/output/files/post/document.pdf\",\n\t\t\texpected:  \"files/post/document.pdf\",\n\t\t},\n\t\t{\n\t\t\tname:      \"WindowsPath\",\n\t\t\tlocalPath: \"/output/files/post/report.xlsx\",\n\t\t\texpected:  \"files/post/report.xlsx\",\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tresult := downloader.makeRelativePath(test.localPath)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t})\n\t}\n}\n\n// TestUpdateHTMLWithLocalPathsForFiles tests HTML content updating for files\nfunc TestUpdateHTMLWithLocalPathsForFiles(t *testing.T) {\n\tdownloader := NewFileDownloader(nil, \"/output\", \"files\", nil)\n\t\n\toriginalHTML := `\n\t<a class=\"file-embed-button wide\" href=\"https://example.com/document.pdf\">PDF Document</a>\n\t<a class=\"file-embed-button wide\" href='https://example.com/spreadsheet.xlsx'>Excel File</a>\n\t<a class=\"file-embed-button wide\" href=\"https://example.com/document.pdf\">Same PDF Again</a>\n\t`\n\t\n\turlToLocalPath := map[string]string{\n\t\t\"https://example.com/document.pdf\":    filepath.Join(\"/output\", \"files\", \"post\", \"document.pdf\"),\n\t\t\"https://example.com/spreadsheet.xlsx\": filepath.Join(\"/output\", \"files\", \"post\", \"spreadsheet.xlsx\"),\n\t}\n\t\n\tupdatedHTML := downloader.updateHTMLWithLocalPaths(originalHTML, urlToLocalPath)\n\t\n\t// Check that URLs were replaced\n\tassert.Contains(t, updatedHTML, `href=\"files/post/document.pdf\"`)\n\tassert.Contains(t, updatedHTML, `href='files/post/spreadsheet.xlsx'`)\n\tassert.NotContains(t, updatedHTML, \"https://example.com/\")\n\t\n\t// Check that duplicate URLs were replaced\n\tassert.Equal(t, 2, strings.Count(updatedHTML, \"files/post/document.pdf\"))\n}\n\n// TestDownloadFiles tests the complete file downloading workflow\nfunc TestDownloadFiles(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"file-download-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create downloader\n\tdownloader := NewFileDownloader(nil, tempDir, \"files\", nil)\n\t\n\tt.Run(\"SuccessfulDownload\", func(t *testing.T) {\n\t\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadFiles(ctx, htmlContent, \"test-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Check results\n\t\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n\t\tassert.Greater(t, result.Failed, 0, \"Should have failed downloads (not-found file)\")\n\t\tassert.Greater(t, len(result.Files), 0, \"Should have file info\")\n\t\t\n\t\t// Check that files directory was created\n\t\tfilesDir := filepath.Join(tempDir, \"files\", \"test-post\")\n\t\t_, err = os.Stat(filesDir)\n\t\tassert.NoError(t, err, \"Files directory should exist\")\n\t\t\n\t\t// Check that some files were downloaded\n\t\tfiles, err := os.ReadDir(filesDir)\n\t\tassert.NoError(t, err)\n\t\tassert.Greater(t, len(files), 0, \"Should have downloaded files\")\n\t\t\n\t\t// Check that HTML was updated\n\t\tassert.NotEqual(t, htmlContent, result.UpdatedHTML, \"HTML should be updated\")\n\t\tassert.Contains(t, result.UpdatedHTML, \"files/test-post/\", \"HTML should contain local file paths\")\n\t\t\n\t\t// Verify specific file was downloaded\n\t\tvar pdfFound bool\n\t\tfor _, file := range result.Files {\n\t\t\tif strings.Contains(file.OriginalURL, \"document.pdf\") && file.Success {\n\t\t\t\tpdfFound = true\n\t\t\t\tassert.Equal(t, \"document.pdf\", file.Filename)\n\t\t\t\tassert.Greater(t, file.Size, int64(0))\n\t\t\t\t\n\t\t\t\t// Verify file content\n\t\t\t\tdata, err := os.ReadFile(file.LocalPath)\n\t\t\t\tassert.NoError(t, err)\n\t\t\t\tassert.Equal(t, testFileData, data)\n\t\t\t}\n\t\t}\n\t\tassert.True(t, pdfFound, \"Should have successfully downloaded PDF file\")\n\t})\n\t\n\tt.Run(\"WithExtensionFilter\", func(t *testing.T) {\n\t\t// Only allow PDF files\n\t\tpdfDownloader := NewFileDownloader(nil, tempDir, \"pdf-files\", []string{\"pdf\"})\n\t\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := pdfDownloader.DownloadFiles(ctx, htmlContent, \"pdf-test\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Should only process PDF files\n\t\tpdfCount := 0\n\t\tfor _, file := range result.Files {\n\t\t\tif strings.HasSuffix(file.Filename, \".pdf\") {\n\t\t\t\tpdfCount++\n\t\t\t}\n\t\t}\n\t\tassert.Equal(t, 2, pdfCount, \"Should find exactly 2 PDF files\")\n\t\tassert.Equal(t, 2, len(result.Files), \"Should only process PDF files due to filter\")\n\t})\n\t\n\tt.Run(\"NoFiles\", func(t *testing.T) {\n\t\thtmlContent := \"<html><body><p>No file attachments here</p></body></html>\"\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadFiles(ctx, htmlContent, \"no-files-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Equal(t, 0, result.Success)\n\t\tassert.Equal(t, 0, result.Failed)\n\t\tassert.Equal(t, 0, len(result.Files))\n\t\tassert.Equal(t, htmlContent, result.UpdatedHTML)\n\t})\n\t\n\tt.Run(\"EmptyHTML\", func(t *testing.T) {\n\t\temptyHTML := \"\"\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadFiles(ctx, emptyHTML, \"empty-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Equal(t, 0, result.Success)\n\t\tassert.Equal(t, 0, result.Failed)\n\t\tassert.Equal(t, 0, len(result.Files))\n\t\tassert.Equal(t, emptyHTML, result.UpdatedHTML)\n\t})\n\t\n\tt.Run(\"InvalidHTML\", func(t *testing.T) {\n\t\tinvalidHTML := \"not valid html <<<\"\n\t\tctx := context.Background()\n\t\t\n\t\t// Should still work with invalid HTML due to goquery's tolerance\n\t\tresult, err := downloader.DownloadFiles(ctx, invalidHTML, \"invalid-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Equal(t, 0, result.Success)\n\t\tassert.Equal(t, 0, result.Failed)\n\t\tassert.Equal(t, 0, len(result.Files))\n\t})\n}\n\n// TestFileDownloadErrorScenarios tests various error conditions\nfunc TestFileDownloadErrorScenarios(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"error-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\tdownloader := NewFileDownloader(nil, tempDir, \"files\", nil)\n\tctx := context.Background()\n\t\n\tt.Run(\"ContextCancellation\", func(t *testing.T) {\n\t\t// Create context with immediate cancellation\n\t\tcancelCtx, cancel := context.WithCancel(context.Background())\n\t\tcancel() // Cancel immediately\n\t\t\n\t\tfileURL := server.URL + \"/document.pdf\"\n\t\tfilesPath := filepath.Join(tempDir, \"cancel-test\")\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(cancelCtx, fileURL, filesPath)\n\t\t\n\t\tassert.False(t, fileInfo.Success)\n\t\tassert.Error(t, fileInfo.Error)\n\t\tassert.Contains(t, fileInfo.Error.Error(), \"context\")\n\t})\n\t\n\tt.Run(\"FileSystemError\", func(t *testing.T) {\n\t\t// Create a read-only directory to cause file creation to fail\n\t\treadOnlyDir := filepath.Join(tempDir, \"readonly\")\n\t\terr := os.MkdirAll(readOnlyDir, 0755)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Make directory read-only (may not work on all filesystems)\n\t\terr = os.Chmod(readOnlyDir, 0444)\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Restore permissions for cleanup\n\t\tdefer os.Chmod(readOnlyDir, 0755)\n\t\t\n\t\tfileURL := server.URL + \"/document.pdf\"\n\t\t\n\t\tfileInfo := downloader.downloadSingleFile(ctx, fileURL, readOnlyDir)\n\t\t\n\t\t// This test may pass on some filesystems that ignore permission restrictions\n\t\t// for the same user, so we just verify the attempt was made\n\t\tif fileInfo.Error != nil {\n\t\t\tassert.False(t, fileInfo.Success)\n\t\t\tassert.Error(t, fileInfo.Error)\n\t\t} else {\n\t\t\t// If no error occurred (e.g., on some filesystems), just log it\n\t\t\tt.Logf(\"Note: Filesystem doesn't enforce directory permissions as expected\")\n\t\t\tassert.True(t, fileInfo.Success)\n\t\t}\n\t})\n\t\n\tt.Run(\"DirectoryCreationError\", func(t *testing.T) {\n\t\t// Try to create files directory where a file already exists\n\t\tinvalidDir := filepath.Join(tempDir, \"invalid-dir\")\n\t\t\n\t\t// Create a file with the same name as the directory we'll try to create\n\t\terr := os.WriteFile(invalidDir, []byte(\"blocking file\"), 0644)\n\t\trequire.NoError(t, err)\n\t\t\n\t\tinvalidDownloader := NewFileDownloader(nil, invalidDir, \"files\", nil)\n\t\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\t\n\t\t_, err = invalidDownloader.DownloadFiles(ctx, htmlContent, \"blocked-post\")\n\t\tassert.Error(t, err)\n\t\tassert.Contains(t, err.Error(), \"failed to create files directory\")\n\t})\n}\n\n// TestFileDownloadWithRealSubstackHTML tests with realistic Substack HTML structure\nfunc TestFileDownloadWithRealSubstackHTML(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"real-substack-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\tdownloader := NewFileDownloader(nil, tempDir, \"attachments\", nil)\n\t\n\t// Realistic Substack HTML structure with file embeds\n\trealisticHTML := fmt.Sprintf(`\n\t<div class=\"post-body\">\n\t\t<p>Here's the quarterly report:</p>\n\t\t\n\t\t<div class=\"file-embed-container\">\n\t\t\t<a class=\"file-embed-button wide\" href=\"%s/quarterly-report.pdf\" target=\"_blank\">\n\t\t\t\t<div class=\"file-embed-icon\">\n\t\t\t\t\t<svg>...</svg>\n\t\t\t\t</div>\n\t\t\t\t<div class=\"file-embed-text\">\n\t\t\t\t\t<div class=\"file-embed-title\">Q3 2023 Financial Report</div>\n\t\t\t\t\t<div class=\"file-embed-subtitle\">PDF • 2.4 MB</div>\n\t\t\t\t</div>\n\t\t\t</a>\n\t\t</div>\n\t\t\n\t\t<p>And here's the supporting data:</p>\n\t\t\n\t\t<div class=\"file-embed-container\">\n\t\t\t<a class=\"file-embed-button wide\" href=\"%s/supporting-data.xlsx\" target=\"_blank\">\n\t\t\t\t<div class=\"file-embed-icon\">\n\t\t\t\t\t<svg>...</svg>\n\t\t\t\t</div>\n\t\t\t\t<div class=\"file-embed-text\">\n\t\t\t\t\t<div class=\"file-embed-title\">Supporting Data</div>\n\t\t\t\t\t<div class=\"file-embed-subtitle\">Excel • 1.8 MB</div>\n\t\t\t\t</div>\n\t\t\t</a>\n\t\t</div>\n\t</div>\n\t`, server.URL, server.URL)\n\t\n\tctx := context.Background()\n\tresult, err := downloader.DownloadFiles(ctx, realisticHTML, \"financial-report\")\n\trequire.NoError(t, err)\n\t\n\t// Should successfully download both files\n\tassert.Equal(t, 2, result.Success)\n\tassert.Equal(t, 0, result.Failed)\n\tassert.Len(t, result.Files, 2)\n\t\n\t// Verify HTML was updated\n\tassert.Contains(t, result.UpdatedHTML, \"attachments/financial-report/quarterly-report.pdf\")\n\tassert.Contains(t, result.UpdatedHTML, \"attachments/financial-report/supporting-data.xlsx\")\n\tassert.NotContains(t, result.UpdatedHTML, server.URL)\n\t\n\t// Verify files exist on disk\n\tattachmentsDir := filepath.Join(tempDir, \"attachments\", \"financial-report\")\n\tfiles, err := os.ReadDir(attachmentsDir)\n\trequire.NoError(t, err)\n\tassert.Len(t, files, 2)\n\t\n\t// Verify specific files\n\tfileNames := []string{files[0].Name(), files[1].Name()}\n\tassert.Contains(t, fileNames, \"quarterly-report.pdf\")\n\tassert.Contains(t, fileNames, \"supporting-data.xlsx\")\n}\n\n// TestExtractorIntegration tests file download integration with the extractor\nfunc TestExtractorIntegration(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"extractor-integration-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create a mock post with file attachments\n\tpost := Post{\n\t\tId:       123,\n\t\tSlug:     \"test-post-with-files\",\n\t\tTitle:    \"Test Post with File Attachments\",\n\t\tBodyHTML: createTestHTMLWithFiles(server.URL),\n\t}\n\t\n\t// Create fetcher for the extractor\n\tfetcher := NewFetcher()\n\t\n\t// Test file download through WriteToFileWithImages\n\toutputPath := filepath.Join(tempDir, \"test-post.html\")\n\tfilesPath := \"attachments\"\n\timageDownloadResult, err := post.WriteToFileWithImages(\n\t\tcontext.Background(),\n\t\toutputPath,\n\t\t\"html\",\n\t\tfalse, // addSourceURL\n\t\tfalse, // downloadImages \n\t\tImageQualityHigh, // imageQuality\n\t\t\"\", // imagesDir (not used when downloadImages is false)\n\t\ttrue,  // downloadFiles\n\t\tnil,   // fileExtensions (no filter)\n\t\tfilesPath, // filesDir\n\t\tfetcher, // fetcher\n\t)\n\t\n\trequire.NoError(t, err)\n\trequire.NotNil(t, imageDownloadResult)\n\t\n\t// Check that the image result is available (files are not reported in image result)\n\t// We'll verify file downloads through the file system\n\t\n\t// Check that the HTML file was created\n\t_, err = os.Stat(outputPath)\n\tassert.NoError(t, err, \"HTML file should be created\")\n\t\n\t// Check that files directory was created\n\tfilesDir := filepath.Join(tempDir, filesPath, post.Slug)\n\t_, err = os.Stat(filesDir)\n\tassert.NoError(t, err, \"Files directory should be created\")\n\t\n\t// Check that some files were actually downloaded\n\tfiles, err := os.ReadDir(filesDir)\n\trequire.NoError(t, err)\n\tassert.Greater(t, len(files), 0, \"Should have actual downloaded files\")\n\t\n\t// Read the HTML file and verify URLs were replaced\n\thtmlContent, err := os.ReadFile(outputPath)\n\trequire.NoError(t, err)\n\t\n\thtmlStr := string(htmlContent)\n\tassert.Contains(t, htmlStr, fmt.Sprintf(\"%s/%s/\", filesPath, post.Slug), \"HTML should contain local file paths\")\n\t\n\t// Check that successfully downloaded files had their URLs replaced\n\tassert.Contains(t, htmlStr, \"attachments/test-post-with-files/document.pdf\", \"PDF file URL should be replaced\")\n\tassert.Contains(t, htmlStr, \"attachments/test-post-with-files/spreadsheet.xlsx\", \"XLSX file URL should be replaced\")\n\tassert.Contains(t, htmlStr, \"attachments/test-post-with-files/with-query\", \"Query file URL should be replaced\")\n\t\n\t// URLs that weren't downloadable or detectable should remain as original\n\t// (not-found.pdf and files that don't match CSS selector)\n\t\n\t// Verify specific file types were downloaded\n\tvar pdfFound, xlsxFound bool\n\tfor _, file := range files {\n\t\tif strings.HasSuffix(file.Name(), \".pdf\") {\n\t\t\tpdfFound = true\n\t\t}\n\t\tif strings.HasSuffix(file.Name(), \".xlsx\") {\n\t\t\txlsxFound = true\n\t\t}\n\t}\n\tassert.True(t, pdfFound, \"Should have downloaded PDF file\")\n\tassert.True(t, xlsxFound, \"Should have downloaded XLSX file\")\n}\n\n// TestExtractorIntegrationWithFiltering tests file download with extension filtering through extractor\nfunc TestExtractorIntegrationWithFiltering(t *testing.T) {\n\t// Create test server\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"extractor-filtering-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create a mock post with file attachments\n\tpost := Post{\n\t\tId:       456,\n\t\tSlug:     \"filtered-post\",\n\t\tTitle:    \"Post with Filtered Files\",\n\t\tBodyHTML: createTestHTMLWithFiles(server.URL),\n\t}\n\t\n\t// Create fetcher for the extractor\n\tfetcher := NewFetcher()\n\t\n\t// Test file download with extension filtering (only PDF files)\n\toutputPath := filepath.Join(tempDir, \"filtered-post.html\")\n\tfilesPath := \"documents\"\n\timageDownloadResult, err := post.WriteToFileWithImages(\n\t\tcontext.Background(),\n\t\toutputPath,\n\t\t\"html\",\n\t\tfalse, // addSourceURL\n\t\tfalse, // downloadImages \n\t\tImageQualityHigh, // imageQuality\n\t\t\"\", // imagesDir (not used when downloadImages is false)\n\t\ttrue,  // downloadFiles\n\t\t[]string{\"pdf\"}, // fileExtensions - only PDF files\n\t\tfilesPath, // filesDir\n\t\tfetcher, // fetcher\n\t)\n\t\n\trequire.NoError(t, err)\n\trequire.NotNil(t, imageDownloadResult)\n\t\n\t// Check that the integration worked (files are not reported in image result)\n\t// We'll verify file downloads through the file system\n\t\n\t// Check that files directory was created\n\tfilesDir := filepath.Join(tempDir, filesPath, post.Slug)\n\t_, err = os.Stat(filesDir)\n\tassert.NoError(t, err, \"Files directory should be created\")\n\t\n\t// Check that only PDF files were downloaded\n\tfiles, err := os.ReadDir(filesDir)\n\trequire.NoError(t, err)\n\tassert.Greater(t, len(files), 0, \"Should have downloaded files\")\n\t\n\t// Verify only PDF files were downloaded\n\tfor _, file := range files {\n\t\tassert.True(t, strings.HasSuffix(file.Name(), \".pdf\"), \n\t\t\t\"Only PDF files should be downloaded, found: %s\", file.Name())\n\t}\n\t\n\t// Should be fewer files than the unfiltered test\n\tassert.LessOrEqual(t, len(files), 2, \"Should have fewer files due to filtering\")\n}\n\n// Benchmark tests\nfunc BenchmarkExtractFileElements(b *testing.B) {\n\tserver := createTestFileServer()\n\tdefer server.Close()\n\t\n\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\thtmlContent := createTestHTMLWithFiles(server.URL)\n\t\n\tdoc, _ := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\t\n\tb.ResetTimer()\n\tfor i := 0; i < b.N; i++ {\n\t\tdownloader.extractFileElements(doc)\n\t}\n}\n\nfunc BenchmarkSanitizeFilename(b *testing.B) {\n\tdownloader := NewFileDownloader(nil, \"/tmp\", \"files\", nil)\n\tfilename := \"my<unsafe:file>name/with\\\\many|bad?chars*.pdf\"\n\t\n\tb.ResetTimer()\n\tfor i := 0; i < b.N; i++ {\n\t\tdownloader.sanitizeFilename(filename)\n\t}\n}"
  },
  {
    "path": "lib/images.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"encoding/json\"\n\t\"fmt\"\n\t\"io\"\n\t\"net/url\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"regexp\"\n\t\"strconv\"\n\t\"strings\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n)\n\n// ImageQuality represents the quality level for image downloads\ntype ImageQuality string\n\nconst (\n\tImageQualityHigh   ImageQuality = \"high\"   // 1456w\n\tImageQualityMedium ImageQuality = \"medium\" // 848w\n\tImageQualityLow    ImageQuality = \"low\"    // 424w\n)\n\n// ImageInfo contains information about a downloaded image\ntype ImageInfo struct {\n\tOriginalURL string\n\tLocalPath   string\n\tWidth       int\n\tHeight      int\n\tFormat      string\n\tSuccess     bool\n\tError       error\n}\n\n// ImageDownloader handles downloading and processing images from Substack posts\ntype ImageDownloader struct {\n\tfetcher      *Fetcher\n\toutputDir    string\n\timagesDir    string\n\timageQuality ImageQuality\n}\n\n// NewImageDownloader creates a new ImageDownloader instance\nfunc NewImageDownloader(fetcher *Fetcher, outputDir, imagesDir string, quality ImageQuality) *ImageDownloader {\n\tif fetcher == nil {\n\t\tfetcher = NewFetcher()\n\t}\n\treturn &ImageDownloader{\n\t\tfetcher:      fetcher,\n\t\toutputDir:    outputDir,\n\t\timagesDir:    imagesDir,\n\t\timageQuality: quality,\n\t}\n}\n\n// ImageDownloadResult contains the results of downloading images for a post\ntype ImageDownloadResult struct {\n\tImages      []ImageInfo\n\tUpdatedHTML string\n\tSuccess     int\n\tFailed      int\n}\n\n// ImageElement represents an image element with all its URLs\ntype ImageElement struct {\n\tBestURL    string   // The URL to download (highest quality)\n\tAllURLs    []string // All URLs that should be replaced with the local path\n\tLocalPath  string   // Local path after download\n\tSuccess    bool     // Whether download was successful\n}\n\n// DownloadImages downloads all images from a post's HTML content and returns updated HTML\nfunc (id *ImageDownloader) DownloadImages(ctx context.Context, htmlContent string, postSlug string) (*ImageDownloadResult, error) {\n\t// Parse HTML content\n\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to parse HTML content: %w\", err)\n\t}\n\n\t// Extract image elements with all their URLs\n\timageElements, err := id.extractImageElements(doc)\n\tif err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to extract image elements: %w\", err)\n\t}\n\n\tif len(imageElements) == 0 {\n\t\treturn &ImageDownloadResult{\n\t\t\tImages:      []ImageInfo{},\n\t\t\tUpdatedHTML: htmlContent,\n\t\t\tSuccess:     0,\n\t\t\tFailed:      0,\n\t\t}, nil\n\t}\n\n\t// Create images directory\n\timagesPath := filepath.Join(id.outputDir, id.imagesDir, postSlug)\n\tif err := os.MkdirAll(imagesPath, 0755); err != nil {\n\t\treturn nil, fmt.Errorf(\"failed to create images directory: %w\", err)\n\t}\n\n\t// Download images and build URL mapping\n\tvar images []ImageInfo\n\turlToLocalPath := make(map[string]string)\n\n\tfor _, element := range imageElements {\n\t\t// Download the best quality URL\n\t\timageInfo := id.downloadSingleImage(ctx, element.BestURL, imagesPath)\n\t\timages = append(images, imageInfo)\n\n\t\tif imageInfo.Success {\n\t\t\t// Map ALL URLs for this image element to the same local path\n\t\t\tfor _, url := range element.AllURLs {\n\t\t\t\turlToLocalPath[url] = imageInfo.LocalPath\n\t\t\t}\n\t\t}\n\t}\n\n\t// Update HTML content with local paths\n\tupdatedHTML := id.updateHTMLWithLocalPaths(htmlContent, urlToLocalPath)\n\n\t// Count success/failure\n\tsuccess := 0\n\tfailed := 0\n\tfor _, img := range images {\n\t\tif img.Success {\n\t\t\tsuccess++\n\t\t} else {\n\t\t\tfailed++\n\t\t}\n\t}\n\n\treturn &ImageDownloadResult{\n\t\tImages:      images,\n\t\tUpdatedHTML: updatedHTML,\n\t\tSuccess:     success,\n\t\tFailed:      failed,\n\t}, nil\n}\n\n// extractImageElements extracts image elements with all their URLs from HTML content\nfunc (id *ImageDownloader) extractImageElements(doc *goquery.Document) ([]ImageElement, error) {\n\tvar imageElements []ImageElement\n\tseenBestURLs := make(map[string]bool) // To avoid duplicates based on best URL\n\tallURLsToCollect := make(map[string][]string) // Map from best URL to all URLs that should map to it\n\n\t// Find all img tags and collect their URLs\n\tdoc.Find(\"img\").Each(func(i int, s *goquery.Selection) {\n\t\telement := id.getImageElementInfo(s)\n\t\tif element.BestURL != \"\" && !seenBestURLs[element.BestURL] {\n\t\t\tallURLsToCollect[element.BestURL] = element.AllURLs\n\t\t\timageElements = append(imageElements, element)\n\t\t\tseenBestURLs[element.BestURL] = true\n\t\t}\n\t})\n\n\t// Also collect URLs from <a> tags that link to images\n\tdoc.Find(\"a\").Each(func(i int, s *goquery.Selection) {\n\t\tif href, exists := s.Attr(\"href\"); exists && id.isImageURL(href) {\n\t\t\t// Find the corresponding image element to add this URL to\n\t\t\tfor bestURL, urls := range allURLsToCollect {\n\t\t\t\tif id.isSameImage(href, bestURL) {\n\t\t\t\t\t// Add this href URL to the list of URLs to replace\n\t\t\t\t\turlExists := false\n\t\t\t\t\tfor _, existingURL := range urls {\n\t\t\t\t\t\tif existingURL == href {\n\t\t\t\t\t\t\turlExists = true\n\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tif !urlExists {\n\t\t\t\t\t\tallURLsToCollect[bestURL] = append(urls, href)\n\t\t\t\t\t\t// Update the corresponding element in imageElements\n\t\t\t\t\t\tfor j, elem := range imageElements {\n\t\t\t\t\t\t\tif elem.BestURL == bestURL {\n\t\t\t\t\t\t\t\timageElements[j].AllURLs = allURLsToCollect[bestURL]\n\t\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t})\n\n\t// Also collect URLs from <source> tags (in <picture> elements)\n\tdoc.Find(\"source\").Each(func(i int, s *goquery.Selection) {\n\t\tif srcset, exists := s.Attr(\"srcset\"); exists {\n\t\t\tsrcsetURLs := id.extractAllURLsFromSrcset(srcset)\n\t\t\tfor _, srcsetURL := range srcsetURLs {\n\t\t\t\tif id.isImageURL(srcsetURL) {\n\t\t\t\t\t// Find the corresponding image element to add this URL to\n\t\t\t\t\tfor bestURL, urls := range allURLsToCollect {\n\t\t\t\t\t\tif id.isSameImage(srcsetURL, bestURL) {\n\t\t\t\t\t\t\t// Add this srcset URL to the list of URLs to replace\n\t\t\t\t\t\t\turlExists := false\n\t\t\t\t\t\t\tfor _, existingURL := range urls {\n\t\t\t\t\t\t\t\tif existingURL == srcsetURL {\n\t\t\t\t\t\t\t\t\turlExists = true\n\t\t\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tif !urlExists {\n\t\t\t\t\t\t\t\tallURLsToCollect[bestURL] = append(urls, srcsetURL)\n\t\t\t\t\t\t\t\t// Update the corresponding element in imageElements\n\t\t\t\t\t\t\t\tfor j, elem := range imageElements {\n\t\t\t\t\t\t\t\t\tif elem.BestURL == bestURL {\n\t\t\t\t\t\t\t\t\t\timageElements[j].AllURLs = allURLsToCollect[bestURL]\n\t\t\t\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\t}\n\t\t\t\t\t\t\tbreak\n\t\t\t\t\t\t}\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t})\n\n\treturn imageElements, nil\n}\n\n// extractImageURLs extracts image URLs from HTML content (kept for backward compatibility with tests)\nfunc (id *ImageDownloader) extractImageURLs(doc *goquery.Document) ([]string, error) {\n\tvar imageURLs []string\n\turlSet := make(map[string]bool) // To avoid duplicates\n\n\t// Find all img tags\n\tdoc.Find(\"img\").Each(func(i int, s *goquery.Selection) {\n\t\t// Get the best quality URL based on user preference\n\t\timageURL := id.getBestImageURL(s)\n\t\tif imageURL != \"\" && !urlSet[imageURL] {\n\t\t\timageURLs = append(imageURLs, imageURL)\n\t\t\turlSet[imageURL] = true\n\t\t}\n\t})\n\n\treturn imageURLs, nil\n}\n\n// getImageElementInfo extracts all URLs and determines the best one for an img element\nfunc (id *ImageDownloader) getImageElementInfo(imgElement *goquery.Selection) ImageElement {\n\tvar allURLs []string\n\turlSet := make(map[string]bool) // To avoid duplicates\n\t\n\t// Helper function to add unique URLs\n\taddURL := func(url string) {\n\t\tif url != \"\" && !urlSet[url] {\n\t\t\tallURLs = append(allURLs, url)\n\t\t\turlSet[url] = true\n\t\t}\n\t}\n\t\n\t// 1. Get URL from data-attrs JSON (highest priority)\n\tif dataAttrs, exists := imgElement.Attr(\"data-attrs\"); exists {\n\t\tvar attrs map[string]interface{}\n\t\tif err := json.Unmarshal([]byte(dataAttrs), &attrs); err == nil {\n\t\t\tif src, ok := attrs[\"src\"].(string); ok && src != \"\" {\n\t\t\t\taddURL(src)\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// 2. Get URLs from srcset attribute\n\tif srcset, exists := imgElement.Attr(\"srcset\"); exists {\n\t\tsrcsetURLs := id.extractAllURLsFromSrcset(srcset)\n\t\tfor _, url := range srcsetURLs {\n\t\t\taddURL(url)\n\t\t}\n\t}\n\t\n\t// 3. Get URL from src attribute\n\tif src, exists := imgElement.Attr(\"src\"); exists {\n\t\taddURL(src)\n\t}\n\t\n\t// Determine the best URL to download\n\tbestURL := id.getBestImageURL(imgElement)\n\t\n\treturn ImageElement{\n\t\tBestURL: bestURL,\n\t\tAllURLs: allURLs,\n\t}\n}\n\n// getBestImageURL extracts the best quality image URL from an img element\nfunc (id *ImageDownloader) getBestImageURL(imgElement *goquery.Selection) string {\n\t// First try to get URL from data-attrs JSON\n\tdataAttrs, exists := imgElement.Attr(\"data-attrs\")\n\tif exists {\n\t\tvar attrs map[string]interface{}\n\t\tif err := json.Unmarshal([]byte(dataAttrs), &attrs); err == nil {\n\t\t\tif src, ok := attrs[\"src\"].(string); ok && src != \"\" {\n\t\t\t\treturn src\n\t\t\t}\n\t\t}\n\t}\n\n\t// Get target width based on quality preference\n\ttargetWidth := id.getTargetWidth()\n\n\t// Try to get URL from srcset based on quality preference\n\tsrcset, exists := imgElement.Attr(\"srcset\")\n\tif exists {\n\t\tif url := id.extractURLFromSrcset(srcset, targetWidth); url != \"\" {\n\t\t\treturn url\n\t\t}\n\t}\n\n\t// Fallback to src attribute\n\tsrc, exists := imgElement.Attr(\"src\")\n\tif exists {\n\t\treturn src\n\t}\n\n\treturn \"\"\n}\n\n// getTargetWidth returns the target width based on image quality preference\nfunc (id *ImageDownloader) getTargetWidth() int {\n\tswitch id.imageQuality {\n\tcase ImageQualityHigh:\n\t\treturn 1456\n\tcase ImageQualityMedium:\n\t\treturn 848\n\tcase ImageQualityLow:\n\t\treturn 424\n\tdefault:\n\t\treturn 1456\n\t}\n}\n\n// extractAllURLsFromSrcset extracts all URLs from a srcset attribute\nfunc (id *ImageDownloader) extractAllURLsFromSrcset(srcset string) []string {\n\tif srcset == \"\" {\n\t\treturn []string{} // Return empty slice instead of nil\n\t}\n\t\n\tvar urls []string\n\t\n\t// Use the same robust parsing as updateSrcsetAttribute\n\tentries := id.parseSrcsetEntries(srcset)\n\t\n\tfor _, entry := range entries {\n\t\tentry = strings.TrimSpace(entry)\n\t\tif entry == \"\" {\n\t\t\tcontinue\n\t\t}\n\t\t\n\t\t// Parse \"URL WIDTHw\" format\n\t\tparts := strings.Fields(entry)\n\t\tif len(parts) >= 1 {\n\t\t\turl := parts[0]\n\t\t\t// Only include if it looks like a valid URL (not a fragment like \"f_webp\")\n\t\t\tif url != \"\" && (strings.HasPrefix(url, \"http://\") || strings.HasPrefix(url, \"https://\")) {\n\t\t\t\turls = append(urls, url)\n\t\t\t}\n\t\t}\n\t}\n\t\n\tif urls == nil {\n\t\treturn []string{} // Ensure we never return nil\n\t}\n\t\n\treturn urls\n}\n\n// extractURLFromSrcset extracts the URL with the target width from a srcset attribute\nfunc (id *ImageDownloader) extractURLFromSrcset(srcset string, targetWidth int) string {\n\t// Use the robust parsing to handle URLs with commas\n\tentries := id.parseSrcsetEntries(srcset)\n\t\n\tvar bestURL string\n\tvar bestWidth int\n\n\tfor _, entry := range entries {\n\t\tentry = strings.TrimSpace(entry)\n\t\tif entry == \"\" {\n\t\t\tcontinue\n\t\t}\n\t\t\n\t\t// Parse \"URL WIDTHw\" format\n\t\tparts := strings.Fields(entry)\n\t\tif len(parts) >= 2 {\n\t\t\turl := parts[0]\n\t\t\twidthStr := strings.TrimSuffix(parts[1], \"w\")\n\t\t\t\n\t\t\t// Only process if it looks like a valid URL\n\t\t\tif url != \"\" && (strings.HasPrefix(url, \"http://\") || strings.HasPrefix(url, \"https://\")) {\n\t\t\t\tif width, err := strconv.Atoi(widthStr); err == nil {\n\t\t\t\t\t// Find the closest width to our target, preferring exact matches or higher\n\t\t\t\t\tif width == targetWidth || (bestURL == \"\" || \n\t\t\t\t\t\t(width >= targetWidth && (bestWidth < targetWidth || width < bestWidth)) ||\n\t\t\t\t\t\t(width < targetWidth && bestWidth < targetWidth && width > bestWidth)) {\n\t\t\t\t\t\tbestURL = url\n\t\t\t\t\t\tbestWidth = width\n\t\t\t\t\t}\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\treturn bestURL\n}\n\n// downloadSingleImage downloads a single image and returns its info\nfunc (id *ImageDownloader) downloadSingleImage(ctx context.Context, imageURL, imagesPath string) ImageInfo {\n\timageInfo := ImageInfo{\n\t\tOriginalURL: imageURL,\n\t\tSuccess:     false,\n\t}\n\n\t// Generate safe filename\n\tfilename, err := id.generateSafeFilename(imageURL)\n\tif err != nil {\n\t\timageInfo.Error = fmt.Errorf(\"failed to generate filename: %w\", err)\n\t\treturn imageInfo\n\t}\n\n\tlocalPath := filepath.Join(imagesPath, filename)\n\timageInfo.LocalPath = localPath\n\n\t// Download the image\n\tbody, err := id.fetcher.FetchURL(ctx, imageURL)\n\tif err != nil {\n\t\timageInfo.Error = fmt.Errorf(\"failed to fetch image: %w\", err)\n\t\treturn imageInfo\n\t}\n\tdefer body.Close()\n\n\t// Create the local file\n\tfile, err := os.Create(localPath)\n\tif err != nil {\n\t\timageInfo.Error = fmt.Errorf(\"failed to create local file: %w\", err)\n\t\treturn imageInfo\n\t}\n\tdefer file.Close()\n\n\t// Copy image data\n\t_, err = io.Copy(file, body)\n\tif err != nil {\n\t\timageInfo.Error = fmt.Errorf(\"failed to write image data: %w\", err)\n\t\tos.Remove(localPath) // Clean up failed file\n\t\treturn imageInfo\n\t}\n\n\t// Extract image metadata\n\timageInfo.Format = id.getImageFormat(filename)\n\timageInfo.Width, imageInfo.Height = id.extractDimensionsFromURL(imageURL)\n\n\timageInfo.Success = true\n\treturn imageInfo\n}\n\n// generateSafeFilename generates a safe filename from an image URL\nfunc (id *ImageDownloader) generateSafeFilename(imageURL string) (string, error) {\n\tparsedURL, err := url.Parse(imageURL)\n\tif err != nil {\n\t\treturn \"\", err\n\t}\n\n\t// Extract filename from URL path\n\tfilename := filepath.Base(parsedURL.Path)\n\t\n\t// If no valid filename, generate one from URL patterns\n\tif filename == \"\" || filename == \"/\" || filename == \".\" {\n\t\tfilename = \"\" // Reset to force fallback logic\n\t\t\n\t\t// Try to extract from the URL patterns\n\t\tif strings.Contains(imageURL, \"substack\") {\n\t\t\t// Try to extract the image ID from Substack URLs\n\t\t\tif match := regexp.MustCompile(`([a-f0-9-]{36})_(\\d+x\\d+)\\.(jpeg|jpg|png|webp)`).FindStringSubmatch(imageURL); len(match) > 0 {\n\t\t\t\tfilename = fmt.Sprintf(\"%s_%s.%s\", match[1][:8], match[2], match[3])\n\t\t\t}\n\t\t}\n\t\t\n\t\t// If still no filename, use default\n\t\tif filename == \"\" {\n\t\t\tfilename = \"image.jpg\"\n\t\t}\n\t}\n\n\t// Clean filename - remove invalid characters (but preserve structure)\n\t// Only replace invalid filesystem characters\n\tcleanedFilename := regexp.MustCompile(`[<>:\"/\\\\|?*]`).ReplaceAllString(filename, \"_\")\n\t\n\t// Ensure we have a valid filename after cleaning\n\tif cleanedFilename == \"\" || cleanedFilename == \"_\" || cleanedFilename == \"__\" {\n\t\tcleanedFilename = \"image.jpg\"\n\t}\n\t\n\t// Ensure filename is not too long\n\tif len(cleanedFilename) > 200 {\n\t\text := filepath.Ext(cleanedFilename)\n\t\tname := strings.TrimSuffix(cleanedFilename, ext)\n\t\tif len(ext) < 200 {\n\t\t\tcleanedFilename = name[:200-len(ext)] + ext\n\t\t} else {\n\t\t\tcleanedFilename = \"image.jpg\"\n\t\t}\n\t}\n\n\treturn cleanedFilename, nil\n}\n\n// getImageFormat determines image format from filename\nfunc (id *ImageDownloader) getImageFormat(filename string) string {\n\text := strings.ToLower(filepath.Ext(filename))\n\tswitch ext {\n\tcase \".jpg\", \".jpeg\":\n\t\treturn \"jpeg\"\n\tcase \".png\":\n\t\treturn \"png\"\n\tcase \".webp\":\n\t\treturn \"webp\"\n\tcase \".gif\":\n\t\treturn \"gif\"\n\tdefault:\n\t\treturn \"unknown\"\n\t}\n}\n\n// extractDimensionsFromURL attempts to extract width and height from URL\nfunc (id *ImageDownloader) extractDimensionsFromURL(imageURL string) (int, int) {\n\t// Look for patterns like \"1456x819\" or \"w_1456,h_819\"\n\tif match := regexp.MustCompile(`(\\d+)x(\\d+)`).FindStringSubmatch(imageURL); len(match) >= 3 {\n\t\twidth, _ := strconv.Atoi(match[1])\n\t\theight, _ := strconv.Atoi(match[2])\n\t\treturn width, height\n\t}\n\n\tif match := regexp.MustCompile(`w_(\\d+)`).FindStringSubmatch(imageURL); len(match) >= 2 {\n\t\twidth, _ := strconv.Atoi(match[1])\n\t\treturn width, 0 // Height unknown\n\t}\n\n\treturn 0, 0\n}\n\n// updateHTMLWithLocalPaths replaces image URLs in HTML with local paths\nfunc (id *ImageDownloader) updateHTMLWithLocalPaths(htmlContent string, urlToLocalPath map[string]string) string {\n\t// Parse HTML content\n\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\tif err != nil {\n\t\t// Fallback to simple string replacement if parsing fails\n\t\treturn id.updateHTMLWithStringReplacement(htmlContent, urlToLocalPath)\n\t}\n\n\t// Create URL to relative path mapping\n\turlToRelPath := make(map[string]string)\n\tfor originalURL, localPath := range urlToLocalPath {\n\t\t// Convert absolute local path to relative path from output directory\n\t\trelPath, err := filepath.Rel(id.outputDir, localPath)\n\t\tif err != nil {\n\t\t\trelPath = localPath // fallback to absolute path\n\t\t}\n\t\t// Always ensure forward slashes for HTML (web standard)\n\t\trelPath = strings.ReplaceAll(relPath, \"\\\\\", \"/\")\n\t\turlToRelPath[originalURL] = relPath\n\t}\n\n\t// Update img elements\n\tdoc.Find(\"img\").Each(func(i int, s *goquery.Selection) {\n\t\t// Update src attribute\n\t\tif src, exists := s.Attr(\"src\"); exists {\n\t\t\tif relPath, found := urlToRelPath[src]; found {\n\t\t\t\ts.SetAttr(\"src\", relPath)\n\t\t\t}\n\t\t}\n\n\t\t// Update srcset attribute\n\t\tif srcset, exists := s.Attr(\"srcset\"); exists {\n\t\t\tupdatedSrcset := id.updateSrcsetAttribute(srcset, urlToRelPath)\n\t\t\ts.SetAttr(\"srcset\", updatedSrcset)\n\t\t}\n\n\t\t// Update data-attrs JSON\n\t\tif dataAttrs, exists := s.Attr(\"data-attrs\"); exists {\n\t\t\tupdatedDataAttrs := id.updateDataAttrsJSON(dataAttrs, urlToRelPath)\n\t\t\ts.SetAttr(\"data-attrs\", updatedDataAttrs)\n\t\t}\n\t})\n\n\t// Update anchor elements with image links\n\tdoc.Find(\"a\").Each(func(i int, s *goquery.Selection) {\n\t\tif href, exists := s.Attr(\"href\"); exists {\n\t\t\tif relPath, found := urlToRelPath[href]; found {\n\t\t\t\ts.SetAttr(\"href\", relPath)\n\t\t\t}\n\t\t}\n\t})\n\n\t// Update source elements (in picture tags)\n\tdoc.Find(\"source\").Each(func(i int, s *goquery.Selection) {\n\t\tif srcset, exists := s.Attr(\"srcset\"); exists {\n\t\t\tupdatedSrcset := id.updateSrcsetAttribute(srcset, urlToRelPath)\n\t\t\ts.SetAttr(\"srcset\", updatedSrcset)\n\t\t}\n\t})\n\n\t// Get the updated HTML\n\thtml, err := doc.Html()\n\tif err != nil {\n\t\t// Fallback to simple string replacement if HTML generation fails\n\t\treturn id.updateHTMLWithStringReplacement(htmlContent, urlToLocalPath)\n\t}\n\n\treturn html\n}\n\n// updateHTMLWithStringReplacement is the fallback method using simple string replacement\nfunc (id *ImageDownloader) updateHTMLWithStringReplacement(htmlContent string, urlToLocalPath map[string]string) string {\n\tupdatedHTML := htmlContent\n\n\tfor originalURL, localPath := range urlToLocalPath {\n\t\t// Convert absolute local path to relative path from output directory\n\t\trelPath, err := filepath.Rel(id.outputDir, localPath)\n\t\tif err != nil {\n\t\t\trelPath = localPath // fallback to absolute path\n\t\t}\n\n\t\t// Always ensure forward slashes for HTML (web standard)\n\t\t// Convert any backslashes to forward slashes regardless of platform\n\t\trelPath = strings.ReplaceAll(relPath, \"\\\\\", \"/\")\n\n\t\t// Replace URL in various contexts\n\t\tupdatedHTML = strings.ReplaceAll(updatedHTML, originalURL, relPath)\n\t\t\n\t\t// Also replace URL-encoded versions\n\t\tencodedURL := url.QueryEscape(originalURL)\n\t\tif encodedURL != originalURL {\n\t\t\tupdatedHTML = strings.ReplaceAll(updatedHTML, encodedURL, relPath)\n\t\t}\n\t}\n\n\treturn updatedHTML\n}\n\n// updateSrcsetAttribute updates URLs in a srcset attribute\nfunc (id *ImageDownloader) updateSrcsetAttribute(srcset string, urlToRelPath map[string]string) string {\n\tif srcset == \"\" {\n\t\treturn srcset\n\t}\n\n\t// Parse srcset more carefully to handle URLs with commas\n\tentries := id.parseSrcsetEntries(srcset)\n\t\n\t// Map to track unique local paths and their best width descriptor\n\tpathToEntry := make(map[string]string)\n\t\n\tfor _, entry := range entries {\n\t\tentry = strings.TrimSpace(entry)\n\t\tif entry == \"\" {\n\t\t\tcontinue\n\t\t}\n\n\t\t// Parse \"URL WIDTH\" format\n\t\tparts := strings.Fields(entry)\n\t\tif len(parts) >= 1 {\n\t\t\turl := parts[0]\n\t\t\t// Replace URL if we have a mapping for it\n\t\t\tif relPath, found := urlToRelPath[url]; found {\n\t\t\t\t// Build the new entry with local path\n\t\t\t\tvar newEntry string\n\t\t\t\tif len(parts) >= 2 {\n\t\t\t\t\t// Has width descriptor\n\t\t\t\t\tnewEntry = relPath + \" \" + parts[1]\n\t\t\t\t} else {\n\t\t\t\t\t// No width descriptor\n\t\t\t\t\tnewEntry = relPath\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\t// Only keep one entry per unique local path\n\t\t\t\t// If we already have an entry for this path, keep the one with width descriptor\n\t\t\t\tif existingEntry, exists := pathToEntry[relPath]; exists {\n\t\t\t\t\t// Prefer entries with width descriptors\n\t\t\t\t\tif len(parts) >= 2 && !strings.Contains(existingEntry, \" \") {\n\t\t\t\t\t\tpathToEntry[relPath] = newEntry\n\t\t\t\t\t}\n\t\t\t\t\t// If both have width descriptors or both don't, keep the first one\n\t\t\t\t} else {\n\t\t\t\t\tpathToEntry[relPath] = newEntry\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// URL wasn't mapped, keep original entry\n\t\t\t\tpathToEntry[url] = entry\n\t\t\t}\n\t\t}\n\t}\n\n\t// Convert map back to slice, maintaining order as much as possible\n\tvar updatedEntries []string\n\tfor _, entry := range entries {\n\t\tentry = strings.TrimSpace(entry)\n\t\tif entry == \"\" {\n\t\t\tcontinue\n\t\t}\n\t\t\n\t\tparts := strings.Fields(entry)\n\t\tif len(parts) >= 1 {\n\t\t\turl := parts[0]\n\t\t\tif relPath, found := urlToRelPath[url]; found {\n\t\t\t\t// Use the entry from our deduplication map\n\t\t\t\tif finalEntry, exists := pathToEntry[relPath]; exists {\n\t\t\t\t\tupdatedEntries = append(updatedEntries, finalEntry)\n\t\t\t\t\tdelete(pathToEntry, relPath) // Remove to avoid duplicates\n\t\t\t\t}\n\t\t\t} else {\n\t\t\t\t// Original URL, use as-is\n\t\t\t\tif finalEntry, exists := pathToEntry[url]; exists {\n\t\t\t\t\tupdatedEntries = append(updatedEntries, finalEntry)\n\t\t\t\t\tdelete(pathToEntry, url)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\n\treturn strings.Join(updatedEntries, \", \")\n}\n\n// isImageURL checks if a URL appears to be an image URL (Substack CDN or S3)\nfunc (id *ImageDownloader) isImageURL(url string) bool {\n\treturn strings.Contains(url, \"substackcdn.com\") || \n\t\t   strings.Contains(url, \"substack-post-media.s3.amazonaws.com\") ||\n\t\t   strings.Contains(url, \"bucketeer-\") // Some Substack images use bucketeer URLs\n}\n\n// isSameImage checks if two URLs refer to the same image by comparing the core image identifier\nfunc (id *ImageDownloader) isSameImage(url1, url2 string) bool {\n\t// Extract the UUID pattern from both URLs\n\tuuidPattern := regexp.MustCompile(`([a-f0-9-]{36})`)\n\t\n\tmatches1 := uuidPattern.FindStringSubmatch(url1)\n\tmatches2 := uuidPattern.FindStringSubmatch(url2) \n\t\n\tif len(matches1) > 0 && len(matches2) > 0 {\n\t\treturn matches1[1] == matches2[1]\n\t}\n\t\n\t// Fallback: if we can't find UUIDs, check if the URLs contain similar image identifiers\n\t// This handles cases where the URL structure might vary\n\treturn strings.Contains(url1, extractImageID(url2)) || strings.Contains(url2, extractImageID(url1))\n}\n\n// extractImageID extracts a unique identifier from an image URL\nfunc extractImageID(url string) string {\n\t// Try to extract UUID first\n\tif match := regexp.MustCompile(`([a-f0-9-]{36})`).FindStringSubmatch(url); len(match) > 0 {\n\t\treturn match[1]\n\t}\n\t\n\t// Fallback to extracting a filename-like pattern\n\tif match := regexp.MustCompile(`/([^/]+)\\.(jpeg|jpg|png|webp|heic|gif)(?:\\?|$)`).FindStringSubmatch(url); len(match) > 0 {\n\t\treturn match[1]\n\t}\n\t\n\treturn \"\"\n}\n\n// parseSrcsetEntries carefully parses srcset entries, handling URLs that contain commas\nfunc (id *ImageDownloader) parseSrcsetEntries(srcset string) []string {\n\tvar entries []string\n\t\n\t// Use regex to find URLs followed by width descriptors\n\t// This pattern matches: (URL) (WIDTH)w where URL can contain commas\n\tpattern := regexp.MustCompile(`(https?://[^\\s]+)\\s+(\\d+w)`)\n\tmatches := pattern.FindAllStringSubmatch(srcset, -1)\n\t\n\tfor _, match := range matches {\n\t\tif len(match) >= 3 {\n\t\t\turl := match[1]\n\t\t\twidth := match[2]\n\t\t\tentries = append(entries, url+\" \"+width)\n\t\t}\n\t}\n\t\n\t// If regex parsing didn't find anything, fall back to simple comma splitting\n\t// but only for URLs that don't contain commas\n\tif len(entries) == 0 {\n\t\tparts := strings.Split(srcset, \",\")\n\t\tfor _, part := range parts {\n\t\t\tpart = strings.TrimSpace(part)\n\t\t\tif part != \"\" {\n\t\t\t\t// Only include if it looks like a proper entry (URL + width or just URL)\n\t\t\t\tfields := strings.Fields(part)\n\t\t\t\tif len(fields) >= 1 && (strings.HasPrefix(fields[0], \"http://\") || strings.HasPrefix(fields[0], \"https://\")) {\n\t\t\t\t\tentries = append(entries, part)\n\t\t\t\t}\n\t\t\t}\n\t\t}\n\t}\n\t\n\treturn entries\n}\n\n// updateDataAttrsJSON updates URLs in a data-attrs JSON string\nfunc (id *ImageDownloader) updateDataAttrsJSON(dataAttrs string, urlToRelPath map[string]string) string {\n\tif dataAttrs == \"\" {\n\t\treturn dataAttrs\n\t}\n\n\tvar attrs map[string]interface{}\n\tif err := json.Unmarshal([]byte(dataAttrs), &attrs); err != nil {\n\t\treturn dataAttrs // Return original if parsing fails\n\t}\n\n\t// Update src field if it exists\n\tif src, ok := attrs[\"src\"].(string); ok {\n\t\tif relPath, found := urlToRelPath[src]; found {\n\t\t\tattrs[\"src\"] = relPath\n\t\t}\n\t}\n\n\t// Marshal back to JSON\n\tupdatedJSON, err := json.Marshal(attrs)\n\tif err != nil {\n\t\treturn dataAttrs // Return original if marshaling fails\n\t}\n\n\treturn string(updatedJSON)\n}"
  },
  {
    "path": "lib/images_test.go",
    "content": "package lib\n\nimport (\n\t\"context\"\n\t\"fmt\"\n\t\"net/http\"\n\t\"net/http/httptest\"\n\t\"net/url\"\n\t\"os\"\n\t\"path/filepath\"\n\t\"strings\"\n\t\"testing\"\n\t\"time\"\n\n\t\"github.com/PuerkitoBio/goquery\"\n\t\"github.com/stretchr/testify/assert\"\n\t\"github.com/stretchr/testify/require\"\n)\n\n// Test image data - a simple 1x1 PNG\nvar testImageData = []byte{\n\t0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 0x00, 0x00, 0x00, 0x0D,\n\t0x49, 0x48, 0x44, 0x52, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00, 0x00, 0x01,\n\t0x08, 0x06, 0x00, 0x00, 0x00, 0x1F, 0x15, 0xC4, 0x89, 0x00, 0x00, 0x00,\n\t0x0A, 0x49, 0x44, 0x41, 0x54, 0x78, 0x9C, 0x63, 0x00, 0x01, 0x00, 0x00,\n\t0x05, 0x00, 0x01, 0x0D, 0x0A, 0x2D, 0xB4, 0x00, 0x00, 0x00, 0x00, 0x49,\n\t0x45, 0x4E, 0x44, 0xAE, 0x42, 0x60, 0x82,\n}\n\n// createTestImageServer creates a test server that serves test images\nfunc createTestImageServer() *httptest.Server {\n\treturn httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\tpath := r.URL.Path\n\t\t\n\t\tswitch {\n\t\tcase strings.Contains(path, \"success\"):\n\t\t\tw.Header().Set(\"Content-Type\", \"image/png\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testImageData)\n\t\tcase strings.Contains(path, \"not-found\"):\n\t\t\tw.WriteHeader(http.StatusNotFound)\n\t\tcase strings.Contains(path, \"server-error\"):\n\t\t\tw.WriteHeader(http.StatusInternalServerError)\n\t\tcase strings.Contains(path, \"timeout\"):\n\t\t\t// Don't respond to simulate timeout - but add a timeout to prevent hanging\n\t\t\tselect {\n\t\t\tcase <-time.After(5 * time.Second):\n\t\t\t\tw.WriteHeader(http.StatusRequestTimeout)\n\t\t\t}\n\t\tdefault:\n\t\t\tw.Header().Set(\"Content-Type\", \"image/png\")\n\t\t\tw.WriteHeader(http.StatusOK)\n\t\t\tw.Write(testImageData)\n\t\t}\n\t}))\n}\n\n// createTestHTMLWithImages creates HTML content with various image structures\nfunc createTestHTMLWithImages(baseURL string) string {\n\treturn fmt.Sprintf(`\n<!DOCTYPE html>\n<html>\n<head><title>Test Post</title></head>\n<body>\n<h1>Test Post with Images</h1>\n\n<!-- Simple img tag -->\n<p>Here's a simple image:</p>\n<img src=\"%s/simple-image.png\" alt=\"Simple image\" width=\"200\" height=\"100\">\n\n<!-- Complex Substack-style image with srcset -->\n<div class=\"captioned-image-container\">\n  <figure>\n    <a class=\"image-link is-viewable-img image2\" target=\"_blank\" href=\"%s/fullsize.jpeg\">\n      <div class=\"image2-inset\">\n        <picture>\n          <source type=\"image/webp\" srcset=\"%s/w_424.webp 424w, %s/w_848.webp 848w, %s/w_1456.webp 1456w\">\n          <img src=\"%s/w_1456.jpeg\" \n               srcset=\"%s/w_424.jpeg 424w, %s/w_848.jpeg 848w, %s/w_1456.jpeg 1456w\"\n               data-attrs='{\"src\":\"%s/original.jpeg\",\"width\":1456,\"height\":819,\"type\":\"image/jpeg\",\"bytes\":385174}'\n               alt=\"Complex image\" width=\"1456\" height=\"819\">\n        </picture>\n      </div>\n    </a>\n  </figure>\n</div>\n\n<!-- Image with data-attrs only -->\n<img data-attrs='{\"src\":\"%s/data-attrs-only.png\",\"width\":800,\"height\":600}' alt=\"Data attrs image\">\n\n<!-- Non-existent image for error testing -->\n<img src=\"%s/not-found.png\" alt=\"Missing image\">\n\n</body>\n</html>`, \n\t\tbaseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, baseURL, \n\t\tbaseURL, baseURL, baseURL, baseURL)\n}\n\n// TestNewImageDownloader tests the creation of ImageDownloader\nfunc TestNewImageDownloader(t *testing.T) {\n\tt.Run(\"WithFetcher\", func(t *testing.T) {\n\t\tfetcher := NewFetcher()\n\t\tdownloader := NewImageDownloader(fetcher, \"/tmp\", \"images\", ImageQualityHigh)\n\t\t\n\t\tassert.Equal(t, fetcher, downloader.fetcher)\n\t\tassert.Equal(t, \"/tmp\", downloader.outputDir)\n\t\tassert.Equal(t, \"images\", downloader.imagesDir)\n\t\tassert.Equal(t, ImageQualityHigh, downloader.imageQuality)\n\t})\n\t\n\tt.Run(\"WithoutFetcher\", func(t *testing.T) {\n\t\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityMedium)\n\t\t\n\t\tassert.NotNil(t, downloader.fetcher)\n\t\tassert.Equal(t, \"/tmp\", downloader.outputDir)\n\t\tassert.Equal(t, \"images\", downloader.imagesDir)\n\t\tassert.Equal(t, ImageQualityMedium, downloader.imageQuality)\n\t})\n}\n\n// TestGetTargetWidth tests width calculation for different quality levels\nfunc TestGetTargetWidth(t *testing.T) {\n\ttests := []struct {\n\t\tquality ImageQuality\n\t\twidth   int\n\t}{\n\t\t{ImageQualityHigh, 1456},\n\t\t{ImageQualityMedium, 848},\n\t\t{ImageQualityLow, 424},\n\t\t{ImageQuality(\"invalid\"), 1456}, // should default to high\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(string(test.quality), func(t *testing.T) {\n\t\t\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", test.quality)\n\t\t\twidth := downloader.getTargetWidth()\n\t\t\tassert.Equal(t, test.width, width)\n\t\t})\n\t}\n}\n\n// TestExtractURLFromSrcset tests srcset URL extraction\nfunc TestExtractURLFromSrcset(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\ttests := []struct {\n\t\tname       string\n\t\tsrcset     string\n\t\ttargetWidth int\n\t\texpected   string\n\t}{\n\t\t{\n\t\t\tname:        \"ExactMatch\",\n\t\t\tsrcset:      \"https://example.com/image-424.jpg 424w, https://example.com/image-848.jpg 848w, https://example.com/image-1456.jpg 1456w\",\n\t\t\ttargetWidth: 848,\n\t\t\texpected:    \"https://example.com/image-848.jpg\",\n\t\t},\n\t\t{\n\t\t\tname:        \"ClosestHigher\",\n\t\t\tsrcset:      \"https://example.com/image-424.jpg 424w, https://example.com/image-1200.jpg 1200w, https://example.com/image-1456.jpg 1456w\",\n\t\t\ttargetWidth: 800,\n\t\t\texpected:    \"https://example.com/image-1200.jpg\",\n\t\t},\n\t\t{\n\t\t\tname:        \"ClosestLower\",\n\t\t\tsrcset:      \"https://example.com/image-200.jpg 200w, https://example.com/image-400.jpg 400w\",\n\t\t\ttargetWidth: 800,\n\t\t\texpected:    \"https://example.com/image-400.jpg\",\n\t\t},\n\t\t{\n\t\t\tname:        \"SingleEntry\",\n\t\t\tsrcset:      \"https://example.com/single-image.jpg 1024w\",\n\t\t\ttargetWidth: 800,\n\t\t\texpected:    \"https://example.com/single-image.jpg\",\n\t\t},\n\t\t{\n\t\t\tname:        \"EmptySrcset\",\n\t\t\tsrcset:      \"\",\n\t\t\ttargetWidth: 800,\n\t\t\texpected:    \"\",\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tresult := downloader.extractURLFromSrcset(test.srcset, test.targetWidth)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t})\n\t}\n}\n\n// TestGenerateSafeFilename tests filename generation\nfunc TestGenerateSafeFilename(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\ttests := []struct {\n\t\tname     string\n\t\turl      string\n\t\texpected string\n\t}{\n\t\t{\n\t\t\tname:     \"SimpleURL\",\n\t\t\turl:      \"https://example.com/image.jpg\",\n\t\t\texpected: \"image.jpg\",\n\t\t},\n\t\t{\n\t\t\tname:     \"SubstackPattern\",\n\t\t\turl:      \"https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg\",\n\t\t\texpected: \"d83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg\",\n\t\t},\n\t\t{\n\t\t\tname:     \"InvalidCharacters\",\n\t\t\turl:      \"https://example.com/image:with<bad>chars.png\",\n\t\t\texpected: \"image_with_bad_chars.png\",\n\t\t},\n\t\t{\n\t\t\tname:     \"NoExtension\",\n\t\t\turl:      \"https://example.com/imagewithoutextension\",\n\t\t\texpected: \"imagewithoutextension\",\n\t\t},\n\t\t{\n\t\t\tname:     \"EmptyFilename\",\n\t\t\turl:      \"https://example.com/\",\n\t\t\texpected: \"image.jpg\",\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\tresult, err := downloader.generateSafeFilename(test.url)\n\t\t\tassert.NoError(t, err)\n\t\t\tassert.Equal(t, test.expected, result)\n\t\t})\n\t}\n}\n\n// TestGetImageFormat tests image format detection\nfunc TestGetImageFormat(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\ttests := []struct {\n\t\tfilename string\n\t\tformat   string\n\t}{\n\t\t{\"image.jpg\", \"jpeg\"},\n\t\t{\"image.jpeg\", \"jpeg\"},\n\t\t{\"image.png\", \"png\"},\n\t\t{\"image.webp\", \"webp\"},\n\t\t{\"image.gif\", \"gif\"},\n\t\t{\"image.JPG\", \"jpeg\"},\n\t\t{\"image.PNG\", \"png\"},\n\t\t{\"image.unknown\", \"unknown\"},\n\t\t{\"image\", \"unknown\"},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.filename, func(t *testing.T) {\n\t\t\tresult := downloader.getImageFormat(test.filename)\n\t\t\tassert.Equal(t, test.format, result)\n\t\t})\n\t}\n}\n\n// TestExtractDimensionsFromURL tests dimension extraction from URLs\nfunc TestExtractDimensionsFromURL(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\ttests := []struct {\n\t\tname   string\n\t\turl    string\n\t\twidth  int\n\t\theight int\n\t}{\n\t\t{\n\t\t\tname:   \"DimensionPattern\",\n\t\t\turl:    \"https://example.com/image_1920x1080.jpg\",\n\t\t\twidth:  1920,\n\t\t\theight: 1080,\n\t\t},\n\t\t{\n\t\t\tname:   \"WidthOnlyPattern\",\n\t\t\turl:    \"https://example.com/w_1456,c_limit/image.jpg\",\n\t\t\twidth:  1456,\n\t\t\theight: 0,\n\t\t},\n\t\t{\n\t\t\tname:   \"NoDimensions\",\n\t\t\turl:    \"https://example.com/image.jpg\",\n\t\t\twidth:  0,\n\t\t\theight: 0,\n\t\t},\n\t\t{\n\t\t\tname:   \"SubstackPattern\",\n\t\t\turl:    \"https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg\",\n\t\t\twidth:  5634,\n\t\t\theight: 2864,\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\twidth, height := downloader.extractDimensionsFromURL(test.url)\n\t\t\tassert.Equal(t, test.width, width)\n\t\t\tassert.Equal(t, test.height, height)\n\t\t})\n\t}\n}\n\n// TestDownloadImages tests the complete image downloading workflow\nfunc TestDownloadImages(t *testing.T) {\n\t// Create test server\n\tserver := createTestImageServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"image-download-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create downloader\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\t\n\tt.Run(\"SuccessfulDownload\", func(t *testing.T) {\n\t\thtmlContent := createTestHTMLWithImages(server.URL)\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"test-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\t// Check results\n\t\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n\t\tassert.Greater(t, result.Failed, 0, \"Should have failed downloads (not-found image)\")\n\t\tassert.Greater(t, len(result.Images), 0, \"Should have image info\")\n\t\t\n\t\t// Check that images directory was created\n\t\timagesDir := filepath.Join(tempDir, \"images\", \"test-post\")\n\t\t_, err = os.Stat(imagesDir)\n\t\tassert.NoError(t, err, \"Images directory should exist\")\n\t\t\n\t\t// Check that some images were downloaded\n\t\tfiles, err := os.ReadDir(imagesDir)\n\t\tassert.NoError(t, err)\n\t\tassert.Greater(t, len(files), 0, \"Should have downloaded image files\")\n\t\t\n\t\t// Check that HTML was updated\n\t\tassert.NotEqual(t, htmlContent, result.UpdatedHTML, \"HTML should be updated\")\n\t\tassert.Contains(t, result.UpdatedHTML, \"images/test-post/\", \"HTML should contain local image paths\")\n\t})\n\t\n\tt.Run(\"NoImages\", func(t *testing.T) {\n\t\thtmlContent := \"<html><body><p>No images here</p></body></html>\"\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"no-images-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Equal(t, 0, result.Success)\n\t\tassert.Equal(t, 0, result.Failed)\n\t\tassert.Equal(t, 0, len(result.Images))\n\t\tassert.Equal(t, htmlContent, result.UpdatedHTML)\n\t})\n\t\n\tt.Run(\"EmptyHTML\", func(t *testing.T) {\n\t\temptyHTML := \"\"\n\t\tctx := context.Background()\n\t\t\n\t\tresult, err := downloader.DownloadImages(ctx, emptyHTML, \"empty-post\")\n\t\trequire.NoError(t, err)\n\t\t\n\t\tassert.Equal(t, 0, result.Success)\n\t\tassert.Equal(t, 0, result.Failed)\n\t\tassert.Equal(t, 0, len(result.Images))\n\t})\n}\n\n// TestDownloadSingleImage tests individual image downloading\nfunc TestDownloadSingleImage(t *testing.T) {\n\t// Create test server\n\tserver := createTestImageServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"single-image-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\tctx := context.Background()\n\t\n\tt.Run(\"SuccessfulDownload\", func(t *testing.T) {\n\t\timageURL := server.URL + \"/success.png\"\n\t\timageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir)\n\t\t\n\t\tassert.True(t, imageInfo.Success)\n\t\tassert.NoError(t, imageInfo.Error)\n\t\tassert.Equal(t, imageURL, imageInfo.OriginalURL)\n\t\tassert.NotEmpty(t, imageInfo.LocalPath)\n\t\t\n\t\t// Check file exists\n\t\t_, err := os.Stat(imageInfo.LocalPath)\n\t\tassert.NoError(t, err)\n\t\t\n\t\t// Check file content\n\t\tdata, err := os.ReadFile(imageInfo.LocalPath)\n\t\tassert.NoError(t, err)\n\t\tassert.Equal(t, testImageData, data)\n\t})\n\t\n\tt.Run(\"NotFound\", func(t *testing.T) {\n\t\timageURL := server.URL + \"/not-found.png\"\n\t\timageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir)\n\t\t\n\t\tassert.False(t, imageInfo.Success)\n\t\tassert.Error(t, imageInfo.Error)\n\t\tassert.Equal(t, imageURL, imageInfo.OriginalURL)\n\t})\n\t\n\tt.Run(\"ServerError\", func(t *testing.T) {\n\t\timageURL := server.URL + \"/server-error.png\"\n\t\timageInfo := downloader.downloadSingleImage(ctx, imageURL, tempDir)\n\t\t\n\t\tassert.False(t, imageInfo.Success)\n\t\tassert.Error(t, imageInfo.Error)\n\t})\n}\n\n// TestUpdateHTMLWithLocalPaths tests HTML content updating\nfunc TestUpdateHTMLWithLocalPaths(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/output\", \"images\", ImageQualityHigh)\n\t\n\toriginalHTML := `<img src=\"https://example.com/image1.jpg\" alt=\"Image 1\">\n<img src=\"https://example.com/image2.png\" alt=\"Image 2\">\n<img src=\"https://example.com/image1.jpg\" alt=\"Same image again\">`\n\t\n\turlToLocalPath := map[string]string{\n\t\t\"https://example.com/image1.jpg\": filepath.Join(\"/output\", \"images\", \"post\", \"image1.jpg\"),\n\t\t\"https://example.com/image2.png\": filepath.Join(\"/output\", \"images\", \"post\", \"image2.png\"),\n\t}\n\t\n\tupdatedHTML := downloader.updateHTMLWithLocalPaths(originalHTML, urlToLocalPath)\n\t\n\t// Check that URLs were replaced\n\tassert.Contains(t, updatedHTML, `src=\"images/post/image1.jpg\"`)\n\tassert.Contains(t, updatedHTML, `src=\"images/post/image2.png\"`)\n\tassert.NotContains(t, updatedHTML, \"https://example.com/\")\n\t\n\t// Check that duplicate URLs were replaced\n\tassert.Equal(t, 2, strings.Count(updatedHTML, \"images/post/image1.jpg\"))\n}\n\n// Benchmark tests\nfunc BenchmarkExtractURLFromSrcset(b *testing.B) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\tsrcset := \"img-424.jpg 424w, img-848.jpg 848w, img-1272.jpg 1272w, img-1456.jpg 1456w\"\n\t\n\tb.ResetTimer()\n\tfor i := 0; i < b.N; i++ {\n\t\tdownloader.extractURLFromSrcset(srcset, 1456)\n\t}\n}\n\nfunc BenchmarkGenerateSafeFilename(b *testing.B) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\turl := \"https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg\"\n\t\n\tb.ResetTimer()\n\tfor i := 0; i < b.N; i++ {\n\t\tdownloader.generateSafeFilename(url)\n\t}\n}\n\n// TestWithRealSubstackHTML tests image extraction from actual Substack HTML files\nfunc TestWithRealSubstackHTML(t *testing.T) {\n\t// Skip test if scraped directory doesn't exist\n\tscrapedDir := \"../scraped/computerenhance\"\n\tif _, err := os.Stat(scrapedDir); os.IsNotExist(err) {\n\t\tt.Skip(\"Scraped directory not found, skipping real HTML test\")\n\t}\n\t\n\t// Find some sample HTML files\n\tfiles, err := os.ReadDir(scrapedDir)\n\trequire.NoError(t, err)\n\t\n\tvar htmlFiles []string\n\tfor _, file := range files {\n\t\tif strings.HasSuffix(file.Name(), \".html\") && len(htmlFiles) < 3 {\n\t\t\thtmlFiles = append(htmlFiles, filepath.Join(scrapedDir, file.Name()))\n\t\t}\n\t}\n\t\n\tif len(htmlFiles) == 0 {\n\t\tt.Skip(\"No HTML files found in scraped directory\")\n\t}\n\t\n\t// Create temporary directory for testing\n\ttempDir, err := os.MkdirTemp(\"\", \"real-substack-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\t\n\tfor _, htmlFile := range htmlFiles {\n\t\tt.Run(filepath.Base(htmlFile), func(t *testing.T) {\n\t\t\t// Read the HTML file\n\t\t\thtmlContent, err := os.ReadFile(htmlFile)\n\t\t\trequire.NoError(t, err)\n\t\t\t\n\t\t\t// Extract image URLs from the real HTML\n\t\t\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(string(htmlContent)))\n\t\t\trequire.NoError(t, err)\n\t\t\t\n\t\t\timageURLs, err := downloader.extractImageURLs(doc)\n\t\t\trequire.NoError(t, err)\n\t\t\t\n\t\t\tt.Logf(\"Found %d image URLs in %s\", len(imageURLs), filepath.Base(htmlFile))\n\t\t\t\n\t\t\t// Verify we can parse the image URLs and generate filenames\n\t\t\tfor i, imageURL := range imageURLs {\n\t\t\t\tif i >= 5 { // Limit to first 5 images for performance\n\t\t\t\t\tbreak\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\tt.Logf(\"Image URL %d: %s\", i+1, imageURL)\n\t\t\t\t\n\t\t\t\t// Test filename generation\n\t\t\t\tfilename, err := downloader.generateSafeFilename(imageURL)\n\t\t\t\tassert.NoError(t, err)\n\t\t\t\tassert.NotEmpty(t, filename)\n\t\t\t\tassert.False(t, strings.Contains(filename, \"<\"), \"Filename should not contain invalid characters\")\n\t\t\t\tassert.False(t, strings.Contains(filename, \">\"), \"Filename should not contain invalid characters\")\n\t\t\t\t\n\t\t\t\t// Test dimension extraction\n\t\t\t\twidth, height := downloader.extractDimensionsFromURL(imageURL)\n\t\t\t\tt.Logf(\"  Dimensions: %dx%d\", width, height)\n\t\t\t\t\n\t\t\t\t// Test URL parsing\n\t\t\t\t_, err = url.Parse(imageURL)\n\t\t\t\tassert.NoError(t, err, \"Image URL should be valid\")\n\t\t\t}\n\t\t\t\n\t\t\t// Test HTML update functionality (without actually downloading)\n\t\t\tif len(imageURLs) > 0 {\n\t\t\t\t// Create a mock mapping for URL replacement\n\t\t\t\turlToLocalPath := make(map[string]string)\n\t\t\t\tfor i, imageURL := range imageURLs {\n\t\t\t\t\tif i >= 3 { // Limit for performance\n\t\t\t\t\t\tbreak\n\t\t\t\t\t}\n\t\t\t\t\tfilename, _ := downloader.generateSafeFilename(imageURL)\n\t\t\t\t\tlocalPath := filepath.Join(tempDir, \"images\", \"test-post\", filename)\n\t\t\t\t\turlToLocalPath[imageURL] = localPath\n\t\t\t\t}\n\t\t\t\t\n\t\t\t\tupdatedHTML := downloader.updateHTMLWithLocalPaths(string(htmlContent), urlToLocalPath)\n\t\t\t\tassert.NotEqual(t, string(htmlContent), updatedHTML, \"HTML should be updated\")\n\t\t\t\t\n\t\t\t\t// Verify some URLs were replaced\n\t\t\t\tfor originalURL := range urlToLocalPath {\n\t\t\t\t\tassert.NotContains(t, updatedHTML, originalURL, \"Original URL should be replaced\")\n\t\t\t\t}\n\t\t\t}\n\t\t})\n\t}\n}\n\n// TestURLReplacementIssue tests that all image URLs are properly replaced in HTML\nfunc TestURLReplacementIssue(t *testing.T) {\n\t// Create test server\n\tserver := createTestImageServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"url-replacement-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create downloader\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\t\n\t// Create HTML with mismatched URLs between src and data-attrs\n\t// Use server URLs so downloads will succeed\n\thtmlContent := fmt.Sprintf(`<div class=\"captioned-image-container\">\n  <figure>\n    <a class=\"image-link\" href=\"%s/fullsize.jpeg\">\n      <div class=\"image2-inset\">\n        <picture>\n          <img src=\"%s/w_1456.jpeg\" \n               srcset=\"%s/w_424.jpeg 424w, %s/w_848.jpeg 848w, %s/w_1456.jpeg 1456w\"\n               data-attrs='{\"src\":\"%s/original-high-quality.jpeg\",\"width\":1456,\"height\":819}'\n               alt=\"Test image\" width=\"1456\" height=\"819\">\n        </picture>\n      </div>\n    </a>\n  </figure>\n</div>\n\n<img src=\"%s/simple-src.jpg\" \n     data-attrs='{\"src\":\"%s/data-attrs-src.jpg\",\"width\":800,\"height\":600}' \n     alt=\"Simple image\">`, \n\t\tserver.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL)\n\t\n\tt.Logf(\"Original HTML:\\n%s\", htmlContent)\n\t\n\t// Use the full DownloadImages method which should use the new logic\n\tctx := context.Background()\n\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"test-post\")\n\trequire.NoError(t, err)\n\t\n\tt.Logf(\"Download results: Success=%d, Failed=%d\", result.Success, result.Failed)\n\tt.Logf(\"Updated HTML:\\n%s\", result.UpdatedHTML)\n\t\n\t// Verify that ALL URLs were replaced, not just the ones from data-attrs\n\tproblemURLs := []string{\n\t\tfmt.Sprintf(\"%s/w_1456.jpeg\", server.URL),        // src attribute\n\t\tfmt.Sprintf(\"%s/simple-src.jpg\", server.URL),     // simple src\n\t\tfmt.Sprintf(\"%s/w_424.jpeg\", server.URL),         // srcset URLs\n\t\tfmt.Sprintf(\"%s/w_848.jpeg\", server.URL),\n\t}\n\t\n\tfor _, url := range problemURLs {\n\t\tif strings.Contains(result.UpdatedHTML, url) {\n\t\t\tt.Errorf(\"URL should be replaced but still present: %s\", url)\n\t\t}\n\t}\n\t\n\t// Verify some images were actually downloaded\n\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n\t\n\t// Verify local paths are present\n\tassert.Contains(t, result.UpdatedHTML, \"images/test-post/\", \"Should contain local image paths\")\n}\n\n// TestCommaSeparatedURLRegressionBug tests the specific bug reported in v0.6.0\n// where multiple URLs for the same image (in srcset, data-attrs, etc.) would\n// create comma-separated URL strings in the output instead of clean local paths.\n// This is a regression test to ensure this specific pattern doesn't break again.\nfunc TestCommaSeparatedURLRegressionBug(t *testing.T) {\n\t// Create a test server that serves image content\n\tserver := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {\n\t\t// Return a small PNG image for any request\n\t\tw.Header().Set(\"Content-Type\", \"image/png\")\n\t\tw.WriteHeader(http.StatusOK)\n\t\t// Write minimal PNG data\n\t\tpngData := []byte{0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A, 0x00, 0x00, 0x00, 0x0D, 0x49, 0x48, 0x44, 0x52}\n\t\tw.Write(pngData)\n\t}))\n\tdefer server.Close()\n\n\t// Create temporary directory\n\ttempDir := t.TempDir()\n\t\n\tfetcher := NewFetcher()\n\tdownloader := NewImageDownloader(fetcher, tempDir, \"images\", ImageQualityHigh)\n\t\n\t// Create HTML that reproduces the exact bug pattern from the bug report\n\t// This simulates real Substack HTML where the same image appears with multiple URL variations\n\t// but they all represent the same actual image file and should map to the same local path\n\tbaseImageID := \"4697c31d-2502-48d2-b6c1-11e5ea97536f_2560x2174\"\n\t\n\t// These represent different CDN transformations of the same base image\n\t// All should download the same file and map to the same local path\n\toriginalURL := fmt.Sprintf(\"%s/substack-post-media.s3.amazonaws.com/public/images/%s.jpeg\", server.URL, baseImageID)\n\tw1456URL := fmt.Sprintf(\"%s/substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg\", server.URL, baseImageID)\n\tw848URL := fmt.Sprintf(\"%s/substackcdn.com/image/fetch/w_848,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg\", server.URL, baseImageID)\n\tw424URL := fmt.Sprintf(\"%s/substackcdn.com/image/fetch/w_424,c_limit,f_auto,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg\", server.URL, baseImageID)\n\twebpURL := fmt.Sprintf(\"%s/substackcdn.com/image/fetch/f_webp,w_1456,c_limit,q_auto:good/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s.jpeg\", server.URL, baseImageID)\n\t\n\t// Create HTML that matches the structure from the bug report\n\t// All these URLs should map to the same local file path\n\thtmlContent := fmt.Sprintf(`<div class=\"captioned-image-container\">\n  <figure>\n    <a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"%s\" data-component-name=\"Image2ToDOM\">\n      <div class=\"image2-inset\">\n        <picture>\n          <source type=\"image/webp\" srcset=\"%s 424w, %s 848w, %s 1272w, %s 1456w\" sizes=\"100vw\">\n          <img src=\"%s\" \n               srcset=\"%s 424w, %s 848w, %s 1272w, %s 1456w\" \n               data-attrs='{\"src\":\"%s\",\"srcNoWatermark\":null,\"fullscreen\":false,\"imageSize\":\"large\",\"height\":1236,\"width\":1456}'\n               class=\"sizing-large\" alt=\"Test Image\" title=\"Test Image\" \n               sizes=\"100vw\" fetchpriority=\"high\">\n        </picture>\n      </div>\n    </a>\n  </figure>\n</div>`, \n\t\toriginalURL,  // href\n\t\tw424URL, w848URL, w1456URL, webpURL,  // webp srcset\n\t\tw1456URL,     // img src  \n\t\tw424URL, w848URL, w1456URL, webpURL,  // img srcset\n\t\toriginalURL)  // data-attrs src\n\t\n\tt.Logf(\"Original HTML with potentially problematic URLs:\\n%s\", htmlContent)\n\t\n\t// Download images using the full pipeline\n\tctx := context.Background()\n\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"good-ideas\")\n\trequire.NoError(t, err)\n\t\n\tt.Logf(\"Download results: Success=%d, Failed=%d\", result.Success, result.Failed)\n\tt.Logf(\"Updated HTML:\\n%s\", result.UpdatedHTML)\n\t\n\t// THE KEY REGRESSION TEST: Verify no comma-separated URL strings appear\n\t// This is the exact bug pattern that was reported\n\tcommaSeparatedPatterns := []string{\n\t\t\"images/good-ideas/\" + baseImageID + \".jpeg,images/good-ideas/\",  // Should not have comma-separated paths\n\t\t\",f_webp,images/good-ideas/\",  // Should not have CDN parameters mixed with local paths\n\t\t\"images/good-ideas/\" + baseImageID + \".jpeg,images/good-ideas/\" + baseImageID + \".jpeg\",  // Repeated paths\n\t}\n\t\n\tfor _, pattern := range commaSeparatedPatterns {\n\t\tif strings.Contains(result.UpdatedHTML, pattern) {\n\t\t\tt.Errorf(\"REGRESSION BUG DETECTED: Found comma-separated URL pattern in output: %s\", pattern)\n\t\t\tt.Errorf(\"This indicates the string replacement bug has returned\")\n\t\t}\n\t}\n\t\n\t// Verify that all original URLs have been replaced with local paths\n\toriginalURLs := []string{originalURL, w1456URL, w848URL, w424URL, webpURL}\n\tfor _, url := range originalURLs {\n\t\tif strings.Contains(result.UpdatedHTML, url) {\n\t\t\tt.Errorf(\"Original URL should be replaced but still present: %s\", url)\n\t\t}\n\t}\n\t\n\t// Verify clean local paths are present\n\texpectedLocalPath := \"images/good-ideas/\" + baseImageID + \".jpeg\"\n\tif !strings.Contains(result.UpdatedHTML, expectedLocalPath) {\n\t\tt.Errorf(\"Expected local path not found: %s\", expectedLocalPath)\n\t}\n\t\n\t// Verify srcset entries are clean (no commas except between entries)\n\tif strings.Contains(result.UpdatedHTML, `srcset=\"`) {\n\t\t// Extract srcset content\n\t\tsrcsetStart := strings.Index(result.UpdatedHTML, `srcset=\"`) + 8\n\t\tsrcsetEnd := strings.Index(result.UpdatedHTML[srcsetStart:], `\"`)\n\t\tsrcsetContent := result.UpdatedHTML[srcsetStart : srcsetStart+srcsetEnd]\n\t\t\n\t\tt.Logf(\"Extracted srcset: %s\", srcsetContent)\n\t\t\n\t\t// Verify srcset has proper format: \"path width, path width, ...\" or just \"path\"\n\t\t// Should NOT have comma-separated paths without proper structure\n\t\tentries := strings.Split(srcsetContent, \",\")\n\t\tfor i, entry := range entries {\n\t\t\tentry = strings.TrimSpace(entry)\n\t\t\tif entry == \"\" {\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t\n\t\t\tparts := strings.Fields(entry)\n\t\t\tif len(parts) == 0 {\n\t\t\t\tt.Errorf(\"Srcset entry %d is empty after trimming: %s\", i, entry)\n\t\t\t\tcontinue\n\t\t\t}\n\t\t\t\n\t\t\t// First part should be a clean local path\n\t\t\tif !strings.HasPrefix(parts[0], \"images/good-ideas/\") {\n\t\t\t\tt.Errorf(\"Srcset entry %d doesn't have proper local path: %s\", i, parts[0])\n\t\t\t}\n\t\t\t\n\t\t\t// If there's a second part, it should be a width descriptor\n\t\t\tif len(parts) >= 2 {\n\t\t\t\tif !strings.HasSuffix(parts[1], \"w\") {\n\t\t\t\t\tt.Errorf(\"Srcset entry %d has invalid width descriptor: %s\", i, parts[1])\n\t\t\t\t}\n\t\t\t}\n\t\t\t\n\t\t\t// Should not have more than 2 parts\n\t\t\tif len(parts) > 2 {\n\t\t\t\tt.Errorf(\"Srcset entry %d has too many parts (should be 'path' or 'path width'): %s\", i, entry)\n\t\t\t}\n\t\t}\n\t}\n\t\n\t// Verify at least one image was successfully downloaded\n\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n\tassert.Equal(t, 0, result.Failed, \"Should have no failed downloads\")\n}\n\n// TestExtractImageElements tests the new image element extraction with all URLs\nfunc TestExtractImageElements(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\thtmlContent := `\n\t<!-- Image with all attributes -->\n\t<img src=\"https://example.com/src.jpg\" \n\t     srcset=\"https://example.com/small.jpg 400w, https://example.com/large.jpg 800w\"\n\t     data-attrs='{\"src\":\"https://example.com/data.jpg\",\"width\":800,\"height\":600}' \n\t     alt=\"Complete image\">\n\t\n\t<!-- Image with only src -->\n\t<img src=\"https://example.com/simple.jpg\" alt=\"Simple image\">\n\t\n\t<!-- Image with only data-attrs -->\n\t<img data-attrs='{\"src\":\"https://example.com/data-only.jpg\",\"width\":400,\"height\":300}' alt=\"Data only\">\n\t`\n\t\n\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\trequire.NoError(t, err)\n\t\n\timageElements, err := downloader.extractImageElements(doc)\n\trequire.NoError(t, err)\n\t\n\t// Should find 3 image elements\n\tassert.Len(t, imageElements, 3)\n\t\n\t// First image should have all URLs\n\telem1 := imageElements[0]\n\tassert.Equal(t, \"https://example.com/data.jpg\", elem1.BestURL) // data-attrs has priority\n\texpectedURLs1 := []string{\n\t\t\"https://example.com/data.jpg\",     // from data-attrs\n\t\t\"https://example.com/small.jpg\",    // from srcset\n\t\t\"https://example.com/large.jpg\",    // from srcset\n\t\t\"https://example.com/src.jpg\",      // from src\n\t}\n\tassert.ElementsMatch(t, expectedURLs1, elem1.AllURLs)\n\t\n\t// Second image should have only src URL\n\telem2 := imageElements[1]\n\tassert.Equal(t, \"https://example.com/simple.jpg\", elem2.BestURL)\n\tassert.Equal(t, []string{\"https://example.com/simple.jpg\"}, elem2.AllURLs)\n\t\n\t// Third image should have only data-attrs URL\n\telem3 := imageElements[2]\n\tassert.Equal(t, \"https://example.com/data-only.jpg\", elem3.BestURL)\n\tassert.Equal(t, []string{\"https://example.com/data-only.jpg\"}, elem3.AllURLs)\n}\n\n// TestExtractAllURLsFromSrcset tests srcset URL extraction\nfunc TestExtractAllURLsFromSrcset(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\ttests := []struct {\n\t\tname     string\n\t\tsrcset   string\n\t\texpected []string\n\t}{\n\t\t{\n\t\t\tname:   \"MultipleSizes\",\n\t\t\tsrcset: \"https://example.com/img-400.jpg 400w, https://example.com/img-800.jpg 800w, https://example.com/img-1200.jpg 1200w\",\n\t\t\texpected: []string{\"https://example.com/img-400.jpg\", \"https://example.com/img-800.jpg\", \"https://example.com/img-1200.jpg\"},\n\t\t},\n\t\t{\n\t\t\tname:   \"SingleEntry\",\n\t\t\tsrcset: \"https://example.com/single.jpg 1024w\",\n\t\t\texpected: []string{\"https://example.com/single.jpg\"},\n\t\t},\n\t\t{\n\t\t\tname:   \"ExtraSpaces\",\n\t\t\tsrcset: \"  https://example.com/spaced1.jpg 400w  ,   https://example.com/spaced2.jpg 800w  \",\n\t\t\texpected: []string{\"https://example.com/spaced1.jpg\", \"https://example.com/spaced2.jpg\"},\n\t\t},\n\t\t{\n\t\t\tname:     \"Empty\",\n\t\t\tsrcset:   \"\",\n\t\t\texpected: []string{},\n\t\t},\n\t}\n\t\n\tfor _, test := range tests {\n\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\turls := downloader.extractAllURLsFromSrcset(test.srcset)\n\t\t\tassert.Equal(t, test.expected, urls)\n\t\t})\n\t}\n}\n\n// TestImageURLParsing tests URL parsing with various Substack image patterns\nfunc TestImageURLParsing(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\t// Real Substack URL patterns from the analysis\n\ttestURLs := []string{\n\t\t\"https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F43e258db-6164-4e47-835f-d11f10847d9d_5616x3744.jpeg\",\n\t\t\"https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd83a175f-d0a1-450a-931f-adf68630630e_5634x2864.jpeg\",\n\t\t\"https://substack-post-media.s3.amazonaws.com/public/images/d6ad0fd8-3659-4626-b5db-f81cbcd4c779_779x305.png\",\n\t}\n\t\n\tfor i, testURL := range testURLs {\n\t\tt.Run(fmt.Sprintf(\"URL_%d\", i+1), func(t *testing.T) {\n\t\t\t// Test filename generation\n\t\t\tfilename, err := downloader.generateSafeFilename(testURL)\n\t\t\tassert.NoError(t, err)\n\t\t\tassert.NotEmpty(t, filename)\n\t\t\t\n\t\t\t// Test dimension extraction\n\t\t\twidth, height := downloader.extractDimensionsFromURL(testURL)\n\t\t\tt.Logf(\"URL: %s\", testURL)\n\t\t\tt.Logf(\"Filename: %s\", filename)\n\t\t\tt.Logf(\"Dimensions: %dx%d\", width, height)\n\t\t\t\n\t\t\t// URLs should be valid\n\t\t\t_, err = url.Parse(testURL)\n\t\t\tassert.NoError(t, err)\n\t\t})\n\t}\n}\n\n// TestImageURLHelperFunctions tests the helper functions added for the bug fix\nfunc TestImageURLHelperFunctions(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\tt.Run(\"IsImageURL\", func(t *testing.T) {\n\t\ttests := []struct {\n\t\t\tname     string\n\t\t\turl      string\n\t\t\texpected bool\n\t\t}{\n\t\t\t{\"SubstackCDN\", \"https://substackcdn.com/image/fetch/w_1456/image.jpg\", true},\n\t\t\t{\"SubstackS3\", \"https://substack-post-media.s3.amazonaws.com/public/images/test.png\", true},\n\t\t\t{\"Bucketeer\", \"https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/test.jpeg\", true},\n\t\t\t{\"NotImage\", \"https://example.com/page.html\", false},\n\t\t\t{\"RegularImage\", \"https://example.com/image.jpg\", false}, // Not Substack\n\t\t}\n\t\t\n\t\tfor _, test := range tests {\n\t\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\t\tresult := downloader.isImageURL(test.url)\n\t\t\t\tassert.Equal(t, test.expected, result)\n\t\t\t})\n\t\t}\n\t})\n\t\n\tt.Run(\"IsSameImage\", func(t *testing.T) {\n\t\tbaseUUID := \"b0ebde87-580d-4dce-bb73-573edf9229ff\"\n\t\ttests := []struct {\n\t\t\tname     string\n\t\t\turl1     string\n\t\t\turl2     string\n\t\t\texpected bool\n\t\t}{\n\t\t\t{\n\t\t\t\t\"SameUUID\",\n\t\t\t\tfmt.Sprintf(\"https://substackcdn.com/image/fetch/w_1456/%s_1024x1536.heic\", baseUUID),\n\t\t\t\tfmt.Sprintf(\"https://substack-post-media.s3.amazonaws.com/public/images/%s_1024x1536.heic\", baseUUID),\n\t\t\t\ttrue,\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"DifferentUUIDs\",\n\t\t\t\t\"https://substackcdn.com/image/fetch/w_1456/aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee_800x600.jpg\",\n\t\t\t\t\"https://substackcdn.com/image/fetch/w_848/ffffffff-gggg-hhhh-iiii-jjjjjjjjjjjj_800x600.jpg\",\n\t\t\t\tfalse,\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"NoUUIDs\",\n\t\t\t\t\"https://example.com/image1.jpg\",\n\t\t\t\t\"https://example.com/image2.jpg\",\n\t\t\t\tfalse,\n\t\t\t},\n\t\t}\n\t\t\n\t\tfor _, test := range tests {\n\t\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\t\tresult := downloader.isSameImage(test.url1, test.url2)\n\t\t\t\tassert.Equal(t, test.expected, result)\n\t\t\t})\n\t\t}\n\t})\n\t\n\tt.Run(\"ExtractImageID\", func(t *testing.T) {\n\t\ttests := []struct {\n\t\t\tname     string\n\t\t\turl      string\n\t\t\texpected string\n\t\t}{\n\t\t\t{\n\t\t\t\t\"UUID\",\n\t\t\t\t\"https://substack-post-media.s3.amazonaws.com/public/images/b0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic\",\n\t\t\t\t\"b0ebde87-580d-4dce-bb73-573edf9229ff\",\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"FilenamePattern\",\n\t\t\t\t\"https://example.com/path/to/myimage.jpg\",\n\t\t\t\t\"myimage\",\n\t\t\t},\n\t\t\t{\n\t\t\t\t\"NoPattern\",\n\t\t\t\t\"https://example.com/path/\",\n\t\t\t\t\"\",\n\t\t\t},\n\t\t}\n\t\t\n\t\tfor _, test := range tests {\n\t\t\tt.Run(test.name, func(t *testing.T) {\n\t\t\t\tresult := extractImageID(test.url)\n\t\t\t\tassert.Equal(t, test.expected, result)\n\t\t\t})\n\t\t}\n\t})\n}\n\n// TestExtractImageElementsWithAnchorAndSourceTags tests the bug fix for collecting URLs from <a> and <source> tags\nfunc TestExtractImageElementsWithAnchorAndSourceTags(t *testing.T) {\n\tdownloader := NewImageDownloader(nil, \"/tmp\", \"images\", ImageQualityHigh)\n\t\n\t// This HTML pattern reproduces the exact structure from real Substack posts\n\t// where the same image appears in multiple places with different URLs\n\tbaseUUID := \"f35ed9ff-eb9e-4106-a443-45c963ae74cd\"\n\toriginalURL := fmt.Sprintf(\"https://substack-post-media.s3.amazonaws.com/public/images/%s_1208x793.png\", baseUUID)\n\threfURL := fmt.Sprintf(\"https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png\", baseUUID)\n\tw424URL := fmt.Sprintf(\"https://substackcdn.com/image/fetch/w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png\", baseUUID)\n\tw848URL := fmt.Sprintf(\"https://substackcdn.com/image/fetch/w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png\", baseUUID)\n\tw1456URL := fmt.Sprintf(\"https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2F%s_1208x793.png\", baseUUID)\n\t\n\thtmlContent := fmt.Sprintf(`\n\t<div class=\"captioned-image-container\">\n\t  <figure>\n\t    <a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"%s\" data-component-name=\"Image2ToDOM\">\n\t      <div class=\"image2-inset\">\n\t        <picture>\n\t          <source type=\"image/webp\" srcset=\"%s 424w, %s 848w, %s 1456w\" sizes=\"100vw\"/>\n\t          <img src=\"%s\" \n\t               srcset=\"%s 424w, %s 848w, %s 1456w\" \n\t               data-attrs='{\"src\":\"%s\",\"width\":1208,\"height\":793,\"type\":\"image/png\"}'\n\t               class=\"sizing-normal\" alt=\"\" \n\t               sizes=\"100vw\" fetchpriority=\"high\"/>\n\t        </picture>\n\t      </div>\n\t    </a>\n\t  </figure>\n\t</div>`,\n\t\threfURL,                               // <a href>\n\t\tw424URL, w848URL, w1456URL,            // <source srcset>\n\t\toriginalURL,                           // <img src>\n\t\tw424URL, w848URL, w1456URL,            // <img srcset>\n\t\toriginalURL)                           // data-attrs src\n\t\n\tt.Logf(\"Test HTML:\\n%s\", htmlContent)\n\t\n\tdoc, err := goquery.NewDocumentFromReader(strings.NewReader(htmlContent))\n\trequire.NoError(t, err)\n\t\n\timageElements, err := downloader.extractImageElements(doc)\n\trequire.NoError(t, err)\n\t\n\t// Should find exactly 1 image element (all URLs refer to the same image)\n\tassert.Len(t, imageElements, 1, \"Should find exactly one image element\")\n\t\n\telem := imageElements[0]\n\tt.Logf(\"BestURL: %s\", elem.BestURL)\n\tt.Logf(\"AllURLs: %v\", elem.AllURLs)\n\t\n\t// Best URL should be from data-attrs (highest priority)\n\tassert.Equal(t, originalURL, elem.BestURL)\n\t\n\t// All URLs should be collected (from img src, img srcset, source srcset, a href, and data-attrs)\n\texpectedURLs := []string{\n\t\toriginalURL,  // from data-attrs and img src\n\t\tw424URL,      // from srcsets\n\t\tw848URL,      // from srcsets\n\t\tw1456URL,     // from srcsets\n\t\threfURL,      // from <a href>\n\t}\n\t\n\t// Check that all expected URLs are present\n\tfor _, expectedURL := range expectedURLs {\n\t\tassert.Contains(t, elem.AllURLs, expectedURL, \"Should contain URL: %s\", expectedURL)\n\t}\n\t\n\t// Should not have duplicates\n\turlCounts := make(map[string]int)\n\tfor _, url := range elem.AllURLs {\n\t\turlCounts[url]++\n\t}\n\tfor url, count := range urlCounts {\n\t\tassert.Equal(t, 1, count, \"URL should appear exactly once: %s\", url)\n\t}\n}\n\n// TestHrefAndSourceURLReplacementRegression tests the specific bug where images were downloaded \n// but <a href> and <source srcset> URLs weren't replaced with local paths\nfunc TestHrefAndSourceURLReplacementRegression(t *testing.T) {\n\t// Create test server\n\tserver := createTestImageServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory\n\ttempDir, err := os.MkdirTemp(\"\", \"href-source-regression-test-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create downloader\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\t\n\t// Create HTML that reproduces the exact bug:\n\t// - Images are downloaded successfully\n\t// - img src and srcset are replaced correctly\n\t// - BUT <a href> and <source srcset> still contain original URLs\n\t// Using Substack-style URLs so they're detected as image URLs\n\tbaseUUID := \"123e4567-e89b-12d3-a456-426614174000\"\n\timageURL := server.URL + \"/substack-post-media.s3.amazonaws.com/public/images/\" + baseUUID + \"_800x600.png\"\n\threfURL := server.URL + \"/substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F\" + baseUUID + \"_1200x900.png\"\n\tsrcsetURL1 := server.URL + \"/substackcdn.com/image/fetch/w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F\" + baseUUID + \"_800x600.png\"\n\tsrcsetURL2 := server.URL + \"/substackcdn.com/image/fetch/w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F\" + baseUUID + \"_800x600.png\"\n\t\n\thtmlContent := fmt.Sprintf(`\n\t<div class=\"captioned-image-container\">\n\t  <figure>\n\t    <a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"%s\">\n\t      <div class=\"image2-inset\">\n\t        <picture>\n\t          <source type=\"image/webp\" srcset=\"%s 424w, %s 848w\" sizes=\"100vw\"/>\n\t          <img src=\"%s\" \n\t               srcset=\"%s 424w, %s 848w\" \n\t               alt=\"Test image\" width=\"800\" height=\"600\"/>\n\t        </picture>\n\t      </div>\n\t    </a>\n\t  </figure>\n\t</div>`,\n\t\threfURL,                     // <a href> - THIS was not being replaced in the bug\n\t\tsrcsetURL1, srcsetURL2,      // <source srcset> - THIS was not being replaced in the bug\n\t\timageURL,                    // <img src> - this was working\n\t\tsrcsetURL1, srcsetURL2)      // <img srcset> - this was working\n\t\n\tt.Logf(\"Original HTML with problematic URLs:\\n%s\", htmlContent)\n\t\n\t// Download images using the full pipeline\n\tctx := context.Background()\n\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"regression-test\")\n\trequire.NoError(t, err)\n\t\n\tt.Logf(\"Download results: Success=%d, Failed=%d\", result.Success, result.Failed)\n\tt.Logf(\"Updated HTML:\\n%s\", result.UpdatedHTML)\n\t\n\t// CRITICAL REGRESSION TEST: Verify ALL original URLs are replaced\n\toriginalURLs := []string{imageURL, hrefURL, srcsetURL1, srcsetURL2}\n\t\n\tfor _, originalURL := range originalURLs {\n\t\tassert.NotContains(t, result.UpdatedHTML, originalURL, \n\t\t\t\"REGRESSION BUG: Original URL should be replaced but still present: %s\", originalURL)\n\t}\n\t\n\t// Verify local paths are present  \n\tassert.Contains(t, result.UpdatedHTML, \"images/regression-test/\", \"Should contain local image directory path\")\n\t\n\t// Verify <a href> was replaced with local path\n\tassert.Regexp(t, `href=\"images/regression-test/[^\"]*\"`, result.UpdatedHTML, \"href should point to local path\")\n\t\n\t// Verify <source srcset> was replaced with local paths\n\tassert.Contains(t, result.UpdatedHTML, `<source type=\"image/webp\" srcset=\"images/regression-test/`, \n\t\t\"source srcset should contain local paths\")\n\t\n\t// Verify some images were successfully downloaded\n\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n\t\n\t// Verify image files exist on disk\n\timagesDir := filepath.Join(tempDir, \"images\", \"regression-test\")\n\tfiles, err := os.ReadDir(imagesDir)\n\tassert.NoError(t, err)\n\tassert.Greater(t, len(files), 0, \"Should have downloaded image files to disk\")\n}\n\n// TestComplexSubstackImageStructureRegression tests the full complex Substack image structure\n// that was reported in the original bug, ensuring all image references are properly replaced\nfunc TestComplexSubstackImageStructureRegression(t *testing.T) {\n\t// Create test server\n\tserver := createTestImageServer()\n\tdefer server.Close()\n\t\n\t// Create temporary directory  \n\ttempDir, err := os.MkdirTemp(\"\", \"complex-substack-regression-*\")\n\trequire.NoError(t, err)\n\tdefer os.RemoveAll(tempDir)\n\t\n\t// Create downloader\n\tdownloader := NewImageDownloader(nil, tempDir, \"images\", ImageQualityHigh)\n\t\n\t// This is the exact HTML structure from the bug report, with server URLs\n\thtmlContent := fmt.Sprintf(`<div class=\"captioned-image-container\"><figure><a class=\"image-link image2 is-viewable-img\" target=\"_blank\" href=\"%s/substackcdn.com/image/fetch/$s_!7a2j!,f_auto,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2Fb0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic\" data-component-name=\"Image2ToDOM\"><div class=\"image2-inset\"><picture><source type=\"image/webp\" srcset=\"%s/substackcdn.com/image/fetch/$s_!7a2j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2Fb0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic 424w, %s/substackcdn.com/image/fetch/$s_!7a2j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2Fb0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic 848w, %s/substackcdn.com/image/fetch/$s_!7a2j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%%3A%%2F%%2Fsubstack-post-media.s3.amazonaws.com%%2Fpublic%%2Fimages%%2Fb0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic 1456w\" sizes=\"100vw\"/><img src=\"%s/substack-post-media.s3.amazonaws.com/public/images/b0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic\" width=\"1024\" height=\"1536\" data-attrs=\"{&#34;src&#34;:&#34;%s/substack-post-media.s3.amazonaws.com/public/images/b0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic&#34;,&#34;width&#34;:1024,&#34;height&#34;:1536}\" class=\"sizing-normal\" alt=\"\" srcset=\"%s/substack-post-media.s3.amazonaws.com/public/images/b0ebde87-580d-4dce-bb73-573edf9229ff_1024x1536.heic 424w\" sizes=\"100vw\" fetchpriority=\"high\"/></picture></div></a></figure></div>`,\n\t\tserver.URL, server.URL, server.URL, server.URL, server.URL, server.URL, server.URL)\n\t\n\tt.Logf(\"Complex Substack HTML structure:\\n%s\", htmlContent)\n\t\n\t// Process the HTML \n\tctx := context.Background()\n\tresult, err := downloader.DownloadImages(ctx, htmlContent, \"complex-test\")\n\trequire.NoError(t, err)\n\t\n\tt.Logf(\"Download results: Success=%d, Failed=%d\", result.Success, result.Failed)\n\tt.Logf(\"Updated HTML:\\n%s\", result.UpdatedHTML)\n\t\n\t// Verify NO original server URLs remain in the output\n\tassert.NotContains(t, result.UpdatedHTML, server.URL, \n\t\t\"REGRESSION BUG: Original server URLs should be completely replaced\")\n\t\n\t// Verify local paths are present\n\tassert.Contains(t, result.UpdatedHTML, \"images/complex-test/\", \"Should contain local image paths\")\n\t\n\t// Verify the href was replaced\n\tassert.Contains(t, result.UpdatedHTML, `href=\"images/complex-test/`, \"href should point to local path\")\n\t\n\t// Verify source srcset was replaced  \n\tassert.Contains(t, result.UpdatedHTML, `<source type=\"image/webp\" srcset=\"images/complex-test/`, \n\t\t\"source srcset should contain local paths\")\n\t\n\t// Verify img src was replaced\n\tassert.Contains(t, result.UpdatedHTML, `src=\"images/complex-test/`, \"img src should point to local path\")\n\t\n\t// Verify img srcset was replaced\n\tassert.Regexp(t, `srcset=\"images/complex-test/[^\"]+\\s+424w\"`, result.UpdatedHTML, \n\t\t\"img srcset should contain local paths with width descriptors\")\n\t\n\t// Verify data-attrs was updated (JSON can be reordered and HTML-encoded)\n\tassert.Regexp(t, `&#34;src&#34;:&#34;images/complex-test/[^&]*&#34;`, result.UpdatedHTML, \"data-attrs src should be updated\")\n\t\n\t// Verify at least one image was successfully downloaded\n\tassert.Greater(t, result.Success, 0, \"Should have successful downloads\")\n}"
  },
  {
    "path": "main.go",
    "content": "package main\n\nimport \"github.com/alexferrari88/sbstck-dl/cmd\"\n\nfunc main() {\n\tcmd.Execute()\n}\n"
  },
  {
    "path": "specs/archive-index-page.md",
    "content": "# Archive Index Page Feature Specification\n\n## 1. Overview\n\n### 1.1 Purpose\nAdd support for generating organized index pages that link all downloaded posts with their metadata. This feature enables users to create beautiful, browseable archives of their downloaded Substack content with comprehensive post information and navigation.\n\n### 1.2 Success Criteria\n- Users can generate archive index pages using command-line flags\n- Archive pages are created in matching format (HTML/Markdown/Text) to downloaded posts\n- Index pages display comprehensive post metadata including titles, dates, descriptions, and cover images\n- Posts are automatically sorted by publication date (newest first)\n- Archive pages use relative file paths for maximum portability\n- Integration works seamlessly with both single post and bulk downloads\n- Archive generation includes comprehensive error handling and validation\n\n### 1.3 Scope Boundaries\n**In Scope:**\n- Generation of index pages in HTML, Markdown, and Text formats\n- Extraction and display of post metadata (title, dates, description, cover image)\n- Automatic sorting by publication date with fallback sorting\n- Relative path generation for downloaded post links\n- Integration with existing CLI infrastructure and output patterns\n- Support for both single post downloads and bulk archive downloads\n\n**Out of Scope:**\n- Archive page theming or advanced styling customization\n- Search functionality within archive pages\n- Archive page regeneration from existing files (without re-downloading)\n- Multiple archive page formats in a single run\n- Archive page pagination for very large collections\n\n## 2. Technical Architecture\n\n### 2.1 Architecture Alignment\nThis feature follows the established sbstck-dl patterns:\n- **Modular Design**: New `Archive` and `ArchiveEntry` structs in existing extractor.go\n- **Consistent Interface**: Integration with existing CLI flags and format selection\n- **Content Generation**: Similar approach to post content generation with format-specific methods\n- **File Operations**: Consistent with existing file writing patterns and directory structures\n\n### 2.2 Core Components\n\n#### 2.2.1 Archive Data Structures\n```go\ntype ArchiveEntry struct {\n    Post         Post\n    FilePath     string\n    DownloadTime time.Time\n}\n\ntype Archive struct {\n    Entries []ArchiveEntry\n}\n```\n\n#### 2.2.2 Archive Generation Interface\n```go\nfunc NewArchive() *Archive\nfunc (a *Archive) AddEntry(post Post, filePath string, downloadTime time.Time)\nfunc (a *Archive) sortEntries()\nfunc (a *Archive) GenerateHTML(outputDir string) error\nfunc (a *Archive) GenerateMarkdown(outputDir string) error\nfunc (a *Archive) GenerateText(outputDir string) error\n```\n\n### 2.3 Post Metadata Enhancement\n\n#### 2.3.1 Enhanced Post Structure\nExtended the existing `Post` struct with new metadata fields:\n```go\ntype Post struct {\n    // ... existing fields\n    Subtitle string `json:\"subtitle,omitempty\"` // NEW: from .subtitle CSS selector\n    // CoverImage string - enhanced extraction from og:image meta tag\n}\n```\n\n#### 2.3.2 Metadata Extraction Strategy\n- **Subtitle Extraction**: Parse `.subtitle` CSS selector from post HTML\n- **Cover Image Enhancement**: Extract from `og:image` meta property when CoverImage field is empty\n- **Graceful Fallbacks**: Use Description field when Subtitle is not available\n\n## 3. Command Line Interface\n\n### 3.1 New CLI Flag\n\n```go\n// New flag added to cmd/download.go\nvar createArchive bool // --create-archive\n```\n\n### 3.2 Flag Definition\n\n| Flag | Short | Default | Description |\n|------|-------|---------|-------------|\n| `--create-archive` | | `false` | Create an archive index page linking all downloaded posts |\n\n### 3.3 Usage Examples\n\n```bash\n# Download entire archive and create index page\nsbstck-dl download --url https://example.substack.com --create-archive\n\n# Create archive index in Markdown format\nsbstck-dl download --url https://example.substack.com --create-archive --format md\n\n# Build archive over time with single posts\nsbstck-dl download --url https://example.substack.com/p/post-title --create-archive\n\n# Complete download with all features\nsbstck-dl download --url https://example.substack.com --download-images --download-files --create-archive\n\n# Custom directory structure with archive\nsbstck-dl download --url https://example.substack.com --create-archive --images-dir assets --files-dir attachments\n```\n\n## 4. Implementation Details\n\n### 4.1 Archive Entry Collection\n\n1. **Initialization**: Create Archive instance when `--create-archive` flag is set\n2. **Entry Collection**: Add entries during both single post and bulk download flows\n3. **Metadata Capture**: Record post details, file path, and download timestamp\n4. **Automatic Sorting**: Sort entries by publication date (newest first) on each addition\n\n### 4.2 Archive Generation Formats\n\n#### 4.2.1 HTML Format\n- **Styled Output**: Professional styling with CSS embedded in the HTML\n- **Post Cards**: Each post displayed as a card with image, title, metadata, and description\n- **Responsive Design**: Mobile-friendly layout with flexible containers\n- **Cover Images**: Display cover images with proper scaling and alignment\n- **File**: `index.html` in output directory root\n\n#### 4.2.2 Markdown Format  \n- **Clean Structure**: Headers, links, and metadata in standard Markdown format\n- **Image References**: Cover images included as standard Markdown image syntax\n- **Metadata Formatting**: Bold formatting for dates and consistent structure\n- **File**: `index.md` in output directory root\n\n#### 4.2.3 Text Format\n- **Plain Text**: Maximum compatibility with simple text structure\n- **Clear Separators**: Consistent formatting with horizontal line separators\n- **All Metadata**: Complete information including file paths and descriptions\n- **File**: `index.txt` in output directory root\n\n### 4.3 Sorting Algorithm\n\n```go\nfunc (a *Archive) sortEntries() {\n    sort.Slice(a.Entries, func(i, j int) bool {\n        // Parse post dates and compare (newest first)\n        dateI, errI := time.Parse(time.RFC3339, a.Entries[i].Post.PostDate)\n        dateJ, errJ := time.Parse(time.RFC3339, a.Entries[j].Post.PostDate)\n        \n        if errI != nil || errJ != nil {\n            // If parsing fails, sort by title alphabetically\n            return a.Entries[i].Post.Title < a.Entries[j].Post.Title\n        }\n        \n        return dateI.After(dateJ) // newest first\n    })\n}\n```\n\n### 4.4 File Path Management\n\n- **Relative Paths**: All post links use `filepath.Rel()` for portability\n- **Cross-Platform Compatibility**: Proper path separators for all operating systems\n- **Directory Structure Preservation**: Maintains existing file organization patterns\n\n## 5. Integration Points\n\n### 5.1 Download Flow Integration\n\n```go\n// Archive initialization in download command\nvar archive *lib.Archive\nif createArchive {\n    archive = lib.NewArchive()\n}\n\n// Entry collection during download processing\nif archive != nil {\n    archive.AddEntry(post, path, time.Now())\n}\n\n// Archive generation after downloads complete\nif archive != nil && len(archive.Entries) > 0 {\n    var archiveErr error\n    switch format {\n    case \"html\":\n        archiveErr = archive.GenerateHTML(outputFolder)\n    case \"md\":\n        archiveErr = archive.GenerateMarkdown(outputFolder)\n    case \"txt\":\n        archiveErr = archive.GenerateText(outputFolder)\n    }\n}\n```\n\n### 5.2 Format Consistency\n\n- **Output Format Matching**: Archive format automatically matches selected post format\n- **Content Alignment**: Archive styling and structure complement post formatting\n- **Directory Structure**: Archive placed in root output directory alongside posts\n\n## 6. Archive Content Structure\n\n### 6.1 Post Metadata Display\n\nEach archive entry includes:\n- **Title**: Clickable link to downloaded post file\n- **Publication Date**: Original Substack publication date (formatted: \"January 2, 2006\")\n- **Download Date**: Local download timestamp (formatted: \"January 2, 2006 15:04\")\n- **Description**: Post subtitle (priority) or description (fallback)\n- **Cover Image**: Featured post image when available\n\n### 6.2 Content Prioritization\n\n```go\n// Description selection logic\ndescription := entry.Post.Subtitle\nif description == \"\" {\n    description = entry.Post.Description\n}\n```\n\n### 6.3 Date Formatting\n\n- **Publication Date**: Human-readable format (\"January 2, 2006\")\n- **Download Date**: Includes time for precise tracking (\"January 2, 2006 15:04\")\n- **Sorting**: Uses RFC3339 format for accurate chronological ordering\n\n## 7. Error Handling Strategy\n\n### 7.1 Archive Generation Errors\n\n- **Directory Creation**: Automatic creation of output directory if missing\n- **File Writing**: Graceful handling of permission and disk space issues\n- **Format Validation**: Error reporting for unknown or unsupported formats\n\n### 7.2 Metadata Processing\n\n- **Date Parsing**: Fallback to title-based sorting for unparseable dates  \n- **Missing Fields**: Graceful handling of empty subtitles, descriptions, or cover images\n- **Path Generation**: Error handling for invalid file paths or relative path calculation failures\n\n### 7.3 Content Validation\n\n- **Empty Archives**: Skip generation when no entries are present\n- **Invalid Entries**: Continue processing valid entries when individual entries have issues\n- **HTML Escaping**: Proper escaping of user content in HTML format\n\n## 8. Performance Considerations\n\n### 8.1 Memory Management\n\n- **Incremental Building**: Archive entries added incrementally during download process\n- **Efficient Sorting**: In-place sorting using standard library algorithms\n- **Content Generation**: String building optimized for each format type\n\n### 8.2 File I/O Optimization\n\n- **Single Write Operations**: Generate complete content before writing to disk\n- **Relative Path Caching**: Efficient path calculation using filepath.Rel()\n- **Format-Specific Generation**: Only generate requested format to minimize overhead\n\n## 9. Testing Strategy\n\n### 9.1 Unit Tests\n\n```go\n// Comprehensive test coverage areas\nfunc TestNewArchive(t *testing.T)\nfunc TestArchive_AddEntry(t *testing.T)\nfunc TestArchive_sortEntries(t *testing.T)\nfunc TestArchive_GenerateHTML(t *testing.T)\nfunc TestArchive_GenerateMarkdown(t *testing.T)\nfunc TestArchive_GenerateText(t *testing.T)\nfunc TestEnhancedPostExtraction(t *testing.T)\n```\n\n### 9.2 Integration Tests\n\n```go\nfunc TestArchiveWorkflow(t *testing.T)\nfunc TestCommandFlags(t *testing.T)\nfunc TestArchivePageGeneration(t *testing.T)\n```\n\n### 9.3 Test Coverage Areas\n\n- **Data Structure Operations**: Archive creation, entry management, sorting\n- **Format Generation**: Content generation for all three formats\n- **Error Scenarios**: Invalid dates, missing fields, empty archives\n- **Integration**: End-to-end workflows with CLI flag integration\n- **Post Enhancement**: Subtitle and cover image extraction functionality\n\n## 10. Security Considerations\n\n### 10.1 Content Security\n\n- **HTML Escaping**: Proper escaping of post titles and descriptions in HTML format\n- **Path Validation**: Safe relative path generation preventing directory traversal\n- **Input Sanitization**: Clean handling of user-provided post content\n\n### 10.2 File System Security\n\n- **Directory Containment**: Archive files created only in designated output directory\n- **Permission Handling**: Graceful handling of file system permission restrictions\n- **Path Safety**: Cross-platform safe path generation and validation\n\n## 11. Directory Structure Impact\n\n### 11.1 Output Structure with Archive\n\n```\noutput/\n├── index.html                    # Archive index page\n├── 20231201_120000_post-title.html\n├── 20231115_090000_another-post.html\n├── images/\n│   ├── post-title/\n│   │   └── image1_1456x819.jpeg\n│   └── another-post/\n│       └── image2_848x636.png\n└── files/\n    ├── post-title/\n    │   └── document.pdf\n    └── another-post/\n        └── spreadsheet.xlsx\n```\n\n### 11.2 Archive Index Formats\n\n- **HTML**: `index.html` - Styled webpage with embedded CSS\n- **Markdown**: `index.md` - Clean markdown for documentation systems\n- **Text**: `index.txt` - Plain text for maximum compatibility\n\n## 12. Migration and Rollout\n\n### 12.1 Backward Compatibility\n\n- **Opt-in Feature**: Archive generation only when `--create-archive` flag is used\n- **No Breaking Changes**: Existing CLI behavior unchanged when flag not present\n- **Format Consistency**: Archive format automatically matches post format selection\n\n### 12.2 Progressive Enhancement\n\n- **Single Post Support**: Build archives incrementally with individual post downloads\n- **Bulk Download Integration**: Seamless operation with existing bulk download workflows\n- **Feature Combination**: Full compatibility with image and file download features\n\n## 13. Future Enhancements\n\n### 13.1 Potential Extensions\n\n- **Custom Templates**: User-provided HTML/Markdown templates for archive pages\n- **Theme Support**: Multiple built-in themes for HTML archive format\n- **Pagination**: Support for paginated archives with very large post collections\n- **Search Integration**: Client-side search functionality for archive pages\n\n### 13.2 Advanced Features\n\n- **Archive Regeneration**: Rebuild archive from existing downloaded files\n- **Multiple Formats**: Generate archive in multiple formats simultaneously\n- **RSS Generation**: Create RSS/Atom feeds from archive content\n- **Static Site Integration**: Export formats compatible with static site generators\n\n---\n\n**Specification Status**: Implemented v1.0  \n**Last Updated**: 2025-01-03  \n**Dependencies**: Existing sbstck-dl codebase (fetcher.go, extractor.go), enhanced Post struct  \n**Implementation**: Complete with comprehensive test coverage"
  },
  {
    "path": "specs/file-attachment-download.md",
    "content": "# File Attachment Download Feature Specification\n\n## 1. Overview\n\n### 1.1 Purpose\nAdd support for downloading file attachments from Substack posts alongside the existing text and image download functionality. This feature will enable users to download PDFs, documents, and other files that authors embed in their posts, with local file references updated in the downloaded content.\n\n### 1.2 Success Criteria\n- Users can download file attachments from Substack posts using command-line flags\n- Downloaded files are organized in a configurable directory structure\n- HTML/Markdown content is updated with local file paths\n- Optional file extension filtering allows selective downloading\n- Integration with existing rate limiting and retry mechanisms\n- Comprehensive error handling for network failures and unsupported file types\n\n### 1.3 Scope Boundaries\n**In Scope:**\n- Detection and extraction of file attachment URLs from Substack HTML\n- Download of attachments with appropriate file naming\n- Content rewriting to reference local file paths\n- File extension filtering capabilities\n- Integration with existing fetcher infrastructure\n- Support for all common file types (PDF, DOC, TXT, etc.)\n\n**Out of Scope:**\n- File preview or content analysis capabilities\n- Automatic file conversion between formats\n- Virus scanning or security validation of downloaded files\n- Selective downloading based on file size limits\n- Cloud storage integration for downloaded files\n\n## 2. Technical Architecture\n\n### 2.1 Architecture Alignment\nThis feature follows the established sbstck-dl patterns:\n- **Modular Design**: New `FileDownloader` struct similar to existing `ImageDownloader`\n- **Consistent Interface**: Integration with existing CLI flags and output patterns\n- **Error Handling**: Leverages existing retry and backoff mechanisms from `Fetcher`\n- **Content Rewriting**: Similar approach to image URL replacement in HTML/Markdown\n\n### 2.2 Core Components\n\n#### 2.2.1 FileDownloader Struct\n```go\ntype FileDownloader struct {\n    fetcher     *Fetcher\n    outputDir   string\n    filesDir    string\n    allowedExts []string // empty means all extensions allowed\n}\n```\n\n#### 2.2.2 File Information Structure\n```go\ntype FileInfo struct {\n    URL         string\n    Filename    string\n    Extension   string\n    Size        string\n    Type        string\n    LocalPath   string\n}\n\ntype FileDownloadResult struct {\n    Files       []FileInfo\n    UpdatedHTML string\n    Errors      []error\n}\n```\n\n### 2.3 HTML Parsing Strategy\n\n#### 2.3.1 CSS Selector Target\n- **Primary Selector**: `.file-embed-button.wide`\n- **Container Selector**: `.file-embed-container-top` (for metadata extraction)\n\n#### 2.3.2 HTML Structure Analysis\nBased on the example URL, file attachments follow this structure:\n```html\n<div class=\"file-embed-container-top\">\n    <img src=\"...\" class=\"file-embed-thumbnail-default\">\n    <div class=\"file-embed-details\">\n        <div class=\"file-embed-details-h1\">The Stone Boy Cropped 1</div>\n        <div class=\"file-embed-details-h2\">207KB ∙ PDF file</div>\n    </div>\n    <a href=\"https://georgesaunders.substack.com/api/v1/file/...\" \n       class=\"file-embed-button wide\">\n        <span class=\"file-embed-button-text\">Download</span>\n    </a>\n</div>\n```\n\n## 3. Command Line Interface\n\n### 3.1 New CLI Flags\n\n```go\n// New flags to add to cmd/download.go\nvar (\n    downloadFiles    bool     // --download-files\n    filesDir         string   // --files-dir  \n    allowedFileExts  []string // --file-extensions\n)\n```\n\n### 3.2 Flag Definitions\n\n| Flag | Short | Default | Description |\n|------|-------|---------|-------------|\n| `--download-files` | | `false` | Download file attachments locally and update content references |\n| `--files-dir` | | `\"files\"` | Directory name for downloaded files (relative to output directory) |\n| `--file-extensions` | | `[]` (all) | Comma-separated list of allowed file extensions (e.g., \"pdf,doc,txt\") |\n\n### 3.3 Usage Examples\n\n```bash\n# Download posts with all file attachments\nsbstck-dl download --url https://example.substack.com --download-files\n\n# Download only PDF and DOC files to custom directory\nsbstck-dl download --url https://example.substack.com --download-files \\\n    --file-extensions \"pdf,doc\" --files-dir \"documents\"\n\n# Combined with existing features\nsbstck-dl download --url https://example.substack.com --download-files \\\n    --download-images --format md --output ./downloads\n```\n\n## 4. Implementation Details\n\n### 4.1 File Detection Algorithm\n\n1. **HTML Parsing**: Use goquery to find all `.file-embed-button.wide` elements\n2. **URL Extraction**: Extract `href` attribute from anchor tags\n3. **Metadata Extraction**: Parse container for filename, size, and type information\n4. **Extension Filtering**: Apply user-specified extension filters if provided\n\n### 4.2 File Naming Strategy\n\n```go\nfunc (fd *FileDownloader) generateSafeFilename(fileInfo FileInfo, index int) string {\n    // Priority order for filename:\n    // 1. Extract from file-embed-details-h1 if available\n    // 2. Parse from URL path\n    // 3. Generate from URL hash + extension\n    // 4. Fallback: \"attachment_<index>.<ext>\"\n}\n```\n\n### 4.3 Content Rewriting\n\n#### 4.3.1 HTML Content Updates\n- Replace `href` attributes in `.file-embed-button.wide` elements\n- Maintain original HTML structure while updating file paths\n- Handle both absolute and relative path scenarios\n\n#### 4.3.2 Markdown Content Updates\n- Convert file embed HTML to Markdown link format: `[filename](local/path)`\n- Preserve file metadata information in link text when possible\n\n### 4.4 Directory Structure\n\n```\noutput_directory/\n├── post-title.html\n├── images/           # existing images directory\n│   └── image1.jpg\n└── files/           # new files directory\n    ├── document1.pdf\n    ├── spreadsheet1.xlsx\n    └── archive1.zip\n```\n\n## 5. Integration Points\n\n### 5.1 Extractor Integration\n\n```go\n// Add to Post struct\ntype Post struct {\n    // ... existing fields\n    FileDownloadResult *FileDownloadResult `json:\"file_download_result,omitempty\"`\n}\n\n// New method on Post\nfunc (p *Post) WriteToFileWithAttachments(ctx context.Context, path, format string, \n    addSourceURL, downloadImages, downloadFiles bool, imageQuality ImageQuality, \n    imagesDir, filesDir string, allowedExts []string, fetcher *Fetcher) (*FileDownloadResult, error)\n```\n\n### 5.2 Command Integration\n\n```go\n// Update in cmd/download.go init()\ndownloadCmd.Flags().BoolVar(&downloadFiles, \"download-files\", false, \n    \"Download file attachments locally and update content to reference local files\")\ndownloadCmd.Flags().StringVar(&filesDir, \"files-dir\", \"files\", \n    \"Directory name for downloaded files\")\ndownloadCmd.Flags().StringSliceVar(&allowedFileExts, \"file-extensions\", []string{}, \n    \"Comma-separated list of allowed file extensions (empty = all extensions)\")\n```\n\n## 6. Error Handling Strategy\n\n### 6.1 Network Error Handling\n- **Retry Logic**: Leverage existing `Fetcher` retry mechanisms with exponential backoff\n- **Rate Limiting**: Respect existing rate limiting for file downloads\n- **Timeout Handling**: Use configurable timeouts for large file downloads\n\n### 6.2 File System Error Handling\n- **Directory Creation**: Ensure files directory exists before downloading\n- **Permission Errors**: Graceful handling of write permission issues\n- **Disk Space**: Basic validation for available disk space\n\n### 6.3 Content Error Handling\n- **Invalid URLs**: Skip malformed or inaccessible file URLs\n- **Extension Filtering**: Log filtered files for user awareness\n- **Partial Failures**: Continue processing other files if individual downloads fail\n\n## 7. Performance Considerations\n\n### 7.1 Concurrent Downloads\n- Use Go's `errgroup` pattern consistent with existing image download implementation\n- Configurable worker pools to prevent resource exhaustion\n- Progress reporting for large file downloads\n\n### 7.2 Memory Management\n- Stream large files to disk rather than loading entirely in memory\n- Implement file size limits to prevent excessive memory usage\n- Clean up temporary files on process interruption\n\n## 8. Testing Strategy\n\n### 8.1 Unit Tests\n\n```go\n// Test coverage areas\nfunc TestFileDownloader_ExtractFileElements(t *testing.T)\nfunc TestFileDownloader_GenerateSafeFilename(t *testing.T)  \nfunc TestFileDownloader_DownloadSingleFile(t *testing.T)\nfunc TestFileDownloader_UpdateHTMLWithLocalPaths(t *testing.T)\nfunc TestFileDownloader_ExtensionFiltering(t *testing.T)\n```\n\n### 8.2 Integration Tests\n- **Real Substack Posts**: Test with actual posts containing file attachments\n- **Network Conditions**: Test behavior under various network conditions\n- **File Type Coverage**: Test common file types (PDF, DOC, XLS, ZIP, etc.)\n- **Edge Cases**: Empty responses, malformed HTML, missing files\n\n### 8.3 Performance Tests\n- **Large File Handling**: Test download of files >100MB\n- **Multiple Files**: Test posts with many attachments\n- **Concurrent Processing**: Validate worker pool behavior\n\n## 9. Security Considerations\n\n### 9.1 File Path Security\n- **Path Traversal Prevention**: Sanitize filenames to prevent directory traversal attacks\n- **Safe Filename Generation**: Remove or escape dangerous characters in filenames\n- **Directory Containment**: Ensure all downloads remain within designated directories\n\n### 9.2 Content Validation\n- **URL Validation**: Validate file URLs are from expected Substack domains\n- **File Type Validation**: Basic MIME type checking for downloaded files\n- **Size Limits**: Implement reasonable file size limits to prevent abuse\n\n## 10. Migration and Rollout\n\n### 10.1 Backward Compatibility\n- New feature is entirely opt-in via `--download-files` flag\n- No changes to existing CLI behavior when flag is not used\n- Existing configurations and scripts remain unaffected\n\n### 10.2 Documentation Updates\n- Update CLI help text and documentation\n- Add usage examples to README\n- Document new directory structure and file naming conventions\n\n## 11. Future Enhancements\n\n### 11.1 Potential Extensions\n- **File Size Filtering**: Add flags for minimum/maximum file size limits\n- **Content Type Detection**: Enhanced MIME type detection and handling\n- **Progress Indicators**: Visual progress bars for large downloads\n- **Deduplication**: Skip downloading identical files across multiple posts\n\n### 11.2 Advanced Features\n- **Selective Downloads**: Interactive mode for choosing which files to download\n- **Metadata Preservation**: Store original file metadata in sidecar files\n- **Cloud Integration**: Optional upload to cloud storage services\n\n---\n\n**Specification Status**: Draft v1.0  \n**Last Updated**: 2025-07-31  \n**Dependencies**: Existing sbstck-dl codebase (fetcher.go, extractor.go, images.go)"
  }
]