main 83c5b4231f44 cached
59 files
458.6 KB
106.1k tokens
377 symbols
1 requests
Download .txt
Showing preview only (483K chars total). Download the full file or copy to clipboard to get everything.
Repository: PromtEngineer/agentic-file-search
Branch: main
Commit: 83c5b4231f44
Files: 59
Total size: 458.6 KB

Directory structure:
gitextract_mqv4xk8i/

├── .github/
│   └── workflows/
│       ├── build.yaml
│       ├── lint.yaml
│       ├── test.yaml
│       └── typecheck.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── ARCHITECTURE.md
├── CLAUDE.md
├── IMPLEMENTATION_PLAN.md
├── Makefile
├── README.md
├── YOUTUBE_DEMO_TESTS.md
├── data/
│   ├── large_acquisition/
│   │   └── TEST_QUESTIONS.md
│   ├── test_acquisition/
│   │   └── TEST_QUESTIONS.md
│   └── testfile.txt
├── docker/
│   └── docker-compose.yml
├── pyproject.toml
├── scripts/
│   ├── generate_large_docs.py
│   └── generate_test_docs.py
├── src/
│   └── fs_explorer/
│       ├── __init__.py
│       ├── agent.py
│       ├── embeddings.py
│       ├── exploration_trace.py
│       ├── fs.py
│       ├── index_config.py
│       ├── indexing/
│       │   ├── __init__.py
│       │   ├── chunker.py
│       │   ├── metadata.py
│       │   ├── pipeline.py
│       │   └── schema.py
│       ├── main.py
│       ├── models.py
│       ├── search/
│       │   ├── __init__.py
│       │   ├── filters.py
│       │   ├── query.py
│       │   ├── ranker.py
│       │   └── semantic.py
│       ├── server.py
│       ├── storage/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   └── duckdb.py
│       ├── ui.html
│       └── workflow.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── test_agent.py
    ├── test_cli_indexing.py
    ├── test_e2e.py
    ├── test_embeddings.py
    ├── test_exploration_trace.py
    ├── test_fs.py
    ├── test_indexing.py
    ├── test_models.py
    ├── test_search.py
    ├── test_server_search.py
    └── testfiles/
        ├── file1.txt
        ├── file2.md
        └── last/
            └── lastfile.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/build.yaml
================================================
name: Build

on:
  pull_request:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install 3.13

      - name: Build package
        run: make build


================================================
FILE: .github/workflows/lint.yaml
================================================
name: Linting

on:
  pull_request:

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install 3.12

      - name: Run formatter
        shell: bash
        run: make format-check

      - name: Run linter
        shell: bash
        run: make lint


================================================
FILE: .github/workflows/test.yaml
================================================
name: CI Tests - Pull Request

on:
  pull_request:

jobs:
  testing_pr:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12", "3.13"]
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Install uv
        uses: astral-sh/setup-uv@v6
        with:
          python-version: ${{ matrix.python-version }}
          enable-cache: true

      - name: Run Tests on Main Package
        run: make test
        

================================================
FILE: .github/workflows/typecheck.yaml
================================================
name: Typecheck

on:
  pull_request:

jobs:
  core-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install

      - name: Run Mypy
        run: make typecheck

================================================
FILE: .gitignore
================================================
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# Virtual environments
.venv

# caches
*_cache/

# Environment
.env

# OS files
.DS_Store

================================================
FILE: .pre-commit-config.yaml
================================================
---
default_language_version:
  python: python3

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: check-merge-conflict
      - id: check-symlinks
      - id: check-yaml
      - id: detect-private-key

================================================
FILE: .python-version
================================================
3.13


================================================
FILE: ARCHITECTURE.md
================================================
# FsExplorer Architecture Documentation

## Table of Contents

1. [System Overview](#system-overview)
2. [Component Architecture](#component-architecture)
3. [Core Modules](#core-modules)
4. [Workflow Engine](#workflow-engine)
5. [Agent Decision Loop](#agent-decision-loop)
6. [Document Processing Pipeline](#document-processing-pipeline)
7. [Three-Phase Exploration Strategy](#three-phase-exploration-strategy)
8. [Token Tracking & Cost Estimation](#token-tracking--cost-estimation)
9. [CLI Interface](#cli-interface)
10. [Data Flow](#data-flow)
11. [File Structure](#file-structure)
12. [Extension Points](#extension-points)

---

## System Overview

FsExplorer is an AI-powered filesystem exploration agent that answers questions about documents by intelligently navigating directories, parsing files, and synthesizing information with source citations.

```mermaid
graph TB
    subgraph "User Interface"
        CLI[CLI Interface<br/>typer + rich]
    end

    subgraph "Orchestration Layer"
        WF[Workflow Engine<br/>llama-index-workflows]
        EVT[Event System]
    end

    subgraph "Intelligence Layer"
        AGENT[FsExplorer Agent]
        LLM[Google Gemini 2.0 Flash<br/>Structured JSON Output]
        PROMPT[System Prompt<br/>Three-Phase Strategy]
    end

    subgraph "Tools Layer"
        TOOLS[Tool Registry]
        SCAN[scan_folder<br/>Parallel Scan]
        PREVIEW[preview_file<br/>Quick Preview]
        PARSE[parse_file<br/>Deep Read]
        READ[read<br/>Text Files]
        GREP[grep<br/>Pattern Search]
        GLOB[glob<br/>File Search]
    end

    subgraph "Document Processing"
        DOCLING[Docling<br/>Document Converter]
        CACHE[Document Cache]
    end

    subgraph "Filesystem"
        FS[(Local Filesystem)]
        PDF[PDF Files]
        DOCX[DOCX Files]
        MD[Markdown Files]
        OTHER[Other Formats]
    end

    CLI --> WF
    WF --> EVT
    EVT --> AGENT
    AGENT --> LLM
    AGENT --> PROMPT
    AGENT --> TOOLS
    
    TOOLS --> SCAN
    TOOLS --> PREVIEW
    TOOLS --> PARSE
    TOOLS --> READ
    TOOLS --> GREP
    TOOLS --> GLOB
    
    SCAN --> DOCLING
    PREVIEW --> DOCLING
    PARSE --> DOCLING
    
    DOCLING --> CACHE
    CACHE --> FS
    
    FS --> PDF
    FS --> DOCX
    FS --> MD
    FS --> OTHER

    style LLM fill:#4285f4,color:#fff
    style DOCLING fill:#ff6b6b,color:#fff
    style CACHE fill:#ffd93d,color:#000
    style AGENT fill:#6bcb77,color:#fff
```

---

## Component Architecture

### High-Level Component Diagram

```mermaid
graph LR
    subgraph "Entry Point"
        MAIN[main.py<br/>CLI Entry]
    end

    subgraph "Workflow"
        WORKFLOW[workflow.py<br/>Event Orchestration]
    end

    subgraph "Agent"
        AGENT_MOD[agent.py<br/>AI Decision Making]
    end

    subgraph "Models"
        MODELS[models.py<br/>Pydantic Schemas]
    end

    subgraph "Filesystem"
        FS_MOD[fs.py<br/>File Operations]
    end

    MAIN --> WORKFLOW
    WORKFLOW --> AGENT_MOD
    AGENT_MOD --> MODELS
    AGENT_MOD --> FS_MOD
    WORKFLOW --> MODELS

    style MAIN fill:#e1f5fe
    style WORKFLOW fill:#f3e5f5
    style AGENT_MOD fill:#e8f5e9
    style MODELS fill:#fff3e0
    style FS_MOD fill:#fce4ec
```

### Module Dependencies

```mermaid
graph TD
    subgraph "fs_explorer package"
        INIT[__init__.py<br/>Public API Exports]
        MAIN[main.py]
        WORKFLOW[workflow.py]
        AGENT[agent.py]
        MODELS[models.py]
        FS[fs.py]
    end

    subgraph "External Dependencies"
        TYPER[typer<br/>CLI Framework]
        RICH[rich<br/>Terminal UI]
        WORKFLOWS[llama-index-workflows<br/>Event System]
        GENAI[google-genai<br/>Gemini API]
        PYDANTIC[pydantic<br/>Data Validation]
        DOCLING[docling<br/>Document Parsing]
    end

    INIT --> AGENT
    INIT --> WORKFLOW
    INIT --> MODELS
    
    MAIN --> TYPER
    MAIN --> RICH
    MAIN --> WORKFLOW
    
    WORKFLOW --> WORKFLOWS
    WORKFLOW --> AGENT
    WORKFLOW --> MODELS
    WORKFLOW --> FS
    
    AGENT --> GENAI
    AGENT --> MODELS
    AGENT --> FS
    
    MODELS --> PYDANTIC
    
    FS --> DOCLING

    style GENAI fill:#4285f4,color:#fff
    style DOCLING fill:#ff6b6b,color:#fff
```

---

## Core Modules

### models.py - Data Schemas

Defines the structured output format for the AI agent using Pydantic models.

```mermaid
classDiagram
    class Action {
        +action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction
        +reason: str
        +to_action_type() ActionType
    }

    class ToolCallAction {
        +tool_name: Tools
        +tool_input: list[ToolCallArg]
        +to_fn_args() dict
    }

    class ToolCallArg {
        +parameter_name: str
        +parameter_value: Any
    }

    class GoDeeperAction {
        +directory: str
    }

    class StopAction {
        +final_result: str
    }

    class AskHumanAction {
        +question: str
    }

    Action --> ToolCallAction
    Action --> GoDeeperAction
    Action --> StopAction
    Action --> AskHumanAction
    ToolCallAction --> ToolCallArg

    note for Action "Main container returned by LLM"
    note for ToolCallAction "Invokes filesystem tools"
    note for StopAction "Contains final answer with citations"
```

### agent.py - AI Agent

The core intelligence component that interacts with Google Gemini.

```mermaid
classDiagram
    class FsExplorerAgent {
        -_client: GenAIClient
        -_chat_history: list[Content]
        +token_usage: TokenUsage
        +__init__(api_key: str)
        +configure_task(task: str) void
        +take_action() tuple[Action, ActionType]
        +call_tool(tool_name: Tools, tool_input: dict) void
        +reset() void
    }

    class TokenUsage {
        +prompt_tokens: int
        +completion_tokens: int
        +total_tokens: int
        +api_calls: int
        +tool_result_chars: int
        +documents_parsed: int
        +documents_scanned: int
        +add_api_call(prompt_tokens, completion_tokens) void
        +add_tool_result(result, tool_name) void
        +summary() str
    }

    class TOOLS {
        <<dictionary>>
        +read: read_file
        +grep: grep_file_content
        +glob: glob_paths
        +scan_folder: scan_folder
        +preview_file: preview_file
        +parse_file: parse_file
    }

    FsExplorerAgent --> TokenUsage
    FsExplorerAgent --> TOOLS
```

### fs.py - Filesystem Operations

All filesystem and document parsing utilities.

```mermaid
classDiagram
    class FilesystemModule {
        <<module>>
        +SUPPORTED_EXTENSIONS: frozenset
        +DEFAULT_PREVIEW_CHARS: int = 3000
        +DEFAULT_SCAN_PREVIEW_CHARS: int = 1500
        +DEFAULT_MAX_WORKERS: int = 4
    }

    class DocumentCache {
        <<singleton>>
        -_DOCUMENT_CACHE: dict[str, str]
        +clear_document_cache() void
        +_get_cached_or_parse(file_path) str
    }

    class DirectoryOps {
        <<functions>>
        +describe_dir_content(directory) str
        +glob_paths(directory, pattern) str
    }

    class FileOps {
        <<functions>>
        +read_file(file_path) str
        +grep_file_content(file_path, pattern) str
    }

    class DocumentOps {
        <<functions>>
        +preview_file(file_path, max_chars) str
        +parse_file(file_path) str
        +scan_folder(directory, max_workers, preview_chars) str
    }

    FilesystemModule --> DocumentCache
    FilesystemModule --> DirectoryOps
    FilesystemModule --> FileOps
    FilesystemModule --> DocumentOps
    DocumentOps --> DocumentCache
```

---

## Workflow Engine

The workflow engine uses an event-driven architecture based on `llama-index-workflows`.

### Workflow State Machine

```mermaid
stateDiagram-v2
    [*] --> StartExploration: InputEvent(task)
    
    StartExploration --> ToolCall: ToolCallEvent
    StartExploration --> GoDeeper: GoDeeperEvent
    StartExploration --> AskHuman: AskHumanEvent
    StartExploration --> End: StopAction
    
    ToolCall --> ToolCall: ToolCallEvent
    ToolCall --> GoDeeper: GoDeeperEvent
    ToolCall --> AskHuman: AskHumanEvent
    ToolCall --> End: StopAction
    
    GoDeeper --> ToolCall: ToolCallEvent
    GoDeeper --> GoDeeper: GoDeeperEvent
    GoDeeper --> AskHuman: AskHumanEvent
    GoDeeper --> End: StopAction
    
    AskHuman --> WaitForHuman: InputRequiredEvent
    WaitForHuman --> ProcessHumanResponse: HumanAnswerEvent
    ProcessHumanResponse --> ToolCall: ToolCallEvent
    ProcessHumanResponse --> GoDeeper: GoDeeperEvent
    ProcessHumanResponse --> AskHuman: AskHumanEvent
    ProcessHumanResponse --> End: StopAction
    
    End --> [*]: ExplorationEndEvent

    note right of StartExploration
        Initial task processing
        Describes current directory
        Asks LLM for first action
    end note

    note right of ToolCall
        Executes filesystem tool
        Adds result to chat history
        Asks LLM for next action
    end note

    note right of GoDeeper
        Updates current directory
        Describes new directory
        Asks LLM for next action
    end note
```

### Event Types

```mermaid
graph TB
    subgraph "Start Events"
        IE[InputEvent<br/>task: str]
    end

    subgraph "Intermediate Events"
        TCE[ToolCallEvent<br/>tool_name, tool_input, reason]
        GDE[GoDeeperEvent<br/>directory, reason]
        AHE[AskHumanEvent<br/>question, reason]
        HAE[HumanAnswerEvent<br/>response]
    end

    subgraph "End Events"
        EEE[ExplorationEndEvent<br/>final_result, error]
    end

    IE --> TCE
    IE --> GDE
    IE --> AHE
    IE --> EEE

    TCE --> TCE
    TCE --> GDE
    TCE --> AHE
    TCE --> EEE

    GDE --> TCE
    GDE --> GDE
    GDE --> AHE
    GDE --> EEE

    AHE --> HAE
    HAE --> TCE
    HAE --> GDE
    HAE --> AHE
    HAE --> EEE

    style IE fill:#4caf50,color:#fff
    style EEE fill:#f44336,color:#fff
    style TCE fill:#2196f3,color:#fff
    style GDE fill:#9c27b0,color:#fff
    style AHE fill:#ff9800,color:#fff
```

### Workflow Steps

```mermaid
sequenceDiagram
    participant CLI as CLI (main.py)
    participant WF as Workflow
    participant Agent as FsExplorerAgent
    participant LLM as Gemini API
    participant Tools as Tool Registry
    participant FS as Filesystem

    CLI->>WF: InputEvent(task)
    
    WF->>Agent: configure_task(initial_prompt)
    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: Action JSON
    
    alt ToolCallAction
        Agent->>Tools: call_tool(name, args)
        Tools->>FS: execute operation
        FS-->>Tools: result
        Tools-->>Agent: tool result
        Agent->>Agent: add to chat_history
        WF-->>CLI: ToolCallEvent (stream)
        WF->>Agent: configure_task("next action?")
        Note over WF,Agent: Loop continues
    else GoDeeperAction
        WF->>WF: update current_directory
        WF-->>CLI: GoDeeperEvent (stream)
        WF->>Agent: configure_task("next action?")
        Note over WF,Agent: Loop continues
    else AskHumanAction
        WF-->>CLI: AskHumanEvent (stream)
        CLI->>CLI: Wait for user input
        CLI->>WF: HumanAnswerEvent(response)
        WF->>Agent: configure_task(response)
        Note over WF,Agent: Loop continues
    else StopAction
        WF-->>CLI: ExplorationEndEvent(final_result)
    end
```

---

## Agent Decision Loop

### Single Decision Cycle

```mermaid
flowchart TB
    subgraph "Agent.take_action()"
        START([Start]) --> SEND[Send chat_history to Gemini]
        SEND --> RECEIVE[Receive JSON response]
        RECEIVE --> TRACK[Track token usage]
        TRACK --> PARSE[Parse Action from JSON]
        PARSE --> CHECK{Action Type?}
        
        CHECK -->|toolcall| EXEC[Execute Tool]
        EXEC --> RESULT[Get tool result]
        RESULT --> ADD[Add result to chat_history]
        ADD --> RETURN1[Return Action, ActionType]
        
        CHECK -->|godeeper| RETURN2[Return Action, ActionType]
        CHECK -->|askhuman| RETURN3[Return Action, ActionType]
        CHECK -->|stop| RETURN4[Return Action, ActionType]
        
        RETURN1 --> END([End])
        RETURN2 --> END
        RETURN3 --> END
        RETURN4 --> END
    end

    style START fill:#4caf50,color:#fff
    style END fill:#f44336,color:#fff
    style CHECK fill:#ff9800,color:#000
```

### Chat History Evolution

```mermaid
sequenceDiagram
    participant User
    participant Agent
    participant LLM

    Note over Agent: chat_history = []

    User->>Agent: configure_task("Initial prompt + directory listing")
    Note over Agent: chat_history = [user: initial_prompt]

    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: {action: scan_folder, reason: "..."}
    Note over Agent: chat_history = [user: initial_prompt, model: action1]

    Agent->>Agent: Execute scan_folder, add result
    Note over Agent: chat_history = [user: initial_prompt, model: action1, user: tool_result1]

    User->>Agent: configure_task("What's next?")
    Note over Agent: chat_history = [..., user: "What's next?"]

    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: {action: parse_file, reason: "..."}
    Note over Agent: chat_history = [..., model: action2]

    Note over Agent: Pattern continues until StopAction
```

---

## Document Processing Pipeline

### Docling Integration

```mermaid
flowchart LR
    subgraph "Input Formats"
        PDF[PDF]
        DOCX[DOCX]
        PPTX[PPTX]
        XLSX[XLSX]
        HTML[HTML]
        MD[Markdown]
    end

    subgraph "Docling"
        DC[DocumentConverter]
        DETECT[Format Detection]
        PIPELINE[Processing Pipeline]
        EXPORT[Markdown Export]
    end

    subgraph "Output"
        MARKDOWN[Markdown Text]
    end

    PDF --> DC
    DOCX --> DC
    PPTX --> DC
    XLSX --> DC
    HTML --> DC
    MD --> DC

    DC --> DETECT
    DETECT --> PIPELINE
    PIPELINE --> EXPORT
    EXPORT --> MARKDOWN

    style DC fill:#ff6b6b,color:#fff
```

### Caching Strategy

```mermaid
flowchart TB
    subgraph "Cache Key Generation"
        PATH[file_path] --> ABS[os.path.abspath]
        ABS --> MTIME[os.path.getmtime]
        MTIME --> KEY["cache_key = f'{abs_path}:{mtime}'"]
    end

    subgraph "Cache Lookup"
        KEY --> CHECK{Key in cache?}
        CHECK -->|Yes| HIT[Return cached content]
        CHECK -->|No| MISS[Parse with Docling]
        MISS --> STORE[Store in cache]
        STORE --> RETURN[Return content]
    end

    subgraph "_DOCUMENT_CACHE"
        CACHE[(dict: str → str)]
    end

    HIT --> CACHE
    STORE --> CACHE

    style CACHE fill:#ffd93d,color:#000
```

### Parallel Document Scanning

```mermaid
flowchart TB
    subgraph "scan_folder(directory)"
        START([Start]) --> LIST[List directory files]
        LIST --> FILTER[Filter by SUPPORTED_EXTENSIONS]
        FILTER --> POOL[Create ThreadPoolExecutor<br/>max_workers=4]
        
        subgraph "Parallel Processing"
            POOL --> T1[Thread 1<br/>_preview_single_file]
            POOL --> T2[Thread 2<br/>_preview_single_file]
            POOL --> T3[Thread 3<br/>_preview_single_file]
            POOL --> T4[Thread 4<br/>_preview_single_file]
        end

        T1 --> COLLECT[Collect Results]
        T2 --> COLLECT
        T3 --> COLLECT
        T4 --> COLLECT

        COLLECT --> SORT[Sort by filename]
        SORT --> FORMAT[Format output report]
        FORMAT --> END([Return summary])
    end

    style START fill:#4caf50,color:#fff
    style END fill:#4caf50,color:#fff
    style POOL fill:#2196f3,color:#fff
```

---

## Three-Phase Exploration Strategy

### Phase Overview

```mermaid
flowchart TB
    subgraph "PHASE 1: Parallel Scan"
        P1_START([User Query]) --> P1_SCAN[scan_folder]
        P1_SCAN --> P1_PREVIEW[Get previews of ALL documents]
        P1_PREVIEW --> P1_CATEGORIZE[Categorize documents]
        
        P1_CATEGORIZE --> REL[RELEVANT<br/>Directly related]
        P1_CATEGORIZE --> MAYBE[MAYBE<br/>Potentially useful]
        P1_CATEGORIZE --> SKIP[SKIP<br/>Not relevant]
    end

    subgraph "PHASE 2: Deep Dive"
        REL --> P2_PARSE[parse_file on RELEVANT docs]
        MAYBE -.->|If needed| P2_PARSE
        P2_PARSE --> P2_EXTRACT[Extract key information]
        P2_EXTRACT --> P2_CROSS{Cross-references<br/>found?}
    end

    subgraph "PHASE 3: Backtracking"
        P2_CROSS -->|Yes| P3_CHECK{Referenced doc<br/>was SKIPPED?}
        P3_CHECK -->|Yes| P3_BACKTRACK[Go back and parse<br/>referenced document]
        P3_BACKTRACK --> P2_EXTRACT
        P3_CHECK -->|No| P3_CONTINUE[Continue analysis]
        P2_CROSS -->|No| P3_CONTINUE
    end

    subgraph "Final Answer"
        P3_CONTINUE --> ANSWER[Generate answer<br/>with citations]
        ANSWER --> SOURCES[List sources consulted]
        SOURCES --> END([Return to user])
    end

    style P1_START fill:#4caf50,color:#fff
    style END fill:#4caf50,color:#fff
    style REL fill:#4caf50,color:#fff
    style MAYBE fill:#ff9800,color:#000
    style SKIP fill:#9e9e9e,color:#fff
    style P3_BACKTRACK fill:#e91e63,color:#fff
```

### Cross-Reference Detection

```mermaid
flowchart LR
    subgraph "Document Content"
        DOC[Parsed Document]
    end

    subgraph "Pattern Matching"
        DOC --> P1["'See Exhibit A/B/C...'"]
        DOC --> P2["'As stated in [Document]...'"]
        DOC --> P3["'Refer to [filename]...'"]
        DOC --> P4["'per Document: [name]'"]
        DOC --> P5["'[Doc #XX]'"]
    end

    subgraph "Action"
        P1 --> FOUND[Cross-reference found]
        P2 --> FOUND
        P3 --> FOUND
        P4 --> FOUND
        P5 --> FOUND
        
        FOUND --> CHECK{Was referenced<br/>doc SKIPPED?}
        CHECK -->|Yes| BACKTRACK[Backtrack and parse]
        CHECK -->|No| CONTINUE[Continue]
    end

    style BACKTRACK fill:#e91e63,color:#fff
```

---

## Token Tracking & Cost Estimation

### TokenUsage Class

```mermaid
flowchart TB
    subgraph "Input Tracking"
        API[API Call] --> PROMPT[prompt_token_count]
        API --> COMPLETION[candidates_token_count]
        PROMPT --> ADD_API[add_api_call]
        COMPLETION --> ADD_API
    end

    subgraph "Tool Tracking"
        TOOL[Tool Execution] --> RESULT[result string]
        RESULT --> ADD_TOOL[add_tool_result]
        ADD_TOOL --> CHARS[tool_result_chars += len]
        ADD_TOOL --> PARSED{tool_name?}
        PARSED -->|parse_file| INC_PARSED[documents_parsed++]
        PARSED -->|preview_file| INC_PARSED
        PARSED -->|scan_folder| INC_SCANNED[documents_scanned += count]
    end

    subgraph "Cost Calculation"
        ADD_API --> TOTALS[Update totals]
        TOTALS --> CALC[_calculate_cost]
        CALC --> INPUT_COST["input_cost = prompt_tokens × $0.075/1M"]
        CALC --> OUTPUT_COST["output_cost = completion_tokens × $0.30/1M"]
        INPUT_COST --> TOTAL_COST[total_cost]
        OUTPUT_COST --> TOTAL_COST
    end

    subgraph "Summary Output"
        TOTAL_COST --> SUMMARY[summary]
        CHARS --> SUMMARY
        INC_PARSED --> SUMMARY
        INC_SCANNED --> SUMMARY
    end
```

### Cost Estimation Formula

```mermaid
graph LR
    subgraph "Gemini 2.0 Flash Pricing"
        INPUT["Input: $0.075 / 1M tokens"]
        OUTPUT["Output: $0.30 / 1M tokens"]
    end

    subgraph "Calculation"
        PROMPT[prompt_tokens] --> DIV1[÷ 1,000,000]
        DIV1 --> MULT1[× $0.075]
        MULT1 --> INPUT_COST[Input Cost]

        COMP[completion_tokens] --> DIV2[÷ 1,000,000]
        DIV2 --> MULT2[× $0.30]
        MULT2 --> OUTPUT_COST[Output Cost]

        INPUT_COST --> SUM[+]
        OUTPUT_COST --> SUM
        SUM --> TOTAL[Total Estimated Cost]
    end

    style TOTAL fill:#4caf50,color:#fff
```

---

## CLI Interface

### Output Formatting

```mermaid
flowchart TB
    subgraph "Event Handling"
        EVENT{Event Type}
        
        EVENT -->|ToolCallEvent| TOOL_PANEL[format_tool_panel]
        EVENT -->|GoDeeperEvent| NAV_PANEL[format_navigation_panel]
        EVENT -->|AskHumanEvent| HUMAN_PANEL[Human Input Panel]
        EVENT -->|ExplorationEndEvent| FINAL_PANEL[Final Answer Panel]
    end

    subgraph "Tool Panel Components"
        TOOL_PANEL --> ICON[Tool Icon 📂📖👁️🔍]
        TOOL_PANEL --> STEP[Step Number]
        TOOL_PANEL --> PHASE[Phase Label]
        TOOL_PANEL --> TARGET[Target File/Directory]
        TOOL_PANEL --> REASON[Agent's Reasoning]
    end

    subgraph "Final Panel Components"
        FINAL_PANEL --> ANSWER[Answer with Citations]
        FINAL_PANEL --> SOURCES[Sources Consulted]
    end

    subgraph "Summary Panel"
        SUMMARY[Workflow Summary]
        SUMMARY --> STEPS[Total Steps]
        SUMMARY --> CALLS[API Calls]
        SUMMARY --> DOCS[Documents Scanned/Parsed]
        SUMMARY --> TOKENS[Token Usage]
        SUMMARY --> COST[Estimated Cost]
    end

    FINAL_PANEL --> SUMMARY
```

### Visual Elements

```mermaid
graph TB
    subgraph "Panel Styles"
        TOOL["📂 Tool Call<br/>border: yellow"]
        NAV["📁 Navigation<br/>border: magenta"]
        HUMAN["❓ Human Input<br/>border: red"]
        FINAL["✅ Final Answer<br/>border: green"]
        SUMMARY["📊 Summary<br/>border: blue"]
    end

    subgraph "Tool Icons"
        I1["📂 scan_folder"]
        I2["👁️ preview_file"]
        I3["📖 parse_file"]
        I4["📄 read"]
        I5["🔍 grep"]
        I6["🔎 glob"]
    end

    subgraph "Phase Labels"
        PH1["Phase 1: Parallel Document Scan"]
        PH2["Phase 2: Deep Dive"]
        PH3["Phase 1/2: Quick Preview"]
    end

    style TOOL fill:#ffeb3b,color:#000
    style NAV fill:#e1bee7,color:#000
    style HUMAN fill:#ffcdd2,color:#000
    style FINAL fill:#c8e6c9,color:#000
    style SUMMARY fill:#bbdefb,color:#000
```

---

## Data Flow

### Complete Request Flow

```mermaid
sequenceDiagram
    participant User
    participant CLI as main.py
    participant WF as Workflow
    participant Agent as FsExplorerAgent
    participant LLM as Gemini API
    participant Tools as Tool Registry
    participant Docling
    participant Cache
    participant FS as Filesystem

    User->>CLI: uv run explore --task "..."
    CLI->>CLI: print_workflow_header()
    CLI->>WF: workflow.run(InputEvent)

    loop Until StopAction
        WF->>Agent: configure_task()
        Agent->>LLM: generate_content()
        LLM-->>Agent: Action JSON
        Agent->>Agent: Track tokens

        alt ToolCallAction
            Agent->>Tools: TOOLS[name](**args)
            
            alt Document Tool
                Tools->>Cache: Check cache
                alt Cache Hit
                    Cache-->>Tools: Cached content
                else Cache Miss
                    Cache->>Docling: Convert document
                    Docling->>FS: Read file
                    FS-->>Docling: Raw bytes
                    Docling-->>Cache: Markdown content
                    Cache-->>Tools: Content
                end
            else Filesystem Tool
                Tools->>FS: Execute operation
                FS-->>Tools: Result
            end
            
            Tools-->>Agent: Tool result
            Agent->>Agent: Track tool metrics
            WF-->>CLI: ToolCallEvent
            CLI->>CLI: format_tool_panel()
        else GoDeeperAction
            WF->>WF: Update directory state
            WF-->>CLI: GoDeeperEvent
            CLI->>CLI: format_navigation_panel()
        else AskHumanAction
            WF-->>CLI: AskHumanEvent
            CLI->>User: Display question
            User->>CLI: Enter response
            CLI->>WF: HumanAnswerEvent
        else StopAction
            WF-->>CLI: ExplorationEndEvent
        end
    end

    CLI->>CLI: Display final answer
    CLI->>CLI: print_workflow_summary()
    CLI-->>User: Complete output
```

---

## File Structure

```
fs-explorer/
├── src/
│   └── fs_explorer/
│       ├── __init__.py      # Public API exports
│       ├── main.py          # CLI entry point (typer)
│       ├── workflow.py      # Event-driven workflow orchestration
│       ├── agent.py         # AI agent + Gemini integration
│       ├── models.py        # Pydantic action schemas
│       └── fs.py            # Filesystem + Docling operations
├── tests/
│   ├── conftest.py          # Test fixtures and mocks
│   ├── test_agent.py        # Agent unit tests
│   ├── test_fs.py           # Filesystem function tests
│   ├── test_models.py       # Model tests
│   ├── test_e2e.py          # End-to-end integration tests
│   └── testfiles/           # Test data
├── data/
│   ├── large_acquisition/   # Sample PDF documents
│   └── test_acquisition/    # Test document set
├── scripts/
│   ├── generate_test_docs.py
│   └── generate_large_docs.py
├── pyproject.toml           # Project configuration
├── Makefile                 # Development commands
├── README.md                # User documentation
└── ARCHITECTURE.md          # This file
```

---

## Extension Points

### Adding New Tools

```mermaid
flowchart LR
    subgraph "Step 1: Define Function"
        FUNC[def new_tool(args) -> str]
    end

    subgraph "Step 2: Register Tool"
        TOOLS["TOOLS dict in agent.py"]
        FUNC --> TOOLS
    end

    subgraph "Step 3: Update Types"
        TYPES["Tools TypeAlias in models.py"]
        TOOLS --> TYPES
    end

    subgraph "Step 4: Update Prompt"
        PROMPT["SYSTEM_PROMPT in agent.py"]
        TYPES --> PROMPT
    end

    style FUNC fill:#e3f2fd
    style TOOLS fill:#f3e5f5
    style TYPES fill:#fff3e0
    style PROMPT fill:#e8f5e9
```

### Adding New Document Formats

```mermaid
flowchart LR
    subgraph "Docling Supported"
        PDF[PDF] --> DOCLING[Docling]
        DOCX[DOCX] --> DOCLING
        PPTX[PPTX] --> DOCLING
        XLSX[XLSX] --> DOCLING
        HTML[HTML] --> DOCLING
        MD[Markdown] --> DOCLING
    end

    subgraph "To Add New Format"
        NEW[New Format] --> CHECK{Docling<br/>supports?}
        CHECK -->|Yes| ADD["Add to SUPPORTED_EXTENSIONS"]
        CHECK -->|No| CUSTOM["Create custom handler<br/>in fs.py"]
    end

    DOCLING --> OUTPUT[Markdown]
    ADD --> OUTPUT
    CUSTOM --> OUTPUT
```

### Customizing the System Prompt

The system prompt in `agent.py` can be modified to:

1. **Add new exploration strategies**
2. **Change citation format**
3. **Adjust categorization criteria**
4. **Add domain-specific instructions**

```python
SYSTEM_PROMPT = """
# Customize this prompt to change agent behavior

## Your custom instructions here
...
"""
```

---

## Performance Characteristics

| Metric | Typical Value | Notes |
|--------|---------------|-------|
| Parallel scan threads | 4 | Configurable via `DEFAULT_MAX_WORKERS` |
| Preview size | 1500 chars | ~1 page of content |
| Full preview size | 3000 chars | ~2-3 pages |
| Document cache | In-memory | Keyed by path + mtime |
| Workflow timeout | 300 seconds | 5 minutes for complex queries |
| API model | gemini-2.0-flash | Fast, cost-effective |

---

## Security Considerations

1. **API Key**: Stored in environment variable `GOOGLE_API_KEY`
2. **Local Processing**: Documents parsed locally via Docling (no cloud upload)
3. **Filesystem Access**: Limited to current working directory
4. **No Persistent Storage**: Document cache is in-memory only

---

*Last updated: 2026-01-03*


================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Agentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval.

**Tech Stack:** Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI.

## Common Commands

```bash
# Install dependencies
uv pip install .
uv pip install -e ".[dev]"  # with dev dependencies

# Run CLI (agentic exploration)
uv run explore --task "What is the purchase price?" --folder data/test_acquisition/

# Run CLI (indexed query - requires prior indexing)
uv run explore index data/test_acquisition/
uv run explore query --task "What is the purchase price?" --folder data/test_acquisition/

# Schema management
uv run explore schema discover data/test_acquisition/
uv run explore schema show data/test_acquisition/

# Run web UI
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000

# Run tests
uv run pytest                      # all tests
uv run pytest tests/test_fs.py     # single file
uv run pytest -k "test_name"       # single test

# Lint, format, typecheck (also available via Makefile)
uv run pre-commit run -a           # lint (or: make lint)
uv run ruff check .                # ruff only
uv run ruff format                 # format (or: make format)
uv run ty check src/fs_explorer/   # typecheck (or: make typecheck)
```

Entry points defined in `pyproject.toml`: `explore` → `fs_explorer.main:app`, `explore-ui` → `fs_explorer.server:run_server`.

## Architecture

### Core Flow (Agentic Mode)
```
User Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem
```

### Core Flow (Indexed Mode)
```
User Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results
```

### Key Modules (src/fs_explorer/)

- **workflow.py**: Event-driven orchestration using `llama-index-workflows`. Defines `FsExplorerWorkflow` with steps: `start_exploration`, `go_deeper_action`, `tool_call_action`, `receive_human_answer`. Uses singleton agent via `get_agent()`.

- **agent.py**: `FsExplorerAgent` manages Gemini API interaction. Chat history accumulates in `_chat_history`. `take_action()` sends history to LLM, receives structured JSON `Action`, auto-executes tool calls. `TokenUsage` tracks costs. Also contains the `TOOLS` registry (9 tools), `SYSTEM_PROMPT`, and indexed tool functions (`semantic_search`, `get_document`, `list_indexed_documents`). Index context is managed via module-level `set_index_context()`/`clear_index_context()`.

- **models.py**: Pydantic schemas for structured LLM output. `Action` contains one of: `ToolCallAction`, `GoDeeperAction`, `StopAction`, `AskHumanAction`. `Tools` TypeAlias defines all available tool names.

- **fs.py**: Filesystem operations. `scan_folder()` uses ThreadPoolExecutor for parallel document processing. `_DOCUMENT_CACHE` (dict) caches parsed documents keyed by `path:mtime`. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown.

- **main.py**: Typer CLI entry point with subcommands: default (agentic explore), `index`, `query`, `schema discover`, `schema show`.

- **server.py**: FastAPI server with WebSocket endpoint `/ws/explore` for real-time streaming.

- **exploration_trace.py**: Records tool call paths and extracts cited sources from final answers for the CLI summary.

### Indexing Subsystem (src/fs_explorer/indexing/)

- **pipeline.py**: `IndexingPipeline` orchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates to `SmartChunker` and `extract_metadata()`, handles schema resolution and deleted-file cleanup.

- **chunker.py**: `SmartChunker` splits parsed document text into overlapping chunks.

- **schema.py**: `SchemaDiscovery` auto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields like `mentions_currency`/`mentions_dates`). Optionally includes langextract fields.

- **metadata.py**: `extract_metadata()` produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls the `langextract` library for entity extraction (organizations, people, deal terms, etc.) via configurable profiles.

### Search Subsystem (src/fs_explorer/search/)

- **query.py**: `IndexedQueryEngine` runs parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks via `RankedDocument.combined_score`.

- **filters.py**: `parse_metadata_filters()` parses a human-readable filter DSL (`field=value`, `field>=num`, `field in (a, b)`, `field~substring`) into `MetadataFilter` objects. Validates against allowed schema fields.

- **ranker.py**: `RankedDocument` dataclass with `combined_score` (semantic * 100 + metadata * 10). `rank_documents()` sorts and limits.

### Storage Subsystem (src/fs_explorer/storage/)

- **duckdb.py**: `DuckDBStorage` manages four tables: `corpora`, `documents`, `chunks`, `schemas`. Key operations: `upsert_document`, `search_chunks` (keyword-based scoring), `search_documents_by_metadata` (JSON path filtering via `json_extract_string`), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes.

- **base.py**: `StorageBackend` protocol and shared dataclasses (`DocumentRecord`, `ChunkRecord`, `SchemaRecord`).

### Index Config

- **index_config.py**: `resolve_db_path()` resolves DuckDB path with precedence: CLI `--db-path` > `FS_EXPLORER_DB_PATH` env > `~/.fs_explorer/index.duckdb`.

### Workflow Event Types
- `InputEvent` → starts exploration
- `ToolCallEvent` → tool execution
- `GoDeeperEvent` → directory navigation
- `AskHumanEvent`/`HumanAnswerEvent` → human interaction
- `ExplorationEndEvent` → completion with `final_result` or `error`

### Adding New Tools
1. Implement function in `fs.py` (filesystem) or `agent.py` (indexed) returning `str`
2. Add to `TOOLS` dict in `agent.py`
3. Add to `Tools` TypeAlias in `models.py`
4. Update `SYSTEM_PROMPT` in `agent.py`
5. Update `TOOL_ICONS` and `PHASE_DESCRIPTIONS` in `main.py`

## Environment

- `GOOGLE_API_KEY` (required) — in `.env` file or environment variable
- `FS_EXPLORER_DB_PATH` (optional) — override default DuckDB location
- `FS_EXPLORER_LANGEXTRACT_MAX_CHARS` (optional) — max chars sent to langextract (default 6000)
- `FS_EXPLORER_LANGEXTRACT_MODEL` (optional) — model for langextract (default `gemini-3-flash-preview`)

## Testing

Tests mock the Gemini client via `MockGenAIClient` in `conftest.py`. Use `reset_agent()` to clear singleton state between tests. The mock always returns a `StopAction` response.

Key test files:
- `test_agent.py` / `test_e2e.py` — agent and workflow integration
- `test_fs.py` — filesystem tools
- `test_indexing.py` / `test_cli_indexing.py` — indexing pipeline and CLI
- `test_search.py` — search/filter/ranking
- `test_exploration_trace.py` — trace and citation extraction

Test documents live in `data/test_acquisition/` and `data/large_acquisition/`. Test fixtures for unit tests are in `tests/testfiles/`.


================================================
FILE: IMPLEMENTATION_PLAN.md
================================================
# Implementation Plan: Hybrid Semantic + Agentic Search (Revised)

## Overview

Add semantic search with optional metadata filtering to `agentic-file-search` without regressing the current agentic workflow.

The revised approach keeps the current CLI and behavior stable first, introduces indexing as opt-in, and only enables auto-detection after compatibility and quality checks pass.

- Storage: DuckDB + `vss` (embedded, local file)
- Embeddings: Gemini embeddings (API-backed)
- Metadata extraction: `langextract` (optional)
- Infrastructure model: no external database service (no Docker/Postgres required)

---

## Goals

1. Preserve existing `explore --task` behavior and UX by default.
2. Add a fast indexed path for large corpora.
3. Support metadata-aware filtering when metadata is available.
4. Keep agentic deep-read and cross-reference behavior available.

## Non-Goals (Initial Release)

1. Replacing the existing agentic strategy entirely.
2. Forcing index usage for all queries.
3. Heuristic/NLP folder extraction from free-form task text.

---

## Current Codebase Constraints to Respect

1. CLI currently has one root command (`explore --task`) and no subcommands.
2. Workflow and server currently use shared/global process state (`os.chdir`, singleton agent).
3. Existing tests assert the current 6-tool model and prompt behavior.

These constraints require a staged rollout to avoid breaking current users.

---

## High-Level Architecture

```text
INDEX TIME
├── Parse documents (Docling)
├── Chunk content (paragraph/sentence-aware)
├── Generate embeddings (provider-configured dimension)
├── [optional] Extract metadata (langextract)
└── Persist in DuckDB (corpus-scoped)

QUERY TIME
├── Retrieve by semantic search
├── [optional] Retrieve by metadata filter
├── Union + rank results
├── Expand via cross-references where needed
└── Agent continues deep exploration using existing tools
```

---

## Data Model (DuckDB)

Use corpus-scoped tables and file freshness fields to prevent collisions and stale indexes.

```sql
-- Install and load extension programmatically
-- INSTALL vss; LOAD vss;

CREATE TABLE IF NOT EXISTS corpora (
    id VARCHAR PRIMARY KEY,
    root_path VARCHAR NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS documents (
    id VARCHAR PRIMARY KEY,
    corpus_id VARCHAR NOT NULL REFERENCES corpora(id),
    relative_path VARCHAR NOT NULL,
    absolute_path VARCHAR NOT NULL,
    content VARCHAR NOT NULL,
    metadata JSON NOT NULL DEFAULT '{}',
    file_mtime DOUBLE NOT NULL,
    file_size BIGINT NOT NULL,
    content_sha256 VARCHAR NOT NULL,
    last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_deleted BOOLEAN DEFAULT FALSE,
    UNIQUE(corpus_id, relative_path)
);

-- EMBEDDING_DIM is configured in code at index creation time.
CREATE TABLE IF NOT EXISTS chunks (
    id VARCHAR PRIMARY KEY,
    doc_id VARCHAR NOT NULL REFERENCES documents(id),
    text VARCHAR NOT NULL,
    embedding FLOAT[${EMBEDDING_DIM}] NOT NULL,
    embedding_dim INTEGER NOT NULL,
    position INTEGER NOT NULL,
    start_char INTEGER NOT NULL,
    end_char INTEGER NOT NULL
);

CREATE TABLE IF NOT EXISTS schemas (
    id INTEGER PRIMARY KEY,
    corpus_id VARCHAR REFERENCES corpora(id),
    name VARCHAR,
    schema_def JSON NOT NULL,
    is_active BOOLEAN DEFAULT FALSE,
    UNIQUE(corpus_id, name)
);

CREATE INDEX IF NOT EXISTS idx_chunks_embedding
ON chunks USING HNSW (embedding) WITH (metric = 'cosine');
```

### Embedding Dimension Rule

`EMBEDDING_DIM` must be a runtime config constant validated at startup. Do not hardcode `1536` across modules.

### DB Location

Default: `~/.fs_explorer/index.duckdb`
Override via:
- `FS_EXPLORER_DB_PATH`
- CLI: `--db-path`

---

## CLI Contract and Rollout

### Compatibility Rules (Required)

1. `uv run explore --task "..."` must keep working as-is.
2. Existing non-indexed behavior remains default in initial rollout.
3. New indexed behavior is opt-in first.

### New Commands

```bash
# Index management
uv run explore index <folder>
uv run explore index <folder> --with-metadata
uv run explore index <folder> --schema schema.json

# Indexed query path
uv run explore query --task "..." --folder <folder> [--filter "..."]

# Schema inspection
uv run explore schema --discover <folder>
uv run explore schema --show --folder <folder>

# Existing command (backward-compatible)
uv run explore --task "..." [--folder <folder>] [--use-index]
```

### Folder Resolution (Deterministic)

For commands that need corpus selection:
1. If `--folder` is provided, use it.
2. Else use current working directory (`.`).
3. Do not parse folder intent from natural language task text in v1.

### Auto-Detection Strategy

- v1: explicit `--use-index` only.
- v2: optional auto-detect behind feature flag `FS_EXPLORER_AUTO_INDEX=1`.
- v3: default auto-detect only after parity tests and quality benchmarks pass.

---

## Server and Concurrency Requirements

Before adding indexing/search endpoints:

1. Remove request-level `os.chdir` usage; pass absolute target folder through workflow state.
2. Avoid global singleton agent across concurrent requests; instantiate per workflow run/session.
3. Add per-corpus index lock to avoid concurrent write corruption.
4. Keep read queries concurrent-safe.

---

## Module Structure

```text
src/fs_explorer/
├── storage/
│   ├── __init__.py
│   ├── base.py
│   └── duckdb.py
├── indexing/
│   ├── __init__.py
│   ├── pipeline.py
│   ├── chunker.py
│   ├── metadata.py
│   └── schema.py
├── search/
│   ├── __init__.py
│   ├── query.py
│   ├── semantic.py
│   ├── filters.py
│   └── ranker.py
├── embeddings.py
└── index_config.py
```

---

## Files to Modify

| File | Changes |
|------|---------|
| `src/fs_explorer/agent.py` | Add indexed tools and prompt guidance while keeping existing tools |
| `src/fs_explorer/models.py` | Extend `Tools` type alias |
| `src/fs_explorer/main.py` | Add subcommands + `--folder` + `--use-index` while preserving root command |
| `src/fs_explorer/workflow.py` | Remove global/shared run-state assumptions |
| `src/fs_explorer/fs.py` | Support safe path resolution without cwd mutation |
| `src/fs_explorer/server.py` | Add index/search endpoints and remove `os.chdir` coupling |
| `pyproject.toml` | Add `duckdb`, `langextract` |

---

## Implementation Phases

### Phase 0: Contracts and Safety (New)

1. Freeze CLI compatibility requirements (`explore --task` must remain stable).
2. Define deterministic folder resolution contract.
3. Define per-request state model for workflow/server.
4. Add failing tests for compatibility and concurrency assumptions.

### Phase 1: Storage + Embeddings

5. Implement `storage/base.py` (backend interface).
6. Implement `storage/duckdb.py` with corpus-scoped schema.
7. Implement `embeddings.py` with configurable embedding dimension.
8. Add storage/embedding tests (including dimension validation).

### Phase 2: Indexing Pipeline

9. Implement `indexing/chunker.py`.
10. Implement optional `indexing/metadata.py`.
11. Implement `indexing/schema.py`.
12. Implement `indexing/pipeline.py` with freshness checks (`mtime`, hash, deleted files).
13. Add indexing tests.

### Phase 3: Search Pipeline

14. Implement `search/filters.py`.
15. Implement `search/ranker.py`.
16. Implement `search/query.py` (parallel retrieval + union).
17. Implement cross-reference expansion hooks.
18. Add search tests.

### Phase 4: Agent Integration (Opt-in)

19. Add tools: `semantic_search`, `get_document`, `list_indexed_documents`.
20. Keep existing 6 filesystem tools available.
21. Add indexed prompt guidance without removing current strategy.
22. Add tool-selection tests for indexed and non-indexed paths.

### Phase 5: CLI + Server Integration

23. Add `explore index/query/schema` commands.
24. Add `--folder` and `--use-index` to root command.
25. Integrate indexed path into workflow when explicitly requested.
26. Add `/api/index` and `/api/search` endpoints.
27. Remove `os.chdir` in server workflow path.

### Phase 6: Auto-Detect Rollout (Guarded)

28. Add feature-flagged auto-detect (`FS_EXPLORER_AUTO_INDEX`).
29. Add parity checks between indexed and baseline runs on test corpora.
30. Keep fallback to legacy behavior on index errors.

### Phase 7: Testing and Docs

31. Full integration tests.
32. Backward compatibility tests.
33. Concurrency tests for WebSocket/API usage.
34. Performance benchmarks and docs updates.

---

## Revised Design Decisions

1. **Opt-in First**: indexed retrieval starts behind `--use-index` to avoid regressions.
2. **Deterministic Corpus Selection**: explicit `--folder` or `.` fallback only.
3. **Corpus-Scoped Storage**: avoid global path collisions by namespacing.
4. **Freshness Tracking**: incremental reindex using mtime/hash/deletion markers.
5. **No Global Request State**: remove `os.chdir` and shared singleton pitfalls in server flows.
6. **Configurable Embedding Dimension**: validated at runtime; not hardcoded everywhere.
7. **No External DB Service**: embedded local DB only; APIs are still external dependencies.

---

## Verification Steps

```bash
# Baseline safety (must stay green)
uv run pytest tests/test_models.py tests/test_fs.py tests/test_agent.py -v

# Phase 1-3
uv run pytest tests/test_storage.py tests/test_embeddings.py tests/test_search.py -v

# Index build + inspect
uv run explore index data/test_acquisition/
uv run python -c "import duckdb; db=duckdb.connect('~/.fs_explorer/index.duckdb'); print(db.execute('SELECT COUNT(*) FROM documents').fetchone())"

# Opt-in indexed execution
uv run explore --task "Search for acquisition terms" --folder data/test_acquisition --use-index

# Compatibility execution (legacy path)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"

# CLI checks
uv run explore --help
uv run explore index --help
uv run explore query --help
uv run explore schema --help

# Full suite
uv run pytest tests/ -v
```

---

## Dependencies to Add

```toml
# pyproject.toml
dependencies = [
    # ... existing ...
    "duckdb>=1.0.0",
    "langextract>=1.0.0",
]
```

---

## Critical Files Summary

| Purpose | Path |
|---------|------|
| Storage interface | `src/fs_explorer/storage/base.py` |
| DuckDB backend | `src/fs_explorer/storage/duckdb.py` |
| Embeddings | `src/fs_explorer/embeddings.py` |
| Chunking | `src/fs_explorer/indexing/chunker.py` |
| Metadata extraction | `src/fs_explorer/indexing/metadata.py` |
| Schema discovery | `src/fs_explorer/indexing/schema.py` |
| Indexing pipeline | `src/fs_explorer/indexing/pipeline.py` |
| Query pipeline | `src/fs_explorer/search/query.py` |
| Filter parsing | `src/fs_explorer/search/filters.py` |
| Result ranking | `src/fs_explorer/search/ranker.py` |
| Agent tools/prompt | `src/fs_explorer/agent.py` |
| Tool types | `src/fs_explorer/models.py` |
| CLI commands | `src/fs_explorer/main.py` |
| Workflow safety | `src/fs_explorer/workflow.py` |
| Server safety/endpoints | `src/fs_explorer/server.py` |


================================================
FILE: Makefile
================================================
.PHONY: test lint format format-check typecheck build

all: test lint format typecheck

test:
	$(info ****************** running tests ******************)
	uv run pytest tests

lint:
	$(info ****************** linting ******************)
	uv run pre-commit run -a

format:
	$(info ****************** formatting ******************)
	uv run ruff format

format-check:
	$(info ****************** checking formatting ******************)
	uv run ruff format --check

typecheck:
	$(info ****************** type checking ******************)
	uv run ty check src/fs_explorer/

build:
	$(info ****************** building ******************)
	uv build

================================================
FILE: README.md
================================================
# Agentic File Search

> **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original CLI agent for filesystem exploration.

An AI-powered document search agent that explores files like a human would — scanning, reasoning, and following cross-references. Unlike traditional RAG systems that rely on pre-computed embeddings, this agent dynamically navigates documents to find answers.

## Why Agentic Search?

Traditional RAG (Retrieval-Augmented Generation) has limitations:
- **Chunks lose context** — Splitting documents destroys relationships between sections
- **Cross-references are invisible** — "See Exhibit B" means nothing to embeddings
- **Similarity ≠ Relevance** — Semantic matching misses logical connections

This system uses a **three-phase strategy**:
1. **Parallel Scan** — Preview all documents in a folder at once
2. **Deep Dive** — Full extraction on relevant documents only
3. **Backtrack** — Follow cross-references to previously skipped documents

## Watch the video
This video explains the architecture of the project and how to run it. 
[![Watch the demo on YouTube](https://img.youtube.com/vi/rMADSuus6jg/maxresdefault.jpg)](https://www.youtube.com/watch?v=rMADSuus6jg)

## Features

- 🔍 **6 Tools**: `scan_folder`, `preview_file`, `parse_file`, `read`, `grep`, `glob`
- 📄 **Document Support**: PDF, DOCX, PPTX, XLSX, HTML, Markdown (via Docling)
- 🤖 **Powered by**: Google Gemini 3 Flash with structured JSON output
- 💰 **Cost Efficient**: ~$0.001 per query with token tracking
- 🌐 **Web UI**: Real-time WebSocket streaming interface
- 📊 **Citations**: Answers include source references

## Installation

```bash
# Clone the repository
git clone https://github.com/PromtEngineer/agentic-file-search.git
cd agentic-file-search

# Install with uv (recommended)
uv pip install .

# Or with pip
pip install .
```

## Configuration

Create a `.env` file in the project root:

```bash
GOOGLE_API_KEY=your_api_key_here
```

Get your API key from [Google AI Studio](https://aistudio.google.com/apikey).

## Usage

### CLI

```bash
# Basic query
uv run explore --task "What is the purchase price in data/test_acquisition/?"

# Multi-document query
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms including adjustments and escrow?"
```

### Web UI

```bash
# Start the server
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000

# Open http://127.0.0.1:8000 in your browser
```

The web UI provides:
- Folder browser to select target directory
- Real-time step-by-step execution log
- Final answer with citations
- Token usage and cost statistics

## Architecture

```
User Query
    ↓
┌─────────────────┐
│ Workflow Engine │ ←→ LlamaIndex Workflows (event-driven)
└────────┬────────┘
         ↓
┌─────────────────┐
│     Agent       │ ←→ Gemini 3 Flash (structured JSON)
└────────┬────────┘
         ↓
┌─────────────────────────────────────────┐
│ scan_folder │ preview │ parse │ read │ grep │ glob │
└─────────────────────────────────────────┘
                    ↓
              Document Parser (Docling - local)
```

See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams.

## Test Documents

The repo includes test document sets for evaluation:

- `data/test_acquisition/` — 10 interconnected legal documents
- `data/large_acquisition/` — 25 documents with extensive cross-references

Example queries:
```bash
# Simple (single doc)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"

# Cross-reference required
uv run explore --task "Look in data/test_acquisition/. What is the adjusted purchase price?"

# Multi-document synthesis
uv run explore --task "Look in data/large_acquisition/. What happens to employees after the acquisition?"
```

## Tech Stack

| Component | Technology |
|-----------|------------|
| LLM | Google Gemini 3 Flash |
| Document Parsing | Docling (local, open-source) |
| Orchestration | LlamaIndex Workflows |
| CLI | Typer + Rich |
| Web Server | FastAPI + WebSocket |
| Package Manager | uv |

## Project Structure

```
src/fs_explorer/
├── agent.py      # Gemini client, token tracking
├── workflow.py   # LlamaIndex workflow engine
├── fs.py         # File tools: scan, parse, grep
├── models.py     # Pydantic models for actions
├── main.py       # CLI entry point
├── server.py     # FastAPI + WebSocket server
└── ui.html       # Single-file web interface
```

## Development

```bash
# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
uv run pytest

# Lint
uv run ruff check .
```

## License

MIT

## Acknowledgments

- Original concept from [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer)
- Document parsing by [Docling](https://github.com/DS4SD/docling)
- Powered by [Google Gemini](https://deepmind.google/technologies/gemini/)

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/agentic-file-search&type=Date)](https://star-history.com/#PromtEngineer/agentic-file-search&Date)


================================================
FILE: YOUTUBE_DEMO_TESTS.md
================================================
# YouTube Demo: FS-Explorer Test Results

## System Overview

- **25 PDF documents** (~93 pages total)
- **63 cross-references** between documents
- **Parallel document scanning** using ThreadPoolExecutor
- **Three-phase exploration**: Scan → Filter → Deep Dive + Backtracking

---

## Test Results Summary

| Test | Question | Docs Parsed | Tokens Used | Cost | Time |
|------|----------|-------------|-------------|------|------|
| 1 | CTO & retention bonus | 1 | 27,376 | $0.0021 | ~30s |
| 2 | Litigation & insurance | 2 | 41,431 | $0.0033 | ~30s |
| 3 | Patents owned | 1 | 27,416 | $0.0022 | ~27s |

---

## TEST 1: Simple Fact Lookup ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
```

**Agent Answer:**
> Dr. Sarah Chen is the CTO and their retention bonus is $1,200,000.

**Verification (from 07_employee_schedule.pdf):**
```
1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000
```

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,376 |
| Cost | $0.0021 |
| ✅ Accuracy | **100% Correct** |

---

## TEST 2: Cross-Reference Following ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
```

**Agent Answer:**
> The total litigation exposure is between $650,000 and $2,350,000. The Smith v. InnovateTech Solutions, Inc. case, with an exposure of $150,000 - $350,000, is covered by insurance. The DataTech LLC v. InnovateTech Solutions, Inc. case, with an exposure of $500,000 - $2,000,000, does not specify if it is covered by insurance.

**Verification (from 12_litigation_schedule.pdf):**
```
1. Smith v. InnovateTech Solutions, Inc.
   Exposure: $150,000 - $350,000 (covered by insurance)

2. DataTech LLC v. InnovateTech Solutions, Inc.
   Exposure: Preliminary assessment $500,000 - $2,000,000
```

**Cross-Reference Followed:**
- Agent noticed reference to "Legal Opinion Letter" 
- Parsed `24_legal_opinion.pdf` to get complete picture

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 2 (litigation + legal opinion) |
| Tokens Used | 41,431 |
| Cost | $0.0033 |
| ✅ Accuracy | **100% Correct** |

---

## TEST 3: Multi-Document IP Analysis ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What patents does the company own and are there any pending applications?"
```

**Agent Answer:**
> The company owns the following patents: 
> - US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
> - US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
> - US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
> 
> There is one pending application: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024.

**Verification (from 06_ip_schedule.pdf):**
```
US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024
```

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,416 |
| Cost | $0.0022 |
| ✅ Accuracy | **100% Correct** |

---

## Additional Demo Tests

### Purchase Price & Payment Structure
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price and how is it being paid?"
```
**Expected:** $125M total ($80M cash + $30M stock + $15M escrow)

### Closing Conditions Status
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
```
**Expected:** HSR ✅, State filings ✅, MegaCorp consent ✅, GlobalBank pending, Employee retention ✅, Legal opinion ✅, Good standing ordered

### Key Employee Compensation
```bash
uv run explore --task "Look in data/large_acquisition/. List all the key employees and their retention bonuses"
```
**Expected:** 5 employees totaling $3.5M in retention bonuses

---

## Key Architecture Points to Highlight

### 1. Parallel Scanning (scan_folder)
- Scans ALL 26 documents simultaneously using ThreadPoolExecutor
- Takes ~25 seconds for entire folder
- Returns quick preview of each document

### 2. Smart Filtering
- LLM reviews all previews at once
- Identifies which documents are relevant
- Avoids parsing irrelevant documents

### 3. Cross-Reference Discovery
- Agent watches for document references like:
  - "See Document: Legal Opinion Letter"
  - "Per Document: Risk Assessment Memo"
- Automatically follows references (backtracking)

### 4. Document Caching
- Documents cached after first parse
- Backtracking is free (no re-parsing)

---

## Cost Analysis

| Scenario | Tokens | Est. Cost |
|----------|--------|-----------|
| Simple query (1 doc) | ~27K | $0.002 |
| Cross-ref query (2-3 docs) | ~40K | $0.003 |
| Complex synthesis (5+ docs) | ~60K | $0.005 |
| All 25 documents parsed | ~150K | $0.012 |

**Key Insight:** Even with 25 documents, costs are minimal because the system only parses what's needed!

---

## Commands to Run Demo

```bash
# Setup
cd /path/to/fs-explorer
export GOOGLE_API_KEY="your-key"

# Run any test
uv run explore --task "Look in data/large_acquisition/. [YOUR QUESTION]"
```

---

## What to Show in Video

1. **The folder scan** - Watch as 26 documents are scanned in parallel
2. **Smart filtering** - Note which documents the agent CHOOSES to parse
3. **Cross-reference following** - Show agent backtracking to referenced docs
4. **Token usage summary** - Highlight the efficiency stats at the end
5. **Verification** - Show the actual PDF content matches the answer



================================================
FILE: data/large_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Large Document Set

## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages

## Test Questions

### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```

### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```

### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```

### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```


================================================
FILE: data/test_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Document Exploration

These questions are designed to test the two-stage document exploration approach with cross-reference discovery.

## Test Scenario

**Context:** TechCorp Industries is acquiring StartupXYZ LLC. There are 10 documents in this folder related to the acquisition.

---

## Question Set 1: Simple (Single Document)

These questions can be answered from a single document:

```bash
# Q1: What is the purchase price?
explore --task "What is the total purchase price for the StartupXYZ acquisition?"

# Q2: When did the NDA get signed?
explore --task "When was the Non-Disclosure Agreement between TechCorp and StartupXYZ signed?"

# Q3: How many patents does StartupXYZ have?
explore --task "How many patents does StartupXYZ own?"
```

**Expected Behavior:**
- Agent should preview documents
- Identify the relevant document quickly
- Parse only that document for the answer

---

## Question Set 2: Medium (2-3 Documents with Cross-References)

These questions require following cross-references:

```bash
# Q4: What risks were identified and how were they addressed?
explore --task "What are the key risks identified in this acquisition and what mitigation measures were put in place?"

# Q5: What's the adjusted purchase price?
explore --task "The original purchase price was $45M. Were there any adjustments? What is the final amount?"

# Q6: What happened with customer consents?
explore --task "Which customers required consent for the acquisition, and was consent obtained from all of them?"
```

**Expected Behavior:**
- Agent previews documents
- Reads Risk Assessment Memo
- Notices references to Financial Adjustments, Customer Consents
- Follows cross-references to get complete picture

---

## Question Set 3: Complex (Multiple Documents, Deep Cross-References)

These questions require synthesizing information from many documents:

```bash
# Q7: Complete IP status
explore --task "Give me a complete picture of StartupXYZ's intellectual property - what do they own, is it properly certified, and are there any pending matters or risks?"

# Q8: Due diligence findings and resolution
explore --task "What did the due diligence process uncover, and how were any issues resolved before closing?"

# Q9: Full timeline and status
explore --task "Create a timeline of this acquisition from NDA signing to closing. What are the key milestones and their status?"

# Q10: Closing readiness
explore --task "Is this acquisition ready to close? What items are complete and what's still pending?"
```

**Expected Behavior:**
- Agent should preview all documents first
- Read the most relevant documents (e.g., Closing Checklist references everything)
- Follow cross-references to IP Certification, Due Diligence, Risk Assessment, etc.
- Synthesize information from 5+ documents

---

## Question Set 4: Adversarial (Tests Cross-Reference Discovery)

These questions specifically test if the agent goes back to previously-skipped documents:

```bash
# Q11: Following exhibit references
explore --task "The Acquisition Agreement mentions 'Exhibit A - Financial Terms'. What are the detailed financial terms?"

# Q12: Understanding document relationships  
explore --task "How does the Legal Opinion Letter relate to other documents in this acquisition?"

# Q13: Hidden connection
explore --task "Is there anything about MegaCorp in these documents? Why are they important to this deal?"
```

**Expected Behavior:**
- Q11: Agent might initially skip Financial Adjustments, but should go back when Acquisition Agreement references Exhibit A
- Q12: Agent should trace all documents referenced BY and FROM the Legal Opinion
- Q13: MegaCorp is mentioned in Due Diligence, Risk Assessment, and Customer Consents - agent should connect the dots

---

## Scoring Rubric

| Metric | Description |
|--------|-------------|
| **Preview Usage** | Did the agent use `preview_file` before `parse_file`? |
| **Selective Parsing** | Did the agent avoid parsing irrelevant documents? |
| **Cross-Reference Discovery** | Did the agent follow document references? |
| **Backtracking** | Did the agent return to previously-skipped documents when needed? |
| **Answer Completeness** | Was the final answer comprehensive and accurate? |

---

## Running a Test

```bash
export GOOGLE_API_KEY="your-key"
cd /path/to/fs-explorer
uv run explore --task "YOUR QUESTION HERE"
```

Watch for:
1. Which documents get previewed
2. Which documents get fully parsed
3. Whether the agent mentions cross-references
4. Whether the agent goes back to read referenced documents



================================================
FILE: data/testfile.txt
================================================
This is a test.

================================================
FILE: docker/docker-compose.yml
================================================
version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg17
    container_name: fs-explorer-db
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-fs_explorer}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-devpassword}
      POSTGRES_DB: ${POSTGRES_DB:-fs_explorer}
    ports:
      - "${POSTGRES_PORT:-5432}:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U fs_explorer -d fs_explorer"]
      interval: 5s
      timeout: 5s
      retries: 5
    restart: unless-stopped

volumes:
  postgres_data:


================================================
FILE: pyproject.toml
================================================
[build-system]
requires = ["uv_build>=0.9.10,<0.10.0"]
build-backend = "uv_build"

[project]
name = "fs-explorer"
version = "0.1.0"
description = "Explore and understand your filesystem better with AI."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "docling>=2.55.0",
    "duckdb>=1.0.0",
    "fastapi>=0.115.0",
    "google-genai>=1.55.0",
    "langextract>=1.0.0",
    "llama-index-workflows>=2.11.5",
    "python-dotenv>=1.0.0",
    "reportlab>=4.4.7",
    "rich>=13.0.0",
    "typer>=0.12.5,<0.20.0",
    "uvicorn>=0.34.0",
    "websockets>=14.0",
]

[dependency-groups]
dev = [
    "pre-commit>=4.5.0",
    "pytest>=9.0.2",
    "pytest-asyncio>=1.3.0",
    "ruff>=0.14.9",
    "ty>=0.0.1a33",
]

[project.scripts]
explore = "fs_explorer.main:app"
explore-ui = "fs_explorer.server:run_server"


================================================
FILE: scripts/generate_large_docs.py
================================================
#!/usr/bin/env python3
"""
Generate a large set of interconnected legal documents for testing.
Creates 25 documents, each 3-5 pages, with extensive cross-references.
"""

import os
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

OUTPUT_DIR = "data/large_acquisition"

# Document metadata with cross-references
DOCUMENTS = {
    "01_master_agreement": {
        "title": "MASTER ACQUISITION AGREEMENT",
        "refs": ["02_schedules", "03_exhibits", "04_disclosure_schedules", "05_ancillary_agreements"],
        "pages": 5
    },
    "02_schedules": {
        "title": "SCHEDULES TO ACQUISITION AGREEMENT", 
        "refs": ["01_master_agreement", "06_ip_schedule", "07_employee_schedule", "08_contract_schedule"],
        "pages": 4
    },
    "03_exhibits": {
        "title": "EXHIBITS TO ACQUISITION AGREEMENT",
        "refs": ["01_master_agreement", "09_escrow_agreement", "10_stock_purchase"],
        "pages": 3
    },
    "04_disclosure_schedules": {
        "title": "SELLER DISCLOSURE SCHEDULES",
        "refs": ["01_master_agreement", "11_financial_statements", "12_litigation_schedule"],
        "pages": 5
    },
    "05_ancillary_agreements": {
        "title": "ANCILLARY AGREEMENTS INDEX",
        "refs": ["13_nda", "14_non_compete", "15_consulting_agreement", "16_transition_services"],
        "pages": 2
    },
    "06_ip_schedule": {
        "title": "SCHEDULE 3.12 - INTELLECTUAL PROPERTY",
        "refs": ["01_master_agreement", "17_patent_assignments", "18_trademark_registrations"],
        "pages": 4
    },
    "07_employee_schedule": {
        "title": "SCHEDULE 3.15 - EMPLOYEE MATTERS",
        "refs": ["01_master_agreement", "19_retention_agreements", "20_benefit_plans"],
        "pages": 4
    },
    "08_contract_schedule": {
        "title": "SCHEDULE 3.13 - MATERIAL CONTRACTS",
        "refs": ["01_master_agreement", "21_customer_contracts", "22_vendor_contracts"],
        "pages": 5
    },
    "09_escrow_agreement": {
        "title": "ESCROW AGREEMENT",
        "refs": ["01_master_agreement", "03_exhibits", "11_financial_statements"],
        "pages": 4
    },
    "10_stock_purchase": {
        "title": "STOCK PURCHASE DETAILS - EXHIBIT B",
        "refs": ["01_master_agreement", "11_financial_statements"],
        "pages": 3
    },
    "11_financial_statements": {
        "title": "AUDITED FINANCIAL STATEMENTS",
        "refs": ["04_disclosure_schedules", "23_audit_report"],
        "pages": 6
    },
    "12_litigation_schedule": {
        "title": "SCHEDULE 3.9 - LITIGATION AND CLAIMS",
        "refs": ["04_disclosure_schedules", "24_legal_opinion"],
        "pages": 3
    },
    "13_nda": {
        "title": "NON-DISCLOSURE AGREEMENT",
        "refs": ["01_master_agreement"],
        "pages": 3
    },
    "14_non_compete": {
        "title": "NON-COMPETITION AGREEMENT",
        "refs": ["01_master_agreement", "07_employee_schedule"],
        "pages": 3
    },
    "15_consulting_agreement": {
        "title": "CONSULTING AGREEMENT - FOUNDER",
        "refs": ["01_master_agreement", "07_employee_schedule", "19_retention_agreements"],
        "pages": 4
    },
    "16_transition_services": {
        "title": "TRANSITION SERVICES AGREEMENT",
        "refs": ["01_master_agreement", "25_closing_checklist"],
        "pages": 4
    },
    "17_patent_assignments": {
        "title": "PATENT ASSIGNMENT AGREEMENTS",
        "refs": ["06_ip_schedule", "01_master_agreement"],
        "pages": 3
    },
    "18_trademark_registrations": {
        "title": "TRADEMARK REGISTRATION SCHEDULE",
        "refs": ["06_ip_schedule"],
        "pages": 2
    },
    "19_retention_agreements": {
        "title": "KEY EMPLOYEE RETENTION AGREEMENTS",
        "refs": ["07_employee_schedule", "15_consulting_agreement"],
        "pages": 4
    },
    "20_benefit_plans": {
        "title": "EMPLOYEE BENEFIT PLAN SCHEDULE",
        "refs": ["07_employee_schedule"],
        "pages": 3
    },
    "21_customer_contracts": {
        "title": "MAJOR CUSTOMER CONTRACT SUMMARIES",
        "refs": ["08_contract_schedule", "01_master_agreement"],
        "pages": 5
    },
    "22_vendor_contracts": {
        "title": "MAJOR VENDOR CONTRACT SUMMARIES",
        "refs": ["08_contract_schedule"],
        "pages": 3
    },
    "23_audit_report": {
        "title": "INDEPENDENT AUDITOR'S REPORT",
        "refs": ["11_financial_statements", "04_disclosure_schedules"],
        "pages": 4
    },
    "24_legal_opinion": {
        "title": "LEGAL OPINION LETTER",
        "refs": ["01_master_agreement", "12_litigation_schedule", "06_ip_schedule"],
        "pages": 3
    },
    "25_closing_checklist": {
        "title": "CLOSING CHECKLIST AND CONDITIONS",
        "refs": ["01_master_agreement", "09_escrow_agreement", "16_transition_services", 
                 "17_patent_assignments", "21_customer_contracts"],
        "pages": 4
    }
}

def generate_content(doc_id: str, meta: dict) -> list:
    """Generate realistic legal document content."""
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=16, spaceAfter=20)
    heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=12, spaceAfter=10)
    body_style = ParagraphStyle('Body', parent=styles['Normal'], fontSize=10, spaceAfter=8, leading=14)
    
    content = []
    
    # Title
    content.append(Paragraph(meta["title"], title_style))
    content.append(Spacer(1, 0.3*inch))
    
    # Document intro with cross-references
    refs_text = ", ".join([f"Document: {DOCUMENTS[r]['title']}" for r in meta["refs"][:3]])
    intro = f"""
    This document is part of the acquisition transaction between GlobalTech Corporation ("Buyer") 
    and InnovateTech Solutions, Inc. ("Seller") dated as of February 15, 2025. This document should 
    be read in conjunction with {refs_text}, and all other transaction documents.
    """
    content.append(Paragraph(intro.strip(), body_style))
    content.append(Spacer(1, 0.2*inch))
    
    # Generate sections based on document type
    sections = generate_sections(doc_id, meta)
    for section_title, section_content in sections:
        content.append(Paragraph(section_title, heading_style))
        for para in section_content:
            content.append(Paragraph(para, body_style))
        content.append(Spacer(1, 0.15*inch))
    
    return content

def generate_sections(doc_id: str, meta: dict) -> list:
    """Generate document-specific sections with legal content."""
    sections = []
    
    # Add document-specific content
    if "master_agreement" in doc_id:
        sections = [
            ("ARTICLE I - DEFINITIONS", [
                "1.1 'Acquisition' means the purchase by Buyer of all outstanding capital stock of Seller.",
                "1.2 'Purchase Price' means One Hundred Twenty-Five Million Dollars ($125,000,000), subject to adjustments.",
                "1.3 'Closing Date' means April 1, 2025, or such other date as mutually agreed.",
                "1.4 'Material Adverse Effect' means any change that is materially adverse to the business of Seller.",
                "1.5 'Knowledge of Seller' means the actual knowledge of the officers listed in Schedule 1.5.",
            ]),
            ("ARTICLE II - PURCHASE AND SALE", [
                "2.1 Subject to the terms hereof, Seller agrees to sell and Buyer agrees to purchase all Shares.",
                "2.2 The Purchase Price shall be paid as follows: (a) $80,000,000 in cash at Closing; "
                "(b) $30,000,000 in Buyer common stock per Document: Stock Purchase Details - Exhibit B; "
                "(c) $15,000,000 in escrow per Document: Escrow Agreement.",
                "2.3 Purchase Price adjustments are detailed in Document: Audited Financial Statements.",
                "2.4 Working capital target is $8,500,000 as calculated per Schedule 2.4.",
            ]),
            ("ARTICLE III - REPRESENTATIONS AND WARRANTIES", [
                "3.1 Organization. Seller is duly organized under Delaware law.",
                "3.9 Litigation. Except as set forth in Document: Schedule 3.9 - Litigation and Claims, "
                "there are no pending legal proceedings against Seller.",
                "3.12 Intellectual Property. All IP is listed in Document: Schedule 3.12 - Intellectual Property. "
                "Patent assignments are documented in Document: Patent Assignment Agreements.",
                "3.13 Material Contracts. All contracts exceeding $100,000 annually are in Document: Schedule 3.13 - Material Contracts.",
                "3.15 Employees. Employee matters are disclosed in Document: Schedule 3.15 - Employee Matters.",
            ]),
            ("ARTICLE IV - COVENANTS", [
                "4.1 Conduct of Business. Prior to Closing, Seller shall operate in ordinary course.",
                "4.2 Access. Seller shall provide Buyer access to facilities, books, and records.",
                "4.3 Confidentiality. Parties shall comply with Document: Non-Disclosure Agreement.",
                "4.4 Non-Competition. Key employees shall execute Document: Non-Competition Agreement.",
            ]),
            ("ARTICLE V - CONDITIONS TO CLOSING", [
                "5.1 Buyer's conditions: (a) accuracy of representations; (b) material consents obtained; "
                "(c) no Material Adverse Effect; (d) receipt of Document: Legal Opinion Letter.",
                "5.2 Regulatory approvals as specified in Document: Closing Checklist and Conditions.",
                "5.3 Third-party consents from customers in Document: Major Customer Contract Summaries.",
            ]),
        ]
    elif "financial" in doc_id:
        sections = [
            ("BALANCE SHEET", [
                "As of December 31, 2024:",
                "Total Assets: $47,250,000 (Current: $18,500,000; Non-current: $28,750,000)",
                "Total Liabilities: $12,300,000 (Current: $8,200,000; Long-term: $4,100,000)",
                "Stockholders' Equity: $34,950,000",
                "Working Capital: $10,300,000 (above target of $8,500,000 per Document: Master Acquisition Agreement)",
            ]),
            ("INCOME STATEMENT", [
                "For fiscal year ended December 31, 2024:",
                "Total Revenue: $52,400,000 (SaaS: $41,920,000; Professional Services: $10,480,000)",
                "Cost of Revenue: $15,720,000 (Gross Margin: 70%)",
                "Operating Expenses: $28,600,000 (R&D: $12,100,000; S&M: $11,500,000; G&A: $5,000,000)",
                "Operating Income: $8,080,000 (EBITDA: $11,200,000)",
                "Net Income: $6,464,000",
            ]),
            ("REVENUE BREAKDOWN BY CUSTOMER", [
                "Top 5 customers represent 62% of revenue (see Document: Major Customer Contract Summaries):",
                "1. MegaCorp Industries: $12,576,000 (24%) - Contract through 2027",
                "2. GlobalBank Holdings: $8,384,000 (16%) - Renewal pending",
                "3. HealthFirst Systems: $5,240,000 (10%) - Multi-year agreement",
                "4. RetailMax Inc.: $3,668,000 (7%) - Expansion discussion ongoing",
                "5. TechPrime Solutions: $2,620,000 (5%) - New customer 2024",
            ]),
            ("NOTES TO FINANCIAL STATEMENTS", [
                "Note 1: Significant Accounting Policies - Revenue recognized per ASC 606.",
                "Note 2: Deferred Revenue of $4,200,000 represents prepaid annual subscriptions.",
                "Note 3: Contingent liabilities detailed in Document: Schedule 3.9 - Litigation and Claims.",
                "Note 4: Related party transactions with founder disclosed in Document: Consulting Agreement - Founder.",
            ]),
        ]
    elif "ip_schedule" in doc_id or "patent" in doc_id:
        sections = [
            ("PATENTS", [
                "Seller owns or has rights to the following patents:",
                "US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021",
                "US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022",
                "US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023",
                "Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024",
                "Assignment agreements in Document: Patent Assignment Agreements.",
            ]),
            ("TRADEMARKS", [
                "Registered trademarks (see Document: Trademark Registration Schedule):",
                "INNOVATETECH (word mark) - Reg. No. 5,123,456 - Software services",
                "INNOVATETECH (logo) - Reg. No. 5,234,567 - Software services",
                "DATAFLOW PRO - Reg. No. 5,345,678 - Data analytics software",
            ]),
            ("TRADE SECRETS AND KNOW-HOW", [
                "Seller maintains trade secrets including proprietary algorithms and processes.",
                "All employees have executed invention assignment agreements per Document: Schedule 3.15 - Employee Matters.",
                "Key technical personnel retention addressed in Document: Key Employee Retention Agreements.",
            ]),
        ]
    elif "employee" in doc_id or "retention" in doc_id:
        sections = [
            ("EMPLOYEE CENSUS", [
                "Total Employees: 127 (Full-time: 120; Part-time: 7)",
                "Engineering: 68 employees (Senior: 24; Mid-level: 32; Junior: 12)",
                "Sales & Marketing: 28 employees",
                "Customer Success: 18 employees",
                "G&A: 13 employees",
            ]),
            ("KEY EMPLOYEES", [
                "The following are Key Employees subject to Document: Key Employee Retention Agreements:",
                "1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000",
                "2. Michael Rodriguez - VP Engineering - Leads 45-person team - Retention: $800,000",
                "3. Jennifer Walsh - VP Sales - $18M quota achievement - Retention: $600,000",
                "4. David Kim - Principal Architect - Core platform expertise - Retention: $500,000",
                "5. Amanda Foster - VP Customer Success - 95% retention rate - Retention: $400,000",
                "Founder consulting terms in Document: Consulting Agreement - Founder.",
            ]),
            ("BENEFIT PLANS", [
                "Active benefit plans (details in Document: Employee Benefit Plan Schedule):",
                "401(k) Plan - Company match 4% - $2.1M annual cost",
                "Health Insurance - PPO and HMO options - $1.8M annual cost",
                "Stock Option Plan - 2,500,000 shares reserved - 1,800,000 granted",
                "Treatment of equity awards addressed in Document: Master Acquisition Agreement Section 2.6.",
            ]),
        ]
    elif "customer" in doc_id or "contract_schedule" in doc_id:
        sections = [
            ("MATERIAL CUSTOMER CONTRACTS", [
                "Contracts with annual value exceeding $500,000:",
                "",
                "1. MEGACORP INDUSTRIES - Master Services Agreement",
                "   Annual Value: $12,576,000 | Term: Through December 2027",
                "   Change of Control: Consent required (OBTAINED February 8, 2025)",
                "   Renewal Terms: Auto-renew with 90-day notice",
                "",
                "2. GLOBALBANK HOLDINGS - Enterprise License Agreement",
                "   Annual Value: $8,384,000 | Term: Through June 2025",
                "   Change of Control: 60-day notice required (PROVIDED January 15, 2025)",
                "   Renewal: Currently in negotiation for 3-year extension",
                "",
                "3. HEALTHFIRST SYSTEMS - SaaS Subscription Agreement",
                "   Annual Value: $5,240,000 | Term: Through December 2026",
                "   Change of Control: No restrictions",
                "",
                "See Document: Closing Checklist and Conditions for consent status.",
            ]),
            ("CONSENT REQUIREMENTS", [
                "Customer consents required for acquisition (per Document: Master Acquisition Agreement):",
                "- MegaCorp Industries: OBTAINED (see Exhibit A hereto)",
                "- GlobalBank Holdings: NOTICE PROVIDED (awaiting acknowledgment)",
                "- Other customers: No consent required",
                "Risk assessment in Document: Legal Opinion Letter.",
            ]),
        ]
    elif "litigation" in doc_id:
        sections = [
            ("PENDING LITIGATION", [
                "1. Smith v. InnovateTech Solutions, Inc.",
                "   Court: California Superior Court, Santa Clara County",
                "   Claims: Wrongful termination, discrimination",
                "   Status: Discovery phase; trial set for September 2025",
                "   Exposure: $150,000 - $350,000 (covered by insurance)",
                "   Opinion: See Document: Legal Opinion Letter",
                "",
                "2. DataTech LLC v. InnovateTech Solutions, Inc.",
                "   Court: US District Court, Northern District of California",
                "   Claims: Patent infringement (US Patent 9,876,543)",
                "   Status: Motion to dismiss pending; hearing March 2025",
                "   Exposure: Preliminary assessment $500,000 - $2,000,000",
                "   IP validity analysis in Document: Schedule 3.12 - Intellectual Property",
            ]),
            ("THREATENED CLAIMS", [
                "Demand letter received from former contractor re: unpaid invoices ($45,000).",
                "Resolution expected prior to Closing per Document: Closing Checklist and Conditions.",
            ]),
            ("INSURANCE COVERAGE", [
                "D&O Insurance: $5,000,000 limit | Deductible: $50,000",
                "E&O Insurance: $3,000,000 limit | Deductible: $25,000",
                "General Liability: $2,000,000 limit",
            ]),
        ]
    elif "closing" in doc_id:
        sections = [
            ("PRE-CLOSING CONDITIONS", [
                "The following conditions must be satisfied prior to Closing:",
                "",
                "1. REGULATORY APPROVALS",
                "   [X] HSR Filing - Early termination granted February 1, 2025",
                "   [X] State filings - Completed in all required jurisdictions",
                "",
                "2. THIRD-PARTY CONSENTS",
                "   [X] MegaCorp Industries - Obtained February 8, 2025",
                "   [ ] GlobalBank Holdings - Pending (expected by March 15)",
                "   Per Document: Major Customer Contract Summaries",
                "",
                "3. EMPLOYEE MATTERS",
                "   [X] Key employee retention agreements executed",
                "   [X] Founder consulting agreement finalized",
                "   Per Document: Key Employee Retention Agreements",
                "",
                "4. LEGAL DELIVERABLES",
                "   [X] Legal opinion - See Document: Legal Opinion Letter",
                "   [ ] Good standing certificates - Ordered",
            ]),
            ("CLOSING DELIVERABLES", [
                "SELLER DELIVERABLES:",
                "- Stock certificates endorsed in blank",
                "- Officer's certificate re: representations",
                "- Secretary's certificate with resolutions",
                "- IP assignments per Document: Patent Assignment Agreements",
                "- Third-party consents per above",
                "",
                "BUYER DELIVERABLES:",
                "- Cash payment: $80,000,000 by wire transfer",
                "- Stock consideration: 1,500,000 shares per Document: Stock Purchase Details - Exhibit B",
                "- Escrow deposit: $15,000,000 per Document: Escrow Agreement",
            ]),
            ("POST-CLOSING OBLIGATIONS", [
                "1. Transition services per Document: Transition Services Agreement (6 months)",
                "2. Earnout payments per Exhibit C to Document: Master Acquisition Agreement",
                "3. Escrow release schedule per Document: Escrow Agreement",
                "4. Employee benefit plan merger per Document: Employee Benefit Plan Schedule",
            ]),
        ]
    elif "escrow" in doc_id:
        sections = [
            ("ESCROW TERMS", [
                "Escrow Amount: $15,000,000 (12% of Purchase Price)",
                "Escrow Agent: First National Trust Company",
                "Term: 18 months from Closing Date",
                "",
                "Release Schedule:",
                "- 6 months: $5,000,000 released (absent claims)",
                "- 12 months: $5,000,000 released (absent claims)",
                "- 18 months: Remaining balance released",
                "",
                "Claims may be made for breaches of representations in Document: Master Acquisition Agreement.",
            ]),
            ("INDEMNIFICATION", [
                "Indemnification provisions per Article VII of Document: Master Acquisition Agreement:",
                "- Basket: $500,000 (1% of escrow)",
                "- Cap: $15,000,000 (escrow amount) for general reps",
                "- Fundamental reps: Full Purchase Price cap",
                "",
                "Specific indemnities for matters in Document: Schedule 3.9 - Litigation and Claims.",
            ]),
        ]
    elif "legal_opinion" in doc_id:
        sections = [
            ("OPINIONS RENDERED", [
                "Wilson & Associates LLP, counsel to Seller, renders the following opinions:",
                "",
                "1. Seller is a corporation duly organized under Delaware law.",
                "2. Seller has corporate power to execute Document: Master Acquisition Agreement.",
                "3. Transaction documents are valid and enforceable obligations.",
                "4. No conflicts with charter documents or material agreements.",
                "5. Based on review of Document: Schedule 3.9 - Litigation and Claims, pending "
                "litigation does not pose material risk to transaction.",
                "6. IP matters reviewed per Document: Schedule 3.12 - Intellectual Property; "
                "no infringement claims other than disclosed.",
            ]),
            ("QUALIFICATIONS AND ASSUMPTIONS", [
                "This opinion is subject to standard qualifications regarding:",
                "- Bankruptcy and insolvency laws",
                "- Equitable principles",
                "- Public policy considerations",
                "",
                "We have relied upon certificates from officers of Seller and representations "
                "in Document: Seller Disclosure Schedules.",
            ]),
        ]
    elif "audit" in doc_id:
        sections = [
            ("INDEPENDENT AUDITOR'S REPORT", [
                "To the Board of Directors of InnovateTech Solutions, Inc.:",
                "",
                "We have audited the accompanying financial statements, which comprise the "
                "balance sheet as of December 31, 2024, and the related statements of income, "
                "comprehensive income, stockholders' equity, and cash flows for the year then ended.",
                "",
                "OPINION",
                "In our opinion, the financial statements present fairly, in all material respects, "
                "the financial position of InnovateTech Solutions, Inc. as of December 31, 2024, "
                "in accordance with accounting principles generally accepted in the United States.",
            ]),
            ("KEY AUDIT MATTERS", [
                "1. REVENUE RECOGNITION",
                "   SaaS revenue recognized ratably over subscription period per ASC 606.",
                "   Deferred revenue of $4,200,000 verified to customer contracts.",
                "",
                "2. STOCK-BASED COMPENSATION",
                "   Options valued using Black-Scholes model.",
                "   Expense of $2,100,000 recorded in accordance with ASC 718.",
                "",
                "3. CONTINGENCIES",
                "   Litigation matters reviewed with counsel (see Document: Schedule 3.9 - Litigation and Claims).",
                "   Accruals of $350,000 determined to be appropriate.",
            ]),
        ]
    else:
        # Generic sections for other documents
        sections = [
            ("OVERVIEW", [
                f"This {meta['title']} is executed in connection with the acquisition transaction.",
                f"Reference documents: {', '.join([DOCUMENTS[r]['title'] for r in meta['refs'][:2]])}.",
            ]),
            ("TERMS AND CONDITIONS", [
                "Standard terms apply as set forth in the Master Acquisition Agreement.",
                "Amendments require written consent of all parties.",
            ]),
            ("MISCELLANEOUS", [
                "Governing Law: State of Delaware",
                "Dispute Resolution: Arbitration in San Francisco, California",
                "Notices: As specified in Master Acquisition Agreement",
            ]),
        ]
    
    # Add boilerplate to reach target page count
    for i in range(meta["pages"] - 2):
        sections.append((f"SECTION {len(sections) + 1}", [
            f"Additional provisions related to {meta['title']}.",
            "All terms defined in Document: Master Acquisition Agreement apply herein.",
            f"Cross-reference: See {DOCUMENTS[meta['refs'][i % len(meta['refs'])]]['title']} for related provisions.",
            "The parties acknowledge receipt of all schedules and exhibits referenced herein.",
            "This section shall survive the Closing Date as specified in Article VIII of the Master Agreement.",
        ]))
    
    return sections


def create_pdf(doc_id: str, meta: dict, output_dir: str):
    """Create a PDF document."""
    filepath = os.path.join(output_dir, f"{doc_id}.pdf")
    doc = SimpleDocTemplate(filepath, pagesize=letter,
                           topMargin=0.75*inch, bottomMargin=0.75*inch,
                           leftMargin=1*inch, rightMargin=1*inch)
    content = generate_content(doc_id, meta)
    doc.build(content)
    print(f"  Created: {filepath}")


def main():
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print(f"\nGenerating {len(DOCUMENTS)} large documents in {OUTPUT_DIR}/\n")
    
    for doc_id, meta in DOCUMENTS.items():
        create_pdf(doc_id, meta, OUTPUT_DIR)
    
    # Create test questions
    questions_path = os.path.join(OUTPUT_DIR, "TEST_QUESTIONS.md")
    with open(questions_path, "w") as f:
        f.write("""# Test Questions for Large Document Set

## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages

## Test Questions

### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```

### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```

### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```

### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```
""")
    print(f"  Created: {questions_path}")
    
    # Summary
    total_pages = sum(m["pages"] for m in DOCUMENTS.values())
    total_refs = sum(len(m["refs"]) for m in DOCUMENTS.values())
    print(f"\n{'='*60}")
    print(f"SUMMARY")
    print(f"{'='*60}")
    print(f"  Documents created: {len(DOCUMENTS)}")
    print(f"  Total pages: ~{total_pages}")
    print(f"  Cross-references: {total_refs}")
    print(f"  Output directory: {OUTPUT_DIR}/")
    print(f"{'='*60}\n")


if __name__ == "__main__":
    main()



================================================
FILE: scripts/generate_test_docs.py
================================================
#!/usr/bin/env python3
"""
Generate test PDF documents for testing the two-stage document exploration approach.

Scenario: TechCorp's acquisition of StartupXYZ
Documents have cross-references to test the agent's ability to follow document relationships.
"""

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
import os

OUTPUT_DIR = "data/test_acquisition"

DOCUMENTS = {
    "01_acquisition_agreement.pdf": {
        "title": "ACQUISITION AGREEMENT",
        "content": """
        <b>ACQUISITION AGREEMENT</b><br/><br/>
        
        This Acquisition Agreement ("Agreement") is entered into as of January 15, 2025, 
        by and between TechCorp Industries, Inc. ("Buyer") and StartupXYZ LLC ("Seller").<br/><br/>
        
        <b>ARTICLE I - DEFINITIONS</b><br/><br/>
        
        1.1 "Acquisition" means the purchase of all outstanding shares of Seller by Buyer.<br/>
        1.2 "Purchase Price" means $45,000,000 USD as detailed in <b>Exhibit A - Financial Terms</b>.<br/>
        1.3 "Closing Date" means March 1, 2025, subject to conditions in Article IV.<br/>
        1.4 "Employee Matters" shall be governed by <b>Schedule 3 - Employee Transition Plan</b>.<br/><br/>
        
        <b>ARTICLE II - PURCHASE AND SALE</b><br/><br/>
        
        2.1 Subject to the terms and conditions of this Agreement, Seller agrees to sell, 
        and Buyer agrees to purchase, all of the issued and outstanding shares of Seller.<br/><br/>
        
        2.2 The Purchase Price shall be paid as follows:<br/>
        (a) $30,000,000 in cash at Closing<br/>
        (b) $10,000,000 in Buyer's common stock (see <b>Exhibit B - Stock Valuation</b>)<br/>
        (c) $5,000,000 in earnout payments (see <b>Exhibit C - Earnout Terms</b>)<br/><br/>
        
        <b>ARTICLE III - REPRESENTATIONS AND WARRANTIES</b><br/><br/>
        
        3.1 Seller represents and warrants that the financial statements provided in 
        <b>Document: Due Diligence Report</b> are accurate and complete.<br/><br/>
        
        3.2 Seller represents that all intellectual property is properly documented in 
        <b>Schedule 1 - IP Assets</b> and is free of encumbrances as certified in 
        <b>Document: IP Certification Letter</b>.<br/><br/>
        
        3.3 All material contracts are listed in <b>Schedule 2 - Material Contracts</b>.<br/><br/>
        
        <b>ARTICLE IV - CONDITIONS TO CLOSING</b><br/><br/>
        
        4.1 Buyer's obligation to close is subject to:<br/>
        (a) Receipt of regulatory approval as documented in <b>Document: Regulatory Approval Letter</b><br/>
        (b) Completion of due diligence per <b>Document: Due Diligence Report</b><br/>
        (c) No material adverse change as defined in Section 1.5<br/><br/>
        
        4.2 Both parties acknowledge the risks identified in <b>Document: Risk Assessment Memo</b>.<br/><br/>
        
        <b>ARTICLE V - CONFIDENTIALITY</b><br/><br/>
        
        5.1 This Agreement is subject to the terms of the <b>Document: Non-Disclosure Agreement</b> 
        executed between the parties on October 1, 2024.<br/><br/>
        
        IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.<br/><br/>
        
        _________________________<br/>
        TechCorp Industries, Inc.<br/>
        By: James Mitchell, CEO<br/><br/>
        
        _________________________<br/>
        StartupXYZ LLC<br/>
        By: Sarah Chen, Founder & CEO
        """
    },
    
    "02_due_diligence_report.pdf": {
        "title": "DUE DILIGENCE REPORT",
        "content": """
        <b>CONFIDENTIAL DUE DILIGENCE REPORT</b><br/><br/>
        
        <b>Prepared for:</b> TechCorp Industries, Inc.<br/>
        <b>Subject:</b> StartupXYZ LLC<br/>
        <b>Date:</b> December 20, 2024<br/>
        <b>Prepared by:</b> Morrison & Associates, LLP<br/><br/>
        
        <b>EXECUTIVE SUMMARY</b><br/><br/>
        
        This report summarizes our findings from the due diligence investigation of StartupXYZ LLC 
        in connection with the proposed acquisition described in the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. FINANCIAL REVIEW</b><br/><br/>
        
        1.1 Revenue for FY2024: $12.3 million (growth of 45% YoY)<br/>
        1.2 EBITDA: $2.1 million (17% margin)<br/>
        1.3 Cash position: $3.2 million as of November 30, 2024<br/>
        1.4 Outstanding debt: $1.5 million (detailed in <b>Exhibit A - Financial Terms</b> of the Acquisition Agreement)<br/><br/>
        
        <b>KEY FINDING:</b> Financial statements are materially accurate. Minor adjustments 
        recommended as noted in <b>Document: Financial Adjustments Memo</b>.<br/><br/>
        
        <b>2. INTELLECTUAL PROPERTY</b><br/><br/>
        
        2.1 StartupXYZ holds 12 patents related to AI/ML technology<br/>
        2.2 All patents verified as valid per <b>Document: IP Certification Letter</b><br/>
        2.3 No pending litigation affecting IP (confirmed in <b>Document: Legal Opinion Letter</b>)<br/>
        2.4 Full IP inventory in <b>Schedule 1 - IP Assets</b> of the Acquisition Agreement<br/><br/>
        
        <b>3. EMPLOYEE MATTERS</b><br/><br/>
        
        3.1 Total employees: 47 (32 engineering, 8 sales, 7 operations)<br/>
        3.2 Key employee retention risk: HIGH for 5 senior engineers<br/>
        3.3 Retention bonuses recommended per <b>Schedule 3 - Employee Transition Plan</b><br/>
        3.4 No pending employment disputes<br/><br/>
        
        <b>4. MATERIAL CONTRACTS</b><br/><br/>
        
        4.1 23 active customer contracts reviewed (see <b>Schedule 2 - Material Contracts</b>)<br/>
        4.2 3 contracts contain change-of-control provisions requiring consent<br/>
        4.3 Largest customer (MegaCorp) accounts for 28% of revenue - concentration risk noted in 
        <b>Document: Risk Assessment Memo</b><br/><br/>
        
        <b>5. REGULATORY COMPLIANCE</b><br/><br/>
        
        5.1 Company is compliant with all applicable regulations<br/>
        5.2 HSR filing required - timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>
        
        <b>6. RECOMMENDATIONS</b><br/><br/>
        
        Based on our findings, we recommend proceeding with the acquisition subject to:<br/>
        (a) Obtaining customer consents for change-of-control contracts<br/>
        (b) Implementing retention packages for key employees<br/>
        (c) Addressing items in <b>Document: Financial Adjustments Memo</b><br/><br/>
        
        Respectfully submitted,<br/>
        Morrison & Associates, LLP
        """
    },
    
    "03_ip_certification.pdf": {
        "title": "IP CERTIFICATION LETTER",
        "content": """
        <b>INTELLECTUAL PROPERTY CERTIFICATION LETTER</b><br/><br/>
        
        <b>Date:</b> December 15, 2024<br/>
        <b>To:</b> TechCorp Industries, Inc.<br/>
        <b>From:</b> PatentWatch Legal Services<br/>
        <b>Re:</b> IP Certification for StartupXYZ LLC Acquisition<br/><br/>
        
        Dear Mr. Mitchell,<br/><br/>
        
        In connection with the proposed acquisition of StartupXYZ LLC as described in the 
        <b>Document: Acquisition Agreement</b>, we have conducted a comprehensive review of 
        StartupXYZ's intellectual property portfolio.<br/><br/>
        
        <b>CERTIFICATION</b><br/><br/>
        
        We hereby certify the following:<br/><br/>
        
        <b>1. PATENTS</b><br/><br/>
        
        StartupXYZ owns 12 U.S. patents as listed in <b>Schedule 1 - IP Assets</b>:<br/>
        - US Patent 10,123,456: "Neural Network Optimization Method"<br/>
        - US Patent 10,234,567: "Distributed AI Training System"<br/>
        - US Patent 10,345,678: "Real-time Data Processing Pipeline"<br/>
        - [9 additional patents listed in Schedule 1]<br/><br/>
        
        All patents are valid, enforceable, and free of liens or encumbrances.<br/><br/>
        
        <b>2. TRADEMARKS</b><br/><br/>
        
        StartupXYZ owns 3 registered trademarks:<br/>
        - "StartupXYZ" (word mark)<br/>
        - StartupXYZ logo (design mark)<br/>
        - "IntelliFlow" (product name)<br/><br/>
        
        <b>3. TRADE SECRETS</b><br/><br/>
        
        We have reviewed StartupXYZ's trade secret protection protocols. All employees have 
        signed appropriate NDAs. See <b>Document: Non-Disclosure Agreement</b> template.<br/><br/>
        
        <b>4. THIRD-PARTY IP</b><br/><br/>
        
        StartupXYZ uses 47 open-source libraries. License compliance verified - no copyleft 
        contamination issues identified.<br/><br/>
        
        <b>5. PENDING MATTERS</b><br/><br/>
        
        There is one pending patent application (Application No. 17/456,789) for "Advanced 
        Federated Learning System" expected to issue Q2 2025. This is noted in 
        <b>Document: Risk Assessment Memo</b> as a minor risk item.<br/><br/>
        
        <b>6. LITIGATION</b><br/><br/>
        
        No IP-related litigation is pending or threatened. This is confirmed in 
        <b>Document: Legal Opinion Letter</b>.<br/><br/>
        
        This certification is provided in connection with the due diligence process and 
        may be relied upon by TechCorp Industries, Inc.<br/><br/>
        
        Sincerely,<br/>
        PatentWatch Legal Services<br/>
        By: Robert Kim, Patent Attorney
        """
    },
    
    "04_risk_assessment.pdf": {
        "title": "RISK ASSESSMENT MEMO",
        "content": """
        <b>CONFIDENTIAL RISK ASSESSMENT MEMORANDUM</b><br/><br/>
        
        <b>To:</b> TechCorp Board of Directors<br/>
        <b>From:</b> Corporate Development Team<br/>
        <b>Date:</b> December 22, 2024<br/>
        <b>Re:</b> Risk Assessment - StartupXYZ Acquisition<br/><br/>
        
        This memo summarizes key risks identified in connection with the proposed acquisition 
        as documented in the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. HIGH-PRIORITY RISKS</b><br/><br/>
        
        <b>1.1 Customer Concentration (HIGH)</b><br/>
        - MegaCorp represents 28% of StartupXYZ revenue<br/>
        - MegaCorp contract contains change-of-control clause<br/>
        - Mitigation: Obtain consent prior to closing (see <b>Document: Customer Consent Letters</b>)<br/>
        - Impact if materialized: $3.4M annual revenue at risk<br/><br/>
        
        <b>1.2 Key Employee Retention (HIGH)</b><br/>
        - 5 senior engineers critical to product development<br/>
        - 2 have expressed interest in leaving post-acquisition<br/>
        - Mitigation: Retention packages per <b>Schedule 3 - Employee Transition Plan</b><br/>
        - Estimated cost: $2.5M in retention bonuses<br/><br/>
        
        <b>2. MEDIUM-PRIORITY RISKS</b><br/><br/>
        
        <b>2.1 Earnout Structure (MEDIUM)</b><br/>
        - $5M earnout tied to 2025-2026 performance metrics<br/>
        - Metrics defined in <b>Exhibit C - Earnout Terms</b> of the Acquisition Agreement<br/>
        - Risk: Disagreement on metric calculation methodology<br/>
        - Mitigation: Clear definitions in agreement; third-party arbitration clause<br/><br/>
        
        <b>2.2 Integration Costs (MEDIUM)</b><br/>
        - Estimated integration costs: $4.2M over 18 months<br/>
        - Systems integration detailed in <b>Document: Integration Plan</b><br/>
        - Risk: Cost overruns of 20-30% typical in tech acquisitions<br/><br/>
        
        <b>3. LOW-PRIORITY RISKS</b><br/><br/>
        
        <b>3.1 Pending Patent Application (LOW)</b><br/>
        - One patent pending as noted in <b>Document: IP Certification Letter</b><br/>
        - Low risk of rejection based on patent attorney's assessment<br/><br/>
        
        <b>3.2 Regulatory Approval (LOW)</b><br/>
        - HSR filing required but expected to clear without issues<br/>
        - Timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>
        
        <b>4. FINANCIAL IMPACT SUMMARY</b><br/><br/>
        
        Total risk-adjusted impact: $6.2M - $8.7M<br/>
        This is reflected in purchase price negotiations per <b>Document: Financial Adjustments Memo</b><br/><br/>
        
        <b>5. RECOMMENDATION</b><br/><br/>
        
        Despite identified risks, we recommend proceeding with the acquisition. The strategic 
        value of StartupXYZ's AI technology platform justifies the purchase price when 
        accounting for risk mitigation costs. All findings are consistent with 
        <b>Document: Due Diligence Report</b>.<br/><br/>
        
        <b>6. NEXT STEPS</b><br/><br/>
        
        - Finalize customer consent process<br/>
        - Execute retention agreements<br/>
        - Complete regulatory filings<br/>
        - Prepare for closing per <b>Document: Closing Checklist</b>
        """
    },
    
    "05_financial_adjustments.pdf": {
        "title": "FINANCIAL ADJUSTMENTS MEMO",
        "content": """
        <b>FINANCIAL ADJUSTMENTS MEMORANDUM</b><br/><br/>
        
        <b>To:</b> Deal Team<br/>
        <b>From:</b> Finance Department<br/>
        <b>Date:</b> December 23, 2024<br/>
        <b>Re:</b> Purchase Price Adjustments - StartupXYZ Acquisition<br/><br/>
        
        Following our review in connection with the <b>Document: Due Diligence Report</b>, 
        we recommend the following adjustments to the purchase price as set forth in 
        <b>Exhibit A - Financial Terms</b> of the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. WORKING CAPITAL ADJUSTMENT</b><br/><br/>
        
        Target working capital: $1,200,000<br/>
        Estimated closing working capital: $980,000<br/>
        Adjustment: ($220,000)<br/><br/>
        
        <b>2. DEBT ADJUSTMENT</b><br/><br/>
        
        Previously disclosed debt: $1,500,000<br/>
        Additional identified debt: $175,000 (capital lease obligations)<br/>
        Adjustment: ($175,000)<br/><br/>
        
        <b>3. REVENUE RECOGNITION ADJUSTMENT</b><br/><br/>
        
        Deferred revenue requiring restatement: $340,000<br/>
        Impact on EBITDA: ($85,000)<br/>
        Implied value adjustment (at 15x): ($1,275,000)<br/><br/>
        
        <b>4. CONTINGENT LIABILITY RESERVE</b><br/><br/>
        
        As noted in <b>Document: Risk Assessment Memo</b>, we recommend establishing 
        reserves for:<br/>
        - Customer concentration risk: $500,000<br/>
        - Integration contingency: $800,000<br/>
        Total reserve: $1,300,000 (to be held in escrow per <b>Exhibit C - Earnout Terms</b>)<br/><br/>
        
        <b>5. SUMMARY OF ADJUSTMENTS</b><br/><br/>
        
        Original Purchase Price: $45,000,000<br/>
        Working Capital Adjustment: ($220,000)<br/>
        Debt Adjustment: ($175,000)<br/>
        Revenue Recognition: ($1,275,000)<br/>
        <b>Adjusted Purchase Price: $43,330,000</b><br/><br/>
        
        Plus escrow reserve: $1,300,000<br/>
        <b>Total Cash Required at Closing: $44,630,000</b><br/><br/>
        
        <b>6. PAYMENT STRUCTURE</b><br/><br/>
        
        As revised from <b>Document: Acquisition Agreement</b> Section 2.2:<br/>
        (a) Cash at closing: $28,330,000 (adjusted)<br/>
        (b) Stock consideration: $10,000,000 (per <b>Exhibit B - Stock Valuation</b>)<br/>
        (c) Earnout: $5,000,000 (unchanged, per <b>Exhibit C - Earnout Terms</b>)<br/>
        (d) Escrow: $1,300,000 (18-month release schedule)<br/><br/>
        
        These adjustments have been discussed with Seller's representatives and are 
        subject to final negotiation.<br/><br/>
        
        Please refer to <b>Document: Closing Checklist</b> for timeline and requirements.
        """
    },
    
    "06_legal_opinion.pdf": {
        "title": "LEGAL OPINION LETTER",
        "content": """
        <b>LEGAL OPINION LETTER</b><br/><br/>
        
        <b>Date:</b> December 18, 2024<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        500 Technology Drive<br/>
        San Francisco, CA 94105<br/><br/>
        
        <b>Re: Acquisition of StartupXYZ LLC</b><br/><br/>
        
        Ladies and Gentlemen:<br/><br/>
        
        We have acted as legal counsel to StartupXYZ LLC ("Company") in connection with 
        the proposed acquisition by TechCorp Industries, Inc. pursuant to the 
        <b>Document: Acquisition Agreement</b> dated January 15, 2025.<br/><br/>
        
        <b>DOCUMENTS REVIEWED</b><br/><br/>
        
        In connection with this opinion, we have reviewed:<br/>
        1. The Acquisition Agreement and all Exhibits and Schedules<br/>
        2. <b>Document: Due Diligence Report</b> prepared by Morrison & Associates<br/>
        3. <b>Document: IP Certification Letter</b> from PatentWatch Legal Services<br/>
        4. All material contracts listed in <b>Schedule 2 - Material Contracts</b><br/>
        5. Corporate records and organizational documents of the Company<br/>
        6. <b>Document: Non-Disclosure Agreement</b> between the parties<br/><br/>
        
        <b>OPINIONS</b><br/><br/>
        
        Based on our review, we are of the opinion that:<br/><br/>
        
        <b>1. Corporate Status</b><br/>
        The Company is a limited liability company duly organized, validly existing, and 
        in good standing under the laws of Delaware.<br/><br/>
        
        <b>2. Authority</b><br/>
        The Company has full power and authority to execute and deliver the Acquisition 
        Agreement and to consummate the transactions contemplated thereby.<br/><br/>
        
        <b>3. No Conflicts</b><br/>
        The execution and delivery of the Acquisition Agreement does not violate any 
        provision of the Company's organizational documents or any material contract, 
        except for change-of-control provisions noted in <b>Document: Customer Consent Letters</b>.<br/><br/>
        
        <b>4. Litigation</b><br/>
        There is no litigation, arbitration, or governmental proceeding pending or, to 
        our knowledge, threatened against the Company that would have a material adverse 
        effect on the Company or the transactions contemplated by the Acquisition Agreement.<br/><br/>
        
        This opinion confirms the representations in the <b>Document: IP Certification Letter</b> 
        regarding absence of IP litigation.<br/><br/>
        
        <b>5. Regulatory Compliance</b><br/>
        The Company is in material compliance with all applicable laws and regulations. 
        The HSR filing requirements are addressed in <b>Document: Regulatory Approval Letter</b>.<br/><br/>
        
        <b>QUALIFICATIONS</b><br/><br/>
        
        This opinion is subject to the following qualifications:<br/>
        1. We express no opinion on tax matters (see separate tax opinion)<br/>
        2. This opinion is limited to Delaware and federal law<br/>
        3. Certain contracts require third-party consents as noted above<br/><br/>
        
        This opinion is provided solely for your benefit in connection with the 
        transactions contemplated by the Acquisition Agreement.<br/><br/>
        
        Very truly yours,<br/>
        Wilson & Partners LLP<br/>
        By: Jennifer Walsh, Partner
        """
    },
    
    "07_nda.pdf": {
        "title": "NON-DISCLOSURE AGREEMENT",
        "content": """
        <b>MUTUAL NON-DISCLOSURE AGREEMENT</b><br/><br/>
        
        This Mutual Non-Disclosure Agreement ("NDA") is entered into as of October 1, 2024, 
        by and between:<br/><br/>
        
        <b>TechCorp Industries, Inc.</b> ("TechCorp")<br/>
        500 Technology Drive, San Francisco, CA 94105<br/><br/>
        
        and<br/><br/>
        
        <b>StartupXYZ LLC</b> ("StartupXYZ")<br/>
        123 Innovation Way, Palo Alto, CA 94301<br/><br/>
        
        (each a "Party" and collectively the "Parties")<br/><br/>
        
        <b>RECITALS</b><br/><br/>
        
        The Parties wish to explore a potential business relationship, including a possible 
        acquisition of StartupXYZ by TechCorp (the "Purpose"), which is now documented in 
        the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. DEFINITION OF CONFIDENTIAL INFORMATION</b><br/><br/>
        
        "Confidential Information" means any non-public information disclosed by either 
        Party, including but not limited to:<br/>
        - Financial information (as contained in <b>Document: Due Diligence Report</b>)<br/>
        - Technical information (as certified in <b>Document: IP Certification Letter</b>)<br/>
        - Business strategies and plans<br/>
        - Customer and supplier information<br/>
        - Employee information (as detailed in <b>Schedule 3 - Employee Transition Plan</b>)<br/><br/>
        
        <b>2. OBLIGATIONS</b><br/><br/>
        
        Each Party agrees to:<br/>
        (a) Hold Confidential Information in strict confidence<br/>
        (b) Not disclose Confidential Information to third parties without prior written consent<br/>
        (c) Use Confidential Information solely for the Purpose<br/>
        (d) Limit access to Confidential Information to employees with a need to know<br/><br/>
        
        <b>3. TERM</b><br/><br/>
        
        This NDA shall remain in effect for three (3) years from the date first written 
        above, or until superseded by the confidentiality provisions in the 
        <b>Document: Acquisition Agreement</b> Article V.<br/><br/>
        
        <b>4. EXCLUSIONS</b><br/><br/>
        
        Confidential Information does not include information that:<br/>
        (a) Is or becomes publicly available through no fault of the receiving Party<br/>
        (b) Was rightfully in the receiving Party's possession prior to disclosure<br/>
        (c) Is rightfully obtained from a third party without restriction<br/>
        (d) Is independently developed without use of Confidential Information<br/><br/>
        
        <b>5. RETURN OF MATERIALS</b><br/><br/>
        
        Upon request or termination, each Party shall return or destroy all Confidential 
        Information, except as required for legal or regulatory purposes.<br/><br/>
        
        <b>6. NO LICENSE</b><br/><br/>
        
        Nothing in this NDA grants any rights to intellectual property, except as 
        subsequently agreed in the <b>Document: Acquisition Agreement</b> and 
        <b>Schedule 1 - IP Assets</b>.<br/><br/>
        
        IN WITNESS WHEREOF, the Parties have executed this NDA as of the date first above written.<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        By: ______________________<br/>
        Name: James Mitchell<br/>
        Title: CEO<br/><br/>
        
        StartupXYZ LLC<br/>
        By: ______________________<br/>
        Name: Sarah Chen<br/>
        Title: Founder & CEO
        """
    },
    
    "08_regulatory_approval.pdf": {
        "title": "REGULATORY APPROVAL LETTER",
        "content": """
        <b>FEDERAL TRADE COMMISSION</b><br/>
        <b>PREMERGER NOTIFICATION OFFICE</b><br/><br/>
        
        January 28, 2025<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        500 Technology Drive<br/>
        San Francisco, CA 94105<br/><br/>
        
        StartupXYZ LLC<br/>
        123 Innovation Way<br/>
        Palo Alto, CA 94301<br/><br/>
        
        <b>Re: Early Termination of HSR Waiting Period</b><br/>
        <b>Transaction: Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>
        
        Dear Parties:<br/><br/>
        
        This letter confirms that the Federal Trade Commission has granted early 
        termination of the waiting period under the Hart-Scott-Rodino Antitrust 
        Improvements Act of 1976 for the above-referenced transaction.<br/><br/>
        
        <b>FILING DETAILS</b><br/><br/>
        
        Filing Date: January 10, 2025<br/>
        Transaction Value: $45,000,000 (as stated in <b>Document: Acquisition Agreement</b>)<br/>
        HSR Filing Fee: $30,000<br/>
        Early Termination Granted: January 28, 2025<br/><br/>
        
        <b>EFFECT OF EARLY TERMINATION</b><br/><br/>
        
        The parties may now consummate the transaction at any time. This early termination 
        satisfies the condition precedent set forth in Article IV, Section 4.1(a) of the 
        <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        Please note that early termination of the waiting period does not preclude the 
        Commission from taking any action it deems necessary to protect competition.<br/><br/>
        
        <b>NEXT STEPS</b><br/><br/>
        
        Per the <b>Document: Closing Checklist</b>, you may now proceed with the closing 
        scheduled for March 1, 2025, subject to satisfaction of other conditions in the 
        <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        The <b>Document: Risk Assessment Memo</b> correctly identified this as a low-risk 
        item. The market analysis in the <b>Document: Due Diligence Report</b> supported 
        the determination that this transaction does not raise competitive concerns.<br/><br/>
        
        Sincerely,<br/>
        Premerger Notification Office<br/>
        Federal Trade Commission
        """
    },
    
    "09_customer_consents.pdf": {
        "title": "CUSTOMER CONSENT LETTERS",
        "content": """
        <b>CUSTOMER CONSENT STATUS REPORT</b><br/><br/>
        
        <b>Date:</b> February 15, 2025<br/>
        <b>To:</b> Deal Team<br/>
        <b>From:</b> Legal Department<br/>
        <b>Re:</b> Change of Control Consent Status<br/><br/>
        
        As required by <b>Schedule 2 - Material Contracts</b> of the 
        <b>Document: Acquisition Agreement</b>, this memo summarizes the status of 
        customer consents for contracts containing change-of-control provisions.<br/><br/>
        
        <b>CONSENT STATUS SUMMARY</b><br/><br/>
        
        <b>1. MegaCorp Inc. - OBTAINED</b><br/>
        Contract Value: $3.4M annual<br/>
        Consent Received: February 10, 2025<br/>
        Notes: MegaCorp requested meeting with TechCorp leadership; meeting held 2/8/25. 
        Consent granted with no additional conditions. This addresses the primary concern 
        noted in <b>Document: Risk Assessment Memo</b> Section 1.1.<br/><br/>
        
        <b>2. DataFlow Systems - OBTAINED</b><br/>
        Contract Value: $1.2M annual<br/>
        Consent Received: February 5, 2025<br/>
        Notes: Standard consent process. No concerns raised.<br/><br/>
        
        <b>3. CloudTech Partners - PENDING</b><br/>
        Contract Value: $890K annual<br/>
        Status: Consent requested February 1, 2025<br/>
        Expected: February 20, 2025<br/>
        Notes: Legal review in progress at CloudTech. Their counsel has reviewed the 
        <b>Document: Acquisition Agreement</b> and has no objections. Verbal confirmation 
        received; written consent expected shortly.<br/><br/>
        
        <b>IMPACT ANALYSIS</b><br/><br/>
        
        Per <b>Document: Due Diligence Report</b> Section 4, there were 3 contracts 
        requiring consent:<br/>
        - 2 obtained (representing $4.6M annual revenue)<br/>
        - 1 pending (representing $890K annual revenue)<br/><br/>
        
        <b>CLOSING IMPLICATIONS</b><br/><br/>
        
        The <b>Document: Acquisition Agreement</b> Article IV requires "material" customer 
        consents as a closing condition. With MegaCorp consent obtained, this condition 
        is substantially satisfied. The pending CloudTech consent is expected before 
        the March 1 closing date per <b>Document: Closing Checklist</b>.<br/><br/>
        
        <b>ATTACHMENTS</b><br/><br/>
        
        Attached hereto:<br/>
        - Exhibit A: MegaCorp Consent Letter (dated February 10, 2025)<br/>
        - Exhibit B: DataFlow Systems Consent Letter (dated February 5, 2025)<br/>
        - Exhibit C: CloudTech Partners Draft Consent (pending signature)<br/><br/>
        
        <b>RECOMMENDATION</b><br/><br/>
        
        We recommend proceeding with closing preparations. The risk of CloudTech 
        withholding consent is low based on discussions with their counsel. This 
        is consistent with the risk mitigation strategy in <b>Document: Risk Assessment Memo</b>.
        """
    },
    
    "10_closing_checklist.pdf": {
        "title": "CLOSING CHECKLIST",
        "content": """
        <b>CLOSING CHECKLIST</b><br/>
        <b>Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>
        
        <b>Closing Date:</b> March 1, 2025<br/>
        <b>Closing Location:</b> Wilson & Partners LLP, San Francisco<br/><br/>
        
        <b>I. PRE-CLOSING CONDITIONS</b><br/><br/>
        
        <b>A. Regulatory</b><br/>
        [X] HSR Filing submitted - <b>Document: Regulatory Approval Letter</b><br/>
        [X] Early termination received (January 28, 2025)<br/>
        [ ] State regulatory filings (if required)<br/><br/>
        
        <b>B. Third-Party Consents</b><br/>
        [X] MegaCorp consent - <b>Document: Customer Consent Letters</b><br/>
        [X] DataFlow consent - <b>Document: Customer Consent Letters</b><br/>
        [ ] CloudTech consent (expected February 20) - <b>Document: Customer Consent Letters</b><br/><br/>
        
        <b>C. Due Diligence Completion</b><br/>
        [X] Financial due diligence - <b>Document: Due Diligence Report</b><br/>
        [X] Legal due diligence - <b>Document: Legal Opinion Letter</b><br/>
        [X] IP due diligence - <b>Document: IP Certification Letter</b><br/>
        [X] Risk assessment - <b>Document: Risk Assessment Memo</b><br/><br/>
        
        <b>II. CLOSING DOCUMENTS</b><br/><br/>
        
        <b>A. Transaction Documents</b><br/>
        [ ] Executed <b>Document: Acquisition Agreement</b><br/>
        [ ] Bill of Sale<br/>
        [ ] Assignment and Assumption Agreement<br/>
        [ ] IP Assignment Agreement (per <b>Schedule 1 - IP Assets</b>)<br/><br/>
        
        <b>B. Corporate Documents</b><br/>
        [ ] Seller's Certificate of Good Standing<br/>
        [ ] Secretary's Certificate (resolutions, incumbency)<br/>
        [ ] Buyer's Certificate of Good Standing<br/><br/>
        
        <b>C. Financial Documents</b><br/>
        [ ] Closing Statement per <b>Document: Financial Adjustments Memo</b><br/>
        [ ] Wire transfer instructions<br/>
        [ ] Escrow Agreement (per <b>Exhibit C - Earnout Terms</b>)<br/>
        [ ] Stock certificates or book entry (per <b>Exhibit B - Stock Valuation</b>)<br/><br/>
        
        <b>D. Employment Documents</b><br/>
        [ ] Retention agreements per <b>Schedule 3 - Employee Transition Plan</b><br/>
        [ ] Offer letters for key employees<br/>
        [ ] WARN Act compliance (if applicable)<br/><br/>
        
        <b>III. CLOSING FUNDS</b><br/><br/>
        
        Per <b>Document: Financial Adjustments Memo</b>:<br/>
        [ ] Cash payment: $28,330,000<br/>
        [ ] Escrow deposit: $1,300,000<br/>
        [ ] Stock issuance: $10,000,000<br/>
        Total at Closing: $39,630,000<br/><br/>
        
        <b>IV. POST-CLOSING</b><br/><br/>
        
        [ ] File UCC termination statements<br/>
        [ ] Update corporate records<br/>
        [ ] Integration kickoff per <b>Document: Integration Plan</b><br/>
        [ ] Employee communications<br/>
        [ ] Customer notifications<br/>
        [ ] Press release<br/><br/>
        
        <b>V. RESPONSIBLE PARTIES</b><br/><br/>
        
        Buyer's Counsel: Morrison & Associates LLP<br/>
        Seller's Counsel: Wilson & Partners LLP<br/>
        Escrow Agent: First National Trust<br/><br/>
        
        <b>VI. KEY CONTACTS</b><br/><br/>
        
        TechCorp: James Mitchell (CEO), (415) 555-0100<br/>
        StartupXYZ: Sarah Chen (CEO), (650) 555-0200<br/>
        Legal (Buyer): John Morrison, (415) 555-0300<br/>
        Legal (Seller): Jennifer Walsh, (415) 555-0400
        """
    }
}


def create_pdf(filename: str, title: str, content: str):
    """Create a PDF document."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    doc = SimpleDocTemplate(filepath, pagesize=letter,
                           topMargin=1*inch, bottomMargin=1*inch,
                           leftMargin=1*inch, rightMargin=1*inch)
    
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Heading1'],
        fontSize=16,
        spaceAfter=30,
        alignment=1  # Center
    )
    body_style = ParagraphStyle(
        'CustomBody',
        parent=styles['Normal'],
        fontSize=11,
        leading=14,
        spaceAfter=12
    )
    
    story = []
    story.append(Paragraph(title, title_style))
    story.append(Spacer(1, 0.5*inch))
    
    # Split content into paragraphs and add them
    paragraphs = content.strip().split('<br/><br/>')
    for para in paragraphs:
        para = para.replace('<br/>', '<br/>')
        story.append(Paragraph(para, body_style))
    
    doc.build(story)
    print(f"Created: {filepath}")


def main():
    # Create output directory
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print(f"\nGenerating {len(DOCUMENTS)} test documents in {OUTPUT_DIR}/\n")
    
    for filename, doc_info in DOCUMENTS.items():
        create_pdf(filename, doc_info["title"], doc_info["content"])
    
    print(f"\n✅ Generated {len(DOCUMENTS)} documents successfully!")
    print(f"\nDocument cross-reference map:")
    print("=" * 60)
    print("""
    Acquisition Agreement (01)
    ├── references: Exhibit A, B, C, Schedule 1-3
    ├── referenced by: ALL other documents
    │
    Due Diligence Report (02)
    ├── references: Acquisition Agreement, IP Cert, Risk Assessment
    ├── referenced by: Legal Opinion, Risk Assessment, Regulatory
    │
    IP Certification (03)
    ├── references: Acquisition Agreement, Schedule 1, NDA
    ├── referenced by: Due Diligence, Legal Opinion
    │
    Risk Assessment (04)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert
    ├── referenced by: Financial Adjustments, Customer Consents
    │
    Financial Adjustments (05)
    ├── references: Due Diligence, Risk Assessment, Acquisition Agreement
    ├── referenced by: Closing Checklist
    │
    Legal Opinion (06)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert, NDA
    ├── referenced by: Closing Checklist
    │
    NDA (07)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert
    ├── referenced by: IP Cert, Legal Opinion
    │
    Regulatory Approval (08)
    ├── references: Acquisition Agreement, Due Diligence, Risk Assessment
    ├── referenced by: Closing Checklist
    │
    Customer Consents (09)
    ├── references: Acquisition Agreement, Risk Assessment, Schedule 2
    ├── referenced by: Closing Checklist
    │
    Closing Checklist (10)
    └── references: ALL documents
    """)


if __name__ == "__main__":
    main()



================================================
FILE: src/fs_explorer/__init__.py
================================================
"""
FsExplorer - AI-powered filesystem exploration agent.

This package provides an intelligent agent that can explore filesystems,
parse documents, and answer questions about their contents using
Google Gemini for decision-making and Docling for document parsing.

Example usage:
    >>> from fs_explorer import FsExplorerAgent, workflow
    >>> agent = FsExplorerAgent()
    >>> # Use with the workflow for full exploration
    >>> result = await workflow.run(start_event=InputEvent(task="Find the purchase price"))
"""

from .agent import FsExplorerAgent, TokenUsage
from .workflow import (
    workflow,
    FsExplorerWorkflow,
    InputEvent,
    ExplorationEndEvent,
    ToolCallEvent,
    GoDeeperEvent,
    AskHumanEvent,
    HumanAnswerEvent,
    get_agent,
    reset_agent,
)
from .models import Action, ActionType, Tools

__all__ = [
    # Agent
    "FsExplorerAgent",
    "TokenUsage",
    # Workflow
    "workflow",
    "FsExplorerWorkflow",
    "InputEvent",
    "ExplorationEndEvent",
    "ToolCallEvent",
    "GoDeeperEvent",
    "AskHumanEvent",
    "HumanAnswerEvent",
    "get_agent",
    "reset_agent",
    # Models
    "Action",
    "ActionType",
    "Tools",
]



================================================
FILE: src/fs_explorer/agent.py
================================================
"""
FsExplorer Agent for filesystem exploration using Google Gemini.

This module contains the agent that interacts with the Gemini AI model
to make decisions about filesystem exploration actions.
"""

import os
import re
from pathlib import Path
from typing import Callable, Any, cast
from dataclasses import dataclass

from dotenv import load_dotenv
from google.genai.types import Content, HttpOptions, Part
from google.genai import Client as GenAIClient

from .models import Action, ActionType, ToolCallAction, Tools
from .fs import (
    read_file,
    grep_file_content,
    glob_paths,
    scan_folder,
    preview_file,
    parse_file,
)
from .embeddings import EmbeddingProvider
from .index_config import resolve_db_path
from .search import (
    IndexedQueryEngine,
    MetadataFilterParseError,
    supported_filter_syntax,
)
from .storage import DuckDBStorage

# Load .env file from project root
_env_path = Path(__file__).parent.parent.parent / ".env"
if _env_path.exists():
    load_dotenv(_env_path)


# =============================================================================
# Token Usage Tracking
# =============================================================================

# Gemini Flash pricing (per million tokens)
GEMINI_FLASH_INPUT_COST_PER_MILLION = 0.075
GEMINI_FLASH_OUTPUT_COST_PER_MILLION = 0.30


@dataclass
class TokenUsage:
    """
    Track token usage and costs across the session.

    Maintains running totals of API calls, token counts, and provides
    cost estimates based on Gemini Flash pricing.
    """

    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    api_calls: int = 0

    # Track content sizes
    tool_result_chars: int = 0
    documents_parsed: int = 0
    documents_scanned: int = 0

    def add_api_call(self, prompt_tokens: int, completion_tokens: int) -> None:
        """Record token usage from an API call."""
        self.prompt_tokens += prompt_tokens
        self.completion_tokens += completion_tokens
        self.total_tokens += prompt_tokens + completion_tokens
        self.api_calls += 1

    def add_tool_result(self, result: str, tool_name: str) -> None:
        """Record metrics from a tool execution."""
        self.tool_result_chars += len(result)
        if tool_name == "parse_file":
            self.documents_parsed += 1
        elif tool_name == "scan_folder":
            # Count documents in scan result by counting document markers
            self.documents_scanned += result.count("│ [")
        elif tool_name == "preview_file":
            self.documents_parsed += 1

    def _calculate_cost(self) -> tuple[float, float, float]:
        """Calculate estimated costs based on Gemini Flash pricing."""
        input_cost = (
            self.prompt_tokens / 1_000_000
        ) * GEMINI_FLASH_INPUT_COST_PER_MILLION
        output_cost = (
            self.completion_tokens / 1_000_000
        ) * GEMINI_FLASH_OUTPUT_COST_PER_MILLION
        return input_cost, output_cost, input_cost + output_cost

    def summary(self) -> str:
        """Generate a formatted summary of token usage and costs."""
        input_cost, output_cost, total_cost = self._calculate_cost()

        return f"""
═══════════════════════════════════════════════════════════════
                      TOKEN USAGE SUMMARY
═══════════════════════════════════════════════════════════════
  API Calls:           {self.api_calls}
  Prompt Tokens:       {self.prompt_tokens:,}
  Completion Tokens:   {self.completion_tokens:,}
  Total Tokens:        {self.total_tokens:,}
───────────────────────────────────────────────────────────────
  Documents Scanned:   {self.documents_scanned}
  Documents Parsed:    {self.documents_parsed}
  Tool Result Chars:   {self.tool_result_chars:,}
───────────────────────────────────────────────────────────────
  Est. Cost (Gemini Flash):
    Input:  ${input_cost:.4f}
    Output: ${output_cost:.4f}
    Total:  ${total_cost:.4f}
═══════════════════════════════════════════════════════════════
"""


# =============================================================================
# Tool Registry
# =============================================================================


@dataclass(frozen=True)
class IndexContext:
    """Execution context for indexed retrieval tools."""

    root_folder: str
    db_path: str


_INDEX_CONTEXT: IndexContext | None = None
_EMBEDDING_PROVIDER: EmbeddingProvider | None = None
_FIELD_CATALOG_SHOWN: bool = False
_ENABLE_SEMANTIC: bool = False
_ENABLE_METADATA: bool = False


def set_search_flags(
    *, enable_semantic: bool = False, enable_metadata: bool = False
) -> None:
    """Configure which indexed retrieval paths are active."""
    global _ENABLE_SEMANTIC, _ENABLE_METADATA
    _ENABLE_SEMANTIC = enable_semantic
    _ENABLE_METADATA = enable_metadata


def get_search_flags() -> tuple[bool, bool]:
    """Return (enable_semantic, enable_metadata)."""
    return _ENABLE_SEMANTIC, _ENABLE_METADATA


def set_embedding_provider(provider: EmbeddingProvider | None) -> None:
    """Set the embedding provider for vector search in indexed tools."""
    global _EMBEDDING_PROVIDER
    _EMBEDDING_PROVIDER = provider


def set_index_context(folder: str, db_path: str | None = None) -> None:
    """Enable indexed tools for a specific folder corpus."""
    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER
    _INDEX_CONTEXT = IndexContext(
        root_folder=str(Path(folder).resolve()),
        db_path=resolve_db_path(db_path),
    )
    # Auto-create embedding provider if API key available
    if _EMBEDDING_PROVIDER is None:
        try:
            _EMBEDDING_PROVIDER = EmbeddingProvider()
        except ValueError:
            pass


def clear_index_context() -> None:
    """Disable indexed tools for the current process."""
    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER, _FIELD_CATALOG_SHOWN
    global _ENABLE_SEMANTIC, _ENABLE_METADATA
    _INDEX_CONTEXT = None
    _EMBEDDING_PROVIDER = None
    _FIELD_CATALOG_SHOWN = False
    _ENABLE_SEMANTIC = False
    _ENABLE_METADATA = False


def _get_index_storage_and_corpus() -> tuple[
    DuckDBStorage | None, str | None, str | None
]:
    if _INDEX_CONTEXT is None:
        return None, None, "Index context is not configured. Re-run with `--use-index`."

    storage = DuckDBStorage(_INDEX_CONTEXT.db_path)
    corpus_id = storage.get_corpus_id(_INDEX_CONTEXT.root_folder)
    if corpus_id is None:
        return (
            None,
            None,
            f"No index found for folder {_INDEX_CONTEXT.root_folder}. "
            "Run `explore index <folder>` first.",
        )
    return storage, corpus_id, None


def _clean_excerpt(text: str, max_chars: int = 320) -> str:
    squashed = re.sub(r"\s+", " ", text).strip()
    if len(squashed) <= max_chars:
        return squashed
    return f"{squashed[:max_chars]}..."


def semantic_search(query: str, filters: str | None = None, limit: int = 5) -> str:
    """Search indexed chunks and return ranked excerpts."""
    storage, corpus_id, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None and corpus_id is not None

    engine = IndexedQueryEngine(storage, embedding_provider=_EMBEDDING_PROVIDER)
    try:
        hits = engine.search(
            corpus_id=corpus_id,
            query=query,
            filters=filters,
            limit=limit,
            enable_semantic=_ENABLE_SEMANTIC,
            enable_metadata=_ENABLE_METADATA,
        )
    except MetadataFilterParseError as exc:
        return f"Invalid metadata filter: {exc}\n{supported_filter_syntax()}"
    except ValueError as exc:
        return f"Metadata filter error: {exc}"

    if not hits:
        if filters:
            return f"No indexed matches found for query={query!r} with filters={filters!r}."
        return f"No indexed matches found for query: {query!r}"

    lines = [
        "=== INDEXED SEARCH RESULTS ===",
        f"Query: {query}",
    ]
    if filters:
        lines.append(f"Filters: {filters}")
    lines.append("")
    for idx, hit in enumerate(hits, start=1):
        position = hit.position if hit.position is not None else "<metadata>"
        lines.extend(
            [
                f"[{idx}] doc_id: {hit.doc_id}",
                f"    path: {hit.absolute_path}",
                f"    match: {hit.matched_by}",
                f"    chunk_position: {position}",
                f"    semantic_score: {hit.semantic_score}",
                f"    metadata_score: {hit.metadata_score}",
                f"    score: {hit.score:.2f}",
                f"    excerpt: {_clean_excerpt(hit.text)}",
                "",
            ]
        )
    lines.append(
        "Use get_document(doc_id=...) to read full content for the most relevant documents."
    )

    # Include a rich field catalog on the first search so the agent can
    # construct effective metadata filters.
    global _FIELD_CATALOG_SHOWN
    if not _FIELD_CATALOG_SHOWN:
        active_schema = storage.get_active_schema(corpus_id=corpus_id)
        if active_schema is not None:
            schema_fields = active_schema.schema_def.get("fields")
            if isinstance(schema_fields, list) and schema_fields:
                field_names = [
                    str(f["name"])
                    for f in schema_fields
                    if isinstance(f, dict) and isinstance(f.get("name"), str)
                ]
                field_values = storage.get_metadata_field_values(
                    corpus_id=corpus_id,
                    field_names=field_names,
                )
                field_descs: list[str] = []
                for field in schema_fields:
                    if not isinstance(field, dict) or not isinstance(
                        field.get("name"), str
                    ):
                        continue
                    name = field["name"]
                    ftype = field.get("type", "string")
                    desc = field.get("description", "")
                    entry = f"{name} ({ftype})"
                    if desc:
                        entry += f": {desc}"
                    vals = field_values.get(name, [])
                    if ftype == "boolean":
                        entry += " Values: true, false"
                    elif ftype in {"integer", "number"} and vals:
                        nums = []
                        for v in vals:
                            try:
                                nums.append(float(v))
                            except (TypeError, ValueError):
                                pass
                        if nums:
                            entry += f" Range: {min(nums):.6g}-{max(nums):.6g}"
                    elif vals:
                        if "enum" in field:
                            entry += f" Values: {field['enum']}"
                        else:
                            entry += f" Values: {', '.join(repr(v) for v in vals)}"
                    elif "enum" in field:
                        entry += f" Values: {field['enum']}"
                    field_descs.append(entry)
                if field_descs:
                    lines.append("")
                    lines.append(
                        "Available filter fields for semantic_search(filters=...):"
                    )
                    for desc in field_descs:
                        lines.append(f"  - {desc}")
                _FIELD_CATALOG_SHOWN = True

    return "\n".join(lines)


def get_document(doc_id: str) -> str:
    """Return full document content by id from the active index context."""
    storage, _, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None

    document = storage.get_document(doc_id=doc_id)
    if document is None:
        return f"No indexed document found for doc_id={doc_id!r}"
    if document["is_deleted"]:
        return f"Document {doc_id} is marked as deleted in the index."

    return (
        f"=== DOCUMENT {doc_id} ===\n"
        f"Path: {document['absolute_path']}\n\n"
        f"{document['content']}"
    )


def list_indexed_documents() -> str:
    """List indexed documents for the active corpus."""
    storage, corpus_id, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None and corpus_id is not None

    documents = storage.list_documents(corpus_id=corpus_id, include_deleted=False)
    if not documents:
        return "No indexed documents found for the active corpus."

    lines = ["=== INDEXED DOCUMENTS ==="]
    for idx, document in enumerate(documents, start=1):
        lines.append(
            f"[{idx}] doc_id={document['id']} path={document['absolute_path']}"
        )
    lines.append("")
    lines.append("Use semantic_search(...) to find relevant doc_ids.")
    return "\n".join(lines)


TOOLS: dict[Tools, Callable[..., str]] = {
    "read": read_file,
    "grep": grep_file_content,
    "glob": glob_paths,
    "scan_folder": scan_folder,
    "preview_file": preview_file,
    "parse_file": parse_file,
    "semantic_search": semantic_search,
    "get_document": get_document,
    "list_indexed_documents": list_indexed_documents,
}


# =============================================================================
# System Prompt
# =============================================================================

SYSTEM_PROMPT = """
You are FsExplorer, an AI agent that explores filesystems to answer user questions about documents.

## Available Tools

| Tool | Purpose | Parameters |
|------|---------|------------|
| `scan_folder` | **PARALLEL SCAN** - Scan ALL documents in a folder at once | `directory` |
| `preview_file` | Quick preview of a single document (~first page) | `file_path` |
| `parse_file` | **DEEP READ** - Full content of a document | `file_path` |
| `read` | Read a plain text file | `file_path` |
| `grep` | Search for a pattern in a file | `file_path`, `pattern` |
| `glob` | Find files matching a pattern | `directory`, `pattern` |
| `semantic_search` | Search indexed chunks and metadata-filtered docs, then union/rank results | `query`, `filters`, `limit` |
| `get_document` | Read full indexed document by document id | `doc_id` |
| `list_indexed_documents` | List indexed documents for active corpus | none |

## Indexed Retrieval Strategy

When indexed tools are available:
1. Start with `semantic_search` to quickly find relevant documents.
2. Use `get_document` for the top candidate doc IDs.
3. If indexed tools report index is unavailable, fall back to filesystem tools (`scan_folder`, `parse_file`, etc.).

Filter syntax for `semantic_search(filters=...)`:
- `field=value`
- `field!=value`
- `field>=number`, `field<=number`, `field>number`, `field<number`
- `field in (a, b, c)`
- `field~substring`
- combine conditions with comma or `and`

## Three-Phase Document Exploration Strategy

### PHASE 1: Parallel Scan (Use `scan_folder`)
When you encounter a folder with documents:
1. Use `scan_folder` to scan ALL documents in parallel
2. This gives you a quick preview of every document at once
3. In your **reason**, explicitly list your document categorization:
   - **RELEVANT**: Documents clearly related to the query (list them)
   - **MAYBE**: Documents that might be relevant (list them)
   - **SKIP**: Documents not relevant (list them)

### PHASE 2: Deep Dive (Use `parse_file`)
1. Use `parse_file` on documents marked RELEVANT
2. In your **reason**, explain what key information you found
3. **WATCH FOR CROSS-REFERENCES** - look for mentions like:
   - "See Exhibit A/B/C..."
   - "As stated in the [Document Name]..."
   - "Refer to [filename]..."
   - Document numbers, exhibit labels, or file names
4. In your **reason**, note any cross-references you discovered

### PHASE 3: Backtracking (Revisit if Cross-Referenced)
**CRITICAL**: If a document you're reading references another document that you SKIPPED:
1. In your **reason**, explain: "Found cross-reference to [document] - need to backtrack"
2. Use `preview_file` or `parse_file` to read the referenced document
3. Continue this until all relevant cross-references are resolved

## Providing Detailed Reasoning

Your `reason` field is displayed to the user, so make it informative:
- After scanning: List which documents you're categorizing as RELEVANT/MAYBE/SKIP and why
- After parsing: Summarize key findings and any cross-references discovered
- When backtracking: Explain which reference led you back to a skipped document

## CRITICAL: Citation Requirements for Final Answers

When providing your final answer, you MUST include citations for ALL factual claims:

### Citation Format
Use inline citations in this format: `[Source: filename, Section/Page]`

Example:
> The total purchase price is $125,000,000 [Source: 01_master_agreement.pdf, Section 2.1], 
> consisting of $80M cash [Source: 01_master_agreement.pdf, Section 2.1(a)], 
> $30M in stock [Source: 10_stock_purchase.pdf, Section 1], and 
> $15M in escrow [Source: 09_escrow_agreement.pdf, Section 2].

### Citation Rules
1. **Every factual claim needs a citation** - dates, numbers, names, terms, etc.
2. **Be specific** - include section numbers, article numbers, or page references when available
3. **Use the actual filename** - not paraphrased names
4. **Multiple sources** - if information comes from multiple documents, cite all of them

### Final Answer Structure
Your final answer should:
1. **Start with a direct answer** to the user's question
2. **Provide details** with inline citations
3. **End with a Sources section** listing all documents consulted:

```
## Sources Consulted
- 01_master_agreement.pdf - Main acquisition terms
- 10_stock_purchase.pdf - Stock component details  
- 09_escrow_agreement.pdf - Escrow terms and release schedule
```

## Example Workflow

```
User asks: "What is the purchase price?"

1. scan_folder("./documents/")
   Reason: "Scanned 10 documents. Categorizing:
   - RELEVANT: purchase_agreement.pdf (mentions 'Purchase Price' in preview)
   - RELEVANT: financial_terms.pdf (contains pricing tables)
   - MAYBE: exhibits.pdf (referenced by other docs)
   - SKIP: employee_handbook.pdf, hr_policies.pdf (unrelated to pricing)"

2. parse_file("purchase_agreement.pdf")
   Reason: "Found purchase price of $50M in Section 2.1. Document references 
   'Exhibit B for price adjustments' - need to check exhibits.pdf next."

3. parse_file("exhibits.pdf")  [BACKTRACKING]
   Reason: "Backtracking to exhibits.pdf because purchase_agreement.pdf 
   referenced it for adjustment details. Found working capital adjustment 
   formula in Exhibit B."

4. STOP with final answer including citations:
   "The purchase price is $50,000,000 [Source: purchase_agreement.pdf, Section 2.1], 
   subject to working capital adjustments [Source: exhibits.pdf, Exhibit B]..."
```
"""

def _build_system_prompt(enable_semantic: bool, enable_metadata: bool) -> str:
    """Build a system prompt with retrieval-path guidance appended."""
    if enable_semantic and enable_metadata:
        hint = (
            "\n\n## Retrieval: Semantic + Metadata\n"
            "An index is available. Start with `semantic_search` using optional "
            "`filters` for best results, then use filesystem tools for deep dives."
        )
    elif enable_semantic:
        hint = (
            "\n\n## Retrieval: Semantic Only\n"
            "An index is available. Use `semantic_search` WITHOUT the `filters` "
            "parameter for similarity search, then use filesystem tools for details."
        )
    elif enable_metadata:
        hint = (
            "\n\n## Retrieval: Metadata Only\n"
            "An index is available. Use `semantic_search` with the `filters=` "
            "parameter for metadata filtering, then use filesystem tools for details."
        )
    else:
        return SYSTEM_PROMPT
    return SYSTEM_PROMPT + hint


# =============================================================================
# Agent Implementation
# =============================================================================


class FsExplorerAgent:
    """
    AI agent for exploring filesystems using Google Gemini.

    The agent maintains a conversation history with the LLM and uses
    structured JSON output to make decisions about which actions to take.

    Attributes:
        token_usage: Tracks API call statistics and costs.
    """

    def __init__(self, api_key: str | None = None) -> None:
        """
        Initialize the agent with Google API credentials.

        Args:
            api_key: Google API key. If not provided, reads from
                     GOOGLE_API_KEY environment variable.

        Raises:
            ValueError: If no API key is available.
        """
        if api_key is None:
            api_key = os.getenv("GOOGLE_API_KEY")
        if api_key is None:
            raise ValueError(
                "GOOGLE_API_KEY not found within the current environment: "
                "please export it or provide it to the class constructor."
            )

        self._client = GenAIClient(
            api_key=api_key,
            http_options=HttpOptions(api_version="v1beta"),
        )
        self._chat_history: list[Content] = []
        self.token_usage = TokenUsage()

    def configure_task(self, task: str) -> None:
        """
        Add a task message to the conversation history.

        Args:
            task: The task or context to add to the conversation.
        """
        self._chat_history.append(
            Content(role="user", parts=[Part.from_text(text=task)])
        )

    async def take_action(self) -> tuple[Action, ActionType] | None:
        """
        Request the next action from the AI model.

        Sends the current conversation history to Gemini and receives
        a structured JSON response indicating the next action to take.

        Returns:
            A tuple of (Action, ActionType) if successful, None otherwise.
        """
        response = await self._client.aio.models.generate_content(
            model="gemini-3-flash-preview",
            contents=self._chat_history,  # type: ignore
            config={
                "system_instruction": _build_system_prompt(_ENABLE_SEMANTIC, _ENABLE_METADATA),
                "response_mime_type": "application/json",
                "response_schema": Action,
            },
        )

        # Track token usage from response metadata
        if response.usage_metadata:
            self.token_usage.add_api_call(
                prompt_tokens=response.usage_metadata.prompt_token_count or 0,
                completion_tokens=response.usage_metadata.candidates_token_count or 0,
            )

        if response.candidates is not None:
            if response.candidates[0].content is not None:
                self._chat_history.append(response.candidates[0].content)
            if response.text is not None:
                action = Action.model_validate_json(response.text)
                if action.to_action_type() == "toolcall":
                    toolcall = cast(ToolCallAction, action.action)
                    self.call_tool(
                        tool_name=toolcall.tool_name,
                        tool_input=toolcall.to_fn_args(),
                    )
                return action, action.to_action_type()

        return None

    def call_tool(self, tool_name: Tools, tool_input: dict[str, Any]) -> None:
        """
        Execute a tool and add the result to the conversation history.

        Args:
            tool_name: Name of the tool to execute.
            tool_input: Dictionary of arguments to pass to the tool.
        """
        try:
            result = TOOLS[tool_name](**tool_input)
        except Exception as e:
            result = (
                f"An error occurred while calling tool {tool_name} "
                f"with {tool_input}: {e}"
            )

        # Track tool result sizes
        self.token_usage.add_tool_result(result, tool_name)

        self._chat_history.append(
            Content(
                role="user",
                parts=[
                    Part.from_text(text=f"Tool result for {tool_name}:\n\n{result}")
                ],
            )
        )

    def reset(self) -> None:
        """Reset the agent's conversation history and token tracking."""
        self._chat_history.clear()
        self.token_usage = TokenUsage()


================================================
FILE: src/fs_explorer/embeddings.py
================================================
"""
Embedding provider for vector-based semantic search.

Wraps the Google GenAI embedding API for batch and single-query embedding
with configurable model, dimensions, and batch size.
"""

from __future__ import annotations

import os
from typing import Any

from google.genai import Client as GenAIClient


_DEFAULT_MODEL = "gemini-embedding-001"
_DEFAULT_DIM = 768
_DEFAULT_BATCH_SIZE = 50


class EmbeddingProvider:
    """Generate text embeddings via Google GenAI."""

    def __init__(
        self,
        *,
        api_key: str | None = None,
        model: str | None = None,
        dim: int | None = None,
        batch_size: int | None = None,
        client: Any | None = None,
    ) -> None:
        self.model = model or os.getenv("FS_EXPLORER_EMBEDDING_MODEL", _DEFAULT_MODEL)
        self.dim = dim or int(os.getenv("FS_EXPLORER_EMBEDDING_DIM", str(_DEFAULT_DIM)))
        self.batch_size = batch_size or int(
            os.getenv("FS_EXPLORER_EMBEDDING_BATCH_SIZE", str(_DEFAULT_BATCH_SIZE))
        )

        if client is not None:
            self._client = client
        else:
            resolved_key = api_key or os.getenv("GOOGLE_API_KEY")
            if resolved_key is None:
                raise ValueError(
                    "GOOGLE_API_KEY not found. "
                    "Provide api_key or set the environment variable."
                )
            self._client = GenAIClient(api_key=resolved_key)

    def embed_texts(
        self,
        texts: list[str],
        *,
        task_type: str = "RETRIEVAL_DOCUMENT",
    ) -> list[list[float]]:
        """Embed a list of texts in batches.

        Returns a list of embedding vectors in the same order as *texts*.
        """
        all_embeddings: list[list[float]] = []
        for start in range(0, len(texts), self.batch_size):
            batch = texts[start : start + self.batch_size]
            result = self._client.models.embed_content(
                model=self.model,
                contents=batch,
                config={
                    "task_type": task_type,
                    "output_dimensionality": self.dim,
                },
            )
            for emb in result.embeddings:
                all_embeddings.append(list(emb.values))
        return all_embeddings

    def embed_query(self, query: str) -> list[float]:
        """Embed a single query text for retrieval."""
        result = self._client.models.embed_content(
            model=self.model,
            contents=[query],
            config={
                "task_type": "RETRIEVAL_QUERY",
                "output_dimensionality": self.dim,
            },
        )
        return list(result.embeddings[0].values)


================================================
FILE: src/fs_explorer/exploration_trace.py
================================================
"""
Helpers for recording exploration path and referenced files.
"""

from __future__ import annotations

import os
import re
from dataclasses import dataclass, field
from typing import Any


FILE_TOOLS: frozenset[str] = frozenset({"read", "grep", "preview_file", "parse_file"})

# Matches citations like: [Source: filename.pdf, Section 2.1]
SOURCE_CITATION_RE = re.compile(r"\[Source:\s*([^,\]]+)")


def normalize_path(path: str, root_directory: str) -> str:
    """Return an absolute path using root_directory for relative inputs."""
    if os.path.isabs(path):
        return os.path.abspath(path)
    return os.path.abspath(os.path.join(root_directory, path))


def extract_cited_sources(final_result: str | None) -> list[str]:
    """Extract source labels from final answer citations while preserving order."""
    if not final_result:
        return []

    seen: set[str] = set()
    ordered_sources: list[str] = []

    for raw_source in SOURCE_CITATION_RE.findall(final_result):
        source = raw_source.strip()
        if source and source not in seen:
            seen.add(source)
            ordered_sources.append(source)

    return ordered_sources


@dataclass
class ExplorationTrace:
    """
    Collects a step-by-step path and files referenced by tool calls.

    Paths are normalized to absolute paths to make replay/debugging easier.
    """

    root_directory: str
    step_path: list[str] = field(default_factory=list)
    referenced_documents: set[str] = field(default_factory=set)

    def record_tool_call(
        self,
        *,
        step_number: int,
        tool_name: str,
        tool_input: dict[str, Any],
        resolved_document_path: str | None = None,
    ) -> None:
        """Record a tool call in the exploration path."""
        path_entries: list[str] = []

        directory = tool_input.get("directory")
        if isinstance(directory, str) and directory:
            path_entries.append(f"directory={normalize_path(directory, self.root_directory)}")

        file_path = tool_input.get("file_path")
        if isinstance(file_path, str) and file_path:
            normalized_file_path = normalize_path(file_path, self.root_directory)
            path_entries.append(f"file={normalized_file_path}")
            if tool_name in FILE_TOOLS:
                self.referenced_documents.add(normalized_file_path)

        if resolved_document_path:
            normalized_doc_path = normalize_path(resolved_document_path, self.root_directory)
            path_entries.append(f"document={normalized_doc_path}")
            self.referenced_documents.add(normalized_doc_path)

        parameters = ", ".join(path_entries) if path_entries else "no-path-args"
        self.step_path.append(f"{step_number}. tool:{tool_name} ({parameters})")

    def record_go_deeper(self, *, step_number: int, directory: str) -> None:
        """Record a directory navigation event in the exploration path."""
        resolved_dir = normalize_path(directory, self.root_directory)
        self.step_path.append(f"{step_number}. godeeper (directory={resolved_dir})")

    def sorted_documents(self) -> list[str]:
        """Return a sorted list of referenced documents."""
        return sorted(self.referenced_documents)


================================================
FILE: src/fs_explorer/fs.py
================================================
"""
Filesystem utilities for the FsExplorer agent.

This module provides functions for reading, searching, and parsing files
in the filesystem, including support for complex document formats via Docling.
"""

import os
import re
import glob as glob_module
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path

from docling.document_converter import DocumentConverter


# =============================================================================
# Configuration Constants
# =============================================================================

# Supported document extensions for parsing
SUPPORTED_EXTENSIONS: frozenset[str] = frozenset({
    ".pdf", ".docx", ".doc", ".pptx", ".xlsx", ".html", ".md"
})

# Preview settings
DEFAULT_PREVIEW_CHARS = 3000  # Characters for single file preview (~2-3 pages)
DEFAULT_SCAN_PREVIEW_CHARS = 1500  # Characters for folder scan preview (~1 page)
MAX_PREVIEW_LINES = 30  # Maximum lines to show in scan results

# Parallel processing settings
DEFAULT_MAX_WORKERS = 4  # Thread pool size for parallel document scanning


# =============================================================================
# Document Cache
# =============================================================================

# Cache for parsed documents to avoid re-parsing
_DOCUMENT_CACHE: dict[str, str] = {}


def clear_document_cache() -> None:
    """Clear the document cache. Useful for testing or memory management."""
    _DOCUMENT_CACHE.clear()


def _get_cached_or_parse(file_path: str) -> str:
    """
    Get document content from cache or parse it.
    
    Uses file modification time in cache key to invalidate stale entries.
    
    Args:
        file_path: Path to the document file.
    
    Returns:
        The document content as markdown.
    
    Raises:
        Exception: If the document cannot be parsed.
    """
    abs_path = os.path.abspath(file_path)
    cache_key = f"{abs_path}:{os.path.getmtime(abs_path)}"
    
    if cache_key not in _DOCUMENT_CACHE:
        converter = DocumentConverter()
        result = converter.convert(file_path)
        _DOCUMENT_CACHE[cache_key] = result.document.export_to_markdown()
    
    return _DOCUMENT_CACHE[cache_key]


# =============================================================================
# Directory Operations
# =============================================================================

def describe_dir_content(directory: str) -> str:
    """
    Describe the contents of a directory.
    
    Lists all files and subdirectories in the given directory path.
    
    Args:
        directory: Path to the directory to describe.
    
    Returns:
        A formatted string describing the directory contents,
        or an error message if the directory doesn't exist.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    children = os.listdir(directory)
    if not children:
        return f"Directory {directory} is empty"
    
    files = []
    directories = []
    
    for child in children:
        fullpath = os.path.join(directory, child)
        if os.path.isfile(fullpath):
            files.append(fullpath)
        else:
            directories.append(fullpath)
    
    description = f"Content of {directory}\n"
    description += "FILES:\n- " + "\n- ".join(files)
    
    if not directories:
        description += "\nThis folder does not have any sub-folders"
    else:
        description += "\nSUBFOLDERS:\n- " + "\n- ".join(directories)
    
    return description


# =============================================================================
# Basic File Operations
# =============================================================================

def read_file(file_path: str) -> str:
    """
    Read the contents of a text file.
    
    Args:
        file_path: Path to the file to read.
    
    Returns:
        The file contents, or an error message if the file doesn't exist.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"
    
    with open(file_path, "r") as f:
        return f.read()


def grep_file_content(file_path: str, pattern: str) -> str:
    """
    Search for a regex pattern in a file.
    
    Args:
        file_path: Path to the file to search.
        pattern: Regular expression pattern to search for.
    
    Returns:
        A formatted string with matches, "No matches found",
        or an error message if the file doesn't exist.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"
    
    with open(file_path, "r") as f:
        content = f.read()
    
    regex = re.compile(pattern=pattern, flags=re.MULTILINE)
    matches = regex.findall(content)
    
    if matches:
        return f"MATCHES for {pattern} in {file_path}:\n\n- " + "\n- ".join(matches)
    return "No matches found"


def glob_paths(directory: str, pattern: str) -> str:
    """
    Find files matching a glob pattern in a directory.
    
    Args:
        directory: Path to the directory to search in.
        pattern: Glob pattern to match (e.g., "*.txt", "**/*.pdf").
    
    Returns:
        A formatted string with matching paths, "No matches found",
        or an error message if the directory doesn't exist.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    # Use pathlib for cleaner path handling
    search_path = Path(directory) / pattern
    matches = glob_module.glob(str(search_path))
    
    if matches:
        return f"MATCHES for {pattern} in {directory}:\n\n- " + "\n- ".join(matches)
    return "No matches found"


# =============================================================================
# Document Parsing Operations
# =============================================================================

def preview_file(file_path: str, max_chars: int = DEFAULT_PREVIEW_CHARS) -> str:
    """
    Get a quick preview of a document file.
    
    Reads only the first portion of the document content for initial
    relevance assessment before doing a full parse.
    
    Args:
        file_path: Path to the document file.
        max_chars: Maximum characters to return (default: 3000, ~2-3 pages).
    
    Returns:
        A preview of the document content, or an error message.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"

    ext = os.path.splitext(file_path)[1].lower()
    if ext not in SUPPORTED_EXTENSIONS:
        return (
            f"Unsupported file extension: {ext}. "
            f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )

    try:
        full_content = _get_cached_or_parse(file_path)
        preview = full_content[:max_chars]
        
        total_len = len(full_content)
        if total_len > max_chars:
            preview += (
                f"\n\n[... PREVIEW TRUNCATED. Full document has {total_len:,} "
                f"characters. Use parse_file() to read the complete document ...]"
            )
        
        return f"=== PREVIEW of {file_path} ===\n\n{preview}"
    except Exception as e:
        return f"Error previewing {file_path}: {e}"


def parse_file(file_path: str) -> str:
    """
    Parse and return the complete content of a document file.
    
    Use this after preview_file() confirms the document is relevant,
    or when you need to find cross-references to other documents.
    
    Supported formats: PDF, DOCX, DOC, PPTX, XLSX, HTML, MD.
    
    Args:
        file_path: Path to the document file.
    
    Returns:
        The complete document content as markdown, or an error message.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"

    ext = os.path.splitext(file_path)[1].lower()
    if ext not in SUPPORTED_EXTENSIONS:
        return (
            f"Unsupported file extension: {ext}. "
            f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )

    try:
        return _get_cached_or_parse(file_path)
    except Exception as e:
        return f"Error parsing {file_path}: {e}"


# =============================================================================
# Parallel Document Scanning
# =============================================================================

def _preview_single_file(file_path: str, preview_chars: int) -> dict:
    """
    Helper to preview a single file for parallel processing.
    
    Args:
        file_path: Path to the document file.
        preview_chars: Number of characters to include in preview.
    
    Returns:
        A dictionary with file info and preview content.
    """
    filename = os.path.basename(file_path)
    try:
        content = _get_cached_or_parse(file_path)
        preview = content[:preview_chars]
        return {
            "file": file_path,
            "filename": filename,
            "preview": preview,
            "total_chars": len(content),
            "status": "success"
        }
    except Exception as e:
        return {
            "file": file_path,
            "filename": filename,
            "preview": "",
            "total_chars": 0,
            "status": f"error: {e}"
        }


def scan_folder(
    directory: str,
    max_workers: int = DEFAULT_MAX_WORKERS,
    preview_chars: int = DEFAULT_SCAN_PREVIEW_CHARS,
) -> str:
    """
    Scan all documents in a folder in parallel and return quick previews.
    
    This is the FIRST step when exploring a folder with multiple documents.
    It efficiently processes all documents at once so you can assess relevance
    before doing deep dives into specific files.
    
    Args:
        directory: Path to the folder to scan.
        max_workers: Number of parallel workers (default: 4).
        preview_chars: Characters to preview per file (default: 1500, ~1 page).
    
    Returns:
        A formatted summary of all documents with their previews.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    # Find all supported document files
    doc_files = []
    for item in os.listdir(directory):
        item_path = os.path.join(directory, item)
        if os.path.isfile(item_path):
            ext = os.path.splitext(item)[1].lower()
            if ext in SUPPORTED_EXTENSIONS:
                doc_files.append(item_path)
    
    if not doc_files:
        return (
            f"No supported documents found in {directory}. "
            f"Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )
    
    # Scan all documents in parallel
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(_preview_single_file, f, preview_chars): f 
            for f in doc_files
        }
        for future in as_completed(future_to_file):
            results.append(future.result())
    
    # Sort by filename for consistent ordering
    results.sort(key=lambda x: x["filename"])
    
    # Build the summary report
    output = []
    output.append("═══════════════════════════════════════════════════════════════")
    output.append(f"  PARALLEL DOCUMENT SCAN: {directory}")
    output.append(f"  Found {len(results)} documents")
    output.append("═══════════════════════════════════════════════════════════════")
    output.append("")
    
    for i, result in enumerate(results, 1):
        output.append("┌─────────────────────────────────────────────────────────────")
        output.append(f"│ [{i}/{len(results)}] {result['filename']}")
        output.append(f"│ Path: {result['file']}")
        output.append(f"│ Status: {result['status']} | Total size: {result['total_chars']:,} chars")
        output.append("├─────────────────────────────────────────────────────────────")
        
        if result['status'] == 'success' and result['preview']:
            # Indent the preview content
            preview_lines = result['preview'].split('\n')
            for line in preview_lines[:MAX_PREVIEW_LINES]:
                output.append(f"│ {line}")
            if len(preview_lines) > MAX_PREVIEW_LINES:
                output.append("│ ... (preview truncated)")
        else:
            output.append("│ [No preview available]")
        
        output.append("└─────────────────────────────────────────────────────────────")
        output.append("")
    
    output.append("═══════════════════════════════════════════════════════════════")
    output.append("  NEXT STEPS:")
    output.append("  1. Assess which documents are RELEVANT to the user's query")
    output.append("  2. Use parse_file() for DEEP DIVE into relevant documents")
    output.append("  3. Watch for cross-references to other docs (may need backtracking)")
    output.append("═══════════════════════════════════════════════════════════════")
    
    return "\n".join(output)


================================================
FILE: src/fs_explorer/index_config.py
================================================
"""
Configuration helpers for local index storage.
"""

from __future__ import annotations

import os
from pathlib import Path


DEFAULT_DB_PATH = "~/.fs_explorer/index.duckdb"
ENV_DB_PATH = "FS_EXPLORER_DB_PATH"


def resolve_db_path(override_path: str | None = None) -> str:
    """
    Resolve the DuckDB path from CLI override, env var, or default.

    Precedence:
    1) explicit override_path
    2) FS_EXPLORER_DB_PATH
    3) default path
    """
    raw_path = override_path or os.getenv(ENV_DB_PATH) or DEFAULT_DB_PATH
    resolved = Path(raw_path).expanduser().resolve()
    resolved.parent.mkdir(parents=True, exist_ok=True)
    return str(resolved)


================================================
FILE: src/fs_explorer/indexing/__init__.py
================================================
"""Indexing components for FsExplorer."""

from .chunker import SmartChunker, TextChunk
from .pipeline import IndexingPipeline, IndexingResult
from .schema import SchemaDiscovery

__all__ = [
    "SmartChunker",
    "TextChunk",
    "IndexingPipeline",
    "IndexingResult",
    "SchemaDiscovery",
]


================================================
FILE: src/fs_explorer/indexing/chunker.py
================================================
"""
Chunking utilities for indexing document content.
"""

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class TextChunk:
    """A content chunk with source offsets."""

    text: str
    position: int
    start_char: int
    end_char: int


class SmartChunker:
    """
    Paragraph-aware chunker with overlap.

    This implementation is char-based to keep it deterministic and lightweight.
    """

    def __init__(self, chunk_size: int = 1500, overlap: int = 150) -> None:
        if chunk_size <= 0:
            raise ValueError("chunk_size must be > 0")
        if overlap < 0:
            raise ValueError("overlap must be >= 0")
        if overlap >= chunk_size:
            raise ValueError("overlap must be smaller than chunk_size")

        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk_text(self, text: str) -> list[TextChunk]:
        """
        Split text into chunks while preferring paragraph boundaries.
        """
        normalized = text.strip()
        if not normalized:
            return []

        chunks: list[TextChunk] = []
        start = 0
        position = 0
        total = len(normalized)

        while start < total:
            tentative_end = min(start + self.chunk_size, total)
            end = tentative_end

            if tentative_end < total:
                boundary = normalized.rfind("\n\n", start + (self.chunk_size // 2), tentative_end)
                if boundary != -1:
                    end = boundary + 2

            chunk_text = normalized[start:end].strip()
            if chunk_text:
                chunks.append(
                    TextChunk(
                        text=chunk_text,
                        position=position,
                        start_char=start,
                        end_char=end,
                    )
                )
                position += 1

            if end >= total:
                break
            start = max(0, end - self.overlap)

        return chunks


================================================
FILE: src/fs_explorer/indexing/metadata.py
================================================
"""
Metadata extraction helpers for indexed documents.
"""

from __future__ import annotations

import copy
import json
import os
import re
from collections import defaultdict
from pathlib import Path
from typing import Any


_CURRENCY_RE = re.compile(r"\$\s?\d[\d,]*(?:\.\d+)?")
_DATE_RE = re.compile(
    r"\b(?:\d{4}-\d{2}-\d{2}|"
    r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\s+\d{1,2},\s+\d{4})\b",
    flags=re.IGNORECASE,
)
_DOC_TYPE_TOKEN_RE = re.compile(r"[a-z0-9]+")
_DOC_TYPE_STOPWORDS: set[str] = {
    "the",
    "and",
    "for",
    "with",
    "from",
    "copy",
    "draft",
    "final",
    "version",
    "v1",
    "v2",
    "v3",
    "new",
    "old",
    "tmp",
    "temp",
}

_LANGEXTRACT_PROMPT_DESCRIPTION = (
    "Extract key transaction metadata from legal and deal documents. "
    "Use extraction classes: organization, person, money, date, deal_term. "
    "Use exact spans from the source text and avoid paraphrasing."
)

_VALID_METADATA_FIELD_NAME_RE = re.compile(r"^[A-Za-z][A-Za-z0-9_]*$")
_VALID_FIELD_TYPES: set[str] = {"string", "integer", "number", "boolean"}
_VALID_RUNTIME_FIELDS: set[str] = {"enabled", "extraction_count", "entity_classes"}
_FIELD_MODE_ALIASES: dict[str, str] = {
    "csv": "values",
    "list": "values",
    "joined": "values",
    "join": "values",
    "values": "values",
    "count": "count",
    "exists": "exists",
    "contains": "contains",
    "contains_any": "contains",
}

_DEFAULT_LANGEXTRACT_PROFILE: dict[str, Any] = {
    "name": "default_langextract",
    "description": "Default metadata extraction profile for legal and deal-style documents.",
    "prompt_description": _LANGEXTRACT_PROMPT_DESCRIPTION,
    "fields": [
        {
            "name": "lx_enabled",
            "type": "boolean",
            "required": False,
            "description": "Whether langextract metadata extraction succeeded.",
            "source": "runtime",
            "runtime": "enabled",
        },
        {
            "name": "lx_extraction_count",
            "type": "integer",
            "required": False,
            "description": "Number of langextract entities extracted from the document.",
            "source": "runtime",
            "runtime": "extraction_count",
        },
        {
            "name": "lx_entity_classes",
            "type": "string",
            "required": False,
            "description": "Comma-separated extraction classes returned by langextract.",
            "source": "runtime",
            "runtime": "entity_classes",
        },
        {
            "name": "lx_organizations",
            "type": "string",
            "required": False,
            "description": "Comma-separated organization names extracted by langextract.",
            "source": "entities",
            "source_classes": ["organization", "company", "party"],
            "mode": "values",
        },
        {
            "name": "lx_people",
            "type": "string",
            "required": False,
            "description": "Comma-separated person names extracted by langextract.",
            "source": "entities",
            "source_classes": ["person", "individual", "executive"],
            "mode": "values",
        },
        {
            "name": "lx_deal_terms",
            "type": "string",
            "required": False,
            "description": "Comma-separated deal terms extracted by langextract.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "values",
        },
        {
            "name": "lx_money_mentions",
            "type": "integer",
            "required": False,
            "description": "Count of monetary amount entities from langextract.",
            "source": "entities",
            "source_classes": ["money", "amount", "currency"],
            "mode": "count",
        },
        {
            "name": "lx_date_mentions",
            "type": "integer",
            "required": False,
            "description": "Count of date entities from langextract.",
            "source": "entities",
            "source_classes": ["date"],
            "mode": "count",
        },
        {
            "name": "lx_has_earnout",
            "type": "boolean",
            "required": False,
            "description": "Whether extracted deal terms indicate an earnout.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "contains",
            "contains_any": ["earnout"],
        },
        {
            "name": "lx_has_escrow",
            "type": "boolean",
            "required": False,
            "description": "Whether extracted deal terms indicate escrow.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "contains",
            "contains_any": ["escrow"],
        },
    ],
}


_AUTO_PROFILE_PROMPT_TEMPLATE = (
    "You are a metadata schema designer. Analyze the document samples below and generate "
    "a langextract metadata extraction profile tailored to this corpus.\n\n"
    "Return a JSON object with these keys:\n"
    '- "name": a short descriptive profile name (string)\n'
    '- "description": one-sentence description of the profile (string)\n'
    '- "prompt_description": instruction text for the extraction model (string)\n'
    '- "fields": array of field definitions\n\n'
    "Each field object must have:\n"
    '- "name": valid identifier starting with "lx_" (letters, digits, underscores)\n'
    '- "type": one of "string", "integer", "number", "boolean"\n'
    '- "description": what this field captures\n'
    '- "source": "entities"\n'
    '- "source_classes": array of entity class names to aggregate (e.g. ["organization", "company"])\n'
    '- "mode": one of "values" (comma-joined text), "count" (integer count), "exists" (boolean), '
    '"contains" (boolean, requires "contains_any")\n'
    '- "contains_any": (only when mode is "contains") array of lowercase terms to match\n\n'
    "Valid entity source classes include (but are not limited to): organization, company, party, "
    "person, individual, executive, money, amount, currency, date, deal_term, term, provision, "
    "location, product, technology, regulation, clause, obligation.\n\n"
    "### Example profile for legal/M&A documents\n"
    "```json\n"
    '{"name": "legal_ma", "description": "Metadata extraction for legal and M&A deal documents.", '
    '"prompt_description": "Extract key transaction metadata from legal and deal documents.", '
    '"fields": ['
    '{"name": "lx_organizations", "type": "string", "description": "Organization names.", '
    '"source": "entities", "source_classes": ["organization", "company", "party"], "mode": "values"}, '
    '{"name": "lx_money_mentions", "type": "integer", "description": "Count of monetary amounts.", '
    '"source": "entities", "source_classes": ["money", "amount"], "mode": "count"}, '
    '{"name": "lx_has_escrow", "type": "boolean", "description": "Whether escrow terms are present.", '
    '"source": "entities", "source_classes": ["deal_term", "provision"], "mode": "contains", '
    '"contains_any": ["escrow"]}'
    "]}\n"
    "```\n\n"
    "### Example profile for technical/research documents\n"
    "```json\n"
    '{"name": "tech_research", "description": "Metadata extraction for technical and research documents.", '
    '"prompt_description": "Extract key entities from technical and research documents.", '
    '"fields": ['
    '{"name": "lx_technologies", "type": "string", "description": "Technology names.", '
    '"source": "entities", "source_classes": ["technology", "product"], "mode": "values"}, '
    '{"name": "lx_people", "type": "string", "description": "Person names.", '
    '"source": "entities", "source_classes": ["person", "individual"], "mode": "values"}, '
    '{"name": "lx_org_count", "type": "integer", "description": "Number of organizations mentioned.", '
    '"source": "entities", "source_classes": ["organization", "company"], "mode": "count"}'
    "]}\n"
    "```\n\n"
    "### Document samples from the corpus\n\n"
    "SAMPLES_PLACEHOLDER\n\n"
    "Generate a profile with 4-8 entity fields (do NOT include runtime fields). "
    "Return ONLY the JSON object, no markdown fencing."
)


def _get_genai_client(api_key: str) -> Any:
    """Instantiate a Google GenAI client. Separated for test patching."""
    from google.genai import Client as _GenAIClient

    return _GenAIClient(api_key=api_key)


def auto_discover_profile(
    folder: str,
    *,
    sample_count: int = 3,
    model_id: str | None = None,
) -> dict[str, Any]:
    """Use an LLM to generate a langextract profile tailored to the corpus.

    Falls back to the default hardcoded profile on any failure.
    """
    from .schema import _iter_supported_files

    files = _iter_supported_files(folder)
    if not files:
        return default_langextract_profile()

    # Sample files evenly
    n = min(sample_count, len(files))
    step = max(1, len(files) // n)
    sampled = [files[i * step] for i in range(n)]

    # Parse and truncate
    from ..fs import parse_file

    snippets: list[str] = []
    for file_path in sampled:
        try:
            text = parse_file(file_path)
            snippets.append(
                f"--- {Path(file_path).name} ---\n{text[:2000]}"
            )
        except Exception:
            continue

    if not snippets:
        return default_langextract_profile()

    api_key = os.getenv("GOOGLE_API_KEY")
    if not api_key:
        return default_langextract_profile()

    effective_model = model_id or os.getenv(
        "FS_EXPLORER_PROFILE_MODEL", "gemini-2.0-flash"
    )

    try:
        client = _get_genai_client(api_key=api_key)
        prompt = _AUTO_PROFILE_PROMPT_TEMPLATE.replace(
            "SAMPLES_PLACEHOLDER", "\n\n".join(snippets)
        )
        response = client.models.generate_content(
            model=effective_model,
            contents=prompt,
        )
        raw_text = (response.text or "").strip()
        # Strip markdown fencing if present
        if raw_text.startswith("```"):
            raw_text = re.sub(r"^```[a-z]*\n?", "", raw_text)
            raw_text = re.sub(r"\n?```$", "", raw_text).strip()
        profile = json.loads(raw_text)
        # Add runtime fields that are always present
        runtime_fields = [
            f for f in _DEFAULT_LANGEXTRACT_PROFILE["fields"] if f.get("source") == "runtime"
        ]
        existing_names = {
            str(f.get("name")) for f in profile.get("fields", []) if isinstance(f, dict)
        }
        for rf in runtime_fields:
            if rf["name"] not in existing_names:
                profile.setdefault("fields", []).insert(0, copy.deepcopy(rf))
        return normalize_langextract_profile(profile)
    except Exception:
        return default_langextract_profile()


def infer_document_type(file_path: str) -> str:
    """Infer a generic document type from filename tokens."""
    stem = Path(file_path).stem.lower()
    tokens = [token for token in _DOC_TYPE_TOKEN_RE.findall(stem) if token]
    filtered = [
        token
        for token in tokens
        if not token.isdigit() and len(token) > 2 and token not in _DOC_TYPE_STOPWORDS
    ]
    if filtered:
        return filtered[-1]
    if tokens:
        return tokens[-1]
    return "document"


def default_langextract_profile() -> dict[str, Any]:
    """Return a mutable copy of the built-in metadata profile."""
    return copy.deepcopy(_DEFAULT_LANGEXTRACT_PROFILE)


def normalize_langextract_profile(profile: dict[str, Any] | None) -> dict[str, Any]:
    """
    Validate and normalize user-provided langextract profile configuration.

    Expected shape:
    - prompt_description: str (optional)
    - max_chars: int (optional)
    - fields: list[{
        name: str,
        type: string|integer|number|boolean,
        description: str (optional),
        required: bool (optional),
        source: runtime|entities (default entities),
        runtime: enabled|extraction_count|entity_classes (runtime source only),
        source_class: str (entities source),
        source_classes: list[str] (entities source),
        mode: values|count|exists|contains (entities source),
        contains_any: list[str] (contains mode),
      }]
    """
    raw = default_langextract_profile() if profile is None else copy.deepcopy(profile)
    if not isinstance(raw, dict):
        raise ValueError("Metadata profile must be a JSON object.")

    prompt = raw.get("prompt_description")
    if prompt is None:
        prompt_description = _LANGEXTRACT_PROMPT_DESCRIPTION
    elif isinstance(prompt, str) and prompt.strip():
        prompt_description = prompt.strip()
    else:
        raise ValueError(
            "Metadata profile field 'prompt_description' must be a non-empty string."
        )

    max_chars: int | None = None
    if "max_chars" in raw:
        max_chars = _safe_positive_int(
            raw.get("max_chars"),
            minimum=500,
            field_name="max_chars",
        )

    raw_fields = raw.get("fields")
    if not isinstance(raw_fields, list) or not raw_fields:
        raise ValueError("Metadata profile must include a non-empty 'fields' array.")

    normalized_fields: list[dict[str, Any]] = []
    seen_names: set[str] = set()
    for idx, raw_field in enumerate(raw_fields):
        if not isinstance(raw_field, dict):
            raise ValueError(f"Metadata field at index {idx} must be an object.")

        name_obj = raw_field.get("name")
        if not isinstance(name_obj, str) or not name_obj.strip():
            raise ValueError(
                f"Metadata field at index {idx} is missing a valid 'name'."
            )
        name = name_obj.strip()
        if not _VALID_METADATA_FIELD_NAME_RE.match(name):
            raise ValueError(
                f"Invalid metadata field name '{name}'. "
                "Use letters, numbers, and underscores."
            )
        if name in seen_names:
            raise ValueError(f"Duplicate metadata field name '{name}'.")
        seen_names.add(name)

        field_type = str(raw_field.get("type", "string")).strip().lower()
        if field_type not in _VALID_FIELD_TYPES:
            allowed_types = ", ".join(sorted(_VALID_FIELD_TYPES))
            raise ValueError(
                f"Metadata field '{name}' has invalid type '{field_type}'. "
                f"Allowed types: {allowed_types}."
            )

        description_obj = raw_field.get("description")
        description = (
            description_obj.strip()
            if isinstance(description_obj, str) and description_obj.strip()
            else f"Metadata field '{name}'."
        )
        required = bool(raw_field.get("required", False))

        source = str(raw_field.get("source", "entities")).strip().lower()
        if source not in {"runtime", "entities"}:
            raise ValueError(
                f"Metadata field '{name}' has invalid source '{source}'. "
                "Use 'runtime' or 'entities'."
            )

        normalized: dict[str, Any] = {
            "name": name,
            "type": field_type,
            "required": required,
            "description": description,
            "source": source,
        }

        if source == "runtime":
            runtime = str(raw_field.get("runtime", "")).strip().lower()
            if runtime not in _VALID_RUNTIME_FIELDS:
                allowed_runtime = ", ".join(sorted(_VALID_RUNTIME_FIELDS))
                raise ValueErro
Download .txt
gitextract_mqv4xk8i/

├── .github/
│   └── workflows/
│       ├── build.yaml
│       ├── lint.yaml
│       ├── test.yaml
│       └── typecheck.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── ARCHITECTURE.md
├── CLAUDE.md
├── IMPLEMENTATION_PLAN.md
├── Makefile
├── README.md
├── YOUTUBE_DEMO_TESTS.md
├── data/
│   ├── large_acquisition/
│   │   └── TEST_QUESTIONS.md
│   ├── test_acquisition/
│   │   └── TEST_QUESTIONS.md
│   └── testfile.txt
├── docker/
│   └── docker-compose.yml
├── pyproject.toml
├── scripts/
│   ├── generate_large_docs.py
│   └── generate_test_docs.py
├── src/
│   └── fs_explorer/
│       ├── __init__.py
│       ├── agent.py
│       ├── embeddings.py
│       ├── exploration_trace.py
│       ├── fs.py
│       ├── index_config.py
│       ├── indexing/
│       │   ├── __init__.py
│       │   ├── chunker.py
│       │   ├── metadata.py
│       │   ├── pipeline.py
│       │   └── schema.py
│       ├── main.py
│       ├── models.py
│       ├── search/
│       │   ├── __init__.py
│       │   ├── filters.py
│       │   ├── query.py
│       │   ├── ranker.py
│       │   └── semantic.py
│       ├── server.py
│       ├── storage/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   └── duckdb.py
│       ├── ui.html
│       └── workflow.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── test_agent.py
    ├── test_cli_indexing.py
    ├── test_e2e.py
    ├── test_embeddings.py
    ├── test_exploration_trace.py
    ├── test_fs.py
    ├── test_indexing.py
    ├── test_models.py
    ├── test_search.py
    ├── test_server_search.py
    └── testfiles/
        ├── file1.txt
        ├── file2.md
        └── last/
            └── lastfile.txt
Download .txt
SYMBOL INDEX (377 symbols across 32 files)

FILE: scripts/generate_large_docs.py
  function generate_content (line 145) | def generate_content(doc_id: str, meta: dict) -> list:
  function generate_sections (line 178) | def generate_sections(doc_id: str, meta: dict) -> list:
  function create_pdf (line 507) | def create_pdf(doc_id: str, meta: dict, output_dir: str):
  function main (line 518) | def main():

FILE: scripts/generate_test_docs.py
  function create_pdf (line 709) | def create_pdf(filename: str, title: str, content: str):
  function main (line 746) | def main():

FILE: src/fs_explorer/agent.py
  class TokenUsage (line 52) | class TokenUsage:
    method add_api_call (line 70) | def add_api_call(self, prompt_tokens: int, completion_tokens: int) -> ...
    method add_tool_result (line 77) | def add_tool_result(self, result: str, tool_name: str) -> None:
    method _calculate_cost (line 88) | def _calculate_cost(self) -> tuple[float, float, float]:
    method summary (line 98) | def summary(self) -> str:
  class IndexContext (line 129) | class IndexContext:
  function set_search_flags (line 143) | def set_search_flags(
  function get_search_flags (line 152) | def get_search_flags() -> tuple[bool, bool]:
  function set_embedding_provider (line 157) | def set_embedding_provider(provider: EmbeddingProvider | None) -> None:
  function set_index_context (line 163) | def set_index_context(folder: str, db_path: str | None = None) -> None:
  function clear_index_context (line 178) | def clear_index_context() -> None:
  function _get_index_storage_and_corpus (line 189) | def _get_index_storage_and_corpus() -> tuple[
  function _clean_excerpt (line 207) | def _clean_excerpt(text: str, max_chars: int = 320) -> str:
  function semantic_search (line 214) | def semantic_search(query: str, filters: str | None = None, limit: int =...
  function get_document (line 328) | def get_document(doc_id: str) -> str:
  function list_indexed_documents (line 348) | def list_indexed_documents() -> str:
  function _build_system_prompt (line 511) | def _build_system_prompt(enable_semantic: bool, enable_metadata: bool) -...
  class FsExplorerAgent (line 541) | class FsExplorerAgent:
    method __init__ (line 552) | def __init__(self, api_key: str | None = None) -> None:
    method configure_task (line 578) | def configure_task(self, task: str) -> None:
    method take_action (line 589) | async def take_action(self) -> tuple[Action, ActionType] | None:
    method call_tool (line 631) | def call_tool(self, tool_name: Tools, tool_input: dict[str, Any]) -> N...
    method reset (line 659) | def reset(self) -> None:

FILE: src/fs_explorer/embeddings.py
  class EmbeddingProvider (line 21) | class EmbeddingProvider:
    method __init__ (line 24) | def __init__(
    method embed_texts (line 50) | def embed_texts(
    method embed_query (line 75) | def embed_query(self, query: str) -> list[float]:

FILE: src/fs_explorer/exploration_trace.py
  function normalize_path (line 19) | def normalize_path(path: str, root_directory: str) -> str:
  function extract_cited_sources (line 26) | def extract_cited_sources(final_result: str | None) -> list[str]:
  class ExplorationTrace (line 44) | class ExplorationTrace:
    method record_tool_call (line 55) | def record_tool_call(
    method record_go_deeper (line 85) | def record_go_deeper(self, *, step_number: int, directory: str) -> None:
    method sorted_documents (line 90) | def sorted_documents(self) -> list[str]:

FILE: src/fs_explorer/fs.py
  function clear_document_cache (line 43) | def clear_document_cache() -> None:
  function _get_cached_or_parse (line 48) | def _get_cached_or_parse(file_path: str) -> str:
  function describe_dir_content (line 78) | def describe_dir_content(directory: str) -> str:
  function read_file (line 123) | def read_file(file_path: str) -> str:
  function grep_file_content (line 140) | def grep_file_content(file_path: str, pattern: str) -> str:
  function glob_paths (line 166) | def glob_paths(directory: str, pattern: str) -> str:
  function preview_file (line 194) | def preview_file(file_path: str, max_chars: int = DEFAULT_PREVIEW_CHARS)...
  function parse_file (line 234) | def parse_file(file_path: str) -> str:
  function _preview_single_file (line 269) | def _preview_single_file(file_path: str, preview_chars: int) -> dict:
  function scan_folder (line 301) | def scan_folder(

FILE: src/fs_explorer/index_config.py
  function resolve_db_path (line 15) | def resolve_db_path(override_path: str | None = None) -> str:

FILE: src/fs_explorer/indexing/chunker.py
  class TextChunk (line 11) | class TextChunk:
  class SmartChunker (line 20) | class SmartChunker:
    method __init__ (line 27) | def __init__(self, chunk_size: int = 1500, overlap: int = 150) -> None:
    method chunk_text (line 38) | def chunk_text(self, text: str) -> list[TextChunk]:

FILE: src/fs_explorer/indexing/metadata.py
  function _get_genai_client (line 215) | def _get_genai_client(api_key: str) -> Any:
  function auto_discover_profile (line 222) | def auto_discover_profile(
  function infer_document_type (line 297) | def infer_document_type(file_path: str) -> str:
  function default_langextract_profile (line 313) | def default_langextract_profile() -> dict[str, Any]:
  function normalize_langextract_profile (line 318) | def normalize_langextract_profile(profile: dict[str, Any] | None) -> dic...
  function langextract_schema_fields (line 464) | def langextract_schema_fields(
  function langextract_field_names (line 482) | def langextract_field_names(profile: dict[str, Any] | None = None) -> se...
  function ensure_langextract_schema_fields (line 487) | def ensure_langextract_schema_fields(
  function extract_metadata (line 530) | def extract_metadata(
  function _extract_langextract_metadata (line 593) | def _extract_langextract_metadata(
  function _schema_profile_if_present (line 660) | def _schema_profile_if_present(schema_def: dict[str, Any] | None) -> dic...
  function _resolve_langextract_profile (line 669) | def _resolve_langextract_profile(
  function _normalize_source_classes (line 679) | def _normalize_source_classes(raw_field: dict[str, Any]) -> list[str]:
  function _normalize_field_mode (line 701) | def _normalize_field_mode(mode_obj: Any, *, field_type: str) -> str:
  function _normalize_contains_any (line 720) | def _normalize_contains_any(
  function _profile_defaults (line 745) | def _profile_defaults(profile: dict[str, Any]) -> dict[str, Any]:
  function _default_field_value (line 752) | def _default_field_value(field: dict[str, Any]) -> Any:
  function _aggregate_profile_metadata (line 773) | def _aggregate_profile_metadata(
  function _runtime_field_value (line 820) | def _runtime_field_value(
  function _entity_field_value (line 837) | def _entity_field_value(*, field: dict[str, Any], matched_values: list[s...
  function _coerce_field_value (line 851) | def _coerce_field_value(*, value: Any, field_type: str) -> Any:
  function _langextract_examples (line 873) | def _langextract_examples(lx: Any) -> list[Any]:
  function _dedupe_preserve_order (line 927) | def _dedupe_preserve_order(values: list[str], *, max_items: int = 16) ->...
  function _safe_positive_int (line 944) | def _safe_positive_int(value: Any, *, minimum: int, field_name: str) -> ...
  function _safe_int_env (line 958) | def _safe_int_env(name: str, *, default: int, minimum: int) -> int:

FILE: src/fs_explorer/indexing/pipeline.py
  class IndexingResult (line 34) | class IndexingResult:
  class IndexingPipeline (line 47) | class IndexingPipeline:
    method __init__ (line 50) | def __init__(
    method index_folder (line 62) | def index_folder(
    method _extract_metadata_batch (line 185) | def _extract_metadata_batch(
    method _resolve_schema (line 218) | def _resolve_schema(
    method _augment_schema_for_langextract (line 284) | def _augment_schema_for_langextract(
    method _schema_metadata_profile (line 331) | def _schema_metadata_profile(
    method _schema_field_names (line 342) | def _schema_field_names(schema_def: dict[str, Any]) -> set[str]:
    method _generate_and_store_embeddings (line 354) | def _generate_and_store_embeddings(
    method _iter_supported_files (line 381) | def _iter_supported_files(root: str) -> list[str]:
    method _sha256 (line 392) | def _sha256(content: str) -> str:
    method _is_parse_error (line 396) | def _is_parse_error(content: str) -> bool:

FILE: src/fs_explorer/indexing/schema.py
  function _iter_supported_files (line 20) | def _iter_supported_files(folder: str) -> list[str]:
  class SchemaDiscovery (line 32) | class SchemaDiscovery:
    method discover_from_folder (line 35) | def discover_from_folder(

FILE: src/fs_explorer/main.py
  function _load_metadata_profile (line 71) | def _load_metadata_profile(path_value: str | None) -> dict[str, Any] | N...
  function format_tool_panel (line 88) | def format_tool_panel(event: ToolCallEvent, step_number: int) -> Panel:
  function format_navigation_panel (line 136) | def format_navigation_panel(event: GoDeeperEvent, step_number: int) -> P...
  function print_workflow_header (line 155) | def print_workflow_header(console: Console, task: str, folder: str) -> N...
  function print_workflow_summary (line 178) | def print_workflow_summary(
  function run_workflow (line 257) | async def run_workflow(
  function main (line 447) | def main(
  function index_command (line 510) | def index_command(
  function query_command (line 622) | def query_command(
  function schema_discover_command (line 649) | def schema_discover_command(
  function schema_show_command (line 746) | def schema_show_command(

FILE: src/fs_explorer/models.py
  class StopAction (line 37) | class StopAction(BaseModel):
  class AskHumanAction (line 50) | class AskHumanAction(BaseModel):
  class GoDeeperAction (line 63) | class GoDeeperAction(BaseModel):
  class ToolCallArg (line 76) | class ToolCallArg(BaseModel):
  class ToolCallAction (line 91) | class ToolCallAction(BaseModel):
    method to_fn_args (line 106) | def to_fn_args(self) -> dict[str, Any]:
  class Action (line 116) | class Action(BaseModel):
    method to_action_type (line 132) | def to_action_type(self) -> ActionType:

FILE: src/fs_explorer/search/filters.py
  class MetadataFilter (line 16) | class MetadataFilter:
    method to_storage_dict (line 23) | def to_storage_dict(self) -> dict[str, Any]:
  class MetadataFilterParseError (line 31) | class MetadataFilterParseError(ValueError):
  function supported_filter_syntax (line 39) | def supported_filter_syntax() -> str:
  function parse_metadata_filters (line 49) | def parse_metadata_filters(
  function _parse_condition (line 65) | def _parse_condition(condition: str, *, allowed_fields: set[str] | None)...
  function _validate_field (line 109) | def _validate_field(field: str, *, allowed_fields: set[str] | None) -> N...
  function _split_conditions (line 119) | def _split_conditions(raw: str) -> list[str]:
  function _flush_part (line 185) | def _flush_part(parts: list[str], current: list[str]) -> None:
  function _parse_list_value (line 192) | def _parse_list_value(raw_value: str) -> list[str | bool | int | float]:
  function _parse_scalar_value (line 206) | def _parse_scalar_value(raw_value: str) -> str | bool | int | float:

FILE: src/fs_explorer/search/query.py
  class SearchHit (line 18) | class SearchHit:
  class IndexedQueryEngine (line 32) | class IndexedQueryEngine:
    method __init__ (line 35) | def __init__(
    method search (line 43) | def search(
    method _parse_filters (line 108) | def _parse_filters(
    method _allowed_filter_fields (line 116) | def _allowed_filter_fields(self, *, corpus_id: str) -> set[str] | None:
    method _search_parallel (line 131) | def _search_parallel(
    method _semantic_query (line 157) | def _semantic_query(
    method _metadata_query (line 181) | def _metadata_query(
    method _acquire_query_storage (line 198) | def _acquire_query_storage(self) -> tuple[StorageBackend, Callable[[],...
    method _merge_and_rank (line 210) | def _merge_and_rank(

FILE: src/fs_explorer/search/ranker.py
  class RankedDocument (line 11) | class RankedDocument:
    method combined_score (line 23) | def combined_score(self) -> float:
    method matched_by (line 29) | def matched_by(self) -> str:
  function rank_documents (line 37) | def rank_documents(

FILE: src/fs_explorer/search/semantic.py
  class SemanticSearchEngine (line 16) | class SemanticSearchEngine:
    method __init__ (line 19) | def __init__(
    method search (line 27) | def search(

FILE: src/fs_explorer/server.py
  function _get_corpus_lock (line 40) | def _get_corpus_lock(folder: str) -> asyncio.Lock:
  class TaskRequest (line 48) | class TaskRequest(BaseModel):
  class IndexRequest (line 57) | class IndexRequest(BaseModel):
  class AutoProfileRequest (line 69) | class AutoProfileRequest(BaseModel):
  class SearchRequest (line 75) | class SearchRequest(BaseModel):
  function get_ui (line 86) | async def get_ui():
  function list_folders (line 97) | async def list_folders(path: str = "."):
  function index_status (line 134) | async def index_status(folder: str, db_path: str | None = None):
  function generate_auto_profile (line 192) | async def generate_auto_profile(request: AutoProfileRequest):
  function build_index (line 208) | async def build_index(request: IndexRequest):
  function search_index (line 264) | async def search_index(request: SearchRequest):
  function websocket_explore (line 321) | async def websocket_explore(websocket: WebSocket):
  function run_server (line 522) | def run_server(host: str = "127.0.0.1", port: int = 8000):

FILE: src/fs_explorer/storage/base.py
  class ChunkRecord (line 12) | class ChunkRecord:
  class DocumentRecord (line 25) | class DocumentRecord:
  class SchemaRecord (line 40) | class SchemaRecord:
  class StorageBackend (line 51) | class StorageBackend(Protocol):
    method initialize (line 54) | def initialize(self) -> None:
    method get_or_create_corpus (line 57) | def get_or_create_corpus(self, root_path: str) -> str:
    method get_corpus_id (line 60) | def get_corpus_id(self, root_path: str) -> str | None:
    method upsert_document (line 63) | def upsert_document(
    method mark_deleted_missing_documents (line 68) | def mark_deleted_missing_documents(
    method list_documents (line 76) | def list_documents(
    method count_chunks (line 84) | def count_chunks(self, *, corpus_id: str) -> int:
    method search_chunks (line 87) | def search_chunks(
    method search_documents_by_metadata (line 96) | def search_documents_by_metadata(
    method get_document (line 105) | def get_document(self, *, doc_id: str) -> dict[str, Any] | None:
    method save_schema (line 108) | def save_schema(
    method list_schemas (line 118) | def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:
    method get_schema_by_name (line 121) | def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRe...
    method get_active_schema (line 124) | def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:
    method store_chunk_embeddings (line 127) | def store_chunk_embeddings(
    method search_chunks_semantic (line 135) | def search_chunks_semantic(
    method get_metadata_field_values (line 144) | def get_metadata_field_values(
    method has_embeddings (line 153) | def has_embeddings(self, *, corpus_id: str) -> bool:

FILE: src/fs_explorer/storage/duckdb.py
  function _stable_id (line 18) | def _stable_id(prefix: str, value: str) -> str:
  function _query_terms (line 23) | def _query_terms(query: str, max_terms: int = 8) -> list[str]:
  class DuckDBStorage (line 37) | class DuckDBStorage:
    method __init__ (line 40) | def __init__(
    method close (line 59) | def close(self) -> None:
    method initialize (line 63) | def initialize(self) -> None:
    method _try_load_vss (line 126) | def _try_load_vss(self) -> None:
    method get_or_create_corpus (line 135) | def get_or_create_corpus(self, root_path: str) -> str:
    method get_corpus_id (line 154) | def get_corpus_id(self, root_path: str) -> str | None:
    method upsert_document (line 164) | def upsert_document(
    method mark_deleted_missing_documents (line 228) | def mark_deleted_missing_documents(
    method list_documents (line 268) | def list_documents(
    method count_chunks (line 299) | def count_chunks(self, *, corpus_id: str) -> int:
    method search_chunks (line 311) | def search_chunks(
    method search_documents_by_metadata (line 364) | def search_documents_by_metadata(
    method get_document (line 415) | def get_document(self, *, doc_id: str) -> dict[str, Any] | None:
    method save_schema (line 438) | def save_schema(
    method list_schemas (line 471) | def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:
    method get_schema_by_name (line 483) | def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRe...
    method get_active_schema (line 497) | def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:
    method make_document_id (line 513) | def make_document_id(corpus_id: str, relative_path: str) -> str:
    method make_chunk_id (line 517) | def make_chunk_id(
    method _row_to_schema_record (line 523) | def _row_to_schema_record(row: tuple[Any, ...]) -> SchemaRecord:
    method store_chunk_embeddings (line 533) | def store_chunk_embeddings(
    method search_chunks_semantic (line 554) | def search_chunks_semantic(
    method get_metadata_field_values (line 594) | def get_metadata_field_values(
    method has_embeddings (line 619) | def has_embeddings(self, *, corpus_id: str) -> bool:
    method create_hnsw_index (line 627) | def create_hnsw_index(self, *, corpus_id: str) -> bool:
    method _metadata_clause (line 649) | def _metadata_clause(

FILE: src/fs_explorer/workflow.py
  function get_agent (line 34) | def get_agent() -> FsExplorerAgent:
  function reset_agent (line 43) | def reset_agent() -> None:
  class WorkflowState (line 48) | class WorkflowState(BaseModel):
  class InputEvent (line 59) | class InputEvent(StartEvent):
  class GoDeeperEvent (line 69) | class GoDeeperEvent(Event):
  class ToolCallEvent (line 76) | class ToolCallEvent(Event):
  class AskHumanEvent (line 84) | class AskHumanEvent(InputRequiredEvent):
  class HumanAnswerEvent (line 91) | class HumanAnswerEvent(HumanResponseEvent):
  class ExplorationEndEvent (line 97) | class ExplorationEndEvent(StopEvent):
  function _handle_action_result (line 108) | def _handle_action_result(
  function _process_agent_action (line 153) | async def _process_agent_action(
  class FsExplorerWorkflow (line 185) | class FsExplorerWorkflow(Workflow):
    method start_exploration (line 197) | async def start_exploration(
    method go_deeper_action (line 244) | async def go_deeper_action(
    method receive_human_answer (line 264) | async def receive_human_answer(
    method tool_call_action (line 282) | async def tool_call_action(

FILE: tests/conftest.py
  class MockModels (line 19) | class MockModels:
    method generate_content (line 22) | async def generate_content(self, *args, **kwargs) -> GenerateContentRe...
  class MockAio (line 50) | class MockAio:
    method models (line 54) | def models(self) -> MockModels:
  class MockGenAIClient (line 59) | class MockGenAIClient:
    method __init__ (line 66) | def __init__(self, api_key: str, http_options: HttpOptions) -> None:
    method aio (line 71) | def aio(self) -> MockAio:

FILE: tests/test_agent.py
  class TestAgentInitialization (line 23) | class TestAgentInitialization:
    method test_agent_init_with_env_key (line 27) | def test_agent_init_with_env_key(self) -> None:
    method test_agent_init_with_explicit_key (line 34) | def test_agent_init_with_explicit_key(self) -> None:
    method test_agent_init_without_key_raises (line 39) | def test_agent_init_without_key_raises(self) -> None:
  class TestAgentConfiguration (line 51) | class TestAgentConfiguration:
    method test_configure_task_adds_to_history (line 55) | def test_configure_task_adds_to_history(self) -> None:
    method test_multiple_configure_task_calls (line 65) | def test_multiple_configure_task_calls(self) -> None:
  class TestAgentActions (line 76) | class TestAgentActions:
    method test_take_action_returns_action (line 81) | async def test_take_action_returns_action(self) -> None:
    method test_reset_clears_history (line 101) | def test_reset_clears_history(self) -> None:
  class TestTokenUsage (line 113) | class TestTokenUsage:
    method test_add_api_call (line 116) | def test_add_api_call(self) -> None:
    method test_add_tool_result_parse_file (line 126) | def test_add_tool_result_parse_file(self) -> None:
    method test_add_tool_result_scan_folder (line 134) | def test_add_tool_result_scan_folder(self) -> None:
    method test_summary_format (line 143) | def test_summary_format(self) -> None:
  class TestSystemPrompt (line 156) | class TestSystemPrompt:
    method test_system_prompt_contains_tools (line 159) | def test_system_prompt_contains_tools(self) -> None:
    method test_system_prompt_contains_strategy (line 168) | def test_system_prompt_contains_strategy(self) -> None:
    method test_system_prompt_contains_index_tools (line 174) | def test_system_prompt_contains_index_tools(self) -> None:
  class TestSearchFlags (line 181) | class TestSearchFlags:
    method setup_method (line 184) | def setup_method(self) -> None:
    method teardown_method (line 187) | def teardown_method(self) -> None:
    method test_set_and_get_search_flags (line 190) | def test_set_and_get_search_flags(self) -> None:
    method test_clear_index_context_resets_flags (line 197) | def test_clear_index_context_resets_flags(self) -> None:
    method test_build_system_prompt_no_index (line 202) | def test_build_system_prompt_no_index(self) -> None:
    method test_build_system_prompt_semantic_only (line 206) | def test_build_system_prompt_semantic_only(self) -> None:
    method test_build_system_prompt_metadata_only (line 211) | def test_build_system_prompt_metadata_only(self) -> None:
    method test_build_system_prompt_both (line 216) | def test_build_system_prompt_both(self) -> None:
    method test_all_tools_always_available (line 221) | def test_all_tools_always_available(self) -> None:

FILE: tests/test_cli_indexing.py
  function test_root_task_mode_remains_compatible (line 11) | def test_root_task_mode_remains_compatible(tmp_path: Path, monkeypatch) ...
  function test_query_command_enables_index_mode (line 40) | def test_query_command_enables_index_mode(tmp_path: Path, monkeypatch) -...
  function test_index_and_schema_commands (line 78) | def test_index_and_schema_commands(tmp_path: Path, monkeypatch) -> None:
  function test_index_command_with_metadata_forces_schema_discovery (line 109) | def test_index_command_with_metadata_forces_schema_discovery(
  function test_index_command_with_metadata_profile_path (line 161) | def test_index_command_with_metadata_profile_path(
  function test_index_command_with_embeddings_flag (line 232) | def test_index_command_with_embeddings_flag(
  function test_auto_index_env_var_enables_use_index (line 277) | def test_auto_index_env_var_enables_use_index(
  function test_auto_index_env_var_silent_fallback (line 316) | def test_auto_index_env_var_silent_fallback(

FILE: tests/test_e2e.py
  function test_e2e (line 14) | async def test_e2e() -> None:

FILE: tests/test_embeddings.py
  class _FakeEmbedding (line 20) | class _FakeEmbedding:
  class _FakeEmbedResult (line 25) | class _FakeEmbedResult:
  class _FakeModels (line 29) | class _FakeModels:
    method __init__ (line 32) | def __init__(self) -> None:
    method embed_content (line 35) | def embed_content(
  class _FakeClient (line 47) | class _FakeClient:
    method __init__ (line 48) | def __init__(self) -> None:
  function test_embed_texts_returns_correct_count (line 57) | def test_embed_texts_returns_correct_count() -> None:
  function test_embed_texts_uses_document_task_type (line 67) | def test_embed_texts_uses_document_task_type() -> None:
  function test_embed_query_uses_query_task_type (line 77) | def test_embed_query_uses_query_task_type() -> None:
  function test_embed_texts_batching (line 88) | def test_embed_texts_batching() -> None:
  function test_env_overrides (line 103) | def test_env_overrides(monkeypatch) -> None:
  function test_missing_api_key_raises (line 121) | def test_missing_api_key_raises(monkeypatch) -> None:
  function test_real_embedding_api (line 136) | def test_real_embedding_api() -> None:

FILE: tests/test_exploration_trace.py
  function test_normalize_path_relative (line 12) | def test_normalize_path_relative() -> None:
  function test_normalize_path_absolute (line 17) | def test_normalize_path_absolute() -> None:
  function test_trace_records_steps_and_documents (line 22) | def test_trace_records_steps_and_documents() -> None:
  function test_trace_records_resolved_document_paths (line 47) | def test_trace_records_resolved_document_paths() -> None:
  function test_extract_cited_sources_ordered_unique (line 61) | def test_extract_cited_sources_ordered_unique() -> None:

FILE: tests/test_fs.py
  class TestDescribeDirContent (line 21) | class TestDescribeDirContent:
    method test_valid_directory (line 24) | def test_valid_directory(self) -> None:
    method test_nonexistent_directory (line 32) | def test_nonexistent_directory(self) -> None:
    method test_directory_without_subfolders (line 37) | def test_directory_without_subfolders(self) -> None:
  class TestReadFile (line 45) | class TestReadFile:
    method test_valid_file (line 48) | def test_valid_file(self) -> None:
    method test_nonexistent_file (line 53) | def test_nonexistent_file(self) -> None:
  class TestGrepFileContent (line 59) | class TestGrepFileContent:
    method test_pattern_match (line 62) | def test_pattern_match(self) -> None:
    method test_no_match (line 68) | def test_no_match(self) -> None:
    method test_nonexistent_file (line 73) | def test_nonexistent_file(self) -> None:
  class TestGlobPaths (line 79) | class TestGlobPaths:
    method test_pattern_match (line 82) | def test_pattern_match(self) -> None:
    method test_no_match (line 89) | def test_no_match(self) -> None:
    method test_nonexistent_directory (line 94) | def test_nonexistent_directory(self) -> None:
  class TestDocumentParsing (line 100) | class TestDocumentParsing:
    method setup_method (line 103) | def setup_method(self) -> None:
    method test_parse_file_nonexistent (line 107) | def test_parse_file_nonexistent(self) -> None:
    method test_parse_file_unsupported_extension (line 112) | def test_parse_file_unsupported_extension(self) -> None:
    method test_preview_file_nonexistent (line 117) | def test_preview_file_nonexistent(self) -> None:
    method test_preview_file_unsupported_extension (line 122) | def test_preview_file_unsupported_extension(self) -> None:
    method test_parse_file_pdf (line 131) | def test_parse_file_pdf(self) -> None:
    method test_preview_file_pdf (line 144) | def test_preview_file_pdf(self) -> None:
  class TestScanFolder (line 154) | class TestScanFolder:
    method setup_method (line 157) | def setup_method(self) -> None:
    method test_nonexistent_directory (line 161) | def test_nonexistent_directory(self) -> None:
    method test_empty_directory (line 166) | def test_empty_directory(self) -> None:
    method test_scan_folder_with_documents (line 178) | def test_scan_folder_with_documents(self) -> None:
  class TestSupportedExtensions (line 186) | class TestSupportedExtensions:
    method test_supported_extensions_is_frozenset (line 189) | def test_supported_extensions_is_frozenset(self) -> None:
    method test_common_extensions_supported (line 193) | def test_common_extensions_supported(self) -> None:

FILE: tests/test_indexing.py
  function test_smart_chunker_overlap (line 19) | def test_smart_chunker_overlap() -> None:
  function test_schema_discovery_from_folder (line 30) | def test_schema_discovery_from_folder(tmp_path: Path) -> None:
  function test_schema_discovery_with_langextract_fields (line 50) | def test_schema_discovery_with_langextract_fields(tmp_path: Path, monkey...
  function test_schema_discovery_with_custom_metadata_profile (line 74) | def test_schema_discovery_with_custom_metadata_profile(tmp_path: Path) -...
  function test_indexing_pipeline_indexes_and_marks_deleted (line 108) | def test_indexing_pipeline_indexes_and_marks_deleted(
  function test_indexing_pipeline_with_langextract_metadata (line 176) | def test_indexing_pipeline_with_langextract_metadata(
  function test_indexing_pipeline_reuses_saved_metadata_profile (line 241) | def test_indexing_pipeline_reuses_saved_metadata_profile(
  function test_auto_discover_profile_with_mock_llm (line 319) | def test_auto_discover_profile_with_mock_llm(
  function test_auto_discover_profile_falls_back_on_error (line 382) | def test_auto_discover_profile_falls_back_on_error(
  function test_auto_discover_profile_falls_back_without_api_key (line 411) | def test_auto_discover_profile_falls_back_without_api_key(
  function test_schema_discovery_uses_auto_profile_when_no_explicit_profile (line 435) | def test_schema_discovery_uses_auto_profile_when_no_explicit_profile(
  class _FakeEmbedding (line 495) | class _FakeEmbedding:
  class _FakeEmbedResult (line 500) | class _FakeEmbedResult:
  class _FakeEmbedModels (line 504) | class _FakeEmbedModels:
    method embed_content (line 505) | def embed_content(
  class _FakeEmbedClient (line 516) | class _FakeEmbedClient:
    method __init__ (line 517) | def __init__(self) -> None:
  function test_indexing_pipeline_with_embeddings (line 526) | def test_indexing_pipeline_with_embeddings(
  function test_indexing_pipeline_without_embeddings (line 553) | def test_indexing_pipeline_without_embeddings(
  function test_embedding_cascade_on_reindex (line 577) | def test_embedding_cascade_on_reindex(
  function test_extract_metadata_batch_returns_correct_metadata (line 612) | def test_extract_metadata_batch_returns_correct_metadata(
  function test_extract_metadata_batch_parallel_is_faster_than_sequential (line 659) | def test_extract_metadata_batch_parallel_is_faster_than_sequential(
  function test_parallel_and_sequential_produce_same_results (line 710) | def test_parallel_and_sequential_produce_same_results(

FILE: tests/test_models.py
  function test_tool_call_action_to_tool_args (line 10) | def test_tool_call_action_to_tool_args() -> None:
  function test_action_to_action_type (line 24) | def test_action_to_action_type() -> None:

FILE: tests/test_search.py
  function test_parse_metadata_filters_supports_scalar_and_list_values (line 22) | def test_parse_metadata_filters_supports_scalar_and_list_values() -> None:
  function test_parse_metadata_filters_rejects_unknown_schema_fields (line 40) | def test_parse_metadata_filters_rejects_unknown_schema_fields() -> None:
  function test_indexed_query_engine_unions_semantic_and_metadata_results (line 48) | def test_indexed_query_engine_unions_semantic_and_metadata_results(
  class _SlowStorage (line 86) | class _SlowStorage:
    method search_chunks (line 87) | def search_chunks(self, *, corpus_id: str, query: str, limit: int = 5)...
    method search_documents_by_metadata (line 100) | def search_documents_by_metadata(self, *, corpus_id: str, filters, lim...
    method get_active_schema (line 112) | def get_active_schema(self, *, corpus_id: str):  # noqa: ARG002
  function test_indexed_query_engine_executes_semantic_and_metadata_in_parallel (line 116) | def test_indexed_query_engine_executes_semantic_and_metadata_in_parallel...
  function test_search_enable_semantic_false_returns_only_metadata (line 132) | def test_search_enable_semantic_false_returns_only_metadata() -> None:
  function test_search_enable_metadata_false_returns_only_semantic (line 148) | def test_search_enable_metadata_false_returns_only_semantic() -> None:
  function test_search_both_disabled_returns_empty (line 164) | def test_search_both_disabled_returns_empty() -> None:
  class _FakeEmbedding (line 186) | class _FakeEmbedding:
  class _FakeEmbedResult (line 191) | class _FakeEmbedResult:
  class _FakeEmbedModels (line 195) | class _FakeEmbedModels:
    method embed_content (line 196) | def embed_content(
  class _FakeEmbedClient (line 208) | class _FakeEmbedClient:
    method __init__ (line 209) | def __init__(self) -> None:
  function test_vector_search_with_pre_stored_embeddings (line 218) | def test_vector_search_with_pre_stored_embeddings(
  function test_keyword_fallback_when_no_embeddings (line 254) | def test_keyword_fallback_when_no_embeddings(
  function test_get_metadata_field_values_returns_distinct_values (line 287) | def test_get_metadata_field_values_returns_distinct_values(
  function test_get_metadata_field_values_empty_corpus (line 319) | def test_get_metadata_field_values_empty_corpus(tmp_path: Path) -> None:
  function test_get_metadata_field_values_respects_max_distinct (line 330) | def test_get_metadata_field_values_respects_max_distinct(
  function test_semantic_search_includes_field_catalog_on_first_call (line 358) | def test_semantic_search_includes_field_catalog_on_first_call(
  function test_float_scoring_in_ranked_documents (line 393) | def test_float_scoring_in_ranked_documents() -> None:

FILE: tests/test_server_search.py
  function indexed_corpus (line 18) | def indexed_corpus(tmp_path: Path, monkeypatch):
  function test_search_endpoint_returns_hits (line 37) | def test_search_endpoint_returns_hits(indexed_corpus) -> None:
  function test_search_endpoint_with_filters (line 57) | def test_search_endpoint_with_filters(indexed_corpus) -> None:
  function test_search_endpoint_missing_index (line 76) | def test_search_endpoint_missing_index(tmp_path: Path) -> None:
  function test_search_endpoint_invalid_folder (line 94) | def test_search_endpoint_invalid_folder() -> None:
  function test_index_status_not_indexed (line 112) | def test_index_status_not_indexed(tmp_path: Path) -> None:
  function test_index_status_after_indexing (line 128) | def test_index_status_after_indexing(indexed_corpus) -> None:
  function test_index_status_includes_schema_fields (line 146) | def test_index_status_includes_schema_fields(indexed_corpus) -> None:
  function test_auto_profile_endpoint (line 168) | def test_auto_profile_endpoint(tmp_path: Path) -> None:
  function test_auto_profile_invalid_folder (line 207) | def test_auto_profile_invalid_folder() -> None:
Condensed preview — 59 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (499K chars).
[
  {
    "path": ".github/workflows/build.yaml",
    "chars": 301,
    "preview": "name: Build\n\non:\n  pull_request:\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4"
  },
  {
    "path": ".github/workflows/lint.yaml",
    "chars": 398,
    "preview": "name: Linting\n\non:\n  pull_request:\n\njobs:\n  lint:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v"
  },
  {
    "path": ".github/workflows/test.yaml",
    "chars": 500,
    "preview": "name: CI Tests - Pull Request\n\non:\n  pull_request:\n\njobs:\n  testing_pr:\n    runs-on: ubuntu-latest\n    strategy:\n      m"
  },
  {
    "path": ".github/workflows/typecheck.yaml",
    "chars": 346,
    "preview": "name: Typecheck\n\non:\n  pull_request:\n\njobs:\n  core-typecheck:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: action"
  },
  {
    "path": ".gitignore",
    "chars": 169,
    "preview": "# Python-generated files\n__pycache__/\n*.py[oc]\nbuild/\ndist/\nwheels/\n*.egg-info\n\n# Virtual environments\n.venv\n\n# caches\n*"
  },
  {
    "path": ".pre-commit-config.yaml",
    "chars": 253,
    "preview": "---\ndefault_language_version:\n  python: python3\n\nrepos:\n  - repo: https://github.com/pre-commit/pre-commit-hooks\n    rev"
  },
  {
    "path": ".python-version",
    "chars": 5,
    "preview": "3.13\n"
  },
  {
    "path": "ARCHITECTURE.md",
    "chars": 27206,
    "preview": "# FsExplorer Architecture Documentation\n\n## Table of Contents\n\n1. [System Overview](#system-overview)\n2. [Component Arch"
  },
  {
    "path": "CLAUDE.md",
    "chars": 7472,
    "preview": "# CLAUDE.md\n\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\n\n## "
  },
  {
    "path": "IMPLEMENTATION_PLAN.md",
    "chars": 11012,
    "preview": "# Implementation Plan: Hybrid Semantic + Agentic Search (Revised)\n\n## Overview\n\nAdd semantic search with optional metada"
  },
  {
    "path": "Makefile",
    "chars": 641,
    "preview": ".PHONY: test lint format format-check typecheck build\n\nall: test lint format typecheck\n\ntest:\n\t$(info ******************"
  },
  {
    "path": "README.md",
    "chars": 5032,
    "preview": "# Agentic File Search\n\n> **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original "
  },
  {
    "path": "YOUTUBE_DEMO_TESTS.md",
    "chars": 6005,
    "preview": "# YouTube Demo: FS-Explorer Test Results\n\n## System Overview\n\n- **25 PDF documents** (~93 pages total)\n- **63 cross-refe"
  },
  {
    "path": "data/large_acquisition/TEST_QUESTIONS.md",
    "chars": 2088,
    "preview": "# Test Questions for Large Document Set\n\n## Document Overview\n- 25 interconnected documents\n- Each document 3-6 pages\n- "
  },
  {
    "path": "data/test_acquisition/TEST_QUESTIONS.md",
    "chars": 4577,
    "preview": "# Test Questions for Document Exploration\n\nThese questions are designed to test the two-stage document exploration appro"
  },
  {
    "path": "data/testfile.txt",
    "chars": 15,
    "preview": "This is a test."
  },
  {
    "path": "docker/docker-compose.yml",
    "chars": 645,
    "preview": "version: '3.8'\n\nservices:\n  postgres:\n    image: pgvector/pgvector:pg17\n    container_name: fs-explorer-db\n    environme"
  },
  {
    "path": "pyproject.toml",
    "chars": 820,
    "preview": "[build-system]\nrequires = [\"uv_build>=0.9.10,<0.10.0\"]\nbuild-backend = \"uv_build\"\n\n[project]\nname = \"fs-explorer\"\nversio"
  },
  {
    "path": "scripts/generate_large_docs.py",
    "chars": 29617,
    "preview": "#!/usr/bin/env python3\n\"\"\"\nGenerate a large set of interconnected legal documents for testing.\nCreates 25 documents, eac"
  },
  {
    "path": "scripts/generate_test_docs.py",
    "chars": 35625,
    "preview": "#!/usr/bin/env python3\n\"\"\"\nGenerate test PDF documents for testing the two-stage document exploration approach.\n\nScenari"
  },
  {
    "path": "src/fs_explorer/__init__.py",
    "chars": 1184,
    "preview": "\"\"\"\nFsExplorer - AI-powered filesystem exploration agent.\n\nThis package provides an intelligent agent that can explore f"
  },
  {
    "path": "src/fs_explorer/agent.py",
    "chars": 24477,
    "preview": "\"\"\"\nFsExplorer Agent for filesystem exploration using Google Gemini.\n\nThis module contains the agent that interacts with"
  },
  {
    "path": "src/fs_explorer/embeddings.py",
    "chars": 2708,
    "preview": "\"\"\"\nEmbedding provider for vector-based semantic search.\n\nWraps the Google GenAI embedding API for batch and single-quer"
  },
  {
    "path": "src/fs_explorer/exploration_trace.py",
    "chars": 3248,
    "preview": "\"\"\"\nHelpers for recording exploration path and referenced files.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimpo"
  },
  {
    "path": "src/fs_explorer/fs.py",
    "chars": 13176,
    "preview": "\"\"\"\nFilesystem utilities for the FsExplorer agent.\n\nThis module provides functions for reading, searching, and parsing f"
  },
  {
    "path": "src/fs_explorer/index_config.py",
    "chars": 662,
    "preview": "\"\"\"\nConfiguration helpers for local index storage.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom pathlib impor"
  },
  {
    "path": "src/fs_explorer/indexing/__init__.py",
    "chars": 300,
    "preview": "\"\"\"Indexing components for FsExplorer.\"\"\"\n\nfrom .chunker import SmartChunker, TextChunk\nfrom .pipeline import IndexingPi"
  },
  {
    "path": "src/fs_explorer/indexing/chunker.py",
    "chars": 2045,
    "preview": "\"\"\"\nChunking utilities for indexing document content.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses import d"
  },
  {
    "path": "src/fs_explorer/indexing/metadata.py",
    "chars": 33223,
    "preview": "\"\"\"\nMetadata extraction helpers for indexed documents.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport copy\nimport json\n"
  },
  {
    "path": "src/fs_explorer/indexing/pipeline.py",
    "chars": 13659,
    "preview": "\"\"\"\nIndexing pipeline orchestration.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport hashlib\nimport json\nimport os\nfrom "
  },
  {
    "path": "src/fs_explorer/indexing/schema.py",
    "chars": 3522,
    "preview": "\"\"\"\nSchema discovery utilities.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom pathlib import Path\nfrom typing "
  },
  {
    "path": "src/fs_explorer/main.py",
    "chars": 26587,
    "preview": "\"\"\"\nCLI entry point for the FsExplorer agent.\n\nProvides a command-line interface for running filesystem exploration task"
  },
  {
    "path": "src/fs_explorer/models.py",
    "chars": 3934,
    "preview": "\"\"\"\nPydantic models for FsExplorer agent actions.\n\nThis module defines the structured data models used to represent\nthe "
  },
  {
    "path": "src/fs_explorer/search/__init__.py",
    "chars": 563,
    "preview": "\"\"\"Search helpers for indexed corpora.\"\"\"\n\nfrom .filters import (\n    MetadataFilter,\n    MetadataFilterParseError,\n    "
  },
  {
    "path": "src/fs_explorer/search/filters.py",
    "chars": 6491,
    "preview": "\"\"\"\nMetadata filter parsing helpers.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport re\nfrom dataclasses import dataclas"
  },
  {
    "path": "src/fs_explorer/search/query.py",
    "chars": 9306,
    "preview": "\"\"\"\nIndexed query helpers for agent tools.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom concurrent.futures import Threa"
  },
  {
    "path": "src/fs_explorer/search/ranker.py",
    "chars": 1359,
    "preview": "\"\"\"\nRanking helpers for merging retrieval result sets.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses import "
  },
  {
    "path": "src/fs_explorer/search/semantic.py",
    "chars": 1069,
    "preview": "\"\"\"\nVector-based semantic search engine.\n\nEmbeds a query and searches chunk embeddings via cosine similarity,\nfalling ba"
  },
  {
    "path": "src/fs_explorer/server.py",
    "chars": 18761,
    "preview": "\"\"\"\nFastAPI server for FsExplorer web UI.\n\nProvides a WebSocket endpoint for real-time workflow streaming\nand serves the"
  },
  {
    "path": "src/fs_explorer/storage/__init__.py",
    "chars": 278,
    "preview": "\"\"\"Storage backends for FsExplorer indexing.\"\"\"\n\nfrom .base import ChunkRecord, DocumentRecord, SchemaRecord, StorageBac"
  },
  {
    "path": "src/fs_explorer/storage/base.py",
    "chars": 4056,
    "preview": "\"\"\"\nStorage interfaces and data models for index persistence.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses "
  },
  {
    "path": "src/fs_explorer/storage/duckdb.py",
    "chars": 24008,
    "preview": "\"\"\"\nDuckDB storage backend for index persistence.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport hashlib\nimport json\nim"
  },
  {
    "path": "src/fs_explorer/ui.html",
    "chars": 53005,
    "preview": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width"
  },
  {
    "path": "src/fs_explorer/workflow.py",
    "chars": 9648,
    "preview": "\"\"\"\nWorkflow orchestration for the FsExplorer agent.\n\nThis module defines the event-driven workflow that coordinates the"
  },
  {
    "path": "tests/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "tests/conftest.py",
    "chars": 2124,
    "preview": "\"\"\"\nPytest fixtures and mocks for FsExplorer tests.\n\nProvides mock implementations of the Google GenAI client for unit t"
  },
  {
    "path": "tests/test_agent.py",
    "chars": 8567,
    "preview": "\"\"\"Tests for the FsExplorerAgent class.\"\"\"\n\nimport pytest\nimport os\n\nfrom unittest.mock import patch\nfrom google.genai i"
  },
  {
    "path": "tests/test_cli_indexing.py",
    "chars": 10327,
    "preview": "\"\"\"CLI tests for indexing and schema commands.\"\"\"\n\nfrom pathlib import Path\n\nimport fs_explorer.indexing.pipeline as pip"
  },
  {
    "path": "tests/test_e2e.py",
    "chars": 1006,
    "preview": "import pytest\nimport os\n\nfrom workflows.testing import WorkflowTestRunner\n\nSKIP_IF, SKIP_REASON = (\n    os.getenv(\"GOOGL"
  },
  {
    "path": "tests/test_embeddings.py",
    "chars": 4402,
    "preview": "\"\"\"Tests for the embedding provider.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom dataclasses import dataclass"
  },
  {
    "path": "tests/test_exploration_trace.py",
    "chars": 2097,
    "preview": "\"\"\"Tests for exploration trace helpers.\"\"\"\n\nimport os\n\nfrom fs_explorer.exploration_trace import (\n    ExplorationTrace,"
  },
  {
    "path": "tests/test_fs.py",
    "chars": 7514,
    "preview": "\"\"\"Tests for filesystem utility functions.\"\"\"\n\nimport pytest\nimport os\nimport tempfile\nfrom pathlib import Path\n\nfrom fs"
  },
  {
    "path": "tests/test_indexing.py",
    "chars": 23725,
    "preview": "\"\"\"Tests for indexing and schema components.\"\"\"\n\nimport json\nimport time\nfrom dataclasses import dataclass\nfrom pathlib "
  },
  {
    "path": "tests/test_models.py",
    "chars": 1280,
    "preview": "from fs_explorer.models import (\n    ToolCallAction,\n    Action,\n    ToolCallArg,\n    GoDeeperAction,\n    StopAction,\n)\n"
  },
  {
    "path": "tests/test_search.py",
    "chars": 12440,
    "preview": "\"\"\"Tests for search filtering and merged retrieval ranking.\"\"\"\n\nfrom __future__ import annotations\n\nimport time\nfrom dat"
  },
  {
    "path": "tests/test_server_search.py",
    "chars": 6075,
    "preview": "\"\"\"Tests for the /api/search and /api/index REST endpoints.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pathlib import "
  },
  {
    "path": "tests/testfiles/file1.txt",
    "chars": 14,
    "preview": "this is a test"
  },
  {
    "path": "tests/testfiles/file2.md",
    "chars": 17,
    "preview": "# this is a test!"
  },
  {
    "path": "tests/testfiles/last/lastfile.txt",
    "chars": 5,
    "preview": "hello"
  }
]

About this extraction

This page contains the full source code of the PromtEngineer/agentic-file-search GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 59 files (458.6 KB), approximately 106.1k tokens, and a symbol index with 377 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!