Repository: PromtEngineer/agentic-file-search
Branch: main
Commit: 83c5b4231f44
Files: 59
Total size: 458.6 KB

Directory structure:
gitextract_mqv4xk8i/

├── .github/
│   └── workflows/
│       ├── build.yaml
│       ├── lint.yaml
│       ├── test.yaml
│       └── typecheck.yaml
├── .gitignore
├── .pre-commit-config.yaml
├── .python-version
├── ARCHITECTURE.md
├── CLAUDE.md
├── IMPLEMENTATION_PLAN.md
├── Makefile
├── README.md
├── YOUTUBE_DEMO_TESTS.md
├── data/
│   ├── large_acquisition/
│   │   └── TEST_QUESTIONS.md
│   ├── test_acquisition/
│   │   └── TEST_QUESTIONS.md
│   └── testfile.txt
├── docker/
│   └── docker-compose.yml
├── pyproject.toml
├── scripts/
│   ├── generate_large_docs.py
│   └── generate_test_docs.py
├── src/
│   └── fs_explorer/
│       ├── __init__.py
│       ├── agent.py
│       ├── embeddings.py
│       ├── exploration_trace.py
│       ├── fs.py
│       ├── index_config.py
│       ├── indexing/
│       │   ├── __init__.py
│       │   ├── chunker.py
│       │   ├── metadata.py
│       │   ├── pipeline.py
│       │   └── schema.py
│       ├── main.py
│       ├── models.py
│       ├── search/
│       │   ├── __init__.py
│       │   ├── filters.py
│       │   ├── query.py
│       │   ├── ranker.py
│       │   └── semantic.py
│       ├── server.py
│       ├── storage/
│       │   ├── __init__.py
│       │   ├── base.py
│       │   └── duckdb.py
│       ├── ui.html
│       └── workflow.py
└── tests/
    ├── __init__.py
    ├── conftest.py
    ├── test_agent.py
    ├── test_cli_indexing.py
    ├── test_e2e.py
    ├── test_embeddings.py
    ├── test_exploration_trace.py
    ├── test_fs.py
    ├── test_indexing.py
    ├── test_models.py
    ├── test_search.py
    ├── test_server_search.py
    └── testfiles/
        ├── file1.txt
        ├── file2.md
        └── last/
            └── lastfile.txt

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/build.yaml
================================================
name: Build

on:
  pull_request:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install 3.13

      - name: Build package
        run: make build


================================================
FILE: .github/workflows/lint.yaml
================================================
name: Linting

on:
  pull_request:

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install 3.12

      - name: Run formatter
        shell: bash
        run: make format-check

      - name: Run linter
        shell: bash
        run: make lint


================================================
FILE: .github/workflows/test.yaml
================================================
name: CI Tests - Pull Request

on:
  pull_request:

jobs:
  testing_pr:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: ["3.10", "3.11", "3.12", "3.13"]
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Install uv
        uses: astral-sh/setup-uv@v6
        with:
          python-version: ${{ matrix.python-version }}
          enable-cache: true

      - name: Run Tests on Main Package
        run: make test
        

================================================
FILE: .github/workflows/typecheck.yaml
================================================
name: Typecheck

on:
  pull_request:

jobs:
  core-typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 1

      - name: Install uv
        uses: astral-sh/setup-uv@v6

      - name: Set up Python
        run: uv python install

      - name: Run Mypy
        run: make typecheck

================================================
FILE: .gitignore
================================================
# Python-generated files
__pycache__/
*.py[oc]
build/
dist/
wheels/
*.egg-info

# Virtual environments
.venv

# caches
*_cache/

# Environment
.env

# OS files
.DS_Store

================================================
FILE: .pre-commit-config.yaml
================================================
---
default_language_version:
  python: python3

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: check-merge-conflict
      - id: check-symlinks
      - id: check-yaml
      - id: detect-private-key

================================================
FILE: .python-version
================================================
3.13


================================================
FILE: ARCHITECTURE.md
================================================
# FsExplorer Architecture Documentation

## Table of Contents

1. [System Overview](#system-overview)
2. [Component Architecture](#component-architecture)
3. [Core Modules](#core-modules)
4. [Workflow Engine](#workflow-engine)
5. [Agent Decision Loop](#agent-decision-loop)
6. [Document Processing Pipeline](#document-processing-pipeline)
7. [Three-Phase Exploration Strategy](#three-phase-exploration-strategy)
8. [Token Tracking & Cost Estimation](#token-tracking--cost-estimation)
9. [CLI Interface](#cli-interface)
10. [Data Flow](#data-flow)
11. [File Structure](#file-structure)
12. [Extension Points](#extension-points)

---

## System Overview

FsExplorer is an AI-powered filesystem exploration agent that answers questions about documents by intelligently navigating directories, parsing files, and synthesizing information with source citations.

```mermaid
graph TB
    subgraph "User Interface"
        CLI[CLI Interface<br/>typer + rich]
    end

    subgraph "Orchestration Layer"
        WF[Workflow Engine<br/>llama-index-workflows]
        EVT[Event System]
    end

    subgraph "Intelligence Layer"
        AGENT[FsExplorer Agent]
        LLM[Google Gemini 2.0 Flash<br/>Structured JSON Output]
        PROMPT[System Prompt<br/>Three-Phase Strategy]
    end

    subgraph "Tools Layer"
        TOOLS[Tool Registry]
        SCAN[scan_folder<br/>Parallel Scan]
        PREVIEW[preview_file<br/>Quick Preview]
        PARSE[parse_file<br/>Deep Read]
        READ[read<br/>Text Files]
        GREP[grep<br/>Pattern Search]
        GLOB[glob<br/>File Search]
    end

    subgraph "Document Processing"
        DOCLING[Docling<br/>Document Converter]
        CACHE[Document Cache]
    end

    subgraph "Filesystem"
        FS[(Local Filesystem)]
        PDF[PDF Files]
        DOCX[DOCX Files]
        MD[Markdown Files]
        OTHER[Other Formats]
    end

    CLI --> WF
    WF --> EVT
    EVT --> AGENT
    AGENT --> LLM
    AGENT --> PROMPT
    AGENT --> TOOLS
    
    TOOLS --> SCAN
    TOOLS --> PREVIEW
    TOOLS --> PARSE
    TOOLS --> READ
    TOOLS --> GREP
    TOOLS --> GLOB
    
    SCAN --> DOCLING
    PREVIEW --> DOCLING
    PARSE --> DOCLING
    
    DOCLING --> CACHE
    CACHE --> FS
    
    FS --> PDF
    FS --> DOCX
    FS --> MD
    FS --> OTHER

    style LLM fill:#4285f4,color:#fff
    style DOCLING fill:#ff6b6b,color:#fff
    style CACHE fill:#ffd93d,color:#000
    style AGENT fill:#6bcb77,color:#fff
```

---

## Component Architecture

### High-Level Component Diagram

```mermaid
graph LR
    subgraph "Entry Point"
        MAIN[main.py<br/>CLI Entry]
    end

    subgraph "Workflow"
        WORKFLOW[workflow.py<br/>Event Orchestration]
    end

    subgraph "Agent"
        AGENT_MOD[agent.py<br/>AI Decision Making]
    end

    subgraph "Models"
        MODELS[models.py<br/>Pydantic Schemas]
    end

    subgraph "Filesystem"
        FS_MOD[fs.py<br/>File Operations]
    end

    MAIN --> WORKFLOW
    WORKFLOW --> AGENT_MOD
    AGENT_MOD --> MODELS
    AGENT_MOD --> FS_MOD
    WORKFLOW --> MODELS

    style MAIN fill:#e1f5fe
    style WORKFLOW fill:#f3e5f5
    style AGENT_MOD fill:#e8f5e9
    style MODELS fill:#fff3e0
    style FS_MOD fill:#fce4ec
```

### Module Dependencies

```mermaid
graph TD
    subgraph "fs_explorer package"
        INIT[__init__.py<br/>Public API Exports]
        MAIN[main.py]
        WORKFLOW[workflow.py]
        AGENT[agent.py]
        MODELS[models.py]
        FS[fs.py]
    end

    subgraph "External Dependencies"
        TYPER[typer<br/>CLI Framework]
        RICH[rich<br/>Terminal UI]
        WORKFLOWS[llama-index-workflows<br/>Event System]
        GENAI[google-genai<br/>Gemini API]
        PYDANTIC[pydantic<br/>Data Validation]
        DOCLING[docling<br/>Document Parsing]
    end

    INIT --> AGENT
    INIT --> WORKFLOW
    INIT --> MODELS
    
    MAIN --> TYPER
    MAIN --> RICH
    MAIN --> WORKFLOW
    
    WORKFLOW --> WORKFLOWS
    WORKFLOW --> AGENT
    WORKFLOW --> MODELS
    WORKFLOW --> FS
    
    AGENT --> GENAI
    AGENT --> MODELS
    AGENT --> FS
    
    MODELS --> PYDANTIC
    
    FS --> DOCLING

    style GENAI fill:#4285f4,color:#fff
    style DOCLING fill:#ff6b6b,color:#fff
```

---

## Core Modules

### models.py - Data Schemas

Defines the structured output format for the AI agent using Pydantic models.

```mermaid
classDiagram
    class Action {
        +action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction
        +reason: str
        +to_action_type() ActionType
    }

    class ToolCallAction {
        +tool_name: Tools
        +tool_input: list[ToolCallArg]
        +to_fn_args() dict
    }

    class ToolCallArg {
        +parameter_name: str
        +parameter_value: Any
    }

    class GoDeeperAction {
        +directory: str
    }

    class StopAction {
        +final_result: str
    }

    class AskHumanAction {
        +question: str
    }

    Action --> ToolCallAction
    Action --> GoDeeperAction
    Action --> StopAction
    Action --> AskHumanAction
    ToolCallAction --> ToolCallArg

    note for Action "Main container returned by LLM"
    note for ToolCallAction "Invokes filesystem tools"
    note for StopAction "Contains final answer with citations"
```

### agent.py - AI Agent

The core intelligence component that interacts with Google Gemini.

```mermaid
classDiagram
    class FsExplorerAgent {
        -_client: GenAIClient
        -_chat_history: list[Content]
        +token_usage: TokenUsage
        +__init__(api_key: str)
        +configure_task(task: str) void
        +take_action() tuple[Action, ActionType]
        +call_tool(tool_name: Tools, tool_input: dict) void
        +reset() void
    }

    class TokenUsage {
        +prompt_tokens: int
        +completion_tokens: int
        +total_tokens: int
        +api_calls: int
        +tool_result_chars: int
        +documents_parsed: int
        +documents_scanned: int
        +add_api_call(prompt_tokens, completion_tokens) void
        +add_tool_result(result, tool_name) void
        +summary() str
    }

    class TOOLS {
        <<dictionary>>
        +read: read_file
        +grep: grep_file_content
        +glob: glob_paths
        +scan_folder: scan_folder
        +preview_file: preview_file
        +parse_file: parse_file
    }

    FsExplorerAgent --> TokenUsage
    FsExplorerAgent --> TOOLS
```

### fs.py - Filesystem Operations

All filesystem and document parsing utilities.

```mermaid
classDiagram
    class FilesystemModule {
        <<module>>
        +SUPPORTED_EXTENSIONS: frozenset
        +DEFAULT_PREVIEW_CHARS: int = 3000
        +DEFAULT_SCAN_PREVIEW_CHARS: int = 1500
        +DEFAULT_MAX_WORKERS: int = 4
    }

    class DocumentCache {
        <<singleton>>
        -_DOCUMENT_CACHE: dict[str, str]
        +clear_document_cache() void
        +_get_cached_or_parse(file_path) str
    }

    class DirectoryOps {
        <<functions>>
        +describe_dir_content(directory) str
        +glob_paths(directory, pattern) str
    }

    class FileOps {
        <<functions>>
        +read_file(file_path) str
        +grep_file_content(file_path, pattern) str
    }

    class DocumentOps {
        <<functions>>
        +preview_file(file_path, max_chars) str
        +parse_file(file_path) str
        +scan_folder(directory, max_workers, preview_chars) str
    }

    FilesystemModule --> DocumentCache
    FilesystemModule --> DirectoryOps
    FilesystemModule --> FileOps
    FilesystemModule --> DocumentOps
    DocumentOps --> DocumentCache
```

---

## Workflow Engine

The workflow engine uses an event-driven architecture based on `llama-index-workflows`.

### Workflow State Machine

```mermaid
stateDiagram-v2
    [*] --> StartExploration: InputEvent(task)
    
    StartExploration --> ToolCall: ToolCallEvent
    StartExploration --> GoDeeper: GoDeeperEvent
    StartExploration --> AskHuman: AskHumanEvent
    StartExploration --> End: StopAction
    
    ToolCall --> ToolCall: ToolCallEvent
    ToolCall --> GoDeeper: GoDeeperEvent
    ToolCall --> AskHuman: AskHumanEvent
    ToolCall --> End: StopAction
    
    GoDeeper --> ToolCall: ToolCallEvent
    GoDeeper --> GoDeeper: GoDeeperEvent
    GoDeeper --> AskHuman: AskHumanEvent
    GoDeeper --> End: StopAction
    
    AskHuman --> WaitForHuman: InputRequiredEvent
    WaitForHuman --> ProcessHumanResponse: HumanAnswerEvent
    ProcessHumanResponse --> ToolCall: ToolCallEvent
    ProcessHumanResponse --> GoDeeper: GoDeeperEvent
    ProcessHumanResponse --> AskHuman: AskHumanEvent
    ProcessHumanResponse --> End: StopAction
    
    End --> [*]: ExplorationEndEvent

    note right of StartExploration
        Initial task processing
        Describes current directory
        Asks LLM for first action
    end note

    note right of ToolCall
        Executes filesystem tool
        Adds result to chat history
        Asks LLM for next action
    end note

    note right of GoDeeper
        Updates current directory
        Describes new directory
        Asks LLM for next action
    end note
```

### Event Types

```mermaid
graph TB
    subgraph "Start Events"
        IE[InputEvent<br/>task: str]
    end

    subgraph "Intermediate Events"
        TCE[ToolCallEvent<br/>tool_name, tool_input, reason]
        GDE[GoDeeperEvent<br/>directory, reason]
        AHE[AskHumanEvent<br/>question, reason]
        HAE[HumanAnswerEvent<br/>response]
    end

    subgraph "End Events"
        EEE[ExplorationEndEvent<br/>final_result, error]
    end

    IE --> TCE
    IE --> GDE
    IE --> AHE
    IE --> EEE

    TCE --> TCE
    TCE --> GDE
    TCE --> AHE
    TCE --> EEE

    GDE --> TCE
    GDE --> GDE
    GDE --> AHE
    GDE --> EEE

    AHE --> HAE
    HAE --> TCE
    HAE --> GDE
    HAE --> AHE
    HAE --> EEE

    style IE fill:#4caf50,color:#fff
    style EEE fill:#f44336,color:#fff
    style TCE fill:#2196f3,color:#fff
    style GDE fill:#9c27b0,color:#fff
    style AHE fill:#ff9800,color:#fff
```

### Workflow Steps

```mermaid
sequenceDiagram
    participant CLI as CLI (main.py)
    participant WF as Workflow
    participant Agent as FsExplorerAgent
    participant LLM as Gemini API
    participant Tools as Tool Registry
    participant FS as Filesystem

    CLI->>WF: InputEvent(task)
    
    WF->>Agent: configure_task(initial_prompt)
    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: Action JSON
    
    alt ToolCallAction
        Agent->>Tools: call_tool(name, args)
        Tools->>FS: execute operation
        FS-->>Tools: result
        Tools-->>Agent: tool result
        Agent->>Agent: add to chat_history
        WF-->>CLI: ToolCallEvent (stream)
        WF->>Agent: configure_task("next action?")
        Note over WF,Agent: Loop continues
    else GoDeeperAction
        WF->>WF: update current_directory
        WF-->>CLI: GoDeeperEvent (stream)
        WF->>Agent: configure_task("next action?")
        Note over WF,Agent: Loop continues
    else AskHumanAction
        WF-->>CLI: AskHumanEvent (stream)
        CLI->>CLI: Wait for user input
        CLI->>WF: HumanAnswerEvent(response)
        WF->>Agent: configure_task(response)
        Note over WF,Agent: Loop continues
    else StopAction
        WF-->>CLI: ExplorationEndEvent(final_result)
    end
```

---

## Agent Decision Loop

### Single Decision Cycle

```mermaid
flowchart TB
    subgraph "Agent.take_action()"
        START([Start]) --> SEND[Send chat_history to Gemini]
        SEND --> RECEIVE[Receive JSON response]
        RECEIVE --> TRACK[Track token usage]
        TRACK --> PARSE[Parse Action from JSON]
        PARSE --> CHECK{Action Type?}
        
        CHECK -->|toolcall| EXEC[Execute Tool]
        EXEC --> RESULT[Get tool result]
        RESULT --> ADD[Add result to chat_history]
        ADD --> RETURN1[Return Action, ActionType]
        
        CHECK -->|godeeper| RETURN2[Return Action, ActionType]
        CHECK -->|askhuman| RETURN3[Return Action, ActionType]
        CHECK -->|stop| RETURN4[Return Action, ActionType]
        
        RETURN1 --> END([End])
        RETURN2 --> END
        RETURN3 --> END
        RETURN4 --> END
    end

    style START fill:#4caf50,color:#fff
    style END fill:#f44336,color:#fff
    style CHECK fill:#ff9800,color:#000
```

### Chat History Evolution

```mermaid
sequenceDiagram
    participant User
    participant Agent
    participant LLM

    Note over Agent: chat_history = []

    User->>Agent: configure_task("Initial prompt + directory listing")
    Note over Agent: chat_history = [user: initial_prompt]

    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: {action: scan_folder, reason: "..."}
    Note over Agent: chat_history = [user: initial_prompt, model: action1]

    Agent->>Agent: Execute scan_folder, add result
    Note over Agent: chat_history = [user: initial_prompt, model: action1, user: tool_result1]

    User->>Agent: configure_task("What's next?")
    Note over Agent: chat_history = [..., user: "What's next?"]

    Agent->>LLM: generate_content(chat_history)
    LLM-->>Agent: {action: parse_file, reason: "..."}
    Note over Agent: chat_history = [..., model: action2]

    Note over Agent: Pattern continues until StopAction
```

---

## Document Processing Pipeline

### Docling Integration

```mermaid
flowchart LR
    subgraph "Input Formats"
        PDF[PDF]
        DOCX[DOCX]
        PPTX[PPTX]
        XLSX[XLSX]
        HTML[HTML]
        MD[Markdown]
    end

    subgraph "Docling"
        DC[DocumentConverter]
        DETECT[Format Detection]
        PIPELINE[Processing Pipeline]
        EXPORT[Markdown Export]
    end

    subgraph "Output"
        MARKDOWN[Markdown Text]
    end

    PDF --> DC
    DOCX --> DC
    PPTX --> DC
    XLSX --> DC
    HTML --> DC
    MD --> DC

    DC --> DETECT
    DETECT --> PIPELINE
    PIPELINE --> EXPORT
    EXPORT --> MARKDOWN

    style DC fill:#ff6b6b,color:#fff
```

### Caching Strategy

```mermaid
flowchart TB
    subgraph "Cache Key Generation"
        PATH[file_path] --> ABS[os.path.abspath]
        ABS --> MTIME[os.path.getmtime]
        MTIME --> KEY["cache_key = f'{abs_path}:{mtime}'"]
    end

    subgraph "Cache Lookup"
        KEY --> CHECK{Key in cache?}
        CHECK -->|Yes| HIT[Return cached content]
        CHECK -->|No| MISS[Parse with Docling]
        MISS --> STORE[Store in cache]
        STORE --> RETURN[Return content]
    end

    subgraph "_DOCUMENT_CACHE"
        CACHE[(dict: str → str)]
    end

    HIT --> CACHE
    STORE --> CACHE

    style CACHE fill:#ffd93d,color:#000
```

### Parallel Document Scanning

```mermaid
flowchart TB
    subgraph "scan_folder(directory)"
        START([Start]) --> LIST[List directory files]
        LIST --> FILTER[Filter by SUPPORTED_EXTENSIONS]
        FILTER --> POOL[Create ThreadPoolExecutor<br/>max_workers=4]
        
        subgraph "Parallel Processing"
            POOL --> T1[Thread 1<br/>_preview_single_file]
            POOL --> T2[Thread 2<br/>_preview_single_file]
            POOL --> T3[Thread 3<br/>_preview_single_file]
            POOL --> T4[Thread 4<br/>_preview_single_file]
        end

        T1 --> COLLECT[Collect Results]
        T2 --> COLLECT
        T3 --> COLLECT
        T4 --> COLLECT

        COLLECT --> SORT[Sort by filename]
        SORT --> FORMAT[Format output report]
        FORMAT --> END([Return summary])
    end

    style START fill:#4caf50,color:#fff
    style END fill:#4caf50,color:#fff
    style POOL fill:#2196f3,color:#fff
```

---

## Three-Phase Exploration Strategy

### Phase Overview

```mermaid
flowchart TB
    subgraph "PHASE 1: Parallel Scan"
        P1_START([User Query]) --> P1_SCAN[scan_folder]
        P1_SCAN --> P1_PREVIEW[Get previews of ALL documents]
        P1_PREVIEW --> P1_CATEGORIZE[Categorize documents]
        
        P1_CATEGORIZE --> REL[RELEVANT<br/>Directly related]
        P1_CATEGORIZE --> MAYBE[MAYBE<br/>Potentially useful]
        P1_CATEGORIZE --> SKIP[SKIP<br/>Not relevant]
    end

    subgraph "PHASE 2: Deep Dive"
        REL --> P2_PARSE[parse_file on RELEVANT docs]
        MAYBE -.->|If needed| P2_PARSE
        P2_PARSE --> P2_EXTRACT[Extract key information]
        P2_EXTRACT --> P2_CROSS{Cross-references<br/>found?}
    end

    subgraph "PHASE 3: Backtracking"
        P2_CROSS -->|Yes| P3_CHECK{Referenced doc<br/>was SKIPPED?}
        P3_CHECK -->|Yes| P3_BACKTRACK[Go back and parse<br/>referenced document]
        P3_BACKTRACK --> P2_EXTRACT
        P3_CHECK -->|No| P3_CONTINUE[Continue analysis]
        P2_CROSS -->|No| P3_CONTINUE
    end

    subgraph "Final Answer"
        P3_CONTINUE --> ANSWER[Generate answer<br/>with citations]
        ANSWER --> SOURCES[List sources consulted]
        SOURCES --> END([Return to user])
    end

    style P1_START fill:#4caf50,color:#fff
    style END fill:#4caf50,color:#fff
    style REL fill:#4caf50,color:#fff
    style MAYBE fill:#ff9800,color:#000
    style SKIP fill:#9e9e9e,color:#fff
    style P3_BACKTRACK fill:#e91e63,color:#fff
```

### Cross-Reference Detection

```mermaid
flowchart LR
    subgraph "Document Content"
        DOC[Parsed Document]
    end

    subgraph "Pattern Matching"
        DOC --> P1["'See Exhibit A/B/C...'"]
        DOC --> P2["'As stated in [Document]...'"]
        DOC --> P3["'Refer to [filename]...'"]
        DOC --> P4["'per Document: [name]'"]
        DOC --> P5["'[Doc #XX]'"]
    end

    subgraph "Action"
        P1 --> FOUND[Cross-reference found]
        P2 --> FOUND
        P3 --> FOUND
        P4 --> FOUND
        P5 --> FOUND
        
        FOUND --> CHECK{Was referenced<br/>doc SKIPPED?}
        CHECK -->|Yes| BACKTRACK[Backtrack and parse]
        CHECK -->|No| CONTINUE[Continue]
    end

    style BACKTRACK fill:#e91e63,color:#fff
```

---

## Token Tracking & Cost Estimation

### TokenUsage Class

```mermaid
flowchart TB
    subgraph "Input Tracking"
        API[API Call] --> PROMPT[prompt_token_count]
        API --> COMPLETION[candidates_token_count]
        PROMPT --> ADD_API[add_api_call]
        COMPLETION --> ADD_API
    end

    subgraph "Tool Tracking"
        TOOL[Tool Execution] --> RESULT[result string]
        RESULT --> ADD_TOOL[add_tool_result]
        ADD_TOOL --> CHARS[tool_result_chars += len]
        ADD_TOOL --> PARSED{tool_name?}
        PARSED -->|parse_file| INC_PARSED[documents_parsed++]
        PARSED -->|preview_file| INC_PARSED
        PARSED -->|scan_folder| INC_SCANNED[documents_scanned += count]
    end

    subgraph "Cost Calculation"
        ADD_API --> TOTALS[Update totals]
        TOTALS --> CALC[_calculate_cost]
        CALC --> INPUT_COST["input_cost = prompt_tokens × $0.075/1M"]
        CALC --> OUTPUT_COST["output_cost = completion_tokens × $0.30/1M"]
        INPUT_COST --> TOTAL_COST[total_cost]
        OUTPUT_COST --> TOTAL_COST
    end

    subgraph "Summary Output"
        TOTAL_COST --> SUMMARY[summary]
        CHARS --> SUMMARY
        INC_PARSED --> SUMMARY
        INC_SCANNED --> SUMMARY
    end
```

### Cost Estimation Formula

```mermaid
graph LR
    subgraph "Gemini 2.0 Flash Pricing"
        INPUT["Input: $0.075 / 1M tokens"]
        OUTPUT["Output: $0.30 / 1M tokens"]
    end

    subgraph "Calculation"
        PROMPT[prompt_tokens] --> DIV1[÷ 1,000,000]
        DIV1 --> MULT1[× $0.075]
        MULT1 --> INPUT_COST[Input Cost]

        COMP[completion_tokens] --> DIV2[÷ 1,000,000]
        DIV2 --> MULT2[× $0.30]
        MULT2 --> OUTPUT_COST[Output Cost]

        INPUT_COST --> SUM[+]
        OUTPUT_COST --> SUM
        SUM --> TOTAL[Total Estimated Cost]
    end

    style TOTAL fill:#4caf50,color:#fff
```

---

## CLI Interface

### Output Formatting

```mermaid
flowchart TB
    subgraph "Event Handling"
        EVENT{Event Type}
        
        EVENT -->|ToolCallEvent| TOOL_PANEL[format_tool_panel]
        EVENT -->|GoDeeperEvent| NAV_PANEL[format_navigation_panel]
        EVENT -->|AskHumanEvent| HUMAN_PANEL[Human Input Panel]
        EVENT -->|ExplorationEndEvent| FINAL_PANEL[Final Answer Panel]
    end

    subgraph "Tool Panel Components"
        TOOL_PANEL --> ICON[Tool Icon 📂📖👁️🔍]
        TOOL_PANEL --> STEP[Step Number]
        TOOL_PANEL --> PHASE[Phase Label]
        TOOL_PANEL --> TARGET[Target File/Directory]
        TOOL_PANEL --> REASON[Agent's Reasoning]
    end

    subgraph "Final Panel Components"
        FINAL_PANEL --> ANSWER[Answer with Citations]
        FINAL_PANEL --> SOURCES[Sources Consulted]
    end

    subgraph "Summary Panel"
        SUMMARY[Workflow Summary]
        SUMMARY --> STEPS[Total Steps]
        SUMMARY --> CALLS[API Calls]
        SUMMARY --> DOCS[Documents Scanned/Parsed]
        SUMMARY --> TOKENS[Token Usage]
        SUMMARY --> COST[Estimated Cost]
    end

    FINAL_PANEL --> SUMMARY
```

### Visual Elements

```mermaid
graph TB
    subgraph "Panel Styles"
        TOOL["📂 Tool Call<br/>border: yellow"]
        NAV["📁 Navigation<br/>border: magenta"]
        HUMAN["❓ Human Input<br/>border: red"]
        FINAL["✅ Final Answer<br/>border: green"]
        SUMMARY["📊 Summary<br/>border: blue"]
    end

    subgraph "Tool Icons"
        I1["📂 scan_folder"]
        I2["👁️ preview_file"]
        I3["📖 parse_file"]
        I4["📄 read"]
        I5["🔍 grep"]
        I6["🔎 glob"]
    end

    subgraph "Phase Labels"
        PH1["Phase 1: Parallel Document Scan"]
        PH2["Phase 2: Deep Dive"]
        PH3["Phase 1/2: Quick Preview"]
    end

    style TOOL fill:#ffeb3b,color:#000
    style NAV fill:#e1bee7,color:#000
    style HUMAN fill:#ffcdd2,color:#000
    style FINAL fill:#c8e6c9,color:#000
    style SUMMARY fill:#bbdefb,color:#000
```

---

## Data Flow

### Complete Request Flow

```mermaid
sequenceDiagram
    participant User
    participant CLI as main.py
    participant WF as Workflow
    participant Agent as FsExplorerAgent
    participant LLM as Gemini API
    participant Tools as Tool Registry
    participant Docling
    participant Cache
    participant FS as Filesystem

    User->>CLI: uv run explore --task "..."
    CLI->>CLI: print_workflow_header()
    CLI->>WF: workflow.run(InputEvent)

    loop Until StopAction
        WF->>Agent: configure_task()
        Agent->>LLM: generate_content()
        LLM-->>Agent: Action JSON
        Agent->>Agent: Track tokens

        alt ToolCallAction
            Agent->>Tools: TOOLS[name](**args)
            
            alt Document Tool
                Tools->>Cache: Check cache
                alt Cache Hit
                    Cache-->>Tools: Cached content
                else Cache Miss
                    Cache->>Docling: Convert document
                    Docling->>FS: Read file
                    FS-->>Docling: Raw bytes
                    Docling-->>Cache: Markdown content
                    Cache-->>Tools: Content
                end
            else Filesystem Tool
                Tools->>FS: Execute operation
                FS-->>Tools: Result
            end
            
            Tools-->>Agent: Tool result
            Agent->>Agent: Track tool metrics
            WF-->>CLI: ToolCallEvent
            CLI->>CLI: format_tool_panel()
        else GoDeeperAction
            WF->>WF: Update directory state
            WF-->>CLI: GoDeeperEvent
            CLI->>CLI: format_navigation_panel()
        else AskHumanAction
            WF-->>CLI: AskHumanEvent
            CLI->>User: Display question
            User->>CLI: Enter response
            CLI->>WF: HumanAnswerEvent
        else StopAction
            WF-->>CLI: ExplorationEndEvent
        end
    end

    CLI->>CLI: Display final answer
    CLI->>CLI: print_workflow_summary()
    CLI-->>User: Complete output
```

---

## File Structure

```
fs-explorer/
├── src/
│   └── fs_explorer/
│       ├── __init__.py      # Public API exports
│       ├── main.py          # CLI entry point (typer)
│       ├── workflow.py      # Event-driven workflow orchestration
│       ├── agent.py         # AI agent + Gemini integration
│       ├── models.py        # Pydantic action schemas
│       └── fs.py            # Filesystem + Docling operations
├── tests/
│   ├── conftest.py          # Test fixtures and mocks
│   ├── test_agent.py        # Agent unit tests
│   ├── test_fs.py           # Filesystem function tests
│   ├── test_models.py       # Model tests
│   ├── test_e2e.py          # End-to-end integration tests
│   └── testfiles/           # Test data
├── data/
│   ├── large_acquisition/   # Sample PDF documents
│   └── test_acquisition/    # Test document set
├── scripts/
│   ├── generate_test_docs.py
│   └── generate_large_docs.py
├── pyproject.toml           # Project configuration
├── Makefile                 # Development commands
├── README.md                # User documentation
└── ARCHITECTURE.md          # This file
```

---

## Extension Points

### Adding New Tools

```mermaid
flowchart LR
    subgraph "Step 1: Define Function"
        FUNC[def new_tool(args) -> str]
    end

    subgraph "Step 2: Register Tool"
        TOOLS["TOOLS dict in agent.py"]
        FUNC --> TOOLS
    end

    subgraph "Step 3: Update Types"
        TYPES["Tools TypeAlias in models.py"]
        TOOLS --> TYPES
    end

    subgraph "Step 4: Update Prompt"
        PROMPT["SYSTEM_PROMPT in agent.py"]
        TYPES --> PROMPT
    end

    style FUNC fill:#e3f2fd
    style TOOLS fill:#f3e5f5
    style TYPES fill:#fff3e0
    style PROMPT fill:#e8f5e9
```

### Adding New Document Formats

```mermaid
flowchart LR
    subgraph "Docling Supported"
        PDF[PDF] --> DOCLING[Docling]
        DOCX[DOCX] --> DOCLING
        PPTX[PPTX] --> DOCLING
        XLSX[XLSX] --> DOCLING
        HTML[HTML] --> DOCLING
        MD[Markdown] --> DOCLING
    end

    subgraph "To Add New Format"
        NEW[New Format] --> CHECK{Docling<br/>supports?}
        CHECK -->|Yes| ADD["Add to SUPPORTED_EXTENSIONS"]
        CHECK -->|No| CUSTOM["Create custom handler<br/>in fs.py"]
    end

    DOCLING --> OUTPUT[Markdown]
    ADD --> OUTPUT
    CUSTOM --> OUTPUT
```

### Customizing the System Prompt

The system prompt in `agent.py` can be modified to:

1. **Add new exploration strategies**
2. **Change citation format**
3. **Adjust categorization criteria**
4. **Add domain-specific instructions**

```python
SYSTEM_PROMPT = """
# Customize this prompt to change agent behavior

## Your custom instructions here
...
"""
```

---

## Performance Characteristics

| Metric | Typical Value | Notes |
|--------|---------------|-------|
| Parallel scan threads | 4 | Configurable via `DEFAULT_MAX_WORKERS` |
| Preview size | 1500 chars | ~1 page of content |
| Full preview size | 3000 chars | ~2-3 pages |
| Document cache | In-memory | Keyed by path + mtime |
| Workflow timeout | 300 seconds | 5 minutes for complex queries |
| API model | gemini-2.0-flash | Fast, cost-effective |

---

## Security Considerations

1. **API Key**: Stored in environment variable `GOOGLE_API_KEY`
2. **Local Processing**: Documents parsed locally via Docling (no cloud upload)
3. **Filesystem Access**: Limited to current working directory
4. **No Persistent Storage**: Document cache is in-memory only

---

*Last updated: 2026-01-03*


================================================
FILE: CLAUDE.md
================================================
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

Agentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval.

**Tech Stack:** Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI.

## Common Commands

```bash
# Install dependencies
uv pip install .
uv pip install -e ".[dev]"  # with dev dependencies

# Run CLI (agentic exploration)
uv run explore --task "What is the purchase price?" --folder data/test_acquisition/

# Run CLI (indexed query - requires prior indexing)
uv run explore index data/test_acquisition/
uv run explore query --task "What is the purchase price?" --folder data/test_acquisition/

# Schema management
uv run explore schema discover data/test_acquisition/
uv run explore schema show data/test_acquisition/

# Run web UI
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000

# Run tests
uv run pytest                      # all tests
uv run pytest tests/test_fs.py     # single file
uv run pytest -k "test_name"       # single test

# Lint, format, typecheck (also available via Makefile)
uv run pre-commit run -a           # lint (or: make lint)
uv run ruff check .                # ruff only
uv run ruff format                 # format (or: make format)
uv run ty check src/fs_explorer/   # typecheck (or: make typecheck)
```

Entry points defined in `pyproject.toml`: `explore` → `fs_explorer.main:app`, `explore-ui` → `fs_explorer.server:run_server`.

## Architecture

### Core Flow (Agentic Mode)
```
User Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem
```

### Core Flow (Indexed Mode)
```
User Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results
```

### Key Modules (src/fs_explorer/)

- **workflow.py**: Event-driven orchestration using `llama-index-workflows`. Defines `FsExplorerWorkflow` with steps: `start_exploration`, `go_deeper_action`, `tool_call_action`, `receive_human_answer`. Uses singleton agent via `get_agent()`.

- **agent.py**: `FsExplorerAgent` manages Gemini API interaction. Chat history accumulates in `_chat_history`. `take_action()` sends history to LLM, receives structured JSON `Action`, auto-executes tool calls. `TokenUsage` tracks costs. Also contains the `TOOLS` registry (9 tools), `SYSTEM_PROMPT`, and indexed tool functions (`semantic_search`, `get_document`, `list_indexed_documents`). Index context is managed via module-level `set_index_context()`/`clear_index_context()`.

- **models.py**: Pydantic schemas for structured LLM output. `Action` contains one of: `ToolCallAction`, `GoDeeperAction`, `StopAction`, `AskHumanAction`. `Tools` TypeAlias defines all available tool names.

- **fs.py**: Filesystem operations. `scan_folder()` uses ThreadPoolExecutor for parallel document processing. `_DOCUMENT_CACHE` (dict) caches parsed documents keyed by `path:mtime`. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown.

- **main.py**: Typer CLI entry point with subcommands: default (agentic explore), `index`, `query`, `schema discover`, `schema show`.

- **server.py**: FastAPI server with WebSocket endpoint `/ws/explore` for real-time streaming.

- **exploration_trace.py**: Records tool call paths and extracts cited sources from final answers for the CLI summary.

### Indexing Subsystem (src/fs_explorer/indexing/)

- **pipeline.py**: `IndexingPipeline` orchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates to `SmartChunker` and `extract_metadata()`, handles schema resolution and deleted-file cleanup.

- **chunker.py**: `SmartChunker` splits parsed document text into overlapping chunks.

- **schema.py**: `SchemaDiscovery` auto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields like `mentions_currency`/`mentions_dates`). Optionally includes langextract fields.

- **metadata.py**: `extract_metadata()` produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls the `langextract` library for entity extraction (organizations, people, deal terms, etc.) via configurable profiles.

### Search Subsystem (src/fs_explorer/search/)

- **query.py**: `IndexedQueryEngine` runs parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks via `RankedDocument.combined_score`.

- **filters.py**: `parse_metadata_filters()` parses a human-readable filter DSL (`field=value`, `field>=num`, `field in (a, b)`, `field~substring`) into `MetadataFilter` objects. Validates against allowed schema fields.

- **ranker.py**: `RankedDocument` dataclass with `combined_score` (semantic * 100 + metadata * 10). `rank_documents()` sorts and limits.

### Storage Subsystem (src/fs_explorer/storage/)

- **duckdb.py**: `DuckDBStorage` manages four tables: `corpora`, `documents`, `chunks`, `schemas`. Key operations: `upsert_document`, `search_chunks` (keyword-based scoring), `search_documents_by_metadata` (JSON path filtering via `json_extract_string`), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes.

- **base.py**: `StorageBackend` protocol and shared dataclasses (`DocumentRecord`, `ChunkRecord`, `SchemaRecord`).

### Index Config

- **index_config.py**: `resolve_db_path()` resolves DuckDB path with precedence: CLI `--db-path` > `FS_EXPLORER_DB_PATH` env > `~/.fs_explorer/index.duckdb`.

### Workflow Event Types
- `InputEvent` → starts exploration
- `ToolCallEvent` → tool execution
- `GoDeeperEvent` → directory navigation
- `AskHumanEvent`/`HumanAnswerEvent` → human interaction
- `ExplorationEndEvent` → completion with `final_result` or `error`

### Adding New Tools
1. Implement function in `fs.py` (filesystem) or `agent.py` (indexed) returning `str`
2. Add to `TOOLS` dict in `agent.py`
3. Add to `Tools` TypeAlias in `models.py`
4. Update `SYSTEM_PROMPT` in `agent.py`
5. Update `TOOL_ICONS` and `PHASE_DESCRIPTIONS` in `main.py`

## Environment

- `GOOGLE_API_KEY` (required) — in `.env` file or environment variable
- `FS_EXPLORER_DB_PATH` (optional) — override default DuckDB location
- `FS_EXPLORER_LANGEXTRACT_MAX_CHARS` (optional) — max chars sent to langextract (default 6000)
- `FS_EXPLORER_LANGEXTRACT_MODEL` (optional) — model for langextract (default `gemini-3-flash-preview`)

## Testing

Tests mock the Gemini client via `MockGenAIClient` in `conftest.py`. Use `reset_agent()` to clear singleton state between tests. The mock always returns a `StopAction` response.

Key test files:
- `test_agent.py` / `test_e2e.py` — agent and workflow integration
- `test_fs.py` — filesystem tools
- `test_indexing.py` / `test_cli_indexing.py` — indexing pipeline and CLI
- `test_search.py` — search/filter/ranking
- `test_exploration_trace.py` — trace and citation extraction

Test documents live in `data/test_acquisition/` and `data/large_acquisition/`. Test fixtures for unit tests are in `tests/testfiles/`.


================================================
FILE: IMPLEMENTATION_PLAN.md
================================================
# Implementation Plan: Hybrid Semantic + Agentic Search (Revised)

## Overview

Add semantic search with optional metadata filtering to `agentic-file-search` without regressing the current agentic workflow.

The revised approach keeps the current CLI and behavior stable first, introduces indexing as opt-in, and only enables auto-detection after compatibility and quality checks pass.

- Storage: DuckDB + `vss` (embedded, local file)
- Embeddings: Gemini embeddings (API-backed)
- Metadata extraction: `langextract` (optional)
- Infrastructure model: no external database service (no Docker/Postgres required)

---

## Goals

1. Preserve existing `explore --task` behavior and UX by default.
2. Add a fast indexed path for large corpora.
3. Support metadata-aware filtering when metadata is available.
4. Keep agentic deep-read and cross-reference behavior available.

## Non-Goals (Initial Release)

1. Replacing the existing agentic strategy entirely.
2. Forcing index usage for all queries.
3. Heuristic/NLP folder extraction from free-form task text.

---

## Current Codebase Constraints to Respect

1. CLI currently has one root command (`explore --task`) and no subcommands.
2. Workflow and server currently use shared/global process state (`os.chdir`, singleton agent).
3. Existing tests assert the current 6-tool model and prompt behavior.

These constraints require a staged rollout to avoid breaking current users.

---

## High-Level Architecture

```text
INDEX TIME
├── Parse documents (Docling)
├── Chunk content (paragraph/sentence-aware)
├── Generate embeddings (provider-configured dimension)
├── [optional] Extract metadata (langextract)
└── Persist in DuckDB (corpus-scoped)

QUERY TIME
├── Retrieve by semantic search
├── [optional] Retrieve by metadata filter
├── Union + rank results
├── Expand via cross-references where needed
└── Agent continues deep exploration using existing tools
```

---

## Data Model (DuckDB)

Use corpus-scoped tables and file freshness fields to prevent collisions and stale indexes.

```sql
-- Install and load extension programmatically
-- INSTALL vss; LOAD vss;

CREATE TABLE IF NOT EXISTS corpora (
    id VARCHAR PRIMARY KEY,
    root_path VARCHAR NOT NULL UNIQUE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE IF NOT EXISTS documents (
    id VARCHAR PRIMARY KEY,
    corpus_id VARCHAR NOT NULL REFERENCES corpora(id),
    relative_path VARCHAR NOT NULL,
    absolute_path VARCHAR NOT NULL,
    content VARCHAR NOT NULL,
    metadata JSON NOT NULL DEFAULT '{}',
    file_mtime DOUBLE NOT NULL,
    file_size BIGINT NOT NULL,
    content_sha256 VARCHAR NOT NULL,
    last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    is_deleted BOOLEAN DEFAULT FALSE,
    UNIQUE(corpus_id, relative_path)
);

-- EMBEDDING_DIM is configured in code at index creation time.
CREATE TABLE IF NOT EXISTS chunks (
    id VARCHAR PRIMARY KEY,
    doc_id VARCHAR NOT NULL REFERENCES documents(id),
    text VARCHAR NOT NULL,
    embedding FLOAT[${EMBEDDING_DIM}] NOT NULL,
    embedding_dim INTEGER NOT NULL,
    position INTEGER NOT NULL,
    start_char INTEGER NOT NULL,
    end_char INTEGER NOT NULL
);

CREATE TABLE IF NOT EXISTS schemas (
    id INTEGER PRIMARY KEY,
    corpus_id VARCHAR REFERENCES corpora(id),
    name VARCHAR,
    schema_def JSON NOT NULL,
    is_active BOOLEAN DEFAULT FALSE,
    UNIQUE(corpus_id, name)
);

CREATE INDEX IF NOT EXISTS idx_chunks_embedding
ON chunks USING HNSW (embedding) WITH (metric = 'cosine');
```

### Embedding Dimension Rule

`EMBEDDING_DIM` must be a runtime config constant validated at startup. Do not hardcode `1536` across modules.

### DB Location

Default: `~/.fs_explorer/index.duckdb`
Override via:
- `FS_EXPLORER_DB_PATH`
- CLI: `--db-path`

---

## CLI Contract and Rollout

### Compatibility Rules (Required)

1. `uv run explore --task "..."` must keep working as-is.
2. Existing non-indexed behavior remains default in initial rollout.
3. New indexed behavior is opt-in first.

### New Commands

```bash
# Index management
uv run explore index <folder>
uv run explore index <folder> --with-metadata
uv run explore index <folder> --schema schema.json

# Indexed query path
uv run explore query --task "..." --folder <folder> [--filter "..."]

# Schema inspection
uv run explore schema --discover <folder>
uv run explore schema --show --folder <folder>

# Existing command (backward-compatible)
uv run explore --task "..." [--folder <folder>] [--use-index]
```

### Folder Resolution (Deterministic)

For commands that need corpus selection:
1. If `--folder` is provided, use it.
2. Else use current working directory (`.`).
3. Do not parse folder intent from natural language task text in v1.

### Auto-Detection Strategy

- v1: explicit `--use-index` only.
- v2: optional auto-detect behind feature flag `FS_EXPLORER_AUTO_INDEX=1`.
- v3: default auto-detect only after parity tests and quality benchmarks pass.

---

## Server and Concurrency Requirements

Before adding indexing/search endpoints:

1. Remove request-level `os.chdir` usage; pass absolute target folder through workflow state.
2. Avoid global singleton agent across concurrent requests; instantiate per workflow run/session.
3. Add per-corpus index lock to avoid concurrent write corruption.
4. Keep read queries concurrent-safe.

---

## Module Structure

```text
src/fs_explorer/
├── storage/
│   ├── __init__.py
│   ├── base.py
│   └── duckdb.py
├── indexing/
│   ├── __init__.py
│   ├── pipeline.py
│   ├── chunker.py
│   ├── metadata.py
│   └── schema.py
├── search/
│   ├── __init__.py
│   ├── query.py
│   ├── semantic.py
│   ├── filters.py
│   └── ranker.py
├── embeddings.py
└── index_config.py
```

---

## Files to Modify

| File | Changes |
|------|---------|
| `src/fs_explorer/agent.py` | Add indexed tools and prompt guidance while keeping existing tools |
| `src/fs_explorer/models.py` | Extend `Tools` type alias |
| `src/fs_explorer/main.py` | Add subcommands + `--folder` + `--use-index` while preserving root command |
| `src/fs_explorer/workflow.py` | Remove global/shared run-state assumptions |
| `src/fs_explorer/fs.py` | Support safe path resolution without cwd mutation |
| `src/fs_explorer/server.py` | Add index/search endpoints and remove `os.chdir` coupling |
| `pyproject.toml` | Add `duckdb`, `langextract` |

---

## Implementation Phases

### Phase 0: Contracts and Safety (New)

1. Freeze CLI compatibility requirements (`explore --task` must remain stable).
2. Define deterministic folder resolution contract.
3. Define per-request state model for workflow/server.
4. Add failing tests for compatibility and concurrency assumptions.

### Phase 1: Storage + Embeddings

5. Implement `storage/base.py` (backend interface).
6. Implement `storage/duckdb.py` with corpus-scoped schema.
7. Implement `embeddings.py` with configurable embedding dimension.
8. Add storage/embedding tests (including dimension validation).

### Phase 2: Indexing Pipeline

9. Implement `indexing/chunker.py`.
10. Implement optional `indexing/metadata.py`.
11. Implement `indexing/schema.py`.
12. Implement `indexing/pipeline.py` with freshness checks (`mtime`, hash, deleted files).
13. Add indexing tests.

### Phase 3: Search Pipeline

14. Implement `search/filters.py`.
15. Implement `search/ranker.py`.
16. Implement `search/query.py` (parallel retrieval + union).
17. Implement cross-reference expansion hooks.
18. Add search tests.

### Phase 4: Agent Integration (Opt-in)

19. Add tools: `semantic_search`, `get_document`, `list_indexed_documents`.
20. Keep existing 6 filesystem tools available.
21. Add indexed prompt guidance without removing current strategy.
22. Add tool-selection tests for indexed and non-indexed paths.

### Phase 5: CLI + Server Integration

23. Add `explore index/query/schema` commands.
24. Add `--folder` and `--use-index` to root command.
25. Integrate indexed path into workflow when explicitly requested.
26. Add `/api/index` and `/api/search` endpoints.
27. Remove `os.chdir` in server workflow path.

### Phase 6: Auto-Detect Rollout (Guarded)

28. Add feature-flagged auto-detect (`FS_EXPLORER_AUTO_INDEX`).
29. Add parity checks between indexed and baseline runs on test corpora.
30. Keep fallback to legacy behavior on index errors.

### Phase 7: Testing and Docs

31. Full integration tests.
32. Backward compatibility tests.
33. Concurrency tests for WebSocket/API usage.
34. Performance benchmarks and docs updates.

---

## Revised Design Decisions

1. **Opt-in First**: indexed retrieval starts behind `--use-index` to avoid regressions.
2. **Deterministic Corpus Selection**: explicit `--folder` or `.` fallback only.
3. **Corpus-Scoped Storage**: avoid global path collisions by namespacing.
4. **Freshness Tracking**: incremental reindex using mtime/hash/deletion markers.
5. **No Global Request State**: remove `os.chdir` and shared singleton pitfalls in server flows.
6. **Configurable Embedding Dimension**: validated at runtime; not hardcoded everywhere.
7. **No External DB Service**: embedded local DB only; APIs are still external dependencies.

---

## Verification Steps

```bash
# Baseline safety (must stay green)
uv run pytest tests/test_models.py tests/test_fs.py tests/test_agent.py -v

# Phase 1-3
uv run pytest tests/test_storage.py tests/test_embeddings.py tests/test_search.py -v

# Index build + inspect
uv run explore index data/test_acquisition/
uv run python -c "import duckdb; db=duckdb.connect('~/.fs_explorer/index.duckdb'); print(db.execute('SELECT COUNT(*) FROM documents').fetchone())"

# Opt-in indexed execution
uv run explore --task "Search for acquisition terms" --folder data/test_acquisition --use-index

# Compatibility execution (legacy path)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"

# CLI checks
uv run explore --help
uv run explore index --help
uv run explore query --help
uv run explore schema --help

# Full suite
uv run pytest tests/ -v
```

---

## Dependencies to Add

```toml
# pyproject.toml
dependencies = [
    # ... existing ...
    "duckdb>=1.0.0",
    "langextract>=1.0.0",
]
```

---

## Critical Files Summary

| Purpose | Path |
|---------|------|
| Storage interface | `src/fs_explorer/storage/base.py` |
| DuckDB backend | `src/fs_explorer/storage/duckdb.py` |
| Embeddings | `src/fs_explorer/embeddings.py` |
| Chunking | `src/fs_explorer/indexing/chunker.py` |
| Metadata extraction | `src/fs_explorer/indexing/metadata.py` |
| Schema discovery | `src/fs_explorer/indexing/schema.py` |
| Indexing pipeline | `src/fs_explorer/indexing/pipeline.py` |
| Query pipeline | `src/fs_explorer/search/query.py` |
| Filter parsing | `src/fs_explorer/search/filters.py` |
| Result ranking | `src/fs_explorer/search/ranker.py` |
| Agent tools/prompt | `src/fs_explorer/agent.py` |
| Tool types | `src/fs_explorer/models.py` |
| CLI commands | `src/fs_explorer/main.py` |
| Workflow safety | `src/fs_explorer/workflow.py` |
| Server safety/endpoints | `src/fs_explorer/server.py` |


================================================
FILE: Makefile
================================================
.PHONY: test lint format format-check typecheck build

all: test lint format typecheck

test:
	$(info ****************** running tests ******************)
	uv run pytest tests

lint:
	$(info ****************** linting ******************)
	uv run pre-commit run -a

format:
	$(info ****************** formatting ******************)
	uv run ruff format

format-check:
	$(info ****************** checking formatting ******************)
	uv run ruff format --check

typecheck:
	$(info ****************** type checking ******************)
	uv run ty check src/fs_explorer/

build:
	$(info ****************** building ******************)
	uv build

================================================
FILE: README.md
================================================
# Agentic File Search

> **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original CLI agent for filesystem exploration.

An AI-powered document search agent that explores files like a human would — scanning, reasoning, and following cross-references. Unlike traditional RAG systems that rely on pre-computed embeddings, this agent dynamically navigates documents to find answers.

## Why Agentic Search?

Traditional RAG (Retrieval-Augmented Generation) has limitations:
- **Chunks lose context** — Splitting documents destroys relationships between sections
- **Cross-references are invisible** — "See Exhibit B" means nothing to embeddings
- **Similarity ≠ Relevance** — Semantic matching misses logical connections

This system uses a **three-phase strategy**:
1. **Parallel Scan** — Preview all documents in a folder at once
2. **Deep Dive** — Full extraction on relevant documents only
3. **Backtrack** — Follow cross-references to previously skipped documents

## Watch the video
This video explains the architecture of the project and how to run it. 
[![Watch the demo on YouTube](https://img.youtube.com/vi/rMADSuus6jg/maxresdefault.jpg)](https://www.youtube.com/watch?v=rMADSuus6jg)

## Features

- 🔍 **6 Tools**: `scan_folder`, `preview_file`, `parse_file`, `read`, `grep`, `glob`
- 📄 **Document Support**: PDF, DOCX, PPTX, XLSX, HTML, Markdown (via Docling)
- 🤖 **Powered by**: Google Gemini 3 Flash with structured JSON output
- 💰 **Cost Efficient**: ~$0.001 per query with token tracking
- 🌐 **Web UI**: Real-time WebSocket streaming interface
- 📊 **Citations**: Answers include source references

## Installation

```bash
# Clone the repository
git clone https://github.com/PromtEngineer/agentic-file-search.git
cd agentic-file-search

# Install with uv (recommended)
uv pip install .

# Or with pip
pip install .
```

## Configuration

Create a `.env` file in the project root:

```bash
GOOGLE_API_KEY=your_api_key_here
```

Get your API key from [Google AI Studio](https://aistudio.google.com/apikey).

## Usage

### CLI

```bash
# Basic query
uv run explore --task "What is the purchase price in data/test_acquisition/?"

# Multi-document query
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms including adjustments and escrow?"
```

### Web UI

```bash
# Start the server
uv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000

# Open http://127.0.0.1:8000 in your browser
```

The web UI provides:
- Folder browser to select target directory
- Real-time step-by-step execution log
- Final answer with citations
- Token usage and cost statistics

## Architecture

```
User Query
    ↓
┌─────────────────┐
│ Workflow Engine │ ←→ LlamaIndex Workflows (event-driven)
└────────┬────────┘
         ↓
┌─────────────────┐
│     Agent       │ ←→ Gemini 3 Flash (structured JSON)
└────────┬────────┘
         ↓
┌─────────────────────────────────────────┐
│ scan_folder │ preview │ parse │ read │ grep │ glob │
└─────────────────────────────────────────┘
                    ↓
              Document Parser (Docling - local)
```

See [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams.

## Test Documents

The repo includes test document sets for evaluation:

- `data/test_acquisition/` — 10 interconnected legal documents
- `data/large_acquisition/` — 25 documents with extensive cross-references

Example queries:
```bash
# Simple (single doc)
uv run explore --task "Look in data/test_acquisition/. Who is the CTO?"

# Cross-reference required
uv run explore --task "Look in data/test_acquisition/. What is the adjusted purchase price?"

# Multi-document synthesis
uv run explore --task "Look in data/large_acquisition/. What happens to employees after the acquisition?"
```

## Tech Stack

| Component | Technology |
|-----------|------------|
| LLM | Google Gemini 3 Flash |
| Document Parsing | Docling (local, open-source) |
| Orchestration | LlamaIndex Workflows |
| CLI | Typer + Rich |
| Web Server | FastAPI + WebSocket |
| Package Manager | uv |

## Project Structure

```
src/fs_explorer/
├── agent.py      # Gemini client, token tracking
├── workflow.py   # LlamaIndex workflow engine
├── fs.py         # File tools: scan, parse, grep
├── models.py     # Pydantic models for actions
├── main.py       # CLI entry point
├── server.py     # FastAPI + WebSocket server
└── ui.html       # Single-file web interface
```

## Development

```bash
# Install dev dependencies
uv pip install -e ".[dev]"

# Run tests
uv run pytest

# Lint
uv run ruff check .
```

## License

MIT

## Acknowledgments

- Original concept from [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer)
- Document parsing by [Docling](https://github.com/DS4SD/docling)
- Powered by [Google Gemini](https://deepmind.google/technologies/gemini/)

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/agentic-file-search&type=Date)](https://star-history.com/#PromtEngineer/agentic-file-search&Date)


================================================
FILE: YOUTUBE_DEMO_TESTS.md
================================================
# YouTube Demo: FS-Explorer Test Results

## System Overview

- **25 PDF documents** (~93 pages total)
- **63 cross-references** between documents
- **Parallel document scanning** using ThreadPoolExecutor
- **Three-phase exploration**: Scan → Filter → Deep Dive + Backtracking

---

## Test Results Summary

| Test | Question | Docs Parsed | Tokens Used | Cost | Time |
|------|----------|-------------|-------------|------|------|
| 1 | CTO & retention bonus | 1 | 27,376 | $0.0021 | ~30s |
| 2 | Litigation & insurance | 2 | 41,431 | $0.0033 | ~30s |
| 3 | Patents owned | 1 | 27,416 | $0.0022 | ~27s |

---

## TEST 1: Simple Fact Lookup ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
```

**Agent Answer:**
> Dr. Sarah Chen is the CTO and their retention bonus is $1,200,000.

**Verification (from 07_employee_schedule.pdf):**
```
1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000
```

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,376 |
| Cost | $0.0021 |
| ✅ Accuracy | **100% Correct** |

---

## TEST 2: Cross-Reference Following ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
```

**Agent Answer:**
> The total litigation exposure is between $650,000 and $2,350,000. The Smith v. InnovateTech Solutions, Inc. case, with an exposure of $150,000 - $350,000, is covered by insurance. The DataTech LLC v. InnovateTech Solutions, Inc. case, with an exposure of $500,000 - $2,000,000, does not specify if it is covered by insurance.

**Verification (from 12_litigation_schedule.pdf):**
```
1. Smith v. InnovateTech Solutions, Inc.
   Exposure: $150,000 - $350,000 (covered by insurance)

2. DataTech LLC v. InnovateTech Solutions, Inc.
   Exposure: Preliminary assessment $500,000 - $2,000,000
```

**Cross-Reference Followed:**
- Agent noticed reference to "Legal Opinion Letter" 
- Parsed `24_legal_opinion.pdf` to get complete picture

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 2 (litigation + legal opinion) |
| Tokens Used | 41,431 |
| Cost | $0.0033 |
| ✅ Accuracy | **100% Correct** |

---

## TEST 3: Multi-Document IP Analysis ✅

**Question:**
```bash
uv run explore --task "Look in data/large_acquisition/. What patents does the company own and are there any pending applications?"
```

**Agent Answer:**
> The company owns the following patents: 
> - US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
> - US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
> - US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
> 
> There is one pending application: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024.

**Verification (from 06_ip_schedule.pdf):**
```
US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021
US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022
US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023
Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024
```

| Metric | Value |
|--------|-------|
| Documents Scanned | 26 (parallel) |
| Documents Parsed | 1 |
| Tokens Used | 27,416 |
| Cost | $0.0022 |
| ✅ Accuracy | **100% Correct** |

---

## Additional Demo Tests

### Purchase Price & Payment Structure
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price and how is it being paid?"
```
**Expected:** $125M total ($80M cash + $30M stock + $15M escrow)

### Closing Conditions Status
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
```
**Expected:** HSR ✅, State filings ✅, MegaCorp consent ✅, GlobalBank pending, Employee retention ✅, Legal opinion ✅, Good standing ordered

### Key Employee Compensation
```bash
uv run explore --task "Look in data/large_acquisition/. List all the key employees and their retention bonuses"
```
**Expected:** 5 employees totaling $3.5M in retention bonuses

---

## Key Architecture Points to Highlight

### 1. Parallel Scanning (scan_folder)
- Scans ALL 26 documents simultaneously using ThreadPoolExecutor
- Takes ~25 seconds for entire folder
- Returns quick preview of each document

### 2. Smart Filtering
- LLM reviews all previews at once
- Identifies which documents are relevant
- Avoids parsing irrelevant documents

### 3. Cross-Reference Discovery
- Agent watches for document references like:
  - "See Document: Legal Opinion Letter"
  - "Per Document: Risk Assessment Memo"
- Automatically follows references (backtracking)

### 4. Document Caching
- Documents cached after first parse
- Backtracking is free (no re-parsing)

---

## Cost Analysis

| Scenario | Tokens | Est. Cost |
|----------|--------|-----------|
| Simple query (1 doc) | ~27K | $0.002 |
| Cross-ref query (2-3 docs) | ~40K | $0.003 |
| Complex synthesis (5+ docs) | ~60K | $0.005 |
| All 25 documents parsed | ~150K | $0.012 |

**Key Insight:** Even with 25 documents, costs are minimal because the system only parses what's needed!

---

## Commands to Run Demo

```bash
# Setup
cd /path/to/fs-explorer
export GOOGLE_API_KEY="your-key"

# Run any test
uv run explore --task "Look in data/large_acquisition/. [YOUR QUESTION]"
```

---

## What to Show in Video

1. **The folder scan** - Watch as 26 documents are scanned in parallel
2. **Smart filtering** - Note which documents the agent CHOOSES to parse
3. **Cross-reference following** - Show agent backtracking to referenced docs
4. **Token usage summary** - Highlight the efficiency stats at the end
5. **Verification** - Show the actual PDF content matches the answer


================================================
FILE: data/large_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Large Document Set

## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages

## Test Questions

### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```

### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```

### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```

### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```


================================================
FILE: data/test_acquisition/TEST_QUESTIONS.md
================================================
# Test Questions for Document Exploration

These questions are designed to test the two-stage document exploration approach with cross-reference discovery.

## Test Scenario

**Context:** TechCorp Industries is acquiring StartupXYZ LLC. There are 10 documents in this folder related to the acquisition.

---

## Question Set 1: Simple (Single Document)

These questions can be answered from a single document:

```bash
# Q1: What is the purchase price?
explore --task "What is the total purchase price for the StartupXYZ acquisition?"

# Q2: When did the NDA get signed?
explore --task "When was the Non-Disclosure Agreement between TechCorp and StartupXYZ signed?"

# Q3: How many patents does StartupXYZ have?
explore --task "How many patents does StartupXYZ own?"
```

**Expected Behavior:**
- Agent should preview documents
- Identify the relevant document quickly
- Parse only that document for the answer

---

## Question Set 2: Medium (2-3 Documents with Cross-References)

These questions require following cross-references:

```bash
# Q4: What risks were identified and how were they addressed?
explore --task "What are the key risks identified in this acquisition and what mitigation measures were put in place?"

# Q5: What's the adjusted purchase price?
explore --task "The original purchase price was $45M. Were there any adjustments? What is the final amount?"

# Q6: What happened with customer consents?
explore --task "Which customers required consent for the acquisition, and was consent obtained from all of them?"
```

**Expected Behavior:**
- Agent previews documents
- Reads Risk Assessment Memo
- Notices references to Financial Adjustments, Customer Consents
- Follows cross-references to get complete picture

---

## Question Set 3: Complex (Multiple Documents, Deep Cross-References)

These questions require synthesizing information from many documents:

```bash
# Q7: Complete IP status
explore --task "Give me a complete picture of StartupXYZ's intellectual property - what do they own, is it properly certified, and are there any pending matters or risks?"

# Q8: Due diligence findings and resolution
explore --task "What did the due diligence process uncover, and how were any issues resolved before closing?"

# Q9: Full timeline and status
explore --task "Create a timeline of this acquisition from NDA signing to closing. What are the key milestones and their status?"

# Q10: Closing readiness
explore --task "Is this acquisition ready to close? What items are complete and what's still pending?"
```

**Expected Behavior:**
- Agent should preview all documents first
- Read the most relevant documents (e.g., Closing Checklist references everything)
- Follow cross-references to IP Certification, Due Diligence, Risk Assessment, etc.
- Synthesize information from 5+ documents

---

## Question Set 4: Adversarial (Tests Cross-Reference Discovery)

These questions specifically test if the agent goes back to previously-skipped documents:

```bash
# Q11: Following exhibit references
explore --task "The Acquisition Agreement mentions 'Exhibit A - Financial Terms'. What are the detailed financial terms?"

# Q12: Understanding document relationships  
explore --task "How does the Legal Opinion Letter relate to other documents in this acquisition?"

# Q13: Hidden connection
explore --task "Is there anything about MegaCorp in these documents? Why are they important to this deal?"
```

**Expected Behavior:**
- Q11: Agent might initially skip Financial Adjustments, but should go back when Acquisition Agreement references Exhibit A
- Q12: Agent should trace all documents referenced BY and FROM the Legal Opinion
- Q13: MegaCorp is mentioned in Due Diligence, Risk Assessment, and Customer Consents - agent should connect the dots

---

## Scoring Rubric

| Metric | Description |
|--------|-------------|
| **Preview Usage** | Did the agent use `preview_file` before `parse_file`? |
| **Selective Parsing** | Did the agent avoid parsing irrelevant documents? |
| **Cross-Reference Discovery** | Did the agent follow document references? |
| **Backtracking** | Did the agent return to previously-skipped documents when needed? |
| **Answer Completeness** | Was the final answer comprehensive and accurate? |

---

## Running a Test

```bash
export GOOGLE_API_KEY="your-key"
cd /path/to/fs-explorer
uv run explore --task "YOUR QUESTION HERE"
```

Watch for:
1. Which documents get previewed
2. Which documents get fully parsed
3. Whether the agent mentions cross-references
4. Whether the agent goes back to read referenced documents


================================================
FILE: data/testfile.txt
================================================
This is a test.

================================================
FILE: docker/docker-compose.yml
================================================
version: '3.8'

services:
  postgres:
    image: pgvector/pgvector:pg17
    container_name: fs-explorer-db
    environment:
      POSTGRES_USER: ${POSTGRES_USER:-fs_explorer}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-devpassword}
      POSTGRES_DB: ${POSTGRES_DB:-fs_explorer}
    ports:
      - "${POSTGRES_PORT:-5432}:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U fs_explorer -d fs_explorer"]
      interval: 5s
      timeout: 5s
      retries: 5
    restart: unless-stopped

volumes:
  postgres_data:


================================================
FILE: pyproject.toml
================================================
[build-system]
requires = ["uv_build>=0.9.10,<0.10.0"]
build-backend = "uv_build"

[project]
name = "fs-explorer"
version = "0.1.0"
description = "Explore and understand your filesystem better with AI."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "docling>=2.55.0",
    "duckdb>=1.0.0",
    "fastapi>=0.115.0",
    "google-genai>=1.55.0",
    "langextract>=1.0.0",
    "llama-index-workflows>=2.11.5",
    "python-dotenv>=1.0.0",
    "reportlab>=4.4.7",
    "rich>=13.0.0",
    "typer>=0.12.5,<0.20.0",
    "uvicorn>=0.34.0",
    "websockets>=14.0",
]

[dependency-groups]
dev = [
    "pre-commit>=4.5.0",
    "pytest>=9.0.2",
    "pytest-asyncio>=1.3.0",
    "ruff>=0.14.9",
    "ty>=0.0.1a33",
]

[project.scripts]
explore = "fs_explorer.main:app"
explore-ui = "fs_explorer.server:run_server"


================================================
FILE: scripts/generate_large_docs.py
================================================
#!/usr/bin/env python3
"""
Generate a large set of interconnected legal documents for testing.
Creates 25 documents, each 3-5 pages, with extensive cross-references.
"""

import os
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

OUTPUT_DIR = "data/large_acquisition"

# Document metadata with cross-references
DOCUMENTS = {
    "01_master_agreement": {
        "title": "MASTER ACQUISITION AGREEMENT",
        "refs": ["02_schedules", "03_exhibits", "04_disclosure_schedules", "05_ancillary_agreements"],
        "pages": 5
    },
    "02_schedules": {
        "title": "SCHEDULES TO ACQUISITION AGREEMENT", 
        "refs": ["01_master_agreement", "06_ip_schedule", "07_employee_schedule", "08_contract_schedule"],
        "pages": 4
    },
    "03_exhibits": {
        "title": "EXHIBITS TO ACQUISITION AGREEMENT",
        "refs": ["01_master_agreement", "09_escrow_agreement", "10_stock_purchase"],
        "pages": 3
    },
    "04_disclosure_schedules": {
        "title": "SELLER DISCLOSURE SCHEDULES",
        "refs": ["01_master_agreement", "11_financial_statements", "12_litigation_schedule"],
        "pages": 5
    },
    "05_ancillary_agreements": {
        "title": "ANCILLARY AGREEMENTS INDEX",
        "refs": ["13_nda", "14_non_compete", "15_consulting_agreement", "16_transition_services"],
        "pages": 2
    },
    "06_ip_schedule": {
        "title": "SCHEDULE 3.12 - INTELLECTUAL PROPERTY",
        "refs": ["01_master_agreement", "17_patent_assignments", "18_trademark_registrations"],
        "pages": 4
    },
    "07_employee_schedule": {
        "title": "SCHEDULE 3.15 - EMPLOYEE MATTERS",
        "refs": ["01_master_agreement", "19_retention_agreements", "20_benefit_plans"],
        "pages": 4
    },
    "08_contract_schedule": {
        "title": "SCHEDULE 3.13 - MATERIAL CONTRACTS",
        "refs": ["01_master_agreement", "21_customer_contracts", "22_vendor_contracts"],
        "pages": 5
    },
    "09_escrow_agreement": {
        "title": "ESCROW AGREEMENT",
        "refs": ["01_master_agreement", "03_exhibits", "11_financial_statements"],
        "pages": 4
    },
    "10_stock_purchase": {
        "title": "STOCK PURCHASE DETAILS - EXHIBIT B",
        "refs": ["01_master_agreement", "11_financial_statements"],
        "pages": 3
    },
    "11_financial_statements": {
        "title": "AUDITED FINANCIAL STATEMENTS",
        "refs": ["04_disclosure_schedules", "23_audit_report"],
        "pages": 6
    },
    "12_litigation_schedule": {
        "title": "SCHEDULE 3.9 - LITIGATION AND CLAIMS",
        "refs": ["04_disclosure_schedules", "24_legal_opinion"],
        "pages": 3
    },
    "13_nda": {
        "title": "NON-DISCLOSURE AGREEMENT",
        "refs": ["01_master_agreement"],
        "pages": 3
    },
    "14_non_compete": {
        "title": "NON-COMPETITION AGREEMENT",
        "refs": ["01_master_agreement", "07_employee_schedule"],
        "pages": 3
    },
    "15_consulting_agreement": {
        "title": "CONSULTING AGREEMENT - FOUNDER",
        "refs": ["01_master_agreement", "07_employee_schedule", "19_retention_agreements"],
        "pages": 4
    },
    "16_transition_services": {
        "title": "TRANSITION SERVICES AGREEMENT",
        "refs": ["01_master_agreement", "25_closing_checklist"],
        "pages": 4
    },
    "17_patent_assignments": {
        "title": "PATENT ASSIGNMENT AGREEMENTS",
        "refs": ["06_ip_schedule", "01_master_agreement"],
        "pages": 3
    },
    "18_trademark_registrations": {
        "title": "TRADEMARK REGISTRATION SCHEDULE",
        "refs": ["06_ip_schedule"],
        "pages": 2
    },
    "19_retention_agreements": {
        "title": "KEY EMPLOYEE RETENTION AGREEMENTS",
        "refs": ["07_employee_schedule", "15_consulting_agreement"],
        "pages": 4
    },
    "20_benefit_plans": {
        "title": "EMPLOYEE BENEFIT PLAN SCHEDULE",
        "refs": ["07_employee_schedule"],
        "pages": 3
    },
    "21_customer_contracts": {
        "title": "MAJOR CUSTOMER CONTRACT SUMMARIES",
        "refs": ["08_contract_schedule", "01_master_agreement"],
        "pages": 5
    },
    "22_vendor_contracts": {
        "title": "MAJOR VENDOR CONTRACT SUMMARIES",
        "refs": ["08_contract_schedule"],
        "pages": 3
    },
    "23_audit_report": {
        "title": "INDEPENDENT AUDITOR'S REPORT",
        "refs": ["11_financial_statements", "04_disclosure_schedules"],
        "pages": 4
    },
    "24_legal_opinion": {
        "title": "LEGAL OPINION LETTER",
        "refs": ["01_master_agreement", "12_litigation_schedule", "06_ip_schedule"],
        "pages": 3
    },
    "25_closing_checklist": {
        "title": "CLOSING CHECKLIST AND CONDITIONS",
        "refs": ["01_master_agreement", "09_escrow_agreement", "16_transition_services", 
                 "17_patent_assignments", "21_customer_contracts"],
        "pages": 4
    }
}

def generate_content(doc_id: str, meta: dict) -> list:
    """Generate realistic legal document content."""
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=16, spaceAfter=20)
    heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=12, spaceAfter=10)
    body_style = ParagraphStyle('Body', parent=styles['Normal'], fontSize=10, spaceAfter=8, leading=14)
    
    content = []
    
    # Title
    content.append(Paragraph(meta["title"], title_style))
    content.append(Spacer(1, 0.3*inch))
    
    # Document intro with cross-references
    refs_text = ", ".join([f"Document: {DOCUMENTS[r]['title']}" for r in meta["refs"][:3]])
    intro = f"""
    This document is part of the acquisition transaction between GlobalTech Corporation ("Buyer") 
    and InnovateTech Solutions, Inc. ("Seller") dated as of February 15, 2025. This document should 
    be read in conjunction with {refs_text}, and all other transaction documents.
    """
    content.append(Paragraph(intro.strip(), body_style))
    content.append(Spacer(1, 0.2*inch))
    
    # Generate sections based on document type
    sections = generate_sections(doc_id, meta)
    for section_title, section_content in sections:
        content.append(Paragraph(section_title, heading_style))
        for para in section_content:
            content.append(Paragraph(para, body_style))
        content.append(Spacer(1, 0.15*inch))
    
    return content

def generate_sections(doc_id: str, meta: dict) -> list:
    """Generate document-specific sections with legal content."""
    sections = []
    
    # Add document-specific content
    if "master_agreement" in doc_id:
        sections = [
            ("ARTICLE I - DEFINITIONS", [
                "1.1 'Acquisition' means the purchase by Buyer of all outstanding capital stock of Seller.",
                "1.2 'Purchase Price' means One Hundred Twenty-Five Million Dollars ($125,000,000), subject to adjustments.",
                "1.3 'Closing Date' means April 1, 2025, or such other date as mutually agreed.",
                "1.4 'Material Adverse Effect' means any change that is materially adverse to the business of Seller.",
                "1.5 'Knowledge of Seller' means the actual knowledge of the officers listed in Schedule 1.5.",
            ]),
            ("ARTICLE II - PURCHASE AND SALE", [
                "2.1 Subject to the terms hereof, Seller agrees to sell and Buyer agrees to purchase all Shares.",
                "2.2 The Purchase Price shall be paid as follows: (a) $80,000,000 in cash at Closing; "
                "(b) $30,000,000 in Buyer common stock per Document: Stock Purchase Details - Exhibit B; "
                "(c) $15,000,000 in escrow per Document: Escrow Agreement.",
                "2.3 Purchase Price adjustments are detailed in Document: Audited Financial Statements.",
                "2.4 Working capital target is $8,500,000 as calculated per Schedule 2.4.",
            ]),
            ("ARTICLE III - REPRESENTATIONS AND WARRANTIES", [
                "3.1 Organization. Seller is duly organized under Delaware law.",
                "3.9 Litigation. Except as set forth in Document: Schedule 3.9 - Litigation and Claims, "
                "there are no pending legal proceedings against Seller.",
                "3.12 Intellectual Property. All IP is listed in Document: Schedule 3.12 - Intellectual Property. "
                "Patent assignments are documented in Document: Patent Assignment Agreements.",
                "3.13 Material Contracts. All contracts exceeding $100,000 annually are in Document: Schedule 3.13 - Material Contracts.",
                "3.15 Employees. Employee matters are disclosed in Document: Schedule 3.15 - Employee Matters.",
            ]),
            ("ARTICLE IV - COVENANTS", [
                "4.1 Conduct of Business. Prior to Closing, Seller shall operate in ordinary course.",
                "4.2 Access. Seller shall provide Buyer access to facilities, books, and records.",
                "4.3 Confidentiality. Parties shall comply with Document: Non-Disclosure Agreement.",
                "4.4 Non-Competition. Key employees shall execute Document: Non-Competition Agreement.",
            ]),
            ("ARTICLE V - CONDITIONS TO CLOSING", [
                "5.1 Buyer's conditions: (a) accuracy of representations; (b) material consents obtained; "
                "(c) no Material Adverse Effect; (d) receipt of Document: Legal Opinion Letter.",
                "5.2 Regulatory approvals as specified in Document: Closing Checklist and Conditions.",
                "5.3 Third-party consents from customers in Document: Major Customer Contract Summaries.",
            ]),
        ]
    elif "financial" in doc_id:
        sections = [
            ("BALANCE SHEET", [
                "As of December 31, 2024:",
                "Total Assets: $47,250,000 (Current: $18,500,000; Non-current: $28,750,000)",
                "Total Liabilities: $12,300,000 (Current: $8,200,000; Long-term: $4,100,000)",
                "Stockholders' Equity: $34,950,000",
                "Working Capital: $10,300,000 (above target of $8,500,000 per Document: Master Acquisition Agreement)",
            ]),
            ("INCOME STATEMENT", [
                "For fiscal year ended December 31, 2024:",
                "Total Revenue: $52,400,000 (SaaS: $41,920,000; Professional Services: $10,480,000)",
                "Cost of Revenue: $15,720,000 (Gross Margin: 70%)",
                "Operating Expenses: $28,600,000 (R&D: $12,100,000; S&M: $11,500,000; G&A: $5,000,000)",
                "Operating Income: $8,080,000 (EBITDA: $11,200,000)",
                "Net Income: $6,464,000",
            ]),
            ("REVENUE BREAKDOWN BY CUSTOMER", [
                "Top 5 customers represent 62% of revenue (see Document: Major Customer Contract Summaries):",
                "1. MegaCorp Industries: $12,576,000 (24%) - Contract through 2027",
                "2. GlobalBank Holdings: $8,384,000 (16%) - Renewal pending",
                "3. HealthFirst Systems: $5,240,000 (10%) - Multi-year agreement",
                "4. RetailMax Inc.: $3,668,000 (7%) - Expansion discussion ongoing",
                "5. TechPrime Solutions: $2,620,000 (5%) - New customer 2024",
            ]),
            ("NOTES TO FINANCIAL STATEMENTS", [
                "Note 1: Significant Accounting Policies - Revenue recognized per ASC 606.",
                "Note 2: Deferred Revenue of $4,200,000 represents prepaid annual subscriptions.",
                "Note 3: Contingent liabilities detailed in Document: Schedule 3.9 - Litigation and Claims.",
                "Note 4: Related party transactions with founder disclosed in Document: Consulting Agreement - Founder.",
            ]),
        ]
    elif "ip_schedule" in doc_id or "patent" in doc_id:
        sections = [
            ("PATENTS", [
                "Seller owns or has rights to the following patents:",
                "US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021",
                "US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022",
                "US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023",
                "Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024",
                "Assignment agreements in Document: Patent Assignment Agreements.",
            ]),
            ("TRADEMARKS", [
                "Registered trademarks (see Document: Trademark Registration Schedule):",
                "INNOVATETECH (word mark) - Reg. No. 5,123,456 - Software services",
                "INNOVATETECH (logo) - Reg. No. 5,234,567 - Software services",
                "DATAFLOW PRO - Reg. No. 5,345,678 - Data analytics software",
            ]),
            ("TRADE SECRETS AND KNOW-HOW", [
                "Seller maintains trade secrets including proprietary algorithms and processes.",
                "All employees have executed invention assignment agreements per Document: Schedule 3.15 - Employee Matters.",
                "Key technical personnel retention addressed in Document: Key Employee Retention Agreements.",
            ]),
        ]
    elif "employee" in doc_id or "retention" in doc_id:
        sections = [
            ("EMPLOYEE CENSUS", [
                "Total Employees: 127 (Full-time: 120; Part-time: 7)",
                "Engineering: 68 employees (Senior: 24; Mid-level: 32; Junior: 12)",
                "Sales & Marketing: 28 employees",
                "Customer Success: 18 employees",
                "G&A: 13 employees",
            ]),
            ("KEY EMPLOYEES", [
                "The following are Key Employees subject to Document: Key Employee Retention Agreements:",
                "1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000",
                "2. Michael Rodriguez - VP Engineering - Leads 45-person team - Retention: $800,000",
                "3. Jennifer Walsh - VP Sales - $18M quota achievement - Retention: $600,000",
                "4. David Kim - Principal Architect - Core platform expertise - Retention: $500,000",
                "5. Amanda Foster - VP Customer Success - 95% retention rate - Retention: $400,000",
                "Founder consulting terms in Document: Consulting Agreement - Founder.",
            ]),
            ("BENEFIT PLANS", [
                "Active benefit plans (details in Document: Employee Benefit Plan Schedule):",
                "401(k) Plan - Company match 4% - $2.1M annual cost",
                "Health Insurance - PPO and HMO options - $1.8M annual cost",
                "Stock Option Plan - 2,500,000 shares reserved - 1,800,000 granted",
                "Treatment of equity awards addressed in Document: Master Acquisition Agreement Section 2.6.",
            ]),
        ]
    elif "customer" in doc_id or "contract_schedule" in doc_id:
        sections = [
            ("MATERIAL CUSTOMER CONTRACTS", [
                "Contracts with annual value exceeding $500,000:",
                "",
                "1. MEGACORP INDUSTRIES - Master Services Agreement",
                "   Annual Value: $12,576,000 | Term: Through December 2027",
                "   Change of Control: Consent required (OBTAINED February 8, 2025)",
                "   Renewal Terms: Auto-renew with 90-day notice",
                "",
                "2. GLOBALBANK HOLDINGS - Enterprise License Agreement",
                "   Annual Value: $8,384,000 | Term: Through June 2025",
                "   Change of Control: 60-day notice required (PROVIDED January 15, 2025)",
                "   Renewal: Currently in negotiation for 3-year extension",
                "",
                "3. HEALTHFIRST SYSTEMS - SaaS Subscription Agreement",
                "   Annual Value: $5,240,000 | Term: Through December 2026",
                "   Change of Control: No restrictions",
                "",
                "See Document: Closing Checklist and Conditions for consent status.",
            ]),
            ("CONSENT REQUIREMENTS", [
                "Customer consents required for acquisition (per Document: Master Acquisition Agreement):",
                "- MegaCorp Industries: OBTAINED (see Exhibit A hereto)",
                "- GlobalBank Holdings: NOTICE PROVIDED (awaiting acknowledgment)",
                "- Other customers: No consent required",
                "Risk assessment in Document: Legal Opinion Letter.",
            ]),
        ]
    elif "litigation" in doc_id:
        sections = [
            ("PENDING LITIGATION", [
                "1. Smith v. InnovateTech Solutions, Inc.",
                "   Court: California Superior Court, Santa Clara County",
                "   Claims: Wrongful termination, discrimination",
                "   Status: Discovery phase; trial set for September 2025",
                "   Exposure: $150,000 - $350,000 (covered by insurance)",
                "   Opinion: See Document: Legal Opinion Letter",
                "",
                "2. DataTech LLC v. InnovateTech Solutions, Inc.",
                "   Court: US District Court, Northern District of California",
                "   Claims: Patent infringement (US Patent 9,876,543)",
                "   Status: Motion to dismiss pending; hearing March 2025",
                "   Exposure: Preliminary assessment $500,000 - $2,000,000",
                "   IP validity analysis in Document: Schedule 3.12 - Intellectual Property",
            ]),
            ("THREATENED CLAIMS", [
                "Demand letter received from former contractor re: unpaid invoices ($45,000).",
                "Resolution expected prior to Closing per Document: Closing Checklist and Conditions.",
            ]),
            ("INSURANCE COVERAGE", [
                "D&O Insurance: $5,000,000 limit | Deductible: $50,000",
                "E&O Insurance: $3,000,000 limit | Deductible: $25,000",
                "General Liability: $2,000,000 limit",
            ]),
        ]
    elif "closing" in doc_id:
        sections = [
            ("PRE-CLOSING CONDITIONS", [
                "The following conditions must be satisfied prior to Closing:",
                "",
                "1. REGULATORY APPROVALS",
                "   [X] HSR Filing - Early termination granted February 1, 2025",
                "   [X] State filings - Completed in all required jurisdictions",
                "",
                "2. THIRD-PARTY CONSENTS",
                "   [X] MegaCorp Industries - Obtained February 8, 2025",
                "   [ ] GlobalBank Holdings - Pending (expected by March 15)",
                "   Per Document: Major Customer Contract Summaries",
                "",
                "3. EMPLOYEE MATTERS",
                "   [X] Key employee retention agreements executed",
                "   [X] Founder consulting agreement finalized",
                "   Per Document: Key Employee Retention Agreements",
                "",
                "4. LEGAL DELIVERABLES",
                "   [X] Legal opinion - See Document: Legal Opinion Letter",
                "   [ ] Good standing certificates - Ordered",
            ]),
            ("CLOSING DELIVERABLES", [
                "SELLER DELIVERABLES:",
                "- Stock certificates endorsed in blank",
                "- Officer's certificate re: representations",
                "- Secretary's certificate with resolutions",
                "- IP assignments per Document: Patent Assignment Agreements",
                "- Third-party consents per above",
                "",
                "BUYER DELIVERABLES:",
                "- Cash payment: $80,000,000 by wire transfer",
                "- Stock consideration: 1,500,000 shares per Document: Stock Purchase Details - Exhibit B",
                "- Escrow deposit: $15,000,000 per Document: Escrow Agreement",
            ]),
            ("POST-CLOSING OBLIGATIONS", [
                "1. Transition services per Document: Transition Services Agreement (6 months)",
                "2. Earnout payments per Exhibit C to Document: Master Acquisition Agreement",
                "3. Escrow release schedule per Document: Escrow Agreement",
                "4. Employee benefit plan merger per Document: Employee Benefit Plan Schedule",
            ]),
        ]
    elif "escrow" in doc_id:
        sections = [
            ("ESCROW TERMS", [
                "Escrow Amount: $15,000,000 (12% of Purchase Price)",
                "Escrow Agent: First National Trust Company",
                "Term: 18 months from Closing Date",
                "",
                "Release Schedule:",
                "- 6 months: $5,000,000 released (absent claims)",
                "- 12 months: $5,000,000 released (absent claims)",
                "- 18 months: Remaining balance released",
                "",
                "Claims may be made for breaches of representations in Document: Master Acquisition Agreement.",
            ]),
            ("INDEMNIFICATION", [
                "Indemnification provisions per Article VII of Document: Master Acquisition Agreement:",
                "- Basket: $500,000 (1% of escrow)",
                "- Cap: $15,000,000 (escrow amount) for general reps",
                "- Fundamental reps: Full Purchase Price cap",
                "",
                "Specific indemnities for matters in Document: Schedule 3.9 - Litigation and Claims.",
            ]),
        ]
    elif "legal_opinion" in doc_id:
        sections = [
            ("OPINIONS RENDERED", [
                "Wilson & Associates LLP, counsel to Seller, renders the following opinions:",
                "",
                "1. Seller is a corporation duly organized under Delaware law.",
                "2. Seller has corporate power to execute Document: Master Acquisition Agreement.",
                "3. Transaction documents are valid and enforceable obligations.",
                "4. No conflicts with charter documents or material agreements.",
                "5. Based on review of Document: Schedule 3.9 - Litigation and Claims, pending "
                "litigation does not pose material risk to transaction.",
                "6. IP matters reviewed per Document: Schedule 3.12 - Intellectual Property; "
                "no infringement claims other than disclosed.",
            ]),
            ("QUALIFICATIONS AND ASSUMPTIONS", [
                "This opinion is subject to standard qualifications regarding:",
                "- Bankruptcy and insolvency laws",
                "- Equitable principles",
                "- Public policy considerations",
                "",
                "We have relied upon certificates from officers of Seller and representations "
                "in Document: Seller Disclosure Schedules.",
            ]),
        ]
    elif "audit" in doc_id:
        sections = [
            ("INDEPENDENT AUDITOR'S REPORT", [
                "To the Board of Directors of InnovateTech Solutions, Inc.:",
                "",
                "We have audited the accompanying financial statements, which comprise the "
                "balance sheet as of December 31, 2024, and the related statements of income, "
                "comprehensive income, stockholders' equity, and cash flows for the year then ended.",
                "",
                "OPINION",
                "In our opinion, the financial statements present fairly, in all material respects, "
                "the financial position of InnovateTech Solutions, Inc. as of December 31, 2024, "
                "in accordance with accounting principles generally accepted in the United States.",
            ]),
            ("KEY AUDIT MATTERS", [
                "1. REVENUE RECOGNITION",
                "   SaaS revenue recognized ratably over subscription period per ASC 606.",
                "   Deferred revenue of $4,200,000 verified to customer contracts.",
                "",
                "2. STOCK-BASED COMPENSATION",
                "   Options valued using Black-Scholes model.",
                "   Expense of $2,100,000 recorded in accordance with ASC 718.",
                "",
                "3. CONTINGENCIES",
                "   Litigation matters reviewed with counsel (see Document: Schedule 3.9 - Litigation and Claims).",
                "   Accruals of $350,000 determined to be appropriate.",
            ]),
        ]
    else:
        # Generic sections for other documents
        sections = [
            ("OVERVIEW", [
                f"This {meta['title']} is executed in connection with the acquisition transaction.",
                f"Reference documents: {', '.join([DOCUMENTS[r]['title'] for r in meta['refs'][:2]])}.",
            ]),
            ("TERMS AND CONDITIONS", [
                "Standard terms apply as set forth in the Master Acquisition Agreement.",
                "Amendments require written consent of all parties.",
            ]),
            ("MISCELLANEOUS", [
                "Governing Law: State of Delaware",
                "Dispute Resolution: Arbitration in San Francisco, California",
                "Notices: As specified in Master Acquisition Agreement",
            ]),
        ]
    
    # Add boilerplate to reach target page count
    for i in range(meta["pages"] - 2):
        sections.append((f"SECTION {len(sections) + 1}", [
            f"Additional provisions related to {meta['title']}.",
            "All terms defined in Document: Master Acquisition Agreement apply herein.",
            f"Cross-reference: See {DOCUMENTS[meta['refs'][i % len(meta['refs'])]]['title']} for related provisions.",
            "The parties acknowledge receipt of all schedules and exhibits referenced herein.",
            "This section shall survive the Closing Date as specified in Article VIII of the Master Agreement.",
        ]))
    
    return sections


def create_pdf(doc_id: str, meta: dict, output_dir: str):
    """Create a PDF document."""
    filepath = os.path.join(output_dir, f"{doc_id}.pdf")
    doc = SimpleDocTemplate(filepath, pagesize=letter,
                           topMargin=0.75*inch, bottomMargin=0.75*inch,
                           leftMargin=1*inch, rightMargin=1*inch)
    content = generate_content(doc_id, meta)
    doc.build(content)
    print(f"  Created: {filepath}")


def main():
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print(f"\nGenerating {len(DOCUMENTS)} large documents in {OUTPUT_DIR}/\n")
    
    for doc_id, meta in DOCUMENTS.items():
        create_pdf(doc_id, meta, OUTPUT_DIR)
    
    # Create test questions
    questions_path = os.path.join(OUTPUT_DIR, "TEST_QUESTIONS.md")
    with open(questions_path, "w") as f:
        f.write("""# Test Questions for Large Document Set

## Document Overview
- 25 interconnected documents
- Each document 3-6 pages
- Extensive cross-references between documents
- Total content: ~100+ pages

## Test Questions

### Level 1: Single Document (Easy)
```bash
uv run explore --task "Look in data/large_acquisition/. What is the total purchase price?"
uv run explore --task "Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?"
uv run explore --task "Look in data/large_acquisition/. What patents does the company own?"
```

### Level 2: Cross-Reference Required (Medium)
```bash
uv run explore --task "Look in data/large_acquisition/. What customer consents are required and what is their status?"
uv run explore --task "Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?"
uv run explore --task "Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?"
```

### Level 3: Multi-Document Synthesis (Hard)
```bash
uv run explore --task "Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?"
uv run explore --task "Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks."
uv run explore --task "Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?"
```

### Level 4: Deep Cross-Reference (Expert)
```bash
uv run explore --task "Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?"
uv run explore --task "Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation."
uv run explore --task "Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents."
```
""")
    print(f"  Created: {questions_path}")
    
    # Summary
    total_pages = sum(m["pages"] for m in DOCUMENTS.values())
    total_refs = sum(len(m["refs"]) for m in DOCUMENTS.values())
    print(f"\n{'='*60}")
    print(f"SUMMARY")
    print(f"{'='*60}")
    print(f"  Documents created: {len(DOCUMENTS)}")
    print(f"  Total pages: ~{total_pages}")
    print(f"  Cross-references: {total_refs}")
    print(f"  Output directory: {OUTPUT_DIR}/")
    print(f"{'='*60}\n")


if __name__ == "__main__":
    main()


================================================
FILE: scripts/generate_test_docs.py
================================================
#!/usr/bin/env python3
"""
Generate test PDF documents for testing the two-stage document exploration approach.

Scenario: TechCorp's acquisition of StartupXYZ
Documents have cross-references to test the agent's ability to follow document relationships.
"""

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch
import os

OUTPUT_DIR = "data/test_acquisition"

DOCUMENTS = {
    "01_acquisition_agreement.pdf": {
        "title": "ACQUISITION AGREEMENT",
        "content": """
        <b>ACQUISITION AGREEMENT</b><br/><br/>
        
        This Acquisition Agreement ("Agreement") is entered into as of January 15, 2025, 
        by and between TechCorp Industries, Inc. ("Buyer") and StartupXYZ LLC ("Seller").<br/><br/>
        
        <b>ARTICLE I - DEFINITIONS</b><br/><br/>
        
        1.1 "Acquisition" means the purchase of all outstanding shares of Seller by Buyer.<br/>
        1.2 "Purchase Price" means $45,000,000 USD as detailed in <b>Exhibit A - Financial Terms</b>.<br/>
        1.3 "Closing Date" means March 1, 2025, subject to conditions in Article IV.<br/>
        1.4 "Employee Matters" shall be governed by <b>Schedule 3 - Employee Transition Plan</b>.<br/><br/>
        
        <b>ARTICLE II - PURCHASE AND SALE</b><br/><br/>
        
        2.1 Subject to the terms and conditions of this Agreement, Seller agrees to sell, 
        and Buyer agrees to purchase, all of the issued and outstanding shares of Seller.<br/><br/>
        
        2.2 The Purchase Price shall be paid as follows:<br/>
        (a) $30,000,000 in cash at Closing<br/>
        (b) $10,000,000 in Buyer's common stock (see <b>Exhibit B - Stock Valuation</b>)<br/>
        (c) $5,000,000 in earnout payments (see <b>Exhibit C - Earnout Terms</b>)<br/><br/>
        
        <b>ARTICLE III - REPRESENTATIONS AND WARRANTIES</b><br/><br/>
        
        3.1 Seller represents and warrants that the financial statements provided in 
        <b>Document: Due Diligence Report</b> are accurate and complete.<br/><br/>
        
        3.2 Seller represents that all intellectual property is properly documented in 
        <b>Schedule 1 - IP Assets</b> and is free of encumbrances as certified in 
        <b>Document: IP Certification Letter</b>.<br/><br/>
        
        3.3 All material contracts are listed in <b>Schedule 2 - Material Contracts</b>.<br/><br/>
        
        <b>ARTICLE IV - CONDITIONS TO CLOSING</b><br/><br/>
        
        4.1 Buyer's obligation to close is subject to:<br/>
        (a) Receipt of regulatory approval as documented in <b>Document: Regulatory Approval Letter</b><br/>
        (b) Completion of due diligence per <b>Document: Due Diligence Report</b><br/>
        (c) No material adverse change as defined in Section 1.5<br/><br/>
        
        4.2 Both parties acknowledge the risks identified in <b>Document: Risk Assessment Memo</b>.<br/><br/>
        
        <b>ARTICLE V - CONFIDENTIALITY</b><br/><br/>
        
        5.1 This Agreement is subject to the terms of the <b>Document: Non-Disclosure Agreement</b> 
        executed between the parties on October 1, 2024.<br/><br/>
        
        IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.<br/><br/>
        
        _________________________<br/>
        TechCorp Industries, Inc.<br/>
        By: James Mitchell, CEO<br/><br/>
        
        _________________________<br/>
        StartupXYZ LLC<br/>
        By: Sarah Chen, Founder & CEO
        """
    },
    
    "02_due_diligence_report.pdf": {
        "title": "DUE DILIGENCE REPORT",
        "content": """
        <b>CONFIDENTIAL DUE DILIGENCE REPORT</b><br/><br/>
        
        <b>Prepared for:</b> TechCorp Industries, Inc.<br/>
        <b>Subject:</b> StartupXYZ LLC<br/>
        <b>Date:</b> December 20, 2024<br/>
        <b>Prepared by:</b> Morrison & Associates, LLP<br/><br/>
        
        <b>EXECUTIVE SUMMARY</b><br/><br/>
        
        This report summarizes our findings from the due diligence investigation of StartupXYZ LLC 
        in connection with the proposed acquisition described in the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. FINANCIAL REVIEW</b><br/><br/>
        
        1.1 Revenue for FY2024: $12.3 million (growth of 45% YoY)<br/>
        1.2 EBITDA: $2.1 million (17% margin)<br/>
        1.3 Cash position: $3.2 million as of November 30, 2024<br/>
        1.4 Outstanding debt: $1.5 million (detailed in <b>Exhibit A - Financial Terms</b> of the Acquisition Agreement)<br/><br/>
        
        <b>KEY FINDING:</b> Financial statements are materially accurate. Minor adjustments 
        recommended as noted in <b>Document: Financial Adjustments Memo</b>.<br/><br/>
        
        <b>2. INTELLECTUAL PROPERTY</b><br/><br/>
        
        2.1 StartupXYZ holds 12 patents related to AI/ML technology<br/>
        2.2 All patents verified as valid per <b>Document: IP Certification Letter</b><br/>
        2.3 No pending litigation affecting IP (confirmed in <b>Document: Legal Opinion Letter</b>)<br/>
        2.4 Full IP inventory in <b>Schedule 1 - IP Assets</b> of the Acquisition Agreement<br/><br/>
        
        <b>3. EMPLOYEE MATTERS</b><br/><br/>
        
        3.1 Total employees: 47 (32 engineering, 8 sales, 7 operations)<br/>
        3.2 Key employee retention risk: HIGH for 5 senior engineers<br/>
        3.3 Retention bonuses recommended per <b>Schedule 3 - Employee Transition Plan</b><br/>
        3.4 No pending employment disputes<br/><br/>
        
        <b>4. MATERIAL CONTRACTS</b><br/><br/>
        
        4.1 23 active customer contracts reviewed (see <b>Schedule 2 - Material Contracts</b>)<br/>
        4.2 3 contracts contain change-of-control provisions requiring consent<br/>
        4.3 Largest customer (MegaCorp) accounts for 28% of revenue - concentration risk noted in 
        <b>Document: Risk Assessment Memo</b><br/><br/>
        
        <b>5. REGULATORY COMPLIANCE</b><br/><br/>
        
        5.1 Company is compliant with all applicable regulations<br/>
        5.2 HSR filing required - timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>
        
        <b>6. RECOMMENDATIONS</b><br/><br/>
        
        Based on our findings, we recommend proceeding with the acquisition subject to:<br/>
        (a) Obtaining customer consents for change-of-control contracts<br/>
        (b) Implementing retention packages for key employees<br/>
        (c) Addressing items in <b>Document: Financial Adjustments Memo</b><br/><br/>
        
        Respectfully submitted,<br/>
        Morrison & Associates, LLP
        """
    },
    
    "03_ip_certification.pdf": {
        "title": "IP CERTIFICATION LETTER",
        "content": """
        <b>INTELLECTUAL PROPERTY CERTIFICATION LETTER</b><br/><br/>
        
        <b>Date:</b> December 15, 2024<br/>
        <b>To:</b> TechCorp Industries, Inc.<br/>
        <b>From:</b> PatentWatch Legal Services<br/>
        <b>Re:</b> IP Certification for StartupXYZ LLC Acquisition<br/><br/>
        
        Dear Mr. Mitchell,<br/><br/>
        
        In connection with the proposed acquisition of StartupXYZ LLC as described in the 
        <b>Document: Acquisition Agreement</b>, we have conducted a comprehensive review of 
        StartupXYZ's intellectual property portfolio.<br/><br/>
        
        <b>CERTIFICATION</b><br/><br/>
        
        We hereby certify the following:<br/><br/>
        
        <b>1. PATENTS</b><br/><br/>
        
        StartupXYZ owns 12 U.S. patents as listed in <b>Schedule 1 - IP Assets</b>:<br/>
        - US Patent 10,123,456: "Neural Network Optimization Method"<br/>
        - US Patent 10,234,567: "Distributed AI Training System"<br/>
        - US Patent 10,345,678: "Real-time Data Processing Pipeline"<br/>
        - [9 additional patents listed in Schedule 1]<br/><br/>
        
        All patents are valid, enforceable, and free of liens or encumbrances.<br/><br/>
        
        <b>2. TRADEMARKS</b><br/><br/>
        
        StartupXYZ owns 3 registered trademarks:<br/>
        - "StartupXYZ" (word mark)<br/>
        - StartupXYZ logo (design mark)<br/>
        - "IntelliFlow" (product name)<br/><br/>
        
        <b>3. TRADE SECRETS</b><br/><br/>
        
        We have reviewed StartupXYZ's trade secret protection protocols. All employees have 
        signed appropriate NDAs. See <b>Document: Non-Disclosure Agreement</b> template.<br/><br/>
        
        <b>4. THIRD-PARTY IP</b><br/><br/>
        
        StartupXYZ uses 47 open-source libraries. License compliance verified - no copyleft 
        contamination issues identified.<br/><br/>
        
        <b>5. PENDING MATTERS</b><br/><br/>
        
        There is one pending patent application (Application No. 17/456,789) for "Advanced 
        Federated Learning System" expected to issue Q2 2025. This is noted in 
        <b>Document: Risk Assessment Memo</b> as a minor risk item.<br/><br/>
        
        <b>6. LITIGATION</b><br/><br/>
        
        No IP-related litigation is pending or threatened. This is confirmed in 
        <b>Document: Legal Opinion Letter</b>.<br/><br/>
        
        This certification is provided in connection with the due diligence process and 
        may be relied upon by TechCorp Industries, Inc.<br/><br/>
        
        Sincerely,<br/>
        PatentWatch Legal Services<br/>
        By: Robert Kim, Patent Attorney
        """
    },
    
    "04_risk_assessment.pdf": {
        "title": "RISK ASSESSMENT MEMO",
        "content": """
        <b>CONFIDENTIAL RISK ASSESSMENT MEMORANDUM</b><br/><br/>
        
        <b>To:</b> TechCorp Board of Directors<br/>
        <b>From:</b> Corporate Development Team<br/>
        <b>Date:</b> December 22, 2024<br/>
        <b>Re:</b> Risk Assessment - StartupXYZ Acquisition<br/><br/>
        
        This memo summarizes key risks identified in connection with the proposed acquisition 
        as documented in the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. HIGH-PRIORITY RISKS</b><br/><br/>
        
        <b>1.1 Customer Concentration (HIGH)</b><br/>
        - MegaCorp represents 28% of StartupXYZ revenue<br/>
        - MegaCorp contract contains change-of-control clause<br/>
        - Mitigation: Obtain consent prior to closing (see <b>Document: Customer Consent Letters</b>)<br/>
        - Impact if materialized: $3.4M annual revenue at risk<br/><br/>
        
        <b>1.2 Key Employee Retention (HIGH)</b><br/>
        - 5 senior engineers critical to product development<br/>
        - 2 have expressed interest in leaving post-acquisition<br/>
        - Mitigation: Retention packages per <b>Schedule 3 - Employee Transition Plan</b><br/>
        - Estimated cost: $2.5M in retention bonuses<br/><br/>
        
        <b>2. MEDIUM-PRIORITY RISKS</b><br/><br/>
        
        <b>2.1 Earnout Structure (MEDIUM)</b><br/>
        - $5M earnout tied to 2025-2026 performance metrics<br/>
        - Metrics defined in <b>Exhibit C - Earnout Terms</b> of the Acquisition Agreement<br/>
        - Risk: Disagreement on metric calculation methodology<br/>
        - Mitigation: Clear definitions in agreement; third-party arbitration clause<br/><br/>
        
        <b>2.2 Integration Costs (MEDIUM)</b><br/>
        - Estimated integration costs: $4.2M over 18 months<br/>
        - Systems integration detailed in <b>Document: Integration Plan</b><br/>
        - Risk: Cost overruns of 20-30% typical in tech acquisitions<br/><br/>
        
        <b>3. LOW-PRIORITY RISKS</b><br/><br/>
        
        <b>3.1 Pending Patent Application (LOW)</b><br/>
        - One patent pending as noted in <b>Document: IP Certification Letter</b><br/>
        - Low risk of rejection based on patent attorney's assessment<br/><br/>
        
        <b>3.2 Regulatory Approval (LOW)</b><br/>
        - HSR filing required but expected to clear without issues<br/>
        - Timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>
        
        <b>4. FINANCIAL IMPACT SUMMARY</b><br/><br/>
        
        Total risk-adjusted impact: $6.2M - $8.7M<br/>
        This is reflected in purchase price negotiations per <b>Document: Financial Adjustments Memo</b><br/><br/>
        
        <b>5. RECOMMENDATION</b><br/><br/>
        
        Despite identified risks, we recommend proceeding with the acquisition. The strategic 
        value of StartupXYZ's AI technology platform justifies the purchase price when 
        accounting for risk mitigation costs. All findings are consistent with 
        <b>Document: Due Diligence Report</b>.<br/><br/>
        
        <b>6. NEXT STEPS</b><br/><br/>
        
        - Finalize customer consent process<br/>
        - Execute retention agreements<br/>
        - Complete regulatory filings<br/>
        - Prepare for closing per <b>Document: Closing Checklist</b>
        """
    },
    
    "05_financial_adjustments.pdf": {
        "title": "FINANCIAL ADJUSTMENTS MEMO",
        "content": """
        <b>FINANCIAL ADJUSTMENTS MEMORANDUM</b><br/><br/>
        
        <b>To:</b> Deal Team<br/>
        <b>From:</b> Finance Department<br/>
        <b>Date:</b> December 23, 2024<br/>
        <b>Re:</b> Purchase Price Adjustments - StartupXYZ Acquisition<br/><br/>
        
        Following our review in connection with the <b>Document: Due Diligence Report</b>, 
        we recommend the following adjustments to the purchase price as set forth in 
        <b>Exhibit A - Financial Terms</b> of the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. WORKING CAPITAL ADJUSTMENT</b><br/><br/>
        
        Target working capital: $1,200,000<br/>
        Estimated closing working capital: $980,000<br/>
        Adjustment: ($220,000)<br/><br/>
        
        <b>2. DEBT ADJUSTMENT</b><br/><br/>
        
        Previously disclosed debt: $1,500,000<br/>
        Additional identified debt: $175,000 (capital lease obligations)<br/>
        Adjustment: ($175,000)<br/><br/>
        
        <b>3. REVENUE RECOGNITION ADJUSTMENT</b><br/><br/>
        
        Deferred revenue requiring restatement: $340,000<br/>
        Impact on EBITDA: ($85,000)<br/>
        Implied value adjustment (at 15x): ($1,275,000)<br/><br/>
        
        <b>4. CONTINGENT LIABILITY RESERVE</b><br/><br/>
        
        As noted in <b>Document: Risk Assessment Memo</b>, we recommend establishing 
        reserves for:<br/>
        - Customer concentration risk: $500,000<br/>
        - Integration contingency: $800,000<br/>
        Total reserve: $1,300,000 (to be held in escrow per <b>Exhibit C - Earnout Terms</b>)<br/><br/>
        
        <b>5. SUMMARY OF ADJUSTMENTS</b><br/><br/>
        
        Original Purchase Price: $45,000,000<br/>
        Working Capital Adjustment: ($220,000)<br/>
        Debt Adjustment: ($175,000)<br/>
        Revenue Recognition: ($1,275,000)<br/>
        <b>Adjusted Purchase Price: $43,330,000</b><br/><br/>
        
        Plus escrow reserve: $1,300,000<br/>
        <b>Total Cash Required at Closing: $44,630,000</b><br/><br/>
        
        <b>6. PAYMENT STRUCTURE</b><br/><br/>
        
        As revised from <b>Document: Acquisition Agreement</b> Section 2.2:<br/>
        (a) Cash at closing: $28,330,000 (adjusted)<br/>
        (b) Stock consideration: $10,000,000 (per <b>Exhibit B - Stock Valuation</b>)<br/>
        (c) Earnout: $5,000,000 (unchanged, per <b>Exhibit C - Earnout Terms</b>)<br/>
        (d) Escrow: $1,300,000 (18-month release schedule)<br/><br/>
        
        These adjustments have been discussed with Seller's representatives and are 
        subject to final negotiation.<br/><br/>
        
        Please refer to <b>Document: Closing Checklist</b> for timeline and requirements.
        """
    },
    
    "06_legal_opinion.pdf": {
        "title": "LEGAL OPINION LETTER",
        "content": """
        <b>LEGAL OPINION LETTER</b><br/><br/>
        
        <b>Date:</b> December 18, 2024<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        500 Technology Drive<br/>
        San Francisco, CA 94105<br/><br/>
        
        <b>Re: Acquisition of StartupXYZ LLC</b><br/><br/>
        
        Ladies and Gentlemen:<br/><br/>
        
        We have acted as legal counsel to StartupXYZ LLC ("Company") in connection with 
        the proposed acquisition by TechCorp Industries, Inc. pursuant to the 
        <b>Document: Acquisition Agreement</b> dated January 15, 2025.<br/><br/>
        
        <b>DOCUMENTS REVIEWED</b><br/><br/>
        
        In connection with this opinion, we have reviewed:<br/>
        1. The Acquisition Agreement and all Exhibits and Schedules<br/>
        2. <b>Document: Due Diligence Report</b> prepared by Morrison & Associates<br/>
        3. <b>Document: IP Certification Letter</b> from PatentWatch Legal Services<br/>
        4. All material contracts listed in <b>Schedule 2 - Material Contracts</b><br/>
        5. Corporate records and organizational documents of the Company<br/>
        6. <b>Document: Non-Disclosure Agreement</b> between the parties<br/><br/>
        
        <b>OPINIONS</b><br/><br/>
        
        Based on our review, we are of the opinion that:<br/><br/>
        
        <b>1. Corporate Status</b><br/>
        The Company is a limited liability company duly organized, validly existing, and 
        in good standing under the laws of Delaware.<br/><br/>
        
        <b>2. Authority</b><br/>
        The Company has full power and authority to execute and deliver the Acquisition 
        Agreement and to consummate the transactions contemplated thereby.<br/><br/>
        
        <b>3. No Conflicts</b><br/>
        The execution and delivery of the Acquisition Agreement does not violate any 
        provision of the Company's organizational documents or any material contract, 
        except for change-of-control provisions noted in <b>Document: Customer Consent Letters</b>.<br/><br/>
        
        <b>4. Litigation</b><br/>
        There is no litigation, arbitration, or governmental proceeding pending or, to 
        our knowledge, threatened against the Company that would have a material adverse 
        effect on the Company or the transactions contemplated by the Acquisition Agreement.<br/><br/>
        
        This opinion confirms the representations in the <b>Document: IP Certification Letter</b> 
        regarding absence of IP litigation.<br/><br/>
        
        <b>5. Regulatory Compliance</b><br/>
        The Company is in material compliance with all applicable laws and regulations. 
        The HSR filing requirements are addressed in <b>Document: Regulatory Approval Letter</b>.<br/><br/>
        
        <b>QUALIFICATIONS</b><br/><br/>
        
        This opinion is subject to the following qualifications:<br/>
        1. We express no opinion on tax matters (see separate tax opinion)<br/>
        2. This opinion is limited to Delaware and federal law<br/>
        3. Certain contracts require third-party consents as noted above<br/><br/>
        
        This opinion is provided solely for your benefit in connection with the 
        transactions contemplated by the Acquisition Agreement.<br/><br/>
        
        Very truly yours,<br/>
        Wilson & Partners LLP<br/>
        By: Jennifer Walsh, Partner
        """
    },
    
    "07_nda.pdf": {
        "title": "NON-DISCLOSURE AGREEMENT",
        "content": """
        <b>MUTUAL NON-DISCLOSURE AGREEMENT</b><br/><br/>
        
        This Mutual Non-Disclosure Agreement ("NDA") is entered into as of October 1, 2024, 
        by and between:<br/><br/>
        
        <b>TechCorp Industries, Inc.</b> ("TechCorp")<br/>
        500 Technology Drive, San Francisco, CA 94105<br/><br/>
        
        and<br/><br/>
        
        <b>StartupXYZ LLC</b> ("StartupXYZ")<br/>
        123 Innovation Way, Palo Alto, CA 94301<br/><br/>
        
        (each a "Party" and collectively the "Parties")<br/><br/>
        
        <b>RECITALS</b><br/><br/>
        
        The Parties wish to explore a potential business relationship, including a possible 
        acquisition of StartupXYZ by TechCorp (the "Purpose"), which is now documented in 
        the <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        <b>1. DEFINITION OF CONFIDENTIAL INFORMATION</b><br/><br/>
        
        "Confidential Information" means any non-public information disclosed by either 
        Party, including but not limited to:<br/>
        - Financial information (as contained in <b>Document: Due Diligence Report</b>)<br/>
        - Technical information (as certified in <b>Document: IP Certification Letter</b>)<br/>
        - Business strategies and plans<br/>
        - Customer and supplier information<br/>
        - Employee information (as detailed in <b>Schedule 3 - Employee Transition Plan</b>)<br/><br/>
        
        <b>2. OBLIGATIONS</b><br/><br/>
        
        Each Party agrees to:<br/>
        (a) Hold Confidential Information in strict confidence<br/>
        (b) Not disclose Confidential Information to third parties without prior written consent<br/>
        (c) Use Confidential Information solely for the Purpose<br/>
        (d) Limit access to Confidential Information to employees with a need to know<br/><br/>
        
        <b>3. TERM</b><br/><br/>
        
        This NDA shall remain in effect for three (3) years from the date first written 
        above, or until superseded by the confidentiality provisions in the 
        <b>Document: Acquisition Agreement</b> Article V.<br/><br/>
        
        <b>4. EXCLUSIONS</b><br/><br/>
        
        Confidential Information does not include information that:<br/>
        (a) Is or becomes publicly available through no fault of the receiving Party<br/>
        (b) Was rightfully in the receiving Party's possession prior to disclosure<br/>
        (c) Is rightfully obtained from a third party without restriction<br/>
        (d) Is independently developed without use of Confidential Information<br/><br/>
        
        <b>5. RETURN OF MATERIALS</b><br/><br/>
        
        Upon request or termination, each Party shall return or destroy all Confidential 
        Information, except as required for legal or regulatory purposes.<br/><br/>
        
        <b>6. NO LICENSE</b><br/><br/>
        
        Nothing in this NDA grants any rights to intellectual property, except as 
        subsequently agreed in the <b>Document: Acquisition Agreement</b> and 
        <b>Schedule 1 - IP Assets</b>.<br/><br/>
        
        IN WITNESS WHEREOF, the Parties have executed this NDA as of the date first above written.<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        By: ______________________<br/>
        Name: James Mitchell<br/>
        Title: CEO<br/><br/>
        
        StartupXYZ LLC<br/>
        By: ______________________<br/>
        Name: Sarah Chen<br/>
        Title: Founder & CEO
        """
    },
    
    "08_regulatory_approval.pdf": {
        "title": "REGULATORY APPROVAL LETTER",
        "content": """
        <b>FEDERAL TRADE COMMISSION</b><br/>
        <b>PREMERGER NOTIFICATION OFFICE</b><br/><br/>
        
        January 28, 2025<br/><br/>
        
        TechCorp Industries, Inc.<br/>
        500 Technology Drive<br/>
        San Francisco, CA 94105<br/><br/>
        
        StartupXYZ LLC<br/>
        123 Innovation Way<br/>
        Palo Alto, CA 94301<br/><br/>
        
        <b>Re: Early Termination of HSR Waiting Period</b><br/>
        <b>Transaction: Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>
        
        Dear Parties:<br/><br/>
        
        This letter confirms that the Federal Trade Commission has granted early 
        termination of the waiting period under the Hart-Scott-Rodino Antitrust 
        Improvements Act of 1976 for the above-referenced transaction.<br/><br/>
        
        <b>FILING DETAILS</b><br/><br/>
        
        Filing Date: January 10, 2025<br/>
        Transaction Value: $45,000,000 (as stated in <b>Document: Acquisition Agreement</b>)<br/>
        HSR Filing Fee: $30,000<br/>
        Early Termination Granted: January 28, 2025<br/><br/>
        
        <b>EFFECT OF EARLY TERMINATION</b><br/><br/>
        
        The parties may now consummate the transaction at any time. This early termination 
        satisfies the condition precedent set forth in Article IV, Section 4.1(a) of the 
        <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        Please note that early termination of the waiting period does not preclude the 
        Commission from taking any action it deems necessary to protect competition.<br/><br/>
        
        <b>NEXT STEPS</b><br/><br/>
        
        Per the <b>Document: Closing Checklist</b>, you may now proceed with the closing 
        scheduled for March 1, 2025, subject to satisfaction of other conditions in the 
        <b>Document: Acquisition Agreement</b>.<br/><br/>
        
        The <b>Document: Risk Assessment Memo</b> correctly identified this as a low-risk 
        item. The market analysis in the <b>Document: Due Diligence Report</b> supported 
        the determination that this transaction does not raise competitive concerns.<br/><br/>
        
        Sincerely,<br/>
        Premerger Notification Office<br/>
        Federal Trade Commission
        """
    },
    
    "09_customer_consents.pdf": {
        "title": "CUSTOMER CONSENT LETTERS",
        "content": """
        <b>CUSTOMER CONSENT STATUS REPORT</b><br/><br/>
        
        <b>Date:</b> February 15, 2025<br/>
        <b>To:</b> Deal Team<br/>
        <b>From:</b> Legal Department<br/>
        <b>Re:</b> Change of Control Consent Status<br/><br/>
        
        As required by <b>Schedule 2 - Material Contracts</b> of the 
        <b>Document: Acquisition Agreement</b>, this memo summarizes the status of 
        customer consents for contracts containing change-of-control provisions.<br/><br/>
        
        <b>CONSENT STATUS SUMMARY</b><br/><br/>
        
        <b>1. MegaCorp Inc. - OBTAINED</b><br/>
        Contract Value: $3.4M annual<br/>
        Consent Received: February 10, 2025<br/>
        Notes: MegaCorp requested meeting with TechCorp leadership; meeting held 2/8/25. 
        Consent granted with no additional conditions. This addresses the primary concern 
        noted in <b>Document: Risk Assessment Memo</b> Section 1.1.<br/><br/>
        
        <b>2. DataFlow Systems - OBTAINED</b><br/>
        Contract Value: $1.2M annual<br/>
        Consent Received: February 5, 2025<br/>
        Notes: Standard consent process. No concerns raised.<br/><br/>
        
        <b>3. CloudTech Partners - PENDING</b><br/>
        Contract Value: $890K annual<br/>
        Status: Consent requested February 1, 2025<br/>
        Expected: February 20, 2025<br/>
        Notes: Legal review in progress at CloudTech. Their counsel has reviewed the 
        <b>Document: Acquisition Agreement</b> and has no objections. Verbal confirmation 
        received; written consent expected shortly.<br/><br/>
        
        <b>IMPACT ANALYSIS</b><br/><br/>
        
        Per <b>Document: Due Diligence Report</b> Section 4, there were 3 contracts 
        requiring consent:<br/>
        - 2 obtained (representing $4.6M annual revenue)<br/>
        - 1 pending (representing $890K annual revenue)<br/><br/>
        
        <b>CLOSING IMPLICATIONS</b><br/><br/>
        
        The <b>Document: Acquisition Agreement</b> Article IV requires "material" customer 
        consents as a closing condition. With MegaCorp consent obtained, this condition 
        is substantially satisfied. The pending CloudTech consent is expected before 
        the March 1 closing date per <b>Document: Closing Checklist</b>.<br/><br/>
        
        <b>ATTACHMENTS</b><br/><br/>
        
        Attached hereto:<br/>
        - Exhibit A: MegaCorp Consent Letter (dated February 10, 2025)<br/>
        - Exhibit B: DataFlow Systems Consent Letter (dated February 5, 2025)<br/>
        - Exhibit C: CloudTech Partners Draft Consent (pending signature)<br/><br/>
        
        <b>RECOMMENDATION</b><br/><br/>
        
        We recommend proceeding with closing preparations. The risk of CloudTech 
        withholding consent is low based on discussions with their counsel. This 
        is consistent with the risk mitigation strategy in <b>Document: Risk Assessment Memo</b>.
        """
    },
    
    "10_closing_checklist.pdf": {
        "title": "CLOSING CHECKLIST",
        "content": """
        <b>CLOSING CHECKLIST</b><br/>
        <b>Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>
        
        <b>Closing Date:</b> March 1, 2025<br/>
        <b>Closing Location:</b> Wilson & Partners LLP, San Francisco<br/><br/>
        
        <b>I. PRE-CLOSING CONDITIONS</b><br/><br/>
        
        <b>A. Regulatory</b><br/>
        [X] HSR Filing submitted - <b>Document: Regulatory Approval Letter</b><br/>
        [X] Early termination received (January 28, 2025)<br/>
        [ ] State regulatory filings (if required)<br/><br/>
        
        <b>B. Third-Party Consents</b><br/>
        [X] MegaCorp consent - <b>Document: Customer Consent Letters</b><br/>
        [X] DataFlow consent - <b>Document: Customer Consent Letters</b><br/>
        [ ] CloudTech consent (expected February 20) - <b>Document: Customer Consent Letters</b><br/><br/>
        
        <b>C. Due Diligence Completion</b><br/>
        [X] Financial due diligence - <b>Document: Due Diligence Report</b><br/>
        [X] Legal due diligence - <b>Document: Legal Opinion Letter</b><br/>
        [X] IP due diligence - <b>Document: IP Certification Letter</b><br/>
        [X] Risk assessment - <b>Document: Risk Assessment Memo</b><br/><br/>
        
        <b>II. CLOSING DOCUMENTS</b><br/><br/>
        
        <b>A. Transaction Documents</b><br/>
        [ ] Executed <b>Document: Acquisition Agreement</b><br/>
        [ ] Bill of Sale<br/>
        [ ] Assignment and Assumption Agreement<br/>
        [ ] IP Assignment Agreement (per <b>Schedule 1 - IP Assets</b>)<br/><br/>
        
        <b>B. Corporate Documents</b><br/>
        [ ] Seller's Certificate of Good Standing<br/>
        [ ] Secretary's Certificate (resolutions, incumbency)<br/>
        [ ] Buyer's Certificate of Good Standing<br/><br/>
        
        <b>C. Financial Documents</b><br/>
        [ ] Closing Statement per <b>Document: Financial Adjustments Memo</b><br/>
        [ ] Wire transfer instructions<br/>
        [ ] Escrow Agreement (per <b>Exhibit C - Earnout Terms</b>)<br/>
        [ ] Stock certificates or book entry (per <b>Exhibit B - Stock Valuation</b>)<br/><br/>
        
        <b>D. Employment Documents</b><br/>
        [ ] Retention agreements per <b>Schedule 3 - Employee Transition Plan</b><br/>
        [ ] Offer letters for key employees<br/>
        [ ] WARN Act compliance (if applicable)<br/><br/>
        
        <b>III. CLOSING FUNDS</b><br/><br/>
        
        Per <b>Document: Financial Adjustments Memo</b>:<br/>
        [ ] Cash payment: $28,330,000<br/>
        [ ] Escrow deposit: $1,300,000<br/>
        [ ] Stock issuance: $10,000,000<br/>
        Total at Closing: $39,630,000<br/><br/>
        
        <b>IV. POST-CLOSING</b><br/><br/>
        
        [ ] File UCC termination statements<br/>
        [ ] Update corporate records<br/>
        [ ] Integration kickoff per <b>Document: Integration Plan</b><br/>
        [ ] Employee communications<br/>
        [ ] Customer notifications<br/>
        [ ] Press release<br/><br/>
        
        <b>V. RESPONSIBLE PARTIES</b><br/><br/>
        
        Buyer's Counsel: Morrison & Associates LLP<br/>
        Seller's Counsel: Wilson & Partners LLP<br/>
        Escrow Agent: First National Trust<br/><br/>
        
        <b>VI. KEY CONTACTS</b><br/><br/>
        
        TechCorp: James Mitchell (CEO), (415) 555-0100<br/>
        StartupXYZ: Sarah Chen (CEO), (650) 555-0200<br/>
        Legal (Buyer): John Morrison, (415) 555-0300<br/>
        Legal (Seller): Jennifer Walsh, (415) 555-0400
        """
    }
}


def create_pdf(filename: str, title: str, content: str):
    """Create a PDF document."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    doc = SimpleDocTemplate(filepath, pagesize=letter,
                           topMargin=1*inch, bottomMargin=1*inch,
                           leftMargin=1*inch, rightMargin=1*inch)
    
    styles = getSampleStyleSheet()
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Heading1'],
        fontSize=16,
        spaceAfter=30,
        alignment=1  # Center
    )
    body_style = ParagraphStyle(
        'CustomBody',
        parent=styles['Normal'],
        fontSize=11,
        leading=14,
        spaceAfter=12
    )
    
    story = []
    story.append(Paragraph(title, title_style))
    story.append(Spacer(1, 0.5*inch))
    
    # Split content into paragraphs and add them
    paragraphs = content.strip().split('<br/><br/>')
    for para in paragraphs:
        para = para.replace('<br/>', '<br/>')
        story.append(Paragraph(para, body_style))
    
    doc.build(story)
    print(f"Created: {filepath}")


def main():
    # Create output directory
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print(f"\nGenerating {len(DOCUMENTS)} test documents in {OUTPUT_DIR}/\n")
    
    for filename, doc_info in DOCUMENTS.items():
        create_pdf(filename, doc_info["title"], doc_info["content"])
    
    print(f"\n✅ Generated {len(DOCUMENTS)} documents successfully!")
    print(f"\nDocument cross-reference map:")
    print("=" * 60)
    print("""
    Acquisition Agreement (01)
    ├── references: Exhibit A, B, C, Schedule 1-3
    ├── referenced by: ALL other documents
    │
    Due Diligence Report (02)
    ├── references: Acquisition Agreement, IP Cert, Risk Assessment
    ├── referenced by: Legal Opinion, Risk Assessment, Regulatory
    │
    IP Certification (03)
    ├── references: Acquisition Agreement, Schedule 1, NDA
    ├── referenced by: Due Diligence, Legal Opinion
    │
    Risk Assessment (04)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert
    ├── referenced by: Financial Adjustments, Customer Consents
    │
    Financial Adjustments (05)
    ├── references: Due Diligence, Risk Assessment, Acquisition Agreement
    ├── referenced by: Closing Checklist
    │
    Legal Opinion (06)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert, NDA
    ├── referenced by: Closing Checklist
    │
    NDA (07)
    ├── references: Acquisition Agreement, Due Diligence, IP Cert
    ├── referenced by: IP Cert, Legal Opinion
    │
    Regulatory Approval (08)
    ├── references: Acquisition Agreement, Due Diligence, Risk Assessment
    ├── referenced by: Closing Checklist
    │
    Customer Consents (09)
    ├── references: Acquisition Agreement, Risk Assessment, Schedule 2
    ├── referenced by: Closing Checklist
    │
    Closing Checklist (10)
    └── references: ALL documents
    """)


if __name__ == "__main__":
    main()


================================================
FILE: src/fs_explorer/__init__.py
================================================
"""
FsExplorer - AI-powered filesystem exploration agent.

This package provides an intelligent agent that can explore filesystems,
parse documents, and answer questions about their contents using
Google Gemini for decision-making and Docling for document parsing.

Example usage:
    >>> from fs_explorer import FsExplorerAgent, workflow
    >>> agent = FsExplorerAgent()
    >>> # Use with the workflow for full exploration
    >>> result = await workflow.run(start_event=InputEvent(task="Find the purchase price"))
"""

from .agent import FsExplorerAgent, TokenUsage
from .workflow import (
    workflow,
    FsExplorerWorkflow,
    InputEvent,
    ExplorationEndEvent,
    ToolCallEvent,
    GoDeeperEvent,
    AskHumanEvent,
    HumanAnswerEvent,
    get_agent,
    reset_agent,
)
from .models import Action, ActionType, Tools

__all__ = [
    # Agent
    "FsExplorerAgent",
    "TokenUsage",
    # Workflow
    "workflow",
    "FsExplorerWorkflow",
    "InputEvent",
    "ExplorationEndEvent",
    "ToolCallEvent",
    "GoDeeperEvent",
    "AskHumanEvent",
    "HumanAnswerEvent",
    "get_agent",
    "reset_agent",
    # Models
    "Action",
    "ActionType",
    "Tools",
]


================================================
FILE: src/fs_explorer/agent.py
================================================
"""
FsExplorer Agent for filesystem exploration using Google Gemini.

This module contains the agent that interacts with the Gemini AI model
to make decisions about filesystem exploration actions.
"""

import os
import re
from pathlib import Path
from typing import Callable, Any, cast
from dataclasses import dataclass

from dotenv import load_dotenv
from google.genai.types import Content, HttpOptions, Part
from google.genai import Client as GenAIClient

from .models import Action, ActionType, ToolCallAction, Tools
from .fs import (
    read_file,
    grep_file_content,
    glob_paths,
    scan_folder,
    preview_file,
    parse_file,
)
from .embeddings import EmbeddingProvider
from .index_config import resolve_db_path
from .search import (
    IndexedQueryEngine,
    MetadataFilterParseError,
    supported_filter_syntax,
)
from .storage import DuckDBStorage

# Load .env file from project root
_env_path = Path(__file__).parent.parent.parent / ".env"
if _env_path.exists():
    load_dotenv(_env_path)


# =============================================================================
# Token Usage Tracking
# =============================================================================

# Gemini Flash pricing (per million tokens)
GEMINI_FLASH_INPUT_COST_PER_MILLION = 0.075
GEMINI_FLASH_OUTPUT_COST_PER_MILLION = 0.30


@dataclass
class TokenUsage:
    """
    Track token usage and costs across the session.

    Maintains running totals of API calls, token counts, and provides
    cost estimates based on Gemini Flash pricing.
    """

    prompt_tokens: int = 0
    completion_tokens: int = 0
    total_tokens: int = 0
    api_calls: int = 0

    # Track content sizes
    tool_result_chars: int = 0
    documents_parsed: int = 0
    documents_scanned: int = 0

    def add_api_call(self, prompt_tokens: int, completion_tokens: int) -> None:
        """Record token usage from an API call."""
        self.prompt_tokens += prompt_tokens
        self.completion_tokens += completion_tokens
        self.total_tokens += prompt_tokens + completion_tokens
        self.api_calls += 1

    def add_tool_result(self, result: str, tool_name: str) -> None:
        """Record metrics from a tool execution."""
        self.tool_result_chars += len(result)
        if tool_name == "parse_file":
            self.documents_parsed += 1
        elif tool_name == "scan_folder":
            # Count documents in scan result by counting document markers
            self.documents_scanned += result.count("│ [")
        elif tool_name == "preview_file":
            self.documents_parsed += 1

    def _calculate_cost(self) -> tuple[float, float, float]:
        """Calculate estimated costs based on Gemini Flash pricing."""
        input_cost = (
            self.prompt_tokens / 1_000_000
        ) * GEMINI_FLASH_INPUT_COST_PER_MILLION
        output_cost = (
            self.completion_tokens / 1_000_000
        ) * GEMINI_FLASH_OUTPUT_COST_PER_MILLION
        return input_cost, output_cost, input_cost + output_cost

    def summary(self) -> str:
        """Generate a formatted summary of token usage and costs."""
        input_cost, output_cost, total_cost = self._calculate_cost()

        return f"""
═══════════════════════════════════════════════════════════════
                      TOKEN USAGE SUMMARY
═══════════════════════════════════════════════════════════════
  API Calls:           {self.api_calls}
  Prompt Tokens:       {self.prompt_tokens:,}
  Completion Tokens:   {self.completion_tokens:,}
  Total Tokens:        {self.total_tokens:,}
───────────────────────────────────────────────────────────────
  Documents Scanned:   {self.documents_scanned}
  Documents Parsed:    {self.documents_parsed}
  Tool Result Chars:   {self.tool_result_chars:,}
───────────────────────────────────────────────────────────────
  Est. Cost (Gemini Flash):
    Input:  ${input_cost:.4f}
    Output: ${output_cost:.4f}
    Total:  ${total_cost:.4f}
═══════════════════════════════════════════════════════════════
"""


# =============================================================================
# Tool Registry
# =============================================================================


@dataclass(frozen=True)
class IndexContext:
    """Execution context for indexed retrieval tools."""

    root_folder: str
    db_path: str


_INDEX_CONTEXT: IndexContext | None = None
_EMBEDDING_PROVIDER: EmbeddingProvider | None = None
_FIELD_CATALOG_SHOWN: bool = False
_ENABLE_SEMANTIC: bool = False
_ENABLE_METADATA: bool = False


def set_search_flags(
    *, enable_semantic: bool = False, enable_metadata: bool = False
) -> None:
    """Configure which indexed retrieval paths are active."""
    global _ENABLE_SEMANTIC, _ENABLE_METADATA
    _ENABLE_SEMANTIC = enable_semantic
    _ENABLE_METADATA = enable_metadata


def get_search_flags() -> tuple[bool, bool]:
    """Return (enable_semantic, enable_metadata)."""
    return _ENABLE_SEMANTIC, _ENABLE_METADATA


def set_embedding_provider(provider: EmbeddingProvider | None) -> None:
    """Set the embedding provider for vector search in indexed tools."""
    global _EMBEDDING_PROVIDER
    _EMBEDDING_PROVIDER = provider


def set_index_context(folder: str, db_path: str | None = None) -> None:
    """Enable indexed tools for a specific folder corpus."""
    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER
    _INDEX_CONTEXT = IndexContext(
        root_folder=str(Path(folder).resolve()),
        db_path=resolve_db_path(db_path),
    )
    # Auto-create embedding provider if API key available
    if _EMBEDDING_PROVIDER is None:
        try:
            _EMBEDDING_PROVIDER = EmbeddingProvider()
        except ValueError:
            pass


def clear_index_context() -> None:
    """Disable indexed tools for the current process."""
    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER, _FIELD_CATALOG_SHOWN
    global _ENABLE_SEMANTIC, _ENABLE_METADATA
    _INDEX_CONTEXT = None
    _EMBEDDING_PROVIDER = None
    _FIELD_CATALOG_SHOWN = False
    _ENABLE_SEMANTIC = False
    _ENABLE_METADATA = False


def _get_index_storage_and_corpus() -> tuple[
    DuckDBStorage | None, str | None, str | None
]:
    if _INDEX_CONTEXT is None:
        return None, None, "Index context is not configured. Re-run with `--use-index`."

    storage = DuckDBStorage(_INDEX_CONTEXT.db_path)
    corpus_id = storage.get_corpus_id(_INDEX_CONTEXT.root_folder)
    if corpus_id is None:
        return (
            None,
            None,
            f"No index found for folder {_INDEX_CONTEXT.root_folder}. "
            "Run `explore index <folder>` first.",
        )
    return storage, corpus_id, None


def _clean_excerpt(text: str, max_chars: int = 320) -> str:
    squashed = re.sub(r"\s+", " ", text).strip()
    if len(squashed) <= max_chars:
        return squashed
    return f"{squashed[:max_chars]}..."


def semantic_search(query: str, filters: str | None = None, limit: int = 5) -> str:
    """Search indexed chunks and return ranked excerpts."""
    storage, corpus_id, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None and corpus_id is not None

    engine = IndexedQueryEngine(storage, embedding_provider=_EMBEDDING_PROVIDER)
    try:
        hits = engine.search(
            corpus_id=corpus_id,
            query=query,
            filters=filters,
            limit=limit,
            enable_semantic=_ENABLE_SEMANTIC,
            enable_metadata=_ENABLE_METADATA,
        )
    except MetadataFilterParseError as exc:
        return f"Invalid metadata filter: {exc}\n{supported_filter_syntax()}"
    except ValueError as exc:
        return f"Metadata filter error: {exc}"

    if not hits:
        if filters:
            return f"No indexed matches found for query={query!r} with filters={filters!r}."
        return f"No indexed matches found for query: {query!r}"

    lines = [
        "=== INDEXED SEARCH RESULTS ===",
        f"Query: {query}",
    ]
    if filters:
        lines.append(f"Filters: {filters}")
    lines.append("")
    for idx, hit in enumerate(hits, start=1):
        position = hit.position if hit.position is not None else "<metadata>"
        lines.extend(
            [
                f"[{idx}] doc_id: {hit.doc_id}",
                f"    path: {hit.absolute_path}",
                f"    match: {hit.matched_by}",
                f"    chunk_position: {position}",
                f"    semantic_score: {hit.semantic_score}",
                f"    metadata_score: {hit.metadata_score}",
                f"    score: {hit.score:.2f}",
                f"    excerpt: {_clean_excerpt(hit.text)}",
                "",
            ]
        )
    lines.append(
        "Use get_document(doc_id=...) to read full content for the most relevant documents."
    )

    # Include a rich field catalog on the first search so the agent can
    # construct effective metadata filters.
    global _FIELD_CATALOG_SHOWN
    if not _FIELD_CATALOG_SHOWN:
        active_schema = storage.get_active_schema(corpus_id=corpus_id)
        if active_schema is not None:
            schema_fields = active_schema.schema_def.get("fields")
            if isinstance(schema_fields, list) and schema_fields:
                field_names = [
                    str(f["name"])
                    for f in schema_fields
                    if isinstance(f, dict) and isinstance(f.get("name"), str)
                ]
                field_values = storage.get_metadata_field_values(
                    corpus_id=corpus_id,
                    field_names=field_names,
                )
                field_descs: list[str] = []
                for field in schema_fields:
                    if not isinstance(field, dict) or not isinstance(
                        field.get("name"), str
                    ):
                        continue
                    name = field["name"]
                    ftype = field.get("type", "string")
                    desc = field.get("description", "")
                    entry = f"{name} ({ftype})"
                    if desc:
                        entry += f": {desc}"
                    vals = field_values.get(name, [])
                    if ftype == "boolean":
                        entry += " Values: true, false"
                    elif ftype in {"integer", "number"} and vals:
                        nums = []
                        for v in vals:
                            try:
                                nums.append(float(v))
                            except (TypeError, ValueError):
                                pass
                        if nums:
                            entry += f" Range: {min(nums):.6g}-{max(nums):.6g}"
                    elif vals:
                        if "enum" in field:
                            entry += f" Values: {field['enum']}"
                        else:
                            entry += f" Values: {', '.join(repr(v) for v in vals)}"
                    elif "enum" in field:
                        entry += f" Values: {field['enum']}"
                    field_descs.append(entry)
                if field_descs:
                    lines.append("")
                    lines.append(
                        "Available filter fields for semantic_search(filters=...):"
                    )
                    for desc in field_descs:
                        lines.append(f"  - {desc}")
                _FIELD_CATALOG_SHOWN = True

    return "\n".join(lines)


def get_document(doc_id: str) -> str:
    """Return full document content by id from the active index context."""
    storage, _, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None

    document = storage.get_document(doc_id=doc_id)
    if document is None:
        return f"No indexed document found for doc_id={doc_id!r}"
    if document["is_deleted"]:
        return f"Document {doc_id} is marked as deleted in the index."

    return (
        f"=== DOCUMENT {doc_id} ===\n"
        f"Path: {document['absolute_path']}\n\n"
        f"{document['content']}"
    )


def list_indexed_documents() -> str:
    """List indexed documents for the active corpus."""
    storage, corpus_id, error = _get_index_storage_and_corpus()
    if error:
        return error
    assert storage is not None and corpus_id is not None

    documents = storage.list_documents(corpus_id=corpus_id, include_deleted=False)
    if not documents:
        return "No indexed documents found for the active corpus."

    lines = ["=== INDEXED DOCUMENTS ==="]
    for idx, document in enumerate(documents, start=1):
        lines.append(
            f"[{idx}] doc_id={document['id']} path={document['absolute_path']}"
        )
    lines.append("")
    lines.append("Use semantic_search(...) to find relevant doc_ids.")
    return "\n".join(lines)


TOOLS: dict[Tools, Callable[..., str]] = {
    "read": read_file,
    "grep": grep_file_content,
    "glob": glob_paths,
    "scan_folder": scan_folder,
    "preview_file": preview_file,
    "parse_file": parse_file,
    "semantic_search": semantic_search,
    "get_document": get_document,
    "list_indexed_documents": list_indexed_documents,
}


# =============================================================================
# System Prompt
# =============================================================================

SYSTEM_PROMPT = """
You are FsExplorer, an AI agent that explores filesystems to answer user questions about documents.

## Available Tools

| Tool | Purpose | Parameters |
|------|---------|------------|
| `scan_folder` | **PARALLEL SCAN** - Scan ALL documents in a folder at once | `directory` |
| `preview_file` | Quick preview of a single document (~first page) | `file_path` |
| `parse_file` | **DEEP READ** - Full content of a document | `file_path` |
| `read` | Read a plain text file | `file_path` |
| `grep` | Search for a pattern in a file | `file_path`, `pattern` |
| `glob` | Find files matching a pattern | `directory`, `pattern` |
| `semantic_search` | Search indexed chunks and metadata-filtered docs, then union/rank results | `query`, `filters`, `limit` |
| `get_document` | Read full indexed document by document id | `doc_id` |
| `list_indexed_documents` | List indexed documents for active corpus | none |

## Indexed Retrieval Strategy

When indexed tools are available:
1. Start with `semantic_search` to quickly find relevant documents.
2. Use `get_document` for the top candidate doc IDs.
3. If indexed tools report index is unavailable, fall back to filesystem tools (`scan_folder`, `parse_file`, etc.).

Filter syntax for `semantic_search(filters=...)`:
- `field=value`
- `field!=value`
- `field>=number`, `field<=number`, `field>number`, `field<number`
- `field in (a, b, c)`
- `field~substring`
- combine conditions with comma or `and`

## Three-Phase Document Exploration Strategy

### PHASE 1: Parallel Scan (Use `scan_folder`)
When you encounter a folder with documents:
1. Use `scan_folder` to scan ALL documents in parallel
2. This gives you a quick preview of every document at once
3. In your **reason**, explicitly list your document categorization:
   - **RELEVANT**: Documents clearly related to the query (list them)
   - **MAYBE**: Documents that might be relevant (list them)
   - **SKIP**: Documents not relevant (list them)

### PHASE 2: Deep Dive (Use `parse_file`)
1. Use `parse_file` on documents marked RELEVANT
2. In your **reason**, explain what key information you found
3. **WATCH FOR CROSS-REFERENCES** - look for mentions like:
   - "See Exhibit A/B/C..."
   - "As stated in the [Document Name]..."
   - "Refer to [filename]..."
   - Document numbers, exhibit labels, or file names
4. In your **reason**, note any cross-references you discovered

### PHASE 3: Backtracking (Revisit if Cross-Referenced)
**CRITICAL**: If a document you're reading references another document that you SKIPPED:
1. In your **reason**, explain: "Found cross-reference to [document] - need to backtrack"
2. Use `preview_file` or `parse_file` to read the referenced document
3. Continue this until all relevant cross-references are resolved

## Providing Detailed Reasoning

Your `reason` field is displayed to the user, so make it informative:
- After scanning: List which documents you're categorizing as RELEVANT/MAYBE/SKIP and why
- After parsing: Summarize key findings and any cross-references discovered
- When backtracking: Explain which reference led you back to a skipped document

## CRITICAL: Citation Requirements for Final Answers

When providing your final answer, you MUST include citations for ALL factual claims:

### Citation Format
Use inline citations in this format: `[Source: filename, Section/Page]`

Example:
> The total purchase price is $125,000,000 [Source: 01_master_agreement.pdf, Section 2.1], 
> consisting of $80M cash [Source: 01_master_agreement.pdf, Section 2.1(a)], 
> $30M in stock [Source: 10_stock_purchase.pdf, Section 1], and 
> $15M in escrow [Source: 09_escrow_agreement.pdf, Section 2].

### Citation Rules
1. **Every factual claim needs a citation** - dates, numbers, names, terms, etc.
2. **Be specific** - include section numbers, article numbers, or page references when available
3. **Use the actual filename** - not paraphrased names
4. **Multiple sources** - if information comes from multiple documents, cite all of them

### Final Answer Structure
Your final answer should:
1. **Start with a direct answer** to the user's question
2. **Provide details** with inline citations
3. **End with a Sources section** listing all documents consulted:

```
## Sources Consulted
- 01_master_agreement.pdf - Main acquisition terms
- 10_stock_purchase.pdf - Stock component details  
- 09_escrow_agreement.pdf - Escrow terms and release schedule
```

## Example Workflow

```
User asks: "What is the purchase price?"

1. scan_folder("./documents/")
   Reason: "Scanned 10 documents. Categorizing:
   - RELEVANT: purchase_agreement.pdf (mentions 'Purchase Price' in preview)
   - RELEVANT: financial_terms.pdf (contains pricing tables)
   - MAYBE: exhibits.pdf (referenced by other docs)
   - SKIP: employee_handbook.pdf, hr_policies.pdf (unrelated to pricing)"

2. parse_file("purchase_agreement.pdf")
   Reason: "Found purchase price of $50M in Section 2.1. Document references 
   'Exhibit B for price adjustments' - need to check exhibits.pdf next."

3. parse_file("exhibits.pdf")  [BACKTRACKING]
   Reason: "Backtracking to exhibits.pdf because purchase_agreement.pdf 
   referenced it for adjustment details. Found working capital adjustment 
   formula in Exhibit B."

4. STOP with final answer including citations:
   "The purchase price is $50,000,000 [Source: purchase_agreement.pdf, Section 2.1], 
   subject to working capital adjustments [Source: exhibits.pdf, Exhibit B]..."
```
"""

def _build_system_prompt(enable_semantic: bool, enable_metadata: bool) -> str:
    """Build a system prompt with retrieval-path guidance appended."""
    if enable_semantic and enable_metadata:
        hint = (
            "\n\n## Retrieval: Semantic + Metadata\n"
            "An index is available. Start with `semantic_search` using optional "
            "`filters` for best results, then use filesystem tools for deep dives."
        )
    elif enable_semantic:
        hint = (
            "\n\n## Retrieval: Semantic Only\n"
            "An index is available. Use `semantic_search` WITHOUT the `filters` "
            "parameter for similarity search, then use filesystem tools for details."
        )
    elif enable_metadata:
        hint = (
            "\n\n## Retrieval: Metadata Only\n"
            "An index is available. Use `semantic_search` with the `filters=` "
            "parameter for metadata filtering, then use filesystem tools for details."
        )
    else:
        return SYSTEM_PROMPT
    return SYSTEM_PROMPT + hint


# =============================================================================
# Agent Implementation
# =============================================================================


class FsExplorerAgent:
    """
    AI agent for exploring filesystems using Google Gemini.

    The agent maintains a conversation history with the LLM and uses
    structured JSON output to make decisions about which actions to take.

    Attributes:
        token_usage: Tracks API call statistics and costs.
    """

    def __init__(self, api_key: str | None = None) -> None:
        """
        Initialize the agent with Google API credentials.

        Args:
            api_key: Google API key. If not provided, reads from
                     GOOGLE_API_KEY environment variable.

        Raises:
            ValueError: If no API key is available.
        """
        if api_key is None:
            api_key = os.getenv("GOOGLE_API_KEY")
        if api_key is None:
            raise ValueError(
                "GOOGLE_API_KEY not found within the current environment: "
                "please export it or provide it to the class constructor."
            )

        self._client = GenAIClient(
            api_key=api_key,
            http_options=HttpOptions(api_version="v1beta"),
        )
        self._chat_history: list[Content] = []
        self.token_usage = TokenUsage()

    def configure_task(self, task: str) -> None:
        """
        Add a task message to the conversation history.

        Args:
            task: The task or context to add to the conversation.
        """
        self._chat_history.append(
            Content(role="user", parts=[Part.from_text(text=task)])
        )

    async def take_action(self) -> tuple[Action, ActionType] | None:
        """
        Request the next action from the AI model.

        Sends the current conversation history to Gemini and receives
        a structured JSON response indicating the next action to take.

        Returns:
            A tuple of (Action, ActionType) if successful, None otherwise.
        """
        response = await self._client.aio.models.generate_content(
            model="gemini-3-flash-preview",
            contents=self._chat_history,  # type: ignore
            config={
                "system_instruction": _build_system_prompt(_ENABLE_SEMANTIC, _ENABLE_METADATA),
                "response_mime_type": "application/json",
                "response_schema": Action,
            },
        )

        # Track token usage from response metadata
        if response.usage_metadata:
            self.token_usage.add_api_call(
                prompt_tokens=response.usage_metadata.prompt_token_count or 0,
                completion_tokens=response.usage_metadata.candidates_token_count or 0,
            )

        if response.candidates is not None:
            if response.candidates[0].content is not None:
                self._chat_history.append(response.candidates[0].content)
            if response.text is not None:
                action = Action.model_validate_json(response.text)
                if action.to_action_type() == "toolcall":
                    toolcall = cast(ToolCallAction, action.action)
                    self.call_tool(
                        tool_name=toolcall.tool_name,
                        tool_input=toolcall.to_fn_args(),
                    )
                return action, action.to_action_type()

        return None

    def call_tool(self, tool_name: Tools, tool_input: dict[str, Any]) -> None:
        """
        Execute a tool and add the result to the conversation history.

        Args:
            tool_name: Name of the tool to execute.
            tool_input: Dictionary of arguments to pass to the tool.
        """
        try:
            result = TOOLS[tool_name](**tool_input)
        except Exception as e:
            result = (
                f"An error occurred while calling tool {tool_name} "
                f"with {tool_input}: {e}"
            )

        # Track tool result sizes
        self.token_usage.add_tool_result(result, tool_name)

        self._chat_history.append(
            Content(
                role="user",
                parts=[
                    Part.from_text(text=f"Tool result for {tool_name}:\n\n{result}")
                ],
            )
        )

    def reset(self) -> None:
        """Reset the agent's conversation history and token tracking."""
        self._chat_history.clear()
        self.token_usage = TokenUsage()


================================================
FILE: src/fs_explorer/embeddings.py
================================================
"""
Embedding provider for vector-based semantic search.

Wraps the Google GenAI embedding API for batch and single-query embedding
with configurable model, dimensions, and batch size.
"""

from __future__ import annotations

import os
from typing import Any

from google.genai import Client as GenAIClient


_DEFAULT_MODEL = "gemini-embedding-001"
_DEFAULT_DIM = 768
_DEFAULT_BATCH_SIZE = 50


class EmbeddingProvider:
    """Generate text embeddings via Google GenAI."""

    def __init__(
        self,
        *,
        api_key: str | None = None,
        model: str | None = None,
        dim: int | None = None,
        batch_size: int | None = None,
        client: Any | None = None,
    ) -> None:
        self.model = model or os.getenv("FS_EXPLORER_EMBEDDING_MODEL", _DEFAULT_MODEL)
        self.dim = dim or int(os.getenv("FS_EXPLORER_EMBEDDING_DIM", str(_DEFAULT_DIM)))
        self.batch_size = batch_size or int(
            os.getenv("FS_EXPLORER_EMBEDDING_BATCH_SIZE", str(_DEFAULT_BATCH_SIZE))
        )

        if client is not None:
            self._client = client
        else:
            resolved_key = api_key or os.getenv("GOOGLE_API_KEY")
            if resolved_key is None:
                raise ValueError(
                    "GOOGLE_API_KEY not found. "
                    "Provide api_key or set the environment variable."
                )
            self._client = GenAIClient(api_key=resolved_key)

    def embed_texts(
        self,
        texts: list[str],
        *,
        task_type: str = "RETRIEVAL_DOCUMENT",
    ) -> list[list[float]]:
        """Embed a list of texts in batches.

        Returns a list of embedding vectors in the same order as *texts*.
        """
        all_embeddings: list[list[float]] = []
        for start in range(0, len(texts), self.batch_size):
            batch = texts[start : start + self.batch_size]
            result = self._client.models.embed_content(
                model=self.model,
                contents=batch,
                config={
                    "task_type": task_type,
                    "output_dimensionality": self.dim,
                },
            )
            for emb in result.embeddings:
                all_embeddings.append(list(emb.values))
        return all_embeddings

    def embed_query(self, query: str) -> list[float]:
        """Embed a single query text for retrieval."""
        result = self._client.models.embed_content(
            model=self.model,
            contents=[query],
            config={
                "task_type": "RETRIEVAL_QUERY",
                "output_dimensionality": self.dim,
            },
        )
        return list(result.embeddings[0].values)


================================================
FILE: src/fs_explorer/exploration_trace.py
================================================
"""
Helpers for recording exploration path and referenced files.
"""

from __future__ import annotations

import os
import re
from dataclasses import dataclass, field
from typing import Any


FILE_TOOLS: frozenset[str] = frozenset({"read", "grep", "preview_file", "parse_file"})

# Matches citations like: [Source: filename.pdf, Section 2.1]
SOURCE_CITATION_RE = re.compile(r"\[Source:\s*([^,\]]+)")


def normalize_path(path: str, root_directory: str) -> str:
    """Return an absolute path using root_directory for relative inputs."""
    if os.path.isabs(path):
        return os.path.abspath(path)
    return os.path.abspath(os.path.join(root_directory, path))


def extract_cited_sources(final_result: str | None) -> list[str]:
    """Extract source labels from final answer citations while preserving order."""
    if not final_result:
        return []

    seen: set[str] = set()
    ordered_sources: list[str] = []

    for raw_source in SOURCE_CITATION_RE.findall(final_result):
        source = raw_source.strip()
        if source and source not in seen:
            seen.add(source)
            ordered_sources.append(source)

    return ordered_sources


@dataclass
class ExplorationTrace:
    """
    Collects a step-by-step path and files referenced by tool calls.

    Paths are normalized to absolute paths to make replay/debugging easier.
    """

    root_directory: str
    step_path: list[str] = field(default_factory=list)
    referenced_documents: set[str] = field(default_factory=set)

    def record_tool_call(
        self,
        *,
        step_number: int,
        tool_name: str,
        tool_input: dict[str, Any],
        resolved_document_path: str | None = None,
    ) -> None:
        """Record a tool call in the exploration path."""
        path_entries: list[str] = []

        directory = tool_input.get("directory")
        if isinstance(directory, str) and directory:
            path_entries.append(f"directory={normalize_path(directory, self.root_directory)}")

        file_path = tool_input.get("file_path")
        if isinstance(file_path, str) and file_path:
            normalized_file_path = normalize_path(file_path, self.root_directory)
            path_entries.append(f"file={normalized_file_path}")
            if tool_name in FILE_TOOLS:
                self.referenced_documents.add(normalized_file_path)

        if resolved_document_path:
            normalized_doc_path = normalize_path(resolved_document_path, self.root_directory)
            path_entries.append(f"document={normalized_doc_path}")
            self.referenced_documents.add(normalized_doc_path)

        parameters = ", ".join(path_entries) if path_entries else "no-path-args"
        self.step_path.append(f"{step_number}. tool:{tool_name} ({parameters})")

    def record_go_deeper(self, *, step_number: int, directory: str) -> None:
        """Record a directory navigation event in the exploration path."""
        resolved_dir = normalize_path(directory, self.root_directory)
        self.step_path.append(f"{step_number}. godeeper (directory={resolved_dir})")

    def sorted_documents(self) -> list[str]:
        """Return a sorted list of referenced documents."""
        return sorted(self.referenced_documents)


================================================
FILE: src/fs_explorer/fs.py
================================================
"""
Filesystem utilities for the FsExplorer agent.

This module provides functions for reading, searching, and parsing files
in the filesystem, including support for complex document formats via Docling.
"""

import os
import re
import glob as glob_module
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path

from docling.document_converter import DocumentConverter


# =============================================================================
# Configuration Constants
# =============================================================================

# Supported document extensions for parsing
SUPPORTED_EXTENSIONS: frozenset[str] = frozenset({
    ".pdf", ".docx", ".doc", ".pptx", ".xlsx", ".html", ".md"
})

# Preview settings
DEFAULT_PREVIEW_CHARS = 3000  # Characters for single file preview (~2-3 pages)
DEFAULT_SCAN_PREVIEW_CHARS = 1500  # Characters for folder scan preview (~1 page)
MAX_PREVIEW_LINES = 30  # Maximum lines to show in scan results

# Parallel processing settings
DEFAULT_MAX_WORKERS = 4  # Thread pool size for parallel document scanning


# =============================================================================
# Document Cache
# =============================================================================

# Cache for parsed documents to avoid re-parsing
_DOCUMENT_CACHE: dict[str, str] = {}


def clear_document_cache() -> None:
    """Clear the document cache. Useful for testing or memory management."""
    _DOCUMENT_CACHE.clear()


def _get_cached_or_parse(file_path: str) -> str:
    """
    Get document content from cache or parse it.
    
    Uses file modification time in cache key to invalidate stale entries.
    
    Args:
        file_path: Path to the document file.
    
    Returns:
        The document content as markdown.
    
    Raises:
        Exception: If the document cannot be parsed.
    """
    abs_path = os.path.abspath(file_path)
    cache_key = f"{abs_path}:{os.path.getmtime(abs_path)}"
    
    if cache_key not in _DOCUMENT_CACHE:
        converter = DocumentConverter()
        result = converter.convert(file_path)
        _DOCUMENT_CACHE[cache_key] = result.document.export_to_markdown()
    
    return _DOCUMENT_CACHE[cache_key]


# =============================================================================
# Directory Operations
# =============================================================================

def describe_dir_content(directory: str) -> str:
    """
    Describe the contents of a directory.
    
    Lists all files and subdirectories in the given directory path.
    
    Args:
        directory: Path to the directory to describe.
    
    Returns:
        A formatted string describing the directory contents,
        or an error message if the directory doesn't exist.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    children = os.listdir(directory)
    if not children:
        return f"Directory {directory} is empty"
    
    files = []
    directories = []
    
    for child in children:
        fullpath = os.path.join(directory, child)
        if os.path.isfile(fullpath):
            files.append(fullpath)
        else:
            directories.append(fullpath)
    
    description = f"Content of {directory}\n"
    description += "FILES:\n- " + "\n- ".join(files)
    
    if not directories:
        description += "\nThis folder does not have any sub-folders"
    else:
        description += "\nSUBFOLDERS:\n- " + "\n- ".join(directories)
    
    return description


# =============================================================================
# Basic File Operations
# =============================================================================

def read_file(file_path: str) -> str:
    """
    Read the contents of a text file.
    
    Args:
        file_path: Path to the file to read.
    
    Returns:
        The file contents, or an error message if the file doesn't exist.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"
    
    with open(file_path, "r") as f:
        return f.read()


def grep_file_content(file_path: str, pattern: str) -> str:
    """
    Search for a regex pattern in a file.
    
    Args:
        file_path: Path to the file to search.
        pattern: Regular expression pattern to search for.
    
    Returns:
        A formatted string with matches, "No matches found",
        or an error message if the file doesn't exist.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"
    
    with open(file_path, "r") as f:
        content = f.read()
    
    regex = re.compile(pattern=pattern, flags=re.MULTILINE)
    matches = regex.findall(content)
    
    if matches:
        return f"MATCHES for {pattern} in {file_path}:\n\n- " + "\n- ".join(matches)
    return "No matches found"


def glob_paths(directory: str, pattern: str) -> str:
    """
    Find files matching a glob pattern in a directory.
    
    Args:
        directory: Path to the directory to search in.
        pattern: Glob pattern to match (e.g., "*.txt", "**/*.pdf").
    
    Returns:
        A formatted string with matching paths, "No matches found",
        or an error message if the directory doesn't exist.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    # Use pathlib for cleaner path handling
    search_path = Path(directory) / pattern
    matches = glob_module.glob(str(search_path))
    
    if matches:
        return f"MATCHES for {pattern} in {directory}:\n\n- " + "\n- ".join(matches)
    return "No matches found"


# =============================================================================
# Document Parsing Operations
# =============================================================================

def preview_file(file_path: str, max_chars: int = DEFAULT_PREVIEW_CHARS) -> str:
    """
    Get a quick preview of a document file.
    
    Reads only the first portion of the document content for initial
    relevance assessment before doing a full parse.
    
    Args:
        file_path: Path to the document file.
        max_chars: Maximum characters to return (default: 3000, ~2-3 pages).
    
    Returns:
        A preview of the document content, or an error message.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"

    ext = os.path.splitext(file_path)[1].lower()
    if ext not in SUPPORTED_EXTENSIONS:
        return (
            f"Unsupported file extension: {ext}. "
            f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )

    try:
        full_content = _get_cached_or_parse(file_path)
        preview = full_content[:max_chars]
        
        total_len = len(full_content)
        if total_len > max_chars:
            preview += (
                f"\n\n[... PREVIEW TRUNCATED. Full document has {total_len:,} "
                f"characters. Use parse_file() to read the complete document ...]"
            )
        
        return f"=== PREVIEW of {file_path} ===\n\n{preview}"
    except Exception as e:
        return f"Error previewing {file_path}: {e}"


def parse_file(file_path: str) -> str:
    """
    Parse and return the complete content of a document file.
    
    Use this after preview_file() confirms the document is relevant,
    or when you need to find cross-references to other documents.
    
    Supported formats: PDF, DOCX, DOC, PPTX, XLSX, HTML, MD.
    
    Args:
        file_path: Path to the document file.
    
    Returns:
        The complete document content as markdown, or an error message.
    """
    if not os.path.exists(file_path) or not os.path.isfile(file_path):
        return f"No such file: {file_path}"

    ext = os.path.splitext(file_path)[1].lower()
    if ext not in SUPPORTED_EXTENSIONS:
        return (
            f"Unsupported file extension: {ext}. "
            f"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )

    try:
        return _get_cached_or_parse(file_path)
    except Exception as e:
        return f"Error parsing {file_path}: {e}"


# =============================================================================
# Parallel Document Scanning
# =============================================================================

def _preview_single_file(file_path: str, preview_chars: int) -> dict:
    """
    Helper to preview a single file for parallel processing.
    
    Args:
        file_path: Path to the document file.
        preview_chars: Number of characters to include in preview.
    
    Returns:
        A dictionary with file info and preview content.
    """
    filename = os.path.basename(file_path)
    try:
        content = _get_cached_or_parse(file_path)
        preview = content[:preview_chars]
        return {
            "file": file_path,
            "filename": filename,
            "preview": preview,
            "total_chars": len(content),
            "status": "success"
        }
    except Exception as e:
        return {
            "file": file_path,
            "filename": filename,
            "preview": "",
            "total_chars": 0,
            "status": f"error: {e}"
        }


def scan_folder(
    directory: str,
    max_workers: int = DEFAULT_MAX_WORKERS,
    preview_chars: int = DEFAULT_SCAN_PREVIEW_CHARS,
) -> str:
    """
    Scan all documents in a folder in parallel and return quick previews.
    
    This is the FIRST step when exploring a folder with multiple documents.
    It efficiently processes all documents at once so you can assess relevance
    before doing deep dives into specific files.
    
    Args:
        directory: Path to the folder to scan.
        max_workers: Number of parallel workers (default: 4).
        preview_chars: Characters to preview per file (default: 1500, ~1 page).
    
    Returns:
        A formatted summary of all documents with their previews.
    """
    if not os.path.exists(directory) or not os.path.isdir(directory):
        return f"No such directory: {directory}"
    
    # Find all supported document files
    doc_files = []
    for item in os.listdir(directory):
        item_path = os.path.join(directory, item)
        if os.path.isfile(item_path):
            ext = os.path.splitext(item)[1].lower()
            if ext in SUPPORTED_EXTENSIONS:
                doc_files.append(item_path)
    
    if not doc_files:
        return (
            f"No supported documents found in {directory}. "
            f"Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}"
        )
    
    # Scan all documents in parallel
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_file = {
            executor.submit(_preview_single_file, f, preview_chars): f 
            for f in doc_files
        }
        for future in as_completed(future_to_file):
            results.append(future.result())
    
    # Sort by filename for consistent ordering
    results.sort(key=lambda x: x["filename"])
    
    # Build the summary report
    output = []
    output.append("═══════════════════════════════════════════════════════════════")
    output.append(f"  PARALLEL DOCUMENT SCAN: {directory}")
    output.append(f"  Found {len(results)} documents")
    output.append("═══════════════════════════════════════════════════════════════")
    output.append("")
    
    for i, result in enumerate(results, 1):
        output.append("┌─────────────────────────────────────────────────────────────")
        output.append(f"│ [{i}/{len(results)}] {result['filename']}")
        output.append(f"│ Path: {result['file']}")
        output.append(f"│ Status: {result['status']} | Total size: {result['total_chars']:,} chars")
        output.append("├─────────────────────────────────────────────────────────────")
        
        if result['status'] == 'success' and result['preview']:
            # Indent the preview content
            preview_lines = result['preview'].split('\n')
            for line in preview_lines[:MAX_PREVIEW_LINES]:
                output.append(f"│ {line}")
            if len(preview_lines) > MAX_PREVIEW_LINES:
                output.append("│ ... (preview truncated)")
        else:
            output.append("│ [No preview available]")
        
        output.append("└─────────────────────────────────────────────────────────────")
        output.append("")
    
    output.append("═══════════════════════════════════════════════════════════════")
    output.append("  NEXT STEPS:")
    output.append("  1. Assess which documents are RELEVANT to the user's query")
    output.append("  2. Use parse_file() for DEEP DIVE into relevant documents")
    output.append("  3. Watch for cross-references to other docs (may need backtracking)")
    output.append("═══════════════════════════════════════════════════════════════")
    
    return "\n".join(output)


================================================
FILE: src/fs_explorer/index_config.py
================================================
"""
Configuration helpers for local index storage.
"""

from __future__ import annotations

import os
from pathlib import Path


DEFAULT_DB_PATH = "~/.fs_explorer/index.duckdb"
ENV_DB_PATH = "FS_EXPLORER_DB_PATH"


def resolve_db_path(override_path: str | None = None) -> str:
    """
    Resolve the DuckDB path from CLI override, env var, or default.

    Precedence:
    1) explicit override_path
    2) FS_EXPLORER_DB_PATH
    3) default path
    """
    raw_path = override_path or os.getenv(ENV_DB_PATH) or DEFAULT_DB_PATH
    resolved = Path(raw_path).expanduser().resolve()
    resolved.parent.mkdir(parents=True, exist_ok=True)
    return str(resolved)


================================================
FILE: src/fs_explorer/indexing/__init__.py
================================================
"""Indexing components for FsExplorer."""

from .chunker import SmartChunker, TextChunk
from .pipeline import IndexingPipeline, IndexingResult
from .schema import SchemaDiscovery

__all__ = [
    "SmartChunker",
    "TextChunk",
    "IndexingPipeline",
    "IndexingResult",
    "SchemaDiscovery",
]


================================================
FILE: src/fs_explorer/indexing/chunker.py
================================================
"""
Chunking utilities for indexing document content.
"""

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class TextChunk:
    """A content chunk with source offsets."""

    text: str
    position: int
    start_char: int
    end_char: int


class SmartChunker:
    """
    Paragraph-aware chunker with overlap.

    This implementation is char-based to keep it deterministic and lightweight.
    """

    def __init__(self, chunk_size: int = 1500, overlap: int = 150) -> None:
        if chunk_size <= 0:
            raise ValueError("chunk_size must be > 0")
        if overlap < 0:
            raise ValueError("overlap must be >= 0")
        if overlap >= chunk_size:
            raise ValueError("overlap must be smaller than chunk_size")

        self.chunk_size = chunk_size
        self.overlap = overlap

    def chunk_text(self, text: str) -> list[TextChunk]:
        """
        Split text into chunks while preferring paragraph boundaries.
        """
        normalized = text.strip()
        if not normalized:
            return []

        chunks: list[TextChunk] = []
        start = 0
        position = 0
        total = len(normalized)

        while start < total:
            tentative_end = min(start + self.chunk_size, total)
            end = tentative_end

            if tentative_end < total:
                boundary = normalized.rfind("\n\n", start + (self.chunk_size // 2), tentative_end)
                if boundary != -1:
                    end = boundary + 2

            chunk_text = normalized[start:end].strip()
            if chunk_text:
                chunks.append(
                    TextChunk(
                        text=chunk_text,
                        position=position,
                        start_char=start,
                        end_char=end,
                    )
                )
                position += 1

            if end >= total:
                break
            start = max(0, end - self.overlap)

        return chunks


================================================
FILE: src/fs_explorer/indexing/metadata.py
================================================
"""
Metadata extraction helpers for indexed documents.
"""

from __future__ import annotations

import copy
import json
import os
import re
from collections import defaultdict
from pathlib import Path
from typing import Any


_CURRENCY_RE = re.compile(r"\$\s?\d[\d,]*(?:\.\d+)?")
_DATE_RE = re.compile(
    r"\b(?:\d{4}-\d{2}-\d{2}|"
    r"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\s+\d{1,2},\s+\d{4})\b",
    flags=re.IGNORECASE,
)
_DOC_TYPE_TOKEN_RE = re.compile(r"[a-z0-9]+")
_DOC_TYPE_STOPWORDS: set[str] = {
    "the",
    "and",
    "for",
    "with",
    "from",
    "copy",
    "draft",
    "final",
    "version",
    "v1",
    "v2",
    "v3",
    "new",
    "old",
    "tmp",
    "temp",
}

_LANGEXTRACT_PROMPT_DESCRIPTION = (
    "Extract key transaction metadata from legal and deal documents. "
    "Use extraction classes: organization, person, money, date, deal_term. "
    "Use exact spans from the source text and avoid paraphrasing."
)

_VALID_METADATA_FIELD_NAME_RE = re.compile(r"^[A-Za-z][A-Za-z0-9_]*$")
_VALID_FIELD_TYPES: set[str] = {"string", "integer", "number", "boolean"}
_VALID_RUNTIME_FIELDS: set[str] = {"enabled", "extraction_count", "entity_classes"}
_FIELD_MODE_ALIASES: dict[str, str] = {
    "csv": "values",
    "list": "values",
    "joined": "values",
    "join": "values",
    "values": "values",
    "count": "count",
    "exists": "exists",
    "contains": "contains",
    "contains_any": "contains",
}

_DEFAULT_LANGEXTRACT_PROFILE: dict[str, Any] = {
    "name": "default_langextract",
    "description": "Default metadata extraction profile for legal and deal-style documents.",
    "prompt_description": _LANGEXTRACT_PROMPT_DESCRIPTION,
    "fields": [
        {
            "name": "lx_enabled",
            "type": "boolean",
            "required": False,
            "description": "Whether langextract metadata extraction succeeded.",
            "source": "runtime",
            "runtime": "enabled",
        },
        {
            "name": "lx_extraction_count",
            "type": "integer",
            "required": False,
            "description": "Number of langextract entities extracted from the document.",
            "source": "runtime",
            "runtime": "extraction_count",
        },
        {
            "name": "lx_entity_classes",
            "type": "string",
            "required": False,
            "description": "Comma-separated extraction classes returned by langextract.",
            "source": "runtime",
            "runtime": "entity_classes",
        },
        {
            "name": "lx_organizations",
            "type": "string",
            "required": False,
            "description": "Comma-separated organization names extracted by langextract.",
            "source": "entities",
            "source_classes": ["organization", "company", "party"],
            "mode": "values",
        },
        {
            "name": "lx_people",
            "type": "string",
            "required": False,
            "description": "Comma-separated person names extracted by langextract.",
            "source": "entities",
            "source_classes": ["person", "individual", "executive"],
            "mode": "values",
        },
        {
            "name": "lx_deal_terms",
            "type": "string",
            "required": False,
            "description": "Comma-separated deal terms extracted by langextract.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "values",
        },
        {
            "name": "lx_money_mentions",
            "type": "integer",
            "required": False,
            "description": "Count of monetary amount entities from langextract.",
            "source": "entities",
            "source_classes": ["money", "amount", "currency"],
            "mode": "count",
        },
        {
            "name": "lx_date_mentions",
            "type": "integer",
            "required": False,
            "description": "Count of date entities from langextract.",
            "source": "entities",
            "source_classes": ["date"],
            "mode": "count",
        },
        {
            "name": "lx_has_earnout",
            "type": "boolean",
            "required": False,
            "description": "Whether extracted deal terms indicate an earnout.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "contains",
            "contains_any": ["earnout"],
        },
        {
            "name": "lx_has_escrow",
            "type": "boolean",
            "required": False,
            "description": "Whether extracted deal terms indicate escrow.",
            "source": "entities",
            "source_classes": ["deal_term", "term", "provision"],
            "mode": "contains",
            "contains_any": ["escrow"],
        },
    ],
}


_AUTO_PROFILE_PROMPT_TEMPLATE = (
    "You are a metadata schema designer. Analyze the document samples below and generate "
    "a langextract metadata extraction profile tailored to this corpus.\n\n"
    "Return a JSON object with these keys:\n"
    '- "name": a short descriptive profile name (string)\n'
    '- "description": one-sentence description of the profile (string)\n'
    '- "prompt_description": instruction text for the extraction model (string)\n'
    '- "fields": array of field definitions\n\n'
    "Each field object must have:\n"
    '- "name": valid identifier starting with "lx_" (letters, digits, underscores)\n'
    '- "type": one of "string", "integer", "number", "boolean"\n'
    '- "description": what this field captures\n'
    '- "source": "entities"\n'
    '- "source_classes": array of entity class names to aggregate (e.g. ["organization", "company"])\n'
    '- "mode": one of "values" (comma-joined text), "count" (integer count), "exists" (boolean), '
    '"contains" (boolean, requires "contains_any")\n'
    '- "contains_any": (only when mode is "contains") array of lowercase terms to match\n\n'
    "Valid entity source classes include (but are not limited to): organization, company, party, "
    "person, individual, executive, money, amount, currency, date, deal_term, term, provision, "
    "location, product, technology, regulation, clause, obligation.\n\n"
    "### Example profile for legal/M&A documents\n"
    "```json\n"
    '{"name": "legal_ma", "description": "Metadata extraction for legal and M&A deal documents.", '
    '"prompt_description": "Extract key transaction metadata from legal and deal documents.", '
    '"fields": ['
    '{"name": "lx_organizations", "type": "string", "description": "Organization names.", '
    '"source": "entities", "source_classes": ["organization", "company", "party"], "mode": "values"}, '
    '{"name": "lx_money_mentions", "type": "integer", "description": "Count of monetary amounts.", '
    '"source": "entities", "source_classes": ["money", "amount"], "mode": "count"}, '
    '{"name": "lx_has_escrow", "type": "boolean", "description": "Whether escrow terms are present.", '
    '"source": "entities", "source_classes": ["deal_term", "provision"], "mode": "contains", '
    '"contains_any": ["escrow"]}'
    "]}\n"
    "```\n\n"
    "### Example profile for technical/research documents\n"
    "```json\n"
    '{"name": "tech_research", "description": "Metadata extraction for technical and research documents.", '
    '"prompt_description": "Extract key entities from technical and research documents.", '
    '"fields": ['
    '{"name": "lx_technologies", "type": "string", "description": "Technology names.", '
    '"source": "entities", "source_classes": ["technology", "product"], "mode": "values"}, '
    '{"name": "lx_people", "type": "string", "description": "Person names.", '
    '"source": "entities", "source_classes": ["person", "individual"], "mode": "values"}, '
    '{"name": "lx_org_count", "type": "integer", "description": "Number of organizations mentioned.", '
    '"source": "entities", "source_classes": ["organization", "company"], "mode": "count"}'
    "]}\n"
    "```\n\n"
    "### Document samples from the corpus\n\n"
    "SAMPLES_PLACEHOLDER\n\n"
    "Generate a profile with 4-8 entity fields (do NOT include runtime fields). "
    "Return ONLY the JSON object, no markdown fencing."
)


def _get_genai_client(api_key: str) -> Any:
    """Instantiate a Google GenAI client. Separated for test patching."""
    from google.genai import Client as _GenAIClient

    return _GenAIClient(api_key=api_key)


def auto_discover_profile(
    folder: str,
    *,
    sample_count: int = 3,
    model_id: str | None = None,
) -> dict[str, Any]:
    """Use an LLM to generate a langextract profile tailored to the corpus.

    Falls back to the default hardcoded profile on any failure.
    """
    from .schema import _iter_supported_files

    files = _iter_supported_files(folder)
    if not files:
        return default_langextract_profile()

    # Sample files evenly
    n = min(sample_count, len(files))
    step = max(1, len(files) // n)
    sampled = [files[i * step] for i in range(n)]

    # Parse and truncate
    from ..fs import parse_file

    snippets: list[str] = []
    for file_path in sampled:
        try:
            text = parse_file(file_path)
            snippets.append(
                f"--- {Path(file_path).name} ---\n{text[:2000]}"
            )
        except Exception:
            continue

    if not snippets:
        return default_langextract_profile()

    api_key = os.getenv("GOOGLE_API_KEY")
    if not api_key:
        return default_langextract_profile()

    effective_model = model_id or os.getenv(
        "FS_EXPLORER_PROFILE_MODEL", "gemini-2.0-flash"
    )

    try:
        client = _get_genai_client(api_key=api_key)
        prompt = _AUTO_PROFILE_PROMPT_TEMPLATE.replace(
            "SAMPLES_PLACEHOLDER", "\n\n".join(snippets)
        )
        response = client.models.generate_content(
            model=effective_model,
            contents=prompt,
        )
        raw_text = (response.text or "").strip()
        # Strip markdown fencing if present
        if raw_text.startswith("```"):
            raw_text = re.sub(r"^```[a-z]*\n?", "", raw_text)
            raw_text = re.sub(r"\n?```$", "", raw_text).strip()
        profile = json.loads(raw_text)
        # Add runtime fields that are always present
        runtime_fields = [
            f for f in _DEFAULT_LANGEXTRACT_PROFILE["fields"] if f.get("source") == "runtime"
        ]
        existing_names = {
            str(f.get("name")) for f in profile.get("fields", []) if isinstance(f, dict)
        }
        for rf in runtime_fields:
            if rf["name"] not in existing_names:
                profile.setdefault("fields", []).insert(0, copy.deepcopy(rf))
        return normalize_langextract_profile(profile)
    except Exception:
        return default_langextract_profile()


def infer_document_type(file_path: str) -> str:
    """Infer a generic document type from filename tokens."""
    stem = Path(file_path).stem.lower()
    tokens = [token for token in _DOC_TYPE_TOKEN_RE.findall(stem) if token]
    filtered = [
        token
        for token in tokens
        if not token.isdigit() and len(token) > 2 and token not in _DOC_TYPE_STOPWORDS
    ]
    if filtered:
        return filtered[-1]
    if tokens:
        return tokens[-1]
    return "document"


def default_langextract_profile() -> dict[str, Any]:
    """Return a mutable copy of the built-in metadata profile."""
    return copy.deepcopy(_DEFAULT_LANGEXTRACT_PROFILE)


def normalize_langextract_profile(profile: dict[str, Any] | None) -> dict[str, Any]:
    """
    Validate and normalize user-provided langextract profile configuration.

    Expected shape:
    - prompt_description: str (optional)
    - max_chars: int (optional)
    - fields: list[{
        name: str,
        type: string|integer|number|boolean,
        description: str (optional),
        required: bool (optional),
        source: runtime|entities (default entities),
        runtime: enabled|extraction_count|entity_classes (runtime source only),
        source_class: str (entities source),
        source_classes: list[str] (entities source),
        mode: values|count|exists|contains (entities source),
        contains_any: list[str] (contains mode),
      }]
    """
    raw = default_langextract_profile() if profile is None else copy.deepcopy(profile)
    if not isinstance(raw, dict):
        raise ValueError("Metadata profile must be a JSON object.")

    prompt = raw.get("prompt_description")
    if prompt is None:
        prompt_description = _LANGEXTRACT_PROMPT_DESCRIPTION
    elif isinstance(prompt, str) and prompt.strip():
        prompt_description = prompt.strip()
    else:
        raise ValueError(
            "Metadata profile field 'prompt_description' must be a non-empty string."
        )

    max_chars: int | None = None
    if "max_chars" in raw:
        max_chars = _safe_positive_int(
            raw.get("max_chars"),
            minimum=500,
            field_name="max_chars",
        )

    raw_fields = raw.get("fields")
    if not isinstance(raw_fields, list) or not raw_fields:
        raise ValueError("Metadata profile must include a non-empty 'fields' array.")

    normalized_fields: list[dict[str, Any]] = []
    seen_names: set[str] = set()
    for idx, raw_field in enumerate(raw_fields):
        if not isinstance(raw_field, dict):
            raise ValueError(f"Metadata field at index {idx} must be an object.")

        name_obj = raw_field.get("name")
        if not isinstance(name_obj, str) or not name_obj.strip():
            raise ValueError(
                f"Metadata field at index {idx} is missing a valid 'name'."
            )
        name = name_obj.strip()
        if not _VALID_METADATA_FIELD_NAME_RE.match(name):
            raise ValueError(
                f"Invalid metadata field name '{name}'. "
                "Use letters, numbers, and underscores."
            )
        if name in seen_names:
            raise ValueError(f"Duplicate metadata field name '{name}'.")
        seen_names.add(name)

        field_type = str(raw_field.get("type", "string")).strip().lower()
        if field_type not in _VALID_FIELD_TYPES:
            allowed_types = ", ".join(sorted(_VALID_FIELD_TYPES))
            raise ValueError(
                f"Metadata field '{name}' has invalid type '{field_type}'. "
                f"Allowed types: {allowed_types}."
            )

        description_obj = raw_field.get("description")
        description = (
            description_obj.strip()
            if isinstance(description_obj, str) and description_obj.strip()
            else f"Metadata field '{name}'."
        )
        required = bool(raw_field.get("required", False))

        source = str(raw_field.get("source", "entities")).strip().lower()
        if source not in {"runtime", "entities"}:
            raise ValueError(
                f"Metadata field '{name}' has invalid source '{source}'. "
                "Use 'runtime' or 'entities'."
            )

        normalized: dict[str, Any] = {
            "name": name,
            "type": field_type,
            "required": required,
            "description": description,
            "source": source,
        }

        if source == "runtime":
            runtime = str(raw_field.get("runtime", "")).strip().lower()
            if runtime not in _VALID_RUNTIME_FIELDS:
                allowed_runtime = ", ".join(sorted(_VALID_RUNTIME_FIELDS))
                raise ValueError(
                    f"Metadata field '{name}' has invalid runtime source '{runtime}'. "
                    f"Allowed runtime values: {allowed_runtime}."
                )
            normalized["runtime"] = runtime
            normalized["mode"] = "runtime"
            normalized["source_classes"] = []
            normalized["contains_any"] = []
            normalized_fields.append(normalized)
            continue

        source_classes = _normalize_source_classes(raw_field)
        if not source_classes:
            raise ValueError(
                f"Metadata field '{name}' requires 'source_class' or "
                "'source_classes' for entity extraction."
            )

        requested_mode = raw_field.get("mode")
        mode = _normalize_field_mode(requested_mode, field_type=field_type)
        contains_any = _normalize_contains_any(
            raw_field.get("contains_any"),
            mode=mode,
            field_name=name,
        )

        normalized["source_classes"] = source_classes
        normalized["mode"] = mode
        normalized["contains_any"] = contains_any
        normalized_fields.append(normalized)

    normalized_profile: dict[str, Any] = {
        "name": str(raw.get("name", "langextract_profile")),
        "description": str(
            raw.get("description", "User-defined langextract metadata profile.")
        ),
        "prompt_description": prompt_description,
        "fields": normalized_fields,
    }
    if max_chars is not None:
        normalized_profile["max_chars"] = max_chars
    return normalized_profile


def langextract_schema_fields(
    profile: dict[str, Any] | None = None,
) -> list[dict[str, Any]]:
    """Return schema field definitions for langextract metadata."""
    normalized = normalize_langextract_profile(profile)
    fields: list[dict[str, Any]] = []
    for field in normalized["fields"]:
        fields.append(
            {
                "name": field["name"],
                "type": field["type"],
                "required": bool(field.get("required", False)),
                "description": str(field.get("description", "")),
            }
        )
    return fields


def langextract_field_names(profile: dict[str, Any] | None = None) -> set[str]:
    """Return field names used by langextract metadata extraction."""
    return {field["name"] for field in langextract_schema_fields(profile)}


def ensure_langextract_schema_fields(
    schema_def: dict[str, Any],
    profile: dict[str, Any] | None = None,
) -> tuple[dict[str, Any], bool]:
    """Ensure schema contains langextract field definitions."""
    normalized_profile = normalize_langextract_profile(
        profile if profile is not None else _schema_profile_if_present(schema_def)
    )
    required_fields = langextract_schema_fields(normalized_profile)

    fields_obj = schema_def.get("fields")
    fields: list[dict[str, Any]]
    if isinstance(fields_obj, list):
        fields = [dict(field) for field in fields_obj if isinstance(field, dict)]
    else:
        fields = []

    existing_names = {
        str(field.get("name")) for field in fields if isinstance(field.get("name"), str)
    }
    updated = list(fields)
    changed = False
    for field in required_fields:
        if field["name"] in existing_names:
            continue
        updated.append(dict(field))
        changed = True

    merged = dict(schema_def)
    if changed:
        merged["fields"] = updated

    existing_profile = _schema_profile_if_present(schema_def)
    if profile is not None or existing_profile is not None:
        if existing_profile != normalized_profile:
            merged["metadata_profile"] = normalized_profile
            changed = True
        elif "metadata_profile" in schema_def:
            merged["metadata_profile"] = existing_profile

    return merged, changed


def extract_metadata(
    *,
    file_path: str,
    root_path: str,
    content: str,
    schema_def: dict[str, Any] | None = None,
    with_langextract: bool = False,
    langextract_model_id: str | None = None,
    langextract_profile: dict[str, Any] | None = None,
) -> dict[str, Any]:
    """
    Build metadata used for filtering and schema-aware indexing.

    If a schema is provided with a `fields` list, only those keys are emitted.
    """
    absolute_path = str(Path(file_path).resolve())
    relative_path = os.path.relpath(absolute_path, str(Path(root_path).resolve()))
    extension = Path(file_path).suffix.lower()

    stat = os.stat(file_path)
    metadata: dict[str, Any] = {
        "filename": Path(file_path).name,
        "relative_path": relative_path,
        "extension": extension,
        "document_type": infer_document_type(file_path),
        "file_size_bytes": int(stat.st_size),
        "file_mtime": float(stat.st_mtime),
        "mentions_currency": bool(_CURRENCY_RE.search(content)),
        "mentions_dates": bool(_DATE_RE.search(content)),
    }
    if with_langextract:
        resolved_profile = _resolve_langextract_profile(
            schema_def=schema_def,
            profile_override=langextract_profile,
        )
        metadata.update(
            _extract_langextract_metadata(
                content=content,
                model_id=langextract_model_id,
                profile=resolved_profile,
            )
        )

    if not schema_def:
        return metadata

    fields = schema_def.get("fields")
    if not isinstance(fields, list):
        return metadata

    allowed: set[str] = set()
    for field in fields:
        if isinstance(field, dict):
            name = field.get("name")
            if isinstance(name, str):
                allowed.add(name)

    if not allowed:
        return metadata

    return {k: v for k, v in metadata.items() if k in allowed}


def _extract_langextract_metadata(
    *,
    content: str,
    model_id: str | None = None,
    profile: dict[str, Any] | None = None,
) -> dict[str, Any]:
    normalized_profile = normalize_langextract_profile(profile)
    defaults = _profile_defaults(normalized_profile)

    api_key = (
        os.getenv("LANGEXTRACT_API_KEY")
        or os.getenv("GEMINI_API_KEY")
        or os.getenv("GOOGLE_API_KEY")
    )
    if not api_key:
        return defaults

    try:
        import langextract as lx  # type: ignore[import-not-found]
    except Exception:
        return defaults

    profile_max_chars_obj = normalized_profile.get("max_chars")
    profile_max_chars = (
        _safe_positive_int(
            profile_max_chars_obj,
            minimum=500,
            field_name="max_chars",
        )
        if profile_max_chars_obj is not None
        else None
    )
    max_chars = profile_max_chars or _safe_int_env(
        "FS_EXPLORER_LANGEXTRACT_MAX_CHARS",
        default=6000,
        minimum=500,
    )
    snippet = content[:max_chars]
    if not snippet.strip():
        return defaults

    effective_model_id = model_id or os.getenv(
        "FS_EXPLORER_LANGEXTRACT_MODEL",
        "gemini-3-flash-preview",
    )
    try:
        result = lx.extract(
            text_or_documents=snippet,
            prompt_description=str(normalized_profile["prompt_description"]),
            examples=_langextract_examples(lx),
            model_id=effective_model_id,
            api_key=api_key,
            max_char_buffer=min(1200, max_chars),
            show_progress=False,
            prompt_validation_level=lx.prompt_validation.PromptValidationLevel.OFF,
        )
    except Exception:
        return defaults

    extractions = list(result.extractions or [])
    return _aggregate_profile_metadata(
        normalized_profile=normalized_profile,
        extractions=extractions,
        enabled=True,
    )


def _schema_profile_if_present(schema_def: dict[str, Any] | None) -> dict[str, Any] | None:
    if not schema_def:
        return None
    metadata_profile = schema_def.get("metadata_profile")
    if isinstance(metadata_profile, dict):
        return metadata_profile
    return None


def _resolve_langextract_profile(
    *,
    schema_def: dict[str, Any] | None,
    profile_override: dict[str, Any] | None,
) -> dict[str, Any] | None:
    if profile_override is not None:
        return profile_override
    return _schema_profile_if_present(schema_def)


def _normalize_source_classes(raw_field: dict[str, Any]) -> list[str]:
    classes: list[str] = []
    single = raw_field.get("source_class")
    if isinstance(single, str) and single.strip():
        classes.append(single.strip().lower())

    multi = raw_field.get("source_classes")
    if isinstance(multi, list):
        for item in multi:
            if isinstance(item, str) and item.strip():
                classes.append(item.strip().lower())

    seen: set[str] = set()
    deduped: list[str] = []
    for class_name in classes:
        if class_name in seen:
            continue
        seen.add(class_name)
        deduped.append(class_name)
    return deduped


def _normalize_field_mode(mode_obj: Any, *, field_type: str) -> str:
    if isinstance(mode_obj, str) and mode_obj.strip():
        requested = mode_obj.strip().lower()
        normalized = _FIELD_MODE_ALIASES.get(requested)
        if normalized is None:
            allowed = ", ".join(sorted(set(_FIELD_MODE_ALIASES.values())))
            raise ValueError(
                f"Unsupported metadata field mode '{requested}'. "
                f"Allowed modes: {allowed}."
            )
        return normalized

    if field_type == "boolean":
        return "exists"
    if field_type in {"integer", "number"}:
        return "count"
    return "values"


def _normalize_contains_any(
    contains_obj: Any,
    *,
    mode: str,
    field_name: str,
) -> list[str]:
    if mode != "contains":
        return []
    if not isinstance(contains_obj, list) or not contains_obj:
        raise ValueError(
            f"Metadata field '{field_name}' with mode 'contains' "
            "requires 'contains_any' list."
        )
    terms: list[str] = []
    for term in contains_obj:
        if isinstance(term, str) and term.strip():
            terms.append(term.strip().lower())
    if not terms:
        raise ValueError(
            f"Metadata field '{field_name}' with mode 'contains' "
            "has no valid 'contains_any' terms."
        )
    return terms


def _profile_defaults(profile: dict[str, Any]) -> dict[str, Any]:
    defaults: dict[str, Any] = {}
    for field in profile["fields"]:
        defaults[field["name"]] = _default_field_value(field)
    return defaults


def _default_field_value(field: dict[str, Any]) -> Any:
    source = str(field.get("source", "entities"))
    runtime = str(field.get("runtime", ""))
    if source == "runtime":
        if runtime == "enabled":
            return False
        if runtime == "extraction_count":
            return 0
        if runtime == "entity_classes":
            return ""

    field_type = str(field.get("type", "string"))
    if field_type == "boolean":
        return False
    if field_type == "integer":
        return 0
    if field_type == "number":
        return 0.0
    return ""


def _aggregate_profile_metadata(
    *,
    normalized_profile: dict[str, Any],
    extractions: list[Any],
    enabled: bool,
) -> dict[str, Any]:
    classes: set[str] = set()
    by_class: dict[str, list[str]] = defaultdict(list)

    for extraction in extractions:
        extraction_class = str(getattr(extraction, "extraction_class", "")).strip().lower()
        extraction_text = str(getattr(extraction, "extraction_text", "")).strip()
        if not extraction_class:
            continue
        classes.add(extraction_class)
        if extraction_text:
            by_class[extraction_class].append(extraction_text)

    metadata: dict[str, Any] = {}
    for field in normalized_profile["fields"]:
        name = str(field["name"])
        source = str(field["source"])
        if source == "runtime":
            value = _runtime_field_value(
                field=field,
                enabled=enabled,
                extraction_count=len(extractions),
                classes=classes,
            )
            metadata[name] = _coerce_field_value(
                value=value,
                field_type=str(field["type"]),
            )
            continue

        matched_values: list[str] = []
        for extraction_class in field["source_classes"]:
            matched_values.extend(by_class.get(extraction_class, []))
        value = _entity_field_value(field=field, matched_values=matched_values)
        metadata[name] = _coerce_field_value(value=value, field_type=str(field["type"]))

    defaults = _profile_defaults(normalized_profile)
    for key, default_value in defaults.items():
        metadata.setdefault(key, default_value)
    return metadata


def _runtime_field_value(
    *,
    field: dict[str, Any],
    enabled: bool,
    extraction_count: int,
    classes: set[str],
) -> Any:
    runtime = str(field.get("runtime", ""))
    if runtime == "enabled":
        return enabled
    if runtime == "extraction_count":
        return extraction_count
    if runtime == "entity_classes":
        return ", ".join(sorted(classes))
    return _default_field_value(field)


def _entity_field_value(*, field: dict[str, Any], matched_values: list[str]) -> Any:
    mode = str(field.get("mode", "values"))
    if mode == "count":
        return len(matched_values)
    if mode == "exists":
        return bool(matched_values)
    if mode == "contains":
        terms = [str(term).lower() for term in field.get("contains_any", [])]
        lowered_values = [value.lower() for value in matched_values]
        return any(term in value for term in terms for value in lowered_values)
    deduped = _dedupe_preserve_order(matched_values)
    return ", ".join(deduped)


def _coerce_field_value(*, value: Any, field_type: str) -> Any:
    if field_type == "boolean":
        return bool(value)
    if field_type == "integer":
        if isinstance(value, bool):
            return int(value)
        try:
            return int(value)
        except (TypeError, ValueError):
            return 0
    if field_type == "number":
        if isinstance(value, bool):
            return float(int(value))
        try:
            return float(value)
        except (TypeError, ValueError):
            return 0.0
    if value is None:
        return ""
    return str(value)


def _langextract_examples(lx: Any) -> list[Any]:
    return [
        lx.data.ExampleData(
            text=(
                "TechCorp Industries will pay $45,000,000 in cash consideration, "
                "with a $1,500,000 escrow reserve and a $5,000,000 earnout to "
                "acquire StartupXYZ LLC. CTO Dr. Sarah Chen signed on January 15, 2025."
            ),
            extractions=[
                lx.data.Extraction(
                    extraction_class="organization",
                    extraction_text="TechCorp Industries",
                ),
                lx.data.Extraction(
                    extraction_class="organization",
                    extraction_text="StartupXYZ LLC",
                ),
                lx.data.Extraction(
                    extraction_class="money",
                    extraction_text="$45,000,000",
                ),
                lx.data.Extraction(
                    extraction_class="money",
                    extraction_text="$1,500,000",
                ),
                lx.data.Extraction(
                    extraction_class="money",
                    extraction_text="$5,000,000",
                ),
                lx.data.Extraction(
                    extraction_class="deal_term",
                    extraction_text="cash consideration",
                ),
                lx.data.Extraction(
                    extraction_class="deal_term",
                    extraction_text="escrow reserve",
                ),
                lx.data.Extraction(
                    extraction_class="deal_term",
                    extraction_text="earnout",
                ),
                lx.data.Extraction(
                    extraction_class="person",
                    extraction_text="Dr. Sarah Chen",
                ),
                lx.data.Extraction(
                    extraction_class="date",
                    extraction_text="January 15, 2025",
                ),
            ],
        )
    ]


def _dedupe_preserve_order(values: list[str], *, max_items: int = 16) -> list[str]:
    seen: set[str] = set()
    deduped: list[str] = []
    for value in values:
        key = value.strip()
        if not key:
            continue
        lower = key.lower()
        if lower in seen:
            continue
        seen.add(lower)
        deduped.append(key)
        if len(deduped) >= max_items:
            break
    return deduped


def _safe_positive_int(value: Any, *, minimum: int, field_name: str) -> int:
    try:
        integer = int(value)
    except (TypeError, ValueError) as exc:
        raise ValueError(
            f"Metadata profile field '{field_name}' must be an integer."
        ) from exc
    if integer < minimum:
        raise ValueError(
            f"Metadata profile field '{field_name}' must be >= {minimum}."
        )
    return integer


def _safe_int_env(name: str, *, default: int, minimum: int) -> int:
    raw = os.getenv(name)
    if raw is None:
        return default
    try:
        value = int(raw)
    except ValueError:
        return default
    return value if value >= minimum else minimum


================================================
FILE: src/fs_explorer/indexing/pipeline.py
================================================
"""
Indexing pipeline orchestration.
"""

from __future__ import annotations

import hashlib
import json
import os
from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from pathlib import Path
from typing import Any

from .chunker import SmartChunker
from .metadata import (
    ensure_langextract_schema_fields,
    extract_metadata,
    langextract_field_names,
)
from .schema import SchemaDiscovery
from ..embeddings import EmbeddingProvider
from ..fs import SUPPORTED_EXTENSIONS, parse_file
from ..storage import ChunkRecord, DocumentRecord, DuckDBStorage, StorageBackend

_PARSE_ERROR_PREFIXES: tuple[str, ...] = (
    "Error parsing ",
    "Unsupported file extension",
    "No such file:",
)


@dataclass(frozen=True)
class IndexingResult:
    """Summary output for an indexing run."""

    corpus_id: str
    indexed_files: int
    skipped_files: int
    deleted_files: int
    chunks_written: int
    active_documents: int
    schema_used: str | None
    embeddings_written: int = 0


class IndexingPipeline:
    """Build and update corpus indexes from filesystem documents."""

    def __init__(
        self,
        storage: StorageBackend,
        chunker: SmartChunker | None = None,
        embedding_provider: EmbeddingProvider | None = None,
        max_workers: int = 4,
    ) -> None:
        self.storage = storage
        self.chunker = chunker or SmartChunker()
        self.embedding_provider = embedding_provider
        self._max_workers = max_workers

    def index_folder(
        self,
        folder: str,
        *,
        discover_schema: bool = False,
        schema_name: str | None = None,
        with_metadata: bool = False,
        metadata_profile: dict[str, Any] | None = None,
    ) -> IndexingResult:
        root = str(Path(folder).resolve())
        if not os.path.exists(root) or not os.path.isdir(root):
            raise ValueError(f"No such directory: {root}")

        effective_with_metadata = with_metadata or metadata_profile is not None
        corpus_id = self.storage.get_or_create_corpus(root)
        schema_def, selected_schema_name = self._resolve_schema(
            corpus_id=corpus_id,
            root=root,
            discover_schema=discover_schema,
            schema_name=schema_name,
            with_metadata=effective_with_metadata,
            metadata_profile=metadata_profile,
        )
        effective_profile = metadata_profile or self._schema_metadata_profile(
            schema_def
        )

        # Pass 1: Parse all documents
        parsed_docs: list[tuple[str, str, str]] = []  # (file_path, relative_path, content)
        skipped_files = 0
        active_paths: set[str] = set()

        for file_path in self._iter_supported_files(root):
            relative_path = os.path.relpath(file_path, root)
            active_paths.add(relative_path)

            content = parse_file(file_path)
            if self._is_parse_error(content):
                skipped_files += 1
                continue

            parsed_docs.append((file_path, relative_path, content))

        # Parallel metadata extraction across documents
        metadata_map = self._extract_metadata_batch(
            parsed_docs=parsed_docs,
            root_path=root,
            schema_def=schema_def,
            with_langextract=effective_with_metadata,
            langextract_profile=effective_profile,
        )

        # Pass 2: Chunk + upsert (sequential, DB writes)
        indexed_files = 0
        chunks_written = 0
        all_chunk_records: list[ChunkRecord] = []

        for file_path, relative_path, content in parsed_docs:
            chunks = self.chunker.chunk_text(content)
            metadata = metadata_map[relative_path]
            metadata_json = json.dumps(metadata, sort_keys=True)

            stat = os.stat(file_path)
            doc_id = DuckDBStorage.make_document_id(corpus_id, relative_path)
            doc_record = DocumentRecord(
                id=doc_id,
                corpus_id=corpus_id,
                relative_path=relative_path,
                absolute_path=str(Path(file_path).resolve()),
                content=content,
                metadata_json=metadata_json,
                file_mtime=float(stat.st_mtime),
                file_size=int(stat.st_size),
                content_sha256=self._sha256(content),
            )

            chunk_records: list[ChunkRecord] = []
            for chunk in chunks:
                chunk_records.append(
                    ChunkRecord(
                        id=DuckDBStorage.make_chunk_id(
                            doc_id,
                            chunk.position,
                            chunk.start_char,
                            chunk.end_char,
                        ),
                        doc_id=doc_id,
                        text=chunk.text,
                        position=chunk.position,
                        start_char=chunk.start_char,
                        end_char=chunk.end_char,
                    )
                )

            self.storage.upsert_document(doc_record, chunk_records)
            all_chunk_records.extend(chunk_records)
            indexed_files += 1
            chunks_written += len(chunk_records)

        deleted_files = self.storage.mark_deleted_missing_documents(
            corpus_id=corpus_id,
            active_relative_paths=active_paths,
        )
        active_documents = len(
            self.storage.list_documents(corpus_id=corpus_id, include_deleted=False)
        )

        embeddings_written = self._generate_and_store_embeddings(
            corpus_id=corpus_id,
            all_chunk_records=all_chunk_records,
        )

        return IndexingResult(
            corpus_id=corpus_id,
            indexed_files=indexed_files,
            skipped_files=skipped_files,
            deleted_files=deleted_files,
            chunks_written=chunks_written,
            active_documents=active_documents,
            schema_used=selected_schema_name,
            embeddings_written=embeddings_written,
        )

    def _extract_metadata_batch(
        self,
        *,
        parsed_docs: list[tuple[str, str, str]],
        root_path: str,
        schema_def: dict[str, Any] | None,
        with_langextract: bool,
        langextract_profile: dict[str, Any] | None,
    ) -> dict[str, dict[str, Any]]:
        """Extract metadata for all documents in parallel using a thread pool."""

        def _extract_one(item: tuple[str, str, str]) -> tuple[str, dict[str, Any]]:
            file_path, relative_path, content = item
            metadata = extract_metadata(
                file_path=file_path,
                root_path=root_path,
                content=content,
                schema_def=schema_def,
                with_langextract=with_langextract,
                langextract_profile=langextract_profile,
            )
            return relative_path, metadata

        result: dict[str, dict[str, Any]] = {}
        if not parsed_docs:
            return result

        with ThreadPoolExecutor(max_workers=self._max_workers) as executor:
            for relative_path, metadata in executor.map(_extract_one, parsed_docs):
                result[relative_path] = metadata

        return result

    def _resolve_schema(
        self,
        *,
        corpus_id: str,
        root: str,
        discover_schema: bool,
        schema_name: str | None,
        with_metadata: bool,
        metadata_profile: dict[str, Any] | None,
    ) -> tuple[dict[str, Any] | None, str | None]:
        if discover_schema:
            schema_def = SchemaDiscovery().discover_from_folder(
                root,
                with_langextract=with_metadata,
                metadata_profile=metadata_profile,
            )
            discovered_name = str(schema_def.get("name", f"auto_{Path(root).name}"))
            self.storage.save_schema(
                corpus_id=corpus_id,
                name=discovered_name,
                schema_def=schema_def,
                is_active=True,
            )
            return schema_def, discovered_name

        if schema_name:
            schema = self.storage.get_schema_by_name(
                corpus_id=corpus_id, name=schema_name
            )
            if schema is None:
                raise ValueError(f"Schema '{schema_name}' not found for corpus {root}")
            if with_metadata:
                return self._augment_schema_for_langextract(
                    corpus_id=corpus_id,
                    schema_name=schema.name,
                    schema_def=schema.schema_def,
                    metadata_profile=metadata_profile,
                )
            return schema.schema_def, schema.name

        active = self.storage.get_active_schema(corpus_id=corpus_id)
        if active is None:
            if with_metadata:
                schema_def = SchemaDiscovery().discover_from_folder(
                    root,
                    with_langextract=True,
                    metadata_profile=metadata_profile,
                )
                discovered_name = str(schema_def.get("name", f"auto_{Path(root).name}"))
                self.storage.save_schema(
                    corpus_id=corpus_id,
                    name=discovered_name,
                    schema_def=schema_def,
                    is_active=True,
                )
                return schema_def, discovered_name
            return None, None
        if with_metadata:
            return self._augment_schema_for_langextract(
                corpus_id=corpus_id,
                schema_name=active.name,
                schema_def=active.schema_def,
                metadata_profile=metadata_profile,
            )
        return active.schema_def, active.name

    def _augment_schema_for_langextract(
        self,
        *,
        corpus_id: str,
        schema_name: str,
        schema_def: dict[str, Any],
        metadata_profile: dict[str, Any] | None,
    ) -> tuple[dict[str, Any], str]:
        effective_profile = metadata_profile or self._schema_metadata_profile(
            schema_def
        )
        existing_field_names = self._schema_field_names(schema_def)
        required = langextract_field_names(effective_profile)
        if required.issubset(existing_field_names):
            if metadata_profile is None and (
                effective_profile is None
                or self._schema_metadata_profile(schema_def) is not None
            ):
                return schema_def, schema_name

            augmented_with_profile, changed = ensure_langextract_schema_fields(
                schema_def,
                effective_profile,
            )
            if not changed:
                return schema_def, schema_name
            self.storage.save_schema(
                corpus_id=corpus_id,
                name=schema_name,
                schema_def=augmented_with_profile,
                is_active=True,
            )
            return augmented_with_profile, schema_name

        augmented_schema, _ = ensure_langextract_schema_fields(
            schema_def,
            effective_profile,
        )
        self.storage.save_schema(
            corpus_id=corpus_id,
            name=schema_name,
            schema_def=augmented_schema,
            is_active=True,
        )
        return augmented_schema, schema_name

    @staticmethod
    def _schema_metadata_profile(
        schema_def: dict[str, Any] | None,
    ) -> dict[str, Any] | None:
        if not schema_def:
            return None
        profile = schema_def.get("metadata_profile")
        if isinstance(profile, dict):
            return profile
        return None

    @staticmethod
    def _schema_field_names(schema_def: dict[str, Any]) -> set[str]:
        fields = schema_def.get("fields")
        if not isinstance(fields, list):
            return set()
        names: set[str] = set()
        for field in fields:
            if isinstance(field, dict):
                name = field.get("name")
                if isinstance(name, str):
                    names.add(name)
        return names

    def _generate_and_store_embeddings(
        self,
        *,
        corpus_id: str,
        all_chunk_records: list[ChunkRecord],
    ) -> int:
        """Embed chunk texts and store in the database. Returns count written."""
        if self.embedding_provider is None or not all_chunk_records:
            return 0

        texts = [cr.text for cr in all_chunk_records]
        embeddings = self.embedding_provider.embed_texts(texts)

        pairs: list[tuple[str, list[float]]] = [
            (cr.id, emb) for cr, emb in zip(all_chunk_records, embeddings)
        ]
        written = self.storage.store_chunk_embeddings(
            corpus_id=corpus_id,
            chunk_embeddings=pairs,
        )

        if isinstance(self.storage, DuckDBStorage):
            self.storage.create_hnsw_index(corpus_id=corpus_id)

        return written

    @staticmethod
    def _iter_supported_files(root: str) -> list[str]:
        files: list[str] = []
        for current_root, _, filenames in os.walk(root):
            for filename in filenames:
                ext = Path(filename).suffix.lower()
                if ext in SUPPORTED_EXTENSIONS:
                    files.append(str(Path(current_root) / filename))
        files.sort()
        return files

    @staticmethod
    def _sha256(content: str) -> str:
        return hashlib.sha256(content.encode("utf-8")).hexdigest()

    @staticmethod
    def _is_parse_error(content: str) -> bool:
        return content.startswith(_PARSE_ERROR_PREFIXES)


================================================
FILE: src/fs_explorer/indexing/schema.py
================================================
"""
Schema discovery utilities.
"""

from __future__ import annotations

import os
from pathlib import Path
from typing import Any

from .metadata import (
    auto_discover_profile,
    infer_document_type,
    langextract_schema_fields,
    normalize_langextract_profile,
)
from ..fs import SUPPORTED_EXTENSIONS


def _iter_supported_files(folder: str) -> list[str]:
    root = Path(folder).resolve()
    files: list[str] = []
    for current_root, _, filenames in os.walk(root):
        for filename in filenames:
            ext = Path(filename).suffix.lower()
            if ext in SUPPORTED_EXTENSIONS:
                files.append(str(Path(current_root) / filename))
    files.sort()
    return files


class SchemaDiscovery:
    """Auto-discover a lightweight metadata schema from a corpus."""

    def discover_from_folder(
        self,
        folder: str,
        *,
        with_langextract: bool = False,
        metadata_profile: dict[str, Any] | None = None,
    ) -> dict[str, Any]:
        files = _iter_supported_files(folder)
        document_types = sorted({infer_document_type(path) for path in files})
        corpus_name = Path(folder).resolve().name or "corpus"

        fields: list[dict[str, Any]] = [
            {
                "name": "filename",
                "type": "string",
                "required": True,
                "description": "Document filename.",
            },
            {
                "name": "relative_path",
                "type": "string",
                "required": True,
                "description": "Path relative to corpus root.",
            },
            {
                "name": "extension",
                "type": "string",
                "required": True,
                "description": "File extension.",
            },
            {
                "name": "document_type",
                "type": "string",
                "required": True,
                "description": "Inferred document category.",
                "enum": document_types or ["other"],
            },
            {
                "name": "file_size_bytes",
                "type": "integer",
                "required": True,
                "description": "File size in bytes.",
            },
            {
                "name": "file_mtime",
                "type": "number",
                "required": True,
                "description": "File modification timestamp (epoch seconds).",
            },
            {
                "name": "mentions_currency",
                "type": "boolean",
                "required": True,
                "description": "Whether text appears to contain currency amounts.",
            },
            {
                "name": "mentions_dates",
                "type": "boolean",
                "required": True,
                "description": "Whether text appears to contain date patterns.",
            },
        ]
        schema: dict[str, Any] = {
            "name": f"auto_{corpus_name}",
            "description": "Auto-discovered schema for document-level metadata filtering.",
            "fields": fields,
        }
        if with_langextract:
            if metadata_profile is None:
                effective_profile = auto_discover_profile(folder)
            else:
                effective_profile = normalize_langextract_profile(metadata_profile)
            fields.extend(langextract_schema_fields(effective_profile))
            schema["metadata_profile"] = effective_profile
        return schema


================================================
FILE: src/fs_explorer/main.py
================================================
"""
CLI entry point for the FsExplorer agent.

Provides a command-line interface for running filesystem exploration tasks
with rich, detailed output showing each step of the workflow.
"""

import json
import asyncio
import os
from datetime import datetime
from pathlib import Path

from typer import Typer, Option, Argument, Context, BadParameter, Exit
from typing import Annotated, Any
from rich.markdown import Markdown
from rich.panel import Panel
from rich.console import Console
from rich.table import Table
from rich.text import Text

from .embeddings import EmbeddingProvider
from .index_config import resolve_db_path
from .indexing import IndexingPipeline, SchemaDiscovery
from .storage import DuckDBStorage
from .agent import set_index_context, clear_index_context
from .workflow import (
    workflow,
    InputEvent,
    ToolCallEvent,
    GoDeeperEvent,
    AskHumanEvent,
    HumanAnswerEvent,
    get_agent,
    reset_agent,
)
from .exploration_trace import ExplorationTrace, extract_cited_sources

app = Typer()
schema_app = Typer(help="Manage metadata schemas for indexed corpora.")
app.add_typer(schema_app, name="schema")


# Tool icons for visual distinction
TOOL_ICONS = {
    "scan_folder": "📂",
    "preview_file": "👁️",
    "parse_file": "📖",
    "read": "📄",
    "grep": "🔍",
    "glob": "🔎",
    "semantic_search": "🧠",
    "get_document": "📚",
    "list_indexed_documents": "🗂️",
}

# Phase detection based on tool usage
PHASE_DESCRIPTIONS = {
    "scan_folder": ("Phase 1", "Parallel Document Scan", "cyan"),
    "preview_file": ("Phase 1/2", "Quick Preview", "cyan"),
    "parse_file": ("Phase 2", "Deep Dive", "green"),
    "read": ("Reading", "Text File", "blue"),
    "grep": ("Searching", "Pattern Match", "yellow"),
    "glob": ("Finding", "File Search", "yellow"),
    "semantic_search": ("Indexed", "Semantic Retrieval", "magenta"),
    "get_document": ("Indexed", "Document Fetch", "green"),
    "list_indexed_documents": ("Indexed", "Corpus Listing", "blue"),
}


def _load_metadata_profile(path_value: str | None) -> dict[str, Any] | None:
    if path_value is None:
        return None
    resolved = Path(path_value).expanduser().resolve()
    if not resolved.exists() or not resolved.is_file():
        raise BadParameter(f"Metadata profile file not found: {resolved}")
    try:
        payload = json.loads(resolved.read_text())
    except json.JSONDecodeError as exc:
        raise BadParameter(
            f"Metadata profile file is not valid JSON: {resolved}"
        ) from exc
    if not isinstance(payload, dict):
        raise BadParameter("Metadata profile JSON must be an object.")
    return payload


def format_tool_panel(event: ToolCallEvent, step_number: int) -> Panel:
    """Create a richly formatted panel for a tool call event."""
    tool_name = event.tool_name
    icon = TOOL_ICONS.get(tool_name, "🔧")
    phase_info = PHASE_DESCRIPTIONS.get(tool_name, ("Action", "Tool Call", "yellow"))
    phase_label, phase_desc, color = phase_info

    # Build the content
    lines = []

    # Tool and target info
    if "directory" in event.tool_input:
        target = event.tool_input["directory"]
        lines.append(f"**Target Directory:** `{target}`")
    elif "file_path" in event.tool_input:
        target = event.tool_input["file_path"]
        lines.append(f"**Target File:** `{target}`")

    # Additional parameters
    other_params = {
        k: v for k, v in event.tool_input.items() if k not in ("directory", "file_path")
    }
    if other_params:
        lines.append(f"**Parameters:** `{json.dumps(other_params)}`")

    lines.append("")
    lines.append("---")
    lines.append("")

    # Reasoning (this is the key part for visibility)
    lines.append("**Agent's Reasoning:**")
    lines.append("")
    lines.append(event.reason)

    content = "\n".join(lines)

    # Create title with step number and phase
    title = f"{icon} Step {step_number}: {tool_name} [{phase_label}: {phase_desc}]"

    return Panel(
        Markdown(content),
        title=title,
        title_align="left",
        border_style=f"bold {color}",
        padding=(1, 2),
    )


def format_navigation_panel(event: GoDeeperEvent, step_number: int) -> Panel:
    """Create a panel for directory navigation events."""
    content = f"""**Navigating to:** `{event.directory}`

---

**Agent's Reasoning:**

{event.reason}
"""
    return Panel(
        Markdown(content),
        title=f"📁 Step {step_number}: Navigate to Directory",
        title_align="left",
        border_style="bold magenta",
        padding=(1, 2),
    )


def print_workflow_header(console: Console, task: str, folder: str) -> None:
    """Print a header showing the task being executed."""
    console.print()
    header = Table.grid(padding=(0, 2))
    header.add_column(style="bold cyan", justify="right")
    header.add_column()

    header.add_row("🤖 FsExplorer Agent", "")
    header.add_row("📋 Task:", task)
    header.add_row("📁 Folder:", folder)
    header.add_row("🕐 Started:", datetime.now().strftime("%Y-%m-%d %H:%M:%S"))

    console.print(
        Panel(
            header,
            border_style="bold blue",
            title="Starting Exploration",
            title_align="left",
        )
    )
    console.print()


def print_workflow_summary(
    console: Console,
    agent,
    step_count: int,
    trace: ExplorationTrace,
    cited_sources: list[str],
) -> None:
    """Print a summary of the workflow execution."""
    usage = agent.token_usage

    # Create summary table
    summary = Table.grid(padding=(0, 2))
    summary.add_column(style="bold", justify="right")
    summary.add_column()

    summary.add_row("Total Steps:", str(step_count))
    summary.add_row("API Calls:", str(usage.api_calls))
    summary.add_row("Documents Scanned:", str(usage.documents_scanned))
    summary.add_row("Documents Parsed:", str(usage.documents_parsed))
    summary.add_row("", "")
    summary.add_row("Prompt Tokens:", f"{usage.prompt_tokens:,}")
    summary.add_row("Completion Tokens:", f"{usage.completion_tokens:,}")
    summary.add_row("Total Tokens:", f"{usage.total_tokens:,}")
    summary.add_row("", "")

    # Cost calculation
    input_cost, output_cost, total_cost = usage._calculate_cost()
    summary.add_row("Est. Input Cost:", f"${input_cost:.4f}")
    summary.add_row("Est. Output Cost:", f"${output_cost:.4f}")
    summary.add_row("Est. Total Cost:", f"${total_cost:.4f}")

    console.print()
    console.print(
        Panel(
            summary,
            title="📊 Workflow Summary",
            title_align="left",
            border_style="bold blue",
        )
    )

    if trace.step_path:
        path_markdown = "\n".join(f"- `{entry}`" for entry in trace.step_path)
        console.print()
        console.print(
            Panel(
                Markdown(path_markdown),
                title="🧭 Exploration Path",
                title_align="left",
                border_style="bold cyan",
            )
        )

    referenced_documents = trace.sorted_documents()
    if referenced_documents:
        docs_markdown = "\n".join(f"- `{doc}`" for doc in referenced_documents)
        console.print()
        console.print(
            Panel(
                Markdown(docs_markdown),
                title="📚 Referenced Documents (Tool Calls)",
                title_align="left",
                border_style="bold green",
            )
        )

    if cited_sources:
        sources_markdown = "\n".join(f"- `{source}`" for source in cited_sources)
        console.print()
        console.print(
            Panel(
                Markdown(sources_markdown),
                title="🔖 Cited Sources (Final Answer)",
                title_align="left",
                border_style="bold yellow",
            )
        )


async def run_workflow(
    task: str,
    folder: str = ".",
    *,
    use_index: bool = False,
    db_path: str | None = None,
) -> None:
    """
    Execute the exploration workflow with detailed step-by-step output.

    Args:
        task: The user's task/question to answer.
    """
    console = Console()
    resolved_folder = os.path.abspath(folder)
    if not os.path.exists(resolved_folder) or not os.path.isdir(resolved_folder):
        console.print(
            Panel(
                Text(f"No such directory: {resolved_folder}", style="bold red"),
                title="❌ Error",
                title_align="left",
                border_style="bold red",
            )
        )
        return

    resolved_db_path: str | None = None
    index_storage: DuckDBStorage | None = None
    if use_index:
        resolved_db_path = resolve_db_path(db_path)
        storage = DuckDBStorage(resolved_db_path)
        corpus_id = storage.get_corpus_id(resolved_folder)
        if corpus_id is None:
            console.print(
                Panel(
                    Text(
                        "No index found for this folder. "
                        "Run `explore index <folder>` first.",
                        style="bold red",
                    ),
                    title="❌ Missing Index",
                    title_align="left",
                    border_style="bold red",
                )
            )
            return
        index_storage = storage
        set_index_context(resolved_folder, resolved_db_path)
    else:
        clear_index_context()

    try:
        # Reset agent for fresh state
        reset_agent()

        # Print header
        print_workflow_header(console, task, resolved_folder)
        trace = ExplorationTrace(root_directory=resolved_folder)

        step_number = 0
        handler = workflow.run(
            start_event=InputEvent(
                task=task,
                folder=resolved_folder,
                use_index=use_index,
            )
        )

        with console.status(status="[bold cyan]🔄 Analyzing task...") as status:
            async for event in handler.stream_events():
                if isinstance(event, ToolCallEvent):
                    step_number += 1
                    resolved_document_path: str | None = None
                    if event.tool_name == "get_document":
                        doc_id = event.tool_input.get("doc_id")
                        if (
                            index_storage is not None
                            and isinstance(doc_id, str)
                            and doc_id
                        ):
                            document = index_storage.get_document(doc_id=doc_id)
                            if document and not document["is_deleted"]:
                                resolved_document_path = str(document["absolute_path"])

                    trace.record_tool_call(
                        step_number=step_number,
                        tool_name=event.tool_name,
                        tool_input=event.tool_input,
                        resolved_document_path=resolved_document_path,
                    )

                    # Update status based on tool
                    icon = TOOL_ICONS.get(event.tool_name, "🔧")
                    if event.tool_name == "scan_folder":
                        status.update(
                            f"[bold cyan]{icon} Scanning documents in parallel..."
                        )
                    elif event.tool_name == "parse_file":
                        status.update(
                            f"[bold green]{icon} Reading document in detail..."
                        )
                    elif event.tool_name == "preview_file":
                        status.update(f"[bold cyan]{icon} Quick preview of document...")
                    elif event.tool_name == "semantic_search":
                        status.update(f"[bold magenta]{icon} Searching index...")
                    elif event.tool_name == "get_document":
                        status.update(f"[bold green]{icon} Reading indexed document...")
                    elif event.tool_name == "list_indexed_documents":
                        status.update(f"[bold blue]{icon} Listing indexed documents...")
                    else:
                        status.update(
                            f"[bold yellow]{icon} Executing {event.tool_name}..."
                        )

                    # Print the detailed panel
                    panel = format_tool_panel(event, step_number)
                    console.print(panel)
                    console.print()

                    status.update("[bold cyan]🔄 Processing results...")
                elif isinstance(event, GoDeeperEvent):
                    step_number += 1
                    trace.record_go_deeper(
                        step_number=step_number, directory=event.directory
                    )
                    panel = format_navigation_panel(event, step_number)
                    console.print(panel)
                    console.print()
                    status.update("[bold cyan]🔄 Exploring directory...")

                elif isinstance(event, AskHumanEvent):
                    status.stop()
                    console.print()

                    # Create a nice prompt panel
                    question_panel = Panel(
                        Markdown(
                            f"**Question:** {event.question}\n\n**Why I'm asking:** {event.reason}"
                        ),
                        title="❓ Human Input Required",
                        title_align="left",
                        border_style="bold red",
                    )
                    console.print(question_panel)

                    answer = console.input("[bold cyan]Your answer:[/] ")
                    while answer.strip() == "":
                        console.print("[bold red]Please provide an answer.[/]")
                        answer = console.input("[bold cyan]Your answer:[/] ")

                    handler.ctx.send_event(HumanAnswerEvent(response=answer.strip()))
                    console.print()
                    status.start()
                    status.update("[bold cyan]🔄 Processing your response...")

            # Get final result
            result = await handler
            status.update("[bold green]✨ Preparing final answer...")
            await asyncio.sleep(0.1)
            status.stop()

        # Print final result with prominent styling
        console.print()
        if result.final_result:
            final_panel = Panel(
                Markdown(result.final_result),
                title="✅ Final Answer",
                title_align="left",
                border_style="bold green",
                padding=(1, 2),
            )
            console.print(final_panel)
        elif result.error:
            error_panel = Panel(
                Text(result.error, style="bold red"),
                title="❌ Error",
                title_align="left",
                border_style="bold red",
            )
            console.print(error_panel)

        # Print workflow summary
        agent = get_agent()
        cited_sources = extract_cited_sources(result.final_result)
        print_workflow_summary(console, agent, step_number, trace, cited_sources)
    finally:
        clear_index_context()


@app.callback(invoke_without_command=True)
def main(
    ctx: Context,
    task: Annotated[
        str | None,
        Option(
            "--task",
            "-t",
            help="Task that the FsExplorer Agent has to perform while exploring the current directory.",
        ),
    ] = None,
    folder: Annotated[
        str,
        Option(
            "--folder",
            "-f",
            help="Folder to explore. Defaults to the current directory.",
        ),
    ] = ".",
    use_index: Annotated[
        bool,
        Option(
            "--use-index",
            help="Use indexed retrieval tools for this run (requires prior indexing).",
        ),
    ] = False,
    db_path: Annotated[
        str | None,
        Option("--db-path", help="Path to DuckDB index file."),
    ] = None,
) -> None:
    """
    Explore documents with an agent, build indexes, and manage schema metadata.

    Backward-compatible mode:
    - `explore --task "..." [--folder ...]`
    """
    if ctx.invoked_subcommand is not None:
        return

    if task is None or not task.strip():
        raise BadParameter("`--task` is required unless you run a subcommand.")

    effective_use_index = use_index
    if (
        not effective_use_index
        and os.getenv("FS_EXPLORER_AUTO_INDEX", "").strip() == "1"
    ):
        try:
            resolved_folder = os.path.abspath(folder)
            resolved_db = resolve_db_path(db_path)
            storage = DuckDBStorage(resolved_db, read_only=True, initialize=False)
            if storage.get_corpus_id(resolved_folder) is not None:
                effective_use_index = True
            storage.close()
        except Exception:
            pass

    asyncio.run(
        run_workflow(task, folder, use_index=effective_use_index, db_path=db_path)
    )


@app.command("index")
def index_command(
    folder: Annotated[
        str,
        Argument(help="Folder to index recursively."),
    ] = ".",
    db_path: Annotated[
        str | None,
        Option("--db-path", help="Path to DuckDB index file."),
    ] = None,
    discover_schema: Annotated[
        bool,
        Option(
            "--discover-schema",
            help="Auto-discover metadata schema and set it active for this corpus.",
        ),
    ] = False,
    schema_name: Annotated[
        str | None,
        Option("--schema-name", help="Use an existing stored schema by name."),
    ] = None,
    with_metadata: Annotated[
        bool,
        Option(
            "--with-metadata",
            help=(
                "Enable langextract metadata extraction (requires API key). "
                "Also enables schema discovery if not explicitly requested."
            ),
        ),
    ] = False,
    metadata_profile_path: Annotated[
        str | None,
        Option(
            "--metadata-profile",
            help=(
                "Path to JSON profile defining dynamic langextract metadata fields "
                "and prompt. Implies --with-metadata."
            ),
        ),
    ] = None,
    with_embeddings: Annotated[
        bool,
        Option(
            "--with-embeddings",
            help="Generate vector embeddings for indexed chunks (requires GOOGLE_API_KEY).",
        ),
    ] = False,
) -> None:
    """Build or refresh an index for a folder."""
    console = Console()
    resolved_db_path = resolve_db_path(db_path)
    storage = DuckDBStorage(resolved_db_path)

    embedding_provider: EmbeddingProvider | None = None
    if with_embeddings:
        try:
            embedding_provider = EmbeddingProvider()
        except ValueError as exc:
            raise BadParameter(str(exc)) from exc

    pipeline = IndexingPipeline(
        storage=storage,
        embedding_provider=embedding_provider,
    )
    metadata_profile = _load_metadata_profile(metadata_profile_path)
    effective_with_metadata = with_metadata or metadata_profile is not None

    if effective_with_metadata and metadata_profile is None:
        console.print(
            "[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]"
        )

    try:
        effective_discover_schema = discover_schema or effective_with_metadata
        result = pipeline.index_folder(
            folder,
            discover_schema=effective_discover_schema,
            schema_name=schema_name,
            with_metadata=effective_with_metadata,
            metadata_profile=metadata_profile,
        )
    except ValueError as exc:
        raise BadParameter(str(exc)) from exc

    summary = Table.grid(padding=(0, 2))
    summary.add_column(style="bold", justify="right")
    summary.add_column()
    summary.add_row("DB Path:", resolved_db_path)
    summary.add_row("Corpus ID:", result.corpus_id)
    summary.add_row("Indexed Files:", str(result.indexed_files))
    summary.add_row("Skipped Files:", str(result.skipped_files))
    summary.add_row("Deleted Files:", str(result.deleted_files))
    summary.add_row("Chunks Written:", str(result.chunks_written))
    summary.add_row("Active Documents:", str(result.active_documents))
    summary.add_row("Embeddings Written:", str(result.embeddings_written))
    summary.add_row("Schema Used:", result.schema_used or "<none>")
    summary.add_row(
        "Metadata Mode:",
        "langextract" if effective_with_metadata else "heuristic",
    )
    if metadata_profile_path:
        profile_label = str(Path(metadata_profile_path).expanduser().resolve())
    elif effective_with_metadata:
        profile_label = "<auto-discovered>"
    else:
        profile_label = "<none>"
    summary.add_row("Metadata Profile:", profile_label)

    console.print(Panel(summary, title="📦 Index Complete", border_style="bold green"))


@app.command("query")
def query_command(
    task: Annotated[
        str,
        Option(
            "--task",
            "-t",
            help="Question to answer using indexed retrieval tools.",
        ),
    ],
    folder: Annotated[
        str,
        Option(
            "--folder",
            "-f",
            help="Folder whose index should be queried.",
        ),
    ] = ".",
    db_path: Annotated[
        str | None,
        Option("--db-path", help="Path to DuckDB index file."),
    ] = None,
) -> None:
    """Run the agent with indexed retrieval enabled."""
    asyncio.run(run_workflow(task, folder, use_index=True, db_path=db_path))


@schema_app.command("discover")
def schema_discover_command(
    folder: Annotated[
        str,
        Argument(help="Folder to inspect for schema discovery."),
    ] = ".",
    db_path: Annotated[
        str | None,
        Option("--db-path", help="Path to DuckDB index file."),
    ] = None,
    name: Annotated[
        str | None,
        Option("--name", help="Override discovered schema name."),
    ] = None,
    activate: Annotated[
        bool,
        Option(
            "--activate/--no-activate",
            help="Set schema as active for the corpus.",
        ),
    ] = True,
    with_metadata: Annotated[
        bool,
        Option(
            "--with-metadata",
            help="Include langextract metadata fields in discovered schema.",
        ),
    ] = False,
    metadata_profile_path: Annotated[
        str | None,
        Option(
            "--metadata-profile",
            help=(
                "Path to JSON profile defining dynamic langextract metadata fields "
                "and prompt. Implies --with-metadata."
            ),
        ),
    ] = None,
) -> None:
    """Auto-discover and store a metadata schema for a folder."""
    console = Console()
    resolved_folder = str(os.path.abspath(folder))
    if not os.path.isdir(resolved_folder):
        raise BadParameter(f"No such directory: {resolved_folder}")

    resolved_db_path = resolve_db_path(db_path)
    storage = DuckDBStorage(resolved_db_path)
    corpus_id = storage.get_or_create_corpus(resolved_folder)
    metadata_profile = _load_metadata_profile(metadata_profile_path)
    effective_with_metadata = with_metadata or metadata_profile is not None

    if effective_with_metadata and metadata_profile is None:
        console.print(
            "[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]"
        )

    discovery = SchemaDiscovery()
    discovered = discovery.discover_from_folder(
        resolved_folder,
        with_langextract=effective_with_metadata,
        metadata_profile=metadata_profile,
    )
    schema_name = name or str(
        discovered.get("name", f"auto_{os.path.basename(resolved_folder)}")
    )
    discovered["name"] = schema_name
    schema_id = storage.save_schema(
        corpus_id=corpus_id,
        name=schema_name,
        schema_def=discovered,
        is_active=activate,
    )

    output = Table.grid(padding=(0, 2))
    output.add_column(style="bold", justify="right")
    output.add_column()
    output.add_row("DB Path:", resolved_db_path)
    output.add_row("Corpus ID:", corpus_id)
    output.add_row("Schema ID:", schema_id)
    output.add_row("Schema Name:", schema_name)
    output.add_row("Active:", str(activate))
    output.add_row("Field Count:", str(len(discovered.get("fields", []))))
    output.add_row(
        "Metadata Mode:", "langextract" if effective_with_metadata else "heuristic"
    )
    if metadata_profile_path:
        profile_label = str(Path(metadata_profile_path).expanduser().resolve())
    elif effective_with_metadata:
        profile_label = "<auto-discovered>"
    else:
        profile_label = "<none>"
    output.add_row("Metadata Profile:", profile_label)

    console.print(Panel(output, title="🧩 Schema Saved", border_style="bold cyan"))
    console.print_json(json.dumps(discovered, indent=2))


@schema_app.command("show")
def schema_show_command(
    folder: Annotated[
        str,
        Argument(help="Folder whose schemas should be listed."),
    ] = ".",
    db_path: Annotated[
        str | None,
        Option("--db-path", help="Path to DuckDB index file."),
    ] = None,
) -> None:
    """Show saved schemas for a folder's corpus."""
    console = Console()
    resolved_folder = str(os.path.abspath(folder))
    resolved_db_path = resolve_db_path(db_path)
    storage = DuckDBStorage(resolved_db_path)

    corpus_id = storage.get_corpus_id(resolved_folder)
    if corpus_id is None:
        console.print(
            Panel(
                f"No corpus found for folder: {resolved_folder}\nRun `explore index {resolved_folder}` first.",
                title="⚠️ No Corpus",
                border_style="bold yellow",
            )
        )
        raise Exit(code=1)

    schemas = storage.list_schemas(corpus_id=corpus_id)
    if not schemas:
        console.print(
            Panel(
                f"No schemas saved for corpus: {corpus_id}",
                title="⚠️ No Schemas",
                border_style="bold yellow",
            )
        )
        raise Exit(code=1)

    table = Table(title=f"Schemas for {resolved_folder}")
    table.add_column("Name")
    table.add_column("Active")
    table.add_column("Created At")
    table.add_column("Field Count")

    for schema in schemas:
        table.add_row(
            schema.name,
            "yes" if schema.is_active else "no",
            schema.created_at,
            str(len(schema.schema_def.get("fields", []))),
        )

    console.print(table)


================================================
FILE: src/fs_explorer/models.py
================================================
"""
Pydantic models for FsExplorer agent actions.

This module defines the structured data models used to represent
the actions the agent can take during filesystem exploration.
"""

from pydantic import BaseModel, Field
from typing import TypeAlias, Literal, Any


# =============================================================================
# Type Aliases
# =============================================================================

Tools: TypeAlias = Literal[
    "read",
    "grep",
    "glob",
    "scan_folder",
    "preview_file",
    "parse_file",
    "semantic_search",
    "get_document",
    "list_indexed_documents",
]
"""Available tool names that the agent can invoke."""

ActionType: TypeAlias = Literal["stop", "godeeper", "toolcall", "askhuman"]
"""Types of actions the agent can take."""


# =============================================================================
# Action Models
# =============================================================================

class StopAction(BaseModel):
    """
    Action indicating the task is complete.
    
    Used when the agent has gathered enough information to provide
    a final answer to the user's query.
    """
    
    final_result: str = Field(
        description="Final result of the operation with the answer to the user's query"
    )


class AskHumanAction(BaseModel):
    """
    Action requesting clarification from the user.
    
    Used when the agent needs additional information or context
    to proceed with the task.
    """
    
    question: str = Field(
        description="Clarification question to ask the user"
    )


class GoDeeperAction(BaseModel):
    """
    Action to navigate into a subdirectory.
    
    Used when the agent needs to explore a subdirectory
    to find relevant files.
    """
    
    directory: str = Field(
        description="Path to the directory to navigate into"
    )


class ToolCallArg(BaseModel):
    """
    A single argument for a tool call.
    
    Represents a parameter name-value pair to pass to a tool.
    """
    
    parameter_name: str = Field(
        description="Name of the parameter"
    )
    parameter_value: Any = Field(
        description="Value for the parameter"
    )


class ToolCallAction(BaseModel):
    """
    Action to invoke a filesystem tool.
    
    Used when the agent needs to read files, search for patterns,
    or parse documents to gather information.
    """
    
    tool_name: Tools = Field(
        description="Name of the tool to invoke"
    )
    tool_input: list[ToolCallArg] = Field(
        description="Arguments to pass to the tool"
    )

    def to_fn_args(self) -> dict[str, Any]:
        """
        Convert tool input to a dictionary for function calls.
        
        Returns:
            Dictionary mapping parameter names to values.
        """
        return {arg.parameter_name: arg.parameter_value for arg in self.tool_input}


class Action(BaseModel):
    """
    Container for an agent action with reasoning.
    
    Wraps any of the specific action types (stop, go deeper,
    tool call, ask human) along with the agent's explanation
    for why this action was chosen.
    """
    
    action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction = Field(
        description="The specific action to take"
    )
    reason: str = Field(
        description="Explanation for why this action was chosen"
    )

    def to_action_type(self) -> ActionType:
        """
        Get the type of this action.
        
        Returns:
            The action type string: "toolcall", "godeeper", "askhuman", or "stop".
        """
        if isinstance(self.action, ToolCallAction):
            return "toolcall"
        elif isinstance(self.action, GoDeeperAction):
            return "godeeper"
        elif isinstance(self.action, AskHumanAction):
            return "askhuman"
        else:
            return "stop"


================================================
FILE: src/fs_explorer/search/__init__.py
================================================
"""Search helpers for indexed corpora."""

from .filters import (
    MetadataFilter,
    MetadataFilterParseError,
    parse_metadata_filters,
    supported_filter_syntax,
)
from .query import IndexedQueryEngine, SearchHit
from .ranker import RankedDocument, rank_documents
from .semantic import SemanticSearchEngine

__all__ = [
    "MetadataFilter",
    "MetadataFilterParseError",
    "parse_metadata_filters",
    "supported_filter_syntax",
    "IndexedQueryEngine",
    "SearchHit",
    "RankedDocument",
    "rank_documents",
    "SemanticSearchEngine",
]


================================================
FILE: src/fs_explorer/search/filters.py
================================================
"""
Metadata filter parsing helpers.
"""

from __future__ import annotations

import re
from dataclasses import dataclass
from typing import Any, Literal


FilterOperator = Literal["eq", "ne", "gt", "gte", "lt", "lte", "in", "contains"]


@dataclass(frozen=True)
class MetadataFilter:
    """Normalized metadata filter condition."""

    field: str
    operator: FilterOperator
    value: str | bool | int | float | list[str | bool | int | float]

    def to_storage_dict(self) -> dict[str, Any]:
        return {
            "field": self.field,
            "operator": self.operator,
            "value": self.value,
        }


class MetadataFilterParseError(ValueError):
    """Raised when metadata filter syntax is invalid."""


_FIELD_RE = re.compile(r"^[A-Za-z_][A-Za-z0-9_]*$")
_NUMBER_RE = re.compile(r"^-?\d+(?:\.\d+)?$")


def supported_filter_syntax() -> str:
    """Return a short help text for filter syntax."""
    return (
        "Supported filter syntax: "
        "`field=value`, `field!=value`, `field>=number`, `field<=number`, "
        "`field>number`, `field<number`, `field in (a, b, c)`, `field~substring`; "
        "combine with comma or `and`."
    )


def parse_metadata_filters(
    raw_filters: str | None,
    *,
    allowed_fields: set[str] | None = None,
) -> list[MetadataFilter]:
    """Parse a raw filter string into normalized metadata conditions."""
    if raw_filters is None or not raw_filters.strip():
        return []

    conditions = _split_conditions(raw_filters)
    parsed: list[MetadataFilter] = []
    for condition in conditions:
        parsed.append(_parse_condition(condition, allowed_fields=allowed_fields))
    return parsed


def _parse_condition(condition: str, *, allowed_fields: set[str] | None) -> MetadataFilter:
    text = condition.strip()
    if not text:
        raise MetadataFilterParseError("Empty filter condition.")

    in_match = re.match(r"^\s*([A-Za-z_][A-Za-z0-9_]*)\s+in\s+(.+)\s*$", text, flags=re.IGNORECASE)
    if in_match:
        field = in_match.group(1)
        _validate_field(field, allowed_fields=allowed_fields)
        values = _parse_list_value(in_match.group(2))
        if not values:
            raise MetadataFilterParseError(f"`in` filter has no values: {text!r}")
        return MetadataFilter(field=field, operator="in", value=values)

    op_match = re.match(r"^\s*([A-Za-z_][A-Za-z0-9_]*)\s*(<=|>=|!=|=|<|>|~|:)\s*(.+)\s*$", text)
    if not op_match:
        raise MetadataFilterParseError(f"Invalid filter syntax: {text!r}")

    field = op_match.group(1)
    operator_symbol = op_match.group(2)
    raw_value = op_match.group(3)
    _validate_field(field, allowed_fields=allowed_fields)
    value = _parse_scalar_value(raw_value)

    operator_map: dict[str, FilterOperator] = {
        "=": "eq",
        ":": "eq",
        "!=": "ne",
        ">": "gt",
        ">=": "gte",
        "<": "lt",
        "<=": "lte",
        "~": "contains",
    }
    operator = operator_map[operator_symbol]

    if operator in {"gt", "gte", "lt", "lte"} and not isinstance(value, (int, float)):
        raise MetadataFilterParseError(
            f"Operator `{operator_symbol}` requires a numeric value: {text!r}"
        )

    return MetadataFilter(field=field, operator=operator, value=value)


def _validate_field(field: str, *, allowed_fields: set[str] | None) -> None:
    if not _FIELD_RE.match(field):
        raise MetadataFilterParseError(f"Invalid field name: {field!r}")
    if allowed_fields is not None and field not in allowed_fields:
        allowed = ", ".join(sorted(allowed_fields)) if allowed_fields else "<none>"
        raise MetadataFilterParseError(
            f"Unknown metadata field {field!r}. Allowed fields: {allowed}"
        )


def _split_conditions(raw: str) -> list[str]:
    parts: list[str] = []
    current: list[str] = []
    quote: str | None = None
    paren_depth = 0
    bracket_depth = 0
    i = 0
    while i < len(raw):
        ch = raw[i]

        if quote is not None:
            current.append(ch)
            if ch == quote:
                quote = None
            i += 1
            continue

        if ch in {"'", '"'}:
            quote = ch
            current.append(ch)
            i += 1
            continue
        if ch == "(":
            paren_depth += 1
            current.append(ch)
            i += 1
            continue
        if ch == ")":
            paren_depth = max(paren_depth - 1, 0)
            current.append(ch)
            i += 1
            continue
        if ch == "[":
            bracket_depth += 1
            current.append(ch)
            i += 1
            continue
        if ch == "]":
            bracket_depth = max(bracket_depth - 1, 0)
            current.append(ch)
            i += 1
            continue

        if paren_depth == 0 and bracket_depth == 0 and ch == ",":
            _flush_part(parts, current)
            i += 1
            continue

        if (
            paren_depth == 0
            and bracket_depth == 0
            and raw[i : i + 3].lower() == "and"
            and (i == 0 or raw[i - 1].isspace())
            and (i + 3 == len(raw) or raw[i + 3].isspace())
        ):
            _flush_part(parts, current)
            i += 3
            continue

        current.append(ch)
        i += 1

    _flush_part(parts, current)
    return parts


def _flush_part(parts: list[str], current: list[str]) -> None:
    text = "".join(current).strip()
    if text:
        parts.append(text)
    current.clear()


def _parse_list_value(raw_value: str) -> list[str | bool | int | float]:
    text = raw_value.strip()
    if text.startswith("(") and text.endswith(")"):
        text = text[1:-1]
    elif text.startswith("[") and text.endswith("]"):
        text = text[1:-1]

    if not text.strip():
        return []

    items = _split_conditions(text)
    return [_parse_scalar_value(item) for item in items]


def _parse_scalar_value(raw_value: str) -> str | bool | int | float:
    text = raw_value.strip()
    if not text:
        raise MetadataFilterParseError("Missing filter value.")

    if (text.startswith("'") and text.endswith("'")) or (
        text.startswith('"') and text.endswith('"')
    ):
        return text[1:-1]

    lower = text.lower()
    if lower == "true":
        return True
    if lower == "false":
        return False
    if _NUMBER_RE.match(text):
        if "." in text:
            return float(text)
        return int(text)
    return text


================================================
FILE: src/fs_explorer/search/query.py
================================================
"""
Indexed query helpers for agent tools.
"""

from __future__ import annotations

from concurrent.futures import ThreadPoolExecutor
from dataclasses import dataclass
from typing import Any, Callable

from ..embeddings import EmbeddingProvider
from ..storage import DuckDBStorage, StorageBackend
from .filters import MetadataFilter, parse_metadata_filters
from .ranker import RankedDocument, rank_documents


@dataclass(frozen=True)
class SearchHit:
    """Ranked document hit from indexed retrieval."""

    doc_id: str
    relative_path: str
    absolute_path: str
    position: int | None
    text: str
    semantic_score: float
    metadata_score: int
    score: float
    matched_by: str


class IndexedQueryEngine:
    """Parallel retrieval engine for semantic + metadata query paths."""

    def __init__(
        self,
        storage: StorageBackend,
        embedding_provider: EmbeddingProvider | None = None,
    ) -> None:
        self.storage = storage
        self.embedding_provider = embedding_provider

    def search(
        self,
        *,
        corpus_id: str,
        query: str,
        filters: str | None = None,
        limit: int = 5,
        enable_semantic: bool = True,
        enable_metadata: bool = True,
    ) -> list[SearchHit]:
        normalized_limit = max(limit, 1)
        parsed_filters = self._parse_filters(corpus_id=corpus_id, filters=filters)
        semantic_limit = max(normalized_limit * 4, normalized_limit)
        metadata_limit = max(normalized_limit * 4, normalized_limit)

        run_semantic = enable_semantic
        run_metadata = enable_metadata and bool(parsed_filters)

        semantic_rows: list[dict[str, Any]]
        metadata_rows: list[dict[str, Any]]
        if run_semantic and run_metadata:
            semantic_rows, metadata_rows = self._search_parallel(
                corpus_id=corpus_id,
                query=query,
                metadata_filters=parsed_filters,
                semantic_limit=semantic_limit,
                metadata_limit=metadata_limit,
            )
        elif run_semantic:
            semantic_rows = self._semantic_query(
                corpus_id=corpus_id,
                query=query,
                limit=semantic_limit,
            )
            metadata_rows = []
        elif run_metadata:
            semantic_rows = []
            metadata_rows = self._metadata_query(
                corpus_id=corpus_id,
                metadata_filters=parsed_filters,
                limit=metadata_limit,
            )
        else:
            semantic_rows, metadata_rows = [], []

        ranked = self._merge_and_rank(
            semantic_rows=semantic_rows,
            metadata_rows=metadata_rows,
            limit=normalized_limit,
        )
        return [
            SearchHit(
                doc_id=doc.doc_id,
                relative_path=doc.relative_path,
                absolute_path=doc.absolute_path,
                position=doc.position,
                text=doc.text,
                semantic_score=doc.semantic_score,
                metadata_score=doc.metadata_score,
                score=doc.combined_score,
                matched_by=doc.matched_by,
            )
            for doc in ranked
        ]

    def _parse_filters(
        self, *, corpus_id: str, filters: str | None
    ) -> list[MetadataFilter]:
        if filters is None or not filters.strip():
            return []
        allowed_fields = self._allowed_filter_fields(corpus_id=corpus_id)
        return parse_metadata_filters(filters, allowed_fields=allowed_fields)

    def _allowed_filter_fields(self, *, corpus_id: str) -> set[str] | None:
        active_schema = self.storage.get_active_schema(corpus_id=corpus_id)
        if active_schema is None:
            return None
        fields = active_schema.schema_def.get("fields")
        if not isinstance(fields, list):
            return None
        allowed: set[str] = set()
        for field in fields:
            if isinstance(field, dict):
                name = field.get("name")
                if isinstance(name, str):
                    allowed.add(name)
        return allowed if allowed else None

    def _search_parallel(
        self,
        *,
        corpus_id: str,
        query: str,
        metadata_filters: list[MetadataFilter],
        semantic_limit: int,
        metadata_limit: int,
    ) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:
        with ThreadPoolExecutor(max_workers=2) as executor:
            semantic_future = executor.submit(
                self._semantic_query,
                corpus_id=corpus_id,
                query=query,
                limit=semantic_limit,
            )
            metadata_future = executor.submit(
                self._metadata_query,
                corpus_id=corpus_id,
                metadata_filters=metadata_filters,
                limit=metadata_limit,
            )
            semantic_rows = semantic_future.result()
            metadata_rows = metadata_future.result()
        return semantic_rows, metadata_rows

    def _semantic_query(
        self,
        *,
        corpus_id: str,
        query: str,
        limit: int,
    ) -> list[dict[str, Any]]:
        scoped_storage, cleanup = self._acquire_query_storage()
        try:
            if self.embedding_provider is not None and scoped_storage.has_embeddings(
                corpus_id=corpus_id
            ):
                query_embedding = self.embedding_provider.embed_query(query)
                return scoped_storage.search_chunks_semantic(
                    corpus_id=corpus_id,
                    query_embedding=query_embedding,
                    limit=limit,
                )
            return scoped_storage.search_chunks(
                corpus_id=corpus_id, query=query, limit=limit
            )
        finally:
            cleanup()

    def _metadata_query(
        self,
        *,
        corpus_id: str,
        metadata_filters: list[MetadataFilter],
        limit: int,
    ) -> list[dict[str, Any]]:
        scoped_storage, cleanup = self._acquire_query_storage()
        try:
            return scoped_storage.search_documents_by_metadata(
                corpus_id=corpus_id,
                filters=[flt.to_storage_dict() for flt in metadata_filters],
                limit=limit,
            )
        finally:
            cleanup()

    def _acquire_query_storage(self) -> tuple[StorageBackend, Callable[[], None]]:
        if isinstance(self.storage, DuckDBStorage):
            clone = DuckDBStorage(
                self.storage.db_path,
                read_only=self.storage.read_only,
                initialize=False,
                embedding_dim=self.storage.embedding_dim,
            )
            return clone, clone.close
        return self.storage, lambda: None

    @staticmethod
    def _merge_and_rank(
        *,
        semantic_rows: list[dict[str, Any]],
        metadata_rows: list[dict[str, Any]],
        limit: int,
    ) -> list[RankedDocument]:
        merged: dict[str, dict[str, Any]] = {}

        for row in semantic_rows:
            doc_id = str(row["doc_id"])
            score = float(row["score"])
            position = int(row["position"])
            entry = merged.setdefault(
                doc_id,
                {
                    "doc_id": doc_id,
                    "relative_path": str(row["relative_path"]),
                    "absolute_path": str(row["absolute_path"]),
                    "position": position,
                    "text": str(row["text"]),
                    "semantic_score": 0.0,
                    "metadata_score": 0,
                },
            )
            if score > float(entry["semantic_score"]):
                entry["semantic_score"] = score
                entry["position"] = position
                entry["text"] = str(row["text"])

        for row in metadata_rows:
            doc_id = str(row["doc_id"])
            entry = merged.setdefault(
                doc_id,
                {
                    "doc_id": doc_id,
                    "relative_path": str(row["relative_path"]),
                    "absolute_path": str(row["absolute_path"]),
                    "position": None,
                    "text": str(row.get("preview_text", "")),
                    "semantic_score": 0.0,
                    "metadata_score": 0,
                },
            )
            entry["metadata_score"] = max(
                int(entry["metadata_score"]),
                int(row.get("metadata_score", 1)),
            )
            if not entry["text"]:
                entry["text"] = str(row.get("preview_text", ""))

        documents = [
            RankedDocument(
                doc_id=str(entry["doc_id"]),
                relative_path=str(entry["relative_path"]),
                absolute_path=str(entry["absolute_path"]),
                position=int(entry["position"])
                if entry["position"] is not None
                else None,
                text=str(entry["text"]),
                semantic_score=float(entry["semantic_score"]),
                metadata_score=int(entry["metadata_score"]),
            )
            for entry in merged.values()
        ]
        return rank_documents(documents, limit=limit)


================================================
FILE: src/fs_explorer/search/ranker.py
================================================
"""
Ranking helpers for merging retrieval result sets.
"""

from __future__ import annotations

from dataclasses import dataclass


@dataclass(frozen=True)
class RankedDocument:
    """Merged retrieval candidate for a document."""

    doc_id: str
    relative_path: str
    absolute_path: str
    position: int | None
    text: str
    semantic_score: float
    metadata_score: int

    @property
    def combined_score(self) -> float:
        # Semantic scores dominate ordering; metadata score boosts ties and
        # metadata-only matches into the candidate set.
        return float(self.semantic_score * 100 + self.metadata_score * 10)

    @property
    def matched_by(self) -> str:
        if self.semantic_score > 0 and self.metadata_score > 0:
            return "semantic+metadata"
        if self.semantic_score > 0:
            return "semantic"
        return "metadata"


def rank_documents(
    documents: list[RankedDocument], *, limit: int
) -> list[RankedDocument]:
    """Sort merged retrieval results and apply limit."""
    ordered = sorted(
        documents,
        key=lambda doc: (
            -doc.combined_score,
            -doc.semantic_score,
            -doc.metadata_score,
            doc.position if doc.position is not None else 10**9,
            doc.relative_path,
        ),
    )
    return ordered[: max(limit, 1)]


================================================
FILE: src/fs_explorer/search/semantic.py
================================================
"""
Vector-based semantic search engine.

Embeds a query and searches chunk embeddings via cosine similarity,
falling back to keyword matching when embeddings are unavailable.
"""

from __future__ import annotations

from typing import Any

from ..embeddings import EmbeddingProvider
from ..storage import StorageBackend


class SemanticSearchEngine:
    """Embed a query and search stored chunk embeddings."""

    def __init__(
        self,
        storage: StorageBackend,
        embedding_provider: EmbeddingProvider,
    ) -> None:
        self.storage = storage
        self.embedding_provider = embedding_provider

    def search(
        self,
        *,
        corpus_id: str,
        query: str,
        limit: int = 5,
    ) -> list[dict[str, Any]]:
        """Return ranked chunk hits using vector cosine similarity."""
        query_embedding = self.embedding_provider.embed_query(query)
        return self.storage.search_chunks_semantic(
            corpus_id=corpus_id,
            query_embedding=query_embedding,
            limit=limit,
        )


================================================
FILE: src/fs_explorer/server.py
================================================
"""
FastAPI server for FsExplorer web UI.

Provides a WebSocket endpoint for real-time workflow streaming
and serves the single-page HTML interface.
"""

import asyncio
from pathlib import Path
from typing import Any

from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse, JSONResponse
from pydantic import BaseModel

from .agent import clear_index_context, set_index_context, set_search_flags
from .embeddings import EmbeddingProvider
from .exploration_trace import ExplorationTrace, extract_cited_sources
from .index_config import resolve_db_path
from .indexing import IndexingPipeline
from .indexing.metadata import auto_discover_profile
from .search import IndexedQueryEngine
from .storage import DuckDBStorage
from .workflow import (
    AskHumanEvent,
    GoDeeperEvent,
    HumanAnswerEvent,
    InputEvent,
    ToolCallEvent,
    get_agent,
    reset_agent,
    workflow,
)

app = FastAPI(title="FsExplorer", description="AI-powered filesystem exploration")

_corpus_locks: dict[str, asyncio.Lock] = {}


def _get_corpus_lock(folder: str) -> asyncio.Lock:
    """Return a per-folder asyncio lock, creating one if needed."""
    normalized = str(Path(folder).resolve())
    if normalized not in _corpus_locks:
        _corpus_locks[normalized] = asyncio.Lock()
    return _corpus_locks[normalized]


class TaskRequest(BaseModel):
    """Request model for task submission."""

    task: str
    folder: str = "."
    use_index: bool = False
    db_path: str | None = None


class IndexRequest(BaseModel):
    """Request model for index build/refresh."""

    folder: str = "."
    db_path: str | None = None
    discover_schema: bool = True
    schema_name: str | None = None
    with_metadata: bool = False
    metadata_profile: dict[str, Any] | None = None
    with_embeddings: bool = False


class AutoProfileRequest(BaseModel):
    """Request model for auto-profile generation."""

    folder: str = "."


class SearchRequest(BaseModel):
    """Request model for search queries."""

    corpus_folder: str
    query: str
    filters: str | None = None
    limit: int = 5
    db_path: str | None = None


@app.get("/", response_class=HTMLResponse)
async def get_ui():
    """Serve the main UI HTML file."""
    html_path = Path(__file__).parent / "ui.html"
    if html_path.exists():
        return HTMLResponse(
            content=html_path.read_text(encoding="utf-8"), status_code=200
        )
    return HTMLResponse(content="<h1>UI not found</h1>", status_code=404)


@app.get("/api/folders")
async def list_folders(path: str = "."):
    """
    List folders in the given path.
    Returns list of folder names and current path info.
    """
    try:
        base_path = Path(path).resolve()
        if not base_path.exists():
            return JSONResponse({"error": "Path not found"}, status_code=404)
        if not base_path.is_dir():
            return JSONResponse({"error": "Not a directory"}, status_code=400)

        # Get folders (non-hidden)
        folders = sorted(
            [
                f.name
                for f in base_path.iterdir()
                if f.is_dir() and not f.name.startswith(".")
            ]
        )

        # Get parent path (if not at root)
        parent = str(base_path.parent) if base_path != base_path.parent else None

        return {
            "current": str(base_path),
            "parent": parent,
            "folders": folders,
            "files_count": len([f for f in base_path.iterdir() if f.is_file()]),
        }
    except PermissionError:
        return JSONResponse({"error": "Permission denied"}, status_code=403)
    except Exception as e:
        return JSONResponse({"error": str(e)}, status_code=500)


@app.get("/api/index/status")
async def index_status(folder: str, db_path: str | None = None):
    """Check whether a folder has been indexed and return status details."""
    try:
        folder_path = Path(folder).resolve()
        if not folder_path.exists() or not folder_path.is_dir():
            return {"indexed": False}

        resolved_db_path = resolve_db_path(db_path)
        if not Path(resolved_db_path).exists():
            return {"indexed": False}

        try:
            storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False)
        except Exception:
            return {"indexed": False}

        try:
            corpus_id = storage.get_corpus_id(str(folder_path))
            if corpus_id is None:
                storage.close()
                return {"indexed": False}

            docs = storage.list_documents(corpus_id=corpus_id, include_deleted=False)
            active_schema = storage.get_active_schema(corpus_id=corpus_id)
            has_embeddings = storage.has_embeddings(corpus_id=corpus_id)

            schema_name: str | None = None
            has_metadata = False
            schema_fields: list[str] = []
            if active_schema is not None:
                schema_name = active_schema.name
                has_metadata = (
                    active_schema.schema_def.get("metadata_profile") is not None
                )
                fields_def = active_schema.schema_def.get("fields")
                if isinstance(fields_def, list):
                    for f in fields_def:
                        if isinstance(f, dict) and isinstance(f.get("name"), str):
                            schema_fields.append(f["name"])

            storage.close()
            return {
                "indexed": True,
                "corpus_id": corpus_id,
                "document_count": len(docs),
                "schema_name": schema_name,
                "has_metadata": has_metadata,
                "has_embeddings": has_embeddings,
                "schema_fields": schema_fields,
            }
        except Exception:
            storage.close()
            return {"indexed": False}
    except Exception:
        return {"indexed": False}


@app.post("/api/index/auto-profile")
async def generate_auto_profile(request: AutoProfileRequest):
    """Generate an auto-discovered metadata profile for preview/editing."""
    try:
        folder_path = Path(request.folder).resolve()
        if not folder_path.exists() or not folder_path.is_dir():
            return JSONResponse(
                {"error": f"Invalid folder: {request.folder}"}, status_code=400
            )

        profile = await asyncio.to_thread(auto_discover_profile, str(folder_path))
        return {"profile": profile}
    except Exception as exc:
        return JSONResponse({"error": str(exc)}, status_code=500)


@app.post("/api/index")
async def build_index(request: IndexRequest):
    """Build or refresh the index for a selected folder."""
    try:
        folder_path = Path(request.folder).resolve()
        if not folder_path.exists():
            return JSONResponse({"error": "Path not found"}, status_code=404)
        if not folder_path.is_dir():
            return JSONResponse({"error": "Not a directory"}, status_code=400)

        lock = _get_corpus_lock(str(folder_path))
        async with lock:
            resolved_db_path = resolve_db_path(request.db_path)
            embedding_provider: EmbeddingProvider | None = None
            if request.with_embeddings:
                try:
                    embedding_provider = EmbeddingProvider()
                except ValueError:
                    embedding_provider = None
            pipeline = IndexingPipeline(
                storage=DuckDBStorage(resolved_db_path),
                embedding_provider=embedding_provider,
            )
            effective_with_metadata = (
                request.with_metadata or request.metadata_profile is not None
            )
            discover_schema = request.discover_schema or effective_with_metadata
            result = pipeline.index_folder(
                str(folder_path),
                discover_schema=discover_schema,
                schema_name=request.schema_name,
                with_metadata=effective_with_metadata,
                metadata_profile=request.metadata_profile,
            )

        return {
            "db_path": resolved_db_path,
            "folder": str(folder_path),
            "corpus_id": result.corpus_id,
            "indexed_files": result.indexed_files,
            "skipped_files": result.skipped_files,
            "deleted_files": result.deleted_files,
            "chunks_written": result.chunks_written,
            "active_documents": result.active_documents,
            "schema_used": result.schema_used,
            "embeddings_written": result.embeddings_written,
            "metadata_mode": "langextract" if effective_with_metadata else "heuristic",
        }
    except ValueError as exc:
        return JSONResponse({"error": str(exc)}, status_code=400)
    except PermissionError:
        return JSONResponse({"error": "Permission denied"}, status_code=403)
    except Exception as exc:
        return JSONResponse({"error": str(exc)}, status_code=500)


@app.post("/api/search")
async def search_index(request: SearchRequest):
    """Search an indexed corpus and return ranked hits."""
    try:
        folder_path = Path(request.corpus_folder).resolve()
        if not folder_path.exists() or not folder_path.is_dir():
            return JSONResponse(
                {"error": f"Invalid folder: {request.corpus_folder}"}, status_code=400
            )

        resolved_db_path = resolve_db_path(request.db_path)
        storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False)
        corpus_id = storage.get_corpus_id(str(folder_path))
        if corpus_id is None:
            storage.close()
            return JSONResponse(
                {"error": "No index found for this folder."}, status_code=404
            )

        embedding_provider: EmbeddingProvider | None = None
        if storage.has_embeddings(corpus_id=corpus_id):
            try:
                embedding_provider = EmbeddingProvider()
            except ValueError:
                pass

        engine = IndexedQueryEngine(storage, embedding_provider=embedding_provider)
        hits = engine.search(
            corpus_id=corpus_id,
            query=request.query,
            filters=request.filters,
            limit=request.limit,
        )
        storage.close()

        return {
            "corpus_folder": str(folder_path),
            "query": request.query,
            "hits": [
                {
                    "doc_id": hit.doc_id,
                    "relative_path": hit.relative_path,
                    "absolute_path": hit.absolute_path,
                    "position": hit.position,
                    "text": hit.text,
                    "semantic_score": hit.semantic_score,
                    "metadata_score": hit.metadata_score,
                    "score": hit.score,
                    "matched_by": hit.matched_by,
                }
                for hit in hits
            ],
        }
    except Exception as exc:
        return JSONResponse({"error": str(exc)}, status_code=500)


@app.websocket("/ws/explore")
async def websocket_explore(websocket: WebSocket):
    """
    WebSocket endpoint for real-time exploration streaming.

    Protocol:
    1. Client sends: {"task": "user question"}
    2. Server streams events: {"type": "...", "data": {...}}
    3. Final event: {"type": "complete", "data": {...}}
    """
    await websocket.accept()

    try:
        # Receive the task
        data = await websocket.receive_json()
        task = data.get("task", "")
        folder = data.get("folder", ".")
        use_index = bool(data.get("use_index", False))
        db_path = data.get("db_path")
        enable_semantic = bool(data.get("enable_semantic", False))
        enable_metadata = bool(data.get("enable_metadata", False))
        index_storage: DuckDBStorage | None = None

        if not task:
            await websocket.send_json(
                {"type": "error", "data": {"message": "No task provided"}}
            )
            return

        # Validate folder
        folder_path = Path(folder).resolve()
        if not folder_path.exists() or not folder_path.is_dir():
            await websocket.send_json(
                {"type": "error", "data": {"message": f"Invalid folder: {folder}"}}
            )
            return

        clear_index_context()
        if use_index:
            resolved_db_path = resolve_db_path(
                db_path if isinstance(db_path, str) else None
            )
            storage = DuckDBStorage(resolved_db_path)
            corpus_id = storage.get_corpus_id(str(folder_path))
            if corpus_id is None:
                await websocket.send_json(
                    {
                        "type": "error",
                        "data": {
                            "message": (
                                "No index found for the selected folder. "
                                "Run `explore index <folder>` first."
                            )
                        },
                    }
                )
                return
            index_storage = storage
            set_index_context(str(folder_path), resolved_db_path)

        set_search_flags(
            enable_semantic=enable_semantic and use_index,
            enable_metadata=enable_metadata and use_index,
        )

        trace = ExplorationTrace(root_directory=str(folder_path))

        # Reset agent for fresh state
        reset_agent()

        # Send start event
        await websocket.send_json(
            {
                "type": "start",
                "data": {
                    "task": task,
                    "folder": str(folder_path),
                    "use_index": use_index,
                },
            }
        )

        # Run the workflow
        step_number = 0
        handler = workflow.run(
            start_event=InputEvent(
                task=task,
                folder=str(folder_path),
                use_index=use_index,
                enable_semantic=enable_semantic and use_index,
                enable_metadata=enable_metadata and use_index,
            )
        )

        async for event in handler.stream_events():
            if isinstance(event, ToolCallEvent):
                step_number += 1
                resolved_document_path: str | None = None
                if event.tool_name == "get_document":
                    doc_id = event.tool_input.get("doc_id")
                    if index_storage is not None and isinstance(doc_id, str) and doc_id:
                        document = index_storage.get_document(doc_id=doc_id)
                        if document and not document["is_deleted"]:
                            resolved_document_path = str(document["absolute_path"])
                trace.record_tool_call(
                    step_number=step_number,
                    tool_name=event.tool_name,
                    tool_input=event.tool_input,
                    resolved_document_path=resolved_document_path,
                )
                await websocket.send_json(
                    {
                        "type": "tool_call",
                        "data": {
                            "step": step_number,
                            "tool_name": event.tool_name,
                            "tool_input": event.tool_input,
                            "reason": event.reason,
                        },
                    }
                )

            elif isinstance(event, GoDeeperEvent):
                step_number += 1
                trace.record_go_deeper(
                    step_number=step_number, directory=event.directory
                )
                await websocket.send_json(
                    {
                        "type": "go_deeper",
                        "data": {
                            "step": step_number,
                            "directory": event.directory,
                            "reason": event.reason,
                        },
                    }
                )

            elif isinstance(event, AskHumanEvent):
                step_number += 1
                await websocket.send_json(
                    {
                        "type": "ask_human",
                        "data": {
                            "step": step_number,
                            "question": event.question,
                            "reason": event.reason,
                        },
                    }
                )

                # Wait for human response
                response_data = await websocket.receive_json()
                if response_data.get("type") == "human_response":
                    handler.ctx.send_event(
                        HumanAnswerEvent(response=response_data.get("response", ""))
                    )

        # Get final result
        result = await handler
        cited_sources = extract_cited_sources(result.final_result)

        # Get token usage
        agent = get_agent()
        usage = agent.token_usage
        input_cost, output_cost, total_cost = usage._calculate_cost()

        await websocket.send_json(
            {
                "type": "complete",
                "data": {
                    "final_result": result.final_result,
                    "error": result.error,
                    "stats": {
                        "steps": step_number,
                        "api_calls": usage.api_calls,
                        "documents_scanned": usage.documents_scanned,
                        "documents_parsed": usage.documents_parsed,
                        "prompt_tokens": usage.prompt_tokens,
                        "completion_tokens": usage.completion_tokens,
                        "total_tokens": usage.total_tokens,
                        "tool_result_chars": usage.tool_result_chars,
                        "estimated_cost": round(total_cost, 6),
                    },
                    "trace": {
                        "step_path": trace.step_path,
                        "referenced_documents": trace.sorted_documents(),
                        "cited_sources": cited_sources,
                    },
                },
            }
        )

    except WebSocketDisconnect:
        pass
    except Exception as e:
        await websocket.send_json({"type": "error", "data": {"message": str(e)}})
    finally:
        set_search_flags(enable_semantic=False, enable_metadata=False)
        clear_index_context()


def run_server(host: str = "127.0.0.1", port: int = 8000):
    """Run the FastAPI server."""
    import uvicorn

    uvicorn.run(app, host=host, port=port)


if __name__ == "__main__":
    run_server()


================================================
FILE: src/fs_explorer/storage/__init__.py
================================================
"""Storage backends for FsExplorer indexing."""

from .base import ChunkRecord, DocumentRecord, SchemaRecord, StorageBackend
from .duckdb import DuckDBStorage

__all__ = [
    "ChunkRecord",
    "DocumentRecord",
    "SchemaRecord",
    "StorageBackend",
    "DuckDBStorage",
]


================================================
FILE: src/fs_explorer/storage/base.py
================================================
"""
Storage interfaces and data models for index persistence.
"""

from __future__ import annotations

from dataclasses import dataclass
from typing import Any, Protocol


@dataclass(frozen=True)
class ChunkRecord:
    """A text chunk stored for a document."""

    id: str
    doc_id: str
    text: str
    position: int
    start_char: int
    end_char: int
    embedding: list[float] | None = None


@dataclass(frozen=True)
class DocumentRecord:
    """A normalized document record for indexing."""

    id: str
    corpus_id: str
    relative_path: str
    absolute_path: str
    content: str
    metadata_json: str
    file_mtime: float
    file_size: int
    content_sha256: str


@dataclass(frozen=True)
class SchemaRecord:
    """A stored schema entry."""

    id: str
    corpus_id: str
    name: str
    schema_def: dict[str, Any]
    is_active: bool
    created_at: str


class StorageBackend(Protocol):
    """Protocol for persistence operations used by indexing and schema workflows."""

    def initialize(self) -> None:
        """Initialize required tables/indexes."""

    def get_or_create_corpus(self, root_path: str) -> str:
        """Return corpus id for a root path, creating if needed."""

    def get_corpus_id(self, root_path: str) -> str | None:
        """Return corpus id for a root path if present."""

    def upsert_document(
        self, document: DocumentRecord, chunks: list[ChunkRecord]
    ) -> None:
        """Insert or update a document and replace its chunks."""

    def mark_deleted_missing_documents(
        self,
        *,
        corpus_id: str,
        active_relative_paths: set[str],
    ) -> int:
        """Mark documents deleted when not present in the latest index run."""

    def list_documents(
        self,
        *,
        corpus_id: str,
        include_deleted: bool = False,
    ) -> list[dict[str, Any]]:
        """List documents for a corpus."""

    def count_chunks(self, *, corpus_id: str) -> int:
        """Count chunks for active documents in a corpus."""

    def search_chunks(
        self,
        *,
        corpus_id: str,
        query: str,
        limit: int = 5,
    ) -> list[dict[str, Any]]:
        """Search indexed chunks and return ranked matches."""

    def search_documents_by_metadata(
        self,
        *,
        corpus_id: str,
        filters: list[dict[str, Any]],
        limit: int = 20,
    ) -> list[dict[str, Any]]:
        """Search indexed documents by metadata filters."""

    def get_document(self, *, doc_id: str) -> dict[str, Any] | None:
        """Get a document by id."""

    def save_schema(
        self,
        *,
        corpus_id: str,
        name: str,
        schema_def: dict[str, Any],
        is_active: bool = True,
    ) -> str:
        """Create or update a schema entry."""

    def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:
        """List all schemas for a corpus."""

    def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None:
        """Fetch a schema by name."""

    def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:
        """Fetch active schema for a corpus if present."""

    def store_chunk_embeddings(
        self,
        *,
        corpus_id: str,
        chunk_embeddings: list[tuple[str, list[float]]],
    ) -> int:
        """Bulk-store (chunk_id, embedding) pairs. Return count written."""

    def search_chunks_semantic(
        self,
        *,
        corpus_id: str,
        query_embedding: list[float],
        limit: int = 5,
    ) -> list[dict[str, Any]]:
        """Search chunks by cosine similarity against a query embedding."""

    def get_metadata_field_values(
        self,
        *,
        corpus_id: str,
        field_names: list[str],
        max_distinct: int = 10,
    ) -> dict[str, list[str]]:
        """Return up to *max_distinct* distinct non-empty values per metadata field."""

    def has_embeddings(self, *, corpus_id: str) -> bool:
        """Return True if the corpus has stored embeddings."""


================================================
FILE: src/fs_explorer/storage/duckdb.py
================================================
"""
DuckDB storage backend for index persistence.
"""

from __future__ import annotations

import hashlib
import json
import re
from pathlib import Path
from typing import Any

import duckdb

from .base import ChunkRecord, DocumentRecord, SchemaRecord


def _stable_id(prefix: str, value: str) -> str:
    digest = hashlib.sha1(value.encode("utf-8")).hexdigest()
    return f"{prefix}_{digest}"


def _query_terms(query: str, max_terms: int = 8) -> list[str]:
    terms = re.findall(r"[a-zA-Z0-9_]{3,}", query.lower())
    unique_terms: list[str] = []
    for term in terms:
        if term not in unique_terms:
            unique_terms.append(term)
        if len(unique_terms) >= max_terms:
            break
    if unique_terms:
        return unique_terms
    fallback = query.strip().lower()
    return [fallback] if fallback else []


class DuckDBStorage:
    """DuckDB-backed persistence for corpora, documents, chunks, and schemas."""

    def __init__(
        self,
        db_path: str,
        *,
        read_only: bool = False,
        initialize: bool = True,
        embedding_dim: int = 768,
    ) -> None:
        self.db_path = str(Path(db_path).expanduser().resolve())
        self.read_only = read_only
        self.embedding_dim = embedding_dim
        Path(self.db_path).parent.mkdir(parents=True, exist_ok=True)
        self._conn = duckdb.connect(self.db_path, read_only=read_only)
        self._vss_available = False
        if initialize and not read_only:
            self.initialize()
        if not read_only:
            self._try_load_vss()

    def close(self) -> None:
        """Close the underlying DuckDB connection."""
        self._conn.close()

    def initialize(self) -> None:
        self._conn.execute(
            """
            CREATE TABLE IF NOT EXISTS corpora (
                id VARCHAR PRIMARY KEY,
                root_path VARCHAR NOT NULL UNIQUE,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            );
            """
        )
        self._conn.execute(
            """
            CREATE TABLE IF NOT EXISTS documents (
                id VARCHAR PRIMARY KEY,
                corpus_id VARCHAR NOT NULL REFERENCES corpora(id),
                relative_path VARCHAR NOT NULL,
                absolute_path VARCHAR NOT NULL,
                content VARCHAR NOT NULL,
                metadata_json VARCHAR NOT NULL DEFAULT '{}',
                file_mtime DOUBLE NOT NULL,
                file_size BIGINT NOT NULL,
                content_sha256 VARCHAR NOT NULL,
                last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                is_deleted BOOLEAN DEFAULT FALSE,
                UNIQUE(corpus_id, relative_path)
            );
            """
        )
        self._conn.execute(
            """
            CREATE TABLE IF NOT EXISTS chunks (
                id VARCHAR PRIMARY KEY,
                doc_id VARCHAR NOT NULL REFERENCES documents(id),
                text VARCHAR NOT NULL,
                position INTEGER NOT NULL,
                start_char INTEGER NOT NULL,
                end_char INTEGER NOT NULL
            );
            """
        )
        self._conn.execute(
            """
            CREATE TABLE IF NOT EXISTS schemas (
                id VARCHAR PRIMARY KEY,
                corpus_id VARCHAR NOT NULL REFERENCES corpora(id),
                name VARCHAR NOT NULL,
                schema_def VARCHAR NOT NULL,
                is_active BOOLEAN DEFAULT FALSE,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(corpus_id, name)
            );
            """
        )
        self._conn.execute(
            f"""
            CREATE TABLE IF NOT EXISTS chunk_embeddings (
                chunk_id VARCHAR PRIMARY KEY REFERENCES chunks(id),
                corpus_id VARCHAR NOT NULL,
                embedding FLOAT[{self.embedding_dim}] NOT NULL
            );
            """
        )

    def _try_load_vss(self) -> None:
        """Attempt to install and load the vss extension for HNSW acceleration."""
        try:
            self._conn.execute("INSTALL vss")
            self._conn.execute("LOAD vss")
            self._vss_available = True
        except Exception:
            self._vss_available = False

    def get_or_create_corpus(self, root_path: str) -> str:
        normalized = str(Path(root_path).resolve())
        corpus_id = _stable_id("corpus", normalized)
        self._conn.execute(
            """
            INSERT INTO corpora (id, root_path)
            VALUES (?, ?)
            ON CONFLICT(root_path) DO NOTHING
            """,
            [corpus_id, normalized],
        )
        row = self._conn.execute(
            "SELECT id FROM corpora WHERE root_path = ?",
            [normalized],
        ).fetchone()
        if row is None:
            raise RuntimeError(f"Failed to create corpus for path: {normalized}")
        return str(row[0])

    def get_corpus_id(self, root_path: str) -> str | None:
        normalized = str(Path(root_path).resolve())
        row = self._conn.execute(
            "SELECT id FROM corpora WHERE root_path = ?",
            [normalized],
        ).fetchone()
        if row is None:
            return None
        return str(row[0])

    def upsert_document(
        self, document: DocumentRecord, chunks: list[ChunkRecord]
    ) -> None:
        # Cascade-delete embeddings for old chunks, then remove old chunks.
        self._conn.execute(
            """
            DELETE FROM chunk_embeddings
            WHERE chunk_id IN (SELECT id FROM chunks WHERE doc_id = ?)
            """,
            [document.id],
        )
        self._conn.execute("DELETE FROM chunks WHERE doc_id = ?", [document.id])

        self._conn.execute(
            """
            INSERT INTO documents (
                id, corpus_id, relative_path, absolute_path, content, metadata_json,
                file_mtime, file_size, content_sha256, is_deleted
            )
            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, FALSE)
            ON CONFLICT(id) DO UPDATE SET
                corpus_id = excluded.corpus_id,
                relative_path = excluded.relative_path,
                absolute_path = excluded.absolute_path,
                content = excluded.content,
                metadata_json = excluded.metadata_json,
                file_mtime = excluded.file_mtime,
                file_size = excluded.file_size,
                content_sha256 = excluded.content_sha256,
                last_indexed_at = now(),
                is_deleted = FALSE
            """,
            [
                document.id,
                document.corpus_id,
                document.relative_path,
                document.absolute_path,
                document.content,
                document.metadata_json,
                document.file_mtime,
                document.file_size,
                document.content_sha256,
            ],
        )

        if chunks:
            self._conn.executemany(
                """
                INSERT INTO chunks (id, doc_id, text, position, start_char, end_char)
                VALUES (?, ?, ?, ?, ?, ?)
                """,
                [
                    (
                        chunk.id,
                        chunk.doc_id,
                        chunk.text,
                        chunk.position,
                        chunk.start_char,
                        chunk.end_char,
                    )
                    for chunk in chunks
                ],
            )

    def mark_deleted_missing_documents(
        self,
        *,
        corpus_id: str,
        active_relative_paths: set[str],
    ) -> int:
        if not active_relative_paths:
            self._conn.execute(
                """
                UPDATE documents
                SET is_deleted = TRUE
                WHERE corpus_id = ? AND is_deleted = FALSE
                """,
                [corpus_id],
            )
        else:
            placeholders = ", ".join(["?"] * len(active_relative_paths))
            params: list[Any] = [corpus_id]
            params.extend(sorted(active_relative_paths))
            self._conn.execute(
                f"""
                UPDATE documents
                SET is_deleted = TRUE
                WHERE corpus_id = ?
                  AND is_deleted = FALSE
                  AND relative_path NOT IN ({placeholders})
                """,
                params,
            )

        row = self._conn.execute(
            """
            SELECT COUNT(*)
            FROM documents
            WHERE corpus_id = ? AND is_deleted = TRUE
            """,
            [corpus_id],
        ).fetchone()
        return int(row[0]) if row else 0

    def list_documents(
        self,
        *,
        corpus_id: str,
        include_deleted: bool = False,
    ) -> list[dict[str, Any]]:
        sql = """
            SELECT id, relative_path, absolute_path, file_size, file_mtime, is_deleted
            FROM documents
            WHERE corpus_id = ?
        """
        params: list[Any] = [corpus_id]
        if not include_deleted:
            sql += " AND is_deleted = FALSE"
        sql += " ORDER BY relative_path"

        rows = self._conn.execute(sql, params).fetchall()
        results: list[dict[str, Any]] = []
        for row in rows:
            results.append(
                {
                    "id": str(row[0]),
                    "relative_path": str(row[1]),
                    "absolute_path": str(row[2]),
                    "file_size": int(row[3]),
                    "file_mtime": float(row[4]),
                    "is_deleted": bool(row[5]),
                }
            )
        return results

    def count_chunks(self, *, corpus_id: str) -> int:
        row = self._conn.execute(
            """
            SELECT COUNT(*)
            FROM chunks c
            JOIN documents d ON d.id = c.doc_id
            WHERE d.corpus_id = ? AND d.is_deleted = FALSE
            """,
            [corpus_id],
        ).fetchone()
        return int(row[0]) if row else 0

    def search_chunks(
        self,
        *,
        corpus_id: str,
        query: str,
        limit: int = 5,
    ) -> list[dict[str, Any]]:
        terms = _query_terms(query)
        if not terms:
            return []

        score_expr = " + ".join(
            ["CASE WHEN lower(c.text) LIKE '%' || ? || '%' THEN 1 ELSE 0 END"]
            * len(terms)
        )
        sql = f"""
            SELECT * FROM (
                SELECT
                    d.id AS doc_id,
                    d.relative_path,
                    d.absolute_path,
                    c.position,
                    c.text,
                    ({score_expr}) AS score
                FROM chunks c
                JOIN documents d ON d.id = c.doc_id
                WHERE d.corpus_id = ?
                  AND d.is_deleted = FALSE
            ) ranked
            WHERE score > 0
            ORDER BY score DESC, relative_path ASC, position ASC
            LIMIT ?
        """
        params: list[Any] = []
        params.extend(terms)
        params.append(corpus_id)
        params.append(limit)
        rows = self._conn.execute(sql, params).fetchall()

        results: list[dict[str, Any]] = []
        for row in rows:
            results.append(
                {
                    "doc_id": str(row[0]),
                    "relative_path": str(row[1]),
                    "absolute_path": str(row[2]),
                    "position": int(row[3]),
                    "text": str(row[4]),
                    "score": int(row[5]),
                }
            )
        return results

    def search_documents_by_metadata(
        self,
        *,
        corpus_id: str,
        filters: list[dict[str, Any]],
        limit: int = 20,
    ) -> list[dict[str, Any]]:
        if not filters:
            return []

        sql = """
            SELECT
                d.id,
                d.relative_path,
                d.absolute_path,
                substring(d.content, 1, 320) AS preview_text
            FROM documents d
            WHERE d.corpus_id = ?
              AND d.is_deleted = FALSE
        """
        params: list[Any] = [corpus_id]

        for flt in filters:
            field = str(flt["field"])
            operator = str(flt["operator"])
            value = flt["value"]
            clause, clause_params = self._metadata_clause(
                field=field,
                operator=operator,
                value=value,
            )
            sql += f"\n  AND {clause}"
            params.extend(clause_params)

        sql += "\nORDER BY d.relative_path ASC\nLIMIT ?"
        params.append(limit)
        rows = self._conn.execute(sql, params).fetchall()
        metadata_score = len(filters)
        results: list[dict[str, Any]] = []
        for row in rows:
            results.append(
                {
                    "doc_id": str(row[0]),
                    "relative_path": str(row[1]),
                    "absolute_path": str(row[2]),
                    "preview_text": str(row[3]),
                    "metadata_score": metadata_score,
                }
            )
        return results

    def get_document(self, *, doc_id: str) -> dict[str, Any] | None:
        row = self._conn.execute(
            """
            SELECT
                id, corpus_id, relative_path, absolute_path, content, metadata_json, is_deleted
            FROM documents
            WHERE id = ?
            LIMIT 1
            """,
            [doc_id],
        ).fetchone()
        if row is None:
            return None
        return {
            "id": str(row[0]),
            "corpus_id": str(row[1]),
            "relative_path": str(row[2]),
            "absolute_path": str(row[3]),
            "content": str(row[4]),
            "metadata_json": str(row[5]),
            "is_deleted": bool(row[6]),
        }

    def save_schema(
        self,
        *,
        corpus_id: str,
        name: str,
        schema_def: dict[str, Any],
        is_active: bool = True,
    ) -> str:
        schema_id = _stable_id("schema", f"{corpus_id}:{name}")
        if is_active:
            self._conn.execute(
                "UPDATE schemas SET is_active = FALSE WHERE corpus_id = ?",
                [corpus_id],
            )

        self._conn.execute(
            """
            INSERT INTO schemas (id, corpus_id, name, schema_def, is_active)
            VALUES (?, ?, ?, ?, ?)
            ON CONFLICT(corpus_id, name) DO UPDATE SET
                schema_def = excluded.schema_def,
                is_active = excluded.is_active
            """,
            [
                schema_id,
                corpus_id,
                name,
                json.dumps(schema_def, sort_keys=True),
                is_active,
            ],
        )
        return schema_id

    def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:
        rows = self._conn.execute(
            """
            SELECT id, corpus_id, name, schema_def, is_active, created_at
            FROM schemas
            WHERE corpus_id = ?
            ORDER BY created_at DESC, name ASC
            """,
            [corpus_id],
        ).fetchall()
        return [self._row_to_schema_record(row) for row in rows]

    def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None:
        row = self._conn.execute(
            """
            SELECT id, corpus_id, name, schema_def, is_active, created_at
            FROM schemas
            WHERE corpus_id = ? AND name = ?
            LIMIT 1
            """,
            [corpus_id, name],
        ).fetchone()
        if row is None:
            return None
        return self._row_to_schema_record(row)

    def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:
        row = self._conn.execute(
            """
            SELECT id, corpus_id, name, schema_def, is_active, created_at
            FROM schemas
            WHERE corpus_id = ? AND is_active = TRUE
            ORDER BY created_at DESC
            LIMIT 1
            """,
            [corpus_id],
        ).fetchone()
        if row is None:
            return None
        return self._row_to_schema_record(row)

    @staticmethod
    def make_document_id(corpus_id: str, relative_path: str) -> str:
        return _stable_id("doc", f"{corpus_id}:{relative_path}")

    @staticmethod
    def make_chunk_id(
        doc_id: str, position: int, start_char: int, end_char: int
    ) -> str:
        return _stable_id("chunk", f"{doc_id}:{position}:{start_char}:{end_char}")

    @staticmethod
    def _row_to_schema_record(row: tuple[Any, ...]) -> SchemaRecord:
        return SchemaRecord(
            id=str(row[0]),
            corpus_id=str(row[1]),
            name=str(row[2]),
            schema_def=json.loads(str(row[3])),
            is_active=bool(row[4]),
            created_at=str(row[5]),
        )

    def store_chunk_embeddings(
        self,
        *,
        corpus_id: str,
        chunk_embeddings: list[tuple[str, list[float]]],
    ) -> int:
        """Bulk-store (chunk_id, embedding) pairs. Return count written."""
        if not chunk_embeddings:
            return 0
        self._conn.executemany(
            """
            INSERT INTO chunk_embeddings (chunk_id, corpus_id, embedding)
            VALUES (?, ?, ?)
            ON CONFLICT(chunk_id) DO UPDATE SET
                corpus_id = excluded.corpus_id,
                embedding = excluded.embedding
            """,
            [(cid, corpus_id, emb) for cid, emb in chunk_embeddings],
        )
        return len(chunk_embeddings)

    def search_chunks_semantic(
        self,
        *,
        corpus_id: str,
        query_embedding: list[float],
        limit: int = 5,
    ) -> list[dict[str, Any]]:
        """Search chunks by cosine similarity against a query embedding."""
        sql = """
            SELECT
                d.id AS doc_id,
                d.relative_path,
                d.absolute_path,
                c.position,
                c.text,
                array_cosine_similarity(ce.embedding, ?::FLOAT[{dim}]) AS score
            FROM chunk_embeddings ce
            JOIN chunks c ON c.id = ce.chunk_id
            JOIN documents d ON d.id = c.doc_id
            WHERE ce.corpus_id = ?
              AND d.is_deleted = FALSE
            ORDER BY score DESC
            LIMIT ?
        """.format(dim=self.embedding_dim)
        rows = self._conn.execute(sql, [query_embedding, corpus_id, limit]).fetchall()

        results: list[dict[str, Any]] = []
        for row in rows:
            results.append(
                {
                    "doc_id": str(row[0]),
                    "relative_path": str(row[1]),
                    "absolute_path": str(row[2]),
                    "position": int(row[3]),
                    "text": str(row[4]),
                    "score": float(row[5]),
                }
            )
        return results

    def get_metadata_field_values(
        self,
        *,
        corpus_id: str,
        field_names: list[str],
        max_distinct: int = 10,
    ) -> dict[str, list[str]]:
        """Return up to *max_distinct* distinct non-empty values per metadata field."""
        result: dict[str, list[str]] = {}
        for field in field_names:
            rows = self._conn.execute(
                """
                SELECT DISTINCT json_extract_string(d.metadata_json, ?) AS val
                FROM documents d
                WHERE d.corpus_id = ?
                  AND d.is_deleted = FALSE
                  AND val IS NOT NULL
                  AND val != ''
                LIMIT ?
                """,
                [f"$.{field}", corpus_id, max_distinct],
            ).fetchall()
            result[field] = [str(row[0]) for row in rows]
        return result

    def has_embeddings(self, *, corpus_id: str) -> bool:
        """Return True if the corpus has stored embeddings."""
        row = self._conn.execute(
            "SELECT COUNT(*) FROM chunk_embeddings WHERE corpus_id = ?",
            [corpus_id],
        ).fetchone()
        return bool(row and int(row[0]) > 0)

    def create_hnsw_index(self, *, corpus_id: str) -> bool:
        """Create an HNSW index on chunk embeddings if vss is available.

        Returns True if the index was created, False otherwise.
        """
        if not self._vss_available:
            return False
        try:
            index_name = f"hnsw_{corpus_id.replace('-', '_')}"
            self._conn.execute(
                f"""
                CREATE INDEX IF NOT EXISTS {index_name}
                ON chunk_embeddings
                USING HNSW (embedding)
                WITH (metric = 'cosine')
                """
            )
            return True
        except Exception:
            return False

    @staticmethod
    def _metadata_clause(
        *,
        field: str,
        operator: str,
        value: Any,
    ) -> tuple[str, list[Any]]:
        json_expr = "json_extract_string(d.metadata_json, ?)"
        json_path = f"$.{field}"

        if operator in {"eq", "ne"}:
            comparator = "=" if operator == "eq" else "<>"
            if isinstance(value, bool):
                return (
                    f"lower(coalesce({json_expr}, '')) {comparator} ?",
                    [json_path, "true" if value else "false"],
                )
            if isinstance(value, (int, float)):
                return (
                    f"try_cast({json_expr} AS DOUBLE) {comparator} ?",
                    [json_path, float(value)],
                )
            return (
                f"lower(coalesce({json_expr}, '')) {comparator} lower(?)",
                [json_path, str(value)],
            )

        if operator in {"gt", "gte", "lt", "lte"}:
            if not isinstance(value, (int, float)):
                raise ValueError(
                    f"Metadata operator {operator!r} requires numeric value for field {field!r}."
                )
            comparator_map = {
                "gt": ">",
                "gte": ">=",
                "lt": "<",
                "lte": "<=",
            }
            comparator = comparator_map[operator]
            return (
                f"try_cast({json_expr} AS DOUBLE) {comparator} ?",
                [json_path, float(value)],
            )

        if operator == "contains":
            return (
                f"lower(coalesce({json_expr}, '')) LIKE '%' || lower(?) || '%'",
                [json_path, str(value)],
            )

        if operator == "in":
            if not isinstance(value, list) or not value:
                raise ValueError(
                    f"Metadata `in` filter for field {field!r} has no values."
                )

            if all(isinstance(item, bool) for item in value):
                placeholders = ", ".join(["?"] * len(value))
                return (
                    f"lower(coalesce({json_expr}, '')) IN ({placeholders})",
                    [
                        json_path,
                        *["true" if bool(item) else "false" for item in value],
                    ],
                )

            if all(
                isinstance(item, (int, float)) and not isinstance(item, bool)
                for item in value
            ):
                placeholders = ", ".join(["?"] * len(value))
                return (
                    f"try_cast({json_expr} AS DOUBLE) IN ({placeholders})",
                    [json_path, *[float(item) for item in value]],
                )

            placeholders = ", ".join(["?"] * len(value))
            return (
                f"lower(coalesce({json_expr}, '')) IN ({placeholders})",
                [json_path, *[str(item).lower() for item in value]],
            )

        raise ValueError(f"Unsupported metadata operator: {operator!r}")


================================================
FILE: src/fs_explorer/ui.html
================================================
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>fs-explorer</title>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500;600;700&family=Instrument+Serif:ital@0;1&display=swap" rel="stylesheet">
    <style>
        :root {
            --bg: #f4f1eb;
            --bg-alt: #ebe7df;
            --ink: #1a1a1a;
            --ink-light: #4a4a4a;
            --ink-muted: #8a8a8a;
            --accent: #c45d3a;
            --accent-light: #e8d5ce;
            --success: #2d6a4f;
            --border: #d4d0c8;
            --shadow: rgba(0,0,0,0.08);
        }

        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        ::selection {
            background: var(--accent);
            color: var(--bg);
        }

        body {
            font-family: 'IBM Plex Mono', monospace;
            background: var(--bg);
            color: var(--ink);
            min-height: 100vh;
            font-size: 14px;
            line-height: 1.6;
        }

        /* Layout */
        .page {
            max-width: 1400px;
            margin: 0 auto;
            padding: 40px;
        }

        /* Header */
        .masthead {
            display: flex;
            justify-content: space-between;
            align-items: flex-end;
            border-bottom: 2px solid var(--ink);
            padding-bottom: 20px;
            margin-bottom: 40px;
        }

        .title-block {
            display: flex;
            align-items: baseline;
            gap: 16px;
        }

        .site-title {
            font-family: 'Instrument Serif', serif;
            font-size: 42px;
            font-weight: 400;
            letter-spacing: -1px;
        }

        .version {
            font-size: 11px;
            color: var(--ink-muted);
            text-transform: uppercase;
            letter-spacing: 1px;
        }

        .status-indicator {
            display: flex;
            align-items: center;
            gap: 8px;
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 1px;
        }

        .status-dot {
            width: 8px;
            height: 8px;
            border-radius: 50%;
            background: var(--ink-muted);
        }

        .status-dot.active {
            background: var(--success);
        }

        .status-dot.error {
            background: var(--accent);
        }

        /* Folder Section */
        .folder-section {
            margin-bottom: 20px;
        }

        .folder-row {
            display: flex;
        }

        .folder-display {
            flex: 1;
            display: flex;
            justify-content: space-between;
            align-items: center;
            background: var(--bg-alt);
            border: 2px solid var(--border);
            padding: 12px 16px;
        }

        .folder-path {
            font-size: 13px;
            color: var(--ink-light);
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
            max-width: calc(100% - 80px);
        }

        .folder-btn {
            background: transparent;
            border: 1px solid var(--ink);
            padding: 6px 16px;
            font-family: inherit;
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 1px;
            cursor: pointer;
            transition: all 0.15s;
        }

        .folder-btn:hover {
            background: var(--ink);
            color: var(--bg);
        }

        /* Query Section */
        .query-section {
            margin-bottom: 40px;
        }

        .query-label {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--ink-muted);
            margin-bottom: 12px;
        }

        .query-row {
            display: flex;
            gap: 0;
        }

        .query-input {
            flex: 1;
            background: var(--bg);
            border: 2px solid var(--ink);
            border-right: none;
            padding: 16px 20px;
            font-family: inherit;
            font-size: 16px;
            color: var(--ink);
        }

        .query-input:focus {
            outline: none;
            background: #fff;
        }

        .query-input::placeholder {
            color: var(--ink-muted);
            font-style: italic;
        }

        .query-btn {
            background: var(--ink);
            color: var(--bg);
            border: 2px solid var(--ink);
            padding: 16px 32px;
            font-family: inherit;
            font-size: 12px;
            text-transform: uppercase;
            letter-spacing: 2px;
            cursor: pointer;
            transition: all 0.15s;
        }

        .query-btn:hover:not(:disabled) {
            background: var(--accent);
            border-color: var(--accent);
        }

        .query-btn:disabled {
            opacity: 0.4;
            cursor: not-allowed;
        }

        /* Index Badge */
        .index-badge {
            display: inline-flex;
            align-items: center;
            gap: 6px;
            padding: 4px 12px;
            font-size: 11px;
            letter-spacing: 0.5px;
            border: 1px solid var(--border);
            background: var(--bg);
            cursor: pointer;
            transition: all 0.15s;
            white-space: nowrap;
        }

        .index-badge:hover {
            border-color: var(--ink);
        }

        .badge-dot {
            width: 7px;
            height: 7px;
            border-radius: 50%;
            background: var(--ink-muted);
        }

        .badge-dot.indexed {
            background: var(--success);
        }

        /* Search Mode Toggles */
        .search-mode-section {
            margin-top: 14px;
            display: flex;
            align-items: center;
            gap: 14px;
        }

        .search-mode-label {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--ink-muted);
            white-space: nowrap;
        }

        .search-mode-options {
            display: flex;
            gap: 0;
            border: 1px solid var(--border);
        }

        .search-mode-options label {
            display: flex;
            align-items: center;
            gap: 5px;
            padding: 7px 14px;
            font-size: 12px;
            color: var(--ink-light);
            cursor: pointer;
            border-right: 1px solid var(--border);
            transition: all 0.15s;
            user-select: none;
        }

        .search-mode-options label:last-child {
            border-right: none;
        }

        .search-mode-options label:has(input:checked) {
            background: var(--ink);
            color: var(--bg);
        }

        .search-mode-options label.disabled {
            opacity: 0.35;
            cursor: not-allowed;
        }

        .search-mode-options label.disabled input {
            pointer-events: none;
        }

        .search-mode-options input[type="checkbox"] {
            display: none;
        }

        /* Indexing Modal */
        .indexing-modal {
            max-width: 620px;
        }

        .indexing-modal .modal-section {
            margin-bottom: 20px;
        }

        .indexing-modal .schema-editor textarea {
            width: 100%;
            border: 1px solid var(--border);
            background: var(--bg-alt);
            color: var(--ink);
            font-family: 'IBM Plex Mono', monospace;
            font-size: 12px;
            padding: 12px;
            resize: vertical;
            min-height: 160px;
        }

        .indexing-modal .schema-editor textarea:focus {
            outline: none;
            border-color: var(--ink-light);
        }

        .indexing-summary {
            font-size: 13px;
        }

        .indexing-summary dl {
            display: grid;
            grid-template-columns: auto 1fr;
            gap: 6px 16px;
        }

        .indexing-summary dt {
            color: var(--ink-muted);
            font-size: 12px;
            text-transform: uppercase;
            letter-spacing: 0.5px;
        }

        .indexing-summary dd {
            font-weight: 600;
        }

        .panel-btn {
            background: transparent;
            border: 1px solid var(--ink);
            padding: 7px 16px;
            font-family: inherit;
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 1px;
            cursor: pointer;
            transition: all 0.15s;
        }

        .panel-btn:hover {
            background: var(--ink);
            color: var(--bg);
        }

        .panel-btn:disabled {
            opacity: 0.4;
            cursor: not-allowed;
        }

        .panel-btn.primary {
            background: var(--ink);
            color: var(--bg);
        }

        .panel-btn.primary:hover {
            background: var(--accent);
            border-color: var(--accent);
        }

        .modal-actions {
            display: flex;
            gap: 12px;
            margin-top: 20px;
        }

        .modal-actions .modal-btn {
            flex: 1;
        }

        .embed-toggle-row {
            display: flex;
            align-items: center;
            gap: 8px;
            font-size: 12px;
            color: var(--ink-light);
            margin-top: 14px;
        }

        .embed-toggle-row input[type="checkbox"] {
            accent-color: var(--accent);
        }

        .profile-label {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 1px;
            color: var(--ink-muted);
            margin-bottom: 8px;
        }

        .profile-input {
            width: 100%;
            border: 1px solid var(--border);
            background: var(--bg);
            color: var(--ink);
            font-family: inherit;
            font-size: 12px;
            padding: 10px 12px;
            resize: vertical;
            min-height: 64px;
            margin-bottom: 12px;
        }

        .profile-input:focus {
            outline: none;
            border-color: var(--ink-light);
        }

        .radio-group {
            display: flex;
            gap: 20px;
            margin-bottom: 12px;
        }

        .radio-group label {
            display: flex;
            align-items: center;
            gap: 6px;
            font-size: 12px;
            color: var(--ink-light);
            cursor: pointer;
        }

        .radio-group input[type="radio"] {
            accent-color: var(--accent);
        }

        .panel-spinner {
            display: inline-block;
            width: 14px;
            height: 14px;
            border: 2px solid var(--border);
            border-top-color: var(--accent);
            border-radius: 50%;
            animation: spin 0.6s linear infinite;
            vertical-align: middle;
            margin-right: 6px;
        }

        @keyframes spin {
            to { transform: rotate(360deg); }
        }

        .index-panel-message {
            font-size: 12px;
            color: var(--ink-muted);
            font-style: italic;
            margin-top: 10px;
        }

        /* Main Grid */
        .main-grid {
            display: grid;
            grid-template-columns: 1fr 1fr;
            gap: 40px;
        }

        @media (max-width: 1000px) {
            .main-grid {
                grid-template-columns: 1fr;
            }
        }

        /* Sections */
        .section {
            min-height: 400px;
        }

        .section-header {
            display: flex;
            justify-content: space-between;
            align-items: baseline;
            border-bottom: 1px solid var(--border);
            padding-bottom: 12px;
            margin-bottom: 20px;
        }

        .section-title {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--ink-muted);
        }

        .section-meta {
            font-size: 11px;
            color: var(--ink-muted);
        }

        /* Steps */
        .steps-list {
            display: flex;
            flex-direction: column;
            gap: 2px;
        }

        .step {
            background: var(--bg-alt);
            border-left: 3px solid var(--ink-muted);
            padding: 16px 20px;
            animation: appear 0.2s ease;
        }

        @keyframes appear {
            from { opacity: 0; transform: translateX(-10px); }
            to { opacity: 1; transform: translateX(0); }
        }

        .step.scan { border-left-color: #0969da; }
        .step.parse { border-left-color: var(--success); }
        .step.preview { border-left-color: #8250df; }
        .step.search { border-left-color: var(--accent); }
        .step.navigate { border-left-color: #bf8700; }

        .step-header {
            display: flex;
            justify-content: space-between;
            align-items: baseline;
            margin-bottom: 8px;
        }

        .step-id {
            font-weight: 600;
            font-size: 12px;
        }

        .step-tool {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 1px;
            color: var(--ink-light);
        }

        .step-target {
            font-size: 13px;
            color: var(--ink-light);
            margin-bottom: 8px;
            word-break: break-all;
        }

        .step-target::before {
            content: '→ ';
            color: var(--ink-muted);
        }

        .step-reason {
            font-size: 12px;
            color: var(--ink-muted);
            font-style: italic;
            line-height: 1.5;
        }

        /* Empty State */
        .empty-state {
            display: flex;
            flex-direction: column;
            align-items: center;
            justify-content: center;
            height: 300px;
            color: var(--ink-muted);
            text-align: center;
        }

        .empty-state .prompt {
            font-family: 'Instrument Serif', serif;
            font-size: 24px;
            font-style: italic;
            margin-bottom: 8px;
            color: var(--ink-light);
        }

        .empty-state .hint {
            font-size: 12px;
        }

        /* Result */
        .result-content {
            line-height: 1.8;
        }

        .result-text {
            font-size: 15px;
            white-space: pre-wrap;
        }

        .citation {
            background: var(--accent-light);
            color: var(--accent);
            padding: 2px 6px;
            font-size: 11px;
            border-radius: 2px;
        }

        /* Sources */
        .sources {
            margin-top: 30px;
            padding-top: 20px;
            border-top: 1px solid var(--border);
        }

        .sources-title {
            font-size: 11px;
            text-transform: uppercase;
            letter-spacing: 2px;
            color: var(--ink-muted);
            margin-bottom: 12px;
        }

        .source-item {
            font-size: 12px;
            color: var(--ink-light);
            padding: 4px 0;
            display: flex;
            align-items: baseline;
            gap: 8px;
        }

        .source-item::before {
            content: '●';
            color: var(--accent);
            font-size: 8px;
        }

        /* Stats Bar */
        .stats-bar {
            margin-top: 40px;
            padding-top: 20px;
            border-top: 2px solid var(--ink);
            display: grid;
            grid-template-columns: repeat(6, 1fr);
            gap: 20px;
        }

        @media (max-width: 768px) {
            .stats-bar {
                grid-template-columns: repeat(3, 1fr);
            }
        }

        .stat {
            text-align: center;
        }

        .stat-value {
            font-size: 24px;
            font-weight: 700;
            font-family: 'Instrument Serif', serif;
        }

        .stat-label {
            font-size: 10px;
            text-transform: uppercase;
            letter-spacing: 1px;
            color: var(--ink-muted);
            margin-top: 4px;
        }

        /* Progress */
        .progress-bar {
            height: 3px;
            background: var(--border);
            margin-top: 40px;
            overflow: hidden;
        }

        .progress-fill {
            height: 100%;
            background: var(--accent);
            width: 0%;
            transition: width 0.3s;
        }

        .progress-bar.active .progress-fill {
            animation: indeterminate 1.5s infinite;
        }

        @keyframes indeterminate {
            0% { transform: translateX(-100%); width: 30%; }
            50% { width: 50%; }
            100% { transform: translateX(400%); width: 30%; }
        }

        /* Loading */
        .loading-text {
            font-style: italic;
            color: var(--ink-muted);
            animation: blink 1s infinite;
        }

        @keyframes blink {
            0%, 100% { opacity: 1; }
            50% { opacity: 0.5; }
        }

        /* Human Modal */
        .modal-overlay {
            position: fixed;
            inset: 0;
            background: rgba(244, 241, 235, 0.95);
            display: flex;
            align-items: center;
            justify-content: center;
            z-index: 100;
            opacity: 0;
            visibility: hidden;
            transition: all 0.2s;
        }

        .modal-overlay.active {
            opacity: 1;
            visibility: visible;
        }

        .modal {
            background: var(--bg);
            border: 2px solid var(--ink);
            padding: 40px;
            max-width: 500px;
            width: 90%;
            box-shadow: 8px 8px 0 var(--ink);
        }

        .modal-title {
            font-family: 'Instrument Serif', serif;
            font-size: 24px;
            margin-bottom: 20px;
        }

        .modal-question {
            background: var(--bg-alt);
            padding: 16px;
            margin-bottom: 20px;
            font-size: 14px;
        }

        .modal-input {
            width: 100%;
            border: 2px solid var(--ink);
            padding: 12px 16px;
            font-family: inherit;
            font-size: 14px;
            resize: vertical;
            min-height: 80px;
            margin-bottom: 20px;
        }

        .modal-input:focus {
            outline: none;
        }

        .modal-btn {
            background: var(--ink);
            color: var(--bg);
            border: none;
            padding: 12px 24px;
            font-family: inherit;
            font-size: 12px;
            text-transform: uppercase;
            letter-spacing: 2px;
            cursor: pointer;
            width: 100%;
        }

        .modal-btn:hover {
            background: var(--accent);
        }

        .modal-btn.secondary {
            background: transparent;
            border: 2px solid var(--ink);
            color: var(--ink);
        }

        .modal-btn.secondary:hover {
            background: var(--ink);
            color: var(--bg);
        }

        /* Folder Modal */
        .folder-modal {
            max-width: 600px;
        }

        .folder-nav {
            display: flex;
            align-items: center;
            gap: 12px;
            margin-bottom: 16px;
            padding-bottom: 12px;
            border-bottom: 1px solid var(--border);
        }

        .folder-nav-btn {
            background: transparent;
            border: 1px solid var(--ink);
            padding: 6px 12px;
            font-family: inherit;
            font-size: 11px;
            cursor: pointer;
        }

        .folder-nav-btn:hover {
            background: var(--ink);
            color: var(--bg);
        }

        .folder-nav-btn:disabled {
            opacity: 0.3;
            cursor: not-allowed;
        }

        .folder-current {
            font-size: 12px;
            color: var(--ink-light);
            overflow: hidden;
            text-overflow: ellipsis;
            white-space: nowrap;
        }

        .folder-list {
            background: var(--bg-alt);
            border: 1px solid var(--border);
            padding: 8px;
            max-height: 300px;
            margin-bottom: 20px;
        }

        .folder-item {
            display: flex;
            align-items: center;
            gap: 8px;
            padding: 10px 12px;
            cursor: pointer;
            border-bottom: 1px solid var(--border);
            transition: background 0.1s;
        }

        .folder-item:last-child {
            border-bottom: none;
        }

        .folder-item:hover {
            background: var(--bg);
        }

        .folder-item.selected {
            background: var(--accent-light);
        }

        .folder-icon {
            color: var(--accent);
            font-weight: bold;
        }

        .folder-name {
            font-size: 13px;
        }

        .folder-actions {
            display: flex;
            gap: 12px;
        }

        .folder-actions .modal-btn {
            flex: 1;
        }

        .folder-empty {
            padding: 20px;
            text-align: center;
            color: var(--ink-muted);
            font-style: italic;
        }

        .folder-info {
            font-size: 11px;
            color: var(--ink-muted);
            margin-top: 4px;
        }

        /* Scrollbar */
        .scrollable {
            max-height: 500px;
            overflow-y: auto;
        }

        .scrollable::-webkit-scrollbar {
            width: 8px;
        }

        .scrollable::-webkit-scrollbar-track {
            background: var(--bg-alt);
        }

        .scrollable::-webkit-scrollbar-thumb {
            background: var(--border);
        }

        .scrollable::-webkit-scrollbar-thumb:hover {
            background: var(--ink-muted);
        }

        /* Footer */
        .footer {
            margin-top: 60px;
            padding-top: 20px;
            border-top: 1px solid var(--border);
            font-size: 11px;
            color: var(--ink-muted);
            text-align: center;
        }
    </style>
</head>
<body>
    <div class="page">
        <!-- Masthead -->
        <header class="masthead">
            <div class="title-block">
                <h1 class="site-title">fs-explorer</h1>
                <span class="version">v0.1.0</span>
            </div>
            <div class="status-indicator">
                <div class="status-dot" id="statusDot"></div>
                <span id="statusText">Ready</span>
            </div>
        </header>

        <!-- Folder Selector -->
        <section class="folder-section">
            <div class="query-label">Target Folder</div>
            <div class="folder-row">
                <div class="folder-display" id="folderDisplay">
                    <span class="folder-path" id="currentPath">.</span>
                    <span class="index-badge" id="indexBadge" onclick="openIndexingModal()" style="display:none;">
                        <span class="badge-dot" id="badgeDot"></span>
                        <span id="badgeText">Not Indexed</span>
                    </span>
                    <button class="folder-btn" onclick="openFolderPicker()">Browse</button>
                </div>
            </div>
        </section>

        <!-- Query -->
        <section class="query-section">
            <div class="query-label">Query</div>
            <div class="query-row">
                <input
                    type="text"
                    class="query-input"
                    id="queryInput"
                    placeholder="What would you like to know about your documents?"
                    autocomplete="off"
                >
                <button class="query-btn" id="queryBtn" onclick="startExploration()">
                    Execute
                </button>
            </div>
            <div class="search-mode-section" id="searchModeSection">
                <div class="search-mode-label">Retrieval</div>
                <div class="search-mode-options">
                    <label><input type="checkbox" value="agentic" checked disabled><span>Agentic</span></label>
                    <label class="disabled" id="smSemantic"><input type="checkbox" id="cbSemantic" value="semantic" disabled><span>Semantic</span></label>
                    <label class="disabled" id="smMetadata"><input type="checkbox" id="cbMetadata" value="metadata" disabled><span>Metadata</span></label>
                </div>
            </div>
        </section>

        <!-- Progress -->
        <div class="progress-bar" id="progressBar">
            <div class="progress-fill"></div>
        </div>

        <!-- Main Grid -->
        <div class="main-grid">
            <!-- Steps -->
            <section class="section">
                <div class="section-header">
                    <span class="section-title">Execution Log</span>
                    <span class="section-meta" id="stepCount">—</span>
                </div>
                <div class="scrollable">
                    <div class="steps-list" id="stepsList">
                        <div class="empty-state">
                            <div class="prompt">Awaiting query...</div>
                            <div class="hint">Enter a question to begin document exploration</div>
                        </div>
                    </div>
                </div>
            </section>

            <!-- Result -->
            <section class="section">
                <div class="section-header">
                    <span class="section-title">Response</span>
                </div>
                <div class="result-content scrollable" id="resultContent">
                    <div class="empty-state">
                        <div class="prompt">No results yet</div>
                        <div class="hint">Results with citations will appear here</div>
                    </div>
                </div>
            </section>
        </div>

        <!-- Stats -->
        <div class="stats-bar" id="statsBar" style="display: none;">
            <div class="stat">
                <div class="stat-value" id="statSteps">0</div>
                <div class="stat-label">Steps</div>
            </div>
            <div class="stat">
                <div class="stat-value" id="statScanned">0</div>
                <div class="stat-label">Scanned</div>
            </div>
            <div class="stat">
                <div class="stat-value" id="statParsed">0</div>
                <div class="stat-label">Parsed</div>
            </div>
            <div class="stat">
                <div class="stat-value" id="statCalls">0</div>
                <div class="stat-label">API Calls</div>
            </div>
            <div class="stat">
                <div class="stat-value" id="statTokens">0</div>
                <div class="stat-label">Tokens</div>
            </div>
            <div class="stat">
                <div class="stat-value" id="statCost">$0</div>
                <div class="stat-label">Est. Cost</div>
            </div>
        </div>

        <!-- Footer -->
        <footer class="footer">
            Powered by Gemini 3 Flash · Documents parsed with Docling
        </footer>
    </div>

    <!-- Human Modal -->
    <div class="modal-overlay" id="humanModal">
        <div class="modal">
            <div class="modal-title">Input Required</div>
            <div class="modal-question" id="modalQuestion"></div>
            <textarea class="modal-input" id="modalInput" placeholder="Your response..."></textarea>
            <button class="modal-btn" onclick="submitHumanResponse()">Submit</button>
        </div>
    </div>

    <!-- Folder Picker Modal -->
    <div class="modal-overlay" id="folderModal">
        <div class="modal folder-modal">
            <div class="modal-title">Select Folder</div>
            <div class="folder-nav">
                <button class="folder-nav-btn" id="folderUpBtn" onclick="navigateUp()">↑ Parent</button>
                <span class="folder-current" id="folderModalPath">.</span>
            </div>
            <div class="folder-list scrollable" id="folderList">
                <div class="loading-text">Loading...</div>
            </div>
            <div class="folder-actions">
                <button class="modal-btn secondary" onclick="closeFolderPicker()">Cancel</button>
                <button class="modal-btn" onclick="selectCurrentFolder()">Select This Folder</button>
            </div>
        </div>
    </div>

    <!-- Indexing Config Modal -->
    <div class="modal-overlay" id="indexingModal">
        <div class="modal indexing-modal">
            <div id="indexingModalContent"></div>
        </div>
    </div>

    <script>
        // State
        let ws = null;
        let stepCount = 0;
        let isRunning = false;
        let currentFolder = '.';
        let browsingPath = '.';
        let indexStatus = null;

        // Tool styles
        const toolStyles = {
            scan_folder: 'scan',
            preview_file: 'preview',
            parse_file: 'parse',
            read: 'preview',
            grep: 'search',
            glob: 'search',
        };

        // Elements
        const queryInput = document.getElementById('queryInput');
        const queryBtn = document.getElementById('queryBtn');
        const stepsList = document.getElementById('stepsList');
        const stepCountEl = document.getElementById('stepCount');
        const resultContent = document.getElementById('resultContent');
        const statsBar = document.getElementById('statsBar');
        const progressBar = document.getElementById('progressBar');
        const statusDot = document.getElementById('statusDot');
        const statusText = document.getElementById('statusText');
        const humanModal = document.getElementById('humanModal');
        const folderModal = document.getElementById('folderModal');
        const folderList = document.getElementById('folderList');
        const folderModalPath = document.getElementById('folderModalPath');
        const folderUpBtn = document.getElementById('folderUpBtn');
        const currentPathEl = document.getElementById('currentPath');
        const indexingModal = document.getElementById('indexingModal');
        const indexingModalContent = document.getElementById('indexingModalContent');

        // Enter key
        queryInput.addEventListener('keypress', (e) => {
            if (e.key === 'Enter' && !isRunning) startExploration();
        });

        // ========== Index Badge + Status ==========

        async function checkIndexStatus(folder) {
            try {
                const res = await fetch(`/api/index/status?folder=${encodeURIComponent(folder)}`);
                indexStatus = await res.json();
            } catch (e) {
                indexStatus = { indexed: false };
            }
            updateIndexBadge();
            updateSearchModeAvailability();
        }

        function updateIndexBadge() {
            const badge = document.getElementById('indexBadge');
            const dot = document.getElementById('badgeDot');
            const text = document.getElementById('badgeText');
            badge.style.display = 'inline-flex';

            if (indexStatus && indexStatus.indexed) {
                dot.className = 'badge-dot indexed';
                text.textContent = `Indexed (${indexStatus.document_count} docs)`;
            } else {
                dot.className = 'badge-dot';
                text.textContent = 'Not Indexed';
            }
        }

        function updateSearchModeAvailability() {
            const isIndexed = indexStatus && indexStatus.indexed;

            const pairs = [
                ['smSemantic', 'cbSemantic', isIndexed],
                ['smMetadata', 'cbMetadata', isIndexed],
            ];

            for (const [labelId, cbId, enabled] of pairs) {
                const label = document.getElementById(labelId);
                const cb = document.getElementById(cbId);
                if (enabled) {
                    label.classList.remove('disabled');
                    cb.disabled = false;
                } else {
                    label.classList.add('disabled');
                    cb.disabled = true;
                    cb.checked = false;
                }
            }
        }

        // ========== Indexing Config Modal ==========

        function openIndexingModal() {
            showIndexingConfigView();
            indexingModal.classList.add('active');
        }

        function closeIndexingModal() {
            indexingModal.classList.remove('active');
        }

        function showIndexingConfigView() {
            const isIndexed = indexStatus && indexStatus.indexed;
            const statusLine = isIndexed
                ? `Currently indexed: <strong>${indexStatus.document_count}</strong> documents` +
                  (indexStatus.schema_name ? ` &middot; Schema: ${esc(indexStatus.schema_name)}` : '')
                : 'No index found for this folder.';

            indexingModalContent.innerHTML = `
                <div class="modal-title">${isIndexed ? 'Re-index Folder' : 'Index Folder'}</div>
                <div style="font-size:12px;color:var(--ink-light);margin-bottom:20px;">${statusLine}</div>

                <div class="modal-section">
                    <div class="profile-label">Schema Source</div>
                    <div class="radio-group">
                        <label><input type="radio" name="modalSchemaSource" value="auto" checked> Auto-discover</label>
                        <label><input type="radio" name="modalSchemaSource" value="custom"> Custom JSON</label>
                    </div>
                    <div id="modalSchemaArea">
                        <button class="panel-btn" onclick="generateSchemaForModal()" id="modalGenBtn">Generate Schema</button>
                    </div>
                </div>

                <div class="embed-toggle-row">
                    <input type="checkbox" id="modalEmbedToggle">
                    <span>Generate embeddings</span>
                </div>

                <div class="modal-actions">
                    <button class="modal-btn secondary" onclick="closeIndexingModal()">Cancel</button>
                    <button class="modal-btn" onclick="startIndexingFromModal()" id="modalStartBtn">Start Indexing</button>
                </div>
            `;

            // Wire up schema source radio toggle
            indexingModalContent.querySelectorAll('input[name="modalSchemaSource"]').forEach(r => {
                r.addEventListener('change', () => {
                    const area = document.getElementById('modalSchemaArea');
                    if (r.value === 'auto' && r.checked) {
                        area.innerHTML = '<button class="panel-btn" onclick="generateSchemaForModal()" id="modalGenBtn">Generate Schema</button>';
                    } else if (r.value === 'custom' && r.checked) {
                        area.innerHTML = `
                            <div class="profile-label" style="margin-top:8px">Schema JSON</div>
                            <div class="schema-editor">
                                <textarea id="modalSchemaEditor">{
  "name": "custom",
  "fields": [
    {"name": "document_type", "type": "string", "description": "Type of document."},
    {"name": "mentions_currency", "type": "boolean", "description": "Contains monetary values."}
  ]
}</textarea>
                            </div>
                        `;
                    }
                });
            });
        }

        async function generateSchemaForModal() {
            const area = document.getElementById('modalSchemaArea');
            area.innerHTML = '<span class="panel-spinner"></span> Discovering schema...';

            try {
                const res = await fetch('/api/index/auto-profile', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({ folder: currentFolder }),
                });
                const data = await res.json();
                if (data.error) {
                    area.innerHTML = `<div style="color:var(--accent);font-size:12px;margin-bottom:8px">Error: ${esc(data.error)}</div>
                        <button class="panel-btn" onclick="generateSchemaForModal()" id="modalGenBtn">Retry</button>`;
                    return;
                }
                const schemaJson = JSON.stringify(data.profile, null, 2);
                area.innerHTML = `
                    <div class="profile-label" style="margin-top:8px">Schema JSON (editable)</div>
                    <div class="schema-editor">
                        <textarea id="modalSchemaEditor">${esc(schemaJson)}</textarea>
                    </div>
                    <button class="panel-btn" onclick="generateSchemaForModal()" style="margin-top:8px">Regenerate</button>
                `;
            } catch (e) {
                area.innerHTML = `<div style="color:var(--accent);font-size:12px;margin-bottom:8px">Error: ${esc(e.message)}</div>
                    <button class="panel-btn" onclick="generateSchemaForModal()" id="modalGenBtn">Retry</button>`;
            }
        }

        async function startIndexingFromModal() {
            if (isRunning) return;

            // Parse schema from editor if present
            let metadataProfile = null;
            let withMetadata = false;
            const editorEl = document.getElementById('modalSchemaEditor');
            if (editorEl) {
                const raw = editorEl.value.trim();
                if (raw) {
                    try {
                        metadataProfile = JSON.parse(raw);
                        withMetadata = true;
                    } catch (e) {
                        alert('Invalid JSON in schema editor: ' + e.message);
                        return;
                    }
                }
            }

            const embedEl = document.getElementById('modalEmbedToggle');
            const withEmbeddings = embedEl ? embedEl.checked : false;

            isRunning = true;
            updateStatus('Indexing', 'active');
            progressBar.classList.add('active');

            // Show progress view in modal
            indexingModalContent.innerHTML = `
                <div class="modal-title">Indexing in Progress</div>
                <div style="text-align:center;padding:40px 0;">
                    <span class="panel-spinner" style="width:24px;height:24px;border-width:3px;"></span>
                    <div style="margin-top:16px;font-size:13px;color:var(--ink-light);">
                        Parsing documents and building search index...
                    </div>
                </div>
            `;

            try {
                const response = await fetch('/api/index', {
                    method: 'POST',
                    headers: { 'Content-Type': 'application/json' },
                    body: JSON.stringify({
                        folder: currentFolder,
                        discover_schema: true,
                        with_metadata: withMetadata,
                        metadata_profile: metadataProfile,
                        with_embeddings: withEmbeddings,
                    }),
                });
                const payload = await response.json();
                if (!response.ok || payload.error) {
                    throw new Error(payload.error || `Indexing failed (${response.status})`);
                }
                // Refresh status
                await checkIndexStatus(currentFolder);
                showIndexingSummary(payload);
                updateStatus('Indexed', '');
            } catch (e) {
                indexingModalContent.innerHTML = `
                    <div class="modal-title">Indexing Failed</div>
                    <div style="color:var(--accent);margin-bottom:20px;font-size:13px;">${esc(e.message)}</div>
                    <button class="modal-btn" onclick="closeIndexingModal()">Close</button>
                `;
                updateStatus('Error', 'error');
            }
            isRunning = false;
            queryBtn.disabled = false;
            progressBar.classList.remove('active');
        }

        function showIndexingSummary(result) {
            indexingModalContent.innerHTML = `
                <div class="modal-title">Indexing Complete</div>
                <div class="indexing-summary">
                    <dl>
                        <dt>Documents indexed</dt>
                        <dd>${result.indexed_files || 0}</dd>
                        <dt>Active documents</dt>
                        <dd>${result.active_documents || 0}</dd>
                        <dt>Chunks written</dt>
                        <dd>${result.chunks_written || 0}</dd>
                        <dt>Skipped files</dt>
                        <dd>${result.skipped_files || 0}</dd>
                        <dt>Deleted files</dt>
                        <dd>${result.deleted_files || 0}</dd>
                        <dt>Schema</dt>
                        <dd>${esc(result.schema_used || 'none')}</dd>
                        <dt>Metadata mode</dt>
                        <dd>${esc(result.metadata_mode || 'heuristic')}</dd>
                        <dt>Embeddings written</dt>
                        <dd>${result.embeddings_written || 0}</dd>
                    </dl>
                </div>
                <div class="modal-actions">
                    <button class="modal-btn" onclick="closeIndexingModal()">Done</button>
                </div>
            `;
        }

        // ========== Start Exploration ==========

        async function startExploration() {
            const query = queryInput.value.trim();
            if (!query || isRunning) return;

            if (ws) { ws.close(); ws = null; }

            isRunning = true;
            stepCount = 0;
            queryBtn.disabled = true;

            stepsList.innerHTML = '<div class="empty-state"><div class="prompt">Processing...</div><div class="hint">Starting exploration</div></div>';
            resultContent.innerHTML = '<div class="loading-text">Processing...</div>';
            statsBar.style.display = 'none';
            progressBar.classList.add('active');
            stepCountEl.textContent = '—';

            const enableSemantic = document.getElementById('cbSemantic').checked;
            const enableMetadata = document.getElementById('cbMetadata').checked;
            const useIndex = (enableSemantic || enableMetadata) && indexStatus && indexStatus.indexed;
            updateStatus('Connecting', 'active');

            const wsProtocol = window.location.protocol === 'https:' ? 'wss' : 'ws';
            const wsUrl = `${wsProtocol}://${window.location.host}/ws/explore`;
            ws = new WebSocket(wsUrl);

            ws.onopen = () => {
                updateStatus('Executing', 'active');
                ws.send(JSON.stringify({
                    task: query,
                    folder: currentFolder,
                    use_index: useIndex,
                    enable_semantic: enableSemantic,
                    enable_metadata: enableMetadata,
                }));
            };

            ws.onmessage = (e) => handleMessage(JSON.parse(e.data));
            ws.onerror = () => {
                showError('Connection failed. Is the server running?');
                finish();
            };
            ws.onclose = () => finish();
        }

        // Handle message
        function handleMessage(msg) {
            switch (msg.type) {
                case 'tool_call':
                    addStep(msg.data);
                    break;
                case 'go_deeper':
                    addNavStep(msg.data);
                    break;
                case 'ask_human':
                    showHumanModal(msg.data);
                    break;
                case 'complete':
                    showResult(msg.data);
                    break;
                case 'error':
                    showError(msg.data.message);
                    break;
            }
        }

        // Add step
        function addStep(data) {
            if (data.step === 1 && stepsList.querySelector('.empty-state')) {
                stepsList.innerHTML = '';
            }

            stepCount = data.step;
            stepCountEl.textContent = `${stepCount} ${stepCount === 1 ? 'step' : 'steps'}`;

            const style = toolStyles[data.tool_name] || 'preview';
            const target = data.tool_input.directory || data.tool_input.file_path || '';

            const html = `
                <div class="step ${style}">
                    <div class="step-header">
                        <span class="step-id">#${data.step}</span>
                        <span class="step-tool">${data.tool_name}</span>
                    </div>
                    ${target ? `<div class="step-target">${esc(target)}</div>` : ''}
                    <div class="step-reason">${esc(data.reason)}</div>
                </div>
            `;
            stepsList.insertAdjacentHTML('beforeend', html);
            stepsList.scrollTop = stepsList.scrollHeight;
        }

        // Add nav step
        function addNavStep(data) {
            if (data.step === 1 && stepsList.querySelector('.empty-state')) {
                stepsList.innerHTML = '';
            }

            stepCount = data.step;
            stepCountEl.textContent = `${stepCount} ${stepCount === 1 ? 'step' : 'steps'}`;

            const html = `
                <div class="step navigate">
                    <div class="step-header">
                        <span class="step-id">#${data.step}</span>
                        <span class="step-tool">navigate</span>
                    </div>
                    <div class="step-target">${esc(data.directory)}</div>
                    <div class="step-reason">${esc(data.reason)}</div>
                </div>
            `;
            stepsList.insertAdjacentHTML('beforeend', html);
            stepsList.scrollTop = stepsList.scrollHeight;
        }

        // Show human modal
        function showHumanModal(data) {
            document.getElementById('modalQuestion').textContent = data.question;
            document.getElementById('modalInput').value = '';
            humanModal.classList.add('active');
            updateStatus('Awaiting input', 'active');
        }

        // Submit human response
        function submitHumanResponse() {
            const response = document.getElementById('modalInput').value.trim();
            if (!response) return;
            humanModal.classList.remove('active');
            ws.send(JSON.stringify({ type: 'human_response', response }));
            updateStatus('Executing', 'active');
        }

        // Show result
        function showResult(data) {
            if (data.error) {
                showError(data.error);
                return;
            }

            let text = data.final_result || 'No result';
            text = text.replace(/\[Source:[^\]]+\]/g, m => `<span class="citation">${esc(m)}</span>`);

            resultContent.innerHTML = `<div class="result-text">${text}</div>`;

            const s = data.stats;
            if (s) {
                statsBar.style.display = 'grid';
                document.getElementById('statSteps').textContent = s.steps;
                document.getElementById('statScanned').textContent = s.documents_scanned;
                document.getElementById('statParsed').textContent = s.documents_parsed;
                document.getElementById('statCalls').textContent = s.api_calls;
                document.getElementById('statTokens').textContent = formatNum(s.total_tokens);
                document.getElementById('statCost').textContent = '$' + s.estimated_cost.toFixed(4);
            }

            updateStatus('Complete', '');
        }

        // Show error
        function showError(msg) {
            resultContent.innerHTML = `<div class="empty-state"><div class="prompt">Error</div><div class="hint">${esc(msg)}</div></div>`;
            updateStatus('Error', 'error');
        }

        // Finish
        function finish() {
            isRunning = false;
            queryBtn.disabled = false;
            progressBar.classList.remove('active');
            if (ws) { ws.close(); ws = null; }
        }

        // Update status
        function updateStatus(text, state) {
            statusText.textContent = text;
            statusDot.className = 'status-dot' + (state ? ' ' + state : '');
        }

        // Escape HTML
        function esc(s) {
            const d = document.createElement('div');
            d.textContent = s;
            return d.innerHTML;
        }

        // Format number
        function formatNum(n) {
            if (n >= 1e6) return (n/1e6).toFixed(1) + 'M';
            if (n >= 1e3) return (n/1e3).toFixed(1) + 'K';
            return n;
        }

        // ========== Folder Picker ==========

        async function openFolderPicker() {
            folderModal.classList.add('active');
            browsingPath = currentFolder;
            await loadFolders(browsingPath);
        }

        function closeFolderPicker() {
            folderModal.classList.remove('active');
        }

        async function loadFolders(path) {
            folderList.innerHTML = '<div class="loading-text">Loading...</div>';
            try {
                const res = await fetch(`/api/folders?path=${encodeURIComponent(path)}`);
                const data = await res.json();

                if (data.error) {
                    folderList.innerHTML = `<div class="folder-empty">${esc(data.error)}</div>`;
                    return;
                }

                browsingPath = data.current;
                folderModalPath.textContent = data.current;
                folderUpBtn.disabled = !data.parent;

                if (data.folders.length === 0) {
                    folderList.innerHTML = `
                        <div class="folder-empty">No subfolders</div>
                        <div class="folder-info" style="text-align:center">${data.files_count} file(s) in this folder</div>
                    `;
                } else {
                    folderList.innerHTML = data.folders.map(name => `
                        <div class="folder-item" onclick="navigateToFolder('${esc(name)}')">
                            <span class="folder-icon">▸</span>
                            <span class="folder-name">${esc(name)}</span>
                        </div>
                    `).join('');
                    folderList.innerHTML += `<div class="folder-info">${data.files_count} file(s) in this folder</div>`;
                }
            } catch (e) {
                folderList.innerHTML = `<div class="folder-empty">Error: ${esc(e.message)}</div>`;
            }
        }

        async function navigateToFolder(name) {
            const newPath = browsingPath === '.' ? name : `${browsingPath}/${name}`;
            await loadFolders(newPath);
        }

        async function navigateUp() {
            const res = await fetch(`/api/folders?path=${encodeURIComponent(browsingPath)}`);
            const data = await res.json();
            if (data.parent) {
                await loadFolders(data.parent);
            }
        }

        function selectCurrentFolder() {
            currentFolder = browsingPath;
            currentPathEl.textContent = currentFolder;
            closeFolderPicker();
            checkIndexStatus(currentFolder);
        }

        // Initialize with current directory
        (async function init() {
            try {
                const res = await fetch('/api/folders?path=.');
                const data = await res.json();
                if (data.current) {
                    currentFolder = data.current;
                    currentPathEl.textContent = currentFolder;
                }
            } catch (e) {
                console.error('Failed to get initial folder:', e);
            }
            checkIndexStatus(currentFolder);
        })();
    </script>
</body>
</html>


================================================
FILE: src/fs_explorer/workflow.py
================================================
"""
Workflow orchestration for the FsExplorer agent.

This module defines the event-driven workflow that coordinates the agent's
exploration of the filesystem, handling tool calls, directory navigation,
and human interaction.
"""

import contextvars
import os

from workflows import Workflow, Context, step
from workflows.events import (
    StartEvent,
    StopEvent,
    Event,
    InputRequiredEvent,
    HumanResponseEvent,
)
from workflows.resource import Resource
from pydantic import BaseModel
from typing import Annotated, cast, Any

from .agent import FsExplorerAgent
from .models import GoDeeperAction, ToolCallAction, StopAction, AskHumanAction, Action
from .fs import describe_dir_content

# Per-asyncio-task agent storage — each WebSocket connection gets its own.
_AGENT_VAR: contextvars.ContextVar[FsExplorerAgent | None] = contextvars.ContextVar(
    "_AGENT_VAR", default=None
)


def get_agent() -> FsExplorerAgent:
    """Get or create the agent instance for the current context."""
    agent = _AGENT_VAR.get()
    if agent is None:
        agent = FsExplorerAgent()
        _AGENT_VAR.set(agent)
    return agent


def reset_agent() -> None:
    """Reset the agent instance for the current context."""
    _AGENT_VAR.set(None)


class WorkflowState(BaseModel):
    """State maintained throughout the workflow execution."""

    initial_task: str = ""
    root_directory: str = "."
    current_directory: str = "."
    use_index: bool = False
    enable_semantic: bool = False
    enable_metadata: bool = False


class InputEvent(StartEvent):
    """Initial event containing the user's task."""

    task: str
    folder: str = "."
    use_index: bool = False
    enable_semantic: bool = False
    enable_metadata: bool = False


class GoDeeperEvent(Event):
    """Event triggered when navigating into a subdirectory."""

    directory: str
    reason: str


class ToolCallEvent(Event):
    """Event triggered when executing a tool."""

    tool_name: str
    tool_input: dict[str, Any]
    reason: str


class AskHumanEvent(InputRequiredEvent):
    """Event triggered when human input is required."""

    question: str
    reason: str


class HumanAnswerEvent(HumanResponseEvent):
    """Event containing the human's response."""

    response: str


class ExplorationEndEvent(StopEvent):
    """Event signaling the end of exploration."""

    final_result: str | None = None
    error: str | None = None


# Type alias for the union of possible workflow events
WorkflowEvent = ExplorationEndEvent | GoDeeperEvent | ToolCallEvent | AskHumanEvent


def _handle_action_result(
    action: Action,
    action_type: str,
    ctx: Context[WorkflowState],
) -> WorkflowEvent:
    """
    Convert an action result into the appropriate workflow event.

    This helper extracts the common logic for handling agent action results,
    reducing code duplication across workflow steps.

    Args:
        action: The action returned by the agent
        action_type: The type of action ("godeeper", "toolcall", "askhuman", "stop")
        ctx: The workflow context for state updates and event streaming

    Returns:
        The appropriate workflow event based on the action type
    """
    if action_type == "godeeper":
        godeeper = cast(GoDeeperAction, action.action)
        event = GoDeeperEvent(directory=godeeper.directory, reason=action.reason)
        ctx.write_event_to_stream(event)
        return event

    elif action_type == "toolcall":
        toolcall = cast(ToolCallAction, action.action)
        event = ToolCallEvent(
            tool_name=toolcall.tool_name,
            tool_input=toolcall.to_fn_args(),
            reason=action.reason,
        )
        ctx.write_event_to_stream(event)
        return event

    elif action_type == "askhuman":
        askhuman = cast(AskHumanAction, action.action)
        # InputRequiredEvent is written to the stream by default
        return AskHumanEvent(question=askhuman.question, reason=action.reason)

    else:  # stop
        stopaction = cast(StopAction, action.action)
        return ExplorationEndEvent(final_result=stopaction.final_result)


async def _process_agent_action(
    agent: FsExplorerAgent,
    ctx: Context[WorkflowState],
    update_directory: bool = False,
) -> WorkflowEvent:
    """
    Process the agent's next action and return the appropriate event.

    Args:
        agent: The agent instance
        ctx: The workflow context
        update_directory: Whether to update the current directory on godeeper action

    Returns:
        The appropriate workflow event
    """
    result = await agent.take_action()

    if result is None:
        return ExplorationEndEvent(error="Could not produce action to take")

    action, action_type = result

    # Update directory state if needed for godeeper actions
    if update_directory and action_type == "godeeper":
        godeeper = cast(GoDeeperAction, action.action)
        async with ctx.store.edit_state() as state:
            state.current_directory = godeeper.directory

    return _handle_action_result(action, action_type, ctx)


class FsExplorerWorkflow(Workflow):
    """
    Event-driven workflow for filesystem exploration.

    Coordinates the agent's actions through a series of steps:
    - start_exploration: Initial task processing
    - go_deeper_action: Directory navigation
    - tool_call_action: Tool execution
    - receive_human_answer: Human interaction handling
    """

    @step
    async def start_exploration(
        self,
        ev: InputEvent,
        ctx: Context[WorkflowState],
        agent: Annotated[FsExplorerAgent, Resource(get_agent)],
    ) -> WorkflowEvent:
        """Initialize exploration with the user's task."""
        root_directory = os.path.abspath(ev.folder)
        if not os.path.exists(root_directory) or not os.path.isdir(root_directory):
            return ExplorationEndEvent(error=f"No such directory: {root_directory}")

        async with ctx.store.edit_state() as state:
            state.initial_task = ev.task
            state.root_directory = root_directory
            state.current_directory = root_directory
            state.use_index = ev.use_index
            state.enable_semantic = ev.enable_semantic
            state.enable_metadata = ev.enable_metadata

        dirdescription = describe_dir_content(root_directory)
        if ev.enable_semantic and ev.enable_metadata:
            index_hint = (
                "An index is available. Start with `semantic_search` (with optional "
                "filters) for fast retrieval, then use filesystem tools for deep dives."
            )
        elif ev.enable_semantic:
            index_hint = (
                "An index is available. Use `semantic_search` (no filters) for "
                "similarity search, then use filesystem tools for details."
            )
        elif ev.enable_metadata:
            index_hint = (
                "An index is available. Use `semantic_search` with metadata "
                "filters, then use filesystem tools for details."
            )
        else:
            index_hint = "Prefer absolute paths from the directory listing when calling tools."
        agent.configure_task(
            f"Given that the current directory ('{root_directory}') looks like this:\n\n"
            f"```text\n{dirdescription}\n```\n\n"
            f"And that the user is giving you this task: '{ev.task}', "
            f"what action should you take first? {index_hint}"
        )

        return await _process_agent_action(agent, ctx, update_directory=True)

    @step
    async def go_deeper_action(
        self,
        ev: GoDeeperEvent,
        ctx: Context[WorkflowState],
        agent: Annotated[FsExplorerAgent, Resource(get_agent)],
    ) -> WorkflowEvent:
        """Handle navigation into a subdirectory."""
        state = await ctx.store.get_state()
        dirdescription = describe_dir_content(state.current_directory)

        agent.configure_task(
            f"Given that the current directory ('{state.current_directory}') "
            f"looks like this:\n\n```text\n{dirdescription}\n```\n\n"
            f"And that the user is giving you this task: '{state.initial_task}', "
            f"what action should you take next?"
        )

        return await _process_agent_action(agent, ctx, update_directory=True)

    @step
    async def receive_human_answer(
        self,
        ev: HumanAnswerEvent,
        ctx: Context[WorkflowState],
        agent: Annotated[FsExplorerAgent, Resource(get_agent)],
    ) -> WorkflowEvent:
        """Process the human's response to a question."""
        state = await ctx.store.get_state()

        agent.configure_task(
            f"Human response to your question: {ev.response}\n\n"
            f"Based on it, proceed with your exploration based on the "
            f"original task: {state.initial_task}"
        )

        return await _process_agent_action(agent, ctx, update_directory=True)

    @step
    async def tool_call_action(
        self,
        ev: ToolCallEvent,
        ctx: Context[WorkflowState],
        agent: Annotated[FsExplorerAgent, Resource(get_agent)],
    ) -> WorkflowEvent:
        """Process the result of a tool call."""
        agent.configure_task(
            "Given the result from the tool call you just performed, "
            "what action should you take next?"
        )

        return await _process_agent_action(agent, ctx, update_directory=True)


# Workflow timeout for complex multi-document analysis (5 minutes)
WORKFLOW_TIMEOUT_SECONDS = 300

workflow = FsExplorerWorkflow(timeout=WORKFLOW_TIMEOUT_SECONDS)


================================================
FILE: tests/__init__.py
================================================


================================================
FILE: tests/conftest.py
================================================
"""
Pytest fixtures and mocks for FsExplorer tests.

Provides mock implementations of the Google GenAI client for unit testing
without making actual API calls.
"""

from google.genai.types import (
    HttpOptions,
    Content,
    GenerateContentResponse,
    Candidate,
    Part,
    GenerateContentResponseUsageMetadata,
)
from fs_explorer.models import StopAction, Action


class MockModels:
    """Mock implementation of the GenAI models interface."""
    
    async def generate_content(self, *args, **kwargs) -> GenerateContentResponse:
        """Return a mock response with a stop action."""
        return GenerateContentResponse(
            candidates=[
                Candidate(
                    content=Content(
                        role="model",
                        parts=[
                            Part.from_text(
                                text=Action(
                                    action=StopAction(
                                        final_result="this is a final result"
                                    ),
                                    reason="I am done",
                                ).model_dump_json()
                            )
                        ],
                    )
                )
            ],
            usage_metadata=GenerateContentResponseUsageMetadata(
                prompt_token_count=100,
                candidates_token_count=50,
                total_token_count=150,
            ),
        )


class MockAio:
    """Mock implementation of the async GenAI interface."""
    
    @property
    def models(self) -> MockModels:
        """Return mock models interface."""
        return MockModels()


class MockGenAIClient:
    """
    Mock implementation of the Google GenAI client.
    
    Provides predictable responses for testing without API calls.
    """
    
    def __init__(self, api_key: str, http_options: HttpOptions) -> None:
        """Initialize mock client (ignores parameters)."""
        pass

    @property
    def aio(self) -> MockAio:
        """Return mock async interface."""
        return MockAio()


================================================
FILE: tests/test_agent.py
================================================
"""Tests for the FsExplorerAgent class."""

import pytest
import os

from unittest.mock import patch
from google.genai import Client as GenAIClient
from google.genai.types import HttpOptions

from fs_explorer.agent import (
    FsExplorerAgent,
    SYSTEM_PROMPT,
    TokenUsage,
    _build_system_prompt,
    set_search_flags,
    get_search_flags,
    clear_index_context,
)
from fs_explorer.models import Action, StopAction
from .conftest import MockGenAIClient


class TestAgentInitialization:
    """Tests for agent initialization."""
    
    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    def test_agent_init_with_env_key(self) -> None:
        """Test agent initialization with API key from environment."""
        agent = FsExplorerAgent()
        assert isinstance(agent._client, GenAIClient)
        assert len(agent._chat_history) == 0  # No system prompt in history
        assert isinstance(agent.token_usage, TokenUsage)

    def test_agent_init_with_explicit_key(self) -> None:
        """Test agent initialization with explicit API key."""
        agent = FsExplorerAgent(api_key="explicit-test-key")
        assert isinstance(agent._client, GenAIClient)

    def test_agent_init_without_key_raises(self) -> None:
        """Test that initialization without API key raises ValueError."""
        # Ensure no key in environment
        env = os.environ.copy()
        if "GOOGLE_API_KEY" in env:
            del env["GOOGLE_API_KEY"]
        
        with patch.dict(os.environ, env, clear=True):
            with pytest.raises(ValueError, match="GOOGLE_API_KEY not found"):
                FsExplorerAgent()


class TestAgentConfiguration:
    """Tests for agent task configuration."""
    
    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    def test_configure_task_adds_to_history(self) -> None:
        """Test that configure_task adds message to chat history."""
        agent = FsExplorerAgent()
        agent.configure_task("this is a task")
        
        assert len(agent._chat_history) == 1
        assert agent._chat_history[0].role == "user"
        assert agent._chat_history[0].parts[0].text == "this is a task"

    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    def test_multiple_configure_task_calls(self) -> None:
        """Test that multiple configure_task calls accumulate."""
        agent = FsExplorerAgent()
        agent.configure_task("task 1")
        agent.configure_task("task 2")
        
        assert len(agent._chat_history) == 2
        assert agent._chat_history[0].parts[0].text == "task 1"
        assert agent._chat_history[1].parts[0].text == "task 2"


class TestAgentActions:
    """Tests for agent action handling."""
    
    @pytest.mark.asyncio
    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    async def test_take_action_returns_action(self) -> None:
        """Test that take_action returns an action from the model."""
        agent = FsExplorerAgent()
        agent.configure_task("this is a task")
        agent._client = MockGenAIClient(
            api_key="test", 
            http_options=HttpOptions(api_version="v1beta")
        )
        
        result = await agent.take_action()
        
        assert result is not None
        action, action_type = result
        assert isinstance(action, Action)
        assert isinstance(action.action, StopAction)
        assert action.action.final_result == "this is a final result"
        assert action.reason == "I am done"
        assert action_type == "stop"

    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    def test_reset_clears_history(self) -> None:
        """Test that reset clears chat history and token usage."""
        agent = FsExplorerAgent()
        agent.configure_task("task 1")
        agent.token_usage.api_calls = 5
        
        agent.reset()
        
        assert len(agent._chat_history) == 0
        assert agent.token_usage.api_calls == 0


class TestTokenUsage:
    """Tests for TokenUsage tracking."""
    
    def test_add_api_call(self) -> None:
        """Test adding API call metrics."""
        usage = TokenUsage()
        usage.add_api_call(100, 50)
        
        assert usage.prompt_tokens == 100
        assert usage.completion_tokens == 50
        assert usage.total_tokens == 150
        assert usage.api_calls == 1

    def test_add_tool_result_parse_file(self) -> None:
        """Test tracking parse_file tool usage."""
        usage = TokenUsage()
        usage.add_tool_result("document content here", "parse_file")
        
        assert usage.documents_parsed == 1
        assert usage.tool_result_chars == len("document content here")

    def test_add_tool_result_scan_folder(self) -> None:
        """Test tracking scan_folder tool usage."""
        usage = TokenUsage()
        # Simulating scan output with document markers
        result = "│ [1/3] doc1.pdf\n│ [2/3] doc2.pdf\n│ [3/3] doc3.pdf"
        usage.add_tool_result(result, "scan_folder")
        
        assert usage.documents_scanned == 3

    def test_summary_format(self) -> None:
        """Test that summary produces formatted output."""
        usage = TokenUsage()
        usage.add_api_call(1000, 500)
        
        summary = usage.summary()
        
        assert "TOKEN USAGE SUMMARY" in summary
        assert "1,000" in summary  # Formatted prompt tokens
        assert "API Calls:" in summary
        assert "Est. Cost" in summary


class TestSystemPrompt:
    """Tests for system prompt configuration."""
    
    def test_system_prompt_contains_tools(self) -> None:
        """Test that system prompt documents all tools."""
        assert "scan_folder" in SYSTEM_PROMPT
        assert "preview_file" in SYSTEM_PROMPT
        assert "parse_file" in SYSTEM_PROMPT
        assert "read" in SYSTEM_PROMPT
        assert "grep" in SYSTEM_PROMPT
        assert "glob" in SYSTEM_PROMPT

    def test_system_prompt_contains_strategy(self) -> None:
        """Test that system prompt includes exploration strategy."""
        assert "Three-Phase" in SYSTEM_PROMPT or "PHASE" in SYSTEM_PROMPT
        assert "Parallel Scan" in SYSTEM_PROMPT or "PARALLEL" in SYSTEM_PROMPT
        assert "Backtracking" in SYSTEM_PROMPT or "BACKTRACK" in SYSTEM_PROMPT

    def test_system_prompt_contains_index_tools(self) -> None:
        """Test that system prompt documents index-aware tools."""
        assert "semantic_search" in SYSTEM_PROMPT
        assert "get_document" in SYSTEM_PROMPT
        assert "list_indexed_documents" in SYSTEM_PROMPT


class TestSearchFlags:
    """Tests for search flag state and dynamic system prompt."""

    def setup_method(self) -> None:
        clear_index_context()

    def teardown_method(self) -> None:
        clear_index_context()

    def test_set_and_get_search_flags(self) -> None:
        assert get_search_flags() == (False, False)
        set_search_flags(enable_semantic=True, enable_metadata=False)
        assert get_search_flags() == (True, False)
        set_search_flags(enable_semantic=False, enable_metadata=False)
        assert get_search_flags() == (False, False)

    def test_clear_index_context_resets_flags(self) -> None:
        set_search_flags(enable_semantic=True, enable_metadata=True)
        clear_index_context()
        assert get_search_flags() == (False, False)

    def test_build_system_prompt_no_index(self) -> None:
        prompt = _build_system_prompt(False, False)
        assert prompt == SYSTEM_PROMPT

    def test_build_system_prompt_semantic_only(self) -> None:
        prompt = _build_system_prompt(True, False)
        assert "Semantic Only" in prompt
        assert "WITHOUT the `filters`" in prompt

    def test_build_system_prompt_metadata_only(self) -> None:
        prompt = _build_system_prompt(False, True)
        assert "Metadata Only" in prompt
        assert "metadata filtering" in prompt

    def test_build_system_prompt_both(self) -> None:
        prompt = _build_system_prompt(True, True)
        assert "Semantic + Metadata" in prompt

    @patch.dict(os.environ, {"GOOGLE_API_KEY": "test-api-key"})
    def test_all_tools_always_available(self) -> None:
        """Filesystem and indexed tools are never blocked."""
        set_search_flags(enable_semantic=False, enable_metadata=False)
        agent = FsExplorerAgent()
        agent.configure_task("test")
        agent.call_tool("glob", {"directory": "/tmp", "pattern": "*.md"})

        last = agent._chat_history[-1]
        assert "not available" not in last.parts[0].text


================================================
FILE: tests/test_cli_indexing.py
================================================
"""CLI tests for indexing and schema commands."""

from pathlib import Path

import fs_explorer.indexing.pipeline as pipeline_module
import fs_explorer.main as main_module
from fs_explorer.storage import DuckDBStorage
from typer.testing import CliRunner


def test_root_task_mode_remains_compatible(tmp_path: Path, monkeypatch) -> None:
    called: dict[str, object] = {}

    async def fake_run_workflow(
        task: str,
        folder: str = ".",
        *,
        use_index: bool = False,
        db_path: str | None = None,
    ) -> None:
        called["task"] = task
        called["folder"] = folder
        called["use_index"] = use_index
        called["db_path"] = db_path

    monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow)

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        ["--task", "who is the CTO?", "--folder", str(tmp_path)],
    )

    assert result.exit_code == 0
    assert called["task"] == "who is the CTO?"
    assert called["folder"] == str(tmp_path)
    assert called["use_index"] is False


def test_query_command_enables_index_mode(tmp_path: Path, monkeypatch) -> None:
    called: dict[str, object] = {}

    async def fake_run_workflow(
        task: str,
        folder: str = ".",
        *,
        use_index: bool = False,
        db_path: str | None = None,
    ) -> None:
        called["task"] = task
        called["folder"] = folder
        called["use_index"] = use_index
        called["db_path"] = db_path

    monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow)

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        [
            "query",
            "--task",
            "purchase price?",
            "--folder",
            str(tmp_path),
            "--db-path",
            "tmp.duckdb",
        ],
    )

    assert result.exit_code == 0
    assert called["task"] == "purchase price?"
    assert called["folder"] == str(tmp_path)
    assert called["use_index"] is True
    assert called["db_path"] == "tmp.duckdb"


def test_index_and_schema_commands(tmp_path: Path, monkeypatch) -> None:
    corpus = tmp_path / "corpus"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $10.")
    (corpus / "risk_report.md").write_text("Risk summary here.")

    # Replace Docling path with plain text read for this unit test.
    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = tmp_path / "index.duckdb"
    runner = CliRunner()

    index_result = runner.invoke(
        main_module.app,
        ["index", str(corpus), "--db-path", str(db_path), "--discover-schema"],
    )
    assert index_result.exit_code == 0
    assert "Index Complete" in index_result.stdout

    show_result = runner.invoke(
        main_module.app,
        ["schema", "show", str(corpus), "--db-path", str(db_path)],
    )
    assert show_result.exit_code == 0
    assert "auto_corpus" in show_result.stdout


def test_index_command_with_metadata_forces_schema_discovery(
    tmp_path: Path,
    monkeypatch,
) -> None:
    called: dict[str, object] = {}

    class FakePipeline:
        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001
            called["storage_type"] = type(storage).__name__

        def index_folder(
            self,
            folder: str,
            *,
            discover_schema: bool = False,
            schema_name: str | None = None,
            with_metadata: bool = False,
            metadata_profile: dict | None = None,
        ):
            called["folder"] = folder
            called["discover_schema"] = discover_schema
            called["schema_name"] = schema_name
            called["with_metadata"] = with_metadata
            called["metadata_profile"] = metadata_profile
            return pipeline_module.IndexingResult(
                corpus_id="corpus_123",
                indexed_files=1,
                skipped_files=0,
                deleted_files=0,
                chunks_written=1,
                active_documents=1,
                schema_used="auto_corpus",
            )

    monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline)

    db_path = tmp_path / "index.duckdb"
    corpus = tmp_path / "corpus"
    corpus.mkdir()

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        ["index", str(corpus), "--db-path", str(db_path), "--with-metadata"],
    )

    assert result.exit_code == 0
    assert called["with_metadata"] is True
    assert called["discover_schema"] is True
    assert called["metadata_profile"] is None


def test_index_command_with_metadata_profile_path(
    tmp_path: Path,
    monkeypatch,
) -> None:
    called: dict[str, object] = {}

    class FakePipeline:
        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001
            called["storage_type"] = type(storage).__name__

        def index_folder(
            self,
            folder: str,
            *,
            discover_schema: bool = False,
            schema_name: str | None = None,
            with_metadata: bool = False,
            metadata_profile: dict | None = None,
        ):
            called["folder"] = folder
            called["discover_schema"] = discover_schema
            called["schema_name"] = schema_name
            called["with_metadata"] = with_metadata
            called["metadata_profile"] = metadata_profile
            return pipeline_module.IndexingResult(
                corpus_id="corpus_123",
                indexed_files=1,
                skipped_files=0,
                deleted_files=0,
                chunks_written=1,
                active_documents=1,
                schema_used="auto_corpus",
            )

    monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline)

    db_path = tmp_path / "index.duckdb"
    corpus = tmp_path / "corpus"
    corpus.mkdir()
    metadata_profile_path = tmp_path / "profile.json"
    metadata_profile_path.write_text(
        (
            "{"
            '"prompt_description": "Extract organizations.", '
            '"fields": ['
            '{"name": "org_names", "type": "string", "source_class": "organization"}'
            "]"
            "}"
        )
    )

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        [
            "index",
            str(corpus),
            "--db-path",
            str(db_path),
            "--metadata-profile",
            str(metadata_profile_path),
        ],
    )

    assert result.exit_code == 0
    assert called["with_metadata"] is True
    assert called["discover_schema"] is True
    assert isinstance(called["metadata_profile"], dict)
    assert called["metadata_profile"]["fields"][0]["name"] == "org_names"


def test_index_command_with_embeddings_flag(
    tmp_path: Path,
    monkeypatch,
) -> None:
    """--with-embeddings creates an EmbeddingProvider and passes it to the pipeline."""
    calls: dict[str, object] = {}

    class FakePipeline:
        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001
            calls["has_embedding_provider"] = embedding_provider is not None

        def index_folder(self, folder, **kwargs):  # noqa: ANN001, ANN003
            return pipeline_module.IndexingResult(
                corpus_id="corpus_123",
                indexed_files=1,
                skipped_files=0,
                deleted_files=0,
                chunks_written=1,
                active_documents=1,
                schema_used=None,
                embeddings_written=5,
            )

    class FakeEmbeddingProvider:
        def __init__(self, **kwargs):  # noqa: ANN003
            pass

    monkeypatch.setattr(main_module, "IndexingPipeline", FakePipeline)
    monkeypatch.setattr(main_module, "EmbeddingProvider", FakeEmbeddingProvider)

    db_path = tmp_path / "index.duckdb"
    corpus = tmp_path / "corpus"
    corpus.mkdir()

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        ["index", str(corpus), "--db-path", str(db_path), "--with-embeddings"],
    )

    assert result.exit_code == 0
    assert calls["has_embedding_provider"] is True
    assert "Embeddings Written" in result.stdout


def test_auto_index_env_var_enables_use_index(
    tmp_path: Path,
    monkeypatch,
) -> None:
    """FS_EXPLORER_AUTO_INDEX=1 auto-enables --use-index when index exists."""
    called: dict[str, object] = {}

    async def fake_run_workflow(
        task: str,
        folder: str = ".",
        *,
        use_index: bool = False,
        db_path: str | None = None,
    ) -> None:
        called["use_index"] = use_index

    monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow)
    monkeypatch.setenv("FS_EXPLORER_AUTO_INDEX", "1")

    # Create a real DuckDB with a corpus entry so auto-index detection works.
    corpus = tmp_path / "corpus"
    corpus.mkdir()
    db_path = tmp_path / "index.duckdb"
    storage = DuckDBStorage(str(db_path))
    storage.get_or_create_corpus(str(corpus.resolve()))
    storage.close()

    monkeypatch.setenv("FS_EXPLORER_DB_PATH", str(db_path))

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        ["--task", "test question", "--folder", str(corpus)],
    )

    assert result.exit_code == 0
    assert called["use_index"] is True


def test_auto_index_env_var_silent_fallback(
    tmp_path: Path,
    monkeypatch,
) -> None:
    """FS_EXPLORER_AUTO_INDEX=1 gracefully falls back when no index exists."""
    called: dict[str, object] = {}

    async def fake_run_workflow(
        task: str,
        folder: str = ".",
        *,
        use_index: bool = False,
        db_path: str | None = None,
    ) -> None:
        called["use_index"] = use_index

    monkeypatch.setattr(main_module, "run_workflow", fake_run_workflow)
    monkeypatch.setenv("FS_EXPLORER_AUTO_INDEX", "1")

    corpus = tmp_path / "empty_corpus"
    corpus.mkdir()

    runner = CliRunner()
    result = runner.invoke(
        main_module.app,
        ["--task", "test question", "--folder", str(corpus)],
    )

    assert result.exit_code == 0
    assert called["use_index"] is False


================================================
FILE: tests/test_e2e.py
================================================
import pytest
import os

from workflows.testing import WorkflowTestRunner

SKIP_IF, SKIP_REASON = (
    os.getenv("GOOGLE_API_KEY") is None,
    "GOOGLE_API_KEY not available",
)


@pytest.mark.asyncio
@pytest.mark.skipif(condition=SKIP_IF, reason=SKIP_REASON)
async def test_e2e() -> None:
    from fs_explorer.workflow import (
        workflow,
        InputEvent,
        ExplorationEndEvent,
        ToolCallEvent,
        GoDeeperEvent,
    )

    start_event = InputEvent(
        task="Starting from the current directory, individuate the python file responsible for file system operations and explain what it does"
    )
    runner = WorkflowTestRunner(workflow=workflow)
    result = await runner.run(start_event=start_event)
    assert isinstance(result.result, ExplorationEndEvent)
    assert result.result.error is None
    assert result.result.final_result is not None
    assert len(result.collected) > 1
    assert ToolCallEvent in result.event_types or GoDeeperEvent in result.event_types


================================================
FILE: tests/test_embeddings.py
================================================
"""Tests for the embedding provider."""

from __future__ import annotations

import os
from dataclasses import dataclass
from typing import Any

import pytest

from fs_explorer.embeddings import EmbeddingProvider


# ---------------------------------------------------------------------------
# Mock helpers
# ---------------------------------------------------------------------------


@dataclass
class _FakeEmbedding:
    values: list[float]


@dataclass
class _FakeEmbedResult:
    embeddings: list[_FakeEmbedding]


class _FakeModels:
    """Records calls and returns deterministic embeddings."""

    def __init__(self) -> None:
        self.calls: list[dict[str, Any]] = []

    def embed_content(
        self, *, model: str, contents: list[str], config: dict
    ) -> _FakeEmbedResult:
        self.calls.append({"model": model, "contents": contents, "config": config})
        dim = config.get("output_dimensionality", 768)
        return _FakeEmbedResult(
            embeddings=[
                _FakeEmbedding(values=[float(i)] * dim) for i in range(len(contents))
            ]
        )


class _FakeClient:
    def __init__(self) -> None:
        self.models = _FakeModels()


# ---------------------------------------------------------------------------
# Unit tests (mock-based, no API key needed)
# ---------------------------------------------------------------------------


def test_embed_texts_returns_correct_count() -> None:
    client = _FakeClient()
    provider = EmbeddingProvider(client=client, dim=4, batch_size=50)

    embeddings = provider.embed_texts(["hello", "world"])

    assert len(embeddings) == 2
    assert len(embeddings[0]) == 4


def test_embed_texts_uses_document_task_type() -> None:
    client = _FakeClient()
    provider = EmbeddingProvider(client=client, dim=4)

    provider.embed_texts(["test"])

    call = client.models.calls[0]
    assert call["config"]["task_type"] == "RETRIEVAL_DOCUMENT"


def test_embed_query_uses_query_task_type() -> None:
    client = _FakeClient()
    provider = EmbeddingProvider(client=client, dim=4)

    result = provider.embed_query("search query")

    assert len(result) == 4
    call = client.models.calls[0]
    assert call["config"]["task_type"] == "RETRIEVAL_QUERY"


def test_embed_texts_batching() -> None:
    client = _FakeClient()
    provider = EmbeddingProvider(client=client, dim=4, batch_size=3)

    texts = [f"text_{i}" for i in range(7)]
    embeddings = provider.embed_texts(texts)

    assert len(embeddings) == 7
    # 7 texts with batch_size=3 → 3 API calls (3+3+1)
    assert len(client.models.calls) == 3
    assert len(client.models.calls[0]["contents"]) == 3
    assert len(client.models.calls[1]["contents"]) == 3
    assert len(client.models.calls[2]["contents"]) == 1


def test_env_overrides(monkeypatch) -> None:
    client = _FakeClient()
    monkeypatch.setenv("FS_EXPLORER_EMBEDDING_MODEL", "custom-model-001")
    monkeypatch.setenv("FS_EXPLORER_EMBEDDING_DIM", "256")
    monkeypatch.setenv("FS_EXPLORER_EMBEDDING_BATCH_SIZE", "10")

    provider = EmbeddingProvider(client=client)

    assert provider.model == "custom-model-001"
    assert provider.dim == 256
    assert provider.batch_size == 10

    provider.embed_texts(["test"])
    call = client.models.calls[0]
    assert call["model"] == "custom-model-001"
    assert call["config"]["output_dimensionality"] == 256


def test_missing_api_key_raises(monkeypatch) -> None:
    monkeypatch.delenv("GOOGLE_API_KEY", raising=False)
    with pytest.raises(ValueError, match="GOOGLE_API_KEY"):
        EmbeddingProvider(api_key=None, client=None)


# ---------------------------------------------------------------------------
# Real API integration test (skipped unless GOOGLE_API_KEY is set)
# ---------------------------------------------------------------------------


@pytest.mark.skipif(
    not os.getenv("GOOGLE_API_KEY"),
    reason="GOOGLE_API_KEY not set — skipping real embedding test",
)
def test_real_embedding_api() -> None:
    provider = EmbeddingProvider(dim=128)

    texts = ["The purchase price is $45 million.", "Risk assessment summary."]
    embeddings = provider.embed_texts(texts)

    assert len(embeddings) == 2
    assert len(embeddings[0]) == 128
    assert all(isinstance(v, float) for v in embeddings[0])

    query_emb = provider.embed_query("purchase price")
    assert len(query_emb) == 128


================================================
FILE: tests/test_exploration_trace.py
================================================
"""Tests for exploration trace helpers."""

import os

from fs_explorer.exploration_trace import (
    ExplorationTrace,
    extract_cited_sources,
    normalize_path,
)


def test_normalize_path_relative() -> None:
    root = "/tmp/project"
    assert normalize_path("docs/file.pdf", root) == os.path.abspath("/tmp/project/docs/file.pdf")


def test_normalize_path_absolute() -> None:
    root = "/tmp/project"
    assert normalize_path("/var/data/file.pdf", root) == os.path.abspath("/var/data/file.pdf")


def test_trace_records_steps_and_documents() -> None:
    trace = ExplorationTrace(root_directory="/tmp/project")

    trace.record_tool_call(
        step_number=1,
        tool_name="scan_folder",
        tool_input={"directory": "docs"},
    )
    trace.record_tool_call(
        step_number=2,
        tool_name="parse_file",
        tool_input={"file_path": "docs/contract.pdf"},
    )
    trace.record_go_deeper(step_number=3, directory="docs/subdir")

    assert len(trace.step_path) == 3
    assert "tool:scan_folder" in trace.step_path[0]
    assert "tool:parse_file" in trace.step_path[1]
    assert "godeeper" in trace.step_path[2]

    referenced = trace.sorted_documents()
    assert len(referenced) == 1
    assert referenced[0].endswith("docs/contract.pdf")


def test_trace_records_resolved_document_paths() -> None:
    trace = ExplorationTrace(root_directory="/tmp/project")

    trace.record_tool_call(
        step_number=1,
        tool_name="get_document",
        tool_input={"doc_id": "doc_123"},
        resolved_document_path="/tmp/project/docs/indexed.pdf",
    )

    assert "document=/tmp/project/docs/indexed.pdf" in trace.step_path[0]
    assert trace.sorted_documents() == ["/tmp/project/docs/indexed.pdf"]


def test_extract_cited_sources_ordered_unique() -> None:
    final_result = (
        "Price is $10M [Source: agreement.pdf, Section 2.1]. "
        "Escrow is $1M [Source: escrow.pdf, Section 3]. "
        "Reconfirmed [Source: agreement.pdf, Section 2.1]."
    )
    assert extract_cited_sources(final_result) == ["agreement.pdf", "escrow.pdf"]


================================================
FILE: tests/test_fs.py
================================================
"""Tests for filesystem utility functions."""

import pytest
import os
import tempfile
from pathlib import Path

from fs_explorer.fs import (
    describe_dir_content,
    read_file,
    grep_file_content,
    glob_paths,
    parse_file,
    preview_file,
    scan_folder,
    clear_document_cache,
    SUPPORTED_EXTENSIONS,
)


class TestDescribeDirContent:
    """Tests for describe_dir_content function."""
    
    def test_valid_directory(self) -> None:
        """Test describing a valid directory with files and subfolders."""
        description = describe_dir_content("tests/testfiles")
        assert "Content of tests/testfiles" in description
        assert "tests/testfiles/file1.txt" in description
        assert "tests/testfiles/file2.md" in description
        assert "tests/testfiles/last" in description

    def test_nonexistent_directory(self) -> None:
        """Test describing a directory that doesn't exist."""
        description = describe_dir_content("tests/testfile")
        assert description == "No such directory: tests/testfile"

    def test_directory_without_subfolders(self) -> None:
        """Test describing a directory that has no subdirectories."""
        description = describe_dir_content("tests/testfiles/last")
        assert "Content of tests/testfiles/last" in description
        assert "tests/testfiles/last/lastfile.txt" in description
        assert "This folder does not have any sub-folders" in description


class TestReadFile:
    """Tests for read_file function."""
    
    def test_valid_file(self) -> None:
        """Test reading a valid text file."""
        content = read_file("tests/testfiles/file1.txt")
        assert content.strip() == "this is a test"

    def test_nonexistent_file(self) -> None:
        """Test reading a file that doesn't exist."""
        content = read_file("tests/testfiles/file2.txt")
        assert content == "No such file: tests/testfiles/file2.txt"


class TestGrepFileContent:
    """Tests for grep_file_content function."""
    
    def test_pattern_match(self) -> None:
        """Test searching for a pattern that exists."""
        result = grep_file_content("tests/testfiles/file2.md", r"(are|is) a test")
        assert "MATCHES for (are|is) a test" in result
        assert "is" in result

    def test_no_match(self) -> None:
        """Test searching for a pattern that doesn't exist."""
        result = grep_file_content("tests/testfiles/last/lastfile.txt", r"test")
        assert result == "No matches found"

    def test_nonexistent_file(self) -> None:
        """Test searching in a file that doesn't exist."""
        result = grep_file_content("tests/testfiles/file2.txt", r"test")
        assert result == "No such file: tests/testfiles/file2.txt"


class TestGlobPaths:
    """Tests for glob_paths function."""
    
    def test_pattern_match(self) -> None:
        """Test finding files that match a glob pattern."""
        result = glob_paths("tests/testfiles", "file?.*")
        assert "MATCHES for file?.* in tests/testfiles" in result
        assert "file1.txt" in result
        assert "file2.md" in result

    def test_no_match(self) -> None:
        """Test a pattern that matches nothing."""
        result = glob_paths("tests/testfiles", "nonexistent*")
        assert result == "No matches found"

    def test_nonexistent_directory(self) -> None:
        """Test glob in a directory that doesn't exist."""
        result = glob_paths("tests/nonexistent", "*.txt")
        assert result == "No such directory: tests/nonexistent"


class TestDocumentParsing:
    """Tests for document parsing functions (parse_file, preview_file)."""
    
    def setup_method(self) -> None:
        """Clear cache before each test."""
        clear_document_cache()

    def test_parse_file_nonexistent(self) -> None:
        """Test parsing a file that doesn't exist."""
        content = parse_file("data/nonexistent.pdf")
        assert content == "No such file: data/nonexistent.pdf"

    def test_parse_file_unsupported_extension(self) -> None:
        """Test parsing a file with unsupported extension."""
        content = parse_file("tests/testfiles/file1.txt")
        assert "Unsupported file extension: .txt" in content

    def test_preview_file_nonexistent(self) -> None:
        """Test previewing a file that doesn't exist."""
        content = preview_file("data/nonexistent.pdf")
        assert content == "No such file: data/nonexistent.pdf"

    def test_preview_file_unsupported_extension(self) -> None:
        """Test previewing a file with unsupported extension."""
        content = preview_file("tests/testfiles/file1.txt")
        assert "Unsupported file extension: .txt" in content

    @pytest.mark.skipif(
        not os.path.exists("data/large_acquisition"),
        reason="Test documents not generated"
    )
    def test_parse_file_pdf(self) -> None:
        """Test parsing an actual PDF file."""
        # Use one of the generated test PDFs
        pdf_files = list(Path("data/large_acquisition").glob("*.pdf"))
        if pdf_files:
            content = parse_file(str(pdf_files[0]))
            assert len(content) > 0
            assert "Error" not in content

    @pytest.mark.skipif(
        not os.path.exists("data/large_acquisition"),
        reason="Test documents not generated"
    )
    def test_preview_file_pdf(self) -> None:
        """Test previewing an actual PDF file."""
        pdf_files = list(Path("data/large_acquisition").glob("*.pdf"))
        if pdf_files:
            content = preview_file(str(pdf_files[0]), max_chars=500)
            assert "=== PREVIEW of" in content
            # Preview should be limited
            assert len(content) < 2000  # Preview + header + truncation message


class TestScanFolder:
    """Tests for scan_folder function."""
    
    def setup_method(self) -> None:
        """Clear cache before each test."""
        clear_document_cache()

    def test_nonexistent_directory(self) -> None:
        """Test scanning a directory that doesn't exist."""
        result = scan_folder("nonexistent/path")
        assert result == "No such directory: nonexistent/path"

    def test_empty_directory(self) -> None:
        """Test scanning a directory with no supported documents."""
        with tempfile.TemporaryDirectory() as tmpdir:
            # Create a non-document file
            Path(tmpdir, "test.txt").write_text("hello")
            result = scan_folder(tmpdir)
            assert "No supported documents found" in result

    @pytest.mark.skipif(
        not os.path.exists("data/large_acquisition"),
        reason="Test documents not generated"
    )
    def test_scan_folder_with_documents(self) -> None:
        """Test scanning a folder with actual documents."""
        result = scan_folder("data/large_acquisition", max_workers=2)
        assert "PARALLEL DOCUMENT SCAN" in result
        assert "Found" in result
        assert "documents" in result


class TestSupportedExtensions:
    """Tests for supported extensions configuration."""
    
    def test_supported_extensions_is_frozenset(self) -> None:
        """Verify SUPPORTED_EXTENSIONS is immutable."""
        assert isinstance(SUPPORTED_EXTENSIONS, frozenset)
    
    def test_common_extensions_supported(self) -> None:
        """Verify common document extensions are supported."""
        assert ".pdf" in SUPPORTED_EXTENSIONS
        assert ".docx" in SUPPORTED_EXTENSIONS
        assert ".md" in SUPPORTED_EXTENSIONS


================================================
FILE: tests/test_indexing.py
================================================
"""Tests for indexing and schema components."""

import json
import time
from dataclasses import dataclass
from pathlib import Path
from unittest.mock import MagicMock, patch

import fs_explorer.indexing.metadata as metadata_module
import fs_explorer.indexing.pipeline as pipeline_module
from fs_explorer.embeddings import EmbeddingProvider
from fs_explorer.indexing.chunker import SmartChunker
from fs_explorer.indexing.metadata import auto_discover_profile, normalize_langextract_profile
from fs_explorer.indexing.pipeline import IndexingPipeline
from fs_explorer.indexing.schema import SchemaDiscovery
from fs_explorer.storage import DuckDBStorage


def test_smart_chunker_overlap() -> None:
    text = "A" * 2500
    chunker = SmartChunker(chunk_size=1000, overlap=100)

    chunks = chunker.chunk_text(text)

    assert len(chunks) == 3
    assert chunks[1].start_char == chunks[0].end_char - 100
    assert chunks[2].start_char == chunks[1].end_char - 100


def test_schema_discovery_from_folder(tmp_path: Path) -> None:
    folder = tmp_path / "corpus"
    folder.mkdir()
    (folder / "01_master_agreement.md").write_text("# agreement\nprice: $10")
    (folder / "04_risk_report.md").write_text("# report\nrisk summary")

    schema = SchemaDiscovery().discover_from_folder(str(folder))

    fields = schema["fields"]
    field_names = {field["name"] for field in fields}
    assert "document_type" in field_names
    assert "mentions_currency" in field_names

    document_type_field = next(
        field for field in fields if field["name"] == "document_type"
    )
    assert "agreement" in document_type_field["enum"]
    assert "report" in document_type_field["enum"]


def test_schema_discovery_with_langextract_fields(tmp_path: Path, monkeypatch) -> None:
    folder = tmp_path / "corpus"
    folder.mkdir()
    (folder / "agreement.md").write_text("Purchase price with escrow and earnout.")

    # Mock auto_discover_profile to return the default profile so this test
    # stays deterministic (auto-discovery would call the real LLM).
    from fs_explorer.indexing.metadata import default_langextract_profile

    monkeypatch.setattr(
        "fs_explorer.indexing.schema.auto_discover_profile",
        lambda folder, **kwargs: default_langextract_profile(),
    )

    schema = SchemaDiscovery().discover_from_folder(
        str(folder),
        with_langextract=True,
    )
    field_names = {field["name"] for field in schema["fields"]}
    assert "lx_enabled" in field_names
    assert "lx_has_earnout" in field_names
    assert "lx_money_mentions" in field_names


def test_schema_discovery_with_custom_metadata_profile(tmp_path: Path) -> None:
    folder = tmp_path / "corpus"
    folder.mkdir()
    (folder / "notes.md").write_text("Acme Corp retained Jane Doe for diligence.")

    profile = {
        "prompt_description": "Extract organizations and people.",
        "fields": [
            {
                "name": "org_names",
                "type": "string",
                "source_class": "organization",
                "mode": "values",
            },
            {
                "name": "person_count",
                "type": "integer",
                "source_class": "person",
                "mode": "count",
            },
        ],
    }

    schema = SchemaDiscovery().discover_from_folder(
        str(folder),
        with_langextract=True,
        metadata_profile=profile,
    )
    field_names = {field["name"] for field in schema["fields"]}
    assert "org_names" in field_names
    assert "person_count" in field_names
    assert isinstance(schema.get("metadata_profile"), dict)


def test_indexing_pipeline_indexes_and_marks_deleted(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    first = corpus / "a_agreement.md"
    second = corpus / "b_schedule.md"
    first.write_text("Purchase price is $45,000,000.\n\nSection 1.2")
    second.write_text("Schedule details.\n\nEffective Date: January 1, 2026")

    # Avoid Docling in this unit test; treat markdown as plain text.
    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = tmp_path / "index.duckdb"
    storage = DuckDBStorage(str(db_path))
    pipeline = IndexingPipeline(storage=storage)

    first_result = pipeline.index_folder(str(corpus), discover_schema=True)
    assert first_result.indexed_files == 2
    assert first_result.skipped_files == 0
    assert first_result.active_documents == 2
    assert first_result.schema_used is not None
    assert storage.count_chunks(corpus_id=first_result.corpus_id) > 0

    hits = storage.search_chunks(
        corpus_id=first_result.corpus_id,
        query="purchase price",
        limit=3,
    )
    assert hits
    top_doc = storage.get_document(doc_id=hits[0]["doc_id"])
    assert top_doc is not None
    assert "Purchase price" in top_doc["content"]

    metadata_hits = storage.search_documents_by_metadata(
        corpus_id=first_result.corpus_id,
        filters=[
            {
                "field": "document_type",
                "operator": "eq",
                "value": "agreement",
            }
        ],
        limit=5,
    )
    assert metadata_hits
    assert any(hit["relative_path"] == "a_agreement.md" for hit in metadata_hits)
    assert all(hit["relative_path"] != "b_schedule.md" for hit in metadata_hits)

    second.unlink()

    second_result = pipeline.index_folder(str(corpus))
    assert second_result.indexed_files == 1
    assert second_result.active_documents == 1

    all_docs = storage.list_documents(
        corpus_id=first_result.corpus_id,
        include_deleted=True,
    )
    deleted_paths = {doc["relative_path"] for doc in all_docs if doc["is_deleted"]}
    assert "b_schedule.md" in deleted_paths


def test_indexing_pipeline_with_langextract_metadata(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    doc_path = corpus / "agreement.md"
    doc_path.write_text("Purchase price and escrow details.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )
    # Use the default profile so the schema includes the expected fields
    from fs_explorer.indexing.metadata import default_langextract_profile

    monkeypatch.setattr(
        "fs_explorer.indexing.schema.auto_discover_profile",
        lambda folder, **kwargs: default_langextract_profile(),
    )
    monkeypatch.setattr(
        metadata_module,
        "_extract_langextract_metadata",
        lambda **_: {
            "lx_enabled": True,
            "lx_extraction_count": 3,
            "lx_entity_classes": "deal_term,organization",
            "lx_organizations": "TechCorp Industries",
            "lx_people": "",
            "lx_deal_terms": "escrow reserve",
            "lx_money_mentions": 1,
            "lx_date_mentions": 0,
            "lx_has_earnout": False,
            "lx_has_escrow": True,
        },
    )

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))
    pipeline = IndexingPipeline(storage=storage)
    result = pipeline.index_folder(
        str(corpus),
        discover_schema=True,
        with_metadata=True,
    )
    assert result.indexed_files == 1
    assert result.schema_used is not None

    docs = storage.list_documents(corpus_id=result.corpus_id, include_deleted=False)
    assert len(docs) == 1
    stored = storage.get_document(doc_id=docs[0]["id"])
    assert stored is not None
    metadata = json.loads(stored["metadata_json"])
    assert metadata["lx_enabled"] is True
    assert metadata["lx_has_escrow"] is True

    hits = storage.search_documents_by_metadata(
        corpus_id=result.corpus_id,
        filters=[{"field": "lx_has_escrow", "operator": "eq", "value": True}],
        limit=5,
    )
    assert hits
    assert hits[0]["relative_path"] == "agreement.md"


def test_indexing_pipeline_reuses_saved_metadata_profile(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    doc_path = corpus / "custom.md"
    doc_path.write_text("Acme Corp and Jane Doe signed terms.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    seen_profiles: list[dict[str, object] | None] = []

    def fake_extract(**kwargs):  # noqa: ANN003
        seen_profiles.append(kwargs.get("profile"))
        return {
            "org_names": "Acme Corp",
            "person_present": True,
        }

    monkeypatch.setattr(metadata_module, "_extract_langextract_metadata", fake_extract)

    custom_profile = {
        "prompt_description": "Extract organizations and people.",
        "fields": [
            {
                "name": "org_names",
                "type": "string",
                "source_class": "organization",
                "mode": "values",
            },
            {
                "name": "person_present",
                "type": "boolean",
                "source_class": "person",
                "mode": "exists",
            },
        ],
    }

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))
    pipeline = IndexingPipeline(storage=storage)
    first_result = pipeline.index_folder(
        str(corpus),
        discover_schema=True,
        with_metadata=True,
        metadata_profile=custom_profile,
    )
    assert first_result.indexed_files == 1
    assert seen_profiles and isinstance(seen_profiles[0], dict)

    second_result = pipeline.index_folder(
        str(corpus),
        with_metadata=True,
    )
    assert second_result.indexed_files == 1
    assert len(seen_profiles) >= 2
    latest_profile = seen_profiles[-1]
    assert isinstance(latest_profile, dict)
    fields_obj = latest_profile.get("fields")
    assert isinstance(fields_obj, list)
    second_fields = {
        str(field["name"])
        for field in fields_obj
        if isinstance(field, dict) and isinstance(field.get("name"), str)
    }
    assert {"org_names", "person_present"}.issubset(second_fields)


# ---------------------------------------------------------------------------
# Auto-profile generation tests
# ---------------------------------------------------------------------------


def test_auto_discover_profile_with_mock_llm(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "contract.md").write_text("TechCorp acquires StartupXYZ for $10M.")
    (corpus / "report.md").write_text("Quarterly revenue report for FY2025.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )
    monkeypatch.setenv("GOOGLE_API_KEY", "fake-key")

    llm_response_json = json.dumps(
        {
            "name": "test_auto",
            "description": "Auto-generated test profile.",
            "prompt_description": "Extract key metadata from documents.",
            "fields": [
                {
                    "name": "lx_organizations",
                    "type": "string",
                    "description": "Organization names.",
                    "source": "entities",
                    "source_classes": ["organization", "company"],
                    "mode": "values",
                },
                {
                    "name": "lx_money_count",
                    "type": "integer",
                    "description": "Count of monetary amounts.",
                    "source": "entities",
                    "source_classes": ["money"],
                    "mode": "count",
                },
            ],
        }
    )

    mock_response = MagicMock()
    mock_response.text = llm_response_json

    mock_client_instance = MagicMock()
    mock_client_instance.models.generate_content.return_value = mock_response

    with patch(
        "fs_explorer.indexing.metadata._get_genai_client",
        return_value=mock_client_instance,
    ):
        profile = auto_discover_profile(str(corpus))

    # Should pass validation
    normalized = normalize_langextract_profile(profile)
    field_names = {f["name"] for f in normalized["fields"]}
    assert "lx_organizations" in field_names
    assert "lx_money_count" in field_names
    # Runtime fields should have been added automatically
    assert "lx_enabled" in field_names


def test_auto_discover_profile_falls_back_on_error(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "file.md").write_text("Some content.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )
    monkeypatch.setenv("GOOGLE_API_KEY", "fake-key")

    with patch(
        "fs_explorer.indexing.metadata._get_genai_client",
        side_effect=RuntimeError("API down"),
    ):
        profile = auto_discover_profile(str(corpus))

    # Should return default profile
    default_names = {
        f["name"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE["fields"]
    }
    got_names = {f["name"] for f in profile["fields"]}
    assert default_names == got_names


def test_auto_discover_profile_falls_back_without_api_key(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "file.md").write_text("Some content.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )
    monkeypatch.delenv("GOOGLE_API_KEY", raising=False)

    profile = auto_discover_profile(str(corpus))

    default_names = {
        f["name"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE["fields"]
    }
    got_names = {f["name"] for f in profile["fields"]}
    assert default_names == got_names


def test_schema_discovery_uses_auto_profile_when_no_explicit_profile(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "contract.md").write_text("Agreement terms.")

    # Capture what auto_discover_profile returns (mock it)
    auto_profile = {
        "name": "auto_test",
        "description": "Auto-generated.",
        "prompt_description": "Extract metadata.",
        "fields": [
            {
                "name": "lx_enabled",
                "type": "boolean",
                "required": False,
                "description": "Whether langextract succeeded.",
                "source": "runtime",
                "runtime": "enabled",
                "mode": "runtime",
                "source_classes": [],
                "contains_any": [],
            },
            {
                "name": "lx_orgs",
                "type": "string",
                "required": False,
                "description": "Organizations.",
                "source": "entities",
                "source_classes": ["organization"],
                "mode": "values",
                "contains_any": [],
            },
        ],
    }

    monkeypatch.setattr(
        "fs_explorer.indexing.schema.auto_discover_profile",
        lambda folder, **kwargs: auto_profile,
    )

    schema = SchemaDiscovery().discover_from_folder(
        str(corpus),
        with_langextract=True,
        metadata_profile=None,
    )
    field_names = {f["name"] for f in schema["fields"]}
    assert "lx_orgs" in field_names
    assert "lx_enabled" in field_names
    assert schema.get("metadata_profile") == auto_profile


# ---------------------------------------------------------------------------
# Mock embedding helpers for indexing tests
# ---------------------------------------------------------------------------


@dataclass
class _FakeEmbedding:
    values: list[float]


@dataclass
class _FakeEmbedResult:
    embeddings: list[_FakeEmbedding]


class _FakeEmbedModels:
    def embed_content(
        self, *, model: str, contents: list[str], config: dict
    ) -> _FakeEmbedResult:
        dim = config.get("output_dimensionality", 4)
        return _FakeEmbedResult(
            embeddings=[
                _FakeEmbedding(values=[0.1 * i] * dim) for i in range(len(contents))
            ]
        )


class _FakeEmbedClient:
    def __init__(self) -> None:
        self.models = _FakeEmbedModels()


# ---------------------------------------------------------------------------
# Embedding indexing tests
# ---------------------------------------------------------------------------


def test_indexing_pipeline_with_embeddings(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "report.md").write_text("Risk register summary.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path, embedding_dim=4)
    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)
    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)

    result = pipeline.index_folder(str(corpus), discover_schema=True)

    assert result.indexed_files == 2
    assert result.embeddings_written > 0
    assert storage.has_embeddings(corpus_id=result.corpus_id)


def test_indexing_pipeline_without_embeddings(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path)
    pipeline = IndexingPipeline(storage=storage)

    result = pipeline.index_folder(str(corpus), discover_schema=True)

    assert result.embeddings_written == 0
    assert not storage.has_embeddings(corpus_id=result.corpus_id)


def test_embedding_cascade_on_reindex(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    doc = corpus / "agreement.md"
    doc.write_text("Purchase price is $45,000,000.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path, embedding_dim=4)
    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)
    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)

    first = pipeline.index_folder(str(corpus), discover_schema=True)
    assert first.embeddings_written > 0

    # Update document and re-index; old embeddings should be replaced
    doc.write_text("Updated purchase price is $50,000,000.")
    second = pipeline.index_folder(str(corpus))
    assert second.embeddings_written > 0
    assert storage.has_embeddings(corpus_id=second.corpus_id)


# ---------------------------------------------------------------------------
# Parallel metadata extraction tests
# ---------------------------------------------------------------------------


def test_extract_metadata_batch_returns_correct_metadata(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "report.md").write_text("Risk register summary.")
    (corpus / "schedule.md").write_text("Effective Date: January 1, 2026")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))
    pipeline = IndexingPipeline(storage=storage, max_workers=2)

    root = str(corpus)
    parsed_docs = []
    import os

    for f in sorted(corpus.iterdir()):
        content = f.read_text()
        rel = os.path.relpath(str(f), root)
        parsed_docs.append((str(f), rel, content))

    metadata_map = pipeline._extract_metadata_batch(
        parsed_docs=parsed_docs,
        root_path=root,
        schema_def=None,
        with_langextract=False,
        langextract_profile=None,
    )

    assert len(metadata_map) == 3
    assert "agreement.md" in metadata_map
    assert "report.md" in metadata_map
    assert "schedule.md" in metadata_map

    # Check heuristic metadata
    assert metadata_map["agreement.md"]["mentions_currency"] is True
    assert metadata_map["schedule.md"]["mentions_dates"] is True
    assert metadata_map["report.md"]["document_type"] == "report"


def test_extract_metadata_batch_parallel_is_faster_than_sequential(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    for i in range(6):
        (corpus / f"doc_{i}.md").write_text(f"Document {i} content. Price is ${i}00.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    delay = 0.1
    original_extract = metadata_module.extract_metadata

    def slow_extract(**kwargs):
        time.sleep(delay)
        return original_extract(**kwargs)

    monkeypatch.setattr(pipeline_module, "extract_metadata", slow_extract)

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))
    pipeline = IndexingPipeline(storage=storage, max_workers=6)

    root = str(corpus)
    parsed_docs = []
    import os

    for f in sorted(corpus.iterdir()):
        content = f.read_text()
        rel = os.path.relpath(str(f), root)
        parsed_docs.append((str(f), rel, content))

    start = time.monotonic()
    metadata_map = pipeline._extract_metadata_batch(
        parsed_docs=parsed_docs,
        root_path=root,
        schema_def=None,
        with_langextract=False,
        langextract_profile=None,
    )
    elapsed = time.monotonic() - start

    assert len(metadata_map) == 6
    # 6 docs * 0.1s each = 0.6s sequential; parallel should finish in < 0.4s
    assert elapsed < 0.4, f"Parallel extraction too slow: {elapsed:.2f}s"


def test_parallel_and_sequential_produce_same_results(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "a.md").write_text("Purchase price is $45,000,000.")
    (corpus / "b.md").write_text("Effective Date: January 1, 2026. Risk summary.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))

    root = str(corpus)
    parsed_docs = []
    import os

    for f in sorted(corpus.iterdir()):
        content = f.read_text()
        rel = os.path.relpath(str(f), root)
        parsed_docs.append((str(f), rel, content))

    # Sequential (max_workers=1)
    pipeline_seq = IndexingPipeline(storage=storage, max_workers=1)
    map_seq = pipeline_seq._extract_metadata_batch(
        parsed_docs=parsed_docs,
        root_path=root,
        schema_def=None,
        with_langextract=False,
        langextract_profile=None,
    )

    # Parallel (max_workers=4)
    pipeline_par = IndexingPipeline(storage=storage, max_workers=4)
    map_par = pipeline_par._extract_metadata_batch(
        parsed_docs=parsed_docs,
        root_path=root,
        schema_def=None,
        with_langextract=False,
        langextract_profile=None,
    )

    assert map_seq.keys() == map_par.keys()
    for key in map_seq:
        assert map_seq[key] == map_par[key], f"Mismatch for {key}"


================================================
FILE: tests/test_models.py
================================================
from fs_explorer.models import (
    ToolCallAction,
    Action,
    ToolCallArg,
    GoDeeperAction,
    StopAction,
)


def test_tool_call_action_to_tool_args() -> None:
    tool_call_action = ToolCallAction(
        tool_name="glob",
        tool_input=[
            ToolCallArg(parameter_name="directory", parameter_value="tests/testfiles"),
            ToolCallArg(parameter_name="pattern", parameter_value="file?.*"),
        ],
    )
    assert tool_call_action.to_fn_args() == {
        "directory": "tests/testfiles",
        "pattern": "file?.*",
    }


def test_action_to_action_type() -> None:
    action = Action(
        action=ToolCallAction(
            tool_name="glob",
            tool_input=[
                ToolCallArg(
                    parameter_name="directory", parameter_value="tests/testfiles"
                ),
                ToolCallArg(parameter_name="pattern", parameter_value="file?.*"),
            ],
        ),
        reason="",
    )
    assert action.to_action_type() == "toolcall"
    action = Action(action=GoDeeperAction(directory="tests/testfiles/last"), reason="")
    assert action.to_action_type() == "godeeper"
    action = Action(action=StopAction(final_result="hello"), reason="")
    assert action.to_action_type() == "stop"


================================================
FILE: tests/test_search.py
================================================
"""Tests for search filtering and merged retrieval ranking."""

from __future__ import annotations

import time
from dataclasses import dataclass
from pathlib import Path

import fs_explorer.indexing.pipeline as pipeline_module
import pytest

from fs_explorer.embeddings import EmbeddingProvider
from fs_explorer.indexing.pipeline import IndexingPipeline
from fs_explorer.search import (
    IndexedQueryEngine,
    MetadataFilterParseError,
    parse_metadata_filters,
)
from fs_explorer.storage import DuckDBStorage


def test_parse_metadata_filters_supports_scalar_and_list_values() -> None:
    parsed = parse_metadata_filters(
        "document_type=agreement and mentions_currency=true, file_size_bytes>=100, "
        "document_type in (agreement, report)"
    )

    assert len(parsed) == 4
    assert parsed[0].field == "document_type"
    assert parsed[0].operator == "eq"
    assert parsed[0].value == "agreement"
    assert parsed[1].field == "mentions_currency"
    assert parsed[1].value is True
    assert parsed[2].operator == "gte"
    assert parsed[2].value == 100
    assert parsed[3].operator == "in"
    assert parsed[3].value == ["agreement", "report"]


def test_parse_metadata_filters_rejects_unknown_schema_fields() -> None:
    with pytest.raises(MetadataFilterParseError):
        parse_metadata_filters(
            "owner=finance",
            allowed_fields={"document_type", "mentions_currency"},
        )


def test_indexed_query_engine_unions_semantic_and_metadata_results(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "b_report.md").write_text(
        "Risk register and litigation exposure summary."
    )

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = tmp_path / "index.duckdb"
    storage = DuckDBStorage(str(db_path))
    result = IndexingPipeline(storage=storage).index_folder(
        str(corpus), discover_schema=True
    )
    engine = IndexedQueryEngine(storage)

    hits = engine.search(
        corpus_id=result.corpus_id,
        query="purchase price",
        filters="document_type=report",
        limit=5,
    )

    by_path = {hit.relative_path: hit for hit in hits}
    assert "a_agreement.md" in by_path
    assert "b_report.md" in by_path
    assert by_path["a_agreement.md"].semantic_score > 0
    assert by_path["b_report.md"].metadata_score > 0


class _SlowStorage:
    def search_chunks(self, *, corpus_id: str, query: str, limit: int = 5):  # noqa: ARG002
        time.sleep(0.3)
        return [
            {
                "doc_id": "doc_semantic",
                "relative_path": "a.md",
                "absolute_path": "/tmp/a.md",
                "position": 0,
                "text": "semantic hit",
                "score": 3,
            }
        ]

    def search_documents_by_metadata(self, *, corpus_id: str, filters, limit: int = 20):  # noqa: ARG002
        time.sleep(0.3)
        return [
            {
                "doc_id": "doc_metadata",
                "relative_path": "b.md",
                "absolute_path": "/tmp/b.md",
                "preview_text": "metadata hit",
                "metadata_score": 1,
            }
        ]

    def get_active_schema(self, *, corpus_id: str):  # noqa: ARG002
        return None


def test_indexed_query_engine_executes_semantic_and_metadata_in_parallel() -> None:
    engine = IndexedQueryEngine(_SlowStorage())

    start = time.perf_counter()
    hits = engine.search(
        corpus_id="corpus_test",
        query="test",
        filters="document_type=agreement",
        limit=5,
    )
    elapsed = time.perf_counter() - start

    assert elapsed < 0.58
    assert {hit.doc_id for hit in hits} == {"doc_semantic", "doc_metadata"}


def test_search_enable_semantic_false_returns_only_metadata() -> None:
    """When enable_semantic=False, only metadata results are returned."""
    engine = IndexedQueryEngine(_SlowStorage())

    hits = engine.search(
        corpus_id="corpus_test",
        query="test",
        filters="document_type=agreement",
        limit=5,
        enable_semantic=False,
    )

    assert len(hits) == 1
    assert hits[0].doc_id == "doc_metadata"


def test_search_enable_metadata_false_returns_only_semantic() -> None:
    """When enable_metadata=False, only semantic results are returned."""
    engine = IndexedQueryEngine(_SlowStorage())

    hits = engine.search(
        corpus_id="corpus_test",
        query="test",
        filters="document_type=agreement",
        limit=5,
        enable_metadata=False,
    )

    assert len(hits) == 1
    assert hits[0].doc_id == "doc_semantic"


def test_search_both_disabled_returns_empty() -> None:
    """When both enable_semantic and enable_metadata are False, no results."""
    engine = IndexedQueryEngine(_SlowStorage())

    hits = engine.search(
        corpus_id="corpus_test",
        query="test",
        filters="document_type=agreement",
        limit=5,
        enable_semantic=False,
        enable_metadata=False,
    )

    assert hits == []


# ---------------------------------------------------------------------------
# Mock embedding helpers
# ---------------------------------------------------------------------------


@dataclass
class _FakeEmbedding:
    values: list[float]


@dataclass
class _FakeEmbedResult:
    embeddings: list[_FakeEmbedding]


class _FakeEmbedModels:
    def embed_content(
        self, *, model: str, contents: list[str], config: dict
    ) -> _FakeEmbedResult:
        dim = config.get("output_dimensionality", 4)
        return _FakeEmbedResult(
            embeddings=[
                _FakeEmbedding(values=[0.1 * (i + 1)] * dim)
                for i in range(len(contents))
            ]
        )


class _FakeEmbedClient:
    def __init__(self) -> None:
        self.models = _FakeEmbedModels()


# ---------------------------------------------------------------------------
# Vector search tests
# ---------------------------------------------------------------------------


def test_vector_search_with_pre_stored_embeddings(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "report.md").write_text("Risk register and litigation exposure summary.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path, embedding_dim=4)
    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)
    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)

    result = pipeline.index_folder(str(corpus), discover_schema=True)
    assert result.embeddings_written > 0

    engine = IndexedQueryEngine(storage, embedding_provider=provider)
    hits = engine.search(
        corpus_id=result.corpus_id,
        query="purchase price",
        limit=5,
    )

    assert len(hits) > 0
    # All hits should have float semantic scores from cosine similarity
    for hit in hits:
        assert isinstance(hit.semantic_score, float)


def test_keyword_fallback_when_no_embeddings(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $45,000,000.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path)
    IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True)

    # Create engine with embedding provider but no embeddings stored
    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)
    engine = IndexedQueryEngine(storage, embedding_provider=provider)
    result_corpus_id = storage.get_corpus_id(str(Path(corpus).resolve()))
    assert result_corpus_id is not None

    hits = engine.search(
        corpus_id=result_corpus_id,
        query="purchase price",
        limit=5,
    )
    # Should still return results via keyword fallback
    assert len(hits) > 0


def test_get_metadata_field_values_returns_distinct_values(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "b_report.md").write_text("Risk register summary.")
    (corpus / "c_agreement.md").write_text("Escrow details for the deal.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = tmp_path / "index.duckdb"
    storage = DuckDBStorage(str(db_path))
    result = IndexingPipeline(storage=storage).index_folder(
        str(corpus), discover_schema=True
    )

    values = storage.get_metadata_field_values(
        corpus_id=result.corpus_id,
        field_names=["document_type", "mentions_currency"],
    )
    assert "document_type" in values
    assert "agreement" in values["document_type"]
    assert "report" in values["document_type"]
    assert "mentions_currency" in values


def test_get_metadata_field_values_empty_corpus(tmp_path: Path) -> None:
    db_path = tmp_path / "index.duckdb"
    storage = DuckDBStorage(str(db_path))
    corpus_id = storage.get_or_create_corpus(str(tmp_path / "empty"))
    values = storage.get_metadata_field_values(
        corpus_id=corpus_id,
        field_names=["document_type"],
    )
    assert values == {"document_type": []}


def test_get_metadata_field_values_respects_max_distinct(
    tmp_path: Path,
    monkeypatch,
) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    for i in range(5):
        (corpus / f"doc_{i:02d}_type{i}.md").write_text(f"Content {i}")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    storage = DuckDBStorage(str(tmp_path / "index.duckdb"))
    result = IndexingPipeline(storage=storage).index_folder(
        str(corpus), discover_schema=True
    )

    values = storage.get_metadata_field_values(
        corpus_id=result.corpus_id,
        field_names=["document_type"],
        max_distinct=2,
    )
    assert len(values["document_type"]) <= 2


def test_semantic_search_includes_field_catalog_on_first_call(
    tmp_path: Path,
    monkeypatch,
) -> None:
    import fs_explorer.agent as agent_module

    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "a_agreement.md").write_text("Purchase price is $45,000,000.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path)
    IndexingPipeline(storage=storage).index_folder(
        str(corpus), discover_schema=True
    )

    agent_module.set_index_context(str(corpus), db_path)
    agent_module.set_search_flags(enable_semantic=True, enable_metadata=True)
    try:
        first = agent_module.semantic_search("purchase price")
        assert "Available filter fields" in first
        assert "document_type" in first

        second = agent_module.semantic_search("purchase price")
        assert "Available filter fields" not in second
    finally:
        agent_module.clear_index_context()


def test_float_scoring_in_ranked_documents() -> None:
    from fs_explorer.search.ranker import RankedDocument, rank_documents

    docs = [
        RankedDocument(
            doc_id="d1",
            relative_path="a.md",
            absolute_path="/a.md",
            position=0,
            text="doc 1",
            semantic_score=0.95,
            metadata_score=1,
        ),
        RankedDocument(
            doc_id="d2",
            relative_path="b.md",
            absolute_path="/b.md",
            position=0,
            text="doc 2",
            semantic_score=0.5,
            metadata_score=2,
        ),
    ]
    ranked = rank_documents(docs, limit=2)
    assert ranked[0].doc_id == "d1"
    assert ranked[0].combined_score > ranked[1].combined_score


================================================
FILE: tests/test_server_search.py
================================================
"""Tests for the /api/search and /api/index REST endpoints."""

from __future__ import annotations

from pathlib import Path
from unittest.mock import patch

import fs_explorer.indexing.pipeline as pipeline_module
import pytest
from fastapi.testclient import TestClient

from fs_explorer.indexing.pipeline import IndexingPipeline
from fs_explorer.server import app
from fs_explorer.storage import DuckDBStorage


@pytest.fixture()
def indexed_corpus(tmp_path: Path, monkeypatch):
    """Create a small indexed corpus and return (folder, db_path)."""
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "agreement.md").write_text("Purchase price is $45,000,000.")
    (corpus / "report.md").write_text("Risk register and litigation exposure summary.")

    monkeypatch.setattr(
        pipeline_module,
        "parse_file",
        lambda file_path: Path(file_path).read_text(),
    )

    db_path = str(tmp_path / "index.duckdb")
    storage = DuckDBStorage(db_path)
    IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True)
    return str(corpus), db_path


def test_search_endpoint_returns_hits(indexed_corpus) -> None:
    corpus_folder, db_path = indexed_corpus
    client = TestClient(app)

    response = client.post(
        "/api/search",
        json={
            "corpus_folder": corpus_folder,
            "query": "purchase price",
            "db_path": db_path,
        },
    )

    assert response.status_code == 200
    data = response.json()
    assert "hits" in data
    assert len(data["hits"]) > 0
    assert data["hits"][0]["semantic_score"] > 0


def test_search_endpoint_with_filters(indexed_corpus) -> None:
    corpus_folder, db_path = indexed_corpus
    client = TestClient(app)

    response = client.post(
        "/api/search",
        json={
            "corpus_folder": corpus_folder,
            "query": "litigation",
            "filters": "document_type=report",
            "db_path": db_path,
        },
    )

    assert response.status_code == 200
    data = response.json()
    assert "hits" in data


def test_search_endpoint_missing_index(tmp_path: Path) -> None:
    corpus = tmp_path / "empty"
    corpus.mkdir()
    db_path = str(tmp_path / "nonexistent.duckdb")

    client = TestClient(app)
    response = client.post(
        "/api/search",
        json={
            "corpus_folder": str(corpus),
            "query": "test",
            "db_path": db_path,
        },
    )

    assert response.status_code in (404, 500)


def test_search_endpoint_invalid_folder() -> None:
    client = TestClient(app)
    response = client.post(
        "/api/search",
        json={
            "corpus_folder": "/nonexistent/path/abc123",
            "query": "test",
        },
    )

    assert response.status_code == 400


# ---------------------------------------------------------------------------
# /api/index/status tests
# ---------------------------------------------------------------------------


def test_index_status_not_indexed(tmp_path: Path) -> None:
    corpus = tmp_path / "empty_folder"
    corpus.mkdir()
    db_path = str(tmp_path / "nonexistent.duckdb")

    client = TestClient(app)
    response = client.get(
        "/api/index/status",
        params={"folder": str(corpus), "db_path": db_path},
    )

    assert response.status_code == 200
    data = response.json()
    assert data["indexed"] is False


def test_index_status_after_indexing(indexed_corpus) -> None:
    corpus_folder, db_path = indexed_corpus
    client = TestClient(app)

    response = client.get(
        "/api/index/status",
        params={"folder": corpus_folder, "db_path": db_path},
    )

    assert response.status_code == 200
    data = response.json()
    assert data["indexed"] is True
    assert data["document_count"] == 2
    assert data["schema_name"] is not None
    assert isinstance(data["has_metadata"], bool)
    assert isinstance(data["has_embeddings"], bool)


def test_index_status_includes_schema_fields(indexed_corpus) -> None:
    corpus_folder, db_path = indexed_corpus
    client = TestClient(app)

    response = client.get(
        "/api/index/status",
        params={"folder": corpus_folder, "db_path": db_path},
    )

    assert response.status_code == 200
    data = response.json()
    assert "schema_fields" in data
    assert isinstance(data["schema_fields"], list)
    assert len(data["schema_fields"]) > 0
    assert "document_type" in data["schema_fields"]


# ---------------------------------------------------------------------------
# /api/index/auto-profile tests
# ---------------------------------------------------------------------------


def test_auto_profile_endpoint(tmp_path: Path) -> None:
    corpus = tmp_path / "docs"
    corpus.mkdir()
    (corpus / "contract.md").write_text("TechCorp acquires StartupXYZ for $10M.")

    fake_profile = {
        "name": "test_auto",
        "description": "Auto-generated.",
        "prompt_description": "Extract metadata.",
        "fields": [
            {
                "name": "lx_organizations",
                "type": "string",
                "description": "Org names.",
                "source": "entities",
                "source_classes": ["organization"],
                "mode": "values",
            }
        ],
    }

    client = TestClient(app)
    with patch(
        "fs_explorer.server.auto_discover_profile",
        return_value=fake_profile,
    ):
        response = client.post(
            "/api/index/auto-profile",
            json={"folder": str(corpus)},
        )

    assert response.status_code == 200
    data = response.json()
    assert "profile" in data
    assert data["profile"]["name"] == "test_auto"
    field_names = {f["name"] for f in data["profile"]["fields"]}
    assert "lx_organizations" in field_names


def test_auto_profile_invalid_folder() -> None:
    client = TestClient(app)
    response = client.post(
        "/api/index/auto-profile",
        json={"folder": "/nonexistent/path/abc123"},
    )

    assert response.status_code == 400


================================================
FILE: tests/testfiles/file1.txt
================================================
this is a test

================================================
FILE: tests/testfiles/file2.md
================================================
# this is a test!

================================================
FILE: tests/testfiles/last/lastfile.txt
================================================
hello