[
  {
    "path": ".github/workflows/build.yaml",
    "content": "name: Build\n\non:\n  pull_request:\n\njobs:\n  build:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Install uv\n        uses: astral-sh/setup-uv@v6\n\n      - name: Set up Python\n        run: uv python install 3.13\n\n      - name: Build package\n        run: make build\n"
  },
  {
    "path": ".github/workflows/lint.yaml",
    "content": "name: Linting\n\non:\n  pull_request:\n\njobs:\n  lint:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n\n      - name: Install uv\n        uses: astral-sh/setup-uv@v6\n\n      - name: Set up Python\n        run: uv python install 3.12\n\n      - name: Run formatter\n        shell: bash\n        run: make format-check\n\n      - name: Run linter\n        shell: bash\n        run: make lint\n"
  },
  {
    "path": ".github/workflows/test.yaml",
    "content": "name: CI Tests - Pull Request\n\non:\n  pull_request:\n\njobs:\n  testing_pr:\n    runs-on: ubuntu-latest\n    strategy:\n      matrix:\n        python-version: [\"3.10\", \"3.11\", \"3.12\", \"3.13\"]\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          fetch-depth: 1\n\n      - name: Install uv\n        uses: astral-sh/setup-uv@v6\n        with:\n          python-version: ${{ matrix.python-version }}\n          enable-cache: true\n\n      - name: Run Tests on Main Package\n        run: make test\n        "
  },
  {
    "path": ".github/workflows/typecheck.yaml",
    "content": "name: Typecheck\n\non:\n  pull_request:\n\njobs:\n  core-typecheck:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n        with:\n          fetch-depth: 1\n\n      - name: Install uv\n        uses: astral-sh/setup-uv@v6\n\n      - name: Set up Python\n        run: uv python install\n\n      - name: Run Mypy\n        run: make typecheck"
  },
  {
    "path": ".gitignore",
    "content": "# Python-generated files\n__pycache__/\n*.py[oc]\nbuild/\ndist/\nwheels/\n*.egg-info\n\n# Virtual environments\n.venv\n\n# caches\n*_cache/\n\n# Environment\n.env\n\n# OS files\n.DS_Store"
  },
  {
    "path": ".pre-commit-config.yaml",
    "content": "---\ndefault_language_version:\n  python: python3\n\nrepos:\n  - repo: https://github.com/pre-commit/pre-commit-hooks\n    rev: v4.5.0\n    hooks:\n      - id: check-merge-conflict\n      - id: check-symlinks\n      - id: check-yaml\n      - id: detect-private-key"
  },
  {
    "path": ".python-version",
    "content": "3.13\n"
  },
  {
    "path": "ARCHITECTURE.md",
    "content": "# FsExplorer Architecture Documentation\n\n## Table of Contents\n\n1. [System Overview](#system-overview)\n2. [Component Architecture](#component-architecture)\n3. [Core Modules](#core-modules)\n4. [Workflow Engine](#workflow-engine)\n5. [Agent Decision Loop](#agent-decision-loop)\n6. [Document Processing Pipeline](#document-processing-pipeline)\n7. [Three-Phase Exploration Strategy](#three-phase-exploration-strategy)\n8. [Token Tracking & Cost Estimation](#token-tracking--cost-estimation)\n9. [CLI Interface](#cli-interface)\n10. [Data Flow](#data-flow)\n11. [File Structure](#file-structure)\n12. [Extension Points](#extension-points)\n\n---\n\n## System Overview\n\nFsExplorer is an AI-powered filesystem exploration agent that answers questions about documents by intelligently navigating directories, parsing files, and synthesizing information with source citations.\n\n```mermaid\ngraph TB\n    subgraph \"User Interface\"\n        CLI[CLI Interface<br/>typer + rich]\n    end\n\n    subgraph \"Orchestration Layer\"\n        WF[Workflow Engine<br/>llama-index-workflows]\n        EVT[Event System]\n    end\n\n    subgraph \"Intelligence Layer\"\n        AGENT[FsExplorer Agent]\n        LLM[Google Gemini 2.0 Flash<br/>Structured JSON Output]\n        PROMPT[System Prompt<br/>Three-Phase Strategy]\n    end\n\n    subgraph \"Tools Layer\"\n        TOOLS[Tool Registry]\n        SCAN[scan_folder<br/>Parallel Scan]\n        PREVIEW[preview_file<br/>Quick Preview]\n        PARSE[parse_file<br/>Deep Read]\n        READ[read<br/>Text Files]\n        GREP[grep<br/>Pattern Search]\n        GLOB[glob<br/>File Search]\n    end\n\n    subgraph \"Document Processing\"\n        DOCLING[Docling<br/>Document Converter]\n        CACHE[Document Cache]\n    end\n\n    subgraph \"Filesystem\"\n        FS[(Local Filesystem)]\n        PDF[PDF Files]\n        DOCX[DOCX Files]\n        MD[Markdown Files]\n        OTHER[Other Formats]\n    end\n\n    CLI --> WF\n    WF --> EVT\n    EVT --> AGENT\n    AGENT --> LLM\n    AGENT --> PROMPT\n    AGENT --> TOOLS\n    \n    TOOLS --> SCAN\n    TOOLS --> PREVIEW\n    TOOLS --> PARSE\n    TOOLS --> READ\n    TOOLS --> GREP\n    TOOLS --> GLOB\n    \n    SCAN --> DOCLING\n    PREVIEW --> DOCLING\n    PARSE --> DOCLING\n    \n    DOCLING --> CACHE\n    CACHE --> FS\n    \n    FS --> PDF\n    FS --> DOCX\n    FS --> MD\n    FS --> OTHER\n\n    style LLM fill:#4285f4,color:#fff\n    style DOCLING fill:#ff6b6b,color:#fff\n    style CACHE fill:#ffd93d,color:#000\n    style AGENT fill:#6bcb77,color:#fff\n```\n\n---\n\n## Component Architecture\n\n### High-Level Component Diagram\n\n```mermaid\ngraph LR\n    subgraph \"Entry Point\"\n        MAIN[main.py<br/>CLI Entry]\n    end\n\n    subgraph \"Workflow\"\n        WORKFLOW[workflow.py<br/>Event Orchestration]\n    end\n\n    subgraph \"Agent\"\n        AGENT_MOD[agent.py<br/>AI Decision Making]\n    end\n\n    subgraph \"Models\"\n        MODELS[models.py<br/>Pydantic Schemas]\n    end\n\n    subgraph \"Filesystem\"\n        FS_MOD[fs.py<br/>File Operations]\n    end\n\n    MAIN --> WORKFLOW\n    WORKFLOW --> AGENT_MOD\n    AGENT_MOD --> MODELS\n    AGENT_MOD --> FS_MOD\n    WORKFLOW --> MODELS\n\n    style MAIN fill:#e1f5fe\n    style WORKFLOW fill:#f3e5f5\n    style AGENT_MOD fill:#e8f5e9\n    style MODELS fill:#fff3e0\n    style FS_MOD fill:#fce4ec\n```\n\n### Module Dependencies\n\n```mermaid\ngraph TD\n    subgraph \"fs_explorer package\"\n        INIT[__init__.py<br/>Public API Exports]\n        MAIN[main.py]\n        WORKFLOW[workflow.py]\n        AGENT[agent.py]\n        MODELS[models.py]\n        FS[fs.py]\n    end\n\n    subgraph \"External Dependencies\"\n        TYPER[typer<br/>CLI Framework]\n        RICH[rich<br/>Terminal UI]\n        WORKFLOWS[llama-index-workflows<br/>Event System]\n        GENAI[google-genai<br/>Gemini API]\n        PYDANTIC[pydantic<br/>Data Validation]\n        DOCLING[docling<br/>Document Parsing]\n    end\n\n    INIT --> AGENT\n    INIT --> WORKFLOW\n    INIT --> MODELS\n    \n    MAIN --> TYPER\n    MAIN --> RICH\n    MAIN --> WORKFLOW\n    \n    WORKFLOW --> WORKFLOWS\n    WORKFLOW --> AGENT\n    WORKFLOW --> MODELS\n    WORKFLOW --> FS\n    \n    AGENT --> GENAI\n    AGENT --> MODELS\n    AGENT --> FS\n    \n    MODELS --> PYDANTIC\n    \n    FS --> DOCLING\n\n    style GENAI fill:#4285f4,color:#fff\n    style DOCLING fill:#ff6b6b,color:#fff\n```\n\n---\n\n## Core Modules\n\n### models.py - Data Schemas\n\nDefines the structured output format for the AI agent using Pydantic models.\n\n```mermaid\nclassDiagram\n    class Action {\n        +action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction\n        +reason: str\n        +to_action_type() ActionType\n    }\n\n    class ToolCallAction {\n        +tool_name: Tools\n        +tool_input: list[ToolCallArg]\n        +to_fn_args() dict\n    }\n\n    class ToolCallArg {\n        +parameter_name: str\n        +parameter_value: Any\n    }\n\n    class GoDeeperAction {\n        +directory: str\n    }\n\n    class StopAction {\n        +final_result: str\n    }\n\n    class AskHumanAction {\n        +question: str\n    }\n\n    Action --> ToolCallAction\n    Action --> GoDeeperAction\n    Action --> StopAction\n    Action --> AskHumanAction\n    ToolCallAction --> ToolCallArg\n\n    note for Action \"Main container returned by LLM\"\n    note for ToolCallAction \"Invokes filesystem tools\"\n    note for StopAction \"Contains final answer with citations\"\n```\n\n### agent.py - AI Agent\n\nThe core intelligence component that interacts with Google Gemini.\n\n```mermaid\nclassDiagram\n    class FsExplorerAgent {\n        -_client: GenAIClient\n        -_chat_history: list[Content]\n        +token_usage: TokenUsage\n        +__init__(api_key: str)\n        +configure_task(task: str) void\n        +take_action() tuple[Action, ActionType]\n        +call_tool(tool_name: Tools, tool_input: dict) void\n        +reset() void\n    }\n\n    class TokenUsage {\n        +prompt_tokens: int\n        +completion_tokens: int\n        +total_tokens: int\n        +api_calls: int\n        +tool_result_chars: int\n        +documents_parsed: int\n        +documents_scanned: int\n        +add_api_call(prompt_tokens, completion_tokens) void\n        +add_tool_result(result, tool_name) void\n        +summary() str\n    }\n\n    class TOOLS {\n        <<dictionary>>\n        +read: read_file\n        +grep: grep_file_content\n        +glob: glob_paths\n        +scan_folder: scan_folder\n        +preview_file: preview_file\n        +parse_file: parse_file\n    }\n\n    FsExplorerAgent --> TokenUsage\n    FsExplorerAgent --> TOOLS\n```\n\n### fs.py - Filesystem Operations\n\nAll filesystem and document parsing utilities.\n\n```mermaid\nclassDiagram\n    class FilesystemModule {\n        <<module>>\n        +SUPPORTED_EXTENSIONS: frozenset\n        +DEFAULT_PREVIEW_CHARS: int = 3000\n        +DEFAULT_SCAN_PREVIEW_CHARS: int = 1500\n        +DEFAULT_MAX_WORKERS: int = 4\n    }\n\n    class DocumentCache {\n        <<singleton>>\n        -_DOCUMENT_CACHE: dict[str, str]\n        +clear_document_cache() void\n        +_get_cached_or_parse(file_path) str\n    }\n\n    class DirectoryOps {\n        <<functions>>\n        +describe_dir_content(directory) str\n        +glob_paths(directory, pattern) str\n    }\n\n    class FileOps {\n        <<functions>>\n        +read_file(file_path) str\n        +grep_file_content(file_path, pattern) str\n    }\n\n    class DocumentOps {\n        <<functions>>\n        +preview_file(file_path, max_chars) str\n        +parse_file(file_path) str\n        +scan_folder(directory, max_workers, preview_chars) str\n    }\n\n    FilesystemModule --> DocumentCache\n    FilesystemModule --> DirectoryOps\n    FilesystemModule --> FileOps\n    FilesystemModule --> DocumentOps\n    DocumentOps --> DocumentCache\n```\n\n---\n\n## Workflow Engine\n\nThe workflow engine uses an event-driven architecture based on `llama-index-workflows`.\n\n### Workflow State Machine\n\n```mermaid\nstateDiagram-v2\n    [*] --> StartExploration: InputEvent(task)\n    \n    StartExploration --> ToolCall: ToolCallEvent\n    StartExploration --> GoDeeper: GoDeeperEvent\n    StartExploration --> AskHuman: AskHumanEvent\n    StartExploration --> End: StopAction\n    \n    ToolCall --> ToolCall: ToolCallEvent\n    ToolCall --> GoDeeper: GoDeeperEvent\n    ToolCall --> AskHuman: AskHumanEvent\n    ToolCall --> End: StopAction\n    \n    GoDeeper --> ToolCall: ToolCallEvent\n    GoDeeper --> GoDeeper: GoDeeperEvent\n    GoDeeper --> AskHuman: AskHumanEvent\n    GoDeeper --> End: StopAction\n    \n    AskHuman --> WaitForHuman: InputRequiredEvent\n    WaitForHuman --> ProcessHumanResponse: HumanAnswerEvent\n    ProcessHumanResponse --> ToolCall: ToolCallEvent\n    ProcessHumanResponse --> GoDeeper: GoDeeperEvent\n    ProcessHumanResponse --> AskHuman: AskHumanEvent\n    ProcessHumanResponse --> End: StopAction\n    \n    End --> [*]: ExplorationEndEvent\n\n    note right of StartExploration\n        Initial task processing\n        Describes current directory\n        Asks LLM for first action\n    end note\n\n    note right of ToolCall\n        Executes filesystem tool\n        Adds result to chat history\n        Asks LLM for next action\n    end note\n\n    note right of GoDeeper\n        Updates current directory\n        Describes new directory\n        Asks LLM for next action\n    end note\n```\n\n### Event Types\n\n```mermaid\ngraph TB\n    subgraph \"Start Events\"\n        IE[InputEvent<br/>task: str]\n    end\n\n    subgraph \"Intermediate Events\"\n        TCE[ToolCallEvent<br/>tool_name, tool_input, reason]\n        GDE[GoDeeperEvent<br/>directory, reason]\n        AHE[AskHumanEvent<br/>question, reason]\n        HAE[HumanAnswerEvent<br/>response]\n    end\n\n    subgraph \"End Events\"\n        EEE[ExplorationEndEvent<br/>final_result, error]\n    end\n\n    IE --> TCE\n    IE --> GDE\n    IE --> AHE\n    IE --> EEE\n\n    TCE --> TCE\n    TCE --> GDE\n    TCE --> AHE\n    TCE --> EEE\n\n    GDE --> TCE\n    GDE --> GDE\n    GDE --> AHE\n    GDE --> EEE\n\n    AHE --> HAE\n    HAE --> TCE\n    HAE --> GDE\n    HAE --> AHE\n    HAE --> EEE\n\n    style IE fill:#4caf50,color:#fff\n    style EEE fill:#f44336,color:#fff\n    style TCE fill:#2196f3,color:#fff\n    style GDE fill:#9c27b0,color:#fff\n    style AHE fill:#ff9800,color:#fff\n```\n\n### Workflow Steps\n\n```mermaid\nsequenceDiagram\n    participant CLI as CLI (main.py)\n    participant WF as Workflow\n    participant Agent as FsExplorerAgent\n    participant LLM as Gemini API\n    participant Tools as Tool Registry\n    participant FS as Filesystem\n\n    CLI->>WF: InputEvent(task)\n    \n    WF->>Agent: configure_task(initial_prompt)\n    Agent->>LLM: generate_content(chat_history)\n    LLM-->>Agent: Action JSON\n    \n    alt ToolCallAction\n        Agent->>Tools: call_tool(name, args)\n        Tools->>FS: execute operation\n        FS-->>Tools: result\n        Tools-->>Agent: tool result\n        Agent->>Agent: add to chat_history\n        WF-->>CLI: ToolCallEvent (stream)\n        WF->>Agent: configure_task(\"next action?\")\n        Note over WF,Agent: Loop continues\n    else GoDeeperAction\n        WF->>WF: update current_directory\n        WF-->>CLI: GoDeeperEvent (stream)\n        WF->>Agent: configure_task(\"next action?\")\n        Note over WF,Agent: Loop continues\n    else AskHumanAction\n        WF-->>CLI: AskHumanEvent (stream)\n        CLI->>CLI: Wait for user input\n        CLI->>WF: HumanAnswerEvent(response)\n        WF->>Agent: configure_task(response)\n        Note over WF,Agent: Loop continues\n    else StopAction\n        WF-->>CLI: ExplorationEndEvent(final_result)\n    end\n```\n\n---\n\n## Agent Decision Loop\n\n### Single Decision Cycle\n\n```mermaid\nflowchart TB\n    subgraph \"Agent.take_action()\"\n        START([Start]) --> SEND[Send chat_history to Gemini]\n        SEND --> RECEIVE[Receive JSON response]\n        RECEIVE --> TRACK[Track token usage]\n        TRACK --> PARSE[Parse Action from JSON]\n        PARSE --> CHECK{Action Type?}\n        \n        CHECK -->|toolcall| EXEC[Execute Tool]\n        EXEC --> RESULT[Get tool result]\n        RESULT --> ADD[Add result to chat_history]\n        ADD --> RETURN1[Return Action, ActionType]\n        \n        CHECK -->|godeeper| RETURN2[Return Action, ActionType]\n        CHECK -->|askhuman| RETURN3[Return Action, ActionType]\n        CHECK -->|stop| RETURN4[Return Action, ActionType]\n        \n        RETURN1 --> END([End])\n        RETURN2 --> END\n        RETURN3 --> END\n        RETURN4 --> END\n    end\n\n    style START fill:#4caf50,color:#fff\n    style END fill:#f44336,color:#fff\n    style CHECK fill:#ff9800,color:#000\n```\n\n### Chat History Evolution\n\n```mermaid\nsequenceDiagram\n    participant User\n    participant Agent\n    participant LLM\n\n    Note over Agent: chat_history = []\n\n    User->>Agent: configure_task(\"Initial prompt + directory listing\")\n    Note over Agent: chat_history = [user: initial_prompt]\n\n    Agent->>LLM: generate_content(chat_history)\n    LLM-->>Agent: {action: scan_folder, reason: \"...\"}\n    Note over Agent: chat_history = [user: initial_prompt, model: action1]\n\n    Agent->>Agent: Execute scan_folder, add result\n    Note over Agent: chat_history = [user: initial_prompt, model: action1, user: tool_result1]\n\n    User->>Agent: configure_task(\"What's next?\")\n    Note over Agent: chat_history = [..., user: \"What's next?\"]\n\n    Agent->>LLM: generate_content(chat_history)\n    LLM-->>Agent: {action: parse_file, reason: \"...\"}\n    Note over Agent: chat_history = [..., model: action2]\n\n    Note over Agent: Pattern continues until StopAction\n```\n\n---\n\n## Document Processing Pipeline\n\n### Docling Integration\n\n```mermaid\nflowchart LR\n    subgraph \"Input Formats\"\n        PDF[PDF]\n        DOCX[DOCX]\n        PPTX[PPTX]\n        XLSX[XLSX]\n        HTML[HTML]\n        MD[Markdown]\n    end\n\n    subgraph \"Docling\"\n        DC[DocumentConverter]\n        DETECT[Format Detection]\n        PIPELINE[Processing Pipeline]\n        EXPORT[Markdown Export]\n    end\n\n    subgraph \"Output\"\n        MARKDOWN[Markdown Text]\n    end\n\n    PDF --> DC\n    DOCX --> DC\n    PPTX --> DC\n    XLSX --> DC\n    HTML --> DC\n    MD --> DC\n\n    DC --> DETECT\n    DETECT --> PIPELINE\n    PIPELINE --> EXPORT\n    EXPORT --> MARKDOWN\n\n    style DC fill:#ff6b6b,color:#fff\n```\n\n### Caching Strategy\n\n```mermaid\nflowchart TB\n    subgraph \"Cache Key Generation\"\n        PATH[file_path] --> ABS[os.path.abspath]\n        ABS --> MTIME[os.path.getmtime]\n        MTIME --> KEY[\"cache_key = f'{abs_path}:{mtime}'\"]\n    end\n\n    subgraph \"Cache Lookup\"\n        KEY --> CHECK{Key in cache?}\n        CHECK -->|Yes| HIT[Return cached content]\n        CHECK -->|No| MISS[Parse with Docling]\n        MISS --> STORE[Store in cache]\n        STORE --> RETURN[Return content]\n    end\n\n    subgraph \"_DOCUMENT_CACHE\"\n        CACHE[(dict: str → str)]\n    end\n\n    HIT --> CACHE\n    STORE --> CACHE\n\n    style CACHE fill:#ffd93d,color:#000\n```\n\n### Parallel Document Scanning\n\n```mermaid\nflowchart TB\n    subgraph \"scan_folder(directory)\"\n        START([Start]) --> LIST[List directory files]\n        LIST --> FILTER[Filter by SUPPORTED_EXTENSIONS]\n        FILTER --> POOL[Create ThreadPoolExecutor<br/>max_workers=4]\n        \n        subgraph \"Parallel Processing\"\n            POOL --> T1[Thread 1<br/>_preview_single_file]\n            POOL --> T2[Thread 2<br/>_preview_single_file]\n            POOL --> T3[Thread 3<br/>_preview_single_file]\n            POOL --> T4[Thread 4<br/>_preview_single_file]\n        end\n\n        T1 --> COLLECT[Collect Results]\n        T2 --> COLLECT\n        T3 --> COLLECT\n        T4 --> COLLECT\n\n        COLLECT --> SORT[Sort by filename]\n        SORT --> FORMAT[Format output report]\n        FORMAT --> END([Return summary])\n    end\n\n    style START fill:#4caf50,color:#fff\n    style END fill:#4caf50,color:#fff\n    style POOL fill:#2196f3,color:#fff\n```\n\n---\n\n## Three-Phase Exploration Strategy\n\n### Phase Overview\n\n```mermaid\nflowchart TB\n    subgraph \"PHASE 1: Parallel Scan\"\n        P1_START([User Query]) --> P1_SCAN[scan_folder]\n        P1_SCAN --> P1_PREVIEW[Get previews of ALL documents]\n        P1_PREVIEW --> P1_CATEGORIZE[Categorize documents]\n        \n        P1_CATEGORIZE --> REL[RELEVANT<br/>Directly related]\n        P1_CATEGORIZE --> MAYBE[MAYBE<br/>Potentially useful]\n        P1_CATEGORIZE --> SKIP[SKIP<br/>Not relevant]\n    end\n\n    subgraph \"PHASE 2: Deep Dive\"\n        REL --> P2_PARSE[parse_file on RELEVANT docs]\n        MAYBE -.->|If needed| P2_PARSE\n        P2_PARSE --> P2_EXTRACT[Extract key information]\n        P2_EXTRACT --> P2_CROSS{Cross-references<br/>found?}\n    end\n\n    subgraph \"PHASE 3: Backtracking\"\n        P2_CROSS -->|Yes| P3_CHECK{Referenced doc<br/>was SKIPPED?}\n        P3_CHECK -->|Yes| P3_BACKTRACK[Go back and parse<br/>referenced document]\n        P3_BACKTRACK --> P2_EXTRACT\n        P3_CHECK -->|No| P3_CONTINUE[Continue analysis]\n        P2_CROSS -->|No| P3_CONTINUE\n    end\n\n    subgraph \"Final Answer\"\n        P3_CONTINUE --> ANSWER[Generate answer<br/>with citations]\n        ANSWER --> SOURCES[List sources consulted]\n        SOURCES --> END([Return to user])\n    end\n\n    style P1_START fill:#4caf50,color:#fff\n    style END fill:#4caf50,color:#fff\n    style REL fill:#4caf50,color:#fff\n    style MAYBE fill:#ff9800,color:#000\n    style SKIP fill:#9e9e9e,color:#fff\n    style P3_BACKTRACK fill:#e91e63,color:#fff\n```\n\n### Cross-Reference Detection\n\n```mermaid\nflowchart LR\n    subgraph \"Document Content\"\n        DOC[Parsed Document]\n    end\n\n    subgraph \"Pattern Matching\"\n        DOC --> P1[\"'See Exhibit A/B/C...'\"]\n        DOC --> P2[\"'As stated in [Document]...'\"]\n        DOC --> P3[\"'Refer to [filename]...'\"]\n        DOC --> P4[\"'per Document: [name]'\"]\n        DOC --> P5[\"'[Doc #XX]'\"]\n    end\n\n    subgraph \"Action\"\n        P1 --> FOUND[Cross-reference found]\n        P2 --> FOUND\n        P3 --> FOUND\n        P4 --> FOUND\n        P5 --> FOUND\n        \n        FOUND --> CHECK{Was referenced<br/>doc SKIPPED?}\n        CHECK -->|Yes| BACKTRACK[Backtrack and parse]\n        CHECK -->|No| CONTINUE[Continue]\n    end\n\n    style BACKTRACK fill:#e91e63,color:#fff\n```\n\n---\n\n## Token Tracking & Cost Estimation\n\n### TokenUsage Class\n\n```mermaid\nflowchart TB\n    subgraph \"Input Tracking\"\n        API[API Call] --> PROMPT[prompt_token_count]\n        API --> COMPLETION[candidates_token_count]\n        PROMPT --> ADD_API[add_api_call]\n        COMPLETION --> ADD_API\n    end\n\n    subgraph \"Tool Tracking\"\n        TOOL[Tool Execution] --> RESULT[result string]\n        RESULT --> ADD_TOOL[add_tool_result]\n        ADD_TOOL --> CHARS[tool_result_chars += len]\n        ADD_TOOL --> PARSED{tool_name?}\n        PARSED -->|parse_file| INC_PARSED[documents_parsed++]\n        PARSED -->|preview_file| INC_PARSED\n        PARSED -->|scan_folder| INC_SCANNED[documents_scanned += count]\n    end\n\n    subgraph \"Cost Calculation\"\n        ADD_API --> TOTALS[Update totals]\n        TOTALS --> CALC[_calculate_cost]\n        CALC --> INPUT_COST[\"input_cost = prompt_tokens × $0.075/1M\"]\n        CALC --> OUTPUT_COST[\"output_cost = completion_tokens × $0.30/1M\"]\n        INPUT_COST --> TOTAL_COST[total_cost]\n        OUTPUT_COST --> TOTAL_COST\n    end\n\n    subgraph \"Summary Output\"\n        TOTAL_COST --> SUMMARY[summary]\n        CHARS --> SUMMARY\n        INC_PARSED --> SUMMARY\n        INC_SCANNED --> SUMMARY\n    end\n```\n\n### Cost Estimation Formula\n\n```mermaid\ngraph LR\n    subgraph \"Gemini 2.0 Flash Pricing\"\n        INPUT[\"Input: $0.075 / 1M tokens\"]\n        OUTPUT[\"Output: $0.30 / 1M tokens\"]\n    end\n\n    subgraph \"Calculation\"\n        PROMPT[prompt_tokens] --> DIV1[÷ 1,000,000]\n        DIV1 --> MULT1[× $0.075]\n        MULT1 --> INPUT_COST[Input Cost]\n\n        COMP[completion_tokens] --> DIV2[÷ 1,000,000]\n        DIV2 --> MULT2[× $0.30]\n        MULT2 --> OUTPUT_COST[Output Cost]\n\n        INPUT_COST --> SUM[+]\n        OUTPUT_COST --> SUM\n        SUM --> TOTAL[Total Estimated Cost]\n    end\n\n    style TOTAL fill:#4caf50,color:#fff\n```\n\n---\n\n## CLI Interface\n\n### Output Formatting\n\n```mermaid\nflowchart TB\n    subgraph \"Event Handling\"\n        EVENT{Event Type}\n        \n        EVENT -->|ToolCallEvent| TOOL_PANEL[format_tool_panel]\n        EVENT -->|GoDeeperEvent| NAV_PANEL[format_navigation_panel]\n        EVENT -->|AskHumanEvent| HUMAN_PANEL[Human Input Panel]\n        EVENT -->|ExplorationEndEvent| FINAL_PANEL[Final Answer Panel]\n    end\n\n    subgraph \"Tool Panel Components\"\n        TOOL_PANEL --> ICON[Tool Icon 📂📖👁️🔍]\n        TOOL_PANEL --> STEP[Step Number]\n        TOOL_PANEL --> PHASE[Phase Label]\n        TOOL_PANEL --> TARGET[Target File/Directory]\n        TOOL_PANEL --> REASON[Agent's Reasoning]\n    end\n\n    subgraph \"Final Panel Components\"\n        FINAL_PANEL --> ANSWER[Answer with Citations]\n        FINAL_PANEL --> SOURCES[Sources Consulted]\n    end\n\n    subgraph \"Summary Panel\"\n        SUMMARY[Workflow Summary]\n        SUMMARY --> STEPS[Total Steps]\n        SUMMARY --> CALLS[API Calls]\n        SUMMARY --> DOCS[Documents Scanned/Parsed]\n        SUMMARY --> TOKENS[Token Usage]\n        SUMMARY --> COST[Estimated Cost]\n    end\n\n    FINAL_PANEL --> SUMMARY\n```\n\n### Visual Elements\n\n```mermaid\ngraph TB\n    subgraph \"Panel Styles\"\n        TOOL[\"📂 Tool Call<br/>border: yellow\"]\n        NAV[\"📁 Navigation<br/>border: magenta\"]\n        HUMAN[\"❓ Human Input<br/>border: red\"]\n        FINAL[\"✅ Final Answer<br/>border: green\"]\n        SUMMARY[\"📊 Summary<br/>border: blue\"]\n    end\n\n    subgraph \"Tool Icons\"\n        I1[\"📂 scan_folder\"]\n        I2[\"👁️ preview_file\"]\n        I3[\"📖 parse_file\"]\n        I4[\"📄 read\"]\n        I5[\"🔍 grep\"]\n        I6[\"🔎 glob\"]\n    end\n\n    subgraph \"Phase Labels\"\n        PH1[\"Phase 1: Parallel Document Scan\"]\n        PH2[\"Phase 2: Deep Dive\"]\n        PH3[\"Phase 1/2: Quick Preview\"]\n    end\n\n    style TOOL fill:#ffeb3b,color:#000\n    style NAV fill:#e1bee7,color:#000\n    style HUMAN fill:#ffcdd2,color:#000\n    style FINAL fill:#c8e6c9,color:#000\n    style SUMMARY fill:#bbdefb,color:#000\n```\n\n---\n\n## Data Flow\n\n### Complete Request Flow\n\n```mermaid\nsequenceDiagram\n    participant User\n    participant CLI as main.py\n    participant WF as Workflow\n    participant Agent as FsExplorerAgent\n    participant LLM as Gemini API\n    participant Tools as Tool Registry\n    participant Docling\n    participant Cache\n    participant FS as Filesystem\n\n    User->>CLI: uv run explore --task \"...\"\n    CLI->>CLI: print_workflow_header()\n    CLI->>WF: workflow.run(InputEvent)\n\n    loop Until StopAction\n        WF->>Agent: configure_task()\n        Agent->>LLM: generate_content()\n        LLM-->>Agent: Action JSON\n        Agent->>Agent: Track tokens\n\n        alt ToolCallAction\n            Agent->>Tools: TOOLS[name](**args)\n            \n            alt Document Tool\n                Tools->>Cache: Check cache\n                alt Cache Hit\n                    Cache-->>Tools: Cached content\n                else Cache Miss\n                    Cache->>Docling: Convert document\n                    Docling->>FS: Read file\n                    FS-->>Docling: Raw bytes\n                    Docling-->>Cache: Markdown content\n                    Cache-->>Tools: Content\n                end\n            else Filesystem Tool\n                Tools->>FS: Execute operation\n                FS-->>Tools: Result\n            end\n            \n            Tools-->>Agent: Tool result\n            Agent->>Agent: Track tool metrics\n            WF-->>CLI: ToolCallEvent\n            CLI->>CLI: format_tool_panel()\n        else GoDeeperAction\n            WF->>WF: Update directory state\n            WF-->>CLI: GoDeeperEvent\n            CLI->>CLI: format_navigation_panel()\n        else AskHumanAction\n            WF-->>CLI: AskHumanEvent\n            CLI->>User: Display question\n            User->>CLI: Enter response\n            CLI->>WF: HumanAnswerEvent\n        else StopAction\n            WF-->>CLI: ExplorationEndEvent\n        end\n    end\n\n    CLI->>CLI: Display final answer\n    CLI->>CLI: print_workflow_summary()\n    CLI-->>User: Complete output\n```\n\n---\n\n## File Structure\n\n```\nfs-explorer/\n├── src/\n│   └── fs_explorer/\n│       ├── __init__.py      # Public API exports\n│       ├── main.py          # CLI entry point (typer)\n│       ├── workflow.py      # Event-driven workflow orchestration\n│       ├── agent.py         # AI agent + Gemini integration\n│       ├── models.py        # Pydantic action schemas\n│       └── fs.py            # Filesystem + Docling operations\n├── tests/\n│   ├── conftest.py          # Test fixtures and mocks\n│   ├── test_agent.py        # Agent unit tests\n│   ├── test_fs.py           # Filesystem function tests\n│   ├── test_models.py       # Model tests\n│   ├── test_e2e.py          # End-to-end integration tests\n│   └── testfiles/           # Test data\n├── data/\n│   ├── large_acquisition/   # Sample PDF documents\n│   └── test_acquisition/    # Test document set\n├── scripts/\n│   ├── generate_test_docs.py\n│   └── generate_large_docs.py\n├── pyproject.toml           # Project configuration\n├── Makefile                 # Development commands\n├── README.md                # User documentation\n└── ARCHITECTURE.md          # This file\n```\n\n---\n\n## Extension Points\n\n### Adding New Tools\n\n```mermaid\nflowchart LR\n    subgraph \"Step 1: Define Function\"\n        FUNC[def new_tool(args) -> str]\n    end\n\n    subgraph \"Step 2: Register Tool\"\n        TOOLS[\"TOOLS dict in agent.py\"]\n        FUNC --> TOOLS\n    end\n\n    subgraph \"Step 3: Update Types\"\n        TYPES[\"Tools TypeAlias in models.py\"]\n        TOOLS --> TYPES\n    end\n\n    subgraph \"Step 4: Update Prompt\"\n        PROMPT[\"SYSTEM_PROMPT in agent.py\"]\n        TYPES --> PROMPT\n    end\n\n    style FUNC fill:#e3f2fd\n    style TOOLS fill:#f3e5f5\n    style TYPES fill:#fff3e0\n    style PROMPT fill:#e8f5e9\n```\n\n### Adding New Document Formats\n\n```mermaid\nflowchart LR\n    subgraph \"Docling Supported\"\n        PDF[PDF] --> DOCLING[Docling]\n        DOCX[DOCX] --> DOCLING\n        PPTX[PPTX] --> DOCLING\n        XLSX[XLSX] --> DOCLING\n        HTML[HTML] --> DOCLING\n        MD[Markdown] --> DOCLING\n    end\n\n    subgraph \"To Add New Format\"\n        NEW[New Format] --> CHECK{Docling<br/>supports?}\n        CHECK -->|Yes| ADD[\"Add to SUPPORTED_EXTENSIONS\"]\n        CHECK -->|No| CUSTOM[\"Create custom handler<br/>in fs.py\"]\n    end\n\n    DOCLING --> OUTPUT[Markdown]\n    ADD --> OUTPUT\n    CUSTOM --> OUTPUT\n```\n\n### Customizing the System Prompt\n\nThe system prompt in `agent.py` can be modified to:\n\n1. **Add new exploration strategies**\n2. **Change citation format**\n3. **Adjust categorization criteria**\n4. **Add domain-specific instructions**\n\n```python\nSYSTEM_PROMPT = \"\"\"\n# Customize this prompt to change agent behavior\n\n## Your custom instructions here\n...\n\"\"\"\n```\n\n---\n\n## Performance Characteristics\n\n| Metric | Typical Value | Notes |\n|--------|---------------|-------|\n| Parallel scan threads | 4 | Configurable via `DEFAULT_MAX_WORKERS` |\n| Preview size | 1500 chars | ~1 page of content |\n| Full preview size | 3000 chars | ~2-3 pages |\n| Document cache | In-memory | Keyed by path + mtime |\n| Workflow timeout | 300 seconds | 5 minutes for complex queries |\n| API model | gemini-2.0-flash | Fast, cost-effective |\n\n---\n\n## Security Considerations\n\n1. **API Key**: Stored in environment variable `GOOGLE_API_KEY`\n2. **Local Processing**: Documents parsed locally via Docling (no cloud upload)\n3. **Filesystem Access**: Limited to current working directory\n4. **No Persistent Storage**: Document cache is in-memory only\n\n---\n\n*Last updated: 2026-01-03*\n"
  },
  {
    "path": "CLAUDE.md",
    "content": "# CLAUDE.md\n\nThis file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.\n\n## Project Overview\n\nAgentic File Search is an AI-powered document search agent that explores files dynamically rather than using pre-computed embeddings. It uses a three-phase strategy: parallel scan, deep dive, and backtracking for cross-references. There is also an optional DuckDB-backed indexing pipeline for pre-indexed semantic+metadata retrieval.\n\n**Tech Stack:** Python 3.10+, Google Gemini 3 Flash, LlamaIndex Workflows, Docling (document parsing), DuckDB (indexing), langextract (optional metadata extraction), FastAPI + WebSocket, Typer + Rich CLI.\n\n## Common Commands\n\n```bash\n# Install dependencies\nuv pip install .\nuv pip install -e \".[dev]\"  # with dev dependencies\n\n# Run CLI (agentic exploration)\nuv run explore --task \"What is the purchase price?\" --folder data/test_acquisition/\n\n# Run CLI (indexed query - requires prior indexing)\nuv run explore index data/test_acquisition/\nuv run explore query --task \"What is the purchase price?\" --folder data/test_acquisition/\n\n# Schema management\nuv run explore schema discover data/test_acquisition/\nuv run explore schema show data/test_acquisition/\n\n# Run web UI\nuv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000\n\n# Run tests\nuv run pytest                      # all tests\nuv run pytest tests/test_fs.py     # single file\nuv run pytest -k \"test_name\"       # single test\n\n# Lint, format, typecheck (also available via Makefile)\nuv run pre-commit run -a           # lint (or: make lint)\nuv run ruff check .                # ruff only\nuv run ruff format                 # format (or: make format)\nuv run ty check src/fs_explorer/   # typecheck (or: make typecheck)\n```\n\nEntry points defined in `pyproject.toml`: `explore` → `fs_explorer.main:app`, `explore-ui` → `fs_explorer.server:run_server`.\n\n## Architecture\n\n### Core Flow (Agentic Mode)\n```\nUser Query → Workflow (LlamaIndex) → Agent (Gemini) → Tools → Docling → Filesystem\n```\n\n### Core Flow (Indexed Mode)\n```\nUser Query → Workflow → Agent → semantic_search/get_document → DuckDB → Ranked Results\n```\n\n### Key Modules (src/fs_explorer/)\n\n- **workflow.py**: Event-driven orchestration using `llama-index-workflows`. Defines `FsExplorerWorkflow` with steps: `start_exploration`, `go_deeper_action`, `tool_call_action`, `receive_human_answer`. Uses singleton agent via `get_agent()`.\n\n- **agent.py**: `FsExplorerAgent` manages Gemini API interaction. Chat history accumulates in `_chat_history`. `take_action()` sends history to LLM, receives structured JSON `Action`, auto-executes tool calls. `TokenUsage` tracks costs. Also contains the `TOOLS` registry (9 tools), `SYSTEM_PROMPT`, and indexed tool functions (`semantic_search`, `get_document`, `list_indexed_documents`). Index context is managed via module-level `set_index_context()`/`clear_index_context()`.\n\n- **models.py**: Pydantic schemas for structured LLM output. `Action` contains one of: `ToolCallAction`, `GoDeeperAction`, `StopAction`, `AskHumanAction`. `Tools` TypeAlias defines all available tool names.\n\n- **fs.py**: Filesystem operations. `scan_folder()` uses ThreadPoolExecutor for parallel document processing. `_DOCUMENT_CACHE` (dict) caches parsed documents keyed by `path:mtime`. Docling converts PDF/DOCX/PPTX/XLSX/HTML/MD to markdown.\n\n- **main.py**: Typer CLI entry point with subcommands: default (agentic explore), `index`, `query`, `schema discover`, `schema show`.\n\n- **server.py**: FastAPI server with WebSocket endpoint `/ws/explore` for real-time streaming.\n\n- **exploration_trace.py**: Records tool call paths and extracts cited sources from final answers for the CLI summary.\n\n### Indexing Subsystem (src/fs_explorer/indexing/)\n\n- **pipeline.py**: `IndexingPipeline` orchestrates document parsing → chunking → metadata extraction → DuckDB upsert. Walks a folder for supported files, delegates to `SmartChunker` and `extract_metadata()`, handles schema resolution and deleted-file cleanup.\n\n- **chunker.py**: `SmartChunker` splits parsed document text into overlapping chunks.\n\n- **schema.py**: `SchemaDiscovery` auto-discovers metadata schemas from a corpus folder (file types, heuristic boolean fields like `mentions_currency`/`mentions_dates`). Optionally includes langextract fields.\n\n- **metadata.py**: `extract_metadata()` produces per-document metadata dicts. Heuristic fields (filename, extension, document_type, currency/date detection) are always available. Optional langextract integration calls the `langextract` library for entity extraction (organizations, people, deal terms, etc.) via configurable profiles.\n\n### Search Subsystem (src/fs_explorer/search/)\n\n- **query.py**: `IndexedQueryEngine` runs parallel semantic (chunk text matching) + metadata (JSON filter) retrieval paths using ThreadPoolExecutor, then merges and ranks via `RankedDocument.combined_score`.\n\n- **filters.py**: `parse_metadata_filters()` parses a human-readable filter DSL (`field=value`, `field>=num`, `field in (a, b)`, `field~substring`) into `MetadataFilter` objects. Validates against allowed schema fields.\n\n- **ranker.py**: `RankedDocument` dataclass with `combined_score` (semantic * 100 + metadata * 10). `rank_documents()` sorts and limits.\n\n### Storage Subsystem (src/fs_explorer/storage/)\n\n- **duckdb.py**: `DuckDBStorage` manages four tables: `corpora`, `documents`, `chunks`, `schemas`. Key operations: `upsert_document`, `search_chunks` (keyword-based scoring), `search_documents_by_metadata` (JSON path filtering via `json_extract_string`), schema CRUD. Corpus/doc/chunk IDs are SHA1-based stable hashes.\n\n- **base.py**: `StorageBackend` protocol and shared dataclasses (`DocumentRecord`, `ChunkRecord`, `SchemaRecord`).\n\n### Index Config\n\n- **index_config.py**: `resolve_db_path()` resolves DuckDB path with precedence: CLI `--db-path` > `FS_EXPLORER_DB_PATH` env > `~/.fs_explorer/index.duckdb`.\n\n### Workflow Event Types\n- `InputEvent` → starts exploration\n- `ToolCallEvent` → tool execution\n- `GoDeeperEvent` → directory navigation\n- `AskHumanEvent`/`HumanAnswerEvent` → human interaction\n- `ExplorationEndEvent` → completion with `final_result` or `error`\n\n### Adding New Tools\n1. Implement function in `fs.py` (filesystem) or `agent.py` (indexed) returning `str`\n2. Add to `TOOLS` dict in `agent.py`\n3. Add to `Tools` TypeAlias in `models.py`\n4. Update `SYSTEM_PROMPT` in `agent.py`\n5. Update `TOOL_ICONS` and `PHASE_DESCRIPTIONS` in `main.py`\n\n## Environment\n\n- `GOOGLE_API_KEY` (required) — in `.env` file or environment variable\n- `FS_EXPLORER_DB_PATH` (optional) — override default DuckDB location\n- `FS_EXPLORER_LANGEXTRACT_MAX_CHARS` (optional) — max chars sent to langextract (default 6000)\n- `FS_EXPLORER_LANGEXTRACT_MODEL` (optional) — model for langextract (default `gemini-3-flash-preview`)\n\n## Testing\n\nTests mock the Gemini client via `MockGenAIClient` in `conftest.py`. Use `reset_agent()` to clear singleton state between tests. The mock always returns a `StopAction` response.\n\nKey test files:\n- `test_agent.py` / `test_e2e.py` — agent and workflow integration\n- `test_fs.py` — filesystem tools\n- `test_indexing.py` / `test_cli_indexing.py` — indexing pipeline and CLI\n- `test_search.py` — search/filter/ranking\n- `test_exploration_trace.py` — trace and citation extraction\n\nTest documents live in `data/test_acquisition/` and `data/large_acquisition/`. Test fixtures for unit tests are in `tests/testfiles/`.\n"
  },
  {
    "path": "IMPLEMENTATION_PLAN.md",
    "content": "# Implementation Plan: Hybrid Semantic + Agentic Search (Revised)\n\n## Overview\n\nAdd semantic search with optional metadata filtering to `agentic-file-search` without regressing the current agentic workflow.\n\nThe revised approach keeps the current CLI and behavior stable first, introduces indexing as opt-in, and only enables auto-detection after compatibility and quality checks pass.\n\n- Storage: DuckDB + `vss` (embedded, local file)\n- Embeddings: Gemini embeddings (API-backed)\n- Metadata extraction: `langextract` (optional)\n- Infrastructure model: no external database service (no Docker/Postgres required)\n\n---\n\n## Goals\n\n1. Preserve existing `explore --task` behavior and UX by default.\n2. Add a fast indexed path for large corpora.\n3. Support metadata-aware filtering when metadata is available.\n4. Keep agentic deep-read and cross-reference behavior available.\n\n## Non-Goals (Initial Release)\n\n1. Replacing the existing agentic strategy entirely.\n2. Forcing index usage for all queries.\n3. Heuristic/NLP folder extraction from free-form task text.\n\n---\n\n## Current Codebase Constraints to Respect\n\n1. CLI currently has one root command (`explore --task`) and no subcommands.\n2. Workflow and server currently use shared/global process state (`os.chdir`, singleton agent).\n3. Existing tests assert the current 6-tool model and prompt behavior.\n\nThese constraints require a staged rollout to avoid breaking current users.\n\n---\n\n## High-Level Architecture\n\n```text\nINDEX TIME\n├── Parse documents (Docling)\n├── Chunk content (paragraph/sentence-aware)\n├── Generate embeddings (provider-configured dimension)\n├── [optional] Extract metadata (langextract)\n└── Persist in DuckDB (corpus-scoped)\n\nQUERY TIME\n├── Retrieve by semantic search\n├── [optional] Retrieve by metadata filter\n├── Union + rank results\n├── Expand via cross-references where needed\n└── Agent continues deep exploration using existing tools\n```\n\n---\n\n## Data Model (DuckDB)\n\nUse corpus-scoped tables and file freshness fields to prevent collisions and stale indexes.\n\n```sql\n-- Install and load extension programmatically\n-- INSTALL vss; LOAD vss;\n\nCREATE TABLE IF NOT EXISTS corpora (\n    id VARCHAR PRIMARY KEY,\n    root_path VARCHAR NOT NULL UNIQUE,\n    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n);\n\nCREATE TABLE IF NOT EXISTS documents (\n    id VARCHAR PRIMARY KEY,\n    corpus_id VARCHAR NOT NULL REFERENCES corpora(id),\n    relative_path VARCHAR NOT NULL,\n    absolute_path VARCHAR NOT NULL,\n    content VARCHAR NOT NULL,\n    metadata JSON NOT NULL DEFAULT '{}',\n    file_mtime DOUBLE NOT NULL,\n    file_size BIGINT NOT NULL,\n    content_sha256 VARCHAR NOT NULL,\n    last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,\n    is_deleted BOOLEAN DEFAULT FALSE,\n    UNIQUE(corpus_id, relative_path)\n);\n\n-- EMBEDDING_DIM is configured in code at index creation time.\nCREATE TABLE IF NOT EXISTS chunks (\n    id VARCHAR PRIMARY KEY,\n    doc_id VARCHAR NOT NULL REFERENCES documents(id),\n    text VARCHAR NOT NULL,\n    embedding FLOAT[${EMBEDDING_DIM}] NOT NULL,\n    embedding_dim INTEGER NOT NULL,\n    position INTEGER NOT NULL,\n    start_char INTEGER NOT NULL,\n    end_char INTEGER NOT NULL\n);\n\nCREATE TABLE IF NOT EXISTS schemas (\n    id INTEGER PRIMARY KEY,\n    corpus_id VARCHAR REFERENCES corpora(id),\n    name VARCHAR,\n    schema_def JSON NOT NULL,\n    is_active BOOLEAN DEFAULT FALSE,\n    UNIQUE(corpus_id, name)\n);\n\nCREATE INDEX IF NOT EXISTS idx_chunks_embedding\nON chunks USING HNSW (embedding) WITH (metric = 'cosine');\n```\n\n### Embedding Dimension Rule\n\n`EMBEDDING_DIM` must be a runtime config constant validated at startup. Do not hardcode `1536` across modules.\n\n### DB Location\n\nDefault: `~/.fs_explorer/index.duckdb`\nOverride via:\n- `FS_EXPLORER_DB_PATH`\n- CLI: `--db-path`\n\n---\n\n## CLI Contract and Rollout\n\n### Compatibility Rules (Required)\n\n1. `uv run explore --task \"...\"` must keep working as-is.\n2. Existing non-indexed behavior remains default in initial rollout.\n3. New indexed behavior is opt-in first.\n\n### New Commands\n\n```bash\n# Index management\nuv run explore index <folder>\nuv run explore index <folder> --with-metadata\nuv run explore index <folder> --schema schema.json\n\n# Indexed query path\nuv run explore query --task \"...\" --folder <folder> [--filter \"...\"]\n\n# Schema inspection\nuv run explore schema --discover <folder>\nuv run explore schema --show --folder <folder>\n\n# Existing command (backward-compatible)\nuv run explore --task \"...\" [--folder <folder>] [--use-index]\n```\n\n### Folder Resolution (Deterministic)\n\nFor commands that need corpus selection:\n1. If `--folder` is provided, use it.\n2. Else use current working directory (`.`).\n3. Do not parse folder intent from natural language task text in v1.\n\n### Auto-Detection Strategy\n\n- v1: explicit `--use-index` only.\n- v2: optional auto-detect behind feature flag `FS_EXPLORER_AUTO_INDEX=1`.\n- v3: default auto-detect only after parity tests and quality benchmarks pass.\n\n---\n\n## Server and Concurrency Requirements\n\nBefore adding indexing/search endpoints:\n\n1. Remove request-level `os.chdir` usage; pass absolute target folder through workflow state.\n2. Avoid global singleton agent across concurrent requests; instantiate per workflow run/session.\n3. Add per-corpus index lock to avoid concurrent write corruption.\n4. Keep read queries concurrent-safe.\n\n---\n\n## Module Structure\n\n```text\nsrc/fs_explorer/\n├── storage/\n│   ├── __init__.py\n│   ├── base.py\n│   └── duckdb.py\n├── indexing/\n│   ├── __init__.py\n│   ├── pipeline.py\n│   ├── chunker.py\n│   ├── metadata.py\n│   └── schema.py\n├── search/\n│   ├── __init__.py\n│   ├── query.py\n│   ├── semantic.py\n│   ├── filters.py\n│   └── ranker.py\n├── embeddings.py\n└── index_config.py\n```\n\n---\n\n## Files to Modify\n\n| File | Changes |\n|------|---------|\n| `src/fs_explorer/agent.py` | Add indexed tools and prompt guidance while keeping existing tools |\n| `src/fs_explorer/models.py` | Extend `Tools` type alias |\n| `src/fs_explorer/main.py` | Add subcommands + `--folder` + `--use-index` while preserving root command |\n| `src/fs_explorer/workflow.py` | Remove global/shared run-state assumptions |\n| `src/fs_explorer/fs.py` | Support safe path resolution without cwd mutation |\n| `src/fs_explorer/server.py` | Add index/search endpoints and remove `os.chdir` coupling |\n| `pyproject.toml` | Add `duckdb`, `langextract` |\n\n---\n\n## Implementation Phases\n\n### Phase 0: Contracts and Safety (New)\n\n1. Freeze CLI compatibility requirements (`explore --task` must remain stable).\n2. Define deterministic folder resolution contract.\n3. Define per-request state model for workflow/server.\n4. Add failing tests for compatibility and concurrency assumptions.\n\n### Phase 1: Storage + Embeddings\n\n5. Implement `storage/base.py` (backend interface).\n6. Implement `storage/duckdb.py` with corpus-scoped schema.\n7. Implement `embeddings.py` with configurable embedding dimension.\n8. Add storage/embedding tests (including dimension validation).\n\n### Phase 2: Indexing Pipeline\n\n9. Implement `indexing/chunker.py`.\n10. Implement optional `indexing/metadata.py`.\n11. Implement `indexing/schema.py`.\n12. Implement `indexing/pipeline.py` with freshness checks (`mtime`, hash, deleted files).\n13. Add indexing tests.\n\n### Phase 3: Search Pipeline\n\n14. Implement `search/filters.py`.\n15. Implement `search/ranker.py`.\n16. Implement `search/query.py` (parallel retrieval + union).\n17. Implement cross-reference expansion hooks.\n18. Add search tests.\n\n### Phase 4: Agent Integration (Opt-in)\n\n19. Add tools: `semantic_search`, `get_document`, `list_indexed_documents`.\n20. Keep existing 6 filesystem tools available.\n21. Add indexed prompt guidance without removing current strategy.\n22. Add tool-selection tests for indexed and non-indexed paths.\n\n### Phase 5: CLI + Server Integration\n\n23. Add `explore index/query/schema` commands.\n24. Add `--folder` and `--use-index` to root command.\n25. Integrate indexed path into workflow when explicitly requested.\n26. Add `/api/index` and `/api/search` endpoints.\n27. Remove `os.chdir` in server workflow path.\n\n### Phase 6: Auto-Detect Rollout (Guarded)\n\n28. Add feature-flagged auto-detect (`FS_EXPLORER_AUTO_INDEX`).\n29. Add parity checks between indexed and baseline runs on test corpora.\n30. Keep fallback to legacy behavior on index errors.\n\n### Phase 7: Testing and Docs\n\n31. Full integration tests.\n32. Backward compatibility tests.\n33. Concurrency tests for WebSocket/API usage.\n34. Performance benchmarks and docs updates.\n\n---\n\n## Revised Design Decisions\n\n1. **Opt-in First**: indexed retrieval starts behind `--use-index` to avoid regressions.\n2. **Deterministic Corpus Selection**: explicit `--folder` or `.` fallback only.\n3. **Corpus-Scoped Storage**: avoid global path collisions by namespacing.\n4. **Freshness Tracking**: incremental reindex using mtime/hash/deletion markers.\n5. **No Global Request State**: remove `os.chdir` and shared singleton pitfalls in server flows.\n6. **Configurable Embedding Dimension**: validated at runtime; not hardcoded everywhere.\n7. **No External DB Service**: embedded local DB only; APIs are still external dependencies.\n\n---\n\n## Verification Steps\n\n```bash\n# Baseline safety (must stay green)\nuv run pytest tests/test_models.py tests/test_fs.py tests/test_agent.py -v\n\n# Phase 1-3\nuv run pytest tests/test_storage.py tests/test_embeddings.py tests/test_search.py -v\n\n# Index build + inspect\nuv run explore index data/test_acquisition/\nuv run python -c \"import duckdb; db=duckdb.connect('~/.fs_explorer/index.duckdb'); print(db.execute('SELECT COUNT(*) FROM documents').fetchone())\"\n\n# Opt-in indexed execution\nuv run explore --task \"Search for acquisition terms\" --folder data/test_acquisition --use-index\n\n# Compatibility execution (legacy path)\nuv run explore --task \"Look in data/test_acquisition/. Who is the CTO?\"\n\n# CLI checks\nuv run explore --help\nuv run explore index --help\nuv run explore query --help\nuv run explore schema --help\n\n# Full suite\nuv run pytest tests/ -v\n```\n\n---\n\n## Dependencies to Add\n\n```toml\n# pyproject.toml\ndependencies = [\n    # ... existing ...\n    \"duckdb>=1.0.0\",\n    \"langextract>=1.0.0\",\n]\n```\n\n---\n\n## Critical Files Summary\n\n| Purpose | Path |\n|---------|------|\n| Storage interface | `src/fs_explorer/storage/base.py` |\n| DuckDB backend | `src/fs_explorer/storage/duckdb.py` |\n| Embeddings | `src/fs_explorer/embeddings.py` |\n| Chunking | `src/fs_explorer/indexing/chunker.py` |\n| Metadata extraction | `src/fs_explorer/indexing/metadata.py` |\n| Schema discovery | `src/fs_explorer/indexing/schema.py` |\n| Indexing pipeline | `src/fs_explorer/indexing/pipeline.py` |\n| Query pipeline | `src/fs_explorer/search/query.py` |\n| Filter parsing | `src/fs_explorer/search/filters.py` |\n| Result ranking | `src/fs_explorer/search/ranker.py` |\n| Agent tools/prompt | `src/fs_explorer/agent.py` |\n| Tool types | `src/fs_explorer/models.py` |\n| CLI commands | `src/fs_explorer/main.py` |\n| Workflow safety | `src/fs_explorer/workflow.py` |\n| Server safety/endpoints | `src/fs_explorer/server.py` |\n"
  },
  {
    "path": "Makefile",
    "content": ".PHONY: test lint format format-check typecheck build\n\nall: test lint format typecheck\n\ntest:\n\t$(info ****************** running tests ******************)\n\tuv run pytest tests\n\nlint:\n\t$(info ****************** linting ******************)\n\tuv run pre-commit run -a\n\nformat:\n\t$(info ****************** formatting ******************)\n\tuv run ruff format\n\nformat-check:\n\t$(info ****************** checking formatting ******************)\n\tuv run ruff format --check\n\ntypecheck:\n\t$(info ****************** type checking ******************)\n\tuv run ty check src/fs_explorer/\n\nbuild:\n\t$(info ****************** building ******************)\n\tuv build"
  },
  {
    "path": "README.md",
    "content": "# Agentic File Search\n\n> **Based on**: [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer) — The original CLI agent for filesystem exploration.\n\nAn AI-powered document search agent that explores files like a human would — scanning, reasoning, and following cross-references. Unlike traditional RAG systems that rely on pre-computed embeddings, this agent dynamically navigates documents to find answers.\n\n## Why Agentic Search?\n\nTraditional RAG (Retrieval-Augmented Generation) has limitations:\n- **Chunks lose context** — Splitting documents destroys relationships between sections\n- **Cross-references are invisible** — \"See Exhibit B\" means nothing to embeddings\n- **Similarity ≠ Relevance** — Semantic matching misses logical connections\n\nThis system uses a **three-phase strategy**:\n1. **Parallel Scan** — Preview all documents in a folder at once\n2. **Deep Dive** — Full extraction on relevant documents only\n3. **Backtrack** — Follow cross-references to previously skipped documents\n\n## Watch the video\nThis video explains the architecture of the project and how to run it. \n[![Watch the demo on YouTube](https://img.youtube.com/vi/rMADSuus6jg/maxresdefault.jpg)](https://www.youtube.com/watch?v=rMADSuus6jg)\n\n## Features\n\n- 🔍 **6 Tools**: `scan_folder`, `preview_file`, `parse_file`, `read`, `grep`, `glob`\n- 📄 **Document Support**: PDF, DOCX, PPTX, XLSX, HTML, Markdown (via Docling)\n- 🤖 **Powered by**: Google Gemini 3 Flash with structured JSON output\n- 💰 **Cost Efficient**: ~$0.001 per query with token tracking\n- 🌐 **Web UI**: Real-time WebSocket streaming interface\n- 📊 **Citations**: Answers include source references\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/PromtEngineer/agentic-file-search.git\ncd agentic-file-search\n\n# Install with uv (recommended)\nuv pip install .\n\n# Or with pip\npip install .\n```\n\n## Configuration\n\nCreate a `.env` file in the project root:\n\n```bash\nGOOGLE_API_KEY=your_api_key_here\n```\n\nGet your API key from [Google AI Studio](https://aistudio.google.com/apikey).\n\n## Usage\n\n### CLI\n\n```bash\n# Basic query\nuv run explore --task \"What is the purchase price in data/test_acquisition/?\"\n\n# Multi-document query\nuv run explore --task \"Look in data/large_acquisition/. What are all the financial terms including adjustments and escrow?\"\n```\n\n### Web UI\n\n```bash\n# Start the server\nuv run uvicorn fs_explorer.server:app --host 127.0.0.1 --port 8000\n\n# Open http://127.0.0.1:8000 in your browser\n```\n\nThe web UI provides:\n- Folder browser to select target directory\n- Real-time step-by-step execution log\n- Final answer with citations\n- Token usage and cost statistics\n\n## Architecture\n\n```\nUser Query\n    ↓\n┌─────────────────┐\n│ Workflow Engine │ ←→ LlamaIndex Workflows (event-driven)\n└────────┬────────┘\n         ↓\n┌─────────────────┐\n│     Agent       │ ←→ Gemini 3 Flash (structured JSON)\n└────────┬────────┘\n         ↓\n┌─────────────────────────────────────────┐\n│ scan_folder │ preview │ parse │ read │ grep │ glob │\n└─────────────────────────────────────────┘\n                    ↓\n              Document Parser (Docling - local)\n```\n\nSee [ARCHITECTURE.md](ARCHITECTURE.md) for detailed diagrams.\n\n## Test Documents\n\nThe repo includes test document sets for evaluation:\n\n- `data/test_acquisition/` — 10 interconnected legal documents\n- `data/large_acquisition/` — 25 documents with extensive cross-references\n\nExample queries:\n```bash\n# Simple (single doc)\nuv run explore --task \"Look in data/test_acquisition/. Who is the CTO?\"\n\n# Cross-reference required\nuv run explore --task \"Look in data/test_acquisition/. What is the adjusted purchase price?\"\n\n# Multi-document synthesis\nuv run explore --task \"Look in data/large_acquisition/. What happens to employees after the acquisition?\"\n```\n\n## Tech Stack\n\n| Component | Technology |\n|-----------|------------|\n| LLM | Google Gemini 3 Flash |\n| Document Parsing | Docling (local, open-source) |\n| Orchestration | LlamaIndex Workflows |\n| CLI | Typer + Rich |\n| Web Server | FastAPI + WebSocket |\n| Package Manager | uv |\n\n## Project Structure\n\n```\nsrc/fs_explorer/\n├── agent.py      # Gemini client, token tracking\n├── workflow.py   # LlamaIndex workflow engine\n├── fs.py         # File tools: scan, parse, grep\n├── models.py     # Pydantic models for actions\n├── main.py       # CLI entry point\n├── server.py     # FastAPI + WebSocket server\n└── ui.html       # Single-file web interface\n```\n\n## Development\n\n```bash\n# Install dev dependencies\nuv pip install -e \".[dev]\"\n\n# Run tests\nuv run pytest\n\n# Lint\nuv run ruff check .\n```\n\n## License\n\nMIT\n\n## Acknowledgments\n\n- Original concept from [run-llama/fs-explorer](https://github.com/run-llama/fs-explorer)\n- Document parsing by [Docling](https://github.com/DS4SD/docling)\n- Powered by [Google Gemini](https://deepmind.google/technologies/gemini/)\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=PromtEngineer/agentic-file-search&type=Date)](https://star-history.com/#PromtEngineer/agentic-file-search&Date)\n"
  },
  {
    "path": "YOUTUBE_DEMO_TESTS.md",
    "content": "# YouTube Demo: FS-Explorer Test Results\n\n## System Overview\n\n- **25 PDF documents** (~93 pages total)\n- **63 cross-references** between documents\n- **Parallel document scanning** using ThreadPoolExecutor\n- **Three-phase exploration**: Scan → Filter → Deep Dive + Backtracking\n\n---\n\n## Test Results Summary\n\n| Test | Question | Docs Parsed | Tokens Used | Cost | Time |\n|------|----------|-------------|-------------|------|------|\n| 1 | CTO & retention bonus | 1 | 27,376 | $0.0021 | ~30s |\n| 2 | Litigation & insurance | 2 | 41,431 | $0.0033 | ~30s |\n| 3 | Patents owned | 1 | 27,416 | $0.0022 | ~27s |\n\n---\n\n## TEST 1: Simple Fact Lookup ✅\n\n**Question:**\n```bash\nuv run explore --task \"Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?\"\n```\n\n**Agent Answer:**\n> Dr. Sarah Chen is the CTO and their retention bonus is $1,200,000.\n\n**Verification (from 07_employee_schedule.pdf):**\n```\n1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000\n```\n\n| Metric | Value |\n|--------|-------|\n| Documents Scanned | 26 (parallel) |\n| Documents Parsed | 1 |\n| Tokens Used | 27,376 |\n| Cost | $0.0021 |\n| ✅ Accuracy | **100% Correct** |\n\n---\n\n## TEST 2: Cross-Reference Following ✅\n\n**Question:**\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?\"\n```\n\n**Agent Answer:**\n> The total litigation exposure is between $650,000 and $2,350,000. The Smith v. InnovateTech Solutions, Inc. case, with an exposure of $150,000 - $350,000, is covered by insurance. The DataTech LLC v. InnovateTech Solutions, Inc. case, with an exposure of $500,000 - $2,000,000, does not specify if it is covered by insurance.\n\n**Verification (from 12_litigation_schedule.pdf):**\n```\n1. Smith v. InnovateTech Solutions, Inc.\n   Exposure: $150,000 - $350,000 (covered by insurance)\n\n2. DataTech LLC v. InnovateTech Solutions, Inc.\n   Exposure: Preliminary assessment $500,000 - $2,000,000\n```\n\n**Cross-Reference Followed:**\n- Agent noticed reference to \"Legal Opinion Letter\" \n- Parsed `24_legal_opinion.pdf` to get complete picture\n\n| Metric | Value |\n|--------|-------|\n| Documents Scanned | 26 (parallel) |\n| Documents Parsed | 2 (litigation + legal opinion) |\n| Tokens Used | 41,431 |\n| Cost | $0.0033 |\n| ✅ Accuracy | **100% Correct** |\n\n---\n\n## TEST 3: Multi-Document IP Analysis ✅\n\n**Question:**\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What patents does the company own and are there any pending applications?\"\n```\n\n**Agent Answer:**\n> The company owns the following patents: \n> - US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021\n> - US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022\n> - US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023\n> \n> There is one pending application: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024.\n\n**Verification (from 06_ip_schedule.pdf):**\n```\nUS Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021\nUS Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022\nUS Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023\nPending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024\n```\n\n| Metric | Value |\n|--------|-------|\n| Documents Scanned | 26 (parallel) |\n| Documents Parsed | 1 |\n| Tokens Used | 27,416 |\n| Cost | $0.0022 |\n| ✅ Accuracy | **100% Correct** |\n\n---\n\n## Additional Demo Tests\n\n### Purchase Price & Payment Structure\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What is the total purchase price and how is it being paid?\"\n```\n**Expected:** $125M total ($80M cash + $30M stock + $15M escrow)\n\n### Closing Conditions Status\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?\"\n```\n**Expected:** HSR ✅, State filings ✅, MegaCorp consent ✅, GlobalBank pending, Employee retention ✅, Legal opinion ✅, Good standing ordered\n\n### Key Employee Compensation\n```bash\nuv run explore --task \"Look in data/large_acquisition/. List all the key employees and their retention bonuses\"\n```\n**Expected:** 5 employees totaling $3.5M in retention bonuses\n\n---\n\n## Key Architecture Points to Highlight\n\n### 1. Parallel Scanning (scan_folder)\n- Scans ALL 26 documents simultaneously using ThreadPoolExecutor\n- Takes ~25 seconds for entire folder\n- Returns quick preview of each document\n\n### 2. Smart Filtering\n- LLM reviews all previews at once\n- Identifies which documents are relevant\n- Avoids parsing irrelevant documents\n\n### 3. Cross-Reference Discovery\n- Agent watches for document references like:\n  - \"See Document: Legal Opinion Letter\"\n  - \"Per Document: Risk Assessment Memo\"\n- Automatically follows references (backtracking)\n\n### 4. Document Caching\n- Documents cached after first parse\n- Backtracking is free (no re-parsing)\n\n---\n\n## Cost Analysis\n\n| Scenario | Tokens | Est. Cost |\n|----------|--------|-----------|\n| Simple query (1 doc) | ~27K | $0.002 |\n| Cross-ref query (2-3 docs) | ~40K | $0.003 |\n| Complex synthesis (5+ docs) | ~60K | $0.005 |\n| All 25 documents parsed | ~150K | $0.012 |\n\n**Key Insight:** Even with 25 documents, costs are minimal because the system only parses what's needed!\n\n---\n\n## Commands to Run Demo\n\n```bash\n# Setup\ncd /path/to/fs-explorer\nexport GOOGLE_API_KEY=\"your-key\"\n\n# Run any test\nuv run explore --task \"Look in data/large_acquisition/. [YOUR QUESTION]\"\n```\n\n---\n\n## What to Show in Video\n\n1. **The folder scan** - Watch as 26 documents are scanned in parallel\n2. **Smart filtering** - Note which documents the agent CHOOSES to parse\n3. **Cross-reference following** - Show agent backtracking to referenced docs\n4. **Token usage summary** - Highlight the efficiency stats at the end\n5. **Verification** - Show the actual PDF content matches the answer\n\n"
  },
  {
    "path": "data/large_acquisition/TEST_QUESTIONS.md",
    "content": "# Test Questions for Large Document Set\n\n## Document Overview\n- 25 interconnected documents\n- Each document 3-6 pages\n- Extensive cross-references between documents\n- Total content: ~100+ pages\n\n## Test Questions\n\n### Level 1: Single Document (Easy)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What is the total purchase price?\"\nuv run explore --task \"Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?\"\nuv run explore --task \"Look in data/large_acquisition/. What patents does the company own?\"\n```\n\n### Level 2: Cross-Reference Required (Medium)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What customer consents are required and what is their status?\"\nuv run explore --task \"Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?\"\nuv run explore --task \"Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?\"\n```\n\n### Level 3: Multi-Document Synthesis (Hard)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?\"\nuv run explore --task \"Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks.\"\nuv run explore --task \"Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?\"\n```\n\n### Level 4: Deep Cross-Reference (Expert)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?\"\nuv run explore --task \"Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation.\"\nuv run explore --task \"Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents.\"\n```\n"
  },
  {
    "path": "data/test_acquisition/TEST_QUESTIONS.md",
    "content": "# Test Questions for Document Exploration\n\nThese questions are designed to test the two-stage document exploration approach with cross-reference discovery.\n\n## Test Scenario\n\n**Context:** TechCorp Industries is acquiring StartupXYZ LLC. There are 10 documents in this folder related to the acquisition.\n\n---\n\n## Question Set 1: Simple (Single Document)\n\nThese questions can be answered from a single document:\n\n```bash\n# Q1: What is the purchase price?\nexplore --task \"What is the total purchase price for the StartupXYZ acquisition?\"\n\n# Q2: When did the NDA get signed?\nexplore --task \"When was the Non-Disclosure Agreement between TechCorp and StartupXYZ signed?\"\n\n# Q3: How many patents does StartupXYZ have?\nexplore --task \"How many patents does StartupXYZ own?\"\n```\n\n**Expected Behavior:**\n- Agent should preview documents\n- Identify the relevant document quickly\n- Parse only that document for the answer\n\n---\n\n## Question Set 2: Medium (2-3 Documents with Cross-References)\n\nThese questions require following cross-references:\n\n```bash\n# Q4: What risks were identified and how were they addressed?\nexplore --task \"What are the key risks identified in this acquisition and what mitigation measures were put in place?\"\n\n# Q5: What's the adjusted purchase price?\nexplore --task \"The original purchase price was $45M. Were there any adjustments? What is the final amount?\"\n\n# Q6: What happened with customer consents?\nexplore --task \"Which customers required consent for the acquisition, and was consent obtained from all of them?\"\n```\n\n**Expected Behavior:**\n- Agent previews documents\n- Reads Risk Assessment Memo\n- Notices references to Financial Adjustments, Customer Consents\n- Follows cross-references to get complete picture\n\n---\n\n## Question Set 3: Complex (Multiple Documents, Deep Cross-References)\n\nThese questions require synthesizing information from many documents:\n\n```bash\n# Q7: Complete IP status\nexplore --task \"Give me a complete picture of StartupXYZ's intellectual property - what do they own, is it properly certified, and are there any pending matters or risks?\"\n\n# Q8: Due diligence findings and resolution\nexplore --task \"What did the due diligence process uncover, and how were any issues resolved before closing?\"\n\n# Q9: Full timeline and status\nexplore --task \"Create a timeline of this acquisition from NDA signing to closing. What are the key milestones and their status?\"\n\n# Q10: Closing readiness\nexplore --task \"Is this acquisition ready to close? What items are complete and what's still pending?\"\n```\n\n**Expected Behavior:**\n- Agent should preview all documents first\n- Read the most relevant documents (e.g., Closing Checklist references everything)\n- Follow cross-references to IP Certification, Due Diligence, Risk Assessment, etc.\n- Synthesize information from 5+ documents\n\n---\n\n## Question Set 4: Adversarial (Tests Cross-Reference Discovery)\n\nThese questions specifically test if the agent goes back to previously-skipped documents:\n\n```bash\n# Q11: Following exhibit references\nexplore --task \"The Acquisition Agreement mentions 'Exhibit A - Financial Terms'. What are the detailed financial terms?\"\n\n# Q12: Understanding document relationships  \nexplore --task \"How does the Legal Opinion Letter relate to other documents in this acquisition?\"\n\n# Q13: Hidden connection\nexplore --task \"Is there anything about MegaCorp in these documents? Why are they important to this deal?\"\n```\n\n**Expected Behavior:**\n- Q11: Agent might initially skip Financial Adjustments, but should go back when Acquisition Agreement references Exhibit A\n- Q12: Agent should trace all documents referenced BY and FROM the Legal Opinion\n- Q13: MegaCorp is mentioned in Due Diligence, Risk Assessment, and Customer Consents - agent should connect the dots\n\n---\n\n## Scoring Rubric\n\n| Metric | Description |\n|--------|-------------|\n| **Preview Usage** | Did the agent use `preview_file` before `parse_file`? |\n| **Selective Parsing** | Did the agent avoid parsing irrelevant documents? |\n| **Cross-Reference Discovery** | Did the agent follow document references? |\n| **Backtracking** | Did the agent return to previously-skipped documents when needed? |\n| **Answer Completeness** | Was the final answer comprehensive and accurate? |\n\n---\n\n## Running a Test\n\n```bash\nexport GOOGLE_API_KEY=\"your-key\"\ncd /path/to/fs-explorer\nuv run explore --task \"YOUR QUESTION HERE\"\n```\n\nWatch for:\n1. Which documents get previewed\n2. Which documents get fully parsed\n3. Whether the agent mentions cross-references\n4. Whether the agent goes back to read referenced documents\n\n"
  },
  {
    "path": "data/testfile.txt",
    "content": "This is a test."
  },
  {
    "path": "docker/docker-compose.yml",
    "content": "version: '3.8'\n\nservices:\n  postgres:\n    image: pgvector/pgvector:pg17\n    container_name: fs-explorer-db\n    environment:\n      POSTGRES_USER: ${POSTGRES_USER:-fs_explorer}\n      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-devpassword}\n      POSTGRES_DB: ${POSTGRES_DB:-fs_explorer}\n    ports:\n      - \"${POSTGRES_PORT:-5432}:5432\"\n    volumes:\n      - postgres_data:/var/lib/postgresql/data\n      - ./init.sql:/docker-entrypoint-initdb.d/init.sql:ro\n    healthcheck:\n      test: [\"CMD-SHELL\", \"pg_isready -U fs_explorer -d fs_explorer\"]\n      interval: 5s\n      timeout: 5s\n      retries: 5\n    restart: unless-stopped\n\nvolumes:\n  postgres_data:\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\"uv_build>=0.9.10,<0.10.0\"]\nbuild-backend = \"uv_build\"\n\n[project]\nname = \"fs-explorer\"\nversion = \"0.1.0\"\ndescription = \"Explore and understand your filesystem better with AI.\"\nreadme = \"README.md\"\nrequires-python = \">=3.10\"\ndependencies = [\n    \"docling>=2.55.0\",\n    \"duckdb>=1.0.0\",\n    \"fastapi>=0.115.0\",\n    \"google-genai>=1.55.0\",\n    \"langextract>=1.0.0\",\n    \"llama-index-workflows>=2.11.5\",\n    \"python-dotenv>=1.0.0\",\n    \"reportlab>=4.4.7\",\n    \"rich>=13.0.0\",\n    \"typer>=0.12.5,<0.20.0\",\n    \"uvicorn>=0.34.0\",\n    \"websockets>=14.0\",\n]\n\n[dependency-groups]\ndev = [\n    \"pre-commit>=4.5.0\",\n    \"pytest>=9.0.2\",\n    \"pytest-asyncio>=1.3.0\",\n    \"ruff>=0.14.9\",\n    \"ty>=0.0.1a33\",\n]\n\n[project.scripts]\nexplore = \"fs_explorer.main:app\"\nexplore-ui = \"fs_explorer.server:run_server\"\n"
  },
  {
    "path": "scripts/generate_large_docs.py",
    "content": "#!/usr/bin/env python3\n\"\"\"\nGenerate a large set of interconnected legal documents for testing.\nCreates 25 documents, each 3-5 pages, with extensive cross-references.\n\"\"\"\n\nimport os\nfrom reportlab.lib.pagesizes import letter\nfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak\nfrom reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle\nfrom reportlab.lib.units import inch\n\nOUTPUT_DIR = \"data/large_acquisition\"\n\n# Document metadata with cross-references\nDOCUMENTS = {\n    \"01_master_agreement\": {\n        \"title\": \"MASTER ACQUISITION AGREEMENT\",\n        \"refs\": [\"02_schedules\", \"03_exhibits\", \"04_disclosure_schedules\", \"05_ancillary_agreements\"],\n        \"pages\": 5\n    },\n    \"02_schedules\": {\n        \"title\": \"SCHEDULES TO ACQUISITION AGREEMENT\", \n        \"refs\": [\"01_master_agreement\", \"06_ip_schedule\", \"07_employee_schedule\", \"08_contract_schedule\"],\n        \"pages\": 4\n    },\n    \"03_exhibits\": {\n        \"title\": \"EXHIBITS TO ACQUISITION AGREEMENT\",\n        \"refs\": [\"01_master_agreement\", \"09_escrow_agreement\", \"10_stock_purchase\"],\n        \"pages\": 3\n    },\n    \"04_disclosure_schedules\": {\n        \"title\": \"SELLER DISCLOSURE SCHEDULES\",\n        \"refs\": [\"01_master_agreement\", \"11_financial_statements\", \"12_litigation_schedule\"],\n        \"pages\": 5\n    },\n    \"05_ancillary_agreements\": {\n        \"title\": \"ANCILLARY AGREEMENTS INDEX\",\n        \"refs\": [\"13_nda\", \"14_non_compete\", \"15_consulting_agreement\", \"16_transition_services\"],\n        \"pages\": 2\n    },\n    \"06_ip_schedule\": {\n        \"title\": \"SCHEDULE 3.12 - INTELLECTUAL PROPERTY\",\n        \"refs\": [\"01_master_agreement\", \"17_patent_assignments\", \"18_trademark_registrations\"],\n        \"pages\": 4\n    },\n    \"07_employee_schedule\": {\n        \"title\": \"SCHEDULE 3.15 - EMPLOYEE MATTERS\",\n        \"refs\": [\"01_master_agreement\", \"19_retention_agreements\", \"20_benefit_plans\"],\n        \"pages\": 4\n    },\n    \"08_contract_schedule\": {\n        \"title\": \"SCHEDULE 3.13 - MATERIAL CONTRACTS\",\n        \"refs\": [\"01_master_agreement\", \"21_customer_contracts\", \"22_vendor_contracts\"],\n        \"pages\": 5\n    },\n    \"09_escrow_agreement\": {\n        \"title\": \"ESCROW AGREEMENT\",\n        \"refs\": [\"01_master_agreement\", \"03_exhibits\", \"11_financial_statements\"],\n        \"pages\": 4\n    },\n    \"10_stock_purchase\": {\n        \"title\": \"STOCK PURCHASE DETAILS - EXHIBIT B\",\n        \"refs\": [\"01_master_agreement\", \"11_financial_statements\"],\n        \"pages\": 3\n    },\n    \"11_financial_statements\": {\n        \"title\": \"AUDITED FINANCIAL STATEMENTS\",\n        \"refs\": [\"04_disclosure_schedules\", \"23_audit_report\"],\n        \"pages\": 6\n    },\n    \"12_litigation_schedule\": {\n        \"title\": \"SCHEDULE 3.9 - LITIGATION AND CLAIMS\",\n        \"refs\": [\"04_disclosure_schedules\", \"24_legal_opinion\"],\n        \"pages\": 3\n    },\n    \"13_nda\": {\n        \"title\": \"NON-DISCLOSURE AGREEMENT\",\n        \"refs\": [\"01_master_agreement\"],\n        \"pages\": 3\n    },\n    \"14_non_compete\": {\n        \"title\": \"NON-COMPETITION AGREEMENT\",\n        \"refs\": [\"01_master_agreement\", \"07_employee_schedule\"],\n        \"pages\": 3\n    },\n    \"15_consulting_agreement\": {\n        \"title\": \"CONSULTING AGREEMENT - FOUNDER\",\n        \"refs\": [\"01_master_agreement\", \"07_employee_schedule\", \"19_retention_agreements\"],\n        \"pages\": 4\n    },\n    \"16_transition_services\": {\n        \"title\": \"TRANSITION SERVICES AGREEMENT\",\n        \"refs\": [\"01_master_agreement\", \"25_closing_checklist\"],\n        \"pages\": 4\n    },\n    \"17_patent_assignments\": {\n        \"title\": \"PATENT ASSIGNMENT AGREEMENTS\",\n        \"refs\": [\"06_ip_schedule\", \"01_master_agreement\"],\n        \"pages\": 3\n    },\n    \"18_trademark_registrations\": {\n        \"title\": \"TRADEMARK REGISTRATION SCHEDULE\",\n        \"refs\": [\"06_ip_schedule\"],\n        \"pages\": 2\n    },\n    \"19_retention_agreements\": {\n        \"title\": \"KEY EMPLOYEE RETENTION AGREEMENTS\",\n        \"refs\": [\"07_employee_schedule\", \"15_consulting_agreement\"],\n        \"pages\": 4\n    },\n    \"20_benefit_plans\": {\n        \"title\": \"EMPLOYEE BENEFIT PLAN SCHEDULE\",\n        \"refs\": [\"07_employee_schedule\"],\n        \"pages\": 3\n    },\n    \"21_customer_contracts\": {\n        \"title\": \"MAJOR CUSTOMER CONTRACT SUMMARIES\",\n        \"refs\": [\"08_contract_schedule\", \"01_master_agreement\"],\n        \"pages\": 5\n    },\n    \"22_vendor_contracts\": {\n        \"title\": \"MAJOR VENDOR CONTRACT SUMMARIES\",\n        \"refs\": [\"08_contract_schedule\"],\n        \"pages\": 3\n    },\n    \"23_audit_report\": {\n        \"title\": \"INDEPENDENT AUDITOR'S REPORT\",\n        \"refs\": [\"11_financial_statements\", \"04_disclosure_schedules\"],\n        \"pages\": 4\n    },\n    \"24_legal_opinion\": {\n        \"title\": \"LEGAL OPINION LETTER\",\n        \"refs\": [\"01_master_agreement\", \"12_litigation_schedule\", \"06_ip_schedule\"],\n        \"pages\": 3\n    },\n    \"25_closing_checklist\": {\n        \"title\": \"CLOSING CHECKLIST AND CONDITIONS\",\n        \"refs\": [\"01_master_agreement\", \"09_escrow_agreement\", \"16_transition_services\", \n                 \"17_patent_assignments\", \"21_customer_contracts\"],\n        \"pages\": 4\n    }\n}\n\ndef generate_content(doc_id: str, meta: dict) -> list:\n    \"\"\"Generate realistic legal document content.\"\"\"\n    styles = getSampleStyleSheet()\n    title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=16, spaceAfter=20)\n    heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=12, spaceAfter=10)\n    body_style = ParagraphStyle('Body', parent=styles['Normal'], fontSize=10, spaceAfter=8, leading=14)\n    \n    content = []\n    \n    # Title\n    content.append(Paragraph(meta[\"title\"], title_style))\n    content.append(Spacer(1, 0.3*inch))\n    \n    # Document intro with cross-references\n    refs_text = \", \".join([f\"Document: {DOCUMENTS[r]['title']}\" for r in meta[\"refs\"][:3]])\n    intro = f\"\"\"\n    This document is part of the acquisition transaction between GlobalTech Corporation (\"Buyer\") \n    and InnovateTech Solutions, Inc. (\"Seller\") dated as of February 15, 2025. This document should \n    be read in conjunction with {refs_text}, and all other transaction documents.\n    \"\"\"\n    content.append(Paragraph(intro.strip(), body_style))\n    content.append(Spacer(1, 0.2*inch))\n    \n    # Generate sections based on document type\n    sections = generate_sections(doc_id, meta)\n    for section_title, section_content in sections:\n        content.append(Paragraph(section_title, heading_style))\n        for para in section_content:\n            content.append(Paragraph(para, body_style))\n        content.append(Spacer(1, 0.15*inch))\n    \n    return content\n\ndef generate_sections(doc_id: str, meta: dict) -> list:\n    \"\"\"Generate document-specific sections with legal content.\"\"\"\n    sections = []\n    \n    # Add document-specific content\n    if \"master_agreement\" in doc_id:\n        sections = [\n            (\"ARTICLE I - DEFINITIONS\", [\n                \"1.1 'Acquisition' means the purchase by Buyer of all outstanding capital stock of Seller.\",\n                \"1.2 'Purchase Price' means One Hundred Twenty-Five Million Dollars ($125,000,000), subject to adjustments.\",\n                \"1.3 'Closing Date' means April 1, 2025, or such other date as mutually agreed.\",\n                \"1.4 'Material Adverse Effect' means any change that is materially adverse to the business of Seller.\",\n                \"1.5 'Knowledge of Seller' means the actual knowledge of the officers listed in Schedule 1.5.\",\n            ]),\n            (\"ARTICLE II - PURCHASE AND SALE\", [\n                \"2.1 Subject to the terms hereof, Seller agrees to sell and Buyer agrees to purchase all Shares.\",\n                \"2.2 The Purchase Price shall be paid as follows: (a) $80,000,000 in cash at Closing; \"\n                \"(b) $30,000,000 in Buyer common stock per Document: Stock Purchase Details - Exhibit B; \"\n                \"(c) $15,000,000 in escrow per Document: Escrow Agreement.\",\n                \"2.3 Purchase Price adjustments are detailed in Document: Audited Financial Statements.\",\n                \"2.4 Working capital target is $8,500,000 as calculated per Schedule 2.4.\",\n            ]),\n            (\"ARTICLE III - REPRESENTATIONS AND WARRANTIES\", [\n                \"3.1 Organization. Seller is duly organized under Delaware law.\",\n                \"3.9 Litigation. Except as set forth in Document: Schedule 3.9 - Litigation and Claims, \"\n                \"there are no pending legal proceedings against Seller.\",\n                \"3.12 Intellectual Property. All IP is listed in Document: Schedule 3.12 - Intellectual Property. \"\n                \"Patent assignments are documented in Document: Patent Assignment Agreements.\",\n                \"3.13 Material Contracts. All contracts exceeding $100,000 annually are in Document: Schedule 3.13 - Material Contracts.\",\n                \"3.15 Employees. Employee matters are disclosed in Document: Schedule 3.15 - Employee Matters.\",\n            ]),\n            (\"ARTICLE IV - COVENANTS\", [\n                \"4.1 Conduct of Business. Prior to Closing, Seller shall operate in ordinary course.\",\n                \"4.2 Access. Seller shall provide Buyer access to facilities, books, and records.\",\n                \"4.3 Confidentiality. Parties shall comply with Document: Non-Disclosure Agreement.\",\n                \"4.4 Non-Competition. Key employees shall execute Document: Non-Competition Agreement.\",\n            ]),\n            (\"ARTICLE V - CONDITIONS TO CLOSING\", [\n                \"5.1 Buyer's conditions: (a) accuracy of representations; (b) material consents obtained; \"\n                \"(c) no Material Adverse Effect; (d) receipt of Document: Legal Opinion Letter.\",\n                \"5.2 Regulatory approvals as specified in Document: Closing Checklist and Conditions.\",\n                \"5.3 Third-party consents from customers in Document: Major Customer Contract Summaries.\",\n            ]),\n        ]\n    elif \"financial\" in doc_id:\n        sections = [\n            (\"BALANCE SHEET\", [\n                \"As of December 31, 2024:\",\n                \"Total Assets: $47,250,000 (Current: $18,500,000; Non-current: $28,750,000)\",\n                \"Total Liabilities: $12,300,000 (Current: $8,200,000; Long-term: $4,100,000)\",\n                \"Stockholders' Equity: $34,950,000\",\n                \"Working Capital: $10,300,000 (above target of $8,500,000 per Document: Master Acquisition Agreement)\",\n            ]),\n            (\"INCOME STATEMENT\", [\n                \"For fiscal year ended December 31, 2024:\",\n                \"Total Revenue: $52,400,000 (SaaS: $41,920,000; Professional Services: $10,480,000)\",\n                \"Cost of Revenue: $15,720,000 (Gross Margin: 70%)\",\n                \"Operating Expenses: $28,600,000 (R&D: $12,100,000; S&M: $11,500,000; G&A: $5,000,000)\",\n                \"Operating Income: $8,080,000 (EBITDA: $11,200,000)\",\n                \"Net Income: $6,464,000\",\n            ]),\n            (\"REVENUE BREAKDOWN BY CUSTOMER\", [\n                \"Top 5 customers represent 62% of revenue (see Document: Major Customer Contract Summaries):\",\n                \"1. MegaCorp Industries: $12,576,000 (24%) - Contract through 2027\",\n                \"2. GlobalBank Holdings: $8,384,000 (16%) - Renewal pending\",\n                \"3. HealthFirst Systems: $5,240,000 (10%) - Multi-year agreement\",\n                \"4. RetailMax Inc.: $3,668,000 (7%) - Expansion discussion ongoing\",\n                \"5. TechPrime Solutions: $2,620,000 (5%) - New customer 2024\",\n            ]),\n            (\"NOTES TO FINANCIAL STATEMENTS\", [\n                \"Note 1: Significant Accounting Policies - Revenue recognized per ASC 606.\",\n                \"Note 2: Deferred Revenue of $4,200,000 represents prepaid annual subscriptions.\",\n                \"Note 3: Contingent liabilities detailed in Document: Schedule 3.9 - Litigation and Claims.\",\n                \"Note 4: Related party transactions with founder disclosed in Document: Consulting Agreement - Founder.\",\n            ]),\n        ]\n    elif \"ip_schedule\" in doc_id or \"patent\" in doc_id:\n        sections = [\n            (\"PATENTS\", [\n                \"Seller owns or has rights to the following patents:\",\n                \"US Patent 10,123,456 - 'Machine Learning System for Predictive Analytics' - Issued 2021\",\n                \"US Patent 10,234,567 - 'Distributed Data Processing Architecture' - Issued 2022\",\n                \"US Patent 10,345,678 - 'Real-time Anomaly Detection Method' - Issued 2023\",\n                \"Pending: US Application 17/456,789 - 'Automated Workflow Optimization' - Filed 2024\",\n                \"Assignment agreements in Document: Patent Assignment Agreements.\",\n            ]),\n            (\"TRADEMARKS\", [\n                \"Registered trademarks (see Document: Trademark Registration Schedule):\",\n                \"INNOVATETECH (word mark) - Reg. No. 5,123,456 - Software services\",\n                \"INNOVATETECH (logo) - Reg. No. 5,234,567 - Software services\",\n                \"DATAFLOW PRO - Reg. No. 5,345,678 - Data analytics software\",\n            ]),\n            (\"TRADE SECRETS AND KNOW-HOW\", [\n                \"Seller maintains trade secrets including proprietary algorithms and processes.\",\n                \"All employees have executed invention assignment agreements per Document: Schedule 3.15 - Employee Matters.\",\n                \"Key technical personnel retention addressed in Document: Key Employee Retention Agreements.\",\n            ]),\n        ]\n    elif \"employee\" in doc_id or \"retention\" in doc_id:\n        sections = [\n            (\"EMPLOYEE CENSUS\", [\n                \"Total Employees: 127 (Full-time: 120; Part-time: 7)\",\n                \"Engineering: 68 employees (Senior: 24; Mid-level: 32; Junior: 12)\",\n                \"Sales & Marketing: 28 employees\",\n                \"Customer Success: 18 employees\",\n                \"G&A: 13 employees\",\n            ]),\n            (\"KEY EMPLOYEES\", [\n                \"The following are Key Employees subject to Document: Key Employee Retention Agreements:\",\n                \"1. Dr. Sarah Chen - CTO - 15 years experience - Retention bonus: $1,200,000\",\n                \"2. Michael Rodriguez - VP Engineering - Leads 45-person team - Retention: $800,000\",\n                \"3. Jennifer Walsh - VP Sales - $18M quota achievement - Retention: $600,000\",\n                \"4. David Kim - Principal Architect - Core platform expertise - Retention: $500,000\",\n                \"5. Amanda Foster - VP Customer Success - 95% retention rate - Retention: $400,000\",\n                \"Founder consulting terms in Document: Consulting Agreement - Founder.\",\n            ]),\n            (\"BENEFIT PLANS\", [\n                \"Active benefit plans (details in Document: Employee Benefit Plan Schedule):\",\n                \"401(k) Plan - Company match 4% - $2.1M annual cost\",\n                \"Health Insurance - PPO and HMO options - $1.8M annual cost\",\n                \"Stock Option Plan - 2,500,000 shares reserved - 1,800,000 granted\",\n                \"Treatment of equity awards addressed in Document: Master Acquisition Agreement Section 2.6.\",\n            ]),\n        ]\n    elif \"customer\" in doc_id or \"contract_schedule\" in doc_id:\n        sections = [\n            (\"MATERIAL CUSTOMER CONTRACTS\", [\n                \"Contracts with annual value exceeding $500,000:\",\n                \"\",\n                \"1. MEGACORP INDUSTRIES - Master Services Agreement\",\n                \"   Annual Value: $12,576,000 | Term: Through December 2027\",\n                \"   Change of Control: Consent required (OBTAINED February 8, 2025)\",\n                \"   Renewal Terms: Auto-renew with 90-day notice\",\n                \"\",\n                \"2. GLOBALBANK HOLDINGS - Enterprise License Agreement\",\n                \"   Annual Value: $8,384,000 | Term: Through June 2025\",\n                \"   Change of Control: 60-day notice required (PROVIDED January 15, 2025)\",\n                \"   Renewal: Currently in negotiation for 3-year extension\",\n                \"\",\n                \"3. HEALTHFIRST SYSTEMS - SaaS Subscription Agreement\",\n                \"   Annual Value: $5,240,000 | Term: Through December 2026\",\n                \"   Change of Control: No restrictions\",\n                \"\",\n                \"See Document: Closing Checklist and Conditions for consent status.\",\n            ]),\n            (\"CONSENT REQUIREMENTS\", [\n                \"Customer consents required for acquisition (per Document: Master Acquisition Agreement):\",\n                \"- MegaCorp Industries: OBTAINED (see Exhibit A hereto)\",\n                \"- GlobalBank Holdings: NOTICE PROVIDED (awaiting acknowledgment)\",\n                \"- Other customers: No consent required\",\n                \"Risk assessment in Document: Legal Opinion Letter.\",\n            ]),\n        ]\n    elif \"litigation\" in doc_id:\n        sections = [\n            (\"PENDING LITIGATION\", [\n                \"1. Smith v. InnovateTech Solutions, Inc.\",\n                \"   Court: California Superior Court, Santa Clara County\",\n                \"   Claims: Wrongful termination, discrimination\",\n                \"   Status: Discovery phase; trial set for September 2025\",\n                \"   Exposure: $150,000 - $350,000 (covered by insurance)\",\n                \"   Opinion: See Document: Legal Opinion Letter\",\n                \"\",\n                \"2. DataTech LLC v. InnovateTech Solutions, Inc.\",\n                \"   Court: US District Court, Northern District of California\",\n                \"   Claims: Patent infringement (US Patent 9,876,543)\",\n                \"   Status: Motion to dismiss pending; hearing March 2025\",\n                \"   Exposure: Preliminary assessment $500,000 - $2,000,000\",\n                \"   IP validity analysis in Document: Schedule 3.12 - Intellectual Property\",\n            ]),\n            (\"THREATENED CLAIMS\", [\n                \"Demand letter received from former contractor re: unpaid invoices ($45,000).\",\n                \"Resolution expected prior to Closing per Document: Closing Checklist and Conditions.\",\n            ]),\n            (\"INSURANCE COVERAGE\", [\n                \"D&O Insurance: $5,000,000 limit | Deductible: $50,000\",\n                \"E&O Insurance: $3,000,000 limit | Deductible: $25,000\",\n                \"General Liability: $2,000,000 limit\",\n            ]),\n        ]\n    elif \"closing\" in doc_id:\n        sections = [\n            (\"PRE-CLOSING CONDITIONS\", [\n                \"The following conditions must be satisfied prior to Closing:\",\n                \"\",\n                \"1. REGULATORY APPROVALS\",\n                \"   [X] HSR Filing - Early termination granted February 1, 2025\",\n                \"   [X] State filings - Completed in all required jurisdictions\",\n                \"\",\n                \"2. THIRD-PARTY CONSENTS\",\n                \"   [X] MegaCorp Industries - Obtained February 8, 2025\",\n                \"   [ ] GlobalBank Holdings - Pending (expected by March 15)\",\n                \"   Per Document: Major Customer Contract Summaries\",\n                \"\",\n                \"3. EMPLOYEE MATTERS\",\n                \"   [X] Key employee retention agreements executed\",\n                \"   [X] Founder consulting agreement finalized\",\n                \"   Per Document: Key Employee Retention Agreements\",\n                \"\",\n                \"4. LEGAL DELIVERABLES\",\n                \"   [X] Legal opinion - See Document: Legal Opinion Letter\",\n                \"   [ ] Good standing certificates - Ordered\",\n            ]),\n            (\"CLOSING DELIVERABLES\", [\n                \"SELLER DELIVERABLES:\",\n                \"- Stock certificates endorsed in blank\",\n                \"- Officer's certificate re: representations\",\n                \"- Secretary's certificate with resolutions\",\n                \"- IP assignments per Document: Patent Assignment Agreements\",\n                \"- Third-party consents per above\",\n                \"\",\n                \"BUYER DELIVERABLES:\",\n                \"- Cash payment: $80,000,000 by wire transfer\",\n                \"- Stock consideration: 1,500,000 shares per Document: Stock Purchase Details - Exhibit B\",\n                \"- Escrow deposit: $15,000,000 per Document: Escrow Agreement\",\n            ]),\n            (\"POST-CLOSING OBLIGATIONS\", [\n                \"1. Transition services per Document: Transition Services Agreement (6 months)\",\n                \"2. Earnout payments per Exhibit C to Document: Master Acquisition Agreement\",\n                \"3. Escrow release schedule per Document: Escrow Agreement\",\n                \"4. Employee benefit plan merger per Document: Employee Benefit Plan Schedule\",\n            ]),\n        ]\n    elif \"escrow\" in doc_id:\n        sections = [\n            (\"ESCROW TERMS\", [\n                \"Escrow Amount: $15,000,000 (12% of Purchase Price)\",\n                \"Escrow Agent: First National Trust Company\",\n                \"Term: 18 months from Closing Date\",\n                \"\",\n                \"Release Schedule:\",\n                \"- 6 months: $5,000,000 released (absent claims)\",\n                \"- 12 months: $5,000,000 released (absent claims)\",\n                \"- 18 months: Remaining balance released\",\n                \"\",\n                \"Claims may be made for breaches of representations in Document: Master Acquisition Agreement.\",\n            ]),\n            (\"INDEMNIFICATION\", [\n                \"Indemnification provisions per Article VII of Document: Master Acquisition Agreement:\",\n                \"- Basket: $500,000 (1% of escrow)\",\n                \"- Cap: $15,000,000 (escrow amount) for general reps\",\n                \"- Fundamental reps: Full Purchase Price cap\",\n                \"\",\n                \"Specific indemnities for matters in Document: Schedule 3.9 - Litigation and Claims.\",\n            ]),\n        ]\n    elif \"legal_opinion\" in doc_id:\n        sections = [\n            (\"OPINIONS RENDERED\", [\n                \"Wilson & Associates LLP, counsel to Seller, renders the following opinions:\",\n                \"\",\n                \"1. Seller is a corporation duly organized under Delaware law.\",\n                \"2. Seller has corporate power to execute Document: Master Acquisition Agreement.\",\n                \"3. Transaction documents are valid and enforceable obligations.\",\n                \"4. No conflicts with charter documents or material agreements.\",\n                \"5. Based on review of Document: Schedule 3.9 - Litigation and Claims, pending \"\n                \"litigation does not pose material risk to transaction.\",\n                \"6. IP matters reviewed per Document: Schedule 3.12 - Intellectual Property; \"\n                \"no infringement claims other than disclosed.\",\n            ]),\n            (\"QUALIFICATIONS AND ASSUMPTIONS\", [\n                \"This opinion is subject to standard qualifications regarding:\",\n                \"- Bankruptcy and insolvency laws\",\n                \"- Equitable principles\",\n                \"- Public policy considerations\",\n                \"\",\n                \"We have relied upon certificates from officers of Seller and representations \"\n                \"in Document: Seller Disclosure Schedules.\",\n            ]),\n        ]\n    elif \"audit\" in doc_id:\n        sections = [\n            (\"INDEPENDENT AUDITOR'S REPORT\", [\n                \"To the Board of Directors of InnovateTech Solutions, Inc.:\",\n                \"\",\n                \"We have audited the accompanying financial statements, which comprise the \"\n                \"balance sheet as of December 31, 2024, and the related statements of income, \"\n                \"comprehensive income, stockholders' equity, and cash flows for the year then ended.\",\n                \"\",\n                \"OPINION\",\n                \"In our opinion, the financial statements present fairly, in all material respects, \"\n                \"the financial position of InnovateTech Solutions, Inc. as of December 31, 2024, \"\n                \"in accordance with accounting principles generally accepted in the United States.\",\n            ]),\n            (\"KEY AUDIT MATTERS\", [\n                \"1. REVENUE RECOGNITION\",\n                \"   SaaS revenue recognized ratably over subscription period per ASC 606.\",\n                \"   Deferred revenue of $4,200,000 verified to customer contracts.\",\n                \"\",\n                \"2. STOCK-BASED COMPENSATION\",\n                \"   Options valued using Black-Scholes model.\",\n                \"   Expense of $2,100,000 recorded in accordance with ASC 718.\",\n                \"\",\n                \"3. CONTINGENCIES\",\n                \"   Litigation matters reviewed with counsel (see Document: Schedule 3.9 - Litigation and Claims).\",\n                \"   Accruals of $350,000 determined to be appropriate.\",\n            ]),\n        ]\n    else:\n        # Generic sections for other documents\n        sections = [\n            (\"OVERVIEW\", [\n                f\"This {meta['title']} is executed in connection with the acquisition transaction.\",\n                f\"Reference documents: {', '.join([DOCUMENTS[r]['title'] for r in meta['refs'][:2]])}.\",\n            ]),\n            (\"TERMS AND CONDITIONS\", [\n                \"Standard terms apply as set forth in the Master Acquisition Agreement.\",\n                \"Amendments require written consent of all parties.\",\n            ]),\n            (\"MISCELLANEOUS\", [\n                \"Governing Law: State of Delaware\",\n                \"Dispute Resolution: Arbitration in San Francisco, California\",\n                \"Notices: As specified in Master Acquisition Agreement\",\n            ]),\n        ]\n    \n    # Add boilerplate to reach target page count\n    for i in range(meta[\"pages\"] - 2):\n        sections.append((f\"SECTION {len(sections) + 1}\", [\n            f\"Additional provisions related to {meta['title']}.\",\n            \"All terms defined in Document: Master Acquisition Agreement apply herein.\",\n            f\"Cross-reference: See {DOCUMENTS[meta['refs'][i % len(meta['refs'])]]['title']} for related provisions.\",\n            \"The parties acknowledge receipt of all schedules and exhibits referenced herein.\",\n            \"This section shall survive the Closing Date as specified in Article VIII of the Master Agreement.\",\n        ]))\n    \n    return sections\n\n\ndef create_pdf(doc_id: str, meta: dict, output_dir: str):\n    \"\"\"Create a PDF document.\"\"\"\n    filepath = os.path.join(output_dir, f\"{doc_id}.pdf\")\n    doc = SimpleDocTemplate(filepath, pagesize=letter,\n                           topMargin=0.75*inch, bottomMargin=0.75*inch,\n                           leftMargin=1*inch, rightMargin=1*inch)\n    content = generate_content(doc_id, meta)\n    doc.build(content)\n    print(f\"  Created: {filepath}\")\n\n\ndef main():\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n    \n    print(f\"\\nGenerating {len(DOCUMENTS)} large documents in {OUTPUT_DIR}/\\n\")\n    \n    for doc_id, meta in DOCUMENTS.items():\n        create_pdf(doc_id, meta, OUTPUT_DIR)\n    \n    # Create test questions\n    questions_path = os.path.join(OUTPUT_DIR, \"TEST_QUESTIONS.md\")\n    with open(questions_path, \"w\") as f:\n        f.write(\"\"\"# Test Questions for Large Document Set\n\n## Document Overview\n- 25 interconnected documents\n- Each document 3-6 pages\n- Extensive cross-references between documents\n- Total content: ~100+ pages\n\n## Test Questions\n\n### Level 1: Single Document (Easy)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What is the total purchase price?\"\nuv run explore --task \"Look in data/large_acquisition/. Who is the CTO and what is their retention bonus?\"\nuv run explore --task \"Look in data/large_acquisition/. What patents does the company own?\"\n```\n\n### Level 2: Cross-Reference Required (Medium)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What customer consents are required and what is their status?\"\nuv run explore --task \"Look in data/large_acquisition/. What is the total litigation exposure and is it covered by insurance?\"\nuv run explore --task \"Look in data/large_acquisition/. How is the purchase price being paid and what are the escrow terms?\"\n```\n\n### Level 3: Multi-Document Synthesis (Hard)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. What are all the conditions that must be satisfied before closing and what is the status of each?\"\nuv run explore --task \"Look in data/large_acquisition/. Provide a complete picture of MegaCorp's relationship with the company - revenue, contract terms, consent status, and any risks.\"\nuv run explore --task \"Look in data/large_acquisition/. What are all the financial terms of this deal including adjustments, escrow, earnouts, and stock?\"\n```\n\n### Level 4: Deep Cross-Reference (Expert)\n```bash\nuv run explore --task \"Look in data/large_acquisition/. Trace all references to the Legal Opinion Letter - what documents cite it and what opinions does it provide?\"\nuv run explore --task \"Look in data/large_acquisition/. Create a complete picture of IP assets - patents, trademarks, assignments, and any related risks or litigation.\"\nuv run explore --task \"Look in data/large_acquisition/. What happens after closing? List all post-closing obligations, their timelines, and related documents.\"\n```\n\"\"\")\n    print(f\"  Created: {questions_path}\")\n    \n    # Summary\n    total_pages = sum(m[\"pages\"] for m in DOCUMENTS.values())\n    total_refs = sum(len(m[\"refs\"]) for m in DOCUMENTS.values())\n    print(f\"\\n{'='*60}\")\n    print(f\"SUMMARY\")\n    print(f\"{'='*60}\")\n    print(f\"  Documents created: {len(DOCUMENTS)}\")\n    print(f\"  Total pages: ~{total_pages}\")\n    print(f\"  Cross-references: {total_refs}\")\n    print(f\"  Output directory: {OUTPUT_DIR}/\")\n    print(f\"{'='*60}\\n\")\n\n\nif __name__ == \"__main__\":\n    main()\n\n"
  },
  {
    "path": "scripts/generate_test_docs.py",
    "content": "#!/usr/bin/env python3\n\"\"\"\nGenerate test PDF documents for testing the two-stage document exploration approach.\n\nScenario: TechCorp's acquisition of StartupXYZ\nDocuments have cross-references to test the agent's ability to follow document relationships.\n\"\"\"\n\nfrom reportlab.lib.pagesizes import letter\nfrom reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak\nfrom reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle\nfrom reportlab.lib.units import inch\nimport os\n\nOUTPUT_DIR = \"data/test_acquisition\"\n\nDOCUMENTS = {\n    \"01_acquisition_agreement.pdf\": {\n        \"title\": \"ACQUISITION AGREEMENT\",\n        \"content\": \"\"\"\n        <b>ACQUISITION AGREEMENT</b><br/><br/>\n        \n        This Acquisition Agreement (\"Agreement\") is entered into as of January 15, 2025, \n        by and between TechCorp Industries, Inc. (\"Buyer\") and StartupXYZ LLC (\"Seller\").<br/><br/>\n        \n        <b>ARTICLE I - DEFINITIONS</b><br/><br/>\n        \n        1.1 \"Acquisition\" means the purchase of all outstanding shares of Seller by Buyer.<br/>\n        1.2 \"Purchase Price\" means $45,000,000 USD as detailed in <b>Exhibit A - Financial Terms</b>.<br/>\n        1.3 \"Closing Date\" means March 1, 2025, subject to conditions in Article IV.<br/>\n        1.4 \"Employee Matters\" shall be governed by <b>Schedule 3 - Employee Transition Plan</b>.<br/><br/>\n        \n        <b>ARTICLE II - PURCHASE AND SALE</b><br/><br/>\n        \n        2.1 Subject to the terms and conditions of this Agreement, Seller agrees to sell, \n        and Buyer agrees to purchase, all of the issued and outstanding shares of Seller.<br/><br/>\n        \n        2.2 The Purchase Price shall be paid as follows:<br/>\n        (a) $30,000,000 in cash at Closing<br/>\n        (b) $10,000,000 in Buyer's common stock (see <b>Exhibit B - Stock Valuation</b>)<br/>\n        (c) $5,000,000 in earnout payments (see <b>Exhibit C - Earnout Terms</b>)<br/><br/>\n        \n        <b>ARTICLE III - REPRESENTATIONS AND WARRANTIES</b><br/><br/>\n        \n        3.1 Seller represents and warrants that the financial statements provided in \n        <b>Document: Due Diligence Report</b> are accurate and complete.<br/><br/>\n        \n        3.2 Seller represents that all intellectual property is properly documented in \n        <b>Schedule 1 - IP Assets</b> and is free of encumbrances as certified in \n        <b>Document: IP Certification Letter</b>.<br/><br/>\n        \n        3.3 All material contracts are listed in <b>Schedule 2 - Material Contracts</b>.<br/><br/>\n        \n        <b>ARTICLE IV - CONDITIONS TO CLOSING</b><br/><br/>\n        \n        4.1 Buyer's obligation to close is subject to:<br/>\n        (a) Receipt of regulatory approval as documented in <b>Document: Regulatory Approval Letter</b><br/>\n        (b) Completion of due diligence per <b>Document: Due Diligence Report</b><br/>\n        (c) No material adverse change as defined in Section 1.5<br/><br/>\n        \n        4.2 Both parties acknowledge the risks identified in <b>Document: Risk Assessment Memo</b>.<br/><br/>\n        \n        <b>ARTICLE V - CONFIDENTIALITY</b><br/><br/>\n        \n        5.1 This Agreement is subject to the terms of the <b>Document: Non-Disclosure Agreement</b> \n        executed between the parties on October 1, 2024.<br/><br/>\n        \n        IN WITNESS WHEREOF, the parties have executed this Agreement as of the date first above written.<br/><br/>\n        \n        _________________________<br/>\n        TechCorp Industries, Inc.<br/>\n        By: James Mitchell, CEO<br/><br/>\n        \n        _________________________<br/>\n        StartupXYZ LLC<br/>\n        By: Sarah Chen, Founder & CEO\n        \"\"\"\n    },\n    \n    \"02_due_diligence_report.pdf\": {\n        \"title\": \"DUE DILIGENCE REPORT\",\n        \"content\": \"\"\"\n        <b>CONFIDENTIAL DUE DILIGENCE REPORT</b><br/><br/>\n        \n        <b>Prepared for:</b> TechCorp Industries, Inc.<br/>\n        <b>Subject:</b> StartupXYZ LLC<br/>\n        <b>Date:</b> December 20, 2024<br/>\n        <b>Prepared by:</b> Morrison & Associates, LLP<br/><br/>\n        \n        <b>EXECUTIVE SUMMARY</b><br/><br/>\n        \n        This report summarizes our findings from the due diligence investigation of StartupXYZ LLC \n        in connection with the proposed acquisition described in the <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        <b>1. FINANCIAL REVIEW</b><br/><br/>\n        \n        1.1 Revenue for FY2024: $12.3 million (growth of 45% YoY)<br/>\n        1.2 EBITDA: $2.1 million (17% margin)<br/>\n        1.3 Cash position: $3.2 million as of November 30, 2024<br/>\n        1.4 Outstanding debt: $1.5 million (detailed in <b>Exhibit A - Financial Terms</b> of the Acquisition Agreement)<br/><br/>\n        \n        <b>KEY FINDING:</b> Financial statements are materially accurate. Minor adjustments \n        recommended as noted in <b>Document: Financial Adjustments Memo</b>.<br/><br/>\n        \n        <b>2. INTELLECTUAL PROPERTY</b><br/><br/>\n        \n        2.1 StartupXYZ holds 12 patents related to AI/ML technology<br/>\n        2.2 All patents verified as valid per <b>Document: IP Certification Letter</b><br/>\n        2.3 No pending litigation affecting IP (confirmed in <b>Document: Legal Opinion Letter</b>)<br/>\n        2.4 Full IP inventory in <b>Schedule 1 - IP Assets</b> of the Acquisition Agreement<br/><br/>\n        \n        <b>3. EMPLOYEE MATTERS</b><br/><br/>\n        \n        3.1 Total employees: 47 (32 engineering, 8 sales, 7 operations)<br/>\n        3.2 Key employee retention risk: HIGH for 5 senior engineers<br/>\n        3.3 Retention bonuses recommended per <b>Schedule 3 - Employee Transition Plan</b><br/>\n        3.4 No pending employment disputes<br/><br/>\n        \n        <b>4. MATERIAL CONTRACTS</b><br/><br/>\n        \n        4.1 23 active customer contracts reviewed (see <b>Schedule 2 - Material Contracts</b>)<br/>\n        4.2 3 contracts contain change-of-control provisions requiring consent<br/>\n        4.3 Largest customer (MegaCorp) accounts for 28% of revenue - concentration risk noted in \n        <b>Document: Risk Assessment Memo</b><br/><br/>\n        \n        <b>5. REGULATORY COMPLIANCE</b><br/><br/>\n        \n        5.1 Company is compliant with all applicable regulations<br/>\n        5.2 HSR filing required - timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>\n        \n        <b>6. RECOMMENDATIONS</b><br/><br/>\n        \n        Based on our findings, we recommend proceeding with the acquisition subject to:<br/>\n        (a) Obtaining customer consents for change-of-control contracts<br/>\n        (b) Implementing retention packages for key employees<br/>\n        (c) Addressing items in <b>Document: Financial Adjustments Memo</b><br/><br/>\n        \n        Respectfully submitted,<br/>\n        Morrison & Associates, LLP\n        \"\"\"\n    },\n    \n    \"03_ip_certification.pdf\": {\n        \"title\": \"IP CERTIFICATION LETTER\",\n        \"content\": \"\"\"\n        <b>INTELLECTUAL PROPERTY CERTIFICATION LETTER</b><br/><br/>\n        \n        <b>Date:</b> December 15, 2024<br/>\n        <b>To:</b> TechCorp Industries, Inc.<br/>\n        <b>From:</b> PatentWatch Legal Services<br/>\n        <b>Re:</b> IP Certification for StartupXYZ LLC Acquisition<br/><br/>\n        \n        Dear Mr. Mitchell,<br/><br/>\n        \n        In connection with the proposed acquisition of StartupXYZ LLC as described in the \n        <b>Document: Acquisition Agreement</b>, we have conducted a comprehensive review of \n        StartupXYZ's intellectual property portfolio.<br/><br/>\n        \n        <b>CERTIFICATION</b><br/><br/>\n        \n        We hereby certify the following:<br/><br/>\n        \n        <b>1. PATENTS</b><br/><br/>\n        \n        StartupXYZ owns 12 U.S. patents as listed in <b>Schedule 1 - IP Assets</b>:<br/>\n        - US Patent 10,123,456: \"Neural Network Optimization Method\"<br/>\n        - US Patent 10,234,567: \"Distributed AI Training System\"<br/>\n        - US Patent 10,345,678: \"Real-time Data Processing Pipeline\"<br/>\n        - [9 additional patents listed in Schedule 1]<br/><br/>\n        \n        All patents are valid, enforceable, and free of liens or encumbrances.<br/><br/>\n        \n        <b>2. TRADEMARKS</b><br/><br/>\n        \n        StartupXYZ owns 3 registered trademarks:<br/>\n        - \"StartupXYZ\" (word mark)<br/>\n        - StartupXYZ logo (design mark)<br/>\n        - \"IntelliFlow\" (product name)<br/><br/>\n        \n        <b>3. TRADE SECRETS</b><br/><br/>\n        \n        We have reviewed StartupXYZ's trade secret protection protocols. All employees have \n        signed appropriate NDAs. See <b>Document: Non-Disclosure Agreement</b> template.<br/><br/>\n        \n        <b>4. THIRD-PARTY IP</b><br/><br/>\n        \n        StartupXYZ uses 47 open-source libraries. License compliance verified - no copyleft \n        contamination issues identified.<br/><br/>\n        \n        <b>5. PENDING MATTERS</b><br/><br/>\n        \n        There is one pending patent application (Application No. 17/456,789) for \"Advanced \n        Federated Learning System\" expected to issue Q2 2025. This is noted in \n        <b>Document: Risk Assessment Memo</b> as a minor risk item.<br/><br/>\n        \n        <b>6. LITIGATION</b><br/><br/>\n        \n        No IP-related litigation is pending or threatened. This is confirmed in \n        <b>Document: Legal Opinion Letter</b>.<br/><br/>\n        \n        This certification is provided in connection with the due diligence process and \n        may be relied upon by TechCorp Industries, Inc.<br/><br/>\n        \n        Sincerely,<br/>\n        PatentWatch Legal Services<br/>\n        By: Robert Kim, Patent Attorney\n        \"\"\"\n    },\n    \n    \"04_risk_assessment.pdf\": {\n        \"title\": \"RISK ASSESSMENT MEMO\",\n        \"content\": \"\"\"\n        <b>CONFIDENTIAL RISK ASSESSMENT MEMORANDUM</b><br/><br/>\n        \n        <b>To:</b> TechCorp Board of Directors<br/>\n        <b>From:</b> Corporate Development Team<br/>\n        <b>Date:</b> December 22, 2024<br/>\n        <b>Re:</b> Risk Assessment - StartupXYZ Acquisition<br/><br/>\n        \n        This memo summarizes key risks identified in connection with the proposed acquisition \n        as documented in the <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        <b>1. HIGH-PRIORITY RISKS</b><br/><br/>\n        \n        <b>1.1 Customer Concentration (HIGH)</b><br/>\n        - MegaCorp represents 28% of StartupXYZ revenue<br/>\n        - MegaCorp contract contains change-of-control clause<br/>\n        - Mitigation: Obtain consent prior to closing (see <b>Document: Customer Consent Letters</b>)<br/>\n        - Impact if materialized: $3.4M annual revenue at risk<br/><br/>\n        \n        <b>1.2 Key Employee Retention (HIGH)</b><br/>\n        - 5 senior engineers critical to product development<br/>\n        - 2 have expressed interest in leaving post-acquisition<br/>\n        - Mitigation: Retention packages per <b>Schedule 3 - Employee Transition Plan</b><br/>\n        - Estimated cost: $2.5M in retention bonuses<br/><br/>\n        \n        <b>2. MEDIUM-PRIORITY RISKS</b><br/><br/>\n        \n        <b>2.1 Earnout Structure (MEDIUM)</b><br/>\n        - $5M earnout tied to 2025-2026 performance metrics<br/>\n        - Metrics defined in <b>Exhibit C - Earnout Terms</b> of the Acquisition Agreement<br/>\n        - Risk: Disagreement on metric calculation methodology<br/>\n        - Mitigation: Clear definitions in agreement; third-party arbitration clause<br/><br/>\n        \n        <b>2.2 Integration Costs (MEDIUM)</b><br/>\n        - Estimated integration costs: $4.2M over 18 months<br/>\n        - Systems integration detailed in <b>Document: Integration Plan</b><br/>\n        - Risk: Cost overruns of 20-30% typical in tech acquisitions<br/><br/>\n        \n        <b>3. LOW-PRIORITY RISKS</b><br/><br/>\n        \n        <b>3.1 Pending Patent Application (LOW)</b><br/>\n        - One patent pending as noted in <b>Document: IP Certification Letter</b><br/>\n        - Low risk of rejection based on patent attorney's assessment<br/><br/>\n        \n        <b>3.2 Regulatory Approval (LOW)</b><br/>\n        - HSR filing required but expected to clear without issues<br/>\n        - Timeline in <b>Document: Regulatory Approval Letter</b><br/><br/>\n        \n        <b>4. FINANCIAL IMPACT SUMMARY</b><br/><br/>\n        \n        Total risk-adjusted impact: $6.2M - $8.7M<br/>\n        This is reflected in purchase price negotiations per <b>Document: Financial Adjustments Memo</b><br/><br/>\n        \n        <b>5. RECOMMENDATION</b><br/><br/>\n        \n        Despite identified risks, we recommend proceeding with the acquisition. The strategic \n        value of StartupXYZ's AI technology platform justifies the purchase price when \n        accounting for risk mitigation costs. All findings are consistent with \n        <b>Document: Due Diligence Report</b>.<br/><br/>\n        \n        <b>6. NEXT STEPS</b><br/><br/>\n        \n        - Finalize customer consent process<br/>\n        - Execute retention agreements<br/>\n        - Complete regulatory filings<br/>\n        - Prepare for closing per <b>Document: Closing Checklist</b>\n        \"\"\"\n    },\n    \n    \"05_financial_adjustments.pdf\": {\n        \"title\": \"FINANCIAL ADJUSTMENTS MEMO\",\n        \"content\": \"\"\"\n        <b>FINANCIAL ADJUSTMENTS MEMORANDUM</b><br/><br/>\n        \n        <b>To:</b> Deal Team<br/>\n        <b>From:</b> Finance Department<br/>\n        <b>Date:</b> December 23, 2024<br/>\n        <b>Re:</b> Purchase Price Adjustments - StartupXYZ Acquisition<br/><br/>\n        \n        Following our review in connection with the <b>Document: Due Diligence Report</b>, \n        we recommend the following adjustments to the purchase price as set forth in \n        <b>Exhibit A - Financial Terms</b> of the <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        <b>1. WORKING CAPITAL ADJUSTMENT</b><br/><br/>\n        \n        Target working capital: $1,200,000<br/>\n        Estimated closing working capital: $980,000<br/>\n        Adjustment: ($220,000)<br/><br/>\n        \n        <b>2. DEBT ADJUSTMENT</b><br/><br/>\n        \n        Previously disclosed debt: $1,500,000<br/>\n        Additional identified debt: $175,000 (capital lease obligations)<br/>\n        Adjustment: ($175,000)<br/><br/>\n        \n        <b>3. REVENUE RECOGNITION ADJUSTMENT</b><br/><br/>\n        \n        Deferred revenue requiring restatement: $340,000<br/>\n        Impact on EBITDA: ($85,000)<br/>\n        Implied value adjustment (at 15x): ($1,275,000)<br/><br/>\n        \n        <b>4. CONTINGENT LIABILITY RESERVE</b><br/><br/>\n        \n        As noted in <b>Document: Risk Assessment Memo</b>, we recommend establishing \n        reserves for:<br/>\n        - Customer concentration risk: $500,000<br/>\n        - Integration contingency: $800,000<br/>\n        Total reserve: $1,300,000 (to be held in escrow per <b>Exhibit C - Earnout Terms</b>)<br/><br/>\n        \n        <b>5. SUMMARY OF ADJUSTMENTS</b><br/><br/>\n        \n        Original Purchase Price: $45,000,000<br/>\n        Working Capital Adjustment: ($220,000)<br/>\n        Debt Adjustment: ($175,000)<br/>\n        Revenue Recognition: ($1,275,000)<br/>\n        <b>Adjusted Purchase Price: $43,330,000</b><br/><br/>\n        \n        Plus escrow reserve: $1,300,000<br/>\n        <b>Total Cash Required at Closing: $44,630,000</b><br/><br/>\n        \n        <b>6. PAYMENT STRUCTURE</b><br/><br/>\n        \n        As revised from <b>Document: Acquisition Agreement</b> Section 2.2:<br/>\n        (a) Cash at closing: $28,330,000 (adjusted)<br/>\n        (b) Stock consideration: $10,000,000 (per <b>Exhibit B - Stock Valuation</b>)<br/>\n        (c) Earnout: $5,000,000 (unchanged, per <b>Exhibit C - Earnout Terms</b>)<br/>\n        (d) Escrow: $1,300,000 (18-month release schedule)<br/><br/>\n        \n        These adjustments have been discussed with Seller's representatives and are \n        subject to final negotiation.<br/><br/>\n        \n        Please refer to <b>Document: Closing Checklist</b> for timeline and requirements.\n        \"\"\"\n    },\n    \n    \"06_legal_opinion.pdf\": {\n        \"title\": \"LEGAL OPINION LETTER\",\n        \"content\": \"\"\"\n        <b>LEGAL OPINION LETTER</b><br/><br/>\n        \n        <b>Date:</b> December 18, 2024<br/><br/>\n        \n        TechCorp Industries, Inc.<br/>\n        500 Technology Drive<br/>\n        San Francisco, CA 94105<br/><br/>\n        \n        <b>Re: Acquisition of StartupXYZ LLC</b><br/><br/>\n        \n        Ladies and Gentlemen:<br/><br/>\n        \n        We have acted as legal counsel to StartupXYZ LLC (\"Company\") in connection with \n        the proposed acquisition by TechCorp Industries, Inc. pursuant to the \n        <b>Document: Acquisition Agreement</b> dated January 15, 2025.<br/><br/>\n        \n        <b>DOCUMENTS REVIEWED</b><br/><br/>\n        \n        In connection with this opinion, we have reviewed:<br/>\n        1. The Acquisition Agreement and all Exhibits and Schedules<br/>\n        2. <b>Document: Due Diligence Report</b> prepared by Morrison & Associates<br/>\n        3. <b>Document: IP Certification Letter</b> from PatentWatch Legal Services<br/>\n        4. All material contracts listed in <b>Schedule 2 - Material Contracts</b><br/>\n        5. Corporate records and organizational documents of the Company<br/>\n        6. <b>Document: Non-Disclosure Agreement</b> between the parties<br/><br/>\n        \n        <b>OPINIONS</b><br/><br/>\n        \n        Based on our review, we are of the opinion that:<br/><br/>\n        \n        <b>1. Corporate Status</b><br/>\n        The Company is a limited liability company duly organized, validly existing, and \n        in good standing under the laws of Delaware.<br/><br/>\n        \n        <b>2. Authority</b><br/>\n        The Company has full power and authority to execute and deliver the Acquisition \n        Agreement and to consummate the transactions contemplated thereby.<br/><br/>\n        \n        <b>3. No Conflicts</b><br/>\n        The execution and delivery of the Acquisition Agreement does not violate any \n        provision of the Company's organizational documents or any material contract, \n        except for change-of-control provisions noted in <b>Document: Customer Consent Letters</b>.<br/><br/>\n        \n        <b>4. Litigation</b><br/>\n        There is no litigation, arbitration, or governmental proceeding pending or, to \n        our knowledge, threatened against the Company that would have a material adverse \n        effect on the Company or the transactions contemplated by the Acquisition Agreement.<br/><br/>\n        \n        This opinion confirms the representations in the <b>Document: IP Certification Letter</b> \n        regarding absence of IP litigation.<br/><br/>\n        \n        <b>5. Regulatory Compliance</b><br/>\n        The Company is in material compliance with all applicable laws and regulations. \n        The HSR filing requirements are addressed in <b>Document: Regulatory Approval Letter</b>.<br/><br/>\n        \n        <b>QUALIFICATIONS</b><br/><br/>\n        \n        This opinion is subject to the following qualifications:<br/>\n        1. We express no opinion on tax matters (see separate tax opinion)<br/>\n        2. This opinion is limited to Delaware and federal law<br/>\n        3. Certain contracts require third-party consents as noted above<br/><br/>\n        \n        This opinion is provided solely for your benefit in connection with the \n        transactions contemplated by the Acquisition Agreement.<br/><br/>\n        \n        Very truly yours,<br/>\n        Wilson & Partners LLP<br/>\n        By: Jennifer Walsh, Partner\n        \"\"\"\n    },\n    \n    \"07_nda.pdf\": {\n        \"title\": \"NON-DISCLOSURE AGREEMENT\",\n        \"content\": \"\"\"\n        <b>MUTUAL NON-DISCLOSURE AGREEMENT</b><br/><br/>\n        \n        This Mutual Non-Disclosure Agreement (\"NDA\") is entered into as of October 1, 2024, \n        by and between:<br/><br/>\n        \n        <b>TechCorp Industries, Inc.</b> (\"TechCorp\")<br/>\n        500 Technology Drive, San Francisco, CA 94105<br/><br/>\n        \n        and<br/><br/>\n        \n        <b>StartupXYZ LLC</b> (\"StartupXYZ\")<br/>\n        123 Innovation Way, Palo Alto, CA 94301<br/><br/>\n        \n        (each a \"Party\" and collectively the \"Parties\")<br/><br/>\n        \n        <b>RECITALS</b><br/><br/>\n        \n        The Parties wish to explore a potential business relationship, including a possible \n        acquisition of StartupXYZ by TechCorp (the \"Purpose\"), which is now documented in \n        the <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        <b>1. DEFINITION OF CONFIDENTIAL INFORMATION</b><br/><br/>\n        \n        \"Confidential Information\" means any non-public information disclosed by either \n        Party, including but not limited to:<br/>\n        - Financial information (as contained in <b>Document: Due Diligence Report</b>)<br/>\n        - Technical information (as certified in <b>Document: IP Certification Letter</b>)<br/>\n        - Business strategies and plans<br/>\n        - Customer and supplier information<br/>\n        - Employee information (as detailed in <b>Schedule 3 - Employee Transition Plan</b>)<br/><br/>\n        \n        <b>2. OBLIGATIONS</b><br/><br/>\n        \n        Each Party agrees to:<br/>\n        (a) Hold Confidential Information in strict confidence<br/>\n        (b) Not disclose Confidential Information to third parties without prior written consent<br/>\n        (c) Use Confidential Information solely for the Purpose<br/>\n        (d) Limit access to Confidential Information to employees with a need to know<br/><br/>\n        \n        <b>3. TERM</b><br/><br/>\n        \n        This NDA shall remain in effect for three (3) years from the date first written \n        above, or until superseded by the confidentiality provisions in the \n        <b>Document: Acquisition Agreement</b> Article V.<br/><br/>\n        \n        <b>4. EXCLUSIONS</b><br/><br/>\n        \n        Confidential Information does not include information that:<br/>\n        (a) Is or becomes publicly available through no fault of the receiving Party<br/>\n        (b) Was rightfully in the receiving Party's possession prior to disclosure<br/>\n        (c) Is rightfully obtained from a third party without restriction<br/>\n        (d) Is independently developed without use of Confidential Information<br/><br/>\n        \n        <b>5. RETURN OF MATERIALS</b><br/><br/>\n        \n        Upon request or termination, each Party shall return or destroy all Confidential \n        Information, except as required for legal or regulatory purposes.<br/><br/>\n        \n        <b>6. NO LICENSE</b><br/><br/>\n        \n        Nothing in this NDA grants any rights to intellectual property, except as \n        subsequently agreed in the <b>Document: Acquisition Agreement</b> and \n        <b>Schedule 1 - IP Assets</b>.<br/><br/>\n        \n        IN WITNESS WHEREOF, the Parties have executed this NDA as of the date first above written.<br/><br/>\n        \n        TechCorp Industries, Inc.<br/>\n        By: ______________________<br/>\n        Name: James Mitchell<br/>\n        Title: CEO<br/><br/>\n        \n        StartupXYZ LLC<br/>\n        By: ______________________<br/>\n        Name: Sarah Chen<br/>\n        Title: Founder & CEO\n        \"\"\"\n    },\n    \n    \"08_regulatory_approval.pdf\": {\n        \"title\": \"REGULATORY APPROVAL LETTER\",\n        \"content\": \"\"\"\n        <b>FEDERAL TRADE COMMISSION</b><br/>\n        <b>PREMERGER NOTIFICATION OFFICE</b><br/><br/>\n        \n        January 28, 2025<br/><br/>\n        \n        TechCorp Industries, Inc.<br/>\n        500 Technology Drive<br/>\n        San Francisco, CA 94105<br/><br/>\n        \n        StartupXYZ LLC<br/>\n        123 Innovation Way<br/>\n        Palo Alto, CA 94301<br/><br/>\n        \n        <b>Re: Early Termination of HSR Waiting Period</b><br/>\n        <b>Transaction: Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>\n        \n        Dear Parties:<br/><br/>\n        \n        This letter confirms that the Federal Trade Commission has granted early \n        termination of the waiting period under the Hart-Scott-Rodino Antitrust \n        Improvements Act of 1976 for the above-referenced transaction.<br/><br/>\n        \n        <b>FILING DETAILS</b><br/><br/>\n        \n        Filing Date: January 10, 2025<br/>\n        Transaction Value: $45,000,000 (as stated in <b>Document: Acquisition Agreement</b>)<br/>\n        HSR Filing Fee: $30,000<br/>\n        Early Termination Granted: January 28, 2025<br/><br/>\n        \n        <b>EFFECT OF EARLY TERMINATION</b><br/><br/>\n        \n        The parties may now consummate the transaction at any time. This early termination \n        satisfies the condition precedent set forth in Article IV, Section 4.1(a) of the \n        <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        Please note that early termination of the waiting period does not preclude the \n        Commission from taking any action it deems necessary to protect competition.<br/><br/>\n        \n        <b>NEXT STEPS</b><br/><br/>\n        \n        Per the <b>Document: Closing Checklist</b>, you may now proceed with the closing \n        scheduled for March 1, 2025, subject to satisfaction of other conditions in the \n        <b>Document: Acquisition Agreement</b>.<br/><br/>\n        \n        The <b>Document: Risk Assessment Memo</b> correctly identified this as a low-risk \n        item. The market analysis in the <b>Document: Due Diligence Report</b> supported \n        the determination that this transaction does not raise competitive concerns.<br/><br/>\n        \n        Sincerely,<br/>\n        Premerger Notification Office<br/>\n        Federal Trade Commission\n        \"\"\"\n    },\n    \n    \"09_customer_consents.pdf\": {\n        \"title\": \"CUSTOMER CONSENT LETTERS\",\n        \"content\": \"\"\"\n        <b>CUSTOMER CONSENT STATUS REPORT</b><br/><br/>\n        \n        <b>Date:</b> February 15, 2025<br/>\n        <b>To:</b> Deal Team<br/>\n        <b>From:</b> Legal Department<br/>\n        <b>Re:</b> Change of Control Consent Status<br/><br/>\n        \n        As required by <b>Schedule 2 - Material Contracts</b> of the \n        <b>Document: Acquisition Agreement</b>, this memo summarizes the status of \n        customer consents for contracts containing change-of-control provisions.<br/><br/>\n        \n        <b>CONSENT STATUS SUMMARY</b><br/><br/>\n        \n        <b>1. MegaCorp Inc. - OBTAINED</b><br/>\n        Contract Value: $3.4M annual<br/>\n        Consent Received: February 10, 2025<br/>\n        Notes: MegaCorp requested meeting with TechCorp leadership; meeting held 2/8/25. \n        Consent granted with no additional conditions. This addresses the primary concern \n        noted in <b>Document: Risk Assessment Memo</b> Section 1.1.<br/><br/>\n        \n        <b>2. DataFlow Systems - OBTAINED</b><br/>\n        Contract Value: $1.2M annual<br/>\n        Consent Received: February 5, 2025<br/>\n        Notes: Standard consent process. No concerns raised.<br/><br/>\n        \n        <b>3. CloudTech Partners - PENDING</b><br/>\n        Contract Value: $890K annual<br/>\n        Status: Consent requested February 1, 2025<br/>\n        Expected: February 20, 2025<br/>\n        Notes: Legal review in progress at CloudTech. Their counsel has reviewed the \n        <b>Document: Acquisition Agreement</b> and has no objections. Verbal confirmation \n        received; written consent expected shortly.<br/><br/>\n        \n        <b>IMPACT ANALYSIS</b><br/><br/>\n        \n        Per <b>Document: Due Diligence Report</b> Section 4, there were 3 contracts \n        requiring consent:<br/>\n        - 2 obtained (representing $4.6M annual revenue)<br/>\n        - 1 pending (representing $890K annual revenue)<br/><br/>\n        \n        <b>CLOSING IMPLICATIONS</b><br/><br/>\n        \n        The <b>Document: Acquisition Agreement</b> Article IV requires \"material\" customer \n        consents as a closing condition. With MegaCorp consent obtained, this condition \n        is substantially satisfied. The pending CloudTech consent is expected before \n        the March 1 closing date per <b>Document: Closing Checklist</b>.<br/><br/>\n        \n        <b>ATTACHMENTS</b><br/><br/>\n        \n        Attached hereto:<br/>\n        - Exhibit A: MegaCorp Consent Letter (dated February 10, 2025)<br/>\n        - Exhibit B: DataFlow Systems Consent Letter (dated February 5, 2025)<br/>\n        - Exhibit C: CloudTech Partners Draft Consent (pending signature)<br/><br/>\n        \n        <b>RECOMMENDATION</b><br/><br/>\n        \n        We recommend proceeding with closing preparations. The risk of CloudTech \n        withholding consent is low based on discussions with their counsel. This \n        is consistent with the risk mitigation strategy in <b>Document: Risk Assessment Memo</b>.\n        \"\"\"\n    },\n    \n    \"10_closing_checklist.pdf\": {\n        \"title\": \"CLOSING CHECKLIST\",\n        \"content\": \"\"\"\n        <b>CLOSING CHECKLIST</b><br/>\n        <b>Acquisition of StartupXYZ LLC by TechCorp Industries, Inc.</b><br/><br/>\n        \n        <b>Closing Date:</b> March 1, 2025<br/>\n        <b>Closing Location:</b> Wilson & Partners LLP, San Francisco<br/><br/>\n        \n        <b>I. PRE-CLOSING CONDITIONS</b><br/><br/>\n        \n        <b>A. Regulatory</b><br/>\n        [X] HSR Filing submitted - <b>Document: Regulatory Approval Letter</b><br/>\n        [X] Early termination received (January 28, 2025)<br/>\n        [ ] State regulatory filings (if required)<br/><br/>\n        \n        <b>B. Third-Party Consents</b><br/>\n        [X] MegaCorp consent - <b>Document: Customer Consent Letters</b><br/>\n        [X] DataFlow consent - <b>Document: Customer Consent Letters</b><br/>\n        [ ] CloudTech consent (expected February 20) - <b>Document: Customer Consent Letters</b><br/><br/>\n        \n        <b>C. Due Diligence Completion</b><br/>\n        [X] Financial due diligence - <b>Document: Due Diligence Report</b><br/>\n        [X] Legal due diligence - <b>Document: Legal Opinion Letter</b><br/>\n        [X] IP due diligence - <b>Document: IP Certification Letter</b><br/>\n        [X] Risk assessment - <b>Document: Risk Assessment Memo</b><br/><br/>\n        \n        <b>II. CLOSING DOCUMENTS</b><br/><br/>\n        \n        <b>A. Transaction Documents</b><br/>\n        [ ] Executed <b>Document: Acquisition Agreement</b><br/>\n        [ ] Bill of Sale<br/>\n        [ ] Assignment and Assumption Agreement<br/>\n        [ ] IP Assignment Agreement (per <b>Schedule 1 - IP Assets</b>)<br/><br/>\n        \n        <b>B. Corporate Documents</b><br/>\n        [ ] Seller's Certificate of Good Standing<br/>\n        [ ] Secretary's Certificate (resolutions, incumbency)<br/>\n        [ ] Buyer's Certificate of Good Standing<br/><br/>\n        \n        <b>C. Financial Documents</b><br/>\n        [ ] Closing Statement per <b>Document: Financial Adjustments Memo</b><br/>\n        [ ] Wire transfer instructions<br/>\n        [ ] Escrow Agreement (per <b>Exhibit C - Earnout Terms</b>)<br/>\n        [ ] Stock certificates or book entry (per <b>Exhibit B - Stock Valuation</b>)<br/><br/>\n        \n        <b>D. Employment Documents</b><br/>\n        [ ] Retention agreements per <b>Schedule 3 - Employee Transition Plan</b><br/>\n        [ ] Offer letters for key employees<br/>\n        [ ] WARN Act compliance (if applicable)<br/><br/>\n        \n        <b>III. CLOSING FUNDS</b><br/><br/>\n        \n        Per <b>Document: Financial Adjustments Memo</b>:<br/>\n        [ ] Cash payment: $28,330,000<br/>\n        [ ] Escrow deposit: $1,300,000<br/>\n        [ ] Stock issuance: $10,000,000<br/>\n        Total at Closing: $39,630,000<br/><br/>\n        \n        <b>IV. POST-CLOSING</b><br/><br/>\n        \n        [ ] File UCC termination statements<br/>\n        [ ] Update corporate records<br/>\n        [ ] Integration kickoff per <b>Document: Integration Plan</b><br/>\n        [ ] Employee communications<br/>\n        [ ] Customer notifications<br/>\n        [ ] Press release<br/><br/>\n        \n        <b>V. RESPONSIBLE PARTIES</b><br/><br/>\n        \n        Buyer's Counsel: Morrison & Associates LLP<br/>\n        Seller's Counsel: Wilson & Partners LLP<br/>\n        Escrow Agent: First National Trust<br/><br/>\n        \n        <b>VI. KEY CONTACTS</b><br/><br/>\n        \n        TechCorp: James Mitchell (CEO), (415) 555-0100<br/>\n        StartupXYZ: Sarah Chen (CEO), (650) 555-0200<br/>\n        Legal (Buyer): John Morrison, (415) 555-0300<br/>\n        Legal (Seller): Jennifer Walsh, (415) 555-0400\n        \"\"\"\n    }\n}\n\n\ndef create_pdf(filename: str, title: str, content: str):\n    \"\"\"Create a PDF document.\"\"\"\n    filepath = os.path.join(OUTPUT_DIR, filename)\n    doc = SimpleDocTemplate(filepath, pagesize=letter,\n                           topMargin=1*inch, bottomMargin=1*inch,\n                           leftMargin=1*inch, rightMargin=1*inch)\n    \n    styles = getSampleStyleSheet()\n    title_style = ParagraphStyle(\n        'CustomTitle',\n        parent=styles['Heading1'],\n        fontSize=16,\n        spaceAfter=30,\n        alignment=1  # Center\n    )\n    body_style = ParagraphStyle(\n        'CustomBody',\n        parent=styles['Normal'],\n        fontSize=11,\n        leading=14,\n        spaceAfter=12\n    )\n    \n    story = []\n    story.append(Paragraph(title, title_style))\n    story.append(Spacer(1, 0.5*inch))\n    \n    # Split content into paragraphs and add them\n    paragraphs = content.strip().split('<br/><br/>')\n    for para in paragraphs:\n        para = para.replace('<br/>', '<br/>')\n        story.append(Paragraph(para, body_style))\n    \n    doc.build(story)\n    print(f\"Created: {filepath}\")\n\n\ndef main():\n    # Create output directory\n    os.makedirs(OUTPUT_DIR, exist_ok=True)\n    \n    print(f\"\\nGenerating {len(DOCUMENTS)} test documents in {OUTPUT_DIR}/\\n\")\n    \n    for filename, doc_info in DOCUMENTS.items():\n        create_pdf(filename, doc_info[\"title\"], doc_info[\"content\"])\n    \n    print(f\"\\n✅ Generated {len(DOCUMENTS)} documents successfully!\")\n    print(f\"\\nDocument cross-reference map:\")\n    print(\"=\" * 60)\n    print(\"\"\"\n    Acquisition Agreement (01)\n    ├── references: Exhibit A, B, C, Schedule 1-3\n    ├── referenced by: ALL other documents\n    │\n    Due Diligence Report (02)\n    ├── references: Acquisition Agreement, IP Cert, Risk Assessment\n    ├── referenced by: Legal Opinion, Risk Assessment, Regulatory\n    │\n    IP Certification (03)\n    ├── references: Acquisition Agreement, Schedule 1, NDA\n    ├── referenced by: Due Diligence, Legal Opinion\n    │\n    Risk Assessment (04)\n    ├── references: Acquisition Agreement, Due Diligence, IP Cert\n    ├── referenced by: Financial Adjustments, Customer Consents\n    │\n    Financial Adjustments (05)\n    ├── references: Due Diligence, Risk Assessment, Acquisition Agreement\n    ├── referenced by: Closing Checklist\n    │\n    Legal Opinion (06)\n    ├── references: Acquisition Agreement, Due Diligence, IP Cert, NDA\n    ├── referenced by: Closing Checklist\n    │\n    NDA (07)\n    ├── references: Acquisition Agreement, Due Diligence, IP Cert\n    ├── referenced by: IP Cert, Legal Opinion\n    │\n    Regulatory Approval (08)\n    ├── references: Acquisition Agreement, Due Diligence, Risk Assessment\n    ├── referenced by: Closing Checklist\n    │\n    Customer Consents (09)\n    ├── references: Acquisition Agreement, Risk Assessment, Schedule 2\n    ├── referenced by: Closing Checklist\n    │\n    Closing Checklist (10)\n    └── references: ALL documents\n    \"\"\")\n\n\nif __name__ == \"__main__\":\n    main()\n\n"
  },
  {
    "path": "src/fs_explorer/__init__.py",
    "content": "\"\"\"\nFsExplorer - AI-powered filesystem exploration agent.\n\nThis package provides an intelligent agent that can explore filesystems,\nparse documents, and answer questions about their contents using\nGoogle Gemini for decision-making and Docling for document parsing.\n\nExample usage:\n    >>> from fs_explorer import FsExplorerAgent, workflow\n    >>> agent = FsExplorerAgent()\n    >>> # Use with the workflow for full exploration\n    >>> result = await workflow.run(start_event=InputEvent(task=\"Find the purchase price\"))\n\"\"\"\n\nfrom .agent import FsExplorerAgent, TokenUsage\nfrom .workflow import (\n    workflow,\n    FsExplorerWorkflow,\n    InputEvent,\n    ExplorationEndEvent,\n    ToolCallEvent,\n    GoDeeperEvent,\n    AskHumanEvent,\n    HumanAnswerEvent,\n    get_agent,\n    reset_agent,\n)\nfrom .models import Action, ActionType, Tools\n\n__all__ = [\n    # Agent\n    \"FsExplorerAgent\",\n    \"TokenUsage\",\n    # Workflow\n    \"workflow\",\n    \"FsExplorerWorkflow\",\n    \"InputEvent\",\n    \"ExplorationEndEvent\",\n    \"ToolCallEvent\",\n    \"GoDeeperEvent\",\n    \"AskHumanEvent\",\n    \"HumanAnswerEvent\",\n    \"get_agent\",\n    \"reset_agent\",\n    # Models\n    \"Action\",\n    \"ActionType\",\n    \"Tools\",\n]\n\n"
  },
  {
    "path": "src/fs_explorer/agent.py",
    "content": "\"\"\"\nFsExplorer Agent for filesystem exploration using Google Gemini.\n\nThis module contains the agent that interacts with the Gemini AI model\nto make decisions about filesystem exploration actions.\n\"\"\"\n\nimport os\nimport re\nfrom pathlib import Path\nfrom typing import Callable, Any, cast\nfrom dataclasses import dataclass\n\nfrom dotenv import load_dotenv\nfrom google.genai.types import Content, HttpOptions, Part\nfrom google.genai import Client as GenAIClient\n\nfrom .models import Action, ActionType, ToolCallAction, Tools\nfrom .fs import (\n    read_file,\n    grep_file_content,\n    glob_paths,\n    scan_folder,\n    preview_file,\n    parse_file,\n)\nfrom .embeddings import EmbeddingProvider\nfrom .index_config import resolve_db_path\nfrom .search import (\n    IndexedQueryEngine,\n    MetadataFilterParseError,\n    supported_filter_syntax,\n)\nfrom .storage import DuckDBStorage\n\n# Load .env file from project root\n_env_path = Path(__file__).parent.parent.parent / \".env\"\nif _env_path.exists():\n    load_dotenv(_env_path)\n\n\n# =============================================================================\n# Token Usage Tracking\n# =============================================================================\n\n# Gemini Flash pricing (per million tokens)\nGEMINI_FLASH_INPUT_COST_PER_MILLION = 0.075\nGEMINI_FLASH_OUTPUT_COST_PER_MILLION = 0.30\n\n\n@dataclass\nclass TokenUsage:\n    \"\"\"\n    Track token usage and costs across the session.\n\n    Maintains running totals of API calls, token counts, and provides\n    cost estimates based on Gemini Flash pricing.\n    \"\"\"\n\n    prompt_tokens: int = 0\n    completion_tokens: int = 0\n    total_tokens: int = 0\n    api_calls: int = 0\n\n    # Track content sizes\n    tool_result_chars: int = 0\n    documents_parsed: int = 0\n    documents_scanned: int = 0\n\n    def add_api_call(self, prompt_tokens: int, completion_tokens: int) -> None:\n        \"\"\"Record token usage from an API call.\"\"\"\n        self.prompt_tokens += prompt_tokens\n        self.completion_tokens += completion_tokens\n        self.total_tokens += prompt_tokens + completion_tokens\n        self.api_calls += 1\n\n    def add_tool_result(self, result: str, tool_name: str) -> None:\n        \"\"\"Record metrics from a tool execution.\"\"\"\n        self.tool_result_chars += len(result)\n        if tool_name == \"parse_file\":\n            self.documents_parsed += 1\n        elif tool_name == \"scan_folder\":\n            # Count documents in scan result by counting document markers\n            self.documents_scanned += result.count(\"│ [\")\n        elif tool_name == \"preview_file\":\n            self.documents_parsed += 1\n\n    def _calculate_cost(self) -> tuple[float, float, float]:\n        \"\"\"Calculate estimated costs based on Gemini Flash pricing.\"\"\"\n        input_cost = (\n            self.prompt_tokens / 1_000_000\n        ) * GEMINI_FLASH_INPUT_COST_PER_MILLION\n        output_cost = (\n            self.completion_tokens / 1_000_000\n        ) * GEMINI_FLASH_OUTPUT_COST_PER_MILLION\n        return input_cost, output_cost, input_cost + output_cost\n\n    def summary(self) -> str:\n        \"\"\"Generate a formatted summary of token usage and costs.\"\"\"\n        input_cost, output_cost, total_cost = self._calculate_cost()\n\n        return f\"\"\"\n═══════════════════════════════════════════════════════════════\n                      TOKEN USAGE SUMMARY\n═══════════════════════════════════════════════════════════════\n  API Calls:           {self.api_calls}\n  Prompt Tokens:       {self.prompt_tokens:,}\n  Completion Tokens:   {self.completion_tokens:,}\n  Total Tokens:        {self.total_tokens:,}\n───────────────────────────────────────────────────────────────\n  Documents Scanned:   {self.documents_scanned}\n  Documents Parsed:    {self.documents_parsed}\n  Tool Result Chars:   {self.tool_result_chars:,}\n───────────────────────────────────────────────────────────────\n  Est. Cost (Gemini Flash):\n    Input:  ${input_cost:.4f}\n    Output: ${output_cost:.4f}\n    Total:  ${total_cost:.4f}\n═══════════════════════════════════════════════════════════════\n\"\"\"\n\n\n# =============================================================================\n# Tool Registry\n# =============================================================================\n\n\n@dataclass(frozen=True)\nclass IndexContext:\n    \"\"\"Execution context for indexed retrieval tools.\"\"\"\n\n    root_folder: str\n    db_path: str\n\n\n_INDEX_CONTEXT: IndexContext | None = None\n_EMBEDDING_PROVIDER: EmbeddingProvider | None = None\n_FIELD_CATALOG_SHOWN: bool = False\n_ENABLE_SEMANTIC: bool = False\n_ENABLE_METADATA: bool = False\n\n\ndef set_search_flags(\n    *, enable_semantic: bool = False, enable_metadata: bool = False\n) -> None:\n    \"\"\"Configure which indexed retrieval paths are active.\"\"\"\n    global _ENABLE_SEMANTIC, _ENABLE_METADATA\n    _ENABLE_SEMANTIC = enable_semantic\n    _ENABLE_METADATA = enable_metadata\n\n\ndef get_search_flags() -> tuple[bool, bool]:\n    \"\"\"Return (enable_semantic, enable_metadata).\"\"\"\n    return _ENABLE_SEMANTIC, _ENABLE_METADATA\n\n\ndef set_embedding_provider(provider: EmbeddingProvider | None) -> None:\n    \"\"\"Set the embedding provider for vector search in indexed tools.\"\"\"\n    global _EMBEDDING_PROVIDER\n    _EMBEDDING_PROVIDER = provider\n\n\ndef set_index_context(folder: str, db_path: str | None = None) -> None:\n    \"\"\"Enable indexed tools for a specific folder corpus.\"\"\"\n    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER\n    _INDEX_CONTEXT = IndexContext(\n        root_folder=str(Path(folder).resolve()),\n        db_path=resolve_db_path(db_path),\n    )\n    # Auto-create embedding provider if API key available\n    if _EMBEDDING_PROVIDER is None:\n        try:\n            _EMBEDDING_PROVIDER = EmbeddingProvider()\n        except ValueError:\n            pass\n\n\ndef clear_index_context() -> None:\n    \"\"\"Disable indexed tools for the current process.\"\"\"\n    global _INDEX_CONTEXT, _EMBEDDING_PROVIDER, _FIELD_CATALOG_SHOWN\n    global _ENABLE_SEMANTIC, _ENABLE_METADATA\n    _INDEX_CONTEXT = None\n    _EMBEDDING_PROVIDER = None\n    _FIELD_CATALOG_SHOWN = False\n    _ENABLE_SEMANTIC = False\n    _ENABLE_METADATA = False\n\n\ndef _get_index_storage_and_corpus() -> tuple[\n    DuckDBStorage | None, str | None, str | None\n]:\n    if _INDEX_CONTEXT is None:\n        return None, None, \"Index context is not configured. Re-run with `--use-index`.\"\n\n    storage = DuckDBStorage(_INDEX_CONTEXT.db_path)\n    corpus_id = storage.get_corpus_id(_INDEX_CONTEXT.root_folder)\n    if corpus_id is None:\n        return (\n            None,\n            None,\n            f\"No index found for folder {_INDEX_CONTEXT.root_folder}. \"\n            \"Run `explore index <folder>` first.\",\n        )\n    return storage, corpus_id, None\n\n\ndef _clean_excerpt(text: str, max_chars: int = 320) -> str:\n    squashed = re.sub(r\"\\s+\", \" \", text).strip()\n    if len(squashed) <= max_chars:\n        return squashed\n    return f\"{squashed[:max_chars]}...\"\n\n\ndef semantic_search(query: str, filters: str | None = None, limit: int = 5) -> str:\n    \"\"\"Search indexed chunks and return ranked excerpts.\"\"\"\n    storage, corpus_id, error = _get_index_storage_and_corpus()\n    if error:\n        return error\n    assert storage is not None and corpus_id is not None\n\n    engine = IndexedQueryEngine(storage, embedding_provider=_EMBEDDING_PROVIDER)\n    try:\n        hits = engine.search(\n            corpus_id=corpus_id,\n            query=query,\n            filters=filters,\n            limit=limit,\n            enable_semantic=_ENABLE_SEMANTIC,\n            enable_metadata=_ENABLE_METADATA,\n        )\n    except MetadataFilterParseError as exc:\n        return f\"Invalid metadata filter: {exc}\\n{supported_filter_syntax()}\"\n    except ValueError as exc:\n        return f\"Metadata filter error: {exc}\"\n\n    if not hits:\n        if filters:\n            return f\"No indexed matches found for query={query!r} with filters={filters!r}.\"\n        return f\"No indexed matches found for query: {query!r}\"\n\n    lines = [\n        \"=== INDEXED SEARCH RESULTS ===\",\n        f\"Query: {query}\",\n    ]\n    if filters:\n        lines.append(f\"Filters: {filters}\")\n    lines.append(\"\")\n    for idx, hit in enumerate(hits, start=1):\n        position = hit.position if hit.position is not None else \"<metadata>\"\n        lines.extend(\n            [\n                f\"[{idx}] doc_id: {hit.doc_id}\",\n                f\"    path: {hit.absolute_path}\",\n                f\"    match: {hit.matched_by}\",\n                f\"    chunk_position: {position}\",\n                f\"    semantic_score: {hit.semantic_score}\",\n                f\"    metadata_score: {hit.metadata_score}\",\n                f\"    score: {hit.score:.2f}\",\n                f\"    excerpt: {_clean_excerpt(hit.text)}\",\n                \"\",\n            ]\n        )\n    lines.append(\n        \"Use get_document(doc_id=...) to read full content for the most relevant documents.\"\n    )\n\n    # Include a rich field catalog on the first search so the agent can\n    # construct effective metadata filters.\n    global _FIELD_CATALOG_SHOWN\n    if not _FIELD_CATALOG_SHOWN:\n        active_schema = storage.get_active_schema(corpus_id=corpus_id)\n        if active_schema is not None:\n            schema_fields = active_schema.schema_def.get(\"fields\")\n            if isinstance(schema_fields, list) and schema_fields:\n                field_names = [\n                    str(f[\"name\"])\n                    for f in schema_fields\n                    if isinstance(f, dict) and isinstance(f.get(\"name\"), str)\n                ]\n                field_values = storage.get_metadata_field_values(\n                    corpus_id=corpus_id,\n                    field_names=field_names,\n                )\n                field_descs: list[str] = []\n                for field in schema_fields:\n                    if not isinstance(field, dict) or not isinstance(\n                        field.get(\"name\"), str\n                    ):\n                        continue\n                    name = field[\"name\"]\n                    ftype = field.get(\"type\", \"string\")\n                    desc = field.get(\"description\", \"\")\n                    entry = f\"{name} ({ftype})\"\n                    if desc:\n                        entry += f\": {desc}\"\n                    vals = field_values.get(name, [])\n                    if ftype == \"boolean\":\n                        entry += \" Values: true, false\"\n                    elif ftype in {\"integer\", \"number\"} and vals:\n                        nums = []\n                        for v in vals:\n                            try:\n                                nums.append(float(v))\n                            except (TypeError, ValueError):\n                                pass\n                        if nums:\n                            entry += f\" Range: {min(nums):.6g}-{max(nums):.6g}\"\n                    elif vals:\n                        if \"enum\" in field:\n                            entry += f\" Values: {field['enum']}\"\n                        else:\n                            entry += f\" Values: {', '.join(repr(v) for v in vals)}\"\n                    elif \"enum\" in field:\n                        entry += f\" Values: {field['enum']}\"\n                    field_descs.append(entry)\n                if field_descs:\n                    lines.append(\"\")\n                    lines.append(\n                        \"Available filter fields for semantic_search(filters=...):\"\n                    )\n                    for desc in field_descs:\n                        lines.append(f\"  - {desc}\")\n                _FIELD_CATALOG_SHOWN = True\n\n    return \"\\n\".join(lines)\n\n\ndef get_document(doc_id: str) -> str:\n    \"\"\"Return full document content by id from the active index context.\"\"\"\n    storage, _, error = _get_index_storage_and_corpus()\n    if error:\n        return error\n    assert storage is not None\n\n    document = storage.get_document(doc_id=doc_id)\n    if document is None:\n        return f\"No indexed document found for doc_id={doc_id!r}\"\n    if document[\"is_deleted\"]:\n        return f\"Document {doc_id} is marked as deleted in the index.\"\n\n    return (\n        f\"=== DOCUMENT {doc_id} ===\\n\"\n        f\"Path: {document['absolute_path']}\\n\\n\"\n        f\"{document['content']}\"\n    )\n\n\ndef list_indexed_documents() -> str:\n    \"\"\"List indexed documents for the active corpus.\"\"\"\n    storage, corpus_id, error = _get_index_storage_and_corpus()\n    if error:\n        return error\n    assert storage is not None and corpus_id is not None\n\n    documents = storage.list_documents(corpus_id=corpus_id, include_deleted=False)\n    if not documents:\n        return \"No indexed documents found for the active corpus.\"\n\n    lines = [\"=== INDEXED DOCUMENTS ===\"]\n    for idx, document in enumerate(documents, start=1):\n        lines.append(\n            f\"[{idx}] doc_id={document['id']} path={document['absolute_path']}\"\n        )\n    lines.append(\"\")\n    lines.append(\"Use semantic_search(...) to find relevant doc_ids.\")\n    return \"\\n\".join(lines)\n\n\nTOOLS: dict[Tools, Callable[..., str]] = {\n    \"read\": read_file,\n    \"grep\": grep_file_content,\n    \"glob\": glob_paths,\n    \"scan_folder\": scan_folder,\n    \"preview_file\": preview_file,\n    \"parse_file\": parse_file,\n    \"semantic_search\": semantic_search,\n    \"get_document\": get_document,\n    \"list_indexed_documents\": list_indexed_documents,\n}\n\n\n# =============================================================================\n# System Prompt\n# =============================================================================\n\nSYSTEM_PROMPT = \"\"\"\nYou are FsExplorer, an AI agent that explores filesystems to answer user questions about documents.\n\n## Available Tools\n\n| Tool | Purpose | Parameters |\n|------|---------|------------|\n| `scan_folder` | **PARALLEL SCAN** - Scan ALL documents in a folder at once | `directory` |\n| `preview_file` | Quick preview of a single document (~first page) | `file_path` |\n| `parse_file` | **DEEP READ** - Full content of a document | `file_path` |\n| `read` | Read a plain text file | `file_path` |\n| `grep` | Search for a pattern in a file | `file_path`, `pattern` |\n| `glob` | Find files matching a pattern | `directory`, `pattern` |\n| `semantic_search` | Search indexed chunks and metadata-filtered docs, then union/rank results | `query`, `filters`, `limit` |\n| `get_document` | Read full indexed document by document id | `doc_id` |\n| `list_indexed_documents` | List indexed documents for active corpus | none |\n\n## Indexed Retrieval Strategy\n\nWhen indexed tools are available:\n1. Start with `semantic_search` to quickly find relevant documents.\n2. Use `get_document` for the top candidate doc IDs.\n3. If indexed tools report index is unavailable, fall back to filesystem tools (`scan_folder`, `parse_file`, etc.).\n\nFilter syntax for `semantic_search(filters=...)`:\n- `field=value`\n- `field!=value`\n- `field>=number`, `field<=number`, `field>number`, `field<number`\n- `field in (a, b, c)`\n- `field~substring`\n- combine conditions with comma or `and`\n\n## Three-Phase Document Exploration Strategy\n\n### PHASE 1: Parallel Scan (Use `scan_folder`)\nWhen you encounter a folder with documents:\n1. Use `scan_folder` to scan ALL documents in parallel\n2. This gives you a quick preview of every document at once\n3. In your **reason**, explicitly list your document categorization:\n   - **RELEVANT**: Documents clearly related to the query (list them)\n   - **MAYBE**: Documents that might be relevant (list them)\n   - **SKIP**: Documents not relevant (list them)\n\n### PHASE 2: Deep Dive (Use `parse_file`)\n1. Use `parse_file` on documents marked RELEVANT\n2. In your **reason**, explain what key information you found\n3. **WATCH FOR CROSS-REFERENCES** - look for mentions like:\n   - \"See Exhibit A/B/C...\"\n   - \"As stated in the [Document Name]...\"\n   - \"Refer to [filename]...\"\n   - Document numbers, exhibit labels, or file names\n4. In your **reason**, note any cross-references you discovered\n\n### PHASE 3: Backtracking (Revisit if Cross-Referenced)\n**CRITICAL**: If a document you're reading references another document that you SKIPPED:\n1. In your **reason**, explain: \"Found cross-reference to [document] - need to backtrack\"\n2. Use `preview_file` or `parse_file` to read the referenced document\n3. Continue this until all relevant cross-references are resolved\n\n## Providing Detailed Reasoning\n\nYour `reason` field is displayed to the user, so make it informative:\n- After scanning: List which documents you're categorizing as RELEVANT/MAYBE/SKIP and why\n- After parsing: Summarize key findings and any cross-references discovered\n- When backtracking: Explain which reference led you back to a skipped document\n\n## CRITICAL: Citation Requirements for Final Answers\n\nWhen providing your final answer, you MUST include citations for ALL factual claims:\n\n### Citation Format\nUse inline citations in this format: `[Source: filename, Section/Page]`\n\nExample:\n> The total purchase price is $125,000,000 [Source: 01_master_agreement.pdf, Section 2.1], \n> consisting of $80M cash [Source: 01_master_agreement.pdf, Section 2.1(a)], \n> $30M in stock [Source: 10_stock_purchase.pdf, Section 1], and \n> $15M in escrow [Source: 09_escrow_agreement.pdf, Section 2].\n\n### Citation Rules\n1. **Every factual claim needs a citation** - dates, numbers, names, terms, etc.\n2. **Be specific** - include section numbers, article numbers, or page references when available\n3. **Use the actual filename** - not paraphrased names\n4. **Multiple sources** - if information comes from multiple documents, cite all of them\n\n### Final Answer Structure\nYour final answer should:\n1. **Start with a direct answer** to the user's question\n2. **Provide details** with inline citations\n3. **End with a Sources section** listing all documents consulted:\n\n```\n## Sources Consulted\n- 01_master_agreement.pdf - Main acquisition terms\n- 10_stock_purchase.pdf - Stock component details  \n- 09_escrow_agreement.pdf - Escrow terms and release schedule\n```\n\n## Example Workflow\n\n```\nUser asks: \"What is the purchase price?\"\n\n1. scan_folder(\"./documents/\")\n   Reason: \"Scanned 10 documents. Categorizing:\n   - RELEVANT: purchase_agreement.pdf (mentions 'Purchase Price' in preview)\n   - RELEVANT: financial_terms.pdf (contains pricing tables)\n   - MAYBE: exhibits.pdf (referenced by other docs)\n   - SKIP: employee_handbook.pdf, hr_policies.pdf (unrelated to pricing)\"\n\n2. parse_file(\"purchase_agreement.pdf\")\n   Reason: \"Found purchase price of $50M in Section 2.1. Document references \n   'Exhibit B for price adjustments' - need to check exhibits.pdf next.\"\n\n3. parse_file(\"exhibits.pdf\")  [BACKTRACKING]\n   Reason: \"Backtracking to exhibits.pdf because purchase_agreement.pdf \n   referenced it for adjustment details. Found working capital adjustment \n   formula in Exhibit B.\"\n\n4. STOP with final answer including citations:\n   \"The purchase price is $50,000,000 [Source: purchase_agreement.pdf, Section 2.1], \n   subject to working capital adjustments [Source: exhibits.pdf, Exhibit B]...\"\n```\n\"\"\"\n\ndef _build_system_prompt(enable_semantic: bool, enable_metadata: bool) -> str:\n    \"\"\"Build a system prompt with retrieval-path guidance appended.\"\"\"\n    if enable_semantic and enable_metadata:\n        hint = (\n            \"\\n\\n## Retrieval: Semantic + Metadata\\n\"\n            \"An index is available. Start with `semantic_search` using optional \"\n            \"`filters` for best results, then use filesystem tools for deep dives.\"\n        )\n    elif enable_semantic:\n        hint = (\n            \"\\n\\n## Retrieval: Semantic Only\\n\"\n            \"An index is available. Use `semantic_search` WITHOUT the `filters` \"\n            \"parameter for similarity search, then use filesystem tools for details.\"\n        )\n    elif enable_metadata:\n        hint = (\n            \"\\n\\n## Retrieval: Metadata Only\\n\"\n            \"An index is available. Use `semantic_search` with the `filters=` \"\n            \"parameter for metadata filtering, then use filesystem tools for details.\"\n        )\n    else:\n        return SYSTEM_PROMPT\n    return SYSTEM_PROMPT + hint\n\n\n# =============================================================================\n# Agent Implementation\n# =============================================================================\n\n\nclass FsExplorerAgent:\n    \"\"\"\n    AI agent for exploring filesystems using Google Gemini.\n\n    The agent maintains a conversation history with the LLM and uses\n    structured JSON output to make decisions about which actions to take.\n\n    Attributes:\n        token_usage: Tracks API call statistics and costs.\n    \"\"\"\n\n    def __init__(self, api_key: str | None = None) -> None:\n        \"\"\"\n        Initialize the agent with Google API credentials.\n\n        Args:\n            api_key: Google API key. If not provided, reads from\n                     GOOGLE_API_KEY environment variable.\n\n        Raises:\n            ValueError: If no API key is available.\n        \"\"\"\n        if api_key is None:\n            api_key = os.getenv(\"GOOGLE_API_KEY\")\n        if api_key is None:\n            raise ValueError(\n                \"GOOGLE_API_KEY not found within the current environment: \"\n                \"please export it or provide it to the class constructor.\"\n            )\n\n        self._client = GenAIClient(\n            api_key=api_key,\n            http_options=HttpOptions(api_version=\"v1beta\"),\n        )\n        self._chat_history: list[Content] = []\n        self.token_usage = TokenUsage()\n\n    def configure_task(self, task: str) -> None:\n        \"\"\"\n        Add a task message to the conversation history.\n\n        Args:\n            task: The task or context to add to the conversation.\n        \"\"\"\n        self._chat_history.append(\n            Content(role=\"user\", parts=[Part.from_text(text=task)])\n        )\n\n    async def take_action(self) -> tuple[Action, ActionType] | None:\n        \"\"\"\n        Request the next action from the AI model.\n\n        Sends the current conversation history to Gemini and receives\n        a structured JSON response indicating the next action to take.\n\n        Returns:\n            A tuple of (Action, ActionType) if successful, None otherwise.\n        \"\"\"\n        response = await self._client.aio.models.generate_content(\n            model=\"gemini-3-flash-preview\",\n            contents=self._chat_history,  # type: ignore\n            config={\n                \"system_instruction\": _build_system_prompt(_ENABLE_SEMANTIC, _ENABLE_METADATA),\n                \"response_mime_type\": \"application/json\",\n                \"response_schema\": Action,\n            },\n        )\n\n        # Track token usage from response metadata\n        if response.usage_metadata:\n            self.token_usage.add_api_call(\n                prompt_tokens=response.usage_metadata.prompt_token_count or 0,\n                completion_tokens=response.usage_metadata.candidates_token_count or 0,\n            )\n\n        if response.candidates is not None:\n            if response.candidates[0].content is not None:\n                self._chat_history.append(response.candidates[0].content)\n            if response.text is not None:\n                action = Action.model_validate_json(response.text)\n                if action.to_action_type() == \"toolcall\":\n                    toolcall = cast(ToolCallAction, action.action)\n                    self.call_tool(\n                        tool_name=toolcall.tool_name,\n                        tool_input=toolcall.to_fn_args(),\n                    )\n                return action, action.to_action_type()\n\n        return None\n\n    def call_tool(self, tool_name: Tools, tool_input: dict[str, Any]) -> None:\n        \"\"\"\n        Execute a tool and add the result to the conversation history.\n\n        Args:\n            tool_name: Name of the tool to execute.\n            tool_input: Dictionary of arguments to pass to the tool.\n        \"\"\"\n        try:\n            result = TOOLS[tool_name](**tool_input)\n        except Exception as e:\n            result = (\n                f\"An error occurred while calling tool {tool_name} \"\n                f\"with {tool_input}: {e}\"\n            )\n\n        # Track tool result sizes\n        self.token_usage.add_tool_result(result, tool_name)\n\n        self._chat_history.append(\n            Content(\n                role=\"user\",\n                parts=[\n                    Part.from_text(text=f\"Tool result for {tool_name}:\\n\\n{result}\")\n                ],\n            )\n        )\n\n    def reset(self) -> None:\n        \"\"\"Reset the agent's conversation history and token tracking.\"\"\"\n        self._chat_history.clear()\n        self.token_usage = TokenUsage()\n"
  },
  {
    "path": "src/fs_explorer/embeddings.py",
    "content": "\"\"\"\nEmbedding provider for vector-based semantic search.\n\nWraps the Google GenAI embedding API for batch and single-query embedding\nwith configurable model, dimensions, and batch size.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom typing import Any\n\nfrom google.genai import Client as GenAIClient\n\n\n_DEFAULT_MODEL = \"gemini-embedding-001\"\n_DEFAULT_DIM = 768\n_DEFAULT_BATCH_SIZE = 50\n\n\nclass EmbeddingProvider:\n    \"\"\"Generate text embeddings via Google GenAI.\"\"\"\n\n    def __init__(\n        self,\n        *,\n        api_key: str | None = None,\n        model: str | None = None,\n        dim: int | None = None,\n        batch_size: int | None = None,\n        client: Any | None = None,\n    ) -> None:\n        self.model = model or os.getenv(\"FS_EXPLORER_EMBEDDING_MODEL\", _DEFAULT_MODEL)\n        self.dim = dim or int(os.getenv(\"FS_EXPLORER_EMBEDDING_DIM\", str(_DEFAULT_DIM)))\n        self.batch_size = batch_size or int(\n            os.getenv(\"FS_EXPLORER_EMBEDDING_BATCH_SIZE\", str(_DEFAULT_BATCH_SIZE))\n        )\n\n        if client is not None:\n            self._client = client\n        else:\n            resolved_key = api_key or os.getenv(\"GOOGLE_API_KEY\")\n            if resolved_key is None:\n                raise ValueError(\n                    \"GOOGLE_API_KEY not found. \"\n                    \"Provide api_key or set the environment variable.\"\n                )\n            self._client = GenAIClient(api_key=resolved_key)\n\n    def embed_texts(\n        self,\n        texts: list[str],\n        *,\n        task_type: str = \"RETRIEVAL_DOCUMENT\",\n    ) -> list[list[float]]:\n        \"\"\"Embed a list of texts in batches.\n\n        Returns a list of embedding vectors in the same order as *texts*.\n        \"\"\"\n        all_embeddings: list[list[float]] = []\n        for start in range(0, len(texts), self.batch_size):\n            batch = texts[start : start + self.batch_size]\n            result = self._client.models.embed_content(\n                model=self.model,\n                contents=batch,\n                config={\n                    \"task_type\": task_type,\n                    \"output_dimensionality\": self.dim,\n                },\n            )\n            for emb in result.embeddings:\n                all_embeddings.append(list(emb.values))\n        return all_embeddings\n\n    def embed_query(self, query: str) -> list[float]:\n        \"\"\"Embed a single query text for retrieval.\"\"\"\n        result = self._client.models.embed_content(\n            model=self.model,\n            contents=[query],\n            config={\n                \"task_type\": \"RETRIEVAL_QUERY\",\n                \"output_dimensionality\": self.dim,\n            },\n        )\n        return list(result.embeddings[0].values)\n"
  },
  {
    "path": "src/fs_explorer/exploration_trace.py",
    "content": "\"\"\"\nHelpers for recording exploration path and referenced files.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nimport re\nfrom dataclasses import dataclass, field\nfrom typing import Any\n\n\nFILE_TOOLS: frozenset[str] = frozenset({\"read\", \"grep\", \"preview_file\", \"parse_file\"})\n\n# Matches citations like: [Source: filename.pdf, Section 2.1]\nSOURCE_CITATION_RE = re.compile(r\"\\[Source:\\s*([^,\\]]+)\")\n\n\ndef normalize_path(path: str, root_directory: str) -> str:\n    \"\"\"Return an absolute path using root_directory for relative inputs.\"\"\"\n    if os.path.isabs(path):\n        return os.path.abspath(path)\n    return os.path.abspath(os.path.join(root_directory, path))\n\n\ndef extract_cited_sources(final_result: str | None) -> list[str]:\n    \"\"\"Extract source labels from final answer citations while preserving order.\"\"\"\n    if not final_result:\n        return []\n\n    seen: set[str] = set()\n    ordered_sources: list[str] = []\n\n    for raw_source in SOURCE_CITATION_RE.findall(final_result):\n        source = raw_source.strip()\n        if source and source not in seen:\n            seen.add(source)\n            ordered_sources.append(source)\n\n    return ordered_sources\n\n\n@dataclass\nclass ExplorationTrace:\n    \"\"\"\n    Collects a step-by-step path and files referenced by tool calls.\n\n    Paths are normalized to absolute paths to make replay/debugging easier.\n    \"\"\"\n\n    root_directory: str\n    step_path: list[str] = field(default_factory=list)\n    referenced_documents: set[str] = field(default_factory=set)\n\n    def record_tool_call(\n        self,\n        *,\n        step_number: int,\n        tool_name: str,\n        tool_input: dict[str, Any],\n        resolved_document_path: str | None = None,\n    ) -> None:\n        \"\"\"Record a tool call in the exploration path.\"\"\"\n        path_entries: list[str] = []\n\n        directory = tool_input.get(\"directory\")\n        if isinstance(directory, str) and directory:\n            path_entries.append(f\"directory={normalize_path(directory, self.root_directory)}\")\n\n        file_path = tool_input.get(\"file_path\")\n        if isinstance(file_path, str) and file_path:\n            normalized_file_path = normalize_path(file_path, self.root_directory)\n            path_entries.append(f\"file={normalized_file_path}\")\n            if tool_name in FILE_TOOLS:\n                self.referenced_documents.add(normalized_file_path)\n\n        if resolved_document_path:\n            normalized_doc_path = normalize_path(resolved_document_path, self.root_directory)\n            path_entries.append(f\"document={normalized_doc_path}\")\n            self.referenced_documents.add(normalized_doc_path)\n\n        parameters = \", \".join(path_entries) if path_entries else \"no-path-args\"\n        self.step_path.append(f\"{step_number}. tool:{tool_name} ({parameters})\")\n\n    def record_go_deeper(self, *, step_number: int, directory: str) -> None:\n        \"\"\"Record a directory navigation event in the exploration path.\"\"\"\n        resolved_dir = normalize_path(directory, self.root_directory)\n        self.step_path.append(f\"{step_number}. godeeper (directory={resolved_dir})\")\n\n    def sorted_documents(self) -> list[str]:\n        \"\"\"Return a sorted list of referenced documents.\"\"\"\n        return sorted(self.referenced_documents)\n"
  },
  {
    "path": "src/fs_explorer/fs.py",
    "content": "\"\"\"\nFilesystem utilities for the FsExplorer agent.\n\nThis module provides functions for reading, searching, and parsing files\nin the filesystem, including support for complex document formats via Docling.\n\"\"\"\n\nimport os\nimport re\nimport glob as glob_module\nfrom concurrent.futures import ThreadPoolExecutor, as_completed\nfrom pathlib import Path\n\nfrom docling.document_converter import DocumentConverter\n\n\n# =============================================================================\n# Configuration Constants\n# =============================================================================\n\n# Supported document extensions for parsing\nSUPPORTED_EXTENSIONS: frozenset[str] = frozenset({\n    \".pdf\", \".docx\", \".doc\", \".pptx\", \".xlsx\", \".html\", \".md\"\n})\n\n# Preview settings\nDEFAULT_PREVIEW_CHARS = 3000  # Characters for single file preview (~2-3 pages)\nDEFAULT_SCAN_PREVIEW_CHARS = 1500  # Characters for folder scan preview (~1 page)\nMAX_PREVIEW_LINES = 30  # Maximum lines to show in scan results\n\n# Parallel processing settings\nDEFAULT_MAX_WORKERS = 4  # Thread pool size for parallel document scanning\n\n\n# =============================================================================\n# Document Cache\n# =============================================================================\n\n# Cache for parsed documents to avoid re-parsing\n_DOCUMENT_CACHE: dict[str, str] = {}\n\n\ndef clear_document_cache() -> None:\n    \"\"\"Clear the document cache. Useful for testing or memory management.\"\"\"\n    _DOCUMENT_CACHE.clear()\n\n\ndef _get_cached_or_parse(file_path: str) -> str:\n    \"\"\"\n    Get document content from cache or parse it.\n    \n    Uses file modification time in cache key to invalidate stale entries.\n    \n    Args:\n        file_path: Path to the document file.\n    \n    Returns:\n        The document content as markdown.\n    \n    Raises:\n        Exception: If the document cannot be parsed.\n    \"\"\"\n    abs_path = os.path.abspath(file_path)\n    cache_key = f\"{abs_path}:{os.path.getmtime(abs_path)}\"\n    \n    if cache_key not in _DOCUMENT_CACHE:\n        converter = DocumentConverter()\n        result = converter.convert(file_path)\n        _DOCUMENT_CACHE[cache_key] = result.document.export_to_markdown()\n    \n    return _DOCUMENT_CACHE[cache_key]\n\n\n# =============================================================================\n# Directory Operations\n# =============================================================================\n\ndef describe_dir_content(directory: str) -> str:\n    \"\"\"\n    Describe the contents of a directory.\n    \n    Lists all files and subdirectories in the given directory path.\n    \n    Args:\n        directory: Path to the directory to describe.\n    \n    Returns:\n        A formatted string describing the directory contents,\n        or an error message if the directory doesn't exist.\n    \"\"\"\n    if not os.path.exists(directory) or not os.path.isdir(directory):\n        return f\"No such directory: {directory}\"\n    \n    children = os.listdir(directory)\n    if not children:\n        return f\"Directory {directory} is empty\"\n    \n    files = []\n    directories = []\n    \n    for child in children:\n        fullpath = os.path.join(directory, child)\n        if os.path.isfile(fullpath):\n            files.append(fullpath)\n        else:\n            directories.append(fullpath)\n    \n    description = f\"Content of {directory}\\n\"\n    description += \"FILES:\\n- \" + \"\\n- \".join(files)\n    \n    if not directories:\n        description += \"\\nThis folder does not have any sub-folders\"\n    else:\n        description += \"\\nSUBFOLDERS:\\n- \" + \"\\n- \".join(directories)\n    \n    return description\n\n\n# =============================================================================\n# Basic File Operations\n# =============================================================================\n\ndef read_file(file_path: str) -> str:\n    \"\"\"\n    Read the contents of a text file.\n    \n    Args:\n        file_path: Path to the file to read.\n    \n    Returns:\n        The file contents, or an error message if the file doesn't exist.\n    \"\"\"\n    if not os.path.exists(file_path) or not os.path.isfile(file_path):\n        return f\"No such file: {file_path}\"\n    \n    with open(file_path, \"r\") as f:\n        return f.read()\n\n\ndef grep_file_content(file_path: str, pattern: str) -> str:\n    \"\"\"\n    Search for a regex pattern in a file.\n    \n    Args:\n        file_path: Path to the file to search.\n        pattern: Regular expression pattern to search for.\n    \n    Returns:\n        A formatted string with matches, \"No matches found\",\n        or an error message if the file doesn't exist.\n    \"\"\"\n    if not os.path.exists(file_path) or not os.path.isfile(file_path):\n        return f\"No such file: {file_path}\"\n    \n    with open(file_path, \"r\") as f:\n        content = f.read()\n    \n    regex = re.compile(pattern=pattern, flags=re.MULTILINE)\n    matches = regex.findall(content)\n    \n    if matches:\n        return f\"MATCHES for {pattern} in {file_path}:\\n\\n- \" + \"\\n- \".join(matches)\n    return \"No matches found\"\n\n\ndef glob_paths(directory: str, pattern: str) -> str:\n    \"\"\"\n    Find files matching a glob pattern in a directory.\n    \n    Args:\n        directory: Path to the directory to search in.\n        pattern: Glob pattern to match (e.g., \"*.txt\", \"**/*.pdf\").\n    \n    Returns:\n        A formatted string with matching paths, \"No matches found\",\n        or an error message if the directory doesn't exist.\n    \"\"\"\n    if not os.path.exists(directory) or not os.path.isdir(directory):\n        return f\"No such directory: {directory}\"\n    \n    # Use pathlib for cleaner path handling\n    search_path = Path(directory) / pattern\n    matches = glob_module.glob(str(search_path))\n    \n    if matches:\n        return f\"MATCHES for {pattern} in {directory}:\\n\\n- \" + \"\\n- \".join(matches)\n    return \"No matches found\"\n\n\n# =============================================================================\n# Document Parsing Operations\n# =============================================================================\n\ndef preview_file(file_path: str, max_chars: int = DEFAULT_PREVIEW_CHARS) -> str:\n    \"\"\"\n    Get a quick preview of a document file.\n    \n    Reads only the first portion of the document content for initial\n    relevance assessment before doing a full parse.\n    \n    Args:\n        file_path: Path to the document file.\n        max_chars: Maximum characters to return (default: 3000, ~2-3 pages).\n    \n    Returns:\n        A preview of the document content, or an error message.\n    \"\"\"\n    if not os.path.exists(file_path) or not os.path.isfile(file_path):\n        return f\"No such file: {file_path}\"\n\n    ext = os.path.splitext(file_path)[1].lower()\n    if ext not in SUPPORTED_EXTENSIONS:\n        return (\n            f\"Unsupported file extension: {ext}. \"\n            f\"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}\"\n        )\n\n    try:\n        full_content = _get_cached_or_parse(file_path)\n        preview = full_content[:max_chars]\n        \n        total_len = len(full_content)\n        if total_len > max_chars:\n            preview += (\n                f\"\\n\\n[... PREVIEW TRUNCATED. Full document has {total_len:,} \"\n                f\"characters. Use parse_file() to read the complete document ...]\"\n            )\n        \n        return f\"=== PREVIEW of {file_path} ===\\n\\n{preview}\"\n    except Exception as e:\n        return f\"Error previewing {file_path}: {e}\"\n\n\ndef parse_file(file_path: str) -> str:\n    \"\"\"\n    Parse and return the complete content of a document file.\n    \n    Use this after preview_file() confirms the document is relevant,\n    or when you need to find cross-references to other documents.\n    \n    Supported formats: PDF, DOCX, DOC, PPTX, XLSX, HTML, MD.\n    \n    Args:\n        file_path: Path to the document file.\n    \n    Returns:\n        The complete document content as markdown, or an error message.\n    \"\"\"\n    if not os.path.exists(file_path) or not os.path.isfile(file_path):\n        return f\"No such file: {file_path}\"\n\n    ext = os.path.splitext(file_path)[1].lower()\n    if ext not in SUPPORTED_EXTENSIONS:\n        return (\n            f\"Unsupported file extension: {ext}. \"\n            f\"Supported: {', '.join(sorted(SUPPORTED_EXTENSIONS))}\"\n        )\n\n    try:\n        return _get_cached_or_parse(file_path)\n    except Exception as e:\n        return f\"Error parsing {file_path}: {e}\"\n\n\n# =============================================================================\n# Parallel Document Scanning\n# =============================================================================\n\ndef _preview_single_file(file_path: str, preview_chars: int) -> dict:\n    \"\"\"\n    Helper to preview a single file for parallel processing.\n    \n    Args:\n        file_path: Path to the document file.\n        preview_chars: Number of characters to include in preview.\n    \n    Returns:\n        A dictionary with file info and preview content.\n    \"\"\"\n    filename = os.path.basename(file_path)\n    try:\n        content = _get_cached_or_parse(file_path)\n        preview = content[:preview_chars]\n        return {\n            \"file\": file_path,\n            \"filename\": filename,\n            \"preview\": preview,\n            \"total_chars\": len(content),\n            \"status\": \"success\"\n        }\n    except Exception as e:\n        return {\n            \"file\": file_path,\n            \"filename\": filename,\n            \"preview\": \"\",\n            \"total_chars\": 0,\n            \"status\": f\"error: {e}\"\n        }\n\n\ndef scan_folder(\n    directory: str,\n    max_workers: int = DEFAULT_MAX_WORKERS,\n    preview_chars: int = DEFAULT_SCAN_PREVIEW_CHARS,\n) -> str:\n    \"\"\"\n    Scan all documents in a folder in parallel and return quick previews.\n    \n    This is the FIRST step when exploring a folder with multiple documents.\n    It efficiently processes all documents at once so you can assess relevance\n    before doing deep dives into specific files.\n    \n    Args:\n        directory: Path to the folder to scan.\n        max_workers: Number of parallel workers (default: 4).\n        preview_chars: Characters to preview per file (default: 1500, ~1 page).\n    \n    Returns:\n        A formatted summary of all documents with their previews.\n    \"\"\"\n    if not os.path.exists(directory) or not os.path.isdir(directory):\n        return f\"No such directory: {directory}\"\n    \n    # Find all supported document files\n    doc_files = []\n    for item in os.listdir(directory):\n        item_path = os.path.join(directory, item)\n        if os.path.isfile(item_path):\n            ext = os.path.splitext(item)[1].lower()\n            if ext in SUPPORTED_EXTENSIONS:\n                doc_files.append(item_path)\n    \n    if not doc_files:\n        return (\n            f\"No supported documents found in {directory}. \"\n            f\"Supported extensions: {', '.join(sorted(SUPPORTED_EXTENSIONS))}\"\n        )\n    \n    # Scan all documents in parallel\n    results = []\n    with ThreadPoolExecutor(max_workers=max_workers) as executor:\n        future_to_file = {\n            executor.submit(_preview_single_file, f, preview_chars): f \n            for f in doc_files\n        }\n        for future in as_completed(future_to_file):\n            results.append(future.result())\n    \n    # Sort by filename for consistent ordering\n    results.sort(key=lambda x: x[\"filename\"])\n    \n    # Build the summary report\n    output = []\n    output.append(\"═══════════════════════════════════════════════════════════════\")\n    output.append(f\"  PARALLEL DOCUMENT SCAN: {directory}\")\n    output.append(f\"  Found {len(results)} documents\")\n    output.append(\"═══════════════════════════════════════════════════════════════\")\n    output.append(\"\")\n    \n    for i, result in enumerate(results, 1):\n        output.append(\"┌─────────────────────────────────────────────────────────────\")\n        output.append(f\"│ [{i}/{len(results)}] {result['filename']}\")\n        output.append(f\"│ Path: {result['file']}\")\n        output.append(f\"│ Status: {result['status']} | Total size: {result['total_chars']:,} chars\")\n        output.append(\"├─────────────────────────────────────────────────────────────\")\n        \n        if result['status'] == 'success' and result['preview']:\n            # Indent the preview content\n            preview_lines = result['preview'].split('\\n')\n            for line in preview_lines[:MAX_PREVIEW_LINES]:\n                output.append(f\"│ {line}\")\n            if len(preview_lines) > MAX_PREVIEW_LINES:\n                output.append(\"│ ... (preview truncated)\")\n        else:\n            output.append(\"│ [No preview available]\")\n        \n        output.append(\"└─────────────────────────────────────────────────────────────\")\n        output.append(\"\")\n    \n    output.append(\"═══════════════════════════════════════════════════════════════\")\n    output.append(\"  NEXT STEPS:\")\n    output.append(\"  1. Assess which documents are RELEVANT to the user's query\")\n    output.append(\"  2. Use parse_file() for DEEP DIVE into relevant documents\")\n    output.append(\"  3. Watch for cross-references to other docs (may need backtracking)\")\n    output.append(\"═══════════════════════════════════════════════════════════════\")\n    \n    return \"\\n\".join(output)\n"
  },
  {
    "path": "src/fs_explorer/index_config.py",
    "content": "\"\"\"\nConfiguration helpers for local index storage.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom pathlib import Path\n\n\nDEFAULT_DB_PATH = \"~/.fs_explorer/index.duckdb\"\nENV_DB_PATH = \"FS_EXPLORER_DB_PATH\"\n\n\ndef resolve_db_path(override_path: str | None = None) -> str:\n    \"\"\"\n    Resolve the DuckDB path from CLI override, env var, or default.\n\n    Precedence:\n    1) explicit override_path\n    2) FS_EXPLORER_DB_PATH\n    3) default path\n    \"\"\"\n    raw_path = override_path or os.getenv(ENV_DB_PATH) or DEFAULT_DB_PATH\n    resolved = Path(raw_path).expanduser().resolve()\n    resolved.parent.mkdir(parents=True, exist_ok=True)\n    return str(resolved)\n"
  },
  {
    "path": "src/fs_explorer/indexing/__init__.py",
    "content": "\"\"\"Indexing components for FsExplorer.\"\"\"\n\nfrom .chunker import SmartChunker, TextChunk\nfrom .pipeline import IndexingPipeline, IndexingResult\nfrom .schema import SchemaDiscovery\n\n__all__ = [\n    \"SmartChunker\",\n    \"TextChunk\",\n    \"IndexingPipeline\",\n    \"IndexingResult\",\n    \"SchemaDiscovery\",\n]\n"
  },
  {
    "path": "src/fs_explorer/indexing/chunker.py",
    "content": "\"\"\"\nChunking utilities for indexing document content.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses import dataclass\n\n\n@dataclass(frozen=True)\nclass TextChunk:\n    \"\"\"A content chunk with source offsets.\"\"\"\n\n    text: str\n    position: int\n    start_char: int\n    end_char: int\n\n\nclass SmartChunker:\n    \"\"\"\n    Paragraph-aware chunker with overlap.\n\n    This implementation is char-based to keep it deterministic and lightweight.\n    \"\"\"\n\n    def __init__(self, chunk_size: int = 1500, overlap: int = 150) -> None:\n        if chunk_size <= 0:\n            raise ValueError(\"chunk_size must be > 0\")\n        if overlap < 0:\n            raise ValueError(\"overlap must be >= 0\")\n        if overlap >= chunk_size:\n            raise ValueError(\"overlap must be smaller than chunk_size\")\n\n        self.chunk_size = chunk_size\n        self.overlap = overlap\n\n    def chunk_text(self, text: str) -> list[TextChunk]:\n        \"\"\"\n        Split text into chunks while preferring paragraph boundaries.\n        \"\"\"\n        normalized = text.strip()\n        if not normalized:\n            return []\n\n        chunks: list[TextChunk] = []\n        start = 0\n        position = 0\n        total = len(normalized)\n\n        while start < total:\n            tentative_end = min(start + self.chunk_size, total)\n            end = tentative_end\n\n            if tentative_end < total:\n                boundary = normalized.rfind(\"\\n\\n\", start + (self.chunk_size // 2), tentative_end)\n                if boundary != -1:\n                    end = boundary + 2\n\n            chunk_text = normalized[start:end].strip()\n            if chunk_text:\n                chunks.append(\n                    TextChunk(\n                        text=chunk_text,\n                        position=position,\n                        start_char=start,\n                        end_char=end,\n                    )\n                )\n                position += 1\n\n            if end >= total:\n                break\n            start = max(0, end - self.overlap)\n\n        return chunks\n"
  },
  {
    "path": "src/fs_explorer/indexing/metadata.py",
    "content": "\"\"\"\nMetadata extraction helpers for indexed documents.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport copy\nimport json\nimport os\nimport re\nfrom collections import defaultdict\nfrom pathlib import Path\nfrom typing import Any\n\n\n_CURRENCY_RE = re.compile(r\"\\$\\s?\\d[\\d,]*(?:\\.\\d+)?\")\n_DATE_RE = re.compile(\n    r\"\\b(?:\\d{4}-\\d{2}-\\d{2}|\"\n    r\"(?:jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)[a-z]*\\s+\\d{1,2},\\s+\\d{4})\\b\",\n    flags=re.IGNORECASE,\n)\n_DOC_TYPE_TOKEN_RE = re.compile(r\"[a-z0-9]+\")\n_DOC_TYPE_STOPWORDS: set[str] = {\n    \"the\",\n    \"and\",\n    \"for\",\n    \"with\",\n    \"from\",\n    \"copy\",\n    \"draft\",\n    \"final\",\n    \"version\",\n    \"v1\",\n    \"v2\",\n    \"v3\",\n    \"new\",\n    \"old\",\n    \"tmp\",\n    \"temp\",\n}\n\n_LANGEXTRACT_PROMPT_DESCRIPTION = (\n    \"Extract key transaction metadata from legal and deal documents. \"\n    \"Use extraction classes: organization, person, money, date, deal_term. \"\n    \"Use exact spans from the source text and avoid paraphrasing.\"\n)\n\n_VALID_METADATA_FIELD_NAME_RE = re.compile(r\"^[A-Za-z][A-Za-z0-9_]*$\")\n_VALID_FIELD_TYPES: set[str] = {\"string\", \"integer\", \"number\", \"boolean\"}\n_VALID_RUNTIME_FIELDS: set[str] = {\"enabled\", \"extraction_count\", \"entity_classes\"}\n_FIELD_MODE_ALIASES: dict[str, str] = {\n    \"csv\": \"values\",\n    \"list\": \"values\",\n    \"joined\": \"values\",\n    \"join\": \"values\",\n    \"values\": \"values\",\n    \"count\": \"count\",\n    \"exists\": \"exists\",\n    \"contains\": \"contains\",\n    \"contains_any\": \"contains\",\n}\n\n_DEFAULT_LANGEXTRACT_PROFILE: dict[str, Any] = {\n    \"name\": \"default_langextract\",\n    \"description\": \"Default metadata extraction profile for legal and deal-style documents.\",\n    \"prompt_description\": _LANGEXTRACT_PROMPT_DESCRIPTION,\n    \"fields\": [\n        {\n            \"name\": \"lx_enabled\",\n            \"type\": \"boolean\",\n            \"required\": False,\n            \"description\": \"Whether langextract metadata extraction succeeded.\",\n            \"source\": \"runtime\",\n            \"runtime\": \"enabled\",\n        },\n        {\n            \"name\": \"lx_extraction_count\",\n            \"type\": \"integer\",\n            \"required\": False,\n            \"description\": \"Number of langextract entities extracted from the document.\",\n            \"source\": \"runtime\",\n            \"runtime\": \"extraction_count\",\n        },\n        {\n            \"name\": \"lx_entity_classes\",\n            \"type\": \"string\",\n            \"required\": False,\n            \"description\": \"Comma-separated extraction classes returned by langextract.\",\n            \"source\": \"runtime\",\n            \"runtime\": \"entity_classes\",\n        },\n        {\n            \"name\": \"lx_organizations\",\n            \"type\": \"string\",\n            \"required\": False,\n            \"description\": \"Comma-separated organization names extracted by langextract.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"organization\", \"company\", \"party\"],\n            \"mode\": \"values\",\n        },\n        {\n            \"name\": \"lx_people\",\n            \"type\": \"string\",\n            \"required\": False,\n            \"description\": \"Comma-separated person names extracted by langextract.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"person\", \"individual\", \"executive\"],\n            \"mode\": \"values\",\n        },\n        {\n            \"name\": \"lx_deal_terms\",\n            \"type\": \"string\",\n            \"required\": False,\n            \"description\": \"Comma-separated deal terms extracted by langextract.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"deal_term\", \"term\", \"provision\"],\n            \"mode\": \"values\",\n        },\n        {\n            \"name\": \"lx_money_mentions\",\n            \"type\": \"integer\",\n            \"required\": False,\n            \"description\": \"Count of monetary amount entities from langextract.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"money\", \"amount\", \"currency\"],\n            \"mode\": \"count\",\n        },\n        {\n            \"name\": \"lx_date_mentions\",\n            \"type\": \"integer\",\n            \"required\": False,\n            \"description\": \"Count of date entities from langextract.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"date\"],\n            \"mode\": \"count\",\n        },\n        {\n            \"name\": \"lx_has_earnout\",\n            \"type\": \"boolean\",\n            \"required\": False,\n            \"description\": \"Whether extracted deal terms indicate an earnout.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"deal_term\", \"term\", \"provision\"],\n            \"mode\": \"contains\",\n            \"contains_any\": [\"earnout\"],\n        },\n        {\n            \"name\": \"lx_has_escrow\",\n            \"type\": \"boolean\",\n            \"required\": False,\n            \"description\": \"Whether extracted deal terms indicate escrow.\",\n            \"source\": \"entities\",\n            \"source_classes\": [\"deal_term\", \"term\", \"provision\"],\n            \"mode\": \"contains\",\n            \"contains_any\": [\"escrow\"],\n        },\n    ],\n}\n\n\n_AUTO_PROFILE_PROMPT_TEMPLATE = (\n    \"You are a metadata schema designer. Analyze the document samples below and generate \"\n    \"a langextract metadata extraction profile tailored to this corpus.\\n\\n\"\n    \"Return a JSON object with these keys:\\n\"\n    '- \"name\": a short descriptive profile name (string)\\n'\n    '- \"description\": one-sentence description of the profile (string)\\n'\n    '- \"prompt_description\": instruction text for the extraction model (string)\\n'\n    '- \"fields\": array of field definitions\\n\\n'\n    \"Each field object must have:\\n\"\n    '- \"name\": valid identifier starting with \"lx_\" (letters, digits, underscores)\\n'\n    '- \"type\": one of \"string\", \"integer\", \"number\", \"boolean\"\\n'\n    '- \"description\": what this field captures\\n'\n    '- \"source\": \"entities\"\\n'\n    '- \"source_classes\": array of entity class names to aggregate (e.g. [\"organization\", \"company\"])\\n'\n    '- \"mode\": one of \"values\" (comma-joined text), \"count\" (integer count), \"exists\" (boolean), '\n    '\"contains\" (boolean, requires \"contains_any\")\\n'\n    '- \"contains_any\": (only when mode is \"contains\") array of lowercase terms to match\\n\\n'\n    \"Valid entity source classes include (but are not limited to): organization, company, party, \"\n    \"person, individual, executive, money, amount, currency, date, deal_term, term, provision, \"\n    \"location, product, technology, regulation, clause, obligation.\\n\\n\"\n    \"### Example profile for legal/M&A documents\\n\"\n    \"```json\\n\"\n    '{\"name\": \"legal_ma\", \"description\": \"Metadata extraction for legal and M&A deal documents.\", '\n    '\"prompt_description\": \"Extract key transaction metadata from legal and deal documents.\", '\n    '\"fields\": ['\n    '{\"name\": \"lx_organizations\", \"type\": \"string\", \"description\": \"Organization names.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"organization\", \"company\", \"party\"], \"mode\": \"values\"}, '\n    '{\"name\": \"lx_money_mentions\", \"type\": \"integer\", \"description\": \"Count of monetary amounts.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"money\", \"amount\"], \"mode\": \"count\"}, '\n    '{\"name\": \"lx_has_escrow\", \"type\": \"boolean\", \"description\": \"Whether escrow terms are present.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"deal_term\", \"provision\"], \"mode\": \"contains\", '\n    '\"contains_any\": [\"escrow\"]}'\n    \"]}\\n\"\n    \"```\\n\\n\"\n    \"### Example profile for technical/research documents\\n\"\n    \"```json\\n\"\n    '{\"name\": \"tech_research\", \"description\": \"Metadata extraction for technical and research documents.\", '\n    '\"prompt_description\": \"Extract key entities from technical and research documents.\", '\n    '\"fields\": ['\n    '{\"name\": \"lx_technologies\", \"type\": \"string\", \"description\": \"Technology names.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"technology\", \"product\"], \"mode\": \"values\"}, '\n    '{\"name\": \"lx_people\", \"type\": \"string\", \"description\": \"Person names.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"person\", \"individual\"], \"mode\": \"values\"}, '\n    '{\"name\": \"lx_org_count\", \"type\": \"integer\", \"description\": \"Number of organizations mentioned.\", '\n    '\"source\": \"entities\", \"source_classes\": [\"organization\", \"company\"], \"mode\": \"count\"}'\n    \"]}\\n\"\n    \"```\\n\\n\"\n    \"### Document samples from the corpus\\n\\n\"\n    \"SAMPLES_PLACEHOLDER\\n\\n\"\n    \"Generate a profile with 4-8 entity fields (do NOT include runtime fields). \"\n    \"Return ONLY the JSON object, no markdown fencing.\"\n)\n\n\ndef _get_genai_client(api_key: str) -> Any:\n    \"\"\"Instantiate a Google GenAI client. Separated for test patching.\"\"\"\n    from google.genai import Client as _GenAIClient\n\n    return _GenAIClient(api_key=api_key)\n\n\ndef auto_discover_profile(\n    folder: str,\n    *,\n    sample_count: int = 3,\n    model_id: str | None = None,\n) -> dict[str, Any]:\n    \"\"\"Use an LLM to generate a langextract profile tailored to the corpus.\n\n    Falls back to the default hardcoded profile on any failure.\n    \"\"\"\n    from .schema import _iter_supported_files\n\n    files = _iter_supported_files(folder)\n    if not files:\n        return default_langextract_profile()\n\n    # Sample files evenly\n    n = min(sample_count, len(files))\n    step = max(1, len(files) // n)\n    sampled = [files[i * step] for i in range(n)]\n\n    # Parse and truncate\n    from ..fs import parse_file\n\n    snippets: list[str] = []\n    for file_path in sampled:\n        try:\n            text = parse_file(file_path)\n            snippets.append(\n                f\"--- {Path(file_path).name} ---\\n{text[:2000]}\"\n            )\n        except Exception:\n            continue\n\n    if not snippets:\n        return default_langextract_profile()\n\n    api_key = os.getenv(\"GOOGLE_API_KEY\")\n    if not api_key:\n        return default_langextract_profile()\n\n    effective_model = model_id or os.getenv(\n        \"FS_EXPLORER_PROFILE_MODEL\", \"gemini-2.0-flash\"\n    )\n\n    try:\n        client = _get_genai_client(api_key=api_key)\n        prompt = _AUTO_PROFILE_PROMPT_TEMPLATE.replace(\n            \"SAMPLES_PLACEHOLDER\", \"\\n\\n\".join(snippets)\n        )\n        response = client.models.generate_content(\n            model=effective_model,\n            contents=prompt,\n        )\n        raw_text = (response.text or \"\").strip()\n        # Strip markdown fencing if present\n        if raw_text.startswith(\"```\"):\n            raw_text = re.sub(r\"^```[a-z]*\\n?\", \"\", raw_text)\n            raw_text = re.sub(r\"\\n?```$\", \"\", raw_text).strip()\n        profile = json.loads(raw_text)\n        # Add runtime fields that are always present\n        runtime_fields = [\n            f for f in _DEFAULT_LANGEXTRACT_PROFILE[\"fields\"] if f.get(\"source\") == \"runtime\"\n        ]\n        existing_names = {\n            str(f.get(\"name\")) for f in profile.get(\"fields\", []) if isinstance(f, dict)\n        }\n        for rf in runtime_fields:\n            if rf[\"name\"] not in existing_names:\n                profile.setdefault(\"fields\", []).insert(0, copy.deepcopy(rf))\n        return normalize_langextract_profile(profile)\n    except Exception:\n        return default_langextract_profile()\n\n\ndef infer_document_type(file_path: str) -> str:\n    \"\"\"Infer a generic document type from filename tokens.\"\"\"\n    stem = Path(file_path).stem.lower()\n    tokens = [token for token in _DOC_TYPE_TOKEN_RE.findall(stem) if token]\n    filtered = [\n        token\n        for token in tokens\n        if not token.isdigit() and len(token) > 2 and token not in _DOC_TYPE_STOPWORDS\n    ]\n    if filtered:\n        return filtered[-1]\n    if tokens:\n        return tokens[-1]\n    return \"document\"\n\n\ndef default_langextract_profile() -> dict[str, Any]:\n    \"\"\"Return a mutable copy of the built-in metadata profile.\"\"\"\n    return copy.deepcopy(_DEFAULT_LANGEXTRACT_PROFILE)\n\n\ndef normalize_langextract_profile(profile: dict[str, Any] | None) -> dict[str, Any]:\n    \"\"\"\n    Validate and normalize user-provided langextract profile configuration.\n\n    Expected shape:\n    - prompt_description: str (optional)\n    - max_chars: int (optional)\n    - fields: list[{\n        name: str,\n        type: string|integer|number|boolean,\n        description: str (optional),\n        required: bool (optional),\n        source: runtime|entities (default entities),\n        runtime: enabled|extraction_count|entity_classes (runtime source only),\n        source_class: str (entities source),\n        source_classes: list[str] (entities source),\n        mode: values|count|exists|contains (entities source),\n        contains_any: list[str] (contains mode),\n      }]\n    \"\"\"\n    raw = default_langextract_profile() if profile is None else copy.deepcopy(profile)\n    if not isinstance(raw, dict):\n        raise ValueError(\"Metadata profile must be a JSON object.\")\n\n    prompt = raw.get(\"prompt_description\")\n    if prompt is None:\n        prompt_description = _LANGEXTRACT_PROMPT_DESCRIPTION\n    elif isinstance(prompt, str) and prompt.strip():\n        prompt_description = prompt.strip()\n    else:\n        raise ValueError(\n            \"Metadata profile field 'prompt_description' must be a non-empty string.\"\n        )\n\n    max_chars: int | None = None\n    if \"max_chars\" in raw:\n        max_chars = _safe_positive_int(\n            raw.get(\"max_chars\"),\n            minimum=500,\n            field_name=\"max_chars\",\n        )\n\n    raw_fields = raw.get(\"fields\")\n    if not isinstance(raw_fields, list) or not raw_fields:\n        raise ValueError(\"Metadata profile must include a non-empty 'fields' array.\")\n\n    normalized_fields: list[dict[str, Any]] = []\n    seen_names: set[str] = set()\n    for idx, raw_field in enumerate(raw_fields):\n        if not isinstance(raw_field, dict):\n            raise ValueError(f\"Metadata field at index {idx} must be an object.\")\n\n        name_obj = raw_field.get(\"name\")\n        if not isinstance(name_obj, str) or not name_obj.strip():\n            raise ValueError(\n                f\"Metadata field at index {idx} is missing a valid 'name'.\"\n            )\n        name = name_obj.strip()\n        if not _VALID_METADATA_FIELD_NAME_RE.match(name):\n            raise ValueError(\n                f\"Invalid metadata field name '{name}'. \"\n                \"Use letters, numbers, and underscores.\"\n            )\n        if name in seen_names:\n            raise ValueError(f\"Duplicate metadata field name '{name}'.\")\n        seen_names.add(name)\n\n        field_type = str(raw_field.get(\"type\", \"string\")).strip().lower()\n        if field_type not in _VALID_FIELD_TYPES:\n            allowed_types = \", \".join(sorted(_VALID_FIELD_TYPES))\n            raise ValueError(\n                f\"Metadata field '{name}' has invalid type '{field_type}'. \"\n                f\"Allowed types: {allowed_types}.\"\n            )\n\n        description_obj = raw_field.get(\"description\")\n        description = (\n            description_obj.strip()\n            if isinstance(description_obj, str) and description_obj.strip()\n            else f\"Metadata field '{name}'.\"\n        )\n        required = bool(raw_field.get(\"required\", False))\n\n        source = str(raw_field.get(\"source\", \"entities\")).strip().lower()\n        if source not in {\"runtime\", \"entities\"}:\n            raise ValueError(\n                f\"Metadata field '{name}' has invalid source '{source}'. \"\n                \"Use 'runtime' or 'entities'.\"\n            )\n\n        normalized: dict[str, Any] = {\n            \"name\": name,\n            \"type\": field_type,\n            \"required\": required,\n            \"description\": description,\n            \"source\": source,\n        }\n\n        if source == \"runtime\":\n            runtime = str(raw_field.get(\"runtime\", \"\")).strip().lower()\n            if runtime not in _VALID_RUNTIME_FIELDS:\n                allowed_runtime = \", \".join(sorted(_VALID_RUNTIME_FIELDS))\n                raise ValueError(\n                    f\"Metadata field '{name}' has invalid runtime source '{runtime}'. \"\n                    f\"Allowed runtime values: {allowed_runtime}.\"\n                )\n            normalized[\"runtime\"] = runtime\n            normalized[\"mode\"] = \"runtime\"\n            normalized[\"source_classes\"] = []\n            normalized[\"contains_any\"] = []\n            normalized_fields.append(normalized)\n            continue\n\n        source_classes = _normalize_source_classes(raw_field)\n        if not source_classes:\n            raise ValueError(\n                f\"Metadata field '{name}' requires 'source_class' or \"\n                \"'source_classes' for entity extraction.\"\n            )\n\n        requested_mode = raw_field.get(\"mode\")\n        mode = _normalize_field_mode(requested_mode, field_type=field_type)\n        contains_any = _normalize_contains_any(\n            raw_field.get(\"contains_any\"),\n            mode=mode,\n            field_name=name,\n        )\n\n        normalized[\"source_classes\"] = source_classes\n        normalized[\"mode\"] = mode\n        normalized[\"contains_any\"] = contains_any\n        normalized_fields.append(normalized)\n\n    normalized_profile: dict[str, Any] = {\n        \"name\": str(raw.get(\"name\", \"langextract_profile\")),\n        \"description\": str(\n            raw.get(\"description\", \"User-defined langextract metadata profile.\")\n        ),\n        \"prompt_description\": prompt_description,\n        \"fields\": normalized_fields,\n    }\n    if max_chars is not None:\n        normalized_profile[\"max_chars\"] = max_chars\n    return normalized_profile\n\n\ndef langextract_schema_fields(\n    profile: dict[str, Any] | None = None,\n) -> list[dict[str, Any]]:\n    \"\"\"Return schema field definitions for langextract metadata.\"\"\"\n    normalized = normalize_langextract_profile(profile)\n    fields: list[dict[str, Any]] = []\n    for field in normalized[\"fields\"]:\n        fields.append(\n            {\n                \"name\": field[\"name\"],\n                \"type\": field[\"type\"],\n                \"required\": bool(field.get(\"required\", False)),\n                \"description\": str(field.get(\"description\", \"\")),\n            }\n        )\n    return fields\n\n\ndef langextract_field_names(profile: dict[str, Any] | None = None) -> set[str]:\n    \"\"\"Return field names used by langextract metadata extraction.\"\"\"\n    return {field[\"name\"] for field in langextract_schema_fields(profile)}\n\n\ndef ensure_langextract_schema_fields(\n    schema_def: dict[str, Any],\n    profile: dict[str, Any] | None = None,\n) -> tuple[dict[str, Any], bool]:\n    \"\"\"Ensure schema contains langextract field definitions.\"\"\"\n    normalized_profile = normalize_langextract_profile(\n        profile if profile is not None else _schema_profile_if_present(schema_def)\n    )\n    required_fields = langextract_schema_fields(normalized_profile)\n\n    fields_obj = schema_def.get(\"fields\")\n    fields: list[dict[str, Any]]\n    if isinstance(fields_obj, list):\n        fields = [dict(field) for field in fields_obj if isinstance(field, dict)]\n    else:\n        fields = []\n\n    existing_names = {\n        str(field.get(\"name\")) for field in fields if isinstance(field.get(\"name\"), str)\n    }\n    updated = list(fields)\n    changed = False\n    for field in required_fields:\n        if field[\"name\"] in existing_names:\n            continue\n        updated.append(dict(field))\n        changed = True\n\n    merged = dict(schema_def)\n    if changed:\n        merged[\"fields\"] = updated\n\n    existing_profile = _schema_profile_if_present(schema_def)\n    if profile is not None or existing_profile is not None:\n        if existing_profile != normalized_profile:\n            merged[\"metadata_profile\"] = normalized_profile\n            changed = True\n        elif \"metadata_profile\" in schema_def:\n            merged[\"metadata_profile\"] = existing_profile\n\n    return merged, changed\n\n\ndef extract_metadata(\n    *,\n    file_path: str,\n    root_path: str,\n    content: str,\n    schema_def: dict[str, Any] | None = None,\n    with_langextract: bool = False,\n    langextract_model_id: str | None = None,\n    langextract_profile: dict[str, Any] | None = None,\n) -> dict[str, Any]:\n    \"\"\"\n    Build metadata used for filtering and schema-aware indexing.\n\n    If a schema is provided with a `fields` list, only those keys are emitted.\n    \"\"\"\n    absolute_path = str(Path(file_path).resolve())\n    relative_path = os.path.relpath(absolute_path, str(Path(root_path).resolve()))\n    extension = Path(file_path).suffix.lower()\n\n    stat = os.stat(file_path)\n    metadata: dict[str, Any] = {\n        \"filename\": Path(file_path).name,\n        \"relative_path\": relative_path,\n        \"extension\": extension,\n        \"document_type\": infer_document_type(file_path),\n        \"file_size_bytes\": int(stat.st_size),\n        \"file_mtime\": float(stat.st_mtime),\n        \"mentions_currency\": bool(_CURRENCY_RE.search(content)),\n        \"mentions_dates\": bool(_DATE_RE.search(content)),\n    }\n    if with_langextract:\n        resolved_profile = _resolve_langextract_profile(\n            schema_def=schema_def,\n            profile_override=langextract_profile,\n        )\n        metadata.update(\n            _extract_langextract_metadata(\n                content=content,\n                model_id=langextract_model_id,\n                profile=resolved_profile,\n            )\n        )\n\n    if not schema_def:\n        return metadata\n\n    fields = schema_def.get(\"fields\")\n    if not isinstance(fields, list):\n        return metadata\n\n    allowed: set[str] = set()\n    for field in fields:\n        if isinstance(field, dict):\n            name = field.get(\"name\")\n            if isinstance(name, str):\n                allowed.add(name)\n\n    if not allowed:\n        return metadata\n\n    return {k: v for k, v in metadata.items() if k in allowed}\n\n\ndef _extract_langextract_metadata(\n    *,\n    content: str,\n    model_id: str | None = None,\n    profile: dict[str, Any] | None = None,\n) -> dict[str, Any]:\n    normalized_profile = normalize_langextract_profile(profile)\n    defaults = _profile_defaults(normalized_profile)\n\n    api_key = (\n        os.getenv(\"LANGEXTRACT_API_KEY\")\n        or os.getenv(\"GEMINI_API_KEY\")\n        or os.getenv(\"GOOGLE_API_KEY\")\n    )\n    if not api_key:\n        return defaults\n\n    try:\n        import langextract as lx  # type: ignore[import-not-found]\n    except Exception:\n        return defaults\n\n    profile_max_chars_obj = normalized_profile.get(\"max_chars\")\n    profile_max_chars = (\n        _safe_positive_int(\n            profile_max_chars_obj,\n            minimum=500,\n            field_name=\"max_chars\",\n        )\n        if profile_max_chars_obj is not None\n        else None\n    )\n    max_chars = profile_max_chars or _safe_int_env(\n        \"FS_EXPLORER_LANGEXTRACT_MAX_CHARS\",\n        default=6000,\n        minimum=500,\n    )\n    snippet = content[:max_chars]\n    if not snippet.strip():\n        return defaults\n\n    effective_model_id = model_id or os.getenv(\n        \"FS_EXPLORER_LANGEXTRACT_MODEL\",\n        \"gemini-3-flash-preview\",\n    )\n    try:\n        result = lx.extract(\n            text_or_documents=snippet,\n            prompt_description=str(normalized_profile[\"prompt_description\"]),\n            examples=_langextract_examples(lx),\n            model_id=effective_model_id,\n            api_key=api_key,\n            max_char_buffer=min(1200, max_chars),\n            show_progress=False,\n            prompt_validation_level=lx.prompt_validation.PromptValidationLevel.OFF,\n        )\n    except Exception:\n        return defaults\n\n    extractions = list(result.extractions or [])\n    return _aggregate_profile_metadata(\n        normalized_profile=normalized_profile,\n        extractions=extractions,\n        enabled=True,\n    )\n\n\ndef _schema_profile_if_present(schema_def: dict[str, Any] | None) -> dict[str, Any] | None:\n    if not schema_def:\n        return None\n    metadata_profile = schema_def.get(\"metadata_profile\")\n    if isinstance(metadata_profile, dict):\n        return metadata_profile\n    return None\n\n\ndef _resolve_langextract_profile(\n    *,\n    schema_def: dict[str, Any] | None,\n    profile_override: dict[str, Any] | None,\n) -> dict[str, Any] | None:\n    if profile_override is not None:\n        return profile_override\n    return _schema_profile_if_present(schema_def)\n\n\ndef _normalize_source_classes(raw_field: dict[str, Any]) -> list[str]:\n    classes: list[str] = []\n    single = raw_field.get(\"source_class\")\n    if isinstance(single, str) and single.strip():\n        classes.append(single.strip().lower())\n\n    multi = raw_field.get(\"source_classes\")\n    if isinstance(multi, list):\n        for item in multi:\n            if isinstance(item, str) and item.strip():\n                classes.append(item.strip().lower())\n\n    seen: set[str] = set()\n    deduped: list[str] = []\n    for class_name in classes:\n        if class_name in seen:\n            continue\n        seen.add(class_name)\n        deduped.append(class_name)\n    return deduped\n\n\ndef _normalize_field_mode(mode_obj: Any, *, field_type: str) -> str:\n    if isinstance(mode_obj, str) and mode_obj.strip():\n        requested = mode_obj.strip().lower()\n        normalized = _FIELD_MODE_ALIASES.get(requested)\n        if normalized is None:\n            allowed = \", \".join(sorted(set(_FIELD_MODE_ALIASES.values())))\n            raise ValueError(\n                f\"Unsupported metadata field mode '{requested}'. \"\n                f\"Allowed modes: {allowed}.\"\n            )\n        return normalized\n\n    if field_type == \"boolean\":\n        return \"exists\"\n    if field_type in {\"integer\", \"number\"}:\n        return \"count\"\n    return \"values\"\n\n\ndef _normalize_contains_any(\n    contains_obj: Any,\n    *,\n    mode: str,\n    field_name: str,\n) -> list[str]:\n    if mode != \"contains\":\n        return []\n    if not isinstance(contains_obj, list) or not contains_obj:\n        raise ValueError(\n            f\"Metadata field '{field_name}' with mode 'contains' \"\n            \"requires 'contains_any' list.\"\n        )\n    terms: list[str] = []\n    for term in contains_obj:\n        if isinstance(term, str) and term.strip():\n            terms.append(term.strip().lower())\n    if not terms:\n        raise ValueError(\n            f\"Metadata field '{field_name}' with mode 'contains' \"\n            \"has no valid 'contains_any' terms.\"\n        )\n    return terms\n\n\ndef _profile_defaults(profile: dict[str, Any]) -> dict[str, Any]:\n    defaults: dict[str, Any] = {}\n    for field in profile[\"fields\"]:\n        defaults[field[\"name\"]] = _default_field_value(field)\n    return defaults\n\n\ndef _default_field_value(field: dict[str, Any]) -> Any:\n    source = str(field.get(\"source\", \"entities\"))\n    runtime = str(field.get(\"runtime\", \"\"))\n    if source == \"runtime\":\n        if runtime == \"enabled\":\n            return False\n        if runtime == \"extraction_count\":\n            return 0\n        if runtime == \"entity_classes\":\n            return \"\"\n\n    field_type = str(field.get(\"type\", \"string\"))\n    if field_type == \"boolean\":\n        return False\n    if field_type == \"integer\":\n        return 0\n    if field_type == \"number\":\n        return 0.0\n    return \"\"\n\n\ndef _aggregate_profile_metadata(\n    *,\n    normalized_profile: dict[str, Any],\n    extractions: list[Any],\n    enabled: bool,\n) -> dict[str, Any]:\n    classes: set[str] = set()\n    by_class: dict[str, list[str]] = defaultdict(list)\n\n    for extraction in extractions:\n        extraction_class = str(getattr(extraction, \"extraction_class\", \"\")).strip().lower()\n        extraction_text = str(getattr(extraction, \"extraction_text\", \"\")).strip()\n        if not extraction_class:\n            continue\n        classes.add(extraction_class)\n        if extraction_text:\n            by_class[extraction_class].append(extraction_text)\n\n    metadata: dict[str, Any] = {}\n    for field in normalized_profile[\"fields\"]:\n        name = str(field[\"name\"])\n        source = str(field[\"source\"])\n        if source == \"runtime\":\n            value = _runtime_field_value(\n                field=field,\n                enabled=enabled,\n                extraction_count=len(extractions),\n                classes=classes,\n            )\n            metadata[name] = _coerce_field_value(\n                value=value,\n                field_type=str(field[\"type\"]),\n            )\n            continue\n\n        matched_values: list[str] = []\n        for extraction_class in field[\"source_classes\"]:\n            matched_values.extend(by_class.get(extraction_class, []))\n        value = _entity_field_value(field=field, matched_values=matched_values)\n        metadata[name] = _coerce_field_value(value=value, field_type=str(field[\"type\"]))\n\n    defaults = _profile_defaults(normalized_profile)\n    for key, default_value in defaults.items():\n        metadata.setdefault(key, default_value)\n    return metadata\n\n\ndef _runtime_field_value(\n    *,\n    field: dict[str, Any],\n    enabled: bool,\n    extraction_count: int,\n    classes: set[str],\n) -> Any:\n    runtime = str(field.get(\"runtime\", \"\"))\n    if runtime == \"enabled\":\n        return enabled\n    if runtime == \"extraction_count\":\n        return extraction_count\n    if runtime == \"entity_classes\":\n        return \", \".join(sorted(classes))\n    return _default_field_value(field)\n\n\ndef _entity_field_value(*, field: dict[str, Any], matched_values: list[str]) -> Any:\n    mode = str(field.get(\"mode\", \"values\"))\n    if mode == \"count\":\n        return len(matched_values)\n    if mode == \"exists\":\n        return bool(matched_values)\n    if mode == \"contains\":\n        terms = [str(term).lower() for term in field.get(\"contains_any\", [])]\n        lowered_values = [value.lower() for value in matched_values]\n        return any(term in value for term in terms for value in lowered_values)\n    deduped = _dedupe_preserve_order(matched_values)\n    return \", \".join(deduped)\n\n\ndef _coerce_field_value(*, value: Any, field_type: str) -> Any:\n    if field_type == \"boolean\":\n        return bool(value)\n    if field_type == \"integer\":\n        if isinstance(value, bool):\n            return int(value)\n        try:\n            return int(value)\n        except (TypeError, ValueError):\n            return 0\n    if field_type == \"number\":\n        if isinstance(value, bool):\n            return float(int(value))\n        try:\n            return float(value)\n        except (TypeError, ValueError):\n            return 0.0\n    if value is None:\n        return \"\"\n    return str(value)\n\n\ndef _langextract_examples(lx: Any) -> list[Any]:\n    return [\n        lx.data.ExampleData(\n            text=(\n                \"TechCorp Industries will pay $45,000,000 in cash consideration, \"\n                \"with a $1,500,000 escrow reserve and a $5,000,000 earnout to \"\n                \"acquire StartupXYZ LLC. CTO Dr. Sarah Chen signed on January 15, 2025.\"\n            ),\n            extractions=[\n                lx.data.Extraction(\n                    extraction_class=\"organization\",\n                    extraction_text=\"TechCorp Industries\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"organization\",\n                    extraction_text=\"StartupXYZ LLC\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"money\",\n                    extraction_text=\"$45,000,000\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"money\",\n                    extraction_text=\"$1,500,000\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"money\",\n                    extraction_text=\"$5,000,000\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"deal_term\",\n                    extraction_text=\"cash consideration\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"deal_term\",\n                    extraction_text=\"escrow reserve\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"deal_term\",\n                    extraction_text=\"earnout\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"person\",\n                    extraction_text=\"Dr. Sarah Chen\",\n                ),\n                lx.data.Extraction(\n                    extraction_class=\"date\",\n                    extraction_text=\"January 15, 2025\",\n                ),\n            ],\n        )\n    ]\n\n\ndef _dedupe_preserve_order(values: list[str], *, max_items: int = 16) -> list[str]:\n    seen: set[str] = set()\n    deduped: list[str] = []\n    for value in values:\n        key = value.strip()\n        if not key:\n            continue\n        lower = key.lower()\n        if lower in seen:\n            continue\n        seen.add(lower)\n        deduped.append(key)\n        if len(deduped) >= max_items:\n            break\n    return deduped\n\n\ndef _safe_positive_int(value: Any, *, minimum: int, field_name: str) -> int:\n    try:\n        integer = int(value)\n    except (TypeError, ValueError) as exc:\n        raise ValueError(\n            f\"Metadata profile field '{field_name}' must be an integer.\"\n        ) from exc\n    if integer < minimum:\n        raise ValueError(\n            f\"Metadata profile field '{field_name}' must be >= {minimum}.\"\n        )\n    return integer\n\n\ndef _safe_int_env(name: str, *, default: int, minimum: int) -> int:\n    raw = os.getenv(name)\n    if raw is None:\n        return default\n    try:\n        value = int(raw)\n    except ValueError:\n        return default\n    return value if value >= minimum else minimum\n"
  },
  {
    "path": "src/fs_explorer/indexing/pipeline.py",
    "content": "\"\"\"\nIndexing pipeline orchestration.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport hashlib\nimport json\nimport os\nfrom concurrent.futures import ThreadPoolExecutor\nfrom dataclasses import dataclass\nfrom pathlib import Path\nfrom typing import Any\n\nfrom .chunker import SmartChunker\nfrom .metadata import (\n    ensure_langextract_schema_fields,\n    extract_metadata,\n    langextract_field_names,\n)\nfrom .schema import SchemaDiscovery\nfrom ..embeddings import EmbeddingProvider\nfrom ..fs import SUPPORTED_EXTENSIONS, parse_file\nfrom ..storage import ChunkRecord, DocumentRecord, DuckDBStorage, StorageBackend\n\n_PARSE_ERROR_PREFIXES: tuple[str, ...] = (\n    \"Error parsing \",\n    \"Unsupported file extension\",\n    \"No such file:\",\n)\n\n\n@dataclass(frozen=True)\nclass IndexingResult:\n    \"\"\"Summary output for an indexing run.\"\"\"\n\n    corpus_id: str\n    indexed_files: int\n    skipped_files: int\n    deleted_files: int\n    chunks_written: int\n    active_documents: int\n    schema_used: str | None\n    embeddings_written: int = 0\n\n\nclass IndexingPipeline:\n    \"\"\"Build and update corpus indexes from filesystem documents.\"\"\"\n\n    def __init__(\n        self,\n        storage: StorageBackend,\n        chunker: SmartChunker | None = None,\n        embedding_provider: EmbeddingProvider | None = None,\n        max_workers: int = 4,\n    ) -> None:\n        self.storage = storage\n        self.chunker = chunker or SmartChunker()\n        self.embedding_provider = embedding_provider\n        self._max_workers = max_workers\n\n    def index_folder(\n        self,\n        folder: str,\n        *,\n        discover_schema: bool = False,\n        schema_name: str | None = None,\n        with_metadata: bool = False,\n        metadata_profile: dict[str, Any] | None = None,\n    ) -> IndexingResult:\n        root = str(Path(folder).resolve())\n        if not os.path.exists(root) or not os.path.isdir(root):\n            raise ValueError(f\"No such directory: {root}\")\n\n        effective_with_metadata = with_metadata or metadata_profile is not None\n        corpus_id = self.storage.get_or_create_corpus(root)\n        schema_def, selected_schema_name = self._resolve_schema(\n            corpus_id=corpus_id,\n            root=root,\n            discover_schema=discover_schema,\n            schema_name=schema_name,\n            with_metadata=effective_with_metadata,\n            metadata_profile=metadata_profile,\n        )\n        effective_profile = metadata_profile or self._schema_metadata_profile(\n            schema_def\n        )\n\n        # Pass 1: Parse all documents\n        parsed_docs: list[tuple[str, str, str]] = []  # (file_path, relative_path, content)\n        skipped_files = 0\n        active_paths: set[str] = set()\n\n        for file_path in self._iter_supported_files(root):\n            relative_path = os.path.relpath(file_path, root)\n            active_paths.add(relative_path)\n\n            content = parse_file(file_path)\n            if self._is_parse_error(content):\n                skipped_files += 1\n                continue\n\n            parsed_docs.append((file_path, relative_path, content))\n\n        # Parallel metadata extraction across documents\n        metadata_map = self._extract_metadata_batch(\n            parsed_docs=parsed_docs,\n            root_path=root,\n            schema_def=schema_def,\n            with_langextract=effective_with_metadata,\n            langextract_profile=effective_profile,\n        )\n\n        # Pass 2: Chunk + upsert (sequential, DB writes)\n        indexed_files = 0\n        chunks_written = 0\n        all_chunk_records: list[ChunkRecord] = []\n\n        for file_path, relative_path, content in parsed_docs:\n            chunks = self.chunker.chunk_text(content)\n            metadata = metadata_map[relative_path]\n            metadata_json = json.dumps(metadata, sort_keys=True)\n\n            stat = os.stat(file_path)\n            doc_id = DuckDBStorage.make_document_id(corpus_id, relative_path)\n            doc_record = DocumentRecord(\n                id=doc_id,\n                corpus_id=corpus_id,\n                relative_path=relative_path,\n                absolute_path=str(Path(file_path).resolve()),\n                content=content,\n                metadata_json=metadata_json,\n                file_mtime=float(stat.st_mtime),\n                file_size=int(stat.st_size),\n                content_sha256=self._sha256(content),\n            )\n\n            chunk_records: list[ChunkRecord] = []\n            for chunk in chunks:\n                chunk_records.append(\n                    ChunkRecord(\n                        id=DuckDBStorage.make_chunk_id(\n                            doc_id,\n                            chunk.position,\n                            chunk.start_char,\n                            chunk.end_char,\n                        ),\n                        doc_id=doc_id,\n                        text=chunk.text,\n                        position=chunk.position,\n                        start_char=chunk.start_char,\n                        end_char=chunk.end_char,\n                    )\n                )\n\n            self.storage.upsert_document(doc_record, chunk_records)\n            all_chunk_records.extend(chunk_records)\n            indexed_files += 1\n            chunks_written += len(chunk_records)\n\n        deleted_files = self.storage.mark_deleted_missing_documents(\n            corpus_id=corpus_id,\n            active_relative_paths=active_paths,\n        )\n        active_documents = len(\n            self.storage.list_documents(corpus_id=corpus_id, include_deleted=False)\n        )\n\n        embeddings_written = self._generate_and_store_embeddings(\n            corpus_id=corpus_id,\n            all_chunk_records=all_chunk_records,\n        )\n\n        return IndexingResult(\n            corpus_id=corpus_id,\n            indexed_files=indexed_files,\n            skipped_files=skipped_files,\n            deleted_files=deleted_files,\n            chunks_written=chunks_written,\n            active_documents=active_documents,\n            schema_used=selected_schema_name,\n            embeddings_written=embeddings_written,\n        )\n\n    def _extract_metadata_batch(\n        self,\n        *,\n        parsed_docs: list[tuple[str, str, str]],\n        root_path: str,\n        schema_def: dict[str, Any] | None,\n        with_langextract: bool,\n        langextract_profile: dict[str, Any] | None,\n    ) -> dict[str, dict[str, Any]]:\n        \"\"\"Extract metadata for all documents in parallel using a thread pool.\"\"\"\n\n        def _extract_one(item: tuple[str, str, str]) -> tuple[str, dict[str, Any]]:\n            file_path, relative_path, content = item\n            metadata = extract_metadata(\n                file_path=file_path,\n                root_path=root_path,\n                content=content,\n                schema_def=schema_def,\n                with_langextract=with_langextract,\n                langextract_profile=langextract_profile,\n            )\n            return relative_path, metadata\n\n        result: dict[str, dict[str, Any]] = {}\n        if not parsed_docs:\n            return result\n\n        with ThreadPoolExecutor(max_workers=self._max_workers) as executor:\n            for relative_path, metadata in executor.map(_extract_one, parsed_docs):\n                result[relative_path] = metadata\n\n        return result\n\n    def _resolve_schema(\n        self,\n        *,\n        corpus_id: str,\n        root: str,\n        discover_schema: bool,\n        schema_name: str | None,\n        with_metadata: bool,\n        metadata_profile: dict[str, Any] | None,\n    ) -> tuple[dict[str, Any] | None, str | None]:\n        if discover_schema:\n            schema_def = SchemaDiscovery().discover_from_folder(\n                root,\n                with_langextract=with_metadata,\n                metadata_profile=metadata_profile,\n            )\n            discovered_name = str(schema_def.get(\"name\", f\"auto_{Path(root).name}\"))\n            self.storage.save_schema(\n                corpus_id=corpus_id,\n                name=discovered_name,\n                schema_def=schema_def,\n                is_active=True,\n            )\n            return schema_def, discovered_name\n\n        if schema_name:\n            schema = self.storage.get_schema_by_name(\n                corpus_id=corpus_id, name=schema_name\n            )\n            if schema is None:\n                raise ValueError(f\"Schema '{schema_name}' not found for corpus {root}\")\n            if with_metadata:\n                return self._augment_schema_for_langextract(\n                    corpus_id=corpus_id,\n                    schema_name=schema.name,\n                    schema_def=schema.schema_def,\n                    metadata_profile=metadata_profile,\n                )\n            return schema.schema_def, schema.name\n\n        active = self.storage.get_active_schema(corpus_id=corpus_id)\n        if active is None:\n            if with_metadata:\n                schema_def = SchemaDiscovery().discover_from_folder(\n                    root,\n                    with_langextract=True,\n                    metadata_profile=metadata_profile,\n                )\n                discovered_name = str(schema_def.get(\"name\", f\"auto_{Path(root).name}\"))\n                self.storage.save_schema(\n                    corpus_id=corpus_id,\n                    name=discovered_name,\n                    schema_def=schema_def,\n                    is_active=True,\n                )\n                return schema_def, discovered_name\n            return None, None\n        if with_metadata:\n            return self._augment_schema_for_langextract(\n                corpus_id=corpus_id,\n                schema_name=active.name,\n                schema_def=active.schema_def,\n                metadata_profile=metadata_profile,\n            )\n        return active.schema_def, active.name\n\n    def _augment_schema_for_langextract(\n        self,\n        *,\n        corpus_id: str,\n        schema_name: str,\n        schema_def: dict[str, Any],\n        metadata_profile: dict[str, Any] | None,\n    ) -> tuple[dict[str, Any], str]:\n        effective_profile = metadata_profile or self._schema_metadata_profile(\n            schema_def\n        )\n        existing_field_names = self._schema_field_names(schema_def)\n        required = langextract_field_names(effective_profile)\n        if required.issubset(existing_field_names):\n            if metadata_profile is None and (\n                effective_profile is None\n                or self._schema_metadata_profile(schema_def) is not None\n            ):\n                return schema_def, schema_name\n\n            augmented_with_profile, changed = ensure_langextract_schema_fields(\n                schema_def,\n                effective_profile,\n            )\n            if not changed:\n                return schema_def, schema_name\n            self.storage.save_schema(\n                corpus_id=corpus_id,\n                name=schema_name,\n                schema_def=augmented_with_profile,\n                is_active=True,\n            )\n            return augmented_with_profile, schema_name\n\n        augmented_schema, _ = ensure_langextract_schema_fields(\n            schema_def,\n            effective_profile,\n        )\n        self.storage.save_schema(\n            corpus_id=corpus_id,\n            name=schema_name,\n            schema_def=augmented_schema,\n            is_active=True,\n        )\n        return augmented_schema, schema_name\n\n    @staticmethod\n    def _schema_metadata_profile(\n        schema_def: dict[str, Any] | None,\n    ) -> dict[str, Any] | None:\n        if not schema_def:\n            return None\n        profile = schema_def.get(\"metadata_profile\")\n        if isinstance(profile, dict):\n            return profile\n        return None\n\n    @staticmethod\n    def _schema_field_names(schema_def: dict[str, Any]) -> set[str]:\n        fields = schema_def.get(\"fields\")\n        if not isinstance(fields, list):\n            return set()\n        names: set[str] = set()\n        for field in fields:\n            if isinstance(field, dict):\n                name = field.get(\"name\")\n                if isinstance(name, str):\n                    names.add(name)\n        return names\n\n    def _generate_and_store_embeddings(\n        self,\n        *,\n        corpus_id: str,\n        all_chunk_records: list[ChunkRecord],\n    ) -> int:\n        \"\"\"Embed chunk texts and store in the database. Returns count written.\"\"\"\n        if self.embedding_provider is None or not all_chunk_records:\n            return 0\n\n        texts = [cr.text for cr in all_chunk_records]\n        embeddings = self.embedding_provider.embed_texts(texts)\n\n        pairs: list[tuple[str, list[float]]] = [\n            (cr.id, emb) for cr, emb in zip(all_chunk_records, embeddings)\n        ]\n        written = self.storage.store_chunk_embeddings(\n            corpus_id=corpus_id,\n            chunk_embeddings=pairs,\n        )\n\n        if isinstance(self.storage, DuckDBStorage):\n            self.storage.create_hnsw_index(corpus_id=corpus_id)\n\n        return written\n\n    @staticmethod\n    def _iter_supported_files(root: str) -> list[str]:\n        files: list[str] = []\n        for current_root, _, filenames in os.walk(root):\n            for filename in filenames:\n                ext = Path(filename).suffix.lower()\n                if ext in SUPPORTED_EXTENSIONS:\n                    files.append(str(Path(current_root) / filename))\n        files.sort()\n        return files\n\n    @staticmethod\n    def _sha256(content: str) -> str:\n        return hashlib.sha256(content.encode(\"utf-8\")).hexdigest()\n\n    @staticmethod\n    def _is_parse_error(content: str) -> bool:\n        return content.startswith(_PARSE_ERROR_PREFIXES)\n"
  },
  {
    "path": "src/fs_explorer/indexing/schema.py",
    "content": "\"\"\"\nSchema discovery utilities.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom pathlib import Path\nfrom typing import Any\n\nfrom .metadata import (\n    auto_discover_profile,\n    infer_document_type,\n    langextract_schema_fields,\n    normalize_langextract_profile,\n)\nfrom ..fs import SUPPORTED_EXTENSIONS\n\n\ndef _iter_supported_files(folder: str) -> list[str]:\n    root = Path(folder).resolve()\n    files: list[str] = []\n    for current_root, _, filenames in os.walk(root):\n        for filename in filenames:\n            ext = Path(filename).suffix.lower()\n            if ext in SUPPORTED_EXTENSIONS:\n                files.append(str(Path(current_root) / filename))\n    files.sort()\n    return files\n\n\nclass SchemaDiscovery:\n    \"\"\"Auto-discover a lightweight metadata schema from a corpus.\"\"\"\n\n    def discover_from_folder(\n        self,\n        folder: str,\n        *,\n        with_langextract: bool = False,\n        metadata_profile: dict[str, Any] | None = None,\n    ) -> dict[str, Any]:\n        files = _iter_supported_files(folder)\n        document_types = sorted({infer_document_type(path) for path in files})\n        corpus_name = Path(folder).resolve().name or \"corpus\"\n\n        fields: list[dict[str, Any]] = [\n            {\n                \"name\": \"filename\",\n                \"type\": \"string\",\n                \"required\": True,\n                \"description\": \"Document filename.\",\n            },\n            {\n                \"name\": \"relative_path\",\n                \"type\": \"string\",\n                \"required\": True,\n                \"description\": \"Path relative to corpus root.\",\n            },\n            {\n                \"name\": \"extension\",\n                \"type\": \"string\",\n                \"required\": True,\n                \"description\": \"File extension.\",\n            },\n            {\n                \"name\": \"document_type\",\n                \"type\": \"string\",\n                \"required\": True,\n                \"description\": \"Inferred document category.\",\n                \"enum\": document_types or [\"other\"],\n            },\n            {\n                \"name\": \"file_size_bytes\",\n                \"type\": \"integer\",\n                \"required\": True,\n                \"description\": \"File size in bytes.\",\n            },\n            {\n                \"name\": \"file_mtime\",\n                \"type\": \"number\",\n                \"required\": True,\n                \"description\": \"File modification timestamp (epoch seconds).\",\n            },\n            {\n                \"name\": \"mentions_currency\",\n                \"type\": \"boolean\",\n                \"required\": True,\n                \"description\": \"Whether text appears to contain currency amounts.\",\n            },\n            {\n                \"name\": \"mentions_dates\",\n                \"type\": \"boolean\",\n                \"required\": True,\n                \"description\": \"Whether text appears to contain date patterns.\",\n            },\n        ]\n        schema: dict[str, Any] = {\n            \"name\": f\"auto_{corpus_name}\",\n            \"description\": \"Auto-discovered schema for document-level metadata filtering.\",\n            \"fields\": fields,\n        }\n        if with_langextract:\n            if metadata_profile is None:\n                effective_profile = auto_discover_profile(folder)\n            else:\n                effective_profile = normalize_langextract_profile(metadata_profile)\n            fields.extend(langextract_schema_fields(effective_profile))\n            schema[\"metadata_profile\"] = effective_profile\n        return schema\n"
  },
  {
    "path": "src/fs_explorer/main.py",
    "content": "\"\"\"\nCLI entry point for the FsExplorer agent.\n\nProvides a command-line interface for running filesystem exploration tasks\nwith rich, detailed output showing each step of the workflow.\n\"\"\"\n\nimport json\nimport asyncio\nimport os\nfrom datetime import datetime\nfrom pathlib import Path\n\nfrom typer import Typer, Option, Argument, Context, BadParameter, Exit\nfrom typing import Annotated, Any\nfrom rich.markdown import Markdown\nfrom rich.panel import Panel\nfrom rich.console import Console\nfrom rich.table import Table\nfrom rich.text import Text\n\nfrom .embeddings import EmbeddingProvider\nfrom .index_config import resolve_db_path\nfrom .indexing import IndexingPipeline, SchemaDiscovery\nfrom .storage import DuckDBStorage\nfrom .agent import set_index_context, clear_index_context\nfrom .workflow import (\n    workflow,\n    InputEvent,\n    ToolCallEvent,\n    GoDeeperEvent,\n    AskHumanEvent,\n    HumanAnswerEvent,\n    get_agent,\n    reset_agent,\n)\nfrom .exploration_trace import ExplorationTrace, extract_cited_sources\n\napp = Typer()\nschema_app = Typer(help=\"Manage metadata schemas for indexed corpora.\")\napp.add_typer(schema_app, name=\"schema\")\n\n\n# Tool icons for visual distinction\nTOOL_ICONS = {\n    \"scan_folder\": \"📂\",\n    \"preview_file\": \"👁️\",\n    \"parse_file\": \"📖\",\n    \"read\": \"📄\",\n    \"grep\": \"🔍\",\n    \"glob\": \"🔎\",\n    \"semantic_search\": \"🧠\",\n    \"get_document\": \"📚\",\n    \"list_indexed_documents\": \"🗂️\",\n}\n\n# Phase detection based on tool usage\nPHASE_DESCRIPTIONS = {\n    \"scan_folder\": (\"Phase 1\", \"Parallel Document Scan\", \"cyan\"),\n    \"preview_file\": (\"Phase 1/2\", \"Quick Preview\", \"cyan\"),\n    \"parse_file\": (\"Phase 2\", \"Deep Dive\", \"green\"),\n    \"read\": (\"Reading\", \"Text File\", \"blue\"),\n    \"grep\": (\"Searching\", \"Pattern Match\", \"yellow\"),\n    \"glob\": (\"Finding\", \"File Search\", \"yellow\"),\n    \"semantic_search\": (\"Indexed\", \"Semantic Retrieval\", \"magenta\"),\n    \"get_document\": (\"Indexed\", \"Document Fetch\", \"green\"),\n    \"list_indexed_documents\": (\"Indexed\", \"Corpus Listing\", \"blue\"),\n}\n\n\ndef _load_metadata_profile(path_value: str | None) -> dict[str, Any] | None:\n    if path_value is None:\n        return None\n    resolved = Path(path_value).expanduser().resolve()\n    if not resolved.exists() or not resolved.is_file():\n        raise BadParameter(f\"Metadata profile file not found: {resolved}\")\n    try:\n        payload = json.loads(resolved.read_text())\n    except json.JSONDecodeError as exc:\n        raise BadParameter(\n            f\"Metadata profile file is not valid JSON: {resolved}\"\n        ) from exc\n    if not isinstance(payload, dict):\n        raise BadParameter(\"Metadata profile JSON must be an object.\")\n    return payload\n\n\ndef format_tool_panel(event: ToolCallEvent, step_number: int) -> Panel:\n    \"\"\"Create a richly formatted panel for a tool call event.\"\"\"\n    tool_name = event.tool_name\n    icon = TOOL_ICONS.get(tool_name, \"🔧\")\n    phase_info = PHASE_DESCRIPTIONS.get(tool_name, (\"Action\", \"Tool Call\", \"yellow\"))\n    phase_label, phase_desc, color = phase_info\n\n    # Build the content\n    lines = []\n\n    # Tool and target info\n    if \"directory\" in event.tool_input:\n        target = event.tool_input[\"directory\"]\n        lines.append(f\"**Target Directory:** `{target}`\")\n    elif \"file_path\" in event.tool_input:\n        target = event.tool_input[\"file_path\"]\n        lines.append(f\"**Target File:** `{target}`\")\n\n    # Additional parameters\n    other_params = {\n        k: v for k, v in event.tool_input.items() if k not in (\"directory\", \"file_path\")\n    }\n    if other_params:\n        lines.append(f\"**Parameters:** `{json.dumps(other_params)}`\")\n\n    lines.append(\"\")\n    lines.append(\"---\")\n    lines.append(\"\")\n\n    # Reasoning (this is the key part for visibility)\n    lines.append(\"**Agent's Reasoning:**\")\n    lines.append(\"\")\n    lines.append(event.reason)\n\n    content = \"\\n\".join(lines)\n\n    # Create title with step number and phase\n    title = f\"{icon} Step {step_number}: {tool_name} [{phase_label}: {phase_desc}]\"\n\n    return Panel(\n        Markdown(content),\n        title=title,\n        title_align=\"left\",\n        border_style=f\"bold {color}\",\n        padding=(1, 2),\n    )\n\n\ndef format_navigation_panel(event: GoDeeperEvent, step_number: int) -> Panel:\n    \"\"\"Create a panel for directory navigation events.\"\"\"\n    content = f\"\"\"**Navigating to:** `{event.directory}`\n\n---\n\n**Agent's Reasoning:**\n\n{event.reason}\n\"\"\"\n    return Panel(\n        Markdown(content),\n        title=f\"📁 Step {step_number}: Navigate to Directory\",\n        title_align=\"left\",\n        border_style=\"bold magenta\",\n        padding=(1, 2),\n    )\n\n\ndef print_workflow_header(console: Console, task: str, folder: str) -> None:\n    \"\"\"Print a header showing the task being executed.\"\"\"\n    console.print()\n    header = Table.grid(padding=(0, 2))\n    header.add_column(style=\"bold cyan\", justify=\"right\")\n    header.add_column()\n\n    header.add_row(\"🤖 FsExplorer Agent\", \"\")\n    header.add_row(\"📋 Task:\", task)\n    header.add_row(\"📁 Folder:\", folder)\n    header.add_row(\"🕐 Started:\", datetime.now().strftime(\"%Y-%m-%d %H:%M:%S\"))\n\n    console.print(\n        Panel(\n            header,\n            border_style=\"bold blue\",\n            title=\"Starting Exploration\",\n            title_align=\"left\",\n        )\n    )\n    console.print()\n\n\ndef print_workflow_summary(\n    console: Console,\n    agent,\n    step_count: int,\n    trace: ExplorationTrace,\n    cited_sources: list[str],\n) -> None:\n    \"\"\"Print a summary of the workflow execution.\"\"\"\n    usage = agent.token_usage\n\n    # Create summary table\n    summary = Table.grid(padding=(0, 2))\n    summary.add_column(style=\"bold\", justify=\"right\")\n    summary.add_column()\n\n    summary.add_row(\"Total Steps:\", str(step_count))\n    summary.add_row(\"API Calls:\", str(usage.api_calls))\n    summary.add_row(\"Documents Scanned:\", str(usage.documents_scanned))\n    summary.add_row(\"Documents Parsed:\", str(usage.documents_parsed))\n    summary.add_row(\"\", \"\")\n    summary.add_row(\"Prompt Tokens:\", f\"{usage.prompt_tokens:,}\")\n    summary.add_row(\"Completion Tokens:\", f\"{usage.completion_tokens:,}\")\n    summary.add_row(\"Total Tokens:\", f\"{usage.total_tokens:,}\")\n    summary.add_row(\"\", \"\")\n\n    # Cost calculation\n    input_cost, output_cost, total_cost = usage._calculate_cost()\n    summary.add_row(\"Est. Input Cost:\", f\"${input_cost:.4f}\")\n    summary.add_row(\"Est. Output Cost:\", f\"${output_cost:.4f}\")\n    summary.add_row(\"Est. Total Cost:\", f\"${total_cost:.4f}\")\n\n    console.print()\n    console.print(\n        Panel(\n            summary,\n            title=\"📊 Workflow Summary\",\n            title_align=\"left\",\n            border_style=\"bold blue\",\n        )\n    )\n\n    if trace.step_path:\n        path_markdown = \"\\n\".join(f\"- `{entry}`\" for entry in trace.step_path)\n        console.print()\n        console.print(\n            Panel(\n                Markdown(path_markdown),\n                title=\"🧭 Exploration Path\",\n                title_align=\"left\",\n                border_style=\"bold cyan\",\n            )\n        )\n\n    referenced_documents = trace.sorted_documents()\n    if referenced_documents:\n        docs_markdown = \"\\n\".join(f\"- `{doc}`\" for doc in referenced_documents)\n        console.print()\n        console.print(\n            Panel(\n                Markdown(docs_markdown),\n                title=\"📚 Referenced Documents (Tool Calls)\",\n                title_align=\"left\",\n                border_style=\"bold green\",\n            )\n        )\n\n    if cited_sources:\n        sources_markdown = \"\\n\".join(f\"- `{source}`\" for source in cited_sources)\n        console.print()\n        console.print(\n            Panel(\n                Markdown(sources_markdown),\n                title=\"🔖 Cited Sources (Final Answer)\",\n                title_align=\"left\",\n                border_style=\"bold yellow\",\n            )\n        )\n\n\nasync def run_workflow(\n    task: str,\n    folder: str = \".\",\n    *,\n    use_index: bool = False,\n    db_path: str | None = None,\n) -> None:\n    \"\"\"\n    Execute the exploration workflow with detailed step-by-step output.\n\n    Args:\n        task: The user's task/question to answer.\n    \"\"\"\n    console = Console()\n    resolved_folder = os.path.abspath(folder)\n    if not os.path.exists(resolved_folder) or not os.path.isdir(resolved_folder):\n        console.print(\n            Panel(\n                Text(f\"No such directory: {resolved_folder}\", style=\"bold red\"),\n                title=\"❌ Error\",\n                title_align=\"left\",\n                border_style=\"bold red\",\n            )\n        )\n        return\n\n    resolved_db_path: str | None = None\n    index_storage: DuckDBStorage | None = None\n    if use_index:\n        resolved_db_path = resolve_db_path(db_path)\n        storage = DuckDBStorage(resolved_db_path)\n        corpus_id = storage.get_corpus_id(resolved_folder)\n        if corpus_id is None:\n            console.print(\n                Panel(\n                    Text(\n                        \"No index found for this folder. \"\n                        \"Run `explore index <folder>` first.\",\n                        style=\"bold red\",\n                    ),\n                    title=\"❌ Missing Index\",\n                    title_align=\"left\",\n                    border_style=\"bold red\",\n                )\n            )\n            return\n        index_storage = storage\n        set_index_context(resolved_folder, resolved_db_path)\n    else:\n        clear_index_context()\n\n    try:\n        # Reset agent for fresh state\n        reset_agent()\n\n        # Print header\n        print_workflow_header(console, task, resolved_folder)\n        trace = ExplorationTrace(root_directory=resolved_folder)\n\n        step_number = 0\n        handler = workflow.run(\n            start_event=InputEvent(\n                task=task,\n                folder=resolved_folder,\n                use_index=use_index,\n            )\n        )\n\n        with console.status(status=\"[bold cyan]🔄 Analyzing task...\") as status:\n            async for event in handler.stream_events():\n                if isinstance(event, ToolCallEvent):\n                    step_number += 1\n                    resolved_document_path: str | None = None\n                    if event.tool_name == \"get_document\":\n                        doc_id = event.tool_input.get(\"doc_id\")\n                        if (\n                            index_storage is not None\n                            and isinstance(doc_id, str)\n                            and doc_id\n                        ):\n                            document = index_storage.get_document(doc_id=doc_id)\n                            if document and not document[\"is_deleted\"]:\n                                resolved_document_path = str(document[\"absolute_path\"])\n\n                    trace.record_tool_call(\n                        step_number=step_number,\n                        tool_name=event.tool_name,\n                        tool_input=event.tool_input,\n                        resolved_document_path=resolved_document_path,\n                    )\n\n                    # Update status based on tool\n                    icon = TOOL_ICONS.get(event.tool_name, \"🔧\")\n                    if event.tool_name == \"scan_folder\":\n                        status.update(\n                            f\"[bold cyan]{icon} Scanning documents in parallel...\"\n                        )\n                    elif event.tool_name == \"parse_file\":\n                        status.update(\n                            f\"[bold green]{icon} Reading document in detail...\"\n                        )\n                    elif event.tool_name == \"preview_file\":\n                        status.update(f\"[bold cyan]{icon} Quick preview of document...\")\n                    elif event.tool_name == \"semantic_search\":\n                        status.update(f\"[bold magenta]{icon} Searching index...\")\n                    elif event.tool_name == \"get_document\":\n                        status.update(f\"[bold green]{icon} Reading indexed document...\")\n                    elif event.tool_name == \"list_indexed_documents\":\n                        status.update(f\"[bold blue]{icon} Listing indexed documents...\")\n                    else:\n                        status.update(\n                            f\"[bold yellow]{icon} Executing {event.tool_name}...\"\n                        )\n\n                    # Print the detailed panel\n                    panel = format_tool_panel(event, step_number)\n                    console.print(panel)\n                    console.print()\n\n                    status.update(\"[bold cyan]🔄 Processing results...\")\n                elif isinstance(event, GoDeeperEvent):\n                    step_number += 1\n                    trace.record_go_deeper(\n                        step_number=step_number, directory=event.directory\n                    )\n                    panel = format_navigation_panel(event, step_number)\n                    console.print(panel)\n                    console.print()\n                    status.update(\"[bold cyan]🔄 Exploring directory...\")\n\n                elif isinstance(event, AskHumanEvent):\n                    status.stop()\n                    console.print()\n\n                    # Create a nice prompt panel\n                    question_panel = Panel(\n                        Markdown(\n                            f\"**Question:** {event.question}\\n\\n**Why I'm asking:** {event.reason}\"\n                        ),\n                        title=\"❓ Human Input Required\",\n                        title_align=\"left\",\n                        border_style=\"bold red\",\n                    )\n                    console.print(question_panel)\n\n                    answer = console.input(\"[bold cyan]Your answer:[/] \")\n                    while answer.strip() == \"\":\n                        console.print(\"[bold red]Please provide an answer.[/]\")\n                        answer = console.input(\"[bold cyan]Your answer:[/] \")\n\n                    handler.ctx.send_event(HumanAnswerEvent(response=answer.strip()))\n                    console.print()\n                    status.start()\n                    status.update(\"[bold cyan]🔄 Processing your response...\")\n\n            # Get final result\n            result = await handler\n            status.update(\"[bold green]✨ Preparing final answer...\")\n            await asyncio.sleep(0.1)\n            status.stop()\n\n        # Print final result with prominent styling\n        console.print()\n        if result.final_result:\n            final_panel = Panel(\n                Markdown(result.final_result),\n                title=\"✅ Final Answer\",\n                title_align=\"left\",\n                border_style=\"bold green\",\n                padding=(1, 2),\n            )\n            console.print(final_panel)\n        elif result.error:\n            error_panel = Panel(\n                Text(result.error, style=\"bold red\"),\n                title=\"❌ Error\",\n                title_align=\"left\",\n                border_style=\"bold red\",\n            )\n            console.print(error_panel)\n\n        # Print workflow summary\n        agent = get_agent()\n        cited_sources = extract_cited_sources(result.final_result)\n        print_workflow_summary(console, agent, step_number, trace, cited_sources)\n    finally:\n        clear_index_context()\n\n\n@app.callback(invoke_without_command=True)\ndef main(\n    ctx: Context,\n    task: Annotated[\n        str | None,\n        Option(\n            \"--task\",\n            \"-t\",\n            help=\"Task that the FsExplorer Agent has to perform while exploring the current directory.\",\n        ),\n    ] = None,\n    folder: Annotated[\n        str,\n        Option(\n            \"--folder\",\n            \"-f\",\n            help=\"Folder to explore. Defaults to the current directory.\",\n        ),\n    ] = \".\",\n    use_index: Annotated[\n        bool,\n        Option(\n            \"--use-index\",\n            help=\"Use indexed retrieval tools for this run (requires prior indexing).\",\n        ),\n    ] = False,\n    db_path: Annotated[\n        str | None,\n        Option(\"--db-path\", help=\"Path to DuckDB index file.\"),\n    ] = None,\n) -> None:\n    \"\"\"\n    Explore documents with an agent, build indexes, and manage schema metadata.\n\n    Backward-compatible mode:\n    - `explore --task \"...\" [--folder ...]`\n    \"\"\"\n    if ctx.invoked_subcommand is not None:\n        return\n\n    if task is None or not task.strip():\n        raise BadParameter(\"`--task` is required unless you run a subcommand.\")\n\n    effective_use_index = use_index\n    if (\n        not effective_use_index\n        and os.getenv(\"FS_EXPLORER_AUTO_INDEX\", \"\").strip() == \"1\"\n    ):\n        try:\n            resolved_folder = os.path.abspath(folder)\n            resolved_db = resolve_db_path(db_path)\n            storage = DuckDBStorage(resolved_db, read_only=True, initialize=False)\n            if storage.get_corpus_id(resolved_folder) is not None:\n                effective_use_index = True\n            storage.close()\n        except Exception:\n            pass\n\n    asyncio.run(\n        run_workflow(task, folder, use_index=effective_use_index, db_path=db_path)\n    )\n\n\n@app.command(\"index\")\ndef index_command(\n    folder: Annotated[\n        str,\n        Argument(help=\"Folder to index recursively.\"),\n    ] = \".\",\n    db_path: Annotated[\n        str | None,\n        Option(\"--db-path\", help=\"Path to DuckDB index file.\"),\n    ] = None,\n    discover_schema: Annotated[\n        bool,\n        Option(\n            \"--discover-schema\",\n            help=\"Auto-discover metadata schema and set it active for this corpus.\",\n        ),\n    ] = False,\n    schema_name: Annotated[\n        str | None,\n        Option(\"--schema-name\", help=\"Use an existing stored schema by name.\"),\n    ] = None,\n    with_metadata: Annotated[\n        bool,\n        Option(\n            \"--with-metadata\",\n            help=(\n                \"Enable langextract metadata extraction (requires API key). \"\n                \"Also enables schema discovery if not explicitly requested.\"\n            ),\n        ),\n    ] = False,\n    metadata_profile_path: Annotated[\n        str | None,\n        Option(\n            \"--metadata-profile\",\n            help=(\n                \"Path to JSON profile defining dynamic langextract metadata fields \"\n                \"and prompt. Implies --with-metadata.\"\n            ),\n        ),\n    ] = None,\n    with_embeddings: Annotated[\n        bool,\n        Option(\n            \"--with-embeddings\",\n            help=\"Generate vector embeddings for indexed chunks (requires GOOGLE_API_KEY).\",\n        ),\n    ] = False,\n) -> None:\n    \"\"\"Build or refresh an index for a folder.\"\"\"\n    console = Console()\n    resolved_db_path = resolve_db_path(db_path)\n    storage = DuckDBStorage(resolved_db_path)\n\n    embedding_provider: EmbeddingProvider | None = None\n    if with_embeddings:\n        try:\n            embedding_provider = EmbeddingProvider()\n        except ValueError as exc:\n            raise BadParameter(str(exc)) from exc\n\n    pipeline = IndexingPipeline(\n        storage=storage,\n        embedding_provider=embedding_provider,\n    )\n    metadata_profile = _load_metadata_profile(metadata_profile_path)\n    effective_with_metadata = with_metadata or metadata_profile is not None\n\n    if effective_with_metadata and metadata_profile is None:\n        console.print(\n            \"[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]\"\n        )\n\n    try:\n        effective_discover_schema = discover_schema or effective_with_metadata\n        result = pipeline.index_folder(\n            folder,\n            discover_schema=effective_discover_schema,\n            schema_name=schema_name,\n            with_metadata=effective_with_metadata,\n            metadata_profile=metadata_profile,\n        )\n    except ValueError as exc:\n        raise BadParameter(str(exc)) from exc\n\n    summary = Table.grid(padding=(0, 2))\n    summary.add_column(style=\"bold\", justify=\"right\")\n    summary.add_column()\n    summary.add_row(\"DB Path:\", resolved_db_path)\n    summary.add_row(\"Corpus ID:\", result.corpus_id)\n    summary.add_row(\"Indexed Files:\", str(result.indexed_files))\n    summary.add_row(\"Skipped Files:\", str(result.skipped_files))\n    summary.add_row(\"Deleted Files:\", str(result.deleted_files))\n    summary.add_row(\"Chunks Written:\", str(result.chunks_written))\n    summary.add_row(\"Active Documents:\", str(result.active_documents))\n    summary.add_row(\"Embeddings Written:\", str(result.embeddings_written))\n    summary.add_row(\"Schema Used:\", result.schema_used or \"<none>\")\n    summary.add_row(\n        \"Metadata Mode:\",\n        \"langextract\" if effective_with_metadata else \"heuristic\",\n    )\n    if metadata_profile_path:\n        profile_label = str(Path(metadata_profile_path).expanduser().resolve())\n    elif effective_with_metadata:\n        profile_label = \"<auto-discovered>\"\n    else:\n        profile_label = \"<none>\"\n    summary.add_row(\"Metadata Profile:\", profile_label)\n\n    console.print(Panel(summary, title=\"📦 Index Complete\", border_style=\"bold green\"))\n\n\n@app.command(\"query\")\ndef query_command(\n    task: Annotated[\n        str,\n        Option(\n            \"--task\",\n            \"-t\",\n            help=\"Question to answer using indexed retrieval tools.\",\n        ),\n    ],\n    folder: Annotated[\n        str,\n        Option(\n            \"--folder\",\n            \"-f\",\n            help=\"Folder whose index should be queried.\",\n        ),\n    ] = \".\",\n    db_path: Annotated[\n        str | None,\n        Option(\"--db-path\", help=\"Path to DuckDB index file.\"),\n    ] = None,\n) -> None:\n    \"\"\"Run the agent with indexed retrieval enabled.\"\"\"\n    asyncio.run(run_workflow(task, folder, use_index=True, db_path=db_path))\n\n\n@schema_app.command(\"discover\")\ndef schema_discover_command(\n    folder: Annotated[\n        str,\n        Argument(help=\"Folder to inspect for schema discovery.\"),\n    ] = \".\",\n    db_path: Annotated[\n        str | None,\n        Option(\"--db-path\", help=\"Path to DuckDB index file.\"),\n    ] = None,\n    name: Annotated[\n        str | None,\n        Option(\"--name\", help=\"Override discovered schema name.\"),\n    ] = None,\n    activate: Annotated[\n        bool,\n        Option(\n            \"--activate/--no-activate\",\n            help=\"Set schema as active for the corpus.\",\n        ),\n    ] = True,\n    with_metadata: Annotated[\n        bool,\n        Option(\n            \"--with-metadata\",\n            help=\"Include langextract metadata fields in discovered schema.\",\n        ),\n    ] = False,\n    metadata_profile_path: Annotated[\n        str | None,\n        Option(\n            \"--metadata-profile\",\n            help=(\n                \"Path to JSON profile defining dynamic langextract metadata fields \"\n                \"and prompt. Implies --with-metadata.\"\n            ),\n        ),\n    ] = None,\n) -> None:\n    \"\"\"Auto-discover and store a metadata schema for a folder.\"\"\"\n    console = Console()\n    resolved_folder = str(os.path.abspath(folder))\n    if not os.path.isdir(resolved_folder):\n        raise BadParameter(f\"No such directory: {resolved_folder}\")\n\n    resolved_db_path = resolve_db_path(db_path)\n    storage = DuckDBStorage(resolved_db_path)\n    corpus_id = storage.get_or_create_corpus(resolved_folder)\n    metadata_profile = _load_metadata_profile(metadata_profile_path)\n    effective_with_metadata = with_metadata or metadata_profile is not None\n\n    if effective_with_metadata and metadata_profile is None:\n        console.print(\n            \"[bold cyan]🔍 Analyzing corpus to generate metadata profile...[/]\"\n        )\n\n    discovery = SchemaDiscovery()\n    discovered = discovery.discover_from_folder(\n        resolved_folder,\n        with_langextract=effective_with_metadata,\n        metadata_profile=metadata_profile,\n    )\n    schema_name = name or str(\n        discovered.get(\"name\", f\"auto_{os.path.basename(resolved_folder)}\")\n    )\n    discovered[\"name\"] = schema_name\n    schema_id = storage.save_schema(\n        corpus_id=corpus_id,\n        name=schema_name,\n        schema_def=discovered,\n        is_active=activate,\n    )\n\n    output = Table.grid(padding=(0, 2))\n    output.add_column(style=\"bold\", justify=\"right\")\n    output.add_column()\n    output.add_row(\"DB Path:\", resolved_db_path)\n    output.add_row(\"Corpus ID:\", corpus_id)\n    output.add_row(\"Schema ID:\", schema_id)\n    output.add_row(\"Schema Name:\", schema_name)\n    output.add_row(\"Active:\", str(activate))\n    output.add_row(\"Field Count:\", str(len(discovered.get(\"fields\", []))))\n    output.add_row(\n        \"Metadata Mode:\", \"langextract\" if effective_with_metadata else \"heuristic\"\n    )\n    if metadata_profile_path:\n        profile_label = str(Path(metadata_profile_path).expanduser().resolve())\n    elif effective_with_metadata:\n        profile_label = \"<auto-discovered>\"\n    else:\n        profile_label = \"<none>\"\n    output.add_row(\"Metadata Profile:\", profile_label)\n\n    console.print(Panel(output, title=\"🧩 Schema Saved\", border_style=\"bold cyan\"))\n    console.print_json(json.dumps(discovered, indent=2))\n\n\n@schema_app.command(\"show\")\ndef schema_show_command(\n    folder: Annotated[\n        str,\n        Argument(help=\"Folder whose schemas should be listed.\"),\n    ] = \".\",\n    db_path: Annotated[\n        str | None,\n        Option(\"--db-path\", help=\"Path to DuckDB index file.\"),\n    ] = None,\n) -> None:\n    \"\"\"Show saved schemas for a folder's corpus.\"\"\"\n    console = Console()\n    resolved_folder = str(os.path.abspath(folder))\n    resolved_db_path = resolve_db_path(db_path)\n    storage = DuckDBStorage(resolved_db_path)\n\n    corpus_id = storage.get_corpus_id(resolved_folder)\n    if corpus_id is None:\n        console.print(\n            Panel(\n                f\"No corpus found for folder: {resolved_folder}\\nRun `explore index {resolved_folder}` first.\",\n                title=\"⚠️ No Corpus\",\n                border_style=\"bold yellow\",\n            )\n        )\n        raise Exit(code=1)\n\n    schemas = storage.list_schemas(corpus_id=corpus_id)\n    if not schemas:\n        console.print(\n            Panel(\n                f\"No schemas saved for corpus: {corpus_id}\",\n                title=\"⚠️ No Schemas\",\n                border_style=\"bold yellow\",\n            )\n        )\n        raise Exit(code=1)\n\n    table = Table(title=f\"Schemas for {resolved_folder}\")\n    table.add_column(\"Name\")\n    table.add_column(\"Active\")\n    table.add_column(\"Created At\")\n    table.add_column(\"Field Count\")\n\n    for schema in schemas:\n        table.add_row(\n            schema.name,\n            \"yes\" if schema.is_active else \"no\",\n            schema.created_at,\n            str(len(schema.schema_def.get(\"fields\", []))),\n        )\n\n    console.print(table)\n"
  },
  {
    "path": "src/fs_explorer/models.py",
    "content": "\"\"\"\nPydantic models for FsExplorer agent actions.\n\nThis module defines the structured data models used to represent\nthe actions the agent can take during filesystem exploration.\n\"\"\"\n\nfrom pydantic import BaseModel, Field\nfrom typing import TypeAlias, Literal, Any\n\n\n# =============================================================================\n# Type Aliases\n# =============================================================================\n\nTools: TypeAlias = Literal[\n    \"read\",\n    \"grep\",\n    \"glob\",\n    \"scan_folder\",\n    \"preview_file\",\n    \"parse_file\",\n    \"semantic_search\",\n    \"get_document\",\n    \"list_indexed_documents\",\n]\n\"\"\"Available tool names that the agent can invoke.\"\"\"\n\nActionType: TypeAlias = Literal[\"stop\", \"godeeper\", \"toolcall\", \"askhuman\"]\n\"\"\"Types of actions the agent can take.\"\"\"\n\n\n# =============================================================================\n# Action Models\n# =============================================================================\n\nclass StopAction(BaseModel):\n    \"\"\"\n    Action indicating the task is complete.\n    \n    Used when the agent has gathered enough information to provide\n    a final answer to the user's query.\n    \"\"\"\n    \n    final_result: str = Field(\n        description=\"Final result of the operation with the answer to the user's query\"\n    )\n\n\nclass AskHumanAction(BaseModel):\n    \"\"\"\n    Action requesting clarification from the user.\n    \n    Used when the agent needs additional information or context\n    to proceed with the task.\n    \"\"\"\n    \n    question: str = Field(\n        description=\"Clarification question to ask the user\"\n    )\n\n\nclass GoDeeperAction(BaseModel):\n    \"\"\"\n    Action to navigate into a subdirectory.\n    \n    Used when the agent needs to explore a subdirectory\n    to find relevant files.\n    \"\"\"\n    \n    directory: str = Field(\n        description=\"Path to the directory to navigate into\"\n    )\n\n\nclass ToolCallArg(BaseModel):\n    \"\"\"\n    A single argument for a tool call.\n    \n    Represents a parameter name-value pair to pass to a tool.\n    \"\"\"\n    \n    parameter_name: str = Field(\n        description=\"Name of the parameter\"\n    )\n    parameter_value: Any = Field(\n        description=\"Value for the parameter\"\n    )\n\n\nclass ToolCallAction(BaseModel):\n    \"\"\"\n    Action to invoke a filesystem tool.\n    \n    Used when the agent needs to read files, search for patterns,\n    or parse documents to gather information.\n    \"\"\"\n    \n    tool_name: Tools = Field(\n        description=\"Name of the tool to invoke\"\n    )\n    tool_input: list[ToolCallArg] = Field(\n        description=\"Arguments to pass to the tool\"\n    )\n\n    def to_fn_args(self) -> dict[str, Any]:\n        \"\"\"\n        Convert tool input to a dictionary for function calls.\n        \n        Returns:\n            Dictionary mapping parameter names to values.\n        \"\"\"\n        return {arg.parameter_name: arg.parameter_value for arg in self.tool_input}\n\n\nclass Action(BaseModel):\n    \"\"\"\n    Container for an agent action with reasoning.\n    \n    Wraps any of the specific action types (stop, go deeper,\n    tool call, ask human) along with the agent's explanation\n    for why this action was chosen.\n    \"\"\"\n    \n    action: ToolCallAction | GoDeeperAction | StopAction | AskHumanAction = Field(\n        description=\"The specific action to take\"\n    )\n    reason: str = Field(\n        description=\"Explanation for why this action was chosen\"\n    )\n\n    def to_action_type(self) -> ActionType:\n        \"\"\"\n        Get the type of this action.\n        \n        Returns:\n            The action type string: \"toolcall\", \"godeeper\", \"askhuman\", or \"stop\".\n        \"\"\"\n        if isinstance(self.action, ToolCallAction):\n            return \"toolcall\"\n        elif isinstance(self.action, GoDeeperAction):\n            return \"godeeper\"\n        elif isinstance(self.action, AskHumanAction):\n            return \"askhuman\"\n        else:\n            return \"stop\"\n"
  },
  {
    "path": "src/fs_explorer/search/__init__.py",
    "content": "\"\"\"Search helpers for indexed corpora.\"\"\"\n\nfrom .filters import (\n    MetadataFilter,\n    MetadataFilterParseError,\n    parse_metadata_filters,\n    supported_filter_syntax,\n)\nfrom .query import IndexedQueryEngine, SearchHit\nfrom .ranker import RankedDocument, rank_documents\nfrom .semantic import SemanticSearchEngine\n\n__all__ = [\n    \"MetadataFilter\",\n    \"MetadataFilterParseError\",\n    \"parse_metadata_filters\",\n    \"supported_filter_syntax\",\n    \"IndexedQueryEngine\",\n    \"SearchHit\",\n    \"RankedDocument\",\n    \"rank_documents\",\n    \"SemanticSearchEngine\",\n]\n"
  },
  {
    "path": "src/fs_explorer/search/filters.py",
    "content": "\"\"\"\nMetadata filter parsing helpers.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport re\nfrom dataclasses import dataclass\nfrom typing import Any, Literal\n\n\nFilterOperator = Literal[\"eq\", \"ne\", \"gt\", \"gte\", \"lt\", \"lte\", \"in\", \"contains\"]\n\n\n@dataclass(frozen=True)\nclass MetadataFilter:\n    \"\"\"Normalized metadata filter condition.\"\"\"\n\n    field: str\n    operator: FilterOperator\n    value: str | bool | int | float | list[str | bool | int | float]\n\n    def to_storage_dict(self) -> dict[str, Any]:\n        return {\n            \"field\": self.field,\n            \"operator\": self.operator,\n            \"value\": self.value,\n        }\n\n\nclass MetadataFilterParseError(ValueError):\n    \"\"\"Raised when metadata filter syntax is invalid.\"\"\"\n\n\n_FIELD_RE = re.compile(r\"^[A-Za-z_][A-Za-z0-9_]*$\")\n_NUMBER_RE = re.compile(r\"^-?\\d+(?:\\.\\d+)?$\")\n\n\ndef supported_filter_syntax() -> str:\n    \"\"\"Return a short help text for filter syntax.\"\"\"\n    return (\n        \"Supported filter syntax: \"\n        \"`field=value`, `field!=value`, `field>=number`, `field<=number`, \"\n        \"`field>number`, `field<number`, `field in (a, b, c)`, `field~substring`; \"\n        \"combine with comma or `and`.\"\n    )\n\n\ndef parse_metadata_filters(\n    raw_filters: str | None,\n    *,\n    allowed_fields: set[str] | None = None,\n) -> list[MetadataFilter]:\n    \"\"\"Parse a raw filter string into normalized metadata conditions.\"\"\"\n    if raw_filters is None or not raw_filters.strip():\n        return []\n\n    conditions = _split_conditions(raw_filters)\n    parsed: list[MetadataFilter] = []\n    for condition in conditions:\n        parsed.append(_parse_condition(condition, allowed_fields=allowed_fields))\n    return parsed\n\n\ndef _parse_condition(condition: str, *, allowed_fields: set[str] | None) -> MetadataFilter:\n    text = condition.strip()\n    if not text:\n        raise MetadataFilterParseError(\"Empty filter condition.\")\n\n    in_match = re.match(r\"^\\s*([A-Za-z_][A-Za-z0-9_]*)\\s+in\\s+(.+)\\s*$\", text, flags=re.IGNORECASE)\n    if in_match:\n        field = in_match.group(1)\n        _validate_field(field, allowed_fields=allowed_fields)\n        values = _parse_list_value(in_match.group(2))\n        if not values:\n            raise MetadataFilterParseError(f\"`in` filter has no values: {text!r}\")\n        return MetadataFilter(field=field, operator=\"in\", value=values)\n\n    op_match = re.match(r\"^\\s*([A-Za-z_][A-Za-z0-9_]*)\\s*(<=|>=|!=|=|<|>|~|:)\\s*(.+)\\s*$\", text)\n    if not op_match:\n        raise MetadataFilterParseError(f\"Invalid filter syntax: {text!r}\")\n\n    field = op_match.group(1)\n    operator_symbol = op_match.group(2)\n    raw_value = op_match.group(3)\n    _validate_field(field, allowed_fields=allowed_fields)\n    value = _parse_scalar_value(raw_value)\n\n    operator_map: dict[str, FilterOperator] = {\n        \"=\": \"eq\",\n        \":\": \"eq\",\n        \"!=\": \"ne\",\n        \">\": \"gt\",\n        \">=\": \"gte\",\n        \"<\": \"lt\",\n        \"<=\": \"lte\",\n        \"~\": \"contains\",\n    }\n    operator = operator_map[operator_symbol]\n\n    if operator in {\"gt\", \"gte\", \"lt\", \"lte\"} and not isinstance(value, (int, float)):\n        raise MetadataFilterParseError(\n            f\"Operator `{operator_symbol}` requires a numeric value: {text!r}\"\n        )\n\n    return MetadataFilter(field=field, operator=operator, value=value)\n\n\ndef _validate_field(field: str, *, allowed_fields: set[str] | None) -> None:\n    if not _FIELD_RE.match(field):\n        raise MetadataFilterParseError(f\"Invalid field name: {field!r}\")\n    if allowed_fields is not None and field not in allowed_fields:\n        allowed = \", \".join(sorted(allowed_fields)) if allowed_fields else \"<none>\"\n        raise MetadataFilterParseError(\n            f\"Unknown metadata field {field!r}. Allowed fields: {allowed}\"\n        )\n\n\ndef _split_conditions(raw: str) -> list[str]:\n    parts: list[str] = []\n    current: list[str] = []\n    quote: str | None = None\n    paren_depth = 0\n    bracket_depth = 0\n    i = 0\n    while i < len(raw):\n        ch = raw[i]\n\n        if quote is not None:\n            current.append(ch)\n            if ch == quote:\n                quote = None\n            i += 1\n            continue\n\n        if ch in {\"'\", '\"'}:\n            quote = ch\n            current.append(ch)\n            i += 1\n            continue\n        if ch == \"(\":\n            paren_depth += 1\n            current.append(ch)\n            i += 1\n            continue\n        if ch == \")\":\n            paren_depth = max(paren_depth - 1, 0)\n            current.append(ch)\n            i += 1\n            continue\n        if ch == \"[\":\n            bracket_depth += 1\n            current.append(ch)\n            i += 1\n            continue\n        if ch == \"]\":\n            bracket_depth = max(bracket_depth - 1, 0)\n            current.append(ch)\n            i += 1\n            continue\n\n        if paren_depth == 0 and bracket_depth == 0 and ch == \",\":\n            _flush_part(parts, current)\n            i += 1\n            continue\n\n        if (\n            paren_depth == 0\n            and bracket_depth == 0\n            and raw[i : i + 3].lower() == \"and\"\n            and (i == 0 or raw[i - 1].isspace())\n            and (i + 3 == len(raw) or raw[i + 3].isspace())\n        ):\n            _flush_part(parts, current)\n            i += 3\n            continue\n\n        current.append(ch)\n        i += 1\n\n    _flush_part(parts, current)\n    return parts\n\n\ndef _flush_part(parts: list[str], current: list[str]) -> None:\n    text = \"\".join(current).strip()\n    if text:\n        parts.append(text)\n    current.clear()\n\n\ndef _parse_list_value(raw_value: str) -> list[str | bool | int | float]:\n    text = raw_value.strip()\n    if text.startswith(\"(\") and text.endswith(\")\"):\n        text = text[1:-1]\n    elif text.startswith(\"[\") and text.endswith(\"]\"):\n        text = text[1:-1]\n\n    if not text.strip():\n        return []\n\n    items = _split_conditions(text)\n    return [_parse_scalar_value(item) for item in items]\n\n\ndef _parse_scalar_value(raw_value: str) -> str | bool | int | float:\n    text = raw_value.strip()\n    if not text:\n        raise MetadataFilterParseError(\"Missing filter value.\")\n\n    if (text.startswith(\"'\") and text.endswith(\"'\")) or (\n        text.startswith('\"') and text.endswith('\"')\n    ):\n        return text[1:-1]\n\n    lower = text.lower()\n    if lower == \"true\":\n        return True\n    if lower == \"false\":\n        return False\n    if _NUMBER_RE.match(text):\n        if \".\" in text:\n            return float(text)\n        return int(text)\n    return text\n"
  },
  {
    "path": "src/fs_explorer/search/query.py",
    "content": "\"\"\"\nIndexed query helpers for agent tools.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom concurrent.futures import ThreadPoolExecutor\nfrom dataclasses import dataclass\nfrom typing import Any, Callable\n\nfrom ..embeddings import EmbeddingProvider\nfrom ..storage import DuckDBStorage, StorageBackend\nfrom .filters import MetadataFilter, parse_metadata_filters\nfrom .ranker import RankedDocument, rank_documents\n\n\n@dataclass(frozen=True)\nclass SearchHit:\n    \"\"\"Ranked document hit from indexed retrieval.\"\"\"\n\n    doc_id: str\n    relative_path: str\n    absolute_path: str\n    position: int | None\n    text: str\n    semantic_score: float\n    metadata_score: int\n    score: float\n    matched_by: str\n\n\nclass IndexedQueryEngine:\n    \"\"\"Parallel retrieval engine for semantic + metadata query paths.\"\"\"\n\n    def __init__(\n        self,\n        storage: StorageBackend,\n        embedding_provider: EmbeddingProvider | None = None,\n    ) -> None:\n        self.storage = storage\n        self.embedding_provider = embedding_provider\n\n    def search(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        filters: str | None = None,\n        limit: int = 5,\n        enable_semantic: bool = True,\n        enable_metadata: bool = True,\n    ) -> list[SearchHit]:\n        normalized_limit = max(limit, 1)\n        parsed_filters = self._parse_filters(corpus_id=corpus_id, filters=filters)\n        semantic_limit = max(normalized_limit * 4, normalized_limit)\n        metadata_limit = max(normalized_limit * 4, normalized_limit)\n\n        run_semantic = enable_semantic\n        run_metadata = enable_metadata and bool(parsed_filters)\n\n        semantic_rows: list[dict[str, Any]]\n        metadata_rows: list[dict[str, Any]]\n        if run_semantic and run_metadata:\n            semantic_rows, metadata_rows = self._search_parallel(\n                corpus_id=corpus_id,\n                query=query,\n                metadata_filters=parsed_filters,\n                semantic_limit=semantic_limit,\n                metadata_limit=metadata_limit,\n            )\n        elif run_semantic:\n            semantic_rows = self._semantic_query(\n                corpus_id=corpus_id,\n                query=query,\n                limit=semantic_limit,\n            )\n            metadata_rows = []\n        elif run_metadata:\n            semantic_rows = []\n            metadata_rows = self._metadata_query(\n                corpus_id=corpus_id,\n                metadata_filters=parsed_filters,\n                limit=metadata_limit,\n            )\n        else:\n            semantic_rows, metadata_rows = [], []\n\n        ranked = self._merge_and_rank(\n            semantic_rows=semantic_rows,\n            metadata_rows=metadata_rows,\n            limit=normalized_limit,\n        )\n        return [\n            SearchHit(\n                doc_id=doc.doc_id,\n                relative_path=doc.relative_path,\n                absolute_path=doc.absolute_path,\n                position=doc.position,\n                text=doc.text,\n                semantic_score=doc.semantic_score,\n                metadata_score=doc.metadata_score,\n                score=doc.combined_score,\n                matched_by=doc.matched_by,\n            )\n            for doc in ranked\n        ]\n\n    def _parse_filters(\n        self, *, corpus_id: str, filters: str | None\n    ) -> list[MetadataFilter]:\n        if filters is None or not filters.strip():\n            return []\n        allowed_fields = self._allowed_filter_fields(corpus_id=corpus_id)\n        return parse_metadata_filters(filters, allowed_fields=allowed_fields)\n\n    def _allowed_filter_fields(self, *, corpus_id: str) -> set[str] | None:\n        active_schema = self.storage.get_active_schema(corpus_id=corpus_id)\n        if active_schema is None:\n            return None\n        fields = active_schema.schema_def.get(\"fields\")\n        if not isinstance(fields, list):\n            return None\n        allowed: set[str] = set()\n        for field in fields:\n            if isinstance(field, dict):\n                name = field.get(\"name\")\n                if isinstance(name, str):\n                    allowed.add(name)\n        return allowed if allowed else None\n\n    def _search_parallel(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        metadata_filters: list[MetadataFilter],\n        semantic_limit: int,\n        metadata_limit: int,\n    ) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]:\n        with ThreadPoolExecutor(max_workers=2) as executor:\n            semantic_future = executor.submit(\n                self._semantic_query,\n                corpus_id=corpus_id,\n                query=query,\n                limit=semantic_limit,\n            )\n            metadata_future = executor.submit(\n                self._metadata_query,\n                corpus_id=corpus_id,\n                metadata_filters=metadata_filters,\n                limit=metadata_limit,\n            )\n            semantic_rows = semantic_future.result()\n            metadata_rows = metadata_future.result()\n        return semantic_rows, metadata_rows\n\n    def _semantic_query(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        limit: int,\n    ) -> list[dict[str, Any]]:\n        scoped_storage, cleanup = self._acquire_query_storage()\n        try:\n            if self.embedding_provider is not None and scoped_storage.has_embeddings(\n                corpus_id=corpus_id\n            ):\n                query_embedding = self.embedding_provider.embed_query(query)\n                return scoped_storage.search_chunks_semantic(\n                    corpus_id=corpus_id,\n                    query_embedding=query_embedding,\n                    limit=limit,\n                )\n            return scoped_storage.search_chunks(\n                corpus_id=corpus_id, query=query, limit=limit\n            )\n        finally:\n            cleanup()\n\n    def _metadata_query(\n        self,\n        *,\n        corpus_id: str,\n        metadata_filters: list[MetadataFilter],\n        limit: int,\n    ) -> list[dict[str, Any]]:\n        scoped_storage, cleanup = self._acquire_query_storage()\n        try:\n            return scoped_storage.search_documents_by_metadata(\n                corpus_id=corpus_id,\n                filters=[flt.to_storage_dict() for flt in metadata_filters],\n                limit=limit,\n            )\n        finally:\n            cleanup()\n\n    def _acquire_query_storage(self) -> tuple[StorageBackend, Callable[[], None]]:\n        if isinstance(self.storage, DuckDBStorage):\n            clone = DuckDBStorage(\n                self.storage.db_path,\n                read_only=self.storage.read_only,\n                initialize=False,\n                embedding_dim=self.storage.embedding_dim,\n            )\n            return clone, clone.close\n        return self.storage, lambda: None\n\n    @staticmethod\n    def _merge_and_rank(\n        *,\n        semantic_rows: list[dict[str, Any]],\n        metadata_rows: list[dict[str, Any]],\n        limit: int,\n    ) -> list[RankedDocument]:\n        merged: dict[str, dict[str, Any]] = {}\n\n        for row in semantic_rows:\n            doc_id = str(row[\"doc_id\"])\n            score = float(row[\"score\"])\n            position = int(row[\"position\"])\n            entry = merged.setdefault(\n                doc_id,\n                {\n                    \"doc_id\": doc_id,\n                    \"relative_path\": str(row[\"relative_path\"]),\n                    \"absolute_path\": str(row[\"absolute_path\"]),\n                    \"position\": position,\n                    \"text\": str(row[\"text\"]),\n                    \"semantic_score\": 0.0,\n                    \"metadata_score\": 0,\n                },\n            )\n            if score > float(entry[\"semantic_score\"]):\n                entry[\"semantic_score\"] = score\n                entry[\"position\"] = position\n                entry[\"text\"] = str(row[\"text\"])\n\n        for row in metadata_rows:\n            doc_id = str(row[\"doc_id\"])\n            entry = merged.setdefault(\n                doc_id,\n                {\n                    \"doc_id\": doc_id,\n                    \"relative_path\": str(row[\"relative_path\"]),\n                    \"absolute_path\": str(row[\"absolute_path\"]),\n                    \"position\": None,\n                    \"text\": str(row.get(\"preview_text\", \"\")),\n                    \"semantic_score\": 0.0,\n                    \"metadata_score\": 0,\n                },\n            )\n            entry[\"metadata_score\"] = max(\n                int(entry[\"metadata_score\"]),\n                int(row.get(\"metadata_score\", 1)),\n            )\n            if not entry[\"text\"]:\n                entry[\"text\"] = str(row.get(\"preview_text\", \"\"))\n\n        documents = [\n            RankedDocument(\n                doc_id=str(entry[\"doc_id\"]),\n                relative_path=str(entry[\"relative_path\"]),\n                absolute_path=str(entry[\"absolute_path\"]),\n                position=int(entry[\"position\"])\n                if entry[\"position\"] is not None\n                else None,\n                text=str(entry[\"text\"]),\n                semantic_score=float(entry[\"semantic_score\"]),\n                metadata_score=int(entry[\"metadata_score\"]),\n            )\n            for entry in merged.values()\n        ]\n        return rank_documents(documents, limit=limit)\n"
  },
  {
    "path": "src/fs_explorer/search/ranker.py",
    "content": "\"\"\"\nRanking helpers for merging retrieval result sets.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses import dataclass\n\n\n@dataclass(frozen=True)\nclass RankedDocument:\n    \"\"\"Merged retrieval candidate for a document.\"\"\"\n\n    doc_id: str\n    relative_path: str\n    absolute_path: str\n    position: int | None\n    text: str\n    semantic_score: float\n    metadata_score: int\n\n    @property\n    def combined_score(self) -> float:\n        # Semantic scores dominate ordering; metadata score boosts ties and\n        # metadata-only matches into the candidate set.\n        return float(self.semantic_score * 100 + self.metadata_score * 10)\n\n    @property\n    def matched_by(self) -> str:\n        if self.semantic_score > 0 and self.metadata_score > 0:\n            return \"semantic+metadata\"\n        if self.semantic_score > 0:\n            return \"semantic\"\n        return \"metadata\"\n\n\ndef rank_documents(\n    documents: list[RankedDocument], *, limit: int\n) -> list[RankedDocument]:\n    \"\"\"Sort merged retrieval results and apply limit.\"\"\"\n    ordered = sorted(\n        documents,\n        key=lambda doc: (\n            -doc.combined_score,\n            -doc.semantic_score,\n            -doc.metadata_score,\n            doc.position if doc.position is not None else 10**9,\n            doc.relative_path,\n        ),\n    )\n    return ordered[: max(limit, 1)]\n"
  },
  {
    "path": "src/fs_explorer/search/semantic.py",
    "content": "\"\"\"\nVector-based semantic search engine.\n\nEmbeds a query and searches chunk embeddings via cosine similarity,\nfalling back to keyword matching when embeddings are unavailable.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom typing import Any\n\nfrom ..embeddings import EmbeddingProvider\nfrom ..storage import StorageBackend\n\n\nclass SemanticSearchEngine:\n    \"\"\"Embed a query and search stored chunk embeddings.\"\"\"\n\n    def __init__(\n        self,\n        storage: StorageBackend,\n        embedding_provider: EmbeddingProvider,\n    ) -> None:\n        self.storage = storage\n        self.embedding_provider = embedding_provider\n\n    def search(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        limit: int = 5,\n    ) -> list[dict[str, Any]]:\n        \"\"\"Return ranked chunk hits using vector cosine similarity.\"\"\"\n        query_embedding = self.embedding_provider.embed_query(query)\n        return self.storage.search_chunks_semantic(\n            corpus_id=corpus_id,\n            query_embedding=query_embedding,\n            limit=limit,\n        )\n"
  },
  {
    "path": "src/fs_explorer/server.py",
    "content": "\"\"\"\nFastAPI server for FsExplorer web UI.\n\nProvides a WebSocket endpoint for real-time workflow streaming\nand serves the single-page HTML interface.\n\"\"\"\n\nimport asyncio\nfrom pathlib import Path\nfrom typing import Any\n\nfrom fastapi import FastAPI, WebSocket, WebSocketDisconnect\nfrom fastapi.responses import HTMLResponse, JSONResponse\nfrom pydantic import BaseModel\n\nfrom .agent import clear_index_context, set_index_context, set_search_flags\nfrom .embeddings import EmbeddingProvider\nfrom .exploration_trace import ExplorationTrace, extract_cited_sources\nfrom .index_config import resolve_db_path\nfrom .indexing import IndexingPipeline\nfrom .indexing.metadata import auto_discover_profile\nfrom .search import IndexedQueryEngine\nfrom .storage import DuckDBStorage\nfrom .workflow import (\n    AskHumanEvent,\n    GoDeeperEvent,\n    HumanAnswerEvent,\n    InputEvent,\n    ToolCallEvent,\n    get_agent,\n    reset_agent,\n    workflow,\n)\n\napp = FastAPI(title=\"FsExplorer\", description=\"AI-powered filesystem exploration\")\n\n_corpus_locks: dict[str, asyncio.Lock] = {}\n\n\ndef _get_corpus_lock(folder: str) -> asyncio.Lock:\n    \"\"\"Return a per-folder asyncio lock, creating one if needed.\"\"\"\n    normalized = str(Path(folder).resolve())\n    if normalized not in _corpus_locks:\n        _corpus_locks[normalized] = asyncio.Lock()\n    return _corpus_locks[normalized]\n\n\nclass TaskRequest(BaseModel):\n    \"\"\"Request model for task submission.\"\"\"\n\n    task: str\n    folder: str = \".\"\n    use_index: bool = False\n    db_path: str | None = None\n\n\nclass IndexRequest(BaseModel):\n    \"\"\"Request model for index build/refresh.\"\"\"\n\n    folder: str = \".\"\n    db_path: str | None = None\n    discover_schema: bool = True\n    schema_name: str | None = None\n    with_metadata: bool = False\n    metadata_profile: dict[str, Any] | None = None\n    with_embeddings: bool = False\n\n\nclass AutoProfileRequest(BaseModel):\n    \"\"\"Request model for auto-profile generation.\"\"\"\n\n    folder: str = \".\"\n\n\nclass SearchRequest(BaseModel):\n    \"\"\"Request model for search queries.\"\"\"\n\n    corpus_folder: str\n    query: str\n    filters: str | None = None\n    limit: int = 5\n    db_path: str | None = None\n\n\n@app.get(\"/\", response_class=HTMLResponse)\nasync def get_ui():\n    \"\"\"Serve the main UI HTML file.\"\"\"\n    html_path = Path(__file__).parent / \"ui.html\"\n    if html_path.exists():\n        return HTMLResponse(\n            content=html_path.read_text(encoding=\"utf-8\"), status_code=200\n        )\n    return HTMLResponse(content=\"<h1>UI not found</h1>\", status_code=404)\n\n\n@app.get(\"/api/folders\")\nasync def list_folders(path: str = \".\"):\n    \"\"\"\n    List folders in the given path.\n    Returns list of folder names and current path info.\n    \"\"\"\n    try:\n        base_path = Path(path).resolve()\n        if not base_path.exists():\n            return JSONResponse({\"error\": \"Path not found\"}, status_code=404)\n        if not base_path.is_dir():\n            return JSONResponse({\"error\": \"Not a directory\"}, status_code=400)\n\n        # Get folders (non-hidden)\n        folders = sorted(\n            [\n                f.name\n                for f in base_path.iterdir()\n                if f.is_dir() and not f.name.startswith(\".\")\n            ]\n        )\n\n        # Get parent path (if not at root)\n        parent = str(base_path.parent) if base_path != base_path.parent else None\n\n        return {\n            \"current\": str(base_path),\n            \"parent\": parent,\n            \"folders\": folders,\n            \"files_count\": len([f for f in base_path.iterdir() if f.is_file()]),\n        }\n    except PermissionError:\n        return JSONResponse({\"error\": \"Permission denied\"}, status_code=403)\n    except Exception as e:\n        return JSONResponse({\"error\": str(e)}, status_code=500)\n\n\n@app.get(\"/api/index/status\")\nasync def index_status(folder: str, db_path: str | None = None):\n    \"\"\"Check whether a folder has been indexed and return status details.\"\"\"\n    try:\n        folder_path = Path(folder).resolve()\n        if not folder_path.exists() or not folder_path.is_dir():\n            return {\"indexed\": False}\n\n        resolved_db_path = resolve_db_path(db_path)\n        if not Path(resolved_db_path).exists():\n            return {\"indexed\": False}\n\n        try:\n            storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False)\n        except Exception:\n            return {\"indexed\": False}\n\n        try:\n            corpus_id = storage.get_corpus_id(str(folder_path))\n            if corpus_id is None:\n                storage.close()\n                return {\"indexed\": False}\n\n            docs = storage.list_documents(corpus_id=corpus_id, include_deleted=False)\n            active_schema = storage.get_active_schema(corpus_id=corpus_id)\n            has_embeddings = storage.has_embeddings(corpus_id=corpus_id)\n\n            schema_name: str | None = None\n            has_metadata = False\n            schema_fields: list[str] = []\n            if active_schema is not None:\n                schema_name = active_schema.name\n                has_metadata = (\n                    active_schema.schema_def.get(\"metadata_profile\") is not None\n                )\n                fields_def = active_schema.schema_def.get(\"fields\")\n                if isinstance(fields_def, list):\n                    for f in fields_def:\n                        if isinstance(f, dict) and isinstance(f.get(\"name\"), str):\n                            schema_fields.append(f[\"name\"])\n\n            storage.close()\n            return {\n                \"indexed\": True,\n                \"corpus_id\": corpus_id,\n                \"document_count\": len(docs),\n                \"schema_name\": schema_name,\n                \"has_metadata\": has_metadata,\n                \"has_embeddings\": has_embeddings,\n                \"schema_fields\": schema_fields,\n            }\n        except Exception:\n            storage.close()\n            return {\"indexed\": False}\n    except Exception:\n        return {\"indexed\": False}\n\n\n@app.post(\"/api/index/auto-profile\")\nasync def generate_auto_profile(request: AutoProfileRequest):\n    \"\"\"Generate an auto-discovered metadata profile for preview/editing.\"\"\"\n    try:\n        folder_path = Path(request.folder).resolve()\n        if not folder_path.exists() or not folder_path.is_dir():\n            return JSONResponse(\n                {\"error\": f\"Invalid folder: {request.folder}\"}, status_code=400\n            )\n\n        profile = await asyncio.to_thread(auto_discover_profile, str(folder_path))\n        return {\"profile\": profile}\n    except Exception as exc:\n        return JSONResponse({\"error\": str(exc)}, status_code=500)\n\n\n@app.post(\"/api/index\")\nasync def build_index(request: IndexRequest):\n    \"\"\"Build or refresh the index for a selected folder.\"\"\"\n    try:\n        folder_path = Path(request.folder).resolve()\n        if not folder_path.exists():\n            return JSONResponse({\"error\": \"Path not found\"}, status_code=404)\n        if not folder_path.is_dir():\n            return JSONResponse({\"error\": \"Not a directory\"}, status_code=400)\n\n        lock = _get_corpus_lock(str(folder_path))\n        async with lock:\n            resolved_db_path = resolve_db_path(request.db_path)\n            embedding_provider: EmbeddingProvider | None = None\n            if request.with_embeddings:\n                try:\n                    embedding_provider = EmbeddingProvider()\n                except ValueError:\n                    embedding_provider = None\n            pipeline = IndexingPipeline(\n                storage=DuckDBStorage(resolved_db_path),\n                embedding_provider=embedding_provider,\n            )\n            effective_with_metadata = (\n                request.with_metadata or request.metadata_profile is not None\n            )\n            discover_schema = request.discover_schema or effective_with_metadata\n            result = pipeline.index_folder(\n                str(folder_path),\n                discover_schema=discover_schema,\n                schema_name=request.schema_name,\n                with_metadata=effective_with_metadata,\n                metadata_profile=request.metadata_profile,\n            )\n\n        return {\n            \"db_path\": resolved_db_path,\n            \"folder\": str(folder_path),\n            \"corpus_id\": result.corpus_id,\n            \"indexed_files\": result.indexed_files,\n            \"skipped_files\": result.skipped_files,\n            \"deleted_files\": result.deleted_files,\n            \"chunks_written\": result.chunks_written,\n            \"active_documents\": result.active_documents,\n            \"schema_used\": result.schema_used,\n            \"embeddings_written\": result.embeddings_written,\n            \"metadata_mode\": \"langextract\" if effective_with_metadata else \"heuristic\",\n        }\n    except ValueError as exc:\n        return JSONResponse({\"error\": str(exc)}, status_code=400)\n    except PermissionError:\n        return JSONResponse({\"error\": \"Permission denied\"}, status_code=403)\n    except Exception as exc:\n        return JSONResponse({\"error\": str(exc)}, status_code=500)\n\n\n@app.post(\"/api/search\")\nasync def search_index(request: SearchRequest):\n    \"\"\"Search an indexed corpus and return ranked hits.\"\"\"\n    try:\n        folder_path = Path(request.corpus_folder).resolve()\n        if not folder_path.exists() or not folder_path.is_dir():\n            return JSONResponse(\n                {\"error\": f\"Invalid folder: {request.corpus_folder}\"}, status_code=400\n            )\n\n        resolved_db_path = resolve_db_path(request.db_path)\n        storage = DuckDBStorage(resolved_db_path, read_only=True, initialize=False)\n        corpus_id = storage.get_corpus_id(str(folder_path))\n        if corpus_id is None:\n            storage.close()\n            return JSONResponse(\n                {\"error\": \"No index found for this folder.\"}, status_code=404\n            )\n\n        embedding_provider: EmbeddingProvider | None = None\n        if storage.has_embeddings(corpus_id=corpus_id):\n            try:\n                embedding_provider = EmbeddingProvider()\n            except ValueError:\n                pass\n\n        engine = IndexedQueryEngine(storage, embedding_provider=embedding_provider)\n        hits = engine.search(\n            corpus_id=corpus_id,\n            query=request.query,\n            filters=request.filters,\n            limit=request.limit,\n        )\n        storage.close()\n\n        return {\n            \"corpus_folder\": str(folder_path),\n            \"query\": request.query,\n            \"hits\": [\n                {\n                    \"doc_id\": hit.doc_id,\n                    \"relative_path\": hit.relative_path,\n                    \"absolute_path\": hit.absolute_path,\n                    \"position\": hit.position,\n                    \"text\": hit.text,\n                    \"semantic_score\": hit.semantic_score,\n                    \"metadata_score\": hit.metadata_score,\n                    \"score\": hit.score,\n                    \"matched_by\": hit.matched_by,\n                }\n                for hit in hits\n            ],\n        }\n    except Exception as exc:\n        return JSONResponse({\"error\": str(exc)}, status_code=500)\n\n\n@app.websocket(\"/ws/explore\")\nasync def websocket_explore(websocket: WebSocket):\n    \"\"\"\n    WebSocket endpoint for real-time exploration streaming.\n\n    Protocol:\n    1. Client sends: {\"task\": \"user question\"}\n    2. Server streams events: {\"type\": \"...\", \"data\": {...}}\n    3. Final event: {\"type\": \"complete\", \"data\": {...}}\n    \"\"\"\n    await websocket.accept()\n\n    try:\n        # Receive the task\n        data = await websocket.receive_json()\n        task = data.get(\"task\", \"\")\n        folder = data.get(\"folder\", \".\")\n        use_index = bool(data.get(\"use_index\", False))\n        db_path = data.get(\"db_path\")\n        enable_semantic = bool(data.get(\"enable_semantic\", False))\n        enable_metadata = bool(data.get(\"enable_metadata\", False))\n        index_storage: DuckDBStorage | None = None\n\n        if not task:\n            await websocket.send_json(\n                {\"type\": \"error\", \"data\": {\"message\": \"No task provided\"}}\n            )\n            return\n\n        # Validate folder\n        folder_path = Path(folder).resolve()\n        if not folder_path.exists() or not folder_path.is_dir():\n            await websocket.send_json(\n                {\"type\": \"error\", \"data\": {\"message\": f\"Invalid folder: {folder}\"}}\n            )\n            return\n\n        clear_index_context()\n        if use_index:\n            resolved_db_path = resolve_db_path(\n                db_path if isinstance(db_path, str) else None\n            )\n            storage = DuckDBStorage(resolved_db_path)\n            corpus_id = storage.get_corpus_id(str(folder_path))\n            if corpus_id is None:\n                await websocket.send_json(\n                    {\n                        \"type\": \"error\",\n                        \"data\": {\n                            \"message\": (\n                                \"No index found for the selected folder. \"\n                                \"Run `explore index <folder>` first.\"\n                            )\n                        },\n                    }\n                )\n                return\n            index_storage = storage\n            set_index_context(str(folder_path), resolved_db_path)\n\n        set_search_flags(\n            enable_semantic=enable_semantic and use_index,\n            enable_metadata=enable_metadata and use_index,\n        )\n\n        trace = ExplorationTrace(root_directory=str(folder_path))\n\n        # Reset agent for fresh state\n        reset_agent()\n\n        # Send start event\n        await websocket.send_json(\n            {\n                \"type\": \"start\",\n                \"data\": {\n                    \"task\": task,\n                    \"folder\": str(folder_path),\n                    \"use_index\": use_index,\n                },\n            }\n        )\n\n        # Run the workflow\n        step_number = 0\n        handler = workflow.run(\n            start_event=InputEvent(\n                task=task,\n                folder=str(folder_path),\n                use_index=use_index,\n                enable_semantic=enable_semantic and use_index,\n                enable_metadata=enable_metadata and use_index,\n            )\n        )\n\n        async for event in handler.stream_events():\n            if isinstance(event, ToolCallEvent):\n                step_number += 1\n                resolved_document_path: str | None = None\n                if event.tool_name == \"get_document\":\n                    doc_id = event.tool_input.get(\"doc_id\")\n                    if index_storage is not None and isinstance(doc_id, str) and doc_id:\n                        document = index_storage.get_document(doc_id=doc_id)\n                        if document and not document[\"is_deleted\"]:\n                            resolved_document_path = str(document[\"absolute_path\"])\n                trace.record_tool_call(\n                    step_number=step_number,\n                    tool_name=event.tool_name,\n                    tool_input=event.tool_input,\n                    resolved_document_path=resolved_document_path,\n                )\n                await websocket.send_json(\n                    {\n                        \"type\": \"tool_call\",\n                        \"data\": {\n                            \"step\": step_number,\n                            \"tool_name\": event.tool_name,\n                            \"tool_input\": event.tool_input,\n                            \"reason\": event.reason,\n                        },\n                    }\n                )\n\n            elif isinstance(event, GoDeeperEvent):\n                step_number += 1\n                trace.record_go_deeper(\n                    step_number=step_number, directory=event.directory\n                )\n                await websocket.send_json(\n                    {\n                        \"type\": \"go_deeper\",\n                        \"data\": {\n                            \"step\": step_number,\n                            \"directory\": event.directory,\n                            \"reason\": event.reason,\n                        },\n                    }\n                )\n\n            elif isinstance(event, AskHumanEvent):\n                step_number += 1\n                await websocket.send_json(\n                    {\n                        \"type\": \"ask_human\",\n                        \"data\": {\n                            \"step\": step_number,\n                            \"question\": event.question,\n                            \"reason\": event.reason,\n                        },\n                    }\n                )\n\n                # Wait for human response\n                response_data = await websocket.receive_json()\n                if response_data.get(\"type\") == \"human_response\":\n                    handler.ctx.send_event(\n                        HumanAnswerEvent(response=response_data.get(\"response\", \"\"))\n                    )\n\n        # Get final result\n        result = await handler\n        cited_sources = extract_cited_sources(result.final_result)\n\n        # Get token usage\n        agent = get_agent()\n        usage = agent.token_usage\n        input_cost, output_cost, total_cost = usage._calculate_cost()\n\n        await websocket.send_json(\n            {\n                \"type\": \"complete\",\n                \"data\": {\n                    \"final_result\": result.final_result,\n                    \"error\": result.error,\n                    \"stats\": {\n                        \"steps\": step_number,\n                        \"api_calls\": usage.api_calls,\n                        \"documents_scanned\": usage.documents_scanned,\n                        \"documents_parsed\": usage.documents_parsed,\n                        \"prompt_tokens\": usage.prompt_tokens,\n                        \"completion_tokens\": usage.completion_tokens,\n                        \"total_tokens\": usage.total_tokens,\n                        \"tool_result_chars\": usage.tool_result_chars,\n                        \"estimated_cost\": round(total_cost, 6),\n                    },\n                    \"trace\": {\n                        \"step_path\": trace.step_path,\n                        \"referenced_documents\": trace.sorted_documents(),\n                        \"cited_sources\": cited_sources,\n                    },\n                },\n            }\n        )\n\n    except WebSocketDisconnect:\n        pass\n    except Exception as e:\n        await websocket.send_json({\"type\": \"error\", \"data\": {\"message\": str(e)}})\n    finally:\n        set_search_flags(enable_semantic=False, enable_metadata=False)\n        clear_index_context()\n\n\ndef run_server(host: str = \"127.0.0.1\", port: int = 8000):\n    \"\"\"Run the FastAPI server.\"\"\"\n    import uvicorn\n\n    uvicorn.run(app, host=host, port=port)\n\n\nif __name__ == \"__main__\":\n    run_server()\n"
  },
  {
    "path": "src/fs_explorer/storage/__init__.py",
    "content": "\"\"\"Storage backends for FsExplorer indexing.\"\"\"\n\nfrom .base import ChunkRecord, DocumentRecord, SchemaRecord, StorageBackend\nfrom .duckdb import DuckDBStorage\n\n__all__ = [\n    \"ChunkRecord\",\n    \"DocumentRecord\",\n    \"SchemaRecord\",\n    \"StorageBackend\",\n    \"DuckDBStorage\",\n]\n"
  },
  {
    "path": "src/fs_explorer/storage/base.py",
    "content": "\"\"\"\nStorage interfaces and data models for index persistence.\n\"\"\"\n\nfrom __future__ import annotations\n\nfrom dataclasses import dataclass\nfrom typing import Any, Protocol\n\n\n@dataclass(frozen=True)\nclass ChunkRecord:\n    \"\"\"A text chunk stored for a document.\"\"\"\n\n    id: str\n    doc_id: str\n    text: str\n    position: int\n    start_char: int\n    end_char: int\n    embedding: list[float] | None = None\n\n\n@dataclass(frozen=True)\nclass DocumentRecord:\n    \"\"\"A normalized document record for indexing.\"\"\"\n\n    id: str\n    corpus_id: str\n    relative_path: str\n    absolute_path: str\n    content: str\n    metadata_json: str\n    file_mtime: float\n    file_size: int\n    content_sha256: str\n\n\n@dataclass(frozen=True)\nclass SchemaRecord:\n    \"\"\"A stored schema entry.\"\"\"\n\n    id: str\n    corpus_id: str\n    name: str\n    schema_def: dict[str, Any]\n    is_active: bool\n    created_at: str\n\n\nclass StorageBackend(Protocol):\n    \"\"\"Protocol for persistence operations used by indexing and schema workflows.\"\"\"\n\n    def initialize(self) -> None:\n        \"\"\"Initialize required tables/indexes.\"\"\"\n\n    def get_or_create_corpus(self, root_path: str) -> str:\n        \"\"\"Return corpus id for a root path, creating if needed.\"\"\"\n\n    def get_corpus_id(self, root_path: str) -> str | None:\n        \"\"\"Return corpus id for a root path if present.\"\"\"\n\n    def upsert_document(\n        self, document: DocumentRecord, chunks: list[ChunkRecord]\n    ) -> None:\n        \"\"\"Insert or update a document and replace its chunks.\"\"\"\n\n    def mark_deleted_missing_documents(\n        self,\n        *,\n        corpus_id: str,\n        active_relative_paths: set[str],\n    ) -> int:\n        \"\"\"Mark documents deleted when not present in the latest index run.\"\"\"\n\n    def list_documents(\n        self,\n        *,\n        corpus_id: str,\n        include_deleted: bool = False,\n    ) -> list[dict[str, Any]]:\n        \"\"\"List documents for a corpus.\"\"\"\n\n    def count_chunks(self, *, corpus_id: str) -> int:\n        \"\"\"Count chunks for active documents in a corpus.\"\"\"\n\n    def search_chunks(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        limit: int = 5,\n    ) -> list[dict[str, Any]]:\n        \"\"\"Search indexed chunks and return ranked matches.\"\"\"\n\n    def search_documents_by_metadata(\n        self,\n        *,\n        corpus_id: str,\n        filters: list[dict[str, Any]],\n        limit: int = 20,\n    ) -> list[dict[str, Any]]:\n        \"\"\"Search indexed documents by metadata filters.\"\"\"\n\n    def get_document(self, *, doc_id: str) -> dict[str, Any] | None:\n        \"\"\"Get a document by id.\"\"\"\n\n    def save_schema(\n        self,\n        *,\n        corpus_id: str,\n        name: str,\n        schema_def: dict[str, Any],\n        is_active: bool = True,\n    ) -> str:\n        \"\"\"Create or update a schema entry.\"\"\"\n\n    def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:\n        \"\"\"List all schemas for a corpus.\"\"\"\n\n    def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None:\n        \"\"\"Fetch a schema by name.\"\"\"\n\n    def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:\n        \"\"\"Fetch active schema for a corpus if present.\"\"\"\n\n    def store_chunk_embeddings(\n        self,\n        *,\n        corpus_id: str,\n        chunk_embeddings: list[tuple[str, list[float]]],\n    ) -> int:\n        \"\"\"Bulk-store (chunk_id, embedding) pairs. Return count written.\"\"\"\n\n    def search_chunks_semantic(\n        self,\n        *,\n        corpus_id: str,\n        query_embedding: list[float],\n        limit: int = 5,\n    ) -> list[dict[str, Any]]:\n        \"\"\"Search chunks by cosine similarity against a query embedding.\"\"\"\n\n    def get_metadata_field_values(\n        self,\n        *,\n        corpus_id: str,\n        field_names: list[str],\n        max_distinct: int = 10,\n    ) -> dict[str, list[str]]:\n        \"\"\"Return up to *max_distinct* distinct non-empty values per metadata field.\"\"\"\n\n    def has_embeddings(self, *, corpus_id: str) -> bool:\n        \"\"\"Return True if the corpus has stored embeddings.\"\"\"\n"
  },
  {
    "path": "src/fs_explorer/storage/duckdb.py",
    "content": "\"\"\"\nDuckDB storage backend for index persistence.\n\"\"\"\n\nfrom __future__ import annotations\n\nimport hashlib\nimport json\nimport re\nfrom pathlib import Path\nfrom typing import Any\n\nimport duckdb\n\nfrom .base import ChunkRecord, DocumentRecord, SchemaRecord\n\n\ndef _stable_id(prefix: str, value: str) -> str:\n    digest = hashlib.sha1(value.encode(\"utf-8\")).hexdigest()\n    return f\"{prefix}_{digest}\"\n\n\ndef _query_terms(query: str, max_terms: int = 8) -> list[str]:\n    terms = re.findall(r\"[a-zA-Z0-9_]{3,}\", query.lower())\n    unique_terms: list[str] = []\n    for term in terms:\n        if term not in unique_terms:\n            unique_terms.append(term)\n        if len(unique_terms) >= max_terms:\n            break\n    if unique_terms:\n        return unique_terms\n    fallback = query.strip().lower()\n    return [fallback] if fallback else []\n\n\nclass DuckDBStorage:\n    \"\"\"DuckDB-backed persistence for corpora, documents, chunks, and schemas.\"\"\"\n\n    def __init__(\n        self,\n        db_path: str,\n        *,\n        read_only: bool = False,\n        initialize: bool = True,\n        embedding_dim: int = 768,\n    ) -> None:\n        self.db_path = str(Path(db_path).expanduser().resolve())\n        self.read_only = read_only\n        self.embedding_dim = embedding_dim\n        Path(self.db_path).parent.mkdir(parents=True, exist_ok=True)\n        self._conn = duckdb.connect(self.db_path, read_only=read_only)\n        self._vss_available = False\n        if initialize and not read_only:\n            self.initialize()\n        if not read_only:\n            self._try_load_vss()\n\n    def close(self) -> None:\n        \"\"\"Close the underlying DuckDB connection.\"\"\"\n        self._conn.close()\n\n    def initialize(self) -> None:\n        self._conn.execute(\n            \"\"\"\n            CREATE TABLE IF NOT EXISTS corpora (\n                id VARCHAR PRIMARY KEY,\n                root_path VARCHAR NOT NULL UNIQUE,\n                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP\n            );\n            \"\"\"\n        )\n        self._conn.execute(\n            \"\"\"\n            CREATE TABLE IF NOT EXISTS documents (\n                id VARCHAR PRIMARY KEY,\n                corpus_id VARCHAR NOT NULL REFERENCES corpora(id),\n                relative_path VARCHAR NOT NULL,\n                absolute_path VARCHAR NOT NULL,\n                content VARCHAR NOT NULL,\n                metadata_json VARCHAR NOT NULL DEFAULT '{}',\n                file_mtime DOUBLE NOT NULL,\n                file_size BIGINT NOT NULL,\n                content_sha256 VARCHAR NOT NULL,\n                last_indexed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,\n                is_deleted BOOLEAN DEFAULT FALSE,\n                UNIQUE(corpus_id, relative_path)\n            );\n            \"\"\"\n        )\n        self._conn.execute(\n            \"\"\"\n            CREATE TABLE IF NOT EXISTS chunks (\n                id VARCHAR PRIMARY KEY,\n                doc_id VARCHAR NOT NULL REFERENCES documents(id),\n                text VARCHAR NOT NULL,\n                position INTEGER NOT NULL,\n                start_char INTEGER NOT NULL,\n                end_char INTEGER NOT NULL\n            );\n            \"\"\"\n        )\n        self._conn.execute(\n            \"\"\"\n            CREATE TABLE IF NOT EXISTS schemas (\n                id VARCHAR PRIMARY KEY,\n                corpus_id VARCHAR NOT NULL REFERENCES corpora(id),\n                name VARCHAR NOT NULL,\n                schema_def VARCHAR NOT NULL,\n                is_active BOOLEAN DEFAULT FALSE,\n                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,\n                UNIQUE(corpus_id, name)\n            );\n            \"\"\"\n        )\n        self._conn.execute(\n            f\"\"\"\n            CREATE TABLE IF NOT EXISTS chunk_embeddings (\n                chunk_id VARCHAR PRIMARY KEY REFERENCES chunks(id),\n                corpus_id VARCHAR NOT NULL,\n                embedding FLOAT[{self.embedding_dim}] NOT NULL\n            );\n            \"\"\"\n        )\n\n    def _try_load_vss(self) -> None:\n        \"\"\"Attempt to install and load the vss extension for HNSW acceleration.\"\"\"\n        try:\n            self._conn.execute(\"INSTALL vss\")\n            self._conn.execute(\"LOAD vss\")\n            self._vss_available = True\n        except Exception:\n            self._vss_available = False\n\n    def get_or_create_corpus(self, root_path: str) -> str:\n        normalized = str(Path(root_path).resolve())\n        corpus_id = _stable_id(\"corpus\", normalized)\n        self._conn.execute(\n            \"\"\"\n            INSERT INTO corpora (id, root_path)\n            VALUES (?, ?)\n            ON CONFLICT(root_path) DO NOTHING\n            \"\"\",\n            [corpus_id, normalized],\n        )\n        row = self._conn.execute(\n            \"SELECT id FROM corpora WHERE root_path = ?\",\n            [normalized],\n        ).fetchone()\n        if row is None:\n            raise RuntimeError(f\"Failed to create corpus for path: {normalized}\")\n        return str(row[0])\n\n    def get_corpus_id(self, root_path: str) -> str | None:\n        normalized = str(Path(root_path).resolve())\n        row = self._conn.execute(\n            \"SELECT id FROM corpora WHERE root_path = ?\",\n            [normalized],\n        ).fetchone()\n        if row is None:\n            return None\n        return str(row[0])\n\n    def upsert_document(\n        self, document: DocumentRecord, chunks: list[ChunkRecord]\n    ) -> None:\n        # Cascade-delete embeddings for old chunks, then remove old chunks.\n        self._conn.execute(\n            \"\"\"\n            DELETE FROM chunk_embeddings\n            WHERE chunk_id IN (SELECT id FROM chunks WHERE doc_id = ?)\n            \"\"\",\n            [document.id],\n        )\n        self._conn.execute(\"DELETE FROM chunks WHERE doc_id = ?\", [document.id])\n\n        self._conn.execute(\n            \"\"\"\n            INSERT INTO documents (\n                id, corpus_id, relative_path, absolute_path, content, metadata_json,\n                file_mtime, file_size, content_sha256, is_deleted\n            )\n            VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, FALSE)\n            ON CONFLICT(id) DO UPDATE SET\n                corpus_id = excluded.corpus_id,\n                relative_path = excluded.relative_path,\n                absolute_path = excluded.absolute_path,\n                content = excluded.content,\n                metadata_json = excluded.metadata_json,\n                file_mtime = excluded.file_mtime,\n                file_size = excluded.file_size,\n                content_sha256 = excluded.content_sha256,\n                last_indexed_at = now(),\n                is_deleted = FALSE\n            \"\"\",\n            [\n                document.id,\n                document.corpus_id,\n                document.relative_path,\n                document.absolute_path,\n                document.content,\n                document.metadata_json,\n                document.file_mtime,\n                document.file_size,\n                document.content_sha256,\n            ],\n        )\n\n        if chunks:\n            self._conn.executemany(\n                \"\"\"\n                INSERT INTO chunks (id, doc_id, text, position, start_char, end_char)\n                VALUES (?, ?, ?, ?, ?, ?)\n                \"\"\",\n                [\n                    (\n                        chunk.id,\n                        chunk.doc_id,\n                        chunk.text,\n                        chunk.position,\n                        chunk.start_char,\n                        chunk.end_char,\n                    )\n                    for chunk in chunks\n                ],\n            )\n\n    def mark_deleted_missing_documents(\n        self,\n        *,\n        corpus_id: str,\n        active_relative_paths: set[str],\n    ) -> int:\n        if not active_relative_paths:\n            self._conn.execute(\n                \"\"\"\n                UPDATE documents\n                SET is_deleted = TRUE\n                WHERE corpus_id = ? AND is_deleted = FALSE\n                \"\"\",\n                [corpus_id],\n            )\n        else:\n            placeholders = \", \".join([\"?\"] * len(active_relative_paths))\n            params: list[Any] = [corpus_id]\n            params.extend(sorted(active_relative_paths))\n            self._conn.execute(\n                f\"\"\"\n                UPDATE documents\n                SET is_deleted = TRUE\n                WHERE corpus_id = ?\n                  AND is_deleted = FALSE\n                  AND relative_path NOT IN ({placeholders})\n                \"\"\",\n                params,\n            )\n\n        row = self._conn.execute(\n            \"\"\"\n            SELECT COUNT(*)\n            FROM documents\n            WHERE corpus_id = ? AND is_deleted = TRUE\n            \"\"\",\n            [corpus_id],\n        ).fetchone()\n        return int(row[0]) if row else 0\n\n    def list_documents(\n        self,\n        *,\n        corpus_id: str,\n        include_deleted: bool = False,\n    ) -> list[dict[str, Any]]:\n        sql = \"\"\"\n            SELECT id, relative_path, absolute_path, file_size, file_mtime, is_deleted\n            FROM documents\n            WHERE corpus_id = ?\n        \"\"\"\n        params: list[Any] = [corpus_id]\n        if not include_deleted:\n            sql += \" AND is_deleted = FALSE\"\n        sql += \" ORDER BY relative_path\"\n\n        rows = self._conn.execute(sql, params).fetchall()\n        results: list[dict[str, Any]] = []\n        for row in rows:\n            results.append(\n                {\n                    \"id\": str(row[0]),\n                    \"relative_path\": str(row[1]),\n                    \"absolute_path\": str(row[2]),\n                    \"file_size\": int(row[3]),\n                    \"file_mtime\": float(row[4]),\n                    \"is_deleted\": bool(row[5]),\n                }\n            )\n        return results\n\n    def count_chunks(self, *, corpus_id: str) -> int:\n        row = self._conn.execute(\n            \"\"\"\n            SELECT COUNT(*)\n            FROM chunks c\n            JOIN documents d ON d.id = c.doc_id\n            WHERE d.corpus_id = ? AND d.is_deleted = FALSE\n            \"\"\",\n            [corpus_id],\n        ).fetchone()\n        return int(row[0]) if row else 0\n\n    def search_chunks(\n        self,\n        *,\n        corpus_id: str,\n        query: str,\n        limit: int = 5,\n    ) -> list[dict[str, Any]]:\n        terms = _query_terms(query)\n        if not terms:\n            return []\n\n        score_expr = \" + \".join(\n            [\"CASE WHEN lower(c.text) LIKE '%' || ? || '%' THEN 1 ELSE 0 END\"]\n            * len(terms)\n        )\n        sql = f\"\"\"\n            SELECT * FROM (\n                SELECT\n                    d.id AS doc_id,\n                    d.relative_path,\n                    d.absolute_path,\n                    c.position,\n                    c.text,\n                    ({score_expr}) AS score\n                FROM chunks c\n                JOIN documents d ON d.id = c.doc_id\n                WHERE d.corpus_id = ?\n                  AND d.is_deleted = FALSE\n            ) ranked\n            WHERE score > 0\n            ORDER BY score DESC, relative_path ASC, position ASC\n            LIMIT ?\n        \"\"\"\n        params: list[Any] = []\n        params.extend(terms)\n        params.append(corpus_id)\n        params.append(limit)\n        rows = self._conn.execute(sql, params).fetchall()\n\n        results: list[dict[str, Any]] = []\n        for row in rows:\n            results.append(\n                {\n                    \"doc_id\": str(row[0]),\n                    \"relative_path\": str(row[1]),\n                    \"absolute_path\": str(row[2]),\n                    \"position\": int(row[3]),\n                    \"text\": str(row[4]),\n                    \"score\": int(row[5]),\n                }\n            )\n        return results\n\n    def search_documents_by_metadata(\n        self,\n        *,\n        corpus_id: str,\n        filters: list[dict[str, Any]],\n        limit: int = 20,\n    ) -> list[dict[str, Any]]:\n        if not filters:\n            return []\n\n        sql = \"\"\"\n            SELECT\n                d.id,\n                d.relative_path,\n                d.absolute_path,\n                substring(d.content, 1, 320) AS preview_text\n            FROM documents d\n            WHERE d.corpus_id = ?\n              AND d.is_deleted = FALSE\n        \"\"\"\n        params: list[Any] = [corpus_id]\n\n        for flt in filters:\n            field = str(flt[\"field\"])\n            operator = str(flt[\"operator\"])\n            value = flt[\"value\"]\n            clause, clause_params = self._metadata_clause(\n                field=field,\n                operator=operator,\n                value=value,\n            )\n            sql += f\"\\n  AND {clause}\"\n            params.extend(clause_params)\n\n        sql += \"\\nORDER BY d.relative_path ASC\\nLIMIT ?\"\n        params.append(limit)\n        rows = self._conn.execute(sql, params).fetchall()\n        metadata_score = len(filters)\n        results: list[dict[str, Any]] = []\n        for row in rows:\n            results.append(\n                {\n                    \"doc_id\": str(row[0]),\n                    \"relative_path\": str(row[1]),\n                    \"absolute_path\": str(row[2]),\n                    \"preview_text\": str(row[3]),\n                    \"metadata_score\": metadata_score,\n                }\n            )\n        return results\n\n    def get_document(self, *, doc_id: str) -> dict[str, Any] | None:\n        row = self._conn.execute(\n            \"\"\"\n            SELECT\n                id, corpus_id, relative_path, absolute_path, content, metadata_json, is_deleted\n            FROM documents\n            WHERE id = ?\n            LIMIT 1\n            \"\"\",\n            [doc_id],\n        ).fetchone()\n        if row is None:\n            return None\n        return {\n            \"id\": str(row[0]),\n            \"corpus_id\": str(row[1]),\n            \"relative_path\": str(row[2]),\n            \"absolute_path\": str(row[3]),\n            \"content\": str(row[4]),\n            \"metadata_json\": str(row[5]),\n            \"is_deleted\": bool(row[6]),\n        }\n\n    def save_schema(\n        self,\n        *,\n        corpus_id: str,\n        name: str,\n        schema_def: dict[str, Any],\n        is_active: bool = True,\n    ) -> str:\n        schema_id = _stable_id(\"schema\", f\"{corpus_id}:{name}\")\n        if is_active:\n            self._conn.execute(\n                \"UPDATE schemas SET is_active = FALSE WHERE corpus_id = ?\",\n                [corpus_id],\n            )\n\n        self._conn.execute(\n            \"\"\"\n            INSERT INTO schemas (id, corpus_id, name, schema_def, is_active)\n            VALUES (?, ?, ?, ?, ?)\n            ON CONFLICT(corpus_id, name) DO UPDATE SET\n                schema_def = excluded.schema_def,\n                is_active = excluded.is_active\n            \"\"\",\n            [\n                schema_id,\n                corpus_id,\n                name,\n                json.dumps(schema_def, sort_keys=True),\n                is_active,\n            ],\n        )\n        return schema_id\n\n    def list_schemas(self, *, corpus_id: str) -> list[SchemaRecord]:\n        rows = self._conn.execute(\n            \"\"\"\n            SELECT id, corpus_id, name, schema_def, is_active, created_at\n            FROM schemas\n            WHERE corpus_id = ?\n            ORDER BY created_at DESC, name ASC\n            \"\"\",\n            [corpus_id],\n        ).fetchall()\n        return [self._row_to_schema_record(row) for row in rows]\n\n    def get_schema_by_name(self, *, corpus_id: str, name: str) -> SchemaRecord | None:\n        row = self._conn.execute(\n            \"\"\"\n            SELECT id, corpus_id, name, schema_def, is_active, created_at\n            FROM schemas\n            WHERE corpus_id = ? AND name = ?\n            LIMIT 1\n            \"\"\",\n            [corpus_id, name],\n        ).fetchone()\n        if row is None:\n            return None\n        return self._row_to_schema_record(row)\n\n    def get_active_schema(self, *, corpus_id: str) -> SchemaRecord | None:\n        row = self._conn.execute(\n            \"\"\"\n            SELECT id, corpus_id, name, schema_def, is_active, created_at\n            FROM schemas\n            WHERE corpus_id = ? AND is_active = TRUE\n            ORDER BY created_at DESC\n            LIMIT 1\n            \"\"\",\n            [corpus_id],\n        ).fetchone()\n        if row is None:\n            return None\n        return self._row_to_schema_record(row)\n\n    @staticmethod\n    def make_document_id(corpus_id: str, relative_path: str) -> str:\n        return _stable_id(\"doc\", f\"{corpus_id}:{relative_path}\")\n\n    @staticmethod\n    def make_chunk_id(\n        doc_id: str, position: int, start_char: int, end_char: int\n    ) -> str:\n        return _stable_id(\"chunk\", f\"{doc_id}:{position}:{start_char}:{end_char}\")\n\n    @staticmethod\n    def _row_to_schema_record(row: tuple[Any, ...]) -> SchemaRecord:\n        return SchemaRecord(\n            id=str(row[0]),\n            corpus_id=str(row[1]),\n            name=str(row[2]),\n            schema_def=json.loads(str(row[3])),\n            is_active=bool(row[4]),\n            created_at=str(row[5]),\n        )\n\n    def store_chunk_embeddings(\n        self,\n        *,\n        corpus_id: str,\n        chunk_embeddings: list[tuple[str, list[float]]],\n    ) -> int:\n        \"\"\"Bulk-store (chunk_id, embedding) pairs. Return count written.\"\"\"\n        if not chunk_embeddings:\n            return 0\n        self._conn.executemany(\n            \"\"\"\n            INSERT INTO chunk_embeddings (chunk_id, corpus_id, embedding)\n            VALUES (?, ?, ?)\n            ON CONFLICT(chunk_id) DO UPDATE SET\n                corpus_id = excluded.corpus_id,\n                embedding = excluded.embedding\n            \"\"\",\n            [(cid, corpus_id, emb) for cid, emb in chunk_embeddings],\n        )\n        return len(chunk_embeddings)\n\n    def search_chunks_semantic(\n        self,\n        *,\n        corpus_id: str,\n        query_embedding: list[float],\n        limit: int = 5,\n    ) -> list[dict[str, Any]]:\n        \"\"\"Search chunks by cosine similarity against a query embedding.\"\"\"\n        sql = \"\"\"\n            SELECT\n                d.id AS doc_id,\n                d.relative_path,\n                d.absolute_path,\n                c.position,\n                c.text,\n                array_cosine_similarity(ce.embedding, ?::FLOAT[{dim}]) AS score\n            FROM chunk_embeddings ce\n            JOIN chunks c ON c.id = ce.chunk_id\n            JOIN documents d ON d.id = c.doc_id\n            WHERE ce.corpus_id = ?\n              AND d.is_deleted = FALSE\n            ORDER BY score DESC\n            LIMIT ?\n        \"\"\".format(dim=self.embedding_dim)\n        rows = self._conn.execute(sql, [query_embedding, corpus_id, limit]).fetchall()\n\n        results: list[dict[str, Any]] = []\n        for row in rows:\n            results.append(\n                {\n                    \"doc_id\": str(row[0]),\n                    \"relative_path\": str(row[1]),\n                    \"absolute_path\": str(row[2]),\n                    \"position\": int(row[3]),\n                    \"text\": str(row[4]),\n                    \"score\": float(row[5]),\n                }\n            )\n        return results\n\n    def get_metadata_field_values(\n        self,\n        *,\n        corpus_id: str,\n        field_names: list[str],\n        max_distinct: int = 10,\n    ) -> dict[str, list[str]]:\n        \"\"\"Return up to *max_distinct* distinct non-empty values per metadata field.\"\"\"\n        result: dict[str, list[str]] = {}\n        for field in field_names:\n            rows = self._conn.execute(\n                \"\"\"\n                SELECT DISTINCT json_extract_string(d.metadata_json, ?) AS val\n                FROM documents d\n                WHERE d.corpus_id = ?\n                  AND d.is_deleted = FALSE\n                  AND val IS NOT NULL\n                  AND val != ''\n                LIMIT ?\n                \"\"\",\n                [f\"$.{field}\", corpus_id, max_distinct],\n            ).fetchall()\n            result[field] = [str(row[0]) for row in rows]\n        return result\n\n    def has_embeddings(self, *, corpus_id: str) -> bool:\n        \"\"\"Return True if the corpus has stored embeddings.\"\"\"\n        row = self._conn.execute(\n            \"SELECT COUNT(*) FROM chunk_embeddings WHERE corpus_id = ?\",\n            [corpus_id],\n        ).fetchone()\n        return bool(row and int(row[0]) > 0)\n\n    def create_hnsw_index(self, *, corpus_id: str) -> bool:\n        \"\"\"Create an HNSW index on chunk embeddings if vss is available.\n\n        Returns True if the index was created, False otherwise.\n        \"\"\"\n        if not self._vss_available:\n            return False\n        try:\n            index_name = f\"hnsw_{corpus_id.replace('-', '_')}\"\n            self._conn.execute(\n                f\"\"\"\n                CREATE INDEX IF NOT EXISTS {index_name}\n                ON chunk_embeddings\n                USING HNSW (embedding)\n                WITH (metric = 'cosine')\n                \"\"\"\n            )\n            return True\n        except Exception:\n            return False\n\n    @staticmethod\n    def _metadata_clause(\n        *,\n        field: str,\n        operator: str,\n        value: Any,\n    ) -> tuple[str, list[Any]]:\n        json_expr = \"json_extract_string(d.metadata_json, ?)\"\n        json_path = f\"$.{field}\"\n\n        if operator in {\"eq\", \"ne\"}:\n            comparator = \"=\" if operator == \"eq\" else \"<>\"\n            if isinstance(value, bool):\n                return (\n                    f\"lower(coalesce({json_expr}, '')) {comparator} ?\",\n                    [json_path, \"true\" if value else \"false\"],\n                )\n            if isinstance(value, (int, float)):\n                return (\n                    f\"try_cast({json_expr} AS DOUBLE) {comparator} ?\",\n                    [json_path, float(value)],\n                )\n            return (\n                f\"lower(coalesce({json_expr}, '')) {comparator} lower(?)\",\n                [json_path, str(value)],\n            )\n\n        if operator in {\"gt\", \"gte\", \"lt\", \"lte\"}:\n            if not isinstance(value, (int, float)):\n                raise ValueError(\n                    f\"Metadata operator {operator!r} requires numeric value for field {field!r}.\"\n                )\n            comparator_map = {\n                \"gt\": \">\",\n                \"gte\": \">=\",\n                \"lt\": \"<\",\n                \"lte\": \"<=\",\n            }\n            comparator = comparator_map[operator]\n            return (\n                f\"try_cast({json_expr} AS DOUBLE) {comparator} ?\",\n                [json_path, float(value)],\n            )\n\n        if operator == \"contains\":\n            return (\n                f\"lower(coalesce({json_expr}, '')) LIKE '%' || lower(?) || '%'\",\n                [json_path, str(value)],\n            )\n\n        if operator == \"in\":\n            if not isinstance(value, list) or not value:\n                raise ValueError(\n                    f\"Metadata `in` filter for field {field!r} has no values.\"\n                )\n\n            if all(isinstance(item, bool) for item in value):\n                placeholders = \", \".join([\"?\"] * len(value))\n                return (\n                    f\"lower(coalesce({json_expr}, '')) IN ({placeholders})\",\n                    [\n                        json_path,\n                        *[\"true\" if bool(item) else \"false\" for item in value],\n                    ],\n                )\n\n            if all(\n                isinstance(item, (int, float)) and not isinstance(item, bool)\n                for item in value\n            ):\n                placeholders = \", \".join([\"?\"] * len(value))\n                return (\n                    f\"try_cast({json_expr} AS DOUBLE) IN ({placeholders})\",\n                    [json_path, *[float(item) for item in value]],\n                )\n\n            placeholders = \", \".join([\"?\"] * len(value))\n            return (\n                f\"lower(coalesce({json_expr}, '')) IN ({placeholders})\",\n                [json_path, *[str(item).lower() for item in value]],\n            )\n\n        raise ValueError(f\"Unsupported metadata operator: {operator!r}\")\n"
  },
  {
    "path": "src/fs_explorer/ui.html",
    "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>fs-explorer</title>\n    <link rel=\"preconnect\" href=\"https://fonts.googleapis.com\">\n    <link rel=\"preconnect\" href=\"https://fonts.gstatic.com\" crossorigin>\n    <link href=\"https://fonts.googleapis.com/css2?family=IBM+Plex+Mono:wght@400;500;600;700&family=Instrument+Serif:ital@0;1&display=swap\" rel=\"stylesheet\">\n    <style>\n        :root {\n            --bg: #f4f1eb;\n            --bg-alt: #ebe7df;\n            --ink: #1a1a1a;\n            --ink-light: #4a4a4a;\n            --ink-muted: #8a8a8a;\n            --accent: #c45d3a;\n            --accent-light: #e8d5ce;\n            --success: #2d6a4f;\n            --border: #d4d0c8;\n            --shadow: rgba(0,0,0,0.08);\n        }\n\n        * {\n            margin: 0;\n            padding: 0;\n            box-sizing: border-box;\n        }\n\n        ::selection {\n            background: var(--accent);\n            color: var(--bg);\n        }\n\n        body {\n            font-family: 'IBM Plex Mono', monospace;\n            background: var(--bg);\n            color: var(--ink);\n            min-height: 100vh;\n            font-size: 14px;\n            line-height: 1.6;\n        }\n\n        /* Layout */\n        .page {\n            max-width: 1400px;\n            margin: 0 auto;\n            padding: 40px;\n        }\n\n        /* Header */\n        .masthead {\n            display: flex;\n            justify-content: space-between;\n            align-items: flex-end;\n            border-bottom: 2px solid var(--ink);\n            padding-bottom: 20px;\n            margin-bottom: 40px;\n        }\n\n        .title-block {\n            display: flex;\n            align-items: baseline;\n            gap: 16px;\n        }\n\n        .site-title {\n            font-family: 'Instrument Serif', serif;\n            font-size: 42px;\n            font-weight: 400;\n            letter-spacing: -1px;\n        }\n\n        .version {\n            font-size: 11px;\n            color: var(--ink-muted);\n            text-transform: uppercase;\n            letter-spacing: 1px;\n        }\n\n        .status-indicator {\n            display: flex;\n            align-items: center;\n            gap: 8px;\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n        }\n\n        .status-dot {\n            width: 8px;\n            height: 8px;\n            border-radius: 50%;\n            background: var(--ink-muted);\n        }\n\n        .status-dot.active {\n            background: var(--success);\n        }\n\n        .status-dot.error {\n            background: var(--accent);\n        }\n\n        /* Folder Section */\n        .folder-section {\n            margin-bottom: 20px;\n        }\n\n        .folder-row {\n            display: flex;\n        }\n\n        .folder-display {\n            flex: 1;\n            display: flex;\n            justify-content: space-between;\n            align-items: center;\n            background: var(--bg-alt);\n            border: 2px solid var(--border);\n            padding: 12px 16px;\n        }\n\n        .folder-path {\n            font-size: 13px;\n            color: var(--ink-light);\n            overflow: hidden;\n            text-overflow: ellipsis;\n            white-space: nowrap;\n            max-width: calc(100% - 80px);\n        }\n\n        .folder-btn {\n            background: transparent;\n            border: 1px solid var(--ink);\n            padding: 6px 16px;\n            font-family: inherit;\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n            cursor: pointer;\n            transition: all 0.15s;\n        }\n\n        .folder-btn:hover {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        /* Query Section */\n        .query-section {\n            margin-bottom: 40px;\n        }\n\n        .query-label {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            color: var(--ink-muted);\n            margin-bottom: 12px;\n        }\n\n        .query-row {\n            display: flex;\n            gap: 0;\n        }\n\n        .query-input {\n            flex: 1;\n            background: var(--bg);\n            border: 2px solid var(--ink);\n            border-right: none;\n            padding: 16px 20px;\n            font-family: inherit;\n            font-size: 16px;\n            color: var(--ink);\n        }\n\n        .query-input:focus {\n            outline: none;\n            background: #fff;\n        }\n\n        .query-input::placeholder {\n            color: var(--ink-muted);\n            font-style: italic;\n        }\n\n        .query-btn {\n            background: var(--ink);\n            color: var(--bg);\n            border: 2px solid var(--ink);\n            padding: 16px 32px;\n            font-family: inherit;\n            font-size: 12px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            cursor: pointer;\n            transition: all 0.15s;\n        }\n\n        .query-btn:hover:not(:disabled) {\n            background: var(--accent);\n            border-color: var(--accent);\n        }\n\n        .query-btn:disabled {\n            opacity: 0.4;\n            cursor: not-allowed;\n        }\n\n        /* Index Badge */\n        .index-badge {\n            display: inline-flex;\n            align-items: center;\n            gap: 6px;\n            padding: 4px 12px;\n            font-size: 11px;\n            letter-spacing: 0.5px;\n            border: 1px solid var(--border);\n            background: var(--bg);\n            cursor: pointer;\n            transition: all 0.15s;\n            white-space: nowrap;\n        }\n\n        .index-badge:hover {\n            border-color: var(--ink);\n        }\n\n        .badge-dot {\n            width: 7px;\n            height: 7px;\n            border-radius: 50%;\n            background: var(--ink-muted);\n        }\n\n        .badge-dot.indexed {\n            background: var(--success);\n        }\n\n        /* Search Mode Toggles */\n        .search-mode-section {\n            margin-top: 14px;\n            display: flex;\n            align-items: center;\n            gap: 14px;\n        }\n\n        .search-mode-label {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            color: var(--ink-muted);\n            white-space: nowrap;\n        }\n\n        .search-mode-options {\n            display: flex;\n            gap: 0;\n            border: 1px solid var(--border);\n        }\n\n        .search-mode-options label {\n            display: flex;\n            align-items: center;\n            gap: 5px;\n            padding: 7px 14px;\n            font-size: 12px;\n            color: var(--ink-light);\n            cursor: pointer;\n            border-right: 1px solid var(--border);\n            transition: all 0.15s;\n            user-select: none;\n        }\n\n        .search-mode-options label:last-child {\n            border-right: none;\n        }\n\n        .search-mode-options label:has(input:checked) {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        .search-mode-options label.disabled {\n            opacity: 0.35;\n            cursor: not-allowed;\n        }\n\n        .search-mode-options label.disabled input {\n            pointer-events: none;\n        }\n\n        .search-mode-options input[type=\"checkbox\"] {\n            display: none;\n        }\n\n        /* Indexing Modal */\n        .indexing-modal {\n            max-width: 620px;\n        }\n\n        .indexing-modal .modal-section {\n            margin-bottom: 20px;\n        }\n\n        .indexing-modal .schema-editor textarea {\n            width: 100%;\n            border: 1px solid var(--border);\n            background: var(--bg-alt);\n            color: var(--ink);\n            font-family: 'IBM Plex Mono', monospace;\n            font-size: 12px;\n            padding: 12px;\n            resize: vertical;\n            min-height: 160px;\n        }\n\n        .indexing-modal .schema-editor textarea:focus {\n            outline: none;\n            border-color: var(--ink-light);\n        }\n\n        .indexing-summary {\n            font-size: 13px;\n        }\n\n        .indexing-summary dl {\n            display: grid;\n            grid-template-columns: auto 1fr;\n            gap: 6px 16px;\n        }\n\n        .indexing-summary dt {\n            color: var(--ink-muted);\n            font-size: 12px;\n            text-transform: uppercase;\n            letter-spacing: 0.5px;\n        }\n\n        .indexing-summary dd {\n            font-weight: 600;\n        }\n\n        .panel-btn {\n            background: transparent;\n            border: 1px solid var(--ink);\n            padding: 7px 16px;\n            font-family: inherit;\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n            cursor: pointer;\n            transition: all 0.15s;\n        }\n\n        .panel-btn:hover {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        .panel-btn:disabled {\n            opacity: 0.4;\n            cursor: not-allowed;\n        }\n\n        .panel-btn.primary {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        .panel-btn.primary:hover {\n            background: var(--accent);\n            border-color: var(--accent);\n        }\n\n        .modal-actions {\n            display: flex;\n            gap: 12px;\n            margin-top: 20px;\n        }\n\n        .modal-actions .modal-btn {\n            flex: 1;\n        }\n\n        .embed-toggle-row {\n            display: flex;\n            align-items: center;\n            gap: 8px;\n            font-size: 12px;\n            color: var(--ink-light);\n            margin-top: 14px;\n        }\n\n        .embed-toggle-row input[type=\"checkbox\"] {\n            accent-color: var(--accent);\n        }\n\n        .profile-label {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n            color: var(--ink-muted);\n            margin-bottom: 8px;\n        }\n\n        .profile-input {\n            width: 100%;\n            border: 1px solid var(--border);\n            background: var(--bg);\n            color: var(--ink);\n            font-family: inherit;\n            font-size: 12px;\n            padding: 10px 12px;\n            resize: vertical;\n            min-height: 64px;\n            margin-bottom: 12px;\n        }\n\n        .profile-input:focus {\n            outline: none;\n            border-color: var(--ink-light);\n        }\n\n        .radio-group {\n            display: flex;\n            gap: 20px;\n            margin-bottom: 12px;\n        }\n\n        .radio-group label {\n            display: flex;\n            align-items: center;\n            gap: 6px;\n            font-size: 12px;\n            color: var(--ink-light);\n            cursor: pointer;\n        }\n\n        .radio-group input[type=\"radio\"] {\n            accent-color: var(--accent);\n        }\n\n        .panel-spinner {\n            display: inline-block;\n            width: 14px;\n            height: 14px;\n            border: 2px solid var(--border);\n            border-top-color: var(--accent);\n            border-radius: 50%;\n            animation: spin 0.6s linear infinite;\n            vertical-align: middle;\n            margin-right: 6px;\n        }\n\n        @keyframes spin {\n            to { transform: rotate(360deg); }\n        }\n\n        .index-panel-message {\n            font-size: 12px;\n            color: var(--ink-muted);\n            font-style: italic;\n            margin-top: 10px;\n        }\n\n        /* Main Grid */\n        .main-grid {\n            display: grid;\n            grid-template-columns: 1fr 1fr;\n            gap: 40px;\n        }\n\n        @media (max-width: 1000px) {\n            .main-grid {\n                grid-template-columns: 1fr;\n            }\n        }\n\n        /* Sections */\n        .section {\n            min-height: 400px;\n        }\n\n        .section-header {\n            display: flex;\n            justify-content: space-between;\n            align-items: baseline;\n            border-bottom: 1px solid var(--border);\n            padding-bottom: 12px;\n            margin-bottom: 20px;\n        }\n\n        .section-title {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            color: var(--ink-muted);\n        }\n\n        .section-meta {\n            font-size: 11px;\n            color: var(--ink-muted);\n        }\n\n        /* Steps */\n        .steps-list {\n            display: flex;\n            flex-direction: column;\n            gap: 2px;\n        }\n\n        .step {\n            background: var(--bg-alt);\n            border-left: 3px solid var(--ink-muted);\n            padding: 16px 20px;\n            animation: appear 0.2s ease;\n        }\n\n        @keyframes appear {\n            from { opacity: 0; transform: translateX(-10px); }\n            to { opacity: 1; transform: translateX(0); }\n        }\n\n        .step.scan { border-left-color: #0969da; }\n        .step.parse { border-left-color: var(--success); }\n        .step.preview { border-left-color: #8250df; }\n        .step.search { border-left-color: var(--accent); }\n        .step.navigate { border-left-color: #bf8700; }\n\n        .step-header {\n            display: flex;\n            justify-content: space-between;\n            align-items: baseline;\n            margin-bottom: 8px;\n        }\n\n        .step-id {\n            font-weight: 600;\n            font-size: 12px;\n        }\n\n        .step-tool {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n            color: var(--ink-light);\n        }\n\n        .step-target {\n            font-size: 13px;\n            color: var(--ink-light);\n            margin-bottom: 8px;\n            word-break: break-all;\n        }\n\n        .step-target::before {\n            content: '→ ';\n            color: var(--ink-muted);\n        }\n\n        .step-reason {\n            font-size: 12px;\n            color: var(--ink-muted);\n            font-style: italic;\n            line-height: 1.5;\n        }\n\n        /* Empty State */\n        .empty-state {\n            display: flex;\n            flex-direction: column;\n            align-items: center;\n            justify-content: center;\n            height: 300px;\n            color: var(--ink-muted);\n            text-align: center;\n        }\n\n        .empty-state .prompt {\n            font-family: 'Instrument Serif', serif;\n            font-size: 24px;\n            font-style: italic;\n            margin-bottom: 8px;\n            color: var(--ink-light);\n        }\n\n        .empty-state .hint {\n            font-size: 12px;\n        }\n\n        /* Result */\n        .result-content {\n            line-height: 1.8;\n        }\n\n        .result-text {\n            font-size: 15px;\n            white-space: pre-wrap;\n        }\n\n        .citation {\n            background: var(--accent-light);\n            color: var(--accent);\n            padding: 2px 6px;\n            font-size: 11px;\n            border-radius: 2px;\n        }\n\n        /* Sources */\n        .sources {\n            margin-top: 30px;\n            padding-top: 20px;\n            border-top: 1px solid var(--border);\n        }\n\n        .sources-title {\n            font-size: 11px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            color: var(--ink-muted);\n            margin-bottom: 12px;\n        }\n\n        .source-item {\n            font-size: 12px;\n            color: var(--ink-light);\n            padding: 4px 0;\n            display: flex;\n            align-items: baseline;\n            gap: 8px;\n        }\n\n        .source-item::before {\n            content: '●';\n            color: var(--accent);\n            font-size: 8px;\n        }\n\n        /* Stats Bar */\n        .stats-bar {\n            margin-top: 40px;\n            padding-top: 20px;\n            border-top: 2px solid var(--ink);\n            display: grid;\n            grid-template-columns: repeat(6, 1fr);\n            gap: 20px;\n        }\n\n        @media (max-width: 768px) {\n            .stats-bar {\n                grid-template-columns: repeat(3, 1fr);\n            }\n        }\n\n        .stat {\n            text-align: center;\n        }\n\n        .stat-value {\n            font-size: 24px;\n            font-weight: 700;\n            font-family: 'Instrument Serif', serif;\n        }\n\n        .stat-label {\n            font-size: 10px;\n            text-transform: uppercase;\n            letter-spacing: 1px;\n            color: var(--ink-muted);\n            margin-top: 4px;\n        }\n\n        /* Progress */\n        .progress-bar {\n            height: 3px;\n            background: var(--border);\n            margin-top: 40px;\n            overflow: hidden;\n        }\n\n        .progress-fill {\n            height: 100%;\n            background: var(--accent);\n            width: 0%;\n            transition: width 0.3s;\n        }\n\n        .progress-bar.active .progress-fill {\n            animation: indeterminate 1.5s infinite;\n        }\n\n        @keyframes indeterminate {\n            0% { transform: translateX(-100%); width: 30%; }\n            50% { width: 50%; }\n            100% { transform: translateX(400%); width: 30%; }\n        }\n\n        /* Loading */\n        .loading-text {\n            font-style: italic;\n            color: var(--ink-muted);\n            animation: blink 1s infinite;\n        }\n\n        @keyframes blink {\n            0%, 100% { opacity: 1; }\n            50% { opacity: 0.5; }\n        }\n\n        /* Human Modal */\n        .modal-overlay {\n            position: fixed;\n            inset: 0;\n            background: rgba(244, 241, 235, 0.95);\n            display: flex;\n            align-items: center;\n            justify-content: center;\n            z-index: 100;\n            opacity: 0;\n            visibility: hidden;\n            transition: all 0.2s;\n        }\n\n        .modal-overlay.active {\n            opacity: 1;\n            visibility: visible;\n        }\n\n        .modal {\n            background: var(--bg);\n            border: 2px solid var(--ink);\n            padding: 40px;\n            max-width: 500px;\n            width: 90%;\n            box-shadow: 8px 8px 0 var(--ink);\n        }\n\n        .modal-title {\n            font-family: 'Instrument Serif', serif;\n            font-size: 24px;\n            margin-bottom: 20px;\n        }\n\n        .modal-question {\n            background: var(--bg-alt);\n            padding: 16px;\n            margin-bottom: 20px;\n            font-size: 14px;\n        }\n\n        .modal-input {\n            width: 100%;\n            border: 2px solid var(--ink);\n            padding: 12px 16px;\n            font-family: inherit;\n            font-size: 14px;\n            resize: vertical;\n            min-height: 80px;\n            margin-bottom: 20px;\n        }\n\n        .modal-input:focus {\n            outline: none;\n        }\n\n        .modal-btn {\n            background: var(--ink);\n            color: var(--bg);\n            border: none;\n            padding: 12px 24px;\n            font-family: inherit;\n            font-size: 12px;\n            text-transform: uppercase;\n            letter-spacing: 2px;\n            cursor: pointer;\n            width: 100%;\n        }\n\n        .modal-btn:hover {\n            background: var(--accent);\n        }\n\n        .modal-btn.secondary {\n            background: transparent;\n            border: 2px solid var(--ink);\n            color: var(--ink);\n        }\n\n        .modal-btn.secondary:hover {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        /* Folder Modal */\n        .folder-modal {\n            max-width: 600px;\n        }\n\n        .folder-nav {\n            display: flex;\n            align-items: center;\n            gap: 12px;\n            margin-bottom: 16px;\n            padding-bottom: 12px;\n            border-bottom: 1px solid var(--border);\n        }\n\n        .folder-nav-btn {\n            background: transparent;\n            border: 1px solid var(--ink);\n            padding: 6px 12px;\n            font-family: inherit;\n            font-size: 11px;\n            cursor: pointer;\n        }\n\n        .folder-nav-btn:hover {\n            background: var(--ink);\n            color: var(--bg);\n        }\n\n        .folder-nav-btn:disabled {\n            opacity: 0.3;\n            cursor: not-allowed;\n        }\n\n        .folder-current {\n            font-size: 12px;\n            color: var(--ink-light);\n            overflow: hidden;\n            text-overflow: ellipsis;\n            white-space: nowrap;\n        }\n\n        .folder-list {\n            background: var(--bg-alt);\n            border: 1px solid var(--border);\n            padding: 8px;\n            max-height: 300px;\n            margin-bottom: 20px;\n        }\n\n        .folder-item {\n            display: flex;\n            align-items: center;\n            gap: 8px;\n            padding: 10px 12px;\n            cursor: pointer;\n            border-bottom: 1px solid var(--border);\n            transition: background 0.1s;\n        }\n\n        .folder-item:last-child {\n            border-bottom: none;\n        }\n\n        .folder-item:hover {\n            background: var(--bg);\n        }\n\n        .folder-item.selected {\n            background: var(--accent-light);\n        }\n\n        .folder-icon {\n            color: var(--accent);\n            font-weight: bold;\n        }\n\n        .folder-name {\n            font-size: 13px;\n        }\n\n        .folder-actions {\n            display: flex;\n            gap: 12px;\n        }\n\n        .folder-actions .modal-btn {\n            flex: 1;\n        }\n\n        .folder-empty {\n            padding: 20px;\n            text-align: center;\n            color: var(--ink-muted);\n            font-style: italic;\n        }\n\n        .folder-info {\n            font-size: 11px;\n            color: var(--ink-muted);\n            margin-top: 4px;\n        }\n\n        /* Scrollbar */\n        .scrollable {\n            max-height: 500px;\n            overflow-y: auto;\n        }\n\n        .scrollable::-webkit-scrollbar {\n            width: 8px;\n        }\n\n        .scrollable::-webkit-scrollbar-track {\n            background: var(--bg-alt);\n        }\n\n        .scrollable::-webkit-scrollbar-thumb {\n            background: var(--border);\n        }\n\n        .scrollable::-webkit-scrollbar-thumb:hover {\n            background: var(--ink-muted);\n        }\n\n        /* Footer */\n        .footer {\n            margin-top: 60px;\n            padding-top: 20px;\n            border-top: 1px solid var(--border);\n            font-size: 11px;\n            color: var(--ink-muted);\n            text-align: center;\n        }\n    </style>\n</head>\n<body>\n    <div class=\"page\">\n        <!-- Masthead -->\n        <header class=\"masthead\">\n            <div class=\"title-block\">\n                <h1 class=\"site-title\">fs-explorer</h1>\n                <span class=\"version\">v0.1.0</span>\n            </div>\n            <div class=\"status-indicator\">\n                <div class=\"status-dot\" id=\"statusDot\"></div>\n                <span id=\"statusText\">Ready</span>\n            </div>\n        </header>\n\n        <!-- Folder Selector -->\n        <section class=\"folder-section\">\n            <div class=\"query-label\">Target Folder</div>\n            <div class=\"folder-row\">\n                <div class=\"folder-display\" id=\"folderDisplay\">\n                    <span class=\"folder-path\" id=\"currentPath\">.</span>\n                    <span class=\"index-badge\" id=\"indexBadge\" onclick=\"openIndexingModal()\" style=\"display:none;\">\n                        <span class=\"badge-dot\" id=\"badgeDot\"></span>\n                        <span id=\"badgeText\">Not Indexed</span>\n                    </span>\n                    <button class=\"folder-btn\" onclick=\"openFolderPicker()\">Browse</button>\n                </div>\n            </div>\n        </section>\n\n        <!-- Query -->\n        <section class=\"query-section\">\n            <div class=\"query-label\">Query</div>\n            <div class=\"query-row\">\n                <input\n                    type=\"text\"\n                    class=\"query-input\"\n                    id=\"queryInput\"\n                    placeholder=\"What would you like to know about your documents?\"\n                    autocomplete=\"off\"\n                >\n                <button class=\"query-btn\" id=\"queryBtn\" onclick=\"startExploration()\">\n                    Execute\n                </button>\n            </div>\n            <div class=\"search-mode-section\" id=\"searchModeSection\">\n                <div class=\"search-mode-label\">Retrieval</div>\n                <div class=\"search-mode-options\">\n                    <label><input type=\"checkbox\" value=\"agentic\" checked disabled><span>Agentic</span></label>\n                    <label class=\"disabled\" id=\"smSemantic\"><input type=\"checkbox\" id=\"cbSemantic\" value=\"semantic\" disabled><span>Semantic</span></label>\n                    <label class=\"disabled\" id=\"smMetadata\"><input type=\"checkbox\" id=\"cbMetadata\" value=\"metadata\" disabled><span>Metadata</span></label>\n                </div>\n            </div>\n        </section>\n\n        <!-- Progress -->\n        <div class=\"progress-bar\" id=\"progressBar\">\n            <div class=\"progress-fill\"></div>\n        </div>\n\n        <!-- Main Grid -->\n        <div class=\"main-grid\">\n            <!-- Steps -->\n            <section class=\"section\">\n                <div class=\"section-header\">\n                    <span class=\"section-title\">Execution Log</span>\n                    <span class=\"section-meta\" id=\"stepCount\">—</span>\n                </div>\n                <div class=\"scrollable\">\n                    <div class=\"steps-list\" id=\"stepsList\">\n                        <div class=\"empty-state\">\n                            <div class=\"prompt\">Awaiting query...</div>\n                            <div class=\"hint\">Enter a question to begin document exploration</div>\n                        </div>\n                    </div>\n                </div>\n            </section>\n\n            <!-- Result -->\n            <section class=\"section\">\n                <div class=\"section-header\">\n                    <span class=\"section-title\">Response</span>\n                </div>\n                <div class=\"result-content scrollable\" id=\"resultContent\">\n                    <div class=\"empty-state\">\n                        <div class=\"prompt\">No results yet</div>\n                        <div class=\"hint\">Results with citations will appear here</div>\n                    </div>\n                </div>\n            </section>\n        </div>\n\n        <!-- Stats -->\n        <div class=\"stats-bar\" id=\"statsBar\" style=\"display: none;\">\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statSteps\">0</div>\n                <div class=\"stat-label\">Steps</div>\n            </div>\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statScanned\">0</div>\n                <div class=\"stat-label\">Scanned</div>\n            </div>\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statParsed\">0</div>\n                <div class=\"stat-label\">Parsed</div>\n            </div>\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statCalls\">0</div>\n                <div class=\"stat-label\">API Calls</div>\n            </div>\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statTokens\">0</div>\n                <div class=\"stat-label\">Tokens</div>\n            </div>\n            <div class=\"stat\">\n                <div class=\"stat-value\" id=\"statCost\">$0</div>\n                <div class=\"stat-label\">Est. Cost</div>\n            </div>\n        </div>\n\n        <!-- Footer -->\n        <footer class=\"footer\">\n            Powered by Gemini 3 Flash · Documents parsed with Docling\n        </footer>\n    </div>\n\n    <!-- Human Modal -->\n    <div class=\"modal-overlay\" id=\"humanModal\">\n        <div class=\"modal\">\n            <div class=\"modal-title\">Input Required</div>\n            <div class=\"modal-question\" id=\"modalQuestion\"></div>\n            <textarea class=\"modal-input\" id=\"modalInput\" placeholder=\"Your response...\"></textarea>\n            <button class=\"modal-btn\" onclick=\"submitHumanResponse()\">Submit</button>\n        </div>\n    </div>\n\n    <!-- Folder Picker Modal -->\n    <div class=\"modal-overlay\" id=\"folderModal\">\n        <div class=\"modal folder-modal\">\n            <div class=\"modal-title\">Select Folder</div>\n            <div class=\"folder-nav\">\n                <button class=\"folder-nav-btn\" id=\"folderUpBtn\" onclick=\"navigateUp()\">↑ Parent</button>\n                <span class=\"folder-current\" id=\"folderModalPath\">.</span>\n            </div>\n            <div class=\"folder-list scrollable\" id=\"folderList\">\n                <div class=\"loading-text\">Loading...</div>\n            </div>\n            <div class=\"folder-actions\">\n                <button class=\"modal-btn secondary\" onclick=\"closeFolderPicker()\">Cancel</button>\n                <button class=\"modal-btn\" onclick=\"selectCurrentFolder()\">Select This Folder</button>\n            </div>\n        </div>\n    </div>\n\n    <!-- Indexing Config Modal -->\n    <div class=\"modal-overlay\" id=\"indexingModal\">\n        <div class=\"modal indexing-modal\">\n            <div id=\"indexingModalContent\"></div>\n        </div>\n    </div>\n\n    <script>\n        // State\n        let ws = null;\n        let stepCount = 0;\n        let isRunning = false;\n        let currentFolder = '.';\n        let browsingPath = '.';\n        let indexStatus = null;\n\n        // Tool styles\n        const toolStyles = {\n            scan_folder: 'scan',\n            preview_file: 'preview',\n            parse_file: 'parse',\n            read: 'preview',\n            grep: 'search',\n            glob: 'search',\n        };\n\n        // Elements\n        const queryInput = document.getElementById('queryInput');\n        const queryBtn = document.getElementById('queryBtn');\n        const stepsList = document.getElementById('stepsList');\n        const stepCountEl = document.getElementById('stepCount');\n        const resultContent = document.getElementById('resultContent');\n        const statsBar = document.getElementById('statsBar');\n        const progressBar = document.getElementById('progressBar');\n        const statusDot = document.getElementById('statusDot');\n        const statusText = document.getElementById('statusText');\n        const humanModal = document.getElementById('humanModal');\n        const folderModal = document.getElementById('folderModal');\n        const folderList = document.getElementById('folderList');\n        const folderModalPath = document.getElementById('folderModalPath');\n        const folderUpBtn = document.getElementById('folderUpBtn');\n        const currentPathEl = document.getElementById('currentPath');\n        const indexingModal = document.getElementById('indexingModal');\n        const indexingModalContent = document.getElementById('indexingModalContent');\n\n        // Enter key\n        queryInput.addEventListener('keypress', (e) => {\n            if (e.key === 'Enter' && !isRunning) startExploration();\n        });\n\n        // ========== Index Badge + Status ==========\n\n        async function checkIndexStatus(folder) {\n            try {\n                const res = await fetch(`/api/index/status?folder=${encodeURIComponent(folder)}`);\n                indexStatus = await res.json();\n            } catch (e) {\n                indexStatus = { indexed: false };\n            }\n            updateIndexBadge();\n            updateSearchModeAvailability();\n        }\n\n        function updateIndexBadge() {\n            const badge = document.getElementById('indexBadge');\n            const dot = document.getElementById('badgeDot');\n            const text = document.getElementById('badgeText');\n            badge.style.display = 'inline-flex';\n\n            if (indexStatus && indexStatus.indexed) {\n                dot.className = 'badge-dot indexed';\n                text.textContent = `Indexed (${indexStatus.document_count} docs)`;\n            } else {\n                dot.className = 'badge-dot';\n                text.textContent = 'Not Indexed';\n            }\n        }\n\n        function updateSearchModeAvailability() {\n            const isIndexed = indexStatus && indexStatus.indexed;\n\n            const pairs = [\n                ['smSemantic', 'cbSemantic', isIndexed],\n                ['smMetadata', 'cbMetadata', isIndexed],\n            ];\n\n            for (const [labelId, cbId, enabled] of pairs) {\n                const label = document.getElementById(labelId);\n                const cb = document.getElementById(cbId);\n                if (enabled) {\n                    label.classList.remove('disabled');\n                    cb.disabled = false;\n                } else {\n                    label.classList.add('disabled');\n                    cb.disabled = true;\n                    cb.checked = false;\n                }\n            }\n        }\n\n        // ========== Indexing Config Modal ==========\n\n        function openIndexingModal() {\n            showIndexingConfigView();\n            indexingModal.classList.add('active');\n        }\n\n        function closeIndexingModal() {\n            indexingModal.classList.remove('active');\n        }\n\n        function showIndexingConfigView() {\n            const isIndexed = indexStatus && indexStatus.indexed;\n            const statusLine = isIndexed\n                ? `Currently indexed: <strong>${indexStatus.document_count}</strong> documents` +\n                  (indexStatus.schema_name ? ` &middot; Schema: ${esc(indexStatus.schema_name)}` : '')\n                : 'No index found for this folder.';\n\n            indexingModalContent.innerHTML = `\n                <div class=\"modal-title\">${isIndexed ? 'Re-index Folder' : 'Index Folder'}</div>\n                <div style=\"font-size:12px;color:var(--ink-light);margin-bottom:20px;\">${statusLine}</div>\n\n                <div class=\"modal-section\">\n                    <div class=\"profile-label\">Schema Source</div>\n                    <div class=\"radio-group\">\n                        <label><input type=\"radio\" name=\"modalSchemaSource\" value=\"auto\" checked> Auto-discover</label>\n                        <label><input type=\"radio\" name=\"modalSchemaSource\" value=\"custom\"> Custom JSON</label>\n                    </div>\n                    <div id=\"modalSchemaArea\">\n                        <button class=\"panel-btn\" onclick=\"generateSchemaForModal()\" id=\"modalGenBtn\">Generate Schema</button>\n                    </div>\n                </div>\n\n                <div class=\"embed-toggle-row\">\n                    <input type=\"checkbox\" id=\"modalEmbedToggle\">\n                    <span>Generate embeddings</span>\n                </div>\n\n                <div class=\"modal-actions\">\n                    <button class=\"modal-btn secondary\" onclick=\"closeIndexingModal()\">Cancel</button>\n                    <button class=\"modal-btn\" onclick=\"startIndexingFromModal()\" id=\"modalStartBtn\">Start Indexing</button>\n                </div>\n            `;\n\n            // Wire up schema source radio toggle\n            indexingModalContent.querySelectorAll('input[name=\"modalSchemaSource\"]').forEach(r => {\n                r.addEventListener('change', () => {\n                    const area = document.getElementById('modalSchemaArea');\n                    if (r.value === 'auto' && r.checked) {\n                        area.innerHTML = '<button class=\"panel-btn\" onclick=\"generateSchemaForModal()\" id=\"modalGenBtn\">Generate Schema</button>';\n                    } else if (r.value === 'custom' && r.checked) {\n                        area.innerHTML = `\n                            <div class=\"profile-label\" style=\"margin-top:8px\">Schema JSON</div>\n                            <div class=\"schema-editor\">\n                                <textarea id=\"modalSchemaEditor\">{\n  \"name\": \"custom\",\n  \"fields\": [\n    {\"name\": \"document_type\", \"type\": \"string\", \"description\": \"Type of document.\"},\n    {\"name\": \"mentions_currency\", \"type\": \"boolean\", \"description\": \"Contains monetary values.\"}\n  ]\n}</textarea>\n                            </div>\n                        `;\n                    }\n                });\n            });\n        }\n\n        async function generateSchemaForModal() {\n            const area = document.getElementById('modalSchemaArea');\n            area.innerHTML = '<span class=\"panel-spinner\"></span> Discovering schema...';\n\n            try {\n                const res = await fetch('/api/index/auto-profile', {\n                    method: 'POST',\n                    headers: { 'Content-Type': 'application/json' },\n                    body: JSON.stringify({ folder: currentFolder }),\n                });\n                const data = await res.json();\n                if (data.error) {\n                    area.innerHTML = `<div style=\"color:var(--accent);font-size:12px;margin-bottom:8px\">Error: ${esc(data.error)}</div>\n                        <button class=\"panel-btn\" onclick=\"generateSchemaForModal()\" id=\"modalGenBtn\">Retry</button>`;\n                    return;\n                }\n                const schemaJson = JSON.stringify(data.profile, null, 2);\n                area.innerHTML = `\n                    <div class=\"profile-label\" style=\"margin-top:8px\">Schema JSON (editable)</div>\n                    <div class=\"schema-editor\">\n                        <textarea id=\"modalSchemaEditor\">${esc(schemaJson)}</textarea>\n                    </div>\n                    <button class=\"panel-btn\" onclick=\"generateSchemaForModal()\" style=\"margin-top:8px\">Regenerate</button>\n                `;\n            } catch (e) {\n                area.innerHTML = `<div style=\"color:var(--accent);font-size:12px;margin-bottom:8px\">Error: ${esc(e.message)}</div>\n                    <button class=\"panel-btn\" onclick=\"generateSchemaForModal()\" id=\"modalGenBtn\">Retry</button>`;\n            }\n        }\n\n        async function startIndexingFromModal() {\n            if (isRunning) return;\n\n            // Parse schema from editor if present\n            let metadataProfile = null;\n            let withMetadata = false;\n            const editorEl = document.getElementById('modalSchemaEditor');\n            if (editorEl) {\n                const raw = editorEl.value.trim();\n                if (raw) {\n                    try {\n                        metadataProfile = JSON.parse(raw);\n                        withMetadata = true;\n                    } catch (e) {\n                        alert('Invalid JSON in schema editor: ' + e.message);\n                        return;\n                    }\n                }\n            }\n\n            const embedEl = document.getElementById('modalEmbedToggle');\n            const withEmbeddings = embedEl ? embedEl.checked : false;\n\n            isRunning = true;\n            updateStatus('Indexing', 'active');\n            progressBar.classList.add('active');\n\n            // Show progress view in modal\n            indexingModalContent.innerHTML = `\n                <div class=\"modal-title\">Indexing in Progress</div>\n                <div style=\"text-align:center;padding:40px 0;\">\n                    <span class=\"panel-spinner\" style=\"width:24px;height:24px;border-width:3px;\"></span>\n                    <div style=\"margin-top:16px;font-size:13px;color:var(--ink-light);\">\n                        Parsing documents and building search index...\n                    </div>\n                </div>\n            `;\n\n            try {\n                const response = await fetch('/api/index', {\n                    method: 'POST',\n                    headers: { 'Content-Type': 'application/json' },\n                    body: JSON.stringify({\n                        folder: currentFolder,\n                        discover_schema: true,\n                        with_metadata: withMetadata,\n                        metadata_profile: metadataProfile,\n                        with_embeddings: withEmbeddings,\n                    }),\n                });\n                const payload = await response.json();\n                if (!response.ok || payload.error) {\n                    throw new Error(payload.error || `Indexing failed (${response.status})`);\n                }\n                // Refresh status\n                await checkIndexStatus(currentFolder);\n                showIndexingSummary(payload);\n                updateStatus('Indexed', '');\n            } catch (e) {\n                indexingModalContent.innerHTML = `\n                    <div class=\"modal-title\">Indexing Failed</div>\n                    <div style=\"color:var(--accent);margin-bottom:20px;font-size:13px;\">${esc(e.message)}</div>\n                    <button class=\"modal-btn\" onclick=\"closeIndexingModal()\">Close</button>\n                `;\n                updateStatus('Error', 'error');\n            }\n            isRunning = false;\n            queryBtn.disabled = false;\n            progressBar.classList.remove('active');\n        }\n\n        function showIndexingSummary(result) {\n            indexingModalContent.innerHTML = `\n                <div class=\"modal-title\">Indexing Complete</div>\n                <div class=\"indexing-summary\">\n                    <dl>\n                        <dt>Documents indexed</dt>\n                        <dd>${result.indexed_files || 0}</dd>\n                        <dt>Active documents</dt>\n                        <dd>${result.active_documents || 0}</dd>\n                        <dt>Chunks written</dt>\n                        <dd>${result.chunks_written || 0}</dd>\n                        <dt>Skipped files</dt>\n                        <dd>${result.skipped_files || 0}</dd>\n                        <dt>Deleted files</dt>\n                        <dd>${result.deleted_files || 0}</dd>\n                        <dt>Schema</dt>\n                        <dd>${esc(result.schema_used || 'none')}</dd>\n                        <dt>Metadata mode</dt>\n                        <dd>${esc(result.metadata_mode || 'heuristic')}</dd>\n                        <dt>Embeddings written</dt>\n                        <dd>${result.embeddings_written || 0}</dd>\n                    </dl>\n                </div>\n                <div class=\"modal-actions\">\n                    <button class=\"modal-btn\" onclick=\"closeIndexingModal()\">Done</button>\n                </div>\n            `;\n        }\n\n        // ========== Start Exploration ==========\n\n        async function startExploration() {\n            const query = queryInput.value.trim();\n            if (!query || isRunning) return;\n\n            if (ws) { ws.close(); ws = null; }\n\n            isRunning = true;\n            stepCount = 0;\n            queryBtn.disabled = true;\n\n            stepsList.innerHTML = '<div class=\"empty-state\"><div class=\"prompt\">Processing...</div><div class=\"hint\">Starting exploration</div></div>';\n            resultContent.innerHTML = '<div class=\"loading-text\">Processing...</div>';\n            statsBar.style.display = 'none';\n            progressBar.classList.add('active');\n            stepCountEl.textContent = '—';\n\n            const enableSemantic = document.getElementById('cbSemantic').checked;\n            const enableMetadata = document.getElementById('cbMetadata').checked;\n            const useIndex = (enableSemantic || enableMetadata) && indexStatus && indexStatus.indexed;\n            updateStatus('Connecting', 'active');\n\n            const wsProtocol = window.location.protocol === 'https:' ? 'wss' : 'ws';\n            const wsUrl = `${wsProtocol}://${window.location.host}/ws/explore`;\n            ws = new WebSocket(wsUrl);\n\n            ws.onopen = () => {\n                updateStatus('Executing', 'active');\n                ws.send(JSON.stringify({\n                    task: query,\n                    folder: currentFolder,\n                    use_index: useIndex,\n                    enable_semantic: enableSemantic,\n                    enable_metadata: enableMetadata,\n                }));\n            };\n\n            ws.onmessage = (e) => handleMessage(JSON.parse(e.data));\n            ws.onerror = () => {\n                showError('Connection failed. Is the server running?');\n                finish();\n            };\n            ws.onclose = () => finish();\n        }\n\n        // Handle message\n        function handleMessage(msg) {\n            switch (msg.type) {\n                case 'tool_call':\n                    addStep(msg.data);\n                    break;\n                case 'go_deeper':\n                    addNavStep(msg.data);\n                    break;\n                case 'ask_human':\n                    showHumanModal(msg.data);\n                    break;\n                case 'complete':\n                    showResult(msg.data);\n                    break;\n                case 'error':\n                    showError(msg.data.message);\n                    break;\n            }\n        }\n\n        // Add step\n        function addStep(data) {\n            if (data.step === 1 && stepsList.querySelector('.empty-state')) {\n                stepsList.innerHTML = '';\n            }\n\n            stepCount = data.step;\n            stepCountEl.textContent = `${stepCount} ${stepCount === 1 ? 'step' : 'steps'}`;\n\n            const style = toolStyles[data.tool_name] || 'preview';\n            const target = data.tool_input.directory || data.tool_input.file_path || '';\n\n            const html = `\n                <div class=\"step ${style}\">\n                    <div class=\"step-header\">\n                        <span class=\"step-id\">#${data.step}</span>\n                        <span class=\"step-tool\">${data.tool_name}</span>\n                    </div>\n                    ${target ? `<div class=\"step-target\">${esc(target)}</div>` : ''}\n                    <div class=\"step-reason\">${esc(data.reason)}</div>\n                </div>\n            `;\n            stepsList.insertAdjacentHTML('beforeend', html);\n            stepsList.scrollTop = stepsList.scrollHeight;\n        }\n\n        // Add nav step\n        function addNavStep(data) {\n            if (data.step === 1 && stepsList.querySelector('.empty-state')) {\n                stepsList.innerHTML = '';\n            }\n\n            stepCount = data.step;\n            stepCountEl.textContent = `${stepCount} ${stepCount === 1 ? 'step' : 'steps'}`;\n\n            const html = `\n                <div class=\"step navigate\">\n                    <div class=\"step-header\">\n                        <span class=\"step-id\">#${data.step}</span>\n                        <span class=\"step-tool\">navigate</span>\n                    </div>\n                    <div class=\"step-target\">${esc(data.directory)}</div>\n                    <div class=\"step-reason\">${esc(data.reason)}</div>\n                </div>\n            `;\n            stepsList.insertAdjacentHTML('beforeend', html);\n            stepsList.scrollTop = stepsList.scrollHeight;\n        }\n\n        // Show human modal\n        function showHumanModal(data) {\n            document.getElementById('modalQuestion').textContent = data.question;\n            document.getElementById('modalInput').value = '';\n            humanModal.classList.add('active');\n            updateStatus('Awaiting input', 'active');\n        }\n\n        // Submit human response\n        function submitHumanResponse() {\n            const response = document.getElementById('modalInput').value.trim();\n            if (!response) return;\n            humanModal.classList.remove('active');\n            ws.send(JSON.stringify({ type: 'human_response', response }));\n            updateStatus('Executing', 'active');\n        }\n\n        // Show result\n        function showResult(data) {\n            if (data.error) {\n                showError(data.error);\n                return;\n            }\n\n            let text = data.final_result || 'No result';\n            text = text.replace(/\\[Source:[^\\]]+\\]/g, m => `<span class=\"citation\">${esc(m)}</span>`);\n\n            resultContent.innerHTML = `<div class=\"result-text\">${text}</div>`;\n\n            const s = data.stats;\n            if (s) {\n                statsBar.style.display = 'grid';\n                document.getElementById('statSteps').textContent = s.steps;\n                document.getElementById('statScanned').textContent = s.documents_scanned;\n                document.getElementById('statParsed').textContent = s.documents_parsed;\n                document.getElementById('statCalls').textContent = s.api_calls;\n                document.getElementById('statTokens').textContent = formatNum(s.total_tokens);\n                document.getElementById('statCost').textContent = '$' + s.estimated_cost.toFixed(4);\n            }\n\n            updateStatus('Complete', '');\n        }\n\n        // Show error\n        function showError(msg) {\n            resultContent.innerHTML = `<div class=\"empty-state\"><div class=\"prompt\">Error</div><div class=\"hint\">${esc(msg)}</div></div>`;\n            updateStatus('Error', 'error');\n        }\n\n        // Finish\n        function finish() {\n            isRunning = false;\n            queryBtn.disabled = false;\n            progressBar.classList.remove('active');\n            if (ws) { ws.close(); ws = null; }\n        }\n\n        // Update status\n        function updateStatus(text, state) {\n            statusText.textContent = text;\n            statusDot.className = 'status-dot' + (state ? ' ' + state : '');\n        }\n\n        // Escape HTML\n        function esc(s) {\n            const d = document.createElement('div');\n            d.textContent = s;\n            return d.innerHTML;\n        }\n\n        // Format number\n        function formatNum(n) {\n            if (n >= 1e6) return (n/1e6).toFixed(1) + 'M';\n            if (n >= 1e3) return (n/1e3).toFixed(1) + 'K';\n            return n;\n        }\n\n        // ========== Folder Picker ==========\n\n        async function openFolderPicker() {\n            folderModal.classList.add('active');\n            browsingPath = currentFolder;\n            await loadFolders(browsingPath);\n        }\n\n        function closeFolderPicker() {\n            folderModal.classList.remove('active');\n        }\n\n        async function loadFolders(path) {\n            folderList.innerHTML = '<div class=\"loading-text\">Loading...</div>';\n            try {\n                const res = await fetch(`/api/folders?path=${encodeURIComponent(path)}`);\n                const data = await res.json();\n\n                if (data.error) {\n                    folderList.innerHTML = `<div class=\"folder-empty\">${esc(data.error)}</div>`;\n                    return;\n                }\n\n                browsingPath = data.current;\n                folderModalPath.textContent = data.current;\n                folderUpBtn.disabled = !data.parent;\n\n                if (data.folders.length === 0) {\n                    folderList.innerHTML = `\n                        <div class=\"folder-empty\">No subfolders</div>\n                        <div class=\"folder-info\" style=\"text-align:center\">${data.files_count} file(s) in this folder</div>\n                    `;\n                } else {\n                    folderList.innerHTML = data.folders.map(name => `\n                        <div class=\"folder-item\" onclick=\"navigateToFolder('${esc(name)}')\">\n                            <span class=\"folder-icon\">▸</span>\n                            <span class=\"folder-name\">${esc(name)}</span>\n                        </div>\n                    `).join('');\n                    folderList.innerHTML += `<div class=\"folder-info\">${data.files_count} file(s) in this folder</div>`;\n                }\n            } catch (e) {\n                folderList.innerHTML = `<div class=\"folder-empty\">Error: ${esc(e.message)}</div>`;\n            }\n        }\n\n        async function navigateToFolder(name) {\n            const newPath = browsingPath === '.' ? name : `${browsingPath}/${name}`;\n            await loadFolders(newPath);\n        }\n\n        async function navigateUp() {\n            const res = await fetch(`/api/folders?path=${encodeURIComponent(browsingPath)}`);\n            const data = await res.json();\n            if (data.parent) {\n                await loadFolders(data.parent);\n            }\n        }\n\n        function selectCurrentFolder() {\n            currentFolder = browsingPath;\n            currentPathEl.textContent = currentFolder;\n            closeFolderPicker();\n            checkIndexStatus(currentFolder);\n        }\n\n        // Initialize with current directory\n        (async function init() {\n            try {\n                const res = await fetch('/api/folders?path=.');\n                const data = await res.json();\n                if (data.current) {\n                    currentFolder = data.current;\n                    currentPathEl.textContent = currentFolder;\n                }\n            } catch (e) {\n                console.error('Failed to get initial folder:', e);\n            }\n            checkIndexStatus(currentFolder);\n        })();\n    </script>\n</body>\n</html>\n"
  },
  {
    "path": "src/fs_explorer/workflow.py",
    "content": "\"\"\"\nWorkflow orchestration for the FsExplorer agent.\n\nThis module defines the event-driven workflow that coordinates the agent's\nexploration of the filesystem, handling tool calls, directory navigation,\nand human interaction.\n\"\"\"\n\nimport contextvars\nimport os\n\nfrom workflows import Workflow, Context, step\nfrom workflows.events import (\n    StartEvent,\n    StopEvent,\n    Event,\n    InputRequiredEvent,\n    HumanResponseEvent,\n)\nfrom workflows.resource import Resource\nfrom pydantic import BaseModel\nfrom typing import Annotated, cast, Any\n\nfrom .agent import FsExplorerAgent\nfrom .models import GoDeeperAction, ToolCallAction, StopAction, AskHumanAction, Action\nfrom .fs import describe_dir_content\n\n# Per-asyncio-task agent storage — each WebSocket connection gets its own.\n_AGENT_VAR: contextvars.ContextVar[FsExplorerAgent | None] = contextvars.ContextVar(\n    \"_AGENT_VAR\", default=None\n)\n\n\ndef get_agent() -> FsExplorerAgent:\n    \"\"\"Get or create the agent instance for the current context.\"\"\"\n    agent = _AGENT_VAR.get()\n    if agent is None:\n        agent = FsExplorerAgent()\n        _AGENT_VAR.set(agent)\n    return agent\n\n\ndef reset_agent() -> None:\n    \"\"\"Reset the agent instance for the current context.\"\"\"\n    _AGENT_VAR.set(None)\n\n\nclass WorkflowState(BaseModel):\n    \"\"\"State maintained throughout the workflow execution.\"\"\"\n\n    initial_task: str = \"\"\n    root_directory: str = \".\"\n    current_directory: str = \".\"\n    use_index: bool = False\n    enable_semantic: bool = False\n    enable_metadata: bool = False\n\n\nclass InputEvent(StartEvent):\n    \"\"\"Initial event containing the user's task.\"\"\"\n\n    task: str\n    folder: str = \".\"\n    use_index: bool = False\n    enable_semantic: bool = False\n    enable_metadata: bool = False\n\n\nclass GoDeeperEvent(Event):\n    \"\"\"Event triggered when navigating into a subdirectory.\"\"\"\n\n    directory: str\n    reason: str\n\n\nclass ToolCallEvent(Event):\n    \"\"\"Event triggered when executing a tool.\"\"\"\n\n    tool_name: str\n    tool_input: dict[str, Any]\n    reason: str\n\n\nclass AskHumanEvent(InputRequiredEvent):\n    \"\"\"Event triggered when human input is required.\"\"\"\n\n    question: str\n    reason: str\n\n\nclass HumanAnswerEvent(HumanResponseEvent):\n    \"\"\"Event containing the human's response.\"\"\"\n\n    response: str\n\n\nclass ExplorationEndEvent(StopEvent):\n    \"\"\"Event signaling the end of exploration.\"\"\"\n\n    final_result: str | None = None\n    error: str | None = None\n\n\n# Type alias for the union of possible workflow events\nWorkflowEvent = ExplorationEndEvent | GoDeeperEvent | ToolCallEvent | AskHumanEvent\n\n\ndef _handle_action_result(\n    action: Action,\n    action_type: str,\n    ctx: Context[WorkflowState],\n) -> WorkflowEvent:\n    \"\"\"\n    Convert an action result into the appropriate workflow event.\n\n    This helper extracts the common logic for handling agent action results,\n    reducing code duplication across workflow steps.\n\n    Args:\n        action: The action returned by the agent\n        action_type: The type of action (\"godeeper\", \"toolcall\", \"askhuman\", \"stop\")\n        ctx: The workflow context for state updates and event streaming\n\n    Returns:\n        The appropriate workflow event based on the action type\n    \"\"\"\n    if action_type == \"godeeper\":\n        godeeper = cast(GoDeeperAction, action.action)\n        event = GoDeeperEvent(directory=godeeper.directory, reason=action.reason)\n        ctx.write_event_to_stream(event)\n        return event\n\n    elif action_type == \"toolcall\":\n        toolcall = cast(ToolCallAction, action.action)\n        event = ToolCallEvent(\n            tool_name=toolcall.tool_name,\n            tool_input=toolcall.to_fn_args(),\n            reason=action.reason,\n        )\n        ctx.write_event_to_stream(event)\n        return event\n\n    elif action_type == \"askhuman\":\n        askhuman = cast(AskHumanAction, action.action)\n        # InputRequiredEvent is written to the stream by default\n        return AskHumanEvent(question=askhuman.question, reason=action.reason)\n\n    else:  # stop\n        stopaction = cast(StopAction, action.action)\n        return ExplorationEndEvent(final_result=stopaction.final_result)\n\n\nasync def _process_agent_action(\n    agent: FsExplorerAgent,\n    ctx: Context[WorkflowState],\n    update_directory: bool = False,\n) -> WorkflowEvent:\n    \"\"\"\n    Process the agent's next action and return the appropriate event.\n\n    Args:\n        agent: The agent instance\n        ctx: The workflow context\n        update_directory: Whether to update the current directory on godeeper action\n\n    Returns:\n        The appropriate workflow event\n    \"\"\"\n    result = await agent.take_action()\n\n    if result is None:\n        return ExplorationEndEvent(error=\"Could not produce action to take\")\n\n    action, action_type = result\n\n    # Update directory state if needed for godeeper actions\n    if update_directory and action_type == \"godeeper\":\n        godeeper = cast(GoDeeperAction, action.action)\n        async with ctx.store.edit_state() as state:\n            state.current_directory = godeeper.directory\n\n    return _handle_action_result(action, action_type, ctx)\n\n\nclass FsExplorerWorkflow(Workflow):\n    \"\"\"\n    Event-driven workflow for filesystem exploration.\n\n    Coordinates the agent's actions through a series of steps:\n    - start_exploration: Initial task processing\n    - go_deeper_action: Directory navigation\n    - tool_call_action: Tool execution\n    - receive_human_answer: Human interaction handling\n    \"\"\"\n\n    @step\n    async def start_exploration(\n        self,\n        ev: InputEvent,\n        ctx: Context[WorkflowState],\n        agent: Annotated[FsExplorerAgent, Resource(get_agent)],\n    ) -> WorkflowEvent:\n        \"\"\"Initialize exploration with the user's task.\"\"\"\n        root_directory = os.path.abspath(ev.folder)\n        if not os.path.exists(root_directory) or not os.path.isdir(root_directory):\n            return ExplorationEndEvent(error=f\"No such directory: {root_directory}\")\n\n        async with ctx.store.edit_state() as state:\n            state.initial_task = ev.task\n            state.root_directory = root_directory\n            state.current_directory = root_directory\n            state.use_index = ev.use_index\n            state.enable_semantic = ev.enable_semantic\n            state.enable_metadata = ev.enable_metadata\n\n        dirdescription = describe_dir_content(root_directory)\n        if ev.enable_semantic and ev.enable_metadata:\n            index_hint = (\n                \"An index is available. Start with `semantic_search` (with optional \"\n                \"filters) for fast retrieval, then use filesystem tools for deep dives.\"\n            )\n        elif ev.enable_semantic:\n            index_hint = (\n                \"An index is available. Use `semantic_search` (no filters) for \"\n                \"similarity search, then use filesystem tools for details.\"\n            )\n        elif ev.enable_metadata:\n            index_hint = (\n                \"An index is available. Use `semantic_search` with metadata \"\n                \"filters, then use filesystem tools for details.\"\n            )\n        else:\n            index_hint = \"Prefer absolute paths from the directory listing when calling tools.\"\n        agent.configure_task(\n            f\"Given that the current directory ('{root_directory}') looks like this:\\n\\n\"\n            f\"```text\\n{dirdescription}\\n```\\n\\n\"\n            f\"And that the user is giving you this task: '{ev.task}', \"\n            f\"what action should you take first? {index_hint}\"\n        )\n\n        return await _process_agent_action(agent, ctx, update_directory=True)\n\n    @step\n    async def go_deeper_action(\n        self,\n        ev: GoDeeperEvent,\n        ctx: Context[WorkflowState],\n        agent: Annotated[FsExplorerAgent, Resource(get_agent)],\n    ) -> WorkflowEvent:\n        \"\"\"Handle navigation into a subdirectory.\"\"\"\n        state = await ctx.store.get_state()\n        dirdescription = describe_dir_content(state.current_directory)\n\n        agent.configure_task(\n            f\"Given that the current directory ('{state.current_directory}') \"\n            f\"looks like this:\\n\\n```text\\n{dirdescription}\\n```\\n\\n\"\n            f\"And that the user is giving you this task: '{state.initial_task}', \"\n            f\"what action should you take next?\"\n        )\n\n        return await _process_agent_action(agent, ctx, update_directory=True)\n\n    @step\n    async def receive_human_answer(\n        self,\n        ev: HumanAnswerEvent,\n        ctx: Context[WorkflowState],\n        agent: Annotated[FsExplorerAgent, Resource(get_agent)],\n    ) -> WorkflowEvent:\n        \"\"\"Process the human's response to a question.\"\"\"\n        state = await ctx.store.get_state()\n\n        agent.configure_task(\n            f\"Human response to your question: {ev.response}\\n\\n\"\n            f\"Based on it, proceed with your exploration based on the \"\n            f\"original task: {state.initial_task}\"\n        )\n\n        return await _process_agent_action(agent, ctx, update_directory=True)\n\n    @step\n    async def tool_call_action(\n        self,\n        ev: ToolCallEvent,\n        ctx: Context[WorkflowState],\n        agent: Annotated[FsExplorerAgent, Resource(get_agent)],\n    ) -> WorkflowEvent:\n        \"\"\"Process the result of a tool call.\"\"\"\n        agent.configure_task(\n            \"Given the result from the tool call you just performed, \"\n            \"what action should you take next?\"\n        )\n\n        return await _process_agent_action(agent, ctx, update_directory=True)\n\n\n# Workflow timeout for complex multi-document analysis (5 minutes)\nWORKFLOW_TIMEOUT_SECONDS = 300\n\nworkflow = FsExplorerWorkflow(timeout=WORKFLOW_TIMEOUT_SECONDS)\n"
  },
  {
    "path": "tests/__init__.py",
    "content": ""
  },
  {
    "path": "tests/conftest.py",
    "content": "\"\"\"\nPytest fixtures and mocks for FsExplorer tests.\n\nProvides mock implementations of the Google GenAI client for unit testing\nwithout making actual API calls.\n\"\"\"\n\nfrom google.genai.types import (\n    HttpOptions,\n    Content,\n    GenerateContentResponse,\n    Candidate,\n    Part,\n    GenerateContentResponseUsageMetadata,\n)\nfrom fs_explorer.models import StopAction, Action\n\n\nclass MockModels:\n    \"\"\"Mock implementation of the GenAI models interface.\"\"\"\n    \n    async def generate_content(self, *args, **kwargs) -> GenerateContentResponse:\n        \"\"\"Return a mock response with a stop action.\"\"\"\n        return GenerateContentResponse(\n            candidates=[\n                Candidate(\n                    content=Content(\n                        role=\"model\",\n                        parts=[\n                            Part.from_text(\n                                text=Action(\n                                    action=StopAction(\n                                        final_result=\"this is a final result\"\n                                    ),\n                                    reason=\"I am done\",\n                                ).model_dump_json()\n                            )\n                        ],\n                    )\n                )\n            ],\n            usage_metadata=GenerateContentResponseUsageMetadata(\n                prompt_token_count=100,\n                candidates_token_count=50,\n                total_token_count=150,\n            ),\n        )\n\n\nclass MockAio:\n    \"\"\"Mock implementation of the async GenAI interface.\"\"\"\n    \n    @property\n    def models(self) -> MockModels:\n        \"\"\"Return mock models interface.\"\"\"\n        return MockModels()\n\n\nclass MockGenAIClient:\n    \"\"\"\n    Mock implementation of the Google GenAI client.\n    \n    Provides predictable responses for testing without API calls.\n    \"\"\"\n    \n    def __init__(self, api_key: str, http_options: HttpOptions) -> None:\n        \"\"\"Initialize mock client (ignores parameters).\"\"\"\n        pass\n\n    @property\n    def aio(self) -> MockAio:\n        \"\"\"Return mock async interface.\"\"\"\n        return MockAio()\n"
  },
  {
    "path": "tests/test_agent.py",
    "content": "\"\"\"Tests for the FsExplorerAgent class.\"\"\"\n\nimport pytest\nimport os\n\nfrom unittest.mock import patch\nfrom google.genai import Client as GenAIClient\nfrom google.genai.types import HttpOptions\n\nfrom fs_explorer.agent import (\n    FsExplorerAgent,\n    SYSTEM_PROMPT,\n    TokenUsage,\n    _build_system_prompt,\n    set_search_flags,\n    get_search_flags,\n    clear_index_context,\n)\nfrom fs_explorer.models import Action, StopAction\nfrom .conftest import MockGenAIClient\n\n\nclass TestAgentInitialization:\n    \"\"\"Tests for agent initialization.\"\"\"\n    \n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    def test_agent_init_with_env_key(self) -> None:\n        \"\"\"Test agent initialization with API key from environment.\"\"\"\n        agent = FsExplorerAgent()\n        assert isinstance(agent._client, GenAIClient)\n        assert len(agent._chat_history) == 0  # No system prompt in history\n        assert isinstance(agent.token_usage, TokenUsage)\n\n    def test_agent_init_with_explicit_key(self) -> None:\n        \"\"\"Test agent initialization with explicit API key.\"\"\"\n        agent = FsExplorerAgent(api_key=\"explicit-test-key\")\n        assert isinstance(agent._client, GenAIClient)\n\n    def test_agent_init_without_key_raises(self) -> None:\n        \"\"\"Test that initialization without API key raises ValueError.\"\"\"\n        # Ensure no key in environment\n        env = os.environ.copy()\n        if \"GOOGLE_API_KEY\" in env:\n            del env[\"GOOGLE_API_KEY\"]\n        \n        with patch.dict(os.environ, env, clear=True):\n            with pytest.raises(ValueError, match=\"GOOGLE_API_KEY not found\"):\n                FsExplorerAgent()\n\n\nclass TestAgentConfiguration:\n    \"\"\"Tests for agent task configuration.\"\"\"\n    \n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    def test_configure_task_adds_to_history(self) -> None:\n        \"\"\"Test that configure_task adds message to chat history.\"\"\"\n        agent = FsExplorerAgent()\n        agent.configure_task(\"this is a task\")\n        \n        assert len(agent._chat_history) == 1\n        assert agent._chat_history[0].role == \"user\"\n        assert agent._chat_history[0].parts[0].text == \"this is a task\"\n\n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    def test_multiple_configure_task_calls(self) -> None:\n        \"\"\"Test that multiple configure_task calls accumulate.\"\"\"\n        agent = FsExplorerAgent()\n        agent.configure_task(\"task 1\")\n        agent.configure_task(\"task 2\")\n        \n        assert len(agent._chat_history) == 2\n        assert agent._chat_history[0].parts[0].text == \"task 1\"\n        assert agent._chat_history[1].parts[0].text == \"task 2\"\n\n\nclass TestAgentActions:\n    \"\"\"Tests for agent action handling.\"\"\"\n    \n    @pytest.mark.asyncio\n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    async def test_take_action_returns_action(self) -> None:\n        \"\"\"Test that take_action returns an action from the model.\"\"\"\n        agent = FsExplorerAgent()\n        agent.configure_task(\"this is a task\")\n        agent._client = MockGenAIClient(\n            api_key=\"test\", \n            http_options=HttpOptions(api_version=\"v1beta\")\n        )\n        \n        result = await agent.take_action()\n        \n        assert result is not None\n        action, action_type = result\n        assert isinstance(action, Action)\n        assert isinstance(action.action, StopAction)\n        assert action.action.final_result == \"this is a final result\"\n        assert action.reason == \"I am done\"\n        assert action_type == \"stop\"\n\n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    def test_reset_clears_history(self) -> None:\n        \"\"\"Test that reset clears chat history and token usage.\"\"\"\n        agent = FsExplorerAgent()\n        agent.configure_task(\"task 1\")\n        agent.token_usage.api_calls = 5\n        \n        agent.reset()\n        \n        assert len(agent._chat_history) == 0\n        assert agent.token_usage.api_calls == 0\n\n\nclass TestTokenUsage:\n    \"\"\"Tests for TokenUsage tracking.\"\"\"\n    \n    def test_add_api_call(self) -> None:\n        \"\"\"Test adding API call metrics.\"\"\"\n        usage = TokenUsage()\n        usage.add_api_call(100, 50)\n        \n        assert usage.prompt_tokens == 100\n        assert usage.completion_tokens == 50\n        assert usage.total_tokens == 150\n        assert usage.api_calls == 1\n\n    def test_add_tool_result_parse_file(self) -> None:\n        \"\"\"Test tracking parse_file tool usage.\"\"\"\n        usage = TokenUsage()\n        usage.add_tool_result(\"document content here\", \"parse_file\")\n        \n        assert usage.documents_parsed == 1\n        assert usage.tool_result_chars == len(\"document content here\")\n\n    def test_add_tool_result_scan_folder(self) -> None:\n        \"\"\"Test tracking scan_folder tool usage.\"\"\"\n        usage = TokenUsage()\n        # Simulating scan output with document markers\n        result = \"│ [1/3] doc1.pdf\\n│ [2/3] doc2.pdf\\n│ [3/3] doc3.pdf\"\n        usage.add_tool_result(result, \"scan_folder\")\n        \n        assert usage.documents_scanned == 3\n\n    def test_summary_format(self) -> None:\n        \"\"\"Test that summary produces formatted output.\"\"\"\n        usage = TokenUsage()\n        usage.add_api_call(1000, 500)\n        \n        summary = usage.summary()\n        \n        assert \"TOKEN USAGE SUMMARY\" in summary\n        assert \"1,000\" in summary  # Formatted prompt tokens\n        assert \"API Calls:\" in summary\n        assert \"Est. Cost\" in summary\n\n\nclass TestSystemPrompt:\n    \"\"\"Tests for system prompt configuration.\"\"\"\n    \n    def test_system_prompt_contains_tools(self) -> None:\n        \"\"\"Test that system prompt documents all tools.\"\"\"\n        assert \"scan_folder\" in SYSTEM_PROMPT\n        assert \"preview_file\" in SYSTEM_PROMPT\n        assert \"parse_file\" in SYSTEM_PROMPT\n        assert \"read\" in SYSTEM_PROMPT\n        assert \"grep\" in SYSTEM_PROMPT\n        assert \"glob\" in SYSTEM_PROMPT\n\n    def test_system_prompt_contains_strategy(self) -> None:\n        \"\"\"Test that system prompt includes exploration strategy.\"\"\"\n        assert \"Three-Phase\" in SYSTEM_PROMPT or \"PHASE\" in SYSTEM_PROMPT\n        assert \"Parallel Scan\" in SYSTEM_PROMPT or \"PARALLEL\" in SYSTEM_PROMPT\n        assert \"Backtracking\" in SYSTEM_PROMPT or \"BACKTRACK\" in SYSTEM_PROMPT\n\n    def test_system_prompt_contains_index_tools(self) -> None:\n        \"\"\"Test that system prompt documents index-aware tools.\"\"\"\n        assert \"semantic_search\" in SYSTEM_PROMPT\n        assert \"get_document\" in SYSTEM_PROMPT\n        assert \"list_indexed_documents\" in SYSTEM_PROMPT\n\n\nclass TestSearchFlags:\n    \"\"\"Tests for search flag state and dynamic system prompt.\"\"\"\n\n    def setup_method(self) -> None:\n        clear_index_context()\n\n    def teardown_method(self) -> None:\n        clear_index_context()\n\n    def test_set_and_get_search_flags(self) -> None:\n        assert get_search_flags() == (False, False)\n        set_search_flags(enable_semantic=True, enable_metadata=False)\n        assert get_search_flags() == (True, False)\n        set_search_flags(enable_semantic=False, enable_metadata=False)\n        assert get_search_flags() == (False, False)\n\n    def test_clear_index_context_resets_flags(self) -> None:\n        set_search_flags(enable_semantic=True, enable_metadata=True)\n        clear_index_context()\n        assert get_search_flags() == (False, False)\n\n    def test_build_system_prompt_no_index(self) -> None:\n        prompt = _build_system_prompt(False, False)\n        assert prompt == SYSTEM_PROMPT\n\n    def test_build_system_prompt_semantic_only(self) -> None:\n        prompt = _build_system_prompt(True, False)\n        assert \"Semantic Only\" in prompt\n        assert \"WITHOUT the `filters`\" in prompt\n\n    def test_build_system_prompt_metadata_only(self) -> None:\n        prompt = _build_system_prompt(False, True)\n        assert \"Metadata Only\" in prompt\n        assert \"metadata filtering\" in prompt\n\n    def test_build_system_prompt_both(self) -> None:\n        prompt = _build_system_prompt(True, True)\n        assert \"Semantic + Metadata\" in prompt\n\n    @patch.dict(os.environ, {\"GOOGLE_API_KEY\": \"test-api-key\"})\n    def test_all_tools_always_available(self) -> None:\n        \"\"\"Filesystem and indexed tools are never blocked.\"\"\"\n        set_search_flags(enable_semantic=False, enable_metadata=False)\n        agent = FsExplorerAgent()\n        agent.configure_task(\"test\")\n        agent.call_tool(\"glob\", {\"directory\": \"/tmp\", \"pattern\": \"*.md\"})\n\n        last = agent._chat_history[-1]\n        assert \"not available\" not in last.parts[0].text\n"
  },
  {
    "path": "tests/test_cli_indexing.py",
    "content": "\"\"\"CLI tests for indexing and schema commands.\"\"\"\n\nfrom pathlib import Path\n\nimport fs_explorer.indexing.pipeline as pipeline_module\nimport fs_explorer.main as main_module\nfrom fs_explorer.storage import DuckDBStorage\nfrom typer.testing import CliRunner\n\n\ndef test_root_task_mode_remains_compatible(tmp_path: Path, monkeypatch) -> None:\n    called: dict[str, object] = {}\n\n    async def fake_run_workflow(\n        task: str,\n        folder: str = \".\",\n        *,\n        use_index: bool = False,\n        db_path: str | None = None,\n    ) -> None:\n        called[\"task\"] = task\n        called[\"folder\"] = folder\n        called[\"use_index\"] = use_index\n        called[\"db_path\"] = db_path\n\n    monkeypatch.setattr(main_module, \"run_workflow\", fake_run_workflow)\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\"--task\", \"who is the CTO?\", \"--folder\", str(tmp_path)],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"task\"] == \"who is the CTO?\"\n    assert called[\"folder\"] == str(tmp_path)\n    assert called[\"use_index\"] is False\n\n\ndef test_query_command_enables_index_mode(tmp_path: Path, monkeypatch) -> None:\n    called: dict[str, object] = {}\n\n    async def fake_run_workflow(\n        task: str,\n        folder: str = \".\",\n        *,\n        use_index: bool = False,\n        db_path: str | None = None,\n    ) -> None:\n        called[\"task\"] = task\n        called[\"folder\"] = folder\n        called[\"use_index\"] = use_index\n        called[\"db_path\"] = db_path\n\n    monkeypatch.setattr(main_module, \"run_workflow\", fake_run_workflow)\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\n            \"query\",\n            \"--task\",\n            \"purchase price?\",\n            \"--folder\",\n            str(tmp_path),\n            \"--db-path\",\n            \"tmp.duckdb\",\n        ],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"task\"] == \"purchase price?\"\n    assert called[\"folder\"] == str(tmp_path)\n    assert called[\"use_index\"] is True\n    assert called[\"db_path\"] == \"tmp.duckdb\"\n\n\ndef test_index_and_schema_commands(tmp_path: Path, monkeypatch) -> None:\n    corpus = tmp_path / \"corpus\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $10.\")\n    (corpus / \"risk_report.md\").write_text(\"Risk summary here.\")\n\n    # Replace Docling path with plain text read for this unit test.\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = tmp_path / \"index.duckdb\"\n    runner = CliRunner()\n\n    index_result = runner.invoke(\n        main_module.app,\n        [\"index\", str(corpus), \"--db-path\", str(db_path), \"--discover-schema\"],\n    )\n    assert index_result.exit_code == 0\n    assert \"Index Complete\" in index_result.stdout\n\n    show_result = runner.invoke(\n        main_module.app,\n        [\"schema\", \"show\", str(corpus), \"--db-path\", str(db_path)],\n    )\n    assert show_result.exit_code == 0\n    assert \"auto_corpus\" in show_result.stdout\n\n\ndef test_index_command_with_metadata_forces_schema_discovery(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    called: dict[str, object] = {}\n\n    class FakePipeline:\n        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001\n            called[\"storage_type\"] = type(storage).__name__\n\n        def index_folder(\n            self,\n            folder: str,\n            *,\n            discover_schema: bool = False,\n            schema_name: str | None = None,\n            with_metadata: bool = False,\n            metadata_profile: dict | None = None,\n        ):\n            called[\"folder\"] = folder\n            called[\"discover_schema\"] = discover_schema\n            called[\"schema_name\"] = schema_name\n            called[\"with_metadata\"] = with_metadata\n            called[\"metadata_profile\"] = metadata_profile\n            return pipeline_module.IndexingResult(\n                corpus_id=\"corpus_123\",\n                indexed_files=1,\n                skipped_files=0,\n                deleted_files=0,\n                chunks_written=1,\n                active_documents=1,\n                schema_used=\"auto_corpus\",\n            )\n\n    monkeypatch.setattr(main_module, \"IndexingPipeline\", FakePipeline)\n\n    db_path = tmp_path / \"index.duckdb\"\n    corpus = tmp_path / \"corpus\"\n    corpus.mkdir()\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\"index\", str(corpus), \"--db-path\", str(db_path), \"--with-metadata\"],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"with_metadata\"] is True\n    assert called[\"discover_schema\"] is True\n    assert called[\"metadata_profile\"] is None\n\n\ndef test_index_command_with_metadata_profile_path(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    called: dict[str, object] = {}\n\n    class FakePipeline:\n        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001\n            called[\"storage_type\"] = type(storage).__name__\n\n        def index_folder(\n            self,\n            folder: str,\n            *,\n            discover_schema: bool = False,\n            schema_name: str | None = None,\n            with_metadata: bool = False,\n            metadata_profile: dict | None = None,\n        ):\n            called[\"folder\"] = folder\n            called[\"discover_schema\"] = discover_schema\n            called[\"schema_name\"] = schema_name\n            called[\"with_metadata\"] = with_metadata\n            called[\"metadata_profile\"] = metadata_profile\n            return pipeline_module.IndexingResult(\n                corpus_id=\"corpus_123\",\n                indexed_files=1,\n                skipped_files=0,\n                deleted_files=0,\n                chunks_written=1,\n                active_documents=1,\n                schema_used=\"auto_corpus\",\n            )\n\n    monkeypatch.setattr(main_module, \"IndexingPipeline\", FakePipeline)\n\n    db_path = tmp_path / \"index.duckdb\"\n    corpus = tmp_path / \"corpus\"\n    corpus.mkdir()\n    metadata_profile_path = tmp_path / \"profile.json\"\n    metadata_profile_path.write_text(\n        (\n            \"{\"\n            '\"prompt_description\": \"Extract organizations.\", '\n            '\"fields\": ['\n            '{\"name\": \"org_names\", \"type\": \"string\", \"source_class\": \"organization\"}'\n            \"]\"\n            \"}\"\n        )\n    )\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\n            \"index\",\n            str(corpus),\n            \"--db-path\",\n            str(db_path),\n            \"--metadata-profile\",\n            str(metadata_profile_path),\n        ],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"with_metadata\"] is True\n    assert called[\"discover_schema\"] is True\n    assert isinstance(called[\"metadata_profile\"], dict)\n    assert called[\"metadata_profile\"][\"fields\"][0][\"name\"] == \"org_names\"\n\n\ndef test_index_command_with_embeddings_flag(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    \"\"\"--with-embeddings creates an EmbeddingProvider and passes it to the pipeline.\"\"\"\n    calls: dict[str, object] = {}\n\n    class FakePipeline:\n        def __init__(self, storage, embedding_provider=None) -> None:  # noqa: ANN001\n            calls[\"has_embedding_provider\"] = embedding_provider is not None\n\n        def index_folder(self, folder, **kwargs):  # noqa: ANN001, ANN003\n            return pipeline_module.IndexingResult(\n                corpus_id=\"corpus_123\",\n                indexed_files=1,\n                skipped_files=0,\n                deleted_files=0,\n                chunks_written=1,\n                active_documents=1,\n                schema_used=None,\n                embeddings_written=5,\n            )\n\n    class FakeEmbeddingProvider:\n        def __init__(self, **kwargs):  # noqa: ANN003\n            pass\n\n    monkeypatch.setattr(main_module, \"IndexingPipeline\", FakePipeline)\n    monkeypatch.setattr(main_module, \"EmbeddingProvider\", FakeEmbeddingProvider)\n\n    db_path = tmp_path / \"index.duckdb\"\n    corpus = tmp_path / \"corpus\"\n    corpus.mkdir()\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\"index\", str(corpus), \"--db-path\", str(db_path), \"--with-embeddings\"],\n    )\n\n    assert result.exit_code == 0\n    assert calls[\"has_embedding_provider\"] is True\n    assert \"Embeddings Written\" in result.stdout\n\n\ndef test_auto_index_env_var_enables_use_index(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    \"\"\"FS_EXPLORER_AUTO_INDEX=1 auto-enables --use-index when index exists.\"\"\"\n    called: dict[str, object] = {}\n\n    async def fake_run_workflow(\n        task: str,\n        folder: str = \".\",\n        *,\n        use_index: bool = False,\n        db_path: str | None = None,\n    ) -> None:\n        called[\"use_index\"] = use_index\n\n    monkeypatch.setattr(main_module, \"run_workflow\", fake_run_workflow)\n    monkeypatch.setenv(\"FS_EXPLORER_AUTO_INDEX\", \"1\")\n\n    # Create a real DuckDB with a corpus entry so auto-index detection works.\n    corpus = tmp_path / \"corpus\"\n    corpus.mkdir()\n    db_path = tmp_path / \"index.duckdb\"\n    storage = DuckDBStorage(str(db_path))\n    storage.get_or_create_corpus(str(corpus.resolve()))\n    storage.close()\n\n    monkeypatch.setenv(\"FS_EXPLORER_DB_PATH\", str(db_path))\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\"--task\", \"test question\", \"--folder\", str(corpus)],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"use_index\"] is True\n\n\ndef test_auto_index_env_var_silent_fallback(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    \"\"\"FS_EXPLORER_AUTO_INDEX=1 gracefully falls back when no index exists.\"\"\"\n    called: dict[str, object] = {}\n\n    async def fake_run_workflow(\n        task: str,\n        folder: str = \".\",\n        *,\n        use_index: bool = False,\n        db_path: str | None = None,\n    ) -> None:\n        called[\"use_index\"] = use_index\n\n    monkeypatch.setattr(main_module, \"run_workflow\", fake_run_workflow)\n    monkeypatch.setenv(\"FS_EXPLORER_AUTO_INDEX\", \"1\")\n\n    corpus = tmp_path / \"empty_corpus\"\n    corpus.mkdir()\n\n    runner = CliRunner()\n    result = runner.invoke(\n        main_module.app,\n        [\"--task\", \"test question\", \"--folder\", str(corpus)],\n    )\n\n    assert result.exit_code == 0\n    assert called[\"use_index\"] is False\n"
  },
  {
    "path": "tests/test_e2e.py",
    "content": "import pytest\nimport os\n\nfrom workflows.testing import WorkflowTestRunner\n\nSKIP_IF, SKIP_REASON = (\n    os.getenv(\"GOOGLE_API_KEY\") is None,\n    \"GOOGLE_API_KEY not available\",\n)\n\n\n@pytest.mark.asyncio\n@pytest.mark.skipif(condition=SKIP_IF, reason=SKIP_REASON)\nasync def test_e2e() -> None:\n    from fs_explorer.workflow import (\n        workflow,\n        InputEvent,\n        ExplorationEndEvent,\n        ToolCallEvent,\n        GoDeeperEvent,\n    )\n\n    start_event = InputEvent(\n        task=\"Starting from the current directory, individuate the python file responsible for file system operations and explain what it does\"\n    )\n    runner = WorkflowTestRunner(workflow=workflow)\n    result = await runner.run(start_event=start_event)\n    assert isinstance(result.result, ExplorationEndEvent)\n    assert result.result.error is None\n    assert result.result.final_result is not None\n    assert len(result.collected) > 1\n    assert ToolCallEvent in result.event_types or GoDeeperEvent in result.event_types\n"
  },
  {
    "path": "tests/test_embeddings.py",
    "content": "\"\"\"Tests for the embedding provider.\"\"\"\n\nfrom __future__ import annotations\n\nimport os\nfrom dataclasses import dataclass\nfrom typing import Any\n\nimport pytest\n\nfrom fs_explorer.embeddings import EmbeddingProvider\n\n\n# ---------------------------------------------------------------------------\n# Mock helpers\n# ---------------------------------------------------------------------------\n\n\n@dataclass\nclass _FakeEmbedding:\n    values: list[float]\n\n\n@dataclass\nclass _FakeEmbedResult:\n    embeddings: list[_FakeEmbedding]\n\n\nclass _FakeModels:\n    \"\"\"Records calls and returns deterministic embeddings.\"\"\"\n\n    def __init__(self) -> None:\n        self.calls: list[dict[str, Any]] = []\n\n    def embed_content(\n        self, *, model: str, contents: list[str], config: dict\n    ) -> _FakeEmbedResult:\n        self.calls.append({\"model\": model, \"contents\": contents, \"config\": config})\n        dim = config.get(\"output_dimensionality\", 768)\n        return _FakeEmbedResult(\n            embeddings=[\n                _FakeEmbedding(values=[float(i)] * dim) for i in range(len(contents))\n            ]\n        )\n\n\nclass _FakeClient:\n    def __init__(self) -> None:\n        self.models = _FakeModels()\n\n\n# ---------------------------------------------------------------------------\n# Unit tests (mock-based, no API key needed)\n# ---------------------------------------------------------------------------\n\n\ndef test_embed_texts_returns_correct_count() -> None:\n    client = _FakeClient()\n    provider = EmbeddingProvider(client=client, dim=4, batch_size=50)\n\n    embeddings = provider.embed_texts([\"hello\", \"world\"])\n\n    assert len(embeddings) == 2\n    assert len(embeddings[0]) == 4\n\n\ndef test_embed_texts_uses_document_task_type() -> None:\n    client = _FakeClient()\n    provider = EmbeddingProvider(client=client, dim=4)\n\n    provider.embed_texts([\"test\"])\n\n    call = client.models.calls[0]\n    assert call[\"config\"][\"task_type\"] == \"RETRIEVAL_DOCUMENT\"\n\n\ndef test_embed_query_uses_query_task_type() -> None:\n    client = _FakeClient()\n    provider = EmbeddingProvider(client=client, dim=4)\n\n    result = provider.embed_query(\"search query\")\n\n    assert len(result) == 4\n    call = client.models.calls[0]\n    assert call[\"config\"][\"task_type\"] == \"RETRIEVAL_QUERY\"\n\n\ndef test_embed_texts_batching() -> None:\n    client = _FakeClient()\n    provider = EmbeddingProvider(client=client, dim=4, batch_size=3)\n\n    texts = [f\"text_{i}\" for i in range(7)]\n    embeddings = provider.embed_texts(texts)\n\n    assert len(embeddings) == 7\n    # 7 texts with batch_size=3 → 3 API calls (3+3+1)\n    assert len(client.models.calls) == 3\n    assert len(client.models.calls[0][\"contents\"]) == 3\n    assert len(client.models.calls[1][\"contents\"]) == 3\n    assert len(client.models.calls[2][\"contents\"]) == 1\n\n\ndef test_env_overrides(monkeypatch) -> None:\n    client = _FakeClient()\n    monkeypatch.setenv(\"FS_EXPLORER_EMBEDDING_MODEL\", \"custom-model-001\")\n    monkeypatch.setenv(\"FS_EXPLORER_EMBEDDING_DIM\", \"256\")\n    monkeypatch.setenv(\"FS_EXPLORER_EMBEDDING_BATCH_SIZE\", \"10\")\n\n    provider = EmbeddingProvider(client=client)\n\n    assert provider.model == \"custom-model-001\"\n    assert provider.dim == 256\n    assert provider.batch_size == 10\n\n    provider.embed_texts([\"test\"])\n    call = client.models.calls[0]\n    assert call[\"model\"] == \"custom-model-001\"\n    assert call[\"config\"][\"output_dimensionality\"] == 256\n\n\ndef test_missing_api_key_raises(monkeypatch) -> None:\n    monkeypatch.delenv(\"GOOGLE_API_KEY\", raising=False)\n    with pytest.raises(ValueError, match=\"GOOGLE_API_KEY\"):\n        EmbeddingProvider(api_key=None, client=None)\n\n\n# ---------------------------------------------------------------------------\n# Real API integration test (skipped unless GOOGLE_API_KEY is set)\n# ---------------------------------------------------------------------------\n\n\n@pytest.mark.skipif(\n    not os.getenv(\"GOOGLE_API_KEY\"),\n    reason=\"GOOGLE_API_KEY not set — skipping real embedding test\",\n)\ndef test_real_embedding_api() -> None:\n    provider = EmbeddingProvider(dim=128)\n\n    texts = [\"The purchase price is $45 million.\", \"Risk assessment summary.\"]\n    embeddings = provider.embed_texts(texts)\n\n    assert len(embeddings) == 2\n    assert len(embeddings[0]) == 128\n    assert all(isinstance(v, float) for v in embeddings[0])\n\n    query_emb = provider.embed_query(\"purchase price\")\n    assert len(query_emb) == 128\n"
  },
  {
    "path": "tests/test_exploration_trace.py",
    "content": "\"\"\"Tests for exploration trace helpers.\"\"\"\n\nimport os\n\nfrom fs_explorer.exploration_trace import (\n    ExplorationTrace,\n    extract_cited_sources,\n    normalize_path,\n)\n\n\ndef test_normalize_path_relative() -> None:\n    root = \"/tmp/project\"\n    assert normalize_path(\"docs/file.pdf\", root) == os.path.abspath(\"/tmp/project/docs/file.pdf\")\n\n\ndef test_normalize_path_absolute() -> None:\n    root = \"/tmp/project\"\n    assert normalize_path(\"/var/data/file.pdf\", root) == os.path.abspath(\"/var/data/file.pdf\")\n\n\ndef test_trace_records_steps_and_documents() -> None:\n    trace = ExplorationTrace(root_directory=\"/tmp/project\")\n\n    trace.record_tool_call(\n        step_number=1,\n        tool_name=\"scan_folder\",\n        tool_input={\"directory\": \"docs\"},\n    )\n    trace.record_tool_call(\n        step_number=2,\n        tool_name=\"parse_file\",\n        tool_input={\"file_path\": \"docs/contract.pdf\"},\n    )\n    trace.record_go_deeper(step_number=3, directory=\"docs/subdir\")\n\n    assert len(trace.step_path) == 3\n    assert \"tool:scan_folder\" in trace.step_path[0]\n    assert \"tool:parse_file\" in trace.step_path[1]\n    assert \"godeeper\" in trace.step_path[2]\n\n    referenced = trace.sorted_documents()\n    assert len(referenced) == 1\n    assert referenced[0].endswith(\"docs/contract.pdf\")\n\n\ndef test_trace_records_resolved_document_paths() -> None:\n    trace = ExplorationTrace(root_directory=\"/tmp/project\")\n\n    trace.record_tool_call(\n        step_number=1,\n        tool_name=\"get_document\",\n        tool_input={\"doc_id\": \"doc_123\"},\n        resolved_document_path=\"/tmp/project/docs/indexed.pdf\",\n    )\n\n    assert \"document=/tmp/project/docs/indexed.pdf\" in trace.step_path[0]\n    assert trace.sorted_documents() == [\"/tmp/project/docs/indexed.pdf\"]\n\n\ndef test_extract_cited_sources_ordered_unique() -> None:\n    final_result = (\n        \"Price is $10M [Source: agreement.pdf, Section 2.1]. \"\n        \"Escrow is $1M [Source: escrow.pdf, Section 3]. \"\n        \"Reconfirmed [Source: agreement.pdf, Section 2.1].\"\n    )\n    assert extract_cited_sources(final_result) == [\"agreement.pdf\", \"escrow.pdf\"]\n"
  },
  {
    "path": "tests/test_fs.py",
    "content": "\"\"\"Tests for filesystem utility functions.\"\"\"\n\nimport pytest\nimport os\nimport tempfile\nfrom pathlib import Path\n\nfrom fs_explorer.fs import (\n    describe_dir_content,\n    read_file,\n    grep_file_content,\n    glob_paths,\n    parse_file,\n    preview_file,\n    scan_folder,\n    clear_document_cache,\n    SUPPORTED_EXTENSIONS,\n)\n\n\nclass TestDescribeDirContent:\n    \"\"\"Tests for describe_dir_content function.\"\"\"\n    \n    def test_valid_directory(self) -> None:\n        \"\"\"Test describing a valid directory with files and subfolders.\"\"\"\n        description = describe_dir_content(\"tests/testfiles\")\n        assert \"Content of tests/testfiles\" in description\n        assert \"tests/testfiles/file1.txt\" in description\n        assert \"tests/testfiles/file2.md\" in description\n        assert \"tests/testfiles/last\" in description\n\n    def test_nonexistent_directory(self) -> None:\n        \"\"\"Test describing a directory that doesn't exist.\"\"\"\n        description = describe_dir_content(\"tests/testfile\")\n        assert description == \"No such directory: tests/testfile\"\n\n    def test_directory_without_subfolders(self) -> None:\n        \"\"\"Test describing a directory that has no subdirectories.\"\"\"\n        description = describe_dir_content(\"tests/testfiles/last\")\n        assert \"Content of tests/testfiles/last\" in description\n        assert \"tests/testfiles/last/lastfile.txt\" in description\n        assert \"This folder does not have any sub-folders\" in description\n\n\nclass TestReadFile:\n    \"\"\"Tests for read_file function.\"\"\"\n    \n    def test_valid_file(self) -> None:\n        \"\"\"Test reading a valid text file.\"\"\"\n        content = read_file(\"tests/testfiles/file1.txt\")\n        assert content.strip() == \"this is a test\"\n\n    def test_nonexistent_file(self) -> None:\n        \"\"\"Test reading a file that doesn't exist.\"\"\"\n        content = read_file(\"tests/testfiles/file2.txt\")\n        assert content == \"No such file: tests/testfiles/file2.txt\"\n\n\nclass TestGrepFileContent:\n    \"\"\"Tests for grep_file_content function.\"\"\"\n    \n    def test_pattern_match(self) -> None:\n        \"\"\"Test searching for a pattern that exists.\"\"\"\n        result = grep_file_content(\"tests/testfiles/file2.md\", r\"(are|is) a test\")\n        assert \"MATCHES for (are|is) a test\" in result\n        assert \"is\" in result\n\n    def test_no_match(self) -> None:\n        \"\"\"Test searching for a pattern that doesn't exist.\"\"\"\n        result = grep_file_content(\"tests/testfiles/last/lastfile.txt\", r\"test\")\n        assert result == \"No matches found\"\n\n    def test_nonexistent_file(self) -> None:\n        \"\"\"Test searching in a file that doesn't exist.\"\"\"\n        result = grep_file_content(\"tests/testfiles/file2.txt\", r\"test\")\n        assert result == \"No such file: tests/testfiles/file2.txt\"\n\n\nclass TestGlobPaths:\n    \"\"\"Tests for glob_paths function.\"\"\"\n    \n    def test_pattern_match(self) -> None:\n        \"\"\"Test finding files that match a glob pattern.\"\"\"\n        result = glob_paths(\"tests/testfiles\", \"file?.*\")\n        assert \"MATCHES for file?.* in tests/testfiles\" in result\n        assert \"file1.txt\" in result\n        assert \"file2.md\" in result\n\n    def test_no_match(self) -> None:\n        \"\"\"Test a pattern that matches nothing.\"\"\"\n        result = glob_paths(\"tests/testfiles\", \"nonexistent*\")\n        assert result == \"No matches found\"\n\n    def test_nonexistent_directory(self) -> None:\n        \"\"\"Test glob in a directory that doesn't exist.\"\"\"\n        result = glob_paths(\"tests/nonexistent\", \"*.txt\")\n        assert result == \"No such directory: tests/nonexistent\"\n\n\nclass TestDocumentParsing:\n    \"\"\"Tests for document parsing functions (parse_file, preview_file).\"\"\"\n    \n    def setup_method(self) -> None:\n        \"\"\"Clear cache before each test.\"\"\"\n        clear_document_cache()\n\n    def test_parse_file_nonexistent(self) -> None:\n        \"\"\"Test parsing a file that doesn't exist.\"\"\"\n        content = parse_file(\"data/nonexistent.pdf\")\n        assert content == \"No such file: data/nonexistent.pdf\"\n\n    def test_parse_file_unsupported_extension(self) -> None:\n        \"\"\"Test parsing a file with unsupported extension.\"\"\"\n        content = parse_file(\"tests/testfiles/file1.txt\")\n        assert \"Unsupported file extension: .txt\" in content\n\n    def test_preview_file_nonexistent(self) -> None:\n        \"\"\"Test previewing a file that doesn't exist.\"\"\"\n        content = preview_file(\"data/nonexistent.pdf\")\n        assert content == \"No such file: data/nonexistent.pdf\"\n\n    def test_preview_file_unsupported_extension(self) -> None:\n        \"\"\"Test previewing a file with unsupported extension.\"\"\"\n        content = preview_file(\"tests/testfiles/file1.txt\")\n        assert \"Unsupported file extension: .txt\" in content\n\n    @pytest.mark.skipif(\n        not os.path.exists(\"data/large_acquisition\"),\n        reason=\"Test documents not generated\"\n    )\n    def test_parse_file_pdf(self) -> None:\n        \"\"\"Test parsing an actual PDF file.\"\"\"\n        # Use one of the generated test PDFs\n        pdf_files = list(Path(\"data/large_acquisition\").glob(\"*.pdf\"))\n        if pdf_files:\n            content = parse_file(str(pdf_files[0]))\n            assert len(content) > 0\n            assert \"Error\" not in content\n\n    @pytest.mark.skipif(\n        not os.path.exists(\"data/large_acquisition\"),\n        reason=\"Test documents not generated\"\n    )\n    def test_preview_file_pdf(self) -> None:\n        \"\"\"Test previewing an actual PDF file.\"\"\"\n        pdf_files = list(Path(\"data/large_acquisition\").glob(\"*.pdf\"))\n        if pdf_files:\n            content = preview_file(str(pdf_files[0]), max_chars=500)\n            assert \"=== PREVIEW of\" in content\n            # Preview should be limited\n            assert len(content) < 2000  # Preview + header + truncation message\n\n\nclass TestScanFolder:\n    \"\"\"Tests for scan_folder function.\"\"\"\n    \n    def setup_method(self) -> None:\n        \"\"\"Clear cache before each test.\"\"\"\n        clear_document_cache()\n\n    def test_nonexistent_directory(self) -> None:\n        \"\"\"Test scanning a directory that doesn't exist.\"\"\"\n        result = scan_folder(\"nonexistent/path\")\n        assert result == \"No such directory: nonexistent/path\"\n\n    def test_empty_directory(self) -> None:\n        \"\"\"Test scanning a directory with no supported documents.\"\"\"\n        with tempfile.TemporaryDirectory() as tmpdir:\n            # Create a non-document file\n            Path(tmpdir, \"test.txt\").write_text(\"hello\")\n            result = scan_folder(tmpdir)\n            assert \"No supported documents found\" in result\n\n    @pytest.mark.skipif(\n        not os.path.exists(\"data/large_acquisition\"),\n        reason=\"Test documents not generated\"\n    )\n    def test_scan_folder_with_documents(self) -> None:\n        \"\"\"Test scanning a folder with actual documents.\"\"\"\n        result = scan_folder(\"data/large_acquisition\", max_workers=2)\n        assert \"PARALLEL DOCUMENT SCAN\" in result\n        assert \"Found\" in result\n        assert \"documents\" in result\n\n\nclass TestSupportedExtensions:\n    \"\"\"Tests for supported extensions configuration.\"\"\"\n    \n    def test_supported_extensions_is_frozenset(self) -> None:\n        \"\"\"Verify SUPPORTED_EXTENSIONS is immutable.\"\"\"\n        assert isinstance(SUPPORTED_EXTENSIONS, frozenset)\n    \n    def test_common_extensions_supported(self) -> None:\n        \"\"\"Verify common document extensions are supported.\"\"\"\n        assert \".pdf\" in SUPPORTED_EXTENSIONS\n        assert \".docx\" in SUPPORTED_EXTENSIONS\n        assert \".md\" in SUPPORTED_EXTENSIONS\n"
  },
  {
    "path": "tests/test_indexing.py",
    "content": "\"\"\"Tests for indexing and schema components.\"\"\"\n\nimport json\nimport time\nfrom dataclasses import dataclass\nfrom pathlib import Path\nfrom unittest.mock import MagicMock, patch\n\nimport fs_explorer.indexing.metadata as metadata_module\nimport fs_explorer.indexing.pipeline as pipeline_module\nfrom fs_explorer.embeddings import EmbeddingProvider\nfrom fs_explorer.indexing.chunker import SmartChunker\nfrom fs_explorer.indexing.metadata import auto_discover_profile, normalize_langextract_profile\nfrom fs_explorer.indexing.pipeline import IndexingPipeline\nfrom fs_explorer.indexing.schema import SchemaDiscovery\nfrom fs_explorer.storage import DuckDBStorage\n\n\ndef test_smart_chunker_overlap() -> None:\n    text = \"A\" * 2500\n    chunker = SmartChunker(chunk_size=1000, overlap=100)\n\n    chunks = chunker.chunk_text(text)\n\n    assert len(chunks) == 3\n    assert chunks[1].start_char == chunks[0].end_char - 100\n    assert chunks[2].start_char == chunks[1].end_char - 100\n\n\ndef test_schema_discovery_from_folder(tmp_path: Path) -> None:\n    folder = tmp_path / \"corpus\"\n    folder.mkdir()\n    (folder / \"01_master_agreement.md\").write_text(\"# agreement\\nprice: $10\")\n    (folder / \"04_risk_report.md\").write_text(\"# report\\nrisk summary\")\n\n    schema = SchemaDiscovery().discover_from_folder(str(folder))\n\n    fields = schema[\"fields\"]\n    field_names = {field[\"name\"] for field in fields}\n    assert \"document_type\" in field_names\n    assert \"mentions_currency\" in field_names\n\n    document_type_field = next(\n        field for field in fields if field[\"name\"] == \"document_type\"\n    )\n    assert \"agreement\" in document_type_field[\"enum\"]\n    assert \"report\" in document_type_field[\"enum\"]\n\n\ndef test_schema_discovery_with_langextract_fields(tmp_path: Path, monkeypatch) -> None:\n    folder = tmp_path / \"corpus\"\n    folder.mkdir()\n    (folder / \"agreement.md\").write_text(\"Purchase price with escrow and earnout.\")\n\n    # Mock auto_discover_profile to return the default profile so this test\n    # stays deterministic (auto-discovery would call the real LLM).\n    from fs_explorer.indexing.metadata import default_langextract_profile\n\n    monkeypatch.setattr(\n        \"fs_explorer.indexing.schema.auto_discover_profile\",\n        lambda folder, **kwargs: default_langextract_profile(),\n    )\n\n    schema = SchemaDiscovery().discover_from_folder(\n        str(folder),\n        with_langextract=True,\n    )\n    field_names = {field[\"name\"] for field in schema[\"fields\"]}\n    assert \"lx_enabled\" in field_names\n    assert \"lx_has_earnout\" in field_names\n    assert \"lx_money_mentions\" in field_names\n\n\ndef test_schema_discovery_with_custom_metadata_profile(tmp_path: Path) -> None:\n    folder = tmp_path / \"corpus\"\n    folder.mkdir()\n    (folder / \"notes.md\").write_text(\"Acme Corp retained Jane Doe for diligence.\")\n\n    profile = {\n        \"prompt_description\": \"Extract organizations and people.\",\n        \"fields\": [\n            {\n                \"name\": \"org_names\",\n                \"type\": \"string\",\n                \"source_class\": \"organization\",\n                \"mode\": \"values\",\n            },\n            {\n                \"name\": \"person_count\",\n                \"type\": \"integer\",\n                \"source_class\": \"person\",\n                \"mode\": \"count\",\n            },\n        ],\n    }\n\n    schema = SchemaDiscovery().discover_from_folder(\n        str(folder),\n        with_langextract=True,\n        metadata_profile=profile,\n    )\n    field_names = {field[\"name\"] for field in schema[\"fields\"]}\n    assert \"org_names\" in field_names\n    assert \"person_count\" in field_names\n    assert isinstance(schema.get(\"metadata_profile\"), dict)\n\n\ndef test_indexing_pipeline_indexes_and_marks_deleted(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    first = corpus / \"a_agreement.md\"\n    second = corpus / \"b_schedule.md\"\n    first.write_text(\"Purchase price is $45,000,000.\\n\\nSection 1.2\")\n    second.write_text(\"Schedule details.\\n\\nEffective Date: January 1, 2026\")\n\n    # Avoid Docling in this unit test; treat markdown as plain text.\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = tmp_path / \"index.duckdb\"\n    storage = DuckDBStorage(str(db_path))\n    pipeline = IndexingPipeline(storage=storage)\n\n    first_result = pipeline.index_folder(str(corpus), discover_schema=True)\n    assert first_result.indexed_files == 2\n    assert first_result.skipped_files == 0\n    assert first_result.active_documents == 2\n    assert first_result.schema_used is not None\n    assert storage.count_chunks(corpus_id=first_result.corpus_id) > 0\n\n    hits = storage.search_chunks(\n        corpus_id=first_result.corpus_id,\n        query=\"purchase price\",\n        limit=3,\n    )\n    assert hits\n    top_doc = storage.get_document(doc_id=hits[0][\"doc_id\"])\n    assert top_doc is not None\n    assert \"Purchase price\" in top_doc[\"content\"]\n\n    metadata_hits = storage.search_documents_by_metadata(\n        corpus_id=first_result.corpus_id,\n        filters=[\n            {\n                \"field\": \"document_type\",\n                \"operator\": \"eq\",\n                \"value\": \"agreement\",\n            }\n        ],\n        limit=5,\n    )\n    assert metadata_hits\n    assert any(hit[\"relative_path\"] == \"a_agreement.md\" for hit in metadata_hits)\n    assert all(hit[\"relative_path\"] != \"b_schedule.md\" for hit in metadata_hits)\n\n    second.unlink()\n\n    second_result = pipeline.index_folder(str(corpus))\n    assert second_result.indexed_files == 1\n    assert second_result.active_documents == 1\n\n    all_docs = storage.list_documents(\n        corpus_id=first_result.corpus_id,\n        include_deleted=True,\n    )\n    deleted_paths = {doc[\"relative_path\"] for doc in all_docs if doc[\"is_deleted\"]}\n    assert \"b_schedule.md\" in deleted_paths\n\n\ndef test_indexing_pipeline_with_langextract_metadata(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    doc_path = corpus / \"agreement.md\"\n    doc_path.write_text(\"Purchase price and escrow details.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n    # Use the default profile so the schema includes the expected fields\n    from fs_explorer.indexing.metadata import default_langextract_profile\n\n    monkeypatch.setattr(\n        \"fs_explorer.indexing.schema.auto_discover_profile\",\n        lambda folder, **kwargs: default_langextract_profile(),\n    )\n    monkeypatch.setattr(\n        metadata_module,\n        \"_extract_langextract_metadata\",\n        lambda **_: {\n            \"lx_enabled\": True,\n            \"lx_extraction_count\": 3,\n            \"lx_entity_classes\": \"deal_term,organization\",\n            \"lx_organizations\": \"TechCorp Industries\",\n            \"lx_people\": \"\",\n            \"lx_deal_terms\": \"escrow reserve\",\n            \"lx_money_mentions\": 1,\n            \"lx_date_mentions\": 0,\n            \"lx_has_earnout\": False,\n            \"lx_has_escrow\": True,\n        },\n    )\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n    pipeline = IndexingPipeline(storage=storage)\n    result = pipeline.index_folder(\n        str(corpus),\n        discover_schema=True,\n        with_metadata=True,\n    )\n    assert result.indexed_files == 1\n    assert result.schema_used is not None\n\n    docs = storage.list_documents(corpus_id=result.corpus_id, include_deleted=False)\n    assert len(docs) == 1\n    stored = storage.get_document(doc_id=docs[0][\"id\"])\n    assert stored is not None\n    metadata = json.loads(stored[\"metadata_json\"])\n    assert metadata[\"lx_enabled\"] is True\n    assert metadata[\"lx_has_escrow\"] is True\n\n    hits = storage.search_documents_by_metadata(\n        corpus_id=result.corpus_id,\n        filters=[{\"field\": \"lx_has_escrow\", \"operator\": \"eq\", \"value\": True}],\n        limit=5,\n    )\n    assert hits\n    assert hits[0][\"relative_path\"] == \"agreement.md\"\n\n\ndef test_indexing_pipeline_reuses_saved_metadata_profile(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    doc_path = corpus / \"custom.md\"\n    doc_path.write_text(\"Acme Corp and Jane Doe signed terms.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    seen_profiles: list[dict[str, object] | None] = []\n\n    def fake_extract(**kwargs):  # noqa: ANN003\n        seen_profiles.append(kwargs.get(\"profile\"))\n        return {\n            \"org_names\": \"Acme Corp\",\n            \"person_present\": True,\n        }\n\n    monkeypatch.setattr(metadata_module, \"_extract_langextract_metadata\", fake_extract)\n\n    custom_profile = {\n        \"prompt_description\": \"Extract organizations and people.\",\n        \"fields\": [\n            {\n                \"name\": \"org_names\",\n                \"type\": \"string\",\n                \"source_class\": \"organization\",\n                \"mode\": \"values\",\n            },\n            {\n                \"name\": \"person_present\",\n                \"type\": \"boolean\",\n                \"source_class\": \"person\",\n                \"mode\": \"exists\",\n            },\n        ],\n    }\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n    pipeline = IndexingPipeline(storage=storage)\n    first_result = pipeline.index_folder(\n        str(corpus),\n        discover_schema=True,\n        with_metadata=True,\n        metadata_profile=custom_profile,\n    )\n    assert first_result.indexed_files == 1\n    assert seen_profiles and isinstance(seen_profiles[0], dict)\n\n    second_result = pipeline.index_folder(\n        str(corpus),\n        with_metadata=True,\n    )\n    assert second_result.indexed_files == 1\n    assert len(seen_profiles) >= 2\n    latest_profile = seen_profiles[-1]\n    assert isinstance(latest_profile, dict)\n    fields_obj = latest_profile.get(\"fields\")\n    assert isinstance(fields_obj, list)\n    second_fields = {\n        str(field[\"name\"])\n        for field in fields_obj\n        if isinstance(field, dict) and isinstance(field.get(\"name\"), str)\n    }\n    assert {\"org_names\", \"person_present\"}.issubset(second_fields)\n\n\n# ---------------------------------------------------------------------------\n# Auto-profile generation tests\n# ---------------------------------------------------------------------------\n\n\ndef test_auto_discover_profile_with_mock_llm(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"contract.md\").write_text(\"TechCorp acquires StartupXYZ for $10M.\")\n    (corpus / \"report.md\").write_text(\"Quarterly revenue report for FY2025.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n    monkeypatch.setenv(\"GOOGLE_API_KEY\", \"fake-key\")\n\n    llm_response_json = json.dumps(\n        {\n            \"name\": \"test_auto\",\n            \"description\": \"Auto-generated test profile.\",\n            \"prompt_description\": \"Extract key metadata from documents.\",\n            \"fields\": [\n                {\n                    \"name\": \"lx_organizations\",\n                    \"type\": \"string\",\n                    \"description\": \"Organization names.\",\n                    \"source\": \"entities\",\n                    \"source_classes\": [\"organization\", \"company\"],\n                    \"mode\": \"values\",\n                },\n                {\n                    \"name\": \"lx_money_count\",\n                    \"type\": \"integer\",\n                    \"description\": \"Count of monetary amounts.\",\n                    \"source\": \"entities\",\n                    \"source_classes\": [\"money\"],\n                    \"mode\": \"count\",\n                },\n            ],\n        }\n    )\n\n    mock_response = MagicMock()\n    mock_response.text = llm_response_json\n\n    mock_client_instance = MagicMock()\n    mock_client_instance.models.generate_content.return_value = mock_response\n\n    with patch(\n        \"fs_explorer.indexing.metadata._get_genai_client\",\n        return_value=mock_client_instance,\n    ):\n        profile = auto_discover_profile(str(corpus))\n\n    # Should pass validation\n    normalized = normalize_langextract_profile(profile)\n    field_names = {f[\"name\"] for f in normalized[\"fields\"]}\n    assert \"lx_organizations\" in field_names\n    assert \"lx_money_count\" in field_names\n    # Runtime fields should have been added automatically\n    assert \"lx_enabled\" in field_names\n\n\ndef test_auto_discover_profile_falls_back_on_error(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"file.md\").write_text(\"Some content.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n    monkeypatch.setenv(\"GOOGLE_API_KEY\", \"fake-key\")\n\n    with patch(\n        \"fs_explorer.indexing.metadata._get_genai_client\",\n        side_effect=RuntimeError(\"API down\"),\n    ):\n        profile = auto_discover_profile(str(corpus))\n\n    # Should return default profile\n    default_names = {\n        f[\"name\"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE[\"fields\"]\n    }\n    got_names = {f[\"name\"] for f in profile[\"fields\"]}\n    assert default_names == got_names\n\n\ndef test_auto_discover_profile_falls_back_without_api_key(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"file.md\").write_text(\"Some content.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n    monkeypatch.delenv(\"GOOGLE_API_KEY\", raising=False)\n\n    profile = auto_discover_profile(str(corpus))\n\n    default_names = {\n        f[\"name\"] for f in metadata_module._DEFAULT_LANGEXTRACT_PROFILE[\"fields\"]\n    }\n    got_names = {f[\"name\"] for f in profile[\"fields\"]}\n    assert default_names == got_names\n\n\ndef test_schema_discovery_uses_auto_profile_when_no_explicit_profile(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"contract.md\").write_text(\"Agreement terms.\")\n\n    # Capture what auto_discover_profile returns (mock it)\n    auto_profile = {\n        \"name\": \"auto_test\",\n        \"description\": \"Auto-generated.\",\n        \"prompt_description\": \"Extract metadata.\",\n        \"fields\": [\n            {\n                \"name\": \"lx_enabled\",\n                \"type\": \"boolean\",\n                \"required\": False,\n                \"description\": \"Whether langextract succeeded.\",\n                \"source\": \"runtime\",\n                \"runtime\": \"enabled\",\n                \"mode\": \"runtime\",\n                \"source_classes\": [],\n                \"contains_any\": [],\n            },\n            {\n                \"name\": \"lx_orgs\",\n                \"type\": \"string\",\n                \"required\": False,\n                \"description\": \"Organizations.\",\n                \"source\": \"entities\",\n                \"source_classes\": [\"organization\"],\n                \"mode\": \"values\",\n                \"contains_any\": [],\n            },\n        ],\n    }\n\n    monkeypatch.setattr(\n        \"fs_explorer.indexing.schema.auto_discover_profile\",\n        lambda folder, **kwargs: auto_profile,\n    )\n\n    schema = SchemaDiscovery().discover_from_folder(\n        str(corpus),\n        with_langextract=True,\n        metadata_profile=None,\n    )\n    field_names = {f[\"name\"] for f in schema[\"fields\"]}\n    assert \"lx_orgs\" in field_names\n    assert \"lx_enabled\" in field_names\n    assert schema.get(\"metadata_profile\") == auto_profile\n\n\n# ---------------------------------------------------------------------------\n# Mock embedding helpers for indexing tests\n# ---------------------------------------------------------------------------\n\n\n@dataclass\nclass _FakeEmbedding:\n    values: list[float]\n\n\n@dataclass\nclass _FakeEmbedResult:\n    embeddings: list[_FakeEmbedding]\n\n\nclass _FakeEmbedModels:\n    def embed_content(\n        self, *, model: str, contents: list[str], config: dict\n    ) -> _FakeEmbedResult:\n        dim = config.get(\"output_dimensionality\", 4)\n        return _FakeEmbedResult(\n            embeddings=[\n                _FakeEmbedding(values=[0.1 * i] * dim) for i in range(len(contents))\n            ]\n        )\n\n\nclass _FakeEmbedClient:\n    def __init__(self) -> None:\n        self.models = _FakeEmbedModels()\n\n\n# ---------------------------------------------------------------------------\n# Embedding indexing tests\n# ---------------------------------------------------------------------------\n\n\ndef test_indexing_pipeline_with_embeddings(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"report.md\").write_text(\"Risk register summary.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path, embedding_dim=4)\n    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)\n    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)\n\n    result = pipeline.index_folder(str(corpus), discover_schema=True)\n\n    assert result.indexed_files == 2\n    assert result.embeddings_written > 0\n    assert storage.has_embeddings(corpus_id=result.corpus_id)\n\n\ndef test_indexing_pipeline_without_embeddings(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path)\n    pipeline = IndexingPipeline(storage=storage)\n\n    result = pipeline.index_folder(str(corpus), discover_schema=True)\n\n    assert result.embeddings_written == 0\n    assert not storage.has_embeddings(corpus_id=result.corpus_id)\n\n\ndef test_embedding_cascade_on_reindex(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    doc = corpus / \"agreement.md\"\n    doc.write_text(\"Purchase price is $45,000,000.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path, embedding_dim=4)\n    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)\n    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)\n\n    first = pipeline.index_folder(str(corpus), discover_schema=True)\n    assert first.embeddings_written > 0\n\n    # Update document and re-index; old embeddings should be replaced\n    doc.write_text(\"Updated purchase price is $50,000,000.\")\n    second = pipeline.index_folder(str(corpus))\n    assert second.embeddings_written > 0\n    assert storage.has_embeddings(corpus_id=second.corpus_id)\n\n\n# ---------------------------------------------------------------------------\n# Parallel metadata extraction tests\n# ---------------------------------------------------------------------------\n\n\ndef test_extract_metadata_batch_returns_correct_metadata(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"report.md\").write_text(\"Risk register summary.\")\n    (corpus / \"schedule.md\").write_text(\"Effective Date: January 1, 2026\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n    pipeline = IndexingPipeline(storage=storage, max_workers=2)\n\n    root = str(corpus)\n    parsed_docs = []\n    import os\n\n    for f in sorted(corpus.iterdir()):\n        content = f.read_text()\n        rel = os.path.relpath(str(f), root)\n        parsed_docs.append((str(f), rel, content))\n\n    metadata_map = pipeline._extract_metadata_batch(\n        parsed_docs=parsed_docs,\n        root_path=root,\n        schema_def=None,\n        with_langextract=False,\n        langextract_profile=None,\n    )\n\n    assert len(metadata_map) == 3\n    assert \"agreement.md\" in metadata_map\n    assert \"report.md\" in metadata_map\n    assert \"schedule.md\" in metadata_map\n\n    # Check heuristic metadata\n    assert metadata_map[\"agreement.md\"][\"mentions_currency\"] is True\n    assert metadata_map[\"schedule.md\"][\"mentions_dates\"] is True\n    assert metadata_map[\"report.md\"][\"document_type\"] == \"report\"\n\n\ndef test_extract_metadata_batch_parallel_is_faster_than_sequential(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    for i in range(6):\n        (corpus / f\"doc_{i}.md\").write_text(f\"Document {i} content. Price is ${i}00.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    delay = 0.1\n    original_extract = metadata_module.extract_metadata\n\n    def slow_extract(**kwargs):\n        time.sleep(delay)\n        return original_extract(**kwargs)\n\n    monkeypatch.setattr(pipeline_module, \"extract_metadata\", slow_extract)\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n    pipeline = IndexingPipeline(storage=storage, max_workers=6)\n\n    root = str(corpus)\n    parsed_docs = []\n    import os\n\n    for f in sorted(corpus.iterdir()):\n        content = f.read_text()\n        rel = os.path.relpath(str(f), root)\n        parsed_docs.append((str(f), rel, content))\n\n    start = time.monotonic()\n    metadata_map = pipeline._extract_metadata_batch(\n        parsed_docs=parsed_docs,\n        root_path=root,\n        schema_def=None,\n        with_langextract=False,\n        langextract_profile=None,\n    )\n    elapsed = time.monotonic() - start\n\n    assert len(metadata_map) == 6\n    # 6 docs * 0.1s each = 0.6s sequential; parallel should finish in < 0.4s\n    assert elapsed < 0.4, f\"Parallel extraction too slow: {elapsed:.2f}s\"\n\n\ndef test_parallel_and_sequential_produce_same_results(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"a.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"b.md\").write_text(\"Effective Date: January 1, 2026. Risk summary.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n\n    root = str(corpus)\n    parsed_docs = []\n    import os\n\n    for f in sorted(corpus.iterdir()):\n        content = f.read_text()\n        rel = os.path.relpath(str(f), root)\n        parsed_docs.append((str(f), rel, content))\n\n    # Sequential (max_workers=1)\n    pipeline_seq = IndexingPipeline(storage=storage, max_workers=1)\n    map_seq = pipeline_seq._extract_metadata_batch(\n        parsed_docs=parsed_docs,\n        root_path=root,\n        schema_def=None,\n        with_langextract=False,\n        langextract_profile=None,\n    )\n\n    # Parallel (max_workers=4)\n    pipeline_par = IndexingPipeline(storage=storage, max_workers=4)\n    map_par = pipeline_par._extract_metadata_batch(\n        parsed_docs=parsed_docs,\n        root_path=root,\n        schema_def=None,\n        with_langextract=False,\n        langextract_profile=None,\n    )\n\n    assert map_seq.keys() == map_par.keys()\n    for key in map_seq:\n        assert map_seq[key] == map_par[key], f\"Mismatch for {key}\"\n"
  },
  {
    "path": "tests/test_models.py",
    "content": "from fs_explorer.models import (\n    ToolCallAction,\n    Action,\n    ToolCallArg,\n    GoDeeperAction,\n    StopAction,\n)\n\n\ndef test_tool_call_action_to_tool_args() -> None:\n    tool_call_action = ToolCallAction(\n        tool_name=\"glob\",\n        tool_input=[\n            ToolCallArg(parameter_name=\"directory\", parameter_value=\"tests/testfiles\"),\n            ToolCallArg(parameter_name=\"pattern\", parameter_value=\"file?.*\"),\n        ],\n    )\n    assert tool_call_action.to_fn_args() == {\n        \"directory\": \"tests/testfiles\",\n        \"pattern\": \"file?.*\",\n    }\n\n\ndef test_action_to_action_type() -> None:\n    action = Action(\n        action=ToolCallAction(\n            tool_name=\"glob\",\n            tool_input=[\n                ToolCallArg(\n                    parameter_name=\"directory\", parameter_value=\"tests/testfiles\"\n                ),\n                ToolCallArg(parameter_name=\"pattern\", parameter_value=\"file?.*\"),\n            ],\n        ),\n        reason=\"\",\n    )\n    assert action.to_action_type() == \"toolcall\"\n    action = Action(action=GoDeeperAction(directory=\"tests/testfiles/last\"), reason=\"\")\n    assert action.to_action_type() == \"godeeper\"\n    action = Action(action=StopAction(final_result=\"hello\"), reason=\"\")\n    assert action.to_action_type() == \"stop\"\n"
  },
  {
    "path": "tests/test_search.py",
    "content": "\"\"\"Tests for search filtering and merged retrieval ranking.\"\"\"\n\nfrom __future__ import annotations\n\nimport time\nfrom dataclasses import dataclass\nfrom pathlib import Path\n\nimport fs_explorer.indexing.pipeline as pipeline_module\nimport pytest\n\nfrom fs_explorer.embeddings import EmbeddingProvider\nfrom fs_explorer.indexing.pipeline import IndexingPipeline\nfrom fs_explorer.search import (\n    IndexedQueryEngine,\n    MetadataFilterParseError,\n    parse_metadata_filters,\n)\nfrom fs_explorer.storage import DuckDBStorage\n\n\ndef test_parse_metadata_filters_supports_scalar_and_list_values() -> None:\n    parsed = parse_metadata_filters(\n        \"document_type=agreement and mentions_currency=true, file_size_bytes>=100, \"\n        \"document_type in (agreement, report)\"\n    )\n\n    assert len(parsed) == 4\n    assert parsed[0].field == \"document_type\"\n    assert parsed[0].operator == \"eq\"\n    assert parsed[0].value == \"agreement\"\n    assert parsed[1].field == \"mentions_currency\"\n    assert parsed[1].value is True\n    assert parsed[2].operator == \"gte\"\n    assert parsed[2].value == 100\n    assert parsed[3].operator == \"in\"\n    assert parsed[3].value == [\"agreement\", \"report\"]\n\n\ndef test_parse_metadata_filters_rejects_unknown_schema_fields() -> None:\n    with pytest.raises(MetadataFilterParseError):\n        parse_metadata_filters(\n            \"owner=finance\",\n            allowed_fields={\"document_type\", \"mentions_currency\"},\n        )\n\n\ndef test_indexed_query_engine_unions_semantic_and_metadata_results(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"a_agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"b_report.md\").write_text(\n        \"Risk register and litigation exposure summary.\"\n    )\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = tmp_path / \"index.duckdb\"\n    storage = DuckDBStorage(str(db_path))\n    result = IndexingPipeline(storage=storage).index_folder(\n        str(corpus), discover_schema=True\n    )\n    engine = IndexedQueryEngine(storage)\n\n    hits = engine.search(\n        corpus_id=result.corpus_id,\n        query=\"purchase price\",\n        filters=\"document_type=report\",\n        limit=5,\n    )\n\n    by_path = {hit.relative_path: hit for hit in hits}\n    assert \"a_agreement.md\" in by_path\n    assert \"b_report.md\" in by_path\n    assert by_path[\"a_agreement.md\"].semantic_score > 0\n    assert by_path[\"b_report.md\"].metadata_score > 0\n\n\nclass _SlowStorage:\n    def search_chunks(self, *, corpus_id: str, query: str, limit: int = 5):  # noqa: ARG002\n        time.sleep(0.3)\n        return [\n            {\n                \"doc_id\": \"doc_semantic\",\n                \"relative_path\": \"a.md\",\n                \"absolute_path\": \"/tmp/a.md\",\n                \"position\": 0,\n                \"text\": \"semantic hit\",\n                \"score\": 3,\n            }\n        ]\n\n    def search_documents_by_metadata(self, *, corpus_id: str, filters, limit: int = 20):  # noqa: ARG002\n        time.sleep(0.3)\n        return [\n            {\n                \"doc_id\": \"doc_metadata\",\n                \"relative_path\": \"b.md\",\n                \"absolute_path\": \"/tmp/b.md\",\n                \"preview_text\": \"metadata hit\",\n                \"metadata_score\": 1,\n            }\n        ]\n\n    def get_active_schema(self, *, corpus_id: str):  # noqa: ARG002\n        return None\n\n\ndef test_indexed_query_engine_executes_semantic_and_metadata_in_parallel() -> None:\n    engine = IndexedQueryEngine(_SlowStorage())\n\n    start = time.perf_counter()\n    hits = engine.search(\n        corpus_id=\"corpus_test\",\n        query=\"test\",\n        filters=\"document_type=agreement\",\n        limit=5,\n    )\n    elapsed = time.perf_counter() - start\n\n    assert elapsed < 0.58\n    assert {hit.doc_id for hit in hits} == {\"doc_semantic\", \"doc_metadata\"}\n\n\ndef test_search_enable_semantic_false_returns_only_metadata() -> None:\n    \"\"\"When enable_semantic=False, only metadata results are returned.\"\"\"\n    engine = IndexedQueryEngine(_SlowStorage())\n\n    hits = engine.search(\n        corpus_id=\"corpus_test\",\n        query=\"test\",\n        filters=\"document_type=agreement\",\n        limit=5,\n        enable_semantic=False,\n    )\n\n    assert len(hits) == 1\n    assert hits[0].doc_id == \"doc_metadata\"\n\n\ndef test_search_enable_metadata_false_returns_only_semantic() -> None:\n    \"\"\"When enable_metadata=False, only semantic results are returned.\"\"\"\n    engine = IndexedQueryEngine(_SlowStorage())\n\n    hits = engine.search(\n        corpus_id=\"corpus_test\",\n        query=\"test\",\n        filters=\"document_type=agreement\",\n        limit=5,\n        enable_metadata=False,\n    )\n\n    assert len(hits) == 1\n    assert hits[0].doc_id == \"doc_semantic\"\n\n\ndef test_search_both_disabled_returns_empty() -> None:\n    \"\"\"When both enable_semantic and enable_metadata are False, no results.\"\"\"\n    engine = IndexedQueryEngine(_SlowStorage())\n\n    hits = engine.search(\n        corpus_id=\"corpus_test\",\n        query=\"test\",\n        filters=\"document_type=agreement\",\n        limit=5,\n        enable_semantic=False,\n        enable_metadata=False,\n    )\n\n    assert hits == []\n\n\n# ---------------------------------------------------------------------------\n# Mock embedding helpers\n# ---------------------------------------------------------------------------\n\n\n@dataclass\nclass _FakeEmbedding:\n    values: list[float]\n\n\n@dataclass\nclass _FakeEmbedResult:\n    embeddings: list[_FakeEmbedding]\n\n\nclass _FakeEmbedModels:\n    def embed_content(\n        self, *, model: str, contents: list[str], config: dict\n    ) -> _FakeEmbedResult:\n        dim = config.get(\"output_dimensionality\", 4)\n        return _FakeEmbedResult(\n            embeddings=[\n                _FakeEmbedding(values=[0.1 * (i + 1)] * dim)\n                for i in range(len(contents))\n            ]\n        )\n\n\nclass _FakeEmbedClient:\n    def __init__(self) -> None:\n        self.models = _FakeEmbedModels()\n\n\n# ---------------------------------------------------------------------------\n# Vector search tests\n# ---------------------------------------------------------------------------\n\n\ndef test_vector_search_with_pre_stored_embeddings(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"report.md\").write_text(\"Risk register and litigation exposure summary.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path, embedding_dim=4)\n    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)\n    pipeline = IndexingPipeline(storage=storage, embedding_provider=provider)\n\n    result = pipeline.index_folder(str(corpus), discover_schema=True)\n    assert result.embeddings_written > 0\n\n    engine = IndexedQueryEngine(storage, embedding_provider=provider)\n    hits = engine.search(\n        corpus_id=result.corpus_id,\n        query=\"purchase price\",\n        limit=5,\n    )\n\n    assert len(hits) > 0\n    # All hits should have float semantic scores from cosine similarity\n    for hit in hits:\n        assert isinstance(hit.semantic_score, float)\n\n\ndef test_keyword_fallback_when_no_embeddings(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path)\n    IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True)\n\n    # Create engine with embedding provider but no embeddings stored\n    provider = EmbeddingProvider(client=_FakeEmbedClient(), dim=4)\n    engine = IndexedQueryEngine(storage, embedding_provider=provider)\n    result_corpus_id = storage.get_corpus_id(str(Path(corpus).resolve()))\n    assert result_corpus_id is not None\n\n    hits = engine.search(\n        corpus_id=result_corpus_id,\n        query=\"purchase price\",\n        limit=5,\n    )\n    # Should still return results via keyword fallback\n    assert len(hits) > 0\n\n\ndef test_get_metadata_field_values_returns_distinct_values(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"a_agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"b_report.md\").write_text(\"Risk register summary.\")\n    (corpus / \"c_agreement.md\").write_text(\"Escrow details for the deal.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = tmp_path / \"index.duckdb\"\n    storage = DuckDBStorage(str(db_path))\n    result = IndexingPipeline(storage=storage).index_folder(\n        str(corpus), discover_schema=True\n    )\n\n    values = storage.get_metadata_field_values(\n        corpus_id=result.corpus_id,\n        field_names=[\"document_type\", \"mentions_currency\"],\n    )\n    assert \"document_type\" in values\n    assert \"agreement\" in values[\"document_type\"]\n    assert \"report\" in values[\"document_type\"]\n    assert \"mentions_currency\" in values\n\n\ndef test_get_metadata_field_values_empty_corpus(tmp_path: Path) -> None:\n    db_path = tmp_path / \"index.duckdb\"\n    storage = DuckDBStorage(str(db_path))\n    corpus_id = storage.get_or_create_corpus(str(tmp_path / \"empty\"))\n    values = storage.get_metadata_field_values(\n        corpus_id=corpus_id,\n        field_names=[\"document_type\"],\n    )\n    assert values == {\"document_type\": []}\n\n\ndef test_get_metadata_field_values_respects_max_distinct(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    for i in range(5):\n        (corpus / f\"doc_{i:02d}_type{i}.md\").write_text(f\"Content {i}\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    storage = DuckDBStorage(str(tmp_path / \"index.duckdb\"))\n    result = IndexingPipeline(storage=storage).index_folder(\n        str(corpus), discover_schema=True\n    )\n\n    values = storage.get_metadata_field_values(\n        corpus_id=result.corpus_id,\n        field_names=[\"document_type\"],\n        max_distinct=2,\n    )\n    assert len(values[\"document_type\"]) <= 2\n\n\ndef test_semantic_search_includes_field_catalog_on_first_call(\n    tmp_path: Path,\n    monkeypatch,\n) -> None:\n    import fs_explorer.agent as agent_module\n\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"a_agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path)\n    IndexingPipeline(storage=storage).index_folder(\n        str(corpus), discover_schema=True\n    )\n\n    agent_module.set_index_context(str(corpus), db_path)\n    agent_module.set_search_flags(enable_semantic=True, enable_metadata=True)\n    try:\n        first = agent_module.semantic_search(\"purchase price\")\n        assert \"Available filter fields\" in first\n        assert \"document_type\" in first\n\n        second = agent_module.semantic_search(\"purchase price\")\n        assert \"Available filter fields\" not in second\n    finally:\n        agent_module.clear_index_context()\n\n\ndef test_float_scoring_in_ranked_documents() -> None:\n    from fs_explorer.search.ranker import RankedDocument, rank_documents\n\n    docs = [\n        RankedDocument(\n            doc_id=\"d1\",\n            relative_path=\"a.md\",\n            absolute_path=\"/a.md\",\n            position=0,\n            text=\"doc 1\",\n            semantic_score=0.95,\n            metadata_score=1,\n        ),\n        RankedDocument(\n            doc_id=\"d2\",\n            relative_path=\"b.md\",\n            absolute_path=\"/b.md\",\n            position=0,\n            text=\"doc 2\",\n            semantic_score=0.5,\n            metadata_score=2,\n        ),\n    ]\n    ranked = rank_documents(docs, limit=2)\n    assert ranked[0].doc_id == \"d1\"\n    assert ranked[0].combined_score > ranked[1].combined_score\n"
  },
  {
    "path": "tests/test_server_search.py",
    "content": "\"\"\"Tests for the /api/search and /api/index REST endpoints.\"\"\"\n\nfrom __future__ import annotations\n\nfrom pathlib import Path\nfrom unittest.mock import patch\n\nimport fs_explorer.indexing.pipeline as pipeline_module\nimport pytest\nfrom fastapi.testclient import TestClient\n\nfrom fs_explorer.indexing.pipeline import IndexingPipeline\nfrom fs_explorer.server import app\nfrom fs_explorer.storage import DuckDBStorage\n\n\n@pytest.fixture()\ndef indexed_corpus(tmp_path: Path, monkeypatch):\n    \"\"\"Create a small indexed corpus and return (folder, db_path).\"\"\"\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"agreement.md\").write_text(\"Purchase price is $45,000,000.\")\n    (corpus / \"report.md\").write_text(\"Risk register and litigation exposure summary.\")\n\n    monkeypatch.setattr(\n        pipeline_module,\n        \"parse_file\",\n        lambda file_path: Path(file_path).read_text(),\n    )\n\n    db_path = str(tmp_path / \"index.duckdb\")\n    storage = DuckDBStorage(db_path)\n    IndexingPipeline(storage=storage).index_folder(str(corpus), discover_schema=True)\n    return str(corpus), db_path\n\n\ndef test_search_endpoint_returns_hits(indexed_corpus) -> None:\n    corpus_folder, db_path = indexed_corpus\n    client = TestClient(app)\n\n    response = client.post(\n        \"/api/search\",\n        json={\n            \"corpus_folder\": corpus_folder,\n            \"query\": \"purchase price\",\n            \"db_path\": db_path,\n        },\n    )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert \"hits\" in data\n    assert len(data[\"hits\"]) > 0\n    assert data[\"hits\"][0][\"semantic_score\"] > 0\n\n\ndef test_search_endpoint_with_filters(indexed_corpus) -> None:\n    corpus_folder, db_path = indexed_corpus\n    client = TestClient(app)\n\n    response = client.post(\n        \"/api/search\",\n        json={\n            \"corpus_folder\": corpus_folder,\n            \"query\": \"litigation\",\n            \"filters\": \"document_type=report\",\n            \"db_path\": db_path,\n        },\n    )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert \"hits\" in data\n\n\ndef test_search_endpoint_missing_index(tmp_path: Path) -> None:\n    corpus = tmp_path / \"empty\"\n    corpus.mkdir()\n    db_path = str(tmp_path / \"nonexistent.duckdb\")\n\n    client = TestClient(app)\n    response = client.post(\n        \"/api/search\",\n        json={\n            \"corpus_folder\": str(corpus),\n            \"query\": \"test\",\n            \"db_path\": db_path,\n        },\n    )\n\n    assert response.status_code in (404, 500)\n\n\ndef test_search_endpoint_invalid_folder() -> None:\n    client = TestClient(app)\n    response = client.post(\n        \"/api/search\",\n        json={\n            \"corpus_folder\": \"/nonexistent/path/abc123\",\n            \"query\": \"test\",\n        },\n    )\n\n    assert response.status_code == 400\n\n\n# ---------------------------------------------------------------------------\n# /api/index/status tests\n# ---------------------------------------------------------------------------\n\n\ndef test_index_status_not_indexed(tmp_path: Path) -> None:\n    corpus = tmp_path / \"empty_folder\"\n    corpus.mkdir()\n    db_path = str(tmp_path / \"nonexistent.duckdb\")\n\n    client = TestClient(app)\n    response = client.get(\n        \"/api/index/status\",\n        params={\"folder\": str(corpus), \"db_path\": db_path},\n    )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert data[\"indexed\"] is False\n\n\ndef test_index_status_after_indexing(indexed_corpus) -> None:\n    corpus_folder, db_path = indexed_corpus\n    client = TestClient(app)\n\n    response = client.get(\n        \"/api/index/status\",\n        params={\"folder\": corpus_folder, \"db_path\": db_path},\n    )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert data[\"indexed\"] is True\n    assert data[\"document_count\"] == 2\n    assert data[\"schema_name\"] is not None\n    assert isinstance(data[\"has_metadata\"], bool)\n    assert isinstance(data[\"has_embeddings\"], bool)\n\n\ndef test_index_status_includes_schema_fields(indexed_corpus) -> None:\n    corpus_folder, db_path = indexed_corpus\n    client = TestClient(app)\n\n    response = client.get(\n        \"/api/index/status\",\n        params={\"folder\": corpus_folder, \"db_path\": db_path},\n    )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert \"schema_fields\" in data\n    assert isinstance(data[\"schema_fields\"], list)\n    assert len(data[\"schema_fields\"]) > 0\n    assert \"document_type\" in data[\"schema_fields\"]\n\n\n# ---------------------------------------------------------------------------\n# /api/index/auto-profile tests\n# ---------------------------------------------------------------------------\n\n\ndef test_auto_profile_endpoint(tmp_path: Path) -> None:\n    corpus = tmp_path / \"docs\"\n    corpus.mkdir()\n    (corpus / \"contract.md\").write_text(\"TechCorp acquires StartupXYZ for $10M.\")\n\n    fake_profile = {\n        \"name\": \"test_auto\",\n        \"description\": \"Auto-generated.\",\n        \"prompt_description\": \"Extract metadata.\",\n        \"fields\": [\n            {\n                \"name\": \"lx_organizations\",\n                \"type\": \"string\",\n                \"description\": \"Org names.\",\n                \"source\": \"entities\",\n                \"source_classes\": [\"organization\"],\n                \"mode\": \"values\",\n            }\n        ],\n    }\n\n    client = TestClient(app)\n    with patch(\n        \"fs_explorer.server.auto_discover_profile\",\n        return_value=fake_profile,\n    ):\n        response = client.post(\n            \"/api/index/auto-profile\",\n            json={\"folder\": str(corpus)},\n        )\n\n    assert response.status_code == 200\n    data = response.json()\n    assert \"profile\" in data\n    assert data[\"profile\"][\"name\"] == \"test_auto\"\n    field_names = {f[\"name\"] for f in data[\"profile\"][\"fields\"]}\n    assert \"lx_organizations\" in field_names\n\n\ndef test_auto_profile_invalid_folder() -> None:\n    client = TestClient(app)\n    response = client.post(\n        \"/api/index/auto-profile\",\n        json={\"folder\": \"/nonexistent/path/abc123\"},\n    )\n\n    assert response.status_code == 400\n"
  },
  {
    "path": "tests/testfiles/file1.txt",
    "content": "this is a test"
  },
  {
    "path": "tests/testfiles/file2.md",
    "content": "# this is a test!"
  },
  {
    "path": "tests/testfiles/last/lastfile.txt",
    "content": "hello"
  }
]